A Matrix Algebra Approach to Artificial Intelligence [1 ed.] 9811527695, 9789811527692

Matrix algebra plays an important role in many core artificial intelligence (AI) areas, including machine learning, neur

1,869 311 8MB

English Pages 854 [844] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

A Matrix Algebra Approach to Artificial Intelligence [1 ed.]
 9811527695, 9789811527692

Table of contents :
Preface
Structure and Contents
Features and Contributions
Audience
Acknowledgments
A Note from the Family of Dr. Zhang
Contents
List of Notations
List of Figures
List of Tables
List of Algorithms
Part I Introduction to Matrix Algebra
1 Basic Matrix Computation
1.1 Basic Concepts of Vectors and Matrices
1.1.1 Vectors and Matrices
1.1.2 Basic Vector Calculus
1.1.3 Basic Matrix Calculus
1.2 Sets and Linear Mapping
1.2.1 Sets
1.2.2 Linear Mapping
1.3 Norms
1.3.1 Vector Norms
1.3.2 Matrix Norms
1.4 Random Vectors
1.4.1 Statistical Interpretation of Random Vectors
1.4.2 Gaussian Random Vectors
1.5 Basic Performance of Matrices
1.5.1 Quadratic Forms
1.5.2 Determinants
1.5.3 Matrix Eigenvalues
1.5.4 Matrix Trace
1.5.5 Matrix Rank
1.6 Inverse Matrices and Moore–Penrose Inverse Matrices
1.6.1 Inverse Matrices
1.6.2 Left and Right Pseudo-Inverse Matrices
1.6.3 Moore–Penrose Inverse Matrices
1.7 Direct Sum and Hadamard Product
1.7.1 Direct Sum of Matrices
1.7.2 Hadamard Product
1.8 Kronecker Products
1.8.1 Definitions of Kronecker Products
1.8.2 Performance of Kronecker Products
1.9 Vectorization and Matricization
1.9.1 Vectorization and Commutation Matrix
1.9.2 Matricization of Vectors
Brief Summary of This Chapter
References
2 Matrix Differential
2.1 Jacobian Matrix and Gradient Matrix
2.1.1 Jacobian Matrix
2.1.2 Gradient Matrix
2.1.3 Calculation of Partial Derivative and Gradient
2.2 Real Matrix Differential
2.2.1 Calculation of Real Matrix Differential
2.2.2 Jacobian Matrix Identification
2.2.3 Jacobian Matrix of Real Matrix Functions
2.3 Complex Gradient Matrices
2.3.1 Holomorphic Function and Complex Partial Derivative
2.3.2 Complex Matrix Differential
2.3.3 Complex Gradient Matrix Identification
Brief Summary of This Chapter
References
3 Gradient and Optimization
3.1 Real Gradient
3.1.1 Stationary Points and Extreme Points
3.1.2 Real Gradient of f(X)
3.2 Gradient of Complex Variable Function
3.2.1 Extreme Point of Complex Variable Function
3.2.2 Complex Gradient
3.3 Convex Sets and Convex Function Identification
3.3.1 Standard Constrained Optimization Problems
3.3.2 Convex Sets and Convex Functions
3.3.3 Convex Function Identification
3.4 Gradient Methods for Smooth Convex Optimization
3.4.1 Gradient Method
3.4.2 Projected Gradient Method
3.4.3 Convergence Rates
3.5 Nesterov Optimal Gradient Method
3.5.1 Lipschitz Continuous Function
3.5.2 Nesterov Optimal Gradient Algorithms
3.6 Nonsmooth Convex Optimization
3.6.1 Subgradient and Subdifferential
3.6.2 Proximal Operator
3.6.3 Proximal Gradient Method
3.7 Constrained Convex Optimization
3.7.1 Penalty Function Method
3.7.2 Augmented Lagrange Multiplier Method
3.7.3 Lagrange Dual Method
3.7.4 Karush–Kuhn–Tucker Conditions
3.7.5 Alternating Direction Method of Multipliers
3.8 Newton Methods
3.8.1 Newton Method for Unconstrained Optimization
3.8.2 Newton Method for Constrained Optimization
Brief Summary of This Chapter
References
4 Solution of Linear Systems
4.1 Gauss Elimination
4.1.1 Elementary Row Operations
4.1.2 Gauss Elimination for Solving Matrix Equations
4.1.3 Gauss Elimination for Matrix Inversion
4.2 Conjugate Gradient Methods
4.2.1 Conjugate Gradient Algorithm
4.2.2 Biconjugate Gradient Algorithm
4.2.3 Preconditioned Conjugate Gradient Algorithm
4.3 Condition Number of Matrices
4.4 Singular Value Decomposition (SVD)
4.4.1 Singular Value Decomposition
4.4.2 Properties of Singular Values
4.4.3 Singular Value Thresholding
4.5 Least Squares Method
4.5.1 Least Squares Solution
4.5.2 Rank-Deficient Least Squares Solutions
4.6 Tikhonov Regularization and Gauss–Seidel Method
4.6.1 Tikhonov Regularization
4.6.2 Gauss–Seidel Method
4.7 Total Least Squares Method
4.7.1 Total Least Squares Solution
4.7.2 Performances of TLS Solution
4.7.3 Generalized Total Least Square
4.8 Solution of Under-Determined Systems
4.8.1 1-Norm Minimization
4.8.2 Lasso
4.8.3 LARS
Brief Summary of This Chapter
References
5 Eigenvalue Decomposition
5.1 Eigenvalue Problem and Characteristic Equation
5.1.1 Eigenvalue Problem
5.1.2 Characteristic Polynomial
5.2 Eigenvalues and Eigenvectors
5.2.1 Eigenvalues
5.2.2 Eigenvectors
5.3 Generalized Eigenvalue Decomposition (GEVD)
5.3.1 Generalized Eigenvalue Decomposition
5.3.2 Total Least Squares Method for GEVD
5.4 Rayleigh Quotient and Generalized Rayleigh Quotient
5.4.1 Rayleigh Quotient
5.4.2 Generalized Rayleigh Quotient
5.4.3 Effectiveness of Class Discrimination
Brief Summary of This Chapter
References
Part II Artificial Intelligence
6 Machine Learning
6.1 Machine Learning Tree
6.2 Optimization in Machine Learning
6.2.1 Single-Objective Composite Optimization
6.2.2 Gradient Aggregation Methods
6.2.3 Coordinate Descent Methods
6.2.4 Benchmark Functions for Single-Objective Optimization
6.3 Majorization-Minimization Algorithms
6.3.1 MM Algorithm Framework
6.3.2 Examples of Majorization-Minimization Algorithms
6.4 Boosting and Probably Approximately Correct Learning
6.4.1 Boosting for Weak Learners
6.4.2 Probably Approximately Correct Learning
6.5 Basic Theory of Machine Learning
6.5.1 Learning Machine
6.5.2 Machine Learning Methods
6.5.3 Expected Performance of Machine Learning Algorithms
6.6 Classification and Regression
6.6.1 Pattern Recognition and Classification
6.6.2 Regression
6.7 Feature Selection
6.7.1 Supervised Feature Selection
6.7.2 Unsupervised Feature Selection
6.7.3 Nonlinear Joint Unsupervised Feature Selection
6.8 Principal Component Analysis
6.8.1 Principal Component Analysis Basis
6.8.2 Minor Component Analysis
6.8.3 Principal Subspace Analysis
6.8.4 Robust Principal Component Analysis
6.8.5 Sparse Principal Component Analysis
6.9 Supervised Learning Regression
6.9.1 Principle Component Regression
6.9.2 Partial Least Squares Regression
6.9.3 Penalized Regression
6.9.4 Gradient Projection for Sparse Reconstruction
6.10 Supervised Learning Classification
6.10.1 Binary Linear Classifiers
6.10.2 Multiclass Linear Classifiers
6.11 Supervised Tensor Learning (STL)
6.11.1 Tensor Algebra Basics
6.11.2 Supervised Tensor Learning Problems
6.11.3 Tensor Fisher Discriminant analysis
6.11.4 Tensor Learning for Regression
6.11.5 Tensor K-Means Clustering
6.12 Unsupervised Clustering
6.12.1 Similarity Measures
6.12.2 Hierarchical Clustering
6.12.3 Fisher Discriminant Analysis (FDA)
6.12.4 K-Means Clustering
6.13 Spectral Clustering
6.13.1 Spectral Clustering Algorithms
6.13.2 Constrained Spectral Clustering
6.13.3 Fast Spectral Clustering
6.14 Semi-Supervised Learning Algorithms
6.14.1 Semi-Supervised Inductive/Transductive Learning
6.14.2 Self-Training
6.14.3 Co-training
6.15 Canonical Correlation Analysis
6.15.1 Canonical Correlation Analysis Algorithm
6.15.2 Kernel Canonical Correlation Analysis
6.15.3 Penalized Canonical Correlation Analysis
6.16 Graph Machine Learning
6.16.1 Graphs
6.16.2 Graph Laplacian Matrices
6.16.3 Graph Spectrum
6.16.4 Graph Signal Processing
6.16.5 Semi-Supervised Graph Learning: Harmonic Function Method
6.16.6 Semi-Supervised Graph Learning: Min-Cut Method
6.16.7 Unsupervised Graph Learning: Sparse Coding Method
6.17 Active Learning
6.17.1 Active Learning: Background
6.17.2 Statistical Active Learning
6.17.3 Active Learning Algorithms
6.17.4 Active Learning Based Binary Linear Classifiers
6.17.5 Active Learning Using Extreme Learning Machine
6.18 Reinforcement Learning
6.18.1 Basic Concepts and Theory
6.18.2 Markov Decision Process (MDP)
6.19 Q-Learning
6.19.1 Basic Q-Learning
6.19.2 Double Q-Learning and Weighted Double Q-Learning
6.19.3 Online Connectionist Q-Learning Algorithm
6.19.4 Q-Learning with Experience Replay
6.20 Transfer Learning
6.20.1 Notations and Definitions
6.20.2 Categorization of Transfer Learning
6.20.3 Boosting for Transfer Learning
6.20.4 Multitask Learning
6.20.5 EigenTransfer
6.21 Domain Adaptation
6.21.1 Feature Augmentation Method
6.21.2 Cross-Domain Transform Method
6.21.3 Transfer Component Analysis Method
Brief Summary of This Chapter
References
7 Neural Networks
7.1 Neural Network Tree
7.2 From Modern Neural Networks to Deep Learning
7.3 Optimization of Neural Networks
7.3.1 Online Optimization Problems
7.3.2 Adaptive Gradient Algorithm
7.3.3 Adaptive Moment Estimation
7.4 Activation Functions
7.4.1 Logistic Regression and Sigmoid Function
7.4.2 Softmax Regression and Softmax Function
7.4.3 Other Activation Functions
7.5 Recurrent Neural Networks
7.5.1 Conventional Recurrent Neural Networks
7.5.2 Backpropagation Through Time (BPTT)
7.5.3 Jordan Network and Elman Network
7.5.4 Bidirectional Recurrent Neural Networks
7.5.5 Long Short-Term Memory (LSTM)
7.5.6 Improvement of Long Short-Term Memory
7.6 Boltzmann Machines
7.6.1 Hopfield Network and Boltzmann Machines
7.6.2 Restricted Boltzmann Machine
7.6.3 Contrastive Divergence Learning
7.6.4 Multiple Restricted Boltzmann Machines
7.7 Bayesian Neural Networks
7.7.1 Naive Bayesian Classification
7.7.2 Bayesian Classification Theory
7.7.3 Sparse Bayesian Learning
7.8 Convolutional Neural Networks
7.8.1 Hankel Matrix and Convolution
7.8.2 Pooling Layer
7.8.3 Activation Functions in CNNs
7.8.4 Loss Function
7.9 Dropout Learning
7.9.1 Dropout for Shallow and Deep Learning
7.9.2 Dropout Spherical K-Means
7.9.3 DropConnect
7.10 Autoencoders
7.10.1 Basic Autoencoder
7.10.2 Stacked Sparse Autoencoder
7.10.3 Stacked Denoising Autoencoders
7.10.4 Convolutional Autoencoders (CAE)
7.10.5 Stacked Convolutional Denoising Autoencoder
7.10.6 Nonnegative Sparse Autoencoder
7.11 Extreme Learning Machine
7.11.1 Single-Hidden Layer Feedforward Networks with Random Hidden Nodes
7.11.2 Extreme Learning Machine Algorithm for Regression and Binary Classification
7.11.3 Extreme Learning Machine Algorithm for Multiclass Classification
7.12 Graph Embedding
7.12.1 Proximity Measures and Graph Embedding
7.12.2 Multidimensional Scaling
7.12.3 Manifold Learning: Isometric Map
7.12.4 Manifold Learning: Locally Linear Embedding
7.12.5 Manifold Learning: Laplacian Eigenmap
7.13 Network Embedding
7.13.1 Structure and Property Preserving Network Embedding
7.13.2 Community Preserving Network Embedding
7.13.3 Higher-Order Proximity Preserved Network Embedding
7.14 Neural Networks on Graphs
7.14.1 Graph Neural Networks (GNNs)
7.14.2 DeepWalk and GraphSAGE
DeepWalk
GraphSAGE
7.14.3 Graph Convolutional Networks (GCNs)
7.15 Batch Normalization Networks
7.15.1 Batch Normalization
The Effect of BatchNorm on the Lipschitzness
The Effect of BatchNorm to Smoothness
BatchNorm Leads to a Favorable Initialization
7.15.2 Variants and Extensions of Batch Normalization
Batch Renormalization 7Ioffe17
Layer Normalization (LN) 7Ba16
Instance Normalization (IN) 7Ulungu99
Group Normalization (GN) 7Wu18
7.16 Generative Adversarial Networks (GANs)
7.16.1 Generative Adversarial Network Framework
7.16.2 Bidirectional Generative Adversarial Networks
7.16.3 Variational Autoencoders
Brief Summary of This Chapter
References
8 Support Vector Machines
8.1 Support Vector Machines: Basic Theory
8.1.1 Statistical Learning Theory
8.1.2 Linear Support Vector Machines
8.2 Kernel Regression Methods
8.2.1 Reproducing Kernel and Mercer Kernel
8.2.2 Representer Theorem and Kernel Regression
8.2.3 Semi-Supervised and Graph Regression
8.2.4 Kernel Partial Least Squares Regression
8.2.5 Laplacian Support Vector Machines
8.3 Support Vector Machine Regression
8.3.1 Support Vector Machine Regressor
8.3.2 ε-Support Vector Regression
8.3.3 ν-Support Vector Machine Regression
8.4 Support Vector Machine Binary Classification
8.4.1 Support Vector Machine Binary Classifier
8.4.2 ν-Support Vector Machine Binary Classifier
8.4.3 Least Squares SVM Binary Classifier
8.4.4 Proximal Support Vector Machine Binary Classifier
8.4.5 SVM-Recursive Feature Elimination
8.5 Support Vector Machine Multiclass Classification
8.5.1 Decomposition Methods for Multiclass Classification
8.5.2 Least Squares SVM Multiclass Classifier
8.5.3 Proximal Support Vector Machine Multiclass Classifier
8.6 Gaussian Process for Regression and Classification
8.6.1 Joint, Marginal, and Conditional Probabilities
8.6.2 Gaussian Process
8.6.3 Gaussian Process Regression
8.6.4 Gaussian Process Classification
8.7 Relevance Vector Machine
8.7.1 Sparse Bayesian Regression
8.7.2 Sparse Bayesian Classification
8.7.3 Fast Marginal Likelihood Maximization
Brief Summary of This Chapter
References
9 Evolutionary Computation
9.1 Evolutionary Computation Tree
9.2 Multiobjective Optimization
9.2.1 Multiobjective Combinatorial Optimization
9.2.2 Multiobjective Optimization Problems
9.3 Pareto Optimization Theory
9.3.1 Pareto Concepts
9.3.2 Fitness Selection Approach
9.3.3 Nondominated Sorting Approach
9.3.4 Crowding Distance Assignment Approach
9.3.5 Hierarchical Clustering Approach
9.3.6 Benchmark Functions for Multiobjective Optimization
9.4 Noisy Multiobjective Optimization
9.4.1 Pareto Concepts for Noisy Multiobjective Optimization
9.4.2 Performance Metrics for Approximation Sets
9.5 Multiobjective Simulated Annealing
9.5.1 Principle of Simulated Annealing
9.5.2 Multiobjective Simulated Annealing Algorithm
9.5.3 Archived Multiobjective Simulated Annealing
9.6 Genetic Algorithm
9.6.1 Basic Genetic Algorithm Operations
9.6.2 Genetic Algorithm with Gene Rearrangement Clustering
9.7 Nondominated Multiobjective Genetic Algorithms
9.7.1 Fitness Functions
9.7.2 Fitness Selection
9.7.3 Nondominated Sorting Genetic Algorithms
9.7.4 Elitist Nondominated Sorting Genetic Algorithm
9.8 Evolutionary Algorithms (EAs)
9.8.1 (1+1) Evolutionary Algorithm
9.8.2 Theoretical Analysis on Evolutionary Algorithms
9.9 Multiobjective Evolutionary Algorithms
9.9.1 Classical Methods for Solving Multiobjective Optimization Problems
9.9.2 MOEA Based on Decomposition (MOEA/D)
9.9.3 Strength Pareto Evolutionary Algorithm
9.9.4 Achievement Scalarizing Functions
9.10 Evolutionary Programming
9.10.1 Classical Evolutionary Programming
9.10.2 Fast Evolutionary Programming
9.10.3 Hybrid Evolutionary Programming
9.11 Differential Evolution
9.11.1 Classical Differential Evolution
9.11.2 Differential Evolution Variants
9.12 Ant Colony Optimization
9.12.1 Real Ants and Artificial Ants
9.12.2 Typical Ant Colony Optimization Problems
9.12.3 Ant System and Ant Colony System
9.13 Multiobjective Artificial Bee Colony Algorithms
9.13.1 Artificial Bee Colony Algorithms
9.13.2 Variants of ABC Algorithms
9.14 Particle Swarm Optimization
9.14.1 Basic Concepts
9.14.2 The Canonical Particle Swarm
9.14.3 Genetic Learning Particle Swarm Optimization
9.14.4 Particle Swarm Optimization for Feature Selection
9.15 Opposition-Based Evolutionary Computation
9.15.1 Opposition-Based Learning
9.15.2 Opposition-Based Differential Evolution
9.15.3 Two Variants of Opposition-Based Learning
Brief Summary of This Chapter
References
Index

Citation preview

Xian-Da Zhang

A Matrix Algebra Approach to Artificial Intelligence

A Matrix Algebra Approach to Artificial Intelligence

Xian-Da Zhang

A Matrix Algebra Approach to Artificial Intelligence

Xian-Da Zhang (Deceased) Department of Automation Tsinghua University Beijing, Beijing, China

ISBN 978-981-15-2769-2 ISBN 978-981-15-2770-8 (eBook) https://doi.org/10.1007/978-981-15-2770-8 © Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Human intelligence is the intellectual prowess of humans, which is marked by four basic and important abilities: learning ability, cognition (acquiring and storing knowledge) ability, generalization ability, and computation ability. Correspondingly, artificial intelligence (AI) also consists of four basic and important methods: machine learning (learning intelligence), neural network (cognitive intelligence), support vector machines (generalization intelligence), and evolutionary computation (computational intelligence). The development of AI is built on mathematics. For example, multivariant calculus deals with the aspect of numerical optimization, which is the driving force behind most machine learning algorithms. The main math applications in AI are matrix algebra, optimization, and mathematical statistics, but the latter two are usually described and applied in the form of matrix. Therefore, matrix algebra is a vast mathematical tool of fundamental importance in most AI subjects. The aim of this book is to provide the solid matrix algebra theory and methods for four basic and important AI fields, including machine learning, neural networks, support vector machines, and evolutionary computation.

Structure and Contents The book consists of two parts. Part I (Introduction to Matrix Algebra) provides fundamentals of matrix algebra and contains Chaps. 1 through 5. Chapter 1 presents the basic operations and performances of matrices, followed by a description of vectorization of matrix and matricization of vector. Chapter 2 is devoted to matrix differential as an important and effective tool in gradient computation and optimization. Chapter 3 is concerned with convex optimization theory and methods by focusing on gradient/subgradient methods in smooth and nonsmooth convex optimizations, and constrained convex optimization. Chapter 4 describes singular value decomposition (SVD) together with Tikhonov regularization and total least squares for solving over-determined v

vi

Preface

matrix equations, followed by the Lasso and LARS methods for solving underdetermined matrix equations. Chapter 5 is devoted to the eigenvalue decomposition (EVD), the generalized eigenvalue decomposition, the Rayleigh quotient, and the generalized Rayleigh quotient. Part II (Artificial Intelligence) focuses on machine learning, neural networks, support vector machines (SVMs), and evolutionary computation from the perspective of matrix algebra. This part is the main body of the book and consists of the following four chapters. Chapter 6 (Machine Learning) presents first the basic theory and methods in machine learning including single-objective optimization, feature selection, principal component analysis and canonical correlation analysis together with supervised, unsupervised, and semi-supervised learning and active learning. Then, this chapter highlights topics and advances in machine learning: graph machine learning, reinforcement learning, Q-learning, and transfer learning. Chapter 7 (Neural Networks) describes optimization problem, activation functions, and basic neural networks. The core part of this chapter are topics and advances in neural networks: convolutional neural networks (CNNs), dropout learning, autoencoders, extreme learning machine (ELM), graph embedding, network embedding, graph neural networks (GNNs), batch normalization networks, and generative adversarial networks (GANs). Chapter 8 (Support Vector Machines) discusses the support vector machine regression and classification, and the relevance vector machine. Chapter 9 (Evolutionary Computation) is concerned primarily with multiobjective optimization, multiobjective simulated annealing, multiobjective genetic algorithms, multiobjective evolutionary algorithms, evolutionary programming, differential evolution together with ant colony optimization, artificial bee colony algorithms, and particle swarm optimization. In particular, this chapter highlights also topics and advances in evolutionary computation: Pareto optimization theory, noisy multiobjective optimization, and opposition-based evolutionary computation. Part I uses some revised content and materials from my book Matrix Analysis and Applications (Cambridge University Press, 2017), but there are considerable differences in content and book objective. This book is concentrated on the applications of the matrix algebra approaches in AI in Part II (561 pages of text), compared to Part I which is only 205 pages of text. This book is also related to my previous book Linear Algebra in Signal Processing (in Chinese, Science Press, Beijing, 1997; Japanese translation, Morikita Press, Tokyo, 2008) in some ideas.

Features and Contributions • The first book on matrix algebra methods and applications in artificial intelligence. • Introduces the machine learning tree, the neural network tree, and the evolutionary tree.

Preface

vii

• Presents the solid matrix algebra theory and methods for four core AI areas: Machine Learning, Neural Networks, Support Vector Machines, and Evolutionary Computation. • Highlights selected topics and advances of machine learning, neural networks, and evolutionary computation. • Summarizes about 80 AI algorithms so that readers can further understand and implement AI methods.

Audience This book is widely suitable for scientists, engineers, and graduate students in many disciplines, including but not limited to artificial intelligence, computer science, mathematics, engineering, etc.

Acknowledgments I want to thank more than 40 of my former doctoral students and more than 20 graduate students for their cooperative research in intelligent signal and information processing and pattern recognition. I am grateful to the several anonymous internal and external experts invited by Springer for their comments and suggestions. I am very grateful to my editor, Dr. Celine Lanlan Chang, Executive Editor, Computer Science, for her patience, understanding, and help in the course of my writing this book. Finally, I would also like to thank my wife Xiao-Ying Tang for her consistent understanding and support for my work, teaching, and research in the past near 50 years. Beijing, China November, 2019

Xian-Da Zhang

A Note from the Family of Dr. Zhang

It is with a heavy heart we share with you that the author of this book, our beloved father, passed away before this book was published. We want to share with you some inspirational thoughts about his life’s journey and how passionate he was about learning and teaching. Our father endured many challenges in his life. During his high school years, the Great Chinese Famine occurred, but hunger could not deter him from studying. During his college years, our father’s family was facing poverty, so he sought to help alleviate this by undergoing his studies at a military institution, where free meals were provided. Not long after, his tenacity to learn would be tested again, as the Cultural Revolution started, closing all the universities in China. He was assigned to work in a remote factory, but his perseverance to learn endured as he continued his education by studying hard secretly during off-hours. After all the universities were reopened, our father left us to continue his study in a different city. He obtained his PhD degree at the age of 41 and became a professor at Tsinghua University in 1992. Our father has taught and mentored many students both professionally and personally throughout his life. He is even more passionate in sharing his ideas and knowledge through writing as he believes that books, with its greater reach, will benefit many more people. He had been working for more than 12 h a day before he was admitted into the hospital. He planned to take a break after he finishes three books this year. Unfortunately, he could not handle such a heavy workload that he asked our help in editing this book in our last conversation. Our father has lived a life with purpose. He discovered his great passion when he was young: learning and teaching. For him, it was even something to die for. He told our mom once that he would rather live fewer years to produce a high-quality book. Despite the numerous challenges and hardships he faced throughout his life, he never stopped learning and teaching. He self-studied Artificial Intelligence in the past few years and completed this book before his final days.

ix

x

A Note from the Family of Dr. Zhang

We sincerely hope you will enjoy reading his final work, as we believe that he would undoubtedly have been very happy to inform and teach and share with you his latest learning. Fremont, CA, USA Philadelphia, PA, USA April, 2020

Yewei Zhang and Wei Wei (Daughter and Son In Law) Zhang and John Zhang (Son and Grand Son)

Contents

Part I Introduction to Matrix Algebra 1

Basic Matrix Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Basic Concepts of Vectors and Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Basic Vector Calculus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Basic Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sets and Linear Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Linear Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Vector Norms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Statistical Interpretation of Random Vectors . . . . . . . . . . . . . . . 1.4.2 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Basic Performance of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Matrix Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Matrix Trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.5 Matrix Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Inverse Matrices and Moore–Penrose Inverse Matrices . . . . . . . . . . . . . 1.6.1 Inverse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 Left and Right Pseudo-Inverse Matrices . . . . . . . . . . . . . . . . . . . . 1.6.3 Moore–Penrose Inverse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Direct Sum and Hadamard Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Direct Sum of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Hadamard Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 3 6 7 10 10 12 14 14 18 20 20 24 27 27 27 30 32 33 35 35 39 40 42 43 43

xi

xii

Contents

1.8

Kronecker Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.1 Definitions of Kronecker Products . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.2 Performance of Kronecker Products . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Vectorization and Matricization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.1 Vectorization and Commutation Matrix . . . . . . . . . . . . . . . . . . . . 1.9.2 Matricization of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46 46 47 48 48 50 53

2

Matrix Differential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Jacobian Matrix and Gradient Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Gradient Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Calculation of Partial Derivative and Gradient . . . . . . . . . . . . . 2.2 Real Matrix Differential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Calculation of Real Matrix Differential . . . . . . . . . . . . . . . . . . . . . 2.2.2 Jacobian Matrix Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Jacobian Matrix of Real Matrix Functions. . . . . . . . . . . . . . . . . . 2.3 Complex Gradient Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Holomorphic Function and Complex Partial Derivative . . . 2.3.2 Complex Matrix Differential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Complex Gradient Matrix Identification . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55 55 55 57 59 61 62 64 68 71 72 75 82 87

3

Gradient and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Real Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Stationary Points and Extreme Points . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Real Gradient of f (X) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Gradient of Complex Variable Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Extreme Point of Complex Variable Function . . . . . . . . . . . . . . 3.2.2 Complex Gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Convex Sets and Convex Function Identification . . . . . . . . . . . . . . . . . . . . 3.3.1 Standard Constrained Optimization Problems . . . . . . . . . . . . . . 3.3.2 Convex Sets and Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Convex Function Identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Gradient Methods for Smooth Convex Optimization . . . . . . . . . . . . . . . . 3.4.1 Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Projected Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Nesterov Optimal Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Lipschitz Continuous Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Nesterov Optimal Gradient Algorithms. . . . . . . . . . . . . . . . . . . . . 3.6 Nonsmooth Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Subgradient and Subdifferential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Proximal Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Proximal Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 89 90 93 95 95 98 100 100 102 104 106 106 108 110 112 113 115 118 119 123 128

Contents

xiii

3.7

Constrained Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Penalty Function Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Augmented Lagrange Multiplier Method . . . . . . . . . . . . . . . . . . . 3.7.3 Lagrange Dual Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.4 Karush–Kuhn–Tucker Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.5 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . 3.8 Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Newton Method for Unconstrained Optimization . . . . . . . . . . 3.8.2 Newton Method for Constrained Optimization . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

132 133 135 137 139 144 147 147 150 153

4

Solution of Linear Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Gauss Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Elementary Row Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Gauss Elimination for Solving Matrix Equations . . . . . . . . . . 4.1.3 Gauss Elimination for Matrix Inversion . . . . . . . . . . . . . . . . . . . . 4.2 Conjugate Gradient Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Conjugate Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Biconjugate Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Preconditioned Conjugate Gradient Algorithm. . . . . . . . . . . . . 4.3 Condition Number of Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Properties of Singular Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Singular Value Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Least Squares Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Least Squares Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Rank-Deficient Least Squares Solutions . . . . . . . . . . . . . . . . . . . . 4.6 Tikhonov Regularization and Gauss–Seidel Method . . . . . . . . . . . . . . . . 4.6.1 Tikhonov Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Gauss–Seidel Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Total Least Squares Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Total Least Squares Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Performances of TLS Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Generalized Total Least Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Solution of Under-Determined Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 1 -Norm Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Lasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 LARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

157 157 158 159 160 162 163 164 165 167 171 171 173 174 175 175 178 179 179 181 184 184 189 191 193 193 196 197 199

5

Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Eigenvalue Problem and Characteristic Equation . . . . . . . . . . . . . . . . . . . . 5.1.1 Eigenvalue Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Characteristic Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

203 203 203 205

xiv

Contents

5.2

Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Generalized Eigenvalue Decomposition (GEVD). . . . . . . . . . . . . . . . . . . . 5.3.1 Generalized Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . 5.3.2 Total Least Squares Method for GEVD . . . . . . . . . . . . . . . . . . . . . 5.4 Rayleigh Quotient and Generalized Rayleigh Quotient. . . . . . . . . . . . . . 5.4.1 Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Generalized Rayleigh Quotient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Effectiveness of Class Discrimination . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

206 206 208 210 210 214 214 215 216 217 220

Part II Artificial Intelligence 6

Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Machine Learning Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Optimization in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Single-Objective Composite Optimization . . . . . . . . . . . . . . . . . 6.2.2 Gradient Aggregation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Coordinate Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Benchmark Functions for Single-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Majorization-Minimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 MM Algorithm Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Examples of Majorization-Minimization Algorithms . . . . . . 6.4 Boosting and Probably Approximately Correct Learning . . . . . . . . . . . 6.4.1 Boosting for Weak Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Probably Approximately Correct Learning . . . . . . . . . . . . . . . . . 6.5 Basic Theory of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Learning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Expected Performance of Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Classification and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Pattern Recognition and Classification. . . . . . . . . . . . . . . . . . . . . . 6.6.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Supervised Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.2 Unsupervised Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.3 Nonlinear Joint Unsupervised Feature Selection . . . . . . . . . . . 6.8 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.1 Principal Component Analysis Basis . . . . . . . . . . . . . . . . . . . . . . . 6.8.2 Minor Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.3 Principal Subspace Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

223 223 226 226 230 232 236 241 242 244 247 248 250 253 253 254 256 256 257 259 260 261 264 266 269 269 270 271

Contents

6.9

6.10

6.11

6.12

6.13

6.14

6.15

6.16

6.17

xv

6.8.4 Robust Principal Component Analysis. . . . . . . . . . . . . . . . . . . . . . 6.8.5 Sparse Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . Supervised Learning Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9.1 Principle Component Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9.2 Partial Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9.3 Penalized Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9.4 Gradient Projection for Sparse Reconstruction . . . . . . . . . . . . . Supervised Learning Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.1 Binary Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.2 Multiclass Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Supervised Tensor Learning (STL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.1 Tensor Algebra Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.2 Supervised Tensor Learning Problems . . . . . . . . . . . . . . . . . . . . . . 6.11.3 Tensor Fisher Discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . 6.11.4 Tensor Learning for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.5 Tensor K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unsupervised Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12.1 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12.3 Fisher Discriminant Analysis (FDA). . . . . . . . . . . . . . . . . . . . . . . . 6.12.4 K-Means Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13.1 Spectral Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13.2 Constrained Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13.3 Fast Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Semi-Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.14.1 Semi-Supervised Inductive/Transductive Learning . . . . . . . . 6.14.2 Self-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.14.3 Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.15.1 Canonical Correlation Analysis Algorithm . . . . . . . . . . . . . . . . . 6.15.2 Kernel Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . 6.15.3 Penalized Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . Graph Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16.2 Graph Laplacian Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16.3 Graph Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16.4 Graph Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16.5 Semi-Supervised Graph Learning: Harmonic Function Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16.6 Semi-Supervised Graph Learning: Min-Cut Method. . . . . . . 6.16.7 Unsupervised Graph Learning: Sparse Coding Method. . . . Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.17.1 Active Learning: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.17.2 Statistical Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

276 278 281 282 285 289 293 296 296 298 301 301 307 309 311 314 314 316 320 324 327 330 330 333 335 337 338 340 341 343 344 347 352 354 354 358 360 363 368 371 377 378 379 380

xvi

7

Contents

6.17.3 Active Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.17.4 Active Learning Based Binary Linear Classifiers . . . . . . . . . . 6.17.5 Active Learning Using Extreme Learning Machine . . . . . . . . 6.18 Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.18.1 Basic Concepts and Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.18.2 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.19 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.19.1 Basic Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.19.2 Double Q-Learning and Weighted Double Q-Learning . . . . 6.19.3 Online Connectionist Q-Learning Algorithm. . . . . . . . . . . . . . . 6.19.4 Q-Learning with Experience Replay . . . . . . . . . . . . . . . . . . . . . . . . 6.20 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.20.1 Notations and Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.20.2 Categorization of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . 6.20.3 Boosting for Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.20.4 Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.20.5 EigenTransfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.21 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.21.1 Feature Augmentation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.21.2 Cross-Domain Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . . 6.21.3 Transfer Component Analysis Method . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

382 383 384 387 387 391 394 394 396 399 400 402 403 406 410 411 414 417 418 421 423 427

Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Neural Network Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 From Modern Neural Networks to Deep Learning. . . . . . . . . . . . . . . . . . . 7.3 Optimization of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Online Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Adaptive Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Adaptive Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Activation Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Logistic Regression and Sigmoid Function . . . . . . . . . . . . . . . . . 7.4.2 Softmax Regression and Softmax Function . . . . . . . . . . . . . . . . 7.4.3 Other Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Conventional Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 7.5.2 Backpropagation Through Time (BPTT) . . . . . . . . . . . . . . . . . . . 7.5.3 Jordan Network and Elman Network . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 Bidirectional Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 7.5.5 Long Short-Term Memory (LSTM). . . . . . . . . . . . . . . . . . . . . . . . . 7.5.6 Improvement of Long Short-Term Memory . . . . . . . . . . . . . . . . 7.6 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Hopfield Network and Boltzmann Machines . . . . . . . . . . . . . . . 7.6.2 Restricted Boltzmann Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

441 441 444 445 446 447 449 452 452 454 456 460 460 463 467 468 471 474 477 477 481

Contents

7.7

7.8

7.9

7.10

7.11

7.12

7.13

7.14

xvii

7.6.3 Contrastive Divergence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.4 Multiple Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Naive Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Bayesian Classification Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.3 Sparse Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Hankel Matrix and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.3 Activation Functions in CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dropout Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Dropout for Shallow and Deep Learning . . . . . . . . . . . . . . . . . . . 7.9.2 Dropout Spherical K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.3 DropConnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.1 Basic Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.2 Stacked Sparse Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.3 Stacked Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.4 Convolutional Autoencoders (CAE) . . . . . . . . . . . . . . . . . . . . . . . . 7.10.5 Stacked Convolutional Denoising Autoencoder . . . . . . . . . . . . 7.10.6 Nonnegative Sparse Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . Extreme Learning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.1 Single-Hidden Layer Feedforward Networks with Random Hidden Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.2 Extreme Learning Machine Algorithm for Regression and Binary Classification . . . . . . . . . . . . . . . . . . . . . . . 7.11.3 Extreme Learning Machine Algorithm for Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12.1 Proximity Measures and Graph Embedding . . . . . . . . . . . . . . . . 7.12.2 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12.3 Manifold Learning: Isometric Map . . . . . . . . . . . . . . . . . . . . . . . . . 7.12.4 Manifold Learning: Locally Linear Embedding . . . . . . . . . . . . 7.12.5 Manifold Learning: Laplacian Eigenmap . . . . . . . . . . . . . . . . . . . Network Embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13.1 Structure and Property Preserving Network Embedding. . . 7.13.2 Community Preserving Network Embedding . . . . . . . . . . . . . . 7.13.3 Higher-Order Proximity Preserved Network Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neural Networks on Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.14.1 Graph Neural Networks (GNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.14.2 DeepWalk and GraphSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.14.3 Graph Convolutional Networks (GCNs) . . . . . . . . . . . . . . . . . . . .

484 487 490 490 491 493 496 498 504 507 510 514 515 518 520 524 525 532 535 538 539 540 542 543 545 549 551 552 557 559 560 563 567 568 569 573 576 577 579 583

xviii

8

Contents

7.15

Batch Normalization Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.15.1 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.15.2 Variants and Extensions of Batch Normalization. . . . . . . . . . . 7.16 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.16.1 Generative Adversarial Network Framework . . . . . . . . . . . . . . . 7.16.2 Bidirectional Generative Adversarial Networks . . . . . . . . . . . . 7.16.3 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

588 588 593 598 598 602 604 607

Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Support Vector Machines: Basic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Linear Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Kernel Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Reproducing Kernel and Mercer Kernel . . . . . . . . . . . . . . . . . . . . 8.2.2 Representer Theorem and Kernel Regression . . . . . . . . . . . . . . 8.2.3 Semi-Supervised and Graph Regression . . . . . . . . . . . . . . . . . . . . 8.2.4 Kernel Partial Least Squares Regression. . . . . . . . . . . . . . . . . . . . 8.2.5 Laplacian Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Support Vector Machine Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Support Vector Machine Regressor . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 -Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 ν-Support Vector Machine Regression. . . . . . . . . . . . . . . . . . . . . . 8.4 Support Vector Machine Binary Classification . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Support Vector Machine Binary Classifier. . . . . . . . . . . . . . . . . . 8.4.2 ν-Support Vector Machine Binary Classifier . . . . . . . . . . . . . . . 8.4.3 Least Squares SVM Binary Classifier. . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Proximal Support Vector Machine Binary Classifier . . . . . . . 8.4.5 SVM-Recursive Feature Elimination . . . . . . . . . . . . . . . . . . . . . . . 8.5 Support Vector Machine Multiclass Classification . . . . . . . . . . . . . . . . . . . 8.5.1 Decomposition Methods for Multiclass Classification . . . . . 8.5.2 Least Squares SVM Multiclass Classifier. . . . . . . . . . . . . . . . . . . 8.5.3 Proximal Support Vector Machine Multiclass Classifier . . . 8.6 Gaussian Process for Regression and Classification . . . . . . . . . . . . . . . . . 8.6.1 Joint, Marginal, and Conditional Probabilities . . . . . . . . . . . . . 8.6.2 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.4 Gaussian Process Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Relevance Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Sparse Bayesian Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Sparse Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.3 Fast Marginal Likelihood Maximization. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

617 617 618 621 624 624 628 630 632 633 635 635 636 639 641 642 645 647 649 651 653 653 656 659 660 661 661 663 666 667 668 672 673 677

Contents

9

Evolutionary Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Evolutionary Computation Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Multiobjective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Multiobjective Combinatorial Optimization . . . . . . . . . . . . . . . . 9.2.2 Multiobjective Optimization Problems . . . . . . . . . . . . . . . . . . . . . 9.3 Pareto Optimization Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Pareto Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Fitness Selection Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Nondominated Sorting Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Crowding Distance Assignment Approach . . . . . . . . . . . . . . . . . 9.3.5 Hierarchical Clustering Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.6 Benchmark Functions for Multiobjective Optimization. . . . 9.4 Noisy Multiobjective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Pareto Concepts for Noisy Multiobjective Optimization. . . 9.4.2 Performance Metrics for Approximation Sets . . . . . . . . . . . . . . 9.5 Multiobjective Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Principle of Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Multiobjective Simulated Annealing Algorithm . . . . . . . . . . . 9.5.3 Archived Multiobjective Simulated Annealing . . . . . . . . . . . . . 9.6 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Basic Genetic Algorithm Operations. . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Genetic Algorithm with Gene Rearrangement Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Nondominated Multiobjective Genetic Algorithms . . . . . . . . . . . . . . . . . . 9.7.1 Fitness Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.2 Fitness Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.3 Nondominated Sorting Genetic Algorithms . . . . . . . . . . . . . . . . 9.7.4 Elitist Nondominated Sorting Genetic Algorithm . . . . . . . . . . 9.8 Evolutionary Algorithms (EAs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8.1 (1 + 1) Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8.2 Theoretical Analysis on Evolutionary Algorithms . . . . . . . . . 9.9 Multiobjective Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.9.1 Classical Methods for Solving Multiobjective Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.9.2 MOEA Based on Decomposition (MOEA/D) . . . . . . . . . . . . . . 9.9.3 Strength Pareto Evolutionary Algorithm. . . . . . . . . . . . . . . . . . . . 9.9.4 Achievement Scalarizing Functions. . . . . . . . . . . . . . . . . . . . . . . . . 9.10 Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.10.1 Classical Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . 9.10.2 Fast Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.10.3 Hybrid Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . 9.11 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.11.1 Classical Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.11.2 Differential Evolution Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xix

681 681 683 684 686 691 691 698 700 702 703 704 707 708 712 714 714 718 721 723 724 729 733 733 734 736 739 740 741 742 743 743 745 749 754 758 758 759 761 763 763 765

xx

Contents

9.12

Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.12.1 Real Ants and Artificial Ants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.12.2 Typical Ant Colony Optimization Problems . . . . . . . . . . . . . . . . 9.12.3 Ant System and Ant Colony System . . . . . . . . . . . . . . . . . . . . . . . . 9.13 Multiobjective Artificial Bee Colony Algorithms . . . . . . . . . . . . . . . . . . . . 9.13.1 Artificial Bee Colony Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.13.2 Variants of ABC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.14 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.14.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.14.2 The Canonical Particle Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.14.3 Genetic Learning Particle Swarm Optimization . . . . . . . . . . . . 9.14.4 Particle Swarm Optimization for Feature Selection . . . . . . . . 9.15 Opposition-Based Evolutionary Computation . . . . . . . . . . . . . . . . . . . . . . . 9.15.1 Opposition-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.15.2 Opposition-Based Differential Evolution . . . . . . . . . . . . . . . . . . . 9.15.3 Two Variants of Opposition-Based Learning . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

767 768 771 773 776 776 778 780 780 781 783 785 788 788 790 791 795

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805

List of Notations

∀ |  ∃  ∧ ∨ |A| A⇒B A⊆B A⊂B A=B A∪B A∩B A∩B =∅ A+B A−B X\A A B AB A

B

AB AB

For all Such that Such that There exists There does not exist Logical AND Logical OR Cardinality of a set A “condition A results in B” or “A implies B” A is a subset of B A is a proper subset of B Sets A = B Union of sets A and B Intersection of sets A and B Sets A and B are disjoint Sum set of sets A and B The set of elements of A that are not in B Complement of the set A in the set X Set A dominates set B: every f(x2 ) ∈ B is dominated by at least one f(x1 ) ∈ A such that f(x1 ) I N f(x2 ) (for maximization) A weakly dominates B: for every f(x2 ) ∈ B and at least one f(x1 ) ∈ A f(x1 ) ≤I N f(x2 ) (for minimization) or f(x1 ) ≥I N f(x2 ) (for maximization) A strictly dominates B: every f(x2 ) ∈ B is strictly dominated by at least one f(x1 ) ∈ A such that fi (x1 ) I N fi (x2 ), ∀i = {1, . . . , m} (for maximization) Sets A and B are incomparable: neither A  B nor B  A A is better than B: every f(x2 ) ∈ B is weakly dominated by at least one f(x1 ) ∈ Aand A = B xxi

xxii

AGGmean (z) AGGLSTM (z) AGGpool (z) C Cn Cm×n C[x] C[x]m×n CI ×J ×K CI1 ×···×IN K Kn Km×n KI ×J ×K KI1 ×···×IN G(V , E, W) N (v) PReLU(z) ReLU(z) R Rn Rm×n R[x] R[x]m×n RI ×J ×K RI1 ×···×IN R+ R++ σ (z) softmax(z) softplus(z) softsign(z) tanh(z) T :V →W T −1 : W → V X1 × · · · × Xn {(xi , yi = +1)} {(xi , yi = −1)} 1n 0n ei x ∼ N (¯x,  x ) x0 x1 x2

List of Notations

Mean aggregate function of z LSTM aggregate function of z Pooling aggregate function of z Complex numbers Complex n-vector Complex m × n matrix Complex polynomial Complex m × n polynomial matrix Complex third-order tensors Complex N -order tensor Real or complex number Real or complex n-vector Real or complex m × n matrix Real or complex third-order tensor Real or complex N -order tensor Graph with vertex set V , edge set E and adjacency matrix W Neighbors of a vertex (node) v Parametric rectified linear unit activation function of z Rectified linear unit activation function of z Real number Real n-vectors (n × 1 real matrix) Real m × n matrix Real polynomial Real m × n polynomial matrix Real third-order tensors Real N -order tensor Nonnegative real numbers, nonnegative orthant Positive real number Sigmoid activation function of z Softmax activation function of z Softplus activation function of z Softsign activation function of z Tangent (tanh) hyperbolic activation function of z Mapping the vectors in V to corresponding vectors in W Inverse mapping of the one-to-one mapping T : V → W Cartesian product of n sets X1 , . . . , Xn Set of training data vectors xi belonging to the classes (+) Set of training data vectors xi belonging to the classes (−) n-dimensional summing vector with all entries 1 n-dimensional zero vector with all zero entries Base vector whose ith entry equal to 1 and others being zero Gaussian vector with mean vector x¯ and covariance matrix  x 0 -norm: number of nonzero entries of vector x 1 -norm of vector x Euclidean form of vector x

List of Notations

xp x∗ x∞ x, y d(x, y) N (x) ρ(x, y) x∈A x∈ /A x ◦ y = xyH x⊥y x>0 x≥0 x≥y x x x x x  x x  x x

x x

x x  x f(x) = f(x ) f(x) = f(x ) f(x) ≤ f(x ) f(x) < f(x ) f(x) ≥ f(x ) f(x) > f(x ) f(x1 ) I N f(x2 )

f(x1 ) ≤I N f(x2 ) f(x1 ) ≥I N f(x2 ) AT AH A−1 A†

xxiii

p -norm or Hölder norm of vector x Nuclear norm of vector x ∞ -norm of vector x Inner product of vectors x and y Distance or dissimilarity between vectors x and y -neighborhood of vector x Correlation coefficient between two random vectors x and y x belongs to the set A, i.e., x is an element or member of A x is not an element of the set A Outer product of vectors x and y Vector orthogonal Positive vector with components xi > 0, ∀i Nonnegative vector with components xi ≥ 0, ∀i Vector elementwise inequality xi ≥ yi , ∀i x domains (or outperforms) x : f(x) < f(x ) for minimization x domains (or outperforms) x : f(x) > f(x ) for maximization x weakly dominates x : f(x) ≤ f(x ) for minimization x weakly dominates x : f(x) ≥ f(x ) for maximization x strictly dominates x : fi (x) < fi (x ), ∀ i for minimization x strictly dominates x : fi (x) > fi (x ), ∀ i for maximization x and x are incomparable, i.e., x  x ∧ x  x fi (x) = fi (x ), ∀i = 1, . . . , m fi (x) = fi (x ), for at least one i ∈ {1, . . . , m} fi (x) ≤ fi (x ), ∀i = 1, . . . , m ∀ i = 1, . . . , m : fi (x) ≤ fi (x ) ∧ ∃j ∈ {1, . . . , m} : fj (x) < fj (x ) fi (x) ≥ fi (x ), ∀i = 1, . . . , m ∀ i = 1, . . . , m : fi (x) ≥ fi (x ) ∧ ∃j ∈ {1, . . . , m} : fj (x) > fj (x ) Interval order relation: ∀i = 1, . . . , m : f i (x1 ) ≤ f i (x2 ) ∧ f i (x1 ) ≤ f i (x2 ) ∧ ∃j ∈ {1, . . . , m} : f j (x1 ) = f j (x2 ) ∨ f j (x1 ) = f j (x2 ) Interval order relation: ∀i = 1, . . . , m : f i (x1 ) ≥ f i (x2 ) ∧ f i (x1 ) ≥ f i (x2 ) ∧ ∃j ∈ {1, . . . , m} : f j (x1 ) = f j (x2 ) ∨ f j (x1 ) = f j (x2 ) Weak interval order relation: ∀i ∈ {1, . . . , m} : f i (x1 ) ≤ f i (x2 ) ∧ f i (x1 ) ≤ f i (x2 ) Weak interval order relation: ∀i ∈ {1, . . . , m} : f i (x1 ) ≥ f i (x2 ) ∧ f i (x1 ) ≥ f i (x2 ) Transpose of matrix A Complex conjugate transpose of matrix A Inverse of nonsingular matrix A Moore–Penrose inverse of matrix A

xxiv

A∗ A 0 A0 A≺0 A0 A>0 A≥0 A≥B In On |A| A1 A2 = Aspec AF A∞ AG A∗ A⊕B A B A!B A⊗B A, B ρ(A) cond(A) diag(A) Diag(A) eig(A) Gr(n, r) rvec(A) off(A) tr(A) vec(A)

List of Notations

Conjugate of matrix A Positive definite matrix Positive semi-definite matrix Negative definite matrix Negative semi-definite matrix Positive (or elementwise positive) matrix Nonnegative (or elementwise nonnegative) matrix Matrix elementwise inequality aij ≥ bij , ∀i, j n × n Identity matrix n × n Null matrix Determinant of matrix A Maximum absolute column-sum norm of matrix A Spectrum norm of matrix A Frobenius norm of matrix A Max norm of A: absolute maximum of all entries of A Mahalanobis norm of matrix A Nuclear norm, called also the trace norm, of matrix A Direct sum of an m × m matrix A and an n × n matrix B Hadamard product (or elementwise product) of A and B Elementwise division of matrices A and B Kronecker product of matrices A and B Inner (or dot) product of A and B : A, B = vec(A), vec(B) Spectral radius of matrix A Condition number of matrix A  n 2 Diagonal function of A = [aij ] : i=1 |aii | Diagonal matrix consisting of diagonal entries of A Eigenvalues of the Hermitian matrix A Grassmann manifold Row vectorization of matrix  A n m 2 Off function of A = [aij ] : i=1,i=j j =1 |aij | Trace of matrix A Vectorization of matrix A

List of Figures

Fig. 1.1

Classification of vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Fig. 6.1 Fig. 6.2 Fig. 6.3

Machine learning tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Three-way array (left) and a tensor modeling a face (right) . . . . . . . . An MDP models the synchronous interaction between agent and environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One neural network as a collection of interconnected units, where i, j , and k are neurons in output layer, hidden layer, and input layer, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transfer learning setup: storing knowledge gained from solving one problem and applying it to a different but related problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The transformation mapping. (a) The symmetric transformation (TS and TT ) of the source-domain feature set XS = {xSi } and target-domain feature set XT = {xTi } into a common latent feature set XC = {xCi }. (b) The asymmetric transformation TT of the source-domain feature set XS to the target-domain feature set XT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

225 301

Fig. 6.4

Fig. 6.5

Fig. 6.6

Fig. 7.1 Fig. 7.2

Fig. 7.3

391

399

405

407

Neural network tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 General structure of a regular unidirectional three-layer RNN with hidden state recurrence, where z−1 denotes a delay line. Left is RNN with a delay line, and right is RNN unfolded in time for two time steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 General structure of a regular unidirectional three-layer RNN with output recurrence, where z−1 denotes a delay line. Left is RNN with a delay line, and right is RNN unfolded in time for two time steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463

xxv

xxvi

Fig. 7.4 Fig. 7.5 Fig. 7.6

Fig. 7.7

Fig. 7.8

Fig. 7.9

Fig. 7.10

Fig. 7.11 Fig. 7.12 Fig. 7.13

List of Figures

Illustration of Jordan network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of Elman network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of the Bidirectional Recurrent Neural Network (BRNN) unfolded in time for three time steps. Upper: output units, Middle: forward and backward hidden units, Lower: input units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A comparison between Hopfield network and Boltzmann machine. Left: is Hopfield network in which a fully connected network of six binary thresholding neural units. Every unit is connected with data. Right: is Boltzmann machine whose model splits into two parts: visible units and hidden units (shaded nodes). The dashed line is used to highlight the model separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A comparison between Left Boltzmann machine and Right restricted Boltzmann machine. In the restricted Boltzmann machine there are no connections between hidden units (shaded nodes) and no connections between visible units (unshaded nodes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A toy example using four different pooling techniques. Left: resulting activations within a given pooling region. Right: pooling results given by four different pooling techniques. If λ = 0.4 is taken then the mixed pooling result is 1.46. The mixed pooling and the stochastic pooling can represent multi-modal distributions of activations within a region . . . . . . . . . . . An example of a thinned network produced by applying dropout to a full-connection neural network. Crossed units have been dropped. The dropout probabilities of the first, second, and third layers are 0.4, 0.6, and 0.4, respectively, . . . . . . . . An example of the structure of DropConnect. The dotted lines show dropped connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MLP with one hidden layer for auto association. The shape of the whole network is similar to an hourglass . . . . . . . . . . . . . . . . . . . . Illustration of the architecture of the basic autoencoder with “encoder” and “decoder” networks for high-level feature learning. The leftmost layer of the whole network is called the input layer, and the rightmost layer the output layer. The middle layer of nodes is called the hidden layer, because its values are not observed in the training set. The circles labeled “+1” are called bias units, and correspond to the intercept term b. Ii denotes the index i, i = 1, . . . , n . . . . . . . . . . . . . .

469 469

470

478

481

508

516 521 525

526

List of Figures

xxvii

Fig. 7.14 Autoencoder training flowchart. Encoder f encodes the input x to h = f (x), and the decoder g decodes h = f (x) to the output y = xˆ = g(f (x)) so that the reconstruction error L(x, y) is minimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 7.15 Stacked autoencoder scheme with 2 hidden (d) + b(d) ) and layers, where h(d) = f (W(d) 1 x 1 (d) (d) y(d) = f (W2 h(d) + b2 ), d = 1, 2 with x(1) = x, x(2) = y(1) and xˆ = y(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 7.16 Comparison between the two network architectures for a batch data: (a) the network with no BatchNorm layer; (b) the same network as in (a) with a BatchNorm layer inserted after the fully connected layer W. All the layer parameters have exactly the same value in both networks, and the two networks have the same loss function L, i.e., Lˆ = L . . . . . . . . . . . . . . . Fig. 7.17 The basic structure of generative adversarial networks (GANs). The generator G uses a random noise z to generate a synthetic data xˆ = G(z), and the discriminator D tries to identify whether the synthesized data xˆ is a real data x, i.e., making a real or fake inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 7.18 Bidirectional generative adversarial network (BiGAN) is a combination of a standard GAN and an Encoder. The generator G from the standard GAN framework maps latent samples z to generated data G(z). An encoder E maps data x to the output E(x). Both the generator tuple (G(z), z) and the encoder tuple (x, E(x)) act, in a bidirectional way, as inputs to the discriminator D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 7.19 A variational autoencoder (VAEs) connected with a standard generative adversarial network (GAN). Unlike the BiGAN in which the encoder’s output E(x) is one of two bidirectional inputs, the autoencoder E here executes variation z ∼ E(x) that is used as the random noise in GAN . . . . . . Fig. 8.1 Fig. 8.2 Fig. 9.1

527

533

590

599

603

604

R2 ,

Three points in shattered by oriented lines. The arrow points to the side with the points labeled black . . . . . . . . . . . . . . . . . . . . . 620 The decision directed acyclic graphs (DDAG) for finding the best class out of four classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Evolutionary computation tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683

xxviii

Fig. 9.2

Fig. 9.3

Fig. 9.4 Fig. 9.5

Fig. 9.6

List of Figures

Pareto efficiency and Pareto improvement. Point A is an inefficient allocation between preference criterions f1 and f2 because it does not satisfy the constraint curve of f1 and f2 . Two decisions to move from Point A to Points C and D would be a Pareto improvement, respectively. They improve both f1 and f2 , without making anyone else worse off. Hence, these two moves would be a Pareto improvement and be Pareto optimal, respectively. A move from Point A to Point B would not be a Pareto improvement because it decreases the cost f1 by increasing another cost f2 , thus making one side better off by making another worse off. The move from any point that lies under the curve to any point on the curve cannot be a Pareto improvement due to making one of two criterions f1 and f2 worse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of degeneracy due to crossover (from [16]). Here, asterisk denotes three cluster centers (1.1, 1.0), (2.2, 2.0), (3.4, 1.2) for chromosome m1 , + denotes three cluster centers (3.2, 1.4), (1.8, 2.2), (0.5, 0.7) for the chromosome m2 , open triangle denotes the chromosome obtained by crossing m1 and m2 , and open circle denotes the chromosome obtained by crossing m1 and m2 with three cluster centers (0.5, 0.7), (1.8, 2.2), (3.2, 1.4) . . . . . Geometric representation of opposite number of a real number x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The opposite point and the quasi-opposite point. Given a solution xi = [xi1 , . . . , xiD ]T with xij ∈ [aj , bj ] and q Mij = (aj + bj )/2, then xijo and xij are the j th elements of q the opposite point xoi and the quasi-opposite point xi of xi , respectively. (a) When xij > Mij . (b) When xij < Mij . . . . . . . . . . . The opposite, quasi-opposite, and quasi-reflected points of a solution (point) xi . Given a solution xi = [xi1 , . . . , xiD ]T q with xij ∈ [aj , bj ] and Mij = (aj + bj )/2, then xijo , xij , qr and xij are the j th elements of the opposite point xoi , the q quasi-opposite point xi , and the quasi-reflected opposite qr point xi of xi , respectively. (a) When xij > Mij . (b) When xij < Mij . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

695

732 788

792

794

List of Tables

Table 1.1

Quadratic forms and positive definiteness of a Hermitian matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 2.1 Table 2.2 Table Table Table Table Table Table Table Table

Symbols of real functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Differential matrices and Jacobian matrices of trace functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Differentials and Jacobian matrices of determinant functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Matrix differentials and Jacobian matrices of real functions. . . . . . 2.5 Differentials and Jacobian matrices of matrix functions . . . . . . . . . 2.6 Forms of complex-valued functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Nonholomorphic and holomorphic functions . . . . . . . . . . . . . . . . . . . . . . 2.8 Complex gradient matrices of trace functions . . . . . . . . . . . . . . . . . . . . . 2.9 Complex gradient matrices of determinant functions . . . . . . . . . . . . . 2.10 Complex matrix differential and complex Jacobian matrix . . . . . . .

28 56 67 69 70 71 73 74 82 83 86

Table 3.1 Table 3.2

Extreme-point conditions for the complex variable functions . . . . 98 Proximal operators of several typical functions . . . . . . . . . . . . . . . . . . . 127

Table Table Table Table

Similarity values and clustering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distance matrix in Example 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distance matrix after first clustering in Example 6.4. . . . . . . . . . . . . . Distance matrix after second clustering in Example 6.4 . . . . . . . . . .

6.1 6.2 6.3 6.4

Table 9.1

322 323 323 323

Relation comparison between objective vectors and approximation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711

xxix

List of Algorithms

Algorithm 3.1 Algorithm 3.2 Algorithm 3.3 Algorithm 3.4 Algorithm 3.5 Algorithm 3.6

Gradient descent algorithm and its variants . . . . . . . . . . . . . . . . . Nesterov (first) optimal gradient algorithm . . . . . . . . . . . . . . . . . . Nesterov algorithm with adaptive convexity parameter . . . . . Nesterov (third) optimal gradient algorithm . . . . . . . . . . . . . . . . . FISTA algorithm with fixed step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davidon–Fletcher–Powell (DFP) quasi-Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 3.7 Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 3.8 Newton algorithm via backtracking line search . . . . . . . . . . . . . Algorithm 3.9 Feasible-start Newton algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 3.10 Infeasible-start Newton algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107 116 117 118 132

Algorithm Algorithm Algorithm Algorithm Algorithm Algorithm

164 165 166 167 187

4.1 4.2 4.3 4.4 4.5 4.6

Conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Biconjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PCG algorithm with preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . PCG algorithm without preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . TLS algorithm for minimum norm solution . . . . . . . . . . . . . . . . . Least angle regressions (LARS) algorithm with Lasso modification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

149 149 150 151 153

199

Algorithm 5.1 Algorithm 5.2 Algorithm 5.3

Lanczos algorithm for GEVD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Tangent algorithm for computing the GEVD . . . . . . . . . . . . . . . . 213 GEVD algorithm for singular matrix B . . . . . . . . . . . . . . . . . . . . . . 213

Algorithm 6.1 Algorithm 6.2 Algorithm 6.3 Algorithm 6.4 Algorithm 6.5 Algorithm 6.6 Algorithm 6.7

Stochastic gradient (SG) method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SVRG method for minimizing an empirical risk Rn . . . . . . . . . SAGA method for minimizing an empirical risk Rn . . . . . . . . . APPROX coordinate descent method . . . . . . . . . . . . . . . . . . . . . . . . LogitBoost (two classes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adaptive boosting (AdaBoost) algorithm . . . . . . . . . . . . . . . . . . . . Gentle AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

229 232 232 236 250 252 252 xxxi

xxxii

List of Algorithms

Algorithm 6.8 Algorithm 6.9 Algorithm 6.10 Algorithm 6.11 Algorithm 6.12 Algorithm 6.13 Algorithm 6.14 Algorithm 6.15 Algorithm 6.16 Algorithm 6.17 Algorithm 6.18 Algorithm 6.19 Algorithm 6.20 Algorithm 6.21 Algorithm 6.22 Algorithm 6.23 Algorithm 6.24 Algorithm 6.25 Algorithm 6.26 Algorithm 6.27 Algorithm 6.28 Algorithm 6.29 Algorithm 6.30 Algorithm 6.31 Algorithm 6.32 Algorithm Algorithm Algorithm Algorithm

6.33 6.34 6.35 6.36

Algorithm 6.37 Algorithm Algorithm Algorithm Algorithm Algorithm Algorithm

6.38 6.39 6.40 6.41 6.42 6.43

Feature clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spectral projected gradient (SPG) algorithm for nonlinear joint unsupervised feature selection . . . . . . . . . . . . . . . Robust PCA via accelerated proximal gradient . . . . . . . . . . . . . . Sparse principal component analysis (SPCA) algorithm . . . . Principal component regression algorithm . . . . . . . . . . . . . . . . . . . Nonlinear iterative partial least squares (NIPALS) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple nonlinear iterative partial least squares regression algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GPSR-BB algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . k-nearest neighbor (kNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alternating projection algorithm for the supervised tensor learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tensor learning algorithm for regression . . . . . . . . . . . . . . . . . . . . . Random K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Normalized spectral clustering algorithm of Ng et al. . . . . . . . Normalized spectral clustering algorithm of Shi and Malik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constrained spectral clustering for two-way partition . . . . . . . Constrained spectral clustering for K-way partition . . . . . . . . . Nyström method for matrix approximation . . . . . . . . . . . . . . . . . . Fast spectral clustering algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Self-training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Propagating 1-nearest neighbor clustering algorithm . . . . . . . . Co-training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Canonical correlation analysis (CCA) algorithm . . . . . . . . . . . . Penalized (sparse) canonical component analysis (CCA) algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manifold regularization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . Harmonic function algorithm for semi-supervised graph learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 directed graph construction algorithm. . . . . . . . . . . . . . . . . . . . . Active learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Co-active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Active learning for finding opposite pair close to separating hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Active learning-extreme learning machine (AL-ELM) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Q-learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weighted double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modified connectionist Q-learning algorithm . . . . . . . . . . . . . . . . Deep Q-learning with experience replay . . . . . . . . . . . . . . . . . . . . . TrAdaBoost algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

266 268 278 281 284 286 289 296 299 309 313 328 331 332 335 336 336 337 340 341 343 348 353 368 371 378 383 383 384 387 396 397 398 400 401 412

List of Algorithms

Algorithm 6.44 Algorithm 6.45 Algorithm 6.46 Algorithm 6.47 Algorithm 6.48 Algorithm 6.49

xxxiii

SVD-based alternating structure optimization algorithm . . . . Structural correspondence learning algorithm . . . . . . . . . . . . . . . EigenCluster: a unified framework for transfer learning . . . . p -norm multiple kernel learning algorithm . . . . . . . . . . . . . . . . . Heterogeneous feature augmentation . . . . . . . . . . . . . . . . . . . . . . . . . Information-theoretic metric learning . . . . . . . . . . . . . . . . . . . . . . . .

ADAGRAD algorithm for minimizing   u, x + 12 x, Hx + λx2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 7.2 ADAM algorithm for stochastic optimization. . . . . . . . . . . . . . . . Algorithm 7.3 CD1 fast RBM learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 7.4 Backpropagation algorithm for the basic autoencoder. . . . . . . Algorithm 7.5 Extreme learning machine (ELM) algorithm . . . . . . . . . . . . . . . . Algorithm 7.6 ELM algorithm for regression and binary classification . . . . . Algorithm 7.7 ELM multiclass classification algorithm . . . . . . . . . . . . . . . . . . . . . Algorithm 7.8 DeepWalk(G, w, d, γ , t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 7.9 SkipGram(, Wvi , w) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 7.10 GraphSAGE embedding generation algorithm . . . . . . . . . . . . . . . Algorithm 7.11 Batch renormalization, applied to activation y over a mini-batch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 7.12 Minibatch stochastic gradient descent training of GANs . . . .

413 414 417 422 422 424

Algorithm 7.1

Algorithm 8.1 Algorithm 8.2 Algorithm 8.3 Algorithm 8.4 Algorithm 8.5 Algorithm Algorithm Algorithm Algorithm Algorithm Algorithm

9.1 9.2 9.3 9.4 9.5 9.6

Algorithm 9.7 Algorithm 9.8 Algorithm 9.9 Algorithm 9.10 Algorithm 9.11 Algorithm 9.12 Algorithm 9.13

PSVM binary classification algorithm . . . . . . . . . . . . . . . . . . . . . . . . SVM-recursive feature elimination (SVM-RFE) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LS-SVM multiclass classification algorithm . . . . . . . . . . . . . . . . . PSVM multiclass classification algorithm . . . . . . . . . . . . . . . . . . . Gaussian process regression algorithm . . . . . . . . . . . . . . . . . . . . . . . Roulette wheel fitness selection algorithm . . . . . . . . . . . . . . . . . . . Fast-nondominated-sort(P ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crowding-distance-assignment (I ) . . . . . . . . . . . . . . . . . . . . . . . . . . . Single-objective simulated annealing. . . . . . . . . . . . . . . . . . . . . . . . . Multiobjective simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . Archived multiobjective simulated annealing (AMOSA) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Genetic algorithm with gene rearrangement (GAGR) clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nondominated sorting genetic algorithm (NSGA) . . . . . . . . . . (1 + 1) evolutionary algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiobjective evolutionary algorithm based on decomposition (MOEA/D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MOEA/D-DE algorithm for Solving (9.9.5) . . . . . . . . . . . . . . . . . SPEA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Improved strength Pareto evolutionary algorithm (SPEA2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

450 451 487 531 546 549 551 580 581 583 595 601 650 653 658 660 666 700 701 703 717 721 722 730 738 741 748 750 752 755

xxxiv

Algorithm Algorithm Algorithm Algorithm Algorithm

List of Algorithms

9.14 9.15 9.16 9.17 9.18

Differential evolution (DE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Framework of artificial bee colony (ABC) algorithm. . . . . . . . Particle swarm optimization (PSO) algorithm . . . . . . . . . . . . . . . Genetic learning PSO (GL-PSO) algorithm . . . . . . . . . . . . . . . . . . Hybrid particle swarm optimization with local search (HPSO-LS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 9.19 Opposition-based differential evolution (ODE) . . . . . . . . . . . . . .

764 779 783 785 787 791

Part I

Introduction to Matrix Algebra

Chapter 1

Basic Matrix Computation

In science and engineering, we often encounter the problem of solving a system of linear equations. Matrices provide the most basic and useful mathematical tool for describing and solving such systems. As the introduction to matrix algebra, this chapter presents the basic operations and performance of matrices, followed by a description of vectorization of matrix and matricization of vector.

1.1 Basic Concepts of Vectors and Matrices First we introduce the basic concepts and notation for vectors and matrices.

1.1.1 Vectors and Matrices A real system of linear equations ⎫ a11 x1 + · · · + a1n xn = b1 ⎪ ⎪ ⎪ ⎬ .. . ⎪ ⎪ ⎪ ⎭ am1 x1 + · · · + amn xn = bm

(1.1.1)

can be simply rewritten using real vector and matrix symbols as a real matrix equation Ax = b,

© Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8_1

(1.1.2)

3

4

1 Basic Matrix Computation

where ⎡

a11 · · · ⎢ .. . . A=⎣ . . am1 · · ·

⎤ a1n .. ⎥ ∈ Rm×n , . ⎦ amn

⎡ ⎤ x1 ⎢ .. ⎥ x = ⎣ . ⎦ ∈ Rn , xn

⎤ b1 ⎢ ⎥ b = ⎣ ... ⎦ ∈ Rm . ⎡

bm (1.1.3)

If a system of linear equations is complex valued, then A ∈ Cm×n , x ∈ Cn , and b ∈ Cm . Here, R (or C) denotes the set of real (or complex) numbers, and Rm (or Cm ) represents the set of all real (or complex) m-dimensional column vectors, while Rm×n (or Cm×n ) is the set of all m × n real (or complex) matrices. An m-dimensional row vector x = [x1 , . . . , xm ] is represented as x ∈ R1×m or x ∈ C1×m . To save writing space, an m-dimensional column vector is usually written as the transposed form of a row vector, denoted x = [x1 , . . . , xm ]T , where T denotes “transpose.” There are three different types of vector in science and engineering [18]: • Physical vector: Its elements are physical quantities with magnitude and direction, such as a displacement vector, a velocity vector, an acceleration vector, and so forth. • Geometric vector: A directed line segment or arrow can visualize a physical vector. Such a representation is known as a geometric vector. For example, −→ v = AB represents the directed line segment with initial point A and terminal point B. • Algebraic vector: A geometric vector needs to be represented in algebraic form −→ in order to operate. For a geometric vector v = AB on a plane, if its initial point is A = (a1 , a2 ) and its terminal point is B = (b1 , b2 ), then the geometric   −→ b − a1 vector v = AB can be represented in an algebraic form v = 1 . Such a b2 − a2 geometric vector described in algebraic form is known as an algebraic vector. The vectors encountered often in practical applications are physical vectors, while geometric vectors and algebraic vectors are, respectively, the visual representation and the algebraic form of physical vectors. Algebraic vectors provide a computational tool for physical vectors. Depending on different types of element values, algebraic vectors can be divided into the following three types: • Constant vector: All entries take real constant numbers or complex constant numbers, e.g., a = [1, 5, 2]T . • Function vector: Its entries take function values, e.g., x = [x 1 , . . . , x n ]T . • Random vector: It uses random variables or signals as entries, e.g., x(n) = [x1 (n), . . . , xm (n)]T where x1 (n), . . . , xm (n) are m random variables or signals.

1.1 Basic Concepts of Vectors and Matrices

5 ⎧ Physical vectors ⎪ ⎪ ⎪ ⎪ ⎪ ⎨Geometric vectors⎧ ⎪ Vectors ⎨Constant vectors ⎪ ⎪ ⎪Algebraic vectors Function vectors ⎪ ⎪ ⎪ ⎩ ⎩Random vectors

Fig. 1.1 Classification of vectors

Figure 1.1 summarizes the classification of vectors. A vector all of whose components are equal to zero is called a zero vector and is denoted as 0 = [0, . . . , 0]T . An n×1 vector x = [x1 , . . . , xn ]T with only one nonzero entry xi = 1 constitutes a basis vector, denoted ei ; e.g., ⎡ ⎤ 1 ⎢0⎥ ⎢ ⎥ ⎢ ⎥ e1 = ⎢0⎥ , ⎢.⎥ ⎣ .. ⎦ 0

⎡ ⎤ 0 ⎢1⎥ ⎢ ⎥ ⎢ ⎥ e2 = ⎢0⎥ , ⎢.⎥ ⎣ .. ⎦ 0

...,

⎡ ⎤ 0 ⎢0⎥ ⎢ ⎥ ⎢ ⎥ en = ⎢0⎥ . ⎢.⎥ ⎣ .. ⎦ 1

(1.1.4)

In modeling physical problems, the matrix A is usually the symbolic representation of a physical system (e.g., a linear system, a filter, or a learning machine). An m × n matrix A is called a square matrix if m = n, a broad matrix for m < n, and a tall matrix for m > n. The main diagonal of an n × n matrix A = [aij ] is the segment connecting the top left to the bottom right corner. The entries located on the main diagonal, a11 , a22 , . . . , ann , are known as the (main) diagonal elements. An n × n matrix A = [aij ] is called a diagonal matrix, identity matrix, and zero matrix if all entries off the main diagonal are zero, and some of all diagonal entries take nonzero values, all diagonal entries take 1 or 0, respectively; i.e., D = Diag(d11 , . . . , dnn ),

(1.1.5)

I = Diag(1, . . . , 1),

(1.1.6)

O = Diag(0, . . . , 0).

(1.1.7)

Clearly, an n × n identity matrix I can be represented as I = [e1 , . . . , en ] using basis vectors. In this book, we use often the following matrix symbols. A(i, :) means the ith row of A. A(:, j ) means the j th column of A.

6

1 Basic Matrix Computation

A(p : q, r : s) means the (q − p + 1) × (s − r + 1) submatrix consisting of the pth row to the qth row and the rth column to the sth column of A. For example, ⎡ a21 ⎢a 31 A(2 : 5, 1 : 3) = ⎢ ⎣a 41 a51

a22 a32 a42 a52

⎤ a23 a33 ⎥ ⎥. a ⎦ 43

a53

A matrix A is an m × n block matrix if it can be represented in the form ⎡

A11 A12 ⎢ A21 A22 ⎢ A = [Aij ] = ⎢ . .. ⎣ .. . Am1 Am2

⎤ · · · A1n · · · A2n ⎥ ⎥ . . . .. ⎥ . . ⎦ · · · Amn

1.1.2 Basic Vector Calculus The vector addition of u = [u1 , . . . , un ]T and v = [v1 , . . . , vn ]T is defined as u + v = [u1 + v1 , . . . , un + vn ]T .

(1.1.8)

Vector addition obeys the commutation law and the associative law: • Commutative law: u + v = v + u. • Associative law: (u + v) ± w = u + (v ± w) = (u ± w) + v. The vector multiplication of an n × 1 vector u by a scalar α is given by αu = [αu1 , . . . , αun ]T .

(1.1.9)

The basic property of vector multiplication by a scalar is that it obeys the distributive law: α(u + v) = αu + αv.

(1.1.10)

Given two real or complex n × 1 vectors u = [u1 , . . . , un ]T and v = [v1 , . . . , vn ]T , their inner product (also called dot product or scalar product), denoted u · v or u, v, is defined as the real number u · v = u, v = uT v = u1 v1 + · · · + un vn =

n  i=1

ui vi ,

(1.1.11)

1.1 Basic Concepts of Vectors and Matrices

7

or the complex number u · v = u, v = uH v = u∗1 v1 + · · · + u∗n vn =

n 

u∗i vi ,

(1.1.12)

i=1

where uH is the complex conjugate transpose (or Hermitian conjugate) of u. Note that if u and v are two row vectors, then u · v = uvT for real vectors and u · v = uvH for complex vectors. The outer product (or cross product) of an m × 1 real vector and an n × 1 real vector, denoted u ◦ v, is defined as the m × n real matrix ⎤ · · · u1 vn .. .. ⎥ ; . . ⎦ um v1 · · · um vn



u1 v1 ⎢ u ◦ v = uvT = ⎣ ...

(1.1.13)

if u and v are complex, then their outer product is the m × n complex matrix given by ⎡

u ◦ v = uvH

⎤ · · · u1 vn∗ .. .. ⎥ . . . ⎦ um v1∗ · · · um vn∗

u1 v1∗ ⎢ .. =⎣ .

(1.1.14)

1.1.3 Basic Matrix Calculus Definition 1.1 (Matrix Transpose) If A = [aij ] is an m×n matrix, then the matrix transpose AT is an n × m matrix with the (i, j )th entry [AT ]ij = aj i . The conjugate of A is represented as A∗ and is an m × n matrix with (i, j )th entry [A∗ ]ij = aij∗ , while the conjugate or Hermitian transpose of A, denoted AH ∈ Cn×m , is defined as ⎤ ⎡ ∗ ∗ a11 · · · am1 ⎥ ⎢ (1.1.15) AH = (A∗ )T = ⎣ ... . . . ... ⎦ . ∗ · · · a∗ a1n mn

Definition 1.2 (Hermitian Matrix) An n × n real (or complex) matrix satisfying AT = A (or AH = A) is called a symmetric matrix (or Hermitian matrix). There are the following relationships between the transpose and conjugate transpose of a matrix: AH = (A∗ )T = (AT )∗ .

(1.1.16)

8

1 Basic Matrix Computation

For an m × n block matrix A = [Aij ], its conjugate transpose AH = [AH j i ] is an n × m block matrix: ⎤ H AH 11 · · · Am1 ⎥ ⎢ = ⎣ ... . . . ... ⎦ . H AH 1n · · · Amn ⎡

AH

The simplest algebraic operations with matrices are the addition of two matrices and the multiplication of a matrix by a scalar. Definition 1.3 (Matrix Addition) Given two m × n matrices A = [aij ] and B = [bij ], matrix addition A + B is defined by [A + B]ij = aij + bij . Similarly, matrix subtraction A − B is defined as [A − B]ij = aij − bij . By using this definition, it is easy to verify that the addition and subtraction of two matrices obey the following rules: • Commutative law: A + B = B + A. • Associative law: (A + B) ± C = A + (B ± C) = (A ± C) + B. Definition 1.4 (Matrix Product) The matrix product of an m × n matrix A = [aij ] and an r × s matrix B = [bij ], denoted AB, exists only when n = r and is an m × s matrix with entries [AB]ij =

n 

aik bkj ,

i = 1, . . . , m; j = 1, . . . , s.

k=1

In particular, if B = αI, then [αA]ij = αaij . If B = x = [x1 , . . . , xn ]T , then n [Ax]i = j =1 aij xj for i = 1, . . . , m. The matrix product obeys the following rules of operation: 1. Associative law of multiplication: If A ∈ Cm×n , B ∈ Cn×p , and C ∈ Cp×q , then A(BC) = (AB)C. 2. Left distributive law of multiplication: For two m × n matrices A and B, if C is an n × p matrix, then (A ± B)C = AC ± BC. 3. Right distributive law of multiplication: If A is an m × n matrix, while B and C are two n × p matrices, then A(B ± C) = AB ± AC. 4. If α is a scalar and A and B are two m × n matrices, then α(A + B) = αA + αB. Note that the product of two matrices generally does not satisfy the commutative law, namely AB = BA. Another important operation on a square matrix is its inversion. Put x = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T . The matrix-vector product Ax = y can be regarded as a linear transform of the vector x, where the n × n matrix A is called the linear transform matrix. Let A−1 denote the linear inverse transform of

1.1 Basic Concepts of Vectors and Matrices

9

the vector y onto x. If A−1 exists, then one has x = A−1 y.

(1.1.17)

This can be viewed as the result of using A−1 to premultiply the original linear transform Ax = y, giving A−1 Ax = A−1 y = x, which means that the linear inverse transform matrix A−1 must satisfy A−1 A = I. Similarly, we have x = A−1 y ⇒ Ax = AA−1 y ≡ y,

∀ y = 0.

This implies that A−1 must also satisfy AA−1 = I. Then, the inverse matrix can be defined below. Definition 1.5 (Inverse Matrix) Let A be an n × n matrix. The matrix A is said to be invertible if there is an n × n matrix A−1 such that AA−1 = A−1 A = I, and A−1 is referred to as the inverse matrix of A. The following are properties of the conjugate, transpose, conjugate transpose, and inverse matrices. 1. The matrix conjugate, transpose, and conjugate transpose satisfy the distributive law: (A + B)∗ = A∗ + B∗ ,

(A + B)T = AT + BT ,

(A + B)H = AH + BH .

2. The transpose, conjugate transpose, and inverse matrix of product of two matrices satisfy the following relationship: (AB)T = BT AT ,

(AB)H = BH AH ,

(AB)−1 = B−1 A−1

in which both A and B are assumed to be invertible. 3. Each of the symbols for the conjugate, transpose, and conjugate transpose can be exchanged with the symbol for the inverse: (A∗ )−1 = (A−1 )∗ = A−∗ , (AT )−1 = (A−1 )T = A−T , (AH )−1 = (A−1 )H = A−H .

4. For any m × n matrix A, both the n × n matrix B = AH A and the m × m matrix C = AAH are Hermitian matrices. An n × n matrix A is nonsingular if and only if the matrix equation Ax = 0 has only the zero solution x = 0. If Ax = 0 exists for any nonzero solution x = 0, then the matrix A is singular. For an n × n matrix A = [a1 , . . . , an ], the matrix equation Ax = 0 is equivalent to a1 x1 + · · · + an xn = 0.

(1.1.18)

10

1 Basic Matrix Computation

If the matrix equation Ax = 0 has a zero solution vector only, then the column vectors of A are linearly independent, and thus the matrix A is said to be nonsingular. To summarize the above discussions, the nonsingularity of an n × n matrix A can be determined in any of the following three ways: • its column vectors are linearly independent; • the matrix equation Ax = b exists a unique nonzero solution; • the matrix equation Ax = 0 has only a zero solution.

1.2 Sets and Linear Mapping The set of all n-dimensional vectors with real (or complex) components is called a real (or complex) n-dimensional vector space, denoted Rn (or Cn ). In real-world artificial intelligence problems, we are usually given N n-dimensional real data vectors {x1 , . . . , xN } that belong to a subset X other than the whole set Rn . Such a subset is known as a vector subspace in n-dimensional vector space Rn , denoted as {x1 , . . . , xN } ∈ X ⊂ Rn . In this section, we present the sets, the vector subspaces, and the linear mapping of one vector subspace onto another.

1.2.1 Sets As the name implies, a set is a collection of elements. A set is usually denoted by S = {·}; inside the braces are the elements of the set S. If there are only a few elements in the set S, these elements are written out within the braces, e.g., S = {a, b, c}. To describe the composition of a more complex set mathematically, the symbol “|” is used to mean “such that.” For example, S = {x|P (x) = 0} reads “the element x in set S such that P (x) = 0.” A set with only one element α is called a singleton, denoted {α}. The following are several common notations for set operations: • • • • • • •

∀ denotes “for all · · · ”; x ∈ A reads “x belongs to the set A”, i.e., x is an element or member of A; x∈ / A means that x is not an element of the set A;  denotes “such that”; ∃ denotes “there exists”;  denotes “there does not exist”; A ⇒ B reads “condition A results in B” or “A implies B.” For example, the expression ∃ θ ∈ V  x + θ = x = θ + x, ∀ x ∈ V

1.2 Sets and Linear Mapping

11

denotes the trivial statement “there is a zero element θ in the set V such that the addition x + θ = x = θ + x holds for all elements x in V .” Let A and B be two sets; then the sets have the following basic relations. The notation A ⊆ B reads “the set A is contained in the set B” or “A is a subset of B,” which implies that each element in A is an element in B, namely x ∈ A ⇒ x ∈ B. If A ⊂ B, then A is called a proper subset of B. The notation B ⊃ A reads “B contains A” or “B is a superset of A.” The set with no elements is denoted by ∅ and is called the null set. The notation A = B reads “the set A equals the set B,” which means that A ⊆ B and B ⊆ A, or x ∈ A ⇔ x ∈ B (any element in A is an element in B, and vice versa). The negation of A = B is written as A = B, implying that A does not belong to B, neither does B belong to A. The union of A and B is denoted as A ∪ B. If X = A ∪ B, then X is called the union set of A and B. The union set is defined as follows: X = A ∪ B = {x ∈ X| x ∈ A or x ∈ B}.

(1.2.1)

In other words, the elements of the union set A ∪ B consist of the elements of A and the elements of B. The intersection of both sets A and B is represented by the notation A ∩ B and is defined as follows: X = A ∩ B = {x ∈ X| x ∈ A and x ∈ B}.

(1.2.2)

The set X = A ∩ B is called the intersection set of A and B. Each element of the intersection set A ∩ B consists of elements common to both A and B. Especially, if A ∩ B = ∅ (null set), then the sets A and B are known as the disjoint sets. The notation Z = A + B means the sum set of sets A and B and is defined as follows: Z = A + B = {z = x + y ∈ Z| x ∈ A, y ∈ B},

(1.2.3)

namely, an element z of the sum set Z = A + B consists of the sum of the element x in A and the element y in B. The set-theoretic difference of the sets A and B, denoted “A − B,” is also termed the difference set and is defined as follows: X = A − B = {x ∈ X| x ∈ A, but x ∈ / B}.

(1.2.4)

That is to say, the difference set A − B is the set of elements of A that are not in B. The difference set A − B is also sometimes denoted by the notation X = A \ B. For example, {Cn \ 0} denotes the set of nonzero vectors in complex n-dimensional vector space.

12

1 Basic Matrix Computation

The relative complement of A in X is defined as / A}. Ac = X − A = X \ A = {x ∈ X| x ∈

(1.2.5)

Example 1.1 For the sets A = {1, 5, 3},

B = {3, 4, 5},

we have A ∪ B = {1, 5, 3, 4},

A ∩ B = {5, 3},

A − B = A \ B = {1},

A + B = {4, 9, 8},

B − A = B \ A = {4}.

If x ∈ X and y ∈ Y , then the set of all ordered pairs (x, y) is denoted by X × Y and is termed the Cartesian product of the sets X and Y , and is defined as    X × Y = (x, y)x ∈ X, y ∈ Y .

(1.2.6)

Similarly, X1 × · · · × Xn denotes the Cartesian product of n sets X1 , . . . , Xn , and its elements are the ordered n-ples (x1 , . . . , xn ):    X1 × · · · × Xn = (x1 , . . . , xn )x1 ∈ X1 , . . . , xn ∈ Xn .

(1.2.7)

1.2.2 Linear Mapping Consider the transformation between vectors in two vector spaces. In mathematics, mapping is a synonym for mathematical function or for morphism. A mapping T : V → W represents a rule for transforming the vectors in V to corresponding vectors in W . The subspace V is said to be the initial set or domain of the mapping T and W its final set or codomain. When v is some vector in the vector space V , T (v) is referred to as the image of the vector v under the mapping T , or the value of the mapping T at the point v, whereas v is known as the original image of T (v). If T (v) represents a collection of transformed outputs of all vectors v in V , i.e., T (V ) = {T (v) | v ∈ V },

(1.2.8)

then T (V ) is said to be the range of the mapping T , denoted by Im(T ) = T (V ) = {T (v) | v ∈ V }.

(1.2.9)

1.2 Sets and Linear Mapping

13

Definition 1.6 (Linear Mapping) A mapping T is called a linear mapping or linear transformation in the vector space V if it satisfies the linear relationship T (c1 v1 + c2 v2 ) = c1 T (v1 ) + c2 T (v2 )

(1.2.10)

for all v1 , v2 ∈ V and all scalars c1 , c2 . A linear mapping has the following basic properties: if T : V → W is a linear mapping, then T (0) = 0

and

T (−x) = −T (x).

(1.2.11)

If f (X, Y) is a scalar function with real matrices X ∈ Rn×n and Y ∈ Rn×n as variables, then in linear mapping notation, the function can be denoted by the Cartesian product form f : Rn×n × Rn×n → R. An interesting special application of linear mappings is the electronic amplifier A ∈ Cn×n with high fidelity (Hi-Fi). By Hi-Fi, it means that there is the following linear relationship between any input signal vector u and the corresponding output signal vector Au of the amplifier: Au = λu,

(1.2.12)

where λ > 1 is the amplification factor or gain. The equation above is a typical characteristic equation of a matrix. Definition 1.7 (One-to-One Mapping) T : V → W is a one-to-one mapping if it is either injective mapping or surjective mapping, i.e., T (x) = T (y) implies x = y or distinct elements have distinct images. A one-to-one mapping T : V → W has an inverse mapping T −1 : W → V . The inverse mapping T −1 restores what the mapping T has done. Hence, if T (v) = w, then T −1 (w) = v, resulting in T −1 (T (v)) = v, ∀ v ∈ V and T (T −1 (w)) = w, ∀ w ∈ W . If u1 , . . . , up are the input vectors of a system T in engineering, then T (u1 ), . . . , T (up ) can be viewed as the output vectors of the system. The criterion for identifying whether a system is linear is: if the system input is the linear expression y = c1 u1 + · · · + cp up , then the system is said to be linear only if its output satisfies the linear expression T (y) = T (c1 u1 + · · · + cp up ) = c1 T (u1 ) + · · · + cp T (up ). Otherwise, the system is nonlinear. The following are intrinsic relationships between a linear space and a linear mapping. Theorem 1.1 ([4, p. 29]) Let V and W be two vector spaces, and let T : V → W be a linear mapping. Then the following relationships are true: • if M is a linear subspace in V , then T (M) is a linear subspace in W ; • if N is a linear subspace in W , then the linear inverse transform T −1 (N ) is a linear subspace in V .

14

1 Basic Matrix Computation

For a given linear transformation y = Ax, if our task is to obtain the output vector y from the input vector x given a transformation matrix A, then Ax = y is said to be a forward problem. Conversely, the problem of finding the input vector x from the output vector y is known as an inverse problem. Clearly, the essence of the forward problem is a matrix-vector calculation, while the essence of the inverse problem is to solve a matrix equation.

1.3 Norms Many problems in artificial intelligence need to solve optimization problems in which vector and/or matrix norms are the cost function terms to be optimized.

1.3.1 Vector Norms Definition 1.8 (Vector Norm) Let V be a real or complex vector space. Given a vector x ∈ V , the mapping function p(x) : V → R is called the vector norm of x ∈ V , if for all vectors x, y ∈ V and any scalar c ∈ K (here K denotes R or C), the following three norm axioms hold: • Nonnegativity: p(x) ≥ 0 and p(x) = 0 ⇔ x = 0. • Homogeneity: p(cx) = |c| · p(x) is true for all complex constant c. • Triangle inequality: p(x + y) ≤ p(x) + p(y). In a real or complex inner product space V , the vector norms have the following properties [4]. 1. 0 = 0 and x > 0, ∀ x = 0. 2. cx = |c| x holds for all vector x ∈ V and any scalar c ∈ K. 3. Polarization identity: For real inner product spaces we have x, y =

 1 x + y2 − x − y2 , ∀ x, y, 4

(1.3.1)

and for complex inner product spaces we have x, y =

 1 x + y2 − x − y2 − j x + j y2 + jx − j y2 , ∀ x, y. 4

(1.3.2)

4. Parallelogram law   x + y2 + x − y2 = 2 x2 + y2 , ∀ x, y.

(1.3.3)

1.3 Norms

15

5. Triangle inequality x + y ≤ x + y, ∀ x, y ∈ V .

(1.3.4)

6. Cauchy–Schwartz inequality |x, y| ≤ x · y.

(1.3.5)

The equality |x, y| = x · y holds if and only if y = c x, where c is some nonzero complex constant. The following are several common norms of constant vectors. • 0 -norm def

x0 =

m 

xi0 ,

where

xi0

=

 1, if xi = 0; 0, if xi = 0.

i=1

That is, x0 = number of nonzero entries of x.

(1.3.6)

• 1 -norm def

x1 =

m 

|xi | = |x1 | + · · · + |xm |,

(1.3.7)

i=1

i.e., x1 is the sum of absolute (or modulus) values of nonzero entries of x. • 2 -norm or the Euclidean norm def

x2 = xE =

 |x1 |2 + · · · + |xm |2 .

(1.3.8)

• ∞ -norm   def x∞ = max |x1 |, . . . , |xm | .

(1.3.9)

• p -norm or Hölder norm [19] def

xp =

 m  i=1

1/p |xi |

p

,

p ≥ 1.

(1.3.10)

16

1 Basic Matrix Computation

The 0 -norm does not satisfy the homogeneity cx0 = |c| · x0 , and thus is a quasi-norm, while the p -norm is also a quasi-norm if 0 < p < 1 but a norm if p ≥ 1. Clearly, when p = 1 or p = 2, the p -norm reduces to the 1 -norm or the 2 -norm, respectively. In artificial intelligence, 0 -norm is usually used as a measure of sparse vectors, i.e., sparse vector x = arg min x0 .

(1.3.11)

x

However, the 0 -norm is difficult to be optimized. Importantly, the 1 -norm can be regarded as a relaxed form of the 0 -norm, and is easy to be optimized: sparse vector x = arg min x1 .

(1.3.12)

x

Here are several important applications of the Euclidean norm. • Measuring the size or length of a vector: size(x) = x2 =



2, x12 + · · · + xm

(1.3.13)

which is called the Euclidean length. • Defining the -neighborhood of a vector x:    N (x) = yy − x2 ≤  ,

 > 0.

(1.3.14)

• Measuring the distance between vectors x and y: d(x, y) = x − y2 =



(x1 − y1 )2 + · · · + (xm − ym )2 .

(1.3.15)

This is known as the Euclidean distance. • Defining the angle θ (0 ≤ θ ≤ 2π ) between vectors x and y: 

x, y θ = arccos √ √ x, x y, y def



 xH y . = arccos xy 

(1.3.16)

A vector with unit Euclidean length is known as a normalized (or standardized) vector. For any nonzero vector x ∈ Cm , x/x, x1/2 is the normalized version of the vector and has the same direction as x. The norm x is said to be a unitary invariant norm if Ux = x holds for all vectors x ∈ Cm and all unitary matrices U ∈ Cm×m such that UH U = I. Proposition 1.1 ([15]) The Euclidean norm  · 2 is unitary invariant.

1.3 Norms

17

If the inner product x, y = xH y = 0, then the angle between the vectors θ = π/2, from which we have the following definition on orthogonality of two vectors. Definition 1.9 (Orthogonal) Two constant vectors x and y are said to be orthogonal, denoted by x ⊥ y, if their inner product x, y = xH y = 0. Definition 1.10 (Inner Product of Function Vectors) Let x(t), y(t) be two function vectors in the complex vector space Cn , and let the definition field of the function variable t be [a, b] with a < b. Then the inner product of the function vectors x(t) and y(t) is defined as def



b

x(t), y(t) =

xH (t)y(t)dt.

(1.3.17)

a

The angle between function vectors is defined as follows: 

b

xH (t)y(t)dt x(t), y(t) a , cos θ = √ = √ x(t) · y(t) x(t), x(t) y(t), y(t) def

(1.3.18)

where x(t) is the norm of the function vector x(t) and is defined as follows: def



x(t) =

b

1/2 H

x (t)x(t)dt

.

(1.3.19)

a

Clearly, if the inner product of two function vectors is equal to zero, i.e., 

b

xH (t)y(t)dt = 0,

a

then the angle θ = π/2. Hence, two function vectors are said to be orthogonal in [a, b], denoted x(t) ⊥ y(t), if their inner product is equal to zero. The following proposition shows that, for any two orthogonal vectors, the square of the norm of their sum is equal to the sum of the squares of the respective vector norms. Proposition 1.2 If x ⊥ y, then x + y2 = x2 + y2 . Proof From the axioms of vector norms it is known that x + y2 = x + y, x + y = x, x + x, y + y, x + y, y.

(1.3.20)

Since x and y are orthogonal, we have x, y = E{xT y} = 0. Moreover, from the axioms of inner products it is known that y, x = x, y = 0. Substituting this result

18

1 Basic Matrix Computation

into Eq. (1.3.20), we immediately get x + y2 = x, x + y, y = x2 + y2 . 

This completes the proof of the proposition.

This proposition is also referred to as the Pythagorean theorem. On the orthogonality of two vectors, we have the following conclusive remarks [40]. • Mathematical definitions: Two vectors x and y are orthogonal if their inner product is equal to zero, i.e., x, y = 0. • Geometric interpretation: If two vectors are orthogonal, then their angle is π/2, and the projection of one vector onto the other vector is equal to zero. • Physical significance: When two vectors are orthogonal, each vector contains no components of the other, that is, there exist no interactions or interference between these vectors.

1.3.2 Matrix Norms The inner product and norms of vectors can be easily extended to the inner product and norms of matrices. For two m × n complex matrices A = [a1 , . . . , an ] and B = [b1 , . . . , bn ], stack them, respectively, into the following mn × 1 vectors according to their columns: ⎡ ⎤ a1 ⎢ .. ⎥ a = vec(A) = ⎣ . ⎦ ,



⎤ b1 ⎢ ⎥ b = vec(B) = ⎣ ... ⎦ ,

an

bn

where the elongated vector vec(A) is the vectorization of the matrix A. We will discuss the vectorization of matrices in detail in Sect. 1.9. The inner product of two m × n matrices A and B, denoted A, B| Cm×n × m×n C → C, is defined as the inner product of two elongated vectors: A, B = vec(A), vec(B) =

n  i=1

aH i bi =

n  ai , bi ,

(1.3.21)

i=1

or equivalently written as A, B = (vec A)H vec(B) = tr(AH B),

(1.3.22)

where tr(C) represents the trace function of a square matrix C, defined as the sum of its diagonal entries.

1.3 Norms

19

Let a = [a11 , . . . , am1 , a12 , . . . , am2 , . . . , a1n , . . . , amn ]T = vec(A) be an mn × 1 elongated vector of the m × n matrix A. The p -norm of the matrix A uses the p -norm definition of the elongated vector a as follows: ⎛ ⎞1/p n m   |aij |p ⎠ . Ap = ap = vec(A)p = ⎝ def

(1.3.23)

i=1 j =1

Since this kind of matrix norm is represented by the matrix entries, it is named the entrywise norm. The following are three typical entrywise matrix norms: 1. 1 -norm (p = 1) def

A1 =

n m  

|aij |.

(1.3.24)

i=1 j =1

2. Frobenius norm (p = 2) ⎛ AF = ⎝ def

n m  

⎞1/2 |aij |2 ⎠

(1.3.25)

i=1 j =1

is the most common matrix norm. Clearly, the Frobenius norm is an extension of the Euclidean norm of the vector to the elongated vector a = [a11 , . . . , am1 , . . . , a1n , . . . , amn ]T . 3. Max norm or ∞ -norm (p = ∞) A∞ =

max

i=1,...,m; j =1,...,n

{|aij |}.

(1.3.26)

Another commonly used matrix norm is the induced matrix norm, A2 , defined as A2 =

λmax (AH A) = σmax (A),

(1.3.27)

where λmax (AH A) and σmax (A) are the maximum eigenvalue of AH A and the maximum singular value of A, respectively.   n 2 H Because m i=1 j =1 |aij | = tr(A A), the Frobenius norm can be also written in the form of the trace function as follows:  def AF = A, A1/2 = tr(AH A) = σ12 + . . . + σk2 , (1.3.28) where k = rank(A) ≤ min{m, n} is the rank of A. Clearly, we have A2 ≤ AF .

(1.3.29)

20

1 Basic Matrix Computation

Given an m × n matrix A, its Frobenius norm weighted by a positive definite matrix , denoted A , is defined by A =

tr(AH A).

(1.3.30)

This norm is usually called the Mahalanobis norm. The following are the relationships between the inner products and norms [15]. • Cauchy–Schwartz inequlity |A, B|2 ≤ A2 B2 .

(1.3.31)

The equals sign holds if and only if A = c B, where c is a complex constant. • Pythagoras’ theorem A, B = 0



A + B2 = A2 + B2 .

(1.3.32)

• Polarization identity " 1! A + B2 − A − B2 , 4 " 1! Re (A, B) = A + B2 − A2 − B2 , 2

Re (A, B) =

(1.3.33) (1.3.34)

where Re (A, B) represents the real part of the inner product A, B.

1.4 Random Vectors In science and engineering applications, the measured data are usually random variables. A vector with random variables as its entries is called a random vector. In this section, we discuss the statistics and properties of random vectors by focusing on Gaussian random vectors.

1.4.1 Statistical Interpretation of Random Vectors In the statistical interpretation of random vectors, the first-order and second-order statistics of random vectors are the most important. Definition 1.11 (Inner Product of Random Vectors) Let both x(ξ ) and y(ξ ) be n × 1 random vectors of variable ξ . Then the inner product of random vectors x(ξ ) and y(ξ ) is defined as follows:  def  x(ξ ), y(ξ ) = E xH (ξ )y(ξ ) ,

(1.4.1)

1.4 Random Vectors

21

where E is the expectation operator E{x(ξ )} = [E{x1 (ξ )}, . . . , E{xn (ξ )}]T and the function variable ξ may be time t, circular frequency f , angular frequency ω or space parameter s, and so on. The square of the norm of a random vector x(ξ ) is defined as  def  x(ξ )2 = E xH (ξ )x(ξ ) .

(1.4.2)

Given a random vector x(ξ ) = [x1 (ξ ), . . . , xm (ξ )]T , its mean vector μx is defined as ⎤ ⎡ ⎤ ⎡ μ1 E{x1 (ξ )} def ⎢ ⎢ ⎥ . ⎥ . .. (1.4.3) μx = E{x(ξ )} = ⎣ ⎦ = ⎣ .. ⎦ , E{xm (ξ )}

μm

where E{xi (ξ )} = μi represents the mean of the ith random variable xi (ξ ). The autocorrelation matrix of the random vector x(ξ ) is defined by ⎡

r11 · · ·  ⎢ . . def  H Rx = E x(ξ )x (ξ ) = ⎣ .. . . rm1 · · ·

⎤ r1m .. ⎥ , . ⎦

(1.4.4)

rmm

where rii (i = 1, . . . , m) denotes the autocorrelation function of the random variable xi (ξ ):    def  rii = E xi (ξ )xi∗ (ξ ) = E |xi (ξ )|2 ,

i = 1, . . . , m,

(1.4.5)

whereas rij represents the cross-correlation function of xi (ξ ) and xj (ξ ):  def  rij = E xi (ξ )xj∗ (ξ ) ,

i, j = 1, . . . , m, i = j.

(1.4.6)

Clearly, the autocorrelation matrix is a complex conjugate symmetric (i.e., Hermitian). The autocovariance matrix of the random vector x(ξ ), denoted Cx , is defined as Cx = cov(x, x) ⎡ c11 · · · ⎢ .. . . =⎣ . .

  = E (x(ξ ) − μx )(x(ξ ) − μx )H ⎤ c1m .. ⎥ , . ⎦

def

cm1 · · · cmm

(1.4.7)

(1.4.8)

22

1 Basic Matrix Computation

where the diagonal entries  def  cii = E |xi (ξ ) − μi |2 = σi2 ,

i = 1, . . . , m

(1.4.9)

denote the variance σi2 of the random variable xi (ξ ), whereas the other entries    def  cij = E [xi (ξ ) − μi ][xj (ξ ) − μj ]∗ = E xi (ξ )xj∗ (ξ ) − μi μ∗j = cj∗i

(1.4.10)

express the covariance of the random variables xi (ξ ) and xj (ξ ). The autocovariance matrix is also Hermitian. There is the following relationship between the autocorrelation and autocovariance matrices: Cx = Rx − μx μH x .

(1.4.11)

By generalizing the autocorrelation and autocovariance matrices, one obtains the cross-correlation matrix of the random vectors x(ξ ) and y(ξ ), ⎡

Rxy

rx ,y  ⎢ 1. 1 def  H = E x(ξ )y (ξ ) = ⎣ .. rxm ,y1

⎤ · · · rx1 ,ym . ⎥ .. . .. ⎦ , · · · rxm ,ym

(1.4.12)

and the cross-covariance matrix ⎤ cx1 ,y1 · · · cx1 ,ym  ⎢ def  ⎥ = E [x(ξ ) − μx ][y(ξ ) − μy ]H = ⎣ ... . . . ... ⎦ , cxm ,y1 · · · cxm ,ym ⎡

Cxy

(1.4.13)

 def  where rxi ,yj = E xi (ξ )yj∗ (ξ ) is the cross-correlation of the random vectors xi (ξ ) def

and yj (ξ ) and cxi ,yj = E{[xi (ξ ) − μxi ][yj (ξ ) − μyj ]∗ } is the cross-covariance of xi (ξ ) and yj (ξ ). From Eqs. (1.4.12) and (1.4.13) it is easily known: Cxy = Rxy − μx μH y .

(1.4.14)

In real-world applications, a data vector x = [x(0), x(1), . . . , x(N − 1)]T with nonzero mean μx usually needs to undergo a zero-mean normalization: x ← x = [x(0) − μx , x(1) − μx , . . . , x(N − 1) − μx ]T , where μx =

1 N

N −1 n=0

x(n).

1.4 Random Vectors

23

After zero-mean normalization, the correlation matrices and covariance matrices are equal, i.e., Rx = Cx and Rxy = Cxy . Some properties of these matrices are as follows. 1. The autocorrelation matrix is Hermitian, i.e., RH x = Rx . 2. The autocorrelation matrix of the linear combination vector y = Ax + b satisfies Ry = ARx AH . 3. The cross-correlation matrix is not Hermitian but satisfies RH xy = Ryx . 4. R(x1 +x2 )y = Rx1 y + Rx2 y . 5. If x and y have the same dimension, then Rx+y = Rx + Rxy + Ryx + Ry . 6. RAx,By = ARxy BH . The degree of correlation of two random variables x(ξ ) and y(ξ ) can be measured by their correlation coefficient: def

ρxy =

E{(x(ξ ) − x)(y(ξ ¯ ) − y) ¯ ∗} E{|x(ξ ) − x| ¯ 2 }E{|y(ξ ) − y| ¯ 2}

=

cxy . σx σy

(1.4.15)

Here cxy = E{(x(ξ )−x)(y(ξ ¯ )−y) ¯ ∗ } is the cross-covariance of the random variables 2 2 x(ξ ) and y(ξ ), while σx and σy are, respectively, the variances of x(ξ ) and y(ξ ). Applying the Cauchy–Schwartz inequality to Eq. (1.4.15), we have 0 ≤ |ρxy | ≤ 1.

(1.4.16)

The correlation coefficient ρxy measures the degree of similarity of two random variables x(ξ ) and y(ξ ). • The closer ρxy is to zero, the weaker the similarity of the random variables x(ξ ) and y(ξ ) is. • The closer ρxy is to 1, the more similar x(ξ ) and y(ξ ) are. • The case ρxy = 0 means that there are no correlated components between the random variables x(ξ ) and y(ξ ). Thus, if ρxy = 0, the random variables x(ξ ) and y(ξ ) are said to be uncorrelated. Since this uncorrelation is defined in a statistical sense, it is usually said to be a statistical uncorrelation. • It is easy to verify that if x(ξ ) = c y(ξ ), where c is a complex number, then |ρxy | = 1. Up to a fixed amplitude scaling factor |c| and a phase φ(c), the random variables x(ξ ) and y(ξ ) are the same, so that x(ξ ) = c · y(ξ ) = |c|e jφ(c) y(ξ ). Hence, if |ρxy | = 1, then random variables x(ξ ) and y(ξ ) are said to be completely correlated or coherent. Definition 1.12 (Uncorrelated) Two random vectors x(ξ ) = [x1 (ξ ), . . . , xm (ξ )]T and y(ξ ) = [y1 (ξ ), . . . , yn (ξ )]T are said to be statistically uncorrelated if their cross-covariance matrix Cxy = Om×n or, equivalently, ρx ,y = 0, ∀ i, j . i

j

In contrast with the case for a constant vector or function vector, an m×1 random vector x(ξ ) and an n × 1 random vector y(ξ ) are said to be orthogonal if any entry

24

1 Basic Matrix Computation

of x(ξ ) is orthogonal to each entry of y(ξ ), namely rxi ,yj = E{xi (ξ )yj (ξ )} = 0, ∀ i = 1, . . . , m, j = 1, . . . , n;

(1.4.17)

  or Rxy = E x(ξ )yH (ξ ) = Om×n . Definition 1.13 (Orthogonality of Random Vectors) An m × 1 random vector x(ξ ) and an n × 1 random vector y(ξ ) are orthogonal each other, denoted by x(ξ ) ⊥ y(ξ ), if their cross-correlation matrix is equal to the m × n null matrix O, i.e.,   Rxy = E x(ξ )yH (ξ ) = O.

(1.4.18)

Note that for the zero-mean normalized m × 1 random vector x(ξ ) and n × 1 normalized random vector y(ξ ), their statistical uncorrelation and orthogonality are equivalent, as their cross-covariance and cross-correlation matrices are equal, i.e., Cxy = Rxy .

1.4.2 Gaussian Random Vectors Definition 1.14 (Gaussian Random Vector) If each entry xi (ξ ), i = 1, . . . , m, is Gaussian random variable, then the random vector x = [x1 (ξ ), . . . , xm (ξ )]T is called a Gaussian random vector. Let x ∼ N (¯x,  x ) denote a real Gaussian or normal random vector with the mean vector x¯ = [x¯1 , . . . , x¯m ]T and covariance matrix  x = E{(x − x¯ )(x − x¯ )T }. If each entry of the Gaussian random vector is independent identically distributed (iid), then its covariance matrix  x = E{(x − x¯ )(x − x¯ )T } = Diag(σ12 , . . . , σm2 ), where σi2 = E{(xi − x¯i )2 } is the variance of the Gaussian random variable xi . Under the condition that all entries are statistically independent of each other, the probability density function of a Gaussian random vector x ∼ N (¯x,  x ) is the joint probability density function of its m random variables, i.e., f (x) = f (x1 , . . . , xm ) = f (x1 ) · · · f (xm )     (x1 − x¯1 )2 1 (xm − x¯m )2 1 = exp − exp − ··· 2σm2 2σ12 2π σm2 2π σ12   (x1 − x¯1 )2 1 (xm − x¯m )2 = exp − − ··· − (2π )m/2 σ1 · · · σm 2σm2 2σ12

1.4 Random Vectors

25

or   1 1 T −1 f (x) = exp − (x − x¯ )  x (x − x¯ ) . 2 (2π )m/2 | x |1/2

(1.4.19)

If the entries are not statistically independent of each other, then the probability density function of the Gaussian random vector x ∼ N (¯x,  x ) is also given by Eq. (1.4.19), but the exponential term becomes [25, 29] ¯) = (x − x¯ )T  −1 x (x − x

m m   [ −1 x ]i,j (xi − μi )(xj − μj ),

(1.4.20)

i=1 j =1 −1 where [ −1 x ]ij represents the (i, j )th entry of the inverse matrix  x and μi = E{xi } is the mean of the random variable xi . The characteristic function of a real Gaussian random vector is given by

  1 T T Φx (ω1 , . . . , ωm ) = exp j ω μx − ω  x ω , 2

(1.4.21)

where ω = [ω1 , . . . , ωm ]T . If xi ∼ CN (μi , σi2 ), then x = [x1 , . . . , xm ]T is called a complex Gaussian random vector, denoted x ∼ CN (μx ,  x ), where μx = [μ1 , . . . , μm ]T and  are, respectively, the mean vector and the covariance matrix of the random vector x. If xi = ui + jvi and the random vectors [u1 , v1 ]T , . . . , [um , vm ]T are statistically independent of each other, then the probability density function of a complex Gaussian random vector x is given by Poularikas [29, p. 35-5] f (x) = f (x1 ) · · · f (xm )   m −1  m #  1 m 2 2 = π σi exp − |x − μi | 2 i σ i i=1 i=1 =

! " 1 H −1 exp − (x − μ )  (x − μ ) x x , x π m | x |

(1.4.22)

where  x = Diag(σ12 , . . . , σm2 ). The characteristic function of the complex Gaussian random vector x is determined by   1 Φx (ω) = exp j Re(ωH μx ) − ωH  x ω . 4

(1.4.23)

A Gaussian random vector x has the following important properties. • The probability density function of x is completely described by its mean vector and covariance matrix.

26

1 Basic Matrix Computation

• If two Gaussian random vectors x and y are statistically uncorrelated, then they are also statistically independent. • Given a Gaussian random vector x with mean vector μx and covariance matrix  x , the random vector y obtained by the linear transformation y(ξ ) = Ax(ξ ) is also a Gaussian random vector, and its probability density function is given by f (y) =

  1 1 T −1 exp − )  (y − μ ) (y − μ y y y 2 (2π )m/2 | y |1/2

(1.4.24)

for real Gaussian random vectors and f (y) =

1 π m |

y|

! " exp − (y − μy )H  −1 y (y − μy )

(1.4.25)

for complex Gaussian random vectors. The statistical expression of a real white Gaussian noise vector is given by E{x(t)} = 0 and

  E x(t)xT (t) = σ 2 I.

(1.4.26)

But, this expression is not available for a complex white Gaussian noise vector. Consider a complex Gaussian random vector x(t) = [x1 (t), . . . , xm (t)]T . If xi (t), i = 1, . . . , m, have zero mean and the same variance σ 2 , then the real part xR,i (t) and the imaginary part xI,i (t) are two real white Gaussian noises that are statistically independent and have the same variance. This implies that   E xR,i (t) = 0,

  E xI,i (t) = 0,

 2   2  1 E xR,i (t) = E xI,i (t) = σ 2 , 2   E xR,i (t)xI,i (t) = 0,    2   2  E xi (t)xi∗ (t) = E xR,i (t) + E xI,i (t) = σ 2 . From the above conditions we know that     E xi2 (t) = E (xR,i (t) + j xI,i (t))2  2   2    = E xR,i (t) − E xI,i (t) + j 2E xR,i (t)xI,i (t) =

1 2 1 2 σ − σ + 0 = 0. 2 2

Since x1 (t), . . . , xm (t) are statistically uncorrelated, we have   E xi (t)xk (t) = 0,

  E xi (t)xk∗ (t) = 0,

i = k.

1.5 Basic Performance of Matrices

27

Therefore, the statistical expression of the complex white Gaussian noise vector x(t) is given by   E x(t) = 0,

  E x(t)xH (t) = σ 2 I,

  E x(t)xT (t) = O.

(1.4.27)

1.5 Basic Performance of Matrices For a multivariate representation with mn components, we need some scalars to describe the basic performance of an m × n matrix.

1.5.1 Quadratic Forms The quadratic form of an n × n matrix A is defined as xH Ax, where x may be any n × 1 nonzero vector. In order to ensure the uniqueness of the definition of quadratic form (xH Ax)H = xH Ax, the matrix A is required to be Hermitian or complex conjugate symmetric, i.e., AH = A. This assumption ensures also that any quadratic form function is real-valued. One of the basic advantages of a real-valued function is its suitability for comparison with a zero value. If xH Ax > 0, then the quadratic form of A is said to be a positive definite function, and the Hermitian matrix A is known as a positive definite matrix. Similarly, one can define the positive semi-definiteness, negative definiteness, and negative semi-definiteness of a Hermitian matrix, as shown in Table 1.1. A Hermitian matrix A is said to be indefinite matrix if xH Ax > 0 for some nonzero vectors x and xH Ax < 0 for other nonzero vectors x.

1.5.2 Determinants The determinant of an n × n matrix A, denoted det(A) or |A|, is defined as  a11  a21  det(A) = |A| =  .  ..  a

n1

a12 a22 .. . an2

 a1n  a2n  ..  . .  · · · ann  ··· ··· .. .

(1.5.1)

Let Aij be the (n − 1) × (n − 1) submatrix obtained by removing the ith row and the j th column from the n × n matrix A = [aij ]. The determinant Aij = det(Aij ) of the remaining matrix Aij is known as the cofactor of the entry aij . In particular, when j = i, Aii is known as the principal minor of A. The cofactor Aij is related

28

1 Basic Matrix Computation

Table 1.1 Quadratic forms and positive definiteness of a Hermitian matrix A Quadratic forms xH Ax > 0 xH Ax ≥ 0 xH Ax < 0 xH Ax ≤ 0

Symbols A 0 A0 A≺0 A0

Positive definiteness A is a positive definite matrix A is a positive semi-definite matrix A is a negative definite matrix A is a negative semi-definite matrix

to the determinant of the submatrix Aij as follows: Aij = (−1)i+j det(Aij ).

(1.5.2)

The determinant of an n × n matrix A can be computed by det(A) = ai1 Ai1 + · · · + ain Ain =

n 

aij (−1)i+j det(Aij )

(1.5.3)

aij (−1)i+j det(Aij ).

(1.5.4)

j =1

or det(A) = a1j A1j + · · · + anj Anj =

n  i=1

Hence the determinant of A can be recursively calculated: an nth-order determinant can be computed from the (n − 1)th-order determinants, while each (n − 1)th-order determinant can be calculated from the (n − 2)th-order determinants, and so forth. For a 3 × 3 matrix A, its determinant is recursively given by ⎡ ⎤ a11 a12 a13 det(A) = det ⎣a21 a22 a23 ⎦ = a11 A11 + a12 A12 + a13 A13 a31 a32 a33             1+1 a22 a23  1+2 a21 a23  1+3 a21 a22  = a11 (−1) a32 a33  + a12 (−1) a31 a33  + a13 (−1) a31 a33  = a11 (a22 a33 − a23 a32 ) − a12 (a21 a33 − a23 a31 ) + a13 (a21 a33 − a22 a31 ). This is the diagonal method for a third-order determinant. A matrix with nonzero determinant is known as a nonsingular matrix. Determinant Equalities [20] 1. If two rows (or columns) of a matrix A are exchanged, then the value of det(A) remains unchanged, but the sign is changed. 2. If some row (or column) of a matrix A is a linear combination of other rows (or columns), then det(A) = 0. In particular, if some row (or column) is

1.5 Basic Performance of Matrices

29

proportional or equal to another row (or column), or there is a zero row (or column), then det(A) = 0. 3. The determinant of an identity matrix is equal to 1, i.e., det(I) = 1. 4. Any square matrix A and its transposed matrix AT have the same determinant, i.e., det(A) = det(AT ); however, det(AH ) = (det(AT ))∗ . 5. The determinant of a Hermitian matrix is real-valued, since det(A) = det(AH ) = (det(A))∗ .

(1.5.5)

6. The determinant of the product of two square matrices is equal to the product of their determinants, i.e., det(AB) = det(A) det(B),

A, B ∈ Cn×n .

(1.5.6)

7. For any constant c and any n × n matrix A, det(c · A) = cn · det(A). 8. If A is nonsingular, then det(A−1 ) = 1/ det(A). 9. For matrices Am×m , Bm×n , Cn×m , Dn×n , the determinant of the block matrix for nonsingular A is given by det

  AB = det(A) det(D − CA−1 B) CD

(1.5.7)

and for nonsingular D we have 

 AB det = det(D) det(A − BD−1 C). CD

(1.5.8)

10. The determinant of a triangular (upper or lower triangular) matrix A is equal to the product of its main diagonal entries: det(A) =

n #

aii .

i=1

The determinant of a diagonal matrix A = Diag(a11 , . . . , ann ) is also equal to the product of its diagonal entries. Here we give a proof of Eq. (1.5.7): 

    AB I A−1 B A O det = det CD C D − CA−1 B O I = det(A) det(D − CA−1 B). We can prove Eq. (1.5.8) in a similar way.

30

1 Basic Matrix Computation

The following are two determinant inequalities [20]. • The determinant of a positive definite matrix A is larger than 0, i.e., det(A) > 0. • The determinant of a positive semi-definite matrix A is larger than or equal to 0, i.e., det(A) ≥ 0.

1.5.3 Matrix Eigenvalues For an n × n matrix A, if the linear algebraic equation Au = λu

(1.5.9)

has a nonzero n × 1 solution vector u, then the scalar λ is called an eigenvalue of the matrix A, and u is its eigenvector corresponding to λ. An equivalent form of the matrix equation (1.5.9) is (A − λI)u = 0.

(1.5.10)

Only condition for this equation having nonzero solution vector u is that the determinant of the matrix A − λI is equal to zero, i.e., det(A − λI) = 0.

(1.5.11)

This equation is known as the characteristic equation of the matrix A. The characteristic equation (1.5.11) reflects the following two facts: • If (1.5.11) holds for λ = 0, then det(A) = 0. This implies that as long as a matrix A has a zero eigenvalue, this matrix must be singular. • All the eigenvalues of a zero matrix are zero, and for any singular matrix there exists at least one zero eigenvalue. Clearly, if all n diagonal entries of an n × n singular matrix A contain a subtraction of the same scalar x = 0 that is not an eigenvalue of A then the matrix A − xI must be nonsingular, since |A − xI| = 0. Let eig(A) represent the eigenvalues of the matrix A. The basic properties of eigenvalues are listed below: 1. For Am×m and Bm×m , eig(AB) = eig(BA) because ABu = λu ⇒ (BA)Bu = λBu ⇒ BAu = λu , where u = Bu. However, if the eigenvector of AB corresponding to λ is u, then the eigenvector of BA corresponding to the same λ is u = Bu. 2. If rank(A) = r, then the matrix A has at most r different eigenvalues. 3. The eigenvalues of the inverse matrix satisfy eig(A−1 ) = 1/eig(A).

1.5 Basic Performance of Matrices

31

4. Let I be the identity matrix; then eig(I + cA) = 1 + c · eig(A),

(1.5.12)

eig(A − cI) = eig(A) − c.

(1.5.13)

Since uH u > 0 for any nonzero vector u, from λ = uH Au/(uH u) it is directly known that positive definite and nonpositive definite matrices can be described by their eigenvalues as follows. • • • • •

Positive definite matrix: Its all eigenvalues are positive real numbers. Positive semi-definite matrix: Its all eigenvalues are nonnegative. Negative definite matrix: Its all eigenvalues are negative. Negative semi-definite matrix: Its all eigenvalues are nonpositive. Indefinite matrix: It has both positive and negative eigenvalues. If A is a positive definite or positive semi-definite matrix, then det(A) ≤

#

aii .

(1.5.14)

i

This inequality is called the Hadamard inequality. The characteristic equation (1.5.11) suggests two methods for improving the numerical stability and the accuracy of solutions of the matrix equation Ax = b, as described below [40]. • Method for improving the numerical stability: Consider the matrix equation Ax = b, where A is usually positive definite or nonsingular. However, owing to noise or errors, A may sometimes be close to singular. We can alleviate this difficulty as follows. If λ is a small positive number, then −λ cannot be an eigenvalue of A. This implies that the characteristic equation |A − xI| = |A − (−λ)I| = |A + λI| = 0 cannot hold for any λ > 0, and thus the matrix A + λI must be nonsingular. Therefore, if we solve (A + λI)x = b instead of the original matrix equation Ax = b, and λ takes a very small positive value, then we can overcome the singularity of A to improve greatly the numerical stability of solving Ax = b. This method for solving (A + λI)x = b, with λ > 0, instead of Ax = b is the well-known Tikhonov regularization method for solving nearly singular matrix equations. • Method for improving the accuracy: For a matrix equation Ax = b, with the data matrix A nonsingular but containing additive interference or observation noise, if we choose a very small positive scalar λ to solve (A − λI)x = b instead of Ax = b, then the influence of the noise of the data matrix A on the solution vector x will be greatly decreased. This is the basis of the well-known total least squares (TLS) method.

32

1 Basic Matrix Computation

1.5.4 Matrix Trace Definition 1.15 (Trace) The sum of the diagonal entries of an n × n matrix A is known as its trace, denoted tr(A): tr(A) = a11 + · · · + ann =

n 

aii .

(1.5.15)

i=1

Clearly, for a random signal x = [x1 , . . . , xn ]T , the trace of its autocorrelation matrix Rx , trace(Rx ) = ni=1 E{|xi |2 }, denotes the energy of the random signal. The following are some properties of the matrix trace. Trace Equality [20] 1. If both A and B are n × n matrices, then tr(A ± B) = tr(A) ± tr(B). 2. If A and B are n×n matrices and c1 and c2 are constants, then tr(c1 ·A±c2 ·B) = c1 · tr(A) ± c2 · tr(B). In particular, tr(cA) = c · tr(A). 3. tr(AT ) = tr(A), tr(A∗ ) = (tr(A))∗ and tr(AH ) = (tr(A))∗ . 4. If A ∈ Cm×n , B ∈ Cn×m , then tr(AB) = tr(BA). 5. If A is an m×n matrix, then tr(AH A) = 0 implies that A is an m×n zero matrix. 6. xH Ax = tr(AxxH ) and yH x = tr(xyH ). 7. The trace of an n × n matrix is equal to the sum of its eigenvalues, namely tr(A) = λ1 + · · · + λn . 8. The trace of a block matrix satisfies   AB tr = tr(A) + tr(D), CD where A ∈ Cm×m , B ∈ Cm×n , C ∈ Cn×m and D ∈ Cn×n . 9. For any positive integer k, we have tr(Ak ) =

n 

λki .

(1.5.16)

i=1

By the trace equality tr(UV) = tr(VU), it is easy to see that tr(AH A) = tr(AAH ) =

n n   i=1 j =1

aij aij∗ =

n n  

|aij |2 .

(1.5.17)

i=1 j =1

Interestingly, if we substitute U = A, V = BC and U = AB, V = C into the trace equality tr(UV) = tr(VU), we obtain tr(ABC) = tr(BCA) = tr(CAB).

(1.5.18)

1.5 Basic Performance of Matrices

33

Similarly, if we let U = A, V = BCD or U = AB, V = CD or U = ABC, V = D, respectively, we obtain tr(ABCD) = tr(BCDA) = tr(CDAB) = tr(DABC).

(1.5.19)

Moreover, if A and B are m × m matrices, and B is nonsingular, then tr(BAB−1 ) = tr(B−1 AB) = tr(ABB−1 ) = tr(A).

(1.5.20)

The Frobenius norm of an m × n matrix A can be also defined using the traces of the m × m matrix AH A or that of the n × n matrix AAH , as follows [22, p. 10]:  AF =

tr(AH A) =

tr(AAH ).

(1.5.21)

1.5.5 Matrix Rank Theorem 1.2 ([36]) Among a set of p-dimensional (row or column) vectors, there are at most p linearly independent (row or column) vectors. Theorem 1.3 ([36]) For an m × n matrix A, the number of linearly independent rows and the number of linearly independent columns are the same. From this theorem we have the following definition of the rank of a matrix. Definition 1.16 (Rank) The rank of an m × n matrix A is defined as the number of its linearly independent rows or columns. It needs to be pointed out that the matrix rank gives only the number of linearly independent rows or columns, but it gives no information on the locations of these independent rows or columns. According to the rank, we can classify the matrix as follows. • • • •

If rank(Am×n ) < min{m, n}, then A is known as a rank-deficient matrix. If rank(Am×n ) = m (< n), then the matrix A is a full row rank matrix. If rank(Am×n ) = n (< m), then the matrix A is called a full column rank matrix. If rank(An×n ) = n, then A is said to be a full-rank matrix (or nonsingular matrix).

The matrix equation Am×n xn×1 = bm×1 is said to be a consistent equation, if it has at least one exact solution. A matrix equation with no exact solution is said to be an inconsistent equation. A matrix A with rank(A) = rA has rA linearly independent column vectors. All linear combinations of the rA linearly independent column vectors constitute a vector space, called the column space or the range or the manifold of A.

34

1 Basic Matrix Computation

The column space Col(A) = Col(a1 , . . . , an ) or the range Range(A) = Range(a1 , . . . , an ) is rA -dimensional. Hence the rank of a matrix can be defined by using the dimension of its column space or range, as described below. Definition 1.17 (Dimension of Column Space) The dimension of the column space Col(A) or the range Range(A) of an m × n matrix A is defined by the rank of the matrix, namely rA = dim(Col(A)) = dim(Range(A)).

(1.5.22)

The following statements about the rank of the matrix A are equivalent: • rank(A) = k; • there are k and not more than k columns (or rows) of A that combine a linearly independent set; • there is a k × k submatrix of A with nonzero determinant, but all the (k + 1) × (k + 1) submatrices of A have zero determinant; • the dimension of the column space Col(A) or the range Range(A) equals k; • k = n − dim[Null(A)], where Null(A) denotes the null space of the matrix A. Theorem 1.4 ([36]) The rank of the product matrix AB satisfies the inequality rank(AB) ≤ min{rank(A), rank(B)}.

(1.5.23)

Lemma 1.1 If premultiplying an m × n matrix A by an m × m nonsingular matrix P, or postmultiplying it by an n × n nonsingular matrix Q, then the rank of A is not changed, namely rank(PAQ) = rank(A). Here are properties of the rank of a matrix. • The rank is a positive integer. • The rank is equal to or less than the number of columns or rows of the matrix. • Premultiplying any matrix A by a full column rank matrix or postmultiplying it by a full row rank matrix, then the rank of the matrix A remains unchanged. Rank Equalities 1. If A ∈ Cm×n , then rank(AH ) = rank(AT ) = rank(A∗ ) = rank(A). 2. If A ∈ Cm×n and c = 0, then rank(cA) = rank(A). 3. If A ∈ Cm×m , B ∈ Cm×n , and C ∈ Cn×n are nonsingular, then rank(AB) = rank(B) = rank(BC) = rank(ABC). That is, after premultiplying and/or postmultiplying by a nonsingular matrix, the rank of B remains unchanged. 4. For A, B ∈ Cm×n , rank(A) = rank(B) if and only if there exist nonsingular matrices X ∈ Cm×m and Y ∈ Cn×n such that B = XAY. 5. rank(AAH ) = rank(AH A) = rank(A). 6. If A ∈ Cm×m , then rank(A) = m ⇔ det(A) = 0 ⇔ A is nonsingular.

1.6 Inverse Matrices and Moore–Penrose Inverse Matrices

35

The commonly used rank inequalities is rank(A) ≤ min{m, n} for any m × n matrix A.

1.6 Inverse Matrices and Moore–Penrose Inverse Matrices Matrix inversion is an important aspect of matrix calculus. In particular, the matrix inversion lemma is often used in science and engineering. In this section we discuss the inverse of a full-rank square matrix, the pseudo-inverse of a non-square matrix with full row (or full column) rank, and the inversion of a rank-deficient matrix.

1.6.1 Inverse Matrices A nonsingular matrix A is said to be invertible, if its inverse A−1 exists so that A−1 A = AA−1 = I. On the nonsingularity or invertibility of an n × n matrix A, the following statements are equivalent [15]. • • • • • • • • • • •

A is nonsingular. A−1 exists. rank(A) = n. All rows of A are linearly independent. All columns of A are linearly independent. det(A) = 0. The dimension of the range of A is n. The dimension of the null space of A is equal to zero. Ax = b is a consistent equation for every b ∈ Cn . Ax = b has a unique solution for every b. Ax = 0 has only the trivial solution x = 0. The inverse matrix A−1 has the following properties [2, 15]. 1. A−1 A = AA−1 = I. 2. A−1 is unique. 3. The determinant of the inverse matrix is equal to the reciprocal of the determinant of the original matrix, i.e., |A−1 | = 1/|A|. 4. The inverse matrix A−1 is nonsingular. 5. The inverse matrix of an inverse matrix is the original matrix, i.e., (A−1 )−1 = A. 6. The inverse matrix of a Hermitian matrix A = AH satisfies (AH )−1 = (A−1 )H = A−1 . That is to say, the inverse matrix of any Hermitian matrix is a Hermitian matrix as well. 7. (A∗ )−1 = (A−1 )∗ .

36

1 Basic Matrix Computation

8. If A and B are invertible, then (AB)−1 = B−1 A−1 . 9. If A = Diag(a1 , . . . , am ) is a diagonal matrix, then its inverse matrix −1 A−1 = Diag(a1−1 , . . . , am ).

10. Let A be nonsingular. If A is an orthogonal matrix, then A−1 = AT , and if A is a unitary matrix, then A−1 = AH . Lemma 1.2 Let A be an n × n invertible matrix, and x and y be two n × 1 vectors such that (A + xyH ) is invertible; then (A + xyH )−1 = A−1 −

A−1 xyH A−1 . 1 + yH A−1 x

(1.6.1)

Lemma 1.2 is called the matrix inversion lemma, and was presented by Sherman and Morrison [37] in 1950. The matrix inversion lemma can be extended to an inversion formula for a sum of matrices: (A + UBV)−1 = A−1 − A−1 UB(B + BVA−1 UB)−1 BVA−1 = A−1 − A−1 U(I + BVA−1 U)−1 BVA−1

(1.6.2)

or (A − UV)−1 = A−1 + A−1 U(I − VA−1 U)−1 VA−1 .

(1.6.3)

The above formula is called Woodbury formula due to the work of Woodbury in 1950 [38]. Taking U = u, B = β, and V = vH , the Woodbury formula gives the result (A + βuvH )−1 = A−1 −

β A−1 uvH A−1 . 1 + βvH A−1 u

(1.6.4)

In particular, if letting β = 1, then Eq. (1.6.4) reduces to formula (1.6.1), the matrix inversion lemma of Sherman and Morrison. As a matter of fact, before Woodbury obtained the inversion formula (1.6.2), Duncan [8] in 1944 and Guttman [12] in 1946 had obtained the following inversion formula: (A − UD−1 V)−1 = A−1 + A−1 U(D − VA−1 U)−1 VA−1 . This formula is usually called the Duncan–Guttman inversion formula [28].

(1.6.5)

1.6 Inverse Matrices and Moore–Penrose Inverse Matrices

37

In addition to the Woodbury formula, the inverse matrix of a sum of matrices also has the following forms [14]: (A + UBV)−1 = A−1 − A−1 (I + UBVA−1 )−1 UBVA−1

(1.6.6)

= A−1 − A−1 UB(I + VA−1 UB)−1 VA−1

(1.6.7)

= A−1 − A−1 UBV(I + A−1 UBV)−1 A−1

(1.6.8)

= A−1 − A−1 UBVA−1 (I + UBVA−1 )−1 .

(1.6.9)

The following are inversion formulas for block matrices. When the matrix A is invertible, one has [1]: 

 −1  −1 AU A + A−1 U(D − VA−1 U)−1 VA−1 −A−1 U(D − VA−1 U)−1 . = −(D − VA−1 U)−1 VA−1 (D − VA−1 U)−1 VD (1.6.10)

If the matrices A and D are invertible, then [16, 17]   −1  AU −A−1 U(D − VA−1 U)−1 (A − UD−1 V)−1 = VD −D−1 V(A − UD−1 V)−1 (D − VA−1 U)−1

(1.6.11)

or [8]  −1   AU −(A − UD−1 V)−1 UD−1 (A − UD−1 V)−1 . = VD −(D − VA−1 U)−1 VA−1 (D − VA−1 U)−1

(1.6.12)

For a (m + 1) × (m + 1) nonsingular Hermitian matrix Rm+1 , it can always be written in a block form:   Rm rm , (1.6.13) Rm+1 = H rm ρm where ρm is the (m + 1, m + 1)th entry of Rm+1 , and Rm is an m × m Hermitian −1 matrix. In order to compute R−1 m+1 using Rm , let 

Qm+1

Q q = Hm m qm αm

 (1.6.14)

be the inverse matrix of Rm+1 . Then we have 

Rm+1 Qm+1

R r = Hm m rm ρm

    Qm qm Im 0m = H qH 0m 1 m αm

(1.6.15)

38

1 Basic Matrix Computation

which gives the following four equations: Rm Qm + rm qH m = Im ,

(1.6.16)

H H rH m Qm + ρm qm = 0m ,

(1.6.17)

Rm qm + rm αm = 0m ,

(1.6.18)

rH m qm + ρm αm = 1.

(1.6.19)

If Rm is invertible, then from Eq. (1.6.18), it follows that qm = −αm R−1 m rm .

(1.6.20)

Substitute this result into Eq. (1.6.19) to yield αm =

1 −1 ρm − rH m Rm rm

(1.6.21)

.

After substituting Eq. (1.6.21) into Eq. (1.6.20), we can obtain qm =

−R−1 m rm

−1 ρm − rH b Rm rm

(1.6.22)

.

Then, substitute Eq. (1.6.22) into Eq. (1.6.16): −1 H R−1 m rm (Rm rm )

−1 H −1 Qm = R−1 m − Rm rm qm = Rm +

−1 ρm − rH m Rm rm

.

(1.6.23)

In order to simplify (1.6.21)–(1.6.23), put (m) T bm = [b0(m) , b1(m) , . . . , bm−1 ] = −R−1 m rm ,

(1.6.24)

−1 H βm = ρm − rH m Rm rm = ρm + rm bm .

(1.6.25)

def def

Then (1.6.21)–(1.6.23) can be, respectively, simplified to αm =

1 , βm

qm =

1 bm , βm

Qm = R−1 m +

1 bm bH m. βm

Substituting the these results into (1.6.15), it is immediately seen that R−1 m+1

= Qm+1

  −1   1 bm bH bm Rm 0m m = + . 0H 1 βm bH m 0 m

(1.6.26)

1.6 Inverse Matrices and Moore–Penrose Inverse Matrices

39

This formula for calculating the (m + 1) × (m + 1) inverse Rm+1 from the m × m inverse Rm is called the block inversion lemma for Hermitian matrices [24].

1.6.2 Left and Right Pseudo-Inverse Matrices From a broader perspective, any n×m matrix G may be called the inverse of a given m × n matrix A, if the product of G and A is equal to the identity matrix I. Definition 1.18 (Left Inverse, Right Inverse [36]) The matrix L satisfying LA = I but not AL = I is called the left inverse of the matrix A. Similarly, the matrix R satisfying AR = I but not RA = I is said to be the right inverse of A. A matrix A ∈ Cm×n has a left inverse only when m ≥ n and a right inverse only when m ≤ n, respectively. It should be noted that the left or right inverse of a given matrix A is usually not unique. Let us consider the conditions for a unique solution of the left and right inverse matrices. Let m > n and let the m × n matrix A have full column rank, i.e., rank(A) = n. In this case, the n × n matrix AH A is invertible. It is easy to verify that L = (AH A)−1 AH

(1.6.27)

satisfies LA = I but not AL = I. This type of left inverse matrix is uniquely determined, and is usually called the left pseudo-inverse matrix of A. On the other hand, if m < n and A has the full row rank, i.e., rank(A) = m, then the m × m matrix AAH is invertible. Define R = AH (AAH )−1 .

(1.6.28)

It is easy to see that R satisfies AR = I but RA = I. The matrix R is also uniquely determined and is usually called the right pseudo-inverse matrix of A. The left pseudo-inverse matrix is closely related to the least squares solution of over-determined equations, while the right pseudo-inverse matrix is closely related to the least squares minimum norm solution of under-determined equations. Consider the recursion Fm = [Fm−1 , fm ] of an n × m matrix Fm . Then, the left H −1 H pseudo-inverse Fleft m = (Fm Fm ) Fm can be recursively computed by Zhang [39] Fleft m

⎡ left H −1 ⎤ Fm−1 − Fleft m−1 fm em m ⎦, =⎣

(1.6.29)

−1 eH m m

−1 H −1 where em = (In − Fm−1 Fleft m−1 )fm and m = (fm em ) ; the initial recursion right

H value is Fleft = fH 1 1 /(f1 f1 ). Similarly, the right pseudo-inverse matrix Fm

=

40

1 Basic Matrix Computation

H −1 has the following recursive formula [39]: FH m (Fm Fm )

right

Fm

⎤ ⎡ right right Fm−1 − m Fm−1 fm cm ⎥ ⎢ =⎣ ⎦. H m cm

(1.6.30)

† H H Here cH m = fm (In − Fm−1 Fm−1 ) and m = cm fm . The initial recursion value is right

F1

H = fH 1 /(f1 f1 ).

1.6.3 Moore–Penrose Inverse Matrices Given an m × n rank-deficient matrix A, regardless of the size of m and n but with rank(A) = k < min{m, n}. The inverse of an m × n rank-deficient matrix is said to be its generalized inverse matrix, denoted as A† , that is an n × m matrix. The generalized inverse of a rank-deficient matrix A should satisfy the following four conditions. 1. If A† is the generalized inverse of A, then Ax = y ⇒ x = A† y. Substituting x = A† y into Ax = y, we have AA† y = y, and thus AA† Ax = Ax. Since this equation should hold for any given nonzero vector x, A† must satisfy the condition: AA† A = A.

(1.6.31)

2. Given any y = 0, the solution equation of the original matrix equation, x = A† y, can be written as x = A† Ax, yielding A† y = A† AA† y. Since A† y = A† AA† y should hold for any given nonzero vector y, the second condition A† AA† = A†

(1.6.32)

must be satisfied as well. 3. If an m × n matrix A is of full column rank or full row rank, we certainly hope that the generalized inverse matrix A† will include the left and right pseudoinverse matrices as two special cases. Because the left and right pseudo-inverse matrix L = (AH A)−1 AH and R = AH (AAH )−1 of the m × n full column rank matrix A satisfy, respectively, AL = A(AH A)−1 AH = (AL)H and RA = AH (AAH )−1 A = (RA)H , in order to guarantee that A† exists uniquely for any m × n matrix A, the following two conditions must be added: AA† = (AA† )H ,

A† A = (A† A)H .

(1.6.33)

1.6 Inverse Matrices and Moore–Penrose Inverse Matrices

41

Definition 1.19 (Moore–Penrose Inverse [26]) Let A be any m × n matrix. An n × m matrix A† is said to be the Moore–Penrose inverse of A if A† meets the following four conditions (usually called the Moore–Penrose conditions): (a) (b) (c) (d)

AA† A = A; A† AA† = A† ; AA† is an m × m Hermitian matrix, i.e., AA† = (AA† )H ; A† A is an n × n Hermitian matrix, i.e., A† A = (A† A)H .

Remarks From the projection viewpoint, Moore [23] showed in 1935 that the generalized inverse matrix A† of an m × n matrix A must meet two conditions, but these conditions are not convenient for practical use. After two decades, Penrose [26] in 1955 presented the four conditions (a)–(d) stated above. In 1956, Rado [32] showed that the four conditions of Penrose are equivalent to the two conditions of Moore. Therefore the conditions (a)–(d) are called the Moore–Penrose conditions, and the generalized inverse matrix satisfying the Moore–Penrose conditions is referred to as the Moore–Penrose inverse of A. It is easy to know that the inverse A−1, the left pseudo-inverse matrix and the right pseudo-inverse matrix AH (AAH )−1 are special examples of the Moore–Penrose inverse A† . The Moore–Penrose inverse of any m × n matrix A can be uniquely determined by Boot [5]

(AH A)−1 AH ,

A† = (AH A)† AH

(if m ≥ n)

(1.6.34)

A† = AH (AAH )†

(if m ≤ n).

(1.6.35)

or [10]

From Definition 1.19 it is easily seen that A† = (AH A)† AH and A† = AH (AAH )† meet the four Moore–Penrose conditions. For an m × n matrix A with rank r, where r ≤ min(m, n), there are the following three methods for computing the Moore–Penrose inverse matrix A† . 1. Equation-solving method [26] • Solve the matrix equations AAH XH = A and AH AY = AH to yield the solutions XH and Y, respectively. • Compute the generalized inverse matrix A† = XAY. 2. Full-rank decomposition method: If A = FG, where Fm×r is of full column rank and Gr×n is of full row rank, then A = FG is called the full-rank decomposition of the matrix A. By Searle [36], a matrix A ∈ Cm×n with rank(A) = r has the full-rank decomposition A = FG. Therefore, if A = FG is a full-rank decomposition of the m × n matrix A, then A† = G† F† = GH (GGH )−1 (FH F)−1 FH .

42

1 Basic Matrix Computation

3. Recursive methods: Block the matrix Am×n into Ak = [Ak−1 , ak ], where Ak−1 consists of the first k − 1 columns of A and ak is the kth column of A. Then, the Moore–Penrose inverse A†k of the block matrix Ak can be recursively calculated from A†k−1 . When k = n, we get the Moore–Penrose inverse matrix A† . Such a recursive algorithm was presented by Greville in 1960 [11]. From [20, 22, 30, 31] and [33], the Moore–Penrose inverses A† have the following properties. 1. For an m × n matrix A, its Moore–Penrose inverse A† is uniquely determined. 2. The Moore–Penrose inverse of the complex conjugate transpose matrix AH is given by (AH )† = (A† )H = A†H = AH † . 3. The generalized inverse of a Moore–Penrose inverse matrix is equal to the original matrix, namely (A† )† = A. 4. If c = 0, then (cA)† = c−1 A† . † † 5. If D = Diag(d11 , . . . , dnn ), then D† = Diag(d11 , . . . , dnn ), where dii† = dii−1 (if dii = 0) or dii† = 0 (if dii = 0). 6. The Moore–Penrose inverse of an m × n zero matrix Om×n is an n × m zero matrix, i.e., O†m×n = On×m . 7. If AH = A and A2 = A, then A† = A. 8. If A = BC, B is of full column rank, and C is of full row rank, then A† = C† B† = CH (CCH )−1 (BH B)−1 BH . 9. (AAH )† = (A† )H A† and (AAH )† (AAH ) = AA† . 10. If the matrices Ai are mutually orthogonal, i.e., AH i Aj = O, i = j , then (A1 + · · · + Am )† = A†1 + · · · + A†m . 11. Regarding the ranks of generalized inverse matrices, one has rank(A† ) = rank(A) = rank(AH ) = rank(A† A) = rank(AA† ) = rank(AA† A) = rank(A† AA† ). 12. The Moore–Penrose inverse of any matrix Am×n can be determined by A† = (AH A)† AH or A† = AH (AAH )† .

1.7 Direct Sum and Hadamard Product This section discusses two special operations of matrices: the direct sum of two or more matrices and the Hadamard product of two matrices.

1.7 Direct Sum and Hadamard Product

43

1.7.1 Direct Sum of Matrices Definition 1.20 (Direct Sum [9]) The direct sum of an m × m matrix A and an n × n matrix B, denoted A ⊕ B, is an (m + n) × (m + n) matrix and is defined as follows:   A Om×n A⊕B= . (1.7.1) On×m B From the above definition, it is easily shown that the direct sum of matrices has the following properties [15, 27]: 1. If c is a constant, then c (A ⊕ B) = cA ⊕ cB. 2. The direct sum does not satisfy exchangeability: A ⊕ B = B ⊕ A unless A = B. 3. If A, B are two m × m matrices and C and D are two n × n matrices, then (A ± B) ⊕ (C ± D) = (A ⊕ C) ± (B ⊕ D), (A ⊕ C)(B ⊕ D) = AB ⊕ CD. 4. If A, B, C are m × m, n × n, p × p matrices, respectively, then A ⊕ (B ⊕ C) = (A ⊕ B) ⊕ C = A ⊕ B ⊕ C. 5. If Am×m and Bn×n are, respectively, orthogonal matrices, then A ⊕ B is an (m + n) × (m + n) orthogonal matrix. 6. The complex conjugate, transpose, complex conjugate transpose, and inverse matrices of the direct sum of two matrices are given by (A ⊕ B)∗ = A∗ ⊕ B∗ ,

(A ⊕ B)T = AT ⊕ BT ,

(A ⊕ B)H = AH ⊕ BH , (A ⊕ B)−1 = A−1 ⊕ B−1 (if A−1 and B−1 exist). 7. The trace, rank, and determinant of the direct sum of N matrices are as follows: tr

N −1 $ i=0



N −1 

Ai =

tr(Ai ), rank

i=0

N −1 $ i=0



N −1 

Ai =

i=0

N −1  N −1 $  #   rank(Ai ),  Ai = |Ai |.   i=0

i=0

1.7.2 Hadamard Product Definition 1.21 (Hadamard Product) The Hadamard product of two m × n matrices A = [aij ] and B = [bij ] is denoted A B and is also an m × n matrix, each entry of which is defined as the product of the corresponding entries of the two

44

1 Basic Matrix Computation

matrices: [A

B]ij = aij bij .

(1.7.2)

That is, the Hadamard product is a mapping Rm×n × Rm×n → Rm×n . The Hadamard product is also known as the Schur product or the elementwise product. Similarly, the elementwise division of two m × n matrices A = [aij ] and B = [bij ], denoted A ! B, is defined as [A ! B]ij = aij /bij .

(1.7.3)

The following theorem describes the positive definiteness of the Hadamard product and is usually known as the Hadamard product theorem [15]. Theorem 1.5 If two m × m matrices A and B are positive definite (positive semidefinite), then their Hadamard product A B is positive definite (positive semidefinite) as well. Corollary 1.1 (Fejer Theorem [15]) An m × m matrix A = [aij ] is positive semidefinite if and only if m m  

aij bij ≥ 0

i=1 j =1

holds for all m × m positive semi-definite matrices B = [bij ]. The following theorems describe the relationship between the Hadamard product and the matrix trace. Theorem 1.6 ([22, p. 46]) Let A, B, C be m × n matrices, 1 =[1, . . . , 1]T be an n × 1 summing vector and D = Diag(d1 , . . . , dm ), where di = nj=1 aij ; then ! tr AT (B

" ! C) = tr (AT

1T AT (B

C)1 = tr(BT DC).

" BT )C ,

(1.7.4) (1.7.5)

Theorem 1.7 ([22, p. 46]) Let A, B be two n × n positive definite square matrices, and 1 = [1, . . . , 1]T be an n × 1 summing vector. Suppose that M is an n × n diagonal matrix, i.e., M = Diag(μ1 , . . . , μn ), while m = M1 is an n × 1 vector. Then one has tr(AMBT M) = mT (A tr(ABT ) = 1T (A MA

BT M = M(A

B)m,

(1.7.6)

B)1,

(1.7.7)

BT )M.

(1.7.8)

1.7 Direct Sum and Hadamard Product

45

From the above definition, it is known that Hadamard products obey the exchange law, the associative law, and the distributive law of the addition: A

B=B

A,

C) = (A

A

(B

A

(B ± C) = A

(1.7.9)

B)

C,

B±A

(1.7.10) C.

(1.7.11)

The properties of Hadamard products are summarized below [22]. 1. If A, B are m × n matrices, then (A

B)T = AT

BT ,

(A

B)H = AH

BH ,

B)∗ = A∗

(A

B∗ .

2. The Hadamard product of a matrix Am×n and a zero matrix Om×n is given by A Om×n = Om×n A = Om×n . 3. If c is a constant, then c (A B) = (cA) B = A (c B). 4. The Hadamard product of two positive definite (positive semi-definite) matrices A, B is positive definite (positive semi-definite) as well. 5. The Hadamard product of the matrix Am×m = [aij ] and the identity matrix Im is an m × m diagonal matrix, i.e., A

Im = I m

A = Diag(A) = Diag(a11 , . . . , amm ).

6. If A, B, D are three m × m matrices and D is a diagonal matrix, then (DA)

(BD) = D(A

B)D.

7. If A, C are two m × m matrices and B, D are two n × n Matrices, then (A ⊕ B)

(C ⊕ D) = (A

C) ⊕ (B

D).

D+B

C+B

8. If A, B, C, D are all m × n matrices, then (A + B)

(C + D) = A

C+A

 9. If A, B, C are m × n matrices, then tr AT (B

  C) = tr (AT

D.  BT )C .

The Hadamard (i.e., elementwise) products of two matrices are widely used in machine learning, neural networks, and evolutionary computations.

46

1 Basic Matrix Computation

1.8 Kronecker Products The Hadamard product described in the previous section is a special product of two matrices. In this section, we discuss another special product of two matrices: the Kronecker products. The Kronecker product is also known as the direct product or tensor product [20].

1.8.1 Definitions of Kronecker Products Kronecker products are divided into right and left Kronecker products. Definition 1.22 (Right Kronecker Product [3]) Given an m × n matrix A = [a1 , . . . , an ] and another p × q matrix B, their right Kronecker product A ⊗ B is an mp × nq matrix defined by ⎤ a11 B a12 B · · · a1n B ⎢ a21 B a22 B · · · a2n B ⎥ ⎥ ⎢ =⎢ . .. . . .. ⎥ . ⎣ .. . . . ⎦ am1 B am2 B · · · amn B ⎡

A ⊗ B = [aij B]m,n i=1,j =1

(1.8.1)

Definition 1.23 (Left Kronecker Product [9, 34]) For an m × n matrix A and a p × q matrix B = [b1 , . . . , bq ], their left Kronecker product A ⊗ B is an mp × nq matrix defined by ⎡

p,q

[A ⊗ B]left = [Abij ]i=1,j =1

Ab11 Ab12 ⎢ Ab Ab ⎢ 21 22 =⎢ .. ⎢ .. ⎣ . . Abp1 Abp2

⎤ · · · Ab1q · · · Ab2q ⎥ ⎥ .. ⎥ .. ⎥. . . ⎦ · · · Abpq

(1.8.2)

Clearly, the left or right Kronecker product is a mapping Rm×n × Rp×q → It is easily seen that the left and right Kronecker products have the following relationship: [A ⊗ B]left = B ⊗ A. Since the right Kronecker product form is the one generally adopted, this book uses the right Kronecker product hereafter unless otherwise stated. In particular, when n = 1 and q = 1, the Kronecker product of two matrices reduces to the Kronecker product of two column vectors a ∈ Rm and b ∈ Rp : Rmp×nq .

⎤ a1 b ⎥ ⎢ = ⎣ ... ⎦ . ⎡

a ⊗ b = [ai b]m i=1

am b

(1.8.3)

1.8 Kronecker Products

47

The result is an mp×1 vector. Evidently, the outer product of two vectors x◦y = xyT can be also represented using the Kronecker product as x ◦ y = x ⊗ yT .

1.8.2 Performance of Kronecker Products Summarizing results from [3, 6] and other literature, Kronecker products have the following properties. 1. The Kronecker product of any matrix and a zero matrix is equal to the zero matrix, i.e., A ⊗ O = O ⊗ A = O. 2. If α and β are constants, then αA ⊗ βB = αβ(A ⊗ B). 3. The Kronecker product of an m×m identity matrix and an n×n identity matrix is equal to an mn × mn identity matrix, i.e., Im ⊗ In = Imn . 4. For matrices Am×n , Bn×k , Cl×p , Dp×q , we have (AB) ⊗ (CD) = (A ⊗ C)(B ⊗ D).

(1.8.4)

5. For matrices Am×n , Bp×q , Cp×q , we have A ⊗ (B ± C) = A ⊗ B ± A ⊗ C,

(1.8.5)

(B ± C) ⊗ A = B ⊗ A ± C ⊗ A.

(1.8.6)

6. The inverse and generalized inverse matrix of Kronecker products satisfy (A ⊗ B)−1 = A−1 ⊗ B−1 ,

(A ⊗ B)† = A† ⊗ B† .

(1.8.7)

7. The transpose and the complex conjugate transpose of Kronecker products are given by (A ⊗ B)T = AT ⊗ BT ,

(A ⊗ B)H = AH ⊗ BH .

(1.8.8)

8. The rank of the Kronecker product is rank(A ⊗ B) = rank(A)rank(B).

(1.8.9)

9. The determinant of the Kronecker product det(An×n ⊗ Bm×m ) = (det A)m (det B)n .

(1.8.10)

10. The trace of the Kronecker product is given by tr(A ⊗ B) = tr(A)tr(B).

(1.8.11)

48

1 Basic Matrix Computation

11. For matrices Am×n , Bm×n , Cp×q , Dp×q , we have (A + B) ⊗ (C + D) = A ⊗ C + A ⊗ D + B ⊗ C + B ⊗ D.

(1.8.12)

12. For matrices Am×n , Bp×q , Ck×l , it is true that (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C).

(1.8.13)

13. For matrices Am×n , Bp×q , we have exp(A ⊗ B) = exp(A) ⊗ exp(B). 14. Let A ∈ Cm×n and B ∈ Cp×q then [22, p. 47] Kpm (A ⊗ B) = (B ⊗ A)Kqn ,

(1.8.14)

Kpm (A ⊗ B)Knq = B ⊗ A,

(1.8.15)

Kpm (A ⊗ B) = B ⊗ A,

(1.8.16)

Kmp (B ⊗ A) = A ⊗ B,

(1.8.17)

where K is a commutation matrix (see Sect. 1.9.1).

1.9 Vectorization and Matricization Consider the operators that transform a matrix into a vector or vice versa.

1.9.1 Vectorization and Commutation Matrix The function or operator that transforms a matrix into a vector is known as the vectorization of the matrix. The column vectorization of a matrix A ∈ Rm×n , denoted vec(A), is a linear transformation that arranges the entries of A = [aij ] as an mn × 1 vector via column stacking: vec(A) = [a11 , . . . , am1 , . . . , a1n , . . . , amn ]T .

(1.9.1)

A matrix A can be also arranged as a row vector by row stacking: rvec(A) = [a11 , . . . , a1n , . . . , am1 , . . . , amn ]. This is known as the row vectorization of the matrix.

(1.9.2)

1.9 Vectorization and Matricization

For instance, given a matrix A =

49

%a a & 11 12 , then vec(A) = [a11 , a21 , a12 , a22 ]T a21 a22

and rvec(A) = [a11 , a12 , a21 , a22 ]. The column vectorization is usually called the vectorization simply. Clearly, there exist the following relationships between the vectorization and the row vectorization of a matrix: rvec(A) = (vec(AT ))T ,

vec(AT ) = (rvec A)T .

(1.9.3)

One obvious fact is that, for a given m × n matrix A, the two vectors vec(A) and vec(AT ) contain the same entries but the orders of their entries are different. Interestingly, there is a unique mn × mn permutation matrix that can transform vec(A) into vec(AT ). This permutation matrix is called the commutation matrix, denoted as Kmn , and is defined by Kmn vec(A) = vec(AT ).

(1.9.4)

Similarly, there is an nm × nm permutation matrix transforming vec(AT ) into vec(A). Such a commutation matrix, denoted Knm , is defined by Knm vec(AT ) = vec(A).

(1.9.5)

From (1.9.4) and (1.9.5) it can be seen that Knm Kmn vec(A) = Knm vec(AT ) = vec(A). Since this formula holds for any m × n matrix A, we have Knm Kmn = Imn or K−1 mn = Knm . The mn × mn commutation matrix Kmn has the following properties [21]. 1. Kmn vec(A) = vec(AT ) and Knm vec(AT ) = vec(A), where A is an m × n matrix. 2. KTmn Kmn = Kmn KTmn = Imn , or K−1 mn = Knm . 3. KTmn = Knm . 4. Kmn can be represented as a Kronecker product of the basic vectors: Kmn =

n  (eTj ⊗ Im ⊗ ej ). j =1

5. K1n = Kn1 = In . 6. Knm Kmn vec(A) = Knm vec(AT ) = vec(A). 7. Eigenvalues of the commutation matrix Knn are 1 and −1 and their multiplicities are, respectively, 12 n(n + 1) and 12 n(n − 1). 8. The rank of the commutation matrix is given by rank(Kmn ) = 1 + d(m − 1, n − 1), where d(m, n) is the greatest common divisor of m and n, d(n, 0) = d(0, n) = n.

50

1 Basic Matrix Computation

9. Kmn (A ⊗ B)Kpq = B ⊗ A, and thus can be equivalently written as Kmn (A ⊗ B) = (B ⊗ A)Kqp , where A is an n × p matrix and B is an m × q matrix. In particular, Kmn (An×n ⊗ Bm×m ) = (B ⊗ A)Kmn .  T 10. tr(Kmn (Am×n ⊗ Bm×n )) = tr(AT B) = vec(BT ) Kmn vec(A). An mn × mn commutation matrix can be constructed as follows. First let ⎤ K1 ⎢ ⎥ = ⎣ ... ⎦ , ⎡

Kmn

Ki ∈ Rn×mn , i = 1, . . . , m;

(1.9.6)

Km then the (i, j )th entry of the first submatrix K1 is given by  K1 (i, j ) =

1, j = (i − 1)m + 1, i = 1, . . . , n, 0, otherwise.

(1.9.7)

Next, the ith submatrix Ki (i = 2, . . . , m) is constructed from the (i − 1)th submatrix Ki−1 as follows: Ki = [0, Ki−1 (1 : mn − 1)],

i = 2, . . . , m,

(1.9.8)

where Ki−1 (1 : mn−1) denotes a submatrix consisting of the first (mn−1) columns of the n × m submatrix Ki−1 .

1.9.2 Matricization of Vectors Matricization refers to a mapping that transform a vector into a matrix. The operation for transforming an mn-dimensional column vector a = [a1 , . . . , amn ]T into an m × n matrix A is known as the matricization of column vector or unfolding of column vector, denoted unvecm,n (a), and is defined as ⎡

Am×n

a1 am+1 ⎢ a2 am+2 ⎢ = unvecm,n (a) = ⎢ . .. ⎣ .. . am a2m

⎤ · · · am(n−1)+1 · · · am(n−1)+2 ⎥ ⎥ ⎥, .. .. ⎦ . . ···

(1.9.9)

amn

where Aij = ai+(j −1)m ,

i = 1, . . . , m, j = 1, . . . , n.

(1.9.10)

Similarly, the operation for transforming an 1 × mn row vector b = [b1 , . . . , bmn ] into an m × n matrix B is called the matricization of row vector or unfolding of row

1.9 Vectorization and Matricization

51

vector, denoted unrvecm,n (b), and is defined as ⎡ ⎢ ⎢ Bm×n = unrvecm,n (b) = ⎢ ⎣

b1

b2

bn+1 .. .

bn+2 .. .

··· ··· .. .

⎤ bn b2n ⎥ ⎥ .. ⎥ . . ⎦

(1.9.11)

b(m−1)n+1 b(m−1)n+2 · · · bmn This is equivalently represented in element form as Bij = bj +(i−1)n ,

i = 1, . . . , m, j = 1, . . . , n.

(1.9.12)

It can be seen from the above definitions that there are the following relationships between matricization (unvec) and column vectorization (vec) or row vectorization (rvec): ⎡ ⎤ A11 · · · A1n vec ⎢ .. . . .. ⎥ −−→ [A , . . . , A , . . . , A , . . . , A ]T , ⎣ . 11 m1 1n mn . . ⎦ ←−−− Am1 · · · Amn unvec ⎡ ⎤ A11 · · · A1n rvec ⎢ .. . . . ⎥ −−→ [A , . . . , A , . . . , A , . . . , A ], ⎣ . 11 1n m1 mn . .. ⎦ ←−−− Am1 · · · Amn unvec which can be written as unvecm,n (a) = Am×n



vec(Am×n ) = amn×1 ,

(1.9.13)

unrvecm,n (b) = Bm×n



rvec(Bm×n ) = b1×mn .

(1.9.14)

The vectorization operator has the following properties [7, 13, 22]. 1. The vectorization of a transposed matrix is given by vec(AT ) = Kmn vec(A) for A ∈ Cm×n . 2. The vectorization of a matrix sum is given by vec(A + B) = vec(A) + vec(B). 3. The vectorization of a Kronecker product is given by Magnus and Neudecker [22, p. 184]: vec(X ⊗ Y) = (Im ⊗ Kqp ⊗ In )(vec(X) ⊗ vec(Y)).

(1.9.15)

4. The trace of a matrix product is given by tr(AT B) = (vec(A))T vec(B),

(1.9.16)

tr(AH B) = (vec(A))H vec(B),

(1.9.17)

tr(ABC) = (vec(A))T (Ip ⊗ B) vec(C),

(1.9.18)

52

1 Basic Matrix Computation

while the trace of the product of four matrices is determined by Magnus and Neudecker [22, p. 31]: tr(ABCD) = (vec(DT ))T (CT ⊗ A) vec(B) = (vec(D))T (A ⊗ CT ) vec(BT ). 5. The Kronecker product of two vectors a and b can be represented as the vectorization of their outer product baT as follows: a ⊗ b = vec(baT ) = vec(b ◦ a).

(1.9.19)

6. The vectorization of the Hadamard product is given by vec(A

B) = vec(A)

vec(B) = Diag(vec(A)) vec(B),

(1.9.20)

where Diag(vec(A)) is a diagonal matrix whose entries are the vectorization function vec(A). 7. The relation of the vectorization of the matrix product Am×p Bp×q Cq×n to the Kronecker product is given by Schott [35, p. 263]: vec(ABC) = (CT ⊗ A) vec(B),

(1.9.21)

vec(ABC) = (Iq ⊗ AB) vec(C) = (CT BT ⊗ Im ) vec(A),

(1.9.22)

vec(AC) = (Ip ⊗ A) vec(C) = (CT ⊗ Im ) vec(A).

(1.9.23)

Example 1.2 Consider the solution of the matrix equation AXB = C, where A ∈ Rm×n , X ∈ Rn×p , B ∈ Rp×q , and C ∈ Rm×q . By using the vectorization function property vec(AXB) = (BT ⊗ A) vec(X), the vectorization vec(AXB) = vec(C) of the original matrix equation can be, in Kronecker product form, rewritten as [33]: (BT ⊗ A)vec(X) = vec(C), and thus vec(X) = (BT ⊗ A)† vec(C). Then by matricizing vec(X), we get the solution matrix X of the original matrix equation AXB = C.

Brief Summary of This Chapter • This chapter focuses on basic operations and performance of matrices, which constitute the foundation of matrix algebra. • Trace, rank, positive definiteness, Moore–Penrose inverse matrices, Kronecker product, Hadamard product (elementwise product), and vectorization of matrices are frequently used in artificial intelligence.

References

53

References 1. Banachiewicz, T.: Zur Berechungung der Determinanten, wie auch der Inverse, und zur darauf basierten Auflösung der Systeme linearer Gleichungen. Acta Astron. Sér. C 3, 41–67 (1937) 2. Barnett, S.: Matrices: Methods and Applications. Clarendon Press, Oxford (1990) 3. Bellman, R.: Introduction to Matrix Analysis, 2nd edn. McGraw-Hill, New York (1970) 4. Berberian, S.K.: Linear Algebra. Oxford University Press, New York (1992) 5. Boot, J.: Computation of the generalized inverse of singular or rectangular matrices. Am. Math Mon. 70, 302–303 (1963) 6. Brewer, J.W.: Kronecker products and matrix calculus in system theory. IEEE Trans. Circuits Syst. 25, 772–781 (1978) 7. Brookes, M.: Matrix Reference Manual (2011). https://www.ee.ic.ac.uk/hp/staff/dmb/matrix/ intro.html 8. Duncan, W.J.: Some devices for the solution of large sets of simultaneous linear equations. Lond. Edinb. Dublin Philos. Mag. J. Sci. Ser. 7 35, 660–670 (1944) 9. Graybill, F.A.: Matrices with Applications in Statistics. Wadsworth International Group, Balmont (1983) 10. Graybill, F.A., Meyer, C.D., Painter, R.J.: Note on the computation of the generalized inverse of a matrix. SIAM Rev. 8(4), 522–524 (1966) 11. Greville, T.N.E.: Some applications of the pseudoinverse of a matrix. SIAM Rev. 2, 15–22 (1960) 12. Guttman, L.: Enlargement methods for computing the inverse matrix. Ann. Math. Stat. 17, 336–343 (1946) 13. Henderson, H.V., Searle, S.R.: The vec-permutation matrix, the vec operator and Kronecker products: a review. Linear Multilinear Alg. 9, 271–288 (1981) 14. Hendeson, H.V., Searle, S.R.: On deriving the inverse of a sum of matrices. SIAM Rev. 23, 53–60 (1981) 15. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985) 16. Hotelling, H.: Some new methods in matrix calculation. Ann. Math. Stat. 14, 1–34 (1943) 17. Hotelling, H.: Further points on matrix calculation and simultaneous equations. Ann. Math. Stat. 14, 440–441 (1943) 18. Johnson, L.W., Riess, R.D., Arnold, J.T.: Introduction to Linear Algebra, 5th edn. PrenticeHall, New York (2000) 19. Lancaster, P., Tismenetsky, M.: The Theory of Matrices with Applications, 2nd edn. Academic, New York (1985) 20. Lütkepohl, H.: Handbook of Matrices. Wiley, New York (1996) 21. Magnus, J.R., Neudecker, H.: The commutation matrix: some properties and applications. Ann. Stat. 7, 381–394 (1979) 22. Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics, rev. edn. Wiley, Chichester (1999) 23. Moore, E.H.: General analysis, part 1. Mem. Am. Philos. Soc. 1, 1 (1935) 24. Noble, B., Danniel, J.W.: Applied Linear Algebra, 3rd edn. Prentice-Hall, Englewood Cliffs (1988) 25. Papoulis, A.: Probability, Random Variables and Stochastic Processes. McGraw-Hill, New York (1991) 26. Penrose, R.A.: A generalized inverse for matrices. Proc. Cambridge Philos. Soc. 51, 406–413 (1955) 27. Phillip, A., Regalia, P.A., Mitra, S.: Kronecker products, unitary matrices and signal processing applications. SIAM Rev. 31(4), 586–613 (1989) 28. Piegorsch, W.W., Casella, G.: The early use of matrix diagonal increments in statistical problems. SIAM Rev. 31, 428–434 (1989) 29. Poularikas, A.D.: The Handbook of Formulas and Tables for Signal Processing. CRC Press, New York (1999)

54

1 Basic Matrix Computation

30. Price, C.: The matrix pseudoinverse and minimal variance estimates. SIAM Rev. 6, 115–120 (1964) 31. Pringle, R.M., Rayner, A.A.: Generalized Inverse of Matrices with Applications to Statistics. Griffin, London (1971) 32. Rado, R.: Note on generalized inverse of matrices. Proc. Cambridge Philos. Soc. 52, 600–601 (1956) 33. Rao, C.R., Mitra, S.K.: Generalized Inverse of Matrices and its Applications. Wiley, New York (1971) 34. Regalia, P.A., Mitra, S.K.: Kronecker products, unitary matrices and signal processing applications. SIAM Rev. 31(4), 586–613 (1989) 35. Schott, J.R.: Matrix Analysis for Statistics. Wiley, New York (1997) 36. Searle, S.R.: Matrix Algebra Useful for Statistics. Wiley, New York (1982) 37. Sherman, J., Morrison, W.J.: Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann. Math. Stat. 21, 124–127 (1950) 38. Woodbury, M.A.: Inverting modified matrices. Memorandum Report 42, Statistical Research Group, Princeton (1950) 39. Zhang, X.D.: Numerical computations of left and right pseudo inverse matrices (in Chinese). Kexue Tongbao 7(2), 126 (1982) 40. Zhang, X.D.: Matrix Analysis and Applications. Cambridge University Press, Cambridge (2017)

Chapter 2

Matrix Differential

The matrix differential is a generalization of the multivariate function differential. The matrix differential (including the matrix partial derivative and gradient) is an important operation tool in matrix algebra and optimization in machine learning, neural networks, support vector machine and evolutional computation. This chapter is concerned with the theory and methods of matrix differential.

2.1 Jacobian Matrix and Gradient Matrix In this section we discuss the partial derivatives of real functions. Table 2.1 summarizes the symbols of the real functions.

2.1.1 Jacobian Matrix First, we introduce definitions on partial derivations and Jacobian matrix. 1. Row partial derivative operator with respect to an m × 1 vector is defined as ∇ xT

  ∂ ∂ ∂ , = = ,..., ∂ xT ∂x1 ∂xm

def

(2.1.1)

and row partial derivative vector of real scalar function f (x) with respect to its m × 1 vector variable x is given by   ∂f (x) ∂f (x) ∂f (x) ∇ xT f (x) = . = ,..., ∂ xT ∂x1 ∂xm

© Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8_2

(2.1.2)

55

56

2 Matrix Differential

Table 2.1 Symbols of real functions Variable x ∈ Rm f (x) : Rm → R f(x) : Rm → Rp F(x) : Rm → Rp×q

Function type Scalar function f ∈ R Vector function f ∈ Rp Matrix function F ∈ Rp×q

Variable X ∈ Rm×n f (X) : Rm×n → R f(X) : Rm×n → Rp F(X) : Rm×n → Rp×q

2. Row partial derivative operator with respect to an m × n matrix X is defined as ∇ (vec X)T

  ∂ ∂ ∂ ∂ ∂ , = = ,..., ,..., ,..., ∂(vec X)T ∂x11 ∂xm1 ∂x1n ∂xmn (2.1.3)

def

and row partial derivative vector of real scalar function f (X) with respect to its matrix variable X ∈ Rm×n is given by   ∂f (X) ∂f (X) ∂f (X) ∂f (X) ∂f (X) . = ,..., ,..., ,..., ∇ (vec X)T f (X) = ∂(vec X)T ∂x11 ∂xm1 ∂x1n ∂xmn (2.1.4) 3. Jacobian operator with respect to an m × n matrix X is defined as ⎡ ∇ XT

⎢ ⎢ ∂ ⎢ = = ⎢ ⎢ ∂ XT ⎣

def

∂ ∂x11

.. . ∂ ∂x1n

⎤ ∂ ∂xm1 ⎥ ⎥ . ⎥, .. . .. ⎥ ⎥ ∂ ⎦ ··· ∂xmn ···

(2.1.5)

and Jacobian matrix of the real scalar function f (X) with respect to its matrix variable X ∈ Rm×n is given by ⎡

∂f (X) ⎢ ∂x 11 ∂f (X) ⎢ ⎢ .. J = ∇ XT f (X) = = ⎢ ⎢ . ∂ XT ⎣ ∂f (X) ∂x1n

⎤ ∂f (X) ∂xm1 ⎥ ⎥ .. ⎥ ∈ Rn×m . .. . . ⎥ ⎥ ∂f (X) ⎦ ··· ∂xmn ···

(2.1.6)

There is the following relationship between the Jacobian matrix and the row partial derivative vector: ! "T ∇ (vec X)T f (X) = rvec(J) = vec(JT ) . This important relation is the basis of the Jacobian matrix identification.

(2.1.7)

2.1 Jacobian Matrix and Gradient Matrix

57

As a matter of fact, the Jacobian matrix is more useful than the row partial derivative vector. The following theorem provides a specific expression for the Jacobian matrix of a p × q real-valued matrix function F(X) with m × n matrix variable X. Definition 2.1 (Jacobian Matrix [6]) Let the vectorization of a p × q matrix function F(X) be given by def

vec F(X) = [f11 (X), . . . , fp1 (X), . . . , f1q (X), . . . , fpq (X)]T ∈ Rpq .

(2.1.8)

Then the pq × mn Jacobian matrix of F(X) is defined as def

J = ∇ (vecX)T F(X) =

∂ vecF(X) ∈ Rpq×mn ∂(vecX)T

(2.1.9)

whose specific expression J is given by ⎡

∂f11 ⎢ ∂(vec X)T ⎢ .. ⎢ ⎢ . ⎢ ∂f ⎢ p1 ⎢ ⎢ ∂(vec X)T ⎢ .. J=⎢ . ⎢ ⎢ ∂f 1q ⎢ ⎢ ⎢ ∂(vec X)T ⎢ .. ⎢ . ⎢ ⎣ ∂fpq ∂(vec X)T





∂f11 ⎥ ⎢ ∂x11 ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ∂f ⎥ ⎢ p1 ⎥ ⎢ ⎥ ⎢ ∂x11 ⎥ ⎢ . ⎥=⎢ . ⎥ ⎢ . ⎥ ⎢ ∂f ⎥ ⎢ 1q ⎥ ⎢ ⎥ ⎢ ∂x11 ⎥ ⎢ . ⎥ ⎢ .. ⎥ ⎢ ⎦ ⎣ ∂fpq ∂x11

··· .. . ··· .. . ··· .. . ···

∂f11 ∂xm1 .. . ∂fp1 ∂xm1 .. . ∂f1q ∂xm1 .. . ∂fpq ∂xm1

··· .. . ··· .. . ··· .. . ···

∂f11 ∂x1n .. . ∂fp1 ∂x1n .. . ∂f1q ∂x1n .. . ∂fpq ∂x1n

··· .. . ··· .. . ··· .. . ···

⎤ ∂f11 ∂xmn ⎥ .. ⎥ ⎥ . ⎥ ∂fp1 ⎥ ⎥ ⎥ ∂xmn ⎥ .. ⎥ ⎥ . ⎥. ∂f1q ⎥ ⎥ ⎥ ∂xmn ⎥ .. ⎥ . ⎥ ⎥ ∂fpq ⎦

(2.1.10)

∂xmn

2.1.2 Gradient Matrix The partial derivative operator in column form is referred to as the gradient vector operator. Definition 2.2 (Gradient Vector Operators) The gradient vector operators with respect to an m × 1 vector x and to an m × n matrix X are, respectively, defined as   ∂ ∂ ∂ T ∇x = = ,..., ∂x ∂x1 ∂xm def

(2.1.11)

58

2 Matrix Differential

and def

∇vec X =

T  ∂ ∂ ∂ ∂ ∂ = ,..., ,..., ,..., . ∂ vec X ∂x11 ∂x1n ∂xm1 ∂xmn

(2.1.12)

Definition 2.3 (Gradient Matrix Operator) The gradient matrix operator with respect to an m × n matrix X, denoted as ∇X = ∂∂X , is defined as ⎡ def

∇X =

∂ ∂x11

⎢ . ∂ . =⎢ ∂X ⎣ .

∂ ∂xm1

··· .. . ···

∂ ∂x1n



.. ⎥ ⎥ . ⎦.

(2.1.13)

∂ ∂xmn

Definition 2.4 (Gradient Vectors) The gradient vectors of functions f (x) and f (X) are, respectively, defined as def



∇x f (x) =

def

∇vec X f (X) =



∂f (x) ∂f (x) ,..., ∂x1 ∂xm

T =

∂f (x) , ∂x

(2.1.14)

∂f (X) ∂f (X) ∂f (X) ∂f (X) ,..., ,..., ,..., ∂x11 ∂xm1 ∂x1n ∂xmn

T .

(2.1.15)

Definition 2.5 (Gradient Matrix) The gradient matrix of the function f (X) is defined as ⎤ ⎡ ∂f (X) ∂f (X) · · · ⎢ ∂x ∂x1n ⎥ 11 ⎥ ⎢ ⎢ .. .. ⎥ = ∂f (X) . .. (2.1.16) ∇X f (X) = ⎢ . ⎥ . . ⎥ ⎢ ∂X ⎣ ∂f (X) ∂f (X) ⎦ ··· ∂xm1 ∂xmn For a real matrix function F(X) ∈ Rp×q with matrix variable X ∈ Rm×n , its gradient matrix is defined as ∇X F(X) =

∂ (vec F(X))T = ∂ vecX



∂vecF(X) ∂(vec X)T

T .

(2.1.17)

Comparing Eq. (2.1.16) with Eq. (2.1.6), we have ∇X f (X) = JT .

(2.1.18)

Similarly, comparing Eq. (2.1.17) with Eq. (2.1.9) gives ∇X F(X) = JT .

(2.1.19)

2.1 Jacobian Matrix and Gradient Matrix

59

That is to say, the gradient matrix of a real scalar function f (X) or a real matrix function F(X) is equal to the transpose of respective Jacobian matrix. An obvious fact is that, given a real scalar function f (x), its gradient vector is directly equal to the transpose of the partial derivative vector. In this sense, the partial derivative in row vector form is a covariant form of the gradient vector, so the row partial derivative vector is also known as the cogradient vector. Similarly, the Jacobian matrix is sometimes called the cogradient matrix. The cogradient is a covariant operator [3] that itself is not the gradient, but is related to the gradient. For this reason, the partial derivative operator ∂/∂xT and the Jacobian operator ∂/∂ XT are known as the (row) partial derivative operator, the covariant form of the gradient operator or the cogradient operator. The direction of the negative gradient −∇x f (x) is known as the gradient flow direction of the function f (x) at the point x, and is expressed as x˙ = −∇x f (x)

or

˙ = −∇ X vecX f (X).

(2.1.20)

From the definition formula of the gradient vector, we have the following conclusive remarks: • In the gradient flow direction, the function f (x) decreases at the maximum descent rate. On the contrary, in the opposite direction (i.e., the positive gradient direction), the function increases at the maximum ascent rate. • Each component of the gradient vector gives the rate of change of the scalar function f (x) in the component direction.

2.1.3 Calculation of Partial Derivative and Gradient The gradient computation of a real function with respect to its matrix variable has the following properties and rules [5]: 1. If f (X) = c, where c is a real constant and X is an m × n real matrix, then the gradient ∂c/∂X = Om×n . 2. Linear rule: If f (X) and g(X) are two real-valued functions of the matrix variable X, and c1 and c2 are two real constants, then ∂[c1 f (X) + c2 g(X)] ∂f (X) ∂g(X) . = c1 + c2 ∂X ∂X ∂X

(2.1.21)

3. Product rule: If f (X), g(X), and h(X) are real-valued functions of the matrix variable X, then ∂(f (X)g(X)) ∂f (X) ∂g(X) = g(X) + f (X) ∂X ∂X ∂X

(2.1.22)

60

2 Matrix Differential

and ∂f (X) ∂g(X) ∂(f (X)g(X)h(X)) = g(X)h(X) + f (X)h(X) ∂X ∂X ∂X ∂h(X) + f (X)g(X) . (2.1.23) ∂X 4. Quotient rule: If g(X) = 0, then   ∂(f (X)/g(X)) 1 ∂g(X) ∂f (X) = 2 − f (X) . g(X) ∂X ∂X ∂X g (X)

(2.1.24)

5. Chain rule: If X is an m × n matrix and y = f (X) and g(y) are, respectively, the real-valued functions of the matrix variable X and of the scalar variable y, then ∂g(f (X)) dg(y) ∂f (X) = . ∂X dy ∂X

(2.1.25)

As an extension, if g(F(X)) = g(F), where F = [fkl ] ∈ Rp×q and X = [xij ] ∈ then the chain rule is given by Petersen and Petersen [7]

Rm×n ,



∂g(F) ∂X

 ij

∂g(F)   ∂g(F) ∂fkl = . ∂xij ∂fkl ∂xij p

=

q

(2.1.26)

k=1 l=1

When computing the partial derivative of the functions f (x) and f (X), it is necessary to make the following basic assumption. Independence Assumption Given a real-valued function f , we assume that the m,n m m×n vector variable x = [xi ]m i=1 ∈ R and the matrix variable X = [xij ]i=1,j =1 ∈ R do not themselves have any special structure; namely, the entries of x (and X) are independent. The independence assumption can be expressed in mathematical form as follows: ⎧ ⎨ 1, i = j ;

∂xi = δij = ⎩ ∂xj

(2.1.27) 0, otherwise. ⎧ ⎨ 1, k = i and l = j ;

∂xkl = δki δlj = ⎩ ∂xij

(2.1.28) 0, otherwise.

These expressions on independence are the basic formulas for partial derivative computation.

2.2 Real Matrix Differential

61

Example 2.1 Consider the real function f (X) = aT XXT b =

m m  

⎛ ak ⎝

k=1 l=1

n 

⎞ xkp xlp ⎠ bl ,

X ∈ Rm×n , a, b ∈ Rn×1 .

p=1

Using Eq. (2.1.28), we can see easily that 

∂f (X) ∂ XT



∂f (X)    ∂ak xkp xlp bl = ∂xj i ∂xj i m

m

= ij

n

k=1 l=1 p=1

=

m  n  m  

ak xlp bl

k=1 l=1 p=1

=

m  n m  

aj xli bl +

i=1 l=1 j =1

=

∂xkp ∂xlp + ak xkp bl ∂xj i ∂xj i m  n m  



ak xki bj

k=1 i=1 j =1

n % m  & % &  XT b aj + XT a bj , i=1 j =1

i

i

which yields, respectively, the Jacobian matrix and the gradient matrix as follows: J = XT (baT + abT )

and

∇X f (X) = JT = (baT + abT )X.

Example 2.2 Let F(X) = AXB, where A ∈ Rp×m , X ∈ Rm×n , B ∈ Rn×q . We have  m n ∂ ∂fkl ∂[AXB]kl u=1 v=1 aku xuv bvl = = = bj l aki ∂xij ∂xij ∂xij ⇒ ∇X (AXB) = B ⊗ AT ⇒ J = (∇X (AXB))T = BT ⊗ A. That is, the pq × mn Jacobian matrix is J = BT ⊗ A, and the mn × pq gradient matrix is given by ∇X (AXB) = B ⊗ AT .

2.2 Real Matrix Differential Although direct computation of partial derivatives ∂fkl /∂xj i or ∂fkl /∂xij can be used to find the Jacobian matrices or the gradient matrices of many matrix functions, for more complex functions (such as the inverse matrix, the Moore–Penrose inverse matrix and the exponential functions of a matrix), direct computation of their partial derivatives is more complicated and difficult. Hence, naturally we want to have

62

2 Matrix Differential

an easily remembered and effective mathematical tool for computing the Jacobian matrices and the gradient matrices of real scalar functions and real matrix functions. Such a mathematical tool is the matrix differential.

2.2.1 Calculation of Real Matrix Differential The differential of an m × n matrix X = [xij ] is known as the matrix differential, denoted dX, which is still m × n matrix and is defined as dX = [dxij ]m,n i=1,j =1 . Consider the differential of a trace function tr(U). We have    n n d(tr U) = d uii = duii = tr(dU), i=1

i=1

namely d(tr U) = tr(dU). The matrix differential of the matrix product UV is given in element form: !

[d(UV)]ij = d [UV]ij =

!

"

   =d uik vkj = d(uik vkj ) k

(duik )vkj + uik dvkj

k

"

k

  = (duik )vkj + uik dvkj k

k

= [(dU)V]ij + [U(dV)]ij . Then, we have the matrix differential d(UV) = (dU)V + U(dV). Common computation formulas for matrix differential are given as follows [6, pp. 148–154]: 1. The differential of a constant matrix is a zero matrix, namely dA = O. 2. The matrix differential of the product αX is given by d(αX) = α dX. 3. The matrix differential of a transposed matrix is equal to the transpose of the original matrix differential, namely d(XT ) = (dX)T . 4. The matrix differential of the sum (or difference) of two matrices is given by d(U ± V) = dU ± dV. More generally, we have d(aU ± bV) = a · dU ± b · dV. 5. The matrix differentials of the functions UV and UVW, where U = F(X), V = G(X), W = H(X), are, respectively, given by d(UV) = (dU)V + U(dV) d(UVW) = (dU)VW + U(dV)W + UV(dW). If A and B are constant matrices, then d(AXB) = A(dX)B.

(2.2.1) (2.2.2)

2.2 Real Matrix Differential

63

6. The differential of the matrix trace d(tr X) is equal to the trace of the matrix differential dX, namely d(tr X) = tr(dX).

(2.2.3)

In particular, the differential of the trace of the matrix function F(X) is given by d(tr F(X)) = tr(dF(X)). 7. The differential of the determinant of X is given by d|X| = |X|tr(X−1 dX).

(2.2.4)

In particular, the differential of the determinant of the matrix function F(X) is computed by d|F(X)| = |F(X)|tr(F−1 (X)dF(X)). 8. The matrix differential of the Kronecker product is given by d(U ⊗ V) = (dU) ⊗ V + U ⊗ dV.

(2.2.5)

9. The matrix differential of the Hadamard product is computed by d(U

V) = (dU)

V+U

dV.

(2.2.6)

10. The matrix differential of the inverse matrix is given by d(X−1 ) = −X−1 (dX)X−1 .

(2.2.7)

11. The differential of the vectorization function vec X is equal to the vectorization of the matrix differential, i.e., d vec X = vec(dX).

(2.2.8)

12. The differential of the matrix logarithm is given by d log X = X−1 dX.

(2.2.9)

In particular, d log F(X) = F−1 (X) dF(X). 13. The matrix differentials of X† , X† X, and XX† are given by d(X† ) = − X† (dX)X† + X† (X† )T (dXT )(I − XX† ) + (I − X† X)(dXT )(X† )T X† , ! "T d(X† X) = X† (dX)(I − X† X) + X† (dX)(I − X† X) ,

(2.2.11)

"T ! d(XX† ) = (I − XX† )(dX)X† + (I − XX† )(dX)X† .

(2.2.12)

(2.2.10)

64

2 Matrix Differential

2.2.2 Jacobian Matrix Identification In multivariate calculus, the multivariate function f (x1 , . . . , xm ) is said to be differentiable at the point (x1 , . . . , xm ), if a change in f (x1 , . . . , xm ) can be expressed as f (x1 , . . . , xm ) = f (x1 + x1 , . . . , xm + xm ) − f (x1 , . . . , xm ) = A1 x1 + · · · + Am xm + O(x1 , . . . , xm ), here A1 , . . . , Am are independent of x1 , . . . , xm , respectively, and O(x1 , . . . , xm ) denotes the second-order and the higher-order terms in x1 , . . . , xm . In this case, the partial derivative ∂f/∂x1 , . . . , ∂f/∂xm must exist, and ∂f = A1 , ∂x1

...

,

∂f = Am . ∂xm

The linear part of the change f (x1 , . . . , xm ), A1 x1 + · · · + Am xm =

∂f ∂f dx1 + · · · + dx , ∂x1 ∂xm m

is said to be the differential or first-order differential of the multivariate function f (x1 , . . . , xm ) and is denoted by df (x1 , . . . , xm ) =

∂f ∂f dx1 + · · · + dx . ∂x1 ∂xm m

(2.2.13)

The sufficient condition for a multivariate function f (x1 , . . . , xm ) to be differentiable at the point (x1 , . . . , xm ) is that the partial derivatives ∂f/∂x1 , . . . , ∂f/∂xm exist and are continuous. Equation (2.2.13) provides two Jacobian matrix identification methods. • For a scalar function f (x) with variable x = [x1 , . . . , xm ]T ∈ Rm , if regarding the elements x1 , . . . , xm as m variables, and using Eq. (2.2.13), then we can directly obtain the differential of the scalar function f (x) as follows: ⎡

⎤ dx1 ∂f (x) ∂f (x) ∂f (x) ⎢ . ⎥ ∂f (x) dx1 + · · · + dxm = ... df (x) = ⎣ .. ⎦ ∂x1 ∂xm ∂x1 ∂xm dxm 



or df (x) =

∂f (x) dx, ∂ xT

(2.2.14)

2.2 Real Matrix Differential

where

∂f (x) ∂ xT

vector A = as a trace:

65

&

%

∂f (x) ∂f (x) T ∂x1 , . . . , ∂xm and dx = [dx1 , . . . , dxm ] . If denoting the row ∂f (x) , then the first-order differential in (2.2.14) can be represented ∂ xT

=

df (x) =

∂f (x) dx = Adx = tr(A dx) ∂ xT

because Adx is a scalar, and for any scalar α we have α = tr(α). This shows that there is an equivalence relationship between the Jacobian matrix of the scalar function f (x) and its matrix differential as follows: df (x) = tr(A dx)



J=

∂f (x) = A. ∂ xT

(2.2.15)

In other words, if the differential of the function f (x) is denoted as df (x) = tr(A dx), then the matrix A is just the Jacobian matrix of the function f (x). • For a scalar function f (X) with variable X = [x1 , . . . , xn ] ∈ Rm×n , if denoting xj = [x1j , . . . , xmj ]T , j = 1, . . . , n, then Eq. (2.2.13) becomes ⎤ dx11 ⎢ . ⎥ ⎢ .. ⎥ ⎥ ⎢ ⎥ ⎢  ⎢dxm1 ⎥  ⎥ ∂f (X) ∂f (X) ∂f (X) ⎢ ∂f (X) ⎢ .. ⎥ , ... ... ... df (X) = . ⎥ ⎢ ∂x11 ∂xm1 ∂x1n ∂xmn ⎢ ⎥ ⎢ dx1n ⎥ ⎥ ⎢ ⎢ .. ⎥ ⎣ . ⎦ dxmn ⎡

=

∂f (X) d vecX = ∇ (vecX)T f (X) dvecX. ∂ (vecX)T

(2.2.16)

By the relationship between the row partial derivative vector and the Jacobian  T matrix in Eq. (2.1.7), ∇ (vec X)T f (X) = vec(JT ) , Eq. (2.2.16) can be written as df (X) = (vec AT )T d(vec X),

(2.2.17)

where ⎡ ∂f (X) ∂x11

A=J=

∂f (X) ⎢ . =⎢ ⎣ .. ∂ XT ∂f (X) ∂x1n

··· .. . ···

∂f (X) ⎤ ∂xm1

.. ⎥ ⎥ . ⎦

∂f (X) ∂xmn

(2.2.18)

66

2 Matrix Differential

is the Jacobian matrix of the scalar function f (X). Using the relationship between the vectorization operator vec and the trace function tr(BT C) = (vec B)T vec C, and letting B = AT and C = dX, then Eq. (2.2.17) can be expressed in the trace form as df (X) = tr(AdX).

(2.2.19)

This can be regarded as the canonical form of the differential of a scalar function f (X). The above discussion shows that once the matrix differential of a scalar function df (X) is expressed in its canonical form, we can identify the Jacobian matrix and/or the gradient matrix of the scalar function f (X), as stated below. Proposition 2.1 If a scalar function f (X) is differentiable at the point X, then the Jacobian matrix A can be directly identified as follows [6]: df (x) = tr(Adx)



J = A,

(2.2.20)

df (X) = tr(AdX)



J = A.

(2.2.21)

Proposition 2.1 motivates the following effective approach to directly identifying the Jacobian matrix J = DX f (X) of the scalar function f (X): 1. Find the differential df (X) of the real function f (X), and denote it in the canonical form as df (X) = tr(A dX). 2. The Jacobian matrix is directly given by A. The following are two main points in applying Proposition 2.1: • Any scalar function f (X) can always be written in the form of a trace function, because f (X) = tr(f (X)). • No matter where dX appears initially in the trace function, we can place it in the rightmost position via the trace property tr(C(dX)B) = tr(BC dX), giving the canonical form df (X) = tr(A dX). It has been shown [6] that the Jacobian matrix A is uniquely determined: if there are A1 and A2 such that df (X) = Ai dX, i = 1, 2, then A1 = A2 . Since the gradient matrix is the transpose of the Jacobian matrix for a given real function f (X), Proposition 2.1 implies in addition that df (X) = tr(AdX) ⇔ ∇X f (X) = AT .

(2.2.22)

Because the Jacobian matrix A is uniquely determined, the gradient matrix is uniquely determined as well.

2.2 Real Matrix Differential

67

Jacobian Matrices of Trace Functions Example 2.3 The differential of the trace tr(XT AX) is given by ! " ! " d tr(XT AX) = tr d(XT AX) = tr (dX)T AX + XT AdX " ! " ! = tr (dX)T AX + tr(XT AdX) = tr (AX)T dX + tr(XT AdX) ! " = tr XT (AT + A)dX , which yields the gradient matrix "T ∂ tr(XT AX) ! T T = X (A + A) = (A + AT )X. ∂X

(2.2.23)

Similarly, we can compute the differential matrices and Jacobian matrices of other typical trace functions. Table 2.2 summarizes the differential matrices and Jacobian matrices of several typical trace functions [6].

Table 2.2 Differential matrices and Jacobian matrices of trace functions f (X)

Differential df (X)

Jacobian matrix J = ∂f (X)/∂X

tr(X)

tr(IdX)

I

tr(X−1 )

−tr(X−2 dX)

−X−2

tr(AX)

tr(AdX)

A

tr(X2 )

2tr(XdX)

2X

tr(XT X)

2tr(XT dX)   tr XT (A + AT )dX   tr (A + AT )XT dX

XT

tr ((AX + XA)dX)   −tr X−1 AX−1 dX   −tr X−1 BAX−1 dX   −tr (X + A)−2 dX

AX + XA

tr ((AXB + BXA)dX)   tr (AXT B + AT XT BT )dX   tr XT (BA + AT BT )dX   tr (BA + AT BT )XT dX

AXB + BXA

tr(XT AX) tr(XAXT ) tr(XAX) tr(AX−1 ) tr(AX−1 B)   tr (X + A)−1 tr(XAXB) tr(XAXT B) tr(AXXT B) tr(AXT XB) Here A−2 = A−1 A−1

XT (A + AT ) (A + AT )XT −X−1 AX−1 −X−1 BAX−1 −(X + A)−2

AXT B + AT XT BT XT (BA + AT BT ) (BA + AT BT )XT

68

2 Matrix Differential

Jacobian Matrices of Determinant Functions Consider the Jacobian matrix identification of typical determinant functions. Example 2.4 For the nonsingular matrix XXT , we have ! " d|XXT | = |XXT | tr (XXT )−1 d(XXT ) " ! "" ! ! = |XXT | tr (XXT )−1 (dX)XT + tr (XXT )−1 X(dX)T ! ! " ! "" = |XXT | tr XT (XXT )−1 dX + tr XT (XXT )−1 dX " ! = tr 2|XXT | XT (XXT )−1 dX . By Proposition 2.1, we get the gradient matrix ∂|XXT | = 2|XXT | (XXT )−1 X. ∂X

(2.2.24)

Similarly, set X ∈ Rm×n . If rank(X) = n, i.e., XT X is invertible, then ! " d|XT X| = tr 2|XT X|(XT X)−1 XT dX ,

(2.2.25)

and hence ∂|XT X|/∂ X = 2|XT X| X(XT X)−1 . Similarly, we can compute the differential matrices and Jacobian matrices of other typical determinant functions. Table 2.3 summarizes the real matrix differentials and the Jacobian matrices of several typical determinant functions.

2.2.3 Jacobian Matrix of Real Matrix Functions Let fkl = fkl (X) be the entry of the kth row and lth column of the real matrix function F(X); then dfkl (X) = [dF(X)]kl represents the differential of the scalar function fkl (X) with respect to the variable matrix X. From Eq. (2.2.16) we have ⎤ dx11 ⎢ . ⎥ ⎢ .. ⎥ ⎥ ⎢ ⎥ ⎢  ⎢dxm1 ⎥  ⎥ ∂fkl (X) ∂fkl (X) ∂fkl (X) ∂fkl (X) ⎢ ⎢ .. ⎥ dfkl (X) = ... ... ... . ⎥ ⎢ ∂x11 ∂xm1 ∂x1n ∂xmn ⎢ ⎥ ⎢ dx1n ⎥ ⎥ ⎢ ⎢ .. ⎥ ⎣ . ⎦ dxmn ⎡

2.2 Real Matrix Differential

69

Table 2.3 Differentials and Jacobian matrices of determinant functions f (X)

Differential df (X)

Jacobian matrix J = ∂f (X)/∂X

|X|

|X| tr(X−1 dX)

|X|X−1

log |X|

tr(X−1 dX)

X−1

|X−1 |

−|X−1 | tr(X−1 dX)   2|X|2 tr X−1 dX

−|X−1 |X−1

k|X|k tr(X−1 dX)   2|XXT | tr XT (XXT )−1 dX   2|XT X| tr (XT X)−1 XT dX   2tr (XT X)−1 XT dX   |AXB| tr B(AXB)−1 AdX ! |XAXT | tr AXT (XAXT )−1

k|X|k X−1

|X2 | |Xk | |XX | T

|XT X| log |XT X| |AXB| |XAXT |

|XT AX|

2|X|2 X−1 2|XXT |XT (XXT )−1 2|XT X|(XT X)−1 XT 2(XT X)−1 XT

 " +(XA)T (XAT XT )−1 dX ! |XT AX|tr (XT AX)−T (AX)T  " +(XT AX)−1 XT A dX

|AXB|B(AXB)−1 A ! |XAXT | AXT (XAXT )−1 +(XA)T (XAT XT )−1 ! |XT AX| (XT AX)−T (AX)T " +(XT AX)−1 XT A

"

or dvecF(X) = AdvecX,

(2.2.26)

where dvecF(X) = [df11 (X), . . . , dfp1 (X), . . . , df1q (X), . . . , dfpq (X)]T , d vecX = [dx11 , . . . , dxm1 , . . . , dx1n , . . . , dxmn ]T , ⎡ ∂f (X) ∂f11 (X) ∂f11 (X) ∂f11 (X) ⎤ 11 ∂x11 · · · ∂xm1 · · · ∂x1n · · · ∂xmn ⎥ ⎢ .. ⎥ .. .. .. .. .. ⎢ .. ⎥ ⎢ . . . . . . . ⎥ ⎢ ∂f (X) ∂fp1 (X) ∂fp1 (X) ∂fp1 (X) ⎥ ⎢ p1 · · · ∂x · · · ∂x · · · ∂x ⎥ ⎢ ∂x mn ⎥ m1 1n ⎢ 11 ⎢ . .. .. .. .. .. .. ⎥ A = ⎢ .. . . . . . . ⎥ ⎥ ⎢ ∂f1q (X) ∂f1q (X) ∂f1q (X) ⎥ ⎢ ∂f1q (X) ⎢ ∂x · · · ∂x · · · ∂x · · · ∂x ⎥ mn ⎥ ⎢ 11 m1 1n ⎢ . .. ⎥ .. .. .. .. .. ⎥ ⎢ .. . . . . . . ⎦ ⎣ ∂fpq (X) ∂fpq (X) ∂fpq (X) ∂fpq (X) · · · · · · · · · ∂x ∂x ∂x ∂x 11

∂ vecF(X) = . ∂(vec X)T

m1

1n

(2.2.27) (2.2.28)

mn

(2.2.29)

70

2 Matrix Differential

∂ vecF(X) In other words, the matrix A is the Jacobian matrix J = ∂(vec of the matrix X)T function F(X). Let F(X) ∈ Rp×q be a matrix function including X and XT as variables, where X ∈ Rm×n .

Theorem 2.1 ([6]) Given a matrix function F(X) : Rm×n → Rp×q , then its pq × mn Jacobian matrix can be identified as follows: dF(X) = A(dX)B + C(dXT )D, ⇔ J=

∂ vec F(X) = (BT ⊗ A) + (DT ⊗ C)Kmn , ∂(vec X)T

(2.2.30)

the mn × pq gradient matrix can be determined from ∇X F(X) =

∂(vec F(X))T = (B ⊗ AT ) + Knm (D ⊗ CT ). ∂ vec X

(2.2.31)

Table 2.4 summarizes the matrix differentials and Jacobian matrices of some real functions. Table 2.5 lists some matrix functions and their Jacobian matrices. Example 2.5 Let F(X, Y) = X ⊗ Y be the Kronecker product of two matrices X ∈ Rp×m and Y ∈ Rn×q . Consider the matrix differential dF(X, Y) = (dX) ⊗ Y + X ⊗ (dY). By the vectorization formula vec(X ⊗ Y) = (Im ⊗ Kqp ⊗ In )(vec X ⊗ vec Y), we have vec(dX ⊗ Y) = (Im ⊗ Kqp ⊗ In )(d vecX ⊗ vec Y) = (Im ⊗ Kqp ⊗ In )(Ipm ⊗ vec Y)d vecX,

(2.2.32)

vec(X ⊗ dY) = (Im ⊗ Kqp ⊗ In )(vec X ⊗ d vecY) = (Im ⊗ Kqp ⊗ In )(vec X ⊗ Inq )d vecY.

(2.2.33)

Table 2.4 Matrix differentials and Jacobian matrices of real functions Functions f (x) : R → R f (x) : Rm → R f (X) : Rm×n → R f(x) : Rm → Rp f(X) : Rm×n → Rp F(x) : Rm → Rp×q F(X) : Rm×n → Rp×q

Matrix differential df (x) = Adx df (x) = Adx df (X) = tr(AdX) df(x) = Adx df(X) = Ad(vecX) d vecF(x) = Adx dF(X) = A(dX)B + C(dXT )D

Jacobian matrix A∈R A ∈ R1×m A ∈ Rn×m A ∈ Rp×m A ∈ Rp×mn A ∈ Rpq×m (BT ⊗ A) + (DT ⊗ C)Kmn ∈ Rpq×mn

2.3 Complex Gradient Matrices

71

Table 2.5 Differentials and Jacobian matrices of matrix functions F(X)

dF(X)

Jacobian matrix

XT X

XT dX + (dXT )X

(In ⊗ XT ) + (XT ⊗ In )Kmn

T

X(dXT ) + (dX)XT

(Im ⊗ X)Kmn + (X ⊗ Im )

T

AX B

A(dXT )B

(BT ⊗ A)Kmn

XT BX

XT B dX + (dXT )BX

I ⊗ (XT B) + ((BX)T ⊗ I)Kmn

AXT BXC

A(dXT )BXC + AXT B(dX)C

((BXC)T ⊗ A)Kmn + CT ⊗ (AXT B)

AXBXT C

A(dX)BXT C + AXB(dXT )C

(BXT C)T ⊗ A + (CT ⊗ (AXB))Kmn

X−1

−X−1 (dX)X−1

−(X−T ⊗ X−1 )

k 

k 

XX

Xk log X exp(X)

Xj −1 (dX)Xk−j

(XT )k−j ⊗ Xj −1

j =1

j =1

X−1 dX

I ⊗ X−1

∞  k=0

1 (k+1)!

k 

Xj (dX)Xk−j

j =0

∞  k=0

1 (k+1)!

k 

(XT )k−j ⊗ Xj

j =0

Hence, the Jacobian matrices with respect to the variable matrices X and Y are, respectively, given by JX (X ⊗ Y) = (Im ⊗ Kqp ⊗ In )(Ipm ⊗ vec Y),

(2.2.34)

JY (X ⊗ Y) = (Im ⊗ Kqp ⊗ In )(vec X ⊗ Inq ).

(2.2.35)

The analysis and examples in this section show that the first-order real matrix differential is indeed an effective mathematical tool for identifying the Jacobian matrix and the gradient matrix of a real function. And this tool is simple and easy to master.

2.3 Complex Gradient Matrices In many engineering applications, observed data are usually complex. In these cases, the objective function of an optimization problem is a real-valued function of a complex vector or matrix. Hence, the gradient of the real objective function with respect to the complex vector or matrix variable is a complex vector or complex matrix. This complex gradient has the following two forms: • Complex gradient: the gradient of the objective function with respect to the complex vector or matrix variable itself; • Conjugate gradient: the gradient of the objective function with respect to the complex conjugate vector or matrix variable.

72

2 Matrix Differential

2.3.1 Holomorphic Function and Complex Partial Derivative Before discussing the complex gradient and conjugate gradient, it is necessary to recall the relevant facts about complex functions. Definition 2.6 (Complex Analytic Function [4]) Let D ⊆ C be the definition domain of the function f : D → C. The function f (z) with complex variable z is said to be a complex analytic function in the domain D if f (z) is complex f (z + z) − f (z) differentiable, namely lim exists for all z ∈ D. z→0 z In the standard framework of complex functions, a complex function f (z) (where def

z = x + jy) is written in the real polar coordinates r = (x, y) as f (r) = f (x, y). The terminology “complex analytic function” is commonly replaced by the completely synonymous terminology “holomorphic function.” It is noted that a (real) analytic complex function is (real) in the real-variable x-domain and ydomain, but is not necessarily holomorphic in the complex variable domain z = x + jy, i.e., it may be complex nonanalytic. A complex function f (z) can always be expressed in terms of its real part u(x, y) and imaginary part v(x, y) as f (z) = u(x, y) + jv(x, y), where z = x + jy and both u(x, y) and v(x, y) are real functions. For a holomorphic scalar function, the following four statements are equivalent [2]: 1. The complex function f (z) is a holomorphic (i.e., complex analytic) function. 2. The derivative f  (z) of the complex function exists and is continuous. 3. The complex function f (z) satisfies the Cauchy–Riemann condition ∂v ∂u = ∂x ∂y

and

∂v ∂u =− . ∂x ∂y

(2.3.1)

4. All derivatives of the complex function f (z) exist, and f (z) has a convergent power series. The Cauchy–Riemann condition is also called the Cauchy–Riemann equations. The function f (z) = u(x, y) + jv(x, y) is a holomorphic function only when both the real functions u(x, y) and v(x, y) satisfy the Laplace equations at the same time: ∂ 2 u(x, y) ∂ 2 u(x, y) + =0 ∂x 2 ∂y 2

and

∂ 2 v(x, y) ∂ 2 v(x, y) + = 0. ∂x 2 ∂y 2

(2.3.2)

2.3 Complex Gradient Matrices

73

A real function g(x, y) is called a harmonic function, if it satisfies the Laplace equation ∂ 2 g(x, y) ∂ 2 g(x, y) + = 0. ∂x 2 ∂y 2

(2.3.3)

A complex function f (z) = u(x, y) + jv(x, y) is not a holomorphic function if any of two real functions u(x, y) and v(x, y) does not meet the Cauchy–Riemann condition or the Laplace equations. Although the power function zn , the exponential function ez , the logarithmic function ln z, the sine function sin z, and the cosine function cos z are holomorphic functions, i.e., analytic functions in the complex plane, many commonly used functions are not holomorphic. A natural question to ask is whether there is a general representation form f (z, ·) instead of f (z) such that f (z, ·) is always holomorphic. The key to solving this problem is to adopt f (z, z∗ ) instead of f (z), as shown in Table 2.6. In the framework of complex derivatives, the formal partial derivatives of complex numbers are defined as ∂ 1 = ∂z 2



∂ ∂ −j ∂x ∂y

 ,

∂ 1 = ∗ ∂z 2



∂ ∂ +j ∂x ∂y

 .

(2.3.4)

The formal partial derivatives above were presented by Wirtinger [8] in 1927, so they are sometimes called Wirtinger partial derivatives. On the partial derivatives of the complex variable z = x + jy, a basic assumption is the independence of its real and imaginary parts: ∂x = 0 and ∂y

∂y = 0. ∂x

(2.3.5)

Table 2.6 Forms of complex-valued functions Function f ∈C f ∈ Cp F ∈ Cp×q

Variables z, z∗ ∈ C f (z, z∗ ) f :C×C→C f(z, z∗ ) f : C × C → Cp F(z, z∗ ) F : C × C → Cp×q

Variables z, z∗ ∈ Cm f (z, z∗ ) f : Cm × Cm → C f(z, z∗ ) f : Cm × Cm → Cp F(z, z∗ ) F : Cm × Cm → Cp×q

Variables Z, Z∗ ∈ Cm×n f (Z, Z∗ ) f : Cm×n × Cm×n → C f(Z, Z∗ ) f : Cm×n × Cm×n → Cp F(Z, Z∗ ) F : Cm×n × Cm×n → Cp×q

74

2 Matrix Differential

From both the definition of the partial derivative and the above independence assumption, it is easy to conclude that ∂z ∂x ∂y 1 = ∗ +j ∗ = ∗ ∂z ∂z ∂z 2



∂x ∂x +j ∂x ∂y

 +j

1 2



∂y ∂y +j ∂x ∂y



1 1 (1 + 0) + j (0 + j), 2 2     ∂x ∂y 1 ∂x ∂x ∂y 1 1 ∂y 1 ∂z∗ = −j = −j −j −j = (1 − 0) − j (0 − j). ∂z ∂z ∂z 2 ∂x ∂y 2 ∂x ∂y 2 2 =

That is to say, ∂z =0 ∂z∗

and

∂z∗ = 0. ∂z

(2.3.6)

Equation (2.3.6) reveals a basic result in the theory of complex variables: the complex variable z and the complex conjugate variable z∗ are independent variables. Therefore, when finding the complex partial derivative ∇z f (z, z∗ ) and the complex conjugate partial derivative ∇z∗ f (z, z∗ ), the complex variable z and the complex conjugate variable z∗ can be regarded as two independent variables:  ∂f (z, z∗ )  , ∇z f (z, z ) = ∂z z∗ =const ∗

 ∂f (z, z∗ )  ∇z∗ f (z, z ) = . ∂z∗ z=const ∗

(2.3.7)

This implies that when any nonholomorphic function f (z) is written as f (z, z∗ ), it becomes holomorphic, because, for a fixed z∗ value, the function f (z, z∗ ) is analytic on the whole complex plane z = x +jy, and for a fixed z value, the function f (z, z∗ ) is analytic on the whole complex plane z∗ = x − jy, see, e.g., [2]. Table 2.7 is a comparison between the nonholomorphic and holomorphic representation forms of complex functions. The function f (z) = |z|2 itself is not a holomorphic function with respect to z, but f (z, z∗ ) = |z|2 = zz∗ is holomorphic, because its first-order partial derivatives ∂|z|2 /∂z = z∗ and ∂|z|2 /∂z∗ = z exist and are continuous. Table 2.7 Nonholomorphic and holomorphic functions Functions Coordinates Representation

Nonholomorphic ⎧ ⎨ r def = (x, y) ∈ R × R ⎩z = x + jy

Holomorphic ⎧ ⎨ r def = (z, z∗ ) ∈ C × C ⎩ z = x + j y, z∗ = x − j y

f (r) = f (x, y)

f (c) = f (z, z∗ )

2.3 Complex Gradient Matrices

75

The following are common formulas and rules for the complex partial derivatives: 1. The conjugate partial derivative of the complex conjugate function ∂f ∗ (z, z∗ ) = ∂z∗



∂f (z, z∗ ) ∂z

∗



∂f (z, z∗ ) ∂z∗

(2.3.8)

.

2. The partial derivative of the conjugate of the complex function ∂f ∗ (z, z∗ ) = ∂z

∂f ∗ (z,z∗ ) : ∂z∗

∂f ∗ (z,z∗ ) : ∂z

∗ .

(2.3.9)

3. Complex differential rule df (z, z∗ ) =

∂f (z, z∗ ) ∗ ∂f (z, z∗ ) dz + dz . ∂z ∂z∗

(2.3.10)

4. Complex chain rule ∂h(g(z, z∗ )) ∂h(g(z, z∗ )) ∂g(z, z∗ ) ∂h(g(z, z∗ )) ∂g ∗ (z, z∗ ) = + , ∂z ∂g(z, z∗ ) ∂z ∂g ∗ (z, z∗ ) ∂z

(2.3.11)

∂h(g(z, z∗ )) ∂h(g(z, z∗ )) ∂g(z, z∗ ) ∂h(g(z, z∗ )) ∂g ∗ (z, z∗ ) = + . ∂z∗ ∂g(z, z∗ ) ∂z∗ ∂g ∗ (z, z∗ ) ∂z∗

(2.3.12)

2.3.2 Complex Matrix Differential Consider the complex matrix function F(Z) and the holomorphic complex matrix function F(Z, Z∗ ). On holomorphic complex matrix functions, the following statements are equivalent [1]: • The matrix function F(Z) is a holomorphic function of the complex matrix variable Z. vecF(Z) • The complex matrix differential d vecF(Z) = ∂∂(vecZ) T d vecZ. • For all Z, • For all Z,

∂ vecF(Z) = O (zero matrix) holds. ∂(vec Z∗ )T ∂ vecF(Z) ∂ vecF(Z) + j ∂(vecImZ) T = O holds. ∂(vecReZ)T

The complex matrix function F(Z, Z∗ ) is obviously a holomorphic function, and its matrix differential is d vecF(Z, Z∗ ) =

∂ vecF(Z, Z∗ ) ∂ vecF(Z, Z∗ ) d vecZ + d vecZ∗ . T ∂(vecZ) ∂(vecZ∗ )T

(2.3.13)

76

2 Matrix Differential

The complex matrix differential d Z = [dZij ]m,n i=1,j =1 has the following properties [1]: 1. 2. 3. 4. 5.

Transpose: dZT = d(ZT ) = (dZ)T . Hermitian transpose: dZH = d(ZH ) = (dZ)H . Conjugate: dZ∗ = d(Z∗ ) = (dZ)∗ . Linearity (additive rule): d(Y + Z) = dY + dZ. Chain rule: If F is a function of Y, while Y is a function of Z, then d vecF =

∂ vecF ∂ vecF ∂ vecY d vecY = d vecZ, ∂(vecY)T ∂(vecY)T ∂(vecZ)T

∂ vecF ∂ vecY where ∂(vecY) T and ∂(vec Z)T are the normal complex partial derivative and the generalized complex partial derivative, respectively. 6. Multiplication rule:

d(UV) = (dU)V + U(dV) dvec(UV) = (VT ⊗ I)dvecU + (I ⊗ U)dvecV. 7. Kronecker product: d(Y ⊗ Z) = dY ⊗ Z + Y ⊗ dZ. 8. Hadamard product: d(Y Z) = dY Z + Y dZ. Let us consider the relationship between the complex matrix differential and the complex partial derivative. First, the complex differential rule for scalar variables, df (z, z∗ ) =

∂f (z, z∗ ) ∗ ∂f (z, z∗ ) dz + dz ∂z ∂z∗

(2.3.14)

is easily extended to a complex differential rule for the multivariate real scalar ∗ )): function f (·) = f ((z1 , z1∗ ), . . . , (zm , zm df (·) = =

∂f (·) ∂f (·) ∗ ∂f (·) ∗ ∂f (·) dz1 + · · · + dzm + dz1 + · · · + dzm ∗ ∗ ∂z1 ∂zm ∂z1 ∂zm ∂f (·) ∂f (·) ∗ dz + dz . ∂ zT ∂ zH

(2.3.15)

∗ ]T . This complex differential rule Here, dz = [dz1 , . . . , dzm ]T , dz∗ = [dz1∗ , . . . , dzm is the basis of the complex matrix differential. In particular, if f (·) = f (z, z∗ ), then

df (z, z∗ ) =

∂f (z, z∗ ) ∂f (z, z∗ ) ∗ dz + dz T ∂z ∂ zH

2.3 Complex Gradient Matrices

77

or df (z, z∗ ) = Dz f (z, z∗ ) dz + Dz∗ f (z, z∗ ) dz∗ ,

(2.3.16)

where    ∂f (z, z∗ ) ∂f (z, z∗ )  ∂f (z, z∗ ) , Dz f (z, z ) = = ,..., ∂ zT z∗ =const ∂z1 ∂zm    ∂f (z, z∗ ) ∂f (z, z∗ )  ∂f (z, z∗ ) ∗ Dz∗ f (z, z ) = = ,..., ∗ ∂ zH z=const ∂z1∗ ∂zm ∗

(2.3.17) (2.3.18)

are, respectively, the cogradient vector and the conjugate cogradient vector of the real scalar function f (z, z∗ ), while Dz =

  ∂ def ∂ ∂ , = , . . . , ∂ zT ∂z1 ∂zm

Dz∗ =

  ∂ def ∂ ∂ = , . . . , ∗ ∂ zH ∂z1∗ ∂zm

(2.3.19)

are termed the cogradient operator and the conjugate cogradient operator of complex vector variable z ∈ Cm , respectively. For z = x + jy = [z1 , . . . , zm ]T ∈ Cm with x = [x1 , . . . , xm ]T ∈ Rm , y = [y1 , . . . , ym ]T ∈ Rm , due to the independence between the real part xi and the imaginary part yi , if applying the complex partial derivative operators ∂ 1 Dz = = i ∂zi 2



∂ ∂ −j ∂xi ∂yi

 ,

Dz∗ i

∂ 1 = ∗ = ∂zi 2



∂ ∂ +j ∂xi ∂yi

 ,

(2.3.20)

to each element of the row vector zT = [z1 , . . . , zm ], then we obtain the following complex cogradient operator: Dz =

∂ 1 = T ∂z 2



∂ ∂ −j T T ∂x ∂y

 (2.3.21)

and the complex conjugate cogradient operator Dz∗

∂ 1 = = ∂ zH 2



∂ ∂ +j T ∂ xT ∂y

 .

(2.3.22)

Similarly, the complex gradient operator and the complex conjugate gradient operator are, respectively, defined as   ∂ def ∂ ∂ T ∇z = = ,..., , ∂z ∂z1 ∂zm

∇z ∗

  ∂ def ∂ ∂ T = = ,..., ∗ . ∂ z∗ ∂z1∗ ∂zm (2.3.23)

78

2 Matrix Differential

Hence, the complex gradient vector and the complex conjugate gradient vector of the real scalar function f (z, z∗ ) are, respectively, defined as  ∂f (z, z∗ )  = (Dz f (z, z∗ ))T , ∂ z z∗ = const vector  ∂f (z, z∗ )  ∗ = (Dz∗ f (z, z∗ ))T . ∇z∗ f (z, z ) = ∂ z∗ z = const vector ∇z f (z, z∗ ) =

(2.3.24) (2.3.25)

On the other hand, by mimicking the complex partial derivative operator to each element of the complex vector z = [z1 , . . . , zm ]T , the complex gradient operator and the complex conjugate gradient operator are obtained as follows: 1 ∂ = ∇z = ∂z 2



 ∂ ∂ −j , ∂x ∂y

∇z ∗

∂ 1 = = ∗ ∂z 2



 ∂ ∂ +j . ∂x ∂y

(2.3.26)

By the above definitions, it is easy to obtain ∂zT ∂xT ∂yT 1 = +j = ∂z ∂z ∂z 2



∂xT ∂xT −j ∂x ∂y



1 +j 2



∂yT ∂yT −j ∂x ∂y

 = Im×m ,

and ∂zT ∂xT ∂yT 1 = ∗ +j ∗ = ∗ ∂z ∂z ∂z 2



∂xT ∂xT +j ∂x ∂y

 +j

1 2



∂yT ∂yT +j ∂x ∂y

 = Om×m .

Summarizing the above results and their conjugate, transpose and complex conjugate transpose, we have the following important results: ∂zT = I, ∂z

∂zH = I, ∂z∗

∂z = I, ∂zT

∂z∗ = I, ∂zH

(2.3.27)

∂zT = O, ∂z∗

∂zH = O, ∂z

∂z = O, ∂zH

∂z∗ = O. ∂zT

(2.3.28)

The above results show that the complex vector variable z and its complex conjugate vector variable z∗ can be viewed as two independent variables. This important fact is not surprising because the angle between z and z∗ is π/2, i.e., they are orthogonal to each other. Hence, we can summarize the rules for using the cogradient operator and gradient operator below. • When using the complex cogradient operator ∂/∂zT or the complex gradient operator ∂/∂z, the complex conjugate vector variable z∗ can be handled as a constant vector. • When using the complex conjugate cogradient operator ∂/∂zH or the complex conjugate gradient operator ∂/∂z∗ , the vector variable z can be handled as a constant vector.

2.3 Complex Gradient Matrices

79

Now, we consider the real scalar function f (Z, Z∗ ) with variables Z, Z∗ ∈ Performing the vectorization of Z and Z∗ , respectively, from Eq. (2.3.16) we get the first-order complex differential rule for the real scalar function f (Z, Z∗ ): Cm×n .

df (Z, Z∗ ) = =

∂f (Z, Z∗ ) ∂f (Z, Z∗ ) dvecZ + dvecZ∗ T ∂(vecZ) ∂(vecZ∗ )T ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) dvecZ + dvecZ∗ , ∂(vecZ)T ∂(vecZ∗ )T

(2.3.29)

where   ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) , = ,..., ,..., ,..., ∂(vec Z)T ∂Z11 ∂Zm1 ∂Z1n ∂Zmn   ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) . = , . . . , , . . . , , . . . , ∗ ∗ ∗ ∗ ∂(vec Z∗ )T ∂Z11 ∂Zm1 ∂Z1n ∂Zmn We define the complex cogradient vector and the complex conjugate cogradient vector as Dvec Z f (Z, Z∗ ) =

∂f (Z, Z∗ ) , ∂(vec Z)T

Dvec Z∗ f (Z, Z∗ ) =

∂f (Z, Z∗ ) , ∂(vec Z∗ )T

(2.3.30)

and the complex gradient vector and the complex conjugate gradient vector of the function f (Z, Z∗ ) as ∇vec Z f (Z, Z∗ ) =

∂f (Z, Z∗ ) , ∂ vec Z

∇vec Z∗ f (Z, Z∗ ) =

∂f (Z, Z∗ ) . ∂ vec Z∗

(2.3.31)

The conjugate gradient vector ∇vec Z∗ f (Z, Z∗ ) has the following properties [1]: 1. The conjugate gradient vector of the function f (Z, Z∗ ) at an extreme point is equal to the zero vector, i.e., ∇vec Z∗ f (Z, Z∗ ) = 0. 2. The conjugate gradient vector ∇vec Z∗ f (Z, Z∗ ) and the negative conjugate gradient vector −∇vec Z∗ f (Z, Z∗ ) point in the direction of the steepest ascent and steepest descent of the function f (Z, Z∗ ), respectively. 3. The step length of the steepest increase slope is ∇vec Z∗ f (Z, Z∗ )2 . 4. The conjugate gradient vector ∇vec Z∗ f (Z, Z∗ ) and the negative conjugate gradient vector −∇vec Z∗ f (Z, Z∗ ) can be used separately as update in gradient ascent algorithms and gradient descent algorithms.

80

2 Matrix Differential

Furthermore, the complex Jacobian matrix and the complex conjugate Jacobian matrix of the real scalar function f (Z, Z∗ ) are, respectively, as follows:  ∗ def ∂f (Z, Z )  JZ = ∂ ZT 

JZ∗

 ∗ def ∂f (Z, Z )  = ∂ ZH 

⎡ ∂f (Z,Z∗ ) Z∗ = const matrix

⎢ =⎢ ⎣

∂Z11

.. .

∂f (Z,Z∗ ) ∂Z1n

⎡ ∂f (Z,Z∗ ) Z = const matrix

⎢ =⎢ ⎣

∗ ∂Z11

.. .

∂f (Z,Z∗ ) ∗ ∂Z1n

··· .. . ··· ··· .. . ···

∂f (Z,Z∗ ) ∂Zm1

.. .

∂f (Z,Z∗ ) ∂Zmn

⎤ ⎥ ⎥, ⎦

(2.3.32)

∂f (Z,Z∗ ) ⎤ ∗ ∂Zm1

.. .

∂f (Z,Z∗ ) ∗ ∂Zmn

⎥ ⎥. ⎦

(2.3.33)

Similarly, the complex gradient matrix and the complex conjugate gradient matrix of the real scalar function f (Z, Z∗ ) are, respectively, given by  ∗  ∗ def ∂f (Z, Z )  ∇Z f (Z, Z ) = ∂Z 

⎡ ∂f (Z,Z∗ ) Z∗ = const matrix

⎢ =⎢ ⎣

∂Z11

.. .

∂f (Z,Z∗ ) ∂Zm1

··· .. . ···

∂f (Z,Z∗ ) ∂Z1n

.. .

∂f (Z,Z∗ ) ∂Zmn

⎤ ⎥ ⎥, ⎦ (2.3.34)

and

∇Z∗ f (Z, Z∗ ) =

def

  

⎡ ∂f (Z,Z∗ )

∂f (Z, Z∗ )  ∂

Z∗

Z = const matrix

⎢ =⎢ ⎣

∗ ∂Z11

.. .

∂f (Z,Z∗ ) ∗ ∂Zm1

··· .. . ···

∂f (Z,Z∗ ) ⎤ ∗ ∂Z1n

.. .

∂f (Z,Z∗ ) ∗ ∂Zmn

⎥ ⎥. ⎦ (2.3.35)

Summarizing the definitions above, there are the following relations among the various complex partial derivatives of the real scalar function f (Z, Z): • The conjugate gradient (or cogradient) vector is equal to the complex conjugate of the gradient (or cogradient) vector; and the conjugate Jacobian (or gradient) matrix is equal to the complex conjugate of the Jacobian (or gradient) matrix. • The gradient (or conjugate gradient) vector is equal to the transpose of the cogradient (or conjugate cogradient) vector, namely ∇vec Z f (Z, Z∗ ) = DTvec Z f (Z, Z∗ ),

∇vec Z∗ f (Z, Z∗ ) = DTvec Z∗ f (Z, Z∗ ). (2.3.36)

2.3 Complex Gradient Matrices

81

• The cogradient (or conjugate cogradient) vector is equal to the transpose of the vectorization of Jacobian (or conjugate Jacobian) matrix:  T Dvec Z f (Z, Z) = vec JZ ,

 T Dvec Z∗ f (Z, Z) = vec JZ∗ .

(2.3.37)

• The gradient (or conjugate gradient) matrix is equal to the transpose of the Jacobian (or conjugate Jacobian) matrix: ∇Z f (Z, Z∗ ) = JTZ

and

∇Z∗ f (Z, Z∗ ) = JTZ∗ .

(2.3.38)

Here are the rules of operation for the complex gradient. 1. If f (Z, Z∗ ) = c (a constant), then its gradient matrix and conjugate gradient matrix are equal to the zero matrix, namely ∂c/∂ Z = O and ∂c/∂ Z∗ = O. 2. Linear rule: If f (Z, Z∗ ) and g(Z, Z∗ ) are scalar functions, and c1 and c2 are complex numbers, then ∂(c1 f (Z, Z∗ ) + c2 g(Z, Z∗ )) ∂f (Z, Z∗ ) ∂g(Z, Z∗ ) = c + c . 1 2 ∂ Z∗ ∂ Z∗ ∂ Z∗ 3. Multiplication rule: ∗ ∗ ∂f (Z, Z∗ )g(Z, Z∗ ) ∗ ∂f (Z, Z ) ∗ ∂g(Z, Z ) = g(Z, Z ) + f (Z, Z ) . ∂Z∗ ∂ Z∗ ∂ Z∗

4. Quotient rule: If g(Z, Z∗ ) = 0, then  ∗ ∗  ∂f/g 1 ∗ ∂f (Z, Z ) ∗ ∂g(Z, Z ) . g(Z, Z = ) − f (Z, Z ) ∂ Z∗ ∂ Z∗ ∂ Z∗ g 2 (Z, Z∗ ) If h(Z, Z∗ ) = g(F(Z, Z∗ ), F∗ (Z, Z∗ )), then the quotient rule becomes ∂h(Z, Z∗ ) ∂g(F(Z, Z∗ ), F∗ (Z, Z∗ )) ∂ (vec F(Z, Z∗ ))T = ∂ vec Z ∂ vec Z ∂ (vec F(Z, Z∗ ))T +

∂g(F(Z, Z∗ ), F∗ (Z, Z∗ )) ∂ (vec F∗ (Z, Z∗ ))T , ∂ vec Z ∂ (vec F∗ (Z, Z∗ ))T

(2.3.39)

and ∂h(Z, Z∗ ) ∂g(F(Z, Z∗ ), F∗ (Z, Z∗ )) ∂ (vecF(Z, Z∗ ))T = ∂ vec Z∗ ∂ vec Z∗ ∂ (vec F(Z, Z∗ ))T +

∂g(F(Z, Z∗ ), F∗ (Z, Z∗ )) ∂ (vec F∗ (Z, Z∗ ))T . ∂ vec Z∗ ∂ (vecF∗ (Z, Z∗ ))T

(2.3.40)

82

2 Matrix Differential

2.3.3 Complex Gradient Matrix Identification If letting A = DZ f (Z, Z∗ ) and B = DZ∗ f (Z, Z∗ ), then ∂f (Z, Z∗ ) = rvecDZ f (Z, Z∗ ) = rvecA = (vec(AT ))T , ∂(vec Z)T

(2.3.41)

∂f (Z, Z∗ ) = rvecDZ∗ f (Z, Z∗ ) = rvecB = (vec(BT ))T . ∂(vec Z)H

(2.3.42)

Hence, Eq. (2.3.29) can be rewritten as df (Z, Z∗ ) = (vec(AT ))T dvec Z + (vec(BT ))T dvecZ∗ .

(2.3.43)

Using tr(CT D) = (vec C)T vecD, Eq. (2.3.43) can be written as df (Z, Z∗ ) = tr(AdZ + BdZ∗ ).

(2.3.44)

Proposition 2.2 Given a scalar function f (Z, Z∗ ) : Cm×n × Cm×n → C, its complex Jacobian and gradient matrices can be, respectively, identified by df (Z, Z∗ ) = tr(AdZ + BdZ∗ )



 J = A, Z *





df (Z, Z ) = tr(AdZ + BdZ )



JZ∗ = B; ∇Z f (Z, Z∗ ) = AT , ∇Z∗ f (Z, Z∗ ) = BT .

(2.3.45)

(2.3.46)

That is to say, the complex gradient matrix and the complex conjugate gradient matrix are identified as the transposes of the matrices A and B, respectively. Table 2.8 lists the complex gradient matrices of several trace functions. Table 2.9 lists the complex gradient matrices of several determinant functions.

Table 2.8 Complex gradient matrices of trace functions f (Z, Z∗ ) tr(AZ) tr(AZH ) tr(ZAZT B) tr(ZAZB) tr(ZAZ∗ B) tr(ZAZH B) tr(AZ−1 ) tr(Zk )

df tr(AdZ) tr(AT dZ∗ ) tr((AZT B + AT ZT BT )dZ) tr((AZB + BZA)dZ) tr(AZ∗ BddZ + BZAddZ∗ ) tr(AZH BdZ + AT ZT BT dZ∗ ) −tr(Z−1 AZ−1 dZ) k tr(Zk−1 dZ)

∂f/∂Z AT O BT ZAT + BZA (AZB + BZA)T BT ZH AT BT Z∗ AT −Z−T AT Z−T k (ZT )k−1

∂f/∂ Z∗ O A O O AT ZT BT BZA O O

2.3 Complex Gradient Matrices

83

Table 2.9 Complex gradient matrices of determinant functions f (Z, Z∗ )

df

|Z|

|Z|tr(Z−1 dZ)

|Z|Z−T

|tr(ZT (ZZT )−1 dZ)

|ZZ |

2|ZZ

|ZT Z|

2|ZT Z|tr((ZT Z)−1 ZT dZ)

T



|ZZ |

∂f/∂ Z∗

∂f/∂Z T





∗ −1

|ZZ |tr(Z (ZZ )

O

2|ZT Z|Z(ZT Z)−1

O

2|ZZ

|(ZH ZT )−1 ZH

|ZZ∗ |ZT (ZH ZT )−1

|Z∗ Z|ZH (ZT ZH )−1

|Z∗ Z|(ZT ZH )−1 ZT

|ZZH |(Z∗ ZT )−1 Z∗

|ZZH |(ZZH )−1 Z

|ZH Z|Z∗ (ZT Z∗ )−1

|ZH Z|Z(ZH Z)−1

k|Z|k Z−T

O

|ZZ

dZ

O

|(ZZT )−1 Z



T

+ (ZZ∗ )−1 ZdZ∗ ) |Z∗ Z|

|Z∗ Z|tr((Z∗ Z)−1 Z∗ dZ + Z(Z∗ Z)−1 dZ∗ )

|ZZH |

|ZZH |tr(ZH (ZZH )−1 dZ ∗ T −1

+ Z (Z Z ) T

|ZH Z|



dZ )

|ZH Z|tr((ZH Z)−1 ZH dZ + (ZT Z∗ )−1 ZT dZ∗ )

|Zk |

k|Z|k tr(Z−1 dZ)

If f(z, z∗ ) = [f1 (z, z∗ ), . . . , fn (z, z∗ )]T is an n × 1 complex vector function with m × 1 complex vector variable, then ⎡ ⎤ ⎡ ⎤ ⎤ ⎡ Dz f1 (z, z∗ ) Dz∗ f1 (z, z∗ ) df1 (z, z∗ ) ⎢ ⎥ ⎢ ⎥ ⎥ ∗ ⎢ .. .. .. ⎦=⎣ ⎦ dz + ⎣ ⎦ dz , ⎣ . . . dfn (z, z∗ )

Dz fn (z, z∗ )

Dz∗ fn (z, z∗ )

which can simply be written as df(z, z∗ ) = Jz dz + Jz∗ dz∗ ,

(2.3.47)

where df(z, z∗ ) = [df1 (z, z∗ ), . . . , dfn (z, z∗ )]T , while ⎡ ⎤ ∂f1 (z,z∗ ) ∂f1 (z,z∗ ) · · · ∂z1 ∂zm ⎥ ∂ f(z, z∗ ) ⎢ ⎢ .. .. ⎥ , . .. =⎢ . Jz = . ⎥ ⎣ ⎦ ∂ zT ∂fn (z,z∗ ) ∂fn (z,z∗ ) · · · ∂z ∂z m

1

and

⎡ Jz∗ =

∂f1 (z,z∗ ) ∗ ⎢ ∂z1

∂ f(z, z∗ ) ⎢ =⎢ ⎣ ∂ zH

.. .

∂fn (z,z∗ ) ∂z1∗

(2.3.48)

··· .. . ···



∂f1 (z,z∗ ) ∗ ∂zm ⎥

.. .

∂fn (z,z∗ ) ∗ ∂zm

⎥ ⎥ ⎦

(2.3.49)

84

2 Matrix Differential

are, respectively, the complex Jacobian matrix and the complex conjugate Jacobian matrix of the vector function f(z, z∗ ). For a p × q matrix function F(Z, Z∗ ) with m × n complex matrix variable Z, if F(Z, Z∗ ) = [f1 (Z, Z∗ ), . . . , fq (Z, Z∗ )], then dF(Z, Z∗ ) = [df1 (Z, Z∗ ), . . . , dfq (Z, Z∗ )], and (2.3.47) holds for the vector functions fi (Z, Z∗ ), i = 1, . . . , q. This implies that ⎡ ⎤ ⎡ ⎤ ⎤ Dvec Z f1 (Z, Z∗ ) Dvec Z∗ f1 (Z, Z∗ ) df1 (Z, Z∗ ) ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ .. .. .. ∗ ⎦=⎣ ⎦ dvecZ + ⎣ ⎦ dvecZ , ⎣ . . . ⎡

dfq (Z, Z∗ )

Dvec Z fq (Z, Z∗ )

Dvec Z∗ fq (Z, Z∗ ) (2.3.50)

∂ f (Z,Z∗ )

p×mn and D ∗ i where Dvec Z fi (Z, Z∗ ) = ∂(vecZ) T ∈ C vec Z∗ fi (Z, Z ) = p×mn C . Equation (2.3.50) can be simply rewritten as

∂ fi (Z,Z∗ ) ∂(vecZ∗ )T

dvecF(Z, Z∗ ) = AdvecZ + BdvecZ∗ ∈ Cpq ,

(2.3.51)

where dvecZ = [dZ11 , . . . , dZm1 , . . . , dZ1n , . . . , dZmn ]T , ∗ ∗ ∗ ∗ T dvecZ∗ = [dZ11 , . . . , dZm1 , . . . , dZ1n , . . . , dZmn ] ,

dvecF(Z, Z∗ )) = [df11 (Z, Z∗ ), . . . , dfp1 (Z, Z∗ ), . . . , df1q (Z, Z∗ ), . . . , dfpq (Z, Z∗ )]T , ⎡

∂f11 (Z,Z∗ ) ⎢ ∂Z11

.. ⎢ ⎢ . ⎢ ∂fp1 (Z,Z∗ ) ⎢ ⎢ ∂Z11 ⎢ .. A=⎢ . ⎢ ⎢ ∂f1q (Z,Z∗ ) ⎢ ⎢ ∂Z11 ⎢ .. ⎢ . ⎣ ∗

∂fpq (Z,Z ) ∂Z11

=

··· .. . ··· .. . ··· .. . ···

∂f11 (Z,Z∗ ) ∂Zm1

.. .

(Z,Z∗ )

∂fp1 ∂Zm1

.. .

∂f1q (Z,Z∗ ) ∂Zm1

.. .

(Z,Z∗ )

∂fpq ∂Zm1

∂ vec F(Z, Z∗ ) = Jvec Z , ∂(vec Z)T

··· .. . ··· .. . ··· .. . ···

∂f11 (Z,Z∗ ) ∂Z1n

.. .

(Z,Z∗ )

∂fp1 ∂Z1n

.. .

∂f1q (Z,Z∗ ) ∂Z1n

.. .

∂fpq (Z,Z∗ ) ∂Z1n

··· .. . ··· .. . ··· .. . ···





∂f11 (Z,Z∗ ) ∂Zmn ⎥

⎥ ⎥ ⎥ ∂fp1 ⎥ ∂Zmn ⎥ ⎥ .. ⎥ . ⎥ ∂f1q (Z,Z∗ ) ⎥ ⎥ ∂Zmn ⎥ ⎥ .. ⎥ . ⎦ ∗ .. .

(Z,Z∗ )

∂fpq (Z,Z ) ∂Zmn

2.3 Complex Gradient Matrices

85

and ⎡

∂f11 (Z,Z∗ ) ∗ ∂Z11

⎢ .. ⎢ ⎢ . ⎢ ⎢ ∂fp1 (Z,Z∗ ) ⎢ ∂Z ∗ 11 ⎢ ⎢ .. B=⎢ . ⎢ ⎢ ∂f1q (Z,Z∗ ) ⎢ ∂Z ∗ 11 ⎢ ⎢ .. ⎢ . ⎣

∂fpq (Z,Z∗ ) ∗ ∂Z11

=

··· .. . ··· .. . ··· .. . ···

∂f11 (Z,Z∗ ) ∗ ∂Zm1

.. .

∂fp1 (Z,Z∗ ) ∗ ∂Zm1

.. .

∂f1q (Z,Z∗ ) ∗ ∂Zm1

.. .

∂fpq (Z,Z∗ ) ∗ ∂Zm1

··· .. . ··· .. . ··· .. . ···

∂f11 (Z,Z∗ ) ∗ ∂Z1n

.. .

∂fp1 (Z,Z∗ ) ∗ ∂Z1n

.. .

∂f1q (Z,Z∗ ) ∗ ∂Z1n

.. .

∂fpq (Z,Z∗ ) ∗ ∂Z1n

··· .. . ··· .. . ··· .. . ···

∂f11 (Z,Z∗ ) ∗ ∂Zmn



⎥ ⎥ ⎥ ⎥ ∂fp1 (Z,Z∗ ) ⎥ ∗ ∂Zmn ⎥ ⎥ ⎥ .. ⎥ . ⎥ ∂f1q (Z,Z∗ ) ⎥ ∗ ∂Zmn ⎥ ⎥ ⎥ .. ⎥ . ⎦ .. .

∂fpq (Z,Z∗ ) ∗ ∂Zmn

∂ vec F(Z, Z∗ ) = Jvec Z∗ . ∂(vec Z∗ )T

The complex gradient matrix and the complex conjugate gradient matrix of the matrix function F(Z, Z∗ ) are, respectively, defined as ∇vec Z F(Z, Z∗ ) =

∂(vec F(Z, Z∗ ))T = (Jvec Z )T , ∂ vec Z

(2.3.52)

∇vec Z∗ F(Z, Z∗ ) =

∂(vec F(Z, Z∗ ))T = (Jvec Z∗ )T . ∂ vec Z∗

(2.3.53)

Proposition 2.3 For a complex matrix function F(Z, Z∗ ) ∈ Cp×q with Z, Z∗ ∈ Cm×n , we have its complex Jacobian matrix and the complex gradient matrix: * ∗



dvecF(Z, Z ) = Advec Z + BdvecZ ⇔ * ∗



dvecF(Z, Z ) = AdvecZ + BdvecZ ⇔

JvecZ = A, JvecZ∗ = B, ∇vecZ F(Z, Z∗ ) = AT , ∇vecZ∗ F(Z, Z∗ ) = BT .

(2.3.54)

(2.3.55)

If d(F(Z, Z∗ )) = A(dZ)B + C(dZ∗ )D, then the vectorization result is given by dvec F(Z, Z∗ ) = (BT ⊗ A)dvec Z + (DT ⊗ C)dvecZ∗ . By Proposition 2.3 we have the following identification formula: ∗



dF(Z, Z ) = A(dZ)B + C(dZ )D ⇔

J

vec Z

= BT ⊗ A,

Jvec Z∗ = DT ⊗ C.

(2.3.56)

86

2 Matrix Differential

Table 2.10 Complex matrix differential and complex Jacobian matrix Function Matrix differential

Jacobian matrix

f (z, z∗ ) df (z, z∗ ) = adz + bdz∗

∂f ∂z

f (z, z∗ ) df (z, z∗ ) = aT dz + bT dz∗

∂f ∂zT

= aT ,

∂f ∂zH

= bT

f (Z, Z∗ ) df (Z, Z∗ ) = tr(AdZ + BdZ∗ )

∂f ∂ZT

= A,

∂f ∂ZH

=B

F(Z, Z∗ ) d vecF = Ad vecZ + Bd vecZ∗

∂vecF ∂(vecZ)T

= A,

dF = A(dZ)B + C(dZ∗ )D

∂vecF ∂(vecZ)T

= BT ⊗ A,

dF = A(dZ)T B + C(dZ∗ )T D

∂vecF ∂(vecZ)T

= (BT ⊗ A)Kmn ,

= a,

∂f ∂z∗

=b

∂vecF ∂(vecZ∗ )T

=B

∂vecF ∂(vecZ∗ )T

= DT ⊗ C

∂vecF ∂(vecZ∗ )T

= (DT ⊗ C)Kmn

Similarly, if dF(Z, Z∗ ) = A(dZ)T B + C(dZ∗ )T D, then we have the result dvecF(Z, Z∗ ) = (BT ⊗ A)dvecZT + (DT ⊗ C)dvecZH = (BT ⊗ A)Kmn dvecZ + (DT ⊗ C)Kmn dvecZ∗ , where we have used the vectorization property vecXTm×n = Kmn vecX. By Proposition 2.3, the following identification formula is obtained: * ∗

∗ T

dF(Z, Z ) = A(dZ) B + C(dZ ) D ⇔ T

Jvec Z = (BT ⊗ A)Kmn , Jvec Z∗ = (DT ⊗ C)Kmn . (2.3.57)

The above equation shows that, as in the vector case, the key to identifying the gradient matrix and conjugate gradient matrix of a matrix function F(Z, Z∗ ) is to write its matrix differential into the canonical form dF(Z, Z∗ ) = A(dZ)T B + C(dZ∗ )T D. Table 2.10 lists the corresponding relationships between the first-order complex matrix differential and the complex Jacobian matrix, where z ∈ Cm , Z ∈ Cm×n , F ∈ Cp×q .

Brief Summary of This Chapter This chapter presents the matrix differential for real and complex matrix functions. The matrix differential is a powerful tool for finding the gradient vectors/matrices that are key for update in optimization, as will see in the next chapter.

References

87

References 1. Brookes, M.: Matrix Reference Manual (2011). https://www.ee.ic.ac.uk/hp/staff/dmb/matrix/ intro.html 2. Flanigan, F.: Complex Variables: Harmonic and Analytic Functions, 2nd edn. Dover, New York (1983) 3. Frankel, T.: The Geometry of Physics: An Introduction (with Corrections and Additions). Cambridge University Press, Cambridge (2001) 4. Kreyszig, E.: Advanced Engineering Mathematics, 7th edn. Wiley, New York (1993) 5. Lütkepohl, H.: Handbook of Matrices. Wiley, New York (1996) 6. Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics, rev. edn. Wiley, Chichester (1999) 7. Petersen, K.B., Petersen, M.S.: The Matrix Cookbook (2008). https://matrixcookbook.com 8. Wirtinger, W.: Zur formalen theorie der funktionen von mehr komplexen veranderlichen. ¨ Math. Ann. 97, 357–375 (1927)

Chapter 3

Gradient and Optimization

Many machine learning methods involve solving complex optimization problems, e.g., neural networks, support vector machines, and evolutionary computation. Optimization theory mainly considers (1) the existence conditions for an extremum value (gradient analysis); (2) the design of optimization algorithms and convergence analysis. This chapter focuses on convex optimization theory and methods by focusing on gradient/subgradient methods in smooth and nonsmooth convex optimizations and constrained convex optimization.

3.1 Real Gradient Consider an unconstrained minimization problem for f (x) : Rn → R described by min f (x), x ∈S

(3.1.1)

where S ∈ Rn is a subset of the n-dimensional vector space Rn ; the vector variable x ∈ S is called the optimization vector and is chosen so as to fulfill Eq. (3.1.1). The function f : Rn → R is known as the objective function; it represents the cost or price paid by selecting the optimization vector x, and thus is also called the cost function. Conversely, the negative cost function −f (x) can be understood as the value function or utility function of x. Hence, solving the optimization problem (3.1.1) corresponds to minimizing the cost function or maximizing the value function. That is to say, the minimization problem minx ∈S f (x) and the maximization problem maxx ∈S {−f (x)} of the negative objective function are equivalent. The above optimization problem is called an unconstrained optimization problem due to having no constraint conditions.

© Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8_3

89

90

3 Gradient and Optimization

Most nonlinear programming methods for solving unconstrained optimization problems are based on the idea of relaxation and approximation [25]. • Relaxation: The sequence {ak }∞ k=0 is known as a relaxation sequence, if ak+1 ≤ ak , ∀ k ≥ 0. Hence, in the process of solving iteratively the optimization problem (3.1.1) it is necessary to generate a relaxation sequence of the cost function f (xk+1 ) ≤ f (xk ), k = 0, 1, . . . • Approximation: Approximating an objective function means using a simpler objective function instead of the original objective function. Therefore, by relaxation and approximation, we can achieve the following. 1. If the objective function f (x) is lower bounded in the definition domain S ∈ Rn , then the sequence {f (xk )}∞ k=0 must converge. 2. In any case, we can improve upon the initial value of the objective function f (x). 3. The minimization of a nonlinear objective function f (x) can be implemented by numerical methods to a sufficiently high approximation accuracy.

3.1.1 Stationary Points and Extreme Points Definition 3.1 (Stationary Point) In optimization, a stationary point or critical point of a differentiable function with one variable is a point in the domain of function where its derivative is zero. In other words, at a stationary point the function stops increasing or decreasing (hence the name). A stationary point could be a saddle point, however, is usually expected to be a global minimum or maximum point of the objective function f (x). Definition 3.2 (Global Minimum Point) A point x∗ in the vector subspace S ∈ Rn is known as a global minimum point of the function f (x) if f (x∗ ) ≤ f (x),

∀ x ∈ S, x = x∗ .

(3.1.2)

A global minimum point is also called an absolute minimum point. The value of the function at this point, f (x∗ ), is called the global minimum or absolute minimum of the function f (x) in the vector subspace S. Let D denote the definition domain of a function f (x), where D may be the whole vector space Rn or its some subset. If f (x∗ ) < f (x),

∀ x ∈ D,

(3.1.3)

then x∗ is said to be a strict global minimum point or a strict absolute minimum point of the function f (x).

3.1 Real Gradient

91

The ideal goal of minimization is to find the global minimum of a given objective function. However, this desired object is often difficult to achieve due to the following two main reasons. • It is usually difficult to know the global or complete information about a function f (x) in the definition domain D. • It is usually impractical to design an algorithm for identifying a global extreme point, because it is almost impossible to compare the value f (x∗ ) with all values of the function f (x) in the definition domain D. In contrast, it is much easier to obtain local information about an objective function f (x) in the vicinity of some point c. At the same time, it is much simpler to design an algorithm for comparing the function value at some point c with other function values at the points near c. Hence, most minimization algorithms can find only a local minimum point c; the value of the objective function at c is the minimum of its values in the neighborhood of c. Definition 3.3 (Open Neighborhood, Closed Neighborhood) Given a point c ∈ D and a positive number r, the set of all points x satisfying x − c2 < r is said to be an open neighborhood with radius r of the point c, denoted by Bo (c; r) = {x| x ∈ D, x − c2 < r}.

(3.1.4)

Bc (c; r) = {x| x ∈ D, x − c2 ≤ r},

(3.1.5)

If

then Bc (c; r) is known as a closed neighborhood of c. D.

Consider a scalar r > 0, and let x = c + x be a point in the definition domain

• If f (c) ≤ f (c + x),

∀ 0 < x2 ≤ r,

(3.1.6)

then the point c and the function value f (c) are called a local minimum point and a local minimum (value) of the function f (x), respectively. • If f (c) < f (c + x),

∀ 0 < x2 ≤ r,

(3.1.7)

then the point c and the function value f (c) are known as a strict local minimum point and a strict local minimum of the function f (x), respectively. • If f (c) ≥ f (c + x),

∀ 0 < x2 ≤ r,

(3.1.8)

92

3 Gradient and Optimization

then the point c and the function value f (c) are called a local maximum point and a local maximum (value) of the function f (x), respectively. • If f (c) > f (c + x),

∀ 0 < x2 ≤ r,

(3.1.9)

then the point c and the function value f (c) are known as a strict local maximum point and a strict local maximum of the function f (x), respectively. • If f (c) ≤ f (x),

∀ x ∈ D,

(3.1.10)

then the point c and the function value f (c) are said to be a global minimum point and a global minimum of the function f (x) in the definition domain D, respectively. • If f (c) < f (x),

∀ x ∈ D, x = c,

(3.1.11)

then the point c and the function value f (c) are referred to as a strict global minimum point and a strict global minimum of the function f (x) in the definition domain D, respectively. • If f (c) ≥ f (x),

∀ x ∈ D,

(3.1.12)

then the point c and the function value f (c) are said to be a global maximum point and a global maximum of the function f (x) in the definition domain D, respectively. • If f (c) > f (x),

∀ x ∈ D, x = c,

(3.1.13)

then the point c and the function value f (c) are referred to as a strict global maximum point and a strict global maximum of the function f (x) in the definition domain D, respectively. The minimum points and the maximum points are collectively called the extreme points of the function f (x), and the minimal value and the maximal value are known as the extrema or extreme values of f (x). In particular, if some point x0 is a unique local extreme point of a function f (x) in the neighborhood B(c; r), then it is said to be an isolated local extreme point.

3.1 Real Gradient

93

3.1.2 Real Gradient of f (X) In practical applications, it is impossible to compare directly the value of an objective function f (x) at some point with all possible values in its neighborhood. Fortunately, the Taylor series expansion provides a simple method for coping with this difficulty. The second-order Taylor series approximation of f (x) at the point c is given by 1 f (c + x) = f (c) + (∇f (c))T x + (x)T H[f (c)]x, 2 where

 ∂f (x)  ∂f (c) = , ∂c ∂x x=c  ∂ 2 f (c) ∂ 2 f (x)  H[f (c)] = = , ∂c∂cT ∂x∂xT x=c ∇f (c) =

(3.1.14)

(3.1.15) (3.1.16)

are, respectively, the gradient vector and the Hessian matrix of f (x) at the point c. Definition 3.4 (Neighborhood) A neighborhood B(X∗ ; r) of a real function f (X) : Rm×n → R with center point vecX∗ and the radius r is defined as B(X∗ ; r) = {X|X ∈ Rm×n , vecX − vecX∗ 2 < r}.

(3.1.17)

The second-order Taylor series approximation formula of the function f (X) at the point X∗ is given by  f (X∗ + X) = f (X∗ ) +

∂f (X∗ ) ∂vecX∗

T vec(X)

∂ 2 f (X∗ ) 1 vec(X) + (vec(X))T 2 ∂vecX∗ ∂(vecX∗ )T  T = f (X∗ ) + ∇vecX∗ f (X∗ ) vec(X) 1 + (vec(X))T H[f (X∗ )]vec(X), 2

(3.1.18)

where  ∂f (X)  ∇vecX∗ f (X∗ ) = ∈ Rmn , ∂vecX X=X∗   ∂ 2 f (X)  ∈ Rmn×mn , H[f (X∗ )] = T ∂vecX∂(vec X) X=X∗ are the gradient vector and the Hessian matrix of f (X) at the point X∗ , respectively.

94

3 Gradient and Optimization

Theorem 3.1 (First-Order Necessary Condition [28]) If x∗ is a local extreme point of a function f (x) and f (x) is continuously differentiable in the neighborhood B(x∗ ; r) of the point x∗ , then ∇x∗ f (x∗ ) =

 ∂f (x)  = 0. ∂x x=x∗

(3.1.19)

Theorem 3.2 (Second-Order Necessary Conditions [21, 28]) If x∗ is a local minimum point of f (x) and the second-order gradient ∇x2 f (x) is continuous in an open neighborhood Bo (x∗ ; r) of x∗ , then ∇x∗ f (x∗ ) =

 ∂f (x)  = 0 and ∂x x=x∗

∇x2∗ f (x∗ ) =

 ∂ 2 f (x)   0, ∂x∂xT x=x∗

(3.1.20)

i.e., the gradient vector of f (x) at the point x∗ is a zero vector and the Hessian matrix ∇x2 f (x) at the point x∗ is a positive semi-definite matrix. Theorem 3.3 (Second-Order Sufficient Condition [21, 28]) Let ∇x2 f (x) be continuous in an open neighborhood of x∗ . If the conditions ∇x∗ f (x∗ ) =

 ∂f (x)  = 0 and ∂x x=x∗

∇x2∗ f (x∗ ) =

 ∂ 2 f (x) 

0 ∂x∂xT x=x∗

(3.1.21)

are satisfied, then x∗ is a strict local minimum point of the objection function f (x). Here ∇x2 f (x∗ ) 0 means that the Hessian matrix ∇x2∗ f (x∗ ) at the point x∗ is a positive definite matrix. Comparing Eq. (3.1.18) with Eq. (3.1.14), we have the following conditions similar to Theorems 3.1–3.3. 1. First-order necessary condition for a minimizer: If X∗ is a local extreme point of a function f (X), and f (X) is continuously differentiable in the neighborhood B(X∗ ; r) of the point X∗ , then  ∂f (X)  = 0. ∇vecX∗ f (X∗ ) = ∂vecX X=X∗

(3.1.22)

2. Second-order necessary conditions for a local minimum point: If X∗ is a local minimum point of f (X), and the Hessian matrix H[f (X)] is continuous in an open neighborhood Bo (X∗ ; r), then ∇vecX∗ f (X∗ ) = 0 and

  ∂ 2 f (X)   0. ∂vecX∂(vecX)T X=X∗

(3.1.23)

3.2 Gradient of Complex Variable Function

95

3. Second-order sufficient conditions for a strict local minimum point: It is assumed that H[f (x)] is continuous in an open neighborhood of X∗ . If the conditions   ∂ 2 f (X) 

0 ∂vecX∂(vecX)T X=X∗

∇vecX∗ f (X∗ ) = 0 and

(3.1.24)

are satisfied, then X∗ is a strict local minimum point of f (X). 4. Second-order necessary conditions for a local maximum point: If X∗ is a local maximum point of f (X), and the Hessian matrix H[f (X)] is continuous in an open neighborhood Bo (X∗ ; r), then   ∂ 2 f (X)   0. T ∂vecX∂(vecX) X=X∗

∇vecX∗ f (X∗ ) = 0 and

(3.1.25)

5. Second-order sufficient conditions for a strict local maximum point: Assume that H[f (X)] is continuous in an open neighborhood of X∗ . If the conditions   ∂ 2 f (X)  ≺0 T ∂vecX∂(vecX) X=X∗

∇vecX∗ f (X∗ ) = 0 and

(3.1.26)

are satisfied, then X∗ is a strict local maximum point of f (X). The point X∗ is only a saddle point of f (X), if ∇vecX∗ f (X∗ ) = 0 and

  ∂ 2 f (X)  is indefinite. T ∂vecX∂(vecX) X=X∗

3.2 Gradient of Complex Variable Function Consider gradient analysis of a real-valued function with complex vector variable.

3.2.1 Extreme Point of Complex Variable Function Consider the unconstrained optimization of a real-valued function f (Z, Z∗ ) : Cm×n × Cm×n → R. From the first-order differential  ∂f (Z, Z∗ ) T vec(dZ∗ ) ∂vec Z∗    ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) vec(dZ) (3.2.1) , = ∂(vecZ)T ∂(vec Z∗ )T ) vec(dZ)∗

df (Z, Z∗ ) =



∂f (Z, Z∗ ) ∂vec Z

T



vec(dZ) +

96

3 Gradient and Optimization

and the second-order differential ⎡ ∂ 2 f (Z,Z∗ ) H ∂(vec Z∗ )∂(vec Z)T  ⎢ vec(dZ) ⎢ d2 f (Z, Z∗ ) = vec(dZ∗ ) ⎣ ∂ 2 f (Z,Z∗ ) ∂(vec Z)∂(vec Z)T



∂ 2 f (Z,Z∗ ) ∂(vec Z∗ )∂(vec Z∗ )T ∂ 2 f (Z,Z∗ ) ∂(vec Z)∂(vec Z∗ )T )

  ⎥ vec(dZ) ⎥ ⎦ vec(dZ∗ ) (3.2.2)

we get the second-order series approximation of f (Z, Z∗ ) at the point C given by ˜ f (Z, Z∗ ) = f (C, C∗ ) + (∇f (C, C∗ ))T vec(C) 1 ˜ H H[f (C, C∗ )]vec(C), ˜ + (vec(C)) 2

(3.2.3)

where ˜ = C



   C Z−C = ∈ C2m×n , C∗ Z∗ − C∗ ⎡ ∂f (Z,Z∗ ) ⎤

⎢ ∇f (C, C∗ ) = ⎣ ⎡

∂(vec Z) ∂f (Z,Z∗ ) ∂(vec Z∗ )

⎥ ⎦

∈ C2mn×1 , Z=C

∂ 2 f (Z,Z∗ ) ∂ 2 f (Z,Z∗ ) ⎢ ∂(vec Z∗ )∂(vec Z)T ∂(vec Z∗ )∂(vec Z∗ )T

H[f (C, C∗ )] = ⎢ ⎣

(3.2.4)

∂ 2 f (Z,Z∗ ) ∂(vec Z)∂(vec Z)T

∂ 2 f (Z,Z∗ ) ∂(vec Z)∂(vec Z∗ )T

  HZ∗ Z HZ∗ Z∗ = ∈ C2mn×2mn . HZZ HZZ∗ Z=C

(3.2.5) ⎤ ⎥ ⎥ ⎦ Z=C

(3.2.6)

Notice that ∗

∇f (C, C ) = 02mn×1  ∂f (Z, Z∗ )  = 0mn×1 ∂vec Z∗ Z∗ =C∗

 ∂f (Z, Z∗ )  ⇔ = 0mn×1 , ∂vec Z Z=C  ∂f (Z, Z∗ )  ⇔ = Om×n . ∂Z∗ Z∗ =C∗

By Eq. (3.2.3) we have the following conditions on extreme points, similar to Theorems 3.1–3.3. 1. First-order necessary condition for a local extreme point: If C is a local extreme point of a function f (Z, Z∗ ), and f (Z, Z∗ ) is continuously differentiable in the neighborhood B(C; r) of the point C, then ∗

∇f (C, C ) = 0 or

 ∂f (Z, Z∗ )  = O. ∂Z∗ Z=C

(3.2.7)

3.2 Gradient of Complex Variable Function

97

2. Second-order necessary conditions for a local minimum point: If C is a local minimum point of f (Z, Z∗ ), and the Hessian matrix H[f (Z, Z∗ )] is continuous in an open neighborhood Bo (C; r) of the point C, then  ∂f (Z, Z∗ )  = O and ∂Z∗ Z=C

H[f (C, C∗ )]  0.

(3.2.8)

3. Second-order sufficient conditions for a strict local minimum point: Suppose that H[f (Z, Z∗ )] is continuous in an open neighborhood of C. If the conditions  ∂f (Z, Z∗ )  = O and ∂Z∗ Z=C

H[f (C, C∗ )] 0

(3.2.9)

are satisfied, then C is a strict local minimum point of f (Z, Z∗ ). 4. Second-order necessary conditions for a local maximum point: If C is a local maximum point of f (Z, Z∗ ), and the Hessian matrix H[f (Z, Z∗ )] is continuous in an open neighborhood Bo (C; r) of the point C, then  ∂f (Z, Z∗ )  = O and ∂Z∗ Z=C

H[f (C, C∗ )]  0.

(3.2.10)

5. Second-order sufficient conditions for a strict local maximum point: Suppose that H[f (Z, Z∗ )] is continuous in an open neighborhood of C, if the conditions  ∂f (Z, Z∗ )  = O and ∂Z∗ Z=C

H[f (C, C∗ )] ≺ 0

(3.2.11)

are satisfied, then C is a strict local maximum point of f (Z, Z∗ ).  ∂f (Z, Z∗ )  If = O, but the Hessian matrix H[f (C, C∗ )] is indefinite, then C  ∗ ∂Z

Z=C

is only a saddle point of f (Z, Z∗ ). Table 3.1 summarizes the extreme-point conditions for three complex variable functions. The Hessian matrices in Table 3.1 are defined as follows: ⎡ 2 ⎤ ∗ 2 ∗ ∂ f (z, z ) ∂ f (z, z )

⎢ ∂z∗ ∂z ∂z∗ ∂z∗ ⎥ ⎢ ⎥ H[f (c, c∗ )] = ⎢ ⎥ ⎣ ∂ 2 f (z, z∗ ) ∂ 2 f (z, z∗ ) ⎦ ∂z∂z

∂z∂z∗

⎡ ∂ 2 f (z, z∗ ) ∂ 2 f (z, z∗ ) ⎤ ∂z∗ ∂zH ⎥ ⎢ ∂z∗ ∂zT ⎥ H[f (c, c∗ )] = ⎢ ⎦ ⎣ 2 ∗ 2 ∗

∈ C2×2 , z=c

∂ f (z, z ) ∂ f (z, z ) ∂z∂zT ∂z∂zH z=c

∈ C2n×2n ,

98

3 Gradient and Optimization

Table 3.1 Extreme-point conditions for the complex variable functions Functions Stationary point

f (z, z∗ ) : C×C → R f (z, z∗ ) : Cn ×Cn → R f (Z, Z∗ ) : Cm×n ×Cm×n → R    ∂f (z, z∗ )  ∂f (z, z∗ )  ∂f (Z, Z∗ )  = 0 = 0 =O ∂z∗ z=c ∂z∗ z=c ∂Z∗ Z=C H[f (c, c∗ )]  0

H[f (c, c∗ )]  0

H[f (C, C∗ )]  0

Strict local H[f (c, c∗ )] 0 minimum Local maximum H[f (c, c∗ )]  0

H[f (c, c∗ )] 0

H[f (C, C∗ )] 0

H[f (c, c∗ )]  0

H[f (C, C∗ )]  0

H[f (c, c∗ )]

H[f (C, C∗ )] ≺ 0

Local minimum

Strict local maximum Saddle point

H[f (c, c∗ )]

≺0

H[f (c, c∗ )] indef.

≺0

H[f (c, c∗ )] indef.

H[f (C, C∗ )] indef.





∂ 2 f (Z, Z∗ ) ∂ 2 f (Z, Z∗ ) ∗ T ⎢ ∂(vecZ )∂(vecZ) ∂(vecZ∗ )∂(vecZ∗ )T ⎥

⎢ H[f (C, C∗ )] = ⎢ ⎣

∂ 2 f (Z, Z∗ )

∂ 2 f (Z, Z∗ )

∂(vecZ)∂(vecZ)T

∂(vecZ)∂(vecZ∗ )T

⎥ ⎥ ⎦

∈ C2mn×2mn . Z=C

3.2.2 Complex Gradient Given a real-valued objective function f (w, w∗ ) or f (W, W∗ ), the gradient analysis of its unconstrained minimization problem can be summarized as follows. 1. The conjugate gradient matrix determines a closed solution of the minimization problem. 2. The sufficient conditions of local minimum points are determined by the conjugate gradient vector and the Hessian matrix of the objective function. 3. The negative direction of the conjugate gradient vector determines the steepestdescent iterative algorithm for solving the minimization problem. 4. The Hessian matrix gives the Newton algorithm for solving the minimization problem. In the following, we discuss the above gradient analysis. Closed Solution of the Unconstrained Minimization Problem By letting the conjugate gradient vector (or matrix) be equal to a zero vector (or matrix), we can find a closed solution of the given unconstrained minimization problem. Example 3.1 For an over-determined matrix equation Az = b, define the loglikelihood function l(ˆz) = C −

1 H 1 e e = C − 2 (b − Aˆz)H (b − Aˆz), σ2 σ

(3.2.12)

3.2 Gradient of Complex Variable Function

99

where C is a real constant. Then, the conjugate gradient of the log-likelihood function is given by ∇zˆ ∗ l(ˆz) =

∂l(ˆz) 1 1 = 2 AH b − 2 AH Aˆz. ∗ ∂z σ σ

By setting ∇zˆ ∗ l(ˆz) = 0, we obtain AH Azopt = AH b, where zopt is the solution for maximizing the log-likelihood function l(ˆz). Hence, if AH A is nonsingular, then zopt = (AH A)−1 AH b.

(3.2.13)

This is the ML solution of the matrix equation Az = b. Clearly, the ML solution and the LS solution are the same for the matrix equation Az = b. Steepest Descent Direction of Real Objective Function There are two choices in determining a stationary point C of an objective function f (Z, Z∗ ) with a complex matrix as variable:  ∂f (Z, Z∗ )  = Om×n  ∂Z Z=C

or

 ∂f (Z, Z∗ )  = Om×n . ∂Z∗ Z∗ =C∗

(3.2.14)

The question is which gradient to select when designing a learning algorithm for an optimization problem. To answer this question, it is necessary to introduce the definition of the curvature direction. Definition 3.5 (Direction of Curvature [14]) If H is the Hessian operator acting on a nonlinear function f (x), a vector p is said to be 1. the direction of positive curvature of the function f , if pH Hp > 0; 2. the direction of zero curvature of the function f , if pH Hp = 0; 3. the direction of negative curvature of the function f , if pH Hp < 0. The scalar pH Hp is called the curvature of the function f along the direction p. The curvature direction is the direction of the maximum rate of change of the objective function. Theorem 3.4 ([6]) Let f (z) be a real-valued function of the complex vector z. By regarding z and z∗ as independent variables, the curvature direction of the real objective function f (z, z∗ ) is given by the conjugate gradient vector ∇z∗ f (z, z∗ ). Theorem 3.4 shows that each component of ∇z∗ f (z, z∗ ), or ∇vecZ∗ f (Z, Z∗ ), gives the rate of change of the objective function f (z, z∗ ), or f (Z, Z∗ ), in this direction: • The conjugate gradient vector ∇z∗ f (z, z∗ ), or ∇vec Z∗ f (Z, Z∗ ), gives the fastest growing direction of the objective function; • The negative gradient vector −∇z∗ f (z, z∗ ), or −∇vec Z∗ f (Z, Z∗ ), provides the steepest decreasing direction of the objective function.

100

3 Gradient and Optimization

Hence, the negative direction of the conjugate gradient is used as the updating direction in minimization algorithm: zk = zk−1 − μ∇z∗ f (z),

μ > 0.

(3.2.15)

That is, the correction amount −μ∇z∗ f (z) of alternate solutions in the iterative process is proportional to the negative conjugate gradient of the objective function. Because the direction of the negative conjugate gradient vector always points in the decreasing direction of an objection function, this type of learning algorithm is called the gradient descent algorithm or steepest descent algorithm. The constant μ in a steepest descent algorithm is referred to as the learning step and determines the rate at which the alternate solution converges to the optimization solution.

3.3 Convex Sets and Convex Function Identification In the above section we have discussed unconstrained optimization problems. This section presents constrained optimization. The basic idea of solving a constrained optimization problem is to transform it into a unconstrained optimization problem.

3.3.1 Standard Constrained Optimization Problems Consider the standard form of constrained optimization problems min f0 (x) x

subject to fi (x) ≤ 0, i = 1, . . . , m, Ax = b

(3.3.1)

which can be written as min f0 (x) x

subject to fi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , q. (3.3.2)

The variable x in constrained optimization problems is called the optimization variable or decision variable and the function f0 (x) is the objective function or cost function, while fi (x) ≤ 0, x ∈ I,

and

hi (x) = 0, x ∈ E,

(3.3.3)

are known as the inequality constraints and the equality constraints, respectively; I and E are, respectively, the definition domains of the inequality constraint function

3.3 Convex Sets and Convex Function Identification

101

and the equality constraint function, i.e., I=

m +

dom fi

and

i=1

E=

q +

dom hi .

(3.3.4)

i=1

The inequality constraints and the equality constraints are collectively called explicit constraints. An optimization problem without explicit constraints (i.e., m = q = 0) reduces to an unconstrained optimization problem. The m inequality constraints fi (x) ≤ 0, i = 1, . . . , m and the q equality constraints hj (x) = 0, j = 1, . . . , q, denote m + q strict requirements or stipulations restricting the possible choices of x. An objective function f0 (x) represents the costs paid by selecting x. Conversely, the negative objective function −f0 (x) can be understood as the values or benefits got by selecting x. Hence, solving the constrained optimization problem (3.3.2) amounts to choosing x given m + q strict requirements such that the cost is minimized or the value is maximized. The optimal solution of the constrained optimization problem (3.3.2), denoted p∗ , is defined as the infimum of the objective function f0 (x):   p∗ = inf f0 (x)|fi (x) ≤ 0, i = 1, . . . , m; hj (x) = 0, j = 1, . . . , q .

(3.3.5)

If p∗ = ∞, then the constrained optimization problem (3.3.2) is said to be infeasible, i.e., no point x can meet the m + q constraint conditions. If p∗ = −∞, then the constrained optimization problem (3.3.2) is lower unbounded. The following are the key steps in solving the constrained optimization (3.3.2). 1. Search for the feasible points of a given constrained optimization problem. 2. Search for the point such that the constrained optimization problem reaches the optimal value. 3. Avoid turning the original constrained optimization into a lower unbounded problem. The point x meeting all the inequality constraints and equality constraints is called a feasible point. The set of all feasible points is called the feasible domain or feasible set, denoted by F, and is defined as   def F = I ∩ E = x|fi (x) ≤ 0, i = 1, . . . , m; hj (x) = 0, j = 1, . . . , q .

(3.3.6)

Any point not in the feasible set is referred to as an infeasible point. The intersection set of the definition domain domf0 of the objective function f0 and its feasible domain F is defined as D = dom f0 ∩

m + i=1

dom fi ∩

q +

dom hj = dom f0 ∩ F

(3.3.7)

j =1

that is known as the definition domain of a constrained optimization problem.

102

3 Gradient and Optimization

A feasible point x is optimal if f0 (x) = p∗ . It is usually difficult to solve a constrained optimization problem. When the number of decision variables in x is large, it is particularly difficult. This is due to the following reasons [19]. 1. A constrained optimization problem may be occupied by a local optimal solution in its definition domain. 2. It is very difficult to find a feasible point. 3. The stopping criteria in general unconstrained optimization algorithms usually fail in a constrained optimization problem. 4. The convergence rate of a constrained optimization algorithm is usually poor. 5. The size of numerical problem can make a constrained minimization algorithm either completely stop or wander, so that these algorithms cannot converge in the normal way. The above difficulties in solving constrained optimization problems can be overcome by using the convex optimization technique. In essence, convex optimization is the minimization (or maximization) of a convex (or concave) function under the constraint of a convex set. Convex optimization is a fusion of optimization, convex analysis, and numerical computation.

3.3.2 Convex Sets and Convex Functions Definition 3.6 (Convex Set) A set S ∈ Rn is convex, if the line connecting any two points x, y ∈ S is also in the set S, namely x, y ∈ S, θ ∈ [0, 1]



θ x + (1 − θ )y ∈ S.

(3.3.8)

    Many familiar sets are  convex,e.g., the unit ball S = x x2 ≤ 1 . However, the unit sphere S = xx2 = 1 is not a convex set because the line connecting two points on the sphere is obviously not itself on the sphere. Letting S1 ⊆ Rn and S2 ⊆ Rm be two convex sets, and A(x) : Rn → Rm be the linear operator such that A(x) = Ax + b. A convex set has the following important properties [25, Theorem 2.2.4].    1. The intersection S1 ∩ S2 =  x ∈ Rn x ∈ S1 , x ∈ S2 (m  = n) is a convex set. 2. The sum of sets S1 + S2 =  z = x + yx ∈ S1 , y ∈ S2 (m = n) is a convex set. n+m x ∈ S , y ∈ S 3. The direct sum S1 ⊕ S2 = 1 ∈R  (x, y) 2 is a convex set. n βx, x ∈ S 4. The conic hull K(S1 ) = z∈ R z = , β ≥ 0 1   is a convex set. 5. The affine image A(S1 ) = y ∈ Rm y = A(x), x ∈ S1 is a convex set.   6. The inverse affine image A−1 (S2 ) = x ∈ Rn x = A−1 (y), y ∈ S2 is a convex set. 7. The following convex hull is a convex set:    conv(S1 , S2 ) = z ∈ Rn z = αx + (1 − α)y, x ∈ S1 , y ∈ S2 ; α ∈ [0, 1] .

3.3 Convex Sets and Convex Function Identification

103

Given a vector x ∈ Rn and a constant ρ > 0, then    Bo (x, ρ) = y ∈ Rn y − x2 < ρ ,    Bc (x, ρ) = y ∈ Rn y − x2 ≤ ρ ,

(3.3.9) (3.3.10)

are called the open ball and the closed ball with center x and radius ρ. A convex set S ⊆ Rn is known as a convex cone if all rays starting at the origin and all segments connecting any two points of these rays are still within the convex set, i.e., x, y ∈ S, λ, μ ≥ 0



λx + μy ∈ S.

(3.3.11)

The nonnegative orthant Rn+ = {x ∈ Rn |x ≥ 0} is a convex cone. The set of n = {X ∈ Rn×n |X  0}, is a convex positive semi-definite matrices X  0, S+ cone as well, since the positive combination of any number of positive semi-definite n is called the positive semi-definite matrices is still positive semi-definite. Hence, S+ cone. Definition 3.7 (Affine Function) The vector function f(x) : Rn → Rm is known as an affine function if it has the form f(x) = Ax + b.

(3.3.12)

Similarly, the matrix function F(x) : Rn → Rp×q is called an affine (matrix) function, if it has the form F(x) = A0 + x1 A1 + · · · + xn An ,

(3.3.13)

where Ai ∈ Rp×q . An affine function sometimes is roughly referred to as a linear function. Definition 3.8 (Strictly Convex Function [25]) Given a convex set S ∈ Rn and a function f : S → R, then we have the following. • f : Rn → R is a convex function if and only if S = domf is a convex set, and for all vectors x, y ∈ S and each scalar α ∈ (0, 1), the function satisfies the Jensen inequality f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y).

(3.3.14)

• The function f (x) is known as a strictly convex function if and only if S = domf is a convex set and, for all vectors x, y ∈ S and each scalar α ∈ (0, 1), the function satisfies the inequality f (αx + (1 − α)y) < αf (x) + (1 − α)f (y).

(3.3.15)

104

3 Gradient and Optimization

A strongly convex function has the following three equivalent definitions. 1. The function f (x) is strongly convex if [25] f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) −

μ α(1 − α)x − y22 2

(3.3.16)

holds for all vectors x, y ∈ S and scalars α ∈ [0, 1] and μ > 0. 2. The function f (x) is strongly convex if [2] (∇f (x) − ∇f (y))T (x − y) ≥ μx − y22

(3.3.17)

holds for all vectors x, y ∈ S and some scalar μ > 0. 3. The function f (x) is strongly convex if [25] f (y) ≥ f (x) + [∇f (x)]T (y − x) +

μ y − x22 2

(3.3.18)

for some scalar μ > 0. In the above three definitions, the constant μ (> 0) is called the convexity parameter of the strongly convex function f (x). The following is the relationship between convex functions, strictly convex functions, and strongly convex functions: strongly convex function ⇒ strictly convex function ⇒ convex function. (3.3.19) Definition 3.9 (Quasi-Convex [33]) A function f (x) is quasi-convex if, for all vectors x, y ∈ S and a scalar α ∈ [0, 1], the inequality f (αx + (1 − α)y) ≤ max{f (x), f (y)}

(3.3.20)

holds. A function f (x) is strongly quasi-convex if, for all vectors x, y ∈ S, x = y and the scalar α ∈ (0, 1), the strict inequality f (αx + (1 − α)y) < max{f (x), f (y)}

(3.3.21)

holds. A function f (x) is referred to as strictly quasi-convex if the strict inequality (3.3.21) holds for all vectors x, y ∈ S, f (x) = f (y) and a scalar α ∈ (0, 1).

3.3.3 Convex Function Identification Consider an objective function defined in the convex set S, f (x) : S → R, a natural question to ask is how to determine whether the function is convex.

3.3 Convex Sets and Convex Function Identification

105

Convex-function identification methods are divided into first-order gradient and second-order gradient identification methods. The following are the first-order necessary and sufficient conditions for identifying the convex function. First-Order Necessary and Sufficient Conditions Theorem 3.5 ([32]) Let f : S → R be a first differentiable function defined in the convex set S in the n-dimensional vector space Rn , then for all vectors x, y ∈ S, we have f (x) convex



∇x f (x) − ∇x f (y), x − y ≥ 0,

f (x) strictly convex



∇x f (x) − ∇x f (y), x − y > 0, x = y,

f (x) strongly convex



∇x f (x) − ∇x f (y), x − y ≥ μx − y22 .

Theorem 3.6 ([3]) If the function f : S → R is differentiable in a convex definition domain, then f is convex if and only if f (y) ≥ f (x) + ∇x f (x), y − x.

(3.3.22)

Second-Order Necessary and Sufficient Conditions Theorem 3.7 ([21]) Let f : S → R be a function defined in the convex set S ∈ Rn , and second differentiable, then f (x) is a convex function if and only if the Hessian matrix is positive semi-definite, namely Hx [f (x)] =

∂ 2 f (x)  0, ∂x∂xT

∀ x ∈ S.

(3.3.23)

Remark Let f : S → R be a function defined in the convex set S ∈ Rn , and second differentiable; then f (x) is strictly convex if and only if the Hessian matrix is positive definite, namely Hx [f (x)] =

∂ 2 f (x)

0, ∂x∂xT

∀ x ∈ S,

(3.3.24)

whereas the sufficient condition for a strict minimum point requires that the Hessian matrix is positive definite at one point c, Eq. (3.3.24) requires that the Hessian matrix is positive definite at all points in the convex set S. It should be pointed out that, apart from the 0 -norm, the vector norms for all other p (p ≥ 1), xp =

 n  i=1

are convex functions.

1/p | xi |

p

, p ≥ 1;

x∞ = max | xi | i

(3.3.25)

106

3 Gradient and Optimization

3.4 Gradient Methods for Smooth Convex Optimization Convex optimization is divided into first-order algorithms and second-order algorithms. Focusing upon smoothing functions, this section presents first-order algorithms for smooth convex optimization: the gradient method, the gradient projection method, and the convergence rates.

3.4.1 Gradient Method Optimization method generates a relaxation sequence xk+1 = xk + μk xk ,

k = 1, 2, . . .

(3.4.1)

to search for the optimal point xopt . In the above equation, k = 1, 2, . . . represents the iterative number and μk ≥ 0 is the step size or step length of the kth iteration; the concatenated symbol of  and x, x, denotes a vector in Rn known as the step direction or search direction of the objective function f (x). From the first-order approximation expression to the objective function at the point xk f (xk+1 ) ≈ f (xk ) + (∇f (xk ))T xk ,

(3.4.2)

(∇f (xk ))T xk < 0,

(3.4.3)

it follows that if

then f (xk+1 ) < f (xk ) for minimization problems. Hence the search direction xk such that (∇f (xk ))T xk < 0 is called the descent step or descent direction of the objective function f (x) at the kth iteration. Obviously, in order to make (∇f (xk ))T xk < 0, we should take xk = −∇f (xk ) cos θ,

(3.4.4)

where 0 ≤ θ < π/2 is the acute angle between the descent direction and the negative gradient direction −∇f (xk ). If θ = 0, then xk = −∇f (xk ). That is to say, the negative gradient direction of the objective function f at the point x is directly taken as the search direction. In this case, the length of the descent step xk 2 = ∇f (xk )2 ≥ ∇f (xk )2 cos θ takes the maximum value, and thus the descent direction xk has the maximal descent step or rate, and the corresponding descent algorithm is called the steepest descent

3.4 Gradient Methods for Smooth Convex Optimization

107

method: xk+1 = xk − μk ∇f (xk ),

k = 1, 2, . . .

(3.4.5)

The steepest descent direction x = −∇f (x) uses only first-order gradient information about the objective function f (x). If second-order information, i.e., the Hessian matrix ∇ 2 f (xk ) of the objective function is used, then we may search for a better search direction. In this case, the optimal descent direction x is the solution minimizing the second-order Taylor approximation function of f (x):  , 1 min f (x + x) = f (x) + (∇f (x))T x + (x)T ∇ 2 f (x)x . x 2

(3.4.6)

At the optimal point, the gradient vector of the objective function with respect to the parameter vector x has to equal zero, i.e., ∂f (x + x) = ∇f (x) + ∇ 2 f (x)x = 0 ∂x ! "−1 ⇔ xnt = − ∇ 2 f (x) ∇f (x),

(3.4.7)

where xnt is called the Newton step or the Newton descent direction and the corresponding optimization method is the Newton or Newton–Raphson method. Algorithm 3.1 summarizes the gradient descent algorithm and its variants. Algorithm 3.1 Gradient descent algorithm and its variants 1. input: An initial point x1 ∈ dom f and the allowed error  > 0. Put k = 1. 2. repeat 3. Compute the gradient ∇f (xk ) (and the Hessian matrix ∇ 2 f (xk )). 4. Choose the descent direction ⎧ ⎪ (steepest descent method), ⎨ − ∇f (xk ) xk = "−1 ! ⎪ ⎩ − ∇ 2 f (xk ) ∇f (xk ) (Newton method). 5. Choose a step μk > 0, and update xk+1 = xk + μk xk . 6. exit if |f (xk+1 ) − f (xk )| ≤ . 7. return k ← k + 1. 8. output: x ← xk .

108

3 Gradient and Optimization

3.4.2 Projected Gradient Method The variable vector x is unconstrained in the gradient algorithms, i.e., x ∈ Rn . When choosing a constrained variable vector x ∈ C, where C ⊂ Rn , then the update formula in the gradient algorithm should be replaced by a formula containing the projected gradient: xk+1 = PC (xk − μk ∇f (xk )).

(3.4.8)

This algorithm is called the projected gradient method or the gradient projection method. In (3.4.8), PC (y) is known as the projection of point y onto the set C, and is defined as PC (y) = arg min x∈C

1 x − y22 . 2

(3.4.9)

The projection operator can be equivalently expressed as PC (y) = PC y,

(3.4.10)

where PC is the projection matrix onto the subspace C. If C is the column space of the matrix A, then PA = A(AT A)−1 AT .

(3.4.11)

Theorem 3.8 ([25]) If C is a convex set, then there exists a unique projection PC (y). In particular, if C = Rn , i.e., the variable vector x is unconstrained, then the projection operator is equal to the identity matrix, namely PC = I, and thus PRn (y) = Py = y,

∀ y ∈ Rn .

In this case, the projected gradient algorithm simplifies to the gradient algorithm. The following are projections of the vector x onto several typical sets [35]. 1. The projection onto the affine set C = {x|Ax = b} with A ∈ Rp×n , rank(A) = p is PC (x) = x + AT (AAT )−1 (b − Ax). If p / n or AAT = I, then the projection PC (x) gives a low cost result.

(3.4.12)

3.4 Gradient Methods for Smooth Convex Optimization

109

 2. The projection onto the hyperplane C = {xaT x = b} (where a = 0) is PC (x) = x +

b − aT x a. a22

(3.4.13)

3. The projection onto the nonnegative orthant C = Rn+ is PC (x) = (x)+

[(x)+ ]i = max{xi , 0}.



(3.4.14)

4. The projection onto the half-space C = {x|aT x ≤ b} (where a = 0):

PC (x) =

⎧ T ⎪ ⎨ x + b − a2 x a, ⎪ ⎩

a2

if aT x > b; (3.4.15) if a x ≤ b. T

x,

5. The projection onto the rectangular set C = [a, b] (where ai ≤ xi ≤ bi ) is ⎧ a , if xi ≤ ai ; ⎪ ⎪ ⎨ i PC (x) = xi , if ai ≤ xi ≤ bi ; ⎪ ⎪ ⎩ bi , if xi ≥ bi .

(3.4.16)

6. The projection onto the second-order cones C = {(x, t)| x2 ≤ t, x ∈ Rn } is ⎧ ⎪ (x, t), if x2 ≤ t; ⎪ ⎪ ⎪   ⎨ t + x2 x , if − t < x2 < t; PC (x) = 2x2 t ⎪ ⎪ ⎪ ⎪ ⎩ (0, 0), if x2 ≤ −t, x = 0.

(3.4.17)

7. The projection onto the Euclidean ball C = {x| x2 ≤ 1} is PC (x) =

⎧ 1 ⎨ x, ⎩

x2

x,

if x2 > 1;

(3.4.18)

if x2 ≤ 1.

8. The projection onto the 1 -norm ball C = {x| x1 ≤ 1} is ⎧ x − λ, if xi > λ; ⎪ ⎪ ⎨ i if − λ ≤ xi ≤ λ; PC (x)i = 0, ⎪ ⎪ ⎩ xi + λ, if xi < −λ.

(3.4.19)

110

3 Gradient and Optimization

Here λ = 0 if x1 ≤ 1; otherwise, λ is the solution of the equation n 

max{| xi | − λ, 0} = 1.

i=1

9. The projection onto the positive semi-definite cones C = Sn+ is PC (X) =

n 

max{0, λi }qi qTi ,

(3.4.20)

i=1

where X =

n

T i=1 λi qi qi

is the eigenvalue decomposition of X.

3.4.3 Convergence Rates By the convergence rate it refers to the number of iterations needed for an optimization algorithm to make the estimated error of the objective function achieve the required accuracy, or given an iteration number K, the accuracy that an optimization algorithm reaches. The inverse of the convergence rate is known as the complexity of an optimization algorithm. Let x∗ be a local or global minimum point. The estimation error of an optimization algorithm is defined as the difference between the value of an objective function at iteration point xk and its value at the global minimum point, i.e., δk = f (xk ) − f (x∗ ). We are naturally interested in the convergence problems of an optimization algorithm: • Given an iteration number K, what is the designed accuracy lim1≤k≤K δk ? • Given an allowed accuracy , how many iterations does the algorithm need to achieve the designed accuracy mink δk ≤ ? When analyzing the convergence problems of optimization algorithms, we often focus on the speed of the updating sequence {xk } at which the objective function argument converges to its ideal minimum point x∗ . In numerical analysis, the speed at which a sequence reaches its limit is called the convergence rate. Q-Convergence Rate Assume that a sequence {xk } converges to x∗ . If there are a real number α ≥ 1 and a positive constant μ independent of the iterations k such that μ = lim

k→∞

xk+1 − x∗ 2 xk − x∗ α2

,

(3.4.21)

3.4 Gradient Methods for Smooth Convex Optimization

111

then {xk } is said to have an α-order Q-convergence rate. The Q-convergence rate means the Quotient convergence rate. It has the following typical values [29]: 1. When α = 1, the Q-convergence rate is called the limit-convergence rate of {xk }: μ = lim

k→∞

xk+1 − x∗ 2 xk − x∗ 2

.

(3.4.22)

According to the value of μ, the limit-convergence rate of the sequence {xk } can be divided into three types: • sublinear convergence rate, α = 1, μ = 1; • linear convergence rate, α = 1, μ ∈ (0, 1); • superlinear convergence rate, α = 1, μ = 0 or 1 < α < 2, μ = 0. 2. When α = 2, we say that the sequence {xk } has a quadratic convergence rate. 3. When α = 3, the sequence {xk } is said to have a cubic convergence rate. If {xk } is sublinearly convergent, and lim

k→∞

xk+2 − xk+1 2 xk+1 − xk 2

= 1,

then the sequence {xk } is said to have a logarithmic convergence rate. Sublinear convergence comprises a class of slow convergence rates; the linear convergence comprises a class of fast convergence; and the superlinear convergence and quadratic convergence comprise, respectively, classes of very fast and extremely fast convergence. When designing an optimization algorithm, we often require it at least to be linearly convergent, preferably to be quadratically convergent. The ultrafast cubic convergence rate is in general difficult to achieve. Local Convergence Rate The Q-convergence rate is a limited convergence rate. When we evaluate an optimization algorithm, a practical question is: how many iterations does it need to achieve the desired accuracy? The answer depends on the local convergence rate of the output objective function sequence of a given optimization algorithm. The local convergence rate of the sequence {xk } is denoted rk and is defined by - xk+1 − x∗ - . rk = xk − x∗ -2

(3.4.23)

The complexity of an optimization algorithm is defined as the inverse of the local convergence rate of the updating variable. The following is a classification of local convergence rates [25].

112

3 Gradient and Optimization

• Sublinear rate: This convergence rate is given by a power function of the iteration number k and is usually denoted as f (xk ) − f (x∗ ) ≤  = O



1 √ k

 .

(3.4.24)

• Linear rate: This convergence rate is expressed as an exponential function of the iteration number k, and is usually defined as   1 . f (xk ) − f (x ) ≤  = O k ∗

(3.4.25)

• Quadratic rate: This convergence rate is a bi-exponential function of the iteration number k and is usually measured by ∗



f (xk ) − f (x ) ≤  = O

1 k2

 .

(3.4.26)

For example, in order to achieve the approximation accuracy f (xk ) − f (x∗ ) ≤  = 0.0001, optimization algorithms with sublinear, linear, and quadratic rates need to run about 108 , 104 , and 100 iterations, respectively. Theorem 3.9 ([25]) Let  = f (xk ) − f (x∗ ) denote the estimation error of the objective function given by the updating sequence of the gradient method xk+1 = xk − α∇f (xk ). Given a convex function f (x), the upper bound of the estimation error  = f (xk ) − f (x∗ ) is given by f (xk ) − f (x∗ ) ≤

2Lx0 − x∗ 22 . k+4

(3.4.27)

This theorem shows that the local convergence rate of the gradient method xk+1 = xk − α∇f (xk ) is the linear rate O(1/k). Although this is a fast convergence rate, it is not by far optimal, as will be seen in the next section.

3.5 Nesterov Optimal Gradient Method Let Q ⊂ Rn be a convex set in the vector space Rn . Consider the unconstrained optimization problem minx∈Q f (x).

3.5 Nesterov Optimal Gradient Method

113

3.5.1 Lipschitz Continuous Function Definition 3.10 (Lipschitz Continuous [25]) An objective function f (x) is said to be Lipschitz continuous in the definition domain Q if |f (x) − f (y)| ≤ Lx − y2 ,

∀ x, y ∈ Q

(3.5.1)

holds for some Lipschitz constant L > 0. Similarly, we say that the gradient vector ∇f (x) of a differentiable function f (x) is Lipschitz continuous in the definition domain Q if ∇f (x) − ∇f (y)2 ≤ Lx − y2 ,

∀ x, y ∈ Q

(3.5.2)

holds for some Lipschitz constant L > 0. Hereafter a Lipschitz continuous function with the Lipschitz constant L will be denoted as an L-Lipschitz continuous function. The function f (x) is said to be continuous at the point x0 , if limx→x0 f (x) = f (x0 ). When we say that f (x) is a continuous function, this means the f (x) is continuous at every point x ∈ Q. A Lipschitz continuous function f (x) must be a continuous function, but a continuous function is not necessarily a Lipschitz continuous function. An everywhere differentiable function is called a smooth function. A smooth function must be continuous, but a continuous function is not necessarily smooth. A typical example occurs when the continuous function has a “sharp” point at which it is non-differentiable. Hence, a Lipschitz continuous function is not necessarily differentiable. However, a function f (x) with Lipschitz continuous gradient in the definition domain Q must be smooth in Q, because the definition of a Lipschitz continuous gradient stipulates that f (x) is differentiable in the definition domain. k,p In convex optimization, the notation CL (Q) (where Q ⊆ Rn ) denotes the Lipschitz continuous function class with the following properties [25]: k,p

• The function f ∈ CL (Q) is k times continuously differentiable in Q. k,p • The pth-order derivative of the function f ∈ CL (Q) is L-Lipschitz continuous: f (p) (x) − f (p) (y) ≤ Lx − y2 , k,p

∀ x, y ∈ Q.

If k = 0, then f ∈ CL (Q) is said to be a differentiable function. Obviously, q,p k,p it always has p ≤ k. If q > k then CL (Q) ⊆ CL (Q). For example, CL2,1 (Q) ⊆ 1,1 CL (Q).

114

3 Gradient and Optimization k,p

The following are three common function classes CL (Q): 1. f (x) ∈ CL0,0 (Q) is L-Lipschitz continuous but non-differentiable in Q; 2. f (x) ∈ CL1,0 (Q) is L-Lipschitz continuously differentiable in Q, but its gradient is not; 3. f (x) ∈ CL1,1 (Q) is L-Lipschitz continuously differentiable in Q, and its gradient ∇f (x) is L-Lipschitz continuous in Q. k,p

k,p

The basic property of the CL (Q) function class is that if f1 ∈ CL1 (Q), f2 ∈

k,p

CL2 (Q) and α, β ∈ R, then k,p

αf1 + βf2 ∈ CL3 (Q), where L3 = |α| L1 + |β| L2 . Among all Lipschitz continuous functions, CL1,1 (Q), with the Lipschitz continuous gradient, is the most important function class, and is widely applied in convex optimization. On the CL1,1 (Q) function class, one has the following two lemmas [25]. Lemma 3.1 The function f (x) belongs to CL2,1 (Rn ) if and only if f  (x)F ≤ L,

∀ x ∈ Rn .

(3.5.3)

Lemma 3.2 If f (x) ∈ CL1,1 (Q), then |f (y) − f (x) − ∇f (x), y − x| ≤

L y − x22 , 2

∀ x, y ∈ Q.

(3.5.4)

Lemma 3.2 is a key inequality for analyzing the convergence rate of a CL1,1 function f (x) in gradient algorithms. By Lemma 3.2, it follows [37] that L x − xk 22 2 k+1 L L ≤ f (xk ) + ∇f (xk ), x − xk  + x − xk 22 − x − xk+1 22 2 2 L L ≤ f (x) + x − xk 22 − x − xk+1 22 . 2 2

f (xk+1 ) ≤ f (xk ) + ∇f (xk ), xk+1 − xk  +

Put x = x∗ and δk = f (xk ) − f (x∗ ); then 0≤

L ∗ L x − xk+1 22 ≤ −δk+1 + x∗ − xk 22 2 2

≤ ··· ≤ −

k+1  i=1

δi +

L ∗ x − x0 22 . 2

3.5 Nesterov Optimal Gradient Method

115

From the estimation errors of the projected gradient algorithm δ1 ≥ δ2 ≥ · · · ≥ δk+1 , it follows that −(δ1 +· · ·+δk+1 ) ≤ −(k+1)δk+1 , and thus the above inequality can be simplified to 0≤

L ∗ L x − xk+1 22 ≤ −(k + 1)δk + x∗ − x0 22 , 2 2

from which the upper bound of the convergence rate of the projected gradient algorithm is given by Yu et al. [37] δk = f (xk ) − f (x∗ ) ≤

Lx∗ − x0 22 . 2(k + 1)

(3.5.5)

This shows that, as for the (basic) gradient method, the local convergence rate of the projected gradient method is O(1/k).

3.5.2 Nesterov Optimal Gradient Algorithms Theorem 3.10 ([24]) Let f (x) be a convex function with L-Lipschitz gradient. If the updating sequence {xk } meets the condition xk ∈ x0 + Span{x0 , . . . , xk−1 }, then the lower bound of the estimation error  = f (xk ) − f (x∗ ) achieved by any first-order optimization method is given by f (xk ) − f (x∗ ) ≥

3Lx0 − x∗ 22 , 32(k + 1)2

(3.5.6)

where Span{u0 , . . . , uk−1 } denotes the linear subspace spanned by u0 , . . . , uk−1 , x0 is the initial value of the gradient method, and f (x∗ ) denotes the minimal value of the function f . Theorem 3.10 shows that the optimal convergence rate of any first-order optimization method is the quadratic rate O(1/k 2 ). Since the convergence rate of gradient methods is the linear rate O(1/k), and the convergence rate of the optimal first-order optimization methods is the quadratic rate O(1/k 2 ), the gradient methods are far from optimal. The heavy ball method (HBM) can efficiently improve the convergence rate of the gradient methods. The HBM is a two-step method: let y0 and x0 be two initial vectors and let αk and βk be two positive valued sequences; then the first-order method for solving an

116

3 Gradient and Optimization

unconstrained minimization problem minx∈Rn f (x) can use two-step updates [31]: . yk = βk yk−1 − ∇f (xk ), xk+1 = xk + αk yk .

(3.5.7)

In particular, letting y0 = 0, then the above two-step updates can be rewritten as the one-step update xk+1 = xk − αk ∇f (xk ) + βk (xk − xk−1 ),

(3.5.8)

where xk − xk−1 is called the momentum. As seen in (3.5.7), the HBM treats the iterations as a point mass with momentum xk − xk−1 , which thus tends to continue moving in the direction xk − xk−1 . Let yk = xk + βk (xk − xk−1 ), and use ∇f (yk ) instead of ∇f (xk ). Then the update Eq. (3.5.7) becomes yk = xk + βk (xk − xk−1 ),

.

xk+1 = yk − αk ∇f (yk ).

(3.5.9)

This is the basic form of the optimal gradient method proposed by Nesterov in 1983 [24], usually called the Nesterov (first) optimal gradient algorithm and shown in Algorithm 3.2. Algorithm 3.2 Nesterov (first) optimal gradient algorithm [24] 1. input: Lipschitz constant L and convexity parameter guess μ. 2. initialization: Choose x−1 = 0, x0 ∈ Rn and α0 ∈ (0, 1). Set k = 0, q = μ/L. 3. repeat 2 4. Compute αk+1 ∈ (0, 1) from the equation αk+1 = (1 − αk+1 )αk2 + qαk+1 . 5.

Set βk =

αk (1−αk ) . αk2 +αk+1

6. Compute yk = xk + βk (xk − xk−1 ). 7. Compute xk+1 = yk − αk+1 ∇f (yk ). 8. exit if xk converges. 9. return k ← k + 1. 10. output: x ← xk .

It is easily seen that the Nesterov optimal gradient algorithm is intuitively like the heavy ball formula (3.5.7) but it is not identical; i.e., it uses ∇f (y) instead of ∇f (x). In the Nesterov optimal gradient algorithm, the estimation sequence {xk } is an approximation solution sequence, and {yk } is a searching point sequence.

3.5 Nesterov Optimal Gradient Method

117

Theorem 3.11 ([24]) Let f be a convex function with L-Lipschitz gradient. The Nesterov optimal gradient algorithm achieves f (xk ) − f (x∗ ) ≤

CLxk − x∗ 22 . (k + 1)2

(3.5.10)

Clearly, the convergence rate of the Nesterov optimal gradient method is the optimal rate for the first-order gradient methods up to constants. Hence, the Nesterov gradient method is indeed an optimal first-order minimization method. It has been shown [18] that the use of a strong convexity constant μ∗ is very effective in reducing the number of iterations of the Nesterov algorithm. A Nesterov algorithm with adaptive convexity parameter, using a decreasing sequence μk → μ∗ , was proposed in [18]. It is shown in Algorithm 3.3. Algorithm 3.3 Nesterov algorithm with adaptive convexity parameter [18] 1. input: Lipschitz constant L, and the convexity parameter guess μ∗ ≤ L. 2. initialization: x0 , v0 = x0 , γ0 > 0, β > 1, μ0 ∈ [μ∗ , γ0 ). Set k = 0, θ ∈ [0, 1], μ+ = μ0 . 3. repeat 4. dk = vk − xk . 5. yk = xk + θk dk . 6. exit if ∇f (yk ) = 0. 7. Steepest descent step: xk+1 = yk − ν∇f (yk ). If L is known, then ν ≥ 1/L. 8. If (γk − μ∗ ) < β(μ+ − μ∗ ), then choose μ+ ∈ [μ∗ , γ /β]. 9.

∇f (yk )2 2[f (yk )−f (xk+1 )] . If μ+ ≥ μ, ˜ then μ+ = max{μ∗ , μ/10}. ˜ Compute αk as the largest root of Aα 2 + Bα + C = 0. γk+1 = (1 − αk )γk + αk μ+ .   vk+1 = γ 1 (1 − αk )γk vk + αk (μ+ yk − ∇f (yk )) . k+1

Compute μ˜ =

10. 11. 12. 13. 14. return k ← k + 1. 15. output: yk as an optimal solution.

The coefficients A, B, C in Step 11 are given by G = γk



" vk − yk 2 + (∇f (yk ))T (vk − yk ) ,

2 1 A = G + ∇f (yk )2 + (μ − γk )(f (xk ) − f (yk )), 2     B = (μ − γk ) f (xk+1 ) − f (xk ) − γk f (yk ) − f (xk ) − G,   C = γk f (xk+1 ) − f (xk ) , with μ = μ+ .

118

3 Gradient and Optimization

It is suggested that γ0 = L if L is known, μ0 = max{μ∗ , γ0 /100} and β = 1.02. The Nesterov (first) optimal gradient algorithm has two limitations: • yk may be out of the definition domain Q, and thus the objective function f (x) is required to be well defined at every point in Q; • this algorithm is available only for the minimization of the Euclidean norm x2 . To cope with these limitations, Nesterov developed two other optimal gradient algorithms [26, 27]. Definite the ith element of the n × 1 mapping vector VQ (u, g) as [26] VQ (u, g) = ui e−gi (i)

⎛ ⎞−1 n  ⎝ uj e−gi ⎠ ,

i = 1, · · · , n.

(3.5.11)

j =1

Algorithm 3.4 shows the Nesterov third optimal gradient algorithm. Algorithm 3.4 Nesterov (third) optimal gradient algorithm [26] 1. input: Choose x0 ∈ Rn and α ∈ (0, 1). / 0   2. initialization: y0 = argminx Lα d(x) + 12 f (x0 ) + f  (x0 ), x − x0  : x ∈ Q . Set k = 0. 3. repeat / 0    4. Find zk = argminx Lα d(x) + ki=0 i+1 f (x ) + ∇f (x ), x − x  : x ∈ Q . i i i 2 5. 6. 7. 8. 9. 10.

Set τk =

and xk+1 = τk zk + (1 − τk )yk .  (i)  Use (3.5.11) to compute xˆ k+1 (i) = VQ zk , Lα τk ∇f (xk+1 ) , i = 1, . . . , n. Set yk+1 = τk xˆ k+1 + (1 − τk )yk . exit if xk converges. return k ← k + 1. output: x ← xk . 2 k+3

In Algorithm 3.4, d(x) is a prox-function of the domain Q. This function has two different choices [26]: d(x) = ln n +

n 

xi ln |xi |,

for x1 =

i=1

 n  1 1 2 xi − d(x) = , 2 n

n 

|xi |,

(3.5.12)

i=1

for x2 =

i=1

 n

1/2 xi2

.

(3.5.13)

i=1

3.6 Nonsmooth Convex Optimization Gradient methods require that the gradient ∇f (x) exists at the point x, while the Nesterov optimal gradient method requires further that the objective function has L-Lipschitz continuous gradient. Therefore, gradient methods in general and

3.6 Nonsmooth Convex Optimization

119

the Nesterov optimal gradient method in particular are available only for smooth convex optimization. As an important extension of gradient methods, this section focuses upon the proximal gradient method which is available for nonsmooth convex optimization.

3.6.1 Subgradient and Subdifferential The following are two typical examples of nonsmooth convex minimization. 1. Basis pursuit denoising seeks a sparse near-solution to an under-determined matrix equation [8]  , λ 2 min x1 + Ax − b2 . x∈Rn 2

(3.6.1)

This problem is essentially equivalent to the Lasso (sparse least squares) problem [34] minn

x∈R

λ Ax − b22 2

subject to x1 ≤ K.

(3.6.2)

2. Robust principal component analysis approximates an input data matrix D to a sum of a low-rank matrix L and a sparse matrix S: 0 / γ min L∗ + λS1 + L + S − D2F . 2

(3.6.3)

The key challenge of the above minimization problems is the non-smoothness induced by the non-differentiable 1 norm  · 1 and nuclear norm  · ∗ . Consider the following combinatorial optimization problem: min {F (x) = f (x) + h(x)} ,

(3.6.4)

x∈E

where • E ⊂ Rn is a finite-dimensional real vector space; • h : E → R is a convex function, but is non-differentiable or nonsmooth in E; • f : Rn → R is a continuous smooth convex function, and its gradient is LLipschitz continuous: ∇f (x) − ∇f (y)2 ≤ L x − y2 ,

∀ x, y ∈ Rn .

120

3 Gradient and Optimization

To address the challenge of non-smoothness, there are two problems to be solved: 1. how to cope with the non-smoothness; 2. how to design a Nesterov-like optimal gradient method for nonsmooth convex optimization. Because the gradient vector of a nonsmooth function h(x) does not exist everywhere, neither a gradient method nor the Nesterov optimal gradient method is available. A natural question to ask is whether a nonsmooth function has some class of “generalized gradient” similar to gradient. For a twice continuous differentiable function f (x), its second-order approximation is given by f (x + x) ≈ f (x) + (∇f (x))T x + (x)T Hx. If the Hessian matrix H is positive semi-definite or positive definite, then we have the inequality f (x + x) ≥ f (x) + (∇f (x))T x, or f (y) ≥ f (x) + (∇f (x))T (y − x),

∀ x, y ∈ dom f (x).

(3.6.5)

Although a nonsmooth function h(x) does not have a gradient vector ∇h(x), it is possible to find another vector g instead of the gradient vector ∇f (x) such that the inequality (3.6.5) still holds. Definition 3.11 (Subgradient, Subdifferential [25]) A vector g ∈ Rn is said to be a subgradient vector of the function h : Rn → R at the point x ∈ Rn if h(y) ≥ h(x) + gT (y − x),

∀ x, y ∈ dom h.

(3.6.6)

The set of all subgradient vectors of the function h at the point x is known as the subdifferential of the function h at the point x, denoted ∂h(x), and is defined as /  0 def ∂h(x) = gh(y) ≥ h(x) + gT (y − x), ∀ y ∈ dom h .

(3.6.7)

Theorem 3.12 ([25]) The optimal solution of a nonsmooth function h(x) given by h(x∗ ) = minx∈dom h h(x) if and only if ∂h(x∗ ) is subdifferential of h(x), i.e., 0 ∈ ∂h(x∗ ). When h(x) is differentiable, we have ∂h(x) = {∇h(x)}, as the gradient vector of a smooth function is unique, so we can view the gradient operator ∇h of a convex and differentiable function as a point-to-point mapping, i.e., ∇h maps each point

3.6 Nonsmooth Convex Optimization

121

x ∈ dom h to the point ∇h(x). In contrast, the subdifferential operator ∂h, defined in Eq. (3.6.7), of a closed proper convex function h can be viewed as a point-to-set mapping, i.e., ∂h maps each point x ∈ dom h to the set ∂h(x).1 Any point g ∈ ∂h(x) is called a subgradient of h at x. Generally speaking, a function h(x) may have one or several subgradient vectors at some point x. The function h(x) is said to be subdifferentiable at the point x if it at least has one subgradient vector. More generally, the function h(x) is said to be subdifferentiable in the definition domain dom h, if it is subdifferentiable at all points x ∈ dom h. The following are several examples of subdifferentials [25]. 1. Let f (x) = |x|, x ∈ R. Then ∂f (0) = [−1, 1] since f (x) = max−1≤g≤1 g · x. 2. For function f (x) = ni=1 |aTi x − bi |, if denoting I− (x) = {i|ai , x − bi < 0}, I+ (x) = {i|ai , x − bi > 0}, I0 (x) = {i|ai , x − bi = 0}, then ∂h(x) =





ai −

i∈I+ (x)

ai +

i∈I− (x)



[−ai , ai ].

(3.6.8)

i∈I0 (x)

3. For function f (x) = max1≤i≤n xi , denoting I (x) = {i : xi = f (x)}, then  ∂f (x) =

Conv{ei |i ∈ I (x)}, x = 0; Conv{e1 , . . . , en }, x = 0.

(3.6.9)

/ I (x), and Here ei = 1 for i ∈ I (x) and ei = 0 for i ∈ * Conv{e1 , . . . , en } = x =

n  i=1

.  n   αi ei αi ≥ 0, αi = 1

(3.6.10)

i=1

is a convex set. 4. The subdifferentials of Euclidean norm f (x) = x2 are given by  ∂x2 =

x = 0; {x/x2 }, {y ∈ Rn | y2 ≤ 1}, x = 0, y = 0.

(3.6.11)

1 A proper convex function f (x) takes values in the extended real number line such that f (x) < +∞ for at least one x and f (x) > −∞ for every x.

122

3 Gradient and Optimization

5. The subdifferentials of 1 -norm f (x) = x1 =  ∂x1 =

n

i=1 |xi |

are given by

x = 0; {x ∈ Rn | max1≤i≤n |xi | < 1},     0. i∈I+ (x) ei − i∈I− (x) ei + i∈I0 (x) [−ei , ei ], x = (3.6.12)

Here I+ (x) = {i|xi > 0}, I− (x) = {i|xi < 0} and I0 (x) = {i|xi = 0}. The basic properties of the subdifferential are as follows [4, 25]. • Convexity: ∂h(x) is always a closed convex set, even if h(x) is not convex. • Nonempty and boundedness: If x ∈ int(dom h), i.e., x is an interior point in the domain of h, then the subdifferential ∂h(x) is nonempty and bounded. • Nonnegative factor: If α > 0, then ∂(αh(x)) = α∂h(x). • Subdifferential: If h is convex and differentiable at the point x, then the subdifferential is a singleton ∂h(x) = {∇h(x)}, namely its gradient is its unique subgradient. Conversely, if h is a convex function and ∂h(x) = {g}, then h is differentiable at the point x and g = ∇h(x). • Minimum point of non-differentiable function: The point x∗ is a minimum point of the convex function h if and only if h is subdifferentiable at x∗ and 0 ∈ ∂h(x∗ ).

(3.6.13)

This condition is known as the first-order optimality condition of the nonsmooth convex function h(x). If h is differentiable, then the first-order optimality condition 0 ∈ ∂h(x) simplifies to ∇h(x) = 0. • Subdifferential of a sum of functions: If h1 , . . . , hm are convex functions, then the subdifferential of h(x) = h1 (x) + · · · + hm (x) is given by ∂h(x) = ∂h1 (x) + · · · + ∂hm (x). • Subdifferential of affine transform: If φ(x) = h(Ax + b), then the subdifferential ∂φ(x) = AT ∂h(Ax + b). • Subdifferential of pointwise maximal function: Let h be a pointwise maximal function of the convex functions h1 , . . . , hm , i.e., h(x) = max hi (x); then i=1,...,m

  ∂h(x) = conv ∪ {∂hi (x)|hi (x) = h(x)} . That is to say, the subdifferential of a pointwise maximal function h(x) is the convex hull of the union set of subdifferentials of the “active function” hi (x) at the point x. ˜ The subgradient vector g of the function h(x) is denoted as g = ∇h(x).

3.6 Nonsmooth Convex Optimization

123

3.6.2 Proximal Operator Consider the combinatorial optimization problem min x∈C

P 

(3.6.14)

fi (x),

i=1

where Ci = dom fi (x) for i = 1, . . . , P is a closed convex set of the m-dimensional 1 Euclidean space Rm and C = Pi=1 Ci is the intersection of these closed convex sets. The types of intersection C can be divided into the following three cases [7]: • The intersection C is nonempty and “small” (all members of C are quite similar). • The intersection C is nonempty and “large” (the differences between the members of C are large). • The intersection C is empty, which means that the imposed constraints of intersecting sets are mutually contradictory. It is difficult to solve the combinatorial optimization problem (3.6.14) directly. But, if  f1 (x) = x − x0 ,

fi (x) = ICi (x) =

0, x ∈ Ci ; +∞, x ∈ / Ci ,

then the original combinatorial optimization problem can be divided into separate problems min 1 x∈ Pi=2 Ci

x − x0 .

(3.6.15)

Different from the combinatorial optimization problem (3.6.14), the separated optimization problems (3.6.15) can be solved using the projection method. In particular, when the Ci are convex sets, the projection of a convex objective function onto the intersection of these convex sets is closely related to the proximal operator of the objective function. Definition 3.12 (Proximal Operator) The proximal operator of a convex function h(x) is defined as [23]  , 1 proxh (u) = arg min h(x) + x − u22 2 x

(3.6.16)

 , 1 2 proxμh (u) = arg min h(x) + x − u2 2μ x

(3.6.17)

or

124

3 Gradient and Optimization

with scale parameter μ > 0. In particular, for a convex function h(X), its proximal operator is defined as  , 1 2 proxμh (U) = arg min h(X) + X − UF . 2μ X

(3.6.18)

The proximal operator has the following important properties [35]. 1. Existence and uniqueness: The proximal operator proxh (u) always exists, and is unique for all x. 2. Subgradient characterization: There is the following correspondence between the proximal mapping proxh (u) and the subgradient ∂h(x): x = proxh (u)



x − u ∈ ∂h(x).

(3.6.19)

3. Nonexpansive mapping: The proximal operator proxh (u) is a nonexpansive ˆ then mapping with constant 1: if x = proxh (u) and xˆ = proxh (u), ˆ ≥ x − xˆ 22 . (x − xˆ )T (u − u) 4. Proximal operator of a separable sum function: If h : Rn1 × Rn2 → R is a separable sum function, i.e., h(x1 , x2 ) = h1 (x1 ) + h2 (x2 ), then proxh (x1 , x2 ) = (proxh (x1 ), proxh (x2 )). 1

2

5. Scaling and translation of argument: If h(x) = f (αx + b), where α = 0, then proxh (x) =

" 1! proxα 2 f (αx + b) − b . α

6. Proximal operator of the conjugate function: If h∗ (x) is the conjugate function of the function h(x), then, for all μ > 0, the proximal operator of the conjugate function is given by proxμh∗ (x) = x − μproxh/μ (x/μ). If μ = 1, then the above equation simplifies to x = proxh (x) + proxh∗ (x).

(3.6.20)

This decomposition is called the Moreau decomposition. An operator closely related to the proximal operator is the soft thresholding operator of one real variable.

3.6 Nonsmooth Convex Optimization

125

Definition 3.13 (Soft Thresholding Operator) The action of the soft thresholding operator on a real variable x ∈ R, denoted Sτ [x] or soft(x, τ ), is defined as ⎧ ⎨ x − τ, soft(x, τ ) = Sτ [x] = 0, ⎩ x + τ,

x > τ; |x| ≤ τ ; x < −τ.

(3.6.21)

Here τ > 0 is called the soft threshold value of the real variable x. The action of the soft thresholding operator can be equivalently written as soft(x, τ ) = (x − τ )+ − (−x − τ )+ = max{x − τ, 0} − max{−x − τ, 0} = (x − τ )+ + (x + τ )− = max{x − τ, 0} + min{x + τ, 0}.

(3.6.22)

This operator is also known as the shrinkage operator, because it can reduce the variable x, the elements of the vector x, and the matrix X to zero, thereby shrinking the range of elements. Hence, the action of the soft thresholding operator is sometimes written as [1, 5]: soft(x, τ ) = (|x| − τ )+ sign(x) = (1 − τ/|x|)+ x.

(3.6.23)

The soft thresholding operation on a real vector x ∈ Rn , denoted soft(x, τ ), is defined as a vector with entries soft(x, τ )i = max{xi − τ, 0} + min{xi + τ, 0} ⎧ ⎨ xi − τ, xi > τ ; = 0, |xi | ≤ τ ; ⎩ xi + τ, xi < −τ.

(3.6.24)

The soft thresholding operation on a real matrix X ∈ Rm×n , denoted soft(X), is defined as an m × n real matrix with entries soft(X, τ )ij = max{xij − τ, 0} + min{xij + τ, 0} ⎧ ⎪ ⎨ xij − τ, xij > τ ; |xij | ≤ τ ; = 0, ⎪ ⎩ x + τ, x < −τ. ij ij Given a function h(x), our goal is to find an explicit expression for  , 1 x − u22 . proxμh (u) = arg min h(x) + 2μ x

(3.6.25)

126

3 Gradient and Optimization

Theorem 3.13 ([25, p. 129]) Let x∗ be an optimal solution of the minimization problem minx∈dom φ(x). If the function φ(x) is subdifferentiable, then x∗ = arg minx∈domφ φ(x) or φ(x∗ ) = minx∈domφ φ(x) if and only if 0 ∈ ∂φ(x∗ ). By the above theorem, the first-order optimality condition of the function φ(x) = h(x) +

1 x − u22 2μ

(3.6.26)

0 ∈ ∂h(x∗ ) +

1 ∗ (x − u), μ

(3.6.27)

is given by

because ∂

1 1 x − u22 = (x − u). Hence, if and only if 0 ∈ ∂h(x∗ ) + μ−1 (x∗ − u), 2μ μ

we have  , 1 x − u22 . x∗ = proxμh (u) = arg min h(x) + 2μ x

(3.6.28)

From Eq. (3.6.27) we have 0 ∈ μ∂h(x∗ ) + (x∗ − u) ⇔ u ∈ (I + μ∂h)x∗ ⇔ (I + μ∂h)−1 u ∈ x∗ , where I is the identical operator such that I x = x. Since x∗ is only one point, (I +μ∂h)−1 u ∈ x∗ should read as (I +μ∂h)−1 u = x∗ and thus 0 ∈ μ∂h(x∗ ) + (x∗ − u)



x∗ = (I + μ∂h)−1 u



proxμh (u) = (I + μ∂h)−1 u.

This shows that the proximal operator proxμh and the subdifferential operator ∂h are related as follows: proxμh = (I + μ∂h)−1 .

(3.6.29)

The (point-to-point) mapping (I + μ∂h)−1 is known as the resolvent of the subdifferential operator ∂h with parameter μ > 0, so the proximal operator proxμh is the resolvent of the subdifferential operator ∂h. Notice that the subdifferential ∂h(x) is a point-to-set mapping, for which neither direction is unique, whereas the proximal operation proxμh (u) is a point-to-point mapping: proxμh (u) maps any point u to a unique point x.

3.6 Nonsmooth Convex Optimization

127

Table 3.2 Proximal operators of several typical functions Functions

Proximal operators ⎧ ⎪ ⎨ ui − μ, ui > μ; (proxμh (u))i = 0, |ui | ≤ μ; ⎪ ⎩ ui + μ, ui < −μ. * (1 − μ/u2 )u, u2 ≥ μ; proxμh (u) = 0, u2 < μ.

h(x) = x1

h(x) = x2 h(x) = aT x + b h(x) =

proxμh (u) = u − μ a

1 T T 2 x Ax + b x

h(x) = −

n  i=1

log xi

h(x) = IC (x) h(x) = supy∈C

proxμh (x) = u + (A + μ−1 I)−1 (b − Au)    (proxμh (u))i = 12 ui + u2i + 4μ proxμh (u) = PC (u) = arg minx∈C x − u2 proxμh (u) = u − μPC (u/μ)

yT x

h(x) = φ(x − z)

proxμh (u) = z + proxμφ (u − z)

h(x) = φ(x/ρ)

proxh (u) = ρproxφ/ρ 2 (u/ρ)

h(x) = φ(−x)

proxμh (u) = −proxμφ (−u)

h(x) =

φ ∗ (x)

proxμh (u) = u − proxμφ (u)

Example 3.2 For the quadratic function h(x) = 12 xT Ax−bT x+c, where A ∈ Rn×n is positive definite. The first-order optimality condition of proxμh (u) is ∂ ∂x



1 T 1 x Ax − bT x + c + x − u22 2 2μ

 = Ax − b +

1 (x − u) = 0, μ

with x = x∗ = proxμh (u). Hence, we have Ax∗ − b + μ−1 (x∗ − u) = 0 ⇔ (A + μ−1 I )x∗ = μ−1 u + b ⇔ x∗ = (A + μ−1 I )−1 [(A + μ−1 I )u + b − Au] which gives directly the result x∗ = proxμh (u) = u + (A + μ−1 I)−1 (b − Au). Table 3.2 lists the proximal operators of several typical functions [10, 30].

128

3 Gradient and Optimization

3.6.3 Proximal Gradient Method Consider a typical form of nonsmooth convex optimization, min J (x) = f (x) + h(x), x

(3.6.30)

where f (x) is a convex, smooth (i.e., differentiable), and L-Lipschitz function, and h(x) is a convex but nonsmooth function (such as x1 , X∗ , and so on). Quadratic Approximation Consider the quadratic approximation of an L-Lipschitz smooth function f (x) around the point xk : 1 f (x) = f (xk ) + (x − xk )T ∇f (xk ) + (x − xk )T ∇ 2 f (xk )(x − xk ) 2 L ≈ f (xk ) + (x − xk )T ∇f (xk ) + x − xk 22 , 2 where ∇ 2 f (xk ) is approximated by a diagonal matrix LI. Minimize J (x) = f (x) + h(x) via iteration to yield xk+1 = arg min {f (x) + h(x)} x

 , L T 2 ≈ arg min f (xk ) + (x − xk ) ∇f (xk ) + x − xk 2 + h(x) 2 x *  -2 . 1 Lx − xk − ∇f (xk ) = arg min h(x) + 2 L x 2   1 = proxL−1 h xk − ∇f (xk ) . L In practical applications, the Lipschitz constant L of f (x) is usually unknown. Hence, a question is how to choose L in order to accelerate the convergence of xk arises. To this end, let μ = 1/L and consider the fixed point iteration   xk+1 = proxμh xk − μ∇f (xk ) .

(3.6.31)

This iteration is called the proximal gradient method for solving the nonsmooth convex optimization problem, where the step μ is chosen to equal some constant or is determined by linear searching. Forward-Backward Splitting To derive a proximal gradient algorithm for nonsmooth convex optimization, let A = ∂h and B = ∇f denote the subdifferential operator and the gradient operator,

3.6 Nonsmooth Convex Optimization

129

respectively. Then, the first-order optimality condition for the objective function h(x) + f (x), 0 ∈ (∂h + ∇f )x∗ , can be written in operator form as 0 ∈ (A + B)x∗ , and thus we have 0 ∈ (A + B)x∗



(I − μB)x∗ ∈ (I + μA)x∗



(I + μA)−1 (I − μB)x∗ = x∗ .

(3.6.32)

Here (I − μB) is a forward operator and (I + μA)−1 is a backward operator. The backward operator (I + μ∂h)−1 is sometimes known as the resolvent of ∂h with parameter μ. By proxμh = (I + μA)−1 and (I − μB)xk = xk − μ∇f (xk ), Eq. (3.6.32) can be written as the fixed point iteration   xk+1 = (I + μA)−1 (I − μB)xk = proxμh xk − μ∇f (xk ) .

(3.6.33)

If we let yk = (I − μB)xk = xk − μ∇f (xk ),

(3.6.34)

then Eq. (3.6.33) reduces to xk+1 = (I + μA)−1 yk = proxμh (yk ).

(3.6.35)

In other words, the proximal gradient algorithm can be split two iterations: • forward iteration: yk = (I − μB)xk = xk − μ∇f (xk ); • backward iteration: xk+1 = (I + μA)−1 yk = proxμh (yk ). The forward iteration is an explicit iteration which is easily computed, whereas the backward iteration is an implicit iteration. Example 3.3 For the 1 -norm function h(x) = x1 , the backward iteration is xk+1 = proxμx (yk ) = soft·1 (yk , μ) 1

= [soft·1 (yk (1), μ), . . . , soft·1 (yk (n), μ)]T

(3.6.36)

can be obtained by the soft thresholding operation [15]: xk+1 (i) = (soft·1 (yk , μ))i = sign(yk (i)) max{|yk (i)| − μ, 0},

(3.6.37)

for i = 1, . . . , n. Here xk+1 (i) and yk (i) are the ith entries of the vectors xk+1 and yk , respectively.

130

3 Gradient and Optimization

Example 3.4 For the 2 -norm function h(x) = x2 , the backward iteration is xk+1 = proxμx (yk ) = soft·2 (yk , μ) 2

= [soft·2 (yk (1), μ), . . . , soft·2 (yk (n), μ)]T ,

(3.6.38)

where soft·2 (yk (i), μ) = max{|yk (i)| − μ, 0}

yk (i) , yk 2

i = 1, . . . , n.

Example 3.5 For the nuclear norm of matrix X∗ = corresponding proximal gradient method becomes

min{m,n} i=1

(3.6.39) σi (X), the

  Xk = proxμ·∗ Xk−1 − μ∇f (Xk−1 ) .

(3.6.40)

If W = UVT is the SVD of the matrix W = Xk−1 − μ∇f (Xk−1 ), then proxμ·∗ (W) = UDμ ()VT ,

(3.6.41)

where the singular value thresholding (SVT) operation Dμ () is defined as * [Dμ ()]i =

σi (X) − μ,

if σi (X) > μ,

0,

otherwise.

(3.6.42)

Example 3.6 If the nonsmooth function h(x) = IC (x) is an indicator function, then the proximal gradient iteration xk+1 = proxμh (xk − μ∇f (xk )) reduces to the gradient projection iteration   xk+1 = PC xk − μ∇f (xk ) .

(3.6.43)

A comparison between other gradient methods and the proximal gradient method is summarized below [38]. • The update formulas are as follows. general gradient method: xk+1 = xk − μ∇f (xk ); Newton method: xk+1 = xk − μH−1 (xk )∇f (xk );   proximal gradient method: xk+1 = proxμh xk − μ∇f (xk ) . General gradient methods and the Newton method use a low-level (explicit) update, whereas the proximal gradient method uses a high-level (implicit) operation proxμh .

3.6 Nonsmooth Convex Optimization

131

• General gradient methods and the Newton method are available only for smooth and unconstrained optimization problems, while the proximal gradient method is available for smooth or nonsmooth and/or constrained or unconstrained optimization problems. • The Newton method allows modest-sized problems to be addressed; the general gradient methods are applicable for large-size problems, and, sometimes, distributed implementations, and the proximal gradient method is available for all large-size problems and distributed implementations. In particular, the proximal gradient method includes the general gradient method, the projected gradient method, and the iterative soft thresholding method as special cases. 1. Gradient method: If h(x) = 0, due to proxh (x) = x, the proximal gradient algorithm (3.6.31) reduces to the general gradient algorithm xk = xk−1 − μk ∇f (xk−1 ). That is to say, the gradient algorithm is a special case of the proximal gradient algorithm when the nonsmooth convex function h(x) = 0. 2. Projected gradient method: For h(x) = IC (x), an indicator function, the minimization problem (3.6.4) becomes the unconstrained minimization minx∈C f (x). Because proxh (x) = PC (x), the proximal gradient algorithm reduces to   xk = PC xk−1 − μk ∇f (xk−1 ) -2 = arg min -u − xk−1 + μk ∇f (xk−1 )-2 .

(3.6.44) (3.6.45)

u∈C

This is just the projected gradient method. 3. Iterative soft thresholding method: When h(x) = x1 , the minimization  problem (3.6.4) becomes the unconstrained minimization problem  min f (x) + x1 . In this case, the proximal gradient algorithm becomes xk = proxμ

kh

  xk−1 − μk ∇f (xk−1 ) .

(3.6.46)

This is called the iterative soft thresholding method, for which ⎧ u − μ, ⎪ ⎪ ⎨ i proxμh (u)i = 0, ⎪ ⎪ ⎩ ui + μ,

ui > μ, − μ ≤ ui ≤ μ, ui < −μ.

From the viewpoint of convergence, the proximal gradient method is suboptimal. A Nesterov-like proximal gradient algorithm developed by Beck and Teboulle [1] is called the fast iterative soft thresholding algorithm (FISTA), as shown in Algorithm 3.5.

132

3 Gradient and Optimization

Algorithm 3.5 FISTA algorithm with fixed step [1] 1. input: The Lipschitz constant L of ∇f (x). 2. initialization: y1 = x0 ∈ Rn , t1 = 1. 3. repeat ! " 4. Compute xk = proxL−1 h yk − L1 ∇f (yk ) .  5. Compute tk+1 = 12 (1 + 1 + 4tk2 ). " ! t −1 6. Compute yk+1 = xk + tk (xk − xk−1 ). k+1 7. exit if xk is converged. 8. return k ← k + 1. 9. output: x ← xk .

Theorem 3.14 ([1]) Let {xk } and {yk } be two sequences generated by the FISTA algorithm; then, for any iteration k ≥ 1, F (xk ) − F (x∗ ) ≤

2Lxk − x∗ 22 , (k + 1)2

∀ x∗ ∈ X∗ ,

where x∗ and X∗ represent, respectively, the optimal solution and the optimal solution set of min (F (x) = f (x) + h(x)). ∗ Theorem 3.14 shows that to achieve the -optimal solution F (¯x)−F  (x ) ≤ , the √ FISTA algorithm needs at most 0C/  −11 iterations, where C = 2Lx0 − x∗ 22 . Since it has the same fast convergence rate as the optimal first-order algorithm, the FISTA is indeed an optimal algorithm.

3.7 Constrained Convex Optimization Consider the constrained minimization problem fi (x) ≤ 0, i = 1, . . . , m; hj (x) = 0, j = 1, . . . , q. (3.7.1) If the objective function f0 (x) and the inequality constraint functions fi (x), i = 1, . . . , m are convex, and the equality constraint functions hj (x) have the affine form h(x) = Ax − b, then Eq. (3.7.1) is called a constrained convex optimization problem. The basic idea in solving a constrained optimization problem is to transform it into an unconstrained optimization problem. The transformation methods have the following three types: min f0 (x) x

subject to

• the Lagrange multiplier method for equality or inequality constrains; • the penalty function method for equality constraints; and • the augmented Lagrange multiplier method for both equality and inequality constrains.

3.7 Constrained Convex Optimization

133

Because the Lagrange multiplier method is just a simpler example of the augmented Lagrange multiplier method, we mainly focus on the penalty function method and the augmented Lagrange multiplier method in this section.

3.7.1 Penalty Function Method The penalty function method is a widely used constrained optimization method, and its basic idea is simple: by using the penalty function and/or the barrier function, a constrained optimization becomes an unconstrained optimization of the composite function consisting of the original objective function and the constraint conditions. The penalty function method transforms the original constrained optimization problem into the following unconstrained optimization problem:   min Lρ (x) = f0 (x) + ρp(x) , x∈S

(3.7.2)

where the coefficient ρ is a penalty parameter that reflects the intensity of “punishment” via the weighting of the penalty function p(x); the greater ρ is, the greater the value of the penalty term. The main property of the penalty function is as follows: if p1 (x) is the penalty for the closed set F1 , and p2 (x) is the penalty for the closed set F2 , then p1 (x) + p2 (x) is the penalty for the intersection F1 ∩ F2 . The following are two common penalty functions. • Exterior penalty function: p(x) = ρ1

m   i=1

max{0, fi (x)}

r

+ ρ2

q 

|hj (x)|2 ,

(3.7.3)

j =1

where r is usually 1 or 2. • Interior penalty function: p(x) = ρ1

m  i=1

 1 log(−fi (x)) + ρ2 |hj (x)|2 . fi (x) q



(3.7.4)

j =1

If fi (x) ≤ 0, ∀ i = 1, . . . , m and hj (x) = 0, ∀ j = 1, . . . , q, then p(x) = 0, i.e., the exterior penalty function has no effect on the interior points in the feasible set int(F). On the contrary, if, for some iteration point xk , the inequality constraint fi (xk ) > 0, i ∈ {1, . . . , m}, and/or hj (xk ) = 0, j ∈ {1, . . . , q}, then the penalty term p(xk ) = 0, that is, any point outside the feasible set F is “punished” in exterior penalty function defined in (3.7.3). The role of interior penalty function in (3.7.4) is equivalent to building a fence on the feasible set boundary bnd(F), and thus blocks any point on bnd(F), whereas

134

3 Gradient and Optimization

the points in the relative feasible interior set relint(F) are slightly punished. The interior penalty function is also known as the barrier function. The following are three typical barrier functions for a closed set F [25]: • Power-function barrier function: φ(x) =

m  i=1



1 , p ≥ 1; (fi (x))p

1  log(−fi (x)); fi (x) i=1   m  1 exp − . • Exponential barrier function: φ(x) = fi (x)

• Logarithmic barrier function: φ(x) = −

m

i=1

In other words, the logarithmic barrier function in (3.7.4) can be replaced by the power-function barrier function or the exponential barrier function.  1 When p = 1, the power-function barrier function φ(x) = m i=1 − fi (x) is called the inverse barrier function and was presented by Carroll in 1961 [9]; while φ(x) = μ

m  i=1

1 log(−fi (x))

(3.7.5)

is known as the classical Fiacco–McCormick logarithmic barrier function [12], where μ is the barrier parameter. A comparison of the features of the external and interior penalty functions are as follows. • The exterior penalty function method is usually known as the penalty function method, and the interior penalty function method is customarily called the barrier method. • The exterior penalty function method punishes all points outside of the feasible set, and its solution satisfies all the inequality constraints fi (x) ≤ 0, i = 1, . . . , m, and all the equality constraints hj (x) = 0, j = 1, . . . , q, and is thus an exact solution to the constrained optimization problem. In other words, the exterior penalty function method is an optimal design scheme. In contrast, the interior penalty function (or barrier) method blocks all points on the boundary of the feasible set, and the found solution satisfies only the strict inequality fi (x) < 0, i = 1, . . . , m, and the equality constraints hj (x) = 0, j = 1, . . . , q, and hence is only an approximate solution. That is to say, the interior penalty function method is a suboptimal design scheme. • The exterior penalty function method can be started using an unfeasible point, and its convergence is slow, whereas the interior penalty function method requires the initial point to be a feasible interior point, so its selection is difficult; but it has a good convergence and approximation performance. In evolutionary computations the exterior penalty function method is normally used, while the interior function method is NP-hard owing to the feasible initial point search.

3.7 Constrained Convex Optimization

135

Engineering designers, especially process controllers, prefer to use the interior penalty function method, because this method allows the designers to observe the changes in the objective function value corresponding to the design points in the feasible set in the optimization process. However, this facility cannot be provided by any exterior penalty function method. In the strict sense of the penalty function classification, all the above kinds of penalty function belong to the “death penalty”: the infeasible solution points x ∈ S \ F (the difference set of the search space S and the feasible set F ) are completely ruled out by the penalty function p(x) = +∞. If the feasible search space is convex, or it is the rational part of the whole search space, then this death penalty works very well [22]. However, for genetic algorithms and evolutionary computations, the boundary between the feasible set and the infeasible set is unknown and thus it is difficult to determine the precise position of the feasible set. In these cases, other penalty functions should be used [36]: static, dynamic, annealing, adaptive, or coevolutionary penalties.

3.7.2 Augmented Lagrange Multiplier Method The augmented Lagrange multiplier method transforms the constrained optimization problem (3.7.1) into an unconstrained optimization problem with Lagrange function L(x, λ, ν) = f0 (x) +

m 

λi fi (x) +

q 

νj hj (x)

j =1

i=1

= f0 (x) + λ f(x) + ν h(x), T

T

(3.7.6)

where L(x, λ, ν) is known as the augmented Lagrange function, λ = [λ1 , . . . , λm ]T and ν = [ν1 , . . . , νq ]T are, respectively, called the Lagrange multiplier vector and the penalty parameter vector, whereas f(x) = [f1 (x), . . . , fm (x)]T and h(x) = [h1 (x), . . . , hq (x)]T are the inequality constraint vector and the equality constraint vector, respectively. The unconstrained optimization problem ⎫ ⎧ q m ⎬ ⎨   min L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x) x ⎩ ⎭ i=1

(3.7.7)

j =1

is known as the primal problem for constrained optimization in (3.7.1). The primal problem is difficult to solve, because the general solution vector x may violate one of the inequality constraints f(x) < 0 and/or the equality constraints h(x) = Ax = 0.

136

3 Gradient and Optimization

The augmented Lagrange function includes the Lagrange function and the penalty function as two special examples. • If ν = 0, then the augmented Lagrange function simplifies to the Lagrange function: L(x, λ, 0) = L(x, λ) = f0 (x) +

m 

λi fi (x).

i=1

• If λ = 0 and ν = ρh(x) with ρ > 0, then the augmented Lagrange function reduces to the penalty function: L(x, 0, ν) = L(x, ν) = f0 (x) + ρ

q 

|hi (x)|2 .

i=1

The above two facts show that the augmented Lagrange multiplier method combines both the Lagrange multiplier method and the penalty function method. For the standard constrained optimization problem (3.7.1), the feasible set is defined as the set of points satisfying all the inequality and equality constraints, namely    F = xfi (x) ≤ 0, i = 1, . . . , m, hj (x) = 0, j = 1, . . . , q .

(3.7.8)

Therefore, the augmented Lagrange multiplier method becomes to solve the dual (optimization) problem ⎧ ⎨

min LD (λ, ν) = f0 (x) + x∈F ⎩

m 

λi fi (x) +

q  j =1

i=1

⎫ ⎬ νj hj (x) , ⎭

(3.7.9)

where LD (λ, ν) is called the dual function. Proposition 3.1 Let p∗ = f0 (x∗ ) = minx L(x, λ, ν) be the optimal solution of the primal problem and d ∗ = LD (λ∗ , ν ∗ ) = minλ≤0,ν L(x, λ, ν) be the optimal solution of the dual problem. Then d ∗ = LD (λ∗ , ν ∗ ) ≤ p∗ = f0 (x∗ ) = minx L(x, λ, ν). Proof In Lagrange function L(x, λ, ν), for any feasible point x ∈ F, due to λ ≥ 0, fi (x) < 0 and hj (x) = 0 for all i, j , we have m  i=1

λi fi (x) +

q  j =1

νj hj (x) ≤ 0.

3.7 Constrained Convex Optimization

137

From this inequality it follows that L(x, λ, ν) = f0 (x) +

m 

λi fi (x) +

q 

νj hj (x) ≤ f0 (x∗ ) = min L(x, λ, ν). x

j =1

i=1

Hence, for any feasible point x ∈ F, we have the following result: LD (λ∗ , ν ∗ ) = min L(x, λ, ν) ≤ min L(x, λ, ν) = f0 (x∗ ). λ≤0,ν

x

That is to say, d ∗ = LD (λ∗ , ν ∗ ) ≤ p∗ = f0 (x∗ ).



3.7.3 Lagrange Dual Method Consider how to solve the dual minimization problem min LD (λ, ν) = min

λ≥0,ν

 , q m   f0 (x) + λi fi (x) + νj hj (x) .

(3.7.10)

j =1

i=1

Owing to the nonnegativity of the Lagrange multiplier vector λ, the augmented Lagrange function L(x, λ, ν) may tend to negative infinity when some λi equals a very large positive number. Hence, we first need to maximize the original augmented Lagrange function to get  , q m   J1 (x) = max f0 (x) + λi fi (x) + νj hj (x) . λ≥0,ν

(3.7.11)

j =1

i=1

The problem with using the unconstrained maximization (3.7.11) is that it is impossible to avoid violation of the constraint fi (x) > 0. This may result in J1 (x) equalling positive infinity; namely,  J1 (x) =

if x meets all original constraints; f0 (x), (f0 (x), +∞), otherwise.

(3.7.12)

From the above equation it follows that in order to minimize f0 (x) subject to all inequality and equality constraints, we should minimize J1 (x) to get the primal cost function JP (x) = min J1 (x) = min max L(x, λ, ν). x

x λ≥0, ν

(3.7.13)

138

3 Gradient and Optimization

This is a minimax problem, whose solution is the supremum of the Lagrange function L(x, λ, ν), namely   q m   JP (x) = sup f0 (x) + λi fi (x) + νi hi (x) . i=1

(3.7.14)

i=1

From (3.7.14) and (3.7.12) it follows that the optimal value of the original constrained minimization problem is given by p∗ = JP (x∗ ) = min f0 (x) = f0 (x∗ ), x

(3.7.15)

which is simply known as the optimal primal value. But, the minimization of a nonconvex objective function cannot be converted into the minimization of another convex function. Hence, if f0 (x) is a convex function, then even if we designed an optimization algorithm that can find a local extremum x˜ of the original cost function, there is no guarantee that x˜ is a global extremum point. Fortunately, the minimization of a convex function f (x) and the maximization of the concave function −f (x) are equivalent. On the basis of this dual relation, it is easy to obtain a dual method for solving the optimization problem of a nonconvex objective function: convert the minimization of a nonconvex objective function into the maximization of a concave objective function. For this purpose, construct another objective function from the Lagrange function L(x, λ, ν): J2 (λ, ν) = min L(x, λ, ν) x

 , q m   = min f0 (x) + λi fi (x) + νi hi (x) . x

i=1

(3.7.16)

i=1

From the above equation it is known that * min L(x, λ, ν) = x

min f0 (x), x

if x meets all the original constraints,

(−∞, min f0 (x)), otherwise. x

Its maximization function JD (λ, ν) = max J2 (λ, ν) = max min L(x, λ, ν) λ≥0, ν

λ≥0, ν

x

(3.7.17)

is called the dual objective function for the original problem. This is a maximin problem for the Lagrange function L(x, λ, ν).

3.7 Constrained Convex Optimization

139

Since the maximin of the Lagrange function L(x, λ, ν) is its infimum, we have  JD (λ, ν) = inf f0 (x) +

m 

λi fi (x) +

i=1

q 

 νi hi (x) .

(3.7.18)

i=1

The dual objective function defined by Eq. (3.7.18) has the following characteristics. • The dual function JD (λ, ν) is an infimum of the augmented Lagrange function L(x, λ, ν). • The dual function JD (λ, ν) is a maximizing objective function, and thus it is a value or utility function rather than a cost function. • The dual function JD (λ, ν) is lower unbounded: its lower bound is −∞. Hence, JD (λ, ν) is a concave function of the variable x even if f0 (x) is not a convex function. Theorem 3.15 ([28, p. 16]) Any local minimum point x∗ of an unconstrained convex optimization function f (x) is a global minimum point. If the convex function f (x) is differentiable, then the stationary point x∗ such that ∂f (x)/∂x = 0 is a global minimum point of f (x). Theorem 3.15 shows that any extreme point of the concave function is a global extreme point. Therefore, the algorithm design of the standard constrained minimization problem (3.7.1) becomes the design of an unconstrained maximization algorithm for the dual objective function JD (λ, ν). Such a method is known as the Lagrange dual method.

3.7.4 Karush–Kuhn–Tucker Conditions Proposition 3.1 shows that d ∗ ≤ min f0 (x) = p∗ . x

(3.7.19)

The difference between the optimal primal value p∗ and the optimal dual value d ∗ , denoted p∗ − d ∗ , is known as the duality gap between the original minimization problem and the dual maximization problem. Equation (3.7.19) gives the relationship between the maximin and the minimax of the augmented Lagrange function L(x, λ, ν). In fact, for any nonnegative realvalued function f (x, y), there is the following inequality relation between the maximin and minimax: max min f (x, y) ≤ min max f (x, y). x

y

y

x

(3.7.20)

140

3 Gradient and Optimization

If d ∗ ≤ p∗ , then the Lagrange dual method is said to have weak duality, while when d ∗ = p∗ , we say that the Lagrange dual method satisfies strong duality. Let x∗ and (λ∗ , ν ∗ ) represent any original optimal point and dual optimal points with zero dual gap  = 0. Since x∗ among all original feasible points x minimizes the augmented Lagrange objective function L(x, λ∗ , ν ∗ ), the gradient vector of L(x, λ∗ , ν ∗ ) at the point x∗ must be equal to the zero vector, namely ∇f0 (x∗ ) +

m 

λ∗i ∇fi (x∗ ) +

q 

νj∗ ∇hj (x∗ ) = 0.

j =1

i=1

Therefore, the Karush–Kuhn–Tucker (KKT) conditions (i.e., the first-order necessary conditions) of the Lagrange dual unconstrained optimization problem are given by Nocedal and Wright [28] below: fi (x∗ ) ≤ 0,

i = 1, . . . , m

hj (x∗ ) = 0,

j = 1, . . . , q

λ∗i ≥ 0,

i = 1, . . . , m

λ∗i fi (x∗ )

⎫ (original inequality constraints), ⎪ ⎪ ⎪ ⎪ ⎪ (original equality constraints), ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ (nonnegativity),

⎪ ⎪ ⎪ ⎪ ⎪ ⎪ q m ⎪   ⎪ ∗ ∗ ∗ ∗ ∗ ⎪ ∇f0 (x )+ λi ∇fi (x ) + νj ∇hj (x ) = 0 (stationary point).⎪ ⎪ ⎭ = 0,

i=1

i = 1, . . . , m

(complementary slackness),

j =1

(3.7.21) A point x satisfying the KKT conditions is called a KKT point. Remark 1 The first KKT condition and the second KKT condition are the original inequality and equality constraint conditions, respectively. Remark 2 The third KKT condition is the nonnegative condition of the Lagrange multiplier λi , which is a key constraint of the Lagrange dual method. Remark 3 The fourth KKT condition (complementary slackness) is also called dual complementary and is another key constraint of the Lagrange dual method. This condition implies that, for a violated constraint fi (x) > 0, the corresponding Lagrange multiplier λi must equal zero, hence we can avoid completely any violated constraint. Thus the role of this condition is to establish a barrier fi (x) = 0, i = 1, . . . , m on the boundary of inequality constraints that prevents the occurrence of constraint violation fi (x) > 0. Remark 4 The fifth KKT condition is the condition for there to be a stationary point at minx L(x, λ, ν). Remark 5 If the inequality constraint fi (x) ≤ 0, i = 1, . . . , m, in the constrained optimization (3.7.1) becomes ci (x) ≥ 0, i = 1, . . . , m, then the Lagrange function

3.7 Constrained Convex Optimization

141

should be modified to L(x, λ, ν) = f0 (x) −

m 

λi ci (x) +

q 

νj hj (x),

j =1

i=1

and all inequality constrained functions fi (x) in the KKT condition formula (3.7.21) should be replaced by −ci (x). In the following, we discuss the necessary modifications of the KKT conditions under some assumptions. Definition 3.14 (Violated Constraint) For inequality constraints fi (x) ≤ 0, i = 1, . . . , m, if fi (¯x) = 0 at the point x¯ , then the ith constraint is said to be an active constraint at the point x¯ . If fi (¯x) < 0, then the ith constraint is called an inactive constraint at the point x¯ . If fi (¯x) > 0, then the ith constraint is known as a violated constraint at the point x¯ . The index set of all active constraints at the point x¯ is denoted A(¯x) = {i|fi (¯x) = 0} and is referred to as the active set of the point x¯ . Let m inequality constraints fi (x), i = 1, . . . , m, have k active constraints fA1 (x∗ ), . . . , fAk (x∗ ) and m − k inactive constraints at some KKT point x∗ . In order to satisfy the complementarity in the KKT conditions λi fi (x∗ ) = 0, the Lagrange multipliers λ∗i corresponding to the inactive constraints fi (x∗ ) < 0 must be equal to zero. This implies that the last KKT condition in (3.7.21) becomes ∇f0 (x∗ ) +



λ∗i ∇fi (x∗ ) +

i∈A

or ⎡



∂f0 (x∗ ) ⎢ ∂x1∗ ⎥

⎢ ⎢ ⎢ ⎣ ∂f

0



∂xn∗

νj∗ ∇hj (x∗ ) = 0

j =1



∂hq (x∗ ) ∂h1 (x∗ ) ⎡ ∗⎤ · · · ⎢ ∂x1∗ ∂x1∗ ⎥ ν1

⎥ ⎢ ⎥+⎢ ⎥ ⎢ ∗ (x ) ⎦ ⎣ ∂h .. .

q 





∂fAk (x∗ ) ⎡ ∂fA1 (x∗ ) ⎤ · · · ∗ ⎢ ∂x1∗ ∂x1∗ ⎥ λA1

⎢ ⎥⎢ . ⎥ ⎥⎢ . ⎥ .. .. .. ⎥⎣ . ⎦ = −⎢ ⎥⎣ . ⎦ . . . . ⎢ ⎥ . ⎥ ⎣ ⎦ ∗ ∗ ∗ ∗ ∂hq (x ) ∂fAk (x ) ⎦ λ∗ ∂fA1 (x∗ ) ν 1 (x ) q ··· Ak ··· ∗ ∗ ∗ ∗ .. .

∂xn

..

.

.. .

∂xn

∂xn

∂xn

namely ∇f0 (x∗ ) + (Jh (x∗ ))T ν ∗ = −(JA (x∗ ))T λ∗A ,

(3.7.22)

where Jh (x∗ ) is the Jacobian matrix of the equality constraint function hj (x) = 0, j = 1, . . . , q at the point x∗ and λ∗A = [λ∗A1 , . . . , λ∗Ak ] ∈ Rk ,

(3.7.23)

142

3 Gradient and Optimization

and ⎡



∂f (x∗ ) ∂fA1 (x∗ ) · · · A1 ∗ ∗ ∂xn ⎥ ⎢ ∂x1

⎢ ⎥ .. .. .. ⎥ ∈ Rk×n , JA (x∗ ) = ⎢ . . . ⎢ ⎥ ⎣ ∂fAk (x∗ ) ∂fAk (x∗ ) ⎦ ··· ∗ ∗ ∂x1

(3.7.24)

∂xn

are the Lagrange multiplier vector of the active constraint function and the Jacobian matrix, respectively. Equation (3.7.22) shows that if the Jacobian matrix JA (¯x) of the active constraint function at the feasible point x¯ is of full row rank, then the actively constrained Lagrange multiplier vector can be uniquely determined: " ! λ∗A = −(JA (¯x)JA (¯x)T )−1 JA (¯x) ∇f0 (¯x) + (Jh (¯x))T ν ∗ .

(3.7.25)

In optimization algorithm design we always want strong duality to be set up. A simple method for determining whether strong duality holds is Slater’s theorem. The set of points satisfying only the strict inequality constraints fi (x) < 0 and the equality constraints hi (x) = 0 is denoted by    relint(F) = xfi (x) < 0, i = 1, . . . , m, hj (x) = 0, j = 1, . . . , q

(3.7.26)

and is known as the relative feasible interior set or the relative strictly feasible set. The points in the feasible interior set are called relative interior points. In an optimization process, the constraint restriction that iterative points should be in the interior of the feasible domain is known as the Slater condition. Slater’s theorem says that if the Slater condition is satisfied, and the original inequality optimization problem (3.7.1) is a convex optimization problem, then the optimal value d ∗ of the dual unconstrained optimization problem (3.7.17) is equal to the optimal value p∗ of the original optimization problem, i.e., strong duality holds. The following summarizes the relationships between the original constrained optimization problem and the Lagrange dual unconstrained convex optimization problem. • Only when the inequality constraint functions fi (x), i = 1, . . . , m, are convex, and the equality constraint functions hj (x), j = 1, . . . , q, are affine, can an original constrained optimization problem be converted into a dual unconstrained maximization problem by the Lagrange dual method. • The maximization of a concave function is equivalent to the minimization of the corresponding convex function. • If the original constrained optimization is a convex problem, then the points of ˜ ν˜ ) satisfying the KKT conditions are the Lagrange objective function x˜ and (λ, the original and dual optimal points, respectively. In other words, the optimal

3.7 Constrained Convex Optimization

143

solution d∗ of the Lagrange dual unconstrained optimization is the optimal solution p∗ of the original constrained convex optimization problem. • In general, the optimal solution of the Lagrange dual unconstrained optimization problem is not the optimal solution of the original constrained optimization problem but only a -suboptimal solution, where  = f0 (x∗ ) − JD (λ∗ , ν ∗ ). Consider the constrained optimization problem min f (x) x

subject to Ax ≤ h, Bx = b.

(3.7.27)

Let the nonnegative vector s ≥ 0 be a slack variable vector such that Ax + s = h. Hence, the inequality constraint Ax ≤ h becomes the equality constraint Ax + s − h = 0. Taking the penalty function φ(g(x)) = 12 g(x)22 , the augmented Lagrange objective function is given by Lρ (x, s, λ, ν) = f (x) + λT (Ax + s − h) + ν T (Bx − b)  ρ + Bx − b22 + Ax + s − h22 , 2

(3.7.28)

where the two Lagrange multiplier vectors λ ≥ 0 and ν ≥ 0 are nonnegative vectors and the penalty parameter ρ > 0. The dual gradient ascent method for solving the original optimization problem (3.7.27) is given by xk+1 = arg min Lρ (x, sk , λk , ν k ),

(3.7.29)

sk+1 = arg min Lρ (xk+1 , s, λk , ν k ),

(3.7.30)

λk+1 = λk + ρk (Axk+1 + sk+1 − h),

(3.7.31)

ν k+1 = ν k + ρk (Bxk+1 − b),

(3.7.32)

x

s≥0

where the gradient vectors are ∂Lρ (xk+1 , sk+1 , λk , ν k ) ∂λk ∂Lρ (xk+1 , sk+1 , λk , ν k ) ∂ν k

= Axk+1 + sk+1 − h, = Bxk+1 − b.

Equations (3.7.29) and (3.7.30) are, respectively, the updates of the original variable x and the intermediate variable s, whereas Eqs. (3.7.31) and (3.7.32) are the dual gradient ascent iterations of the Lagrange multiplier vectors λ and ν, taking into account the inequality constraint Ax ≥ h and the equality constraint Bx = b, respectively.

144

3 Gradient and Optimization

3.7.5 Alternating Direction Method of Multipliers In applied statistics and machine learning we often encounter large-scale equality constrained optimization problems, where the dimension of x ∈ Rn is very large. If the vector x may be decomposed into several subvectors, i.e., x = (x1 , . . . , xr ), and the objective function may also be decomposed into f (x) =

r 

fi (xi ),

i=1

 where xi ∈ Rni and ri=1 ni = n, then a large-scale optimization problem can be transformed into a few distributed optimization problems. The alternating direction multiplier method (ADMM) is a simple and effective method for solving distributed optimization problems. The ADMM method decomposes an optimization problem into smaller subproblems; then, their local solutions are restored or reconstructed into a large-scale optimization solution of the original problem. The ADMM was proposed by Gabay and Mercier [16] and Glowinski and Marrocco [17] independently in the mid-1970s. Corresponding to the composition of the objective function f (x), the equality constraint matrix is blocked as follows: A = [A1 , . . . , Ar ],

Ax =

r 

Ai xi .

i=1

Hence the augmented Lagrange objective function can be written as [5] Lρ (x, λ) =

r 

Li (xi , λ)

i=1

- r -2 r ! - "  ρ fi (xi ) + λT Ai xi − λT b + - (Ai xi ) − b- . = 2 i=1

i=1

2

Applying the dual ascent method to the augmented Lagrange objective function yields a decentralized algorithm for parallel computing [5]: xk+1 = arg min Li (xi , λk ), i xi ∈Rni

λk+1 = λk + ρk

 r  i=1

i = 1, . . . , r,

(3.7.33)

 Ai xk+1 i

−b .

(3.7.34)

3.7 Constrained Convex Optimization

145

Here the updates xi (i = 1, . . . , r) can be run independently in parallel. Because xi , i = 1, . . . , r are updated in an alternating or sequential manner, this augmented multiplier method is known as the “alternating direction” method of multipliers. In practical applications, the simplest decomposition is the objective function decomposition of r = 2: min {f (x) + h(z)}

subject to Ax + Bz = b,

(3.7.35)

where x ∈ Rn , z ∈ Rm , A ∈ Rp×n , B ∈ Rp×m , b ∈ Rp . The augmented Lagrange cost function of the optimization problem (3.7.35) is given by Lρ (x, z, λ) = f (x) + h(z) + λT (Ax + Bz − b) +

ρ Ax + Bz − b22 . 2

(3.7.36)

It is easily seen that the optimality conditions are divided into the original feasibility condition Ax + Bz − b = 0

(3.7.37)

and two dual feasibility conditions, 0 ∈ ∂f (x) + AT λ + ρAT (Ax + Bz − b) = ∂f (x) + AT λ,

(3.7.38)

0 ∈ ∂h(z) + BT λ + ρBT (Ax + Bz − b) = ∂h(z) + BT λ,

(3.7.39)

where ∂f (x) and ∂h(z) are the subdifferentials of the subobjective functions f (x) and h(z), respectively. The updates of the ADMM for the optimization problem min Lρ (x, z, λ) are as follows: xk+1 = arg min Lρ (x, zk , λk ),

(3.7.40)

zk+1 = arg min Lρ (xk+1 , z, λk ),

(3.7.41)

λk+1 = λk + ρk (Axk+1 + Bzk+1 − b).

(3.7.42)

x∈Rn

z∈Rm

The original feasibility cannot be strictly satisfied; its error rk = Axk + Bzk − b

(3.7.43)

146

3 Gradient and Optimization

is known as the original residual (vector) in the kth iteration. Hence the update of the Lagrange multiplier vector can be simply written as λk+1 = λk + ρk rk+1 .

(3.7.44)

Likewise, neither can dual feasibility be strictly satisfied. Because xk+1 is the minimization variable of Lρ (x, zk , λk ), we have 0 ∈ ∂f (xk+1 ) + AT λk + ρAT (Axk+1 + Bzk − b) = ∂f (xk+1 ) + AT (λk + ρrk+1 + ρB(zk − zk+1 )) = ∂f (xk+1 ) + AT λk+1 + ρAT B(zk − zk+1 ). Comparing this result with the dual feasibility formula (3.7.38), it is easy to see that sk+1 = ρAT B(zk − zk+1 )

(3.7.45)

is the error vector of the dual feasibility, and hence is known as the dual residual vector in the (k + 1)th iteration. The stopping criterion of the ADMM is as follows: the original residual and the dual residual in the (k + 1)th iteration should be very small, namely [5] rk+1 2 ≤ pri ,

sk+1 2 ≤ dual ,

(3.7.46)

where pri and dual are, respectively, the allowed disturbances of the primal feasibility and the dual feasibility. If letting ν = (1/ρ)λ be the Lagrange multiplier vector scaled by the ratio 1/ρ, called the scaled dual vector, then Eqs. (3.7.40)–(3.7.42) become [5]   xk+1 = arg min f (x) + (ρ/2)Ax + Bzk − b + ν k 22 ,

(3.7.47)

  zk+1 = arg min h(z) + (ρ/2)Axk+1 + Bz − b + ν k 22 ,

(3.7.48)

ν k+1 = ν k + Axk+1 + Bzk+1 − b = ν k + rk+1 .

(3.7.49)

x∈Rn

z∈Rm

The scaled dual vector has an interesting interpretation [5]: from the residual at the kth iteration, rk = Axk + Bzk − b, it is easily seen that νk = ν0 +

k 

ri .

(3.7.50)

i=1

That is, the scaled dual vector is the running sum of the original residuals of all k iterations.

3.8 Newton Methods

147

Equations (3.7.47)–(3.7.49) are referred to as the scaled ADMM, while Eqs. (3.7.40)–(3.7.42) are the ADMM with no scaling.

3.8 Newton Methods The first-order optimization algorithms use only the zero-order information f (x) and the first-order information ∇f (x) about an objective function. It is known [20] that if the objective function is twice differentiable then the Newton method based on the Hessian matrix is of quadratic or more rapid convergence.

3.8.1 Newton Method for Unconstrained Optimization For an unconstrained optimization minx∈Rn f (x), if the Hessian matrix H = ∇ 2 f (x) is positive definite, then from the Newton equation ∇ 2 f (x)x = −∇f (x),

(3.8.1)

we get the Newton step x = −(∇ 2 f (x))−1 ∇f (x) which results in the gradient descent algorithm xk+1 = xk − μk (∇ 2 f (xk ))−1 ∇f (xk ) = xk − μk H−1 ∇f (xk ).

(3.8.2)

This is the well-known Newton method. The Newton method may encounter the following two thorny issues. • The Hessian matrix H = ∇ 2 f (x) is difficult to find. • Even if H = ∇ 2 f (x) can be found, its inverse H−1 = (∇ 2 f (x))−1 may be numerically unstable. There are the following three methods for resolving the above two thorny issues. 1. Truncated Newton method: Instead of using directly the inverse of the Hessian matrix, an iteration method for solving the Newton matrix equation ∇ 2 f (x)xnt = −∇f (x) is to find an approximate solution for the Newton step xnt . The iteration method for solving Eq. (3.8.1) approximately is known as the truncated Newton method [11], where the conjugate gradient algorithm and the preconditioned conjugate gradient algorithm are two popular algorithms for such a case. The truncated Newton method is especially useful for largescale unconstrained and constrained optimization problems and interior-point methods.

148

3 Gradient and Optimization

2. Modified Newton method: When the Hessian matrix is not positive definite, the Newton matrix equation (3.8.1) can be modified to [14] (∇ 2 f (x) + E)xnt = −∇f (x),

(3.8.3)

where E is a positive semi-definite matrix that is usually taken as a diagonal matrix such that ∇ 2 f (x) + E is symmetric positive definite. Such a method is called the modified Newton method. The typical modified Newton method takes E = δI, where δ > 0 is small. 3. Quasi-Newton method: Using a symmetric positive definite matrix to approximate the Hessian matrix or its inverse, the Newton method is known as the quasi-Newton method. Consider the second-order Taylor series expansion of objective function f (x) at point xk+1 : 1 f (x) ≈ f (xk+1 ) + ∇f (xk+1 ) · (x − xk+1 ) + (x − xk+1 )T ∇ 2 f (xk+1 )(x − xk+1 ). 2 Hence, we have ∇f (x) ≈ ∇f (xk+1 ) + Hk+1 · (x − xk+1 ). If taking x = xk , then the quasi-Newton condition is given by gk+1 − gk ≈ Hk+1 · (xk+1 − xk ).

(3.8.4)

Letting sk = xk+1 − xk denote the Newton step and yk = gk+1 − gk = ∇f (xk+1 ) − ∇f (xk ), then the quasi-Newton condition can be rewritten as yk ≈ Hk+1 · sk

or

sk = H−1 k+1 · yk .

(3.8.5)

• DFP (Davidon–Fletcher–Powell) method: Use an n × n matrix Dk = D(xk ) to approximate the inverse of Hessian matrix, i.e., Dk ≈ H−1 . In this setting, the update of Dk is given by Nesterov [25]: Dk+1 = Dk +

sk sTk yTk sk



Dk yk yTk Dk yTk Dk yk

.

(3.8.6)

• BFGS (Broyden–Fletcher–Goldfarb–Shanno) method: Use a symmetric positive definite matrix to approximate the Hessian matrix, i.e., B = H. Then, the update of Bk is given by Nesterov [25]: Bk+1 = Bk +

yk yTk yTk sk



Bk sk sTk Bk sTk Bk sk

.

(3.8.7)

An important property of the DFP and BFGS methods is that they preserve the positive definiteness of the matrices. In the iterative process of the various Newton methods, it is usually required that a search is made of the optimal point along the direction of the straight line {x +

3.8 Newton Methods

149

Algorithm 3.6 Davidon–Fletcher–Powell (DFP) quasi-Newton method input: Objective function f (x), gradient vector g(x) = ∇f (x), accuracy threshold . initialization: Take the initial value x0 and D0 = I. Take k = 0. Determine the search direction pk = −Dk gk . Determine the step length μk = arg minμ≥0 f (xk + μpk ). Update the solution vector xk+1 = xk + μk pk . Calculate the Newton step sk = xk+1 − xk . Calculate gk+1 = g(xk+1 ). If gk+1 2 <  then stop iterations and output x∗ = xk . Otherwise, continue the next step. 8. Let yk = gk+1 − gk .

1. 2. 3. 4. 5. 6. 7.

9. Calculate Dk+1 = Dk +

sk sTk yTk sk



Dk yk yTk Dk . yTk Dk yk

10. Update k ← k + 1 and return to Step 3. 11. output: Minimum point x∗ = xk of f (x).

Algorithm 3.7 Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton method input: Objective function f (x), gradient vector g(x) = ∇f (x), accuracy threshold . initialization: Take the initial value x0 and B0 = I. Take k = 0. Solve Bk pk = −gk for determining the search direction pk . Determine the step length μk = arg minμ≥0 f (xk + μpk ). Update the solution vector xk+1 = xk + μk pk . Calculate the Newton step sk = xk+1 − xk . Calculate gk+1 = g(xk+1 ). If gk+1 2 <  then stop iterations and output x∗ = xk . Otherwise, continue the next step. 8. Let yk = gk+1 − gk .

1. 2. 3. 4. 5. 6. 7.

9. Calculate Bk+1 = Bk +

yk yTk yTk sk



Bk sk sTk Bk . sTk Bk sk

10. Update k ← k + 1 and return to Step 3. 11. output: Minimum point x∗ = xk of f (x).

μx | μ ≥ 0}. This step is called linear search. However, this choice of procedure can only minimize the objective function approximately, i.e., make it sufficiently small. Such a search for approximate minimization is called an inexact line search. In this search the step size μk on the search line {x + μx | μ ≥ 0} must decrease the objective function f (xk ) sufficiently to ensure that f (xk + μxk ) < f (xk ) + αμ(∇f (xk ))T xk ,

α ∈ (0, 1).

(3.8.8)

The inequality condition (3.8.8) is sometimes called the Armijo condition [13, 20, 28]. In general, for a larger step size μ, the Armijo condition is usually not met. Hence, Newton algorithms can start the search from μ = 1; if the Armijo condition is not met, then the step size μ needs to be reduced to βμ, with the correction factor β ∈ (0, 1). If for step size βμ the Armijo condition is still not met, then the step size needs to be reduced again until the Aemijo condition is met. Such a search method is known as backtracking line search.

150

3 Gradient and Optimization

Algorithm 3.8 shows the Newton algorithm via backtracking line search. Algorithm 3.8 Newton algorithm via backtracking line search 1. 2. 3. 4. 5. 6. 7. 8. 9.

input: initialization: x1 ∈ dom f (x), and parameters α ∈ (0, 0.5), β ∈ (0, 1). Put k = 1. repeat Compute bk = ∇f (xk ) and Hk = ∇ 2 f (xk ). Solve the Newton equation Hk xk = −bk . Update xk+1 = xk + μxk . exit if |f (xk+1 ) − f (xk )| < . return k ← k + 1. output: x ← xk .

The backtracking line search method can ensure that the objective function satisfies f (xk+1 ) < f (xk ) and the step size μ may not need to be too small. If the Newton equation is solved by other methods in Step 2, then Algorithm 3.8 becomes the truncated Newton method, the modified Newton method, or the quasiNewton method, respectively.

3.8.2 Newton Method for Constrained Optimization First, consider the equality constrained optimization problem with a real variable: min f (x) x

subject to Ax = b,

(3.8.9)

where f : Rn → R is a convex function and is twice differentiable, while A ∈ Rp×n and rank(A) = p with p < n. Let xnt denote the Newton search direction. Then the second-order Taylor approximation of the objective function f (x) is given by 1 f (x + xnt ) = f (x) + (∇f (x))T xnt + (xnt )T ∇ 2 f (x)xnt , 2 subject to A(x + xnt ) = b and Axnt = 0. In other words, the Newton search direction can be determined by the equality constrained optimization problem:  , 1 T T 2 min f (x) + (∇f (x)) xnt + (xnt ) ∇ f (x)xnt xnt 2 subject to Axnt = 0.

3.8 Newton Methods

151

Letting λ be the Lagrange multiplier vector multiplying the equality constraint Axnt = 0, we have the Lagrange objective function 1 L(xnt , λ) = f (x) + (∇f (x))T xnt + (xnt )T ∇ 2 f (x)xnt + λT Axnt . 2 (3.8.10) ∂ L(xnt ,λ) = 0 and the constraint condition ∂xnt 2 0, it is easily obtained that ∇f (x)+∇ f (x)xnt +AT λ = 0 and Axnt =

By the first-order optimal condition

Axnt = 0. These two equations can be merged into

 2     ∇ f (x) AT xnt −∇f (x) = . A O λ O

(3.8.11)

As a stopping criterion for the Newton algorithm, one can use [3] λ2 (x) = (xnt )T ∇ 2 f (x)xnt . Algorithm 3.9 shows feasible-start Newton algorithm. Algorithm 3.9 Feasible-start Newton algorithm [3] 1. input: A ∈ Rp×n , b ∈ Rp , tolerance  > 0, parameters α ∈ (0, 0.5), β ∈ (0, 1). 2. initialization: A feasible initial point x1 ∈ dom f with Ax1 = b. Put k = 1. 3. repeat 4. Compute the gradient vector ∇f (xk ) and the Hessian matrix ∇ 2 f (xk ). 4. Use the preconditioned conjugate gradient  to solve  2    algorithm ∇ f (xk ) AT xnt,k −∇f (xk ) . = O A O λnt   T 5. Compute λ2 (xk ) = xnt,k ∇ 2 f (xk )xnt,k . 6. exit if λ2 (xk ) < . 7. Let μ = 1. 8. while not converged do 9. Update μ ← βμ. 10. break if f (xk + μxnt,k ) < f (xk ) + αμ(∇f (xk ))T xnt,k . 11. end while 12. Update xk+1 ← xk + μxnt,k . 13. return k ← k + 1. 14. output: x ← xk .

(3.8.12)

152

3 Gradient and Optimization

However, in many cases it is not easy to find a feasible point as an initial point. If xk is an infeasible point, consider the equality constrained optimization problem  , 1 min f (xk + xk ) = f (xk ) + (∇f (xk ))T xk + (xk )T ∇ 2 f (xk )xk xk 2 subject to A(xk + xk ) = b. Let λk+1 (= λk + λk ) be the Lagrange multiplier vector corresponding to the equality constraint A(xk + xk ) = b. Then the Lagrange objective function is 1 L(xk , λk+1 ) = f (xk ) + (∇f (xk ))T xk + (xk )T ∇ 2 f (xk )xk 2 + λTk+1 (A(xk + xk ) − b). From the optimal conditions ∂L(xk , λk+1 ) ∂xk

= 0 and

∂L(xk , λk+1 ) ∂λk+1

= 0,

we have ∇f (xk ) + ∇ 2 f (xk )xk + AT λk+1 = 0,

Axk = −(Axk − b),

which can be merged into    2   xk ∇f (xk ) ∇ f (xk ) AT =− . λk+1 A O Axk − b

(3.8.13)

Substituting λk+1 = λk + λk into the above equation, we have    2   ∇f (xk ) + AT λk ∇ f (xk ) AT xk =− . A O λk Axk − b

(3.8.14)

Algorithm 3.10 shows an infeasible-start Newton algorithm based on (3.8.14).

References

153

Algorithm 3.10 Infeasible-start Newton algorithm [3] 1. input: Any allowed tolerance  > 0, α ∈ (0, 0.5), and β ∈ (0, 1). 2. initialization: An initial point x1 ∈ Rn , any initial Lagrange multiplier λ1 ∈ Rp . Put k = 1. 3. repeat 4. Compute the gradient vector ∇f (xk ) and the Hessian matrix ∇ 2 f (xk ).   ∇f (xk ) + AT λk 5. Compute r(xk , λk ) = . Axk − b 6. exit if Axk = b and r(xk , λk )2 < . 7. Adopt the preconditioned conjugate gradient algorithm to solve the KKT equation (3.8.14), yielding the Newton step (xk , λk ). 8. Let μ = 1. 9. while not converged do 10. Update μ ← βμ. 11. break if f (xk + μxnt,k ) < f (xk ) + αμ(∇f (xk ))T xnt,k . 12. end while 13. Newton update       xk+1 x xk = k +μ . λk λk λk+1 14. return k ← k + 1. 15. output: x ← xk , λ ← λk .

Brief Summary of This Chapter • This chapter mainly introduces the convex optimization theory and methods for single-objective minimization/maximization. • Subgradient optimization algorithms have important applications in artificial intelligence. • Multiobjective optimization involves with Pareto optimization theory that is the topic in Chap. 9 (Evolutionary Computation).

References 1. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009) 2. Bertsekas, D.P., Nedich, A., Ozdaglar, A.: Convex Analysis and Optimization. Athena Scientific, Nashua (2003) 3. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 4. Boyd, S., Vandenberghe, L.: Subgradients. Notes for EE364b, Stanford University, Winter 2006–2007 (2008) 5. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010) 6. Brandwood, D.H.: A complex gradient operator and its application in adaptive array theory. IEE Proc. F Commun. Radar Signal Process. 130, 11–16 (1983)

154

3 Gradient and Optimization

7. Byrne, C., Censor, Y.: Proximity function minimization using multiple Bregman projections, with applications to split feasibility and Kullback–Leibler distance minimization. Ann. Oper. Res. 105, 77–98 (2001) 8. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998) 9. Carroll, C.W.: The created response surface technique for optimizing nonlinear restrained systems. Oper. Res. 9, 169–184 (1961) 10. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, New York (2011) 11. Dembo, R.S., Steihaug, T.: Truncated-Newton algorithms for large-scale unconstrained optimization. Math Program. 26, 190–212 (1983) 12. Fiacco, A.V., McCormick, G.P.: Nonlinear Programming: Sequential Unconstrained minimization Techniques. Wiley, New York (1968); or Classics Appl. Math. 4, SIAM, PA: Philadelphia, Reprint of the 1968 original, (1990) 13. Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, New York (1987) 14. Forsgren, A., Gill, P.E., Wright, M.H.: Interior methods for nonlinear optimization. SIAM Rev. 44, 525–597 (2002) 15. Friedman, J., Hastie, T., Höeling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007) 16. Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximations. Comput. Math. Appl. 2, 17–40 (1976) 17. Glowinski, R., Marrocco, A.: Sur l’approximation, par elements finis d’ordre un, et la resolution, par penalisation-dualité, d’une classe de problems de Dirichlet non lineares. Revue Française d’Automatique, Informatique, et Recherche Opérationelle 9, 41–76 (1975) 18. Gonzaga, C.C., Karas, E.W.: Fine tuning Nesterov’s steepest descent algorithm for differentiable convex programming. Math. Program. Ser. A 138, 141–166 (2013) 19. Hindi, H.: A tutorial on convex optimization. In: Proceedings of the 2004 American Control Conference, Boston, June 30-July 2, pp. 3252–3265 (2004) 20. Luenberger, D.: An Introduction to Linear and Nonlinear Programming, 2nd edn. AddisonWesley, Boston (1989) 21. Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics, rev. edn. Wiley, Chichester (1999) 22. Michalewicz, Z., Dasgupta, D., Le Riche R., Schoenauer M.: Evolutionary algorithms for constrained engineering problems. Comput. Ind. Eng. J. 30, 851–870 (1996) 23. Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace Hilbertien. Rep. Paris Acad. Sci. Ser. A 255, 2897–2899 (1962) 24. Nesterov, Y.: A method for solving a convex programming problem with rate of convergence O( k12 ). Soviet Math. Dokl. 269(3), 543–547 (1983) 25. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic, Boston (2004) 26. Nesterov, Y.: Smooth minimization of nonsmooth functions (CORE Discussion Paper 2003/12, CORE 2003). Math. Program. 103(1), 127–152 (2005) 27. Nesterov, Y., Nemirovsky, A.: A general approach to polynomial-time algorithms design for convex programming. Report, Central Economical and Mathematical Institute, USSR Academy of Sciences, Moscow (1988) 28. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1999) 29. Ortega, J.M., Rheinboldt, W.C.: Iterative Solutions of Nonlinear Equations in Several Variables, pp. 253–255. Academic, New York (1970) 30. Parikh, N., Bord, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013) 31. Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987) 32. Scutari, G., Palomar, D.P., Facchinei, F., Pang, J.S.: Convex optimization, game theory, and variational inequality theory. IEEE Signal Process. Mag. 27(3), 35–49 (2010) 33. Syau, Y.R.: A note on convex functions. Int. J. Math. Math. Sci. 22(3), 525–534 (1999)

References

155

34. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996) 35. Vandenberghe, L.: Lecture Notes for EE236C (Spring 2019), Section 6 The Proximal Mapping, UCLA (2019) 36. Yeniay, O., Ankara, B.: Penalty function methods for constrained optimization with genetic algorithms. Math. Comput. Appl. 10(1), 45–56 (2005) 37. Yu, K., Zhang, T., Gong, Y.: Nonlinear learning using local coordinate coding. In: Advances in Neural Information Processing Systems, 22, 2223–2231 (2009) 38. Zhang, X.D.: Matrix Analysis and Applications. Cambridge University Press, Cambridge (2017)

Chapter 4

Solution of Linear Systems

One of the most common problems in science and engineering is to solve the matrix equation Ax = b for finding x when the data matrix A and the data vector b are given. Matrix equations can be divided into three types. 1. Well-determined equation: If m = n and rank(A) = n, i.e., the matrix A is nonsingular, then the matrix equation Ax = b is said to be well-determined. 2. Under-determined equation: The matrix equation Ax = b is under-determined if the number of linearly independent equations is less than the number of independent unknowns. 3. Over-determined equation: The matrix equation Ax = b is known as overdetermined if the number of linearly independent equations is larger than the number of independent unknowns. This chapter is devoted to describe singular value decomposition (SVD) together with Gauss elimination and conjugate gradient methods for solving well-determined matrix equations followed by Tikhonov regularization and total least squares for solving over-determined matrix equations, and the Lasso and LARS methods for solving under-determined matrix equations.

4.1 Gauss Elimination First consider solution of well-determined equations. In well-determined equation, the number of independent equations and the number of independent unknowns are the same so that the solution of this system of equations is uniquely determined. The exact solution of a well-determined matrix equation Ax = b is given by x = A−1 b. A well-determined equation is a consistent equation.

© Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8_4

157

158

4 Solution of Linear Systems

In this section, we deal with Gauss elimination method and conjugate gradient methods for solving an over-determined matrix equation Ax = b and finding the inverse of a singular matrix. This method performs only elementary row operations to the augmented matrix B = [A, b].

4.1.1 Elementary Row Operations When solving an m × n system of linear equations, it is useful to reduce equations. The principle of reduction processes is to keep the solutions of the linear systems unchanged. Definition 4.1 (Equivalent Systems) Two linear systems of equations in n unknowns are said to be equivalent systems if they have the same sets of solutions. To transform a given m × n matrix equation Am×n xn = bm into an equivalent system, a simple and efficient way is to apply a sequence of elementary operations on the given matrix equation. Definition 4.2 (Elementary Row Operations) The following three types of operation on the rows of a system of linear equations are called elementary row operations. • Type I: Interchange any two equations, say the pth and qth equations; this is denoted by Rp ↔ Rq . • Type II: Multiply the pth equation by a nonzero number α; this is denoted by αRp → Rp . • Type III: Add β times the pth equation to the qth equation; this is denoted by βRp + Rq → Rq . Clearly, any above type of elementary row operation does not change the solution of a system of linear equations, so after elementary row operations, the reduced system of linear equations and the original system of linear equations are equivalent. Definition 4.3 (Leading Entry) The leftmost nonzero entry of a nonzero row is called the leading entry of the row. If the leading entry is equal to 1 then it is said to be the leading-1 entry. As a matter of fact, any elementary operation on an m × n system of equations Ax = b is equivalent to the same type of elementary operation on the augmented matrix B = [A, b], where the column vector b is written alongside A. Therefore, performing elementary row operations on a system of linear equations Ax = b can be in practice implemented by using the same elementary row operations on the augmented matrix B = [A, b]. The discussions above show that if, after a sequence of elementary row operations, the augmented matrix Bm×(n+1) becomes another simpler matrix Cm×(n+1) then two matrices are row equivalent for the same solution of linear system.

4.1 Gauss Elimination

159

For the convenience of solving a system of linear equations, the final row equivalent matrix should be of echelon form. Definition 4.4 (Echelon Matrix) A matrix is said to be an echelon matrix if it has the following forms: • all rows with all entries zero are located at the bottom of the matrix; • the leading entry of each nonzero row appears always to the right of the leading entry of the nonzero row above; • all entries below the leading entry of the same column are equal to zero. Definition 4.5 (Reduced Row-Echelon Form Matrix [22]) An echelon matrix B is said to be a reduced row-echelon form (RREF) matrix if the leading entry of each nonzero row is a leading-1 entry, and each leading-1 entry is the only nonzero entry in the column in which it is located. Theorem 4.1 Any m × n matrix A is row equivalent to one and only one matrix in reduced row-echelon form. Proof See [27, Appendix A]. Definition 4.6 (Pivot Column [27, p. 15]) A pivot position of an m × n matrix A is the position of some leading entry of its echelon form. Each column containing a pivot position is called a pivot column of the matrix A.

4.1.2 Gauss Elimination for Solving Matrix Equations Elementary row operations can be used to solve matrix equations and perform matrix inversion. Consider how to solve an n × n matrix equation Ax = b, where the inverse matrix A−1 of the matrix A exists. We hope that the solution vector x = A−1 b can be obtained by using elementary row operations. First use the matrix A and the vector b to construct the n × (n + 1) augmented matrix B = [A, b]. Since the solution x = A−1 b can be written as a new matrix equation Ix = A−1 b, we get a new augmented matrix C = [I, A−1 b] associated with the solution x = A−1 b. Hence, we can write the solution process for the two matrix equations Ax = b and x = A−1 b, respectively, as follows: matrix equations augmented matrices

Elementary row

Ax = b −−−−−−−−→ operations

[A, b]

Elementary row

−−−−−−−−→ operations

x = A−1 b, [I, A−1 b].

This implies that, after suitable elementary row operations on the augmented matrix [A, b], if the left-hand part of the new augmented matrix is an n×n identity matrix I,

160

4 Solution of Linear Systems

then the (n+1)th column gives the solution x = A−1 b of the original equation Ax = b directly. This method is called the Gauss or Gauss–Jordan elimination method. Any m × n complex matrix equation Ax = b can be written as (Ar + j Ai )(xr + j xi ) = br + j bi ,

(4.1.1)

where Ar , xr , br and Ai , xi , bi are the real and imaginary parts of A, x, b, respectively. Expanding the above equation to yield Ar xr − Ai xi = br ,

(4.1.2)

Ai xr + Ar xi = bi .

(4.1.3)

The above equations can be combined into 

Ar −Ai Ai Ar

    xr b = r . xi bi

(4.1.4)

Thus, m complex-valued equations with n complex unknowns become 2m realvalued equations with 2n real unknowns. In particular, if m = n, then we have complex matrix equation Ax = b  augmented matrix

Ar −Ai br Ai Ar bi



Elementary row

−−−−−−−−→ operations

Elementary row

−−−−−−−−→ operations

x = A−1 b, 

 In On xr . On In xi

This shows that if we write the n × (n + 1) complex augmented matrix [A, b] as a 2n × (2n + 1) real augmented matrix and perform elementary row operations to make its left-hand side become a 2n × 2n identity matrix, then the upper and lower halves of the (2n + 1)th column give, respectively, the real and imaginary parts of the complex solution vector x of the original complex matrix equation Ax = b.

4.1.3 Gauss Elimination for Matrix Inversion Consider the inversion operation on an n × n nonsingular matrix A. This problem can be modeled as an n × n matrix equation AX = I whose solution X is the inverse matrix of A. It is easily seen that the augmented matrix of the matrix equation AX = I is [A, I], whereas the augmented matrix of the solution equation IX = A−1 is

4.1 Gauss Elimination

161

[I, A−1 ]. Hence, we have the following relations: matrix equation augmented matrix

Elementary row

AX = I

−−−−−−−−→

[A, I]

−−−−−−−−→

operations

Elementary row operations

X = A−1 , [I, A−1 ].

This result tells us that if we use elementary row operations on the n×2n augmented matrix [A, I] so that its left-hand part becomes an n×n identity matrix then the righthand part yields the inverse A−1 of the given n × n matrix A directly. This operation is called the Gauss elimination method for matrix inversion. Example 4.1 Use the Gauss elimination method to find the inverse of the matrix ⎤ 1 1 2 A = ⎣ 3 4 −1⎦ . −1 1 1 ⎡

Perform elementary row operations on the augmented matrix of [A, I] to yield the following results: ⎤ ⎤ ⎡ 1 1 2 1 0 0 (−3)R +R2 →R2 1 1 2 1 0 0 R +R3 →R3 1 ⎣ 3 4 −1 0 1 0⎦ −−−−1−−−−−→ ⎣ 0 1 −7 −3 1 0⎦ −− −−−−→ −1 1 1 0 0 1 −1 1 1 0 0 1 ⎤ ⎡ ⎤ ⎡ 1 0 9 4 −1 0 (−2)R +R →R 1 1 2 1 0 0 (−1)R2 +R →R 3 1 1 ⎣0 1 −7 −3 1 0⎦ −−−−−−−−−→ ⎣0 1 −7 −3 1 0⎦ −−−−2−−−3−−→ 02 3 1 0 1 02 3 1 01 ⎡ ⎤ ⎤ ⎡ 1 0 9 4 −1 0 1 R3 →R3 1 0 9 4 −1 0 (−9)R3 +R →R 17 1 1 ⎣0 1 −7 −3 1 0⎦ − −−−−→ ⎣0 1 −7 −3 1 0 ⎦ −−−−−−−−−→ 7 −2 1 0 0 1 17 0 0 17 7 −2 1 17 17 ⎡ ⎤ 1 −9 5 ⎡ 1 0 0 17 17 1 −9 ⎤ 5 17 1 0 0 17 17 17 ⎢ ⎥ ⎥ ⎥ 7R3 +R2 →R2 ⎢ ⎢ −2 3 7 ⎥ . ⎥ −−−−−−−→ ⎢ ⎢ 0 1 0 ⎢ 17 17 17 ⎥ ⎣0 1 −7 −3 1 0 ⎦ ⎢ ⎥ ⎣ ⎦ 7 −2 1 0 0 1 17 7 −2 1 17 17 0 0 1 17 17 17 ⎡

That is to say, ⎤−1 ⎤ ⎡ 1 1 2 5 1 −9 ⎣ 3 4 −1⎦ = 1 ⎣−2 3 7 ⎦ . 17 −1 1 1 7 −2 1 ⎡

162

4 Solution of Linear Systems

For the matrix equation x1 + x2 + 2x3 = 6, 3x1 + 4x2 − x3 = 5, −x1 + x2 + x3 = 2, its solution ⎤−1 ⎡ ⎤ ⎤⎡ ⎤ ⎡ ⎤ ⎡ 1 1 2 6 1 6 5 1 −9 1 ⎣−2 3 7 ⎦ ⎣5⎦ = ⎣1⎦ . x = A−1 b = ⎣ 3 4 −1⎦ ⎣5⎦ = 17 −1 1 1 2 2 2 7 −2 1 ⎡

Now, we consider the inversion of an n × n nonsingular complex matrix A. Then, its inversion can be modeled as the complex matrix equation (Ar + jAi )(Xr + jXi ) = I. This complex equation can be rewritten as 

Ar −Ai Ai Ar



   Xr I = n , Xi On

(4.1.5)

from which we get the following relation: Elementary row

complex matrix equation AX = I −−−−−−−−→ operations

 augmented matrix

Ar −Ai In Ai Ar On



Elementary row

−−−−−−−−→ operations

X = A−1 , 

 In On Xr . On In Xi

That is to say, after making elementary row operations on the 2n × 3n augmented matrix to transform its left-hand side to a 2n × 2n identity matrix, then the upper and lower halves of the 2n × n matrix on the right give, respectively, the real and imaginary parts of the inverse matrix A−1 of the complex matrix A.

4.2 Conjugate Gradient Methods Given a matrix equation Ax = b, where A ∈ Rn×n is a nonsingular matrix. This matrix equation can be equivalently written as x = (I − A)x + b,

(4.2.1)

which inspires the following iteration algorithm: xk+1 = (I − A)xk + b.

(4.2.2)

4.2 Conjugate Gradient Methods

163

This algorithm is known as the Richardson iteration and can be rewritten in the more general form xk+1 = Mxk + c,

(4.2.3)

where M is an n × n matrix, called the iteration matrix. An iteration in the form of Eq. (4.2.3) is termed as a stationary iterative method and is not as effective as a nonstationary iterative method. Nonstationary iteration methods comprise a class of iteration methods in which xk+1 is related to the former iteration solutions xk , xk−1 , . . . , x0 . The most typical nonstationary iteration method is the Krylov subspace method, described by xk+1 = x0 + Kk ,

(4.2.4)

Kk = Span(r0 , Ar0 , . . . , Ak−1 r0 )

(4.2.5)

where

is called the kth iteration Krylov subspace; x0 is the initial iteration value and r0 denotes the initial residual vector. Krylov subspace methods have various forms, of which the three most common are the conjugate gradient method, the biconjugate gradient method, and the preconditioned conjugate gradient method.

4.2.1 Conjugate Gradient Algorithm The conjugate gradient method uses r0 = Ax0 − b as the initial residual vector. The applicable object of the conjugate gradient method is limited to the symmetric positive definite equation, Ax = b, where A is an n × n symmetric positive definite matrix. The nonzero vector combination {p0 , p1 , . . . , pk } is said to be A-orthogonal or A-conjugate if pTi Apj = 0,

∀ i = j.

(4.2.6)

This property is known as A-orthogonality or A-conjugacy. Obviously, if A = I, then the A-conjugacy reduces to general orthogonality. All algorithms adopting the conjugate vector as the update direction are known as conjugate direction algorithms. If the conjugate vectors p0 , p1 , . . . , pn−1 are not predetermined, but are updated by the gradient descent method in the updating process, then we say that the minimization algorithm for the objective function f (x) is a conjugate gradient algorithm. Algorithm 4.1 gives a conjugate gradient algorithm.

164

4 Solution of Linear Systems

Algorithm 4.1 Conjugate gradient algorithm [16] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

input: A = AT ∈ Rn×n , b ∈ Rn , the largest iteration step number kmax , the allowed error . initialization: Choose x0 ∈ Rn , and let r = Ax0 − b and ρ0 = r22 . repeat If k = 1 then p = r. Otherwise, let β = ρk−1 /ρk−2 and p = r + βp. w = Ap. α = ρk−1 /(pT w). x = x + αp. r = r − αw. ρk = r22 . √ exit if ρk < b2 or k = kmax . return k ← k + 1. output: x ← xk .

From Algorithm 4.1 it can be seen that in the iteration process of the conjugate gradient algorithm, the solution of the matrix equation Ax = b is given by xk =

k  i=1

αi pi =

k  ri−1 , ri−1  i=1

pi , Api 

pi ,

(4.2.7)

that is, xk belongs to the kth Krylov subspace xk ∈ Span{p1 , p2 , . . . , pk } = Span{r0 , Ar0 , . . . , Ak−1 r0 }. The fixed-iteration method requires the updating of the iteration matrix M, but no matrix needs to be updated in Algorithm 4.1. Hence, the Krylov subspace method is also called the matrix-free method [24].

4.2.2 Biconjugate Gradient Algorithm If the matrix A is not a real symmetric matrix, then we can use the biconjugate gradient method of Fletcher [13] for solving the matrix equation Ax = b. As the name implies, there are two search directions, pi and p¯ j , that are A-conjugate in this method: p¯ Ti Apj = pTi Ap¯ j = 0,

i = j ;

r¯ Ti rj = rTi r¯ j = 0,

i = j ;

r¯ Ti pj = rTi p¯ j = 0,

j < i.

(4.2.8)

Here, ri and r¯ j are the residual vectors associated with search directions pi and p¯ j , respectively. Algorithm 4.2 shows a biconjugate gradient algorithm.

4.2 Conjugate Gradient Methods

165

Algorithm 4.2 Biconjugate gradient algorithm [13] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

input: The matrix A. initialization: p1 = r1 , p¯ 1 = r¯ 1 . repeat αk = r¯ Tk rk /(p¯ Tk Apk ). rk+1 = rk − αk Apk . r¯ k+1 = r¯ k − αk AT p¯ k . βk = r¯ Tk+1 rk+1 /(¯rTk rk ). pk+1 = rk+1 + βk pk . p¯ k+1 = r¯ k+1 + βk p¯ k . exit if k = kmax . return k ← k + 1. output: x ← xk + αk pk , x¯ k+1 ← x¯ k + αk p¯ k .

4.2.3 Preconditioned Conjugate Gradient Algorithm Consider the symmetric indefinite saddle point problem      A BT x f = , B O q g where A is an n×n real symmetric positive definite matrix, B is an m×n real matrix with full row rank m (≤ n), and O is an m × m zero matrix. Bramble and Pasciak [4] developed a preconditioned conjugate gradient iteration method for solving the above problem. The basic idea of the preconditioned conjugate gradient iteration is as follows: through a clever choice of the scalar product form, the preconditioned saddle matrix becomes a symmetric positive definite matrix. To simplify the discussion, we assume that the matrix equation with large condition number Ax = b needs to be converted into a new symmetric positive definite equation. To this end, let M be a symmetric positive definite matrix that can approximate the matrix A and for which it is easier to find the inverse matrix. Hence, the original matrix equation Ax = b is converted into M−1 Ax = M−1 b such that the new and original matrix equations have the same solution. However, in the new matrix equation M−1 Ax = M−1 b there is a hidden danger: M−1 A is generally not either symmetric or positive definite even if both M and A are symmetric positive definite. Therefore, it is unreliable to use the matrix M−1 directly as the preprocessor of the matrix equation Ax = b. Let S be the square root of a symmetric matrix M, i.e., M = SST , where S is symmetric positive definite. Now, use S−1 instead of M−1 as the preprocessor to convert the original matrix equation Ax = b to S−1 Ax = S−1 b. If x = S−T xˆ , then the preconditioned matrix equation is given by S−1 AS−T xˆ = S−1 b.

(4.2.9)

166

4 Solution of Linear Systems

Compared with the matrix M−1 A, which is not symmetric positive definite, must be symmetric positive definite if A is symmetric positive definite. The symmetry of S−1 AS−T is easily seen, its positive definiteness can be verified as follows: by checking the quadratic function it easily follows that yT (S−1 AS−T )y = zT Az, where z = S−T y. Because A is symmetric positive definite, we have zT Az > 0, ∀ z = 0, and thus yT (S−1 AS−T )y > 0, ∀ y = 0. That is to say, S−1 AS−T must be positive definite. At this point, the conjugate gradient method can be applied to solve the matrix equation (4.2.9) in order to get xˆ , and then we can recover x via x = S−T xˆ . Algorithm 4.3 shows a preconditioned conjugate gradient (PCG) algorithm with a preprocessor, developed in [37]. S−1 AS−T

Algorithm 4.3 PCG algorithm with preprocessor [37] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

input: A, b, preprocessor S−1 , maximal number of iterations kmax , and allowed error  < 1. initialization: k = 0, r0 = Ax − b, d0 = S−1 r0 , δnew = rT0 d0 , δ0 = δnew . qk+1 = Adk . α = δnew /(dTk qk+1 ). xk+1 = xk + αdk . If k can be divided exactly by 50, then rk+1 = b − Axk . Otherwise, update rk+1 = rk − αqk+1 . sk+1 = S−1 rk+1 . δold = δnew . δnew = rTk+1 sk+1 . exit if k = kmax or δnew <  2 δ0 . β = δnew /δold . dk+1 = sk+1 + βdk . return k ← k + 1. output: x ← xk .

The use of a preprocessor can be avoided, because there are correspondences between the variables of the matrix equations Ax = b and S−1 AS−T xˆ = S−1 b, as follows [24]: xk = S−1 xˆ k ,

rk = Sˆrk ,

pk = S−1 pˆ k ,

zk = S−1 rˆ k .

On the basis of these correspondences, it is easy to develop a PCG algorithm without preprocessor [24], as shown in Algorithm 4.4. A wonderful introduction to the conjugate gradient method is presented in [37]. Regarding the complex matrix equation Ax = b, where A ∈ Cn×n , x ∈ Cn , b ∈ Cn , we can write it as the following real matrix equation:      AR −AI xR b = R . AI AR xI bI

(4.2.10)

Clearly, if A = AR +jAI is a Hermitian positive definite matrix, then Eq. (4.2.10) is a symmetric positive definite matrix equation. Hence, (xR , xI ) can be solved by

4.3 Condition Number of Matrices

167

Algorithm 4.4 PCG algorithm without preprocessor [24] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

input: A = AT ∈ Rn×n , b ∈ Rn , maximal iteration number kmax , and allowed error . initialization: x0 ∈ Rn , r0 = Ax0 − b, ρ0 = r22 and M = SST . Put k = 1. repeat zk = Mrk−1 . τk−1 = zTk rk−1 . If k = 1, then β = 0, p1 = z1 . Otherwise, β = τk−1 /τk−2 , pk ← zk + βpk . wk = Apk . α = τk−1 /(pTk wk ). xk+1 = xk + αpk . rk = rk−1 − αwk . ρk = rTk rk . √ exit if ρk < b2 or k = kmax . return k ← k + 1. output: x ← xk .

adopting the conjugate gradient algorithm or the preconditioned conjugate gradient algorithm. The main advantage of gradient methods is that the computation of each iteration is very simple; however, their convergence is generally slow.

4.3 Condition Number of Matrices In many applications in science and engineering, it is often necessary to consider an important problem: there exist some uncertainties or errors in the actual observation data, and, furthermore, numerical calculation of the data is always accompanied by error. What is the impact of these errors? Is a particular algorithm numerically stable for data processing? In order to answer these questions, the following two concepts are extremely important: • the numerical stability of various kinds of algorithms; • the condition number or perturbation analysis of the problem of interest. Given some application problem f where d ∗ ∈ D is the data without noise or disturbance in a data group D. Let f (d ∗ ) ∈ F denote the solution of f , where F is a solution set. For observed data d ∈ D, we want to evaluate f (d). Owing to background noise and/or observation error, f (d) is usually different from f (d ∗ ). If f (d) is “close" to f (d ∗ ), then the problem f is said to be a “well-conditioned problem." On the contrary, if f (d) is obviously different from f (d ∗ ) even when d is very close to d ∗ , then we say that the problem f is an “ill-conditioned problem." If there is no further information about the problem f , the term “approximation" cannot describe the situation accurately.

168

4 Solution of Linear Systems

In perturbation theory, a method or algorithm for solving a problem f is said to be numerically stable if its sensitivity to a perturbation is not larger than the sensitivity inherent in the original problem. More precisely, f is stable if the approximate solution f (d) is close to the solution f (d ∗ ) without perturbation for all d ∈ D close to d ∗ . To mathematically describe numerical stability of a method or algorithm, we first consider the well-determined linear equation Ax = b, where the n × n coefficient matrix A has known entries and the n×1 data vector b is also known, while the n×1 vector x is an unknown parameter vector to be solved. Naturally, we are interested in the numerical stability of this solution: if the coefficient matrix A and/or the data vector b are perturbed, how will the solution vector x be changed? Does it maintain a certain stability? By studying the influence of perturbations of the coefficient matrix A and/or the data vector b, we can obtain a numerical value, called the condition number, describing an important characteristic of the coefficient matrix A. For the convenience of analysis, we first assume that there is only a perturbation δb of the data vector b, while the matrix A is stable. In this case, the exact solution vector x will be perturbed to x + δx, namely A(x + δx) = b + δb.

(4.3.1)

δx = A−1 δb,

(4.3.2)

This implies that

since Ax = b. By the property of the matrix norm, Eq. (4.3.2) gives δx ≤ A−1  · δb.

(4.3.3)

Similarly, from the linear equation Ax = b we get b ≤ A · x.

(4.3.4)

From Eqs. (4.3.3) and (4.3.4) it is immediately clear that " δb δx ! ≤ A · A−1  . x b

(4.3.5)

Next, consider the influence of simultaneous perturbations δx and δA. In this case, the linear equation becomes (A + δA)(x + δx) = b.

4.3 Condition Number of Matrices

169

From the above equation it can be derived that   δx = (A + δA)−1 − A−1 b 2   3 = A−1 A − (A + δA) (A + δA)−1 b = −A−1 δA(A + δA)−1 b = −A−1 δA(x + δx).

(4.3.6)

Then we have δx ≤ A−1  · δA · x + δx, namely   δA δx ≤ A · A−1  . x + δx A

(4.3.7)

Equations (4.3.5) and (4.3.7) show that the relative error of the solution vector x is proportional to the numerical value cond(A) = A · A−1 ,

(4.3.8)

where cond(A) is called the condition number of the matrix A with respect to its inverse and is sometimes denoted κ(A). Condition numbers have the following properties. • • • •

cond(A) = cond(A−1 ). cond(cA) = cond(A). cond(A) ≥ 1. cond(AB) ≤ cond(A)cond(B). Here, we give the proofs of cond(A) ≥ 1 and cond(AB) ≤ cond(A)cond(B): cond(A) = A · A−1  ≥ AA−1  = I = 1;

cond(AB) = AB · (AB)−1  ≤ A · B · (B−1  · A−1 ) = cond(A)cond(B). An orthogonal matrix A is perfectly conditioned in the sense that cond(A) = 1. The following are four common condition numbers. • 1 condition number, denoted as cond1 (A), is defined as cond1 (A) = A1 A−1 1 .

(4.3.9)

170

4 Solution of Linear Systems

• 2 condition number, denoted as cond2 (A), is defined as 4 −1

cond2 (A) = A2 A

2 =

σmax (A) λmax (AH A) = , λmin (AH A) σmin (A)

(4.3.10)

where λmax (AH A) and λmin (AH A) are the maximum and minimum eigenvalues of AH A, respectively; and σmax (A) and σmin (A) are the maximum and minimum singular values of A, respectively. • ∞ condition number, denoted as cond∞ (A), is defined as cond∞ (A) = A∞ · A−1 ∞ .

(4.3.11)

• Frobenius norm condition number, denoted as condF (A), is defined as −1

condF (A) = AF · A

F =

4 i

σi2



σj−2 .

(4.3.12)

j

Consider an over-determined linear equation Ax = b, where A is an m × n matrix with m > n. An over-determined equation has a unique linear least squares (LS) solution given by AH Ax = AH b,

(4.3.13)

i.e., xLS = (AH A)−1 AH b. It should be noticed that when using the LS method to solve the matrix equation Ax = b, a matrix A with a larger condition number may result into a worse solution due to cond2 (AH A) = (cond2 (A))2 . As a comparison, we consider using the QR factorization A = QR (where Q is orthogonal and R is upper triangular) to solve an over-determined equation Ax = b then cond(Q) = 1,

cond(A) = cond(QH A) = cond(R),

(4.3.14)

since QH Q = I. In this case we have QH Ax = QH b ⇒ Rx = QH b. Since cond(A) = cond(R), the QR factorization method Rx = QH b has better numerical stability (i.e., a smaller condition number) than the least squares method xLS = (AH A)−1 AH b. On the more effective methods than QR factorization for solving over-determined matrix equations, we will discuss them after introducing the singular value decomposition.

4.4 Singular Value Decomposition (SVD)

171

4.4 Singular Value Decomposition (SVD) Beltrami (1835–1899) and Jordan (1838–1921) are recognized as the founders of singular value decomposition (SVD): Beltrami in 1873 published the first paper on SVD [2]. One year later, Jordan published his own independent derivation of SVD [23].

4.4.1 Singular Value Decomposition Theorem 4.2 (SVD) Let A ∈ Rm×n (or Cm×n ), then there exist orthogonal (or unitary) matrices U ∈ Rm×m (or U ∈ Cm×m ) and V ∈ Rn×n (or V ∈ Cn×n ) such that   1 O A = UVT (or UVH ),  = , (4.4.1) O O where U = [u1 , . . . , um ] ∈ Cm×m , V = [v1 , . . . , vn ] ∈ Cn×n and  1 = Diag(σ1 , . . . , σl ) with diagonal entries σ1 ≥ · · · ≥ σr > σr+1 = . . . = σl = 0,

(4.4.2)

in which l = min{m, n} and r = rank of A. This theorem was first shown by Eckart and Young [11] in 1939, but the proof by Klema and Laub [26] is simpler. The nonzero values σ1 , . . . , σr together with the zero values σr+1 = · · · = σl = 0 are called the singular values of the matrix A, and u1 , . . . , um and v1 , . . . , vn are known as the left- and right singular vectors, respectively, while U ∈ Cm×m and V ∈ Cn×n are called the left- and right singular vector matrices, respectively. Here are several useful explanations and remarks on singular values and SVD [46]. 1. The SVD of a matrix A can be rewritten as the vector form A=

r 

σi ui vH i .

(4.4.3)

i=1

This expression is sometimes said to be the dyadic decomposition of A [16]. 2. Suppose that the n × n matrix V is unitary. Postmultiply (4.4.1) by V to get AV = U, whose column vectors are given by  Avi =

σi ui , i = 1, 2, . . . , r; 0, i = r + 1, r + 2, . . . , n.

(4.4.4)

172

4 Solution of Linear Systems

3. Suppose that the m × m matrix U is unitary. Premultiply (4.4.1) by UH to yield UH A = V, whose column vectors are given by  uH i A

=

σi vTi , i = 1, 2, . . . , r; 0, i = r + 1, r + 2, . . . , n.

(4.4.5)

4. When the matrix rank r = rank(A) < min{m, n}, because σr+1 = · · · = σh = 0 with h = min{m, n}, the SVD formula (4.4.1) can be simplified to A = Ur  r VH r ,

(4.4.6)

where Ur = [u1 , . . . , ur ], Vr = [v1 , . . . , vr ], and  r = Diag(σ1 , . . . , σr ). Equation (4.4.6) is called the truncated singular value decomposition or the thin singular value decomposition of the matrix A. In contrast, Eq. (4.4.1) is known as the full singular value decomposition. H 5. Premultiplying (4.4.4) by uH i , and noting that ui ui = 1, it is easy to obtain uH i Avi = σi ,

i = 1, 2, . . . , min{m, n},

which can be written in matrix form as   1 O H U AV =  = ,  1 = Diag(σ1 , . . . , σr ). O O

(4.4.7)

(4.4.8)

Equations (4.4.1) and (4.4.8) are two definitive forms of SVD. 6. From (4.4.1) it follows that AAH = U 2 UH . This shows that the singular value σi of an m × n matrix A is the positive square root of the corresponding nonnegative eigenvalue of the matrix product AAH . 7. If the matrix Am×n has rank r, then • the leftmost r columns of the m × m unitary matrix U constitute an orthonormal basis of the column space of the matrix A, i.e., Col(A) = Span{u1 , . . . , ur }; • the leftmost r columns of the n×n unitary matrix V constitute an orthonormal basis of the row space of A or the column space of AH , i.e., Row(A) = Span{v1 , . . . , vr }; • the rightmost n − r columns of V constitute an orthonormal basis of the null space of the matrix A, i.e., Null(A) = Span{vr+1 , . . . , vn }; • the rightmost m − r columns of U constitute an orthonormal basis of the null space of the matrix AH , i.e., Null(AH ) = Span{ur+1 , . . . , um }.

4.4 Singular Value Decomposition (SVD)

173

4.4.2 Properties of Singular Values Theorem 4.3 (Eckart–Young Theorem [11]) If the singular values of A ∈ Cm×n are given by σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0,

r = rank(A),

then σk = min

E∈Cm×n

  Espec |rank(A + E) ≤ k − 1 ,

k = 1, . . . , r,

(4.4.9)

and there is an error matrix Ek with Ek spec = σk such that rank(A + Ek ) = k − 1,

k = 1, . . . , r.

(4.4.10)

The Eckart–Young theorem shows that the singular value σk is equal to the minimum spectral norm of the error matrix Ek such that the rank of A + Ek is k − 1. An important application of the Eckart–Young theorem is to the best rank-k approximation of the matrix A, where k < r = rank(A). Define Ak =

k 

σi ui vH i ,

k < r;

(4.4.11)

i=1

then Ak is the solution of the optimization problem: Ak = arg min A − X2F ,

k < r,

(4.4.12)

rank(X)=k

and the squared approximation error is given by 2 A − Ak 2F = σk+1 + · · · + σr2 .

(4.4.13)

Theorem 4.4 ([20, 21]) Let A be an m × n matrix with singular values σ1 ≥ · · · ≥ σr , where r = min{m, n}. If the p × q matrix B is a submatrix of A with singular values γ1 ≥ · · · ≥ γmin{p,q} , then σi ≥ γi ,

i = 1, . . . , min{p, q}

(4.4.14)

and γi ≥ σi+(m−p)+(n−q) ,

i ≤ min{p + q − m, p + q − n}.

(4.4.15)

174

4 Solution of Linear Systems

This is the interlacing theorem for singular values. The singular values of a matrix are closely related to its norm, determinant, and condition number. 1. Relationship between Singular Values and Norms: A2 = σ1 = σmax , 1/2  n m   AF = |aij |2 = -UH AV-F = F = σ12 + · · · + σr2 . i=1 j =1

2. Relationship between Singular Values and Determinant: | det(A)| = | det(UVH )| = | det | = σ1 · · · σn .

(4.4.16)

If all the σi are nonzero, then | det(A)| = 0, which shows that A is nonsingular. If there is at least one singular value σi = 0 (i > r), then det(A) = 0, namely A is singular. This is the reason why the σi , i = 1, . . . , min{m, n} are known as the singular values. 3. Relationship between Singular Values and Condition Number: cond2 (A) = σ1 /σp ,

p = min{m, n}.

(4.4.17)

4.4.3 Singular Value Thresholding Consider the truncated SVD of a low-rank matrix W ∈ Rm×n :  = Diag(σ1 , · · · , σr )

W = UVT ,

(4.4.18)

where r = rank(W) / min{m, n}, U ∈ Rm×r , V ∈ Rn×r . Let the threshold value τ ≥ 0, then Dτ (W) = UDτ ()VT

(4.4.19)

is called the singular value thresholding (SVT) of the matrix W, where   Dτ () = soft(, τ ) = Diag (σ1 − τ )+ , · · · , (σr − τ )+ is known as the soft thresholding, and * (σi − τ )+ =

σi − τ,

if σi > τ ;

0,

otherwise

(4.4.20)

4.5 Least Squares Method

175

is the soft thresholding operation. The relationships between the SVT and the SVD are as follows. • If the soft threshold value τ = 0, then the SVT is reduced to the truncated SVD (4.4.18). • All of singular values are soft thresholded by the soft threshold value τ > 0, which just changes the magnitudes of singular values, does not change the left and right singular vector matrices U and V. Theorem 4.5 ([6]) For each soft threshold value μ > 0 and the matrix W ∈ Rm×n , the SVT operator obeys   1 Usoft(, μ)VT = arg min μX∗ + X − W2F , 2 X   1 soft(W, μ) = arg min μX1 + X − W2F , 2 X

(4.4.21) (4.4.22)

where UVT is the SVD of W, and the soft thresholding (shrinkage) operator ⎧ wij − μ, ⎪ ⎪ ⎨ [soft(W, μ)]ij = wij + μ, ⎪ ⎪ ⎩ 0,

wij > μ; wij < −μ;

(4.4.23)

otherwise.

Here wij ∈ R is the (i, j )th entry of W ∈ Rm×n .

4.5 Least Squares Method In over-determined equation, the number of independent equations is larger than the number of independent unknowns, the number of independent equations appears surplus for determining the unique solution. An over-determined matrix equation Ax = b has no exact solution and thus is an inconsistent equation that may in some cases have an approximate solution. There are four commonly used methods for solutions of matrix equations: Least squares method, Tikhonov regularization method, Gauss–Seidel method, and total least squares method. From this section we will introduce these four methods in turn.

4.5.1 Least Squares Solution Consider an over-determined matrix equation Ax = b, where b is an m × 1 data vector, A is an m × n data matrix, and m > n.

176

4 Solution of Linear Systems

Suppose the data vector and the additive observation error or noise exist, i.e., b = b0 + e, where b0 and e are the errorless data vector and the additive error vector, respectively. In order to resist the influence of the error on matrix equation solution, we use a correction vector Δb to “disturb” the data vector b for forcing Ax = b + Δb to compensate the uncertainty existing in the data vector b (noise or error). The idea for solving the matrix equation can be described by the optimization problem   min Δb2 = Ax − b22 = (Ax − b)T (Ax − b) . x

(4.5.1)

Such a method is known as ordinary least squares (OLS) method, simply called the least squares (LS) method. As a matter of fact, the correction vector Δb = Ax − b is just the error vector of both sides of the matrix equation Ax = b. Hence the central idea of the LS method is to find the solution x such that the sum of the squared error Ax − b22 is minimized, namely xˆ LS = arg min Ax − b22 . x

(4.5.2)

In order to derive the analytic solution of x, expand Eq. (4.5.1) to get φ = xT AT Ax − xT AT b − bT Ax + bT b. Find the derivative of φ with respect to x, and let the result equal zero, then dφ = 2AT Ax − 2AT b = 0. dx That is to say, the solution x must meet AT Ax = AT b.

(4.5.3)

The above equation is identifiable or unidentifiable depending on the rank of the m × n matrix A. • Identifiable: When Ax = b is the over-determined equation, it has the unique solution xLS = (AT A)−1 AT b if rank(A) = n,

(4.5.4)

xLS = (AT A)† AT b if rank(A) < n,

(4.5.5)

or

4.5 Least Squares Method

177

where B† denotes the Moore–Penrose inverse matrix of B. In parameter estimation theory, the unknown parameter vector x is said to be uniquely identifiable, if it is uniquely determined. • Unidentifiable: For an under-determined equation Ax = b, if rank(A) = m < n, then different solutions of x give the same value of Ax. Clearly, although the data vector b can provide some information about Ax, we cannot distinguish different parameter vectors x corresponding to the same Ax. Such an unknown vector x is said to be unidentifiable. In parameter estimation, an estimate θˆ of the parameter vector θ is known as an unbiased estimator, if its mathematical expectation is equal to the true unknown parameter vector, i.e., E{θˆ } = θ . Further, an unbiased estimator is called the optimal unbiased estimator if it has the minimum variance. Similarly, for an over-determined matrix equation Aθ = b+e with noisy data vector b, if the mathematical expectation of the LS solution θˆ LS is equal to the true parameter vector θ , i.e., E{θˆ LS } = θ , then θˆ LS is called the optimal unbiased estimator. Theorem 4.6 (Gauss–Markov Theorem) Consider the set of linear equations Ax = b + e,

(4.5.6)

where the m × n matrix A and the n × 1 vector x are, respectively, the constant matrix and the parameter vector; b is the m × 1 vector with random error vector e = [e1 , · · · , em ]T , and its mean vector and covariance matrix are, respectively, E{e} = 0,

cov(e) = E{eeH } = σ 2 I.

ˆ if and Then the n × 1 parameter vector x exists the optimal unbiased solution x, only if rank(A) = n. In this case, the optimal unbiased solution is given by the LS solution xˆ LS = (AH A)−1 AH b,

(4.5.7)

var(ˆxLS ) ≤ var(˜x),

(4.5.8)

and its covariance

where x˜ is any other solution of the matrix equation Ax = b + e. Proof See e.g., [46]. In the Gauss–Markov theorem cov(e) = σ 2 I implies that all components of the additive noise vector e are mutually uncorrelated, and have the same variance σ 2 . Only in this case, the LS solution is unbiased and optimal.

178

4 Solution of Linear Systems

4.5.2 Rank-Deficient Least Squares Solutions In many science and engineering applications, it is usually necessary to use a lowrank matrix to approximate a noisy or disturbed matrix. The following theorem gives an evaluation of the quality of approximation. p T Theorem 4.7 (Low-Rank Approximation) Let A = i ui vi be the SVD i=1 σ  k T m×n , where p = rank(A). If k < p and Ak = of A ∈ R i=1 σi ui vi is the rank-k approximation of A, then the approximation quality can be measured by the Frobenius norm  min

rank(B)=k

A−BF = A−Ak F =

q 

1/2 σi2

,

q = min{m, n}.

(4.5.9)

i=k+1

Proof See, e.g., [10, 21, 30]. For an over-determined and rank-deficient system of linear equations Ax = b, let the SVD of A be given by A = UVH , where  = Diag(σ1 , . . . , σr , 0, . . . , 0). The LS solution xˆ = A† b = V † UH b,

(4.5.10)

where  † = Diag(1/σ1 , . . . , 1/σr , 0, . . . , 0). Equation (4.5.10) can be represented as xLS =

r  (uH i b/σi )vi .

(4.5.11)

i=1

The corresponding minimum residual is given by rLS = AxLS − b2 = -[ur+1 , . . . , um ]H b-2 .

(4.5.12)

Although, in theory, when i > r the true singular values σi = 0, the computed singular values σˆi , i > r, are not usually equal to zero and sometimes even have quite a large perturbation. In these cases, an estimate of the matrix rank r is required. In signal processing and system theory, the rank estimate rˆ is usually called the effective rank. Effective rank can be estimated by one of the following two common methods. 1. Normalized Singular Value Method. Compute the normalized singular values σ¯ i = σˆ i /σˆ 1 , and select the largest integer i satisfying the criterion σ¯ i ≥  as an estimate of the effective rank rˆ . Obviously, this criterion is equivalent to choosing the maximum integer i satisfying σˆ i ≥  σˆ 1

(4.5.13)

4.6 Tikhonov Regularization and Gauss–Seidel Method

179

as rˆ ; here  is a very small positive number, e.g.,  = 0.1 or  = 0.05. 2. Norm Ratio Method. Let an m × n matrix Ak be the rank-k approximation to the original m × n matrix A. Define the Frobenius norm ratio as ν(k) =

Ak F = AF

σ12 + · · · + σk2 σ12 + · · · + σh2

,

h = min{m, n},

(4.5.14)

and choose the minimum integer k satisfying ν(k) ≥ α

(4.5.15)

as the effective rank estimate rˆ , where α is close to 1, e.g., α = 0.997 or 0.998. After the effective rank rˆ is determined via the above two criteria, xˆ LS =

rˆ  (uˆ H ˆ i )ˆvi i b/σ

(4.5.16)

i=1

can be regarded as a reasonable approximation to the LS solution xLS .

4.6 Tikhonov Regularization and Gauss–Seidel Method The LS method is widely used for solving matrix equations and is applicable for many real-world applications in machine learning, neural networks, support vector machines, and evolutionary computation. But, due to its sensitive to the perturbation of the data matrix, the LS methods must be improved in these applications. A simple and efficient way is the well-known Tikhonov regularization.

4.6.1 Tikhonov Regularization When m = n, and A is nonsingular, the solution of matrix equation Ax = b is given by xˆ = A−1 b; and when m > n and Am×n is of full column rank, the solution of the matrix equation is xˆ LS = A† b = (AH A)−1 AH b. The problem is: the data matrix A is often rank deficient in engineering applications. In these cases, the solution xˆ = A−1 b or xˆ LS = (AH A)−1 AH b either diverges, or even exists, it is the bad approximation to the unknown vector x. Even if we happen to find a reasonable approximation of x, the error estimate x − xˆ  ≤ A−1  Aˆx − b or x − xˆ  ≤ A†  Aˆx − b is very disappointing [32]. By observation, it is easily known that the problem lies in the inversion of the covariance matrix AH A of the rank-deficient data matrix A.

180

4 Solution of Linear Systems

As an improvement of the LS cost function 12 Ax − b22 , Tikhonov [39] in 1963 proposed the regularized least squares cost function J (x) =

" 1! Ax − b22 + λx22 , 2

(4.6.1)

where λ ≥ 0 is called the regularization parameter. The conjugate gradient of the cost function J (x) with respect to the argument x is given by " ∂J (x) ∂ ! H H (Ax − b) = (Ax − b) + λx x = AH Ax − AH b + λx. ∂xH ∂xH Let

∂J (x) ∂xH

= 0, then the Tikhonov regularization solution xˆ Tik = (AH A + λI)−1 AH b.

(4.6.2)

This method using (AH A + λI)−1 instead of the direct inversion of the covariance matrix (AH A)−1 is called the Tikhonov regularization method (or simply regularized method). In signal processing and image processing, the regularization method is sometimes known as the relaxation method. The nature of Tikhonov regularization method is: by adding a very small disturbance λ to each diagonal entry of the covariance matrix AH A of rankdeficient matrix A, the inversion of the singular covariance matrix AH A becomes the inversion of a nonsingular matrix AH A + λI, thereby greatly improving the numerical stability of solving the rank-deficient matrix equation Ax = b. Obviously, if the data matrix A is of full column rank, but exists the error or noise, we must adopt the opposite method to the Tikhonov regularization: adding a very small negative disturbance −λ to each diagonal entry of the covariance matrix AH A for suppressing the influence of noise. Such a method using a very small negative disturbance matrix −λI is called the anti-Tikhonov regularization method or anti-regularized method, and its solution is given by xˆ = (AH A − λI)−1 AH b.

(4.6.3)

The total least square method is a typical anti-regularized method and will be discussed later. When the regularization parameter λ varies in the definition interval [0, ∞), the family of solutions for a regularized LS problem is known as its regularization path. Tikhonov regularization solution has the following important properties [25]: 1. Linearity: The Tikhonov regularization LS solution xˆ Tik = (AH A + λI)−1 AH b is the linear function of the observed data vector b. 2. Limit characteristic when λ → 0: When the regularization parameter λ → 0, the Tikhonov regularization LS solution converges to the ordinary LS solution

4.6 Tikhonov Regularization and Gauss–Seidel Method

181

or the Moore–Penrose solution limλ→0 xˆ Tik = xˆ LS = A† b = (AH A)−1 AH b. The solution point xˆ Tik has the minimum 2 -norm among all the feasible points meeting AH (Ax − b) = 0: xˆ Tik =

arg min x2 .

(4.6.4)

AT (b−Ax)=0

3. Limit characteristic when λ → ∞: When λ → ∞, the optimal Tikhonov regularization LS solution converges to a zero vector, i.e., limλ→∞ xˆ Tik = 0. 4. Regularization path: When the regularization parameter λ varies in [0, ∞), the optimal solution of the Tikhonov regularization LS problem is the smooth function of the regularization parameter, i.e., when λ decreases to zero, the optimal solution converges to the Moore–Penrose solution; and when λ increases, the optimal solution converges to zero vector. The Tikhonov regularization can effectively prevent the divergence of the LS solution xˆ LS = (AT A)−1 AT b when A is rank deficient, thereby improves obviously the convergence property of the LS algorithm and the alternative LS algorithm, and is widely applied.

4.6.2 Gauss–Seidel Method A matrix equation Ax = b can be rearranged as a11 x1 a21 x1

= b1 = b2 .. .

+ a22 x2 ..

− a12 x2 − · · · − a1n xn − · · · − a2n xn

· . ·· a(n−1)1 x1 + a(n−2) x2 + · · · + a(n−1)n xn = bn−1 − ann xn an1 x1 + an2 x2 + · · · + ann xn = bn .

Based on the above rearrangement, the Gauss–Seidel method begins with an initial approximation to the solution, x(0) , and then compute an update for the first element of x: (1) x1

  n  1 (0) b − = a1j xj . a11 1

(4.6.5)

j =2

Continuing in this way for the other elements of x, the Gauss–Seidel method gives xi(1)

  i−1 n   1 (1) (0) b − = aij xj − aij xj , aii i j =1

j =i+1

for i = 1, . . . , n.

(4.6.6)

182

4 Solution of Linear Systems

2 3T After getting the approximation x(1) = x1(1) , . . . , xn(1) , we then continue this same kind of iteration for x(2) , x(3) , . . . until x converges. The Gauss–Seidel method not only can find the solution of the matrix equation, but also are applicable for solving a nonlinear optimization problem. Let Xi ⊆ Rni be the feasible set of ni × 1 vector xi . Consider the nonlinear minimization problem   min f (x) = f (x1 , · · · , xm ) ,

(4.6.7)

x∈X

where x ∈ X = X1 × · · · × Xm ⊆ Rn is the Cartesian product of a closed nonempty convex set Xi ⊆ Rni , i = 1, · · · , m, and m i=1 ni = n. Equation (4.6.7) is an unconstrained optimization problem with m coupled variable vectors. An efficient approach for solving this class of the coupled optimization problems is the block nonlinear Gauss–Seidel method, called simply the GS method [3, 18]. In every iteration of the GS method, m − 1 variable vectors are fixed, one remaining variable vector is minimized. This idea constitutes the basic framework of the GS method for solving nonlinear unconstrained optimization problem (4.6.7): 1. Initial m − 1 variable vectors xi , i = 2, · · · , m, and let k = 0. 2. Find the solution of separated sub-optimization problem   k+1 k k = arg min f xk+1 xk+1 i 1 , · · · , xi−1 , y, xi+1 , · · · , xm ,

i = 1, · · · , m.

y∈Xi

(4.6.8) At the (k + 1)th iteration of updating xi , all x1 , · · · , xi−1 have been updated as k+1 k k xk+1 1 , · · · , xi−1 , so these sub-vectors and xi+1 , · · · , xm to be updated are fixed as the known vectors. 3. To test whether m variable vectors are all convergent. If convergent, then output k+1 the optimization results (xk+1 1 , · · · , xm ); otherwise, let k ← k + 1, return to Eq. (4.6.8), and continue to iteration until the convergence criterion meets. If the objective function f (x) of the optimization (4.6.7) is the LS error function (for example, Ax − b22 ), then the GS method is customarily called the alternating least squares (ALS) method. Example 4.2 Consider the full-rank decomposition of an m × n known data matrix X = AB, where the m × r matrix A is of full column rank, and the r × n matrix B is of full row rank. Let the cost function of the matrix full-rank decomposition be f (A, B) =

1 X − AB2F . 2

(4.6.9)

4.6 Tikhonov Regularization and Gauss–Seidel Method

183

Then the ALS algorithm first initializes the matrix A. At the (k +1)th iteration, from the fixed matrix Ak we update the LS solution of B as follows: Bk+1 = (ATk Ak )−1 ATk X.

(4.6.10)

Next, from the transpose of matrix decomposition XT = BT AT , we can immediately update the LS solution of AT as ATk+1 = (Bk+1 BTk+1 )−1 Bk+1 XT .

(4.6.11)

The above two kinds of LS methods are performed alternately. Once the whole ALS algorithm is converged, the optimization results of matrix decomposition can be obtained. The fact that the GS algorithm may not converge was observed by Powell in 1973 [34] who called it the “circle phenomenon” of the GS method. Recently, a lot of simulation experiences have shown [28, 31] that even converged, the iterative process of the ALS method is also very easy to fall into the “swamp”: an unusually large number of iterations leads to a very slow convergence rate. [28, 31]. A kind of simple and effective way for avoiding the circle and swamp phenomenon of the GS method is to make the Tikhonov regularization of the objective function of the optimization problem (4.6.7), namely the separated sub-optimization algorithm (4.6.8) is regularized as xk+1 i

 , 1 k+1 k+1 k k k 2 = arg min f (x1 , · · · , xi−1 , y, xi+1 , · · · , xm ) + τi y − xi 2 , 2 y∈Xi (4.6.12)

where i = 1, · · · , m. The above algorithm is called the proximal point versions of the GS methods [1, 3], abbreviated as PGS method. The role of the regularization term y − xki 22 is to force the updated vector k+1 = y close to xki , not to deviate too much, and thus avoid the violent shock xi of the iterative process in order to prevent the divergence of the algorithm. The GS or PGS method is said to be well defined, if each sub-optimization problem has an optimal solution [18]. Theorem 4.8 ([18]) If the PGS method is well defined, and the sequence {xk } exists limit points, then every limit point x¯ of {xk } is a critical point of the optimization problem (4.6.7). This theorem shows that the convergence performance of the PGS method is better than that of the GS method.

184

4 Solution of Linear Systems

Many simulation experiments show [28] that under the condition achieving the same error, the iterative number of the GS method in swamp iteration is unusually large, whereas the PGS method tends to converge quickly. The PGS method is also called the regularized Gauss–Seidel method in some literature. The ALS method and the regularized ALS method have important applications in nonnegative matrix decomposition and the tensor analysis.

4.7 Total Least Squares Method For a matrix equation Am×n xn = bm , all the LS method, the Tikhonov regularization method, and the Gauss–Seidel method give the solution of n parameters. However, by the rank-deficiency of the matrix A we know that the unknown parameter vector x contains only r independent parameters; the other parameters are linearly dependent on the r independent parameters. In many engineering applications, we naturally want to find the r independent parameters other than the n parameters containing redundancy components. In other words, we want to estimate only the principal parameters and to eliminate minor components. This problem can be solved via the low-rank total least squares (TLS) method. The earliest ideas about TLS can be traced back to the paper of Pearson in 1901 [33] who considered the approximate method for solving the matrix equation Ax = b when both A and b exist the errors. But, only in 1980, Golub and Van Loan [15] have first time made the overall analysis from the point of view of numerical analysis, and have formally known this method as the total least squares. In mathematical statistics, this method is called the orthogonal regression or errorsin-variables regression [14]. In system identification, the TLS method is called the characteristic vector method or the Koopmans–Levin method [42]. Now, the TLS method has been widely used in statistics, physics, economics, biology and medicine, signal processing, automatic control, system science, artificial intelligence, and many other disciplines and fields.

4.7.1 Total Least Squares Solution Let A0 and b0 express unobservable error-free data matrix and error-free data vector, respectively. The actual observed data matrix and data vector are, respectively, given by A = A0 + E,

b = b0 + e,

(4.7.1)

where E and e express the error data matrix and the error data vector, respectively. The basic idea of the TLS is: not only use the correction vector b to perturb the data vector b, but also use the correction matrix A to perturb the data matrix A,

4.7 Total Least Squares Method

185

making the joint compensation for errors or noise in both A and b: b + b = b0 + e + b → b0 , A + A = A0 + E + A → A0 , so that the solution of the noisy matrix equation is transformed into the solution of the error-free matrix equation: (A + A)x = b + b ⇒ A0 x = b0 .

(4.7.2)

Naturally, we want the correction data matrix and the correction data vectors are as small as possible. Hence the TLS problem can be expressed as the constrained optimization problem TLS:

min [A, b]2F

A,b,x

subject to (A + A)x = b + b

(4.7.3)

or TLS:

min D2F z

subject to Dz = −Bz,

where D = [A, b], B = [A, b], and z =

%

(4.7.4)

& x is the (n + 1) × 1 vector. −1

Under the assumption z2 = 1, from D2 ≤ DF and D2 = supz2 =1 Dz2 , we have minz D2F = minz Dz22 and thus can rewrite (4.7.4) as TLS:

min Bz22 z

subject to z = 1

(4.7.5)

due to Dz = −Bz. Case 1: Single Smallest Singular Value The singular value σn of B is significantly larger than σn+1 , i.e., the smallest singular value has only one. In this case, the TLS problem (4.7.5) is easily solved via the Lagrange multiplier method. To this end, define the objective function J (z) = Bz22 + λ(1 − zH z),

(4.7.6) ∂J (z)

where λ is the Lagrange multiplier. Notice that Bz22 = zH BH Bz, from =0 ∂z∗ it follows that BH Bz = λz.

(4.7.7)

This shows that we should select as the Lagrange multiplier the smallest eigenvalue λmin of the matrix BH B = [A, b]H [A, b] (i.e., the squares of the smallest singular

186

4 Solution of Linear Systems

value of B), while the TLS solution vector z is the eigenvector corresponding to the smallest eigenvalue λmin of [A, b]H [A, b]. In other words, the TLS solution vector %

& x is the solution of the Rayleigh quotient minimization problem −1 

min J (x) = x

  H x x [A, b]H [A, b] Ax − b22 −1 −1 . =  H   x22 + 1 x x −1 −1

(4.7.8)

Let the SVD of the m × (n + 1) augmented matrix B be B = UVH , where the singular values are arranged in the order of σ1 ≥ · · · ≥ σn+1 , and their corresponding right singular vectors are v1 , · · · , vn+1 . According to the above analysis, the TLS solution is z = vn+1 , namely ⎡ xTLS = −

1 v(n + 1, n + 1)

⎤ v(1, n + 1) ⎢ ⎥ .. ,⎦ ⎣ .

(4.7.9)

v(n, n + 1)

where v(i, n + 1) is the ith entry of (n + 1)th column of V. Remark If the augmented data matrix is given by B = [−b, A], then the TLS solution is provided by ⎤ v(2, n + 1) ⎥ ⎢ .. ⎦. ⎣ . ⎡ xTLS =

1 v(1, n + 1)

(4.7.10)

v(n + 1, n + 1)

Case 2: Multiple Smallest Singular Values In this case, there are multiple smallest singular value of B, i.e., multiple small singular values are repeated or very close. Letting σ1 ≥ σ2 ≥ · · · ≥ σp > σp+1 ≈ · · · ≈ σn+1 ,

(4.7.11)

then the right singular vector matrix V1 = [vp+1 , vp+2 , · · · , vn+1 ] will give n − p + 1 possible TLS solutions xi = −yp+i /αp+i , i = 1, · · · , n + 1 − p. To find the unique TLS solution, one can make the Householder transformation Q to V1 such that ⎤ ⎡ .. . × y ⎥ ⎢ ⎥ ⎢ (4.7.12) V1 Q = ⎢ - - ...- - - - - - ⎥ . ⎦ ⎣ .. α . 0 ··· 0

4.7 Total Least Squares Method

187

The unique TLS solution is the minimum norm solution given by xˆ TLS = y/α. This algorithm was proposed by Golub and Van Loan [15], as shown in Algorithm 4.5.

Algorithm 4.5 TLS algorithm for minimum norm solution 1. input: A ∈ Cm×n , b ∈ Cm , α > 0. 2. repeat 3. Compute B = [A, b] = UVH , and save V and all singular values. 4. Determine the number p of principal singular values. 5. Put V1 = [vp+1 , · · · , vn+1 ], and compute the Householder transformation ⎡ ⎤ .. ⎢ y. × ⎥ ⎢ . ⎥ ⎥ V1 Q = ⎢ ⎢ - - ..- - - - - - ⎥. ⎣ ⎦ . α .. 0 · · · 0

where α is a scalar, and × denotes the uninterested block. 6. exit if α = 0. 7. return p ← p − 1 8. output: xTLS = −y/α.

The above minimum norm solution xTLS = y/α contains n parameters other than p independent principal parameters because of rank(A) = p < n. In signal processing, system theory and artificial intelligence, the unique TLS solution with no redundant parameters is more interesting, which is the optimal LS approximate solution. To derive the optimal LS approximate solution, let the m × (n + 1) matrix Bˆ = U p VH be an optimal approximation with rank p of the augmented matrix B, where  p = Diag(σ1 , · · · , σp , 0, · · · , 0). Denote the m × (p) as a submatrix among the m × (n + 1) optimal approximate (p + 1) matrix Bˆ ˆ matrix B:

j

(p) ˆ Bˆ j : sub-matrix of the j th to the (j + p)th columns of B.

(4.7.13)

Clearly, there are (n + 1 − p) submatrices Bˆ 1 , Bˆ 2 , · · · , Bˆ n+1−p . As stated before, the fact that the efficient rank of B is equal to p means that p components are linearly independent in the parameter vector x. Let the (p +  (p)

1) × 1 vector a =

(p)

(p)

x(p) , where x(p) is the column vector consisting of p linearly −1

independent unknown parameters of the vector x. Then, the original TLS problem becomes the solution of the following (n + 1 − p) TLS problems: (p) Bˆ j a = 0 ,

j = 1, 2, · · · , n + 1 − p

(4.7.14)

188

4 Solution of Linear Systems

or equivalent to the solution of one synthetic TLS problem ⎡ ⎢ ⎣



ˆ : p + 1) B(1 .. .

⎥ ⎦ a = 0,

(4.7.15)

ˆ + 1 − p : n + 1) B(n ˆ : p + i) = Bˆ (p) is defined in (4.7.13). It is not difficult to show that where B(i i ˆ : p + i) = B(i

p 

σk uk (vik )H ,

(4.7.16)

k=1

where vik is a windowed segment of the kth column vector of V, defined as vik = [v(i, k), v(i + 1, k), · · · , v(i + p, k)]T .

(4.7.17)

Here v(i, k) is the (i, k)th entry of V. According to the least square principle, finding the LS solution of Eq. (4.7.15) is equivalent to minimize the measure (or cost) function ˆ : p + 1)a]H B(1 ˆ : p + 1)a + [B(2 ˆ : p + 2)a]H B(2 ˆ : p + 2)a f (a) = [B(1 ˆ + 1 − p : n + 1)a]H B(n ˆ + 1 − p : n + 1)a + · · · + [B(n ⎡ ⎤ n+1−p  ˆ : p + i)]H B(i ˆ : p + i)⎦ a. [B(i = aH ⎣

(4.7.18)

i=1

Define the (p + 1) × (p + 1) matrix (p)

S

=

n+1−p 

ˆ : p + i)]H B(i ˆ : p + i), [B(i

(4.7.19)

i=1

then the measure function can be simply written as f (a) = aH S(p) a. The minimal variable a of the measure function f (a) is determined by follows: S(p) a = αe1

(4.7.20) ∂f (a) ∂a∗

= 0 as

(4.7.21)

4.7 Total Least Squares Method

189

where e1 = [1, 0, · · · , 0]T , and the constant α > 0 represents the error energy. From (4.7.19) and (4.7.16) we have (p)

S

=

p n+1−p   j =1

σj2 vij (vij )H .

(4.7.22)

i=1

Solving the matrix equation (4.7.21) is simple. Let S−(p) be the inverse matrix S(p) . Then the solution vector a is only dependent of the first column of the inverse matrix S−(p) . It is easily known that the ith entry of x(p) =  (p)  x [xTLS (1), · · · , xTLS (p)]T in the TLS solution vector a = −1 is given by xTLS (i) = −S−(p) (i, 1)/S−(p) (p + 1, 1),

i = 1, · · · , p.

(4.7.23)

This solution is known as the optimal least square approximate solution. Because the number of parameters of this solution and the efficient rank are the same, it is also called the low-rank TLS solution [5]. Notice that if the augmented matrix B = [−b, A], then xTLS (i) = S−(p) (i + 1, 1)/S−(p) (1, 1),

i = 1, 2, · · · , p.

(4.7.24)

In summary, given A ∈ Cm×n , b ∈ Cn , the low-rank SVD-TLS algorithm consists of the following steps [5]. 1. 2. 3. 4.

Compute the SVD B = [A, b] = UVH , and save V. Determine the efficient rank p of B. Use (4.7.22) and (4.7.17) to calculate (p + 1) × (p + 1) matrix S(p) . Compute the inverse S−(p) and the total least square solution xTLS (i) = −S−(p) (i, 1)/S−(p) (p + 1, 1), i = 1, · · · , p.

4.7.2 Performances of TLS Solution The TLS has two interesting explanations: one is its geometric interpretation [15], and another is its closed solution [43]. Geometric Interpretation of TLS Solution Let aTi be the ith row of the matrix A, and bi be the ith entry of the vector b. Then the TLS solution xTLS is the minimal vector such that min x

Ax − b22 |x22 + 1

=

n  |aTi x − bi |2 , xT x + 1 i=1

(4.7.25)

190

4 Solution of Linear Systems

where |aTi x−bi |/(xT x+1) is the distance from the point

  ai ∈ Cn+1 to the nearest bi

point in the subspace Px , and the subspace Px is defined as Px =

  , a : a ∈ Cn×1 , b ∈ C, b = xT a . b

(4.7.26)

Hence the TLS solution can be expressed by using the subspace Px [15]: the  square sum of the distance from the TLS solution point is minimized.

ai bi

to the subspace Px

Closed Solution of TLS Problems If the singular values of the augmented matrix B are σ1 ≥ · · · ≥ σn+1 , then the TLS solution can be expressed as [43] 2 I)−1 AH b. xTLS = (AH A − σn+1

(4.7.27)

Compared with the Tikhonov regularization method, the TLS is a kind of antiregularization method and can be interpreted as a kind of least square method with 2 I from the covariance noise removal: it first removes the noise affection term σn+1 2 T T matrix A A, and then finds the inverse matrix of A A−σn+1 I to get the LS solution. Letting the noisy data matrix be A = A0 + E, then its covariance matrix AH A = Obviously, when the error matrix E has zero mean, the mathematical expectation of the covariance matrix is given by

H H H AH 0 A0 + E A0 + A0 E + E E.

H H H E{AH A} = E{AH 0 A0 } + E{E E} = A0 A0 + E{E E}.

If the column vectors of the error matrix are statistically independent, and have the 2 of same variance, i.e., E{ET E} = σ 2 I, then the smallest eigenvalue λn+1 = σn+1 H the (n + 1) × (n + 1) covariance matrix A A is the square of the singular value 2 of the error matrix E. Because the square of singular value σn+1 happens to reflect 2 the common variance σ of each column vector of the error matrix, the covariance 2 H matrix AH 0 A0 of the error-free data matrix can be retrieved from A A − σn+1 I, 2 I = AH A . In other words, the TLS method can effectively namely AT A − σn+1 0 0 restrain the influence of the unknown error matrix. It should be pointed out that the main difference between the TLS method and the Tikhonov regularization method for solving the matrix equation Am×n xn = bm is: the TLS solution can contain only p = rank([A, b]) principal parameters, and exclude the redundant parameters; whereas the Tikhonov regularization method can only provide all n parameters including redundant parameters.

4.7 Total Least Squares Method

191

4.7.3 Generalized Total Least Square The ordinary LS, the Tikhonov regularization, and the TLS method can be derived and explained by a unified theoretical framework [46]. Consider the minimization problem min

A,b,x

! " [A, b]2F + λx22

(4.7.28)

subject to (A + αA)x = b + βb, where α and β are, respectively, the weighting coefficients of the perturbation A of the data matrix A and the perturbation b of the data vector b, and λ is the Tikhonov regularization parameter. The above minimization problem is called the generalized total least squares (GTLS) problem [46]. The constraint condition (A + αA)x = (b + βb) can be represented as ! "  αx  −1 −1 [α A, β b] + [A, b] = 0. −β  If letting D = [A, b] and z =

 αx , then the above equation becomes −β

& % Dz = − α −1 A, β −1 b z.

(4.7.29)

Under the assumption of zH z = 1, then -2 3 -2 min D2F = min Dz22 = min - α −1 A, β −1 b z-2 , and thus the solution to the GTLS problem (4.7.28) can be rewritten as 

xˆ GTLS

 [α −1 A, β −1 b]z22 + λx22 . = arg min z zH z

Noticing that zH z = α 2 xH x + β 2 and [α

−1

A, β

−1

b]z = [α

−1

A, β

−1



 αx b] = Ax − b, −β

(4.7.30)

192

4 Solution of Linear Systems

the GTLS solution in (4.7.30) can be expressed as  xˆ GTLS = arg min x

Ax − b22

+ λx22 α 2 x22 + β 2

 .

(4.7.31)

The following are the comparison of the ordinary LS method, the Tikhonov regularization, and the TLS method. 1. Comparison of optimization problems • The ordinary least squares: α = 0, β = 1 and λ = 0, which gives min b22

A,b,x

subject to Ax = b + b.

(4.7.32)

• The Tikhonov regularization: α = 0, β = 1 and λ > 0, which gives min

A,b,x

! " b22 + λx22

subject to Ax = b + b.

(4.7.33)

• The total least squares: α = β = 1 and λ = 0, which yields min

A,b,x

! " A, b22

subject to (A + A)x = b + b.

(4.7.34)

2. Comparison of solution vectors: When the weighting coefficients α, β and the Tikhonov regularization parameter λ take the suitable values, the GTLS solution (4.7.31) gives the following results: • xˆ LS = (AH A)−1 AH

(α = 0, β = 1, λ = 0),

• xˆ Tik = (AH A + λI)−1 AH b (α = 0, β = 1, λ > 0), • xˆ TLS = (AH A − λI)−1 AH b

(α = 1, β = 1, λ = 0).

3. Comparison of perturbation methods • Ordinary LS method: it uses the possible small correction term b = Ax − b to disturb the data vector b such that b − b ≈ b0 . • Tikhonov regularization method: it adds the same perturbation term λ > 0 to every diagonal entry of the matrix AH A for avoiding the numerical instability of the LS solution (AH A)−1 AH b. • TLS method: by subtracting the perturbation matrix λI, the noise or disturbance in the covariance matrix of the original data matrix is restrained. 4. Comparison of application ranges • The LS method is applicable for the data matrix A with full column rank and the data vector b existing the iid Gaussian errors.

4.8 Solution of Under-Determined Systems

193

• The Tikhonov regularization method is applicable for the data matrix A with deficient column rank. • The TLS method is applicable for the data matrix A with full column rank and both A and the data vector b existing the iid Gaussian error.

4.8 Solution of Under-Determined Systems A full row rank under-determined matrix equation Am×n xn×1 = bm×1 with m < n has infinitely many solutions. In these cases, letting x = AH y then AAH y = b ⇒ y = (AAH )−1 b ⇒ x = AH (AAH )−1 b. This unique solution is known as the minimum norm solution. However, such a solution has little application in engineering. We are interested in the sparse solution x with a small number of nonzero entries.

4.8.1 1 -Norm Minimization To seek the sparsest solution with the fewest nonzero elements, we would ask the following two questions. • Can it ever be unique? • How to find the sparsest solution? For any positive number p > 0, the p -norm of the vector x is defined as  xp =

1/p



|xi |p

.

(4.8.1)

i∈support(x)

Then the 0 -norm of an n × 1 vector x can be defined as p

x0 = lim xp = lim p→0

p→0

n 

|xi |p =

i=1

n 

1(xi = 0) = #{i|xi = 0},

(4.8.2)

i=1

where #{i|xi = 0} denotes the number of all nonzero elements of x. Thus if x0 / n, then x is sparse. The core problem of sparse representation is 0 -norm minimization (P0 )

min x0 x

where A ∈ Rm×n , x ∈ Rn , b ∈ Rm .

subject to b = Ax,

(4.8.3)

194

4 Solution of Linear Systems

As the observation signal is usually contaminated by noise, the equality constraint in the above optimization problem should be relaxed to the 0 -norm minimization with the inequality constraint  ≥ 0 which allows a certain error perturbation min x0 x

subject to

Ax − b2 ≤ .

(4.8.4)

A key term coined and defined in [9] that is crucial for the study of uniqueness is the sparsity of the matrix A. Definition 4.7 (Sparsity [9]) Given a matrix A, its sparsity, denoted as σ = spark(A), is defined as the smallest possible number such that there exists a subgroup of σ columns from A that are linearly dependent. The sparsity gives a simple criterion for uniqueness of sparse solution of an under-determined system of linear equations Ax = b, as stated below. Theorem 4.9 ([9, 17]) If a system of linear equations Ax = b has a solution x obeying x0 < spark(A)/2, this solution is necessarily the sparsest solution. The index set of nonzero elements of the vector x = [xi , · · · , xn ]T is called its support, denoted by support(x) = {i|xi = 0}. The length of the support (i.e., the number of nonzero elements) is measured by 0 -norm x0 = |support(x)|.

(4.8.5)

A vector x ∈ Rn is said to be K-sparse, if x0 ≤ K, where K ∈ {1, · · · , n}. The set of the K-sparse vectors is denoted by / 0  ΣK = x ∈ Rn×1  x0 ≤ K .

(4.8.6)

If xˆ ∈ ΣK , then the vector xˆ ∈ Rn is known as the K-sparse approximation. Clearly, there is a close relationship between 0 -norm definition formula (4.8.5) p and p -norm definition formula (4.8.1): when p → 0, x0 = limp→0 xp . Because xp is the convex function if and only if p ≥ 1, the 1 -norm is the objective function closest to 0 -norm. Then, from the viewpoint of optimization, the 1 -norm is said to be the convex relaxation of the 0 -norm. Therefore, the 0 -norm minimization problem (P0 ) in (4.8.3) can transformed into the following convexly relaxed 1 -norm minimization problem: (P1 )

min x1 x

subject to

b = Ax.

(4.8.7)

This is a convex optimization problem, because 1 -norm x1 , as the objective function, itself is the convex function, and the equality constraint b = Ax is the affine function.

4.8 Solution of Under-Determined Systems

195

Due to observation noise, the equality constrained optimization problem (P1 ) should be relaxed as the inequality constrained optimization problem below: (P10 )

min x1 x

subject to b − Ax2 ≤ .

(4.8.8)

The 1 -norm minimization above is also called the basis pursuit (BP). This is a quadratically constrained linear programming (QCLP) problem. If x1 is the solution to (P1 ), and x0 is the solution to (P0 ), then [8] x1 1 ≤ x0 1 ,

(4.8.9)

because x0 is only the feasible solution to (P1 ), while x1 is the optimal solution to (P1 ). The direct relationship between x0 and x1 is given by Ax1 = Ax0 . Similar to the inequality 0 -norm minimization expression (4.8.4), the inequality 1 -norm minimization expression (4.8.8) has also two variants: • Since x is constrained as K-sparse vector, the inequality 1 -norm minimization becomes the inequality 2 -norm minimization (P11 )

1 min b − Ax22 x 2

subject to x1 ≤ q.

(4.8.10)

This is a quadratic programming (QP) problem. • Using the Lagrange multiplier method, the inequality constrained 1 -norm minimization problem (P11 ) becomes (P12 )

1 min b − Ax22 + λx1 . λ,x 2

(4.8.11)

This optimization is called the basis pursuit denoising (BPDN) [7]. The optimization problems (P10 ) and (P11 ) are, respectively, called the error constrained 1 -norm minimization and 1 -penalty minimization [41]. The 1 penalty minimization is also known as the regularized 1 linear programming or 1 -norm regularization least squares. The Lagrange multiplier is known as the regularization parameter that controls the sparseness of the sparse solution; and the greater λ is, the more sparse the solution x is. When the regularization parameter λ is large enough, the solution x is a zero vector. With the gradual decrease of λ, the sparsity of the solution vector x gradually decreases. As λ is gradually reduced to 0, the solution vector x becomes the vector such that b − Ax22 is minimized. That is to say, λ > 0 can balance the error squares sum cost function 12 b − Ax22 and the 1 -norm cost function x1 of the twin objectives J (λ, x) =

1 b − Ax22 + λx1 . 2

(4.8.12)

196

4 Solution of Linear Systems

4.8.2 Lasso The solution of a system of sparse equations is closely related to regression analysis. Minimizing the least squared error may usually lead to sensitive solutions. Many regularization methods were proposed to decrease this sensitivity. Among them, Tikhonov regularization [40] and Lasso [12, 38] are two widely known and cited algorithms, as pointed out in [44]. The problem of regression analysis is one of the fundamental problems within the many fields such as statistics, supervised machine learning, optimization, and so on. In order to reduce the computational complexity of solving directly the optimization problem (P1 ), consider a linear regression problem: given an observed data vector b ∈ Rm and an observed data matrix A ∈ Rm×n , find a fitting coefficient vector x ∈ Rn such that bˆi = x1 ai1 + x2 ai2 + · · · + xn ain ,

i = 1, . . . , m,

(4.8.13)

or bˆ =

n 

xi ai = Ax,

(4.8.14)

i=1

where ⎡

x = [x1 , . . . , xn ]T ,

b = [b1 , . . . , bm ]T ,

a11 · · · ⎢ A = [a1 , . . . , an ] = ⎣ ... . . . am1 · · ·

⎤ a1n .. ⎥ . . ⎦ amn

As a preprocessing of the linear regression, it is assumed that m  i=1

bi = 0,

m  i=1

aij = 0,

m 

aij2 = 1,

j = 1, . . . , n,

(4.8.15)

i=1

and let the column vectors of the input matrix A be linearly independent. The preprocessed input matrix A is called an orthonormal input matrix, its column vector ai is known as a predictor; and the vector x is simply called a coefficient vector. Tibshirani [38] in 1996 has proposed the least absolute shrinkage and selection operator (Lasso) algorithm for solving the linear regression. The basic idea of the Lasso is: under the constraint that the 1 -norm of the prediction vector does not

4.8 Solution of Under-Determined Systems

197

exceed an upper bound q, the squares sum of prediction errors is minimized, namely Lasso :

min b − Ax22 x

subject to

|x1 =

n 

|xi | ≤ q.

(4.8.16)

i=1

Obviously, the Lasso model and the QP problem (4.8.10) have the exactly same form. The bound q is a tuning parameter. When q is large enough, the constraint has no effect on x and the solution is just the usual multiple linear least squares regression of x on ai1 , . . . , ain and bi , i = 1, . . . , m. However, for the smaller values of q (q ≥ 0), some of the coefficients xj will take zero. Choosing a suitable q will lead to a sparse coefficient vector x. The regularized Lasso problem is given by Regularized Lasso:

min b − Ax22 + λx1 . x

(4.8.17)

The Lasso problem involves both the 1 -norm constrained fitting for statistics and the data mining. With the aid of two basic functions, the Lasso makes shrinkage and selection for linear regression. • Contraction function: the Lasso shrinks the range of parameters to be estimated, only a small number of selected parameters are estimated at each step. • Selection function: the Lasso would automatically select a very small part of variables for linear regression, yielding a spare solution. The Lasso method achieves the better prediction accuracy by shrinkage and variable selection. The Lasso has many generalizations, here are a few examples. In machine learning, sparse kernel regression is referred to as the generalized Lasso [36]. Multidimensional shrinkage-thresholding method is known as the group Lasso [35]. A sparse multiview feature selection method via low-rank analysis is called the MRM-Lasso (multiview rank minimization-based Lasso) [45]. Distributed Lasso solves the distributed sparse linear regression problems [29].

4.8.3 LARS The LARS is the abbreviation of least angle regressions and is a stepwise regression method. Stepwise regression is also known as “forward stepwise regression." Given a collection of possible predictors, X = {x1 , . . . , xm }, the aim of LARS is to identify

198

4 Solution of Linear Systems

the fitting vector most correlated with the response vector, i.e., the angle between the two vectors is minimized. The LARS algorithm contains the two basic steps. 1. The first step selects the fitting vector having largest absolute correlation with the response vector y, say xj1 = maxi {|corr(y, xi )|}, and performs simple linear regression of y on xj1 : y = β1 xj1 + r, where the residual vector r denotes the residuals of the remaining variables removing xj1 from X. The residual vector r is orthogonal to xj1 . When some predictors happen to be highly correlated with xj and thus are approximately orthogonal to r, these predictors perhaps are eliminated, i.e., the residual vector r will not have information on these eliminating predictors. 2. In the second step, r is regarded as the new response. The LARS selects the fitting vector having largest absolute correlation with the new response r, say xj2 , and performs simple linear regression r ← r − β2 xj2 . Repeat this selection process until no predictor has any correlation with r. After k selection steps, one gets a set of predictors xj1 , . . . , xjk that are then used in the usual way to construct a k-parameter linear model. This selection regression is an aggressive fitting technique that can be overly greedy [12]. The stepwise regression may be said to be of under-regression due to perhaps eliminating useful predictors that happen to be correlated with current predictor. To avoid the under-regression in forward stepwise regression, the LARS algorithm allows “stagewise regression" to be implemented using fairly large steps. Algorithm 4.6 shows the LARS algorithm developed in [12]. Stagewise regression, called also “forward stagewise regression,” is a much more cautious version of stepwise regression, and creates a coefficient profile as follows: at each step it increments the coefficient of that variable most correlated with the current residuals by an amount ±, with the sign determined by the sign of the correlation, as shown below [19]. 1. 2. 3. 4.

Start with r = y and β1 = β2 = . . . = βp = 0. Find the predictor xj most correlated with r. Update βj ← βj + δj , where δj =  · sign[corr(r, xj )]; Update r ← r − δj xj , and repeat Steps 2 and 3 until no predictor has any correlation with r.

Brief Summary of This Chapter • This chapter mainly introduces singular value decomposition and several typical methods for solving the well-determined, over-determined, and underdetermined matrix equations. • Tikhonov regularization method and 1 -norm minimization have wide applications in artificial intelligence.

References

199

• Generalized total least squares method includes the LS method, Tikhonov regularization method, and total least squares method as special examples.

Algorithm 4.6 Least angle regressions (LARS) algorithm with Lasso modification [12] 1. 2. 3. 4.

input: The data vector b ∈ Rm and the input matrix A ∈ Rm×n . initialization: Ω0 = ∅, bˆ = 0 and AΩ0 = A. Put k = 1. repeat Compute the correlation vector cˆ k = ATΩk−1 (b − bˆ k−1 ).



5. Update the active set Ωk = Ωk−1 {j (k) | |cˆk (j )| = C} with C = max{|cˆk (1)|, . . . , |cˆk (n)|}. 6. Update the input matrix AΩk = [sj aj , j ∈ Ωk ], where sj = sign(cˆk (j )). 7. Find the direction of the current minimum angle GΩk = ATΩk AΩk ∈ Rk×k , αΩk = (1Tk GΩk 1k )−1/2 , k wΩk = αΩk G−1 Ωk 1k ∈ R ,

μk = AΩk wΩk ∈ Rk . 8. Compute b = ATΩ μk = [b1 , . . . , bm ]T and estimate the coefficient vector k

T xˆ k = (ATΩ AΩk )−1 ATΩ = G−1 Ωk AΩk . k k  , / x 0+ C − cˆk (j ) C + cˆk (j ) + , , γ˜ = min − wj , where wj is the j th of 9. Compute γˆ = minc j j ∈Ωk αΩ − bj αΩ + bj j ∈Ωk k k entry wΩ = [w1 , . . . , wn ]T and min{·}+ denotes the positive minimum term. If there is not k

positive term then min{·}+ = ∞. 10. If γ˜ < γˆ then the fitted vector and the active set are modified as follows: bˆ k = bˆ k−1 + γ˜ μk , Ωk = Ωk − {j˜}, where the removed index j˜ is the index j ∈ Ωk such that γ˜ is a minimum. Conversely, if γˆ < γ˜ then bˆ k and Ωk are modified as follows: bˆ k = bˆ k−1 + γˆ μk , Ωk = Ωk ∪ {jˆ}, where the added index jˆ is the index j ∈ Ωk such that γˆ is a minimum. 11. exit If some stopping criterion is satisfied. 12. return k ← k + 1. 13. output: coefficient vector x = xˆ k .

References 1. Auslender, A.: Asymptotic properties of the Fenchel dual functional and applications to decomposition problems. J. Optim. Theory Appl. 73(3), 427–449 (1992) 2. Beltrami, E.: Sulle funzioni bilineari, Giomale di Mathematiche ad Uso Studenti Delle Universita. 11, 98–106 (1873). An English translation by D Boley is available as Technical Report 90–37, University of Minnesota, Department of Computer Science (1990)

200

4 Solution of Linear Systems

3. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Nashua (1999) 4. Bramble, J, Pasciak, J.: A preconditioning technique for indefinite systems resulting from mixed approximations of elliptic problems. Math. Comput. 50(181), 1–17 (1988) 5. Cadzow, J.A.: Spectral estimation: an overdetermined rational model equation approach. Proc. IEEE 70, 907–938 (1982) 6. Cai, D., Zhang, C., He, S.: Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD, July 25–28, Washington, pp. 333–342 (2010) 7. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001) 8. Donoho, D.L.: For most large underdetermined systems of linear equations, the minimal 1 solution is also the sparsest solution. Commun. Pure Appl. Math. 59, 797–829 (2006) 9. Donoho, D.L., Grimes, C.: Hessian eigenmaps: locally linear embedding techniques for highdimensional data. Proc. Natl. Acad. Sci. 100(10), 5591–5596 (2003) 10. Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrica 1, 211–218 (1936) 11. Eckart, C., Young, G.: A Principal axis transformation for non-Hermitian matrices. Null Am. Math. Soc. 45, 118–121 (1939) 12. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407– 499 (2004) 13. Fletcher, R.: Conjugate gradient methods for indefinite systems. In: Watson, G.A. (ed.) Proceedings of Dundee Conference on Numerical Analysis, pp. 73–89. Springer, New York (1975) 14. Gleser, L.J.: Estimation in a multivariate “errors in variables" regression model: large sample results. Ann. Stat. 9, 24–44 (1981) 15. Golub, G.H., Van Loan, C.F.: An analysis of the total least squares problem. SIAM J. Numer. Anal. 17, 883–893 (1980) 16. Golub, G.H., Van Loan, C.F.: Matrix Computation, 2nd edn. The John Hopkins University Press, Baltimore (1989) 17. Gorodnitsky, I.F., Rao, B.D.: Sparse signal reconstruction from limited data using FOCUSS: a re-weighted norm minimization algorithm. IEEE Trans. Signal Process. 45, 600–616 (1997) 18. Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear Gauss-Seidel method under convex constraints. Oper. Res. Lett. 26, 127–136 (1999) 19. Hastie, T., Taylor, J., Tibshiran, R., Walther, G.: Forward stagewise regression and the monotone lasso. Electron. J. Stat. 1, 1–29 (2007) 20. Horn, R.A., Johnson, C.R.: Topics in Matrix Analysis. Cambridge University Press, Cambridge (1991) 21. Huffel, S.V., Vandewalle, J.: The Total Least Squares Problems: Computational Aspects and Analysis. Frontiers in Applied Mathematics, vol. 9. SIAM, Philadelphia (1991) 22. Johnson, L.W., Riess, R.D., Arnold, J.T.: Introduction to Linear Algebra, 5th edn. PrenticeHall, New York (2000) 23. Jordan, C.: Memoire sur les formes bilineaires. J. Math. Pures Appl. Deuxieme Ser. 19, 35–54 (1874) 24. Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. Frontiers in Applied Mathematics, vol. 16. SIAM, Philadelphia (1995) 25. Kim, S.J., Koh, K., Lustig, M., Boyd, S., Gorinevsky, D.: An interior-point method for largescale 1 -regularized least squares. IEEE J. Sel. Topics Signal Process. 1(4), 606–617 (2007) 26. Klema, V.C., Laub, A.J.: The singular value decomposition: its computation and some applications. IEEE Trans. Autom. Control 25, 164–176 (1980) 27. Lay, D.C.: Linear Algebra and Its Applications, 2nd edn. Addison-Wesley, New York (2000) 28. Li, N., Kindermannb, S., Navasca, C.: Some convergence results on the regularized alternating least-squares method for tensor decomposition. Linear Algebra Appl. 438(2), 796–812 (2013) 29. Mateos, G., Bazerque, J.A., Giannakis, G.B.: Distributed sparse linear regression. IEEE Trans. Signal Process. 58(10), 5262–5276 (2010)

References

201

30. Mirsky, L.: Symmetric gauge functions and unitarily invariant norms. Quart. J. Math. Oxford 11, 50–59 (1960) 31. Navasca, C., Lathauwer, L.D., Kindermann, S.: Swamp reducing technique for tensor decomposition. In: The 16th Proceedings of the European Signal Processing Conference, Lausanne, August 25–29 (2008) 32. Neumaier, A.: Solving ill-conditioned and singular linear systems: a tutorial on regularization. SIAM Rev. 40(3), 636–666 (1998) 33. Pearson, K.: On lines and planes of closest fit to points in space. Philos. Mag. 2, 559–572 (1901) 34. Powell, M.J.D.: On search directions for minimization algorithms. Math. Program. 4, 193– 201 (1973) 35. Puig, A.T., Wiesel, A., Fleury, G., Hero, A.O.: Multidimensional shrinkage-thresholding operator and group LASSO penalties. IEEE Signal Process. Lett. 18(6), 343–346 (2011) 36. Roth, V.: The generalized LASSO. IEEE Trans. Neural Netw. 15(1), 16–28 (2004) 37. Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain. http://quake-papers/painless-conjugate-gradient-pics.ps 38. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996) 39. Tikhonov, A.: Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4, 1035–1038 (1963) 40. Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed Problems. Wiley, New York (1977) 41. Tropp, J.A.: Just relax: convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52(3), 1030–1051 (2006) 42. Van Huffel, S., Vandewalle, J.: Analysis and properties of the generalized total least squares problem Ax = b when some or all columns in A are subject to error. SIAM J. Matrix Anal. Appl. 10, 294–315 (1989) 43. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Clarendon Press, Oxford (1965) 44. Xu, H., Caramanis, C., Mannor, S.: Robust regression and Lasso. IEEE Trans. Inform. Theory 56(7), 3561–3574 (2010) 45. Yang, W., Gao, Y., Shi, Y., Cao, L.: MRM-Lasso: a sparse multiview feature selection method via low-rank analysis. IEEE Trans. Neural Netw. Learn. Syst. 26(11), 2801–2815 (2015) 46. Zhang, X.D.: Matrix Analysis and Applications. Cambridge University Press, Cambridge (2017)

Chapter 5

Eigenvalue Decomposition

This chapter is devoted to another core subject of matrix algebra: the eigenvalue decomposition (EVD) of matrices, including various generalizations of EVD such as the generalized eigenvalue decomposition, the Rayleigh quotient, and the generalized Rayleigh quotient.

5.1 Eigenvalue Problem and Characteristic Equation The eigenvalue problem not only is a very interesting theoretical problem but also has a wide range of applications in artificial intelligence.

5.1.1 Eigenvalue Problem If L[w] = w holds for any nonzero vector w, then L is called an identity transformation. More generally, if, when a linear operator acts on a vector, the output is a multiple of this vector, then the linear operator is said to have an input reproducing characteristic. Definition 5.1 (Eigenvalue, Eigenvector) Suppose that a nonzero vector u is the input of the linear operator L. If the output vector is the same as the input vector u up to a constant factor λ, i.e., L[u] = λu,

u = 0,

(5.1.1)

then the vector u is known as an eigenvector of the linear operator L and the scalar λ is the corresponding eigenvalue of L.

© Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8_5

203

204

5 Eigenvalue Decomposition

From the above definition, it is known that if each eigenvector u is regarded as an input of a linear time-invariant system L, then the eigenvalue λ associated with eigenvector u is equivalent to the gain of the linear system L with u as the input. Only when the input of the system L is an eigenvector u, its output is the same as the input up to a constant factor. Thus an eigenvector can be viewed as description of the characteristics of the system, and hence is also called a characteristic vector. This is a physical explanation of eigenvectors in relation to linear systems. If a linear transformation L[x] = Ax, then its eigenvalue problem (5.1.1) can be written as Au = λu,

u = 0.

(5.1.2)

The scalar λ is known as an eigenvalue of the matrix A, and the vector u as an eigenvector associated with the eigenvalue λ. Equation (5.1.2) is sometimes called the eigenvalue–eigenvector equation. As an example, we linear time-invariant system h(k) with transfer  consider a −jωk function H (ejω ) = ∞ h(k)e . When inputting a complex exponential or k=−∞ harmonic signal ejωn , the system output is given by ∞ 

L[e jωn ] =

∞ 

h(n − k)e jωk =

k=−∞

h(k)e jω(n−k) = H (e jω )e jωn .

k=−∞

If the harmonic signal vector u(ω) = [1, e jω , . . . , e jω(N −1) ]T is the system input, then its output is given by ⎡ ⎢ ⎢ L⎢ ⎣

1 e jω .. . e jω(N −1)





⎥ ⎢ ⎥ ⎢ ⎥ = H (e jω ) ⎢ ⎦ ⎣

1 e jω .. .

⎤ ⎥ ⎥ ⎥ ⎦



L[u(ω)] = H (e jω )u(ω).

e jω(N −1)

That is to say, the harmonic signal vector u(ω) = [1, e jω , . . . , e jω(N −1) ]T is an eigenvector of the linear time-invariant system, and the system transfer function H (e jω ) is the eigenvalue associated with u(ω). From Eq. (5.1.2) it is easily known that if A ∈ Cn×n is a Hermitian matrix then its eigenvalue λ must be a real number, and A = UUH ,

(5.1.3)

where U = [u1 , . . . , un ]T is a unitary matrix, and  = Diag(λ1 , . . . , λn ). Equation (5.1.3) is called the eigenvalue decomposition (EVD) of A. Since an eigenvalue λ and its associated eigenvector u often appear in pairs, the two-tuple (λ, u) is called an eigenpair of the matrix A. An eigenvalue may take a zero value, but an eigenvector cannot be a zero vector.

5.1 Eigenvalue Problem and Characteristic Equation

205

Equation (5.1.2) means that the linear transformation Au does not “change the direction” of the input vector u. Hence the linear transformation Au is a mapping “keeping the direction unchanged.” In order to determine the vector u, we can rewrite (5.1.2) as (A − λI)u = 0.

(5.1.4)

If the above equation is assumed to hold for certain nonzero vectors u, then the only condition is that, for those vectors u, the determinant of the matrix A − λI is equal to zero: |A − λI| = 0.

(5.1.5)

Hence the eigenvalue problem solving consists of the following two steps: • Find all scalars λ (eigenvalues) such that the matrix |A − λI| = 0; • Given an eigenvalue λ, find all nonzero vectors u satisfying (A − λI)u = 0; they are the eigenvector(s) corresponding to the eigenvalue λ.

5.1.2 Characteristic Polynomial As discussed above, the matrix (A − λI) is singular if and only if its determinant det(A − λI) = 0, i.e., (A − λI) singular



det(A − λI) = 0,

(5.1.6)

where the matrix A − λI is called the characteristic matrix of A. The determinant   a11 − x a12 · · · a1n     a   21 a22 − x · · · a2n  p(x) = det(A − xI) =  . .. ..  ..  .. . . .    a a · · · a − x n1

= pn x + pn−1 x n

n−1

n2

+ · · · + p1 x + p0

nn

(5.1.7)

is known as the characteristic polynomial of A, and p(x) = det(A − xI) = 0 is said to be the characteristic equation of A.

(5.1.8)

206

5 Eigenvalue Decomposition

The roots of the characteristic equation det(A − xI) = 0 are known as the eigenvalues, characteristic values, latent values, the characteristic roots, or latent roots. Obviously, computing the n eigenvalues λi of an n × n matrix A and finding the n roots of the nth-order characteristic polynomial p(x) = det(A − xI) = 0 are two equivalent problems. An n × n matrix A generates an nth-order characteristic polynomial. Likewise, each nth-order polynomial can also be written as the characteristic polynomial of an n × n matrix. Theorem 5.1 ([1]) Any polynomial p(λ) = λn + a1 λn−1 + · · · + an−1 λ + an can be written as the characteristic polynomial of the n × n matrix ⎡ −a −a2 · · · −an−1 ⎢ 1 ⎢ −1 0 · · · 0 ⎢ 0 −1 · · · 0 A=⎢ ⎢ . .. . . .. ⎢ . . ⎣ . . . 0 0 · · · −1

⎤ −an ⎥ 0 ⎥ ⎥ 0 ⎥, .. ⎥ ⎥ . ⎦ 0

namely p(λ) = det(λI − A).

5.2 Eigenvalues and Eigenvectors Given an n × n matrix A, we consider the computation and properties of the eigenvalues and eigenvectors of A.

5.2.1 Eigenvalues Even if an n × n matrix A is real, some roots of its characteristic equation may be complex, and the root multiplicity can be arbitrary or even equal to n. These roots are collectively referred to as the eigenvalues of the matrix A. An eigenvalue λ of a matrix A is said to have algebraic multiplicity μ, if λ is a μ-multiple root of the characteristic polynomial det(A − xI) = 0. If the algebraic multiplicity of an eigenvalue λ is 1, then it is called a single eigenvalue. A nonsingle eigenvalue is said to be a multiple eigenvalue. It is well known that any nth-order polynomial p(x) can be written in the factorized form p(x) = a(x − x1 )(x − x2 ) · · · (x − xn ).

(5.2.1)

5.2 Eigenvalues and Eigenvectors

207

The n roots of the characteristic polynomial p(x), denoted x1 , x2 , . . . , xn , are not necessarily different from each other, and also are not necessarily real for a real matrix. For example, for the Givens rotation matrix A=

  cos θ − sin θ , sin θ cos θ

its characteristic equation is   cos θ − λ − sin θ   = (cos θ − λ)2 + sin2 θ = 0.  det(A − λI) =  sin θ cos θ − λ However, if θ is not an integer multiple of π , then sin2 θ > 0. In this case, the characteristic equation cannot give a real value for λ, i.e., the two eigenvalues of the rotation matrix are complex, and the two corresponding eigenvectors are complex vectors. The eigenvalues of an n × n matrix (not necessarily Hermitian) A have the following properties [8]. 1. An n × n matrix A has a total of n eigenvalues, where multiple eigenvalues count according to their multiplicity. 2. If a nonsymmetric real matrix A has complex eigenvalues and/or complex eigenvectors, then they must appear in the form of a complex conjugate pair. 3. If A is a real symmetric matrix or a Hermitian matrix, then its all eigenvalues are real numbers. 4. The eigenvalues of a diagonal matrix and a triangular matrix satisfy the following: • if A = Diag(a11 , . . . , ann ), then its eigenvalues are given by a11 , . . . , ann ; • if A is a triangular matrix, then all its diagonal entries are eigenvalues. 5. Given an n × n matrix A: • • • •

if λ is an eigenvalue of A, then λ is also an eigenvalue of AT ; if λ is an eigenvalue of A, then λ∗ is an eigenvalue of AH ; if λ is an eigenvalue of A, then λ + σ 2 is an eigenvalue of A + σ 2 I; if λ is an eigenvalue of A, then 1/λ is an eigenvalue of its inverse matrix A−1 .

6. All eigenvalues of an idempotent matrix A2 = A take 0 or 1. 7. If A is an n × n real orthogonal matrix, then all its eigenvalues are located on the unit circle, i.e., |λi (A)| = 1 for all i = 1, . . . , n. 8. The relationship between eigenvalues and the matrix singularity is as follows: • if A is singular, then it has at least one zero eigenvalue; • if A is nonsingular, then its all eigenvalues are nonzero.

208

5 Eigenvalue Decomposition

9. The relationship between eigenvalues and trace: the sum of all eigenvalues of  A is equal to its trace, namely ni=1 λi = tr(A). 10. A Hermitian matrix A is positive definite (semidefinite) if and only if its all eigenvalues are positive (nonnegative). 11. The relationship between eigenvalues and determinant: the product of all 5 eigenvalues of A is equal to its determinant, namely ni=1 λi = det(A) = |A|. 12. The Cayley–Hamilton theorem: if λ1 , λ2 , . . . , λn are the eigenvalues of an n×n matrix A, then n #

(A − λi I) = 0.

i=1

13. If the n × n matrix B is nonsingular, then λ(B−1 AB) = λ(A). If B is unitary, then λ(BH AB) = λ(A). 14. The matrix products Am×n Bn×m and Bn×m Am×n have the same nonzero eigenvalues. 15. If an eigenvalue of a matrix A is λ, then the corresponding eigenvalue of the matrix polynomial f (A) = An + c1 An−1 + · · · + cn−1 A + cn I is given by f (λ) = λn + c1 λn−1 + · · · + cn−1 λ + cn .

(5.2.2)

16. If λ is an eigenvalue of a matrix A, then eλ is an eigenvalue of the matrix exponential function eA .

5.2.2 Eigenvectors If a matrix An×n is a complex matrix, and λ is one of its eigenvalues, then the vector v satisfying (A − λI)v = 0

or

Av = λv

(5.2.3)

is called the right eigenvector of the matrix A associated with the eigenvalue λ, while the vector u satisfying uH (A − λI) = 0T

or

uH A = λuH

(5.2.4)

is known as the left eigenvector of A associated with the eigenvalue λ. If a matrix A is Hermitian, then its all eigenvalues are real, and hence from Eq. (5.2.3) it is immediately known that ((A − λI)v)T = vT (A − λI) = 0T , yielding v = u; namely, the right and left eigenvectors of any Hermitian matrix are the same.

5.2 Eigenvalues and Eigenvectors

209

It is useful to compare the similarities and differences between the SVD and the EVD of a matrix. • The SVD is available for any m × n (where m ≥ n or m < n) matrix, while the EVD is available only for square matrices. • For an n × n non-Hermitian matrix A, its kth singular value is defined as the spectral norm of the error matrix Ek making the rank of the original matrix A decreased by 1: σk = min

E∈Cm×n

/

0 Espec : rank(A + E) ≤ k − 1 ,

k = 1, . . . , min{m, n}, (5.2.5)

while its eigenvalues are defined as the roots of the characteristic polynomial det(A − λI) = 0. There is no inherent relationship between the singular values and the eigenvalues of the same square matrix, but each nonzero singular value of an m × n matrix A is the positive square root of some nonzero eigenvalue of the n × n Hermitian matrix AH A or the m × m Hermitian matrix AAH . • The left singular vector ui and the right singular vector vi of an m × n matrix A associated with the singular value σi are defined as the two vectors satisfying H H uH i Avi = σi , while the left and right eigenvectors are defined by u A = λi u and Avi = λi vi , respectively. Hence, for the same n × n non-Hermitian matrix A, there is no inherent relationship between its (left and right) singular vectors and its (left and right) eigenvectors. However, the left singular vector ui and right singular vector vi of a matrix A ∈ Cm×n are, respectively, the eigenvectors of the m × m Hermitian matrix AAH and of the n × n matrix AH A. From Eq. (5.1.2) it is easily seen that after multiplying an eigenvector u of a matrix A by any nonzero scalar μ, then μu is still an eigenvector of A. In general it is assumed that eigenvectors have unit norm, i.e., u2 = 1. Using eigenvectors we can introduce the condition number of any single eigenvalue. Definition 5.2 (Condition Number of Eigenvalue [17, p. 93]) The condition number of a single eigenvalue λ of any matrix A is defined as cond(λ) =

1 , cos θ (u, v)

(5.2.6)

where θ (u, v) represents the acute angle between the left and right eigenvectors associated with the eigenvalue λ. Definition 5.3 (Matrix Spectrum) The set of all eigenvalues λ ∈ C of a matrix A ∈ Cn×n is called the spectrum of the matrix A, denoted λ(A). The spectral radius of a matrix A, denoted ρ(A), is a nonnegative real number and is defined as ρ(A) = max |λ| : λ ∈ λ(A).

(5.2.7)

210

5 Eigenvalue Decomposition

Definition 5.4 (Inertia) The inertia of a symmetric matrix A ∈ Rn×n , denoted In(A), is defined as the triplet In(A) = (i+ (A), i− (A), i0 (A)), where i+ (A), i− (A) and i0 (A) are, respectively, the numbers of the positive, negative, and zero eigenvalues of A (each multiple eigenvalue is counted according to its multiplicity). Moreover, the quality i+ (A) − i− (A) is known as the signature of A.

5.3 Generalized Eigenvalue Decomposition (GEVD) The EVD of an n × n single matrix can be extended to the EVD of a matrix pair or pencil consisting of two matrices. The EVD of a matrix pencil is called the generalized eigenvalue decomposition (GEVD). As a matter of fact, the EVD is a special example of the GEVD.

5.3.1 Generalized Eigenvalue Decomposition The basis of the EVD is the eigensystem expressed by the linear transformation L[u] = λu: taking the linear transformation L[u] = Au, one gets the EVD Au = λu. Now consider the generalization of the eigensystem composed of two linear systems La and Lb both of whose inputs are the vector u, but the output La [u] of the first system La is some constant times λ the output Lb [u] of the second system Lb . Hence the eigensystem is generalized as [11] La [u] = λLb [u] (u = 0);

(5.3.1)

this is called a generalized eigensystem, denoted (La , Lb ). The constant λ and the nonzero vector u are known as the generalized eigenvalue and the generalized eigenvector of generalized eigensystem (La , Lb ), respectively. In particular, if two linear transformations are, respectively, La [u] = Au and Lb [u] = Bu, then the generalized eigensystem becomes the generalized eigenvalue decomposition (GEVD): Au = λBu.

(5.3.2)

The two n × n matrices A and B compose a matrix pencil (or matrix pair), denoted (A, B). The constant λ and the nonzero vector u are, respectively, referred to as the generalized eigenvalue and the generalized eigenvector of the matrix pencil (A, B).

5.3 Generalized Eigenvalue Decomposition (GEVD)

211

A generalized eigenvalue and its corresponding generalized eigenvector are collectively called a generalized eigenpair, denoted (λ, u). Equation (5.3.2) shows that the ordinary eigenvalue problem is a special example of the generalized eigenvalue problem when the matrix pencil is (A, I). Although the generalized eigenvalue and the generalized eigenvector always appear in pairs, the generalized eigenvalue can be found independently. To do this, we rewrite the generalized characteristic equation (5.3.2) as (A − λB)u = 0. In order to ensure the existence of a nonzero solution u, the matrix A − λB cannot be nonsingular, i.e., the determinant must be equal to zero: (A − λB) singular



det(A − λB) = 0.

(5.3.3)

The polynomial det(A − λB) is called the generalized characteristic polynomial of the matrix pencil (A, B), and det(A − λB) = 0 is known as the generalized characteristic equation of the matrix pencil (A, B). Hence, the matrix pencil (A, B) is often expressed as A − λB. The generalized eigenvalues of the matrix pencil (A, B) are all solutions z of the generalized characteristic equation det(A − zB) = 0, including the generalized eigenvalues equal to zero. If the matrix B = I, then the generalized characteristic polynomial reduces to the characteristic polynomial det(A − λI) = 0 of the matrix A. In other words, the generalized characteristic polynomial is a generalization of the characteristic polynomial, whereas the characteristic polynomial is a special example of the generalized characteristic polynomial of the matrix pencil (A, B) when B = I. The generalized eigenvalues λ(A, B) are defined as    λ(A, B) = z ∈ C det(A − zB) = 0 .

(5.3.4)

Theorem 5.2 ([9]) The pair λ ∈ C and u ∈ Cn are, respectively, the generalized eigenvalue and the generalized eigenvector of the n × n matrix pencil (A, B) if and only if |A − λB| = 0 and u ∈ Null(A − λB) with u = 0. The following are the properties of the GEVD Au = λBu [10, pp. 176–177]. 1. If A and B are exchanged, then every generalized eigenvalue will become its reciprocal, but the generalized eigenvector is unchanged, namely Au = λBu



Bu =

1 Au. λ

2. If B is nonsingular, then the GEVD becomes the standard EVD of B−1 A: Au = λBu



(B−1 A)u = λu.

3. If A and B are real positive definite matrices, then the generalized eigenvalues must be positive.

212

5 Eigenvalue Decomposition

4. If A is singular then λ = 0 must be a generalized eigenvalue. 5. If both A and B are positive definite Hermitian matrices then the generalized eigenvalues must be real, and H uH i Auj = ui Buj = 0,

i = j.

Strictly speaking, the generalized eigenvector u described above is called the right generalized eigenvector of the matrix pencil (A, B). The left generalized eigenvector corresponding to the generalized eigenvalue λ is defined as the column vector v such that vH A = λvH B.

(5.3.5)

Let both X and Y be nonsingular matrices; then from (5.3.2) and (5.3.5) one has XAu = λXBu,

vH AY = λvH BY.

(5.3.6)

This shows that premultiplying the matrix pencil (A, B) by a nonsingular matrix X does not change the right generalized eigenvector u of the matrix pencil (A, B), while postmultiplying (A, B) by any nonsingular matrix Y does not change the left generalized eigenvector v. Algorithm 5.1 uses a contraction mapping to compute the generalized eigenpairs (λ, u) of a given n × n real symmetric matrix pencil (A, B). Algorithm 5.1 Lanczos algorithm for GEVD [17, p. 298] 1. input: n × n real symmetric matrices A, B. 2. initialization: Choose u1 such that u1 2 = 1, and set α1 = 0, z0 = u0 = 0, z1 = Bu1 , i = 1. 3. repeat 4. Compute u = Aui − αi zi−1 . βi = u, ui . u = u − β i zi . w = B−1√u. αi+1 = w, u. ui+1 = w/αi+1 . zi+1 = u/αi+1 . λi = βi+1 /αi+1 . 5. exit if i = n. 6. return i ← i + 1. 7. output: (λi , ui ), i = 1, . . . , n.

Algorithm 5.2 is a tangent algorithm for computing the GEVD of a symmetric positive definite matrix pencil. When the matrix B is singular, Algorithms 5.1 and 5.2 will be unstable. The GEVD algorithm for the matrix pencil (A, B) with singular matrix B was proposed

5.3 Generalized Eigenvalue Decomposition (GEVD)

213

Algorithm 5.2 Tangent algorithm for computing the GEVD [4] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

input: n × n real symmetric matrices A, B. repeat Compute A = Diag(A11 , . . . , Ann )−1/2 , As = A A A , B1 = A B A . Calculate the Cholesky decomposition RTA RA = As , RTB RB = T B1 . Solve the matrix equation FRB = A for F = A R−1 B . Compute the SVD  = VFUT of F. Compute X = A R−1 B U. exit if AX = BX 2 . repeat output: X, .

by Nour-Omid et al. [12], this is shown in Algorithm 5.3. The main idea of this algorithm is to introduce a shift factor σ such that A − σ B is nonsingular. Algorithm 5.3 GEVD algorithm for singular matrix B [12, 17] 1. 2. 3. 4.

input: initialization: Choose a basis w, z1 = Bw, α1 = repeat Compute ui = w/αi . zi = (A − σ B)−1 w. w = w − αi ui−1 . βi = w, zi . zi+1 = Bw. αi+1 = zi+1 , w. λi = βi /αi+1 . 5. exit if i = n. 6. return i ← i + 1. 7. output: (λi , ui ), i = 1, . . . , n.

w, z1 . Set u0 = 0, i = 1.

Definition 5.5 (Equivalent Matrix Pencils) Two matrix pencils with the same generalized eigenvalues are known as equivalent matrix pencils. From the definition of the generalized eigenvalue det(A − λB) = 0 and the properties of a determinant it is easy to see that det(XAY − λXBY) = 0



det(A − λB) = 0.

Therefore, premultiplying and/or postmultiplying one or more nonsingular matrices, the generalized eigenvalues of the matrix pencil are unchanged. This result can be described as the following proposition. Proposition 5.1 If X and Y are two nonsingular matrices, then the matrix pencils (XAY, XBY) and (A, B) are equivalent.

214

5 Eigenvalue Decomposition

The GEVD can be equivalently written as αAu = βBu as well. In this case, the generalized eigenvalue is defined as λ = β/α.

5.3.2 Total Least Squares Method for GEVD As shown by Roy and Kailath [16], use of a least squares estimator can lead to some potential numerical difficulties in solving the generalized eigenvalue problem. To overcome this difficulty, a higher-dimensional ill-conditioned LS problem needs to be transformed into a lower-dimensional well-conditioned LS problem. Without changing its nonzero generalized eigenvalues, a higher-dimensional matrix pencil can be transformed into a lower-dimensional matrix pencil by using a truncated SVD. Consider the GEVD of the matrix pencil (A, B). Let the SVD of A be given by  A = UVH = [U1 , U2 ]

1 O O 2

 H V1 , VH 2

(5.3.7)

where  1 is composed of p principal singular values. Under the condition that the generalized eigenvalues remain unchanged, we can premultiply the matrix pencil A − γ B by UH 1 and postmultiply it by V1 to get [18]: H UH 1 (A − γ B)V1 =  1 − γ U1 BV1 .

(5.3.8)

Hence the original n × n matrix pencil (A, B) becomes a p × p matrix pencil H ( 1 , UH 1 BV1 ). The GEVD of the lower-dimensional matrix pencil ( 1 , U1 BV1 ) is called the total least squares (TLS) method for the GEVD of the higher-dimensional matrix pencil (A, B), which was developed in [18].

5.4 Rayleigh Quotient and Generalized Rayleigh Quotient In physics and artificial intelligence, one often meets the maximum or minimum of the quotient of the quadratic function of a Hermitian matrix. This quotient has two forms: the Rayleigh quotient (also called the Rayleigh–Ritz ratio) of a Hermitian matrix and the generalized Rayleigh quotient (or generalized Rayleigh–Ritz ratio) of two Hermitian matrices.

5.4 Rayleigh Quotient and Generalized Rayleigh Quotient

215

5.4.1 Rayleigh Quotient In his study of the small oscillations of a vibrational system, in order to find the appropriate generalized coordinates, Rayleigh [15] in the 1930s proposed a special form of quotient, later called the Rayleigh quotient. Definition 5.6 (Rayleigh Quotient) The Rayleigh quotient or Rayleigh–Ritz ratio R(x) of a Hermitian matrix A ∈ Cn×n is a scalar and is defined as R(x) = R(x, A) =

xH Ax , xH x

(5.4.1)

where x is a vector to be selected such that the Rayleigh quotient is maximized or minimized. The Rayleigh quotient has the following important properties [2, 13, 14]. Homogeneity: If α and β are scalars, then R(αx, βA) = βR(x, A). Shift invariance: R(x,  A − αI) =  R(x, A) − α. Orthogonality: x ⊥ A − R(x)I x. Boundedness: When the vector x lies in the range of all nonzero vectors, the Rayleigh quotient R(x) falls in a complex plane region (called the range of A). This region is closed, bounded, and convex. If A is a Hermitian matrix such that A = AH , then this region is a closed interval [λ1 , λn ]. 5. Minimum residual: For all vectors x = 0 and all scalars μ, we have (A − R(x)I)x ≤ (A − μI)x. 1. 2. 3. 4.

Theorem 5.3 (Rayleigh–Ritz Theorem) Let A ∈ Cn×n be a Hermitian matrix whose eigenvalues are arranged in increasing order: λmin = λ1 ≤ λ2 ≤ · · · ≤ λn−1 ≤ λn = λmax ;

(5.4.2)

then max

xH Ax xH Ax = max = λmax , H xH x xH x=1 x x

if Ax = λmax x,

(5.4.3)

min

xH Ax xH Ax = min = λmin , xH x xH x=1 xH x

if Ax = λmin x.

(5.4.4)

x=0

x=0

More generally, the eigenvectors and eigenvalues of A correspond, respectively, to the critical points and the critical values of the Rayleigh quotient R(x). This theorem has various proofs, see, e.g., [6, 7] or [3]. As a matter of fact, from the eigenvalue decomposition Ax = λx we have Ax = λx



xH Ax = λxH x



xH Ax = λ. xH x

(5.4.5)

216

5 Eigenvalue Decomposition

That is to say, a Hermitian matrix A ∈ Cn×n has n Rayleigh quotients that are equal to n eigenvalues of A, namely Ri (x, A) = λi (A), i = 1, . . . , n.

5.4.2 Generalized Rayleigh Quotient Definition 5.7 (Generalized Rayleigh Quotient) Let A and B be two n × n Hermitian matrices, and let B be positive definite. The generalized Rayleigh quotient or generalized Rayleigh–Ritz ratio R(x) of the matrix pencil (A, B) is a scalar, and is defined as R(x) =

xH Ax , xH Bx

(5.4.6)

where x is a vector to be selected such that the generalized Rayleigh quotient is maximized or minimized. In order to find the generalized Rayleigh quotient, we define a new vector x˜ = B1/2 x, where B1/2 denotes the square root of the positive definite matrix B. Substitute x = B−1/2 x˜ into the definition formula of the generalized Rayleigh quotient (5.4.6); then we have  H   x˜ H B−1/2 A B−1/2 x˜ R(˜x) = . x˜ H x˜

(5.4.7)

This implies that the generalized Rayleigh quotient of the matrix pencil (A, B) is equivalent to the Rayleigh quotient of the matrix product C = (B−1/2 )H A(B−1/2 ). By the Rayleigh–Ritz theorem, it is known that, when the vector x˜ is selected as the eigenvector corresponding to the minimum eigenvalue λmin of C, the generalized Rayleigh quotient takes a minimum value λmin , while when the vector x˜ is selected as the eigenvector corresponding to the maximum eigenvalue λmax of C, the generalized Rayleigh quotient takes the maximum value λmax . H    Consider the EVD of the matrix product B−1/2 A B−1/2 x˜ = λ˜x. If B = n H i=1 βi vi vi is the EVD of the matrix B, then 1/2

B

=

n   i=1

βi vi vH i

and

−1/2

B

n  1 = vi vH i .. β i i=1

This shows that B−1/2 is also a Hermitian matrix, so that (B−1/2 )H = B−1/2 .  H   Premultiplying B−1/2 A B−1/2 x˜ = λ˜x by B−1/2 and substituting (B−1/2 )H = B−1/2 , it follows that B−1 AB−1/2 x˜ = λB−1/2 x˜ or B−1 Ax = λx. Hence, the EVD of the matrix product (B−1/2 )H A(B−1/2 ) is equivalent to the EVD of B−1 A.

5.4 Rayleigh Quotient and Generalized Rayleigh Quotient

217

Because the EVD of B−1 A is the GEVD of the matrix pencil (A, B), the above analysis can be summarized as follows: the conditions for the maximum and the minimum of the generalized Rayleigh quotient are: R(x) =

xH Ax = λmax , xH Bx

if we choose Ax = λmax Bx,

(5.4.8)

R(x) =

xH Ax = λmin , xH Bx

if we choose Ax = λmin Bx.

(5.4.9)

That is to say, in order to maximize the generalized Rayleigh quotient, the vector x must be taken as the generalized eigenvector corresponding to the maximum generalized eigenvalue of the matrix pencil (A, B). In contrast, for the generalized Rayleigh quotient to be minimized, the vector x should be taken as the generalized eigenvector corresponding to the minimum generalized eigenvalue of the matrix pencil (A, B).

5.4.3 Effectiveness of Class Discrimination Pattern recognition is widely applied in recognitions of human characteristics (such as a person’s face, fingerprint, iris) and various radar targets (such as aircraft, ships). In these applications, the extraction of signal features is crucial. For example, when a target is considered as a linear system, the target parameter is a feature of the target signal. The divergence is a measure of the distance or dissimilarity between two signals and is often used in feature discrimination and evaluation of the effectiveness of class discrimination. Let Q denote the common dimension of the signal-feature vectors extracted by various methods. Assume that there are c classes of signals; the Fisher class separability measure, called simply the Fisher measure, between the ith and the j th classes is used to determine the ranking of feature vectors. Consider the class (l) discrimination of c classes of signals. Let vk denote the kth sample feature vector of the lth class of signals, where l = 1, . . . , c and k = 1, · · · , lK , with lK the number of sample feature vectors in the lth signal class. Under the assumption that the prior (l) probabilities of the random vectors v(l) = vk are the same for k = 1, . . . , lK (i.e., equal probabilities), the Fisher measure is defined as  m(i,j ) =

l=i,j

! "2 (l) (l) meank (vk ) − meanl (mean(vk )) ,  (l) l=i,j vark (vk )

(5.4.10)

218

5 Eigenvalue Decomposition

where: (l)

• meank (vk ) is the mean (centroid) of all signal-feature vectors of the lth class; (l) • var(vk ) is the variance of all signal-feature vectors of the lth class; (l) • meanl (meank (vk )) is the total sample centroid of all classes. As an extension of the Fisher measure, consider the projection of all Q×1 feature vectors onto the (c − 1)th dimension of class discrimination space. Put N = N1 + · · · + Nc , where Ni represents the number of feature vectors of the ith class signal extracted in the training phase. Suppose that si,k = [si,k (1), . . . , si,k (Q)]T denotes the Q × 1 feature vector obtained by the kth group of observation data of the ith signal in the training phase, while mi = [mi (1), . . . , mi (Q)]T is the sample mean of the ith signal-feature vector, where Ni 1  si,k (q), mi (q) = Ni

i = 1, . . . , c, q = 1, . . . , Q.

k=1

Similarly, let m = [m(1), . . . , m(Q)]T denote the ensemble mean of all the signal-feature vectors obtained from all the observation data, where 1 mi (q), c c

m(q) =

q = 1, . . . , Q.

i=1

Then, we have within-class scatter matrix and between-class scatter matrix [5]:

Sw

⎛ ⎞ Ni c 1 ⎝ 1  = (si,k − mi )(si,k − mi )T ⎠ , c Ni

def

i=1

def

Sb =

(5.4.11)

k=1

1 (mi − m)(mi − m)T . c c

(5.4.12)

i=1

Define a criterion function #

UT Sb U

def diag

J (U) = #

,

(5.4.13)

UT Sw U

diag

where

#

A denotes the product of diagonal entries of the matrix A. As it is a measure

diag

of class discrimination ability, J should be maximized. We refer to Span(U) as the

5.4 Rayleigh Quotient and Generalized Rayleigh Quotient

219

class discrimination space, if #

UT Sb U

diag

U = arg max J (U) = # U∈RQ×Q

.

(5.4.14)

UT Sw U

diag

This optimization problem can be equivalently written as ⎧ Q # ⎪ ⎪ T ⎪ ⎪ ⎨ ui Sb ui

⎫ ⎪ ⎪ ⎪ ⎪ Q T # ui Sb ui ⎬ i=1 [u1 , . . . , uQ ] = arg max Q = # uT S u ⎪ ⎪ ui ∈RQ ⎪ TS u i=1 i w i ⎪ ⎪ ⎪ u ⎪ ⎪ ⎩ ⎭ i w i

(5.4.15)

i=1

whose solution is given by ui = arg max ui ∈RQ

uTi Sb ui uTi Sw ui

,

i = 1, . . . , Q.

(5.4.16)

This is just the maximization of the generalized Rayleigh quotient. The above equation has a clear physical meaning: the column vectors of the matrix U constituting the optimal class discrimination subspace which should maximize the between-class scatter and minimize the within-class scatter, namely which should maximize the generalized Rayleigh quotient. For c classes of signals, the optimal class discrimination subspace is (c − 1)dimensional. Therefore Eq. (5.4.16) is maximized only for c − 1 generalized Rayleigh quotients. In other words, it is necessary to solve the following generalized eigenvalue problem: Sb ui = λi Sw ui ,

i = 1, 2, . . . , c − 1,

(5.4.17)

for c − 1 generalized eigenvectors u1 , . . . , uc−1 . These generalized eigenvectors constitute the Q × (c − 1) matrix Uc−1 = [u1 , . . . , uc−1 ]

(5.4.18)

whose columns span the optimal class discrimination subspace. After obtaining the Q × (c − 1) matrix Uc−1 , we can find its projection onto the optimal class discrimination subspace, yi,k = UTc−1 si,k ,

i = 1, . . . , c, k = 1, . . . , Ni ,

for every signal-feature vector obtained in the training phase.

(5.4.19)

220

5 Eigenvalue Decomposition

When there are only three classes of signals (c = 3), the optimal class discrimination subspace is a plane, and the projection of every feature vector onto the optimal class discrimination subspace is a point. These projections directly reflect the discrimination ability of different feature vectors in signal classification.

Brief Summary of This Chapter This chapter focuses on the matrix algebra used widely in artificial intelligence: eigenvalue decomposition, general eigenvalue decomposition, Rayleigh quotients, generalized Rayleigh quotients, and Fisher measure.

References 1. Bellman, R.: Introduction to Matrix Analysis, 2nd edn. McGraw-Hill, New York (1970) 2. Chatelin, F.: Eigenvalues of Matrices. Wiley, New York (1993) 3. Cirrincione, G., Cirrincione, M., Herault J., et al.: The MCA EXIN neuron for the minor component analysis. IEEE Trans. Neural Netw. 13(1), 160–187 (2002) 4. Drmac, Z.: A tangent algorithm for computing the generalized singular value decomposition. SIAM J. Numer. Anal. 35(5), 1804–1832 (1998) 5. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification and Scene Analysis. Wiley, New York (1973) 6. Golub, G.H., Van Loan, C.F.: Matrix Computation, 2nd edn. The John Hopkins University Press, Baltimore (1989) 7. Helmke, U., Moore, J.B.: Optimization and Dynamical Systems. Springer, London (1994) 8. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985) 9. Huang, L.: Linear Algebra in System and Control Theory (in Chinese). Science Press, Beijing (1984) 10. Jennings, A., McKeown, J.J.: Matrix Computations. Wiley, New York (1992) 11. Johnson, D.H., Dudgeon, D.E.: Array Signal Processing: Concepts and Techniques. Prentice Hall, Englewood Cliffs (1993) 12. Nour-Omid, B., Parlett, B.N., Ericsson, T., Jensen, P.S.: How to implement the spectral transformation. Math. Comput. 48, 663–673 (1987) 13. Parlett, B.N.: The Rayleigh quotient iteration and some generalizations for nonnormal matrices. Math. Comput. 28(127), 679–693 (1974) 14. Parlett, B.N.: The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cliffs (1980) 15. Rayleigh, L.: The Theory of Sound, 2nd edn. Macmillan, New York (1937) 16. Roy, R., Kailath, T.: ESPRIT - estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 37, 297–301 (1989) 17. Saad, Y.: Numerical Methods for Large Eigenvalue Problems. Manchester University Press, New York (1992) 18. Zhang, X.D., Liang, Y.C.: Prefiltering-based ESPRIT for estimating parameters of sinusoids in non-Gaussian ARMA noise. IEEE Trans. Signal Process. 43, 349–353 (1995)

Part II

Artificial Intelligence

Chapter 6

Machine Learning

Machine learning is a subset of artificial intelligence. This chapter presents first a machine learning tree, and then focuses on the matrix algebra methods in machine learning including single-objective optimization, feature selection, principal component analysis, and canonical correlation analysis together with supervised, unsupervised, and semi-supervised learning and active learning. More importantly, this chapter highlights selected topics and advances in machine learning: graph machine learning, reinforcement learning, Q-learning, and transfer learning.

6.1 Machine Learning Tree Machine learning (ML) is a subset of artificial intelligence, which build a mathematical model based on sample data, known as “training data,” in order to make predictions or decisions without being explicitly programmed to perform the task. Regarding learning, a good definition given by Mitchell [181] is A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T , as measured by P , improves with experience E.

In machine learning, neural networks, support vector machines, and evolutionary computation, we are usually given a training set and a test set. By the training set, it will mean the union of the labeled set and the unlabeled set of examples available to machine learners. In comparison, test set consists of examples never seen before. Let (Xl , Yl ) = {(x1 , y1 ), . . . , (xl , yl )} denote the labeled set, where xi ∈ RD is the ith D-dimensional data vector and yi ∈ R or yi ∈ {1, . . . , M} is the corresponding label of the data vector xi . In regression problems, yi is the regression or fitting of xi . In classification problems, yi is the corresponding class label of xi among the M classes of targets. The labeled data xi , i = 1, . . . , l are observed by the

© Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8_6

223

224

6 Machine Learning

user, while yi , i = 1, . . . , l are labeled by data labeling experts or supervisors. The unlabeled set is comprised of the data vectors and is denoted by Xu = {x1 , . . . , xu }. Machine learning aims to establish a regressor or classifier through learning the training set, and then to evaluate the performance of the regressor or classifier through the test set. According to the nature of training data, we can classify machine learning as follows. 1. Regular or Euclidean structured data learning • Supervised learning: Given a training set consisting of labeled data (i.e., example inputs and their desired outputs) (Xtrain , Ytrain ) = {(x1 , y1 ), . . . , (xl , yl )}, supervised learning learns a general rule that maps inputs to outputs. This is like a “teacher” or a supervisor (data labeling expert) giving a student a problem (finding the mapping relationship between inputs and outputs) and its solutions (labeled output data) and telling that student to figure out how to solve other, similar problems: finding the mapping from the features of unseen samples to their correct labels or target values in the future. • Unsupervised learning: In unsupervised learning, the training set consists of the unlabeled set only Xtrain = Xu = {x1 , . . . , xu }. The main task of the machine learner is to find the solutions on its own (i.e., patterns, structures, or knowledge in unlabeled data). This is like giving a student a set of patterns and asking him or her to figure out the underlying motifs that generated the patterns. • Semi-supervised learning: Given a training set (Xtrain , Ytrain ) = {(x1 , y1 ), . . . , (xl yl )} ∪ {xl+1 , . . . , xl+u } with l / u, i.e., we are given a small amount of labeled data together with a large amount of unlabeled data. Semisupervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Depending on how the data is labeled, semi-supervised learning can be divided into the following categories: – Self-training is a semi-supervised learning using its own predictions to teach itself. – Co-training is a weakly semi-supervised learning for multi-view data using the co-training setting, and use their own predictions to teach themselves. – Active learning is a semi-supervised learning where the learner has some active or participatory role in determining which data points it will ask to be labeled by an expert or teacher. • Reinforcement learning: Training data (in the form of rewards and punishments) is given only as feedback to an artificial intelligence agent in a dynamic environment. This feedback between the learning system and the interaction experience is useful to improve performance in the task being learned. The machine learning based on data feedback is called reinforcement learning. Qlearning is a popular model-free reinforcement learning and learns a reward or punishment function (action-value functions, simply called Q-function).

6.1 Machine Learning Tree

225

• Transfer learning: In many real-world applications, the data distribution changes or data are outdated, and thus it is necessary to apply transfer learning for considering transfer of knowledge from the source domain to the target domain. Transfer learning includes but not limited to inductive transfer learning, transductive transfer learning, unsupervised transfer learning, multitask learning, self-taught transfer learning, domain adaptation, and EigenTransfer. 2. Graph machine learning is an irregular or non-Euclidean structured data learning, and learns the structure of a graph, called also graph construction, from training samples in semi-supervised and unsupervised learning cases. The above classification of machine learning can be vividly represented by a machine learning tree, as shown in Fig. 6.1. The machine learning tree is so-called because Fig. 6.1 looks like a tree after rotating it 90◦ to the left. There are two basic tasks of machine learning: classification (for discrete data) and prediction (for continuous data). Deep learning is to learn the internal law and representation level of sample data. The information at different hierarchy levels obtained in the learning process is very helpful to the interpretation of data such as text, image, and voice. Deep learning is a complex machine learning algorithm, which has achieved much better results in speech and image recognition than previous related technologies. Limited to space, this chapter will not discuss deep learning, but focuses only on supervised learning, unsupervised learning, reinforcement learning, and transfer learning. Before dealing with machine learning in detail, it is necessary to start with preparation knowledge of machine learning: its optimization problems, majorizationminimization algorithms, and how to boost a weak learning algorithm to a strong learning algorithm. ⎧ ⎧ ⎪ Supervised learning ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Unsupervised learning ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨Self-training ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Semi-supervised learning Co-traing ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩Active laerning ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨Reinforcement learning → Q-learning ⎪ ⎪ ⎧ ⎪ ⎨Euclidean data learning inductive transfer learning ⎪ ⎪ ⎪ ⎪ Machine Learning ⎪ ⎪ ⎪ ⎪transductive transfer learning ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪Unsupervised transfer learning ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪Transfer learning ⎪ ⎪ ⎪ Multitask learning ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪Self-taught transfer learning ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪domain adaptation ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ ⎪ ⎪ EigenTransfer ⎪ ⎪ ⎪ ⎪ ⎩ Non-Euclidean data learning → Graph machine learning

Fig. 6.1 Machine learning tree

226

6 Machine Learning

6.2 Optimization in Machine Learning Two of the pillars of machine learning are matrix algebra and optimization. Matrix algebra involves the matrix-vector representation and modeling of currently available data, and optimization involves the numerical computation of parameters for a system designed to make decisions on yet unseen data. Optimization problems arise throughout machine learning. On optimization in machine learning, the following questions are naturally asked and important [31]: 1. How do optimization problems arise in machine learning applications, and what makes them challenging? 2. What have been the most successful optimization methods for large-scale machine learning, and why? 3. What recent advances have been made in the design of algorithms, and what are open questions in this research area? This section focuses upon the most successful optimization methods for largescale machine learning that involves very large data sets and for which the number of model parameters to be optimized is also large.

6.2.1 Single-Objective Composite Optimization In supervised machine learning, we are given the collection of a data set {(x1 , y1 ), . . . , (xN , yN )}, where xi ∈ Rd for each i ∈ {1, . . . , N } represents the feature vector of a target, and the scalar yi is a label indicating whether a corresponding feature vector belongs (yi = 1) or not (yi = −1) to a particular class (i.e., topic of interest). With a set of examples {(xi , yi )}N i=1 , define the prediction function with respect to the ith sample as h(w; xi ) = w · xi = w, xi  = wT xi ,

(6.2.1)

where the predictor w = [w1 , . . . , wd ]T ∈ Rd is a d-dimension parameter vector of machine learning model that should have the good “generalization ability.” Construct a machine learning program in which the performance of a prediction function yˆi = h(w; xi ) is measured by counting how often the program prediction f (xi ; w) differs from the correct prediction yi , i.e., yˆi = yi . Therefore, we need to search for a predictor function that minimizes the frequency of observed mispredictions, called the empirical risk: N 1  Rn (h) = 11[hi (xi ) = yi ], N i=1

(6.2.2)

6.2 Optimization in Machine Learning

227

where 11[A] =

⎧ ⎨ 1, if A is true; ⎩

(6.2.3) 0, otherwise.

The optimization in machine learning seeks to design a classifier/predictor w so that yˆ = w, x can provide the correct classification/prediction for any unknown input vector x. The objective f (w; x) in machine learning can be represented as the following common form: f (w; x) = l(w; x) + h(w; x),

(6.2.4)

where l(w; x) is the loss function and h(w; x) denotes the regularization function. In this manner, the expected risk for a given w is the expected value of the loss function l(w; x) taken with respect to the distribution of x as follows [31]: (Expected Risk)

R(w) = E{l(w; x)}

(6.2.5)

since the true distribution of x is often unknown, the expected risk is commonly estimated by the empirical risk (Empirical Risk)

Rn (w) =

N 1  li (w), N

(6.2.6)

i=1

where li (w) = l(w; xi ). If we were to try to minimize the empirical risk only, the method would be referred to as ERM (Empricial Risk Minimization). However, the empirical risk usually underestimates the true/expected risk by the estimation error, and thus we add a regularization term h(w; x) to the composite function. The corresponding prediction error is given by ei (w) = e(w; xi , yi ) = wT xi − yi .

(6.2.7)

Then one can write the empirical error as the average of the prediction errors for all N samples {(xi , yi )}N i=1 : (Empirical Error)

Remp (w) =

N 1  ei (w). N i=1

(6.2.8)

228

6 Machine Learning

The loss function incurred by the parameter vector w can be represented as l(w) =

N N 1  1  T |ei (w)|2 = (w xi − yi )2 . 2N 2N i=1

(6.2.9)

i=1

The above machine learning tasks such as classification and regression can be represented as solving a “composite” (additive) optimization problems . N "2 1 ! T w xi − yi + h(w) , min f (w) = l(w) + h(w) = w 2N *

(6.2.10)

i=1

where the objective function f (x) is given by the sum of N component loss functions fi (w) = |ei (w)|2 = (wT xi −yi )2 and a possibly nonsmooth regularization function h(w). It is assumed that each component loss function fi (w; xi ) : Rd × Rd → R is continuously differentiable and the sum of N component loss functions f (x) is strongly convex, while the regularization function h : Rd → R is proper, closed, and convex but not necessarily differentiable, and is used for preventing overfitting. The composite optimization tasks exist widely in various artificial intelligence applications, e.g., image recognition, speech recognition, task allocation, text classification/processing, and so on. To simplify notation, denote gj (wk ) = g(wk ; xj , yj ) and fj (wk ) = f (wk ; xj , yj ) hereafter. Optimization methods for machine learning fall into two broad categories [31]: • Stochastic method: The prototypical stochastic method is the stochastic gradient (SG) method [214] defined by wk+1 ← wk − αk ∇fik (wk )

(6.2.11)

for all k ∈ {1, 2, . . .}, the index ik , corresponding to the sample pair (xik , yik ), is chosen randomly from {1, . . . , N } and αk is a positive stepsize. Each iteration of this method involves only the computation of the gradient ∇fik (wk ) corresponding to one sample. • Batch method: The simplest form of batch method is the steepest descent algorithm: wk+1 ← wk − αk ∇Rn (wk ) = wk −

N αk  ∇fi (wk ). N

(6.2.12)

i=1

The steepest descent algorithm is also referred to as the gradient, batch gradient, or full gradient method.

6.2 Optimization in Machine Learning

229

The gradient algorithms are usually represented as wk+1 = wk − αk g(wk ).

(6.2.13)

Traditional gradient-based methods are effective for solving small-scale learning problems, but in the context of large-scale machine learning one of the core strategies of interest is the SG method proposed by Robbins and Monro [214]: g(wk ) ←

1  ∇fi (wk ), |Sk |

(6.2.14)

i∈Sk

wk+1 ← wk − αk g(wk ),

(6.2.15)

for all k ∈ {1, 2, . . .} and Sk ⊆ {1, . . . , N } is a small subset of sample pairs (xi , yi ) chosen randomly from {1, . . . , N } and αk is a positive stepsize. Algorithm 6.1 shows a generalized SG (mini-batch) method. Algorithm 6.1 Stochastic gradient (SG) method [31] 1. input: Sample pairs (xj , yj ), j = 1, . . . , N . 2. initialization: Choose an initial iterate w1 . 3. for k = 1, 2, . . . do (k) (k) 4. Generate a realization of the random variables (xi , yi ), where i ∈ Sk and Sk ⊆ {1, . . . , N } is a small subset of samples.  (k) (k) 5. Compute a stochastic vector g(wk ) ← |S1k | i∈Sk ∇f (wk ; xi , yi ). 6. Choose a stepsize αk > 0. 7. Update the new iterate as wk+1 ← wk − αk g(wk ). 8. exit: if wk+1 is converged. 9. end for 1 k+1 10. output: w = k+1 i=1 wi .

The SG iteration is very inexpensive, requiring only the evaluation of g(wk , ξk ). However, since SG generates noisy iterate sequences that tend to oscillate around minimizers during the optimization process, a natural idea is to compute a corresponding sequence of iterate averages 1  wi , k+1 k+1

˜ k+1 ← w

(6.2.16)

i=1

˜ k } has no effect on the computation of the SG iterate where the averaged sequence {w sequence {wk }. Iterate averages would automatically possess less noisy behavior.

230

6 Machine Learning

6.2.2 Gradient Aggregation Methods Consider a common single-objective convex optimization problem min {F (x) = f (x) + h(x)} ,

x∈Rd

(6.2.17)

where the first term f (x) is smooth and the second term h(x) is possibly nonsmooth, which allows for the modeling of constraints. In many applications in optimization, pattern recognition, signal processing, and machine learning, f (x) has an additional structure: f (x) =

N 1  fi (x) N

(6.2.18)

i=1

which is the average of a number of convex functions fi . When N is large and when a solution with low to medium accuracy is sufficient, it is a common choice to perform classical stochastic methods, especially stochastic gradient descent (SGD) method for solving the optimization problem (6.2.17). The SGD method can be dated back to the 1951 seminal work of Robbins and Monro [214]. SGD selects an index i ∈ {1, . . . , N } uniformly at random, and then updates x using a stochastic estimate ∇fi (x) of ∇f (x). Thanks to the computation of ∇fi (x) that is N times cheaper than the computation of the full gradient ∇f (x), for optimization problems where N is very large, the per-iteration savings can be extremely large, spanning several orders of magnitude [151]. SG algorithms only use the current gradient information, while gradient aggregation is a method to reuse and/or revise previously computed gradient information. The goal of gradient aggregation is to achieve an improved convergence rate and a lower variance. In addition, if one maintains indexed gradient estimates in storage, then one can revise specific estimates as new information to be collected. Three famous gradient aggregation methods are SAG, SVRG, and SAGA. • Stochastic average gradient (SAG) method: If stochastic average gradient is used to replace iterate averages: SAG:

g(wk ) =

  N  1 ∇fj (wk ) − ∇fj (wk−1 ) + ∇fi (w[i] ) , N

(6.2.19)

i=1

one gets the well-known stochastic average gradient (SAG) method [156, 222]. In iteration k of SAG, ∇fj (wk ) = ∇f (wk ; xj , yj ), j ∈ {1, . . . , N} is chosen from the current gradient at random, and ∇fj (wk−1 ) = ∇f (wk−1 ; x[j ] , y[j ] ), j ∈ {1, . . . , N } is chosen from the past gradient at iteration k − 1 randomly, while ∇fi (w[i] ) = ∇f (w[i] ; x[i] , y[i] ) for all i ∈ {1, . . . , N}, where w[i] represents the latest iterate at which ∇fi was evaluated.

6.2 Optimization in Machine Learning

231

• Stochastic variance reduced gradient (SVRG) method: Stochastic gradient descent (SGD) can be improved with the variance reduction (VR) technique. The SVRG method [133] adopts a constant learning rate to train parameters:

SVRG:

  N 1  ˜ j ) − ∇fij (wk ) − g˜ j ← ∇fij (w ∇fi (wk ) . N

(6.2.20)

i=1

• Stochastic average gradient aggregation (SAGA) method: This method [72] is inspired both from SAG and SVRG. The stochastic gradient vector in SAGA is given by

SAGA:

N 1  gk ← ∇fj (wk ) − ∇fj (w[j ] ) + ∇fi (w[i] ), N

(6.2.21)

i=1

where j ∈ {1, . . . , N } is chosen at random and the stochastic vectors are set by ∇fj (wk ) =

∂f (wk ; xj , yj ) , ∂wk

(6.2.22)

and  ∂f (w; xj , yj )  , ∇fj (w[j ] ) =  ∂w w=w[j ]  ∂f (w; xi , yi )  ∇fi (w[i] ) = ,  ∂w w=w[i]

(6.2.23) (6.2.24)

for all i ∈ {1, . . . , N }, and an integer j ∈ {1, . . . , N } is chosen at random. w[i] and w[j ] represent the latest iterates at which ∇fi and ∇fj were, respectively, evaluated. A formal description of a few variants of SVRG is presented as Algorithm 6.2. SAGA method for minimizing an empirical risk En is shown in Algorithm 6.3. It has been shown [72] that • Theoretical convergence rates for SAGA in the strongly convex case are better than those for SAG and SVRG, • SAGA is applicable to non-strongly convex problems without modification.

232

6 Machine Learning

Algorithm 6.2 SVRG method for minimizing an empirical risk Rn [31, 133] 1. input: Sample pairs (xj , yj ), j = 1, . . . , N , update frequency m and learning rate α. 2. initialization: Choose an initial iterate w1 ∈ Rd . 3. for k = 1, 2, . . . do k ;xi ,yi ) 4. Compute ∇i (wk ) = ∂f (w∂w . k  5. Compute the batch gradient Rn (wk ) = N1 N i=1 ∇fi (wk ). ˜ 1 ← wk . 6. Initialize w 7. for j = 1, . . . , m do 8. Chose ij uniformly from {1, . . . , N }. ˜ j ) − (∇fij (wk ) − Rn (wk )). 9. Set g˜ j ← ∇fij (w ˜ j +1 ← w ˜ j − α g˜ j . 10. Set w 11. end for ˜ m+1 12. Option (a): Set wk+1 = w . ˜ j +1 . 13. Option (b): Set wk+1 = m1 m j =1 w ˜ j +1 . 14. Option (c): Choose j uniformly from {1, . . . , m} and set wk+1 = w 15. exit: If wk+1 is converged. 16. end for 17. output: w = wk+1 .

Algorithm 6.3 SAGA method for minimizing an empirical risk Rn [31, 72] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

input: Sample pairs (xj , yj ), j = 1, . . . , N , stepsize α > 0. initialization: Choose an initial iterate w1 ∈ Rd . for i = 1, . . . , N do 1 ;xi ,yi ) Compute ∇fi (w1 ) = ∂f (w∂w . 1 Store ∇fi (w[i] ) ← ∇fi (w1 ). end for for k = 1, 2, . . . do Choose j uniformly in {1, . . . , N }. ∂f (wk ;xj ,yj ) Compute ∇fj (wk ) = . ∂wk  Set gk ← ∇fj (wk ) − ∇fj (w[j ] ) + n1 ni=1 ∇fi (w[i] ). Store ∇fj (w[i] ) ← ∇fj (wk ). Set wk+1 ← wk − αgk . exit: If wk+1 is converged. end for output: w = wk+1 .

6.2.3 Coordinate Descent Methods Coordinate descent methods (CDMs) are among the first optimization schemes suggested for solving smooth unconstrained minimization problems and large-scale regression problems (see, e.g., [12, 25, 188]). As the name suggests, the basic operation of coordinate descent methods takes steps along coordinate directions: the objective is minimized with respect to a single variable while all others are kept fixed, then other variables are updated similarly in an iterative manner.

6.2 Optimization in Machine Learning

233

The coordinate descent method for minimizing f (w) : Rd → R is given by the iteration wk+1 ← wk − αk ∇ik f (wk )eik

(6.2.25)

with ∇fik f (wk ) =

∂f (wk ) , ∂wik

(6.2.26)

where wik denotes the ik th element of the parameter vector wk = [w1 , . . . , wd ]T ∈ Rd and eik represents the ik th unit coordinate vector (also called natural basis vector) for some ik ∈ {1, . . . , d}, i.e., eik is a d × 1 vector whose ik th element is equal to one and others are equal to zero. Example 6.1 Given a vector wk = [wk1 , . . . , wkd ]T ∈ Rd . If the function f (wk ) = 1 1 2 2 2 2 wk 2 = 2 (wk1 + · · · + wid ), then ∇ik f (wk ) = wik ,

ik ∈ {1, . . . , d},

and the ith element of wk+1 is given by wk+1,i =

⎧ ⎨ wk,i − αk wi , i = ik ; ⎩

wk,i ,

otherwise;

for i = 1, . . . , d; ik ∈ {1, . . . , k}. That is to say, the solution estimates wk+1 and wk differ only in their ik th element as a result of a move in the ik th coordinate from wk . According to the choice of ik , the coordinate descent methods can be divided into two categories. 1. Cyclic coordinate descent: Cyclic coordinate search through {1, . . . , d}; 2. Stochastic coordinate descent: Random coordinate descent search [31]: • Cyclic random coordinate search through a random reordering of these indices (with the indices reordered after each set of d steps); • Select simply an index randomly with replacement in each iteration. Stochastic coordinate descent algorithms are almost identical to cyclic coordinate descent algorithms except for selecting coordinates in a random manner. Coordinate descent methods for solving minx f (x) consist of random choice step (R) and update step (U) at each iteration [166]: R:

Randomly choose an index ik ∈ {1, . . . , n}, read x, and evaluate ∇ik f (x);

U:

Update component ik of the shared xk as xk+1 ← xk − αk ∇ik f (xk )eik .

234

6 Machine Learning

In the basic coordinate descent methods, a single coordinate is only updated at each iteration. In the following we consider how to update a random subset of coordinates at each iteration, Consider the unconstrained minimization problem x = [x1 , . . . , xN ]T ∈ RN ,

min f (x),

x∈RN

(6.2.27)

where the objective function f (x) is convex and differentiable on RN . Under the assumption that decision variables are separable, to decompose the decision vector x ∈ RN into n non-overlapping blocks in which each block contains Ni decision variables: ⎡ (1) ⎤ x ⎢ .. ⎥ (6.2.28) x = ⎣ . ⎦ , where x(i) = [xi1 , . . . , xiN ]T ∈ RNi i

x(n) such that RN = RN1 × · · · × RNn ,

N=

n 

Ni .

(6.2.29)

i=1

Define the corresponding partition of the unit matrix IN ×N as I = [U1 , . . . , Un ] ∈ RN ×N ,

Ui ∈ RN ×Ni , i = 1, . . . , n.

(6.2.30)

If letting fi (x) be differentiable convex functions such that fi (x) depends on blocks x(i) for i ∈ Si = {i1 , . . . , iNi } only, then the partial derivative of function f (x) with respect to the block-vector x(i) is defined as ∇i f (x) =

∂f (x) = UTi ∇f (x) ∈ RNi . ∂x(i)

(6.2.31)

Proposition 6.1 (Block Decomposition [212]) Any vector x ∈ RN can be written uniquely as x=

n  i=1

where x(i) ∈ RNi . Moreover, x(i) = UTi x.

Ui x(i) ,

(6.2.32)

6.2 Optimization in Machine Learning

235

Example 6.2 When S1 = {1, 2}, U1 = [e1 , e2 ], where ei is the ith unit coordinate vector. Then we have x(1) = [x1 , x2 ]T and fi (x) = UTi f (x) = [f (x1 ), f (x2 )]T , which gives  ∇1 f (x) = ∇f1 (x) = ∇[f (x1 ), f (x2 )]T =

 f  (x1 ) , f  (x2 )

or ⎤ f  (x1 )   ⎢  ⎥   f (x1 ) 1 0 0 · · · 0 ⎢ f (x2 ) ⎥ T ∇1 f (x) = [e1 , e2 ] ∇f (x) = , ⎢ . ⎥= f  (x2 ) 0 1 0 · · · 0 ⎣ .. ⎦ ⎡

f  (xN ) where f  (xj ) =

∂f (x) ∂xj , j

= 1, 2.

The randomized coordinate descent method (RCDM) [188] consists of the following two basic steps: • Choose at random ik ∈ Si = {i1 , . . . , iNi }; • Update xk+1 = xk − αk UTik ∇ik f (xk ). To accelerate the randomized coordinate descent method, Nesterov [188] in 2012 proposed a computationally inefficient but theoretically interesting accelerated variant of RCDM, called accelerated coordinate descent method (ACDM), for convex minimization without constraints. Consider the regularized optimization problem with separable decision variables: min {F (x) = f (x) + h(x)}

x∈RN

subject to x =

n 

Ui x(i) ∈ RN1 × · · · × RNn = RN ,

(6.2.33)

i=1

where f (x) is a smooth convex objective function that is separable in the blocks x(i) : f (x) =

n 

fi (x),

where

fi (x) = UTi f (x),

(6.2.34)

i=1

and h(x) is a (possibly nonsmooth) convex regularizer that is also separable in the blocks x(i) : h(x) =

n  i=1

hi (x),

where

hi (x) = UTi h(x).

(6.2.35)

236

6 Machine Learning

Fercoq and Richtárk [91] in 2015 proposed the first randomized block coordinate descent method which is simultaneously accelerated, parallel, and proxima, called simply APPROX, for solving the optimization problem (6.2.33). Let Sˆ be a random subset of N = {1, . . . , N}. Algorithm 6.4 shows the APPROX coordinate descent method in a form facilitating efficient implementation. APPROX is a remarkably versatile method. Algorithm 6.4 APPROX coordinate descent method [91] 1. 2. 3. 4. 5. 6. 7.

input: initialization: Pick z˜ 0 ∈ RN and set θ0 = τn , u0 = 0. for k = 0, 1, . . . do ˆ Generate a random set of blocks Sk ∼ S. uk+1 ← uk , z˜ k+1 ← z˜ k . for i ∈ Sk do / 0 (i) (i) tk = arg max t∈RNi ∇i f (θ 2 uk + z˜ k ), t + nθ2τk vi t2(i) + hi (˜zk ) .

8.

(i) (i) (i) z˜ k+1 ← z˜ k + tk .

9.

uk+1 ← uk −

10.

(i)

end for 

(i)

1− τn θk (i) tk . θk2

θ 4 +4θ 2 −θ 2

11. θk+1 = k 2 k k . 12. end for 13. output: x = θk2 uk+1 + z˜ k+1 .

6.2.4 Benchmark Functions for Single-Objective Optimization The test of reliability, efficiency, and validation of an optimization algorithm is frequently carried out by using a chosen set of common benchmark or test functions [129]. There have been many benchmark or test functions reported in the literature. Ideally, benchmark functions should have diverse properties when testing new algorithms in an unbiased way. On the other hand, there is a serious phenomenon called “curse of dimensionality” in many optimization methods [21]. This implies that the optimization performance deteriorates quickly as the dimensionality of the search space increases. The reasons for this phenomenon appear to be two fold [240]: • The solution space of a problem often increases exponentially with the problem dimension [21] and more efficient search strategies are required to explore all promising regions within a given time budget. • The characteristics of a problem may change with the scale. Because an increase in scale may result in worsening of the features of an optimization problem, a previously successful search strategy may no longer be capable of finding the optimal solution. To solve large-scale problems, it is necessary to provide a suite of benchmark functions for large-scale numerical optimization.

6.2 Optimization in Machine Learning

237

Definition 6.1 (Separable Function) Given x = [x1 , . . . , xn ]T . A function f (x) is a separable function if and only if   arg min f (x1 , . . . , xn ) = arg min f (x1 , . . .), . . . , arg min f (. . . , xn ) , x1

(x1 ,...,xn )

xn

(6.2.36) where f (. . . , xi , . . .) denotes that its decision variable is just xi and other variables are constants with respect to xi . That is to say, a function of n decision variables is separable if it can be rewritten as a sum of n functions of just one variable, i.e., f (x1 , . . . , xn ) = f (x1 ) + · · · + f (xn ). If a function f (x) is separable, its parameters xi are said to be independent. A separable function can be easily optimized, as it can be decomposed into a number of subproblems, each of which involving only one decision variable while treating all others as constants. Definition 6.2 (Nonseparable Function) A nonseparable function f (x) is called m-nonseparable function if at most m (where m < n) of its decision variables xi are not independent. A nonseparable function f (x) is known as a fully nonseparable function if any two of its parameters xi are not independent. A nonseparable function has at least partly coupled or dependent decision variables. The definitions of separability provide us a measure of the difficulty of different optimization problems based on which a spectrum of benchmark problems can be designed. In general, separable problems are considered to be easiest, while the fully nonseparable ones usually are most difficult. A benchmark problem for large-scale optimization is to provide test functions to researchers or users for testing the optimization algorithms’ performance when applied to test problems. These test functions are as same as efficient in practical scenarios. Jamil and Yang [129] presented a collection of 175 unconstrained optimization test problems which can be used to validate the performance of optimization algorithms. Some benchmark functions are described by classification as follows. 1. Continuous, Differentiable, Separable, Multimodal Functions • Giunta Function [179]     2   16 2 16 f (x) = 0.6 + sin xi − 1 + sin xi − 1 15 15 i=1    16 1 sin 4 xi − 1 + 50 15

(6.2.37)

238

6 Machine Learning

subject to −1 ≤ xi ≤ 1. The global minimum is f (x∗ ) = 0.060447 located at x∗ = (0.45834282, 0.45834282). • Parsopoulos Function f (x) = cos(x1 )2 + sin(x2 )2

(6.2.38)

subject to −5 ≤ xi ≤ 5, where (x1 , x2 ) ∈ R2 . This function has an infinite number of global minima in R2 , at points (κ 2π 2 , λπ ), where κ = ±1, ±3, . . . and λ = 0, ±1, ±2, . . .. In the given domain problem, the function has 12 global minima all equal to zero. 2. Continuous, Differentiable, Nonseparable, Multimodal Functions • El-Attar-Vidyasagar-Dutta Function [85] ! "2 ! "2 ! "2 f (x) = x12 + x2 − 10 + x1 + x22 − 7 + x12 + x23 − 1

(6.2.39)

subject to −500 ≤ xi ≤ 500. The global minimum is f (x∗ ) = 0.470427 located at x∗ = (2.842503, 1.920175). • Egg Holder Function f (x) =

n−1 %

−(xi+1 + 47) sin |xi+1 + xi /2 + 47| − xi sin |xi − (xi+1 + 47)|

&

i=1

subject to −512 ≤ xi ≤ 512. The global minimum for n = 2 is f (x∗ ) ≈ −959.64 located at x∗ = (512, 404.2319). • Exponential Function [209]  f (x) = − exp −0.5

D 

 xi2

(6.2.40)

,

i=1

subject to −1 ≤ xi ≤ 1. The global minima is f (x∗ ) = 1 located at x∗ = (0, . . . , 0). • Langerman-5 Function [23] f (x) = −

m  i=1



⎛ ⎞ ⎞ D D   1 ci exp ⎝− (xj − aij )2 ⎠ cos ⎝π (xj − aij )2 ⎠ π j =1

j =1

(6.2.41) subject to 0 ≤ xj ≤ 10, where j ∈ [1, D] and m = 5. It has a global minimum value of f (x∗ ) = −1.4. The matrix A = [aij ] and column vector c = [ci ] are

6.2 Optimization in Machine Learning

239

given as ⎡ 9.681 0.667 ⎢9.400 2.041 ⎢ ⎢ A = ⎢8.025 9.152 ⎢ ⎣2.196 0.415 8.074 8.777

4.783 9.095 3.517 3.788 7.931 2.882 5.114 7.621 4.564 5.649 6.979 9.510 3.467 1.863 6.708

9.325 6.544 2.672 3.568 4.711 2.996 9.166 6.304 6.349 4.534

⎤ 0.211 5.122 2.020 1.284 7.033 7.374⎥ ⎥ ⎥ 6.126 0.734 4.982⎥ ⎥ 6.054 9.377 1.426⎦ 0.276 7.633 1.567

and c = [0.806, 0.517, 1.500, 0.908, 0.965]T . • Mishra Function 1 [180]  f (x) = 1 + D −

n−1 

n−1 n−i=1 0.5(xi +xi+1 )

0.5(xi + xi+1 )

(6.2.42)

i=1

subject to 0 ≤ xi ≤ 1. The global minimum is f (x∗ ) = 2. • Mishra Function 2 [180] ! " ! " &2 % f (x) = − ln sin2 (cos(x1 ) + cos(x2 ))2 + cos2 (sin(x1 ) + sin(x2 ))2 + x1 + 0.01((x1 − 1)2 + (x2 − 1)2 ).

(6.2.43)

The global minimum f (x∗ )=−2.28395 is located at x∗ =(2.88631, 1.82326). 3. Continuous, Non-Differentiable, Separable, Multimodal Functions • Price Function [208] f (x) = (|x1 | − 5)2 + (|x2 | − 5)2

(6.2.44)

subject to −500 ≤ xi ≤ 500. The global minima f (x∗ ) = 0 are located at x∗ = (−5, −5), (−5, 5), (5, −5), (5, 5). • Schwefel Function [223] f (x) = −

n 

|xi |

(6.2.45)

i=1

subject to −100 ≤ xi ≤ 100. The global minimum f (x∗ ) = 0 is located at x∗ = (0, . . . , 0). 4. Continuous, Non-Differentiable, Nonseparable, Multimodal Functions • Bartels Conn Function     f (x) = x12 + x22 + x1 x2  + |sin(x1 )| + |cos(x2 )|

(6.2.46)

240

6 Machine Learning

subject to −500 ≤ xi ≤ 500. The global minimum f (x∗ ) = 1 is located at x∗ = (0, 0). • Bukin Function  f (x) = 100 |x2 − 0.01x12 | + 0.01|x1 + 10| (6.2.47) subject to −15 ≤ x1 ≤ −5 and −13 ≤ x2 ≤ −3. The global minimum F (x∗ ) = 0 is located at x∗ = (−10, 1). 5. Continuous, Differentiable, Partially Separable, Unimodal Functions • Chung–Reynolds Function [54] f (x) =

D 

2 xi2

(6.2.48)

i=1

subject to −100 ≤ xi ≤ 100. The global minimum f (x∗ ) = 0 is located at x∗ = (0, . . . , 0). • Schwefel Function [223]  f (x) =

D 

α xi2

(6.2.49)

i=1

subject to −100 ≤ xi ≤ 100, where α ≥ 0. The global minimum f (x∗ ) = 0 is located at x∗ = (0, . . . , 0). 6. Discontinuous, Non-Differentiable Functions • Corana Function [61]

f (x) =

  ⎧ ⎨ 0.15 zi − 0.05 sign(zi )2 di , if |vi | < A, ⎩

(6.2.50) di xi2 ,

otherwise,

where vi = |xi − zi |, A = 0.05, 7 6 x   i  zi = 0.2   + 0.49999 sign(xi ) 0.2 di = (1, 1000, 10, 100) subject to −500 ≤ xi ≤ 500. The global minimum f (x∗ ) = 0 is located at x∗ = (0, 0, 0, 0).

6.3 Majorization-Minimization Algorithms

241

• Cosine Mixture Function [4] f (x) = −0.1

n  i=1

cos(5π xi ) −

n 

xi2

(6.2.51)

i=1

subject to −1 ≤ xi ≤ 1. The global minimum is located at x∗ = (0, 0) or x∗ = (0, 0, 0, 0), f (x∗ ) = 0.2 or 0.4 for n = 2 and 4, respectively. In this section we have focused on the single-objective optimization problems in machine learning. In a lot of applications in artificial intelligence, we will be faced with multi-objective optimization problems. Because these problems are closely related to evolutionary computation, we will deal with multi-objective optimization theory and methods in Chap. 9.

6.3 Majorization-Minimization Algorithms In the era of big data, a fast development in data acquisition techniques and a growth of computing power, from an optimization perspective, can result in largescale problems due to the tremendous amount of data and variables, which cause challenges to traditional algorithms [87]. By Hunter and Lange [125], the general principle behind majorization-minimization (MM) algorithms was first enunciated by the numerical analysts in 1970s [197] in the context of line search methods. MM algorithms are a set of analytic procedures for tackling difficult optimization problems. The basic idea of MM algorithms is to modify their objective functions so that solution spaces of the modified objective functions are easier to explore. A successful MM algorithm substitutes a simple optimization problem for a difficult optimization problem. By [125], simplicity can be attained by • • • • •

avoiding large matrix inversions, linearizing an optimization problem, separating the parameters of an optimization problem, dealing with equality and inequality constraints gracefully, or turning a non-differentiable problem into a smooth problem.

MM has a long history that dates back to the 1970s [197], and is closely related to the famous expectation-maximization (EM) algorithm [278] intensively used in computational statistics.

242

6 Machine Learning

6.3.1 MM Algorithm Framework Consider the minimization problem θˆ = arg min g(θ) θ ∈Θ

(6.3.1)

for some difficult to manipulate objective function g(θ ). For example, g(θ ) with θ ∈ Θ is non-differential and/or non-convex, where Θ is a subset of some Euclidean space. Let θ (m) represent a fixed value of the parameter θ (where m is the mth iteration), and let g(θ |θ (m) ) denote a real-valued function of θ whose form depends on θ (m) . Instead of operating on g(θ), we can consider an easier surrogate function h(θ|θ (m) ) at some point θ (m) instead. Definition 6.3 (Majorizer [125, 191]) The surrogate function h(θ |θ (m) ) is said to be a majorizer of objective g(θ) provided h(θ|θ (m) ) ≥ g(θ)

for all θ ,

h(θ (m) |θ (m) ) = g(θ (m) ).

(6.3.2) (6.3.3)

To find the majorizer h(θ|θ (m) ) of objective g(θ ) constitutes the first step of MM algorithms, called majorization step. The function h(θ|θ (m) ) satisfying the conditions (6.3.2) and (6.3.3) is said to majorize a real-valued function g(θ ) at the point θ (m) . In other words, the surface θ → h(θ|θ (m) ) lies above the surface g(θ ) and is tangent to it at the point θ = θ (m) . In this sense, h(θ|θ (m) ) is called the majorization function. Moreover, if θ (m+1) is a minimizer of h(θ|θ (m) ), then both (6.3.2) and (6.3.3) further imply [288] that g(θ (m) ) = h(θ (m) |θ (m) ) ≥ h(θ (m+1) |θ (m) ) ≥ g(θ (m+1) ).

(6.3.4)

This is an important property of the MM algorithm, which means that the sequence {g(θ (m) )}, m = {1, 2, . . .} is non-increasing, so the iteration procedure θ (m) pushes g(θ ) toward its minimum. The function h(θ|θ (m) ) is said to minorize g(θ ) at θ (m) if −h(θ|θ (m) ) majorizes −g(θ ) at θ (m) . The second step, known as minimization step, is to minimize the surrogate function, namely update θ as θ (m+1) = arg min h(θ|θ (m) ). θ

(6.3.5)

The descent property (6.3.2) lends an MM algorithm remarkable numerical stability. When minimizing a difficult loss function g(θ), we relax this loss function to a surrogate function h(θ|θ (m) ) that is easily minimized.

6.3 Majorization-Minimization Algorithms

243

Example 6.3 Given a matrix pencil (A, B), consider its generalized eigenvalue (GEV) problem Ax = λBx. From λ = xT Ax/(xT Bx) it is known that the variational formulation for the GEV problem in Ax = λBx is given by λmax (A, B) = max {xT Ax} subject to xT Bx = 1, x

(6.3.6)

where λmax (A, B) is the maximum generalized eigenvalue associated with the matrix pencil (A, B). Then, the sparse GEV problem can be written as max {xT AxT }

subject to xT Bx = 1, x0 ≤ k.

x

(6.3.7)

Consider the regularized (penalized) version of (6.3.7) given by [237]: / max x

xT Ax − ρx0

0

subject to xT Bx ≤ 1,

(6.3.8)

where ρ > 0 is the regularization (penalization) parameter. The constrained equality condition xT Bx = 1 in (6.3.7) has been relaxed to the inequality constraint xT Bx ≤ 1. To relax further the non-convex 0 -norm x0 , consider using x0 =

n 

11{|xi | = 0} = lim

→0

i=1

n  log(1 + |xi |/) i=1

log(1 + 1/)

(6.3.9)

in (6.3.8) to rewrite as the equivalent form * max x

x Ax − ρ lim T

→0

n  log(1 + |xi |/) i=1

log(1 + 1/)

. subject to xT Bx ≤ 1.

(6.3.10)

The above program is approximated by the following approximate sparse GEV program by neglecting the limit in (6.3.10) and choosing  > 0: * xT Ax − ρ

max x

n  log(1 + |xi |/) i=1

. xT Bx ≤ 1,

(6.3.11)

subject to xT Bx ≤ 1,

(6.3.12)

subject to

log(1 + 1/)

which is finally equivalent to [237] * max x

x Ax − ρ T

n 

. log(1 + |xi |/)

i=1

where ρ = ρ/ log(1 + 1/). Hence, instead of a non-convex sparse optimization problem (6.3.7), we can manipulate an easier surrogate convex optimization problem (6.3.12) instead by using the MM algorithm.

244

6 Machine Learning

In addition to the typical examples above, there are a lot of applications of MM algorithms to machine learning, statistical estimation, and signal processing problems, see [191] for a list of 26 applications, which includes fully visible Boltzmann machine estimation, linear mixed model estimation, Markov random field estimation, matrix completion and imputation, quantile regression estimation, SVM estimation, and so on. As pointed out by Nguyen [191], the MM algorithm framework is a popular tool for deriving useful algorithms for problems in machine learning, statistical estimation, and signal processing.

6.3.2 Examples of Majorization-Minimization Algorithms Let θ (0) be some initial value and θ (m) be a sequence of iterates (in m) for the minimization problem (6.3.1). Definition 6.3 suggests the following scheme, referred to as an MM algorithm. Definition 6.4 (Majorization-Minimization (MM) Algorithm [191]) Let θ (0) be some initial value and θ (m) be the mth iterate. Then, θ (m+1) is said to be the (m+1)th iterate of an MM algorithm if it satisfies " ! θ (m+1) = arg min h θ |θ (m) . θ∈Θ

(6.3.13)

From Definitions 6.3 and 6.4 it can be deduced that all MM algorithms have the monotonicity property. That is, if θ (m) is a sequence of MM algorithm iterates, the objective sequence g(θ) is monotonically decreasing in m. Proposition 6.2 (Monotonicity Decreasing [125, 191]) If h(θ|θ (m) ) is a majorizer of the objective g(θ ) and θ (m) is a sequence of MM algorithm iterates, then MM algorithms have the following monotonicity decreasing: g(θ (m+1) ) ≤ h(θ|θ (m+1) ) ≤ h(θ (m) |θ (m) ) ≤ g(θ (m) ).

(6.3.14)

Remarks It is notable that an algorithm need not be an MM algorithm in the strict sense of Definition 6.4 in order for (6.3.14) to hold. In fact, any algorithm with the (m + 1)th iterate satisfying / 0 θ (m+1) ∈ θ ∈ Θ : h(θ|θ (m) ) ≤ h(θ (m) |θ (m) )

(6.3.15)

will generate a monotonically decreasing sequence of objective evaluates. Such an algorithm can be thought of as a generalized MM algorithm.

6.3 Majorization-Minimization Algorithms

245

Proposition 6.3 (Stationary Point [191]) Starting from some initial value θ (0) , if θ (∞) is the limit point of an MM algorithm sequence of iterates θ (m) (i.e., satisfying Definition 6.4), then θ (∞) is a stationary point of the minimization problem (6.3.1). Remarks Proposition 6.3 only guarantees the convergence of MM algorithm iterates to a stationary point and not a global, or even a local minimum. As such, for problems over difficult objective functions, multiple or good initial values are required in order to ensure that the obtained solution is of a high quality. Furthermore, Proposition 6.3 only guarantees convergence to a stationary point of (6.3.1) if a limit point exists for the chosen starting value. If a limit point does not exist, then the MM algorithm objective sequence may diverge. The following result is useful when constructing the majorization function. Lemma 6.1 ([236]) Let L be an Hermitian matrix and M be another Hermitian matrix such that M ≥ L (Mij ≥ Lij ). Then for any point x0 ∈ Cn , the quadratic   function θ H Lθ is majorized by θ H Mθ + 2 Re θ H (L − M)θ 0 + θ H 0 (M − L)θ 0 at θ 0. To minimize over g(θ ), the main steps of the majorization-minimization scheme are as follows [236]: 1. 2. 3. 4.

Find a feasible point θ (0) and set m = 0. Construct a majorization function h(θ|θ (m) ) of g(θ ) at the point θ (m) . Solve θ (m+1) = arg minθ ∈Θ h(θ|θ (m) ). If some convergence criterion is met then exit; otherwise, set k = k + 1 and go to Step 2. Consider the “weighted” 0 minimization problem [43]: min Wx0

x∈Rn

subject to y = x

(6.3.16)

and the “weighted” 1 minimization problem * min

x∈Rn

p 

. wi |xi |

subject to y = x.

(6.3.17)

i=1

The weighted 1 minimization can be viewed as a relaxation of a weighted 0 minimization problem. To establish the connection between the weighted 1 minimization and the weighted 0 minimization, we consider the surrogate function |xi | (m) |xi | + 

⎧ (m) ⎪ ⎨ ≈ 1, xi = xi ; ⎪ ⎩ = 0, x (m) = x = 0. i i

(6.3.18)

246

6 Machine Learning

Here  is a small positive value. This shows that if updating * (m+1)

x

= arg min

n 

|xi |

i=1

xi(m) + 

x∈Rn

then

n

|xi | i=1 x (m) + i

(m)

gives approximately x0 when xi

. ,

(6.3.19)

→ xi for all i = 1, . . . , n

or x(m) → x. Interestingly, (6.3.19) can be viewed as a form of the weighted 1 minimiza(m+1) tion (6.3.17), where wi = (m)1 . Therefore, the MM algorithm for solving |xi

|+

the weighted 1 minimization problem (6.3.17) consists of the following iterative steps [43]: (0)

1. Set the iteration count m to zero and wi = 1, i = 1, . . . , n. 2. Solve the weighted 1 minimization problem * (m)

x

= arg min x∈Rn

n 

. wi |xi |

subject to y = x.

(6.3.20)

i=1

3. Update the weights: for each i = 1, . . . , n, (m+1)

wi

=

1 |xi(m) | + 

.

(6.3.21)

4. Terminate on convergence or when m attains a specified maximum number of iterations mmax . Otherwise, increment m by 1 and go to Step 2. For a two-dimensional array (xi,j ), 1 ≤ i, j ≤ n, let (Dx)i,j be the twodimensional vector of forward differences (Dx)i,j = [xi+1,j − xi,j , xi,j +1 − xi,j ]. The total-variation (TV) norm of Dxij is defined as xTV =



Dxi,j .

(6.3.22)

1≤i,j ≤n−1

Because many natural images have a sparse or nearly sparse gradient, it is interesting to search for the reconstruction with minimal weighted TV norm, i.e., ⎧ ⎨ min



(m) = where wi,j

 1≤i,j ≤n−1

⎫ ⎬

(m) wi,j (Dx)i,j  , ⎭

1 (Dx)i,j +

subject to y = x,

by mimicking the derivation of (6.3.21).

(6.3.23)

6.4 Boosting and Probably Approximately Correct Learning

247

The MM algorithm for minimizing a sequence of weighted TV norms is as follows [43]: (0)

1. Set m = 0 and wi,j , 1 ≤ i, j ≤ n − 1. 2. Solve the weighted TV minimization problem x(m) = arg min x∈Rn

⎧ ⎨ ⎩

 1≤i,j ≤n−1

⎫ ⎬ (m) wi,j (Dx)i,j  , ⎭

subject to y = x

(6.3.24)

3. Update the weights (m) = wi,j

1 (Dx)i,j  + 

(6.3.25)

for each (i, j ), 1 ≤ i, j ≤ n − 1. 4. Terminate on convergence or when m attains a specified maximum number of iterations mmax . Otherwise, increment m by 1 and go to Step 2.

6.4 Boosting and Probably Approximately Correct Learning Valiant and Kearns [140, 254] put forward the concepts of weak learning and strong learning. A learning algorithm with error rate less than 1/2 (i.e., its accuracy is only slightly higher than random guess) is called a weak learning algorithm, and the learning algorithm whose accuracy is very high and can be completed in polynomial time is called a strong learning algorithm. At the same time, Valiant and Kearns proposed the equivalence problem of weak learning algorithm and strong learning algorithm in probably approximately correct (PAC) learning model, that is, if any given weak learning algorithm is only slightly better than random guessing, can it be upgraded to a strong learning algorithm? In 1990, Schapire [221] first constructed a polynomial-level algorithm to prove that a weak learning algorithm which is slightly better than random guessing can be upgraded to a strong learning algorithm instead of having to find a strong learning algorithm that is too hard to get. This is the original boosting algorithm. In 1995, Freund and Schapire [96] improved Boosting algorithm and proposed AdaBoost (Adaptive Boosting) algorithm. “Boosting” is a way of combining the performance of many “weak” classifiers to produce a powerful “committee.” Boosting was proposed in the computational learning theory literature [96, 97, 221] and has since received much attention [100].

248

6 Machine Learning

6.4.1 Boosting for Weak Learners Boosting is a framework algorithm for obtaining a sample subset by an operation on the sample set. It then trains the weak classification algorithm on the sample subset to generate a series of base classifiers. Let xi be an instance and X = {x1 , . . . , xN } be a set called a domain (also referred to as the instance space). If Ds is any fixed probability distribution over the instance space X, and Dt is a distribution over the set of training example (labeled instances) X × Y , then Ds and Dt will be referred to as the instance distribution (or source distribution) and target distribution, respectively. In most of machine learning, it is assumed that Ds = DtX , i.e., the instance distribution and the target distribution are the same over the instance space. A concept c over X is some subset of the instance space X with unknown labels. A concept can be equivalently defined to be a Boolean mapping c : X → {0, 1}, with c(x) = 1 indicating that x is a positive example of c and c(x) = 0 indicating that x is a negative example. The Boolean mapping or concept of an instance or domain point x ∈ X is sometimes known as a representation h of x. Definition 6.5 (Agree, Consistent [140]) A representation h and an example (x, y) agree if h(x) = y; otherwise, they disagree. Let S = {(x1 , y1 ), . . . , (xN , yN )} be any labeled set of instances, where each xi ∈ X and yi ∈ {0, 1}. If c is a concept over X, and c(xi ) = yi for all 1 ≤ i ≤ N , then the concept or representation c is consistent with a sample S (or equivalently, S is consistent with c). Otherwise, they are inconsistent. For M target classes where each class is represented by Ci , ∀ i ∈ {1, . . . , M}, the task of a classifier is to assign the input sample x to one of the (M + 1) classes, with the (M + 1)th class denoting that the classifier rejects x. Let h1 , . . . , hK be K classifiers in which each gives its own classification result hi (x). The problem is how to produce a combined result H (x) = j, j ∈ {1, . . . , M, M + 1} from all K predictions h1 (x), . . . , hK (x). The most common method to combine the outputs is the majority voting method. Majority voting consists of the following three steps [281]. 1. Define a binary function to represent the number of votes:

Vk (x ∈ Ci ) =

⎧ ⎨ 1, if hk (x) = i, i ∈ {1, . . . , M}; ⎩

(6.4.1) 0, otherwise.

2. Sum the votes from all K classifiers for each Ci : VH (x ∈ Ci ) =

K  k=1

Vk (x ∈ Ci ),

i = 1, . . . , M.

(6.4.2)

6.4 Boosting and Probably Approximately Correct Learning

249

3. The combined result H (x) is determined by * H (x) =

i,

if

max

i∈{1,...,M}

VE (x ∈ Ci ) and VE (x ∈ Ci ) ≥ α · K;

M + 1, otherwise. (6.4.3)

Here α is a user-defined threshold that controls the confidence in the final decision. For a learning algorithm, we are primarily interested in its computational efficiency. Definition 6.6 (Polynomially Evaluatable [140]) Let C be a representation class over X. C is said to be polynomially evaluatable if there is a polynomial-time (possibly randomized) evaluation algorithm A that, on input a representation c ∈ C and a domain point x ∈ X, runs in time polynomial in |c| and |x| and decides if x ∈ c(x). A machine learning algorithm is expected to be a strong learning algorithm that is polynomially evaluatable and has a producing representation h that is consistent with a given sample S. However, it is difficult, even impractical, to design directly a strong learning algorithm. A simple and effective approach is to use a boosting framework to a weak machine learning algorithm. A machine learner for producing a two-class classifier is a weak learner if it is a coinflip algorithm based completely on random guessing. Schapire [221] showed that a weak learner could always improve its performance by training two additional classifiers on filtered versions of the input data stream. After learning an initial classifier h1 on the first N training points, if two additional classifiers are produced as follows [100, 221]: 1. h2 is learned on a new sample of N points, half of which are misclassified by h1 , 2. h3 is learned on N points for which h1 and h2 disagree, and 3. the boosted classifier is hB = Majority Vote(h1 , h2 , h3 ), then a boosted weak learner can produce a two-class classifier with performance guaranteed (with high probability) to be significantly better than a coinflip. An additive logistic regression model is defined as log P (Y = 1|x) = β0 + β1 x1 + · · · + βp xp , log P (Y = −1|x)

(6.4.4)

where P (x) = 1+e1 −x is called the logistic function, commonly called the sigmoid function. Let F (x) be an additive function of the form F (x) =

M  m=1

fm (x),

(6.4.5)

250

6 Machine Learning

and consider a problem for minimizing the exponential criterion J (F ) = E{e−yF (x) }

(6.4.6)

for estimation of F (x). Lemma 6.2 ([100]) E{e−yF (x) } is minimized at F (x) =

P (y = 1|x) 1 log . 2 P (y = −1|x)

(6.4.7)

Hence P (y = 1|x) =

eF (x) , e−F (x) + eF (x)

(6.4.8)

P (y = −1|x) =

e−F (x) . eF (x) + e−F (x)

(6.4.9)

Lemma 6.2 shows that for an additive function F (x) with the form (6.4.5), minimizing the exponential criterion J (F ) in (6.4.6) is an additive logistic regression problem in two classes. Friedman, Hastie, and Tibshirani [100] developed a boosting algorithm for additive logistic regression, called simply LogitBoost, as shown in Algorithm 6.5. Algorithm 6.5 LogitBoost (two classes) [100] 1. input: weights wi = 1/N, i = 1, . . . , N , F (x) = 0 and probability estimates p(xi ) = 1/2. 2. for m = 1 to M do 3. Compute the working response and weight y ∗ −p(xi ) zi = p(xii)(1−p(x , i )) wi = p(xi )(1 − p(xi )). 4. Fit the function fm (x) by a weighted least squares regression of zi to xi using weights wi . 5. Update F (x) ← F (x) + 12 fm (x) and p(x) ← (eF (x) )/(eF (x) + e−F (x) ). 6. end for  7. output: the classifier sign[F (x)] = sign[ M m=1 fm (x)].

6.4.2 Probably Approximately Correct Learning We are given a set of N training examples X = {(x1 , y1 ), . . . , (xN , yN )} drawn randomly from X × Y according to distribution Ds , where Y is the label set consisting of just two possible labels Y = {0, 1} in two-classification or C possible labels Y ∈ {1, . . . , C} in multi-classification. Machine learner is to find a hypothesis or representation hf which is consistent with most of the sample (i.e., hf (xi ) = yi for most 1 ≤ i ≤ N ). In general, a hypothesis which is accurate on the training

6.4 Boosting and Probably Approximately Correct Learning

251

set might not be accurate on examples outside the training set; this problem is sometimes referred to as “overfitting” [97]. To avoid overfitting is one of the important problems that must be considered in machine learning. Definition 6.7 (Probably Approximately Correct (PAC) Model [141]) Let C be a concept class over X. C is said to be probably approximately correct (PAC) learnable if there exists an algorithm L with the property: given an error parameter  and a confidence parameter δ, for every concept c ∈ C, for every distribution Ds on X, and for all 0 ≤  ≤ 1/2 and 0 ≤ δ ≤ 1/2, if L, after some amount of time, outputs a hypothesis concept h ∈ C satisfying error(h) ≤  with probability 1 − δ. If L runs in time polynomial in 1/ and 1/δ, then c is said to be efficiently PAC learnable. The hypothesis h ∈ C of the PAC learning algorithm is “approximately correct” with high probability, hence is named as “Probably Approximately Correct” (PAC) learning [141]. In machine learning, it is usually difficult and even impracticable to find a learner whose concept over X, c(xi ) = yi for all labels 1 ≤ i ≤ N . Hence, the goal of PAC learning is to find a hypothesis h : X → {0, 1} which is consistent with most (rather than all) of the sample (i.e., h(xi ) = yi for most 1 ≤ i ≤ N ). After some amount of time, the learner must output a hypothesis h : X → {0, 1}. The value h(x) can be interpreted as a randomized prediction of the label of x that is 1 with probability h(x) and 0 with probability 1 − h(x). Definition 6.8 (Strong PAC-Learning Algorithm [97]) A strong PAC-learning algorithm is an algorithm that, given an accuracy  > 0 and a reliability parameter δ > 0 and access to random examples, outputs with probability 1 − δ a hypothesis with error at most . Definition 6.9 (Weak PAC-Learning Algorithm [97]) A weak PAC-learning algorithm is an algorithm that, given an accuracy  > 0 and access to random examples, outputs with probability 1 − δ a hypothesis with error at least 1/2 − γ , where γ > 0 is either a constant or decreases as 1/p where p is a polynomial in the relevant parameters. In strong PAC learning, the learner is required to generate a hypothesis whose error is smaller than the required accuracy . On the other hand, in weak PAC learning, the accuracy of the hypothesis is required to be just slightly better than 1/2, which is the accuracy of a completely random guess. When learning with respect to a given distribution over the instances, weak and strong learning are not equivalent. For a great assortment of learning problems, the “boosting algorithm” can convert a “weak” PAC learning algorithm that performs just slightly better than random guessing into one with arbitrarily high accuracy [97]. There are two frameworks in which boosting can be applied: boosting by filtering and boosting by sampling [96]. Adaptive boosting (AdaBoost) of Freund and Schapire [97] is a popular boosting algorithm by sampling, which has been used in conjunction with a wide range of other machine learning algorithms to enhance their performance.

252

6 Machine Learning

Finding multiple weak classification algorithms with low recognition rate is much easier than finding a strong classification algorithm with high recognition rate. AdaBoost aims at boosting the accuracy of a weak learner by carefully adjusting the weights of training instances and learn a classifier accordingly. After T such iterations, the final hypothesis hf is output. The hypothesis hf combines the outputs of the T weak hypotheses using a weighted majority vote. Algorithm 6.6 gives the adaptive boosting (AdaBoost) algorithm in [97]. Algorithm 6.6 Adaptive boosting (AdaBoost) algorithm [97] 1. input: 1.1 sequence of N labeled examples {(x1 , y1 ), . . . , (xN , yN )}, 1.2 distribution D over the N examples, 1.3 weak learning algorithm WeakLearn, 1.4 integer T specifying number of iterations. 2. initialization: the weight vector wi1 = D(i) for i = 1, . . . , N . 3. for t = 1 to Tt do 4. Set pt = Nw t . i=1

5. 6. 7. 8. 9.

wi

Call WeakLearn, providing it with the distribution pt ; get back a hypothesis ht : X → {0, 1}.  t Calculate the error of t = N i=1 pi |ht (xi ) − yi |. Set βt = t /(1 − t ). 1−|h (x )−y | Set the new weights vector to be wit+1 = wit βt t i i . output: thehypothesis   1, if Tt=1 (log 1/βt )ht (x) ≥ 12 Tt=1 log(1/t ); hf (x) = . 0, otherwise.

Friedman et al. [100] analyze the AdaBoost procedures from a statistical perspective: AdaBoost can be rederived as a method for fitting an additive model m fm (x) in a forward stagewise manner, which largely explains why it tends to outperform a single base learner. By fitting an additive model of different and potentially simple functions, it expands the class of functions that can be approximated. Using Newton stepping rather than exact optimization at each step, Friedman et al. [100] proposed a modified version of the AdaBoost algorithm, called “Gentle AdaBoost” procedure, which instead takes adaptive Newton steps much like the LogitBoost algorithm, see Algorithm 6.7. Algorithm 6.7 Gentle AdaBoost [100] 1. 2. 3. 4. 5. 6. 7.

input: weights wi = 1/N, i = 1, . . . , N , F (x) = 0. for m = 1 to M do Fit the regression function fm (x) by weighted least squares of yi to xi with weights wi . Update F (x) ← F (x) + fm (x). Update wi ← wi exp(−yi fm (xi )) and renormalize. endfor  output: the classifier sign[F (x)] = sign[ M m=1 fm (x)].

6.5 Basic Theory of Machine Learning

253

The application of adaptive boosting in transfer learning will be discussed in Sect. 6.20.

6.5 Basic Theory of Machine Learning The pioneer of machine learning, Arthur Samuel, defined machine learning as a “field of study that gives computers the ability to learn without being explicitly programmed.”

6.5.1 Learning Machine Consider a learning machine whose task is to learn a mapping xi → yi . The machine is actually defined by a set of possible mappings x → f (x, α) with the functions f (x, α) labeled by the adjustable parameter vector α. The machine is usually assumed to be deterministic: for a given input vector x and choice of α, it will always give the same output f (x, α). A particular choice of α generates a “trained machine,” and hence a neural network with fixed architecture and containing α corresponding to the weights and biases is called a learning machine in this sense [39]. The expectation of the test error for a trained machine is defined as  R(α) =

1 |y − f (x, α)|dP (x, y), 2

(6.5.1)

where P (x, α) is some unknown cumulative probability distribution from which data x and y are drawn, i.e., the data are assumed to be independently and identically distributed (i.i.d.). When a density of cumulative probability distributions, P (x, y), exists, dP (x, y) may be written as p(x, y)dxdy. The quantity R(α) is called the expected risk, or just the risk for choice of α. The “empirical risk” Remp (α) is defined to be just the measured mean error rate on the given training set: Remp (α) =

N 1  |yi − f (xi , α)|, 2N

(6.5.2)

i=1

where no probability distribution appears. The quantity 12 |yi − f (xi , α)| is called the loss. Remp (α) is a fixed number for a particular choice of α and for a particular training set {xi , yi }. Another name for a family of functions f (x, α) is “learning machine.” Designing a good learning machine depends on choice of α. A popular method for building α

254

6 Machine Learning

is empirical risk minimization (ERM). By ERM, we mean Remp (α) is minimized over all possible choices of α, results in a risk R(α ∗ ) which is close to its minimum. The minimization of different loss functions results in different machine learning methods. In reality, most machine learning methods should have three phases rather than two: training, validation, and testing. After the training is complete, there maybe several models (e.g., artificial neural networks) available, and one needs to decide which one to have a good estimation of the error it will achieve on a test set. For this end, there should be a third separate data set: the validation data set.

6.5.2 Machine Learning Methods There are many different machine learning methods for modeling the data to the underlying problem. These machine learning methods can be divided to the following types [38]. 1. Machine Learning Methods based on Network Structure. • Artificial Neural Networks (ANNs): ANNs are inspired by the brain and composed of interconnected artificial neurons capable of certain computations on their inputs [120]. The input data activate the neurons in the first layer of the network whose output is the input to the second layer of neurons in the network. Similarly, each layer passes its output to the next layer and the last layer outputs the result. Layers in between the input and output layers are referred to as hidden layers. When an ANN is used as a classifier, the output layer generates the final classification category. • Bayesian Network: A Bayesian network is a probabilistic graphical model that represents the variables and the relationships between them [98, 120, 130]. The network is constructed with nodes as the discrete or continuous random variables and directed edges as the relationships between them, establishing a directed acyclic graph. Each node maintains the states of the random variable and the conditional probability form. Bayesian networks are built by using expert knowledge or efficient algorithms that perform inference. 2. Machine Learning Methods based on Statistical Analysis. • Association Rules: The concept of association rules was popularized particularly due to the article of Agrawal et al. in 1993 [3]. The goal of association rule mining is to discover previously unknown association rules from the data. An association rule describes a relationship X ⇒ Y between an itemset X and a single item Y . Association rules have two metrics that tell how often a given relationship occurs in the data: the support is an indication of how frequently the itemset appears in the data set, and the confidence is an indication of how often the rule has been found to be true.

6.5 Basic Theory of Machine Learning

255

• Clustering: Clustering [126] is a set of techniques for finding patterns in highdimensional unlabeled data. It is an unsupervised pattern discovery approach where the data are grouped together based on some similarity measure. • Ensemble Learning: Supervised learning algorithms, in general, search the hypothesis space to determine the right hypothesis that will make good predictions for a given problem. Although good hypotheses might exist, it may be hard to find one. Ensemble methods combine multiple learning algorithms to obtain the better predictive performance than the constituent learning algorithms alone. Often, ensemble methods use multiple weak learners to build a strong learner [206]. • Hidden Markov Models (HMMs): An HMM is a statistical Markov model where the system being modeled is assumed to be a Markov process with unobserved (i.e., hidden) states [17]. The main challenge is to determine the hidden parameters from the observable parameters. The states of an HMM represent unobservable conditions being modeled. By having different output probability distributions in each state and allowing the system to change states over time, the model is capable of representing non-stationary sequences. • Inductive Learning: Deduction and induction are two basic techniques of inferring information from data. Inductive learning is traditional supervised learning, and aims at learning a model from labeled examples, and trying to predict the labels of examples we have not seen or known about. In inductive learning, from specific observations one begins to detect patterns and regularities, formulates some tentative hypotheses to be explored, and lastly ends up developing some general conclusions or theories. Several ML algorithms are inductive, but by inductive learning, we usually mean “repeated incremental pruning to produce error reduction” (RIPPER) [57] and the quasioptimal (AQ) algorithm [177]. • Naive Bayes: It is well known [98] that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes, is competitive with state-of-the-art classifiers such as C4.5. In general, the input features are assumed to be independent, whereas in practice this is seldom true. Naive Bayes classifiers [98, 270] can handle an arbitrary number of independent features whether continuous or categorical by reducing a high-dimensional density estimation task to a one-dimensional kernel density estimation, under the assumption that the features are independent. Although the Naive Bayes classifier has some limitations, it is an optimal classifier if the features are conditionally independent given the true class. One of the biggest advantages of the Naive Bayes classifier is that it is an online algorithm and its training can be completed in linear time. 3. Machine Learning Methods based on Evolution. • Evolutionary Computation: In computer science, evolutionary computation is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms (by Wikipedia). The term evolutionary computation

256

6 Machine Learning

encompasses genetic algorithms (GA) [107], genetic programming (GP) [152], evolution strategies [26], particle Swarm optimization [142], ant colony optimization [78], and artificial immune systems [89], which is the topics in Chap. 9 Evolutionary Computations.

6.5.3 Expected Performance of Machine Learning Algorithms A machine learning algorithm is expected to have the following expected performance [145]. • Scalability: This parameter can be defined as the ability for an algorithm to be able to handle an increase in its scale, such as feeding more data to the system, adding more features to the input data or adding more layers in a neural network, without it limitlessly increasing its complexity [5]. • Training Time: This is the amount of time that a machine learning algorithm takes to be fully trained and form the ability to make its predictions. • Response Time: This parameter is related to the agility of a machine learning system, and represents the time that an algorithm takes, after it has been trained, to make a prediction for the desired self-organizing networks (SON) function. • Training Data: This metric of machine learning algorithm is the amount and type of training data an algorithm needs. Algorithms supported by more training data usually have better accuracy, but they also take more time to be trained. • Complexity: Complexity of a system can be defined as the amount of mathematical operations that it performs in order to achieve a desired solution. This parameter can determine if certain algorithms are more suitable to be deployed at the user side. • Accuracy: Future networks are expected to be much more intelligent and quicker, enabling highly different types of applications and user requirements. Deploying algorithms that have high accuracy is critical to guarantee a good operability of certain self-organizing network functions. • Convergence Time: This metric of an algorithm, different from the response time, relates to how fast it agrees that the solution found for that particular problem is the optimal solution at that time. • Convergence Reliability: This parameter represents the susceptibility of some algorithm to be stuck at local minima and how initial conditions can affect its performance.

6.6 Classification and Regression In essence, machine learning is to learn the given training samples for solving one of two basic problems: regression (for continuous outputs) or classification (for discrete outputs). Classification is closely related to pattern recognition, its goal

6.6 Classification and Regression

257

is to design a classifier thought learning a “training” set of input data for making recognition or classification for unknown samples. By regression, it means to design a regressor or predictor based on machine learning results of a set of training data for making a prediction for unknown continuous samples.

6.6.1 Pattern Recognition and Classification Pattern recognition is the field devoted to the study of methods designed to categorize data into distinct classes. By Watanabe [262], a pattern is defined “as opposite of a chaos; it is an entity, vaguely defined, that could be given a name.” Pattern recognition is widely applied in recognitions of human biometrics (such as a person’s face, fingerprint, iris) and various targets (such as aircraft, ships). In these applications, the extraction of object features is crucial. For example, when a target is considered as a linear system, the target parameter is a feature of the target signal. Given a pattern, its recognition/classification may consist of one of the following two tasks [262]: • Supervised classification (e.g., discriminant analysis) in which the input pattern is identified as a member of a predefined class; • Unsupervised classification (e.g., clustering) in which the pattern is assigned to a hitherto unknown class. The tasks of pattern recognition/classification are customarily divided into four distinct blocks: 1. 2. 3. 4.

Data representation (acquisition and pre-processing). Feature selection or extraction. Clustering. Classification.

The four best known approaches for pattern recognition are [128]: (1) template matching, (2) statistical classification, (3) syntactic or structural matching, and (4) neural networks. Data Representation Data representation is mostly problem-specific. In matrix algebra, signal processing, pattern recognition, etc., data pre-processing most commonly involves the zeromean normalization (i.e., data centering): for each data vector xn , cent xi,j = xi,j − x, ¯

x¯ =

N 1  xi,j , N j =1

for i = 1, . . . , N

(6.6.1)

258

6 Machine Learning

to remove the useless direct current (DC) components in data, where xi,j is the j th entry of the vector xi . The data centering is also referred to the data zero-meaning. In order to prevent drastic changes in data amplitude, the input data are also required to be scaled to similar ranges: scal = xi,j

xi,j si

,

8 9 n 9 si = : xk2 ,

i = 1, . . . , N; j = 1, . . . , n.

(6.6.2)

k=1

This pre-processing is called the data scaling. Feature Selection In many applications, data vectors must become low-dimensional vectors via some transformation/processing method. These low-dimensional vectors are called the pattern vectors or feature vectors owing to the fact that they extract the features of the original data vectors, and are directly used for pattern clustering and classification. For example, colors of clouds and parameters of voice tones are pattern or feature vectors in weather forecasting and voice classification, respectively. Due to high dimensionality, the original data vectors are not directly available for pattern recognition. Hence, one should try to find invariant features in data that describe the differences in classes as best as possible. According to the given training set, feature selection can be divided into supervised and unsupervised methods. • Supervised feature selection: Let Xl = {(x1 , y1 ), . . . , (xN , yN )} be the labeled set of data, where xi ∈ Rn is the ith data vectors and yi = k indicates the kth class to which the data vector xi belongs, where k ∈ {1, . . . , M} is the label of M target classes. The pattern recognition uses supervised machine learning. Let lp be the number of k = p and l1 + · · · + lM = M, we can construct the data matrix in the pth class of targets: 2 (p) (p) 3T ∈ RN ×lp , x(p) = x1 , . . . , xlp

p = 1, . . . , M.

(6.6.3)

Thus we can select feature vectors in data by using matrix algebra methods such as principal component analysis, etc. This selection, as a supervised machine learning, is also known as feature selection or dimensionality reduction since the dimension of feature vector is much less than the dimension of original data vectors. • Unsupervised feature selection: When data are unlabeled, i.e., only data vectors xi , . . . , xN are given, the pattern recognition is unsupervised machine learning which is more difficult than supervised patter recognition. In this case, we need to use mapping or signal processing for transforming the data into invariant features such as short time Fourier transform, bispectrum, wavelet, etc.

6.6 Classification and Regression

259

Feature selection is to select the most significant features of the data, and attempt to reduce the dimensionality (i.e., the number of features) for the remaining steps of the task. We will discuss supervised and unsupervised feature extractions later in details. Clustering Let w be the weight vector of a cluster which aims at patterns within a cluster being more similar to each other rather than are patterns belonging to other clusters. Clustering methods are used in order to find the actual mapping between patterns and labels (or targets). This categorization has two machine learning methods: supervised learning (which makes distinct labeling of the data) and unsupervised learning (division of the data into classes), or a combination of more than one of these tasks. The above three stages (data representation, feature selection, and clustering) belong to the training phase. Having such a cluster, one may make a classification in the next testing phase. Classification Classification is one of the most frequently encountered decision making tasks of human activity. A classification problem occurs when an object needs to be assigned into a predefined group or classes related to that object after clustering a number of groups or classes based on a set of observed attributes. In other words, one needs to decide which class among a number of classes a given testing sample belongs to. This step is called the testing phase, compared with the previous training stage. Clearly, this is a form of supervised learning.

6.6.2 Regression In statistical modeling and machine learning, regression is to fit a statistical process for estimating the relationships among variables. It focuses on the relationship between a dependent variable x and one or more independent variables (named as “predictors”) β. More specifically, regression is to analyze how the typical value of the dependent variable changes when anyone of the independent variables is varied, while the other independent variables are fixed. Machine learning is widely used for prediction and forecasting, where its use has substantial overlap with the field of statistical regression analysis. Consider the more general multiple regression model with p independent variables: yi = β1 xi1 + β2 xi2 + · · · + βp xip + εi ,

(6.6.4)

where xij is the ith observation on the j th independent variable xj = [x1j , . . . , xpj ]T . If the first independent variable takes the value 1 for all i, xi1 = 1,

260

6 Machine Learning

then the regression model reduces to yi = β1 + β2 xi2 + · · · + βp xip + εi and β1 is called the regression intercept. The regression residual is defined as εi = yi − βˆ1 xi1 − · · · − βˆp xip .

(6.6.5)

Hence, the normal equations on regression are given by p n  

xij xik βˆk =

i=1 k=1

n 

xij yi ,

j = 1, . . . , p.

(6.6.6)

i=1

In matrix notation, the above normal equations can be written as (XT X)βˆ = XT y,

(6.6.7)

n,p

where X = [xij ]i=1,j =1 is an n × p matrix, y = [yi ]ni=1 is an n × 1 vector, and p βˆ = [βˆj ]j =1 is a p × 1 vector. Finally, the solution of regression problem is given by βˆ = (XT X)−1 XT y.

(6.6.8)

6.7 Feature Selection The performance of machine learning methods generally depends on the choice of data representation. The representation choice heavily depends on representation learning. Representation learning is also known as feature learning. In machine learning, representation learning [22] is a set of learning techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. When mining large data sets, both in dimension and size, we are given a set of original variables (e.g., in gene expression) or a set of extracted features (such as pre-processing via principal or minor component analysis) in data representation. An important problem is how to select a subset of the original variables or a subset of the original features. The former is called the variable selection, and the latter is known as the feature selection or representation choice. Feature selection is a fundamental research topic in data mining and machine learning with a long history since the 1970s. With the advance of science and technology, more and more real-world data sets in data mining and machine learning often involve a large number of features. But, not all features are essential since

6.7 Feature Selection

261

many of them are redundant or even irrelevant, which may significantly degrade the accuracy of learned models as well as reduce the learning speed and ability of the models. By removing irrelevant and redundant features, feature selection can reduce the dimensionality of the data, speed up the learning process, simplify the learned model, and/or improve the performance [69, 112, 284]. A smaller set of representative variables or features, retaining the optimal salient characteristics of the data, not only decreases the processing complexity and time, but also overcomes the risk of “overfitting” and leads to more compactness of the model learned and better generalization. When the class labels of the data are given, we are faced with supervised variable or feature selection, otherwise we should use unsupervised variable or feature selection. Since the learning methods used in both the variable selection and the feature selection are essentially the same, we take the feature selection as the subject of this section.

6.7.1 Supervised Feature Selection An important problem in machine learning is to reduce the dimensionality D of the feature space F to overcome the risk of “overfitting.” Data overfitting arises when the number n of features is large and the number m of training patterns is comparatively small. In such a situation, one may easily find a decision function that separates the training data but gives a poor result for test data. Feature selection has become the focus of much research in areas of application (e.g., text processing of internet documents, gene expression array analysis, and combinatorial chemistry) for high-dimensional data. The objective of feature selection is three-fold [112]: • improving the prediction performance of the predictors, • providing faster and more cost-effective predictors, and • providing a better understanding of the underlying process that generated the data. Many feature selection algorithms are based on variable ranking. As a preprocessing step of feature selection, variable ranking is a filter method: it is independent of the choice of the predictor. For example, in the gene selection problem, the variables are gene expression coefficients corresponding to the abundance of mRNA in a sample for a number of patients. A typical classification task is to separate healthy patients from cancer patients, based on their gene expression “profile” [112]. The feature selection methods can be divided into two categories: the exhaustive enumeration method and the greedy method.

262

6 Machine Learning

• Exhaustive enumeration method selects the best subset of features satisfying a given “model selection” criterion by exhaustive enumeration of all subsets of features. However, this method is impractical for large numbers of features. • Greedy method is available particularly to the feature selection in large dimensional input spaces. Among various possible methods feature ranking techniques are very attractive. A number of top ranked features may be selected for further analysis or to design a classifier. Alternatively, a threshold can be set on the ranking criterion. Only the features whose criterion exceeds the threshold are retained. Consider a set of m examples {xk , yk }(k = 1, . . . , m),  where xk = [xk,1 , . . . , xk,n ]T and one output variable yk . Let x¯i = m1 m k=1 xk,i and 1 m y¯ = m k=1 yk stand for an average over the index k. Variable ranking makes use of a scoring function S(i) computed from the values xk,i and yk , k = 1, . . . , m. By convention, a high score is indicative of a valuable variable (e.g., gene) and variables are sorted in decreasing order of S(i). A good feature selection method selects first a few of features that individually classify best the training data, and then increases the number of features. Classical feature selection methods include correlation methods and expression ratio methods. A typical correlation method is based on the Pearson correlation coefficients defined as m (xk,i − x¯i )(yk − y) ¯ R(i) =  k=1 , (6.7.1)  m m 2 ¯ 2 k=1 (xk,i − x¯ i ) k=1 (yk − y) where the numerator denotes the between-class variance and the denominator represents the within-class variance. In linear regression, using the square of R(i) as a variable ranking criterion enforces a ranking according to goodness of linear fit of individual variables. For two-class data {xk , yk }(k = 1, . . . , m), where xk = [xk,1 , . . . , xk,n ]T consists of n input variables xk,i (i = 1, . . . , n) and yk ∈ {+1, −1} (e.g., certain disease vs. normal), let +

μ+ i

m 1  = + xk,i , m



μ− i

k=1

m 1  = − xk,i , m

+

σi+

m  2 = (xk,i − μ+ i ) , k=1

(6.7.2)

k=1 −

σi−

=

m 

2 (xk,i − μ− i ) ,

(6.7.3)

k=1

where m+ and m− are the numbers of the ith variable xk,i corresponding to yi = +1 − and yi = −1, respectively; μ+ i and μi are the means of xk,i over the index k associated with yi = +1 and yi = −1, respectively; and σi+ and σi− are, respectively, the standard deviations of xk,i over the index k associated with yi = +1 and yi = −1.

6.7 Feature Selection

263

A well-known expression ratio method uses the ratio [103, 109]:    μ+ − μ−   i i  F (xi ) =  + .  σi + σi− 

(6.7.4)

This method selects highest ratios F (xi ) differing most on average in the two classes and having small deviations in the scores in the respective classes as top features. The expression ratio criterion (6.7.4) is similar to Fisher’s discriminant criterion [83]. It should be noted that the variable dependencies cannot be ignored in feature selection as follows [112]. • Noise reduction and consequently better class separation may be obtained by adding variables that are presumably redundant. • Perfectly correlated variables are truly redundant in the sense that no additional information is gained by adding them. • Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity. • A variable that is completely useless by itself can provide a significant performance improvement when taken with others. • Two variables that are useless by themselves can be useful together. One possible use of feature ranking is the design of a class predictor (or classifier) based on a pre-selected subset of features. The weighted voting scheme yields a particular linear discriminant classifier: D(x) = w, x − μ = wT (x − μ),

(6.7.5)

where μ = 12 (μ+ + μ− ) with μ+ =

1  xi , n+ +

(6.7.6)

1  xi , n− −

(6.7.7)

xi ∈X

μ− =

xi ∈X

where X+ = {(xi , yi = +1)}, X− = {(xi , yi = −1)}, and n+ and n− are the numbers of the training data vectors xi belonging to the classes (+) and (−), respectively. Define the n × n within-class scatter matrix   Sw = (xi − μ+ )(xi − μ+ )T + (xi − μ− )(xi − μ− )T . (6.7.8) xi ∈X+

xi ∈X−

264

6 Machine Learning

By Fisher’s linear discriminant, the classifier is given by [113] + − w = S−1 w (μ − μ ).

(6.7.9)

Once the classifier w is designed using the training data (xi , yi ), i = 1, . . . , n (where yi ∈ {+1, −1}), for any given data vector x, the new pattern can be classified according to the sign of the decision function [113]: D(x) = wT x > 0 ⇒ x ∈ class (+), D(x) = wT x < 0 ⇒ x ∈ class (−), D(x) = wT x = 0, decision boundary.

6.7.2 Unsupervised Feature Selection Consider the unsupervised feature selection for given data vectors xi , i = 1, . . . , n without labeled class index yi , i = 1, . . . , n. In this case, there are two common similarity measures between two random vectors x and y, as stated below [182]. 1. Correlation coefficient: The correlation coefficient ρ(x, y) between two random vectors x and y is defined as ρ(x, y) = √

cov(x, y) , var(x)var(y)

(6.7.10)

where var(x) denotes the variance of the random vector x and cov(x, y) denotes the covariance between x and y. The correlation coefficient has the following properties. 0 ≤ 1 − |ρ(x, y)| ≤ 1. 1 − |ρ(x, y)| = 0 if and only if x and y are linearly dependent. Symmetric: 1 − |ρ(x, y)| = 1 − |ρ(y, x)|. y−b Invariant to scaling and translation: if u = x−a c and v = d for some constant vectors a,b and constants c, d, then 1 − |ρ(x, y)| = 1 − |ρ(u, v)|. (e) Sensitive to rotation: If (u, v) is some rotation of (x, y), then |ρ(x, y)| = |ρ(u, v)|.

(a) (b) (c) (d)

2. Least square regression error: Let y = a1 + bx be the linear regression of x with the regression coefficients a and b. Least square regression error is defined as 1 (e(x, y)i )2 , n n

e2 (x, y) =

i=1

(6.7.11)

6.7 Feature Selection

265

where e(x, y)i = yi − a − bxi . The regression coefficients are given by a = cov(x,y) 1 n 2 i=1 yi , b = var(x) and the mean square error e(x, y) = var(y)(1−ρ(x, y) ). n If x and y are linearly correlated, then e(x, y) = 0, and if x and y are completely uncorrelated, then e(x, y) = var(y). The measure e2 is also called the residual variance. The least square regression error measure has the following properties. 0 ≤ e(x, y) ≤ var(y). e(x, y) = 0 if and only if x and y are linearly correlated. Unsymmetric: e(x, y) = e(y, x). Sensitive to scaling: If u = x/c and v = y/d for some constants c and d, then e(x, y) = d 2 e(u, v). However, e is invariant to translation of the variables. (e) The measure e is sensitive to the rotation of the scatter diagram in x–y plane.

(a) (b) (c) (d)

A feature similarity measure is expected to be symmetric, sensitive to scaling and translation of variables, and invariant to their rotation. Hence, the above two similarity measures are not expected, because the correlation coefficient has the unexpected properties (d) and (e), and the least square regression is not symmetric (property (c)) and is sensitive to the rotation (property (d)). In order to have all the desired properties, Mitra et al. [182] proposed a maximal information compression index λ2 defined as 2λ2 (x, y) = var(x) + var(y)  − (var(x) + var(y))2 − 4var(x)var(y)(1 − ρ(x, y)2 ).

(6.7.12)

Here, λ2 (x, y) denotes the smallest eigenvalue of the covariance matrix  of the random variables x and y. The feature similarity measure λ2 (x, y) has the following properties [182]. 0 ≤ λ2 (x, y) ≤ 0.5(var(x) + var(y)). λ2 (x, y) = 0 if and only if x and y are linearly correlated. Symmetric: λ2 (x, y) = λ2 (y, x). Sensitive to scaling: If u = x/c and v = y/d for some nonzero constants c and d, then λ2 (x, y) = λ2 (u, v). Since the expression of λ2 (x, y) does not contain the mean but only the variance and covariance terms, it is invariant to translation of x and y. 5. λ2 is invariant to rotation of the variables about the origin.

1. 2. 3. 4.

Let the original number of features be D, and the original feature set  be O = {Fi , i = 1, . . . , D}. Apply the measures of linear dependency e.g.,  ρ(Fi , Fj ), e(Fi , Fj ), and/or λ2 (Fi , Fj ) to compute the dissimilarity between Fi and Fj , denoted by d(Fi , Fj ). Smaller the value of d(Fi , Fj ) is, the more similar are the features Fi and Fj . Let rik denote the dissimilarity between feature Fi and its nearest neighbor feature in R, and infFi ∈R rik represent the infimum of rik for Fi ∈ R.

266

6 Machine Learning

Algorithm 6.8 is the feature clustering algorithm of Mitra et al. [182]. Algorithm 6.8 Feature clustering algorithm [182] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

11.

12. 13. 14.

input: the original feature set O = {F1 , . . . , FD }. initialization: Choose an initial value of k ≤ d1 and set R ← O. For each Fi ∈ R compute rik . Find feature Fi  for which rik is minimum. Retain this feature in R and discard k nearest features of Fi  . Let  = rik . If k > |R| − 1 then k = |R| − 1. If k = 1 then goto Step 14. While rik >  do (a) k = k − 1. rik = infFi ∈R rik . (“k is decremented by 1, until the “kth nearest neighbor of at least one of features in R is less than -dissimilar with the feature.) (b) If k = 1 then goto Step 14. (if no feature in R has less than -similar the “nearest neighbor” select all the remaining feature in R.) end while Goto Step 3. output: return feature set R as the reduced feature set.

6.7.3 Nonlinear Joint Unsupervised Feature Selection Let X = [x1 , . . . , xn ] be the original feature matrix of n data samples, where xi ∈ RD and xip denote the value of the pth (p = 1, . . . , D) feature of xi . Consider selecting d (d ≤ D) high-quality features. To this end, use s ∈ {0, 1}D as the selection indicator vector, where sp = 1 indicates the pth feature is selected and sp = 0 indicates not selected. Given a feature mapping φ(x) : X → F , φ(x) − E{φ(x)} is called the centered feature mapping, where E{φ(x)} is its expectation. Definition 6.10 (Centered Kernel Function [267]) The kernel function K ∈ Rn×n after centering is given by Kc (xi , xj ) = (φ(xi ) − E{φ(xi )})T (φ(xi ) − E{φ(xi )}) .

(6.7.13)

For a finite number of samples, a feature vector is centered by subtracting its ¯ where φ¯ = 1 n φ(xj ). empirical expectation, i.e., φ(xi ) − φ, j =1 n Definition 6.11 (Centered Kernel Matrix [267]) The kernel matrix K ∈ Rn×n after centering is defined by its entry as follows: (Kc )ij = Kij −

n n n n 1 1 1  Kij − Kij + 2 Kij . n n n i=1

j =1

i=1 j =1

(6.7.14)

6.7 Feature Selection

267

Denoting H = I − n1 11T , then the kernel matrix K ∈ Rn×n after centering can be expressed as Kc = HKH. It is easy to know that H is an idempotent matrix, i.e., HH = H. Definition 6.12 (Kernel Alignment [65, 267]) For two kernel matrices K1 ∈ Rn×n and K2 ∈ Rn×n (assume K1 F > 0 and K2 F > 0), the alignment between K1 and K2 is defined as the form of matrix trace: ρ(K1 , K2 ) = tr(K1 K2 ). Assume L ∈ Rn×n is a kernel matrix computed from original feature matrix X, K ∈ Rn×n is a kernel matrix computed from selected features. Denoting (X) = [φ(x1 ), . . . , φ(xn )] ∈ RD×n , (Xs ) = [φ(Diag(s)x1 ), . . . , φ(Diag(s)xn )],

(6.7.15) (6.7.16)

then L = ((X))T (X) ∈ Rn×n ,

(6.7.17)

K = ((Xs ))T (Xs ) ∈ Rn×n .

(6.7.18)

By Definition 6.12 and using tr(AB) = tr(BA) and HH = H, the kernel alignment between Lc and Kc can be represented as ρ(Lc , Kc ) = tr(HLHHKH) = tr(HHLHHK) = tr(HLHK).

(6.7.19)

The core issue with nonlinear joint unsupervised feature selection is to search for an optimal selection vector s so that the selected d high-quality features from D original features can be the solution to the following constrained optimization problem: s = arg min {f (K) = −tr(HLHK)}

(6.7.20)

s

subject to

D 

sp = d,

sp ∈ {1, 0}, ∀ p = 1, . . . , D,

(6.7.21)

p=1

or s = arg min {f (K) = −tr(HLHK) + λs1 }

(6.7.22)

subject to sp ∈ {1, 0}, ∀ p = 1, . . . , D.

(6.7.23)

s

268

6 Machine Learning

For Gaussian kernel Kij ∼ N(0, σ 2 ), its gradient with respect to the selection vector sp is given by 2(xiP − xjp )2 sp ∂Kij = −Kij . ∂sp σ2

(6.7.24)

Hence, the gradient of the objective f (K) with respect to the selection vector sp is given by [267]   n n   ∂Kij ∂f (K) =− HLHij · +λ ∂sp ∂sp i=1 j =1

=

n ! n    (HLH)ij i=1 j =1

" 2s  p K ij (xip − xjp )2 + λ. σ2

(6.7.25)

This problem can be solved by using spectral projected gradient (SPG), see Algorithm 6.9. Algorithm 6.9 Spectral projected gradient (SPG) algorithm for nonlinear joint unsupervised feature selection [267] 1. input: s0 = 1. 2. initialization: Set step length α0 , step length bound αmin = 10−10 , αmax = 1010 , history h = 10, t = 0. 3. while not converged do 4. αt = min{αmax , max(αmin , α0 )}; 5. dt = Proj[0,1] (st − αt f (st )) − st ; 6. bound fb = max{f (st ), f (st−1 ), . . . , f (st−h )}. 7. set α = 1; 8. if f (st + αdt ) > fb + ηα(f (st ))T dt then 9. Choose α ∈ (0, α); 10. end if 11. st+1 = st + αdt ; 12. ut = st+1 − st ; 13. yt = f (st+1 ) − f (st ); 14. α0 = yTt yt /uTt yt ; 15. t = t + 1; 16. end while 17. output: the selected features with corresponding entry in s equal to 1.

6.8 Principal Component Analysis

269

6.8 Principal Component Analysis Principal component analysis (PCA) is a powerful data processing and dimension reduction technique for extracting structure from possibly high-dimensional data set [134]. PCA has several important variants or generalizations: robust principal component analysis, kernel principal component analysis, sparse principal component analysis, etc. The PCA technique is widely applied in engineering, biology, and social science, such as in face recognition, handwritten zip code classification, gene expression data analysis, etc.

6.8.1 Principal Component Analysis Basis Given a high-dimensional data vector x = [x1 , . . . , xN ]T ∈ RN , where N is large. T We use an N -dimensional feature extraction vector aNi = [ai1 , . . . , aiN ] to extract T the ith feature of the data vector x as x˜i = ai x = j =1 aij xj . If we hope to use an K × N feature extract matrix A = [a1 , . . . , aK ] ∈ RN ×K for extracting the total K features of x, then the K-dimensional feature vector x˜ ∈ RK is given by ⎤ aT1 x ⎥ ⎢ x˜ = AT x = ⎣ ... ⎦ = [x˜1 , . . . , x˜K ]T ∈ RK . ⎡

(6.8.1)

aTK x ˜ is given by For a data matrix X = [x1 , . . . , xP ] ∈ RN ×P , its feature matrix X ⎤ aT1 x1 · · · aT1 xP . . ⎥ K×P ˜ = AT X = AT [x1 , . . . , xP ] = ⎢ , X ⎣ .. . . . .. ⎦ = [˜x1 , . . . , x˜ P ] ∈ R T T aK x1 · · · aK xP (6.8.2) where ⎡

x˜ i = AT xi = [x˜i1 , . . . , x˜iK ]T ∈ RK , i = 1, . . . , P .

(6.8.3)

The feature vector x˜ i , i = 1, . . . , P should satisfy the following requirements. 1. Dimensionality reduction: Due to correlation of data vectors, there is redundant information among the N components of each data vector xi = [xi1 , . . . , xiN ]T for i = 1, . . . , P . We hope to construct a new set of lower-dimensional feature vectors x˜ i = [x˜i1 , . . . , x˜iK ]T ∈ RK , where K / N . Such a process is known as feature extraction or dimension reduction. 2. Orthogonalization: In order to avoid redundant information, P feature vectors should be orthogonal to each other, i.e., x˜ Ti x˜ j = δij for all i, j ∈ {1, . . . , P }.

270

6 Machine Learning

3. Power maximization: Under the assumption of P < N, let σ12 , ≥ · · · ≥ σK2 be K leading eigenvalues of the P × P covariance matrix XT X. Then we have Ex˜i = E{|x˜i |22 } = vTi (XT X)vi = σi2 , where σi are the i leading singular values of X = UVT or λi = σi2 are the ith leading eigenvalues of XT X = VVT with  = Diag(σ1 , . . . , σP ) and  = Diag(λ1 , . . . , λP ). Because the eigenvalues σi2 are arranged in nondescending order and thus Ex˜ 1 ≥ Ex˜ 2 ≥ · · · ≥ Ex˜ P . Therefore, x˜ 1 is often referred to as the first principal component feature of X, x˜ 2 as the second principal component feature of X, and so on. Hence, we have     E{˜x1 } + · · · + E{˜xP } = σ12 + · · · + σK2 ≈ E |x1 |22 + · · · + E |xP |22 .

(6.8.4)

A direct choice satisfying the above three requirements is to take A = U1 in (6.8.2), giving the principal component feature matrix ⎤ σ1 vT1 . ⎥ K×P ˜ 1 = [˜x1 , . . . , x˜ P ] = UT X =  1 VT = ⎢ X ⎣ .. ⎦ ∈ R 1 1 ⎡

(6.8.5)

σK vTK because  X = UV = [U1 , U2 ] T

1

O(P −K)×K

OK×(P −K) 2

 T V1 = U1  1 VT1 + U2  2 VT2 V2

with U1 = [u1 , . . . , uK ] ∈ RN ×K , U2 = [uK+1 , . . . , uP ] ∈ RN ×(P −K) , V1 = [v1 , . . . , vK ] ∈ RP ×K , V2 = [vK+1 , . . . , vP ] ∈ RP ×(P −K) , while  1 = Diag(σ1 , . . . , σK ) and  2 = Diag(σK+1 , . . . , σP ). The feature extraction method with the choice A1 = U1 is known as the principal component analysis (PCA). PCA has the following important properties: • Principal components sequentially capture the maximum variability among the columns of X, thus guaranteeing minimal information loss; • Principal components are uncorrelated, so we can talk about one principal component without referring to others.

6.8.2 Minor Component Analysis Definition 6.13 (Minor Components [169, 282]) Let XT X be the covariance matrix of an N × P data matrix X with K principal eigenvalues and P − K minor eigenvalues (i.e., small eigenvalues). The P − K eigenvectors vK+1 , . . . , vP corresponding to these minor eigenvalues are called the minor components of the data matrix X.

6.8 Principal Component Analysis

271

Data or signal analysis based on the (P − K) minor components is known as minor component analysis (MCA). Unlike the PCA, the choice for the MCA is to take A = V2 in (6.8.2), so that the (P − K) × P minor component feature matrix is determined by ⎤ ⎡ σK+1 vTK+1 ⎥ .. (P −K)×P ˜ 2 = U2 X =  2 VT = ⎢ . X ⎦∈R ⎣ 2 .

(6.8.6)

σP vTP Interestingly, the contribution ratio of the ith principal singular vector vi in PCA is σi /(σ12 + · · · + σK2 )1/2 , i = 1, . . . , K, while this ratio of the j th minor singular 2 + · · · + σP2 )1/2 , j = K + 1, . . . , P . vector vj in MCA is σj /(σK+1 In signal/data processing, pattern recognition (such as face, fingerprints, irises recognition, etc.), image processing, principal components correspond to the key body of a signal or image background, since they describe the global energy of a signal or an image. In contrast, minor components correspond to details of the signal or image, which are usually dictated by the moving objects in the image background.

6.8.3 Principal Subspace Analysis Principal subspace analysis (PSA) and minor subspace analysis (MSA), in which PCA and MCA are special cases, are important for many applications. The goal of PSA or MSA is to extract, from a stationary random process x(k) ∈ RN with covariance C = E{x(k)xT (k)}, the subspaces spanned by the m principal or minor eigenvectors, respectively [79]. For the matrix X ∈ RN ×P with K principal component vectors Vs = [v1 , . . . , vK ], the principal subspace (also known as signal subspace) of X is defined as S = span(Vs ) = span{v1 , . . . , vK },

(6.8.7)

and the orthogonal projector (i.e., signal projection matrix) is given by PS = Vs (VTs Vs )−1 VTs = Vs VTs ,

(6.8.8)

where Vs = [v1 , . . . , vK ] is a P × K semi-orthogonal matrix such that VTs Vs = IK×K . Similarly, the minor subspace (also known as noise subspace) of X is defined as N = span(Vn ) = span{vK+1 , . . . , vP },

(6.8.9)

272

6 Machine Learning

and the orthogonal projector (i.e., noise projection matrix) is given by PN = Vn (VTn Vn )−1 VTn = Vn VTn ,

(6.8.10)

where Vn = [vK+1 , . . . , vP ] is a P × (P − K) semi-orthogonal matrix such that VTn Vn = I(P −K)×(P −K) . It is noticed that the principal subspace and the minor subspace are orthogonal, i.e., S ⊥ N , because PTS PN = Vs VTs Vn VTn = OP ×P

(null matrix).

Moreover, as PS + PN =

Vs , VTs

+ Vn VTn

 T V = [Vs , Vn ] sT = VP ×P VTP ×P = IP ×P , Vn

there is the following relationship between the principal and minor subspaces: PS = I − PN

or

Vs Vs = I − Vn VTn .

(6.8.11)

Therefore, the PSA seeks to find the principal subspace Vs VTs instead of the principal components Vs = [v1 , . . . , vK ] in the PCA, while the MSA finds the minor subspace Vn VTn instead of the minor components Vn = [vK+1 , . . . , vP ] in the MCA. In order to deduce the PSA and MSA updating algorithms, we consider the minimization of the objective function J (W), where W is an n × r matrix. The common constraints on W are of two types, as follows. • Orthogonality constraint: W is required to satisfy an orthogonality condition, either WH W = Ir (n ≥ r) or WWH = In (n < r). The matrix W satisfying such a condition is called semi-orthogonal matrix. • Homogeneity constraint: It is required that J (W) = J (WQ), where Q is an r × r orthogonal matrix. For an unconstrained optimization problem min J (Wn×r ) with n > r, its solution is a single matrix W. However, for an optimization problem subject to the semi-orthogonality constraint WH W = Ir and the homogeneity constraint J (W) = J (WQ), where Q is an r × r orthogonal matrix, i.e., min J (W)

subject to WH W = Ir , J (W) = J (WQ),

(6.8.12)

its solution is any member of the set of semi-orthogonal matrices defined by    H Gr(n, r) = W ∈ Cn×r  WH W = Ir , W1 WH 1 = W2 W2 .

(6.8.13)

6.8 Principal Component Analysis

273

This set is known as the Grassmann manifold of the semi-orthogonal matrices Wn×r such that WT W = I but WWT is an n × n singular matrix. For two given semi-orthogonal matrices W1 and W2 , their subspaces are called equivalent subspaces if W1 WT1 = W2 WT2 . The feature extraction methods based on equivalent subspaces are subspace analysis methods. Therefore, the goals of the PSA and MSA are to find the equivalent principal and minor subspaces, respectively. Subspace analysis methods have the following characteristics [287]. • The signal (or principal) subspace method and the noise (or minor) subspace method need only a few singular vectors or eigenvectors. • In many applications, it is not necessary to know singular values or eigenvalues; it is sufficient to know the rank of the matrix and its singular vectors or eigenvectors. • In most cases, it is not necessary to know the singular vectors or eigenvectors exactly; it is sufficient to know the basis vectors spanning the principal subspace or minor subspace. H • Conversion between the principal subspace Vs VH s and the minor subspace Vn Vn H H can be made by using Vn Vn = I − Vs Vs . On the more basic theory and applications of subspace analysis, the reader can refer to [294]. In the following we deduce the PSA algorithm. Let C = E{xxT } denote the covariance matrix of an N × 1 random vector x ∈ RN , and consider the following constrained optimization problem:   subject to WT W = I, min E x − PW x22 W

(6.8.14)

where PW = W(WT W)−1 WT is the projector of the weighting matrix WN ×P with N > P , while PW x denotes the feature vector of x extracted by W. Clearly, x − PW x denotes the error between the original input data vector and its extracted feature vector, and thus (6.8.14) is a minimum squares error (MSE) criterion. Interestingly, the constrained optimization problem (6.8.14) can be equivalently written as the following unconstrained optimization problem: min W

/-2 0 E -x − WWT x-2

(6.8.15)

whose objective function is given by [287]: /-2 0 J (W) = E -x − WWT x-2 / T  0 = E x − WWT x x − WWT x       = E xT x − 2E xT WWT x + E xT WWT WWT x .

(6.8.16)

274

6 Machine Learning

Because n ! "      E xT x = E |xi |2 = tr E{xxT } = tr(C), i=1

!    "  E xT WWT x = tr E WT xxT W = tr WT CW , !    " E xT WWT WWT x = tr E WT xxT WWT W   = tr WT CWWT W , 

the objective function in (6.8.16) can be represented as the trace form J (W) = tr(C) − 2tr(WT CW) + tr(WT CWWT W),

(6.8.17)

where W is an N × P matrix whose rank is assumed to be K. From Eq. (6.8.17) it can be seen that in the time-varying case the matrix differential of the objective function J (W(t)) is given by " ! dJ (W(t)) = −2tr WT (t)C(t)dW(t) + (C(t)W(t))T dW∗ (t) ! + tr (WT (t)W(t)WT (t)C(t) + W(t)C(t)W(t)WT (t))dW(t) " + (C(t)W(t)WT (t)W(t) + W(t)WT (t)C(t)W(t))T dW∗ (t) , which yields the gradient matrix ∇W J (W(t)) = −2C(t)W(t) + C(t)W(t)WT (t)W(t) + W(t)WT (t)C(t)W(t) = W(t)WT (t)C(t)W(t) − C(t)W(t), where the semi-orthogonality constraint WT (t)W(t) = I has been used. Substituting C(t) = x(t)xT (t) into the above gradient matrix formula, one obtains a gradient descent algorithm for solving the minimization problem W(t) = W(t − 1) − μ∇W J (W(t)), as follows: y(t) = WT (t)x(t),

  W(t + 1) =W(t) + β(t) yT (t)x(t) − W(t)y(t)yT (t) .

(6.8.18) (6.8.19)

Here, β(t) > 0 is a learning parameter. This algorithm is called the projection approximation subspace tracking (PAST) that was developed by Yang in 1995 [287] from the viewpoint of subspace. On the PAST algorithm, the following two theorems are true.

6.8 Principal Component Analysis

275

Theorem 6.1 ([287]) The matrix W is a stationary point of J (W) if and only if W = Vr Q, where Vr ∈ RP ×r consists of r eigenvectors of the covariance matrix C = Vr Dr VTr and Q ∈ Rr×r is any orthogonal matrix. At every stationary point, the value of the objective function J (W) is equal to the sum of the eigenvalues whose corresponding eigenvectors are not in Vr . Theorem 6.2 ([287]) All its stationary points are saddle points of the objective function J (W) unless r = K (rank of C), i.e., Vr = Vs consists of K principal eigenvectors of the covariance matrix C. In this particular case J (W) reaches a global minimum. Theorems 6.1 and 6.2 show the following facts. 1. In the unconstrained minimization of the objective function J (W) in (6.8.17), it is not required that the columns of W are orthogonal, but the two theorems show that minimization of the objective function J (W) in (6.8.17) will automatically result in a semi-orthogonal matrix W such that WT W = I. 2. Theorem 6.2 shows that when the column space of W is equal to the signal or principal subspace: Col(W) = Span(Vs ), the objective function J (W) reaches a global minimum, and it has no other local minimum. 3. From the definition formula (6.8.17) it is easy to see that J (W) = J (WQ) holds for all K × K unitary matrices Q, i.e., the objective function automatically satisfies the homogeneity constraint. 4. Because the objective function defined by (6.8.17) automatically satisfies the homogeneity constraint, and its minimization automatically has the result that W satisfies the orthogonality constraint WH W = I, the solution W is not uniquely determined but is a point on the Grassmann manifold. This greatly relaxes the solution for the optimization problem (6.8.17). 5. The projection matrix P = W(WT W)−1 WT = WWT = Vs VTs is uniquely determined. That is to say, the different solutions W span the same column space. 6. When r = 1, i.e., the objective function is a real function of a vector w, the solution w of the minimization of J (w) is the eigenvector corresponding to the largest eigenvalue of the covariance matrix C. Therefore, Eq. (6.8.19) gives the global solution to the optimization problem (6.8.16). It is worth emphasizing that (6.8.19) is just the PSA rule of Oja [194]. On the other hand, the two MSA algorithms are [49] ! " W(t + 1) = W(t) + β(t) W(t)WT (t)y(t)x(t) − y(t)yT (t)W(t) ,

(6.8.20)

and [79] ! W(t + 1) = W(t) − β(t) W(t)WT (t)W(t)WT (t)y(t)xT (t) " − y(t)yT (t)W(t) .

(6.8.21)

276

6 Machine Learning

The Douglas algorithm (6.8.21) is a self-stabilizing in that the vectors do not need to be periodically normalized to unit modulus.

6.8.4 Robust Principal Component Analysis Given an observation or data matrix D = A + E, where A and E are unknown, but A is known to be a low- rank true data matrix and E is known to be an additive sparse noise matrix, the aim is to recover A. Because the low-rank matrix A can be regarded as the principal component of the data matrix D, and the sparse matrix E may have a few gross errors or outlying observations, the SVD-based principal component analysis (PCA) usually breaks down. Hence, this problem is a robust principal component analysis problem [164, 272]. So-called robust PCA is able to correctly recover underlying low-rank structure in the data matrix, even in the presence of gross errors or outlying observations. This problem is also called the principal component pursuit (PCP) problem [46] since it tracks down the principal components of the data matrix D by minimizing a combination of the nuclear norm A∗ and the weighted 1 -norm μE1 :   min A∗ + μE1 A,E

subject to D = A + E.

(6.8.22)

min{m,n} Here, A∗ = i σi (A) represents the nuclear norm of A, i.e., the sum of all singular values, which reflects the cost of the low-rank matrix A, while E1 = m  n |E | is the sum of the absolute values of all entries of the additive error i=1 j =1 ij matrix E, where some of its entries may be arbitrarily large. The role of the constant μ > 0 is to balance the contradictory requirements of low rank and sparsity. By using the minimization in (6.8.22), an initially unknown low-rank matrix A and unknown sparse matrix E can be recovered from the data matrix D ∈ Rm×n . That is to say, the unconstrained minimization of the robust PCA or the PCP problem can be expressed as  , 1 min A∗ + μE1 + (A + E − D)2F . A,E 2

(6.8.23)

It can be shown [44] that, under rather weak assumptions, the PCP estimate can exactly recover the low-rank matrix A from the data matrix D = A + E with gross but sparse error matrix E. In order to solve the robust PCA or the PCP problem (6.8.23), consider a more general family of optimization problems of the form F (X) = f (X) + μh(X),

(6.8.24)

6.8 Principal Component Analysis

277

where f (X) is a convex, smooth (i.e., differentiable), and L-Lipschitz function and h(X) is a convex but nonsmooth function (such as X1 , X∗ , and so on). Instead of directly minimizing the composite function F (X), we minimize its separable quadratic approximation Q(X, Y), formed at specially chosen points Y: 1 Q(X, Y) = f (Y) + ∇f (Y), X − Y + (X − Y)T ∇ 2 f (Y)(X − Y) + μh(X) 2 L (6.8.25) = f (Y) + ∇f (Y), X − Y + X − Y2F + μh(X), 2 where ∇ 2 f (Y) is approximated by LI. When minimizing Q(X, Y) with respect to X, the function term f (Y) may be regarded as a constant term that is negligible. Hence, we have *

Xk+1

-2 . 1 LX − Yk + ∇f (Yk )= arg min μh(X) + 2 L X F   1 = proxμL−1 h Yk − ∇f (Yk ) . L

(6.8.26)

If we let f (Y) = 12 A + E − D2F and h(X) = A∗ + λE1 , then the Lipschitz constant L = 2 and ∇f (Y) = A + E − D. Hence 1 Q(X, Y) = μh(X) + f (Y) = (μA∗ + μλE1 ) + A + E − D2F 2 reduces to the quadratic approximation of the robust PCA problem. By Theorem 4.5, we have Ak+1 Ek+1

  1 A = proxμ/2·∗ Yk − (Ak + Ek − D) = Usoft(, λ)VT , 2     1 E E μλ , = proxμλ/2·1 Yk − (Ak + Ek − D) = soft Wk , 2 2

where UVT is the SVD of WA k and 1 A WA k = Yk − (Ak + Ek − D), 2 1 E WE k = Yk − (Ak + Ek − D). 2

(6.8.27) (6.8.28)

The robust PCA shown in Algorithm 6.10 has the following convergence.

278

6 Machine Learning

Algorithm 6.10 Robust PCA via accelerated proximal gradient [164, 272] 1. input: Data matrix D ∈ Rm×n , λ, allowed tolerance . 2. initialization: A0 , A−1 ← O; E0 , E−1 ← O; t0 , t−1 ← 1; μ¯ ← δμ0 . 3. repeat tk−1 − 1 (Ak − Ak−1 ). 4. YA k ← Ak + tk t − 1 k−1 5. YE (Ek − Ek−1 ). k ← Ek + tk 1 A 6. WA k ← Yk − (Ak + Ek − D). 2 7. (U, , V) = svd(WA k ). 0 / μ 8. r = max j : σj > k . 2 "  ! μ 9. Ak+1 = ri=1 σi − k ui vTi . 2 1 E 10. WE k ← Yk − (Ak + Ek − D).  2 λμk . , 11. Ek+1 = soft WE k 2  1 + 4tk2 + 1 12. tk+1 ← . 2 ¯ 13. μk+1 ← max(ημk , μ). A A E 14. SA k+1 = 2(Yk − Ak+1 ) + (Ak+1 + Ek+1 − Yk − Yk ). E E A E 15. Sk+1 = 2(Yk − Ek+1 ) + (Ak+1 + Ek+1 − Yk − Yk ). 2 E 2 16. exit if Sk+1 2F = SA k+1 F + Sk+1 F ≤ . 17. return k ← k + 1. 18. output: A ← Ak , E ← Ek .

Theorem 6.3 ([164]) Let F (X) = F (A, E) = μA∗ +μλE1 + 12 A+E−D2F . Then, for all k > k0 = C1 / log(1/n) with C1 = log(μ0 /μ), one has F (X) − F (X∗ ) ≤

4Xk0 − X∗ 2F , (k − k0 + 1)2

(6.8.29)

where X∗ is a solution to the robust PCA problem min F (X). X

6.8.5 Sparse Principal Component Analysis Consider a linear regression problem: given an observed response vector y = [y1 , . . . , yn ]T ∈ Rn and an observed data matrix (or predictors) X = [x1 , . . . , xp ] ∈ Rn×p , find a fitting coefficient vector β = [β1 , . . . , βp ]T ∈ Rp such that βˆ =

arg min y − Xβ22 β

-2 p  = -y − xi βi - , i=1

2

(6.8.30)

6.8 Principal Component Analysis

279

where y and xi , i = 1, . . . , p are assumed to be centered, respectively. Minimizing the above squared error usually leads to sensitive solutions. Many regularization methods have been proposed to decrease this sensitivity. Among them, Tikhonov regularization [247] and Lasso [84, 244] are two widely known and cited algorithms, as pointed out in [283]. The Lasso (least absolute shrinkage and selection operator) is a penalized least squares method: βˆ Lasso

-2 p  = arg min -y − xi βi - + λβ1 , β i=1

(6.8.31)

2

p where β1 = i=1 |βi | is the 1 -norm and λ is nonnegative. The Lasso can provide a sparse prediction coefficient solution vector due to the nature of the 1 -norm penalty through continuously shrinking the prediction coefficients toward zero, and achieving its prediction accuracy via the bias variance trade-off. Although the Lasso has shown success in many situations, it has some limitations [307]. • If p > n, then the Lasso selects at most n variables before it saturates due to the nature of the convex optimization problem. This seems to be a limiting feature for a variable selection method. Moreover, the Lasso is not well defined unless the bound on the 1 -norm of β is smaller than a certain value. • If there is a group of variables among which the pairwise correlations are very high, then the Lasso tends to select only one variable from the group and does not care which one is selected. • For usual n > p cases, if there are high correlations between predictors xi , it has been empirically observed that the prediction performance of the Lasso is dominated by ridge regression [244]. To overcome these drawbacks, an elastic net was proposed by Zou and Hastie [307] through generalizing the Lasso: ⎛

βˆ en

⎞ -2 p p p    = (1 + β2 ) ⎝arg min -y − xi βi - + λ1 |βi | + λ2 |λi |2 ⎠ β i=1

2

i=1

i=1

(6.8.32) for any nonnegative λ1 and λ2 . The elastic net penalty is a convex combination of the ridge and Lasso penalties. If λ2 = 0, then the elastic net reduces to the Lasso. Theorem 6.4 ([308]) Let X = UDVT be the rank-K SVD of X. For each i, denote by zi = dii ui the ith principal component. Consider a positive λ and the ridge

280

6 Machine Learning

estimates βˆ ridge given by βˆ ridge = arg min zi − Xβ22 + λβ22 . β

Letting vˆ =

βˆ ridge

βˆ ridge 

(6.8.33)

, then vˆ = vi .

This theorem establishes the connection between PCA and a regression method. Theorem 6.5 ([308]) Let Ap×k = [α 1 , . . . , α k ] and Bp×k = [β 1 , . . . , β k ]. For any λ > 0, consider the constrained optimization problem k -2  Tˆ B) ˆ = arg min − XBA (A, + λ β j 22 -X F

A,B

(6.8.34)

j =1

subject to AT A = Ik×k . Then, βˆ j ∝ vj for j = 1, . . . , k. If adding the Lasso penalty, then the criterion (6.8.34) becomes the following optimization problem: k k -2   ˆ B) ˆ = arg min (A, β j 22 + λ1,j β j 1 -X − XABT - + λ A,B

F

j =1

(6.8.35)

j =1

subject to AT A = Ik×k . Due to A2F = AT A2 = 1, one has X − XBAT 2F = XA − XB2F = (A − B)T XT X(A − B). Substitute this result into (6.8.35) to get (αˆ j , βˆ j ) = arg min(α j − β j )T XT X(α j − β j ) + λβ j 22 + λ1,j β j 1 . α j ,β j

(6.8.36)

This coupled optimization problem can be solved by using the well-known alternating method. By summarizing the above discussion, the sparse principal component analysis (SPCA) algorithm developed by Zou et al. [308] can be described by Algorithm 6.11.

6.9 Supervised Learning Regression

281

Algorithm 6.11 Sparse principal component analysis (SPCA) algorithm [308] 1. input: The data matrix X ∈ Rn×p and the response vector y ∈ Rn . 2. initialization:  2.1 Make the truncated SVD X = ki=1 di ui vTi . 2.2 Let α j = vj , j = 1, . . . , k. 2.3 Select λ > 0 and λ1,i > 0, i = 1, . . . , k. 3. Given a fixed A = [α 1 , . . . , α k ], solve the following elastic net problem for j = 1, . . . , k: β j = arg min (α j − β)T XT X(α j − β) + λβ22 + λ1,j β1 . β

4. For a fixed B = [β 1 , . . . , β k ], compute the SVD of XT XB = UDVT , then update A = UVT . 5. Repeat Steps 3-4, until convergence. β 6. Normalization: β j ← β j2 for j = 1, . . . , k. j 7. output: normalized fitting coefficient vectors β j , j = 1, . . . , k.

6.9 Supervised Learning Regression Supervised learning in the form of regression (for continuous outputs) and classification (for discrete outputs) is an important field of statistics and machine learning. The aim of machine learning is to establish a mapping relationship between inputs and outputs via learning the correspondence between the sample and the label. As the name implies, supervised learning is based on a supervisor (data labeling expert): a type of machine learning of input (or training) data with labels. In supervised learning, we are given a “training” set of input vectors {xn }N n=1 N along with corresponding targets {yn }N n=1 . The data set {xn , yn }n=1 is known as the labeled data set. Let the domain of input instances be X, and the domain of labels be Y , and let P (x, y) be an (unknown) joint probability distribution on instances and labels X × Y . Given an independent identically distributed (i.i.d.) training data {xi }N i=1 i.i.d.

sampled from P (x), denoted xi ∼ P (x), supervised learning trains a function f : X → Y in some function family F in order to make f (x) predict the true label i.i.d.

y on future data x, where x ∼ P (x) as well. Depending on the target y, supervised learning problems are further divided into two of the most common supervised learning forms: classification and regression. • Classification is the supervised learning problem with discrete class labels y, for example, y = {+1, −1} for two classes. The function f is called a classifier. • Regression is the supervised learning problem with real values y. The function f is called a regression function. From this training set we wish to learn a model of the dependency of the targets on the inputs with the objective of making accurate predictions of y for previously unseen values of x. Goals in model selection include [116]: • accurate predictions, • interpretable models: determining which predictors are meaningful,

282

6 Machine Learning

• stability: small changes in the data should not result in large changes in either the subset of predictors used, the associated coefficients, or the predictions, • avoiding bias in hypothesis tests during or after variable selection.

6.9.1 Principle Component Regression A task to approximate real-valued functions from a training sample is variously referred to as prediction, regression, interpolation, or function approximation. In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (called also “criterion variable”) X and one or more independent variables (or predictors) Y . More specifically, regression analysis helps us understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables. In all cases, a function of the independent variables called the regression function is to be estimated. Given a training set X = {x1 , . . . , xn } with xi = [xi1 , . . . , xip ]T for all i ∈ def

[n] = {1, · · · , n}. Denote X = [x1 , . . . , xn ], and consider the supervised learning problem: to learn a function f (X) : X → y so that f (X) is a good predictor for the corresponding value of y, namely minimize 12 y − f (X)22 . For historical reasons, this function f is called a hypothesis. The regression models involve the following parameters and variables: • The unknown parameters, denoted as β = [β1 , . . . , βp ]T , which is usually a vector. n,p • The independent variable matrix X = [xij ]i=1,j =2 . • The dependent variable vector, denoted as y = [y1 , . . . , yn ]T . A regression model uses the function of X and β y ≈ f (X, β)

(6.9.1)

as an approximation of E(y|X) = f (X, β). Standard Linear Regression Method Consider the more general multiple regression yi = β1 xi1 + β2 xi2 + · · · + βp xip + εi .

(6.9.2)

As data pre-processing, it is usually assumed that y and each of the p columns of X have already been centered so that all of them have zero empirical means.

6.9 Supervised Learning Regression

283

After centering, the standard Gauss–Markov linear regression model for y on X can be represented as: y = Xβ + ε,

(6.9.3)

where β ∈ Rp denotes the unknown parameter vector of regression coefficients and ε denotes the vector of random errors with E (ε) = 0 and var (ε) = σ 2 In×n for some unknown variance parameter σ 2 > 0. The LS estimate of regression parameter vector β is given by βˆ LS = (XT X)−1 XT y

(6.9.4)

for the data matrix X with full column rank. Put the regularized cost function L(β) =

1 y − Xβ22 + λβ22 , 2

(6.9.5)

then the Tikhonov regularization LS solution is given by βˆ Tik = (XH X + λI)−1 XH y

(6.9.6)

for the data matrix X with deficit column rank. The total least squares (TLS) estimate of regression parameter vector β is given by βˆ TLS = (XH X − λI)−1 XH y

(6.9.7)

for the data matrix X with full column rank and observation errors. Principle Component Regression Method Principle component regression (PCR) [175] is another technique for estimating β, based on principle component analysis (PCA). Let R = XT X be the sample autocorrelation matrix and its SVD be R = VVT , where V = [v1 , . . . , vp ] is the principal loading matrix. Then, XV = [Xv1 , . . . , Xvp ] is the principal components of X. The idea of PCR is simple [175]: use principal components W = XV of X instead of the original data matrix X in standard linear regression y = Xβ to form the regression equation: y = Wγ = XVγ .

(6.9.8)

Comparing the PCR y = XVγ with the linear regression y = Xβ, we immediately have β = Vγ . Then, the procedure of PCR consists of the following steps. • Principal Components Extraction: PCR starts by performing an SVD on the autocorrelation matrix of centered data matrix X: R = XT X = VVT , where

284

6 Machine Learning

" ! p×p = Diag λ21 , . . . , λ2p with λ1 ≥ · · · ≥ λp ≥ 0 denoting the nonnegative eigenvalues (also called the principal eigenvalues) of XT X. Then, vj denotes the j th principal component direction (or known as PCA loading) corresponding to the j th largest principal value λj for each j ∈ {1, . . . , p}. Hence, V is called the PCA loading matrix of XT X. • PCR model: For any k ∈ {1, . . . , p}, let Vk denote the p × k matrix with orthonormal columns consisting of the first k columns of V. Let Wk = XVk = [Xv1 , . . . , Xvk ] denote the n × k matrix having the first k principal components as its columns. Then, W may be viewed as the data matrix obtained by using the transformed covariants xki = VTk xi ∈ Rk instead of using the original covariants xi ∈ Rp , ∀i = 1, . . . , n. Hence, if the standard Gauss–Markov linear regression model y = Xβ is viewed as a full component regression, then by using the principal component matrix Wk = XVk instead of the original data matrix X, we have the PCR model yk = Wk γ k .

(6.9.9)

• PCR Estimator γˆ k : The solution of the above PCR equation is given by γˆ k =  T −1 T Wk Wk Wk yk ∈ Rk , which denotes the vector of estimated regression coefficients obtained by ordinary least squares regression of the response vector y on the data matrix Wk . • PCR Estimator βˆ k : Substitute Wk = XVk into Eq. (6.9.9) to get yk = XVk γ k . By comparing yk = XVk γ k with yk = Xβ k , it is easily known that for any k ∈ {1, . . . , p}, the final PCR estimator of β based on using the first k principal components is given by βˆ k = Vk γˆ k ∈ Rp , k ∈ {1, . . . , p}. Algorithm 6.12 shows the principle component regression algorithm. Algorithm 6.12 Principal component regression algorithm [175] 1. input: 1.1 The data matrix X ∈ Rn×p of observed covariants; 1.2 The vector y ∈ Rn of observed outcomes, where n ≥ p. 2. initialization: xij ← xij − x¯j with x¯j = n1 ni=1 xij , j = 1, . . . , p, and yi ← yi − y, ¯  y¯ = n1 ni=1 yi . 3. Compute the SVD: X = UVT , and save the singular values σ1 , . . . , σp and the right singular-vector matrix V. 4. Determine k largest principal eigenvalues λi = σi2 , i = 1, . . . , k with k ∈ {1, . . . , p}, and denote Vk = [v1 , . . . , vk ]. 5. Construct the transformed data matrix Wk = [Xv1 , . . . , Xvk ], k ∈ {1, . . . , p}. −1 T  6. Compute the PCR estimate γˆ k = WTk Wk Wk y ∈ Rk , k ∈ {1, . . . , p}. 7. The regression parameter estimate based on k principal components is given by βˆ k = Vk γˆ k ∈ Rp , k ∈ {1, . . . , p}. 8. output: βˆ k , k ∈ {1, . . . , p}.

6.9 Supervised Learning Regression

285

6.9.2 Partial Least Squares Regression Let X ⊂ RN be an N -dimensional space of variables representing the first block and Y ⊂ RM be a space representing the second block of variables. The general setting of a linear partial least squares (PLS) algorithm is to model the relations between these two data sets (blocks of variables) by means of score vectors. The PLS model can be considered as consisting of outer relations (X - and Yblocks individually) and an inner relation (linking both blocks). Let p denote the number of components in model, N be the number of objects. The main symbols in PLS model are given as follows: X : N × K matrix of predictor variables, Y : N × M matrix of response variables, E : N × K residual matrix for X, F∗ : N × M residual matrix for Y, B : K × M matrix of PLS regression coefficients, W : K × p matrix of the PLS weights for X, C : M × p matrix of the PLS weights for Y, P : K × p matrix of the PLS loadings for X, Q : M × p matrix of the PLS loadings for Y, T : N × p matrix of the PLS scores for X, U : N × p matrix of the PLS scores for Y. The outer relations for the X - and Y-blocks are given by [105] X = TPT + E =

p 

th pTh + E,

(6.9.10)

h=1

Y = UQT + F∗ =

p 

uh qTh + F∗ ,

(6.9.11)

h=1

where T = [t1 , . . . , tp ] ∈ RK×p and U = [u1 , . . . , up ] ∈ RM×p with p score vectors (components, latent vectors) extracted from X and Y, respectively; the K ×p matrix P = [p1 , . . . , pp ] and the M ×p matrix Q = [q1 , . . . , qp ] represent matrices of loadings, and both E ∈ RN ×K and F∗ ∈ RN ×M are the matrices of residuals. Let t = Xw be the X -space score vector (simply X -score) and u = Yc be the Y-space score vector (simply Y-score). For the user of PLS, it is necessary to know the following main properties of the PLS factors [105, 121]. 1. The PLS weight vectors wi for X are mutually orthogonal: wi , wj  = wTi wj = 0, for i = j.

(6.9.12)

286

6 Machine Learning

2. The PLS score vectors ti are mutually orthogonal: ti , tj  = tTi tj = 0, for i = j.

(6.9.13)

3. The PLS weight vectors wi are orthogonal to the PLS loading vectors pj for X: wi , pj  = wTi pj = 0,

for i < j.

(6.9.14)

4. The PLS loading vectors pi are orthogonal in the kernel space of X: pi , pj X = pTi (XT X)† pj = 0,

for i = j.

(6.9.15)

5. Both the PLS loading vectors pi for X and the PLS loading vectors qi for Y have unit length for each i: pi  = qi  = 1. 6. Both the PLS score vector PLS score vector uh for Y have the  th for X and the M zero means for each h: K t = 0 and hi i=1 i=1 uhi = 0. The classical form of RLS is to find weight vectors w and c such that the inner relation linking X -block and Y-block satisfies [121]: [cov(t, u)]2 = [cov(Xw, Yc)]2 =

max [cov(Xr, Ys)]2 ,

r=s=1

(6.9.16)

where cov(t, u) = tT u/N denotes the sample covariance between the score vectors t and u. This classical RLS algorithm is called the nonlinear iterative partial least squares (NIPALS) algorithm, proposed by Wold in 1975 [274], as shown in Algorithm 6.13. Algorithm 6.13 Nonlinear iterative partial least squares (NIPALS) algorithm [274] 1. input: Data blocks X = [x1 , . . . , xK ] and Y = [y1 , . . . , yM ]. 2. initialization: randomly generate Y -space score vector u. 3. repeat // X -space score vector update: 4. w = XT u/(uT u), 5. w → 1, 6. t = Xw. // Y -space score vector update: 7. c = YT t/(tT t), 8. c → 1, 9. u = Yc. 10. exit if both t and u converge. 11. return 12. output: X -space score vector t and Y -space score vector u.

Here are the characteristics of PLS [105]. • There are outer relations of the form X = TPT + E and Y = UQT + F∗ .

6.9 Supervised Learning Regression

287

• There is an inner relation uh = bh th . • The mixed relation is Y = TBQT + F∗ , where F∗  is to be minimized. • In the iterative algorithm, the blocks get each other’s scores, this gives a better inner relation. • In order to obtain orthogonal X -scores, as in the PCA, it is necessary to introduce weights. There are the interesting connections between PLS, principal component analysis (PCA), and canonical correlation analysis (CCA) as follows [215]. 1. The optimization criterion of PCA can be written as   max var(Xr) ,

r=1

(6.9.17)

where var(t) = tT t/N denotes the sample variance. 2. CCA finds the direction of maximal correlation by solving the optimization problem max

r=s=1

/ 0 [corr(Xr, Ys)]2 ,

(6.9.18)

where [corr(t, u)]2 = [cov(t, u)]2 /(var(t)var(u)) denotes the sample squared correlation. 3. The PLS optimization criterion (6.9.18) can be rewritten as max

r=s=1

/ 0 [cov(Xr, Ys)]2 =

max

r=s=1

/ 0 var(Xr)[corr(Xr, Ys)]2 var(Ys) . (6.9.19)

This represents that PLS is a form of CCA where the criterion of maximal correlation is balanced with the requirement to explain as much variance as possible in both X - and Y-spaces. Note that only the X -space variance is involved in the case of a one-dimensional Y-space. The regression based on PLS-model equations (6.9.10) and (6.9.11) is called partial least squares (PLS) regression. To combine (6.9.10) and (6.9.11) into the standard model of PLS regression, two assumptions are made: p

• the score vectors {ti }i=1 are good predictors of Y, where p denotes the number of extracted X -score vectors; • a linear inner relation between the score’s vectors t and u exists; that is, ui = αi ti + hi , where hi denotes a residual vector. Therefore, U = [u1 , . . . , up ] and the K × M matrix of PLS regression coefficients T = [t1 , . . . , tp ] are related by U = TD + H,

(6.9.20)

288

6 Machine Learning

where D = Diag(α1 , . . . , αp ) is p × p diagonal matrix and H = [h1 , . . . , hp ] is the matrix of residuals. Letting W be a weighting matrix such that EW = O (zero matrix), and postmultiplying (6.9.10) by W, then XW = TPT W

T = XW(PT W)−1 .



(6.9.21)

On the other hand, substitute (6.9.20) into (6.9.11) to give Y = TDQT + HQT + F∗



Y = TCT + F,

(6.9.22)

where C = QD denotes the N × M matrix of regression coefficients and F = HQT + F∗ is the residual matrix. Then, substituting (6.9.21) into (6.9.22), we have Y = XW(PT W)−1 CT + F ∗ or can write the PLS regression model as Y = XBPLS + F and

BPLS = W(PT W)−1 CT

(6.9.23)

represents the matrix of PLS regression coefficients. In NIPALS algorithm, w = XT u/(uT u) and c = YT t/(tT t) are generated, which imply that W = XT U and

C = YT T.

(6.9.24)

Moreover, pre-multiplying (6.9.21) by TT and using TT T = I, then (PT W)−1 = (TT XXT U)−1 .

(6.9.25)

Substitute (6.9.24) and (6.9.25) into (6.9.23) to give BPLS = XT U(TT XXT U)−1 TT Y.

(6.9.26)

The NIPALS regression is an iterative process; i.e., after extraction of one component t1 the algorithm starts again using the deflated matrices X and Y to extract the next component t2 . This process is repeated until the deflated matrix X is a null matrix. Algorithm 6.14 shows a simple nonlinear iterative partial least squares (NIPALS) regression algorithm developed by Wold et al. [275]. The basic idea of the NIPALS regression can be summarized below. • Find the first X -score t1 = Xw and the first Y-score u1 = Yc/(cT c) from the original data blocks X and Y. • Use contractive mapping to construct the residual matrices X = X − tpT and Y = Y − btcT .

6.9 Supervised Learning Regression

289

Algorithm 6.14 Simple nonlinear iterative partial least squares regression algorithm [275] 1. input: Optionally transformed, scaled, and centered data, X = X0 and Y = Y0 . 2. initialization: A starting vector of u, usually one of columns of Y. With a single y, take u = y. Put k = 0 3. repeat 4. Compute the X -weights: w = XT u/(uT u). 5. Calculate the X -scores t = Xw. 6. Calculate the Y -weights c = YT t/(tT t). 7. Update a set of Y -scores u as u = Yc/(cT c). 8. exit if the convergence told − tnew /tnew  < ε has been reached, where ε is “small” (e.g., 10−6 or 10−8 ). 9. return 10. Remove (deflate, peel off) the present component from X and Y, and use these deflated matrices as X and Y in the next component. Here the deflation of Y is optional; the results are equivalent whether Y is deflated or not. 11. Compute X -loadings: p = XT t/(tT t). 12. Compute Y -loadings: q = YT u/(uT u). 13. Regression (u on t): b = uT t/(tT t). 14. Deflate X, Y matrices: X ← X − tpT and Y ← Y − btcT . 15. k ← k + 1. 16. uk = u and tk = t. 17. The next set of iterations starts with the new X and Y matrices as the residual matrices from the previous iteration. The iterations continue until a stopping criteria is used or X becomes the zero matrix. 18. U = [u1 , . . . , up ] and T = [t1 , . . . , tp ]. 19. BPLS = XT U(TT XXT U)−1 TT Y. 20. output: the matrix of PLS regression coefficients BPLS .

• Find the second X -score t2 and Y-score u2 from the residual matrices X = X − tpT and Y = Y − btcT . • Construct the new residual matrices X and Y and find the third X - and Y-scores t3 and u3 . Repeat this process until the residual matrix X becomes a null matrix. • Use (6.9.26) to compute the matrix of regression coefficients B, where U = [u1 , . . . , up ] and T = [t1 , . . . , tp ] consist of p X - and Y-scores, respectively. Once the matrix of PLS regression coefficients BPLS is found by using Algorithm 6.14, for a given new data block Xnew , then unknown Y-values can be predicted by Eq. (6.9.23) as ˆ new = Xnew BPLS . Y

(6.9.27)

6.9.3 Penalized Regression For linear regression problems y = Xβ ∈ Rn with X ∈ Rn×p , β ∈ Rp , ordinary least squares (OLS) regression minimizes squared regression error y − Xβ22 =

290

6 Machine Learning

(y − Xβ)T (y − Xβ), and yields a p × 1 unbiased estimator βˆ OLS = (XT X)−1 XT y. Although the OLS estimator is simple and unbiased, if the design matrix X is not ˆ = (XT X)−1 σ 2 (σ 2 is the of full-rank, then βˆ is not unique and its variance var(β) variance of i.i.d. regression error) may be very large. To achieve better prediction, Hoerl and Kennard [117, 118] introduced ridge regression βˆ = arg max



β

1 y − Xβ22 2

, subject to

p 

|βj |γ ≤ t,

(6.9.28)

j =1

where γ ≥ 1 and t ≥ 0. The equivalent form of ridge regression is the following penalized regression [101]: βˆ = arg max β



, p  1 2 γ y − Xβ2 + λ |βj | , 2

(6.9.29)

j =1

where γ ≥ 1 and λ ≥ 0 are a positive scalar; λ = 0 corresponds to ordinary least squares regression. When γ takes different values, the penalized regression has different forms. The most well-known penalized regressions are 2 penalized regression with γ = 2 (often called the ridge regression): βˆ

ridge



, 1 2 2 y − Xβ2 + λβ2 , = arg min 2 β

(6.9.30)

and the 1 penalized regression with γ = 1, called least absolute shrinkage and selection operator (Lasso), given by βˆ

Lasso



, 1 2 y − Xβ2 + λβ1 , = arg min 2 β

(6.9.31)

p where β1 = i=1 |βi | is 1 -norm. The solution of the ridge regression is given by ridge = (XT X + λI)−1 XT y βˆ

that is just the Tikhonov regularized solution [246, 247].

(6.9.32)

6.9 Supervised Learning Regression

291

The Lasso solution is, in a soft-thresholded version [77], given by [99] βˆi Lasso (γ ) = S(βˆi , γ ) = sign(βˆi )(|βˆi | − γ )+ ⎧ ⎪ βˆi − γ , if βˆi > 0 and γ < |βˆi |; ⎪ ⎪ ⎪ ⎪ ⎨ = βˆi + γ , if βˆi < 0 and γ < |βˆi |; ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0, if γ ≥ |βˆi |;

(6.9.33)

(6.9.34)

with x+ =

⎧ ⎨ x, if x > 0, ⎩

(6.9.35) 0, if x ≤ 0.

Important differences between the ridge regression and the Lasso regression are twofold. • The 1 penalty in Lasso results in variable selection, as variables with coefficients of zero are effectively omitted from the model, but the 2 penalty in ridge regression does not perform variable selection. • An 2 penalty λ nj=1 βj2 pushes βj toward zero with a force proportional to  the value of the coefficient, whereas an 1 penalty λ nj=1 |βj | exerts the same force on all nonzero coefficients. Hence for variables that are most valuable in the model and where shrinkage toward zero is less desirable, an 1 penalty shrinks less [116]. Typically, the optimization problems with inequality constraints are carried out using a standard quadratic programming algorithm. An alternative is to explore “one-at-a-time” coordinate-wise descent algorithms for these problems [99]. For large optimization problems, coordinate descent can deliver a path of solutions efficiently, and can be applied to many other convex statistical problems such as the large Lasso, the garotte, elastic net, and so on [99]. The following are a few of typical versions of the Lasso and their respective solutions [99]. • Nonnegative garotte: This method, developed by Breiman [35], is a precursor to the Lasso, and solves min c

  2 , p p n    1 yi − xij cj βˆj + γ cj 2 i=1

j =1

j =1

subject to cj ≥ 0,

(6.9.36)

292

6 Machine Learning

where βˆj is the usual least squares estimates (p ≤ n is assumed). The coordinatewise update is given by  cj ←

β˜j βˆj − λ βˆ 2



j

where β˜j =

n

!

"

(6.9.37)

, +



xik ck βˆk . p • Elastic net: This method [307] adds another constraint λ2 j =1 βj2 /2 to Lasso to solve ⎫ ⎧ ⎛ ⎞ p p p n ⎬ ⎨1     ⎝yi − xij βj ⎠ + λ1 |βj | + λ2 βj2 /2 . (6.9.38) min ⎭ β ⎩2 i=1 xij

i=1

(j )

yi − y˜i

(j )

and y˜i

j =1

=

k=j

j =1

j =1

The coordinate-wise update has the form β˜j ←

S

!

n i=1 xij

! " " (j ) yi − y˜i , λ1

+

1 + λ2

.

(6.9.39)

• Grouped Lasso: Let Xj be an N × pj orthonormal matrix that represents the j th group of pj variables, j = 1, . . . , m, and βj the corresponding coefficient vector. The grouped Lasso [292] solves -2  , m m  y − min X β + λ β  j jj j 2 , β j =1

2

(6.9.40)

j =1

√ where λj = λ pj . The coordinate-wise update is β˜j ← (sj 2 − λj )+

sj , sj 2

(6.9.41)

   here sj = XTj y − y˜ (j ) with y˜ (j ) = k=j Xk β˜ k . • “Berhu” penalty: This method [198] is a robust hybrid of Lasso and ridge regression, and its Lagrange form is ⎧  p n  ⎨1   yi − min xij βj β ⎩2 i=1

j =1

⎫ ⎬ p   (βj2 + δ 2 ) · I (|βj | ≥ δ) |βj | · I (|βj | < δ) + +λ . ⎭ 2δ j =1

(6.9.42)

6.9 Supervised Learning Regression

293

Berhu penalty is the reverse of a “Huber” function. The coordinate-wise update has the form  ⎧ n (j ) if |βj | < δ; ⎨S i=1 xij (yi − y˜ ), λ , ˜ (6.9.43) βj ← ⎩ n (j ) )/(1 + λ/δ), if |β | ≥ δ. x (y − y ˜ ij i j i=1 This is a robust hybrid consisting of Lasso-style soft-thresholding for values less than δ and ridge-style beyond δ.

6.9.4 Gradient Projection for Sparse Reconstruction As the general formulation of Lasso problem, we consider the unconstrained convex optimization problem 

, 1 2 y − Ax2 + τ x1 , min x 2

(6.9.44)

where x ∈ Rn , y ∈ Rk , A ∈ Rk×n , and τ is a nonnegative parameter. When the variable x is a sparse vector, the above optimization is also called the sparse reconstruction of x. The optimization problem (6.9.44) is closely related to the following constrained convex optimization problems: quadratically constrained linear program (QCLP) min x1 x

subject to y − Ax22 ≤ 

(6.9.45)

and quadratic program (QP) min y − Ax22 x

subject to x1 ≤ α,

(6.9.46)

where  and α are nonnegative real parameters. By various enhancements to the basic gradient projection (GP) algorithm applied to a quadratic programming formulation of (6.9.44), Figueiredo et al. [92] proposed a GPSR (gradient projection for sparse reconstruction). This GPSR approach, as in [102], splits the variable x into its positive and negative parts: x = u − v,

u ≥ 0, v ≥ 0,

(6.9.47)

where ui = (xi )+ and vi = (−xi )+ for all i = 1, . . . , n with (x)+ = max{0, x}. Then, (6.9.44) can be rewritten as the following bound-constrained quadratic

294

6 Machine Learning

program (BCQP):  min u,v

, 1 y − A(u − v)22 + τ 1Tn u + τ 1Tn v , 2

subject to u ≥ 0, v ≥ 0.

(6.9.48)

Equation (6.9.48) can be written in more standard BCQP form  , 1 T T min F (z) = c z + z Bz z 2

subject to z ≥ 0,

(6.9.49)

where z=

  u , v

 b = AT y,

c = τ 12n +

 −b , b

(6.9.50)

and  AT A −AT A . B= −AT A AT A 

(6.9.51)

When the solution of (6.9.44) is known in advance to be nonnegative, we have 1 1 1 y − Ax22 = yT y − yT Ax + xT AT Ax, 2 2 2 since xT AT y = (xT AT y)T = yT Ax. It is noted that yT y is a constant independent of the variable x and τ x1 = τ 1Tn x under the assumption that x is nonnegative, thus Eq. (6.9.44) can be equivalently written as min x

! , "T 1 τ 1n − AT y x + xT AT Ax 2

subject to x ≥ 0

(6.9.52)

which has the same form as (6.9.49). Gradient projection is an efficient method for solving general minimization problems over a convex set C in unconstrained minimization form min {f (x) + h(x)} , x

(6.9.53)

where f (x) is continuously differentiable on C and h(x) is non-smoothing on C. Following the first-order optimality condition, we have x∗ ∈ arg min {f (x) + h(x)} ⇔ 0 ∈ ∇f (x∗ ) + ∂h(x∗ ). x∈C

(6.9.54)

6.9 Supervised Learning Regression

295

where ∇f (x) is the gradient of the continuously differentiable function f and ∂h(x) is the subdifferential of the non-smoothing function h. Especially, when f (x) = 12 x − z22 , we have ∇f (x) = x − z, and thus the above first-order condition becomes 0 ∈ (x − z) + ∂h(x) ⇔ z ∈ x + ∂h(x).

(6.9.55)

If taking ∂h(x) = −∇f (x) in Eq. (6.9.54), then Eq. (6.9.55) gives the following result: z = x − ∇f (x).

(6.9.56)

This is just the gradient projection of x onto the convex set C. Then, the basic gradient projection updating formula is given by xk+1 = xk − μk ∇f (xk ),

(6.9.57)

where μk is the step length at the kth update. The basic gradient projection approach for solving (6.9.49) consists of the following two steps [92]. • First, choose some scalar parameter αk > 0 and set wk = (zk − αk ∇F (zk ))+ .

(6.9.58)

• Then, choose the second scalar λk ∈ [0, 1] and set zk+1 = zk + λk (wk − zk ).

(6.9.59)

In the basic approach, each iterate zk is searched along the negative gradient −∇F (zk ), projecting onto the nonnegative orthant, and performing a backtracking line search until a sufficient decrease is attained in F [92]. Figueiredo et al. [92] proposed two gradient projection algorithms for sparse reconstruction: a basic gradient projection (GPSR-Basic) algorithm and a Barzilai– Borwein gradient projection (GPSR-BB) algorithm. The GPSR-BB algorithm, as shown in Algorithm 6.15, is based on the Barzilai– Borwein (BB) method, using Hk = η(k) I as an approximation of the Hessian matrix (MATLAB implementations are available at http://www.lx.it.pt/~mtf/GPSR).

296

6 Machine Learning

Algorithm 6.15 GPSR-BB algorithm [92] 1. input: The data vector y ∈ Rm and the input matrix A ∈ Rm×n . 2. initialization: Randomly generate a nonnegative vector z(0) ∈ R2n +. 3. Choose nonnegative parameters τ,β ∈(0, 1), μ ∈ (0, 1/2) and αmin  , αmax , α0 ∈ [αmin , αmax ]. −b AT A −AT A T . and B = 4. Compute b = A y, c = τ 12n + −AT A AT A b 5. Set k = 0. 6. repeat 7. Calculate the gradient ∇F (z(k) ) = c + Bz(k) . 8. Compute step: δ (k) = z(k) − α (k) ∇F (z(k) ) + − z(k) . 9. (line search): Find the scalar λ(k) that minimizes F (z(k) + λ(k) δ (k) ) on the interval λ(k) ∈ [0, 1]. 10. Set z(k+1) = z(k) + λ(k) δ (k) . 11. (update α): compute ! "T γ (k) = δ (k) Bδ (k) ; if γ (k) = 0 then let α (k) = αmax , otherwise -2 . * -δ (k) -

α (k) = mid αmin , γ (k) 2 , αmax .   12. If -z(k) − z(k) − α∇F ¯ (z) + - ≤ tolP, where tolP is a small parameter and α¯ is a positive constant, then terminate; otherwise set k ← k + 1 return to Step 6. 13. output: z(k+1) .

6.10 Supervised Learning Classification Linear classification is a useful tool in machine learning and data mining. Unlike nonlinear classifiers (such as kernel methods) mapping data to a higher-dimensional space, linear classifiers directly work on data in the original input space. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers [293].

6.10.1 Binary Linear Classifiers Definition 6.14 (Instance [305]) An instance x represents a specific object. The instance is often represented by a D-dimensional feature vector x = [x1 , . . . , xD ]T ∈ RD , where each dimension is called a feature. The length D of the feature vector is known as the dimensionality of the feature vector. It is assumed that these instances are sampled independently from an underlying i.i.d.

distribution P (x), which is unknown to us and is denoted by {xi }N i=1 ∼ P (x), where i.i.d. stands for independent and identically distributed. Definition 6.15 (Training Sample [305]) A training sample is a collection of instances {xi }N i=1 = {x1 , . . . , xN }, which acts as the input to the learning process.

6.10 Supervised Learning Classification

297

Definition 6.16 (Label [305]) The desired prediction y on an instance x is known as its label. Definition 6.17 (Supervised Learning Classification) Let X = {xi } be the domain of instances and Y = {yi } be the domain of labels. Let P (x, y) be an (unknown) joint probability distribution on instances and labels X × Y . Given a i.i.d.

set of training samples {(xi , yi )}N i=1 ∼ P (x, y), supervised learning classification trains a function f : X → Y in some function family F in order to make f (x) i.i.d.

predict the true label y on future testing data x, where (x, y) ∼ P (x, y) as well. In binary classification, we are given training data {xi , yi } ∈ RD × {−1, +1} for i = 1, . . . , N . Linear classification methods construct the following decision function: f (x) = w, x + b = w · x + b = wT x + b,

(6.10.1)

where w is called the weight vector and b is an intercept, or called the bias. To generate a decision function f (x) = wT x, linear classification involves the following risk minimization problem [293]:  , N  ξ(wT xi , yi ) , min f (w) = R(w) + C w

(6.10.2)

i=1

where • w is a weight vector consisting of linear classifier parameters; • R(w) is a regularization function that prevents the overfitting parameters; • ξ(wT x; yi ) is a loss function measuring the discrepancy between the classifier’s prediction and the true output yi for the ith training example; • C > 0 is a scalar constant (pre-specified by the user of the learning algorithm) that controls the balance between the regularization R(w) and the loss function ξ(wT x; yi ). The output of linear classifier, d(w) = wT x, is called the decision function of the classifier, since the classification decision (i.e., the label y) is made by yˆ = sign(wT x). The aim of the regularization function R(w) is to prevent our model from overfitting the observations found in the training set. The following regularization items are commonly applied: R1 (w) = w1 =

N 

|wk |,

(6.10.3)

k=1

or 1 1 2 wk . w22 = 2 2 N

R2 (w) =

k=1

(6.10.4)

298

6 Machine Learning

The regularization R1 (w) = w1 has the following limitations: • It is not strictly convex, so the solution may not be unique. • For two highly correlated features, the solution obtained by R1 regularization may select only one of these features. Consequently, R1 regularization may discard the group effect of variables with high correlation [307]. To overcome the above two limitations of R1 regularization, a convex combination of R1 and R2 regularizations forms the following elastic net [307]: Re (w) = λw22 + (1 − λ)w1 ,

(6.10.5)

where λ ∈ (0, 1). Popular loss functions ξ(wT xi , yi ) include the hinge loss (i.e., interior penalty function) ! " ! " ξ1 wT xi , yi = max 0, 1 − yi wT xi ,

(6.10.6)

! " ! "2 ξ2 wT xi , yi = max 0, 1 − yi wT xi ,

(6.10.7)

for linear support vector machines and the log loss (or log penalty term) ! " ! " T ξ3 wT xi , yi = log 1 + e−yi w xi

(6.10.8)

for linear logistic regression [293]. Clearly, if all of the classifier’s outputs wT xi and the corresponding labels yi have the same sign (i.e., correct decision making), then all terms yi wT xi > 0 for i = 1, . . . , N , thus resulting in any of the above loss functions evaluating to zero; otherwise, there is at least one of ξ(wT xi , yi ) > 0 for some i, so the loss function is punished.

6.10.2 Multiclass Linear Classifiers In multiclass classification it is necessary to determine which of a finite set of classes the test input x belongs to. Let S = {(x1 , y1 ), . . . , (xN , yN )} be a set of N training examples, where each example xi is drawn from a domain X ⊆ RD , and each label yi is an integer from the set Y = {1, . . . , k}. As a concrete example of a supervised learning method, k-nearest neighbor (kNN) is a simple classification algorithm, as shown in Algorithm 6.16. Being a D-dimensional feature vector, the test instance x can be viewed as a point in D-dimensional feature space. A classifier assigns a label to each point in

6.10 Supervised Learning Classification

299

Algorithm 6.16 k-nearest neighbor (kNN) 1. input: Training data (x1 , y1 ), . . . , (xN ,N ), distance function d(a, b), number of neighbors k, testing instance x. (i) 2. initialization: let the ith neighbor’s data subset Li = {x(i) 1 , . . . , xni } for y1 , . . . , yN ∈ {i}, where n1 + · · · + nk = N .  i (i) 3. Compute centers of k neighbors m(i) = N1i nj =1 xj for i = 1, . . . , k. 4. Find the distances between x and the center mi of the ith neighbor, denoted d(x, mi ), for i = 1, . . . , k. 5. Find the nearest neighbor of x using label of y ← mini {d(x, m(1) ), . . . , d(x, m(k) )}. 6. output: label of y as the majority class of yi1 , . . . , yik .

the feature space. This divides the feature space into decision regions within which points have the same label. The boundary separating these regions is called the decision boundary induced by the classifier. A multiclass classifier is a function H : X → Y that maps an instance x ∈ X to an element y of Y. Consider k-class classifiers of the form [64] HW (x) = arg max wTm x,

(6.10.9)

m=1,...,k

where W = [w1 , . . . , wk ] is an N × k weight matrix, while the value of the innerproduct of the mth column of W with the instance x is called the confidence or the similarity score for the mth class. By this definition, the predicted label of the new instance x is the index of the column attaining the highest similarity score with x. For the case of k = 2, linear binary classifiers predict that the label of an instance x is 1 if wT1 x > 0 and −1 (or 2) if wT2 x ≤ 0. In this case, the weight matrix W = [w, −w] is an N × 2 matrix. By using the parsimonious model (6.10.9) when k ≥ 3, Crammer and Singer [64] set the label of a new input instance by choosing the index of the most similar column of W. The multiclass classifier of Crammer and Singer [64] is to solve the following primal optimization problem with “soft” constraints:  min W,ξ

subject to

 1 βW22 + ξi 2 N

, (6.10.10)

i=1

wTy xi + δyi ,m − wTm xi ≥ 1 − ξi , ∀i = 1, . . . , N ; m = 1, . . . , k, i

where β > 0 is a regularization constant and for m = yi the above inequality constraints become ξi ≥ 0, while δp,q is equal to 1 if p = q and 0 otherwise.

300

6 Machine Learning

By the Lagrange multiplier method, the dual optimization problem of (6.10.10) is given by min

W,ξ,η

 k N  1  L(W, ξ, η) = B wm 22 + ξi 2 m=1

+

k N  

i=1

% &, ηi,m (wm − wyi )T xi − δyi ,m + 1 − ξi

(6.10.11)

i=1 m=1

subject to ηi,m ≥ 0,

∀i = 1, . . . , N; m = 1, . . . , k,

where ηi,m are the Lagrange multipliers. ∂L ∂L From ∂ξ = 0 and ∂w = 0 one has, respectively, i m

k 

ηi,m = 1,

∀ i = 1, . . . , N

(6.10.12)

m=1

and N   wm = β −1 (δyi ,m − ηi,m )xi ,

m = 1, . . . , r,

(6.10.13)

which can be rewritten as     −1 (1 − ηi,m xi ) + (−ηi,m )xi wm = β

(6.10.14)

i=1

i:yi =m

i:yi =m

for m = 1, . . . , r. Equations (6.10.12) and (6.10.14) show the following important performance of the linear multiclass classifier of Crammer and Singer [64]: • Since the set of Lagrange  multipliers, {ηi,1 , . . . , ηi,k }, satisfies the constraints ηi,1 . . . , ηi,k ≥ 0 and m ηi,m = 1 in (6.10.12) for each pattern xi , each set can be viewed as a probability distribution over the labels {1, . . . , k}. Under this probabilistic interpretation an example xi is a support pattern if and only if its corresponding distribution is not concentrated on the correct label yi . Therefore, the classifier is constructed using patterns whose labels are uncertain; the rest of the input patterns are ignored. • The first sum in (6.10.14) is over all patterns that belong to the mth class. This shows that an example xi labeled yi = m is a support pattern only if ηi,m = ηi,yi < 1. • The second sum in (6.10.14) is over the rest of the patterns whose labels are different from m. In this case, an example xi is a support pattern only if ηi,m > 0.

6.11 Supervised Tensor Learning (STL)

301

6.11 Supervised Tensor Learning (STL) In many disciplines, more and more problems need three or more subscripts for describing the data. Data with three or more subscripts are referred to as multichannel data, whose representation is the tensor. With development of modern applications, multi-way higher-order data models have been successfully applied across many fields, which include, among others, social network analysis/Web mining (e.g., [1, 150, 238]), computer vision (e.g., [111, 242, 258]), hyperspectral image [47, 157, 204], medicine and neuroscience [86], multilinear image analysis of face recognition (tensor face) [256], epilepsy tensors [2], Chemistry [36], and so on. In this section, we focus on supervised tensor learning and tensor learning for regression.

6.11.1 Tensor Algebra Basics Tensors will be represented by mathcal symbols, such as T , A, X , and so on. A tensor represented in an N -way array is called an N th-order tensor, and is defined as a multilinear function on an N -dimensional Cartesian product vector space, denoted T ∈ KI1 ×I2 ×···×IN , where K denotes either the real field R or the complex field C and where In is the number of entries or the dimension in the nth “direction” or mode. For example, for a third-order tensor T ∈ K4×2×5 , the first, second, and third modes have dimensions, respectively, equal to 4, 2, and 5. A scalar is a zero-order tensor, a vector is a first-order tensor, and a matrix is a second-order tensor. An N th-order tensor is a multilinear mapping T : KI1 × KI2 × · · · × KIN → KI1 ×I2 ×···×IN .

(6.11.1)

Because T ∈ KI1 ×···×IN can be regarded as an N th-order matrix, tensor algebra is essentially higher-order matrix algebra. The following figure shows, respectively, an example of third-order and firthorder tensors (Fig. 6.2).

frequency

viewpoint

space (channel, subject, condition)

illumination times

people pixel

Fig. 6.2 Three-way array (left) and a tensor modeling a face (right)

expression

302

6 Machine Learning

In linear algebra (i.e., matrix algebra in a finite-dimensional vector space) a linear operator is defined in a finite-dimensional vector space. Because a matrix expresses a second-order tensor, linear algebra can also be regarded as the algebra of secondorder tensors. In multiple linear algebra (i.e., higher-order tensor algebra) a multiple linear operator is also defined in a finite-dimensional vector space. A matrix A ∈ Km×n is represented by its entries and the symbol [·] as A = I1 ×···×In is, using a dual matrix [aij ]m,n i,j =1 . Similarly, an nth-order tensor A ∈ K

,...,In symbol [[·]], represented as A = [[ai1 ···in ]]Ii 1,...,i , where ai ···i is the (i1 , . . . , in )th n n =1 1 1 entry of the tensor. An nth-order tensor is sometimes known as an n-dimensional hypermatrix [60]. The set of all (I1 × · · · × In )-dimensional tensors is denoted as T (I1 , . . . , In ). I ×J ×K . A The most common tensor is a third-order tensor, A = [[aij k ]]I,J,K i,j,k ∈ K third-order tensor is sometimes called a three-dimensional matrix. A square thirdorder tensor X ∈ KI ×I ×I is said to be cubical. In particular, a cubical tensor is called supersymmetric tensor [60], if its entries have the following symmetry:

xij k = xikj = xj ik = xj ki = xkij = xkj i ,

∀ i, j, k = 1, . . . , I.

For a tensor A = [[ai ···i ]] ∈ RN ×···×N , the line connecting a11···1 to aN N ···N is m 1 called the superdiagonal line of the tensor A. A tensor A is known as unit tensor, if all its entries on the superdiagonal line are equal to 1 and all other entries are equal to zero, i.e.,

ai

1 ···im

= δi

1 ···im

=

⎧ ⎨ 1, if i1 = · · · = im ; ⎩

(6.11.2) 0, otherwise.

A third-order unit tensor is denoted as I ∈ RN ×N ×N . In tensor algebra, it is convenient to view a third-order tensor as a set of vectors or matrices. Suppose that the ith entry of a vector a is denoted by ai and that a matrix A = [aij ] ∈ KI ×J has I row vectors ai: , i = 1, . . . , I and J column vectors a:j , j = 1, . . . , J ; the ith row vector is denoted by ai: = [ai1 , . . . , aiJ ], and the j th column vector is denoted by a:j = [a1j , . . . , aIj ]T . However, the concepts of row vectors and column vectors are no longer directly applicable for a higher-order tensor. The three-way arrays of a third-order tensor are not known as row vectors and column vectors, but are renamed tensor fibers. A fiber is a one-way array with one subscript that is variable and the other subscripts fixed. The fibers of a third-order tensor are vertical fibers, horizontal fibers, and “depth” fiber. The vertical fibers of the third-order tensor A ∈ KI ×J ×K are also called column fibers, denoted a:j k ; the horizontal fibers are also known as the row fibers, denoted ai:k ; the depth fibers are also called the tube fiber, denoted aij : .

6.11 Supervised Tensor Learning (STL)

303

Definition 6.18 (Tensor Vectorization) The tensor vectorization of an N th-order tensor A = [[Ai1 ,...,iN ]] ∈ KI1 ×···×IN is denoted by a = vec(A) whose entries are given by al = Ai1 ,i2 ,...,iN ,

l = i1 +

N 

 (in − 1)

n=2

n−1 #

 (6.11.3)

Ik .

k=1

For example, a1 = A1,1,...,1 , aI1 = AI1 ,1,...,1 , aI2 = A1,I2 ,1,...,1 , aI1 I2 ···IN = AI1 ,I2 ,...,IN . The operation for transforming a tensor into a matrix is known as tensor matrixing or tensor unfolding. Definition 6.19 (Tensor Unfolding) Given an N th-order tensor A ∈ KI1 ×···×IN , the mappings A → A(n) and A → A(n) are, respectively, said to be the horizontal unfolding and longitudinal unfolding of the tensor A, where the matrix A(n) ∈ KIn ×(I1 ···In−1 In+1 ···IN ) is known as the (mode-n) horizontal unfolding matrix of the tensor A; and the matrix A(n) = K(I1 ···In−1 In+1 ···IN )×In is called the (mode-n) longitudinal unfolding matrix of A. There are three horizontal unfolding methods. 1. Kiers horizontal unfolding method: Given an N th-order tensor A ∈ KI1 ×···×IN , the Kiers horizontal unfolding method, presented by Kiers [143] in 2000, maps its entry ai1 i2 ···iN to the (in , j )th entry of the matrix AKiers ∈ (n) I ×(I ···I I ···I ) n N 1 n−1 n+1 K , i.e., Kiers AKiers (n) (in , j ) = ain ,j = ai1 ,i2 ,...,iN ,

(6.11.4)

where in = 1, . . . , In and j=

N −2 ! 

 " N +n−p−1 # iN +n−p − 1 Iq + in+1 ,

p=1

n = 1, . . . , N

(6.11.5)

q=n+1

with IN +m = Im and iN +m = im (m > 0). 2. LMV horizontal unfolding method: This method, introduced by Lathauwer, Moor, and Vanderwalle [154] in 2000, maps the entry ai1 ,...,iN of an N thLMV ∈ order tensor A ∈ KI1 ×···×IN to the entry aiLMV ,j of the matrix A(n) n

KIn ×(I1 ···In−1 In+1 ···IN ) , where

⎛ ⎞  n−1  n−2  n−1 N  N  # #  # (ik − 1) (ik − 1) Im + Ip ⎝ Iq+1 ⎠ , j = in−1 + k=1

m=k+1

p=1

k=n+1

q=k

(6.11.6) where Iq = 1 if q > N.

304

6 Machine Learning

3. Kolda horizontal unfolding method: This unfolding, introduced by Kolda [149] in 2006, maps the entry ai1 ,...,iN of an N th-order tensor to the entry aiKolda ,j of the n

matrix AKolda ∈ KIn ×(I1 ···In−1 In+1 ···IN ) , where (n)  N  (ik − 1)

j =1+

k=1,k=n



k−1 #

(6.11.7)

Im .

m=1,m=n

Similar to the above three horizontal unfolding methods, there are three longitudinal unfolding methods for A(n) = K(I1 ···In−1 In+1 ···IN )×In . The relationships between the longitudinal unfolding and the horizontal unfolding of a third-order tensor are  (n) AKiers

=

T AKiers (n)

,

(n) ALMV

 T LMV = A(n) ,

(n) AKolda

 T Kolda = A(n) .

(6.11.8)

Definition 6.20 (Tensor Inner Product) For two tensors A, B ∈ KI1 ×···×IN , their tensor inner product A, B is a scalar and is defined as the inner product of the column vectorizations of the two tensors, i.e., def

A, B = vec(A), vec(B) = (vec(A))H vec(B) =

I1 I2   i1 =1 i2 =1

···

IN  in =1

ai∗ i

(6.11.9)

bi1 i2 ···in .

1 2 ···in

The tensor inner product is also called the tensor dot product, denoted by A · B, namely A · B = A, B. Definition 6.21 (Tensor Outer Product) Given two tensors A ∈ KI1 ×···×IP and B ∈ KJ1 ×···×JQ , their tensor outer product is denoted A ◦ B ∈ KI1 ×···×IP ×J1 ×···×JQ , and is defined as (A ◦ B)i1 ···iP j1 ···jQ = ai1 ···iP bj1 ···jQ ,

∀ i1 , . . . , iP , j1 , . . . , jQ .

(6.11.10)

Definition 6.22 (Tensor Frobenius Norm) The Frobenius norm of a tensor A is defined as ⎛ AF =

def A, A = ⎝

I1 I2   i1 =1 i2 =1

···

IN  in =1

⎞1/2 |ai

1 i2 ···in

|2 ⎠

.

(6.11.11)

6.11 Supervised Tensor Learning (STL)

305

Definition 6.23 (Tucker Operator) Given an N th-order tensor G ∈ KJ1 ×J2 ×···×JN and matrices U(n) ∈ KIn ×Jn , where n ∈ {1, . . . , N }, the Tucker operator is defined as [149]: def

[[G; U(1) , U(2) , . . . , U(N ) ]] = G ×1 U(1) ×2 U(2) ×3 · · · ×N U(N ) .

(6.11.12)

The result is an N th-order I1 × I2 × · · · × IN tensor. The mode-n product of a tensor X ∈ RI1 ×···×IN with a matrix U ∈ RJ ×In , denoted as X ×n U, is a tensor of size I1 × · · · × In−1 × J × In+1 × · · · × IN , its elements are defined as (X ×n U)i1 ,...,in−1 ,j,in+1 ,...,iN =

In 

xi1 ,...,iN uj in .

(6.11.13)

in =1

The equivalent unfolded expression is Y = X ×n U ⇔ Y(n) = UX(n) .

(6.11.14)

Especially, for a three-order tensor X ∈ KI1 ×I2 ×I3 and the matrices A ∈ KJ1 ×I1 , B ∈ KJ2 ×I2 , C ∈ KJ3 ×I3 , the Tucker mode-1 product X ×1 A, the Tucker mode-2 product X ×2 B, and the Tucker mode-3 product X ×3 C are, respectively, defined as [155]: (X ×1 A)j

1 i2 i3

(X ×2 B)i

1 j2 i3

(X ×3 C)i

1 i2 j3

=

I1  i1 =1

=

I2  i2 =1

=

I3  i3 =1

xi

aj i1 , ∀ j1 , i2 , i3 ,

(6.11.15)

xi

bj i2 , ∀ i1 , j2 , i3 ,

(6.11.16)

xi

cj i3 , ∀ i1 , i2 , j3 .

(6.11.17)

1 i2 i3

1 i2 i3

1 i2 i3

1

2

3

Decompositions of a third-order tensor have two basic forms. 1. Tucker decomposition: X = G ×1 A ×2 B ×3 C ⇔ xij k =

Q  R P  

gpqr aip bj q ckr .

p=1 q=1 r=1

Here, G is called the core tensor of tensor decomposition.

(6.11.18)

306

6 Machine Learning

2. Canonical decomposition (CANDECOM)/Parallel factors decomposition (PARFAC), simply known as CP decomposition: X = I ×1 A ×2 B ×3 C ⇔ xij k =

Q  R P  

aip bj q ckr ,

(6.11.19)

p=1 q=1 r=1

where I is a unit tensor. It can be easily seen that there are the following differences between the Tucker and CP decompositions: • In the Tucker decomposition, the entry gpqr of the core tensor G shows that there are interaction forms involving the entry aip of the mode-A vector ai = [ai1 , . . . , aiP ]T , the entry bj q of the mode-B vector bj = [bj 1 , . . . , bj Q ]T , and the entry ckr of the mode-C vector ck = [ck1 , . . . , ckR ]T . • In the CP decomposition, the core tensor is a unit tensor. Because the entries on the superdiagonal p = q = r ∈ {1, . . . , R} are equal to 1, and all the other entries are zero, there are interactions only between the rth factor air of the modeA vector ai , the rth factor bj r of the mode-B vector bj , and the rth factor ckr of the mode-C vector ck . This means that the mode-A, mode-B and mode-C vectors have the same number R of factors, i.e., all modes extract the same number of factors. Therefore, the CP decomposition can be understood as a canonical Tucker decomposition. The generalization of the Tucker decomposition to higher-order tensors is called the higher-order singular value decomposition. Theorem 6.6 (Higher-Order SVD [154]) Every I1 × I2 × · · · × IN real tensor X can be decomposed into the mode-n product X = G ×1 U(1) ×2 U(2) ×3 · · · ×N U(N ) = [[G; U(1) , U(2) , . . . , U(N ) ]]

(6.11.20)

with entries xi1 i2 ···iN =

J2 J1  

···

j1 =1 j2 =1

(n)

(n)

JN  jN =1

(1)

(2)

(N )

gi1 i2 ···iN ui1 j1 ui2 j2 · · · uiN jN ,

(6.11.21)

where U(n) = [u1 , . . . , uJn ] is an In × Jn semi-orthogonal matrix: (U(n) )T U(n) = IJn with Jn ≤ In , the core tensor G is a J1 × J2 × · · · × JN tensor and the subtensor Gjn =α is the tensor X with the fixed index jn = α. The subtensors have the following properties.

6.11 Supervised Tensor Learning (STL)

307

• All-orthogonality: two subtensors Gjn =α and Gjn =β , for α = β, are orthogonal: Gjn =α , Gjn =β  = 0,

∀ α = β, n = 1, . . . , N.

(6.11.22)

• Ordering: Gin =1 F ≥ Gin =2 F ≥ · · · ≥ Gin =N F .

(6.11.23)

The mode-n product is simply denoted as X ×1 U(1) ×2 U(2) ×3 · · · ×N U(N ) = X

N #

×n U(n) .

(6.11.24)

n=1

Note that the typical CP decomposition factorizes usually an N -order tensor X ∈ RI1 ×I2 ×···×IN into a linear combination of the R rank-one tensors: X ≈

R 

(2) (N ) u(1) r ◦ ur ◦ · · · ◦ ur

r=1

= [[U(1) , U(2) , . . . , U(N ) ]].

(6.11.25)

This operation is called the rank-one decomposition of tensor. Here, the operator “◦” (n) denotes the outer product of vectors of the factor matrices U(n) = [u(n) 1 , . . . , uR ] ∈ I ×R R n , n = 1, . . . , N . On tensor analysis, readers can further refer to [294].

6.11.2 Supervised Tensor Learning Problems Given N training tensor-scalar data (Xi , yi ), i = 1, . . . , N , where the tensor Xi ∈ RI1 ,...,IM . The supervised tensor learning is to train a parameter/weight tensor W with rank-one decomposition W = w1 ◦ w2 ◦ · · · ◦ wM such that (w1 , . . . , wM ; b, x) =  subject to yi ci Xi

arg min

w1 ,...,wM ;b,x

M #

f (w1 , . . . , wM ; b, x) 

×k wk + b ≥ ξi ,

i = 1, . . . , N.

(6.11.26)

k=1

If yi ∈ {+1, −1}, then the weight tensor W is called the tensor classifier and if yi ∈ R, then W is known as the tensor predictor.

308

6 Machine Learning

The Lagrangian for supervised tensor learning defined in (6.11.26) is given by L(w1 , . . . , wM ; b, x) = f (w1 , . . . , wM ; b, x) −

N 

   #  M λi yi ci Xi ×k wk + b − ξi

i=1

k=1

 #  M N  = f (w1 , . . . , wM ; b, x) − λi yi ci Xi ×k wk + b − λT x, i=1

(6.11.27)

k=1

where λ = [λ1 , . . . , λN ]T ≥ 0 is a nonnegative Lagrangian multiplier vector and x = [ξ1 , . . . , ξN ]T is a slack variable vector. The partial derivative of L(w1 , . . . , wM ; b, x) with respect to wj is given by   # M N  ∂ci ∂ ∂L ∂f = − λi yi ×k wk + b Xi ∂wj ∂wj ∂wj ∂wj i=1

= where z = Xi

∂f − ∂wj

5M

k=1 ×k wk

N 

k=1

λi yi

i=1

dci ¯ j wj ), (Xi × dz

(6.11.28)

+ b and

¯ j wj = Xi ◦ w1 ◦ · · · ◦ wj −1 ◦ wj +1 ◦ · · · ◦ wM . Xi ×

(6.11.29)

Similarly, the partial derivative of L(w1 , . . . , wM ; b, x) with respect to the bias b is   M N  # ∂L ∂f ∂ci ∂ = − λi yi ×k wk + b Xi ∂b ∂b ∂b ∂b i=1

∂f = − ∂b

N  i=1

k=1



M # dci ∂z ∂ λi yi ×k wk + b Xi dz ∂b ∂b



k=1

 ∂f dci − . λi yi ∂b dz N

=

(6.11.30)

i=1

Let

∂L ∂wj

= 0 and

∂L ∂b

= 0 then from (6.11.28) and (6.11.30) it is known that  ∂f dci ¯ j wj ), (Xi × = λi yi ∂wj dz N

i=1

(6.11.31)

6.11 Supervised Tensor Learning (STL)

309

 ∂f dci = . λi yi ∂b dz N

(6.11.32)

i=1

Hence, one has the following update formulae [242]: wj,t ← wj,(t−1) − η1,t

N 

λi yi

i=1

bt ← bt−1 − η2,t

N  i=1

λi yi

dci ¯ j wj,t−1 ), (Xi × ∂z

dci . dz

(6.11.33)

(6.11.34)

Algorithm 6.17 shows an alternating projection algorithm for the supervised tensor learning [242]. Algorithm 6.17 Alternating projection algorithm for the supervised tensor learning [242] 1. input: Training data (Xi , yi ), where Xi ∈ RI1 ×···×IN and yi ∈ R for i = 1, . . . , N . 2. initialization: Set wk equal to random unit vector in RIk for k = 1, . . . , M. 3. repeat 4. for j = 1 to M  (t) (t−1) dci ¯ 5. wj ← wj − η1 N i=1 λi yi dz (Xi ×j wj )  N dci (t) (t−1) 6. b ←b − η2 i=1 λi yi dz 7. end ! " for T 2 8. if M i=k |wk,t wk,t−1 |/wk,t F − 1 ≤ , then goto Step 11. 9. t ←t +1 10. return 11. output: the parameters in classification tensorplane w1 , . . . , wM and b.

6.11.3 Tensor Fisher Discriminant analysis Fisher discriminant analysis (FDA) [94] is a widely applied method for classification. Suppose there are N training data (xi ∈ RI , 1 ≤ i ≤ N ) associated with the class labels yi ∈ {+1, −1}. Let N+ and N− be, respectively, the numbers of the positive training measurements (xi , yi = +1) and the negative training measurements (xi , yi = −1). Denote 11(yi = +1) =

⎧ ⎨ 1, yi = +1; ⎩

(6.11.35) 0, otherwise;

310

6 Machine Learning

11(yi = −1) =

⎧ ⎨ 1, yi = −1; ⎩

(6.11.36) 0, otherwise.

For N+ positive training measurements (xi , yi = +1), their mean vector is m+ = (1/N+ )11(yi = +1)xi ; and for N− negative training measurements (xi , yi = −1), their mean vector can be calculated as m− = (1/N− )11(yi = −1)xi , while the mean vector and the covariance matrix of all training measurements are m =  N (1/N) N x and  = (x − m)(xi − m)T . i i i=1 i=1 The between-class scatter matrix Sb and within-class scatter matrix Sw of two classes of targets (xi , yi = +1) and (xi , yi = −1) are, respectively, defined as Sb = (m+ − m− )(m+ − m− )T , Sw =

N 

(6.11.37)

(xi − m)(xi − m)T = N .

(6.11.38)

i=1

The FDA criterion is to design the classifier w such that  , wT Sb w , w = arg max JFDA = T w Sw w w

(6.11.39)

or equivalently  , m+ − m−  w = arg max JFDA = √ . w wT w

(6.11.40)

The tensor extension of FDA, called tensor Fisher discriminant analysis (TFDA), was proposed by Tao et al. [242], and is a combination of Fisher discriminant analysis and supervised tensor learning. Given the training tensor measurements Xi ∈ RI1 ×···×IM (1 = 1, . . . , N ) and their corresponding class labels yi ∈ {+1, −1}. The mean tensor of the training positive measurements is M+ = (1/N+ ) N mean tensor i=1 11(yi = +1)Xi ; the of the training negative measurements is given by M− = (1/N− ) N 11(yi =  i=1 −1)Xi ; the mean tensor of all training measurements is M = (1/N) N X i=1 i . The TFDA criterion is to design a tensor classifier W such that ⎧ -2 ⎫ 5M - ⎬ ⎪ ⎨ -(M+ − M− ) k=1 ×k wk - ⎪ 2 (w1 , . . . , wM ) = arg max JTFDA = -2 . N 5M - ⎪ w1 ,...,wM ⎪ ⎩ ×k wk - ⎭ -(Xi − M) i=1

k=1

2

(6.11.41) On more applications of supervised tensor learning, see [242].

6.11 Supervised Tensor Learning (STL)

311

6.11.4 Tensor Learning for Regression I1 ×···×IM is an M-mode Given a set of labeled training set {Xi , yi }N i=1 , where Xi ∈ R tensor and yi are the associated scalar targets. A classic linear predictor in the vector space, y = x, w + b, can be extended from the vector space to the tensor space as

yˆi = Xi , W + b,

(6.11.42)

where W ∈ RI1 ×···×IM is the weight tensor and the scalar b is the bias. The tensor regression seeks to design a weight tensor W to give the regression output yˆi = Xi , W+b. To this end, let the weight tensor W is a sum of R rank-one tensors: W=

R 

(2) (M) u(1)  [[U(1) , U(2) , . . . , U(M) ]], r ◦ ur ◦ · · · ◦ ur

(6.11.43)

r=1 (m) where U(m) = [u(m) 1 , . . . , uR ], m = 1, . . . , M. Then, the tensor regression can be rewritten as

yˆi = Xi , W + b = Xi , [[U(1) , U(2) , . . . , U(M) ]] + b.

(6.11.44)

Therefore, the higher rank tensor ridge regression (hrTRR) is to minimize the loss function [111]: "2 ; < 1 ! yi − Xi , [[U(1) , . . . , U(M) ]] − b 2 N

L(U(1) , . . . , U(M) ; b) =

i=1

+

λ-[[U(1) , . . . , U(M) ]]-2 . F 2

(6.11.45)

The closed form solution of the optimization arg minU(1) ,...,U(M) ;b L(U(1) , . . . , U(M) ; b) can be derived as [111]: b=

N !  ; ci means that every cluster in ci+1 is the merging (or union) of clusters in ci . This general arrangement is referred to as a hierarchical clustering scheme (HCS) [132]. The steps of hierarchical clustering methodology are given below [185]: 1. Define decision variables, feasible set, and objective functions. 2. Choose and apply a Pareto optimization algorithm, e.g., NSGA-II. 3. Clustering analysis: • Clustering tendency: By visual inspection or data projections verify that a hierarchical cluster structure is a reasonable model for the data. • Data scaling: Remove implicit variable weightings due to relative scales using range scaling. • Proximity: Select and apply an appropriate similarity measure for the data, here, Euclidean distance. • Choice of algorithm(s): Consider the assumptions and characteristics of clustering algorithms and select the most suitable algorithm for the application, here, group average linkage. • Application of algorithm: Apply the selected algorithm and obtain a dendrogram. • Validation: Examine the results based on application subject matter knowledge, assess the fit to the input data and stability of the cluster structure, and compare the results of multiple algorithms, if used. 4. Represent and use the clusters and structure: if the clustering is reasonable and valid, then examine the divisions in the hierarchy for trade-offs and other information to aid decision making. Example 6.4 Six objects have the following similarity values: 0.04 :

(3, 6)

(6.12.20)

0.08 :

(3, 5), (5, 6)

(6.12.21)

0.22 :

(1, 3), (1, 5), (1, 6), (2, 4)

(6.12.22)

0.35 :

(1, 2), (1, 4)

(6.12.23)

(2, 3), (2, 4), (2, 5), (3, 4), (4, 5), (4, 6).

(6.12.24)

> 0.35 :

The hierarchical clustering results are 0.00 :

[1], [3], [5], [6], [4], [2]

(6.12.25)

0.04 :

[1], [3, 6], [2], [4], [5]

(6.12.26)

0.08 :

[3, 5, 6], [1], [2], [4]

(6.12.27)

0.22 :

[1, 3, 5, 6], [2, 4]

(6.12.28)

0.35 :

[1, 2, 3, 4, 5, 6].

(6.12.29)

322

6 Machine Learning

Table 6.1 Similarity values and clustering results Values 0.00 0.04 0.08 0.22 0.35

Object number 1 3 6 5 • • • • • XXXXXXXX • • XXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX

2 4 • • • • • • • • XXXXXXXX

Table 6.1 shows similarity values and corresponding hierarchical clustering results. In hierarchical clustering, similarity measures have the following properties [132]: • D(x, y) = 0 if and only if x = y. • D(x, y) = D(y, x) for all objects x and y. • The ultrametric inequality: D(x, z) ≤ max{D(x, y), D(y, z)}.

(6.12.30)

• The triangle inequality: D(x, z) ≤ D(x, y) + D(y, z).

(6.12.31)

Due to the symmetry D(x, y) = D(y, x), so for N objects the hierarchical clustering scheme method needs only computing N (N − 1)/2 dissimilarities. The (i, j )th entry of the similarity (or distance) matrix S is defined by the similarity between the ith and j th objects as follows: def

sij = D(xi , xj ).

(6.12.32)

It should be noted that if all objects are clustered into one class at the mth hierarchical clustering with similarity α (m) , then the (i, j )th entries of similarity matrix are given by

sij =

⎧ ⎪ ⎨ sij ,

sij ≤ α (m) ;

⎪ ⎩ α (m) , s > α (m) . ij

(6.12.33)

In hierarchical clustering, there are different distance matrices corresponding to different clusterings. Table 6.2 gives the distance matrix before clustering in Example 6.4. The distance matrix after the first clustering [1], [2], [3, 6], [4], [5] is given in Table 6.3.

6.12 Unsupervised Clustering Table 6.2 Distance matrix in Example 6.4

Table 6.3 Distance matrix after first clustering in Example 6.4

Table 6.4 Distance matrix after second clustering in Example 6.4

323 dij

1

2

3

4

5

6

1 2 3 4 5 6

0.00 0.35 0.22 0.35 0.22 0.22

0.35 0.00 0.35 0.22 0.35 0.35

0.22 0.35 0.00 0.35 0.08 0.04

0.35 0.22 0.35 0.00 0.35 0.35

0.22 0.35 0.08 0.35 0.00 0.08

0.22 0.35 0.04 0.35 0.08 0.00

dij

1

2

1 2 [3, 6] 4 5

0.00 0.35 0.22 0.35 0.22

0.35 0.00 0.35 0.22 0.35

[3, 6] 0.22 0.35 0.00 0.35 0.08

4

5

0.35 0.22 0.35 0.00 0.35

0.22 0.35 0.08 0.35 0.00

dij

1

2

[3, 5, 6]

4

1 2 [3, 5, 6] 4

0.00 0.35 0.22 0.35

0.35 0.00 0.35 0.22

0.22 0.35 0.00 0.35

0.35 0.22 0.35 0.00

Table 6.4 shows the distance matrix after the second clustering [1], [2], [3, 5, 6], [4] in Example 6.4. Mimicking the above processes, we can find the distance matrices after third clustering etc. Given N objects and a similarity metric d = D(x, y) satisfying the ultrametric inequality (6.12.30), the minimum method for hierarchical clustering is as follows [132]. 1. Make the weak clustering C0 with d = 0. 2. For the given clustering Ci−1 with the distance matrix defined for all objects or clusters in Ci−1 , let αi be the smallest nonzero entry in the distance matrix. Merge the pair of objects and/or clusters with distance αi , to create Ci of value αi . 3. Create a new similarity function for Ci in the following manner: if x and y are clustered in Ci and not in Ci−1 (i.e., d(x, y) = αi ), then define the distance from the cluster [x, y] to any third object or cluster, z, by d([x, y], z) = min{d(x, z), d(y, z)}. If x and y are objects and/or clusters in Ci−1 not clustered in Ci , then d(x, y) remains the same. Thus, we obtain a new similarity function d for Ci in this way. 4. Repeat Steps 2 and 3 above until the strong clustering is finally obtained.

324

6 Machine Learning

If d([x, y], z) = min{d(x, z), d(y, z)} in Step 3 above is replaced by d([x, y], z) = max{d(x, z), d(y, z)}, the resulting method is called the maximum method for hierarchical clustering.

6.12.3 Fisher Discriminant Analysis (FDA) In Sect. 6.11.3, we have presented Fisher discriminant analysis (FDA) for supervised binary classification and its extension to tensor data. In this subsection, we deal with the FDA for unsupervised multiple clustering. The divergence is a measure of the distance or dissimilarity between two signals and is often used in feature discrimination and evaluation of the effectiveness of class discrimination. Let Q denote the common dimension of the signal-feature vectors extracted by various methods. Assume that there are c classes of signals; the Fisher class separability measure, called simply the Fisher measure, between the ith and the j th classes is used to determine the ranking of feature vectors. Consider the class (l) discrimination of c classes of signals. Let vk denote the kth sample feature vector of the lth class of signals, where l = 1, . . . , c and k = 1, . . . , lK , where lK is the number of sample feature vectors in the lth signal class. Under the assumption (l) that the prior probabilities of the random vectors v(l) = vk are the same for k = 1, . . . , lK (i.e., equal probabilities), the Fisher measure is defined as 

 m(i,j ) =

l=i,j

   "2 ! (l) (l) meank vk − meanl mean vk   ,  (l) v var l=i,j k k

(6.12.34)

where

  • meank v(l) is the mean (centroid) of all signal-feature vectors of the lth class;  (l)  k • var vk is the variance of all signal-feature vectors of the lth class;   (l)  • meanl meank vk is the total sample centroid of all classes. As an extension of the Fisher measure, consider the projection of all Q×1 feature vectors onto the (c − 1)th dimension of class discrimination space. Put N = N1 + · · · + Nc , where Ni represents the number of feature vectors of the ith class signal extracted in the training phase. Suppose that si,k = [si,k (1), . . . , si,k (Q)]T

6.12 Unsupervised Clustering

325

denotes the ith (Q × 1) feature vector obtained by the kth group of observation data of the ith signal in the training phase, while mi = [mi (1), . . . , mi (Q)]T is the sample mean of the ith signal-feature vector, where mi (q) =

Ni 1  si,k (q), Ni

i = 1, . . . , c, q = 1, . . . , Q.

k=1

Similarly, let m = [m(1), . . . , m(Q)]T denote the ensemble mean of all the signal-feature vectors obtained from all the observation data, where 1 m(q) = mi (q), c c

q = 1, . . . , Q.

i=1

Using the above vectors, one can define a Q × Q within-class scatter matrix [83]

Sw

⎛ ⎞ Ni c 1 ⎝ 1  = (si,k − mi )(si,k − mi )T ⎠ c Ni

def

i=1

(6.12.35)

k=1

and a Q × Q between-class scatter matrix [83] def

Sb =

1 (mi − m)(mi − m)T . c c

(6.12.36)

i=1

For the optimal class discrimination matrix U to be determined, define a criterion function #

UT Sb U

def diag

J (U) = #

,

(6.12.37)

UT Sw U

diag

where

#

A denotes the product of diagonal entries of the matrix A. As it is a measure

diag

of class discrimination ability, J should be maximized. We refer to Span(U) as the

326

6 Machine Learning

class discrimination space, if #

UT Sb U

diag

U = arg max J (U) = # Q×Q U∈R

.

(6.12.38)

UT Sw U

diag

This optimization problem can be equivalently written as ⎧ Q ⎫ # ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ uTi Sb ui ⎪ ⎪ ⎪ ⎪ Q ⎨ ⎬ T # u S u i=1 i b i [u1 , . . . , uQ ] = arg max = Q uT S u ⎪ ⎪# ⎪ ui ∈RQ ⎪ i=1 i w i ⎪ ⎪ T ⎪ ⎪ ⎪ ⎪ ⎩ ui Sw ui ⎭

(6.12.39)

i=1

whose solution is given by ui = arg max ui ∈RQ

uTi Sb ui uTi Sw ui

,

i = 1, . . . , Q.

(6.12.40)

This is just the maximization of the generalized Rayleigh quotient. The above equation has a clear physical meaning: the column vectors of the matrix U constituting the optimal class discrimination subspace should maximize the between-class scatter and minimize the within-class scatter, namely these vectors should maximize the generalized Rayleigh quotient. For c classes of signals, the optimal class discrimination subspace is (c − 1)dimensional. Therefore Eq. (6.12.40) is maximized only for c − 1 generalized Rayleigh quotients. In other words, it is necessary to solve the generalized eigenvalue problem Sb ui = λi Sw ui ,

i = 1, 2, . . . , c − 1,

(6.12.41)

for c − 1 generalized eigenvectors u1 , . . . , uc−1 . These generalized eigenvectors constitute the Q × (c − 1) matrix Uc−1 = [u1 , . . . , uc−1 ]

(6.12.42)

whose columns span the optimal class discrimination subspace. After obtaining the Q × (c − 1) matrix Uc−1 , we can find its projection onto the optimal class discrimination subspace, yi,k = UTc−1 si,k ,

i = 1, . . . , c, k = 1, . . . , Ni ,

for every signal-feature vector obtained in the training phase.

(6.12.43)

6.12 Unsupervised Clustering

327

When there are only three classes of signals (c = 3), the optimal class discrimination subspace is a plane, and the projection of every feature vector onto the optimal class discrimination subspace is a point. These projection points directly reflect the discrimination ability of different feature vectors in signal classification.

6.12.4 K-Means Clustering Clustering technique may broadly be divided into two categories: hierarchical and non-hierarchical [6]. As one of the more widely used non-hierarchical clustering techniques, the K-means [251] aims to minimize the sum of squared Euclidean distance between patters and cluster centers via an iterative hill climbing algorithm. Given a set of N data points in d-dimensional vector space Rd , consider partitioning these points into a number (say K ≤ N ) of groups (or clusters) based on some similarity/dissimilarity metric which establishes a rule for assigning patterns to the domain of a particular cluster center. Let the set of N data be S = {x1 , . . . , xN } and the K clusters be C1 , . . . , CK . Then the clustering is to make K clusters satisfy the following conditions [186]: Ci ∈ ∅, Ci ∩ Cj = ∅,

for i = 1, . . . , K; for i = 1, . . . , K; j = 1, . . . , K and i = j ;

∪Ki=1 Ci = S. Here each observation vector belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Given an open set S ∈ Rd , the set {Vi }K i=1 is called a tessellation of S if Vi ∩Vj = ˆ ∅ for i = j and ∪K V = S. When a set of points {zi }K i=1 i i=1 is given, the sets Vi corresponding to the point zi are defined by    Vˆi = x ∈ S  x − zi  ≤ x − zj  for j = 1, . . . , K, j = i .

(6.12.44)

ˆ K The points {zi }K i=1 are called generators. The set {Vi }i=1 is a Voronoi tessellation or Voronoi diagram of S, and each Vˆi is known as the Voronoi set corresponding to the point zi . The K-means algorithm requires three user-specified parameters: number of clusters K, cluster initialization, and distance metric [127, 245]. • Number of clusters K: The most critical parameter is K. As there are no perfect mathematical criterion available, the K-means algorithm is usually run independently for different values of K. The value of K is then selected using the partition that appears the most meaningful to the domain expert.

328

6 Machine Learning

• Cluster initialization: To overcome the local minima of K-means clustering, the K-means algorithm selects K initial cluster centers m1 , . . . , mK randomly from the N data points {x1 , . . . , xN }. • Distance metric: Let dik denote the distance between data xi = [xi1 , . . . , xiN ]T and xk  = [xk1 , . . . , xkN ]T . K-means is typically used with the Euclidean distance dik = j (xik −xj k )2 or Mahalanobis distance metric for computing the distance between points xi , xk and cluster centers. The most common algorithm uses an iterative refinement technique for finding a locally minimal solution without computing the Voronoi set Vˆi . Due to its ubiquity it is often called the K-means algorithm. However, the K-means algorithm, as a greedy algorithm, can only converge to a local minimum. Given any set Z of K centers, for each center z ∈ Z, let V (z) denote its neighborhood, namely the set of data points for which z is the nearest neighbor. Based on an elegant random sequential sampling method, the random K-means algorithm, or simply the K-means algorithm, does not require the calculation of Voronoi sets. Algorithm 6.19 shows the random K-means algorithm [15, 81]. Algorithm 6.19 Random K-means algorithm [15, 81] 1. input: Dataset S = {x1 , . . . , xN } with xi ∈ Rd , a positive integer K. 2. initialization: 2.1 Select an initial set of K centers Z = {z1 , . . . , zK }, e.g., by using a Monte Carlo method; 2.2 Set ji = 1 for i = 1, . . . , K. 3. repeat 4. Select a data point x ∈ S at random, according to the probability density function ρ(x). 5. Find the zi ∈ Rd that is closest to x, for example, use zi = arg min{x − z1 2 , . . . , x − zK 2 }, z1 ,...,zN

6.

7. 8. 9. 10.

denote the index of that zi by i ∗ . Ties are resolved arbitrarily. Set j i ∗ zi ∗ + x zi ∗ ← and ji ∗ ← ji ∗ + 1, ji ∗ + 1 this new zi ∗ , along with the unchanged zi , i = i ∗ , forms the new set of cluster centers z1 , . . . , z K . Assign point xi , i = 1, . . . , n to cluster Cj , j ∈ {1, . . . , K} if and only if xi − zj  ≤ xi − zp , p = 1, . . . , K; and j = p. exit if K cluster centers z1 , . . . , zK are converged. return output: K cluster centroids z1 , . . . , zK .

Once the K cluster centers z1 , . . . , zK are obtained, we can perform the classification of the new testing data point y ∈ Rd by class of y = order number of zk = min{d(y, z1 ), . . . , d(y, zK )} where d(y, z) is the distance metric.

(6.12.45)

6.12 Unsupervised Clustering

329

The classic K-means clustering algorithm finds cluster centers that minimize the distance between data points and the nearest center. K-means can be viewed as a way of constructing a “dictionary” D ∈ Rn×k of k vectors so that a data vector xi ∈ Rn , i = 1, . . . , m can be mapped to a code vector si that minimizes the error in reconstruction [56]. It is known [56] that initialization and pre-processing of the inputs are useful and necessary for K-means clustering. 1. Though it is common to initialize K-means to randomly chosen examples drawn from the data, this has been found to be a poor choice. Instead, it is better to randomly initialize the centroids from a normal distribution and then normalize them to unit length. 2. The centroids tend to be biased by the correlated data, and thought whitening input data, obtained centroids are more orthogonal. After initializing and whitening inputs, a modified version of K-means, called gain shape vector quantization [290] or spherical K-means [75], is applied to find the dictionary D. Including the initialization and pre-processing, the full K-means training are as follows. 1. Normalize inputs: xi ≡

xi − mean(xi ) , var(xi ) + norm

∀ i,

(6.12.46)

where mean(xi ) and var(xi ) are the mean and variance of the elements of xi , respectively. 2. Whiten inputs: cov(x) = VDVT xi = V(D + )−1/2 VT xi ,

(6.12.47) ∀ i.

(6.12.48)

3. Loop until convergence (typically 10 iterations is enough):

si (j ) =

⎧ T T ⎪ ⎨ dj xi , if j = arg lmax |dl xi |; ⎪ ⎩

0,

(∀ j, i)

(6.12.49)

otherwise.

D = XST + D, dj = dj /dj 2 ,

(6.12.50) ∀ j.

(6.12.51)

The columns dj of the dictionary D give the centroids, and K-means tends to learn low-frequency edge-like centroids.

330

6 Machine Learning

This procedure of normalization, whitening, and K-means clustering is an effective “off the shelf” unsupervised learning module that can serve in many feature-learning roles [56].

6.13 Spectral Clustering Spectral clustering is one of the most popular modern clustering algorithms, and has been extensively studied in the image processing, data mining, and machine learning communities, see, e.g., [190, 230, 257]. In multivariate statistics and the clustering of data, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the data set.

6.13.1 Spectral Clustering Algorithms It is assumed that data consists of N “points” x1 , . . . , xN which can be arbitrary objects. Let their pairwise similarities sij = s(xi , xj ) be measured by using some similarity function which is symmetric and nonnegative, and denote the corresponding similarity matrix by S = [sij ]N,N i,j =1 . Two popular spectral clustering algorithms [190, 230] are based on spectral graph partitioning. A promising alternative is to use matrix algebra approach. Let hk be a cluster that partitions the mapped data  = [φ(x1 ), . . . , φ(xN )], producing the output hk . Maximize the output energy to get max

0 / hk 22 = hTk T hk = hTk Whk ,

where 2 T 3N,N W = T  = [K(xi , xj )]N,N i,j =1 = φ (xi )φ(xj ) i,j =1 .

(6.13.1)

In spectral clustering, due to K(xi , xj ) = s(xi , xj ), the affinity matrix W becomes the similarity matrix S, i.e., W = S. For K clusters we maximize the total output energy max

*K 

h1 ,...,hK

subject to

. hTk Whk

(6.13.2)

k=1

hTk hk = 1, ∀k = 1, . . . , K,

(6.13.3)

6.13 Spectral Clustering

331

to get the primal constrained optimization problem: max

H=[h1 ,...,hK ]

subject to

tr(HT WH)

(6.13.4)

HT H = I.

(6.13.5)

By using the Lagrange multiplier method, the above constrained optimization can be rewritten as the dual unconstrained optimization: max H

/ "0 ! L(H) = tr(HT WH) + λ 1 − H22 .

(6.13.6)

From the first-order optimality condition we have ∂L =O ∂H



2WH − 2λH = O



WH = λH

(6.13.7)

due to the symmetry of the affinity matrix W = S. That is, H = [h1 , . . . , hK ] consists of the K leading eigenvectors (with the largest eigenvalues) of W as columns. If the affinity matrix W is replaced by the normalized Laplacian matrix Lsys = I − D−1/2 WD−1/2 , then the K spectral clusters h1 , . . . , hK are given directly by the K leading eigenvectors. This is just the key point of the normalized spectral clustering algorithm developed by Ng et al. [190], see Algorithm 6.20. Algorithm 6.20 Normalized spectral clustering algorithm of Ng et al. [190] 1. input: Dataset V = {x1 , . . . , xN } with xi ∈ Rn , number k of clusters, kernel function K(xi , xj ) : Rn × Rn → R. 2. Construct the N × N weighted adjacency matrix W = [δij K(xi , xj )]N,N i,j =1 . N 3. Compute the N × N diagonal matrix [D]ij = δij j =1 wij .

Compute the N × N normalized Laplacian matrix Lsys = I − D−1/2 WD−1/2 . Compute the first k eigenvectors u1 , . . . , uk of the eigenvalue problem Lsys u = λu. Let the matrix U = [u1 , . . . , uk ] ∈ RN ×k . k 2 1/2 . Compute the (i, j )th entries of the N × k matrix T = [tij ]N,k i,j =1 as tij = uij /( m=1 uim ) Let yi = [ti1 , . . . , tik ]T ∈ Rk , i = 1, . . . , N . 8. Cluster the points y1 , . . . , yN via the K-means algorithm into clusters C1 , . . . , Ck . 9. output: clusters A1 , . . . , Ak with Ai = {j |yj ∈ Ci }. 4. 5. 6. 7.

˜ = D1/2 H and W ˜ = D−1/2 WD−1/2 , then the solution Interestingly, if letting H to the optimization problem max H

/ "0 ! ˜ = tr(H ˜TW ˜ H) ˜ + λ 1 − H ˜ 2 L(H) 2

(6.13.8)

332

6 Machine Learning

˜H ˜ = λH ˜ which, by using H ˜ = D1/2 H and W ˜ = is given by the eigenproblem W D−1/2 WD−1/2 , becomes the generalized eigenproblem WH = λDH. Therefore, if the affinity matrix W is replaced by the normalized Laplacian matrix L = D − W, then the K spectral clusters h1 , . . . , hK consist of the K leading generalized eigenvectors (corresponding to the largest generalized eigenvalues) of the matrix pencil (Lsys , D). Therefore, we get the normalized spectral clustering algorithm of Shi and Malik [230], as shown in Algorithm 6.21. Algorithm 6.21 Normalized spectral clustering algorithm of Shi and Malik [230] 1. input: Dataset V = {x1 , . . . , xN } with xi ∈ Rn , number k of clusters, kernel function K(xi , xj ) : Rn × Rn → R. 2. Construct the N × N weighted adjacency matrix W = [δij K(xi , xj )]N,N i,j =1 . N 3. Compute the N × N diagonal matrix [D]ij = δij j =1 wij . 4. Compute the N × N normalized Laplacian matrix Lsys = D−1/2 (D − W)D−1/2 . 5. Compute the first k generalized eigenvectors u1 , . . . , uk of the generalized eigenproblem Lsys u = λDu. 6. Let the matrix U = [u1 , . . . , uk ] ∈ RN ×k . k 2 1/2 . 7. Compute the (i, j )th entries of the N × k matrix T = [tij ]N,k i,j =1 as tij = uij /( m=1 uim ) T k Let yi = [ti1 , . . . , tik ] ∈ R , i = 1, . . . , N . 8. Cluster the points y1 , . . . , yN via the K-means algorithm into clusters C1 , . . . , Ck . 9. output: clusters A1 , . . . , Ak with Ai = {j |yj ∈ Ci }.

Remark 1 Both Algorithms 6.20 and 6.21 are normalized spectral clustering algorithms, but different normalized Laplacian matrices are used: Lsys = D−1/2 (D − W)D−1/2 in Algorithm 6.21, while Lsys = I − D−1/2 WD−1/2 in Algorithm 6.20. Remark 2 If using the unnormalized Laplacian matrix L = D − W instead of the normalized Laplacian matrices Lsys in Step 3 in the algorithms above described, then we get the unnormalized spectral clustering algorithms. Remark 3 In practical computation of the “ideal” weighted adjacency matrix W, there is a perturbation to A that results in estimation errors of either generalized eigenvectors in Algorithm 6.21 or eigenvectors in Algorithm 6.20, i.e., there is a perturbed version Tˆ = T + E, where E is a perturbation matrix to the “ideal” principal (generalized) eigenvector matrix T. Spectral clustering techniques are widely used, due to their simplicity and empirical performance advantages compared to other clustering methods, such as hierarchical clustering and K-means clustering and so on. In recent years, spectral clustering has gained new promotion and development. In the next two subsections, we introduce the constrained spectral clustering and the fast spectral clustering, respectively.

6.13 Spectral Clustering

333

6.13.2 Constrained Spectral Clustering Both K-means and hierarchical clustering are constrained clustering, but an intractable problem is how to satisfy many constraints in these algorithmic settings. Alternative to encode many constraints in clustering is to use spectral clustering. To help spectral clustering recover from an undesirable partition, side information in various forms can be introduced, in either small or large amounts [259]: • Pairwise constraints: In some cases a pair of instances must be in the same cluster (Must-Link), while a pair of instances cannot be in the same cluster (CannotLink). • Partial labeling: There can be labels on some of the instances, which are neither complete nor exhaustive. • Alternative weak distance metrics: In some situations there may be more than one distance metrics available. • Transfer of knowledge: When the graph Laplacian is treated as the target domain, it can be viewed as the source domain to transfer knowledge from a different but related graph. Consider encoding side information, in k-class clustering, with an N × N constraint matrix Q defined as [259]:

Qij = Qj i

⎧ +1, if nodes xi , xj belong to same cluster; ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ = −1, if nodes xi , xj belong to different clusters; ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0, no side information available.

(6.13.9)

Define the cluster indicator vector u = [u1 , . . . , uN ]T as ui =

⎧ ⎨ +1, if xi belongs to cluster i; ⎩

(6.13.10) −1, if xi does not belong to cluster i;

for i = 1, . . . , N . Therefore, uT Qu =

N N  

ui uj Qij

(6.13.11)

i=1 j =1

becomes a real-valued measure of how well the constraints in Q are satisfied in the relaxed sense. The larger uT Qu is, the better the cluster assignment u conforms to the given constraints in Q.

334

6 Machine Learning

Let constant α ∈ R be a lower-bound of uT Qu, i.e., uT Qu ≥ α. If substituting u with D−1/2 v (where D is a degree matrix), then this inequality constraint becomes ¯ ≥ α, vT Qv

(6.13.12)

¯ = D−1/2 QD−1/2 is the normalized constraint matrix. where Q Therefore, the constrained spectral clustering can be written as a constrained optimization problem [259]: ¯ min vT Lv

v∈RN

¯ ≥ α, vT v = vol, v = D1/2 1, subject to vT Qv

(6.13.13)

 where L¯ = D−1/2 LD−1/2 and vol = N i=1 Dii is the volume of graph G(V , E). The second constraint vT v = vol normalizes v; the third constraint v = D1/2 1 rules out the trivial solution D1/2 1. By the Lagrange multiplier method, one gets the dual spectral clustering problem:  ,     T ¯ T ¯ T min LD = v Lv + λ α − v Qv + μ vol − v v .

v∈RN

(6.13.14)

From the stationary point condition, we have ∂LD =0 ∂v



¯ − λQv ¯ − μv = 0. Lv

(6.13.15)

Letting the auxiliary variable β = − μλ vol then the stationary point condition is ¯ − λQv ¯ + λβ v = 0 Lv vol

or

  β ¯ ¯ Lv = λ Q − I v. vol

(6.13.16)

  ¯ Q ¯ − λβ I with the That is, v is the generalized eigenvector of the matrix pencil L, vol largest generalized eigenvalue λmax . If there is one largest generalized eigenvalue λmax , then the associated generalized eigenvector u is the unique solution of the generalized eigenproblem. However, if there are multiple (say, p) largest generalized eigenvalues, then p associated generalized eigenvectors v1 , . . . , vp are solutions of the generalized eigenproblem. In order to find the unique solution, let V1 = [v1 , . . . , vp ] ∈ RN ×p . If using the Household transform matrix H ∈ Rp×p such that ⎤ ⎡ .. ⎢ γ . 0 ··· 0 ⎥ ⎥ ⎢ (6.13.17) V1 H = ⎢ - - ...- - - - - - ⎥ , ⎦ ⎣ .. z. ×

6.13 Spectral Clustering

335

where γ = 0 and × may be any values that are not interesting, then the unique 2 3 solution of the generalized eigenproblem is given by v = γz . Then, make v ← √ v v vol. Algorithm 6.22 shows the constrained spectral clustering method. Algorithm 6.22 Constrained spectral clustering for two-way partition [259] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

input: Affinity matrix A, constraint matrix β.  N Q, constant   N N vol ← N i=1 j =1 Aij , D ← Diag j =1 A1j , . . . , j =1 ANj . ¯ ← D−1/2 LD−1/2 , Q ¯ ← D−1/2 QD−1/2 . L ¯ λmax ← the largest eigenvalue of Q. if β ≥ λmax · vol, then return u = 0; else ! " ¯ =λ Q ¯ − β I v for (λi , vi ) Solve the generalized eigenproblem Lv vol if λmax = λ1 is unique, √ then v return v ← v vol. else V1 ← [v1 , . . . , vp ] for λmax = λ1 ≈ · · · ≈ λp .   γ . Perform Household transform V1 H in (6.13.17) to get v = z √ v return v ← v vol. ∗ T ¯ v ← arg minv v Lv, where v is among the feasible eigenvectors generated in the previous step; end if end if output: the optimal (relaxed) cluster indicator u∗ ← D−1/2 v∗ .

For K-way partition, given all the feasible eigenvectors, the top K − 1 eigenvec¯ Instead of only using the optimal tors are picked in terms of minimizing vT Lv. ∗ feasible eigenvector u in Algorithm 6.22, one needs to preserve top (K − 1) eigenvectors associated with positive eigenvalues, and perform K-means algorithm based on that embedding. Letting the K − 1 eigenvectors form the columns of V ∈ RN ×(K−1) , then K-means clustering is performed on the rows of V to get the final clustering. See Algorithm 6.23.

6.13.3 Fast Spectral Clustering For large data sets, a serious computational challenge of spectral clustering is computation of an affinity matrix between pairs of data points for large data sets. By combining the spectral clustering algorithm with the Nyström approximation to the graph Laplacian, a fast spectral clustering was developed by Choromanska et al. in 2013 [52]. The Nyström r-rank approximation for any symmetric positive semi-definite matrix L ∈ RN ×N with large N is below.

336

6 Machine Learning

Algorithm 6.23 Constrained spectral clustering for K-way partition [259] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

input: Affinity matrix A, constraint matrix  N Q, constantβ,N number of classes K.  N vol ← N i=1 j =1 Aij , D ← Diag j =1 A1j , . . . , j =1 ANj . ¯ ← D−1/2 LD−1/2 , Q ¯ ← D−1/2 QD−1/2 . L ¯ λmax ← the largest eigenvalue of Q. if β ≥ λK−1 vol, then return u = 0 else ! " ¯ L ¯ − β I v for v1 , . . . , vK−1 . Solve Lv vol √ v v ← v vol. ∗ ¯ V ← arg minV∈RN×(K−1) tr(VT LV), where the columns of V are a subset of the feasible eigenvectors generated in the previous step;. return u∗ ← kmeans(D−1/2 V∗ , K). end if

1. Perform uniform sampling of L (without replacement schemes) to create matrix C ∈ RN ×l from the sampled columns, and let L be indices of l columns sampled. 2. Form matrix W ∈ Rl×l from C by sampling its l rows via the indices in L. 3. Compute the eigenvalue decomposition (EVD) W = UUT , where U is orthonormal and  = Diag(σ1 , . . . , σl ) is an l × l real diagonal matrix with the diagonal sorted in decreasing order.  4. Calculate the best rank-r approximation to W, denoted as Wr = rp=1 σp up up ,  and its Moore–Penrose inverse W†r = rp=1 σp−1 up up , where up and up are the pth column and row of U, respectively. 5. The Nyström r-rank approximation Lr of L can be obtained as Lr = CW†r C. 6. Perform the EVD Wr = UWr  Wr UWr . 7. The EVD of the Nyström r-rank approximation Lr is given by Lr = ULr  Lr UTLr , where  Lr = Nl  Wr and ULr = N −1/2 CUWr  −1 Wr . Algorithm 6.24 shows the Nyström method for matrix approximation. Algorithm 6.24 Nyström method for matrix approximation [52] 1. input: N × N symmetric matrix L, number l of columns sampled, rank r of matrix approximation (r ≤ l / N ). 2. L ← indices of l columns sampled. 3. C ← L(:, L). 4. W ← C(L, :).  5. W = UUT and Wr = rp=1 σp up up , where up and up are the pth column and row of U. 6. Wr = UWr  Wr UWr .  7.  Lr =

N l  Wr

and ULr =

−1 l N CUWr  Wr .

8. output: Nyström approximation of L is given by Lr = ULr  Lr UTLr .

6.14 Semi-Supervised Learning Algorithms

337

Once the Nyström approximation of L is obtained, one can get the fast spectral clustering algorithm of Choromanska et al. [52]. See Algorithm 6.25. Algorithm 6.25 Fast spectral clustering algorithm [52] 1. input: N × d dataset S = {x1 , . . . , xN } ∈ Rd , number k of clusters, number l of columns sampled, rank r of approximation (k ≤ r ≤ l / N ). 2. initialization: A ∈ RN ×N with aij = [A]ij = δij K(xi , xj ) if i = j and 0 otherwise. 3. L ← indices of l columns sampled (uniformly without replacement). ˆ = [aˆ ij ]N,l ← A(:, L). 4. A i,j =1  l 5. D ∈ RN ×N with dij = δij 1/ aˆ ij .  j =1 N 6. ∈ Rl×l with ij = δij 1/ ˆ ij . i=1 a  l ˆ ˆ ˆ 7. C ← I − N DA , where I is an N × l matrix consisting of columns of the identity matrix IN ×N indexed by L. 8. W ← C(L, :).  9. W = UUT and Wr = rp=1 σp up up , where up and up are the pth column and row of U. 10. Wr = UWr  Wr UWr . 

11.  Lr =

N l  Wr

and ULr =

RN ×k

−1 l N CUWr  Wr .

12. Z = [zim ] ∈ ←k eigenvectors of ULr with k smallest eigenvalues. 2 )1/2 for i = 1, . . . , N ; j = 1, . . . , k. 13. [Y]im = yim = zim /( m zim 14. Use any k-clustering algorithm (e.g., K-means) to N rows of Y ∈ RN ×k to get k clusters Si = {yri1 , . . . , yirm }, i = 1, . . . , k with m1 + · · · + mk = N , where ypr is the pth row of Y. i 15. output: k spectral clustering results of S are given by Si = {xi1 , . . . , ximi }, i = 1, . . . , k.

6.14 Semi-Supervised Learning Algorithms The significance of semi-supervised learning (SSL) is mainly reflected in the following three aspects [20]. • From an engineering standpoint, due to the difficulty of collecting labeled data, an approach to pattern recognition that is able to make better use of unlabeled data to improve recognition performance is of potentially great practical significance. • Arguably, most natural (human or animal) learning occurs in the semi-supervised regime. In many cases, a small amount of feedback is sufficient to allow the child to master the acoustic-to-phonetic mapping of any language. • In most pattern recognition tasks, humans have access only to a small number of labeled examples. The ability of humans to learn unsupervised concepts (e.g., learning clusters and categories of objects) suggests that the success of human learning in this small sample regime is plausibly due to effective utilization of the large amounts of unlabeled data to extract information that is useful for generalization.

338

6 Machine Learning

The semi-supervised learning problem has recently drawn a large amount of attention in the machine learning community, mainly due to considerable improvement in learning accuracy when unlabeled data is used in conjunction with a small amount of labeled data.

6.14.1 Semi-Supervised Inductive/Transductive Learning Typically, semi-supervised learning uses a small amount of labeled data with a large amount of unlabeled data. Hence, as the name suggests, semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). In supervised classification, one is always interested in classifying future test data by using fully labeled training data. In semi-supervised classification, however, the training samples contain both labeled and unlabeled data. Given a training set Dtrain = (Xtrain , Ytrain ) that consists of the labeled set (Xlabeled , Ylabeled ) = {(x1 , y1 ), . . . (xl , yl )} and the disjoint unlabeled set Xunlabeled = {xl+1 , . . . , xl+u }. Depending on different training sets and goals, there are two slightly different types of semi-supervised learning: semi-supervised inductive learning and semi-supervised transductive learning. Definition 6.24 (Inductive Learning) If training set is Dtrain = (Xtrain , Ytrain ) = (X1abeled , Ylabeled ) = {(x1 , y1 ), . . . , (xl , yl )}, and test set is Dtest = Xtest (unlabeled) that is not appeared in training set, then this setting is called inductive learning. Definition 6.25 (Semi-Supervised Inductive Learning [110, 234]) Let training set consist of a large amount of unlabeled data Xunlabeled = {xl+1 , . . . , xl+u } and a small amount of the auxiliary labeled samples Xauxiliary = (Xlabeled , Ylabeled ) = {(x1 , y1 ), . . ., (xl , yl )} with l / u. Semi-supervised inductive learning builds a function f so that f is expected to be a good predictor on future data set Xtest , beyond {xj }jl+u =l+1 . If we do not care about Xtest , but only want to know the effect of machine learning on Xunlabeled , this setting is known as semi-supervised transductive learning, because the unlabeled set Xunlabeled was seen in training. Definition 6.26 (Semi-Supervised Transductive Learning) Let training set consist of a large amount of unlabeled data Xunlabeled = {xl+1 , . . . , xl+u } and a small amount of the auxiliary labeled samples (Xlabeled , Ylabeled ) = {(x1 , y1 ), . . . , (xl , yl )} with l / u. Semi-supervised transductive learning uses all labeled data (Xlabeled , Ylabeled ) and unlabeled data Xunlabeled to give the prediction or classification labels of Xunlabeled or its some interesting subset without building the function f .

6.14 Semi-Supervised Learning Algorithms

339

To put it simply, there are the following differences between inductive learning, semi-supervised learning, semi-supervised inductive learning, and semi-supervised transductive learning. • Differences in training set: Inductive learning includes only all labeled data (Xlabled , Ylabled ), while semi-supervised learning, semi-super inductive learning, and semi-supervised transductive learning use the same training set including both labeled data (Xlabeled , Ylabeled ) = {(x1 , y1 ), . . . , (xl , yl )} and unlabeled data Xunlabeled = {xl+1 , . . . , xl+u }. • Differences in testing data: Unlabeled data Xunlabled is not the testing data in inductive learning, semi-supervised learning, and semi-supervised inductive learning, i.e., the samples we want to predict are not seen (or used) in training, whereas unlabeled data Xunlabled is the testing data in transductive learning, i.e., the samples we want to predict have been seen (or used) in training: Xtest = Xunlabeled . • Differences in prediction functions: In inductive learning, semi-supervised learning, and semi-supervised inductive learning, one needs to build the prediction function on unseen (or used) data (i.e., outside the training data), f : Dtrain → Dtest . However, this function does not need to be built in semi-supervised transductive learning, which only needs to give the predictions for some subset, rather than all, of the unlabeled data seen in training. The semi-supervised transductive learning is simply called the transductive learning. In semi-supervised learning, semi-supervised inductive learning, and semi-supervised transductive learning, a small amount of labeled data set {(x1 , y1 ), . . . , (xl , yl )} is usually viewed as an auxiliary set, denoted as Dauxiliary = (Xlabeled , Ylabeled ). It has been noted that a small amount of auxiliary labeled data Xauxiliary typically helps boost the performance of the classifier or predictor significantly. In fact, most semi-supervised learning strategies are based on extending either unsupervised or supervised learning to include additional information typical of the other learning paradigm. Specifically, semi-supervised learning encompasses the following different settings [305]: • Semi-supervised classification: It is an extension to the supervised classification, and its goal is to train a classifier f from both the labeled and unlabeled data, such that it is better than the supervised classifier trained on a small amount of labeled data alone. • Constrained clustering: This is an extension to unsupervised clustering. Some “supervised information” can be so-called must-link constraints: two instances xi , xj must be in the same cluster; and cannot-link constraints: xi , xj cannot be in the same cluster. One can also constrain the size of the clusters. The goal of constrained clustering is to obtain better clustering than the clustering from unlabeled data alone. The acquisition of labeled data for a learning problem often is expensive since it requires a skilled human agent or a physical experiment. The cost associated with

340

6 Machine Learning

the labeling process thus may render a fully labeled training set infeasible, while acquisition of unlabeled data is relatively inexpensive. In such situations, semisupervised learning can be particularly useful. For example, in the process industry, it is a difficult task to assign the type of detected faulty data samples. This may require process knowledge and experiences of process engineers, and will be costly and may need a lot of time. As a result, there are only a small number of faulty samples that are assigned to their corresponding types, while most faulty samples are unlabeled. Since these unlabeled samples still contain important information of the process, they can be used to improve greatly the efficiency of the fault classification system. In this case, as compared to the model that only depends on a small part of labeled data samples, semi-supervised learning, with both labeled samples and unlabeled samples, will provide an improved fault classification model [104]. To achieve higher accuracy of machine learning with fewer training labels and a large pool of unlabeled data, several approaches have been proposed for semisupervised learning. In the following, we summarize the common semi-supervised learning algorithms: self-training, cluster-then-label, and co-training.

6.14.2 Self-Training Self-training is a commonly used technique for semi-supervised learning, and is the learning process that uses its own predictions to teach itself. For this reason, it is also known as self-teaching or bootstrapping. In self-training a classifier is first found with the small amount of labeled data. Then, the classifier is used to classify the unlabeled data. Typically the most confident unlabeled points, together with their predicted labels, are added to the training set. The classifier is retrained and the procedure repeated until all of unlabeled data are classified. See Algorithm 6.26 for self-training. Algorithm 6.26 Self-training algorithm [305] 1. input: labeled data {(xi , yi )}li=1 , unlabeled data {xj }l+u j =l+1 . 2. 3. 4. 5. 6. 7. 8. 9. 10.

initialization: L = {(xi , yi )}li=1 and U = {xj }l+u j =l+1 . Let k = l + 1. repeat Train f from L using supervised learning. Apply f to the unlabeled instance xk in U . Remove a subset S of learning success from U ; and add {(xk , f (xk ))|xk ∈ S} to L. exist if k = l + u. k ← k + 1. return output: f .

On self-training, one has the following comments and considerations [305].

6.14 Semi-Supervised Learning Algorithms

341

Remark 1 For self-training, the algorithm’s own predictions, assuming high confidence, tend to be correct. This is likely to be the case when the classes form well-separated clusters. Remark 2 Implementation for self-training is simple. Importantly, it is also a wrapper method, which means that the choice of learner for f in Step 4 is left completely open: the learner can be a simple kNN algorithm, or a very complicated classifier. The self-training procedure “wraps” around the learner without changing its inner workings. This is important for many real applications like natural language processing, where the learners can be complicated black boxes not amenable to changes. Self-training is available for a larger labeled data set L. If L is small, an early mistake made by f can reinforce itself by generating incorrectly labeled data. Retraining with this data will lead to an even worse f in the next iteration. The heuristics for semi-supervised clustering can alleviate the above problem, see Algorithm 6.27. Algorithm 6.27 Propagating 1-nearest neighbor clustering algorithm [305] 1. input: labeled data {(xi , yi )}li=1 , unlabeled data {xj }l+u j =l+1 , distance function d(·, ·). 2. 3. 4. 5. 6. 7. 8. 9.

l+u initialization: L = {(xi , yi )}li=1 and U = {xj }l+u j =l+1 = {zj }j =l+1 . repeat Select z = arg minz∈U,x∈L d(z, x). Let k be the label of x ∈ L nearest to z, and add the new labeled data (z, yk ) to L. Remove z from U . exit if U is empty. return output: semi-supervised learning dataset L = {(xi , yi )}l+u i=1 .

6.14.3 Co-training The traditional machine learning algorithms including self-training and the propagating 1-nearest neighbor clustering are based on the assumption that the instance has one attribute or one kind of information. However, in many real-world applications there are two-views data and co-occurring data. These data can be described using two different attributes or two different “kinds” of information. For example, a web-page contains two different kinds of information [30]: • One kind of information about a web-page is the text appearing on the document itself. • Another kind of information is the anchor text attached to hyperlinks pointing to this page, from other pages on the web.

342

6 Machine Learning

Therefore, analyzing two views and co-occurring data is an important challenge in machine learning, pattern recognition, and signal processing communities, e.g., automatic annotation of text, audio, music, image, and video (see, e.g., [30, 80, 144, 193]) and sensor data mining [62]. More generally, the co-training setting [30] applies when a data set has a natural division of its features. Co-training setting assumes usually: • features can be split into two sets; • each sub-feature set is sufficient to train a good classifier; • the two sets are conditionally independent given the class. Traditional machine learning algorithms that learn over these domains ignore this division and pool all features together. Definition 6.27 (Co-training Algorithm [30, 193]) Co-training algorithm is a multi-view weakly supervised algorithm using the co-training setting, and may learn separate classifiers over each of the feature sets, and combine their predictions to decrease classification error. Given a small set of labeled two-view data L = {(xi , yi ), zi }li=1 and a large set of unlabeled two-view data U = {(xj , yj )}jl+u =l+1 , where (xi , yi ) is called the data bag and zi is the class label associated labeled two-view data bag (xi , yi ), i = 1, . . . , l. Two classifiers h(1) and h(2) work in co-training fashion: h(1) is a view-1 classifier that is only based the first view x(1) (say xi ) and ignores the second view x(2) (say yi ), while h(2) is a view-2 classifier based on the second view x(2) and ignores the first view x(1) . The working process of co-training is [189, 304]: two separate classifiers are initially trained with the labeled data, on the two sub-feature sets, respectively. Then, each classifier uses one view of the data and selects most confident predictions from the pool and adds the corresponding instances with their predicted labels to the labeled data while maintaining the class distribution in the labeled data. One classifier “teaches” the other classifier with the few unlabeled examples and provides their feeling most confident unlabeled-data predictions as the training data for the other. Each classifier is then retrained with the additional training examples given by the other classifier, and the process repeats. In co-training process, the unlabeled data is eventually exhausted. Co-training is a wrapper method. That is, it does not matter what the learning algorithms are for the two classifiers h(1) and h(2) . The only requirement is that the two classifiers can assign a confidence score to their predictions. The confidence score is used to select which unlabeled instances to turn into additional training data for the other view. Being a wrapper method, co-training is widely applicable to many tasks. As semi-supervised learning for two-view data and co-occurring data, the most popular algorithm is the standard co-training developed by Blum and Mitchell [30], as shown in Algorithm 6.28. The following are two variants of co-training [304].

6.15 Canonical Correlation Analysis

343

Algorithm 6.28 Co-training algorithm [30] 1. input: labeled data set L = {xi , yi }li=1 , unlabeled data set U = {xj }l+u j =l+1 . Each instance has 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

two views xi = [xi(1) , xi(2) ]. initialization: create a pool U  of examples by choosing u examples at random from U . repeat until unlabeled data is used up: Train a view-1 classifier h(1) from {xi(1) , yi } in L. (2) Train a view-2 classifier h(2) from {xi , yi } in L. Allow h(1) to label p positive and n negative examples from U  . Allow h(2) to label p positive and n negative examples from U  . Add these self-labeled examples to L, and remove these from the unlabeled data U . Randomly choose 2p + 2n examples from U to replenish U  . return output: {xi , yi }l+u i=1 .

1. Democratic co-learning [299]. Let L be the set of labeled data and U be the set of unlabeled data, and A1 , . . . , An (for n ≥ 3) the provided supervised learning algorithms. Democratic co-learning begins by training all learners on the original labeled data set L. For every example xu ∈ U in the unlabeled data set U , each learner predicts a label. If a majority of learners confidently agree on the class of an unlabeled point xu , that classification is used as the label of xu . Then xu and its label are added to the training data. All learners are retrained on the updated training set. The final prediction is made with a variant of a weighted majority vote among the n learners. 2. Tri-training [300]. If two of tri-training learners agree on the classification of an unlabeled point, the classification is used to teach the third classifier. This approach thus avoids the need of explicitly measuring label confidence of any learner. It can be applied to data sets without different views, or different types of classifiers.

6.15 Canonical Correlation Analysis As the previous section illustrated, for a process described by two sets of variables corresponding to two different aspects or views, analyzing the relations between these two views helps improve the understanding of the underlying system. An alternative to two-view data analysis is canonical correlation analysis. Canonical correlation analysis (CCA) [7, 122] is a standard two-view multivariate statistical method of correlating linear relationships between two random vectors such that they are maximally correlated, and hence is a powerful tool for analyzing multi-dimensional paired sample data. Paired samples are chosen in a random manner, and thus are random samples.

344

6 Machine Learning

6.15.1 Canonical Correlation Analysis Algorithm In the case of CCA, the variables of an observation can be partitioned into two sets that can be seen as the two views of the data. Let the two-view matrices X = [x1 , . . . , xN ] ∈ RN ×p and Y = [y1 , . . . , yN ] ∈ RN ×q have covariance matrices (Cxx , Cyy ) and cross-covariance matrices (Cxy , Cyx ), respectively. The row vectors xn ∈ R1×p and yn ∈ R1×q (n = 1, . . . , N ) denote the sets of empirical multivariate observations in X and Y, respectively. The aim of CCA is to extract the linear relations between the variables of X and Y. For this end, consider the following transformations: Xwx = zx

and

Ywy = zy ,

(6.15.1)

where wx ∈ Rp and wy ∈ Rq are, respectively, the projection vectors associated with the data matrices X and Y, while zx ∈ RN and zy ∈ RN are, respectively, the images of the positions wx and wy . The positions wx and wy are often referred to as canonical weight vectors, and the images zx and zy are termed as canonical variants or score variants [253, 265]. Hence, the data matrices X and Y represent linear transformations of the positions wx and wy onto the images zx and zy in the space RN , respectively. Define the cosine of the angle between the images zx and zy : cos(zx , zy ) =

zx , zy  , zx zy 

(6.15.2)

which is referred to as the canonical correlation. The constraints of CCA on the mappings are given below: • the position vectors of the images zx and zy are unit norm vectors; • the enclosing angle θ ∈ [0, π2 ] [108], between zx and zy , is minimized. The principle behind CCA is to find two positions in the two data spaces, respectively, that have images on a unit ball such that the angle between them is minimized and consequently the canonical correlation is maximized: cos θ =

max zx , zy 

zx ,zy ∈RN

(6.15.3)

subject to zx 2 = 1 and zy 2 = 1. For solving the above optimization problem, define the sample cross-covariance matrix Cxy = XT Y,

Cyx = YT X = CTxy

(6.15.4)

and the empirical variance matrices Cxx = XT X,

Cyy = YT Y.

(6.15.5)

6.15 Canonical Correlation Analysis

345

Hence, the constraints zx 2 = 1 and zy 2 = 1 can be rewritten as follows: zx 22 = zTx zx = wTx Cxx wx = 1,

(6.15.6)

zy 22 = zTy zy = wTy Cyy wy = 1.

(6.15.7)

In this case, the covariance matrix between zx and zy can be also written as zTx zy = wTx XT Ywy = wxT Cxy wy .

(6.15.8)

Then, the optimization problem (6.15.3) becomes cos θ =

max zx , zy  =

zx ,zy ∈RN

max

wx ∈Rp ,wy ∈Rq

wTx Cxy wy

(6.15.9)

subject to zx 22 = wTx Cxx wx = 1 and zy 22 = wTy Cyy wy = 1. In order to solve this constrained optimization problem, define the Lagrange objective function ! ! " " L(wx , wy ) = wxT Cxy wy − λ1 wTx Cxx wx − 1 − λ2 wTy Cyy wy − 1 , (6.15.10) where λ1 and λ2 are the Lagrange multipliers. We then have ∂L = Cxy wy − λ1 Cxx wx = 0 ⇒ wTx Cxy wy − λ1 wTx Cxx wx = 0, ∂wx (6.15.11) ∂L = Cyx wx − λ2 Cyy wy = 0 ⇒ wTy Cyx wx − λ2 wTy Cyy wy = 0. ∂wy (6.15.12) Since wTx Cxx wx = 1, wTy Cyy wy cos(zy , zx ) = wTy Cyx wx , we get

= 1 and cos(zx , zy ) = wTx Cxy wy

λ1 = λ2 = λ.

=

(6.15.13)

Substitute (6.15.13) into (6.15.11) and (6.15.12), respectively, we obtain Cxy wy = λCxx wx ,

(6.15.14)

Cyx wx = λCyy wy ,

(6.15.15)

346

6 Machine Learning

which can be combined into the generalized eigenvalue decomposition problem [13, 115]: 

O Cxy Cyx O



    O C wx wx = λ xx , wy O Cyy wy

(6.15.16)

where O denotes the null matrix.

  wx are, respectively, the generalized wy eigenvalue and the corresponding generalized eigenvector of the matrix pencil Equation (6.15.16) shows that λ and 

   O Cxy O C . , xx Cyx O O Cyy

(6.15.17)

The generalized eigenvalue decomposition problem (6.15.16) can also solved by using the SVDs of the data matrices X and Y. Let X = Ux Dx VTx and Y = Uy Dy VTy be the SVDs of X and Y. Then, we have Cxx = XT X = Vx D2x VTx ,

(6.15.18)

Cyy = YT Y = Vy D2y VTy ,

(6.15.19)

T Cxy = XT Y = Vx Dx UTx Uy Dy VTy = Cyx .

(6.15.20)

From (6.15.14) it is known that wx = into (6.15.15) to get

1 −1 λ Cxx Cxy wy .

Substitute this result

  2 Cyx C−1 C − λ C yy wy = 0. xx xy

(6.15.21)

−2 T Substituting C−1 xx = Vx Dx Vx into the above equation, we obtain

!

Vy Dy UTy Ux VTx Vx Dx UTx Uy Dy VTy − λ2 Vy D2y VTy

"

=0

By using Vx Vx = I, the above equation can be rewritten as !

Vy Dy UTy Ux Uy Dy VTy − λ2 Vy D2y VTy

"

= 0.

(6.15.22)

T Premultiplying D−1 y Vy , then the above equation reduces to

!

" UTy Ux UTx Uy VTy − λ2 Dy VTy wy = 0.

(6.15.23)

6.15 Canonical Correlation Analysis

347

Let UTx Uy = UDVT be the SVD of UTx Uy , then " ! VD2 VT − λ2 I Dy VTy wy = 0.

(6.15.24)

" ! V D2 − λ2 I VT Dy VTy wy = 0,

(6.15.25)

or

Premultiplying VT and noting that VT V = I, the above equation can be simplified to [265]: !

" D2 − λ2 I VT Dy VTy wy = 0.

(6.15.26)

Clearly, when wy = Vy D−1 y V,

(6.15.27)

Equation (6.15.26) yields the result !

" D2 − λ2 I = 0

or

|D2 − λ2 I| = 0.

(6.15.28)

By mimicking this process, we immediately have that when wx = Vx D−1 x V,

(6.15.29)

then !

" D2 − λ2 I VT Dx VTx wx = 0.

(6.15.30)

The above analysis shows that the generalized eigenvalues λ in the generalized eigenvalue problem (6.15.16) is given by the singular value matrix D of the matrix UTx Uy , as shown in (6.15.28), and the corresponding generalized eigenvectors (i.e., −1 canonical variants) are given by wx = Vx D−1 x V in (6.15.29) and wy = Vy Dy V in (6.15.27), respectively. Algorithm 6.29 summarizes the CCA algorithm.

6.15.2 Kernel Canonical Correlation Analysis Let X and Y denote two views (i.e., two attribute sets describing the data), and (x, y, c) be a labeled example where x ∈ X and y ∈ Y are the two portions of the example, and c is their label. For simplicity, assume that c ∈ {0, 1} where 0 and 1 denote negative and positive classes, respectively. Assume that there exist

348

6 Machine Learning

Algorithm 6.29 Canonical correlation analysis (CCA) algorithm [13, 115] 1. 2. 3. 4. 5. 6. 7.

1×p , y ∈ R1×q . input: The two-view data vectors {(xn , yn )}N n n=1 with xn ∈ R q 1 p T initialization: Normalize xn ← xn − p i=1 xn (i)1p×1 and yn ← yn − q1 i=1 yn (i)1Tq×1 . Let X = [x1 , . . . , xN ] and Y = [y1 , . . . , yN ]. Make the SVD X = Ux Dx VTx , Y = Uy Dy VTy . Calculate the SVD UTx Uy = UDVT . −1 Calculate wx = Vx D−1 x V and wy = Vy Dy V. output: the canonical variants (wx , wy ) of (X, Y).

two decision functions fX and fY over X and Y , respectively, such that fX (x) = fY (y) = c. Intuitively, this means that every example is associated with two views each contains sufficient information for determining the label of the example. Given one labeled example ((x0 , y0 ), 1) and a large number of unlabeled −1 examples U = {xi , yi }N i=1 , the task of semi-supervised learning is to train a −1 classifier for classifying unlabeled examples U = {xi , yi }N i=1 , i.e., determining unknown labels ci , i = 1, . . . , N − 1. For the data described by two sufficient views, some projections in these two views should have strong correlation. Let X = [x0 , x1 , . . . , xN −1 ] and Y = [y0 , y1 , . . . , yN −1 ] be two-view data matrices consisting of one labeled data ((x0 , y0 ), 1) and N − 1 unlabeled data −1 {xi , yi }N i=1 . CCA finds two projector vectors wx and wy such that the projections XT wx and YT wy are strongly correlated as soon as possible, namely their correlation coefficient is maximized: XT wx , YT wy  T T wx ,wy X wx  · Y wy  ⎛ ⎞ T wx Cxy wy ⎠ = arg max ⎝  T wx ,wy wx Cxx wx · wTy Cyy wy

(wx , wy ) = arg max

(6.15.31)

⎧ T ⎨ wx Cxx wx = 1, subject to



(6.15.32) wTy Cyy wy = 1,

where Cxy = N1 XYT is the between-sets covariance matrix of X and Y, while Cxx = N1 XXT and Cyy = N1 YYT are the within-sets covariance matrices of X and Y, respectively.

6.15 Canonical Correlation Analysis

349

By the Lagrange multiplier method, the objective of the above primal constrained optimization can be rewritten as the objective of the following dual unconstrained optimization: LD (wx , wy ) = wTx Cxy wy −

    λy λx wTx Cxx wx − 1 + wTy Cyy wy − 1 . 2 2 (6.15.33)

From the first-order optimality conditions we have ∂LD =0 ∂wx



Cxy wy = λx Cxx wx ,

(6.15.34)

∂LD =0 ∂wy



Cyx wx = λy Cyy wy ,

(6.15.35)

which can be combined into 0 = wTx Cxy wy − λx wTx Cxx wx − wTy Cyx wx + λy wTy Cyy wy = λy wTy Cyy wy − λx wTx Cxx wx , = λy − λx . Letting λ = λx = λy and assuming that Cyy is invertible, then (6.15.35) becomes wy =

1 −1 C Cyx wx . λ yy

(6.15.36)

Substituting (6.15.36) into (6.15.34) yields 2 Cxy C−1 yy Cyx wx = λ Cxx wx .

(6.15.37)

This implies that the projection vector wx is the generalized eigenvector corresponding to the largest generalized eigenvalue λ2max of the matrix pencil (Cxy C−1 yy Cyx , Cxx ). In order to identify nonlinearly correlated projections between the two views, we can apply the kernel extensions of canonical correlation analysis, called simply Kernel canonical correlation analysis (kernel CCD) [115]. The kernel CCA maps the instances x and y to the higher-dimensional kernel instances φ x (x) and φ y (y), respectively. Letting Sx = [φ x (x0 ), φ x (x1 ), . . . , φ x (xN −1 )],

(6.15.38)

Sy = [φ y (y0 ), φ y (y1 ), . . . , φ y (yN −1 )],

(6.15.39)

350

6 Machine Learning

then the projection vectors φ x (xi ) and φ y (yi ) in higher-dimensional kernel space φ

φ

can be rewritten as wx = Sx α and wy = Sy β, where α, β ∈ RN . Therefore, the objective function (6.15.33) for linearly correlated projections becomes the objective function for nonlinearly correlated projections below [303]: >

= L(α, β) =

φ φ STx wx , STy wy φ

φ

STx wx  · STy wy 

=

α T STx Sx STy Sy β α T STx Sx  · β T STy Sy 

.

(6.15.40)

Let Kx (xi , xj ) = φ x (xi ), φ x (xj ) = φ Tx (xi )φ x (xj ), i, j = 0, 1, . . . , N − 1, (6.15.41) Ky (yi , yj ) = φ y (yi ), φ y (yj ) = φ Ty (yi )φ y (yj ), i, j = 0, 1, . . . , N − 1 (6.15.42) be the kernel functions on the two views, respectively. Defining the kernel matrices ⎡ ⎢ ⎢ Kx = STx Sx = ⎢ ⎣

φ Tx (x0 ) φ Tx (x1 ) .. .

⎤ ⎥2 3 ⎥ ⎥ φ x (x0 ), φ x (x1 ), . . . , φ x (xN −1 ) ⎦

φ Tx (xN −1 ) N −1,N −1 , = [Kx (xi , xj )]i,j =0

(6.15.43)

and ⎡

⎤ φ Ty (y0 ) ⎢ T ⎥ ⎢ φ y (y1 ) ⎥ 2 3 T ⎢ ⎥ φ y (y0 ), φ y (y1 ), . . . , φ y (yN −1 ) Ky = Sy Sy = ⎢ .. ⎥ ⎣ ⎦ . T φ y (yN −1 ) N −1,N −1 = [Ky (yi , yj )]i,j , =0

(6.15.44)

then the objective function in (6.15.40) can be simplified to L(α, β) =

Kx α, Ky β α T KTx Ky β , = Kx α · Ky β α T KTx Kx α · β T KTy Ky β

(6.15.45)

6.15 Canonical Correlation Analysis

351

⎧ T T ⎨ α Kx Kx α = 1, subject to



(6.15.46) β T KTy Ky β = 1.

By the Lagrange multiplier method and the regularization method, we have the following objective function for dual optimization: " ν λx ! T T x α Kx Kx α − 1 − STx α2 2 2 " ν λy ! T T y β Ky Ky β − 1 − STy β2 . − (6.15.47) 2 2

LD (α, β) = α T KTx Ky β −

The first-order optimality conditions give the results: ∂LD = 0 ⇒ KTx Ky β = λx KTx Kx α + νx KTx α ∂α ⇒ Ky β = λx Kx α + νx α,

(6.15.48)

∂LD = 0 ⇒ KTy Kx α = λy KTy Ky β + νy KTy β ∂β ⇒ Kx α = λy Ky β + νy β.

(6.15.49)

For simplicity, let λ = λ1 = λ2 , ν = ν1 = ν2 , and κ = λν . Then, from (6.15.49) we get β=

1 (Ky + κI)−1 Kx α. λ

(6.15.50)

Substituting this result into (6.15.48), we obtain (Kx + κI)−1 Ky (Ky + κI)−1 Kx α = λ2 α,

(6.15.51)

Ky (Ky + κI)−1 Kx α = λ2 (Kx + κI)α.

(6.15.52)

or

This implies that a number of α (and corresponding λ) can be found by solving the eigenvalue problem (6.15.51) or the generalized eigenvalue problem (6.15.52). Once α is found, Eq. (6.15.50) can be used to find the unique β for each α. That is to say, in addition to the most strongly correlated pair of projections, one can also identify the correlations between other pairs of projections and the strength of the correlations can be measured by the values of λ. In the j th projection, the similarity between an original unlabeled instance (xi , yi ), i = 1, 2, . . . , N − 1 and the original labeled instance (x0 , y0 ) can be

352

6 Machine Learning

measured by [303] simi,j = exp(−d 2 (Pj (xi ), Pj (x0 ))) + exp(−d 2 (Pj (yi ), Pj (y0 ))), i = 1, . . . , N − 1; j = 1, . . . , m,

(6.15.53)

where d(a, b) is the Euclidean distance between a and b, m is the number of unlabeled instances (xi , yi ) correlating with the labeled instance (x0 , y0 ), while (Pj (xi ), Pj (yi )) is the projection of the instance (xi , yi ) on higher-dimensional kernel instances, Kx α and Ky β.

6.15.3 Penalized Canonical Correlation Analysis Given two-view data matrices (X, Y), the standard CCA seeks to find (u, v) that maximize cor(Xu, Yv) = uT XYv: max uT XYv u,v

subject to Xu22 ≤ 1 and Yv22 ≤ 1.

(6.15.54)

If including penalties p1 (u) ≤ c1 and p2 (v) ≤ c2 in the CCA then one obtains a penalized canonical correlation analysis (penalized CCA) in [271]: max uT XYv u,v

subject to Xu22 ≤ 1, Yv22 ≤ 1, p1 (u) ≤ c1 , p2 (v) ≤ c2 .

(6.15.55)

If pre-normalizing the column vectors of the data matrices X, Y ∈ RN ×P such that they become the semi-orthogonal, i.e., XT X = IP ×P and YT Y = IP ×P , then the penalized CCA is simplified to [271] max uT XYv u,v

subject to u22 ≤ 1, v22 ≤ 1, p1 (u) ≤ c1 , p2 (v) ≤ c2

(6.15.56)

because Xu22 = uT XT Xu = u22 and Yv22 = v22 . Equation (6.15.56) is called “diagonal penalized CCA.” If penalties p1 (u) = u1 and p2 (v) = v1 , then the diagonal penalized CCA become the sparse CCA that gives the sparse canonical variants u and v. Let soft denote the soft thresholding operator defined by soft(a, c) = sign(a)(|a| − c)+ ,

(6.15.57)

6.15 Canonical Correlation Analysis

353

where c > 0 is a constant and x+ =

⎧ ⎨ x, if x > 0, ⎩

(6.15.58) 0, if x ≤ 0.

Lemma 6.4 ([271]) For the optimization problem max uT a subject to u22 ≤ 1, u1 ≤ c, u

(6.15.59)

its solution is given by u=

s(a, ) , s(a, )2

(6.15.60)

where s(a, ) = [soft(a1 , ), . . . , soft(aP , )]T , and  = 0 if it results in u1 ≤ c, otherwise  is chosen to be a positive constant such that u1 = c. For the penalized (sparse) CCA problem (u, v) = arg max uT Xv u,v

subject to u22 ≤ 1, v22 ≤ 1, u1 ≤ c1 , v1 ≤ c2 .

(6.15.61)

Algorithm 6.30 [271] gives the sparse canonical variants ui , vi , i = 1, . . . , K. Algorithm 6.30 Penalized (sparse) canonical component analysis (CCA) algorithm 1. input: The matrix X.  2. initialization: Make the truncated SVD X = K i=1 di ui vi . Put X1 = X. 3. for k = 1 to K do s(Xk vk ,1 ) 4. uk ← s(X , where 1 = 0 if this results in u1 ≤ c1 ; otherwise, 1 is chosen k vk ,1 )2 to be a positive constant such that u1 = c1 . 5. uk ← uk /uk 2 . 6.

vk ←

s(XTk uk ,2 ) , s(XTk uk ,2 )2

where 2 = 0 if this results in v1 ≤ c2 ; otherwise, 2 is chosen

to be a positive constant such that v1 = c2 . 7. vk ← vk /vk 2 . 8. if uk , vk are converged then do 9. dk = uTk Xk vk , 10. else goto Step 4. 11. end if 12. exit if k = K. 13. Xk+1 ← Xk − dk uk vTk . 14. end for 15. output: uk , vk , k = 1, . . . , K.

354

6 Machine Learning

6.16 Graph Machine Learning Many practical applications in machine learning build on irregular or non-Euclidean structures rather than regular or Euclidean structures. Non-Euclidean structures are also called graph structures because this structure is a topological map in abstract sense of graph theory. Prominent examples of underlying graph-structured data are social networks, information networks, gene data on biological regulatory networks, text documents on word embeddings [73], log data on telecommunication networks, genetic regulatory networks, functional networks of the brain, 3D shapes represented as discrete manifolds, and so on [158]. Moreover, the signal values associated with the vertices of the graph carry the information of interest in observations or physical measurements. Numerous examples can be found in real world applications, such as temperatures within a geographical area, transportation capacities at hubs in a transportation network, or human behaviors in a social network [76]. When signal values are defined on the vertex set of a weighted and undirected graph, structured data are referred to as graph signals, where the vertices of the graph represent the entities and the edge weights reflect the pairwise relationships or similarities between these entities. Graphs can encode complex geometric structures and can be studied with strong mathematical tools such as spectral graph theory [53]. One of the main goals in spectral graph theory is to deduce the principal properties and structure of a graph from its graph spectrum. Spectral graph theory contains three basic aspects: similarity graph, graph Laplacian matrices, and graph spectrum. This section focuses on graph-based machine learning and contains the following two parts: • Spectral graph theory: Similarity graph, graph Laplacian matrices, and graph spectrum. • Graph-based learning: Learning the structure of a graph, called also graph construction, from training samples in semi-supervised and unsupervised learning cases.

6.16.1 Graphs The standard Euclidean structure refers to a space structure on n-dimensional vector space Rn equipped with the following four Euclidean measures: 1. the standard inner product (also known as the dot product) on Rm , x, y = x · y = xT y = x1 y1 + · · · + xm ym ;

6.16 Graph Machine Learning

355

2. the Euclidean length of a vector x on Rm , x =

x, y =



2; x12 + · · · + xm

3. the Euclidean distance between x and y on Rm , d(x, y) = x − y =



(x1 − y1 )2 + · · · + (xm − ym )2 ;

4. the (nonreflex) angle θ (0◦ ≤ θ ≤ 180◦ ) between vectors x and y on Rm ,  θ = arccos

x, y xy



 = arccos

 xT y . xy

A space structure with no Euclidean measure above is known as the nonEuclidean structure. Given a set of data points x1 , . . . , xN with xi ∈ Rd and some kernel function K(xi , xj ) between all pairs of data points xi and xj . The kernel function K(xi , xj ) must be symmetric and nonnegative: K(xi , xj ) = K(xj , xi )

and

K(xi , xj ) ≥ 0,

(6.16.1)

with K(xi , xi ) = 0 for i = 1, . . . , N . The kernel functions satisfying the above conditions may be any distance measure d(xi , xj ), any similarity measure s(xi , xj ) or Gaussian similarity function   s(xi , xj ) = exp − xi − xj 2 /(2σ 2 ) (where the parameter σ controls the width of the neighborhoods), depending on application objects. If we do not have more information than similarities between data points, a nice way of data representing is in the form of a weighted undirected similarity graph or (called simply graph) G = (V , E, W) or simply G = (V , E). Definition 6.28 (Graph [241, 261]) A graph G(V , E, W) or simply G(V , E) is a collection of a vertex set (or node set) V = {v1 , . . . , vN } and an edge set E = {eij }N i,j =1 . A graph G contains nonnegative edge weights associated with each edge: wij ≥ 0. If vi and vj are not connected to each other, then wij = 0. For an undirected graph G, eij = ej i and wij = wj i ; if G is a directed graph, then eij = ej i and wij = wj i . Clearly, any graph is of non-Euclidean structure. The edge weight wij is generally treated as a measure of similarity between the nodes vi and vj . The higher the edge weight, the more similar the two nodes are expected to be. Definition 6.29 (Adjacency Matrix) The adjacency matrix of a graph, denoted as W = [wij ]N,N i,j =1 , refers to the matrix used to represent the connection of nodes in

356

6 Machine Learning

the graph. The adjacency matrix W can be either binary or weighted. For undirected graphs with N nodes, the adjacency matrix is an N × N real symmetric matrix. The adjacency matrix is also called the affinity matrix of the graph. Definition 6.30 (Degree Matrix) The degree (or degree function) of node j in the graph G(V , E, W), denoted as dj : V → R, represents the number of edges connected to node j : dj =



(6.16.2)

wij ,

i∼j

where i ∼ j denotes all vertices i connected to j by the edges (i, j ) ∈ E. The degree matrix of a graph is a diagonal matrix D = Diag(D11 , . . . , DN N ) = Diag(d1 , . . . , dN ) whose ith diagonal element Dii = di is used to describe the degree of node i in the graph. That is, the degree function di is defined by the sum of all entries of the ith row of the affinity matrix W. Each vertex (or node) of the graph corresponds to a datum, and the edges encode the pairwise relationships or similarities among the data. For example, the vertices of the web are just the web pages, and the edges denote the hyperlinks; in market basket analysis, the items also form a graph by connecting any two items which have appeared in the same shopping basket [301]. A signal x : V → R defined on the nodes of the graph may be regarded as a vector x ∈ Rd . Let the vertex vi ∈ V = {v1 , . . . , vN } represent a data point xi or a document doci . Then, each edge e(i, j ) ∈ E is assigned an affinity score wij forming matrix W which reflects the similarity between vertex vi and vertex vj . The nodes of the graph are the points in the feature space, and an edge is formed between every pair of nodes. The nonnegative weight on each edge, wij ≥ 0, is a function of the similarity between nodes i and j . If the edge weighting functions wij = 0, then the vertices vi and vj are not connected by an edge. Clustering seeks to partition the vertex set V into disjoint sets V1 , . . . , Vm , where by some measure the similarity among the vertices in a set Vi is high, and across different sets (Vi , Vj ) is low, where Vi = {xi1 , . . . , xi,mi } is the ith clustered vertex subset such that m1 + · · · + mk = N . Disjoint subsets V1 , . . . , Vm of the vertex set V imply that V1 ∪ V2 ∪ · · · ∪ Vm = V and Vi ∩ Vj = ∅ for all i = j . Due to the assumption of undirected graph G = (V , E), we require wij = wj i . The affinity matrix or weighted adjacency matrix of the graph is the matrix W = [wij ]N,N i,j =1 with edge weights wij as entries

wij =

⎧ ⎪ ⎨ K(xi , xj ), i = j ; ⎪ ⎩ 0,

(6.16.3) i = j.

6.16 Graph Machine Learning

357

Definition 6.31 (Weighed Graph [301]) A graph is weighted if it is associated with a function w : V × V → R satisfying wij > 0,

if (i, j ) ∈ E,

(6.16.4)

and wij = wj i .

(6.16.5)

Consider a graph with N vertices where each vertex corresponds to a data point. For two data points xi and xj , the weight wij of an edge connecting vertices i and j has four most commonly used choices as follows [41, 232]. 1. 0–1 weighting: wij = 1 if and only if nodes i and j are connected by an edge; otherwise wij = 0. This is the simplest weighting method and is very easy to compute. 2. Heat kernel weighting: If nodes i and j are connected, then wij = e−

xi −xj 2 σ

(6.16.6)

is called the heat kernel weighting. Here σ is some pre-selected parameter. 3. Thresholded Gaussian kernel weighting: The weight of an edge connecting vertices i and j is defined, via a thresholded Gaussian kernel weighting function, as " ! ⎧ [dist(i,j )]2 ⎪ , dist(i, j ) ≤ κ, ⎨ exp − 2θ 2 (6.16.7) wij = ⎪ ⎩ 0, otherwise, for some parameters θ and κ. Here, dist(i, j ) may represent a physical distance between vertices i and j or the Euclidean distance between two feature vectors describing i and j , the latter of which is especially common in graph-based semisupervised learning methods. 4. Dot-product weighting: If nodes i and j are connected, then wij = xi · xj = xTi xj

(6.16.8)

is known as the dot-product weighting. Note that if x is normalized to have unit norm, then the dot product of two vectors is equivalent to the cosine similarity of the two vectors. There are different popular similarity graphs, depending on transforming a given set x1 , . . . , xN of data points with pairwise similarities sij or pairwise distances dij into a graph. The following are three popular methods for constructing similarity graphs [257].

358

6 Machine Learning

1. -neighborhood graph: All points are connected with pairwise distances smaller than . As the distances between all connected points are roughly of the same scale (at most ), weighting the edges would not incorporate more information about the data to the graph. Hence, the -neighborhood graph is usually considered as an unweighted graph. 2. K-nearest neighbor graph: Connect vertex vi with vertex vj if vj is among the K-nearest neighbors of vi . Due to the nonsymmetric neighborhood relationship, this definition leads to a directed graph. There are two methods for making this graph undirected: • One can simply ignore the directions of the edges, that is, vi and vj are connected with an undirected edge if vi is among the K-nearest neighbors of vj or if vj is among the K-nearest neighbors of vi . The resulting graph is called the K-nearest neighbor graph. • Vertices vi and vj are connected if vi is among the K-nearest neighbors of vj and vj is among the K-nearest neighbors of vi , which results in a mutual K-nearest neighbor graph. In these cases, after connecting the appropriate vertices, the edges are weighted by the similarity of their endpoints. 3. Fully connected graph: Connect simply all points with positive similarity with each other, and weight all edges by sij . Since the graph should represent the local neighborhood relationships, this construction is only useful if using the similarity function itself to model local neighborhoods. The Gaussian similarity function  s(xi , xj ) = exp − xi − xj 2 /(2σ 2 ) is an example of such a positive similarity function. This parameter plays a similar role as the parameter  in case of the -neighborhood graph. The main differences between the three graphs above-mentioned are as follows. • The -neighborhood graph uses the distance measure smaller than a threshold , and is an unweighted graph. • The K-nearest neighbor graph considers K-nearest neighbors of vertices, and the edges are weighted by the similarity of their endpoints. • The fully connected graph uses the positive (rather than nonnegative) similarity.

6.16.2 Graph Laplacian Matrices The key problem of learning graph topology can be casted as a problem of learning the so-called graph Laplacian matrices as it uniquely characterizes a graph. In other words, graph Laplacian matrices are an essential operator in spectral graph theory [53].

6.16 Graph Machine Learning

359

Definition 6.32 (Graph Laplacian Matrix) The Laplacian matrix of graph G(V , E, W) is defined as L = D − W with elements ⎧ ⎨ dv , if u = v; L(u, v) = −1, if (u, v) ∈ E; ⎩ 0, otherwise.

(6.16.9)

Here, dv denotes the degree of node i. The corresponding normalized Laplacian matrix is given by [53] ⎧ ⎪ if u = v and dv = 0; ⎨ 1, 1√ √ L(u, v) = − du dv , (u, v) ∈ E; ⎪ ⎩ 0, otherwise.

(6.16.10)

The unnormalized graph Laplacian L is also called combinatorial graph Laplacian. There are two common normalized graph Laplacian matrices [257]: • Symmetric normalized graph Laplacian: Lsym = D−1/2 LD−1/2 = I − D−1/2 WD−1/2 .

(6.16.11)

• Random walk normalized graph Laplacian: Lrw = D−1 L = D−1 W.

(6.16.12)

The quadratic form of Laplacian matrices is given by u Lu = u Du − u Wu = T

T

T

N 

di u2i



N N   j =1

i=1

=

N 

di u2i −

N N  

 ui wij uj

i=1

ui uj wij

i=1 j =1

i=1

⎞ N N N  N   1 ⎝ = di u2i − 2 ui uj wij + di u2j ⎠ 2 ⎛

i=1 j =1

i=1

j =1

⎞ N N N  N N N    1 ⎝ 2  = ui wij − 2 ui uj wij + u2j wij ⎠ , 2 ⎛

i=1

j =1

i=1 j =1

j =1

i=1

360

6 Machine Learning

namely 1  wij (ui − uj )2 ≥ 0, 2 N

uT Lu =

N

(6.16.13)

i=1 j =1

the equality holds if and only if ui = uj , ∀ i, j ∈ {1, . . . , N }, i.e., u = 1N = [1, . . . , 1]T ∈ RN . If u = 1N , then [Lu]i = [(D − W)u]i =

N 

dij uj −

j =1

because uj ≡ 1,

N

j =1 dij

= di , and Lu=λu

N 

wij uj = di − di = 0

j =1

N

j =1 wij

= di . Therefore, we have

Lu = 0 ⇐⇒ λ = 0 for u = 1N .

(6.16.14)

Equations (6.16.13) and (6.16.14) give the following important properties of Laplacian matrices. • The Laplacian matrix L = D − W is a positive semi-definite matrix. • The minimum eigenvalue of the Laplacian matrix L is 0, and the eigenvector corresponding to the minimum eigenvalue is a vector with all values of 1.

6.16.3 Graph Spectrum As the combinatorial graph Laplacian and the two normalized graph Laplacians are all real symmetric positive semi-definite matrices, they admit an eigendecomposition L = UUT with U = [u0 , u1 , . . . , un−1 ] ∈ Rn×n and  = Diag(λ0 , λ1 , . . . , λn−1 ) ∈ Rn×n . A complete set of orthonormal eigenvectors {u0 , u1 , . . . , un−1 } is known as the graph Fourier basis. The multiplicity of a zero Laplacian eigenvalue is equal to the number of connected components of the graph, and hence the real nonnegative eigenvalues are ordered as 0 = λ0 ≤ λ1 ≤ λ2 ≤ · · · ≤ λn−1 = λmax . The set of the graph Laplacian eigenvalues, σ (L) = {λ0 , λ1 , . . . , λn−1 }, is usually referred to as the graph spectrum of L (or the spectrum of the associated graph G). Therefore, in graph signal processing and machine learning, eigenvalues of the Laplacian L are usually ordered increasingly, respecting multiplicities. By “the first k eigenvectors” we refer to the eigenvectors corresponding to the k smallest eigenvalues [257].

6.16 Graph Machine Learning

361

Proposition 6.4 ([53, 184]) The graph Laplacian L satisfies the following properties: (1) For every vector f ∈ Rn and any graph Laplacian among L, Lsys , Lrw , we have 1  f Lf = wij (fi − fj )2 . 2 n

n

T

i=1 j =1

(2) Any graph Laplacian among L, Lsys , Lrw is symmetric and positive semidefinite. (3) The smallest eigenvalue of either L or Lrw is 0, and 1 is an eigenvector associated with eigenvalue 0 (here 1 is a constant vector whose entries are all 1). Although 0 is also an eigenvalue of Lnorm , its associated eigenvector is D1/2 1. (4) If a graph Laplacian has k eigenvalues 0, then it has N −k positive, real-valued eigenvalues. The irrelevant eigenvectors are the ones that correspond to the several smallest eigenvalues of the Laplacian matrix L except for the smallest eigenvalue equal to 0. For computational efficiency, we thus often focus on calculating the eigenvectors corresponding to the largest several eigenvalues of the Laplacian matrix L. These results imply the following graph-based methods. 1. Graph principal component analysis: For given data vectors x1 , . . . , xN and a graph G(V , E, W) with Laplacian L, if p is the number of eigenvalues equal to 0 and q is the number of the smallest eigenvalues not equal to zero, then the Laplacian L has N − (p + q) dominant eigenvalues. The corresponding eigenvectors of L give the graph principal component analysis (GPCA) after principal components in the standard PCA (see Sect. 6.8) are replaced by principal components of the Laplacian L. 2. Graph minor component analysis: If the minor components in standard minor component analysis (MCA) (see Sect. 6.8) are replaced by the relevant eigenvectors corresponding to the q smallest eigenvalues of L not equal to zero, then the standard MCA is extended to the graph minor component analysis (GMCA). 3. Graph K-means clustering: If there are K distinct clustered regions within the N data samples, then there are K = N − (p + q) dominant nonnegative eigenvalues that provide a means of estimating the possible number of clusters within the data samples in graph K-means clustering. Definition 6.33 (Edge Derivative [301]) Let e = (i, j ) denote the edge between vertices i and j . The edge derivative of function f with respect to e at the vertex i is defined as  ? ? wij wij ∂f  = fi − fj . (6.16.15)  ∂e i di dj

362

6 Machine Learning

The local variation of f at each vertex j is then defined to be 8 9 9   2 ∂f  9 ∇j f  = : , ∂e j

(6.16.16)

e 5j

where e 5 j denotes the set of the edges incident with vertex j . The smoothness of f is then naturally measured by the sum of the local variations at each vertex: S(f ) =

1 ∇j f 2 . 2

(6.16.17)

j

The graph Laplacian L = D − W is a difference operator, as, for any signal f ∈ Rn , it satisfies  wij [fi − fj ], (6.16.18) (Lf)(i) = j ∈Ni

where the neighborhood Ni is the set of vertices connected to vertex i by an edge. The classical Fourier transform  f (t)e−j 2π ξ t dt (6.16.19) fˆ(ξ ) = f, ej 2π ξ t  = R

is the expansion of a function f in terms of the complex exponentials, which are the eigenfunctions of the one-dimensional (1-D) Laplace operator − L(ej 2π ξ t ) = −

∂ 2 j 2π ξ t e = (2π ξ )2 ej 2π ξ t . ∂t 2

(6.16.20)

Similarly, the graph Fourier transform fˆ(λ) of any function f = [f (0), f (1), . . . , f (n − 1)]T ∈ Rn on the vertices of G is defined as the expansion of f in terms of the eigenvectors ul = [ul (0), ul (1), . . . , ul (n − 1)]T of the graph Laplacian: fˆ(λl ) = f, ul  =

n−1 

fi u∗l (i), l = 0, 1, . . . , n − 1,

(6.16.21)

i=0

or written as matrix-vector form ˆf = UH f ∈ Rn , where U = [u , u1 , . . . , un−1 ]. 0

(6.16.22)

6.16 Graph Machine Learning

363

The graph inverse Fourier transform is then given by fi =

n−1 

fˆ(λl )ul (i), i = 0, 1, . . . , n − 1 or

f = Uˆf.

(6.16.23)

l=0

In classical Fourier analysis, the eigenvalues {(2π ξ )2 }ξ ∈R carry a specific notion of frequency: for ξ close to zero (low frequencies), the associated complex exponential eigenfunctions are smooth, slowly oscillating functions. On the contrary, if ξ is far from zero (high frequencies), then the associated complex exponential eigenfunctions oscillate much more rapidly. In the graph setting, the graph Laplacian eigenvalues and eigenvectors provide a similar notion of frequency [232]: the graph Laplacian eigenvectors associated with low frequencies λl , vary slowly across the graph, i.e., if two vertices are connected by an edge with a large weight, then the eigenvectors at those locations are likely to be similar. The eigenvectors associated with larger eigenvalues oscillate more rapidly and are more likely to have dissimilar values on vertices connected by an edge with high weight.

6.16.4 Graph Signal Processing In this subsection, we focus on some basic operations in graph signal processing. Graph Filtering In classical signal processing, given an input time signal f (t) and a time-domain filter h(t), the frequency filtering is defined as ˆ ), fˆout (ξ ) = fˆin (ξ )h(ξ

(6.16.24)

where fˆin (ξ ) and fˆout (ξ ) are the spectrum of the input and output signals, respecˆ ) is the transfer function of the filter. Taking an inverse Fourier tively; while h(ξ transform of (6.16.24), we have the time filtering given by  fout (t) =

R

ˆ )ej 2π ξ t dξ = fˆin (ξ )h(ξ

 R

fin (τ )h(t − τ ) dτ = (f ∗ h)(t), (6.16.25)

where  (f ∗ h)(t) =

R

fin (τ )h(t − τ )dτ

(6.16.26)

denotes the convolution product of the continue signal f (t) and the continue filter h(t).

364

6 Machine Learning

For discrete signal f = [f1 , . . . , fN ]T , (6.16.25) gives the filtering result of discrete signals fout (i) = (f ∗ h)(i) =

N −1 

f (k)h(i − k) =

k=0

N −1 

h(k)f (i − k).

(6.16.27)

k=0

Similar to (6.16.24), the graph spectral filtering can be defined as ˆ  ), fˆout (λ ) = fˆin (λ )h(λ

(6.16.28)

where λ is the th eigenvalue of the Laplacian matrix L. Taking the inverse Fourier transform of the above equation, we have the discrete-time graph filtering fout (i) =

N −1 

ˆ  )u (i), fˆin (λ )h(λ

(6.16.29)

=0

where u (i) is the ith element of N × 1 eigenvector u = [u (1), . . . , u (N )]T corresponding to the eigenvalue λ . Graph Convolution Because the graph signal is discrete, the graph filtering (6.16.29) can be represented using the same convolution as discrete signals, i.e., fout (i) = (fin ∗ h)G (i),

(6.16.30)

But, the graph convolution does not have the standard form of classical convolution N −1 N −1 f (k)h(i − k) = (f ∗ h)(i) = k=0 k=0 h(k)f (i − k), because there is no delayed form f (i − k) for any graphic signal f (i). By comparing (6.16.30) with (6.16.25), we get the convolution product of graph signals as follows: (f ∗ h)G (i) =

N −1 

ˆ  )u (i). fˆin (λ )h(λ

(6.16.31)

=0

When using the Laplacian matrix L to represent the graph convolution, we denote fin = [fin (1), . . . , fin (N )]T ,

(6.16.32)

fout = [fout (1), . . . , fout (N )] , ⎤ ⎡ 0 λ0 ⎥ H ⎢ ˆ h(L) = U ⎣ ... ⎦U .

(6.16.33)

T

0

λN −1

(6.16.34)

6.16 Graph Machine Learning

365

Therefore, the graph convolution (6.16.31) can be rewritten in matrix-vector form as [232] ˆ (f ∗ h)G = h(L)f in .

(6.16.35)

That is to say, the graph filtering can be represented, in Laplacian form, as follows: ˆ fout = h(L)f in .

(6.16.36)

p-Dirichlet Norm Consider the approximation problem of discrete functions on a graph G. A lot of learning algorithms operate on input spaces X other than Rn , specifically, discrete input spaces, such as strings, graphs, trees, automata, etc. Such an input space is known as the manifold. In Euclidean space, the smoothness of a vector   N function f is usually represented in Euclidean norm as f22 = N i=1 j =1 (fj − 2 fi ) . However, this representation is available for graph signals in manifold. Consider an undirected unweighted graph G(V , E, W) consisting of a set of vertices V numbered 1 to n, and a set of edges E (i.e., pairs (i, j )) where i, j ∈ V and (i, j ) ∈ E ⇔ (j, i) ∈ E. For convenience, we will sometimes use i ∼ j to denote that i and j are neighbors, i.e., (i, j ) ∈ E. The adjacency matrix of G is an n × n real matrix W whose entries wij = 1 if i ∼ j , and 0 otherwise. By construction, W is symmetric and its diagonal entries are zero. To approximate a function on a graph G, with the weight matrix W, we need finding a “good” function that does not make too many “jumps” for two connected nodes i  and j . This good function can be formalized by the smoothness functional S(f) = i∼j wij (fi − fj )2 that should be minimized. More generally, we have the following definition of smoothness function on graphs. Definition 6.34 ([232]) The discrete p-Dirichlet norm of the graph signal f = [f1 , . . . , fN ]T is defined by ⎡ ⎤p/2 1 ⎣  Sp (f) = wij (fj − fi )2 ⎦ . p i∈V

(6.16.37)

j ∈Ni

The following are two common p-Dirichlet norms of the graph signal f. • When p = 1, the 1-Dirichlet norm S1 (f) =

 i∈V

⎡ ⎣



⎤1/2 wij (fj − fi )2 ⎦

j ∈Ni

is the total variation of the signal with respect to the graph.

(6.16.38)

366

6 Machine Learning

• When p = 2, the 2-Dirichlet norm S2 (f) =

1  wij (fj − fi )2 2 i∈V j ∈Ni

=



wij (fj − fi )2 = fT Lf

(6.16.39)

i,j ∈E

is a vector norm of the graph signal vector f weighted by the Laplacian matrix L. As compared with Euclidean norm, the 2-Dirichlet norm can be reviewed as the Euclidean norm weighted by adjacency wij of two nodes. Graph Tikhonov Regularization Given a noisy graph signal y = f0 + e, where f0 is a graph signal vector and e is uncorrelated additive Gaussian noise. In order to recover f0 , if using the 2-Dirichlet form S2 (f) = fT Lf of the graph signal f instead of its regularization term f22 in the Tikhonov regularization method, we have that the Tikhonov regularization on the graph signal is given by   arg min f − y22 + λfT Lf .

(6.16.40)

f

Proposition 6.5 ([231]) The Tikhonov solution f∗ to (6.16.40) is given by f∗ (i) =

N −1   =0

 1 y(λ ˆ  )u (i), 1 + γ λ

(6.16.41)

for all i = {1, 2, . . . , N }. Graph Kerners Kernel-based algorithms capture the structure of X via the kernel K : X × X → R. One of the most general representations of discrete metric spaces are graphs. Regularization on graphs is an important step for manifold learning. For a Mercer kernel K : X × X → R, there is an associated reproducing kernel Hilbert space (RKHS) HK of functions X → R with the corresponding norm  · K . Given a set of labeled examples (xi , yi ), i = 1, . . . , l and a set of unlabeled examples {xj }jl+u =l+1 , we consider the following optimization problem: . l  γ 1 f ∗ = arg min V (xi , yi , f) + γA f2K + I ˆfT Lˆf , l l+u f∈HK *

(6.16.42)

i=1

where ˆf = [f (x1 ), . . . , f (xl+u )]T as n = l + u and fi = f (xi ) for i = 1, . . . , n, and L is the Laplacian matrix L = D − W where wij are the edge weights in the data adjacency graph. Here, the diagonal matrix D is given by Dl+u with

6.16 Graph Machine Learning

367

 1 Dii = jl+u =1 wij . The normalizing coefficient (l+u)2 is the natural scale factor for the empirical estimate of the Laplace operator. On a sparse adjacency graph it may l+u  l+u be replaced by i=1 j =1 wij . ˜ a regularization matrix, then the Corollary 6.1 ([235]) Denote by P = r(L) ˜ or the pseudocorresponding kernel matrix is given by the inverse K = r −1 (L) ˜ wherever necessary. More specifically, if {(λi , vi )} constitute the inverse K = r † (L) ˜ we have eigensystem of L, K=

m 

r −1 (λi )vi vTi ,

(6.16.43)

i=1

where 0−1 = 0. In the context of spectral graph theory and segmentation, the following graph kernel matrices are of particular interest [235]: ˜ −1 K = (I + σ 2 L)

(Regularized Laplacian)

˜ K = exp(σ 2 /2L)

(Diffusion Process)

˜ p with a ≥ 2 K = (aI − L)

(One-Step Random Walk)

˜ K = cos(Lπ/4)

(Inverse Cosine).

Theorem 6.7 ([20]) The minimizer of optimization problem (6.16.42) admits an expansion f ∗ (x) =

l+u 

αi K(xi , x)

(6.16.44)

i=1

in terms of the labeled examples (xi , yi }li=1 and unlabeled examples {xj }jl+u =1 . If taking squared loss function in (6.16.42) as V (xi , yi , f ) = (yi − f (xi ))2 ,

(6.16.45)

then the minimizer of the (l + u)-dimensional expansion coefficient vector α = [α1 , . . . , αl+u ]T in (6.16.44) is given by [20]  α ∗ = JK + γA lI +

−1 γI l LK y, (l + u)2

(6.16.46)

where K is the (l + u) × (l + u) Gram matrix over labeled and unlabeled points; y is an (l + u) dimensional label vector given by y = [y1 , . . . , yl , 0, . . . , 0]T ; and

368

6 Machine Learning

J = Diag(1, . . . , 1, 0, . . . , 0) is an (l + u) × (l + u) diagonal matrix with the first l diagonal entries as 1 and the others as 0. Algorithm 6.31 shows the manifold regularization algorithm mentioned above. Algorithm 6.31 Manifold regularization algorithms [20] 1. input: l labeled examples {xi , yi }li=1 and u unlabeled examples {xj }l+u j =1 2. Construct data adjacencygraph with (l + u) nodes using, e.g., k-nearest neighbors. Choose  2 edge weights wij = exp − xi − xj  /(4l) 3. Choose a kernel function K(x, y) and compute the Gram matrix Kij = K(xi , xj ) 4. Compute graph Laplacian matrix L = D − W where D = Diag(W1,1 , . . . , Wl+u,l+u ) 5. Choose γA and γI ! "−1 γI l 6. Compute α ∗ = JK + γA lI + (l+u) y 2 LK l+u ∗ 7. output: f (x) = i=1 αi K(xi , x)

When γI = 0, Eq. (6.16.42) gives zero coefficients over unlabeled data, so the coefficients over labeled data are exactly those for standard RLS.

6.16.5 Semi-Supervised Graph Learning: Harmonic Function Method In practical applications of graph signal processing, graph pattern recognition, graph machine learning (spectral clustering, graph principal component analysis, graph K-means clustering, etc.), and so on, even graph-structured data (such as social networks, information networks, text documents, etc.) are usually given in convenient sampled data rather than the graph form. To apply the spectral graph theory to the sampled data, we must learn these data and transform them to the graph form, i.e., the weighted adjacency matrix W or the graph Laplacian matrix L. The graph learning problems can be generally thought of as looking for a function f which is smooth and simultaneously close to another given function y. This view can be formalized as the following optimization problem [301]: / 0 μ f = arg min S(f) + f − y22 . 2 f∈2 (V )

(6.16.47)

The first term S(f) measures the smoothness of the function f, and the second term μ 2 2 f − y2 measures its closeness to the given function y. The trade-off between these two terms is captured by a nonnegative parameter μ. We are given l labeled points {(x1 , y1 ), . . . , (xl , yl )} and u unlabeled points {xl+1 , . . . , xl+u } with xi ∈ Rd , typically l / u. Let n = l + u be the total number of data points. Consider a connected graph G(V , E, W) with node set V corresponding to the n data points, with nodes L = {1, . . . , l} corresponding to the labeled points with

6.16 Graph Machine Learning

369

class labels y1 , . . . , yl , and nodes U = {l + 1, . . . , l + u} corresponding to the unlabeled points. The task of semi-supervised graph labeling is to first estimate a real-valued function f : V → R on the graph G such that f satisfies following two conditions at the same time [304]: • fi should be close to the given labels yi on the labeled nodes, • f should be smooth on the whole graph. The function f is usually constrained to take values (l)

fi = fi

= f (l) (xi ) ≡ yi

(6.16.48)

on the labeled data {(x1 , y1 ), . . . , (xl , yl )}. At the same time, the function fu = [fl+1 , . . . , fl+u ]T assigns the labels yl+i to the unlabeled instances xl+1 , . . . , xl+u via yˆl+i = sign(fl+i ) in binary classification yi ∈ {−1, +1}. An approach to graph-based semi-supervised labeling was proposed by Zhu et al. [306]. In this approach, labeled and unlabeled data are represented as vertices in a weighted graph, with edge weights encoding the similarity between instances, and the semi-supervised labeling problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix algebra methods. It is assumed that an n × n symmetric weight matrix W on the edges of the graph is given. For example, when xi ∈ Rd , the (i, j )th entries of the weight matrix can be  d   (xik − xj k )2 wij = exp − , (6.16.49) σk k=1

where xik is the kth component of instance xi ∈ Rd and σ1 , . . . , σd are length scale hyperparameters for each dimension. Thus, the weight matrix can be decomposed into the block form [306]:  W=

 Wll Wlu , Wul Wuu

(6.16.50)

where Wll (i, j ) = wij , i, j = 1, . . . , l; Wlu (i, j ) = wi,l+j , i = 1, . . . , l; j = 1, . . . , u; Wul (i, j ) = wl+i,j , i = 1, . . . , u; j = 1, . . . , l; Wuu (i, j ) = wl+i,l+j , i, j = 1, . . . , u. The Gaussian function (6.16.49) shows that nearby points in Euclidean space are assigned large edge weight or other more appropriate weight values.

370

6 Machine Learning

Let the real-valued function vector   f f = l = [f (1), . . . , f (l), f (l + 1), . . . , f (l + u)]T , fu

(6.16.51)

where fl = [f1 (l), . . . , fl (l)]T fu = [fu (1), . . . , fu (u)]T

or

fl (i) = f (i) ≡ yi , i = 1, . . . , l;

or

fu (i) = f (l + i), i = 1, . . . , u.

(6.16.52) (6.16.53)

Since fl (i) ≡ yi , i = 1, . . . , l are known, fu is the u × 1 vector to be found. Once fu can be found by using the graph, then the labels to unlabeled data xl+1 , . . . , xl+u can be assigned: yˆl+i = sign(fu (i)), i = 1, . . . , u. Intuitively, unlabeled points that are nearby in the graph should have similar labels. This results in the quadratic energy function E(f ) =

1 wij (fi − fj )2 . 2

(6.16.54)

i,j

By [306], the minimum energy function f = arg minf|L =f(l) E(f ) is harmonic; namely, it satisfies f = 0 on unlabeled data points U , and is equal to fl on the labeled data points L, where = D − W is the Laplacian of the graph G. From f = 0 or         Wll Wlu fl 0l×1 Dll − = Duu Wul Wuu fu 0u×1 it follows that the harmonic solution is given by [306]: fu = (Duu − Wuu )−1 Wul fl = (I − Puu )−1 Pul fl ,

(6.16.55)

where 

−1 D−1 ll Wll Dll Wlu P=D W= −1 −1 Duu Wul Duu Wuu   Pll Plu . = Pul Puu −1



(6.16.56)

The above discussions can be summarized into the harmonic function algorithm for semi-supervised graph learning, as shown in Algorithm 6.32. Semi-supervised labeling is closely related to important applications. For example, in electric networks the edges of a graph G can be imagined to be resistors with conductance W , where nodes labeled 1 are connected to a positive voltage source,

6.16 Graph Machine Learning

371

Algorithm 6.32 Harmonic function algorithm for semi-supervised graph learning [306] 1. input: Labeled data (x1 , y1 ), . . . , (xl , yl )} and unlabeled data {xl+1 , . . . , xl+u } with l / u. 2. Choice a Gaussian weight function wij on the graph G(E, V ), where i, j = 1, . . . , n and n = l + u.  3. Compute the degree dj = i=1 wij , j = 1, . . . , n. u,l 4. Compute Duu = Diag(dl+1 , . . . , dl+u ), Wuu = [wl+i,l+j ]u,u i,j =1 and Wul = [wl+i,j ]i,j =1 . −1 −1 5. Puu ← Duu Wuu , Pul ← Duu Wul . 6. fu ← (I − Puu )−1 Pul fl . 7. output: the labels yˆl+i = sign(fu (i)), i = 1, . . . , u.

and points labeled 0 to ground [306]. Then fu (i) is the voltage in the resulting electric network on the ith unlabeled node. Once the labels yˆl+1 are estimated by applying Algorithm 6.32, the semisupervised graph classification or regression problem becomes a supervised one.

6.16.6 Semi-Supervised Graph Learning: Min-Cut Method Semi-supervised graph learning aims at constructing a graph from the labeled data {(x1 , y1 ), . . . , (xl , yl )} and unlabeled data {xl+1 , . . . , xl+u } with xi ∈ Rd , i = 1, . . . , l + u. These data are drawn independently at random from a probability density function p(x) on a domain M ⊆ Rd . The question is how to partition the nodes in a graph into disjoint subsets S and T such that the source s is in S and the sink t is in T . This problem is called s/t cut problem. In combinatorial optimization, the cost of an s/t cut, C = {S, T }, is defined as the sum of the costs of “boundary” edges (p, q), where p ∈ S and q ∈ T . The minimum cut (min-cut) problem on a graph is to find a cut that has the minimum cost among all cuts. Min-cut Let X = {x1 , . . . , xl+u } be a set of l + u points, and S be a smooth hypersurface that is separated into two parts S1 and S2 such that X1 = X ∩ S1 = {x1 , . . . , xl } and X2 = X ∩ S2 = {xl+1 , . . . , xl+u } are the data subsets which land in S1 and S2 , respectively. We are given two disjoint subsets of nodes L = {1, . . . , l}, S = L+ , and T = L− , corresponding to the l labeled points with labels y1 , . . . , yl , i.e., L = S ∪ T and S ∩ T = ∅. The degree of dissimilarity between these two pieces S and T can be computed as total weight of the edges that have been removed. In graph theoretic language, it is called the cut that is denoted by cut(S, T ) and is defined as [230]: cut(S, T ) =

 i∈S,j ∈T

wij .

(6.16.57)

372

6 Machine Learning

The optimal binary classification of a graph is to minimize this cut value, resulting in a most popular method for finding the minimum cut of a graph. Normalized Cut However, the minimum cut criteria favors cutting small sets of isolated nodes in the graph [279]. To avoid this unnatural bias for partitioning out small sets of points, Shi and Malik [230] proposed the normalized cut (Ncut) as the disassociation measure to compute the cut cost as a fraction of the total edge connections to all the nodes in the graph, instead of looking at the value of total edge weight connecting the two partitions (S, T ). Definition 6.35 (Normalized Cut [230]) The normalized cut (Ncut) of two disjointed partitions S ∪ T = L and S ∩ T = ∅ is denoted by Ncut(S, T ) and is defined as Ncut(S, T ) =

cut(S, T ) cut(S, T ) + , assoc(S, L) assoc(T , L)

(6.16.58)

where assoc(S, L) =



wij

and

assoc(T , L) =

i∈S,j ∈L



wij

(6.16.59)

i∈T ,j ∈L

are, respectively, the total connections from nodes in S and T to all nodes in the graph. In the same spirit, a measure for total normalized association within groups for a given partition can be defined as [230] Nassoc(S, T ) =

assoc(T , T ) assoc(S, S) + , assoc(S, L) assoc(T , L)

(6.16.60)

where assoc(S, S) and assoc(T , T ) are total weights of edges connecting nodes within S and T , respectively. The association and disassociation of a partition are naturally related: Ncut(S, T ) =

cut(S, T ) cut(S, T ) + assoc(S, L) assoc(T , L)

assoc(S, L) − assoc(S, S) assoc(T , L) − assoc(T , T ) + assoc(S, L) assoc(T , L)   assoc(S, S) assoc(T , T ) =2− + assoc(S, L) assoc(T , L) =

= 2 − Nassoc(S, T ).

(6.16.61)

6.16 Graph Machine Learning

373

This means that the two partition criteria, attempting to minimize Ncut(S, T ) (disassociation between the groups (S, T )) and attempting to maximize Nassoc(S, T ) (association within the groups), are in fact identical. Graph Mincut Learning Consider the complete graph whose vertices are associated with the points xi , xj , and where the weight of the edge between xi and xj , i = j is given. Let f = [f1 , . . . , fN ]T be the indicator vector for X1 : fi =

⎧ ⎨ 1, if xi ∈ X1 , ⎩

(6.16.62) 0, otherwise.

There are two quantities of interest [187]: @ • S p(s)ds, which measures the quality of the partition S in accordance with the weighted volume of the boundary. • fT Lf, which measures the quality of the empirical partition in terms of its cut size. When making graph constructions, it is necessary to consider choosing which graph to use from the following three graph types [19, 173]. 1. k-nearest neighbor (kNN) graph: The samples xi and xj are considered as neighbors if xi is among the k-nearest neighbors of xj or xj is among the knearest neighbors of xi , where k is a positive integer and the k-nearest neighbors are measured by the usual Euclidean distance. 2. r-ball neighborhood graph: The samples xi and xj are considered as neighbors if and only if xi − xj  < r, where the norm is the usual Euclidean norm, r is a fixed radius and two points are connected if their distance does not exceed the threshold radius r. Note that due to the symmetry of the distance we do not have to distinguish between directed and undirected graphs. The r-ball neighborhood graph is simply called the “r-neighborhood” as well. 3. Complete weighted graph: There is an edge e = (i, j ) between each pair of distinct nodes xi and xj (but no loops). This graph is not considered as a neighborhood graph in general, but if the weight function is chosen in such a way that the weights of edges between nearby nodes are high and the weights between points far away from each other are almost negligible, then the behavior of this graph should be similar to that of a neighborhood graph. One such weight function is the Gaussian weight function wij = exp(−xi − xj 2 /(2σ 2 )). The k-nearest neighbor graph is a directed graph, while the r-ball neighborhood graph is an undirected graph. The weights used on neighborhood graphs usually depend on the distance of the vertices of the edge and are non-increasing. In other words, the weight wij of an edge (xi , xj ) is given by wij = f (dist(xi , xj )) with a non-increasing weight function f . The weight functions to be considered here are the unit weight function

374

6 Machine Learning

f ≡ 1, which results in the unweighted graph, and the Gaussian weight function   1 u2 1 , exp − f (u) = 2 σ2 (2π σ 2 )d/2

(6.16.63)

where the parameter σ > 0 defines the bandwidth of the Gaussian function. It is usually assumed that for a measurable set S ⊆ Rd , the surface integral @ along a decision boundary S, μ(S) = S p(s)ds, is the measure on Rd induced by probability distribution p. This measure is called “weighted boundary volume.” Narayanan et al. [187] prove that the weighted boundary volume is approximated √ by N √πσ fT Lf: 

√ π p(s)ds ≈ √ fT Lf, N σ S

(6.16.64)

where L is the normalized graph Laplacian, f = [f1 , . . . , fl+u ]T is the weight vector function, and σ is the bandwidth of the edge weight Gaussian function. For the classification problem, we are given the training sample data as X = [x1 , . . . , xN ], where xi ∈ Rd and N is the total number of training samples. Traditional graph construction methods typically decompose the graph construction process into two steps: graph adjacency construction and graph weight calculation [286]. 1. Graph adjacency construction: There are two main construction methods [19]: -ball neighborhood and k-nearest neighbors. 2. Three Graph weight calculation approaches [286] • Heat Kernel [19]

wij =

⎧ −x −x 2 ⎪ ⎨ e it j , if xi and xj are neighbors; ⎪ ⎩

0,

(6.16.65)

otherwise,

where t is the heat kernel parameter. Note when t → ∞, the heat kernel will produce binary weight and the graph constructed will be a binary graph, i.e.,

wij =

⎧ ⎪ ⎨ 1, if xi and xj are neighbors; ⎪ ⎩ 0, otherwise.

(6.16.66)

• Inverse Euclidean Distance [63] wij = xi − xj −1 2 , where xi − xj 2 is the Euclidean distance between xi and xj .

(6.16.67)

6.16 Graph Machine Learning

375

• Local linear reconstruction coefficient [216] -2  wij xj ξ(W) = -xi − - , i

j

subject to



wij = 1, ∀ i,

(6.16.68)

j

where wij = 0 if sample xi and xj are not neighbors. If there are N samples in the original data, then leave-one-out cross-validation (LOOCV) uses each sample as a testing set separately, and the other N − 1 samples are used as a training set. Therefore, LOOCV will get N models, and the average of the classification accuracy of the final testing set of N models will be used as the performance index of LOOCV classifier under this condition. LOOCV error is defined as the count of x ∈ L ∪ U with label of x different from label of its nearest neighbor. This is also the value of the cut. Let nnuv be the indicator of “v is the nearest neighbor of u” for K-nearest neighborhood graph, and wuv be the weight of the label of v when classifying u for K-averaged nearest neighborhood. Theorem 6.8 ([29]) Suppose edge weights between example nodes are defined as follows: for each pair of nodes u and v, define nnuv = 1 if v is the nearest neighbor of u, and nnuv = 0 otherwise. If defining w(u, v) = nnuv + nnvu for K-nearest neighborhood graph or w(u, v) = nnuv + nnvu for K-averaged nearest neighborhood graph, then for any binary labeling of the examples u ∈ U , the cost of the associated cut is equal to the number of LOOCV mistakes made by 1-nearest neighbor on L ∪ U . The above theorem implies that min-cut for K-nearest neighborhood graph or Kaveraged nearest neighborhood graph is identical to minimizing the LOOCV error on L ∪ U . The following summarize the steps in Graph Mincut Learning Algorithm [29]. 1. Construct a weighted graph G = (V , E, W), where V = L ∪ U ∪ {v+ , v− }, and E ⊆ V × V . Associated with each edge e ∈ E is a weight w(e). The vertices v+ and v− are called the classification vertices, and all other vertices the example vertices. 2. w(v, v+ ) = ∞ for all v ∈ L+ and w(v, v− ) = ∞ for all v ∈ L− . 3. The function assigning weights to edges between example nodes xi , xj are referred to as the edge weighting function wij . 4. v+ is viewed as the source, v− is viewed as the sink, and the edge weights are treated as capacities. Removing the edges in the cut partitions the graph into two sets of vertices: V+ with v+ ∈ V+ and V− with v− ∈ V− . For concreteness, if there are multiple minimum cuts, then the algorithm chooses the one such that V+ is smallest (this is always well defined and easy to obtain from the flow). 5. Assign a positive label to all unlabeled examples in the set V+ and a negative label to all unlabeled examples in the set V− .

376

6 Machine Learning

Minimum Normalized Cut If a node in S = L+ is viewed as a source s ∈ S and a node in T = L− is viewed as a sink t ∈ T , then minimizing Ncut(S, T ) is a minimum s/t cut problem. One of the fundamental results in combinatorial optimization is [34] that the minimum s/t cut problem can be solved by finding a maximum flow from the source s to the sink t. Loosely speaking, maximum flow is the maximum “amount of water” that can be sent from the source to the sink by interpreting graph edges as directed “pipes” with capacities equal to edge weights. The theorem of Ford and Fulkerson [95] states that a maximum flow from s to t saturates a set of edges in the graph dividing the nodes into two disjoint parts {S, T } corresponding to a minimum cut. Thus, mincut and max-flow problems are equivalent. In this sense, min-cut, max-flow, and min-cut/max-flow methods refer to as the same method. Let f be an N = |L| dimensional indicator vector,  fi = 1 if node i is in L+ and −1, otherwise. Under the assumption that di = j =1 wij is the total connection from node i to all other nodes, Ncut(S, T ) can be rewritten as cut(S, T ) cut(S, T ) + assoc(S, L) assoc(T , L)   (fi >0,fj 0 di fi 0.

(6.18.5)

For each control action at , a reward or a punishment is presumed. In other words, if this action improves the system error, then a reward will be anticipated for this action; otherwise a punishment will be applied. The selection of reward or punishment is specified by comparison between errors in t and t + 1. Thus, the immediate reward function at time t for each action at is given by R(st , at ) =

∞ 

γ t r(st , at ),

(6.18.6)

t=0

where 0 < γ < 1 and r(st , at ) ≤ B in which B is the maximum value of the reward and is selected through trial-and-error. The state value function is defined as the maximum value of the immediate reward functions: V (s) = max R(st , at ). at ∈A

(6.18.7)

The state value function is described by the Bellman equation [21] V (s) = max {r(s, a) + γ V (f(s, a))} . a∈A

(6.18.8)

A scalar value called a payoff is received by the control system for transitions from one state to another. The aim of the system is to find a control policy which maximizes the expected future discount of the payoff received, known as the “return” [218]. The state value function V (st ) represents the discounted sum of payoff received from time step t onwards, and this sum will depend on the sequence of actions taken. The taken sequence of actions can be determined by the policy. Hence, the optimal control policy can be expressed as a∗ (s) = arg max {r(s, a) + γ V (f(s, a))} ,

(6.18.9)

a∈A

from which a reward or punishment function, called action-value function or simply Q-function, can be obtained as Q(s, a) = r(s, a) + γ V (f(s, a)).

(6.18.10)

394

6 Machine Learning

In terms of Q-function, the state value function V (s) can be reexpressed as V (s) = max Q(s, a), a∈A

(6.18.11)

and thus the optimal control policy is given by a∗ (s) = arg max Q(s, a).

(6.18.12)

a∈A

The state value function V (s) or action-value function Q(s, a) is updated by a certain policy π, where π is a function that maps states s ∈ S to actions a ∈ A, i.e., π : S → A ⇒ a = π (s).

6.19 Q-Learning Q-learning, proposed by Watkins in 1989 [263], is a popular model-free reinforcement learning algorithm, and can be used to optimally solve Markov decision processes (MDPs). Moreover, Q-learning is a technology based on action-value function to evaluate which action to take, and action-value function determines the value of taking an action in a certain state.

6.19.1 Basic Q-Learning As an important model-free asynchronous dynamic programming (DP) technique, Q-learning provides agents with the capability of learning to act optimally in Markovian domains by experiencing the consequences of actions, without requiring them to build maps of the domains. As the name implies, Q-learning is how to learn a value (i.e., Q) function. There are two methods for learning the Q-function [248]. • On-policy methods are namely Sarsa for on-policy control [218]: 2 3 Sarsa ← rt+1 + γ Q(st+1 , at+1 ) − Q(st , at ) , Q(st , at ) ← Q(st , at ) + αSarsa .

(6.19.1) (6.19.2)

• Off-policy methods are Q-learning for off-policy control [263]: b∗ ← arg maxb∈A(st+1 ) Q(st+1 , b), 2 3 Qlearning ← rt+1 + γ Q(st+1 , b∗ ) − Q(st , at ) ,

(6.19.4)

Q(st , at ) ← Q(st , at ) + αQlearning ,

(6.19.5)

where α is a stepsize parameter [106].

(6.19.3)

6.19 Q-Learning

395

The basic framework of Q-learning is as follows. Under the assumption that the number of system states and the number of agent actions must be finite, the learning system consists of a finite set of environment states S = {s1 , . . . , sn } and a finite set of agent actions A = {a1 , . . . , an }. At each step of interaction with the environment, the agent observes the environment state st and issues an action at based on the system state. By performing the action at , the system moves from one state st to another st+1 . The new state st+1 gives the agent a penalty pt = p(st ) which indicates the value of the state transition st → st+1 . The agent keeps a value function (named Q-function) Qπ (s, a) for each state-action pair (s, a), which represents the expected long-term penalty or reward if the system starts from state s, taking action a, and then following policy π . Based on this Q-function, the agent decides which action should be taken in current state to achieve the minimum long-term penalties or the maximum rewards. Therefore, the core of the Q-learning algorithm is a value iteration update of the Q-value function. The Q-value for each state-action pair is initially chosen by the designer and then it will be updated each time an action is issued and a penalty is received based on the following expression [229]: Q(st+1 , at+1 )



⎞ expected discount pernalty

⎜C pt+1 + = Q(st , at ) + ηt (st , at ) × ⎜ A BC D A BC D ⎝ ABCD old value

learning rate

pernalty

old value

DA B C DA B⎟ γ min Q(st+1 , a) − Q(st , at )⎟ ⎠. a ABCD BC D A discount factor min future value

(6.19.6) Here, st , at , and pt are the state visited, action taken, and penalty received at time t, respectively; and ηt (s, a) ∈ (0, 1) is the learning rate. The discount factor γ is a value between 0 and 1 which gives more weight to the penalties in the near future than the far future. Interestingly, if we let Q(st , at ) = xi , ηt (st , at ) = α, γ mina Q(st+1 , a) = Fi (x) and pt+1 = ni , then Q-learning (6.19.6) has the same general structure as stochastic approximation algorithms [252]:   xi = xi + α Fi (x) − xi + ni ,

i = 1, . . . , n,

(6.19.7)

where x = [x1 , . . . , xn ]T ∈ Rn , F1 , . . . , Fn are mappings from Rn into R, i.e., Fi : Rn → R, ni is a random noise term and α is a small, usually decreasing, stepsize. On the mapping vector function f(x) = [F1 (x), . . . , Fn (x)]T ∈ Rn , one makes the following assumptions [252]. (a) The mapping f is monotone; that is, if x ≤ y, then f(x) ≤ f(y). (b) The mapping f is continuous. (c) The mapping f(x) has a unique fixed point x∗ .

396

6 Machine Learning

(d) If 1 ∈ Rn is the summing vector with all components equal to 1, and r is a positive scalar, then f(x) − r1 ≤ f(x − r1) ≤ f(x + r1) ≤ f(x) + r1.

(6.19.8)

In basic Q-learning, the agent’s experience consists of a sequence of distinct stages or episodes. In the tth episode, the agent [264]: • • • • •

observes its current state st , selects and performs an action at , observes the subsequent state st+1 , receives an immediate payoff rt , and adjust its Q-function value using a learning factor ηt (st , at ), according to Eq. (6.19.6). Algorithm 6.38 shows a basic Q-learning algorithm.

Algorithm 6.38 Q-learning [296] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

initialization: Q, s Repeat (for each step of episode) Choose action a from state s based on Q and some exploration strategy (e.g., -greedy) Take action a, observe r, s a∗ ← arg maxa Q(s , a) δ ← r + γ Q(s , a∗ ) − Q(s, a) Q(s, a) ← Q(s, a) + α(s, a)δ s ← s until end output: s.

Q-learning has four important variants: double Q-learning, weighted double Qlearning, online connectionist Q-learning, and Q-learning with experience replay. They are the topics in the next three subsections.

6.19.2 Double Q-Learning and Weighted Double Q-Learning Although Q-learning can be used to optimally solve Markov decision processes (MDPs), its performance may be poor in stochastic MDPs because of large overestimation of the action-values [255]. To cope with overestimation caused by Q-learning, double Q-learning that was developed in [255] uses the double estimator method due to using two sets A B = {μB , . . . , μB } to estimate the of estimators: μA = {μA M 1 , . . . , μM } and μ 1 maximum expected value of the Q-function for the next state, maxa E{Q(s , a )}. Both sets of estimators are updated with a subset of the samples drawn such that B S = S A ∪ S B and SA ∩ SB = ∅. Like the single estimator μi , both μA i and μi are

6.19 Q-Learning

397

unbiased if one assumes that the samples are split in a proper manner, for instance, randomly, over the two sets of estimators. Suppose we are given a set of M random variables X = {x1 , . . . , xM }. In many problems, one is interested in finding the maximum expected value maxi E{xi }. However, it is impossible to determine maxi E{xi } exactly if there is no knowledge of the functional form and parameters of the underlying distributions of the variables in X. Therefore, this maximum expected value maxi {xi } is usually approximated by constructing approximations for E{xi } for all i. GM Let S = S be a set of samples, where Si is the subset containing i i=1 samples for the variable xi . It is assumed that the samples in Si are independent and identically distributed (i.i.d.). Then, unbiased estimates for the expected values can be obtained by computing the sample average for each variable: E{xi } =  def E{μi } ≈ μi (S) = |S1i | s∈Si s, where μi is a mean estimator for variables xi . This approximation is unbiased since every sample s ∈ Si is an unbiased estimate for the value of E{xi }. As shown in Algorithm 6.39, the double Q-learning stores two Q-functions, QA and QB , and uses two separate subsets of experience samples to learn them. The action selected is calculated based on the average of the two Q-values for each action and the -greedy exploration strategy. With the same probability, each Q-function is updated using a value from the other Q-function for the next state. Then, the action a∗ is the maximizing action in state s based on the Q1 value. Algorithm 6.39 Double Q-learning [255] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

input: X = {x1 , . . . , xM } initialize QA , QB , s repeat Choose a from s based on QA (s, ·) and QB (s, ·) (e.g., Q-greedy in QA + QB ) Take action a, observe r, s With 0.5 probability to Choose whether to update QA or QB if update QA then a∗ ← arg maxa QA (s , a) α A (s, a) ← r + γ QB (s , a∗ ) − QA (s, a) δ A ← r + γ QB (s , a∗ ) − QA (s, a) QA (s, a) ← QA (s, a) + α A (s, a)δ A else if update QB then a∗ ← arg maxa QB (s , a) α B (s, a) ← r + γ QA (s , a∗ ) − QB (s, a) δ B ← r + γ QA (s , a∗ ) − QB (s, a) QB (s, a) ← QB (s, a) + α B (s, a)δ B s ← s end if until end  output: xˆ i ≈ μi (S) = |S1i | s∈Si s.

As opposed to Q-learning, the double Q-learning uses the value QB (s , a∗ ) rather than the value QA (s , a∗ ) to update QA .

398

6 Machine Learning

Since QB is updated with a different set of experience samples, QB (s , a∗ ) is an unbiased estimate for the value of the action a∗ in the sense that E{QB (s , a∗ )} = E{Q(s , a∗ )}. Similarly, the value of QA (s , a∗ ) is also unbiased. However, since E{QB (s , a∗ )} ≤ maxa E{QB (s , a)} and E{QA (s , a∗ )} ≤ maxa E{QA (s , a)}, both estimators QB (s , a∗ ) and QA (s , a∗ ) sometime have negative biases. This results in double Q-learning underestimating action values in some stochastic environments. Weighted double Q-learning [296] combines Q-learning and double Q-learning for avoiding the overestimation in Q-learning and the underestimation in double Q-learning, as shown in Algorithm 6.40. Algorithm 6.40 Weighted double Q-learning [296] 1. input: X = {x1 , . . . , xM } 2. initialize QA , QB , s 3. repeat 4. Choose a from s based on QA and QB (e.g., -greedy in QA + QB ) 5. Take action a, observe r, s 6. With 0.5 probability to Choose whether to update QA or QB 7. if chose to update QA then 8. a∗ ← arg maxa QA (s , a) 9. aL ← arg mina QA (s , a) |QB (s ,a∗ )−QB (s ,aL )| c+|QB (s ,a∗ )−QB (s ,aL )| r + γ [β A QA (s , a∗ ) + (1 − β A )Q

10.

βA ←

11. 12. 13. 14. 15. 16.

 ∗ A δA ← B (s , a )] − Q (s, a) A B  ∗ A α (s, a) ← r + γ Q (s , a ) − Q (s, a) QA (s, a) ← QA (s, a) + α A (s, a)δ A else if chose to update QB then a∗ ← arg maxa QB (s , a) aL ← arg mina QB (s , a)

17.

βB ←

|QA (s ,a∗ )−QA (s ,aL )| c+|QA (s ,a∗ )−QA (s ,aL )| r + γ [β B QB (s , a∗ ) + (1 − β B )QA (s , a∗ )] − QB (s, a)

18. δB ← 19. α B (s, a) ← r + γ QA (s , a∗ ) − QB (s, a) 20. QB (s, a) ← QB (s, a) + α B (s, a)δ B 21. s ← s 22. end if 23. until end  24. output: xˆ i ≈ μi (S) = |S1i | s∈Si s.

The following are the comparison among the Q-learning, the double Q-learning, and the weighted double Q-learning. 1. The Q-learning is a single estimator for Q(s, a). 2. The double Q-learning is a double estimator for both QA (s, a) and QB (s, a). 3. The weighted double Q-learning is a weighted double estimator for both QA (s, a) and QB (s, a). The main differences between the double Q-learning

6.19 Q-Learning

399

Algorithm 6.39 and the weighted double Q-learning Algorithm 6.40 are: • Algorithm 6.40 uses the weighted term β A QA (s , a∗ ) + (1 − β A )QB (s , a∗ ) to replace QB (s , a∗ ) when updating δ A in Algorithm 6.39; • Algorithm 6.40 uses the weighted term β B QB (s , a∗ ) + (1 − β B )QA (s , a∗ ) to replace QA (s , a∗ ) when updating δ B in Algorithm 6.39. These changes result in the name “weighted double Q-learning.”

6.19.3 Online Connectionist Q-Learning Algorithm In order to represent the Q-function, Lin [163] chose to use one neural network for each action, which avoids the hidden nodes receiving conflicting error signals from different outputs. The neural network weights are updated by [163]   wt+1 = wt + η rt + γ max Qt+1 − Qt ∇w Qt ,

(6.19.9)

a∈A

where η is the learning constant, rt is the payoff received for the transition from state xt to xt+1 , Qt = Q(xt , at ), and ∇w Qt is a vector of the output gradients ∂Qt /∂wt calculated by backpropagation. Rummery and Niranjan [218] proposed an alternative update algorithm, called modified connectionist Q-learning. This algorithm is based more strongly on temporal difference, and uses the following update rule: wt = η (rt + γ Qt+1 − Qt )

t  (γ λ)t−k ∇w Qk .

(6.19.10)

k=0

This update rule is different from the normal Q-learning in the use of the Qt+1 associated with the action selected instead of the greedy maxa∈A Qt+1 . In the following, we discuss how to calculate the gradient ∇w Qt . For this end, define a backpropagation neural network as a collection of interconnected units arranged in layers. Let i, j , and k refer to the different neurons in the network; with error signal propagating from right to left, as shown in Fig. 6.4. Fig. 6.4 One neural network as a collection of interconnected units, where i, j , and k are neurons in output layer, hidden layer, and input layer, respectively .

k

wk j w jk

j

w ji

i

wi j Function signals Error signals

oi

400

6 Machine Learning

Neuron j lies in a layer to the left of neuron i, and neuron k lies in a layer to the left of the neuron j when neuron j is a hidden unit. Letting wij be a weight on a connection from layer i to j , then the output from layer i is given by [218]: oi = f (σi ),

(6.19.11)

 where σi = j wij oj and f (σ ) is a sigmoid function. The output gradient with respect to the output layer weights wij is defined as ∂f (σi ) ∂f (σi ) ∂σi ∂oi = = · = f  (σi )oj , ∂wij ∂wij ∂σi ∂wij

(6.19.12)

where f  (σi ) = ∂f∂σ(σi i ) is the first-order differential of the sigmoid function f (σi ). The gradient of the output oi with respect to the hidden layer weights wj k is given by ∂oi ∂f (σi ) ∂f (σi ) ∂σi ∂oj ∂σj = = · · · ∂wj k ∂wj k ∂σi ∂oj ∂σj ∂wj k = f  (σi ) · wij · f  (σj ) · ok .

(6.19.13)

Modified connectionist Q-learning of Rummery and Niranjan [218] is shown in Algorithm 6.41. Algorithm 6.41 Modified connectionist Q-learning algorithm [218] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

initialization: Reset all eligibilities e0 = 0 and put t = 0. repeat Select action at ; If t > 0, wt = wt−1 + η(rt−1 + Qt − Qt−1 )et−1 . Calculate ∇w Qt with respect to selected action at only. et = ∇w Qt + γ λet−1 . Perform action at , and receive the payoff rt . If trial has not ended, t ← t + 1 and go to Step 3. until end output: Qt .

6.19.4 Q-Learning with Experience Replay Reinforcement learning is known to be unstable and can even diverge when a nonlinear function approximator such as a neural network is used to represent the

6.19 Q-Learning

401

action-value, i.e., Q function. This instability has a few causes [183]: • The correlations present in the sequence of observations. • Small updates to Q may significantly change the policy and therefore change the data distribution. • The correlation between the action-value Q and the target value r+max Q(s , a ). a

To deal with these instabilities, a novel variant of Q-learning was developed by Mnih et al. [183], which uses two key ideas. 1. A biologically inspired mechanism termed experience replay is used, which randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution. 2. An iterative update is used for adjusting the action-values Q towards target values that are only periodically updated, thereby reducing correlations with the target. To generate the experience replay, an approximate value function Q(s, a; θ i ) is first parameterized, where θ i are the parameters (that is, weights) of the Q-network at iteration i. Then, the agent’s experiences et = (st , at , rt , st+1 ) are stored at each time-step t in a data set Dt = {e1 , . . . , et } to perform experience replay. Algorithm 6.42 shows the deep Q-learning with experience replay developed by Mnih et al. [183]. Algorithm 6.42 Deep Q-learning with experience replay [183] 1. initialization: 1.1 replay memory D to capacity N , 1.2 action-value function Q with random weights θ, ˆ with weights θ − = θ. 1.3 target action-value function Q 2. for episode = 1, M do 3. Initialize sequence s1 = {x1 } and preprocessed sequence φ1 = φ(s1 ). 4. for t = 1, T do 5. With probability  select a random action at , 6. otherwise select at = arg maxa Q(φ(st ), a; θ). 7. Execute action at in emulator and observe reward rt and image xt+1 . 8. Set st+1 = st , at , xt+1 and preprocess φt+1 = φ(st+1 ). 9. Store transition (φt , at , rt , φt+1 ) in D. 10. Sample random mini batch of transitions (φj , aj , rj , φj +1 ) from D.  rj , if episode terminates at step j + 1; 11. Set yj =  − ˆ rj + maxa Q(φj +1 , a ; θ ), otherwise. 2  12. Perform a gradient descent step on yj − Q(φj , aj ; θ) with respect to the network parameters θ. ˆ = Q. 13. Every C steps reset Q 14. end for 15. end for

During learning, Q-learning updates are applied on samples (or mini-batches) of experience (s, a, r, s ) ∼ U (D), drawn uniformly at random from the pool of stored

402

6 Machine Learning

samples. The Q-learning update at iteration i uses the following loss function: / 0 Li (θi ) = E(s,a,r,s )∼U (D) r + γ max Q(s , a ; θi− ) − Q(s, a; θi ) , a

(6.19.14)

where γ is the discount factor determining the agent’s horizon, θi are the parameters of the Q-network at iteration i and θi− are the network parameters used to compute the target at iteration i. The target network parameters θi− are only updated with the Q-network parameters θi every C steps and are held fixed between individual updates.

6.20 Transfer Learning Traditional machine learning/data mining works well only when both training set and test set come from the same feature space and have the same distribution. This implies that whenever the data is changed, the model needs to be retrained, which may be too troublesome because • it is very expensive and difficult to obtain new training data for new data sets, • some data sets are easily outdated, that is, the data distribution in different periods will be different. Therefore, in many real-world applications, when the distribution changes or data are outdated, most statistical models are no longer applicable, and need to be rebuilt using newly collected training data. It is expensive or impossible to recollect the needed training data and rebuild the models. In such cases, knowledge transfer or transfer learning between tasks and domains would be desirable. Unlike traditional machine learning, transfer learning allows the domains, tasks, and distributions used in training and testing to be different. On the other hand, we get easily a large number of unlabeled images (or audio samples, or text documents) randomly downloaded from the Internet, and it is desirable to use such unlabeled images for improving performance on a given image (or audio, or text) classification task. Clearly, in such cases it is not assumed that the unlabeled data follows the same class labels or generative distribution as the labeled data. Using unlabeled data in supervised classification tasks is known as self-taught learning (transfer learning from unlabeled data) [210]. Therefore, transfer learning or self-taught learning is widely applicable for typical semi-supervised or transfer learning settings in many practical learning problems. Transfer learning can transfer previously knowledge to a new task, helping the learning of a new task. Transfer learning has attracted more and more attention since 1995 in different names: learning to learn, life-long learning, knowledge transfer, inductive learning, multitask learning, knowledge consolidation, context-sensitive learning, knowledge-based inductive bias, metalearning, incremental/cumulative learning, and self-taught learning [199, 210, 243].

6.20 Transfer Learning

403

6.20.1 Notations and Definitions Definition 6.44 (Domain [199]) A domain D consists of two parts: a feature space X and a marginal probability distribution P (X), denoted as D = {X , P (X)}, where X = {x1 , . . . , xn } ∈ X is a particular learning sample (set). For document classification with a bag-of-words representation, X is the space of all document representations, xi is the ith term vector corresponding to some document and X is the sample set of documents used for training. In supervised learning, unsupervised learning, semi-supervised learning, inductive learning and semi-supervised transductive learning discussed before, the source domain and the target domain are assumed to have the same feature space or the same marginal probability distribution. In these cases, we denote the labeled set D (l) , the unlabeled set D (u) and the test set D (t) as / 0 (6.20.1) D (l) = (xl1 , yl1 ), . . . , (xln , yln ) , l

D

(u)

l

= {xu1 , . . . , xunu },

(6.20.2)

D (t) = {xt1 , . . . , xtnt },

(6.20.3)

respectively. There are two domains in machine learning: source domain DS = {XS , P (XS )} and target domain DT = {XT , P (XT )}. These two domains are generally different: they may have different feature spaces XS = XT or different marginal probability distributions P (XS ) = P (XT ). Let XS = {xS1 , . . . , xSn } and YS = {yS1 , . . . , ySn } be, respectively, the sets of s s all labeled instances of source-domain DS and all corresponding labels of sourcedomain. Similarly, denote XT = {xT1 , . . . , xTn } and YT = {yT1 , . . . , yTn }. The t t source-domain data and the target-domain data are, respectively, represented as / 0 DS = (XS , YS ) = XS ∪ YS = (xS1 , yS1 ), . . . , (xSns , ySns ) ∈ DS , / 0 DT = (XT , YT ) = XT ∪ YT = (xT1 , yT1 ), . . . , (xTn , yTn ) ∈ DT , t

t

(6.20.4) (6.20.5)

where • xSi ∈ XS is the ith data instance of source-domain DS and ySi ∈ YS is the corresponding class label for xSi . For example, in document classification, the label set is a binary set containing true and false, i.e., ySi ∈ {true, false} takes on a value of true or false, and f (x) is the learner that predicts the label value for x. • xTi ∈ XT is the ith data instance of target-domain DT and yTi ∈ YT is the corresponding class label for xTi . • ns and nt are the numbers of source-domain data and target-domain data, respectively. In most cases, 0 ≤ nt / ns .

404

6 Machine Learning (l)

(u)

Let DS and DS be the labeled set and unlabeled set of data drawn from (l) (u) source domain DS , respectively. Similarly, DT and DT denote the labeled set and unlabeled set drawn from DT , respectively. Then we have the following notations: DS(l) (l)

DT

  0 / (l) (l) (l) (l) = xS1 , yS1 , . . . , xSn , ySn

and

0 / (u) DS(u) = x(u) S1 , . . . , xSn ,

  0 / (l) (l) (l) (l) = xT1 , yT1 , . . . , xTn , yTn

and

0 / (u) (u) (u) DT = xT1 , . . . , xTn ,

l

l

1

l

u

u

where (u) • x(l) Si , xSi ∈ DS are the Si -th labeled and unlabeled data vectors drawn from (l)

(l)

source domain, respectively; ySi ∈ YS is the class label corresponding to xSi drawn from source domain; (l) (u) • xTi , xTi ∈ DT are the Ti -th labeled and unlabeled data vectors drawn from target domain, respectively; yT(l)i ∈ YT is the class label corresponding to x(l) Ti drawn from target domain; • nl and nu are the numbers of labeled and unlabeled data in source (or target) domain. (l)

(u)

(l)

(u)

It should be noted that DS and DS are disjoint: DS ∩ DS = ∅. Similarly, we have also D (l) ∩ D (u) = ∅, as required in supervised learning, unsupervised learning, semi-supervised learning, inductive learning and semi-supervised transductive learning discussed before. When discussing machine learning in two different domains DS and DT , we have two possible training sets DStrain drawn from source domain DS and DTtrain drawn from target domain DT as follows: (l)

(u)

∈ DS ,

(6.20.6)

(l)

(u)

∈ DT ,

(6.20.7)

DStrain = DS ∪ DS

DTtrain = DT ∪ DT

where one of DS(l) and DS(u) may be empty, while DT(l) or DT(u) may be empty also, which corresponds to unsupervised and supervised learning cases, respectively. Definition 6.45 (Task [199]) For a given (source or target) domain D = {X , P (X)}, a task T consists of two parts: a label space Y and an objective predictive function f (·), denoted by T = {Y, f (·)}. The objective prediction function f (·) is not observed but can be typically learned from the feature vector and label pairs (xi , yi ), where xi ∈ X and yi ∈ Y. The function f (·) can be used to predict the corresponding label f (xi ) of a new instance xi . From a probabilistic viewpoint, f (xi ) is usually written as P (yi |xi ). If the given specific domain is the source domain, i.e., D = DS , then learning task is called the source learning task, denoted as TS = {YS , fS (·)}. Similarly,

6.20 Transfer Learning

405

learning task in target domain is known as the target learning task, denoted as TT = {YT , fT (·)}. For two different but relevant domains or tasks, if we transfer the model parameters trained in one domain (or task) to the new model in another domain (or task), it helps clearly to train the new data set because this transfer can share the learned parameters with the new model so as to speed up and optimize the learning of the new model without learning from zero as before. Such a machine learning is called transfer learning. The following is a unified definition of transfer learning. Definition 6.46 (Transfer Learning [199]) Given a source domain DS and source learning task TS , a target domain DT and target learning task TT . Transfer learning emphasizes the transfer of knowledge across domains, tasks, and distributions that are similar but not the same and aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS , where DS = DT or TS = TT . This definition shows that transfer learning consists of three parts [302]: • define source and target domains; • learn on the source domain; and • predict or generalize on the target domain. In transfer learning the knowledge gained in solving the source (or target) task in the source (or target) domain is stored and then applied to solving another problem of interest. Figure 6.5 shows this transfer learning setup. In most cases, there is only a single-loop transfer learning (shown by arrows with solid line). However, in machine translation of two languages there are two-loop transfer learning setups, as shown in Fig. 6.5. In transfer learning, there are the following three main research issues [199, 268]. • What to transfer: Which part of knowledge can be transferred across domains or tasks? If some knowledge is specific for individual domains or tasks, then there is no need to transfer it. Conversely, when some knowledge may be common Fig. 6.5 Transfer learning setup: storing knowledge gained from solving one problem and applying it to a different but related problem

Source task/domain

Target task/domain Δ

Model 1

Δ Δ

Model 2

Knowledge

406

6 Machine Learning

between different domains, it is necessary to transfer this part of knowledge because they may help improve performance for the target domain or task. • How to transfer: How to train a suitable model? After discovering which knowledge can be transferred, learning algorithms need to be developed to transfer the knowledge. • When to transfer: The high-level concept of transfer learning is to improve a target learner by using data from a related source domain. But if the source domain is not well-related to the target, then the target learner can be negatively impacted by this weak relation, which is referred to as negative transfer. In Definition 6.46 on transfer learning, a domain is defined as a pair D = {X , P (X)}, and a task is a pair T = {Y, P (Y |X)}. Hence, the condition on two different domains DS = DT has two possible scenarios: XS = XT and/or P (XS ) = P (XT ), while the condition on two different tasks TS = TT has also two possible scenarios: YS = YT and/or P (YS |XS ) = P (YT |XT ). In other words, there are the following four transfer learning scenarios: 1. XS = XT means that the feature spaces between the source and target domains are different. For example, in the context of natural language processing, the documents written in two different languages belong to this scenario, which is generally referred to as cross-lingual adaptation. 2. P (XS ) = P (XT ) implies that the marginal probability distributions between source and target domains are different, for example, when the documents focus on different topics. This scenario is generally known as domain adaptation. 3. YS = YT means that the label spaces between the source and target domains are different. The case of YS = YT refers to a mismatch in the class space. An example of this case is when the source software project has a binary label space of true for defect prone and false for not defect prone, and the target domain has a label space that defines five levels of fault prone modules [268]. 4. P (YS |XS ) = P (YT |XT ) implies that conditional probability distributions between the source and target domains are different. For example, source and target documents are unbalanced with regard to their classes. This scenario is quite common in practice.

6.20.2 Categorization of Transfer Learning Feature-based transfer learning approaches are categorized in two ways [268], as shown in Fig. 6.6. • Asymmetric transformation transforms the features of the source via reweighting to more closely match the target domain [201], as depicted in Fig. 6.6b. • Symmetric transformation discovers underlying meaningful structures between the domains, shown in Fig. 6.6a.

6.20 Transfer Learning

407

Source feature

Target feature

XS

XT Source feature

Target feature

T

T

S

T

XS

TS

XT

XC (a)

(b)

Fig. 6.6 The transformation mapping [268]. (a) The symmetric transformation (TS and TT ) of the source-domain feature set XS = {xSi } and target-domain feature set XT = {xTi } into a common latent feature set XC = {xCi }. (b) The asymmetric transformation TT of the source-domain feature set XS to the target-domain feature set XT

According to different perspectives, there are different categorization methods in transfer learning. 1. According to feature spaces, transfer learning can be categorized to two classes [268]. • Homogeneous transfer learning: Source domain and target domain have the same feature space, i.e., XS = XT , including instance-based transfer learning [48], asymmetric feature-based transfer learning [70], domain adaptation [203], parameter-based transfer learning [249], relational-based transfer learning [161], hybrid-based (instance and parameter) transfer learning [280] and so on. • Heterogeneous transfer learning: Source domain and target domain have different feature spaces, i.e., XS = XT , including symmetric feature-based transfer learning in [207], asymmetric feature-based transfer learning in [153], and so on. 2. According to computational intelligence, transfer learning can be categorized as follows [167]. • Transfer learning using neural network including transfer learning using deep neural network [55], transfer learning using convolutional neural networks [196], transfer learning using multiple task neural network [45, 233], and transfer learning using radial basis function neural networks [285]. • Transfer learning using Bayes including transfer learning using naive Bayes [170, 217], transfer learning using Bayesian network [168, 192], and transfer learning using hierarchical Bayesian model [93]. • Transfer learning using fuzzy system and genetic algorithm including fuzzybased transductive transfer learning [18], framework of fuzzy transfer learning to form a prediction model in intelligent environments [227, 228], generalized hidden-mapping ridge regression [74], genetic transfer learning [148], and so on.

408

6 Machine Learning

3. According to tasks, transfer learning can be categorized into the following three classes [199]: • inductive transfer learning, • transductive transfer learning [10], and • unsupervised transfer learning. Definition 6.47 (Inductive Transfer Learning [199]) Let the target task TT be different from the source task TS , i.e., TS = TT , no matter when the source and target domains are the same or not. Inductive transfer learning uses some labeled data in the target domain to induce an objective predictive model fT (·) for use in the target domain. Definition 6.48 (Transductive Transfer Learning [10, 199]) Given a source domain DS and a corresponding learning task TS , a target domain DT and a corresponding learning task TT , transductive transfer learning aims to improve the learning of the target predictive function fT (·) using the knowledge in DS and TS , where DS = DT and TS = TT . In addition, some unlabeled target-domain data must be available at training time. Definition 6.49 (Unsupervised Transfer Learning [199]) When DS = DT and TS = TT , and labeled data in source domain and target domain are not available, unsupervised transfer learning is to help to improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS . From the above three definitions, we can compare the three transfer learning settings as follows. 1. Common point: The inductive transfer learning, the transductive transfer learning, and the unsupervised transfer learning do not assume that the unlabeled (u) data xj was drawn from the same distribution as, nor that it can be associated with the same class labels as, the labeled data. This is the common point in any type of transfer learning, and is the basic difference from the traditional machine learning. 2. Comparison of assumptions: The transductive transfer learning requires the source task and the target task must be the same, i.e., TS = TT , but both the inductive transfer learning and the unsupervised transfer learning require TS = TT . 3. Comparison of data used: Different from the unsupervised transfer learning that uses no labeled data, the inductive transfer learning uses some labeled data in the target domain, both transductive transfer learning and self-taught learning use some unlabeled data available at training time, while unsupervised transfer learning does not use any labeled data. 4. Comparison of tasks: The inductive transfer learning induces the target predictive model fT (·), while both the transductive transfer learning and the unsupervised transfer learning aim to improve the learning of the target predictive function fT (·) using the knowledge in DS and TS , but the inductive and unsupervised

6.20 Transfer Learning

409

transfer learning require TS = TT , while the transductive transfer learning requires TS = TT . According to different situations of labeled and unlabeled data in the source domain, the inductive transfer learning setting can be categorized further into two cases [199]. • Multitask learning: a lot of labeled data in the source domain are available. In this case, the inductive transfer learning setting is similar to the multitask learning setting. However, the inductive transfer learning setting only aims at achieving high performance in the target task by transferring knowledge from the source task, while the multitask learning tries to learn the target and source tasks simultaneously. • Self-taught transfer learning: no labeled data in the source domain are available. In this case, the inductive transfer learning setting is similar to the self-taught transfer learning setting proposed by Raina et al. [210]. Definition 6.50 (Multitask Learning) For multiple related tasks, multitask learning is defined as an inductive transfer learning method based on shared representation, in which multiple related tasks are learned together by using labeled data for source domain. What is learned for each task can help other tasks to learn better. This definition shows that multitask learning consists of three parts [302]: • model the task relatedness; • learn all tasks simultaneously; • tasks may have different data/features. Definition 6.51 (Self-Taught Transfer Learning [210]) Given a labeled training (l) (l) (l) (l) set of m examples {(x1 , y1 ), . . . , (xm , ym )} drawn i.i.d. from some distribution (l) D, where each xi ∈ Rn denotes an input feature vector (the subscript “l” indicates that it is a labeled example), and yi(l) ∈ {1, . . . , C} is the corresponding class label in the classification problems. In addition, we are given a set of k unlabeled examples (u) (u) (l) x1 , . . . , xk ∈ Rn . Self-taught transfer learning aims to determine if xi and (u) xj come from the same input “type” or “modality” through transfer learning from unlabeled data. In the self-taught learning setting, the label spaces between the source and target domains may be different, which implies the side information of the source domain cannot be used directly. Thus, it is similar to the inductive transfer learning setting where the labeled data in the source domain are unavailable. In the traditional machine learning setting, the transductive learning [131] refers to the situation where all test data are required to be seen at training time, and that the learned model cannot be reused for future data. In contrast, the term “transductive” in Definition 6.48 emphasizes the concept in this type of transfer learning, the tasks must be the same and there must be some unlabeled data available in the target domain. Thus, when some new test data arrive, they must be classified together with all existing data.

410

6 Machine Learning

In the transductive transfer learning setting, the source and target tasks are required to be the same, while the source and target domains are different. In this situation, no labeled data in the target domain are available, while a lot of labeled data in the source domain are available. According to different situations between the source and target domains, the transductive transfer learning can be further categorized into two cases. • The feature spaces between the source and target domains are different, namely XS = XT . This is a so-called heterogeneous transfer learning. • The feature spaces between domains are the same, i.e., XS = XT , but the marginal probability distributions of the input data are different: P (XS ) = P (XT ). This special case is closely related to “domain adaptation”. Definition 6.52 (Domain Adaptation) Given two different domains DS = DT and a single task TS = TT , domain adaptation aims to improve the learning of the target predictive function fT (·) using labeled source-domain data and unlabeled or few labeled target-domain data. There are two main existing approaches to transfer learning. 1. Instance-based approach learns different weights to rank training examples in a source domain for better learning in a target domain. For example, boosting for transfer learning of Dai et al. [66]. 2. Feature-based approach tries to learn a common feature structure from different domains that can bridge the two domains for knowledge transfer. For example, multitask learning of Ando and Zhang [8]; multi-domain learning of Blitzer et al. [27]; self-taught learning of Raina et al. [210]; and transfer learning via dimensionality reduction of Pan et al. [201].

6.20.3 Boosting for Transfer Learning Consider transfer learning in which there are only labeled data from a similar old domain, when a task from one new domain comes. In these cases, labeling the new data can be costly and it would also be a waste to throw away all the old data. A natural question to ask is whether it is possible to construct a high-quality classification model using only a tiny amount of new data and a large amount of old data, even when the new data are not sufficient to train a model alone. For this end, as an extension of adaptive boosting (AdaBoost) [97], an AdaBoost tailored for transfer learning purpose, called TrAdaBoost, was proposed by Dai et al. [66]. For the more or less outdated training data, there are certain parts of the data that can still be reused. In other words, knowledge learned from this part of the data can still be used in training a classifier for the new data. The aim of TrAdaBoost is to iteratively reduce low quality source domain data and to remain the reusable training data. Assume that there are two types of training data [66].

6.20 Transfer Learning

411

• Same-distribution training data: a part of the labeled training data has the same distribution as the test data. The quantity of the same-distribution training data is often inadequate to train a good classifier for the test data, but access to this data is helpful for voting on the usefulness of each of the old data instances. • Diff-distribution training data: the training data has a different distribution from the test data, perhaps because they are outdated. These data are assumed to be abundant, and the classifiers learned from these data cannot classify the test data well due to different data distributions. The TrAdaBoost model uses boosting to filter out the different-distribution training data which are very different from the same-distribution data by automatically adjusting the weights of training instances, so that the remaining differentdistribution data are treated as the additional training data which greatly boost the confidence of the learned model even when the same-distribution training data are scarce. Let Xs be the same-distribution instance space, Xd be the diff-distribution instance space, and Y = {0, 1} be the set of category labels. A concept is a Boolean function c mapping from X to Y , where X = Xs ∪ Xd . The test data set is denoted by S = {(xti )}, where xti ∈ Xs , i = 1, . . . , k. Here, k is the size of the unlabeled test set S. The training data set T ⊆ {X × Y } is partitioned into two labeled sets Td and Ts , where Td represents the diff-distribution training data that Td = {(xdi , c(xdi ))} with xdi ∈ Xd , i = 1, . . . , n. Ts represents the same-distribution training data that Ts = {(xsj , c(xsj ))}, where xsj ∈ Xs , j = 1, . . . , m, and n and m are the sizes of Td and Ts , respectively. The combined training set T = {(xi , c(xi ))} is defined as xi =

⎧ d ⎨ xi , if i = 1, . . . , n; ⎩

(6.20.8) xsi , if i = n + 1, . . . , n + m.

Algorithm 6.43 gives a transfer AdaBoost learning framework (TrAdaBoost) developed by Dai et al. [66].

6.20.4 Multitask Learning The standard supervised learning task is to seek a predictor that maps an input vector x ∈ X to the corresponding output y ∈ Y. Consider m learning problems indexed by  ∈ {1, . . . , m}. Suppose we are given a finite set of training examples {(x1 , y1 ), . . . , (xn , yn )} that are independently generated according to some unknown probability distribution D, where  = 1, . . . , m. Based on this finite set of training examples, the predictor is selected from a set H of functions. The set H, called the hypothesis space, consists of functions mapping from X to Y that can be used to predict the output in Y of an input datum in X .

412

6 Machine Learning

Algorithm 6.43 TrAdaBoost algorithm [66] 1. input: two labeled data sets Td and Ts , the unlabeled data set S, a base learning algorithm Learner, and the maximum number of iterations N . 1 2. initialization: the initial weight vector w1 = [w11 , . . . , wn+m ]T , or the initial values is specified for w1 . 3. for t = 1 to N do n+m t 4. Set pt = wt /( i=1 wi ). 5. Call Learner, providing it the combined training set T with the distribution pt over T and the unlabeled data set S. Then, get back a hypothesis ht : X → Y (or [0, 1] by confidence). 6. Calculate the error of ht on Ts : n+m wit ·|ht (xi )−c(xi )| n+m . t = i=n+1 t i=n+1 wi √ 7. Set βt = t /(1 − t ) and β = 1/(1 + 2 ln n/N ). Note that t ≤ 1/2 is required. 8. Update the new * weight vector: wit β |ht (xi )−c(xi )| , for 1 ≤ i ≤ n; wit+1 = |h (x )−c(xi )| , for n + 1 ≤ i ≤ n + m. wit βt t i 9. end for 10. output: the hypothesis: 5 −1/2 −ht (x) / 1, if 5 ≥ t=0N/21 βt ; t=0N/21 βt hf (x) = 0, otherwise.

In the multitask scenario the goal of a learning system is to solve all multiple supervised learning tasks (such as recognizing objects or predicting attributes) by sharing information between them. Definition 6.53 (Structural Learning [8]) Structural learning aims to learn some underlying predictive functional structures (smooth function classes) that can characterize what good predictors are like. In other words, its goal is to find a predictor f so that its error with respect to D is as small as possible. Given the input space X , a linear predictor is not necessarily linear on the original space X , but rather can be regarded as a linear functional on a highdimensional feature space F. Assume there is a known feature map  : X → F. Therefore, a known high-dimensional feature map is given by wT φ(x), where w is a weight vector on high-dimensional feature map φ(x). In order to apply the structural learning framework, consider a parameterized low-dimensional feature map given by vT ψ θ (x), where v is a weight vector on low-dimensional feature map ψ(x), θ denotes the common structure parameter shared by all m learning problems. That is, the linear predictor f (x) in structural learning framework has a form [8]: f (x) = wT φ(x) + vT ψ θ (x).

(6.20.9)

In order to simplify numerical computation, Ando and Zhang [8] propose using a simple linear form of feature map, ψ θ (x) = ψ(x), where  is an h × p matrix, and ψ is a known p-dimensional vector function. Then, the linear predictor can be reformulated as [8]: fΘ (w, v; x) = wT φ(x) + vT ψ(x).

(6.20.10)

6.20 Transfer Learning

413

Letting φ(x) = ψ(x) = xi ∈ Rp , then      T T f w , v ; xi = w + v  xi ,

 = 1, . . . , m,

(6.20.11)

and the empirical error L(f (w , v ; ), yi ) on training data {(x1 , yi ), . . . , (xn , yn )} can be taken as the modified Huber loss function [8]: L(p, y) =

⎧ ⎨ max(0, 1 − py)2 , if py ≥ −1; ⎩

(6.20.12) −4py,

otherwise.

Hence, the optimization problem for multitask learning can be written as *

. n     1 2 ˆ = arg min ˆ  , vˆ  }; ] [{w L f (w , v ;  , yi ) + λ w2 . n {w,v};

(6.20.13)

i=1

Algorithm 6.44 shows the SVD-based alternative LS solution for w . Algorithm 6.44 SVD-based alternating structure optimization algorithm [8] 1. input: training data {(xi , yi )}, l = 1, . . . , m; i = 1, . . . , n ; parameters: h and λ1 , . . . , λm . 2. initialization: u = 0 ( = 1, . . . , m) and arbitrary matrix . 3. iterate 4. for  = 1 to m do ˆ : 5. With fixed  and v/= u , approximately solve for w 0   ˆ  = arg minw n1 ni=1 w L(wT xi + (vT )xi , yi ) + λ w 22 ˆ  + T v . Let u = w end for √ √ Compute the SVD of U = [ λ1 u1 , . . . , λm um ]: T U = V1 DV2 (with diagonals of D in descending order). 9. Let the rows of  be the first h rows of VT1 . 10. until converge 11. output: h × p dimensional matrix . 6. 7. 8.

For many natural language processing (NLP) tasks, due to new domains in which labeled data is scarce or non-existent, one seeks to adapt existing models from a resource-rich source domain to a resource-poor target domain. For this end, a structural correspondence learning is introduced by Blitzer et al. [27] to automatically induce correspondences among features from different domains. Definition 6.54 (Structural Correspondence Learning [27]) Structural correspondence learning (SCL) is to identify correspondences among features from different domains by modeling their correlations with pivot features. Pivot features are features which behave in the same way for discriminative learning in both

414

6 Machine Learning

domains. Non-pivot features from different domains which are correlated with many of the same pivot features are assumed to correspond, and they are treated similarly in a discriminative learner. The structural correspondence learning method consists of the following three steps. • Find pivot features, e.g., using one of the feature selection methods presented in Sect. 6.7. The pivot predictors are the key element in structural correspondence learning. ˆ  to encode the covariance of the non-pivot features with • Use the weigh vectors w the pivot features in order to learn a mapping matrix  from the original feature spaces of both domains to a shared, low-dimensional real-valued feature space. This is the core of structural correspondence learning. • For each source D ,  = 1, . . . , m, construct its linear predictor on target domain:   ˆ T x , f (x) = sign w

 = 1, . . . , m.

(6.20.14)

Algorithm 6.45 summarizes the structural correspondence learning algorithm developed by Blitzer et al. [27]. Algorithm 6.45 Structural correspondence learning algorithm [27] 1. input: labeled source data {(xt , yt )Tt=1 } and unlabeled data {xj } from both domains. 2. Choose m pivot features x˜ i , i = 1, . . . , m using feature selection method. 3. for  = 1 to m do/ 0 2 T ˆ  = arg min 4. w j L(w xj , p (xj )) + λw2 , w

5. 6. 7. 8. 9.

where L(p, q) is the modified Huber loss function given in (6.20.12). end for ˆ 1, . . . , w ˆ m ]. Construct the matrix W = [w Compute the SVD [U, D, VT ] = SVD of W. Put  = UT1:h,: . ˆ T x), where  = 1, . . . , m. output: predictor f : X → YT is given by f (x) = sign(w

6.20.5 EigenTransfer In transfer learning, we are given two target data sets: a target training data set (u) k n Xt = {x(t) i }i=1 with labels, and a target test data set Xu = {xi }i=1 to be predicted. Different from traditional machine learning, we are also given an auxiliary data set (t) Xa = {xi }m i=1 to help the target learning. In spectral learning [53], the input is a weighted graph G = (V , E) to represent the target task, where V = {vi }ni=1 and E = {eij }n,n i,j =1 represent the node set and the edge set in the graph, respectively.

6.20 Transfer Learning

415

To apply spectral learning to transfer learning, construct a task graph G(V , E) in the following way [68]. • Node construction: the nodes V in the graph G represent features {f (i) }si=1 , (u) k m instances {xi (t)}ni=1 , {x(a) i }i=1 , {xi }i=1 or class labels c(xi ). • Edge construction: the edges E represent the relations between these nodes, where the weights of the edges are based on the number of co-occurrences between the end nodes in the target and auxiliary data. Consequently, the task graph G(V , E) contains almost all the information for the transfer learning task, including the target data and the auxiliary data. Usually, the task graph G is sparse, symmetric, real, and positive semi-definite. Thus, the spectra of the graph G can be calculated very efficiently. The following are two steps for a unified framework for graph transfer learning [68]. 1. Construct the weight graph G(V , E) for three types of transfer learning. • Cross-domain learning: a graph G(V , E) for representing the problem of cross-domain learning is given by V = Xt ∪ Xa ∪ Xu ∪ F ∪ C, ⎧ ⎪ φvi ,vj , if vi ∈ Xt ∪ Xa ∪ Xu ∧ vj ∈ F; ⎪ ⎪ ⎪ ⎪ ⎪ φvj ,vi , if vi ∈ F ∧ vj ∈ Xt ∪ Xa ∪ Xu ; ⎪ ⎪ ⎪ ⎪ ⎪ if vi ∈ Xt ∧ vj ∈ C ∧ C(vi ) = vj ; ⎪1, ⎨ eij = 1, if vi ∈ C ∧ vj ∈ Xt ∧ C(vj ) = vi ; ⎪ ⎪ ⎪ ⎪ 1, if vi ∈ Xa ∧ vj ∈ C ∧ C(vi ) = vj ; ⎪ ⎪ ⎪ ⎪ ⎪ if vi ∈ C ∧ vj ∈ Xa ∧ C(vj ) = vi ; ⎪1, ⎪ ⎪ ⎩ 0, otherwise.

(6.20.15)

(6.20.16)

• Cross-category learning: a graph G(V , E) for representing the problem of cross-category learning is given by V = Xt ∪ X a ∪ X u ∪ F ∪ C t ∪ C a , ⎧ ⎪ φvi ,vj , if vi ∈ Xt ∪ Xa ∪ Xu ∧ vj ∈ F; ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪φvj ,vi , if vi ∈ F ∧ vj ∈ Xt ∪ Xa ∪ Xu ; ⎪ ⎪ ⎪ if vi ∈ Xt ∧ vj ∈ Ct ∧ C(vi ) = vj ; ⎪1, ⎨ eij = 1, if vi ∈ Ct ∧ vj ∈ Xt ∧ C(vj ) = vi ; ⎪ ⎪ ⎪ ⎪ 1, if vi ∈ Xa ∧ vj ∈ Ca ∧ C(vi ) = vj ; ⎪ ⎪ ⎪ ⎪ ⎪ 1, if vi ∈ Ca ∧ vj ∈ Xa ∧ C(vj ) = vi ; ⎪ ⎪ ⎪ ⎩ 0, otherwise.

(6.20.17)

(6.20.18)

416

6 Machine Learning

• Self-taught learning: G(V , E) is given by V = Xt ∪ Xa ∪ Xu ∪ F ∪ Ct , ⎧ ⎪ φvi ,vj , if vi ∈ Xt ∪ Xa ∪ Xu ∧ vj ∈ F; ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨φvj ,vi , if vi ∈ F ∧ vj ∈ Xt ∪ Xa ∪ Xu ; eij = 1, if vi ∈ Xt ∧ vj ∈ Ct ∧ C(vi ) = vj ; ⎪ ⎪ ⎪ ⎪1, if vi ∈ Ct ∧ vj ∈ Xt ∧ C(vj ) = vi ; ⎪ ⎪ ⎪ ⎩0, otherwise.

(6.20.19)

(6.20.20)

In the above equations, φx,f denotes the importance of the feature f ∈ F appearing in the instance x ∈ Xt ∪ Xa ∪ Xu , and C(x) means the true label of the instance x. 2. Learning graph spectra. • Construct an adjacency matrix W ∈ Rn×n with respect to the graph G(V , E): ⎡

w11 · · · ⎢ .. . . W=⎣ . .

⎤ w1n .. ⎥ . ⎦

(wij = eij ).

(6.20.21)

wn1 · · · wnn • Construct the diagonal matrix D = [dij ]n,n i,j =1 : dij =

⎧ n ⎨ t=1 wit , if i = j, ⎩

(6.20.22) 0,

if i = j.

• Calculate the Laplacian matrix of G: L = D − W. • Use the normalized cut technique [230] to learn the new feature representation to enable transfer learning: (a) calculate the first m generalized eigenvectors v1 , . . . , vm of the generalized eigenproblem Lv = λDv. (b) use the first m generalized eigenvectors to construct the feature representation for transfer learning: U = [v1 , . . . , vm ]. The above graph transfer learning based on spectrum features is called EigenTransfer [68]. EigenTransfer can model a variety of existing transfer learning problems and solutions. In this framework, a task graph is first constructed to represent the transfer learning task. Then, an eigen feature representation is learned from the task graph based on spectral learning theory.

6.21 Domain Adaptation

417

Under the new feature representation, the knowledge from the auxiliary data tends to be transferred to help the target learning. Algorithm 6.46 gives an EigenCluster algorithm. Algorithm 6.46 EigenCluster: a unified framework for transfer learning [68] (t)

(a)

1. input: the target data set Xt = {xi }ni=1 , the auxiliary data set Xa = {xi }m i=1 , and the test (u) data set Xu = {xi }ki=1 . 2. Construct the task graph G(V , E) based on the target clustering task (c.f. (6.20.15)–(6.20.19)). 3. Construct the n × n adjacent matrix W whose entries wij = eij are based on the task graph G(V , E). 4. Use (6.20.22) to calculate the diagonal matrix D. 5. Construct the Laplacian matrix L = D − W. 6. Calculate the first N generalized eigenvectors v1 , . . . , vN of Lv = λDv. 7. Let U = [v1 , . . . , vN ]. (t) 8. for each xi ∈ Xt do (t) 9. Let yi be the corresponding row in U with respect to x(t) i . 10. end for (i) (t) 11. Train a classifier based on Y (t) = {yi }ni=1 instead of Xt = {xi }ni=1 using a traditional (u) classification algorithm, and then classify the test data Xu = {xi }ki=1 . 12. output: classification result on Xu .

In the next section, we focus on domain adaptation.

6.21 Domain Adaptation The task of domain adaptation is to develop learning algorithms that can be easily ported from one domain to another. This problem is particularly interesting and important when we have a large collection of labeled data in one “source” domain but truly desire a model that performs well in a second “target” domain. (s) (s) s (s) Let S = {xi , yi }N ∈ RN denotes the labeled data from the i=1 , where xi source domain and is referred to as an observation, and yi(s) is the corresponding Nt

(t) l class label. Labeled data from the target domain is denoted by Tl = {x(t) i , yi }i=1 , (t) where xi ∈ RM . Similarly, unlabeled data in the target domain is denoted by (u) N

tu Tu = {xi }i=1 , where x ∈ RM . It is usually assumed that N = M. Denoting T = Tl ∪ Tu , then Nt = Ntl + Ntu denotes the total number of samples in the target domain. (s) (s) Let S = [x1 , . . . , xNs ] be the matrix of Ns data points from the source domain

(t) S. Similarly, for target domain T , let Tl = [x(t) 1 , . . . , xNt ] be the matrix of Ntl l

(u) data from Tl , Tu = [x(u) 1 , . . . , xNt ] be the matrix of Ntu data from Tu , and T = (t)

(t)

l

[Tl , Tu ] = [x1 , . . . , xNt ] be the matrix of Nt data from T .

418

6 Machine Learning

The goal of domain adaptation is to learn a function f (·) that predicts the class label of a new test sample from the target domain. Depending on the availability of the source and target-domain data, the domain adaptation problem can be defined in many different ways [205]. • In semi-supervised domain adaptation, the function f (·) is learned using the knowledge in both S and Tl . • In unsupervised domain adaptation, the function f (·) is learned using the knowledge in both S and Tu . • In multi-source domain adaptation, f (·) is learned from more than one domain in S accompanying each of the first two cases. • In the heterogeneous domain adaptation, the dimensions of features in the source and target domains are assumed to be different. In other words, N = M. In what follows, we present a number of domain adaptation approaches.

6.21.1 Feature Augmentation Method One of the simplest domain adaptation approaches is the feature augmentation of Daumé [70]. For each given feature x ∈ RF in the original problem, make its three versions: a general version, a source-specific version, and a target-specific version. The augmented source data φ s (x) : RF → R3F will contain only general and sourcespecific versions, and the augmented target data φ t (x) : RF → R3F contains general and target-specific versions as follows: ⎡ ⎤ x φ s (x) = ⎣x⎦ , 0

⎡ ⎤ x φ t (x) = ⎣0⎦ ,

(6.21.1)

x

where 0 is an F × 1 zero vector. Consider the heterogeneous domain adaptation (HDA) problem where the dimensions of data from the source and target domains are different. By introducing a common subspace for the source and target data, both the source and target data of dimension N and M are, respectively, projected onto a latent domain of dimension l using two projection matrices W1 ∈ Rl+N and W2 ∈ Rl+M . The augmented feature maps for the source and target domains in the common space are then defined as [162]: ⎤ ⎡ (s)   W1 xi ⎥ ⎢ (s) l+N +M φ s xi , = ⎣ x(s) ⎦∈R i 0M

(6.21.2)

6.21 Domain Adaptation

419



 (l)

φ t xi

⎡ ⎤ (t) W2 xi ⎢ ⎥ = ⎣ 0N ⎦ ∈ Rl+N +M , (t) xi

(6.21.3)

(t) where x(s) i ∈ S, xi ∈ Tl and 0M is an M × 1 zero vector. After introducing W1 and W2 , the data from two domains can be readily compared in the common subspace. Once the data from both domains are transformed onto a common space, heterogeneous feature augmentation (HFA) [162] can be readily formulated as follows:

N . Nt s   1 (s) (t) 2 w + C min min ξi + ξi , W1 ,W2 w,b,ξ (s) ,ξ (t) 2 i i i=1 i=1   " ! (s) (s) (s) (s) T + b ≥ 1 − ξi , ξi ≥ 0, subject to yi w φ s xi *

(6.21.4)

(6.21.5)

  " ! + b ≥ 1 − ξi(t) , ξi(t) ≥ 0, yi(t) wT φ t x(t) i

(6.21.6)

W1 2F ≤ λw1 , W2 2F ≤ λw2 ,

(6.21.7)

where C > 0 is a tradeoff parameter which balances the model complexity and the empirical losses on the training samples from two domains, and λw1 and λw2 > 0 are predefined parameters to control the complexities of W1 and W2 , respectively. Denote the kernel on the source domain samples as Ks = Ts s ∈ Rns ×ns , (s) where s = [φ s (x(s) 1 ), . . . , φ s (xns )] and φ s (·) is the nonlinear feature mapping function induced by Ks . Similarly, the kernel on the target-domain samples is (t) (t) denoted as Kt = Tt t ∈ Rnt ×nt where t = [φ t (x1 ), . . . , φ t (xnt )] and φ t (·) is the nonlinear feature mapping function induced by Kt . Define the feature weight vector w = [wc , ws , wt ] ∈ Rdc +ds +dt for the augmented feature space, where wc ∈ Rdc , ws ∈ Rds and wt ∈ Rdt are also the weight vectors defined for the common subspace, the source domain, and the target domain, respectively. Introduce a transformation metric H = [W1 , W2 ]T [W1 , W2 ] ∈ R(ds +dt )×(ds +dt )

(6.21.8)

which is positive semi-definite, i.e., H  0. With H, the optimization problem can be reformulated as follows [162]:  1 min max J (α) = 1T α − (α α H0 2

, T

y) KH (α

subject to yT α = 0, 0 ≤ α ≤ C1,

y) ,

tr(H) ≤ λ,

(6.21.9) (6.21.10)

420

6 Machine Learning

where a • • •

b denotes the elementwise product of two vectors a and b, and

(s) (s) (t) (t) α = [α1 , . . . , αns , α1 , . . . , αnt ]T is the vector of dual variables; (s) (s) (t) (t) y = [y1 , . . . , yns , y1 , . . . , ynt ]T is the label vector of all training instances; KH = T (H + I) is the derived kernel matrix for the samples from source and   s Ods ×nt (d ∈ R s +dt )×(ns +nt ) ; target domains, where  =

Odt ×ns

• λ = λw1 + λw2 .

t

Let transformation metric H consist of the linear combination of rank-one normalized positive semi-definite (PSD) matrices: Hθ =

∞ 

Mr ∈ M,

θr Mr ,

(6.21.11)

r=1

where θr is the linear combination coefficients, and M = {Mr |∞ r=1 } is the set of rank-one normalized PSD matrices Mr = hr hTr with hr ∈ Rns +nt and hTr hr = 1. If denoting K = T  = K1/2 K1/2 , then the optimization problem of heterogeneous feature augmentation in Eqs. (6.21.9)–(6.21.10) can be reformulated as [162]:  1 min max 1T α − (α θ α∈A 2 subject to Hθ =

∞ 

y)T K1/2 (Hθ + I)K1/2 (α

θr Mr , Mr ∈ M,

, y) ,

1T θ ≤ λ, θ ≥ 0.

(6.21.12) (6.21.13)

r=1

By setting θ ← λ1 θ , then the above heterogeneous feature augmentation can be further rewritten as [162]: *

1 min max 1 α − (α 2 θ∈Dθ α∈A T

y)

T

∞ 

 θr Kr (α

. y) ,

(6.21.14)

r=1

where A = {α|α T y = 0, 0n ≤ α ≤ C1n } is the feasible set of the dual variables α, n,n T n×n is the kth base kernel and Kr = [kr (xi , xj )]n,n i,j =1 = [φ k (xi )φ k (xj )]i,j =1 ∈ R matrix of the labeled patterns, while the linear combination coefficient vector θ = [θ1 , . . . , θ∞ ]T is nonnegative, i.e., θ ≥ 0. The optimization problem in (6.21.14) is an infinite kernel learning (IKL) problem with each base kernel as Kr , which can be readily solved with the multiple kernel learning solver. The ordinary machine learning and support vector machine operate on a single kernel matrix, representing information from a single data source. However, many machine learning problems are often better characterized by incorporating data from more than one source of data. By combining multiple kernel matrices into a single classifier, multiple kernel learning realizes the representation of multiple feature sets, which is conducive to the processing of multi-source data.

6.21 Domain Adaptation

421

Definition 6.55 (Multiple Kernel Learning) Multiple kernel learning (MKL) is a machine learning that aims at the unified representation of multiple domain data with multiple kernels and consists of the following three parts. • Use kernel matrices to serve as the unified representation of multi-source data. • Transform data under each feature representation to a kernel matrix K. • M kinds of features will lead to M kernels {km (xi , xj )}M m=1 . That is, the ensemble kernel for unified representation is the linear combination of base ones given by Kij = K(xi , xj ) =

M 

θm km (xi , xj ).

(6.21.15)

m=1

The (i, j )th entry of the kernel matrix K may be the linear kernel function Kij = K(xi , xj ) = xTi xj , the polynomial kernel function Kij = K(xi , xj ) = (γ xTi xj + b)d , γ > 0, the Gaussian kernel function Kij = K(xi , xj ) = exp(−γ d 2 (xi , xj )), and so on. Here, d(xi , xj ) is the dissimilarity or distance between xi and xj . With the training data {xi , yi }N i=1 where yi ∈ {−1, +1} and base kernels M {km }m=1 , the learned model is of the form h(x) =

N 

αi yi k(xi , x) + b

i=1

=

N  i=1

αi yi

M 

θm km (xi , x) + b,

(6.21.16)

m=1

where linear combination coefficient θm is the importance of the mth base kernel km (xi , xj ), and b is a learning bias. M Then, task of multiple kernel learning is to optimize both {αi }N i=1 and {θm }m=1 by (s) (s) Ns (t) (t) Nt N using all the training data {(xn , yn )}n=1 = {(xi , yi )}i=1 ∪ {(xj , yj )}j =1 with N = Ns + Nt . Algorithm 6.47 shows the p -norm multiple kernel learning algorithm for interleaved optimization of θ and α in (6.21.9). Algorithm 6.48 is the heterogeneous feature augmentation algorithm [162].

6.21.2 Cross-Domain Transform Method Let W : x(t) → x(s) denote a linear transformation matrix from the target domain x(t) ∈ DT to the source domain x(s) ∈ DS . Then, the inner product similarity function between x(s) and the transformed Wx(t) can be described as <  T ; sim(W) = x(s) , Wx(t) = x(s) Wx(t) .

(6.21.17)

422

6 Machine Learning

Algorithm 6.47 p -norm multiple kernel learning algorithm [146] 1. input: the accuracy parameter  and the subproblem size Q. √ 2. initialization: gm,i = gˆ i = αi = 0, ∀ i = 1, . . . , N ; L = S = −∞; θm = p 1/M, ∀ m = 1, . . . , M. 3. iterate 4. Calculate the gradient!gˆ of J (α) in (6.21.9) w.r.t. α; " M gˆ = ∂J∂α(α) = 1 − (α − y). θ K m m m=1 5. 6. 7. 8. 9. 10.

11. 12. 13. 14. 15.

Select Q variables αi1 , . . . , αiQ based on the gradient gˆ . Store α old = α and then update α ← α − μˆg. Q Update gradient gm,i ← gm,i + q=1 (αiq − αiold )km (xiq , xi ), ∀ m = 1, . . . , M; q i = 1, . . . , N .  2 Compute the quadratic terms Sm = 12  i gm,i αi , qm = 2θm Sm , ∀ m = 1, . . . , M.  Lold = L, L = y α , S = S, S = θ S . m m m  i i i old    if 1 − LoldL−S −Sold  ≥  then H ! "1/p M p/(p+1) θm = (qm )1/(p+1) , ∀m = 1, . . . , M. m =1 (qm ) else break end if gˆ i = m θm gm,i , ∀i = 1, . . . , N . output: θ = [θ1 , . . . , θM ]T and α = [α1 , . . . , αN ]T .

Algorithm 6.48 Heterogeneous feature augmentation [162] (s) Ns (t) (t) Nt 1. input: labeled source samples {(x(s) i , yi )}i=1 and labeled target samples {(xj , yj )}j =1 .

2. 3. 4. 5. 6. 7. 8.

initialization: M1 = {M1 } with M1 = h1 hT1 and h1 = √N 1+N 1Ns +Nt . s t repeat Use Algorithm 6.47 to solve θ and α in (6.21.9) based on Mr . Obtain a rank-one PSD matrix Mr+1 = hhT ∈ R(Ns +Nt )×(Ns +Nt ) with h = Set Mr+1 = Mr ∪ {Mr+1 }, and r = r + 1. exit if the objective converges.  output: α and H = λ Q r=1 θr Mr .

K1/2 (α y) . K1/2 (α y)

The goal of cross-domain transform-based domain adaptation [219] is to learn the linear transformation W given some form of supervision, and then to utilize the learned similarity function in a classification or clustering algorithm. Given a set of N points {x1 , . . . , xN } ∈ Rd , we seek a positive definite Mahalanobis matrix W which parameterizes the (squared) Mahalanobis distance: dW (xi , xj ) = (xi − xj )T W(xi − xj ).

(6.21.18) (s)

(s)

Sample a random pair consisting of a labeled source domain sample (xi , yi ) (t) (t) and a labeled target-domain sample (xj , yj ), and then create two constraints described by ! " (s) (t) dW xi , xj ≤ u,

(s)

if yi

(t)

= yj ;

(6.21.19)

6.21 Domain Adaptation

423

! " (s) (t) dW xi , xj ≥ ,

(s)

if yi

(t)

= yj ,

(6.21.20)

where u is a relatively small value and  is a sufficiently large value. The above two constraints can be rewritten as ! " (t) (t) ≤ u; (6.21.21) x(s) if dW x(s) i similar to xj , i , xj ! " (s) (t) (s) (t) xi dissimilar to xj , if dW xi , xj ≥ . (6.21.22) Learning a Mahalanobis distance can be stated as follows. (s)

(s)

1. Given N source feature points {x1 , . . . , xN } ∈ Rd and N target feature points (t) (t) {x1 , . . . , xN } ∈ Rd . 2. Given inequality constraints relating pairs of feature points. • Similarity constraints:

(t) dW (x(s) i , xj ) ≤ u; (s)

(t)

• Dissimilarity constraints: dW (xi , xj ) ≥ . 3. Problem: Learn a positive semi-definite Mahalanobis matrix W such that the (s) (t) (s) (t) (s) (t) Mahalanobis distance dW (xi , xj ) = (xi − xj )T W(xi − xj ) satisfies the above similarity and dissimilarity constraints. This problem is known as a Mahalanobis metric learning problem which can be formulated as the constrained optimization problem [219]: min

W0

{tr(W) − log det(W)} ,

! " (s) (t) subject to dW xi , xj ≤ u, (i, j ) ∈ S, ! " (s) (t) dW xi , xj ≥ , (i, j ) ∈ D.

(6.21.23) (6.21.24) (6.21.25)

Here, the regularizer tr(W) − log det(W) is defined only between positive semidefinite matrices, and S and D are the set of similar pairs and the set of dissimilar pairs, respectively. The constrained optimization (6.21.23) provides a Mahalanobis metric learning method and is called information theoretic metric learning (ITML), as shown in Algorithm 6.49.

6.21.3 Transfer Component Analysis Method Let P(XS ) and Q(XT ) be the marginal distributions of XS = {xSi } and XT = {xTi } from the source and target domains, respectively; where P and Q are different.

424

6 Machine Learning

Algorithm 6.49 Information-theoretic metric learning [71] 1. input: d × n matrix X = [x1 , . . . , xn ], set of similar pairs S, set of dissimilar pairs D, distance threshold u for similarity constraint, distance threshold  for dissimilarity constrain, Mahalanobis matrix W0 , slack parameter γ , constraint index function c(·). 2. initialization: W ← W0 , λij ← 0, ∀ i, j . / u, if (i, j ) ∈ S; 3. Let ξc(i,j ) = , otherwise. 4. repeat 5. Pick a constraint (i, j ) ∈ S or (i, j ) ∈ D. 6. Update p ← (xi − xj )T W(xi − xj ). / 1, if (i, j ) ∈ S; 7. δ= −1, otherwise. ! "0 / 8. α ← min λij , 2δ p1 − γ ξc(i,j ) . 9. β ← δα/(1 − δαp). 10. ξc(i,j ) ← γ ξc(i,j ) /(γ + δαξc(i,j ) ). 11. λij ← λij − α. 12. W ← W + βW(xi − xj )(xi − xj )T W. 13. until convergence 14. output: Mahalanobis matrix W.

Most domain adaptation methods are based on a key assumption: P = Q, but P (YS |XS ) = P (YT |XT ). However, in many real-world applications, the conditional probability P (Y |X) may also change across domains due to noisy or dynamic factors underlying the observed data. Let φ be a transformation vector such that φ(XS ) = [φ(xS1 ), · · · , φ(xSns )] ∈ R1×ns and φ(XT ) = [φ(xT1 ), . . . , φ(xTn )] ∈ R1×nt , where ns and nt are the t numbers of the source and target features. Consider domain adaptation under weak assumption that P = Q but there exists a transformation vector φ such that P (φ(XS )) ≈ P (φ(XT )) and P (YS |φ(XS )) ≈ P (YT |φ(XT )). A key problem is how to find this transformation φ. Since there are no labeled data in the target domain, φ cannot be learned by directly minimizing the distance between P (YS |φ(XS )) and P (YT |φ(XT )). Pan et al. [203] propose to learn φ such that: • the distance between the marginal distributions P (φ(XS )) and P (φ(XT )) is small, • φ(XS ) and φ(XT ) preserve important properties of XS and XT , respectively. For finding a transformation φ satisfying P (YS |φ(XS )) ≈ P (YT |φ(XT )), assume that φ is the feature mapping vector induced by a universal kernel. Maximum mean discrepancy embedding (MMDE) [201] embeds both the source and targetdomain data into a shared low-dimensional common latent space using a nonlinear mapping φ, which is a symmetric feature transformation with φ = TS = TT shown in Fig. 6.6b.

6.21 Domain Adaptation

425

Define the Gram matrices defined on the source domain (S, S), target domain (T , T ), and cross-domain (S, T ) in the embedded space as follows: ⎡

KS,S

⎤ φ(xS1 ) & ⎢ . ⎥% .. ⎥ φ(xS ) · · · φ(xS ) ∈ Rns ×ns , = φ T (XS )φ(XS ) = ⎢ ns ⎣ ⎦ 1 φ(xSn ) s

⎡ KS,T

⎢ = φ (XS )φ(XT ) = ⎢ ⎣ T



(6.21.26)

φ(xS1 ) & % .. ⎥ ⎥ φ(x ) · · · φ(x ) ∈ Rns ×nt , T1 Tnt . ⎦ φ(xSn ) s

(6.21.27) and KT ,S = φ T (XT )φ(XS ) = KTS,T ∈ Rnt ×ns , (6.21.28) ⎤ ⎡ φ(xT ) & ⎢ . 1 ⎥% T ⎢ φ(x ) · · · φ(x ) ∈ Rnt ×nt . KT ,T = φ (XT )φ(XT ) = ⎣ .. ⎥ T1 Tnt ⎦ φ(xTn ) t

(6.21.29) To learn the symmetric kernel matrix I KS,S KS,T K= KT ,S KT ,T

J ∈ R(ns +nt )×(ns +nt ) ,

construct a matrix L with entries ⎧ if xi , xj ∈ XS ; ⎨ 1/n2s , Lij = 1/n2t , if xi , xj ∈ XT ; ⎩ −1/(ns nt ), otherwise.

(6.21.30)

(6.21.31)

Then, the objective function of MMDE can be written as [203]: max {tr(KL) − λtr(K)} K≥0

subject to constraints on K,

(6.21.32)

where the first term in the objective minimizes the distance between distributions, while the second term maximizes the variance in the feature space, and λ ≥ 0 is a trade-off parameter. However, K is a high-dimensional kernel matrix due to the fact that ns and/or nt are large. As such it is necessary to use a semi-orthogonal ¯ ∈ R(ns +nt )×m such that W ¯W ¯ T = Im to transform K−1/2 transformation matrix W

426

6 Machine Learning

¯ (where m / ns +nt ). Then, the distance to an m-dimensional matrix W = K−1/2 W between distributions, dist(XS , XT ) = tr(KL), becomes ¯ −1/2 W) ¯ T KL) = tr(KWWT KL) = tr(WT KLKW). tr(KL) = tr(KK−1/2 W(K Therefore, the objective function of MMDE in (6.21.32) can be rewritten as [203]: / 0 min tr(WT KLKW) + μtr(WT W) ,

(6.21.33)

subject to WT KHKW = Im ,

(6.21.34)

W

where H is the centering matrix given by H = Ins +nt −

1 11T . nt + nt

(6.21.35)

The constrained optimization problem (6.21.33) and (6.21.34) can be reformulated as the unconstrained optimization problem [203]:  ! , "−1 max J (W) = tr WT (KLK + μI)W WT KHKW . W

(6.21.36)

Under the condition that the column vectors of W = [w1 , . . . , wm ] are orthogonal to each other, the above optimization problem can be divided into m subproblems:  ! , "−1 T T wi = arg max tr wi (KLK + μI)wi wi KHKwi wi

= arg max wi

wTi KHKwi wTi (KLK + μI)wi

,

i = 1, . . . , m.

(6.21.37)

This is a typical generalized Rayleigh quotient, and thus wi is the ith generalized eigenvector corresponding to the ith maximum generalized eigenvalue of the matrix pencil (KHK, KLK + μI) or the ith eigenvector corresponding to the ith maximum eigenvalue of the matrix (KLK + μI)−1 KLK. Moreover, m is the number of leading generalized eigenvalues of (KHK, KLK + μI) or leading eigenvalues of (KLK + μI)−1 KLK. This is the unsupervised transfer component analysis (TCA) for domain adaptation, which was proposed by Pan et al. [203]. Transfer learning has been applied to many real-world applications [199, 268]. • Natural language processing [27] including sentiment classification [28], crosslanguage classification [165], text classification [67], etc. • Image classification [277], WiFi localization classification [200, 202, 289, 297, 298], video concept detection [82], and so on.

References

427

• Pattern recognition including face recognition [139], head pose classification [211], disease prediction [195], atmospheric dust aerosol particle classification [171], sign language recognition [88], image recognition [205], etc.

Brief Summary of This Chapter • This chapter presents the machine learning tree. • Machine learning aims at using learning machines to establish the mapping of input data to output data. • The main tasks of machine learning are classification (for discrete data) and prediction (for continue data). • This chapter highlights selected topics and advances in machine learning: graph machine learning, reinforcement learning, Q-learning, and transfer learning, etc.

References 1. Acar, E., Camtepe, S.A., Krishnamoorthy, M., Yener, B.: Modeling and multiway analysis of chatroom tensors. In: Proceedings of the IEEE International Conference on Intelligence and Security Informatics, pp. 256–268. Springer, Berlin (2005) 2. Acar, E., Aykut-Bingo, C., Bingo, H., Bro, R., Yener, B.: Multiway analysis of epilepsy tensors. Bioinformatics 23, i10–i18 (2007) 3. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1992) 4. Ali, M.M., Khompatraporn, C., Zabinsky, Z.B.: A numerical evaluation of several stochastic algorithms on selected continuous global optimization on test problems. J. Global Optim. 31, 635–672 (2005) 5. Aliu, O.G., Imran, A., Imran, M.A., Evans, B.: A survey of self organisation in future cellular networks. IEEE Commun. Surveys Tutorials. 15(1), 336–361 (2013) 6. Anderberg, M.R.: Cluster Analysis for Application. Academic, New York (1973) 7. Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 2nd edn. Wiley, New York (1984) 8. Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005) 9. Angluin D.: Queries and concept learning. Mach. Learn. 2(4), 319–342 (1988) 10. Arnold, A., Nallapati, R., Cohen, W.W.: A comparative study of methods for transductive transfer learning. In: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, pp. 77–82 (2007) 11. Atlas, L., Cohn, D., Ladner, R., El-Sharkawi, M.A., Marks II, R.J.: Training connectionist networks with queries and selective sampling. In: Advances in Neural Information Processing Systems 2, Morgan Kaufmann, pp. 566–573 (1990) 12. Auslender, A.: Optimisation Méthodes Numériques. Masson, Paris (1976) 13. Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mach. Learn. Res. 3, 1–48 (2002) 14. Bagheri, M., Nurmanova, V., Abedinia, O., Naderi, M.S.: Enhancing power quality in microgrids with a new online control Strategy for DSTATCOM using reinforcement learning algorithm. IEEE Access 6, 38986–38996 (2018)

428

6 Machine Learning

15. Bandyopdhyay, S., Maulik, U.: An evolutionary technique based on K-means algorithm for optimal clustering in RN . Inform. Sci. 146(1–4), 221–237 (2002) 16. Bartlett, P.L.: The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inf. Theory. 44(2), 525–536 (1998) 17. Baum, L.E., Eagon, J.A.: An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bull. Amer. Math. Soc. 73(3), 360 (1967) 18. Behbood, V., Lu, J., Zhang, G.: Fuzzy bridged refinement domain adaptation: long-term bank failure prediction. Int. J. Comput Intell. Appl. 12(1), Art. no. 1350003 (2013) 19. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003) 20. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006) 21. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957) 22. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013) 23. Bersini, H., Dorigo, M., Langerman, S.: Results of the first international contest on evolutionary optimization. In: Proceedings of IEEE International Conference on Evolutionary Computation, Nagoya, pp. 611–615 (1996) 24. Bertsekas, D.P.: Dynamic Programming and Optimal Sequence of States of the Markov Decision Process. Control, vol. 11. Athena Scientific, Nashua (1995) 25. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Nashua (1999) 26. Beyer, H.G., Schwefel, H.P.: Evolution strategies: a comprehensive introduction. J. Nat. Comput. 1(1), 3–52 (2002) 27. Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 120–128 (2006) 28. Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, Boom-Boxes and Blenders: Domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 432–439 (2007) 29. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the 18th International Conference on Machine Learning (2001) 30. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theorem (COLT 98), pp. 92–100 (1998) 31. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018) 32. Bouneffouf, D.: Exponentiated gradient exploration for active learning. Computers 5(1), 1–12 (2016) 33. Bouneffouf, D., Laroche, R., Urvoy, T., Fèraud, R., Allesiardo, R.: Contextual bandit for active learning: Active Thompson sampling. In: Proceedings of the 21st International Conference on Neural Information Processing, ICONIP (2014) 34. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Analy. Mach. Intell. 26(9), 1124–1137 (2004) 35. Breiman, L.: Better subset selection using the nonnegative garrote. Technometrics 37, 738– 754 (1995) 36. Bro, R.: PARAFAC: tutorial and applications. Chemome. Intell. Lab. Syst. 38, 149–171 (1997) 37. Bu, F.: A high-order clustering algorithm based on dropout deep learning for heterogeneous data in Cyber-Physical-Social systems. IEEE Access 6, 11687–11693 (2018) 38. Buczak, A.L., Guven, E.: A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tut. 18(2), 1153–1176 (2016)

References

429

39. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2, 121–167 (1998) 40. Burr, S.: Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, Retrieved 2014-11-18 (2010) 41. Cai, D., Zhang, C., He, S.: Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD, July 25–28, Washington, pp. 333–342 (2010) 42. Campbell, C., Cristianini, N., Smola, A.: Query learning with large margin classifiers. In: Proceedings of the International Conference on Machine Learning (ICML) (2000) 43. Candès, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted 1 minimization. J. Fourier Analy. Appl. 14(5–6), 877–905 (2008) 44. Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 1–37 (2011) 45. Caruana, R.A.: Multitask learning. Mach. Learn. 28, 41–75 (1997) 46. Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Wilisky, A.S.: Rank-sparsity incoherence for matrix decomposition. SIAM J. Optim. 21(2), 572–596 (2011) 47. Chang, C.I., Du, Q.: Estimation of number of spectrally distinct signal sources in hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 42(3), 608–619 (2004) 48. Chattopadhyay, R., Sun, Q., Fan, W., Davidson, I., Panchanathan, S., Ye, J.: Multisource domain adaptation and its application to early detection of fatigue. ACM Trans. Knowl. Discov. From Data 6(4), 1–26 (2012) 49. Chen, T., Amari, S., Lin, Q.: A unified algorithm for principal and minor components extraction. Neural Netw. 11, 385–390 (1998) 50. Chen, Y., Lasko, T.A., Mei, Q., Denny, J.C, Xu, H.: A study of active learning methods for named entity recognition in clinical text. J. Biomed. Inform. 58, 11–18 (2015) 51. Chernoff, H.: Sequential analysis and optimal design. In: CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 8. SIAM, Philadelphia (1972) 52. Choromanska, A., Jebara, T., Kim, H., Mohan, M., Monteleoni, C.: Fast spectral clustering via the Nyström method. In: International Conference on Algorithmic Learning Theory ALT 2013, pp. 367–381 (2013) 53. Chung, F.R.K.: Spectral graph theory. In: CBMS Regional Conference Series, vol.92. Conference Board of the Mathematical Sciences, Washington (1997) 54. Chung, C.J., Reynolds, R.G.: CAEP: An evolution-based tool for real-valued function optimization using cultural algorithms. Int. J. Artif. Intell. Tool 7(3), 239–291 (1998) 55. Ciresan, D.C., Meier, U., Schmidhuber, J.: Transfer learning for Latin and Chinese characters with deep neural networks. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), Brisbane, pp. 1–6 (2012) 56. Coates, A., Ng, A.Y.: Learning feature representations with K-means. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, 2nd edn., pp. 561–580. Springer, Berlin (2012) 57. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the 12th International Conference on International Conference on Machine Learning, Lake Tahoe, pp. 115–123 (1995) 58. Cohn, D.: Active learning. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 10–14 (2011) 59. Cohn, D., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. J. Artific. Intell. Res. 4, 129–145 (1996) 60. Comon, P., Golub, G., Lim, L.H., Mourrain, B.: Symmetric tensors and symmetric tensor rank. SIAM J. Matrix Anal. Appl. 30(3), 1254–1279 (2008) 61. Corana, A., Marchesi, M., Martini, C., Ridella, S.: Minimizing multimodal functions of continuous variables with simulated annealing algorithms. ACM Trans. Math. Softw. 13(3), 262–280 (1987) 62. Correa, N.M., Adali, T., Li, Y.Q., Calhoun, V.D.: Canonical correlation analysis for data fusion and group inferences. IEEE Signal Proc. Mag. 27(4), 39–50 (2010)

430

6 Machine Learning

63. Cortes, C., Mohri, M.: On transductive regression. In: Proceedings of the Neural Information Processing Systems (NIPS), pp. 305–312 (2006) 64. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001) 65. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., Kandola, J.S.: On kernel-target alignment. In: NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, pp. 367–373 (2001) 66. Dai, W., Yang, Q., Xue, G.R., Yu, Y.: Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning, pp. 193–200 (2007) 67. Dai, W., Xue, G., Yang, Q., Yu, Y.: Transferring naive Bayes classifiers for text classification. In: Proc. 22nd Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, pp. 540–545 (2007) 68. Dai, W., Jin, O., Xue, G.-R., Yang, Q., Yu, Y.: EigenTransfer: A unified framework for transfer learning. In: Proceedings of the the 26th International Conference on Machine Learning, Montreal, pp. 193–200 (2009) 69. Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4), 131–156 (1997) 70. Daumé III, H.: Frustratingly easy domain adaptation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256–263 (2007) 71. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: Proceedings of the international Conference on Machine Learning, pp. 209–216 (2007) 72. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, vol. 27, pp. 1646–1654 (2014) 73. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, pp. 3837–3845 (2016) 74. Deng, Z., Choi, K., Jiang, Y.: Generalized hidden-mapping ridge regression, knowledgeleveraged inductive transfer learning for neural networks, fuzzy systems and kernel method. IEEE Trans. Cybern. 44(12), 2585–2599 (2014) 75. Dhillon, I.S., Modha, D.M.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1), 143–175 (2001) 76. Dong, X., Thanou, D., Frossard, P., Vandergheynst, P.: Learning Laplacian matrix in smooth graph signal representations. IEEE Trans. Sign. Proc. 64(23), 6160–6173 (2016) 77. Donoho, D.L., Johnstone, I.: Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90, 1200–1224 (1995) 78. Dorigo, M., Gambardella, L.M.: Ant colony system: A cooperative learning approach to the traveling salesman problem. IEEE Trans. Evol. Comput. 1(1), 53–66 (1997) 79. Douglas, S.C., Kung, S.-Y., Amari, S.: A self-stabilized minor subspace rule. IEEE Sign. Proc. Lett. 5(12), 328–330 (1998) 80. Downie, J.S.: A window into music information retrieval research. Acoust. Sci. Technol. 29(4), 247–255 (2008) 81. Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applications and algorithms. SIAM Rev. 41, 637–676 (1999) 82. Duan, L., Tsang, I.W., Xu, D., Maybank, S.J.: Domain transfer SVM for video concept detection. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1375–1381 (2009)

References

431

83. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973) 84. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Statist. 32, 407–499 (2004) 85. El-Attar, R.A., Vidyasagar, M., Dutta, S.R.K.: An algorithm for II-norm minimization with application to nonlinear II-approximation. SIAM J. Numer. Anal. 16(1), 70–86 (1979) 86. Estienne, F., Matthijs, N., Massart, D.L., Ricoux, P., Leibovici, D.: Multi-way modeling of high-dimensionality electroencephalographic data. Chemometr. Intell. Lab. Syst. 58(1), 59– 72 (2001) 87. Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Nat. Sci. Rev. 1(2), 293–314 (2014) 88. Farhadi, A., Forsyth, D., White, R.: Transfer learning in sign language. In: Proceedings of the IEEE 2007 Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 89. Farmer, J., Packard, N., Perelson, A.: The immune system, adaptation and machine learning. Phys. D: Nonlinear Phenom. 2, 187–204 (1986) 90. Fedorov, V.V.: Theory of Optimal Experiments. (Trans. by Studden, W.J., Klimko, E.M.). Academic, New York (1972) 91. Fercoq, O., Richtárk, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015) 92. Figueiredo, M.A.T., Nowak, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signa. Proc. 1(4), 586–597 (2007) 93. Finkel, J.R., Manning, C.D.: Hierarchical Bayesian domain adaptation. In: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, pp. 602–610 (2009) 94. Fisher, R.A.: The statistical utilization of multiple measurements. Ann. Eugenic. 8, 376–386 (1938) 95. Ford, L., Fulkerson, D.: Flows in Networks. Princeton University Press, Princeton (1962) 96. Freund, Y.: Boosting a weak learning algorithm by majority. Inform. Comput. 12(2), 256–285 (1995) 97. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997) 98. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997) 99. Friedman, J., Hastie, T., Höeling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007) 100. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 28(2), 337–407 (2000) 101. Fu, W.J.: Penalized regressions: the bridge versus the Lasso. J. Comput. Graph. Stat. 7(3), 397–416 (1998) 102. Fuchs, J.J.: Multipath time-delay detection and estimation. IEEE Trans. Signal Process. 47(1), 237–243 (1999) 103. Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10), 906–914 (2000) 104. Ge, Z., Song, Z., Ding, S.X., Huang, B.: Data mining and analytics in the process industry: the role of machine learning. IEEE Access 5, 20590–20616 (2017) 105. Geladi, P., Kowalski, B.R.: Partial least squares regression: a tutorial. Anal. Chim. Acta 186, l–17 (1986) 106. George, A.P., Powell, W.B.: Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Mach. Learn. 65(1), 167–198 (2006) 107. Goldberg, D.E., Holland, J.H.: Genetic algorithms and machine learning. Mach. Learn. 3(2), 95–99 (1988) 108. Golub, G.H., Zha, H.: The canonical correlations of matrix pairs and their numerical computation. In: Linear Algebra for Signal Processing, pp. 27–49. Springer, Berlin (1995)

432

6 Machine Learning

109. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999) 110. Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems, vol. 17, pp. 529–536 (2005) 111. Guo, W., Kotsia, I., Ioannis, P.: Tensor learning for regression. IEEE Trans. Image Process. 21(2), 816–827 (2012) 112. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 113. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002) 114. Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005) 115. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004) 116. Hesterberg, T., Choi, N.H., Meier, L., Fraley, C.: Least angle and 1 penalized regression: a review. Stat. Surv. 2, 61–93 (2008) 117. Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimates for non-orthogonal problems. Technometrics 12, 55–67 (1970) 118. Hoerl, A.E., Kennard, R.W.: Ridge regression: applications to nonorthogonal problems. Technometrics 12, 69–82 (1970) 119. Hoi, S.C.H., Jin, R., Lyu, M.R.: Batch mode active learning with applications to text categorization and image retrieval. IEEE Trans. Knowl. Data Eng. 21(9), 1233–1247 (2009) 120. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989) 121. Höskuldsson, A.: PLS regression methods. J. Chemometr. 2, 211–228 (1988) 122. Hotelling, H.: Relations between two sets of variants. Biometrika 28(3/4), 321–377 (1936) 123. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006) 124. Huang, G.-B., Zhou, H., Ding, X., Zhang, R.: Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. B Cybern. 42(2), 513–529 (2012) 125. Hunter, D.R., Lange, K.: A tutorial on MM algorithms. Amer. Statist. 58, 30–37 (2004) 126. Jain, K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988) 127. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31, 651–666 (2010) 128. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000) 129. Jamil, M., Yang, X.-S.: A literature survey of benchmark functions for global optimization problems. Int. J. Math. Modell. Numer. Optim. 4(2), 150–194 (2013) 130. Jensen, F.V.: Bayesian Networks and Decision Graphs. Springer, New York (2001) 131. Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning, pp. 200–209 (1999) 132. Johnson, S.C.: Hierarchical clustering schemes. Psycioietrika 32(3), 241–254 (1967) 133. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, vol. 26, pp. 315–323 (2013) 134. Jolliffe, I.: Principal Component Analysis. Springer, New York (1986) 135. Jonesb, S., Shaoa, L., Dub, K.: Active learning for human action retrieval using query pool selection. Neurocomputing 124, 89–96 (2014) 136. Jouffe, L.: Fuzzy inference system learning by reinforcement methods. IEEE Trans. Syst. Man Cybern. Part C 28(3), 338–355 (1998)

References

433

137. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996) 138. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1), 99–134 (1998) 139. Kan, M., Wu, J., Shan, S., Chen, X.: Domain adaptation for face recognition: targetize source domain bridged by common subspace. Int. J. Comput. Vis. 109(1–2), 94–109 (2014) 140. Kearns, M., Valiant, L.: Crytographic limitations on learning Boolean formulae and finite automata. In: Proceedings of the Twenty-first Annual ACM Symposium on Theory of Computing, pp. 433–444 (1989); See J. ACM 41(1), 67–95 (1994) 141. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. MIT Press, Cambridge (1994) 142. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks (ICNN), vol. IV, pp. 1942–1948 (1995) 143. Kiers, H.A.L.: Towards a standardized notation and terminology in multiway analysis. J. Chemometr. 14, 105–122 (2000) 144. Kimura, A., Kameoka, H., Sugiyama, M., Nakano, T., Maeda, E., Sakano, H., Ishiguro, K.: SemiCCA: Efficient semi-supervised learning of canonical correlations. Inform. Media Technol. 8(2), 311–318 (2013) 145. Klaine, P.V., Imran, M.A., Souza, R.D., Onireti, O.: A survey of machine learning techniques applied to self-organizing cellular networks. IEEE Commun. Surv. Tut. 19(4), 2392–2431 (2017) 146. Kloft, M., Brefeld, U., Sonnenburg, S., and Zien, A.: p -norm multiple kernel learning. J. Mach. Learn. Res. 12, 953–997 (2011) 147. Kober, J., Bangell, J., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Robustics Res. 32(11), 1238–1274 (2013) 148. Kocer, B., Arslan, A.: Genetic transfer learning. Expert Syst. Appl. 37(10), 6997–7002 (2010) 149. Kolda, T.G.: Multilinear operators for higher-order decompositions. Sandia Report SAND2006-2081, California (2006) 150. Kolda, T.G., Bader, B.W., Kenny, J.P.: Higher-order web link analysis using multilinear algebra. In: Proceedings of the 5th IEEE International Conference on Data Mining, pp. 242– 249 (2005) 151. Koneˇcný J., Liu, J., Richtárik, P., Takáˇc, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signa. Process. 10(2), 242–255 (2016) 152. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992) 153. Kulis, B., Saenko, K., Darrell, T.: What you saw is not what you get: domain adaptation using asymmetric kernel transforms. In: Proceedings of the IEEE 2011 Conference on Computer Vision and Pattern Recognition, pp. 1785–1292 (2011) 154. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21, 1253–1278 (2000) 155. Lathauwer, L.D., Nion, D.: Decompositions of a higher-order tensor in block terms—part III: alternating least squares algorithms. SIAM J. Matrix Anal. Appl. 30(3), 1067–1083 (2008) 156. Le Roux, N., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems, vol. 25, pp. 2663–2671 (2012) 157. Letexier, D., Bourennane, S., Blanc-Talon, J.: Nonorthogonal tensor matricization for hyperspectral image filtering. IEEE Geosci. Remote Sensing. Lett. 5(1), 3–7 (2008) 158. Levie, R., Monti, F., Bresson, X., Bronstein, M.M.: CayleyNets: Graph convolutional neural networks with complex rational spectral filters (2018). Available at: https://arXiv:1705. 07664v2 159. Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12. ACM/Springer, New York/Berlin (1994)

434

6 Machine Learning

160. Li, X., Guo, Y.: Adaptive active learning for image classification. In: Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2013) 161. Li, F., Pan, S.J., Jin, O., Yang, Q., Zhu, X.: Cross-domain co-extraction of sentiment and topic lexicons. In: Proceedings of the 50th annual meeting of the association for computational linguistics long papers, vol. 1, pp. 410–419 (2012) 162. Li, W., Duan, L., Xu, D., Tsang, I.: Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1134–1148 (2014) 163. Lin, L.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8, 293–321 (1992) 164. Lin, Z., Chen, M., Ma, Y.: The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. Technical Report UILU-ENG-09-2215 (2009) 165. Ling, X., G.-R. Xue, G. -R., Dai, W., Jiang, Y., Yang, Q., Yu, Y.: Can Chinese Web pages be classified with English data source? In: Proceedings of the 17th International Conference on World Wide Web, pp. 969–978 (2008) 166. Liu, J., Wright, S.J., Re, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res., 16, 285–322 (2015) 167. Lu, J., Behbood, V., Hao, P., Zuo, H., Xue, S., Zhang, G.: Transfer learning using computational intelligence: a survey. Knowl. Based Syst. 80, 14–23 (2015) 168. Luis, R., Sucar, L.E., Morales, E.F.: Inductive transfer for learning Bayesian networks. Mach. Learn. 79(1–2), 227–255 (2010) 169. Luo, F.L., Unbehauen, R., Cichock, R.: A minor component analysis algorithm. Neural Netw. 10(2), 291–297 (1997) 170. Ma, Y., Luo, G., Zeng, X., Chen, A.: Transfer learning for cross-company software defect prediction. Inform. Softw. Technol. 54(3), 248–256 (2012) 171. Ma, Y., Gong, W., Mao, F.: Transfer learning used to analyze the dynamic evolution of the dust aerosol. J. Quant. Spectrosc. Radiat. Transf. 153, 119–130 (2015) 172. Mahalanobis, P.C.: On the generalised distance in statistics. Proc. Natl. Inst. Sci. India 2(1), 49–55 (1936) 173. Maier, M., von Luxburg, U., Hein, M.: How the result of graph clustering methods depends on the construction of the graph. ESAIM: Probab. Stat. 17, 370–418 (2013) 174. Masci, J., Meier, U., Ciresan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Proceedings of the 21st International Conference on Artificial Neural Networks, Part I, Espoo, pp. 52–59 (2011) 175. Massy, W.F.: Principal components regression in exploratory statistical research. J. Am. Stat. Assoc. 60(309), 234–256 (1965) 176. McCallum, A., Nigam, K.: Employing EM and pool-based active learning for text classification. In: ICML ’98: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 359–367 (1998) 177. Michalski, R.: A theory and methodology of inductive learning. Mach. Learn. 1, 83–134 (1983) 178. Miller, G.A., Nicely, P.E.: An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Am. 27, 338–352 (1955) 179. Mishra, S.K.: Global optimization by differential evolution and particle swarm methods: Evaluation on some benchmark functions. Munich Research Papers in Economics (2006). Available at: https://mpra.ub.uni-muenchen.de/1005/ 180. Mishra, S.K.: Performance of differential evolution and particle swarm methods on some relatively Harder multi-modal benchmark functions (2006). Available at: https://mpra.ub.unimuenchen.de/449/ 181. Mitchell, T.M.: Machine Learning, vol. 45. McGraw Hill, Burr Ridge (1997) 182. Mitra, P., Murthu, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 301–312 (2002) 183. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529– 533 (2015)

References

435

184. Mohar, B.: Some applications of Laplace eigenvalues of graphs. In: Hahn, G., Sabidussi, G. (eds.) Graph Symmetry: Algebraic Methods and Applications. NATO Science Series C, vol.497, pp. 225–275. Kluwer, Dordrecht (1997) 185. Moulton, C.M., Roberts, S.A., Calatn, P.H.: Hierarchical clustering of multiobjective optimization results to inform land-use decision making. URISA J. 21(2), 25–38 (2009) 186. Murthy, C.A., Chowdhury, N.: In search of optimal clusters using genetic algorithms. Pattern Recog. Lett. 17, 825–832 (1996) 187. Narayanan, H., Belkin, M., Niyogi, P.: On the relation between low density separation, spectral clustering and graph cuts. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 1025–1032. MIT Press, Cambridge (2007) 188. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012) 189. Ng, V., Vardie, C.: Weakly supervised natural language learning without redundant views. In: Proceedings of the Human Language Technology/Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Main Papers, pp. 94–101 (2003) 190. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press, Cambridge (2002) 191. Nguyen, H.D.: An introduction to Majorization-minimization algorithms for machine learning and statistical estimation. WIREs Data Min. Knowl. Discovery 7(2), e1198 (2017) 192. Niculescu-Mizil, A., Caruana, R.: Inductive transfer for Bayesian network structure learning. In: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), San Juan (2007) 193. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the International Conference on Information and Knowledge Management (CIKM), pp. 86–93 (2000) 194. Oja, E., Karhunen, J.: On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. J. Math Anal. Appl. 106, 69–84 (1985) 195. Ogoe, H.A., Visweswaran, S., Lu, X., Gopalakrishnan, V.: Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data. BMC Bioinform. 16, 1–15 (2015) 196. Oquab, M., Bottou, L., Laptev, I.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014) 197. Ortega, J.M., Rheinboldt, W.C.: Iterative Solutions of Nonlinear Equations in Several Variables, pp. 253–255. Academic, New York (1970) 198. Owen, A.B.: A robust hybrid of lasso and ridge regression. Prediction and Discovery (Contemp. Math.), 443, 59–71 (2007) 199. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 200. Pan, S.J., Kwok, J.T., Yang, Q., Pan, J.J.: Adaptive localization in a dynamic WiFi environment through multi-view learning. In: Proceedings of the 22nd Association for the Advancement of Artificial Intelligence (AAAI) Conference Artificial Intelligence, pp. 1108– 1113 (2007) 201. Pan, S.J., Kwok, J.T., Yang, Q.: Transfer learning via dimensionality reduction. In: Proceedings of the 23rd National Conference on Artificial Intelligence, vol. 2, pp. 677–682 (2008) 202. Pan, S.J., Shen, D., Yang, Q., Kwok, J.T.: Transferring localization models across space. In: Proceedings of the 23rd Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, pp. 1383–1388 (2008) 203. Pan, S.J., Tsang, I.W., Kwok, J.T, Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22(2), 199–210 (2011)

436

6 Machine Learning

204. Parra, L., Spence, C., Sajda, P., Ziehe, A., Muller, K.: Unmixing hyperspectral data. In: Advances in Neural Information Processing Systems, vol. 12, pp. 942–948. MIT Press, Cambridge (2000) 205. Patel, V.M, Gopalan, R., Li, R., Chellappa, R.: Visual domain adaptation: a survey of recent advances. IEEE Signal Process. Mag. 32(3), 53–69 (2015) 206. Polikar, R.: Ensemble based systems in decision making. IEEE Circ. Syst. Mag. 6(3), 21–45 (2006) 207. Prettenhofer, P., Stein, B.: Cross-language text classification using structural correspondence learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1118–1127 (2010) 208. Price, W.L.: A controlled random search procedure for global optimisation. Comput. J. 20(4), 367–370 (1977) 209. Rahnamayan, S., Tizhoosh, H.R., Salama, N.M.M.: Opposition-based differential evolution. IEEE Trans. Evol. Comput. 12(1), 64–79 (2008) 210. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: Transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning, Corvallis, pp. 759–766 (2007) 211. Rajagopal, A.N., Subramanian, R., Ricci, E., Vieriu, R.L., Lanz, O., Ramak-rishnan, K.R., Sebe, N.: Exploring transfer learning approaches for head pose classification from multi-view surveillance images. Int. J. Comput. Vis. 109(1–2), 146–167 (2014) 212. Richtárik, P., Takáˇc M.: Parallel coordinate descent methods for big data optimization. Math. Program. Ser. A 156, 433–484 (2016) 213. Rivli, J.: An Introduction to the Approximation of Functions. Courier Dover Publications, New York (1969) 214. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951) 215. Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Proceedings of the Workshop on Subspace, Latent Structure and Feature Selection (SLSFS) 2005, pp. 34–51 (2006) 216. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000) 217. Roy, D.M., Kaelbling, L.P.: Efficient Bayesian task-level transfer learning. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, pp. 2599–2604 (2007) 218. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University (1994) 219. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Proceedings of the European Conference on Computer Vision, vol. 6314, pp. 213–226 (2010) 220. Schaal, S.: Is imitation learning the route to humanoid robots? Trends Cogn. Sci. 3(6), 233– 242 (1999) 221. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5, 197–227 (1990) 222. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Technical Report, INRIA, hal-0086005 (2013). See also Math. Program. 162, 83– 112 (2017) 223. Schwefel, H.P.: Numerical Optimization of Computer Models. Wiley, Hoboken (1981) 224. Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs. In: Proceedings of the NIPS Workshop on Cost-Sensitive Learning, pp. 1–10 (2008) 225. Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: Advances in Neural Information Processing Systems (NIPS), vol.20, pp. 1289–1296, MIT Press, Cambridge (2008) 226. Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings of the ACM Workshop on Computational Learning Theory, pp. 287–294 (1992)

References

437

227. Shell, J., Coupland, S.: Towards fuzzy transfer learning for intelligent environments. Ambient. Intell. Lect. Notes Comput. Sci. 7683, 145–160 (2012) 228. Shell, J., Coupland, S.: Fuzzy transfer learning: Methodology and application. Inform. Sci. 293, 59–79 (2015) 229. Shen, H., Tan, Y., Lu, J., Wu, Q., Qiu, Q.: Achieving autonomous power management using reinforcement learning. ACM Trans. Des. Autom. Electron. Syst. 18(2), 24:1–24:32 (2013) 230. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000) 231. Shuman, D.I., Vandergheynst, P., Frossard, P.: Chebyshev polynomial approximation for distributed signal processing. In: Proceedings of the International Conference on Distributed Computing in Sensor Systems, Barcelona, pp. 1–8 (2011) 232. Shuman, D.I., Narang, S.K, Frossard, P., Ortega, A., Vandergheynst, P.: Extending highdimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30(3), 83–98 (2013) 233. Silver, D.L., Mercer, R.E.: The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. In: Thrun, S., Pratt, L.Y. (eds.) Learning to Learn, pp. 213–233. Kluwer Academic, Boston (1997) 234. Sindhwani, V., Niyogi, P., Belkin, M.: Beyond the point cloud: From transductive to semisupervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML), pp. 824–831. ACM, New York (2005) 235. Smola, J., Kondor, R.: Kernels and regularization on graphs. In: Learning Theory and Kernel Machines, pp. 144–158. Springer, Berlin (2003) 236. Song, J., Babu, P., Palomar, D.P.: Optimization methods for designing sequences with low autocorrelation sidelobes. IEEE Trans. Signal Process. 63(15), 3998–4009 (2015) 237. Sriperumbudur, B.K., Torres, D.A., Lanckriet, G.R.G.: A majorization-minimization approach to the sparse generalized eigenvalue problem. Mach. Learn. 85, 3–39 (2011) 238. Sun, J., Zeng, H., Liu, H., Lu, Y., Chen, Z.: CubeSVD: a novel approach to personalized web search. In: Proceedings of the 14th International Conference on World Wide Web, pp. 652–662 (2005) 239. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge (1998) 240. Tang, K., Li, X., Suganthan, P.N., Yang, Z., Weise, T.: Benchmark functions for the CEC’2010 special session and competition on large-scale global optimization. Technical Report, 2009. Available at: https://www.researchgate.net/publication/228932005 241. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J.„ Mei, Q.: LINE: Large-scale information network embedding. In: Proceedings of the International World Wide Web Conference Committee (IW3C2), Florence, pp. 1067–1077 (2015) 242. Tao, D., Li, X., Wu, X., Hu, W., Maybank, S.J.: Supervised tensor learning. Knowl. Inform. Syst. 13, 1–42 (2007) 243. Thrun, S., Pratt, L. (eds.): Learning to Learn. Kluwer Academic, Dordrecht (1998) 244. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267– 288 (1996) 245. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Statist. Soc. B 63(2), 411–423 (2001) 246. Tikhonov, A.: Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl., 4, 1035–1038 (1963) 247. Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed Problems. Wiley, New York (1977) 248. Tokic, M., Palm, G.: Value-difference based exploration: Adaptive control between epsilongreedy and softmax. In: KI 2011: Advances in Artificial Intelligence, pp. 335–346 (2011) 249. Tommasi, T., Orabona, F., Caputo, B.: Safety in numbers: learning categories from few examples with multi model knowledge transfer. In: Proceedings of the IEEE Conference on Computer Vision Pattern Recognition 2010, pp. 3081–3088 (2010) 250. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 3, 45–66 (2001)

438

6 Machine Learning

251. Tou, J.T., Gonzalez, R.C.: Pattern Recognition Principles. Addison-Wesley, London (1974) 252. Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-Learning. Mach. Learn. 16, 185–202 (1994) 253. Uurtio, V., Monteiro, J.M., Kandola, J., Shawe-Taylor, J., Fernandez-Reyes, D., Rousu, J.: A tutorial on canonical correlation methods. ACM Comput. Surv. 50(95), 14–38 (2017) 254. Valiant, L.G.: A theory of the learnable. Commun. ACM 27, 1134–1142 (1984) 255. van Hasselt, H.: Double Q-learning. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS), pp. 2613–2621 (2010) 256. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear analysis of image ensembles: TensorFaces. In: Proceedings of the European Conference on Computer Vision, Copenhagen, pp. 447–460 (2002) 257. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007) 258. Wang, H., Ahuja, N.: Compact representation of multidimensional data using tensor rank-one decomposition. In: Proceedings of the International Conference on Pattern Recognition, vol. 1, pp. 44–47 (2004) 259. Wang, X., Qian, B., Davidson, I.: On constrained spectral clustering and its applications. Data Min. Knowl. Disc. 28, 1–30 (2014) 260. Wang, L., Hua, X., Yuan, B., Lu, J.: Active learning via query synthesis and nearest neighbour search. Neurocomputing 147, 426–434 (2015) 261. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234. ACM, New York (2016) 262. Watanabe, S.: Pattern Recognition: Human and Mechanical. Wiley, New York (1985) 263. Watldns, C.J.C.H.: Learning from delayed rewards. PhD Thesis, University of Cambridge, England (1989) 264. Watkins, C.J.C.H., Dayan, R.: Q-learning. Mach. Learn. 8, 279–292 (1992) 265. Weenink, D.: Canonical correlation analysis. IFA Proc. 25, 81–99 (2003) 266. Wei, X.-Y., Yang, Z.-Q.: Coached active learning for interactive video search. In: Proceedings of the ACM International Conference on Multimedia, pp. 443–452 (2011) 267. Wei, X., Cao, B. Yu, P.S.: Nonlinear joint unsupervised feature selection. In: Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 414–422 (2016) 268. Weiss, K., Khoshgoftaar, T.M., Wang, D.D.: A survey of transfer learning. J. Big Data 3(9), 1–40 (2016) 269. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992) 270. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Mateo (2011) 271. Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515– 534 (2009) 272. Wright, J., Ganesh, A., Rao, S., Peng, Y., Ma, Y.: Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 87, pp. 20:3–20:56 (2009) 273. Wright, J., Ganesh, A., Yang, A.Y., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. Pattern Reconginit. Mach. Intell. 31(2), 210–227 (2009) 274. Wold, H.: Path models with latent variables: The NIPALS approach. In: Blalock, H.M., et al. (eds.) Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building, pp. 307–357. Academic, Cambridge (1975) 275. Wold, S., Sjöström, M., Eriksson, L.: PLS-regression: a basic tool of chemometrics. Chemom. Intell. Lab. Syst. 58(2), 109–130 (2001) 276. Wooldridge, M.J., Jennings, N.R.: Intelligent agent: theory and practice. Knowl. Eng. Rev. 10(2), 115–152 (1995) 277. Wu, P., Dietterich, T.G.: Improving SVM accuracy by training on auxiliary data sources. In: Proceedings of the Twenty-First International Conference on Machine Learning, pp. 871–878 (2004)

References

439

278. Wu, T.T., Lange, K.: The MM alternative to EM. Statist. Sci. 25(4), 492–505 (2010) 279. Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1101– 1113 (1993) 280. Xia, R., Zong, C., Hu, X., Cambria, E.: Feature ensemble plus sample selection: domain adaptation for sentiment classification. IEEE Intell. Syst. 28(3), 10–18 (2013) 281. Xu, L., Krzyzak, A., Suen, C.Y.: Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Syst. Man Cybern. 22, 418–435 (1992) 282. Xu, L., Oja, E., Suen, C.: Modified Hebbian learning for curve and surface fitting. Neural Netw. 5, 441–457 (1992) 283. Xu, H., Caramanis, C., Mannor, S.: Robust regression and Lasso. IEEE Trans. Inform. Theory 56(7), 3561–3574 (2010) 284. Xue, B., Zhang, M., Browne, W.N., Yao, X.: A survey on evolutionary computation approaches to feature selection. IEEE Trans. Evol. Comput. 20(4), 606–626 (2016) 285. Yamauchi, K.: Covariate shift and incremental learning. In: Advances in Neuro-Information Processing, pp. 1154–1162. Springer, Berlin (2009) 286. Yan, S., Wang, H.: Semi-supervised Learning by sparse representation. In: Proceedings of the SIAM International Conference on Data Mining, Philadelphia, pp. 792–801 (2009) 287. Yang, B.: Projection approximation subspace tracking. IEEE Trans. Signal Process. 43, 95– 107 (1995) 288. Yen, T.-J.: A majorization-minimization approach to variable selection using spike and slab priors. Ann. Stat. 39(3), 1748–1775 (2011) 289. Yin, J., Yang, Q., Ni, L.M.: Adaptive temporal radio maps for indoor location estimation. In: Proceedings of the Third IEEE International Conference on Pervasive Computing and Communications (2005) 290. Yu, K., Zhang, T., Gong, Y.: Nonlinear learning using local coordinate coding. In: Advances in Neural Information Processing Systems, vol. 22, pp. 2223–2231 (2009) 291. Yu, H., Sun, C., Yang, W., Yang, X., Zuo, X.: AL-ELM: One uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166(20), 140–150 (2015) 292. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B 68, 49–67 (2006) 293. Yuan, G.-X., Ho, C.-H., Lin, C.-J.: Recent advances of large-scale linear classification. Proc. IEEE 100(9), 2584–2603 (2012) 294. Zhang, X.D.: Matrix Analysis and Applications. Cambridge University Press, Cambridge (2017) 295. Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE Trans. Audio Speech Lang. Process. 23(1), 115–126 (2015) 296. Zhang, Z., Pan, Z., Kochenderfer, M.J.: Weighted Double Q-learning. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pp. 3455– 3461 (2017) 297. Zheng, V.W., Pan, S.J., Yang, Q., Pan, J.J.: Transferring multi-device localization models using latent multi-task learning. In: Proceedings of the 23rd Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, pp. 1427–1432 (2008) 298. Zheng, V.W., Yang, Q., Xiang, W., Shen, D.: Transferring localization models over time. In: Proceedings of the 23rd Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, pp. 1421–1426 (2008) 299. Zhou, Y., Goldman, S.: Democratic co-learning. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 594–602 (2004) 300. Zhou, Z.-H., Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17, 1529–1541 (2005) 301. Zhou, D., Schölkopf, B.: A regularization framework for learning from graph data. In: Proceedings of the ICML Workshop on Statistical Relational Learning, pp. 132–137 (2004)

440

6 Machine Learning

302. Zhou, J., Chen, J., Ye, J.: Multi-task learning: Theory, algorithms, and applications (2012). Available at: https://archive.siam.org/meetings/sdm12/zhou_-chen_-ye.pdf 303. Zhou, Z.-H., Zhan, D.-C., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence (AAAI-07) (2007) 304. Zhu, X.: Semi-Supervised Learning Literature Survey. Computer Sciences TR 1530, University of Wisconsin, Madison, (2005) 305. Zhu, X., Goldberg, A.B.: Introduction to Semi-Supervised Learning. In: Brachman, R.J., Dietterich, T. (eds.) Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypoo, San Rafael (2009) 306. Zhu, X., Ghahramani, Z., Laffer, J.: Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington (2003) 307. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. B, 67(2), 301–320 (2005) 308. Zou, H., Hastie„ T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)

Chapter 7

Neural Networks

Neural networks can be viewed as a kind of cognitive intelligence. This chapter presents a neural network tree, and then deals with the optimization problem in neural networks, activation functions, and basic neural networks from the perspective of matrix algebra. The topical subjects of this chapter are selected topics and advances in neural networks: convolutional neural networks (CNNs), dropout learning, autoencoder, extreme learning machine (ELM), graph embedding, manifold learning, network embedding, graph neural networks (GNNs), batch normalization networks, and generative adversarial networks (GANs).

7.1 Neural Network Tree A neural network is viewed as an adaptive machine, and its good definition given by Haykin [52] is: A neural network is a massively parallel distributed processor that has a natural propensity for storing experimental knowledge and making it available for use. It resembles the brain in two respects: 1. Knowledge is acquired by the network through a learning process. 2. Interneuron connection strengths known as synaptic weights are used to store the knowledge.

Machine learning learns how to associate an input with some output, given a training set of examples of inputs and outputs. In Chap. 6 we focused on using algorithm forms to implement machine learning. In this chapter we will discuss how to use neural networks with layered structures for implementing machine learning. For a neural network, the most important thing is how to get the model parameters of each layer according to the data provided in the training set, so as to minimize the loss. Because of its strong nonlinear fitting ability, it has important applications in various fields. © Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8_7

441

442

7 Neural Networks

As described in Chap. 6, according to data structure, machine learning can be divided into two broad categories: “regular or Euclidean structured data learning” and “irregular or non-Euclidean structured data learning.” Similarly, neural networks (NNs) can be divided into two broad categories: NNs for regular or Euclidean structured model learning and NNs for irregular or non-Euclidean structured model learning. 1. Neural networks for Euclidean structured model learning • Neural networks for stochastic (probabilistic) model learning – Bayesian neural networks: Their core idea is to minimize KL divergence. Bayesian neural networks can be easily learned from small data sets. Bayesian methods provide further uncertainty estimation in the form of a probability distribution through its parameters. At the same time, the prior probability distribution is used to integrate the parameters to provide a regularization effect for the network, so as to prevent overfitting. – Boltzmann machine: The basic idea of the popular Boltzmann machine and restricted Boltzmann machine is that with the introduction of hidden units in the Hopfield network, the model can be upgraded from the initial ideas of the Hopfield network and evolve to replace it. The model of Boltzmann machine splits into two parts: visible units and hidden units. – Generative adversarial networks (GANs): GANs can be used for unsupervised learning of rich feature representations for arbitrary data distributions. GANs learn reusable feature representations from large unlabeled data sets. GANs have emerged as a powerful framework for learning generative models of arbitrarily complex data distributions. The GAN plays an adversarial game with two linked models: Generative Model and Discriminative Model. • Neural networks for deterministic model learning – Multilayer Perceptrons (MLPs), also known as Feedforward Neural Networks, aim at working as an approximator of the mapping function from inputs to outputs. The approximated function is usually built by stacking together several hidden layers that are activated in chain to obtain the desired output. – Recurrent Neural Networks (RNNs), in contrast to MLPs, are models in which the output is a function of not only the current inputs but also of the previous outputs, which are encoded into a hidden state h. Due to having memory of the previous outputs, RNNs therefore can encode the information present in the sequence itself, something that MLPs cannot do. As a consequence, this type of model can be very useful to learn from sequential data. – Convolutional Neural Networks (CNNs) are a specific type of models conceived to accept two-dimensional input data, such as images or time series data. The output of the convolution operation is usually run through

7.1 Neural Network Tree

443

a nonlinear activation function and then further modified by means of a pooling function, which replaces the output in a certain location with a value obtained from nearby outputs. – Dropout networks are a form of regularization for fully connected neural network layers and is a very efficient version of model combination through setting the output of each hidden neuron to zero with probability q while training. Moreover, overfitting is a serious problem in deep neural networks with a large number of parameter to be learned. This overfitting can be greatly reduced by using dropout. – Autoenconder is an MLP with symmetric structure and used to learn an efficient representation (encoding) for a set of data in an unsupervised manner. The aim of an autoencoder is to make the output as similar to the input as possible. Autoencoders are unsupervised learning models. 2. Neural networks for non-Euclidean structured model learning • Graph embedding: Embedding a graph or an information network into a lowdimensional space is useful in a variety of applications. Given the inputs of a graph G = (V , E), and a predefined dimensionality of the embedding d (d / |V |), the graph embedding is to convert G into a d-dimensional space Rd . In this space, the graph properties (such as the first-order, second-order, and higher-order proximities) are preserved as much as possible. The graph is represented as either a d-dimensional vector (for a whole graph) or a set of d-dimensional vectors with each vector representing the embedding of part of the graph (e.g., node, edge, substructure). • Network embedding: Given a graph denoted as G = (V , E), network embedding aims to learn a mapping function f : vi → yi ∈ Rd , where d / |V |. The objective of the function is to make the similarity between yi and yj explicitly preserve the first-order, second-order, and higher-order proximities of vi and vj . • Neural networks on graph: Neural networks in graph domains are divided into semi-supervised networks (Graph neural networks (GNNs) and graph convolutional networks (GCNs)) and unsupervised networks (graph autoencoders). – Graph neural networks (GNNs) are a kind of neural network which processes directly the data represented in graph domain. GNNs can capture the dependence of graphs and rich relation information among elements via message passing between nodes of a graph. – Graph convolutional networks (GCNs), introduced by Kipf and Welling [82], are a machine learning method of graph structured data through extracting spatial features. Because the standard convolution for image or text cannot be directly applied to graphs without a grid structure, it is necessary to transform graphs into another spectral domain with a grid structure. – Graph autoencoders (GAEs): This is a TensorFlow implementation of the (Variational) graph autoencoder model [81].

444

7 Neural Networks

⎧ ⎧ ⎧ ⎪Bayesian neural networks ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ NNs for stochastic ⎨ ⎪ ⎪ ⎪ ⎪ Boltzmann machine ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ model learning ⎪ ⎩Generative adversarial networks ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ ⎪ NNs with Euclidean ⎨ ⎪ Multilayer perceptrons ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ structured model ⎪ ⎪ Recurrent neural networks ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ NNs for deterministic ⎪ ⎪ ⎪ ⎪ Convolutional neural networks ⎪ ⎪ ⎪ ⎪ model learning ⎨ ⎪ ⎪ ⎪ ⎪ Neural ⎪Dropout networks ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ networks ⎪ ⎪ Autoencoders ⎪ ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ ⎪ Graph embedding ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Network embedding ⎪ ⎪ ⎪ ⎪ NNs with non-Euclidean ⎨ ⎧ ⎪ ⎪ ⎪ ⎪Graph neural networks ⎪ ⎨ ⎪ ⎪ structured model Neural Networks ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Graph convolutional networks ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ on graph ⎩ ⎩Graph autoencoders

Fig. 7.1 Neural network tree

The above classification of neural networks can be vividly represented by a neural network tree, as shown in Fig. 7.1.

7.2 From Modern Neural Networks to Deep Learning A standard neural network (NN) consists of many simple, connected neurons, each producing a sequence of real-valued activations. Modern artificial neural networks are divided into shallow and deep neural networks. Shallow neural networks typically consists of three layers (one input layer, one or a few hidden layers, and one output layer), while deep neural networks contain many hidden layers. Shallow neural networks can only capture shallow features in data, while deep neural networks can provide a deep understanding of features in data. The following are major milestones of modern neural network and deep learning research [130, 155]: • McCulloch and Pitts in 1943 [105] introduced MCP Model, which is considered to be the ancestor of Artificial Neural Model. • Donald Hebb in 1949 [57] is considered to be the father of neural networks, as he introduced Hebbian Learning Rule, which lays the foundation for modern neural networks. • Frank Rosenblatt in 1958 [121] introduced the first perceptron, which highly resembles the modern perceptron.

7.3 Optimization of Neural Networks

445

• Paul Werbos in 1975 [158] mentioned the possibility of applying Backpropagation to artificial neural networks. • Kunihiko Fukushima in 1980 [32] introduced a Neocognitron, which inspired the Convolutional Neural Network, and Teuvo Kohonen in 1981 [83] introduced the Self Organizing Map principal. • John Hopfield in 1982 [65] introduced the Hopfield Network. • Ackley et al. in 1985 [1] invented the Boltzmann Machine. • In 1986 Paul Smolensky [139] introduced Harmonium, which was later known as the Restricted Boltzmann Machine, and Michael I. Jordan [75] defined and introduced the Recurrent Neural Network. • Le Cun in 1990 introduced LeNet, showing the possibility of deep neural networks in practice. • In 1991 Hochreiter in his diploma thesis represented a milestone of explicit Deep Learning research. • In 1997 Schuster and Paliwal [132] introduced the Bidirectional Recurrent Neural Network, and Hochreiter and Schmidhuber [63] introduced the LSTM, which solved the problem of vanishing gradients in recurrent neural networks. • Geoffrey Hinton in 2006 [61] introduced Deep Belief Networks and the layerwise pretraining technique, opening the path to the current deep learning era. • Salakhutdinov and Hinton in 2009 [124] introduced Deep Boltzmann Machines. • In 2012 Geoffrey Hinton [62] introduced Dropout, an efficient way of training neural networks, and Krizhevsky et al. [84] invented AlexNet, starting the era of CNN used for ImageNet classification. • In 2016 Goodfellow, Bergio, and Courville published the comprehensive book on deep learning [39]. The universal approximation properties of shallow neural networks come at a price of exponentially many neurons and therefore are not realistic.

7.3 Optimization of Neural Networks Consider a (time) sequence of input data vectors X = {x1 , . . . , xN } and a sequence of corresponding output data vectors Y = {y1 , . . . , yN } with neighboring data-pairs (in time) being somehow statistically dependent. Given time sequences X and Y as training data, the aim of a neural network is to learn the rules to predict the output data y given the input data x. Inputs and outputs can, in general, be continuous and/or categorical variables. When outputs are continuous, the problem is known as a regression problem, and when they are categorical (class labels), the problem is known as a classification problem. In the context of neural networks, the term prediction is usually used as a general term including regression and classification. The convex optimization in machine learning problems are offline convex optimizations: Given a set of labeled data (xi , yi ), i = 1, . . . , N or unlabeled data x1 , . . . , xN , solve the optimization problem minx∈F f (x) through a fixed

446

7 Neural Networks

regularization function such as 2 -squared and modify it only via a single timedependent parameter. However, in many practical applications in science and engineering, we are faced with online convex optimization problems that require optimization algorithms to adaptively choose its regularization function based on the loss functions observed so far.

7.3.1 Online Optimization Problems Given a convex set F ⊆ Rn and a convex cost function f : F → R. Convex optimization involves finding a point x∗ in F which minimizes f . In online convex optimization, the convex set F is known in advance, but in each step of some repeated optimization problem, one must select a point in F before seeing the cost function for that step. Let x ∈ F be the feasible decision vector lie in the convex feasible set F . Definition 7.1 (Convex Optimization Problem [177]) Convex optimization consists of a convex feasible set F and a convex cost function f (x) : F → R. The optimal solution is the solution that minimizes the cost. Definition 7.2 (Online Convex Optimization Problem [177]) Online convex optimization consists of a feasible set F ⊆ Rn and an infinite sequence {f1 , f2 , . . .}, where each ft : F → R is a convex function at the time step t. At each time step t, an online convex optimization algorithm selects a vector x t ∈ F . After the vector is selected, it receives the cost function ft . Definition 7.3 (Projection) For the distance function d(x, y) = x − y between x ∈ F and y, the projection of a point y onto F is defined as [177]: P (y) = arg min d(x, y) = arg min x − y. x∈F

(7.3.1)

x∈F

Especially, standard subgradient algorithms move the predictor x t in the opposite direction of the subgradient vector g t , while maintaining x t+1 ∈ F via the projected gradient update [27, 177]: x t+1 = P (x t − ηg t ) = arg min x − (x t − ηg t )22 .

(7.3.2)

x∈F

If taking the Mahalanobis norm xA = x, Ax1/2 and denoting, by PA (y), the projection of a point y onto F according to A, then the greedy projection algorithm becomes PA (y) = arg minx − y, A(x − y). x∈F

(7.3.3)

7.3 Optimization of Neural Networks

447

Select an arbitrary x1 ∈ F as an initial value and a sequence of learning rates η1 , η2 , . . . ∈ R+ (nonnegative quadrant). In time step t, after receiving a cost function ft (x t ), the greedy projection algorithm selects the next vector xt+1 as follows [177]: xt+1 = P (x t − ηt ∂ft (x t )) = arg min x − (x t − ηt ∂ft (x t )),

(7.3.4)

x∈F

or xt+1 = PA (x t − ηt ∂ft (x t )) ;  < = arg min x − (x t − ηt ∂ft (x t )), A x − (x t − ηt ∂ft (x t )) .

(7.3.5)

x∈F

Definition 7.4 (Cost, Regret [177]) Given an algorithm A and a convex optimization problem (F, {f1 , f2 , . . .}), if {x1 , x2 , . . .} are the vectors selected by A, then the cost of A until time T is T 

CA (T ) =

(7.3.6)

ft (x t ).

t=1

The cost of a static feasible solution x ∈ F until time T is given by Cx (T ) =

T 

(7.3.7)

ft (x).

t=1

The regret of algorithm A until time T is denoted by RA (T ) and is defined as RA (T ) = CA (T ) − min Cx (T ) = x∈F

T 

ft (x t ) − min

t=1

x∈F

T 

ft (x).

(7.3.8)

t=1

The goal of online optimization is to minimize the regret via adaptively choosing the regularization function based on the loss functions observed so far.

7.3.2 Adaptive Gradient Algorithm In many applications of online and stochastic learning, the input data are of very high dimension, yet within any particular high-dimensional data only a few features are nonzero. Standard stochastic subgradient methods largely follow a predetermined procedural scheme that is oblivious to the characteristics of the data being observed.

448

7 Neural Networks

Consider the minimization of composite functions [27]:   min φt (x) = ft (x) + ϕ(x) , x∈F

(7.3.9)

where ft and ϕ are (closed) convex functions. In machine learning setting, ft is either an instantaneous loss or a stochastic estimate of the objective function in an optimization task. The function ϕ serves as a fixed regularization function and is typically used to control the complexity of x. At each round the algorithm makes a prediction x t ∈ F and then receives the function ft . Define the regret with respect to the fixed (optimal) predictor x∗ as [27] Rφ (T ) 

T  2

 3 φt (x t ) − φt x∗t

t=1

=

T  2 3 ft (x t ) + ϕ(x t ) − ft (x∗ ) − ϕ(x∗ ) .

(7.3.10)

t=1

The goal of online optimization is to minimize the regret Rφ (T ). The following are two methods for minimizing the regret (7.3.10): 1. Primal-Dual Subgradient Method [27,  107, 162]: In this method, the algorithm uses the average gradient g¯ t = 1t tτ =1 gτ to make a prediction x t on round t through encompassing a trade-off between a gradient-dependent linear term ¯g t , x, the regularizer ϕ(x), and a proximal term ht (x) for well-conditioned predictions. The update amounts to solving [27]:  , 1 xt+1 = arg min η¯g t , x + ηϕ(x) + ht (x) , t x∈F

(7.3.11)

where η is a fixed stepsize and x1 = arg minx∈F ϕ(x). 2. Composite Mirror Descent Method: This method is also known as proximal gradient, forward-backward splitting [26, 147]. The update formula is given by   xt+1 = arg min ηg t , x + ηϕ(x) + Bh t (x, x t ) ,

(7.3.12)

x∈F

where Bh (x, y) is the Bregman divergence associated with strongly convex and differentiable function h(x), and is defined as Bh (x, y) = h(x) − h(y) − ∇h(y), x − y.

(7.3.13)

7.3 Optimization of Neural Networks

449

Two typical regularization functions are ϕ(x) = x1 and ϕ(x) = x2 , the proximal function often takes the squared Mahalanobis norms ht (x) = x, H t x = xT H t x for a symmetric nonnegative matrix H t ≥ 0. For example, H t = δI + Diag(G t )1/2

(Diagonal)

(7.3.14)

(Full).

(7.3.15)

and H t = δI + (G t )1/2

 where G t = tτ =1 gτ gTτ is the outer product matrix of the subgradient vector g t , and (G t )1/2 is the square-root of G t . If taking u = ηg¯ t , ηϕ(x) = λx2 , and ht (x t ) = 2t x t H = x t , H t x t , then the primal-dual subgradient optimization (7.3.11) becomes the following online optimization of composite functions:  , 1 xt+1 = arg min u, x t  + λx t 2 + x t , H t x t  , 2 x t ∈F

(7.3.16)

 the subgradient g t = u + λx t + H t x t and G t = tτ =1 gτ gTτ . Introducing a variable z = x, then the equivalent problem of (7.3.16) becomes the constrained optimization problem: ,  1 min L(x, z, α) = u, x + x, Hx + λz2 2 subject to x = z. With Lagrange multiplier vector α for the equality constraint, the Lagrangian is given by 1 L(x, z, α) = u, x + x, Hx + λz2 + α, x − z. 2

(7.3.17)

From the first-order optimized condition ∂L(x,z,α) = u+Hx+α = 0, the infimum ∂x of L(x, z, α) is attained at x = −H−1 (u + α). Based on the above analysis, the adaptive gradient (ADAGRAD) algorithm was developed in [27] for minimizing the composite functions u, x+ 12 x, Hx+λx2 , see Algorithm 7.1.

7.3.3 Adaptive Moment Estimation In many fields of science and engineering, many problems can be cast as the optimization (minimization or maximization) of some scalar parameterized objective function with respect to its parameters. Objective functions are often stochastic. In

450

7 Neural Networks

  Algorithm 7.1 ADAGRAD algorithm for minimizing u, x + 12 x, Hx + λx2 [27] 1. input: u ∈ Rd , λ > 0. 2. initialization: Generate randomly a nonnegative matrix H0 and put x0 = 0. 3. for t = 1, 2, . . . do 4. g t = u + (Ht−1 + λI)xt−1 . 5. G t = tτ =1 gτ gTτ . 6. H t = δI + diag(G t )1/2 or H t = δI + (G t )1/2 . 7. Make the SVD H t = UVT and denote the minimal and maximal singular values by σmin (H t ) and σmax (H t ), respectively. −1 T 8. v t = H−1 t u = V U u. 9. θmax = v t 2 /λ − 1/σmin (H t ). 10. θmin = v t 2 /λ − 1/σmax (H t ). 11. while θmax − θmin >  do 12. θ = (θmax + θmin )/2. −1 13. α t = −(H−1 t + θI) v t . 14. if α t 2 > λ do 15. θmin = θ, 16. else 17. θmax = θ. 18. end if   19. x t = −H−1 u + α t (θ) . 20. end while 21. exit: if x t converges. 22. end for   23. output: x t = −H−1 u + α t (θ) .

this case optimization can be made more efficient by taking gradient steps with respect to individual subfunctions, i.e., stochastic gradient descent (SGD) or ascent. Let f (θ) be a noisy objective function, i.e., a stochastic scalar function that is differentiable with respect to parameters θ . We consider how to minimize the expected value of this function, E{f (θ)}, with respect to its parameters θ . Let f1 (θ ), . . . , fT (θ) be the realizations of the stochastic function at subsequent time steps 1, . . . , T . The stochasticity might come from the evaluation at random subsamples (mini-batches) of data points, or arise from inherent function noise. Use gt = ∇θ ft (θ ) to denote the gradient, i.e., the vector of partial derivatives of ft , with respect to θ evaluated at time step t. Given an arbitrary, unknown sequence of convex cost functions f1 (θ ), . . . , fT (θ ). At each time t, the goal of an online optimization is to predict the parameter θ t and evaluate it on a previously unknown cost function ft . Since the nature of the sequence is unknown in advance, we design an online optimization algorithm using the regret, that is the sum of all the previous differences between the online prediction ft (θ t ) and the best fixed point parameter ft (θ ∗ ) from a feasible set F

7.3 Optimization of Neural Networks

451

for all the previous steps. Concretely, the regret is defined as R(T ) =

T  [ft (θ t ) − ft (θ ∗ )],

(7.3.18)

t=1

 where θ ∗ = arg minθ∈F Tt=1 ft (θ). ADAM (Adaptive Moment) of Kingma and Ba [80] is a first-order gradient-based optimization algorithm, which uses estimates of the first moment vector m and the second moment vector v to optimize a stochastic objective function. The first and second moment vectors are updated as mt+1 = β1 m t + (1 − β1 )g t+1 ,

(7.3.19)

vt+1 = β2 v t + (1 − β2 )g2t+1 ,

(7.3.20)

where g t is the gradient of loss function, and g2t is an elementwise square function. The exponential decay rates for the moment estimates are recommended to be β1 = 0.9 and β2 = 0.999. The bias correction of first and second moment estimates are   ˆ t+1 = m t / 1 − β1t+1 , m   vˆ t+1 = v t / 1 − β2t+1 .

(7.3.21) (7.3.22)

Algorithm 7.2 gives the ADAM algorithm for stochastic optimization. Algorithm 7.2 ADAM algorithm for stochastic optimization [80] 1. input: Stepsize α (e.g., α = 0.001), exponential decay rates β1 = 0.9, β2 = 0.999, f (θ), and put  = 10−8 . 2. initialization: 2.1 Initial parameter vector θ 0 . 2.2 m0 ← 0 (Initialize 1st moment vector) 2.3 v0 ← 0 (Initialize 2nd moment vector) 2.4 t ← 0 (Initialize timestep) 3. while θ not converged do 4. t ← t + 1, 5. gt ← ∇θ ft (θt−1 ), 6. m t ← β1 mt−1 + (1 − β1 ) · gt , 7. v t ← β2 vt−1 + (1 − β2 ) · gt2 , ˆ t ← m t /(1 − β1t ), 8. m 9. vˆ t ← v t /(1 − β1t ),  10. αt = α · 1 − β2t /(1 − β1t ),   ˆ t / vˆ t +  . 11. θ t ← θ t−1 + αt · m 12. end while 13. output: θ t .

452

7 Neural Networks

7.4 Activation Functions In neural networks, a unit’s or node’s expected value is referred as its activation, i.e., the activation function of a unit or node (or neuron) defines the output of that unit (or node or neuron) given an input or set of inputs.

7.4.1 Logistic Regression and Sigmoid Function Neural networks typically employ a sigmoidal nonlinearity function as the activation function. Given T training samples {(x1 , y1 ), . . . , (xT , yT )}, where the feature vector x t ∈ Rn . Consider the binary classification problem in which yt ∈ {0, 1}. A logistic regression problem can be described as follows: for the given training samples, design a weight vector w ∈ Rn to minimize the logistic loss function (also called the cross-entropy loss function) T T & 1  1 % J (w) = yt log yˆt + (1 − yt ) log(1 − yˆt ) H (pt , qt ) = − T T t=1

(7.4.1)

t=1

where H (pt , qt ), the cross-entropy between pt and qt , will be defined later. Here, yˆt ≡ g(wT xt ), and is the logistic regressor with an attached sigmoid function (also known as a logistic function or soft-step function) [64]: g(z) = 1/(1 + e−z ) ∈ (0, 1),

(7.4.2)

whose derivative is given by [11] g  (z) =

∂ ∂z



1 1 + e−z

 =

e−z 1 = 1 + e−z (1 + e−z )2

 1−

1 1 + e−z

 ,

i.e., g  (z) = g(z)(1 − g(z)).

(7.4.3)

In information theory, for two discrete distributions p and q with K states, the Kullback–Leibler (KL) divergence of q from p is defined as KL(pq) 

K  k=1

pk log

pk qk

(7.4.4)

7.4 Activation Functions

453

or equivalently is expressed as KL(pq) =

K 

pk log pk −

k=1

K 

pk log qk = −H (p) + H (p, q)

(7.4.5)

k=1

or H (p, q) = H (p) + KL(pq),

(7.4.6)

where H (p) = −

K 

pk log pk

and

H (p, q) = −

k=1

K 

pk log qk

(7.4.7)

k=1

are, respectively, the entropy of p and the cross-entropy between two probability distributions p and q over a given set. The KL divergence KL(pq) is also known as the relative entropy of p with respect to q. In logistic regression, there are only two states yt = 0 and yt = 1, i.e., K = 2, the probability of finding the output y = 0 (corresponding to k = 1) is given by py=0 = 1 − y

and

qy=0 = 1 − y, ˆ

(7.4.8)

while the probability of finding the output y = 1 (corresponding to k = 2) is given by py=1 = y

and

qy=1 = y. ˆ

(7.4.9)

Substituting (7.4.8) and (7.4.9) into (7.4.4), then the KL divergence in logistic regression is given by KL(pq) = (1 − y) log

y 1−y + y log . 1 − yˆ yˆ

(7.4.10)

Similarly, substituting (7.4.8) and (7.4.9) into (7.4.7) to yield the cross-entropy H (p, q) = −(1 − y) log(1 − y) ˆ − y log y. ˆ

(7.4.11)

The cross-entropy H (p, q) denotes the information loss produced when an “unnatural” probability distribution q is used to fit the “true” distribution p. Because the entropy of the true distribution value p is invariable, it is known from (7.4.6) that the cross-entropy and thus the KL divergence also describes the similarity between the predicted results yˆt and the real results yt , which can be used as a loss function to ensure that the predicted value yˆt conforms to the true value yt .

454

7 Neural Networks

Therefore, the logistic loss is given by J (w) =

T T & 1  1 % yt log yˆt + (1 − yt ) log(1 − yˆt ) . H (pt , qt ) = − T T t=1

(7.4.12)

t=1

From yˆt = g(wT x t ) = 1/(1+e−w

Tx

t

) and (7.4.11), the above logistic loss becomes

   T  1  −wT x t T yt log 1 + e + (1 − yt )w x t . J (w) = T

(7.4.13)

t=1

It is easily known that the gradient of J (w) with respect to w is given by  T  ∂J (w) yt 1  1− xt. = T ∂w T 1 + e−w x t t=1

(7.4.14)

In logistic regression, the weight vector w can be updated by the gradient descent algorithm wk+1

 T  yt 1  1− xt = wk − η T T 1 + e−wk x t t=1

(7.4.15)

until wk is converged. For a layer with a sigmoid activation function g(z) = g(Wx + b), where x is the layer input, the weight matrix W, and the bias vector b are the layer parameters to be learned, and g(zi ) = 1/(1 + exp(−zi )). As some |zi | increases, g  (zi ) = g(zi )(1 − g(zi )) tends to zero. This means that the corresponding gradient flowing −g  (zi ) will vanish and thus the model will train slowly. For a deep neural network (DNN) with multiple hidden layers, a layer with slow training will affect convergence of all the layers below. Because this effect is amplified as the network depth increases, the sigmoid function is not available for activation function of hidden layers in DNN.

7.4.2 Softmax Regression and Softmax Function The logistic regression is only available for binary classification problems, but there are many multiple classification problems in science and engineering. Let the class labels be k different values that are assumed to be {1, . . . , k} without generality.

7.4 Activation Functions

455

Suppose we are given a set of training samples {(x1 , y1 ), . . . , (xT , yT )}, where x t ∈ Rn , yt ∈ {1, . . . , k}. In this case, we design a weight matrix W = [w1 , . . . , wk ] ∈ Rp×n such that the hypothetical function has the following vector form: ⎡ T ⎤ ⎡ ⎤ ew1 x t p(yt = 1|x t ; W) ⎢ wT x t ⎥ ⎢p(yt = 2|x t ; W)⎥ ⎢e 2 ⎥ 1 ⎢ ⎥ ⎢ . ⎥. (7.4.16) hW (x t ) = ⎢ ⎥=  .. T ⎢ ⎥ k ⎣ ⎦ . ewj x t ⎣ .. ⎦ j =1

p(yt = k|x t ; W)

T

ewk x t

On the other hand, for an RNN network, its output vector g(Wx t ) is directly used as the hypothetical function, i.e., hW (x t ) = g(Wx t ). Letting z = Wx t or zi = wTi x t , i = 1, . . . , k, then we apply naturally the softmax function Tx

ewi

ezi

softmaxi (z) = k

zj j =1 e

= k

t

wTj x t j =1 e

,

i = 1, . . . , k,

(7.4.17)

or ⎡ 1

softmax(z) =  k

wTj x t j =1 e

⎤ T ew1 x t ⎢ . ⎥ ⎣ .. ⎦ ∈ (0, 1)

(7.4.18)

ewk w t

to express the hypothetical function hW (x t ) = softmax(Wx t ). Due to

∂zi ∂zj

(7.4.19)

= δij the derivative of the softmax function is given by

∂softmaxi (z) ∂ = ∂zj ∂zj



ezi



k

zj j =1 e

ezi δij −ezj + ezi ·   = k  , zj k zj 2 j =1 e j =1 e

namely   ∂softmaxi (z) = softmaxi (z) δij − softmaxj (z) ∈ (0, 0.25), ∂zj because the range of softmax(z) is (0, 1).

(7.4.20)

456

7 Neural Networks

The common (hard) max function max{z1 , . . . , zk } takes only a maximum value among z1 , . . . , zk . Unlike the max function, the softmax function takes k values in probability form. Different from the normalized function, the softmax function has no negative probability. Example 7.1 If z = [3, 5, −4]T , then the max function max{3, 5, −4} = 5, the normalized function gives the normalized vector [0.4243, 0.7071, −0.5657]T and softmax(z) = [0.1192, 0.8807, 0.0001]T . It is interesting to compare the softmax function with the max function and the normalized function. • The max function is a single-valued function and gives only one result while rejecting all others. • The normalized function is a multivalued function and may retain the negative value(s), and cannot be used as the probability representation. • The softmax function is also a multivalued function but provides the probability of zi belonging to the class i, and thus it can transform a matrix into a nonnegative even sparse matrix. Therefore, the softmax function is especially available for multiple classification problems.

7.4.3 Other Activation Functions Recently, there is increasing evidence that other types of nonlinearities can improve the performance of DNNs. The hyperbolic tangent (tanh) function serves as a point-wise nonlinearity applied to all hidden units of a DNN, and is defined by f (z) = tanh(z) = f  (z) =

ez − e−z ∈ (−1, +1), ez + e−z

∂ tanh(z) = 1 − tanh2 (z). ∂z

(7.4.21) (7.4.22)

As an alternative to the tanh function, the softsign function is given by [11] softsign(z) =

z ∈ (−1, 1), 1 + |z|

(7.4.23)

softsign (z) =

∂softsign(z) 1 . = ∂z (1 + |z|)2

(7.4.24)

Clearly, when tanh(z) → −1 or +1, the tanh function has a vanishing gradient, as shown in (7.4.22). But, it is seen from (7.4.24) that the softsign function has no such trouble. As a matter of fact, the ratio of their derivatives has the following

7.4 Activation Functions

457

asymptotic limit [11]: ∂ tanh(z) lim ! ∂z1 " z→∞ ∂ 1+z

∂ ∂z

= lim

z→∞

∂z

!

2 1+exp(−z)

∂ ∂z

!

z 1+z

"

"

= lim exp(−z)z = 0. z→∞

Rectified Linear Unit (ReLU) function [106] is defined as

f (z) = ReLU(z) = max{z, 0} =

⎧ ⎨ 0, z < 0; ⎩

(7.4.25) z, z ≥ 0,

with range [0, ∞). The derivative is given by f  (z) =

⎧ ⎨ 0, z < 0;

∂ReLU(z) = ⎩ ∂z

(7.4.26) 1, z ≥ 0.

Softplus function is defined as softplus(z) = log(1 + ez ).

(7.4.27)

This function can be seen as ReLU’s smoothing. According to neuroscientists’ research, softplus and ReLU are similar to the activation frequency function of brain neurons. That is to say, compared with the early activation functions, softplus and ReLU are closer to the activation model of brain neurons, while neural networks are based on the development of brain neuroscience. It could be said that the application of these two activation functions has contributed to a new wave of neural network research. As a comparison with the sigmoid and tanh functions, the ReLU has the following two advantages: • When the input is positive, there is no gradient saturation problem. • The computation speed is much faster than sigmoid and tanh. However, there are the following two disadvantages in the ReLU: – When the input is negative, the ReLU is completely inactive. This is not a problem for the forward transmission, but in the backpropagation process, the negative input may make the gradient be completely equal to zero, this is the same problem as sigmoid function and tanh function. – The output of the ReLU function is either 0 or positive, which means that the ReLU function is not a zero-centric function.

458

7 Neural Networks

In order to improve the ReLU function, the following variants were developed: 1. Leaky rectified linear unit (Leaky ReLU) [102]

f (z) = LReLU(z) = max{z, 0} =

⎧ ⎨ 0.01z, z < 0; ⎩

∈ (−∞, +∞),

z ≥ 0,

z,

(7.4.28) f  (z) =

⎧ ⎨ 0.01, z < 0;

∂LReLU(z) = ⎩ ∂z

(7.4.29) z ≥ 0.

1,

2. Parametric rectified linear unit (PReLU) [55]

f (z) = PReLU(z) = max{z, 0} =

f  (z) =

⎧ ⎨ αz, z < 0; ⎩

(7.4.30)

z ≥ 0,

z,

⎧ ⎨ α, z < 0;

∂PReLU(z) = ⎩ ∂z

∈ (−∞, +∞)

(7.4.31) 1, z ≥ 0.

Clearly, when α = 0.01 is taken, the PReLU reduces to the Leaky ReLU. Note that the randomized leaky rectified linear unit (RLReLU) was independently proposed in [164] and both the RLReLU and the PReLU have the same form. Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by composing lower level features. Training deep neural networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change [74]. Given values of z over a mini-batch B = {z1 , . . . , zm }. Let 1  zi m m

μB =

2 1  zi − μB m m

and

σB2 =

i=1

(7.4.32)

i=1

are, respectively, the mini-batch mean and the mini-batch variance. Then, the minibatch normalization is given by zi − μB zi − E{zi } = zˆ i = √ var(zi ) σB2 + 

or

zˆ =

z − μB 1 , σB2 + 

(7.4.33)

where  is a small constant added to the mini-batch variance for numerical stability.

7.4 Activation Functions

459

Batch normalization is defined as [74]: BNγi ,βi (zi ) = γi zˆ i + βi

or

BNγ ,β (z) = γ

zˆ + β

(7.4.34)

with γ = [γ1 , . . . , γm ]T and β = [β1 , . . . , βm ]T , where γi and βi are two batch normalization parameters to be learned, and denotes the elementwise product of two vectors, i.e., [γ zˆ ]i = γi zˆ i . As shown in [90], such a normalization speeds up convergence, even when the features are not decorrelated. Batch normalization can be applied to any set of activations in the network. For example, for an affine transformation followed by an elementwise nonlinearity: z = g(Wu + b),

(7.4.35)

where W and b are learned parameters of the model, if the nonlinear activation function g(·) such as sigmoid or ReLU is replaced with z = g(BN(Wu)),

(7.4.36)

then the bias b can be ignored since its effect will be canceled by the subsequent mean subtraction. It is noted that z of a neuron in the some layer (say the lth layer) is not the original input, that is, not the output of each neuron in the (l − 1)th layer, but the linear activation z = Wu + b of the neuron in the lth layer, here u is the output of the neuron in the (l − 1)th layer. Therefore, the original activation z of a neuron is converted by subtracting the mini-batch mean E{z} and dividing it by the mini-batch variance var(z). By normalizing activations throughout the network, it prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients. Hence, the batch normalization helps to address the following issues in traditional deep networks: too-high learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima [74]. Another advantage of batch normalization is that backpropagation through a layer is unaffected by the scale of its parameters, because for a scaled batch normalization BN(Wu) = BN((αW)u) with a scalar α, its derivatives with respect to u and W are, respectively, given by [74] ∂BN((αW)u) ∂BN(Wu) = , ∂u ∂u ∂BN((αW)u) 1 ∂BN(Wu) = · . ∂(αW) α ∂W

(7.4.37) (7.4.38)

460

7 Neural Networks

That is to say, the scale does not affect the layer Jacobian nor, consequently, the gradient propagation. Moreover, larger weights lead to smaller gradients, and batch normalization will stabilize the parameter growth. On batch normalization networks, we will discuss further in Sect. 7.15.

7.5 Recurrent Neural Networks A recurrent neural network (RNN), proposed by Rumelhartin et al. in 1986 [123], is a neural network model for modeling time series. RNNs provide a very powerful way of dealing with (time) sequential data. In traditional neural network models, it is operated from the input layer to hidden layer to output layer. These layers are fully connected, and there is no connection between nodes of each layer. For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs. RNNs are “fuzzy” in the sense that they do not use exact templates from the training data to make predictions, but, like other neural networks, use their internal representation to perform a high-dimensional interpolation between training examples [44].

7.5.1 Conventional Recurrent Neural Networks The modern definition of “recurrent” was initially introduced by Jordan [75]: If a network has one or more cycles, that is, if it is possible to follow a path from a unit back to itself, then the network is referred to as recurrent. A nonrecurrent network has no cycles.

Let xi (t) denote the activation of input unit i at time t, hj (t) denote the activation of hidden unit j at time t, and yk (t) denote the activation of output unit k at time t. The variables and parameters that need to be used and learned during training are denoted as follows: x t ∈ Rn : input vector to the hidden layer, h t ∈ Rl : output of hidden layer and denotes hidden state, yˆ t ∈ Rm : output produced by RNN, y ∈ Rm : desired output, Uhx = [Uj i ] ∈ Rl×n : the input-hidden weight matrix from the ith input node to the j th hidden state, • Why = [Wj k ] ∈ Rl×m : output-hidden weight matrix from the kth output node to the j th hidden state, • Vyh = [Vkj ] ∈ Rm×l : hidden-output weight matrix from the j th hidden state to the kth output node,

• • • •

7.5 Recurrent Neural Networks

• • • •

461

b t ∈ Rl : bias vectors (called also intercept terms) of hidden layers, c t ∈ Rm : bias vectors (called also intercept terms) of output layers, f (hi ) ∈ R: elementwise activation function of hidden layer, g(yi ) ∈ R: elementwise activation function of output layer.

The dynamical system of a feedforward network (FFN) is given by – Hidden states for FFN: hj (t) = f (netj ), n 

netj =

for j = 1, . . . , l,

Uj i xi (t) + bj (t),

(7.5.1)

for j = 1, . . . , l.

(7.5.2)

i=1

– Output nodes for FFN: yk (t) = g(netk ), netk =

l 

for k = 1, . . . , m,

Vkj hj (t) + cj (t),

(7.5.3)

for k = 1, . . . , m.

(7.5.4)

j =1

The dynamical system for FFNs can be written as the matrix-vector form: h t = f (Uhx x t + b t ),

(7.5.5)

y t = g(Vyh h t + c t ).

(7.5.6)

An RNN is a neural network that simulates a discrete-time dynamical system with an input x t = [x1 (t), . . . , xn (t)]T , an output y t = [y1 (t), . . . , ym (t)]T , and a hidden state vector h t = [h1 (t), . . . , hl (t)]T , where the subscript t represents the time step. RNNs have two recurrence forms. Figure 7.2 shows the general structure of a three-layer RNN with hidden state recurrence. The RNN shown in Fig. 7.2 can be described by the following dynamical system: • Hidden states for hidden state recurrence: hj (t) = f (netj ), netj =

n  i=1

for j = 1, . . . , l,

Uj i xi (t) +

l  j˜=1

Wj j˜ hj˜ (t − 1) + bj ,

(7.5.7) for j = 1, . . . , l.

(7.5.8)

462

7 Neural Networks

yt

yt–1

Lt − yˆ t

ct

yˆ t–1 ct–1

unfolded

V U

=⇒

z−1

ht bt

yt

Lt–1

V ht–1

bt–1

W

xt

Lt yˆ t

ct W

V ht

bt

U xt–1

U xt

Fig. 7.2 General structure of a regular unidirectional three-layer RNN with hidden state recurrence, where z−1 denotes a delay line. Left is RNN with a delay line, and right is RNN unfolded in time for two time steps

• Output nodes for hidden state recurrence: yˆk (t) = g(netk ), netk =

l 

for k = 1, . . . , m,

Vkj hj (t) + cj (t),

(7.5.9)

for k = 1, . . . , m.

(7.5.10)

j =1

The dynamical system of the RNN with hidden state recurrence can be written in matrix-vector form as ⎫ h t = f (Uhx x t + Whh ht−1 + b t ),⎪ ⎪ ⎬ yˆ t = g(Vyh h t + c t ), (7.5.11) ⎪ ⎪ ⎭ Lt (y t , yˆ t ) = y t − yˆ t . Figure 7.3 shows the general structure of a three-layer RNN with output recurrence. The RNN with output recurrence can be described by the following equations: • Hidden states for output recurrence: hj (t) = f (netj ), netj =

n  i=1

for j = 1, . . . , l

Uj i xi (t) +

m 

Wj k yˆk (t − 1) + bj ,

(7.5.12) for j = 1, . . . , l.

k=1

(7.5.13)

7.5 Recurrent Neural Networks

yt

yt–1

Lt − z−1

yˆ t ct

unfolded

V U

yt

Lt–1 yˆ t–1

=⇒

ht bt

463

ct–1

ht–1 bt–1

W

V

xt

Lt yˆ t

ct W

U xt–1

V ht

bt

U xt

Fig. 7.3 General structure of a regular unidirectional three-layer RNN with output recurrence, where z−1 denotes a delay line. Left is RNN with a delay line, and right is RNN unfolded in time for two time steps

• Output nodes for output recurrence: yˆk (t) = g(netk ), netk =

l 

for k = 1, . . . , m,

Vkj hj (t) + cj (t),

for k = 1, . . . , m.

(7.5.14) (7.5.15)

j =1

The dynamical system for an RNN with output recurrence can be written as the following matrix-vector form: ⎫ h t = f (Uhx x t + Why yˆ t−1 + b t ),⎪ ⎪ ⎬ yˆ t = g(Vyh h t + c t ), ⎪ ⎪ ⎭ Lt (y t , yˆ t ) = y t − yˆ t .

(7.5.16)

In the following we focus on RNNs with output recurrence.

7.5.2 Backpropagation Through Time (BPTT) An input sequence x ∈ Rm of length T is denoted by (Rm )T and an output sequence y ∈ Rn is denoted by (Rn )T . Then, a recurrent neural network with m inputs, n outputs, and weight matrix W such that y = Wx can be viewed as a continuous map (Rm )T → (Rn )T .

464

7 Neural Networks

0N / (n) T Suppose we are given a set of N training sequences D = x(n) , t , y t t=1 n=1  (n) (n)  where x t , y t denotes the tth pair of samples in the nth subset of the training data set. Then, the parameters θ = (W, U, V) of an RNN can be estimated by minimizing the cost function J (θ ) =

N T 1   ! (n) (n) " d y t , yˆ t , N

(7.5.17)

n=1 t=1

where d(a, b) denotes a predefined divergence measure between a and b, such as Euclidean distance d(a, b) = a − b2 = a − b, a − b1/2 or Mahalanobis distance dH (a, b) = a − bH = a − b, H(a − b)1/2 (where H is a nonnegative matrix) or the KL divergence KL(ab). The most frequently used cost function is the summed squared error (SSE). Suppose that there are P patterns or presentations in the training set and all K output units in an RNN. Then, the loss or error function at time step t is defined as Et (ypk (t), yˆpk (t)) =

P m 2 1  ypk (t) − yˆpk (t) , 2

(7.5.18)

p=1 k=1

where ypk (t) are the desired output vectors (or given sample vectors) associated with the pth pattern at the kth output unit at time step t, and yˆ pk (t) are the estimated or predicted output vectors at time step t. Define the total loss or error as E(y, yˆ ) =

T 

  Et ypk (t), yˆpk (t) .

(7.5.19)

t=1

By gradient descent, each weight change W in the network should be proportional to the negative gradient of the cost with respect to the specific weight matrix W: W = −η where η is a learning step size or rate.

∂E , ∂W

(7.5.20)

7.5 Recurrent Neural Networks

465

The design goal of an RNN is to calculate the gradient of the error with respect to parameters U, V, and W, and then learn good parameters using stochastic gradient descent (SGD). By mimicking the sum of the errors, the gradients at each time step are also summed up for one training example to get  ∂Et ∂E = , ∂W ∂W

(7.5.21)

 ∂Et ∂E = , ∂V ∂V

(7.5.22)

 ∂Et ∂E = . ∂U ∂U

(7.5.23)

T

t=1 T

t=1 T

t=1

The backpropagation through time (BPTT) algorithm [159] plays an important role in designing RNNs. The basic idea of this algorithm is to apply the chain rule of differentiation to calculate the above gradients backwards starting from the error. 1. Calculate the gradient

∂Et ∂Vkj (t) :

∂ yˆ pk (t) ∂netpk (t) ∂Et ∂Et = · · ∂Vkj (t) ∂ yˆ pk (t) ∂netpk (t) ∂Vkj (t) l     hpj (t) = − ypk (t) − yˆpk (t) g  yˆpk (t) j =1

= −δpk (t) ·

l 

hpj (t),

(7.5.24)

j =1

where δpk (t) is called the error for output nodes at time step t and is defined as ∂ yˆpk (t) ∂Et · ∂ yˆpk (t) ∂netpk (t)     = ypk (t) − yˆpk (t) g  yˆpk (t) .

δpk (t) = −

(7.5.25)

466

7 Neural Networks ∂Et ∂Wj k (t) :

2. Calculate the gradient

I m  J  ∂ yˆ pk (t) ∂netpk (t) ∂hpj (t) ∂netpj (t) ∂Et ∂Et = · · · · ∂Wj k (t) ∂ yˆ pk (t) ∂netpk (t) ∂hpj (t) ∂netpj (t) ∂Wj k (t) k=1

 ∂netpj (t) = −δpj (t) · yˆpk (t − 1), ∂Wj k (t) m

= −δpj (t) ·

(7.5.26)

k=1

where δpj (t) is known as the error for hidden nodes given by  δpj (t) = −

m  k=1

=

 m 

∂ yˆpk (t) ∂netpk ∂Et · · ∂ yˆpk (t) ∂netpk ∂hpj (t) 

 ·

∂hpj (t) ∂netpj

δpk · Vkj f  (hpj ).

(7.5.27)

k=1

3. Calculate the gradient

∂Et ∂Uj i (t) :

I m  J  ∂ yˆ pk (t) ∂netpk (t) ∂hpj (t) ∂netpj (t) ∂Et ∂Et = · · · · ∂Uj i (t) ∂ yˆ pk (t) ∂netpk (t) ∂hpj (t) ∂netpj (t) ∂Uj i (t) k=1

= −δpj (t) ·

n 

(7.5.28)

xpi (t).

i=1

The above discussion can be summarized into the following formulae:     δpk (t) = ypk (t) − yˆpk (t) g  yˆpk (t) ,  m      δpj (t) = δpk (t) · Vj k (t) · f  hpj (t) , k=1

Vkj (t + 1) = Vkj (t) + Vkj (t) = Vkj (t) + η

P 

(7.5.29) (7.5.30)

⎞ ⎛ l  δpk ⎝ hpj (t)⎠ , j =1

p=1

Wj k (t + 1) = Wj k (t) + Wj k (t) = Wj k (t) + η

P 

δpj

p=1

 m 

(7.5.31) 

yˆpk (t − 1) ,

k=1

(7.5.32) Uj i (t + 1) = Uj i (t) + Uj i (t) = Uj i (t) + η

P  p=1

δpj

 n  i=1

 xpk (t) .

(7.5.33)

7.5 Recurrent Neural Networks

467

7.5.3 Jordan Network and Elman Network After discussing RNNs with output recurrence, we now focus on RNNs with the forget gate. The key equations for this type of RNNs are given by [34, 47]   i t = σ Wix x t + Wih ht−1 + Wic ct−1 + bi ,   f t = σ Wf x x t + Wf h ht−1 + Wf c ct−1 + bf ,   c t = f t ct−1 + i t tanh Wcx x t + Wch ht−1 + bc ,   y t = σ Wox x t + Woh ht−1 + Woc c t + bo , ht = yt

tanh(c t ),

(7.5.34) (7.5.35) (7.5.36) (7.5.37) (7.5.38)

where denotes Hadamard product or elementwise product of two vectors, and variables that need to be learned during training are as follows: xt ∈ Rd : input vector to the network unit, ft ∈ Rht : forget gate’s activation vector, it ∈ Rh : input gate’s activation vector, yt ∈ Rh : output gate’s activation vector, ht ∈ Rh : output vector of the neural unit, ct ∈ Rh : cell state vector, Wix , Wf x , Wox , Wcx ∈ Rh×d : weight matrices from input vector to input gate, forget gate to input gate, output gate to input gate and cell to input gate, respectively, • Wih , Wf h , Woh , Wch ∈ Rh×h : weight matrices from input gate to hidden gate, forget gate to hidden gate, output gate to hidden gate and cell to hidden gate, respectively, • b ∈ Rh : bias vector.

• • • • • • •

Here the superscripts d and h refer to the number of input features and number of hidden units, respectively, and σ (·) is the logistic sigmoid function. At each step t, the standard RNNs compute the hidden vector h t from the current input vector x t and the previous hidden vector ht−1 using the nonlinear transformation function φ : h t = φ(x t , ht−1 ). This computation is known as recurrent computation and can be done via three different directional mechanisms [110]: 1. Forward mechanism: It recurs from 1 to n and generate the forward hidden vector − → − → sequence: R(x1 , . . . , xn ) = h 1 , . . . , h n ; 2. Backward mechanism: RNNs are performed from n to 1 and provide in the ← − ← − backward hidden vector sequence R(xn , . . . , x1 ) = h n , . . . , h 1 ; 3. Bidirectional mechanism: It runs RNNs in both directions to produce the forward and backward hidden vector sequences, and then concatenate them at each

468

7 Neural Networks

position to generate the new hidden vector sequence hb1 , . . . , hbn with hbi = − − → ← [ h i , h i ]. The RNN dynamics can be described using deterministic transitions from previous to current hidden states. The deterministic state transition is a function described as RNN :

htl−1 , hlt−1 → hlt .

(7.5.39)

For classical RNNs, this function is given by hlt

  l−1 l = f Tn,n ht + Tn,n ht−1 , where f ∈ {sigm, tanh}.

(7.5.40)

m The RNN also has a special initial bias binit h ∈ R which replaces the formally undefined expression Whh h0 at time t = 1. The RNNs have the following two networks:

• Jordan network: For a simple neural network with one hidden layer, if the input matrix is denoted as X = [x1 , . . . , xT ], weights of hidden layer are denoted as Wh , weights of output layer are denoted as Wy , weights of recurrent computation are denoted as Wr , the hidden representation is denoted as h, and the output is denoted as y, then Jordan Network can be formulated as h t = σ (Wh X + Wr yt−1 ),

y = σ (Wy h t ).

• Elman network: It was introduced by Elman in 1990 [29] and is described by h t = σ (Wh X + Wr ht−1 ),

y = σ (Wy h t ).

Figures 7.4 and 7.5 show the difference between recurrent structures of Jordan network and Elman network.

7.5.4 Bidirectional Recurrent Neural Networks If unfolding an RNN, one can get the structure of a feedforward neural network with infinite depth. A key question is: What the recurrent structures that correspond to the infinite layer of bidirectional models are. The answer lies in the Bidirectional Recurrent Neural Network (BRNN) [132]. BRNN was invented by Schuster and Paliwal in 1997 [132] with the goal to introduce a structure that was unfolded to be a bidirectional neural network and can be trained using available input information in the past and future of a specific time frame.

7.5 Recurrent Neural Networks

469

Desired output: x2 , x3 , x4 , . . .

Output units

···

Hidden units

··· ···

Input units

···

Control units

Input: x1 , x2 , x3 , . . . Fig. 7.4 Illustration of Jordan network

Desired output: x2 , x3 , x4 , . . .

Output units

···

Hidden units

···

Input units

···

···

Control units

Input: x1 , x2 , x3 , . . . Fig. 7.5 Illustration of Elman network [97]

As shown in Fig. 7.6, a BRNN connects two hidden layers running in opposite directions to a single output, allowing them to receive information from both past and future states. Unlike standard RNNs, the idea of BRNN is to split the state neurons of a regular RNN into one part that is responsible for the positive time direction (forward states) and another part for the negative time direction (backward states). Neither of these output states are connected to inputs of the opposite directions. With these two time directions considered simultaneously, input data from the past and future of the current time frame can be directly used to minimize the objective function.

470

7 Neural Networks

ot-1

ot

ot+1

Forward pass Backward pass

ht-1

ht

ht+1

ht-1

ht

ht+1

it-1

it

it+1

xt−1

xt

xt+1

Fig. 7.6 Illustration of the Bidirectional Recurrent Neural Network (BRNN) unfolded in time for three time steps. Upper: output units, Middle: forward and backward hidden units, Lower: input units

By Graves [42], bidirectional RNNs are preferred because each output vector depends on the whole input sequence (rather than on the previous inputs only, as is the case with normal RNNs). In general training, forward and backward states are processed first in the “forward” pass, then output neurons are passed. For the backward pass, the opposite takes place: output neurons are processed first, then forward and backward states are passed next. Weights are updated only after the forward and backward passes are complete. For BRNN, the training procedure for the unfolded bidirectional network over time can be summarized as follows [42, 132]: 1. Forward pass: Run all input data for one time slice 1 ≤ t ≤ T through the BRNN and determine all predicted outputs. • Do forward pass just for forward states (from t = 1 to t = T ) and backward states (from t = T to t = 1). • Do forward pass for output neurons. 2. Backward pass: Calculate the part of the objective function derivative for the time slice 1 ≤ t ≤ T used in the forward pass. • Do backward pass for output neurons. • Do backward pass just for forward states (from t = 1 to t = T ) and backward states (from t = T to t = 1).

7.5 Recurrent Neural Networks

471

Given an input sequence of length T , (x1 , . . . , xT ), a BRNN computes the for− → − → ← − ← − ward hidden sequence ( h1 , . . . , hT ), the backward hidden sequence ( h1 , . . . , hT ), and the transcription sequence (f1 , . . . , fT ) by first iterating the backward layer from t = T to 1 to compute the backward hidden sequence [42]:   ← − ← − − x + W← −← − h t−1 + b← − . h t = fh W← hi t h h h

(7.5.41)

Then iterating the forward and output layers from t = 1 to T to compute the backward hidden sequence and the output sequence:   − → − → → x t + W− →− → h t−1 + b− → , h t = fh W− hi h h h

(7.5.42)

− → ← − →h t + W ← − h + bo . y t = Wo− h oh t

(7.5.43)

7.5.5 Long Short-Term Memory (LSTM) Standard RNNs [113] have been plagued by two major practical problems. • RNNs have short-term memory problems: short-term memory has a greater impact, but the long-term impact is small, so the gradient of the total output error with respect to previous inputs quickly vanishes as the time lags between relevant inputs, and errors increase. Therefore, long time lags are inaccessible to existing architectures [33]. • Training RNN requires a lot of cost. Long Short-Term Memory (LSTM) [63] is an RNN architecture designed to be better at storing and accessing information than standard RNNs. LSTM is essentially a recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. For general-purpose sequence modeling, it is widely recognized (see, e.g., [43, 44, 63, 112, 135, 143]) that LSTM, as a special RNN structure, has proven to be stable and powerful for modeling long-range dependencies. The major innovation of LSTM is its memory cell c t which essentially acts as an accumulator of the state information. The cell is accessed, written, and cleared by several selfparameterized controlling gates. At every time, a new input comes, its information will be accumulated to the cell if the input gate is activated. Moreover, the past cell status c t−1 could be “forgotten” in this process if the forget gate f t is on [135]. An RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate.

472

7 Neural Networks

An LSTM layer consists of a set of recurrently connected blocks, known as memory blocks. These blocks can be thought of as a differentiable version of the memory chips in a digital computer. Each block contains one or more recurrently connected memory cells and three multiplicative units: the input gate (denoted as i), output gate (denoted as y), and forget gate (denoted as f ). Each of these gates can be viewed as a “standard” neuron in a feedforward (or multilayer) neural network that compute an activation (using an activation function) of a weighted sum. The vectors i t , y t , and f t represent the activations of the input, output, and forget gates, at time step t, respectively. The cell output yc is calculated via the current cell state and four sources of inputs: netc is input to the cell itself, while neti , netf , and neto are inputs to the input gates, forget gates, and output gates, respectively. Let j denote the indexes of memory blocks; v be the indexes of memory cells in block j (with Sj cells), such that cjv denotes the vth cell of the j th memory block; and wlm be the weight on the connection from unit m to unit l. For the gates fl , l ∈ {i, o, f } is a logistic sigmoid with range [0, 1]. In standard LSTM, a typical LSTM cell is made of input, forget, and output gates and a cell activation component. These units receive the activation signals from different sources and control the activation of the cell by the designed multipliers, as described below [126]. Introduce the following notation. • S denotes the input sequence over which the training takes place, and it runs from time τ0 to τ1 . • i, f, o denote, respectively, the input gate, forget gate, and output gate. • c denotes cell, and refers to an element of the set of cells C. • xk (τ ) is the network input to unit k at time τ , and yk (τ ) denotes its activation. • sc (τ ) is the state value of cell c at time τ , i.e., its value after the input and forget gates have been applied. • tk (τ ) denotes the training target for output unit k at time τ . • E(τ ) refers to the (scalar) output error of the net at time τ . • g(xc ) and h(sc ) are, respectively, the cell input and state squashing functions. • σ (xi ), σ (xf ), and σ (xo ): the squashing function of the input gate, the forget gate, and the output gate, respectively, • wij is the weight from unit j to unit i. The following are pseudocode for full gradient LSTM layer in a multilayer net [45]: 1. Forward pass • Reset all activations to 0. • Running forwards from time τ0 to time τ1 , feed in the inputs and update the activations. Store all hidden layer and output activations at every time step.

7.5 Recurrent Neural Networks

473

• For each LSTM block, the activations are updated as follows: Input gates:  xi (τ ) = wij yj (τ − 1) + wic sc (τ − 1), j ∈N

c∈C

yi (τ ) = σ (xi (τ )). Forget gates:  xf (τ ) = wfj yj (τ − 1) + wf c sc (τ − 1), j ∈N

yf (τ ) = σ (xf (τ )).

c∈C

Cells:  xc (τ ) = wcj yj (τ − 1), ∀c ∈ C, j ∈N

sc (τ ) = yf sc (τ − 1) + yi g(xc (τ )). Output gates:   xo (τ ) = woj yj (τ − 1) + woc sc (τ ), j ∈N

c∈C

yo (τ ) = σ (xo (τ )). Cell outputs: yc (τ ) = yo (τ )h(sc (τ )), ∀c ∈ C. 2. Backward pass

• Reset all partial derivatives to 0. • Starting at time τ1 , propagate the output errors backwards through the unfolded net, using the standard BPTT equations for a softmax output layer and the crossentropy error function: δk (τ ) =

∂E(τ ) = yk (τ ) − tk (τ ), k ∈ output units. ∂xk (τ )

• For each LSTM block, from time τ1 − 1 to time τ0 the δ’s are calculated as follows: Cell Outputs:  c (τ ) = j ∈N wj c δj (τ + 1), ∀ c ∈ C. Output Gates:  δo (τ ) = σ  (xo (τ )) c∈C c (τ )h(sc (τ )). States: ∂E(τ + 1) ∂E(τ ) = c (τ )yo (τ )h (sc (τ )) + yo (τ + 1) ∂sc (τ ) ∂sc (τ + 1) + δi (τ + 1)wic + δf (τ + 1)wf c + δo (τ + 1)woc Cells: ∂E(τ ) δc (τ ) = yi (τ )g  (xc (τ )) ∂s (τ ) . c Forget Gates:  ∂E(τ −1) δf (τ ) = σ  (xf (τ )) c∈C ∂s (τ −1) . c

.

474

7 Neural Networks

Input Gates: δi (τ ) = σ  (xi (τ ))

 c∈C

∂E(τ ) ∂sc (τ ) g(xc (τ )).

• Using the standard BPTT equation, accumulate the δ’s to get the partial derivatives of the cumulative sequence error: define Etotal (S) =

τ1 

E(τ ),

τ =τ0

define ∇ij (S) = ⇒ ∇ij (S) =

Etotal (S) , ∂wij τ1 

δi (τ )yj (τ − 1).

τ =τ0 +1

3. Update Weights: After the presentation of sequence S, with learning rate α and momentum m, update all weights with the standard equation for gradient descent with momentum: wij (S) = −α∇ij (S) + mwij (S − 1). Activation functions may be taken as follows: • σf : sigmoid function. • σi : hyperbolic tangent function. • σo : hyperbolic tangent function or, as the peephole LSTM suggests, σh (x) = x.

7.5.6 Improvement of Long Short-Term Memory RNNs are powerful sequence learners, but they have following limitations: • They require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability for labeling unsegmented sequence data has limited [46]. • They may fail to understand input structures more complicated than a sequence [175]. To avoid the pre-segmented training data and post-processing into label sequences, Graves et al. [46] in 2006 proposed a method for training RNNs to label unsegmented sequences directly, thereby solving both problems. This method is called the Connectionist Temporal Classification (CTC).

7.5 Recurrent Neural Networks

475

Let S be a set of training examples drawn from a fixed distribution DX ×Z . Each example in S consists of a pair of sequences (x, z), where the target sequence z = (z1 , . . . , zU ) is at most as long as the input sequence x = (x1 , . . . , xT ), i.e., U ≤ T . These input and target sequences cannot be aligned a priori due to having not the same length. The CTC method uses a softmax layer to define a separate output distribution P (k|t) at every step t along the input sequence. This distribution covers the K phonemes plus an extra blank symbol Ø which represents a non-output (the softmax layer is therefore size K + 1). RNN strained with CTC are generally bidirectional, to ensure that every P (k|t) depends on the entire input sequence, and not just the inputs up to t. For example, for deep bidirectional networks, with P (k|t) defined by [47] − → ← − →N h N − yt = W− hN t + W← t + by , h y h Ny exp(yt [k]) , P (k|t) = K  k  =1 exp(yt [k ])

(7.5.44) (7.5.45)

where yt [k] is the kth element of the length K +1 unnormalized output vector y t , N →N and W← − denotes the number of bidirectional levels, and W− denote the weight h y h Ny matrix from forward and backward hidden states to the output gate, respectively. The LSTM can learn longer sequence correlation, but it may fail to understand input structures more complicated than a sequence. To overcome the gradient vanishing problem in LSTM and learn longer term dependencies from input, Zhu et al. [175] developed a model of multiple child cells or multiple descendant cells for LSTM, called S-LSTM network. S-LSTM is made of S-LSTM memory blocks and works based on a hierarchical structure. Each S-LSTM memory block contains one input gate and one output gate, but different from LSTM, S-LSTM has two or more forget gates. The number of forget gates depends on the number of children of a node. For two children, their hidden R vectors are denoted as hL t−1 for the left child and ht−1 for the right child. These hidden vectors are taken in as the inputs of the current block. The forward computation of a S-LSTM memory block is specified as follows [175]: 1. Input gate: The input gate i t contains four resources of information: the hidden L R vectors (hLt−1 and hR t−1 ) and cell vectors (c t−1 and c t−1 ) of its two children, i.e.,   L R R L L R R i t = σ WL hi h t−1 + Whi h t−1 + Wci c t−1 + Wci c t−1 + bi ,

(7.5.46)

where σ is the elementwise logistic function used to confine the gating signals to be in the range of [0, 1].

476

7 Neural Networks

2. Forget gate: The above four sources of information are also used to form the L and right forget gate f R via different gating signals for the left forget gate ft−1 t−1 weight matrices:   L L R R L L R R fL t = σ Whfl h t−1 + Whfl h t−1 + Wcfl c t−1 + Wcfl c t−1 + bfl ,   L L R R L L R R fR t = σ Whfr h t−1 + Whfr h t−1 + Wcfr c t−1 + Wcfr c t−1 + bfr .

(7.5.47) (7.5.48)

3. Cell gate: The cell here considers the copies from both children’s cell vectors R (cL t−1 , ct−1 ), gated with separated forget gates. The left and right forget gates can be controlled independently, allowing the pass-through of information from children’s cell vectors: L R R x t = WL hx h t−1 + Whx h t−1 + bx ,

c t = fLt

cLt−1 + fRt

cRt−1 + i t

(7.5.49) tanh(x t ).

(7.5.50)

4. Output gate: The output gate o t considers the hidden vectors from the children and the current cell vector:   L R R o t = σ WL ho h t−1 + Who h t−1 + Wco c t + bo .

(7.5.51)

5. Hidden state: The hidden vector h t and the cell vector c t of the current block are passed to the parent and are used depending on if the current block is a left or right child of its parent: ht = ot

tanh(c t ).

(7.5.52)

The backward computation of a S-LSTM memory block uses backpropagation over structures [175]:  ht =

∂o , ∂h t

(7.5.53)

∂ot =  ht ∂x

tanh(c t )

∂ftl =  ct ∂x

cLt−1

σ  (ot ),

  σ  ftL .

(7.5.54) (7.5.55)

In convenient LSTM each gate receives connections from the input units and the outputs of all cells, but there is no direct connection from the “Constant Error Carousel” (CEC) that provides short-term memory storage for extended time periods. A simple but effective remedy is to add weighted “peephole” connections from the CEC to the gates of the same memory block [34]. The peephole connections allow all input gate i, output gate y, and forget gate f to inspect the current cell state

7.6 Boltzmann Machines

477

even when the output gate is closed. This information can be essential for finding well-working network solutions. The peephole connections from the memory cell c to the three gates i, y, and f actually denote the contributions of the activation of the memory cell c at time step t − 1, i.e., the contribution of ct−1 rather than ct . LSTM with peephole connections is called the peephole LSTM that is proposed by Gers et al. [33, 34]. The key update equations of peephole LSTM are as follows: ft = σg (Wf xt + Uf ct−1 + bf ),

(7.5.56)

it = σg (Wi xt + Ui ct−1 + bi ),

(7.5.57)

yt = σg (Wo xt + Uo ct−1 + bo ),

(7.5.58)

ct = ft

ct−1 + it

(7.5.59)

ht = yt

σh (ct ).

σc (Wc xt + Uc ct−1 + bc ),

(7.5.60)

7.6 Boltzmann Machines The Boltzmann machine, invented by Ackley et al. in 1985 [1], is another important type of neural networks that relies on a stochastic (probabilistic) form of learning. The Boltzmann machine is closely related to the Hopfield network.

7.6.1 Hopfield Network and Boltzmann Machines Energy-based models are a special class of neural networks. The simplest energy model is the Hopfield Network, first introduced by Hopfield in 1982 [65], which is usually viewed as a form of recurrent neural network. The Hopfield network is a fully connected neural network with binary thresholding neural units whose values are either 0 or 1. These units are fully connected in a “recurrent” way in which the connection between weights and neurons are bidirectional. With this setting, the energy of a Hopfield network is defined as E=−

 i

si bi −



si sj wij ,

(7.6.1)

i,j

where si is the state of unit i, bi denotes its bias; wij denotes the bidirectional weights connecting units i and j .

478

7 Neural Networks

There are two inference procedures of Hopfield network [155]. (a) Asynchronous update: For a state of data, the network tests if inverting the state of one units, will cause the energy to decrease. If so, the network will invert the state and proceed to test the next unit. (b) Synchronous update: The network first tests for all the units and then inverts all the units-to-invert simultaneously. The main disadvantages of Hopfield network are twofold [155]. • Both the asynchronous and synchronous update methods may lead to a local optimum. The synchronous update may even result in an increasing of energy and may converge to an oscillation or loop of states. • The Hopfield network cannot keep the memory very efficient because a network of N units can only store memory up to 0.15N 2 bits, but the number of bits required to store N units are N 2 log(2M + 1) (M is instances of data) [65]. Interestingly and importantly, with the introduction of hidden units in the Hopfield network, the model can be upgraded from the initial ideas of the Hopfield network and evolve to replace it. This is the basic idea of the popular Boltzmann machine and restricted Boltzmann machine. The stochastic neurons of a Boltzmann machine partition into two functional groups: hidden units and visible units. Figure 7.7 shows a comparison between the Hopfield network and the Boltzmann machine [155]. In a Boltzmann machine, only visible units are connected with data and hidden units are used to assist visible units to describe the distribution of data, but it still maintains a fully connected network among these units.

Fig. 7.7 A comparison between Hopfield network and Boltzmann machine. Left: is Hopfield network in which a fully connected network of six binary thresholding neural units. Every unit is connected with data. Right: is Boltzmann machine whose model splits into two parts: visible units and hidden units (shaded nodes). The dashed line is used to highlight the model separation

7.6 Boltzmann Machines

479

The Boltzmann machine and the Hopfield network share the following common features [52]: • • • •

their processing units have binary values (say +1 and −1) for their states; all the synaptic connections between their units are symmetric; the units are picked at random and one at a time for updating; they have no self-feedback.

There are three important differences between the Boltzmann machine and the Hopfield network [52]. (a) The Boltzmann machine permits the use of hidden neurons, while no such neurons exist in the Hopfield network. (b) The Boltzmann machine uses stochastic neurons with a probabilistic firing mechanism, whereas the standard Hopfield network uses neurons with a deterministic firing mechanism. (c) The Boltzmann machine operates in a supervised manner, while the Hopfield network operates in an unsupervised manner. The above common features and important differences help in providing a better understanding of the following discussion on the Boltzmann machine. The Boltzmann machine has the same definition of energy functions as that of the Hopfield network, except the Boltzmann machine splits the energy function according to hidden units and visible units [65]: E(x, h) = −bT x − cT h − hT Wx − xT Ux − hT Vh,

(7.6.2)

where b and c are the offsets associated with the input x (visible vector) and the hidden output h (hidden vector), respectively, while W, U, and V are the weight matrices of hidden-visible units, visible-visible units, and hidden-hidden units, respectively. If considering only one observed part (denoted by x) and a hidden part h then the energy-based probability distribution can be defined as [8, 65]: P (x, h) =

e−E(x,h) , Z

(7.6.3)

where the normalizing factor with a sum running over the visible and hidden spaces given by Z=



e−E(x,h)

x,h

is called the partition function by analogy with physical systems.

(7.6.4)

480

7 Neural Networks

Because only x is observed, it only needs to care about the marginal p(x) =

1  −E(x,h) e . Z

(7.6.5)

h

Substituting (7.6.4) into (7.6.5) yields  −E(x,h) e p(x) =  h −E(¯x,h) x¯ ,h e

(7.6.6)

whose log-likelihood form is given by log p(x) = log





e−E(x,h) − log

e−E(¯x,h) .

(7.6.7)

x¯ ,h

h

Then, by letting θ = (b, c, W, U, V) denote the parameters of the Boltzmann model, one has [8] ∂ log p(x) ∂ log = ∂θ = −



−E(x,h) he

∂θ 1



−E(x,h) he

h



∂ log



−E(¯x,h) x¯ ,h e

∂θ ∂E(x, h) e−E(x,h) ∂θ

−E(¯x,h)  1 −E(¯x,h) ∂e e −E(¯x,h) ∂θ x¯ ,h e

+

x¯ ,h

=−

 h

Let s =

p(h|x)

∂E(x, h)  ∂E(¯x, h) + . p(¯x, h) ∂θ ∂θ

(7.6.8)

x¯ ,h

  x denote all the units in the Boltzmann machine. Then (7.6.2) can be h

rewritten as E(s) = −[bT , cT ]

     x U O x − [xT , hT ] = −dT s − sT As, h WV h

(7.6.9)

where   b d= , c



 U O A= . WV

(7.6.10)

7.6 Boltzmann Machines

481

Fig. 7.8 A comparison between Left Boltzmann machine and Right restricted Boltzmann machine. In the restricted Boltzmann machine there are no connections between hidden units (shaded nodes) and no connections between visible units (unshaded nodes)

7.6.2 Restricted Boltzmann Machine A Restricted Boltzmann machine (RBM) is a version of the Boltzmann machine with an added restriction: there should be no connections either between visible units or between hidden units. The RBM was originally known as Harmonium when invented by Smolensky in 1986 [139]. Figure 7.8 shows a comparison between the restricted Boltzmann machine and the Boltzmann machine [155]. The RBM is a two layer, bipartite, undirected graphical model with a set of binary hidden units h, a set of (binary or real-valued) visible units x, and symmetric connections between these two layers represented by a weight matrix W. The probabilistic semantics for an RBM is denoted by p(x; h), and is defined by its energy function: p(x, h) =

e−E(x,h) , Z(h)

(7.6.11)

 ¯ where Z(h) = h¯ e−E(x,h) is known as the partition function for hidden units. Consider a training set of binary vectors which are assumed to be binary images. The training set can be modeled using a two-layer network called a RBM in which stochastic, binary pixels are connected to stochastic, binary feature detectors using symmetrically weighted connections. Because states of the pixels are observed, these pixels correspond to “visible” units of the RBM, while the feature detectors correspond to “hidden” units. Without connections either between visible units or between hidden units, U and V in (7.6.2) are two null weight matrices. Let xi , i = 1, . . . , m and hj , j = 1, . . . , F be the observed visible variables and the binary values of hidden (latent) variables, respectively. Then the energy function (7.6.2) of the Boltzmann machine reduces to that of the restricted Boltzmann

482

7 Neural Networks

machine [65]: E(x, h) = −

m 

bi xi −

F 

cj hj −

j =1

i=1



xi Wij hj

i,j

= −xT Wh − bT x − cT h,

(7.6.12)

where xi and hj are the binary states of visible unit i and hidden unit j , bi and cj are their biases, and Wij is the weight between them, while b = [bi ] is visible unit bias vector and c = [cj ] is hidden unit bias vector. In regression problems, the visible units x are real-valued, and the energy function is defined as   1  2  xi − xi Wij hj − bi xi − cj hj 2 m

E(x, h) =

m

i=1 j =1

i=1

=

F

m

F

i=1

j =1

1 T x x − xT Wh − bT x − cT h. 2

(7.6.13)

Clearly, the hidden units are conditionally independent of one another given the visible layer, and vice versa. In particular, the units of a binary layer (conditioned on the other layer) are independent Bernoulli random variables, and if the visible layer is real-valued then the visible units (conditioned on the hidden layer) are Gaussian with diagonal covariance [92]. Therefore, a tractable expression for the conditional probability for RBM can be readily obtained as [8]: p(x) p(x, h)   exp bT x + cT h + xT Wh =   T T ¯ T ¯ h¯ exp b x + c h + x Wh   5F T j =1 exp cj hj + hj wj x = 5 F F   T j =1 j  =1 exp cj hj  + hj  wj x     T exp hj cj + wj x F # =   F T j  =1 exp hj  (cj + wj x) j =1

p(h|x) =

=

F #

P (hj |x),

(7.6.14)

j =1

where wj is the j th column of the weight matrix W = [w1 , . . . , wF ] ∈ Rm×F .

7.6 Boltzmann Machines

483

By the special structure of RBM (that is, there is connection between layers and no connection within layers) it is known that the activation states of hidden units are conditionally independent given the state of visible units. At this time, the activation probability of the j th hidden unit is given by ecj +wj x T

p(hj = 1|x) =

cj +wTj x

1+e

! " = σ cj + wTj x ,

(7.6.15)

where σ (z) is the logistic sigmoid function 1/(1 + exp(z)). Since x and h play a symmetric role in the energy function, a similar derivation gives the conditional probability of x given h: p(x|h) =

m #

p(xi |h).

(7.6.16)

i=1

Because the structure of RBM is symmetrical, when the state of the hidden unit is given, the activation state of each visible unit is conditionally independent, that is, the activation probability of the ith visible unit is given by   ˜ ih , p(xi = 1|h) = σ bi + w

(7.6.17)

˜ i is the ith row of W ∈ Rm×F . where w Consider a probability distribution over a vector x and with parameters W [15]: p(x; W) =

1 e−E(x;W) , Z(W)

(7.6.18)

 where Z(W) = x e−(x;W) is a normalization constant and E(x; W) is an energy function. Maximum-likelihood (ML) learning of the parameters W given an i.i.d. sample X = {xn }N n=1 can be updated by gradient ascent: W(t+1) = W(t) + η

 ∂L(W; X )   (t) , ∂W W

(7.6.19)

where the learning rate η needs not be constant, and the average log-likelihood is L(W; X ) =

N 1  log p(xn ; W) N n=1

= log p(x; W)0 = −E(x; W)0 − log Z(W),

(7.6.20)

484

7 Neural Networks

where ·0 denotes an average with respect to the data distribution, i.e., P0 (x) = E(x; W)0 = N1 N n=1 δ(x−xn ). A well-known difficulty arises in the computation of the gradient = > > = ∂E(x; W) ∂E(x; W) ∂L(W; X ) =− + , ∂W ∂W ∂W 0 ∞

(7.6.21)

where ·∞ denotes an average with respect to the model distribution p∞ (x; W) = p(x; W). The average ·0 is readily computed using the sample data X = {xn }N n=1 , but the average ·∞ involves the normalization constant Z(W), which cannot generally be computed efficiently (being a sum of an exponential number of terms).

7.6.3 Contrastive Divergence Learning In the RBM framework, there are no direct interactions between the hidden units and no direct interactions between the visible units that represent the pixels. In this case, there is a simple and efficient method called “contrastive divergence” (CD) [58] to learn a good set of feature detectors from a set of training images. The RBM parameters are optimized by performing stochastic gradient ascent on the log-likelihood of training data. Unfortunately, computing the exact gradient of the log-likelihood is usually intractable. Instead of computing the exact gradient of the log-likelihood, one can use the contrastive divergence method to approximate the exact gradient, which has been shown to work well in practice. Consider a probability distribution over a vector x weighted by a parameter matrix W: p(x; W) =

1 e−E(x;W) , Z(W)

(7.6.22)

where E(x; W) is an energy function and Z(W) =



e−E(x;W)

(7.6.23)

x

is a normalization constant. The log-likelihood of p(x; W) is given by L(x; W) = log p(x; W) = log e−E(x;W) − log Z(W)  E(x; W). = −E(x; W) + x

(7.6.24)

7.6 Boltzmann Machines

485

Given an independent identically distributed (i.i.d.) sample set X = {xN n=1 }, the N average log-likelihood of p({xn }n=1 ; W), denoted by L(X; W), is defined as L(X; W) =

N N N  1  1  1  L(xn ; W) = − E(xn ; W) + E(xn ; W) N N N x n=1

n=1

n=1

= −E(x; W)0 + E(x; W)∞ ,

(7.6.25)

where N 1  E(x; W)0 = E(x; W)|x=xn , N

(7.6.26)

n=1

E(x; W)∞ =

N  1  E(x; W)|x=xn N x

(7.6.27)

n=1

 denote an average with respect to the data distribution p0 (x) = N n=1 δ(x − xn ) and an average with respect to the model distribution p∞ (x; W) = P (x; W), respectively. Therefore, maximum likelihood (ML) learning of the weight matrix W at the time t can be updated by (t+1)

W

=W

(t)

 ∂L(W; X)  −η  (t) , ∂W W

(7.6.28)

where η is a learning rate that may allow to be nonconstant, and the gradient is given by = > > = ∂E(x; W) ∂L(W; X) ∂E(x; W) =− + . ∂W ∂W ∂W 0 ∞

(7.6.29)

Although the first term ·0 is readily computed, the second term ·∞ involves a sum of an exponential number of terms, which cannot generally be computed efficiently. To avoid the difficulty in computing the log-likelihood gradient, Hinton [58] proposed the contrastive divergence (CD) method which approximately follows the gradient of a different function. ML learning minimizes the Kullback–Leibler divergence KL(p0 p∞ ) =

 x

p0 (x) log

p0 (x) . p(x; W)

(7.6.30)

486

7 Neural Networks

CD learning approximately follows the gradient of the difference of two divergences [58]: CDn = KL(p0 p∞ ) − KL(pn p∞ ).

(7.6.31)

For fully visible Boltzmann machines (VBMs) without hidden units (i.e., h = 0) with the input x = [x1 , . . . , xd ]T ∈ Rd×d , the energy function E(x; W) = − 12 xT Wx, where W = [wij ] is a symmetric d × d matrix of real-valued weights. In VBMs, the log-likelihood has a unique optimum because its Hessian is negative definite. Since ∂E/∂wij = −xi xj , ML learning takes the form [15]:   wij (t + 1) = wij (t) + η xi xj 0 − xi xj ∞ ,

(7.6.32)

while CDn learning takes the form   wij (t + 1) = wij (t) + η xi xj 0 − xi xj n ,   bi (t + 1) = bi (t) + η xi 0 − xi n ,   cj (t + 1) = cj (t) + η hj 0 − hj n .

(7.6.33) (7.6.34) (7.6.35)

Based on the symmetrical structure of RBM model and the conditional independence of neuron states, we can use Gibbs sampling to obtain random samples of distribution defined by RBM. The specific algorithm of k-step Gibbs sampling in RBM is to use a training sample (or any randomized state of the visible layer) for initializing the state x0 of the visible layer and alternately sample as follows: h0 ∼ P (h|x0 ),

x1 ∼ P (x|h0 ),

h1 ∼ P (h|x1 ),

x2 ∼ P (x|h1 ),

.. . h2 ∼ P (h|xk ),

.. . xk+1 ∼ P (x|hk ).

Unlike Gibbs sampling, when using contrastive divergence method to train data, we only need to use k = 1 step Gibbs sampling to get good enough approximation (x1 , h1 ) and (x2 , h2 ). k = 1 Gibbs sampling based contrastive divergence (CD) learning is called CD1 fast RBM learning algorithm. In this case, CD1 learning forms (7.6.33)–(7.6.35) can be rewritten as the matrix-vector forms:   W ← W + η P (h1 = 1|x1 )xT1 − P (h2 = 1|x2 )xT2 , b ← b + η(x1 − x2 ),   c ← c + η P (h1 = 1|x1 ) − P (h2 = 1|x2 ) .

(7.6.36) (7.6.37) (7.6.38)

7.6 Boltzmann Machines

487

Algorithm 7.3 summarizes the steps of CD1 fast RBM learning algorithm. Algorithm 7.3 CD1 fast RBM learning algorithm 1. input: a training sample x0 , the number of hidden layer units, m, learning rate η, maximum training period T . 2. initialization: initial state of visible layer unit x1 = x0 ; select randomly W, b, c. 3. for t = 1 to T do 4. for j = 1 to m do (for all hidden units)   5. calculate P (h1j = 1|x1 ) = sigmoid cj + i Wij x1i 6. end for 7. construct P (h1 = 1|x1 ) = [P (h11 = 1|x1 ), . . . , P (h1m = 1|x1 )]T 8. for i = 1 to n (for all visible units) ! "  9. Calculate P (x2i = 1|h1 ) = sigmoid bi + j Wij h1j 10. end for 11. for j = 1 to m do    12. calculate P (h2j = 1|x2 ) = sigmoid cj + i wij x2i 13. end for T 14. construct P (h2 = 1|x2 ) = [P (h21 = 1|x2 ), . . . , P (h  2m = 1|x2 )] . 15. W ← W + η P (h1 = 1|x1 )xT1 − P (h2 = 1|x2 )xT2 16. b ← b + η(x1 − x2 ) 17. c ← c + η(P (h1 = 1|x1 ) − P (h2 = 1|x2 )) 18. end for 19. output: W, b, c.

7.6.4 Multiple Restricted Boltzmann Machines In the above discussion, the observed “visible” unit is denoted by the binary vector x = [x1 , . . . , xm ]T . Now, consider the case where we have a set of m user rated movies or commodities. Support the user rated movie i as k, where k ∈ {1, . . . , K}. Then the observed “visible” binary rating matrix is defined as ⎡

1 x11 x21 · · · xm

⎢ ⎢ 2 k 3K,m ⎢ 2 X = xi k=1,i=1 = [x1 , . . . , xm ] = ⎢ x1 ⎢ . ⎣ ..

x1K

x22 · · · .. . . . . x2K · · ·



⎥ ⎥ 2⎥ xm ⎥ ∈ RK×m , .. ⎥ . ⎦

(7.6.39)

K xm

where the (k, i)th entry is given by

xik =

⎧ ⎨ 1, if the user rated movie i as k; ⎩

(7.6.40) 0, otherwise.

488

7 Neural Networks

Therefore, we have K restricted Boltzmann machines with binary hidden units and softmax visible units. For each user, the RBM only includes softmax units for the movies that user has rated [125]. All of the K RBMs have the same binary values of hidden (latent) variables h. If using a conditional multinomial distribution (a “softmax”) for modeling each column of the observed visible binary rating matrix X and a conditional Bernoulli distribution for modeling hidden user features h, then [125] ! "    exp bik + j =1 hj Wijk ", ! p xik = 1|h =   l l l=1 exp bi + j =1 hj Wij  p(hj = 1|X) = σ cj +

K m  

(7.6.41)

 xik Wijk

(7.6.42)

,

i=1 k=1

where Wijk is a symmetric interaction parameter between feature j and rating k of movie i, bik is the bias of rating k for movie i, and cj is the bias of feature j . Note that the bik can be initialized to the logs of their respective base rates over all users. The marginal distribution over the visible ratings X is [125] p(X) =

 h

exp(−E(X, h))     X h exp(−E(X , h ))

(7.6.43)

with an “energy” term given by E(X, h) = −

F  K m  

Wijk hj xik

+ log(Zi ) −

i=1 j =1 k=1

K m  

xik bik



F 

hj cj ,

j =1

i=1 k=1

(7.6.44)  F l l where Zi = K l=1 exp(bi + j =1 hj Wij ) is the normalization term that ensures K that l=1 p(xil = 1|h) = 1. The movies with missing ratings do not make any contribution to the energy function. The symmetric interaction matrix W is updated in gradient ascent form as Wij ← Wij − ηWij , where Wij =

∂ log p(X) ∂Wijk

>

= =

>

= −

xik hj data

xik hj

,

(7.6.45)

model

where the expectation xik hj data defines the frequency with which movie i with rating k and feature j are on together when the feature detectors are being driven by the observed user-rating data from the training set using Eq. (7.6.42), and

7.6 Boltzmann Machines

489

xik hj model is the corresponding frequency when the hidden units are being driven by reconstructed images. To avoid computing ·model , Salakhutdinov et al. [125] proposed to follow an approximation to the gradient of a different objective function called “contrastive divergence” (CD) [58]: = Wijk

=

=

> −

xik hj

> xik hj

data

(7.6.46)

. T

The expectation ·T represents a distribution of samples from running the Gibbs sampler (Eqs. (7.6.41) and (7.6.42)), initialized at the data, for T full steps. T is typically set to one at the beginning of learning and increased as the learning converges. Restricted Boltzmann machines discussed above are assumed to use binary visible and hidden units, but many other types of unit can also be used. The main use of other types of unit is for dealing with data that is not well-modeled by binary (or logistic) visible units. The following are two typical units that can be used in restricted Boltzmann machines [59, 157]: 1. Softmax and multinomial units: For a binary unit, the probability of turning on is given by the logistic sigmoid function of its total input x: p = σ (x) =

1 ex = . 1 + ex ex + e0

(7.6.47)

The energy contributed by the unit is −x if it is on and 0 if it is off. This logistic sigmoid function of two states can be generalized to K alternative states, i.e., pj = 

exj xi i=1 e

(7.6.48)

,

which is often called a “softmax” unit. A further generalization of the softmax unit is to sample N times (with replacement) from the probability distribution instead of just sampling once. The K different states can then have integer values bigger than 1, but the values must add to N . This is called a multinomial unit and the learning rule is again unchanged. 2. Gaussian visible units: The binary visible units are replaced by linear units with independent Gaussian noise. In this case, the energy function becomes E(x, h) =



(xj − bj )2

j ∈visible

2σj2



 i∈hidden

ci hi −

 i,j

hi

xj wij , σj

(7.6.49)

where σi is the standard deviation of the Gaussian noise for visible unit i.

490

7 Neural Networks

7.7 Bayesian Neural Networks Consider the task of classification in a Bayesian learning framework. Assume that the data was generated by a parametric model, and use training data to calculate Bayesian optimal estimates of the model parameters. Then, equipped with these estimates, the classifier classifies new test data using Bayes’ rule for turning the generative model around and calculate the posterior probability that a class would have generated the test data in question. Classification then becomes a simple matter of selecting the most probable class.

7.7.1 Naive Bayesian Classification A widely used framework for classification is provided by a simple theorem of probability known as Bayes’ rule (called also Bayes’ theorem or Bayes’ formula) [94, 104]: p(ck |x) = p(ck ) ×

p(x|ck ) p(x)

(7.7.1)

with p(x) =

M 

p(x|ci )p(ci ),

(7.7.2)

i=1

where M is the number of classes or groups, x = [x1 , . . . , xd ]T is a vector of feature values, and p(ck |x) is the conditional probability that a given vector x belongs to the class ck . It is assumed that all possible events fall into exactly one of M classes or groups {c1 , . . . , cM }. Naive Bayesian classification is to assign the given feature vector x to the class ck with the highest conditional probability p(ck |x). Of course, the conditional probabilities p(c1 |x), . . . , p(cM |x) are not known; they must be estimated from data. This estimation is difficult to do directly. Bayes’ rule suggests first estimating p(x|ck ), p(ck ), and p(x) and then using these estimates to get an estimate of p(ck |x), but estimating p(x|ck ) is still a problem, as x contains d components x1 , . . . , xd . A common strategy is to assume that the distribution of x conditional on ck can be decomposed in the following fashion for all ck [94]: p(x|ck ) =

d # i=1

p(xi |ck ).

(7.7.3)

7.7 Bayesian Neural Networks

491

This assumption implies that the occurrence of a particular value of xi is statistically independent of the occurrence of any other xj , j = i, given a data of class ck . Under the assumption in (7.7.3), Eq. (7.7.1) becomes 5d

i=1 p(xi |ck )

p(ck |x) = p(ck ) ×

p(x)

(7.7.4)

from which it follows that p(x) =

M d     #  p cl × p xi |cl . l=1

(7.7.5)

i=1

Then, the estimation formula is given by [94] p(c ˆ k |x) =

p(c ˆ k) ×

5d

ˆ i |ck ) i=1 p(x

p(x) ˆ

.

(7.7.6)

This estimate can then be used for classification.

7.7.2 Bayesian Classification Theory Bayesian decision theory provides the fundamental probability model for wellknown classification procedures. Consider a general M-group classification problem in which each object has an associated attribute vector of dimension d. Let x ∈ R d be an associated attribute vector, and wj denote the membership variable that takes value 1 if an object belongs to group j . Define p(ωj ) as the prior probability of group j and f (x|ωj ) as the probability density function. According to the Bayes rule, we have p(ωj |x) =

f (x|ωj )p(ωj ) , f (x)

(7.7.7)

where p(ωj |x) is the posterior probability of group j and f (x) is the probability  density function: f (x) = M j =1 f (x|ωj )p(ωj ). Suppose that an object with a particular feature vector x is observed. If the group membership of x is decided to be j , then the probability of classification error is given by [170] p(Error|x) =

 i=j

p(ωi |x) = 1 − p(ωj |x).

(7.7.8)

492

7 Neural Networks

The purpose of Bayesian classification rule is to minimize the probability of total classification error (misclassification rate): Decide ωk if p(ωk |x) = max{p(ω1 |x), . . . , p(ωM |x)}.

(7.7.9)

Let cij be the cost of misclassifying to group i when it actually belongs to group j . The expected cost associated with assigning to group i is Ci (x) =

M 

cij p(ωj |x),

i = 1, . . . , M.

(7.7.10)

j =1

Ci is also known as the conditional risk function. The optimal Bayesian decision rule that minimizes the overall expected cost can be represented as Decide ωk for x if Ck (x) = min{C1 (x), . . . , CM (x)}.

(7.7.11)

For binary classification with the two classes of ω1 and ω2 , we should assign to class +1 if c12 (x)p(ω2 )f (x|ω2 ) < c21 (x)p(ω1 )f (x|ω1 )

(7.7.12)

or c21 (x)p(ω2 ) f (x|ω1 ) > . f (x|ω2 ) c12 (x)p(ω1 )

(7.7.13)

Otherwise, we should assign to class −1. At this point, we discuss the relationship between Bayes classification and neural network classification. To this end, let C be a random variable which specifies the class membership, i.e., C = ck denotes membership in the kth class. The goal of classification is then to determine C given x ∈ X. To minimize the probability of classification error, the optimal decision rule corresponds to choosing the class ck which maximizes the posterior probability p(ck |x). Define y = [y1 , . . . , ym ]T ∈ RM , where M is the number of classes. Let ek = [0, . . . , 0, 1, 0, . . . , 0]T ∈ RM be the M × 1 basic vector whose kth element equals 1 and others equal zero. Hence, if y = ek then x belongs to class k. This implies that y and ck relate the same information if and only if y = ek . The advantage of this representation in [151] is: the kth component of the least squares estimate yˆ = E{y|x} can be written as yˆk = E{yk |x} =

 yk ∈{0,1}

yk p(yk |x) = p(yk = 1|x) = p(ck |x),

7.7 Bayesian Neural Networks

493

namely E{yk |x} = p(ck |x).

(7.7.14)

This result provides a theoretical interpretation for neural network classifiers that are also trained by minimizing the mean squared errors. A neural network can be considered as a non-parametric technique for estimating posterior probability distributions.

7.7.3 Sparse Bayesian Learning Suppose we are given a sample of N “training” pairs {xn , tn }N n=1 , where t = [t1 , . . . , tN ]T is a target vector expressed as the sum of an approximation vector y = [y(x1 ) , . . . , y(xN )]T and an “error” vector  = [1 , . . . , N ]T : t = y +  = w + ,

(7.7.15)

where w is the parameter vector and  = [φ 1 , . . . , φ M ] is the N × M “design” matrix whose columns comprise the complete set of M “basis vectors.” In the sparse Bayesian framework, the errors are assumed to be modeled probabilistically as independent zero-mean Gaussians, with variance σ 2 , i.e., p() =

N #

  N n |0, σ 2 .

(7.7.16)

n=1

The above error model implies a multivariate Gaussian likelihood for the target vector t: ,    (2π )−N/2 t − y2 2 . (7.7.17) p t|w, σ = exp − σN 2σ 2 By [146], given hyperparameters α = [α1 , . . . , αM ]T , the posterior parameter distribution conditioned on the data is given by combining the likelihood and prior within Bayes’ rule:       p w|t, α, σ 2 = p t|w, σ 2 p(w|α)/p t|α, σ 2 ,

(7.7.18)

494

7 Neural Networks

and is Gaussian N (μ, ) with  −1 −2 T  = A+σ −  ,

(7.7.19)

μ = σ −2 T ˆt,

(7.7.20)

where A = Diag(α1 , . . . , αM ) is an M × M diagonal matrix. To find the solution vector α = α MP , Tipping et al. [146] proposed to use sparse Bayesian learning for formulating the (local) maximization with respect to α of its logarithm   L(α) = log p t|α, σ 2  ∞   p t|w, σ 2 p(w|α)dw = log −∞

" 1! = − N log 2π + log |C| + tT C−1 t , 2

(7.7.21)

with C = σ 2 I + A−1 T .

(7.7.22)

Considering the dependence of L(α) on a single hyperparameter αi , i ∈ {1, . . . , M}, then C in (7.7.22) can be decomposed as C = σ 2I +



−1 αm φ m φ Tm + αi−1 φ i φ Ti ,

m=i

= C−i + αi−1 φ i φ Ti ,

(7.7.23)

where C−i is C with the contribution of basis vector i removed. Then, the matrix determinant |C| and the inverse C−1 in the loss L(α) can be obtained as [146]:     (7.7.24) φ |C| = |C−i | · 1 + αi−1 φ Ti C−1 −i i  and C−1 = C−1 −i −

T −1 C−1 −i φ i φ i C−i

αi + φ Ti C−1 −i φ i

.

(7.7.25)

7.7 Bayesian Neural Networks

495

Hence, we have L(α) = −

 1 N log(2π ) + log |C−i | + tT C−1 −i t 2

  − log αi + log αi + φ Ti C−1 −i φ i −

2 (φ Ti C−1 −i t)



αi + φ Ti C−1 −i φ i   qi2 1 , log αi − log(αi + si ) + = L(α −i ) + 2 αi + si = L(α −i ) + (αi ),

(7.7.26)

where si = φ Ti C−1 −i φ i

and

qi = φ Ti C−1 −i t.

(7.7.27)

• Sparsity factor si can be seen to be a measure of the extent that basis vector φ i “overlaps” those already present in the model. • Quality factor qi can be written as qi = σ −2 φ Ti (t − y−i ), and is thus a measure of the alignment of φ i with the error of the model with that vector excluded. Analysis of sparse Bayesian learning [30] shows that L(α) has a unique maximum with respect to αi : αi =

⎧ 2 2 ⎨ si qi − si , if qi2 > si ; ⎩

(7.7.28) ∞,

if

qi2

≤ si .

This result implies the following: • If φ i is “in the model” (i.e., αi < ∞) yet qi2 ≤ si , then φ i may be deleted (i.e., set αi = ∞), • If φ i is excluded from the model (i.e., αi = ∞) and qi2 > si , φ i may be added (i.e., set αi = si2 qi2 − si ). Through the above analysis, Tipping et al. [146] proposed the following sequential sparse Bayesian learning algorithm: 1. If regression initialize σ 2 to some sensible value (e.g., σ 2 = var(t) × 0.1). 2. Initialize with a single basis vector φ i , setting αi =

φ i 2 φ Ti t2 /φ i 2 − σ 2

All other αm are notionally set to infinity.

.

(7.7.29)

496

7 Neural Networks

3. Explicitly compute  and μ (which are scalars initially), along with initial values of sm and qm for all M bases φ m . 4. Recompute/update , μ: −1   = T B + A ,   ˆt = I − T B −1 (t − y), μMP = T Bˆt,

5. 6. 7. 8. 9. 10. 11.

(7.7.30) (7.7.31) (7.7.32)

where A = Diag(α1 , . . . , αM ) and B = σ −2 I. Select a candidate basis vector φ i from the set of all M. Compute θi = qi2 − si . If θi > 0 and αi < ∞ (i.e., φ i is in the model), re-estimate αi . If θi > 0 and αi = ∞, add φ i to the model with updated αi . If θi ≤ 0 and αi < ∞, then delete φ i from the model and set αi = ∞. In regression and estimating the noise level, update σ 2 = t − y2 /(N − M +  m αm mm ), where y ≈ μMP and mm is the (m, m)th diagonal element. Compute Sm = φ Tm Bφ m − φ Tm BT Bφ m , Qm = φ Tm Bˆt − φ Tm BT Bˆt,

(7.7.33) (7.7.34)

where ˆt = t in the regression case, and tˆ = μMP + B−1 (t − y) in the classification case. Here quantities  and  contain only those basis functions that are currently included in the model, and computation thus scales in the cube of that measure which is typically only a very small fraction of the full M. 12. Compute all sm and qm : sm =

αm Sm , αm − Sm

qm =

αm Qm . αm − Sm

(7.7.35)

13. If converged terminate, otherwise go to Step 5. Once a maximum posterior estimate μMP is obtained by evaluating (7.7.20) with α = α MP , then a final (posterior mean) approximator is given by y = w ≈ μMP for sparse Bayesian regression.

7.8 Convolutional Neural Networks A few classic neural networks were introduced in the previous sections. Starting from this section, we will focus on selected topics and advances in neural networks. Here, we first deal with convolutional neural networks.

7.8 Convolutional Neural Networks

497

Convolutional Neural Networks (CNNs) have been extremely successful in images, speech, audio, and video recognition tasks, where the coordinates of the underlying data representation have a grid structure (in 1, 2, and 3 dimensions), thanks to their ability to exploit translational equivariance/invariance with respect to this grid structure [13]. Gabor filter banks are a powerful method for extracting features. In this method, a bank of N Gabor filters are created. Then, each filter is convolved with the input image to produces N different images. Next, pixels of each image are pooled to extract information from each image. There are mainly two steps in the Gabor filter method including convolution and pooling. The Neocognitron, proposed by Fukushima [32], is generally seen as the model that inspires CNNs on the computation side. The first convolutional neural network was LeNet, invented by Le Cun et al. [88] for handwritten digit recognition and further made popular with LeCun et al. [89]. As compared with traditional fully connected neural networks, the main benefit of using CNNs is the reduced amount of parameters to be learned. A CNN has the following layers [31]: 1. Input layer: It feeds data to the network. Inputs can be either raw data (e.g., image pixels) or their transformations, whichever better emphasize some specific aspects of the data. 2. Convolutional layers: They contain a series of filters with fixed size used to perform convolutions on the data, for generating feature maps. 3. Pooling layers: These layers make the network focus only on the most important patterns, reducing the dimensionality of the feature maps used by the following layers. Pooling layers are known as downsampling layers as well. 4. Rectified Linear Unit (ReLU): ReLU layers are responsible for applying a nonlinear function to the output x of the previous layer, such as f (x) = max(0, x). By [84], they can be used for fast convergence in the training of CNNs, speeding-up the training. 5. Fully connected layers: They are used for the understanding of patterns generated by the previous layers. Neurons in this layer have full connections to all activations in the previous layer. They are also called the inner product layers. After trained, transfer learning approaches can extract features in these layers to train another classifier. 6. Loss layers: These layers specify how the network training penalizes the deviation between the predicted and true labels. Various loss functions appropriate for different tasks can be used: Softmax, Sigmoid, Cross-entropy, Euclidean loss, among others. Convolutional layers are the most essential components of the CNN, and we will discuss them first.

498

7 Neural Networks

7.8.1 Hankel Matrix and Convolution Given an input signal x = [x(1), . . . , x(n)]T ∈ Rn , a convolution of filter (or kernel) f = [f (1), . . . , f (d)]T ∈ Rd with input x has the following three forms: yi = (x ∗ f)i =

n 

x(j )f (i − j + 1),

i = 1, . . . , 2d − 1,

(7.8.1)

x(i − j + 1)f (j ),

i = 1, . . . , 2d − 1,

(7.8.2)

i = 1, . . . , n − d + 1.

(7.8.3)

j =1

or yi = (x ∗ f)i =

d  j =1

or yi = (x ∗ f)i =

d 

x(i + j − 1)f (j ),

j =1

As it provides the most outputs when n 6 d, Eq. (7.8.3) is commonly used as the convolution operation in CNNs. The Matrix-vector form of (7.8.3) is given by ⎡

⎤⎡ ⎤ · · · x(d) f (1) ⎢ ⎢ ⎥ · · · x(d + 1)⎥ ⎢ ⎥ ⎢ f (2) ⎥ y = (x ∗ f) = ⎢ ⎥⎢ . ⎥ . . .. .. ⎣ ⎦ ⎣ .. ⎦ x(n − d + 1) x(n − d + 2) · · · x(n) f (d) x(1) x(2) .. .

x(2) x(3) .. .

(7.8.4)

whose compact form is y = (x ∗ f) = H(x)f,

(7.8.5)

where ⎡

⎤ · · · x(d) ⎢ · · · x(d + 1)⎥ ⎢ ⎥ H(x) = ⎢ ⎥ ∈ R(n−d+1)×d .. .. ⎣ ⎦ . . x(n − d + 1) x(n − d + 2) · · · x(n) x(1) x(2) .. .

x(2) x(3) .. .

(7.8.6)

is called the Hankel structured matrix generated from an n-dimensional vector x = [x(1), . . . , x(n)]T ∈ Rn . Here, d is called a matrix pencil parameter. The space of these types of Hankel structured matrices is denoted as H(n, d). A Hankel structured matrix sometimes is simply called the Hankel matrix.

7.8 Convolutional Neural Networks

499

An n × d wrap-around Hankel matrix generated from an n-dimensional vector x = [x(1), . . . , x(n)]T ∈ Rn is defined as [165]: ⎡

⎤ · · · x(d) ⎢ · · · x(d + 1)⎥ ⎢ ⎥ ⎢ ⎥ .. .. ⎢ ⎥ . . ⎢ ⎥ ⎢ ⎥ ⎢x(n − d + 1) x(n − d + 2) · · · x(n) ⎥ Hd (x) = ⎢ ⎥ ∈ Rn×d . ⎢ ⎥ ⎢ ⎥ ⎢x(n − d + 2) x(n − d + 3) · · · x(1) ⎥ ⎢ ⎥ ⎢ ⎥ .. .. .. .. ⎣ ⎦ . . . . x(n) x(1) · · · x(d − 1) x(1) x(2) .. .

x(2) x(3) .. .

(7.8.7)

Clearly, an n×d wrap-around Hankel matrix generated by x ∈ Rn can be considered as a Hankel matrix of the following elongated vector % &T x¯ = xT , x(1), x(2), . . . , x(d − 1) ∈ Rn+d−1 .

(7.8.8)

Theorem 7.1 ([165]) Let r + 1 denote the minimum length of the annihilating filters that annihilate the signal x = [x(1), . . . , x(n)]T . Then, for a given Hankel structured matrix Hd (x) ∈ H(n, d) with d > r, its rank is given by rank(Hd (x)) = r.

(7.8.9)

This theorem implies the following two results: • If d is sufficiently large then the resulting Hankel matrix is low-rank. • The rank of the Hankel matrix can be explicitly calculated as shown in the above theorem. The convolutional operation can be extended to two-dimensional data. Given an input I (m, n) and a kernel K(a, b), their convolution operation is then given by [77] s(t) = I (a, b) ∗ K(a, b) =

 a

I (a, b) · K(m − a, n − b)

(7.8.10)

I (m − a, n − b) · K(a, b).

(7.8.11)

b

or equivalently by s(t) = I (a, b) ∗ K(a, b) =

 a

b

The 2-D convolutional operation can also be written as the cross-correlation form: s(t) = I (a, b) ∗ K(a, b) =

 a

b

I (m + a, n + b) · K(a, b).

(7.8.12)

500

7 Neural Networks

Suppose we are given a 2-D image X = [x1 , . . . , xp ] ∈ Rn×p and a 2-D filter  = [φ 1 , . . . , φ q ] ∈ Rd×q , where xj = [xj (1), . . . , xj (n)]T ∈ Rn , j = 1, . . . , p and φ i = [φi (1), . . . , φi (d)]T ∈ Rd , i = 1, . . . , q. Then, the 2-D convolution of the filter  with the image X is given by (X ∗ )m,k =

q d  

xm+i−1,k+j −1 φi,j ,

i=1 j =1

m = 1, 2, . . . , n − d + 1; k = 1, 2, . . . , p − q + 1,

(7.8.13)

or written as ⎤ H(x1 )φ 1 · · · H(x1 )φ q ⎥ ⎢ .. .. .. Y = (X ∗ ) = ⎣ ⎦ = Hd,q (X), . . . H(xp )φ 1 · · · H(xp )φ q ⎡

(7.8.14)

where ⎤ ··· yp−q+1 (1) ⎥ ⎢ .. .. Y = [y1 , . . . , yp ] = ⎣ ⎦, . . y1 (n − d + 1) · · · yp−q+1 (n − d + 1) ⎤ ⎡ H(x1 ) ⎥ ⎢ Hd,q (X) = ⎣ ... ⎦ , ⎡

y1 (1) .. .

(7.8.15)

(7.8.16)

H(xp ) ⎡

xj (1) xj (2) .. .

⎢ ⎢ H(xj ) = ⎢ ⎣

xj (2) xj (3) .. .

⎤ · · · xj (d) · · · xj (d + 1)⎥ ⎥ ⎥, .. .. ⎦ . .

xj (n − d + 1) xj (n − d + 2) · · ·

(7.8.17)

xj (n)

for j = 1, . . . , p. Here xj (i) is the ith element of the j th input channel xj . Since 2-D convolution (7.8.13) is calculated for every m and k of X, we say that the stride of convolution is equal to one. In some cases, it might be interesting to compute the convolution with a larger stride. For example, suppose we want to compute the convolution of alternating pixels. In this case, the stride of convolution is taken as two, leading to the equation [2, p.95]: (X ∗ )m,k =

q d  

xm+i−1,k+j −1 φi,j ,

i=1 j =1

m = 1, 3, 5, . . . , n − d + 1; k = 1, 3, 5, . . . , p − q + 1.

(7.8.18)

7.8 Convolutional Neural Networks

501

Given a vector v = [v(1), . . . , v(n)]T ∈ Rn , the vector with reversed indices, v = [v(n), . . . , v(1)]T ∈ Rn , is referred to as the flipped vector of v. Similarly, for a matrix ⎡

⎤ φ1 (1) φ2 (1) · · · φq (1) ⎢ φ1 (2) φ2 (2) · · · φq (2) ⎥ ⎢ ⎥ d×q  = [φ 1 , . . . , φ q ] = ⎢ . , .. ⎥ ∈ R .. . ⎣ . . ··· . ⎦ φ1 (d) φ2 (d) · · · φq (d)

(7.8.19)

its flipped matrix is given by ⎡

⎤ φ1 (d) φ2 (d) · · · φq (d) ⎥ 3 ⎢ 2 ⎢φ1 (d − 1) φ2 (d − 1) · · · φq (d − 1)⎥  = φ1, . . . , φq = ⎢ ⎥ ∈ Rd×q . .. .. .. ⎣ ⎦ . . ··· . φ1 (1)

φ2 (1)

···

φq (1) (7.8.20)

For a pd × q block-structured matrix 2 3T  = T1 , . . . , Tp ∈ Rpd×q ,

(7.8.21)

its flipped block-structured matrix has the form: ⎡

⎤ 1 ⎢ 2 ⎥ ⎢ ⎥  = ⎢ . ⎥, ⎣ .. ⎦

(7.8.22)

p where the sub-flipped matrix is defined as ⎡

⎤ j j j φ1 (d) φ2 (d) · · · φq (d) ⎢ j ⎥ j j ⎢φ1 (d − 1) φ2 (d − 1) · · · φq (d − 1)⎥ 2 j j3 ⎥ ∈ Rd×q , j = φ 1 , . . . , φ q = ⎢ .. .. .. ⎢ ⎥ ⎣ ⎦ . . ··· . j j j φ2 (1) · · · φq (1) φ1 (1) (7.8.23) for j = 1, . . . , p. The following are four convolutions in different cases [166]: 1. Single-input single-output (SISO) convolution: For the single-input channel x = [x(1), . . . , x(n)]T ∈ Rn and the single-filter vector φ = [φ(1), . . . , φ(d)]T ∈ Rd , the SISO convolution of the input vector x with the filter vector φ can be

502

7 Neural Networks

represented as matrix form: y = x ∗ φ = Hd (x)φ,

(7.8.24)

where y = [y1 , . . . , yd ]T and Hd (x) is a wrap-around Hankel matrix given by ⎡

⎤ x(1) x(2) · · · x(d) ⎢x(2) x(3) · · · x(d + 1)⎥ ⎢ ⎥ Hd (x) = ⎢ . ⎥ ∈ Rn×d . .. . . .. ⎣ .. ⎦ . . . x(n) x(1) · · · x(d − 1)

(7.8.25)

2. Single-input multi-output (SIMO) convolution: The convolution of the singleinput vector x with q-filter vectors φ 1 , . . . , φ q can be represented by Y = x ∗  = Hd (x),

(7.8.26)

where 3 2 Y = y1 , . . . , yq ∈ Rn×q

and

3 2  = φ 1 , . . . , φ q ∈ Rd×q .

(7.8.27)

3. Multi-input multi-output (MIMO) convolution: The convolution of p-channel input vectors Z = [z1 , . . . , zp ] ∈ Rn×p with q-channel filters φ 1 , . . . , φ q can be represented as yi =

p 

j

zj ∗ φ i ,

i = 1, . . . , q,

(7.8.28)

j =1 j

where φ i ∈ Rd denotes the filter of length d with respect to the j th input channel and the ith output channel. If defining the MIMO filter kernel as ⎡

⎤ 1 ⎢ ⎥  = ⎣ ... ⎦ ,

where

2 j j3 j = φ 1 , . . . , φ q ∈ Rd×q ,

(7.8.29)

p then the corresponding matrix representation of the MIMO convolution is given by Y=

p  j =1

Hd (zj )j = Hd|p (Z),

(7.8.30)

7.8 Convolutional Neural Networks

503

where Hd|p (Z) = [Hd (z1 ), . . . , Hd (zp )]

(7.8.31)

is an extended Hankel matrix by stacking p Hankel matrices side by side. 4. Multi-input single-output (MISO) convolution: If q = 1, then the MIMO convolution reduces to the MISO convolution y = Hd|p (Z)φ,

(7.8.32)

where φ = [φ1 , . . . , φp ]T . The above SISO, SIMO, MIMO, and MISO convolutional operations are easily extended to the multichannel 2-D convolution operation for an image domain convolutional neural network (CNN). For this end, the (extended) Hankel matrix is needed to be defined as a block Hankel matrix. For a 2-D input X = [x1 , . . . , xn2 ] ∈ Rn1 ×n2 with xi ∈ Rn1 , i = 1, . . . , n2 , the block Hankel matrix associated with a d1 × d2 filter is defined as [166]: ⎡

⎤ Hd1 (x1 ) Hd1 (x2 ) · · · Hd1 (xd2 ) ⎢ Hd (x2 ) Hd (x3 ) · · · Hd (xd +1 )⎥ 1 1 2 ⎢ 1 ⎥ Hd1 ,d2 (X) = ⎢ ⎥ ∈ Rn1 n2 ×d1 d2 . .. .. .. .. ⎣ ⎦ . . . . Hd1 (xn2 ) Hd1 (x1 ) · · · Hd1 (xd2 −1 )

(7.8.33)

Then, the output Y = Hd1 ,d2 (X)K ∈ Rn1 ×n2 from the 2-D SISO convolution for a given image X ∈ Rn1 ×n2 with the 2-D filter K ∈ Rd1 ×d2 can be represented in a matrix-vector form as follows: vec(Y) = Hd1 ,d2 (X)vec(K),

(7.8.34)

where vec(Y) denotes the vectorization of the matrix Y by stacking its column vectors one by one. (j ) (j ) Similarly, for a p-channel n1 × n2 input image X(j ) = [x1 , . . . , xn2 ], j = 1, . . . , p, its extended block Hankel matrix is defined as Hd1 ,d2 |p

!2 3" %    & X(1) , . . . , X(p) = Hd1 ,d2 X(1) , . . . , Hd1 ,d2 X(p) ∈ Rn1 n2 ×d1 d2 p . (7.8.35)

Then 2-D MIMO convolution for given p-input images X(j ) ∈ Rn1 ×n2 , j = (j ) 1, . . . , p with 2-D filter K(i) ∈ Rd1 ×d2 associated with j th input channel and the ith output channel can be represented in a matrix-vector form as " !      (j ) vec Y(i) = Hd1 ,d2 X(j ) vec K(i) , p

j =1

i = 1, . . . , q.

(7.8.36)

504

7 Neural Networks

If defining       Y = vec Y(1) , . . . , vec Y(p) ,   (1) ⎤  vec K(1) (1) · · · vec K(q) ⎥ ⎢ .. .. .. ⎥, K=⎢ . . . ⎦ ⎣  (p)   (p)  vec K(1) · · · vec K(q)

(7.8.37)



then 2-D MIMO convolution can be rewritten as !2 3" Y = Hd1 ,d2 |p X(1) , . . . , X(p) K.

(7.8.38)

(7.8.39)

7.8.2 Pooling Layer We now deal with other layers of a CNN after discussing convolutional layers. The pooling layer is used to conduct a form of downsampling, i.e., to transform the raw data so that some specific aspects of the data are emphasized. In computer vision for example, sometimes the image needs to be converted to grayscale, removing color information from the analysis. Given an input RGB image I , with one pixel represented by I (i, j ) = (R(i, j ), G(i, j ), B(i, j )), its conversion to graylevels happens using the following equation [31]: C(i, j ) = 0.2989R(i, j ) + 0.5870G(i, j ) + 0.1140B(i, j ),

(7.8.40)

where C is the image converted to grayscale and used as input to the network and R, G, and B are the color channels from the initial image. Assume an 180×180 image which is connected to a convolution layer containing 50 filters of size 7 × 7. The output of the convolution layer will be an 174 × 174 × 50 = 1,563,300 dimensional vector. Clearly, this number of dimensions is very high. To this end, a pooling layer is needed to reduce the dimensionality of feature maps, i.e., downsample the feature vector. The pooling stride is also called downsampling factor. Let s denote the pooling stride. For example, for the 12-dimensional vector x = [1, 10, 8, 2, 3, 6, 7, 0, 5, 4, 9, 2], downsampling x with stride s = 2 means we have to pick every alternate pixel starting from the element at index 0 which will generate the vector [1, 8, 3, 7, 5, 9]. By doing this, the dimensionality of x is divided by s = 2 and it becomes a six-dimensional vector. Pooling is one of the most important operations in CNN. A pooling operator operates on an individual feature channel, aggregating data of a local region (e.g., a rectangle) and transforming them into one single value. Common choices include max pooling (using the maximum operator) and average pooling (using the average operator), both of which are hand-crafted. Traditional convolution with sliding

7.8 Convolutional Neural Networks

505

strides larger than one pixel can also be regarded as a pooling operation. However, this kind of pooling operation does not deal with each input feature channel independently. Instead, it uses all the input feature channels to generate each output feature channel. Therefore, the convolutional pooling operation consumes many extra parameters. On the pooling operation, the following points are to be noted [77]: 1. Pooling operates over a portion of the input and applies a function f over this input to produce the output. 2. The function f is commonly the max operation (leading to max pooling), but other variants such as average or 2 norm can be used as an alternative. 3. For a two-dimensional input, this is a rectangular portion. 4. The output produced as a result of pooling is much smaller in dimensionality as compared to the input. Let wlk and bkl be the weight vector and bias term of the kth filter of the lth layer, respectively, and xli,j be the input patch centered at location (i, j ) of the lth layer. l , is Then, the feature value at location (i, j ) in the kth feature map of lth layer, zi,j,k calculated by  T l zi,j,k = wlk xli,j + bkl .

(7.8.41)

l l The activation value ai,j,k of convolutional feature zi,j,k can be computed as

 l  l , = a zi,j,k ai,j,k

(7.8.42)

where a(·) denotes an activation function. The pooling layer aims to achieve shift-invariance by reducing the resolution of the feature maps. It is usually placed between two convolutional layers. Each feature map of a pooling layer is connected to its corresponding feature map of the preceding convolutional layer. Denoting the pooling function as pool(·) for each l feature map ai,j,k , then   l l , yi,j,k = pool am,n,k

∀ m, n ∈ Rij ,

(7.8.43)

where Rij is a local neighborhood around location (i, j ). The following are some pooling methods used in CNNs [48]. 1. p pooling: This pooling is a biologically inspired pooling process modeled on complex cells [72]. The summary statistic in p pooling is the p norm of the inputs into the pool. That is, if nodes (i, j ) are in a pool k, then the output of the pool is ⎡ yi,j,k = ⎣



(m,n)∈Rij

⎤1/p (am,n,k )p ⎦

,

(7.8.44)

506

7 Neural Networks

where am,n,k is the feature value at location (m, n) within the pooling region Rij in the kth feature map. The p pooling contains the following two conventional choices: • When p = 1, the 1 pooling gives the sum pooling: yi,j,k =



(7.8.45)

am,n,k .

(m,n)∈Rij

In sum pooling, all elements in a pooling region are considered. When combined with linear rectification nonlinearities, strong activations may be down-weighted since many zero elements are included in the sum. Even worse, with tanh(·) nonlinearities, strong positive and negative activations can cancel each other out, leading to small pooled responses. • When p → ∞, p pooling corresponds to the max pooling yi,j,k = max |am,n,k | · sign(argmax|am,n,k |),

for all i, j.

(7.8.46)

A common form of downsampling is max pooling where the stride size of the pooling layer equals the pooling size. Max pooling serves to reduce the sizes of the feature vectors and the parameters of CNNs, which decreases training time and memory requirements, but this pooling easily overfits the training set in practice, making it hard to generalize well to test examples. 2. Mixed pooling: This pooling method is the combination of a max pooling and an average pooling, namely [167]: yi,j,k = λ

max am,n,k + (1 − λ)

(m,n)∈Rij

1 |Rij |



am,n,k ,

(7.8.47)

(m,n)∈Rij

where λ is a random value being either 0 or 1 which indicates the choice of either using average pooling or max pooling. During forward propagation process, λ is recorded and will be used for the backpropagation operation. 3. Stochastic pooling: It first computes the probabilities p for each region (i, j ) by normalizing the activations within the region as [168]: pj = 

aj

i∈Rj

ai

,

(7.8.48)

then samples from the multinomial distribution based on p to pick a location j within the region. The pooled activation is then simply yj = al

  where l ∼ P p1 , . . . , p|Rj | .

(7.8.49)

7.8 Convolutional Neural Networks

507

Max pooling only captures the strongest activation of the filter template with the input for each region. However, there may be additional activations in the same pooling region that should be taken into account when passing information up the network and stochastic pooling ensures that these non-maximal activations will also be utilized. Unlike the average pooling and the max pooling representing only one-modal distribution of activations within a region, the stochastic pooling, via selecting different l, can represent multi-modal distributions of activations within a region. Compared with max pooling, stochastic pooling can avoid overfitting due to the stochastic component. 4. Spectral pooling: The idea behind spectral pooling, presented by Rippel et al. in 2015 [120], stems from the observation that the frequency domain provides an ideal basis for inputs with spatial structure. Suppose we are given an input x ∈ RM×N , and some desired output map dimensionality H × W . First, compute the discrete Fourier transform (DFT) of the input into the frequency domain as y = F(x) ∈ CM×N , and assume that the direct current component has been shifted to the center of the domain as is standard practice. Then, truncate y to yˆ ∈ CH ×W . Finally, return its inverse DFT as xˆ = F −1 (ˆy) ∈ RH ×W . The main advantages of spectral pooling are [138]: fast convolution computations, high parameter information retention (compression), flexibility in pooling output dimensions, and further computation savings by implementing spectral parameterization. 5. Spatial pyramid pooling (SPP): The last pooling layer (e.g., after the last convolutional layer) is replaced with a spatial pyramid pooling layer [56]. In each spatial bin with the number M of bins, the SPP using some pooling method (e.g., max pooling) generates a fixed-length representation regardless of the input sizes, resulting in the kM-dimensional output vector of the spatial pyramid pooling (k is the number of filters in the last convolutional layer). Then, fixed-dimensional vectors are the input to the fully connected layer. Figure 7.9 shows a toy example using four different pooling techniques.

7.8.3 Activation Functions in CNNs The following are conveniently used activation functions in CNNs [48]. 1. Rectified Linear Unit (ReLU): The standard way to model a neuron’s output f as a function of its input x is to use f (x) = tanh(x) or f (x) = (1 + e−x )−1 . However, these saturating nonlinear activation functions are much slower than the non-saturating nonlinear activation function f (x) = max(0, x) in gradient descent algorithm. Following Nair and Hinton [106], the neurons with the nonlinearity f (x) = max(0, x) are known as rectified linear units (ReLUs). The ReLU is one of the most notable non-saturated activation functions and is defined as   ai,j,k = max zi,j,k , 0 ,

(7.8.50)

508

7 Neural Networks

0.56

Average Pooling

2.8

Max Pooling

activations 2.8

0

1.6

0

0

0

0

0.6

0

1.46

Mixed Pooling (λ = 0.4)

0.56

0

0.32

0

0

0

0

0.12

0

Stochastic Pooling l=3

1.6

probability map Fig. 7.9 A toy example using four different pooling techniques. Left: resulting activations within a given pooling region. Right: pooling results given by four different pooling techniques. If λ = 0.4 is taken then the mixed pooling result is 1.46. The mixed pooling and the stochastic pooling can represent multi-modal distributions of activations within a region

where zi,j,k is the input of the activation function at location (i, j ) on the kth channel. A potential disadvantage of the ReLU unit is: it has zero gradient whenever the unit is not active, which may cause units that are not active initially to never become active since the gradient-based optimization will not adjust their weights. Moreover, it may slow down the training process due to the constant zero gradients. 2. Noisy Rectified Linear Unit (NReLU) [106] ai,j,k

,    2 , = max zi,j,k + N 0, σ

(7.8.51)

where N (0, σ 2 ) is a Gaussian noise with zero mean and variance σ 2 . It was shown [106] that NReLUs work better than binary hidden units for recognizing objects and comparing faces, and can deal with large intensity variations much more naturally than binary units. 3. Leaky Rectified Linear Unit (Leaky ReLU) [102]     ai,j,k = max zi,j,k , 0 + λ min zi,j,k , 0 ,

(7.8.52)

7.8 Convolutional Neural Networks

509

where λ is a predefined parameter in range (0, 1). As compared with ReLU, Leaky ReLU compresses the negative part to a small, nonzero gradient when the unit is not active. 4. Parametric Rectified Linear Unit (PReLU) [55]     ai,j,k = max zi,j,k , 0 + λ min zi,j,k , 0 .

(7.8.53)

5. Randomized Rectified Linear Unit (RReLU) [164] / 0 / 0 (n) (n) (n) (n) ai,j,k = max zi,j,k , 0 + λk min zi,j,k , 0 ,

(7.8.54)

(n) where zi,j,k denotes the input of activation function at location (i, j ) on the kth (n)

channel of the nth example, λk denotes its corresponding sampled parameter, (n) and ai,j,k denotes its corresponding output. 6. Exponential Linear Unit (ELU) [16] ai,j,k

 ,    z  i,j,k = max zi,j,k , 0 + min λ e − 1 ,0 ,

(7.8.55)

where λ is a predefined parameter for controlling the value to which an ELU saturates for negative inputs. 7. Maxout [37] ai,j,k = max zi,j,k , k∈[1,K]

(7.8.56)

where zi,j,k is the kth channel of the feature map. Maxout enjoys all the benefits of ReLU since ReLU is actually a special case of maxout, e.g., max{wT1 x + b1 , wT2 + b2 }, where w1 is a zero vector and b1 is zero. Besides, maxout is particularly well suited for training with Dropout. 8. Probout [141] is a probabilistic variant of maxout, and first defines a probability for each of k linear units as: pˆ 0 = 0.5,

(7.8.57)

eλzi ". pˆ i = !  2 kj =1 eλzj

(7.8.58)

The activation function is then sampled as

ai =

⎧ ⎨ 0, ⎩

if i = 0; (7.8.59)

zi , otherwise,

510

7 Neural Networks

where i ∼ multinomial{pˆ 0 , . . . , pˆ k }. Probout can achieve the balance between preserving the desirable properties of maxout units and improving their invariance properties, but in testing process, probout is computationally expensive than maxout due to the additional probability calculations.

7.8.4 Loss Function A key of CNNs is to learn the low-dimensional mapping: given a set of input vectors I = {x1 , . . . , xN }, where xi ∈ RD , ∀ i = 1, . . . , N, find a parametric function w : RD → Rd with d / D, such that it has the following properties [49]: • It only needs neighborhood relationships between training samples. These relationships could come from prior knowledge, or manual labeling, and be independent of any distance metric. • It may learn functions that are invariant to complicated nonlinear transformations of the inputs such as lighting changes and geometric distortions. • The learned function can be used to map new samples not seen during training, with no prior knowledge. In order to find the parametric function w having the above three properties, the choice of a suitable loss function is an important problem. The commonly used loss functions in neural network designs are introduced below [48]. 1. Hinge loss: It is usually used to train large-margin classifiers. Let w be the weight vector of a classifier and y(i) ∈ {1, . . . , K} indicate its correct class label among the K classes. Hinge loss of the classifier w is defined as Lhinge

 p N K   (i)  T 1  max 0, 1 − δ y , j w xi = , N

(7.8.60)

i=1 j =1

where ⎧ if y (i) = j ;  (i)  ⎨ 1, δ y ,j = ⎩ −1, otherwise.

(7.8.61)

Two special examples of hinge loss are as follows: • If p = 1, then the hinge loss is called the 1 -loss: L1 =

N K " !   1  max 0, 1 − δ y (i) , j wT xi . N i=1 j =1

(7.8.62)

7.8 Convolutional Neural Networks

511

• If p = 2, then the hinge loss is known as the squared hinge loss or 2 -loss [171]: L2 =

 2 N K    1  max 0, 1 − δ y (i) , j wT xi . N

(7.8.63)

i=1 j =1

2. Softmax loss: Due to its simplicity and probabilistic interpretation, the softmax function is widely adopted in many CNNs (see, e.g., [55, 84]). Given a training set (x(i) , y (i) ) for i ∈ {1, . . . , N }, y (i) ∈ {1, . . . , K}, where x(i) is the ith input image patch, and y (i) is its target class label among the K classes. The prediction of the j th class for ith input is transformed with the softmax function: (i)

(i) pj

ezj

= K

(i)

zl l=1 e

,

(7.8.64)

where zj(i) is usually the activations of a densely connected layer, so zj(i) can be (i)

written as zj = wTj a (i) + bj . Softmax turns the predictions into nonnegative values and normalizes them to get a probability distribution over classes. These probabilistic predictions are used to compute the multinomial logistic loss, i.e., the softmax loss is defined by Lsoftmax = −

N K  1    (i) 1 y = j log p(i) , N

(7.8.65)

i=1 j =1

where 1{y (i) = j } denotes the indicator function returning 1 if y (i) = j , and 0 otherwise. Despite its popularity, current softmax loss does not explicitly enjoy both within-class compactness and between-class-separability. In order to reinforce CNNs with more discriminative information, the learned features are expected to maximize simultaneously their within-class compactness and between-class separability. Inspired by such an idea, some improvements to softmax loss were developed to enforce extra within-class compactness and between-class separability: the contrastive loss [49], the triplet loss [131], and large-margin softmax loss [99]. 3. Contrastive loss: Consider a weakly supervised scheme for learning a similarity measure from pairs of data instances labeled as matching or non-matching. (i) (i) Suppose we are given the ith pair of input vectors (x1 , x2 ) and a binary label y assigned to this pair: y = 0 if x1 and x2 are deemed similar, and y = 1 if they (i,l) (i,l) are deemed dissimilar. Let (z1 , z2 ) denotes its corresponding output pair of the lth (l ∈ [1, . . . , L]) layer. The contrastive loss function is defined by [49] Lcontrastive

N 2    1  = (1 − y) · d (i,L) + y · max m − d (i,L) , 0 , 2N i=1

(7.8.66)

512

7 Neural Networks (i,L)

(i,L)

where d (i,L) = z1 − z2 22 , and m is a margin parameter affecting nonmatching pairs. However, Lin et al. [96] found from experiment results that the contrastive loss function would cause a dramatic drop in retrieval results when fine-turning the network on all pairs. As a solution, Lin et al. [96] propose a double-contrastive loss with an additional parameter affecting matching pairs: Ld−contrastive =

N   1  y · max d (i,L) − m1 , 0 2N i=1   + (1 − y) · max m2 − d (i,L) , 0 . (i)

(i)

(i)

(7.8.67) (i)

4. Triplet loss: Consider the triplet units (xa , xp , xn ), where xa is an anchor (i) (i) (i) instance, xp is a positive instance from the same class of xa , and xn is a (i) (i) (i) negative instance. Let (za , zp , zn ) denote the feature representation of the triplet units, the triplet loss is defined as [131]: Ltriplet =

N / 0 1  (i) (i) max d(a,p) − d(a,n) + m, 0 , N

(7.8.68)

i=1

(i)

(i)

(i)

(i)

(i)

(i)

where d(a,p) = za − zp 22 and d(a,n) = za − zn 22 . The objective of triplet loss is to minimize the distance between the anchor and positive, and maximize the distance between the negative and the anchor. The coupled clusters loss function is defined as [98]: Lcc =

Np /0 -2 - (∗) -2 1 1 z max -z(i) − c − − c + m, 0 , p 2 p 2 p n Np 2

(7.8.69)

i=1

(∗)

where Np is the number of samples per set, zn is the feature representation (∗) of xn which is the nearest negative sample to the estimated center point cp = Np (i) ( i=1 zp )/Np . 5. Large-margin softmax loss: Consider a binary classification and a sample x given from class 1. The original softmax is to force WT1 x > WT2 x (i.e., W1 x cos(θ1 ) > W2 x cos(θ2 )) in order to classify x correctly. However, we want to make the classification more rigorous in order to produce a decision margin. To this end, consider instead requiring W1 x cos(mθ1 ) > W2 x cos(θ2 ) (0 ≤ θ1 ≤ π/m), where m is a positive integer. Because the inequality holds W1 x cos(θ1 ) ≥ W1 x cos(mθ1 ) > W2 x cos(θ2 ),

7.8 Convolutional Neural Networks

513

the new classification criteria is a stronger requirement to correctly classify x, producing a more rigorous decision boundary for class 1. By introducing an angular margin to the angle θj between input feature vector a (i) and the j th column wj of weight matrix W, Liu et al. [99] proposed the large-margin softmax (i) (L-Softmax) loss function. The prediction pj for L-Softmax loss is defined as ewj a ψ(θj ) ,  (i) (i) ewj a ψ(θj ) + l=j ewl a  cos(θl ) (i)

(i)

pj =

(7.8.70)

where   ψ(θj ) = (−1)k cos mθj − 2k ,

θj ∈ [kπ/m, (k + 1)π/m].

(7.8.71)

Here k ∈ [0, m − 1] is an integer, m controls the margin among classes. As compared with the Softmax loss, the L-Softmax has the following performance [99]: • In the training stage, the original softmax loss requires θ1 < θ2 to classify the sample x as class 1, while the L-Softmax loss requires mθ1 < θ2 to make the same decision. Hence, with bigger m (under the same training loss), the ideal margin between classes becomes larger and the learning difficulty is also increased. With m = 1, the L-Softmax loss becomes identical to the original softmax loss. • The L-Softmax loss defines a relatively difficult learning objective with adjustable margin (difficulty). A difficult learning objective can effectively avoid overfitting and take full advantage of the strong learning ability from deep and wide architectures. • The L-Softmax loss can be easily used as a replacement for standard loss, as well as used in tandem with other performance-boosting approaches and modules, including learning activation functions, data augmentation, pooling functions, or other modified network architectures. 6. Kullback–Leibler (KL) divergence: Given a discrete variable x and its two probability distributions p(x) and q(x). The KL divergence from q(x) to p(x) is defined as KL(qp) = −H (p(x)) − Ep {log q(x)}   p(x) log p(x) − p(x) log q(x) = x

=

 x

x

p(x) , p(x) log q(x)

(7.8.72)

514

7 Neural Networks

where H (p(x)) is the Shannon entropy of p(x), Ep (log q(x)) is the crossentropy between p(x) and q(x). The KL divergence does not have symmetry, i.e., KL(qp) = KL(pq). 7. Jensen-Shannon (JS) divergence: DJS (pq) =

 - p(x) + q(x)  1 KL p(x)2 2   1 - p(x) + q(x) + KL q(x)2 2

(7.8.73)

is a symmetrical form of KL divergence. It measures the similarity between p(x) and q(x). By minimizing the JS divergence, the two distributions p(x) and q(x) are made as close as possible.

7.9 Dropout Learning Combining the predictions of many different models is a very successful way to reduce test errors, but it appears to be too expensive for big neural networks. Dropout is a very efficient version of model combination through setting to zero the output of each hidden neuron with probability q. Moreover, overfitting is a serious problem in deep neural networks with a large number of parameter to be learned. This overfitting can be greatly reduced by using dropout. Dropout, introduced by Hinton et al. in 2012 [62], is a form of regularization for fully connected neural network layers. Each element of a layer’s output is kept with probability p, otherwise is randomly dropped with probability (1 − p) out some of the units while training. This dropout procedure not only has good performance but also its implementation is simple. The dropout procedure can be explained from two different perspectives. • Randomly omitting: Each hidden unit is randomly omitted from the network with a probability q = 1 − p (say q = 0.5) on each training case, the dropout can prevent complex co-adaptations on the training data. • Model averaging: It is computationally expensive to train all neurons or nodes in many separate layers and then to apply each of these networks to the test data in the standard way. However, the dropout randomly performs model averaging with neural networks, and makes it possible for all of these networks to share the same weights for the hidden units that are present. Dropout is effective for preventing the co-adaptation of feature detectors [5], and hence can prevent overfitting associated with their co-adaptation.

7.9 Dropout Learning

515

7.9.1 Dropout for Shallow and Deep Learning Dropout is an intriguing algorithm for shallow and deep learning, which seems to be more effective than full-connection neural networks, but comes with little formal understanding and raises several interesting questions [5]: 1. What kind of model averaging is dropout implementing, exactly or in approximation, when applied to multiple layers? 2. How crucial are its parameters? For instance, is q = 0.5 necessary and what happens when other values are used? What happens when other transfer functions are used? 3. What are the effects of different values of q for different layers? What happens if dropout is applied to connections rather than units? 4. What are precisely the regularization and averaging properties of dropout? 5. What are the convergence properties of dropout? The motivation for dropout may be better understood by thinking about common inference problems in machine learning or neural networks. Consider a scheme where an inference needs to be made by one big conspiracy of one hundred people M. With less time, a smaller conspiracy involving fifty people can make a first inference y(1) . If another conspiracy of a new group of fifty people make the second inference y(2) based on y(l) , and so on, then a good inference probably is made by a few smaller conspiracies faster than a big conspiracy of one hundred people. To model the dropout problem, we consider a neural network with L hidden layers. Let l ∈ {1, . . . , L} be the index of the lth hidden layer of the network, and z(l) denote the input vector of layer l, y(l) denote the output vector of layer l (where y(0) = x is the input). The feedforward operation of a standard full-connection neural network can be described as z(l+1) = W(l) y(l) + b(l) ,   y(l+1) = f z(l+1) ,

(7.9.1) (7.9.2)

for l ∈ {0, . . . , L−1} and any hidden unit i. Here W(l) and b(l) are, respectively, the weight matrix and bias vector at layer l, while f (·) is any elementwise activation function, such as f (z) = 1/(1 + exp(−z)) or f (z) = tanh(z). If a dropout is applied to the outputs of a fully connected layer, then the above two equations can be rewritten as (l+1)

z

 = M(l)

 W y(l) + b(l) ,

  y(l+1) = f z(l+1) ,

(7.9.3) (7.9.4)

516

7 Neural Networks

Fig. 7.10 An example of a thinned network produced by applying dropout to a full-connection neural network. Crossed units have been dropped. The dropout probabilities of the first, second, and third layers are 0.4, 0.6, and 0.4, respectively,

where M W denotes the elementwise product of two d ×n matrices M and W, and M(l) ∈ Rd×n is a binary matrix encoding the connection information with elements Mij(l) ∼ Bernoulli(p).

(7.9.5)

Here, Bernoulli(p) denotes a Bernoulli distribution of binary random variable v: P (v = 1) = p = 1 − P (v = 0) = 1 − q.

(7.9.6)

Figure 7.10 shows an example of dropout applying to a full-connection neural network. Although dropout is available for hidden layers with different probability q, i.e., the number of dropped out units in each hidden layer may be different, this number is fixed to be same for each hidden layer in practical applications. 2 (l) 2 (l) (l) 3T (l) 3T with y(0) = x ∈ Rn×1 , m(l) = m1 , . . . , mn be Let y(l) = y1 , . . . , yn binary Bernoulli vector with elements 1 and 0, d be the number of all units in each hidden layer, and wi ∈ Rn×1 be the ith row of the weight matrix W ∈ Rd×n . With dropout, the feedforward operations constitute the dropout neural network model [142]: (l)

mj ∼ Bernoulli(p),

(7.9.7)

y˜ (l) = m(l)

(7.9.8)

y(l) ,

z(l+1) = W(l+1) y˜ (l) + b(l+1) , y(l+1) = f (z(l+1) ),

(7.9.9) (7.9.10)

7.9 Dropout Learning

517

for l ∈ {0, 1, . . . , L − 1} and the hidden unit i. The vector m(l) is sampled and multiplied elementwise with the outputs of the lth layer, y(l) , to create the thinned outputs y˜ (l) . The thinned outputs y˜ (l) are then used as inputs to the next layer. Consider a standard CNN composed of alternating convolutional and pooling layers, with fully connected layers on top. On each presentation of a training example, if layer l is followed by a pooling layer, the forward propagation without dropout can be described as (l+1) aj

  (l) (l) (l) = pool a1 , . . . , an , i ∈ Rj ,

(7.9.11)

(l) where ai(l) with i ∈ R(l) j that is pooling region j at layer l and ai is the activity of  (l)  (l) each neuron within it, while n = Rj  is the number of units in Rj . With dropout, the forward propagation of max-pooling dropout at training time becomes [160]

aˆ (l) = m(l) (l+1)

aj

a(l) ,

  (l) (l) = pool aˆ 1 , . . . , aˆ n(l) , i ∈ Rj ,

(7.9.12) (7.9.13)

  where the activations aˆ 1(l) , . . . , aˆ n(l) in each pooling region j are reordered in nondecreasing order, i.e., 0 ≤ aˆ 1(l) ≤ . . . ≤ aˆ n(l) . If layer l is followed by a convolutional layer, the forward propagation with dropout is formulated as [160]: m(l) k (i) ∼ Bernoulli(p), (l) (l) aˆ k = ak (l)

(l+1) zj

=

n 

(l)

mk ,   (l+1) (l) conv Wj , aˆ k ,

(7.9.14) (7.9.15) (7.9.16)

k=1

(l+1) aj (l)

  (l+1) , = f zj

(7.9.17)

where ak denotes the activations of feature map k (k = 1, . . . , n(l) ) at layer l. (l) (l) The mask vector mk consists of independent Bernoulli variables mk (i). Then, this mask is sampled and multiplied with activations in kth feature map at layer l in order to produce dropout-modified activations aˆ (l) k . These modified activations are (l+1) convolved with filter Wk to produce convolved features z(l) j . The function f (·) is an elementwise function applied to the convolved features to get the activations (l+1) of convolutional layers, aj .

518

7 Neural Networks

7.9.2 Dropout Spherical K-Means A major goal in machine learning or neural networks is to learn deep hierarchies of features for other tasks. Many algorithms are available to learn deep hierarchies of features from unlabeled data, especially text documents and images. It has been found [17, 18, 21] that using K-means clustering as the unsupervised learning module in these types of “feature learning” pipelines can lead to excellent results, often rivaling the state-of-the-art systems. The classic K-means clustering algorithm finds cluster centroids that minimize the distance between data points and the nearest centroid. K-means is also called “vector quantization” (VQ) in the sense that K-means can be viewed as a way of constructing a “dictionary” D ∈ Rn×K of K column vectors d(j ) , j = 1, . . . , K so that a data vector x(i) ∈ Rn (i = 1, . . . , m) can be mapped to a code vector s(i) that minimizes the error in reconstruction. A popular modified version of K-means is known as “gain shape vector quantization” [169] or “spherical K-means” [21]. Given m data vectors x(i) ∈ Rn , i = 1, . . . , m. The VQ of the data vector x(i) can be described by the model x(i) = Cs(i) , i = 1, . . . , m,

(7.9.18)

where s(i) is a quantization coefficient vector or “code vector” associated with the input x(i) , D ∈ RK×n denotes the codebook matrix or “dictionary” whose j th column is denoted by d(j ) . The goal of spherical K-means is equivalently to find D and s(i) from the input x(i) according to [18]: *

. m  - (i) -2 (i) -Ds − x - , min 2 D,s

(7.9.19)

i=1

subject to s(i) 0 ≤ 1, ∀ i = 1, . . . , m and d(j ) 2 = 1, ∀ j = 1, . . . , K. (7.9.20) In the above equation, the code vector s(i) can be thought of as a “feature representation” of each example x(i) that satisfies several criteria: • Given s(i) and D, the original example x(i) should be able to be reconstructed well. • The first constraint s(i) 0 ≤ 1 means that each s(i) should have at most one nonzero entry (associated with vector quantization). Thus a new representation of x(i) should preserve it as well as possible, but also be a very simple or parsimonious representation. • The second constraint d(j ) 2 = 1 requires that each dictionary column have unit length, preventing them from becoming arbitrarily large or small.

7.9 Dropout Learning

519

The full K-means algorithm for learning feature representation can be summarized below [18]: 1. Normalize inputs: (i)

x

  x(i) − mean x(i) =   , ∀ i = 1, . . . , m. var x(i) + norm

(7.9.21)

2. Eigenvalue decomposition and estimate inputs:   cov x(i) = VDVT , i = 1, . . . , m,

(7.9.22)

x(i) = V(D + norm I)−1/2 VT x(i) , i = 1, . . . , m.

(7.9.23)

3. Loop until convergence (typically 10 iterations is enough):

(i)

sj =

⎧ (j )T (i) x , if j = arg max D(l) x(i) 1 , ⎪ ⎨d l ⎪ ⎩

0,

(7.9.24)

otherwise,

d(j ) = Xsj + d(j ) ,

(7.9.25)

d(j ) = d(j ) /d(j ) 2 ,

(7.9.26)

where j = 1, . . . , K, i = 1, . . . , m; while X = [x(1) , . . . , x(m) ] ∈ Rn×m and (1) (m) sj = [sj , . . . , sj ]T ∈ Rm×1 are the data (or example) matrix and the code vectors, respectively. When dropout is applied to the outputs of a dictionary, the spherical K-means is extended to the dropout spherical K-means [173]. The optimization problem of the dropout spherical K-means can be described as *

m  -(M min D,s

D)s

(i)

-

2 − x(i) -2

. ,

(7.9.27)

i=1

subject to s(i) 0 ≤ 1, ∀ i = 1, . . . , m and d(j ) 2 = 1, ∀ j = 1, . . . , K, (7.9.28) where M is a binary mask matrix with the same size as D, and each column is drawn independently from Mj ∼ Bernoulli(p) trials.

520

7 Neural Networks

Given a dictionary D and a dropout mask matrix M, then M D can be viewed as “thinned dictionary” Dthin = M D, where Dthin ∈ D. Hence, Eq. (7.9.24) becomes

(i)

sj =

⎧ (j )T (l) ⎪ ⎪ dthin x(i) , if j = arg max Dthin x(i) 1 , ⎨ l

⎪ ⎪ ⎩ 0,

(7.9.29)

otherwise,

for all j = 1, . . . , K; i = 1, . . . , m. After the code vector s(i) is obtained, new centroids can be computed according to / 0 Dnew = arg min Dthin S − X22 + Dthin − Dold 22 , Dthin

  −1  T = SST + I XS + Dthin , ∝ XST + Dthin .

(7.9.30)

The full dropout K-means algorithm [173] for learning feature representation can be summarized below. 1. Normalize inputs: use (7.9.21) to compute the normalized input vectors x(i) . 2. Eigenvalue decomposition and estimate inputs: use (7.9.22) to compute the EVD of the variance matrix cov(x(i) ), then estimate the inputs x(i) via (7.9.23). 3. Loop until convergence: calculate the code vectors s(i) by using (7.9.29), and update Dnew = XST + Dthin . Finally, make the normalization d(j ) = d(j ) /d(j ) 2 .

7.9.3 DropConnect As pointed out in [5], an important and interesting question is: what happens if dropout is applied to connections rather than units? Dropout applied to connections is called DropConnect which was developed by Wan et al. [152]. More specifically, DropConnect is the generalization of Dropout in which each connection, rather than each output unit, can be dropped out with probability 1 − p [152]. The following is a comparison between Dropout and DropConnect from the perspective of training networks: • Training network with Dropout: Each element of a layer’s output r is kept with probability p, otherwise set to 0 with probability 1 − p: r=m

 a(Wv) = a m

 (Wv) ,

(7.9.31)

7.9 Dropout Learning

521

Fig. 7.11 An example of the structure of DropConnect. The dotted lines show dropped connections

v1

× r1

×

v2 × r2 v3

v4

× × ×

r3

where v is the fully connected layer inputs (i.e., output feature vector of the input data vector x), W is the fully connected layer weights, denotes elementwise multiplication, and m is a binary mask vector of size d with each element j drawn independently from mj ∼ Bernoulli(p). The activation function a(·) may be tanh and reLU functions with a(0) = 0. • Training network with DropConnect: Generalization of Dropout in which each connection, rather than each output unit, can be dropped with probability 1 − p:  r = a (M

 W)v ,

(7.9.32)

where M is weight mask matrix with Mij ∼ Bernoulli(p). In DropConnect, dropout is applied at the inputs to the activation function. Figure 7.11 shows an example of DropConnect for a full-connection neural network. From Figs. 7.10 and 7.11 we can see further the similarity and difference between Dropout and DropConnect. • Both Dropout and DropConnect are suitable for fully connected layers only, and can introduce dynamic sparsity within the model. • Dropout has sparsity on output units by dropping activations with probability 1 − p, while DropConnect makes the weight matrix W become a sparse matrix by introducing zero-elements with probability 1 − p, rather than modifying the output vectors of a layer. That is to say, the fully connected layer with DropConnect becomes a sparsely connected layer. However, this is not equivalent to setting W to be a fixed sparse matrix during training. • Dropout is to remove the outputs of randomly dropped hidden units, while DropConnect is to set weight values connecting the input units and hidden units to be zero with a probability of 1 − p. In other words, any dropped hidden unit has no outputs in all output directions, but a DropConnected layer has no outputs in DropConnect directions only.

522

7 Neural Networks

Definition 7.5 (DropConnect Network [153]) Let {x1 , . . . , xl } with labels {y1 , . . . , yl } be the data set S of l entries. A DropConnect network is defined as a mixture model:  p(M)f (x; θ , M) = Em {f (x; θ , M)}, (7.9.33) o= m

where m is the DropConnect layer mask, θ = {Ws , W, Wg } are network parameters: Ws are the softmax layer parameters, W are the DropConnect layer parameters and Wg are the feature extractor parameters. Each network f (x; θ , M) has weights p(M) such that Mij ∼ Bernoulli(p). When each element of M has equal probability of being on and off (p = 0.5), the mixture model has equal weights for all sub-models f (x; θ , M), otherwise the mixture model has larger weights in some sub-models than others. A standard DropConnect model architecture comprises the following basic steps [152]. 1. Feature Extractor: v = g(x; Wg ),

(7.9.34)

where v ∈ Rn×1 is the output feature vector, x ∈ RI ×1 is the input data vector to the overall model, and Wg ∈ Rn×I is the parameter matrix for the feature extractor. g(·) is chosen to be a multilayered convolutional neural network (CNN), with Wg being the convolutional filters (and biases) of the CNN. 2. DropConnect Layer: r = f (u) = f ((M

W)v) ∈ Rd×1 ,

(7.9.35)

where v is the output of the feature extractor, W ∈ Rd×n is a fully connected weight matrix, f is a nonlinear activation function, and M ∈ Rd×n is the binary mask matrix. 3. Softmax Classification Layer:   exp − wTs,i r  , oi = softmax(r; Ws ) =  1 + j exp − wTs,j r

(7.9.36)

where Ws ∈ Rk×d is the weight matrix of softmax classification layer which takes r as its input and uses Ws to map this input to a k-dimensional output vector o (k is the number of classes).

7.9 Dropout Learning

523

Definition 7.6 (Logistic Loss [153]) The following loss function defined on kclass classification is called the logistic loss function: A(o) = −

 i

 exp(oi ) = −oi + ln yi ln  exp(oj ), j exp(oj )

(7.9.37)

j

where y = [y1 , . . . , yk ]T is a binary vector with the ith bit set to on. Lemma 7.1 The logistic loss function A has the following properties [153]: 1. A(0) = ln k, i.e., A(0) depends on someconstant related to the number of labels.  2. −1 ≤ A (0) ≤ 1 with A (0) = ∂A(o) ∂o o = 0. That is to say, A is a Lipschitz function with L = 1. 3. A (o) ≥ 0, i.e., A is a convex function with respect to x. With DropConnect, the feedforward operation becomes the DropConnect network [152]: (l)

Mij ∼ Bernoulli(p),

(7.9.38)

˜ (l) = W(l) M, W  (l+1) (l) (l) = z W˜ ij y + b , i

j

i

(7.9.39) (7.9.40)

j

y(l+1) = f (z(l+1) ).

(7.9.41)

unit ui =  Due to the weighted sum of Bernoulli variables Mij , a single 2 ), i.e., u ∼ (W v )M can be approximated by a Gaussian N (μ , σ ij j ij u u j N(μu , σu2 I), where the mean vector and variance matrix are given by μu = E{u} = pWv and σu2 I = E{uuT } = p(1 − p)(W W)(v v), respectively. A comparison between Dropout and DropConnect network inferences is as follows [152]: • Dropout network inference (mean-inference): Em {a(m

Wv)} ≈ a(Em {m

Wv}) = a(pWv).

(7.9.42)

• DropConnect Network Inference (sampling): EM {a((M

W)v)} ≈ Eu {a(u)},

(7.9.43)

 where u = (M W)v with entry ui = j (Wij vj )Mij that is a weighted sum of Bernoulli variables Mij . Each neuron ui can be approximated by a Gaussian via moment matching to give  u ∼ N pWv, p(1 − p)(W

W)(v

 v) ,

(7.9.44)

524

7 Neural Networks

where the mean vector is EM {u} = pWv and the variance matrix is varM (u) = p(1 − p)(W W)(v v). The overall model f (x; θ , M) therefore maps input data x to an output o through a sequence of the above operations given the parameters θ = {Wg , W, Ws } and a randomly drawn mask M. The correct value of o is obtained by summing out over all possible masks M: o=



p(M)f (x; θ , M).

(7.9.45)

M

If p(M), the probability for a mask, is uniform over all masks M, then o=

1  f (x; θ , M) |M| M

1  = softmax(f ((M |M|

W)v); Ws ).

(7.9.46)

M

The parameters throughout the model θ then can be updated via stochastic gradient descent (SGD) by back-propagating gradients of the loss function with respect to the parameters, Aθ , see [152].

7.10 Autoencoders Multilayer perceptrons (MLPs) form a class of neural networks with a feedforward layered structure. Consider a general MLP with one hidden layer. It consists of an input layer containing ni units and two layers with computational units: the hidden layer with p units and the output layer with no units. In the standard MLP, in order to achieve dimension reduction by autoassociation, the input units are desired to communicate their values to the output units through a hidden layer acting as a limited capacity bottleneck which must optimally encode the input vectors. Thus, in this particular application, ni = no = n and p < n. Figure 7.12 shows an MLP with one hidden layer for auto association. When entering an n-dimensional real input vector xk (k = 1, . . . , N), the output values of the hidden units form a p-vector given by   hk = F W(1) xk + b(1) ,

(7.10.1)

where W(1) is the (input-to-hidden) p × n weight matrix and b(1) is a p-vector of biases.

7.10 Autoencoders Fig. 7.12 MLP with one hidden layer for auto association. The shape of the whole network is similar to an hourglass

525

no output units ···

···

p hidden units ···

···

ni input units

7.10.1 Basic Autoencoder In the context of object recognition, a particularly interesting and challenging question is whether unsupervised learning can be used to learn hierarchies of invariant feature extractors [117]. Autoencoders provide an effective way to solve this problem. The basic idea of the autoencoder, proposed by Rumelhart et al. [123] in 1986, is the layer-by-layer training. Features that are invariant to small distortions are called the invariant features. The key idea to invariant feature learning is to represent an input patch with two components [117]: the invariant feature vector (which represents what is in the image) and the transformation parameters (which encodes where each feature appears in the image). The Autoencoder is an MLP with a symmetric structure, and is used to learn an efficient representation (encoding) for a set of data in an unsupervised manner. The aim of an autoencoder is to make the output as similar to the input as possible. Architecturally, the simplest form of an autoencoder is a feedforward, nonrecurrent neural network very similar to the many single-layer perceptrons in an MLP: having an input layer, an output layer (also known as reconstruction layer), and one or more hidden layers connecting them, but with the output layer having the same number of nodes as the input layer. The purpose of autoencoders is to reconstruct its own inputs instead of predicting the target value Y given inputs X. Therefore, autoencoders are unsupervised learning models. An autoencoder is composed of two networks: an encoder network and a decoder network, as shown in Fig. 7.13. The basic idea of an autoencoder is to encode information automatically (as compressed, not encrypted), and hence it gets its name. The hidden layer in the middle is smaller, and the input layer and output layer on both sides are larger. The autoencoder is always symmetrical, with the middle layer (comprising one or two layers, depending on whether the number of neural network layers is odd or even). The layer in front of the middle layer is the encoding layer, and the layer behind the middle layer is the decoding layer.

526

7 Neural Networks

W(2) h + b(2)

W(1) x + b(1) I1

x1

xˆ1

Iˆ1

xˆ2

Iˆ2

h1 I2

x2 h2

x I3

x3

.. .

.. .

In

xn



.. . hp

xˆ3

Iˆ3

.. .

.. .

xˆn

Iˆn

+1 +1 Encoder network Input Layer

Codes

Decoder network

Hidden Layer

Output Layer

Fig. 7.13 Illustration of the architecture of the basic autoencoder with “encoder” and “decoder” networks for high-level feature learning. The leftmost layer of the whole network is called the input layer, and the rightmost layer the output layer. The middle layer of nodes is called the hidden layer, because its values are not observed in the training set. The circles labeled “+1” are called bias units, and correspond to the intercept term b [109]. Ii denotes the index i, i = 1, . . . , n

• Encoder network: the encoder f(x) is a feature-extracting vector function in a specific parameterized closed form, and maps the original high-dimensional inputs (n-vector) x ∈ Rn to a low-dimensional feature or intermediate (p-vector) representation h = f(W(1) x) ∈ Rp :   h = f(x) = sf W(1) x + bh : Rn → Rp

with n > p,

(7.10.2)

where h is the feature vector or representation or code computed from x, W(1) ∈ Rp×n is the encoding matrix, and sf is an elementwise nonlinear activation function, typically a logistic function sf (zi ) = sigmoid(zi ) = 1+e1−zi for the p argument vector z = [zi ]i=1 of sf (z). The encoder’s parameters are denoted by p a bias vector bh ∈ R . • Decoder network: the symmetric decoder g(h) is another closed form parameterized vector function that maps h back to an n-dimensional output vector, producing a reconstruction y = g(h) that is desired to be as similar as possible

7.10 Autoencoders

527

to the original input vector x, i.e., y = xˆ :   y = g(h) = sg W(2) h + by : Rp → Rn ,

(7.10.3)

where W(2) ∈ Rn×p is the decoding matrix and g is also an elementwise nonlinear function, and sg is the decoder’s activation function, typically either the identity (yielding linear reconstruction) or a sigmoid. The decoder’s parameters are represented by a bias vector by ∈ Rn . In general, one only explores the tied weights case, in which W(2) = (W(1) )T = WT . Because the input layer and the output layer have the same numbers of neurons n that is larger than the number p of neurons of the hidden layer, the middle hidden layer is the bottleneck of the whole autoencoder. The autoencoder shown in Fig. 7.13 is a typical feedforward neural network, since the connectivity graph does not have any directed loops or cycles. Autoencoder training is to find parameters θ = {W, bh , by } that minimize the reconstruction error on a training set of examples Dn :  ,  min JAE (θ ) = L(x, g(f(x))) , θ

(7.10.4)

x∈Dn

where L(x, y) is the reconstruction error. Its typical choices include [119]: • the squared error L(x, y) = x − y2 and  • the cross-entropy loss function L(x, y) = − ni=1 xi log(yi )+(1−xi ) log(1−yi ). An Autoencoder training flowchart is shown in Fig. 7.14. In the autoencoder framework [10, 12, 60], one starts by explicitly defining a feature-extracting function in a specific parameterized closed form. This function, denoted by fθ , is called the encoder and will allow the forthright and efficient computation of a feature vector h = fθ1 (x) from an input x. For each example x t from a data set {x1 , . . . , xT }, define the hidden neurons and activations of the input x

encoder

code h

reconstruction decoder

y = xˆ error

Fig. 7.14 Autoencoder training flowchart. Encoder f encodes the input x to h = f (x), and the decoder g decodes h = f (x) to the output y = xˆ = g(f (x)) so that the reconstruction error L(x, y) is minimized

528

7 Neural Networks

second layer (hidden layer) as (2)

netj =

n 

(1)

(1)

wj i xi + bj , j = 1, . . . , m,

(7.10.5)

i=1

 (2)  (2) aj = f netj , j = 1, . . . , m.

(7.10.6)

Similarly, the hidden neurons and activations of the third layer (output layer) are, respectively, defined as (3) neti

=

m 

(2) (2)

(2)

wij aj + bi , i = 1, . . . , n,

(7.10.7)

j =1

(3)

ai

 (3)  = f neti , i = 1, . . . , n,

(7.10.8)

where f is a nonlinear (typically sigmoid) function, i.e., f (z) = 1/(1 + e−z ). The problem of the basic autoencoder is to find optimal weight matrices W(1) , W(2) and the bias vectors b(1) , b(2) for minimizing the total error energy of the system.   Denoting θ = (θ 1 ; θ 2 ) = W(1) , b(1) ; W(2) , b(2) , the energy of the system is the sum of two terms [115]:  m  n     1   (3) 1   (2) (1) (2) + , JAE (θ ) = Ee aj , W Ed ai , W 2m 2n j =1

(7.10.9)

i=1

where

 (2)  • Ee aj , W(1) is the code prediction energy which measures the discrepancy between the output of the encoder and the code vector 2 (2) (2) 3T a(2) = a1 , . . . , am . A typical definition is given by   1 ! (2) "2  aj − Encode xi , W(1) Ee aj(2) , W(1) = 2 "2 1 ! (2) (1) aj − wj i xi . = 2

(7.10.10)

(3)

• Ed (ai , W(2) ) is the reconstruction energy which measures the discrepancy between the reconstructed image patch produced by the decoder and the input image patch X. A common choice is   "2  1 ! (3) (3) (2) ai − Decode y¯ t , W(2) = Ed ai , W 2 "2 1 ! (3) ai − xi . = 2

(7.10.11)

7.10 Autoencoders

529

Therefore, the loss function (7.10.9) in the basic autoencoder optimization can be represented as 2 2 m  n  1  (2) 1  (3) (1) JAE (θ ) = aj − wj i xi + ai − xi . 2m 2n j =1

(7.10.12)

i=1

Usually, the backpropagation algorithm is an effective method to train the parameters of an autoencoder with n input units and m hidden units. By the backpropagation algorithm, the partial derivative of the loss function JAE (θ ) with (2) (2) respect to weights wij and bi are given using the chain rule twice, respectively: ∂JAE (2)

∂wij

∂JAE (2)

∂bi

=

∂JAE

=

∂JAE

(3)

∂ai

(3)

∂ai

·

·

∂ai(3) (3)

∂neti

∂ai(3) (3)

∂neti

·

·

∂net(3) i (2)

,

(7.10.13)

,

(7.10.14)

∂wij

∂net(3) i (2)

∂bi

where ⎛

(3)

∂neti

(2)

∂wij

=

(3)

∂neti

(2)

∂bi

(3)

∂ai

(3)

∂neti

=





(2)

∂wij ∂

(2)

⎛ ⎝

∂bi

m 

⎞ (2) (2) wij aj + bi(2) ⎠ = aj(2)

j =1 m 

(7.10.15)



(2) (2) wij aj

+ bi(2) ⎠

= 1,

(7.10.16)

j =1

        (3)  (3) (3) (3) (3) = f neti 1 − f neti = ai 1 − ai , = f neti 

(7.10.17) ∂JAE ∂ai(3)

=



∂ ∂ai(3)

1 (3) a − xi 2 i

2

(3)

= ai

− xi .

(7.10.18)

   (3) (3)  (3)  = ai − xi ai 1 − ai ,

(7.10.19)

Hence, we have (3)

(3) σi (2)

bi

=

∂JAE ∂ai (3)

∂neti

(2)

+ σi ,

∂ai

= bi

(3)

(3)

(2) (2) = wij + aj(2) σi(3) , wij

for i = 1, . . . , n; j = 1, . . . , m.

(7.10.20) (7.10.21)

530

7 Neural Networks (1)

Similarly, the partial derivative of the loss JAE (θ ) with respect to weights wj i (1)

and bj are given using the chain rule twice, respectively: ∂JAE ∂wj(1) i ∂JAE (1)

∂bj

=

∂JAE

=

∂JAE

∂aj(2)

(2)

∂aj

(2)

·

∂aj

∂net(2) j

(2)

·

(2)

·

∂aj

(2)

∂netj

∂netj

∂wj(1) i

,

(7.10.22)

,

(7.10.23)

(2)

·

∂netj

(1)

∂bj

where (2)

∂netj

(1)

∂wj i

=



 n 

(1)

∂wj i

 (1) wj i xi

(1) + bj

= xi ,

i=1

and (2)

∂netj

(1)

∂bj

∂aj(2) ∂net(2) j ∂JAE (2) ∂aj

=

 n 

∂ (1)

∂bj

 (1) wj i xi

(1) + bj

= 1,

i=1

        (2)  (2) (2) (2) (2) = f netj 1 − f netj = aj 1 − aj , = f netj 

=

n (3) (3)  ∂JAE ∂ai ∂neti (3) i=1 ∂ai

(3) ∂neti

(2) ∂aj

=

n (3)  ∂JAE ∂ai (3) i=1 ∂ai

(2)

w = (3) ij

∂neti

n 

(3)

(2)

σi wij .

i=1

Thus we have [174] (2) σj (2)

=

(2) ∂JAE ∂aj

∂aj(2) ∂net(2) j (2)

=

(2)

bj = bj + σj , (1) (2) wj(1) i = wj i + xi σj ,

 n 

 (3) (2) σi wij

(2) 

aj



(2) 

1 − aj

,

(7.10.24)

i=1

(7.10.25) (7.10.26)

for i = 1, . . . , n; j = 1, . . . , m. The Backpropagation algorithm for the basic autoencoder is summarized in Algorithm 7.4.

7.10 Autoencoders

531

Algorithm 7.4 Backpropagation algorithm for the basic autoencoder [174] (j )

(j )

(j ) = [x , . . . , x ]T , η and threshold. 1. input: {(x(j ) , y(j ) )}m n 1 j =1 with x 2. initialization: 3. for iteration = 1, 2, . . . , iteratermax do 4. for example = 1, 2, . . . , N do 5. for j = 1, 2, . . . , m do  (1) 6. netj(2) = ni=1 wj(1) i xi + bj ;

7. 8. 9. 10.

aj(2) = f (netj(2) ); end for for i = 1, 2,. . . , n do (2) (2) (2) neti(3) = m j =1 wij aj + bi ;

11. 12. 13. 14.

ai(3) = f (neti(3) ); end for if JT AE (θ) > threshold then compute   (3) (3) (3)   (3) σi = ai · (1 − ai ) · ai − xi , i = 1, . . . , n;  n (2) (2) (3)  (2) (2)  σj = aj (1 − aj ) , j = 1, . . . , m; i=1 wij σi

15.

(2)

(2)

= bi

(3)

+ σi , i = 1, . . . , n;

16.

bi

17.

wij = wij + aj · σi , i = 1, . . . , n; j = 1, . . . , m;

18.

bj = bj + σj , j = 1, . . . , m;

19.

wj i = wj i + xi · σj , i = 1, . . . , n; j = 1, . . . , m;

20.

W(1) = W(1) − η( N1 W (1) );

21.

W(2) = W(2) − η( N1 W (2) );

22.

b(1) = b(1) − η( N1 b(1) );

(2)

(1)

(1)

(2)

(1)

(1)

(2)

(3)

(2)

(2)

23. b(2) = b(2) − η( N1 b(2) ); 24. end if 25. end for 26. end for 27. output: θ = {W(1) , b(1) ; W(2) , b(2) }.

For regularized autoencoders, the simplest form of regularization is weight decay (wd) which favors small weights by optimizing instead the following regularized objective: JAE+wd (θ ) =



L(x, g(f(x))) + λ

x∈Dn



Wij2 ,

(7.10.27)

i,j

where λ is the hyperparameter for controlling the strength of the regularization. If taking the penalization term Jf (x)2F =

  ∂hj (x) 2 i,j

∂xi

(7.10.28)

532

7 Neural Networks

as the regularization term, then the obtained regularized autoencoder is called the contractive autoencoder (CAE) that was proposed by Rifai et al. [119]. Hence, the objective function of the contractive autoencoder is given by JCAE (θ ) =



L(x, g(f(x))) + λJf (x)2F ,

(7.10.29)

x∈Dn

where λ is a hyperparameter controlling the strength of the regularization. Penalizing Jf 2F encourages the mapping to the feature space to be contractive in the neighborhood of the training data, which gives the name “contractive autoencoder.”

7.10.2 Stacked Sparse Autoencoder Autoencoders have various major extensions: the stacked autoencoder, the sparse autoencoder, the stacked sparse autoencoder, the stacked denoising autoencoder, the stacked convolutional autoencoder, and the stacked denoising convolutional autoencoder. The stacked autoencoder of Bengio et al. [9] is a multilayer neural network in which each layer is the hidden layer of the former autoencoder and the input layer of the next autoencoder. Hence, a stacked autoencoder can be viewed as a neural network composed of multiple autoencoders connected successively in a greedy layer-wise fashion. Figure 7.15 gives an illustration of a stacked autoencoder with 2 hidden layers. (1) Letting the input in the dth stacked autoencoder be x(d) t , and x t = x t , then the equations of the dth autoencoder are given by hidden layer as   (d) (d) (d) , = f W x + b h(d) t t 1 1

(7.10.30)

  (d) (d) (d) (d) y t = f W2 h t + b2 ,

(7.10.31)

(d) where h(d) t and y t denote, respectively, the feature representation at hidden layer (d) and the output at the output layer associated with the dth autoencoder, and W1 and (d) W2 stand, respectively, for weight matrices in the encoding step and the decoding (d) (d) step, and b1 and b2 are the biases of the hidden layer and the output layer. The activation function f (·) usually uses the sigmoid function f (x) = 1/(1+exp(−x)).

7.10 Autoencoders

I1

533

x1 (1)

h1 I2

(1)

I3

x3

.. .

.. .

In

xn

Iˆ2

(2)

h2

.. .

.. .

(1)

xˆ2

(2)

h2

hp

Iˆ1

h1

x2

x

xˆ1

xˆ xˆ3

Iˆ3

.. .

(2)

hq

xˆn

Iˆn

+1 +1 input layer

hidden 1

hidden 2

output layer (d)

(d)

Fig. 7.15 Stacked autoencoder scheme with 2 hidden layers, where h(d) = f (W1 x(d) + b1 ) (d) (d) and y(d) = f (W2 h(d) + b2 ), d = 1, 2 with x(1) = x, x(2) = y(1) and xˆ = y(2)

By taking the feature representation y(d−1) of the (d − 1)th layer as the input x(d) of the dth layer, i.e., x(d) = y(d−1) , then the objective function is given by   (d) (d) (d) (d) J W1 , b1 ; W2 , b2 =

T " 1 β ! (d) 2 - (d) (d) -2 (d) W1 2 + W2 22 . -y t − x t - + 2 2T 2

(7.10.32)

t=1

Sparse autoencoder [116, 117] is an autoencoder that introduces a sparse penalty term in its hidden layer, so that the autoencoder can obtain more concise and efficient low-dimensional data features under sparse constraints to better express the input data. Suppose that the average activation of the neurons in the hidden layer is ρˆj =

T 1  xt,j , T

j = 1, . . . , n,

(7.10.33)

t=1

where xt,j is the j th element of training data x t ∈ Rn . To enforce sparsity, it is hoped that the average activation ρˆj approaches a constant ρ which is the sparsity parameter chosen to be a small positive number

534

7 Neural Networks

near 0. To this end, the Kullback–Leibler (KL) divergence between ρˆj and ρ, i.e., ˆ = KL(ρρ)

n  j =1

ρ log

ρ 1−ρ + (1 − ρ) log , ρˆj 1 − ρˆj

j = 1, . . . , n,

(7.10.34)

as a regularization term, is added to the error function JAE (θ ) of the basic autoencoder. Here ρˆ = [ρˆ1 , . . . , ρˆn ]T is the vector of average hidden activities. Denote W = (W(1) , W(2) ) and b = (b(1) , b(2) ). To prevent overfitting, a weight decay term is also added to the cost function of JAE (θ ) = JAE (W, b), giving the final cost function of the sparse autoencoder, denoted by JSAE (W, b), as follows [67]: ˆ + JSAE (W, b) = JAE (W, b) + βJKL (ρρ)

2 sl+1  2 sl  λ  (l) wij , 2

(7.10.35)

l=1 i=1 j =1

ˆ = KL(ρρ), ˆ λ controls the where β controls the sparsity penalty term, JKL (ρρ) penalty term facilitating weight decay, and sl and sl+1 are the sizes of adjacent layers. Stacked sparse autoencoder (SSAE) is a neural network composed of multiple sparse autoencoders connected end to end. The output of the previous layer of each sparse autoencoder is used as the input of the next layer of the autoencoder, so that higher-level feature representations of the input data can be obtained. Hence, a stacked sparse autodecoder is a combination of stacked and sparse autoencoders. SSAE has a structure similar to stacked autoencoders by adding a sparse constraint. The objective function of the dth layer of a stacked sparse autoencoder is described by [68]   Kd T  1 - (d) (d) (d) (d) -2 (d) (d) JSSAE W , b1 , W2 , b2 = KL(ρρˆj ) -y t − x t - + β 2 2T t=1

j =1

- " α !-W(d) -2 + -W(d) -2 , + 1 2 2 2 2

(7.10.36)

where • Kd denotes the number of hidden units in the dth layer with d = 2, . . . , D, and D is the number of layers, • T is the number of inputs, • α means the weight decay parameter, • β denotes the weight of the sparsity penalty, • ρ is the sparsity parameter, usually set to a small value, • ρˆj is the average activation of the j th hidden unit in the dth layer over the training set, • KL(ρ, ρˆj ) denotes the Kullback–Leibler divergence between ρ and ρˆj .

7.10 Autoencoders

535 (d)

(d)

(d)

(d)

The parameters W1 , W2 , b1 , b2 gradient descent algorithm [68]:

are updated by the following stochastic



(d) W1



(d) W1

 T 1  (d) 1 (d) −η hi δi + αW1 , T 

(d) W2



(d) W2

(7.10.37)

t=1

 T 1  (d) 2 (d) −η hi δi + αW2 , T

(7.10.38)

t=1

(d)

(d)

T η  1 δi , T

(7.10.39)

T η  2 − δi , T

(7.10.40)

b1 ← b1 −

i=1

(d) b2



(d) b2

i=1

where η is a learning rate, and  ! "  (d) (d) (d) (d) δi1 = W2 δi2 + β(−(ρ/ρˆ t ) + (1 − ρ)/(1 − ρˆ t )) f  W1 xi + b1 , !

(d)

δi2 = yi

(d)

− xi

"

  (d) (d) (d) f  W2 hi + b2 .

(7.10.41) (7.10.42)

7.10.3 Stacked Denoising Autoencoders Without any additional constraints, conventional autoencoders learn only the identity mapping. This problem can be circumvented by using denoising autoencoder (DAE) [149] which tries to train an autoencoder which could automatically denoise the input data from a corrupted version that has been manually added with random noise, and thus generates better feature representations for the subsequent classification tasks. Let the original input data be x ∈ Rd , where d denotes the dimension of data. DAE firstly produces a vector x˜ by adding the Gaussian noise to x or a variable amount v of noise distributed according to the characteristics of the input image, where the parameter v represents the percentage of permissible corruption. DAE reconstructs the denoising input x from the corrupted input x˜ . The number of units in the input layer is d, which is equal to the dimension of the input data x˜ . The DAE is trained to denoise the inputs by first finding the latent representation h = fe (W˜x + b),

(7.10.43)

536

7 Neural Networks

where h ∈ Rh is the output of the hidden layer and can also be called the feature representation or code, h is the number of units in the hidden layer, W ∈ Rh×d is the input-to-hidden weights, b denotes the bias, W˜x + b stands for the input of the hidden layer, and fe (·) is called activation function of the hidden layer. When choosing the ReLU function as the activation function, then h = fe (W˜x + b) = max(0, W˜x + b).

(7.10.44)

Clearly, if the value of W˜x+b is smaller than zero, the output of the hidden layer will be zero, and hence the ReLU activation function is able to produce a sparse feature representation, which may have better separation capability. Moreover, ReLU can train the neural network for large-scale data faster and more effectively than the other activation functions. The decoding or reconstruction of the original input is obtained by using a mapping function fd (·): z = fd (W y + b ),

(7.10.45)

where z ∈ Rd is the output of DAE, which is also the reconstruction of original data x. The output layer has the same number of nodes as the input layer. W = WT is referred to as the tied weight matrix. If x ranges from 0 to 1, the softplus function is selected as the decoding function fd (zi ) = log(1+ezi ); otherwise, we preprocess x by zero-phase component analysis (ZCA) whitening and use a linear function as the decoding function:  ⎧   ⎨ log 1 + eW y+b , x ∈ [0, 1]d ;    fd W y + b = (7.10.46) ⎩   Wy+b, otherwise. DAE aims to train the network by requiring the output data z to reconstruct the input data x, which is also called reconstruction-oriented training. Therefore, the reconstruction error should be used as the objective function or cost function, which is defined as follows [163]: J (W) ⎧ " ! "& ! " ! d % m   ⎪ (i) (i) (i) (i) 1 ⎪ ⎪ x + 1 − x log 1 − z + λ2 W22 , if x ∈ [0, 1]d ; log z − ⎨ m j j j j i=1 j =1 = m  ⎪ ⎪ 1 -x(i) − z(i) -2 + λ W2 , ⎪m otherwise. ⎩ 2 2 F i=1

(7.10.47) (i)

Here xj is the j th element of the ith sample, W22 is 2 -regularization term, also called the weight decay term, and parameter λ controls the importance of the regularization term.

7.10 Autoencoders

537

DAE can be stacked to build a deep network which has more than one hidden layer. This type of stacked DAEs is simply called the SDAE. Typical SDAE structure includes two encoding layers and two decoding layers. In the encoding part, the output of the first encoding layer acts as the input data of the second encoding layer. Supposing there are L hidden layers in the encoding part, we have the activation function of the kth encoding layer:   y(k+1) = fe W(k) y(k) + b(k) ,

k = 0, . . . , L − 1,

(7.10.48)

where the input y(0) is the original data x. The output y(L) of the last encoding layer are the high-level features extracted by the SDAE network. In the decoding part, the output of the first decoding layer is regarded as the input of the second decoding layer. The decoding function of the kth decode layer is given by (k+1)

z

   (k+1) (L−k)T (k) , = fe W a +b

k = 0, . . . , L − 1,

(7.10.49)

where the input z(0) of the first decoding layer is the output y(L) of the last encoding layer. The output z(L) of the last decoding layer is the reconstruction of the original data x. The training process of SDAE is stated as follows [163]: 1. Choose input data, which can be randomly selected from the hyperspectral images. 2. Train the first DAE, which includes the first encoding layer and the last decoding layer. Obtain the network weights W(1) , b(1) and the features y(1) which are the output of the first encoding layer. 3. Use y(k) as the input data of the (k + 1)th encoding layer. Train the (k + 1)th DAE and obtain W(k+1) , b(k+1) and the features y(k+1) , where k = 1, . . . , L − 1 and L is the number of hidden layers in the network. This training of SDAE is called layer-wise training, since each DAE is trained independently. Unlike the general SDAE network stacked by two DAE structures, the SDAE of Xing et al. [163] retains the encoding part for feature extraction to produce the initial features and adds a logistic regression (LR) part instead of the decoding part for finetuning and classification. The output layer of the whole network is also called the LR layer which uses the following sigmoid function as the activation function: h(x) =

1 , 1 + exp(−Wx − b)

where x is the output y(L) of the last encoding layer.

(7.10.50)

538

7 Neural Networks

The steps of SDAE-LR network training are as follows [163]: 1. SDAE is utilized to train the initial network weights. 2. Initial weights of the LR layer are randomly set. 3. Training data are used as input data, and their predicted classification results are produced with the initial weights of the whole network. 4. Network weights are iteratively tuned by minimizing the cross-entropy function I m J ! !  "    (i) " 1  (i) (i) (i) Cost = − + 1−l log 1 − h x l log h x m i=1

(7.10.51) using mini-batch stochastic gradient descent (MSGD) algorithm. Here l (i) denotes the label of the sample x(i) .

7.10.4 Convolutional Autoencoders (CAE) Fully connected AEs and DAEs both ignore the 2D image structure, which introduces redundancy in the parameters, forcing each feature to be global. To discover localized features, convolutional autoencoder (CAE) is an alternative. CAEs differ from conventional AEs as their weights are shared among all locations in the input, preserving spatial locality [103]. The CAE architecture is intuitively similar to the DAE except that the weights are shared. For a mono-channel input x, the latent representation of the kth feature map is given by [103] (k)

h

  (k) (k) , =σ x∗W +b

(7.10.52)

where the bias is broadcasted to the whole map, σ (·) is an activation function, and ∗ denotes the 2D convolution. A single bias per latent map is used so that each filter specializes on features of the whole input (one bias per pixel would introduce too many degrees of freedom). The reconstruction of the input x is given by  (k)

y





 (k)

h

¯ ∗W

(k)

+c ,

(7.10.53)

k∈H

where c is one bias per input channel; H identifies the group of latent feature maps; ¯ identifies the flip operation over both dimensions of the weights. W

7.10 Autoencoders

539

The cost function of CAE is defined as the mean squared error (MSE): 1  (xi − yi )2 . 2n n

E(θ ) =

(7.10.54)

i=1

It is easy to know by using a convolution operation that ∂E(θ ) = x ∗ δh(k) + h¯ (k) δy, ∂W(k)

(7.10.55)

where δh and δy are the deltas of the hidden states and the reconstruction, respectively; and h¯ is the flip operation of h. Hence, the weight matrix W can be updated as ∂E(θ) ∂W(k) ! " = W(k) − η x ∗ δh(k) + h¯ (k) δy .

W(k+1) = W(k) − η

(7.10.56)

The network architecture of CAE consists of three basic building blocks: the convolutional layer, the max-pooling layer, and the classification layer.

7.10.5 Stacked Convolutional Denoising Autoencoder The stacked convolutional denoising autoencoder (SCDAE), proposed by Du et al. [25], is an unsupervised deep network that stacks well-designed DAEs in a convolutional way to generate high-level feature representations. The overall architecture is optimized by layer-wise training. The structure of the SCDAE with dropout is as follows [25]: 1. Latent vector mapping: A DAE corrupts the input vector x into vector x˜ with a certain probability λ by means of a stochastic mapping x˜ ∼ D(˜x|x, λ),

(7.10.57)

where D is a type of distribution determined by the original distribution of x and the type of random noise added to x. Then, x˜ is mapped to a latent vector representation h using a deterministic function f : h = fe (W˜x + b).

(7.10.58)

2. Dropout: By using the dropout technique, the network optimizes the training process; neural units in the hidden layers are randomly omitted with a probability

540

7 Neural Networks

q. The representation h is then transformed into a dropped representation h˜ by a scalar product with a masking vector m ∼ Bernoulli(1 − q): h˜ = m

h,

(7.10.59)

where (m h)i = mi · hi . Since a large neural network is updated iteratively, a unique network is trained in each iteration as a result of randomly dropping the neurons in the hidden layer. When the training process converges, the network gets an average representation of 2|m| networks. 3. Decoding: The dropped hidden feature vector h˜ is then reversely mapped to a final feature z to reconstruct the original input x by another mapping function: z = fd (W h˜ + b ).

(7.10.60)

4. Optimization: The optimization function is given by ˜ (W, W , b, b ) = arg min z − x22 + sparse(h), W,W ,b,b

(7.10.61)

˜ denotes a type of sparse constraint, such as the KL distance. where sparse(h) 5. Convolutional layer: The input of each convolutional layer is a 3D feature map, a mid-level representation of the input image. In each convolutional layer, the convolutional DAE transforms the input features into more robust and abstract feature maps through the learned denoising filters.

7.10.6 Nonnegative Sparse Autoencoder Nonnegative sparse autoencoders are inspired by the idea of nonnegative matrix factorization (NMF) [91] and sparse coding. 1. Nonnegative constraints: The factorization X = AS of a nonnegative matrix X ∈ RM×N is called the nonnegative matrix factorization, if both the M × p +,0 factor matrix A and the p × N factor matrix S are nonnegative. Inspired by NMF, nonnegative constraints are added to weights of a neural network. There are the following key features of NMF [172]: • Distributed nonnegative encoding: NMF does not allow negative elements in the factor matrices A and S. Unlike the single constraint of the vector quantization (VQ), the nonnegative constraint allows the use of a combination of basis parts to represent a whole signal. Unlike the principal component analysis (PCA), the NMF allows only additive combinations; because the nonzero elements of A and S are all positive, the occurrence of any subtraction between basis images or signals can be avoided. In terms of optimization criteria, the VQ adopts the “winner-take-all” constraint and the PCA is based

7.10 Autoencoders

541

on the “all-sharing” constraint, but the NMF subjects to a “group-sharing” constraint together with a nonnegative constraint. From the encoding point of view, the NMF is a distributed nonnegative encoding, which often leads to sparse encoding. • Parts combination: The NMF gives the intuitive impression that it is not a combination of all features; rather, just some of the features, (simply called the parts), are combined into a (target) whole. From the perspective of machine learning, the NMF is a kind of machine learning method based on the combination of parts which has the ability to extract the main features. • Multilinear data analysis capability: The PCA uses a linear combination of all the basis vectors to represent the data and can extract only the linear structure of the data. In contrast, the NMF uses combinations of different numbers of and different labels of the basis vectors (parts) to represent the data and can extract its multilinear structure; thereby it has certain nonlinear data analysis capabilities. 2. Sparse constraints: Inspired by sparse coding, the sparse constraints imposed on the connecting weights can play the “dropout” role and optimize the whole autoencoder. Dropout is a technique applied to the fully connected layers to prevent overfitting. Many signals in practice have the nonnegative values. The following are four important actual examples of nonnegative matrices [87]: • Text documents are stored as vectors. Each element of a document vector is a count (possibly weighted) of the number of times a corresponding term appears in that document. Stacking document vectors one after the other creates a nonnegative term-by-document matrix that represents the entire document collection numerically. • In image collections, each image is represented by a vector, and each element of the vector corresponds to a pixel. The intensity and color of a pixel is given by a nonnegative value, thereby creating a nonnegative pixel-by-image matrix. • For itemsets or recommendation systems, the information for a purchase history of customers or ratings on a subset of items is stored in a nonnegative sparse matrix. • In gene expression, gene-by-experiment matrices are formed from observations of the gene sequences produced under various experimental conditions. To encourage nonnegativity in the connecting weights W, the weight decay term (l) 2 (wij ) in the sparse autoencoder (7.10.35) is replaced by a quadratic function (l)

f (wij ), resulting in the cost function for nonnegativity constrained autoencoder (NCAE):  sl+1  2 sl  α  (l) ˆ + JNCAE (W, b) = JAE (W, b) + βJKL (ρρ) f wij , 2 l=1 i=1 j =1

(7.10.62)

542

7 Neural Networks

where α ≥ 0 and f (wij ) =

⎧ 2 ⎪ ⎨ wij , wij < 0; ⎪ ⎩ 0,

(7.10.63) wij ≥ 0.

Then, use the backpropagation algorithm to compute the gradients: (l)

(l)

wij = wij − η bi(l) = bi(l) − η

∂JNCAE (W, b) (l)

,

(7.10.64)

,

(7.10.65)

∂wij

∂JNCAE (W, b) ∂bi(l)

where η > 0 is the learning rate, and ∂JNCAE (W, b) (l)

∂wij

∂JNCAE (W, b) (l) ∂bi

=

∂JAE (W, b)

=

∂JAE (W, b)

(l)

∂wij

(l) ∂bi

+β +β

ˆ ∂JKL (ρρ) (l)

∂wij

ˆ ∂JKL (ρρ) (l)

  (l) + αg wij ,

(7.10.66)

,

(7.10.67)

∂bi

where g(wij ) =

⎧ ⎪ ⎨ wij , wij < 0; ⎪ ⎩ 0,

(7.10.68) wij ≥ 0.

Autoencoders with regularization constraints are known as regularized autoencoders [10]. The stacked sparse autoencoder and the nonnegative sparse autoencoder are two typical regularized autoencoders due to their sparsity regularization [115]. Importantly, regularized autoencoders can capture local structure of the density of signals [10].

7.11 Extreme Learning Machine Extreme learning machine (ELM) is a machine learning algorithm for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance while retaining an extremely fast learning speed [70].

7.11 Extreme Learning Machine

543

For function approximation in a finite training set, SLFNs with at most N hidden nodes and almost any nonlinear activation function can exactly learn N distinct observations [69]. ELM is an effective solution for SLFNs. Because ELM has a fast speed for classification, it is widely applied in data stream classification tasks.

7.11.1 Single-Hidden Layer Feedforward Networks with Random Hidden Nodes First, we introduce the concept of random hidden nodes. Definition 7.7 (Piecewise Continuous [150, p.334]) A function is said to be piecewise continuous if it has only a finite number of discontinuities in any interval and its left and right limits are defined (not necessarily equal) at each discontinuity. Definition 7.8 (Randomly Generated) The function sequence {gn = g(wn , x + bn )} or {gn = g(wn −x/bn )} is said to be randomly generated if the corresponding parameters are randomly generated from or based on a continuous sampling distribution probability. Definition 7.9 (Random Node) A node is called a random node if its parameters (w, b) are randomly generated based on a continuous sampling distribution probability. SLFNs have two main network architectures: • the SLFNs with additive hidden nodes and • the SLFNs with radial basis function (RBF) networks which apply RBF nodes in the hidden layer. The network function of a SLFN with d hidden nodes can be represented by fd (x) =

d 

βi gi (x),

x ∈ Rn , βi ∈ R,

(7.11.1)

i=1

where gi denotes the ith hidden node output function. Two commonly used ith hidden node output functions gi are defined as 

 gi (x) = g wi , x + bi

= g(wTi x + bi ),

wi ∈ Rn , bi ∈ R

(7.11.2)

for additive nodes, and  gi (x) = g

 x − ai  , bi

ai ∈ Rn , bi ∈ R+

(7.11.3)

544

7 Neural Networks

for RBF nodes, where ai and bi are the center and impact factor of the ith RBF node and are the weights connecting the ith RBF hidden node to the output node, while R+ indicates the set of all positive real value. In other words, the output function of an SLFN with d additive nodes and d RBF nodes can be, respectively, given by fd (x) =

d 

" ! βi g wTi x + bi ∈ R

(7.11.4)

i=1

and fd (x) =

d 

 βi g

i=1

 x − ai  , bi

ai ∈ Rn ∈ R.

(7.11.5)

It was proved by Hornik [66] that if the activation function is continuous, bounded, and nonconstant, then continuous mappings can be approximated by SLFNs with additive hidden nodes over compact input sets. It is the more important [93] that SLFNs with additive hidden nodes and with a nonpolynomial activation function can approximate any continuous target functions. Consider the SLFN w with the weights β 1 , . . . , β d . If the SLFN produces N distinct samples (xi , yi ), i = 1, . . . , N , where xi = [xi1 , . . . , xin ]T ∈ Rn and yi = [yi1 , . . . , yim ]T ∈ Rm , then standard SLFNs with d hidden nodes and activation function g(x) are mathematically modeled as ⎧   d ⎪  ⎪ T ⎪ ⎪ β i g wi x1 + bi = y1 , ⎪ ⎪ ⎪ ⎪ ⎨ i=1 .. . ⎪ ⎪   ⎪ d ⎪  ⎪ ⎪ ⎪ β i g wTi xN + bi = yN , ⎪ ⎩

(7.11.6)

i=1

where wi = [wi1 , . . . , win ]T ∈ Rn is the weight vector connecting the ith hidden node and the input nodes, β i = [βi1 , . . . , βiN ]T ∈ RN is the weight vector connecting the ith hidden node and the output nodes, and bi is the threshold of the ith hidden node. The N equations in (7.11.6) can be expressed in the following matrix form [70]: HB = Y,

(7.11.7)

7.11 Extreme Learning Machine

545

where  ⎡  T ⎢g w1 x1 + b1 ⎢ ⎢ .. H=⎢ . ⎢ ⎣ T g(w1 xN + b1 )

··· ..

.

···

 ⎤ T g wd x1 + bd ⎥ ⎥ ⎥ .. ⎥ ∈ RN ×d , . ⎥  ⎦ g wTd xN + bd

⎤ ⎡ ⎤ β11 · · · β1m β T1 ⎢ ⎥ ⎢ ⎥ B = ⎣ ... ⎦ = ⎣ ... . . . ... ⎦ ∈ Rd×m , β Td βd1 · · · βdm ⎡ T⎤ ⎡ ⎤ y1 y11 · · · y1m ⎢ ⎥ ⎢ ⎥ Y = ⎣ ... ⎦ = ⎣ ... . . . ... ⎦ ∈ RN ×m .

(7.11.8)



yTN

(7.11.9)

(7.11.10)

yN 1 · · · yN m

H is known as the hidden layer output matrix of the neural network [69], and the ith column of H is the ith hidden node output when inputting x1 , . . . , xN . Theorem 7.2 ([70]) Suppose we are given a standard SLFN with N hidden nodes and activation function g : R → R which is infinitely differentiable in any interval, for N arbitrary distinct samples (xi , yi ), i = 1, . . . , N , where xi ∈ Rn and yi ∈ Rm . Then, for any wi and bi randomly chosen, respectively, from any intervals of Rn and R according to any continuous probability distribution, the hidden layer output matrix H of the SLFN, with probability one, is invertible and HB − Y = 0. Theorem 7.3 ([70]) If we are given any small positive value  > 0 and activation function g : R → R which is infinitely differentiable in any interval, and there exists d ≤ N such that for N arbitrary distinct samples x1 , . . . , xN with xi ∈ Rn and yi ∈ Rm , for any wi and bi randomly chosen, respectively, from any intervals of Rn and R according to any continuous probability distribution, then with probability one HN ×d Bd×m − YN ×m  < . Infinitely differentiable activation functions include the sigmoidal functions as well as the radial basis, sine, cosine, exponential, and many other nonregular functions, as shown by Huang and Babri [69].

7.11.2 Extreme Learning Machine Algorithm for Regression and Binary Classification The offline solution of (7.11.7) is given by B = H† Y,

(7.11.11)

546

7 Neural Networks

where H† is the Moore–Penrose generalized inverse of the hidden layer output matrix H. Based on the minimum norm LS solution B = H† Y, Huang et. al. [70] proposed an extreme learning machine (ELM) as a simple learning algorithm for SLFNs, as shown in Algorithm 7.5. Algorithm 7.5 Extreme learning machine (ELM) algorithm 1. input: A training set {(xi , yi )|xi ∈ Rn , yi ∈ Rm , i = 1, . . . , N }, activation function g(x) and hidden node number d. 2. initialization: Randomly assign input weight wi and bias bi , i = 1, . . . , d. 3. learning step: 3.1. Use (7.11.8) to calculate the hidden layer output matrix H. 3.2. Use (7.11.10) to construct the training output matrix Y. 3.3. Calculate the minimum norm least squares weight matrix B = H† Y, and get β i from BT = [β 1 , . . . , β d ]. 4. testing step: for given sample x ∈ Rn , the output of SLFNs is given by  testing  d T y = i=1 β i g wi x + bi .

It is shown [71] that the ELM can use a wide type of feature mappings (hidden layer output functions), including random hidden nodes and kernels. With this extension, the unified ELM solution can be obtained for feedforward neural networks, RBF network, LS-SVM, and PSVM. In ELM, the hidden layer need not be tuned. For one output node case, the output function of ELM for generalized SLFNs is given by fd (x) =

d 

βj hj (x) = hT (x)β,

(7.11.12)

j =1

where h(x) = [h1 (x), . . . , hd (x)]T is the output vector of the hidden layer with respect to the input x, and β = [β1 , . . . , βd ]T is the vector of the output weights between the hidden layer of d nodes. The output vector h(x) is a feature mapping: it actually maps the data from the n-dimensional input space to the d-dimensional hidden layer feature space (ELM feature space) H. For the binary classification problem, the decision function of ELM is ! " fd (x) = sign hT (x)β . (7.11.13) An ELM with a single-output node is used to generate N output samples: d  j =1

βj hj (xi ) = yi ,

i = 1, . . . , N

(7.11.14)

7.11 Extreme Learning Machine

547

which can be rewritten as the form hT (xi )β = yi , i = 1, . . . , N

or

Hβ = y,

(7.11.15)

where ⎤ h1 (x1 ) · · · hd (x1 ) ⎢ .. ⎥ , .. H = ⎣ ... . . ⎦ h1 (xN ) · · · hd (xN ) ⎡ ⎤ β1 ⎢ .. ⎥ β = ⎣ . ⎦, ⎡



(7.11.16)

(7.11.17)

βd

⎤ ⎤ ⎡ T y1 h (x1 ) ⎢ ⎥ ⎢ ⎥ y = ⎣ ... ⎦ = ⎣ ... ⎦ .

(7.11.18)

hT (xN )

yN

The constrained optimization problem for ELM regression and binary classification with a single-output node can be formulated as [71]: min β,ξi

 , N 1 C 2 LPELM = β22 + ξi , 2 2

(7.11.19)

i=1

subject to hT (xi )β = yi − ξi ,

i = 1, . . . , N.

(7.11.20)

The dual unconstrained optimization problem is min

β,ξi ,αi

 N 1 C 2 LDELM (β, ξi , αi ) = β22 + ξi 2 2 i=1



N 

 , αi hT (xi )β − yi + ξi

(7.11.21)

i=1

with the Lagrange multipliers αi ≥ 0, i = 1, . . . , N. From the optimality conditions one has ∂LDELM =0 ∂β



where α = [α1 , . . . , αN ]T .

β=

N  i=1

αi h(xi ) = HT α = HT α,

(7.11.22)

548

7 Neural Networks

Substituting (7.11.22) into (7.11.15) and using (7.11.18), we get the equation HT Hα = y,

(7.11.23)

where ⎤ hT (x1 ) ⎥ ⎢ HT H = ⎣ ... ⎦ [h(x1 ), . . . , h(xN )] ⎡

hT (xN ) ⎡ T ⎤ h (x1 )h(x1 ) · · · hT (x1 )h(xN ) ⎢ ⎥ .. .. .. =⎣ ⎦. . . . T T h (xN )h(x1 ) · · · h (xN )h(xN )

(7.11.24)

The LS solution of Eq. (7.11.23) is given by α = (HT H)† y.

(7.11.25)

If a feature mapping h(x) is unknown to users, one can apply Mercer’s conditions on ELM, and hence a kernel matrix for ELM can be defined as follows [71]: K(x, xi ) = hT (x)h(xi ).

(7.11.26)

If using (7.11.26), then (7.11.24) can be rewritten as in the following kernel form: ⎤ K(x1 , x1 ) · · · K(x1 , xN ) ⎥ ⎢ .. .. .. HT H = ⎣ ⎦ . . . K(xN , x1 ) · · · K(xN , xN ) ⎡

(7.11.27)

2 3 whose (i, j )th entry is HT H ij = K(xi , xj ) for i = 1, . . . , N ; j = 1, . . . , N . This shows that the feature mapping h(x) need not be used; instead, it is enough to use the corresponding kernel K(x, xi ), e.g., K(x, xi ) = exp(−γ x − xi ). Finally, from (7.11.22) it follows that the EML regression function is yˆ = hT (x)β = hT (x)

N 

αi h(xi ) =

i=1

N 

αi K(x, xi ),

(7.11.28)

i=1

while the decision function for EML binary classification is class of x = sign

N  i=1

 αi K(x, xi ) .

(7.11.29)

7.11 Extreme Learning Machine

549

Algorithm 7.6 shows the ELM algorithm for regression and binary classification. Algorithm 7.6 ELM algorithm for regression and binary classification [71] input: A training set {(xi , yi )|xi ∈ Rn , yi ∈ R, i = 1, . . . , N }, hidden node number d and the kernel function K(u, v) = exp(−γ u − v). initialization: y = [y1 , . . . , yN ]T . learning step: 2 3 1. Use the kernel function to construct the matrix HT H ij = K(xi , xj ), i, j = 1, . . . , N . 2. Calculate the minimum norm least squares solution α = (HT H)† y.  testing step: for given testing sample x ∈ Rn , the ELM regression is N i=1 αi K(x, xi ), while the ! " N ELM binary classification is class of x = sign i=1 αi K(x, xi ) .

7.11.3 Extreme Learning Machine Algorithm for Multiclass Classification For multiclass applications, let ELM have multi-output nodes instead of a singleoutput node. For a k-class of classifier, we are given N training set {xi , yi |xi ∈ Rn , yi ∈ Rd }, i = 1, . . . , N . If the original class label is p, then the j th entry of expected output vector of the d output nodes, yi = [yi1 , . . . , yid ]T ∈ Rd , is denoted as * + 1, j = p, yij = (7.11.30) − 1, j = p, where p = {1, . . . , k}. That is, only the pth entry of yi is +1, while the rest of the entries are set to −1. The primal classification problem of the mth-class for ELM with multi-output nodes can be formulated as [71]: min

  N C 2 1 2 2 β m 2 + bm + = ξm,i 2 2 i=1 ⎧ ⎪ hT (x )β + bm = 1 − ξm,1 , ⎪ ⎨ m 1 m .. . ⎪ ⎪ ⎩ T hm (xN )β m + bm = 1 − ξm,N .

(m) LPELM

subject to

(7.11.31)

(7.11.32)

550

7 Neural Networks

The corresponding dual unconstrained optimization problem is given by    N C 2 1 (m) 2 + β m 22 + bm min LDELM = ξm,i 2 2 i=1



N 

   , . αm,i yi(m) hTm (xi )β m + bm − 1 + ξm,i

(7.11.33)

i=1

From the optimality conditions one has (m)

∂LDELM ∂β m

= 0 ⇒ βm =

∂bm

(m)

hm (xi ),

(7.11.34)

,

(7.11.35)

αm,i yi

i=1

(m)

∂LDELM

N 

= 0 ⇒ bm =

N 

(m)

αm,i yi

i=1

(m)

∂LDELM ∂ξm,i

= 0 ⇒ ξm,i = C −1 αm,i ,

(m)

∂LDELM ∂αm,i

(m)

= 0 ⇒ yi

(7.11.36)

  hTm (xi )β m + bm − 1 + ξm,i = 0,

(7.11.37)

for i = 1, . . . , N . Eliminating β m and ξm,i in the above equations yields the KKT equation bm =

N 

(m)

αm,i yi

= α Tm ym

(7.11.38)

i=1

and   −1 T T C I + Hm Hm + ym ym α m = 1,

(7.11.39)

where I is an N × N identity matrix, 1 is an N × 1 summing vector with all entries equal to 1, and 2 (m) 3 (m) Hm = y1 hm (x1 ), . . . , yN hm (xN ) , 2 (m) (m) 3T ym = y1 , . . . , yN ,

(7.11.41)

α m = [αm,1 , . . . , αm,N ]T .

(7.11.42)

(7.11.40)

7.12 Graph Embedding

551

It easily follows that the (i, j )th entry of the matrix Hm HTm can be represented as   T Hm Hm = yi yj hTm (xi )hm (xj ) = yi yj Km (xi , xj ),

(7.11.43)

ij

where Km (xi , xj ) = hTm (xi )hm (xj ) is the kernel function for the mth ELM classifier. Algorithm 7.7 shows the ELM multiclass classification algorithm. Algorithm 7.7 ELM multiclass classification algorithm [71] 1. input: A training set {(xi , yi )|xi ∈ Rn , yi ∈ Rd , i = 1, . . . , N }, hidden node number d, the number k of classes and the kernel function Km (u, v) for the mth classifier m = 1, . . . , k, such as Km (u, v) = exp(−γ u − v2 ). % & (m) T 2. initialization: Reconstruct the mth class’s output vector y(m) = y1(m) , . . . , yN with  + 1, y = m; i (m) for m = 1, . . . , k; i = 1, . . . , N . yi = − 1, yi = m; 3. learning step: 4. while m = 1, . . . , k 5. Use the kernel function Km (xi , xj ) and the mth class’s output vector y(m) to construct 2 3 (m) (m) the matrix HTm Hm ij = yi yj Km (xi , xj ), i, j = 1, . . . , N .  † 6. Calculate the minimum norm least square solution α m = HTm Hm + ym yTm + C −1 I 1. 7. Calculate bm = α Tm ym . 8. endwhile 9. testing step: For given testing sample x ∈ Rn , the decision function of the ELM multiclass classifier is given by ! " (m) N class of x = arg max i=1 αm,i yi Km (x, xi ) + bm . m=1,...,k

7.12 Graph Embedding Graphs, such as social graph/diffusion graph in social media networks, word cooccurrence networks, user interest graph in electronic commerce area, knowledge graph, and communication networks etc., exist naturally in a wide diversity of realworld scenarios. Analyzing or learning these graphs yields insight into the structure of society, language, and different patterns of communication. For example, in social network research, classification of users into meaningful social groups based on social graphs can lead to many useful practical applications such as user search, targeted advertising and recommendations.

552

7 Neural Networks

Embedding a graph or an information network into a low-dimensional space is useful in a variety of applications. To conduct the embedding, the graph structures must be preserved. The first intuition is that the local graph structure, i.e., the local pairwise proximity between the vertices, must be preserved. Therefore, the two main tasks of graph embedding are dimension reduction and local graph structure preservation. The problem of dimensionality reduction arises in many fields of information processing: machine learning, data compression, neural computation, pattern recognition, scientific visualization, and so on.

7.12.1 Proximity Measures and Graph Embedding We first introduce the definition of the basic concepts in graph embedding. Suppose we are given a graph G = (V , E), where v ∈ V is a vertex or node and e ∈ E is an edge. G is associated with a node type mapping function fv : V → T v and an edge type mapping function fe : E → T e , where T v and T e denote the set of node types and edge types, respectively. Each node vi ∈ V belongs to one particular type, i.e., fv (vi ) ∈ T v . Similarly, for eij ∈ E, fe (eij ) ∈ T e . Graph learning is closely related to graph proximities and the graph embedding. Graph learning tasks can be broadly abstracted into the following four categories [41]: • Node classification aims at determining the label of nodes (a.k.a. vertices) based on other labeled nodes and the topology of the network. • Link prediction refers to the task of predicting missing links or links that are likely to occur in the future. • Clustering is used to find subsets of similar nodes and group them together. • Visualization helps in providing insights into the structure of the network. The most basic measure for both dimension reduction and structure preservation of a graph is the graph proximity. Proximity measures are usually adopted to quantify the graph property to be preserved in the embedded space. The microscopic structures of a graph can be described by its first-order proximity and second-order proximity. The first-order proximity between the vertices is their local pairwise similarity between only the nodes connected by edges. Definition 7.10 (First-Order Proximity [144]) The first-order proximity is the (1) observed pairwise proximity between two nodes vi and vj , denoted as Sij = sij , where sij is the edge weight between the two nodes. If no edge is observed between (1) nodes i and j , then their first-order proximity Sij = 0. The first-order proximity is the first and foremost measure of similarity between two nodes. The first-order proximity implies that two nodes in real-world networks are always similar if they are linked by an observed edge. For example, if a paper cites another paper, they should contain some common topic or keywords. However,

7.12 Graph Embedding

553

it is not sufficient to only capture the first-order proximity, and it is also necessary to introduce the second-order proximity to capture the global network structure. (1)

(1)

= si = [Si,1 , . . . ,

Definition 7.11 (Second-Order Proximity [144]) Let si (1)

(2)

(2)

(2)

Si,n ]T and si = [Si,1 , . . . , Si,n ]T be the first-order and the second-order proximity vectors between node i and other nodes, respectively. Then the second-order proximity Sij(2) is determined by the similarity of si and sj . If no vertex is linked from/to both i and j , then the second-order proximity between vi and vj is zero, (2)

i.e., Sij = 0. (2)

As a matter of fact, the second-order proximity Sij between a pair of vertices (1)

(i, j ) in a graph is the similarity between vi ’s neighborhood si (1) hood sj .

and vj ’s neighbor-

• The first-order proximity compares the similarity between the nodes i and j . The more similar two nodes are, the larger the first-order proximity value between them. • The second-order proximity compares the similarity between the nodes’ neighborhood structures. The more similar two nodes’ neighborhoods are, the larger the second-order proximity value between them. Similarly, we can define the higher-order proximity Sij(k) (where k ≥ 3) between a pair of vertices (i, j ) in a graph. (k)

Definition 7.12 (k-Order Proximity) Let si

(k)

(k)

= [Si,1 , . . . , Si,n ]T be the kth-order (k)

proximity vector between node i and other nodes. Then the kth-order proximity Sij (k−1)

is determined by the similarity of si

(k−1)

and sj

.

In particular, when k ≥ 3, the kth-order proximity is generally referred to as the (k) higher-order proximity. The matrix S(k) = [Sij ] is known as the k-order proximity matrix. When k ≥ 3, S(k) is called the higher-order proximity matrix. The higherorder proximity matrices are also defined using some other metrics, e.g., Katz Index, Rooted PageRank, Adamic-Adar, etc. that will be discussed in Sect. 7.13.3. By Definitions 7.10, 7.11, and 7.12, the first-order, second-order, and third-order proximity matrices S(k) = [Sij(k) ] ∈ Rn×n (where k = 1, 2, 3) are nonnegative matrices, respectively. If considering the cosine similarity as the k-order proximity, then for nodes vi and vj , we have the following results: (1)

1. The first-order proximity Sij = sij , where sij is the edge weight between the two nodes.

554

7 Neural Networks

2. The second-order proximity Sij(2) =

si , sj  si  · sj  n

= n

(1) (1) l=1 Sil Sj l

 (1) 2 n  (1) 2     l=1 Sil l=1 Sj l n l=1 sil sj l  =  . n n 2 2 |s | |s | il j l l=1 l=1

(7.12.1)

In this way, the second-order proximity is between [0, 1]. 3. The third-order proximity n Sij(3)

=

n l=1

(2) (2) l=1 Sil Sj l

 (2) 2 n S  il

l=1

 (2) 2 . S 

(7.12.2)

jl

Definition 7.13 (Graph Embedding [14, 41]) Given the inputs of a graph G = (V , E), and a predefined dimensionality of the embedding d (d / |V |), the graph embedding is to convert G into a d-dimensional space Rd . In this space, the graph properties (such as the first-order, second-order, and higher-order proximities) are preserved as much as possible. The graph is represented as either a d-dimensional vector (for a whole graph) or a set of d-dimensional vectors with each vector representing the embedding of part of the graph (e.g., node, edge, substructure). Therefore, a graph embedding maps each node of graph G(E, V ) to a lowdimensional feature vector yi and tries to preserve the connection strengths between vertices. For example, a graph  embedding preserving first-order proximity might be obtained by minimizing i,j sij yi − yj 22 . Graph embedding is an important method for learning low-dimensional representations of vertices in networks, aiming to capture and preserve the network structure. Learning network representations faces the following great challenges [154]: 1. High nonlinearity: The underlying structure of a graph or network is highly nonlinear. Therefore, designing a model to capture the highly nonlinear structure is rather difficult. 2. Topology structure-preserving: To support applications in analyzing networks, network embedding is required to preserve the network structure. However, the underlying structure of the network is very complex. The similarity of vertices is dependent on both the local and global network structures. Therefore, how to simultaneously preserve the local and global structure is a tough problem. 3. Sparsity: Many real-world networks are often so sparse that only utilizing the very limited observed links is not enough to reach a satisfactory performance.

7.12 Graph Embedding

555

Definition 7.14 (Local Topology Preserving [101]) Suppose we are given a symmetric undirected graph G with edge weights sij = wij , and a corresponding embedding (y1 , . . . , yn ) for the n nodes of the graph. The embedding is said to be local topology preserving if the following condition holds: if wij ≥ wpq then yi − yj 22 ≤ yp − yq 22 , ∀ i, j, p, q.

(7.12.3)

Roughly speaking, this definition says that if two node pairs (vi , vj ) and (vp , vq ) are associated with connections strengths such that sij ≥ spq , then vi and vj will be mapped to points in the embedding space that will be closer to each other than the mapping of vp and vq . That is, for any pair of nodes (vi , vj ), the more similar they (1) are (the bigger the edge weight wij or the first-order proximity Sij is), the closer they should be embedded together (the smaller yi − yj  should be). To describe topology structure of the graph or network, it is necessary to introduce the metric distance between two nodes. If n points xi ∈ Rd , i = 1, . . . , n have (orthogonal) coordinates (xil , . . . , xid ), then the Euclidean (or Pythagorean) metric distance from xi to xj is given by  d 1/2  dij = (xil − xj l )2 .

(7.12.4)

l=1

Another commonly used metric distance is the Minkowski p-metric distance (or called p -metric distance) given by  d 1/p  p dij = (xil − xj l ) ,

p ≥ 1.

(7.12.5)

l=1

Clearly, the Euclidean metric distance is a special example of the Minkowski pmetric distance when p = 2. For p = 1, the Minkowski metric becomes the socalled city-block distance or Manhattan metric: dij =

d 

|xil − xj l |.

(7.12.6)

l=1

For p = ∞, the Minkowski metric becomes a familiar metric dij = max |xil − xj l |. l

(7.12.7)

It is well known that a local subspace of a nonlinear space can be linear. Similarly, a local structured subspace of a non-Euclidean space can be a locally Euclidean structured space as well. It should be noted that the Euclidean distance is available only for Euclidean structured space or locally Euclidean structured space in a graph

556

7 Neural Networks

or network, but the Minkowski p-metric distance (p = 2) can be used to described the distance between two nodes in non-Euclidean structured spaces. The generic problem of linear dimensionality reduction is: given a set of points x1 , . . . , xn in RD , find an n × d transformation matrix W that maps these n points to a set of points y1 , . . . , yn in Rd (d / D), such that yi = WT xi “represents” xi . The two popular linear dimensionality reduction techniques are the principal component analysis  (PCA) and linear discriminant analysis (LDA). Let Cx = n1 ni=1 xi xTi be the covariance matrix of the n data vectors x1 , . . . , xn and u1 , . . . , up be p eigenvectors associated with principal eigenvalues of C, where p / D. Then, the transformation matrix is given by W = [u1 , . . . , up ] ∈ Rn×p and the low-dimensional vectors yi = WT xi ∈ Rp are the good representation of the high-dimensional data vectors xi . While PCA seeks directions that are efficient for data representation, LDA seeks directions that are efficient for data discrimination. Suppose the data points belong to c classes. To ensure the between-class separability and within-class compactness, the objective function of LDA is given by wopt = arg max w

wT Sb w , wT Sw w

(7.12.8)

where the between-class scatter matrix Sb and the within-class scatter matrix Sw are computed as Sb =

c 

  T ni m(i) − m m(i) − m ,

(7.12.9)

i=1

Sw =

ni ! c  "! "T   (i) (i) (i) , x(i) x − m − m j j i=1

(7.12.10)

j =1

where m is the total sample mean vector, ni is the number of samples in the ith (i) class, m(i) is the mean vector of the ith class, and xj is the j th sample vector in the ith class. The LDA solution w is the generalized eigenvector associated with the maximum generalized eigenvalue of the matrix pencil (Sb , Sw ). The transformation matrix W consists of p principal generalized eigenvectors associated with p principal generalized eigenvalues of (Sb , Sw ). In the following, we focus on four graph embedding techniques: classical multidimensional scaling and three manifold learning techniques: isometric map, locally linear embedding, and Laplacian eigenmap.

7.12 Graph Embedding

557

7.12.2 Multidimensional Scaling We are given n points x1 , . . . , xn with xi ∈ RD . While a configuration is considered as n points in the model space, one may with equal validity consider it as a single point 

 x11 , . . . , x1D , . . . , xn1 , . . . , xnD

(7.12.11)

in the configuration space. Let dij denote the distance between nodes i and j , and sij be a measure of the similarity between them. For example, a measure of the psychological similarity of two stimuli in a paired-associate experiment is given by [133] 4 sij =

pij pj i pii pjj

,

(7.12.12)

where pij is the empirical estimate of the conditional probability of the response assigned to stimulus j when stimulus i is presented. Another example of the similarity data sij is [134] sij =

K 

wk pik pj k ,

(7.12.13)

k=1

where wk are the positive psychological weights of the discrete properties that both stimuli have in common, and pik =

⎧ ⎨ 1, if object i has property k; ⎩

(7.12.14) 0, otherwise.

Define the normalized stress [85] by S∗ =

 (dij − dˆij )2 ,

(7.12.15)

i 0,

i = 1, . . . , N,

(8.3.9)

i=1

where K(xi , xj ) = φ(xi ), φ(xj ) = φ T (xi )φ(xj ). The support vector machine regression aims at designing the following machine learning algorithms: • solve the maximization problem (8.3.8) and (8.3.9) for the Lagrangian multiplies αi , i = 1, . . . , N ;  • update the bias via b ← b − η N i=1 αi yi ; • use (8.3.5) to calculate the support vector regressor w.

8.3.2 -Support Vector Regression The basic idea of -support vector (SV) regression is to find a function f (x) = wT φ(x) + b that has at most  deviation from the actually obtained targets yi for all the training data x1 , . . . , xN , namely |f (xi ) − yi | ≤  for i = 1, . . . , N, and at the same time w is as flat as possible. In other words, we do not count errors as long as they are less than , but will not accept any deviation larger than this [36].

8.3 Support Vector Machine Regression

637

One way to ensure the flatness of w is to minimize its norm w22 = w, w. Hence, the basic form of -support vector regression can be written as a convex optimization problem [36, 43]: 1 w22 , 2 * yk − w, φ(xk ) − b ≤ , subject to . w, φ(xk ) + b − yk ≤ ; min w

(8.3.10) (8.3.11)

where  > 0 is a regression error. In order to avoid a violation of constrained conditions in (8.3.11), one can introduce slackness parameters (ξk , ξk∗ ), for given parameters C > 0 and  > 0. Hence we have the standard form of support vector regression given by Vapnik [44] . * N  1 w, w + C (ξk + ξk∗ ) (8.3.12) min w,b,ξk ,ξk∗ 2 k=1

subject to yk − (w, φ(xk ) + b) ≤  + ξk ,

(8.3.13)

+ ξk∗ ,

(8.3.14)

(w, φ(xk ) + b) − yk ≤  ξk , ξk∗ ≥ 0,

k = 1, . . . , N.

(8.3.15)

Here C > 0 is the regularization parameter that denotes SVM misclassification tolerance and is a pre-specified value. ξk represents the upper training error, and ξk∗ is the lower training error subject to -insensitive tube |yk − (w, φ(xk ) + b)| ≤ . To solve the above constrained optimization problem, one can define the Lagrange function (or Lagrangian) as follows [36]: L = L(w, b, ξk , ξk∗ )   1 = w22 + C (ξk + ξk∗ ) − (ηk ξk + ηk∗ ξk∗ ) 2 −

N 

N

N

k=1

k=1

  αk  + ξk − yk + w, φ(xk ) + b

k=1



N 

  αk∗  + ξk∗ + yk − w, φ(xk ) − b ,

(8.3.16)

k=1

where ηk , ηk∗ , αk , αk∗ are Lagrange multipliers which have to satisfy the positivity constraints: ηk(∗) , αk(∗) ≥ 0, where ηk = {ηk , ηk∗ } and αk = {αk , αk∗ }. (∗)

(∗)

(8.3.17)

638

8 Support Vector Machines

From the first-order optimality conditions it follows that N  (αk − αk∗ )φ(xk ),

∂L =0 ∂w



∂L =0 ∂b



∂L =0 ∂ξk



ηk + αk = C,

(8.3.20)

∂L =0 ∂ξk∗



ηk∗ + αk∗ = C.

(8.3.21)

w=

(8.3.18)

k=1 N 

(αk − αk∗ ) = 0,

(8.3.19)

k=1

Equation (8.3.18) is the so-called support vector expansion, i.e., w can be completely described as a linear combination of the training patterns xi . Substituting (8.3.18) and (8.3.19) into (8.3.16) can eliminate the dual variables ηi , ηi∗ and gives the dual function LD (α, α ∗ ) = −

1  (αi − αi∗ )(αj − αj∗ )φ(xi ), φ(xj ) 2 N

N

i=1 j =1

−

N N   (αi + αi∗ ) + yi (αi − αj∗ ). i=1

(8.3.22)

i=1

On the other side, from (8.3.17), (8.3.20), and (8.3.21) it is known that 0 ≤ αi , αi∗ ≤ C,

i = 1, . . . , N.

(8.3.23)

Equations (8.3.22) together with constraints (8.3.19) and (8.3.23) constitute the Wolfe dual -support vector machine regression problem: . N N   1 max∗ − (α − α ∗ )T Q(α − α ∗ ) −  (αi + αi∗ ) + yi (αi − αj∗ ) , α,α 2 *

i=1

i=1

(8.3.24) subject to

N 

(αk − αk∗ ) = 0,

(8.3.25)

k=1

0 ≤ αi , αi∗ ≤ C,

i = 1, . . . , N,

(8.3.26)

8.3 Support Vector Machine Regression

639

where Q = [K(xi , xj )]N,N i,j =1 is an N × N positive semi-definite matrix with K(xi , xj ) = φ(xi ), φ(xj ) being the kernel function of SVM. Then the regression function is given by f (x) = wT φ(x) + b =

N 

(αi − αi∗ )K(xi , x) + b,

(8.3.27)

i=1

where w is described by the support vector expansion Eq. (8.3.18). The KKT conditions for the dual constrained optimization problem (8.3.24) are given by [36] αi ( + ξi − yi + w, xi  + b) = 0, αi∗ (

+ ξi∗

(8.3.28)

− yi + w, xi  + b) = 0,

(8.3.29)

(C − αi )ξi = 0.

(8.3.30)

(C

− αi∗ )ξi∗

= 0.

(8.3.31)

From the above conditions it follows that  − yi + w, xi  + b ≥ 0

and

ξi = 0 if αi < C,

and

ξi∗

 − yi + w, xi  + b ≤ 0  − yi + w, xi  + b ≥ 0

=0

 − yi + w, xi  + b ≤ 0

(8.3.32)

if αi > 0,

(8.3.33)

if αi∗ if αi∗

< C,

(8.3.34)

> 0.

(8.3.35)

Then the computation of b is given by Smola and Schölkopf [36]: max{− + yi − w, xi |αi < C or αi∗ > 0} ≤ b ≤ min{− + yi − w, xi |αi > 0 or αi∗ < C}.

(8.3.36)

If some αi or αi∗ ∈ (0, C), then the inequalities become equalities.

8.3.3 ν-Support Vector Machine Regression In the standard support vector machine regression described above, an error of  is allowed at each point xi . By introducing a constant ν ≥ 0, the size of  can be traded off against model complexity and slack variables. This is the basic idea of the following ν-support vector machine regression developed by Schölkopf et al. [34]: *  . N 1 1  2 ∗ (ξi + ξi ) min w2 + C ν + (8.3.37) N w,ξi ,ξ ∗i , 2 i=1

640

8 Support Vector Machines

subject to (wT xi + b) − yi ≤  + ξi ,

(8.3.38)

yi − (wT xi + b) ≤  + ξi∗ ,

(8.3.39)

ξi ≥ 0, ξi∗ ≥ 0,

(8.3.40)

i = 1, . . . , N.

By introducing the Lagrange multipliers αi∗ .ηi∗ , β ≥ 0, one has the Lagrange function [34]:   L w, b, ξ ∗ , , α ∗ , β, η∗ =

N N  1 C w22 + Cν + (ξi + ξi∗ ) − β − (ηi ξi + ηi∗ ξi∗ ) 2 N i=1



N 

i=1

! " αi ξi + yi − wT xi − b + 

i=1



N 

! " αi∗ ξi∗ + wT xi + b − yi +  .

(8.3.41)

i=1

The first-order optimality conditions give the results ∂L =0 ∂w



∂L =0 ∂



∂L =0 ∂b



∂L =0 ∂ξi∗



N  w= (αi∗ − αi )xi ,

(8.3.42)

i=1

Cν −

N  (αi + αi∗ ) − β = 0,

(8.3.43)

i=1 N  (αi − αi∗ ) = 0,

(8.3.44)

i=1

C − αi∗ − ηi∗ = 0, N

i = 1, . . . , N;

(8.3.45)

for minimizing over the primal variables w, , b, ξi∗ , while maximizing over the dual variables αi∗ , ηi∗ , β yields ∂L =0 ∂αi∗



ξi∗ + wT xi + b − yi +  = 0,

and ∂L =0 ∂ηi∗



 = 0,

∂L =0 ∂β



ξi∗ = 0.

8.4 Support Vector Machine Binary Classification

641

Hence, the regression result is given by yi = wT xi + b.

(8.3.46)

Substituting the four constraints in (8.3.42)–(8.3.45) into the Lagrange function L defined in (8.3.41) leads to the optimization problem called the Wolfe dual. The Wolfe dual ν-support vector machine regression problem can be written as [34] ⎧ ⎨

⎫ N N N  ⎬   1 max W(α ∗ ) = (αi∗ − αi )yi − (αi∗ − αi )(αj∗ − αj )K(xi , xj ) ⎩ ⎭ 2 i=1 j =1

i=1

(8.3.47) subject to

   N N  C , (αi − αi∗ ) = 0, αi∗ ∈ 0, (αi + αi∗ ) ≤ Cν. N i=1

(8.3.48)

i=1

After solving the Wolfe dual problem, the regression value of the new instance x is given by f (x) =

N  (αi∗ − αi )K(x, xi ) + b.

(8.3.49)

i=1

Proposition 8.1 ([34]) Suppose ν-SVR is applied to some data set, and the resulting  is nonzero. The following statements hold: • ν is an upper bound on the fraction of errors. • ν is a lower bound on the fraction of SVs. The following function is a popular choice to illustrate SVR for regression in the literature: * sin(x)/x, x = 0; y(x) = (8.3.50) 0, x = 0.

8.4 Support Vector Machine Binary Classification A basic task in data analysis and pattern recognition is classification that requires the construction of a classifier, namely a function that assigns a class label to instances described by a set of attributes. In this section we discuss SVM binary classification problems with two classes.

642

8 Support Vector Machines

8.4.1 Support Vector Machine Binary Classifier Given a training set of N data points {xk , yk }, k = 1, . . . , N, where xk is the kth input pattern and yk is the kth output pattern. Denote the set of data   (8.4.1) D = (x1 , y1 ), . . . , (xN , yN ) , xk ∈ Rn , yk ∈ {−1, +1} A SVM classifier is a classifier which finds the hyperplane that separates the data with the largest distance between the hyperplane and the closest data point (called margin). A linear separating hyperplane is a decision function in the form f (x) = sign(wT x + b),

(8.4.2)

where x is the input pattern, w is the weight vector, and b is a bias term. Hence, we have ⎧ > 0, then x ∈ class S+ ; ⎪ ⎪ ⎨ T w x + b < 0, then x ∈ class S− ; (8.4.3) ⎪ ⎪ ⎩ = 0, then x ∈ class S+ or x ∈ class S− . Similarly, for a nonlinear classifier w with the nonlinear mapping φ : x → φ(x), its testing output y = f (x) = wT φ(x) + b, where b denotes a bias term. Then, we have ⎧ > 0, then x ∈ class S+ ; ⎪ ⎪ ⎨ wT φ(x) + b < 0, then x ∈ class S− ; (8.4.4) ⎪ ⎪ ⎩ = 0, then x ∈ class S+ or x ∈ class S− . Therefore the decision function of a nonlinear classifier is given by ! " class of x = sign wT φ(x) + b .

(8.4.5)

For a SVM with the kernel function K(x, xi ) = φ T (x)φ(xi ), the SVM classifier w is usually designed as w=

N 

αi yi φ(xi ).

(8.4.6)

i=1

Substituting (8.4.6) into (8.4.5), we directly get the decision function of any SVM classifier as follows:  N  αi yi K(x, xi ) + b . (8.4.7) class of x = sign i=1

8.4 Support Vector Machine Binary Classification

643

Hence, when designing any SVM classifier, its weighting vector w must have the form in (8.4.6). The set of vectors {x1 , . . . , xN } is said to be optimally separated by the hyperplane if it is separated without error and the distance between the closest vector to the hyperplane is maximal. For designing the classifier w, we assume that wT φ(xk ) + b ≥ 1,

if yk = +1,

(8.4.8)

wT φ(xk ) + b ≤ −1, if yk = −1.

(8.4.9)

That is to say, if the data is to be classified correctly, this hyperplane should ensure that ! " (8.4.10) yk wT φ(xk ) + b > 0, for all k = 1, . . . , N, assuming that y ∈ {−1, +1}. Here φ(xk ) is a nonlinear vector function mapping the input space Rn into a higher-dimensional space, but this function is not explicitly constructed. The distance of a point xk from the hyperplane, denoted as d(w, b; xk ), is defined as d(w, b; xk ) =

|w, φ(xk ) + b| . w

(8.4.11)

By the constraint condition yk (w, φ(xk ) + b) ≥ 1 in (8.4.10), the margin of a classifier is given by ρ(w, b) =

min

xk :yk =−1

d(w, b; xk ) + min d(w, b; xk ) xk :yk =1

|w, φ(xk ) + b| |w, φ(xk ) + b| + min xk :yk =−1 xk :yk =1 w w   1 min |w, φ(xk ) + b| + min |w, φ(xk ) + b| = xk :yk =1 w xk :yk =−1

=

min

=

2 . w

(8.4.12)

The optimal hyperplane wopt is given by maximizing the above margin, namely  w = arg max ρ(w, b) = w

2 w

, (8.4.13)

644

8 Support Vector Machines

which is equivalent to w = arg min w

1 w. 2

(8.4.14)

In order to avoid the possibility to violate (8.4.10), variables ξk need to be introduced such that ! " (8.4.15) yk wT φ(xk ) + b ≥ 1 − ξk , k = 1, . . . , N, ξk ≥ 0,

k = 1, . . . , N.

(8.4.16)

According to the structural risk minimization principle, the risk bound is minimized by formulating the optimization problem *

 1 min ξk w2 + C 2 N

. (8.4.17)

k=1

subject to (8.4.15), where C is a user-specified parameter for providing a trade-off between the distance of the separating margin and the training error. Therefore the primary problem for SVM binary classifier is a constrained optimization problem: *

. N  1 2 ξi min LPSVM = w2 + C 2 i=1 ! " subject to yi wT φ(xi ) + b ≥ 1 − ξi , ξi ≥ 0 (i = 1, . . . , N ).

(8.4.18) (8.4.19)

The dual form of the above primary optimization problem is given by ,  N N N  1  min LDSVM = yi yj αi αj φ(xi ), φ(xj ) − αi 2 i=1 j =1

subject to

N 

(8.4.20)

i=1

αi yi = 0, 0 ≤ αi ≤ C (i = 1, . . . , N ),

(8.4.21)

i=1

where αi is the Lagrange multiplier corresponding to ith training sample (xi , yi ), and vectors xi satisfying yi (wT φ(xi ) + b) = 1 are termed support vectors.

8.4 Support Vector Machine Binary Classification

645

In SVM learning algorithms, kernel functions K(u, v) = φ(u), φ(v) are usually used, and the dual optimization problem for SVM binary classifier is represented as  , N N N  1  min LDSVM = yi yj αi αj K(xi , xj ) − αi 2 i=1 j =1

subject to

N 

(8.4.22)

i=1

αi yi = 0, 0 ≤ αi ≤ C (i = 1, . . . , N ).

(8.4.23)

i=1

By Bousquet et al. [5, pp. 14–15], several important practical points should be taken into account when designing classifiers. 1. In order to reduce the likelihood of overfitting the classifier to the training data, the ratio of the number of training examples to the number of features should be at least 10:1. For the same reason the ratio of the number of training examples to the number of unknown parameters should be at least 10:1. 2. Importantly, proper error-estimation methods should be used, especially when selecting parameters for the classifier. 3. Some algorithms require the input features to be scaled to similar ranges, such as some kind of a weighted average of the inputs. 4. There is no single best classification algorithm!

8.4.2 ν-Support Vector Machine Binary Classifier Similar to ν-SVR, by introducing a new parameter ν ∈ (0, 1] in SVM classification, Schölkopf et al. [34] presented ν-support vector classifier (ν-SVC). The primal optimization problem of ν-SVC is described by  min

w,ξ ,ρ

 1 w22 − νρ + ξi 2 N

, (8.4.24)

i=1

subject to yi (wT φ(xi ) + b) ≥  − ξi , ξi ≥ 0 (i = 1, . . . , N ), ρ ≥ 0. (8.4.25) Let Lagrange multipliers αi , βi , δ ≥ 0, and consider the Lagrange function L(w, ξ , b, ρ, α, β, δ) =

N 1 1  w22 − νρ + ξi − δρ 2 N i=1



N !  i=1

% & " αi yi (wT xi + b) − ρ + ξi + βi ξi .

(8.4.26)

646

8 Support Vector Machines

This function needs to be minimized with respect to the primal variables w, ξ , b, ρ and maximized over the dual variables α, β, δ. The first-order optimization conditions for minimization give the following results: ∂L =0 ∂w



∂L =0 ∂ξi



∂L =0 ∂b



∂L =0 ∂ρ



w=

N 

αi yi xi ,

(8.4.27)

i=1

αi + βi = 1/N, N 

i = 1, . . . , N,

(8.4.28)

αi yi = 0,

(8.4.29)

αi − δ = ν.

(8.4.30)

i=1 N  i=1

If substituting (8.4.27)–(8.4.30) into the Lagrange function L, using αi , βi , δ ≥ 0 and incorporating the kernel function K(x, xi ) instead of xi in dot product, then one obtains the Wolfe dual optimization problem for ν-SVC as follows: ⎧ ⎨

N N 1 

max W(α) = − α ⎩ 2

⎫ ⎬ αi αj yi yj K(xi , xj )

i=1 j =1

subject to 0 ≤ αi ≤ 1/N (i = 1, . . . , N);

(8.4.31)



N 

αi yi = 0;

i=1

N 

αi ≥ ν.

i=1

(8.4.32) To compute parameters b and ρ in the primal ν-SVC optimization problem, consider two sets S+ and S− , containing support vectors xi with 0 ≤ αi < 1 and yi = ±1. Let s1 = |S+ | = |{i|0 < αi < 1, yi = 1}|,

(8.4.33)

s2 = |S− | = |{i|0 < αi < 1, yi = −1}|

(8.4.34)

be the sizes of the sets S+ and S− , respectively. If defining r1 =

N 1  αj yj K(x, xj ), s1 x∈S+ j =1

(8.4.35)

8.4 Support Vector Machine Binary Classification

r2 = −

647

N 1  αj yj K(x, xj ), s2

(8.4.36)

x∈S− j =1

then one has [7, 34] b=−

r1 − r 2 2

and

ρ=

r1 + r2 . 2

(8.4.37)

8.4.3 Least Squares SVM Binary Classifier Standard SVMs are powerful tools for data classification by assigning them to one of two disjoint halfspaces in either the original input space for linear classifiers, or in a higher-dimensional feature space for nonlinear classifiers. The least squares SVM (LS-SVMs) developed in [37] and the proximal SVMs (PSVMs) presented in [14, 15] are two much simpler classifiers, in which each class of points is assigned to the closest of two parallel planes (in input or feature space) such that they are pushed apart as far as possible. The LS-SVMs formulate the binary classification problem as 

, N 1 C 2 ek w22 + w,b,e 2 2 k=1 ! " subject to yk wT φ(xk ) + b = 1 − ek , min

(8.4.38) k = 1, . . . , N,

(8.4.39)

for the binary classification yk ∈ {−1, 1}. Define the Lagrange function (Lagrangian) L = L(w, b, e; α) =

N N & " 1 C 2  % ! T w22 + ek − αk yk w φ(xk ) + b − 1 + ek , 2 2 k=1

(8.4.40)

k=1

where αk are Lagrange multipliers. Different from Lagrange multipliers in SVM with inequality constraints, in LS-SVM, Lagrange multipliers αk can be either positive or negative due to the equality constraints used. Based on the KKT conditions, one has the optimality conditions of (8.4.40) as follows [11]:  ∂L =0 ⇒ w= αk yk φ(xk ) ⇒ w = Zα, ∂w N

∇w L =

k=1

(8.4.41)

648

8 Support Vector Machines

∇b L =

N  ∂L =0 ⇒ αk yk = 0 ⇒ yT α = 0, ∂b

(8.4.42)

k=1

∂L = 0 ⇒ αk = Cek , k = 1, . . . , N ⇒ α = Ce, (8.4.43) ∂ek ! " ∂L ∇αk L = = 0 ⇒ yk wT φ(xk ) + b − 1 + ek = 0, k = 1, . . . , N, ∂αk ∇e L =

⇒ ZT w + by + e = 1,

(8.4.44)

where Z = [y1 φ(x1 ), . . . , yN φ(xN )] ∈ Rm×N , y = [y1 , . . . , yN ]T , 1 = [1, . . . , 1]T , e = [e1 , . . . , eN ]T and α = [α1 , . . . , αN ]T . The KKT equations (8.4.41)–(8.4.44) can be written as the following matrix equation form [11]: ⎡

I ⎢0 ⎢ ⎣0 ZT

0 0 0 y

0 0 CI I

⎤⎡ ⎤ ⎡ ⎤ −Z w 0 ⎢ b ⎥ ⎢0⎥ −yT ⎥ ⎥⎢ ⎥ = ⎢ ⎥. −I ⎦ ⎣ e ⎦ ⎣0⎦ 0 α 1

(8.4.45)

By eliminating w and e, the above KKT equations are simplified as      b 0 yT 0 = . y ZT Z + C −1 I α 1

(8.4.46)

Mercer’s condition can be applied again to the matrix ZT Z to get [37] [ZT Z]ij = yi yj φ T (xi )φ(xj ) = yi yj K(xi , xj ).

(8.4.47)

Given a training set {(xi , yi )|xi ∈ Rn , yi ∈ {−1, 1}, i = 1, . . . , N }, constant C > 0, and the kernel function K(xi , xj ), the LS-SVM binary classification algorithm performs the following learning step: • Construct the N × N matrix [ZT Z]ij = yi yj φ T (xi )φ(xj ) = yi yj K(xi , xj ). • Solve the KKT matrix equation (8.4.46) for α = [α1 , . . . , αN ]T and b. In testing step, for given testing sample x ∈ Rn , its decision is given by ⎛ ⎞ N  class of x = sign ⎝ αj yj K(x, xj ) + b⎠ . j =1

(8.4.48)

8.4 Support Vector Machine Binary Classification

649

8.4.4 Proximal Support Vector Machine Binary Classifier Due to the simplicity of their implementations, least square SVM (LS-SVM) and proximal support vector machine (PSVM) have been widely used in binary classification applications. In the standard SVM binary classifier a plane midway between two parallel bounding planes is used to bound two disjoint halfspaces each of which contains points mostly of class 1 or 2. Similar to the LS-SVM, the key idea of PSVM [14] is that the separation hyperplanes are “proximal” planes rather than bounded planes anymore. Proximal planes classify data points depending on proximity to either one of the two separation planes. We aim to push away proximal planes as far apart as possible. Different from LS-SVM, PSVM uses (w22 + b2 ) instead of w22 as the objective function, making the optimization problem strongly convex and has little or no effect on the original optimization problem [22]. The primal optimization problem of linear PSVM is described by ,  N 1 C 2 2 2 min LPPSVM (w, b, ξi ) = (w2 + b ) + ξi , 2 2

(8.4.49)

subject to yi (wT xi + b) = 1 − ξi ,

(8.4.50)

i=1

i = 1, . . . , N.

By Fung and Mangasarian [15], the PSVM formulation (8.4.49) can be also interpreted as a regularized least squares solution [38] of the system of linear equations yi (xTi w + b) = 1, i = 1, . . . , N, that is, finding an approximate solution - -2 - -

w (w, b) to yi (xTi w + b) = 1 with least 2-norm -- -- = w22 + b2 . b

2

The corresponding dual unconstrained optimization problem is given by min LDPSVM (w, b, ξi , αi ) =

N N " 1 C 2  ! (w22 + b2 ) + ξi − αi yi (wT xi + b) − 1 + ξi . 2 2 i=1

i=1

(8.4.51)

∂L ∂αi

L ∂L Using the first-order optimality conditions ∂∂w = 0, ∂∂bL = 0, ∂ξ = 0, and i

= 0, and eliminating w and ξi , one gets the KKT equation of linear PSVM classifier in matrix form: ! " C −1 I + ZZT + yyT α = 1, (8.4.52)

650

8 Support Vector Machines

and b=

N 

αi yi .

(8.4.53)

i=1

Here 2 3 Z = y1 x1 , . . . , yN xN ,

(8.4.54)

y = [y1 , . . . , yN ] ,

(8.4.55)

α = [α1 , . . . , αN ] .

(8.4.56)

T

T

Similar to LS-SVM, the training data x can be mapped from the input space Rn into the feature space φ : x → φ(x). Hence, the nonlinear PSVM classifier has still the KKT equation (8.4.52) with Z = [y1 φ(x1 ), . . . , yN φ(xN )] in Eq. (8.4.54). Algorithm 8.1 shows the PSVM binary classification algorithm. Algorithm 8.1 PSVM binary classification algorithm input: Training set {(xi , yi )|xi ∈ Rn , yi ∈ {−1, 1}, i = 1, . . . , N }, constant C > 0 and the kernel function K(x, xi ). initialization: y = [y1 , . . . , yN ]T . learning step: 1. Construct the N × N matrix [ZT Z]ij = yi yj φ T (xi )φ(xj ) = yi yj K(xi , xj ). 2. Solve the KKT matrix equation (8.4.52) for α = [α1 , . . . , αN ]T .  3. Compute b = N i=1 αi yi . n testing step: for given testing is given by ! sample x ∈ R , its decision " N class of x = sign α y K(x, x ) + b . i=1 i i i

The decision functions of binary SVM, LS-SVM, and PSVM classifiers have the same form  N  αi yi K(x, xi ) + b , (8.4.57) f (x) = sign i=1

where yi is the corresponding target class label of the training data xi , αi is the Lagrange multiplier to be computed by the learning machines, and K(x, xi ) is a suitable kernel function to be given by users. The followings are the comparisons of SVM, LS-SVM, and PSVM. • SVM, LS-SVM, and PSVM are originally proposed for binary classification. • LS-SVM and PSVM provide fast implementations of the traditional SVM. Both LS-SVM and PSVM use equality optimization constraints instead of inequalities from the traditional SVM, which results in a direct least square solution by avoiding quadratic programming.

8.4 Support Vector Machine Binary Classification

651

It should be noted [22] that the Lagrange multipliers αi are proportional to the training errors ξi in LS-SVM, while in the conventional SVM, many Lagrange multipliers αi are typically equal to zero. As compared to the conventional SVM, sparsity is lost in LS-SVM; this is true to PSVM as well.

8.4.5 SVM-Recursive Feature Elimination Decision functions that are simple weighted sums of the training patterns plus a bias are called linear discriminant functions [9]: D(x) = w, x + b,

(8.4.58)

where w is the weight vector and b is a bias value. Construct the optimal hyperplane (w, b) such that wTopt x + bopt = 0

(8.4.59)

which separates a set of training data (x1 , y1 ), . . . , (xn , yn ). Vectors xi such that yi (wT xi + b) = 1 is termed support vectors. Hence, the constrained optimization problem for finding the optimal weight vector w can be described as  , 1 min f (w, b) = w22 , w,b 2 subject to yi (xTi w + b) ≥ 1,

(8.4.60) i = 1, . . . , n,

(8.4.61)

where the inequality constraint is for ensuring x to be support vectors. The above constrained optimization can be rewritten as the unconstrained optimization in Lagrangian form:  , n  1 T αi [yi (xi w + b) − 1] , min L(w, b, α) = w2 − w,b,α 2

(8.4.62)

i=1

where α = [α1 , . . . , αn ]T is the vector of nonnegative Lagrangian multiplies αi ≥ 0, i = 1, . . . , n. The optimization conditions with respect to w and b are given by   n n   ∂L(w, b, α) = w− αi yi xi = 0 ⇒ w = αi yi xi , ∂w i=1

∂L(x, b, α) = ∂b

n  i=1

αi yi = 0,

(8.4.63)

i=1

(8.4.64)

652

8 Support Vector Machines

Substituting the above two equations into (8.4.62) to yield [8] 

 1  αi αj yi yj xTi xj − αi min J (α) = α 2 n

n

i=1 j =1

,

i=1

subject to 0 ≤ α ≤ CI and

α T y = 0.

(8.4.65)

In conclusion, when used for classification, SVMs separate a given set of binary labeled training data (xi , yi ) with hyperplane (w, b) that is maximally distant from them. Such a hyperplane is known as “the maximal margin hyperplane.” In nonlinear binary classification cases, Eq. (8.4.65) becomes   n n   , 1 T min αi αj yi yj φ (xi )φ(xj ) − αi α 2 i=1 j =1

subject to 0 ≤ α ≤ CI

i=1

and

α y = 0. T

(8.4.66)

The goal of the SVM-recursive feature elimination (SVM-RFE) algorithm, proposed by Guyon et al. [20], is to find a subset of size r among d variables (r < d) which maximizes the performance of the predictor, based on a backward sequential selection. Starting all the features, one feature is removed at a time. The removed ith feature is the one whose removal minimizes the variation of w22 [31]:   w2 − w(i) 2  2

2

 d d  d d    1   (i) (i) =  αj αk yj yk K(xj , xk ) − αj αk yj yk K (i) (xj , xk ), 2 j =1 k=1

j =1 k=1

(8.4.67) (i) (i) where [K(i) ]j k = Kj(i) k = φ(xj , φ(xk  is the (j, k)th entry of the Gram matrix (i)

K(i) of the training data when variable i is removed, and αj is the corresponding (i)

solution of (8.4.66). For the sake of simplicity, αj = αj is usually taken even if a variable has been removed in order to reduce the computational complexity of the SVM-RFE algorithm. Algorithm 8.2 shows the SVM-RFE algorithm that is an application of RFE using the weight magnitude as ranking criterion.

8.5 Support Vector Machine Multiclass Classification

653

Algorithm 8.2 SVM-recursive feature elimination (SVM-RFE) algorithm [20] 1. input: Training data X = {x1 , . . . , xd }, class labels Y = {y1 , . . . , yd } and expected feature number r. 2. initialization: Index subset of surviving features S = {1, . . . , d}. 3. repeat 4. Restrict training examples to good feature indices (X, Y ). 5. Solve (8.4.65) or (8.4.66) for the classifier α.  6. Compute the weight vector of dimension m = length(S) as w = m k=1 αk yk xk . 7. Compute the ranking criteria ci = (wi )2 for all i = 1, . . . , m. 8. Find the feature index with smallest ranking criterion using index i = arg min{c1 , . . . , cm }. 9. Eliminate the variable i with smallest ranking criterion and update X ← X \ xi , Y ← Y \ yi and S ← S \ i. 10. until length(S) = r. 11. output: feature ranked list X.

8.5 Support Vector Machine Multiclass Classification A multiclass classifier is a function H : X → Y that maps an instance x ∈ X (for example, X = Rn ) into an element y of Y (for example, y ∈ {1, . . . , k}).

8.5.1 Decomposition Methods for Multiclass Classification A popular way to solve a k-class problem is to decompose it to a set of L binary classification problems. One-versus-one, one-versus-rest, and directed acyclic graph SVM methods are three of the most common decomposition approaches [21, 41, 48]. One-Against-All Method The standard method for k-class SVM classification is to construct L = k binary classifiers Cm , m = 1, . . . , k. The ith SVM will be trained with all of the examples in the ith class with positive labels, and all other examples with negative labels. Then, the weight vector wm for the mth model can be generated by any linear classifier. Such an SVM classification method is referred to as one-against-all (OAA) or one-versus-rest (OVR) method [4, 27]. Let S = {(x1 , y1 ), . . . , (xN , yN )} be a set of N training examples, and each example xi be drawn from a domain X ⊆ Rn and yi ∈ {1, . . . , k} is the class of xi . The mth one-against-all classifier solves the following constrained optimization problem:  min

wm ,bm ,ξm

, N  1 2 wm 2 + C ξm,i , 2 i=1

(8.5.1)

654

8 Support Vector Machines

subject to wTm φ(xi ) + bm ≥ 1 − ξm,i , if yi = m, wTm φ(xi ) + bm ≤ −1 + ξm,i , if yi = m, ξm,i ≥ 0,

i = 1, . . . , N,

(8.5.2)

where wm = [wm,1 , · · · , wm,n ]T , m = 1, . . . , k is the weight vector of the mth classifier, C is the penalty parameter, and ξm,i is the training error corresponding to the mth class and the ith training data xi that is mapped to a higher-dimensional space by the function φ(xi ). Minimizing 12 wm 22 means maximizing the margin 2/wm  between two groups  of data. When data are not linear separable, there is a penalty term C N i=1 ξm,i which can reduce the training errors in the mth classifier. The basic concept behind one-against-all SVM is to searchfor a balance between the regularization term 12 wm 22 and the training error C N i=1 ξm,i for the m classifiers. By “decision function” it means a function f (x) whose sign represents the class assigned to data point x. The solution of (8.5.1) gives k decision functions wTm φ(x) + bm , m = 1, . . . , k. Hence, a new testing instance x is said to be in the class which has the largest value among k decision functions: / 0 class of x = arg max wTm φ(x) + bm .

(8.5.3)

m=1,...,k

One-Against-One Method The one-against-one (OAO) method [13, 25] is also known as one-versus-one (OVO) method. This method constructs L = k(k −1)/2 binary classifiers by solving k(k − 1)/2 binary classification problems [25]. Each binary classifier constructs a model with data from one class as positive and another class as negative. Since there is k(k − 1)/2 combinations of two classes, one needs to construct k(k − 1)/2 weight vectors: w1,2 , . . . , w1,k ; w2,3 , . . . , w2,k ; . . . ; wk−1,k . For training data from the ith and the j th classes, one solves the following binary classification problem: * min

wij ,bij ,ξ ij

. N  1 ij T ij ij ij T (w ) w + C ξn (w ) φ(xn ) , 2

(8.5.4)

n=1

ij

subject to (wij )T φ(xn ) + bij ≥ 1 − ξn , ij

(wij )T φ(xn ) + bij ≤ −1 + ξn , ij

ξn ≥ 0,

n = 1, . . . , N,

if yn = i, if yn = j,

(8.5.5) (8.5.6) (8.5.7)

where wij is the binary classifier for the ith and the j th classes, bij is a real parameter, and ξij is the training error corresponding to the ith and the j th classes.

8.5 Support Vector Machine Multiclass Classification

655

After all k(k −1)/2 binary classifiers are constructed by solving the above binary problems for i, j = 1, . . . , k, the following voting strategy suggested in [25] can be used: if the decision function sign((wij )T φ(x) + bij ) determines that x is in ith class, then vote for the ith class is added by one. Otherwise, the vote for j th is added by one. Finally, x is predicted to be in the class with the largest vote. This voting approach is also called the “Max Wins” strategy. In case that two classes have identical votes, thought it may not be a good strategy, we simply select the one with the smaller index [21]. DAGSVM Method Directed acyclic graph SVM method [27] is simply called DAGSVM method whose training phase is the same as the one-against-one method by solving binary SVMs. However, in the testing phase, it uses a rooted binary directed acyclic graph which has internal nodes and leaves. The training phase of the DAGSVM method is the same as the one-against-one method by solving k(k − 1)/2 binary SVM classification problems, but uses one different voting strategy, called a rooted binary directed acyclic graph which has k(k − 1)/2 internal nodes and k leaves, in the testing phase. Each node is a binary SVM classifier of ith and j th classes. A directed acyclic graph (DAG) is a graph whose edges have an orientation and no cycles. Definition 8.7 (Decision Directed Acyclic Graph [27]) Given a space X and a set of Boolean functions F = {f : X → {0, 1}}, the decision directed acyclic graphs (DDAGs) on k classes over F are functions which can be implemented using a rooted binary DAG with k leaves labeled by the classes where each of the L = k(k − 1)/2 internal nodes is labeled with an element of F. The nodes are arranged in a triangle with the single root node at the top, two nodes in the second layer, and so on until the final layer of k leaves. The ith node in layer j < k is connected to the ith and (i + 1)th node in the (j + 1)st layer. For k-class classification, a rooted binary directed acyclic graph has k layers: the top layer has a single root node, the second layer has two nodes, and so on until the kth (i.e., final) layer has k nodes, so a rooted binary DAG has k leaves and 1 + 2 + · · · + k = k(k − 1)/2 internal nodes. Each node is a binary SVM of ith and j th classes. The DDAG is equivalent to operating on a list, where each node eliminates one class from the list. The list is initialized with a list of all classes. A test point is evaluated against the decision node that corresponds to the first and last elements of the list. If the node prefers one of the two classes, the other class is eliminated from the list, and the DDAG proceeds to test the first and last elements of the new list. The DDAG terminates when only one class remains in the list. Thus, for a problem with k classes, k − 1 decision nodes will be evaluated in order to derive an answer. Example 8.1 Given N test samples {xi , yi }, i = 1, . . . , N with xi ∈ Rn and yi ∈ {1, 2, 3, 4}. The DDAG is equivalent to operating on a list {1, 2, 3, 4}, where each

656

8 Support Vector Machines

Fig. 8.2 The decision directed acyclic graphs (DDAG) for finding the best class out of four classes

1 2 3 4

1 vs 4

not 1 2 3 4 no t 2

3 4

2 vs 4

1 2 3

no t 4

2 3

3 vs 4

4

not 4

3

1 vs 3

not 1

not 3

1 2

2 vs 3

2

1 vs 2

1

node eliminates one class from the list. The list is starting at the root node 1 versus 4 at the top layer, the binary decision function is evaluated. If the output value is not Class 1, then it is eliminated from the list to get a new list {2, 3, 4} and thus we make the binary decision of two classes 2 versus 4. If the root node prefers Class 1, then Class 4 at the root node is removed from the list to yield a new list {1, 2, 3} and the binary decision is made for two classes 1 versus 3. Then, the second layer has two nodes (2 vs 4) and (1 vs 3). Therefore, we go through a path until DDAG terminates when only one class remains in the list. Figure 8.2 shows the decision directed acyclic graphs (DDAG) of the above four classes [27]. DDAGs naturally generalize the class of Decision Trees, allowing for a more efficient representation of redundancies and repetitions that can occur in different branches of the tree, by allowing the merging of different decision paths [27].

8.5.2 Least Squares SVM Multiclass Classifier The previous subsection discussed the support vector machine binary classifier. In this subsection we discuss multi-class classification cases. For the multiclass case with k labels, LS-SVM uses k output nodes in order to encode multiclasses, where yi,j denotes the output value of the j th output node for the training data xi [41]. The k outputs can be used to encode up to 2k different

8.5 Support Vector Machine Multiclass Classification

657

classes. For multiclass case, the primal optimization problem of LS-SVM can be represented as [41] ,  k k N 1 C  2 min LLS−SVM = wm 2 + ξm,i 2 2 m=1 i=1 m=1 ⎧ (1) T ⎪ ⎪yi (w1 φ 1 (xi ) + b1 ) = 1 − ξ1,i ⎨ .. subject to . ⎪ ⎪ ⎩ (k) T yi (wk φ k (xi ) + bk ) = 1 − ξk,i

(8.5.8)

for i = 1, . . . , N . The Lagrange function in dual LS-SVM multiclass classifier is given by LD =

k k N 1 C  2 wm 2 + ξm,i 2 2 m=1



(8.5.9)

i=1 m=1

k N  

% ! " & αm,i yi wTm φ m (xi ) + bm − 1 + ξm,i

(8.5.10)

i=1 m=1

Similar to the LS-SVM solution to the binary classification, the conditions for optimality are given by  ∂LD (m) = 0 ⇒ wm = αm,i yi φ m (xi ) ⇒ wm = Zm α m , ∂wm

(8.5.11)

N  ∂LD =0 ⇒ αm,i yi(m) = 0 ⇒ α Tm y(m) = 0, ∂bm

(8.5.12)

N

i=1

i=1

∂LD = 0 ⇒ αm,i = Cξm,i ⇒ α m = Cξ m , ∂ξm,i ! " ∂LD = 0 ⇒ yi(m) wTm φ m (xi ) + bm − 1 + ξm,i = 0, ∂αm,i ⇒ ZTm wm + bm y(m) + ξ m = 1.

(8.5.13)

(8.5.14)

In the above equations: (m)

(m)

Zm = [y1 φ m (x1 ), . . . , yN φ m (xN )],

(8.5.15)

wm = [wm,1 , . . . , wm,N ]T ,

(8.5.16)

(m)

(m)

y(m) = [y1 , . . . , yN ]T ,

(8.5.17)

658

8 Support Vector Machines

α m = [αm,1 , . . . , αm,N ]T ,

(8.5.18)

ξ m = [ξm,1 , . . . , ξm,N ]T .

(8.5.19)

Due to each classifier being binary, for given yi ∈ {1, . . . .k}, one has (m) yi

=

 +1, yi = m;

(m = 1, . . . , k; i = 1, . . . , N )

(8.5.20)

−1, yi = m.

Equations (8.5.15)–(8.5.19) can be rewritten as the KKT equation in matrix form: 

0 (y(m) )T (m) y Ω (m)



   bm 0 = , αm 1

m = 1, . . . , k,

(8.5.21)

  where Ω (m) = ZTm Zm + C −1 I with the (i, j )th entries (m)

(m) (m) yj Km (xi , xj ) + C −1 δij ,

Ω ij = yi

i, j = 1, . . . , N,

(8.5.22)

where δij = 1 (if i = j ) or 0 (otherwise); and Km (xi , xj ) = φ Tm (xi )φ m (xj ) is the kernel function of the mth SVM for multiclass classification. Algorithm 8.3 shows the LS-SVM multiclass classification algorithm. Algorithm 8.3 LS-SVM multiclass classification algorithm [41] input: Training set {(xi , yi )|xi ∈ Rn , yi ∈ {1, . . . , k}, i = 1, . . . , N }, constant C > 0 and the kernel function Km (x, xi ), m = 1, . . . , k. * +1, yi = m; (m) initialization: yi = for m = 1, . . . , k; i = 1, . . . , N . −1, yi = m; learning step: while m = 1, . . . , k

% & (m) T 1. Construct the N × 1 vector y(m) = y1(m) , . . . , yN .

2. Use (8.5.22) to construct all (i, j )th entries of the N × N matrix Ω (m) . 3. Solve the KKT matrix equation (8.5.21) for bm and α m = [αm,i ]N i=1 . end while testing step: for given testing!sample x ∈ Rn , its decision is given " by N (m) class of x = arg max α y K (x, x ) + b m m . i=1 m,i i i m=1,...,k

8.5 Support Vector Machine Multiclass Classification

659

8.5.3 Proximal Support Vector Machine Multiclass Classifier The primal constrained optimization problem of PSVM multiclass classifier is described by  , N 1 C 2 2 2 min LPPSVM (m)(wm , bm , ξm,i ) = (wm 2 + bm ) + ξm,i 2 2 i=1 ⎧ (1) ⎪ y (wT1 φ m (x1 ) + b1 ) = 1 − ξ1,i , ⎪ ⎨ i .. subject to . ⎪ ⎪ ⎩ (k) T yi (wk φ m (xi ) + bk ) = 1 − ξk,i .

(8.5.23)

(8.5.24)

The corresponding dual unconstrained optimization problem is given by min

(m)

LDPSVM =

N 1 C 2 2 (wm 22 + bm )+ ξm,i 2 2 i=1



N 

! " (m) αm,i yi (wTm φ m (xi ) + bm ) − 1 + ξm,i .

(8.5.25)

i=1

From the optimality conditions we have ∂L(m) DPSVM ∂wm

=0



wm =

∂bm

(m)

αm,i yi

φ m (xi ),

(8.5.26)

,

(8.5.27)

i=1

(m)

∂LDPSVM

N 

=0



bm =

N 

(m)

αm,i yi

i=1

(m)

∂LDPSVM ∂ξm,i ∂L(m) DPSVM ∂αm,i

=0



ξm,i = C −1 αm,i ,

=0



yi

(m)

(8.5.28)

(wTm φ m (xi ) + bm ) − 1 + ξm,i = 0,

(8.5.29)

for m = 1, . . . , k. Eliminating wm and ξm,i in the above equations yields the KKT equations bm =

N  i=1

(m)

αm,i yi

= α Tm ym ,

(8.5.30)

660

8 Support Vector Machines

and (C −1 I + ZTm Zm + ym yTm )α m = 1,

(8.5.31)

% & (m) φ m (xN ) , Zm = y1(m) φ m (x1 ), . . . , yN

(8.5.32)

where

(m)

(m)

ym = [y1 , . . . , yN ]T ,

(8.5.33)

α m = [αm,1 , . . . , αm,N ]T .

(8.5.34)

Algorithm 8.4 shows the PSVM multiclass classification algorithm. Algorithm 8.4 PSVM multiclass classification algorithm 1. input: Training set {(xi , yi )|xi ∈ Rn , yi ∈ {1, . . . , k}, i = 1, . . . , N }, constant C > 0 and the kernel function Km (x, xi ), m = 1, . . . , k. * +1, yi = m; (m) 2. initialization: yi = for m = 1, . . . , k; i = 1, . . . , N . −1, yi = m; 3. learning step: while m = 1, . . . , k (m) (m) 3.1. Construct the N × N matrix [ZTm Zm ]ij = yi yj Km (xi , xj ). 3.2. Solve the KKT matrix equation (8.5.31) for α m = (C −1 I + ZTm Zm + ym yTm )† 1. 3.3. Compute bm = α Tm ym . endwhile n 4. testing step: for given testing !sample x ∈ R , its decision is" given by (m) N class of x = arg max i=1 αm,i yi Km (x, xi ) + bm . m=1,...,k

8.6 Gaussian Process for Regression and Classification In Chap. 6 we discussed traditionally parametric models based machine learning. The parametric models have a possible advantage in ease of interpretability, but for complex data sets, simple parametric models may lack expressive power, and their more complex counterparts (such as feedforward neural networks) may not be easy to work with in practice [29, 30]. The advent of kernel machines, such as Gaussian processes, sparse Bayesian learning, and relevance vector machine, has opened the possibility of flexible models which are practical to work with. In this section we deal with Gaussian process methods for regression and classification problems.

8.6 Gaussian Process for Regression and Classification

661

8.6.1 Joint, Marginal, and Conditional Probabilities Let the n (discrete or continuous) random variables y1 , . . . , yn have a joint probability p(y1 , . . . , yn ), or p(y) for short. Technically, one ought to distinguish between probabilities (for discrete variables) and probability densities for continuous variables. Throughout the book we commonly use the term “probability” to refer to both. Let us partition the variables in y into two groups, yA and yB , where A ∪ B = {1, . . . , n} and A ∩ B = ∅, so that p(y) = p(yA , yB ). Each group may contain one or more variables. The marginal probability of yA , denoted by p(yA ), is given by marginal probability  p(yA ) =

p(yA , yB )dyB .

(8.6.1)

The integral is replaced by a sum if the variables are discrete valued. Notice that if the set A contains more than one variable, then the marginal probability is itself a joint probability of these variables. If the joint distribution is equal to the product of the marginals, then the variables are said to be independent, otherwise they are dependent. The conditional probability function is defined as conditional probability p(yA |yB ) =

p(yA , yB ) p(yB )

(8.6.2)

defined for p(yB ) > 0, as it is not meaningful to condition on an impossible event. If yA and yB are independent, then the marginal p(yA ) and the conditional p(yA |yB ) are equal. Using the definitions of both p(yA |yB ) and p(yB |yA ) we obtain Bayes theorem p(yA |yB ) =

p(yA )p(yB |yA ) . p(yB )

(8.6.3)

8.6.2 Gaussian Process !

"

|x−μ|2 is 2σ 2 2 N (μ, σ ), where

A random variable x with the normal distribution p(x) = (2π σ 2 )−1 exp

called the univariate Gaussian distribution, and is denoted as x ∼ μ = E{x} and σ 2 = var(x) are its mean and variance, respectively. A multivariate Gaussian (or normal) distribution has a joint probability density given by −N/2

p(x|μ, ) = (2π )

||

−1/2



 1 T −1 exp − (x − μ)  (x − μ) , 2

(8.6.4)

662

8 Support Vector Machines

where μ = [μ1 , . . . , μN ]T with μi = E{xi } is the mean vector (of length N ) of x and  is the N × N (symmetric, positive definite) covariance matrix of x. The multivariate Gaussian distribution is denoted simply as x ∼ N (μ, ). If letting x and y be jointly Gaussian random vectors            ¯ C ¯ −1 μx μx x A C A , , T , ¯T ¯ ∼ N =N μy μy y C B C B

(8.6.5)

then the marginal distribution of x and the conditional distribution of x given marginalizing y are       ! " μx x A C , T ⇒ y|x ∼ N μy + CT A−1 (x − μx ), B−CT A−1 C ∼N μy y C B (8.6.6) or      ¯ μx x A , ¯T ∼ N μy y C

  " ! ¯ −1 C −1 ¯ T −1 ¯ ¯ . − B (x − μ ), B ⇒ y|x ∼ N μ C y x ¯ B (8.6.7)

A Gaussian process is a natural generalization of multivariate Gaussian distribution. Definition 8.8 (Gaussian Process [29, 30]) A Gaussian Process f (x) is a collection of random variables x, any finite number of which have (consistent) joint Gaussian distributions. Clearly, the Gaussian distribution is over vectors, whereas the Gaussian process is over functions f (x), where x is a Gaussian vector. A scalar Gaussian process f (x) can be represented as f ∼ GP(μ, K),

(8.6.8)

where μ = E{x} is the mean of x and K(x, x  ) is the covariance function of x. Equation (8.6.8) means [29]: “the function f is distributed as a GP with mean function μ and covariance function K.” A vector-valued Gaussian process f(x) is completely specified by its mean function and covariance function μ(x) = E{f(x)}, K(xi , xj ) = E{(f(xi ) − μ(xi ))T (f(xj ) − μ(xj ))},

(8.6.9) (8.6.10)

8.6 Gaussian Process for Regression and Classification

663

and the Gaussian process is expressed as f(x) ∼ GP(μ, ),

(8.6.11)

where the (i, j )th elements of the covariance matrix  are defined as ij = K(xi , xj ). The covariance function K(xi , xj ) (also known as kernel, kernel function, or covariance kernel) is the driving factor in Gaussian processes for regression and/or classification. Actually, the kernel represents the particular structure present in the data x1 , . . . , xN being modeled. One of the main difficulties in applying Gaussian processes is to construct such a kernel. According to Mercer’s theorem (Theorem 8.3), the nonlinear kernel function is usually constructed by K(xi , xj ) = φ T (xi )φ(xj ).

(8.6.12)

8.6.3 Gaussian Process Regression Let {(xn , fn )| n = 1, . . . , N } be the training samples, where fn , n = 1, . . . , N are the noise-free training outputs of xn , typically continuous functions for regression or discrete function for classification. Denote the noise-free training output vector f = [f1 , . . . , fN ]T and the test output vector f∗ = [f1∗ , . . . , fN∗∗ ]T given the test samples x∗1 , . . . , x∗N∗ . Under the assumption of Gaussian distributions f ∼ N (0, K(X, X)) and f∗ ∼ N(0, K(X∗ , X∗ )), the joint distribution of f and f∗ is given by      f K(X, X) K(X, X∗ ) ∼ N 0, , K(X∗ , X) K(X∗ , X∗ ) f∗

(8.6.13)

where X = [x1 , . . . , xN ] and X∗ = [x∗1 , . . . , x∗N∗ ] are the N × N training sample matrix and the N∗ × N∗ test sample matrix, respectively; and K(X, X) = cov(X) ∈ RN ×N , K(X, X∗ ) = cov(X, X∗ ) ∈ RN ×N∗ , K(X∗ , X) = cov(X∗ , X) = KT (X, X∗ ) ∈ RN∗ ×N , and K(X∗ , X∗ ) = cov(X∗ ) ∈ RN∗ ×N∗ are the covariance function matrices, respectively. From Eq. (8.6.6) it is known that the conditional distribution of f∗ given f can be expressed as   f∗ |f ∼ N K(X∗ , X)K−1 (X, X)f, K(X∗ , X∗ ) − K(X∗ , X)K−1 (X, X)K(X, X∗ ) . (8.6.14) This is just the posterior distribution for a specific set of test cases.

664

8 Support Vector Machines

The goal of Gaussian processes regression is to find f∗ via maximizing the posterior probability f∗ |f. However, f is unobservable. In practical applications, the noisy training output vector y = f + e is observed in the additive Gaussian white noise e ∼ N (0, σn2 I). In this case, y ∼ N(0, cov(y)) with cov(yi , yj ) = K(yi , yj ) + σn2 δij

or

  y ∼ N 0, K(X, X) + σn2 I .

(8.6.15)

Then, the joint distribution of y and f∗ according to the prior is given by      y K(X, X) + σn2 I K(X, X∗ ) ∼ N 0, , K(X∗ , X) K(X∗ , X∗ ) f∗

(8.6.16)

From Eq. (8.6.6) we get the conditional distribution of f∗ given y below:   f∗ |y ∼ N f¯∗ , cov(f∗ ) ,

(8.6.17)

where 2 3 ¯f∗ = E{f∗ |y} = K(X∗ , X) K(X, X) + σn2 I −1 y, 2 3−1 cov(f∗ ) = K(X∗ , X∗ ) − K(X∗ , X) K(X, X) + σn2 I K(X, X∗ ).

(8.6.18) (8.6.19)

Here, ¯f∗ is the mean vector over N∗ test points, i.e., ¯f∗ = ¯f(x∗1 , . . . , xN∗ ). If there is only one test point x∗ , i.e., X∗ = x∗ , then K(X, X∗ ) reduce to k∗ = k(x∗ ) = k(x, x∗ ) which denotes the vector of covariances between the test point x∗ and the n training points x1 , . . . , xN in X. In this case, Eqs. (8.6.18) and (8.6.19) reduce to ¯f∗ = kT∗ (K + σn2 I)−1 y, cov(f∗ ) = K(x∗ , x∗ ) − kT∗ (K + σn∗ I)−1 k∗ .

(8.6.20) (8.6.21)

Letting α = (K + σn2 I)−1 y, then the mean vector centered on a test point x∗ , denoted by ¯f∗ = ¯f(x∗ ), is given by ¯f(x∗ ) =

N 

αi K(xi , x∗ ) = kT∗ α.

(8.6.22)

i=1

When using Eqs. (8.6.20) and (8.6.21) to compute directly ¯f∗ and var(f∗ ), it needs to invert the matrix K + σn2 I. A faster and numerically more stable computation method is to use Cholesky decomposition.

8.6 Gaussian Process for Regression and Classification

665

Let the Cholesky decomposition of K+σn2 I be given by (K+σn2 I) = LLT , where L is a lower triangular matrix. Then, α = (K+σn2 I)−1 y becomes α = (LT )−1 L−1 y. Denoting z = L−1 y and α = (LT )−1 z, then z = L−1 y ⇔ Lz = y, α = (LT )−1 z ⇔ LT α = z.

(8.6.23) (8.6.24)

The above two equations suggest an efficient algorithm for finding α given K and y as follows: 1. Make Cholesky decomposition: (K + σn2 I) = LLT . 2. Solve the triangular system Lz = y for z by forward substitution. 3. Solve the triangular system LT α = z for α by back substitution. Similarly, we have kT∗ (K + σn∗ I)−1 k∗ = kT∗ (LT )−1 L−1 k∗ = vT v, where v = ⇔ Lv = k∗ . This implies v is the solution for the triangular system Lv = k∗ . Once α and v are found, one can use Eqs. (8.6.20) and (8.6.21) to get ¯f(x∗ ) = kT∗ α and var(f∗ ) = k(x∗ , x∗ ) − vT v, respectively. The marginal likelihood is the integral marginal of the likelihood times the prior L−1 k∗

 p(y|X) =

p(y|f, X)p(f|X)df.

(8.6.25)

By the term “marginal likelihood,” it refers to the marginalization over the function values f. Under the Gaussian process model, the prior is Gaussian, f|X ∼ N(0, K(X, X), or N 1 1 log p(f|X) = − fT K−1 f − log |K| − log(2π ). 2 2 2

(8.6.26)

Because of y = f + e with e ∼ N(0, σn2 I), the marginal likelihood   N 3−1 1 2 1 log p(y|X) = − yT K + σn2 I y − log K + σn2 I − log(2π ). 2 2 2

(8.6.27)

Here, the first term is a data fitting term as it is the only one involving the observed targets y. The second term log |K + σn2 I|/2 is the model complexity depending only on the covariance function and the inputs. The last term log(2π )/2 is just a normalization constant. Algorithm 8.5 shows the Gaussian process regression (GPR) in [30].

666

8 Support Vector Machines

Algorithm 8.5 Gaussian process regression algorithm [30] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

input: X (inputs), y (targets), k (covariance function), σn2 (noise level), x∗ (test input). Construct the matrix K whose (i, j ) elements Kij = k(xi , xj ), i, j = 1, . . . , N . Make Cholesky decomposition K + σn2 I = LLT . Solve the triangular system Lz = y for z by forward substitution. Solve the triangular system LT α = z for α by back substitution. ¯f∗ = kT∗ α. Solve the triangular system Lv = k∗ for v by forward substitution. var(f∗ ) = k(x∗ , x∗ ) − vT  v. N log p(y|X) = − 12 yT α − N i=1 Lii − 2 log(2π ). output: ¯f∗ (mean), var(f∗ ) (variance), log(y|X) (log marginal likelihood).

8.6.4 Gaussian Process Classification Both regression and classification can be viewed as function approximation problems, but their solutions are rather different. This is because the targets in regression are continuous functions in which the likelihood function is Gaussian; a Gaussian process prior combined with a Gaussian likelihood gives rise to a posterior Gaussian process over functions, and everything remains analytically tractable [30]. But, in classification models, the targets are discrete class labels, so the Gaussian likelihood is no longer appropriate. Hence, approximate inference is only available for classification, where exact inference is not feasible. Inference is naturally divided into two steps [30]: 1. Computing the distribution of the latent variable corresponding to a test case:  p(f∗ |X, y, x∗ ) = p(f∗ |X, x∗ , f)p(f|X, y)df, (8.6.28) where p(f|X, y) = p(y|f)p(f|X)/p(y|X) is the posterior over the latent variables. 2. Using this distribution over the latent f∗ to produce a probabilistic prediction  π¯ ∗ = p(y∗ = +1|X, y, x∗ ) = σ (f∗ )p(f∗ |X, y, x∗ )df∗ . (8.6.29) In classification due to discrete class labels the posterior p(f|X, y) in (8.6.28) is non-Gaussian, which makes the likelihood in (8.6.28) is non-Gaussian, which makes its integral analytically intractable. Similarly, Eq. (8.6.29) can be intractable analytically for certain sigmoid functions. The non-Gaussian joint posterior p(f|X, y) can be approximated analytically with a Gaussian one. To this end, consider Laplace’s method that utilizes a Gaussian approximation   1 −1 T ˆ ˆ ˆ q(f|X, y) = N(f|f, A ) ∝ exp (f − f) A(f − f) (8.6.30) 2

8.7 Relevance Vector Machine

667

to the non-Gaussian posterior p(f|X, y) in the integral (8.6.28). In the above equation, ˆf = arg minf p(f|X, y), and A = −∇ 2 log p(f|X, y)|f=ˆf is the Hessian matrix of the negative log posterior at that point. By Bayes’ rule the posterior over the latent variables is given by p(f|X, y) = p(y|f)p(f|X)/p(y|X), but as p(y|X) is independent of f, we need to consider only the unnormalized posterior when maximizing with respect to f, namely consider only p(f|X, y) = p(y|f)p(f|X). Taking its logarithm and using Eq. (8.6.26), then #(f) = log p(y|f) + log p(f|X) N 1 1 = log p(y|f) − fT K−1 f − log |K| − log(2π ) 2 2 2

(8.6.31)

whose first and second derivatives with respect to f are, respectively, given by ∇#(f) = ∇ log p(y|f) − K−1 f, ∇ 2 #(f) = ∇ 2 log p(y|f) − K−1 . From ∇#(f) = 0 it follows that f = Ka,

a = ∇ log p(y|f).

(8.6.32)

W = −∇ 2 log p(y|X).

(8.6.33)

The Hessian matrix can be written as ∇ 2 #(f) = −W − K−1 ,

Here, W is a diagonal matrix, since the distribution for yi depends only on fi , not on fj =i . Gaussian processes are a main mathematical tool for sparse Bayesian learning and hence the relevance vector machine that will be discussed in the next section.

8.7 Relevance Vector Machine In real-world data, the presence of noise (in regression) and class overlap (in classification) implies that the principal modeling challenge is to avoid “overfitting” of the training set. In order to avoid the overfitting of SVMs (because of too many support vectors), Tipping [39] proposed a relevant vectors based learning machine, called the relevance vector machine (RVM). “Sparse Bayesian learning” is the basis for the RVM.

668

8 Support Vector Machines

8.7.1 Sparse Bayesian Regression We are given a set of data {xn , tn }N n=1 , where the “target” samples tn = y(xn ) + n are conventionally considered to be realizations of a deterministic function y that is corrupted by some additive noise process {n }. This function f is usually modeled by a linearly weighted sum of M fixed basis functions {φm (x)}M m=1 : y(x) ˆ =

M 

wm φm (x).

(8.7.1)

m=1

Our objective is to infer values of the parameters/weights wm , m = 1, . . . , M such that y(x) ˆ is a “good” approximation of y(x). The function approximation has two important benefits: accuracy and “sparsity.” By sparsity, it means that a sparse learning algorithm can set significant numbers of the parameters wm to zero. The support vector machine (SVM) makes predictions based on the function: y(x; w) =

N 

wi K(x; xi ) + w0 ,

(8.7.2)

i=1

where K(x; xi ) is a kernel function, effectively defining one basis function for each example in the training set. There are two key features of the SVM: • This can avoid overfitting, which leads to good generalization. • This furthermore results in a sparse model dependent only on a subset of kernel functions. However, there are a number of significant and practical disadvantages of the support vector learning methodology [39]: 1. Although relatively sparse, SVMs make unnecessarily liberal use of basis functions since the number of support vectors required typically grows linearly with the size of the training set. In order to reduce computational complexity, some form of post-processing is often required. 2. Predictions are not probabilistic. The SVM outputs a point estimate in regression, and a “hard” binary decision in classification. Ideally, it is desired to estimate the conditional distribution p(t|x) in order to capture uncertainty in prediction. Although posterior probability estimates can be coerced from SVMs via postprocessing, these estimates are unreliable. 3. Owing to satisfying Mercer’s condition, the kernel function K(x; xi ) must be the continuous symmetric kernel of a positive integral operator.

8.7 Relevance Vector Machine

669

To avoid any of the above limitations, Tipping [39] proposed to use the “relevance vector machine” (RVM) instead of the SVM as a Bayesian learning of (8.7.2): y(x; w) =

m 

wi φi (x) = wT φ(x),

(8.7.3)

i=1

where the output is a linearly weighted sum of M generally nonlinear and fixed basis functions φ(x) = [φ1 (x), . . . , φm (x)]T . The regression process is to determine (or learn) the adjustable parameters (or “weights”) w = [w1 , . . . , wm ]T by using a Bayesian learning framework. The key feature of this learning approach is that in order to offer good generalization performance, the inferred predictors are exceedingly sparse in that they contain relatively few nonzero wi parameters. Since the majority of parameters are automatically set to zero during the learning process, this learning approach is known as the sparse Bayesian learning for regression [39]. The following are the framework of sparse Bayesian learning for regression. Model Specification d Given a data set of input-target pairs {xn , tn }N n=1 , where xn ∈ R , tn ∈ R. It is usually assumed that the target values are samples from the model with additive noise n . Under this assumption, the standard linear regression model with additive Gaussian noise is given by tn = yn + n = y(xn ; w) + n ,

(8.7.4)

where yn = y(xn ; w) are approximation signals of the target signal tn , and n are independent samples from some Gaussian noise N (0, σ 2 ) with mean-zero and variance σ 2 . If letting t = [t1 , . . . , tN ]T be the target vector and y = [y1 , . . . , yN ]T be the approximation vector, then under the assumption of independence of the target vector t and the Gaussian noise e = t − y, the likelihood of the complete data set can be written as   1 2 −N/2 −N 2 (8.7.5) p(t|w, σ ) = (2π ) σ exp − 2 t − y2 2σ In different regression cases, the above likelihood has different forms. • The linear neural network makes predictions based on the function yn = y(xn ; w) = xTn w or

y = XT w,

(8.7.6)

670

8 Support Vector Machines

where w = [w1 , . . . , wN ]T is the weight vector, X = [x1 , . . . , xN ] is the data matrix, and thus the likelihood (because of the independence assumption) is given by p(t|w, σ 2 ) =

N #

p(ti |w, σ 2 )

i=1

 ti − xTi w22 = exp − √ 2σ 2 2π σ i=1   1 −N/2 −N T 2 = (2π ) σ exp − 2 t − X w2 . 2σ N #

1



(8.7.7)

• The SVM makes predictions yn =

N 

wi K(xn ; xi ) + w0

or

y = Kw,

(8.7.8)

i=1

where K = [k(x1 ) , . . . , k(xN )]T is the N × (N + 1) kernel matrix with k(xn ) = [1, K(xn , x1 ), . . . , K(xn , xN )]T , and w = [w0 , w1 , . . . , wN ]T . So under the independence assumption, the likelihood is given by   1 2 −N/2 −N 2 (8.7.9) p(t|w, σ ) = (2π ) σ exp − 2 t − Kw2 . 2σ • The RVM makes predictions yn =

N 

φ T (xi )w + w0

or

y = w,

(8.7.10)

i=1

where  = [φ(x1 ), . . . , φ(xN )]T is the N × (N + 1) “design” matrix with φ(xn ) = [1, φ1 (xn ), . . . , φN (xn )]T , and w = [w0 , w1 , . . . , wN ]T . For example, φ(xn ) = [1, K(xn , x1 ), K(xn , x2 ), . . . , K(xn , xN )]T . Therefore, the likelihood (because of the independence assumption) is given by   1 p(t|w, σ 2 ) = (2π )−N/2 σ −N exp − 2 t − w22 . (8.7.11) 2σ Parameter Estimation Inference in the Bayesian linear model is based on the posterior distribution over the weights, computed by Bayes’ rule [30]: posterior =

likelihood × prior . marginal likelihood

(8.7.12)

8.7 Relevance Vector Machine

671

Hence, the posterior distribution over the weight vector w is given by Tipping [39] p(w|t, α, σ 2 ) =

p(t|w, σ 2 )p(w, α) p(t|α, σ 2 )

  1 = (2π )−(N +1)/2 ||−1/2 exp − (w − μ)T  −1 (w − μ) , 2 (8.7.13) where the posterior covariance and mean are, respectively, as follows [26]:  = (σ −2 T  + A)−1 ,

(8.7.14)

μ = σ −2 T t,

(8.7.15)

with A = Diag(α0 , α1 , . . . , αN ). Sparse Bayesian “Learning” The procedure which chooses a model with the maximum marginal likelihood is called the type II maximum likelihood procedure by Good [18]. By this procedure, sparse Bayesian learning is formulated as the (local) maximization with respect to α of the marginal likelihood, or equivalently, its logarithm marginal likelihood is given by Tipping [39]:  L(α) = log p(t|α, σ 2 ) = log

∞ −∞

p(t|w, σ 2 )p(w|α)dw

" 1! = − N log(2π ) + log |C| + tT C−1 t 2

(8.7.16)

with C = σ 2 I + A−1 T .

(8.7.17)

Here, by the term “marginal” it emphasizes a non-parametric model. The Bayesian learning problem, in the context of the RVM, thus becomes the search for the hyperparameters α. In the RVM, these hyperparameters are estimated from minimizing L(α). This estimation problem will be discussed in Sect. 8.7.3. As a typical application of RVMs, consider the compressive sensing (CS) measurements g represented as [23]: g = w,

(8.7.18)

where  = [r1 , . . . , rK ] is a K × N matrix, assuming random CS measurements are made.

672

8 Support Vector Machines

A typical means of solving the CS problem is 1 regularization:   w∗ = arg min g − w22 + γ w1 .

(8.7.19)

w

Under the assumption that the additive noise n is mean-mean Gaussian distribution N (0, σ 2 I), the Gaussian likelihood model is described as   2 g − w 2 p(g|w, σ 2 ) = (2π )−K/2 σ −K exp − . (8.7.20) 2σ 2 Assuming the hyperparameters α and α0 are known, given the CS measurements g and the projection matrix , the posterior for w can be expressed analytically as a multivariate Gaussian distribution with mean and covariance μ = α0 T g

and

 = (α0 T  + A)−1 .

(8.7.21)

Then, Bayesian compressive sensing problem becomes sparse Bayesian regression one.

8.7.2 Sparse Bayesian Classification Sparse Bayesian classification follows an essentially identical framework for regression above, but using a Bernoulli likelihood and a sigmoidal link function to account for the change in the target quantities [39]. For an input vector x, an RVM classifier models the probability distribution of its class label using logistic sigmoid link function σ (y) = 1/(1 + e−y ) as p(d = 1|x) =

1 , 1 + exp(−fRVM (x))

(8.7.22)

where fRVM (x), called the RVM classifier function, is given by fRVM (x) =

N 

αi K(x, xi ),

(8.7.23)

i=1

where K(x, xi ) is a kernel function, and xi , i = 1, . . . , N are the training samples. By adopting the Bernoulli distribution for P (t|x), the likelihood can be written as P (t|w) =

N #   t  1−tn σ y(xn , w) n 1 − σ y(xn , w) n=1

where the target tn ∈ {0, 1}.

(8.7.24)

8.7 Relevance Vector Machine

673

In sparse Bayesian regression, the weights w can be integrated out analytically. Unlike the regression case, sparse Bayesian classification cannot be integrated out analytically, precluding closed form expressions for either the weight posterior p(w|t, α) or the marginal likelihood P (t|α). Thus, it utilizes the Laplace approximation procedure. In the statistics and machine leaning literature, the Laplace approximation refers to the evaluation of the marginal likelihood or free energy using Laplace’s method. This is equivalent to a local Gaussian approximation of P (t|w) around a maximum a posteriori (MAP) estimate [40].

8.7.3 Fast Marginal Likelihood Maximization Consider the dependence of L(α) on a single hyperparameter αm , m ∈ {1, . . . , M}, the matrix C in the Eq. (8.7.16) can be decomposed as C = σ 2I +



−1 αi−1 φ i φ Ti + αm φ m φ Tm

i=m −1 = C−m + αm φ m φ Tm ,

(8.7.25)

where C−m = [C \ cm ] is C with the mth column cm removed. Thus, by applying the determinant identity and matrix inverse lemma, the terms of interest in L can be written as −1 T −1 |C| = |Cm | · |1 + αm φ m C−m φ m |,

C−1 = C−1 −m −

T −1 C−1 −m φ m φ m C−m

αm + φ Tm C−1 −m φ m

(8.7.26) (8.7.27)

.

Then, L(α) can be rewritten as L(α) = −

 1 N log(2π ) + log |C−m | + tT C−1 −m t 2

− log αm + log(αm + φ Tm C−1 −m φ m ) −

2 (φ Tm C−1 −m t)



αm + φ Tm C−1 −m φ m   2 1 qm = L(α −m ) + log αm − log(αm + sm ) + 2 αm + sm = L(α −m ) + (αm ),

(8.7.28)

674

8 Support Vector Machines

where sm = φ Tm C−1 −m φ m

and

qm = φ Tm C−1 −m t,

for m = 1, . . . , M.

(8.7.29)

The “sparsity factor” sm can be seen to be a measure of the extent that basis vector φ m “overlaps” those already present in the model, while the “quality factor” qm is a measure of the alignment of φ m with the error of the model with that vector excluded. L(α) m) From (8.7.28) and the optimization condition ∂∂α = ∂(α ∂αm = 0 we have the m result   2 2 1 1 1 qm sm = 0 or α − − = . (8.7.30) m 2 −s 2 αm αm + sm (αm + sm )2 qm m Hence, L(α) has a unique maximum with respect to αm [10]:  αm =

2 sm 2 −s , qm m

2 >s , if qm m

∞,

otherwise,

(8.7.31)

for m = 1, . . . , M. Equation (8.7.31) shows that: 2 ≤ s , then φ may be deleted • If φ m is “in the model” (i.e., αm < ∞) yet qm m m (i.e., αm set to ∞). 2 > s , then φ may be • If φ m is excluded from the model (αm = ∞) and qm m m added (i.e., αm is set to some optimal finite value).

It is obvious that as compared with C−1 −m , m = 1, . . . , M, it is easier to maintain and compute value of C−1 . Let Sm = φ Tm C−1 φ m

and

Qm = φ Tm C−1 t,

m = 1, . . . , M.

(8.7.32)

By premultiplying φ Tm , postmultiplying φ m , and using (8.7.29), Eq. (8.7.27) yields  Sm =

φ Tm C−1 φ m

= sm −

=

φ Tm

2 sm , αm + sm

C−1 −m



T −1 C−1 −m φ m φ m C−m

αm + φ Tm C−1 −m φ m

m = 1, . . . , M.

 φm (8.7.33)

8.7 Relevance Vector Machine

675

Similarly, we have  Qm =

φ Tm C−1 t

= qm −

=

φ Tm

sm qm , αm + sm

C−1 −m



T −1 C−1 −m φ m φ m C−m



αm + φ Tm C−1 −m φ m

t

m = 1, . . . , M.

(8.7.34)

From (8.7.33) and (8.7.34) it follows that sm =

αm Sm αm − Sm

qm =

and

αm Qm αm − Sm

(8.7.35)

for m = 1, . . . , M. Note that when αm = ∞, sm = Sm and qm = Qm . By Duncan–Guttman inversion formula (1.6.5) (A + UD−1 V)−1 = A−1 − A−1 U(D + VA−1 U)−1 VA−1 ,

(8.7.36)

we have C−1 = (σ 2 I + A−1 T )−1 = B − B(A + T B)−1 T B = B − BT B,

(8.7.37)

where B = σ −2 I and

 = (A + T B)−1 .

(8.7.38)

Substituting (8.7.37) into (8.7.32) to yield Sm = φ Tm Bφ m − φ m BT Bφ Tm , Qm = φ Tm Bˆt − φ m BT Bˆt.

(8.7.39) (8.7.40)

The following is the marginal likelihood maximization algorithm for sequential sparse Bayesian learning proposed by Tipping and Faul [40]: 1. Initialize σ 2 to some sensible value (e.g., var[t] × 0.1). 2. Initialize with a single basis vector φ i , setting, from (8.7.31): αi =

φ i 22 φ Ti t22 / − σ 2

All other αm are notionally set to infinity.

.

(8.7.41)

676

8 Support Vector Machines

3. Explicitly compute  and μ (which are scalars initially), along with initial values of sm and qm for all M bases φ m . 4. Select a candidate basis vector φ i from the set of all M. 5. Compute θi = qi2 − si . 6. If θi > 0 and αi < ∞ (i.e., φ i is in the model), re-estimate αi . Defining κj = (Σjj + (α¯ i − αi )−1 )−1 and  j as the j th column of : 2L =

Q2i

Si + (α¯ i−1 − αi−1 )−1

! " − log 1 + Si (α¯ i−1 − αi−1 ) ,

(8.7.42)

¯ =  − κj  Tj , 

(8.7.43)

¯ = μ − κj μj  j , μ

(8.7.44)

S¯m = Sm + κj (β Tj T φ m )2 ,

(8.7.45)

¯ m = Qm + κj μj (β Tj T φ m ). Q

(8.7.46)

7. If θi > 0 and αi = ∞, add φ i to the model with updated αi : Q2i − Si Si + log 2 , Si Qi ⎡  + β 2 Σii T φ i φ Ti  ¯ ⎣ = −β 2 Σii (T φ i )T   μ − μi βT φ i ¯ = , μ μi

2L =

(8.7.47) − β 2 Σii T φ i

⎤ ⎦,

(8.7.48)

Σii

S¯m = Sm − Σii (βφ Tm ei )2 , ¯ m = Qm − μi (βφ Tm ei ), Q

(8.7.49) (8.7.50) (8.7.51)

where Σii = (αi + Si )−1 , μi = Σii Qi , and ei = φ i − βT φ i . 8. If θi ≤ 0 and αi < ∞, then delete φ i from the model and set αi = ∞: 2L =

  Q2i Si , − log 1 − Si − αi αi

1  j  Tj , Σjj μj ¯ =μ− μ j , Σjj

¯ =− 

(8.7.52) (8.7.53) (8.7.54)

References

677

1 (β Tj T φ m )2 , S¯m = Sm + Σjj μ ¯ m = Qm + j (β Tj T φ m ) Q Σjj

(8.7.55) (8.7.56)

Following updates (8.7.53) and (8.7.54), the appropriate row and/or column j ¯ and μ. ¯ is removed from  9. If it is a regression model and estimating the noise level, update σ 2 = t − y22 /  (N − M + m αm Σmm ). 10. Recompute/update , μ (using the Laplace approximation procedure in classification) and all sm and qm using Eqs. (8.7.35)–(8.7.40). 11. If converged terminate, otherwise goto 4.

Brief Summary of This Chapter • SVMs have high classification accuracy due to introducing maximum interval. • When the sample size is small, SVMs can also classify accurately and have good generalization ability. • The kernel function can be used to solve nonlinear problems. • SVMs can solve the problem of classification and regression with highdimensional features.

References 1. Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68, 337–404 (1950) 2. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006) 3. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Haussler D. (ed.) Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152. ACM Press, Pittsburgh (1992) 4. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L., LeCun, Y., Muller, U., Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study in handwriting digit recognition. In: Proceedings of International Conference on Pattern Recognition, pp. 77–87 (1994) 5. Bousquet, O., von Luxburg, U., Rätsch, G. (eds.): Advanced Lectures on Machine Learning. Springer, Berlin (2004) 6. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2, 121–167 (1998) 7. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011) 8. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20(3), 273–297 (1995) 9. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification and Scene Analysis. Wiley, New York (1973)

678

8 Support Vector Machines

10. Faul, A.C., Tipping, M.E.: Analysis of sparse Bayesian learning. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing, vol. 14, pp. 383–389 (2002). http://citeseer.ist.psu.edu/faul01analysis.html 11. Fletcher, R.: Practical Methods of Optimization. Wiley, Chichester (1987) 12. Fontana, R., Pistone, G., Rogantin, M.P.: Classification of two-level factorial fractions. J. Stat. Plann. Inference 87, 149–172 (2000) 13. Friedman, J.: Another approach to polychotomous classification. Department of Statistics, Stanford University, Stanford (1996). https://www-stat.stanford.edu/reports/friedman/poly.ps. Z 14. Fung, G.M., Mangasarian, O.L.: Proximal support vector machine classifiers. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, San Francisco, pp. 77–86 (2001) 15. Fung, G.M., Mangasarian, O.L.: Multicategory proximal support vector machine classifiers. Mach. Learn. 59(1–2), 77–97 (2005) 16. Girolami, M.: Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002) 17. Girosi, F.: An equivalence between sparse approximation and support vector machines. Neural Comput. 20, 1455–1480 (1998) 18. Good, I.J.: The Estimation of Probabilities. MIT Press, Cambridge (1965) 19. Gunn, S.R.: Support vector machines for classification and regression. Technical Report, Faculty of Engineering, Science and Mathematics School of Electronics and Computer Science, University of Southampton (1998) 20. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002) 21. Hsu, C.-W., Lin C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002) 22. Huang, G.-B., Zhou, H., Ding, X., Zhang, R.: Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. B Cybern. 42(2), 513–529 (2012) 23. Ji, S., Xue, Y., Carin, L.: Bayesian compressive sensing. IEEE Trans. Signal Process. 56(6), 2346–2356 (2008) 24. Kimeldorf, G., Wahba, G.: Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33, 82–95 (1971) 25. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise procedure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing: Algorithms, Architectures and Applications. Springer, New York (1990) 26. MacKay, D.J.C.: Bayesian interpolation. Neural Comput. 4(3), 415–447 (1992) 27. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classification. In: Advances in Neural Information Processing Systems, vol.12, pp. 547–553. MIT Press, Cambridge (2000) 28. Rännar, S., Lindegren, F., Geladi, P., Wold, S.: A PLS kernel algorithm for data sets with many variables and fewer objects. Part 1: theory and algorithm. J. Chemometr. 8, 111–125 (1994) 29. Rasmussen, C.E.: Gaussian processes in machine learning. In: Bousquet, O. et al. (ed.) Machine Learning 2003. Lecture Notes in Artificial Intelligence, vol. 3176, pp. 63–71. Springer, Berlin (2004) 30. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2005) 31. Rokotomamonjy, A.: Variable selection using SVM-based criteria. J. Mach. Learn. Res. 3, 1357–1370 (2003) 32. Rosipal, R., Trejo, L.J.: Kernel partial least squares regression in reproducing kernel Hilbert space. J. Mach. Learn. Res. 2, 97–123 (2001) 33. Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10, 1299–1319 (1998) 34. Schölkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Comput. 12, 1207–1245 (2000)

References

679

35. Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: Proceedings of the International Conference on Computational Learning Theory (COLT2001), pp. 416–426 (2001) 36. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004) 37. Suykens, J.A.K., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9, 293–300 (1999) 38. Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed Problems. Wiley, New York (1977) 39. Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001) 40. Tipping, M.E., Faul, A.C.: Fast marginal likelihood maximization for sparse Bayesian models. In: Bishop, C.M., Frey, B.J. (eds.) Proceedings of the 9th International Workshop Artificial Intelligence and Statistics (2003). http://www.miketipping.com/papers.htm 41. Van Gestel, T., Suykens, J.A.K., Lanckriet, G., Lambrechts, A., De Moor, B., Vandewalle, J.: Multiclass LS-SVMs: moderated outputs and coding-decoding schemes. Neural Process. Lett. 15(1), 48–58 (2002) 42. Vapnik, V.N.: Principles of risk minimization for learning theory. Adv. Neural Inf. Proces. Syst. 4, 831–838 (1992) 43. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995) 44. Vapnik, V.N.: Statistical Learning Theory, vol. 1. Wiley, New York (1998) 45. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988– 999 (1999) 46. Vapnik, V.N., Chervonenkis, A.: Theory of Pattern Recognition (in Russian). Moscow, Nauka (1974). (German Translation: Wapnik W., Tscherwonenkis A.: Theorie der Zeichenerkennung), Akademie-Verlag, Berlin (1979) 47. Wahba, G.: Support vector machines, reproducing Kernel Hilbert spaces, and randomized GACV. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 69–88. The MIT Press, Cambridge (1999) 48. Yuan, G.-X., Ho, C.-H., Lin C.-J.: Recent advances of large-scale linear classification. Proc. IEEE 100(9), 2584–2603 (2012)

Chapter 9

Evolutionary Computation

From the perspective of artificial intelligence, evolutionary computation belongs to computation intelligence. The origins of evolutionary computation can be traced back to the late 1950s (see, e.g., the influencing works [12, 15, 46, 47]), and has started to receive significant attention during the 1970s (see, e.g., [41, 68, 133]). The first issue of the Evolutionary Computation in 1993 and the first issue of the IEEE Transactions on Evolutionary Computation in 1997 mark two important milestones in the history of the rapidly growing field of evolutionary computation. This chapter is focused primarily on basic theory and methods of evolutionary computation, including multiobjective optimization, multiobjective simulated annealing, multiobjective genetic algorithms, multiobjective evolutionary algorithms, evolutionary programming, differential evolution together with ant colony optimization, artificial bee colony algorithms, and particle swarm optimization. In particular, this chapter also highlights selected topics and advances in evolutionary computation: Pareto optimization theory, noisy multiobjective optimization, and opposition-based evolutionary computation.

9.1 Evolutionary Computation Tree Evolutionary computation can be broadly classified into three major categories. 1. Physic-inspired evolutionary computation: Its typical representative is simulated annealing [100] inspired by heating and cooling of materials. 2. Darwin’s evolution theory inspired evolutionary computation • Genetic Algorithm (GA) is inspired by the process of natural selections. A GA is commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover, and selection. © Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8_9

681

682

9 Evolutionary Computation

• Evolutionary Algorithms (EAs) are inspired by Darwin’s evolution theory which is based on survival of fittest candidate for a given environment. These algorithms begin with a population (set of solutions) which tries to survive in an environment (defined with fitness evaluation). The parent population shares their properties of adaptation to the environment to the children with various mechanisms of evolution such as genetic crossover and mutation. • Evolutionary Strategy (ES) was developed by Schwefel in 1981 [143]. Similar to the GA, the ES uses evolutionary theory to optimize, that is to say, genetic information is used to inherit and mutate the survival of the fittest generation by generation, so as to get the optimal solution. The difference lies in: (a) the DNA sequence in ES is encoded by real number instead of 0-1 binary code in GA; (b) In the ES the mutation intensity is added for each real number value on the DNA sequence in order to make a mutation. • Evolutionary Programming (EP) [43]: The EP and the ES adopt the same coding (digital string) to the optimization problem and the same mutation operation mode, but they adopt the different mutation expressions and survival selections. Moreover, the crossover operator in ES is optional, while there is no crossover operator in EP. In the aspect of parent selection, the ES adopts the method of probability selection to form the parent, and each individual of the parent can be selected with the same probability, while the EP adopts a deterministic approach, that is, each parent in the current population must undergo mutation to generate offspring. • Differential Evolution (DE) was developed by Storn and Price in 1997 [147]. Similar to the GA, the main process includes three steps: mutation, crossover, and selection. The difference is that the GA controls the parent’s crossover according to the fitness value, while the mutation vector of the DE is generated by the difference vector of the parent generation, and crossover with the individual vector of the parent generation to generate a new individual vector, which directly selects with the individual of the parent generation. Therefore, the approximation effect of DE is more significant than that of the GA. 3. Swarm intelligence inspired evolutionary computation: Swarm intelligence includes two important concepts: (a) The concept of a swarm means multiplicity, stochasticity, randomness, and messiness. (b) The concept of intelligence implies a kind of problem-solving ability through the interactions of simple informationprocessing units. This “collective intelligence” is built up through a population of homogeneous agents interacting with each other and with their environment. The major swarm intelligence inspired evolutionary computation includes: • Ant Colony Optimization (ACO) by Dorigo in 1992 [28]; a swarm of ants can solve very complex problems such as finding the shortest path between their nest and the food source. If finding the shortest path is looked upon as an optimization problem, then each path between the starting point (their nest) and the terminal (the food source) can be viewed as a feasible solution. • Artificial Bee Colony (ABC) algorithm, proposed by Karaboga in 2005 [81], is inspired by the intelligent foraging behavior of the honey bee swarm. In the

9.2 Multiobjective Optimization

683

Physic-inspired Evolutionary computation: e.g., simulated annealing ⎪ ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ ⎪ ⎪Genetic Algorithm ⎪ ⎪ ⎪ ⎪ ⎪ ⎪Evolutionary Algorithms ⎪ ⎪ ⎪ ⎪ Darwin’s evolution theory inspired ⎨ ⎪ ⎪ ⎪ Evolutionary Strategy ⎨ ⎪ Evolutionary evolutionary computation ⎪ ⎪ Evolutionary Programming ⎪ ⎪ computation ⎪ ⎩ ⎪ ⎪ Differential Evolution ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ ⎪ ⎪ Ant Colony Optimization ⎪ ⎪ Swarm intelligence inspired ⎨ Artificial Bee Colony evolutionary computation Particle Swarm Optimization

Fig. 9.1 Evolutionary computation tree

ABC algorithm, the position of a food source represents a possible solution of the optimization problem and the nectar amount of a food source corresponds to the quality (fitness) of the associated solution. • Particle Swarm Optimization (PSO), developed by Kennedy and Eberhart in 1995 [86], is essentially an optimization technique based on swarm intelligence, and adopts a population-based stochastic algorithm for finding optimal regions of complex search spaces through the interaction of individuals in a population of particles. Different from evolutionary algorithms, the particle swarm does not use selection operation; while interactions of all population members result in iterative improvement of the quality of problem solutions from the beginning of a trial until the end. The above classification of evolutionary computation can be represented by evolutionary computation tree, as shown in Fig. 9.1.

9.2 Multiobjective Optimization In science and engineering applications, a lot of optimization problems involve with multiple objectives. Different from single-objective problems, objectives in multiobjective problems are usually conflict. For example, in the design of a complex hardware/software system, an optimal design might be an architecture that minimizes cost and power consumption but maximizing the overall performance. This structural contradiction leads to multiobjective optimization theories and methods different from single-objective optimization. The need for solving multiobjective optimization problems has spawned various evolutionary computation methods. This section discusses two multiobjective optimizations: multiobjective combinatorial optimization problems and multiobjective optimization problems.

684

9 Evolutionary Computation

9.2.1 Multiobjective Combinatorial Optimization Combinatorial optimization is a topic that consists of finding an optimal object from a finite set of objects. The general framework of multiobjective combinatorial optimization (MOCO) is described as [152] / min x∈

0 cT1 x = z1 (x), . . . , cTK x = zK (x) ,

subject to D = {x : x ∈ LD, x ∈ B n },

LD = {x : Ax ≤ b},

(9.2.1) (9.2.2)

where ck ∈ Rn×1 , the solution x = [x1 , . . . , xn ]T ∈ Rn is a vector of discrete decision variables, A ∈ Rm×n , b ∈ Rm and B = {0, 1}, and Ax ≤ b denotes the elementwise inequality (Ax)i ≤ bi for all i = 1, . . . , m. Most of multiobjective combinatorial optimization are related to assignment problems (APs), transportation problems (TPs) and network flow problems (NFPs). 1. Multiobjective assignment problems [17]: min zk (x) =

n n  

k = 1, . . . K,

(9.2.3)

xij = 1,

i = 1, . . . , n,

(9.2.4)

xij = 1,

j = 1, . . . , n,

(9.2.5)

i, j = 1, . . . , n.

(9.2.6)

k cij xij ,

i=1 j =1 n  j =1 n  i=1

xij = (0, 1),

2. Multiobjective transportation problems [152]: min zk (x) =

s r  

k = 1, . . . K,

(9.2.7)

xij = ai ,

i = 1, . . . , n,

(9.2.8)

xij = bj ,

j = 1, . . . , n,

(9.2.9)

k cij xij ,

i=1 j =1 s  j =1 r  i=1

xij ≥ 0 and integer,

i = 1, . . . , r; j = 1, . . . , q.

(9.2.10)

9.2 Multiobjective Optimization

685

r Here, i=1 ai = q it is not restrictive to suppose equality constraints with b . j j =1 3. Multiobjective network flow or transhipment problems [152]: Let G(N, A) be a network flow with a node set N and an arc set A; the model can be stated as  k min zk (x) = cij xij , k = 1, . . . , K, (9.2.11) 

(i,j )∈A

xij −

j ∈A\i



xj i = 0,

∀i ∈ N,

(9.2.12)

∀ (i, j ) ∈ A,

(9.2.13)

j ∈A\i

lij ≤ xij ≤ uij , xij ≥ 0 and integer,

(9.2.14)

k is the linear transhipment “cost” where xij is the flow through arc (i, j ), cij for arc (i, j ) in objective k, and lij and uij are lower and upper bounds on xij , respectively. 4. Multiobjective Min Sum Location Problems [134]:

min

m 

fik zi +

k cij xij ,

k = 1, . . . , K,

(9.2.15)

i=1 j =1

i=1 m 

n m  

xij = 1,

j = 1, . . . , n,

(9.2.16)

i=1

ai zi ≤

n 

dij xij ≤ bi zi ,

i = 1, . . . , m,

(9.2.17)

j =1

xij = (0, 1),

i = 1, . . . , m; j = 1, . . . , n,

(9.2.18)

zi = (0, 1),

i = 1, . . . , m,

(9.2.19)

where zi = 1 if a facility is established at site i; xii = 1 if customer j is assigned to the facility at site i; dij is a certain usage of the facility at site i by customer j if he/she is assigned to that facility; ai and bi are possible limitations on the total k is, for objective k, a variable cost customer usage permitted at the facility i; cij (or distance, etc.) if customer j is assigned to facility i; fik is, for objective k, a fixed cost associated with the facility at site i. An unconstrained combinatorial optimization problem P = (S, f ) is an optimization problem [120], where S is a finite set of solutions (called search space), and f : S → R+ is an objective function that assigns a positive cost value to each solution vector s ∈ S. The goal is either to find a solution of minimum cost value or a good enough solution in a reasonable amount of time in the case of approximate solution techniques.

686

9 Evolutionary Computation

The difficulty of multiobjective combinatorial optimization problems comes from the following two factors [22]. • It requires intensive co-operation with the decision maker for solving a multiobjective combinatorial optimization problem; this results in especially high requirements for effective tools used to generate efficient solutions. • Many combinatorial problems are hard even in single-objective versions; their multiple objective versions are frequently more difficult. An multiobjective decision making (MODM) problem is ill-posed from the mathematical point of view because, except in trivial cases, it has no optimal solution. The goal of MODM methods is to find a solution most consistent with the decision maker’s preferences, i.e., the best compromise. Under very weak assumptions about decision maker’s preferences the best compromise solution belongs to the set of efficient solutions [136].

9.2.2 Multiobjective Optimization Problems Consider optimization problems with multiple objectives that may usually be incompatible. Definition 9.1 (Multiobjective Optimization Problem [78, 150]) A multiobjective optimization problem (MOP) consists of a set of n parameters (i.e., decision variables), a set of m objective functions, and a set of p inequality constraints and q equality constraints, and can be mathematically formulated as / minimize/maximize

0 y = f(x) = [f1 (x), . . . , fm (x)]T ,

subject to gi (x) ≤ 0 (i = 1, . . . , p) or g(x) ≤ 0, hj (x) = 0 (j = 1, . . . , q) or h(x) = 0.

(9.2.20) (9.2.21) (9.2.22)

Here the integer m ≥ 2 is the number of objectives; x = [x1 , . . . , xn ]T ∈ X is a decision vector, y = [y1 , . . . , ym ]T is the objective vector; fi : Rn → R, i = 1, . . . , m are the objective functions, and gi , hj : Rn → R, i = 1, . . . , m; j = 1, . . . , q are p inequality constraints and q equality constraints of the problem, respectively; while X = {x ∈ Rn } is called the decision space, and Y = {y ∈ Rm |y = f(x), x ∈ X} is known as the objective space. The decision space and the objective space are two Euclidean spaces in MOPs: • The n-dimensional decision space X ∈ Rn in which each coordinate axis corresponds to a component (called decision variable) of decision vector x.

9.2 Multiobjective Optimization

687

• The m-dimensional objective space Y ∈ Rm in which each coordinate axis corresponds to a component (i.e., a scalar objective function) of objective function vector y = f(x) = [f1 (x), . . . , fm (x)]T . The MOP’s evaluation function f : x → y maps decision vector x = [x1 , . . . , xn ]T to objective vector y = [y1 , . . . , ym ]T = [f1 (x), . . . , fm (x)]T . If some objective function fi is to be maximized (or minimized), the original problem is written as the minimization (or maximization) of the negative objective −fi . Thereafter, we consider only the multiobjective minimization (or maximization) problems, rather than optimization problems mixing both minimization and maximization. The constraints g(x) ≤ 0 and h(x) = 0 determine the set of feasible solutions. Definition 9.2 (Feasible Set) The feasible set Xf is defined as the set of decision vectors x that satisfy the inequality constraint g(x) ≤ 0 and the equality constraint h(x) = 0, i.e., Xf = {x ∈ X| g(x) ≤ 0; h(x) = 0}.

(9.2.23)

Example 9.1 Dynamic economic emission dispatch problem is a typical multiobjective optimization which attempts to minimize both cost function and emission function simultaneously while satisfying equality and inequality constraints [8]. 1. Objectives: • Cost function: The fuel cost function of each thermal generator, considering the valve-point effect, is expressed as the sum of a quadratic and a sinusoidal function. The total fuel cost in terms of real power can be expressed as f1 =

N ! M  

  " 2 ai + bi Pim + ci Pim + di  sin ei (Pimin − Pim )  ,

m=1 i=1

(9.2.24) where N is number of generating units, M is number of hours in the time horizon; ai , bi , ci , di , ei are cost coefficients of the ith unit; Pim is power output of ith unit at time m, and Pimin is lower generation limits for ith unit. • Emission function: The atmospheric pollutants such as sulfur oxides (SOx) and nitrogen oxides (NOx) caused by fossil-fueled generating units can be modeled separately. However, for comparison purpose, the total emission of these pollutants which is the sum of a quadratic and an exponential function can be expressed as f2 =

N ! M  "  2 αi + βi Pim + γi Pim + ηi exp(δi Pim ) , m=1 i=1

where αi , βi , γi , ηi , δi are emission coefficients of the ith unit.

(9.2.25)

688

9 Evolutionary Computation

2. Constraints: • Real power balance constraints: The total real power generation must balance the predicted power demand plus the real power losses in the transmission lines, at each time interval over the scheduling horizon, i.e., N 

Pim − PDm − PLm = 0,

m = 1, . . . , M,

(9.2.26)

i=1

where PDm is load demand at time m, PLm is transmission line losses at time m. • Real power operating limits: Pimin ≤ Pim ≤ Pimax ,

i = 1, . . . , N, m = 1, . . . , M,

(9.2.27)

where Pimax is upper generation limits for the ith unit. • Generating unit ramp rate limits: Pim − Pi(m−1) ≤ U Ri ,

i = 1, . . . , N, m = 1, . . . , M,

(9.2.28)

Pi(m−1) − Pim ≤ DRi ,

i = 1, . . . , N, m = 1, . . . , M,

(9.2.29)

where U Ri and DRi are ramp-up and ramp-down rate limits of the ith unit, respectively. opt,k

opt,k

Definition 9.3 (Ideal Vector [20, p. 8]) Let xopt,k = [x1 , . . . , xn ]T be a vector of variables which optimizes (either minimizes or maximizes) the kth objective function fk (x). In other words, if the vector xopt,k ∈ Xf is such that opt,k

fk

= opt fk (x),

k ∈ {1, . . . , m},

(9.2.30)

x∈X opt

opt

opt

then the objective vector fopt = [f1 , . . . , fm ]T (where fk denotes the optimum of the kth objective function) is ideal for an MOP, and the point in Rn which determined this vector is the ideal solution, and is consequently called the ideal vector. However, due to objective conflict and/or interdependence among the n decision variables of x, such an ideal solution vector is impossible for MOPs: none of the feasible solutions allows simultaneous optimal solutions for all objectives [63]. In other words, individual optimal solutions for each objective are usually different. Thus, a mathematically most favorable solution should offer the least objective conflict. Many real-life problems can be described as multiobjective optimization problems. The most typical multiobjective optimization problem is traveling salesman problem.

9.2 Multiobjective Optimization

689

The traveling salesman problem (TSP) can be stated as (see, e.g., [152]) min



(k) ci,ρ(i) ,

k = 1, . . . , K,

(9.2.31)

i∈ρ(i)

where ρ(i) is a tour of {1, . . . , n} and k denotes trip k. Let C = {c1 , . . . , cNc } be a set of cities, where ci is the ith city and Nc is the number of all cities; A = {(r, s) : r, s ∈ C} be the edge set, and d(r, s) be a cost measure associated with edge (r, s) ∈ A. For a set of Nc cities, the TSP problem involves finding the shortest length closed tour visiting each city only once. In other words, the TSP problem is to find the shortest-route to visit each city once, ending up back at the starting city. The following are the different forms of TSP [105]: • Euclidean TSP: If cities r ∈ C are given by their coordinates (xr , yr ) and d(r, s) is the Euclidean distance between city r and s, then TSP is an Euclidean TSP. • Symmetric TSP: If d(r, s) = d(s, r) for all (r, s), then the TSP becomes a symmetric TSP (STSP). • Asymmetric TSP: If d(r, s) = d(s, r) for at least some (r, s), then the TSP is an asymmetric TSP (ATSP). • Dynamic TSP: The dynamic TSP (DTSP) is a TSP in which cities can be added or removed at run time. The goal of TSP is to find the shortest closed tour which visits all the cities in a given set as early as possible after each and every iteration. Another typical example of multiobjective optimizations is multiobjective data mining. One of the most important questions in data mining problems is how to evaluate a candidate model which depends on the type of data mining task at hand. Data mining involves discovering interesting and potentially useful patterns of different types, such as associations, summaries, rules, changes, outliers, and significant structures [110]. Data mining tasks can broadly be classified into two categories [6, 62]: • Predictive or supervised techniques learn from the current data in order to make predictions about the behavior of new data sets. • Descriptive or unsupervised techniques provide a summary of the data. The most commonly used data mining tasks include feature selection, classification, regression, clustering, association rule mining, deviation detection, and so on [110]. 1. Feature Selection: It deals with selection of an optimum relevant set of features or attributes that are necessary for the recognition process (classification or clustering). In general, the feature selection problem ( , P ) can formally be defined as an optimization problem: determine the feature set F ∗ for which P (F ∗ ) = min P (F, X), F∈

(9.2.32)

690

9 Evolutionary Computation

where is the set of possible feature subsets, F refers to a feature subset, and P : × Ψ → R denotes a criterion to measure the quality of a feature subset with respect to its utility in classifying/clustering the set of points X ∈ Ψ . 2. Classification: The problem of supervised classification can formally be stated as follows. Given an unknown function g : X → Y (the ground truth) that maps input instances x ∈ X to output class labels y ∈ Y , and a training data set D = {(x1 , y1 ), . . . , (xn , yn )} which is assumed to represent accurate examples of the mapping g, produce a function h : X → Y that approximates the correct mapping g as closely as possible. 3. Clustering: Clustering is an important unsupervised classification technique where a set of n patterns xi ∈ Rd are grouped into clusters {C1 , . . . , CK } in such a way that patterns in the same cluster are similar in some sense and patterns in different clusters are dissimilar in the same sense. The main objective of any clustering technique is to produce a K × n partition matrix U = [ukj ], k = 1, . . . , K; j = 1, . . . , n, where ukj is the membership of pattern xj to cluster Ck :  ukj =

1, if xj ∈ Ck ; / Ck ; 0, if xj ∈

(9.2.33)

for hard or crisp partitioning of the data, and ∀ k ∈ {1, . . . , K}, 0
f(x ) ⇔ f(x) ≥ f(x ) ∧ f(x) = f(x ).

(9.3.7)

Here the symbol ∧ may be interpreted as logical conjunction operator (“AND”). It goes without saying that the above relations are available for f(x) = x and f(x ) = x as well. However, for two decision vectors x and x , these relations are meaningless. In multiobjective optimization, the relation between decision vectors x and x are called the Pareto dominance which is based on the relations between vector objective functions. Definition 9.5 (Pareto Dominance [150, 171]) For any two decision vectors x and x in minimization problems, x is said to dominate (or outperform) x , denoted by x x , if and only if their objectives or outcomes satisfy jointly the following two conditions: 1. x is no worse than x in all objective functions, i.e., fi (x) ≤ fi (x ) for all i ∈ {1, . . . , m}, 2. x is strictly better than x for at least one objective function, i.e., fj (x) < fj (x ) for at least one j ∈ {1, . . . , m};

9.3 Pareto Optimization Theory

693

or is equivalently represented as x x ⇔ ∀ i ∈ {1, . . . , m} : fi (x) ≤ fi (x ) ∧ ∃j ∈ {1, . . . , m} : fj (x) < fj (x ),

(9.3.8)

or simply written as x x ⇔ f(x) < f(x ).

(9.3.9)

Similarly, for maximization problems, then x x ⇔ ∀ i ∈ {1, . . . , m} : fi (x) ≥ fi (x ) ∧ ∃j ∈ {1, . . . , m} : fj (x) > fj (x ),

(9.3.10)

x x ⇔ f(x) > f(x ).

(9.3.11)

or simply written as

The Pareto dominance can be either weak or strong. Definition 9.6 (Weak Pareto Dominance [150, 171]) A decision vector x is said to weakly dominate another decision vector x , denoted by x  x , if and only if x  x ⇔ f(x) ≤ f(x )

(for minimization),

(9.3.12)

x  x ⇔ f(x) ≥ f(x )

(for maximization).

(9.3.13)

Definition 9.7 (Strict Pareto Dominance [171]) A decision vector x is said to strictly dominate x , denoted by x

x , if and only if fi (x) is better than fi (x ) in all objectives, i.e., x

x ⇔ ∀ i ∈ {1, . . . , m} : fi (x) < fi (x ) 



x

x ⇔ ∀ i ∈ {1, . . . , m} : fi (x) > fi (x )

(for minimization),

(9.3.14)

(for maximization).

(9.3.15)

Definition 9.8 (Incomparable [150, 171]) Two decision vectors x and x are said to be incomparable, denoted by x  x , if neither x weakly dominates x nor x weakly dominates x, i.e., x  x ⇔ x  x ∧ x  x.

(9.3.16)

Two incomparable solutions x and x are also known as mutually nondominating. Definition 9.9 (Nondominance [144, 171]) A decision vector x ∈ Xf is said to be nondominated regarding a set A ⊆ Xf if and only if there is no vector a in A which

694

9 Evolutionary Computation

dominates x; formally  a ∈ A : a x.

(9.3.17)

In other words, a solution x is said to be a nondominated solution if there is no a in the subset A that dominates x. A Pareto improvement is a change to a different allocation that makes at least one objective better off without making any other objective worse off. An trial solution is known as “Pareto efficient” or “Pareto-optimal” or “globally nondominated” when no further Pareto improvements can be made, in which case we are assumed to have reached Pareto optimality. Definition 9.10 (Pareto-Optimality [144, 171]) A decision vector x ∈ Xf is said to be Pareto-optimal if and only if x is nondominated regarding the feature region Xf , namely  a ∈ Xf : a x.

(9.3.18)

In other words, all decision vectors which are not dominated by any other decision vector are known as “nondominated” or Pareto optimal. The phrase “Pareto optimal” means the solution is optimal with respect to the entire decision variable space F unless otherwise specified. Figure 9.2, by inferencing Economics Wiki on “Pareto Efficient and Pareto Optimal Tutorial,” shows the relation of the Pareto efficiency and Pareto improvement for the two-objective minimization problems minx f(x) = [f1 (x), f2 (x)]T . A nondominated decision vector in A ⊆ Xf is only Pareto-optimal in a local decision space A, while a Pareto-optimal decision vector is Pareto-optimal in entire feature decision space Xf . Therefore, a Pareto-optimal decision vector is definitely nondominated, but a nondominated decision vector is not necessarily Pareto-optimal. However, to find Pareto-optimal solutions still is not the ultimate aim of multiobjective optimization, this is because: • There may be a large number of Pareto-optimal solutions. To gain the deeper insight into the multiobjective optimization problem and knowledge about alternate solutions, there is often a special interest in finding or approximating the Pareto-optimal set. • Due to conflicting objectives it is only possible to obtain a set of trade-off solutions (referred to as Pareto-optimal set in decision space or Pareto-optimal front in objective space), instead of a single optimal solution. Therefore our aim is to find the decision vectors that are nondominated within the entire search space. These decision vectors constitute the so-called Pareto-optimal set, as defined below.

9.3 Pareto Optimization Theory

695

Pareto efficiency B

Preference Criterion f2

Pareto efficiency A D Pareto inefficiency

Pareto efficiency C

Preference Criterion f1

Fig. 9.2 Pareto efficiency and Pareto improvement. Point A is an inefficient allocation between preference criterions f1 and f2 because it does not satisfy the constraint curve of f1 and f2 . Two decisions to move from Point A to Points C and D would be a Pareto improvement, respectively. They improve both f1 and f2 , without making anyone else worse off. Hence, these two moves would be a Pareto improvement and be Pareto optimal, respectively. A move from Point A to Point B would not be a Pareto improvement because it decreases the cost f1 by increasing another cost f2 , thus making one side better off by making another worse off. The move from any point that lies under the curve to any point on the curve cannot be a Pareto improvement due to making one of two criterions f1 and f2 worse

Definition 9.11 (Nondominated Sets [150]) Let A ⊆ Xf and the function g(A) give the set of nondominated decision vectors in A: g(A) = {a ∈ A|a is nondominated regarding A}.

(9.3.19)

The set g(A) is said to be the nondominated set (NS) with respect to A. When A = Xf , the corresponding set   N Smin = x ∈ Xf : {f(x ) ∈ Y : f(x ) ≤ f(x) ∧ f(x ) = f(x)} = ∅ ,   N Smax = x ∈ Xf : {f(x ) ∈ Y : f(x ) ≥ f(x) ∧ f(x ) = f(x)} = ∅

(9.3.20) (9.3.21)

is known as the nondominated set in the entire decision space for minimization and maximization, respectively.

696

9 Evolutionary Computation

Definition 9.12 (Pareto-Optimal Set [150]) The Pareto-optimal set (PS) is the set of all objectives given by nondominated decision vectors x in nondominated set N S, namely   P Smin = f(x) ∈ Y : {f(x ) ∈ Y : f(x ) ≤ f(x) ∧ f(x ) = f(x)} = ∅ ,   P Smax = f(x) ∈ Y : {f(x ) ∈ Y : f(x ) ≥ f(x) ∧ f(x ) = f(x)} = ∅ .

(9.3.22) (9.3.23)

Here P Smin and P Smax are the Pareto-optimal sets for minimization and maximization, respectively. Pareto-optimal decision vectors cannot be improved in any objective without causing a degradation in at least one other objective; they represent globally optimal solutions. Analogous to single-objective optimization problems, Pareto-optimal sets are divided into local and global Pareto-optimal sets. Definition 9.13 (Local Pareto-Optimal Set [23, 171]) Consider a set of decision vectors X ⊆ Xf . The set X is said to be a local Pareto-optimal set if and only if ∀x ∈ X |  x ∈ Xf : x ≺ x ∧ x − x  <  ∧ ∀ i ∈ {1, . . . , m} : fi (x) − fi (x ) < δ,

(9.3.24)

where  ·  is a corresponding distance metric and  > 0, δ > 0. Definition 9.14 (Global Pareto-Optimal Set [23, 171]) The set X is called a global Pareto-optimal set if and only if ∀x ∈ X :  x ∈ Xf : x ≺ x .

(9.3.25)

The set of all objectives given by Pareto-optimal set of decision vectors is known as the Pareto-optimal front (or simply called Pareto front). Definition 9.15 (Pareto Front) Given an objective vector f(x) = [f1 (x), . . . , fm (x)]T and a Pareto-optimal set P S, the Pareto front P F is defined as the set of all objective vectors given by decision vectors x ∈ P S, i.e.,  def  P F = f(x) ∈ Rm |x ∈ P S .

(9.3.26)

Solutions in the Pareto front represent the possible optimal trade-offs between competing objectives. For a given system, the Pareto frontier is the set of parameterizations (allocations) that are all Pareto efficient or Pareto optimal. Finding Pareto frontiers is particularly useful in engineering. This is because that when yielding all of the potentially optimal solutions, a designer needs only to focus on trade-offs within this constrained set of parameters without considering the full range of parameters.

9.3 Pareto Optimization Theory

697

It should be noted that a global Pareto-optimal set does not necessarily contain all Pareto-optimal solutions. Pareto-optimal solutions are those solutions within the genotype search space (i.e., decision space) whose corresponding phenotype objective vector components cannot be all simultaneously improved. These solutions are also called non-inferior, admissible, or efficient solutions [73] in the sense that they are nondominated with respect to all other comparison solution vector and may have no clearly apparent relationship besides their membership in the Pareto-optimal set. A “current” set of Pareto-optimal solutions is denoted by Pcurrent (t), where t represents the generation number. A secondary population storing nondominated solutions found through the generations is denoted by Pknown , while the “true” Pareto-optimal set, denoted by Ptrue , is not explicitly known for MOP problems of any difficulty. The associated Pareto front for each of Pcurrent (t), Pknown , and Ptrue are termed as P Fcurrent (t), P Fknown , and P Ftrue , respectively. The decision maker is often selecting solutions via choice of acceptable objective performance, represented by the (known) Pareto front. Choosing an MOP solution that optimizes only one objective may well ignore “better” solutions for other objectives. We wish to determine the Pareto-optimal set from the set X of all the decision variable vectors that satisfy the inequality constraints (9.2.21) and (9.2.22). But, not all solutions in the Pareto-optimal set are normally desirable or achievable in practice. For example, we may not wish to have different solutions that map to the same values in objective function space. Lower and upper bounds on objective values of all Pareto-optimal solutions are given by the ideal objective vector fideal and the nadir objective vectors fnad , respectively. In other words, the Pareto front of a multiobjective optimization problem is bounded by a nadir objective vector znad and an ideal objective vector zideal , if these are finite. Definition 9.16 (Nadir Objective Vector, Ideal Objective Vector) The nadir objective vector is defined as zinad =

sup

fi (x),

for all i = 1, . . . , K

(9.3.27)

x∈F is Pareto optimal

and the ideal objective vector as ziideal = inf fi (x), x∈F

for all i = 1, . . . , K.

(9.3.28)

In other words, the components of a nadir and an ideal objective vector define upper and lower bounds for the objective function values of Pareto-optimal solutions, respectively. In practice, the nadir objective vector can only be approximated as, typically, the whole Pareto-optimal set is unknown. In addition, because of

698

9 Evolutionary Computation

numerical reasons a utopian objective vector zutopian with elements utopian

zi

= ziideal − ,

for all i = 1, . . . , K,

(9.3.29)

or in vector form zutopian = zideal − 1,

(9.3.30)

is often defined, where  > 0 is a small constant, and 1 is an K-dimensional vector of all ones. Definition 9.17 (Archive) Nondominated set produced by heuristic algorithms is referred to as the archive (denoted by A) of the estimated Pareto front. The archive of estimated Pareto fronts will only be an approximation to the true Pareto front. When solving multiobjective optimization problems, since there are many possible Pareto-optimal solutions rather than a single optimal solution for possibly contradictory multiobjective functions, we need to organize or partition found solutions for selecting appropriate solution(s). The selection methods for Paretooptimal solutions are divided to the following four categories: • • • •

Fitness selection approach, Nondominated sorting approach, Crowding distance assignment approach, and Hierarchical clustering approach.

9.3.2 Fitness Selection Approach To fairly evaluate the quality of each solution and make the solution set search in the direction of Pareto-optimal solution, how to implement fitness assignment is important. There are benchmark metrics or indicators that play an important role in evaluating the Pareto fronts. These metrics or indicators are: hypervolume, spacing, maximum spread, and coverage [139]. Definition 9.18 (Hypervolume [170]) For a nondominated vector xi ∈ Pa∗ , its hypervolume (HV) is defined as * HV =

K

. ai |xi ∈ Pa∗ ,

(9.3.31)

i

where Pa∗ is the area in the objective search space covered by the obtained Pareto front, and ai is the hypervolume determined by the components of xi and the origin.

9.3 Pareto Optimization Theory

699

Definition 9.19 (Spacing) Spacing is a measure for the spread (distribution) of nondominated solutions throughout the Pareto front. Actually, it measures the variance of the distance between adjacent nondominated solutions and can be evaluated by Santana et al. [139] 8 9 9 S=:

1  ¯ (d − di )2 , m−1 m

(9.3.32)

i=1

where d¯ is the mean distance between all the adjacent solutions and m is the number of nondominated solutions in the Pareto front, and j

j

di = min{|f1i (x) − f1 (x)| + |f2i (x) − f2 (x)|, j

i, j = 1, . . . , m.

(9.3.33)

A value of spacing equal to zero means that all the solutions are equidistantly spaced in the Pareto front. Definition 9.20 (Maximum Spread) This metric, proposed by Zitzler et al. [171], evaluates the maximum extension covered by the nondominated solutions in the Pareto front, and can be calculated by 8 9K 9      MS = : max fi1 , . . . , fim − min fi1 , . . . , fim ,

(9.3.34)

i=1

where m is the number of solutions in the Pareto front and K is the number of objectives in a given problem. It should be noted that higher values MS indicate better performance. Definition 9.21 (Coverage) The coverage metric of two sets A and B, proposed by Zitzler and Thiele [170], is denoted by C(A, B), and maps the ordered pair (A, B) to the interval [0, 1] using C(A, B) =

|{b ∈ B; ∃ a ∈ A : a  b}| , |B|

(9.3.35)

where |B| is the number of elements of the set B. The value C(A, B) = 1 means that all solutions in B are weakly dominated by ones in A. On the other hand, C(A, B) = 0 means that none of the solutions in B is weakly dominated by A. It should be noted that both C(A, B) and C(B, A) have to be evaluated, respectively, since C(A, B) is not necessarily equal to 1 − C(B, A). If 0 < C(A, B) < 1 and 0 < C(B, A) < 1, then neither A weakly dominates B nor B weakly dominates A. Thus, the sets A and B are incomparable, which means that A is not worse than B and vice versa.

700

9 Evolutionary Computation

Two popular fitness selections are the binary tournament selection and the roulette wheel selection [140]. 1. Binary tournament selection: k individuals are drawn from the entire population, allowing them to compete (tournament) and extract the best individual among them. The number k is the tournament size, and often takes 2. In this special case, we can call it binary tournament selection. Tournament selection is just the broader term where k can be any number ≥ 2. 2. Roulette wheel selection: It is also known as fitness proportionate selection which is a genetic operator for selecting potentially useful solutions for recombination. In roulette wheel selection, as in all selection methods, the fitness function assigns a fitness to possible solutions or chromosomes. This fitness level is used to associate a probability of selection with each individual chromosome. If Fi is the fitness of individual i in the population, its probability of being selected is pi =

Fi ΣjN=1 Fj

,

(9.3.36)

where N is the number of individuals in the population. Algorithm 9.1 gives a roulette wheel fitness selection algorithm. Algorithm 9.1 Roulette wheel fitness selection algorithm [140] 1. 2. 3. 4. 5. 6. 7. 8. 9.

input: The population size k. initialization: Generate randomly the initial population of candidate solutions. Evaluate the fitness Fi of each individual in the population. Compute the probability (slot size) p of selecting each member of the population:  pi = Fi / kj =1 Fj , where k is the population size.  Calculate the cumulative probability qi for each individual: qi = ij =1 pj . Generate a uniform random number r ∈ (0, 1]. If r < q1 , then select the first chromosome x1 , else select the individual xi such that qi−1 < r < qi . Repeat steps 4–7 k times to create k candidates in the mating pool. output: k candidate solutions in the mating pool.

9.3.3 Nondominated Sorting Approach The nondominated selection is based on nondominated rank: dominated individuals are penalized by subtracting a small fixed penalty term from their expected number of copies during selection. But this algorithm failed when the population had very few nondominated individuals, which may result in a large fitness value for those few nondominated points, eventually leading to a high selection pressure. The idea behind the nondominated sorting procedure is twofold.

9.3 Pareto Optimization Theory

701

• Use a ranking selection method to emphasize good points. • Use a niche method to maintain stable subpopulations of good points. If an individual (or solution) is not dominated by all other individuals, then it is assigned as the first nondominated level (or rank). An individual is said to have the second nondominated level or rank, if it is dominated only by the individual(s) in the first nondominated level. Similarly, one can define individuals in the third or higher nondominated level or rank. It is noted that more individuals maybe have the same nondominated level or rank. Let irank represent the nondominated rank of the ith individual. For two solutions with different nondomination ranks, if irank < jrank then the ith solution with the lower (better) rank is preferred. Algorithm 9.2 shows a fast nondominated sorting approach. Algorithm 9.2 Fast-nondominated-sort(P ) [25] 1. for each p ∈ P 2. Sp = ∅ 3. np = 0 4. for each q ∈ P 5. if (xp ≺ xq ) then % if p dominates q 6. Sp = Sp ∪ {q} 7. else if (xq ≺ xp ) then 8. np = np + 1 9. end if 10. if np = 0 then 11. prank = 1 12. F1 = F1 ∪ {p} 13. end if 14. end for 15. i = 1 16. while Fi = ∅ 17. Q=∅ 18. for each p ∈ Fi 19. for each q ∈ Sp 20. nq = nq − 1 21. if nq = 0 then 22. qrank = i + 1 23. Q = Q ∪ {q} 24. end if 25. end for 26. end for 27. i =i+1 28. Fi = Q 29. end while 30. end for 31. output: F = (F1 , F2 , . . .) = fast-nondominated-sort (P ), nondominated fronts of P .

In order to identify solutions of the first nondominated front in a population of size NP , each solution can be compared with every other solution in the population to find if it is dominated. At this stage, all individuals in the first nondominated

702

9 Evolutionary Computation

level in the population are found. In order to find the individuals in the second nondominated level, the solutions of the first level are discounted temporarily and the above procedure is repeated for finding third and higher levels or ranks of nondomination.

9.3.4 Crowding Distance Assignment Approach If the two solutions have the same nondominated rank (say irank = jrank ), which one should we choose? In this case, we need another quality for selecting the better solution. A natural choice is to estimate the density of solutions surrounding a particular solution in the population. For this end, the average distance of two points on either side of this point along each of the objectives needs to be calculated. Definition 9.22 (Crowding Distance [132]) Let f1 and f2 be two objective functions in a multiobjective optimization problem, and let x, xi and xj be the members of a nondominated list of solutions. Furthermore, xi and xj are the nearest neighbors of x in the objective spaces. The crowding distance (CD) of a trial solution x in a nondominated set depicts the perimeter of a hypercube formed by its nearest neighbors (i.e., xi and xj ) at the vertices in the fitness landscapes. The crowding distance computation requires sorting the population according to each objective function value in ascending order of magnitude. Let the population consist of |I | individuals which are ranked according to their fitness values from small to large, and let fki = fki (x) denote the kth objective function of the ith individual, where i = 1, . . . , |I | and k = 1, . . . , K. Except for boundary individuals 1 and |I |, the ith individual in all other intermediate population set {2, . . . , |I | − 1} is assigned a finite crowded distance (CD) value given by Luo et al. [98]: CDi =

K  1   i+1  fk − fki−1  , K

i = 2, . . . , |I | − 1,

(9.3.37)

k=1

or [25, 163]

CDi =

  K f i+1 − f i−1   k k k=1

fkmax − fkmin

,

i = 2, . . . , |I | − 1,

(9.3.38)

while the distance values of boundary individuals are assigned as CD1 = CD|I | =   |I |  |I |  and fkmin = min fk1 , . . . , fk are the ∞. Here, fkmax = max fk1 , . . . , fk maximum and minimum values of the kth objective of all individuals, respectively. The crowded-comparison operator guides the selection process at the various stages of the algorithm toward a uniformly spread-out Pareto-optimal front. Every population has two attributes: nondomination rank and crowding distance. By com-

9.3 Pareto Optimization Theory

703

bining the nondominated rank and the crowded distance, the crowded-comparison operator is defined by partial order (≺n ) as [25]: i ≺n j,

if (irank < jrank ) or (irank = jrank ) and CDi > CDj .

(9.3.39)

That is, between two solutions with differing nondomination ranks, the solution with the lower (better) rank is preferred. Otherwise, if both solutions belong to the same front, then the solution that is located in a lesser crowded region is preferred. Therefore, the crowded-comparison operator (≺n ) guides the selection process at the various stages of the algorithm toward a uniformly spread-out Pareto-optimal front. Algorithm 9.3 shows the crowding-distance-assignment (I ). Algorithm 9.3 Crowding-distance-assignment (I ) [25] l = |I | for each i, set CDi = 0 for each objective j I = sour(I, j ) CD1 = CDl = ∞ for i = 2 to (l − 1) CDi = CDi + (fji+1 − fji−1 )/(fjmax − fjmin ) end for end for end for

Between two populations with differing nondomination ranks, the population with the lower (better) rank is preferred. If both populations belong to the same front, then the population with larger crowding distance is preferred.

9.3.5 Hierarchical Clustering Approach Cluster analysis can be applied to the results of a multiobjective optimization algorithm to organize or partition solutions based on their objective function values. The steps of hierarchical clustering methodology are given below [108]: 1. Define decision variables, feasible set, and objective functions. 2. Choose and apply a Pareto optimization algorithm. 3. Clustering analysis: • Clustering tendency: By visual inspection or data projections verify that a hierarchical cluster structure is a reasonable model for the data. • Data scaling: Remove implicit variable weightings due to relative scales using range scaling.

704

9 Evolutionary Computation

• Proximity: Select and apply an appropriate similarity measure for the data, here, Euclidean distance. • Choice of algorithm(s): Consider the assumptions and characteristics of clustering algorithms and select the most suitable algorithm for the application, here, group average linkage. • Application of algorithm: Apply the selected algorithm to obtain a dendrogram. • Validation: Examine the results based on application subject matter knowledge, assess the fit to the input data and stability of the cluster structure, and compare the results of multiple algorithms, if used. 4. Represent and use the clusters and structure: If the clustering is reasonable and valid, examine the divisions in the hierarchy for trade-offs and other information to aid decision making.

9.3.6 Benchmark Functions for Multiobjective Optimization When designing an artificial intelligence algorithm or system, we need select the benchmark or test functions to examine its effectiveness in converging and maintaining a diverse set of nondominated solutions. A very simple multiobjective test function is the well-known one-variable two-objective function used by Schaffer [141]. It is defined as follows: min f2 (x) = [g(x), h(x)]

with g(x) = x 2 , h(x) = (x − 2)2 ,

(9.3.40)

where x ∈ [0, 2]. The following are the several commonly used multiobjective benchmark functions. 1. Benchmark functions for min f(x) = [g(x), h(x)] • ZDT1 functions [171] f1 (x1 ) = x1 , g(x) = g(x2 , . . . , xm ) = 1 + 9

m 

xi /(m − 1),

i=1

h(x) = h(f1 (x1 ), g(x2 , . . . , xm )) = where m = 30, xi ∈ [0, 1].

⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬

⎪ ⎪ ⎪ ⎪ ⎪ ⎭ 1 − f1 /g,

(9.3.41)

9.3 Pareto Optimization Theory

705

• ZDT4 functions [171] ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ m ⎬   2 xi − 10 cos(4π xi ) , g(x) = g(x2 , . . . , xm ) = 1 + 10(m − 1) + ⎪ ⎪ i=2 ⎪ ⎪ ⎪ ⎭ h(x) = h(f1 (x1 ), g(x2 , . . . , xm )) = 1 − f1 /g, (9.3.42)

f1 (x1 ) = x1 ,

where m = 10 and xi ∈ [0, 1]. • ZDT6 functions [171]: ⎫ f1 (x1 ) = 1 − exp(−4x1 ) sin6 (6π x1 ), ⎪ ⎪ ⎪ ⎪  m 0.25 ⎪ ⎪ ⎬  g(x) = g(x2 , . . . , xm ) = 1 + 9 xi /(m − 1) , ⎪ ⎪ ⎪ i=2 ⎪ ⎪ ⎪ ⎭ 2 h(x) = 1 − (f1 /g) ,

(9.3.43)

where m = 10 and xi ∈ [0, 1]. 2. Benchmark functions for min f(x) = [f1 (x), f2 (x)] • QV functions [126]  f1 (x) =  f2 (x) =

m " 1 ! 2 xi − 10 cos(2π xi ) + 10 n

⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬

1/4 ,

i=1

m "   1 ! (xi − 1.5)2 − 10 cos 2π(xi − 1.5) + 10 n i=1

1/4 ⎪ ⎪ ⎪ ⎪ ,⎪ ⎪ ⎭ (9.3.44)

where xi ∈ [−5, 5]. • KUR functions [90] f1 (x) =

  ⎫  ⎪ 2 2 ⎪ −10 exp −0.2 xi + xi+1 ,⎪ ⎪ ⎪ ⎬

m−1  i=1

m ! "  f2 (x) = |xi |0.8 + sin3 (xi ) , i=1

where xi ∈ [−103 , 103 ].

⎪ ⎪ ⎪ ⎪ ⎪ ⎭

(9.3.45)

706

9 Evolutionary Computation

3. Constrained benchmark functions • Partially separable benchmark functions [23, 171] .

min f(x) = (f1 (x1 ), f2 (x)),

subject to f2 (x) = g(x2 , . . . , xn )h(f1 (x1 )g(x1 , . . . , xn )),

(9.3.46)

where x = [x1 , . . . , xn ]T , the function f1 is a function of the first decision variable x1 only, g is a function of the remaining n − 1 variables x2 , . . . , xn , and the parameters of h are the function values of f1 and g. For example, 9  g(x2 , . . . , xn ) = 1 + xi , n−1 n

f1 (x1 ) = x1 ,

i=2

h(f1 , g) = 1 −

f1 /g

or

h(f1 , g) = 1 − (f1 /g)2 ,

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭

(9.3.47)

where n = 30 and xi ∈ [0, 1]. The Pareto-optimal front is formed with g(x) = 1. • Separable benchmark functions [170]: max f(x) = [f1 (x), f2 (x)],

fi (x) =

500 

aij xj , i = 1, 2

j =1

subject to

500 

bij xj ≤ ci , i = 1, 2;

j =1

⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬

⎪ ⎪ ⎪ ⎪ xj = 0 or 1, j = 1, 2, . . . , 500,⎪ ⎪ ⎭ (9.3.48)

In this test, x is a 500-dimensional binary vector, aij is the profit of item j according to knapsack i, bij is the weight of item j according to knapsack i, and ci is the capacity of knapsack i (i = 1, 2 and j = 1, 2, . . . , 500). 4. Multiobjective benchmark functions for min f(x) = [f1 (x), . . . , fm (x)]: SPH-m functions [92, 141] fj (x) =

n 

(xi )2 + (xj − 1)2 , j = 1, . . . , m; m = 2, 3.

(9.3.49)

i=1,i=j

where xi ∈ [−103 , 103 ]. 5. Multiobjective benchmark functions for min f(x) = [f1 (x), f2 (x), g3 (x), . . . , gm (x)]

9.4 Noisy Multiobjective Optimization

707

• ISH1 functions [79] gi (x) = fi (x), i = 1, 2, gi (x) = α ad’f ˛ 1 (x) + (1 − α)fi (x), i = 3, 5, 7, 9, gi (x) = αf2 (x) + (1 − α)fi (x), i = 4, 6, 8, 10,

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭

(9.3.50)

where the value of α ∈ [0, 1] can be viewed as the correlation strength. α = 0 corresponds to the minimum correlation strength 0, where gi (x) is the same as the randomly generated objective fi (x). α = 1 corresponds to the maximum correlation strength 1, where gi (x) is the same as f1 (x) or f2 (x). • ISH2 functions [79] gi (x) = fi (x), i = 1, 2,

.

gi (x) = αik f1 (x) + (1 − αik )f2 (x), i = 3, 4, . . . , k,

(9.3.51)

where αik is specified as follows: αik = (i − 2)/k − 1,

i = 3, 4, . . . , k; k = 4, 6, 8, 10.

(9.3.52)

For example, the four-objective problem is specified as ⎫ g2 (x) = f2 (x), ⎪ ⎪ ⎬ g3 (x) = 1/3 · f1 (x) + 2/3 · f2 (x), ⎪ ⎪ ⎭ g4 (x) = 2/3 · f1 (x) + 1/3 · f2 (x). g1 (x) = f1 (x),

(9.3.53)

More benchmark test functions can be found in [149, 165, 172]. Yao et al. [165] summarized 23 benchmark test functions.

9.4 Noisy Multiobjective Optimization Optimization problems in various real-world applications are often characterized by multiple conflicting objectives and a wide range of uncertainty. Optimization problems containing uncertainty and multiobjectives are termed as uncertain multiobjective optimization problems. In evolutionary optimization community, uncertainty in the objective functions is generally stochastic noise, and the corresponding multiobjective optimization problems are termed as noisy (or imprecise) multiobjective optimization problems. Noisy multiobjective optimization problems are also known as interval multiobjective optimization problems, since objectives fi (x, ci ) contaminated

708

9 Evolutionary Computation

by noise vector ci can be reformulated, in interval-value form, as fi (x, ci ) = [f i (x, ci ), f i (x, ci )].

9.4.1 Pareto Concepts for Noisy Multiobjective Optimization There are at least two different types of uncertainty that are important for real-world modeling [96]. 1. Noise, also referred to as aleatory uncertainty. Noise is an inherent property of the system modeled (or is introduced into the model to simulate this behavior) and therefore cannot be reduced. By Oberkampf et al. [117], aleatory uncertainty is defined as the “inherent variation associated with the physical system or the environment under consideration.” 2. Imprecision, also known as epistemic uncertainty, describes not uncertainty due to system variance but the uncertainty of the outcome due to “any lack of knowledge or information in any phase or activity of the modeling process” [117]. Uncertainty in objectives can be divided into two kinds [58]. • Decision vector x ∈ Rd corresponding to different objectives is no longer a fixed point, but is characterized by (x, c) with c = [c1 , . . . , cd ]T that is a local neighborhood of x, where ci = [ci , ci ] is an interval with lower limits ci and upper limit ci for i = 1, . . . , d. • Objectives fi is no longer a fixed value fi (x), but is an interval of objective, denoted by fi (x, c) = [f i (x, c), f i (x, c)]. Therefore, noisy multiobjective optimization problems can be described as the following noisy multiobjective optimization problems [58]:   max/min f = [f1 (x, c1 ), . . . , fm (x, cm )]T , subject to ci = [ci1 , . . . , cil ]T , cij = [cij , cij ],

(9.4.1) / i = 1, . . . , m, j = 1, . . . , l,

(9.4.2)

where x is a d-dimensional decision variable; X is the decision space of x; fi (x, ci ) = [f i (x, ci ), f i (x, ci )] is the ith objective function with interval [f i , f i ], i = 1, . . . , m (m is the number of objectives, and m ≥ 3); ci is a fixed interval vector independent of the decision vector x (namely, ci remains unchanged along with x), while cij = [cij , cij ] is the j th interval parameter component of ci . For any two solutions x1 and x2 ∈ X of problem (9.4.1), the corresponding ith objectives are fi (x1 , ci ) and fi (x2 , ci ), i = 1, . . . , m, respectively. Definition 9.23 (Interval Order Relation) Consider two objective intervals f(x1 ) = [f(x1 ), f(x1 )] and f(x2 ) = [f(x2 ), f(x2 )]. Their interval order relations

9.4 Noisy Multiobjective Optimization

709

are defined as f(x1 ) ≤I N f(x2 ) ⇔ f(x1 ) ≤ f(x2 ) ∧ f(x1 ) ≤ f(x2 ),

(9.4.3)

f(x1 ) ≥I N f(x2 ) ⇔ f(x1 ) ≥ f(x2 ) ∧ f(x1 ) ≥ f(x2 ),

(9.4.4)

f(x1 ) I N f(x2 ) ⇔ f(x1 ) ≥ f(x2 ) ∧ f(x1 ) ≥ f(x2 ) ∧ f(x1 ) = f(x2 ),

(9.4.6)

f(x1 )  f(x2 ) ⇔ neither f(x1 ) ≤I N f(x2 ) nor f(x2 ) ≤I N f(x1 ).

(9.4.7)

In the absence of other factors (e.g., preference for certain objectives, or for a particular region of the trade-off surface), the task of an evolutionary multiobjective optimization (EMO) algorithm is to provide as good an approximation as the true Pareto front. To compare two evolutionary multiobjective optimization algorithms, we need to compare the nondominated sets they produce. Since an interval is also a set consisting of the components larger than its lower limit and smaller than its upper limit, x1 = (x, c1 ) and x2 = (x, c2 ) can be regarded as two components in sets A and B, respectively. Hence, the decision solutions x1 and x2 are not two fixed points but two approximation sets, denoted as A and B, respectively. Different from Pareto concepts defined by decision vectors in (precise) multiobjective optimization, the Pareto concepts in noisy (or imprecise) multiobjective optimization are called the Pareto concepts for approximation sets due to concept definitions for approximation sets. To evaluate approximations to the true Pareto front, Hansen and Jaszkiewicz [64] define a number of outperformance relations that express the relationship between two sets of internally nondominated objective vectors, A and B. The outperformance is habitually called the dominance. Definition 9.24 (Dominance for Sets) An approximation set A is said to dominate another approximation set B, denoted by A B, if every x2 ∈ B is dominated by at least one x1 ∈ A in all objectives [173], i.e., A B ⇔ ∃ x1 ∈ A, f(x1 ) I N f(x2 ), ∀ x2 ∈ B (for maximization),

(9.4.9)

or written as [64, 89]: A B ⇔ N D(A ∪ B) = A

and

B \ ND(A ∪ B) = ∅,

(9.4.10)

where N D(S) denotes the set of nondominated points in S. Definition 9.25 (Weak Dominance for Sets) An approximation set A is said to weakly dominate (or weakly outperform) another approximation set B, denoted by

710

9 Evolutionary Computation

A  B, if every x2 ∈ B is weakly dominated by at least one x1 ∈ A in all objectives [173]: A  B ⇔ ∃ x1 ∈ A, f(x1 ) ≤I N f(x2 ), ∀ x2 ∈ B (for minimization),

(9.4.11)

A  B ⇔ ∃ x1 ∈ A, f(x1 ) ≥I N f(x2 ), ∀ x2 ∈ B (for maximization),

(9.4.12)

or written as [64, 89]: A  B ⇔ ND(A ∪ B) = A

and

A = B.

(9.4.13)

That is to say, A weakly dominates B if all points in B are “covered” by those in A (here “covered” means is equal to or dominates) and there is at least one point in A that is not contained in B. Definition 9.26 (Strict Dominance for Sets) An approximation set A is said to strictly dominate (or completely outperform) another approximation set B, denoted by A

B, if every x2 ∈ B is strictly dominated by at least one x1 ∈ A in all objectives, namely [173]: A

B ⇔ ∃ x1 ∈ A, fi (x1 ) I N fi (x2 ), ∀ x2 ∈ B (for maximization), (9.4.15) for all i ∈ {1, . . . , m}, or written as [64, 89]: A

B ⇔ ND(A ∪ B) = A

and

B ∩ N D(A ∪ B) = ∅.

(9.4.16)

Definition 9.27 (Better) An approximation set A is said to be better than another approximation set B, denoted as A 7 B, and is defined as [173]: A 7 B ⇔ ∃ x1 ∈ A, f(x1 ) ≤I N f(x2 ), ∀ x2 ∈ B ∧ A = B (for minimization), (9.4.17) A 7 B ⇔ ∃ x1 ∈ A, f(x1 ) ≥I N f(x2 ), ∀ x2 ∈ B ∧ A = B (for maximization). (9.4.18) From the above definition of the relation 7, one can conclude that A  B ⇒ A 7 B ∨ A = B. In other words, if A weakly dominates B, then either A is better than B or they are equal. Notice that A

B ⇒ A B ⇒ A 7 B ⇒ A  B.

(9.4.19)

9.4 Noisy Multiobjective Optimization

711

Table 9.1 Relation comparison between objective vectors and approximation sets Relation Weakly dominates

Objective vectors x  x x is not worse than x

Dominates

x x

Strictly dominates

x

x

x is not worse than x in all objectives and better in at least one objective x is better than x in all objectives

Better

Incomparable

xx

x  x ∧ x  x

Approximation sets AB Every f(x2 ) ∈ B is weakly dominated by at least one f(x1 ) in A A B Every f(x2 ) ∈ B is dominated by at least one f(x1 ) in A A

B Every f(x2 ) ∈ B is strictly dominated by at least one f(x1 ) A7B Every f(x2 ) ∈ B is weakly dominated by at least one f(x1 ) and A = B AB A  B ∧ B  A

In other words, strict dominance (complete outperformance) is the strongest and weak dominance (weak outperformance) is the weakest among the set dominance relations. Definition 9.28 (Incomparable for Sets [173]) An approximation set A is said to be incomparable to another approximation set B, denoted by A  B, if neither A weakly dominates B nor B weakly dominates A, i.e., A  B ⇔ A  B ∧ B  A.

(9.4.20)

Table 9.1 compares relation between objective vectors and approximation sets. For dominance and nondominance, we have the following probability representations. • A dominates B: the corresponding probabilities are P (A B) = 1, P (A ≺ B) = 0, and P (A ≡ B) = 0. • A is dominated by B: the corresponding probabilities are P (A ≺ B) = 1, P (A B) = 0, and P (A ≡ B) = 0. • A and B are nondominated each other: the corresponding probabilities are P (A B) = 0, P (A ≺ B) = 0, and P (A ≡ B) = 1. The fitness of an individual in one population refers to as the direct competition (capability) with some individual(s) from another population. Hence, the fitness plays a crucial role in evaluating individuals in evolutionary process. When considering two fitness values A and B with multiple objectives, under the case without noise, there are three possible outcomes from comparing the two fitness values.

712

9 Evolutionary Computation

In noisy multiobjective optimization, in addition to the convergence, diversity, and spread of a Pareto front, there are two indicators: hypervolume and imprecision [96]. For convenience, we focus on the multiobjective maximization hereafter. To determine whether a nondominated set is a Pareto-optimal set, Zitzler et al. [171] suggest three goals that can be identified and measured (see also [89]): 1. The distance of the obtained nondominated front to the Pareto-optimal front should be minimized. 2. A good (in most cases uniform) distribution of the solutions found—in objective space—is desirable. 3. The extent of the obtained nondominated front should be maximized, i.e., for each objective, a wide range of values should be present. The performance metrics or indicators play an important role in evaluating noisy multiobjective optimization algorithms [53, 89, 173]. Definition 9.29 (Hypervolume for Set [96]) For an approximate Pareto-optimal solution set X of problem (9.4.1), the hypervolume for set is defined as K  % &    n H (X) = H (X), H (X) = ∧ y ∈ R x I N y I N xref ,

(9.4.21)

x∈X

where xref is a reference point; ∧ is Lebesgue measure; H (X) and H (X) are the worst-case and the best-case hypervolume, respectively. Definition 9.30 (Imprecision for Set) For an approximate Pareto-optimal solution set X of problem (9.4.1), the imprecision of the front corresponding to X is defined as follows [58]: I (X) =

k ! "  f i (x, ci ) − f i (x, ci ) ,

(9.4.22)

x∈X i=1

where x is a solution in X, and fi (x, ci ) = [f i (x, ci ), f i (x, ci )] is the ith objective function with interval parameters ci = [ci , c¯ i ], i = 1, . . . , m. The smaller the value of the imprecision, the smaller the uncertainty of the front.

9.4.2 Performance Metrics for Approximation Sets The performance metrics or indicators play an important role in noisy multiobjective optimization.

9.4 Noisy Multiobjective Optimization

713

It is usually assumed (e.g., see [2, 14, 53, 75, 76, 89]) that noise has a disruptive influence on the value of each individual in the objective space, namely f˜i (x) = fi (x) + N(0, σ 2 ),

(9.4.23)

where N (0, σ 2 ) denotes the normal distribution function with zero mean and variance σ 2 representing the level of noise present; and f˜i and fi denotes the ith objective functions with and without the additive noise, respectively. σ 2 is represented as a percentage of fimax , where fimax is the maximum of the ith objective in true Pareto front. The following are the performance metrics or indicators pertinent to the optimization goals: proximity, diversity, and distribution [53]. 1. Proximity indicator: The metric of generational distance (GD) gives a good indication of the gap between the evolved Pareto front P Fknown and the true Pareto front P Ftrue , and is defined as  GD =

nP F 1 

nP F

1/2 di2

(9.4.24)

,

i=1

where nP F is the number of members in P Fknown , di is the Euclidean distance (in objective space) between the member i of P Fknown and its nearest member of P Ftrue . Intuitively, a low value of GD is desirable because it reflects a small deviation between the evolved and the true Pareto front. However, the metric of GD gives no information about the diversity of the algorithm under evaluation. 2. Diversity indicator: To evaluate the diversity of an algorithm, the following modified maximum spread is used as the diversity indicator: 8 9 m 91  2  32 : MS = = min{fimax , Fimax } − min{fimin , Fimin } /(Fimax − Fimin ) , m i=1

(9.4.25) where m is the number of objectives, fimax and fimin are, respectively, the maximum and minimum of the ith objective in P Fknown ; and Fimax and Fimin are the maximum and minimum of the ith objective in P Ftrue , respectively. This modified metric takes into account the proximity to F Ptrue . 3. Distribution indicator: To evaluate how evenly the nondominated solutions are distributed along the discovered Pareto front, the modified metric of spacing is defined as S=

1 nP F



nP F 1 

nP F

i=1

1/2 ¯ 2 (di − d)

,

(9.4.26)

714

9 Evolutionary Computation

n P F where d¯ = n 1 i=1 di is the average Euclidean distance between all members PF of P Fknown and their nearest members of P Ftrue .

9.5 Multiobjective Simulated Annealing Metropolis algorithm, proposed by Metropolis et al. in 1953 [100], is a simple algorithm that can be used to provide an efficient simulation of a collection of atoms in equilibrium at a given temperature. This algorithm was extended by Hastings in 1970 [65] to the more general case, and hence is also called Metropolis-Hastings algorithm. The Metropolis algorithm is the earliest simulated annealing algorithm that is widely used for solving multiobjective combinatorial optimization problems.

9.5.1 Principle of Simulated Annealing The idea of simulated annealing originates from thermodynamics and metallurgy [153]: when molten iron is cooled slowly enough it tends to solidify in a structure of minimal energy. This annealing process is mimicked by a local search strategy; at the start, almost any move is accepted, which allows one to explore the solution space. Then, the temperature is gradually decreased such that one becomes more and more selective in accepting new solutions. By the end, only improving moves are accepted in practice. Tools used for generation of efficient solutions in multiobjective combinatorial optimization (MOCO), like single-objective optimization methods, may be classified into one of the following categories [22]: 1. Exact procedures; 2. Specialized heuristic procedures; 3. Metaheuristic procedures. The main disadvantage of exact algorithms is their high computational complexity and inflexibility. A metaheuristic can be defined as an iterative generation process which guides a subordinate heuristic by combining intelligently different concepts for exploring and exploiting the search space [118]. Metaheuristic is divided into two categories: single-solution metaheuristic and population metaheuristic [51]. Single-solution metaheuristic considers a single solution (and search trajectory) at a time. Its typical examples include simulated annealing (SA), tabu search (TS), etc. Population metaheuristic evolves concurrently a population of solutions rather than a single solution. Most evolutionary algorithms are based on population metaheuristic.

9.5 Multiobjective Simulated Annealing

715

There are two basic strategies for heuristics: “divide-and-conquer” and iterative improvement [88]. • The divide-and-conquer strategy divides the problem into subproblems of manageable size, then solves the subproblems. The solutions to the subproblems must be patched back together, and the subproblems must be naturally disjoint, while the division made must be appropriate, so that errors made in patching do not offset the gains obtained in applying more powerful methods to the subproblems. • Iterative improvement starts with the system in a known configuration, then a standard rearrangement operation is applied to all parts of the system in turn, until discovering a rearranged configuration that improves the cost function. The rearranged configuration then becomes a new configuration of the system, and the process is continued until no further improvements can be found. Iterative improvement makes a search in this coordinate space for rearrangement steps which lead to downhill. Since this search usually gets stuck in a local but not a global optimum, it is customary to carry out the process several times, starting from different randomly generated configurations, and save the best result. By metaheuristic procedures, it means that they define only a “skeleton” of the optimization procedure that has to be customized for particular applications. The earliest metaheuristic method is simulated annealing [100]. Other metaheuristic methods include tabu search [52], genetic algorithms [55], and so on. The goal of multiple-objective metaheuristic procedures is to find a sample of feasible solutions that is a good approximation to the efficient solutions set. In physically appealing, particles are free to move around at high temperatures, while as the temperature is lowered they are increasingly confined due to the high energy cost of movement. From the point of view of optimization, the energy E(x) of the state x in physically appealing is regarded as the function to be minimized, and by introducing a parameter T , the computational temperature is lowered throughout the simulation according to an annealing schedule. Sampling from the equilibrium distribution is usually achieved by Metropolis sampling, which involves making new solutions x that are accepted with probability [145]    p = min 1, exp −δE(x , xi )/T ,

(9.5.1)

where δE(x , xi ) is the cost criterion for transition from the current state xi to a new state x , and T is the annealing temperature. For a single-objective framework, depending on the values of δE(x , xi ), we have the following different acceptances (or transitions) from the current state (or solution) xi to the new state x [145]: • Any move leading to a negative value δE(x , xi ) < 0 is an improving one and always be accepted (p = 1).

716

9 Evolutionary Computation

• If δE(x , xi ) > 0 is an positive value, thenthe probability to accept x as a new current solution xi+1 is given by p = exp −δE(x , xi )/T . Clearly, the higher the difference δE, the lower the probability to accept x instead of xi . • δE(x , xi ) = 0 implies that the new state x is the same level of value as the current state x, there may exist two schemes—move to the new state x or stay in the current state xi . The analysis of this problem shows that the move scheme is better than the stay one. In the stay scheme, search will end on both edges of the Pareto frontier not entering the middle of the frontier. However, if the move scheme is used, then search will be continue into the middle part of the frontier, move freely between nondominated states like a random walk when the temperature is low and eventually will be distributed uniformly over the Pareto frontier as time goes to infinity [114]. This result is in accordance with p = 1 given by (9.5.1). The Boltzmann probability factor is defined as P (E(x)) = exp(−E(x)/kB T ), where E(x) is the energy of the solution x, and kB is Boltzmann’s constant. At each T the simulated appealing algorithm aims at drawing samples from the equilibrium distribution P (E(x)) = exp(−E(x)/T ). As T → 0, any sample from P (E(x)) will almost surely lie at the minimum of E. The simulated annealing consists of two processes: first “melting” the system being optimized at a high effective temperature, then lowering the temperature by slow stages until the system “freezes” and no further changes occur. At each temperature, the simulation must proceed long enough so that the system reaches a steady state. The sequence of temperatures and the number of rearrangements of {xi } attempted to reach equilibrium at each temperature can be thought of as an annealing schedule. Intuitively, in addition to perturbations which decrease the energy, when T is high, perturbations from x to x which increase the energy are likely to be accepted and the samples can explore the state space. However, as T is reduced, only perturbations leading to small increases in E are accepted, so that only limited exploration is possible as the system settles on (hopefully) the global minimum. Let E(xi ) and E(x ) be the cost criterions of the current solution (or state) xi and the new solution x , respectively. The implementation of simulated annealing requires three different kinds of options [153]. 1. Decision Rule: In this rule, when moving from a current solution (or state) xi to the new solution x , the cost criterion is defined as δE(x , xi ) = E(x ) − E(xi ).

(9.5.2)

2. Neighborhood V (x) is defined as a set of feasible solution close to x such that any solution satisfying the constraints of MOCO problems, D = {x : x ∈ LD, x ∈ B n } and LD = {x : Ax = b}, can be obtained after a finite number of moves. 3. Typical parameters: Some simulated annealing-typical parameters must be fixed as follows:

9.5 Multiobjective Simulated Annealing

717

• The initial temperature parameter T0 or alternatively an initial acceptance probability p0 ; • The cooling factor α (α < 1) and the temperature length step Nstep in the cooling schedule. • The stopping rule(s): final temperature Tstop and/or the maximum number of iterations without improvement Nstop . Algorithm 9.4 shows a single-objective simulated annealing algorithm. Algorithm 9.4 Single-objective simulated annealing [153] 1. input: initial temperature parameter T0 , cooling factor α, temperature length step Nstep , final temperature Tstop and the maximum number of iterations Nstop . 2. initialization: draw at random an initial solution x0 . 2.1 Evaluate z(x0 ) = f (x0 ). 2.2 X = {x0 }. 2.3 Ncount = n = 0. 3. while n = 1, . . . , Nstop 4. Draw at random a solution y in the neighborhood V (xn ) of xn . 5. Evaluate z(xn ) = f (xn ). 6. Compute δz = z(x ) − z(xn ). " ! 7. The acceptance probability p = exp − Tδzn . 8. 9.

10. 11.

12. 13. 14. 15.

If δz ≤ 0 then we accept the new solution: xn+1 ← x . If δz > 0 then we accept the new solution with a certain probability * p ←− x , xn+1 1−p ←− xn . n ← n + 1. Update the * temperature Tn : αTn−1 , if n (mod Nstop ) = 0, Tn = Tn−1 , otherwise. If n = Nstop or T < Tstop then exist. If n < Nstop or T > Tstop then goto step 4. end while output: the optimal solution xn .

Simulated annealing has the following two important features [88]: 1. Annealing differs from iterative improvement in that the procedure need not get stuck since transitions out of a local optimum are always possible at nonzero temperature. 2. Annealing is an adaptive form of the divide-and-conquer approach. Gross features of the eventual state of the system appear at higher temperatures, while fine details develop at lower temperatures.

718

9 Evolutionary Computation

9.5.2 Multiobjective Simulated Annealing Algorithm Common ideas behind simulated annealing are [88, 91]: • • • •

the concept of neighborhood; acceptance of new solutions with some probability; dependence of the probability on a parameter called the temperature; and the scheme of the temperature changes.

As comparison with single-objective simulated annealing, population-based simulated annealing (PSA) uses also the following ideas: 1. Apply the concept of a sample (population) of interacting solutions from genetic algorithms [55] at each iteration in simulated annealing. The solutions are called generate solutions. 2. In order to assure dispersion of the generated solutions over the whole set of efficient solutions, one must control the objective weights used in the multiple objective rules for acceptance probability in order to increase or decrease the probability of improving values of the particular objectives. Different from the single-objective framework, in the multiple-objective framework, when comparing a new solution x with the current solution xi according to K criteria zk (x , xi ), k = 1, . . . , K, three cases can obviously occur [153]. • If zk (x ) ≤ zk (xn ), ∀ k = 1, . . . , K then the move from xi to x is an improvement with respect to all the objective δzk (x , xn ) = zk (x ) − zk (xn ) ≤ 0, ∀k = 1, . . . , K. Therefore, x is always accepted (p = 1). • An improvement and a deterioration can be simultaneously observed on different cost criteria δzk < 0 and δzk  > 0, ∃ k = k  ∈ {1, . . . , K}. The first crucial point is how to define the acceptance probability p. • When all cost criteria are deteriorated: ∀k, δzk ≥ 0 at least one strict inequality, a probability p to accept x instead of xi must be calculated. Let z(x ) = [z1 (x ), . . . , zK (x )]T

and

z(xi ) = [z1 (xi ), . . . , zK (xi )]T . (9.5.3)

For the computation of the probability p, the second crucial point is how to compute the “distance” between z(x ) and z(xi ). To overcome the above two difficulties, Ulungu et al. [153] proposed a criterion scalarizing approach in 1999. Define the probability from xi to x with respect to the kth criterion as follows:  πk =

" ! k exp − δz Ti , δzk > 0, 1,

δzk ≤ 0.

(9.5.4)

9.5 Multiobjective Simulated Annealing

719

Then, the global acceptance probability p can be determined by p = t (Π, λ) = {π1 , . . . , πK },

Π = (π1 , . . . , πK ),

(9.5.5)

where Π = (π1 , . . . , πK ) is an aggregation of K criteria πk . The most intuitive criterion scalarizing functions t (Π, λ) are the product function K #

t (Π, λ) =

(πk )λk

(9.5.6)

k=1

and the minimum function tmin (Π, λ) =

min (πk )λk .

k=1,...,K

(9.5.7)

In order to compute the distance between two solutions for each individual and then to evaluate the acceptance probability of the move from xi to x with respect to this particular criterion, an efficient criterion scalarizing approach is to choose the acceptance probability given by  p=

1, ! " if δs ≤ 0; δs exp − Ti , if δs > 0.

(9.5.8)

Here δs = s(z(x , λ)) − s(z(xi , λ)) may take one of the following two forms: • Weighed sum: s(z(x, λ)) =

K 

λk zk (x),

(9.5.9)

k=1

where K 

λk = 1,

λk > 0, ∀ k = 1, . . . , K.

(9.5.10)

k=1

An equivalent alternative is given by Engrand [39] s(z(x, λ)) =

K 

log zk (x).

(9.5.11)

k=1

• Weighed Chebyshev norm L∞ : s(z(x, λ)) = max {λk |˜zk − zk |}, 1≤k≤K

λk > 0, ∀ k = 1, . . . , K,

(9.5.12)

720

9 Evolutionary Computation

where z˜ k are the cost (criterion) values of the ideal point z˜ k = minx∈D zk (x). It is easily shown [153] that the global acceptance probability p in (9.5.8) is just a most intuitive scalarizing function, i.e., p = t (Π, λ). Due to conflicting multiple objectives, a good algorithm for solving multiobjective optimization problems needs to have the following four important properties [114]. • Searching precision: Because of problem complexity of multiobjective optimization, the algorithm finds hardly the Pareto-optimal solutions, and hence it must find the possible near solutions to the optimal solutions set. • Searching-time: The algorithm must find the optimal set efficiently during searching-time. • Uniform probability distribution over the optimal set: The found solutions must be widely spread, or uniformly distributed over the real Pareto-optimal set rather than converging to one point because every solution is important in multiobjective optimization. • Information about Pareto frontier: The algorithm must give as much information as possible about the Pareto frontier. A multiobjective simulated annealing algorithm is shown in Algorithm 9.5. It is recognized (see, e.g., [114]) that simulated annealing for multiobjective optimization has the following properties. 1. By using the concepts of Pareto optimality and domination, high searching precision can be achieved by simulated annealing. 2. The main drawback of simulated annealing is searching-time, as it is generally known that simulated annealing takes long time to find the optimum. 3. An interesting advantage of simulated annealing is its uniform probability distribution property as it is mathematically proved [50, 104] that it can find each of the global optima with the same probability in a scalar finite-state problem. 4. For multiobjective optimization, as all the Pareto solutions have different cost vectors that have a trade-off relationship, a decision maker must select a proper solution from the found Pareto solution set or sometimes by interpolating the found solutions. Therefore, in order to apply simulated annealing in multiobjective optimization, one must reduce searching-time and effectively search Pareto optimal solutions. It should be noted that any solution in the set PE of Pareto-optimal solutions should not be dominated by other(s), otherwise any dominated solution should be removed from PE. In this sense, Pareto solution set should be the best nondominated set.

9.5 Multiobjective Simulated Annealing

721

Algorithm 9.5 Multiobjective simulated annealing [153] 1. input: initial temperature parameter T0 , cooling factor α, temperature length step Nstep , final temperature Tstop and the maximum number of iterations Nstop . 2. initialization: draw at random an initial solution x0 . 2.1 Evaluate zk (x0 ) = fk (x0 ), ∀ k = 1, . . . , K. 2.2 The set of potentially optimal solutions PE = {x0 }. 2.3 Ncount = n = 0. 3. while n = 1, . . . , Nstop 4. Draw at random a solution y in the neighborhood V (xn ) of xn . 5. Evaluate zk (xn ) = fk (xn), ∀ k = 1, . . . , K.   6. Compute δs = K k=1 λk zk (x ) − zk!(xn ) ." 7. The acceptance probability p = exp − Tδsn .

8. If δs ≤ 0 then we accept the new solution: xn+1 ← x . 9. If δs > 0 then we accept the new solution with a certain probability * p ←− x , xn+1 1−p ←− xn . 10. Remove all the solutions dominated by xn+1 from PE. Add xn+1 to PE if no solutions in PE dominate xn+1 . 11. n ← n + 1. 12. Update the * temperature Tn : αTn−1 , if n (mod Nstop ) = 0, Tn = Tn−1 , otherwise. 13. If n = Nstop or T < Tstop then exist. 14. If n < Nstop or T > Tstop then goto step 4. 15. end while 16. output: the Pareto-optimal solutions PE = {xi ∈ D}.

9.5.3 Archived Multiobjective Simulated Annealing In multiobjective simulated annealing (MOSA) algorithm developed by Smith et al. [145], the acceptance of a new solution x is determined by its energy function. If the true Pareto front is available, then the energy of a particular solution x is calculated as the total energy of all solutions that dominates x. In practical applications, as the true Pareto front is not available all the time, one must estimate first the Pareto front F  which is the set of mutually nondominating solutions found thus far in the process. Then, the energy of the current solution x is the total energy of nondominating solutions. These nondominating solutions are called the archival nondominating solutions or nondominating solutions in Archive. In order to estimate the energy of the Pareto front F  , the number of archival nondominated solutions in the Pareto front should be taken into consideration in MOSA. But, this is not done in MOSA. By incorporating the nondominating solutions in the Archive, one can determine the acceptance of a new solution. Such an MOSA is known as archived multiobjective simulated annealing (AMOSA), which was proposed by Bandyopadhyay et al. in 2008 [7], see Algorithm 9.6.

722

9 Evolutionary Computation

Algorithm 9.6 Archived multiobjective simulated annealing (AMOSA) algorithm [7] 1. input: Tmax , Tmin , H L, SL, iter, α, temp = Tmax . 2. initialization: 2.1 Draw at random an initial Archive. 2.2 curren pt = random(Archive). 3. while temp > Tmin 4. for (i = 0; i < iter, i + +) 5. new-pt = pertub(current-pt). 6. Check the domination status of new-pt and current-pt. 7. if (current-pt dominates !!new-pt) then " " k 1 8. domavg = k+1 i=1 domi,new−pt + domcurrent,new−pt . 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42.

% k ≥ 0 is total number of points in the Archive which dominates new-pt. 1 prob = 1+exp(dom . avg ·temp) if (current-pt and new-pt are nondominated each other) then Check the domination status of new-pt and current-pt. if (new-pt is dominated by k (k ≥ 1) points in the Archive) then 1 prob = 1+exp(dom . ! avg ·temp) " 1 k domavg = k i=1 domi,new−pt . Set new-pt as current-pt with probability = prob. if (new-pt is nondominated w.r.t all the points in the Archive) then Set new-pt as current-pt and add it to the Archive. if Archive size > SL then Cluster Archive into H L number of clusters. if (new-pt dominates k (k ≥ 1) points in the Archive) then Set new-pt as current-pt and add it to the Archive. Remove all the k dominated points from the Archive. if (new-pt dominates current-pt) then Check the domination status of new-pt and points in the Archive. if (new-pt is dominated by k (k ≥ 1) points in the Archive) then dommin = minimum of the difference of domination amount between the new-pt and the k points. 1 prob = 1+exp(− . min ) Set point in the Archive which corresponds to dommin as current-pt with prob, else set new-pt as current-pt. if (new-pt is nondominated w.r.t all the points in the Archive) then Set new-pt as current-pt and add it to the Archive. if current-pt is in the Archive, remove it from the Archive. else if Archive size > SL, then cluster Archive into H L number of clusters. if (new-pt dominates k other points in the Archive) then Set new-pt as current-pt and add it to the Archive. Remove all the k dominated points from the Archive. end for temp = α · temp. end while if Archive-size > SL then Cluster Archive into H L number of clusters. end if

9.6 Genetic Algorithm

723

The following parameters need to be set a priori in Algorithm 9.6. • H L: The maximum size of the Archive on termination. This set is equal to the maximum number of nondominated solutions required by the user; • SL: The maximum size to which the Archive may be filled before clustering is used to reduce its size to H L; • Tmax : Maximum (initial) temperature, Tmin : Minimal (final) temperature; • iter: Number of iterations at each temperature; • α: The cooling rate in SA. Let x¯ ∗ = [x¯1∗ , . . . , x¯n∗ ]T denote the decision variable vector that simultaneously maximizes the objective values {f1 (¯x), . . . , fK (¯x)}, while satisfying the constraints, if any. In maximization problems, a solution x¯ i is said to dominate x¯ j if and only if fk (¯xi ) ≥ fk (¯xj ),

∀ k = 1, . . . , m,

(9.5.13)

fk (¯xi ) = fk (¯xj ),

for some k ∈ {1, . . . , m}.

(9.5.14)

Among a set of solutions P , the nondominated set P  of solutions are those that are not dominated by any member of the set P . The nondominated set of the entire search space S is the globally Pareto-optimal set. At a given temperature T , a new state s is selected with a probability Pqs =

1 + exp

!

1 −(E(q,T )−E(s,T )) T

",

(9.5.15)

where q is the current state, whereas E(q, T ) and E(s, T ) are the corresponding energy values of q and s, respectively. One of the points, called current-pt, is randomly selected from Archive as the initial solution at temperature temp = Tmax . The current-pt is perturbed to generate a new solution called new-pt. The domination status of new-pt is checked with respect to the current-pt and solutions in Archive. Depending on the domination status between current-pt and new-pt, the new-pt is selected as the current-pt with different probabilities. This process constitutes the core of Algorithm 9.6, see Step 4 to Step 37.

9.6 Genetic Algorithm A genetic algorithm (GA) is a metaheuristic algorithm inspired by the process of natural selection. Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover, and selection. In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better

724

9 Evolutionary Computation

solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible. The evolution is an iterative process: (a) It usually starts from a population of randomly generated individuals. The population in each iteration is known as a generation. (b) In each generation, the fitness of every individual in the population is evaluated. The fitness is usually the value of the objective function in the optimization problem being solved. (c) The fitter individuals are stochastically selected from the current population, and each individual’s genome is modified (recombined and possibly randomly mutated) to form a new generation. (d) This new generation of candidate solutions is then used in the next iteration of the algorithm. (e) The algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.

9.6.1 Basic Genetic Algorithm Operations The genetic algorithm consists of encoded chromosome, fitness function, reproduction, crossover, and mutation operations generally. A typical genetic algorithm requires three important concepts: • a genetic representation of the solution domain, • a fitness function to evaluate the solution domain, and • a notion of population. Unlike traditional search methods, genetic algorithms rely on a population of candidate solutions. A genetic algorithm (GA) is a search technique used for finding true or approximate solutions to optimization and search problems. GAs are a particular class of evolutionary algorithms (EAs) that use inheritance, mutation, selection, and crossover (also called recombination) inspired by evolutionary biology. GAs encode the solutions (or decision variables) of a search problem into finitelength strings of alphabets of certain cardinality. Candidate solutions are called individuals, creatures, or phenotypes. An abstract representation of individuals is called chromosomes, the genotype or the genome, the alphabets are known as genes and the values of genes are referred to as alleles. In contrast to traditional optimization techniques, GAs work with coding of parameters, rather than the parameters themselves. One can think of a population of individuals as one “searcher” sent into the optimization phase space. Each searcher is defined by his genes, namely his position inside the phase space is coded in his genes. Every searcher has the duty to find a value of the quality of his position in the phase space. To evolve good solutions and to implement natural selection, a measure is necessary for distinguishing good solutions from bad solutions. This measure is called fitness.

9.6 Genetic Algorithm

725

Once the genetic representation and the fitness function are defined, at the beginning of a run of a genetic algorithm a large population of random chromosomes is created. Each decoded chromosome will represent a different solution to the problem at hand. When two organisms mate they share their genes. The resultant offspring may end up having half the genes from one parent and half from the other. This process is called recombination. Very occasionally a gene may be mutated. Genetic algorithms are a way of solving problems by mimicking the same processes mother nature uses. They use the same combination of selection, recombination (i.e., crossover), and mutation to evolve a solution to a problem. Chromosome Code Each chromosome is made up of a sequence of genes from certain alphabet which can consist of binary digits (0 and 1), floating-point numbers, integers, symbols (i.e., A, B, C, D), etc. In binary gene representation, a possible solution needs to be encoded as a string of bits (i.e., a chromosome), and these bits need to represent all the different characters available to the solution. For example, four bits are required to represent the range of characters used: 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: +: −: ∗: /:

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101

This shows all the different genes required to encode the problem as described. The possible genes 1110, 1111 will remain unused and will be ignored by the algorithm if encountered. A string of four bits “7 + 6 ∗ 4 / 2 + 1” would be represented by nine genes as follows: 0111 1010 0110 1100 0100 1101 0010 1010 0001 7

+

6



4

/

2

+

1

726

9 Evolutionary Computation

These genes are all strung together to form the chromosome: 0111 1010 0110 1100 0100 1101 0010 1010 0001 Because the algorithm deals with random arrangements of bits it is often going to come across a string of bits like this: 0010 0010 1010 1110 1011 0111 0010 Decoded, these bits represent: 0010 0010 1010 1110 1011 0111 0010 2

2

+

n/a



7

2

Fitness Selection To evolve good solutions and to implement natural selection, we need a measure for distinguishing good solutions from bad solutions. In essence, the fitness measure must determine a candidate solution’s relative fitness, which will subsequently be used by the GA to guide the evolution of good solutions. This is one of the characteristics of the GA: it only uses the fitness of individuals to get the relevant information for the next search step. To assign fitness, Hajela and Lin [61] proposed a weighed-sum method: Each objective is assigned a weight wi ∈ (0, 1) such that i wi = 1, and the scalar fitness value is then calculated by summing up the weighted objective values wi fi (x). To search multiple solutions in parallel, the weights are not fixed but coded in the genotype so that the diversity of the weight combinations is promoted by phenotypic fitness sharing [61, 169]. As a consequence, the fitness assignment evolves the solutions and weight combinations simultaneously. Reproduction [164] Individuals are selected through a fitness-based process, where fitter individuals with lower inferior value are typically more likely to be selected. On the one hand, excellent individuals must be reserved to avoid too random search and low efficiency. On the other hand, these high-fitness individuals are not expected to over breed, avoiding the premature convergence of the algorithm. For simplicity and efficiency, an elitist selection strategy is employed. First, the individuals in the population p(t) are sorted from small to large according to the inferior value. Then the former 1/10 of the sorted individuals are directly selected into the crossover pool. The remaining 9/10 of individuals are gained by random competitions of all individuals in the population. Crossover and Mutation Crossover is a unique feature of the originality of GA in evolutionary algorithms. Genetic crossover is a process of genetic recombination that mimics sexual reproduction in nature. Its role is to inherit the original good genes to the next generation

9.6 Genetic Algorithm

727

of individuals and to generate new individuals with more complex genetic structures and higher fitness value. The improved crossover probability Pci of the ith individual is tuned adaptively as  Pci =

Pc × ri /rmax , ri < ravg , ri ≥ ravg , Pc ,

(9.6.1)

where Pci is the crossover probability of the GA, ri is the inferior value of the individual, rmax is the maximum inferior value, and ravg is the average inferior value of the population. By the crossover rate, it is simply the chance that two chromosomes will swap their bits. A good value for this is around 0.7. Crossover is performed by selecting a random gene along the length of the chromosomes and swapping all the genes after that point. For example, given two chromosomes 1000 1001 1100 0010 1010 0101 0001 1010 0001 0111 If choosing a random bit along the length, say at position 9, and swap all the bits after that point, then we have 1000 1001 1010 0001 0111 0101 0001 1100 0010 1010 It is seen from Eq. (9.6.1) for crossover probability that when the individual fitness value of the participating crossover is lower than the average fitness value of the population, i.e., the individual is a poor performing individual, a large crossover probability is adopted for it. When the individual fitness value of the crossover is higher than the average fitness value, i.e., the individual has excellent performance, then a small crossover probability is used to reduce the damage of the crossover operator to the better performing individual. When the fitness values of the offspring generated by the crossover operation are no longer better than their parents, but the global optimal solution is not reached, the GA algorithm will occur premature convergence. At this time, the introduction of the mutation operator in the GA tends to produce good results. Mutation in the GA simulates the mutation of a certain gene on the chromosome in the evolution of natural organisms in order to change the structure and physical properties of the chromosome. On one hand, the mutation operator can restore the lost genetic information during population evolution to maintain individual differences in the population and prevent premature convergence. On the other hand, when the population size is large, for one to introduce moderate mutation after crossover operation, one can

728

9 Evolutionary Computation

also improve the local search efficiency of the GA algorithm, thereby increasing the diversity of the population to reach the global domain of the search. The improved mutation probability Pmi of the ith individual is tuned adaptively as  Pm × ri /rmax , ri < ravg ; Pmi = (9.6.2) ri ≥ ravg . Pm , The mutation probability Pm of the population is adaptively changed according to the diversity of the current population as Pm = Pm0 [1 − 2(rmax − rmin )/3(PopSize − 1)],

(9.6.3)

where rmax and rmin are, respectively, the maximum and minimum inferior value of the individual, PopSize is the number of individuals, and Pm0 is the mutation probability of the program. Mutation rate is the chance that one or more genes of a selected chromosome will be flipped (0 becomes 1, 1 becomes 0). Mutation probability is usually achieved using an adaptive selection:  pm =

k2 · k4 ,

fmax −f , fmax −f¯

if f > f¯; if f ≤ f¯.

(9.6.4)

Here k2 and k4 are equal to 0.5, and f is the fitness of the chromosome under mutation. It can be seen from the expressions of Pci in (9.6.1) and Pmi in (9.6.2) that • When the individual fitness value is lower than the average fitness value of the population, namely the individual has a poor performance, one adopts a large crossover probability and mutation probability. • If the individual fitness value is higher than the average fitness value, i.e., the individual’s performance is excellent, then the corresponding crossover probability and mutation probability are adaptively taken according to the fitness value. • When the individual fitness value is closer to the maximum fitness value, the crossover probability and the mutation probability are smaller. • If the individual fitness is equal to the maximum fitness value, then the crossover probability and the mutation probability are zero. Therefore, individuals with optimal fitness values will remain intact in the next generation. However, this will allow individuals with the greatest fitness to rapidly multiply in the population, leading to premature convergence of the algorithm. Hence, this adjustment method is applicable to the population in the late stage of evolution, but it is not conducive to the early stage of evolution, because the dominant individuals in the early stage of evolution are almost in a state of no

9.6 Genetic Algorithm

729

change, and the dominant individuals at this time are not necessarily optimized. The global optimal solution to the problem increases the likelihood of evolution to a locally optimal solution. To overcome this problem, we can use the default crossover probability pc = 0.8 and the default mutation probability pm = 0.001 for each individual, so that the crossover probability and mutation probability of individuals with good performance in the population are no longer zero.

9.6.2 Genetic Algorithm with Gene Rearrangement Clustering Genetic clustering algorithms have a certain degree of degeneracy in the evolution process. The degeneracy problem is mainly caused by the non-one-to-one correspondence between the solution space of the clustering problem and the genetic individual coding in the evolution process [127]. The degeneracy of the evolutionary process usually causes the algorithm to search the local solution space repeatedly. Therefore, the search efficiency of the genetic clustering algorithm can be improved by avoiding the degeneracy of the evolution process. To illustrate how the degeneracy occurs, consider a clustering problem in which a set of patterns is grouped into K clusters so that patterns in the same cluster are similar and differentiate from those of other clusters in some sense. Denote two chromosomes as x = [x11 , . . . , x1N , . . . , xK1 , . . . , xKN ] = [x1 , . . . , xK ],

(9.6.5)

y = [y11 , . . . , y1N , . . . , yK1 , . . . , yKN ] = [y1 , . . . , yK ],

(9.6.6)

with xi = [xi1 , . . . , xiN ] ∈ R1×N , yi = [yi1 , . . . , yiN ] ∈ R1×N , and x is called a referenced vector. Let P1 = {1, . . . , K} be a set of indexes of {x1 , . . . , xK } and P2 = ∅. Consider the rearrangement of P1 to P2 : for i = 1 to K do k = arg min yi − xj 2 , j ∈P1 ,j ∈P2

P1 = P1 \k and P2 (i) = k. end for Then, the elements of P2 will be a permutation of that of P1 , and y will be rearranged according to P2 , i.e., yk = yP2 (k) .

(9.6.7)

Definition 9.31 (Gene Rearrangement [16]) If x and y are two chromosomes in gene representation, and the elements of y are rearranged according to the indexes

730

9 Evolutionary Computation

in P2 to get a new vector whose elements are given by y˜i = yP2 (i) ,

i = 1, . . . , N,

(9.6.8)

N then the new chromosome y˜ = [y˜i ]N i=1 = [yP2 (i) ]i=1 is called the gene rearrangement of the original chromosome y.

The genetic algorithm with gene rearrangement (GAGR) clustering algorithm is given in Algorithm 9.7. Algorithm 9.7 Genetic algorithm with gene rearrangement (GAGR) clustering algorithm [16] 1. initialization: 1.1 Generate a group of cluster centers with size N P . 1.2 Consider only valid chromosomes (that have at least one data point in each cluster). 1.3 Each data point of the set is assigned to the cluster with closest cluster center using the Euclidean distance. 2. Evaluate each chromosome and copy the best chromosome pbest of the initial population in a separate location. 3. If the termination condition is not reached, go to Step 4; else select the best individual from the population as the best cluster result. 4. Select individuals from the population for crossover and mutation. 5. Apply crossover operator to the selected individuals based on the crossover probability. 6. Apply mutation operator to the selected individuals based on the mutation probability. 7. Evaluate the newly generated candidates. 8. Compare the worst chromosome in the new population with pbest in terms of their fitness values. If the former is worse than the later, then replace it by pbest. 9. Find the best chromosome in the new population and replace pbest. 10. For the new population, select the best chromosome as a reference, which other chromosomes might fall into the gene rearrangement if needed. 11. Go back to Step 3.

The following are the illustration of the (GAGR) steps [16]. 1. Chromosome representation: Extensive experiments comparing real-valued and binary GAs indicate [102] that the real-valued GA is more efficient in terms of CPU time. In real-valued gene representation, each chromosome constitutes a sequence of cluster centers, i.e., each chromosome is described by M = N · K real-valued numbers as follows: m = [m11 , . . . , m1N , . . . , mK1 , . . . , mKN ] = [m1 , . . . , mK ],

(9.6.9)

where N is the dimension of the feature space and K is the number of clusters; and the first N elements denote the first cluster center, the following N elements denote the second cluster center, and so forth. 2. Population initialization: In GAGR clustering algorithm, an initial population of size N can be randomly generated and K data points randomly chosen from

9.6 Genetic Algorithm

731

the data set, but on the condition that there are no identical points to form a chromosome, presenting the K cluster centers. This process is repeated until N P chromosomes are generated. Only valid strings (i.e., those that have at least one data point in each cluster) are considered to be included in the initial population. After the population initialization, each data point is assigned to the cluster with closest cluster center using the following equation: xi ∈ Cj ↔ xi − mj  =

min xi − mk ,

k=1,...,K

(9.6.10)

where mk is the center of the kth cluster. 3. Fitness function: The fitness function is used to define a fitness value to each candidate solution. A common clustering criterion or quality indicator is the sum of squared error (SSE) measure, defined as SSE =



(x − mi )T (x − mi ) =

Ci x∈Ci



x − mi 2 ,

(9.6.11)

Ci x∈Ci

where x ∈ Ci is a data point assigned to that cluster. This measure computes the cumulative distance of each pattern from its cluster center of each cluster individually, and then sums those measures over all clusters. If this measure is small, then the distances from patterns to cluster centers are all small and the clustering would be regarded favorably. It is interesting to note that SSE has a theoretical minimum of zero, which corresponds to all clusters containing only a single data point. Then the fitness function of the chromosome is defined as the inverse of SSE, i.e., f =

1 . SSE

(9.6.12)

This fitness function will be maximized during the evolutionary process and lead to minimization of the SSE. 4. Evolutionary operators: • Crossover: Let x and y be two chromosomes to be crossed, then the heuristic crossover is given by x = x + r(x − y),

(9.6.13)

y = x,

(9.6.14)

where r = U (0, 1) with U (0, 1) being a uniform distribution on interval [0, 1] and x is assumed to be better than y in terms of the fitness. • Mutation: Let fmin and fmax be the minimum and maximum fitness values in the current population, respectively. For an individual with fitness value f ,

732

9 Evolutionary Computation

a number δ ∈ [−R, +R] is generated with uniform distribution, where R is given by Bandyopdhyay and Maulik [5]:  R=

f −fmin fmax −fmin ,

1,

if fmax > f, if fmax = f.

(9.6.15)

If the minimum and maximum values of the data set along the ith dimension (i = 1, . . . , N ) are mimin and mimax , respectively, then after mutation the ith element of the individual is given by Bandyopdhyay and Maulik [5]:  mi =

mi + δ · (mimax − mi ), if δ ≥ 0, mi + δ · (mi − mimin ), otherwise.

(9.6.16)

One key feature of gene rearrangement is to avoid the degeneracy resulted by the representations used in many clustering problems. The degeneracy mainly arises from a non-one-to-one correspondence between the representation and the clustering result. To illustrate this degeneracy, let m1 = [1.1, 1.0, 2.2, 2.0, 3.4, 1.2] and m2 = [3.2, 1.4, 1.8, 2.2, 0.5, 0.7] represent, respectively, two clustering results of a two-dimensional data set with three cluster centers in one generation. If m1 is selected as the referenced vector, then m2 becomes m2 = [0.5, 0.7, 1.8, 2.2, 3.2, 1.4] after the gene rearrangement described by Definition 9.31. Figure 9.3 provides a crossover results of m1 and m2 (m2 ). From Fig. 9.3 we can see that the performance of the chromosome marked by triangles in the figure directly obtained from crossing m1 and m2 is poor as two of 3.5 3 2.5 2 1.5 1 0.5 0 −0.5

0

1

2

3

4

Fig. 9.3 Illustration of degeneracy due to crossover (from [16]). Here, asterisk denotes three cluster centers (1.1, 1.0), (2.2, 2.0), (3.4, 1.2) for chromosome m1 , + denotes three cluster centers (3.2, 1.4), (1.8, 2.2), (0.5, 0.7) for the chromosome m2 , open triangle denotes the chromosome obtained by crossing m1 and m2 , and open circle denotes the chromosome obtained by crossing m1 and m2 with three cluster centers (0.5, 0.7), (1.8, 2.2), (3.2, 1.4)

9.7 Nondominated Multiobjective Genetic Algorithms

733

the three clusters are very close, producing some confusion, while the chromosome marked by circles in the figure obtained from m1 and m2 with gene rearrangement is more efficient because of no confusion between any pair of clusters. Since there is a one-to-one correspondence between the representation and the clustering result through the gene rearrangement, there is no degeneracy, which makes the GAGR algorithm more efficient than the classical GA.

9.7 Nondominated Multiobjective Genetic Algorithms In generation-based evolutionary algorithms (EAs) or genetic algorithms (GAs), the selected individuals are recombined (e.g., crossover) and mutated to constitute the new population. A designer/user prefers the more incremental, steady-state population update, which selects (and possibly deletes) only one or two individuals from the current population and adds the newly recombined and mutated individuals to it. Therefore, the selection of one or two individuals is a key step in generationbased EAs or GAs.

9.7.1 Fitness Functions In the fields of genetic programming and genetic algorithms, each design solution is commonly represented as a string of numbers (called a chromosome). After each round of testing or simulation, the designer’s goal is to delete the “n” worst design solutions, and to breed “n” new ones from the best design solutions. To this end, each design solution needs to be evaluated by a figure of merit in order to determine how close it was to meeting the overall specification. This merit usually uses the fitness function to the test or simulation. A fitness function, as a single figure of merit, is a particular type of objective function, and is used in genetic programming and genetic algorithms to guide simulations towards optimal design solutions. According to “survival of the fittest” mechanism, a GA evaluates directly a solution by the fitness value of the solution. The GA requires only the fitness function to satisfy the nonnegativeness. This feature makes the genetic algorithm has wide applicability. Since a feasible solution x1 ∈ X is said to (Pareto) dominate another solution x2 ∈ X, if fi (x1 ) ≤ fi (x2 ) for all indices i ∈ {1, . . . , k} and fj (x1 ) < fj (x2 ) for at least one index j ∈ {1, . . . , k}, the objective function fi (x) can naturally be taken as the fitness function of a candidate solution x. However, in practical problems, the fitness function is not completely consistent with the objective function of the problem. In GAs, fitness is the key indicator to describe the individual’s performance. Due to selecting the fittest based on fitness, it becomes the driving force of the genetic algorithm. From a biological point of view, the fitness is equivalent to the

734

9 Evolutionary Computation

survival competition. The “survival” biological viability is of great significance in the genetic process. Mapping the objective function of an optimization problem to the individual’s fitness can be achieved by optimizing the objective function of the optimization problem in the process of group evolution. The function is also known as the evaluation function, which is a criterion for distinguishing good individuals from bad ones in a group according to the objective function. The fitness is always nonnegative. In any case, it is expected that the larger the fitness value, the better. In the selection operation, there are two genetic algorithm deceptive problems: • In the early stage of GA, some supernormal individuals are usually generated. These supernormal individuals will control the selection process due to their outstanding competitiveness, which affects the global optimization performance of the algorithm. • At the later stage of GA, if the algorithm tends to converge, then the potential for continued optimization is reduced and a certain local optimum solution may be obtained due to the small difference in individual fitness in the population. Therefore, if the fitness function is not properly selected, it will result in more than the problem of deception. The choice of fitness function can be seen as significant for genetic algorithms.

9.7.2 Fitness Selection EAs/GAs are capable of solving complicated optimization tasks in which an objective function f : I → R will be maximized, where i ∈ I is an individual from the set of feasible solutions. A population is a multi-set of individuals which is maintained and updated as follows: one or more individuals are first selected according to some selection strategy. In generation-based EAs, the selected individuals are then recombined (e.g., via crossover) and mutated in order to constitute the new population. We prefer to select (and possibly delete) only one or two individuals from the current population and adds the newly recombined and mutated individuals to it. We are interested in finding a single individual of maximal objective value for difficult multimodal and deceptive problems. The following are the two common selection schemes [77]: 1. Standard Selection Scheme: This scheme selects all favor individuals of higher fitness, and has the following variants: • Linear Proportionate Selection: In this method, the probability of selecting an individual depends linearly on its fitness [69]. • Truncation Selection: In this selection the fittest individuals are selected, usually with multiplicity to keep the population size fixed [109]. • Ranking Selection: This selection orders the individuals according to their fitness. The selection probability is, then, a (linear) function of the rank [157].

9.7 Nondominated Multiobjective Genetic Algorithms

735

• Tournament Selection: This method selects the best out of individuals. It has primarily developed for steady-state EAs, but can be adapted to generation based EAs [4]. 2. Fitness Uniform Selection Scheme (FUSS): FUSS is based on the insight that one is not primarily interested in a population converging to maximal fitness but only in a single individual of maximal fitness. The scheme automatically creates a suitable selection pressure and preserves genetic fitness values, and should help to preserve population diversity. All above four standard selection schemes have the property (and goal) to increase the average fitness of a population, i.e., to evolve the population toward higher fitness [77]. Define the difference or distance between two individuals f (i) and f (j ) as [77]: d(i, j ) = |f (i) − f (j )|.

(9.7.1)

The distance is based solely on the fitness function, and is independent of the coding/representation and other problem details, and of the optimization algorithm (e.g., the genetic mutation and recombination operators), and can trivially be computed from the fitness values. If making the natural assumption that functionally similar individuals have similar fitness, then they are also similar with respect to the distance d. On the other hand, individuals with very different coding and even functionally dissimilar individuals may be d-similar. Two individuals i and j are said to be -similar if d(i, j ) = |f (i) − f (j )| ≤ . Let the lowest and highest fitness values in the current population be fmin and fmax , respectively. The fitness uniform selection scheme is a two-stage uniform selection process [77]: • Randomly select a fitness value f uniformly from the fitness values F = [fmin , fmax ]. • Select an individual i ∈ P with fitness nearest to f , and add a copy to the population P , possibly after mutation and recombination. The probability of selecting a specific individual is proportional to the distance to its nearest fitness neighbor. In a population with a high density of unfitting and low density of fitting individuals, the more fitting ones are effectively favored. To distinguish good solutions (individuals) from bad solutions, we need to make the fitness selection of individuals. The fitness selection is a mathematical model or a computer simulation, or even is a subjective function for choosing better solutions over worse ones. In order to guide the evolution of good solutions in GAs, the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population (based on their fitness), and modified (recombined and possibly mutated) to form a new population. The new population is then used in the next iteration of the GA. This iteration process is repeated until either a maximum

736

9 Evolutionary Computation

number of generations has been produced, or a satisfactory fitness level has been reached for the population. However, if the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached. In contrast, we may also preserve diversity through deletion rather than selection, because we delete those individuals with “commonly occurring” fitness values. This method is called fitness uniform deletion scheme (FUDS) whose role is to govern actively different parts of the solution space rather than to move the population as a whole toward higher fitness [77]. Thus, FUDS is at least a partial solution to the problem of having to set correctly a selection intensity parameter. A common fitness selection method is the Goldberg’s non-inferior sorting method for selecting the superior individuals [54]. Let xi (t) be an arbitrary individual in the population p(t) of tth generation, ri (t) indicate the number of individuals non-inferior to xi (t) in this population. Then, ri (t) is defined as the inferior value of the individual xi (t) in the population. Obviously, individuals with small inferior value are superior, and are close to the Pareto solutions. In the process of evolution, the whole population keeps approaching to the Pareto boundary.

9.7.3 Nondominated Sorting Genetic Algorithms Different from single-objective optimization problems, the performance improvement of one objective is often inevitable at the cost of descending at least another objective in the multiobjective optimization problems. In other words, it is hard to identify a unique optimal solution but rather a family of optimal solutions called the Pareto-optimal front or nondominated set. Multiobjective evolutionary algorithms (MOEAs) are a most popular approach to solving multiobjective optimization problems because MOEAs are able to find multiple Pareto-optimal solutions in one single run. Since it is not possible for a multiobjective optimization problem to have a single solution which simultaneously optimizes all objectives, an algorithm is of great practical value if it gives a large number of alternative solutions lying on or near the Pareto-optimal front. Earlier practical GA, called Vector Evaluated Genetic Algorithm (VEGA), was developed by Schaffer in 1985 [141]. One of the problems with VEGA is its bias toward some Pareto-optimal solutions. A decision maker may want to find as many nondominated points as possible in order to avoid any bias toward such middling individuals. The number of dominating points d(s, P (t)) is the number of points y from set P (t) that dominate point s, namely d(s, P (t)) = |{y ∈ P (t)|y ≺ s}|.

(9.7.2)

9.7 Nondominated Multiobjective Genetic Algorithms

737

The d(s, P (t)) measure favors solutions located in those areas where the better fronts are sparsely populated. When the population may consist of nondominated solutions only, the d(s, P (t)) selection scheme is never available. Definition 9.32 (Niched Pareto Genetic Algorithm [74]) A niched Pareto genetic algorithm combines tournament selection and the concept of Pareto dominance. For two competing individuals and a comparison set consisting of other individuals picked at random from the population, if one of the competing individuals is dominated by any member of the set, and the other is not, then the latter is chosen as winner of the tournament. If both individuals are dominated or not dominated, then the result of the tournament is decided by sharing: the individual which has the least individuals in its niche is selected for reproduction. The niched Pareto genetic algorithm was proposed by Horn et al. in 1994 [74]. Nondominated sorting genetic algorithm (NSGA) differs from a simple genetic algorithm only in how the selection operator works. The crossover and mutation operators remain as usual: the population is ranked on the basis of an individual’s nondomination before the selection is performed. The nondominated individuals present in the population are first identified from the current population, then all these individuals are assumed to constitute the first nondominated front in the population and assigned a large dummy fitness value. If some nondominated individuals are assigned to the same fitness value, then all these nondominated individuals have an equal reproductive potential. To maintain diversity in the population, these classified individuals are then shared with their dummy fitness values. Let the parameter d(xi , xj ) be the phenotypic distance between two individuals xi and xj in the current front, and σshare be the maximum phenotypic distance allowed between any two individuals to become members of a niche. Definition 9.33 (Sharing Function Value [146]) For two individuals xi and xj in the same front, their sharing function value, denoted by sh(xi , xj ), is defined as  sh(xi , xj ) =

1−

!

" d(xi ,xj ) 2 , σshare

0, For example, d(x1 , x2 ) =

d(xi , xj ) < σshare ;

(9.7.3)

otherwise. 

n i=1 (x1i

− x2i )2 is taken.

Definition 9.34 (Niche Count [56]) Given X = {x1 , . . . , xL } (the union of the parent and offspring populations) and σ (the niche radius). Then, the niche count of x ∈ X is defined by nc(x|X, σ ) =

L  i=1,xi =x

sh(x, xi ).

(9.7.4)

738

9 Evolutionary Computation

Sharing is achieved by performing selection operation via degraded fitness values obtained by dividing the original fitness value of an individual by a quantity proportional to the number of individuals around it. This causes multiple optimal points to co-exist in the population. These shared nondominated individuals are ignored temporarily, and are then assigned a new dummy fitness value for keeping smaller than the minimum shared dummy fitness of the previous front. This process is continued until the entire population is classified into several fronts [146]. In multiobjective optimization genetic algorithm, the whole population is checked and all nondominated individuals are assigned rank 1. Other individuals are ranked according to the nondominance of them with respect to the rest of the population as follows. For an individual point, the number of points that strictly dominate the point in the population is first found. Then, the rank of that individual is assigned to be one more than that number. At the end of this ranking procedure, there could be a number of points having the same rank. The selection procedure then uses these ranks to select or delete blocks of points to form the mating pool. Algorithm 9.8 shows the nondominated sorting genetic algorithm (NSGA) [146]. Algorithm 9.8 Nondominated sorting genetic algorithm (NSGA) [146] 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.

input: maxgen (maximal generation number). initialization: population gen = 0. while gen < maxgen front = 1; justify population classification; if population is classified do flag− classified = 1; else flag− classified = 0; end if while flag− classified = 0 identify nondominated individuals; assign dummy fitness; sharing in current front; front = front + 1; return step 2; end while while flag− classified = 1 reproduction according to dummy fitness; crossover; mutation; if gen < maxgen do gen = gen + 1; return step 1; else end if end while end while output: nondominated sorting population.

9.7 Nondominated Multiobjective Genetic Algorithms

739

9.7.4 Elitist Nondominated Sorting Genetic Algorithm The nondominated sorting genetic algorithm (NSGA) proposed by Srinivas and Deb [146] uses nondominated sorting and sharing, and its main criticism are as follows [25]: • High computational complexity of nondominated sorting: Computational complexity of ordinary nondominated sorting algorithm is O(kN 3 ), where k is the number of objectives and N is the population size. • Lack of elitism: Elitism can speed up the performance of the GA significantly, and also can help to prevent the loss of good solutions once they have been found. • Need for specifying the sharing parameter σ share : A parameterless diversity preservation mechanism is desirable. To overcome the three shortcomings mentioned above, a fast and elitist multiobjective genetic algorithm (NSGA-II) was proposed by Deb et al. in 2002 [25]. Evolutionary multiobjective algorithms apply biologically inspired evolutionary processes as heuristics to generate nondominated sets of solutions. However, it should be noted that the solutions returned by evolutionary multiobjective algorithms may not be Pareto optimal (i.e., globally nondominated), but the algorithms are designed to evolve solutions that approach the Pareto front and spread out to capture the diversity existing on the Pareto front for obtaining a good approximation of the Pareto front [108]. The NSGA-II features nondominated sorting and a (μ + μ) (μ is the number of solution vectors) selection scheme. The secondary ranking criterion for solutions on the same front is called crowding distance. For a solution, the crowding distance is defined as the sum of the side lengths of the cuboid through the neighboring solutions on the front. For extremal solutions it is defined as infinity. The crowding distance value of a solution depends on its neighbors and not directly on the position of the point itself. At the end of one generation of NSGA-II, the size of the population is twice bigger than the original size. This bigger population is pruned based on the nondominated sorting [80]. The basic NSGA-II steps are as follows [108]: 1. Randomly generate the first generation. 2. Partition the parents and offspring of the current generation into k fronts F1 , . . . , Fk , so that the members of each front are dominated by all members of better fronts and by no members of worse fronts. 3. Calculate the crowding distance for each potential solution: for each front, sort the members of the front according to each objective function from the lowest value to the highest value. Compute the difference between the solution and the closest lesser solution and the closest greater solution, ignoring the other objective functions. Repeat for each objective function. The crowding distance for that solution is the average difference over all the objective func-

740

9 Evolutionary Computation

tions. Note that the extreme solutions for each objective function are always included. 4. Select the next generation based on nondomination and diversity: • Add the best fronts F1 , . . . , Fj to the next generation until it reaches the target population size. • If part of a front, Fj +1 , must be added to the generation to reach the target population size, sort the members from the least crowded to most crowded. Then add as many members as are necessary, starting with the least crowded. 5. Generate offspring: • Apply binary tournament selection by randomly choosing two solutions and including the higher-ranked solution with a fixed probability 0.5–1. • Apply single-point crossover and the sitewise mutation to generate offspring. 6. Repeat steps 2 through 5 for the desired number of generations.

9.8 Evolutionary Algorithms (EAs) Optimization plays a very important role in engineering, operational research, information science, and related areas. Optimization techniques can be divided into two categories: derivative-based methods and derivative-free methods. As an important branch of derivative-free methods, evolutionary algorithms (EAs) have shown considerable success in solving optimization problems and attracted more and more attention. Single objective evolutional algorithms are general purpose optimization heuristics inspired by Darwin’s principle of natural evolution. Starting with a set of candidate solutions (population), in each iteration (generation), the better solutions are selected (parents) according to their fitness (i.e., function values under the objective function) and used to generate new solutions (offspring) through recombining the information of two parents in a new way (crossover) or randomly modifying a solution (mutation). These offspring are then inserted into the population, replacing some of the weaker solutions (individuals) [13]. By iteratively selecting the better solutions and using them to create new candidates, the population “evolves,” and the obtained solutions become better and better adapted to the optimization problem at hand, just like in the nature, where the individuals become better and better adapted to their environment through evolution. EAs have successfully employed on a wide variety of complex optimization problems because they can deal with almost arbitrarily complex objective functions and constraints, resulting into very few assumptions and not even requiring a mathematical description of the problem.

9.8 Evolutionary Algorithms (EAs)

741

9.8.1 (1 + 1) Evolutionary Algorithm The most simple single objective evolutionary algorithm is called (1 + 1) evolutionary algorithm or (1 + 1) EA. The basic idea and steps of (1 + 1) EA are as follows [35]: 1. The size of the population is restricted to just one individual and does not use crossover. 2. The current individual is represented as a bit string. 3. Use bitwise mutation operator that flips each bit independently of the others with some probability pm . 4. The current bit string is replaced by the new one if the fitness of the current bit string is not superior to the fitness of the new string. This replacement strategy selects the best individual of one parent and one child as the new generation. (1 + 1) EA is shown in Algorithm 9.9. Algorithm 9.9 (1 + 1) evolutionary algorithm [35] 1. 2. 3. 4. 5. 6. 7.

input: pm = 1/n. initialization: Choose randomly an initial bit string x ∈ {0, 1}n . while x is not optimal do Compute x by flipping independently each bit xi with probability pm . Replace x by x if and only if f (x ) ≥ f (x). end while output: x.

The different choices of selection, crossover, mutation, replacement, and concrete representation of the individuals offer a great variety of different EAs. The combinatorial optimization problem can be described as follows: given a finite state space S and a function f (x), x ∈ S, find the solution max f (x). x∈S

(9.8.1)

Assume x∗ is one state with the maximum function value, and fmax = f (x∗ ). The EA for solving the combinatorial optimization problem can be described as follows: 1. Initialization: generate, either randomly or heuristically, an initial population of 2N individuals, denoted by ξ0 = (x1 , . . . , x2N ) (where N > 0 is an integer), and let k ← 0. For any population ξk , define f (ξk ) = max{f (xi ) : xi ∈ ξk }. 2. Generation: generate a new (intermediate) population by crossover and mutation (or any other operators for generating offspring), and denote it as ξk+1/2 . 3. Selection: select and reproduce 2N individuals from populations ξk+1/2 and ξk , and obtain another (new intermediate) population ξk+S .

742

9 Evolutionary Computation

4. If f (ξk+S ) = fmax , then stop; otherwise let ξk+1 = ξk+S and k ← k + 1, and go to step 2 to continue.

9.8.2 Theoretical Analysis on Evolutionary Algorithms Several different methods are widely used in the theoretical analysis of EAs. Fitness Partition Fitness partition method is the very basic method, in which the objective fitness domain is partitioned into several levels. Theorem 9.1 ([94]) Consider a maximization problem with fitness function f : D → R. Suppose D can be partitioned into L subsets and for any solution xi in subset i and xj in subset j , we have f (xi ) < f (xj ) if i < j . In addition, subset L only contains optimal solution(s) of f . If an algorithm can jump from subset k to subset k + 1 or higher with probability at least pk , then the total expected time for the algorithm to find the optimum satisfies E(T ) ≤

L−1  k=1

1 . pk

(9.8.2)

Drift Analysis By considering the expected progress achieved in a single step at each search point, drift analysis [60, 66] can be used to analyze problems which are hard to be analyzed by fitness partition. Assume x∗ is an optimal point, and let d(x, x∗ ) be the distance between two points x and x∗ . If there are more than one optimal point (that is, a set S ∗ ), then d(x, S ∗ ) = min{d(x, x∗ ) : x∗ ∈ S ∗ } is used as the distance between individual x and the optimal set S ∗ . Denote the distance by d(x). Usually d(x) satisfies d(x∗ ) = 0 and d(x) > 0 for any x ∈ / S∗. Given a population X = {x1 , . . . , x2N }, let d(X) = min{d(x : x ∈ X)},

(9.8.3)

which is used to measure the distance of the population to the optimal solution. The drift of the random sequence {d(ξk ); k = 0, 1, . . .} at time k is defined by  (d(ξk )) = d(ξk+1 ) − d(ξk ).

(9.8.4)

Define the stopping time of an EA as τ = min{k : d(ξk ) = 0}, which is the first hitting time on the optimal solution. The drift analysis focuses on the following question [66]: under what conditions of the drift  (d(ξk )) can we estimate the expect first hitting time E[τ ]?

9.9 Multiobjective Evolutionary Algorithms

743

A fitness function f : {0, 1}n → R can be written as a polynomial [35]: 

f (s1 , . . . , sn ) =

cf (I )

I ⊆{1,...,n}

#

si ,

(9.8.5)

i∈I

where coefficient cf (I ) ∈ R. Let the class of fitness functions f (s1 , . . . , sn ) satisfy the condition f (s1 , . . . , sk−1 , 0, sk+1 , . . . , sn ) < f (s1 , . . . , sk−1 , 1, sk+1 , . . . , sn )

(9.8.6)

for any k = 1, . . . , n and fixed s1 , . . . , sk−1 , sk+1 , . . . , sn . This condition shows that if one “0” bit at any position flips into “1,” the fitness will increase, and thus (1, . . . , 1) is the unique maximum point. Theorem 9.2 ([67]) For any fitness function (9.8.5) satisfying (9.8.6), the EA with mutation probability pm = 1/(2n) needs average O(n log n) steps to reach the optimal solution.

9.9 Multiobjective Evolutionary Algorithms Multiobjective evolutionary algorithms (MOEAs) have been shown to be wellsuited for solving multiobjective optimization problems (MOPs) with conflicting objectives as they can approximate the Pareto-optimal front with a population in a single run.

9.9.1 Classical Methods for Solving Multiobjective Optimization Problems All classical methods for solving MOPs are to scalarize the objective vector into one scalar objective. The following are the three commonly used classical methods [146]: 1. Method of objective weighting: multiple objective functions are combined into one overall objective function: Z(x) =

m 

wi fi (x),

x∈

,

(9.9.1)

i=1

where w i are fractional numbers (0 ≤ wi ≤ l), and all weights are summed up to 1, or ki=1 wi = 1. In this method, the optimal solution is controlled by the

744

9 Evolutionary Computation

weigh vector w = [w1 , . . . , wk ]T : the preference of each objective fi (x) can be changed by modifying its corresponding weight wi . 2. Method of distance functions: the scalarization of objective vector f(x) = [f1 (x), . . . , fk (x)]T is achieved by using a demand-level vector y¯ which has to be specified by the decision maker as follows: Z(x) =

 m

1/r fi (x) − y¯ r

,

1 ≤ r < ∞,

(9.9.2)

i=1

where a Euclidean metric r = 2 is usually chosen, and the demand-level vector y¯ is individual optima of multiple objectives. The solution given by this method depends on the chosen demand-level vector. Arbitrary selection of a demandlevel may be highly undesirable; a wrong demand-level will lead to a non-Paretooptimal solution. 3. Min-max formulation: this method attempts to minimize the relative deviations of the single objective functions from the individual optimum, namely it tries to minimize the objective conflict F(x): min F(x) = max Zj (x),

j = 1, . . . , m,

(9.9.3)

where Zj (x) is calculated for nonnegative target optimal value f¯ > 0 as follows: Zj (x) =

fj − f¯j , f¯j

j = 1, . . . , m.

(9.9.4)

Drawbacks of the above classical methods are: the optimization of the single objective may guarantee a Pareto-optimal solution, but results in a single-point solution. In real-world situations decision makers often need different alternatives in decision making. Moreover, if some of the objectives are noisy or have discontinuous variable space, then these methods may not work effectively. Traditional search and optimization methods (such as gradient-based methods) are not available for extending to the multiobjective optimization because their basic design precludes the consideration of multiple solutions. In contrast, populationbased methods (such as single objective evolutionary algorithms) are well-suited for handling such situations due to their ability to approximate the entire Pareto front in a single run [20, 24, 110]. The basic difference between multiobjective EAs from single objective EAs is how they rank and select individuals in the conflicting performance measures or objectives, many real-life multiobjective problems must be optimized simultaneously to achieve a trade-off. In case of single objective, individuals are naturally ranked according to this objective, and it is clear which individuals are best and should be selected as parents. However, if there are multiple objectives, ranking the individuals is no

9.9 Multiobjective Evolutionary Algorithms

745

longer obvious. Most people probably agree that a good approximation to the Pareto front is characterized by the following metrics [13]: • a small distance of the solutions to the true Pareto frontier (such as dominated sorting solutions), • a wide range of solutions, i.e., an approximation of the extreme values (such as extreme solutions), • a good distribution of solutions, i.e., an even spread along the Pareto frontier (such as solutions with a larger distance to other solutions). MOEAs then rank individuals according to how much they contribute to the above goals. In MOEAs, the ability to balance between convergence and diversity depends on the selection strategy that can be broadly classified as: Pareto dominance-based MOEAs (PDMOEAs), indicator-based MOEAs, and decomposition-based MOEAs [119]. Difficulties in handling multiobjective problems can be roughly classified into the following five categories [87]. 1. 2. 3. 4. 5.

Difficulties in searching for Pareto-optimal solutions. Difficulties in approximating the entire Pareto front. Difficulties in the presentation of obtained solutions. Difficulties in selecting a single final solution. Difficulties in the evaluation of search algorithms.

Two major problems must be addressed when an evolutionary algorithm is applied to multiobjective optimization [171]: • How to accomplish fitness assignment and selection, respectively, in order to guide the search towards the Pareto-optimal set. • How to maintain a diverse population in order to prevent premature convergence and achieve a well distributed trade-off front.

9.9.2 MOEA Based on Decomposition (MOEA/D) Consider the multiobjective optimization problem (MOP) min f(x) = [f1 (x), . . . , fm (x)]T ,

subject to x ∈

,

(9.9.5)

is the decision where x = [x1 , . . . , xn ]T is the decision (variable) vector, (variable) space, Rm is the objective space, and f : → Rm consists of m realvalued objective functions. If is a closed and connected region in Rm and all the objectives are continuous of x, then the above problem is known as a continuous MOP.

746

9 Evolutionary Computation

A point x∗ ∈ is called (globally) Pareto optimal if there is no x ∈ such that f(x) dominates f(x∗ ), i.e., fi (x) dominates fi (x∗ ) for each i = 1, . . . , m. The set of all the Pareto-optimal points is called the Pareto set. The set of all the Pareto objective vectors, P F = {f(x) ∈ Rm |x ∈ }, is called the Pareto front. Decomposition is a basic strategy in traditional multiobjective optimization. But, general MOEAs treat a MOP as a whole, and do not associate each individual solution with any particular scalar optimization problem. A multiobjective evolutionary algorithm based on decomposition (called MOEA/D) [167] decomposes an MOP by scalarizing functions into a number of subproblems, each of which is associated with a search direction (or weight vector) and assigned a candidate solution. The followings are the three approaches for converting the problem of approximation of the Pareto front of problem (9.9.5) into a number of scalar optimization problems [167]. 1. Weighted sum approach:Let λ = [λ1 , . . . , λm ]T be a weight vector with λi ≥ 0, ∀i = 1, . . . , m and m i=1 λi = 1. Then, the optimal solution to the scalar optimization problem  , m  max g ws (x|λ) = λi fi (x) ,

subject to x ∈

(9.9.6)

i=1

is a Pareto-optimal point to the MOP (9.9.5). 2. Tchebycheff approach: This approach [103] can convert the problem of a Paretooptimal solution of (9.9.5) into the scalar optimization problem: 0 / min g tc (x|λ, z∗ ) = max {λi |fi (x) − zi∗ |} , 1≤i≤m

subject to x ∈

,

(9.9.7)

where z = [z1 , . . . , zm ]T is the reference point vector, i.e., zi∗ = max{fi (x)|x ∈ } for each i = 1, . . . , m, and each optimal solution of (9.9.7) is a Paretooptimal solution of (9.9.5), and thus one is able to obtain different Pareto-optimal solutions by altering the weight vector λ. Therefore, the Tchebycheff scalar optimization subproblem can be defined as [167]: min max {λi |fi (x) − zi∗ |}, 1≤i≤m

subject to x ∈

,

(9.9.8)

where λi is a weight parameter for the ith objective, and zi∗ is set to the current best fitness of the ith objective. Tchebycheff approach used in MOEA/D can decompose the original MOP into several scalar optimization subproblems in form of (9.9.8), which can be solved with any optimization method for single objective optimization.

9.9 Multiobjective Evolutionary Algorithms

747

3. Boundary intersection (BI) approach: This approach solves the following scalar optimization subproblem: min g bi (x|λ, z∗ ) = d,

subject to z∗ − f(x) = dλ, x ∈

(9.9.9)

.

The goal of the constraint z∗ − f(x) = dλ is to push f(x) as high as possible so that it reaches the boundary of the attainable objective set. In order to avoid the equality constraint in (9.9.9), one can use a penalty method to deal with the constraint [167]: min g bip (x|λ, z∗ ) = d1 + θ d2 ,

subject to x ∈

(9.9.10)

,

where θ > 0 is a preset penalty parameter, and d1 =

(z∗ − f(x))T λ , λ

(9.9.11)

d2 = f(x) − (z∗ − d1 λ).

(9.9.12)

Multiobjective evolutionary algorithm based on decomposition (MOEA/D), proposed by Zhang and Li [167], uses the Tchebycheff approach to convert the problem of a Pareto-optimal solution of (9.9.5) into the scalar optimization problem (9.9.8). j j Let λj = [λ1 , . . . , λm ]T be a set of weight vectors and z∗ be a reference point. The problem of approximation of the PF of (9.9.5) can be decomposed into scalar optimization subproblems by using the Tchebycheff approach and the objective function of the ith subproblem is given by min g tc (x|λj , z∗ ) = max

1≤i≤m

j

j

 j  λi |fi (x) − zi∗ | ,

subject to x ∈

,

(9.9.13)

where λj = [λ1 , . . . , λm ]T . As g tc is continuous of λ, the optimal solution of g tc (x|λi , z∗ ) should be close to that of g tc (x|λj , z∗ ) if λi and λj are close to each other. This makes any information about these g tc ’s with weight vectors close to λi should be helpful for optimizing g tc (x|λi , z∗ ). This is a basic idea behind MOEA/D. Let N be the number of the subproblems considered in MOEA/D, i.e., consider a population of N points x1 , . . . , xN ∈ , where xi is the current solution to the ith subproblem. In MOEA/D, a neighborhood of weight vector λi is defined as a set of its closest weight vectors in {λ1 , . . . , λN }. The neighborhood of the ith subproblem consists of all the subproblems with the weight vectors from the neighborhood of λi . The population is composed of the best solution found so far for each subproblem. Only the current solutions to its neighboring subproblems are exploited for optimizing a subproblem in MOEA/D. MOEA/D algorithm is shown in Algorithm 9.10.

748

9 Evolutionary Computation

Algorithm 9.10 Multiobjective evolutionary algorithm based on decomposition (MOEA/D) [167] 1. input: 1.1 N : the number of the subproblems considered; 1.2 λ1 , . . . , λN : a set of N weight vectors; 1.3 T : the number of the weight vectors in the neighborhood of each weight vector; 2. initialization: 2.1 Let external Pareto-optimal set P = ∅. 2.2 Compute d(i, j ) = λi − λj 2 for i, j ∈ {1, . . . , N } but i = j . and then work out the T closest weight vectors to each weight vector. For each i ∈ {1, . . . , k}, set B(i) = {i1 , . . . , iT }, where λi1 , . . . , λiT are the T closest weight vectors to λi . 2.3 Generate randomly an initial population x1 , . . . , xN or by a problem-specific method. Let fvi = f(xi ) and P = {x1 , . . . , xN }. 2.4 Initialize z = [z1 , . . . , zk ]T by a problem-specific method. 3. for i = 1, . . . , N do 4. Reproduction: Randomly select two indexes j and l from B(i), and then generate a new solution y from xj and xl by using genetic operators. 5. Improvement: Apply a problem-specific repair/improvement heuristic on y to produce y . 6. Update z: For each j = 1, . . . , k, if zj < fj (y ), then set zj = fj (y ). 7. Generation: For each index j ∈ B(i), if g tc (y |λj , z) ≤ g tc (xj |λj , z), then let xj = y and update neighboring solutions fvj = f(y ). 8. Selection: Remove all the vectors dominated by f(y ) from P . Add f(y ) to P if no vectors in P dominate f(y ). 9. Stopping criteria: If stopping criteria is satisfied, then stop and output P . Otherwise, go to step 1. 10. end for 11. output: external Pareto-optimal set P .

However, MOEA/D with simulated binary crossover has the following two shortcomings [93]. • The population in MOEA/D with simulated binary crossover may lose diversity, which is needed for exploring the search space effectively, particularly at the early stage of the search when applied to MOPs with complicated Pareto sets. • The simulated binary crossover operator often generates inferior solutions in MOEAs. To overcome these shortcomings, Li and Zhang [93] developed a version of MOEA/D based on differential evolution (DE), called MOEA/D-DE, which uses a differential evolution [124] operator and a polynomial mutation operator [24] to produce a new solution y = [y1 , . . . , yn ]T . 1. Differential evolution operator: Generate a solution y¯ = [y¯1 , . . . , y¯n ]T from three solutions xr1 , xr2 , and xr3 selected randomly from P by a DE operator. The elements of y¯ = [y¯1 , . . . , y¯n ]T are given by * y¯j =

xjr1 + F · (xjr2 − xjr3 ) with probability CR, xjr1

with probability 1 − CR,

(9.9.14)

9.9 Multiobjective Evolutionary Algorithms

749

for j = 1, . . . , n, where CR and F are two control parameters. 2. Polynomial mutation operator: Generate y = [y1 , . . . , yn ]T from y¯ in the following way: * yj =

y¯j + σj (bj − aj )

with probability pm ,

y¯j

with probability 1 − pm

(9.9.15)

with * σj =

(2 · rand)1/(η+1) − 1,

if rand < 0.5,

1 − (2 − 2 · rand)1/(η+1) ,

otherwise,

(9.9.16)

for j = 1, . . . , n, where rand is a uniform random number from [0, 1], the distribution index η and the mutation rate pm are two control parameters, and aj and bj are the lower and upper bounds of the kth decision variable, respectively. At each generation, MOEA/D-DE maintains the following: • a population of points x1 , . . . , xN ∈ , where xi is the current solution to the ith subproblem; • fv1 , . . . , fvN , where fvi is the f-value of xi , i.e., fvi = [f1 (xi ), . . . , fm (xi )]T for each i = 1, . . . , N ; • z = [z1 , . . . , zm ]T , where zi is the best value found so far for objective fi . MOEA/D-DE algorithm in [93] is shown in Algorithm 9.11.

9.9.3 Strength Pareto Evolutionary Algorithm It is widely recognized (e.g., see [45, 155]) that multiobjective evolutionary algorithms (MOEAs) can stochastically solve multiobjective optimization problems in an acceptable timeframe. As their name implies, MOEAs are evolutionary computation based multiobjective optimization techniques. General MOPs are mathematically defined as follows. Definition 9.35 (General MOP [155]) In general, a multiobjective optimization (MOP) minimizes f(x) = [f1 (x), . . . , fk (x)] subject to gi (x) ≤ 0, i = 1, . . . , m, x ∈ . An MOP solution minimizes the components of a vector f(x), where x is an n-dimensional decision variable vector x = [x1 , . . . , xn ]T from some universe . Solution to MOP usually consists of both search (i.e., optimization) process and decision process. There are three decision making preferences [78]:

750

9 Evolutionary Computation

Algorithm 9.11 MOEA/D-DE algorithm for Solving (9.9.5) [93] 1. input: 1.1 a stopping criterion; 1.2 N : the number of the subproblems considered; 1.3 λ1 , . . . , λN : a set of N weight vectors; 1.4 T : the number of the weight vectors in the neighborhood of each weight vector; 1.5 δ: the probability that parent solutions are selected from the neighborhood; 1.6 nr : the maximal number of solutions replaced by each child solution. 2. initialization: 2.1 Compute the Euclidean distances between any two weight vectors and then work out the T closest weight vectors to each weight vector. For each i = 1, . . . , N , set B(i) = {i1 , . . . , iT } where λi1 , . . . , λiT are the T closest weight vectors to λi . 2.2 Generate an initial population x1 , . . . , xN by uniformly randomly sampling from . Set fvi = [f1 (xi ), . . . , fm (xi )]T . 2.3 Initialize z = [z1 , . . . , zm ]T by setting zj = min1≤i≤N fj (x∗ ). 3. for i = 1, . . . , N do 4. Selection of Mating/Update Range: Uniformly randomly generate a number rand from [0, 1]. Then * set B(i), if rand < δ, P = {1, . . . , N }, otherwise. 5. Reproduction: Set r1 = i and randomly select two indexes r2 and r3 from P , and then generate a solution y¯ from xr1 , xr2 and xr3 by a differential evolution operator (9.9.14), and then perform a mutation operator (9.9.15) on y¯ with probability pm to produce a new solution y. 6. Repair: If an element of y is out of the boundary of , its value is reset to be a randomly selected value inside the boundary. 7. Update of z: For each j = 1, . . . , m, if zj > fj (y) then set zj = fj (y). 8. Update of Solutions: Set c = 0 and then do the following: 9. If c = nr or P is empty, go to Step 3. Otherwise, randomly pick an index j from P . 10. If g(y, λj , z) ≤ g(xj , λj , z), then set xj = y, fvj = [f1 (y), . . . , fm (y)]T and c = c+1. 11. Remove j from P and go to step 9. 1 N 12. Stopping  Criterion: If the stopping  criterion is satisfied,then stop and output {x , . . . , x } and f1 (x1 ), . . . , fm (x1 ) , . . . , f1 (xN ), . . . , fm (xN ) . Otherwise go to Step 3. 13. end for 14. Output: 1 , . . . , xN }; Approximation to the Pareto set: {x    Approximation to the Pareto front: f1 (x1 ), . . . , fm (x1 ) , . . . , f1 (xN ), . . . , fm (xN ) .

• A priori preference articulation (Decide → Search): Decision making combines the differing objectives into a scalar cost function. This effectively makes the MOP a single-objective prior to optimization. • Progressive preference articulation (Search ↔ Decide): Decision making and optimization are intertwined. Partial preference information is provided upon which optimization occurs, providing an “updated” set of solutions for the decision making to consider. • A posteriori preference articulation (Search → Decide): A decision making is presented with a set of Pareto optimal candidate solutions and chooses from that set. However, note that most MOEA researchers search for and present a set of nondominated vectors (P Fknown ) to the decision making.

9.9 Multiobjective Evolutionary Algorithms

751

As pointed out in [73] and [155], any practical MOEA implementation must include a secondary population composed of all Pareto-optimal solutions Pknown (t) found so far during search. This is because the MOEA is a stochastic optimization, which does not guarantee that desirable solutions, once found, remain in the generational population until MOEA termination. For this end, one needs to make fitness assignment. Let P denote the population and P  be an external nondominated set. The fitness assignment procedure is a two-stage process [170]. 1. Each solution xi ∈ P  is assigned a real value si ∈ [0, 1), called strength; si is proportional to the number of population members for which xi  xj . Let n denote the number of individuals in P that are covered by xi and assume N is the size of P . Then si is defined as si = N n+1 . The fitness Fi of xi is equal to its strength: Fi = si . 2. The fitness of an individual j ∈ P is calculated by summing the strengths of all external nondominated solutions xi ∈ P  that cover xj . The total fitness is added by 1 in order to guarantee that members of P  have better fitness than members of P (note that fitness is to be minimized, i.e., small fitness values correspond to high reproduction probabilities): Fj = 1 +



si ,

where Fj ∈ [1, N).

(9.9.17)

i,xi xj

These are the following three main techniques used in MOEAs [170]. • Store the nondominated solutions in P  found so far externally. • Use the concept of Pareto dominance in order to assign scalar fitness values to individuals in P . • Perform clustering to reduce the number of nondominated solutions stored in P without destroying the characteristics of the trade-off front. Strength Pareto EA (SPEA) uses the following techniques for finding multiple Pareto-optimal solutions in parallel [170]. (a) It combines the above three techniques in a single algorithm. (b) The fitness of an individual in P is determined only from the solutions stored in the external nondominated set P  ; whether members of the population dominate each other is irrelevant. (c) All solutions in the external nondominated set P  participate in the selection. (d) A niching method is provided in order to preserve diversity in the population P ; this method is Pareto-based and does not require any distance parameter (like the niche radius for sharing). Since SPEA uses a mixture of the three basic techniques in MOEAs in (a) and other three techniques (b)–(d), this MOEA method is a strength Pareto variant of MOEAs, and hence is named as SPEA. Algorithm 9.12 shows the flow of the SPEA algorithm.

752

9 Evolutionary Computation

Algorithm 9.12 SPEA algorithm [170] 1. input: The population size N . 2. initialization: Generate an initial population P and create the empty external nondominated set P  . 3. Copy nondominated members of P to P  . 4. Remove solutions within P  which are covered by any other member of P  . 5. If the number of externally stored nondominated solutions exceeds a given maximum N  , prune by means of clustering. 6. Calculate the fitness of each individual in P as well as in P  . 7. Select individuals from P ∪ P  (e.g., using binary tournament selection), until the mating pool is filled. 8. Apply problem-specific crossover and mutation operators as usual. 9. If the maximum number of generations is reached, then stop, else go to Step 4. 10. output: nondominated population P .

The average linkage method [107] can be used for the pruning by clustering in Step 5 in Algorithm 9.12. The average linkage method contains the following steps.  (1) Initialize cluster set C; Geach external nondominated point xi ∈ P constitutes a distinct cluster: C = i {xi }. (2) If |C| ≤ N  , go to Step (5), else go to Step (3). (3) Calculate the distance of all possible pairs of clusters. The distance d(c1 , c2 ) of two clusters c1 , c2 ∈ C is given as the average distance between pairs of individuals across the two clusters:

d(c1 , c2 ) =

1 |c1 | · |c2 | x

 i1 ∈c1 , xi2 ∈c2

xi1 − xi2 22 ,

(9.9.18)

where |ci | denotes the number of individuals in class ci , i = 1, 2, the metric  · 22 reflects the Euclidean distance between two individuals xi1 and xi2 , and is called as the Euclidean metric on the objective space. (4) Determine two clusters c1 and c2 with minimal distance d(c1 , c2 ); the chosen clusters amalgamate into a larger cluster C = C \ {c1 , c2 } ∪ {c1 ∪ c2 }. Go to Step (2). (5) Compute the reduced nondominated set by selecting a representative individual per cluster. Use the centroid (the point with minimal average distance to all other points in the cluster) as representative solution. Consider an improvement of SPEA, called SPEA2. In multiobjective evolutionary optimization, the approximation of the Paretooptimal set involves itself two (possibly conflicting) objectives: the distance to the optimal front is to be minimized and the diversity of the generated solutions is to be maximized (in terms of objective or parameter values). In this context, there are three fundamental issues when designing a multiobjective evolutionary algorithm [172]: fitness assignment, environmental selection, and mating selection.

9.9 Multiobjective Evolutionary Algorithms

753

Fitness Assignment To avoid the situation that individuals dominated by the same archive members have identical fitness values, for each individual both dominating and dominated solutions are taken into account in SPEA2. Each individual i in the archive P¯t and the population Pt is assigned a strength value S(i), representing the number of solutions it dominates: S(i) = |{j |j ∈ Pt ∪ P¯t ∧ i > j }|,

(9.9.19)

where | · | denotes the cardinality of a set, and the symbol > corresponds to the Pareto dominance relation. Use the S(i) values to calculate the raw fitness value R(i) of an individual i as follows: 

R(i) =

S(j ).

(9.9.20)

j ∈Pt ∪P¯t , j >i

Remark 1 The raw fitness is determined by the strengths of its dominators in both archive and population in SPEA2, while SPEA considers only archive members in this context. Remark 2 Fitness is to be minimized in SPEA2, i.e., R(i) = 0 corresponds to a nondominated individual, while a high R(i) value means that i is dominated by many individuals (which in turn dominate many individuals). The raw fitness assignment provides a sort of niching mechanism based on the concept of Pareto dominance, but it may fail when most individuals do not dominate each other. To avoid this drawback, additional density information is incorporated to discriminate between individuals having identical raw fitness values. The density D(i) corresponding to i is defined by D(i) =

1 σik + 2

,

(9.9.21)

where σik denotes the distance of i to its kth nearest neighbor in P¯t+1 , and k = N + N¯ . Finally, adding D(i) to the raw fitness value R(i) of an individual i yields its fitness F (i): F (i) = R(i) + D(i).

(9.9.22)

Environmental Selection This selection involves which individuals in the population to keep during the evolution process. For this end, beside the population, an archive is necessary to contain a representation of the nondominated Pareto front among all solutions considered so far. During environmental selection, the first step is to copy all nondominated

754

9 Evolutionary Computation

individuals having a fitness lower than one from archive and population to the archive of the next generation: P¯t+1 = {i|i ∈ Pt ∪ P¯t ∧ F (i) < 1}.

(9.9.23)

If the nondominated front fits exactly into the archive (|P¯t+1 | = N¯ ) the environmental selection step is completed. Otherwise, if the archive is too small (|P¯t+1 | < N¯ ), then the best N¯ − |P¯t+1 | dominated individuals in the previous archive P¯t and population are copied to the new archive P¯t+1 . This can be implemented by sorting the multi-set Pt + P¯t according to the fitness values and copy the first N¯ − |P¯t+1 | individuals i with F (i) ≥ 1 from the resulting ordered list to P¯t+1 . Conversely, if the archive is too large (|P¯t+1 | > N¯ ) then an archive truncation procedure is invoked which iteratively removes individuals from P¯t+1 until |P¯t+1 | = N¯ . In other words, the individual which has the minimum distance to another individual is chosen at each stage; if there are several individuals with minimum distance, the tie is broken by considering the second smallest distances and so forth. Mating Selection The pool of individuals at each generation is evaluated in a two stage process. First all individuals are compared on the basis of the Pareto dominance relation, which defines a partial order on this multi-set. Basically, the information which decides how each individual dominates, is dominated by or is indifferent to, is used to define a ranking on the generation pool. Afterwards, this ranking is refined by the incorporation of density information. Various density estimation techniques are used to measure the size of the niche in which a specific individual is located. Algorithm 9.13 summarizes the improved strength Pareto evolutionary algorithm (SPEA2). The main differences of SPEA2 in comparison to SPEA are as follows [172]. • An improved fitness assignment scheme is used, which takes for each individual into account how many individuals it dominates and it is dominated by. • A nearest neighbor density estimation technique is incorporated which allows a more precise guidance of the search process. • A new archive truncation method guarantees the preservation of boundary solutions.

9.9.4 Achievement Scalarizing Functions Achievement scalarizing functions (ASFs), introduced by Wierzbicki [158], are denoted by sR (f(x)) : Rk → R which map (or scalarize) k objective functions to a scalar. The achievement scalarizing problem is given by min sR (f(x)). x∈X

(9.9.24)

9.9 Multiobjective Evolutionary Algorithms

755

Algorithm 9.13 Improved strength Pareto evolutionary algorithm (SPEA2) [172] 1. input: N : population size, N¯ : archive size, T : maximum number of generations. 2. initialization: 2.1 Generate an initial population P0 and create the empty archive (external set) P¯0 = ∅; 2.2 Set t = 0. 3. while t = 0, 1, . . . % Fitness assignment: Calculate fitness values of individuals in Pt and P¯t 4. Calculate the strength value S(i)  = |{j |j ∈ Pt ∪ P¯t ∧ i > j }| for i = 1, . . . , N ; 5. Calculate the raw fitness R(i) = j ∈Pt ∪P¯t , j >i S(j ) for i = 1, . . . , N ; 6. Compute the density D(i) = k1 for i = 1, . . . , N ; σi +2

7. Compute the fitness value F (i) = R(i) + D(i) for i = 1, . . . , N . % Environmental selection: 8. Copy all nondominated individuals in Pt and P¯t to P¯t+1 : P¯t+1 = {i|i ∈ Pt ∪ P¯t ∧ F (i) < 1}; 9. If |P¯t+1 | = N¯ then the environmental selection step is completed; 10. If |P¯t+1 | < N¯ then sort the multi-set Pt + P¯t according to the fitness values and copy the first N¯ − |P¯t+1 | individuals i with F (i) ≥ 1 from the resulting ordered list to P¯t+1 . 11. If |P¯t+1 | > N¯ then an archive truncation procedure is invoked which iteratively removes individuals from P¯t+1 until |P¯t+1 | = N¯ . % Termination: 12. If t ≥ T or another stopping criterion is satisfied then set A to the set of decision vectors represented by the nondominated individuals in P¯t+1 . Stop and output. % Mating selection: 13. Perform binary tournament selection with replacement on Pt+1 in order to fill the mating pool. 14. end while 15. output: nondominated set A.

Certain properties of ASFs guarantee that scalarizing problem (9.9.24) yields Pareto-optimal solutions. Definition 9.36 (Increasing Function [158]) An achievement scalarizing function sR (f(x)) : Rk → R is said to be 1. Increasing: if for any y1 , y2 ∈ Rk , yi1 ≤ yi2 for all i ∈ {1, . . . , k}, then sR (y1 ) ≤ sR (y2 ). 2. Strictly increasing: if for any y1 , y2 ∈ Rk , yi1 < yi2 for all i ∈ {1, . . . , k}, then sR (y1 ) < sR (y2 ). 3. Strongly increasing: if for any y1 , y2 ∈ Rk , yi1 ≤ yi2 for all i ∈ {1, . . . , k} and y1 = y2 , then sR (y1 ) < sR (y2 ). Clearly, any strongly increasing ASF is also strictly increasing, and any strictly increasing ASF is also increasing. The following theorems define necessary and sufficient conditions for an optimal solution of (9.9.24) to be (weakly) Pareto optimal. Theorem 9.3 ([159, 160]) The necessary and sufficient conditions for an optimal solution being Pareto optimal are as follows.

756

9 Evolutionary Computation

1. Let sR be strongly (strictly) increasing. If x∗ ∈ X is an optimal solution of problem (9.9.24), then x∗ is (weakly) Pareto optimal. 2. If sR is increasing and the solution x∗ ∈ X of (9.9.24) is unique, then x∗ is Pareto optimal. Theorem 9.4 ([103]) If sR is strictly increasing and x∗ ∈ X is weakly Pareto optimal, then it is a solution of (9.9.24) with the reference point f(x∗ ) and the optimal value of sR is zero. One example of the achievement scalarizing problems can be formulated as * min max

i=1,...,k

.

fi (x) − z¯ i zinad

utopia − zi



k 

fi (x)

nad i=1 zi

− zi

utopian

subject to x ∈ S,

(9.9.26)

k

fi (x) i=1 znad −zutopia i i constant, and znad

where the term ρ

(9.9.25)

is called the augmentation term, in which ρ >

0 is a small and zutopian are the nadir vector and utopian vectors, respectively. In the above problem, the parameter is the so-called reference point z¯ i which represents objective function values preferred by the decision maker. The following are the several most well-known ASFs [116]. 1. The strictly increasing ASF is of Chebyshev type: sR∞ (f(x), λ) =

max λi (fi (x) − fi (x∗ )),

i∈{1,...,k}

(9.9.27)

where x∗ is a reference point, and λ is k-vector of nonnegative coefficients used for scaling purposes, that is, for normalizing objective functions of different magnitudes. 2. The strongly increasing ASF is of augmented Chebyshev type: sR∞+1 (f(x), λ) = ρ



λi (fi (x) − fi (x∗ )) + max λi (fi (x) − fi (x∗ )), i∈{1,...,k}

i∈{1,...,k}

(9.9.28) where ρ > 0 is a small parameter. 3. The additive ASF based on the L1 metric [137]: sR1 (f(x), λ) =

max

i∈{1,...,k}

  λi (fi (x) − fi (x∗ )), 0 .

(9.9.29)

9.9 Multiobjective Evolutionary Algorithms

757

4. The parameterized ASF: Let Iq be a subset of Nk = {1, . . . , k} of cardinality q. The parameterized ASF is defined as follows [116]: q s˜R (f(x), λ)

=

max

⎧ ⎨

Iq ∈Nk :|Iq |=q ⎩

i∈Iq

⎫ ⎬

max[λi (fi (x) − fi (x∗ )), 0] , ⎭

(9.9.30)

where q ∈ Nk and λ = [λ1 , . . . , λk ]T , λi > 0, i ∈ Nk . Notice that q

• for q ∈ Nk : s˜R (f(x), λ) ≥ 0; • for q = 1 : s˜R1 (f(x), λ) = maxi∈Nk max[λi (fi (x)−fi (x∗ )), 0] ∼ = sR∞ (f(x), λ);  k ∗ • for q = k : s˜R (f(x), λ) = i∈Nk max[λi (fi (x) − fi (x )), 0] = sR1 (f(x), λ). Here, “∼ =” means equality in the case where there exist no feasible solutions x ∈ X which strictly dominate the reference point, i.e., such that fi (x) < fi (x∗ ) for all i ∈ Nk . The performance of the additive ASF can be described by the following two theorems. Theorem 9.5 ([137]) Given problem (9.9.24) with ASF defined by (9.9.29), let f(x∗ ) be a reference point such that f(x∗ ) is not dominated by an objective vector of any feasible solution of problem (9.9.24). Also assume λi > 0 for all i ∈ {1, . . . , k}. Then any optimal solution of problem (9.9.24) is a weakly Pareto-optimal solution. Theorem 9.6 ([137]) Given problem (9.9.24) with ASF defined by (9.9.29) and any reference point f(x∗ ), and assume λi > 0 for all i ∈ {1, . . . , k}. Then among the optimal solutions of problem (9.9.24) there exists at least one Pareto-optimal solution. If the optimal solution of problem (9.9.24) is unique, then it is Pareto optimal. When adopting the parameterized ASF, the corresponding parameterized achievement scalarizing problem is given by Nikulin et al. [116]: q

min s˜R (f(x), λ). x∈X

(9.9.31)

For any x ∈ X, denoting Ix = {i ∈ Nk : fi (x∗ ) ≤ fi (x)}, then the following two results are true: Theorem 9.7 ([116]) Given problem (9.9.31), let f(x∗ ) be a reference point vector such that there exists no feasible solution whose image strictly dominates f(x∗ ). Also assume λi > 0 for all i ∈ Nk . Then, any optimal solution of problem (9.9.31) is a weakly Pareto-optimal solution. Theorem 9.8 ([116]) Given problem (9.9.31), let f(x∗ ) be any reference point. Also assume λi > 0 for all i ∈ Nk . Then, among the optimal solutions of problem (9.9.31) there exists at least one Pareto-optimal solution.

758

9 Evolutionary Computation

Theorem 9.8 implies that the uniqueness of the optimal solution guarantees its Pareto optimality.

9.10 Evolutionary Programming It is widely recognized [44] that evolutionary programming (EP) was first proposed as an approach to artificial intelligence. The EP has been applied with success to many numerical and combinatorial optimization problems.

9.10.1 Classical Evolutionary Programming Consider a global minimization problem min f (x) formalized as a pair (S, f ), where xmin ∈ S with S being a bounded set on Rn and f is an n-dimensional real-valued function. Our aim is to find a point xmin ∈ S such that f (xmin ) is a global minimum on S, namely f (xmin ) ≤ f (x),

∀x ∈ S,

(9.10.1)

where f is a bounded function which does not need to be continuous. Optimization by EP can be summarized into two major steps: • mutate the solutions in the current population; • select the next generation from the mutated solutions and the current solutions. There are the classical evolutionary programming (CEP) with self-adaptive mutation and the CEP without self-adaptive mutation. It is well-known (see, e.g., [3, 42, 165]) that the former usually performs better than the latter. By Bäck and Schwefel [3], the CEP is implemented as follows (see also [165]). 1. Initialization: Generate the initial population of μ individuals, and set k = 1. Each individual is taken as a pair of real-valued vectors (xi , ηi ) (i = 1, . . . , μ), where xi are objective variables and ηi are standard deviations for Gaussian mutations (also called strategy parameters in self-adaptive evolutionary algorithms). 2. Evaluation of fitness: Use the objective function f (xi ) to evaluate the fitness score for each individual (xi , ηi ), ∀ i = 1, . . . , μ. 3. Self-adaptive mutation: Each parent (xi (t − 1), ηi (t − 1)), i = 1, . . . , μ at the tth generation creates a single offspring (xi (t), ηi (t)) at the tth generation: j

j

j

xi (t) = xi (t − 1) + ηi (t − 1)Nj (0, 1),

(9.10.2)

ηi (t) = ηi (t − 1) exp(τ  N(0, 1) + τ Nj (0, 1)),

(9.10.3)

j

j

9.10 Evolutionary Programming

759 j

4. 5.

6. 7.

j

for i = 1, . . . , μ; j = 1, . . . , n. Here xi (t − 1) and ηi (t − 1) denote the j th components of the parent vectors xi (t − 1) and ηi (t − 1) at the (t − 1)th j j generations, respectively; and xi (t − 1) and ηi (t − 1) are, respectively, the j th components of the offspring xi (t) and ηi (t) at the tth generation; while N (0, 1) is a standard Gaussian distribution used for all j = 1, . . . , n, and Nj (0, 1) represents the standard Gaussian distribution used just for some given j . The √  √ −1 2 n and τ  = ( 2n)−1 . factors are commonly set to τ = Fitness for each offspring: Calculate the fitness of each offspring (xi (t), ηi (t)), i = 1, . . . , μ. Pairwise comparison: Conduct pairwise comparison over the union of parents (xi (t − 1), ηi (t − 1)) and offspring (xi (t), η(t)) for all i = 1, . . . , μ. For each individual, opponents are chosen uniformly at random from all the parents and offspring. For each comparison, if the individual’s fitness is no smaller than the opponent’s, it receives a “win.” Selection: Select the μ individuals out of (xi (t − 1), ηi (t − 1)) and (xi (t), ηi (t)), that have the most wins to be parents of the next generation. Stopping criteria: Stop if the halting criterion is satisfied; otherwise, k = k + 1 and go to Step 3, and repeat the above steps.

The feature of CEP is to use the standard Gaussian mutation operator Nj (0, 1) in self-adaptive mutation (9.10.2).

9.10.2 Fast Evolutionary Programming To speed up the convergence of the CEP, a fast evolutionary programming (FEP) was suggested by Yao et al. [165]. The one-dimensional Cauchy density function centered at the origin is defined by ft (x) =

t 1 , 2 π t + x2

−∞ < x < +∞,

(9.10.4)

where t > 0 is a scale parameter. The corresponding distribution function is Fi (x) =

!x " 1 1 . + arctan 2 π t

(9.10.5)

The variance of the Cauchy distribution is infinite. The FEP is exactly the same as the CEP except for Gaussian mutation operator Nj (0, 1) in (9.10.2) which is replaced by Cauchy mutation operator as follows: xi (j ) = xi (j ) + ηi (j )Cj (0, 1),

(9.10.6)

760

9 Evolutionary Computation

where Cj (0, 1) is a standard Cauchy random variable with the scale parameter t = 1 and is generated anew for each value of j . The main difference between the FEP and CEP is: the CEP mutation equation (9.10.2) is a self-adaption controlled by the Gaussian distribution Nj (0, 1); while the FEP mutation equation (9.10.6) is a self-adaption controlled by the Cauchy random variable δi . It is known [165] that Cauchy mutation is more likely to generate an offspring further away from its parent than Gaussian mutation due to its long flat tails. It is expected to have a higher probability of escaping from a local optimum or moving away from a plateau, especially when the “basin of attraction” of the local optimum or the plateau is large relative to the mean step size. Therefore, the FEP is expected to be faster than the CEP from viewpoint of convergence. The basic steps of the FEP algorithm are described as follows [18]. 1. Generate an initial population of μ solutions (xi , ηi ), i ∈ {1, . . . , μ}, and evaluate their fitness f (xi ). 2. For every parent solution (xi , ηi ) in the population, create an offspring (xi , ηi ) using xi (j ) = xi (j ) + ηi (j )Cj (0, 1)

. ,

ηi (j ) = ηi (j ) exp(τ  N(0, 1) + τ Nj (0, 1))

i = 1, . . . , μ; j = 1, . . . , n,

where xi (j ), ηi (j ) denote the j th components of the parent vectors xi , ηi , and xi (j ), ηi (j ) are the j th components of the offspring xi , ηi , respectively; while N (0, 1) is sampled from a standard Gaussian distribution for all j = 1, . . . , n, and Nj (0, 1) is sampled from an another standard Gaussian distribution just for some given j , while Cj (0, 1) is sampled from a standard Cauchy distribution for √ each and j . The factors τ and τ  are commonly set to ( 2 n)−1 and τ  = √ i −1 ( 2n) . 3. For every parent solution (xi , ηi ) in the population, create another offspring (xi , ηi ) using xi (j ) = xi (j ) + ηi (j )Cj (0, 1) ηi (j ) = ηi (j ) exp(τ  N(0, 1) + τ Nj (0, 1))

. ,

i = 1, . . . , μ; j = 1, . . . , n.

4. Evaluate the fitness of offspring solutions (xi , ηi ) and (xi , ηi ). 5. Perform pairwise tournaments over the union of parents (xi , ηi ) and offspring (i.e., (xi , ηi ) and (xi , ηi )). For each solution in the union, q opponents are chosen uniformly at random from the union. For every tournament, if the fitness of the solution is no smaller than the fitness of the opponent, it is awarded a “win.” The total number of wins awarded to a solution (xi , ηi ) is denoted by win(xi , ηi ). 6. Choose μ individual solutions from the union of parents and offspring at step 5. Chosen solutions will form the population of the next generation.

9.10 Evolutionary Programming

761

7. Stop if the number of generations exceeds a predetermined number. By performing on a number of benchmark problems, the experimental results show [165] that FEP with the Cauchy mutation operator Cj (0, 1) performs much better than CEP with the Gaussian mutation operator for multimodal functions with many local minima while being comparable to CEP in performance for unimodal and multimodal functions with only a few local minima. A comparative study of applications of Cauchy and Gaussian mutation operators in electromagnetics reported in [70] and [72] was also observed that Cauchy mutation operator performs better than its Gaussian counterpart for a number of unconstrained or weakly constrained antenna optimization problems, but it performs relatively poor for highly constrained problems.

9.10.3 Hybrid Evolutionary Programming Consider evolutionary programming for constrained optimization problems P : min f (x) subject to gi (x) ≤ 0, i = 1, . . . , r, hj (x) = 0, j = 1, . . . , m, (9.10.7) where f and g1 , . . . , gr are functions on Rn , and h1 , . . . , hm are functions on Rm for m ≤ n and x = [x1 , . . . , xn ]T ∈ Rn and x ∈ F ⊆ S. A vector x is called a feasible solution to the optimization problem (P ) if and only if x satisfies the r inequality constraints gi (x) ≤ 0, i = 1, . . . , r and m equality constraints hj (x) = 0, j = 1, . . . , m. When the collection of feasible solutions is empty, x is said to be infeasible. The set S ⊆ Rn defines the search space for x and the set F ⊆ S defines a feasible part of the search space S. Use the cost function to evaluate a feasible solution, i.e., Φf (x) = f (x),

x ∈ F,

(9.10.8)

and define the constraint violation measure for the constraints as follows: Φu (x) =

r m   [gi (x)]+ |hj (x)|

(9.10.9)

j =1

i=1

or ⎞ ⎛ r m  1 ⎝ (gi (x))2 + (hj (x))2 ⎠ , Φu (x) = 2 i=1

j =1

(9.10.10)

762

9 Evolutionary Computation

where | · | denotes an absolute value of the argument, and gi+ (x) = max{0, gi (x)}

(9.10.11)

is the magnitude of the violation of the ith constraint in (P ), where 1 ≤ i ≤ r. Then, Φ(x) = Φf (x) + sΦu (x)

(9.10.12)

defines the total evaluation of an individual x, where s is a penalty parameter of a positive for minimization or a negative constant for maximization. Clearly, Φ(x) can be interpreted as the error (for a minimization problem) or fitness (for a maximization problem) of an individual x to the optimization problem (P ). Therefore, the original constrained optimization problem (P ) becomes the following unconstrained optimization problem: min {f (x) + sΦu (x)},

(9.10.13)

where Φu (x) is defined by (9.10.9) or (9.10.10). Different from CEP and FEP whose each individual is a two-tuples of realvalued vectors (xi , ηi ), Myung et al. [112, 113] proposed an EP to update a triplet of real-valued vectors (¯xi , σ¯ i , η¯ i ), ∀ i ∈ {1, . . . , μ}, and named it as the hybrid evolutionary programming (HEP). In HEP, x¯ i = [xi (1), . . . , xi (n)]T , while σ¯ i and η¯ i are the n-dimensional solution vector and its two corresponding strategy parameter vectors, respectively. Here σ¯ i and η¯ i are initialized based on the specified search domains, x¯ i ∈ {xmin , xmax }n , which may be imposed at the initialized stage. HEP is named because of applying hybrid (Cauchy + Gaussian) mutation operators as follows [71]: xi (j ) = xi (j ) + σi (j )[Nj (0, 1) + βi (j )Cj (0, 1)], βi (j ) =

ηj (j ) , σi (j )

(9.10.14) (9.10.15)

σi (j ) = σi (j ) exp(τ  N(0, 1) + τ Nj (0, 1)),

(9.10.16)

ηi (j )

(9.10.17)



= ηi (j ) exp(τ N(0, 1) + τ Nj (0, 1)),

for i = 1, . . . , n, where xi (j ), σi (j ), and ηi (j ) are the j th components of the vectors xi , σ i , and ηi , respectively. Cj (0, 1) in (9.10.14) and Nj (0, 1) in (9.10.14), (9.10.16), and (9.10.17) indicate that the random variables are generated anew for each value of j in each equation.

9.11 Differential Evolution

763

9.11 Differential Evolution Problems involving global optimization over continuous spaces are ubiquitous throughout the scientific community [147]. When the cost function is nonlinear and non-differentiable, one usually uses direct search approaches. Most standard direct search methods use the greedy criterion under which a new parameter vector is accepted if and only if it reduces the value of the cost function. Although the greedy decision process converges fairly fast, it is easily trapped in a local minimum. Users generally demand that a practical minimization technique should fulfill the following requirements [147]. 1. Ability to handle non-differentiable, nonlinear, and multimodal cost functions. 2. Parallelizability to cope with intensive computation cost functions. 3. Ease of use, i.e., few control variables to steer the minimization. These variables should also be robust and easy to choose. 4. Good convergence properties, i.e., consistent convergence to the global minimum in consecutive independent trials. A minimization method called “differential evolution” (DE) was designed by Storn and Price in 1997 [147] to fulfill all of the above requirements.

9.11.1 Classical Differential Evolution Differential evolution (DE) of Storn and Price [124, 147] is a simple yet effective algorithm for global optimization. DE is a parallel direct search method which utilizes D-dimensional parameter vectors G G T pG i = [P1i , . . . , PDi ] ,

∀ i ∈ {1, . . . , NP }

(9.11.1)

as a population for each generation G, where NP is the population size which does not change during the minimization process. The initial vector population is assumed to be a uniform probability distribution for all random decisions. Solutions in DE are represented by D-dimensional vectors pi , ∀ i ∈ {1, . . . , NP }. For each solution pi , select three parents pi1 , pi2 , pi3 , where i = i1 = i2 = i3 . Algorithm 9.14 gives the details of the classical DE algorithm, in which BFV: best fitness value so far, NFC: number of function calls, VTR: value to search, and MAXNFC : maximum number of function calls. The main operations of the classical DE algorithm can be summarized as follows [147]. 1. Mutation: For each target vector pG i , i = 1, 2, . . . , NP , create the mutant individual viG+1 by adding a weighted difference vector between two individuals

764

9 Evolutionary Computation

Algorithm 9.14 Differential evolution (DE) [128] 1. input: Population size Np , problem dimension D, mutation constant F , crossover rate Cr . 2. initialization: Generate uniformly distributed random population p0 = [P01 , . . . , PiD ]T . 3. while (BFV > VTR) and (NFC < MAXNFC ) do 4. for i = 0 to Np do 5. select three parents pi1 , pi2 , pi3 randomly from current population where i = i1 = i2 = i3 . // Mutation 6. vi ← pi1 + F × (pi3 − pi2 ) and denote vi = [V1i , . . . , VDi ]T . // Crossover 7. for j = 1 to D do 8. if rand(0, 1) < Cr then 9. Uj i ← Vj i , 10. else 11. Uj i ← P j i . 12. end if 13. end for // Selection 14. Denote ui = [U1i , . . . , UDi ]T . 15. if f (ui ) ≤ f (pi ) then 16. pi ← ui , 17. else 18. pi ← pi . 19. end if 20. end for 21. pi+1 ← pi . 22. end while 23. output: solution vector p.

G pG i2 and xi3 according to G G viG+1 = pG i1 + F · (pi3 − pi2 ),

(9.11.2)

where F > 0 is a constant coefficient used to control the differential variation G G dG i = pi3 − pi2 . The evolution optimization based on difference vector between two populations or solutions is known as the differential evolution. G G T 2. Crossover: Denote uG i = [U1i , . . . , UDi ] . In order to increase the diversity of the perturbed parameter vectors, introduce crossover * UjG+1 i

=

VjG+1 , randj (0, 1) ≤ CR or j = jrand , i G Pj i , otherwise,

(9.11.3)

for j = 1, 2, . . . , D, where randj (0, 1) is an uniformly distributed random number between 0 and 1, and jrand is a randomly chosen index to ensure that the trial vector uiG+1 does not duplicate pG i . CR ∈ (0, 1) is the crossover rate. 3. Selection: To decide whether or not it should become a member of the next generation G + 1, the trial vector uiG+1 is compared to the target vector pG i according to the fitness values of the parent individual pG i and the trial individual

9.11 Differential Evolution

765

uiG+1 , i.e.,  piG+1 =

uiG+1 , f (uiG+1 ) < f (pG i ); , otherwise. pG i

(9.11.4)

Here piG+1 is the offspring of pG i for the next generation.

9.11.2 Differential Evolution Variants There are several schemes of DE based on different mutation strategies [124]: G G G vG i = pr1 + F · (pr2 − pr3 ),

(9.11.5)

G G G vG i = pbest + F · (pr1 − pr2 ),

(9.11.6)

G G G G G vG i = pi + F · (pbest − pi ) + F · (pr1 − pr2 ),

(9.11.7)

G G G G G vG i = pbest + F · (pr1 − pr2 ) + F · (pr3 − pr4 ),

(9.11.8)

G G G G G vG i = pr1 + F · (pr2 − pr3 ) + F · (pr4 − pr5 ),

(9.11.9)

G G G G where pG best is the best vector in the population. pr1 , pr2 , pr3 , and pr4 are four different randomly selected vectors (where i = r1 = r2 = r3 = r4 ) from the current population. These schemes can be unified by the notation DE/a/b/c [101]. The meaning of a, b, c in this notation is as follows [101, 128].

• “a” specifies the vector which should be mutated; it can be the best vector (a = “best”) of the current population or a randomly selected one (a = “rand”). • “b” denotes the number of difference vectors which participate in the mutation (b = 1 or b = 2). • “c” denotes the applied crossover scheme, binary (c = “bin”) or exponential (c = “exp”). The classical DE (9.11.5) with notation DE/rand/1/bin and the scheme (9.11.7) with notation DE/best/2/exp and exponential crossover are the most often used in practice due to their good performance [124, 166]. There are the following DE variants. 1. Neighborhood search differential evolution (NSDE): it has been shown to play a crucial role in improving evolutionary programming (EP) algorithm’s performance [165]. The neighborhood search differential evolution (NSDE) developed in [162] is the same as the above classical DE except for mutation (9.11.2)

766

9 Evolutionary Computation

replaced by  viG+1 = pG r1 +

dG i · N(0.5, 0.5), if rand(0, 1) < 0.5; otherwise. δ · dG i ,

(9.11.10)

G G Here dG i = pr2 − pr3 , N(0.5, 0.5) denotes a Gaussian random number with mean 0.5 and standard deviation 0.5, and δ denotes a Cauchy random variable with scale parameter 1. 2. Self-adaptive differential evolution (SaDE): As compared with the classical DE of Storn and Price [147], the SaDE of Qin and Suganthan [125] has the following main differences: (a) SaDE adopts two different mutation strategies in single DE variant and introduces a probability p to control which mutation strategy to use, and p is gradually self-adapted according to the learning experience. (b) SaDE utilizes two methods to adapt and self-adapt DE’s parameters F and CR. The detailed operations of SaDE are summarized as follows:

• Mutation strategies self-adaptation: SaDE selects mutation strategies (9.11.5) and (9.11.7) as candidates to create the mutant individual:  viG+1

=

Equation (9.11.5), if randi (0, 1) < p; Equation (9.11.7), otherwise.

(9.11.11)

Here, the probability p is updated as p=

ns1 · (ns2 + nf 2) , ns2 · (ns1 + nf 1) + ns1 · (ns2 + nf 2)

(9.11.12)

where ns1 and ns2 are, respectively, the number of offspring successfully entering the next generation while generated by Eqs. (9.11.5) and (9.11.7) after evaluation of all offspring; and both nf 1 and nf 2 are the numbers of offspring discarded while generated by Eqs. (9.11.5) and (9.11.7), respectively. • Scale factor F setting: Fi = Ni (0.5, 0.3),

(9.11.13)

where Ni (0.5, 0.3) denotes a Gaussian random number with mean 0.5 and standard deviation 0.3. • Crossover rate CR self-adaptation: SaDE allocates a CRi for each individuals: CRi = Ni (CRm, 0.1)

(9.11.14)

9.12 Ant Colony Optimization

767

with the updating formula CRm =

|CRrec|  1 CRrec(k), |CRrec|

(9.11.15)

k=1

where CRm is set to 0.5 initially, and CRrec is the CR values associated with offspring successfully entering the next generation. CRrec will be reset once CRm is updated. This self-adaptation scheme for CR is denoted as SaCR. 3. Self-adaptive NSDE (SaNSDE): This is an algorithm incorporating self-adaptive mechanisms into NSDE, see [166] for details. On opposition-based differential evolution, an important extension to DE, we will present it in Sect. 9.15.

9.12 Ant Colony Optimization In addition to evolutionary algorithms, another group of nature inspired metaheuristic algorithms is based on swarm intelligence in biology society. Swarm intelligence refers to a kind of problem-solving ability through the interactions of simple information-processing units. Swarm intelligence includes two important concepts [85]: 1. The concept of a swarm means multiplicity, stochasticity, randomness, and messiness. 2. The concept of intelligence suggests that the problem-solving method is somehow successful. The information-processing units that compose a swarm can be animate (such as ants, bees, fishes, etc.), mechanical, computational, or mathematical. Three famous swarm intelligence algorithms are ant colony algorithms, artificial bee colony algorithms, and swarm intelligence algorithms. In this section, we focus on ant colony algorithms. The original inspired optimization idea behind ant algorithms came from observations of the foraging behavior of ants in the wild, and moreover, the phenomena called stigmergy [111]. By stigmergy, it means the indirect communication among a self-organizing emergent system via individuals modifying their local environment [59]. The stigmergic nature of ant colonies was explored by Deneubourg et al. in 1990 [27]: ants communicate indirectly by laying down pheromone trails, which ants then tend to follow. Depending on the criteria being considered, modern heuristic algorithms for solving combinatorial and numeric optimization problems can be classified into the following four main different groups [82]:

768

9 Evolutionary Computation

• Population-based algorithm: An algorithm working with a set of solutions and trying to improve them is called population based algorithm. • Iterative based algorithm: An algorithm using multiple iterations to approach the solution sought is named as iterative based algorithm. • Stochastic algorithm: An algorithm employing a probabilistic rule for improving a solution is called probabilistic or stochastic algorithm. • Deterministic algorithm: An algorithm employing a deterministic rule for improving a solution is known as deterministic algorithm. The most popular population-based algorithm is evolutionary algorithms, and the most typical stochastic algorithms are swarm intelligence based algorithms. Both evolutionary algorithms and swarm intelligence based algorithms depend on the nature of phenomenon simulated by algorithms. As the most popular evolutionary algorithm, genetic algorithm attempts to simulate the phenomenon of natural evolution. The two important swarm optimization algorithms are the ant colony optimization (ACO) algorithms and the artificial bee colony (ABC) algorithms. An ant colony can be thought of as a swarm whose individual agents are ants. Similarly, an artificial bee colony can be thought of as a swarm whose individual agents are bees. Recently, ABC algorithm has been proposed to solve many difficult optimization problems in different fields such as constraint optimization problem, machine learning, and bioinformatics. This section discusses ACO algorithms, and the next section discusses ABC algorithms.

9.12.1 Real Ants and Artificial Ants In nature, the behavioral rules of ants are very simple, and their communication rules via “pheromones” are also very simple. However, a swarm of ants can solve very complex problems such as finding the shortest path between their nest and the food source. If finding the shortest path is looked upon as an optimization problem, then each path between the starting point (their nest) and the terminal (the food source) can be viewed as a feasible solution. Each real ant in the ant system has the following characteristics [111]: • By using a transition rule that is a function of the distance to the source, an ant decides which food source to go and the amount of pheromone present along the connecting path. • Transitions to already visited food sources are added to a tabu list and are not allowed. • Once a tour is complete, the ant lays a pheromone trail along each path visited in the source.

9.12 Ant Colony Optimization

769

By Beckers, Deneubourg, and Goss [9], it was demonstrated experimentally that ants are able to find the shortest path. Artificial ants of the colony have the following properties [21, 31]: • An ant searches for minimum cost feasible solutions Jˆψ = minψ J (L, t). • An ant k has an internal memory Mk which is used to store the path information followed by the ant k so far (i.e., the previously visited states). Memory can be used to build feasible solutions, to evaluate the solution found, and to retrace the path backward. • An ant k in state Sr,i can move from node i to any node j in its feasible neighborhood Nik . • An ant k can be assigned a start state Ssk and one or more termination conditions ek . Usually, the start state is expressed as a unit length sequence, that is, a single component. • Starting from an initial state Sinitial , each ant tries to build a feasible solution to the given problem in an incremental way, moving to feasible neighbor states in an iterative fashion through its search space/environment. The construction procedure stops when for at least one ant k, at least one of the termination conditions ek is satisfied. • The guidance factors involved in an ant’s movement take the form of a transition rule which is applied before every move from state Si to state Sj . The transition rule may also include additional problem specific constraints and may utilize the ants internal memory. • The ants’ probabilistic decision rule is a function of (a) the values stored in a node local data structure Ai = [aij ] called ant-routing table, obtained by a functional composition of node locally available pheromone trails and heuristic values, (b) the ant’s private memory storing its past history, and (c) the problem constraints. • When moving from node i to neighbor node j the ant can update the pheromone trail τij on the arc (i, j ). This is called online step-by-step pheromone update. • Once built a solution, the ants can retrace their same path backward and update the pheromone trails on the traversed arcs. This is called online delayed pheromone update. • The amount of pheromone each ant deposits is governed by a problem specific pheromone update rule. • Ants may deposit pheromone associated with states, or alternatively, with state transitions. • Once it has built a solution, and, if the case, after it has retraced the path back to the source node, the ant dies, freeing all the allocated resources. Artificial ants have several characteristics similar to real ants, namely [122]: (a) Artificial ants have a probabilistic preference for paths with a larger amount of pheromone. (b) Shorter paths tend to have larger rates of growth in their amount of pheromone. (c) Artificial ants use an indirect communication system based on the amount of pheromone deposited on each path.

770

9 Evolutionary Computation

By mimicking bionic swarm intelligence that simulates the behavior of ants and their communication ways in searching for food, ACO algorithms, as a heuristic positive feedback search algorithm, are especially available for solving the optimization problem including two parts: the feasible solution space and the objective function. The key component of an ACO algorithm is a parameterized probabilistic model, which is called the pheromone model [30]. For a constrained combinatorial optimization problem, the central component of an ACO algorithm is also the pheromone model which is used to probabilistically sample the search space S. Definition 9.37 (Pheromone Model [30]) A pheromone model P = (S, , f ) of a combinatorial optimization problem consists of two parts: (1) a search (or solution) space S defined over a finite set of discrete decision variables and a set of constraints among the variables and (2) an objective function f : S → R+ to be minimized, which assigns a positive cost function to each solution s ∈ S. The problem representation of a combinatorial optimization problem which is exploited by the ants can be characterized as follows [33]. • A finite set C = {c1 , . . . , cNc } of components is given. • The states of the problem are defined in terms of sequences x = {ci , cj , . . . , ck , . . .} over the elements of C. The set of all possible sequences is denoted by X. The length of a sequence x, that is, the number of components in the sequence, is expressed by |x|. The maximum length of a sequence is bounded by a positive constant n < +∞. • The set of (candidate) solutions S is a subset of X (i.e., S ⊆ X). ¯ • The finite set of constraints defines the set of feasible states X¯ with ⊆ X. ∗ ∗ ¯ • A non-empty set S of feasible solutions is given, with ⊆ X and S ⊆ S. • A cost f (s, t) is associated with each candidate solution s ∈ S. • In some cases a cost, or the estimate of a cost, J (xi , t), can be associated with states other than solutions. If xi can be obtained by adding solution components to a state xj , then J (xi , t) ≤ J (xj , t). Note that J (s, t) ≡ J (t, s). In combinatorial optimization a system may occur in many different configurations. Any configuration has a cost function for that particular configuration. Similar to the simulated annealing of solids, one can statistically model the evolution of the system to be optimized into a state that corresponds with the minimum cost function value. In order to probabilistically construct solutions, ACO algorithms for combinatorial optimization problems use a pheromone model that consists of a set of pheromone values, i.e., a function of the search experience of the algorithm. The pheromone model is used to bias the solution construction toward regions of the search space containing high-quality solutions.

9.12 Ant Colony Optimization

771

Definition 9.38 (Mixed-Variable Optimization Problem [95]) A pheromone model R = (S, , f ) of a mixed-variable optimization problem (MVOP) consists of the following parts. • A search space S defined over a finite set of both discrete and continuous decision variables and a set of constraints among the variables; • An objective function f : S → R+ 0 to be minimized. The search space S is defined by a set of n = d + r variables xi , i = 1, . . . , n, of which d is the number of discrete decision variables and r is the number of continuous decision variables. A solution s ∈ S is a complete value assignment, that is, each decision variable is assigned a value. A feasible solution is a solution that satisfies all constraints in the set . A global optimum s ∗ ∈ S is a feasible solution satisfying f (s ∗ ) ≤ f (s), ∀s ∈ S. The set of all globally optimal solutions is denoted by s ∗ ∈ S ∗ ⊆ S. Solving an MVOP requires finding at least one s ∗ ∈ S ∗ .

9.12.2 Typical Ant Colony Optimization Problems Ant colony optimization is so called because of its original inspiration: some ant species forage between their nests and food sources by collectively exploiting the pheromone they deposit on the ground while walking [29]. Similar to real ants, artificial ants in ant colony optimization deposit artificial pheromone on the graph of the problem they are solving. In ant colony optimization, each individual of the population is an artificial ant that builds incrementally and stochastically a solution to the considered problem. At each step the moves of ants define which solution components are added to the solution under construction. A probabilistic model is associated with the graph G(C, L) and is used to bias the agents’ choices. The probabilistic model is updated online by the agents in order to increase the probability that future agents will build good solutions. In order to build good solutions, there are three main construction functions (ACO construction functions for short) [111]: • Ant solution construct: In the solution construction process, artificial ants move through adjacent states of a problem according to a transition rule, iteratively building solutions. • Pheromone update: It performs pheromone trail updates and may involve updating the pheromone trails once complete solutions have been built, or updating after each iteration. • Deamon actions: It is an optional step in the ant colony optimization and involves applying additional updates from a global perspective (there exists no natural counterpart). An example could be applying additional pheromone reinforcement to the best solution generated (known as offline pheromone trail update).

772

9 Evolutionary Computation

The following are the three typical optimization problems whose solutions are available for adopting ant colony optimization. 1. Job Scheduling Problems (JSP) [105] have a vital role in recent years due to the growing consumer demand for variety, reduced product life cycles, changing markets with global competition, and rapid development of new technologies. The job shop scheduling problem (JSSP) is one of the most popular scheduling models existing in practice, which is among the hardest combinatorial optimization problems. The job scheduling problem consists of the following components: • • • • •

a number of independent (user/application) jobs to be the elements in C, a number of heterogeneous machine candidates to participate in the planning, the workload of each job (in millions of instructions), the computing capacity of each machine, ready time indicates when machine will have finished the previously assigned jobs, and • the expected time to compute (ETC) matrix (“nb ” jobs × “nb ” numbers of components machines) in which ETC[i][j ] is the expected execution time of job “i” in machine “j .”

2. Data mining [122]: In an ant colony optimization algorithm, each ant incrementally constructs/modifies a solution for the target problem. • In the context of classification task of data mining, the target problem is the discovery of classification rules that is often expressed in the form of IFTHEN rules as follows: IF < conditions > THEN < class > . The rule antecedent (IF part) contains a set of conditions, usually connected by a logical conjunction operator (AND). The rule consequent (THEN part) specifies the class predicted for cases whose predictor attributes satisfy all the terms specified in the rule antecedent. As a promising research area, ACO algorithms involve simple agents (ants) that cooperate with others to achieve an emergent unified behavior for the system as a whole, producing a robust system capable of finding high-quality solutions for problems with a large search space. • In the context of rule discovery, an ACO algorithm is able to perform a flexible robust search for a good combination of terms (logical conditions) involving values of the predictor attributes. 3. Network coding resource minimization (NCRM) problem [156]: As a resource optimization problem emerging from the field of network coding, although all nodes have the potential to perform coding would perform coding by default, only a subset of coding-possible nodes suffices to realize network codingbased multicast (NCM) with an expected data rate. Hence, it is worthwhile

9.12 Ant Colony Optimization

773

to study the problem of minimizing coding operations within NCM. EAs are the mainstream solutions for NCRM in the field of computational intelligence, but EAs are not good for integrating local information of the search space or domain-knowledge of the problem, which could seriously deteriorate their optimization performance. Different from EAs, ACO algorithms are the classes of reactive search optimization methods adopting the principle of “learning while optimizing.” ACOs may be a good candidate for solving the NCRM problem.

9.12.3 Ant System and Ant Colony System As the first ant algorithm, ant system was developed by Dorigo et al. in 1996 [34] for applying the well-known benchmark traveling salesman problem. Let τij be the pheromone trail between components (nodes) i and j that determines the path taken by the ants, and ηij = 1/dij be the heuristic values, where τij constitute the pheromone matrix T = [τij ]. In other words, the shorter the distance dij between two cities i and j , the higher the heuristic value ηij that constitutes the heuristic matrix. The values from multiple pheromone (or heuristic) matrices need to be aggregated into a single pheromone (or heuristic) value. The following are the three alternatives of aggregation [97]: 1. Weighted sum: for example τij = (1 − λ)τij1 + λτij2

and

1 ηij = (1 − λ)ηij + λτij2 .

(9.12.1)

and

1 (1−λ) 2 λ ηij = (ηij ) · (ηij ) .

(9.12.2)

2. Weighted product: for example τij = (τij1 )(1−λ) · (τij2 )λ

3. Random: at each construction step, given a uniform random number U (0, 1), an ant selects the first of the two matrices if U (0, 1) < 1 − λ; otherwise, it selects the other matrix. The ants exists in an environment represented mathematically as a construction graph G(C, L). 1. C = {c1 , . . . , cNc } denotes a set of components, where ci is the ith city and Nc is the number of all cities in tour. 2. L is a set of connections fully connecting. By Dorigo et al. [34] and Neto and Filho [115], ant model has the following characteristics: • The ant always occupies a node in a graph which represents a search space. This node is called nf .

774

9 Evolutionary Computation

• It has an initial state. • Although it cannot sense the whole graph, it can collect two kinds of information about the neighborhood: (a) the weight of each trail linked to nf and (b) the characteristics of each pheromone deposited on this trail by other ants of the same colony. • It moves toward a trail Cij that connects nodes i and j of the graph. • Also, it can alter the pheromones of the trail Cij , in an operation called as “deposit of pheromone levels.” • It can sense the pheromone levels of all Cij trails that connect a node i. • It can determine a set of “prohibited” trails. • It presents a pseudo-random behavior enabling the choice among the various possible trails. • This choice can be (and usually is) influenced by the level of pheromone. • It can move from node i to node j . An ant system utilizes the graph representation G(C, L): for each cost measure δ(r, s) (distance from city r moving to city s), each edge (r, s) has also a desirability measure (called pheromone) τrs = τ (r, s) which is updated at run time by ants. An ant system works according to the following rules [32]: 1. State transition rule: Each ant generates a complete tour by choosing the cities according to a probabilistic state transition rule; ants prefer to move to cities which are connected by short edges with a high amount of pheromone. 2. Global (pheromone) updating rule: Once all ants have completed their tours, a global updating rule is used; a fraction of the pheromone evaporates on all edges, and then each ant deposits an amount of pheromone on edges which belong to its tour in proportion to how short its tour was. The process is then iterated. 3. Random-proportional rule: It gives the probability with which the kth ant in city r chooses to move to the city s: ⎧ ⎪ [τ (r, s)] · [η(r, s)]β ⎪ , if s ∈ Jk (r); ⎪ ⎨  [τ (r, u)] · [η(r, u)]β pk (r, s) = ⎪ u∈Jk (r) ⎪ ⎪ ⎩0, otherwise.

(9.12.3)

Here τ is the pheromone, η = 1/δ is the inverse of the distance δ(r, s), Jk (r, s) is the set of cities that remain to be visited by ant k positioned on city r (to make the solution feasible), and β > 0 is a parameter for determining the relative importance of pheromone versus distance. In the way given by (9.12.3), the ants favor the choice of edges which are shorter and have a greater amount of pheromone.

9.12 Ant Colony Optimization

775

In an ant system, once all ants have built their tours, pheromone is updated on all edges according to the following global updating rule: τ (r, s) ← (1 − α) · τ (r, s) +

m 

τk (r, s),

(9.12.4)

k=1

where ⎧ ⎨ 1 , if (r, s) ∈ tour done by ant k; τk (r, s) = Lk ⎩ 0, otherwise.

(9.12.5)

Here 0 < α < 1 is a pheromone decay parameter, Lk is the length of the tour performed by ant k, and m is the number of ants. The ant colony system (ACS) differs from the ant system in three main aspects [32]: • State transition rule provides a direct way to balance between exploration of new edges and exploitation of a priori and accumulated knowledge about the problem. • Global updating rule is applied only to edges which belong to the best ant tour. • A local (pheromone) updating rule is applied while ants construct a solution. Different from ant system, ant colony system works according to the following rules [32]: 1. ACS state transition rule: In ACS, an ant positioned on node r chooses the city s to move to by applying the ACS state transition rule given by

s=

⎧   ⎨arg max [τ (r, u)] · [η(r, u)]β , if q ≤ q0 (exploitation); u∈Jk (r)

⎩ S,

(9.12.6)

otherwise.

Here q is a random number uniformly distributed in [0, 1], 0 ≤ q0 ≤ 1 is a parameter, and S is a random variable selected according to the probability distribution given in (9.12.3). The state transition rule given by (9.12.6) and (9.12.3) is known as pseudo-random-proportional rule. This rule favors transitions toward nodes connected by short edges and with a large amount of pheromone. 2. ACS global updating rule: In ACS only the globally best ant, constructed the shortest tour from the beginning of the trial, is allowed to deposit pheromone. After all ants have completed their tours, the pheromone level is updated by the following ACS global updating rule: τ (r, s) ← (1 − α) · τ (r, s) + α · τ (r, s),

(9.12.7)

776

9 Evolutionary Computation

where * τ (r, s) =

(Lgb )−1 , if (r, s) ∈ global-best-tour; 0,

otherwise.

(9.12.8)

Here 0 < α < 1 is the pheromone decay parameter, and Lgb is the length of the globally best tour from the beginning of the trial. 3. ACS local updating rule: While building a solution (i.e., a tour) of the traveling salesman problem (TSP), ants visit edges and change their pheromone level by applying the following local updating rule: τ (r, s) ← (1 − ρ) · τ (r, s) + ρ · τ (r, s),

(9.12.9)

where 0 < ρ < 1 is a parameter.

9.13 Multiobjective Artificial Bee Colony Algorithms Artificial bee colony (ABC) algorithm, proposed by Karaboga in 2005 [81], is inspired by the intelligent foraging behavior of the honey bee swarm. Numerical comparisons show that the performance of ABC is competitive to that of other population-based algorithms with an advantage of employing fewer control parameters [82–84]. The main advantages of ABC are its simplicity and ease of implementation, but ABC also faces up to the poor convergence commonly existing in evolutionary algorithms.

9.13.1 Artificial Bee Colony Algorithms The design of an ant colony optimization (ACO) algorithm implies the specification of the following aspects [11, 99, 122]: • An appropriate representation of the problem, which allows the ants to incrementally construct/modify solutions through the use of a probabilistic transition rule, based on the amount of pheromone in the trail and on a local problem-dependent heuristic. • A method to enforce the construction of valid solutions, i.e., solutions that are legal in the real-world situation corresponding to the problem definition. • A problem dependent heuristic function η that provides a quality measurement of items that can be added to the current partial solution. • A rule for pheromone updating, which specifies how to modify the pheromone trail τ .

9.13 Multiobjective Artificial Bee Colony Algorithms

777

• A probabilistic transition rule based on the value of the heuristic function η and on the contents of the pheromone trail τ that is used to iteratively construct a solution. • A clear specification of when the algorithm converges to a solution. Unlike ant swarms, the collective intelligence of honey bee swarms consists of three essential components: food sources, employed bees, unemployed (onlooker) bees. In the ABC algorithm, the position of a food source represents a possible solution of the optimization problem and the nectar amount of a food source corresponds to the quality (fitness) of the associated solution. The artificial bee colony is divided into three groups of bees: employed bees, onlooker bees, and scout bees. Half of the colony consists of the employed bees, and another half consists of the onlookers. Each cycle of the search consists of three steps. • Employed bees are responsible for exploring the nectar sources and sharing their food information with onlookers within the hive. • The onlooker bees select good food sources from those foods found by the employed bees to further search the foods. • If the quality of some food source is not improved through a predetermined number of cycles, then the food source is abandoned by its employed bee, and the employed bee becomes a scout and starts to search for a new food source randomly in the vicinity of the hive. Each food source xi = [xi,1 , . . . , xi,D ]T (i = 1, . . . , SN ) is a D-dimensional vector, and stands for a potential solution of the problem to be optimized. Let P = {x1 , . . . , xSN } denotes the population of SN food solutions (individuals). The ABC algorithm is an iterative algorithm. At the beginning, ABC generates the population with randomly generated food solutions. The newly generated population P contains SN individuals xi ∈ RD , where the population size SN is identical to the number of those different kinds of bees. ABC generates a randomly distributed initial population P of SN solutions (food source positions) x1 , . . . , xSN whose elements are as follows: xi,j = xmin,j + U (0, 1)(xmax,j − xmin,j ),

(9.13.1)

where i = 1, . . . , SN ; j = 1, . . . , D, and D is the number of optimization parameters; xmin,j and xmax,j are the lower and upper bounds for the dimension j , respectively; and U (0, 1) is a random number with uniform distribution in (0, 1). After the initialization, the population of solutions (i.e., food sources) undergoes repeated cycles of the search processes of the employed bees, onlooker bees, and scout bees. Each employed bee always remembers its previous best position and produces a new position within its neighborhood in its memory. After all employed bees finish their search process, they will share the information about nectar amounts and positions of food sources with onlookers. Each onlooker chooses a food source depending on the probability value pi associated

778

9 Evolutionary Computation

with the food source through pi =

fiti SN 

(9.13.2)

,

fitj

j =1

where fiti represents the fitness value of the ith solution f(xi ). If the new food source has the equal or better quality than the old source, then the old source is replaced by the new source. Otherwise, the old source is retained. ABC uses vi,j = xi,j + φi,j (xi,j − xl,j ),

l = i,

(9.13.3)

to produce a candidate food position vi = [vi1 , . . . , viD ]T from the old solution xi in memory. Here, l ∈ {1, . . . , SN} and j ∈ {1, . . . , D} are randomly chosen indices, and φij is a random number in the range [−1, 1]. Then, ABC searches the area within its neighborhood to generate a new candidate solution. If a position cannot be improved further through a predetermined number of cycles, then that food source is said to be abandoned. The value of the predetermined number of cycles is an important control parameter of ABC, and is called the limit for abandonment. When the nectar of a food source is abandoned, the corresponding employed bee becomes a scout. The scout will randomly generate a new food source to replace the abandoned position. The above framework of artificial bee colony (ABC) algorithm is summarized as Algorithm 9.15.

9.13.2 Variants of ABC Algorithms The solution search equation of ABC, for generating new candidate solutions based on the information of previous solutions, is good at exploration, but it is poor at exploitation. To achieve good performance on problem optimizations, exploration and exploitation should be well balanced. To improve the performance of ABC, several variants of ABC were proposed. 1. GABC: This a gbest-guided artificial bee colony algorithm [168]. By incorporating the information of the gbest solution into the solution search equation to improve the exploitation of ABC. The solution search equation in GABC is given by vij = xij + φij (xij − xkj ) + ψij (gj − xij ),

(9.13.4)

9.13 Multiobjective Artificial Bee Colony Algorithms

779

Algorithm 9.15 Framework of artificial bee colony (ABC) algorithm [49] 1. initialization: 1.1 Randomly generate SN points in the search space to form an initial population; 1.2 Evaluate the objective value of the population; 1.3 F ES = SN . // The employed bee phase: 2. for i = 1, . . . , SN do 3. Generate a candidate solution Vi using (9.13.3). 4. Evaluate f(Vi ) and set F ES = F ES + 1. 5. If f (Vi ) < f (Xi ) then set Xi = Vi and traili = 1. Otherwise triali = traili + 1. 6. Calculate the probability pi using (9.13.2), and set t = 0 and i = 1. // The onlooker bee phase: 7. while t ≤ SN do 8. if rand(0, 1) < pi do 9. Generate a candidate solution Vi using (9.13.3). 10. Evaluate the objective value of the population; 11. If f (Vi ) < f (Xi ) then set Xi = Vi and traili = 1. Otherwise triali = traili + 1. 12. Set t = t + 1. 13. end if 14. end while 15. Set i = i + 1 and if i = SN then set i = 1. // The scout bee phase: 16. If max{triali } > limit then replace Xi with a new randomly generated solution by (9.13.1). 17. If F ES ≥ F ESmax then stop and output the best solution achieved so far. Otherwise, go to step 1. 18. end for 19. output: SN solutions (food source positions) x1 , . . . , xS N .

where the third term in the right-hand side is a new added term called the gbest term, gj is the j th element of the gbest solution vector gbest , φij is a random number in the range [−1, 1], and ψij is a uniform random number in [0, 1.5]. As demonstrated by experimental results, GABC can outperform ABC in most experiments. 2. IABC: This is an improved artificial bee colony (IABC) algorithm [48]. In this algorithm, two improved solution search equations are given by vij = gbest,j + φij (xij − xr1,j ),

(9.13.5)

vij = xr1,j + φij (xij − xr2,j ),

(9.13.6)

where the indices r1 and r2 are mutually exclusive integers randomly chosen from {1, 2, . . . , SN }, and both of them are different from the base index i; gbest,j is the j th element of the best individual vector with the best fitness in the current population, and j ∈ {1, . . . , D} is a randomly chosen index. Since (9.13.5) generates a candidate solution around gbest , it focuses on the exploitation. On the other hand, as (9.13.6) is based on three vectors xi , xr1 , and xr2 , and the two latter ones are two different individuals randomly selected from the population, it emphasizes the exploration.

780

9 Evolutionary Computation

3. CABC: The solution search equation is given by Gao et al. [49] vij = xr1,j + φij (xr1,j − xr2,j ),

(9.13.7)

where the indices r1 and r2 are distinct integers uniformly chosen from the range [1, SN] and are also different from i, and φij is a random number in the range [−1, 1]. Since (9.13.7) looks like the crossover operator of GA, the ABC with this search equation is named as crossover-like artificial bee colony (CABC) [49].

9.14 Particle Swarm Optimization Particle swarm optimization (PSO) is a global optimization technique originally developed by Eberhart and Kennedy [36, 86] in 1995. PSO is essentially an optimization technique based on swarm intelligence, and adopts a populationbased stochastic algorithm for finding optimal regions of complex search spaces through the interaction of individuals in a population of particles. Different from evolutionary algorithms, the particle swarm does not use selection operation; while interactions of all population members result in iterative improvement of the quality of problem solutions from the beginning of a trial until the end. PSO is rooted in artificial life and social psychology, as well as in engineering and computer science. It is widely recognized (see, e.g., [19, 37, 85]) that swarm intelligence (SI) is an innovative distributed intelligent paradigm for solving optimization problems, and PSO was originally inspired by the swarming behavior observed in flocks of birds, swarms of bees, or schools of fish, and even human social behavior, from which the idea is emerged. Due to computational intelligence, PSO is not largely affected by the size and nonlinearity of the optimization problem, and can converge to the optimal solution in many problems where most analytical methods fail to converge. In PSO, the term “particles” refers to population members with an arbitrarily small mass or volume and is subject to velocities and accelerations towards a better mode of behavior.

9.14.1 Basic Concepts PSO is based on two fundamental disciplines: social science and computer science, and is closely based on the swarm intelligence concepts and principles [26]. 1. Social concepts: It is known [37] that “human intelligence results from social interaction.” Evaluation, comparison, and imitation of others, as well as learning from experience allow humans to adapt to the environment and determine optimal patterns of behavior, attitudes, and so on.

9.14 Particle Swarm Optimization

781

2. Swarm intelligence principles [37, 85, 86]: Swarm Intelligence can be described by considering the following five fundamental principles. • Proximity principle: the population should be able to carry out simple space and time computations. • Quality principle: the population should be able to respond to quality factors in the environment. • Diverse response principle: the population should not commit its activity along excessively narrow channels. • Stability principle: the population should not change its mode of behavior whenever the environment changes. • Adaptability principle: the population should be able to change its behavior mode when it is worth the computational price. 3. Computational characteristics [37]: As a useful paradigm for implementing adaptive systems, the swarm intelligence computation is an extension of evolutionary computation and includes the softening parameterization of logical operators like AND, OR, and NOT. In particular, PSO is an extension, and a potentially important incarnation of cellular automata (CA). The particle swarm can be conceptualized as cells in CA, whose states change in many dimensions simultaneously. Both PSO and CA share the following computational attributes. • Individual particles (cells) are updated in parallel. • Each new value depends only on the previous value of the particle (cell) and its neighbors. • All updates are performed according to the same rules.

9.14.2 The Canonical Particle Swarm Given an unleveled data set Z = {z1 , . . . , zN } representing N patterns zi ∈ RD , i = 1, . . . , N , each having D features. Then partitional clustering approach aims at clustering the data set into K groups (K ≤ N ) such that Ck = ∅, ∀ k = 1, . . . , K; Ck ∩ Cl = ∅, ∀ k, l = 1, . . . , K; k = l;

(9.14.1) K K

Ck = Z.

(9.14.2)

k=1

The clustering operation is dependent on the similarity between elements present in the data set. If f denotes the fitness function, then the clustering task can be viewed as an optimization problem: optimizeCk f(Zk , Ck ), ∀ k = 1, . . . , K. That is to say, the optimization based clustering task is carried out by single objective nature inspired metaheuristic algorithms.

782

9 Evolutionary Computation

PSO is inspired by social interaction (e.g., foraging process of bird flocking). The original concept of PSO is to simulate the behavior of flying birds and their means of information exchange to solve problems. In PSO, each single solution is like a “bird.” Imagine the following scenario: a group of birds are randomly searching food in an area in which only one piece of food is assumed to exist, and the birds do not know where the food is. An efficient strategy for finding the food is to follow the bird nearest to the food. In a particle swarm optimizer, instead of using genetic operators, each single solution is regarded as a “bird” (particle or individual) in the search space. These individuals are “evolved” by cooperation and competition among the individuals themselves through generations. Each particle adjusts its flying according to its own flying experience and its companions’ flying experience. Each individual or particle represents a potential solution to an optimization problem, and is treated as a point in a D-dimensional space. Define the notations and parameters as follows. • • • •



• • •

Np : the swarm (or population) size; xi (t): the position vector of the ith particle at time t in the search space; vi (t): the velocity vector of the ith particle at time t in the search space; pi,best (t) = [pi1 (t), . . . , piD (t)]T : the previous best position (the position giving the best fitness value, simply named pbest) of the ith particle (up to time t) in the search space; gbest = [g1 , . . . , gD ]T : the best particle found so far among all the particles in the population, and is named as the global best position (gbest) vector of the swarm; U (0, β): a uniform random number generator; α: inertia weight or constriction coefficient; β: acceleration constant.

Position and velocity of each particle are adjusted, and the fitness function with the new position is evaluated at each time step. A population of particles at the time t is initialized with random position of the ith particle at the time t, xi (t) = [xi1 (t), . . . , xiD (t)]T ∈ RD , and the velocity of the ith particle at the time t, vi (t) = [vi1 (t), . . . , viD (t)]T ∈ RD . The most common type of implementation defines the particles’ behaviors in the following two formulas [19, 85]. First adjust the step size of the particle:   vi (t + 1) ← αvi (t) + U (0, β) pi,best (t) − xi (t) + U (0, β) (gbest − xi (t)) , (9.14.3) and then move the particle by adding the velocity to its previous position: xi (t + 1) ← xi (t) + vi (t + 1), i = 1, . . . , Np .

(9.14.4)

9.14 Particle Swarm Optimization

783

By Clerc and Kennedy [19], though there is variety in the implementations of the particle swarm, the most standard version uses α = 0.7298 and β = ψ/2, where ψ = 2.9922. The personal best position of each particle is updated using  pi,best (t + 1) =

pi,best (t), if f (xi (t + 1)) ≥ f (pi,best (t)), xi (t + 1), if f (xi (t + 1)) < f (pi,best (t)),

(9.14.5)

for i = 1, . . . , Np , and the global best position gbest found by any particle during all previous steps is defined as gbest = arg min f (pi,best (t + 1)), pi,best

1 ≤ i ≤ Np .

(9.14.6)

The basic particle swarm optimization algorithm is shown in Algorithm 9.16. Algorithm 9.16 Particle swarm optimization (PSO) algorithm 1. input: cost functions f , number of particles Np , maximum cycles Nc , the dimension D of position and velocity vectors, the inertia weight α, acceleration constant β. 2. initialization: 2.1 Create particle positions x1 (1), . . . , xNp (1) and particle velocities v1 (1), . . . , vNp (1). 2.2 pi,best (1) = xi (1), i = 1, . . . , Np . 2.3 gbest = xi (1), where i = arg min{f (x1 (1)), . . . , f (xNp (1))}. 3. repeat 4. for t = 1, . . . , Nc do 5. for i = 1, . . . , Np do ! " 6. 7. 8. 9.

vi (t + 1) ← αvi (t) + U (0, β) pi,best (t) − xi (t) + U (0, β) (gbest − xi (t)); xi (t + 1) = xi (t)*+ vi (t + 1). pi,best (t), if f (xi (t + 1)) ≥ f (pi,best (t)), pi,best (t + 1) = xi (t + 1), if f (xi (t + 1)) < f (pi,best (t)). gbest = arg min {f (pj,best (t + 1))}, j = 1, . . . , i. pi,best

10. end for 11. end for 12. until gbest is sufficiently good or the maximum cycle is archived. 13. output: global best position gbest .

9.14.3 Genetic Learning Particle Swarm Optimization Advantages of PSO over other similar optimization techniques such as GA can be summarized as follows [26]: • PSO is easier to implement and there are fewer parameters to adjust.

784

9 Evolutionary Computation

• In PSO, every particle remembers its own previous best value as well as the neighborhood best; therefore, it has a more effective memory capability than the GA. • PSO is more efficient in maintaining the diversity of the swarm [38] (more similar to the ideal social interaction in a community), since all the particles use the information related to the most successful particle in order to improve themselves, whereas in GA, the worse solutions are discarded and only the good ones are saved; therefore, in GA the population evolves around a subset of the best individuals. The following are the two main drawbacks of PSO [123]. 1. The swarm may prematurely converge. • For the global best PSO, particles converge to a single point, which is on the line between the global best and the personal best positions. This point is not guaranteed for a local optimum [154]. • The fast rate of information flow between particles, resulting in the creation of similar particles with a loss in diversity that increases the possibility of being trapped in local optima. 2. Stochastic approaches have problem-dependent performance. This dependency usually results from the parameter settings in each algorithm. Increasing the inertia weight will increase the speed of the particles resulting in more exploration (global search) and less exploitation (local search) will decrease the speed of the particles resulting in more exploitation and less exploration. The problem-dependent performance can be addressed via hybrid mechanism combining different approaches in order to be benefited from the advantages of each approach. The following are the comparison between PSO and GA. • In general PSO, particles are guided by their previous best positions (pbests) and the global best position (gbest) found by the swarm, so the search is more directional than that of GA. Hence, it is expected that GA possesses a better exploration ability than PSO, whereas the latter facilitates faster convergence. • By applying crossover operation in GA, information can be swapped between two particles to have the ability to fly to the new search area. If applying mutation used in GA to PSO, then it is expected to increase the diversity of the population and the ability to have the PSO to avoid the local maxima. This hybrid of genetic algorithm and particle swarm optimization is simply called GA-PSO, which combines the advantages of swarm intelligence and a natural selection mechanism in GA, and thus increases the number of highly evaluated agents, while decreases the number of lowly evaluated agents at each iteration step. An alternative of GA-PSO is the genetic learning particle swarm optimization (GL-PSO) in [57], as shown in Algorithm 9.17.

9.14 Particle Swarm Optimization

785

Algorithm 9.17 Genetic learning PSO (GL-PSO) algorithm [57] 1. initialize: 2. for i = 1 to M do 3. Randomly initialize vi and xi ; 4. Evaluate f (xi ); 5. pi,best = xi . 6. end for 7. Set gbest to the current best position of particles; 8. repeat 9. for i = 1 to M do 10. for d = 1 to D do 11. Randomly select a particle k ∈ {1, . . . , M}; 12. if f (pi,best ) < f (pk,best ) then 13. oi,d = rd · pi,d + (1 − rd ) · gd ; 14. else 15. oi,d = pk,d . 16. end if 17. end for 18. for d = 1 to D do 19. if rand(0, 1) < pm then 20. oi,d = rand(lbd , ubd ). 21. end if 22. end for 23. Evaluate f (Oi ). 24. if f (oi ) < f (ei ) then 25. ei = Oi . 26. end if 27. if f (ei ) ceases improving for sg generations then 28. Select ej by 20%M tournament; 29. ei = ej . 30. end if 31. for d = 1 to D do 32. vi,d = ω · vi,d + c · rd · (ei,d − xi,d ); 33. xi,d = xi,d + vi,d . 34. end for 35. Evaluate f (Xi ); 36. Update Pi and G. 37. end for 38. until Terminal Condition.

The GL-PSO is an optimization algorithm hybridized by genetic learning and particle swarm optimization hybridized in a highly cohesive way.

9.14.4 Particle Swarm Optimization for Feature Selection To find the optimal feature subset one needs to enumerate and evaluate all the possible subsets of features in the entire search space, which implies that the search space size is 2n where n is the number of the original features. Therefore, evaluating

786

9 Evolutionary Computation

the entire feature subset is computationally expensive and also impractical even for a moderate-sized feature set. Although many feature selection algorithms involve heuristic or random search strategies to find the optimal or near optimal subset of features in order to reduce the computational time, metaheuristic algorithms may clearly be more efficient for feature subset selection. Consider feature selection in supervised classification. Given the data set S = {(x1 , y1 ), . . . , (xn , yn )}, where xi = [xi1 , . . . , xid ]T is a multi-dimensional vector sample, d denotes the number of features, n is the number of samples, and yi ∈ I = {1, . . . , d} denotes the label of the sample xi . Let X = {x1 , . . . , xn }. The main goal of the supervised learning is to approximate a function f : X → I to predict the class label for a new input vector x. In the learning problems, choice of the optimization method is very important to reduce the dimensionality of the feature space and to improve the convergence speed of the learning model. The PSO method would be preferred because it can handle a large number of features. Potential advantages of PSO for feature selection can be summarized as follows [138]: • PSO has a powerful exploration ability until the optimal solution is found because different particles can explore different parts of the solution space. • PSO is particularly attractive for feature selection since the particle swarm has memory, and knowledge of the solution is retained by all particles as they fly within the problem space. • The attractiveness of PSO is also due to its computationally inexpensive implementation that still gives decent performance. • PSO works with the population of potential solutions rather than with a single solution. • PSO can address binary and discrete data. • PSO has better performance compared to other feature selection techniques in terms of memory and runtime and does not need complex mathematical operators. • PSO is easy to implement with few parameters, and is easy to realize and gives promising results. • The performance of PSO is almost unaffected by the dimension of the problem. Although PSO provides a global search strategy for finding a better solution in the feature selection task, it is infected with two shortcomings: premature convergence and weakness in fine-tuning near local optimum points. To overcome these weaknesses, a hybrid feature selection method based on PSO with local search, called HPSO-LS, was developed in [106]. Define the relevant symbols are as follows. nf denotes the number of the original features in a given data set, ns = |Xs | is the number of the selected features, cij is the Pearson correlation coefficient between two features xi and xj , cori is the correlation value for feature i, xbest is the best previously visited position (up to i

9.14 Particle Swarm Optimization

787

time t) of the ith particle, xbest is the global best position of the swarm, xid (t) is g the position of the dth dimension of the ith particle, vi (t) is the velocity of the ith particle, xi (k) and xj (k) denote the values of the feature vectors i and j for the kth sample. The pseudo code of the HPSO-LS algorithm is shown in Algorithm 9.18. Algorithm 9.18 Hybrid particle swarm optimization with local search (HPSO-LS) [106] 1. input: F eature = {f1 , . . . , fD }, N Cmax : the maximum cycles that algorithm repeated, K / D, N P : number of particles. 2. begin initialize: 3. for i = 1 to N P do 4. Xij = create random particle ∈ {0, 1} ⇒ xi = [Xi1 , . . . , XiD ]T 5. Vij = create random velocity ∼ U [0, 1] ⇒ vi = [Vi1 , . . . , ViD ]T 6. end for 7. for all feature ido m (x (k)−¯xi )T (xj (k)−¯xj ) √m 8. cij = √m k=1 i x (k)−¯ xi 2 xj 2 i k=1 k=1 xj (k)−¯ 1 nf 9. cori = nf −1 j =1,j =i |cij |, i = j 10. end for 11. Similar feature set ← fi if cori ≥ cormid 12. Dissimilar feature set ← fi if cori < cormid 13. end initialize 14. for i = 1 to N Cmax do ! "   15. vˆ i (t + 1) = vi (t) + c1 · r1 · rand xbest − xi (t) vi (t) + c2 · r2 · rand xbest g − xi (t) i 16. for d = 1 to D do   17. vid (t + 1) = sign vˆid (t + 1)) min{|vˆid (t + 1)|, vi,max } 18. s(vid (t + 1)) = 1+exp(−v1 id (t+1)) * 1ˇc if ranm < s(vid (t + 1)); 19. xid (t + 1) = 0, otherwise. 20. end for 21. Xs = Similar feature set 22. Xd = Dissimilar feature set 23. xi = xi 24. Remove all feature in Xd that is 0 in particle xi 25. Remove all feature in Xs that is 0 in particle xi 26. ns = |Xs | and nd = |Xd | 27. Perform “particle movement” on each position of particle and replace it 28. f¯i = f itness(xi ) ⎧ best best ¯ ⎪ ⎨f etness(xi ), if fi > f itness(xi )   best ¯ 29. fi = f itness(xi ) = f etness(xg ), if fi > f itness(xbest g ) ⎪ ⎩¯ fi , otherwise. 30. end for 31. output: feature = {f1 , . . . , fk }

788

9 Evolutionary Computation

9.15 Opposition-Based Evolutionary Computation It is widely recognized that hybridization with different algorithms is another direction for improving machine intelligence algorithms. Learning, search, and optimization are fundamental tasks in the machine intelligence: Artificial intelligence algorithms learn from past data or instructions, optimize estimated solutions, and search for an existing solution in large spaces. In many cases, the machine learning starts at a random point, such as weights of a neural network, initial population of soft computing algorithms, and action policy of reinforcement agents. If the starting point is close to the optimal solution, then its convergence is faster. On the other hand, if it is very far from the optimal solution, such as opposite location in the worst case, the convergence will take much more time or even the solution can be intractable. In these bad cases, looking simultaneously for a better candidate solution in both current and opposite directions may help us to solve the present problem quickly and efficiently, which could be beneficial. Opposition-based evolutionary computation is inspired in part by the observation that opposites permeate everything around us, in some form or another.

9.15.1 Opposition-Based Learning There are several definitions on opposition of an existing solution x. Focused on metaheuristics and optimization, opposition is presented as a relationship between a pair of candidate solutions. In combinatorial and continuous optimization, each candidate solution has its own particularly defined opposite candidate. Definition 9.39 (Opposite Number [151]) Let x ∈ R be a real number defined on a certain interval: x ∈ [a, b]. The opposite number x˘ is defined as x˘ = (a + b) − x.

(9.15.1)

The opposite number in a multidimensional case can be analogously defined. Figure 9.4 shows opposite number of a real number x. Definition 9.40 (Opposite Point [131, 151]) Let x = [x1 , . . . , xD ]T be a point (vector) in D-dimensional space, where x1 , . . . , xD ∈ R and xi ∈ [ai , bi ], i = 1, . . . , D. The opposite point x˘ = [x˘1 , . . . , p˘ D ]T is completely defined by its Fig. 9.4 Geometric representation of opposite number of a real number x

a

x

1 2 (a + b)



b

9.15 Opposition-Based Evolutionary Computation

789

components x˘i = a˘ i + b˘i − xi ,

i = 1, . . . , D.

(9.15.2)

Theorem 9.9 (Uniqueness [130]) Every point x = [x1 , . . . , xD ]T ∈ RD of real numbers with xi ∈ [ai , bi ] has a unique opposite point x˘ = [x˘1 , . . . , x˘D ]T defined by x˘i = ai + bi − xi , i = 1, . . . , D. Definition 9.41 (Opposite Solution) Given a candidate solution x = [x1 , . . . , xD ]T in a d-dimensional vector space, its opposition solution x¯ = [x¯1 , . . . , x¯D ]T is defined as element form: x¯i = (ai + bi ) − xi ,

i = 1, . . . , D;

(9.15.3)

and generalized opposition solution x˜ g = [x˜1 , . . . , x˜D ]T is defined as x˜ g = [x˜i ]di=1 = [k · (a + b) − xi ]D i=1 ,

(9.15.4)

where k a random number in [0, 1]; while ai and bi are the minimum and maximum values for ith element of x. Center-based sampling (CBS) [129] is a variant of opposition solution similar to generalized opposition solution. Definition 9.42 (Center-Based Sampling [129, 135]) Let x = x1 , . . . , xD ]T be a solution in a D-dimension vector space. An opposite candidate solution xˆ c = [xˆ1 , . . . , xˆD ]T is defined by a random point between x and its opposite point x¯ as follows: xˆi = randi · (ai + bi − 2xi ) + xi ,

i = 1, . . . , D,

(9.15.5)

where randi is a uniformly distributed random number in [0, 1]; and ai and bi are the minimum and maximum values of the ith component of x. The objective of center-based sampling is to obtain opposite candidate solutions closer to the center of the domain of each variable. Definition 9.43 (Opposition-Based Learning [151]) Let x = [x1 , . . . , xD ]T and x¯ = [x¯1 , . . . , x¯D ]T be a solution point and its opposition solution point in Ddimensional vector space, respectively. If the opposition solution has the better fitness, i.e., f (¯x) < f (x), then the original solution x should be replaced with the new candidate solution x¯ ; otherwise, x is continue to be used as a candidate solution. This machine learning based on opposition is called opposition-based learning (OBL). The optimization using the candidate solution and its opposition solution simultaneously for searching the better one is referred to as the oppositionbased optimization (OBO).

790

9 Evolutionary Computation

The basic concept of OBL was originally introduced by Tizhoosh in 2005 [151]. The OBL has been applied in evolutionary algorithms, differential evolution, particle swarm optimization, harmony search, among others (see, e.g., Survey Papers [135, 161]).

9.15.2 Opposition-Based Differential Evolution For evolutionary optimization methods, they start with some initial solutions (initial population) and try to improve them toward some optimal solution(s). In the absence of a priori information about the solution, some initial solutions are needed to be randomly guessed. Then, evolutionary optimization is related to the distance of these initial guesses from the optimal solution. An important fact is [131] that evolutionary optimization can be improved, if it starts with a closer (fitter) solution by simultaneously checking the opposite solution. Let p = [p1 , . . . , pD ]T be a candidate solution in D-dimensional space, and p˘ = [p˘ 1 , . . . , p˘ D ]T be the opposite solution in D-dimensional space. Assume that f (p) is a fitness function for point p, which is used to measure the candidate’s ˘ ≥ f (p), then point p can be replaced with p; ˘ otherwise, continue fitness. If f (p) with p. Hence, the point p and its opposite point p˘ are evaluated simultaneously in order to continue with the fitter one. By applying the OBL to the classical DE, Rahnamayan et al. [131] developed an opposition-based differential evolution (ODE). The ODE consists of the following two OBL parts [131]. 1. Opposition-based population initialization: • Initialize the population P

=

(p1 , . . . , pNp ) randomly, here pi

=

[Pi1 , . . . , PiD • Calculate opposite population by OPij = aj +bj −Pij for i = 1, . . . , Np ; j = 1, . . . , D. Here, Pij and OPij denote j th variable of the ith vector of the population and the opposite-population, respectively. • Select the Np fittest individuals from {P ∪ OP} as initial population. ]T .

2. Opposition-based generation jumping: This part forces the evolutionary process to jump to a new solution candidate which ideally is fitter than the current population. Unlike opposition-based initialization that selects OPij = aj + bj − Pij , opposition-based generation jumping selects the new population from the p p current population as OPij = MINj + MAXj − Pij for i = 1, . . . , Np ; j = p p 1, . . . , D. Here, MINj and MAXj are minimum and maximum values of the j th variable in the current population. In essence, opposition-based differential evolution is an evolution optimization based on both difference vector between two populations or solutions for mutation and opposition-based learning for initialization and generation jumping.

9.15 Opposition-Based Evolutionary Computation

791

Opposition-based differential evolution is shown in Algorithm 9.19. As compared with the classical differential evolution Algorithm 9.14, the opposition-based differential evolution Algorithm 9.19 adds one pre-processing (Opposition-Based Population Initialization) and one post-processing (Opposition-Based Generation Jumping). Algorithm 9.19 Opposition-based differential evolution (ODE) [130] 1. input: Population size Np , problem dimension D, mutation constant F , crossover rate Cr . 2. Generate uniformly distributed random population p0 . // Begin of Opposition-Based Population Initialization 3. for i = 0 to Np do 4. for j = 0 to D do 5. OP0i,j ← aj + bj − P0i,j . 6. end for 7. end for 8. Select Np fittest individuals from the set {P0 , OP0 } as initial population P0 . // End of Opposition-Based Population Initialization // Begin of DE’s Evolution Steps 9. Call Algorithm 9.14 Differential evolution (DE) // End of DE’s Evolution Steps // Begin of Opposition-Based Generation Jumping 10. if rand(0, 1) < Jr then 11. for i = 0 to Np do 12. for j = 0 to D do p p 13. OPi,j ← MINj + MAXj − Pi,j . 14. end for 15. end for 16. Select Np fittest individuals from the set {P, OP} as current population P . 17. end if // End of Opposition-Based Generation Jumping 18. end while

9.15.3 Two Variants of Opposition-Based Learning In opposition-based learning, an opposite point is deterministically calculated with the reference point at the center in the fixed search range. There are two variants of opposite point selection. The first variant is called quasi opposition-based learning (quasi-OBL), proposed in [130]. Definition 9.44 (Quasi-Opposite Number [130]) Let x be any real number between [a, b] and x o = x˘ be its opposite number. The quasi-opposite number of x, denoted by x q , is defined as x q = rand(m, x o ),

(9.15.6)

792

9 Evolutionary Computation

where m = (a +b)/2 is the middle of the interval [a, b] and rand(m, x o ) is a random number uniformly distributed between m and the opposite point x o . Definition 9.45 (Quasi-Opposite Point [130]) Let xi = [xi1 , . . . , xiD ]T ∈ RD be any vector (point) in D-dimensional real space, and xij ∈ [aj , bj ] for i = q 1, . . . , Np ; j = 1, . . . , D. The quasi-opposite point of xi , denoted by xi = q q T [xi1 , . . . , xiD ] , is defined in its element form as q

xij = rand(Mij , xijo ),

(9.15.7)

where i = 1, . . . , Np ; j = 1, . . . D, Mij = (aj + bj )/2 is the middle of the interval [aj , bj ], and rand(Mij , OPij ) is a random number uniformly distributed between the middle Mij and opposite number xijo = OPij . The calculation of the quasi-opposite point in (9.15.7) depends on the comparison of the values between the opposite-point and the original point as follows:  q

xij =

Mij + rand(0, 1) · (xijo − Mij ), if xij < Mij ; xijo + rand(0, 1) · (Mij − xijo ), otherwise.

(9.15.8)

Figure 9.5 shows the opposite and quasi-opposite points of a solution (point) x. Theorem 9.10 ([130]) Given a guess point x ∈ RD , its opposite point xo ∈ RD and quasi opposite point xq ∈ RD , and given the distance from the solution d(·) and probability function Pr (·), then Pr [d(xq ) < d(xo )] > 1/2.

(9.15.9)

q

xi j

aj

xioj

Mi j

xi j

bj

xioj

bj

(a) q

xi j

aj

xi j

Mi j (b)

Fig. 9.5 The opposite point and the quasi-opposite point. Given a solution xi = [xi1 , . . . , xiD ]T q with xij ∈ [aj , bj ] and Mij = (aj + bj )/2, then xijo and xij are the j th elements of the opposite q point xoi and the quasi-opposite point xi of xi , respectively. (a) When xij > Mij . (b) When xij < Mij

9.15 Opposition-Based Evolutionary Computation

793

The distance in the above theorem can take the Euclidean distance d(xo ) = xo − x2

and

d(xq ) = xq − x2

(9.15.10)

or other distances. Theorem 9.10 implies that for a black-box optimization problem (which means solution can appear anywhere over the search space), the quasi opposite point xq has a higher chance than the opposite point xo to be closer to the solution. The quasi opposition-based learning consists of Quasi-Oppositional Population Initialization and Quasi-Oppositional Generation Jumping [130]. 1. Quasi-oppositional population initialization: Generate uniformly distributed random populations p0i = [P0i1 , . . . , P0iD ]T for i = 1, . . . , Np , where P0ij ∈ [aj , bj ] for j = 1, . . . , D. • Compute the opposite populations poi = [OP0i1 , . . . , OP0iD ]T associated with p0i in element form: OP0ij = aj + bj − P0ij ,

i = 1, . . . , Np ; j = 1, . . . , D.

(9.15.11)

• Denote the middle point between aj and bj as Mij = (aj + bj )/2,

i = 1, . . . , Np ; j = 1, . . . , D.

(9.15.12)

q

• If letting the quasi-opposition of pi be pi = [QOP0i1 , . . . , QOP0iD ]T , then its elements are given by  QOP0ij =

Mij + rand(0, 1) · (OP0ij − Mij ), if P0ij < Mij ; OP0ij + rand(0, 1) · (Mij − OP0ij ), otherwise.

(9.15.13)

Here, i = 1, . . . , Np ; j = 1, . . . , D. q • Select Np fittest individuals from the set {poi , pi } as initial population p0 . p

2. Quasi-oppositional generation jumping: Let Jr be jumping rate, MINj be p minimum value of the j th individual in the current population, and MAXj denote maximum value of the j th individual in the current population. • If rand(0, 1) < Jr then calculate p

p

OPij = MINj + MAXj − Pij , p

p

Mij = (MINj + MAXj )/2,  Mij + rand(0, 1) · (OPij − Mij ), if Pij < Mij ; QOPij = OPij + rand(0, 1) · (Mij − OPij ), otherwise.

(9.15.14) (9.15.15) (9.15.16)

794

9 Evolutionary Computation

Here, i = 0, . . . Np − 1; j = 0, . . . , D − 1. • Select Np fittest individuals from set the {Pij , QOPij } as current population p. If the oppositional population initialization and the oppositional generation jumping in Algorithm 9.19 are, respectively, replaced by the above quasioppositional population initialization and quasi-oppositional generation jumping, then the opposition-based differential evolution (ODE) described in Algorithm 9.19 becomes the quasi-oppositional differential evolution (QODE) in [130]. The second variant of the OBL is quasi reflection point-based OBL, which was proposed in [40]. Definition 9.46 (Quasi-Reflected Point [40]) Let xi = [xi1 , . . . , xiD ]T be any point in D-dimensional real space, and xij ∈ [aj , bj ]. Then the quasi-reflected qr qr qr point of xi , denoted by xi = [xi1 , . . . , xiD ]T , is defined in its element form as qr

xij = rand(Mij , xij ),

(9.15.17)

where Mij = (aj + bj )/2, and rand(Mij , xij ) is a random number uniformly distributed between Mij and xij . Figure 9.6 shows the opposite, quasi-opposite, and quasi-reflected points of a solution (point) xi . There are the following relations between the quasi-opposite point and quasireflected opposite point. q

• The quasi-opposite point is generated by xij = rand(Mij , xijo ), whereas the qr quasi-reflected opposite point is defined by xij = rand(Mij , xij ). • The quasi-reflected opposite point is the mirror image of the quasi-opposite point with regard to the middle point Mij , and hence is named as the quasi-reflected opposite point. q

qr

xi j

xi j

aj

xioj

Mi j

xi j

bj

xioj

bj

(a) qr xi j

aj

xi j

q

xi j

Mi j (b)

Fig. 9.6 The opposite, quasi-opposite, and quasi-reflected points of a solution (point) xi . Given a q qr solution xi = [xi1 , . . . , xiD ]T with xij ∈ [aj , bj ] and Mij = (aj + bj )/2, then xijo , xij , and xij q o are the j th elements of the opposite point xi , the quasi-opposite point xi , and the quasi-reflected qr opposite point xi of xi , respectively. (a) When xij > Mij . (b) When xij < Mij

References

795

After calculating the fitness of opposite population generated by quasi-reflected opposition, the fittest Np individuals in P and OP constitute the updated population. The application of the quasi-opposite points or the quasi-reflected opposite points will result into various variants and extensions of the basic OBL algorithm. For example, a stochastic opposition-based learning using a Beta distribution in differential evolution can be found in [121].

Brief Summary of This Chapter • This chapter presents evolutionary computation tree. • Evolutionary computation is a subset of artificial intelligence, which is essentially a computational intelligence for solving combinatorial optimization problems. • Evolutionary computation, starting from a group of randomly generated individuals, imitates the biological genetic mode to update the next generation of individuals by replication, crossover, and mutation. Then, according to the size of fitness, the individual survival of the fittest improves the quality of the new generation of groups, after repeated iterations, gradually approaching the optimal solution. From the mathematical point of view, evolutionary computation is essentially a search optimization method. • This chapter focuses on the Pareto optimization theory in multiobjective optimization problems existing in coevolutionary computation. • Evolutionary computation has been widely used in artificial intelligence, pattern recognition, image processing, biology, electrical engineering, communication, economic management, mechanical engineering, and many other fields.

References 1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD International Conference on the Management of Data, pp. 207–216 (1992) 2. Bäck, T., Hammel, U.: Evolution strategies applied to perturbed objective functions. In: Proceedings of 1st IEEE Conference Evolutionary Computation, vol. 1, pp. 40–45 (1994) 3. Bäck, T., Schwefel H.-P.: An overview of evolutionary algorithms for parameter optimization. Evol. Comput. 1(1), 1–23 (1993) 4. Baker, J.E.: Adaptive selection methods for genetic algorithms. In: Proceedings of the First International Conference on Genetic Algorithms and their Applications, Pittsburgh, pp. 101– 111 (1985) 5. Bandyopdhyay, S., Maulik, U.: An evolutionary technique based on K-means algorithm for optimal clustering in RN . Inform. Sci. 146(1–4), 221–237 (2002) 6. Bandyopadhyay, S., Maulik, U., Holder, L.B., Cook, D.J.: Advanced Methods for Knowledge Discovery From Complex Data (Advanced Information and Knowledge Processing). Springer, London (2005)

796

9 Evolutionary Computation

7. Bandyopadhyay, S., Saha, S., Maulik, U., Deb, K.: A simulated annealing-based multiobjective optimization algorithm: AMOSA. IEEE Trans. Evol. Comput. 12(3), 269–283 (2008) 8. Basu, M.: Dynamic economic emission dispatch using non-dominated sorting genetic algorithm-II. Electr. Power Energy Syst. 30, 140–149 (2008) 9. Beckers, R., Deneubourg, J.L., Goss, S.: Trails and U-turns in the selection of the shortest path by the ant Lasius Niger. J. Theor. Biol. 159, 397–415 (1992) 10. Benson, H.P., Sayin, S.: Towards finding global representations of the efficient set in multiple objective mathematical programming. Naval Res. Logist. 44, 47–67 (1997) 11. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press, New York (1999) 12. Box, G.E.P.: Evolutionary operation: a method for increasing industrial productivity. Appl. Stat. VI(2), 81–101 (1957) 13. Branke, J.: Multi-objective evolutionary algorithms and MCDA. European Working Group “Multiple Criteria Decision Aiding”, ser. 3, vol. 25, pp. 1–3 (2012) 14. Branke, J., Schmidt, C., Schmeck, H.: Efficient fitness estimation in noisy environments. In: Proceedings of the Genetic and Evolutionary Computation, pp. 243–250 (2001) 15. Bremermann, H.J.: Optimization through evolution and recombination. In: Yovits M.C., et al. (eds.) Self-Organizing Systems. Spartan, Washington (1962) 16. Chang, D.X., Zhang, X.D., Zheng, C.W.: A genetic algorithm with gene rearrangement for K-means clustering. Pattern Recognit. 42, 1210–1222 (2009) 17. Charnes, A., Cooper, W., Niehaus, R., Stredry, A.: Static and dynamic model with multiple objectives and some remarks on organisational design. Manag. Sci. 15B, 365–375 (1969) 18. Chen, G., Low, C.P., Yang, Z.: Preserving and exploiting genetic diversity in evolutionary programming algorithms. IEEE Trans. Evol. Comput. 13(3), 661–673 (2009) 19. Clerc, M., Kennedy, J.: The particle swarm - explosion, stability, and convergence in a multidimensional complex space. IEEE Trans. Evol. Comput. 6(1), 58–73 (2002) 20. Coello Coello, C.A., Lamont, G.B., van Veldhuizen, D.A.: Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation), 2nd edn. Springer, Berlin (2007) 21. Cordon, O., Herrera, F., Stutzle, T.: A review on the ant colony optimization metaheuristic: basis, models and new trends. Mathware Soft Comput. 9(2–3), 141–175 (2002) 22. Czyzak, P., Jaszkiewicz, A.: Pareto simulated annealing—a metaheuristic technique for multiple-objective combinatorial optimization. J. Multi-Crit. Decis. Anal. 7, 34–47 (1998) 23. Deb, K.: Multi-objective genetic algorithms: problem difficulties and construction of test problems. Evol. Comput. 7(3), 205–230 (1999) 24. Deb, K.: Multi-Objective Optimization Using Evolutionary Algorithms. Wiley, London (2001) 25. Deb, K., Agarwal, S., Pratap, A., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002) 26. del Valle, Y., Venayagamoorthy, G.K., Mohagheghi, S., Hernandez, J.-C., Harley, R.G.: Particle swarm optimization: basic concepts, variants and applications in power systems. IEEE Trans. Evol. Comput. 12(2), 171–195 (2008) 27. Deneubourg, J.L., Aron, S., Goss, S., Pasteels, J.M.: The self-organizing exploratory pattern of the argentine ant. J. Insect Behav. 3, 159 (1990) 28. Dorigo, M.: Optimization, learning and natural algorithms, Ph.D.Thesis, Politecnico diMilano (1992) 29. Dorigo, M., Birattari, M.: Ant colony optimization. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 37–40. Springer, Berlin (2011) 30. Dorigo, M., Blumb, C.: Ant colony optimization theory: a survey. Theor. Comput. Sci. 344, 243–278 (2005) 31. Dorigo, M., Caro, G.D.: The ant colony optimization meta-heuristic. In: Corne, D., Dorigo, M, Glover, F. (eds.) New Ideas in Optimization, chap. 2. McGraw-Hill, New York (1999) 32. Dorigo, M., Gambardella, L.M.: Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans. Evol. Comput. 1(1), 53–66 (1997)

References

797

33. Dorigo, M., Stützle, T.: The ant colony optimization metaheuristic: algorithms, applications, and advances. In: Glover, F., Kochenberger, G.A. (eds.) Handbook of Metaheuristics, chap. 9. Kluwer Academic, New York (2003) 34. Dorigo, M., Maniezzo, V., Colorni, A.: The ant system: optimization by a colony of cooperating agents. IEEE Trans. Syst. Man Cybern. B 26(2), 29–41 (1996) 35. Droste, S., Jansen, T., Wegener, I.: On the analysis of the (1 + 1) evolutionary algorithms. Theor. Comput. Sci. 276, 51–81 (2002) 36. Eberhart, R., Kennedy, J.: A new optimizer using particle swarm theory. In: Proceedings of the 6th International Symposium on Micro Machine and Human Science (MHS), pp. 39–43 (1995) 37. Eberhart, R., Shi, Y., Kennedy, J.: Swarm Intelligence. Morgan Kaufmann, San Francisco (2001) 38. Engelbrecht, A.P.: Particle swarm optimization: where does it belong? In: Proceedings of the 2003 IEEE Swarm Intelligence Symposium, pp. 48–54 (2006) 39. Engrand, P.: A multi-objective approach based on simulated annealing and its application to nuclear fuel management. In: 5th International Conference on Nuclear Engineering, Nice, pp. 416–423 (1997) 40. Ergezer, M., Simon, D., Du, D.: Oppositional biogeography-based optimization. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics (SMC), San Antonio, pp. 1009–1014 (2009) 41. Fogel, L.J.: Autonomous automata. Ind. Res. 4, 14–19 (1962) 42. Fogel, D.B.: An introduction to simulated evolutionary optimization. IEEE Trans. Neural Netw. 5, 3–14 (1994) 43. Fogel, D.B.: Evolutionary Computation: Toward a New Philosophy of Machine Intelligence. IEEE Press, Piscataway (1995) 44. Fogel, L.J., Owens, A.J., Walsh, M.J.: Artificial Intelligence Through Simulated Evolution. Wiley, New York (1966) 45. Fonseca, C.M., Fleming, P.J.: An overview of evolutionary algorithms in multiobjective optimization. Evol. Comput. 3(1), 1–16 (1995) 46. Friedberg, R.M.: A learning machine: part I. IBM J. 2(1), 2–13 (1958) 47. Friedberg, R.M., Dunham, B., North, J.H.: A learning machine: part II. IBM J. 3(7), 282–287 (1959) 48. Gao, W.F., Liu, S.Y.: Improved artificial bee colony algorithm for global optimization. Inf. Process. Lett. 111(17), 871–882 (2011) 49. Gao, W.F., Liu, S.Y., Huang, L.L.: A novel artificial bee colony algorithm based on modified search equation and orthogonal learning. IEEE Trans. Cybern. 43(3), 1011–1024 (2013) 50. Geman, A., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 721–741 (1984) 51. Gendreau, M., Potvin, J.-Y.: Metaheuristics in combinatorial optimization. Ann. Oper. Res. 140(1), 189–213 (2005) 52. Glover, F.: Tabu search — Part I. ORSA J. Comput. 1, 190–206 (1989) 53. Goh, C.K., Tan, K.C.: An investigation on noisy environments in evolutionary multiobjective optimization. IEEE Trans. Evol. Comput. 11(3), 354–381 (2007) 54. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989) 55. Goldberg, D.E., Holland, J.H.: Genetic algorithms and machine learning. Mach. Learn. 3(2), 95–99 (1988) 56. Goldberg, D.E., Richardson, J.: Genetic algorithms with sharing for multimodal function optimization. In: Proceedings of the Second International Conference on Genetic Algorithms on Genetic Algorithms and Their Application, Cambridge, pp. 41–49 (1987) 57. Gong, Y.-J., Li, J.-J., Zhou, Y., Li, Y.„ Chung, H.S., Shi, Y.-H., Zhang, J.: Genetic learning particle swarm optimization. IEEE Trans. Cybern. 46(10), 2277–2290 (2016) 58. Gong, D., Sun, J., Miao, Z.: A set-based genetic algorithm for interval many-objective optimization problems. IEEE Trans. Evol. Comput. 22(1), 47–60 (2018)

798

9 Evolutionary Computation

59. Grasse, P.P.: La reconstruction du nid et les coordinations interindividuelles chez bellicositermes natalensis et cubitermes sp. la theorie de la stigmergie: Essai dinterpretation du comportement des termites constructeurs. Insectes Sociaux 6, 41–81 (1959) 60. Hajek, B.: Hitting-time and occupation-time bounds implied by drift analysis with applications. Adv. Appl. Probab. 14(3), 502–525 (1982) 61. Hajela, P., Lin, C.-Y.: Genetic search strategies in multicriterion optimal design. Struct. Optim. 4, 99–107 (1992) 62. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000) 63. Hans, A.E.: Multicriteria optimization for highly accurate systems. In: Stadler, W. (ed.) Multicriteria Optimization in Engineering and Sciences, Mathematical concepts and methods in science and engineering, vol. 19, pp. 309–352. Plenum Press, New York (1988) 64. Hansen, M.P., Jaszkiewicz, A.: Evaluating the quality of approximations to the non-dominated set. Technical Report IMM-REP-1998-7, Technical University of Denmark (1998) 65. Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970) 66. He, J., Yao, X.: Drift analysis and average time complexity of evolutionary algorithms. Artif. Intell. 127(1), 57–85 (2001) 67. He, J., Yao, X.: Erratum to: drift analysis and average time complexity of evolutionary algorithms. Artif. Intell. 140, 245–248 (2002) 68. Holland, J.H.: Outline for a logical theory of adaptive systems. J. Assoc. Comput. Mach. 3, 297–314 (1962) 69. Holland, J.H.: Adaption in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975) 70. Hoorfar, A.: Mutation-based evolutionary algorithms and their applications to optimization of antennas in layered media. In: Proceedings of IEEE Antennas and Propagation Society International Symposium, Orlando, pp. 2876–2879 (1999) 71. Hoorfar, A.: Evolutionary programming in electromagnetic optimization: a review. IEEE Trans. Antennas Propag. 55(3), 523–537 (2007) 72. Hoorfar, A., Liu, Y.: A study of Cauchy and Gaussian mutation operators in evolutionary programming optimization of antenna structures. In: Proceedings of 16th Annual Applied Computational Electromagnetics Conference, Monterey, pp. 63–69 (2000) 73. Horn, J.: Multicriterion decision making. In: Bäck, T., Fogel, D., Michalewicz, Z. (eds.) Handbook of Evolutionary Computation, vol. 1, pp. F1.9:1–F1.9:15. Oxford University Press, Oxford (1997) 74. Horn, J., Nafpliotis, N., Goldberg, D.E.: A niched Pareto genetic algorithm for multiobjective optimization. In: Proceedings of the First IEEE Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, vol. 1, pp. 82–87. IEEE Press, Piscataway (1994) 75. Hughes, E.J.: Evolutionary multi-objective ranking with uncertainty and noise. In: Proceedings of first International Conference on Evolutionary Multi-Criterion Optimization, Zürich, pp. 329–343 (2001) 76. Hughes, E.J.: Constraint handling with uncertain and noisy multi-objective evolution. In: Proceedings of 2001 Congress on Evolutionary Computation, vol. 2, pp. 963–970 (2001) 77. Hutter, M., Legg, S.: Fitness uniform optimization. IEEE Trans. Evol. Comput. 10(5), 568– 589 (2006) 78. Hwang, C.-L., Masud, A.S.M.: Multiple Objective Decision Making-Methods and Applications. Springer, Berlin (1979) 79. Ishibuchi, H., Akedo, N., Nojima, Y.: Behavior of multiobjective evolutionary algorithms on many-objective knapsack problems. IEEE Trans. Evol. Comput. 19(2), 264–283 (2015) 80. Jensen, M.T.: Reducing the run-time complexity of multiobjective EAs: the NSGA-II and other algorithms. IEEE Trans. Evol. Comput. 7(5), 503–515 (2003)

References

799

81. Karaboga, D.: An idea based on honey bee swarm for numerical optimization. Erciyes University, Kayseri, Tech. Rep.-TR06 (2005) 82. Karaboga, D., Basturk, B.: A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm. J. Global Optim. 39, 459–471 (2007) 83. Karaboga, D., Basturk, B.: On the performance of artificial bee colony (ABC) algorithm. Appl. Soft Comput. 8(1), 687–697 (2008) 84. Karaboga, D., Basturk, B.: A comparative study of artificial bee colony algorithm. Appl. Math. Comput. 214(1), 108–132 (2009) 85. Kennedy, J.: Swarm intelligence. In: Zomaya, A.Y. (ed.) Handbook of Nature-Inspired and Innovative Computing, pp. 187–219. Springer, New York (2006) 86. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks (ICNN), vol. IV, pp. 1942–1948 (1995) 87. Kim, J.-H., Han, J.-H., Kim, Y.-H., Choi, S.-H., Kim, E.-S.: Preference-based solution selection algorithm for evolutionary multiobjective optimization. IEEE Trans. Evol. Comput. 16(1), 20–34 (2012) 88. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983) 89. Knowles, J.D., Corne, D.W.: On metrics for comparing nondominated sets. In: Proceedings of the Congress on Evolutionary Computation, vol. 1, pp. 711–716 (2002) 90. Kursawe, F.: A variant of evolution strategies for vector optimization. In: Schwefel, H.-P., Manner, R. Parallel Problem Solving from Nature, 193–197. Springer, Berlin (1991) 91. Laarhoven, P.J.M., Aarts, E.H.L.: Simulated Annealing: Theory and Applications. Reidel, Dordrecht (1987) 92. Laumanns, M., Rudolph, G., Schwefel, H.-P.: Mutation control and convergence in evolutionary multi-objective optimization. In: Proceedings of the 7th International Mendel Conference on Soft Computing (MENDEL 2001), Brno (2001) 93. Li, H., Zhang, Q.: Multiobjective optimization problems with complicated Pareto sets, MOEA/D and NSGA-II. IEEE Trans. Evol. Comput. 13(2), 284–302 (2009) 94. Li, Y.-L., Zhou, Y.-R., Zhan, Z.-H., Zhang, J.: A primary theoretical study on decompositionbased multiobjective evolutionary algorithms. IEEE Trans. Evol. Comput. 20(4), 563–576 (2016) 95. Liao, T., Socha, K., Montes, M.A., Stützle, T., Dorigo, M.: Ant colony optimization for mixedvariable optimization problems. 18(4), 503–518 (2014) 96. Limbourg, P., Aponte, D.E.S.: An optimization algorithm for imprecise multi-objective problem function. In: Proceedings of IEEE Congress on Evolutionary Computation, Edinburgh, pp. 459–466 (2005) 97. López-Ioá¯nez, M., Stützle, T.: The automatic design of multiobjective ant colony optimization algorithms. IEEE Trans. Evol. Comput. 16(6), 861–875 (2012) 98. Luo, B., Zheng, J., Xie, J., Wu, J.: Dynamic crowding distance - a new diversity maintenance strategy for MOEAs. In: Fourth International Conference on Natural Computation, pp. 580– 585 (2008) 99. Martens, D., Backer, M.D., Haesen, R., Vanthienen, J., Snoeck, M., Baesens, B.: Classification with ant colony optimization. IEEE Trans. Evol. Comput. 11(5), 651–665 (2007) 100. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equations of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953) 101. Mezura-Montes, E., Velázquez-Reyes, J., Coello, C.A.C.: A comparative study of differential evolution variants for global optimization. In: Proceedings of the 2006 Conference on Genetic and Evolutionary Computation (GECCO-2006), Seattle, pp. 485–492 (2006) 102. Michalewicz, Z.: Genetic Algorithms+Data Structures=Evolution Programs. AI Series. Springer, New York (1994) 103. Miettinen, K.: Nonlinear Multiobjective Optimization. Kluwer, Norwell (1999) 104. Mitra, D., Romeo, F., Sangiovanni-Vincentelli, A.: Convergence and finite-time behavior of simulated annealing. Adv. Appl. Probab. 18, 747–771 (1986)

800

9 Evolutionary Computation

105. Mohan, B.C., Baskaran, R.: A survey: ant colony optimization based recent research and implementation on several engineering domain. Exp. Syst. Appl. 39, 4618–4627 (2012) 106. Moradi, P., Gholampour, M.: A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy. Appl. Soft Comput. 43, 117–130 (2016) 107. Morse, J.N.: Reducing the size of the nondominated set: pruning by clustering. Comput. Oper. Res. 7(1–2), 55–66 (1980) 108. Moulton, C.M., Roberts, S.A., Calatn, P.H.: Hierarchical clustering of multiobjective optimization results to inform land-use decision making. URISA J. 21(2), 25–38 (2009) 109. Mühlenbein, H., Schlierkamp-Voosen, D.: The science of breeding and its application to the breeder genetic algorithm (BGA). Evol. Comput. 1(4), 335–360 (1994) 110. Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S., Coello Coello, C.A.: A survey of multiobjective evolutionary algorithms for data mining: part I. IEEE Trans. Evol. Comput. 18(1), 4–19 (2014) 111. Mullen, R.J., Monekosso, D., Barman, S., Remagnino, P.: A review of ant algorithms. Exp. Syst. Appl. 36, 9608–9617 (2009) 112. Myung, H., Kim, J.-H.: Hybrid evolutionary programming for heavily constrained problems. BioSystems 38, 29–43 (1996) 113. Myung, H., Kim, J.-H., Fogel, D.B.: Preliminary investigations into a two-stage method of evolutionary optimization on constrained problems. In: McDonnell, J.R., Reynolds, R.G., Fogel, D.B. (eds.) Proceedings of the Fourth Annual Conference Evolutionary Programming, pp. 449–463. MIT Press, Cambridge (1995) 114. Nam, D.K., Park, C.H.: Multiobjective simulated annealing: a comparative study to evolutionary algorithms. Inf. J. Fuzzy Syst. 2(2), 87–97 (2000) 115. Neto, R.F.T., Filho, M.G.: A software model to prototype ant colony optimization algorithms. Exp. Syst. Appl. 38, 249–259 (2011) 116. Nikulin, Y., Miettinen, K., Mäkelä, M.M.: A new achievement scalarizing function based on parameterization in multiobjective optimization. OR Spectr. 34, 69–87 (2012) 117. Oberkampf, W.L., Helton, J.C., Joslyn, C.A., Wojtkiewicz, S.F., Ferson, S.: Challenge problems: uncertainty in system response given uncertain parameters. Reliab. Eng. Syst. Saf. 85, 11–19 (2004) 118. Osman, I.H., Laporte, G.: Metaheuristics: a bibliography. Ann. Oper. Res. 63(5), 511–623 (1996) 119. Palakonda, V., Mallipeddi, R.: Pareto dominance-based algorithms with ranking methods for many-objective optimization. IEEE Access 5, 11043–11053 (2017) 120. Papadimitriou, C.H., Steiglitz, K.: Combinatorial Optimization: Algorithms and Complexity. Dover, New York (1982) 121. Park, S.-Y., Lee, J.-J.: Stochastic opposition-based learning using a Beta distribution in differential evolution. IEEE Trans. Cybern. 46(10), 2184–2194 (2016) 122. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony optimization algorithm. IEEE Trans. Evol. Comput. 6(4), 321–332 (2002) 123. Premalatha, K., Natarajan, A.M.: Hybrid PSO and GA for global maximization. Int. J. Open Problems Compt. Math. 2(4), 597–608 (2009) 124. Price, K., Storn, R., Lampinen, J.: Differential Evolution: A Practical Approach to Global Optimization. Springer, Berlin (2005) 125. Qin, A.K., Suganthan, P.N.: Self-adaptive differential evolution algorithm for numerical optimization. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation, vol. 2, pp. 1785–1791 (2005) 126. Quagliarella, D., Vicini, A.: Coupling genetic algorithms and gradient based optimization techniques. In: Quagliarella, D., Periaux, J., Poloni, C., Winter, G. (eds.) Genetic Algorithms and Evolution Strategy in Engineering and Computer Science – Recent Advances and Industrial Applications. Wiley, Chichester (1997) 127. Radcliffe, N., Surry, P.: Fitness variance of formae and performance prediction. In: Foundations of Genetic Algorithms 3, pp. 51–72. Morgan Kaufmann, San Mateo (1995)

References

801

128. Rahnamayan, S.: Opposition-based differential evolution. Thesis for Doctor of Philosophy, University of Waterloo (2007) 129. Rahnamayan, S., Wang, G.G.: Center-based sampling for population-based algorithms. In: 2009 IEEE Congress on Evolutionary Computation, pp. 933–938 (2009) 130. Rahnamayan, S., Tizhoosh, H.R., Salama, M.: Quasi-oppositional differential evolution. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Singapore, pp. 2229–2236 (2007) 131. Rahnamayan, S., Tizhoosh, H.R., Salama, N.M.M.: Opposition-based differential evolution. IEEE Trans. Evol. Comput. 12(1), 64–79 (2008) 132. Rakshit, P., Konar, A.: Differential evolution for noisy multiobjective optimization. Artif. Intell. 227, 165–189 (2015) 133. Rechenberg, I.: Cybernetic solution path of an experimental problem. Royal Aircraft Establishment, Library translation No. 1122, Farnborough, Hants (1965) 134. Revelle, C., Cohon, J.L., Shobys, D.: Multiple objectives in facility location: a review. In: Beckmann, M., Kunzi, A.P. (eds.) Lecture Notes in Economics and Mathematical Systems, vol. 190, pp. 321–337. Springer, Berlin (1981) 135. Rojas-Morales, N., Riff Rojas, M.-C., Ureta, E.M.: A survey and classification of oppositionbased metaheuristics. Comput. Ind. Eng. 110, 424–435 (2017) 136. Rosenthal, R.E.: Principles of multiobjective optimization. Decis. Sci. 16, 133–152 (1985) 137. Ruiz, F., Luque, M., Miguel, F., del Mar Muñoz, M.: An additive achievement scalarizing function for multiobjective programming problems. Eur. J. Oper. Res. 188(3), 683–694 (2008) 138. Sakri, S., Rashid, N.A., Zain, Z.M.: Particle swarm optimization feature selection for breast cancer recurrence prediction. IEEE Access 6, 29637–29647 (2018) 139. Santana, R.A., Pontes, M.R., Bastos-Filho, C.J.A.: A multiple objective particle Swarm optimization approach using crowding distance and roulette wheel. In: Ninth International Conference on Intelligent Systems Design and Applications, pp. 237–242 (2009) 140. Sastry, K., Goldberg, D., Kendall, G.: Genetic algorithms. In: Burke, E.K., Kendall, G. (eds.) Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques. Springer, New York (2005) 141. Schaffer, J.D.: Multiple objective optimization with vector evaluated genetic algorithms. In: Proceedings of the First International Conference on Genetic Algorithms (ICGA’85), pp. 93– 100 (1985) 142. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5, 197–227 (1990) 143. Schwefel, H.P.: Numerical Optimization of Computer Models. Wiley, Hoboken (1981) 144. Slater, M.: Lagrange multipliers (revisited). Cowles Commission Discussion Paper: Mathematics 403 (1950) 145. Smith, K., Everson, R., Fieldsend, J.: Dominance measures for multi-objective simulated annealing. In: Proceedings of the 2004 IEEE Congress on Evolutionary Computation, pp. 23–30 (2004) 146. Srinivas, N., Deb, K.: Multiobjective optimization using nondominated sorting in genetic algorithms. Evol. Comput. 2(3), 221–248 (1995) 147. Storn, R., Price, K.: Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11, 341–359 (1997) 148. Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: Proceedings of the 8th ACM SIGKDD International Conference on KDD, pp. 32–41 (2002) 149. Tang, K., Li, X., Suganthan, P.N., Yang, Z., Weise, T.: Benchmark functions for the CEC’2010 special session and competition on large-scale global optimization. Tech. Rep. (2009) 150. Teich, J.: Pareto-front exploration with uncertain objectives. In: Zitzler, E. et al. (eds.) Evolutionary Multi-Criterion Optimization (EMO) 2001. Lecture Notes in Computer Science, vol. 1993, pp. 314–328 (2001)

802

9 Evolutionary Computation

151. Tizhoosh, H.R.: Opposition-based learning: a new scheme for machine intelligence. In: Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, 28–30 November, Vienna, vol. 1, pp. 695–701 (2005) 152. Ulungu, E.L., Teghem, J.: Multi-objective combinatorial optimization problems: a survey. J. MultiCrit. Decis. Anal. 3, 83–101 (1994) 153. Ulungu, E.L., Teghem, J., Fortemps, P., Tuyttens, D.: MOSA method: a tool for solving multiobjective combinatorial optimization problems. J. MultiCrit. Decis. Anal. 8, 221–236 (1999) 154. van den Bergh, F., Engelbrecht, A.P.: A cooperative approach to particle swarm optimization. IEEE Trans. Evol. Comput. 8(3), 225–239 (2004) 155. Van Veldhuizen, D.A., Lamont, G.B.: Multiobjective evolutionary algorithms: analyzing the state-of-the-art. Evol. Comput. 8(2), 125–147 (2000) 156. Wang, R., Zhang, Q., Zhang, T.: Decomposition-based algorithms using Pareto adaptive scalarizing methods. IEEE Trans. Evol. Comput. 20(6), 821–837 (2016) 157. Whitley, D.: The GENITOR algorithm and selection pressure: why rank-based allocation of reproductive trials is best. In: Proceedings of the Third International Conference on Genetic Algorithms, San Mateo, pp. 116–123 (1989) 158. Wierzbicki, A.P.: The use of reference objectives in multiobjective optimization. In: Fandel, G., Gal, T. (eds.) Multiple Criteria Decision Making Theory and Applications. MCDM Theory and Applications Proceedings. Lecture Notes in Economics and Mathematical Systems, vol. 177. Springer, Berlin, pp. 468–486 (1980) 159. Wierzbicki, A.P.: A methodological approach to comparing parametric characterizations of efficient solutions. In: Fandel, G. et al. (eds.) Large-Scale Modeling and Interactive Decision Analysis. Lecture Notes in Economics and Mathematical Systems, vol. 273, pp. 27–45. Springer, Berlin (1986) 160. Wierzbicki, A.P.: On the completeness and constructiveness of parametric characterizations to vector optimization problems. OR Spectr. 8, 73–87 (1986) 161. Xu, Q., Wang, L., Wang, N., Hei, X., Zhao, L.: A review of opposition-based learning from 2005 to 2012. Eng. Appl. Artif. Intell. 29, 1–12 (2014) 162. Yang, Z., He, J., Yao, X.: Making a difference to differential evolution. In: Michalewicz, Z., Siarry, P. (eds.) Advances in Metaheuristics for Hard Optimization, pp. 397–414. Springer, Berlin (2008) 163. Yang, L., Guan, Y., Sheng, W.: A novel dynamic crowding distance based diversity maintenance strategy for MOEAs. In: Proceedings of the 2017 International Conference on Machine Learning and Cybernetics, Ningbo, pp. 211–216 (2017) 164. Yang, D., Liu, Z., Shu, T., Yang, L., Ouyang, J., Shen, Z.: An improved genetic algorithm for multiobjective optimization of helical coil electromagnetic launchers. IEEE Trans. Plasma Sci. 46(1), 127–133 (2018) 165. Yao, X., Liu, Y., Lin, G.: Evolutionary programming made faster. IEEE Trans. Evol. Comput. 3(2), 82–102 (1999) 166. Yao, X., Liu, Y., Lin, G.: Self-adaptive differential evolution with neighborhood search. In: Proceedings of the 2008 Congress on Evolutionary Computation (CEC2008), pp. 1110–1116 (2008) 167. Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007) 168. Zhu, G.P., Kwong, S.: Gbest-guided artificial bee colony algorithm for numerical function optimization. Appl. Math. Comput. 217(7), 3166–3173 (2010) 169. Zitzler, E., Thiele, L.: Multiobjective optimization using evolutionary algorithms - a comparative case study. In: Eiben, V.A.E. et al. (eds.) Parallel Problem Solving From Nature. Springer, Berlin, 292–301 (1998) 170. Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans. Evol. Comput. 3(4), 257–271 (1999)

References

803

171. Zitzler, E., Deb, K., Thiele, L.: Comparison of multiobjective evolutionary algorithms: empirical results. Evol. Comput. 8(2), 173–195 (2000) 172. Zitzler, E., Laumanns, M., Thiele, L.: SPEA2: improving the strength Pareto evolutionary algorithm for multiobjective optimization. In: Proceedings of the Evolutionary Methods for Design, Optimization and Control with Applications to Industrial Problems (EUROGEN), pp. 95–100 (2002) 173. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., Fonseca, V.G.: Performance assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol. Comput. 7(2), 117– 132 (2003)

Index

Symbols K-sparse approximation, 194 L-Lipschitz continuous function, 113 L-Lipschitz continuously differentiable, 114 0 -norm, 15 0 -norm minimization, 193 1 -norm minimization, 194 1 -penalty minimization, 195 p pooling, 505 p -metric distance, 555 -neighborhood, 16, 559 k-nearest neighbor (kNN), 298 k-nearest neighborhood, 559 n-dimensional hypermatrix, 302 p-Dirichlet norm, 365

A Acceleration constant, 782 Achievement scalarizing functions (ASFs), 754 additive ASF, 756 parameterized ASF, 757 strictly increasing ASF, 756 strongly increasing ASF, 756 Achievement scalarizing problem, 754 ACO construction functions, 771 ant solution construct, 771 deamon actions, 771 pheromone update, 771 A-conjugacy, 163 Action space, 391 Action-value function, 393 Activation function, 452 exponential linear unit (ELU), 509

Kullback-Leibler (KL) divergence, 452 logistic function, 452 maxout, 509 nonlinear activation function, 543 probout, 509 rectified linear unit (ReLU), 457, 507 leaky rectified linear unit (Leaky ReLU), 458, 508 noisy rectified linear unit (NReLU), 508 parametric rectified linear unit (PReLU), 458, 509 randomized rectified linear unit (RReLU), 509 sigmoid function, 452 softmax function, 455 Softplus function, 457 Softsign function, 456 Soft-step function, 452 tangent (tanh) function, 456 Active constraint, 141 Active learning (AL), 378, 380 co-active learning, 382 membership query learning, 380 pool-based active learning, 381 selective sampling, 381 sequential learning, 381 stream-based active learning, 381 Active learning-extreme learning machine (AL-ELM) algorithm, 386 Active set, 141 Adaptive boosting (AdaBoost), 252 Adaptive convexity parameter, 117 Additive logistic regression model, 249 Adversarial discriminator, 599 Affine function, 103

© Springer Nature Singapore Pte Ltd. 2020 X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, https://doi.org/10.1007/978-981-15-2770-8

805

806 Aggregate functions, 581 LSTM aggregate function, 582 mean aggregate function, 582 pooling aggregate function, 582 Agree, 248 Aleatory uncertainty, 708 Alleles, 724 All-sharing, 541 Alternating direction multiplier method (ADMM), 144 Alternating least squares (ALS) method, 182 Alternative weak distance metrics, 333 Ant colony system (ACS), 775 global updating rule, 775 local updating rule, 775, 776 state transition rule, 775 Anti-Tikhonov regularization method, 180 Ant system, 773 global updating rule, 774 random-proportional rule, 774 state transition rule, 774 APPROX coordinate descent method, 236 Approximation, 90 Approximation error, 618 Archival nondominated solutions, 721 Archived multiobjective simulated annealing (AMOSA), 721 Archive truncation procedure, 754 Artificial ants ant-routing table, 769 feasible neighborhood, 769 online delayed pheromone update, 769 start state, 769 step-by-step pheromone update, 769 termination conditions, 769 Artificial bee colony (ABC), 777 abandoned position, 778 ABC algorithms, 778 crossover-like artificial bee colony (CABC), 780 employed bees, 777 food sources, 777 gbest-guided artificial bee colony (GABC), 778 improved artificial bee colony (IABC), 779 nectar amounts, 778 onlooker bees, 777 scout bees, 777 Augmented Lagrange multiplier method, 135 Augmented matrix, 158 Autocorrelation matrix, 21 Autocovariance matrix, 21 Autoencoder, 525 code prediction energy, 528

Index contractive autoencoder (CAE), 532 convolutional autoencoder (CAE), 538 decoder network, 527 denoising autoencoder (DAE), 535 encoder network, 526 nonnegativity constrained autoencoder (NCAE), 541 reconstruction energy, 528 sparse autoencoder (SAE), 533 stacked autoencoder, 532 stacked convolutional denoising autoencoder (SCDAE), 539 stacked sparse autoencoder (SSAE), 534 Auxiliary set, 339 Average pooling, 506

B Backpropagation through time (BPTT), 465 Backtracking line search, 295 Backward iteration, 129 Barrier function, 134 exponential barrier function, 134 Fiacco-McCormick logarithmic barrier function, 134 inverse barrier function, 134 logarithmic barrier function, 134 power-function barrier function, 134 Barrier method, 134 Basis pursuit (BP), 195 Basis pursuit denoising (BPDN), 195 Batch gradient, see Barch method Batch method, 228 Batch normalization (BatchNorm), 459, 589 mini-batch mean, 458 mini-batch normalization, 458 mini-batch variance, 458 parameters, 459 Batch normalizing transform, 589 Bayesian classification rule, 492 Bayesian classification theory, 491 Bayes’ rule, 490 Bellman equation, 393 Benchmark function, 236 Bernoulli distribution, 516 Bernoulli vector, 516 Between-class scatter matrix, 310 Between-class variance, 262 Between-sets covariance matrix, 348 Bidirectional recurrent neural network (BRNN), 468 Binary mask vector, 521 Boltzmann machine, 477, 478 Boltzmann probability factor, 716

Index Boltzmann’s constant, 716 Boosting, 247 Boosting algorithm, 251 Boundary intersection (BI) approach, 747 Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, 148

C Cannot-Link, 333 Canonical correlation analysis (CCA), 343 canonical correlation, 344 canonical variants, 344 canonical weight vectors, 344 CCA algorithm, 347 diagonal penalized CCA, 352 Kernel canonical correlation analysis, 347, 349 penalized canonical correlation analysis, 352 penalized (sparse) CCA algorithm, 353 score variants, 344 sparse canonical variants, 352 sparse CCA, 352 Canonical decomposition (CANDECOM), 306 Cartesian product, 12 Cauchy graph embedding, 566 Cauchy mutation operator, 759 Cauchy-Riemann condition, 72 Cauchy-Riemann equations, 72 Center-based sampling (CBS), 789 Centered kernel function, 266 Centered kernel matrix, 266 Chain rule, 60 Characteristic equation, 205 Characteristic function, 25 Characteristic matrix, 205 Characteristic polynomial, 205 Chromosomes, 724 City-block distance, 555 Class discrimination space, 219 Classification, 257, 259 supervised classification, 257 unsupervised classification, 257 Classification problem, 445 Classification vertices, 375 Closed ball, 103 Closed neighborhood, 91 Cluster analysis, 314 Cluster indicator vector, 333 Cofactor, 28 Cogradient matrix, 59 Cogradient operator, 77 Cogradient vector, 59, 77

807 Coherent, 23 Column space, 33 Community membership, 569 Community membership vector, 570 Community preserving network embedding, 571 Community representation matrix, 572 Commutation matrix, 49 Complex analytic function, 72 Complex conjugate partial derivative, 74 Complex conjugate transpose, 7 Complex differentiable, 72 Complex Gaussian random vector, 25 Complex gradient, 71 Composite mirror descent method, 448 Composite optimization, 228 Concept, 248 Conditional risk function, 492 Condition number, 169 1 condition number, 169 2 condition number, 170 ∞ condition number, 170 Frobenius-norm condition number, 170 Confidence, 254, 299 Confidence parameter, 251 Conic hull, 102 Conjugate cogradient operator, 77 Conjugate cogradient vector, 77 Conjugate gradient, 71 Conjugate gradient algorithm, 163 Connectionist temporal classification (CTC), 474 Consistent, 248 Consistent equation, 33 Constant Error Carousel (CEC), 477 Constrained clustering, 339 Constrained convex optimization problem, 132 Constrained minimization problem, 132 Constrained spectral clustering, 334 Constraint matrix, 333 normalized constraint matrix, 334 Constriction coefficient, see Inertia weight Contrastive divergence (CD), 484, 485 Contribution ratio, 271 Convergence rate, 110 local convergence rate, 111 linear convergence rate, 112 quadratic convergence rate, 112 sublinear convergence rate, 112 logarithmic convergence rate, 111 quotient-convergence rate, 110, 111 cubic convergence rate, 111 linear convergence rate, 111 quadraticr convergence rate, 111

808 sublinear convergence rate, 111 superlinear linear convergence rate, 111 Conves hull, 102 Convex cone, 103 Convex function, 103 Convexity parameter, 104 Convex programming problem, 446 Convex relaxation, 194 Convolutional neural networks (CNNs), 497 convolutional layers, 497 downsampling layers, 497 full-connected layers, 497 input layers, 497 loss layers, 497 pooling layers, 497 rectified linear unit (ReLU), 497 Convolutions, 498, 501 multi-input multi-output (MIMO) convolution, 502 multi-input single-output (MISO) convolution, 503 single-input multi-output (SIMO) convolution, 502 single-input single-output (SISO) convolution, 502 2-D convolution, 500 Convolution theorem in frequency domain, 584 Convolution theorem in time domain, 584 Co-occurring data, 341 Coordinate descent methods (CDMs), 232 accelerated coordinate descent method (ACDM), 235 cyclic coordinate descent, 233 randomized coordinate descent method (RCDM), 235 stochastic coordinate descent, 233 Coordinate-wise descent algorithms, 291 Core tensor, 305 Correlation coefficient, 23, 264 Correlation method, 262 Cost, 447 Cost function, 89 Co-training, 342 Covariant operator, 59 Coverage, 699 Criterion scalarizing approach, 718 Criterion scalarizing functions, 719 Cross-correlation matrix, 22 Cross-covariance matrix, 22 Cross-entropy, 453 Crossover, 727 Crossover probability, 727 Crossover rate, 727 Crowded-comparison operator, 703

Index Crowding distance assignment approach, 702 Crowding distance (CD), 702 Cumulative probability distribution, 253 Cut, 371

D Darwin’s principle of natural evolution, 740 Data bag, 342 Data centering, 258 Data scaling, 258 Data zero-meaning, see Data centering Davidon-Fletcher-Powell (DFP) method, 148 Decision boundary, 299 Decision functions, 297, 650 Decision making approach, 691 Definition domain, 101 Degeneracy, 729 Democratic co-learning, 343 Density, 753 Descent step, 106 Determinant, 27 Differential evolution, 764 Differential evolution (DE) operator, 748 Dimensionality reduction, 258, 269 Direct acyclic graph (DAG), 655 directed direct acyclic graphs (DDAGs), 655 rooted binary DAGs, 655 Directed acyclic graph SVM (DAGSVM) method, 655 Direct sum, 43 Disagree, 248 Discount factor, 392 Discrepancy, see Loss Discriminative model, 598 Discriminator, 600 Distributed nonnegative encoding, 540 Distributed optimization problems, 144 Distribution indicator, 713 Divergence, 324 Diversity indicator, 713 Divide-and-conquer strategy, 715 Domain, 248, 403 Domain adaptation, 410 cross-domain transform method, 421 feature augmentation method, 418 transfer component analysis method, 423 Double Q-learning, 397 Downsampling factor, 504 Drift analysis, 742 DropConnect, 520 binary mask matrix, 522 feature extractor, 522

Index layer, 522 softmax classification layer, 522 DropConnect Network, 522 Dropout, 514 Dropout neural network model, 516 Dropout spherical K-means, 519 Duality gap, 139 Dual residual, 146 Dyadic decomposition, 171

E Echelon matrix, 159 Eckart-Young theorem, 173 Edge derivative, 361 Effective rank, 178 EigenCluster, 417 Eigenpair, 204 Eigensystem, 210 EigenTransfer, 416 Eigenvalue, 203 algebraic multiplicity, 206 multiple eigenvalue, 206 single eigenvalue, 206 Eigenvalue decomposition (EVD), 204 Eigenvalue-eigenvector equation, 204 Eigenvector, 203 left eigenvector, 208 right eigenvector, 208 Elastic net, 279 Elementary row operations, 158 Type I elementary row operation, 158 Type II elementary row operation, 158 Type III elementary row operation, 158 Elemenwise division, 44 Elemenwise product, 44 Empirical risk, 226, 253 Empirical risk minimization (ERM), 254, 619 Environmental selection, 753, 754 Episode, 388 Epistemic uncertainty, 708 Equality constraints, 101 Equivalent subspaces, 273 Equivalent systems, 158 Error constrained 1 -norm minimization, 195 Error for hidden nodes, 466 Error for output nodes, 465 Error function, 464 Error parameter, 251 Estimation error, 618 Euclidean distance, 16 Euclidean leangth, 16 Euclidean measures, 354 Euclidean metric, 752

809 Euclidean metric distance, 555 Euclidean structure, 354 Evalustion functional, 624 Evolutionary algorithms (EAs), 740 crossover, 740 1 + 1 evolutionary algorithm, 741 fitness, 740 generation, 740 individual, 740 mutation, 740, 741 offspring, 740 parents, 740 population, 740 replacement, 741 Example vertices, 375 Exhaustive enumeration method, 262 Expected risk, 227, 253 Explicit constraints, 101 Expression ratio criterion, 263 Expression ratio method, 263 Extrema/extreme, 92 global maximum, 92 global minimum, 90, 92 global minimum point, 90 local maximum, 92 local minimum, 91 strict global maximum, 92 strict global minimum, 92 strict global minimum point, 90 strict local maximum, 92 strict local minimum, 91 strict local minimum point, 91 Extreme learning machine (ELM), 542 algorithm, 546 binary classification, 547 multiclass classification, 551 regression, 547 Extreme point, 92 global maximum point, 92 global minimum point, 92 isolated local extreme point, 92 local maximum point, 92 local minimum point, 91 strict global maximum point, 92 strict local maximum point, 92

F Fast evolutionary programming (FEP), 759 Feasible point, 101 Feasible set, 101, 136 Feature augmentation, 418 Feature extraction, 269 Feature learning, 260

810 Feature ranking, 262 Feature selection, 258, 260, 261 Feature vector, 258 FEP algorithm, 760 Finiteness, 620 First-order differential, 64 First-order necessary condition, 94 Fisher discriminant analysis (FDA), 310, 324 Fisher measure, 217, 324 Fisher’s criterion, 263 Fitness, 711 Fitness assignment, 726, 751, 753 Fitness selection, 734 fitness uniform selection scheme, 735 standard selection scheme, 734 linear proportionate selection, 734 ranking selection, 734 tournament selection, 735 truncation selection, 734 Fitness selection approach, 698 Flipped block-structured matrix, 501 Flipped matrix, 501 Flipped vector, 501 Formal partial derivatives, 73 Forward iteration, 129 Forward problem, 14 Frobenius norm ratio, 179 Fully-nonseparable function, 237 Function approximation, 390 Function vector, 4

G Gain shape vector quantization, 329 Gated graph neural networks (GG-NN), 577 Gauss elimination method, 160 Gauss elimination method for matrix inversion, 161 Gaussian approximation, 666 Gaussian mutation operator, 759 Gaussian process classification, 666 Gaussian process regression, 663 Gaussian process regression algorithm, 666 Gaussian random vector, 24 Gauss-Markov theorem, 177 Gauss-Seidel method, 182 circle phenomenon, 183 swamp, 183 Generalized characteristic equation, 211 Generalized characteristic polynomial, 211 Generalized eigenpair, 211 Generalized eigenvalue, 210 Generalized eigenvalue decomposition (GEVD), 210

Index Generalized eigenvector, 210 Generalized opposition solution, 789 Generalized total least squares (GTLS), 191 Generational distance, 713 Generative Adversarial Networks (GANs), 598 Generative model, 598 Generators, 327, 599 Gene rearrangement, 729 Genes, 724 Genetic algorithm (GA), 723 Genetic algorithm with gene rearrangement (GAGR), 730 Global acceptance probability, 719 Global best position (gbest), 782 Global output function, 578 Global transition function, 578 Gradient aggregation, 230 Gradient computation, 59 Gradient descent algorithm, 100 Gradient flow direction, 59 Gradient matrix, 58 Gradient matrix operator, 58 Gradient projection, 295 Gradient projection for sparse reconstruction (GPSR), 293 Gradient-projection method, 108 Gradient vector operators, 57 Gradient vectors, 58 Gram matrix, 625 Graph, 355 adjacency matrix, 355, 356 affinity matrix, 356 r-ball neighborhood graph, 373 complete weighted graph, 373 degree, 356 degree matrix, 356 directed graph, 355 edge set, 355 edge weighting functions, 356 edge weights, 356 dot-product weighting, 357 heat kernel weighting, 357 thresholded Gaussian kernel weighting, 357 0-1 weighting, 357 first-order proximity, 552 fully connected graph, 358 1 graph, 378 higher-order proximity, 553 higher-order proximity matrix, 553 mutual K-nearest neighbor graph, 358 k-nearest neighbor graph, 358, 373 -neighborhood graph, 358 k-order proximity, 553

Index proximity matrix, 553 second-order proximity, 553 undirected graphs, 355 vertex set, 355 weighted adjacency matrix, 356 Graph convolution, 364 Graph convolution theorem, 585 Graph convolutional networks (GCNs), 577, 583 Graph embedding, 554 Graph filtering, 363 Graph Fourier basis, 360 Graph Fourier transform, 362, 585 Graph inverse Fourier transform, 363 Graph kerners, 366 Graph K-means clustering, 361 Graph Laplacian matrices, 358 combinatorial graph Laplacian, 359 normalized graph Laplacian matrices, 359 random walk normalized graph Laplacian, 359 symmetric normalized graph Laplacian, 359 Graph mincut learning algorithm, 375 Graph minor component analysis (GMCA), 361 Graph principal component analysis (GPCA), 361 GraphSAGE, 581 aggregator architectur, 581 embedding generation, 581 parameter learning, 581 Graph signal processing, 363 Graph signals, 354 Graph spectrum, 360 Graph structure, 354 Graph-structured data, 354 Graph Tikhonov regularization, 366 Grassmann manifold, 273 Greedy method, 262 Greedy projection algorithm, 447 Grouped Lasso, 292 Group normalization (GN), 597 Group-sharing, 541

H Hadamard inequality, 31 Hadamard product, 44 Half-space, 109 Hankel matrix, 498 extended Hankel matrix, 503 wrap-around Hankel matrix, 499

811 Hankel structured matrix, 498 Harmonic function, 73 Heat kernel, 374, 565 Heavy ball method (HBM), 115 Hermitian conjugate, see Complex conjugate transpose Heterogeneous domain adaptation (HDA), 418 Heterogeneous feature augmentation (HFA), 419 Heterogeneous transfer learning, 407 Heuristic algorithms, 767 deterministic algorithm, 768 iterative based algorithm, 768 population-based algorithm, 768 stochastic algorithm, 768 Heuristic procedures, 714 Hidden layer output matrix, 545, 546 Hidden nodes, 543 Hierarchical clustering, 320 array of similarity measures, 320 distance matrix, 322 hierarchical clustering scheme (HCS), 320 hierarchical system of clustering representations, 320 maximum method, 324 minimum method, 323 ultrametric inequality, 322 Hierarchical clustering approach, 703 Higher-order matrix algebra, 301 Higher-order proximity, 574 Higher-order proximity matrix, 574 Adamic-Adar (AA), 575 common neighbors, 575 Katz index, 574 Rooted PageRank, 575 Higher-order singular value decomposition, 306 Higher rank tensor ridge regression (hrTRR), 311 Holomorphic complex matrix functions, 75 Holomorphic function, 72 Homogeneity constraint, 272 Homogeneous transfer learning, 407 Hopfield network, 477 Horizontal unfolding, 303 Kiers horizontal unfolding method, 303 Kolda horizontal unfolding method, 304 LMV horizontal unfolding method, 303 Human intelligence, 780 Hybrid evolutionary programming (HEP), 762 Hybrid mutation operators, 762 Hyperplane, 109, 618 Hypervolume for set, 712

812 Hypervolume (HV), 698 Hypothesis, 619 Hypothesis space, 411, 619

I Ideal objective vector, 697 Idempotent matrix, 207 Ill-conditioned problem, 167 Immediate reward, 392 Imprecision, 708 Imprecision for set, 712 Inactive constraint, 141 Inconsistent, 248 Inconsistent equation, 33 Increasing, 755 Indefinite matrix, 27 Independence assumption, 60 Independent and identically distributed (i.i.d.), 296 Independent identically distributed (iid), 24 Individuals, 724 Inductive learning, 338 Inductive transfer learning, 408 Inequality constraints, 101 Inertia weight, 782 Inexact line search, 149 Infeasible point, 101 Infimum, 139 Infinite kernel learning (IKL), 420 Information network, 355, 568 Information theoretic metric learning (ITML), 423 Inner relation, 285 Instance, 296 Instance distribution, 248 Instance normalization, 596 Instance space, 248 Intercept, 297 Interlacing theorem for singular values, 174 Interpolation, see Regression Interval multiobjective optimization problems, 708 Interval order relation, 709 Interval vector, 708 Invariant feature learning, 525 Invariant features, 525 Inverse affine image, 102 Inverse Euclidean distance, 374 Inverse graph Fourier transform, 585 Inverse matrix, 9 Inverse problem, 14 Iterate averages, 229 Iteration matrix, 163

Index Iterative improvement, 715 Iterative soft thresholding method, 131

J Jacobian matrix, 56 Jacobian operator, 56 Jensen inequality, 103 Jensen-Shannon (JS) divergence, 514 Job scheduling problems (JSP), 772 Joint probability distribution, 618

K Karush-Kuhn-Tucker (KKT) conditions, 140 Kernel alignment, 267 Kernel functions, 628, 663 additive kernel function, 629 B-splines kernel function, 628 exponential radial basis function, 628 Gaussian radial basis function, 628 linear kernel function, 628 multi-layer perceptron kernel function, 628 polynomial kernel function, 628 Kernel K-means clustering, 627 Kernel partial least squares regression, 632 Kernel PCA, 627 Kernel reproducing property, 626 K-means algorithm, 328 random K-means algorithm, 328 Krylov subspace, 163 Krylov subspace method, 163 Kullback-Leibler (KL) divergence, 513

L Label, 297 Labeled set, 223 Lagrange dual method, 139 Lagrange multiplier vector, 135 Laplace equations, 72 Laplace’s method, 666 Laplacian embedding, 563 Laplacian regularized least squares (LapRLS) solution, 631 Laplacian support vector machines, 633 LAR algorithm, 199 Lasso distributed Lasso, 197 generalized Lasso, 197 group Lasso, 197 MRM-Lasso, 197 Layer normalization (LN), 596 Leading-1 entry, 158

Index Leading entry, 158 Learning machine, 253, 254 Learning step, 100 Least absolute shrinkage and selection operator (Lasso), 279, 290 Least angle regressions (LARS), 198 Least square regression error, 264 Leave-one-out cross-validation (LOOCV), 375 Left generalized eigenvector, 212 Left inverse, 39 Left Kronecker product, 46 Left pseudo-inverse matrix, 39 Linear discriminant analysis, 556 Linear inverse transform, 9 Linear regression, 278 Linear rule, 59 Linear transform matrix, 9 Lipschitz constant, 113 Lipschitz continuous, 113 Lipschitz continuous function, 113 Locality preserving projections (LPP), 565 Local output function, 577 Local topology preserving, 555 Local transition function, 577 Logistic function, 249, 452 Logistic regression, 537 LogitBoost, 250 Longitudinal unfolding, 303 Long short-term memory (LSTM), 471 backward pass, 473 forget gate, 472 forward pass, 472 input gate, 472 memory blocks, 472 output gate, 472 peephole LSTM, 477 Loss, 618 Loss function, 227, 464, 510 contrastive loss, 511 coupled clusters loss, 512 cross-entropy loss function, 452, 527 double-contrastive loss, 512 hinge loss, 510 1 -loss, 510 2 -loss, 511 square hinge loss, 511 large-margin softmax (L-Softmax) loss function, 513 logistic loss function, 452, 523 softmax loss, 511 Lower unbounded, 101, 139 Low-rank approximation, 178

813 M Machine learning (ML), 223 expected performance, 256 accuracy, 256 complexity, 256 convergence reliability, 256 convergence time, 256 response time, 256 scalability, 256 training data, 256 training time, 256 Mahalanobis metric learning, 423 Majority voting, 248 Majorization-minimization (MM), 241 majorization step, 242 majorizer, 242 minimization step, 242 Majorization-minimization (MM) algorithm, 244 monotonicity decreasing, 244 stationary point, 245 Manhattan metric, 555 Manifold learning, 559 isometric map (Isomap), 559 -Isomap, 559 k-Isomap, 559 conformal Isomap, 560 Laplacian eigenmap method, 564 Laplacian eigenmaps, 563 locally linear embedding (LLE), 560 Mapping, 12 codomain, 12 domain, 12 image, 12 injective mapping, 13 inverse mapping, 13 linear mapping, 13 one-to-one mapping, 13 range, 12 surjective mapping, 13 Margin, 618, 621 Markov decision process (MDP, 391 Mating selection, 754 Matricization, 50 matricization of column vector, 50, 51 Matrix, 3 block matrix, 6 broad matrix, 5 diagonal matrix, 5 full column rank matrix, 33 full rank matrix, 33 full row rank matrix, 33

814 Hermitian matrix, 7 identity matrix, 5 nonsingular matrix, 28 rank-deficient matrix, 33 square matrix, 5 submatrix, 6 symmetric matrix, 7 tall matrix, 5 zero matrix, 5 Matrix diffferential, 62 Matrix equation, 3 Matrix inversion lemma, 36 Matrix norm entrywise norm p-norm, 19 Frobenius norm, 19 Mahalanobis norm, 20 max norm, 19 1 -norm, 19 Matrix pair, 210 Matrix pencil, 210 Matrix transpose, 7 Max-flow method, 376 Maximal margin hyperplane, 652 Maximin problem, 138 Maximum ascent rate, 59 Maximum descent rate, 59 Maximum margin, 622 Maximum mean discrepancy embedding (MMDE), 424 Maximum spread, 699 Max pooling, 506 Max Wins strategy, 655 Mean tensor, 310 Measure, 316 nonnegativity, 316 symmetry, 316 triangle inequality, 316 Mercer kernel, 627 exponential radial basis function kernel, 627 Gaussian radial basis function (GRBF) kernel, 627 Mercer’s theorem, 626 Mesoscopic structure, 571 Message passing neural networks (MPNNs), 577 Metaheuristic, 714 metaheuristic procedures, 714 population metaheuristic, 714 single-solution metaheuristic, 714 Method of distance functions, 744 Method of objective weighting, 744 Metropolis algorithm, 714

Index Metropolis-Hastings algorithm, see Metropolis algorithm Microscopic structure, 571 Minimax problem, 138 Minimum cut (min-cut), 371 Minkowski p-metric distance, 555 Min-max formulation, 744 Minor component analysis (MCA), 271 Minor components, 270 Minor subspace, 271 Minor subspace analysis (MSA), 271 Misclassification rate, 492 Misclassification tolerance, 637 Mixed pooling, 506 Mixed-variable optimization problem (MVOP), 771 Model mismatch, 618 Modified connectionist Q-learning, 399 Modularity, 570 Modularity matrix, 570 Momentum, 116 Moore-Aronszajn Theorem, 626 Moore-Penrose conditions, 41 Moore-Penrose inverse, 41 Moreau decomposition, 124 Multiclass classifier, 299 Multidimensional scaling (MDS), 558 Multilinear data analysis, 541 Multiobjective combinatorial optimization (MOCO), 684 multiobjective assignment problems, 684 multiobjective min sum location problem, 685 multiobjective network flow, 685 multiobjective transportation problem, 684 unconstrained combinatorial optimization problem, 685 Multiobjective evolutionary algorithm based on decomposition (MOEA/D), 746 Multiobjective optimization problem (MOP), 686 decision space, 686 decision vector, 686 feasible set, 687 ideal solution, 688 ideal vector, 688 objective space, 686 objective vector, 686 Multi-output nodes, 549 Multiple kernel learning (MKL), 421 Must-Link, 333 Mutation, 727 Mutation probability, 728 Mutation rate, 728

Index N Nadir objective vector, 697 Naive Bayesian classification, 490 Nash equilibrium, 600 Natural basis vector, 233 Nearest neighbor, 317 Nearest neighbor classification, 317 Negative definite matrix, 27 Negative semi-definite matrix, 27 Negative transfer, 406 Neighborhood, 93 Neighborhood preserving embedding (NPE), 562 Network ambedding, 568 Network coding resource minimization (NCRM), 773 Network embedding large-scale information network embedding, 568 network inference, 568 network reconstruction, 568 Neural network tree, 444 Newton method, 107 modified Newton method, 148 truncated Newton method, 147 Niche count, 737 Niched Pareto Algorithm, 737 Niched Pareto genetic algorithm, 737 Niche radius, 737 Node embedding, 577 Node vector, 577 Noise projection matrix, 272 Noise subspace, 271 Noisy multiobjective optimization problems, 707 Nondominated sorting approach, 700 Nondomination level, 703 Nondomination rank, 703 Non-Euclidean structure, 354, 355 Nonlinear iterative partial least squares (NIPALS), 632 Nonlinear iterative partial least squares (NIPALS) algorithm, 286 Nonnegative constraints, 540 Nonnegative matrix factorization (NMF), 540 Nonnegative orthant, 103 Non-pivot features, 414 Nonseparable case, 623 Nonseparable function, 237 Nonsmooth convex optimization, 128 Nonstationary iterative method, 163 Normalized singular values, 178 Numerically stable, 168

815 Numerical stability, 168 Nyström r-rank approximation, 335

O Objective function, 89 One-against-all (OAA) method, 653 One-against-one (OAO) method, 654 One-versus-one (OVO) method, see One-against-one (OAO) method One-versus-rest (OVR) method, see One-against-all (OAA) method Online convex optimization, 446 Online convex programming problem, 446 Open ball, 103 Open neighborhood, 91 Opposite number, 788 Opposite pair, 383 Opposite point, 789 Opposition-Based Generation Jumping, 791 Opposition-based learning (OBL), 789 Opposition-based optimization (OBO), 789 Opposition-Based Population Initialization, 791 Opposition number, 788 Opposition solution, 789 Optimal behavior models, 390 average-reward model, 391 finite-horizon model, 390 infinite-horizon discounted model, 390 Optimal class discrimination matrix, 325 Optimal experimental design, see Active learning Optimal primal value, 138 Optimal rank tensor ridge regression (orTRR), 312 Optimal solution, 101 Optimal unbiased estimator, 177 Optimization vector, 89 Ordered n-ples, 12 Original residual, 146 Orthogonality constraint, 272 Outer relations, 285 Overfitting, 251, 617

P Pairwise constraints, 333 Parallel factors decomposition (PARFAC), 306 Parameterized achievement scalarizing problem, 757 Pareto concepts, 691, 692 archive, 698

816 global Pareto-optimal set, 696 globally nondominated, 694 incomparable, 693 local Pareto-optimal set, 696 mutually nondominating, 693 nondominated sets, 695 nondominated solution, 694 outcomes, 692 Pareto dominance, 692 Pareto efficient, 694 Pareto front, 696 Pareto frontier, 696 Pareto improvement, 694 Pareto-optimal, 694 Pareto-optimal front, 696 Pareto-optimality, 694 Pareto-optimal solutions, 691 strict Pareto dominance, 693 weak Pareto dominance, 693 Pareto concepts for approximation sets, 709 incomparable dominance for approximation sets, 711 strict dominance for approximation sets, 710 Pareto concepts for sets dominance for approximation sets, 709 weak dominance for approximation sets, 710 Pareto optimization approach, 691 Pareto optimization theory, 691 Partial labeling, 333 Partial least squares (PLS), 285 Partial least squares (PLS) regression, 287 Partial order, 703 Particle swarm canonical particle swarm, 781 Particle swarm optimization (PSO), 780 genetic learning particle swarm optimization (GL-PSO), 784 Particle swarm optimization (PSO) algorithm [19], 783 Partition function, 479, 481 Partitional clustering, 781 Parts combination, 541 Passive learning, 378 Pattern, 257 Pattern recognition, 257 Pattern vectors, 258 Pearson correlation coefficients, 262 Penalized regression, 290 Penalty function exterior penalty function, 133 interior penalty function, 133 Penalty function method, 133, 134

Index Penalty parameter vector, 135 Pheromone model, 770 Pheromone values, 770 Piecewise continuous, 543 Pivot column, 159 Pivot features, 414 Pivot position, 159 Policy, 393 Polynomial mutation operator, 749 Polynomially evaluatble, 249 Pool-based active learning (PAL), 382 Pooling stride, 504 Positive definite matrix, 27 Positive definition kernel, 625 Positive semi-definite cones, 110 Positive semi-definite matrix, 27 Preconditioned conjugate gradient (PCG), 166 Prediction, see Regression Prediction error, 227 Prediction function, 226, 227 Predictors, 259, 278 Primal cost function, 137 Primal-dual subgradient method, 448 Principal component analysis (PCA), 270 Principal component pursuit (PCP), 276 Principal subspace, 271 Principal subspace analysis (PSA), 271 Probably approximately correct (PAC) learnable, 251 efficiently PAC learnable, 251 Probably approximately correct (PAC) learning, 251 Probably approximately correct (PAC) model, 251 Product rule, 59 Projected gradient update, 446 Projection approximation subspace tracking (PAST), 274 Projection operator, 108 Proximal operator, 123 Proximity indicator, 713 Pythagorean theorem, 18

Q Q-function, 393, 394 Q-learning, 394 QR factorization, 170 Quadratically constrained linear program (QCLP), 293 Quadratic form, 27 Quadratic program (QP), 293 bound-constrained quadratic program (BCQP), 294

Index Quadratic programming (QP) problem, 195 Quasi-convex, 104 Quasi-opposite number, 791 Quasi-opposite point, 792 Quasi-oppositional differential evolution (QODE), 794 Quasi opposition-based learning (quasi-OBL), 791 Query learning, see Active learning Query strategies, 381 balance exploration and exploitation, 381 expected model change, 381 exponentiated gradient exploration, 381 least certainty, 381 middle certainty, 381 query by committee, 381 uncertainty sampling, 381 variance reduction, 381 Quotient rule, 60

R Random vector, 4 Random walks, 580 Range, 33 Rank, 33 Rank-one decomposition, 307 Raw fitness value, 753 Rayleigh quotient, 215 Rayleigh-Ritz theorem, 215 Recombination, 725 Rectangular set, 109 Recurrent neural networks (RNNs), 460 backward mechanism, 467 bidirectional mechanism, 468 forward mechanism, 467 recurrent computation, 467 Recursive feature elimination (RFE), 652 Reduced row-echelon form (RREF) matrix, 159 Reference point vector, 746 Reformulation approach, 691 Regression, 259, 282 Regression intercept, 260 Regression problem, 445 Regression residual, 260 Regret, 447 Regularization parameter, 180, 637 Regularization path, 180 Regularized Gauss-Seidel method, 184 Regularized least squares cost function, 180 Reinforcement learning, 387 action, 388 agent, 388

817 autonomy, 388 pro-activeness, 388 reactivity, 388 social ability, 388 control policy, 389 deterministic policy, 389 probabilistic policy, 389 improvement, 392 payoff, 389 return, 389 reward function, 388 state, 388 state-action space, 388 Relative entropy, see Kullback-Leibler (KL) divergence Relative feasible set, 142 Relative interior points, 142 Relaxation, 90 Relevance feedback, 380 Relevance vector machine (RVM), 667 Representation choice, 260 Representation learning, 260 Representer theorem for semi-supervised learning, 630 Representer theorem for supervised learning, 629 Reproducing kernel, 624, 625 Reproducing kernel Hilbert space (RKHS), 624 Reproducing property, 624, 626 Residual variance, 265 Restricted Boltzmann machine (RBM), 481 Return, 393 Reward function, 392 Rewards, 389 Richardson iteration, 163 Ridge regression, 290 Right generalized eigenvector, 212 Right Kronecker product, 46 Right pseudo-inverse matrix, 39 Risk function, 618 Robust principal component analysis, 276 Roulette wheel selection, 700 roulette wheel fitness selection algorithm, 700 Row equivalent, 158 Row partial derivative operator, 55 Row partial derivative vector, 55

S Second-order necessary condition, 94 Second-order sufficient condition, 94 Self-taught transfer learning, 409

818 Semi-orthogonal matrix, 272 Semi-supervised classification, 339 Semi-supervised clustering, 341 propagating 1-nearest neighbor clustering, 341 Semi-supervised learning, 337 bootstrapping, 340 co-training, 342 self-teaching, 340 self-training, 340 semi-supervised inductive learning, 338 semi-supervised transductive learning, 338 Separable case, 621 Separable function, 237 Set, 10 difference set, 11 intersection set, 11 null set, 11 proper subset, 11 singleton, 10 subset, 11 sum set, 11 superset, 11 union set, 11 Sharing function value, 737 Shrinkage operator, 125 Signal projection matrix, 271 Signal subspace, 271 Similarity, 316 combined similarity, 320 dissimilarity, 316 distance matrix, 319 Euclidean distance, 316 Mahalanobis distance, 317 normalized Euclidean distance, 317 regularized Euclidean distance, 316 semantic similarity, 320 similarity matrix, 319 similarity strength, 319 syntax similarity, 320 Tanimoto measure, 318 word similarity, 320 Similarity score, 299 SinC function, 641 Single-hidden layer feedforward networks (SLFNs), 542 Single-output node, 546 Singular value decomposition (SVD) full singular value decomposition, 172 left-singular vector matrix, 171 left-singular vectors, 171 right-singular vector matrix, 171 right-singular vectors, 171 singular values, 171

Index truncated singular value decomposition, 172 Singular value thresholding (SVT), 130, 174 Slackness parameters, 637 Slack variable vector, 143 Slater condition, 142 Smooth function, 113 Soft thresholding, 174 Soft thresholding operation, 129, 175 Soft thresholding operator, 125 Soft threshold value, 125 Source domain, 403 Source-domain data, 403 Source embedding vectors, 574 Source learning task, 405 Spacing, 699 Sparse Bayesian classification, 672 Sparse Bayesian learning, 667, 669 Sparse constraints, 541 Sparse principal component analysis (SPCA) algorithm, 280 Sparse restruction, 293 Sparsest solution, 194 Sparsity, 194, 668 Spatial pyramid pooling, 507 Spectral convolution, 583 Spectral graph theory, 354 Spectral pooling, 507 Spherical K-means, 329, 518 code vector, 518 codebook matrix, 518 dictionary, 518 Stagewise regression, 198 Staionary iterative method, 163 State embedding, 577 State transition function, 391 State value function, 393 State vector, 577 Stationary point, 90 Statistically uncorrelated, 23 Statistical uncorrelation, 23 Steepest descent direction, 107 Steepest descent method, 107 Step direction, 106 Stepwise regression, 198 Stochastic average gradient aggregation (SAGA) method, 232 Stochastic average gradient (SAG), 230 Stochastic gradient (SG) method, 228 Stochastic pooling, 506 Stochastic variance reduced gradient (SVRG) method, 231 Strength value, 753 Stress, 557

Index Strictly convex function, 103 Strictly increasing, 755 Strictly quasi-convex, 104 Stride of convolution, 500 Strong duality, 140 Strong learning algorithm, 247 Strong PAC-learning algorithm, 251 Strongly convex, 104 Strongly increasing, 755 Strongly quasi-convex, 104 Structural correspondence learning (SCL), 414 Structural learning, 412 Subdifferentiable, 121 Subdifferential, 120 Subgradient vector, 120 Subsec-mil, 35 Subspace analysis methods, 273 Superdiagonal line, 302 Supersymmetric tensor, 302 Supervised feature selection, 258 Supervised learning, 224 Supervised learning classification, 297 Supervised tensor learning, 307 alternating projection algorithm, 309 Support, 254 Support vector expansion, 638 Support vector machine binary classification, 641 Support vector machine binary classifier, 642 least squares support vector machine (LS-SVM) classifier, 647 proximal support vector machine binary classifier, 649 ν-support vector binary classifier, 645 Support vector machine multiclass classification, 653 least-squares support machine multiclass classifier, 656 proximal support machine multiclass classifier, 659 Support vector machine regression, 635 -support vector machine regression, 636 -insensitive tube, 637 Wolfe dual -support vector machine regression, 638 ν-support vector machine regression, 639 Wolfe dual ν-support vector machine regression, 641 Support vector machines (SVMs), 617 B-splines SVM, 628 least-squares support vector machine, 647 linear SVM, 628 multi-layer SVM, 628 polynomial SVM, 628

819 proximal support vector machine, 649 radial basis function (RBF) SVM, 628 summing SVM, 629 Supremum, 138 Surrogate function, 242 Swarm intelligence, 781 System of linear equations, 3

T Target distribution, 248 Target domain, 403 Target-domain data, 403 Target embedding vectors, 574 Target learning task, 405 Task, 404 Tchebycheff approach, 746 Tchebycheff scalar optimization subproblem, 746 Tensor, 301 Frobenius norm, 304 Tensor algebra, 301 Tensor classifier, 307 Tensor distance (TD), 314 Tensor dot product, 304 Tensor fibers, 302 column fibers, 302 row fibers, 302 tube fiber, 302 Tensor Fisher discriminant analysis (TFDA), 310 Tensor inner product, 304 Tensor K-means clustering, 314 Tensor matrixing, 303 Tensor outer product, 304 Tensor predictor, 307 Tensor regression, 311 Tensor unfolding, 303 Tensor vectorization, 303 Tessellation, 327 Testing phase, 259 Test set, 223 Thinned dictionary, 520 Third-order tensor, 302 Tied weight matrix, 536 Tied weights, 527 Tikhonov regularization method, 180 Tikhonov regularization solution, 180 Tournament selection, 700 binary tournament selection, 700 tournament size, 700 TrAdaBoost, 410 Trained machine, 253 Training output matrix, 546

820

Index

Training phase, 259 Training sample, 296 Training set, 223 Transductive transfer learning, 408 Transfer AdaBoost learning framework, 411 Transfer component analysis (TCA), 426 Transfer learning, 405 Transfer of knowledge, 333 Transformation metric, 419 Transitivity, 573 Travelling salesman problem (TSP), 689, 773 asymmetric TSP, 689 dynamic TSP (DTSP), 689 Euclidean TSP, 689 symmetric TSP, 689 Triplet loss, 512 Tri-training, 343 Trivial solution, 35 Tucker decomposition, 305 Tucker mode-1 product, 305 Tucker mode-2 product, 305 Tucker mode-3 product, 305 Tucker operator, 305 Two-view data, 341 Type II maximum likelihood, 671

geometric vector, 4 inner product, 6 normalized vector, 16 outer product, 7 physical vector, 4 vector addition, 6 Vector cross product outer product, 7 Vectorization, 48 column vectorization, 48 row vectorization, 48 Vector multiplication, 6 Vector norm, 14 Euclidean norm, 15 1 -norm, 15 2 -norm, 15 ∞ -norm, 15 p -norm, 15 unitary invariant norm, 16 Vector quantization (VQ), 540 Violated constraint, 141 Visible binary rating matrix, 487 Voronoi set, 327 Voronoi tessellation, 327 Voting strategy, 655

U Unconstrained optimization problem, 89 Under-regression, 198 Unique solution, 35 Unit ball, 102 Unit coordinate vector, 233 Unit tensor, 302 Unlabeled set, 223 Unsupervised feature selection, 258 Unsupervised learning, 224 Unsupervised transfer learning, 408

W Weak duality, 140 Weak learning algorithm, 247 Weak PAC learning algorithm, 251 Web-page, 341 Weighed Chebyshev norm, 719 Weighed sum, 719 Weighed-sum method, 726 Weight-delay (wd), 531 Weighted boundary volume, 374 Weighted sum approach, 746 Weight mask matrix, 521 Well-conditioned problem, 167 Winner-take-all, 541 Wirtinger partial derivatives, 73 Within-class scatter matrix, 218, 310 Within-class variance, 262 Within-sets covariance matrix, 348 Wolfe dual optimization problem, 622 Woodbury formula, 36

V Value function, 89 Variable ranking, 261 Variable selection, 260 Variance reduction, 231 VC-dimension, 619, 620 VC-theory, 617 Vector, 3 algebraic vector, 4 constant vector, 4 basic vector, 5 dot product, 6

Z Zero-phase component analysis (ZCA), 536 Zero vector, 5