Modern Optimization Methods 9782759831753

With the fast development of big data and artificial intelligence, a natural question is how do we analyze data more eff

138 105 6MB

English Pages 157 [166] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Optimization II.. Numerical methods for nonlinear continuous optimization

235 82 1MB Read more

Optimization Methods for Structural Engineering 9819923778, 9789819923779

This contributed book focuses on optimization methods inspired by nature such as Harmony Search Algorithm, Drosophila Fo

499 109 11MB Read more

Lectures on modern convex optimization

129 28 Read more

First-Order Methods in Optimization 1611974992, 9781611974997

The primary goal of this book is to provide a self-contained, comprehensive study of the main ?rst-order methods that ar

3,587 313 9MB Read more

Modern Optimization Methods for Decision Making Under Risk and Uncertainty 1032196416, 9781032196411

The book comprises original articles on topical issues of risk theory, rational decision making, statistical decisions,

549 155 6MB Read more

Theory and Methods of Vector Optimization, Volume 1: The Theory and Methods of Vector Optimization [1 ed.] 1527548317, 9781527548312

This first volume presents the theory and methods of solving vector optimization problems, using initial definitions tha

915 143 2MB Read more

Modern Statistical Methods

305 36 556KB Read more

Variation calculus and methods of optimization: educational manual 9786010416949

Some theoretical foundations of optimal control problem are expounded in the educational manual: methods of variation ca

496 63 2MB Read more

Approximation Methods in Optimization of Nonlinear Systems 9783110668520, 9783110668438

The monograph addresses some problems particularly with regard to ill-posedness of boundary value problems and problems

212 85 27MB Read more

Practical Optimization Methods: With Mathematica® Applications 9781461267911, 9781461205012, 1461267919

This introductory textbook adopts a practical and intuitive approach, rather than emphasizing mathematical rigor. Comput

273 97 36MB Read more

Modern Optimization Methods
9782759831753

Author / Uploaded
Qingna LI

Citation preview

Current Natural Sciences

Qingna LI

Modern Optimization Methods

Printed in France

EDP Sciences – ISBN(print): 978-2-7598-3174-6 – ISBN(ebook): 978-2-7598-3175-3 DOI: 10.1051/978-2-7598-3174-6 All rights relative to translation, adaptation and reproduction by any means whatsoever are reserved, worldwide. In accordance with the terms of paragraphs 2 and 3 of Article 41 of the French Act dated March 11, 1957, “copies or reproductions reserved strictly for private use and not intended for collective use” and, on the other hand, analyses and short quotations for example or illustrative purposes, are allowed. Otherwise, “any representation or reproduction – whether in full or in part – without the consent of the author or of his successors or assigns, is unlawful” (Article 40, paragraph 1). Any representation or reproduction, by any means whatsoever, will therefore be deemed an infringement of copyright punishable under Articles 425 and following of the French Penal Code. The printed edition is not for sale in Chinese mainland. Customers in Chinese mainland please order the print book from Science Press. ISBN of the China edition: Science Press 978-7-03-074785-3 Ó Science Press, EDP Sciences, 2023

Preface

This book is based on the author’s lecture “Modern Optimization Methods” given to graduate students at the Beijing Institute of Technology since 2015. It aimed at presenting complete and systematic theories of numerical optimization and their latest applications in different areas, especially in machine learning, statistics, and computer science. This book aims to introduce the basic definitions and theory of numerical optimization, including optimality conditions for unconstrained and constrained optimization, as well as algorithms for unconstrained and constrained problems. Moreover, it also includes the nonsmooth Newton’s method, which plays an important role in large-scale numerical optimization. Finally, based on the author’s research experiences, several latest applications for optimization are introduced, including optimization algorithms for hypergraph matching, support vector machine, and bilevel optimization approach for hyperparameter selection in machine learning. The structure of book is organized into three parts. In the first part (chapters 1–6), after the introduction in chapter 1, we start with the fundamentals of optimization, followed by the typical methods for unconstrained optimization problems. Such methods are classified as line search methods and trust region methods. In particular, we include semismooth Newton’s method as an independent chapter to show how we can apply it to solve the data-based problem, i.e., support vector machine (SVM) in machine learning. In the second part (chapters 7 and 8), we introduce the theories and typical methods for a constrained optimization problem. Moreover, in order to show how the methods are applied in problems arising from the application, we demonstrate the quadratic penalty method by applying it to the hypergraph matching, and show the interesting exact recovery property based on the author’s research experience. We also show how the augmented Lagrange method is used in the L1 regularized SVM. The final part (chapter 9) is to show how optimization can be used in hyperparameter selection in machine learning.

DOI: 10.1051/978-2-7598-3174-6.c901 Ó Science Press, EDP Sciences, 2023

IV

Preface

It involves bilevel optimization and mathematical program with equilibrium constraints (MPEC), which are the two branches of optimization. We sincerely wish that the reader will learn and understand the main ideas and the essence of the basic optimality theories and the numerical algorithms. We also wish that the reader can deeply understand the ideas of how to implement fast algorithms and further undertake the related research after having learned references from this book. We also want to take this opportunity to thank all the people who were concerned about us, including my parents, husband, kids, and my teachers, colleagues, and collaborators. The author is supported in part by the NSF of China with the number 12071032. Qingna LI March, 2023

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

III

CHAPTER 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

. . . .

1 4 10 15

CHAPTER 2 Fundamentals of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.1 1.2 1.3 1.4

About Optimization . . . . . . . . . . Classification of Optimization . . . Preliminaries in Convex Analysis Exercises . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . .

17 18 18 20 23 25 26 30 31 32 33

CHAPTER 3 Line Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.1 2.2

2.3

2.4 2.5 2.6

3.1

3.2 3.3

Unconstrained Optimization Problem . . What is a Solution? . . . . . . . . . . . . . . . 2.2.1 Definitions of Different Solutions 2.2.2 Recognizing a Local Minimum . . 2.2.3 Nonsmooth Problems . . . . . . . . . Overview of Algorithms . . . . . . . . . . . . 2.3.1 Line Search Strategy . . . . . . . . . 2.3.2 Trust Region Strategy . . . . . . . . Convergence . . . . . . . . . . . . . . . . . . . . . Scaling . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

Step Length . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Wolfe Conditions . . . . . . . . . . . 3.1.2 The Goldstein Conditions . . . . . . . . 3.1.3 Sufficient Decrease and Backtracking Convergence of Line Search Methods . . . . . Rate of Convergence . . . . . . . . . . . . . . . . . . 3.3.1 Steepest Descent Method . . . . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . .

35 37 40 41 42 44 44

Contents

VI

3.3.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46 48 50

CHAPTER 4 Trust Region Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.4

. . . . . . . . . . .

52 54 54 56 58 59 59 61 65 65 68

CHAPTER 5 Conjugate Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.1 4.2

4.3

4.4 4.5 4.6

Outline of the Trust Region Approach . . . . . . . . Algorithms Based on the Cauchy Point . . . . . . . . 4.2.1 The Cauchy Point . . . . . . . . . . . . . . . . . . 4.2.2 The Dogleg Method . . . . . . . . . . . . . . . . . 4.2.3 Two-Dimensional Subspace Minimization . Global Convergence . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Reduction Obtained by the Cauchy Point 4.3.2 Convergence to Stationary Points . . . . . . . Local Convergence . . . . . . . . . . . . . . . . . . . . . . . Other Enhancements . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . .

69 69 72 75 76 77 78 80 81 83

CHAPTER 6 Semismooth Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

6.1 6.2 6.3 6.4 6.5

. . . . .

85 87 89 91 96

CHAPTER 7 Theory of Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.1

5.2

5.3

7.1 7.2

Linear Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . 5.1.1 Conjugate Direction Method . . . . . . . . . . . . . . . . . . . 5.1.2 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . 5.1.3 A Practical Form of the Conjugate Gradient Method . 5.1.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.5 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonlinear Conjugate Gradient Methods . . . . . . . . . . . . . . . . 5.2.1 The Polak-Ribiere Method and Variants . . . . . . . . . . . 5.2.2 Global Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Semismoothness . . . . . . . . . . . . . . . . . . . Nonsmooth Version of Newton’s Method . Support Vector Machine . . . . . . . . . . . . . Semismooth Newton’s Method for SVM . Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

Local and Global Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97 98 99

Contents

7.3 7.4 7.5 7.6 7.7 7.8 7.9

Tangent Cone and Constraint Qualifications . First-Order Optimality Conditions . . . . . . . . Second-Order Conditions . . . . . . . . . . . . . . . Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . KKT Condition . . . . . . . . . . . . . . . . . . . . . . Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

VII

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

103 105 106 109 112 114 118

CHAPTER 8 Penalty and Augmented Lagrangian Methods . . . . . . . . . . . . . . . . . . . . . . . . 119 8.1 8.2 8.3 8.4

8.5

8.6

The Quadratic Penalty Method . . . . . . . . . . . . . . . . . . . . . Exact Penalty Method . . . . . . . . . . . . . . . . . . . . . . . . . . . Augmented Lagrangian Method . . . . . . . . . . . . . . . . . . . . Quadratic Penalty Method for Hypergraph Matching . . . . 8.4.1 Hypergraph Matching . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . 8.4.3 Relaxation Problem . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Quadratic Penalty Method for (8.21) . . . . . . . . . . . 8.4.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . Augmented Lagrangian Method for SVM . . . . . . . . . . . . . 8.5.1 Support Vecotr Machine . . . . . . . . . . . . . . . . . . . . 8.5.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . 8.5.3 Augmented Lagrangian Method (ALM) . . . . . . . . . 8.5.4 Semismooth Newton’s Method for the Subproblem . 8.5.5 Reducing the Computational Cost . . . . . . . . . . . . . 8.5.6 Convergence Result of ALM . . . . . . . . . . . . . . . . . . 8.5.7 Numerical Results on LIBLINEAR . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

119 122 123 125 126 126 128 129 130 132 132 133 133 136 137 138 139 141

CHAPTER 9 Bilevel Optimization and Its Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 9.1 9.2 9.3 9.4 9.5

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bilevel Model for a Case of Hyperparameter Selection in SVC . 9.2.1 An MPEC Formulation . . . . . . . . . . . . . . . . . . . . . . . . The Global Relaxation Method (GRM) . . . . . . . . . . . . . . . . . . MPEC-MFCQ: A Hidden Property . . . . . . . . . . . . . . . . . . . . . Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

143 145 147 148 149 150

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Chapter 1 Introduction In the first chapter of this book, we will introduce what is optimization, the classification of optimization, as well as some preliminaries in convex analysis.

1.1

About Optimization

Optimization exists everywhere. People optimize. As long as people have choices, they do optimization. In finance, people do portfolio selections to maximize the rate of return and avoid risk. In engineering, people try to maximize the efficiency of the system and try to optimal control of the system. Nature optimizes as well. Physical systems always reach the state of total minimum energy. For isolated chemical systems, reactions will not stop until reaching the minimum total potential energy. Light travels following the path of minimizing travel time. See figure 1.1. Figure 1.1, S represents the incident ray, and S 0 represents the mirror ray of S with respect to the reflective surface.

0 FIG. 1.1 – Light travels following the shortest path. DOI: 10.1051/978-2-7598-3174-6.c001 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

2

One can conclude that as long as there are choices, optimization happens. There are three key facts in optimization. (i) Objective. It is a quantitative measure of the performance under study. For example, profit, time, and potential energy. (ii) Variables. They are the unknowns in the problem, which need to be determined. (iii) Constraints. They are the restrictions that variables follow, such as nonnegativity, and so on. The optimization process can be divided into the following three steps. (i) Modelling. It is to identify the key facts in optimization. On one hand, the model can not be too simple. If the model is too simple, it will not represent the real application problems. On the other hand, it can not be too complicated, because it will bring challenges in solving the problem. (ii) Apply optimization algorithm to solve the model. This is the main focus of the book. We will discuss different ways to design optimization algorithms. (iii) Check if some stopping criterion is satisfied. If optimality conditions are satisfied, a solution is obtained. Otherwise, we do sensitivity analysis. Mathematical modelling of optimization Optimization is the minimization or maximization of a function subject to constraints on its variables. To build up the mathematical model of the application problem, one needs to identify the above three key points, whose mathematical notations are as follows. x. It is usually a vector of variables in IRn , which represents the decision that people want to make. f . The objective function usually is a scalar function of x that we want to maximize or minimize, that is f : IRn ! IR. ci ; IRn ! IR; i ¼ 1; . . .; m: They are some scalar functions that define certain equalities and inequalities that x must satisfy. With the mathematical notations, the optimization problem can be written in the following form min

x2IRn

s.t.

f ðxÞ ci ðxÞ ¼ 0; i 2 E; ci ðxÞ 0; i 2 I :

ð1:1Þ

Here, we use “s.t.” to denote “subject to”. If we have max f ðxÞ, it can be transferred to min f ðxÞ. E denotes the set of indices for equality constraints. I denotes the set of indices for inequality constraints. Below we give an example. Example 1.1.1. min

ðx 1 2Þ2 þ ðx 2 1Þ2

s.t.

x 21 x 2 0; x 1 þ x 2 2:

x 1 ;x 2

ð1:2Þ

Introduction

Define

3

x1 f ðxÞ ¼ ðx 1 2Þ þ ðx 2 1Þ ; x ¼ ; x2 2

2

c1 ðxÞ x 21 þ x 2 ¼ cðxÞ ¼ : c2 ðxÞ x 1 x 2 þ 2 Let I ¼ f1; 2g;

E ¼ ;:

Then (1.2) can be reformulated into the form of (1.1). See figure 1.2 for example 1.1.1. The dotted lines denote the contours of f (the set of points for which f ðxÞ has a constant value).

FIG. 1.2 – Example 1.1.1. Example 1.1.2. [20] An example from finance. Assume that an insurer is exposed to four risks (for example equity market falls, credit spread widening, life catastrophe risk, and lapses of its policies) and that the capital held for each risk is £10m; £20m; £7m and £5m; respectively. We know the correlation between the two market risks and the two non-market risks only, and these are 75% and 25% respectively (see figure 1.3). We obtain the following problem min s.t.

V T CV C 0; some C ij known;

where C 0 means that C is positive semidefinite.

ð1:3Þ

Modern Optimization Methods

4

FIG. 1.3 – Capital vector and correlation matrix. Problem (2.1) is a (linear) semidefinite programming (SDP) problem [52, 62, 63]. The objective function is linear. There are linear constraints in (2.1) as well as a positive semidefinite constraint. If the objective is a quadratic function in (2.1), it is a Quadratic Semidefinite programming (QSDP) [60, 76, 77].

1.2

Classification of Optimization

There are different ways to classify optimization problems. Below we list some ways of classifications, from different aspects: (i) Continuous optimization and discrete optimization; (ii) Constrained optimization and unconstrained optimization; (iii) Global optimization and local optimization; (iv) Stochastic optimization and deterministic optimization; (v) Convex optimization and nonconvex optimization. We will roughly discuss them one by one. Continuous optimization and discrete optimization The feasible set of continuous optimization is uncountable and infinite, see figure 1.4 for example. For continuous optimization, the benefit is that local

y

x

FIG. 1.4 – An uncountable infinite feasible set fðx; yÞ j y ¼ x 8; x 0; y 0g of continuous optimization.

Introduction

5

y 8 y=x-8

x 0

8

FIG. 1.5 – A finite feasible region fðx; y Þ j y ¼ x 8; x 0; y 0; x and y are integers} of discrete optimization. information can be used when searching for the minimization of the function within the feasible set. For discrete optimization, the feasible set is finite, see figure 1.5 for an example. The potential difficulty for discrete optimization is that there may be great changes even when two feasible points are close in some sense. Typical two types of discrete optimization are as follows. Integer Programming (IP), where the variables are f0; 1g or binary. For example, whether or not a particular factory should be located in a particular city can be represented as a binary variable. Mixed Integer Programming problems (MIPs), where variables are not restricted to integer or binary variables. Example 1.2.1. Linear assignment problem [4]. Assume that there are n jobs and m individuals. We try to assign each job to one person such that the total profit can be maximized. Each job can be only assigned to one person. The mathematical model is as follows max nm

x2IR

s.t.

m P n P j¼1 i¼1 m P j¼1

cij x ij

x ij ¼ 1; i ¼ 1; . . .; n;

x ij 2 f0; 1g; i ¼ 1; . . .; n; j ¼ 1; . . .; m:

Here, cij 0 is the profit if job i is assigned to individual j.

Modern Optimization Methods

6

Constrained optimization and unconstrained optimization If there are constraints, the problem is a constrained optimization problem, see figure 1.6. Otherwise, it is an unconstrained optimization problem, see figure 1.7.

y

a

x b

FIG. 1.6 – Constrained optimization with feasible region x 2 ½a; b. y

x

FIG. 1.7 – Unconstrained optimization. Global optimization and local optimization Local optimization seeks a local minimizer while global optimization seeks a global minimizer. Roughly speaking, a local minimizer x means that f ðxÞ reaches the lowest value among all its nearly points. In contrast, a global minimizer x, f ðxÞ reaches the lowest value among all the feasible points. We will discuss the mathematical definition of local and global minimizers in chapter 2. See figure 1.8 for an example. Convex optimization and nonconvex optimization To introduce convex optimization, we first give the definition of convexity for both sets and functions. Definition 1.2.2. A set S is convex if the straight line segment, connecting any two points in S lies entirely inside S, i.e., for any x; y 2 S, we have ax þ ð1 aÞy 2 S;

8 a 2 ½0; 1:

ð1:4Þ

Introduction

7

FIG. 1.8 – A difficult case for global minimization. Equivalently, there is y þ aðx yÞ 2 S;

8 a 2 ½0; 1:

ð1:5Þ

Note that a ¼ 0 and a ¼ 1 correspond to the points x and y, respectively. See figure 1.9 for some examples of convex sets and nonconvex sets.

FIG. 1.9 – Demonstration of convex sets and nonconvex sets. Example 1.2.3. Examples of convex sets are as follows. Unit ball: fy 2 IRn : kyk2 1g. Hyperplane: fx 2 IRn j a T x ¼ bg, a 2 IRn ; b 2 IR. Polyhedron: it is defined by linear equalities and inequalities. For example, fx 2 IRn j Ax ¼ b; Cx dg; A 2 IRmn ; C 2 IRpn ; b 2 IRm ; d 2 IRp : Some properties of convex sets are as follows. Proposition 1.2.4. If D1 ; . . .; Dk are convex sets, then D ¼ D1 \ D 2 \ \ D k is convex.

Modern Optimization Methods

8

If D1 ; D 2 are convex sets, then D1 þ D2 defined by D1 þ D2 :¼ fy : y ¼ x þ z; x 2 D1 ; z 2 D2 g is convex. If D is convex, b 2 IR, then bD defined by bD :¼ fy : y ¼ bx; x 2 Dg is convex. Definition 1.2.5. f is a convex real-value function if its domain S IRn is a convex set and if for any x; y 2 S, the following holds f ðax þ ð1 aÞyÞ af ðxÞ þ ð1 aÞf ðyÞ;

8 a 2 ½0; 1:

ð1:6Þ

See figure 1.10 for the illustration of convex functions.

P2

P1

x

y

FIG. 1.10 – Convex function. Definition 1.2.6. f is strictly convex if its domain S IRn is a convex set and if for any x; y 2 S, f ðax þ ð1 ayÞÞ\af ðxÞ þ ð1 aÞf ðyÞ;

8 a 2 ½0; 1:

ð1:7Þ

Example 1.2.7. Examples of convex functions are as follows. Linear function: f ðxÞ ¼ cT x; x 2 IRn . Convex quadratic function: f ðxÞ ¼ x T Hx, where x 2 IRn and H is a symmetric positive semidefinite matrix, i.e., H 0. Remark 1.2.8. f is a concave function if f is convex.

Introduction

9

Some properties of convex functions are given below. Proposition 1.2.9. If f 1 ; f 2 are convex functions defined on convex set D, k [ 0, then kf 1 , f 1 þ f 2 are both convex functions defined on D. Proposition 1.2.10. Let D be a nonempty open convex set, f ðxÞ is continuously differentiable defined on D IRn . Then f is convex if and only if for any x; y 2 D, there is f ðyÞ f ðxÞ þ rf ðxÞT ðy xÞ: f is strictly convex on D if and only if the above inequality holds strictly for x; y 2 D with x 6¼ y. Proposition 1.2.11. Let D be a nonempty open convex set, f ðxÞ is twice continuously differentiable defined on D. Then f is convex if and only if for any x 2 D, r2 f ðxÞ 0. If for any x 2 D, r2 f ðxÞ 0 (positive definite), then f is strictly convex on D. Convex Optimization means the following hold min

x2IRn

s.t.

f ðxÞ ci ðxÞ ¼ 0; i 2 E; ci ðxÞ 0; i 2 I ;

where f is convex and the equality constraints ci , i 2 E, are linear and the inequality constraints ci , i 2 I , are concave. Remark 1.2.12. For convex optimization, a local minimizer is also a global minimizer. Stochastic optimization and deterministic optimization In many cases, the optimization model can not be fully specified. It may depend on quantities that are unknown at the time of formulation. For example, chance-constrained optimization and robust optimization. If the model is completely known, it is referred to as deterministic optimization. Remark 1.2.13. In this book, we will focus on convex continuous optimization, both constrained and unconstrained. We only consider the deterministic case and focus on local minimizers. Recent progress in optimization includes sparse optimization (such as compressive sensing [42]), matrix optimization (such as matrix completion [39]), nonsmooth optimization [71] (such as l 1 optimization and l p optimization, p 2 ð0; 1Þ), and optimization in statistics [41]. Optimization algorithms work as follows. They generate a sequence as x 0 ; x 1 ; x 2 ; . . .; x k ; x k þ 1 ; . . .

Modern Optimization Methods

10

The key point is how to obtain x k þ 1 from x k . This is the main focus of this book. To evaluate the performance of optimization algorithms, the most common measurements include robustness, efficiency, and accuracy. Robustness means that the algorithm performs well on a wide variety of problems in their class, for all reasonable values of starting points. Efficiency means that the algorithm takes less cputime and less storage. Accuracy means that the output by the algorithm is close enough to a solution, without being overly sensitive to errors in data or arithmetic rounding errors in the computer.

1.3

Preliminaries in Convex Analysis

Let X be a finite-dimensional linear space, such as IRn . We will introduce more details about cones. For more preliminaries in convex analysis, please see [40, 57]. Definition 1.3.1. Assume that cone K is a subset of X . If it is closed for positive multiplication, i.e., for any x 2 K , and any k 0, there is kx 2 K , then K is a cone. Example 1.3.2. C ¼ fðx; yÞ 2 IR2 j y ¼ jxjg. See figure 1.11. abs(x) 6 5 4 3 2 1 0 -6

-4

-2

0 x

2

4

6

FIG. 1.11 – A cone C as shown in example 1.3.2.

Definition 1.3.3. If a cone is a convex set, then it is called a convex cone. Example 1.3.4. See figure 1.12 for a convex cone.

Introduction

11

x2 x1

FIG. 1.12 – Convex cone. Example 1.3.5. Typical five classes of cones are as follows. Second-order cone defined by vffiffiffiffiffiffiffiffiffiffiffiffiffi9 8 u n1 = < uX n x 2i ; x ¼ ðx 1 ; . . .; x n Þ 2 IR j x n t : ; i¼1 see figure 1.13. Symmetrical positive semidefinite cone defined by S nþ ¼ fA 2 S n j A 0}. Conditional symmetrical positive semidefinite cone [54]: K nþ ¼ fX 2 S n j xT Xx 0; xT e ¼ 0g; where e ¼ ð1; . . .; 1ÞT 2 IRn . Nonnegative quadrant cone defined by Rnþ :¼ fx 2 IRn j x i 0; i ¼ 1; . . .; ng: Positive quadrant of IRn defined by fx 2 IRn j x i [ 0; i ¼ 1; . . .; ng: Definition 1.3.6. For any nonzero b 2 IRn and any b 2 IR, the sets fx 2 IRn j hx; bi bg; are called closed half-spaces.

fx 2 IRn j hx; bi bg

Modern Optimization Methods

12

FIG. 1.13 – Second-order cone. Definition 1.3.7. The sets fx 2 IRn j hx; bi\bg;

fx 2 IRn j hx; bi [ bg

are called open half-spaces. Example 1.3.8. All of the subspace of IRn , the open half-space and the closed half-space through the origin are convex cones. Definition 1.3.9. A vector x is said to be normal to a convex C at a point a, where a 2 C , if x does not make an acute angle with any segment in C with a as endpoint, i.e. if hx a; x i 0;

8 x 2 C:

For instance, if C is a half-space fx j hx; bi bg, and a satisfies ha; bi ¼ b, then b is normal to C at a. Because b satisfies the condition that for any x 2 C, hx a; bi ¼ hx; bi ha; bi 0: Definition 1.3.10. The set of all vectors x normal to C at a is called the normal cone to C at a, denoted as N C ðaÞ, i.e.,

Introduction

13 N C ðaÞ ¼

fd j hd; z ai 0; 8 z 2 C g; ;;

a 2 C; otherwise:

Example 1.3.11. C is a segment between A and B. See figure 1.14 for illustration. Calculate N C ðaÞ, N C ðBÞ.

FIG. 1.14 – Example 1.3.11. Example 1.3.12. Let C ¼ IRnþ ; x 2 IRn ; x i 0; i ¼ 1; . . .; n. Consider N C ðaÞ, where ðiÞa ¼ 0; ðiiÞa [ 0; ðiiiÞa 1 ¼ 0; a i [ 0; i ¼ 2; . . .; n. Definition 1.3.13. The polar C of a cone C in IRn is defined as C ¼ fy 2 IRn j hy; xi 0; 8 x 2 C g: Example 1.3.14. Let K IR2 be defined as K ¼ fðx; yÞ j y ¼ 2x; x 0g IR2 : There is

n xo K ¼ ðx; yÞ j y : 2

See figure 1.15. Definition 1.3.15. The dual C of a cone C in IRn is defined as C ¼ fy 2 IRn j hy; xi 0; 8 x 2 C g: If C ¼ C , then C is self-dual. Example 1.3.16. See figure 1.16 for cone K , its polar K and dual K . Remark 1.3.17. It is not difficult to verify that the polar cone and dual cone have the following relationship C ¼ C :

14

Modern Optimization Methods

FIG. 1.15 – Cone K and its polar in example 1.3.14.

FIG. 1.16 – K and K in example 1.3.16.

Introduction

1.4

Exercises

Exercise 1.4.1. Prove proposition 1.2.4. Exercise 1.4.2. Prove propositions 1.2.9–1.2.11. Exercise 1.4.3. Find the polar of the following sets.

IRn ; f0g where 0 2 IRn ; C ¼ fðx 1 ; x 2 Þ k x 2 ¼ 2x 1 ; x 1 0g; S nþ ¼ fX 2 S n j X 0g.

Exercise 1.4.4. Find the dual of the sets in exercise 1.4.3. Exercise 1.4.5. Are the above cones in exercise 1.4.3 self-dual?

15

Chapter 2 Fundamentals of Optimization In this chapter, we will introduce the fundamentals of optimization, including different types of solutions, and optimality conditions. In particular, we will briefly discuss the algorithms, such as the line search strategy and trust region strategy. We will also address the convergence of algorithms.

2.1

Unconstrained Optimization Problem

Consider the unconstrained optimization problem min f ðxÞ;

x2IRn

where f : IRn ! IR is a smooth function, i.e., f is twice continuously differentiable. Example 2.1.1. (Least squares problem) Given sample data a i 2 IRn and the observation bi ; i ¼ 1; . . .; m, we are looking for the parameter x 2 IRn , such that the resulting x T a i matches the observation bi ; i ¼ 1; . . .; m; as much as possible. In other words, assume that bi ¼ a Ti x þ ei ; i ¼ 1; . . .; n; where ei ; i ¼ 1; . . .; n, are random noise. If the noise ei is assumed to have standard Gaussian distribution, the maximum likelihood estimation will lead to the following least squares model 1 min kAx bk22 ; 2

x2IRn

where k k2 is the l 2 norm for 2 T a1 6 .. A¼4 .

a Tm

vectors, and 3 7 5 2 IRmn ;

2

3 b1 6 . 7 b ¼ 4 .. 5 2 IRm : bm

DOI: 10.1051/978-2-7598-3174-6.c002 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

18

Example 2.1.2. In many cases of least squares problem as in example 2.1.1, m is greater than n, i.e., m n. However, in recent compressive sensing [5, 42], m n, i.e., m is far smaller than n. In this case, one usually assumes that x is sparse. That is, the number of nonzero components in x is small, compared with large n. It leads to the following typical least absolute shrinkage and selection operator (LASSO) model 1 minn kAx bk22 þ kxk1 ; x2IR 2 where k k1 is the l 1 norm for vectors. The above two examples lead to a natural question: how to verify that a point x is indeed a minimizer of f ? In many cases, it is difficult to answer this question. See figure 2.1 for example. We address this question in next section.

f

global minimizer

x local minimizer

FIG. 2.1 – A difficult case for global minimization.

2.2 2.2.1

What is a Solution? Definitions of Different Solutions

In this part, we introduce the definitions of different solutions in mathematical form, including the global minimizer and different types of local minimizers. Definition 2.2.1. A point x is a global minimizer if f ðx Þ f ðxÞ for all x. Definition 2.2.2. A point x is a local minimizer (or weak local minimizer) if there is a neighborhood N of x such that f ðx Þ f ðxÞ for all x 2 N .

Fundamentals of Optimization

19

Definition 2.2.3. A point x is a strict local minimizer (or strong local minimizer) if there is a neighborhood N of x such that f ðx Þ\f ðxÞ for all x 2 N with x 6¼ x . Example 2.2.4. (i) f ðxÞ ¼ 2. Each point is a weak local minimizer. (ii) f ðxÞ ¼ ðx 2Þ4 . The strict local minimizer is x ¼ 2. See figure 2.2.

f(x) 4 3.5 3 2.5 2 1.5 1 0.5 0 0

0.5

1

1.5

2

2.5

3

3.5

4

x

FIG. 2.2 – Graphic representation of f ðxÞ ¼ ðx 2Þ4 : Definition 2.2.5. A point x is an isolated local minimizer if there is a neighborhood N of x such that x is the only local minimizer in N . Remark 2.2.6. Isolated local minimizers are strict local minimizers. However, strict local minimizers are not necessarily isolated local minimizers. We give an example below. Example 2.2.7. f ðxÞ ¼ x 4 cos x1 þ 2x 4 , f ð0Þ ¼ 0. See figure 2.3. One can verify that f is twice continuously differentiable and has a strict local minimizer at x ¼ 0. However, there are strict local minimizers at many nearby points x j . Actually, we can find a sequence fx j g ! 0, as j ! 1, and meanwhile x j is also a strict local minimizer. Therefore, x ¼ 0 is not an isolated local minimizer.

Modern Optimization Methods

20

1.5

10 -4

1

0.5

0 -0.1

-0.05

0

0.05

0.1

FIG. 2.3 – Graphic representation of example 2.2.7.

2.2.2

Recognizing a Local Minimum

How to verify that a point is a local minimum? One way is to check all the points in its neighborhood. However, it is not practical. The other way is to check the gradient rf ðx Þ and the Hessian r2 f ðx Þ if they are available. To that end, we need the following Taylor’s theorem. Theorem 2.2.8. (Taylor’s Theorem) Suppose that f : IRn ! IR is continuously differentiable and that p 2 IRn . Then we have that f ðx þ pÞ ¼ f ðxÞ þ rf ðx þ tpÞT p; for some t 2 ð0; 1Þ: Moreover, if f is twice continuously differentiable, we have that Z 1 rf ðx þ pÞ ¼ rf ðxÞ þ r2 f ðx þ tpÞ p dt; 0

and that 1 T 2 p r f ðx þ tpÞ p; for some t 2 ð0; 1Þ: 2 Below we will discuss the necessary and sufficient conditions for optimality. Necessary conditions for optimality are derived by assuming that x is a local minimizer. f ðx þ pÞ ¼ f ðxÞ þ rf ðxÞT p þ

Theorem 2.2.3. (First-Order Necessary Conditions) If x is a local minimizer and f is continuously differentiable in an open neighborhood of x , then rf ðx Þ ¼ 0:

Fundamentals of Optimization

21

Proof. For contradiction, suppose that rf ðx Þ 6¼ 0. Let p ¼ rf ðx Þ. There is pT rf ðx Þ ¼ krf ðx Þk2 \0: Since rf is continuously differentiable in an open neighborhood of x , there exists T [ 0 such that the following holds pT rf ðx þ tpÞ\0; for all t 2 ð0; T : For any t 2 ð0; T Þ, there is f ðx þ t pÞ ¼ f ðx Þ þ t rf ðx þ tpÞT p; for some t 2 ð0; t Þ: Therefore, f ðx þ t pÞ\f ðx Þ; for all t 2 ð0; T : Now we have found a direction leading away from x along which f decreases, which is a contradiction. The proof is finished. □ Definition 2.2.10. We call x a stationary point if rf ðx Þ ¼ 0. Remark 2.2.11. Any local minimizer of a smooth optimization problem must be a stationary point. Let S n be the set of symmetric matrices of n by n. Note that a matrix B 2 S n is positive definite if pT Bp [ 0 for all p 6¼ 0. We denote as B 0. A matrix B is positive semidefinite if pT Bp 0 for all p. We denote as B 0. Theorem 2.2.12. (Second-Order Necessary Conditions) If x is a local minimizer and r2 f exists and is continuous in an open neighborhood of x , then rf ðx Þ ¼ 0 and r2 f ðx Þ 0: Proof. For contradiction, assume that r2 f ðx Þ is not positive semidefinite. Then there exists a vector p such that pT r2 f ðx Þp\0: Since r2 f is continuous in an open neighborhood of x , there exists a T [ 0, such that pT r2 f ðx þ tpÞp\0; for all t 2 ð0; T : By Taylor expansion around x , we have, for all t 2 ð0; T and some t 2 ð0; t Þ that the following holds 1 2 f ðx þ t pÞ ¼ f ðx Þ þ t rf ðx ÞT p þ t pT r2 f ðx þ tpÞp\f ðx Þ; 2 which is a contradiction. The proof is finished.

□

Modern Optimization Methods

22

Theorem 2.2.13. (Second-Order Sufficient Conditions) Suppose that r2 f is continuous in an open neighborhood of x and that rf ðx Þ ¼ 0 and r2 f ðx Þ 0. Then x is a strict local minimizer of f . Proof. Because the Hessian r2 f is continuous in an open neighborhood of x and r2 f ðx Þ 0, we can choose r [ 0 such that r2 f ðxÞ 0 for all x 2 D, where D ¼ fzjkz x k\rg: Taking any nonzero vector p with kpk\r, there is x þ p 2 D. Consequently, f ðx þ pÞ ¼ f ðx Þ þ rf ðx ÞT p þ

1 T 2 p r f ðz Þp 2

1 ¼ f ðx Þ þ pT r2 f ðzÞp; 2 where z ¼ x þ tp; for some t 2 ð0; 1Þ: Note that z 2 D, there is pT r2 f ðzÞp [ 0. Therefore, f ðx þ pÞ [ f ðx Þ, for any p with kpk\r, i.e., x is a strict local minimizer. □ Remark 2.2.14. Sufficient conditions are those that guarantee that x is a local minimizer. Second-order sufficient conditions are stronger than necessary conditions since it guarantees a strict local minimizer. Second-order sufficient conditions are not necessary for a point x to be a strict local minimizer. For example, f ðxÞ ¼ x 4 . x ¼ 0 is a strict local minimizer, but r2 f ðx Þ ¼ 0: In terms of convex optimization, we have the following result. Theorem 2.2.15. When f is convex, any local minimizer is a global minimizer of f . If in addition, f is differentiable, then any stationary point x is a global minimizer of f . Proof. Part I. For contradiction, suppose x is a local minimizer but not a global minimizer. Then we can find a point z 2 IRn with f ðzÞ\f ðx Þ. Consider the line segment that joints x to z, i.e., x ¼ kz þ ð1 kÞx ; for some k 2 ð0; 1Þ: By the convexity of f , we have f ðxÞ kf ðzÞ þ ð1 kÞf ðx Þ\f ðx Þ: One can see that any neighborhood N of x contains a piece of the line segment defined above. So there will always be points x 2 N satisfying f ðxÞ\f ðx Þ, contradicting that x is a local minimizer.

Fundamentals of Optimization

23

Part II. For contradiction, suppose x is a stationary point but not a global minimizer. Then we can find a point z 2 IRn with f ðzÞ\f ðx Þ. By the convexity of f in proposition 1.2.10, we have rf ðx ÞT ðz x Þ f ðzÞ f ðx Þ\0: Therefore, rf ðx Þ 6¼ 0, implying that x is not a stationary point.

2.2.3

□

Nonsmooth Problems

The above optimality conditions in section 2.2.1 are suitable for smooth problems. Recall that f is smooth if its second derivatives exist and are continuous. Nonsmooth optimization problems mean that there is at least one nonsmooth function in objectives or constraints. Some typical nonsmooth functions are given in example 2.2.16. Example 2.2.16.

P f ðxÞ ¼ krðxÞk1 ¼ ni¼1 jr i ðxÞj, where rðxÞ ¼ ½r 1 ðxÞ; . . .; r n ðxÞ T 2 IRn is a vector function. f ðxÞ ¼ krðxÞk1 :¼ max1 i n jr i ðxÞj. f ðxÞ ¼ maxðf 1 ðxÞ; f 2 ðxÞÞ, where f 1 ; f 2 are both smooth functions.

In some cases, nonsmooth problems can be reformulated as smooth optimization problems. For example, the piecewise smooth case can be reformulated as a smooth case. Example 2.2.17. Consider the nonsmooth problem where the objective is nonsmooth. For example, min kxk1

x2IRn

s.t. ci ðxÞ ¼ 0; ci ðxÞ 0;

ð2:1Þ

i 2 E; i 2 I;

where ci ðxÞ; i 2 E [ I are smooth functions. To deal with the nonsmooth term in objective, a popular way is to introduce variable x þ 2 IRn and x 2 IRn , such that x ¼ x þ x ; x þ 0; x 0: (2.1) can be equivalently reformulated as min n

x þ 2IR ; x 2IR

s.t.

n X n

i¼1

ðx þ Þi þ

n X i¼1

ðx Þi

ci ðx þ x Þ ¼ 0; ci ðx þ x Þ 0; x þ ; x 0:

i 2 E; i 2 I;

ð2:2Þ

Modern Optimization Methods

24

Problem (2.2) is then a smooth optimization problem with variables x þ and x . Compared with problem (2.1), the size of variables of the problem (2.2) is doubled, and there are extra lower bound constraints x þ 0; x 0. However, the benefit of transferring (2.1) to (2.2) is that problem (2.2) is smooth. Another way to deal with the nonsmooth term in an objective is to transfer it into a cone constraint by introducing a one-dimensional variable. That is, min

t

s.t.

ci ðxÞ ¼ 0; ci ðxÞ 0; t kxk1 :

x2IRn ;t2IR

i 2 E; i 2 I;

ð2:3Þ

The set fðt; xÞ j t kxk1 ; x 2 IRg is basically a cone. Furthermore, it is a convex cone as defined in definition 1.3.3. Therefore, by the cone constraint, one can also transfer a nonsmooth problem (2.1) into a smooth problem (2.3). This is another technique to deal with nonsmooth functions. In fact, the difficulty in solving (2.3) is how to deal with the cone constraint. We refer to [13, 14, 66] for more examples and the latest progress about this type of formulation (2.3). We demonstrate by one example to show the popular way of transferring nonsmooth constraints into smooth ones. Example 2.2.18. Consider the following problem min f ðxÞ

x2IR2

s.t. kxk1 1;

ð2:4Þ

where f is smooth. We can represent the nonsmooth constraint by four smooth constraints, and obtain the following equivalent smooth problem min

x2IR2

f ðxÞ

s.t. x 1 þ x 2 1; x 1 x 2 1; x 1 x 2 1; x 1 þ x 2 1:

ð2:5Þ

The feasible region of the problem (2.4) is demonstrated in figure 2.4. In fact, function kxk1 is a piecewise smooth function, that’s why we can reformulate it into a smooth case.

Fundamentals of Optimization

25

FIG. 2.4 – The feasible region of problem (2.4). However, in general cases, it is difficult to transfer a nonsmooth problem into a smooth one. It is therefore difficult to characterize minimizers. See figure 2.5 for an example. In figure 2.5, the function has a minimum value at point x . In such cases, a subgradient or generalized gradient may be used. We refer to the classical book [64] for discussing nonsmooth optimization. f

x*

x

FIG. 2.5 – Nonsmooth function with minimum at a kink.

2.3

Overview of Algorithms

Recall the key point in designing an algorithm is how to move from x k to x k þ 1 . We can use possible information f ðx k Þ or even f ðx k1 Þ; . . .; f ðx 0 Þ. Our aim is to find x k þ 1 such that f ðx k þ 1 Þ\f ðx k Þ. Two strategies to move from x k to x k þ 1 are line search and trust region. Below we discuss them one by one.

Modern Optimization Methods

26

2.3.1

Line Search Strategy

The main idea for the line search strategy is as follows. At the current iteration x k , we first choose a direction pk , then move along such direction with step length ak . That is, x k þ 1 ¼ x k þ ak p k : The search direction pk has to be a descent direction, meaning that the function value along this direction is decreasing within a small step length. For continuously differentiable function f , any direction pk satisfying pTk rf k \0 is a descent direction, due to the following relation f ðx k þ pk Þ ¼ f ðx k Þ þ pTk rf k þ oð2 Þ: The step length ak can be obtained by exact line search or inexact line search. Specifically, step length is determined by min /ðaÞ :¼ f ðx k þ apk Þ: a[0

ð2:6Þ

With x k and pk fixed, the function / is a one-dimensional function with respect to a. Therefore, the line search is also referred to as a one-dimensional search. Exact line search means that ak is a global minimizer of (2.6). It means that we reach the smallest function value along the search direction pk . In general, it is too expensive since too many evaluations of f and probably rf are needed. Inexact line search means that ak achieves an adequate reduction in f at minimal cost. We will discuss more details about line search in the next chapter. Search directions for line search methods include the following ways: steepest descent direction, Newton’s direction, quasi-Newton directions, and conjugate gradient directions, leading to steepest descent method, Newton’s method, quasi-Newton methods and conjugate gradient methods correspondingly. We detail them one by one below. Steepest descent direction The steepest descent direction is given by pk ¼ rf k : Theorem 2.3.1. Among all directions, the steepest descent method is the one along which f decreases most rapidly. Proof. Note that f ðx k þ apÞ ¼ f ðx k Þ þ arf Tk p þ

1 T 2 p r f ðx k þ tpÞp; for some t 2 ð0; aÞ: 2

The unitary direction p of the most rapid decrease is the solution to the following problem min rf Tk p s:t: kpk ¼ 1: p

Fundamentals of Optimization

27

Note that rf Tk p ¼ krf k kkpkcosh ¼ krf k kcosh; where h is the angle between rf k and p. The optimal solution is achieved when cosh ¼ 1, i.e., h ¼ p. It leads to p ¼ rf k =krf k k: □ Geometrically, the steepest descent direction pk is orthogonal to the contours of f at x k , as shown in figure 2.6. x is a local minimizer in figure 2.6.

FIG. 2.6 – Steepest descent direction for a function of two variables. The advantage of the steepest descent method is the low computational cost since it only requires the calculation of rf . The disadvantage is that it may be excruciatingly slow on difficult problems. Newton’s direction Newton’s direction is given by 2 1 pN k ¼ r f k rf k :

The intuition of Newton’s direction is to approximate f at x k by a quadratic function. That is, 1 f ðx k þ pÞ f ðx k Þ þ pT rf k þ pT r2 f k p :¼ m k ðpÞ: 2 Assume for the moment that r2 f k 0, then we solve the minimization of model function min m k ðpÞ; p

which gives the Newton’s direction.

Modern Optimization Methods

28

Remark 2.3.2. The model function is reliable when f ðx k þ pÞ m k ðpÞ is not too large. Moreover, 2 T 2 1 rf Tk pN k ¼ rf k r f k rf k rk krf k k ; N for some rk [ 0. If rf k 6¼ 0, then rf Tk pN k \0, implying that pk is a descent T N direction. Modification is needed in case that r2 f 1 k may not exist, or rf k pk 0.

Remark 2.3.3. The advantage of Newton’s method is that it has a superlinear (or even quadratic) convergence rate as shown later. However, the disadvantage is that it requires the twice continuous differentiability of f and the positive definiteness of r2 f . Moreover, it may involve heavy computational costs due to the calculation of r2 f as well as solving Newton’s direction. Remark 2.3.4. A nonsmooth version of Newton’s method will be introduced in chapter 6. Quasi-Newton directions Quasi-Newton direction is given by pk ¼ B 1 k rf k : The main idea of quasi-Newton methods is as follows. We approximate r2 f k by B k , and B k is updated based on the fact that changes in the gradient g provide information about r2 f along the search direction p. By Taylor’s theorem, there is Z 1 2 2 r f ðx þ tpÞ r2 f ðxÞ pdt: rf ðx þ pÞ ¼ rf ðxÞ þ r f ðxÞp þ 0

Since rf is continuous, there is Z 1 ½r2 f ðx þ tpÞ r2 f ðxÞ pdt ¼ oðkpkÞ: 0

Let x ¼ xk ;

p ¼ xk þ 1 xk:

There is rf k þ 1 ¼ rf k þ r2 f k ðx k þ 1 x k Þ þ oðkx k þ 1 x k kÞ: Consequently, we have r2 f k ðx k þ 1 x k Þ rf k þ 1 rf k : Therefore, B k þ 1 is chosen to satisfy the following secant equation B k þ 1 sk ¼ yk ;

Fundamentals of Optimization

29

where sk ¼ x k þ 1 x k ;

y k ¼ rf k þ 1 rf k :

Additional constraints on B k þ 1 include symmetry and that the difference between B k þ 1 and B k has low rank. Two of the most popular updating formulae of quasi-Newton methods are the symmetric-rank-one (SR1) formula (rank-one update) Bk þ 1 ¼ Bk þ

ðy k B k s k Þðy k B k s k ÞT ðy k B k s k ÞT s k

;

and the BFGS formula (by Broyden, Fletcher, Goldfarb, and Shanno, rank-two update) Bk þ 1 ¼ Bk

B k s k s Tk B k y k y Tk þ : s Tk B k s k y Tk s k

Note that in quasi-Newton’s direction, we actually need B 1 k when we calculate 1 the search direction. Therefore, let H k ¼ B k . The corresponding update formulae for H k þ 1 are the symmetric-rank-one (SR1) formula Hk þ1 ¼ Hk þ

ðs k H k y k Þðsk H k y k ÞT ðs k H k y k ÞT y k

and the BFGS formula H k þ 1 ¼ ðI qk s k y Tk ÞH k ðI qk y k s Tk Þ þ qk s k s Tk ; qk ¼

1 : y Tk s k

The advantages of quasi-Newton methods are as follows. They do not require the calculation of the Hessian r2 f k , and they keep a superlinear rate of convergence under reasonable assumptions. More details can be found in chapter 3. Conjugate gradient directions The conjugate gradient directions take the form pk ¼ rf ðx k Þ þ bk pk1 ; where bk is a scalar to ensure that pk and pk1 are conjugate. Conjugate gradient methods were originally designed to solve linear systems Ax ¼ b; where A 0; which is equivalent to solving the quadratic optimization problem 1 min /ðxÞ ¼ x T Ax bT x: x 2

Modern Optimization Methods

30

The advantages of conjugate gradient methods are two folds. Firstly, they are much more effective than using the steepest descent direction. Secondly, they need less storage than Newton or quasi-Newton methods, not requiring storage of matrices. We discuss more details of conjugate gradient methods in chapter 5.

2.3.2

Trust Region Strategy

The main idea for trust region strategy is to approximate f around x k by a model function m k . Then we get pk by solving minm k ðx k þ pÞ s:t: kpk D; p

where D [ 0 is called the trust region radius and k k is usually chosen to be the Euclidean norm. If the solution does not produce a sufficient decrease in f , shrink D; Otherwise, increase D. The model function is usually defined by a quadratic function m k ðpÞ ¼ f k þ g Tk p þ

1 T p B k p; 2

ð2:7Þ

where B k is either the Hessian r2 f k or approximation to it. Example 2.3.5. Consider f ðxÞ ¼ 10ðx 2 x 21 Þ2 þ ð1 x 1 Þ2 : At x k ¼ ð0; 1Þ, there is rf k ¼

2 ; 20

r2 f k ¼

38 0

2 : 20

See figure 2.7 for illustration.

FIG. 2.7 – Two possible trust regions (circles) and their corresponding steps pk in example 2.3.5. The solid lines are contours of the model function m k .

Fundamentals of Optimization

31

Trust region strategy has a close relationship with the line search strategy. Let B k ¼ 0, and define the trust region using the Euclidean norm. The trust region problem becomes min f k þ rf Tk p s:t:kpk2 Dk ; p

the optimal solution is pk ¼ Dk

rf k : krf k k

In this case, it is equivalent to the steepest descent method with step length ak ¼ Dk . Let B k ¼ r2 f k . If 2 1 kpN k k ¼ kr f k rf k k Dk ;

then it is also the optimal solution of the subproblem in trust region method.

2.4

Convergence

After designing the way to generate the iterations fx k g, we need to evaluate the performance of the iteration process. A natural question is whether the sequence fx k g converges. We have the following definitions. Definition 2.4.1. If fx k g generated by an algorithm reaches the optimal solution x of the problem either within a finite number of iterations, or fx k g has an accumulation point, which is the optimal solution x , then we say that the algorithm is convergent. In particular, if fx k g converges to the optimal solution x only when x 0 is close enough to x , then the algorithm is said to be locally convergent. If any feasible solution x 0 , fx k g always converges to the optimal solution x , then the algorithm is said to be globally convergent. Another way to evaluate the performance of the algorithm is the convergence rate, which is defined as follows. Definition 2.4.2. Suppose fx k g converges to x , and kx k þ 1 x k ¼ b: k!1 kx k x k lim

If 0\b\1, fx k g is said to be linearly convergent. If b ¼ 0, fx k g is said to be superlinearly convergent. If b ¼ 1, fx k g is said to be sublinearly convergent.

Modern Optimization Methods

32

The above definitions provide different convergence rates, where the superlinear convergence rate indicates that the sequence fx k g converges to x in the fastest way. Below we provide further characterization for superlinear convergence. Definition 2.4.3. Suppose fx k g converges to x , if for some p 1, there is kx k þ 1 x k p ¼ b; k!1 kx k x k lim

0\b\1;

we say that fx k g is p-order convergent. In particular, if p ¼ 2, fx k g is said to be quadratically convergent. We will discuss the convergence rate of each type of line search methods in this book.

2.5

Scaling

Scaling is very important in optimization. If a problem is poorly scaled, then we need to do scaling. Below, we first give the definition of poor scaling. Definition 2.5.1. In unconstrained optimization, a problem is said to be poorly scaled if changes to x in a certain direction produce much larger variations in objective f than changes to variable x in another direction. An example that is poorly scaled is given as follows. Example 2.5.2. Consider the problem min

f ðxÞ ¼ 109 x 21 þ x 22 :

x 1 2IR; x 2 2IR

We can see that function f is very sensitive to small changes in x 1 . However, it is not so sensitive to small changes in x 2 . In this case, we need to do scaling before solving the optimization problem. Diagonal scaling A popular method to tackle the poor scaled problem is diagonal scaling. Suppose we have variables of the following magnitude (each stands for the rate constant of the reaction speed in chemical systems) x 1 1010 ; x 2 x 3 1; x 4 105 : The diagonal scaling basically includes two steps. Step 1: Introduce a new variable z by 2 3 2 10 32 3 x1 z1 10 0 0 0 6 x2 7 6 0 7 6 1 0 0 76 z 2 7 6 7¼6 7 ¼ Diagð1010 ; 1; 1; 105 Þz: 4 x3 5 4 0 0 1 0 54 z 3 5 x4 z4 0 0 0 105

Fundamentals of Optimization

33

Step 2: Solve the optimization problem in terms of z. The opitmal values of z will be within about an order of magnitude of 1. From the above procedure, we have the following conclusions. Algorithms that are not sensitive to scaling are preferable because they can handle poor problem formulations in a more robust fashion. Scale invariance will be discussed for the methods in our following study. For example, generally speaking, it is easier to preserve scale invariance for line search methods than for trust-region methods.

2.6

Exercises

Exercise 2.6.1. Support Vector Machine (SVM) is one important technique for dealing with classification. Do a review of the models and methods used in SVM, pointing out for each model, what type of the optimization problem is, and which kind of method is proposed to solve the model. Exercise 2.6.2. Find the convergence rate of the following sequence. fx k g ¼ k1 . n o fx k g ¼ k12 . n o fx k g ¼ 21k c1k , c [ 1. n o fx k g ¼ k1k k!1 . n o fx k g ¼ 12k . 2

Exercise 2.6.3. Try to formulate LASSO problem in example 2.1.2 into smooth optimization problem.

Chapter 3 Line Search Methods Line search methods play an important role in optimization, not only in unconstrainted optimization but also in constrained optimization. Consider the unconstrainted optimization problem min f ðxÞ;

x2IRn

ð3:1Þ

where f is continuously differentiable. In this chapter, we will discuss line search methods to solve (3.1), mainly focusing on different strategies to find the step length, as well as the convergence of line search methods.

3.1

Step Length

For line search methods to slove (3.1), the update takes the following form x k þ 1 ¼ x k þ ak p k ; where pk is the search direction, and ak is the step length. Recall that for continuously differentable function f , pk is a descent direction if pTk rf k \0. As discussed in chapter 2, the general form of pk can be summarized as follows pk ¼ B 1 k rf k ; where B k ¼ I denotes the steepest descent method; B k ¼ r2 f k denotes Newton’s method; B k r2 f k are quasi-Newton methods, which are updated by a low rank modification. If B k 0, then pTk rf k ¼ pTk B 1 k pk \0; i.e. pk is a descent direction. The step length is determined by argmin /ðaÞ :¼ f ðx k þ apk Þ: a[0

ð3:2Þ

DOI: 10.1051/978-2-7598-3174-6.c003 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

36

The aim of the step length is to obtain a sufficient decrease in f . There are in general two ways to determine the step length. One is the exact line search. The other is the inexact line search. Roughly speaking, the exact line search means that ak is a global minimizer of the problem (3.2). See figure 3.1. A special case of the exact line search is given as follows. Φ(α)

global minimizer

α local minimizer

FIG. 3.1 – The ideal step length is the global minimizer. Example 3.1.1. Consider the quadratic minimization (Q 0) 1 min f ðxÞ :¼ x T Qx þ bT x: 2

x2IRn

ð3:3Þ

At x k , suppose the search direction is pk , then the exact line search is given by (3.2) where 1 1 /ðaÞ ¼ pTk Qpk a2 þ pTk ðQx k þ bÞa þ x Tk Qx k þ bT x k : 2 2 It can be seen that /ðaÞ is a quadratic function with respect to a. By the positive definiteness of Q, one can derive that the optimal solution for (3.2) with /ðaÞ defined as above is given by pT ðQx k þ bÞ pT rf ak ¼ k T ¼ kT k ; ð3:4Þ pk Qpk pk Qpk where

rf k ¼ Qx k þ b:

However, in general, it is too expensive for an exact line search since too many evaluations of f and probably rf are needed. The inexact line search strategy means

Line Search Methods

37

that ak achieves an adequate reduction in f at a minimal cost. The main idea of inexact line search is to try out a sequence of candidate values for a and accept one when certain conditions are satisfied. Two steps of inexact line search are as follows. Bracketing phase. It is to find an interval containing desirable step lengths. Bisection or interpolation phase. It is to compute a good step length within this interval. Below we discuss three inexact line search strategies: the Wolfe conditions, the Goldstein conditions, and the backtracking approach.

3.1.1

The Wolfe Conditions

The first condition in Wolfe conditions is to satisfy the sufficient decrease, i.e., f ðx k þ apk Þ f ðx k Þ þ c1 arf Tk pk ;

ð3:5Þ

for some c1 2 ð0; 1Þ. It guartantees that the reduction in f should be proportional to both the step length ak and the directional derivative rf Tk pk . To see the interpretation, let lðaÞ :¼ f ðx k Þ þ c1 arf Tk pk : One can see that lð0Þ ¼ f k . The slope of lðaÞ is c1 rf Tk pk \0. Recall /ðaÞ ¼ f ðx k þ apk Þ, condition (3.5) means that a is acceptable only if /ðaÞ lðaÞ: See figure 3.2 for an illustration. However, a may be sufficiently small if only a sufficient decrease condition is satisfied.

Φ(α) = f (xk+α pk)

l(α) acceptable

{

{ acceptable

FIG. 3.2 – Sufficient decrease condition.

α

Modern Optimization Methods

38

It brings the second condition, i.e., the curvature condition as follows rf ðx k þ ak pk ÞT pk c2 rf Tk pk ;

c2 2 ðc1 ; 1Þ:

ð3:6Þ

See figure 3.3. Again by /ðaÞ ¼ f ðx k þ apk Þ, (3.6) is actually the following condition /0 ðak Þ c2 /0 ð0Þ:

Φ(α) = f (xk+α pk)

desired slope

α tangent acceptable step lengths

FIG. 3.3 – The curvature condition.

Remark 3.1.2. Curvature condition rules out the unacceptable small a. If /0 ðaÞ is strongly negative, we have an indication that we can reduce f significantly by moving further in the chosen direction. If /0 ðaÞ is slightly negative or even positive, it is a sign that we may not expect much more decrease in f in the chosen direction. Terminate the line search in this case. The Wolfe conditions are summarized as follows (see figure 3.4)

with 0\c1 \c2 \1.

f ðx k þ apk Þ f ðx k Þ þ c1 arf Tk pk ;

ð3:7Þ

rf ðx k þ ak pk ÞT pk c2 rf Tk pk ;

ð3:8Þ

Line Search Methods

39

Φ(α) = f (xk+α pk)

desires slope

line of sufficient decrease l(α)

α

tangent acceptable

FIG. 3.4 – Step lengths satisfying the Wolfe conditions. Remark 3.1.3. Two points should be remarked. Usually, we choose c1 ¼ 104 , c2 ¼ 0:9 for Newton or quasi-Newton methods and we choose c2 ¼ 0:1 for nonlinear conjugate gradient method. ak satisfying Wolfe conditions may not be particularly close to a minimizer of /. One enhanced version of the Wolfe conditions is the strong Wolfe conditions, which are given below ð3:9Þ f ðx k þ apk Þ f ðx k Þ þ c1 arf Tk pk ; jrf ðx k þ ak pk ÞT pk j c2 jrf Tk pk j;

ð3:10Þ

with 0\c1 \c2 \1. The difference from Wolfe conditions is that we no longer allow /0 ðak Þ to be too positive. It excludes points that are far from the stationary points of /. The following lemma addresses that the Wolfe conditions and strong Wolfe conditions are well defined. Lemma 3.1.4. Suppose that f : IRn ! IR is continuously differentiable. Let pk be a descent direction at x k , and assume that f is bounded below along the ray fx k þ apk j a [ 0g. Then if 0\c1 \c2 \1, there exist intervals of step lengths satisfying the Wolfe conditions and the strong Wolfe conditions. Proof. Step 1. By assumptions, /ðaÞ ¼ f ðx k þ apk Þ is bounded below for all a [ 0. The linear function lðaÞ defined by lðaÞ :¼ f ðx k Þ þ c1 arf Tk pk is unbounded below (recall c1 2 ð0; 1Þ). Therefore, / and lðaÞ must intersect at least once. Let a be the smallest intersecting value of a, that is apk Þ ¼ f ðx k Þ þ c1 arf Tk pk : f ðx k þ

ð3:11Þ

Modern Optimization Methods

40

The sufficient decrease condition (3.7) then holds for all step lengths a\a. Step 2. By the mean value theorem, there exists â 2 ð0; aÞ such that apk Þ f ðx k Þ ¼ arf ðx k þ âpk ÞT pk : f ðx k þ

ð3:12Þ

By combining (3.11) and (3.12), there is rf ðx k þ ^ apk ÞT pk ¼ c1 rf Tk pk [ c2 rf Tk pk :

ð3:13Þ

^ satisfies the Wolfe conditions. The inequalities Since c1 \c2 and rf Tk pk \0, a hold strictly in both (3.7) and (3.8). Step 3. By smoothness assumption on f , there is an interval around â such that the Wolfe conditions hold. Moreover, since apk ÞT pk ¼ c1 rf Tk pk \0; rf ðx k þ ^ □

the strong Wolfe conditions hold in the same interval.

Remark 3.1.5. The Wolfe conditions are scale-invariant in a broad sense. Multiplying f by a constant or making an affine change of variables does not alter them.

3.1.2

The Goldstein Conditions

The Goldstein conditions are given as follows f ðx k Þ þ ð1 cÞak rf Tk pk f ðx k þ ak pk Þ f ðx k Þ þ cak rf Tk pk

ð3:14Þ

with 0\c\ 12. For Goldstein conditions as above, the second inequality is the sufficient decrease condition, and the first inequality is to control the step length as below (See figure 3.5). We have the following remark regarding the Goldstein conditions. Φ(α) = f (xk+α pk)

αc▽fkTpk

{

α acceptable step lengths

α(1-c)▽fkTpk

FIG. 3.5 – Goldstein conditions.

Line Search Methods

41

Remark 3.1.6. Goldstein conditions and Wolfe conditions have much in common, and their convergence theories are quite similar. The Goldstein conditions are often used in Newton-type methods and are not well suited for quasi-Newton methods that maintain a positive definite Hessian approximation. A disadvantage compared with Wolfe conditions is that Goldstein conditions may exclude all minimizers of /.

3.1.3

Sufficient Decrease and Backtracking

Recall the sufficient decrease condition f ðx k þ apk Þ f ðx k Þ þ c1 arf Tk pk : Only the sufficient decrease condition is not enough to ensure that the algorithm makes reasonable progress along the given direction. However, with the backtracking approach, we can just use the sufficient decrease condition to terminate the line search procedure. The backtracking line search is given as follows. Algorithm 3.1.7. (Backtracking Line Search) Choose a [ 0, q 2 ð0; 1Þ, c 2 ð0; 1Þ; set a a. Repeat until f ðx k þ apk Þ f ðx k Þ þ carf Tk pk a qa; end(repeat). Terminate with ak ¼ a. is chosen to be 1 in Newton and quasi-Newton Usually, the step length a methods. But it may have different values in other algorithms. ak will be found after a finite number of trials since the sufficient decrease condition will eventually hold for a sufficiently small. The decreasing ratio q can vary at each iteration, i.e., q 2 ½qlo ; qhi , with 0\qlo \qhi \1. Possible cases for ak include a fixed value such as a ¼ a or a small enough value that satisfies the sufficient decrease condition. Another important fact is that for ak 6¼ a, the previous trial ak =q was rejected for violating the sufficient decrease condition, implying that ak ak f x k þ pk [ f ðx k Þ þ c1 rf Tk pk : q q Condition (3.15) is very important in proving the global convergence of line search methods. Remark 3.1.8. In this part, we only discuss line search strategies that lead sufficient decrease in function value. That is, the sequence fx k g satisfies f ðx 0 Þ [ f ðx 1 Þ [ [ f ðx k Þ [

ð3:15Þ

However, there are also line search strategies that do not follow (3.15). Such strategy is called the nonmonotone line search scheme, which leads to a nonmonotone

Modern Optimization Methods

42

sequence of function values. Readers can refer to [21, 75] for more details as some typical examples of nonmonotone line search strategy.

3.2

Convergence of Line Search Methods

In the section, we discuss the convergence of line search methods. To that end, we first define hk to be angle between pk and the steepest descent direction rf k , that is, coshk ¼

rf Tk pk : krf k kkpk k

ð3:16Þ

We have the following convergence result regarding the Wolfe conditions. Theorem 3.2.1. Consider any iteration of the form x k þ 1 ¼ x k þ ak p k ;

ð3:17Þ

where pk is a descent direction and ak satisfies the Wolfe conditions. Suppose that f is bounded below in IRn and that f is continuously differentiable in an open set N containing the level set L :¼ fx 2 IRn : f ðxÞ f ðx 0 Þg, where x 0 is the starting point of the iteration. Assume also that the gradient rf is Lipschitz continuous on N . That is, there exists a constant L [ 0 such that krf ðxÞ rf ð~ x Þk Lkx x~k;

8 x; x~ 2 N :

ð3:18Þ

Then 1 X k 0

cos2 hk krf k k2 \1:

ð3:19Þ

The theorem, due to Zoutendijk, has far reaching consequence which we will see later. Proof. From (3.8) and (3.17), we have ðrf k þ 1 rf k ÞT pk ðc2 1Þrf Tk pk : The Lipschitz condition implies that ðrf k þ 1 rf k ÞT pk ak Lkpk k2 : We obtain ak

c2 1 rf Tk pk : L kpk k2

Substituting it into the first Wolfe condition (3.7), we get f k þ 1 f k c1

1 c2 ðrf Tk pk Þ2 : L kpk k2

ð3:20Þ

Line Search Methods

43

Recalling the definition of hk in (3.16), we can rewrite the inequality (3.20) as f k þ 1 f k c0 cos2 hk krf k k2 ; where c0 ¼ c1 ð1 c2 Þ=L. By summing over all indices less than or equal to k, we obtain f k þ 1 f 0 c0

k X j¼0

cos2 hj krf j k2 :

Since f is bounded below, there exists u [ 0 such that f 0 f k þ 1 \u, for all k. This implies that 1 X k 0

cos2 hk krf k k2 \1: □

Similar results hold when using Goldstein conditions or strong Wolfe conditions. We call inequality (3.19) the Zoutendijk condition. The assumptions of the theorem are not too restrictive in the sense that if f is not bounded below, the optimization problem would not be well defined. Moreover, the smoothness assumption (Lipschitz continuity of the gradient) is implied by many of the smoothness conditions. The Zoutendijk condition implies that cos2 hk krk k2 ! 0;

ð3:21Þ

which can be used to derive global convergence for line search algorithms. If there is d [ 0 such that coshk d [ 0;

8 k;

i.e., the angle hk is bounded away from p2, it follows from (3.21) that lim krf k k ¼ 0:

k!1

Below we give the definition of global convergence. Definition 3.2.2. An algorithm is globally convergent if the following holds lim krf k k ¼ 0:

k!1

ð3:22Þ

Remark 3.2.3. For line search methods, (3.22) is the strongest global convergence result that can be obtained. (3.22) only guarantees that the iterates are attracted by a stationary point, rather than a minimizer. To guarantee to converge to a local minimum, additional requirements are needed, such as the information from r2 f ðx k Þ.

Modern Optimization Methods

44

Below we briefly discuss the global convergence for different types of line search methods. For steepest descent method, coshk ¼ 1 [ 0 implies that the steepest descent method with Wolfe or Goldstein line search is globally convergent. For Newton-like methods, assume that B 0k s are positive definite with a uniformly bounded condition number, i.e., there is a constant M [ 0 such that kB k kkB 1 k k M;

8 k;

then coshk 1=M ;

ð3:23Þ

implying that limk!1 krf k k ¼ 0. In other words, Newton and quasi-Newton methods are globally convergent if B 0k s have a bounded condition number and if the step lengths satisfy Wolfe or Goldstein conditions. For conjugate gradient methods, only the weaker result can be proved. That is, lim inf krf k k ¼ 0: k!1

ð3:24Þ

The sketch of the proof is as follows. For contradiction, suppose (3.24) fails. Therefore, we have krf k k c; for all k: We conclude that coshk ! 0. One can show that a subsequence fcoshk g is bounded away from zero to derive the contradiction [121].

3.3

Rate of Convergence

For an algorithm, we would expect that it incorporates both of the following properties: global convergence and a rapid rate of convergence. However, sometimes the two properties conflict with each other. In this part, we will consider the rate of convergence for the steepest descent method, Newton’s method, and quasi-Newton methods.

3.3.1

Steepest Descent Method

Consider the ideal case where f is quadratic and the line search is exact. That is, 1 min f ðxÞ :¼ x T Qx bT x; 2

x2IRn

ð3:25Þ

where Q is symmetric and Q 0. There is rf ðxÞ ¼ Qx b: The minimizer x of (3.25) is the unique solution of the following linear system Qx ¼ b:

Line Search Methods

45

Let ak be computed by the exact line search (recall that pk ¼ rf k ), i.e., min /ðaÞ :¼ f ðx k arf k Þ; a[0

where 1 /ðaÞ ¼ ðx k arf k ÞT Qðx k arf k Þ bT ðx k arf k Þ: 2 By solving /0 ðaÞ ¼ 0, we obtain ak ¼

rf Tk rf k ; rf Tk Qrf k

ðnote that a 0Þ:

ð3:26Þ

The steepest descent iteration with exact line search for (3.25) is as follows xk þ 1 ¼ xk

rf Tk rf k rf k : rf Tk Qrf k

ð3:27Þ

Note that rf k ¼ Qx k b, the iteration update yields a closed-form expression for x k þ 1 in terms of x k . As shown in figure 3.6, the contours of f are ellipsoids and there may be a Zigzag phenomenon toward the solution.

FIG. 3.6 – Steepest descent steps. To see the convergence rate, we introduce the weighted norm defined as kxk2Q ¼ x T Qx: Let x be the optimal solution of (3.25). By Qx ¼ b, there is 1 kx x k2Q ¼ f ðxÞ f ðx Þ: 2

Modern Optimization Methods

46

With update formula (3.27) and rf k ¼ Qðx k x Þ, we derive that ( ) 2 T ðrf rf Þ 2 k k kx k þ 1 x kQ ¼ 1 kx k x k2Q : T T 1 ðrf k Qrf k Þðrf k Q rf k Þ

ð3:28Þ

It describes the exact decrease in f at each iteration. However, the term inside the brackets is difficult to interpret. In fact, we have the following result. Theorem 3.3.1. When the steepest descent method with exact line searches (3.26) is applied to the strongly convex quadratic function (3.25), the error norm satisfies the following inequality kn k1 2 kx k x k2Q ; ð3:29Þ kx k þ 1 x k2Q kn þ k1 where 0\k1 k2 kn are eigenvalues of Q. See Luenberger [34] for more details. It can be seen that f k converges to f at a linear rate. If k1 ¼ ¼ kn , only one iteration is needed. In general, as the condition number jðQÞ :¼ kn =k1 increases, the contours of the quadratic become more elongated and the zigzagging becomes more pronounced. Condition (3.29) implies that the convergence degrades. Theorem 3.3.2. Suppose that f : IRn ! IR is twice continuously differentiable, and that the iterates generated by the steepest descent method with exact line search converge to a point x at which the Hessian matrix r2 f ðx Þ 0. Let r be any scalar satisfying k n k1 r2 ;1 ; kn þ k1 where k1 k2 kn are the eigenvalues of r2 f ðx Þ. Then for all k sufficiently large, we have f ðx k þ 1 Þ f ðx Þ r 2 ðf ðx k Þ f ðx ÞÞ:

In general, we can not expect the rate of convergence to improve if an inexact line search is used. The steepest descent method can have an unacceptably slow rate of convergence, even when the Hessian is reasonably well-conditioned. For example: jðQÞ ¼ 800, f ðx 1 Þ ¼ 1, and f ðx Þ ¼ 0. f ðx k Þ 0:08 after k ¼ 1000, using steepest descent method with exact line search.

3.3.2

Newton’s Method

Newton’s direction takes the following form 2 1 pN k ¼ r f k rf k :

ð3:30Þ

Line Search Methods

47

An important fact is that if r2 f ðx Þ 0, and r2 f is Lipschitz continuous, then r2 f ðxÞ 0 for all x in the neighborhood of x . We have the following result. Theorem 3.3.3. Suppose that f : IRn ! IR is twice continuously differentiable, and that r2 f ðxÞ is Lipschitz continuous in a neighborhood of a solution x at which the sufficient conditions (theorem 2.2.13) are satisfied. Consider the iteration x k þ 1 ¼ x k þ pk , where pk is given by (3.30). Then If the starting point x 0 is sufficiently close to x , the sequence of iterates converges to x . The rate of convergence of fx k g is quadratic. The sequence of gradient norms fkrf k kg converges quadratically to zero. Proof. By the optimality condition rf ðx Þ ¼ 0, there is

2 1 x k þ pN k x ¼ x k x r f k rf k ¼ r2 f 1 r2 f k ðx k x Þ ðrf k rf Þ : k

ð3:31Þ

By Taylor’s theorem, we have Z 1 ðrf k rf Þ ¼ r2 f ðx k þ tðx x k ÞÞðx k x Þdt: 0

Therefore, kr2 f k ðx k x Þ ðrf k rf Þk Z 1 2 2

¼ ½r f ðx k Þ r f ðx k þ tðx x k ÞÞðx k x Þdt 0 Z 1 kr2 f ðx k Þ r2 f ðx k þ tðx x k ÞÞkkx k x kdt 0

kx k x k2

Z

1 0

ð3:32Þ

1 Ltdt ¼ Lkx k x k2 ; 2

where L [ 0 is the Lipschitz constant of r2 f ðxÞ for x near x and rf denotes rf ðx Þ. The nonsingularity and continuity of r2 f ðx Þ implies that there is r [ 0 such that 2

1 kr2 f 1 k k 2kr f ðx Þ k

for all x k with kx k x k r. Substituting in (3.31) and (3.32), there is

2

1

2 kx k þ pN k x k Lkr f ðx Þ kkx k x k e k x k2 ; ¼ Lkx

e ¼ Lkr2 f ðx Þ1 k. Choosing x 0 so that where L 1 kx 0 x k min r; ; e 2L

Modern Optimization Methods

48

we will prove that fx k g ! x , and the rate of convergence is quadratic. 2 N Recall x k þ 1 x k ¼ pN k and rf k þ r f k pk ¼ 0, we have rf ðx k þ 1 Þ ¼ krf ðx k þ 1 Þ rf k r2 f ðx k ÞpN k k Z 1 2 N 2 N r f ðx þ tp Þðx x Þdt r f ðx Þp ¼ k k þ1 k k k k 0 Z 1 2 N kr2 f ðx k þ tpN k Þ r f ðx k Þkkpk kdt 0

1 2 LkpN k k 2 1 2 Lkr2 f ðx k Þ1 k krf k k2 2 2

2Lkr2 f ðx k Þ1 k krf k k2 ; proving that the gradient norms converge to zero quadratically. The proof is finished. □ Remark 3.3.4. The step length ak ¼ 1 will be eventually accepted using the Wolfe conditions when x k approaches the solution. The initial step length is usually chosen as a ¼ 1 in the line search for Newton’s method.

3.3.3

Quasi-Newton Methods

Recall the search direction for quasi-Newton methods pk ¼ B 1 k rf k ;

ð3:33Þ

where B k is updated at each iteration by a quasi-Newton update formula. The following result shows that if pk approximates the Newton direction well enough, the unit step length will satisfy the Wolfe condition as the iterates converge to the solution. It specifies a condition that the search direction must satisfy in order to give rise to a superlinearly convergent iteration. Theorem 3.3.5. Suppose that f : IRn ! IR is twice continuously differentiable. Consider the iteration x k þ 1 ¼ x k þ ak pk where pk is a descent direction and ak satisfies the Wolfe conditions with c 1=2. If the sequence fx k g converges to a point x

such that rf ðx Þ ¼ 0 and r2 f ðx Þ 0, and if the search direction satisfies krf k þ r2 f k pk k ¼ 0; k!1 kpk k lim

ð3:34Þ

then the followings hold the step length ak ¼ 1 is admissible for all k greater than a certain index k 0 ; and if ak ¼ 1 for all k k 0 , fx k g converges to x superlinearly.

Line Search Methods

49

Remark 3.3.6. If c [ 1=2, the line search would exclude the minimizer of a quadratic function, unit step lengths may not be admissible. If pk is a quasi-Newton search direction as in (3.33), the condition (3.34) is equivalent to the following condition kðB k r2 f ðx ÞÞpk k ¼ 0: k!1 kpk k lim

ð3:35Þ

It implies that A superlinear convergence rate can be attained even if the sequence of quasi-Newton matrices fB k g does not converge to r2 f ðx Þ. It suffices that B k becomes increasingly accurate approximations to r2 f ðx Þ along the search direction pk . Condition (3.35) is both necessary and sufficient for the superlinear convergence of quasi-Newton methods. Theorem 3.3.7. Suppose that f : IRn ! IR is twice continuously differentiable. Consider the iteration x k þ 1 ¼ x k þ pk (that is, the step length ak is uniformly 1) where pk is given by (3.33). Let us assume that fx k g converges to a point x such that rf ðx Þ ¼ 0 and r2 f ðx Þ 0. Then fx k g converges to x superlinearly if and only if (3.35) holds. Proof. Step 1. We first show that (3.35) is equivalent to pk pN k ¼ oðkpk kÞ;

ð3:36Þ

2 1 where pN k ¼ r f k rf k is the Newton’s direction. Assuming that (3.35) holds, we have that 2 1 2 pk pN k ¼ r f k ðr f k pk þ rf k Þ 2 ¼ r2 f 1 k ðr f k B k Þpk

¼ Oðkðr2 f k B k Þpk kÞ ¼ oðkpk kÞ;

where we used that kr2 f 1 k k is bounded above for x k sufficiently close to x since 2

r f ðx Þ 0. The converse follows readily if we multiply both sides of

pk pN k ¼ oðkpk kÞ; by r2 f k and recall pk ¼ B 1 k rf k . Step 2. By combining

2 e kx k þ pN k x k Lkx k x k

and (3.37), we obtain that

N kx k þ pk x k kx k þ pN k x k þ kpk pk k

¼ Oðkx k x k2 Þ þ oðkpk kÞ:

ð3:37Þ

Modern Optimization Methods

50

Note that kpk k ¼ Oðkx k x kÞ due to the fact that N kpk k kpk pN k k þ kpk k

¼ oðkpk kÞ þ kr2 f ðx k Þðrf k rf ðx ÞÞk

oðkpk kÞ þ Oðkx k x kÞ: We get that

kx k þ pk x k oðkx k x kÞ; giving the superlinearly convergence result. The proof is finished.

3.4

□

Exercises

Exercise 3.4.1. Compare the advantages and disadvantages of exact line search and inexact line search, especially by some examples. Exercise 3.4.2.Compare the advantages and disadvantages of the steepest descent method, Newton method, and quasi-Newton methods, conjugate gradient methods. Exercise 3.4.3. Try to show the global convergence of the line search method with a backtracking line search. Exercise 3.4.4. Prove the result in (3.26). Exercise 3.4.5. Show that if 0\c2 \c1 \1, that may be no step lengths that satisfy that Wolfe conditions. Exercise 3.4.6. Prove that kBxk kBkxk 1 for any nonsingular matrix B. Use this fact to k establish (3.23). Exercise 3.4.7. Show that the one-dimensional minimizer of a strongly convex quadratic function always satisfies the Goldstein conditions (3.14).

Chapter 4 Trust Region Methods Trust region methods were originally proposed by Powell [51]. In this chapter, we will introduce the main idea of trust region method, as well as the different ways to solve the subproblems. We will also discuss the convergence of trust region methods [61]. Consider the unconstrained minimization problem min f ðxÞ;

x2IRn

where f : IRn ! IR is twice continuously differentiable. The main idea of the trust region approach is as follows. Define a trust region around the current iterate, then approximate the function f by a model function m k ðÞ. That is, to obtain a step pk , solve the following subproblem min

m k ðpÞ

s:t:

kpk Dk ;

p2IRn

ð4:1Þ

where m k ðÞ is the approximation of f at current iteration k, and Dk is the radius of trust region. Here k k denotes some specific norm, which is usually chosen as l 2 norm. In other words, choose a step including the direction and the length of the step simultaneously to be the approximate minimizer of the model function within this region. If a step is not acceptable, reduce the size of the region and find a new minimizer. The key points in trust region methods include the size of the trust region, the quadratic model function m k and the step pk . We address each of them below. The size of the trust region is critical to the effectiveness of each step in the following sense. If it is too small, then one may miss an opportunity to take a substantial step that will move much closer to the minimizer of f . If it is too large, the minimizer of the model may be far from the minimizer of f . How to define the size of the trust region? It is based on the performance during the previous iterations. “Good” means that the model is reliable, giving a good estimation of f . Therefore, we may increase the size of the trust region. “Bad” means that the model is an inadequate representation of f . In this case, one needs to reduce the size of the trust region. For model function m k , assume that m k is a quadratic function. By Taylor’s expansion around x k , one has DOI: 10.1051/978-2-7598-3174-6.c004 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

52

1 f ðx k þ pÞ ¼ f k þ g Tk p þ pT r2 f ðx k þ tpÞp; for some tð0; 1Þ: 2

ð4:2Þ

Let B k r2 f ðx k þ tpÞ, m k can be defined as m k ðpÞ ¼ f k þ g Tk p þ

1 T p B k p; 2

ð4:3Þ

where B k is symmetric and jm k ðpÞ f ðx k þ tpÞj ¼ Oðkpk2 Þ; which is small when kpk is small. In each step, we solve the subproblem min

m k ðpÞ ¼ f k þ g Tk p þ 12 pT B k p

s:t:

kpk Dk ;

p2IRn

ð4:4Þ

where Dk [ 0 is the trust region radius. Here we define k k to be the Euclidean norm so that kpk Dk is a ball. If 1 B kB 1 k g k k Dk , it is the unconstrained minimum pk ¼ B k g k , referred as the full step. It coincides with Newton’s direction. Otherwise, the solution is not so obvious. See figure 4.1 for the comparison of the trust region step and line search step.

FIG. 4.1 – Trust region step and line search step.

4.1

Outline of the Trust Region Approach

As we mentioned before, one of the key points in the trust region method is the size of the trust region. Choosing the trust region radius Dk is based on the agreement between the model function m k and the objective function f at previous iterations. Given a step pk , define the ratio qk as

Trust Region Methods

53

qk ¼

f ðx k Þ f ðx k þ pk Þ ; m k ð0Þ m k ðpk Þ

ð4:5Þ

where the numerator denotes the actual reduction and the denominator denotes the predicted reduction. The ratio qk actually measures how well the model function m ðÞ approximates the objective function f in a local area. Specifically, note that m k ð0Þ ¼ f ðx k Þ, the predicted reduction will always be nonnegative, i.e., m k ð0Þ m k ðpk Þ 0. Based on different cases of qk , we take different strategies as below. If qk \0, it means that f ðx k Þ f ðx k þ pk Þ\0. The step must be rejected, and we shrink the trust region. If qk 1, it implies that there is good agreement between the model m k and f over this step. So we expand the trust region. If 0\qk 1, it implies that the step is fairly good. So we keep the trust region. If q 0 or q\0, it means that the step is not that good. Therefore, we shrink the trust region. We give the details of trust region algorithm below. Algorithm 4.1.1. (Trust Region Method). ^ and g 2 ½0; 1Þ. ^ [ 0; D0 2 ð0; DÞ, S0. Given D 4 S1. Obtain pk by solving (4.3). S2. Evaluate f ðx k Þ f ðx k þ pk Þ : qk ¼ m k ð0Þ m k ðpk Þ ^ S3. If qk \ 14, let Dk þ 1 ¼ 14 Dk . If qk [ 34 and kpk k ¼ Dk , let Dk þ 1 ¼ minf2Dk ; Dg. Otherwise, Dk þ 1 ¼ Dk . S4. Update x k . If qk [ g, x k þ 1 ¼ x k þ pk ; otherwise, x k þ 1 ¼ x k . ^ is an overall bound on the step length. The size of trust In algorithm 4.1.1, D region Dk increases only when kpk k ¼ Dk . Otherwise, we infer that Dk is not interfering with the progress of the algorithm, so we keep it. See figure 4.2 for different radii and the corresponding steps. The key point in algorithm 4.1.1 is how to solve the subproblem (4.4). We address the conditions that the optimal solution of subproblem (4.4) satisfies as follows.

contours of m

3

2

p*3

p*2

p*1

1

FIG. 4.2 – Solutions of trust-region subproblem for different radii D1 , D2 , D3 .

Modern Optimization Methods

54

Theorem 4.1.2. The vector p is a global solution of the trust-region subproblem min

mðpÞ ¼ f þ g T p þ 12 pT Bp

s:t:

kpk D;

p2IRn

if and only if p is feasible and there is a scalar k 0 such that the following conditions are satisfied ðB þ kI Þp ¼ g;

ð4:6Þ

kðD kp kÞ ¼ 0;

ð4:7Þ

ðB þ kI Þ 0:

ð4:8Þ

Note that (4.7) is called the complementarity condition. If kp k\D, there is k ¼ 0, implying that p ¼ B 1 g. If kp k ¼ D, we have k 0. It can be observed from theorem 4.1.2 that if kpk ¼ D, one needs to find some methods to obtain such optimal solution p . Below we discuss three methods to find the approximate solutions to the subproblem, including the Cauchy point method, the dogleg method (for B k 0), and the two-dimensional subspace minimization method (for indefinite B k ).

4.2

Algorithms Based on the Cauchy Point

In this part, we discuss numerical algorithms to solve the subproblem in the trust region method. Note that subproblem (4.4) is a quadratic function with trust region constraint, a natural idea is to apply the gradient method, which results in the Cauchy point method. To combine the Cauchy point with Newton’s direction, one will reach the dogleg method. To go a further step, one will reach the two-dimensional subspace minimization by extending the search direction to a two-dimensional subspace.

4.2.1

The Cauchy Point

We start with the Cauchy point method. Cauchy point provides a sufficient reduction in an objective, which then guarantees the global convergence of the trust region method, just like the way to prove the global convergence by line search method. The idea of the Cauchy point method is to first approximate the model function by a linear function to obtain a direction psk , then find a minimizer of the quadratic model function (4.4) along the particular direction psk . The Cauchy point is calculated as follows.

Trust Region Methods

55

Algorithm 4.2.1. (Cauchy Point Calculation) S1. Find the vector psk that solves a linear version of the subproblem (4.4), that is, solve minn f k þ g Tk p p2IR

kpk Dk

s:t:

to get psk ; S2. Calculate the scalar sk [ 0 by solving min

m k ðspsk Þ

s:t:

kspsk k Dk ;

s0

S3. Set pCk ¼ sk psk . The closed-form of Cauchy point is derived as follows. Firstly, it is easy to derive the closed-form of psk . That is, psk ¼ Dk

gk : kg k k

ð4:9Þ

For sk , there is m k ðspsk Þ ¼

D2k kg k k2

g Tk B k g k s2 Dk kg k ks þ f k :

ð4:10Þ

Note that (4.10) is basically a function in s 2 IR. Consequently, by elementary calculation, we have the following results. If g Tk B k g k 0, m k ðspsk Þ decreases monotonically with respect to s if kg k k 6¼ 0. One will get sk ¼ 1. If g Tk B k g k [ 0, m k ðspsk Þ is a convex quadratic function in s. Therefore, we have ( ) kg k k3 sk ¼ min 1; : Dk g Tk B k g k In summary, we have pCk ¼ sk where sk ¼

8 > < 1; (

3

Dk g ; kg k k k

)

kg k k > : min 1; D g T B g ; k k k k

if g Tk B k g k 0; otherwise:

Note that the Cauchy point produces a sufficient reduction in the model function m k , and it is inexpensive to calculate. For trust region methods, it is crucial in deciding whether an approximate solution to the subproblem is acceptable. A trust region method will be globally convergent if its step pk gives a reduction in m k that is

Modern Optimization Methods

56

at least some fixed positive multiple of the decrease attained by the Cauchy step. We will discuss more details about the convergence in section 4.3. The Cauchy point can be viewed as the steepest descent method with a special step length. Therefore, it may converge slowly. There are two ways to improve it: the Dogleg method and two-dimensional subspace minimization, which will be addressed in subsections 4.2.2 and 4.2.3.

4.2.2

The Dogleg Method

Recall the subproblem (here we omit the subscript k) min

mðpÞ ¼ f þ g T p þ

s:t:

kpk D:

p2IRn

1 T p Bp 2

The optimal solution is denoted as p ðDÞ. If B 0, pB ¼ B 1 g is the minimizer of the unconstrained problem. If D kpB k, there is p ðDÞ ¼ pB : If D is small relative to pB , kpk D ensures that the quadratic term in m ðÞ has little effect on p ðDÞ. By dropping the quadratic term in m ðÞ, we get the approximate solution g ; when D is small: p ðDÞ D kgk For intermediate D, p ðDÞ follows a curved trajectory. The dogleg method replaces the curved trajectory for p ðDÞ with a path consisting of two line segments. The first line segment is gT g pU ¼ T g: g Bg The second line segment runs from pU to pB , which is given by pU þ aðpB pU Þ;

a 2 ½0; 1:

~ðsÞ be the path consisting of the above two line segments (s 2 ½0; 2), i.e., Let p U sp ; 0 s 1; ~ðsÞ ¼ p pU þ ðs 1ÞðpB pU Þ; 1 s 2: ~ðsÞ, The dogleg method chooses p to minimize the model m ðÞ along the path p subject to the trust region bounds. That is, in each iteration, the dogleg method chooses a step as p ¼ pðs Þ, where s is given by solving the following problem min

m k ð~ pðsÞÞ

s:t:

s 2 ½0; 2:

s2IR

Trust Region Methods

57

FIG. 4.3 – Exact trajectory and dogleg approximation. See figure 4.3 for illustration of dogleg method. ~ðsÞ and mð~ We have the following properties regarding p pðsÞÞ. Lemma 4.2.2. Let B 0. Then (i) k~ pðsÞk is an increasing function of s, and (ii) mð~ pðsÞÞ is a decreasing function of s. Proof. If s 2 ½0; 1, the results (i) and (ii) are obvious. Consider s 2 ½1; 2. For (i), define hðaÞ by 1 pð1 þ aÞk2 hðaÞ ¼ k~ 2 1 2 ¼ kpU þ aðpB pU Þk 2 1 1 2 2 ¼ kpU k þ aðpU ÞT ðpB pU Þ þ a2 kpB pU k : 2 2 Next, we will show h 0 ðaÞ [ 0 for a 2 ð0; 1Þ. Note that h 0 ðaÞ ¼ ðpU ÞT ðpU pB Þ þ akpB pU k

2

ðpU ÞT ðpU pB Þ gT g T gT g ¼ T g T g þ B 1 g g Bg g Bg g T B 1 g ðg T gÞ2 ¼g g T 1 T g Bg ðg BgÞðg T B 1 gÞ

!

T

0: The last inequality is due to the fact that T 2 g g P0: 1 T ðg Bg Þ g T B 1 g

ð4:11Þ

Modern Optimization Methods

58

We leave the proof of (4.11) as an exercise. Part (i) is then proved. ^ For (ii), define hðaÞ ¼ mð~ pð1 þ aÞÞ. We only need to show that h^0 ðaÞ 0; a 2 ð0; 1Þ: Similarly, by the positive semidefiniteness of B and a 2 ð0; 1Þ, we have h^0 ðaÞ ¼ ðpB pU ÞT ðg þ BpU Þ þ aðpB pU ÞT BðpB pU Þ ðpB pU ÞT ðg þ BpU þ BðpB pU ÞÞ ¼ ðpB pU ÞT ðg þ BpB Þ ¼ 0; giving the result by the definition of pB . The proof is finished.

□

~ðsÞ intersects the trust-region Remark 4.2.3. Lemma 4.2.2 implies that the path p B boundary kpk ¼ D at exactly one point if kp k D, and nowhere otherwise. If kpB k D, p ¼ pB , with s ¼ 2. If kpB k [ D, we can compute s by solving 2

kpU þ ðs 1ÞðpB pU Þk ¼ D2 : If r2 f ðx k Þ is available and positive definite, we can choose B k ¼ r2 f ðx k Þ. Otherwise, we can define pB where B k is the modified Hessian matrix.

4.2.3

Two-Dimensional Subspace Minimization

If B k 0, we can also consider the two-dimensional subspace subproblem min

p2IRn

s:t:

1 T p Bp 2 kpk Dk ; p 2 span g; B 1 g ; m ðpÞ ¼ f þ g T p þ

ð4:12Þ

when span g; B 1 g denotes the linear subspace spanned by g and B 1 g. Problem (4.12) is inexpensive to solve. Since it involves only two variables (can be reduced to finding roots of a fourth-degree polynomial). The Cauchy point pC is a feasible solution of (4.12) implying that the optimal solution of (4.12) leads to at least as much reduction in m ðÞ as the Cauchy point. Therefore, global convergence is guaranteed. It can also be viewed as the extension of the dogleg method. Moreover, it is particularly appropriate for case that B is not positive definite. When B has negative eigenvalues, the two-dimensional subspace subproblem is reduced to the following problem min

p2IRn

s:t:

1 T p Bp 2 kpk D; n o p 2 span g; ðB þ aI Þ1 g ; m ðpÞ ¼ f þ g T p þ

ð4:13Þ

Trust Region Methods

59

for some a 2 ðk1 ; 2k1 , where k1 denotes the smallest negative eigenvalue of B. If kðB þ aI Þ1 gk D, define the step to be p ¼ ðB þ aI Þ1 g þ v; where v 2 IRn satisfies v T ðB þ aI Þ1 g 0 (implying that kpk kðB þ aI Þ1 gk). When B has zero eigenvalues but no negative eigenvalues, we can choose p ¼ pC . Remark 4.2.4. The reduction in m ðÞ achieved by the two-dimensional subspace minimization strategy often is close to that achieved by the exact solution of the subproblem (4.4). However, the computational effort for the two-dimensional subspace minimization strategy involves only a single factorization of B or B þ kI , while finding the exact solution of (4.4) requires two or three such factorizations.

4.3 4.3.1

Global Convergence Reduction Obtained by the Cauchy Point

In this part, we first derive the reduction obtained by the Cauchy point in order to discuss the global convergence of trust region methods. We have the following lemma. Lemma 4.3.1. The Cauchy point pCk satisfies the following condition kg k k m k ð0Þ m k ðpk Þ c1 kg k k min Dk ; ; kB k k

ð4:14Þ

where c1 ¼ 12 : Proof. For simplicity, we drop the index k. We discuss different cases as follows. D Case 1. g T Bg 0. Then pC ¼ kgk g: We have g mðpC Þ mð0Þ ¼ m D f kgk D 1 D2 T kgk2 þ g Bg kgk 2 kgk2 Dkgk kgk kgkmin D; ; kBk ¼

giving (4.14). Case 2. g T Bg [ 0 and

kgk3 Dg T Bg

s¼

1. In this case, there is

kgk3 ; Dg T Bg

pC ¼

kgk3 D g: Dg T Bg kgk

Modern Optimization Methods

60

We have mðpC Þ mð0Þ ¼ ¼

kgk4 1 kgk4 þ g T Bg T g Bg 2 ðg T BgÞ2 1 kgk4 2 g T Bg

D

kgk4 kBkkgk2

1 kgk2 2 kBk 1 kgk kgkmin D; : 2 kBk ¼

So (4.14) holds. Case 3. g T Bg [ 0 and

kgk3 Dg T Bg

[ 1. There is s ¼ 1 and pC ¼

The condition

kgk3 Dg T Bg

D g: kgk 3

[ 1 implies that g T Bg\ kgk D : Therefore, we have that

mðpC Þ mð0Þ ¼

D 1 D2 T kgk2 þ g Bg kgk 2 kgk2

¼ Dkgk þ

1 D2 kgk3 2 kgk2 D

1 ¼ Dkgk 2 1 kgk kgkmin D; ; 2 kBk showing that (4.14) holds. Overall, condition (4.14) holds for the Cauchy point pCk .

□

Remark 4.3.2. Lemma 4.3.1 implies that the Cauchy point leads to a sufficient reduction in function value, which is important in proving the global convergence of the trust region method. Moreover, we have the following result. Theorem 4.3.3. Let pk be any vector such that kpk k Dk and the following holds m k ð0Þ m k ðpk Þ c2 ðm k ð0Þ m k ðpCk ÞÞ: ð4:15Þ Then pk satisfies (4.14) with c1 ¼ c22 . In particular, if pk is the exact solution pk of the subproblem (4.4), then it satisfies (4.14) with c1 ¼ 12.

Trust Region Methods

61

Proof. Since kpk k Dk , we have from the lemma 4.3.1 that 1 kg k k C m k ð0Þ m k ðpk Þ c2 ðm k ð0Þ m k ðpk ÞÞ c2 kg k kmin Dk ; ; 2 kB k k □

giving the result.

Remark 4.3.4. The dogleg method and two-dimensional subspace minimization algorithm both satisfy (4.14) with c1 ¼ 12, because they all produce approximate solution pk for which m k ðpk Þ m k ðpCk Þ.

4.3.2

Convergence to Stationary Points

In this part, we discuss the convergence of trust region methods. We address global two cases: g ¼ 0 and g 2 0; 14 . Theorem 4.3.5. Let g ¼ 0 in algorithm 4.1.1. Suppose that kB k k b for some constant b [ 0, that f is bounded below on the level set S defined by S :¼ fx : f ðxÞ f ðx 0 Þg;

ð4:16Þ

and Lipschitz continuously differentiable in the neighborhood SðR0 Þ for some R0 [ 0 SðR0 Þ :¼ fx : kx yk\R0 for some y 2 Sg; and that all approximate solutions of (4.4) satisfy the inequalities and kpk k cDk ;

ð4:17Þ

for some positive constant c1 and c. We then have lim inf kg k k ¼ 0: k!1

Proof. We prove the result by several steps. Step 1. By the definition of qk , there is ðf ðx k Þ f ðx k þ pk ÞÞ ðm k ð0Þ m k ðpk ÞÞ jqk 1j ¼ m k ð0Þ m k ðpk Þ m k ðpk Þ f ðx k þ pk Þ : ¼ m k ð0Þ m k ðpk Þ By Taylor’s theorem, we have that f ðx k þ pk Þ ¼ f ðx k Þ þ gðx k ÞT pk þ

Z

1 0

½gðx k þ tpk Þ gðx k ÞT pk dt:

Recall that mðpÞ ¼ f þ g T p þ 12 pT Bp. There is Z 1 1 ½gðx k þ tpk Þ gðx k ÞT pk dt jm k ðpk Þ f ðx k þ pk Þj ¼ pTk Bpk 2 0 b 2 kpk k þ b1 kpk k2 ; 2

ð4:18Þ

Modern Optimization Methods

62

where b1 is the Lipschitz constant for g on SðR0 Þ, and assume that kpk k R0 to ensure that x k and x k þ 1 both lie in the set SðR0 Þ. Step 2. For contradiction, suppose that there is [ 0 and a positive index K such that kg k k ; It implies that for k K , there is

8 k K:

kg k m k ð0Þ m k ðpk Þ c1 kg k kmin Dk ; k kB k k c1 min Dk ; : b

Then we have jqk 1j

c2 D2k ðb=2 þ b1 Þ : c1 minðDk ; =bÞ

where D is Next, we will prove that for sufficiently small Dk , there is Dk D, defined by 1 c1 R0 D ¼ min ; : ð4:19Þ 2 c2 ðb=2 þ b1 Þ c R0 c

ensures that the bound in (4.18) is valid. where D is defined in (4.19). Step 3. For sufficiently small Dk , there is Dk D, =b, implying that for all Dk 2 ½0; D, we have Since c1 1, and c 1, we have D minðDk ; =bÞ ¼ Dk . We obtain The ratio

c2 D2k ðb=2 þ b1 Þ c1 Dk c2 Dk ðb=2 þ b1 Þ ¼ c1 2 c Dðb=2 þ b1 Þ c1 1 : 2

jqk 1j

Therefore, q [ 14. By the trust region algorithm 4.1.1, we have that Dk þ 1 Dk So the reduction of Dk (by a factor of 1 ) can occur only if Dk D: whenever Dk D. 4 We then conclude that Dk minðDK ; D=4Þ; for all k K: Step 4. We claim that there is an infinite subsequence K such that q 14 for k 2 K. To prove this, for contradiction, suppose that there is only finite k 2 K such that q 14. Then there exists k 0 2 K such that for all k k 0 ; k 2 K, there is qk \ 14. Therefore, Dk þ 1 Dk will happen for all these k. Eventually, there will be some Dk

Trust Region Methods

63

Once this happens, we will have q 1. This contradicts the fact such that Dk D. k 4 that for all k k 0 ; k 2 K, qk \ 14. So the claim is correct. Step 5. Suppose now that there is an infinite subsequence K such that q 14 for k 2 K. For k 2 K and k K , we have that f ðx k Þ f ðx k þ 1 Þ ¼ f ðx k Þ f ðx k þ pk Þ 1 ðm k ð0Þ m k ðpk ÞÞ 4 1 c1 minðDk ; =bÞ: 4 Since f is bounded below, it follows that lim

k2K ;k!1

contradicting that

Dk ¼ 0;

for all k K: Dk minðDK ; D=4Þ;

So no such infinite subsequence K exists. We must have qk \ 14 for k sufficiently large. In this case, Dk will be eventually multiplied by 14 at every iteration, and we have Dk ! 0, which again is a contradiction. So our assertion kg k k ; 8 k K must be false, giving the result. □ 1 Theorem 4.3.6. Let g 2 0; 4 in algorithm 4.1.1. Suppose that kB k k b for some constant b, that f is bounded below on the level set S and Lipschitz continuously differentiable in the neighborhood SðR0 Þ for some R0 [ 0 and that all approximate solutions of (4.4) satisfy the inequalities and (4.17) for some positive constant c1 and c. We then have lim inf kg k k ¼ 0: k!1

Proof. We consider a particular positive index m with g m 6¼ 0. Using b1 to denote the Lipschitz constant for g on the set SðR0 Þ, we have that kgðxÞ g m k b1 kx 1 x m k; Let and R satisfy

1 ¼ kg m k; 2

8 x 2 SðR0 Þ:

R ¼ min ; R0 : b1

Note that Bðx m ; RÞ ¼ fxj kx x m k Rg SðR0 Þ: So Lipschitz continuity of g holds inside Bðx m ; RÞ. Therefore, x 2 Bðx m ; RÞ implies that 1 kgðxÞk kg m k kgðxÞ g m k kg m k ¼ : 2

Modern Optimization Methods

64

If fx k g Bðx m ; RÞ, we would have kg k k [ 0 for all k m. Similar proof in the previous theorem can be used to show that this scenario does not occur. Therefore, fx m gk m eventually leaves Bðx m ; RÞ. Let l m be such that x l þ 1 is the first iterate after x m lying outside Bðx m ; RÞ. Since kg k k for k ¼ m; m þ 1; . . .; l, we have l X f ðx m Þ f ðx l þ 1 Þ ¼ f ðx k Þ f ðx k þ 1 Þ k¼m l X

k¼m;x k 6¼x k þ 1 l X

k¼m;x k 6¼x k þ 1

gðm k ð0Þ m k ðpk ÞÞ gc1 min Dk ; : b

If Dk =b for all k ¼ m; m þ 1; . . .; l, we have f ðx m Þ f ðx l þ 1 Þ gc1

l X

Dk

k¼m;x k 6¼x k þ 1

gc1 min ; R0 : b1 Otherwise, we have Dk [ =b for some k ¼ m; m þ 1; . . .; l. Therefore f ðx m Þ f ðx l þ 1 Þ gc1 : b Since the sequence ff ðx k Þg1 k¼0 is decreasing and bounded below, we have that f ðx k Þ # f for some f [ 1. Therefore, the following hold f ðx m Þ f f ðx m Þ f ðx l þ 1 Þ gc1 min ; ; R0 b b1 1 kg m k kg m k ; ¼ gc1 kg m kmin ; R0 2 2b 2b1 [ 0: Since f ðx m Þ f # 0, we must have g m ! 0, giving the result. The proof is finished. □ 1 Remark 4.3.7. Theorems 4.3.5 and 4.3.6 imply that for g 2 0; 4 , with reasonable assumptions, the trust region method in algorithm 4.1.1 generates sequence fx k g such that lim inf kg k k ¼ 0 k!1

ð4:20Þ

Trust Region Methods

65

holds. (4.20) is a weak global convergence result compared with the global convergence result in line search methods (3.22).

4.4

Local Convergence

In terms of the local convergence rate of trust region methods, we hope that near the solution, the (approximate) solution of the trust-region subproblem is well inside the trust region and becomes closer and closer to the true Newton step. Steps that satisfy the above property are said to be asymptotically similar to Newton steps. Theorem 4.4.1. Let f be twice continuously differentiable in a neighborhood of a point x at which second-order sufficient conditions (theorem 2.2.13) are satisfied. Suppose the sequence fx k g converges to x and that for all k sufficiently large, the trust-region algorithm based on (4.4) with B k ¼ r2 f ðx k Þ chooses steps pk that satisfy the Cauchy-point-based model reduction criterion and are asymptotically similar to N 1 Newton step pN K whenever kpk k 2 Dk , that is N kpk pN ð4:21Þ k k ¼ o kpk k : Then the trust-region bound Dk becomes inactive for all k sufficiently large and the sequence fx k g converges superlinearly to x . Remark 4.4.2. If pk ¼ pN k for all k sufficiently large, we have quadratic convergence of fx k g to x . Reasonable implementations of the dogleg method, subspace minimization method eventually use the steps pk ¼ pN k under the conditions of theorem 4.4.1, and therefore converge quadratically.

4.5

Other Enhancements

There are different variants and enhancements to further improve the performance of trust region methods. In this part, we discuss two techniques that are frequently used. Scaling Recall that poor scaling means f is highly sensitive to small changes in certain components of x and relatively insensitive to changes in other components. For trust region methods, a spherical trust region may not be appropriate when f is poorly scaled. Even if the Hessian matrix of the model function B k is exact, the rapid changes in f along certain directions probably will cause m k to be a poor approximation to f along these directions. m k may be a more reliable approximation to f along directions in which f is changing more slowly. The shape of our trust-region should be such that our confidence in the model is more or less the same at all points on the boundary of the region. Define the elliptical trust region as follows kDpk D;

Modern Optimization Methods

66

where D is a diagonal matrix with positive diagonal elements. The scaled trust-region subproblem takes the following form min

m k ðpÞ :¼ f k þ g Tk þ 12 pT B k p

s:t:

kDpk Dk :

p2IRn

ð4:22Þ

When f is highly sensitive to the value of the ith component x i , we set the corresponding diagonal element D ii of D to be large, while D ii is smaller for less-sensitive components. The elements Dii ; i ¼ 1; . . .; n; may be derived from the second derivatives

@2 f . @x 2i

Below we demonstrate the calculation of the Cauchy point in terms of a scaled version. Algorithm 4.5.1. (Generalized Cauchy Point Calculation) S1. Find psk that solves the following problem min

m k ðpÞ :¼ f k þ g Tk p

s:t:

kDpk Dk :

p2IRn

ð4:23Þ

S2. Calculate the scalar sk [ 0 by solving min

m k ðspsk Þ

s:t:

ksDpsk k Dk :

s0

S3. Let pCk ¼ sk psk . In algorithm 4.5.1, psk takes the following form psk ¼ and

8 < 1; sk¼ : min 1;

kD

1

Dk D2 g k ; kD 1 g k k

gk k

if g Tk D 2 B k D2 g k \0;

3

ðDk g Tk D2 Bk D2 g k Þ

;

otherwise:

ð4:24Þ

ð4:25Þ

Trust Region in Other Norms Trust regions may also be defined in terms of norms other than the Euclidean norm. For example, by the l 1 norm or l 1 norm, the trust region is given by kpk1 Dk

or

kpk1 Dk ;

ð4:26Þ

or

kDpk1 Dk ;

ð4:27Þ

or other scaled counterparts kDpk1 Dk

where D is a positive diagnoal matrix as before. See figures 4.4 and 4.5 below.

Trust Region Methods

67

FIG. 4.4 – kpk1 O1, p 2 IR2 .

FIG. 4.5 – kpk1 O1, p 2 IR2 .

Modern Optimization Methods

68

4.6

Exercises

Exercises 4.6.1. Theorem 4.3.5 shows that the sequence fkgkg has an accumulation point at zero. Show that if the iterates x stay in a bounded set B, then there is a limit point x 1 of the sequence fx k g such that g ðx 1 Þ ¼ 0. Exercises 4.6.2. Try to prove (4.24) and (4.25). Exercises 4.6.3. Plot the regions defined by (4.27). Exercises 4.6.4. The Cauchy–Schwarz inequality states that for any vectors u and v, we have T T T u v O u u v v ; with equality only when u and v are parallel. When B is positive definite, use this inequality to show that kgk4 O1 ðg T Bg Þ g T B 1 g with equality only if g and Bg (and B 1 g) are parallel. Exercises 4.6.5. Derive the solution of the two-dimensional subspace minimization problem in the case where B is positive definite. Exercises 4.6.6. Show that if B is any symmetric matrix, then there exists kP0 such that B þ kI is positive definite.

Chapter 5 Conjugate Gradient Methods Conjugate gradient methods can be divided into two types: linear conjugate gradient methods and nonlinear conjugate gradient methods. For linear conjugate gradient method, it was originally proposed by Hestenes and Stiefel in the 1960s [23] for solving linear systems Ax ¼ b with A positive definite. Preconditioning could be used to improve performance. The nonlinear conjugate gradient method was introduced by Fletcher and Reeves in the 1960s [18]. It can be used for large-scale nonlinear optimization problems. The advantages of conjugate gradient methods are as follows. They are in general faster than the steepest descent method. In this chapter, we will introduce linear and nonlinear conjugate gradient methods, mainly focusing on linear conjugate gradient methods.

5.1

Linear Conjugate Gradient Method

Conjugate gradient methods were originally designed to solve the linear systems Ax ¼ b; where A is positive definite, i.e., A 0. It is equivalent to solve the quadratic optimization problem 1 minn /ðx Þ ¼ x T Ax bT x: x2IR 2

ð5:1Þ

The gradient takes the following form, r/ðxÞ ¼ Ax b :¼ rðxÞ; which is also referred as residual. In particular, at x k , there is r k ¼ Ax k b:

5.1.1

Conjugate Direction Method

In this part, we introduce what are conjugate directions. First, we define conjugacy. DOI: 10.1051/978-2-7598-3174-6.c005 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

70

Definition 5.1.1. A set of nonzero vectors fp0 ; p1 ; . . .; pm g IRn is said to be conjugate with respect to the symmetric positive definite matrix A 2 IRnn if the following holds pTi Apj ¼ 0;

8 i 6¼ j:

Notice that any set of vectors satisfying conjugacy is also linearly independent. Given x 0 2 IRn and a set of conjugate directions fp0 ; p1 ; . . .; pn1 g, let fx k g be generated by x k þ 1 ¼ x k þ ak p k ;

ð5:2Þ

where ak is the one-dimensional minimizer of / along pk , given by ak ¼

r Tk pk : pk Apk

ð5:3Þ

The updates (5.2) and (5.3) are called the conjugate direction method. We have the following important result. One can minimize /ðÞ in n steps by successively minimizing it along the individual directions in a conjugate set. Theorem 5.1.2. For any x 0 2 IRn, the sequence generated by the conjugate direction algorithm (5.2) and (5.3) converges to the solution x of the linear system Ax ¼ b within at most n steps. Proof. Since the directions fpi g are linearly independent, they must span the whole space IRn . Hence, we can write the difference between x 0 and the solution x in the following way x x 0 ¼ r0 p0 þ þ rn1 pn1 ;

ð5:4Þ

for some choice of scalars rk ; k ¼ 0; . . .; n 1. By premultiplying (5.4) by pTk A and using the conjugacy property, we obtain rk ¼

pTk Aðx x 0 Þ : pTk Apk

We now establish the result by showing that these coefficients rk coincide with the step lengths ak generated by the formula (5.3). If x k is generated by algorithm (5.2) and (5.3), then we have x k ¼ x 0 þ a0 p0 þ þ ak1 pk1 :

ð5:5Þ

By premultiplying (5.5) by pTk A and using the conjugacy property, we have that pTk Aðx k x 0 Þ ¼ 0;

Conjugate Gradient Methods

71

and therefore pTk Aðx x 0 Þ ¼ pTk Aðx x k Þ ¼ pTk ðb Ax k Þ ¼

ð5:6Þ

pTk r k :

By comparing this relation with (5.2) and (5.3), we find that rk ¼ ak , giving the result. □ With theorem 5.1.2, we have the following result. Theorem 5.1.3. (Expanding Subspace Minimization) Let x 0 2 IRn be any starting point and suppose that the sequence fx k g is generated by the conjugate direction method (5.2) and (5.3). Then r Tk pi ¼ 0;

for i ¼ 0; 1; . . .; k 1;

1 and x k is the minimizer of /ðxÞ ¼ x T Ax bT x over the set 2 fx j x ¼ x0 þ spanfp0 ; p1 ; . . .; pk1 gg:

ð5:7Þ

ð5:8Þ

Proof. We begin by showing that a point e x minimize / over the set (5.8) if and only if rðe x ÞT pi ¼ 0, for each i ¼ 0; 1; . . .; k 1. Let us define hðrÞ ¼ /ðx 0 þ r0 p0 þ þ rk1 pk1 Þ; where r ¼ ðr0 ; . . .; rk1 ÞT 2 IRn . Since hðrÞ is a strictly convex quadratic function, it has a unique minimizer r satisfying @hðr Þ ¼ 0; @ri

i ¼ 0; 1; . . .; k 1:

By the chain rule, it implies that r/ðr ÞT pi ¼ 0;

i ¼ 0; 1; . . .; k 1:

By recalling the definition r/ðxÞ ¼ Ax b ¼ rðxÞ, we obtain the desired result. We now use induction to show that x k satisfies r k þ 1 ¼ r k þ ak Apk :

ð5:9Þ

Since ak is always the one-dimensional minimizer, we have immediately that r T1 p0 ¼ 0: Let us now make the induction hypothesis, namely, that r Tk1 pi ¼ 0 for i ¼ 0; . . .; k 2: By (5.9), there is r k ¼ r k1 þ ak1 Apk1 : We have pTk1 r k ¼ pTk1 r k1 þ ak1 pTk1 Apk1 ¼ 0

Modern Optimization Methods

72

by the definition (5.3) of ak1 . Meanwhile, for the other vectors pi ; i ¼ 0; . . .; k 2, we have pTi r k ¼ pTi r k1 þ ak1 pTi Apk1 ¼ 0; by the induction hypothesis and the conjugacy of the pi . We conclude that r Tk pi ¼ 0, for i ¼ 0; . . .; k 1. The proof is completed. □ Remark 5.1.4. Property (5.7) implies that the current residual r k is orthogonal to all previous search directions. Remark 5.1.5. Theorem 5.1.3 provides us with a way to solve the minimization problem (5.1). The key point is to choose a set of conjugate directions. In the next part, we discuss a special way to generate conjugate directions, which leads to the conjugate gradient method.

5.1.2

Conjugate Gradient Method

In conjugate gradient method, the search direction pk is generated by pk ¼ r k þ bk pk1 : By the conjugacy of pk and pk1 , we can derive that bk ¼

r Tk Apk1 : pTk1 Apk1

We choose p0 ¼ r 0 , the steepest descent direction. It will lead to the conjugate gradient method. Algorithm 5.1.6. (Conjugate Gradient Method – Preliminary Version) S0. Given x 0 . S1. Set r 0 ¼ Ax 0 b, p0 ¼ r 0 , k ¼ 0. S2. While r k 6¼ 0, do the following calculations rT p ak ¼ Tk k ; pk Apk x k þ 1 ¼ x k þ ak p k ;

ð5:11Þ

r k þ 1 ¼ Ax k þ 1 b;

ð5:12Þ

r Tkþ 1 Apk ; pTk Apk

ð5:13Þ

bk þ 1 ¼

pk þ 1 ¼ r k þ 1 þ bk þ 1 pk ; Let k :¼ k þ 1; end (while)

ð5:10Þ

ð5:14Þ

Conjugate Gradient Methods

73

Theorem 5.1.7. Suppose that the k-th iterate generated by the conjugate gradient method is not the solution point x . The following four properties hold r Tk r i ¼ 0;

i ¼ 0; 1; . . .; k 1;

ð5:15Þ

spanfr 0 ; r 1 ; . . .; r k g ¼ spanfr 0 ; Ar 0 ; . . .; Ak r 0 g;

ð5:16Þ

spanfp0 ; p1 ; . . .; pk g ¼ spanfr 0 ; Ar 0 ; . . .; Ak r 0 g;

ð5:17Þ

pTk Api ¼ 0;

i ¼ 0; 1; . . .; k 1:

ð5:18Þ

Therefore, the sequence fx k g converges to x in at most n steps. Proof. The proof is given by induction. The properties (5.16) and (5.17) hold trivially for k ¼ 0, while (5.18) holds by conjugacy for k ¼ 1. Assuming now that these three expressions are true for some k (the induction hypothesis), we show that they continue to hold for k þ 1. To prove (5.16), we show first that the set on the left-hand side is contained in the set on the right-hand side. Because of the induction hypothesis, we have from (5.16) and (5.17) that r k 2 spanfr 0 ; Ar 0 ; . . .; Ak r 0 g;

pk 2 spanfr 0 ; Ar 0 ; . . .; Ak r 0 g;

while by multiplying the second of these expressions by A, we obtain Apk 2 spanfAr 0 ; . . .; Ak þ 1 r 0 g:

ð5:19Þ

By applying (5.9), we find that

r k þ 1 2 span r 0 ; Ar 0 ; . . .; Ak þ 1 r 0 :

ð5:20Þ

By combining (5.20) with the induction hypothesis for (5.16), we conclude that spanfr 0 ; r 1 ; . . .; r k þ 1 r 0 g spanfr 0 ; Ar 0 ; . . .; Ak þ 1 r 0 g: To prove that the reverse inclusion holds as well, we use the induction hypothesis on (5.17) to deduce that Ak þ 1 r 0 ¼ AðAk r 0 Þ 2 spanfAp0 ; Ap1 ; . . .; Apk g: Since by (5.9) we have Api ¼ ðr i þ 1 r i Þ=ai ;

for i ¼ 0; . . .; k;

it follows that Ak þ 1 r 0 ¼ spanfr 0 ; r 1 ; . . .; r k þ 1 r 0 g:

ð5:21Þ

Modern Optimization Methods

74

By combining (5.21) with the induction hypothesis for (5.16), we find that spanfr 0 ; r 1 ; . . .; r k þ 1 r 0 g spanfr 0 ; Ar 0 ; . . .; Ak þ 1 r 0 g: Therefore, the relation (5.16) continues to hold when k is replaced by k þ 1, as claimed. We show that (5.17) continues to hold when k is replaced by k þ 1 by the following argument: spanfp0 ; p1 ; . . .; pk ; pk þ 1 ¼ spanfp0 ; p1 ; . . .; pk ; r k þ 1 g ¼ span r 0 ; Ar 0 ; . . .; Ak r 0 ; r k þ 1 ðby induction hypothesis for ð5:17ÞÞ ¼ spanfr 0 ; r 1 ; . . .; r k ; r k þ 1 g ¼ span r 0 ; Ar 0 ; . . .; Ak þ 1 r 0 :

ðby ð5:16ÞÞ ðby ð5:16Þ for k þ 1Þ

Next, we prove the conjugacy condition (5.18) with k replaced by k þ 1. By multiplying (5.14) by Api ; i ¼ 0; 1; . . .; k, we obtain pTkþ 1 Api ¼ r Tkþ 1 Api þ bk þ 1 pTk Api :

ð5:22Þ

By the definition (5.13) of bk , the right-hand-side of (5.22) vanishes when i ¼ k. For i k 1 we need to collect a number of observations. Note first that our induction hypothesis for (5.18) implies that the directions p0 ; . . .; pk are conjugate, so we can apply theorem 5.1.3 to deduce that r Tkþ 1 pi ¼ 0;

i ¼ 0; . . .; k:

ð5:23Þ

Second, by repeatedly applying (5.17), we find that for i ¼ 0; . . .; k 1, the following inclusion holds: Api 2 spanfr 0 ; Ar 0 ; . . .; Ai r 0 g ¼ spanfAr 0 ; A2 r 0 ; . . .; Ai þ 1 r 0 g spanfp0 ; . . .; pi þ 1 g:

ð5:24Þ

By combining (5.23) and (5.24), we deduce that r Tkþ 1 Api ¼ 0;

i ¼ 0; . . .; k 1;

so the first term in the right-hand side of (5.22) vanishes for i ¼ 0; 1; . . .; k 1. Because of the induction hypothesis for (5.18), the second term vanishes as well, and we conclude that pTkþ 1 Api ¼ 0; i ¼ 0; 1; . . .; k 1. Hence, the induction argument holds for (5.18) also. It follows that the direction set generated by the conjugate gradient method is indeed a conjugate direction set. Theorem 5.1.2 tells us that the algorithm terminates in, at most n iterations. Finally, we prove (5.15) by a noninductive argument. Because the direction set is conjugate, we have from (5.7) that r Tk pi ¼ 0 for all i ¼ 0; 1; . . .; k and any k ¼ 1; 2; . . .; n. By rearranging (5.14), we find that pi ¼ r i þ bi pi 1;

Conjugate Gradient Methods

75

so that r i 2 spanfpi ; pi1 g for all i ¼ 1; . . .; k 1. We conclude that r Tk r i ¼ 0 for all i ¼ 1; . . .; k 1. To complete the proof, we note that r Tk r 0 ¼ r Tk p0 ¼ 0, by the □ definition of p0 in algorithm 5.1.6 and by (5.7). The directions p0 , p1 , . . .; pn1 are indeed conjugate, implying the termination in n steps. The residuals r j are mutually orthogonal. Each search direction pk and residual contained in the Krylov subspace of degree k for r 0 , defined as Kðr 0 ; kÞ :¼ spanfr 0 ; Ar 0 ; . . .; Ak r 0 g:

5.1.3

A Practical Form of the Conjugate Gradient Method

In this part, we introduce a practical form of the conjugate gradient method, which rT p

needs less computational cost. Replace ak ¼ pTkApk by k

ak ¼

k

r Tk r k : pTk Apk

Note that ak Apk ¼ r k þ 1 r k , we replace bk þ 1 ¼ bk þ 1 ¼

rT Apk kþ1 pT Apk k

by

r Tkþ 1 r k þ 1 : r Tk r k

It leads to the following practical version of the conjugate gradient method. Algorithm 5.1.8. (Conjugate Gradient Method, CG) S0. Given x 0 . S1. Set r 0 ¼ Ax 0 b, p0 ¼ r 0 , k ¼ 0. S2. While r k 6¼ 0, do the following calculations ak ¼

r Tk r k ; pTk Apk

x k þ 1 ¼ x k þ ak p k ;

ð5:26Þ

r k þ 1 ¼ r k þ ak Apk ;

ð5:27Þ

r Tkþ 1 r k þ 1 ; r Tk r k

ð5:28Þ

bk þ 1 ¼

pk þ 1 ¼ r k þ 1 þ bk þ 1 pk ; Let k :¼ k þ 1; end (while)

ð5:25Þ

ð5:29Þ

Modern Optimization Methods

76

In algorithm 5.1.8, one only needs to store two iterations of vector information, implying low storage cost. In each iteration, only Ap, pT ðApÞ, and r T r are involved, implying a low computational cost. Therefore, CG is recommended only for large problems. For small ones, one can choose Gaussian elimination or other factorization algorithms since they are less sensitive to rounding errors. CG sometimes approaches the solution quickly, as we will show below.

5.1.4

Rate of Convergence

To demonstrate the convergence rate of conjugate gradient method, let P k ðÞ be a polynomial of degree k with coefficients g0 ; g1 ; . . .; gk , i.e., P k ðzÞ ¼ g0 þ g1 z þ g2 z 2 þ þ gk z k : Define kxk2A ¼ x T Ax. We have the following important convergence result for conjugate gradient method kx k þ 1 x k2A ¼ min max ð1 þ ki P k ðki ÞÞ2 kx 0 x k2A ; P k 1\i n

ð5:30Þ

where 0\k1 k2 kn are the eigenvalues of A. We search for a polynomial P k that makes the following nonnegative scalar quantity min max ð1 þ ki P k ðki ÞÞ2 ð5:31Þ P k 1\i n

as small as possible. In some practical cases, we can find this polynomial explicitly and get some interesting properties of CG. Theorem 5.1.9. If A has only r distinct eigenvalues, then the CG iteration will terminate at the solution in at most r iterations. Proof. Suppose that the eigenvalues k1 ; . . .; kn take the r distinct values s1

sr . We define a polynomial Q r ðkÞ by Q r ðkÞ ¼

ð1Þr ðk s1 Þ ðk sr Þ; s1 sr

and note that Q r ðki Þ ¼ 0 for i ¼ 1; . . .; n and Q r ð0Þ ¼ 1. From the latter observation, we deduce that Q r ðkÞ 1 is a polynomial of degree r with a root at k ¼ 0. So by polynomial division, the function P r1 defined by P r1 ðkÞ ¼

ðQ r ðkÞ 1Þ k

is a polynomial of degree r 1. By setting k ¼ r 1 in (5.31), we have 0 min max ð1 þ ki P r1 ðki ÞÞ2 max ð1 þ ki P r1 ðki ÞÞ2 ¼ max Q 2r ðki Þ ¼ 0: P r1 1 i n

1 i n

1 i n

Hence, the constant in (5.31) is zero for the value k ¼ r 1. By substituting into (5.30) we have that kx r x k2A ¼ 0. Therefore x r ¼ x , as claimed. □

Conjugate Gradient Methods

77

Theorem 5.1.10. If A has eigenvalues k1 k2 kn , we have that knk k1 2 2 kx k þ 1 x kA

kx 0 x k2A : knk þ k1

ð5:32Þ

Consider the case that A consists of m large values, with n m smaller eigenvalues clustered around 1 (as shown in figure 5.1). Let ¼ knm k1 , then kx k þ 1 x kA kx 0 x kA :

FIG. 5.1 – Two clusters of eigenvalues. For a small value of , CG provides a good estimate of the solution after m þ 1 steps. Another convergence expression for CG is based on the Euclidean condition number of A, defined by jðAÞ ¼ kAk2 kA1 k2 ¼

kn : k1

It can be shown that !k pffiffiffiffiffiffiffiffiffiffi jðAÞ 1 kx k þ 1 x kA 2 pffiffiffiffiffiffiffiffiffiffi kx 0 x kA : jðAÞ þ 1

Notice that 1 pffiffiffiffiffiffiffiffiffiffi 1 pffiffiffiffiffiffiffi jðAÞ 1 jðAÞ pffiffiffiffiffiffiffiffiffiffi : ¼ 1 jðAÞ þ 1 1 þ pffiffiffiffiffiffiffi jðAÞ

ð5:33Þ

when jðAÞ ! 1, the ratio in (5.33) tends to 0, the convergence rate is faster compared with the case that jðAÞ ! 1. If jðAÞ ! 1, the convergence rate could be very slow. In this situation, we need preconditioning to improve the performance, as we will show below.

5.1.5

Preconditioning

Preconditioning is to accelerate CG by transforming the linear system to improve the eigenvalue distribution of A. Let x^ ¼ Cx:

Modern Optimization Methods

78

Accordingly, there is ^ x Þ ¼ 1 x^T ðC 1 AC 1 Þ^ /ð^ x ðC 1 bÞT x^: 2 ^ x Þ ¼ 0 becomes The linear system r/ð^ ðC 1 AC 1 Þ^ x ¼ C 1 b: The convergence rate will depend on the eigenvalue distribution of C 1 AC 1 . We choose C such that the condition number of C 1 AC 1 is much smaller than the original condition number. Notice that there is no need to calculate C 1 explicitly. C is often chosen as a diagonal matrix. The preconditioned version of the conjugate gradient method is given below. Algorithm 5.1.11. (Preconditioned CG) S0. Given x 0 , and the preconditioner M 2 IRnn . S1. Set r 0 ¼ Ax 0 b, solve My 0 ¼ r 0 for y 0 . S2. Let p0 ¼ y 0 , k :¼ 0. S3. While r k 6¼ 0, do the following calculations ak ¼

r Tk y k ; pTk Apk

ð5:34Þ

x k þ 1 ¼ x k þ ak p k ;

ð5:35Þ

r k þ 1 ¼ r k þ ak Apk ;

ð5:36Þ

solve My k þ 1 ¼ r k þ 1 ;

ð5:37Þ

bk þ 1 ¼

r Tkþ 1 y k þ 1 ; r Tk y k

pk þ 1 ¼ y k þ 1 þ bk þ 1 pk ;

ð5:38Þ ð5:39Þ

Let k :¼ k þ 1; end (while)

5.2

Nonlinear Conjugate Gradient Methods

To extend to nonlinear conjugate gradient methods, we need to consider two factors: (i) The step length ak . One needs to choose ak to reach an approximate minimum of the nonlinear function f . (ii) The residual r. We have to replace it with the gradient of the nonlinear function f . The Fletcher-Reeves method [18] (FR) is given as follows.

Conjugate Gradient Methods

79

Algorithm 5.2.1. ðFRÞ S0. Given x 0 . S1. Evaluate f 0 ¼ f ðx 0 Þ, rf 0 ¼ rf ðx 0 Þ. Set p0 ¼ rf 0 , k :¼ 0. S2. While rf k 6¼ 0. Compute ak , set x k þ 1 ¼ x k þ ak pk ; Evaluate rf k , do the following calculations bFR k þ1 ¼

rf Tkþ 1 rf k þ 1 rf Tk rf k

;

pk þ 1 ¼ rf k þ 1 þ bFR k þ 1 pk ;

ð5:40Þ

ð5:41Þ

Let k :¼ k þ 1; end (while) As one can see in algorithm 5.2.1, r k is replaced by rf k. It leads to bFR k þ 1 in (5.40). In terms of the step length ak , we choose step length ak to make the search direction pk to be a descent direction. Recall that algorithm 5.2.1, pk is determined by pk ¼ rf k þ bFR k pk1 ; and T rf Tk pk ¼ krf k k2 þ bFR k rf k pk1 :

The next lemma shows that the strong Wolfe condition can imply that pk is a descent direction. Lemma 5.2.2. Suppose that algorithm 5.2.1 is implemented with a step length ak that satisfies the strong Wolfe conditions with 0\c2 \ 12. Then the method generates descent directions pk that satisfy the following inequalities

1 rf Tk pk 2c2 1

; 2 1 c2 krf k k 1 c2

8 k ¼ 0; 1; . . .

ð5:42Þ

Proof. Note first that the function tðnÞ :¼

2n 1 1n

is monotonically increasing on the interval ½0; 1=2 and that tð0Þ ¼ 1 and tð1=2Þ ¼ 0. Hence, because of c2 2 ð0; 1=2Þ, we have 1\

2c1 1 \0: 1 c2

ð5:43Þ

The descent condition rf Tk pk \0 follows immediately once we establish (5.42).

Modern Optimization Methods

80

The proof is by induction. For k ¼ 0, the middle term in (5.42) is −1, so by using (5.43), we see that both inequalities in (5.42) are satisfied. Next, assume that (5.42) holds for some k 1. From (5.41) and (5.40), we have rf Tkþ 1 pk þ 1 krf k þ 1 k2

¼ 1 þ bk þ 1

rf Tkþ 1 pk krf k þ 1 k2

¼ 1 þ

rf Tkþ 1 pk krf k k2

:

ð5:44Þ

By using the line search condition (5.41), we have jrf Tkþ 1 pk j c2 rf Tk pk ; then by combining with (5.44) and recalling (5.40), we obtain 1 þ c2

rf Tk pk krf k þ 1 k

Substituting for the term

2

rf Tkþ 1 pk þ 1 krf k þ 1 k

rf T k pk krf k k2

2

1 c2

rf Tk pk krf k k2

:

from the left-hand-side of the induction

hypothesis (5.42), we obtain 1

rf Tkþ 1 pk þ 1 c2 c2

1þ ; 2 1 c2 1 c2 krf k þ 1 k

which shows that (5.42) holds for k þ 1 as well. The proof is complete.

□

Remark 5.2.3. Only the second inequality is used in the proof. The first inequality will be needed later to establish global convergence. The bounds on rf Tk pk impose a limit on how fast kpk k can grow, and they will play a crucial role in the convergence analysis. If the method generates a bad direction and a tiny step, then the next direction and the next step are also likely to be poor.

5.2.1

The Polak-Ribiere Method and Variants

The Polak-Ribiere method proposed by Polak and Ribiere [50], defines the parameters as follows rf Tkþ 1 rf k þ 1 rf k bPR ¼ : ð5:45Þ k þ1 krf k k2 The algorithm where bk þ 1 is replaced by (5.45) is referred to as Polak-Ribiere method (PR). Remark 5.2.4. For strongly convex quadratic function f and the exact line search, PR there is bFR k þ 1 ¼ bk þ 1 : When applied to general nonlinear functions with inexact line search, however, the behavior differs markedly. Numerical experiences indicate that Algorithm PR tends to be the more robust and efficient of the two.

Conjugate Gradient Methods

81

However, some problems, bPR k þ 1 maybe negative. Therefore, a modified version of PR is as follows. Define bkþþ 1 ¼ maxfbPR k þ 1 ; 0g: The resulting algorithm is referred to as Algorithm PR+. A simple adaption of the strong Wolfe conditions ensures that the descent property holds. Other variants of nonlinear conjugate gradient methods include the Hestenes-Stiefel method [23], with bk þ 1 given by rf Tkþ 1 rf k þ 1 rf k HS ð5:46Þ bk þ 1 ¼ ðrf k þ 1 rf k ÞT pk and the Dai-Yuan method [12], with bk þ 1 given by bDY k þ1 ¼

krf k k2

ðrf k þ 1 rf k ÞT pk

:

For a strongly convex quadratic function f with exact line search, all bk þ 1 s are the same. For general problems, if the method generates a bad direction and a tiny step, then for the FR algorithm, the next direction and the next step are also likely to be poor. However, for the PR method, it performs a restarting after it encounters a bad direction. It is also the same with PR+ and HS algorithms. Quadratic Termination and Restart For nonlinear problems, one usually uses the restarting technique to further improve the performance of nonlinear conjugate gradient methods. Specifically, restart the iteration at every n step by setting bk ¼ 0, i.e., by choosing just the steepest descent step. It leads to an n-step quadratic convergence, i.e., kx k þ n x k ¼ Oðkx k x k2 Þ: In practice, the n-step restarting is rarely used especially for large scale problems. In fact, restarts often depend on the following condition: when two consecutive gradients are far from orthogonality, i.e., jrf Tk rf k1 j krf k k2

v;

where a typical v is 0:1.

5.2.2

Global Convergence

In this part, we introduce the convergence result of nonlinear conjugate gradient methods. We need the following assumptions.

Modern Optimization Methods

82 Assumption 5.2.5.

(i) The level set L ¼ fx j f ðxÞ f ðx 0 Þg is bounded. (ii) In some open neighborhood N of L, the objective function f is Lipschitz continuously differentiable. The assumptions imply that there is a constant c [ 0 such that krf ðxÞk c;

8 x 2 L:

ð5:47Þ

Theorem 5.2.6. Suppose that assumption 5.2.5 holds, and that algorithm 5.2.1 is implemented with a line search that satisfies the strong Wolfe conditions with 0\c1 \c2 \ 12. Then lim inf krf k k ¼ 0: k!1

ð5:48Þ

Proof. The proof is by contradiction. It assumes that the opposite of (5.48) holds. That is, there is a constant c [ 0 such that krf k k c;

ð5:49Þ

for all k sufficiently large. By substituting the left inequality of (5.42) into Zoutendijk’s condition, we obtain 1 X krf x k4 k¼0

kpk k2

\1:

By using the strong Wolfe conditions and (5.42), we obtain that T rf pk1 c2 rf T pk1 c2 krf k1 k2 : k k1 1 c2

ð5:50Þ

ð5:51Þ

Thus, from (5.41) and recalling the definition bFR k , we obtain FR 2 T kpk1 k2 kpk k2 krf k k2 þ 2bFR k rf k pk1 þ bk 2 2c2 FR

krf k k2 þ b krf k1 k2 þ bFR kpk1 k2 k 1 c2 k 2 1 þ c2 ¼ kpk1 k2 : krf k k2 þ bFR k 1 c2 Applying this relation repeatedly, and defining c3 :¼ ð1 þ c2 Þ=ð1 c2 Þ 1, we have

2

2 2 kpk k2 c3 krf k k2 þ bFR c3 krf Tk k þ bFR kp0 k2 . . . c3 krf k2 k2 þ þ bFR k k1 k ¼ c3 krf k k4

k X j¼0

krf j k2 ;

ð5:52Þ

Conjugate Gradient Methods

83

where we used the facts that 2

2

krf k k4

2

FR FR ðbFR k Þ ðbk1 Þ . . .ðbki Þ ¼

krf ki1 k4

;

and p0 ¼ rf 0 . By using the bounds (5.47) and (5.49) in (5.52), we obtain kpk k2 ¼

c 3 c4 k; c2

ð5:53Þ

which implies that 1 X k¼0

kpk k2 c4

1 X 1 k¼1

k

;

ð5:54Þ

for some positive constant c4 . On the other hand, from (5.49) and (5.50), we have that 1 X

1

k¼0

kpk k2

\1:

ð5:55Þ

P 1 However, if we combine this inequality with (5.54), we obtain that 1 k¼0 k \1, which is not true. Hence, (5.49) does not hold, and the claim (5.48) is proved. □ A counter-example for convergence is as follows. Theorem 5.2.7. Consider the PR method with an ideal line search. There exists a twice continuously differentiable objective function f : IR3 ! IR and a starting point x 0 2 IR3 such that the sequence of gradients fkrf k kg is bounded away from zero.

5.3

Exercises

Exercise 5.3.1. Any set of vectors satisfying conjugacy is also linearly independent. Exercise 5.3.2. Search online for the CG method, and download a solver. Run it on some problems, and write a report about the performance of CG. Exercise 5.3.3. Analyze the computational cost in algorithms 5.1.6 and 5.1.8 in each iteration. Which algorithm needs more computational cost? Exercise 5.3.4. Implement algorithm 5.1.8 and use to it solve linear systems in which A is the Hilbert matrix, whose elements are Ai;j ¼ 1=ði þ j 1Þ. Set the right-hand side be to b ¼ ð1; 1; . . .; 1ÞT and the initial point to x 0 ¼ 0. Try dimensions n ¼ 5; 8; 12; 20 and report the number of iterations required to reduce the residual below 106 .

Modern Optimization Methods

84

Exercise 5.3.5. Show that if f ðxÞ is a strictly convex quadratic, then the function def

hðrÞ ¼ f ðx 0 þ r0 p0 þ þ rk1 pk1 Þ is also a strictly convex quadratic in the variable ¼ ðr0 ; r1 ; . . .; rk1 ÞT . Exercise 5.3.6. Derive algorithm 5.1.11 by applying the standard CG method in the variables x^ and then transforming back into the original variables. Exercise 5.3.7. Let fki ; v i g; i ¼ 1; 2; . . .; n be the eigenpairs of the symmetric matrix A. Show that the eigenvalues and eigenvectors of ½I þ P k ðAÞA T A½I þ P k ðAÞA are ki ½1 þ ki P k ðki Þ 2 and v i , respectively. Exercise 5.3.8. Show that when applied to a quadratic function, with exact line searches, both the Polak-Ribiere formula given by (5.45) and the Hestenes-Stiefel formula given by (5.46) reduce to the Fletcher-Reeves formula (5.40).

Chapter 6 Semismooth Newton’s Method In the optimization community, semismooth Newton’s method has been well studied and has been successfully used in many applications, especially in solving modern optimization problems, such as the nearest correlation matrix problem [52, 53], the nearest Euclidean distance matrix problem [38], the tensor eigenvalue complementarity problem [8], absolute value equations [10], and quasi-variational inequations [17], as well as linear and convex quadratic semidefinite programming problems [77]. In this chapter, we will briefly introduce the semismooth Newton’s method as well as its application to support vector machines. We begin with the definition of semismoothness.

6.1

Semismoothness

The semismooth equation is popular in the complementarity and variational inequalities (nonsmooth equations) community [49, 56]. Josephy [29] introduced Newton and quasi-Newton methods for generalized equations (in terms of Robinson). Kojima and Shindo [31] investigated Newton’s method for piecewise smooth equations. Kummer [32, 33] gave a sufficient condition to extend Kojima and Shindo’s work. Qi and Sun [55] proved that semismooth functions satisfy Kummer’s condition, which is a big leap in the extension from piecewise smoothness to potentially infinitely many smooth pieces. Since then, many exciting developments have been achieved, in particular in large-scale settings. The semismoothness of a function is closely related to the generalized Jacobian in the sense of Clarke [9], which is stated as follows. Let U : IRm ! IRl be a (locally) Lipschitz function. According to Rademacher’s theorem [58, section 14], U is differentiable almost everywhere. Define DU as the set of points at which the function U is differentiable. That is, D U :¼ fx 2 IRm j U is differentiable at xg: Let U0 ðxÞ denote the Jacobian of U at x 2 DU . The Bouligand subdifferential of U at x 2 IRm is then defined by DOI: 10.1051/978-2-7598-3174-6.c006 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

86

@ B UðxÞ :¼ V 2 IRml j V is an accumulation point of U0 ðx k Þ; x k ! x; x k 2 DU : ð6:1Þ Below is a simple example. Example 6.1.1. Consider /ðtÞ ¼ maxf0; tg;

t 2 IR:

See figure 6.1. There is D/ ¼ ð1; 0Þ [ ð0; þ 1Þ: By the definition, we have D/0 ðtÞ ¼

1; 0;

t [ 0; t\0:

With (6.1), there is @ B /ð0Þ ¼ f0; 1g: For t 6¼ 0, there is

@ B /ðtÞ ¼

1; t [ 0; 0; t\0:

FIG. 6.1 – maxð0; tÞ. To derive Clarke’s generalized Jacobian, we also need the concept of a convex hull, which is given below. Definition 6.1.2. The convex hull of a set A is defined by all the convex combinations of the elements in A, i.e.,

Semismooth Newton’s Method ( coðAÞ ¼

X

87

ki x i ; x i 2 A;

i2I

X

) ki ¼ 1; ki 0; i 2 I :

i2I

With the convex hull, we are ready to give the definition of Clarke’s generalized Jacobian. Definition 6.1.3. The generalized Jacobian of / in the sense of Clarke [9] is the convex hull of @ B UðxÞ, i.e., @UðxÞ ¼ coð@ B UðxÞÞ; where coð@ B UðxÞÞ is the convex hull of @ B UðxÞ. Example 6.1.4. The Clarke’s generalized gradient of /ðtÞ ¼ maxð0; tÞ is 8 < 1; t [ 0; @/ðt Þ ¼ 0; t\0; : v; 0 v 1; t ¼ 0:

ð6:2Þ

The concept of semismoothness was introduced by Mifflin [47] for functionals. It was extended to vector-valued functions by Qi and Sun [55]. One way of defining semismoothness is as follows. Definition 6.1.5. We say that U is semismooth at x if (i) U is directional differentiable at x and (ii) for any V 2 @Uðx þ hÞ, Uðx þ hÞ UðxÞ Vh ¼ oðkhkÞ;

h ! 0:

U is said to be c-order semismooth at x (c [ 0) if for any V 2 @Uðx þ hÞ, Uðx þ hÞ UðxÞ Vh ¼ Oðkhk1 þ c Þ;

h ! 0:

U is said to be strongly semismooth at x if c ¼ 1. Some particular examples of semismooth functions are as follows. Piecewise linear functions are strongly semismooth. The composition of (strongly) semismooth functions is also (strongly) semismooth. For example, maxð0; tÞ is strongly semismooth, as shown in figure 6.1.

6.2

Nonsmooth Version of Newton’s Method

For U : IRm ! IRm , consider solving Uðx Þ ¼ 0:

ð6:3Þ

Modern Optimization Methods

88

Let U be locally Lipschitz. Nonsmooth Newton’s method takes the following update k x k þ 1 ¼ x k V 1 V k 2 @U x k ; k ¼ 0; 1; 2; . . . ð6:4Þ k U x ; If U is continuously differentiable, then @UðxÞ reduces to a singleton, which is the Jacobian of UðxÞ. In this situation, the update (6.4) is the classical Newton’s method. Let x be the solution of (6.3). Assume that @Uðx k Þ is nonsingular and that x 0 is sufficiently close to x. If U is semismooth at x, then, Uðx k Þ UðxÞ V k ðx k xÞ k kx k þ 1 xk2 ¼ k V 1 k |ffl{zffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} 2 semismooth bounded ¼ o kx k xk2 : |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} superlinear 1þc

It takes oðkx k xk to the following result.

Þ if U is c-order semismooth at x. The above analysis leads

Theorem 6.2.1. [55, theorem 3.2] Let x be a solution of UðxÞ ¼ 0 and let U be a locally Lipschitz function which is semismooth at x. Assume that all V 2 @UðxÞ are nonsingular. Then every sequence generated by (6.4) is superlinearly convergent to x, provided that the starting point x 0 is sufficiently close to x. Moreover, if U is strongly semismooth at x, the convergence rate is quadratic. Remark 6.2.2. Theorem 6.2.1 implies that to guarantee the locally superlinear convergence rate of the nonsmooth Newton’s method, we need to verify two facts: the semismoothness of the function U as well as the nonsingularity of the elements in Clarke’s generalized Jacobian @UðxÞ. Next, we discuss the globalization of nonsmooth Newton’s method. Consider min f ðxÞ:

x2IRn

ð6:5Þ

If f is continuously differentiable, it is equivalent to solving rf ðxÞ ¼ 0: To get global convergence, we take the following update k x k þ 1 ¼ x k ak V 1 V k 2 @ 2 f x k ; k ¼ 0; 1; 2; . . .; k rf x ;

ð6:6Þ

where ak is the step length. The globalization version of nonsmooth Newton’s method for solving (6.5) is given as follows. Algorithm 6.2.3. S0. Given k :¼ 0. Choose x 0 ; r 2 ð0; 1Þ, q 2 ð0; 1Þ, d [ 0, and g0 [ 0; g1 [ 0. S1. Calculate rf ðx k Þ. If krf ðx k Þk d, stop. Otherwise, go to S2.

Semismooth Newton’s Method

89

S2. Select an element V k 2 @ 2 f ðx k Þ and apply conjugate gradient method [23] to find an approximate solution d k by ð6:7Þ V k d k þ rf x k ¼ 0 such that

kV k d k þ rf ðx k Þk lk krf ðx k Þk;

where lk ¼ minðg0 ; g1 krf ðx k ÞkÞ. S3. Do line search, and let m k [ 0 be the smallest integer such that the following holds f ðx k þ qm d k Þ f ðx k Þ þ rqm rf ðx k ÞT d k : Let ak ¼ qmk . S4. Let x k þ 1 ¼ x k þ ak d k , k :¼ k þ 1, go to S1. Remark 6.2.4. In S2, the search direction d k is obtained by solving the linear system (8.39) inexactly. In fact, one can have other choices of lk to obtain a different approximate solution d k .

6.3

Support Vector Machine

SVM is a popular approach to deal with classification and regression, both of which are fundamental problems in machine learning [3, 37, 59, 78]. Support vector machine (SVM) includes support vector classification (SVC) and support vector regression (SVR) [19]. See figure 6.2.

FIG. 6.2 – Two classifications of SVM.

Modern Optimization Methods

90

Support vector classification Given training data x 1 ; x 2 ; . . .; x l 2 IRn and the corresponding label y 1 ; y 2 ; . . .; y l 2 f1; 1g, support vector classification is to find a hyperplane xT x þ b ¼ 0 to separate the two types of data. Assume that the two types of data can be separately successfully. The mathematical model of SVC is as follows 1 kxk22 min x2IRn ;b2IR 2 T ð6:8Þ s:t: y i x x i þ b 1; i ¼ 1; . . .; l: Model (6.8) is also referred to as the hard margin model. In many cases, the two types of data can not be seperated successfully. Therefore, it leads to the regularized model, which takes the following form l X 1 2 min kxk þ C nðx; x i ; y i ; bÞ; ð6:9Þ x2IRn ;b2IR 2 i¼1 where C [ 0 is a penalty parameter and nðÞ is the loss function. In particular, if nðÞ takes the L2-loss function, it leads to the L2-loss SVC model l X 2 1 2 kxk min þ C max 1 y i xT x i þ b ; 0 : ð6:10Þ x2IRn ;b2IR 2 i¼1 In literature, problem (8.24) can be reformulated as l X 1 kxk22 þ C min ni x;b;n 2 i¼1 T s:t: y i x x i þ b 1 ni ; i ¼ 1; . . .; l; n 0: It is referred to as the primal form of L2-loss SVC. The dual problem is l X l l P 1X max ai ai aj y i y j x Ti x j 2 i¼1 j¼1 a2IRl i¼1 s:t:

l P

i¼1

ai yi ¼ 0;

ð6:11Þ

ð6:12Þ

0 ai C ; i ¼ 1; . . .; l:

The derivation of dual problem will be discussed in chapter 7. Support vector regression Given training data x 1 ; x 2 ; . . .; x l 2 IRn and the corresponding observations y 1 ; y 2 ; . . .; y l 2 IR, SVR is to find x 2 IRn and b 2 IR such that xT x i þ b is close to the target value y i , i ¼ 1; . . .; l; as much as possible. One of the mathematical formulation for SVR is as follows 1 kxk22 min x2IRn ;b2IR 2 ð6:13Þ s:t: y i xT x i b ; i ¼ 1; . . .; l; xT x i þ b y i ; i ¼ 1; . . .; l:

Semismooth Newton’s Method

91

The -L2-loss SVR model (We omit the bias term b for SVR) takes the following form l X

2 1 max xT x i y i ; 0 : minn f 2 ðxÞ :¼ kxk2 þ C x2IR 2 i¼1

ð6:14Þ

Various methods have been proposed to solve L2-loss SVC and e-L2-loss SVR in literature. For L2-loss SVC, a modified Newton method (actually a semismooth Newton’s method with unit step size) is proposed by Mangasarian [45], where the inverse of the Hessian matrix is used to calculate the Newton direction. Keerthi and DeCoste [30] proposed a modified Newton method. They computed the Newton point and did an exact line search to determine step length. Lin et al. [43] proposed a trust region Newton method (TRON) for the L2-loss SVC model. Chang et al. [6] proposed a coordinate descent method for the primal problem (8.24). Hsieh et al. [26] proposed a dual coordinate descent method (DCD) for the dual problem. Hsia et al. [25] investigated trust region update rules in Newton’s method. They proposed using line search and trust region to obtain step length but they focused on investigating the trust region update rules in Newton’s method for L2-loss SVC. For -L2-loss-SVR, Ho and Lin [24] proposed TRON and DCD. Gu et al. [22] proposed a smoothing Newton method for the primal problem (6.11). The state-of-art softwares include LIBLINEAR1 and LIBSVM.2 For example, TRON (Trust Region Newton method) and DCD (Dual coordinate descent method) for SVC and SVR are included in LIBLINEAR.

6.4

Semismooth Newton’s Method for SVM

In this part, we will show how to apply semismooth Newton’s method to solve SVC (8.24) and SVR (6.14). We take SVC as an example. We refer to [71] for more details. As in [26], by setting x Ti

½x Ti ; 1;

xT

½xT ; b;

we transfer L2-loss SVC (8.24) to the unbiased form l X 2 1 minn f 1 ðxÞ :¼ kxk2 þ C max 1 y i xT x i ; 0 : x2IR 2 i¼1

ð6:15Þ

The functions in problem (8.25) and problem (6.14) are continuously differentiable but not twice continuously differentiable. We will apply semismooth Newton’s method to solve the two models. 1

https://www.csie.ntu.edu.tw/cjlin/liblinear/. https://www.csie.ntu.edu.tw/cjlin/libsvm/index.html.

2

Modern Optimization Methods

92

Solving L2-Loss SVC (8.25) is equivalent to solving rf 1 ðxÞ ¼ 0;

ð6:16Þ

where rf 1 ðxÞ ¼ x 2C

l X i¼1

maxð0; 1 y i xT x i Þy i x i :

Note that rf 1 is strongly semismooth. By the chain rule [9, theorem 2.3.9], we have the following result. Proposition 6.4.1. There is @ 2 f 1 ðxÞ V 1 , where ( ) l X T T V 1 ¼ I þ 2C h i x i x i ; h i 2 @maxð0; z i ðxÞÞ; z i ðxÞ ¼ 1 y i x x i ; i ¼ 1; . . .; l : i¼1

Therefore, we can apply the globalized version of the semismooth Newton’s method in algorithm 6.2.3 to solve (9.4). To analyze the convergence rate, we need to check the semi smoothness of rf 1 ðxÞ as well as the nonsingularity of all elements in @ 2 f 1 ðxÞ. In fact, we have the following property. Proposition 6.4.2. For any V 2 V 1, V is positive definite. Together with the strong semismoothness of function rf 1 ðxÞ, we have the following result. Theorem 6.4.3. Let x be a solution of (8.25). Let xk be generated by algorithm 6.2.3 with f ðxÞ :¼ f 1 ðxÞ. Then fxk g is globally convergent to x . Moreover, the convergence rate is quadratic. Figure 6.3 demonstrates the fast convergence rate (quadratic convergence rate) of semismooth Newton’s method, where w3a and real-sim are datasets in LIBLINEAR.

1 l

FIG. 6.3 – Performance when C 2 f102 ; 101 ; 1; 101 ; 102 g.

Semismooth Newton’s Method

93

Exploring sparsity It seems that we have reached a perfect ending so far since we applied semismooth Newton’s method successfully to SVC and got a quadratic convergence rate. However, notice that for SVM, besides the fast convergence rate, we also need to make sure that the algorithm is very fast. The common sense for second-order methods like semismooth Newton’s method is that it involves heavy computational cost due to the calculation of the Hessian matrix (or Clarke’s generalized Jacobian) as well as the search direction. Below we will show that by exploring the sparse structure of SVM, we can significantly reduce the computational cost and speed up the semismooth Newton’s method. The key point is that we do not form Clarke’s generalized Jacobian explicitly, but use it implicitly. Moreover, we obtained the search direction by the conjugate gradient method, as shown in S2 in algorithm 6.2.3. Next, we show more details about applying the conjugate gradient method. When solving the linear system (8.39) by conjugate gradient method, we need to calculate V Dx, where V 2 V1 with V 1 defined as in proposition 6.4.1. The key point is to calculate V Dx, V 2 IRnn . In our implementation, we do not save V explicitly. Instead, we use the product V Dx directly by V Dx ¼ Dx þ 2C

l X i¼1

h i ðx Ti DxÞx i :

Recall that h i 2 @maxð0; z i ðxÞÞ and (6.2). Then one only needs to do the summition for the terms with h i ¼ 1. Let I k :¼ fi j z ki ðxk Þ ¼ 1 y i x Ti xk [ 0; i ¼ 1; :::; lg: We have the following result V Dx ¼ Dx þ 2C

X i2I k

ðx Ti DxÞx i :

Below we give the comparison of computational cost in different ways as shown in table 6.1. TAB. 6.1 – Comparison of computational cost. Formula Calculate V Dx directly With definition of I k

l P

h i ðx Ti DxÞx i P T V Dx ¼ Dx þ 2C ðx i DxÞx i

V Dx ¼ Dx þ 2C

i¼1

i2I k

Complexity lð2n þ 1Þ þ n þ 1 2jI k jn þ n þ 1

Modern Optimization Methods

94

In figure 6.4, we show by an example that the ratio of jI k j over sample size l is indeed small. Therefore, compared with calculating V Dx directly, using the definition of I k can lead to significantly lower computational cost.

I

I

FIG. 6.4 – Demonstration of jI k j and ratiok for each iteration. In tables 6.2–6.4, we report the comparison of the semismooth Newton’s method with TRON and DCD in LIBLINEAR. Here accuracy is calculated by the number of correctly estimated labels over the sample size. It is a measurement for SVC. The higher accuracy is, the better the performance is. MSE is defined as MSE ¼

m 1X ðy y 0i Þ2 ; m i¼1 i

x i ; i ¼ 1; :::; m, and where y i are the observed data corresponds to test data b x i . MSE is a measurement for SVR. The lower MSE is, the better the y 0i ¼ xT b performance is. The winners for accuracy and MSE among the three methods are marked in bold. One can see that semismooth Newton’s method is indeed very efficient, especially for SVR. More details can be found in [71]. TAB. 6.2 – L2-loss SVC. A1: DCD1; A2: TRON1; A3: semismooth’s Newton method (LIBLINEAR datasets). Data a1a a2a a3a a4a a5a a6a a7a a8a a9a australian

t(s) (A1jA2jA3) 0.04j0.07j0.08 0.03j0.06j0.08 0.03j0.06j0.08 0.03j0.05j0.08 0.03j0.05j0.07 0.02j0.03j0.05 0.02j0.03j0.04 0.03j0.04j0.06 0.04j0.06j0.09 0.00j0.00j0.00

Accuracy (A1jA2jA3) 84.63j84.63j84.66 84.70j84.70j84.72 84.67j84.67j84.62 84.68j84.68j84.73 84.71j84.71j84.74 84.40j84.40j84.95 84.78j84.78j84.77 84.31j84.31j84.30 84.64j84.64j84.66 84.78j84.78j85.14

Semismooth Newton’s Method

95

TAB. 6.3 – L2-loss SVC. A1: DCD1; A2: TRON1; A3: semismooth’s Newton method (further LIBLINEAR datasets). Data breast-cancer cod-rna colon-cancer diabetes duke breast-cancer fourclass german. numer gisette heart ijcnn1 ionosphere leukemia liver-disorders mushrooms news20.binary phishing rcv1.binary real-sim skin nonskin splice sonar svmguide1 svmguide3 w1a w2a w3a w4a w5a w6a w7a w8a covtype.binary

t(s) (A1jA2jA3) 0.00j0.00j0.00 3.10j0.06j0.09 0.01j1.03j0.05 0.00j0.00j0.00 0.02j1.75j0.17 0.00j0.00j0.00 0.01j0.00j0.01 4.93j12.12j14.18 0.01j0.00j0.01 0.08j0.07j0.08 0.01j0.00j0.01 0.02j1.95j0.25 0.00j0.00j0.00 0.01j0.01j0.02 0.61j1.52j2.45 0.02j0.03j0.03 0.12j0.16j0.22 0.34j0.29j0.37 15.78j0.08j0.17 0.15j0.01j0.01 0.00j0.00j0.01 0.00j0.00j0.00 0.00j0.00j0.01 0.04j0.05j0.10 0.05j0.06j0.09 0.05j0.06j0.09 0.04j0.05j0.09 0.03j0.05j0.08 0.03j0.03j0.06 0.02j0.02j0.05 0.04j0.06j0.10 31.25j1.18j0.70

Accuracy (A1jA2jA3) 98.90j98.90j98.90 81.58j82.60j76.01 72.00j72.00j72.00 80.46j80.46j79.48 80.00j80.00j80.00 66.96j66.96j74.94 76.50j76.50j76.75 97.00j97.00j97.00 85.19j85.19j87.04 91.44j91.44j92.31 93.57j93.57j92.86 26.67j26.67j93.33 39.66j62.07j65.52 96.43j96.43j96.43 72.14j72.14j69.84 90.59j90.59j90.59 93.74j93.74j94.07 78.78j78.78j73.88 89.16j89.16j90.61 84.94j84.94j85.40 14.46j14.46j15.66 11.89j11.89j11.89 40.44j40.44j40.44 99.32j99.32j99.92 99.31j99.31j99.92 99.29j99.29j99.93 99.30j99.30j99.92 99.27j99.27j99.92 99.36j99.36j99.94 99.31j99.31j99.95 99.33j99.33j99.91 59.29j59.29j61.54

Modern Optimization Methods

96

TAB. 6.4 – The comparison results for -L2-loss SVR with C ¼ 1l 102 and ¼ 1e 2. B1: DCD2; B2: TRON2; B3: the semismooth Newton’s method. Data abalone bodyfat cpusmall tfidf.train tfidf.test eunite2001 housing mg mpg pyrim space ga triazines

6.5

t(s) (B1jB2jB3) 0.00j0.00j0.01 0.00j0.00j0.00 0.01j0.01j0.02 1.64j1.43j1.73 0.57j0.41j0.78 0.00j0.00j0.00 0.00j0.00j0.01 0.00j0.00j0.00 0.00j0.00j0.00 0.01j0.00j0.01 0.00j0.00j0.01 0.03j0.00j0.02

MSE (B1jB2jB3) 50.07j50.07j4.17 0.77j0.77j0.00 112.35j112.37j102.24 0.46j0.46j0.14 0.40j0.40j0.13 131854j131854j408.44 194.38j194.38j71.45 0.87j0.87j0.02 562.55j562.56j37.48 0.07j0.07j0.01 0.44j0.44j0.03 0.03j0.03j0.03

Exercises

Exercise 6.5.1. Prove proposition 6.4.1. Exercise 6.5.2. Prove proposition 6.4.2. Exercise 6.5.3. Show an example of a strongly semismooth function and derive Clarke’s generalized Jacobian. Exercise 6.5.4. Derive the gradient rf 2 ðxÞ and Clarke’s generalized Jacobian @ 2 f 2 ðxÞ.

Chapter 7 Theory of Constrained Optimization A general form for constrained optimization problem is as follows. ci ðxÞ ¼ 0; i 2 E; min f ðxÞ s:t: ci ðxÞ 0; i 2 I ; x2IRn

ð7:1Þ

where f and ci : IRn ! IR are all smooth functions, and E is the set of equality constraints; I is the set of inequality constraints. The feasible set X is defined as X ¼ fx j ci ðxÞ ¼ 0; i 2 E; ci ðxÞ 0; i 2 I g: Then we get min f ðxÞ: x2X

7.1

Local and Global Solutions

The potential difficulty for constrained optimization problems to identify global and local minimum is that constraints may make things more difficult. Example 7.1.1. minðx 2 þ 100Þ2 þ 0:01x 21 s:t: x 2 cosx 1 0: If there is no constraint, the optimal solution is x ¼ ð100; 0Þ. With constraint, there are local solutions near the points x ðkÞ ¼ ðkp; 1ÞT ;

k ¼ 1; 3; 5; . . .

See figure 7.1.

DOI: 10.1051/978-2-7598-3174-6.c007 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

98

local solutions

constraint contours of f

FIG. 7.1 – Constrained problem with many isolated local solutions. Definition 7.1.2. A point x is a local solution if x 2 X and there is a neighborhood N of x such that f ðx Þ f ðxÞ for all x 2 N \ X. Definition 7.1.3. A point x is a strict local solution (or strong local solution) if x 2 X and there is a neighborhood N of x such that f ðx Þ\f ðxÞ for all x 2 N \ X with x 6¼ x . Definition 7.1.4. A point x is a isolated local solution if x 2 X and there is a neighborhood N of x such that x is the only local solution in N \ X. Remark 7.1.5. Isolated local solutions are strict, but the reverse is not true.

7.1.1

Smoothness

The smoothness of f and constraints function is important. Smoothness of constraints: some constraints can be described by nonsmooth functions, and also smooth functions. For example, kxk1 ¼ jx 1 j þ jx 2 j 1: The smooth description is x 1 þ x 2 1; x 1 x 2 1; x 1 þ x 2 1; x 1 x 2 1: See figure 7.2 for the feasible region. Another example is as follows. For a nonsmooth description of f ðxÞ, where f ðxÞ ¼ maxðx 2 ; xÞ; the smooth description is min t s:t: t x; t x 2 :

Theory of Constrained Optimization

99

FIG. 7.2 – A feasible region with a nonsmooth boundary can be described by smooth constraints.

7.2

Examples

Active Set Definition 7.1.2. The active set AðxÞ at any feasible x consists of the equality constraint indices from E together with the indices of the inequality constraints i for which ci ðxÞ ¼ 0; that is, AðxÞ ¼ E [ fi 2 I j ci ðxÞ ¼ 0g: At a feasible point x, the inequality constraint i 2 I is said to be active if ci ðxÞ ¼ 0, and inactive if the strict inequality ci ðxÞ [ 0 is satisfied. Example 7.2.2. min x 1 þ x 2 s:t: x 21 þ x 22 2 ¼ 0: f ðxÞ ¼ x 1 þ x 2 , I ¼ ;, E ¼ f1g, c1 ðxÞ ¼ x 21 þ x 22 2. pffiffiffi pffiffiffi rf ðxÞ ¼ ½1; 1. The optimal solution is ð 2; 2Þ. See figure 7.3. rf ðx Þ is parallel to rc1 ðx Þ, i.e., there is k (in this case, k ¼ 12) such that rf ðx Þ ¼ k c1 ðx Þ: The interpretation of rf ðx Þ ¼ k c1 ðx Þ by first-order Taylor series approximation is as follows. Suppose x is feasible, i.e., c1 ðxÞ ¼ 0. To keep feasibility, step s has to satisfy: c1 ðx þ sÞ ¼ 0, i.e., 0 ¼ c1 ðx þ sÞ c1 ðxÞ þ rc1 ðxÞT s ¼ rc1 ðxÞT s:

Modern Optimization Methods

100

FIG. 7.3 – Example 7.2.2, showing constraint and function gradients at various feasible points.

So, to stay feasibility of c1 , to first order, we requires rc1 ðxÞT s ¼ 0: To maintain reduction in f , if s is decrease direction, then we have 0 [ f ðx þ sÞ f ðxÞ rf ðxÞT s; i.e., rf ðxÞT s\0. Overall, we have rc1 ðxÞT s ¼ 0; and rf ðxÞT s\0:

ð7:2Þ

If no s satisfies (7.2), then x is a local minimizer. This could only happen when rf ðxÞ and c1 ðxÞ are parallel. If rf ðxÞ and c1 ðxÞ are not parallel, i.e., (7.2) holds, then we can set ! rc ðxÞrc ðxÞ d 1 1 d ¼ I rf ðxÞ; d ¼ : kdk krc1 ðxÞk2 It is easy to verify that d satisfies (7.2), i.e., rc1 ðxÞT d ¼ 0; and rf ðxÞT d\0: Introduce the Lagrangian function: Lðx; k1 Þ ¼ f ðxÞ k1 c1 ðxÞ:

ð7:3Þ

Theory of Constrained Optimization

101

Then at solution x , there is a scalar k1 , such that rLðx ; k1 Þ ¼ rf ðx Þ k1 rc1 ðx Þ ¼ 0; where k1 is the Lagrangian multiplier for constraint c1 ðxÞ ¼ 0. The (first-order) necessary for one equality constraint optimization problem is as follows. If x is a local minimizer, then there is a scalar k1 , such that the followings holds. rx Lðx ; k1 Þ ¼ 0; and c1 ðxÞ ¼ 0: Remark 7.2.3. This is the necessary but not sufficient condition. x ¼ ð1; 1ÞT with k ¼ 12 also satisfies the first order necessary condition, but it is a local maximizer. For equality constraint c1 ðxÞ, k1 may be either positive, or negative, or zero. Example 7.2.4. min x 1 þ x 2

s:t: 2 x 21 x 22 0:

We have the following observations. Optimal solution x ¼ ð1; 1ÞT . rf ðx Þ ¼ k1 rc1 ðx Þ with k1 ¼ 12. The difference with example 7.2.4 is k1 can be only nonnegative! At boundary point x, rc1 points to the interior of the feasible region. Suppose a feasible point x is not optimal if we can find a small step s that both maintains feasibility and reduction in f to first order. Then we have: The reduction in f requires rf ðxÞT s\0: To keep feasibility, there is 0 c1 ðx þ sÞ c1 ðxÞ þ rc1 ðxÞT s, i.e., feasibility is maintained to first order if c1 ðxÞ þ rc1 ðxÞT s 0:

ð7:4Þ

Case 1: x lies strictly inside the circle, i.e., c1 ðxÞ [ 0, then any s satisfies (7.4) provided that the step length is sufficiently small. In this case, if rf ðxÞ 6¼ 0, then we can choose s ¼ arf ðxÞ to get another feasible point x þ as with smaller function value. Case 2: x lies on the boundary of the circle, i.e., c1 ðxÞ ¼ 0. Then the conditions for s become: rf ðxÞT s\0;

rc1 ðxÞT s 0:

s exists if and only if rf ðxÞ and rc1 ðxÞ are not in the same direction (see figure 7.4), i.e., s does not exists (the intersection of the two regions is empty) only when

Modern Optimization Methods

102

FIG. 7.4 – Improvement directions from two feasible points. rf ðxÞ ¼ k1 rc1 ðxÞ;

for some k1 0:

ð7:5Þ

If rf ðxÞ and rc1 ðxÞ are in opposite directions, then the intersection of the two regions would be an entire open half-plane. In other words, in (7.5), k can not be negative. s does not exists (the intersection of the two regions is empty) only when rf ðxÞ ¼ k1 rc1 ðxÞ;

for some k1 0:

ð7:6Þ

If x is an optimal solution lying on the boundary, then no first-order feasible descent direction s exists, (i.e., s satisfying rf ðxÞ ¼ k1 rc1 ðxÞ;

for some k1 0

ð7:7Þ

does not exist). In other words, if x is an optimal solution lying on the boundary, then the following holds: rf ðxÞ ¼ k1 rc1 ðxÞ;

for some k1 0:

ð7:8Þ

Let Lagrangian function be defined as Lðx; k1 Þ ¼ f ðxÞ k1 c1 ðxÞ: When no first-order feasible descent direction exists for example 7.2.4 at some point x , we have that rx Lðx ; k1 Þ ¼ 0;

for some k1 0:

Also, we require that k1 c1 ðx Þ ¼ 0; which is complementarity condition.

Theory of Constrained Optimization

103

Example 7.2.5. min x 1 þ x 2 s:t: 2 x 21 x 22 0; x 2 0: Let Lagrangian function be defined as Lðx; k1 Þ ¼ f ðxÞ k1 c1 ðxÞ k2 c2 ðxÞ: When no first-order feasible descent direction exists for example 7.2.5 at some point x , we have that rx Lðx ; k1 Þ ¼ 0;

for some k1 0; k2 0:

Also, we require that ki ci ðx Þ ¼ 0;

7.3

i ¼ 1; 2:

Tangent Cone and Constraint Qualifications

In the previous section, we used the first-order Taylor expansion of the functions to analyse the feasible descent direction. This approach is based on the assumption that the linearized approximation captures the essential geometric features of the feasible set near x. How can we guarantee such as assumption hold? We need the property called constraint qualification to make sure that such an assumption holds. Let X be a closed convex set. Definition 7.3.1. Given a feasible point x, we call fz k g a feasible sequence approaching x if z k 2 X for all k sufficiently large and z k ! x. A tangent is a limiting direction of a feasible sequence. See figure 7.5. Definition 7.3.2. The vector d is said to be a tangent (or tangent vector) to X at a point x if there are a feasible sequence fz k g approaching x and a sequence of positive scalars ft k g with t k ! 0 such that zk x lim ¼ d: k!1 tk The set of all tangents to X at x is called the tangent cone and is denoted by T X ðx Þ. Example 7.3.3. Take example 7.2.2 as example. Consider tangent cone at nonoptimal qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffi T T point x ¼ 2; 0 . Take z k ¼ 2 1=k 2 ; 1=k , t k ¼ kz k xk, so d ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T ð0; 1ÞT is a tangent. Take z k ¼ 2 1=k 2 ; 1=k , t k ¼ kz k xk, so d ¼ ð0; 1ÞT is a tangent. So T X ðxÞ ¼ fð0; d 2 ÞT : d 2 2 IRg.

Modern Optimization Methods

104

FIG. 7.5 – Constraint normal, objective gradient, and feasible sequence. Recall constraints

ci ðx Þ ¼ 0; i 2 E; ci ðx Þ 0; i 2 I ;

ð7:9Þ

and active set AðxÞ at a feasible point x: Aðx Þ ¼ E [ fi 2 I j ci ðx Þ ¼ 0g:

ð7:10Þ

Definition 7.3.4. Given a feasible point x and the active constraint set Aðx Þ the set of the linearized feasible direction F ðx Þ is ( ) T d rci ðx Þ ¼ 0; for all x 2 E; : ð7:11Þ F ðx Þ ¼ d T d rci ðx Þ 0; for all i 2 Aðx Þ \ I Example 7.3.5. Take example 7.2.2 as an example. Consider the linearized feasible pffiffiffi direction at nonoptimal point x ¼ ð 2; 0ÞT . F ðxÞ ¼ fd : d T rc1 ðxÞ ¼ 0g:

d 0 ¼ d rc1 ðxÞ ¼ ½2x 1 ; 2x 2 1 d2 T

pffiffiffi ¼ 2 2d 1 ! d 1 ¼ 0:

We have F ðxÞ ¼ fð0; d 2 ÞT : d 2 2 IRg: Recall T X ðxÞ ¼ fð0; d 2 ÞT : d 2 2 IRg. We have T X ðxÞ ¼ F ðxÞ. Recall the constraints in (7.9) and the active set in (7.10). Definition 7.3.6. (LICQ) Given the point x and the active set AðxÞ, we say that the linear independent constraint qualification (LICQ) holds if the set of active constraint gradients frci ðxÞ; i 2 AðxÞg is linearly independent.

Theory of Constrained Optimization

105

Roughly speaking, constraint qualifications are conditions under which the linearized feasible set F ðxÞ is similar to the tangent cone T X ðxÞ. Under LICQ, at any feasible point, F ðxÞ is the same as the tangent cone T X ðxÞ, i.e., F ðxÞ ¼ T X ðxÞ. LICQ is one of the constraint qualifications. There are other constraint qualifications such as MFCQ (Mangasarian-Fromovitz constraint qualification).

7.4

First-Order Optimality Conditions

Recall constrained optimization problem in (7.1) and active set AðxÞ at a feasible point x as in (7.10). Define the Lagrangian function as X Lðx; kÞ ¼ f ðxÞ ki ci ðxÞ: i2E [ I

Theorem 7.4.1. (Theorem 7.4.1 First-order necessary conditions) Suppose that x is a local solution of (7.1), that the functions f and ci are continuously differentiable, and that the LICQ holds at x . Then there is a Lagrangian multiplier vector k , i 2 E [ I , such that the following conditions are satisfied at x : rx Lðx ; k Þ ¼ 0;

ð7:12Þ

ci ðx Þ ¼ 0;

for all i 2 E;

ð7:13Þ

ci ðx Þ 0;

for all i 2 I ;

ð7:14Þ

ki 0; ki ci ðx Þ ¼ 0;

for all i 2 I ; for all i 2 E [ I :

ð7:15Þ ð7:16Þ

The above system is known as the Karush–Kuhn–Tucker conditions, or KKT conditions for short. (7.16) are complementarity conditions: it implies that either constraint i is active or ki ¼ 0, or both. Definition 7.4.2. (Strict Complementarity) Suppose that x is a local solution of (7.1) and a vector k satisfying (7.12)–(7.16), we say that the strict complementarity condition holds if exactly one of ki and ci ðx Þ is zero for each index i 2 I . In other words, we have that ki [ 0 for each i 2 I \ Aðx Þ. Remarks: A strict complementarity condition usually makes it easier for algorithms to determine the active set Aðx Þ, and converge rapidly to the solution x . For a given problem and a solution x , there may be many vectors k satisfying the KKT conditions. However, if LICQ holds, then such k is unique.

Modern Optimization Methods

106

Example 7.4.3. Consider

3 2 1 4 min x 1 þ x2 s:t: kxk1 ¼ jx 1 j þ jx 2 j 1: x 2 2 Reformulate it as a smooth optimization problem: 2 3 1 x1 x2

2

4 6 1 x1 þ x2 7 3 1 7 min x 1 þ x2 s:t: 6 4 1 þ x 1 x 2 5 0: x 2 2 1 þ x1 þ x2 As shown in figure 7.6 below.

FIG. 7.6 – Inequality constrained problem with solution at ð1; 0ÞT : The solution is x ¼ ð1; 0ÞT . The active constraints are c1 and c2 . We have

1 T rf ðx Þ ¼ 1; ; rc1 ðx Þ ¼ ð1; 1ÞT ; rc2 ðx Þ ¼ ð1; 1ÞT : 2 rc1 ðx Þ; rc2 ðx Þ are linearly independent, so LICQ holds. KKT conditions are T satisfied when we set k ¼ 34 ; 14 ; 0; 0 .

7.5

Second-Order Conditions

Motivation for Second-Order Conditions Recall the KKT condition in (7.12)–(7.16), and the linearized feasible direction F ðxÞ in (7.11). P For any d 2 F ðx Þ, 0 ¼ d T rx Lðx ; k Þ ¼ d T rf ðx Þ i2AðxÞ ki d T rci ðx Þ. Therefore, we have

Theory of Constrained Optimization X

d T rf ðx Þ ¼

i2AðxÞ

ki d T rci ðx Þ ¼

For d 2 F ðx Þ, we have

107

X i2E

ki d T rci ðx Þ þ

X i2AðxÞ \ I

ki d T rci ðx Þ 0:

d T rf ðx Þ 0:

If d T rf ðx Þ [ 0, then d is an increasing direction of f . If d T rf ðx Þ ¼ 0, then d keeps this value the same, which is the undecided direction. The second-order conditions concern the curvature of the Lagrangian function in the “undecided” directions–the directions w 2 F ðx Þ for which w T rf ðx Þ ¼ 0. Assumption 7.5.1. f and ci , i 2 E [ I are all assumed to be twice continuously differentiable. Definition 7.5.2. Given the linearized feasible direction F ðx Þ and some Lagrangian multiplier vector k satisfying the KKT condition, we define the critical cone Cðx ; k Þ as follows. Cðx ; k Þ ¼ fw 2 F ðx Þ j rci ðx ÞT w ¼ 0; all i 2 AðxÞ \ I with ki [ 0g: Equivalently,

8 T < rci ðx Þ w ¼ 0; w 2 Cðx ; k Þ $ rci ðx ÞT w ¼ 0; : rci ðx ÞT w 0;

for all i 2 E; for all i 2 Aðx Þ \ I with ki [ 0; for all i 2 Aðx Þ \ I with ki ¼ 0:

ð7:17Þ

Note that ki ¼ 0 for i 2 I nAðx Þ, so we have w 2 Cðx ; k Þ ) ki rci ðx ÞT w ¼ 0; Therefore, w 2 Cðx ; k Þ ) w T rf ðx Þ ¼

X i2E [ I

for all i 2 E [ I :

ki w T rci ðx Þ ¼ 0:

Example 7.5.3. Consider min x 1 s.t: x 2 0; 1 ðx 1 1Þ2 x 22 0; The solution is x ¼ ð0; 0ÞT , and Aðx Þ ¼ f1; 2g, the unique Lagrangian multiplier k ¼ ð0; 0:5ÞT . rc1 ðx Þ ¼ ð0; 1ÞT , rc2 ðx Þ ¼ ð2; 0ÞT , so LICQ holds. (implying that the Lagrangian multiplier is unique.) The linearized feasible set is F ðx Þ ¼ fd j d 0g. The critical cone is Cðx ; k Þ ¼ fð0; w 2 ÞT j w 2 0g:

Modern Optimization Methods

108

Theorem 7.5.4. (Second-Order Necessary Conditions) Suppose that x is a local minimizer of (7.1) and that LICQ condition is satisfied. Let k be the Lagrangian multiplier vector for which the KKT conditions are satisfied. Then w T r2xx Lðx ; k Þw 0;

for all w 2 Cðx ; k Þ:

The second-order necessary conditions mean that if x is a local solution, then the Hessian of the Lagrangian has nonnegative curvature along critical directions. Theorem 7.5.5. (Second-Order Sufficient Conditions) Suppose that for some feasible point x 2 IRn , there is a Lagrangian multiplier vector k such that the KKT conditions are satisfied. Suppose also that w T r2xx Lðx ; k Þw [ 0;

for all w 2 Cðx ; k Þ; w 6¼ 0:

Then x is a strict local solution for (7.1). Notice that in theorem 7.5.5, constraint qualification is not required, and strict inequality is used. In particular, if Lðx ; k Þ 0, then x is a strict local solution. (See example 7.5.6). Example 7.5.6. min x 1 þ x 2 s:t: 2 x 21 x 22 ¼ 0: The Lagrangian function is Lðx; kÞ ¼ x 1 þ x 2 k1 ð2 x 21 x 22 Þ: The KKT conditions are satisfied by x ¼ ð1; 1ÞT , with k1 ¼ 12. The Lagrangian Hessian at x is 2k1 0 1 0 r2xx Lðx ; k Þ ¼ 0: ¼ 0 2k1 0 1 Therefore, x is also a strict local solution. In fact, it is the global solution. Example 7.5.7. Consider min 0:1ðx 1 4Þ2 þ x 22 s:t: x 21 þ x 22 1 0: Notice that f is a nonconvex function, the feasible set is the exterior of the unit circle. f is not bounded below, so no global solution exists. However, we could identify a strict local solution on the boundary of the constraint, by considering the KKT condition and the second-order conditions of theorem 7.5.5. We have

0:2ðx 1 4Þ 2k1 x 1 rx Lðx; kÞ ¼ ; 2x 2 2k1 x 2

r2xx Lðx; kÞ

0:2 2k1 ¼ 0

0 : 2 2k1

Theory of Constrained Optimization

109

x ¼ ð0; 1ÞT with k1 ¼ 0:3 satisfies the KKT condition, with active set Aðx Þ ¼ f1g. To check the second-order sufficient conditions, note that 0:4 0 2 ; rc1 ðx Þ ¼ ð2; 0ÞT ; rxx Lðx ; k1 Þ ¼ 0 1:4 so Cðx ; k1 Þ ¼ fð0; w 2 ÞT j w 2 2 IRg: Therefore, for any w 2 Cðx ; k1 Þ with w 6¼ 0, T T 0 0:4 0 0 T 2 ¼ 1:4w 22 [ 0: w rxx Lðx ; k1 Þw ¼ w2 0 1:4 w 2 Hence, the second-order sufficient conditions are satisfied, we conclude that x ¼ ð0; 1ÞT is a strict local solution for example 7.5.7.

7.6

Duality

Duality Theory is important in the following sense. To motivate and develop some important algorithms, such as the augmented Lagrangian algorithms, one need duality theory. It provides important insight into the fields of convex nonsmooth optimization and even discrete optimization. Its specialization in linear programming proved central to the development of that area. Duality theory shows how we can construct an alternative problem (dual problem) from the functions and data that define the original problem (primal problem). In particular, sometimes the dual problem is easier to solve computationally than the original problem. Sometimes the dual can be used to obtain a lower bound on the optimal value for the primal problem. Consider (referred to as primal problem) min f ðxÞ s:t: cðxÞ 0;

x2IRn

where cðxÞ ¼ ðc1 ðxÞ; . . .; cm ðxÞÞT 2 IRm . The Lagrangian function (with Lagrangian multiplier k 2 IRm ) is Lðx; kÞ ¼ f ðxÞ kT cðxÞ: Define the dual objective function q : IRm ! IR as qðkÞ :¼ inf Lðx; kÞ: x

In many problems, this minimum is 1 for some values of k. We define the domain of q as follows, D :¼ fk j qðkÞ [ 1g:

Modern Optimization Methods

110

The dual problem is defined as max qðkÞ s:t: k 0: k2D

ð7:18Þ

Example 7.6.1. min 0:5ðx 21 þ x 22 Þ s:t: x 1 1 0: x

Step 1: The Lagrangian function is Lðx; kÞ ¼ 0:5ðx 21 þ x 22 Þ k1 ðx 1 1Þ: Step 2: Find qðkÞ, i.e., inf x Lðx; kÞ. Fix k, and Lðx; kÞ is a convex function of x. So to get inf x Lðx; kÞ, we only need to let rx Lðx; kÞ ¼ 0, we have x 1 k1 ¼ 0; x 2 ¼ 0 ! x 1 ¼ k1 ; x 2 ¼ 0: Substituting into Lðx; kÞ, we have the dual objective qðk1 Þ ¼ 0:5ðk21 þ 0Þ k1 ðk1 1Þ ¼ 0:5k21 þ k1 : Step 3: The dual problem is max 0:5k21 þ k1 : k1 0

Clearly, the optimal solution is k1 ¼ 1. qðkÞ :¼ inf Lðx; kÞ: x

Remark 7.6.2. The infimum requires finding the global minimizer of the function ð ; kÞ, which may be extremely difficult in practice. However, when f and ci are convex functions and k 0, the function ð ; kÞ is also convex. In this situation, all local minimizers are global minimizers, so qðkÞ becomes easier to get. (As shown in the above example.) Theorem 7.6.3. The function q is concave and its domain D is convex. Theorem 7.6.4. (Weak Duality) For any x feasible for the primal problem. min f ðxÞ s:t: cðxÞ 0

x2IRn

and any k 0, we have qð kÞ f ð x Þ. Proof. T k cðxÞ f ð x Þ kcð x Þ f ð x Þ; qð kÞ ¼ inf f ðxÞ x

ð7:19Þ

Theory of Constrained Optimization

111

where the last inequality follows from k 0 and cð x Þ 0.

□

Theorem 7.6.5. Suppose that x is a solution of the primal problem (7.19) and that f and ci are convex functions on IRn that are differentiable at x. Then any k for which ð x; kÞ satisfies the KKT conditions above is a solution to the dual problem (7.18). In this case, there is

x Þ ¼ qð kÞ ¼ maxqðkÞ; min f ðxÞ ¼ f ð k0

cðxÞ 0

and

Lðx; kÞ Lð x; kÞ Lð x ; kÞ;

where x and k are primal and dual feasible, respectively. Example 7.6.6. Linear Programming mincT x s:t: Ax b 0: The dual objetive is i h qðkÞ ¼ inf cT x kT ðAx bÞ ¼ inf ½ðc AT kÞT x þ bT k: x

x

Case 1: c AT k 6¼ 0: we can choose proper x to make ðc AT kÞT x ! 1, so q ! 1. Case 2: c AT k ¼ 0: q ¼ bT k. Overall, to maximize q, we drop case 1. Therefore, the dual problem is max bT k s:t: AT k ¼ c; k 0: k

Example 7.12. Convex Quadratic Programming 1 min x T Gx þ cT x s:t: Ax b 0; 2 where G 0. Dual objective is

1 T T T qðkÞ ¼ inf Lðx; kÞ ¼ inf x Gx þ c x k ðAx bÞ : x x 2

Since G 0 Lð ; kÞ is a strictly convex quadratic function, the infimum is achieved when rx Lðx; kÞ ¼ 0, that is, Gx þ c AT k ¼ 0: There is x ¼ G 1 ðAT k cÞ:

Modern Optimization Methods

112

Substituting x into the infimum expression, we get the dual objective: 1 qðkÞ ¼ ðAT k cÞT G 1 ðAT k cÞ þ bT k: 2 The dual problem is 1 max ðAT k cÞT G 1 ðAT k cÞ þ bT k s:t: k 0: k 2

7.7

KKT Condition

Consider the nearest correlation matrix problem (NCM) min

X2S n

s:t:

1 kX C k2F 2 X ii ¼ 1; i ¼ 1; . . .; n; X 0:

ð7:20Þ

It arises from finance, and it is a typical quadratic semidefinite programming problem (QSDP). It can be solved by typical QSDP solver QSDP [70, 77]. The specific solver for NCM is semismooth Newton’s method [52]. A question is how to write down the KKT condition and dual problem? We need the following tools. Consider the following constrained optimization problem: min

f ðxÞ

s:t:

GðxÞ 2 K;

x2X

ð7:21Þ

where f : X ! IR is a continuously differentiable function. G : X ! Y is continuously differentiable and K Y is a convex cone. The Lagrange function of this problem is Lðx; lÞ ¼ f ðxÞ hl; GðxÞi: Þ 2 X Y Definition 7.7.1. For constrained optimization problems (7.21), if ð x; l satisfies the following KKT condition in feasible point x (i.e. Gð x Þ 2 K), Þ ¼ 0; 0 2 l þ N K ðGð rx Lð x; l x ÞÞ;

ð7:22Þ

Lagrange multiplier, where N K ðGð then we call l x ÞÞ is the normal cone of K at point Gð x Þ 2 Y.

Theory of Constrained Optimization

113

Note that þ N K ðGð 2 N K ðGð 02l x ÞÞ () l x ÞÞ () Gð x Þ 2 K; h l; d Gð x Þi 0; 8 d 2 K ðK is a convex cone; 0 2 K; 2Gð x Þ 2 KÞ () Gð x Þ 2 K; h l; Gð x Þi ¼ 0; h l; di 0; 8 d 2 K 2 K : () Gð x Þ 2 K; h l; Gð x Þi ¼ 0; l Thus, KKT condition (7.22) is equivalent to Þ ¼ 0; x; l rx Lð

Gð x Þ 2 K;

or Þ ¼ 0; rx Lð x; l

h l; Gð x Þi ¼ 0;

2 K ; l

? Gð K 3 l x Þ 2 K:

ð7:23Þ ð7:24Þ

Example 7.7.2. Consider the following optimization problem with equality and inequality constraints. min

f ðxÞ

s:t:

ci ðxÞ ¼ 0; i 2 E :¼ f1; . . .; mg; h i ðxÞ 0; i 2 I :¼ f1; . . .; kg:

x2IRn

Let

3 c1 ðxÞ 6

7 7 6 6 cm ðxÞ 7 7; 6 GðxÞ ¼ 6 7 6 h 1 ðxÞ 7 4

5 h k ðxÞ

ð7:25Þ

2

K ¼ f0gm IRkþ :

Observed that K ¼ IRm IRkþ : Let l¼

k 2 IRm þ k ; v

k 2 IRm ;

v 2 IRk :

Recall KKT condition Þ ¼ 0; rx Lð x; l

Gð x Þ 2 K;

h l; Gð x Þi ¼ 0;

2 K : l

Modern Optimization Methods

114 So we have l 2 K () k 2 IRm ; v 2 IRk :

GðxÞ 2 K () ci ðxÞ ¼ 0; i 2 E; h i ðxÞ 0; i 2 I : X X ki ci ðxÞ þ v i h i ðxÞ ¼ 0 hu; GðxÞi ¼ 0 () i2E

()

X

i2I

v i h i ðxÞ ¼ 0 ðci ðxÞ ¼ 0Þ

i2I

() v i h i ðxÞ ¼ 0; i 2 I ðh i ðxÞ 0; v i 0Þ: The KKT condition of (7.25) can be written as 8 P P ki rci ðxÞ v i h i ðxÞ ¼ 0; > < rx Lðx; k; vÞ ¼ rf ðxÞ i2E i2I ci ðxÞ ¼ 0; i 2 E; > : h i ðxÞ 0; v i 0; v i h i ðxÞ ¼ 0; i 2 I : Example 7.7.3. Considering linear programming min

hc; xi

s:t:

Ax ¼ b; x 0;

x2IRn

ð7:26Þ

where A 2 IRmn , b 2 IRm are given. The Lagrange function Lðx; k; lÞ ¼ hc; xi hk; Ax bi hl; xi, where k 2 IRm ; l 2 IRn . The KKT condition of (7.26) can be written as 8 < rx Lðx; k; lÞ ¼ c kT A l ¼ 0; Ax b ¼ 0; : x 0; l 0; hl; xi ¼ 0:

7.8

Dual Problem

Consider problem (7.21), its dual problem is sup inf Lðx; lÞ :¼ sup hðlÞ;

l2K x2X

l2K

ð7:27Þ

where hðlÞ ¼ inf Lðx; lÞ; x2X

Lðx; lÞ ¼ f ðxÞ hGðxÞ; li: Example 7.8.1. The dual problem of linear programming. Considering linear programming

Theory of Constrained Optimization

115

min

hc; xi

s:t:

Ax ¼ b; x 0;

x2IRn

where A 2 IRmn , b 2 IRm are given. Ax b GðxÞ ¼ ; x

ð7:28Þ

K ¼ f0gm IRkþ:

Note that K ¼ IRm IRnþ : Let

k 2 IRm þ n ; l¼ v

k 2 IRm ;

v 2 IRn:

So we have l 2 K () k 2 IRm ; v 2 IRnþ; GðxÞ 2 K () Ax b ¼ 0; x 0: The Lagrange function is Lðx; lÞ ¼ hc; xi hGðxÞ; li * + Ax b k ¼¼ hc; xi ; x v ¼ hc; xi hAx b; ki hx; vi ¼ hx; c AT k vi þ hb; ki; hðlÞ ¼ inf n Lðx; lÞ x2IR

¼ inf n hx; c AT k vi þ hb; ki x2IR ( 0; c AT k v ¼ 0 ¼ hb; ki þ 1; otherwise ( hb; ki; c AT k v ¼ 0; ¼ 1; otherwise:

Modern Optimization Methods

116 The dual problem takes the following form sup inf Lðx; lÞ ¼ sup hðlÞ

l2K x2X

l2K

¼

(

sup

k2IRm ;v2IRnþ

hb; ki; 1;

c AT k v ¼ 0 otherwise

¼ hb; ki; and c AT k v ¼ 0; i.e. max

hb; ki

s:t:

c AT k v ¼ 0; v 0:

k2IRm ;v2IRn

Example 7.8.2. The dual problem of quadratic programming. Considering the quadratic programming max

x2IRn

s:t:

1 T x Qx þ hc; xi 2 Ax ¼ b; Bx d 0;

ð7:29Þ

where Q is symmetric positive semidefinite, A 2 IRpn ; b 2 IRp ; B 2 IRqn ; d 2 IRq are given. Ax b GðxÞ ¼ ; K ¼ f0gp IRqþ: Bx d

Notice that K ¼ IRp IRqþ: Let l¼ So we have

k 2 IRp þ q ; v

k 2 IRp ;

v 2 IRq :

l 2 K () k 2 IRp ; v 2 IRqþ; GðxÞ 2 K () Ax b ¼ 0; Bx d 0:

Theory of Constrained Optimization

117

The Lagrange function is 1 Lðx; lÞ ¼ x T Qx þ hc; xi hGðxÞ; li 2 D Ax b k E 1 ¼ x T Qx þ hc; xi ; 2 Bx d v 1 T ¼ x Qx þ hc; xi hAx b; ki hBx d; vi 2 1 ¼ x T Qx þ hx; c AT k B T vi þ hb; ki þ hd; vi: 2 Notice that Q is symmetric positive semidefinite, the problem inf Lðx; lÞ

x2IRn

ð7:30Þ

admits a unique minimum point, and rx Lðx; lÞ ¼ 0

ð7:31Þ

is the sufficient and necessary optimality condition to find the minimizer. (7.31) is equivalent to Qx þ c AT k B T v ¼ 0: Equivalently, x ¼ Q 1 ðc þ AT k þ B T vÞ is the minimizer of (7.30). inf Lðx; lÞ ¼ Lðx ; lÞ

x2IRn

1 ¼ ðc þ AT k þ B T vÞT Q 1 ðc þ AT k þ B T vÞ 2 þ hQ 1 ðc þ AT k þ B T vÞ; c AT k vi þ hk; bi þ hv; di 1 ¼ ðc þ AT k þ B T vÞQ 1 ðc þ AT k þ B T vÞ þ hk; bi þ hv; di: 2 The dual problem is sup inf Lðx; lÞ ¼ sup hðlÞ l2K x2X

l2K

¼

sup p

k2IR

;v2IRqþ

1 ðc þ AT k þ B T vÞQ 1 ðc þ AT k þ B T vÞ 2

þ hk; bi þ hv; di; i.e. max

k2IRp ;v2IRq

s.t:

1 ðc þ AT k þ B T vÞQ 1 ðc þ AT k þ B T vÞ þ hk; bi þ hv; di 2 v 0:

Modern Optimization Methods

118

Indicator Function and Support Function Indicator function of set C is defined as 0; x 2 C; dðx j C Þ ¼ þ 1; otherwise:

ð7:32Þ

The conjugate function of f at x is defined as f ðx Þ ¼ supfhx; x i f ðxÞg: x

ð7:33Þ

The conjugate function of dð j C Þ at x is referred as the support function of C at x , denoted as d ðx j C Þ. The conjugate function is an important tool to derive dual problems for more complicated problems.

7.9

Exercises

Exercise 7.9.1. Derive the dual problem of the following SVM models. min n

1 kxk22 2 y i ðxT x i þ bÞ 1; i ¼ 1; . . .; l:

ð7:34Þ

l X 1 kxk22 þ C ni 2 i¼1 y i ðxT x i þ bÞ 1 ni ; i ¼ 1; . . .; l; n 0:

ð7:35Þ

x2IR ;b2IR

s:t:

min x;b;n

s:t:

Exercise 7.9.2. Derive the KKT condition for the above problems. Exercise 7.9.3. Derive the KKT condition for the subproblem in trust region method. Exercise 7.9.4. Derive the KKT condition and the dual problem for NCM problem.

Chapter 8 Penalty and Augmented Lagrangian Methods In this chapter, we will discuss the methods for constrained optimization problems, i.e., min

f ðxÞ

s:t:

ci ðxÞ ¼ 0; i 2 E; cj ðxÞ 0; j 2 I ;

x2IRn

ð8:1Þ

mainly focusing on the penalty methods and the augmented Lagrangian method. The idea of the two types of methods is to transfer the constrained problem to an unconstrained problem.

8.1

The Quadratic Penalty Method

To introduce the quadratic penalty method, we first consider the equality constrained optimization problem min

f ðxÞ

s:t:

ci ðxÞ ¼ 0; i 2 E:

x2IRn

ð8:2Þ

The quadratic penalty function Qðx; lÞ for (8.2) is def

Qðx; lÞ ¼ f ðxÞ þ

lX 2 c ðxÞ; 2 i2E i

ð8:3Þ

where l [ 0 is the penalty parameter. The idea of the quadratic penalty method is to try to transfer the constrained problem into an unconstrained problem by putting the constraints into the objective and penalizing the violation of constraints by quadratic functions.

DOI: 10.1051/978-2-7598-3174-6.c008 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

120

By deriving l to þ 1, we penalize the constraint violation with increasing severity. Problem (8.3) is an unconstrained problem. This motivates us to consider a sequence of values flk g with lk " 1 as k ! þ 1 and to seek the approximate minimizer x k of Qðx; lk Þ for each k. For each lk , we can use the techniques from unconstrained optimizations to search for x k . The previous iteration x k1 can be used as an initial guess to solve min Qð; lk Þ. Just a few steps of unconstrained minimization may be needed for solving the unconstrained problem. An example is given below. Example 8.1.1. Consider min x 1 þ x 2 s:t: x 21 þ x 22 2 ¼ 0:

x2IR2

ð8:4Þ

The quadratic penalty function is (l [ 0) l Qðx; lÞ ¼ x 1 þ x 2 þ ðx 21 þ x 22 2Þ2 : ð8:5Þ 2 For l ¼ 1, the minimizer of Qðx; lÞ is near x ¼ ð1:1; 1:1ÞT . There is also a local minimizer near x ¼ ð0:3; 0:3ÞT . See figures 8.1 and 8.2 for illustration. For the problem with inequality constraints, we also penalize the violation of the inequalities by the quadratic function. Define the quadratic penalty function as lX 2 lX Qðx; lÞ ¼ f ðxÞ þ ci ðxÞ þ ðmaxð0; ci ðxÞÞÞ2 : ð8:6Þ 2 i2E 2 i2I

FIG. 8.1 – Contours of Qðx; lÞ for l ¼ 1, contour spacing 0.5.

Penalty and Augmented Lagrangian Methods

121

FIG. 8.2 – Contours of Qðx; lÞ for l ¼ 10, contour spacing 2. Remark 8.1.2. The advantage of the quadratic penalty method is that the subproblem is an unconstrained problem. Moreover, as long as ci ; i 2 E [ I are continuously differentiable, Qðx; uÞ in (8.6) is also continuously differentiable. Below we give the details of the quadratic penalty method. Algorithm 8.1.3. (Quadratic Penalty Method) S0. Given l0 [ 0, fsk g 0 with fsk g ! 0. Initial point x s0 , k:= 0; S1. Find an approximate minimizer x k of Qð; lk Þ, starting at x sk , and terminating when krx Qðx; lk Þk sk ; S2. If the final convergence test is satisfied, stop with the approximate solution x k ; Otherwise, go to S3; S3. Choose a new penalty parameter lk þ 1 [ lk ; Choose a new starting point x sk þ 1 . Go to S1. We have the following results regarding the convergence property of the quadratic penalty method. Theorem 8.1.4. Suppose that each x k is the exact global minimizer of Qðx; lk Þ defined by (8.6) in algorithm 8.1.3, and that lk " 1: Then every limit point x of the sequence fx k g is a global solution of the problem (8.1). Remark 8.1.5. Theorem 8.1.4 implies that if lk turns to þ 1, then any limit point generated by algorithm 8.1.3 is a global optimal solution of (8.1), provided that each iteration x k is a global minimizer of the subproblem in (8.6). However, for Qðx; uÞ in (8.6), it is in general nonconvex, and solving the global minimizer is not that easy. Theorem 8.1.6. Suppose that the tolerances and the penalty parameters in algorithm 8.1.3 satisfy sk ! 0 and lk ! 1: Then if a limit point x of the sequence fx k g is infeasible, it is a stationary point of the function kcðxÞk2 . On the other hand, if a limit

Modern Optimization Methods

122

point x is feasible and the constraint gradients rci ðx Þ are linearly independent, then x is a KKT point for the problem (8.2). For such points, we have for any infinite subsequence K such that limk2K ¼ x that lim lk ci ðx k Þ ¼ ki ; for all i 2 E;

ð8:7Þ

k2K

where k is the multiplier vector that satisfies the KKT conditions for the equalityconstrained problem (8.2).

8.2

Exact Penalty Method

For the general constrained optimization problem (8.1), a popular nonsmooth penalty function is the l 1 penalty enalty function, which takes the following form X X jci ðxÞj þ l ½ci ðxÞ ; ð8:8Þ /1 ðx; lÞ ¼ f ðx Þ þ l i2I

i2E

where we use the notation ½y ¼ maxf0; yg: An interesting property of the exact penalty method is that, by choosing a proper l; the solution of (8.8) is the solution of (8.1). This is referred to as the exact recovery. Note that quadratic penalty method has no exact recovery property. Example 8.2.1. min x s:t: x 1:

ð8:9Þ

x2IR

The optimal solution is x ¼ 1: The l 1 penalty function is

/1 ðx; lÞ ¼ x þ l½x 1 ¼

ð1 lÞx þ l x;

if x 1; if x [ 1:

ð8:10Þ

As can be seen in figure 8.3, the penalty function has aminimizer at x ¼ 1 when l [ 1; but is a monotone increasing function when l\1:

FIG. 8.3 – Penalty function with l [ 1 (left) and l\1 (right).

Penalty and Augmented Lagrangian Methods

123

Below we give the details of the exact penalty method. Algorithm 8.2.2. (Classical l 1 Penalty Method) S0. Given l0 [ 0; tolerance s [ 0; starting point x s0 ; k :¼ 0; S1. Find an approximate minimizer x k of /1 ðx; lk Þ; starting at x sk ; S2. If h ðx k Þ s; stop with approximate solution x k ; Otherwise, go to S3; S3. Choose a new penalty parameter lk þ 1 [ lk ; S4. Choose a new starting point x sk þ 1 ; k :¼ k þ 1: The advantage of l 1 penalty method is that, without driving the penalty parameter to þ 1; one can get the exact solution of the original problem (8.1). However, since the penalty function in subproblem (8.8) is nonsmooth, we need some nonsmooth techniques to solve subproblem (8.8).

8.3

Augmented Lagrangian Method

In this part, we introduce the augmented Lagrangian method. Consider the equality constrained problem (8.2). The quadratic penalty method returns an infeasible solution x k : ci x k 6¼ 0: Moreover, l ! 1 will lead to an ill-conditioned problem. It motivates the augmented Lagrangian method, which gets avoid of driving l to infinity and includes an explicit estimation of the Lagrange multiplier k: Given kk ; lk ; define the augmented Lagrangian function as follows X l X 2 LA ðx; kk ; lk Þ ¼ f ðxÞ ki ci ðxÞ þ k c ðxÞ: 2 i2E i i2E We generate x kk by

x kk arg minLA ðx; kk ; lk Þ; x

which gives 0 rx LA ðx k ; kk ; lk Þ ¼ rf ðx k Þ

X i2E

kk lk ci ðx k Þ rci ðx k Þ:

Comparing with the optimality condition for (8.2), which is X 0 ¼ rx Lðx k ; k Þ ¼ rf ðx k Þ k rci ðx k Þ; i2E

we can get ki kki lk ci ðx k Þ; i 2 E: So we update kki þ 1 by kki þ 1 ¼ kki lk ci ðx k Þ; i 2 E:

ð8:11Þ

The details of the augmented Lagrangian method is given in algorithm 8.3.1.

Modern Optimization Methods

124

Algorithm 8.3.1. (Augmented Lagrangian Method-Equality Constranits) S0. Given l0 [ 0, tolerance s0 [ 0, starting points x s0 and k0 ; k :¼ 0. S1. Find an approximate minimizer x k of LA ð; kk ; lk Þ, starting at x sk , and terminate when krx LA ðx k ; kk ; lk Þk sk . S2. If a convergence test for (8.2) is satisfied, stop with an approximate solution x k ; Otherwise, go to S3. S3. Select tolerance sk þ 1 ; Let x sk þ 1 ¼ x k ; Choose lk þ 1 lk ; Update the Lagrange multipliers using (8.11) to obtain kk þ 1 ; k :¼ k þ 1; Go to S1. Example 8.3.2 Consider min x 1 þ x 2 s:t: x 21 þ x 22 2 ¼ 0:

x2IR2

The optimal solution is ð1; 1ÞT , and the optimal Lagrange multiplier is k ¼ 0:5. The augmented Lagrangian function is l LA ðx; k; lÞ ¼ x 1 þ x 2 kðx 21 þ x 22 2Þ þ ðx 21 þ x 22 2Þ2 : 2

At k-th iteration, suppose kk ¼ 0:4, lk ¼ 1. See figure 8.4 for LA ðx; 0.4, 1Þ.

FIG. 8.4 – Contours of LA ðx; k; lÞ for k ¼ 0:4 and l ¼ 1, contour spacing 0.5.

Penalty and Augmented Lagrangian Methods

125

The optimal solution of LA ðx; 0.4, 1Þ is near ð1:02; 1:02Þ, compared with the quadratic penalty minimizer of Qðx; 1Þ, which is ð1:1; 1:1ÞT . This shows the significant improvement of augmented Lagrangian method over the quadratic penalty method. We have the following result about the augmented Lagrangian method. Theorem 8.3.3. Let x be a local solution of (8.2) at which the LICQ is satisfied and the second-order sufficient conditions are satisfied for k ¼ k . Then there is a such that for all l [ l , x is a strict local minimizer of LA ðx; k ; lÞ. threshold value l Theorem 8.3.4. Suppose that the assumption of theorem 8.3.3 is satisfied at ðx ; k Þ, be chosen as in that theorem. Then there exist positive scalars d; ; and M and let l such that the following claims hold: (a) For all kk and lk satisfying kkk k k lk d; the problem

; lk l

ð8:12Þ

minLA x; kk ; lk s:t: kx x k x

has a unique solution x k . Moreover, we have kx x k M

kkk k k : lk

ð8:13Þ

(b) For all kk and lk that satisfy (8.13), we have jkk þ 1 k k M

kkk k k ; lk

ð8:14Þ

where kk þ 1 is given by the formula kki þ 1 ¼ kki lk ci ðx k Þ; for all i 2 E:

ð8:15Þ

(c) For all kk and lk that satisfy (8.13), the matrix r2xx LA ðx k ; kk ; lk Þ is positive definite and the constraint gradients rci ðx k Þ; i 2 E; are linearly independent.

8.4

Quadratic Penalty Method for Hypergraph Matching

In this part, we will show how to apply the quadratic penalty method to solve the hypergraph matching problem, which is an important research topic in computer vision.

Modern Optimization Methods

126

8.4.1

Hypergraph Matching

Hypergraph matching is a fundamental problem in computer vision. It has been used in many applications, including object detection [1], image retrieval [67], image stitching [72, 73], and bio-informatics [65]. In this part, we will show how to apply the quadratic penalty method to solve the hypergraph matching problem. We start with graph matching. Traditional graph matching models only use point-to-point features or pair-to-pair features, which can be solved by linear assignment algorithms [27, 46] or quadratic assignment algorithms [16, 28, 35, 44, 68], respectively. To use more geometric information such as angles, lines, and areas, triple-to-triple graph matching was proposed in 2008 [74], and was further studied in [15, 36, 48]. Since three vertices are associated with one edge, it is also termed hypergraph matching. See figure 8.5. Optimization problems over binary assignment matrices are known to be NP-hard due to the combinatorial property, such as in (8.19).

FIG. 8.5 – Illustration of hypergraph matching.

8.4.2

Mathematical Formulation

First, we will give the mathematical formulation for hypergraph matching, including its objective function and constraints. We refer to [11] for more details. Consider two hypergraphs G 1 ¼ fV 1 ; E 1 g, and G 2 ¼ fV 2 ; E 2 g, where V 1 and V 2 are sets of points with jV 1 j ¼ n 1 , jV 2 j ¼ n 2 , and E 1 , E 2 are sets of hyperedges. In this part, we always suppose that n 1 n 2 , and each point in V 1 is matched to exactly one point in V 2 , while each point in V 2 can be matched to an arbitrary number of points in V 1 . For each hypergraph, we consider three-uniform hyperedges. Namely, the three points involved in each hyperedge are different, for example, ðl 1 ; j 1 ; k 1 Þ 2 E 1 . Our aim is to find the best correspondence (also referred to as “matching”) between V 1 and V 2 with the maximum matching score. Let X 2 IRn1 n2 be the assignment matrix between V 1 and V 2 , i.e., 1; if l 1 2 V 1 is assigned to l 2 2 V 2 ; X l1l2 ¼ 0; otherwise:

Penalty and Augmented Lagrangian Methods

127

Two hyperedges ðl 1 ; j 1 ; k 1 Þ 2 E 1 and ðl 2 ; j 2 ; k 2 Þ 2 E 2 are said to be matched if l 1 ; j 1 ; k 1 2 V 1 are assigned to l 2 ; j 2 ; k 2 2 V 2 ; respectively. It can be represented equivalently by X l 1 l 2 X j 1 j 2 X k 1 k 2 ¼ 1. Let Bl 1 l 2 j 1 j 2 k 1 k 2 be the matching score between ðl 1 ; j 1 ; k 1 Þ and ðl 2 ; j 2 ; k 2 Þ. Then B 2 IRn1 n2 n1 n2 n1 n2 is a sixth order tensor. Assume B is given, satisfying Bl 1 l 2 j 1 j 2 k 1 k 2 0 if ðl 1 ; j 1 ; k 1 Þ 2 E 1 and ðl 2 ; j 2 ; k 2 Þ 2 E 2 , and Bl 1 l 2 j 1 j 2 k 1 k 2 ¼ 0, otherwise. Given hypergraphs G 1 ¼ fV 1 ; E 1 g, G 2 ¼ fV 2 ; E 2 g, and the matching score B, the hypergraph matching problem takes the following form X max Bl 1 l 2 j 1 j 2 k 1 k 2 X l 1 l 2 X j 1 j 2 X k 1 k 2 : ð8:16Þ X2P1 ðl ;j ;k Þ2E 1 1 1 1 ðl 2 ;j 2 ;k 2 Þ2E 2

Note that (8.16) is a matrix optimization problem, which can be reformulated as a vector optimization problem as follows. Let n ¼ n 1 n 2 , x 2 IRn be the vectorization of X, that is 2 3 2 T3 x1 X 11 X 1n2 6 .. 7 6 . . T T T .. 5 :¼ 4 ... 7 .. x :¼ ð x 1 ; . . .; xn1 Þ ; with X ¼ 4 . 5: X n1 1

X n1 n2

xTn1

Here, xi 2 IRn2 is the i-th block of x. In the following, for any vector z 2 IRn , we always assume it has the same partition as x. Define A 2 IRn n n as Aljk ¼ Bl 1 l 2 j 1 j 2 k 1 k 2 ;

ð8:17Þ

where l ¼ ðl 1 1Þn 2 þ l 2 ;

j ¼ ðj 1 1Þn 2 þ j 2 ;

k ¼ ðk 1 1Þn 2 þ k 2 :

ð8:18Þ

Consequently, (8.16) can be reformulated as min

x2IRn

s:t:

1 f ðxÞ :¼ Ax 3 6 eT xi ¼ 1; i ¼ 1; . . .; n 1 ; x 2 f0,1g;

ð8:19Þ

P where e 2 IRn2 is a vector with all entries equal to one, and Ax 3 :¼ l;j;k Aljk x l x j x k . Existing methods include the power iteration algorithm (TM [1]), the hypergraph matching method via regularized random walks (RRAGM [2]), (HGM [3]), and block coordinate descent graph matching (BCAGM [4]). They can be summarized into two categories: one is to relax f0; 1g to ½0; 1 and solve the relaxation problem. The other is to keep f0; 1g constraint and reduce the problem to a series of the small size of assignment problems.

Modern Optimization Methods

128

8.4.3

Relaxation Problem

The approach in [11] is to first reformulate (8.19) as the following sparse constrained problem min

f ðxÞ

s:t:

eT xi ¼ 1; i ¼ 1; . . .; n 1 ; x 0; kxk0 n 1 ;

x2IRn

ð8:20Þ

where kxk0 denotes the number of nonzero elements. By chopping the sparse constraints, we reach the relaxation problem min

f ðxÞ

s:t:

eT xi ¼ 1; i ¼ 1; . . .; n 1 ; x 0;

x2IRn

ð8:21Þ

which is basically equivalent to min

x2IRn

s:t:

f ðxÞ eT xi ¼ 1; i ¼ 1; . . .; n 1 ; x 2 ½0, 1:

ð8:22Þ

We have the following result regarding the relaxation problem (8.21) and the original problem (8.19). More details of the properties of problem 8.20 can be found in [11]. Theorem 8.4.1. There exists a global minimizer x of the relaxation problem (8.21) such that kx k0 ¼ n 1 . Furthermore, x is a global minimizer of the original problem (8.19). The following algorithm shows how one can obtain a global minimizer of (8.19), starting from a global minimizer of (8.21). Algorithm 8.4.2. (Generating A Global Minimizer of (8.19)) S0. Given y ¼ ðy T1 ; . . .; y Tn1 ÞT 2 IRn , a global minimizer of (8.21); Let x ¼ 0 2 IRn . S1. For i ¼ 1; . . .; n 1 , find pi ¼ arg maxp ðy i Þp , and let ðx i Þpi ¼ 1. S2. Output x ¼ ðx T1 ; . . .; x Tn1 ÞT , a global minimizer of (8.19).

For example, for n 1 ¼ 2, n 2 ¼ 3, y ¼ ðy T1 ; y T2 ÞT , we have 2 3 2 3 2 3 1=3 0 1 y1 ¼ 4 1=2 5 ! x1 ¼ 4 0 5; or x1 ¼ 4 0 5: 5=6 1 0 2 3 0 y2 ¼ 4 1 5 ! x2 ¼ y2 : 0 The output of algorithm 8.4.2 is then x ¼ ðx T1 ; x T2 Þ.

Penalty and Augmented Lagrangian Methods

8.4.4

129

Quadratic Penalty Method for (8.21)

Below we show how we solve the relaxation problem (8.21). Notice that our aim is to identify the support set of a global minimizer for the relaxation problem (8.23). Once the support set is found, we can follow the method in algorithm 8.4.2 to obtain a global minimizer of (8.20). Inspired by such observations, we penalize the equality constraint violations as part of the objective function. This is another main difference between our method from the existing algorithms. It leads us to the following quadratic penalty problem minn

f ðxÞ þ

s:t:

x 0;

x2IR

r 2

n1 P

ðeT x i 1Þ2

i¼1

where r [ 0 is a penalty parameter. However, this problem is not well defined in general, since for a fixed r the global minimizer will approach infinity. We can add an upper bound to make the feasible set bounded. This gives the following problem minn

hðxÞ :¼ f ðxÞ þ

s:t:

0 x M;

x2IR

r 2

n1 P

ðeT x i 1Þ2

i¼1

ð8:23Þ

where M 1 is a given number. (8.23) is actually the quadratic penalty problem of the following problem min

f ðxÞ

s:t:

eT x i ¼ 1; i ¼ 1; . . .; n 1 ; 0 x M ;

x2IRn

which is equivalent to (8.21). Algorithm 8.4.3. (Quadratic Penalty Method for (8.21)) S0. Given an initial point x 0 0, set the parameter r0 [ 0. Let k :¼ 1. S1. Start from x k1 and solve (8.23) to obtain a global minimizer x k . S2. If the termination rule is satisfied, project x k to its nearest binary assignment matrix and stop. Otherwise, choose rk þ 1 rk , k :¼ k þ 1, and go to S1. Theorem 8.4.4. Let fx k g be generated by algorithm 8.4.3, and limk!1 rk ¼ þ 1. Then any accumulation point of the generated sequence fx k g is a global minimizer of (8.21). Assumption 8.4.5. Let fx k g be generated by algorithm 8.4.3, and limk!1 rk ¼ þ 1. Let K be a subset of f1, 2; . . .g. Assume limk!1;k2K x k ¼ z, and z is a global minimizer of (8.21). We have the following result. Theorem 8.4.6. Suppose assumption 8.4.5 holds. If there exists k 0 , such that kx k k0 ¼ n 1 for all k k 0 ; k 2 K . Then there exists k 1 k 0 such that the support set of z can be correctly identified, i.e.,

Modern Optimization Methods

130 Ck ¼ CðzÞ;

k k1; k 2 K :

Furthermore, z must be a global minimizer of (8.19). Here Ck ¼ fl : x kl [ 0g; CðzÞ ¼ fl : z l [ 0g. Define J ki ¼ arg maxfðx ki Þp g; pki ¼ minfp : p 2 J ki g; J k :¼ p

n1 [ i¼1

fpki þ n 2 ði 1Þg;

and J i ðzÞ ¼ arg maxfðz i Þp g; pi ðzÞ ¼ minfp : p 2 J i ðzÞg; J ðzÞ :¼ p

n1 [

fpi ðzÞ þ n 2 ði 1Þg:

i¼1

Theorem 8.4.7. (i) If kzk0 ¼ n 1 , then there exists k 0 [ 0, such that for all k k 0 ; k 2 K , there is CðzÞ ¼ J k . If kzk0 [ n 1 and jJ i ðzÞj ¼ 1 for all i ¼ 1; . . .; n 1 , then there exist a global minimizer x of (8.19) and k 0 [ 0, such that for all k k 0 ; k 2 K , there is C ¼ J k . (iii) If kzk0 [ n 1 and jJ i ðzÞj [ 1 for one i 2 f1; . . .; n 1 g, then there exist a global minimizer x of (8.19), a subsequence fx k gk2K and k 0 [ 0, such that for all k k 0 ; k 2 K 0 , there is C ¼ J k . (ii)

Remark 8.4.8. Theorems 8.4.6 and 8.4.7 imply that the quadratic penalty method has the exact recovery property in the sense that it can recover the support set of the global minimizer of (8.19) when rk us sufficiently large. Remark 8.4.9. The subproblem (8.22) can be solved by the active-based method such as the projected gradient method modified from the projected Newton methods in [2].

8.4.5

Numerical Results

We demonstrate by one example that the algorithm QPPG can indeed recover the support set of the global minimizer of (8.19). See figure 8.6. For small-scale problems in figure 8.7, one can see that QPPG performs well in both accuracy and matching scores. As for CPU time, all the algorithms are competitive since the maximum time is about 0.06 s. Figure 8.8 shows the matching results for two houses with v 1 ¼ 0 and v 2 ¼ 60. It can be seen that our algorithm is competitive with other methods in terms of accuracy, matching score, and CPU time for large-scale problems in figure 8.9. One of the matching results is shown in figure 8.10.

Penalty and Augmented Lagrangian Methods

FIG. 8.6 – Entries in x k with k ¼ 1; 21; 39 in QPPG, T k :¼ fj : x kj [ 0g.

FIG. 8.7 – CMU house matching of different v for n 1 ¼ 20 and n 2 ¼ 30.

FIG. 8.8 – Result of QPPG for matching two houses v 1 ¼ 0 and v 2 ¼ 60:

131

Modern Optimization Methods

132

FIG. 8.9 – Results for matching fish with different dimension n 1 .

FIG. 8.10 – Example of matching fish data using QPPG.

8.5

Augmented Lagrangian Method for SVM

In this part, we will show how to apply the augmented Lagrangian method to solve SVM.

8.5.1

Support Vecotr Machine

Support vector machine (SVM) has proved to be a successful approach to machine learning. Two typical SVM models are the L1-loss model for support vector classification (SVC) and the -L1-loss model for support vector regression (SVR). Due to the nonsmoothness of the L1-loss function in the two models, most of the traditional approaches focus on solving the dual problem. Now we present an augmented Lagrangian method for the L1-loss model, which is designed to solve the primal problem. More details can be found in [69].

Penalty and Augmented Lagrangian Methods

8.5.2

133

Mathematical Formulation

Given training data ðx i ; y i Þ; i ¼ 1; . . .; m, where x i 2 IRn are the observations, y i 2 f1; 1g are the labels, the support vector classification is to find a hyperplane y ¼ w T x þ b such that the data with different labels can be separated by the hyperplane. The L1 -Loss SVC model is min n

x2IR ;b2IR

m X 1 kxk2 þ C maxð1 y i ðxT x i þ bÞ; 0Þ: 2 i¼1

ð8:24Þ

Notice that there is a bias term b in the standard SVC model. For large-scale SVC, the bias term is often omitted. By setting xi

½x i ; 1; x

½x; b;

we reach the following model minn

x2IR

m X 1 kxk2 þ C max 1 y i ðxT x i Þ; 0 : 2 i¼1

ð8:25Þ

1 kxk2 þ C kmaxð0; Bw þ dÞk1 2

ð8:26Þ

It can be written as min

w2IRn

by letting

2

3 y 1 x T1 6 7 B ¼ 4 ... 52 IRm n ; y m x Tm

2 3 1 . 4 d ¼ .. 52 IRm : 1

By introducing a variable s2 IRm , we get the following constrained optimization problem 1 kxk2 þ pðsÞ min ð8:27Þ w2IRn ;s2IRm 2 s:t: s ¼ Bw þ d; where pðsÞ ¼ C kmaxð0; sÞk1 . The Lagrangian function for (8.27) is 1 lðw; s; kÞ ¼ kxk2 þ pðsÞ hk; s Bw di; 2 where k2 IRm is the Lagrange multiplier corresponding to the equality constraints.

8.5.3

Augmented Lagrangian Method (ALM)

Next, we will apply the augmented Lagrangian method (ALM) to solve (8.27). See [69] for more details.

Modern Optimization Methods

134 ALM works as follows. At iteration k, solve min Lrk ðw; s; kk Þ w;s

ð8:28Þ

to get ðw k þ 1 ; s k þ 1 Þ. Then update the Lagrange multiplier by kk þ 1 ¼ kk rk ðs k þ 1 Bw k þ 1 dÞ; and rk þ 1 rk . Now the key questions are how to solve the subproblem (8.28) and what are the global convergence and local convergence rates for ALM. Before exploring further about these two questions, we first illustrate the Moreau-Yosida regularization. Let X be a real finite-dimensional Euclidean space with an inner product h; i and its induced norm k k. Denote that q : X ! ð1; þ 1Þ is a closed convex function. The Moreau-Yosida regularization of q at x 2 X is defined by 1 hq ðxÞ :¼ min ky xk2 þ qðyÞ: y2X 2

ð8:29Þ

The unique solution of (8.29), denoted as Proxq ðxÞ, is called the proximal point of x associated with q. The following property holds for Moreau-Yosida regularization. Proposition 8.5.1. Let q : X ! ð1; þ 1Þ be a closed convex function, hðÞ be the Moreau-Yosida regularization of q and Proxq ðÞ be the associated proximal point mapping. Then hq ðÞ is continuously differentiable, and there is rhq ðxÞ ¼ x Proxq ðxÞ:

Recall the subproblem ðw k þ 1 ; s k þ 1 Þ ¼ arg min Lrk ðw; s; kk Þ: w;s

Denote /ðwÞ : ¼ min Lr ðw; s; kÞ s

1 1 1 1 2 2 2 ¼ kwk kkk þ rmin ks zðwÞk þ pðsÞ s 2 2r 2 r 1 1 2 2 ¼ kwk kkk þ rhp ðzðwÞÞ; 2 2r

ð8:30Þ

where zðwÞ ¼ Bw þ d þ rk : By the property of Moreau decomposition, there is r/ðwÞ ¼ w þ rrhp ðzðwÞÞ ¼ w þ B T k rB T ðs ðwÞ Bw dÞ:

Penalty and Augmented Lagrangian Methods

135

Therefore, we can get ðw ; s Þ by w ¼ arg min /ðwÞ;

ð8:31Þ

s ¼ s ðw Þ ¼ Prox1=r p ðzðw ÞÞ:

ð8:32Þ

w

We will use semismooth Newton’s method to solve (8.31). Now the key questions remained to be answered are the following:

how to compute the proximal mapping Proxp ðÞ;

how to compute generalized jacobian of /ðwÞ;

and how to reduce the computational cost? We address them one by one. The proximal mapping, denoted as ProxM p ðÞ, is defined as the solution of the following problem /ðzÞ :¼ minm wðz; sÞ;

ð8:33Þ

s2IR

where wðz; sÞ :¼

1 kz sk2 þ pðsÞ; 2M

z2 IRm :

It is easy to derive that ProxM p ðzÞ takes the following form 8 < z i CM ; z i [ CM ; ProxM z; ðzÞ ¼ z i \0; p : i i 0; 0 z i CM :

ð8:34Þ

The proximal mapping ProxM p ðÞ is piecewise linear as shown in figure 8.11, and therefore strongly semismooth.

FIG. 8.11 – Demonstration of ProxM p ðz i Þ.

Modern Optimization Methods

136

FIG. 8.12 – Demonstration of ProxM kk1 ðz i Þ. Remark 8.5.2. Here we would like to point out that the proximal mapping here is closely related to the proximal mapping associate with ksk1 , which is popularly used in the well-known LASSO problem. Actually, from figures 8.11 and 8.12, the proximal mapping ProxM p ðÞ can be viewed as a shift of that with ksk1 . M Given z, the Clarke subdifferential of ProxM p ðzÞ, denoted as @Proxp ðzÞ, is a set of diagonal matrices. For U 2 @ProxM p ðzÞ, its diagonal elements take the following form 8 z i [ CM ; or z i \0; < 1; U ii ¼ 0; 0\z i \CM ; ð8:35Þ : u i ; 0 u i 1; z i ¼ 0; or z i ¼ CM :

In other words, we have m @ProxM p ðzÞ ¼ fDiagðuÞ : u2 IR ; u i ¼ 1; if z i [ CM or z i \0;

u i ¼ 0; if 0\z i \CM ; u i 2 ½0; 1; otherwiseg:

8.5.4

Semismooth Newton’s Method for the Subproblem

Next we will discuss semismooth Newton’s method to solve (8.31). Note that /ðwÞ defined as in (8.30) is continuously differentiable. By proposition 8.5.1, the gradient r/ðwÞ takes the following form r/ðwÞ ¼ w þ rrsðzðwÞÞ ¼ w þ B T k rB T ðs ðwÞ Bw dÞ;

ð8:36Þ

which is strongly semismooth due to the strongly semismoothness of s ðwÞ. The generalized Hessian of / at x, denoted as @ 2 /ðxÞ, satisfies the following condition: for any h 2 IRn, 2 @ 2 /ðxÞðhÞ @^ /ðxÞðhÞ;

Penalty and Augmented Lagrangian Methods

137

where 2 @^ /ðxÞ ¼ I þ rB T B rB T @Prox1=r p ðzðwÞÞB:

ð8:37Þ

Consequently, solving (8.31) is equivalent to solving the following strongly semismooth equations r/ðwÞ ¼ 0:

ð8:38Þ

We apply semismooth Newton’s method to solve the nonsmooth equation (8.38). At iteration k, we update w by w k þ 1 ¼ w k ðV k Þ1 r/ðw k Þ;

2 V k 2 @^ /ðw k Þ:

Below, we use the following well studied globalized version of the semismooth Newton method [52, algorithm 5.1] to solve (8.38). Algorithm 8.5.3. A globalized semismooth Newton method S0. Given j :¼ 0. Choose x0 ; r 2 ð0; 1Þ, q 2 ð0; 1Þ, d [ 0, g0 [ 0 and g1 [ 0. S1. Calculate r/ðx j Þ. If kr/ðx j Þk d, stop. Otherwise, go to S2. 2 S2. Select an element V j 2 @^ /ðx j Þ defined as in (8.37). Apply conjugate gradient (CG) method to find an approximate solution d j by V j d j þ r/ðx j Þ ¼ 0

ð8:39Þ

such that kV j d j þ r/ðx j Þk lj kr/ðx j Þk; where lj ¼ minðg0 ; g1 kr/ðx j ÞkÞ. S3. Do line search, and let m j [ 0 be the smallest integer such that the following holds /ðx j þ qm d j Þ /ðx j Þ þ rqm r/ðx j ÞT d j : Let aj ¼ qmj . S4. Let xj þ 1 ¼ x j þ aj d j , j :¼ j þ 1, go to S1.

8.5.5

Reducing the Computational Cost

It is well-known that the high computational complexity for semismooth Newton’s method comes from solving the linear system in S2 of algorithm 8.5.3. As designed in algorithm 8.5.3, we apply the popular solver conjugate gradient method (CG) to solve the linear system 2 V j d j þ r/ðx j Þ ¼ 0; V j 2 @^ /ðw j Þ;

Modern Optimization Methods

138

iteratively. Recall that w 2 IRn , B 2 IRm n . The heavy burden in applying CG is to calculate Vh, where h 2 IRn is a given vector. Below, we will mainly analyse several ways to implement Vh, where V ¼ I þ rB T B rB T UB;

U 2 @Prox1=r p ðzðwÞÞ:

Let I ðzÞ ¼ i : z i 2 0; Cr . Vh ¼ h þ rB T ðI U ÞBh ¼ h þ rB T ðI DiagðuÞÞðBhÞ ¼ h þ rBðI ðzÞ; :ÞT ðBðI ðzÞ; :ÞhÞ: We summarize the details of the computation for the three methods in table 8.1. As we will show in the numerical part, in most situations, jI ðzÞj is far more less than m. Together with the complexity reported in table 8.1, we have the following relations OðnjI ðzÞjÞ\OðnmÞ\Oðnm þ m 2 Þ\Oðn 2 m 2 þ nm 3 Þ: As a result, we use Method III in our algorithm. TAB. 8.1 – The details of the computation for the three methods of computing Vh. Method Method I Method II Method III

Formula

Computational Cost

Form V ¼ I þ rB T B rB T UB Calculate V Dw, where Dw ¼ w k þ 1 w k Calculate tmp ¼ Bh Calculate h þ rB T ðtmpÞ rB T DiagðvÞðtmpÞ Calculate tmp ¼ BðI ðzÞ; :Þh

2n 2 m 2 þ n 2 m 3 n2 mn 2nm þ m 2 þ n

T

Calculate h þ rðBðI ðzÞ; : Þ tmpÞ

8.5.6

jI ðzÞjn jI ðzÞjn þ n

Convergence Result of ALM

Now we are going to give the global and local convergence results of ALM. The global convergence rate of ALM is given in theorem 8.5.4. Theorem 8.5.4. [42] Let the sequence fðw k ; sk ; kk Þg be generated by ALM with stopping criteria (A) 1 X ðAÞ Wk ðw k þ 1 ; sk þ 1 Þ inf Wk 2k =2rk ; k \ þ 1; k¼0

where Wk ðw; sÞ :¼ Lrk ðw; s; kk Þ. Then the sequence fkk g is bounded and converges to an optimal solution of the dual problem of (8.27). Moreover, the sequence fðw k ; s k Þg is also bounded and converges to the unique optimal solution ðw ; s Þ of (8.27).

Penalty and Augmented Lagrangian Methods With stopping criteria Wk ðw k þ 1 ; s k þ 1 Þ inf Wk 2k =2rk ;

ðAÞ

139

1 X

k \ þ 1;

k¼0 2

ðB1Þ Wk ðw k þ 1 ; s k þ 1 Þ inf Wk ðd2k =2rk Þkkk þ 1 kk k ; 1 X dk \ þ 1; k¼1

ðB2Þ

distð0; @Wk ðw k þ 1 ; s k þ 1 ÞÞ ðd0k =rk Þkkk þ 1 kk k; 1 X d0k ! 0; k¼1

one has the local convergence result of ALM as in theorem 8.5.5. Theorem 8.5.5. [42] Let fðw k ; s k ; kk Þg be the infinite sequence generated by ALM with stopping criteria (A) and (B1). The following results hold. (i) The sequence fkk g converges to k , one of the optimal solutions of the dual problem, and for all k sufficiently large, there is distðkk þ 1 ; XÞ bk distðkk ; XÞ; where X is the set of the optimal solutions of the dual problem and a f ða 2f þ r2k Þ2 þ 2dk 1

bk ¼

1 dk

! b1 ¼ a f ða 2f þ r21 Þ2 \1 1

as k ! þ 1. Moreover, fðw k ; s k Þg converges to the unique optimal solution ðw ; s Þ of (8.27). (ii) If the stopping criteria (B2) is also used, then for all k sufficiently large, there is kðw k þ 1 ; z k þ 1 Þ ðw ; z Þk b0k kkk þ 1 kk k; where b0k ¼ a l ð1 þ d0k Þ=rk ! a l =r1 as k ! þ 1.

8.5.7

Numerical Results on LIBLINEAR

In this part, we present some numerical results for ALM. LIBLINEAR is the most popular solver for linear SVM [7]. Consequently, in this part, we compare our algorithm with solvers in LIBLINEAR on a dataset for SVC. We split each set into 80% training and 20% testing. Specifically, we compare with DCD which solves also model (8.25). Parameters in ALM are set as s ¼ 1, r0 ¼ 0:15, rmax ¼ 2, h ¼ 0:8 and k0 ¼ zerosðm; 1Þ. For the semismooth Newton’s method, we choose q ¼ 0:5, d ¼ maxð1:0 102 ; 1:0 10j Þ. The results are reported in table 8.2, where ALM-SNCG denotes our proposed method, the augmented Lagrangian method

Modern Optimization Methods

140

TAB. 8.2 – The comparison results for L1-loss SVC. dataset leukemia a1a a2a a3a a4a a5a a6a a7a a8a a9a w1a w2a w3a w4a w5a w6a w7a w8a breast-cancer cod-rna diabetes fourclass german.numer heart australian ionosphere covtype.binary ijcnn1 sonar splice svmguide1 svmguide3 phishing madelon mushrooms duke breast-cancer gisette news20.binary rcv1-train.binary real-sim liver-disorders colon-cancer skin-nonskin

time(s)(DCD j ALM-SNCG) 0.017 j 0.015 0.059 j 0.036 0.049 j 0.029 0.044 j 0.024 0.042 j 0.028 0.040 j 0.022 0.033 j 0.022 0.026 j 0.019 0.033 j 0.019 0.068 j 0.031 0.088 j 0.057 0.071 j 0.062 0.103 j 0.093 0.077 j 0.059 0.121 j 0.105 0.047 j 0.068 0.039 j 0.031 0.094 j 0.060 0.001 j 0.010 0.066 j 0.032 0.001 j 0.006 0.001 j 0.004 0.004 j 0.020 0.002 j 0.006 0.001 j 0.009 0.006 j 0.006 12.086 j 7.956 0.125 j 0.091 0.006 j 0.009 0.016 j 0.021 0.001 j 0.007 0.002 j 0.013 0.081 j 0.047 0.181 j 0.743 0.131 j 0.202 0.034 j 0.022 3.796 j 19.270 0.642 j 4.463 0.123 j 0.537 0.353 j 0.436 0.000 j 0.005 0.006 j 0.009 0.171 j 0.157

accuracy(%)(DCD j ALM-SNCG) 100.000 j 100.000 84.722 j 84.819 84.752 j 84.818 84.700 j 84.717 84.683 j 84.737 84.627 j 84.780 84.516 j 84.587 84.482 j 84.604 84.119 j 84.031 84.708 j 84.677 99.662 j 99.958 99.676 j 99.968 99.699 j 99.955 99.705 j 99.953 99.737 j 99.950 99.754 j 99.923 99.721 j 99.900 99.668 j 99.950 100.000 j 99.270 19.879 j 20.417 74.675 j 75.974 68.786 j 76:879 78.000 j 78.500 81.481 j 85.185 86.232 j 86.232 98.592 j 98.592 64.094 j 64.331 90.198 j 90.198 64.286 j 64.286 69.500 j 80:500 53.560 j 78:479 0.000 j 0.000 92.763 j 92.492 52.500 j 52.250 100.000 j 100.000 100.000 j 100.000 97.333 j 97.583 27.775 j 11.275 94.443 j 94.394 72.383 j 80:729 41.379 j 55:172 69.231 j 69.231 88.236 j 90.429

Penalty and Augmented Lagrangian Methods

141

with subproblems solved by the CG-semismooth Newton’s method. Here accuracy means accuracy for prediction, which is calculated by number of correct prediction 100%: number of test data We can get the following observations from the results. Both of the two algorithms can obtain high Winners are marked in bold. Those marked in red show significant improvement of ALM-SNCG over DCD in accuracy. In terms of cpu-time, ALM-SNCG is competitive with DCD.

8.6

Exercises

Exercise 8.6.1. For the nearest correlation problem minn x2S

s:t:

1 kX Gk2F 2 X ii ¼ 1; i ¼ 1,2; . . .; n; X < 0;

if we write the constraints in the form of GðXÞ 2 K , where K is a cone, then K ¼ __

e ; e ¼ ð1; . . .; 1ÞT 2 IRn , where Snþ denotes the set of positive semidefinite A. Snþ n matrices

in S . 0 ; 0 2 IRn . B. Sn

e C. ; e ¼ ð1; . . .; 1ÞT 2 IRn . IRnþ

0 D. ; 0 2 IRn . Snþ Exercise 8.6.2. For linear programming min

cT x

s:t:

Ax ¼ b; l x 1 u:

x2IRn

The constraints can be written as the form of GðxÞ 2 K , where K ¼ f0gn IR þ IR þ is2a cone. Then GðxÞ takes the following form. 3 Ax b A. GðxÞ ¼ 4 x 1 l 5. u x1 2 3 Ax B. GðxÞ ¼ 4 l x 1 5. u x1

Modern Optimization Methods

142 2

3 Ax b C. GðxÞ ¼ 4 x 1 l 5. x u 2 1 3 Ax D. GðxÞ ¼ 4 l x 1 5. u x1 Exercise 8.6.3. Consider the following problem min

f ðxÞ

s:t:

GðxÞ 2 K ;

x2X

where K is a cone. The dual problem takes the following form A. supu2K inf x2X Lðx; uÞ. B. supu2K inf x2K Lðx; uÞ. C. supu2K inf x2X Lðx; uÞ. D. supu2K inf x2K 8 Lðx; uÞ.

Chapter 9 Bilevel Optimization and Its Applications 9.1

Introduction

For machine learning models, there are hyperparameters that must be tuned. They have a great impact on the performance of models. However, manual tuning is ineffective especially when there is a large number of hyperparameters. For example, in neural network with several hidden layers, there are a large number of hyperparameters. Automatic hyperparameter tuning has many advantages. It can reduce the effort required. It can also improve the performance of the models. After choosing proper hyperparameters, one can apply machine learning models to practical problems efficiently. Recent reviews for hyperparameter optimization include Luo [102], Yang and Shami [112] as well as Yu and Zhu [116]. Algorithms for hyperparameter optimization can be divided into five categories [112]: model-free algorithms [82, 94, 110], bilevel optimization [97, 98, 108], gradient-based optimization [79, 84, 107], bayesian optimization [87, 92, 95] and metaheuristic algorithms [89, 105, 113]. In terms of bilevel optimization, we refer to [83, 85, 86, 101] for the latest reviews. Below we introduce some bilevel optimization models for hyperparameter selections in machine learning, mainly focusing on support vector machines (SVM). Kunisch and Pock [98] proposed a bilevel optimization model for hyperparameter selection in the variational image denoising model. They used a semismooth Newton’s method to solve this bilevel optimization model. Okuno et al. [108] proposed a bilevel optimization model for the nonsmooth, possibly nonconvex, l p -regularized problem: min

f ðk; wÞ

s:t:

k 0;

k;w

w ¼ argmin gðwÞ þ x

r P i¼1

ki kwkpp ;

P where R1 ðwÞ ¼ kwkpp ¼ ni¼1 jw i jp ð0\p\1Þ; R2 ; . . .; Rr are twice continuously differentiable functions, f is once continuously differentiable function, and g is twice DOI: 10.1051/978-2-7598-3174-6.c009 © Science Press, EDP Sciences, 2023

Modern Optimization Methods

144

continuously differentiable function, such as l 2 -loss function and logistic-loss function. They used smoothing method to solve this bilevel optimization model. Bennett et al. [80] proposed a bilevel model for hyperparameter selection in support vector regression (SVR), which is given as follows T P P t T min ðw Þ x i y i C ; e; k2IR; w t 2IRn ; w; w; w 0 ; t¼1; ...;T

s:t:

t¼1 i2Nt

ð9:1Þ

e; C ; k 0; w w; and for t ¼ 1; . . .; T :

8 9