Accelerated Optimization for Machine Learning: First-Order Algorithms 9811529094, 9789811529092

This book on optimization includes forewords by Michael I. Jordan, Zongben Xu and Zhi-Quan Luo. Machine learning relies

1,203 147 3MB

English Pages 273 [286] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Accelerated Optimization for Machine Learning: First-Order Algorithms
 9811529094, 9789811529092

Table of contents :
Foreword by Michael I. Jordan
Foreword by Zongben Xu
Foreword by Zhi-Quan Luo
Preface
References
Acknowledgements
Contents
About the Authors
Acronyms
1 Introduction
1.1 Examples of Optimization Problems in Machine Learning
1.2 First-Order Algorithm
1.3 Sketch of Representative Works on Accelerated Algorithms
1.4 About the Book
References
2 Accelerated Algorithms for Unconstrained Convex Optimization
2.1 Accelerated Gradient Method for Smooth Optimization
2.2 Extension to Composite Optimization
2.2.1 Nesterov's First Scheme
2.2.2 Nesterov's Second Scheme
2.2.2.1 A Primal-Dual Perspective
2.2.3 Nesterov's Third Scheme
2.3 Inexact Proximal and Gradient Computing
2.3.1 Inexact Accelerated Gradient Descent
2.3.2 Inexact Accelerated Proximal Point Method
2.4 Restart
2.5 Smoothing for Nonsmooth Optimization
2.6 Higher Order Accelerated Method
2.7 Explanation: A Variational Perspective
2.7.1 Discretization
References
3 Accelerated Algorithms for Constrained Convex Optimization
3.1 Some Facts for the Case of Linear Equality Constraint
3.2 Accelerated Penalty Method
3.2.1 Generally Convex Objectives
3.2.2 Strongly Convex Objectives
3.3 Accelerated Lagrange Multiplier Method
3.3.1 Recovering the Primal Solution
3.3.2 Accelerated Augmented Lagrange Multiplier Method
3.4 Alternating Direction Method of Multiplier and Its Non-ergodic Accelerated Variant
3.4.1 Generally Convex and Nonsmooth Case
3.4.2 Strongly Convex and Nonsmooth Case
3.4.3 Generally Convex and Smooth Case
3.4.4 Strongly Convex and Smooth Case
3.4.5 Non-ergodic Convergence Rate
3.4.5.1 Original ADMM
3.4.5.2 ADMM with Extrapolation and Increasing Penalty Parameter
3.5 Primal-Dual Method
3.5.1 Case 1: μg=μh=0
3.5.2 Case 2: μg>0, μh=0
3.5.3 Case 3: μg=0, μh>0
3.5.4 Case 4: μg>0, μh>0
3.6 Faster Frank–Wolfe Algorithm
References
4 Accelerated Algorithms for Nonconvex Optimization
4.1 Proximal Gradient with Momentum
4.1.1 Convergence Theorem
4.1.2 Another Method: Monotone APG
4.2 AGD Achieves Critical Points Quickly
4.2.1 AGD as a Convexity Monitor
4.2.2 Negative Curvature Descent
4.2.3 Accelerating Nonconvex Optimization
4.3 AGD Escapes Saddle Points Quickly
4.3.1 Almost Convex Case
4.3.2 Very Nonconvex Case
4.3.3 AGD for Nonconvex Problems
4.3.3.1 Locally Almost Convex → Globally Almost Convex
4.3.3.2 Outer Iterations
4.3.3.3 Inner Iterations
References
5 Accelerated Stochastic Algorithms
5.1 The Individually Convex Case
5.1.1 Accelerated Stochastic Coordinate Descent
5.1.2 Background for Variance Reduction Methods
5.1.3 Accelerated Stochastic Variance Reduction Method
5.1.4 Black-Box Acceleration
5.2 The Individually Nonconvex Case
5.3 The Nonconvex Case
5.3.1 SPIDER
5.3.2 Momentum Acceleration
5.4 Constrained Problem
5.5 The Infinite Case
References
6 Accelerated Parallel Algorithms
6.1 Accelerated Asynchronous Algorithms
6.1.1 Asynchronous Accelerated Gradient Descent
6.1.2 Asynchronous Accelerated Stochastic Coordinate Descent
6.2 Accelerated Distributed Algorithms
6.2.1 Centralized Topology
6.2.1.1 Large Mini-Batch Algorithms
6.2.1.2 Dual Communication-Efficient Methods
6.2.2 Decentralized Topology
References
7 Conclusions
References
A Mathematical Preliminaries
A.1 Notations
A.2 Algebra and Probability
A.3 Convex Analysis
A.4 Nonconvex Analysis
References
Index

Citation preview

Zhouchen Lin Huan Li Cong Fang

Accelerated Optimization for Machine Learning First-Order Algorithms

Accelerated Optimization for Machine Learning

Zhouchen Lin • Huan Li • Cong Fang

Accelerated Optimization for Machine Learning First-Order Algorithms

Zhouchen Lin Key Lab. of Machine Perception School of EECS Peking University Beijing, Beijing, China

Huan Li College of Computer Science and Technology Nanjing University of Aeronautics and Astronautics Nanjing, Jiangsu, China

Cong Fang School of Engineering and Applied Science Princeton University Princeton, NJ, USA

ISBN 978-981-15-2909-2 ISBN 978-981-15-2910-8 (eBook) https://doi.org/10.1007/978-981-15-2910-8 © Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

To our families. Without your great support this book will not exist and even our careers will be meaningless.

Foreword by Michael I. Jordan

Optimization algorithms have been the engine that have powered the recent rise of machine learning. The needs of machine learning are different from those of other disciplines that have made use of the optimization toolbox; most notably, the parameter spaces are of high dimensionality, and the functions that are being optimized are often sums of millions of terms. In such settings, gradient-based methods are preferred over higher order methods, and given that the computation of a full gradient can be infeasible, stochastic gradient methods are the coin of the realm. Putting such specifications together with the need to solve nonconvex optimization problems, to control the variance induced by the stochastic sampling, and to develop algorithms that run on distributed platforms, one poses a new set of challenges for optimization. Surprisingly, many of these challenges have been addressed within the past decade. The book by Lin, Li, and Fang is one of the first book-length treatments of this emerging field. The book covers gradient-based algorithms in detail, with a focus on the concept of acceleration. Acceleration is a key concept in modern optimization, supplying new algorithms and providing insight into achievable convergence rates. The book also covers stochastic methods, including variance control, and it includes material on asynchronous distributed implementations. Any researcher wishing to work in the machine learning field should have a foundational understanding of the disciplines of statistics and optimization. The current book is an excellent place to obtain the latter and to begin one’s adventure in machine learning. University of California Berkeley, CA, USA October 2019

Michael I. Jordan

vii

Foreword by Zongben Xu

Optimization is one of the core topics in machine learning. While benefiting from the advances in the native optimization community, optimization for machine learning has its own flavor. One remarkable phenomenon is that first-order algorithms more or less dominate the optimization methods in machine learning. While there have been some books or preprints that introduce major optimization algorithms used in machine learning, either partially or thoroughly, this book focuses on a notable stream in recent machine learning optimization, namely the accelerated first-order methods. Originating from Polyak’s heavy-ball method and triggered by Nesterov’s series of works, accelerated first-order methods have become a hot topic in both the optimization and the machine learning communities and have yielded fruitfully. The results have significantly extended beyond the traditional scope of unconstrained (and deterministic) convex optimization. New results include acceleration for constrained convex optimization and nonconvex optimization, stochastic algorithms, and general acceleration frameworks such as Katyusha and Catalyst. Some of them even have nearly optimal convergence rates. Unfortunately, existing literatures scatter across diverse and extensive publications. Mastering the basic techniques and having a global picture of this dynamic field thus becomes very difficult. Fortunately, this monograph, coauthored by Zhouchen Lin, Huan Li, and Cong Fang, meets the need of quick education on accelerated first-order algorithms just in time. The book first gives an overview on the development of accelerated firstorder algorithms, which is extremely informative, despite being sketchy. Then, it introduces the representative works in different categories, with detailed proofs that greatly facilitate the understanding of underlying ideas and the mastering of basic techniques. Without doubt, this book is a vital reference for those who want to learn the state of the art of machine learning optimization. I have known Dr. Zhouchen Lin for a long time. He impresses me with solid work, deep insights, and careful analysis on the problems arising from his diverse research fields. With a lot of shared research interests, one of which is

ix

x

Foreword by Zongben Xu

learning-based optimization, I am delighted to see this book finally published after elaborative writing. Xi’an Jiaotong University Xi’an, China October 2019

Zongben Xu

Foreword by Zhi-Quan Luo

First-order optimization methods have been the main workhorse in the machine learning, signal processing, and artificial intelligence involving big data. These methods, while simple conceptually, require careful analysis and a good understanding of them to be effectively deployed. The issues such as acceleration, nonsmoothness, nonconvexity, parallel and distributed implementation are critical due to their great impact on the algorithm’s convergence behavior and running time. This research monograph gives an excellent introduction to the algorithmic aspects of first-order optimization methods, focusing on algorithm design and convergence analysis. It treats in depth the issues of acceleration, nonconvexity, constraints, and asynchronous implementation. The topics covered and the results given in the monograph are very timely and strongly relevant to both the researchers and practitioners of machine learning, signal processing, and artificial intelligence. The theoretical issues of lower bounds on complexity are purposely avoided to give way to algorithm design and convergence analysis. Overall, the treatment of the subject is quite balanced and many useful insights are provided throughout the monograph. The authors of this monograph are experienced researchers at the interface of machine learning and optimization. The monograph is very well written and makes an excellent read. It should be an important reference book for everyone interested in the optimization aspects of machine learning. The Chinese University of Hong Kong Shenzhen, China October 2019

Zhi-Quan Luo

xi

Preface

While I was preparing advanced materials for the optimization course taught at Peking University, I found that accelerated algorithms is the most attractive and practical topic for students in engineering. Actually, this is also a hot topic of current machine learning conferences. While some books have introduced some accelerated algorithms, such as [1–3], they are nevertheless incomplete, unsystematic, and not up-to-date. Thus, in early 2018, I decided to write a monograph on accelerated algorithms. My goal was to produce a book that is organized and self-contained, with sufficient preliminary materials and detailed proofs, so that the readers need not consult scattered literatures, be plagued by inconsistent notations, and be carried away from the central ideas by non-essential contents. Luckily, my two Ph.D. students, Huan Li and Cong Fang, were happy to join this work. This task turned out to be very hard, as we had to work among our busy schedules. Eventually, we managed to have the first complete yet crude draft right before Huan Li and Cong Fang graduated. Smoothing the book and correcting various errors further took us 4 months. Finally, we were truly honored to have forewords from Prof. Michael I. Jordan, Prof. Zongben Xu, and Prof. Zhi-Quan Luo. While this book deprived us of all our leisure time in the past nearly 2 years, we still feel that our endeavor pays when every part of the book is ready. Hope this book is a valuable reference for the machine learning and the optimization communities. This will be the highest praise for our work. Beijing, China November 2019

Zhouchen Lin

References 1. A. Beck, First-Order Methods in Optimization, vol. 25 (SIAM, Philadelphia, 2017) 2. S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015) 3. Y. Nesterov, Lectures on Convex Optimization (Springer, New York, 2018) xiii

Acknowledgements

The authors would like to thank all our collaborators and friends, especially: Bingsheng He, Junchi Li, Qing Ling, Guangcan Liu, Risheng Liu, Yuanyuan Liu, Canyi Lu, Zhiquan Luo, Yi Ma, Fanhua Shang, Zaiwen Wen, Xingyu Xie, Chen Xu, Shuicheng Yan, Wotao Yin, Xiaoming Yuan, Yaxiang Yuan, and Tong Zhang. The authors also thank Yuqing Hou, Jia Li, Shiping Wang, Jianlong Wu, Hongyang Zhang, and Pan Zhou for careful proofreading. The authors also thank Celine Chang from Springer, who offered much assistance during the production of the book. This monograph is supported by National Natural Science Foundation of China under Grant Nos. 61625301 and 61731018 and Beijing Academy of Artificial Intelligence.

xv

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Examples of Optimization Problems in Machine Learning . . . . . . . . . . 1.2 First-Order Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Sketch of Representative Works on Accelerated Algorithms . . . . . . . . 1.4 About the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 3 6 6

2

Accelerated Algorithms for Unconstrained Convex Optimization . . . . . 2.1 Accelerated Gradient Method for Smooth Optimization . . . . . . . . . . . . . 2.2 Extension to Composite Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Nesterov’s First Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Nesterov’s Second Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Nesterov’s Third Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Inexact Proximal and Gradient Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Inexact Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Inexact Accelerated Proximal Point Method . . . . . . . . . . . . . . . . . 2.4 Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Smoothing for Nonsmooth Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Higher Order Accelerated Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Explanation: A Variational Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 11 17 17 21 25 27 36 37 38 39 44 49 50 54

3

Accelerated Algorithms for Constrained Convex Optimization . . . . . . . . 3.1 Some Facts for the Case of Linear Equality Constraint . . . . . . . . . . . . . . 3.2 Accelerated Penalty Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Generally Convex Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Strongly Convex Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Accelerated Lagrange Multiplier Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Recovering the Primal Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Accelerated Augmented Lagrange Multiplier Method . . . . . . .

57 57 61 66 67 67 69 71

xvii

xviii

Contents

3.4

Alternating Direction Method of Multiplier and Its Non-ergodic Accelerated Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Generally Convex and Nonsmooth Case . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Strongly Convex and Nonsmooth Case . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Generally Convex and Smooth Case . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Strongly Convex and Smooth Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Non-ergodic Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Primal-Dual Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Case 1: μg = μh = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Case 2: μg > 0, μh = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Case 3: μg = 0, μh > 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Case 4: μg > 0, μh > 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Faster Frank–Wolfe Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72 78 79 81 83 85 95 97 99 101 101 103 107

4

Accelerated Algorithms for Nonconvex Optimization . . . . . . . . . . . . . . . . . . . 4.1 Proximal Gradient with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Another Method: Monotone APG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 AGD Achieves Critical Points Quickly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 AGD as a Convexity Monitor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Negative Curvature Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Accelerating Nonconvex Optimization . . . . . . . . . . . . . . . . . . . . . . . 4.3 AGD Escapes Saddle Points Quickly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Almost Convex Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Very Nonconvex Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 AGD for Nonconvex Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 109 110 117 118 118 121 122 125 126 127 129 133

5

Accelerated Stochastic Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 The Individually Convex Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Accelerated Stochastic Coordinate Descent . . . . . . . . . . . . . . . . . . 5.1.2 Background for Variance Reduction Methods . . . . . . . . . . . . . . . . 5.1.3 Accelerated Stochastic Variance Reduction Method . . . . . . . . . 5.1.4 Black-Box Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Individually Nonconvex Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The Nonconvex Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Momentum Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Constrained Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 The Infinite Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

137 139 139 147 152 160 161 168 168 175 177 202 206

6

Accelerated Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 6.1 Accelerated Asynchronous Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 6.1.1 Asynchronous Accelerated Gradient Descent . . . . . . . . . . . . . . . . 210

Contents

xix

6.1.2

Asynchronous Accelerated Stochastic Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Accelerated Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Centralized Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Decentralized Topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

223 236 236 241 254

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

A Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Algebra and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Convex Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Nonconvex Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

261 261 263 264 270 272

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

About the Authors

Zhouchen Lin is a leading expert in the fields of machine learning and computer vision. He is currently a Professor at the Key Laboratory of Machine Perception (Ministry of Education), School of EECS, Peking University. He served as an area chair for several prestigious conferences, including CVPR, ICCV, ICML, NIPS/NeurIPS, AAAI and IJCAI. He is an associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence and the International Journal of Computer Vision. He is a Fellow of IAPR and IEEE. Huan Li received his Ph.D. degree in machine learning from Peking University in 2019. He is currently an Assistant Professor at the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics. His current research interests include optimization and machine learning. Cong Fang received his Ph.D. degree from Peking University in 2019. He is currently a Postdoctoral Researcher at Princeton University. His research interests include machine learning and optimization.

xxi

Acronyms

AAAI AACD AAGD AASCD AC-AGD Acc-ADMM Acc-SADMM Acc-SDCA ADMM AGD APG ASCD ASGD ASVRG DSCAD ERM EXTRA GIST IC IFO INC iPiano IQC KKT KŁ LASSO LMI MISO NC NCD PCA

Association for the Advancement of Artificial Intelligence Asynchronous Accelerated Coordinate Descent Asynchronous Accelerated Gradient Descent Asynchronous Accelerated Stochastic Coordinate Descent Almost Convex Accelerated Gradient Descent Accelerated Alternating Direction Method of Multiplier Accelerated Stochastic Alternating Direction Method of Multiplier Accelerated Stochastic Dual Coordinate Ascent Alternating Direction Method of Multiplier Accelerated Gradient Descent Accelerated Proximal Gradient Accelerated Stochastic Coordinate Descent Asynchronous Stochastic Gradient Descent Asynchronous Stochastic Variance Reduced Gradient Distributed Stochastic Communication Accelerated Dual Empirical Risk Minimization EXact firsT-ordeR Algorithm General Iterative Shrinkage and Thresholding Individually Convex Incremental First-order Oracle Individually Nonconvex Inertial Proximal Algorithms for Nonconvex Optimization Integral Quadratic Constraint Karush–Kuhn–Tucker Kurdyka–Łojasiewicz Least Absolute Shrinkage and Selection Operator Linear Matrix Inequality Minimization by Incremental Surrogate Optimization Negative Curvature/Nonconvex Negative Curvature Descent Principal Component Analysis xxiii

xxiv

PG SAG SAGD SCD SDCA SGD SPIDER SVD SVM SVRG SVT VR

Acronyms

Proximal Gradient Stochastic Average Gradient Stochastic Accelerated Gradient Descent Stochastic Coordinate Descent Stochastic Dual Coordinate Ascent Stochastic Gradient Descent Stochastic Path-Integrated Differential Estimator Singular Value Decomposition Support Vector Machine Stochastic Variance Reduced Gradient Singular Value Thresholding Variance Reduction

Chapter 1

Introduction

Optimization is a supporting technology in many numerical computation related research fields, such as machine learning, signal processing, industrial design, and operation research. In particular, P. Domingos, an AAAI Fellow and a Professor of University of Washington, proposed a celebrated formula [23]: machine learning = representation + optimization + evaluation, showing the importance of optimization in machine learning.

1.1 Examples of Optimization Problems in Machine Learning Optimization problems arise throughout machine learning. We provide two representative examples here. The first one is classification/regression and the second one is low-rank learning. Many classification/regression problems can be formulated as 1  l(p(xi ; w), yi ) + λR(w), m m

minn

w∈R

(1.1)

i=1

where w consists of the parameters of a classification/regression system, p(x; w) represents the prediction function of the learning model, l is the loss function to punish the inconformity between the system prediction and the truth value, (xi , yi ) is the i-th data sample with xi being the datum/feature vector and yi the label for classification or the corresponding value for regression, R is a regularizer that

© Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8_1

1

2

1 Introduction

enforces some special property in w, and λ ≥ 0 is a trade-off parameter. Typical examples of l(p, y) include the squared loss l(p, y) = 12 (p − y)2 , the logistic loss l(p, y) = log(1 + exp(−py)), and the hinge loss l(p, y) = max{0, 1 − py}. Examples of p(x; w) include p(x; w) = wT x − b for linear classification/regression and p(x; W) = φ(Wn φ(Wn−1 · · · φ(W1 x) · · · )) for forward propagation widely used in deep neural networks, where W is a collection of the weight matrices Wk , k = 1, · · · , n, and φ is an activation function. Representative examples of R(w) include the 2 regularizer R(w) = 12 w2 and the 1 regularizer R(w) = w1 . The combinations of different loss functions, prediction functions, and regularizers lead to different machine learning models. For example, hinge loss, linear classification function, and 2 regularizer give the support vector machine (SVM) problem [21]; logistic loss, linear regression function, and 2 regularizer give the regularized logistic regression problem [10]; square loss, forward propagation function, and R(W) = 0 give the multi-layer perceptron [33]; and square loss, linear regression function, and 1 regularizer give the LASSO problem [68]. There are also many problems investigated by the machine learning community that are not of the form of (1.1). For example, the matrix completion problem, which has wide applications in signal and data processing, can be written as: min X∗ ,

X∈Rm×n

Xij = Dij , ∀(i, j ) ∈ ,

s.t.

where  is the locations of observed entries. The low-rank representation (LRR) problem [50], which is powerful in clustering data into subspaces, is cast as: min

Z∗ + λE1 ,

s.t.

D = DZ + E.

Z∈Rn×n ,E∈Rm×n

To reduce the computational cost as well as the storage space, people observe that a low-rank matrix can be factorized as a product of two much smaller matrices, i.e., X = UVT . Take the matrix completion problem as an example, it can be reformulated as follows, which is a nonconvex problem, min

U∈Rm×r ,V∈Rn×r

2  1   λ   U2F + V2F . Ui VTj − Dij  + F 2 2 (i,j )∈

For more examples of optimization problems in machine learning, one may refer to the survey paper written by Gambella, Ghaddar, and Naoum-Sawaya in 2019 [28].

1.3 Sketch of Representative Works on Accelerated Algorithms

3

1.2 First-Order Algorithm In most machine learning models, a moderate numerical precision of parameters already suffices. Moreover, an iteration needs to be finished in reasonable amount of time. Thus, first-order optimization methods are the mainstream algorithms used in the machine learning community. While “first-order” has its rigorous definition in the complexity theory of optimization, which is based on an oracle that only returns f (xk ) and ∇f (xk ) when queried with xk , here we adopt a much more general sense that higher order derivatives of the objective function are not used (thus allows the closed form solution of a subproblem and the use of proximal mapping (Definition A.19), etc.). However, we do not want to write a book on all first-order algorithms that are commonly used or actively investigated in the machine learning community, which is clearly out of our capability due to the huge amount of literatures. Some excellent reference books, preprints, or surveys include [7, 12–14, 34, 35, 37, 58, 60, 66]. Rather, we focus on the accelerated first-order methods only, where “accelerated” means that the convergence rate is improved without making much stronger assumptions and the techniques used are essentially exquisite interpolation and extrapolation.

1.3 Sketch of Representative Works on Accelerated Algorithms In the above sense of acceleration, the first accelerated optimization algorithm may be Polyak’s heavy-ball method [61]. Consider a problem with an L-smooth (Definition A.12) and μ-strongly convex (Definition A.10) objective, and let ε be theerror tothe optimal solution. The heavy-ball  method reduces the complexity 

1 L 1 O L μ log ε of the usual gradient descent to O μ log ε . In 1983, Nesterov proposed his accelerated gradient descent for L-smooth objective functions,  (AGD)  1 √ where the complexity is reduced to O as compared with that of usual gradient ε   1 descent: O ε . Nesterov further proposed another accelerated algorithm for Lsmooth objective functions in 1988 [53], smoothing techniques for nonsmooth functions with acceleration tricks in 2005 [54], and an accelerated algorithm for composite functions in 2007 [55] (whose formal publication is [57]). Nesterov’s seminal work did not catch much attention in the machine learning community, possibly because the objective functions in machine learning models are often nonsmooth, e.g., due to the adoption of sparse and low-rank regularizers which are not differentiable. The accelerated proximal gradient (APG) for composite functions by Beck and Teboulle [8], which was formally published in 2009 and

4

1 Introduction

is an extension of [53] and simpler than [55],1 somehow gained great interest in the machine learning community as it fits well for the sparse and low-rank models which were hot topics at that time. Tseng further provided a unified analysis of existing acceleration techniques [70] and Bubeck proposed a near optimal method for highly smooth convex optimization [16]. Nesterov’s AGD is not quite intuitive. There have been some efforts on interpreting his AGD algorithm. Su et al. gave an interpretation from the viewpoint of differential equations [67] and Wibisono et al. further extended it to higher order AGD [71]. Fazlyab et al. proposed a Linear Matrix Inequality (LMI) using the Integral Quadratic Constraints (IQCs) from robust control theory to interpret AGD [42]. Allen-Zhu and Orecchia connected AGD to mirror descent via the linear coupling technique [6]. On the other hand, some researchers work on designing other interpretable accelerated algorithms. Kim and Fessler designed an optimized first-order algorithm whose complexity is only one half of that of Nesterov’s accelerated gradient method via the Performance Estimation Problem approach [40]. Bubeck proposed a geometric descent method inspired from the ellipsoid method [15] and Drusvyatskiy et al. showed that the same iterate sequence is generated via computing an optimal average of quadratic lower-models of the function [24]. For linearly constrained convex problems, different from the unconstrained case, both the errors in the objective function value and the constraint should be taken care of. Ideally, both errors should reduce at the same rate. A straightforward way to extend Nesterov’s acceleration technique to constrained optimization is to solve its dual problem (Definition A.24) using AGD directly, which leads to the accelerated dual ascent [9] and accelerated augmented Lagrange multiplier method [36], both with the optimal convergence rate in the dual space. Lu [51] and Li [44] further analyzed the complexity in the primal space for the accelerated dual ascent and its variant. One disadvantage of the dual based method is the need to solve a subproblem at each iteration. Linearization is an effective approach to overcome this shortcoming. Specifically, Li et al. proposed an accelerated linearized penalty method that increases the penalty along with the update of variable [45] and Xu proposed an accelerated linearized augmented Lagrangian method [72]. ADMM and the primal-dual method, as the most commonly used methods for constrained optimization, were also accelerated in [59] and [20] for generally convex (Definition A.10) and smooth objectives, respectively. When the strong convexity is assumed, ADMM and the primal-dual method can have faster convergence rates even if no acceleration techniques are used [19, 72]. Nesterov’s AGD has also been extended to nonconvex problems. The first analysis of AGD for nonconvex optimization appeared in [31], which minimizes a composite objective with a smooth (Definition A.11) nonconvex part and a

1 In each iteration [8] uses only information from two last iterations and makes one call on proximal

mapping, while [55] uses entire history of previous iterations and makes two calls on proximal mapping.

1.3 Sketch of Representative Works on Accelerated Algorithms

5

nonsmooth convex (Definition A.7) part. Inspired by [31], Li and Lin proposed AGD variants for minimizing the composition of a smooth nonconvex part and a nonsmooth nonconvex part [43]. Both works in [31, 43] studied the convergence to the first-order critical point (Definition A.34). Carmon et al. further gave 1 log 1ε complexity analysis [17]. For many famous machine learning an O ε7/4 problems, e.g., matrix sensing and matrix completion, there is no spurious local minimum [11, 30] and the only task is to escape strict saddle points (Definition A.29). The first accelerated method to find the second-order critical point appeared in [18], which alternates between two subroutines: negative curvature descent and Almost Convex AGD, and can be seen as a combination of accelerated gradient descent and the Lanczos method. Jin et al. further proposed a single-loop accelerated method [38]. Agarwal et al. proposed a careful implementation of the Nesterov–Polyak method, using accelerated methods for fast approximate matrix inversion [1]. The   1 1 complexities established in [1, 18, 38] are all O ε7/4 log ε . As for stochastic algorithms, compared with the deterministic algorithms, the main challenge is that the noise of gradient will not reach zero through updates and this makes the famous stochastic gradient descent (SGD) converge only with a sublinear rate even for strongly convex and smooth problems. Variance reduction (VR) is an efficient technique to reduce the negative effect of noise [22, 39, 52, 63]. With the VR and the momentum technique, Allen-Zhu proposed the first truly accelerated stochastic algorithm, named Katyusha [2]. Katyusha is an algorithm working in the primal space. Another way to accelerate the stochastic algorithms is to solve the problem in the dual space so that we can use the techniques like stochastic coordinate descent (SCD) [27, 48, 56] and stochastic primal-dual method [41, 74]. On the other hand, in 2015 Lin et al. proposed a generic framework, called Catalyst [49], that minimizes a convex objective function via an accelerated proximal point method and gains acceleration, whose idea previously appeared in [65]. Stochastic nonconvex optimization is also an important topic and some excellent works include [3–5, 29, 62, 69, 73]. Particularly, Fang et al. proposed a Stochastic Path-Integrated Differential Estimator (SPIDER) technique and attained the near optimal convergence rate under certain conditions [26]. The acceleration techniques are also applicable to parallel optimization. Parallel algorithms can be implemented in two fashions: asynchronous updates and synchronous updates. For asynchronous update, none of the machines need to wait for the others to finish computing. Representative works include asynchronous accelerated gradient descent (AAGD) [25] and asynchronous accelerated coordinate descent (AACD) [32]. Based on different topologies, synchronous algorithms include centralized and decentralized distributed methods. Typical works for the former organization include the distributed ADMM [13], distributed dual coordinate ascent [75] and their extensions. One bottleneck of centralized topology lies in high communication cost at the central node [47]. Although decentralized algorithms have been widely studied by the control community, the lower bound has not been established until 2017 [64] and a distributed dual ascent with a matching upper bound is given in [64]. Motivated by the lower bound, Li et al. further analyzed

6

1 Introduction

the distributed accelerated gradient descent with both optimal communication and computation complexities up to a log factor [46].

1.4 About the Book In the previous section, we have briefly introduced the representative works on accelerated first-order algorithms. However, due to limited time we do not give details of all of them in the subsequent chapters. Rather, we only introduce results and proofs of part of them, based on our personal flavor and familiarity. The algorithms are organized by their nature: deterministic algorithms for unconstrained convex problems (Chap. 2), constrained convex problems (Chap. 3), and (unconstrained) nonconvex problems (Chap. 4), as well as stochastic algorithms for centralized optimization (Chap. 5) and distributed optimization (Chap. 6). To make our book self-contained, for each introduced algorithm we give the details of its proof. This book serves as a reference to part of the recent advances in optimization. It is appropriate for graduate students and researchers who are interested in machine learning and optimization. Nonetheless, the proofs for achieving critical points (Sect. 4.2), escaping saddle points (Sect. 4.3), and decentralized topology (Sect. 6.2.2) are highly non-trivial. So uninterested readers may skip them.

References 1. N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima for nonconvex optimization in linear time, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1195–1200 2. Z. Allen-Zhu, Katyusha: the first truly accelerated stochastic gradient method, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1200–1206 3. Z. Allen-Zhu, Natasha2: faster non-convex optimization than SGD, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 2675–2686 4. Z. Allen-Zhu, E. Hazan, Variance reduction for faster non-convex optimization, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 699–707 5. Z. Allen-Zhu, Y. Li, Neon2: finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 3716–3726 6. Z. Allen-Zhu, L. Orecchia, Linear coupling: an ultimate unification of gradientand mirror descent, in Proceedings of the 8th Innovations in Theoretical Computer Science, Berkeley, (2017) 7. A. Beck, First-Order Methods in Optimization, vol. 25 (SIAM, Philadelphia, 2017) 8. A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009) 9. A. Beck, M. Teboulle, A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett. 42(1), 1–6 (2014) 10. J. Berkson, Application of the logistic function to bio-assay. J. Am. Stat. Assoc. 39(227), 357–365 (1944)

References

7

11. S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 3873–3881 12. L. Bottou, F.E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018) 13. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011) 14. S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015) 15. S. Bubeck, Y.T. Lee, M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent (2015). Preprint. arXiv:1506.08187 16. S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, A. Sidford, Near-optimal method for highly smooth convex optimization, in Proceedings of the 32th Conference on Learning Theory, Phoenix, (2019), pp. 492–507 17. Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Convex until proven guilty: dimensionfree acceleration of gradient descent on non-convex functions, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 654–663 18. Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018) 19. A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40(1), 120–145 (2011) 20. Y. Chen, G. Lan, Y. Ouyang, Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. 24(4), 1779–1814 (2014) 21. C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 22. A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 1646–1654 23. P.M. Domingos, A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012) 24. D. Drusvyatskiy, M. Fazel, S. Roy, An optimal first order method based on optimal quadratic averaging. SIAM J. Optim. 28(1), 251–271 (2018) 25. C. Fang, Y. Huang, Z. Lin, Accelerating asynchronous algorithms for convex optimization by momentum compensation (2018). Preprint. arXiv:1802.09747 26. C. Fang, C.J. Li, Z. Lin, T. Zhang, SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 689–699 27. O. Fercoq, P. Richtárik, Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015) 28. C. Gambella, B. Ghaddar, J. Naoum-Sawaya, Optimization models for machine learning: a survey (2019). Preprint. arXiv:1901.05331 29. R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory, Paris, (2015), pp. 797–842 30. R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 2973–2981 31. S. Ghadimi, G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016) 32. R. Hannah, F. Feng, W. Yin, A2BCD: an asynchronous accelerated block coordinate descent algorithm with optimal complexity, in Proceedings of the 7th International Conference on Learning Representations, New Orleans, (2019) 33. S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd edn. (Pearson Prentice Hall, Upper Saddle River, 1999)

8

1 Introduction

34. E. Hazan, Introduction to online convex optimization. Found. Trends Optim. 2(3–4), 157–325 (2016) 35. E. Hazan, Optimization for machine learning. Technical report, Princeton University (2019) 36. B. He, X. Yuan, On the acceleration of augmented Lagrangian method for linearly constrained optimization. Optim. (2010). Preprint. http://www.optimization-online.org/DB_FILE/2010/ 10/2760.pdf 37. P. Jain, P. Kar, Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3–4), 142–336 (2017) 38. C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, in Proceedings of the 31th Conference On Learning Theory, Stockholm, (2018), pp. 1042–1085 39. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, Lake Tahoe, vol. 26 (2013), pp. 315–323 40. D. Kim, J.A. Fessler, Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016) 41. G. Lan, Y. Zhou, An optimal randomized incremental gradient method. Math. Program. 171(1– 2), 167–215 (2018) 42. L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016) 43. H. Li, Z. Lin, Accelerated proximal gradient methods for nonconvex programming, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 379–387 44. H. Li, Z. Lin, On the complexity analysis of the primal solutions for the accelerated randomized dual coordinate ascent. J. Mach. Learn. Res. (2020). http://jmlr.org/papers/v21/18-425.html 45. H. Li, C. Fang, Z. Lin, Convergence rates analysis of the quadratic penalty method and its applications to decentralized distributed optimization (2017). Preprint. arXiv:1711.10802 46. H. Li, C. Fang, W. Yin, Z. Lin, A sharp convergence rate analysis for distributed accelerated gradient methods (2018). Preprint. arXiv:1810.01053 47. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in Advances in Neural Information Processing Systems, Long Beach, vol. 30 (2017), pp. 5330–5340 48. Q. Lin, Z. Lu, L. Xiao, An accelerated proximal coordinate gradient method, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 3059–3067 49. H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 3384–3392 50. G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in Proceedings of the 27th International Conference on Machine Learning, Haifa, vol. 1 (2010), pp. 663–670 51. J. Lu, M. Johansson, Convergence analysis of approximate primal solutions in dual first-order methods. SIAM J. Optim. 26(4), 2430–2467 (2016) 52. J. Mairal, Optimization with first-order surrogate functions, in Proceedings of the 30th International Conference on Machine Learning, Atlanta, (2013), pp. 783–791 53. Y. Nesterov, On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika I Mateaticheskie Metody 24(3), 509–517 (1988) 54. Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005) 55. Y. Nesterov, Gradient methods for minimizing composite objective function. Technical Report Discussion Paper #2007/76, CORE (2007) 56. Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012) 57. Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013) 58. Y. Nesterov, Lectures on Convex Optimization (Springer, New York, 2018)

References

9

59. Y. Ouyang, Y. Chen, G. Lan, E. Pasiliao Jr., An accelerated linearized alternating direction method of multipliers. SIAM J. Imag. Sci. 8(1), 644–681 (2015) 60. N. Parikh, S. Boyd, Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014) 61. B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964) 62. S.J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 314–323 63. M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017) 64. K. Seaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulié, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 3027–3036 65. S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in Proceedings of the 31th International Conference on Machine Learning, Beijing, (2014), pp. 64–72 66. S. Sra, S. Nowozin, S.J. Wright (eds.), Optimization for Machine Learning (MIT Press, Cambridge, MA, 2012) 67. W. Su, S. Boyd, E. Candès, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 2510–2518 68. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58(1), 267–288 (1996) 69. N. Tripuraneni, M. Stern, C. Jin, J. Regier, M.I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 2899–2908 70. P. Tseng, On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle (2008) 71. A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), 7351–7358 (2016) 72. Y. Xu, Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017) 73. Y. Xu, J. Rong, T. Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 5530–5540 74. Y. Zhang, L. Xiao, Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 18(1), 2939–2980 (2017) 75. S. Zheng, J. Wang, F. Xia, W. Xu, T. Zhang, A general distributed dual coordinate optimization framework for regularized loss minimization. J. Mach. Learn. Res. 18(115), 1–52 (2017)

Chapter 2

Accelerated Algorithms for Unconstrained Convex Optimization

First-order methods have received extensive attention recently because they are effective in solving large-scale optimization problems. The accelerated gradient method may be one of the most widely used first-order methods due to its solid theoretical foundation, effective practical performance, and simple implementation. In this chapter, we summarize the basic accelerated gradient methods for the unconstrained convex optimization. We are interested in algorithms for solving the following problem: min f (x), x

(2.1)

where the objective function f (x) is convex. This chapter includes the descriptions of several accelerated gradient methods for smooth and composite optimization, accelerated algorithms with inexact gradient and proximal computing, restart technique for restricted strongly convex optimization, smoothing technique for nonsmooth optimization, and higher order accelerated gradient algorithms and their explanation from the variational perspective.

2.1 Accelerated Gradient Method for Smooth Optimization In a series of celebrated works [19, 20, 22], Y. Nesterov proposed several accelerated gradient methods to solve the smooth problem (2.1). We introduce the momentum based accelerated gradient descent (AGD) in this section and leave the other algorithms in Sect. 2.2. Accelerated gradient descent is an extension of gradient descent, where the latter has the following iterations: xk+1 = xk −

1 ∇f (xk ). L

© Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8_2

(2.2)

11

12

2 Accelerated Algorithms for Unconstrained Convex Optimization

Fig. 2.1 Comparison of gradient descent without momentum (left) and with momentum (right). Reproduced from https://www.willamette.edu/~gorr/classes/cs449/momrate.html

Physically, accelerated gradient descent adds an inertia to the current point to generate an extrapolated point yk and then performs a gradient descent step at yk . The algorithm is described in Algorithm 2.1, where βk will be specified in Theorem 2.1. Figure 2.1 demonstrates the intuition behind momentum. Intuitively, gradient descent may oscillate across the slopes of the ravine around the local minimum, while only making hesitant progress along the bottom towards the local minimum. Momentum helps accelerate gradient descent in the relevant direction and dampens oscillations. Algorithm 2.1 Accelerated gradient descent (AGD) Initialize x0 = x−1 . for k = 0, 1, 2, 3, · · · do yk = xk + βk (xk − xk−1 ), xk+1 = yk − L1 ∇f (yk ). end for

  Theoretically, the gradient descent (2.2) has the O k1 convergence rate for   μ k generally convex problems and O 1 − L convergence rate for μ-strongly convex ones, respectively [21]. As a comparison, the accelerated gradient descent  k

   μ 1 can improve the convergence rates to O k 2 and O 1 − L for generally convex and strongly convex problems, respectively. We use the technique of estimate sequence to prove the convergence rate, which was originally proposed in [19] and then some interests in this concept resurrected after the publication of [22]. A more recent introduction of estimate sequence can be found in [4]. We first define the estimate sequence. ∞ Definition 2.1 A pair of sequences {φk (x)}∞ k=0 and {λk }k=0 , where λk ≥ 0, is called an estimate sequence of function f (x) if λk → 0 and for any x, we have

φk (x) ≤ (1 − λk )f (x) + λk φ0 (x).

(2.3)

2.1 Accelerated Gradient Method for Smooth Optimization

13

The following lemma indicates how estimate sequence can be used for analyzing an optimization algorithm and how fast it would converge. Lemma 2.1 If φk∗ ≡ minx φk (x) ≥ f (xk ) and (2.3) holds, then we can have f (xk ) − f (x∗ ) ≤ λk (φ0 (x∗ ) − f (x∗ )). Proof From φk∗ ≥ f (xk ) and (2.3), we can have f (xk ) ≤ φk∗ ≤ min [(1 − λk )f (x) + λk φ0 (x)] ≤ (1 − λk )f (x∗ ) + λk φ0 (x∗ ), x



which leads to the conclusion.

For μ-strongly convex function f (if f is generally convex, we let μ = 0), we use the following way to construct the estimate sequence. Define two sequences: λk+1 = (1 − θk )λk ,

(2.4)   μ φk+1 (x) = (1 − θk )φk (x) + θk f (yk ) + ∇f (yk ), x − yk  + x − yk 2 , (2.5) 2 with λ0 = 1 and φ0 (x) = f (x0 ) + γ20 x − x0 2 . Then we can prove (2.3) by induction. Indeed, φ0 (x) ≤ (1 − λ0 )f (x) + λ0 φ0 (x) since λ0 = 1. Assume that (2.3) holds for some k. Then from the μ-strong convexity of f (x) and (2.4), we obtain φk+1 (x) ≤ (1 − θk )φk (x) + θk f (x) ≤ (1 − θk ) [(1 − λk )f (x) + λk φ0 (x)] + θk f (x) = (1 − λk+1 )f (x) + λk+1 φ0 (x).

(2.6)

Thus (2.3) also holds for k + 1. The following lemma describes the minimum value of φk (x) and its minimizer, which will be used to validate φk∗ ≥ f (xk ) that appeared in Lemma 2.1. Lemma 2.2 Process (2.5) forms φk (x) = φk∗ +

γk x − zk 2 , 2

(2.7)

where γk+1 = (1 − θk )γk + θk μ, zk+1 =

γ0 =

L, if μ = 0, μ, if μ > 0,

1 [(1 − θk )γk zk + θk μyk − θk ∇f (yk )] , γk+1

(2.8) z0 = x0 ,

(2.9)

14

2 Accelerated Algorithms for Unconstrained Convex Optimization

∗ φk+1 = (1 − θk )φk∗ + θk f (yk ) −

+

θk2 ∇f (yk )2 2γk+1

 θk (1 − θk )γk  μ yk − zk 2 + ∇f (yk ), zk − yk  , γk+1 2

φ0∗ = f (x0 ). (2.10)

Proof From the definition in (2.5), we know that φk (x) is in the quadratic form of (2.7). We only need to get the recursions of γk , zk , and φk∗ . Indeed,   γk φk+1 (x) = (1 − θk ) φk∗ + x − zk 2 2   μ +θk f (yk ) + ∇f (yk ), x − yk  + x − yk 2 . 2 Letting ∇φk+1 (zk+1 ) = 0, we obtain (2.8) and (2.9). From (2.5), we also have ∗ φk+1 +

γk+1 yk − zk+1 2 = φk+1 (yk ) 2   γk = (1 − θk ) φk∗ + yk − zk 2 + θk f (yk ). (2.11) 2

From (2.9), we have γk+1 zk+1 − yk 2 2 1 (1 − θk )2 γk2 zk − yk 2 − 2θk (1 − θk )γk ∇f (yk ), zk − yk  = 2γk+1

+θk2 ∇f (yk )2 . Substituting it into (2.11), we obtain (2.10).



Now, we are ready to prove φk∗ ≥ f (xk ) and thus get the final conclusions via Lemma 2.1. Theorem 2.1 Suppose that f (x) is convex and L-smooth. Let θ−1 = 1, θk+1 =  θk4 +4θk2 −θk2 , 2

and βk =

θk (1−θk−1 ) . θk−1

Then for Algorithm 2.1, we have

4 f (xK+1 ) − f (x ) ≤ (K + 2)2 ∗



L ∗ ∗ 2 f (x0 ) − f (x ) + x0 − x  . 2

2.1 Accelerated Gradient Method for Smooth Optimization

15

Suppose that f (x) is μ-strongly convex and L-smooth. Let θk = √ √ L− μ √ √ . L+ μ



μ L

and βk =

Then for Algorithm 2.1, we have

 K+1   μ μ f (x0 ) − f (x∗ ) + x0 − x∗ 2 . f (xK+1 ) − f (x ) ≤ 1 − L 2



Proof We prove φk∗ ≥ f (xk ) by induction. Assume that it holds for some k. From (2.10), we have ∗ φk+1 ≥ (1 − θk )f (xk ) + θk f (yk ) −

+ a

θk2 ∇f (yk )2 2γk+1

θk (1 − θk )γk ∇f (yk ), zk − yk  γk+1

θk2 ∇f (yk )2 2γk+1   θk γk +(1 − θk ) ∇f (yk ), (zk − yk ) + xk − yk , γk+1

≥ f (yk ) −

a

where we use the convexity of f (x) in ≥. From the L-smoothness of f (x), we obtain f (xk+1 ) ≤ f (yk ) − and

1 2 2L ∇f (yk )

∗ by (A.6). Then φk+1 ≥ f (xk+1 ) if

θk2 γk+1

θk γ k (zk − yk ) + xk − yk = 0. γk+1 Case 1: μ = 0. Then from (2.8), we have 

θk2 γk+1

=

1 L

⇒ θk2 =

=

1 L

(2.12) (1−θk )γk L

= (1 −

 2 2 , which leads to θ = , θk ≤ k+2 , and λk+1 = ki=0 (1 − θk )θk−1 k 2  θ2 4 θi ) = ki=0 2i = θk2 ≤ (k+2) 2 via Lemma 2.3. From (2.8) and (2.9), we have 4 +4θ 2 −θ 2 θk−1 k−1 k−1

θi−1

zk+1 = zk −

θk 1 1 ∇f (yk ) = zk − ∇f (yk ) = zk − (yk − xk+1 ). (2.13) γk+1 Lθk θk

From (2.12) and (2.8), we have yk = θk zk + (1 − θk )xk .

(2.14)

16

2 Accelerated Algorithms for Unconstrained Convex Optimization

From (2.13) and (2.14), we have xk+1 = θk zk+1 + (1 − θk )xk . Thus, yk =

θk θk−1

[xk − (1 − θk−1 )xk−1 ] + (1 − θk )xk = xk +

From Lemma 2.1, we can get the conclusion.  Case 2: μ > 0. Then θk = θ =

Equation (2.4) leads to λk+1 = (1 − θ ) zk+1 = (1 − θ )zk + θ yk −

μ L and k+1

θk (1 − θk−1 ) (xk − xk−1 ). θk−1

γk = μ satisfy

θk2 γk+1

=

1 L

and (2.8).

. From (2.9) and γk = μ, we have

θ 1 ∇f (yk ) = (1 − θ )zk + θ yk + (xk+1 − yk ). (2.15) μ θ

From (2.12), we have θ zk + xk − (θ + 1)yk = 0.

(2.16)

Thus from (2.15) and (2.16), one can obtain a

xk+1 = θ zk+1 − θ (1 − θ )zk − θ 2 yk + yk = θ zk+1 + yk − θ zk + θ 2 (zk − yk ) b

= θ zk+1 + xk − θ yk + θ (yk − xk ) = θ zk+1 + (1 − θ )xk , a

b

where = uses (2.15) and = uses (2.16). So 1 1 (θ zk + xk ) = [xk − (1 − θ )xk−1 + xk ] θ +1 θ +1 √ √ L− μ 1−θ (xk − xk−1 ) = xk + √ = xk + √ (xk − xk−1 ), 1+θ L+ μ a

yk =

a

where = uses (2.16).



2.2 Extension to Composite Optimization

17

The following lemma describes some useful properties for the sequence {θk }∞ k=0 , which will be frequently used in this book. Lemma 2.3 If sequence {θk }∞ k=0 satisfies θk ≤

2 k+2/θ0 ,

k

1 i=0 θi

=

1−θk θk2

1 θk2



=

1 2 , θ−1

and θk+1 =

=

1 2 θk−1

and θ0 ≤ 1, then

 θk4 +4θk2 −θk2 . 2



2 θk−1

, we can have

1 θk



1 2

2

1 k+1/θ0





1 2 , which leads to θk−1 1 1 1 1 1 K θk − 2 ≥ θk−1 . Summing over k = 1, 2, · · · , K, we have θK ≥ θ0 + 2 , which leads  2 2 1 . On the other hand, we know θ ≤ 1 for all k and thus − 1 ≤ to θK ≤ K+2/θ k θk 0 1 1 1 1 1 2 , which leads to θk − 1 ≤ θk−1 . Similarly, we have θK ≤ θ0 + K, which leads θk−1 1 . The second conclusion can be obtained by θ1k = 12 − 21 and the to θK ≥ K+1/θ 0 θk θk−1 

last conclusion can be obtained from 1−θ2 k+1 = 12 . θk+1 θk

Proof In fact, from

1

1−θk θk2

2.2 Extension to Composite Optimization Composite convex optimization consists of the optimization of a convex function with Lipschitz continuous gradients (Definition A.12) and a nonsmooth function, which can be written as min F (x) ≡ f (x) + h(x), x

(2.17)

where f (x) is smooth and we often assume that the proximal mapping of h(x) has a closed form solution or can be computed efficiently. Accelerated gradient descent was extended to the composite optimization in [5, 24] and a unified analysis of acceleration techniques was given in [29]. We follow [29] to describe Nesterov’s three methods for solving problem (2.17).

2.2.1 Nesterov’s First Scheme The first method we describe is an extension of Algorithm 2.1, which is described in Algorithm 2.2. It can be easily checked that the momentum parameter (Lθk −μ)(1−θk−1 ) is equivalent to the settings in Theorem 2.1. Please see Remark 2.1. (L−μ)θk−1

18

2 Accelerated Algorithms for Unconstrained Convex Optimization

Algorithm 2.2 Accelerated proximal gradient (APG) method 1 Initialize x0 = x−1 . for k = 0, 1, 2, 3, · · · do k −μ)(1−θk−1 ) yk = xk + (Lθ(L−μ)θ (xk − xk−1 ), k−1  2

 L  xk+1 = argminx h(x) + 2 x − yk + L1 ∇f (yk ) . end for

We can also use the estimate sequence technique to prove the convergence rate of Algorithm 2.2. However, we introduce the techniques in [29] to enrich the toolbox of this monograph. We first describe the following lemma, which can serve as a starting point for analyzing various first-order methods. Lemma 2.4 Suppose that h(x) is convex and f (x) is μ-strongly convex and Lsmooth. Then for Algorithm 2.2, we have F (xk+1 ) ≤ F (x) −

μ L x − yk 2 − xk+1 − yk 2 + L xk+1 − yk , x − yk  , ∀x. 2 2

Proof From the optimality condition of the second step, we obtain 0 ∈ ∂h(xk+1 ) + L(xk+1 − yk ) + ∇f (yk ). Then from the convexity of h(x), we have h(x) − h(xk+1 ) ≥ −L(xk+1 − yk ) − ∇f (yk ), x − xk+1  .

(2.18)

From the L-smoothness and the μ-strong convexity of f (x) and (2.18), we get L xk+1 − yk 2 + h(xk+1 ) 2 = f (yk ) + ∇f (yk ), x − yk  + ∇f (yk ), xk+1 − x

F (xk+1 ) ≤ f (yk ) + ∇f (yk ), xk+1 − yk  +

L + xk+1 − yk 2 + h(xk+1 ) 2 μ L ≤ f (x) − x − yk 2 + xk+1 − yk 2 + h(x) 2 2 +L xk+1 − yk , x − xk+1  = F (x) − The proof is complete.

L μ x − yk 2 − xk+1 − yk 2 + L xk+1 − yk , x − yk  . 2 2 

2.2 Extension to Composite Optimization

19

We define the Lyapunov function k+1 =

2 F (xk+1 ) − F (x∗ ) L  + zk+1 − x∗  2 2 θk

(2.19)

for the case of μ = 0 and    1 μ ∗ ∗ 2  z k+1 =  F (x ) − F (x ) + − x k+1 k+1  √ k+1 2 1 − μ/L

(2.20)

for μ > 0, where zk+1 ≡

1 1 − θk xk+1 − xk , θk θk

z0 = x0 .

(2.21)

From the definitions of wk+1 and yk , we can have the following easy-to-verify identities. Lemma 2.5 For Algorithm 2.1, we have x∗ +

(1 − θk )L L−μ xk − yk = x∗ − zk , Lθk − μ Lθk − μ   θk x∗ + (1 − θk )xk − xk+1 = θk x∗ − zk+1 .

(2.22)

We will show k+1 ≤ k for all k = 0, 1, · · · and establish the convergence rates in the following theorem. Theorem 2.2 Suppose  that f (x) and h(x) are convex and f (x) is L-smooth. Let θk4 +4θk2 −θk2 . 2

θ0 = 1 and θk+1 =

Then for Algorithm 2.2, we have

F (xK+1 ) − F (x∗ ) ≤

2L x0 − x∗ 2 . (K + 2)2

Suppose that h(x) is convex and f (x) is μ-strongly convex and L-smooth. Let θk =  μ L for all k. Then for Algorithm 2.2, we have  K+1   μ μ F (x0 ) − F (x∗ ) + x0 − x∗ 2 . F (xK+1 ) − F (x ) ≤ 1 − L 2 ∗



20

2 Accelerated Algorithms for Unconstrained Convex Optimization

Proof We apply Lemma 2.4, first with x = xk and then with x = x∗ , to obtain two inequalities L xk+1 − yk 2 + L xk+1 − yk , xk − yk  , 2   μ L F (xk+1 ) ≤ F (x∗ ) − x∗ − yk 2 − xk+1 − yk 2 + L xk+1 − yk , x∗ − yk . 2 2

F (xk+1 ) ≤ F (xk ) −

Multiplying the first inequality by (1 − θk ) and the second by θk and adding them together, we have F (xk+1 ) − F (x∗ ) L θk μ ∗ x − yk 2 ≤ (1 − θk )(F (xk ) − F (x∗ )) − xk+1 − yk 2 − 2 2   + L xk+1 − yk , (1 − θk )xk + θk x∗ − yk = (1 − θk )(F (xk ) − F (x∗ )) − a

L θk μ ∗ xk+1 − yk 2 − x − yk 2 2 2

L xk+1 − yk 2 + (1 − θk )xk + θk x∗ − yk 2 2  −(1 − θk )xk + θk x∗ − xk+1 2

+

θk μ ∗ x − yk 2 = (1 − θk )(F (xk ) − F (x∗ )) − 2   2   ∗ 2 Lθk2  1 1 − θ k ∗ x − yk +   , + xk    − x − zk+1 2 θk θk a

where = uses (A.1). By reorganizing the terms in x∗ − can have

1 k θk y

+

1−θk k θk x

carefully, we

 2  Lθk2  x∗ − 1 yk + 1 − θk xk    2 θk θk 

2  Lθk2  μ ∗ Lθk − μ ∗ L(1 − θk ) L−μ  x + xk − yk  = (x − yk ) +   2 Lθk Lθk Lθk − μ Lθk − μ  2  a μθk θk (Lθk − μ)  x∗ + L(1 − θk ) xk − L − μ yk  x∗ − yk 2 + ≤  2 2 Lθk − μ Lθk − μ   θk (Lθk − μ)  b μθk zk − x∗ 2 , x∗ − yk 2 + = (2.23) 2 2

2.2 Extension to Composite Optimization

where we let 0 ≤ Thus we can have

μ Lθk

21 a

b

< 1, use the convexity of  · 2 in ≤, and use (2.22) in =.

 Lθk2  zk+1 − x∗ 2 2  θk (Lθk − μ)  zk − x∗ 2 . ≤ (1 − θk )(F (xk ) − F (x∗ )) + 2

F (xk+1 ) − F (x∗ ) +

1−θk θk2 use 12 θ−1

Case 1: μ = 0. Dividing both sides of (2.24) by θk2 and using obtain k+1 ≤ k , which leads to the first conclusion, where we

Case 2: μ > 0. Letting θ (Lθ − μ) = Lθ 2 (1 − θ ), we have θ = both sides of (2.24) by conclusion.

(1 − θ )k+1 ,



=

(2.24) 1 2 , θk−1

we

= 0. μ L.

Dividing

we obtain k+1 ≤ k , which leads to the second 

k −μ)(1−θk−1 ) Remark 2.1 When μ = 0, (Lθ(L−μ)θ = k−1 √ √  L− μ (Lθk −μ)(1−θk−1 ) μ =√ √ . L , ∀k, (L−μ)θk−1

θk (1−θk−1 ) . θk−1

When μ = 0 and θk =

L+ μ

Remark 2.2 In Theorem 2.1, we start from θ−1 = 1, while in Theorem 2.2 we start from θ0 = 1.

2.2.2 Nesterov’s Second Scheme Besides Algorithm 2.2, another accelerated algorithm is also widely used in practice and we describe it in Algorithm 2.3. Algorithm 2.3 is equivalent to Algorithm 2.1 when h(x) = 0. In fact, in this case the two algorithms produce the same sequences of {xk }, {yk }, and {zk }. However, this is not true for nonsmooth problems. Algorithm 2.3 Accelerated proximal gradient (APG) method 2 Initialize z0 = x0 . for k = 0, 1, 2, 3, · · · do L−Lθk k −μ yk = Lθ L−μ zk + L−μ xk , zk+1 = argminz h(z) + ∇f (yk ), z +

θk L 2

  z −

1 θk yk

+

2  1−θk θk xk 

,

xk+1 = (1 − θk )xk + θk zk+1 . end for

We give the convergence rates of Algorithm 2.3 in Theorem 2.3. The proof is similar to that of Algorithm 2.2 and we only show the difference. We use the Lyapunov functions defined in (2.19) and (2.20), and we can easily verify that xk , yk , and zk satisfy the relations in Lemma 2.5.

22

2 Accelerated Algorithms for Unconstrained Convex Optimization

Theorem 2.3 Suppose  that f (x) and h(x) are convex and f (x) is L-smooth. Let θk4 +4θk2 −θk2 . 2

θ0 = 1 and θk+1 =

Then for Algorithm 2.3, we have

F (xK+1 ) − F (x∗ ) ≤

2L x0 − x∗ 2 . (K + 2)2

Suppose that h(x) is convex and f (x) is μ-strongly convex and L-smooth. Let θk =  μ L for all k. Then for Algorithm 2.3, we have  K+1   μ μ F (x0 ) − F (x∗ ) + x0 − x∗ 2 . F (xK+1 ) − F (x ) ≤ 1 − L 2



Proof From the optimality condition of the second step, we have

1 1 − θk 0 ∈ ∂h(zk+1 ) + ∇f (yk ) + θk L zk+1 − yk + xk . θk θk Then from the convexity of h(x), we have

 1 1 − θk h(x) − h(zk+1 ) ≥ −θk L zk+1 − yk + xk − ∇f (yk ), x − zk+1 . θk θk (2.25) 



From the smoothness and the strong convexity of f (x), we have F (xk+1 ) ≤ f (yk ) + ∇f (yk ), xk+1 − yk  +

L xk+1 − yk 2 + h(xk+1 ) 2

= f (yk ) + ∇f (yk ), (1 − θk )xk + θk zk+1 − yk   2 Lθk2  1 − θk 1   xk + zk+1 − yk  + h((1 − θk )xk + θk zk+1 ) + 2  θk θk  ≤ (1 − θk )(f (yk ) + ∇f (yk ), xk − yk  + h(xk )) + θk (f (yk ) + ∇f (yk ), zk+1 − yk  + h(zk+1 ))  2 Lθ 2  1 − θk 1   + k  x + z − y k k+1 k 2  θk θk       ≤ (1 − θk )F (xk ) + θk f (yk ) + ∇f (yk ), x∗ − yk + ∇f (yk ), zk+1 − x∗  2 Lθ 2  1 − θk 1   +h(zk+1 )) + k  x + z − y k k+1 k 2  θk θk 

2.2 Extension to Composite Optimization a

23



μ yk − x∗ 2 + h(x∗ ) 2 

 1 1 − θk ∗ + θk L zk+1 − yk + xk , x − zk+1 θk θk  2 Lθ 2  1 − θk 1   + k  x + z − y k k+1 k 2  θk θk 

≤ (1 − θk )F (xk ) + θk f (x∗ ) −

b μθk ≤ (1 − θk )F (xk ) + θk F (x∗ ) − yk − x∗ 2 2   2  θk2 L  1 1 − θ k ∗ 2  yk − + xk − x∗  θ  − zk+1 − x  2 θk k c

≤ (1 − θk )F (xk ) + θk F (x∗ ) + a

2  θk (Lθk − μ)  zk − x∗ 2 − θk L zk+1 − x∗ 2 , 2 2

b

c

where ≤ uses (2.25), ≤ uses (A.2), and ≤ uses (2.23). Following the same inductions in the proof of Theorem 2.2, we can have the conclusions. 

2.2.2.1

A Primal-Dual Perspective

Several different explanations have been proposed for Algorithm 2.3. Allen-Zhu and Orecchia [2] considered it as a linear coupling of gradient descent and mirror descent and [15] gave an explanation from the perspective of the primal-dual method. We introduce the idea in [15] here. We only consider the case of μ = 0 and in this case the iterations of Algorithm 2.3 become yk = θk zk + (1 − θk )xk ,

θk L 2 x − zk  , zk+1 = argmin h(x) + ∇f (yk ), x + 2 x

(2.26)

xk+1 = (1 − θk )xk + θk zk+1 .

(2.27)

Let the Fenchel conjugate (Definition A.21) of f (z) be f ∗ (u). Since f ∗∗ = f , we can rewrite problem (2.17) in the following min-max form:   min max h(x) + x, u − f ∗ (u) . x

u

(2.28)

Since f is L-smooth, f ∗ is L1 -strongly convex (point 4 of Proposition A.12). We can define the Bregman distance (Definition A.20) induced by f ∗ :    ˆ ∗ (v), u − v , Df ∗ (u, v) = f ∗ (u) − f ∗ (v) + ∇f

24

2 Accelerated Algorithms for Unconstrained Convex Optimization

ˆ ∗ (v) ∈ ∂f ∗ (v). We can use the primal-dual method [8, 9] to find the saddle where ∇f point (Definition A.29) of problem (2.28), which updates x and u in the primal space and the dual space, respectively: zˆ k = αk (zk − zk−1 ) + zk ,    uk+1 = argmax zˆ k , u − f ∗ (u) − τk Df ∗ (u, uk ) ,

(2.29)

u

  ηk zk+1 = argmin h(z) + z, uk+1  + z − zk 2 . 2 z

(2.30)

In the following Lemma, we establish the equivalence between algorithms (2.26)– (2.27) and (2.29)–(2.30). θk−1 (1−θk ) k Lemma 2.6 Let u0 = ∇f (z0 ), z−1 = z0 , τk = 1−θ , ηk = Lθk , θk , αk = θk and θ0 = 1. Then Algorithms (2.26)–(2.27) and (2.29)–(2.30) are equivalent.

Proof Denoting y−1 = z0 , by point 5 of Proposition A.12 we can see from u0 = ∗ ∗ ˆ ∗ ∇f (z0 ) that  y−1 ∈ ∂f (u0 ). Letting yk−1 = ∇f (uk ) ∈ ∂f (uk ) and defining 1 yk = 1+τk zˆ k + τk yk−1 , we have     ˆ ∗ (uk ), u + (1 + τk )f ∗ (u) uk+1 = argmin − zˆ k + τk ∇f u

  = argmax yk , u − f ∗ (u) . u

Hence yk ∈ ∂f ∗ (uk+1 ). So by mathematical induction, we have yk ∈ ∂f ∗ (uk+1 ), ∀k. Thus again by point 5 of Proposition A.12, we have uk+1 = ∇f (yk ), ∀k, and (2.29)–(2.30) are equivalent to zˆ k = αk (zk − zk−1 ) + zk ,  1  zˆ k + τk yk−1 , 1 + τk   ηk = argmin h(z) + z, ∇f (yk ) + z − zk 2 . 2 z

yk = zk+1 Letting τk =

1−θk θk ,

yk =

αk =

θk−1 (1−θk ) , θk

and ηk = Lθk , we have

1 [αk (zk − zk−1 ) + zk + τk yk−1 ] 1 + τk

= θk−1 (1 − θk )(zk − zk−1 ) + θk zk + (1 − θk )yk−1 = θk zk + (1 − θk ) (θk−1 zk + yk−1 − θk−1 zk−1 ) , which satisfies the relations in (2.26) and (2.27).



2.2 Extension to Composite Optimization

25

2.2.3 Nesterov’s Third Scheme The third method we describe is remarkably different from the previous two in that it uses a weighted sum of previous gradients. We only consider the generally convex case in this section and describe the method in Algorithm 2.4. We can see that the first step and the last step of Algorithm 2.4 are the same as (2.26) and (2.27). The difference lies in the way that zk+1 is generated. Accordingly, their proofs of convergence rates are also different. Note that the zk+1 in Algorithm 2.4 is different from the zk+1 in Algorithm (2.26)–(2.27). They are the same when h(x) = 0. Algorithm 2.4 Accelerated proximal gradient (APG) method 3 Initialize x0 = z0 , φ0 (x) = 0, and θ0 = 1. for k = 0, 1, 2, 3, · · · do yk = θk zk + (1 − θk )xk ,  φk+1 (x) = ki=0 f (yi )+∇f (yθii),x−yi +h(x) ,   zk+1 = argminz φk+1 (z) + L2 z − x0 2 , xk+1 = (1 − θk )xk + θk zk+1 . end for

We define the Lyapunov function k+1 =

F (xk+1 ) L − φk+1 (zk+1 ) − zk+1 − x0 2 . 2 2 θk

We can see that it is different from the one in (2.19) used for the previous two methods. Then by exploiting the non-increment of k , we can establish the convergence rate in the following theorem. Theorem 2.4 Suppose  that f (x) and h(x) are convex and f (x) is L-smooth. Let θ0 = 1 and θk+1 =

θk4 +4θk2 −θk2 . 2

Then for Algorithm 2.4, we have

F (xK+1 ) − F (x∗ ) ≤

2L x∗ − x0 2 . (K + 2)2

Proof From the optimality condition of the third step, we have 0 ∈ ∂φk (zk ) + L(zk − x0 ). Then from the convexity of φk (x), we have φk (z) − φk (zk ) ≥ −L zk − x0 , z − zk  a

=

L L L zk − x0 2 − z − x0 2 + zk − z2 , 2 2 2

26

2 Accelerated Algorithms for Unconstrained Convex Optimization a

where = uses (A.2). Letting z = zk+1 , we obtain L zk − zk+1 2 2



L L 2 2 ≤ φk (zk+1 ) + zk+1 − x0  − φk (zk ) + zk − x0  . (2.31) 2 2 Following the same induction in the proof of Theorem 2.3, we have F (xk+1 ) ≤ (1 − θk )F (xk ) + θk (f (yk ) + ∇f (yk ), zk+1 − yk  + h(zk+1 )) +

Lθk2 zk+1 − zk 2 2

a

= (1 − θk )F (xk ) + θk2 (φk+1 (zk+1 ) − φk (zk+1 )) + b

≤ (1 − θk )F (xk ) + θk2 φk+1 (zk+1 ) + −

Lθk2 zk − zk+1 2 2

Lθk2 zk+1 − x0 2 − θk2 φk (zk ) 2

Lθk2 zk − x0 2 , 2 b

a

where = uses the definition of φk+1 (x) in Step 2 of Algorithm 2.4 and ≤ uses (2.31). The above also holds for k = 0 due to φ0 (x) = 0. Dividing both sides by θk2 and using 1−θ2 k = 21 , where 12 = 0, we obtain k+1 ≤ k , i.e., θk

θk−1

θ−1

F (xK+1 ) L L ≤ φK+1 (zK+1 ) + zK+1 − x0 2 − φ0 (z0 ) − z0 − x0 2 2 2 2 θK a

≤ φK+1 (x∗ ) + b



K  F (x∗ ) i=0

c

=

θi

+

 L x∗ − x0 2 2  L x∗ − x0 2 2

 F (x∗ ) L  x∗ − x0 2 , + 2 θK2 a

b

where we use Step 3 of Algorithm 2.4 in ≤, the convexity of f (x) in ≤, and c Lemma 2.3 and 12 = 1−θ2 0 = 0 in =. 

θ−1

θ0

2.3 Inexact Proximal and Gradient Computing

27

Remark 2.3 When h(x) = 0, Algorithms 2.3 and 2.4 are equivalent. In this case, from Step 2 of Algorithm 2.3, we have ∇f (yk ) + θk L(zk+1 − zk ) = 0.

(2.32)

From Steps 2 and 3 of Algorithm 2.4, we have k  1 ∇f (yi ) + L(zk+1 − x0 ) = 0. θi

(2.33)

i=0

Dividing both sides of (2.32) by θk and summing over k = 0, 1, · · · , we have (2.33).

2.3 Inexact Proximal and Gradient Computing In Sect. 2.2, we prove the convergence rate under the assumption that the proximal mapping of h is easily computable, e.g., having a closed form solution. This is the case for several notable choices of h, e.g., the 1 -regularization [3]. However, in many scenarios the proximal mapping may not have an analytic solution, or it may be very expensive to compute this solution exactly. This includes important problems such as total-variation regularization [12], graph-guided-fused LASSO [10], and overlapping group 1 -regularization with general groups [14]. Moreover, in some scenarios the gradient may be corrupted by noise. Motivated by these problems, several works, e.g., [11, 27], study the case that both the gradient and proximal mapping are computed inexactly. We only analyze a  (yk ) variant of Algorithm 2.2 and describe it in Algorithm 2.5. In Algorithm 2.5, ∇f means the gradient with noise, i.e.,  (yk ) = ∇f (yk ) + ek . ∇f We consider the inexact proximal mapping with error k , which is described as

L L 2 2 (2.34) h(xk+1 ) + xk+1 − wk  ≤ min h(x) + x − wk  + k . x 2 2 Algorithm 2.5 Inexact accelerated proximal gradient method Initialize x0 = x−1 . for k = 0, 1, 2, 3, · · · do −μ2 (1−θk ))(1−θk−1 ) yk = xk + (Lθk −μ1 (L−μ (xk − xk−1 ), 1 )θk−1 1 wk = yk − L ∇f (yk ),   xk+1 ≈ argminx h(x) + L2 x − wk 2 . end for

28

2 Accelerated Algorithms for Unconstrained Convex Optimization

We first give a crucial property when the proximal mapping is computed inexactly. When k = 0, from the optimality condition of the proximal mapping and the strong convexity of h, we can immediately have h(x) − h(xk+1 ) ≥ L xk+1 − wk , xk+1 − x +

μ x − xk+1 2 . 2

(2.35)

However, when k = 0 we need to modify (2.35) accordingly and it is described in the following lemma. Specially, in Lemma 2.7 h can be generally convex, i.e., μ = 0. Lemma 2.7 Assume that h(x) is μ-strongly convex. Let xk+1 be an inexact proximalmapping of h(x) such that (2.34) holds. Then there exists σ k satisfying σ k  ≤

2(L+μ)k L2

such that

h(x) − h(xk+1 ) ≥ −k + L xk+1 − wk + σ k , xk+1 − x +

μ x − xk+1 2 . (2.36) 2

Proof Let

L x∗k+1 = argmin h(x) + x − wk 2 . 2 x From the strong convexity of h(x) and the definition of xk+1 , we have 0 ∈ ∂h(x∗k+1 ) + L(x∗k+1 − wk ),  μ  h(x) − h(x∗k+1 ) ≥ −L x∗k+1 − wk , x − x∗k+1 + x − x∗k+1 2 , 2 L L h(x∗k+1 ) + x∗k+1 − wk 2 + k ≥ h(xk+1 ) + xk+1 − wk 2 . 2 2 So we can have h(x) − h(xk+1 )  μ  L ≥ −k − L x∗k+1 − wk , x − x∗k+1 + x − x∗k+1 2 + xk+1 − wk 2 2 2 L ∗ − xk+1 − wk 2 2  L ∗ a xk+1 − wk 2 + x∗k+1 − x2 − wk − x2 = −k + 2 L L μ + xk+1 − wk 2 − x∗k+1 − wk 2 + x − x∗k+1 2 2 2 2

2.3 Inexact Proximal and Gradient Computing

29

 μ L xk+1 − wk 2 + x∗k+1 − x2 − wk − x2 + x − x∗k+1 2 2 2   L xk+1 − wk 2 + xk+1 − x2 − wk − x2 = −k + 2 L L μ + x∗k+1 − x2 − xk+1 − x2 + x − x∗k+1 2 2 2 2 L+μ ∗ b = −k + L xk+1 − wk , xk+1 − x + xk+1 − x2 2 L+μ μ xk+1 − x2 + x − xk+1 2 − 2 2   = −k + L xk+1 − wk , xk+1 − x − (L + μ) xk+1 − x∗k+1 , xk+1 − x = −k +

μ L+μ xk+1 − x∗k+1 2 + x − xk+1 2 2 2 L+μ xk+1 − x∗k+1 2 = −k + L xk+1 − wk + σ k , xk+1 − x + 2 μ + x − xk+1 2 , 2 +

a

b

L+μ ∗ L (xk+1 2 2(L+μ) σ k  .

where = uses (A.2), = uses (A.1), and we define σ k = x = xk+1 , we have k ≥

L+μ ∗ 2 xk+1

− xk+1

2

=

− xk+1 ). Letting

L2



The following lemma gives a useful tool to analyze the convergence of algorithms with inexact computing. Lemma 2.8 Assume that sequence {Sk } is increasing, {uk } and {αi } are nonnegative, and u20 ≤ S0 . If u2k ≤ Sk +

k 

(2.37)

αi ui ,

i=1

then Sk +

k 



k   αi ui ≤ Sk + αi

i=1

2 (2.38)

.

i=1

Proof Let bk2 be the right-hand side of (2.37). We have for all k ≥ 1, uk ≤ bk , and bk2

= Sk +

k  i=1

αi ui ≤ Sk +

k  i=1

αi bi ≤ Sk +

 k  i=1

 αi bk .

30

2 Accelerated Algorithms for Unconstrained Convex Optimization

So

bk ≤

Using the inequality

bk ≤

1 2

1 2

k  i=1

  2  k  1 αi +  αi + Sk . 2 i=1

√ √ √ x + y ≤ x + y, we have k  i=1

  2  k k     1 αi +  αi + Sk = Sk + αi . 2 i=1

i=1

Hence Sk +

k 



αi ui =

bk2

k   ≤ Sk + αi

i=1

2 ,

i=1



finishing the proof. Chaining (2.37) and (2.38) we immediately conclude the following. Corollary 2.1 With the assumptions in Lemma 2.8, we have uk ≤

k 

αi +



Sk ,

i=1

and thus  u2k

≤2

k 

2 αi

+ 2Sk .

(2.39)

i=1

Similar to Lemma 2.5, we can give the following easy-to-verify identities. The difference from Lemma 2.5 is that Lemma 2.5 only considers the case that only f is strongly convex, while in this section we consider the case that both f and h are strongly convex with moduli μ1 and μ2 , respectively. Lemma 2.9 Also define zk as in (2.21). For Algorithm 2.5, we have x∗ +

L − Lθk + μ2 (1 − θk ) L − μ1 xk − yk = x∗ − zk , (2.40) Lθk − μ1 − μ2 (1 − θk ) Lθk − μ1 − μ2 (1 − θk )   θk x∗ + (1 − θk )xk − xk+1 = θk x∗ − zk+1 .

Following the proof in Sect. 2.2.1, we first give the following lemma, which is the counterpart of Lemma 2.4.

2.3 Inexact Proximal and Gradient Computing

31

Lemma 2.10 Suppose that f (x) is μ1 -strongly convex and L-smooth and h(x) is μ2 -strongly convex. Then for Algorithm 2.5, we have F (xk+1 ) − F (x) ≤ L xk+1 − yk , x − yk  −

μ1 μ2 x − yk 2 − x − xk+1 2 2 2

L − xk+1 − yk 2 + Lσ k + ek , x − xk+1  + k . 2

(2.41)

Proof From the L-smoothness and the μ-strong convexity of f (x), we have f (xk+1 ) ≤ f (yk ) + ∇f (yk ), xk+1 − yk  +

L xk+1 − yk 2 2

L xk+1 − yk 2 2   μ1 L  (yk ), xk+1 − x x − yk 2 + xk+1 − yk 2 + ∇f ≤ f (x) − 2 2 + ek , x − xk+1  .

= f (yk ) + ∇f (yk ), x − yk  + ∇f (yk ), xk+1 − x +

Adding (2.36), we can see that there exists σ k satisfying σ k  ≤ that



2(L+μ)k L2

such

F (xk+1 ) − F (x) ≤ L xk+1 − yk , x − xk+1  −

μ1 μ2 x − yk 2 − x − xk+1 2 2 2

L + xk+1 − yk 2 + Lσ k + ek , x − xk+1  + k 2 μ1 μ2 L x − yk 2 − x − xk+1 2 − xk+1 − yk 2 ≤ L xk+1 − yk , x − yk  − 2 2 2 + Lσ k + ek , x − xk+1  + k . The proof is complete.



In the following lemma, we give a progress in one iteration of Algorithm 2.5. Lemma 2.11 Suppose that f (x) is μ1 -strongly convex and L-smooth and h(x) is μ1 k) μ2 -strongly convex. Let θk ∈ (0, 1] and Lθ + μ2 (1−θ ≤ 1. Then for Algorithm 2.5, Lθk k we have F (xk+1 ) − F (x∗ ) − (1 − θk )(F (xk ) − F (x∗ ))   2    Lθk2 μ1 θk μ2 θk (1 − θk )  zk − x∗ 2 − Lθk x∗ − zk+1 2 ≤ − − 2 2 2 2

32

2 Accelerated Algorithms for Unconstrained Convex Optimization

μ2 θk (1 − θk ) ∗ θk μ2 ∗ x − xk 2 − x − xk+1 2 2 2   +θk 2(L + μ2 )k + ek  x∗ − zk+1  + k . +

(2.42)

Proof Following the same proof of Theorem 2.2, i.e., applying (2.41) first with x = xk and then with x = x∗ to obtain two inequalities, multiplying the first inequality by (1 − θk ), multiplying the second by θk , and adding them together, we can have F (xk+1 ) − F (x∗ ) − (1 − θk )(F (xk ) − F (x∗ ))   θk μ1 ∗ ≤ L xk+1 − yk , (1 − θk )xk + θk x∗ − yk − x − yk 2 2 θk μ2 ∗ L x − xk+1 2 − xk+1 − yk 2 − 2 2   + Lσ k + ek , (1 − θk )xk + θk x∗ − xk+1 + k   2 2   ∗  ∗ 2 1 1 − θk  θk μ1 ∗ a Lθk     x − yk + x − yk 2 = xk  − x − zk+1 −  2 θk θk 2 −

  θk μ2 ∗ x − xk+1 2 + θk Lσ k + ek , x∗ − zk+1 + k , 2

a

k where = uses (2.21). By reorganizing the terms in x∗ − θ1k yk + 1−θ θk xk carefully, we can have

 2  Lθk2  x∗ − 1 yk + 1 − θk xk    2 θk θk 

Lθk2   μ1 (x∗ − yk ) + μ2 (1 − θk ) (x∗ − xk ) + 1 − μ1 − μ2 (1 − θk ) = 2  Lθk Lθk Lθk Lθk  2 μ2 (1−θk ) L−μ1 1   θk − 1 + Lθk Lθk ∗ x + x − y  k k μ1 μ2 (1−θk ) μ1 μ2 (1−θk )  1 − Lθ − 1 − − Lθk Lθk Lθk k ≤

μ1 θk ∗ μ2 θk (1 − θk ) ∗ x − yk 2 + x − xk 2 2 2   Lθk2 μ1 θk μ2 θk (1 − θk )  x∗ + L − Lθk + μ2 (1 − θk ) xk − − +  2 2 2 Lθk − μ1 − μ2 (1 − θk ) 2  L − μ1 yk  − Lθk − μ1 − μ2 (1 − θk ) 

2.3 Inexact Proximal and Gradient Computing a

=

where we let

μ1 Lθk

33

μ1 θk ∗ μ2 θk (1 − θk ) ∗ x − yk 2 + x − xk 2 2 2    Lθk2 μ1 θk μ2 θk (1 − θk )  x∗ − zk 2 , − − + 2 2 2

+

inequality and using

a μ2 (1−θk ) Lθk ≤ 1 and = uses (2.40). Plugging it into 2 )k , we can have the conclusion. σ k  ≤ 2(L+μ L2

the above 

Now, we are ready to prove the convergence of Algorithm 2.5 in the following theorem. It describes that when we control the error to be very small, which is dependent on the iteration k, the accelerated convergence can still be maintained. Different from Theorem 2.2, in Theorem 2.5 we consider three scenarios: both f and h are generally convex, only one of f and h is strongly convex, and both f and h are strongly convex. Theorem 2.5 Suppose that f (x) and h(x) are convex and f (x) is L-smooth. Let   θk4 +4θk2 −θk2 1 L θ0 = 1, θk+1 = , e  ≤ k 2 2 , and (2.34) is satisfied with (k+1)2+δ k ≤ (k+1)1 4+2δ , where δ can be any small positive constant. Then for Algorithm 2.5, we have

18 4 2 ∗ 2 Lx . F (xK+1 ) − F (x∗ ) ≤ + − x  + 0 1 + 2δ (K + 2)2 δ2 Suppose that f (x) is μ1 -strongly convex and L-smooth and h(x) is μ2 -strongly convex (we allow μ1 = 0 or μ2 = 0 but require μ1 + μ2 > 0). Let θk = θ ≡ k+1  1 for all k, ek  ≤ [1 − (1 − δ)θ ] 2 , and (2.34) is

2 μ2 2(μ1 +μ2 ) +

μ2 2(μ1 +μ2 )



L 1 +μ2

satisfied with k ≤ [1 − (1 − δ)θ ]k+1 . Then for Algorithm 2.5, we have F (xK+1 ) − F (x∗ ) ≤ C[1 − (1 − δ)θ ]K+1 , where C = 2(F (x0 ) − F (x∗ )) + (Lθ 2 + θ μ2 )x0 − x∗ 2 +   2 L+μ2 2 2 . L + L



2 δθ

+

8



δ2 θ 2

Proof Case 1: μ1 = μ2 = 0. We use the Lyapunov function defined in (2.19). Dividing both sides of (2.42) by θk2 and using 1−θ2 k = 21 , we have θk

k+1

θk−1

√ √ 2 k + 2/Lek   k ≤ k + k+1 + 2 . θk θk

34

2 Accelerated Algorithms for Unconstrained Convex Optimization

Summing over k = 0, 1, 2, · · · , K − 1, we have K ≤0 +

K  k−1

θ2 k=1 k−1

+

√ √ K  2 k−1 + 2/Lek−1   θk−1

k=1

k .

From (2.39) of Corollary 2.1, we have  K ≤ 2 0 +

≤ 20 + 2

K  k−1

θ2 k=1 k−1 K 



 +2

2 √ √ K  2 k−1 + 2/Lek−1  θk−1

k=1

 k k−1 + 2 2

k=1

K  

  2  √ 2k k−1 + 2/Lkek−1 

k=1

≤ Lx0 − x∗ 2 +

2 18 + 2, 1 + 2δ δ

 1 1 L where we use θk ≥ k+1 from Lemma 2.3, k ≤ (k+1)1 4+2δ , and ek  ≤ (k+1) 2+δ 2. The conclusion can be obtained by the definition of K . Case 2: μ1 + μ2 > 0. Let θk = θ, ∀k, such that θ satisfies Lθ 2 − μ1 θ − μ2 θ (1 − θ ) = (1 − θ )Lθ 2 , which leads to μ1 μ2 (1 − θ ) + =θ Lθ Lθ and θ= μ2 2(μ1 +μ2 )

+



1 μ2 2(μ1 +μ2 )



2

≤ +

L μ1 +μ2

μ1 + μ2 . L

We can easily check that θ < 1 and the conditions in Lemma 2.11 hold. Define the Lyapunov function k+1 =

 Lθ 2  ∗ zk+1 − x∗ 2 F (x ) − F (x ) + k+1 2 (1 − θ )k+1

μ2 θ xk+1 − x∗ 2 . + 2 1

Dividing both sides of (2.42) by (1 − θ )k+1 , we have

k+1

  2 2 )k + 2 (L+μ k L L ek   ≤ k + k+1 + . k+1 (1 − θ )k+1 (1 − θ ) 2

2.3 Inexact Proximal and Gradient Computing

35

Summing over k = 0, 1, 2, · · · , K − 1, we have   2 K K 2 (L+μ2 )k−1 +   k−1 L L ek−1   K ≤ 0 + + k . k (1 − θ )k (1 − θ ) 2 k=1 k=1 From (2.39) of Corollary 2.1, we have ⎞2 ⎛   2 K K 2 (L+μ2 )k−1 + e    k−1 k−1 L L ⎟ ⎜ K ≤ 20 + 2 + 2⎝ ⎠ k (1 − θ )k (1 − θ ) 2 k=1 k=1 ∗

∗ 2

≤ 2(F (x0 ) − F (x )) + (Lθ + θ μ2 )x0 − x  + 2 2

' K &  1 − (1 − δ)θ k 1−θ

k=1

( 

' K & L + μ2  1 − (1 − δ)θ 2 +2 2 + L 1−θ k



k=1

' k )2 K & 2  1 − (1 − δ)θ 2 L 1−θ k=1

a

≤ 2(F (x0 ) − F (x∗ )) + (Lθ 2 + θ μ2 )x0 − x∗ 2  2 &

  ' 1 − (1 − δ)θ K 8 L + μ2 2 2 + 2 2 + + 2 δθ L L 1−θ δ θ ' & 1 − (1 − δ)θ K ≤C , 1−θ a

where in ≤ we use k ≤ [1 − (1 − δ)θ ]k+1 , ek  ≤ [1 − (1 − δ)θ ] ' K &  1 − (1 − δ)θ k k=1

1−θ

1 − (1 − δ)θ = 1−θ

K

1−(1−δ)θ 1−θ 1−(1−δ)θ 1−θ

k+1 2

,

−1

−1 & ' 1 − (1 − δ)θ 1 − (1 − δ)θ K ≤ δθ 1−θ & 'K 1 1 − (1 − δ)θ ≤ , δθ 1−θ

and ' K &  1 − (1 − δ)θ k/2 k=1

1−θ

qK − 1 = q −1 ≤

q qK q −1

 where q =



1 − (1 − δ)θ 1−θ



36

2 Accelerated Algorithms for Unconstrained Convex Optimization



1 − (1 − δ)θ qK √ 1 − (1 − δ)θ − 1 − θ √  √ √ 1 − (1 − δ)θ 1 − (1 − δ)θ + 1 − θ K = q δθ 2 K ≤ q . δθ =√



The conclusion can be obtained by the definition of K .

2.3.1 Inexact Accelerated Gradient Descent In this section, we specify h = 0. Namely, we consider the inexact variant of Algorithm 2.1. In this case, μ2 = 0 and Theorem 2.5 reduces to the following theorem. Algorithm 2.6 Inexact accelerated gradient descent Initialize x0 = x−1 . for k = 0, 1, 2, 3, · · · do k −μ1 )(1−θk−1 ) yk = xk + (Lθ(L−μ (xk − xk−1 ), 1 )θk−1 1 xk+1 = yk − L ∇f (yk ). end for

Theorem 2.6 Suppose that f (x) is convex and L-smooth. Let θ0 = 1, θk+1 =   4 2 θk +4θk −θk2 1 L , and ek  ≤ (k+1) 2+δ 2 2 , where δ can be any small positive constant. Then for Algorithm 2.6, we have F (xK+1 ) − F (x∗ ) ≤

4 (K + 2)2

Lx0 − x∗ 2 +

18 2 + 2 . 1 + 2δ δ

Suppose that f (x) is μ1 -strongly convex and L-smooth. Let θk = θ ≡ k, ek  ≤ [1 − (1 − δ)θ ]

k+1 2



μ1 L

. Then for Algorithm 2.6, we have

F (xK+1 ) − F (x∗ ) ≤ C[1 − (1 − δ)θ ]K+1 , where C = 2(F (x0 ) − F (x∗ )) + μx0 − x∗ 2 +



2 δθ

+

8 δ2 θ 2

 2  2 + L2 .

for all

2.3 Inexact Proximal and Gradient Computing

37

2.3.2 Inexact Accelerated Proximal Point Method Now we consider the case that f = 0. In this case, Algorithm 2.5 reduces to the well-known accelerated proximal point algorithm, which is described in Algorithm 2.7. Algorithm 2.7 Inexact accelerated proximal point method Initialize x0 = x−1 . for k = 0, 1, 2, 3, · · · do k )](1−θk−1 ) yk = xk + [τ θk −μ2 (1−θ (xk − xk−1 ),  τ θk−1 τ  xk+1 ≈ argminx h(x) + 2 x − yk 2 . end for

Accordingly, μ1 = 0 in this scenario and Theorem 2.5 reduces to the following Theorem. 

Theorem 2.7 Suppose that h(x) is convex. Let θ0 = 1, θk+1 = h(xk+1 ) +

θk4 +4θk2 −θk2 , 2

and

  τ τ xk+1 − yk 2 ≤ min h(x) + x − yk 2 + k x 2 2

(2.43)

is satisfied with k ≤ (k+1)1 4+2δ , where δ can be any small positive constant. Then for Algorithm 2.7, we have F (xK+1 ) − F (x∗ ) ≤

4 (K + 2)2

τ x0 − x∗ 2 +

18 2 + 2 . 1 + 2δ δ

Suppose that h(x) is μ2 -strongly convex. Let θk = θ ≡

1 0.5+ 0.25+ μτ

for all k

2

and (2.43) is satisfied with k ≤ [1 − (1 − δ)θ ]k+1 . Then for Algorithm 2.7, we have F (xK+1 ) − F (x∗ ) ≤ C[1 − (1 − δ)θ ]K+1 , where C = 2(F (x0 ) − F (x∗ )) + (τ θ 2 + θ μ2 )x0 − x∗ 2 +   2 2 2 2 τ +μ + . τ τ



2 δθ

+

8 δ2 θ 2



The accelerated proximal point method can be seen as a general algorithm framework and has been widely used in stochastic optimization, which leads to the popular framework of Catalyst [17] (see Algorithm 5.4) for the acceleration of firstorder methods. The details will be described in Sect. 5.1.4.

38

2 Accelerated Algorithms for Unconstrained Convex Optimization

2.4 Restart In this section, we consider the restart technique, which was firstly proposed in [26]. Namely, we run Algorithm 2.2 for a few iterations and then restart it with warm start. We describe the method in Algorithm 2.8. Besides its practical use, a beauty of the restart technique is that we can prove a faster rate even when the objective function is generally convex, e.g., see [18]. Specifically, we introduce the Hölderian error bound condition. Consider problem (2.17). Definition 2.2 The Hölderian error bound condition is x − x ≤ ν (F (x) − F ∗ )ϑ , where 0 < ν < ∞, ϑ ∈ (0, 1], and x is the projection of x onto the optimal solution set X∗ . If F (x) has a unique minimizer, then x = x∗ , ∀x.  When ϑ = 0.5, ν = μ2 , and F (x) has a unique minimizer, the Hölderian error bound condition reduces to the strong convexity (see (A.10)). Algorithm 2.8 Accelerated proximal gradient (APG) method with restart Initialize xK−1 = x0 . for k = 1, 2, 3, · · · , T do Minimize F (x) using Algorithm 2.2 for Kt iterations with xKt−1 being the initializer and output xKt . end for

From the results in Theorem 2.1, we can give the total runtime of Algorithm 2.8 in the following theorem. Theorem 2.8 Suppose that F (x) is convex and L-smooth and satisfies the Hölde√  √C 2ϑ−1 rian error bound condition. Let Kt ≥ 2ν 2L 2t−1 , where C = F (x0 ) − F ∗ . ∗ To achieve an xKT such that F (xKT ) − F (x ) ≤ , Algorithm 2.8 needs  √  L 1. O ν0.5−ϑ runtime when ϑ < 0.5,  √  2. O ν L log 1 runtime when ϑ = 0.5,

√ √ 2ϑ−1 3. O ν L runtime when ϑ > 0.5. C

2.5 Smoothing for Nonsmooth Optimization

39

Proof We prove F (xKt ) − F ∗ ≤ 4Ct by induction. C . From Theorem 2.2 and the Hölderian error Assume that F (xKt−1 ) − F ∗ ≤ 4t−1 bound condition, we have 2L xKt−1 − xKt−1 2 (Kt + 1)2  2ϑ 2L ≤ ν 2 f (xKt−1 ) − f ∗ 2 (Kt + 1)

C 2ϑ 2L C 2 ≤ ν ≤ t, 4 (Kt + 1)2 4t−1

F (xKt ) − F ∗ ≤

 where we use Kt + 1 ≥

8Lν 2



2ϑ−1

C 4t−1

. So we only need 4T =

C .

The total

runtime is T  t=1

−1  t √ √ 2ϑ−1 T 21−2ϑ Kt ≥ 2ν 2L C t=0

⎧ √ √ 2ϑ−1 (2T )1−2ϑ −1 ⎪ ⎪ 2ν C , if ϑ < 0.5, ⎪ ⎨ √2L 21−2ϑ −1 = 2ν 2L log4 C , if ϑ = 0.5, ⎪ √ √ 2ϑ−1 1−(2T )1−2ϑ ⎪ ⎪ ⎩ 2ν 2L C , if ϑ > 0.5. 1−21−2ϑ   ⎧  1−2ϑ √  √ 2ϑ−1 ⎪ ⎪ C ⎪ 2ν 2L 2 C − 1 , if ϑ < 0.5, ⎪  ⎪ ⎨ √ ≥ 2ν 2L log C , if ϑ = 0.5, 4  ⎪ & ⎪  2ϑ−1 ' ⎪ √ √ 2ϑ−1 ⎪  ⎪ ⎩ 2ν 2L 1− , if ϑ > 0.5. C C

The proof is complete.



2.5 Smoothing for Nonsmooth Optimization When f is nonsmooth in problem  (2.1) and only the subgradient is available,  convergence rate when the subgradient-type we can only obtain the O √1 K methods are used. In 2003, Nesterov proposed a first-order smoothing method in his seminal paper [22], where the gradient descent is used to solve  accelerated  1 a smoothed problem and the O K total convergence rate can be obtained by carefully designing the smoothing parameter.

40

2 Accelerated Algorithms for Unconstrained Convex Optimization

We use the tool of Fenchel conjugate to smooth a nonsmooth function. Let f ∗ (y) = max (x, y − f (x))

(2.44)

x

be the Fenchel conjugate of f (x). Then we have y ∈ ∂f (x∗ ),

where x∗ = argmax(x, y − f (x)).

(2.45)

x

Define a regularized function of f ∗ (y) as δ fδ∗ (y) = f ∗ (y) + y2 . 2

(2.46)

Let   fδ (x) = max x, y − fδ∗ (y) . y

The following proposition describes that fδ (x) is a smoothed approximation of f (x). Proposition 2.1 Suppose that f (x) is convex, then 1. fδ (x) is 1δ -smooth. 2. fδ1 (x) ≤ fδ2 (x) if δ1 ≥ δ2 . 2 3. Suppose that f (x) has bounded subgradient: ∂f (x) ≤ M. Then f (x) − δM 2 ≤ fδ (x) ≤ f (x). Proof The first conclusion is a direct consequence of point 4 of Proposition A.12. For the second conclusion, we know fδ∗1 (y) = f ∗ (y) +

δ1 δ2 y2 ≥ f ∗ (y) + y2 = fδ∗2 (y) 2 2

and     fδ1 (x) = max x, y − fδ∗1 (y) ≤ max x, y − fδ∗2 (y) = fδ2 (x). y

y

From (2.45) and ∂f (x) ≤ M, we have y ≤ M, i.e., fδ∗ (y) has a bounded

domain. From (2.46), we have f ∗ (y) ≤ fδ∗ (y) ≤ f ∗ (y) +

&

δM 2 fδ (x) ≥ max x, y − f (y) + y 2 ∗

δM 2 2 .

Thus we have

' = f (x) −

δM 2 2

and   fδ (x) ≤ max x, y − f ∗ (y) = f (x). y



2.5 Smoothing for Nonsmooth Optimization

41

We can use Algorithm 2.2 to minimize fδ (x) and have the following convergence rate theorem. Let xˆ ∗δ = argminx fδ (x). Theorem 2.9 Suppose that f (x) is convex and has bounded subgradient: 2 ∂f (x) ≤ M. Run Algorithm 2.2 with L = M to minimize fδ (x), where δ = M 2 . 1. If f (x) is generally convex, then we only need K = that f (xK+1 ) − f (x∗ ) ≤ . 2. If f (x) is μ-strongly convex, then we only need

2Mx0 −ˆx∗δ  

iterations such

2(f (x0 ) − f (x∗ )) + δM 2 + μx0 − xˆ ∗δ 2 M K = √ log μ  iterations such that f (xK+1 ) − f (x∗ ) ≤ . Proof From Proposition 2.1, we know f (xK+1 ) − f (x∗ ) ≤ fδ (xK+1 ) +

δM 2 δM 2 − fδ (x∗ ) ≤ fδ (xK+1 ) − fδ (ˆx∗δ ) + . 2 2

When f (x) is generally convex, fδ (x) is also generally convex. From Theorem 2.2 with h(x) = 0, we have fδ (xK+1 ) − fδ (ˆx∗δ ) ≤

2 x0 − xˆ ∗δ 2 . δ(K + 2)2

Thus we have f (xK+1 ) − f (x∗ ) ≤

2 δM 2 . x0 − xˆ ∗δ 2 + 2 2 δ(K + 2) 2Mx −ˆx∗ 

0 δ Letting δ = M 2 , we only need K = iterations such that f (xK+1 ) −  f (x∗ ) ≤ . When f (x) is μ-strongly convex, fδ (x) is also μ-strongly convex. From Theorem 2.2, we have

   K+1  μ fδ (x0 ) − fδ (ˆx∗δ ) + x0 − xˆ ∗δ 2 fδ (xK+1 ) − fδ (ˆx∗δ ) ≤ 1 − μδ 2

   K+1 μ δM 2 ∗ ∗ 2 f (x0 ) − f (ˆxδ ) + + x0 − xˆ δ  ≤ 1 − μδ 2 2    ≤ exp −(K + 1) μδ

μ δM 2 + x0 − xˆ ∗δ 2 . × f (x0 ) − f (x∗ ) + 2 2

42

2 Accelerated Algorithms for Unconstrained Convex Optimization

Letting δ = M 2 , we only need K = iterations such that f (xK+1 ) − f (x∗ ) ≤ .

√M μ

log

2(f (x0 )−f (x∗ ))+δM 2 +μx0 −ˆx∗δ 2 



From Theorem 2.9 we know that when f is strongly  the accelerated  convex, log√1/ time to achieve an gradient descent with the smoothing technique needs O  ∗ x such that f (x) − f (x ) ≤ . [1] proposed an efficient reduction which solves a sequence of smoothed problems with decreasing parameter δ such   smoothing that the convergence rate can be improved to O √1 . We describe the method in Algorithm 2.9 and the convergence rate in Theorem 2.10. Algorithm 2.9 Accelerated proximal gradient (APG) method with smoothing Initialize x0 and δ0 . for k = 1, 2, 3, · · · , K do Minimize fδk−1 (x) using Algorithm 2.2 for and output xk , δk = δk−1 /2. end for

√log 8 μδk−1

iterations with xk−1 being the initializer

Theorem 2.10 Suppose that f (x) is μ-strongly convex and has bounded   subgradi(x∗ ) √M ent: ∂f (x) ≤ M. Let δ0 = f (x0 )−f , then Algorithm 2.9 needs O μ time M2 such that f (xK ) − f (x∗ ) ≤ . Proof Let xˆ ∗δk = argminx fδk (x) and t =

√log 8 . μδk−1

From Theorem 2.2, we have

fδk−1 (xt ) − fδk−1 (ˆx∗δk−1 ) t     μ fδk−1 (xk−1 ) − fδk−1 (ˆx∗δk−1 ) + xk−1 − xˆ ∗δk−1 2 ≤ 1 − μδk−1 2 t     a fδk−1 (xk−1 ) − fδk−1 (ˆx∗δk−1 ) ≤ 2 1 − μδk−1     ≤ 2 exp −t μδk−1 fδk−1 (xk−1 ) − fδk−1 (ˆx∗δk−1 ) =

fδk−1 (xk−1 ) − fδk−1 (ˆx∗δk−1 ) 4

, a

where we use the fact that fδk−1 (x) is μ-strongly convex in ≤. Let Dδk = fδk (xk ) − fδk (ˆx∗δk ). Then we have Dδk = fδk (xk ) − fδk (ˆx∗δk ) ≤ f (xk ) − fδk−1 (ˆx∗δk )

2.5 Smoothing for Nonsmooth Optimization

≤ fδk−1 (xk ) + ≤

43

δk−1 M 2 − fδk−1 (ˆx∗δk−1 ) 2

Dδk−1 δk−1 M 2 + , 4 2

and Dδ0 ≤ f (x0 ) − f (ˆx∗δ0 ) +

δ0 M 2 δ0 M 2 ≤ f (x0 ) − f (x∗ ) + , 2 2

where we use Proposition 2.1, fδk−1 (ˆx∗δk ) ≥ fδk−1 (ˆx∗δk−1 ), and f (ˆx∗δ0 ) ≥ f (x∗ ). From δk−1 = 2δk , we have D δK

δK−3 δK−2 δ0 + 2 + · · · + K−1 δK−1 + 4 4 4

∗ 2 δK−3 f (x0 ) − f (x ) M δK−2 δ0 δ0 + 2 + · · · + K−1 + K δK−1 + ≤ + 4K 2 4 4 4 4

∗ 1 1 f (x0 ) − f (x ) 1 1 = + M 2 δK 1 + + + · · · + K−1 + K+1 4K 2 4 2 2 Dδ M2 ≤ K0 + 4 2



f (x0 ) − f (x∗ ) + 2M 2 δK 4K

and f (xK ) − f (x∗ ) ≤ fδK (xK ) − fδK (ˆx∗δK ) + = D δK + ≤

δK M 2 2

δK M 2 2

f (x0 ) − f (x∗ ) δK M 2 + 2M 2 δK + K 4 2

f (x0 ) − f (x∗ ) 5M 2 δ0 + 4K 2K ∗ f (x0 ) − f (x ) ≤6 . 2K

=

Thus we only need K = log2 total time is K  k=1



6(f (x0 )−f (x∗ )) 

 2k−1 log 8 √

μδ0

log 8 =√ μδ0

such that f (xK ) − f (x∗ ) ≤ . So the

6(f (x0 )−f (x∗ )) 

√ 2−1

−1

3M ≤√ . μ



44

2 Accelerated Algorithms for Unconstrained Convex Optimization

2.6 Higher Order Accelerated Method When the objective function f (x) is twice continuously differentiable and has Lipschitz continuous Hessians (Definition A.14), the accelerated method has a faster convergence rate by applying the cubic regularization, which was originally proposed in [25] and further extended in [23]. Baes [4] and Bubeck et al. [7] extended the cubic regularization to higher regularization and studied the higher order accelerated method, which includes the cubic regularization as a special case. We introduce the study in [4] in this section. The method in [4] is described in Algorithm 2.10, where f (m) (x)[·] is the multilinear operator that corresponds to the m-th order derivative of f and maps a tuple of m − 1 vectors to a vector. For consistency, we write ∇f (x) as f (1) (x). Then the Taylor expansion of f (x) at xk can be written as f (x) = f (xk ) +

m   1  (m) f (xk )[x − xk , · · · , x − xk ], x − xk + · · · . m! i=1

Algorithm 2.10 High order accelerated gradient descent (AGD) Initialize x0 = z0 . for k = 1, 2, 3, · · · do λk+1 = (1 − θk )λk , yk = θk zk + (1 − θk )xk ,  m 1  (m) xk+1 = argminx (xk )[x − xk , · · · , x − xk ], x − xk + i=1 m! f Define φk+1 (x) in (2.48), zk+1 = argminx φk+1 (x). end for

N m+1 (m+1)! x − yk 

 ,

We make the following high order assumption: Assumption 2.1 Assume that the m-th derivative of f (x) is Lipschitz continuous: f (m) (y) − f (m) (x) ≤ My − x. By integrating several times the above inequality, we can easily deduce that for all x and y, we have   m   M  1  (1)  (1) (m) f (x)[y − x, · · · , y − x] ≤ y − xm . f (y) − f (x) −   m! (m − 1)! i=2

We use the estimate sequence to prove the convergence. Define two sequences: λk+1 = (1 − θk )λk ,

(2.47)

φk+1 (x) = (1 − θk )φk (x) + θk [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ] , (2.48)

2.6 Higher Order Accelerated Method

45

M with λ0 = 1 and φ0 (x) = f (x0 ) + (m+1)! x − x0 m+1 . Then similar to the proof of (2.6), we can easily see that (2.3) holds. Thus it is an estimate sequence. From Lemma 2.1, we have that if φk∗ ≥ f (xk ), then f (xk ) −f (x∗ ) ≤ λk (φ0 (x∗ ) −f (x∗ )). Define

zk = argmin φk (x).

(2.49)

x

We only need to ensure f (xk ) ≤ φk (zk ). The following lemma describes the Bregman distance induced by the higher order form of x − x0 p . Lemma 2.12 Let p ≥ 2 and ρ(x) = x − x0 p . Then ρ(y) − ρ(x) − ∇ρ(x), y − x ≥ cp y − xp , p−1 where cp =

p−2 for p > 2 and cp = 1 for p = 2. 1 + (2p − 3)1/(p−2) From the fact of (2.49), we can have the following lemma, which establishes the relation between φk (x) and φk (zk ) for any x. Lemma 2.13 For the definitions of (2.48), we have φk (x) ≥ φk (zk ) + λk χ (x, zk ), ∀x, where χ (x, y) =

(2.50)

Mcm+1 m+1 . (m+1)! y − x

Proof From the definition of φ0 (x) and Lemma 2.12, we have φ0 (x) ≥ φ0 (zk ) + ∇φ0 (zk ), x − zk  + χ (x, zk ). From the definition in (2.48), we know φk (x) = λk φ0 (x) + lk (x), where lk (x) is an affine function. Thus we have φk (x) ≥ λk φ0 (zk ) + λk ∇φ0 (zk ), x − zk  + λk χ (x, zk ) + lk (x) − lk (zk ) + lk (zk ) = φk (zk ) + ∇φk (zk ), x − zk  + λk χ (x, zk ) a

= φk (zk ) + λk χ (x, zk ), a

where = uses (2.49).



Now we can provide an intermediate inequality that we will use to prove f (xk ) ≤ φk (zk ) for all k.

46

2 Accelerated Algorithms for Unconstrained Convex Optimization

Lemma 2.14 Denote by {xk }k≥0 a sequence satisfying f (xk ) ≤ φk (zk ), then φk+1 (zk+1 ) ≥ f (xk+1 ) + ∇f (xk+1 ), (1 − θk )xk + θk zk − xk+1  + min (θk ∇f (xk+1 ), x − zk  + λk+1 χ (x, zk )) . x

(2.51)

Proof From (2.48), (2.50), f (xk ) ≤ φk (zk ), and the convexity of f (x), we have φk+1 (x) ≥ (1 − θk ) [φk (zk ) + λk χ (x, zk )] + θk [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ] ≥ (1 − θk ) [f (xk ) + λk χ (x, zk )] + θk [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ] ≥ (1 − θk ) [f (xk+1 ) + ∇f (xk+1 ), xk − xk+1  + λk χ (x, zk )] +θk [f (xk+1 ) + ∇f (xk+1 ), x − xk+1 ] = f (xk+1 ) + ∇f (xk+1 ), (1 − θk )xk + θk zk − xk+1  +θk ∇f (xk+1 ), x − zk  + λk+1 χ (x, zk ). 

(2.51) follows immediately.

Our goal is to make the sum of the last two terms on the right-hand side of (2.51) positive. We start our analysis with an easy lemma. The proof is simple and we omit it. Lemma 2.15 From the definition of yk , we have min (θk ∇f (xk+1 ), x − zk  + λk+1 χ (x, zk )) x   λk+1 ≥ min ∇f (xk+1 ), x − yk  + m+1 χ (x, yk ) . x θk

(2.52)

The next lemma plays a crucial role in the validation of the desired inequality. Lemma 2.16 For any x, we have ∇f (xk+1 ), x − xk+1  ≥−

M +N N −M yk − xk+1 m x − yk  + yk − xk+1 m+1 . m! m!

Proof From the optimality condition for xk+1 , we have  f

(1)

+

1 f (m) (yk )[xk+1 − yk , · · · , xk+1 − yk ], x − xk+1 (yk ) + · · · + (m − 1)! N yk − xk+1 m−1 xk+1 − yk , x − xk+1  ≥ 0, ∀x. m!



2.6 Higher Order Accelerated Method

47

On the other hand, from the high order smoothness of f (x), we have  1 f (1) (yk ) + · · · + f (m) (yk )[xk+1 − yk , · · · , xk+1 − yk ] (m − 1)!  − f (1) (xk+1 ), x − xk+1 M yk − xk+1 m x − xk+1  m! M ≤ yk − xk+1 m (yk − xk+1  + x − yk ) . m!



Thus we have 0≤

M yk − xk+1 m (yk − xk+1  + x − yk ) m!  N y − x m−1  k k+1 xk+1 − yk , x − xk+1  . + f (1) (xk+1 ), x − xk+1 + m!

Since xk+1 − yk , x − xk+1  = xk+1 − yk , (x − yk ) − (xk+1 − yk ) ≤ xk+1 − 

yk x − yk  − xk+1 − yk 2 , we can have the conclusion. Based on the previous lemmas, we are now ready to prove the convergence rate. Theorem 2.11 Let N = (2m + 1)M and θk ∈ (0, 1] is obtained by solving the equation (2m + 2)θkm+1 = cm+1 (1 − θk )λk , then f (xK ) − f (x∗ ) ≤

2m + 2 cm+1



m+1 K

m+1

f (x0 ) − f (x∗ ) +

M x0 − x∗ m+1 . (m + 1)!

Proof From (2.51), (2.52), and the definition of yk , we only need  ∇f (xk+1 ), yk − xk+1  + min ∇f (xk+1 ), x − yk  + x

λk+1 θkm+1

 χ (x, yk ) ≥ 0

to ensure f (xk+1 ) ≤ φk+1 (zk+1 ). Since  ∇f (xk+1 ), yk − xk+1  + min ∇f (xk+1 ), x − yk  + x

 = min ∇f (xk+1 ), x − xk+1  + x

λk+1 θkm+1

 χ (x, yk )

λk+1 θkm+1

 χ (x, yk )

48

2 Accelerated Algorithms for Unconstrained Convex Optimization

≥ min − x

M +N yk − xk+1 m x − yk  m!

N −M λk+1 + yk − xk+1 m+1 + m+1 χ (x, yk ) m! θk



N −M yk − xk+1 m+1 m!   λk+1 Mcm+1 m+1 M +N m + min − yk − xk+1  t + m+1 t t≥0 m! (m + 1)! θk ⎡  ⎤  m+1 θ m+1 1/m m+1 (M + N ) yk − xk+1  k ⎦. ⎣N − M − m = m! m+1 Mcm+1 λk+1

=

This quantity is nonnegative as long as cm+1

M(N − M)m (M + N)m+1



m+1 m

m ≥

θkm+1 . λk+1 cm+1 2m+2 obtained cm+1 = 2m+2 . From

Maximizing the right-hand side with respect to N , we get a value of for N = (2m + 1)M. From the definition of θk , we have  m+1 m+1 . Lemma 2.17, we have λK ≤ 2m+2 cm+1 K

θkm+1 λk+1



The following lemma is a higher order counterpart of Lemma 2.3 with p = 1. p

Lemma 2.17 If there exists δ > 0 such that in (2.47), then λK ≤

p p+K

p √ p

δ



θk λk+1

≥ δ, where λk+1 is defined

1  p p . δ K

Proof Since λk ≤ 1 is decreasing, we have  λk − λk+1 =

p

 p

p−1

λk

p−1

≤ p λk

+

     p p−2 p p−1 p λk λk+1 + · · · + λk+1 λk − p λk+1

 p

λk −

  p λk+1 .

This inequality implies 1 1 − √ = √ p p λk+1 λk

√ √ √ p p δ λk − p λk+1 λk − λk+1 θk λk . ≥ = ≥ √ √ √ p p p p λk λk+1 pλk λk+1 pλk λk+1

2.7 Explanation: A Variational Perspective

49

Summing over k = 0, · · · , K − 1, we have √ p 1 δ 1 1 , − = − 1 ≥ K √ √ √ p p p p λ0 λK λK which is equivalent to the desired result.



2.7 Explanation: A Variational Perspective Despite the compelling evidence of the convergence rate of the accelerated algorithms, it remains something of a conceptual mystery. In recent years, a number of explanations and interpretations of acceleration have been proposed [2, 6, 13, 16, 28]. In this section, we introduce the work in [30], which gives a variational perspective. Define the Bregman Lagrangian L(X, V , t) =

& '

t p p t Dh X + V , X − Ct p f (X) , t p

which is a function of position X, velocity V , and time t. Dh (·, ·) is the Bregman distance (Definition A.20) induced by some convex function h(x). Given a general Lagrangian L(X, V , t), 2we define a function on the curve Xt via integration of the Lagrangian J (X) = L(Xt , X˙ t , t)dt. From the calculus of variations, a necessary condition for a curve to minimize this functional is that it solves the Euler–Lagrange equation: d dt

&

' ∂L ∂L (Xt , X˙ t , t) = (Xt , X˙ t , t). ∂V ∂X

Specifically, for the Bregman Lagrangian the partial derivatives are &

' t p p t 2 ∂L p (X, V , t) = t ∇h X + V − ∇h(X) − ∇ h(X)V − Ct ∇f (X) , ∂X t p p

' & t ∂L (X, V , t) = t p ∇h X + V − ∇h(X) . ∂V p Thus the Euler–Lagrange equation for the Bregman Lagrangian is a second-order differential equation given by &

'−1 p+1 ˙ p2 p t ˙ 2 ¨ Xt + ∇f (Xt ) = 0. Xt + C 2 t ∇ h Xt + Xt t p t

(2.53)

50

2 Accelerated Algorithms for Unconstrained Convex Optimization

We can also write (2.53) in the following way, which only requires that ∇h is differentiable,

p d t ∇h Xt + X˙ t = −C t p ∇f (Xt ). (2.54) dt p t To establish a convergence rate associated with solutions to the Euler–Lagrange equation, we take a Lyapunov function approach. Defining the energy function  p  εt = Dh x∗ , Xt + X˙ t + Ct p (f (Xt ) − f (x∗ )). t Then we have the following theorem to give the convergence rate. Theorem 2.12 Solutions to the Euler–Lagrange equation (2.54) satisfy ∗



f (Xt ) − f (x ) ≤ O

1 Ct p

.

Proof The time derivative of the energy function is

 p d t t ∇h Xt + X˙ t , x∗ − Xt − X˙ t + C t p (f (Xt ) − f (x∗ )) dt p p t   +Ct p ∇f (Xt ), X˙ t . 

ε˙ t = −

If Xt satisfies (2.54), then the time derivative simplifies to   p  ε˙ t = −C t p f (x∗ ) − f (Xt ) − ∇f (Xt ), x∗ − Xt ≤ 0. t  ∗  That Dh x , Xt + pt X˙ t ≥ 0 implies that for any t, Ct p (f (Xt ) − f (x∗ )) ≤ εt ≤ εt εt0 . Thus f (Xt ) − f (x∗ ) ≤ Ct0p .



2.7.1 Discretization We now turn to discretize the differential equation (2.54). We write the second-order equation (2.54) as the following system of first-order equations: Zt = Xt +

t ˙ Xt , p

d ∇h(Zt ) = −Cpt p−1 ∇f (Xt ). dt

(2.55) (2.56)

2.7 Explanation: A Variational Perspective

51

Now we discretize Xt and Zt into sequences xt and zt and set xt = Xt , xt+1 = Xt + X˙ t , zt = Zt , and zt+1 = Zt + Z˙t . Applying the forward Euler method to (2.55) gives the equation zt = xt +

t (xt+1 − xt ). p

(2.57)

Similarly, applying the backward Euler method to Eq. (2.56) gives ∇h(zt ) − ∇h(zt−1 ) = −Cpt p−1 ∇f (xt ), which can be written as the optimality condition of the following mirror descent: 3 4 zt = argmin Cpt p−1 ∇f (xt ), z + Dh (z, zt−1 ) .

(2.58)

z

However, we cannot prove the convergence for the algorithm in (2.57) and (2.58). Inspired by Algorithm 2.3, which maintains three sequences, we introduce a third sequence yt and consider the following iterates: 3 4 zt = argmin Cpt p−1 ∇f (yt ), z + Dh (z, zt−1 ) ,

(2.59)

z

p t zt + yt , t +p t +p

xt+1 =

(2.60)

where t p−1 = t (t + 1) · · · (t + p − 2) is the rising factorial. A sufficient condition for algorithm (2.59)–(2.60) to have an O(1/t p ) convergence rate is that the new sequence yt satisfies the inequality: ∇f (yt ), xt − yt  ≥ M∇f (yt )p/(p−1)

(2.61)

for some constant M > 0. Theorem 2.13 Assume that h is 1-uniformly convex of order p ≥ 2 (i.e., the Bregman distance satisfies Dh (y, x) ≥ p1 y − xp ) and yt satisfies (2.61), then the algorithm (2.59)–(2.60) with the constant C ≤ M p−1 /pp and initial condition z0 = x0 has the convergence rate ∗

f (yt ) − f (x ) ≤ O



1 tp

.

We define the following function, which can be recognized as Nesterov’s estimate function, ψt (x) = Cp

t  i=0

i p−1 [f (yi ) + ∇f (yi ), x − yi ] + Dh (x, x0 ).

(2.62)

52

2 Accelerated Algorithms for Unconstrained Convex Optimization

The optimality condition for (2.59) is ∇h(zt ) = ∇h(zt−1 ) − Cpt p−1 ∇f (yt ). By unrolling the recursion, we can write ∇h(zt ) = ∇h(z0 ) − Cp

t 

i p−1 ∇f (yi ),

i=0

and because x0 = z0 , we can write this equation as ∇ψt (zt ) = 0. Thus we have zt = argmin ψt (z). z

For proving Theorem 2.13, we have the following property. Lemma 2.18 For all t ≥ 0, we have ψt (zt ) ≥ Ct p f (yt ).

(2.63)

Proof We prove by induction. The base case k = 0 is true because both sides equal 0. Now assume (2.63) holds for some t, we will show it also holds for t + 1. Because h is 1-uniformly convex of order p, the Bregman distance Dh (x, x0 ) is 1-uniformly convex. Thus the estimate function ψt is also 1-uniformly convex of order p. Because ∇ψt (zt ) = 0, Dψt (x, zt ) = ψt (x) − ψt (zt ). So for all x we have ψt (x) = ψt (zt ) + Dψt (x, zt ) ≥ ψt (zt ) +

1 x − zt p . p

Applying the inductive hypothesis (2.63) and using the convexity of f gives ψt (x) ≥ Ct p [f (yt+1 ) + ∇f (yt+1 ), yt − yt+1 ] +

1 x − zt p . p

We now add Cp(t + 1)p−1 [f (yt+1 ) + ∇f (yt+1 ), x − yt+1 ] to both sides of the equation to obtain ψt+1 (x) = ψt (x) + Cp(t + 1)p−1 [f (yt+1 ) + ∇f (yt+1 ), x − yt+1 ] ≥ C(t + 1)p [f (yt+1 ) + ∇f (yt+1 ), xt+1 − yt+1 + τt (x − zt )] 1 + x − zt p , p from the definition of ψt+1 (x) in (2.62) and the definition of xt+1 in (2.60), where p−1 p = t+p . τt = p(t+1) (t+1)p

2.7 Explanation: A Variational Perspective

53

Applying (2.61) to the term ∇f (yt+1 ), xt+1 − yt+1 , we have ψt+1 (x) ≥ C(t + 1)p f (yt+1 ) + C(t + 1)p M∇f (yt+1 )p/(p−1) +Cp(t + 1)p−1 ∇f (yt+1 ), x − zt  +

1 x − zt p . p

(2.64)

Next, we apply the Fenchel-Young inequality (Proposition A.13) to have s, u +

1 p−1 up ≥ − sp/(p−1) p p

with the choices of u = x − zt and s = Cp(t + 1)p−1 ∇f (yt+1 ). Then from (2.64), we have p f (yt+1 ) ψt+1 (x) ≥ C(t + 1) 

p − 1 p/(p−1) 1/(p−1) ((t + 1)p−1 )p/(p−1) + M− p C p (t + 1)p 4 ×∇f (yt+1 )p/(p−1) .



Note that [(t + 1)p−1 ]p/(p−1) ≤ (t + 1)p . Then from the assumption C ≤ M p−1 /pp , we see that the second term inside the parentheses is nonnegative. Hence we conclude the desired inequality ψt+1 (x) ≥ C(t + 1)p f (yt+1 ). Because x is arbitrary, it also holds for the minimizer x = zt+1 of ψt+1 , finishing the induction. 

With Lemma 2.18, we can complete the proof of Theorem 2.13. Proof Because f is convex, we can bound the estimate function ψt by ψt (x) ≤ Cp

t 

i p−1 f (x) + Dh (x, x0 ) = Ct p f (x) + Dh (x, x0 ),

i=0

 where we can prove p ti=0 i p−1 = t p by mathematical induction, e.g., t+p−2 p−1 p = Ct+p−1 holds for t = 1, 2, · · · . The above inequality holds for all j =p−1 Cj x and in particular for the minimizer x∗ of f . Combining the bound with result of Lemma 2.18 and recalling that zt is the minimizer of ψt , we get Ct p f (yt ) ≤ ψt (zt ) ≤ ψt (x∗ ) ≤ Ct p f (x∗ ) + Dh (x∗ , x0 ). Rearranging and dividing by Ct p gives the desired convergence rate.

54

2 Accelerated Algorithms for Unconstrained Convex Optimization

Till now, the remaining thing is to find some yt such that (2.61) holds. For simplicity, we only consider the case of p = 2 and assume that f is (p − 1)-order smooth, i.e., L-smooth when p = 2. The higher order smooth case can be analyzed with the similar inductions [30]. Let

1 x − xt 2 . yt = argmin ∇f (xt ), x + 4M x Then from the optimality condition, we know ∇f (xt ) +

1 2M (yt

− xt ) = 0. Since

 2  1   L2 xt − yt 2 ≥ ∇f (xt ) − ∇f (yt )2 =  − x ) + ∇f (y ) (y t t  ,  2M t letting 2M ≤ 1/L the above gives 1 1 ∇f (yt ), xt − yt  ≥ ∇f (yt )2 + yt − xt 2 − L2 xt − yt 2 M 4M 2 ≥ ∇f (yt )2 , which leads to (2.61) with p = 2.



References 1. Z. Allen-Zhu, E. Hazan, Optimal black-box reductions between optimization objectives, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 1614– 1622 2. Z. Allen-Zhu, L. Orecchia, Linear coupling: an ultimate unification of gradient and mirror descent, in Proceedings of the 8th Innovations in Theoretical Computer Science (ITCS), Berkeley, (2017) 3. F. Bach, R. Jenatton, J. Mairal, G. Obozinski, Convex optimization with sparsity-inducing norms, in Optimization for Machine Learning (MIT Press, Cambridge, 2012), pp. 19–53 4. M. Baes, Estimate sequence methods: extensions and approximations. Technical report, Institute for Operations Research, ETH, Zürich (2009) 5. A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009) 6. S. Bubeck, Y.T. Lee, M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent (2015). Preprint. arXiv:1506.08187 7. S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, A. Sidford, Near-optimal method for highly smooth convex optimization, in Proceedings of the 36th Conference on Learning Theory, Long Beach, (2019), pp. 492–507 8. A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40(1), 120–145 (2011) 9. A. Chambolle, T. Pock, On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159(1–2), 253–287 (2016) 10. X. Chen, S. Kim, Q. Lin, J.G. Carbonell, E.P. Xing, Graph-structured multi-task regression and an efficient optimization method for general fused lasso (2010). Preprint. arXiv:1005.3579

References

55

11. O. Devolder, F. Glineur, Y. Nesterov, First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014) 12. J.M. Fadili, G. Peyré, Total variation projection with first order schemes. IEEE Trans. Image Process. 20(3), 657–669 (2010) 13. N. Flammarion, F. Bach, From averaging to acceleration, there is only a step-size, in Proceedings of the 28th Conference on Learning Theory, Paris, (2015), pp. 658–695 14. L. Jacob, G. Obozinski, J.-P. Vert, Group lasso with overlap and graph lasso, in Proceedings of the 26th International Conference on Machine Learning, Montreal, (2009), pp. 433–440 15. G. Lan, Y. Zhou, An optimal randomized incremental gradient method. Math. Program. 171(1– 2), 167–215 (2018) 16. L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016) 17. H. Lin, J. Mairal, Z. Harchaoui, Catalyst acceleration for first-order convex optimization: from theory to practice. J. Mach. Learn. Res. 18(212), 1–54 (2018) 18. I. Necoara, Y. Nesterov, F. Glineur, Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175(1–2), 69–107 (2019) 19. Y. Nesterov, A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). Sov. Math. Dokl. 27(2), 372–376 (1983) 20. Y. Nesterov, On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika I Mateaticheskie Metody 24(3), 509–517 (1988) 21. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer, New York, 2004) 22. Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005) 23. Y. Nesterov, Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 181(1), 112–159 (2008) 24. Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013) 25. Y. Nesterov, B.T. Polyak, Cubic regularization of Newton’s method and its global performance. Math. Program. 108(1), 177–205 (2006) 26. B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015) 27. M. Schmidt, N.L. Roux, F.R. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, in Advances in Neural Information Processing Systems, Granada, vol. 24 (2011), pp. 1458–1466 28. W. Su, S. Boyd, E. Candès, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 2510–2518 29. P. Tseng, On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle (2008) 30. A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), 7351–7358 (2016)

Chapter 3

Accelerated Algorithms for Constrained Convex Optimization

Besides the unconstrained optimization, the acceleration techniques can also be applied to constrained optimization. In this chapter, we will introduce how to apply the acceleration techniques to constrained optimization. Formally, we consider the following general constrained convex optimization problem: min f (x),

x∈Rn

(3.1)

s.t. Ax = b, gi (x) ≤ 0, i = 1, · · · , p, where both f (x) and gi (x) are convex and A ∈ Rm×n . This chapter introduces the penalty method, the Lagrange multiplier method, the augmented Lagrange multiplier method, the alternating direction method of multiplier, and the primal-dual method.

3.1 Some Facts for the Case of Linear Equality Constraint We first consider a simple case of problem (3.1), with linear equality constraint only: min f (x), x

(3.2)

s.t. Ax = b. The introduction in this section is mostly for problem (3.2).

© Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8_3

57

58

3 Accelerated Algorithms for Constrained Convex Optimization

From the definition of Fenchel conjugate (Definition A.21), we know that the dual problem (Definition A.24) of (3.2) is max d(u), where d(u) = −f ∗ (AT u) − u, b . u

Let σ1 ≥ σ2 ≥ · · · ≥ σr > 0 be the nonzero singular values (Definition A.1) of A, we have the following lemma for d(u). Lemma 3.1 1. If f is μ-strongly convex, then d(u) is 2. If f is L-smooth, then −d(u) is

σ12 μ -smooth.

σr2 L -strongly

convex for all u ∈ Span(A).

Proof 1. Since f is μ-strongly convex, we know that f ∗ is 1/μ-smooth (Point 4 of Proposition A.12). So ∇d(u) − ∇d(v) = A∇f ∗ (AT u) − A∇f ∗ (AT v) ≤ A2 ∇f ∗ (AT u) − ∇f ∗ (AT v) ≤

A2 T A u − AT v μ



A22 u − v. μ

2. Since f is L-smooth, we know that f ∗ is L1 -strongly convex, but not necessarily differentiable (Point 4 of Proposition A.12). So      ˆ ∗ (AT u) − A∇f ˆ ∗ (AT u ), u − u ˆ ˆ ), u − u = A∇f − ∇d(u) − ∇d(u   ˆ ∗ (AT u ), AT u − AT u ˆ ∗ (AT u) − ∇f = ∇f a



1 T A u − AT u 2 , L a

ˆ ˆ ∗ (u) ∈ ∂f ∗ (u), and ≥ uses Proposition A.11. Let A = where ∇d(u) ∈ ∂d(u), ∇f T UV be the economic SVD (Definition A.1). Since u and u ∈ Span(A), there exists y such that u − u = Ay = UVT y = Uz. So we have AT A = V 2 VT ,

3.1 Some Facts for the Case of Linear Equality Constraint

59

AAT = U 2 UT and AT u − AT u 2 = (u − u )T AAT (u − u ) = zT UT U 2 UT Uz = zT  2 z ≥ σr2 (A)z2 = σr2 (A)u − u 2 , where we use u − u 2 = zT UT Uz = z2 .



Now we introduce the following lemmas, which will be used in this section. Lemma 3.2 Suppose that f (x) is convex and let (x∗ , λ∗ ) be a KKT point (Defini tion A.26) of problem (3.2), then we have f (x) − f (x∗ ) + λ∗ , Ax − b ≥ 0, ∀x. Lemma 3.3 Suppose that f (x) is convex and let (x∗ , λ∗ ) be a KKT point of problem (3.2). If   f (x) − f (x∗ ) + λ∗ , Ax − b ≤ α1 , Ax − b ≤ α2 , then we have −λ∗ α2 ≤ f (x) − f (x∗ ) ≤ λ∗ α2 + α1 . At last, we introduce the augmented Lagrangian function: Lβ (x, u) = f (x) + u, Ax − b +

β Ax − b2 , 2

and define dβ (u) = min Lβ (x, u). x

(3.3)

For any u, we have d(u) ≤ dβ (u). Moreover, for any u, we have dβ (u) ≤ f (x∗ ). Since d(u∗ ) = f (x∗ ), we know d(u∗ ) = dβ (u∗ ) = f (x∗ ). Further, dβ (u) has a better smoothness property than d(u).

60

3 Accelerated Algorithms for Constrained Convex Optimization

Lemma 3.4 Let D(u) denote the optimal solution set of minx Lβ (x, u). Then Ax is invariant over D(u). Moreover, dβ (u) is differentiable and ∇dβ (u) = Ax(u) − b, where x(u) ∈ D(u) is any minimizer of minx Lβ (x, u). We also have that dβ (u) is 1 β -smooth, i.e., ∇dβ (u) − ∇dβ (u ) ≤

1 u − u . β

Proof Suppose that there exist x and x ∈ D(u) with Ax = Ax . Then we have dβ (u) = Lβ (x, u) = Lβ (x , u). Due to the convexity of Lβ (x, u) with respect to x, D(u) must be convex, implying x = (x + x )/2 ∈ D(u). By the convexity of f and strict convexity (Definition A.9) of  · 2 , we have 1 1 β Lβ (x, u) + Lβ (x , u) > f (x) + Ax − b, u + Ax − b2 2 2 2 = Lβ (x, u).

dβ (u) =

This contradicts the definition dβ (u) = minx Lβ (x, u). Thus Ax is invariant over D(u). So ∂dβ (u) is a singleton. By Danskin’s theorem (Theorem A.1), we know dβ (u) is differentiable and ∇d(u) = Ax(u) − b, where x(u) ∈ D(u) is any minimizer of minx Lβ (x, u). Let x = argminx Lβ (x, u) and x = argminx Lβ (x, u ). Then we have 0 ∈ ∂f (x) + AT u + βAT (Ax − b), 0 ∈ ∂f (x ) + AT u + βAT (Ax − b). From the monotonicity of ∂f (Proposition A.11), we have   −(AT u + βAT (Ax − b)) + (AT u + βAT (Ax − b)), x − x ≥ 0   ⇒ u − u , Ax − Ax + βAx − Ax 2 ≤ 0 ⇒ βAx − Ax  ≤ u − u . So we have ∇dβ (u) − ∇dβ (u ) = Ax − Ax  ≤

1 u − u . β 

3.2 Accelerated Penalty Method

61

3.2 Accelerated Penalty Method The penalty method poses the constraint in problem (3.2) as a large penalty [17, 25, 28, 29, 33] and minimizes the following problem instead: min f (x) + x

Generally speaking, if β is of the order f (x) +

β Ax − b2 . 2 1 

(3.4)

and



β β Ax − b2 ≤ min f (x) + Ax − b2 + , x 2 2

then we can have |f (x)−f (x∗ )| ≤  and Ax−b ≤  [17]. In fact, letting (x∗ , λ∗ ) be a KKT point of problem (3.2), we have f (x) +



β β Ax − b2 ≤ min f (x) + Ax − b2 +  ≤ f (x∗ ) + , x 2 2  ∗    ∗ ∗ f (x ) = f (x ) + λ , Ax∗ − b ≤ f (x) + λ∗ , Ax − b ,

which leads to   f (x) − f (x∗ ) ≤  and − λ∗ Ax − b ≤ − λ∗ , Ax − b ≤ f (x) − f (x∗ ). So β Ax − b2 − λ∗ Ax − b ≤ , 2 which further leads to 2λ∗  Ax − b ≤ + β

5 2 ≤  and − λ∗  ≤ f (x) − f (x∗ ) β

  by letting β = O 1 . Thus we can use the accelerated gradient methods described in the previous section to minimize the penalized problem (3.4). However, directly minimizing problem (3.4) with a large penalty makes the algorithm slow due to the illconditioning of β2 Ax − b2 with a large β. To solve this problem, the continuation technique is often used [17], namely solving a sequence of problems (3.4) with increasing penalty parameters.

62

3 Accelerated Algorithms for Constrained Convex Optimization

In this section, we follow [20] and introduce a little different strategy from the continuation technique. We increase the penalty parameter β at each iteration. In other words, we solve a sequence of subproblems in the original continuation technique with only one iteration and then immediately increase the penalty parameter. We adopt the acceleration technique discussed in the previous section and describe the algorithm in Algorithm 3.1, where θk , αk , and ηk will be specified in Theorems 3.2 and 3.3. Algorithm 3.1 Accelerated penalty method Initialize x0 = x−1 . for k = 0, 1, 2, 3, · · · do k−1 ) yk = xk + (ηk θ(ηk −μ)(1−θ (xk − xk−1 ), −μ)θk−1 k

1 xk+1 = yk − ηk ∇f (yk ) + αβk AT (Ayk − b) . end for

We first give a general result in Theorem 3.1, which considers both the generally convex case and the strongly convex case. The following lemma gives some basic relations that will be used in the proof of Theorem 3.1. ∞ Lemma 3.5 Assume that sequences {αk }∞ k=0 and {θk }k=0 satisfy Define

1−θk αk

=

1 αk−1 .

λk+1 =

β (Ayk − b) , αk

λk+1 =

β (Axk+1 − b), αk

wk+1 =

1 1 − θk xk+1 − xk . θk θk

(3.5)

β [Axk+1 − (1 − θk )Axk − θk b] , αk

(3.6)

Then we have λk+1 − λk =

αk βAT A2 λk+1 − λk+1 2 ≤ xk+1 − yk 2 , 2β 2αk wk =

ηk − μ ηk (1 − θk ) yk − xk . ηk θk − μ ηk θk − μ

(3.7) (3.8)

3.2 Accelerated Penalty Method

63

Proof For the first relation, we have λk+1 − λk =

β β (Axk+1 − b) − (Axk − b) αk αk−1

=

β β(1 − θk ) (Axk+1 − b) − (Axk − b) αk αk

=

β [Axk+1 − (1 − θk )Axk − θk b] . αk

For the second relation, we have  2  β αk αk  βAT A2 2  λk+1 − λk+1  = A(xk+1 − yk ) ≤ xk+1 − yk 2 .   2β 2β αk 2αk 

The third relation can be obtained from the definition of yk . Now we give the main results in the following theorem.

Theorem 3.1 Assume that f (x) is L-smooth and μ-strongly convex. Let {αk }∞ k=0 1 be a decreasing sequence with α−1 = 0 and αk ≥ 0. Define θk and ηk as those k satisfying 1−θ αk = inequalities hold:

1 αk−1

and ηk = L +

2 ηk−1 θk−1

2αk−1



βAT A2 . αk

ηk θk2 − μθk , 2αk

Assume that the following two

θk ≥

μ . ηk

(3.9)

Then for Algorithm 3.1, we have |f (xK+1 ) − f (x∗ )| ≤ O (αK ) ,

AxK+1 − b ≤ O (αK ) .

Proof From the second step, we have 0 = ∇f (yk ) + AT λk+1 + ηk (xk+1 − yk ). From the L-smoothness and the μ-strong convexity of f , we have L xk+1 − yk 2 2 μ L ≤ f (x) − x − yk 2 + ∇f (yk ), xk+1 − x + xk+1 − yk 2 2 2   μ 2 T = f (x) − x − yk  + A λk+1 , x − xk+1 2 L + ηk xk+1 − yk , x − xk+1  + xk+1 − yk 2 2

f (xk+1 ) ≤ f (yk ) + ∇f (yk ), xk+1 − yk  +

64

3 Accelerated Algorithms for Constrained Convex Optimization

  = f (x) + AT λk+1 , x − xk+1 + ηk xk+1 − yk , x − yk  −

μ x − yk 2 − 2



L βAT A2 + 2 αk

xk+1 − yk 2 .

Letting x = xk and x = x∗ , respectively, we obtain two inequalities. Multiplying the first inequality by 1 − θk and the second by θk and adding them, we have f (xk+1 ) − (1 − θk )f (xk ) − θk f (x∗ )   ≤ λk+1 , θk Ax∗ + (1 − θk )Axk − Axk+1   + ηk xk+1 − yk , θk x∗ + (1 − θk )xk − yk

L βAT A2 μθk ∗ 2 xk+1 − yk 2 . x − yk  − + − 2 2 αk   Adding λ∗ , Axk+1 − (1 − θk )Axk − θk Ax∗ to both sides and using Ax∗ = b, we have   f (xk+1 ) − f (x∗ ) + λ∗ , Axk+1 − b    − (1 − θk ) f (xk ) − f (x∗ ) + λ∗ , Axk − b   ≤ λk+1 − λ∗ , θk Ax∗ + (1 − θk )Axk − Axk+1   + ηk xk+1 − yk , θk x∗ + (1 − θk )xk − yk

μθk ∗ L βAT A2 − xk+1 − yk 2 x − yk 2 − + 2 2 αk  a αk  = λk+1 − λ∗ , λk − λk+1 β  ηk  θk x∗ + (1 − θk )xk − yk 2 − θk x∗ + (1 − θk )xk − xk+1 2 + 2 βAT A2 μθk ∗ x − yk 2 − yk − xk+1 2 2 2αk   b αk λk − λ∗ 2 − λk+1 − λ∗ 2 − λk+1 − λk 2 + λk+1 − λk+1 2 = 2β  2  2    ∗ 1 − θk ηk θk2  1 − θk 1  1 ∗    x + xk − yk  − x + xk − xk+1  +   2 θk θk θk θk −



μθk ∗ βAT A2 x − yk 2 − yk − xk+1 2 2 2αk

3.2 Accelerated Penalty Method c



65

 μθ αk  k λk − λ∗ 2 − λk+1 − λ∗ 2 − λk+1 − λk 2 − x∗ − yk 2 2β 2   2   ηk θk2  1 − θk 1  ∗ ∗ 2    + , x + θ xk − θ yk  − wk+1 − x 2 k k a

c

b

where we use (3.6) and (A.1) in =, (A.3) in =, and (3.7) and (3.5) in ≤. Consider  2  ηk θk2  x∗ + 1 − θk xk − 1 yk   2 θk θk  

ηk θk2  μ μ ∗  = (x − yk ) + 1 − 2  ηk θk ηk θk

2  ηk (1 − θk ) ηk − μ ∗ × x + xk − yk   ηk θk − μ ηk θk − μ  2  a μθk θk (ηk θk − μ)  x∗ + ηk (1 − θk ) xk − ηk − μ yk  ≤ x∗ − yk 2 +  2 2 ηk θk − μ ηk θk − μ   θk (ηk θk − μ)  b μθk wk − x∗ 2 , = x∗ − yk 2 + 2 2 a

b

where ≤ uses the convexity of  · 2 and = uses (3.8). Thus we have   f (xk+1 ) − f (x∗ ) + λ∗ , Axk+1 − b    − (1 − θk ) f (xk ) − f (x∗ ) + λ∗ , Axk − b  θ (η θ − μ)   αk  k k k wk − x∗ 2 λk − λ∗ 2 − λk+1 − λ∗ 2 + ≤ 2β 2 −

 ηk θk2  wk+1 − x∗ 2 . 2

Dividing both sides by αk and using 1 αk



1−θk αk

=

1 αk−1

and (3.9), we have

  ηk θk2  wk+1 − x∗ 2 f (xk+1 ) − f (x ) + λ , Axk+1 − b + 2 ∗

+







1 λk+1 − λ∗ 2 2β   2     θ η 1 k−1 2 k−1 wk − x∗  ≤ f (xk ) − f (x∗ ) + λ∗ , Axk − b + αk−1 2 +

1 λk − λ∗ 2 . 2β

66

3 Accelerated Algorithms for Constrained Convex Optimization

So we have    ∗  ηK θK2 1 ∗ ∗ 2 f (xK+1 ) − f (x ) + λ , AxK+1 − b + wK+1 − x  αK 2 +

1 1 λK+1 − λ∗ 2 ≤ λ0 − λ∗ 2 , 2β 2β

where we use

1 α−1

= 0. From Lemma 3.2, we have

  αK λ0 − λ∗ 2 , f (xK+1 ) − f (x∗ ) + λ∗ , AxK+1 − b ≤ 2β λK+1 − λ∗  ≤ λ0 − λ∗ .     Since  αβK (AxK+1 − b) = λK+1  ≤ λK+1 − λ∗  + λ∗  ≤ λ0 − λ∗  + λ∗ , we have AxK+1 − b ≤

λ0 − λ∗  + λ∗  αK . β 

From Lemma 3.3, we can have the conclusion.

3.2.1 Generally Convex Objectives We can specialize the value of αk for the generally convex case and the strongly convex case and establish their rates. We first consider the generally   convergence convex case and prove the O K1 convergence rate. Theorem 3.2 Assume that f (x) is L-smooth and convex. Let αk = θk = assumption (3.9) holds and we have

1 k+1 ,

then

|f (xK+1 ) − f (x∗ )| ≤ O (1/K) and AxK+1 − b ≤ O (1/K) . Proof If μ = 0 and αk = θk , then (3.9) reduces to ηk θk ≤ ηk−1 θk−1 and θk ≥ 0, 1 k which is true due to 0 ≤ θk < θk−1 and the definition of ηk . From 1−θ αk = αk−1

1 1 and α−1 = 0 we have αk = k+1 . So we have AxK+1 − b ≤ O (1/K) and ∗ |f (xK+1 ) − f (x )| ≤ O (1/K). 

3.3 Accelerated Lagrange Multiplier Method

67

3.2.2 Strongly Convex Objectives  Then we consider the strongly convex case and give a faster O rate.

1 K2

 convergence

Theorem 3.3 Assume that f (x) is L-smooth and μ-strongly convex. Let 1 2 , θk−1

αk = θk2 , θ0 = 1, and

μ2 4LAT A2

≤β ≤

μ , AT A2

1−θk θk2

=

then assumption (3.9) holds

and we have     |f (xK+1 ) − f (x∗ )| ≤ O 1/K 2 and AxK+1 − b ≤ O 1/K 2 . Proof If μ > 0 and αk = θk2 , then (3.9) reduces to ηk − μ/θk ≤ ηk−1 and T

T

Lθ + βAθk A2 ≥ μ. Consider ηk − μ/θk − ηk−1 = L + βAαkA2 − μ/θk −     k T A βAT A2 −μ 1 1 T A 2 = βA − , where we use L + βA 2 αk−1 αk αk−1 − μ/θk = θk μ , then ηk − μ/θk ≤ ηk−1 . AT A2 T θk μ−Lθk2 Then we consider Lθk + βAθk A2 ≥ μ. It holds if β ≥ A T A . Since 2 2 2 μ μ 2 θ μ − Lθ ≤ 4L , ∀θ , we only need β ≥ 4LAT A . So we finally get the condition 2 μ2 μ 1−θk ≤ β ≤ . Since θ = 1 and = 21 , from Lemma 2.3 we can 0 4LAT A2 AT A2 θk2 θk−1   2 4 easily have θk ≤ k+2 and αk ≤ (k+2)2 . Thus we have AxK+1 − b ≤ O 1/K 2   and |f (xK+1 ) − f (x∗ )| ≤ O 1/K 2 . 

1 αk



1 αk−1

=

θk αk

=

1 θk .

So if β ≤

3.3 Accelerated Lagrange Multiplier Method In this section, we consider the general problem (3.1) and use the methods described in the previous section to maximize its Lagrange dual (A.13). When f (x) is μstrongly convex and each gi has bounded subgradient, due to Danskin’s theorem [1] (Theorem A.1) we know that d(u, v) is convex, differentiable (see also Lemma 3.4) and ∇d(u, v) =



T T Ax∗ (u, v) − b , g1 (x∗ (u, v)), · · · , gp (x∗ (u, v)) , (3.10)

where x∗ (u, v) = argminx L(x, u, v). Then    d(u, v) = f (x∗ (u, v)) + u, Ax∗ (u, v) − b + vi gi (x∗ (u, v)) p

i=1 ∗

= f (x (u, v)) + λ, ∇d(u, v) ,

(3.11)

68

3 Accelerated Algorithms for Constrained Convex Optimization

Moreover, from Proposition 3.3 in [23] we have the Lipschitz smoothness condition of d(u, v): ∇d(u, v) − ∇d(u , v ) ≤ L(u, v) − (u , v ), ∀(u, v), (u , v ) ∈ D, where   √ m  p + 1 max{A2 , maxi Lgi }  A2 + L= L2gi , 2 μ i=1

in which Lgi is an upper bound of ∂gi . Thus we can use the gradient ascent or accelerated gradient ascent to maximize d(u, v). The Lagrange multiplier method solves problem (3.1) via maximizing its dual function using the gradient ascent. It consists of the following step at each iteration: λk+1 = Projλ∈D (λk + β∇d(λk )) , where λ = (u, v) and Proj is the projection operator. Instead of the gradient ascent, [23] used the accelerated gradient ascent to solve the dual problem, which leads to the accelerated Lagrange multiplier method. We describe it in Algorithm 3.2, where we use Algorithm 2.3 rather than Algorithm 2.2 in the dual space since we need to ensure ν k , μk , and λk ∈ D. Algorithm 2.2 cannot guarantee this. Algorithm 3.2 Accelerated Lagrange multiplier method Initialize ν 0 = μ0 = λ0 . for k = 0, 1, 2, 3, · · · do ν k = (1 − θk )λk +  θk μk ,

 μk+1 = Projμ∈D μk + θk1L ∇d(ν k ) , λk+1 = (1 − θk )λk + θk μk+1 . end for

From Theorem 2.3, we have the following convergence rate theorem in the dual space. Theorem 3.4 Assume that f (x) is μ-strongly convex and ∂gi  ≤ Lgi . Let θ0 = 1  4 +4θ 2 −θ 2 θk−1 k−1 k−1

and θk = 2 dual space, we have

. Then for Algorithm 3.2, with Algorithm 2.3 used in the

−d(λK+1 ) + d(λ∗ ) ≤

L λ0 − λ∗ 2 . (K + 2)2

3.3 Accelerated Lagrange Multiplier Method

69

3.3.1 Recovering the Primal Solution It is not satisfactory to establish the iteration complexity only in the dual space. We should recover the primal solutions from the dual iterates and need to estimate how quickly the primal solutions converge. We describe the main results studied in [23], which answers this question. Lemma 3.6 For any x = argminx L(x, u, v), we have 

Ax − b2 +

p

i=1 (max{0, gi (x)})

2



 6 7 2 ∗ L d(λ ) − d(λ) ,

 6 7 d(λ∗ ) − d(λ) , f (x∗ ) − f (x) ≤ λ∗  2(p+m) L  f (x) − f (x∗ ) ≤ 2[d(λ∗ ) − d(λ)] + λ∞ 2L(p + m)[d(λ∗ ) − d(λ)]. Proof Let A(λ) = {i > m : λi + L1 ∇i d(λ) < 0} and I (λ) = {1, 2, · · · , m + p}\A(λ), which means that the projection onto D is active and inactive, respectively. Then

1 − d(λ) d(λ∗ ) − d(λ) ≥ d ProjD λ + ∇d(λ) L

  a 1 ≥ ∇d(λ), ProjD λ + ∇d(λ) − λ L  2

 L 1  Proj −  λ + ∇d(λ) − λ D   2 L

 1 L b  − ∇i d(λ), λi  − λ2i + ∇i d(λ)2 = 2 2L i∈A(λ)

i∈I (λ)



 1 c  1 − ∇i d(λ), λi  + ∇i d(λ)2 ≥ 2 2L i∈A(λ)

d



 i∈I (λ)

e

= f



(3.12a)

i∈I (λ)

1 ∇i d(λ)2 2L

1 1 Ax − b2 + 2L 2L

(3.12b) 

(gi−m (x))2

i>m,i∈I (λ)

1 1  Ax − b2 + (max{0, gi (x)})2 , 2L 2L p

i=1

70

3 Accelerated Algorithms for Constrained Convex Optimization a

where we use the L-smoothness of the concave function  d(λ) in ≥, the   fact that 1 1 ProjD λi + L ∇i d(λ) = 0 if i ∈ A(λ) and ProjD λi + L ∇i d(λ) = λi +   b 1 1 2 L ∇i d(λ) if i ∈ I (λ) in =, the fact that λi ≤ λi , − L ∇i d(λ) for i ∈ A(λ) due to   c d λi , λi + L1 ∇i d(λ) ≤ 0 in ≥, ∇i d(λ), λi  ≤ −Lλ2i ≤ 0 for i ∈ A(λ) in ≥, (3.10) f

e

in =, and gi−m (x) ≥ 0 ⇒ i ∈ I (λ) in ≥. Since  i∈I (λ)

⎞2 ⎛  1 ⎝ ∇i d(λ)2 ≥ |∇i d(λ)|⎠ |I (λ)| i∈I (λ)



⎞2  1 ⎝ 1 ∇i d(λ), λi ⎠ , ≥ p + m λ∞ i∈I (λ)

we have 

 − ∇i d(λ), λi  ≤ λ∞ 2L(p + m)[d(λ∗ ) − d(λ)]

i∈I (λ)

from (3.12b) and 

− ∇i d(λ), λi  ≤ 2[d(λ∗ ) − d(λ)]

i∈A(λ)

from (3.12a). So by (3.11) and the strong duality, we have f (x) − f (x∗ ) = − λ, ∇d(λ) + d(λ) − d(λ∗ ) ≤ − λ, ∇d(λ)

 ≤ 2[d(λ∗ ) − d(λ)] + λ∞ 2L(p + m)[d(λ∗ ) − d(λ)].

On the other hand, we have   ∗ f (x ) = f (x ) + u , Ax − b + vi gi (x∗ ) ∗

a





p





i=1

  b ≤ f (x) + u∗ , Ax − b +

p  i=1

v∗i gi (x)

3.3 Accelerated Lagrange Multiplier Method

71

   ∗ c ≤ f (x) + u∗ , Ax − b + vi max{0, gi (x)} p

i=1

  p   √ ∗  ≤ f (x) + p + mλ  Ax − b2 + (max{0, gi (x)})2 , i=1 a

where we use the KKT condition (Definition A.26) in =, 

m   v∗i gi (x) x = argmin f (x) + u , Ax − b + ∗



x

b





i=1

c

in ≤, and v∗i ≥ 0 in ≤.



in the dual space has an From  Lemma 3.6, we can see that when the  algorithm  1 1 O K 2 convergence rate, it only has an O K convergence rate in the primal space. The rate can be improved in the primal space via averaging the primal solutions appropriately, e.g., see [18, 26, 27, 31, 36]. We omit the details.

3.3.2 Accelerated Augmented Lagrange Multiplier Method In Sect. 3.3, we use the accelerated gradient ascent to solve the dual problem. In this section, we consider the objective (3.3) induced by the augmented Lagrangian function, i.e., using the accelerated gradient ascent to maximize dβ (u). Lemma 3.4 shows that dβ (u) is smooth, no matter whether f is strongly convex or not. This is the advantage of the augmented Lagrange multiplier method over the Lagrange multiplier method. Specially, when applying Algorithm 2.6 in the dual space, we have the following iterations: θk (1 − θk−1 )  (λk − λk−1 ), λk = λk + θk−1 λk+1

(3.13)

 β ( = λk + β ∇d λk ),

 β ( λk ) is an error corrupted gradient of ∇dβ ( λk ). From Theorem 2.6 we where ∇d need 5   1 1 ∇dβ (  β ( λk ) − ∇d λ k ) ≤ 2+δ 2β (k + 1)

72

3 Accelerated Algorithms for Constrained Convex Optimization

for the second step. Define

  β λk , Ax − b + Ax − b2 . x∗k+1 = argmin f (x) +  2 x  β ( Then we have ∇dβ ( λk ) = Ax∗k+1 − b. Define ∇d λk ) = Axk+1 − b for some xk+1 , then we only need  ∗ Ax

2  ≤ k+1 − Axk+1

5 1 (k + 1)2+δ

1 . 2β

(3.14)

In conclusion, iteration (3.13) leads to the accelerated augmented Lagrange multiplier method [11], which is described in Algorithm 3.3. From Theorem 2.6, we can give the convergence rate of Algorithm 3.3 in the dual space directly. The convergence rate in the primal space can be recovered by Lemma 3.6. Algorithm 3.3 Accelerated augmented Lagrange multiplier method Initialize x0 = x−1 and θ0 = 1. for k = 0, 1, 2, 3, · · · do k−1 )  λk = λk + θk (1−θ (λk − λk−1 ), θk−1     λk , Ax − b + β2 Ax − b2 , xk+1 ≈ argminx f (x) +  λk+1 =  λk + β(Axk+1 − b). end for



θ 4 +4θ 2 −θ 2

k k k Theorem 3.5 Suppose that f (x) is convex. Let θ0 = 1, θk+1 = , 2 and (3.14) be satisfied, where δ can be any small positive constant. Then for Algorithm 3.3, we have

dβ (λ∗ ) − dβ (λK+1 ) ≤

4 (K + 2)2



1 18 2 λ0 − λ∗ 2 + + 2 . β 1 + 2δ δ

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic Accelerated Variant In this section, we consider the following convex problem with a linear constraint and a separable objective function, which is the sum of two functions with decoupled variables: min f (x) + g(y), x,y

s.t. Ax + By = b.

(3.15)

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

73

Introduce the augmented Lagrangian function Lβ (x, y, λ) = f (x) + g(y) + Ax + By − b, λ +

β Ax + By − b2 . 2

The alternating direction method of multiplier (ADMM) is a popular method to solve problem (3.15) and has broad applications, see, e.g., [2, 7, 21, 22] and references therein. It alternately updates x, y, and λ and we describe it in Algorithm 3.4. When f and g are not simple and A and B are non-unitary, the cost of solving the subproblems may be high. Thus the linearized ADMM is proposed by linearizing the augmentation term Ax + By − b2 and the complex f and g [14, 22, 34, 37] such that the subproblems may even have closed form solutions. For simplicity, we only consider the original ADMM. Algorithm 3.4 Alternating direction method of multiplier (ADMM) for k = 0, 1, 2, 3, · · · do xk+1 = argminx Lβk (x, yk , λk ), yk+1 = argminy Lβk (xk+1 , y, λk ), λk+1 = λk + βk (Axk+1 + Byk+1 − b). end for

In this section, we focus on the convergence rate analysis of ADMM for several scenarios.1 We first give several useful lemmas. The following one measures the optimality condition of the first two steps of ADMM and the KKT condition. Lemma 3.7 For Algorithm 3.4, we have 0 ∈ ∂f (xk+1 ) + AT λk + βk AT (Axk+1 + Byk − b), 0 ∈ ∂g(yk+1 ) + BT λk + βk BT (Axk+1 + Byk+1 − b), λk+1 − λk = βk (Axk+1 + Byk+1 − b), ∗

(3.16) (3.17)

T ∗

0 ∈ ∂f (x ) + A λ , ∗

0 ∈ ∂g(y∗ ) + BT λ∗ ,

(3.18)



(3.19)

Ax + By = b. Define two variables: ˆ (xk+1 ) = −AT λk − βk AT (Axk+1 + Byk − b), ∇f T T ˆ ∇g(y k+1 ) = −B λk − βk B (Axk+1 + Byk+1 − b).

1 The

four cases in Sects. 3.4.1–3.4.4 assume different conditions. They are not accelerated and cannot compare with each other. However, the convergence rate in Sect. 3.4.5 is truly accelerated.

74

3 Accelerated Algorithms for Constrained Convex Optimization

ˆ (xk+1 ) ∈ ∂f (xk+1 ) and ∇g(y ˆ Then we have ∇f k+1 ) ∈ ∂g(yk+1 ), and they further lead to the following lemma. Lemma 3.8 For Algorithm 3.4, we have 

 ˆ ∇g(y k+1 ), yk+1 − y = − λk+1 , Byk+1 − By

(3.20)

and     ˆ ˆ (xk+1 ), xk+1 − x + ∇g(y ∇f k+1 ), yk+1 − y = − λk+1 , Axk+1 + Byk+1 − Ax − By + βk Byk+1 − Byk , Axk+1 − Ax . (3.21) Proof From (3.17) we also have 

 ˆ (xk+1 ), xk+1 − x ∇f   = − AT λk + βk AT (Axk+1 + Byk − b), xk+1 − x = − λk+1 , Axk+1 − Ax + βk Byk+1 − Byk , Axk+1 − Ax

and  ˆ ), y − y = − λk+1 , Byk+1 − By . ∇g(y k+1 k+1



Adding them together, we can have (3.21).



The following lemma can be used to prove the monotonicity of ADMM. Lemma 3.9 For Algorithm 3.4, we have λk+1 − λk , Byk+1 − Byk  ≤ 0. Proof (3.20) gives 

 ˆ ∇g(y k ), yk − y + λk , Byk − By = 0.

(3.22)

Letting y = yk in (3.20) and y = yk+1 in (3.22) and adding them together, we have 

 ˆ ˆ ∇g(y k+1 ) − ∇g(y k ), yk+1 − yk + λk+1 − λk , Byk+1 − Byk  = 0.

Using the monotonicity of ∂g (Proposition A.11), we have the conclusion. Based on Lemma 3.9, we can give the monotonicity in the following lemma.



3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

75

Lemma 3.10 Let βk = β, ∀k. For Algorithm 3.4, we have 1 β 1 λk+1 − λk 2 + Byk+1 − Byk 2 ≤ λk − λk−1 2 2β 2 2β +

β Byk − Byk−1 2 . 2

Proof (3.21) gives 

   ˆ (xk ), xk − x + ∇g(y ˆ ∇f k ), yk − y = − λk , Axk + Byk − Ax − By + β Byk − Byk−1 , Axk − Ax .(3.23)

Letting (x, y, λ) = (xk , yk , λk ) in (3.21) and (x, y, λ) = (xk+1 , yk+1 , λk1 ) in (3.23), adding them together, and using (3.17), we have 

   ˆ (xk+1 ) − ∇f ˆ ˆ (xk ), xk+1 − xk + ∇g(y ˆ ∇f k+1 ) − ∇g(y k ), yk+1 − yk = − λk+1 − λk , Axk+1 + Byk+1 − Axk − Byk  + β Byk+1 − Byk − (Byk − Byk−1 ), Axk+1 − Axk  =−

1 λk+1 − λk , λk+1 − λk − (λk − λk−1 ) β

+ Byk+1 − Byk − (Byk − Byk−1 ), λk+1 −λk − βByk+1 − (λk − λk−1 − βByk )

a 1 λk − λk−1 2 − λk+1 − λk 2 − λk+1 − λk − (λk − λk−1 )2 = 2β β Byk − Byk−1 2 − Byk+1 − Byk 2 + 2

−Byk+1 − Byk − (Byk − Byk−1 )2 + Byk+1 − Byk − (Byk − Byk−1 ), λk+1 − λk − (λk − λk−1 )  1  λk − λk−1 2 − λk+1 − λk 2 = 2β  β Byk − Byk−1 2 − Byk+1 − Byk 2 + 2 & 1 λk+1 − λk − (λk − λk−1 )2 − 2β

76

3 Accelerated Algorithms for Constrained Convex Optimization

+

β Byk+1 − Byk − (Byk − Byk−1 )2 2

− Byk+1 − Byk − (Byk − Byk−1 ), λk+1 − λk − (λk − λk−1 ) ≤

 1  λk − λk−1 2 − λk+1 − λk 2 2β  β Byk − Byk−1 2 − Byk+1 − Byk 2 , + 2

a

where = uses (A.1). Using the monotonicity of ∂f and ∂g (Proposition A.11), we have the conclusion. 

Lemma 3.8 leads to the following lemma, which further leads to Lemma 3.12. The convergence rate of ADMM can be immediately obtained from Lemma 3.12. Lemma 3.11 For Algorithm 3.4, we have 

     ∗ ˆ (xk+1 ), xk+1 − x∗ + ∇g(y ˆ ∇f + λ∗ , Axk+1 + Byk+1 − b ), y − y k+1 k+1 1 1 1 λk − λ∗ 2 − λk+1 − λ∗ 2 − λk+1 − λk 2 2βk 2βk 2βk



βk βk βk Byk − By∗ 2 − Byk+1 − By∗ 2 − Byk+1 − Byk 2 . 2 2 2   Proof Letting (x, y, λ) = (x∗ , y∗ , λ∗ ) in (3.21), adding λ∗ , Axk+1 + Byk+1 − b to both sides, and using (3.17), (3.19), and the identities in Lemma A.1, we can have +



     ∗ ˆ (xk+1 ), xk+1 − x∗ + ∇g(y ˆ ∇f + λ∗ , Axk+1 + Byk+1 − b ), y − y k+1 k+1     = − λk+1 − λ∗ , Axk+1 + Byk+1 − b + βk Byk+1 − Byk , Axk+1 − Ax∗  1  λk+1 − λ∗ , λk+1 − λk + Byk+1 − Byk , λk+1 − λk  βk   −βk Byk+1 − Byk , Byk+1 − By∗

=−

1 1 1 λk − λ∗ 2 − λk+1 − λ∗ 2 − λk+1 − λk 2 2βk 2βk 2βk

a

=

βk βk βk Byk − By∗ 2 − Byk+1 − By∗ 2 − Byk+1 − Byk 2 2 2 2 + Byk+1 − Byk , λk+1 − λk  , +

a

where = uses (A.1). From Lemma 3.9, we can have the conclusion.

(3.24) 

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

77

Lemma 3.12 Suppose that f (x) and g(y) are convex. Then for Algorithm 3.4, we have   f (xk+1 ) + g(yk+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk+1 + Byk+1 − b ≤

1 βk 1 λk − λ∗ 2 − λk+1 − λ∗ 2 + Byk − By∗ 2 2βk 2βk 2 −

βk Byk+1 − By∗ 2 . 2

(3.25)

If we further assume that g(y) is μ-strongly convex, then we have   f (xk+1 ) + g(yk+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk+1 + Byk+1 − b ≤

1 1 λk − λ∗ 2 − λk+1 − λ∗ 2 2βk 2βk +

βk βk μ Byk − By∗ 2 − Byk+1 − By∗ 2 − yk+1 − y∗ 2 . (3.26) 2 2 2

If we further assume that g(y) is L-smooth, then we have   f (xk+1 ) + g(yk+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk+1 + Byk+1 − b ≤

1 1 λk − λ∗ 2 − λk+1 − λ∗ 2 2βk 2βk +

βk βk 1 Byk − By∗ 2 − Byk+1 − By∗ 2 − ∇g(yk+1 ) − ∇g(y∗ )2 . 2 2 2L (3.27)

Proof We use Lemma 3.11 to prove these conclusions. From the convexity of f (x) and g(y), we have   f (xk+1 ) + g(yk+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk+1 + Byk+1 − b    a  ∗ ˆ ˆ (xk+1 ), xk+1 − x∗ + ∇g(y ≤ ∇f k+1 ), yk+1 − y   + λ∗ , Axk+1 + Byk+1 − b ≤

1 1 βk λk − λ∗ 2 − λk+1 − λ∗ 2 + Byk − By∗ 2 2βk 2βk 2 −

βk Byk+1 − By∗ 2 . 2

78

3 Accelerated Algorithms for Constrained Convex Optimization

When g(y) is strongly convex, we can have an extra a

μ 2 yk+1

− y∗ 2 in the left-

hand side of ≤ thus have (3.26). When g(y) is L-smooth, from (A.7) we can have an extra

1 ∗ 2 2L ∇g(yk+1 ) − ∇g(y )

a

in the left-hand side of ≤ thus have (3.27).



Based on the above properties, we can prove the convergence rate of ADMM. We consider four scenarios in the following sections, respectively. We also discuss the non-ergodic convergence rate of ADMM.

3.4.1 Generally Convex and Nonsmooth Case We first consider the scenario that f and g are  both generally convex and nonsmooth. In this case, [12] proved the ergodic O K1 convergence rate. The following theorem summarizes the result. Theorem 3.6 Suppose that f (x) and g(y) are convex. Let βk = β, ∀k. Then for Algorithm 3.4, we have |f (ˆxK+1 ) + g(ˆyK+1 ) − f (x∗ ) − g(y∗ )|

1 β 1 λ0 − λ∗ 2 + By0 − By∗ 2 ≤ K + 1 2β 2

λ∗  2 + λ0 − λ∗  + By0 − By∗  , K +1 β

2 1 ∗ ∗ λ0 − λ  + By0 − By  , AˆxK+1 + BˆyK+1 − b ≤ K +1 β where xˆ K+1 =

1 K+1

K+1  k=1

xk and yˆ K+1 =

1 K+1

K+1 

yk .

k=1

Proof Summing (3.25) over k = 0, 1, · · · , K, dividing both sides with K + 1, and using the convexity of f (x) and g(y), we have   f (ˆxK+1 ) + g(ˆyK+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , AˆxK+1 + BˆyK+1 − b

1 1 β 1 ∗ 2 ∗ 2 ∗ 2 λ0 − λ  − λK+1 − λ  + By0 − By  . ≤ K + 1 2β 2β 2 From Lemma 3.2, we know that the left-hand side of the above inequality is non-negative, which leads to λK+1 − λ∗ 2 ≤ λ0 − λ∗ 2 + β 2 By0 − By∗ 2 .

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

79

From (3.17), we have K    1   AˆxK+1 + BˆyK+1 − b =  (λk+1 − λk )  β(K + 1)  k=0

1 λK+1 − λ0  β(K + 1)   1 2λ0 − λ∗  + βBy0 − By∗  . ≤ β(K + 1) =



From Lemma 3.3, we can have the conclusion.

Ouyang et al. [30] and Lu et al. [24] also studied the accelerated ADMM when the objectives are composed of an L-smooth part and a nonsmoothpart. They  can partially accelerate ADMM with a better dependence on L, i.e., O KL2 + K1 .   However, their entire complexity remains O K1 . We omit the details.

3.4.2 Strongly Convex and Nonsmooth Case   When g is strongly convex, the convergence rate can be improved to O K12 via setting increasing βk , e.g., see [38]. Moreover, we do not use the acceleration techniques described for the unconstrained problem, e.g., the extrapolation technique and the using of multiple sequences. Via comparing the proofs with the previous scenario, the only difference is that we use the increasing penalty parameters here.2 Theorem 3.7 Assume that f (x) is convex and g(y) is μ-strongly convex.

−1 K K   μ 2 2 ˆK = βk βk xk+1 , and yˆ K = Let βk+1 ≤ βk + 2 βk , x

K 

k=0

−1 βk

B2

K 

k=0

k=0

βk yk+1 , then we have

k=0

|f (ˆxK+1 ) + g(ˆyK+1 ) − f (x∗ ) − g(y∗ )|   β02 1 1 ∗ 2 ∗ 2 λ0 − λ  + By0 − By  ≤ K 2 2 k=0 βk

2 In

fact, the faster rate is due to the stronger assumption, i.e., the strong convexity of g, rather than the acceleration technique.

80

3 Accelerated Algorithms for Constrained Convex Optimization

 λ∗   2λ0 − λ∗  + β0 By0 − By∗  , + K k=0 βk 1 AˆxK+1 + BˆyK+1 − b ≤ K

k=0 βk

  2λ0 − λ∗  + β0 By0 − By∗  .

Proof Multiplying both sides of (3.26) by βk and using yk+1 − y∗ 2 ≥ 1 ∗ 2 2 Byk+1 − By  , we have B2

   βk f (xk+1 ) + g(yk+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk+1 + Byk+1 − b ≤



β2 1 1 λk − λ∗ 2 − λk+1 − λ∗ 2 + k Byk − By∗ 2 2 2 2   2 βk μβk + − Byk+1 − By∗ 2 2 2B22 β2 1 1 λk − λ∗ 2 − λk+1 − λ∗ 2 + k Byk − By∗ 2 2 2 2 −

2 βk+1

2

Byk+1 − By∗ 2 .

Summing over k = 0, 1, · · · , K, and dividing both sides with

K

k=0 βk ,

we have

  f (ˆxK ) + g(ˆyK ) − f (x∗ ) − g(y∗ ) + λ∗ , AˆxK + BˆyK − b   β02 1 1 1 ∗ 2 ∗ 2 ∗ 2 λ0 − λ  − λK+1 − λ  + By0 − By  . ≤ K 2 2 2 k=0 βk Similar to the proof of Theorem 3.6, we have K      AˆxK+1 + BˆyK+1 − b = K  (λk+1 − λk )   k=0 βk k=0 1

1 = K

λK+1 − λ0 

1 ≤ K

  2λ0 − λ∗  + β0 By0 − By∗  .

k=0 βk

k=0 βk

Similar to the induction of Theorem 3.6, we can have the conclusion.



 By using the following lemma, we see that the convergence rate is O K12 .



3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

Lemma 3.13 Let βk = 1 K

k=0 βk



6B22 . μ(K+1)2

μ(k+1) . 3B22

2 Then {βk } satisfy βk+1 ≤ βk2 +

81 μ β B22 k

and

3.4.3 Generally Convex and Smooth Case Now we consider the scenario that g is smooth and both f and g are generally convex. We describe the results in [35] in the following theorem. Theorem 3.8 Assume that f (x) and g(y) are convex and g(y) is L-smooth.

−1 K K  1  1 σ2 1 ˆ ˆK = Let 1 2 + 2Lβ ≥ , x = K 2 βk βk xk+1 , and y k

2βk

K 

k=0

1 βk

−1

2βk+1

K  k=0

1 βk yk+1 ,

k=0

k=0

where σ = σmin (B), then we have

  f (ˆxK+1 ) + g(ˆyK+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , AˆxK+1 + BˆyK+1 − b   1 1 1 ≤ K 1 λ0 − λ∗ 2 + By0 − By∗ 2 , 2 2 2β 0 k=0 βk   1 ∗ 2 2 ∗ 2 ∗ 2 λK+1 − λ  ≤ βK+1 λ0 − λ  + By0 − By  . β02 Proof From BT λ ≥ σ λ, where σ = σmin (B), (3.17), (3.16), and (3.18), we have σ2 1 λk+1 − λ∗ 2 ≤ BT (λk+1 − λ∗ )2 2L 2L 1 BT (λk − λ∗ ) + βBT (Axk+1 + Byk+1 − b)2 = 2L 1 ∇g(yk+1 ) + BT λ∗ 2 = 2L 1 ∇g(yk+1 ) − ∇g(y∗ )2 . = (3.28) 2L

82

3 Accelerated Algorithms for Constrained Convex Optimization

Dividing both sides of (3.27) by βk and using (3.28) and

1 2βk2

+

σ2 2Lβk



1 2 , 2βk+1

we

have   1  f (xk+1 ) + g(yk+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk+1 + Byk+1 − b βk   1 σ2 1 ∗ 2 λk − λ  − + λk+1 − λ∗ 2 ≤ 2Lβk 2βk2 2βk2 1 1 + Byk − By∗ 2 − Byk+1 − By∗ 2 2 2 1 1 1 ≤ λk − λ∗ 2 − 2 λk+1 − λ∗ 2 + Byk − By∗ 2 2 2 2βk 2βk+1 1 − Byk+1 − By∗ 2 . 2 Summing over k = 0, 1, · · · , K and dividing both sides with

K

1 k=0 βk ,

we have

  f (ˆxK+1 ) + g(ˆyK+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , AˆxK+1 + BˆyK+1 − b   1 1 1 1 ∗ 2 ∗ 2 ∗ 2 ≤ K 1 λ0 − λ  − 2 λK+1 − λ  + By0 − By  . 2 2β02 2βK+1 k=0 β k

The proof is complete.



By using the following lemma, we see that the convergence rate is O B is of full row rank. Lemma 3.14 Suppose σ > 0 and let 1 2 2βk+1

and

K

1

1 k=0 βk



6L . σ 2 (K+1)2

1 βk

=

σ 2 (k+1) 1 3L . Then {βk } satisfy 2β 2 k

1 K2





when 2

σ + 2Lβ ≥ k

Different from the above two scenarios, Theorem 3.8 does not measure the distance to the optimal objective value and the violation of the constraint. In fact, as claimed in the following remark, we may not prove a faster rate for the violation of the constraint. Although the convergence is proven in [35], the convergence rate is very slow, rather than “accelerated.” We involve this scenario only for providing a complete comparison with other scenarios.

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

83

Remark 3.1 Different from Theorem 3.7, we have  K   k=0 AˆxK+1 + BˆyK+1 − b =  

1 βk (Axk+1 + Byk+1 K 1 k=0 βk

1

= K

1 k=0 βk

1 ≤ K

1 k=0 βk

1  K

K   λ  k+1 − λk       βk2 k=0

K  λk − λ∗  + λk+1 − λ∗  βk2 k=0 K  2C

1 k=0 βk k=0

where we let C =



1 λ0 β02

 − b)    

βk

= 2C,

− λ∗ 2 + By0 − By∗ 2 . The above suggests that

AˆxK+1 + BˆyK+1 − b may not be decreasing. Similarly, we also have AxK+1 + K+1  ByK+1 − b = λK −λ  2C. The reason is that we use decreasing penalty βK parameters {βk }.

3.4.4 Strongly Convex and Smooth Case At last, we discuss the scenario that g is both strongly convex and smooth. In this case, the faster convergence rate can be proven via carefully choosing the penalty parameter [10]. Similar to the previous scenarios, we also do not use the acceleration techniques described in the previous section. Theorem 3.9 Assume that f (x) is convex and g(y) is μ-strongly convex and Lsmooth. Assume that BT λ ≥ σ λ, ∀λ, where σ = σmin (B) > 0. Let βk = β = √ μL σ B2 , then we have 1 β λk+1 − λ∗ 2 + Byk+1 − By∗ 2 2β 2

1 1 β  ≤ λk − λ∗ 2 + Byk − By∗ 2 . μ σ 2β 2 1 + 12 L B2

84

3 Accelerated Algorithms for Constrained Convex Optimization

Proof From (3.26)–(3.28), and Lemma 3.2, we have μ 1 1 λk − λ∗ 2 − λk+1 − λ∗ 2 Byk+1 − By∗ 2 ≤ 2 2β 2β 2B2 β β + Byk − By∗ 2 − Byk+1 − By∗ 2 (3.29) 2 2 and σ2 1 1 λk+1 − λ∗ 2 ≤ λk − λ∗ 2 − λk+1 − λ∗ 2 2L 2β 2β β β + Byk − By∗ 2 − Byk+1 − By∗ 2 . 2 2

(3.30)

Multiplying (3.30) by t, multiplying (3.29) by 1 − t, adding them together, and rearranging the terms, we have

σ 2t 1 + 2L 2β ≤

Letting

σ 2 βt L



 β μ(1 − t) + λk+1 − λ  + Byk+1 − By∗ 2 2 2B22

μ(1−t) , βB22

we have t =

μL μL+B22 σ 2 β 2

and



1 μβσ 2 β ∗ 2 ∗ 2 λ By + 1 − λ  + − By  k+1 k+1 2β 2 μL + B22 σ 2 β 2 β 1 λk − λ∗ 2 + Byk − By∗ 2 . 2β 2

Letting β = √1μ



∗ 2

1 β λk − λ∗ 2 + Byk − By∗ 2 . 2β 2 =



1+ 12



σ L B2

√ μL σ B2 ,

which maximizes

μβσ 2 μL+B22 σ 2 β 2

+ 1, we have

.

Theorem 3.9 shows that the convergence is linear.

1 μβσ 2 +1 2 2 μL+B2 2σ β

= 

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

85

3.4.5 Non-ergodic Convergence Rate In Theorems 3.6–3.8, the convergence rates are all in the ergodic sense. Now, we discuss the non-ergodic convergence in this section. We only consider the scenario that f and g are both generally convex and nonsmooth.

3.4.5.1

Original ADMM

  We first give the O √1 non-ergodic convergence rate of the original ADMM. K The result was first proven in [13] and then extended in [6]. Theorem 3.10 Let βk = β, ∀k. For Algorithm 3.4, we have 5 C ≤ f (xK+1 ) + g(yK+1 ) − f (x∗ ) − g(y∗ ) β(K + 1) 5 C C 2C ∗ ≤ + λ  +√ , K +1 β(K + 1) K +1 5 C , AxK+1 + ByK+1 − b ≤ β(K + 1) ∗

−λ 

where C = β1 λ0 − λ∗ 2 + βBy0 − By∗ 2 . 1 Proof From Lemma 3.10, we know that 2β λk+1 − λk 2 + β2 Byk+1 − Byk 2 is decreasing. From Lemmas 3.2 and 3.11 and the convexity of f and g, we have

1 β λk+1 − λk 2 + Byk+1 − Byk 2 2β 2 ≤

1 1 λk − λ∗ 2 − λk+1 − λ∗ 2 2β 2β β β + Byk − By∗ 2 − Byk+1 − By∗ 2 . 2 2

Summing over k = 0, 1, · · · , K − 1, we have 1 λK+1 − λK 2 + βByK+1 − ByK 2 β

1 1 λ0 − λ∗ 2 + βBy0 − By∗ 2 . ≤ K +1 β

(3.31)

86

3 Accelerated Algorithms for Constrained Convex Optimization

Then we have  βAxK+1 + ByK+1 − b = λK+1 − λK  ≤ 5 ByK+1 − ByK  ≤

βC , K +1

C . β(K + 1)

On the other hand, (3.31) gives 1 β 1 β λk+1 − λ∗ 2 + Byk+1 − By∗ 2 ≤ λk − λ∗ 2 + Byk − By∗ 2 2β 2 2β 2 ≤

1 β λ0 − λ∗ 2 + By0 − By∗ 2 2β 2

=

1 C. 2

So we have λK+1 − λ∗  ≤

 5

ByK+1 − By∗  ≤

βC, C . β

Then from (3.24) and the convexity of f and g, we have   f (xK+1 ) − f (x∗ ) + g(yK+1 ) − g(y∗ ) + λ∗ , AxK+1 + ByK+1 − b ≤

1 λK+1 − λ∗ λK+1 − λK  + ByK+1 − ByK λK+1 − λK  β + βByK+1 − ByK ByK+1 − By∗ 



2C C +√ . K +1 K +1

From Lemma 3.3, we can have the conclusion.

3.4.5.2



ADMM with Extrapolation and Increasing Penalty Parameter

  Now we describe the results in [19], which gives an improved O K1 non-ergodic convergence rate. Both the extrapolation and the increasing penalty parameters are used to build the method, which is described in Algorithm 3.5.

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

87

Algorithm 3.5 Accelerated alternating direction method of multiplier (AccADMM) Initialize θ0 = 1. for k = 1, 2, 3, · · · do k Solve θk via 1−θ θk = vk = yk +

1 θk−1 − τ , θk (1−θk−1 ) (yk − yk−1 ), θk−1 

 xk+1 = argminx f (x) + λk , Ax + 2θβk Ax + Bvk − b2 ,   yk+1 = argminy g(y) + λk , By + 2θβk Axk+1 + By − b2 , λk+1 = λk + βτ (Axk+1 + Byk+1 − b). end for

Define several auxiliary variables λk+1 = λk +

 β  k+1 Ax + Bvk − b , θk

β(1 − θk ) λˆ k = λk + (Axk + Byk − b) , θk zk+1 = k+1 and let θk satisfy 1−θ θk+1 = following lemma.

1 1 − θk yk+1 − yk , θk θk 1 θk

− τ , θ0 = 1, and θ−1 = 1/τ . Then we first give the

Lemma 3.15 For the definitions of λk+1 , λˆ k , λk , zk+1 , yk+1 , vk , and θk , we have β λˆ k+1 − λˆ k = [Axk+1 + Byk+1 − b − (1 − θk )(Axk + Byk − b)] , θk λˆ k+1 − λk+1  =

β Byk+1 − Bvk , θk

 β λˆ K+1 − λˆ 0 = (AxK+1 + ByK+1 − b) + βτ (Axk + Byk − b) , θK K

k=1

vk − (1 − θk )yk = θk zk . Proof From the definitions of λˆ k and λk+1 and

1−θk+1 θk+1

1 − θk+1 λˆ k+1 = λk+1 + β (Axk+1 + Byk+1 − b) θk+1

1 − τ (Axk+1 + Byk+1 − b) = λk+1 + β θk

=

1 θk

− τ , we have

88

3 Accelerated Algorithms for Constrained Convex Optimization

= λk + βτ (Axk+1 + Byk+1 − b) + β = λk +

1 − τ (Axk+1 + Byk+1 − b) θk

β (Axk+1 + Byk+1 − b) θk

(3.32a)

1 − θk β = λˆ k − β (Axk + Byk − b) + (Axk+1 + Byk+1 − b) θk θk

(3.32b)

β = λˆ k + [Axk+1 + Byk+1 − b − (1 − θk )(Axk + Byk − b)] . θk On the other hand, from (3.32a) and the definition of λk+1 we have

From (3.32b),

1−θk θk

λˆ K+1 − λˆ 0 =

λˆ k+1 − λk+1 2 =

β B(yk+1 − vk )2 . θk

=

= τ , we have

1 θk−1

− τ , and

1 θ−1

K    λˆ k+1 − λˆ k k=0

' K &  1 1 − θk (Axk+1 + Byk+1 − b) − =β (Axk + Byk − b) θk θk k=0

K &  1 1 (Axk+1 + Byk+1 − b) − (Axk + Byk − b) θk θk−1 k=0

+ τ (Axk + Byk − b)



 β (AxK+1 + ByK+1 − b) + βτ (Axk + Byk − b) . θK K

=

k=1

For the last identity, we have (1 − θk )yk + θk zk = (1 − θk )yk + = yk +

θk θk−1

[yk − (1 − θk−1 )yk−1 ]

θk (1 − θk−1 ) (yk − yk−1 ). θk−1

The right-hand side is the definition of vk .



The following lemma plays the role of Lemma 2.4 in the unconstrained optimization.

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

89

Lemma 3.16 Suppose that f (x) and g(y) are convex. Then for Algorithm 3.5, we have f (xk+1 ) + g(yk+1 ) − f (x) − g(y)   β Byk+1 − Bvk , Byk+1 − By . ≤ − λk+1 , Axk+1 + Byk+1 − Ax − By − θk (3.33) Proof Let ˆ (xk+1 ) ≡ −AT λk − β AT (Axk+1 + Bvk − b) = −AT λk+1 , ∇f θk β T T ˆ B (Axk+1 + Byk+1 − b) ∇g(y k+1 ) ≡ −B λk − θk = −BT λk+1 −

β T B B(yk+1 − vk ). θk

ˆ ˆ (xk+1 ) ∈ ∂f (xk+1 ) and ∇g(y For Algorithm 3.5, we have ∇f k+1 ) ∈ ∂g(yk+1 ). From the convexity of f and g, we have     ˆ (xk+1 ), xk+1 − x = − λk+1 , Axk+1 − Ax f (xk+1 ) − f (x) ≤ ∇f and   ˆ g(yk+1 ) − g(y) ≤ ∇g(y k+1 ), yk+1 − y  β  Byk+1 − Bvk , Byk+1 − By . = − λk+1 , Byk+1 − By − θk Adding them together, we can have the conclusion.

 



The following lemma plays a crucial role for the O K1 non-ergodic convergence rate and it is close to the final conclusion except the constraint violation. Lemma 3.17 Suppose that f (x) and g(y) are convex. With the definitions in Lemma 3.15, for Algorithm 3.5 we have   f (xK+1 ) + g(yK+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , AxK+1 + ByK+1 − b

 1 ˆ β ∗ 2 ∗ 2  (3.34) ≤ θK λ0 − λ  + Bz0 − By 2β 2

90

3 Accelerated Algorithms for Constrained Convex Optimization

and   K   1    (Axk + Byk − b)  (AxK+1 + ByK+1 − b) + τ   θK k=1

  2 ≤ λˆ 0 − λ∗  + Bz0 − By∗  . β

(3.35)

Proof Letting x = x∗ and x = xk in (3.33), respectively, we obtain two inequalities. Multiplying the first inequality by θk , multiplying the second by 1 − θk , adding them together, and using Ax∗ + By∗ = b, we have f (xk+1 ) + g(yk+1 ) − (1 − θk )(f (xk ) + g(yk )) − θk (f (x∗ ) + g(y∗ ))   ≤ − λk+1 , Axk+1 + Byk+1 − b − (1 − θk )(Axk + Byk − b) −

 β  Byk+1 − Bvk , Byk+1 − (1 − θk )Byk − θk By∗ . θk

Dividing both sides by θk , adding   1 − θk ∗ 1 λ , (Axk+1 + Byk+1 − b) − (Axk + Byk − b) θk θk to both sides, and using Ax − Ax∗ = Ax − b + By∗ and Lemmas 3.15 and A.1, we have   f (xk+1 ) + g(yk+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk+1 + Byk+1 − b θk    1 − θk f (xk ) + g(yk ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk + Byk − b − θk  1 ≤ − λk+1 − λ∗ , λˆ k+1 − λˆ k β  β  − 2 Byk+1 − Bvk , Byk+1 − (1 − θk )Byk − θk By∗ θk   a 1 λˆ k − λ∗ 2 − λˆ k+1 − λ∗ 2 − λˆ k − λk+1 2 + λˆ k+1 − λk+1 2 = 2β 2 β  + 2 Bvk − (1 − θk )Byk − θk By∗  2θk   2 − Byk+1 − (1 − θk )Byk − θk By∗  − Byk+1 − Bvk 2

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .



91

 1  ˆ λk − λ∗ 2 − λˆ k+1 − λ∗ 2 2β     β  Bzk − By∗ 2 − Bzk+1 − By∗ 2 , + 2

a

where = uses (A.3) and (A.1). Using over k = 0, 1, · · · , K, we have

1−θk θk

=

1 θk−1

− τ and θ−1 = 1/τ and summing

  f (xK+1 ) + g(yK+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , AxK+1 + ByK+1 − b θK +τ

K     f (xk ) + g(yk ) − f (x∗ ) − g(y∗ ) + λ∗ , Axk + Byk − b k=1



 β 2 1  ˆ λ0 − λ∗ 2 − λˆ K+1 − λ∗ 2 + Bz0 − By∗  . 2β 2

From Lemma 3.2, we have   f (xK+1 ) + g(yK+1 ) − f (x∗ ) − g(y∗ ) + λ∗ , AxK+1 + ByK+1 − b θK   β 2 1 λˆ 0 − λ∗ 2 − λˆ K+1 − λ∗ 2 + Bz0 − By∗  . ≤ 2β 2 So we can have (3.34) and λˆ K+1 − λ∗  ≤



λˆ 0 − λ∗ 2 + β 2 Bz0 − By∗ 2   ≤ λˆ 0 − λ∗  + β Bz0 − By∗  ,

which leads to   λˆ K+1 − λˆ 0  ≤ 2λˆ 0 − λ∗  + β Bz0 − By∗  . 

From Lemma 3.15, we can have (3.35).

We need to bound the violation of constraint in the form of Ax + By − b, rather than (3.35). The following lemma provides a useful tool for it. Lemma 3.18 Consider a sequence {ak }∞ k=1 of vectors, if {ak } satisfies   K      ak  ≤ c, [1/τ + K(1/τ − 1)]aK+1 +  

∀K = 0, 1, 2, · · · ,

k=1

    where 0 < τ < 1. Then  K a k=1 k  < c for all K = 1, 2, · · · .

92

3 Accelerated Algorithms for Constrained Convex Optimization

 Proof Let sK = K k=1 ak , ∀K ≥ 1, and s0 = 0. For each K ≥ 0, there exists cK+1 with every entry (cK+1 )i ≥ 0 such that −(cK+1 )i ≤ [1/τ + K(1/τ − 1)](aK+1 )i + (sK )i ≤ (cK+1 )i , and cK+1  = c. Then −(cK+1 )i − (sK )i (cK+1 )i − (sK )i ≤ (aK+1 )i ≤ , ∀K ≥ 0, 1/τ + K(1/τ − 1) 1/τ + K(1/τ − 1) where we use 1/τ > 1 and 1/τ + K(1/τ − 1) > 0. Thus for all K ≥ 0, we have (sK+1 )i = (aK+1 )i + (sK )i ≤

(cK+1 )i − (sK )i + (sK )i 1/τ + K(1/τ − 1)

=

(K + 1)(1/τ − 1) (cK+1 )i + (sK )i . 1/τ + K(1/τ − 1) 1/τ + K(1/τ − 1)

By recursion, we have (sK+1 )i ≤

(cK+1 )i 1/τ + K(1/τ − 1) +

(cK )i (K + 1)(1/τ − 1) 1/τ + K(1/τ − 1) 1/τ + (K − 1)(1/τ − 1)

+

K(1/τ − 1) (cK−1 )i (K + 1)(1/τ − 1) 1/τ + K(1/τ − 1) 1/τ + (K − 1)(1/τ − 1) 1/τ + (K − 2)(1/τ − 1)

+ ··· ⎡ K+1 8 +⎣ &

j =2

⎤ j (1/τ − 1) ⎦ 1/τ + (j − 1)(1/τ − 1)

(c1 )i 1/τ − 1 × + (s0 )i 1/τ + 0(1/τ − 1) 1/τ + 0(1/τ − 1) =

K+1  k=1

'

K+1 8 (ck )i j (1/τ − 1) , 1/τ + (k − 1)(1/τ − 1) 1/τ + (j − 1)(1/τ − 1) j =k+1

3.4 Alternating Direction Method of Multiplier and Its Non-ergodic. . .

where we set

K+1

j (1/τ −1) j =K+2 1/τ +(j −1)(1/τ −1)

rk =

93

= 1. Define

K+1 8 j (1/τ − 1) 1 , 1/τ + (k − 1)(1/τ − 1) 1/τ + (j − 1)(1/τ − 1) j =k+1

∀k = 1, 2, · · · , K + 1. Then we have rk > 0 and (sK+1 )i ≤ K+1 rk (ck )i . Thus (sK+1 )i ≥ − k=1 |(sK+1 )i | ≤

K+1 k=1

K+1 

rk (ck )i . Similarly, we also have

rk (ck )i .

k=1

Further define RK =

K  k=1

K K 8  1 j (1/τ − 1) = rk , 1/τ + (k − 1)(1/τ − 1) 1/τ + (j − 1)(1/τ − 1) j =k+1

k=1

then R1 =

1  k=1

1 8 1 j (1/τ − 1) = τ, 1/τ + (k − 1)(1/τ − 1) 1/τ + (j − 1)(1/τ − 1) j =k+1

and  1 1 = + 1/τ + K(1/τ − 1) 1/τ + (k − 1)(1/τ − 1) K

RK+1

k=1

×

K+1 8 j =k+1

j (1/τ − 1) 1/τ + (j − 1)(1/τ − 1)

(K + 1)(1/τ − 1)  1 1 + = 1/τ + K(1/τ − 1) 1/τ + K(1/τ − 1) 1/τ + (k − 1)(1/τ − 1) K

k=1

×

K 8 j =k+1

=

j (1/τ − 1) 1/τ + (j − 1)(1/τ − 1)

(K + 1)(1/τ − 1) 1 + RK . 1/τ + K(1/τ − 1) 1/τ + K(1/τ − 1)

94

3 Accelerated Algorithms for Constrained Convex Optimization

Next, we prove RK < 1, ∀K ≥ 1, by induction. It can be easily checked that R1 = τ < 1. Assume that RK < 1 holds, then RK+1
0 The results for the scenario that h is strongly convex can be obtained by studying problem of − minx maxy L(x, y) and using the conclusions in the previous section. We omit the details.

3.5.4 Case 4: μg > 0, μh > 0 At last, we consider the scenario that both g and h are strongly convex and prove the faster convergence rate by setting the parameters carefully. Theorem 3.14 Assume that g is μg -strongly convex and h is μh -strongly con −1 K K  1  1 √1 ˆ vex. Let θk = θ = , ∀k, x = x , yˆ K = K μg μh θk θ k k+1

K 

k=0

−1 1 θk

1+

K  k=0

1 y , θ k k+1

K2

σk = σ =

1 K2



k=0

μg μh ,

k=0

and τk = τ =

1 K2



μh μg ,

then

we have L(ˆxK , y∗ ) − L(x∗ , yˆ K ) ≤ θ K



1 ∗ 1 ∗ x − x0 2 + y − y0 2 . 2τ 2σ

Proof From Lemma 3.19 and using (3.38) with θk−1 = θ , we have L(xk+1 , y∗ ) − L(x∗ , yk+1 )

μg 1 1 ∗ 1 ∗ 2 x − xk  − + y − yk 2 ≤ x∗ − xk+1 2 + 2τ 2τ 2 2σ

K22 θ 2 σ μh 1 1 + xk − xk−1 2 − xk+1 − xk 2 y∗ − yk+1 2 + − 2σ 2 2 2τ     + Kxk+1 − Kxk , y∗ − yk+1 − θ Kxk − Kxk−1 , y∗ − yk .

102

3 Accelerated Algorithms for Constrained Convex Optimization

Letting μg 1 1 ≤ + , 2θ τ 2τ 2

1 1 μh ≤ + , 2θ σ 2σ 2

K22 θ σ ≤

1 , τ

which are satisfied by the definitions of τ , σ , and θ , we have L(xk+1 , y∗ ) − L(x∗ , yk+1 ) ≤

1 ∗ 1 ∗ x − xk 2 + y − yk 2 2τ 2σ

1 1 ∗ 1 ∗ x − xk+1 2 + y − yk+1 2 − θ 2τ 2σ θ 1 xk − xk−1 2 − xk+1 − xk 2 2τ 2τ     + Kxk+1 − Kxk , y∗ − yk+1 − θ Kxk − Kxk−1 , y∗ − yk . +

Dividing both sides by θ k , we have 1 1 L(xk+1 , y∗ ) − k L(x∗ , yk+1 ) k θ θ

1 ∗ 1 1 ∗ ≤ k x − xk 2 + y − yk 2 θ 2τ 2σ

1 ∗ 1 1 ∗ 2 2 − k+1 x − xk+1  + y − yk+1  2τ 2σ θ 1 1 1 1 xk − xk−1 2 − k xk+1 − xk 2 θ 2τ θ k−1 2τ   1  1  + k Kxk+1 − Kxk , y∗ − yk+1 − k−1 Kxk − Kxk−1 , y∗ − yk . θ θ +

Summing over k = 0, 1, · · · , K, and using x0 = x−1 and the convexity of f and g, we have K   1 6 7 L(ˆxK , y∗ ) − L(x∗ , yˆ K ) k θ k=0



Using K22 θ σ ≤

1 ∗ 1 ∗ x − x0 2 + y − y0 2 2τ 2σ 1 1 1 1 ∗ xK+1 − xK 2 − K+1 y − yK+1 2 − K θ 2τ 2σ θ  1  + K KxK+1 − KxK , y∗ − yK+1 . θ 1 τ

and

K

1 k=0 θ k

>

1 , θK

we can have the conclusion.



3.6 Faster Frank–Wolfe Algorithm

103

3.6 Faster Frank–Wolfe Algorithm The Frank–Wolfe method [8], also called the conditional gradient method, is a firstorder projection-free method for minimizing a convex function over a convex set. It works by iteratively solving a linear optimization problem and remaining inside the feasible set. It avoids the projection step and has the advantage of keeping the “sparsity” of its solution, in the sense that the solution is a linear combination of finite extreme points (Definition A.6) of the feasible set. So it is particularly suitable for solving sparse and low-rank problems. For these reasons, the Frank– Wolfe method has drawn growing interest in recent years, especially in matrix completion, structural SVM, object tracking, sparse PCA, metric learning, and many other settings (e.g., [16]). The convergence rate of the Frank–Wolfe method is O(1/k) for generally convex optimization [15]. In this section, we follow [9] to show that under stronger assumptions, the convergence rate of the Frank–Wolfe algorithm can be improved to O(1/k 2 ) (since this rate is obtained with stronger conditions, we do not called it “accelerated.”). More specifically, we need to assume that the objective function is smooth and satisfies the quadratic functional growth condition and the set is a strongly convex set. Definition 3.1 (Quadratic Functional Growth) A function satisfies the quadratic functional growth condition over set K if f (x) − f (x∗ ) ≥

μ x − x∗ 2 , 2

∀x ∈ K,

where x∗ is the minimizer of f (x) over K. If f is a strongly convex function over K (which implies that K is convex), it naturally satisfies the quadratic functional growth condition. Thus the quadratic functional growth condition is weaker than the condition of strong convexity. Another concept is the strongly convex set. Let  ·  and  · ∗ be a pair of dual norms (Definition A.3) over Rn . Definition 3.2 (Strongly Convex Set) We say that a convex set K ⊂ Rn is αstrongly convex with respect to  ·  if for any x and y ∈ K, any γ ∈ [0, 1], and any vector z ∈ Rn such that z = 1, it holds that γ x + (1 − γ )y +

αγ (1 − γ ) x − y2 z ∈ K. 2

Many convex sets in sparse and low-rank optimization are strongly convex sets [9]. Examples include the p ball for p ∈ (1, 2] and the Schatten-p ball for p ∈ (1, 2]. We consider the following general constrained convex problem: min f (x), x

s.t. x ∈ K

104

3 Accelerated Algorithms for Constrained Convex Optimization

under the following assumptions: Assumption 3.1 1. f (x) satisfies the quadratic functional growth condition. 2. f (x) is L-smooth. 3. K is an αK -strongly convex set. We describe the Frank–Wolfe method in Algorithm 3.7. We can see that xk+1 is a convex combination of xk and pk , with pk ∈ K. Thus we can simply prove xk ∈ 2 K, ∀k ≥ 0 by induction. In the traditional Frank–Wolfe algorithm, ηk is set as k+2 [15]. Here we give a complex setting to fit the proof. Algorithm 3.7 Faster Frank–Wolfe method Initialize x0 ∈ K. for k = 0, 1, 2, 3, · · · do pk = argminp∈K p, ∇f (xk ), 4 3 (xk )∗ , ηk = min 1, αK ∇f 4L xk+1 = xk + ηk (pk − xk ). end for

We first give the following lemma to describe the decreasing property of hk = f (xk ) − f (x∗ ) at each iteration. Lemma 3.20 Suppose that Assumption 3.1 holds. For Algorithm 3.7, we have 9 αK ∇f (xk )∗ 1 ,1 − . ≤ hk max 2 8L

hk+1

Proof From the optimality of pk , we have   pk − xk , ∇f (xk ) ≤ x∗ − xk , ∇f (xk ) ≤ f (x∗ ) − f (xk ) = −hk . Denote wk = argminw=1 w, ∇f (xk ), then we have w, ∇f (xk ) = pk = 12 (pk + −∇f (xk )∗ . Using the strong convexity of the set K, we have  αK 2 xk ) + 8 xk − pk  wk ∈ K. So we have pk − xk , ∇f (xk ) ≤  pk − xk , ∇f (xk ) =

αK xk − pk 2 1 pk − xk , ∇f (xk ) + wk , ∇f (xk ) 2 8

1 αK xk − pk 2 ∇f (xk )∗ . ≤ − hk − 2 8

(3.42)

3.6 Faster Frank–Wolfe Algorithm

105

On the other hand, from the smoothness of f we have f (xk+1 ) ≤ f (xk ) + ηk pk − xk , ∇f (xk ) +

Lηk2 pk − xk 2 . 2

Subtracting f (x∗ ) from both sides, we have hk+1 ≤ hk + ηk pk − xk , ∇f (xk ) +

Lηk2 pk − xk 2 . 2

Plugging (3.42), we have hk+1 If



 ηk  pk − xk 2 ηk αK ∇f (xk )∗ 2 + Lηk − . ≤ hk 1 − 2 2 4

αK ∇f (xk )∗ 4

≥ L, then ηk = 1. So we have hk+1 ≤

Otherwise, ηk =

αK ∇f (xk )∗ 4L

hk+1

hk . 2

and we have



αK ∇f (xk )∗ , ≤ hk 1 − 8L 

which completes the proof.

Based on Lemma 3.20, we can prove the convergence rate of the Frank–Wolfe algorithm in the following theorem. α



μ

Theorem 3.15 Suppose that Assumption 3.1 holds. Let M = K√ and denote 8 2L D = maxx,y∈K x − y as the diameter of K. For Algorithm 3.7, we have 3 f (xk ) − f (x∗ ) ≤

max

9 2 ∗ −2 2 LD , 4(f (x0 ) − f (x )), 18M (k + 2)2

Proof From the quadratic functional growth condition, we have f (x) − f (x∗ ) ≥

μ x − x∗ 2 . 2

4 .

106

3 Accelerated Algorithms for Constrained Convex Optimization

So we have   f (x) − f (x∗ ) ≤ x − x∗ , ∇f (x) ≤ x − x∗ ∇f (x)∗ 5 2 (f (x) − f (x∗ ))∇f (x)∗ , ≤ μ which leads to 

μ (f (x) − f (x∗ )) ≤ ∇f (x)∗ . 2

Using Lemma 3.20, we have

hk+1

 9 1 ≤ hk max , 1 − M hk . 2

(3.43)

3 C 9 2 , where C = max Now we use induction to prove hk ≤ (k+2) 2 2 LD , 4(f (x0 ) − f 4 C (x∗ )), 18M −2 . It holds for k = 0 trivially. Assume that hk ≤ (k+2) 2 for k = t. We then consider k =√t + 1. If 12 ≥ 1 − M ht , we have ht+1 ≤ If

1 2

C ht C ≤ ≤ . 2 2 2(t + 2) (t + 3)2

√ < 1 − M ht and ht ≤

C , 2(t+2)2

ht+1 ≤ ht ≤ If

1 2

√ < 1 − M ht and ht >

similar to the above analysis, it holds that

C C ≤ . 2 2(t + 2) (t + 3)2

C . 2(t+2)2

From (3.43), we have

   ht+1 ≤ ht 1 − M ht    C 1 C ≤ 1−M 2 t +2 (t + 2)2

3 C C 1 − ≤ . ≤ t +2 (t + 2)2 (t + 3)2 The proof is completed.



References

107

References 1. D.P. Bertsekas, Nonlinear Programming, 2nd edn. (Athena Scientific, Belmont, MA, 1999) 2. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011) 3. A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40(1), 120–145 (2011) 4. A. Chambolle, T. Pock, On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159(1–2), 253–287 (2016) 5. Y. Chen, G. Lan, Y. Ouyang, Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. 24(4), 1779–1814 (2014) 6. D. Davis, W. Yin, Convergence rate analysis of several splitting schemes, in Splitting Methods in Communication, Imaging, Science, and Engineering (Springer, New York, 2016), pp. 115– 163 7. E. Esser, X. Zhang, T.F. Chan, A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. Imag. Sci. 3(4), 1015–1046 (2010) 8. M. Frank, P. Wolfe, An algorithm for quadratic programming. Nav. Res. Logist. Q. 3(1–2), 95–110 (1956) 9. D. Garber, E. Hazan, Faster rates for the Frank-Wolfe method over strongly-convex sets, in Proceedings of the 32nd International Conference on Machine Learning, Lille, (2015), pp. 541–549 10. P. Giselsson, S. Boyd, Linear convergence and metric selection in Douglas Rachford splitting and ADMM. IEEE Trans. Automat. Contr. 62(2), 532–544 (2017) 11. B. He, X. Yuan, On the acceleration of augmented Lagrangian method for linearly constrained optimization (2010). Preprint. http://www.optimization-online.org/DB_FILE/2010/10/2760. pdf 12. B. He, X. Yuan, On the O(1/t) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012) 13. B. He, X. Yuan, On non-ergodic convergence rate of Douglas-Rachford alternating directions method of multipliers. Numer. Math. 130(3), 567–577 (2015) 14. B. He, L.-Z. Liao, D. Han, H. Yang, A new inexact alternating directions method for monotone variational inequalities. Math. Program. 92(1), 103–118 (2002) 15. M. Jaggi, Revisiting Frank-Wolfe: projection free sparse convex optimization, in Proceedings of the 31th International Conference on Machine Learning, Atlanta, (2013), pp. 427–435 16. M. Jaggi, M. Sulovsk, A simple algorithm for nuclear norm regularized problems, in Proceedings of the 27th International Conference on Machine Learning, Haifa, (2010), pp. 471–478 17. G. Lan, R.D. Monteiro, Iteration-complexity of first-order penalty methods for convex programming. Math. Program. 138(1–2), 115–139 (2013) 18. H. Li, Z. Lin, On the complexity analysis of the primal solutions for the accelerated randomized dual coordinate ascent. J. Mach. Learn. Res. (2020). http://jmlr.org/papers/v21/18-425.html 19. H. Li, Z. Lin, Accelerated alternating direction method of multipliers: an optimal O(1/K) nonergodic analysis. J. Sci. Comput. 79(2), 671–699 (2019) 20. H. Li, C. Fang, Z. Lin, Convergence rates analysis of the quadratic penalty method and its applications to decentralized distributed optimization (2017). Preprint. arXiv:1711.10802 21. Z. Lin, M. Chen, Y. Ma, The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices (2010). Preprint. arXiv:1009.5055 22. Z. Lin, R. Liu, H. Li, Linearized alternating direction method with parallel splitting and adaptive penalty for separable convex programs in machine learning. Mach. Learn. 99(2), 287–325 (2015)

108

3 Accelerated Algorithms for Constrained Convex Optimization

23. J. Lu, M. Johansson, Convergence analysis of approximate primal solutions in dual first-order methods. SIAM J. Optim. 26(4), 2430–2467 (2016) 24. C. Lu, H. Li, Z. Lin, S. Yan, Fast proximal linearized alternating direction method of multiplier with parallel splitting, in Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, (2016), pp. 739–745 25. D.G. Luenberger, Convergence rate of a penalty-function scheme. J. Optim. Theory Appl. 7(1), 39–51 (1971) 26. I. Necoara, V. Nedelcu, Rate analysis of inexact dual first-order methods application to dual decomposition. IEEE Trans. Automat. Contr. 59(5), 1232–1243 (2014) 27. I. Necoara, A. Patrascu, Iteration complexity analysis of dual first-order methods for conic convex programming. Optim. Methods Softw. 31(3), 645–678 (2016) 28. I. Necoara, A. Patrascu, F. Glineur, Complexity of first-order inexact Lagrangian and penalty methods for conic convex programming. Optim. Methods Softw. 34(2), 305–335 (2019) 29. V.H. Nguyen, J.-J. Strodiot, Convergence rate results for a penalty function method, in Optimization Techniques (Springer, New York, 1978), pp. 101–106 30. Y. Ouyang, Y. Chen, G. Lan, E. Pasiliao Jr., An accelerated linearized alternating direction method of multipliers. SIAM J. Imag. Sci. 8(1), 644–681 (2015) 31. P. Patrinos, A. Bemporad, An accelerated dual gradient projection algorithm for embedded linear model predictive control. IEEE Trans. Automat. Contr. 59(1), 18–33 (2013) 32. T. Pock, D. Cremers, H. Bischof, A. Chambolle, An algorithm for minimizing the MumfordShah functional, in Proceedings of the 12th International Conference on Computer Vision, Kyoto, (2009), pp. 1133–1140 33. B.T. Polyak, The convergence rate of the penalty function method. USSR Comput. Math. Math. Phys. 11(1), 1–12 (1971) 34. R. Shefi, M. Teboulle, Rate of convergence analysis of decomposition methods based on the proximal method of multipliers for convex minimization. SIAM J. Optim. 24(1), 269–297 (2014) 35. W. Tian, X. Yuan, An alternating direction method of multipliers with a worst-case o(1/n2 ) convergence rate. Math. Comput. 88(318), 1685–1713 (2019) 36. P. Tseng, On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle (2008) 37. X. Wang, X. Yuan, The linearized alternating direction method of multipliers for Dantzig selector. SIAM J. Sci. Comput. 34(5), A2792–A2811 (2012) 38. Y. Xu, Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)

Chapter 4

Accelerated Algorithms for Nonconvex Optimization

Nonconvex optimization has gained extensive attention recently in the machine learning community. In this section, we present the accelerated gradient methods for nonconvex optimization. The topics studied include the general convergence results under the Kurdyka–Łojasiewicz (KŁ) condition (Definition A.36), how to achieve the critical point quickly, and how to escape the saddle point quickly.

4.1 Proximal Gradient with Momentum Consider the following composite minimization problem: min F (x) ≡ f (x) + g(x), x

(4.1)

where f is differentiable (it can be nonconvex) and g can be both nonconvex and nonsmooth. Examples of problem (4.1) include sparse and low-rank learning with nonconvex regularizers, e.g., p -norm [10], Capped-1 penalty [27], Log-Sum Penalty [6], Minimax Concave Penalty [26], Geman Penalty [13], Smoothly Clipped Absolute Deviation [9], and Schatten-p norm [22]. Popular methods for problem (4.1) include the general iterative shrinkage and thresholding (GIST) [15], inertial forward-backward [5], iPiano [25], and the proximal gradient with momentum [21], where GIST is the general proximal gradient (PG) method and the later three are extensions of PG with momentum. The accelerated PG method was studied in [14, 20, 24], where the convergence to the critical point is guaranteed for nonconvex programs and acceleration is proven for convex programs. As a comparison, acceleration is not ensured in [5, 21, 25].

© Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8_4

109

110

4 Accelerated Algorithms for Nonconvex Optimization

The KŁ inequality [2–4] is a powerful tool proposed recently for the analysis of nonconvex programming, where the convergence of the whole sequence can be proven, rather than the subsequence only. Moreover, the convergence rate can also be proven when the desingularizing function (Definition A.35) ϕ used in the KŁ property has some special form [11]. In this section, we use the method discussed in [21] as an example to give the general convergence results for nonconvex programs and stronger results under the KŁ condition. The method is described in Algorithm 4.1. Algorithm 4.1 Proximal gradient (PG) with momentum Initialize y0 = x0 , β ∈ (0, 1), and η < for k = 0, 1, 2, 3, · · · do xk = Proxηg (yk − η∇f (yk )), vk = xk + β(xk − xk−1 ), if F (xk ) ≤ F (vk ) then yk+1 = xk , else yk+1 = vk . end if end for

1 L.

In this section, we make the following assumptions. Assumption 4.1 1) f (x) is a proper (Definition A.30) and L-smooth function and g(x) is proper and lower semicontinuous (Definition A.31). 2) F (x) is coercive (Definition A.32). 3) F (x) has the KŁ property.

4.1.1 Convergence Theorem We first give the general convergence result of Algorithm 4.1 in the following theorem. Generally speaking, we prove that every accumulation point is a critical point. Theorem 4.1 Assume that 1) and 2) of Assumption 4.1 hold, then with η < sequence {xk } generated by Algorithm 4.1 satisfies

1 L

the

1. {xk } is a bounded sequence. 2. The set of accumulation points  of {xk } forms a compact set (Definition A.27), on which the objective function F is constant. 3. All elements of  are critical points of F (x).

4.1 Proximal Gradient with Momentum

111

Proof From the L-smoothness of f we have F (xk ) ≤ g(xk ) + f (yk ) + ∇f (yk ), xk − yk  + a

≤ g(yk ) − ∇f (yk ), xk − yk  −

L xk − yk 2 2

1 xk − yk 2 2η

L + f (yk ) + ∇f (yk ), xk − yk  + xk − yk 2 2

L 1 − xk − yk 2 = F (yk ) − 2η 2 = F (yk ) − αxk − yk 2 , where α = g(xk ) +

1 2η



L 2

a

and ≤ uses the definition of xk , thus

1 1 xk − (yk − η∇f (yk ))2 ≤ g(yk ) + yk − (yk − η∇f (yk ))2 2η 2η = g(yk ) +

1 η∇f (yk )2 . 2η

So we have F (yk+1 ) ≤ F (xk ) ≤ F (yk ) − αxk − yk 2 ≤ F (xk−1 ) − αxk − yk 2 .

(4.2)

Since F (x) > −∞, we conclude that F (xk ) and F (yk ) converge to the same limit F ∗ , i.e., lim F (xk ) = lim F (yk ) = F ∗ .

k→∞

k→∞

(4.3)

On the other hand, we also have F (xk ) ≤ F (x0 ) and F (yk ) ≤ F (x0 ) for all k. Thus {xk } and {yk } are bounded and have bounded accumulation points. From (4.2), we also have αxk − yk 2 ≤ F (yk ) − F (yk+1 ). Summing over k = 0, 1, · · · , k and letting k → ∞, we have α

∞  k=0

xk − yk 2 ≤ F (yk ) − inf F < ∞.

112

4 Accelerated Algorithms for Nonconvex Optimization

It further implies that xk − yk  → 0. Thus {xk } and {yk } share the same accumulation points . Since  is closed and bounded, we conclude that  is compact. Define uk = ∇f (xk ) − ∇f (yk ) −

1 (xk − yk ). η

From the optimality condition, we have −∇f (yk ) −

1 (xk − yk ) ∈ ∂g(xk ). η

Then uk = ∇f (xk ) − ∇f (yk ) −

1 (xk − yk ) ∈ ∂F (xk ). η

We further have     1  ∇f (x (x ) − ∇f (y ) − − y ) uk  =  k k k k   η

1 ≤ L+ yk − xk  → 0. η

(4.4)

Consider any z ∈  and we write xk → z and yk → z by restricting to a subsequence. By the definition of the proximal mapping, we have ∇f (yk ), xk − yk  +

1 xk − yk 2 + g(xk ) 2η

≤ ∇f (yk ), z − yk  +

1 z − yk 2 + g(z). 2η

Taking limsup on both sides and noting that xk − yk → 0 and yk → z, we obtain that lim supk→∞ g(xk ) ≤ g(z). Since g is lower semicontinuous and xk → z, it follows that lim supk→∞ g(xk ) ≥ g(z). Combining both inequalities, we conclude that limk→∞ g(xk ) = g(z). Note that the continuity of f yields limk→∞ f (xk ) = f (z). We then conclude that limk→∞ F (xk ) = F (z). Since limk→∞ F (xk ) = F ∗

4.1 Proximal Gradient with Momentum

113

by (4.3), we conclude that F (z) = F ∗ , ∀z ∈ . Hence, F is constant on the compact set . Now we have established xk → z, F (xk ) → F (z), and ∂F (xk )  uk → 0. Recalling the definition of limiting subdifferential (Definition A.33), we conclude that 0 ∈ ∂F (z) for all z ∈ . 

With the KŁ property, we can have a stronger convergence result, that is to say, the whole sequence generated by the algorithm converges to a critical point, rather than only the subsequence proven in Theorem 4.1. Theorem 4.2 Assume that 1)–3) of Assumption 4.1 hold, then with η < sequence {xk } generated by Algorithm 4.1 satisfies

1 L

the

1. {xk } converges; 2. {xk } converges to a critical point x∗ ∈ . Proof From (4.2), we have F (xk ) ≤ F (xk−1 ) − αxk − yk 2 . Moreover, from (4.4) we have dist(0, ∂F (xk )) ≤

1 + L xk − yk . η

We have shown in the proof of Theorem 4.1 that F (xk+1 ) ≤ F (xk ), F (xk ) → F ∗ , and dist(xk , ) → 0. Thus for any  > 0 and δ > 0, there exists K0 such that xk ∈ {x, dist(x, ) ≤ } ∩ [F ∗ < F (x) < F ∗ + δ], ∀k ≥ K0 . From the uniform KŁ property (Lemma A.3), there exists a desingularizing function ϕ such that ϕ  (F (xk ) − F ∗ )dist(0, ∂F (xk )) ≥ 1. So ϕ  (F (xk ) − F ∗ ) ≥

1 1  ≥ . 1 dist(0, ∂F (xk )) + L x − y  k k η

(4.5)

114

4 Accelerated Algorithms for Nonconvex Optimization

On the other hand, since ϕ is concave, we have ϕ(F (xk+1 ) − F ∗ ) ≤ ϕ(F (xk ) − F ∗ ) + ϕ  (F (xk ) − F ∗ )(F (xk+1 ) − F (xk )) F (xk+1 ) − F (xk )  ≤ ϕ(F (xk ) − F ∗ ) +  1 η + L xk − yk  αxk+1 − yk+1 2  ≤ ϕ(F (xk ) − F ∗ ) −  . 1 η + L xk − yk  So xk+1 − yk+1  ≤ where c = have

1 η +L

α



cxk − yk (k − k+1 ) ≤

1 c xk − yk  + (k − k+1 ), 2 2

and k = ϕ(F (xk ) − F ∗ ). Summing over k = 1, 2, · · · , ∞, we ∞ 

xk − yk  ≤ x1 − y1  + c1 .

k=2

From the definition of yk in Algorithm 4.1, we have xk − yk  = xk − xk−1  or xk − yk  = xk − xk−1 − β(xk−1 − xk−2 ). So xk − xk−1  − βxk−1 − xk−2  ≤ xk − yk . Thus, ∞ 

xk − xk−1  −

k=2

∞ 

βxk − xk−1  ≤ x1 − y1  + c1 .

k=1

So (1 − β)

∞ 

xk − xk−1  ≤ βx1 − x0  + x1 − y1  + c1 .

k=2

So {xk } is a Cauchy sequence and hence is a convergent sequence. The second result follows immediately from Theorem 4.1. 

Moreover, with the KŁ property we can give the convergence rate of Algorithm 4.1. As a comparison, the convergence rate of gradient descent for the general nonconvex problem is in the form of min0≤k≤K dist(0, ∂F (xk ))2 ≤ O(1/K) [23]. Theorem 4.3 Assume that 1)–3) of Assumption 4.1 hold and the desingularizing function ϕ has the form of ϕ(t) = Cθ t θ for some C > 0, θ ∈ (0, 1]. Let F ∗ = F (x) for all x ∈  and rk = F (xk ) − F ∗ , then with η < L1 the sequence {xk } generated by Algorithm 4.1 satisfies

4.1 Proximal Gradient with Momentum

115

1. If θ = 1, then there exists k1 such that F (xk ) = F ∗ for all k > k1 and the algorithm terminates in finite steps; 2. If θ ∈ [ 12 , 1), then there exists k2 such that for all k > k2 , F (xk ) − F ∗ ≤



d1 C 2 1 + d1 C 2

k−k2 rk2 ;

3. If θ ∈ (0, 12 ), then there exists k3 such that for all k > k3 , &

C F (xk ) − F ≤ (k − k3 )d2 (1 − 2θ ) ∗

where d1 =



1 η

'

1 1−2θ

,

  2θ−1 4 2  3  1 C 2 2θ−2 − 1 r02θ−1 . + L / 2η − L2 and d2 = min 2d11 C , 1−2θ

Proof Throughout the proof we assume that F (xk ) = F ∗ for all k. From (4.5) we have 6 72 1 ≤ ϕ  (F (xk ) − F ∗ )dist(0, ∂F (xk ))

2 1 + L xk − yk 2 ≤ [ϕ  (rk )]2 η

2 1 F (xk−1 ) − F (xk ) ≤ [ϕ  (rk )]2 +L η α = d1 [ϕ  (rk )]2 (rk−1 − rk ), for all k > k0 . Because ϕ has the form of ϕ(t) = we have

C θ θt ,

1 ≤ d1 C 2 rk2θ−2 (rk−1 − rk ).

we have ϕ  (t) = Ct θ−1 . So

(4.6)

1. Case θ = 1. In this case, (4.6) becomes 1 ≤ d1 C 2 (rk − rk+1 ). Because rk → 0, d1 > 0, and C > 0, this is a contradiction. So there exists k1 such that rk = 0 for all k > k1 . Thus the algorithm terminates in finite steps. 2. Case θ ∈ [ 12 , 1). In this case, 0 < 2 − 2θ ≤ 1. As rk → 0, there exists kˆ3 such that rk2−2θ ≥ rk for all k > kˆ3 . Then (4.6) becomes rk ≤ d1 C 2 (rk−1 − rk ).

116

4 Accelerated Algorithms for Nonconvex Optimization

So we have d1 C 2 rk−1 , 1 + d1 C 2

rk ≤ for all k2 > max{k0 , kˆ3 } and rk ≤

d1 C 2 1 + d1 C 2

k−k2 rk2 .

3. Case θ ∈ (0, 12 ). In this case, 2θ − 2 ∈ (−2, −1) and 2θ − 1 ∈ (−1, 0). As rk−1 > rk , we have 2θ−2 2θ−1 rk−1 < rk2θ−2 and r02θ−1 < · · · < rk−1 < rk2θ−1 . C 2θ−1  , then φ (t) = −Ct 2θ−2 . Define φ(t) = 1−2θ t 2θ−2 , then If rk2θ−2 ≤ 2rk−1 :

rk

φ(rk ) − φ(rk−1 ) =

:



φ (t)dt = C

rk−1

rk−1

t 2θ−2 dt

rk

2θ−2 ≥ C(rk−1 − rk )rk−1



a C 1 (rk−1 − rk )rk2θ−2 ≥ , 2 2d1 C

a

for all k > k0 , where ≥ uses (4.6). 2θ−1 2θ−2 2θ−1 If rk2θ−2 ≥ 2rk−1 , then rk2θ−1 ≥ 2 2θ−2 rk−1 and C 1 − 2θ C ≥ 1 − 2θ

φ(rk ) − φ(rk−1 ) =

  2θ−1 rk2θ−1 − rk−1  2θ−1  2θ−1 2 2θ−2 − 1 rk−1

2θ−1 = qrk−1 ≥ qr02θ−1 ,

where q =

C 1−2θ

 2θ−1 4  3 2 2θ−2 − 1 . Let d2 = min 2d11 C , qr02θ−1 , we have φ(rk ) − φ(rk−1 ) ≥ d2 ,

for all k > k0 and φ(rk ) ≥ φ(rk ) − φ(rk0 ) =

k  i=k0 +1

[φ(ri ) − φ(ri−1 )] ≥ (k − k0 )d2 .

4.1 Proximal Gradient with Momentum

117

So we have rk2θ−1 ≥

(k − k0 )d2 (1 − 2θ ) C

and thus & rk ≤

C (k − k0 )d2 (1 − 2θ )

'

1 1−2θ

.

Letting k3 = k0 , we have &

C F (xk ) − F = rk ≤ (k − k3 )d2 (1 − 2θ ) ∗

'

1 1−2θ

, ∀k ≥ k3 .



4.1.2 Another Method: Monotone APG Algorithm 4.2 Monotone APG Initialize y0 = x0 , β ∈ (0, 1), and η < L1 . for k = 0, 1, 2, 3, · · · do tk−1 −1 yk = xk + tk−1 tk (zk − xk ) + tk (xk − xk−1 ), zk+1 = Proxαy g (yk − αy ∇f (yk )), vk+1 =  Proxαx g (xk − αx ∇f (xk )), tk+1 = xk+1 =

4tk2 +1+1 , 2

zk+1 , if F (zk+1 ) ≤ F (vk+1 ), vk+1 , otherwise.

end for

In this section, we introduce another accelerated PG method for nonconvex optimization, namely the monotone APG [20]. We describe it in Algorithm 4.2. Algorithm 4.2 compares the accelerated PG step and the non-accelerated PG step and chooses the one which has a smaller objective function value. Compared with Algorithm   4.1, Algorithm 4.2 has the beauty that the provable acceleration with O K12 rate is guaranteed for convex programs, while Algorithm 4.1 has no convergence rate analysis for convex problems if the KŁ property is not assumed. However, for Algorithm 4.2 with the KŁ property we cannot prove that the whole sequence converges to a critical point, i.e., Theorem 4.2, when the desingularizing function is general. But when the desingularizing function is also of the special form ϕ(t) = Cθ t θ , the same convergence rate as Theorem 4.3 describes can also be guaranteed for Algorithm 4.2 [20].

118

4 Accelerated Algorithms for Nonconvex Optimization

4.2 AGD Achieves Critical Points Quickly Although Algorithm 4.1 uses momentum, there is no provable advantage over the gradient descent. Recently, there is a trend to analyze the accelerated gradient descent for nonconvex programs with provable guarantees, e.g., the problems of how to achieve critical points quickly and how to escape saddle points quickly. In this section, we describe the result in [7] that the accelerated gradient descent achieves critical points quickly for nonconvex problems. In the following sections, we present two building blocks of the result: a monitored variation of AGD and a negative curvature descent step. Then we combine these components to obtain the desired conclusion.

4.2.1 AGD as a Convexity Monitor We describe the first component in Algorithm 4.3, where f is an L-smooth function and is conjectured to be σ -strongly convex (so the σ -strong convexity needs to be verified in Find-Witness-Pair()), and Nesterov’s accelerated gradient descent (AGD, Algorithm 2.1) for strongly convex functions is employed. At every iteration, the method invokes Certify-Progress() to test whether the optimization is progressing as it should be for strongly convex functions. In particular, it tests whether the norm of gradient decreases exponentially quickly. If the test fails, FindWitness-Pair() produces points u and v such that f violates the σ -strong convexity. Otherwise, we proceed until we find a point x such that ∇f (x) ≤ . The effectiveness of Algorithm 4.3 is based on the following guarantee on the performance of AGD, which can be obtained from the proof of Theorem 2.2, where x∗ is simply replaced by w. Lemma 4.1 Let f be L-smooth. If the sequences {xj } and {yj } and w satisfy   σ f (xj ) ≥ f (yj ) + ∇f (yj ), xi − yj + xi − yi 2 , 2 j = 0, 1, · · · , t − 1,  σ  f (w) ≥ f (yj ) + ∇f (yj ), w − yj + w − yi 2 , 2

(4.7)

then

1 t f (xt ) − f (w) ≤ 1 − √ ψ(w), κ where ψ(w) = f (x0 ) − f (w) + σ2 x0 − w2 .

(4.8)

4.2 AGD Achieves Critical Points Quickly

119

Algorithm 4.3 AGD-Until-Guilty(f, x0 , , L, σ ) √

κ = L/σ , ω = √κ−1 , and y0 = x0 , κ+1 for t = 1, 2, 3, · · · do xt = yt−1 − L1 ∇f (yt−1 ), yt = xt + ω(xt − xt−1 ), wt ←Certify-Progress(f, x0 , xt , L, σ, κ, t). if wt = null then (u, v) ←Find-Witness-Pair(f, x0:t , y0:t , wt , σ ), return (x0:t , y0:t , u, v). end if if ∇f (xt ) ≤  then return (x0:t , y0:t ,null). end if end for ************************************************************* function Certify-Progress(f, x0 , xt , L, σ, κ, t) if f (xt ) > f (x0 ) then return x0 . end if wt = xt − L1 ∇f (xt ), t  if ∇f (xt )2 > 2L 1 − √1κ ψ(wt ) then return wt . else return null. end if ************************************************************* function Find-Witness-Pair(f, x0:t , y0:t , wt , σ ) for j = 0, 1, · · · , t − 1 do for u = xj , wt do   if f (u) < f (yj ) + ∇f (yj ), u − yj + σ2 u − yj 2 then return (u, yj ). end if end for end for

1 Specifically, letting w = wt = xt − L1 ∇f (xt ), we can have 2L ∇f (xt )2 ≤ f (xt ) − f (wt ) ((A.6) of Proposition A.6). After combining with (4.8), it leads to



1 1 t 2 ∇f (xt ) ≤ 1 − √ ψ(wt ). 2L κ If Certify-Progress() always tells that the optimization progresses as we have expected, then f (wt ) ≤ f (xt ) ≤ f (x0 ). From the coerciveness assumption we know that wt − x0  is bounded by some constant C and ψ(wt ) is bounded by f (x0 ) − f (x∗ ) + σ C 2 /2. Thus ∇f (xt ) decreases exponentially quickly and Algorithm 4.3 can always terminate. With the above results in hand, we summarize the guarantees of Algorithm 4.3 as follows.

120

4 Accelerated Algorithms for Nonconvex Optimization

Lemma 4.2 Let f be L-smooth and t be the number of iterations that Algorithm 4.3 terminates. Then t satisfies ( 

) L 2Lψ(wt−1 ) t ≤ 1 + max 0, log . (4.9) σ 2 If wt = null, then (u, v) = null and f (u) < f (v) + ∇f (v), u − v +

σ u − v2 2

(4.10)

for some v = yj and u = xj or u = wt , 0 ≤ j < t. Moreover, max{f (x1 ), · · · , f (xt−1 ), f (u)} ≤ f (x0 ).

(4.11)

Proof The algorithm does not terminate at iteration t − 1. So we have

√ b 1 t−1 a ψ(wt−1 ) ≤ 2Le−(t−1)/ κ ψ(wt−1 ),  2 < ∇f (xt−1 )2 ≤ 2L 1 − √ κ a

where < uses the fact that the second “if condition” in AGD-Until-Guilty() fails b

when the algorithm does not terminate and ≤ uses that the second “if condition” in Certify-Progress() fails. So we have (4.9). When wt = null, suppose f (u) ≥ f (v) + ∇f (v), u − v +

σ u − v2 2

holds for all v = yj and u = xj or u = wt , 0 ≤ j < t. Namely, (4.7) holds for w = wt . Suppose wt = x0 , then we have

1 t a b f (xt ) − f (wt ) > 0 = 1 − √ ψ(wt ), κ a

where > uses the fact that the first “if condition” in Certify-Progress() proceeds b

and = uses ψ(wt ) = ψ(x0 ) = 0. This contradicts (4.8). Similarly, suppose wt = xt − L1 ∇f (xt ), then we have f (xt ) − f (wt ) ≥

1 1 t a ∇f (xt )2 > 1 − √ ψ(wt ), 2L κ a

again contradicting (4.8), where > uses the fact that the second “if condition” in Certify-Progress() proceeds. Thus there must exist some yj and xj or wt such that (4.7) is violated. Thus Find-Witness-Pair() always produces points (u, v) = null.

4.2 AGD Achieves Critical Points Quickly

121

From the first “if condition” in Certify-Progress(), it is clear that f (xs ) ≤ f (x0 ), s = 0, 1, · · · , t − 1. If u = xs for some 0 ≤ s ≤ t − 1, then f (u) ≤ f (x0 ) holds trivially. If u = wt , then the first “if condition” in Certify-Progress() proceeds, 

i.e., f (xt ) ≤ f (x0 ). So we have f (wt ) ≤ f (xt ) ≤ f (x0 ). So (4.11) holds.

4.2.2 Negative Curvature Descent Algorithm 4.4 Exploit-NC-Pair(f, u, v, η) u−v δ = u−v , u+ = u + ηδ, u− = u − ηδ, return z = argminu+ ,u− f (z).

The second component is the exploitation of negative curvature to decrease function values, which is described in Algorithm 4.4. The property of Algorithm 4.4 is described in the following lemma. Lemma 4.3 Let f be L1 -smooth and have L2 -Lipschitz continuous Hessians (Definition A.14) and u and v satisfy f (u) < f (v) + ∇f (v), u − v − If u − v ≤

σ 2L2

and η ≤

σ L2 ,

σ u − v2 . 2

then for Algorithm 4.4 we have f (z) ≤ f (u) −

σ η2 . 12

Proof From (4.12) and the basic calculus, we have σ − u − v2 ≥ f (u) − f (v) − ∇f (v), u − v 2 : 1 ∇f (v + t (u − v)), u − v dt − ∇f (v), u − v = 0

: =

u−v

∇f (v + τ δ), δ dτ − u − v ∇f (v), δ

0

: =

0

: =

u−v

∇f (v + τ δ) − ∇f (v), δ dτ

u−v : τ

0



0

u − v2 2

c,

δ T ∇ 2 f (v + θ δ)δdθ dτ

(4.12)

122

4 Accelerated Algorithms for Nonconvex Optimization

where δ = one hand,

u−v u−v

and c = min0≤τ ≤u−v {δ T ∇f 2 (v + τ δ)δ}. Then c ≤ −σ . On the

  δ T ∇ 2 f (u)δ − c = δ T ∇ 2 f (u) − ∇ 2 f (v + τ ∗ (u − v)) δ ≤ ∇ 2 f (u) − ∇ 2 f (v + τ ∗ (u − v)) ≤ L2 u − [v + τ ∗ (u − v)] ≤ L2 u − v. So we have δ T ∇ 2 f (u)δ ≤ − σ2 . On the other hand, from (A.8) we have 1 f (u± ) ≤ f (u) + ∇f (u), u± − u + (u± − u)T ∇f 2 (u)(u± − u) 2 L2 u± − u3 + 6 = f (u) ± ∇f (u), δ +

η2 T 2 L2 3 δ ∇ f (u)δ + |η| 2 6

≤ f (u) ± ∇f (u), δ −

σ η2 , 12

σ L2 . Since ± ∇f (u), δ must be negative η2 . f (u) − σ12

where we use η ≤ have f (z) ≤

for either u+ or u− , we 

4.2.3 Accelerating Nonconvex Optimization Algorithm 4.5 Nonconvex AGD for achieving critical points, NC-AGDCP(f, p0 , , L1 , σ, η) for k = 1, 2, 3, · · · , K do fˆ(x) = f (x) + σ x − pk−1 2 , (x0:t , y0:t , u, v) ←AGD-Until-Guilty(fˆ, pk−1 , if (u, v) = null then pk = xt . else b1 = argminu,x0 ,··· ,xt f (x), b2 = Expoint-NC-Pair(f, u, v, η), pk = argminb1 ,b2 f (x). end if if ∇f (pk ) ≤  then return pk . end if end for

 10 , L1

+ 2σ, σ ).

4.2 AGD Achieves Critical Points Quickly

123

We now combine the above building blocks and give the accelerated method in Algorithm 4.5. When f is not very nonconvex, in other words, almost convex [7], i.e., f (u) ≥ f (v) + ∇f (v), u − v −

σ u − v2 , 2

(4.13)

then function fˆ is σ2 -strongly convex and AGD-Until-Guilty produces an xt with ∇ fˆ(xt ) ≤  and (u, v) = null. Otherwise, when f is very nonconvex, we can use Expoint-NC-Pair to find a new descent direction. The following lemma verifies the condition of Exploit-NC-Pair. Lemma 4.4 Let f be L1 -smooth and τ > 0. In Algorithm 4.5, if (u, v) = null and f (b1 ) ≥ f (x0 ) − σ τ 2 , then u − v ≤ 4τ . Proof Since pk−1 is the initial point of fˆ in AGD-Until-Guilty(), pk−1 = x0 . Then from (4.11), we have fˆ(xi ) ≤ fˆ(x0 ) = f (x0 ), i = 1, · · · , t − 1. From f (xi ) ≥ f (b1 ) ≥ f (x0 ) − σ τ 2 , we have σ xi − x0 2 = fˆ(xi ) − f (xi ) ≤ f (x0 ) − f (xi ) ≤ σ τ 2 , which implies xi − x0  ≤ τ . Since fˆ(u) ≤ fˆ(x0 ) and f (u) ≥ f (b1 ), we also have u − x0  ≤ τ . From yi = xi + ω(xi − xi−1 ), we have yi − x0  ≤ (1 + ω)xi − x0  + ωxi−1 − x0  ≤ 3τ. Since v = yi for some i (see Lemma 4.2), we have u−v ≤ u−x0 +yi −x0  ≤ 4τ . 

The following central lemma provides a progress guarantee for each iteration of Algorithm 4.5. Lemma 4.5 Let f be L1 -smooth and have L2 -Lipschitz continuous Hessians and η = Lσ2 . Then for Algorithm 4.5 with k ≤ K − 1, we have (

2 σ 3 , f (pk ) ≤ f (pk−1 ) − min 5σ 64L22

)

(4.14)

.

Proof Case 1: (u, v) = null. In this case, pk = xk and ∇ fˆ(pk ) ≤ /10. On the other hand, ∇f (pk ) > , thus we have 9/10 ≤ ∇f (pk ) − ∇ fˆ(pk ) ≤ ∇f (pk ) − ∇ fˆ(pk ) = 2σ pk − pk−1 . We also have fˆ(pk ) = fˆ(xk ) ≤ fˆ(x0 ) = fˆ(pk−1 ) = f (pk−1 ). So f (pk ) = fˆ(pk ) − σ pk − pk−1 2 ≤ f (pk−1 ) − σ



9 20σ

2 ≤ f (pk−1 ) −

2 . 5σ

124

4 Accelerated Algorithms for Nonconvex Optimization

Case 2: (u, v) = null. From (4.10), we have   σ fˆ(u) < fˆ(v) + ∇ fˆ(v), u − v + u − v2 , 2 which leads to σ f (v) + ∇f (v), u − v − u − v2 − f (u) 2   σ = fˆ(v) + ∇ fˆ(v), u − v + u − v2 − fˆ(u) > 0. 2 So u and v satisfy the condition in Lemma 4.3. If f (b1 ) ≤ f (x0 ) − are done. If f (b1 ) > f (x0 ) − So from Lemma 4.3, we have

σ3 64L22

, then we have u − v ≤

f (b2 ) ≤ f (u) −

σ 2L2

σ3 , 64L22

then we

from Lemma 4.4.

σ3 σ3 ≤ f (p ) − , k−1 12L22 12L22

where we use f (u) ≤ fˆ(u) ≤ fˆ(x0 ) = f (pk−1 ).



Using the above lemma, we can give the final theorem with guarantee of acceleration. Theorem 4.4 Let f be L1 -smooth and √ have L2 -Lipschitz continuous Hessians,  = f (p0 ) − infz f (z), and σ = 2 L2 . Then Algorithm 4.5 finds a point pK such that ∇f (pK ) ≤  with at most  O

 √ 1/2 1/4 L1 L2  (L1 + L2 ) log  7/4 2

gradient computations. Proof From (4.14), we have  ≥ f (p0 ) − f (pK−1 ) =

K−1  k=1

(

2 σ 3 , [f (pk−1 ) − f (pk )] ≥ (K − 1) min 5σ 64L22

≥ (K − 1)

 3/2 1/2

.

10L2

So 1/2

K ≤ 1 + 10

L2  .  3/2

)

4.3 AGD Escapes Saddle Points Quickly

125

√ From (4.9) with the Lipschitz constant of fˆ being L = L1 + 2σ = L2 + 4 L2 , we have ⎧ 5 ⎫ √ √

⎨ 2(L1 + 4 L2 )ψ(wT −1 ) ⎬ L1 + 4 L2  log T ≤ 1 + max 0, √ ⎩ ⎭ 2 2 L2   =O

1/2

L1 1/4

L2  1/4

log

(L1 +



L2 )

2

 ,

where we use b σ σ a ψ(z) = fˆ(x0 ) − fˆ(z) + z − x0 2 = f (x0 ) − f (z) − z − x0 2 ≤ , 2 2

in which = is because x0 = pk−1 , fˆ(x0 ) = f (x0 ), and fˆ(z) = f (z) + σ z − x0 2 a

b

and ≤ is because f (x0 ) = f (pk−1 ) ≤ f (p0 ) by (4.14). So the total number of gradient computations is  KT = O

 √ 1/2 1/4 L 1 L2  (L1 + L2 ) log .  7/4 2



4.3 AGD Escapes Saddle Points Quickly In optimization and machine learning literature, recently there has been substantial work on the convergence of gradient-like methods to local optima for nonconvex problems. Typical ones include [12] which showed that stochastic gradient descent converges to a second-order local optimum, [19] which showed that gradient descent generically converges to a second-order local optimum, and [16] which showed that perturbed gradient descent can escape saddle points almost for free. [8] employed a combination of (regularized) accelerated gradient descent and the Lanczos method to obtain better rates for first-order methods, [17] proposed a single-loop accelerated method, and [1] proposed a careful implementation of the Nesterov–Polyak method, using accelerated methods for fast approximate matrix inversion. In this section, we describe the method in [8]. The algorithm alternates between finding directions of negative curvature of f and solving structured subproblems that are nearly convex. Similar to Sect. 4.2, the analysis in this section also depends on two building blocks, which uses the accelerated gradient descent for the almost convex case and finds the negative curvature for the very nonconvex case, respectively.

126

4 Accelerated Algorithms for Nonconvex Optimization

4.3.1 Almost Convex Case We first give the following lemma, which is a direct consequence of (A.6) and (A.11). Lemma 4.6 Assume that f is γ -strongly convex and L-smooth. Then 2γ (f (x) − f (x∗ )) ≤ ∇f (x)2 ≤ 2L(f (x) − f (x∗ )). We consider the case that f is σ -almost convex, defined in (4.13). Similar to Algorithm 4.5, we can add a regularizing term σ x − y2 to make f σ strongly convex and solve it quickly using accelerated gradient descent (AGD, Algorithm 2.1). We describe the method in Algorithm 4.6, which finds a zj such that ∇f (zj ) ≤ . In Algorithm 4.6, we call subroutine AGD to minimize a σ -strongly convex and (L + 2σ )-smooth function gj , initialized at zj , such that gj (zj +1 ) − minz gj (z) ≤

(  )2 2L+4σ .

Algorithm 4.6 Almost convex AGD, AC-AGD(f, z1 , , σ, L) for j = 1, 2, 3, · · · , J do If ∇f (zj ) ≤ , then return zj , √ Define gj (z) = f (z) + σ z − zj 2 and   =  σ/[50(L + 2σ )],  2 zj +1 = AGD(gj , zj , ( ) /(2L + 4σ ), L + 2σ, σ ). end for

Lemma 4.7 Assume that f is σ -almost convex and L-smooth. Then Algorithm 4.6 terminates in the total running time of 

 √ L 5 Lσ + (f (z1 ) − f (zJ )) log 1/. σ 2

Proof gj is σ -strongly convex and (L + 2σ )-smooth. Then from Theorem 2.1, √ √ = O( κ log 1/  ) iterations, we have we know that after O κ log 2L+4σ (  )2 gj (zj +1 ) − minz gj (z) ≤

(  )2 2L+4σ

5 a

∇gj (zj +1 ) ≤

and

(  )2 2(L + 2σ ) =  =  2L + 4σ a

where we use Lemma 4.6 in ≤.



σ  ≤ , 50(L + 2σ ) 10

4.3 AGD Escapes Saddle Points Quickly

127

On the other hand, ∇gj (zj ) = ∇f (zj ) ≥  for j < J . So ∇gj (zj +1 )2 < σ 2 σ 2 L+2σ ≤ L+2σ ∇gj (zj ) and gj (zj +1 ) − gj (z∗j ) ≤

1 1 ∇gj (zj +1 )2 ≤ ∇gj (zj )2 ≤ gj (zj ) − gj (z∗j ), 2σ 2(L + 2σ )

where z∗j = argminz gj (z) and we use Lemma 4.6. So gj (zj +1 ) ≤ gj (zj ) and f (zj +1 ) = gj (zj +1 ) − σ zj +1 − zj 2 ≤ gj (zj ) − σ zj +1 − zj 2 = f (zj ) − σ zj +1 − zj 2 . (4.15) Summing over j = 1, 2, · · · , J − 1, we have σ

J −1

zj +1 − zj 2 ≤ f (z1 ) − f (zJ ).

j =1

On the other hand, since 2σ zj +1 − zj  = ∇f (zj +1 ) − ∇gj (zj +1 ) ≥ ∇f (zj +1 ) − ∇gj (zj +1 ) ≥  −

 , 10

we can have f (z1 ) − f (zJ ) ≥ σ (J − 1) Thus J ≤ 1 +

5σ (f (z1 ) − f (zJ )) 2

&

0.81 2 2 ≥ (J − 1) . 2 5σ 4σ

(4.16)

and the total running time is

' 5σ L log 1/ 1 + 2 (f (z1 ) − f (zJ )) σ    √ L 5 Lσ = + (f (z1 ) − f (zJ )) log 1/. σ 2

The proof is complete.

(4.17) 

4.3.2 Very Nonconvex Case When f is very nonconvex, i.e., satisfying (4.12), similar to Algorithm 4.5, we find the negative curvature and use it as the descent direction. However, since

128

4 Accelerated Algorithms for Nonconvex Optimization

our purpose is different from that of Sect. 4.2, the negative curvature procedure is different from Algorithm 4.4. We need a tool with the following guarantee: If f is L1 -smooth and λmin (∇ 2 f (x)) ≤ −α, where α > 0 is a small T 2 constant,  then we can find a v such that v = 1 and v ∇ f (x)v ≤ −O(α) in L1  time with probability at least 1 − δ  . O α log 1/δ A number of methods computing approximate leading eigenvectors can be used as the tool, e.g., the Lanczos method [18] and noisy accelerated gradient descent [17]. Using the tool to find the leading eigenvector, we can describe the negative curvature descent in Algorithm 4.7. Algorithm 4.7 Negative curvature descent, NCD(f, z1 , L2 , α, δ) Let δ  =

δ 1+

3L2 2 α3

.

[f (z1 )−minz f (z)]

for j = 1, 2, 3, · · · , J do If λmin (∇ 2 f (zj )) ≥ −α, then return zj , Find vj such that vj  = 1 and vTj ∇ 2 f (zj )vj ≤ −O(α) with probability at least 1 − δ  , > > > T 2 >   >vj ∇ f (zj )vj > T ∇f (z ) . zj +1 = zj − ηj vj , where ηj = sign v j j L2 end for

We provide a formal guarantee for Algorithm 4.7 in the following lemma. Lemma 4.8 Assume that f has L1 -Lipschitz continuous gradients and L2 Lipschitz continuous Hessians. Then with probability of at least 1−δ, Algorithm 4.7 terminates in the total running time of 

   3L22 L1  log 1/δ . 1 + 3 (f (z1 ) − f (zJ )) O α α

Proof Since f has L2 -Lipschitz continuous Hessians, by Proposition A.8 we have f (zj − ηj vj ) − f (zj ) + ηj vTj ∇f (zj ) −

ηj2 2

vTj ∇ 2 f (zj )vj ≤

L2 ηj3 6

vj 3 .

From the definition of ηj , we have ηj vTj ∇f (zj ) ≥ 0. Then f (zj +1 ) − f (zj ) ≤

ηj2 2

=−

vTj ∇ 2 f (zj )vj +

> >3 > T 2 > >vj ∇ f (zj )vj > 3L22

L2 |ηj |3 vj 3 6 ≤−

α3 . 3L22

(4.18)

4.3 AGD Escapes Saddle Points Quickly

129

Summing over j = 1, · · · , J − 1, we have (J − 1)

α3 ≤ f (z1 ) − f (zJ ). 3L22

(4.19)

So J ≤1+

3L22 [f (z1 ) − f (zJ )]. α3

The probability of failure is J δ  ≤ δ and the total running time is     3L22 L1  log 1/δ , 1 + 3 (f (z1 ) − f (zJ )) O α α as claimed.



4.3.3 AGD for Nonconvex Problems Based on the above two building blocks, we can give the accelerated gradient descent for nonconvex optimization in Algorithm 4.8 and the complexity guarantee in Theorem 4.5. Similar to Algorithm 4.5, Algorithm 4.8 alternates between procedures NCD and AC-AGD. However, since NCD terminates at a point where f is locally almost convex, we define a new function fk (x) that is globally almost convex. The details can be found in Sect. 4.3.3.1. Algorithm 4.8 Nonconvex AGD for escaping saddle points, NC-AGDSP(f ,L1 ,L2 ,,δ,f ) √ 3L2  α = L2 , K = α23 f , and δ  = Kδ . for k = 1, 2, 3, · · · , K do xˆ k = NCD(f, xk , L2 , α, δ  ), If ∇f (ˆxk ) ≤ , return xˆ k , Set fk (x) = f (x) + L1 ([x − xˆ k  − α/L2 ]+ )2 , xk+1 = AC-AGD(fk , xˆ k , /2, 3α, 5L1 ). end for

Theorem 4.5 Assume that f has L1 -Lipschitz continuous gradients and L2 √ Lipschitz continuous Hessians. Let α = L2 , then with probability of at least 1 − δ, Algorithm 4.8 terminates √ 15 L 

1. in 1 +  3/22 f outer iterations, 1/4 1/2

L2 L1 f 1 log δ running time, 2. in  7/4

130

4 Accelerated Algorithms for Nonconvex Optimization

√ such that ∇f (xk ) ≤  and λmin (∇ 2 f (xk )) ≥ − , where f = f (ˆx0 ) − minx f (x). We will prove Theorem 4.5 in three steps in the following sections, respectively. Locally Almost Convex → Globally Almost Convex

4.3.3.1

The following lemma illustrates how to transform a locally almost convex function into a globally almost convex function.

2 α Lemma 4.9 Let fk (x) = f (x) + L1 x − xˆ k  − L2 , where [x]+ = +

max(x, 0). If λmin ∇ 2 f (ˆxk ) ≥ −α, then fk is 3α-almost convex and 5L1 -smooth.

2 α Proof Let ρ(x) = L1 x − L2 , then +

' & α x x − . ∇ρ(x) = 2L1 x L2 + We can have that ∇ρ(x) is continuous, ∇ρ(x) is differentiable except at x = Lα2 ,  T  xx I for − and ∇ 2 ρ(x) = 0 for x < Lα2 and ∇ 2 ρ(x) = 2L1 I + Lα2 x 3 x α α x > L2 . So when x > L2 , we have

α I  0, ∇ 2 ρ(x) 2L1 I − L2 x



α xxT α I ∇ 2 ρ(x) 2L1 I +  2L I +  4L1 I. 1 L2 x3 L2 x Thus fk (x) is 5L1 -smooth. 2α When x − xˆ k  > L , we have 2

α I  0. ∇ fk (x)  ∇ f (x) + 2L1 I − L2 x − xˆ k  2

2

When x − xˆ k  ≤

2α L2 ,

we have a

∇ 2 f (x)  ∇ 2 f (ˆxk ) − L2 x − xˆ k I  −3αI, a

where  uses the Lipschitz continuity of ∇ 2 f . So ∇ 2 fk (x)  ∇ 2 f (x)  −3αI. Namely, fk (x) is 3α-almost convex.



4.3 AGD Escapes Saddle Points Quickly

4.3.3.2

131

Outer Iterations

We present the complexity of the outer iterations of Algorithm 4.8 in the following lemma. Lemma 4.10 Assume that f has L1 -Lipschitz continuous gradients and L2 √ Lipschitz continuous Hessians. Let α = L , then Algorithm 4.8 terminates in 2 √   1+

15 L2 f  3/2

outer iterations.

Proof For NCD, we have α3 ≤ f (z1 ) − f (z2 ) ≤ f (z1 ) − f (zJ ), 3L22 where we use (4.18). Then α3 ≤ f (xk ) − f (ˆxk ). 3L22

(4.20)

For AC-AGD, we have fk (xk+1 ) ≤ fk (ˆxk ), fk (ˆxk ) = f (ˆxk ), and fk (xk+1 ) ≥ f (xk+1 ), where we use (4.15) and the definition of fk . So f (xk+1 ) ≤ f (ˆxk ) and α3 ≤ f (ˆxk−1 ) − f (ˆxk ), 3L22 where we use (4.20). Summing over k = 1, · · · , K, we have K

α3 ≤ f (ˆx0 ) − f (ˆxK ). 3L22

So K≤

3L22 3L22 3L22 f (f (ˆ x ) − f (ˆ x )) ≤ (f (ˆ x ) − min f (x)) = . 0 K 0 x α3 α3 α3

On the other hand, for AC-AGD we have 2 ≤ fk (z1 ) − fk (z2 ) ≤ fk (z1 ) − fk (zJ ), 15α where we use (4.16) and σ = 3α. So 2 ≤ fk (ˆxk ) − fk (xk+1 ) = f (ˆxk ) − fk (xk+1 ) 15α a

≤ f (ˆxk ) − f (xk+1 ) ≤ f (ˆxk ) − f (ˆxk+1 ),

132

4 Accelerated Algorithms for Nonconvex Optimization a

where ≤ uses (4.20). Summing over k = 1, · · · , K − 1, we have (K − 1)

2 ≤ f (ˆx1 ) − f (ˆxK ) ≤ f . 15α

So √ 15 L2 f 15αf =1+ . K ≤1+ 2  3/2

(4.21) 

4.3.3.3

Inner Iterations

Now, we consider the total inner iterations, which directly leads to the second conclusion of Theorem 4.5. Lemma 4.11 Assume that f has L1 -Lipschitz continuous gradients and L2 √ Lipschitz continuous Hessians. Let α = L2 , then Algorithm 4.8 terminates in the running time of  O

1/2 1/4

L 1 L 2 f 1 log δ  7/4

 .

Proof Consider the inner iterations of NCD. Let jk be the inner iterations at the k-th outer iteration. From (4.19), we have K K   3L22 (jk − 1) ≤ (f (xk ) − f (ˆxk )) α3 k=1

k=1



K  3L2 2

k=1

=

α3

(f (ˆxk−1 ) − f (ˆxk ))

3L22 f 3L22 (f (ˆ x ) − f (ˆ x )) ≤ . 0 K α3 α3

So K  k=1

√ 3L22 f 18 L2 f jk ≤ +K ≤ α3  3/2

References

133

and the total running time of NCD is    1/2 1/4  √ L 1 L 2 f 18 L2 f L1 1  O log , log 1/δ = O α δα  3/2  7/4

(4.22)

where we use the fact that δ  defined in Algorithm 4.7 is at the same order of δα 3 . Next, we consider the inner iterations of AC-AGD. From (4.17), we have that the running time is at the order of   √ K−1  L1 L1 α + (f (ˆxk ) − f (xk+1 )) log 1/ α 2 k=1

 L1 α (f (ˆxk ) − f (ˆxk+1 )) log 1/ ≤ 2 k=1    √ f L 1 α L1 ≤ K log 1/ + α 2 K−1 

a





L1 + α



1/2 1/4

16L1 L2 f log 1/,  7/4

(4.23) a

where we have dropped some constant coefficients √ and ≤ uses (4.21) and α = √ L2 . Adding (4.22) and (4.23) and using α = L2 , the total running time of Algorithm 4.8 is  1/2 1/4  L 1 L 2 f 1 log O , δ  7/4 which completes the proof.



References 1. N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima for nonconvex optimization in linear time, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1195–1200 2. H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010) 3. H. Attouch, J. Bolte, B.F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1–2), 91–129 (2013) 4. J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)

134

4 Accelerated Algorithms for Nonconvex Optimization

5. R.I. Bo¸t, E.R. Csetnek, S.C. László, An inertial forward-backward algorithm for the minimization of the sum of two nonconvex functions. EURO J. Comput. Optim. 4(1), 3–25 (2016) 6. E.J. Candès, M.B. Wakin, S.P. Boyd, Enhancing sparsity by reweighted l1 minimization. J. Fourier Anal. Appl. 14(5), 877–905 (2008) 7. Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Convex until proven guilty: dimensionfree acceleration of gradient descent on non-convex functions, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 654–663 8. Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018) 9. J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001) 10. S. Foucart, M.-J. Lai, Sparsest solutions of underdetermined linear systems via lq minimization for 0 < q ≤ 1. Appl. Comput. Harmon. Anal. 26(3), 395–407 (2009) 11. P. Frankel, G. Garrigos, J. Peypouquet, Splitting methods with variable metric for KurdykaŁojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165(3), 874–900 (2015) 12. R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory, Lille, (2015), pp. 797–842 13. D. Geman, C. Yang, Nonlinear image recovery with half-quadratic regularization. IEEE Trans. Image Process. 4(7), 932–946 (1995) 14. S. Ghadimi, G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016) 15. P. Gong, C. Zhang, Z. Lu, J. Huang, J. Ye, A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems, in Proceedings of the 30th International Conference on Machine Learning, Atlanta, (2013), pp. 37–45 16. C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 1724–1732 17. C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, in Proceedings of the 31th Conference On Learning Theory, Stockholm, (2018), pp. 1042–1085 18. J. Kuczy´nski, H. Wo´zniakowski, Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992) 19. J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Proceedings of the 29th Conference on Learning Theory, New York, (2016), pp. 1246–1257 20. H. Li, Z. Lin, Accelerated proximal gradient methods for nonconvex programming, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 379–387 21. Q. Li, Y. Zhou, Y. Liang, P.K. Varshney, Convergence analysis of proximal gradient with momentum for nonconvex optimization, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 2111–2119 22. K. Mohan, M. Fazel, Iterative reweighted algorithms for matrix rank minimization. J. Mach. Learn. Res. 13(1), 3441–3473 (2012) 23. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer, New York, 2004) 24. Y. Nesterov, A. Gasnikov, S. Guminov, P. Dvurechensky, Primal-dual accelerated gradient descent with line search for convex and nonconvex optimization problems (2018). Preprint. arXiv:1809.05895

References

135

25. P. Ochs, Y. Chen, T. Brox, T. Pock, iPiano: inertial proximal algorithm for nonconvex optimization. SIAM J. Imag. Sci. 7(2), 1388–1419 (2014) 26. C.-H. Zhang, Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010) 27. T. Zhang, Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 11, 1081–1107 (2010)

Chapter 5

Accelerated Stochastic Algorithms

In machine learning and statistics communities, lots of large-scale problems can be formulated as the following optimization problem: min f (x) ≡ E[F (x; ξ )], x

(5.1)

where F (x; ξ ) is a stochastic component indexed by a random number ξ . A special case that is of central interest is that f (x) can be written as a sum of functions. If we denote each component function as fi (x), then (5.1) can be restated as 1 fi (x), min f (x) ≡ x n n

(5.2)

i=1

where n is the number of individual functions. When n is finite, (5.2) is an offline problem, with typical examples including empirical risk minimization (ERM). n can also go to infinity and we refer to this case as an online (streaming) problem. Obtaining the full gradient of (5.2) might be expensive when n is large and even inaccessible when n = ∞. Instead, a standard manner is to estimate the full gradient via one or several randomly sampled counterparts from individual functions. The obtained algorithms to solve problem (5.2) are referred to as stochastic algorithms, involving the following characteristics: 1. in theory, most convergence properties are studied in the forms of expectation (concentration); 2. in real experiments, the algorithms are often much faster than the batch (deterministic) ones.

© Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8_5

137

138

5 Accelerated Stochastic Algorithms

Because the updates access the gradient individually, the time complexity to reach a tolerable accuracy can be evaluated by the total number of calls for individual functions. Formally, we refer to it as the Incremental First-order Oracle (IFO) calls, with definition as follows: Definition 5.1 For problem (5.2), an IFO takes an index i ∈ [n] and a point x ∈ Rd , and returns the pair (fi (x), ∇fi (x)). As has been discussed in Chap. 2, the momentum (acceleration) technique ensures a theoretically faster convergence rate for deterministic algorithms. We might ask whether the momentum technique can accelerate stochastic algorithms. Before we answer the question, we first put our attention on the stochastic algorithms itself. What is the main challenge in analyzing the stochastic algorithms? Definitely, it is the noise of the gradients. Specifically, the variance of the noisy gradient will not go to zero through the updates, which fundamentally slows down the convergence rate. So a more involved question is to ask whether the momentum technique can reduce the negative effect of noise? Unfortunately, the existing results answer the question with “No.” The momentum technique cannot reduce the variance but instead can accumulate the noise. What really reduces the negative effect of noise is the technique called variance reduction (VR) [10]. When applying the VR technique, the algorithms are transformed to act like a deterministic algorithm and then can be further fused with momentum. Now we are able to answer our first question. The answer is towards positivity: in some cases, however, not all (e.g., when n is large), the momentum technique fused with VR ensures a provably faster rate! Besides, another effect of the momentum technique is that it can accumulate the noise. Thus one can reduce the variance together after it is aggregated by the momentum. By doing this, the mini-batch sampling size is increased, which is very helpful in distributed optimization (see Chap. 6). From a high level view, we summarize the way to accelerate stochastic methods into two steps: 1. transforming the algorithm into a “near deterministic” one by VR and 2. fusing the momentum trick to achieve a faster rate of convergence. We conclude the advantages of momentum technique for stochastic algorithms below: • Ensure faster convergence rates (by order) when n is sufficiently small. • Ensure larger mini-batch sizes for distributed optimization. In the following sections, we will concretely introduce how to use the momentum technique to accelerate algorithms according to different properties of f (x). In particular, we roughly split them into three cases:

5.1 The Individually Convex Case

139

Algorithm 5.1 Accelerated stochastic coordinate descent (ASCD) [8] Input θk , step size γ = L1c , x0 = 0, and z0 = 0. for k = 0 to K do 1 yk = (1 − θk )xk + θk zk , 2 Randomly choose an index ik from [n],    3 δ = argminδ hik (zkik + δ) + ∇ik f (yk ), δ +

nθk 2 2γ δ

 ,

4 zk+1 = zkik + δ, with other coordinates unchanged, ik 5 xk+1 = (1 − θk )xk + nθk zk+1 − (n − 1)θk zk . end for Output xK+1 .

• each fi (x) is convex, which we call the individually convex (IC) case. • each fi (x) can be nonconvex but f (x) is convex, which we call the individually nonconvex (INC) case. • f (x) is nonconvex (NC). At last, we extend some results to linearly constrained problems.

5.1 The Individually Convex Case We consider a more general form of (5.2): 1 fi (x), n n

min F (x) ≡ h(x) + x

(5.3)

i=1

where h(x) and fi (x) with i ∈ [n] are convex, the proximal mapping of h can be computed efficiently, and n is finite.

5.1.1 Accelerated Stochastic Coordinate Descent To use the finiteness of n, we can solve (5.3) in its dual space in which n becomes the dimension of the dual variable and one coordinate update of the dual variable corresponds to one time of accessing the individual function. We first introduce accelerated stochastic coordinate descent (ASCD) [8, 11, 16] and later illustrate how to apply it to solve (5.3). ASCD is first proposed in [16], and later proximal versions are discovered in [8, 11]. With a little abuse of notation, without hindering the readers to understand the momentum technique, we still write the optimization model for ASCD as follows: min F (x) ≡ h(x) + f (x),

x∈Rn

(5.4)

140

5 Accelerated Stochastic Algorithms

where f (x) has Lc -coordinate Lipschitz continuous gradients n (Definition A.13), h(x) has coordinate separable structure, i.e., h(x) = i=1 hi (xi ) with x = (xT1 , · · · , xTn )T , and f (x) and hi (x) are convex. Stochastic coordinate descent algorithms are efficient to solve (5.4). In each update, one random coordinate xik is chosen to sufficiently reduce the objective value while keeping other coordinates fixed, reducing the periteration cost. More specifically, the following types of proximal subproblem are solved:

  θ k 2 k k δ = argmin hik (xik + δ) + ∇ik f (x ), δ + δ , 2γ δ where ∇ik f (x) denotes the partial gradient of f with respect to xik . Fusing with the momentum technique, ASCD is shown in Algorithm 5.1. We give the convergence result below. The proof is taken from [8]. Lemma 5.1 If θ0 ≤ n1 and for all k ≥ 0, θk ≥ 0 and is monotonically nonincreasing, then xk is a convex combination of z0 , · · · , zk , i.e., we have xk = k i i=0 ek,i z , where e0,0 = 1, e1,0 = 1 − nθ0 , e1,1 = nθ , and for k > 1, we have ⎧ ⎨

ek+1,i

Defining hˆ k =

(1 − θk )ek,i , = n(1 − θk )θk−1 + θk − nθk , ⎩ nθk ,

k

i i=0 ek,i h(z ),

i ≤ k − 1, i = k, i = k + 1.

(5.5)

we have

Eik (hˆ k+1 ) = (1 − θk )hˆ k + θk

n  ik =1

hik (zk+1 ik ),

(5.6)

where Eik denotes that the expectation is taken only on the random number ik under the condition that xk and zk are known. Proof We prove ek+1,i first. When k = 0 and 1, we can check that it is right. We then prove (5.5). Since a

xk+1 = (1 − θk )xk + θk zk + nθk (zk+1 − zk ) = (1 − θk )

k 

ek,i zi + θk zk + nθk (zk+1 − zk )

i=0

= (1 − θk )

k−1  i=0

7 6 ek,i zi + (1 − θk )ek,k + θk − nθk zk + nθk zk+1 ,

5.1 The Individually Convex Case

141

a

where = uses Step 5 of Algorithm 5.1. Comparing the results, we obtain (5.5). Next, we prove that the above is a convex combination. It is easy to prove that the weights sum to 1 (by induction), 0 ≤ (1 − θk )ek,j ≤ 1, and 0 ≤ nθk ≤ 1. So (1 − θk )ek,k + θk − nθk = n(1 − θk )θk−1 + θk − nθk ≤ 1. On the other hand, we have n(1 − θk )θk−1 + θk − nθk ≥ n(1 − θk )θk + θk − nθk = θk (1 − nθk ) ≥ 0. For (5.6), we have Eik hˆ k+1 a

=

k  i=0

=

k  i=0

=

k 



ek+1,i h(zi ) + Eik nθk h(zk+1 ) ⎞ ⎛   1 ek+1,i h(zi ) + nθk ⎝hik (zk+1 hj (zkj )⎠ ik ) + n ek+1,i h(zi ) + θk

b

k−1 



k hik (zk+1 ik ) + (n − 1)θk h(z )

ik

i=0

=

j =ik

ik

ek+1,i h(zi ) + [n(1 − θk )θk−1 + θk − nθk ]h(zk ) + (n − 1)θk h(zk )

i=0

+ θk



hik (zk+1 ik )

ik

=

k−1 

ek+1,i h(zi ) + n(1 − θk )θk−1 h(zk ) + θk

 ik

i=0

c

=

k−1 

ek,i (1 − θk )h(zi ) + (1 − θk )ek,k h(zk ) + θk

k 

 ik

i=0

=

hik (zk+1 ik )

ek,i (1 − θk )h(zi ) + θk

i=0

= (1 − θk )hˆ k + θk

 ik



hik (zk+1 ik )

ik

hik (zk+1 ik ),

hik (zk+1 ik )

142

5 Accelerated Stochastic Algorithms a

b

where in = we use ek+1,k+1 = nθk , in = we use ek+1,k = n(1 − θk )θk−1 + θk − nθk , c 

and in = we use ek+1,i = (1 − θk )ek,i for i ≤ k − 1, and ek,k = nθk−1 . Theorem 5.1 For Algorithm 5.1, setting θk = combination of z0 , · · · , zk , and

2 2n+k ,

we have that xk is a convex

E[F (xK+1 )] − F (x∗ ) n2 Lc EzK+1 − x∗ 2 + 2 θK2 ≤

F (x0 ) − F (x∗ ) n2 Lc 0 z − x∗ 2 . + 2 2 θ−1

(5.7)

When√h(x) is strongly convex with modulus 0 ≤ μ ≤ Lc , setting θk =  √ − Lμc + μ2 /L2c +4μ/Lc μ/Lc which is denoted as θ instead, we have ∼ O 2n n E[F (xK+1 )] − F (x∗ ) 2

n2 θ 2 Lc + nθ μ   0  ≤ (1 − θ )K+1 F (x0 ) − F (x∗ ) + z − x∗  . (5.8) 2 Proof We can check that the setting of θk satisfies the assumptions in Lemma 5.1. We first consider the function value. By the optimality of zk+1 in Step 4, we have ik k k k nθk (zk+1 ik − zik ) + γ ∇ik f (y ) + γ ξ ik = 0,

(5.9)

where ξ kik ∈ ∂hik (zk+1 ik ). By Steps 1 and 5, we have xk+1 = yk + nθk (zk+1 − zk ).

(5.10)

Substituting (5.10) into (5.9), we have k k k xk+1 ik − yik + γ ∇ik f (y ) + γ ξ ik = 0.

(5.11)

Since f has Lc -coordinate Lipschitz continuous gradients on coordinate ik (see (A.4)) and xk+1 and yk only differ at the ik -th entry, we have f (xk+1 )  L  2  a c k k xk+1 ≤ f (yk ) + ∇ik f (yk ), xk+1 ik − yik + ik − yik 2  L  2  b c k xk+1 = f (yk ) − γ ∇ik f (yk ), ∇ik f (yk ) + ξ kik + − y ik ik 2

5.1 The Individually Convex Case

143

 L  2  c c k xk+1 = f (yk ) − γ ∇ik f (yk ) + ξ kik , ∇ik f (yk ) + ξ kik + ik − yik 2   +γ ξ kik , ∇ik f (yk ) + ξ kik γ = f (y ) − 2 d

k



k xk+1 ik − yik

γ

2

  k − ξ kik , xk+1 ik − yik ,

(5.12)

 a b c where ≤ uses Proposition A.7, in = we use (5.11), in = we insert γ ξ kik , ∇ik f (yk ) +  d ξ kik , and = uses γ = L1c .  2 Then we analyze zk+1 − x∗  . We have 2 n2   k+1  − θk x∗  θk z 2γ 2 n2   k  = θk z − θk x∗ + θk zk+1 − θk zk  2γ 2 n2  2 n2    k k θk zk+1 = − θ z θk z − θk x∗  + k ik ik 2γ 2γ   n2   k+1 θk zik − zkik , θk zkik − θk x∗ik + γ 2  2 2 1  k+1 a n   xik − ykik = θk zk − θk x∗  + 2γ 2γ   −n ∇ik f (yk ) + ξ kik , θk zkik − θk x∗ik ,

(5.13)

a

where = uses (5.9) and (5.10). Then by taking expectation on (5.13), we have  2 n2   Eik θk zk+1 − θk x∗  2γ n 2 2   n2  1   k+1   k xik − ykik − ∇f (yk ), θk zk − θk x∗ = θk z − θk x∗  + 2γ 2γ n ik =1



n    ξ kik , θk zkik − θk x∗ik .

ik =1

(5.14)

144

5 Accelerated Stochastic Algorithms

By Step 1, we have     − ∇f (yk ), θk zk − θk x∗ = ∇f (yk ), (1 − θk )xk + θk x∗ − yk a

≤ (1 − θk )f (xk ) + θk f (x∗ ) − f (yk ),

(5.15)

a

where ≤ uses the convexity of f . Taking expectation on (5.12) and adding (5.14) and (5.15), we have Eik f (xk+1 ) ≤ (1 − θk )f (xk ) + θk f (x∗ ) n    1  k+1 ξ kik , θk zkik − θk x∗ik + xik − ykik − n ik =1

+

2 n2  2 n2   k    Eik θk zk+1 − θk x∗  θk z − θk x∗  − 2γ 2γ

= (1 − θk )f (xk ) + θk f (x∗ ) − a

n    ∗ ξ kik , θk zk+1 − θ x k ik ik

ik =1

+

2 n2  2 n2   k    Eik θk zk+1 − θk x∗  , θk z − θk x∗  − 2γ 2γ

a

where in = we use (5.10). Using the strong convexity of hik (μ ≥ 0), we have   2 μθk  k+1 k+1 ∗ ∗ z θk ξ kik , x∗ik − zk+1 ≤ θ h (x ) − θ h (z ) − − x . (5.16) k i k i k k ik ik ik ik 2 On the other hand, by analyzing the expectation we have ⎡ ⎤ n  2       2 2 1   ⎣ zk+1 − x∗i Eik zk+1 − x∗  = zk+1 + − x∗j ⎦ ik j k n ik =1

j =ik

ik =1

j =ik

⎡ ⎤ n  2   2  1 a ⎣ zk+1 − x∗i zkj − x∗j ⎦ = + ik k n =

n 2 2 n − 1  1   k+1  k  zik − x∗ik + z − x∗  , n n ik =1

(5.17)

5.1 The Individually Convex Case

145

a

where in = we use zk+1 = zkj , j = ik . Similar to (5.17), we also have j ⎡ ⎤ n  2       2 2 1   ⎣ xk+1 − yki xk+1 + − ykj ⎦ Eik xk+1 − yk  = ik j k n j =ik

ik =1

a

=

1 n

n  ik =1



k xk+1 ik − yik

2

,

a

where in = we use xk+1 = ykj , j = ik . Then we obtain j Eik f (xk+1 ) a

≤ (1 − θk )f (xk ) + θk F (x∗ ) − θk

n  ik =1

hik (zk+1 ik ) −

n 2  μθk  k+1 zik − x∗ik 2

ik =1

2 n2  2      k + Eik θk zk+1 − θk x∗  θk z − θk x∗  − 2γ 2γ n2

= (1 − θk )f (xk ) + θk F (x∗ ) − θk b

n  ik =1

+

n2 θk2

hik (zk+1 ik )

2 n2 θ 2 + nθ μγ  2 + (n − 1)θk μγ  k     k k Eik zk+1 − x∗  z − x∗  − 2γ 2γ

c = (1 − θk )f (xk ) + θk F (x∗ ) + (1 − θk )hˆ k − Eik hˆ k+1 2 n2 θ 2 + nθ μγ  2 n2 θk2 + (n − 1)θk μγ  k  k    k + Eik zk+1 − x∗  , z − x∗  − 2γ 2γ (5.18) a

b

c

where ≤ uses (5.16), = uses (5.17), and = uses Lemma 5.1. For the generally convex case (μ = 0), rearranging (5.18) and dividing both sides with θk2 , we have  2 Eik f (xk+1 ) + Eik hˆ k+1 − F (x∗ ) n2  k+1 ∗ E + − x z  i 2γ k θk2 2  2  1 − θk  k k ∗ ˆ k − F (x∗ ) + n  z f (x ≤ ) + h − x   2γ θk2 2  n2  a 1    k f (xk ) + hˆ k − F (x∗ ) + ≤ 2 z − x∗  , 2γ θk−1

146

5 Accelerated Stochastic Algorithms a

where in ≤ we use

1−θk θk2



1 2 θk−1

when k ≥ −1. Taking full expectation, we have

2 Ef (xk+1 ) + Ehˆ k+1 − F (x∗ ) n2   k+1 ∗ E + − x z  2γ θk2 ≤ a

=

2 f (x0 ) + hˆ 0 − F (x∗ ) n2   0 ∗ + − x  z 2 2γ θ−1 F (x0 ) − F (x∗ ) n2 0 z − x∗ 2 , + 2 2γ θ−1

a where in = we use hˆ 0 = h(x0 ). Then using the convexity of h(x) and that xk+1 is a convex combination of z0 , · · · , zk+1 , we have

h(xk+1 ) ≤

k+1 

a ek+1,i h(zi ) = hˆ k+1 ,

(5.19)

i=0 a

where = uses Lemma 5.1. So we obtain (5.7). For the strongly convex case (μ > 0), (5.18) gives  2 n2 θ 2 + nθ μγ   Eik zk+1 − x∗  Eik f (xk+1 ) + Ehˆ k+1 − F (x∗ ) + 2γ 2   n2 θ 2 + (n − 1)θ μγ   k  ≤ (1 − θ ) f (xk ) + Ehˆ k − F (x∗ ) + z − x∗  . 2γ For the setting of θ , we have θ 2 + (n − 1)θ μγ = (1 − θ )(n2 θ 2 + nθ μγ ). Thus  2 n2 θ 2 + nθ μγ   Eik zk+1 − x∗  2γ 2

n2 θ 2 + nθ μγ    k ≤ (1 − θ ) f (xk ) + Ehˆ k − F (x∗ ) + z − x∗  . 2γ

Eik f (xk+1 ) + Ehˆ k+1 − F (x∗ ) +

5.1 The Individually Convex Case

147

Taking full expectation, expanding the result to k = 0, then using h(xk+1 ) ≤ hˆ k+1 (see (5.19)), h(x0 ) = hˆ 0 , and zk+1 − x∗ 2 ≥ 0, we obtain (5.8).

 An important application of ASCD is to solve ERM problems in the following form: 1 λ φi (ATi x) + x2 , n 2 n

min P (x) ≡

x∈Rd

(5.20)

i=1

where λ > 0 and φi (ATi x), i = 1, · · · , n, are loss functions over training samples. Lots of machine learning problems can be formulated as (5.20), such as linear SVM, ridge regression, and logistic regression. For ASCD, we can solve (5.20) via its dual problem: minn D(a) =

a∈R

 2 n  1 ∗ λ 1  , Aa φi (−ai ) +   n 2 λn 

(5.21)

i=1

where φi∗ (·) is the conjugate function (Definition A.21) of φi (·) and A = [A1 , · · · , An ]. Then (5.21) can be solved by Algorithm 5.1.

5.1.2 Background for Variance Reduction Methods For stochastic gradient descent, due to the nonzero variance of the gradient, using a constant step size cannot guarantee the convergence of algorithms. Instead, it only guarantees a sublinear convergence rate on the strongly convex and L-smooth objective functions. For finite-sum objective functions, the way to solve the problem is called variance reduction (VR) [10] which reduces the variance to zero through the updates. The convergence rate can be accelerated to be linear for strongly convex and L-smooth objective functions. The first VR method might be SAG [17], which uses the sum of the latest individual gradients as an estimator. It requires O(nd) memory storage and uses a biased gradient estimator. SDCA [18] also achieves a linear convergence rate. In the primal space, the algorithm is known as MISO [13], which is a majorization-minimization VR algorithm. SVRG [10] is a follow-up work of SAG [17] which reduces the memory costs to O(d) and uses an unbiased gradient estimator. The main technique of SVRG [10] is frequently pre-storing a snapshot vector and bounding the variance by the distance from the snapshot vector and the latest variable. Later, SAGA [6] improves SAG by using an unbiased updates via the technique of SVRG [10].

148

5 Accelerated Stochastic Algorithms

Algorithm 5.2 Stochastic variance reduced gradient (SVRG) [10] Input x00 . Set epoch length m, x˜ 0 = x00 , and step size γ . 1 for s = 0 to S − 1 do 2 for k = 0 to m − 1 do 3 Randomly sample ik,s from [n],  ˜ (xs ) = ∇fik,s (xs ) − ∇fik,s (˜xs ) + 1 ni=1 ∇fi (˜xs ), 5 ∇f k k n s s s ˜ (x ), 6 xk+1 = xk − γ ∇f k 7 end for k s+1 1 m−1 s 8 Option I : x0 = m k=0 xk , 9 Option I I : xs+1 = xsm , 0 s+1 s+1 10 x˜ = x0 . end for s Output x˜ S .

We introduce SVRG [10] as an example. We solve the following problem: 1 fi (x). n n

min f (x) ≡

x∈Rd

(5.22)

i=1

The proximal version can be obtained straightforwardly in Sect. 5.1.3. The algorithm of SVRG is Algorithm 5.2. We have the following theorem: Theorem 5.2 For Algorithm 5.2, if each fi (x) is μ-strongly convex and L-smooth 1 , then we have and η < 2L Ef (˜xs ) − f (x∗ )

  (5.23) 2Lη 1 + Ef (˜xs−1 ) − f (x∗ ) , s ≥ 1, ≤ μη(1 − 2Lη)m 1 − 2Lη where x∗ = argminx∈Rd f (x). In other words, by setting m = O( L μ ) and η = O( L1 ), the IFO calls (Definition 5.1) to achieve an -accuracy solution is    O n+ L μ log(1/) . If each fi (x) is L-smooth (but might be nonconvex) and f (x) is μ-strongly convex, by setting η ≤ O(η−1 μ−1 ), we have

μ 8(1+e)L2

and m = −ln−1 (1 − ημ/2) ∼

Exsk − x∗ 2 ≤ (1 − ημ/2)sm+k x00 − x∗ 2 . In other words, the IFO calls to achieve an - accuracy solution is O   L2 log(1/) . 2 μ

 n+

5.1 The Individually Convex Case

149

The proof is mainly taken from [10] and [4]. Proof Let Ek denote the expectation taken only on the random number ik,s conditioned on xsk . Then we have   n    1 ˜ ik,s (xsk ) = Ek ∇fik,s (xsk ) − Ek ∇fik,s (˜xs ) − ∇fi (˜xs ) Ek ∇f n i=1

=

∇f (xsk ).

(5.24)

˜ ik,s (xs ) is an unbiased estimator of ∇f (xs ). Then So ∇f k k ˜ (xsk )2 Ek ∇f a

≤ 2Ek ∇fik,s (xsk ) − ∇fik,s (x∗ )2 + 2Ek ∇fik,s (˜xs ) − ∇fik,s (x∗ ) − ∇f (˜xs )2 b

≤ 2Ek ∇fik,s (xsk ) − ∇fik,s (x∗ )2 + 2Ek ∇fik,s (˜xs ) − ∇fik,s (x∗ )2 , a

(5.25)

b

where in ≤ we use a − b2 ≤ 2a2 + 2b2 and in ≤ we use Proposition A.2. We first consider the case when each fi (x) is strongly convex. Since each fi (x) is convex and L-smooth, from (A.7) we have ∇fi (x) − ∇fi (y)2 ≤ 2L (fi (x) − fi (y) + ∇fi (y), y − x) .

(5.26)

Letting x = xsk and y = x∗ in (5.26) and summing the result with i = 1 to n, we have 2    Ek ∇fik,s (xsk ) − ∇fik,s (x∗ ) ≤ 2L f (xsk ) − f (x∗ ) ,

(5.27)

where we use ∇f (x∗ ) = 0. In the same way, we have 2    Ek ∇fik,s (˜xs ) − ∇fik,s (x∗ ) ≤ 2L f (˜xs ) − f (x∗ ) .

(5.28)

Plugging (5.27) and (5.28) into (5.25), we have  2 6 7 ˜  (xsk ) ≤ 4L (f (xsk ) − f (x∗ )) + (f (˜xs ) − f (x∗ )) . Ek ∇f

(5.29)

150

5 Accelerated Stochastic Algorithms

On the other hand, Ek xsk+1 − x∗ 2

  = xsk − x∗ 2 + 2Ek xsk+1 − xsk , xsk − x∗ + Ek xsk+1 − xsk 2   ˜ (xsk )2 ˜ (xsk ), xsk − x∗ + η2 Ek ∇f = xsk − x∗ 2 − 2ηEk ∇f   a ≤ xsk − x∗ 2 − 2η ∇f (xsk ), xsk − x∗ 6 7 + 4Lη2 (f (xsk ) − f (x∗ )) + (f (˜xs ) − f (x∗ )) ≤ xsk − x∗ 2 − 2η(1 − 2Lη)[f (xsk ) − f (x∗ )] + 4Lη2 [f (˜xs ) − f (x∗ )], (5.30) a

where ≤ uses (5.24) and (5.29). Suppose we choose Option I . By taking full expectation on (5.30) and telescoping the result with k = 1 to m, we have Exsm − x∗ 2 + 2η(1 − 2Lη)mE(f (˜xs+1 ) − f (x∗ )) a

≤ Exsm − x∗ 2 + 2η(1 − 2Lη)

m−1 

E(f (xsk ) − f (x∗ ))

k=0

  b ≤ Exs0 − x∗ 2 + 4Lmη2 E f (˜xs ) − f (x∗ )     ≤ 2 μ−1 + 2Lmη2 E f (˜xs − f (x∗ ) , a

b

where ≤ uses Option I of Algorithm 5.2 and ≤ is by telescoping (5.30). Using xsm − x∗ 2 ≥ 0, we can obtain (5.23). Then we consider the case when f (x) is strongly convex. From f (·) being Lsmooth, by (A.5) we have  L  f (xsk+1 ) ≤ f (xsk ) + ∇f (xsk ), xsk+1 − xsk + xsk+1 − xsk 2 , 2

(5.31)

and from f (·) being μ-strongly convex, by (A.9) we have  μ  f (xsk ) ≤ f (x∗ ) + ∇f (xsk ), xsk − x∗ − xsk − x∗ 2 . 2

(5.32)

5.1 The Individually Convex Case

151

Adding (5.31) and (5.32), we have f (xsk+1 )  L  μ ≤ f (x∗ ) + ∇f (xsk ), xsk+1 − x∗ + xsk+1 − xsk 2 − xsk − x∗ 2 2 2   L 1 = f (x∗ ) − xsk+1 − xsk , xsk+1 − x∗ + xsk+1 − xsk 2 η 2   μ s ˜ (xsk ) − ∇f (xsk ), xs − x∗ − xk − x∗ 2 − ∇f k+1 2

1 1 s 1 s L ∗ ∗ 2 ∗ 2 xsk+1 − xsk 2 = f (x ) − −x  + x x − x  − − 2η k+1 2η k 2η 2   μ ˜ (xsk ) − ∇f (xsk ), xs − x∗ . − xsk − x∗ 2 − ∇f k+1 2 Then by rearranging terms and using f (xsk+1 ) − f (x∗ ) ≥ 0, we have 1 s x − x∗ 2 2 k+1   1 − ημ s ˜ (xsk ) − ∇f (xsk ), xs − x∗ xk − x∗ 2 − η ∇f ≤ k+1 2

1 Lη − xsk+1 − xsk 2 . − 2 2

(5.33)

Considering the expectation taken on the random number of ik,s , we have   ˜ (xsk ) − ∇f (xsk ), xs − x∗ −ηEk ∇f k+1   a ˜ (xsk ) − ∇f (xsk ), xs = −ηEk ∇f k+1   b ˜ (xsk ) − ∇f (xsk ), xs − xsk = −ηEk ∇f k+1 ˜ (xsk ) − ∇f (xsk )2 + 1 Ek xs − xsk 2 ≤ η2 Ek ∇f k+1 4 2 1  ≤ η2 Ek ∇fik,s (xsk ) − ∇fik,s (˜xs ) − (∇f (xsk )−∇f (˜xs )) + Ek xsk+1 −xsk 2 4   c 1 2 ≤ η2 Ek ∇fik,s (xsk ) − ∇fik,s (˜xs ) + Ek xsk+1 − xsk 2 4 1 ≤ η2 L2 xsk − x˜ s 2 + Ek xsk+1 − xsk 2 4 1 ≤ 2η2 L2 xsk − x∗ 2 + 2η2 L2 ˜xs − x∗ 2 + Ek xsk+1 − xsk 2 , (5.34) 4

152

5 Accelerated Stochastic Algorithms a

c

b

where both = and = use (5.24) and ≤ uses Proposition A.2. From the setting of η, we have Lη ≤ 12 . Substituting (5.34) into (5.33) after taking expectation on ik , we have 1 Ek xsk+1 − x∗ 2 2 1 − ημ s xk − x∗ 2 + 2η2 L2 xsk − x∗ 2 + 2η2 L2 ˜xs − x∗ 2 . (5.35) ≤ 2 Now we use induction to prove Exsk − x∗ 2 ≤ (1 − ημ/2)sm+k x00 − x∗ 2 . When k = 0, it is right. Suppose that at iteration k ≥ 0, we have Exsk − x∗ 2 ≤ (1 − ημ/2)sm+k x00 − x∗ 2 . Now we consider k + 1. We consider Option I I in Algorithm 5.2 and have E˜xs − x∗ 2 = Exs0 − x∗ 2 ≤ (1 − ημ/2)sm x00 − x∗ 2 a

≤ e(1 − ημ/2)sm+k x00 − x∗ 2 , a

where ≤ uses k ≤ m = −ln−1 (1 − ημ/2). Then we have 4η2 L2 Exsk − x∗ 2 + 4η2 L2 E˜xs − x∗ 2 ≤ 4η2 L2 (1 + e)(1 − ημ/2)sm+k x00 − x∗ 2 a

≤ ημ/2(1 − ημ/2)sm+k x00 − x∗ 2 ,

(5.36)

a

μ where ≤ uses η ≤ 8(1+e)L 2. Taking full expectation on (5.35) and substituting (5.36) into it, we can obtain that Exsk+1 − x∗ 2 ≤ (1 − ημ/2)sm+k+1 x00 − x∗ 2 . 

5.1.3 Accelerated Stochastic Variance Reduction Method With the VR technique, we can fuse the momentum technique to achieve a faster √rate. We show that the convergence rate can be improved to  O (n + nκ) log(1/) in the IC case, where κ = L μ . We introduce Katyusha [1], which is the first truly accelerated stochastic algorithm. The main technique in Katyusha [1] is introducing a “negative momentum” which restricts the extrapolation term to be not far from x˜ , the snapshot vector introduced by SVRG [10]. The algorithm is shown in Algorithm 5.3.

5.1 The Individually Convex Case

153

Algorithm 5.3 Katyusha [1] Input θ1 , step size γ , x00 = 0, x˜ 0 = 0, z00 = 0, θ2 = 12 , m, and θ3 = μγ θ1 + 1. for s = 0 to S do for k = 0 to m − 1 1 ysk = θ1 zsk + θ2 x˜ s + (1 − θ1 − θ2 )xsk , 2 Randomly selected one (or b mini-batch in Sect. 6.2.1.1) sample(s), denoted as iks , 3 ∇˜ ks = ∇fiks (ysk ) − ∇fiks (˜xs ) + ∇f (˜xs ),   θ1 δ2 , 4 δ sk = argminδ h(zsk + δ) + ∇˜ ks , δ + 2γ zsk+1 = zsk + δ sk , xsk+1 = θ1 zsk+1 + θ2 x˜ s + (1 − θ1 − θ2 )xsk . end for k xs+1 = xsm , 0 −1   m−1 k m−1 k s s+1 x˜ = k=0 θ3 k=0 θ3 xk . end for s Output x0S+1 . 5 6

The proof is taken from [1]. We have the following lemma to bound the variance:  Lemma 5.2 For f (x) = n1 ni=1 fi (x), with each fi with i ∈ [n] being convex and L-smooth. For any u and x˜ , defining  ˜ (u) = ∇fk (u) − ∇fk (˜x) + 1 ∇fi (˜x), ∇f n n

i=1

we have 2     ˜ (u) − ∇f (u) ≤ 2L f (˜x) − f (u) + ∇f (u), u − x˜  , E ∇f

(5.37)

where the expectation is taken on the random number k under the condition that u and x˜ are known. Proof 2   ˜ (u) − ∇f (u) E ∇f   2  = E ∇fk (u) − ∇fk (˜x) − ∇f (u) − ∇f (˜x)   2 a ≤ E ∇fk (u) − ∇fk (˜x) ,

154

5 Accelerated Stochastic Algorithms a

where in inequality ≤ we use   E ∇fk (u) − ∇fk (˜x) = ∇f (u) − ∇f (˜x) 

and Proposition A.2. Then by directly applying (A.7) we obtain (5.37).

L Theorem 5.3 Suppose that h(x) is μ-strongly convex and n ≤ 4μ . For Algo μγ nμ 1 rithm 5.3, if the step size γ = 3L , θ1 = L , θ3 = 1 + θ1 , and m = n, we have

F (˜xS+1 ) − F (x∗ ) ≤ θ3−Sn Proof Because n ≤

L 4μ

&

' 1 1  z00 − x∗ 2 + 1 + F (x00 ) − F (x∗ ) . 4nγ n

and θ1 =



nμ L ,

we have

1 , 2 1 − θ1 − θ2 ≥ 0. θ1 ≤

(5.38)

For Step 1, ysk = θ1 zsk + θ2 x˜ s + (1 − θ1 − θ2 )xsk .

(5.39)

Together with Step 6, we have xsk+1 = ysk + θ1 (zsk+1 − zsk ).

(5.40)

Through the optimality of zsk+1 in Step 4 of Algorithm 5.3, there exists ξ sk+1 ∈ ∂h(zsk+1 ) satisfying θ1 (zsk+1 − zsk ) + γ ∇˜ ks + γ ξ sk+1 = 0. From (5.39) and (5.40), we have xsk+1 − ysk + γ ∇˜ ks + γ ξ sk+1 = 0. For f (·) is L-smooth, we have  L xs − ys 2 k k+1 2 2 L = f (ysk ) − γ ∇f (ysk ), ∇˜ ks + ξ sk+1  + xsk+1 − ysk  2

f (xsk+1 ) ≤ f (ysk ) + ∇f (ysk ), xsk+1 − ysk  +

= f (ysk ) − γ ∇˜ ks + ξ sk+1 , ∇˜ ks + ξ sk+1  a

(5.41)

5.1 The Individually Convex Case

155

 L xs − ys 2 + γ ∇˜ s + ξ s − ∇f (ys ), ∇˜ s + ξ s  k k k k k+1 k+1 k+1 2 

  2 γL  b  1 xs − ys  −∇˜ s − ∇f (ys ), xs − ys  = f (ysk ) − γ 1 − k  k k k k+1 2  γ k+1 +

−ξ sk+1 , xsk+1 − ysk ,

(5.42)

where in = we add and subtract the term γ ∇˜ ks + ξ sk+1 , ∇˜ ks + ξ sk+1  and = uses the equality (5.41). For the last but one term of (5.42), we have a

b

Ek ∇˜ ks − ∇f (ysk ), ysk − xsk+1 

   2 γ C 1  s 2 γ 3 ˜s s  s   Ek  xk+1 − yk  ≤ Ek ∇k − ∇f (yk ) + 2C3 2 γ   2  γ C3  1  s b γL  s  x f (˜xs ) − f (ysk ) + ∇f (ysk ), ysk − x˜ s  + ≤ − y Ek  k  ,  γ k+1 C3 2 a

(5.43) where we use Ek to denote that expectation is taken on the random number iks (step a

k and epoch s) under the condition that ysk is known, in ≤ we use the Cauchy– b

Schwartz inequality, and ≤ uses (5.37). C3 is an absolute constant determined later. Taking expectation for (5.42) on the random number iks and adding (5.43), we obtain Ek f (xsk+1 ) ≤

f (ysk ) − γ



 1  s  2 γ L C3 s   − Ek  xk+1 − yk  1− 2 2 γ

−Ek ξ sk+1 , xsk+1 − ysk  +

 γL  f (˜xs ) − f (ysk ) + ∇f (ysk ), ysk − x˜ s  . C3 (5.44)

On the other hand, we analyze zsk − x∗ 2 . Setting a = 1 − θ1 − θ2 , we have  s θ1 z

k+1

2 − θ1 x∗ 

 2 = xsk+1 − axsk − θ2 x˜ s − θ1 x∗   2 = ysk − axsk − θ2 x˜ s − θ1 x∗ − (ysk − xsk+1 )  2  2 = ysk − axsk − θ2 x˜ s − θ1 x∗  + ysk − xsk+1 

156

5 Accelerated Stochastic Algorithms

−2γ ξ sk+1 + ∇˜ ks , ysk − axsk − θ2 x˜ s − θ1 x∗  2  2 a  = θ1 zsk − θ1 x∗  + ysk − xsk+1    −2γ ξ sk+1 + ∇˜ ks , ysk − axsk − θ2 x˜ s − θ1 x∗ ,

(5.45)

a

where = uses Step 1 of Algorithm 5.3. For the last term of (5.45), we have   Ek ∇˜ ks , axsk + θ2 x˜ s + θ1 x∗ − ysk    a  = ∇f (ysk ), axsk + θ1 x∗ − (1 − θ2 )ysk + θ2 ∇f (ysk ), x˜ s − ysk   b ≤ af (xsk ) + θ1 f (x∗ ) − (1 − θ2 )f (ysk ) + θ2 ∇f (ysk ), x˜ s − ysk , (5.46) a

where in ≤ we use Ek (∇˜ ks ) = ∇f (ysk ) and that axsk + θ2 x˜ s + θ1 x∗ − ysk is constant, because the expectation is taken only on the random number ik,s , and in inequality b

≤ we use the convexity of f (·) and so for any vector u, ∇f (ysk ), u − ysk  ≤ f (u) − f (ysk ). Then dividing (5.45) by 2γ and taking expectation on the random number iks , we have  2 1 Ek θ1 zsk+1 − θ1 x∗  2γ ≤

     2 1   θ1 zs − θ1 x∗ 2 + γ Ek  1 ys − xs k k+1  2γ 2 γ k   −Ek ξ sk+1 + ∇˜ sk , ysk − axsk − θ2 x˜ s − θ1 x∗

  1  s  2 1  γ s ∗ 2 s   θ1 zk − θ1 x ≤ + Ek  yk − xk+1   2γ 2 γ  s  −Ek ξ k+1 , ysk − axsk − θ2 x˜ s − θ1 x∗   +af (xsk ) + θ1 f (x∗ ) − (1 − θ2 )f (ysk ) + θ2 ∇f (ysk ), x˜ s − ysk , (5.47) a

a

where ≤ uses (5.46). Adding (5.47) and (5.44), we obtain that Ek f (xsk+1 ) +

 2 1 Ek θ1 zsk+1 − θ1 x∗  2γ

≤ af (xsk ) + θ1 f (x∗ ) + θ2 f (ysk ) + θ2 ∇f (ysk ), x˜ s − ysk 

5.1 The Individually Convex Case

−γ

157

1 γ L C3 − − 2 2 2



  1  s  2 s  Ek  y − x k+1  γ k

−Ek ξ sk+1 , xsk+1 − axsk − θ2 x˜ s − θ1 x∗     1  γL  θ1 zs − θ1 x∗ 2 f (˜xs ) − f (ysk ) + ∇f (ysk ), ysk − x˜ s + k C3 2γ  a 1  θ1 zs − θ1 x∗ 2 ≤ af (xsk ) + θ1 f (x∗ ) + θ2 f (˜xs ) + k 2γ  s  s s s (5.48) −Ek ξ k+1 , xk+1 − axk − θ2 x˜ − θ1 x∗ , +

a

where in ≤ we set C3 =

γL θ2 .

For the last term of (5.48), we have

  − ξ sk+1 , xsk+1 − axsk − θ2 x˜ s − θ1 x∗   = − ξ sk+1 , θ1 zsk+1 − θ1 x∗ ≤ θ1 h(x∗ ) − θ1 h(zsk+1 ) −

 μθ1  zs − x∗ 2 k+1 2

a

≤ θ1 h(x∗ ) − h(xsk+1 ) + θ2 h(˜xs ) + ah(xsk ) −

 μθ1  zs − x∗ 2 , (5.49) k+1 2

a

where in ≤ we use xsk+1 = axsk + θ2 x˜ s + θ1 zsk+1 and the convexity of h(·). Substituting (5.49) into (5.48), we obtain Ek F (xsk+1 ) +

1+

μγ θ1



 s θ1 z

k+1

2 − θ1 x∗ 

≤ aF (xsk ) + θ1 F (x∗ ) + θ2 F (˜xs ) + Set θ1 =



nμ L

and θ3 =

μγ θ1

+1 ≤

Lγ θ1

+1 ≤

 1  θ1 zs − θ1 x∗ 2 . k 2γ 1 3



μ Ln

(5.50)

+ 1. Taking expectation

on (5.50) for the first k − 1 iterations, then multiplying it with θ3k , and telescoping the results with k from 0 to m − 1, we have m 

m−1      θ3k−1 F (xsk ) − F (x∗ ) − a θ3k F (xsk ) − F (x∗ )

k=1

− θ2

k=0 m−1  k=0

2    θ m  1  θ1 zs − θ1 x∗ 2 . θ3k F (˜xs ) − F (x∗ ) + 3 θ1 zsm − θ1 x∗  ≤ 0 2γ 2γ (5.51)

158

5 Accelerated Stochastic Algorithms

By rearranging the terms of (5.51), we have [θ1 + θ2 − (1 − 1/θ3 )]

m 

    θ3k F (xsk ) − F (x∗ ) + θ3m a F (xsm ) − F (x∗ )

k=1

+

θ3m 2γ

  s θ1 z − θ1 x∗ 2 m

≤ θ2

m−1  k=0

     1  θ1 zs −θ1 x∗ 2 . θ3k F (˜xs ) − F (x∗ ) + a F (xs0 )−F (x∗ ) + 0 2γ

From the definition x˜ s+1 =

[θ1 + θ2 − (1 − 1/θ3 )]θ3

 m−1 j =0

m−1 

 θ3k

 j −1 m−1 j s j =0 θ3 xj ,

θ3

we have

    F (˜xs+1 ) − F (x∗ ) + θ3m a F (xsm ) − F (x∗ )

k=0

+

θ3m 2γ

  s θ1 z − θ1 x∗ 2 m

≤ θ2

m−1 

 θ3k

    F (˜xs ) − F (x∗ ) + a F (xs0 ) − F (x∗ )

k=0

 1  θ1 zs − θ1 x∗ 2 . + 0 2γ

(5.52)

Since   θ2 θ3m−1 − 1 + (1 − 1/θ3 ) 1 ≤ 2



11 ≤ 22 a

1 1+ 3





nμ + L

μ nL 

1 3

−1 +

μ nL b

θ3



m−1

7 ≤ 12



1 3



μ nL

θ3

nμ ≤ θ1 , L

(5.53)

a

where in ≤ we use μn ≤ L/4 < L, m − 1 = n − 1 ≤ n, and the fact that 1 3 g(x) = (1 + x)c ≤ 1 + cx when c ≥ 1 and x ≤ , 2 c

(5.54)

5.1 The Individually Convex Case

159

b

and in ≤ we use θ3 n ≥ θ3 ≥ 1. To prove (5.54), we can use Taylor expansion at point x = 0 to obtain (1 + x)c = 1 + cx +

3 c(c − 1) 2 c(c − 1) 1 ξ ≤ 1 + cx + x ≤ 1 + cx, 2 2 c 2

where ξ ∈ [0, x]. Equation (5.53) indicates that θ1 + θ2 − (1 − 1/θ3 ) ≥ θ2 θ3m−1 . This and (5.52) gives θ2 θ3m

m−1 

 θ3k



   F (˜xs+1 ) − F (x∗ ) + θ3m a F (xsm ) − F (x∗ )

k=0

+

θ3m 2γ

  s θ1 z − θ1 x∗ 2 m

≤ θ2

m−1 

 θ3k

    F (˜xs ) − F (x∗ ) + a F (xs0 ) − F (x∗ )

k=0

 1  θ1 zs − θ1 x∗ 2 . + 0 2γ By expanding the above inequality from s = S, · · · , 0, we have θ2

m−1 

 θ3k

    F (˜xS+1 ) − F (x∗ ) + (1 − θ1 − θ2 ) F (xSm ) − F (x∗ )

k=0

+

2 θ12   S+1  z0 − x∗  2γ  ( m−1     ≤ θ3−Sm θ3k + (1 − θ1 − θ2 ) F (x00 ) − F (x∗ ) θ2 k=0

+

θ12 2γ

) 2   0 ∗ z0 − x  .

Since θ3k ≥ 1, we have we have S+1

F (˜x



) − F (x ) ≤

This ends the proof.

m−1 k=0

θ3−Sn

θ3k ≥ n. Then using θ2 =

1 2

and θ1 ≤

1 2

(see (5.38)),

&

2 '  1  1   0 0 ∗ ∗ 1+ F (x0 ) − F (x ) + z − x  . n 4nγ 0 

160

5 Accelerated Stochastic Algorithms

5.1.4 Black-Box Acceleration In this section, we introduce black-box acceleration methods. In general, the methods solve (5.3) by constructing a series of subproblems known as “mediator” [15], which can be solved efficiently to a high accuracy. The main advantages of these methods are listed as follows: 1. The black-box methods make acceleration easier, because we only need to concern the method to solve subproblems. In most time, the subproblems have a good condition number and so are easy to be solved. Specifically, for general problem of (5.3), the subproblems can be solved by arbitrary vanilla VR methods. For specific forms of (5.3), one is allowed to design appropriate methods according to the characteristic of functions to solve them without considering acceleration techniques. 2. The black-box methods make the acceleration technique more general. For different properties of objectives, no matter strongly convex or not and smooth or not, the black-box methods are able to give a universal accelerated algorithm. The first stochastic black-box acceleration [19]. Its con  √ might be Acc-SDCA vergence rate for the IC case is O (n + nκ) log(κ) log2 (1/) . Later, Lin et al. proposed  a generic  called Catalyst [12], which achieved a convergence √ acceleration, rate O (n + nκ) log2 (1/) , outperforming Acc-SDCA by a factor of log(κ). Allen-Zhu et al. [2] designed a black-box acceleration by gradually decreasing the condition number of the subproblem, achieving O(log(1/)) faster rate than Catalyst [12] on some general objective functions. In the following, we introduce Catalyst [12] as an example. The algorithm of Catalyst is shown in Algorithm 5.4. The main theorem for Algorithm 5.4 is as follows: Theorem 5.4 For problem (5.3), suppose that F (x) is μ-strongly convex and set √ μ α0 = q with q = μ+κ and k =

2 (F (x0 ) − F ∗ )(1 − ρ)k 9

with ρ ≤



q.

Then Algorithm 5.4 generates iterates {xk }k≥0 such that F (xk ) − F ∗ ≤ C(1 − ρ)k+1 (F (x0 ) − F ∗ )

8 with C = √ . ( q − ρ)2

We leave the proof to Sect. 5.2 and first introduce how to use it to obtain an accelerated algorithm. We have the following theorem:

5.2 The Individually Nonconvex Case

161

Algorithm 5.4 Catalyst [12] Input x0 , parameters κ and α0 , sequence {k }k≥0 , and optimization method M. 1 Initialize q = μ/(μ + κ) and y0 = x0 . 2 for k = 0 to K do ? @ 3 Applying M to solve: xk = argminx∈Rp Gk (x) ≡ F (x) + κ2 x − yk−1 2 ∗ to the accuracy satisfying Gk (xk ) − Gk ≤ k , 2 + qαk , 4 Compute αk ∈ (0, 1) from equation αk2 = (1 − αk )αk−1 5

yk = xk + βk (xk − xk−1 ) with βk =

αk−1 (1−αk−1 ) . 2 +α αk−1 k

end for k Output xK+1 .

Theorem 5.5 For (5.3), assume that each fi (x) is convex and L-smooth and h(x) is L μ-strongly convex satisfying μ ≤ L/n.1 For Algorithm 5.4, if setting κ = n−1 and solving Step 3 by SVRG [10], then one can obtain an -accuracy solution satisfying √  F (x) − F (x∗ ) ≤  with IFO calls of O nL/μ log2 (1/) .     L L Proof The subproblem in Step 3 is μ + n−1 -strongly convex and L + n−1 smooth. From Theorem 5.2, applying SVRG to solve the subproblem needs



O O

√

n+

L L+ n−1

log(1/)  nL/μ log (1/) . L μ+ n−1 2

= O (n log(1/)) IFOs. So the total complexity is 

The black-box algorithms, e.g., Catalyst [12], are proposed earlier than Katyusha [1], discussed in Sect. 5.1.3. From the convergence results, Catalyst [12] is O(log(1/)) times lower than Katyusha [1]. However, the black-box algorithms are more flexible and easier to obtain an accelerated rate. For example, in the next section we apply Catalyst to the INC case.

5.2 The Individually Nonconvex Case In this section, we consider solving (5.3) by allowing nonconvexity of fi (x) but f (x), the sum of fi (x), is still convex. One important application of it is principal component analysis [9]. It is also the core technique to obtain faster rate in the NC case. Notice that the convexity of f (x) guarantees an achievable global minimum ∗ of (5.3). As shown in Theorem 5.2, to reach 2aminimizerof F (x) − F ≤ , L 2 vanilla SVRG needs IFO calls of O n + μ2 log (1/) . We will show that     2 log (1/) by the convergence rate can be improved to be O n + n3/4 L μ n ≥ O(L/μ), O(n) is the dominant dependency in the convergence rate, thus the momentum technique cannot achieve a faster rate by order.

1 If

162

5 Accelerated Stochastic Algorithms

acceleration. Compared with the computation costs in the IC case, the costs of NC cases are (n1/4 ) times larger, which is caused by the nonconvexity of individuals. We still use Catalyst [12] to obtain the result. We first prove Theorem 5.4 in the context of INC. The proof is directly taken from [12], which is based on the estimate sequence in [14] (see Sect. 2.1) and further takes the error of inexact solver into account. Proof of Theorem 5.4 in the Context of INC Define the estimate sequence as follows: 1. φ0 (x) ≡ F (x0 ) + 2. For k ≥ 0,

γ0 2 2 x − x0  ;

φk+1 (x) = (1 − αk )φk (x) + αk [F (xk+1 ) + κ(yk − xk+1 ), x − xk+1 ] +

μ x − xk+1 2 , 2

where γ0 ≥ 0 will be defined in Step 3. One can find that the main difference of estimate sequence defined here with the original one in [15] is replacing ∇f (yk ) with κ(yk − xk+1 ) (see (2.5)). Step 1: For all k ≥ 0, we can have φk (x) = φk∗ +

γk x − vk 2 , 2

(5.55)

where φ0∗ = F (x0 ) and v0 = x0 when k = 0, and γk , vk , and φk∗ satisfy: γk = (1 − αk−1 )γk−1 + αk−1 μ, vk =

(5.56)

1 [(1 − αk−1 )γk−1 vk−1 + αk−1 μxk − αk−1 κ(yk−1 − xk )] , γk

∗ φk∗ = (1 − αk−1 )φk−1 + αk−1 F (xk ) −

+

2 αk−1

2γk

(5.57)

κ(yk−1 − xk )2

 αk−1 (1 − αk−1 )γk−1  μ xk − vk−1 2 + κ(yk−1 − xk ), vk−1 − xk  , γk 2 (5.58)

when k > 0. Proof of Step 1: we need to check φk∗ +

γk x − vk 2 2

= (1 − αk−1 )φk−1 (x)   μ + αk−1 F (xk ) + κ(yk−1 − xk ), x − xk ] + x − xk 2 . 2

5.2 The Individually Nonconvex Case

163

Suppose that at iteration k − 1, (5.55) is right. Then we need to prove γk x − vk 2 2   γk−1 ∗ x − vk−1 2 = (1 − αk−1 ) φk−1 + 2   μ +αk−1 F (xk ) + κ(yk−1 − xk ), x − xk  + x − xk 2 . 2

φk∗ +

(5.59)

Both sides of the equation are simple quadratic forms. By comparing the coefficient of x2 , we have γk = (1 − αk−1 )γk−1 + αk−1 μ. Then by computing the gradient at x = 0 on both sides of (5.59), we can obtain γk vk = (1 − αk−1 )γk−1 vk−1 − αk−1 κ(yk−1 − xk ) + αk−1 μxk . For φk∗ , we can set x = xk in (5.59) and obtain: φk∗ = −

  γk γk−1 ∗ xk − vk 2 + (1 − αk−1 ) φk−1 xk − vk−1 2 + αk−1 F (xk ). + 2 2

Then by substituting (5.56) and (5.57) to the above, we can obtain (5.58). Step 2: For Algorithm 5.4, we can have F (xk ) ≤ φk∗ + ξk ,

(5.60)

   where ξ0 = 0 and ξk = (1 − αk−1 ) ξk−1 + k − (κ + μ) xk − x∗k , xk−1 − xk . Proof of Step 2: Suppose that x∗k is the optimal solution in Step 3 of Algorithm 5.4. By the (μ + κ)-strongly convexity of Gk (x), we have Gk (x)  G∗k +

κ +μ x − x∗k 2 . 2

Then κ +μ κ x − x∗k 2 − x − yk−1 2 2 2 κ +μ κ a (x − xk ) + (xk − x∗k )2 − x − yk−1 2 = Gk (xk ) − k + 2 2 κ κ + μ κ x − xk 2 − x − yk−1 2 ≥ F (xk ) + xk − yk−1 2 − k + 2 2 2 +(κ + μ)xk − x∗k , x − xk  μ = F (xk ) + κ yk−1 − xk , x − xk  − k + x − xk 2 2 ∗ +(κ + μ)xk − xk , x − xk , (5.61)

F (x) ≥ G∗k +

164

5 Accelerated Stochastic Algorithms a

where in = we use that Gk (xk ) is an -accuracy solution of Step 3. Now we prove (5.60). When k = 0, we have F (x0 ) = φ0∗ . Suppose that for k − 1 ∗ with k ≥ 1, (5.60) is right, i.e., F (xk−1 ) ≤ φk−1 + ξk−1 . Then ∗ ≥ F (xk−1 ) − ξk−1 φk−1

  a ≥ F (xk ) + κ(yk−1 − xk ), xk−1 − xk  + (κ + μ) xk − x∗k , xk−1 − xk −k − ξk−1 = F (xk ) + κ(yk−1 − xk ), xk−1 − xk  − ξk /(1 − αk−1 ),

(5.62)

a

where ≥ uses (5.61). Then from (5.58), we have ∗ φk∗ = (1 − αk−1 )φk−1 + αk−1 F (xk ) −

+

2 αk−1

2γk

κ(yk−1 − xk )2

 αk−1 (1 − αk−1 )γk−1  μ xk − vk−1 2 + κ(yk−1 − xk ), vk−1 − xk  γk 2

a

≥ (1 − αk−1 )F (xk ) + (1 − αk−1 ) κ(yk−1 − xk ), xk−1 − xk  − ξk + αk−1 F (xk ) 2 αk−1

αk−1 (1 − αk−1 )γk−1 κ(yk−1 − xk ), vk−1 − xk  γk   αk−1 γk−1 = F (xk ) + (1 − αk−1 ) κ(yk−1 − xk ), xk−1 − xk + (vk−1 − xk ) γk −



2γk

2 αk−1

2γk

κ(yk−1 − xk )2 +

κ(yk−1 − xk )2 − ξk

  αk−1 γk−1 b = F (xk ) + (1 − αk−1 ) κ(yk−1 − xk ), xk−1 − yk−1 + (vk−1 − yk−1 ) γk   2 (κ + 2μ)αk−1 + 1− κyk−1 − xk 2 − ξk , 2γk a

b

where ≥ uses (5.62) and = uses (5.56). Step 3: Set xk−1 − yk−1 +

αk−1 γk−1 (vk−1 − yk−1 ) = 0, γk γ0 =

(5.63)

α0 [(κ + μ)α0 − μ] , 1 − α0

2 γk = (κ + μ)αk−1 ,

k ≥ 1.

(5.64)

5.2 The Individually Nonconvex Case

165

By Step 4 of Algorithm 5.4, we can have yk = xk +

αk−1 (1 − αk−1 ) (xk − xk−1 ). 2 αk−1 + αk

(5.65)

Proof of Step 3: Suppose that at iteration k − 1, (5.65) is right, then from (5.57) in Step 1, we have 1 [(1 − αk−1 )γk−1 vk−1 + αk−1 μxk − αk−1 κ(yk−1 − xk )] γk 1 − αk−1 a 1 = [(γk + αk−1 γk−1 )yk−1 − γk xk−1 ] γk αk−1 9 +αk−1 μxk − αk−1 κ(yk−1 − xk )

vk =

1

b

=

αk−1

[xk − (1 − αk−1 )xk−1 ],

a

(5.66)

b

where = uses (5.63) and in = we use c

(1 − αk−1 )(γk + αk−1 γk−1 ) = (1 − αk−1 )(γk−1 + αk−1 μ) d

e

2 2 = γk − μαk−1 = καk−1 , c

d

e

in which both = and = use (5.56) and = uses (5.64). Plugging (5.66) into  (5.63) with k − 1 being replaced by k, we obtain (5.65). Step 4: Set λk = k−1 i=0 (1 − αi ), we can have  γk 1  F (xk ) − F ∗ + x∗ − vk 2 λk 2 k k √  i  2i γi ∗ ≤ φ0 (x∗ ) − F ∗ + + x − vi . λi λi i=1

i=1

Proof of Step 4: From the definition of φk (x), we have φk (x∗ ) = (1 − αk−1 )φk−1 (x∗ )  2   μ  +αk−1 F (xk ) + κ(yk−1 − xk ), x∗ − xk + x∗ − xk  2 a

≤ (1 − αk−1 )φk−1 (x∗ ) 6 7  +αk−1 F (x∗ ) + k − (κ + μ) xk − x∗k , x∗ − xk ,

(5.67)

166

5 Accelerated Stochastic Algorithms a

where ≤ uses (5.61). Rearranging terms and using  the definition of ξk = (1 −  αk−1 ) ξk−1 + k − (κ + μ) xk − x∗k , xk−1 − xk , we have φk (x∗ ) + ξk − F ∗ ≤ (1 − αk−1 )(φk−1 (x∗ ) + ξk−1 − F ∗ ) + k   −(κ + μ) xk − x∗k , (1 − αk−1 )xk−1 + αk−1 x∗ − xk  a ≤ (1 − αk−1 )(φk−1 (x∗ ) + ξk−1 − F ∗ ) + k + 2k γk x∗ − vk , (5.68) a

where F ∗ = F (x∗ ) and in = we use   −(κ + μ) xk − x∗k , (1 − αk−1 )xk−1 + αk−1 x∗ − xk   a = −αk−1 (κ + μ) xk − x∗k , x∗ − vk ≤ αk−1 (κ + μ)xk − x∗k x∗ − vk   b c  ≤ αk−1 2(κ + μ)k x∗ − vk  = 2k γk x∗ − vk , b

a

c

where = uses (5.66), ≤ uses (A.10), and = uses (5.64). Then dividing λk on both sides of (5.68) and telescoping the result with k = 1 to k, we have k k   i  1  φk (x∗ ) + ξk − F ∗ ≤ φ0 (x∗ ) − F ∗ + + λk λi i=1

i=1

√ 2i γi ∗ x − vi . λi

Using the definition of φk (x∗ ), we have φk (x∗ ) + ξk − F ∗ = φk∗ + ξk − F ∗ + a

≥ F (xk ) − F ∗ +

γk ∗ x − vk 2 2

γk ∗ x − vk 2 , 2

a

obtain (5.67). where ≥ uses (5.60). So we  γi ∗ − v , α = 2 i , and S = φ (x∗ ) − F ∗ + Step 5: By setting ui = 2λ x i i k 0 λi i k  i ∗ i=1 λi , (2.37) of Lemma 2.8 is validated due to (5.67) and F (x) − F ≥ 0. So by Lemma 2.8 we have  ∗ a

F (xk ) − F ≤ λk Sk +

k  i=1

a

where ≤ uses (5.67).

 αi ui

 2 k    i Sk + 2 , ≤ λk λi i=1

(5.69)

5.2 The Individually Nonconvex Case

167

 √ μ Step 6: We prove Theorem 5.4. By setting α0 = q = μ+κ , we can have √ √ k αk = q for k ≥ 0, and then λk = (1 − q) and γ0 = μ. Thus by the μ-strong convexity of F (·), we have γ20 x0 − x∗ 2 ≤ F (x0 ) − F ∗ . Then 

Sk + 2

k   i

λi i=1   k k     i i γ0 = F (x0 ) − F ∗ + x0 − x∗ 2 + +2 2 λi λi i=1

 F (x0 ) − F ∗ +





=



γ0 x0 − x∗ 2 + 3 2

2(F (x0 ) − F ∗ ) + 3





k   i=1

2(F (x0 ) − F ∗ ) ⎣1 +

k  i=1

= ≤

 

where we set η =

2(F (x0 ) − F ∗ )

2(F (x0 ) − F ∗ )



1−ρ √ 1− q .

i=1

k   i=1

i λi

i λi

5

1−ρ √ 1− q

i ⎤ ⎦

−1 η−1

ηk+1

ηk+1 , η−1

Then from (5.69), we have

F (xk ) − F ∗

2 ηk+1 ≤ 2λk (F (x0 ) − F ) η−1

2 a η ≤2 (1 − ρ)k (F (x0 ) − F ∗ ) η−1  2 √ 1−ρ  =2 √ (1 − ρ)k (F (x0 ) − F ∗ ) √ 1−ρ− 1− q  2 1  =2 √ (1 − ρ)k+1 (F (x0 ) − F ∗ ), √ 1−ρ− 1− q ∗



168

5 Accelerated Stochastic Algorithms

 √ √ k − αi ) ≤ 1 − q . Using that 1 − x + x2 is √  √ √ q monotonically decreasing, we have 1 − ρ + ρ2 ≥ 1 − q + 2 . So we have a

where in ≤ we use λk =

k−1

i=0 (1

  8 F (xk ) − F ∗ ≤ √ (1 − ρ)k+1 F (x0 ) − F ∗ . ( q − ρ)2

(5.70) 

This ends the proof.

With a special setting of the parameters for Algorithm 5.4, we are able to give the convergence result for the INC case. Theorem 5.6 For (5.3), assume that each fi (x) is L-smooth, f (x) is convex, and 1/2 , then for Algorithm 5.4, by setting h(x) is μ-strongly convex, satisfying L μ ≥ n L κ = √n−1 and solving Step 3 by SVRG [10] by running O(n log(1/)) steps, one can obtain solution satisfying F (x) − F ∗ ≤  with IFO calls of √ an -accuracy 2 3/4 O(n L/μ log (1/)).

5.3 The Nonconvex Case In this section, we consider a hard case when f (x) is nonconvex. We only consider the problem where h(x) = 0 and focus on the IFO complexity to achieve an approximate first-order stationary point satisfying ∇f (x) ≤ . In the IC case, SVRG [10] has already achieved nearly optimal rate (ignoring some constants) when n is sufficiently large (n ≥ L μ ). However, for the NC case SVRG is not the optimal. We will introduce SPIDER [7] which can find an approximate first-order  1/2  n stationary point in an optimal (ignoring some constants) O  2 rate. Next, we will show that if further assume that the objective function has Lipschitz continuous Hessians, the momentum technique can ensure a faster rate when n is much smaller than κ = L/μ.

5.3.1 SPIDER The Stochastic Path-Integrated Differential Estimator (SPIDER) [7] technique is a radical VR method which is used to track quantities using reduced stochastic oracles. Let us consider an arbitrary deterministic vector quantity Q(x). Assume that we observe a sequence xˆ 0:K and we want to dynamically track Q(ˆxk ) for k = ˜ x0 ) ≈ Q(ˆx0 ) and an 0, 1, · · · , K. Further assume that we have an initial estimate Q(ˆ k k−1 unbiased estimate ξ k (ˆx0:k ) of Q(ˆx ) − Q(ˆx ) such that for each k = 1, · · · , K, 7 6 E ξ k (ˆx0:k ) | xˆ 0:k = Q(ˆxk ) − Q(ˆxk−1 ).

5.3 The Nonconvex Case

169

Then we can integrate (in the discrete sense) the stochastic differential estimate as ˜ x0:K ) ≡ Q(ˆ ˜ x0 ) + Q(ˆ

K 

ξ k (ˆx0:k ).

(5.71)

k=1

˜ x0:K ) the Stochastic Path-Integrated Differential Estimator, or We call estimator Q(ˆ SPIDER for brevity. We have Proposition 5.1 The martingale (Definition A.4) variance bound has  2  2 ˜  ˜ 0  x0:K ) − Q(ˆxK ) = E Q(ˆ E Q(ˆ x ) − Q(ˆx0 ) +

K  2    E ξ k (ˆx0:k ) − (Q(ˆxk ) − Q(ˆxk−1 )) .

(5.72)

k=1

Proposition 5.1 can be easily proven using the property of square-integrable martingales. Now, let Bi map any x ∈ Rd to a random estimate Bi (x), where B(x) is the true value to be estimated. At each step k, let S∗ be a subset that samples |S∗ | elements in [n] with replacement and let the stochastic estimator BS∗ = (1/|S∗ |) i∈S∗ Bi satisfy E Bi (x) − Bi (y)2 ≤ L2B x − y2 ,

(5.73)

  and xk − xk−1  ≤ 1 for all k = 1, · · · , K. Finally, we set our estimator Vk of B(xk ) as Vk = BS∗ (xk ) − BS∗ (xk−1 ) + Vk−1 . Applying Proposition 5.1 immediately concludes the following lemma, which gives an error bound of the estimator Vk in terms of the second moment of  Vk − B(xk ): Lemma 5.3 Under the condition (5.73) we have that for all k = 1, · · · , K,   2 kL2  2 2     B 1 E Vk − B(xk ) ≤ + E V0 − B(x0 ) . |S∗ |

(5.74)

˜ = V) Proof For any k > 0, we have from Proposition 5.1 (by applying Q   2 2     Ek Vk − B(xk ) = Ek BS∗ (xk ) − B(xk ) − BS∗ (xk−1 ) + B(xk−1 )  2   + Vk−1 − B(xk−1 ) .

(5.75)

170

5 Accelerated Stochastic Algorithms

Then 2    Ek BS∗ (xk ) − B(xk ) − BS∗ (xk−1 ) + B(xk−1 ) 2 1    E Bi (xk ) − B(xk ) − Bi (xk−1 ) + B(xk−1 ) |S∗ |  2 b 1   E Bi (xk ) − Bi (xk−1 ) ≤ |S∗ | 2 L2  2 c 1 2    LB E xk − xk−1  ≤ B 1 , ≤ |S∗ | |S∗ | a

=

(5.76)

a

where in = we use that S∗ is randomly sampled from [n] with replacement, so b

c

the variance reduces by |S1∗ | times. In ≤ and ≤ we use Proposition A.2 and (5.73), respectively. Combining (5.75) and (5.76), we have  2 L2  2  2     Ek Vk − B(xk ) ≤ B 1 + Vk−1 − B(xk−1 ) . |S∗ |

(5.77)

Telescoping the above display for k  = k − 1, · · · , 0 and using the iterated law of expectation (Proposition A.4), we have   2 kL2  2 2     B 1 + E V0 − B(x0 ) . E Vk − B(xk ) ≤ |S∗ |

(5.78) 

The algorithm using SPIDER to solve (5.2) is shown in Algorithm 5.5. We have the following theorem. Theorem 5.7 For the optimization problem (5.22) in the online case (n = ∞), assume that each fi (x) is L-smooth and E∇fi (x) − ∇f (x)2 ≤ σ 2 . Set the parameters S1 , S2 , η, and q as

σ n0 1  2σ 2 2σ  , q= , , S1 = 2 , S2 = , η= , ηk = min n0 Ln0 Ln0 vk  2Ln0   (5.79) A B and set K = (4Ln0 ) −2 + 1. Then running Algorithm 5.5 with OPTION II for K iterations outputs an x˜ satisfying E∇f (˜x) ≤ 5,

(5.80)

5.3 The Nonconvex Case

171

Algorithm 5.5 SPIDER for searching first-order stationary point (SPIDER-SFO) 1: Input x0 , q, S1 , S2 , n0 , , and ˜ . 2: for k = 0 to K do 3: if mod (k, q) = 0 then 4: Draw S1 samples (or compute the full gradient for the finite-sum case) and let vk = ∇fS1 (xk ). 5: else 6: Draw S2 samples and let vk = ∇fS2 (xk ) − ∇fS2 (xk−1 ) + vk−1 . 7: end if 8: OPTION I 9: if vk  ≤ 2˜ then 10: return xk . 11: else 12: xk+1 = xk − η · (vk /vk ), where 13:

η=

end if

14: OPTION II 15:

for convergence rates in high probability

xk+1

=

xk

 . Ln0

for

convergence rates in expectation  1 ηk = min . , Ln0 vk  2Ln0

− ηk

vk ,

where

16: end for 17: OPTION I: Return xK .

however, this line is not reached with high probability

18: OPTION II: Return x˜ chosen uniformly at random from {xk }K−1 k=0 .

where  = f (x0 ) − f ∗ (f ∗ = infx f (x)). The gradient cost is bounded by 24Lσ · −1 for any choice of n ∈ [1, 2σ/]. Treating , L, and σ  −3 + 2σ 2  −2 + 4σ n−1 0 0  as positive constants, the stochastic gradient complexity is O( −3 ). To prove Theorem 5.7, we first prepare the following lemmas. Lemma 5.4 Set the parameters S1 , S2 , η, and q as in (5.79), and k0 = "k/q# · q. We have & ' 2 >>   Ek0 vk − ∇f (xk ) >> x0:k0 ≤  2 , (5.81) where Ek0 denotes the conditional expectation over the randomness of x(k0 +1):k . Proof For k = k0 , we have   2 2 σ 2 2     Ek0 vk0 − ∇f (xk0 ) = Ek0 ∇fS1 (xk0 ) − ∇f (xk0 ) ≤ = . S1 2

(5.82)

172

5 Accelerated Stochastic Algorithms

From Line 15 of Algorithm 5.5 we have that for all k ≥ 0,     k+1 − xk  = min x

 1 , k Ln0 v  2Ln0

vk  ≤

 . Ln0

(5.83)

Applying Lemma 5.3 with 1 = /(Ln0 ), S2 = 2σ/(n0 ), and K = k − k0 ≤ q = σ n0 /, we have

  2 σ n 2 a  2 n0 0     Ek0 vk − ∇f (xk ) ≤ · L2 · + Ek0 vk0 − ∇f (xk0 ) ≤  2 , ·  Ln0 2σ a

where ≤ uses (5.82). The proof is completed.



Lemma 5.5 Setting k0 = "k/q# · q, we have  

3 2    . Ek0 f (xk+1 ) − f (xk ) ≤ − Ek0 vk  + 4Ln0 4n0 L

(5.84)

Proof We have ∇f (x) − ∇f (y)2 = Ei (∇fi (x) − ∇fi (y))2 ≤ Ei ∇fi (x) − ∇fi (y)2 ≤ L2 x − y2 . So f (x) is L-smooth, then 2  L    f (xk+1 ) ≤ f (xk ) + ∇f (xk ), xk+1 − xk + xk+1 − xk  2   Lη2  2 k  k = f (xk ) − ηk ∇f (xk ), vk + v  2

   ηk L   k 2 = f (xk ) − ηk 1 − v  − ηk ∇f (xk ) − vk , vk 2

 2  a 1 ηk L    k 2 ηk  k k − ≤ f (x ) − ηk v − ∇f (xk ) , v  + 2 2 2 a

where in ≤ we apply the Cauchy–Schwartz inequality. Since

1  , ηk = min k Ln0 v  2Ln0



1 1 , ≤ 2Ln0 2L

(5.85)

5.3 The Nonconvex Case

173

we have

   1 ηk L   k 2 1  k 2 − ηk v  ≥ ηk v  2 2 4

    2  k 2  vk   vk  a 2  ,   ≥ v  − 2 , min 2  =     8n0 L   4n0 L

  a 2 where in ≥ we use V (x) = min |x|, x2 ≥ |x| − 2 for all x. Hence f (xk+1 ) ≤ f (xk ) − a

≤ f (xk ) −

ηk vk  2 + + 4Ln0 2n0 L 2

 2  k  v − ∇f (xk )

2 1  vk  2  k  + + v − ∇f (xk ) , 4Ln0 2n0 L 4Ln0

(5.86)

a

1 . where ≤ uses ηk ≤ 2Ln 0 Taking expectation on the above display and using Lemma 5.4, we have

Ek0 f (xk+1 ) − Ek0 f (xk ) ≤ −

  3 2    Ek0 vk  + . 4Ln0 4Ln0

(5.87) 

Lemma 5.6 For all k ≥ 0, we have         E ∇f (xk ) ≤ E vk  + .

(5.88)

Proof By taking the total expectation on (5.81), we have  2   E vk − ∇f (xk ) ≤  2 .

(5.89)

Then by Jensen’s inequality (Proposition A.3), 2  2       E vk − ∇f (xk ) ≤ E vk − ∇f (xk ) ≤  2 . So using the triangle inequality,         E ∇f (xk ) = E vk − (vk − ∇f (xk ))             ≤ E vk  + E vk − ∇f (xk ) ≤ E vk  + . This completes our proof.

(5.90) 

174

5 Accelerated Stochastic Algorithms

Now, we are ready to prove Theorem 5.7. Proof of Theorem 5.7 Taking full expectation on (5.84) and telescoping the results from k = 0 to K − 1, we have K−1     3K 2 a 3K 2   E vk  ≤ f (x0 ) − Ef (xK ) + ≤+ , 4Ln0 4Ln0 4Ln0

(5.91)

k=0

a

where ≤ uses Ef (xK ) ≥ f ∗ . Dividing both sides of (5.91) by we have

 4Ln0 K

0 and using K = " 4Ln #+1 ≥ 2

K−1  4Ln0 1 1     + 3 ≤ 4. E vk  ≤  · K  K

4Ln0 , 2

(5.92)

k=0

Then from the choice of x˜ in Line 17 of Algorithm 5.5, we have E∇f (˜x) =

K−1   a 1 K−1   b 1       E ∇f (xk ) ≤ E vk  +  ≤ 5, K K k=0

a

(5.93)

k=0

b

where ≤ and ≤ use (5.88) and (5.92), respectively. To compute the gradient cost, note that in each q iteration we access for one time of S1 stochastic gradients and for q times of 2S2 stochastic gradients, hence the cost is D C a 1 S1 + 2KS2 ≤ 3K · S2 + S1 K· q

' & 2σ 4Ln0  2σ 2 + 2 ≤ 3 + 2 2 n0   =

2σ 2 24Lσ  4σ + 2 , + 3 n0   

(5.94)

a

where ≤ uses S1 = qS2 . This concludes a gradient cost of 24Lσ  −3 + 2σ 2  −2 + −1 4σ n−1 

0  . Theorem 5.8 For the optimization problem (5.22) in the finite-sum case (n < ∞), assume that each fi (x) is L-smooth, set the parameters S2 , ηk , and q as

1  n1/2  , q = n0 n1/2 , (5.95) , S2 = , η= , ηk = min n0 Ln0 Ln0 vk  2Ln0

5.3 The Nonconvex Case

175

B A set K = (4Ln0 ) −2 + 1, and let S1 = n, i.e., we obtain the full gradient in Line 4. Then running Algorithm 5.5 with OPTION II for K iterations outputs an x˜ satisfying E∇f (˜x) ≤ 5. 1/2 for any choice of The gradient cost is bounded by n + 12(L) · n1/2  −2 + 2n−1 0 n n0 ∈ [1, n1/2 ]. Treating , L, and σ as positive constants, the stochastic gradient complexity is O(n + n1/2  −2 ).

Proof For k = k0 , we have   2 2     Ek0 vk0 − ∇f (xk0 ) = Ek0 ∇f (xk0 ) − ∇f (xk0 ) = 0. For k = k0 , applying Lemma 5.3 with 1 = n0 n1/2 , we have

 Ln0 ,

S2 =

n1/2 n0 ,

(5.96)

and K = k − k0 ≤ q =

  2 2  2 n0     · 1/2 + Ek0 vk0 − ∇f (xk0 ) Ek0 vk − ∇f (xk ) ≤ n0 n1/2 · L2 · Ln0 n a

= 2, a

where = uses (5.96). So Lemma 5.4 holds for all k. Then from the same technique of the online case (n = ∞), we can also obtain (5.83), (5.84), and (5.93). The gradient cost analysis is computed as C K·

D a 1 S1 + 2KS2 ≤ 3K · S2 + S1 q

' 1/2 & n 4Ln0  + 2 ≤ 3 +n 2 n0  =

2n1/2 12(L) · n1/2 + + n, 2 n0 

(5.97)

a

where ≤ uses S1 = qS2 . This concludes a gradient cost of n + 12(L) · n1/2  −2 + 1/2 . 2n−1 

0 n

5.3.2 Momentum Acceleration When computing a first-order stationary point, SPIDER is actually (nearly) optimal, if only with the gradient-smoothness condition under certain regimes. Thus only

176

5 Accelerated Stochastic Algorithms

with the condition of smoothness of the gradient, it is hard to apply the momentum techniques to accelerate algorithms. However, one can obtain a faster rate with an additional assumption on Hessian: Assumption 5.1 Each fi (x) has ρ-Lipschitz continuous Hessians (Definition A.14). The technique to accelerate nonconvex algorithms is briefly described as follows: • Run an efficient Negative Curvature Search (NC-Search) iteration to find an δapproximate negative Hessian direction w1 using stochastic gradients,2 e.g., the shift-and-invert technique in [9]. • If NC-Search finds a w1 , update xk+1 ← xk ± (δ/ρ)w1 . • If not, solve the INC problem: k+1

x



(δ) k 2 x − x  , = argmin f (x) + 2 x

using a momentum acceleration technique, e.g., Catalyst [12] described in Sect. 5.1.4. If xk+1 − xk  ≥ (δ), return to Step 1, otherwise output xk+1 . We informally list the convergence result as follows: Theorem 5.9 Suppose solving NC-Search by [9] and INC blocks by Catalyst described in Theorem 5.6, the total stochastic gradient complexity to achieve √ an -accuracy solution satisfying ∇f (xk ) ≤  and λmin (∇ 2 f (xk )) ≥ −  is ˜ 3/4  −1.75 ). O(n The proof of Theorem 5.9 is lengthy, so we omit the proof here. We mention that when n is large, e.g., n ≥  −1 , the above method might not be faster than SPIDER [7]. Thus the existing lowest complexity to find a first-order stationary 3/4  −1.75 , n1/2 ε −2 , ε −3 )). In fact, for the problem of searching ˜ point is O(min(n a stationary point in nonconvex (stochastic) optimization, neither the upper nor the lower bounds have been well studied up to now. However, it is a very hot topic recently and has aroused a lot of attention in both the optimization and the machine learning communities due to the empirical practicability for nonconvex models. Interested readers may refer to [3, 7, 9] for some latest developments.

task that given a point x ∈ Rd , decides if λmin (∇ 2 f (x)) ≥ −2δ or finds a unit vector w1 such that wT1 ∇ 2 f (x)w1 ≤ −δ (for numerical reasons, one has to leave some room).

2A

5.4 Constrained Problem

177

Algorithm 5.6 Inner loop of Acc-SADMM for k = 0 to m − 1 do k Update dual variable: λks = λ˜ s +

βθ2 θ1,s



 A1 xks,1 + A2 xks,2 − b˜ s ,

Update xk+1 s,1 by (5.99), Update xk+1 s,2 by (5.100),   k+1 k+1 Update dual variable: λ˜ s = λks + β A1 xk+1 s,1 + A2 xs,2 − b , by yk+1 = xk+1 + (1 − θ1,s − θ2 )(xk+1 − xks ). Update yk+1 s s s s end for k

5.4 Constrained Problem We extend the stochastic acceleration methods to solve the constrained problem in this section. As a simple example, we consider the convex finite-sum problem with linear constraints: 1 f2,i (x2 ), n n

min h1 (x1 ) + f1 (x1 ) + h2 (x2 ) +

x1 ,x2

(5.98)

i=1

s.t. A1 x1 + A2 x2 = b, where f1 (x1 ) and f2,i (x2 ) with i ∈ [n] are convex and have Lipschitz continuous gradients, and h1 (x1 ) and h2 (x2 ) are also convex and their proximal mappings can be solved efficiently. We use L1 to denote the Lipschitz constant of f1 (x 1 ), and L2 to  denote the Lipschitz constant of f2,i (x2 ) with i ∈ [n], and f2 (x) = n1 ni=1 f2,i (x). We show that by fusing the VR technique and momentum, the convergence rate can be improved to be non-ergodic O(1/K). We list the notations and variables in Table 5.1. The algorithm has double loops: in the inner loop, we update primal variables xks,1 and xks,2 through extrapolation terms yks,1 and yks,2 and the dual variable λks ; in the outer loop, we maintain snapshot vectors x˜ s+1,1 , x˜ s+1,2 , and b˜ s+1 , and then assign the initial value to the extrapolation terms y0s+1,1 and y0s+1,2 . The whole algorithm is shown in Algorithm 5.7. In the Table 5.1 Notations and variables Notation x, yG , xG

Meaning √ xT Gy, xT Gx

Variable yks,1 , yks,2

Meaning Extrapolation variables

Fi (xi )

hi (xi ) + fi (xi )

xks,1 , xks,2

Primal variables

x

k k λ˜ s , λks , λˆ

Dual and temporary variables

y

(xT1 , xT2 )T (yT1 , yT2 )T

F (x)

F1 (x1 ) + F2 (x2 )

A Ik,s

x˜ s,1 , x˜ s,2 , b˜ s

Snapshot vectors

[A1 , A2 ]

(x∗1 , x∗2 , λ∗ )

KKT point of (5.98)

Mini-batch indices

b

Batch size

used for VR

178

5 Accelerated Stochastic Algorithms

process of solving primal variables, we linearize both the smooth term fi (xi ) and the augmented term β2 A1 x1 + A2 x2 − b + βλ 2 . The update rules of x1 and x2 can be written as   k = argmin h (x ) + ∇f (y ), x xk+1 1 1 1 1 s,1 s,1 x1

  β  k k k A1 ys,1 + A2 ys,2 − b + λs , A1 x1 + θ1,s    2 β AT1 A1   L1   + + x1 − yks,1  2 2θ1,s 

(5.99)

and   k ˜ xk+1 s,2 = argmin h2 (x2 ) + ∇f2 (ys,1 ), x2 x2

  β  k k A1 xk+1 + A y − b + λ , A x 2 s,2 2 2 s s,1 θ1,s  ⎛ ⎞  2 1 + bθ12 L2 β AT2 A2    ⎠ +⎝ + x2 − yks,2  , 2 2θ1,s 

+

(5.100)

˜ 2 (yk ) is defined as where ∇f s,2 ˜ 2 (yk ) = 1 ∇f s,2 b

   ∇f2,ik,s (yks,2 ) − ∇f2,ik,s (˜xs,2 ) + ∇f2 (˜xs,2 ) , ik,s ∈Ik,s

in which Ik,s is a mini-batch of indices randomly chosen from [n] with a size of b. Now, we give the convergence result. The main property of Acc-SADMM (Algorithm 5.7) in the inner loop is shown below. Lemma 5.7 For Algorithm 5.6, in any epoch with fixed s (for simplicity we drop the subscript s throughout the proof unless necessary), we have k+1 ∗ Eik L(xk+1 x1 , x˜ 2 , λ∗ ) − (1 − θ2 − θ1 )L(xk1 , xk2 , λ∗ ) 1 , x2 , λ ) − θ2 L(˜  2  k+1 2

θ1  ˆ k ˆ ∗ ∗ −λ  ≤ λ − λ  − Eik λ 2β 2 1   + yk1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1  G1 2  2 1   − Eik xk+1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1  1 G1 2

5.4 Constrained Problem

179

Algorithm 5.7 Accelerated stochastic alternating direction method of multiplier (Acc-SADMM) 0

Input: epoch length m > 2, β, τ = 2, c = 2, x00 = 0, λ˜ 0 = 0, x˜ 0 = x00 , y00 = x00 , θ1,s = m−τ and θ2 = τ (m−1) . for s = 0 to S − 1 do Do inner loop, as stated in Algorithm 5.6, Set primal variables: x0s+1 = xm s , 

 (τ −1)θ1,s+1 (τ −1)θ1,s+1 m−1 k 1 xm Update x˜ s+1 by x˜ s+1 = m 1 − s + 1 + (m−1)θ2 k=1 xs , θ2

1 c+τ s ,

0 m λ˜ s+1 = λsm−1 + β(1 − τ )(A1 xm s,1 + A2 xs,2 − b), Update dual snapshot variable: b˜ s+1 = A1 x˜ s+1,1 + A2 x˜ s+1,2 , Update extrapolation terms y0s+1 through

Update dual variable:

˜ s+1 + y0s+1 = (1 − θ2 )xm s + θ2 x end for s Output: xˆ S =

θ1,s+1 m−1 (1 − θ1,s )xm − θ2 x˜ s . s − (1 − θ1,s − θ2 )xs θ1,s

1 (m − 1)(θ1,S + θ2 ) + 1

xm S +

m−1  θ1,S + θ2 xkS . (m − 1)(θ1,S + θ2 ) + 1

(5.101)

k=1

2 1   + yk2 − (1 − θ1 − θ2 )xk2 − θ2 x˜ 2 − θ1 x∗2  G2 2  2 1  k ∗ ˜ x − Eik xk+1 − (1 − θ − θ )x − θ − θ x , 1 2 2 2 1 2 2 2 G2 2

(5.102)

where Eik denotes that the expectation is taken over the random samples in the minibatch Ik,s , L(x1 , x2 , λ) = F1 (x1 )+F2 (x2 )+λ, A 1 x1 +A2 x2 −b is the Lagrangian  

T

 

T

k k β A1 A1 βA A 1) I − θ11 1 , and (Axk − b), G1 = L1 + function, λˆ = λ˜ + β(1−θ θ1 θ1   

 β AT2 A2  1 I. Other notations can be found in Table 5.1. G2 = 1 + bθ2 L2 + θ1

Proof Step 1: We first analyze x1 . By the optimality of xk+1 in (5.99) and the 1 convexity of F1 (·), we can obtain F1 (xk+1 1 ) ≤ (1 − θ1 − θ2 )F1 (xk1 ) + θ2 F1 (˜x1 ) + θ1 F1 (x∗1 )   ¯ k+1 , yk ), xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 1 − θ1 x∗1 − AT1 λ(x 2 1 1 1 2 L1    k+1 x1 − yk1  2   − xk+1 − yk1 , xk+1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗ 1 1 +

G1

.

(5.103)

180

5 Accelerated Stochastic Algorithms

We prove (5.103) below. By the same proof technique of Lemma 5.2 for Katyusha (Algorithm 5.3), we can bound the variance through  2 2L

2  k k k ˜ 2 (yk ) ˜ f Eik ∇f2 (yk2 ) − ∇f ≤ (˜ x ) − f (y ) − ∇f (y ), x − y   2 2 2 2 2 2 2 2 2 . b (5.104) Set ¯ 1 , x2 ) = λk + β (A1 x1 + A2 x2 − b) . λ(x θ1 For the optimality solution of xk+1 in (5.99), we have 1    β AT1 A1   k+1 ¯ k , yk ) ∈ −∂h1 (xk+1 ). x1 − yk1 + ∇f1 (yk1 ) + AT1 λ(y L1 + 1 2 1 θ1 (5.105)



Since f1 is L1 -smooth, we have 2  L   1  k+1 k+1 k k k k x f1 (xk+1 + ) ≤ f (y ) + ∇f (y ), x − y − y  1 1 1 1 1 1 1 1 1 2 2  L   a 1  k+1 k + ≤ f1 (u1 ) + ∇f1 (yk1 ), xk+1 − u − y x 1 1 1 1 2   b k+1 k+1 T ¯ k k ≤ f1 (u1 ) − ∂h1 (xk+1 ), x − u  − A , y ), x − u λ(y 1 1 1 1 2 1 1 1   T  2  L  β A1 A1   k+1 1  k+1 k x1 − yk1 , xk+1 + − L1 + − u − y x 1 1 , 1 1 θ1 2 a

where u1 is an arbitrary variable. In the inequality ≤ we use the fact that f1 (·) is b

convex and so f1 (yk1 ) ≤ f1 (u1 )+∇f1 (yk1 ), yk1 −u1 . The inequality ≤ uses (5.105). k+1 k+1 Then the convexity of h1 (·) gives h1 (xk+1 − u1 . So 1 ) ≤ h1 (u1 ) + ∂h1 (x1 ), x1 we have 2  L   1  k+1  k+1 T ¯ k k F1 (xk+1 − u1 + x1 − yk1  1 ) ≤ F1 (u1 ) − A1 λ(y1 , y2 ), x1 2     β AT1 A1   k+1 x1 − yk1 , xk+1 − L1 + − u1 . 1 θ1

5.4 Constrained Problem

181

Setting u1 be xk1 , x˜ 1 , and x∗1 , respectively, then multiplying the three inequalities by (1 − θ1 − θ2 ), θ2 , and θ1 , respectively, and adding them, we have F1 (xk+1 1 )

2 L1   k+1  ≤ (1 − θ1 − θ2 )F1 (xk1 ) + θ2 F1 (˜x1 ) + θ1 F1 (x∗1 ) + x1 − yk1  2   ¯ k , yk ), xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 1 − θ1 x∗1 − AT1 λ(y 1 2 1 1     β AT1 A1   k+1 x1 − yk1 , xk+1 − L1 + − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1 1 θ1 2 L1  a  k+1  = (1 − θ1 − θ2 )F1 (xk1 ) + θ2 F1 (˜x1 ) + θ1 F1 (x∗1 ) + x1 − yk1  2   ¯ k+1 , yk ), xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 1 − θ1 x∗1 − AT1 λ(x 2 1 1 1   k k+1 k ∗ ˜ x − y , x − (1 − θ − θ )x − θ − θ x , (5.106) − xk+1 1 2 1 2 1 1 1 1 1 G1

a

¯ k , yk ) with AT λ(x ¯ k+1 , yk ) − where in the equality ≤ we replace AT1 λ(y 1 1 2 2 1 βAT1 A1 k+1 θ1 (x1

− yk1 ).

Step 2: We next analyze x2 . By the optimality of xk+1 in (5.100) and the 2 convexity of F2 (·), we can obtain Eik F2 (xk+1 2 ) E



   β AT2 A2   k+1 x2 − yk2 , αL2 + θ1

¯ k+1 , yk ) + AT2 λ(x 2 1

≤ −Eik

F xk+1 − θ2 x˜ 2 2 E −Eik



¯ k+1 , yk ) + AT2 λ(x 2 1

   β AT2 A2   k+1 x2 − yk2 , αL2 + θ1 F

−(1 − θ2 − θ1 )xk2 − θ1 x∗2 +(1 − θ2 − θ1 )F2 (xk2 ) + θ1 F2 (x∗2 ) + θ2 F2 (˜x2 )  ⎛ ⎞ 2 1 + bθ12 L2    k+1 +Eik ⎝ x2 − yk2  ⎠ . 2 We prove (5.107) below.

(5.107)

182

5 Accelerated Stochastic Algorithms

For the optimality of xk+1 in (5.100), we have 2    β AT2 A2   k+1 ¯ k+1 , yk ) ˜ 2 (yk ) + AT2 λ(x x2 − yk2 + ∇f αL2 + 2 2 1 θ1



∈ −∂h2 (xk+1 2 ), where we set α = 1 +

1 bθ2 .

(5.108)

Since f2 is L2 -smooth, we have

2  L   2  k+1 k+1 k k k k x + ) ≤ f (y ) + ∇f (y ), x − y − y f2 (xk+1  2 2 2 2 2 2  . (5.109) 2 2 2 2 We first consider ∇f2 (yk2 ), xk+1 − yk2  and have 2   − yk2 ∇f2 (yk2 ), xk+1 2   a = ∇f2 (yk2 ), u2 − yk2 + xk+1 − u 2 2       b = ∇f2 (yk2 ), u2 − yk2 − θ3 ∇f2 (yk2 ), yk2 − x˜ 2 + ∇f2 (yk2 ), zk+1 − u2     = ∇f2 (yk2 ), u2 − yk2 − θ3 ∇f2 (yk2 ), yk2 − x˜ 2     ˜ 2 (yk ), zk+1 − u2 , ˜ 2 (yk ), zk+1 − u2 + ∇f2 (yk ) − ∇f (5.110) + ∇f 2 2 2 a

where in the equality = we introduce an arbitrary variable u2 (we will set it to be b

xk2 , x˜ 2 , and x∗2 ) and in the equality = we set zk+1 = xk+1 + θ3 (yk2 − x˜ 2 ), 2

(5.111)

˜ 2 (yk ), zk+1 − u2 , we in which θ3 is an absolute constant determined later. For ∇f 2 have   ˜ 2 (yk ), zk+1 − u2 ∇f 2 E    β AT2 A2  a k+1 T ¯ k+1 k = − ∂h2 (x2 ) + A2 λ(x1 , y2 ) + αL2 + θ1 F   xk+1 − yk2 , zk+1 − u2 2   b k+1 k ˜ = − ∂h2 (xk+1 ), x + θ (y − x ) − u 3 2 2 2 2 2

5.4 Constrained Problem

183

E



F    β AT2 A2   k+1 k k+1 x2 − y2 , z − αL2 + − u2 θ1   k+1 k+1 k+1 k ˜ = − ∂h2 (xk+1 ), x + θ (y − x + x − x ) − u 3 2 2 2 2 2 2 2 E  F  T   β A2 A2   k+1 T ¯ k+1 k k k+1 x2 − y2 , z − A2 λ(x1 , y2 ) + αL2 + − u2 θ1   c k+1 k+1 k+1 k ∂h ≤ h2 (u2 ) − h2 (xk+1 ) + θ h (˜ x ) − θ h (x ) − θ (x ), y − x 3 2 2 3 2 3 2 2 2 2 2 2 E  F  T   β A2 A2   k+1 ¯ k+1 , yk ) + αL2 + x2 − yk2 , zk+1 − u2 − AT2 λ(x 2 1 θ1 ¯ k+1 , yk ) + AT2 λ(x 2 1

d

= h2 (u2 ) − h2 (xk+1 x2 ) − θ3 h2 (xk+1 2 ) + θ3 h2 (˜ 2 )  E F  T   β A2 A2   k+1 T ¯ k+1 k k k+1 x2 − y2 , z − u2 − A2 λ(x1 , y2 ) + αL2 + θ1  E    β AT2 A2   k+1 T ¯ k+1 k x2 − yk2 −θ3 A2 λ(x1 , y2 ) + αL2 + θ1 F ˜ 2 (yk ), xk+1 − yk , +∇f (5.112) 2

2

2

a

b

where in the equalities = and =, we use (5.108) and (5.111), respectively. The c

d

inequality = uses (5.108) again. The inequality ≤ uses the convexity of h2 :   k+1 ∂h2 (xk+1 ≤ h2 (w) − h2 (xk+1 2 ), w − x2 2 ),

w = u2 , x˜ 2 .

  ˜ 2 (yk )−∇f2 (yk ) , ˜ 2 (yk )=∇f2 (yk )+ ∇f Rearranging terms in (5.112) and using ∇f 2 2 2 2 we have   ˜ 2 (yk ), zk+1 − u2 ∇f 2 = h2 (u2 ) − h2 (xk+1 x2 ) − θ3 h2 (xk+1 2 ) + θ3 h2 (˜ 2 )  E  T   β A2 A2   k+1 ¯ k+1 , yk ) + αL2 + x2 − yk2 , − AT2 λ(x 2 1 θ1 F θ3 (xk+1 − yk2 ) + zk+1 − u2 2     ˜ 2 (yk ) − ∇f2 (yk ) , xk+1 − yk . −θ3 ∇f2 (yk2 ) + ∇f 2 2 2 2

(5.113)

184

5 Accelerated Stochastic Algorithms

Substituting (5.113) in (5.110), we obtain   − yk2 (1 + θ3 ) ∇f2 (yk2 ), xk+1 2     = ∇f2 (yk2 ), u2 − yk2 − θ3 ∇f2 (yk2 ), yk2 − x˜ 2 + h2 (u2 ) − h2 (xk+1 2 ) +θ3 h2 (˜x2 ) − θ3 h2 (xk+1 2 )  E    β AT2 A2   k+1 T ¯ k+1 k x2 − yk2 , − A2 λ(x1 , y2 ) + αL2 + θ1  k zk+1 − u2 + θ3 (xk+1 − y ) 2 2   ˜ 2 (yk ), θ3 (xk+1 − yk ) + zk+1 − u2 . + ∇f2 (yk2 ) − ∇f 2 2 2

(5.114)

Multiplying (5.109) by (1 + θ3 ) and then adding (5.114), we can eliminate the term ∇f2 (yk2 ), xk+1 − yk2  and obtain 2 (1 + θ3 )F2 (xk+1 2 )

  ≤ (1 + θ3 )f2 (yk2 ) + ∇f2 (yk2 ), u2 − yk2 − θ3 ∇f2 (yk2 ), yk2 − x˜ 2  +h2 (u2 ) + θ3 h2 (˜x2 ) E     β AT2 A2   k+1 T ¯ k+1 k x2 − yk2 , − A2 λ(x1 , y2 ) + αL2 + θ1 F zk+1 − u2 + θ3 (xk+1 − yk2 ) 2   ˜ 2 (yk ), θ3 (xk+1 − yk ) + zk+1 − u2 + ∇f2 (yk2 ) − ∇f 2 2 2

2 (1 + θ3 )L2   k+1  x2 − yk2  2   a ≤ F2 (u2 ) − θ3 ∇f (yk2 ), yk2 − x˜ 2 + θ3 f2 (yk2 ) + θ3 h2 (˜x2 )  E    β AT2 A2   k+1 T ¯ k+1 k x2 − yk2 , − A2 λ(x1 , y2 ) + αL2 + θ1 F +

zk+1 − u2 + θ3 (xk+1 − yk2 ) 2

5.4 Constrained Problem

185

  ˜ 2 (yk ), θ3 (xk+1 − yk ) + zk+1 − u2 + ∇f (yk2 ) − ∇f 2 2 2 +

2 (1 + θ3 )L2   k+1  x2 − yk2  , 2

(5.115)

a

where the inequality ≤ uses the convexity of f2 : ∇f2 (yk2 ), u2 − yk2  ≤ f2 (u2 ) − f2 (yk2 ).   ˜ 2 (yk ), θ3 (xk+1 − yk ) + zk+1 − u2 . We now consider the term ∇f2 (yk2 ) − ∇f 2 2 2 We will set u2 to be xk2 and x∗2 , which do not depend on Ik,s . So we obtain   ˜ 2 (yk ), θ3 (xk+1 − yk ) + zk+1 − u2 Eik ∇f2 (yk2 ) − ∇f 2 2   ˜ 2 (yk ), θ3 zk+1 + zk+1 = Eik ∇f2 (yk ) − ∇f 2

2

  ˜ 2 (yk ), θ32 (yk − x˜ 2 ) + θ3 yk + u2 −Eik ∇f2 (yk2 ) − ∇f 2 2 2 ˜ 2 (yk ), zk+1  = (1 + θ3 )Eik ∇f2 (yk2 ) − ∇f 2 a

b ˜ 2 (yk ), xk+1  = (1 + θ3 )Eik ∇f2 (yk2 ) − ∇f 2 2 c ˜ 2 (yk ), xk+1 − yk  = (1 + θ3 )Eik ∇f2 (yk2 ) − ∇f 2 2 2

 2

 2 d θ3 b  (1 + θ3 )2 L2   k+1 k k  k ˜ + E ≤ Eik (y ) − ∇f (y ) − y x2 ∇f2 2 2 2  ik 2 2L2 2θ3 b   e ≤ θ3 f2 (˜x2 ) − f2 (yk2 ) − ∇f2 (yk2 ), x˜ 2 − yk2 

+Eik

2

(1 + θ3 )2 L2   k+1 k x2 − y2  , 2θ3 b

(5.116)

a

where in the equality = we use the fact that   ˜ 2 (yk ) = 0, Eik ∇f2 (yk2 ) − ∇f 2 and xk2 , yk2 , x˜ 2 , and u2 are independent of ik,s (they are known), so ˜ 2 (yk ), yk  = 0, Eik ∇f2 (yk2 ) − ∇f 2 2 ˜ 2 (yk ), x˜ 2  = 0, Eik ∇f2 (yk2 ) − ∇f 2 ˜ 2 (yk ), u2  = 0; Eik ∇f2 (yk2 ) − ∇f 2 b

c

d

the equalities = and = hold similarly; the inequality ≤ uses the Cauchy–Schwartz e

inequality; and ≤ uses (5.104). Taking expectation on (5.115) and adding (5.116),

186

5 Accelerated Stochastic Algorithms

we obtain (1 + θ3 )Eik F2 (xk+1 2 )  E  T   A A2   β 2 k+1 k ¯ k+1 , yk ) + αL2 + x , − y ≤ −Eik AT2 λ(x 2 2 1 2 θ1 F zk+1 − u2 + θ3 (xk+1 − yk2 ) 2 ⎛ +F2 (u2 ) + θ3 F (˜x2 ) + Eik ⎝

= −Eik

1+θ3 bθ3

2



⎞ 2 L2   k+1  x2 − yk2  ⎠



E a

 (1 + θ3 ) 1 +

   β AT2 A2   k+1 x2 − yk2 , αL2 + θ1 F

¯ k+1 , yk ) + AT2 λ(x 2 1

(1 + θ3 )xk+1 − θ3 x˜ 2 − u2 2 ⎛ +F2 (u2 ) + θ3 F (˜x2 ) + Eik ⎝

 (1 + θ3 ) 1 + 2

1 bθ2



⎞ 2 L2    k+1 x2 − yk2  ⎠ ,

a

θ3 . Setting u2 to be where in equality = we use (5.111) and set θ3 satisfying θ2 = 1+θ 3 xk2 and x∗2 , respectively, then multiplying the two inequalities by 1 − θ1 (1 + θ3 ) and θ1 (1 + θ3 ), respectively, and adding them, we obtain

(1 + θ3 )Eik F2 (xk+1 2 )  E    β AT2 A2   k+1 T ¯ k+1 k x2 − yk2 , ≤ −Eik A2 λ(x1 , y2 ) + αL2 + θ1 F (1 + θ3 )xk+1 − θ3 x˜ 2 2 

E −Eik

¯ k+1 , yk ) + AT2 λ(x 2 1

   β AT2 A2   k+1 x2 − yk2 , αL2 + θ1

F − [1 − θ1 (1 + θ3 )] xk2 E −Eik

¯ k+1 , yk ) + AT2 λ(x 2 1



   β AT2 A2   k+1 x2 − yk2 , αL2 + θ1

5.4 Constrained Problem

187

F −θ1 (1 + θ3 )x∗2 + [1 − θ1 (1 + θ3 )] F2 (xk2 ) + θ1 (1 + θ3 )F2 (x∗2 ) + θ3 F (˜x2 )   ⎛ ⎞ 2 (1 + θ3 ) 1 + bθ12 L2    k+1 +Eik ⎝ x2 − yk2  ⎠ . 2

(5.117)

Dividing (5.117) by (1 + θ3 ), we obtain Eik F2 (xk+1 2 ) E



F    β AT2 A2   k+1 k+1 k x2 − y2 , x2 − θ2 x˜ 2 αL2 + ≤ −Eik θ1  E    β AT2 A2   k+1 T ¯ k+1 k x2 − yk2 , −Eik A2 λ(x1 , y2 ) + αL2 + θ1 F ¯ k+1 , yk ) + AT2 λ(x 2 1

−(1 − θ2 − θ1 )xk2 − θ1 x∗2 +(1 − θ2 − θ1 )F2 (xk2 ) + θ1 F2 (x∗2 ) + θ2 F2 (˜x2 )  ⎛ ⎞ 2 1 + bθ12 L2    − yk2  ⎠ , +Eik ⎝ xk+1 2 2 where we use θ2 = Step 3: Setting

θ3 1+θ3

and so

1−θ1 (1+θ3 ) 1+θ3

(5.118)

= 1 − θ2 − θ1 . We obtain (5.107).

β(1 − θ1 ) k k (A1 xk1 + A2 xk2 − b), λˆ = λ˜ + θ1

(5.119)

we prove that it has the following properties: k+1 ¯ k+1 , xk+1 ), = λ(x (5.120) λˆ 1 2

β k+1 k λˆ − λˆ = A1 xk+1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1 1 θ1

β + A2 xk+1 − (1 − θ1 − θ2 )xk2 − θ2 x˜ 2 − θ1 x∗2 , (5.121) 2 θ1 0 m λˆ s = λˆ s−1 ,

s ≥ 1.

(5.122)

188

5 Accelerated Stochastic Algorithms

Indeed, for Algorithm 5.6 we have  βθ2  k A1 xk1 + A2 xk2 − b˜ λk = λ˜ + θ1

(5.123)

and   k+1 k+1 = λk + β A1 xk+1 + A x − b . λ˜ 2 1 2

(5.124)

With (5.119) we have k+1 k+1 = λ˜ +β λˆ



1 − 1 (A1 xk+1 + A2 xk+1 − b) 1 2 θ1

β (A1 xk+1 + A2 xk+1 − b) (5.125) 1 2 θ1

4 β 3 b k A1 xk+1 = λ˜ + + A2 xk+1 − b + θ2 A1 (xk2 − x˜ 1 ) + A2 (xk2 − x˜ 2 ) , 1 2 θ1 a

= λk +

a

b

where in equality = we use (5.124) and the equality = is obtained by (5.123) and b˜ = A1 x˜ 1 + A2 x˜ 2 (see Algorithm 5.7). Together with (5.119) we obtain

β k+1 k k ∗ k ˜ − λˆ = A1 xk+1 − (1 − θ )x − θ x + θ (x − x ) λˆ 1 1 1 1 2 1 1 1 θ1

β + A2 xk+1 − (1 − θ1 )xk2 − θ1 x∗2 + θ2 (xk2 − x˜ 2 ) , 2 θ1 where we use the fact that A1 x∗1 + A2 x∗2 = b. So (5.121) is proven. ¯ k+1 , xk+1 ), we obtain (5.120). Now we prove λˆ m ˆ0 Since (5.125) equals λ(x s−1 = λs 1 2 when s ≥ 1.  β(1 − θ1,s )  0 a 0 m A1 xm λˆ s = λ˜ s + s,1 + A2 xs,2 − b θ1,s

  1 b 0 m = λ˜ s + β + τ − 1 A1 xm s,1 + A2 xs,2 − b θ1,s−1   c m−1 m = λs−1 − β(τ − 1) A1 xm s,1 + A2 xs,2 − b

  1 m +β + τ − 1 A1 xm s,1 + A2 xs,2 − b θ1,s−1 m−1 = λs−1 +

β θ1,s−1

  m A1 xm s,1 + A2 xs,2 − b

5.4 Constrained Problem

189

m ˜ = λs−1 − β − d

β



θ1,s−1

  m A1 xm s,1 + A2 xs,2 − b

m

= λˆ s−1 , a

b

where = uses (5.119), = uses the fact that

(5.126) 1 θ1,s

=

1 θ1,s−1

0

+ τ , = uses λ˜ s+1 = λsm−1 + c

d

m β(1 − τ )(A1 xm s,1 + A2 xs,2 − b) in Algorithm 5.7, and = uses (5.124). Step 4: We now are ready to prove (5.102). Define L(x1 , x2 , λ) = F1 (x1 ) − F1 (x∗1 ) + F2 (x2 ) − F2 (x∗2 ) + λ, A1 x1 + A2 x2 − b. We have k+1 ∗ L(xk+1 x1 , x˜ 2 , λ∗ ) − (1 − θ1 − θ2 )L(xk1 , xk2 , λ∗ ) 1 , x2 , λ ) − θ2 L(˜ k ∗ x1 ) = F1 (xk+1 1 ) − (1 − θ2 − θ1 )F1 (x1 ) − θ1 F1 (x1 ) − θ2 F1 (˜ k ∗ x2 ) +F2 (xk+1 2 ) − (1 − θ2 − θ1 )F2 (x2 ) − θ1 F2 (x2 ) − θ2 F2 (˜ 

 + λ∗ , A1 xk+1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1 1

  k ∗ ˜ x − (1 − θ − θ )x − θ − θ x + λ∗ , A2 xk+1 1 2 2 2 1 2 . 2 2

Plugging (5.106) and (5.118) into the above, we have k+1 ∗ Eik L(xk+1 x1 , x˜ 2 , λ∗ ) − (1 − θ2 − θ1 )L(xk1 , xk2 , λ∗ ) 1 , x2 , λ ) − θ2 L(˜ 

 ¯ k+1 , yk ), A1 xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 1 − θ1 x∗1 ≤ Eik λ∗ − λ(x 2 1 1 1 

 ¯ k+1 , yk ), A2 xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 2 − θ1 x∗2 + Eik λ∗ − λ(x 2 2 1 2   − yk1 , xk+1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1 − Eik xk+1 1 1 G1

  k k+1 k ∗ ˜ x − Eik xk+1 − y , x − (1 − θ − θ )x − θ − θ x 1 2 2 2 1 2 2 2 2 2

αL2 +

 ⎛ ⎞  2 2 1 + bθ12 L2  L1     k+1 Ei xk+1 − yk1  + Eik ⎝ + x2 − yk2  ⎠ 2 k 1 2



β AT 2 A2 θ1



I



 a ¯ k+1 , xk+1 ), A1 xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 1 − θ1 x∗1 = Eik λ∗ − λ(x 1 1 2 1 

 ¯ k+1 , xk+1 ), A2 xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 2 − θ1 x∗2 + Eik λ∗ − λ(x 2 1 2 2   k k+1 k ∗ ˜ x − y , x − (1 − θ − θ )x − θ − θ x − Eik xk+1 1 2 2 1 1 1 1 1 1 1  − Eik xk+1 − yk2 , xk+1 − (1 − θ1 − θ2 )xk2 2 2

G1

190

5 Accelerated Stochastic Algorithms

−θ2 x˜ 2 − θ1 x∗2





T β  AT 2 A2  I− βA2 A2 αL2 + θ θ 1

1

 ⎞  2 2 1 + bθ12 L2  L1     k+1 Ei xk+1 − yk1  + Eik ⎝ + x2 − yk2  ⎠ 2 k 1 2 +

⎛



 β k+1 k k ∗ ˜ x x Eik A2 xk+1 − A y , A − (1 − θ − θ )x − θ − θ x 2 1 1 2 2 1 1 1 , 2 1 2 1 θ1 (5.127)

a ¯ k+1 , yk ) to λ(x ¯ k+1 , xk+1 ) − where in the equality = we change the term λ(x 2 1 1 2 β k+1 k ). For the first two terms in the right-hand side of (5.127), we have A (x − y 2 2 2 θ1



 ¯ k+1 , xk+1 ), A1 xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 1 − θ1 x∗1 λ∗ − λ(x 1 1 2 1

  ¯ k+1 , xk+1 ), A2 xk+1 − (1 − θ1 − θ2 )xk − θ2 x˜ 2 − θ1 x∗2 + λ∗ − λ(x 1

2

1

2

θ1 ∗ ˆ k+1 ˆ k+1 ˆ k λ − λ , λ −λ  β  2  k+1 2  k+1 

k 2 b θ1 ˆk ˆ ˆ ∗ ∗ ˆ = − λ  − λ −λ  , λ − λ  − λ 2β a

=

a

(5.128)

b

where = uses (5.120) and (5.121) and = uses (A.2). Substituting (5.128) into (5.127), we obtain k+1 ∗ x1 , x˜ 2 , λ∗ ) − (1 − θ2 − θ1 )L(xk1 , xk2 , λ∗ ) Eik L(xk+1 1 , x2 , λ ) − θ2 L(˜ 2  k+1 2  k+1 

θ1  k 2 ˆk     − λ∗  − Eik λˆ − λˆ  ≤ λ − λ∗  − Eik λˆ 2β   k k+1 k ∗ ˜ x − Eik xk+1 − y , x − (1 − θ − θ )x − θ − θ x 1 2 2 1 1 1 1 1 1 1

 − Eik xk+1 − yk2 , xk+1 − (1 − θ1 − θ2 )xk2 2 2 

−θ2 x˜ 2 − θ1 x∗2 T β AT 2 A2  I− βA2 A2 αL + 2

θ1

G1

θ1

 ⎞  2 2 1 + bθ12 L2  L1  k+1    Ei xk+1 − yk1  + Eik ⎝ + x2 − yk2  ⎠ 2 k 1 2 +

⎛



 β k+1 k k ∗ ˜ x x Eik A2 xk+1 − A y , A − (1 − θ − θ )x − θ − θ x 2 1 1 2 2 1 1 1 . 2 1 2 1 θ1 (5.129)

5.4 Constrained Problem

191

Then applying identity (A.1) to the second and the third terms in the right-hand side of (5.129) and rearranging, we have k+1 ∗ x1 , x˜ 2 , λ∗ ) − (1 − θ2 − θ1 )L(xk1 , xk2 , λ∗ ) Eik L(xk+1 1 , x2 , λ ) − θ2 L(˜ 2  k+1 2  k+1 

θ1  k 2 ˆk     − λ∗  − Eik λˆ − λˆ  ≤ λ − λ∗  − Eik λˆ 2β 2 1   + yk1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1  G1 2  2 1  k ∗ ˜ x − Eik xk+1 − (1 − θ − θ )x − θ − θ x  1 2 2 1 1 1 1 1 G1 2 2 1  

T + yk2 − (1 − θ1 − θ2 )xk2 − θ2 x˜ 2 − θ1 x∗2  β  AT 2 A2  I− βA2 A2 αL2 + 2 θ1 θ1

 2 1  

T − Eik xk+1 − (1 − θ1 − θ2 )xk2 − θ2 x˜ 2 − θ1 x∗2  β AT 2 2 A2  I− βA2 A2 αL2 + 2 θ1 θ1  2  2 1 1   k+1 k k T T x E − Eik xk+1 − y − − y   i 1 β A1 A1  βA1 A1 2  β AT2 A2  βAT2 A2 1 I− θ I− θ 2 2 k 2 θ1 θ1 1 1 

 β + Eik A2 xk+1 − A2 yk2 , A1 xk+1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1 . 2 1 θ1 (5.130) For the last term in the right-hand side of (5.130), we have

 β  k+1 k k ∗ ˜ A2 xk+1 x x − A y , A − (1 − θ − θ )x − θ − θ x 2 1 1 2 2 1 1 1 2 1 2 1 θ1  a β k A2 xk+1 = 2 −A2 v −(A2 y2 − A2 v), θ1

 k ∗ ˜ A1 xk+1 x − (1−θ −θ )x − θ − θ x 1 2 2 1 1 1 −0 1 1 b

=



2 β    − A2 v + A1 xk+1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1  A2 xk+1 2 1 2θ1 2 2 β  β      k A − − A v + y − A v A2 xk+1    2 2 2 2 2 2θ1 2θ1

2 β   k ∗  ˜ x − − (1 − θ − θ )x − θ − θ x A2 yk2 − A2 v + A1 xk+1 1 2 1 2 1 1 1  1 2θ1

192

5 Accelerated Stochastic Algorithms c

=

2 2  θ1  β  β   ˆ k+1 ˆ k 2     −λ  − − A2 v + λ A2 xk+1 A2 yk2 − A2 v 2 2β 2θ1 2θ1

2 β   k ∗  ˜ x − − (1 − θ − θ )x − θ − θ x A2 yk2 − A2 v + A1 xk+1 1 2 2 1 1 1  , 1 1 2θ1 (5.131) a

b

c

where in = we set v = (1 − θ1 − θ2 )xk2 + θ2 x˜ 2 + θ1 x∗2 , = uses (A.3), and = uses (5.121). Substituting (5.131) into (5.130), we have k+1 ∗ x1 , x˜ 2 , λ∗ ) − (1 − θ2 − θ1 )L(xk1 , xk2 , λ∗ ) Eik L(xk+1 1 , x2 , λ ) − θ2 L(˜ 2  k+1 2

θ1     ˆk − λ∗  ≤ λ − λ∗  − Eik λˆ 2β 2 1   + yk1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1  G1 2  2 1  k ∗ ˜ x − Eik xk+1 − (1 − θ − θ )x − θ − θ x 1 2 2 1 1 1 1 1 G1 2 2 1  

+ yk2 − (1 − θ1 − θ2 )xk2 − θ2 x˜ 2 − θ1 x∗2  β  AT 2 A2  I αL2 + 2 θ1

 2 1  k ∗ 

˜ x − Eik xk+1 − (1 − θ − θ )x − θ − θ x 1 2 2 2 1 β AT 2 2 2 2 A2  I αL2 + 2 θ1  2  2 1 1     − Eik xk+1 − yk1  β AT1 A1  βAT1 A1 − Eik xk+1 − yk2  β AT2 A2  βAT2 A2 1 2 I− I− θ 2 2 θ1 θ1 θ1 1 

 β 2  − Eik A2 yk2 − A2 v + A1 xk+1 − (1 − θ1 − θ2 )xk1 − θ2 x˜ 1 − θ1 x∗1  . 1 2θ1 (5.132) Since the last three terms in the right-hand side of (5.132) are nonpositive, we obtain (5.102).

 Theorem 5.10 If the conditions in Lemma 5.7 hold, then we have    1   β(m − 1)(θ2 + θ1,S ) + β AˆxS −b E  2β θ1,S 2    β(m−1)θ2  0 0 ∗ ˜ Ax0 − b + λ0 − λ  − θ1,0

 (m − 1)(θ2 + θ1,S ) + 1  F (ˆxS ) − F (x∗ ) + λ∗ , AˆxS − b +E θ1,S   ≤ C3 F (x00 ) − F (x∗ ) + λ∗ , Ax00 − b

5.4 Constrained Problem

193

 2  1  λ˜ 0 + β(1 − θ1,0 ) (Ax0 − b) − λ∗  0 0   2β θ1,0 2 1   + x00,1 − x∗1   θ1,0 L1 +βAT1 A1  I−βAT1 A1 2 2 1    , + x00,2 − x∗2  1  1+ bθ θ1,0 L2 +βAT2 A2  I 2 2 +

where C3 =

(5.133)

1−θ1,0 +(m−1)θ2 . θ1,0

Proof Taking expectation over the first k iterations for (5.102) and dividing θ1 on both sides of it, we obtain 1 θ2 1 − θ2 − θ1 k+1 ∗ EL(xk+1 L(˜x1 , x˜ 2 , λ∗ ) − L(xk1 , xk2 , λ∗ ) 1 , x2 , λ ) − θ1 θ1 θ1 2 2

 k+1 1  ˆk ˆ ∗ ∗ ≤ −λ  λ − λ  − E λ 2β  2

 1 k θ1  k ∗  

+  y1 − (1 − θ1 − θ2 )x1 − θ2 x˜ 1 − x1  T β AT 2 θ1 1 A1  I− βA1 A1 L1 + θ1

 2

 1 k+1 θ1  k ∗   x1 − (1 − θ1 − θ2 )x1 − θ2 x˜ 1 − x1  − E 2 θ 1

+



θ1 2

 2

1 k  k ∗   ˜ y − x x − (1 − θ − θ )x − θ 1 2 2 2 2 2 2 θ 1

L1 +

αL2 +



θ1



β AT 1 A1 θ1

β AT 2 A2 θ1





I−

βAT 1 A1 θ1

I

 2

 1 k+1 θ1  k ∗ 

, ˜ E x − x x − (1 − θ − θ )x − θ 1 2 2 2 2 2 2  β AT 2 θ1 2 A2  I αL2 + θ1

(5.134) where the expectation is taken under the condition that the randomness under the first s epochs is fixed. Since yk = xk + (1 − θ1 − θ2 )(xk − xk−1 ),

k ≥ 1,

we obtain 1 θ2 1 − θ2 − θ1 k+1 ∗ EL(xk+1 L(˜x1 , x˜ 2 , λ∗ ) − L(xk1 , xk2 , λ∗ ) 1 , x2 , λ ) − θ1 θ1 θ1 2 2

 k+1 1  ˆ ˆk ∗ ∗ ≤ −λ  λ − λ  − E λ 2β

194

5 Accelerated Stochastic Algorithms

+



θ1 2

 2

1 k  k−1 ∗   ˜ x − x x − (1 − θ − θ )x − θ 1 2 1 2 1 1 1 θ 1

L1 +



β AT 1 A1 θ1



I−

βAT 1 A1 θ1

2 

 1 k+1 θ1  k ∗ 

˜ x − x x − (1 − θ − θ )x − θ E 1 2 1 2 1 T 1 1  β  AT 2 θ1 1 A1  I− βA1 A1 L1 + θ1

θ1

 2

 1 k θ1  k−1 ∗  

x2 − (1 − θ1 − θ2 )x2 − θ2 x˜ 2 − x2  +  β AT 2 θ1 2 A2  I αL2 + θ1

 2

 1 k+1 θ1  k ∗   x2 − (1 − θ1 − θ2 )x2 − θ2 x˜ 2 − x2  − E 2 θ 1

k ≥ 1.

αL2 +



β AT 2 A2 θ1





, I

(5.135)

Adding back the subscript s, taking expectation on the first s epoches, and then summing (5.134) with k from 0 to m − 1 (for k ≥ 1, using (5.135)), we have      θ2 + θ1,s m−1 ∗ ∗ ∗ E L(xm , λ ) − L(x , λ ) + E L(xks , λ∗ ) − L(x∗ , λ∗ ) s θ1,s θ1,s 1





k=1

 mθ   1 − θ1,s − θ2 2 E L(x0s , λ∗ ) − L(x∗ , λ∗ ) + E L(˜xs , λ∗ ) − L(x∗ , λ∗ ) θ1,s θ1,s 

 1 1 y0s,1 − θ2 x˜ s,1 − (1 − θ1,s − θ2 )x0s,1 + E  2 θ1,s 2  −x∗1     θ1,s L1 +β AT1 A1  I−βAT1 A1



1  1 m m−1 xs,1 − θ2 x˜ s,1 − (1 − θ1,s − θ2 )xs,1 − E  2 θ 1,s

2  ∗ −x1 

  θ1,s L1 +β AT1 A1  I−βAT1 A1

 2

 1 0 1  0 ∗  ys,2 − θ2 x˜ s,2 − (1 − θ1,s − θ2 )xs,2 − x2  + E   2 θ1,s αθ1,s L2 +β AT2 A2  I  2

 1 m 1  m−1 ∗  xs,2 − θ2 x˜ s,2 − (1 − θ1,s − θ2 )xs,2 − x2  − E   2 θ1,s αθ1,s L2 +β AT2 A2  I  2 2

 m 1  0    E λˆ s − λ∗  − E λˆ s − λ∗  , s ≥ 0, (5.136) + 2β

5.4 Constrained Problem

195

where we use L(xks , λ∗ ) and L(˜xs , λ∗ ) to denote L(xks,1 , xks,2 , λ∗ ) and L(˜xs,1 , x˜ s,2 , λ∗ ), respectively. Since L(x, λ∗ ) is convex for x, we have mL(˜xs , λ∗ ) 

1 = mL m & ≤ 1−

  

m−1  (τ − 1)θ1,s (τ − 1)θ 1,s xm 1− xks−1 , λ∗ s−1 + 1 + θ2 (m − 1)θ2 k=1

(τ − 1)θ1,s θ2

'

' m−1 & (τ − 1)θ1,s  ∗ L(xm , λ ) + 1 + L(xks−1 , λ∗ ). s−1 (m − 1)θ2 k=1

(5.137) 0 Substituting (5.137) into (5.136), and using xm s−1 = xs , we have

     θ2 + θ1,s m−1 ∗ ∗ ∗ E L(xm , λ ) − L(x , λ ) + E L(xks , λ∗ ) − L(x∗ , λ∗ ) s θ1,s θ1,s 1

k=1

 1 − τ θ1,s  ∗ ∗ ∗ E L(xm ≤ s−1 , λ ) − L(x , λ ) θ1,s +

θ2 +

m−1 τ −1  m−1 θ1,s

  E L(xks−1 , λ∗ ) − L(x∗ , λ∗ )

θ1,s k=1 

 1 1 y0s,1 − θ2 x˜ s,1 − (1 − θ1,s − θ2 )x0s,1 + E  2 θ1,s 2  −x∗1    θ1,s L1 +βAT1 A1  I−βAT1 A1



1  1 m m−1 xs,1 − θ2 x˜ s,1 − (1 − θ1,s − θ2 )xs,1 − E  2 θ 1,s

2  ∗ −x1 

 θ1,s L1 +βAT1 A1  I−βAT1 A1



1  1 0 ys,2 − θ2 x˜ s,2 − (1 − θ1,s − θ2 )x0s,2 + E  2 θ1,s 2  −x∗2    αθ1,s L2 +βAT2 A2  I



1 m 1  m−1 ˜ x x − E − θ − (1 − θ − θ )x 2 s,2 1,s 2 s,2 s,2 2 θ 1,s

196

5 Accelerated Stochastic Algorithms

2  −x∗2  

 αθ1,s L2 +βAT2 A2  I

1 + 2β

  m 2 2

ˆ ˆ0 ∗ ∗ E λs − λ  − E λs − λ  , s ≥ 0.

Then from the setting of θ1,s =

1 2+τ s

and θ2 =

m−τ τ (m−1) ,

1 1 − τ θ1,s+1 = , θ1,s θ1,s+1

(5.138)

we have

s ≥ 0,

(5.139)

and τ −1 θ1,s+1 θ2 + m−1 θ2 + θ1,s θ2 = − τ θ2 + 1 = , θ1,s θ1,s+1 θ1,s+1

s ≥ 0.

(5.140)

Substituting (5.139) into the first term and (5.140) into the second term in the righthand side of (5.138), we obtain      θ2 + θ1,s m−1 ∗ ∗ ∗ k ∗ ∗ ∗ E L(xm , λ ) − L(x , λ ) + E L(x , λ ) − L(x , λ ) s s θ1,s θ1,s 1

k=1



  1 ∗ ∗ ∗ E L(xm s−1 , λ ) − L(x , λ ) θ1,s−1 m−1  θ2 + θ1,s−1   E L(xks−1 , λ∗ ) − L(x∗ , λ∗ ) + θ1,s−1 k=1 

 1 1 + E y0s,1 − θ2 x˜ s,1 − (1 − θ1,s − θ2 )x0s,1  2 θ1,s 2  −x∗1     θ1,s L1 +β AT1 A1  I−βAT1 A1



1  1 m m−1 ˜ x x − E − θ − (1 − θ − θ )x 2 s,1 1,s 2 s,1 2  θ1,s s,1 2  − x∗1    T  T   θ1,s L1 +β A1 A1

I−βA1 A1

 2

 1 0 1  0 ∗  ys,2 − θ2 x˜ s,2 − (1 − θ1,s − θ2 )xs,2 − x2  + E 2 θ 1,s

  αθ1,s L2 +β AT2 A2  I

5.4 Constrained Problem

197

 2

 1  1 m m−1 ∗ ˜ x − x x − E − θ − (1 − θ − θ )x 2 s,2 1,s 2 s,2 2  s,2    2 θ1,s αθ1,s L2 +β AT2 A2  I   m 2 2

1 ˆ ˆ0 ∗ ∗ + (5.141) E λs − λ  − E λs − λ  , s ≥ 0. 2β When k = 0, for ˜ s+1 y0s+1 = (1 − θ2 )xm s + θ2 x

θ1,s+1 m−1 ˜ (1 − θ1,s )xm , x + − (1 − θ − θ )x − θ 1,s 2 2 s s s θ1,s we obtain

1 m xs − θ2 x˜ s − (1 − θ1,s − θ2 )xsm−1

θ1,s

=

1 θ1,s+1



y0s+1 − θ2 x˜ s+1 − (1 − θ1,s+1 − θ2 )x0s+1 .

(5.142)

Substituting (5.142) into the third and the fifth terms in the right-hand side of (5.141) and substituting (5.122) into the last term in the right-hand side of (5.141), we obtain      θ2 + θ1,s m−1 ∗ ∗ ∗ E L(xm , λ ) − L(x , λ ) + E L(xks , λ∗ ) − L(x∗ , λ∗ ) s θ1,s θ1,s 1

k=1



1 θ1,s−1



∗ ∗ ∗ E L(xm s−1 , λ ) − L(x , λ )



m−1  θ2 + θ1,s−1   E L(xks−1 , λ∗ ) − L(x∗ , λ∗ ) θ1,s−1 k=1 

 1 1 m−1 ˜ s−1,1 − (1 − θ1,s−1 − θ2 )xs−1,1 + E xm s−1,1 − θ2 x  2 θ1,s−1 2  −x∗1    

+

θ1,s L1 +β AT1 A1  I−βAT1 A1



1  1 m m−1 xs,1 − θ2 x˜ s,1 − (1 − θ1,s − θ2 )xs,1 − E  2 θ 1,s

2  ∗ −x1 

  θ1,s L1 +β AT1 A1  I−βAT1 A1

198

5 Accelerated Stochastic Algorithms



1 m 1  m−1 ˜ x x + E − θ − (1 − θ − θ )x 2 s−1,2 1,s−1 2 s−1,2 s−1,2 2  θ1,s−1 2  ∗ −x2    αθ1,s L2 +β AT2 A2  I



1 m 1  m−1 xs,2 − θ2 x˜ s,2 − (1 − θ1,s − θ2 )xs,2 − E  2 θ 1,s

2  ∗ −x2 

  αθ1,s L2 +β AT2 A2  I

+

1 2β

 2 2

 m  m    E λˆ s−1 − λ∗  − E λˆ s − λ∗  ,

s ≥ 1.

For θ1,s−1 ≥ θ1,s , we have x2θ1,s−1 L ≥ x2θ1,s L , thus      θ2 + θ1,s m−1 ∗ ∗ ∗ E L(xm , λ ) − L(x , λ ) + E L(xks , λ∗ ) − L(x∗ , λ∗ ) s θ1,s θ1,s 1

k=1



1 θ1,s−1

  ∗ ∗ ∗ E L(xm s−1 , λ ) − L(x , λ )

m−1  θ2 + θ1,s−1   E L(xks−1 , λ∗ ) − L(x∗ , λ∗ ) θ1,s−1 k=1 

 1 1 m−1 ˜ s−1,1 − (1 − θ1,s−1 − θ2 )xs−1,1 + E xm s−1,1 − θ2 x  2 θ1,s−1 2  −x∗1    

+

θ1,s−1 L1 +β AT1 A1  I−βAT1 A1



1  1 m m−1 xs,1 − θ2 x˜ s,1 − (1 − θ1,s − θ2 )xs,1 − E  2 θ 1,s

2  ∗ −x1 

  θ1,s L1 +β AT1 A1  I−βAT1 A1



1  1 m m−1 xs−1,2 − θ2 x˜ s−1,2 − (1 − θ1,s−1 − θ2 )xs−1,2 + E  2 θ1,s−1 2  −x∗2     αθ1,s−1 L2 +β AT2 A2  I

5.4 Constrained Problem

199

 2

 1  1 m m−1 ∗ ˜ x − x x − E − θ − (1 − θ − θ )x 2 s,2 1,s 2 s,2 2  s,2    2 θ1,s αθ1,s L2 +β AT2 A2  I   m 2 2

1 ˆ ˆm ∗ ∗ + (5.143) E λs−1 − λ  − E λs − λ  , s ≥ 1. 2β When s = 0, via (5.136) and using y00,1 = x˜ 0,1 = x00,1 , y00,2 = x˜ 0,2 = x00,2 , and θ1,0 ≥ θ1,1 , we obtain      θ2 + θ1,0 m−1 ∗ ∗ ∗ k ∗ ∗ ∗ E L(xm , λ ) − L(x , λ ) + E L(x , λ ) − L(x , λ ) 0 0 θ1,0 θ1,0 1

k=1

 1 − θ1,0 + (m − 1)θ2  L(x00 , λ∗ ) − L(x∗ , λ∗ ) ≤ θ1,0 2  1  + x00,1 − x∗1    θ1,0 L1 +β AT1 A1  I−βAT1 A1 2 

1 m 1  m−1 ˜ x x − E − θ − (1 − θ − θ )x 2 0,1 1,0 2 0,1 0,1 2 θ 1,0

2  −x∗1  

  θ1,1 L1 +β AT1 A1  I−βAT1 A1

2 1   + x00,2 − x∗1    αθ1,0 L2 +β AT2 A2  I 2 

1  1 m m−1 x0,2 − θ2 x˜ 0,2 − (1 − θ1,0 − θ2 )x0,2 − E  2 θ 1,0

2  ∗ −x2 

  αθ1,1 L2 +β AT2 A2  I

+

1 2β

 2 2

 m    ˆ0 λ0 − λ∗  − E λˆ 0 − λ∗  ,

(5.144)

where we use θ1,0 ≥ θ1,1 in the fourth and the sixth lines. Summing (5.143) with s from 1 to S − 1 and adding (5.144), we have      θ1,S + θ2 m−1 ∗ ∗ ∗ k ∗ ∗ ∗ E L(xm , λ ) − L(x , λ ) + E L(x , λ ) − L(x , λ ) S S θ1,S θ1,S 1

k=1

 1 − θ1,0 + (m − 1)θ2  L(x00 , λ∗ ) − L(x∗ , λ∗ ) ≤ θ1,0 2 2  1 1   0 ∗ x + − x + x00,1 − x∗1   T     2  θ1,0 L1 +β A1 A1  I−βAT1 A1 αθ1,0 L2 +β AT2 A2  I 2 2 0,2

200

5 Accelerated Stochastic Algorithms

1 + 2β

 2 2

 m ˆ0 ˆ ∗ ∗ λ0 − λ  − E λS − λ 



1  1 m m−1 − E xS,1 − θ2 x˜ S,1 − (1 − θ1,s − θ2 )xS,1  2 θ1,S 2  −x∗1     θ1,S L1 +β AT1 A1  I−βAT1 A1

 2

 1 m 1  m−1 ∗  xS,2 − θ2 x˜ S,2 − (1 − θ1,S − θ2 )xs,2 − x2  − E 2 θ

  αθ1,S L2 +β AT2 A2  I

1,S



 1 − θ1,0 + (m − 1)θ2  L(x00 , λ∗ ) − L(x∗ , λ∗ ) θ1,0 2 2  1 1   0 ∗ + − x + x00,1 − x∗1   T     x 2 0,2 θ1,0 L1 +β A1 A1  I−βAT1 A1 αθ1,0 L2 +β AT2 A2  I 2 2 2 2

 m 1  ˆ0 ˆ ∗ ∗ λ λ + − λ − E − λ (5.145)  0   .  S 2β

m Now we analyze λˆ S − λ∗ 2 . From (5.126), for s ≥ 1 we have m m m 0 λˆ s − λˆ s−1 = λˆ s − λˆ s =

m  

k k−1 λˆ s − λˆ s



k=1 a



m &  k=1

=

'  1−θ −θ    1  k θ2  1,s 2 Axs − b − Axk−1 A˜ x − b − − b s s θ1,s θ1,s θ1,s

   β(θ2 + θ1,s ) m−1 β  m Axks − b Axs − b + θ1,s θ1,s k=1

− b

=

 mβθ2   β(1 − θ1,s − θ2 )  m Axs−1 − b − A˜xs − b θ1,s θ1,s

   β(θ2 + θ1,s ) m−1 β  m Axks − b Axs − b + θ1,s θ1,s k=1 &   1 − θ1,s − (τ − 1)θ1,s Axm −β s−1 − b θ1,s  τ −1   θ1,s m−1 θ2 + m−1 k Axs−1 − b + θ1,s k=1

5.4 Constrained Problem

c

=

201

   β(θ2 + θ1,s ) m−1 β  m Axks − b Axs − b + θ1,s θ1,s k=1



β θ1,s−1

   β(θ2 + θ1,s−1 ) m−1  m Axks−1 − b , Axs−1 − b − θ1,s−1

(5.146)

k=1

a

b

c

where = uses (5.121), = uses the definition of x˜ s , and = uses (5.139) and (5.140). When s = 0, we can obtain m 0 λˆ 0 − λˆ 0 =

m    k k−1 λˆ 0 − λˆ 0 k=1

=

=

m &  β(1 − θ − θ )    β  k 1,0 2 Ax0 − b − Axk−1 −b 0 θ1,0 θ1,0 k=1 ' θ2 β  0 − Ax0 − b θ1,0

   β(θ2 + θ1,0 ) m−1 β  m Axk0 − b Ax0 − b + θ1,0 θ1,0 k=1

 β[1 − θ1,0 + (m − 1)θ2 ]  0 Ax0 − b . − θ1,0

(5.147)

Summing (5.146) with s from 1 to S − 1 and adding (5.147), we have m m 0 0 λˆ S − λ∗ = λˆ S − λˆ 0 + λˆ 0 − λ∗

=

a

=

   β(θ2 + θ1,S ) m−1 β  m AxkS − b AxS − b + θ1,S θ1,S k=1 7 6  β 1 − θ1,0 + (m − 1)θ2 Ax00 − b − θ1,0  β(1 − θ1,0 )  0 0 +λ˜ 0 + Ax0 − b − λ∗ θ1,0  (m − 1)(θ2 + θ1,S )β + β  AˆxS − b θ1,S  β(m − 1)θ2  0 0 +λ˜ 0 − Ax0 − b − λ∗ , θ1,0

(5.148)

202

5 Accelerated Stochastic Algorithms a

where the equality = uses the definition of xˆ S . Substituting (5.148) into (5.145) and using that L(x, λ) is convex in x, we have    1   β(m − 1)(θ2 + θ1,S ) + β AˆxS −b E  2β θ1,S 2    β(m−1)θ2  0 0 Ax0 − b + λ˜ 0 − λ∗  −  θ1,0

 (m − 1)(θ2 + θ1,S ) + 1  L(ˆxS , λ∗ ) − L(x∗ , λ∗ ) +E θ1,S 2  1  1 − θ1,0 + (m − 1)θ2  ˆ0  L(x00 , λ∗ ) − L(x∗ , λ∗ ) + ≤ λ 0 − λ ∗  θ1,0 2β 2 1   + x00,1 − x∗1    θ1,0 L1 +β AT1 A1  I−βAT1 A1 2 2 1   + x00,2 − x∗2    . αθ1,0 L2 +β AT2 A2  I 2

0

By the definitions of L(x, λ) and λˆ 0 we can obtain (5.133).



With Theorem 5.10 in hand, we can check that the convergence rate of Acc-SADMM to solve problem (5.98) is exactly O(1/(mS)). However, for nonaccelerated methods, the convergence rate for ADMM in the non-ergodic sense is √ O(1/ mS) [5].

5.5 The Infinite Case In the previous section, the momentum technique is used to achieve faster convergence rate. We introduce the other benefit: the momentum technique can increase the mini-batch size. Unlike the previous sections, here we focus on the infinite case which differs from the finite case, since the true gradient for the infinite case is unavailable. We show that by using the momentum technique, the mini-batch size can be largely increased. We consider the problem: min f (x) ≡ E[F (x; ξ )], x

(5.149)

where f (x) is μ-strongly convex and L-smooth and its stochastic gradient has finite variance bound σ , i.e., E∇F (x; ξ ) − ∇f (x)2 ≤ σ 2 .

5.5 The Infinite Case

203

Algorithm 5.8 Stochastic accelerated gradient descent (SAGD) √ 1 Input η = 2L , θ = μ/(2L), x0 , and mini-batch size b. for k = 0 to K do 2 k−1 , yk = 1+θ xk − 1−θ 1+θ x Randomly sampleb functions whose index set is denoted as Ik , ˜ (yk ) = 1 i∈Ik ∇F (yk ; i) , Set ∇f b k+1 ˜ (yk ). = yk − η∇f x end for k Output xK+1 .

We apply mini-batch AGD to solve (5.149). The algorithm is shown in Algorithm 5.8. 2

σ Theorem 5.11 For Algorithm 5.8, set the mini-batch size b = 2Lθ , where   √ θ = μ/(2L). After running K = log1−√ μ f (x0 ) − f (x∗ ) + Lx0 − x∗ 2 − 2L   L 0 − x∗ 2 /) iterations, we have log(Lx log1−√ μ () ∼ O μ 2L

 2   Ef (xK+1 ) − f (x∗ ) + LE xK+1 − (1 − θ )xK − θ x∗  ≤ 2. Proof For f (x) is L-smooth, we have f (xk+1 ) ≤ f (yk ) + ∇f (yk ), xk+1 − yk  +

2 L   k+1 − yk  . x 2

(5.150)

For f (y) is μ-strongly convex, we have 2  μ    f (yk ) ≤ f (x∗ ) + ∇f (yk ), yk − x∗ − yk − x∗  , 2

(5.151)

and   f (yk ) ≤ f (xk ) + ∇f (yk ), yk − xk .

(5.152)

Multiplying (5.151) by θ and (5.152) by 1−θ and summing the results with (5.150), we have   f (xk+1 ) ≤ θf (x∗ ) + (1 − θ )f (xk ) + ∇f (yk ), xk+1 − (1 − θ )xk − θ x∗ +

2 μθ  2 L  k+1  k   − yk  − x y − x∗  2 2

= θf (x∗ ) + (1 − θ )f (xk )   ˜ (yk ), xk+1 − (1 − θ )xk − θ x∗ + ∇f (yk ) − ∇f

204

5 Accelerated Stochastic Algorithms

2 μθ  2 L  k+1  k   − yk  − x y − x∗  2 2   ˜ (yk ), xk+1 − (1 − θ )xk − θ x∗ , + ∇f +

√ where we will set θ = μ/(2L). By taking the expectation on the random number in Ik , we have   ˜ (yk ), xk+1 − (1 − θ )xk − θ x∗ Ek ∇f (yk ) − ∇f   a ˜ (yk ), xk+1 − yk = Ek ∇f (yk ) − ∇f  2 2 L  1    ˜ (yk ) Ek ∇f (yk ) − ∇f  + Ek xk+1 − yk  , 2L 2   a ˜ (yk ) = 0. Thus we have where in = we use Ek ∇f (yk ) − ∇f ≤

Ek f (xk+1 )

  ≤ θf (x∗ ) + (1 − θ )f (xk ) − 2LEk xk+1 − yk , xk+1 − (1 − θ )xk − θ x∗

2 μθ  2 L  2 L   k      + Ek xk+1 − yk  − y − x∗  + Ek xk+1 − yk  2 2 2  2 1  ˜ (yk ) + Ek ∇f (yk ) − ∇f  2L  2 1  ˜ (yk ) Ek ∇f (yk ) − ∇f ≤ θf (x∗ ) + (1 − θ )f (xk ) +  2L 2  2      +L yk − (1 − θ )xk − θ x∗  − LEk xk+1 − (1 − θ )xk − θ x∗  −

2 μθ   k  y − x∗  . 2

Then using  2   L yk − (1 − θ )xk − θ x∗  2   1 k 1−θ k ∗ ∗ y x = θ 2L  − θ x − − (1 − θ )x  θ θ 2 

  k 1 1−θ k ∗ k ∗ θ (y − θ y x = θ 2L  − x ) + − − (1 − θ )x   θ θ  '2

&   k 1 k 1 k 2  ∗ ∗  y − x −x  = θ L θ (y − x ) + (1 − θ ) 1 + θ θ

5.5 The Infinite Case

205

2 

 2   a 1 k 1 k   ∗ 1 + x y ≤ θ 3 L yk − x∗  + θ 2 (1 − θ )L  − − x   θ θ 2 

 2   1 k 1 k b θμ  k ∗ 2 ∗  1 + y y = − x + θ (1 − θ )L − − x x    ,  2 θ θ a

b

where in ≤ we use the convexity of  · 2 and in = we use θ = from the update rule, we have

√ μ/(2L) . Then



1 k 1 k 1 k k−1 x − (1 − θ )x y − x . = 1+ θ θ θ So we obtain  2   Ek f (xk+1 ) − f (x∗ ) + LEk xk+1 − (1 − θ )xk − θ x∗  2

   ≤ (1 − θ ) f (xk ) − f (x∗ ) + L xk − (1 − θ )xk−1 − θ x∗   2 1  ˜ (yk ) Ek ∇f (yk ) − ∇f  . 2L   Setting K = log1−θ f (x0 ) − f (x∗ ) + Lx0 − x∗ 2 + log1−θ ( −1 )   L O μ log(1/) and taking the first K iterations into account, we have +



2    Ef (xK+1 ) − f (x∗ ) + LE xK+1 − (1 − θ )xK − θ x∗  K  2 1   ˜ (yi ) ≤+ (1 − θ )K−i E ∇f (yi ) − ∇f  . 2L i=0

By the setting of b =

σ2 2Lθ

mini-batch size, we have

 2  ˜ (yk ) E ∇f (yk ) − ∇f  ≤ 2Lθ ,

k ≥ 0.

So we have 2    Ef (xK+1 ) − f (x∗ ) + LE xK+1 − (1 − θ )xK − θ x∗  ≤ 2. 

206

5 Accelerated Stochastic Algorithms

 2 σ . Theorem 5.11 indicates that the total stochastic gradient calls is O˜ μ   L However, the algorithm converges in O˜ μ steps, which is the fastest rate even in deterministic algorithms. So the momentum technique enlarges the mini-batch   size by 

L μ

times.

References 1. Z. Allen-Zhu, Katyusha: the first truly accelerated stochastic gradient method, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1200–1206 2. Z. Allen-Zhu, E. Hazan, Optimal black-box reductions between optimization objectives, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 1614– 1622 3. Z. Allen-Zhu, Y. Li, Neon2: finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 3716–3726 4. Z. Allen-Zhu, Y. Yuan, Improved SVRG for non-strongly-convex or sum-of-non-convex objectives, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 1080–1089 5. D. Davis, W. Yin, Convergence rate analysis of several splitting schemes, in Splitting Methods in Communication, Imaging, Science, and Engineering (Springer, New York, 2016), pp. 115– 163 6. A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 1646–1654 7. C. Fang, C.J. Li, Z. Lin, T. Zhang, SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 689–699 8. O. Fercoq, P. Richtárik, Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015) 9. D. Garber, E. Hazan, C. Jin, S.M. Kakade, C. Musco, P. Netrapalli, A. Sidford, Faster eigenvector computation via shift-and-invert preconditioning, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 2626–2634 10. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, Lake Tahoe, vol. 26 (2013), pp. 315–323 11. Q. Lin, Z. Lu, L. Xiao, An accelerated proximal coordinate gradient method, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 3059–3067 12. H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 3384–3392 13. J. Mairal, Optimization with first-order surrogate functions, in Proceedings of the 30th International Conference on Machine Learning, Atlanta, (2013), pp. 783–791 14. Y. Nesterov, A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). Sov. Math. Dokl. 27(2), 372–376 (1983) 15. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer, New York, 2004) 16. Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

References

207

17. M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017) 18. S. Shalev-Shwartz, T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013) 19. S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in Proceedings of the 31th International Conference on Machine Learning, Beijing, (2014), pp. 64–72

Chapter 6

Accelerated Parallel Algorithms

Along with the popularity of multi-core computers and the crucial demands for handling large-scale data in machine learning, designing parallel algorithms has attracted lots of interests in recent years. In this section, we introduce how to apply the momentum technique to accelerate parallel algorithms.

6.1 Accelerated Asynchronous Algorithms Parallel algorithms can be implemented in two fashions: asynchronous updates and synchronous updates. For asynchronous update, the threads (machines) are allowed to work independently and concurrently and none of them needs to wait for the others to finish computing. Therefore, a large overhead is reduced when we compare asynchronous algorithms with synchronous ones, especially when the computation costs for each thread are different, or a large load imbalance exists. For synchronous algorithms, the updates are essentially identical to the serial one with variants only on implementation. However, for asynchronous algorithms the states of the parameters for computing the gradient are different. This is because when one thread is computing the gradient, other threads might have updated the parameters. So the gradient might be computed on an old state. We consider a simple algorithm, asynchronous gradient descent, in which all the threads concurrently load the parameters and compute the gradients, and update on the latest parameters. If we assign a global counter k to indicate each update from any thread, the iteration can be formulated as xk+1 = xk − γ ∇f (xj (k) ),

© Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8_6

209

210

6 Accelerated Parallel Algorithms

where γ is the step size and xj (k) is the state of x at the reading time. Typically, xj (k) can be any of {x1 , · · · , xk } when the parameters are updated with locks. In other words, for asynchronous algorithms the gradient might be delayed. Up to now, lots of plain asynchronous algorithms have been designed. Niu et al. [12] and Agarwal et al. [1] proposed Asynchronous Stochastic Gradient Descent (ASGD). Some variance reduction (VR) based asynchronous algorithms were also designed later, such as [4, 13] in the primal space and [8] in the dual. Mania et al. [10] introduced a unified approach to analyze asynchronous stochastic algorithms by viewing them as serial methods operating on noisy inputs. Under the proposed framework of perturbed iterate analysis, lots of asynchronous algorithms, e.g., ASGD, asynchronous SVRG (ASVRG), and asynchronous SCD, can be studied in a unified scheme. Notably, linear convergence rates of asynchronous SCD and asynchronous sparse SVRG are achievable. However, to the best of our knowledge, the framework cannot incorporate the momentum based acceleration algorithms. In the following, we illustrate how to apply the momentum technique to asynchronous algorithms. The challenges lying in designing asynchronous algorithms are twofold: • In serial accelerated schemes, the extrapolation points are subtly and strictly  k  k−1 ) connected with xk and xk−1 , i.e., yk = xk + θk (1−θ x − xk−1 . However, θk such information might not be available for asynchronous algorithms because there are unknown delays in updating the parameters. • Since xk+1 is updated based on yk , i.e., xk+1 = yk − L1 ∇f (yk ), xk+1 is related to the old past updates (to generate yk ). This is different from unaccelerated algorithms, e.g., in gradient descent, xk+1 = xk − L1 ∇f (xk ), in which xk+1 only depends on xk . The main technique to tackle the above difficulties is by the “momentum compensation” [5]. We will show that doing only one original step of momentum prevents us from bounding the distance between delayed gradient and the latest one. Instead, by “momentum compensation” our algorithm is able to achieve a faster rate by order. Later, we will demonstrate that the technique can be applied to modern stochastic algorithms.

6.1.1 Asynchronous Accelerated Gradient Descent Because the gradients are delayed for asynchronous algorithms, we make an assumption on a bounded delay for analysis: Assumption 6.1 All the updates before the (k − τ − 1)-th iteration are completed before the “read” step of the k-th iteration.

6.1 Accelerated Asynchronous Algorithms

211

In most asynchronous parallelism, there are typically two schemes: • Atom (consistent read) scheme: The parameter x is updated as an atom. When x is read or updated in the central node, it will be locked. So xj (k) ∈ {x0 , x1 , · · · , xk }. • Wild (inconsistent read) scheme: To further reduce the system overhead, there is no lock in implementation. All the threads may perform modifications on x at the same time [12]. Obviously, analysis becomes more complicated in this situation. In this book, we only consider the atom scheme. If Assumption 6.1 holds, we have j (k) ∈ {k − τ, k − τ + 1, · · · , k}. We consider the following objective function: min f (x) + h(x), x

where f (x) is L-smooth and both f (x) and h(x) are convex. Recall the serial accelerated gradient descent (AGD) [11] described in Sect. 2.2.1 (called Accelerated Proximal Gradient Method when h(x) = 0). If we directly implement AGD [11] asynchronously, we can only obtain the gradient ∇f (yj (k) ) due to the delay. Now we measure the distance between yj (k) and yk . We have yk = xk +

θk (1 − θk−1 ) k (x − xk−1 ). θk−1

(6.1)

k−1 ) Setting ak = θk (1−θ , we have ak ≤ 1, because θk is monotonically nonθk−1 increasing and 0 ≤ θk ≤ 1 in our setting. We have

yk = xk + ak (xk − yk−1 ) + ak (yk−1 − xk−1 ) = yk−1 + (ak + 1)(xk − yk−1 ) + ak ak−1 (xk−1 − xk−2 ),

k ≥ 2. (6.2)

For xk−1 − xk−2 , where k ≥ j (k) + 2 ≥ 2, we have xk−1 − xk−2 = xk−1 − yk−2 + yk−2 − xk−2 = xk−1 − yk−2 + ak−2 (xk−2 − xk−3 ) = xk−1 − yk−2 + ak−2 (xk−2 − yk−3 ) + ak−2 ak−3 (xk−3 − xk−4 )

212

6 Accelerated Parallel Algorithms

=x

k−1

⎛ +⎝

−y

k−2

k−2 8

+



 k−2  8 i i−1 al (x − y )

k−2  i=j (k)+1

l=i

al ⎠ (xj (k) − xj (k)−1 ).

(6.3)

l=j (k)

k

Set b(l, k) =

where l ≤ k. Substituting (6.3) into (6.2), we have

i=l ai ,

yk = yk−1 + (b(k, k) + 1)(xk − yk−1 ) + b(k − 1, k)(xk−1 − yk−2 )   b(i, k)(xi − yi−1 ) + b(j (k), k)(xj (k) − xj (k)−1 )

k−2 

+

i=j (k)+1

  b(i, k)(xi − yi−1 )

k 

= yk−1 + (xk − yk−1 ) +

i=j (k)+1

+b(j (k), k)(xj (k) − xj (k)−1 ).

(6.4)

By checking, when k = j (k) and k = j (k) + 1, (6.4) is also right. Summing (6.4) with k = j (k) + 1 to k, we have y =y k

j (k)

+

k 

(x − y i

i=j (k)+1



k 

+⎝

i−1

)+

k 

l 

b(i, l)(xi − yi−1 )

l=j (k)+1 i=j (k)+1



b(j (k), i)⎠ (xj (k) − xj (k)−1 )

i=j (k)+1 a

=y

j (k)

⎛ +⎝

+

k  i=j (k)+1

k 

(x − y i

i−1



)+

k  i=j (k)+1

 k 

 b(i, l) (xi − yi−1 )

l=i

b(j (k), i)⎠ (xj (k) − xj (k)−1 ),

(6.5)

i=j (k)+1 a

where = is obtained by reordering the summations on i and l. Notice that xj (k) − xj (k)−1 is related to all the past updates before j (k). If we directly implement AGD asynchronously like most asynchronous algorithms, then j (k) < k (due to delay),  so ki=j (k)+1 b(j (k), i) > 0. Because xj (k) − xj (k)−1 is hard to bound, it causes difficulty in obtaining the accelerated convergence rate.

6.1 Accelerated Asynchronous Algorithms

213

Algorithm 6.1 Asynchronous accelerated gradient descent (AAGD) k−1 ) Input θk , step size γ , x0 = 0, z0 = 0, ak = θk (1−θ , and b(l, k) = θk−1 for k = 0 to K do   k j (k) − xj (k)−1 ), 1 wj (k) = xj (k) + i=j (k) b(j (k), i) (x   θk δ2 , 2 δ k = argminδ h(zk + δ) + ∇f (wj (k) ), δ + 2γ

k

i=l

ai .

3 zk+1 = zk + δ k , 4 xk+1 = θk zk+1 + (1 − θk )xk . end for k Output xK+1 .

Instead, one can compensate the momentum term and introduce a new extrapolation point wj (k) , such that ⎛ ⎞ k  wj (k) = xj (k) + ⎝ b(j (k), i)⎠ (xj (k) − xj (k)−1 ). i=j (k)

One can find that these are actually several steps of momentum. Then the difference between yk and wj (k) can be directly bounded by the norm of several latest updates,  2      namely  ki=j (k)+1 1 + kl=i b(i, l) (xi − yi−1 ) (see (6.23)). So we are able to obtain the accelerated rate. The algorithm is shown in Algorithm 6.1. We have the following theorem. Theorem 6.1 Under Assumption 6.1, for Algorithm 6.1, when h(x) is generally 2 convex, if the step size γ satisfies 2γ L + 3γ L(τ 2 + 3τ )2 ≤ 1, θk = k+2 , and the 1 first τ iterations are updated in serial, we have F (xK+1 ) − F (x∗ ) ≤

θK2 0 x − x∗ 2 . 2γ

(6.6)

When h(x) is μ-strongly convex (μ ≤ L), the step size γ satisfies 52 γ L + γ L(τ 2 + √ −γ μ+ γ μ2 +4γ μ 3τ )2 ≤ 1, and θk = , denoted as θ instead, we have 2 K+1

F (x



&

) − F (x ) ≤ (1 − θ )

K+1



F (x ) − F (x ) + 0



μθ θ2 + 2γ 2



∗ 2

x − x  0

' .

(6.7) Proof We introduce yk = (1 − θk )xk + θk zk .

1 This

assumption is used only for simplifying the proof.

(6.8)

214

6 Accelerated Parallel Algorithms

Because xk = θk−1 zk + (1 − θk−1 )xk−1 , we can eliminate zk and obtain yk =  k  k−1 ) x − xk−1 , which is (6.1). We can further have xk + θk (1−θ θk−1 xk+1 − yk = θk (zk+1 − zk ).

(6.9)

  From (6.5) and yj (k) = xj (k) + aj (k) xj (k) − xj (k)−1 , we have y =x k

j (k)

⎛ +⎝

k 

+

i=j (k)+1 k 

(x − y i

i−1



k 

)+

 k 

i=j (k)+1

 b(i, l) (xi − yi−1 )

l=i

b(j (k), i)⎠ (xj (k) − xj (k)−1 ).

(6.10)

i=j (k)

Through the optimality of zk+1 in Step 2 of Algorithm 6.1, we have θk (zk+1 − zk ) + γ ∇f (wj (k) ) + γ ξ k = 0,

(6.11)

where ξ k ∈ ∂h(zk+1 ) and (xk+1 − yk ) + γ ∇f (wj (k) ) + γ ξ k = 0,

(6.12)

thanks to (6.9). For f is L-smooth, we obtain 2 L  k+1  − yk  x 2 2 L a   = f (yk ) − γ ∇f (yk ), ∇f (wj (k) ) + ξ k  + xk+1 − yk  2 2 L b   = f (yk ) − γ ∇f (wj (k) ) + ξ k , ∇f (wj (k) ) + ξ k  + xk+1 − yk  2

f (xk+1 ) ≤ f (yk ) + ∇f (yk ), xk+1 − yk  +

+γ ξ k , ∇f (wj (k) ) + ξ k  + γ ∇f (wj (k) ) − ∇f (yk ), ∇f (wj (k) ) + ξ k 

   2 γL  c  1 xk+1 − yk  − ξ k , xk+1 − yk  = f (yk ) − γ 1 −  2 γ −∇f (wj (k) ) − ∇f (yk ), xk+1 − yk ,

(6.13)

  a b where in = we use (6.12), in = we use −∇f (yk ) = − ∇f (wj (k) ) + ξ k + ξ k +   c ∇f (wj (k) ) − ∇f (yk ) , and in = we reuse (6.12).

6.1 Accelerated Asynchronous Algorithms

215

For the last term of (6.13), applying the Cauchy–Schwartz inequality, we have   ∇f (wj (k) ) − ∇f (yk ), xk+1 − yk 2 γ C    2 γ  1     1 xk+1 − yk  ∇f (wj (k) ) − ∇f (yk ) +  2C1 2 γ 2 γ C   2 1  k+1 γ L2  1   j (k) k k   w x ≤ − y + − y     . 2C1 2 γ ≤

(6.14)

Substituting (6.14) into (6.13), we obtain

   2 γL   1 xk+1 − yk  − ξ k , xk+1 − yk  f (xk+1 ) ≤ f (yk ) − γ 1 −  2 γ 2 γ C   2 1  k+1 γ L2  1   j (k) k k   x + − y + − y (6.15) w    . 2C1 2 γ Considering zk+1 − x∗ 2 , we have 2 1   k+1  − θk x∗  θk z 2γ 2 1   k  = θk z − θk x∗ + θk zk+1 − θk zk  2γ 2 2 1  1     k  k+1 = − θk zk  θk z − θk x∗  + θk z 2γ 2γ   1  + θk zk+1 − zk , θk zk − θk x∗ γ  2 2 1  a 1   k+1   = − yk  − ∇f (wj (k) ) + ξ k , θk zk − θk x∗ , θk zk − θk x∗  + x 2γ 2γ (6.16) a

where in = we use (6.11). For the last term, we have −∇f (wj (k) ), θk zk − θk x∗  = −∇f (wj (k) ), yk − (1 − θk )xk − θk x∗  a

= −∇f (wj (k) ), wj (k) − (1 − θk )xk − θk x∗  − ∇f (wj (k) ), yk − wj (k)  b c

≤ (1 − θk )f (xk ) + θk f (x∗ ) − f (wj (k) ) − ∇f (wj (k) ), yk − wj (k)  d

≤ (1 − θk )f (xk ) + θk f (x∗ ) − f (yk ) + ∇f (yk ) − ∇f (wj (k) ), yk − wj (k) , (6.17)

216

6 Accelerated Parallel Algorithms a

c

b

where = uses (6.8), in = we insert wj (k) , in ≤ we use the convexity of f , namely applying f (wj (k) ) + ∇f (wj (k) ), a − wj (k)  ≤ f (a) d

on a = x∗ and a = xk , respectively, and in ≤ we use −f (wj (k) ) ≤ −f (yk ) + ∇f (yk ), yk − wj (k) . Substituting (6.17) into (6.16), we have 2 1   k+1  − θk x∗  θk z 2γ 2 2 1  1   k  k+1   ≤ − yk  − ξ k , θk zk − θk x∗  θk z − θk x∗  + x 2γ 2γ   +(1 − θk )f (xk ) + θk f (x∗ ) − f (yk ) + ∇f (yk ) − ∇f (wj (k) ), yk − wj (k) . (6.18) Adding (6.15) and (6.18), we have k+1

f (x

   2 1 k+1 1 γL  k   x − ) ≤ (1 − θk )f (x ) + θk f (x ) − γ − y  2 2 γ  2 γ C  1   2 γ L2  1   j (k) k k+1 k k k+1 k  x −ξ , x −y + −y  + − y w  2C1 2 γ     − ξ k , θk zk − θk x∗ + ∇f (yk ) − ∇f (wj (k) ), yk − wj (k) k





2 2 1  1   k  k+1   − θk x∗  θk z − θk x∗  − θk z 2γ 2γ

   2 a 1 k+1 1 γL  k ∗ k   x − ≤ (1 − θk )f (x ) + θk f (x ) − γ − y  2 2 γ 2

 2 γL   −ξ k , xk+1 − yk  + + L wj (k) − yk  2C1   2   2 γ C1  k ∗  1 xk+1 − yk  − ξ k , θk zk − θk x∗  + 1  + z − θ x θ k k   2 γ 2γ +



2 1    k+1 − θk x∗  , θk z 2γ

(6.19)

6.1 Accelerated Asynchronous Algorithms

217

a

where in ≤ we use ∇f (yk ) − ∇f (wj (k) ), yk − wj (k)  ≤ Lyk − wj (k) 2 , by the Cauchy–Schwartz inequality and the L-smoothness of f (·). Since ξ k ∈ ∂h(zk+1 ), we also have −ξ k , xk+1 − yk  − ξ k , θk zk − θk x∗  = θk ξ k , x∗ − zk+1  ≤ θk h(x∗ ) − θk h(zk+1 ) −

2 μθk   k+1  − x∗  . z 2

(6.20)

For the convexity of h(zk+1 ), we have θk h(zk+1 ) + (1 − θk )h(xk ) ≥ h(xk+1 ).

(6.21)

Substituting (6.20) into (6.19) and using (6.21), we have

   2 1 γL   1 xk+1 − yk  −  2 2 γ 2

 2 γ C   2 1  k+1 γL 1   j (k) k k   x + + L w −y  + −y   2C1 2 γ

   2 μ  k+1 1 1  k 2  + + − θk x∗  . (6.22) θk z θk z − θk x∗  − 2γ 2γ 2θk

F (xk+1 ) ≤ (1 − θk )F (xk ) + θk F (x∗ ) − γ



We first consider the generally convex case. Through (6.10), we have  2  j (k)  − yk  w  2      k   k  i i−1   = b(i, l) (x − y ) 1+ i=j (k)+1  l=i ⎡ ⎤    k k k k 2      a   ≤⎣ b(i, l) ⎦ b(i, l) xi − yi−1  1+ 1+ l=i

i=j (k)+1

⎡ b

≤⎣



k 

1+

k−i+1 

c

k−j (k)

≤⎣

ii=1

 1+

ii  l=1

k 

1 ⎦

l=1

i=j (k)+1



⎤

1 ⎦

l=i

i=j (k)+1

⎤

 1+

k−i+1 

ii=1

 1+

2    1 xi − yi−1 

l=1

i=j (k)+1 k−j (k)



ii  l=1



2    1 xk−ii+1 − yk−ii 

218

6 Accelerated Parallel Algorithms



min(τ,k) 

d

≤⎣

 1+

ii=1

τ 2 + 3τ ≤ 2

ii 

⎤ τ   ii  2     1 ⎦ 1 xk−ii+1 − yk−ii  1+

l=1 min(τ,k) 

ii=1

l=1

2    (i + 1) xk−i+1 − yk−i  ,

(6.23)

i=1

a

where in ≤ we use the fact that for ci ≥ 0, 0 ≤ i ≤ n, c1 a1 + c2 a2 + · · · cn an 2

  ≤ (c1 + c2 + · · · + cn ) c1 a1 2 + c2 a2 2 + · · · cn an 2 , b

c

because function φ(x) = x2 is convex, in ≤ we use b(i, l) ≤ 1, in ≤ we change d

variable ii = k − i + 1, and in ≤ we use k − j (k) ≤ τ . As we are more interested in the limiting case, when k is large (e.g., k ≥ 2(τ −1)) we assume that at the first τ steps, we run our algorithm in serial. Dividing θk2 on both sides of (6.23) and summing the results with k = 0 to K, we have K K 2  2  1  1   j (k)  j (k) k k − y = − y w  w  θ2 θ2 k=τ k k=0 k



K min(τ,k−τ ) 2 τ 2 + 3τ   i + 1   k−i+1 k−i  − y x  2 θk2 k=τ

a



i=1

K min(τ,k−τ ) 2 τ 2 + 3τ   4(i + 1)   k−i+1 k−i  x − y   2 2 θk−i k=τ

i=1



K τ 2 τ 2 + 3τ   4(i + 1)   k−i+1 k−i  − y x  2 2 θk−i i=1 k=i+τ

=

τ K−i 2  1  τ 2 + 3τ   k  +1 k  [4(i + 1)] − y x  2 θ2 i=1 k  =τ k 



τ K 2  1  τ 2 + 3τ   k+1 k [4(i + 1)] − y x  2 θk2 i=1

b

= (τ 2 + 3τ )2

k=0

K 2  1   k+1 k − y x  , θ2 k=0 k

(6.24)

6.1 Accelerated Asynchronous Algorithms a

where in ≤ we use the fact that b

1 θk2

≤ τ

219 4 2 θk−i

for θk =

2 k+2 ,

k ≥ 2(τ − 1), and

i ≤ min(τ, k − τ ), and = is because i=1 (i + 1) = 12 (τ 2 + 3τ ). Dividing θk2 on both sides of (6.22) and using μ = 0, we have F (xk+1 ) − F (x∗ ) θk2



   2 (1 − θk )(F (xk ) − F (x∗ )) γ 1 γL   1 xk+1 − yk  − −   2 2 2 γ θk θk 2 

    2 γ C1  1  k+1 2 1 γ γ 2 L2 j (k) k  k   w x + 2 + γL  − y + − y   γ 2C1 θk 2θk2  γ 2 2 1  1     k  k+1 + − x∗  z − x∗  − z 2γ 2γ

   2 a F (xk ) − F (x∗ ) γ 1 γL   1 xk+1 − yk  ≤ − −  2 γ θ2 θ2 2 ≤

k

k−1

   2 γ C 1 1 γ j (k) k  w + 2 + Lγ  − y  + 2θ 2 γ 2C1 θk k  2 2 1  1  k  k+1   + − x∗  , z − x∗  − z 2γ 2γ

a

γ 2 L2

where in ≤ we use

1−θk θk2



1 2 θk−1

for θk =

2 k+2 ,

   2  1 k+1 k   x − y  γ (6.25)

k ≥ 1.

Telescoping (6.25) with k from 0 to K and applying (6.24), and when k = 0, for = 0, we have

1−θ0 θ0

F (xK+1 ) − F (x∗ ) θk2 ≤−



  K   2 γ 1 γL   1 xk+1 − yk  −  2 γ θ2 2 k=0 k



  K   2 1 γ γ 2 L2 j (k) k   w + + γ L − y  γ θ 2 2C1 k=0 k   K   2 1 k+1 γ C1  k   x + − y  2θk2  γ k=0

+

2 2 1  1   0  K+1   − x∗  z − x∗  − z 2γ 2γ

220

6 Accelerated Parallel Algorithms



2 2 1  1   0  K+1   − x∗  z − x∗  − z 2γ 2γ ' 2 2 &

γ L 1 γ L C1 − − − − + γ L (τ 2 + 3τ )2 2 2 2 2C1   K   2 γ   1 xk+1 − yk  . ×  θk2  γ k=0

Setting C1 = γ L, we have that 2γ L + 3γ L(τ 2 + 3τ )2 ≤ 1 implies 1 γ L C1 − − − 2 2 2



γ 2 L2 + γ L (τ 2 + 3τ )2 ≥ 0. 2C1

So 2 2 F (xK+1 ) − F (x∗ ) 1  1   K+1  0 ∗ ∗ z z + − x ≤ − x     , 2γ 2γ θK2 and we have (6.6). Now we consider the strongly convex case. In the following, we set θk = θ in the previous deductions. Multiplying (6.23) with (1 − θ )K−k and telescoping the results with k from 0 to K, we have K  2    (1 − θ )K−k wj (k) − yk  k=0



K min(τ,k)  2 τ 2 + 3τ     (i + 1)(1 − θ )K−k xk−i+1 − yk−i  2 k=0

=

K min(τ,k)  2 τ 2 + 3τ     (1 − θ )−i (i + 1)(1 − θ )K−(k−i) xk−i+1 − yk−i  2 k=0



i=1

τ K  2  τ 2 + 3τ  K−(k−i)  k−i+1 k−i  (i + 1) (1 − θ ) − y x  2(1 − θ )τ i=1

=

i=1

K min(τ,k)  2 τ 2 + 3τ   K−(k−i)  k−i+1 k−i  (i + 1)(1 − θ ) − y x  2(1 − θ )τ k=0

=

i=1

k=i

τ K−i  2  τ 2 + 3τ  K−k   k  +1 k  x (i + 1) (1 − θ ) − y   2(1 − θ )τ  i=1

k =0

6.1 Accelerated Asynchronous Algorithms



τ K  2  τ 2 + 3τ  K−k   k  +1 k  x (i + 1) (1 − θ ) − y   2(1 − θ )τ  k =0

i=1

a

=

221

K  2 (τ 2 + 3τ )2  K−k  k+1 k (1 − θ ) − y x  , 4(1 − θ )τ

(6.26)

k=0

 a where = is because τi=1 (i + 1) = 12 (τ 2 + 3τ ). By rearranging terms in (6.22), we have k+1

F (x





) − F (x ) +

μθ θ2 + 2γ 2

 2  k+1  − x∗  z

& 2 2 ' μθ  θ  k k ∗ ∗ + ≤ (1 − θ ) F (x ) − F (x ) + z − x  2γ 2

   2 1 k+1 1 γ L C1  k   −γ x − − − y  2 2 2 γ 2 2

   2 1 γ L j (k) k  w +γ + γL  − y  , γ 2C1 where we use θ =



−γ μ+

γ μ2 +4γ μ , 2



which satisfies

θ2 μθ + 2γ 2

(1 − θ ) =

θ2 . 2γ

Equivalently, θ is the root of x 2 + μγ x − μγ = 0 and so when γ μ ≤ 1. For the assumption on γ , we have 9γ Lτ 2 ≤ We then consider we have

(6.27)

1 (1−θ)τ



γ μ/2 ≤ θ ≤

5 γ L + γ L(τ 2 + 3τ )2 ≤ 1. 2

√ γμ

(6.28)

. Without loss of generality, we assume that τ ≥ 2. Then

a b 1 1 ≤ ≤  √ τ τ (1 − θ ) (1 − γ μ) 1− c



1 1 √ 3τ

1 (1 −

1 τ 3τ )

τ μ/L d



1 (1 −

1 1 3)

=

3 , 2

(6.29)

222

6 Accelerated Parallel Algorithms

a b c d √ μ where in ≤ we use θ ≤ γ μ, in ≤ we use (6.28), in ≤ we use L ≤ 1, and in ≤ 1 −x we use the fact that function g(x) = (1 − 3x ) is monotonically decreasing for x ∈ [1, ∞). Multiplying (6.27) with θ K−k and summing the result with k from 0 to K, we have

2 2 μθ  θ   K+1 + F (xK+1 ) − F (x∗ ) + − x∗  z 2γ 2 &

2 2 ' μθ  θ  0  + ≤ (1 − θ )K+1 F (x0 ) − F (x∗ ) + z − x∗  2γ 2

K  2 1 γ L C1    −γ − − (1 − θ )K−k xk+1 − yk  2 2 2 i=0



γ 2 L2 + γL 2C1

 K

&

(1 − θ )

k=0

K−k

   2 1 j (k) k   w − y  γ

2 ' μθ  θ2  0 ∗ F (x ) − F (x ) + + ≤ (1 − θ ) z − x  2γ 2 ' &

2 2 2 1 γ L C1 (τ + 3τ )2 γ L −γ + γL − − − 2 2 2 2C1 4(1 − θ )τ a

×



0

K+1



K  2    (1 − θ )K−k xk+1 − yk  i=0

2 ' μθ  θ2  0 ∗ F (x ) − F (x ) + + ≤ (1 − θ ) z − x  2γ 2 2 2 ' &

γ L 1 γ L C1 3(τ 2 + 3τ )2 − − − −γ + γL 2 2 2 2C1 8 b

&

K+1



0



K  2    × (1 − θ )K−k xk+1 − yk  , i=0 a

b

where in ≤ we use (6.26) and in ≤ we use (6.29). Setting C1 = γ L, by the assumption on the choice of γ , we have 1 γ L C1 − − − 2 2 2 So (6.7) is proven.



γ 2 L2 3(τ 2 + 3τ )2 ≥ 0. + γL 2C1 8 

6.1 Accelerated Asynchronous Algorithms

223

Algorithm 6.2 Asynchronous accelerated stochastic coordinate descent (AASCD) Input θk , step size γ , x0 = 0, and z0 = 0.  k−1 ) Define ak = θk (1−θ , b(l, k) = ki=l ai . θk−1 for k = 0 to K do  1 wj (k) = yj (k) + ki=j (k)+1 b(j (k) + 1, i)(yj (k) − xj (k) ), 2 Randomly choose an index ik from [n],   θk 2 3 δk = argminδ hik (zkik + δ) + ∇ik f (wj (k) ) · δ + 2γ δ , 4 zk+1 = zkik + δk with other coordinates unchanged, ik 5 yk = (1 − θk )xk + θk zk , 6 xk+1 = (1 − θk )xk + nθk zk+1 − (n − 1)θk zk . end for k Output xK+1 .

6.1.2 Asynchronous Accelerated Stochastic Coordinate Descent To meet the demand of large-scale machine learning, most asynchronous algorithms are designed in a stochastic fashion. Momentum compensation can further be applied to accelerate modern state-of-the-art stochastic asynchronous algorithms, such as asynchronous SCD [8] and ASVRG [13]. We illustrate Asynchronous Accelerated Stochastic Coordinate Descent (AASCD) as an example. The readers can refer to [5] for fusing momentum compensation with variance reduction. Stochastic coordinate descent algorithms have been described in Sect. 5.1.1. We solve the following problem: min f (x) + h(x),

x∈Rn

where f (x) has Lc -coordinate Lipschitz continuous gradient (see (A.4)), h(x) has  coordinate separable structure, i.e., h(x) = ni=1 hi (xi ) with x = (xT1 , · · · , xTn )T , and f (x) and hi (xi ) are convex. We illustrate AASCD. Like AAGD, we compute the distance between the delayed extrapolation points and the latest ones, and introduce a new extrapolation term to compensate the “lost” momentum. The algorithm is shown in Algorithm 6.2. Theorem 6.2 Assume that f (x) and hi (x √i ) are convex, f (x) has Lc -coordinate Lipschitz continuous gradient, and τ ≤ n. For Algorithm 6.2, if the step size γ   6 2 72 1 2 satisfies 2γ Lc + 2 + n γ Lc (τ + τ )/n + 2τ ≤ 1 and θk = 2n+k , we have n2 F (x0 ) − F (x∗ ) n2 0 EF (xK+1 ) − F (x∗ ) K+1 ∗ 2 Ez z − x∗ 2 . + − x  ≤ + 2 2γ 2γ θK2 θ−1

224

6 Accelerated Parallel Algorithms

When h(x) is μ-strongly convex with μ ≤ Lc , setting the step √ size γ to satisfy that 6 2 72 −γ μ+ γ 2 μ2 +4γ μ 3 3 2γ Lc + ( 4 + 8n )γ Lc (τ + τ )/n + 2τ ≤ 1 and θk = , denoted 2n as θ instead, we have EF (xK+1 ) − F (x∗ ) +

n2 θ 2 + nθ μγ EzK+1 − x∗ 2 2γ



n2 θ 2 + nθ μγ 0 x − x∗ 2 . ≤ (1 − θ )K+1 F (x0 ) − F (x∗ ) + 2γ Proof From Steps 5 and 6 of Algorithm 6.2, we have θk zk = yk − (1 − θk )xk ,

(6.30)

nθk zk+1 = xk+1 − (1 − θk )xk + (n − 1)θk zk ,

(6.31)

xk+1 = yk + nθk (zk+1 − zk ).

(6.32)

and

Multiplying (6.30) with (n − 1) and adding with (6.31), we have nθk zk+1 = xk+1 − (1 − θk )xk + (n − 1)yk − (n − 1)(1 − θk )xk .

(6.33)

Eliminating zk using (6.33) and (6.30), for k ≥ 1, we have 1 k [y − (1 − θk )xk ] θk =

1 [xk − (1 − θk−1 )xk−1 + (n − 1)yk−1 − (n − 1)(1 − θk−1 )xk−1 ]. nθk−1 (6.34)

Computing out yk through (6.34), we have yk = xk − θk xk +

θk k θk (1 − θk−1 ) k−1 x − x nθk−1 nθk−1

(n − 1)θk (1 − θk−1 ) k−1 (n − 1)θk k−1 x + y nθk−1 nθk−1

1 θk θk (1 − θk−1 ) k−1 k − θk−1 (xk − yk−1 ) + =x + (y − xk−1 ). θk−1 n θk−1 −

6.1 Accelerated Asynchronous Algorithms

We set ak =

θk (1−θk−1 ) θk−1

225

and b(l, k) =

l > k. Then by setting ck =

θk

1 θk−1 ( n

k

i=l ai

when l ≤ k and b(l, k) = 0 when

− θk−1 ), we have

yk = xk + ck (xk − yk−1 ) + ak (yk−1 − xk−1 ) = yk−1 + (1 + ck )(xk − yk−1 ) + ak (yk−1 − xk−1 ) = yk−1 + (1 + ck )(xk − yk−1 ) + ak ck−1 (xk−1 − yk−2 ) + ak ak−1 (yk−2 − xk−2 ) =y

k−1

+ (1 + ck )(x − y k

k−1

k−1 

)+

b(i + 1, k)ci (xi − yi−1 )

i=j (k)+1

+b(j (k) + 1, k)(y

j (k)

−x

j (k)

k ≥ j (k) + 1 ≥ 1.

),

(6.35)

Summing (6.35) with k = j (k) + 1 to k, we have k 

yk = yj (k) +

(1 + ci )(xi − yi−1 )

i=j (k)+1 k 

+

l−1 

ci b(i + 1, l)(xi − yi−1 )

l=j (k)+1 i=j (k)+1



k 

+⎝



b(j (k) + 1, i)⎠ (yj (k) − xj (k) )

i=j (k)+1

=y

j (k)

k 

+

(1 + ci )(xi − yi−1 )

i=j (k)+1



k−1 

+

+⎝

 ci b(i + 1, l) (xi − yi−1 )

l=i+1

i=j (k)+1



k 

k 



b(j (k) + 1, i)⎠ (yj (k) − xj (k) )

i=j (k)+1 k 

= yj (k) +

(1 + ci )(xi − yi−1 )

i=j (k)+1 k 

+

 ci

i=j (k)+1



+⎝

k  i=j (k)+1

k 

l=i+1

 b(i + 1, l) (xi − yi−1 ) ⎞

b(j (k) + 1, i)⎠ (yj (k) − xj (k) ),

(6.36)

226

6 Accelerated Parallel Algorithms

where in the last equality, we use b(k + 1, l) = 0 for all l ≤ k. Then from the optimality of zk+1 in Steps 3 and 4, we have ik k j (k) ) + γ ξ kik = 0, nθk (zk+1 ik − zik ) + γ ∇ik f (w

(6.37)

where ξ kik ∈ ∂hik (zk+1 ik ). From (6.32), k j (k) ) + γ ξ kik = 0. xk+1 ik − yik + γ ∇ik f (w

(6.38)

Since f has coordinate Lipschitz continuous gradients and xk+1 and yk only differ in the ik -th entry, we have 2 Lc  k+1 xik − ykik 2 2 Lc  k+1 a xik − ykik = f (yk ) − γ ∇ik f (yk ), ∇ik f (wj (k) ) + ξ kik  + 2

k f (xk+1 ) ≤ f (yk ) + ∇ik f (yk ), xk+1 ik − yik  +

b

= f (yk ) − γ ∇ik f (wj (k) ) + ξ kik , ∇ik f (wj (k) ) + ξ kik  2 Lc  k+1 xik − ykik + γ ξ kik , ∇ik f (wj (k) ) + ξ kik  + 2 +γ ∇ik f (wj (k) ) − ∇ik f (yk ), ∇ik f (wj (k) ) + ξ kik  2

 k+1 xik − ykik γ Lc k k − ξ kik , xk+1 = f (y ) − γ 1 − ik − yik  2 γ k −∇ik f (wj (k) ) − ∇ik f (yk ), xk+1 ik − yik  2

 k+1 xik − ykik c γ Lc k k ≤ f (y ) − γ 1 − − ξ kik , xk+1 ik − yik  2 γ

2 γ C γ L2  j (k) 2 + c wik − ykik + 2C2 2 a



k xk+1 ik − yik

γ

2 ,

(6.39)

b

where in ≤ we use (6.38), in ≤ we insert −γ ∇ik f (wj (k) ) + ξ kik , ∇ik f (wj (k) ) + c

ξ kik , and in ≤ we use the Cauchy–Schwartz inequality and the coordinate Lipschitz continuity of ∇f . C2 > 0 will be chosen later.

6.1 Accelerated Asynchronous Algorithms

227

We consider zk − z∗ 2 and have 2 2 n2  n2   k+1  k   − θk x∗  = θk z θk z − θk x∗ + θk zk+1 − θk zk  2γ 2γ 2 n2  2 n2    k k θk zk+1 = − θ z θk z − θk x∗  + k ik ik 2γ 2γ   n2   k+1 θk zik − zkik , θk zkik − θk x∗ik + γ  2 2 2 1  k+1 a n   xik − ykik = θk zk − θk x∗  + 2γ 2γ   (6.40) −n ∇ik f (wj (k) ) + ξ kik , θk zkik − θk x∗ik , a

where = uses (6.32) and (6.37). So taking expectation on (6.40), we have  2 n2   Eik θk zk+1 − θk x∗  2γ n 2 2 n2  1   k+1  k  xik − ykik = θk z − θk x∗  + 2γ 2γ n ik =1

−∇f (wj (k) ), θk zk − θk x∗  −

n  ik =1

ξ kik , θk zkik − θk x∗ik .

(6.41)

By the same technique of (6.17), for the last but one term of (6.41), we have −∇f (wj (k) ), θk zk − θk x∗  = −∇f (wj (k) ), yk − (1 − θk )xk − θk x∗  = −∇f (wj (k) ), wj (k) − (1 − θk )xk − θk x∗  − ∇f (wj (k) ), yk − wj (k)  ≤ (1 − θk )f (xk ) + θk f (x∗ ) − f (wj (k) ) − ∇f (wj (k) ), yk − wj (k)    ≤ (1 − θk )f (xk ) + θk f (x∗ ) − f (yk ) + ∇f (yk ) − ∇f (wj (k) ), yk − wj (k) . (6.42)

228

6 Accelerated Parallel Algorithms

Substituting (6.42) into (6.41), we have  2 n2   Eik θk zk+1 − θk x∗  2γ n n 2 2  n2  1   k+1  k  xik − ykik − ≤ ξ kik , θk zkik − θk x∗ik . θk z − θk x∗  + 2γ 2γ n ik =1

ik =1

+(1 − θk )f (xk ) + θk f (x∗ ) − f (yk ) + ∇f (yk ) − ∇f (wj (k) ), yk − wj (k)  n n 2 2  1   k+1 n2   k  xik − ykik − ξ kik , θk zkik − θk x∗ik  ≤ θk z − θk x∗  + 2γ 2γ n ik =1

ik =1

 2   +(1 − θk )f (xk ) + θk f (x∗ ) − f (yk ) + Lc yk − wj (k)  .

(6.43)

Taking expectation on the random index ik for (6.39), we have Eik f (x

k+1

 k+1 

n k 2 C2 1  xik − yik γ Lc − ) ≤ f (y ) − γ 1 − 2 2 n γ k

ik =1



n 2 1  k k+1 γ L2c   j (k)  ξ ik , xik − ykik  + − yk  . w n 2nC2

(6.44)

ik =1

Adding (6.44) and (6.43) we have Eik f (xk+1 ) γ ≤ (1 − θk )f (x ) + θk f (x ) − n k





1 γ Lc C2 − − 2 2 2



 n ik =1

 2 γ L2c   + + Lc wj (k) − yk  2nC2 n    1  k+1 k k ∗ k x ξ ik , θk zik − θk xik + − − yik n ik

k xk+1 ik − yik

2

γ

ik =1

+

2 n2  2 n2   k    Eik θk zk+1 − θk x∗  θk z − θk x∗  − 2γ 2γ

γ = (1 − θk )f (x ) + θk f (x ) − n a

k





1 γ Lc C2 − − 2 2 2



 n ik =1

k xk+1 ik − yik

γ

2

6.1 Accelerated Asynchronous Algorithms

+

229

 n 2  γ L2c   ∗ + Lc wj (k) − yk  − ξ kik , θk zk+1 ik − θk xik  2nC2 ik =1

+

2 n2  2 n2   k    Eik θk zk+1 − θk x∗  , θk z − θk x∗  − 2γ 2γ

(6.45)

a

where in = we use (6.32). For hik is μ-strongly convex, we have k+1 ∗ θk ξ kik , x∗ik − zk+1 ik  ≤ θk hik (x ) − θk hik (zik ) −

2 μθk  k+1 zik − x∗ik . (6.46) 2

Analyzing the expectation, we have ⎡ ⎤ n  2  2   2  1  k+1  ⎣ zk+1 − x∗i Eik z zk+1 − x∗  = + − x∗j ⎦ ik j k n ik =1

j =ik

ik =1

j =ik

⎡ ⎤ n       2 2 a 1 ⎣ zk+1 − x∗i zkj − x∗j ⎦ = + ik k n =

1 n

n  ik =1

2  2 n − 1   k ∗ ∗ z zk+1 − x + − x   , ik ik n

(6.47)

a

where = uses the fact that zk+1 and zk only differ at the ik -th entry. Similar to (6.47), we can find that ⎡ ⎤ n  2  2   2  1   ⎣ xk+1 − yki xk+1 + − ykj ⎦ Eik xk+1 − yk  = ik j k n j =ik

ik =1

a

=

n 2 1   k+1 xik − ykik , n

(6.48)

ik =1

a

where = uses (6.32) and that xk+1 and yk only differ at the ik -th entry. Then plugging (6.46) into (6.45), we have Eik f (x

k+1

) + θk

n  ik =1

hik (zk+1 ik )

γ ≤ (1 − θk )f (x ) + θk F (x ) − n k





1 γ L C2 − − 2 2 2



 n ik =1

k xk+1 ik − yik

γ

2

230

6 Accelerated Parallel Algorithms

+

 n 2  2 μθk  k+1 γ L2c   zik − x∗ik + Lc wj (k) − yk  − 2nC2 2 ik =1

2 n2  2 n2   k    Eik θk zk+1 − θk x∗  θk z − θk x∗  − 2γ 2γ

 2 1 1 γ L C2 a   = (1 − θk )f (xk ) + θk F (x∗ ) − − − Eik xk+1 − yk  γ 2 2 2

 2 n2 θ 2 + (n − 1)θ μγ  2 γ L2c k   k   k + + Lc wj (k) − yk  + z − x∗  2nC2 2γ  2 n2 θk2 + nθk μγ   Eik zk+1 − x∗  , − (6.49) 2γ +

a

where in = we use (6.47) and (6.48). k+1 − (n − 1)θ zk = On the other hand, because xk+1 = (1 − θk )xk + nθ kz k  k k k k+1 k ˆ (1 − θk )x + θk z + nθk (z − z ), we can define hk = i=0 ek,i h(zi ) and obtain the same result of Lemma 5.1 in ASCD (Sect. 5.1.1). Thus Eik hˆ k+1 = (1 − θk )hˆ k + θk

n  ik =1

hik (zk+1 ik ).

(6.50)

From (6.36), using the same technique of (6.23) we have  2  j (k)  − yk  w  2      k   k  i i−1   = b(i + 1, l) (x − y ) 1 + ci + ci i=j (k)+1  l=i+1 ⎡ ⎤   k k   a ≤⎣ b(i + 1, l) ⎦ 1 + ci + ci l=i+1

i=j (k)+1

× ⎡ b

≤⎣



k 

1 + ci + ci 

k 

1 + ci

k−i+1  l=1

i=j (k)+1

c

k−j (k)

≤⎣

ii=1



2    b(i + 1, l) xi − yi−1 

l=i+1

i=j (k)+1



k 

⎤ 1 ⎦

⎤

k 

 1 + ci

k−i+1 



2    1 xi − yi−1 

l=1

i=j (k)+1

  k−j (k) ii ii 2  1 ⎦  1   1 1 xk−ii+1 − yk−ii  1+ 1+ n n



l=1

ii=1

l=1

6.1 Accelerated Asynchronous Algorithms



min(τ,k) 

d

≤⎣

ii=1



231

⎤ τ   ii ii  2 1  ⎦ 1   1 1 xk−ii+1 − yk−ii  1+ 1+ n n



l=1

τ2 + τ +τ 2n

ii=1

min(τ,k)  i=1

l=1

 2 i   + 1 xk−i+1 − yk−i  , n

a

b

(6.51)

c

where ≤ uses the convexity of  · 2 , in ≤ we use b(i, l) ≤ 1, in ≤ we change d

variable ii = k − i + 1 and use ck ≤ n1 , and in ≤ we use k − j (k) ≤ τ . As in Theorem 6.1, we are more interested in the limiting case, when k is large (e.g., k ≥ 2(τ − 1)) we assume that at the first τ steps, we run our algorithm in serial. Dividing θk2 on both sides of (6.51) and summing the results with k = 0 to K, we have K 2  1   j (k) k − y w  θ2 k=0 k a



 K min(τ,k) 2  4( i + 1)  τ2 + τ  k−i+1 n k−i  +τ − y x  2 2n θk−i



 K τ 2  4( ni + 1)  τ2 + τ  k−i+1 k−i  x +τ − y   2 2n θk−i i=1 k=i+τ





' K−i τ & 2  1  i τ2 + τ  k  +1 k  4 +τ +1 − y x  2n n θ2 i=1 k  =τ k 





'  τ & K 2 τ2 + τ i 1   k+1 k − y 4 +τ +1 x  2n n θk2



k=0



=



i=1

i=1

= a

τ2 + τ + 2τ n

where in ≤ we use

1 θk2



4 2 , θk−i

2  K k=0

k=0

2 1   k+1 k − y x  , θk2

since θk =

2 2n+k

(6.52)

and 2i ≤ 2τ ≤ 2n + k.

We first consider the generally convex case. Clearly, we can check that the setting of θk satisfies the assumptions of Lemma 5.1 in Sect. 5.1.1. Now we consider (6.49).

232

6 Accelerated Parallel Algorithms

 ˆ ˆ By (6.50), replacing θk nik =1 hik (zk+1 ik ) with Eik hk+1 − (1 − θk )hk , dividing both 2 sides of (6.49) by θk , and using μ = 0, we have  2 Eik f (xk+1 ) + Eik hˆ k+1 − F (x∗ ) n2  k+1 ∗ + − x E z  i 2γ k θk2

 2  1 γ L C2 1 − θk    k ˆ k − F (x∗ ) − 1 − − Eik xk+1 − yk  f (x ≤ ) + h 2 2 2 2 θk γ θk 2

 2 n2  2 γ L2c 1  k    + 2 + Lc wj (k) − yk  + z − x∗  2γ θk 2nC2

 2  a γ L C2 1  1  k+1 k 1 − − E f (xk ) + hˆ k − F (x∗ ) − ≤ 2 − y x  i k 2 2 θk−1 γ θk2

 2 n2  2 γ L2c 1   k   + 2 + Lc wj (k) − yk  + (6.53) z − x∗  , 2nC 2γ θk 2 a

where in ≤ we use

1−θk θk2



1 2 . θk−1

Taking expectation on the first k iterations for (6.53) and summing it with k from 0 to K, we have 2 n2  Ef (xK+1 ) + Ehˆ K+1 − F (x∗ )  K+1 ∗ z E + − x   2γ θK2

K  2 1 γ L C2  1 f (x0 ) + hˆ 0 − F (x∗ )  k+1 k x − − ≤ − E − y   2 2 2 2 θ−1 γ θk2 k=0

 K 2 n2  2 1  γ L2c  0  j (k) k ∗ + + Lc E − y + − x z   w 2nC2 2γ θ2 k=0 k

a



2 f (x0 ) + hˆ 0 − F (x∗ ) n2   0 ∗ + − x z  2 2γ θ−1 

2

2  1 γ Lc τ +τ γ L2c C2 − − −γ + 2τ − + Lc 2 2 2 2nC2 n ×

a

2  k+1 K  γ  − yk   , x E  2  γ θ k k=0

where ≤ uses (6.52).

6.1 Accelerated Asynchronous Algorithms

233

Setting C2 = γ Lc , by the assumption 2

2

τ +τ 1 + 2τ γ Lc ≤ 1, 2γ L + 2 + n n we have 1 γ Lc C2 − − −γ 2 2 2



γ L2c + Lc 2nC2



τ2 + τ + 2τ n

2 ≥ 0.

So 2 EF (xK+1 ) − F (x∗ ) n2   K+1 ∗ z E + − x   2γ θK2 ≤

2 Ef (xK+1 ) + Ehˆ K+1 − F (x∗ ) n2   K+1 ∗ z E + − x   2γ θK2



2 f (x0 ) + hˆ 0 − F (x∗ ) n2   0 ∗ + − x z  2 γ2 θ−1

a

b

=

2 F (x0 ) − F (x∗ ) n2   0 ∗ + − x z  , 2 2γ θ−1

a

where in ≤ we use h(xK+1 ) = h



K+1 i i=0 ek+1,i z





K+1 i=0

ek+1,i h(zi ) = hˆ K+1

b

and in ≤ we use hˆ 0 = h(x0 ). Now we consider the strongly convex case. In the following, again we set θk = θ . Multiplying (6.51) with (1 − θ )K−k and summing the results with k from 0 to K, we have K 

 2   (1 − θ )K−k wj (k) − yk 

k=0





K min(τ,k)  2  i τ2 + τ   +τ + 1 (1 − θ )K−k xk−i+1 − yk−i  2n n k=0

=

τ2 + τ +τ 2n

i=1

 K min(τ,k)  k=0

i=1

(1 − θ )

−i

 2   ×(1 − θ )K−(k−i) xk−i+1 − yk−i 



i +1 n

234

6 Accelerated Parallel Algorithms

1 ≤ (1 − θ )τ





K min(τ,k)  i τ2 + τ +τ +1 2n n k=0

i=1

 2   ×(1 − θ )K−(k−i) xk−i+1 − yk−i  =

1 (1 − θ )τ ×





τ τ2 + τ i +τ +1 2n n i=1

K  2    (1 − θ )K−(k−i) xk−i+1 − yk−i  k=i

1 = (1 − θ )τ ≤

1 (1 − θ )τ





K−i τ    τ2 + τ i     2 +τ +1 (1 − θ )K−k xk +1 − yk  2n n  k =0

i=1



τ2 + τ +τ 2n

 τ i=1

 K  2 i   +1 (1 − θ )K−k xk+1 − yk  n k=0

72 K 6 2  2 (τ + τ )/n + 2τ    = (1 − θ )K−k xk+1 − yk  . τ 4(1 − θ )

(6.54)

k=0

Since we have set θ =



−γ μ+

γ 2 μ2 +4γ μ , 2n

which satisfies

2 2

θ n + nμθ γ θ 2 n2 + (n − 1)μθ γ = (1 − θ ) , 2γ 2γ by rearranging terms in (6.49) and using (6.50) again, we have  2 n2 θ 2 + nθ μγ   Eik f (xk+1 ) + Eik hˆ k+1 − F (x∗ ) + Eik zk+1 − x∗  2γ 2

n2 θ 2 + nθ μγ   k k ∗ ∗ ˆ ≤ (1 − θ ) f (x ) + hk − F (x ) + z − x  2γ



 2 γ L2 2 1 1 γ L C2   k+1  c k − − Eik x − −y  + + Lc wj (k) − yk  . γ 2 2 2 2nC2 (6.55) From the assumption, we have μ/Lc ≤ 1. By the setting of γ , we have γ μ ≤ 1. So √ nθ ≤ γ μ ≤ 1. Thus the setting of θ also satisfies the assumptions of Lemma 5.1 in Sect. 5.1.1. On the other hand, from the assumption on γ we have 3γ Lc τ 2 ≤ 2γ Lc +



2 3 3 + γ Lc (τ 2 + τ )/n + 2τ ≤ 1. 4 8n

(6.56)

6.1 Accelerated Asynchronous Algorithms

Now we consider have

1 (1−θ)τ

235

. Without loss of generality, we assume n ≥ 2. Then we

a b 1 1 1

τ ≤ ≤ √ √ τ τ (1 − θ ) (1 − γ μ/n) 1 − τ1 μ/(3Lc )/n

1

c

≤

a

where in ≤ we use θ ≤

1−

√1 2 3τ

τ ≤

1 1−

1 √



2 3

3 , 2

 b c √ γ μ/n, in ≤ we use (6.56), and ≤ uses Lμc /n ≤

1 n

≤ 12 .

Taking expectation on (6.55), multiplying (6.55) with θ K−k , and then summing the result with k from 0 to K, we have 2 n2 θ 2 + nθ μγ    E zK+1 − x∗  2γ 2

n2 θ 2 + nθ μγ   0  ≤ (1 − θ )K+1 f (x0 ) + hˆ 0 − F (x∗ ) + z − x∗  2γ

K 2  1 1 γ L C2    − − − (1 − θ )K−k E xk+1 − yk  γ 2 2 2

Ef (xK+1 ) + Ehˆ K+1 − F (x∗ ) +

k=0

 K 2  γ L2c   + + Lc (1 − θ )K−k E wj (k) − yk  2nC2

k=0

2

a n2 θ 2 + nθ μγ   0 K+1 0 ∗ ∗ ˆ f (x ) + h0 − F (x ) + ≤ (1 − θ ) z − x  2γ ( 72 )

6 2 2 2 (τ + τ )/n + 2τ 1 1 γ Lc γ Lc C2 − − − − + γ Lc γ 2 2 2 2nC2 4(1 − θ )τ ×

K 2     (1 − θ )K−k E xk+1 − yk  , k=0

a

where ≤ uses (6.54). Setting C2 = γ Lc , since 72

6 2 (τ + τ )/n + 2τ γ Lc + 2γ Lc 2γ Lc + n 4(1 − θ )τ 72

6 2 3 (τ + τ )/n + 2τ γ Lc + 2γ Lc , ≤ 2γ Lc + n 8

236

6 Accelerated Parallel Algorithms

by the assumption of γ we have 1 γ Lc C2 − − − 2 2 2



γ 2 L2c + γ Lc 2nC2

72

6 2 (τ + τ )/n + 2τ ≥ 0. 4(1 − θ )τ

Then using h(xK+1 ) ≤ hˆ K+1 and hˆ 0 = h(x0 ), we obtain: 2 n2 θ 2 + nθ μγ    E zK+1 − x∗  2γ 2

n2 θ 2 + nθ μγ   0 K+1 0 ∗ ∗ ≤ (1 − θ ) F (x ) − F (x ) + z − x  . 2γ

EF (xK+1 ) − F (x∗ ) +



6.2 Accelerated Distributed Algorithms We describe accelerated distributed algorithms in this section. Distributed algorithms allow machines to handle their local data and communicate with each other to solve the coupled problems. The main challenges of distributed algorithms are: 1. Reducing communication costs. In practice, the computation time of local computation is usually significantly smaller than the communication cost. So balancing the local computation costs and communication ones becomes very important in designing an efficient distributed algorithms. 2. Improving the speed-up factor. Speed-up is the factor that the running time using a single machine divided by the running time using m machines. We say that a distributed algorithm achieves a linear speed-up if using m machines to solve can be αm times faster than using a single machine, where α is an absolute constant. Typical organizations for communication between nodes are centralized topology where there is a central machine and all others communicate with it, and decentralized one where all the machines can only communicate with their neighbors.

6.2.1 Centralized Topology 6.2.1.1

Large Mini-Batch Algorithms

A straightforward way to implement an algorithm on a distributed system is by transporting the algorithm into a large mini-batch setting. All the machines compute the (stochastic) gradient synchronously and then send back the gradient to the central node. In this way, the algorithm is essentially identical to a serial algorithm. For infinite-sum√problems, we have shown in Sect. 5.5 that the momentum technique ensures an ( κ) times larger mini-batch size. For finite-sum problems, the

6.2 Accelerated Distributed Algorithms

237

best result is obtained by Katyusha with a large mini-batch size version [2] (c.f. Algorithm 5.3 in Sect. 5.1.3). We show the convergence results as follows. Theorem 6.3 Suppose that h(x) is μ-strongly convex with μ ≤ 3L 8n . For Algo μγ 1 , θ1 = 2nμ rithm 5.3, if the step size γ = 3L 3L , θ3 = 1 + θ1 , the mini-batch √ size b satisfies b ≤ n, and m = n/b, then we have out

F (x

 −Sm    F (x00 ) − F (x∗ ) , ) − F (x ) ≤ O 1 + μ/(Ln) ∗

√ m˜xS +(1−2θ1 )zSm where xout = . In other words, by setting b = n, the total m+1−2θ1 computation costs √ and total communication costs for Algorithm 5.3 to achieve an √ ˜ nκ) and O( ˜ κ), respectively. -accuracy are O( The proof of Theorem 6.3 is similar to that of Theorem 5.3 but is a little more involved by taking the mini-batch size b into account. Readers who are interested in it can refer to [2] for the details.

6.2.1.2

Dual Communication-Efficient Methods

The above mini-batch algorithms are mainly implemented on shared memory systems where the cores can access all the data. However, in some distributed systems the data are stored separately. To handle this case, we can formulate the model into a constrained problem. We consider the following objective function: 1  λ lij (w) + h(w) + w2 , mn 2 m

L(w) =

n

w ∈ Rd ,

i=1 j =1

where m is the number of machines in the network and n is the number of individual functions on each machine. We assume that h(w) is μ-strongly convex and each lij is convex and L-smooth. We also denote κ = L μ and assume κ ≥ n. For centralized algorithms, we can introduce auxiliary variables as follows:

m n 1  λ 2 lij (wij ) + ui  + h(u0 ), min w,u mn 2 i=1 j =1

s.t. wij = ui , ui = u0 ,

i ∈ [m], j ∈ [n], i ∈ [m].

(6.57)

238

6 Accelerated Parallel Algorithms

1 Let mn aij ∈ Rd with i ∈ [m] and j ∈ [n] be the Lagrange multipliers for wij = ui , 1 bi ∈ Rd be the Lagrange multipliers for ui = u0 . Then we can obtain the and mn dual problem of (6.57) as



m n  λ  1  2 ⎣ lij (wij ) + aij , wij − ui + ui  max D(a, b) = max min mn 2 a,b a,b w,u i=1 j =1

 m 1  bi , u0 − ui  + h(u0 ) + nm i=1 ⎛ ⎡ 2 ⎞    n m n    1 1  1  ⎜ 1 ⎟ ⎢ ∗  = max − ⎣ lij (−aij ) + aij + bi  ⎝  ⎠ mn 2mλ  n n a,b   j =1 i=1 j =1  +h



1  bi − nm m

 ,

(6.58)

i=1

where lij∗ (·) and h∗ (·) are the conjugate functions of lij (·) and h(·), respectively. If h(x) ≡ 0 we have h∗ (x) =



0, x = 0, +∞, otherwise.

We denote ai = [ai1 ; · · · ; ain ] ∈ Rnd , a = [a1 ; · · · ; am ] ∈ Rmnd , and b = [b1 ; · · · ; bm ] ∈ Rmd . To solve (6.58), when bi is fixed, each ai can be solved independently, so each ai can be stored separately on a local machine. We refer a as the local variable. When each ai is fixed, bi reflects the “inconsistent degree” from a local solution to the global one, and solving bi needs communication. So we refer b as the consensus variable. We can solve a and b alternately. The iteration can be written as a˙ k+1 = max D(˙ai , bk ), a˙i is a subset of ai , i ∈ [m], i a˙ i

bk+1 = max D(ak+1 , b). b

The above method has been considered in [9, 18]. By designing a subproblem for ai , each machine can solve the local problem by efficient stochastic algorithms. However, the above method is still suboptimal. Suppose that the sample size √ of a˙ k+1 is n0 , then we can only obtain the total computation costs of O( nκn0 ) i √ and communication costs of O( κn/n0 ). To obtain a communication-efficient algorithm, we need choose n0 = n. However, the computation costs become

6.2 Accelerated Distributed Algorithms

239

√ O(n κ), whose dependence on n is not square rooted. To obtain a faster rate, we consider the following shifted form of (6.58). Setting F (a, b) = −D(a, b), we have min F (a, b) ≡ g0 (b) + a,b

n m  

gij (aij ) + f (a, b),

(6.59)

i=1 j =1

where

 g0 (b) = h

1  bi − nm



m



i=1

+

m  μ1 bi 2 , 6mn2 i=1

 1 ∗ μ1 lij (−aij ) − gij (aij ) = aij 2 , mn 4 m  f (a, b) = fi (ai , b),

i ∈ [m],

j ∈ [n],

i=1

in which 2    n n    μ1 1 1 1 1  2 fi (ai , b) = aij + bi  aij  + mn 4 2mλ  n n    j =1 j =1 −

μ1 bi 2 , 6mn2

i ∈ [m].

¯ ∈ Rmnd as the partial gradient of f (¯a, b) ¯ w.r.t. a. Similarly, we We define ∇a f (¯a, b) nd d ¯ ¯ ¯ ¯ ∈ denote ∇ai f (¯a, b) ∈ R , ∇bi f (¯a, b) ∈ R , ∇b f (¯a, b) ∈ Rmd , and ∇aij f (¯a, b) d ¯ w.r.t. ai , bi , b and R with i ∈ [m] and j ∈ [n] as the partial gradient of f (¯a, b) aij , respectively. We have the following lemma. 1 1 Lemma 6.1 Suppose n ≤ Lλ and set μ1 = L1 and Lc = 2mn 2 λ + 4mnL . Then f (a, b) is convex and has Lc -block coordinate Lipschitz continuous gradients with respect to (aij , b), i.e., for all i ∈ [m] and j ∈ [n], for any b ∈ Rmd , and for any a¯ ∈ Rmnd and a˜ ∈ Rmnd which differ only on the (i, j )-th component, i.e., a¯ k,l = a˜ k,l when k = i or l = j , we have

∇aij f (˜a, b) − ∇aij f (¯a, b) ≤ Lc ˜aij − a¯ ij ,

(6.60)

and for any a ∈ Rmnd , b˜ ∈ Rmd , and b¯ ∈ Rmd , we have ˜ − ∇b f (a, b) ¯ ≤ Lc b˜ − b. ¯ ∇b f (a, b) Moreover, g0 (b) is strongly convex.

1 -strongly 6mn2 L

convex and gij (aij ) (j ∈ [n], i ∈ [m]) are

(6.61) 3 4nmL -

240

6 Accelerated Parallel Algorithms

μ1 1 Proof By checking, we can find that (6.60) is right. Because 6mn 2 = 6mn2 L ≤ μ1 1 4mnL , (6.61) is right. Also, it is easy to prove that g0 (b) is 6mn2 -strongly convex 3μ1 and gij (aij ) with i ∈ [m] and j ∈ [n] are 4nm -strongly convex. We now prove that f (a, b) is convex. Consider the following inequality:



1 (1 + η)a 2 + 1 + b2 ≥ (a + b)2 . η Dividing 1 + η on both sides, we have 1 (a + b)2 − b2 . 1+η η

(6.62)

2L 2 = ≥ 2n ≥ 2, μ1 λ λ

(6.63)

a2 ≥ Then setting η= we have

2     n n   1 1  μ1 1 1 2  aij  + aij + bi   mn 2 2mλ  n n   j =1 j =1   n  1 2 1  μ1 1 1 2  aij  + bi  ≥ −   mn 2 2mλ(1 + η) n 2mλη a

j =1

2      1 n  aij   n  j =1 

  n n  1 2  1  μ1 1 1  2   − aij 2 aij  + b ≥ i   mn 2 2mλ(1 + η) n 2mnλη j =1

b

= c

≥ d

=

1 mn

n  j =1

1 2mn

j =1

  n  1 2  μ1 1 1  μ1  2   aij 2 aij  + bi  −  2 2mλ(1 + η) n 2mn 2

n  j =1

j =1

   1 2 μ1 1  bi  aij 2 +   2 2m( ηλ + ηλ) n 2

 2 n  1  μ1 μ1   1 bi  , aij 2 +  mn 4 6m n  j =1

a

b

c

d

where ≥ uses (6.62), and =, ≥, and ≥ all use (6.63). Then by summing i = 1 to m, we obtain that f (a, b) is convex as it is a nonnegative quadratic function.



6.2 Accelerated Distributed Algorithms

241

Algorithm 6.3 Distributed stochastic communication accelerated dual (DSCAD) b Input θk , pb , and L2 . Set a0 = aˆ 0 = 0, b0 = bˆ 0 = 0, and pa = 1−p n . 1 for k = 0 to K do 2 a˜ k = (1 − θk )ak + θk aˆ k , 3 b˜ k = (1 − θk )bk + θk bˆ k , 4 Uniformly sample a random number q from [0, 1], 5 If q ≤ pb , Communication  2

  k L2  ˆk 7 bˆ k+1 = argminb g0 (b) + ∇b f (˜ak , b˜ k ), b + θ2p − b b  . b

8 9

Else Local Update For each machine i with sample an index j (i), i ∈ [m], randomly  2

   θk L2  k, b k k ), a ˜ ˆ + 10 aˆ k+1 = argmin (a )+ ∇ f (˜ a − a g a ai,j (i) i i i,j (i) i,j (i) ai,j (i) i,j (i) i,j (i) i,j (i) . i,j (i) 2pa

11 End If 12 ak+1 = a˜ k + pθka (ˆak+1 − aˆ k ), 13 bk+1 = b˜ k + θk (bˆ k+1 − bˆ k ). pb

14 end for k Output aK+1 and bK+1 .

Lemma 6.1 suggests that f (a, b) has the same block Lipschitz constants and the strongly convex moduli of gij (aij ) are O(n) times larger than that of g0 (b). We can update a and b with a nonuniform probability, which is known as importance sampling in [3, 17]. During the updates, if a is chosen, we solve b with communication. Otherwise, for each machine i, we randomly choose a sample j (i) and update ai,j (i) . By integrating the momentum technique [7, 11], we obtain the framework called Distributed Stochastic Communication Accelerated Dual (DSCAD) and shown in Algorithm 6.3. The main improvement in the algorithm is the variants that the machines are scheduled to do local communication or communicate with each other under a certain probability. In this way, in each round of communication, each machine is not solving a local subproblem but instead concurrently solving the original problem √ (6.59). The algorithm ensures convergence √ with computation costs ˜ κ). of T1 = O( nκ) and communication costs of T2 = O( DSCAD can also be implemented in decentralized systems. So we leave the proof of the convergence result to the next section.

6.2.2 Decentralized Topology For decentralized topology, the machines are connected by an undirected network and cooperatively solve a joint problem. We represent the topology of the network as a graph g = {V , E}, where V and E are the node and the edge sets, respectively. Each node v ∈ V represents a machine in the network and eij = (i, j ) ∈ E indicates that nodes i and j are connected. Let the Laplacian matrix (Definition A.2) of the graph g be L. The early works for decentralized algorithms include typically decentralized gradient descent [16], which exhibits sublinear convergence rate even if each fi (x)

242

6 Accelerated Parallel Algorithms

is strongly convex and has Lipschitz continuous gradients. More recently, a number of methods of linear convergence rate have been designed, such as EXTRA [15] and augmented Lagrangians [6], and a more recent work [14] shows that a little variants of decentralized gradient descent can achieve a linear convergence√rate and ˜ κ), and can further obtain the optimal convergence rate with iteration costs of O( √ ˜ O( κκg ) by fusing with the momentum technique, where κ is the condition number of the local function and κg is the eigengap of the gossip matrix (matrix A in (6.64)) used for communication. We consider the following optimization problem: min L2 (w) ≡ w

m n 1  λ lij (wi ) + h(w) , mn 2

(6.64)

i=1 j =1

s.t. Aw = 0, H md 1/2 where w = [wH Id ∈ Rdm×dm is the 1 ; · · · ; wm ] ∈ R , A = (L/L) gossip matrix, indicates the Kronecker product, Id is the identity matrix with size d, wi ∈ Rd , and λ ≥ 0. A is a symmetric matrix and the choice of A ensures that computing A2 w needs one time of communication [14, 15]. For simplicity, we assume that h(w) = w2 . Then by introducing auxiliary variables to split the loss and the regularization terms, we have min w,u

m n 1  λ lij (wij ) + ui 2 mn 2 i=1 j =1

s.t. wij = ui ,

i ∈ [m],

j ∈ [n],

Au = 0, 1 where u = [u1 ; · · · ; um ]. By introducing the dual variables mn aij with i ∈ [m] and 1 j ∈ [n] for wij = ui , and mn b for Au = 0, we can derive the dual problem as



n m      λ  1 1 max min ⎣ lij (wij ) + aij , wij − ui + ui 2 + AT b, u ⎦ mn 2 nm a,b w,u ⎡

i=1 j =1



m n  λ  1  2 ⎣ = max min lij (wij ) + aij , wij − ui + ui  mn 2 a,b w,u i=1 j =1

 m  1  T Ai b, ui + nm i=1

2 ⎞    n n m     1 1 1 T  ⎜ 1  ⎟ lij∗ (−aij ) + a + b A = max − ⎝ ij i  ⎠, mn 2mλ  n n a,b   j =1 i=1 j =1 ⎛

(6.65)

6.2 Accelerated Distributed Algorithms

243

where we use A being symmetric and ATi denotes the i-th row-block of A. We can transform (6.65) into a shifted form that is similar to (6.59) by setting g0 (b) =

m  μ1 μ2 i=1

6mn2

bi 2 ,

 1 ∗ μ1 lij (−aij ) − aij 2 , i ∈ [m], j ∈ [n], mn 4 2    n n    1 1  μ1 1 1 2 T   fi (ai , b) = aij  + aij + Ai b  mn 4 2mλ  n n  gij (aij ) =

j =1



j =1

μ1 μ2 bi 2 , i ∈ [m]. 6mn2

The above has the similar form as (6.59) and we can solve the transformed problem by Algorithm 6.3. Lemma 6.2 Suppose n ≤ Lλ and set u1 = L1 and μ2 = 1/κg . Then f (a, b) has Lc -block coordinate Lipschitz continuous gradients w.r.t. (aij , b) (see (6.60) 1 1 mnd × and (6.61)), where Lc = 2mn 2 λ + 4mnL . f (a, b) is convex w.r.t. (a, b) ∈ R Span(A). g0 (b) is 6mn12 κ L -strongly convex. gij (aij ) with i ∈ [m] and j ∈ [n] are g

1 4nmL -strongly

convex.

Proof By checking, we can find that (6.60) is right. Using the fact that  m T 2 i=1 Ai Ai = A and A = 1, the block Lipschitz constant of b for f (a, b) 1 1 , so (6.61) is right. Also, it is easy to prove that g0 (b) is is less than 2mn2 λ + 4mnL μ1 1 -strongly convex and gij (aij ) are 4nmL -strongly convex, i ∈ [m], j ∈ [n]. 3mn2 Now we prove that f (a, b) is convex w.r.t. (a, b) ∈ Rmnd × Span(A). Applying (6.62) and using η=

2L 2 = ≥ 2n ≥ 2, μ1 λ λ

(6.66)

we have 2    n n    1 μ1 1 1 1 T   aij 2 + A a + b ij mn 2 2mλ  n i    n j =1 j =1 ≥

  n n   1 T 2 μ1 1  μ1 1  A b − 1 aij 2 + aij 2 i   mn 2 2mλ(1 + η) n 2mn 2 j =1

j =1

244

6 Accelerated Parallel Algorithms



  n  1 T 2 1  μ1 1  A b aij 2 + n i  mn 4 2m( ηλ 2 + ηλ) j =1

=

 2 n    1  μ1  aij 2 + μ1  1 AT b , i   mn 4 3m n

a

(6.67)

j =1

a

where ≥ uses (6.66). Summing (6.67) with i = 1, · · · , m, we have 2   n m  m   n    1 1 μ1 1 1 T  2  aij  + aij + Ai b  n mn 2 2mλ n   i=1 j =1

≥ a



1 mn

i=1

n m   i=1 j =1

j =1

 2  μ1 μ1   1 Ab aij 2 +  4 3m n 

1   μ1 μ1 b2 , aij 2 + mn 4 3mn2 κg m

n

i=1 j =1

where in = we use Ab2 ≥ b κg if b ∈ SpanA. So f (a, b) is convex as it is a nonnegative quadratic function.

 2

a

We give a unified convergence results for Algorithm 6.3. Theorem 6.4 For Algorithm 6.3, suppose that f (a, b) is convex and has L2 -block coordinate Lipschitz continuous gradient, g0 (b) is u3 -strongly convex, gij (aij ) with i ∈ [m] and j ∈ [n] are u4 -strongly √ convex, and max(μ3 , μ4 ) ≤ L2 . Then by setting pb μ3 /L2 1 pb = , we have μ3 and θk ≡ θ = 2 1+n

μ4



2 θ 2 L2 + θpb μ3   ˆ k+1 ∗ E − b b   2pb2 2

θ 2 L2 + θpa μ4   k+1 ∗ ˆ a + E − a   2pa2  2 θ 2 L2 + θpb μ3  ˆ0 k ∗ ≤ (1 − θ ) F (a0 , b0 ) − F (a∗ , b∗ ) + E − b b   2pb2 2

θ 2 L2 + θpa μ4   0 ∗ + E aˆ − a  . 2pa2

E F (ak+1 , bk+1 ) − F (a∗ , b∗ ) +

6.2 Accelerated Distributed Algorithms

245

Proof We use bk+1 and bˆ k+1 to denote the result of bk+1 and bˆ k+1 , respectively, if b c c ˆ k+1 is chosen to update at iteration k (i.e., q ≤ pb). Similarly, ak+1 i,j (i),c and a i,j (i),c denote ˆ k+1 the result of ak+1 i,j (i) and a i,j (i) , respectively, if ai,j (i) is chosen update at iteration k. Then by the optimality condition of bˆ k+1 in Step 7, we have θ L2 ˆ k+1 ˆ k ak , b˜ k ) + (b − b ) = 0. ∂g0 (bˆ k+1 c ) + ∇b f (˜ pb c

(6.68)

From Step 13, we have ak , b˜ k ) + L2 (bk+1 − b˜ k ) = 0. ∂g0 (bˆ k+1 c ) + ∇b f (˜ c

(6.69)

If b is chosen to update at iteration k (i.e., q ≤ pb ), we have ak+1 = a˜ k . Since f (a, b) has L2 -block coordinate Lipschitz continuous gradient w.r.t. b, we have 2  L   2  k+1 k k ˜ ˜ + − b − b f (ak+1 , bk+1 ) ≤ f (˜ak , b˜ k ) + ∇b f (˜ak , b˜ k ), bk+1 b  . c c 2 (6.70) Substituting (6.69) into (6.70), we have 2  L   2  k+1 k+1 k k ˜ ˜ b − f (ak+1 , bk+1 ) ≤ f (˜ak , b˜ k ) − ∂g0 (bˆ k+1 ), b − b − b   . c c c 2 (6.71) Similar to (6.69), we have     k+1 k ˜k k ˜ ∂gi,j (i) aˆ k+1 + ∇ a = 0. f (˜ a , b ) + L − a a i 2 i,j (i) i i,j (i) i,j (i),c i,j (i),c

(6.72)

If a is chosen to update (i.e., q > pb ), we have bk+1 = b˜ k . For each fi , suppose that j (i) is chosen. Since fi (ai , b) has L2 -coordinate Lipschitz continuous gradient w.r.t. ai,j (i) , we have fi (ak+1 , bk+1 )

2  L   2  k+1  k k ˜ ˜ + ≤ fi (˜ak , b˜ k ) + ∇aj (i) fi (˜aki , b˜ k ), ak+1 − a − a a i,j (i) i,j (i)  i,j (i),c i,j (i),c 2 2  L   a 2  k+1  k+1 k k ˜ ˜ a − = fi (˜ak , b˜ k ) − ∂gi,j (i) (ˆak+1 ), a − a − a  i,j (i) i,j (i)  , i,j (i),c i,j (i),c i,j (i),c 2

246

6 Accelerated Parallel Algorithms a

where = uses (6.72). Taking expectation only on j (i) under the condition that a is chosen at iteration k, we have Ea fi (ak+1 , bk+1 ) ≤ fi (˜ak , b˜ k ) −

n n 2  1 L2  1   k+1 k+1 k k  ˜ ˜ ∂gij (ˆak+1 − ), a − a − a a ij ij  . i,j,c i,j,c i,j,c n n 2 j =1

j =1

Summing i = 1 to m, we have  1  k+1 k ˜ ∂gij (ˆak+1 Ea f (ak+1 , bk+1 ) ≤ f (˜ak , b˜ k ) − ), a − a ij i,j,c i,j,c n m

n

i=1 j =1



m n 2 1   L2   k+1  ai,j,c − a˜ kij  . n 2 i=1 j =1

Then by taking expectation on the random choice of a and b at iteration k, we have Ek f (ak+1 , bk+1 )

  k+1 − b˜ k ≤ f (˜ak , b˜ k ) − pb ∂g0 (bˆ k+1 c ), bc −

n m  

  k+1 k ˜ pa ∂gij (ˆak+1 ), a − a ij i,j,c i,j,c

i=1 j =1



m n   2 L2 p b   k+1 ˜ k 2   L2 pa  k+1  bc − b  − ai,j,c − a˜ kij  , 2 2

(6.73)

i=1 j =1

b . where we use pa = 1−p 2 n    − b∗  , we have For bˆ k+1 c

2 1  ˆ k+1  bc − b∗  2 2 1  k k ∗ ˆ ˆ = bˆ k+1 − b + b − b  2 c 2 1  2   1  ˆk k ∗ k+1 k ˆk ∗ ˆ ˆ ˆ b = bˆ k+1 − b + − b + − b , b − b b    c 2 c 2     a 1 2 1  2 = bˆ k+1 − bˆ k  + bˆ k − b∗  c 2 2  pb  k+1 ∂g0 (bˆ c ) + ∇b f (˜ak , b˜ k ), bˆ k − b∗ , − θ L2

(6.74)

6.2 Accelerated Distributed Algorithms

247

a

where = uses (6.68).

2    Considering the expectation of bˆ k+1 − b∗  under the random numbers in iteration k, we have   2  2 2 ˆk    ∗ ∗ Ek bˆ k+1 − b∗  = pb bˆ k+1 − b + (1 − p ) − b b    . b c

(6.75)

Multiplying (6.75) by L2 θ 2 /(2pb2 ) and substituting (6.74) into it, we have 2 L θ 2  2 L2 θ 2  2  ˆ k+1 ˆk ∗ ∗ b b E − b − − b     k 2pb2 2pb2    L2 θ 2   ˆ k+1 ˆ k 2 k ˜k k ∗ ˆ . = ) + ∇ f (˜ a , b ), θ b − θ b bc − b  − ∂g0 (bˆ k+1 b c 2pb

(6.76)

From Step 3 of Algorithm 6.3, we have θ bˆ k = b˜ k − (1 − θ )bk ,

(6.77)

and from Step 13 of Algorithm 6.3, we have − b˜ k = bk+1 c

θ ˆ k+1 ˆ k (b − b ). pb c

(6.78)

So we have 2 L θ 2  2 L2 θ 2  2  ˆ k+1 ˆk ∗ ∗ b b E − b − − b     k 2pb2 2pb2 2    a L2 pb  k+1  ˆ k+1 − pb (bk+1 = − b˜ k ) − θ b∗ bc − b˜ k  − ∂g0 (bˆ k+1 c ), θ bc c 2   − ∇b f (˜ak , b˜ k ), θ bˆ k − θ b∗ . b

=

a

   L 2 pb   k+1 ˜ k 2 ˆ k+1 − pb (bk+1 − b˜ k ) − θ b∗ bc − b  − ∂g0 (bˆ k+1 c ), θ bc c 2   (6.79) − ∇b f (˜ak , b˜ k ), b˜ k − (1 − θ )bk − θ b∗ , b

where = uses (6.76) and (6.78) and = uses (6.77).

248

6 Accelerated Parallel Algorithms

Using the same technique on akij , we have 2 L θ 2  2 L2 θ 2  2  k+1  k ∗ ∗ ˆ ˆ a a E − a − − a     k ij ij ij ij 2pa2 2pa2 2   L 2 pa    k+1 k+1 ˆ k+1 ˜ kij ) − θ a∗ij = ai,j,c − a˜ kij  − ∂gij (ˆak+1 i,j,c ), θ a i,j,c − pa (ai,j,c − a 2   (6.80) − ∇ai,j (i) fi (˜ak , b˜ k ), a˜ kij − (1 − θ )akij − θ a∗ij . Then by summing i = 1, · · · , m and j = 1, · · · , n for (6.80), and adding (6.79) and (6.73) we have 2 θ 2 L θ 2 L2  2  ˆ k+1 ∗ b E − b   − k 2 2 2pb 2pb     2 θ 2 L2  k+1 2 θ L2  k 2 + E aˆ − a∗  − aˆ − a∗  2 2 2pa 2pa   ˆ k+1 − b∗ ≤ f (˜ak , b˜ k ) − θ ∂g0 (bˆ k+1 c ), bc

Ek f (ak+1 , bk+1 ) +



 2 ˆk  b − b∗ 

n m     k+1 ∗ ˆ θ ∂gij (ˆak+1 ), a − a ij i,j,c i,j,c i=1 j =1



n  m    ∇ai,j (i) fi (˜aki , b˜ k ), a˜ kij − (1 − θ )akij − θ a∗ij i=1 j =1

 − ∇b f (˜ak , b˜ k ), b˜ k − (1 − θ )bk − θ b∗ . 

(6.81)

By the μ3 -strongly convexity of g0 , we have   ˆ k+1 − b∗ − ∂g0 (bˆ k+1 c ), bc 2 μ3   ˆ k+1  bc − b∗  2  2 μ (1 − p )  2 μ3 a 3 b ˆk  ˆ k+1 ∗ ∗ ∗ b b = −g0 (bˆ k+1 ) + g (b ) − E − b + − b     , 0 k c 2pb 2pb ∗ ≤ −g0 (bˆ k+1 c ) + g0 (b ) −

(6.82) a

where = uses (6.75).

6.2 Accelerated Distributed Algorithms

249

Similarly, for aij we have   ∗ ˆ k+1 − ∂gij (ˆak+1 i,j,c ), a i,j,c − aij 2 μ4    k+1 aˆ i,j,c − a∗ij  2  2 μ4  ∗ ∗ = −gij (ˆak+1 Ek aˆ k+1 i,j,c ) + gij (aij ) − ij − aij  2pa  2 μ4 (1 − pa )  k+1  + aˆ ij − a∗ij  . 2pa ∗ ≤ −gij (ˆak+1 i,j,c ) + gij (aij ) −

(6.83)

Next, we have   − ∇b f (˜ak , b˜ k ), b˜ k − (1 − θ )bk − θ b∗ −

n  m    ∇ai,j (i) fi (˜ak , b˜ k ), a˜ kij − (1 − θ )akij − θ a∗ij i=1 j =1

  = − ∇b f (˜ak , b˜ k ), b˜ k − (1 − θ )bk − θ b∗   − ∇a f (˜ak , b˜ k ), a˜ k − (1 − θ )ak − θ a∗ a

≤ −f (˜ak , b˜ k ) + (1 − θ )f (ak , bk ) + θf (a∗ , b∗ ),

(6.84)

a

where in ≤ we use the convexity of f (a, b). Substituting (6.82), (6.83), and (6.84) into (6.81), we have Ek f (ak+1 , bk+1 ) + θg0 (bˆ k+1 c )+θ

n m  

gij (ˆak+1 i,j,c )

i=1 j =1

2 θ 2 L + θ μ p (1 − p )  2 θ 2 L2 + θpb μ3  2 3 b b ˆk    Ek bˆ k+1 − b∗  − b − b∗  2 2 2pb 2pb 2 θ 2 L + θ μ p (1 − p )  2 θ 2 L2 + θpa μ4  2 4 a a  k  k+1 ∗ ∗ ˆ ˆ a a + E − a − − a     k 2pa2 2pa2 +

≤ (1 − θ )f (ak , bk ) + θ F (a∗ , b∗ ).

250

6 Accelerated Parallel Algorithms

Using the definition of gˆ k in Lemma 6.3 (see the end of the proof), we have   Ek f (ak+1 , bk+1 ) + gˆ k+1   ≤ (1 − θ ) f (ak , bk ) + gˆ k + θ F (a∗ , b∗ ) 2 θ 2 L + θ μ p (1 − p )  2 θ 2 L2 + θpb μ3  2 3 b b ˆk    Ek bˆ k+1 − b∗  + b − b∗  2 2 2pb 2pb 2 θ 2 L + θ μ p (1 − p )  2 θ 2 L2 + θpa μ4  2 4 a a  k  k+1 ∗ ∗ ˆ ˆ a a − E − a + − a     . k 2pa2 2pa2 −

From the setting of pb , we have can have

pb pa

=

√ μ √ 4, μ3

then θ =

√ pb u3 /L2 2

=

√ pa u4 /L2 . 2

So we

θ 2 L2 + θ μ3 pb (1 − pb ) θ 2 L2 + θpb μ3 (1 − θ ) ≥ , 2pb2 2pb2

(6.85)

θ 2 L2 + θpa μ4 θ 2 L2 + θ μ4 pa (1 − pa ) (1 − θ ) ≥ . 2pa2 2pa2

(6.86)

and

Indeed, by solving (6.85) we have

θ≤



μ23 μ3 pb −μ3 /L2 + 2 + 4 L2 L2

2

Using u3 /L2 ≤ 1, we can check that θ = u4 /L2 ≤ 1, we have that θ =



pa u4 /L2 2

√ pb u3 /L2 2

.

(6.87)

satisfies (6.87). Also using

satisfies (6.86). Then we have

  Ek f (ak+1 , bk+1 ) + gˆ k+1 − F (a∗ , b∗ )   2 θ 2 L + θp μ  2 θ 2 L2 + θpb μ3  2 a 4  k+1  ˆ k+1 ∗ ∗ −b  + −a  +Ek b aˆ 2pa2 2pb2 ≤ (1 − θ ) f (ak , bk ) + gˆ k − F (a∗ , b∗ )  2 θ 2 L + θp μ  2 θ 2 L2 + θpb μ3  2 a 4  k ˆk ∗ ∗ + b − b  + aˆ − a  . 2pa2 2pb2

6.2 Accelerated Distributed Algorithms

251

By taking full expectation, we have   E f (ak+1 , bk+1 ) + gˆ k+1 − F (a∗ , b∗ )   2 θ 2 L + θp μ  2 θ 2 L2 + θpb μ3  2 a 4  k+1  ˆ k+1 ∗ ∗ E b −b  + E aˆ −a  + 2pa2 2pb2 k ≤ (1 − θ ) f (a0 , b0 ) + gˆ 0 − F (a∗ , b∗ ) θ 2 L2 + θpb μ3 + 2pb2

  2 θ 2 L + θp μ  2 2 a 4 ˆ0    b − b∗  + aˆ 0 − a∗  . 2pa2

Then using the convexity of g0 we have k+1

g0 (b

) = g0

k+1 

 ˆi



ek+1,i,1 b

i=0

k+1 

ek+1,i,1 g0 (bˆ i ) = gˆ 0k+1 .

i=0

Also for any aij , we have ⎛ ⎝ gij (ak+1 ij ) = gij

k+1 

⎞ ek+1,q,2 aˆ ij ⎠ ≤ q

q=0

and gˆ 0 = g0 (bˆ 0 ) +

 m n i=1

a0ij ). j =1 gij (ˆ

k+1 

q

ek+1,q,2 gij (ˆaij ) = gˆ ijk+1 ,

q=0

We obtain Theorem 6.4.



The following lemma is a straightforward extension of Lemma 5.1 in Sect. 5.1.1. Lemma 6.3 From Algorithm 6.3 with the parameters set in Theorem 6.4, we have that bk is a convex combination of {bˆ i }ki=0 , i.e., bk = ki=0 ek,i,1 bˆ i , where e0,0,1 = 1, e1,0,1 = 1 − θ/pb , e1,1,1 = θ/pb . And for k > 1, we have ek,i,1 =

⎧ ⎨

(1 − θ )ek−1,i,1 , (1 − θ )θ/pb + θ − θ/pb , ⎩ θ/pb ,

i ≤ k − 2, i = k − 1, i = k.

(6.88)

 Also, ak is a convex combination of {ˆai }ki=0 , i.e., akj = ki=0 ek,i,2 aˆ ij , with e0,0,2 = 1, e1,0,2 = 1 − θ/pa , e1,1,2 = θ/pa . And for k > 1, we have ek,i,2 =

⎧ ⎨

(1 − θ )ek−1,i,2 , (1 − θ )θ/pa + θ − θ/pa , ⎩ θ/pa ,

i ≤ k − 2, i = k − 1, i = k.

252

6 Accelerated Parallel Algorithms

Set gˆ 0k = m  n i=1

k

ˆq q=0 ek,q,1 g0 (b ),

k j =1 gˆ ij ,

gˆ ijk =

k

q aij ), q=0 ek,q,2 gij (ˆ

and gˆ k = gˆ 0k +

we have Ek (gˆ 0k+1 ) = (1 − θ )gˆ 0k + θg0 (bˆ k+1 c ),

(6.89)

and Ek (gˆ ijk+1 ) = (1 − θ )gˆ ijk + θgij (ˆak+1 i,j,c ), with i ∈ [m] and j ∈ [n], where Ek denotes that the expectation is only taken on the random number in the k-th iteration under the condition that ak and bk are known, bˆ k+1 denotes the result of bˆ k+1 if b is chosen to update at iteration k, and aˆ k+1 c i,j,c denotes the result of aˆ k+1 ij if aij is chosen to update at iteration k. Proof We consider ek,i,j first. When k = 0 and 1, it is true that e0,0,1 = 1 and e0,0,2 = 1. We first prove (6.88). Assume for k, (6.88) holds. From Steps 3 and 13, we have bk+1 = (1 − θ )bk + θ bˆ k + θ/pb (bˆ k+1 − bˆ k ) = (1 − θ )

k 

ek,i,1 bˆ i + θ bˆ k + θ/pb (bˆ k+1 − bˆ k )

i=0

= (1 − θ )

k−1 

7 6 ek,i,1 bˆ i + (1 − θ )ek,k,1 + θ − θ/pb bˆ k + θ/pb bˆ k+1 .

i=0

Comparing the results, we obtain (6.88). To prove convex combination, it is easy to prove that the weights sum to 1. We then prove that ek,i,1 ≥ 0 for all k ≥ 0 and 0 ≤ i ≤ k.√ When k = 0 and k = 1, we have e0,0,1 = 1 ≥ 0, e1,0,1 = 1 − pθb = 1 − μ42/L2 ≥ 0, e1,1,1 = pθb ≥ 0. When k ≥ 1, suppose that at k, ek,i,1 ≥ 0 with 0 ≤ i ≤ k. Then we have ek+1,i,1 = (1 − θ )ek,i,1 ≥ 0 (i ≤ k − 1), ek+1,k+1,1 = θ/pb ≥ 0, and ek+1,k,1



μ4 /L2 ≥ 0. = (1 − θ )θ/pb + θ − θ/pb = θ (1 − θ/pb ) = θ 1 − 2

6.2 Accelerated Distributed Algorithms

253

We then prove (6.89), we have a

Ek gˆ 0k+1 =

k 

ek+1,i,1 g0 (bˆ i ) + (θ/pb )Ek g0 (bˆ k+1 )

i=0 b

=

k  i=0

=

k 



1 − pb k ˆ ek+1,i,1 g0 (bˆ i ) + θ g0 (bˆ k+1 ) + g ( b ) 0 c pb ˆk ek+1,i,1 g0 (bˆ i ) + θg0 (bˆ k+1 c ) + (1/pb − 1)θg0 (b )

i=0 c

=

k−1 

ek+1,i,1 g0 (bˆ i ) + [(1 − θ )θ/pb + θ − θ/pb ]g0 (bˆ k )

i=0

+(1/pb − 1)θg0 (bˆ k ) + θg0 (bˆ k+1 c ) =

k−1 

ek+1,i,1 g0 (bˆ i ) + (1 − θ )θ/pb g0 (bˆ k ) + θg0 (bˆ k+1 c )

i=0 d

=

k−1 

(1 − θ )ek,i,1 g0 (bˆ i ) + (1 − θ )ek,k,1 g0 (bˆ k ) + θg0 (bˆ k+1 c )

i=0

= (1 − θ )

k 

k ˆ k+1 ek,i,1 g0 (bˆ i ) + θg0 (bˆ k+1 c ) = (1 − θ )gˆ 0 + θg0 (bc ),

i=0 a

b

where in = we use ek+1,k+1,1 = θ/pb , in = we use ˆk Ek g0 (bˆ k+1 ) = pb g0 (bˆ k+1 c ) + (1 − pb )g0 (b ), c

d

in = we use ek+1,k,1 = (1 − θ )θ/pb + θ − θ/pb , and in = we use ek+1,i,1 = (1 − θ )ek,i,1 for i ≤ k − 1 and ek,k,1 = θ/pb . By the same way, we can prove the result for a. 

With Theorem 6.4 in hand, we can compute the communication and the iteration costs of Algorithm 6.3, which are stated in the following theorems. 1 1 1 Theorem 6.5 Assume Lλ ≥ n. Set μ1 = L1 , L2 = 2mn 2 λ + 4mnL , μ3 = 6mn2 L ,   √ 3 . It takes O˜ nL/λ iterations to obtain an -accuracy solution and μ4 = 4mnL ∗ ∗ k k for problem (6.58) satisfying D(a √  , b ) − D(a √ , b ) ≤ . The communication and ˜ ˜ the iteration costs are O nL/λ and O L/λ , respectively.

254

6 Accelerated Parallel Algorithms

1 μ2 Also for the decentralized case, for problem (6.65), because g0 (b) = μ b2 , 6mn2 from Steps 3, 7, and 13 of Algorithm 6.3 and b0 = bˆ 0 = 0 ∈ Span(A), we can obtain that bk ∈ Span(A) and bˆ k ∈ Span(A) for all k ≥ 0. Thus we have

1 1 Theorem 6.6 Assume Lλ ≥ n. Set μ1 = L1 , L2 = 2mn + 4mnL , μ3 = √  2λ √ √ 1 1 ˜ , and μ = . It takes O κ + n L/λ iterations to obtain 4 g 4mnL 6mn2 κ L g

∗ ∗ k , bk ) ≤ . an -accuracy solution for problem (6.65) satisfying D(a  √ , b ) − D(a  The communication and the iteration costs are O˜ nL/λ and O˜ κg L/λ , respectively.

References 1. A. Agarwal, J.C. Duchi, Distributed delayed stochastic optimization, in Advances in Neural Information Processing Systems, Granada, vol. 24 (2011), pp. 873–881 2. Z. Allen-Zhu, Katyusha: the first truly accelerated stochastic gradient method, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1200–1206 3. Z. Allen-Zhu, Z. Qu, P. Richtárik, Y. Yuan, Even faster accelerated coordinate descent using non-uniform sampling, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 1110–1119 4. C. Fang, Z. Lin, Parallel asynchronous stochastic variance reduction for nonconvex optimization, in Proceedings of the 31th AAAI Conference on Artificial Intelligence, San Francisco, (2017), pp. 794–800 5. C. Fang, Y. Huang, Z. Lin, Accelerating asynchronous algorithms for convex optimization by momentum compensation (2018). Preprint. arXiv:1802.09747 6. D. Jakoveti´c, J.M. Moura, J. Xavier, Linear convergence rate of a class of distributed augmented Lagrangian algorithms. IEEE Trans. Automat. Contr. 60(4), 922–936 (2014) 7. Q. Lin, Z. Lu, L. Xiao, An accelerated proximal coordinate gradient method, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 3059–3067 8. J. Liu, S.J. Wright, C. Ré, V. Bittorf, S. Sridhar, An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015) 9. C. Ma, V. Smith, M. Jaggi, M.I. Jordan, P. Richtarik, M. Takac, Adding vs. averaging in distributed primal-dual optimization, arXiv preprint, arXiv:1502.03508 (2015) 10. H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, M.I. Jordan, Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27(4), 2202–2229 (2017) 11. Y. Nesterov, A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). Sov. Math. Dokl. 27(2), 372–376 (1983) 12. B. Recht, C. Re, S. Wright, F. Niu, HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems, Granada, vol. 24 (2011), pp. 693–701 13. S.J. Reddi, A. Hefny, S. Sra, B. Poczos, A.J. Smola, On variance reduction in stochastic gradient descent and its asynchronous variants, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 2647–2655 14. K. Seaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulié, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 3027–3036 15. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)

References

255

16. K. Yuan, Q. Ling, W. Yin, On the convergence of decentralized gradient descent. SIAM J. Optim. 26(3), 1835–1854 (2016) 17. P. Zhao, T. Zhang, Stochastic optimization with importance sampling for regularized loss minimization, in Proceedings of the 32th International Conference on Machine Learning, Lille, (2015), pp. 1–9 18. S. Zheng, J. Wang, F. Xia, W. Xu, T. Zhang, A general distributed dual coordinate optimization framework for regularized loss minimization. J. Mach. Learn. Res. 18(115), 1–52 (2017)

Chapter 7

Conclusions

In the previous chapters, we have introduced many representative accelerated firstorder algorithms used or investigated by the machine learning community. It is inevitable that our review is incomplete and biased. Moreover, new accelerated algorithms still emerge when this book was under preparation, but we had to leave them out. Although accelerated algorithms are very attractive in theory, in practice whether we should use them depends on many practical factors. For example, when using the warm start technique to solve a series of LASSO problems with different penalties, the advantages of accelerated algorithms may diminish when the grid of penalties is fine enough. For low-rank problems [7], the inter-/extra-polation may destroy the low-rank structure of matrices and make the related SVD (used in the singular value thresholding) computation much more expensive, counteracting the benefit of requiring less iterations. Since most accelerated algorithms only consider the L-smoothness and μ-strong convexity of the objective functions, and do not consider other characteristics of the problems, for some problems the unaccelerated algorithms and accelerated ones can actually converge equally fast, or even faster than the predicted rate (such as the problems with the objective functions being only restricted strongly convex, meaning that strongly convex only on a subset of the domain, and the “well-behaved” learning problems, meaning that the signal to noise ratio is decently high, the correlations between predictor variables are under control, and the number of predictors is larger than that of observations). One remarkable example is that although there have been several algorithms for nonconvex problems that are proven to converge faster to stationary points than gradient descent does, in reality when training deep neural networks they still cannot beat gradient descent. On the other hand, optimization actually involves multiple aspects of computation. If some details of computation are not treated appropriately, accelerated algorithms can be slow. For example, the singular value thresholding (SVT) operator is often encountered when solving low-rank models. However, if naively implementing it, full SVD will be needed and the computation can be extremely expensive. © Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8_7

257

258

7 Conclusions

Actually, very often SVT could be done by partial SVD [7], whose computation can be much cheaper than that of full SVD. For distributed optimization, some pre-processing such as balancing the data according to the computing power and communication bandwidth (if possible) can help a lot. Although many intricate accelerated algorithms have been proposed, there are fundamental complexity lower bounds that cannot be broken if the algorithms are designed in the traditional ways. Some algorithms have achieved the complexity lower bounds, if looking at the orders and ignoring the constant factors (e.g., [2, 5, 6, 9]). So it is not quite exciting to improve the constants. Recently, there has been some work on using machine learning techniques to boost convergence in optimization algorithms, which seem to break the complexity lower bounds when running on test data [1, 3, 8, 11, 12]. Although promising experimental results have been shown, rare algorithms have theoretical guarantee. Thus most learningbased optimization algorithms remain heuristic. Chen et al. might be the first to provide convergence guarantees [1]. However, their proof is only valid for the LASSO problem. Learning-based optimization algorithms for general problems, rather than a particular problem, that have convergence guarantees are scarce. Liu et al. [8] and Xie et al. [10], which aim at solving nonconvex inverse problems and linearly constrained separable convex problems, respectively, are among the very limited literatures. It is not surprising that when considering characteristics of data, optimization can be accelerated. An example from the traditional algorithms is assuming that the dictionary matrix in the LASSO problem has the Restricted Isometric Property or the sampling operator in the matrix completion problem has the Matrix Restricted Isometric Property [4]. Learning-based optimization just aims at describing the characteristics of data more accurately, but currently only using samples themselves rather than mathematical properties. Although learning-based optimization is still in its infant age, we believe that it will bring another wave of acceleration in the future.

References 1. X. Chen, J. Liu, Z. Wang, W. Yin, Theoretical linear convergence of unfolded ISTA and its practical weights and thresholds, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 9079–9089 2. C. Fang, C.J. Li, Z. Lin, T. Zhang, SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 689–699 3. K. Gregor, Y. LeCun, Learning fast approximations of sparse coding, in Proceedings of the 27th International Conference on Machine Learning, Haifa, (2010), pp. 399–406 4. M.-J. Lai, W. Yin, Augmented 1 and nuclear-norm models with a globally linearly convergent algorithm. SIAM J. Imag. Sci. 6(2), 1059–1091 (2013) 5. G. Lan, Y. Zhou, An optimal randomized incremental gradient method. Math. Program. 171(1– 2), 167–215 (2018) 6. H. Li, Z. Lin, Accelerated alternating direction method of multipliers: an optimal O(1/K) nonergodic analysis. J. Sci. Comput. 79(2), 671–699 (2019)

References

259

7. Z. Lin, H. Zhang, Low-Rank Models in Visual Analysis: Theories, Algorithms, and Applications (Academic, New York, 2017) 8. R. Liu, S. Cheng, Y. He, X. Fan, Z. Lin, Z. Luo, On the convergence of learning-based iterative methods for nonconvex inverse problems. IEEE Trans. Pattern Anal. Mach. Intell. (2020). https://doi.org/10.1109/TPAMI.2019.2920591 9. K. Seaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulié, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 3027–3036 10. X. Xie, J. Wu, G. Liu, Z. Zhong, Z. Lin, Differentiable linearized ADMM, in Proceedings of the 36th International Conference on Machine Learning, Long Beach, (2019), pp. 6902–6911 11. Y. Yang, J. Sun, H. Li, Z. Xu, Deep ADMM-Net for compressive sensing MRI, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 10–18 12. J. Zhang, B. Ghanem, ISTA-Net: interpretable optimization-inspired deep network for image compressive sensing, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, (2018), pp. 1828–1837

Appendix A

Mathematical Preliminaries

In this appendix, we list the conventions of notations and some basic definitions and facts that are used in the book.

A.1 Notations Notations Normal font, e.g., s Bold lowercase, e.g., v Bold capital, e.g., M Calligraphic capital, e.g., T R, R+ Z+ [n] EX I, 0, 1

Meanings A scalar A vector A matrix A subspace, an operator, or a set

x≥y XY

Set of real numbers, set of nonnegative real numbers Set of nonnegative intergers {1, 2, · · · , n} Expectation of random variable (or random vector) X The identity matrix, all-zero matrix or vector, and all-one vector x − y is a nonnegative vector X − Y is a positive semi-definite matrix

f (N ) = O(g(N ))

∃a > 0, such that

f (N ) g(N )

≤ a for all N ∈ Z+

˜ f (N ) = O(g(N ))

∃a > 0, such that

f˜(N ) g(N )

≤ a for all N ∈ Z+ ,

where f˜(N) is the function ignoring poly-logarithmic factors in f (N)

© Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8

261

262

A Mathematical Preliminaries

f (N ) g(N )

≥ a for all N ∈ Z+

f (N ) = (g(N ))

∃a > 0, such that

xi ∇f (x)

The i-th vector in a sequence or the i-th coordinate of x Gradient of f at x

∇i f (x)

∂f ∂xi

X:j Xij XT Diag(x)

The j -th column of matrix X The entry at the i-th row and the j -th column of X Transpose of matrix X Diagonal matrix whose diagonal entries are entries of vector x The i-th largest singular value of matrix X The i-th largest eigenvalue of matrix X Matrix whose (i, j )-th entry is |Xij |

σi (X) λi (X) |X| Span(X) || · || || · ||2 or  · 

The subspace spanned by the columns of X Operator norm of an operator or a matrix  2 2 norm of vectors, ||v||2 = i vi ;

|| · ||F

|| · || is also used for general norm of vectors Nuclear norm of matrices, the sum of singular values 0 pseudo-norm, number of nonzero entries  1 norm, ||X||1 = i,j |Xij | 1/p  p p norm, ||X||p = i,j |Xij |  2 Frobenius norm of a matrix, ||X||F = i,j Xij

|| · ||∞

∞ norm, ||X||∞ = maxij |Xij |

conv(X) ∂f

Convex hull of set X Subgradient (resp. supergradient) of a convex (resp concave) function f Optimum value of f (x), where x varies in domf and the constraints The conjugate function of f (x) Proximal mapping w.r.t. f and parameter α,   Proxαf (y) = argminx αf (x) + 12 x − y22

|| · ||∗ || · ||0 || · ||1 || · ||p

f∗ f ∗ (x) Proxαf (·)

A Mathematical Preliminaries

263

A.2 Algebra and Probability Proposition A.1 (Cauchy-Schwartz Inequality) x, y ≤ xy. Lemma A.1 For any x, y, z, and w ∈ Rn , we have the following three identities: 2 x, y = x2 + y2 − x − y2 ,

(A.1)

2 x, y = x + y2 − x2 − y2 ,

(A.2)

2 x − z, y − w = x − w − z − w − x − y + z − y . 2

2

2

2

(A.3)

Definition A.1 (Singular Value Decomposition (SVD)) Suppose that A ∈ Rm×n with rankA = r. Then A can be factorized as A = UV% , where U ∈ Rm×r satisfies U% U = I, V ∈ Rn×r satisfies V% V = I, and  = Diag(σ1 , · · · , σr ), with σ1 ≥ σ2 ≥ · · · ≥ σr > 0. The factorization (1) is called the economic singular value decomposition (SVD) of A. The columns of U are called left singular vectors of A, the columns of V are right singular vectors, and the numbers σi are the singular values. Definition A.2 (Laplacian Matrix of a Graph) Denote a graph as g = {V , E}, where V and E are the node and the edge sets, respectively. eij = (i, j ) ∈ E indicates that nodes i and j are connected. Define Vi = {j ∈ V |(i, j ) ∈ E} to be the index set of the nodes that are connected to node i. The Laplacian matrix of the graph g = {V , E} is defined as: ⎧ ⎨ |Vi |, if i = j, Lij = −1, if i = j and (i, j ) ∈ E, ⎩ 0, otherwise. Definition A.3 (Dual Norm) Let  ·  be a norm of vectors in Rn , then its dual norm  · ∗ is defined as: y∗ = max{x, y |x ≤ 1}. Proposition A.2 Given random vector ξ , we have Eξ − Eξ 2 ≤ Eξ 2 .

264

A Mathematical Preliminaries

Proposition A.3 (Jensen’s Inequality: Continuous Case) If f : C ⊆ Rn → R is convex and ξ is a random vector over C, then f (Eξ ) ≤ Ef (ξ ). Definition A.4 (Discrete-Time Martingale) A sequence of random variables (or random vectors) X1 , X2 , · · · is called a martingale if it satisfies for any time n, E|Xn | < ∞, E(Xn+1 |X1 , · · · , Xn ) = Xn . That is, the conditional expected value of the next observation, given all the past observations, is equal to the most recent observation. Proposition A.4 (Iterated Law of Expectation) For two random variables X and Y , we have EY = EX (E(Y |X)).

A.3 Convex Analysis The descriptions for the basic concepts of convex sets and convex functions can be found in [4]. Definition A.5 (Convex Set) A set C ⊆ Rn is called convex if for all x, y ∈ C and α ∈ [0, 1] we have αx + (1 − α)y ∈ C. Definition A.6 (Extreme Point) Given a nonempty convex set C, a vector x ∈ C is said to be an extreme point of C if it does not lie strictly between the endpoints of any line segment contained in C. Namely, there do not exist vectors y, z ∈ C, with y = x and z = x, and a scalar α ∈ (0, 1) such that x = αy + (1 − α)z. Definition A.7 (Convex Function) A function f : C ⊆ Rn → R is called convex if C is a convex set and for all x, y ∈ C and α ∈ [0, 1] we have f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y). C is called the domain of f . Definition A.8 (Concave Function) A function f : C ⊆ Rn → R is called concave if −f is convex.

A Mathematical Preliminaries

265

Definition A.9 (Strictly Convex Function) A function f : C ⊆ Rn → R is called strictly convex if C is a convex set and for all x = y ∈ C and α ∈ (0, 1) we have f (αx + (1 − α)y) < αf (x) + (1 − α)f (y). Definition A.10 (Strongly Convex Function and Generally Convex Function) A function f : C ⊆ Rn → R is called strongly convex if C is a convex set and there exists a constant μ > 0 such that for all x, y ∈ C and α ∈ [0, 1] we have f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) −

μα(1 − α) y − x2 . 2

μ is called the strong convexity modulus of f . For brevity, a strongly convex function with a strong convexity modulus μ is called a μ-strongly convex function. If a convex function is not strongly convex, we also call it a generally convex function. Proposition A.5 (Jensen’s Inequality: Discrete Case) If f : C ⊆ Rn → R is m  convex, xi ∈ C, αi ≥ 0, i = 1, · · · , m, and αi = 1, then i=1

f

 m  i=1

 αi xi



m 

αi f (xi ).

i=1

Definition A.11 (Smooth Function) A function is (informally) called smooth if it is continuously differentiable. Definition A.12 (Function with Lipschitz Continuous Gradients) A differentiable function f : C ⊆ Rn → R is called to have Lipschitz continuous gradients if there exists L > 0 such that ∇f (x) − ∇f (y) ≤ Ly − x,

∀x, y ∈ C.

For simplicity, if the constant L is explicitly specified we also call such a function an L-smooth function. Definition A.13 (Function with Coordinate Lipschitz Continuous Gradients) f (x) is said to have Lc -coordinate Lipschitz continuous gradients if: |∇i f (x) − ∇i f (y)| ≤ Lc |xi − yi |,

all i ∈ [n].

(A.4)

Definition A.14 (Function with Lipschitz Continuous Hessians) A twice differentiable function f : C ⊆ Rn → R is called to have Lipschitz continuous Hessians if there exists L > 0 such that ∇ 2 f (x) − ∇ 2 f (y) ≤ Ly − x,

∀x, y ∈ C.

266

A Mathematical Preliminaries

If the constant L is explicitly specified such a function is also called to have LLipschitz continuous Hessians. Proposition A.6 ([6]) If f : C ⊆ Rn → R is L-smooth, then |f (y) − f (x) − ∇f (x), y − x| ≤

L y − x2 , 2

∀x, y ∈ C.

(A.5)

In particular, if y = x − L1 ∇f (x), then f (y) ≤ f (x) −

1 ∇f (x)2 . 2L

(A.6)

If f is further convex, then f (y) ≥ f (x) + ∇f (x), y − x +

1 ∇f (y) − ∇f (x)2 . 2L

(A.7)

Proposition A.7 If f : C ⊆ Rn → R has Lc -coordinate Lipschitz continuous gradients on coordinate i and x and y only differ at the i-th entry, then we have |f (x) − f (y) − ∇i f (y), xi − yi | ≤

Lc (xi − yi )2 . 2

Proposition A.8 If f : C ⊆ Rn → R has L-Lipschitz continuous Hessians, then [6] > > > > >f (y) − f (x) − ∇f (x), y − x − 1 (y − x)T ∇ 2 f (x)(y − x)> ≤ L y − x2 , > > 6 2 ∀x, y ∈ C. (A.8) Definition A.15 (Subgradient of a Convex Function) A vector g is called a subgradient of a convex function f : C ⊆ Rn → R at x ∈ C if f (y) ≥ f (x) + g, y − x , ∀y ∈ C. The set of subgradients at x is denoted as ∂f (x). Proposition A.9 For convex function f : C ⊆ Rn → R, its subgradient exists at every interior point of C. It is differentiable at x iff (aka if and only if) ∂f (x) is a singleton. Proposition A.10 If f : Rn → R is μ-strongly convex, then f (y) ≥ f (x) + g, y − x +

μ y − x2 , 2

∀g ∈ ∂f (x).

(A.9)

A Mathematical Preliminaries

267

In particular, if f is differentiable and μ-strongly convex and x∗ = argminx f (x), then f (x) − f (x∗ ) ≥

μ x − x∗ 2 . 2

(A.10)

1 ∇f (x)2 . 2μ

(A.11)

On the other hand, we can have f (x∗ ) ≥ f (x) −

Definition A.16 (Epigraph) The epigraph of f : C ⊆ Rn → R is defined as epi f = {(x, t)|x ∈ C, t ≥ f (x)}. Definition A.17 (Closed Function) If epi f is a closed set, then f is called a closed function. Definition A.18 (Monotone Mapping and Monotone Function) A set valued n function f : C ⊆ Rn → 2R is called a monotone mapping if x − y, u − v ≥ 0,

∀x, y ∈ C and u ∈ f (x), v ∈ f (y).

In particular, if f is a single valued function and x − y, f (x) − f (y) ≥ 0,

∀x, y ∈ C.

then it is called a monotone function. Proposition A.11 (Monotonicity of Subgradient) If f : C ⊆ Rn → R is convex, then ∂f (x) is a monotone mapping. If f is further μ-strongly convex, then x1 − x2 , g1 − g2  ≥ μx1 − x2 2 ,

∀xi ∈ C and gi ∈ ∂f (xi ), i = 1, 2.

Definition A.19 (Envelope Function and Proximal Mapping) Given a function f : C ⊆ Rn → R and a > 0,

1 2 y − x Envaf (x) = min f (y) + y∈C 2a is called the envelope function of f (x), and

1 y − x2 Proxaf (x) = argmin f (y) + 2a y∈C



is called the proximal mapping of f (x). Proxaf (x) may be set-valued if f is not convex.

268

A Mathematical Preliminaries

Further descriptions of proximal mapping can be found in [7]. Definition A.20 (Bregman Distance) Given a differentiable strongly convex function h, the Bregman distance is defined as Dh (y, x) = h(y) − h(x) − ∇h(x), y − x . The Euclidean distance is obtained when h(x) = 12 x2 , in which case Dh (y, x) = 1 2 2 x − y . The generalization to a nondifferentiable h was discussed in [5]. Definition A.21 (Conjugate Function) Given f : C ⊆ Rn → R, its conjugate function is defined as f ∗ (u) = sup (z, u − f (z)) . z∈C

The domain of f ∗ is domf ∗ = {u|f ∗ (u) < +∞}. Proposition A.12 (Properties of Conjugate Function) Given f : C ⊆ Rn → R, its conjugate function has the following properties: f ∗ is always a convex function; f ∗∗ (x) ≤ f (x), ∀x ∈ C; If f is a proper, closed and convex function, then f ∗∗ (x) = f (x), ∀x ∈ C. If f is L-smooth, then f ∗ is L−1 -strongly convex on domf ∗ . Conversely, if f is μ-strongly convex, then f ∗ is μ−1 -smooth on domf ∗ . 5. If f is closed and convex, then y ∈ ∂f (x) if and only if x ∈ ∂f ∗ (y).

1. 2. 3. 4.

Proposition A.13 (Fenchel-Young Inequality) Let f ∗ be the conjugate function of f , then f (x) + f ∗ (y) ≥ x, y . Definition A.22 (Lagrangian Function) Given a constrained problem: min f (x),

x∈Rn

(A.12)

s.t. Ax = b, g(x) ≤ 0, where A ∈ Rm×n and g(x) = (g1 (x), · · · , gp (x))T , the Lagrangian function is L(x, u, v) = f (x) + u, Ax − b + v, g(x) , where v ≥ 0.

A Mathematical Preliminaries

269

Definition A.23 (Lagrange Dual Function) Given a constrained problem (A.12), the Lagrange dual function is d(u, v) = min L(x, u, v), x∈C

(A.13)

where C is the domain of f . The domain of the dual function is D = {(u, v)|d(u, v) > −∞}. Definition A.24 (Dual Problem) Given a constrained problem (A.12), the dual problem is max d(u, v), u,v

s.t. v ≥ 0. Accordingly, problem (A.12) is called the primal problem. Definition A.25 (Slater’s Condition) For convex primal problem (A.12), if there exists an x0 such that Ax0 = b, gi (x0 ) ≤ 0, i ∈ I1 , and gi (x0 ) < 0, i ∈ I2 , where I1 and I2 are the sets of indices of linear and nonlinear inequality constraints, respectively, then the Slater’s condition holds. Proposition A.14 (Properties of Dual Problem) 1. d(u, v) is always a concave function even if the primal problem (A.12) is not convex. 2. The primal and the dual optimal values, f ∗ and d ∗ , always satisfy the weak duality: f ∗ ≥ d ∗ . 3. When the Slater’s condition holds, the strong duality holds: f ∗ = d ∗ . Definition A.26 (KKT Point and KKT Condition) (x, u, v) is called a Karush– Kuhn–Tucker (KKT) point of problem (A.12) if p 1. Stationarity: 0 ∈ ∂f (x) + AT u + i=1 vi ∂gi (x). 2. Primal feasibility: Ax = b, gi (x) ≤ 0, i = 1, · · · , p. 3. Complementary slackness: vi gi (x) = 0. 4. Dual feasibility: vi ≥ 0, i = 1, · · · , p. The above conditions are called the KKT condition of problem (A.12). They are the optimality condition of problem (A.12) when f (x) and gi (x), i = 1, · · · , p, are all convex. Proposition A.15 When f (x) and gi (x), i = 1, · · · , p, are all convex, (x∗ , u∗ , v∗ ) is a pair of the primal and the dual solutions with zero dual gap if and only if it satisfies the KKT condition. Definition A.27 (Compact Set) A subset S of Rn is called compact if it is both bounded and closed.

270

A Mathematical Preliminaries

Definition A.28 (Convex Hull) The convex hull of a set X, denoted as conv(X), is the set of all convex combinations of points in X: ( conv(X) =

k  i=1

> ) k >  > αi xi > xi ∈ X, αi ≥ 0, i = 1, · · · , k, αi = 1 . > i=1

Theorem A.1 (Danskin’s Theorem) Let Z be a compact subset of Rm , and let φ : Rn × Z → R be continuous and such that φ(·, z) : Rn → R is convex for each z ∈ Z. Define f : Rn → R by f (x) = max φ(x, z) and z∈Z

> 9 > > Z(x) = z¯ >φ(x, z¯ ) = max φ(x, z) . z∈Z

If φ(·, z) is differentiable for all z ∈ Z and ∇x φ(x, ·) is continuous on Z for each x, then ∂f (x) = conv {∇x φ(x, z)|z ∈ Z(x)} ,

∀x ∈ Rn .

Definition A.29 (Saddle Point) (x∗ , λ∗ ) is called a saddle point of function f (x, λ) : C × D → R if it satisfies the following inequalities: f (x∗ , λ) ≤ f (x∗ , λ∗ ) ≤ f (x, λ∗ ),

∀x ∈ C, λ ∈ D.

A.4 Nonconvex Analysis Definition A.30 (Proper Function) A function g : Rn → (−∞, +∞] is said to be proper if dom g = ∅, where dom g = {x ∈ R : g(x) < +∞}. Definition A.31 (Lower Semicontinuous Function) A function g : Rn → (−∞, +∞] is said to be lower semicontinuous at point x0 if lim inf g(x) ≥ g(x0 ). x→x0

Definition A.32 (Coercive Function) F (x) is called coercive if infx F (x) > −∞ and {x|F (x) ≤ a} is bounded for all a. Definition A.33 (Subdifferential) Let f be a proper and lower semicontinuous function.

A Mathematical Preliminaries

271

ˆ (x), is 1. For a given x ∈dom f , the Frechel subdifferential of f at x, written as ∂f the set of all vectors u ∈ Rn which satisfies f (y) − f (x) − u, y − x ≥ 0. y=x,y→x y − x lim inf

2. The limiting subdifferential, or simply the subdifferential, of f at x ∈ Rn , written as ∂f (x), is defined through the following closure process: ˆ (xk ) → u, k → ∞}. ∂f (x) := {u ∈ Rn : ∃xk → x, f (xk ) → f (x), uk ∈ ∂f Definition A.34 (Critical Point) A point x is called a critical point of function f if 0 ∈ ∂f (x). The following lemma describes the properties of subdifferential. Lemma A.2 1. In the nonconvex context, Fermat’s rule remains unchanged: If x ∈ Rn is a local minimizer of g, then 0 ∈ ∂g(x). 2. Let (xk , uk ) be a sequence such that xk → x, uk → u, g(xk ) → g(x), and uk ∈ ∂g(xk ), then u ∈ ∂g(x). 3. If f is a continuously differentiable function, then ∂(f +g)(x) = ∇f (x)+∂g(x). Definition A.35 (Desingularizing Function) A function ϕ : [0, η) → R+ satisfying the following conditions is called a desingularizing function: (1) ϕ is concave and continuously differentiable on (0, η); (2) ϕ is continuous at 0, ϕ(0) = 0; and (3) ϕ  (x) > 0, ∀x ∈ (0, η). η is the set of desingularizing functions defined on [0, η). Now we define the KŁ function. More introductions and applications can be found in [1–3]. Definition A.36 (Kurdyka–Łojasiewicz (KŁ) Property) A function f : Rn → (−∞, +∞] is said to have the Kurdyka-Łojasiewicz (KŁ) property at u ∈ dom∂f := {x ∈ Rn : ∂f (u) = ∅} if there exists η ∈ (0, +∞], a neighborhood U of u, and a desingularizing function ϕ ∈ η , such that for all u ∈ U ∩ {u ∈ Rn : f (u) < f (u) < f (u) + η}, the following inequality holds: ϕ  (f (u) − f (u))dist(0, ∂f (u)) > 1.

272

A Mathematical Preliminaries

Lemma A.3 (Uniform Kurdyka-Łojasiewicz Property) Let  be a compact set and let f : Rn → (−∞, +∞] be a proper and lower semicontinuous function. Assume that f is constant on  and satisfies the KŁ property at each point of . Then there exists  > 0, η > 0, and ϕ ∈ η , such that for all u in  and all u in the following intersection {u ∈ Rn : dist(u, ) < } ∩ {u ∈ Rn : f (u) < f (u) < f (u) + η}, the following inequality holds: ϕ  (f (u) − f (u))dist(0, ∂f (u)) > 1. Functions satisfying the KŁ property are general enough. Typical examples include: real polynomial functions, logistic loss function log(1+e−t ), xp (p ≥ 0), x∞ , and indicator functions of the positive semidefinite (PSD) cone, the Stiefel manifolds, and the set of constant rank matrices.

References 1. H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010) 2. H. Attouch, J. Bolte, B.F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1–2), 91–129 (2013) 3. J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014) 4. S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge, 2004) 5. K.C. Kiwiel, Proximal minimization methods with generalized Bregman functions. SIAM J. Control. Optim. 35(4), 1142–1168 (1997) 6. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer, New York, 2004) 7. N. Parikh, S. Boyd, Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)

Index

A Accelerated alternating direction method of multiplier (Acc-ADMM), 79, 87, 99 Accelerated augmented Lagrange multiplier method, 4, 71, 72 Accelerated dual ascent, 4 Accelerated gradient descent (AGD), 3–5, 11, 12, 61, 68, 71, 109, 118, 125, 126, 128, 129, 203, 211, 212 Accelerated Lagrange multiplier method, 67, 68 Accelerated linearized augmented Lagrangian method, 4 Accelerated linearized penalty method, 4 Accelerated penalty method, 61, 62 Accelerated primal-dual method, 99 Accelerated proximal gradient (APG), 3, 18, 21, 25, 38, 42, 109, 211 Accelerated proximal point method, 5, 37 Accelerated stochastic alternating direction method of multiplier (AccSADMM), 177–179, 202 Accelerated stochastic coordinate descent (ASCD), 139, 140, 147, 230 Accelerated stochastic dual coordinate ascent (Acc-SDCA), 160 Almost convex AGD (AC-AGD), 5, 126, 129, 131, 133 Alternating direction method of multiplier (ADMM), 4, 73, 74, 76, 78, 79, 85, 86, 202 Asynchronous accelerated coordinate descent (AACD), 5 Asynchronous accelerated gradient descent (AAGD), 5, 210, 213, 223

Asynchronous accelerated stochastic coordinate descent (AASCD), 223 Asynchronous gradient descent, 209 Asynchronous stochastic coordinate descent, 210, 223 Asynchronous stochastic gradient descent (ASGD), 210 Asynchronous stochastic variance reduced gradient (ASVRG), 210, 223 Augmented Lagrange multiplier method, 71 Augmented Lagrangian function, 59, 71, 73

B Black-box acceleration, 160 Bregman distance, 23, 45, 49, 51, 52, 268 Bregman Lagrangian, 49

C Catalyst, 5, 37, 160–162, 176 Compact set, 110, 112, 113, 269, 270, 272 Composite optimization, 17 Conjugate function, 23, 40, 147, 268 Continuation technique, 61, 62 Critical point, 5, 109, 110, 113, 118, 271

D Danskin’s theorem, 60, 67, 270 Desingularizing function, 110, 113, 114, 117, 271 Distributed accelerated gradient descent, 6 Distributed ADMM, 5 Distributed dual ascent, 5

© Springer Nature Singapore Pte Ltd. 2020 Z. Lin et al., Accelerated Optimization for Machine Learning, https://doi.org/10.1007/978-981-15-2910-8

273

274 Distributed dual coordinate ascent, 5 Distributed stochastic communication accelerated dual (DSCAD), 241 Dual norm, 103 Dual problem, 4, 58, 68, 71, 147, 238, 242 E Ergodic, 78, 85 Estimate function, 51–53 Estimate sequence, 12, 13, 18, 162 Euler method, 51 Exact first-order algorithm (EXTRA), 242 F Faster Frank–Wolfe method, 104 Fenchel-Young inequality, 53, 268 Frank-Wolfe method, 103–105 G Gossip matrix, 242 H Hölderian error bound condition, 38 Heavy-ball method, 3 High order accelerated gradient descent, 44 I Incremental first-order oracle (IFO), 138, 148, 161, 168 Inexact accelerated gradient descent, 36 Inexact accelerated proximal gradient, 27 Inexact accelerated proximal point method, 37 Inexact proximal mapping, 27 J Jensen’s inequality, 173, 264, 265 K Katyusha, 5, 152, 153, 161, 180 KKT condition, 71, 73, 269 KKT point, 59, 61, 177, 269 Kurdyka-Łojasiewicz (KŁ) condition/property, 109, 110, 113, 114, 117, 271, 272 KŁ function, 271 L Lagrange dual function, 67, 68, 269

Index Lagrange multiplier method, 71 Lagrangian function, 49, 179, 268 Laplacian matrix, 241, 263 Linearized alternating direction method of multiplier (Linearized ADMM), 73 Lyapunov function, 19, 21, 25, 33, 34, 50 M Minimization by incremental surrogate optimization (MISO), 147 Mirror descent, 4, 23, 51 Momentum, 5, 11, 12, 17, 109, 110, 118, 138–140, 152, 161, 168, 175–177, 202, 213, 223, 236, 241, 242 Momentum compensation, 210, 223 Monotone APG, 117 Monotonicity, 60, 74, 76, 267 N Negative curvature descent (NCD), 5, 118, 121, 128, 129, 131–133 Negative curvature search (NC-Search), 176 Nesterov–Polyak method, 5, 125 Non-ergodic, 78, 85, 86, 89, 177, 202 Nonconvex AGD, 122, 129 P Performance estimation problem, 4 Primal-dual gap, 95 Primal-dual method, 4, 23, 24, 95, 96, 99 Proximal mapping, 3, 4, 17, 27, 28, 112, 139, 177, 262, 267 Q Quadratic functional growth, 103–105 R Restart, 38 S Saddle point, 5, 24, 95, 109, 118, 125, 270 Singular value decomposition (SVD), 58, 257, 258, 263 Slater’s condition, 269 Stochastic accelerated gradient descent (SAGD), 203 Stochastic average gradient (SAG), 147 Stochastic coordinate descent (SCD), 5, 140, 223

Index Stochastic dual coordinate ascent (SDCA), 147 Stochastic gradient descent (SGD), 5, 125, 147 Stochastic path-integrated differential estimator (SPIDER), 5, 168–171, 175, 176 Stochastic primal-dual method, 5 Stochastic variance reduced gradient (SVRG), 147, 148, 152, 161, 168

275 Strongly convex set, 103, 104

V Variance reduction (VR), 5, 138, 147, 152, 160, 168, 177, 210