Modern aspects of random matrix theory : AMS Short Course, Random Matrices, January 6-7, 2013, San Diego, California 0821894714, 978-0-8218-9471-2

The theory of random matrices is an amazingly rich topic in mathematics. Random matrices play a fundamental role in vari

568 151 2MB

English Pages 176 [186] Year 2014

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Modern aspects of random matrix theory : AMS Short Course, Random Matrices, January 6-7, 2013, San Diego, California
 0821894714, 978-0-8218-9471-2

Table of contents :
Content: Lecture notes on the circular law by C. Bordenave and D. Chafai Free probability and random matrices by A. Guionnet Random matrix theory, numerical computation and applications by A. Edelman, B. D. Sutton, and Y. Wang Recent developments in non-asymptotic theory of random matrices by M. Rudelson Random matrices: The universality phenomenon for Wigner ensembles by T. Tao and V. Vu Index

Citation preview

Volume 72

Modern Aspects of Random Matrix Theory AMS Short Course Random Matrices January 6–7, 2013 San Diego, California

Van H. Vu Editor

AMS SHORT COURSE LECTURE NOTES Introductory Survey Lectures published as a subseries of Proceedings of Symposia in Applied Mathematics

Volume 72

Modern Aspects of Random Matrix Theory AMS Short Course Random Matrices January 6–7, 2013 San Diego, California

Van H. Vu Editor

EDITORIAL COMMITTEE Mary Pugh (Chair)

Daniel Rockmore

2010 Mathematics Subject Classification. Primary 11C20, 60B20, 15B52, 05D40, 60H25, 62-07.

Library of Congress Cataloging-in-Publication Data AMS Short Course, Random Matrices (2013 : San Diego, Calif.) Modern aspects of random matrix theory : AMS Short Course, Random Matrices, January 6–7, 2013, San Diego, California / Van H. Vu, editor. pages cm. — (Proceedings of symposia in applied mathematics ; volume 72) Includes bibliographical references and index. ISBN 978-0-8218-9471-2 (alk. paper) 1. Random matrices—Congresses. 2. Number theory—Congresses. I. Vu, Van, 1970– editor of compilation. II. Title. QA196.5.A47 512.5—dc23

2013 2013051063

Copying and reprinting. Material in this book may be reproduced by any means for educational and scientific purposes without fee or permission with the exception of reproduction by services that collect fees for delivery of documents and provided that the customary acknowledgment of the source is given. This consent does not extend to other kinds of copying for general distribution, for advertising or promotional purposes, or for resale. Requests for permission for commercial use of material should be addressed to the Acquisitions Department, American Mathematical Society, 201 Charles Street, Providence, Rhode Island 02904-2294, USA. Requests can also be made by e-mail to [email protected]. Excluded from these provisions is material in articles for which the author holds copyright. In such cases, requests for permission to use or reprint should be addressed directly to the author(s). (Copyright ownership is indicated in the notice in the lower right-hand corner of the first page of each article.) c 2014 by the American Mathematical Society. All rights reserved.  The American Mathematical Society retains all rights except those granted to the United States Government. Copyright of individual articles may revert to the public domain 28 years after publication. Contact the AMS for copyright status of individual articles. Printed in the United States of America. ∞ The paper used in this book is acid-free and falls within the guidelines 

established to ensure permanence and durability. Visit the AMS home page at http://www.ams.org/ 10 9 8 7 6 5 4 3 2 1

19 18 17 16 15 14

Contents

Preface

vii

Lecture notes on the circular law Charles Bordenave and Djalil Chafa¨ı

1

Free probability and random matrices Alice Guionnet

35

Random matrix theory, numerical computation and applications Alan Edelman, Brian Sutton, and Yuyang Wang

53

Recent developments in non-asymptotic theory of random matrices Mark Rudelson

83

Random matrices: The universality phenomenon for Wigner ensembles Terence Tao and Van Vu

121

Index

173

v

Preface The theory of random matrices is an amazingly rich topic in mathematics. Beside being interesting in its own right, random matrices play a fundamental role in various areas such as statistics, mathematical physics, combinatorics, theoretical computer science, number theory and numerical analysis, to mention a few. A famous example is the work of the physicist Eugene Wigner, who used the spectrum of random matrices to model energy levels of atoms, and consequently discovered the fundamental semi-circle law which describes the limiting distribution of the eigenvalues of a random hermitian matrix. Special random matrices models where the entries are iid complex or real gaussian random variables (GUE, GOE or Wishart) have been studied in detail. However, much less was known about general models, as the above mentioned study relies very heavily on properties of the gaussian distribution. In the last ten years or so, we have witnessed considerable progresses on several fundamental problems concerning general models, such as the Circular law conjecture or Universality conjectures. More importantly, these new results are proved using novel and robust approaches which seem to be applicable to many other problems. Surprising connections to the emerging field of free probability have also been made and fortified. Equally surprising is the discovery that many practical tricks for numerical problems (to make the computation of eigenvalues faster or more reliable, say) can also be used as powerful theoretical tools to study spectral limits. Another area where we see rapid progressions is the theory of computing and applications (which includes numerical analysis, theoretical computer science, machine learning and data analysis). Here properties of random matrices have been used for the purpose of designing and analyzing practical algorithms. As already realized by von Neumann and Goldstine at the dawn of the computer era, bounds on the condition number of large random matrices would play a central role in a vast number of numerical problems. Their questions were posed 70 years ago, but effective ways to estimate this number have only been found in recent years. As a model for random noise/error, random matrices enter all problems concerning large data, perhaps one of the most talked about subjects in applied science in recent years. Today, random matrices are studied not only for their own mathematical beauty, but also for a very real purpose of making digital images sharper or computer networks more reliable. These new goals have motivated new lines of research, such as non-asymptotic or large deviation theory for random matrices. This volume contains surveys by leading researchers in the field, written in introductory style to quickly provide a broad picture about this fascinating and vii

viii

PREFACE

rapidly developing topic. We aim to touch most of the key points mentioned above (and many more) without putting too much technical burden on the readers. Most of the surveys are accessible with basic knowledge in probability and linear algebra. We also made an attempt to discuss a considerable amount of open problems. Some of these are classical but many are new, motivated by current developments. These problems may serve as a guideline for future research, especially for young researchers who would like to study this wonderful subject. Van H. Vu New Haven, Fall 2013.

Proceedings of Symposia in Applied Mathematics Volume 72, 2014 http://dx.doi.org/10.1090/psapm/072/00617

Lecture notes on the circular law Charles Bordenave and Djalil Chafa¨ı Abstract. The circular law theorem states that the empirical spectral distribution of a n × n random matrix with i.i.d. entries of variance 1/n tends to the uniform law on the unit disc of the complex plane as the dimension n tends to infinity. This phenomenon is the non-Hermitian counterpart of the semi circular limit for Wigner random Hermitian matrices, and the quarter circular limit for Marchenko-Pastur random covariance matrices. In these expository notes, we present a proof in a Gaussian case, due to Mehta and Silverstein, based on a formula by Ginibre, and a proof of the universal case by revisiting the approach of Tao and Vu, based on the Hermitization of Girko, the logarithmic potential, and the control of the small singular values. We also discuss some related models and open problems.

These notes constitute an abridged and updated version of the probability survey [BC], prepared at the occasion of the American Mathematical Society short course on Random Matrices, organized by Van H. Vu for the 2013 AMS-MAA Joint Mathematics Meeting held in January 9–13 in San Diego, CA, USA. Section 1 introduces the notion of eigenvalues and singular values and discusses their relationships. Section 2 states the circular law theorem. Section 3 is devoted to the Gaussian model known as the Complex Ginibre Ensemble, for which the law of the spectrum is known and leads to the circular law. Section 4 provides the proof of the circular law theorem in the universal case, using the approach of Tao and Vu based on the Hermitization of Girko and the logarithmic potential. Section 5 gathers finally some few comments on related problems and models. All random variables are defined on a unique common probability space (Ω, A, P). An element of Ω is denoted ω. We write a.s., a.a., and a.e. for almost surely, Lebesgue almost all, and Lebesgue almost everywhere respectively. 1. Two kinds of spectra The eigenvalues of a matrix A ∈ Mn (C) are the roots in C of its characteristic polynomial PA (z) := det(A − zI). We label them λ1 (A), . . . , λn (A) so that |λ1 (A)| ≥ · · · ≥ |λn (A)| with growing phases. The spectral radius is |λ1 (A)|. The eigenvalues form the algebraic spectrum of A. The singular values of A are defined 2010 Mathematics Subject Classification. Primary 15B52 (60B20; 60F15). Key words and phrases. Spectrum, singular values, eigenvalues, random matrices, random graphs, circular law, ginibre ensemble, non Hermitian matrices, non normal matrices. c 2014 Bordenave and Chafa¨ı

1

CHARLES BORDENAVE AND DJALIL CHAFA¨I

2

by

√ sk (A) := λk ( AA∗ ) for all 1 ≤ k ≤ n, where A∗ = A¯ is the conjugate-transpose. We have s1 (A) ≥ · · · ≥ sn (A) ≥ 0. 



The matrices A, A , A have the same singular values. The Hermitian matrix   0 A HA := A∗ 0 is 2n × 2n with eigenvalues s1 (A), −s1 (A), . . . , sn (A), −sn (A). This turns out to be useful√because the mapping A → HA is linear in A, in contrast with the mapping A → AA∗ . If A ∈ {0, 1}n×n then A is the adjacency matrix of a graph, while HA is the adjacency matrix of a bipartite nonoriented graph. Geometrically, the matrix A maps the unit sphere to an ellipsoid, the half-lengths of its principal axes being exactly the singular values of A. The operator norm or spectral norm of A is A 2→2 := max Ax 2 = s1 (A) while x2 =1

sn (A) = min Ax 2 . x2 =1

The rank of A is equal to the number of non-zero singular values. If A is non-singular  −1 then si (A−1 ) = sn−i (A)−1 for all 1 ≤ i ≤ n and sn (A) = s1 (A−1 )−1 = A−1 2→2 .

A

Figure 1. Largest and smallest singular values of A ∈ M2 (R). Taken from [CGLP] and used with permission of the Soci´et´ e Math´ ematique de France.

Since the singular values are the eigenvalues of a Hermitian matrix, we have variational formulas for all of them, often called the Courant-Fischer variational formulas [HJ, Theorem 3.1.2]. Namely, denoting Gn,i the Grassmannian of all i-dimensional subspaces, we have si (A) = max

E∈Gn,i

min Ax 2 =

x∈E x2 =1

max

min

E,F ∈Gn,i (x,y)∈E×F x2 =y2 =1

Ax, y .

Most useful properties of the singular values are consequences of their Hermitian nature via these variational formulas, which are valid in Cn if A ∈ Mn (C), and in Rn if A ∈ Mn (R). In contrast, there are no such variational formulas for the eigenvalues in great generality, beyond the case of normal matrices. It follows from the Schur unitary triangularization1 that for any fixed A ∈ Mn (C), we have si (A) = |λi (A)| for every 1 ≤ i ≤ n if and only if A is normal (i.e. AA∗ = A∗ A)2 . Beyond normal matrices, the relationships between eigenvalues and singular values are captured by a set of inequalities due to Weyl, which can be obtained by using the Schur unitary triangularization, see for instance [HJ, Th. 3.3.2 p. 171]. 1 If

A ∈ Mn (C) then there exists a unitary matrix U such that T = U AU ∗ is upper triangular. always use the word normal in this way, and never as a synonym for Gaussian.

2 We

LECTURE NOTES ON THE CIRCULAR LAW

3

Theorem 1.1 (Weyl inequalities). For every A ∈ Mn (C) and 1 ≤ k ≤ n, k 

(1.1)

|λi (A)| ≤

i=1

k 

si (A). i=1 n i=n−k+1 |λi (A)|

 The reversed form ni=n−k+1 si (A) ≤ for every 1 ≤ k ≤ n can be deduced easily (exercise!). Equality is achieved for k = n and we have n n    √ (1.2) |λi (A)| = | det(A)| = | det(A)|| det(A∗ )| = | det( AA∗ )| = si (A). i=1

i=1

One may deduce from Weyl’s inequalities that (see [HJ, Theorem 3.3.13]) n n n    (1.3) |λi (A)|2 ≤ si (A)2 = Tr(AA∗ ) = |Ai,j |2 . i=1

i=1

i,j=1

Since s1 (·) = · 2→2 we have for any A, B ∈ Mn (C) that (1.4)

s1 (AB) ≤ s1 (A)s1 (B)

and s1 (A + B) ≤ s1 (A) + s1 (B).

We define the empirical eigenvalues and singular values measures by n n 1 1 δλk (A) and νA := δsk (A) . μA := n n k=1

k=1

Note that μA and νA are supported respectively in C and R+ . There is a rigid determinantal relationship between μA and νA , namely from (1.2) we get n 1 log |λi (A)| log |λ| dμA (λ) = n i=1 1 log | det(A)| n n 1 = log(si (A)) n i=1 = log(s) dνA (s). =

This identity is at the heart of the Hermitization technique used in section 4. The singular values are quite regular functions of the matrix entries. For instance, the Courant-Fischer formulas imply that the map A → (s1 (A), . . . , sn (A)) is 1-Lipschitz for the operator norm and the ∞ norm: for any A, B ∈ Mn (C), (1.5)

max |si (A) − si (B)| ≤ s1 (A − B).

1≤i≤n

Recall that Mn is a Hilbert space for the dot product A · B = Tr(AB ∗ ). The associated norm · HS , called the Hilbert-Schmidt norm3 , satisfies to n  2 (1.6) A HS = Tr(AA∗ ) = si (A)2 = n s2 dνA (s). i=1

In the sequel, we say that a sequence of (possibly signed) measures (ηn )n≥1 on C (respectively on R) tends weakly to a (possibly signed) measure η, and we denote ηn  η, 3 Also

known as the trace norm, the Schur norm, or the Frobenius norm.

CHARLES BORDENAVE AND DJALIL CHAFA¨I

4

when for all continuous and bounded function f : C → R (respectively f : R → R), lim f dηn = f dη. n→∞

Example 1.2 (Spectra of non-normal matrices). The eigenvalues depend continuously on the entries of the matrix. It turns out that for non-normal matrices, the eigenvalues are more sensitive to perturbations than the singular values. Among non-normal matrices, we find non-diagonalizable matrices, including nilpotent ma´ and [BS1, Chapter 10]. Let trices. Let us recall a striking example taken from [S] us consider A, B ∈ Mn (R) given by ⎛ ⎞ ⎛ ⎞ 0 1 0 ··· 0 0 1 0 ··· 0 ⎜ 0 0 1 · · · 0⎟ ⎜0 0 1 · · · 0⎟ ⎜ ⎟ ⎜ ⎟ ⎟ ⎜ .. .. .. . .. .. .. ⎟ . .. ⎟ and B = ⎜ A = ⎜. . . ⎜ .. ⎟ . . . ⎜ ⎟ ⎜ ⎟ ⎝ 0 0 0 · · · 1⎠ ⎝0 0 0 · · · 1⎠ κn 0 0 · · · 0 0 0 0 ··· 0 where (κn ) is a sequence of positive real numbers. The matrix A is nilpotent, and B is a perturbation with small norm (and rank one!): rank(A − B) = 1

and

A − B 2→2 = κn .

We have λ1 (A) = · · · = λκn (A) = 0 and thus μA = δ 0 . 1/n

In contrast, B n = κn I and thus λk (B) = κn e2kπi/n for all 1 ≤ k ≤ n which gives μB  Uniform{z ∈ C : |z| = 1} 1/n

as soon as κn

→ 1 (this allows κn → 0). On the other hand, the identities

AA∗ = diag(1, . . . , 1, 0)

and

BB ∗ = diag(1, . . . , 1, κ2n )

give s1 (A) = · · · = sn−1 (A) = 1 and s1 (B) = · · · = sn−1 (B) = 1 and thus νA  δ 1

and

νB  δ 1 .

This shows the stability of the limiting singular values distribution under additive perturbation of rank 1 of arbitrary large norm, and the instability of the limiting eigenvalues distribution under an additive perturbation of rank 1 and small norm. We must keep in mind the fact that the singular values are related to the geometry of the matrix rows. We end up this section with a couple of lemmas relating rows distances and norms of the inverse, which are used in the sequel. Lemma 1.3 (Rows and operator norm of the inverse). Let A ∈ Mn (C) with rows R1 , . . . , Rn . Define the vector space R−i := span{Rj : j = i}. We have then n−1/2 min dist(Ri , R−i ) ≤ sn (A) ≤ min dist(Ri , R−i ). 1≤i≤n

1≤i≤n

Proof of lemma 1.3. The argument is essentially in [RuVe1]. Since A and A have same singular values, one can consider the columns C1 , . . . , Cn of A instead

LECTURE NOTES ON THE CIRCULAR LAW

5

of the rows. For every column vector x ∈ Cn and 1 ≤ i ≤ n, the triangle inequality and the identity Ax = x1 C1 + · · · + xn Cn give Ax 2 ≥ dist(Ax, C−i ) = min Ax − y 2 = min xi Ci − y 2 y∈C−i

y∈C−i

= xi 2 dist(Ci , C−i ). If x 2 = 1 then necessarily xi 2 ≥ n−1/2 for some 1 ≤ i ≤ n and therefore sn (A) = min Ax 2 ≥ n−1/2 min dist(Ci , C−i ). x2 =1

1≤i≤n

Conversely, for every 1 ≤ i ≤ n, there exists a vector y with yi = 1 such that dist(Ci , C−i ) = y1 C1 + · · · + yn Cn 2 = Ay 2 ≥ y 2 min Ax 2 ≥ sn (A) x2 =1

where we used the fact that

2 y 2

= |y1 |2 + · · · + |yn |2 ≥ |yi |2 = 1.



Lemma 1.4 (Rows and trace norm of the inverse). Let 1 ≤ m ≤ n. If A ∈ Mm,n (C) has full rank, with rows R1 , . . . , Rm and R−i := span{Rj : j = i}, then m  i=1

si (A)−2 =

m 

dist(Ri , R−i )−2 .

i=1

Proof. We follow the proof given in [TV4, Lemma A4]. The orthogonal projection of Ri∗ on the subspace R−i is B ∗ (BB ∗ )−1 BRi∗ where B is the (m−1)×n matrix obtained from A by removing the row Ri . In particular, we have  2 2 |Ri |2 − dist(Ri , R−i )2 = B ∗ (BB ∗ )−1 BRi∗ 2 = (BRi∗ )∗ (BB ∗ )−1 BRi∗ by the Pythagoras theorem. On the other hand, the Schur block inversion formula states that if M is a m × m matrix then for every partition {1, . . . , m} = I ∪ I c , (M −1 )I,I = (MI,I − MI,I c (MI c ,I c )−1 MI c ,I )−1 . We take M = AA∗ and I = {i}, and we note that (AA∗ )i,j = Ri Rj∗ , which gives ((AA∗ )−1 )i,i = (Ri Ri∗ − (BRi∗ )∗ (BB ∗ )−1 BRi∗ )−1 = dist(Ri , R−i )−2 . The desired formula follows by taking the sum over i ∈ {1, . . . , m}.



2. Circular law The variance of a random variable Z on C is Var(Z) = E(|Z|2 ) − |E(Z)|2 . Let (Xij )i,j≥1 be an infinite table of i.i.d. random variables on C with variance 1. We consider the square random matrix X := (Xij )1≤i,j≤n as a random variable in Mn (C). We start with the classical Marchenko-Pastur theorem for the “empirical covariance matrix” n1 XX ∗ . This theorem is universal in the sense that the limiting distribution does not depend on the law of X11 . Theorem 2.1 (Marchenko-Pastur quarter circular law). a.s. νn−1/2 X  Q2 as n → ∞, where Q2 is the quarter circular law4 on [0, 2] ⊂ R+ with density √ 4 − x2 x → 1[0,2] (x). π 4 Actually, it is a quarter ellipse rather than a quarter circle, due to the normalizing factor 1/π. However, one may use different scales to see a true quarter circle, as in figure 2.

6

CHARLES BORDENAVE AND DJALIL CHAFA¨I

The n−1/2 normalization is easily understood from the law of large numbers n 1  si (X)2 (2.1) s2 dνn−1/2 X (s) = 2 n i=1 =

n 1 1  ∗ Tr(XX ) = |Xi,j |2 → E(|X1,1 |2 ). n2 n2 i,j=1 2

Note that from (1.6) the left hand side is nothing else but n12 X HS . The main subject of these lecture notes is the following counterpart for the eigenvalues. Theorem 2.2 (Girko circular law). a.s. μn−1/2 X  C1 as n → ∞, where C1 is the circular law5 which is the uniform law on the unit disc of C with density 1 z → 1{z∈C:|z|≤1} . π If Z is a complex random variable following the circular law C1 then the random variables Re(Z) and Im(Z) follow the semi circular law on [−1, 1], and are not independent. Additionally, the random variables |Re(Z)| and |Im(Z)| follow the quarter circular law on [0, 1], and |Z| follows the law with density ρ → 12 ρ1[0,1] (ρ). As we will see in section 4, we will deduce theorem 2.2 from an extension of theorem 2.1 by using a Hermitization technique. The circular law theorem 2.2 has a long history. It was established through a sequence of partial results during the period 1965–2009, the general case being finally obtained by Tao and Vu [TV4]. Indeed Mehta [Meh1] was the first to obtain a circular law theorem for the expected empirical spectral distribution in the complex Gaussian case, by using the explicit formula for the spectrum due to Ginibre [Gin]. Edelman was able to prove the same kind of result for the far more delicate real Gaussian case [E]. Silverstein provided an argument to pass from the expected to the almost sure convergence in the complex Gaussian case [Hwa]. Girko worked on the universal version and came with very good ideas such as the Hermitization technique [Gir1, Gir3, Gir5, Gir6, Gir7]. Unfortunately, his work was controversial due to a lack of clarity and rigor6 . In particular, his approach relies implicitly on an unproved uniform integrability related to the behavior of the smallest singular values of random matrices. Let us mention that the Hermitization technique is also present in the work of Widom [Wi] on Toeplitz matrices and in the work of Goldsheid and Khoruzhenko [GK1]. Bai [Ba] was the first to circumvent the problem in the approach of Girko, at the price of bounded density assumptions and moments assumptions7 . Bai improved his approach in his book written with Silverstein [BS1]. His approach involves the control of the speed of convergence of the ´ singular values distribution. Sniady considered a universal version beyond random matrices and the circular law, using the notion of ∗-moments and Brown measure of operators in free probability, and a regularization by adding an independent ´ Goldsheid and Khoruzhenko [GK2] used successfully Gaussian Ginibre noise [S]. 5 It is not customary to call it the “disc law”. The terminology corresponds to what we draw: a circle for the circular law, a quarter circle (ellipse) for the quarter circular law, even if it is the boundary of the support in the first case, and the density in the second case. See figure 2. 6 Girko’s writing style is also quite original, see for instance the more recent paper [GV]. 7 “. . . I worked for 13 years from 1984 to 1997, which was eventually published in Annals of Probability. It was the hardest problem I have ever worked on.”. Zhidong Bai, interview with Atanu Biswas in 2006 [CZH].

LECTURE NOTES ON THE CIRCULAR LAW

7

the logarithmic potential to derive the analogue of the circular law theorem for random non-Hermitian tridiagonal matrices. The smallest singular value of random matrices was the subject of an impressive activity culminating with the works of Tao and Vu [TV3, TV5, TV3] and of Rudelson and Vershynin [RuVe1], using tools from asymptotic geometric analysis and additive combinatorics (Littlewood-Offord problems). These achievements allowed G¨ otze and Tikhomirov [GT1] to obtain the expected circular law theorem up to a small loss in the moment assumption, by using the logarithmic potential. Similar ingredients are present in the work of Pan and Zhou [PZ]. At the same time, Tao and Vu, using a refined bound on the smallest singular value and the approach of Bai, deduced the circular law theorem up to a small loss in the moment assumption [TV1]. As in the works of Girko, Bai and their followers, the loss was due to a sub-optimal usage of the Hermitization approach. In [TV4], Tao and Vu finally obtained the full circular law theorem 2.2 by using the full strength of the logarithmic potential, and a new control of the count of the small singular values which replaces the speed of convergence estimates of Bai. See also their synthetic paper [TV2]. We will follow essentially their approach in section 4 to prove theorem 2.2. The a.s. tightness of μn−1/2 X is easily understood since Weyl’s inequality give n n 1  1  2 2 2 |λ| dμn−1/2 X (λ) = 2 |λi (X)| ≤ 2 si (X) = s2 dνn−1/2 X (s). n i=1 n i=1 Note that thanks to the n−1/2 normalization and to the strong law of large numbers, the Euclidean norm of each row and of each column of n−1/2 X tends almost surely to 1 when n → ∞. Moreover, their scalar product tends to 0. This suggest that n−1/2 X is in a sense asymptotically a unitary matrix. The convergence in the couple of theorems above is the weak convergence of probability measures with respect to continuous bounded functions. We recall that this mode of convergence does not capture the convergence of the support. It implies only that a.s. lim s1 (n−1/2 X) ≥ 2 and

n→∞

lim |λ1 (n−1/2 X)| ≥ 1.

n→∞

However, it can be shown that if E(X1,1 ) = 0 and E(|X1,1 |4 ) < ∞ then a.s. (2.2)

lim s1 (n−1/2 X) = 2 and

n→∞

lim |λ1 (n−1/2 X)| = 1,

n→∞

[BY, Hwa, Ge, PZ, BS2]. The asymptotic factor 2 between the operator norm and the spectral radius indicates in a sense that X is a non-normal matrix asymptotically as n → ∞ (note that if X11 is absolutely continuous then X is absolutely continuous and thus XX ∗ = X ∗ X a.s. which means that X is non-normal a.s.). The law of the modulus under the circular law has density ρ → 2ρ1[0,1] (ρ) which differs √ completely from the shape of the quarter circular law s → π −1 4 − s2 1[0,2] (s), see figure 3. The integral of “log” for both laws is the same. 3. Gaussian case This section is devoted to the case where X11 ∼ N (0, 12 I2 ). From now on, we denote G instead of X in order to distinguish the Gaussian case from the general case. We say that G belongs to the Complex Ginibre Ensemble. The Lebesgue

8

CHARLES BORDENAVE AND DJALIL CHAFA¨I

Figure 2. Illustration of universality in the quarter circular law and the circular law theorems 2.1 and 2.2. The plots are made with the singular values (upper plots) and eigenvalues (lower plot) for a single random matrix X of dimension n = 1000. On the left hand side, X11 follows a standard Gaussian law on R, while on the right hand side X11 follows a symmetric Bernoulli law on {−1, 1}. density of the n × n random matrix G = (Gi,j )1≤i,j≤n in Mn (C) ≡ Cn×n is A ∈ Mn (C) → π −n e− 2

(3.1)

n i,j=1

|Aij |2

.

This law is a Boltzmann type distribution with energy A →

n  i,j=1

|Aij |2 = Tr(AA∗ ) = A HS = 2

n 

s2i (A).

i=1

This law is unitary invariant, in the sense that if U and V are n×n unitary matrices then U GV and G are equally distributed. If H1 and H2 are two independent √ 8 then (H + iH )/ 2 has the law of G. Conversely, the matrices copies of GUE 1 √2 √ (G + G∗ )/ 2 and (G − G∗ )/ 2i are independent and belong to the GUE. 8 Up to scaling, a random n×n Hermitian matrix H belongs to the Gaussian Unitary Ensemble (GUE) when its density is proportional to exp(− 21 Tr(H 2 ))). Equivalently, the random variables {Hii , Hij : 1 ≤ i ≤ n, i < j ≤ n} are indep. with Hii ∼ N (0, 1) and Hij ∼ N (0, 12 I2 ) for i = j.

LECTURE NOTES ON THE CIRCULAR LAW

2

9

ModuleCircleLaw QuarterCircle

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

0

0.5

1 x

1.5

2

Figure 3. Comparison between the quarter circular distribution of theorem 2.1 for the singular values, and the modulus under the circular law of theorem 2.2 for the eigenvalues. The supports and the shapes are different. This difference indicates the asymptotic non-normality of these matrices. The integral of the function t → log(t) is the same for both distributions.

The singular values of G are the square root of the eigenvalues of the positive semidefinite Hermitian matrix GG∗ . The matrix GG∗ is a complex Wishart matrix, and belongs to the complex Laguerre Ensemble (β = 2). The empirical distribution of the singular values of n−1/2 G tends to the Marchenko-Pastur quarter circular distribution (Gaussian case in theorem 2.1). This section is rather devoted to the study of the eigenvalues of G, and in particular to the proof of the circular law theorem 2.2 in this Gaussian settings. The set of elements of Mn (C) with multiple eigenvalues has zero Lebesgue measure in Cn×n . In particular, the set of non-diagonalizable elements of Mn (C) has zero Lebesgue measure in Cn×n . Since G is absolutely continuous, we have a.s. GG∗ = G∗ G (non-normality) and G is diagonalizable with distinct eigenvalues. Following Ginibre [Gin] – see also [Meh2, Chapter 15; F2] and [KS] – one may then compute the joint density of the eigenvalues λ1 (G), . . . , λn (G) of G by integrating (3.1) over the non-eigenvalues variables. Let G = U T U ∗ be the Schur unitary decomposition of G with U unitary and T upper triangular, T = D + N with D = diag(λ1 (G), . . . , λn (G)) and N nilpotent. Since Tr(DN ) = Tr(N D) = 0, the quadratic nature of Tr(GG∗ ) gives the following remarkable separation of variables: Tr(GG∗ ) = Tr(DD∗ ) + Tr(N N ∗ ). This leads to the result stated in theorem 3.1 below. It is worthwhile to mention that in contrast with what we do for the unitary invariant ensembles of random Hermitian matrices, here the computation of the spectrum law is problematic if one replaces the square potential by a more general potential, see [KS]. The law of G is invariant by the multiplication of the entries with a common phase, and thus the

CHARLES BORDENAVE AND DJALIL CHAFA¨I

10

law of the spectrum of G has also the same property. In the sequel we set Δn := {(z1 , . . . , zn ) ∈ Cn : |z1 | ≥ · · · ≥ |zn |}. Theorem 3.1 (Spectrum law). (λ1 (G), . . . , λn (G)) has density n!ϕn 1Δn where  n    π −n ϕn (z1 , . . . , zn ) = exp − |zk |2 |zi − zj |2 . 1!2! · · · n! k=1

1≤i 0 such that a.s. lim lim s−p dνn−1/2 X−zI (s) < ∞ and sp dνn−1/2 X−zI (s) < ∞. (4.9) n→∞

n→∞

The second statement in (4.9) with p ≤ 2 follows from the strong law of large numbers (2.1) together with (1.5), which gives si (n−1/2 X − zI) ≤ si (n−1/2 X) + |z| for all 1 ≤ i ≤ n. The first statement in (4.9) concentrates most of the difficulty behind theorem 2.2. In the next two sub-sections, we will prove and comment the following couple of key lemmas taken from [TV4] and [TV1] respectively. Lemma 4.6 (Count of small singular values). There exist c0 > 0 and 0 < γ < 1 such that a.s. for n  1 and n1−γ ≤ i ≤ n − 1 and all M ∈ Mn (C), we have i sn−i (n−1/2 X + M ) ≥ c0 . n Lemma 4.6 is more meaningful when i is close to n1−γ . For i = n − 1, it gives only a lower bound on s1 . The linearity in i is what we may expect on spacing. Lemma 4.7 (Lower bound on sn ). For every d > 0 there exists b > 0 such that if M ∈ Mn (C) is deterministic with s1 (M ) ≤ nd then a.s. for n  1, sn (X + M ) ≥ n−b . Let us abridge si (n−1/2 X −zI) into si . Applying lemmas 4.6-4.7 with M = −zI √ and M = −z nI respectively, we get, for any c > 0, z ∈ C, a.s. for n  1, 1  −p 1 si ≤ n i=1 n n

n− n1−γ

≤ c−p 0



s−p i +

i=1

1 n

n  

n p i

i=1

1 n

n 

s−p i

i=n− n1−γ +1

+ 2n−γ nbp .

1 The first term of the right hand side is a Riemann sum for 0 s−p ds which converges as soon as 0 < p < 1. We finally obtain the first statement in (4.9) as soon as 0 < p < min(γ/b, 1). Now the Hermitization lemma 4.2 ensures that there exists a probability measure μ ∈ P(C) such that a.s. μY  μ as n → ∞ and for all z ∈ C, ∞ log(s) dνz (s). Uμ (z) = − 0

Since νz does not depend on the law of X11 (we say that it is then universal), it follows that μ also does not depend on the law of X11 . It remains to show that ∞ log(s) dνz (s) is equal to the logarithmic potential of C1 given by (4.2). This can 0 be done either indirectly by using the circular law for the complex Gaussian case (theorem 3.4), or by a direct computation as in [PZ, Lemma 3]. 4.3. Count of small singular values. This sub-section is devoted to lemma 4.6 used in the proof of theorem 2.2 to check the uniform integrability assumption in lemma 4.2. Lemma 4.6 says that νn−1/2 X−zI ([0, η]) ≤ η/C for every η ≥ 2Cn−γ . In this form, we see that lemma 4.6 is an upper bound on the counting measure nνn−1/2 X−zI on a small interval [0, η]. This type of estimate (local Wegner estimates) has already been studied. Notably, an alternative proof of lemma 4.6 can be obtained following the work of [ESY] on the resolvent of Wigner matrices.

LECTURE NOTES ON THE CIRCULAR LAW

19

Proof of lemma 4.6. We follow the proof of Tao and Vu [TV4]. Up to increasing γ, it is enough to prove the statement for all 2n1−γ ≤ i ≤ n − 1 for some γ ∈ (0, 1) to be chosen later. To lighten the notations, we denote by s1 ≥ · · · ≥ sn the singular values of Y := n−1/2 X + M . We fix 2n1−γ ≤ i√≤ n − 1 and we consider the matrix Y formed by the first m := n − i/2 rows of nY . Let s 1 ≥ · · · ≥ s m be the singular values of Y . By the Cauchy-Poincar´e interlacing12 , we get n−1/2 s n−i ≤ sn−i Next, by lemma 1.4 we obtain −2 −2 s −2 + · · · + s −2 1 n− i/2 = dist1 + · · · + distn− i/2 ,

where distj := dist(Rj , Hj ) is the distance from the j th row Rj of Y to Hj , the subspace spanned by the other rows of Y . In particular, we have (4.10)

i −2 s ≤ is −2 n−i ≤ 2n n−i



n− i/2

j=n− i



n− i/2

s −2 ≤ j

dist−2 j .

j=1

Now Hj is independent of Rj and dim(Hj ) ≤ n − 2i ≤ n − n1−γ , and thus, for the choice of γ given in the forthcoming lemma 4.8, ⎞ ⎛  √ n−1  n− i/2   i P⎝ distj ≤ √ ⎠ < ∞ 2 2 1−γ n1 j=1 i=2n (note that the exponential bound in lemma 4.8 kills the polynomial factor due to the union bound over i, j). Consequently, by the first Borel-Cantelli lemma, we obtain that a.s. for n  1, all 2n1−γ ≤ i ≤ n − 1, and all 1 ≤ j ≤ n − i/2, √ √ i i distj ≥ √ ≥ 4 2 2 √ 2 2 2 Finally, (4.10) gives sn−i ≥ (i )/(32n ), i.e. the desired result with c0 := 1/(4 2).  Lemma 4.8 (Distance of a random vector to a subspace). There exist γ > 0 and δ > 0 such that for all n  1, 1 ≤ i ≤ n, any deterministic vector v ∈ Cn and any subspace H of Cn with 1 ≤ dim(H) ≤ n − n1−γ , we have, denoting R := (Xi1 , . . . , Xin ) + v,   1 n − dim(H) ≤ exp(−nδ ). P dist(R, H) ≤ 2 The exponential bound above is not optimal, but is more than enough for our purposes: in the proof of lemma 4.6, a large enough polynomial bound suffices. Proof. The argument is due to Tao and Vu [TV4, Proposition 5.1]. We first note that if H is the vector space spanned by H, v and ER, then dim(H ) ≤ dim(H) + 2 and

dist(R, H) ≥ dist(R, H ) = dist(R , H ),

12 If A ∈ M (C) and 1 ≤ m ≤ n and if B ∈ M n m,n (C) is obtained from A by deleting r := n − m rows, then si (A) ≥ si (B) ≥ si+r (A) for every 1 ≤ i ≤ m. In particular, [sm (B), s1 (B)] ⊂ [sn (A), s1 (A)], i.e. the smallest singular value increases while the largest singular value is diminished. See [HJ, Corollary 3.1.3]

CHARLES BORDENAVE AND DJALIL CHAFA¨I

20

where R := R − E(R). We may thus directly suppose without loss of generality that v = 0 and E(Xik ) = 0. Then, it is easy to check that (see computation below) E(dist(R, H)2 ) = n − dim(H). The lemma is thus a statement on the deviation probability of dist(R, H). We first perform a truncation. Let 0 < ε < 1/3. Markov’s inequality gives P(|Xik | ≥ nε ) ≤ n−2ε . Hence, from Hoeffding’s deviation inequality13 , for n  1,  n   P 1{|Xik |≤nε } < n − n1−ε ≤ exp(−2n1−2ε (1 − n−ε )2 ) ≤ exp(−n1−2ε ). k=1

It is thus sufficient to prove that the result holds by conditioning on Em := {|Xi1 | ≤ nε , . . . , |Xim | ≤ nε } with m := n − n1−ε . Let Em [ · ] := E[ · |Em ; Fm ] denote the conditional expectation given Em and the filtration Fm generated by Xi,m+1 , . . . , Xi,n . Let W be the subspace spanned by H,

u = (0, . . . , 0, Xi,m+1 , . . . , Xi,n ),

w = (Em [Xi1 ], . . . , Em [Xim ], 0, . . . , 0).

Then, by construction dim(W ) ≤ dim(H) + 2 and W is Fm -measurable. If we set Y = (Xi1 − λ, . . . , Xim − λ, 0, . . . , 0) = R − u − w and λ = Em [Xi1 ] then dist(R, H) ≥ dist(R, W ) = dist(Y, W ). Next we have

     2   σ 2 := Em Y12 = E Xi1 − E Xi1  |Xi1 | ≤ nε  |Xi1 | ≤ nε = 1 − o(1).

Now, let us consider the disc D := {z ∈ C : |z| ≤ nε } and the convex function f : Dm → R+ defined by f (x) = dist((x, 0, . . . , 0), W ), which is also 1-Lipschitz: |f (x) − f (x )| ≤ dist(x, x ). Talagrand’s concentration inequality14 gives then   t2 , (4.11) Pm (|dist(Y, W ) − Mm | ≥ t) ≤ 4 exp − 16n2ε where Mm is the median of dist(Y, W ) under Em . In particular, ! Mm ≥ Em dist2 (Y, W ) − cnε .

13 If X , . . . , X are independent and bounded real r.v. and if S := X + · · · + X then n n n 1 1 P(Sn − ESn ≤ tn) ≤ exp(−2n2 t2 /(d21 + · · · + d2n )) for any t ≥ 0, where di := max(Xi ) − min(Xi ). See [McD, Theorem 5.7]. 14 If X , . . . , X are i.i.d. r.v. on D := {z ∈ C : |z| ≤ r} and if f : D n → R is convex, n 1 t2 1-Lipschitz, with median M , then P(|f (X1 , . . . , Xn ) − M | ≥ t) ≤ 4 exp(− 16r 2 ) for any t ≥ 0. See [Tal] and [Le1, Corollary 4.9].

LECTURE NOTES ON THE CIRCULAR LAW

21

Also, if P denotes the orthogonal projection on the orthogonal of W , we find Em dist2 (Y, W ) =

m 

  Em Yk2 Pkk

k=1





2

n 

Pkk −

k=1

n 

 Pkk

k=m+1

≥ σ 2 (n − dim(W ) − (n − m))   ≥ σ 2 n − dim(H) − n1−ε − 2 Pick some 0 <  γ < ε. Then, from the above expression for any 1/2 < c < 1 and  n  1, Mm ≥ c n − dim(H). We set t = (c − 1/2) n − dim(H) in (4.11).  4.4. Smallest singular value. This sub-section is devoted to lemma 4.7 which was used in the proof of theorem 2.2 to get the uniform integrability in lemma 4.2. The full proof of lemma 4.7 by Tao and Vu in [TV1] is based on Littlewood-Offord type problems. The main difficulty is the possible presence of atoms in the law of the entries (in this case X is non-invertible with positive probability). Regarding the assumptions, the finite second moment hypothesis on X11 is not crucial and can be considerably weakened. For the sake of simplicity, we give here a simplified proof when the law of X11 has a bounded density on C or on R (which implies that X + M is invertible with probability one). Proof of lemma 4.7. By the first Borel-Cantelli lemma, it suffices to show that for every a, d > 0 (actually a > 1 is enough), there exists b > 0 such that if M ∈ Mn (C) is deterministic with s1 (M ) ≤ nd then P(sn (X + M ) ≤ n−b ) ≤ n−a . As already mentioned, we will only prove this when X11 has a bounded density. √ For every x, y ∈ Cn and S ⊂ Cn , we set x · y := x1 y1 + · · · + xn yn and x 2 := x · x and dist(x, S) := miny∈S x − y 2 . Let R1 , . . . , Rn be the rows of X + M and set R−i := span{Rj ; j = i} for every 1 ≤ i ≤ n. The lower bound in lemma 1.3 gives √ min dist(Ri , R−i ) ≤ n sn (X + M ) 1≤i≤n

and consequently, by the union bound, for any u ≥ 0, √ P( n sn (X + M ) ≤ u) ≤ n max P(dist(Ri , R−i ) ≤ u). 1≤i≤n

Let us fix 1 ≤ i ≤ n. Let Yi be a unit vector orthogonal to R−i . Such a vector is not unique, but we may just pick a measurable one which is independent of Ri . This defines a random variable on the unit sphere Sn−1 = {x ∈ Cn : x 2 = 1}. By the Cauchy-Schwarz inequality, |Ri · Yi | ≤ πi (Ri ) 2 Yi 2 = dist(Ri , R−i ) where πi is the orthogonal projection on the orthogonal complement of R−i . Let νi be the distribution of Yi on Sn−1 . Since Yi and Ri are independent, for any u ≥ 0, P(dist(Ri , R−i ) ≤ u) ≤ P(|Ri · Yi | ≤ u) = P(|Ri · y| ≤ u) dνi (y). Sn−1

22

CHARLES BORDENAVE AND DJALIL CHAFA¨I

Let us assume that X11 has a bounded density ϕ on C. Since y 2 = 1 there √ exists an index j0 ∈ {1, . . . , n} such that yj0 = 0 with |yj0 |−1 ≤ n. The complex random variable Ri · y is a sum of independent complex random variables and one of them √ is Xij0 yj0 , which is absolutely continuous with a density bounded above by n ϕ ∞ . Consequently, by a basic property of convolutions of probability measures, the complex random √variable Ri · y is also absolutely continuous with a density ϕi bounded above by n ϕ ∞ , and thus √ 1{|s|≤u} ϕi (s) ds ≤ πu2 n ϕ ∞ . P(|Ri · y| ≤ u) = C

Therefore, for every b > 0, we obtain the desired result P(sn (X + M ) ≤ n−b−1/2 ) = O(n3/2−2b ). This scheme remains valid in the case where X11 has a bounded density on R.  5. Comments This section gathers some comments and extensions around the circular law. Weak convergence in probability. For simplicity, the main mode of convergence considered in these notes for the empirical spectral distributions of random matrices is the a.s. weak convergence. It is often useful to consider another mode of convergence, which is the weak convergence in probability. The Hermitization lemma 4.2 is also available for this mode of convergence. The details are in [BC]. Replacement principle. A variant of the Hermitization lemma 4.2, known as the “replacement principle” states that if (An )n≥1 and (Bn )n≥1 are sequences where An and Bn are random variables in Mn (C) and such that for a.a. z ∈ C a.s. (k) limn→∞ UμAn (z) − UμBn (z) = 0 (kk) log(1 + ·) is uniformly integrable for (νAn )n≥1 and (νBn )n≥1 then a.s. μAn − μBn  0 as n → ∞. The details are in [TV4, Theorem 2.1]. This replacement principle is the key to obtain a universality principle going beyond the circular law. Namely, following [TV4] and [Bo], if X and G are the random matrices considered in sections 3-4 obtained from infinite tables with i.i.d. entries, and if (Mn )n≥1 is a deterministic sequence such that Mn ∈ Mn (C) and lim sp dνMn (s) < ∞ n→∞

for some p > 0, then a.s. μn−1/2 X+Mn − μn−1/2 G+Mn  0 as n → ∞. Logarithmic potential and Cauchy-Stieltjes transform. The CauchyStieltjes transform mμ : C → C ∪ {∞} of a probability measure μ on C is 1 dμ(λ). mμ (z) := λ − z C Since 1/|·| is Lebesgue locally integrable on C, the Fubini-Tonelli theorem implies that mμ (z) is finite for a.a. z ∈ C, and moreover mμ is locally Lebesgue integrable on C and thus belongs to D (C). Note that for a matrix A ∈ Mn (C), we have mμA (z) = n1 Tr((A − zI)−1 ), which is the normalized trace of the resolvent of A

LECTURE NOTES ON THE CIRCULAR LAW

23

at point z, finite outside the spectrum of A. Suppose now that μ ∈ P(C). The logarithmic potential is related to the Cauchy-Stieltjes transform via the identity mμ = 2∂Uμ in D (C). In particular, since 4∂∂ = 4∂∂ = Δ in D (C), we obtain, still in D (C), 2∂mμ = −ΔUμ = −2πμ. Thus we can recover μ from mμ . Note that for any ε > 0, mμ is bounded on Dε = {z ∈ C : dist(z, supp(μ)) > ε}. If supp(μ) is one-dimensional then one may completely recover μ from the knowledge of mμ on Dε as ε → 0. Note also that mμ is analytic outside supp(μ), and is thus characterized by its real part or its imaginary part on arbitrary small balls in the connected components of supp(μ)c . If supp(μ) is not one-dimensional then one needs the knowledge of mμ inside the support to recover μ. An elegant solution to this problem is to define a quaternionic Cauchy-Stieltjes transform, replacing the complex variable z by a quaternionic variable h. This idea appears in various works such as [FZ1, GJNP, RC, Ro, BCC1]. The quaternionic Cauchy-Stieltjes transform is a powerful tool well suited for non-Hermitian random matrices, and which may replace completely the logarithmic potential, see [BC]. Free probability, Brown spectral measure, and -moments. The circular law has an interpretation in free probability theory, a sub-domain of operator algebra theory connected to random matrices, see the books by Voiculescu, Dykema and Nica [VDN] and by Anderson, Guionnet, and Zeitouni [AGZ]. Let M be an algebra of bounded operators on a Hilbert space H, with unit 1, stable by the adjoint operation ∗. Let τ : M → C be a linear √ map such that τ (1) = 1 and τ (aa∗ ) = τ (a∗ a) ≥ 0. For a ∈ M, define |a| = aa∗ . If b ∈ M is self-adjoint, i.e. b∗ = b, the spectral measure μb of b is the unique probability measure on the real line satisfying, for any integer k ∈ N, k τ (b ) = tk dμb (t). Brown spectral measure. For any a ∈ M we define νa = μ|a| , which is a probability measure on R+ . In the spirit of (4.7), the Brown spectral measure [Br] of a ∈ M is the unique probability measure μa on C, which satisfies for a.a. z ∈ C, log |z − λ| dμa (λ) = log(s) dνa−z (s). In distribution, it is given by the formula15 1 Δ log(s) dνa−z (s). (5.1) μa = 2π The fact that the above definition is indeed a probability measure requires a proof, which can be found in [HS]. Our notation is consistent: first, if a is self-adjoint, then the Brown spectral measure coincides with the spectral measure. Secondly, if M = Mn (C) and τ := n1 Tr is the normalized trace on Mn (C), then we retrieve our usual definition for νA and μA . It is interesting to point out that the identity (5.1) which is a consequence of the definition of the eigenvalues when M = Mn (C) serves  15 The

so-called Fuglede-Kadison determinant of a is exp

log(t) dμ|a| (t), see [FuKa].

24

CHARLES BORDENAVE AND DJALIL CHAFA¨I

as a definition in the general setting of operator algebras. All these definitions can be extended beyond bounded operators, see Haagerup and Schultz [HS]. Failure of the method of moments. For non-Hermitian matrices, the spectrum does not necessarily belong to the real line, and in general, the limiting spectral distribution is not supported in the real line. The problem here is that the moments are not enough to characterize laws on C. For instance, if Z is a complex random variable following the uniform law Cκ on the centered disc {z ∈ C; |z| ≤ κ} of radius κ then for every r ≥ 0, E(Z r ) = 0 and thus Cκ is not characterized by its moments. Any rotational invariant law on C with light tails shares with Cκ the same sequence of null moments. One can try to circumvent the problem by using “mixed moments” which uniquely determine μ by the Weierstrass theorem. Namely, for every A ∈ Mn (C), if A = U T U ∗ is the Schur unitary triangularization of A then for every integers r, r ≥ 0 and with z = x + iy and τ = n1 Tr, n    r 1 r r z r z r dμA (z) = λi (A)λi (A) = τ (T r T ) = τ (T r T ∗r ) = τ (Ar A∗r ). n i=1 C Indeed equality holds true when T = T ∗ , i.e. when T is diagonal, i.e. when A is normal. This explains why the method of moments looses its strength for nonnormal operators. To circumvent the problem, one may think about using the notion of -moments. Note that if A is normal then for every word Aε1 · · · Aεk where ε1 , . . . , εn ∈ {1, ∗}, we have τ (Aε1 · · · Aεk ) = τ (Ak1 A∗k2 ) where k1 , k2 are the number of occurrence of A and A∗ . -distribution. The -distribution of a ∈ M is the collection of all its -moments: τ (aε1 aε2 · · · aεn ), where n ≥ 1 and ε1 , . . . , εn ∈ {1, ∗}. The element c ∈ M is circular when it has the -distribution of w1 + iw2 where w1 , w2 ∈√M are free and semi circular with spectral measure of Lebesgue density x → π1 2 − x2 1[−√2,√2] (x). As a free complex variable, c has mean zero and half-unit covariance matrix:  2  1  w1 w1 w2 0 2 τ (c) = τ (w1 ) + iτ (w2 ) = 0 and τ = . w2 w1 w22 0 12 In particular c has unit variance in the sense that τ (|c|2 ) = τ (w12 + w22 ) = 1. It is worthwhile to mention that the circular law appears naturally in the free central limit theorem, which states that if a1 , a2 , . . . is a sequence of free elements of M with identical -distribution of zero mean and half-unit covariance matrix – in other words the moments of ai and of c match up to second order – then, in -distribution, a1 + · · · + an d √ −→ c. n→∞ n The -distribution of a ∈ M allows to recover the moments of |a − z|2 = (a − z)(a − z)∗ for all z ∈ C, and thus νa−z for all z ∈ C, and thus the Brown measure μa of a. Actually, for a random matrix, the -distribution contains, in addition to the spectral measure, an information on the eigenvectors of the matrix. We say that a sequence of matrices (An )n≥1 where A takes it values in Mn (C) converges in -moments to a ∈ M, if all -moments converge to the -moments of a ∈ M. For example, if G ∈ Mn (C) is our complex Ginibre matrix, then a.s. as n → ∞, n−1/2 G converges in -moments to a circular element.

LECTURE NOTES ON THE CIRCULAR LAW

25

Discontinuity of the Brown measure. Due to the unboundedness of the logarithm, the Brown measure μa depends discontinuously on the -moments of a ´ A simple counter example is given by the matrices of example 1.2. For [BL, S]. random matrices, this discontinuity is circumvented in the Girko Hermitization by requiring a uniform integrability, which turns out to be a.s. satisfied the random matrices n−1/2 X in the circular law theorem 2.2. ´ ´ Theorem 4.1] has shown that it is always possible to regHowever, Sniady [S, ularize the Brown measure by adding an additive noise. More precisely, if G is as above and (An )n≥1 is a sequence of matrices where An takes its values in Mn (C), and if the -moments of An converge to the -moments of a ∈ M as n → ∞, then a.s. n → ∞ μAn +tn−1/2 G converges to μa+tc , c is circular element free of a. In particular, by choosing a sequence tn going to 0 sufficiently slowly, it is possible ´ to regularize the Brown measure: a.s. μAn +tn n−1/2 G converges to μa . The Sniady theorem was revisited recently by Guionnet, Wood, and Zeitouni [GWZ]. Beyond ´ the Sniady theorem, one may conjecture that a version of the asymptotic freeness theorem of Voiculescu remains valid at the level of Brown spectral measures. Outliers. The circular law theorem 2.2 allows the blow up of an arbitrary (asymptotically negligible) fraction of the extremal eigenvalues. Indeed, it was shown by Silverstein [S] that if E(|X11 |4 ) < ∞ and E(X11 ) = 0 then the spectral √ radius |λ1 (n−1/2 X)| tends to infinity at speed n and has a Gaussian fluctuation. This observation of Silverstein is the base of [C], see also the ideas of Andrew [An]. More recently, Tao studied in [Tao] the outliers produced by various types of perturbations including general additive perturbations. Sum and products. The scheme of proof of theorem 2.2 (based on Hermitization, logarithmic potential, and uniform integrability) turns out to be quite robust. It allows for instance to study the limit of the empirical distribution of the eigenvalues of sums and products of random matrices, see [Bo], and also [GT2] in relation with Fuss-Catalan laws. We may also mention [OS]. The crucial step lies in the control of the small singular values. Cauchy and the sphere. It is well known that the ratio of two independent standard real Gaussian variables is a Cauchy random variable, which has heavy tails. The complex analogue of this phenomenon leads to a complex Cauchy random variable, which is also the image law by the stereographical projection of the uniform law on the sphere. The matrix analogue consists in starting from two independent copies G1 and G2 of the Complex Ginibre Ensemble, and to consider the random matrix Y = G−1 1 G2 . The limit of μY was analyzed by Forrester and Krishnapur [FoKr]. Note that Y does not have i.i.d. entries. Random circulant matrices. The eigenvalues of a non-Hermitian circulant matrix are linear functionals of the matrix entries. Meckes [Mec] used this fact together with the central limit theorem in order to show that if the entries are i.i.d. with finite positive variance then the scaled empirical spectral distribution of the eigenvalues tends to a Gaussian law. We can imagine a heavy tailed version of this phenomenon with α-stable limiting laws.

26

CHARLES BORDENAVE AND DJALIL CHAFA¨I

Dependent entries. According to Girko, the circular law theorem 2.2 remains valid for random matrices with independent rows provided some natural hypotheses [Gir4]. Indeed, a circular law theorem is available for random Markov matrices including the Dirichlet Markov Ensemble [BCC2], for random matrices with i.i.d. log-concave isotropic rows [Ad], for random matrices with given rows sums [NV], and for random matrices with projected rows [Tao]. Another Markovian model consists in a non-Hermitian random Markov generator with i.i.d. off-diagonal entries, which gives rise to new limiting spectral distributions, possibly not rotationally invariant, which can be interpreted using free probability theory [BCC3]. Beyond the i.i.d. rows structure, the circular law is also available for random doubly stochastic matrices [Ng], for random matrices with unconditional log-concave law [AC], and for random matrices with exchangeable entries [ACW]. All these models can be analyzed using the Hermitization lemma 4.2, the main technical difficulty in most cases is concentrated in the control of the small singular values. Another kind of dependence comes from truncation of random matrices with depend entries such as Haar unitary matrices. Namely, let U be distributed according to the uniform law on the unitary group Un (we say that U is Haar unitary). Dong, Jiang, and Li have shown in [DJL] that the empirical spectral distribution of the diagonal sub-matrix (Uij )1≤i,j≤m tends to the circular law if m/n → 0, while it tends to the arc law (uniform law on the unit circle {z ∈ C : |z| = 1}) if m/n → 1. Other results of the same flavor can be found in [Ji]. Yet another way to add some dependence consists in considering an infinite array (Xij , Xji )1≤i i n f i n i t y ; %% P a r a m e t e r s n =100; % matrix s i z e t =5 0 0 0 ; % trials v=[]; % e i g e n v a l u e samples vl =[]; % l a r g e s t e i g e n v a l u e samples dx = . 2 ; % binsize %% E x p e r i m e n t f o r i =1: t %% S ampl e GOE and c o l l e c t t h e i r e i g e n v a l u e s a=randn ( n ) ; % n b y n m a t r i x o f random G a u s s i a n s s =(a+a ’ ) / 2 ; % symmetrize matrix v =[ v ; e i g ( s ) ] ; % e i g e n v a l u e s %% S ampl e GUE and c o l l e c t t h e i r l a r g e s t e i g e n v a l u e s a=randn ( n)+ s q r t ( −1)∗randn ( n ) ; % random nxn c o m p l e x m a t r i x s =(a+a ’ ) / 2 ; % Hermitian matrix v l =[ v l ; max( e i g ( s ) ) ] ; % Largest Eigenvalue end %% S e m i c i r c l e l a w v=v / s q r t ( n / 2 ) ; % normalize eigenvalues % Plot [ count , x ]= h i s t ( v , − 2 : dx : 2 ) ; bar ( x , c o u n t / ( t ∗n∗ dx ) , ’ y ’ ) % Th e o r y hold on p l o t ( x , s q r t (4−x . ˆ 2 ) / ( 2 ∗ p i ) , ’ LineWidth ’ , 2 ) a x i s ( [ − 2 . 5 2 . 5 −.1 . 5 ] ) hold o f f %% Tracy−Widom d i s t r i b u t i o n % normalized ei genvalues v l=n ˆ ( 1 / 6 ) ∗ ( v l −2∗s q r t ( n ) ) ; % Plot f i g u r e ; [ count , x]= h i s t ( v l , − 5 : dx : 2 ) ; bar ( x , c o u n t / ( t ∗dx ) , ’ y ’ ) % Th e o r y hold on tracywidom

(a) Semicircle Law

(b) Tracy-Widom distribution

55

56

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

Algorithm 2 Tracy-Widom distribution (β = 2) %Th e o r y : Compute and p l o t t h e Tracy−Widom d i s t r i b u t i o n %%P a r a m e t e r s t 0 =5; % r ig ht endpoint t n =−8; % left endpoint dx = . 0 0 5 ; % discretization %%Theor y : The d i f f e r e n t i a l e q u a t i o n s o l v e r deq=@( t , y ) [ y ( 2 ) ; t ∗y (1)+2∗ y ( 1 ) ˆ 3 ; y ( 4 ) ; y ( 1 ) ˆ 2 ] ; o p t s=o d e s e t ( ’ r e l t o l ’ , 1 e −12 , ’ a b s t o l ’ , 1 e − 1 5 ) ; y0 =[ a i r y ( t 0 ) ; a i r y ( 1 , t 0 ) ; 0 ; a i r y ( t 0 ) ˆ 2 ] ; % b o u n d a r y c o n d i t i o n s [ t , y ]=ode45 ( deq , t 0 :−dx : tn , y0 , o p t s ) ; % solve F2=exp(−y ( : , 3 ) ) ; % the d i s t r i b u t i o n f 2=gradient ( F2 , t ) ; % the density %% P l o t p l o t ( t , f 2 , ’ LineWidth ’ , 2 ) a x i s ([ −5 2 0 . 5 ] )

As we can see in Algorithm 1, direct random matrix experiments usually involve calculating the eigenvalues of random matrices, i.e., eig(s). Since many linear algebra computations require O(n3 ) operations, it seems more feasible to take n relatively small, and take a large number of Monte Carlo instances. This is our strategy in Algorithm 1. In fact, sophisticated matrix computations involve a series of reductions. With normally distributed matrices, the most expensive reduction steps can be avoided on the computer as they can be done with mathematics! All of a sudden O(n3 ) computations become O(n2 ) or even better. The resulting matrix requires less storage either using sparse matrices or data structures with even less overhead. The story gets better. Random matrix experiments involving complex numbers or even over the quaternions reduce to real matrices even before they need to be stored on a computer. The story gets even better yet. On one side, for finite n, the reduced form leads to the notion of a “ghost” random matrix quantity that exists for every β (not only real, complex and quaternions), and a “shadow” quantity which may be real or complex which allows for computation. On the other hand, the reduced forms connect random matrices to the continuous limit, stochastic operators, which in some ways represent a truer view of why random matrices behave as they do. The rest of the notes are organized as follows. In Chapter 2, we prepare our readers with matrix factorization preliminaries for random matrices. In Chapter 3,the stochastic operator is introduced with applications and we discuss Sturm sequences and Ricatti diffusion in Chapter 4. We introduce “ghost” and “shadow” techniques for random matrices in Chapter 5. The final chapter is devoted to the smallest singular value of randn(n). note: It has now been nine years since the first author has written a large survey for Acta Numerica [ER05], and three years since the applications survey [EW13]. This survey is meant to be different as we mean to demonstrate the thesis in the very name of this section.

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

Ensemble Hermite Laguerre

Matrices

57

Numeric Models

matlab (β = 1) g = randn(n,n); Wigner eig Tridiagonal (2.4) H=(g+g’)/2; g = randn(m,n); Wishart svd Bidiagonal (2.5) L=(g’*g)/m; Table 2. Hermite and Laguerre ensembles.

2. Random Matrix Factorization In this section, we will provide the details of matrix reductions that do not require a computer. Then, we derive the reduced forms of β-Hermite and β-Laguerre ensembles, which is summarized in Table 2 and Table 3 shows how to generate them in sparse formula. Later this section, we give an overview of how these reductions lead to various computational and theoretical impact. Ensemble

matlab commands (Statistics Toolbox required)

Hermite (2.4)

% d H H

Pick n , beta = s q r t ( c h i 2 r n d ( beta ∗ [ n : − 1 : 1 ] ) ) ’ ; = spdiags ( d , 1 , n , n ) + spdiags ( randn ( n , = (H + H ’ ) / s q r t ( 2 ) ;

Laguerre (2.5)

% % d s B L

P i c k m, n , b e t a Pick a > b e ta ∗ (n − 1)/2 = s q r t ( c h i 2 r n d ( 2 ∗ a − beta ∗ [ 0 : 1 : n − 1 ] ) ) ’ ; = s q r t ( c h i 2 r n d ( beta ∗ [ n : − 1 : 1 ] ) ) ’ ; = spdiags ( s , −1, n , n ) + spdiags ( d , 0 , n , n ) ; = B ∗ B’ ;

1) ,

0, n, n);

Table 3. Generating the Hermite and Laguerre ensembles as sparse matrices.

2.1. The Chi-distribution and orthogonal invariance. There are two key facts to know about a vector of independent standard normals. Let vn denote such a vector. In matlab this would be randn(n,1). Mathematically, we say that the n elements are independently identically distributed (iid) standard normals (i.e., mean 0, variance 1). • Chi distribution: the Euclidean length vn , which is the square root of the sum of the n squares of Gaussians, has what is known as the χn distribution. • Orthogonal invariance: for any fixed orthogonal matrix Q, or if Q is random and independent of vn , the distribution of Qvn is identical to that of vn . In other words, it is impossible to tell the difference between a computer generated vn or Qvn upon inspecting only the output. It is easy to see that the density of vn is (2π)− 2 e− on the length of vn . n

vn 2 2

which only depends

We shall see that these two facts allow us to transform matrices involving standard normals to simpler forms. For reference, we mention that the χn distribution has the probability density

58

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

xn−1 e−x /2 . 2n/2−1 Γ(n/2) 2

f (x) =

Notice that there is no specific requirement that n be an integer, despite our original motivation as the length of a Gaussian vector. The square of χn is the distribution that underlies the well-known Chi-squared test. It can be seen that the mean of χ2n is n. For integers, it is the sum of the n standard normal variables. We have that vn is the product of the random scalar χn , which serves as the length, and an independent vector that is uniform on the sphere, which serves as the direction. 2.2. The QR decomposition of randn(n). Given a vector vn , we can readily construct an orthogonal reflection or rotation Hn such that Hn vn = ± vn e1 , where e1 denotes the first column of the identity. We do this using the standard technique of Householder transformations [TB97] (see Lec. 10) in numerical linear algebra, which is a reflection across the external angle bisector of these two vectors. ww T In this case, Hn = I − 2 T where w = vn ± vn e1 . w w Therefore, if vn follows a multivariate standard normal distribution, Hn vn yields a Chi distribution for the first element and 0 otherwise. Furthermore, let randn(n) be an n × n matrix of iid standard normals. It is easy to see now that through successive Householder reflections of size n, n−1, . . . , 1 we can orthogonally transform randn(n) into the upper triangular matrix ⎞ ⎛ χn G ... G G ⎜ χn−1 . . . G G ⎟ ⎟ ⎜ ⎜ .. .. ⎟ . .. H1 H2 · · · Hn−1 Hn × randn(n) = Rn = ⎜ ⎟ . . . ⎟ ⎜ ⎝ χ2 G ⎠ χ1 Here all elements are independent and represent a distribution and each G is an iid standard normal. It is helpful to watch a 3 × 3 matrix turn into R3 : ⎛ ⎞ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ G G G χ3 G G χ3 G G χ3 G G ⎝ G G G ⎠ → ⎝ 0 G G ⎠ → ⎝ 0 χ2 G ⎠ → ⎝ 0 χ2 G ⎠ . 0 0 G 0 0 χ1 G G G 0 G G The Gs as the computation progresses are not the same numbers, but merely indicating that the distributions remain unchanged. With a bit of care we can say that randn(n) = (orthogonal uniform with Haar measure) · Rn is the QR decomposition of randn(n). Notice that in earlier versions of lapack and matlab [Q, R]=qr(randn(n)) did not always yield Q with Haar measure. Random matrix theory provided the impetus to fix this! One immediate consequence is the following interesting fact (2.1)

IE[det[randn(n)]2 ] = n!.

2.3. The tridiagonal reduction of the GOE. Eigenvalues are usually introduced for the first time as the roots of the characteristic polynomial. Many people just assume that this is the definition that is used during a computation, but it is well-established that this is not a good method for computing eigenvalues.

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

59

Rather, a matrix factorization is used. In the case that S is symmetric, an orthogonal matrix Q is found such that QT SQ = Λ is diagonal. The columns of Q are the eigenvectors and the diagonal of Λ are the eigenvalues. Mathematically, the construction of Q is an iterative procedure, requiring infinitely many steps to converge. In practice, S is first tridiagonalized through a finite process which usually takes the bulk of the time. The tridiagonal is then iteratively diagonalized. Usually, this tridiagonal to diagonal step takes a negligible amount of time to converge in finite precision. √ Suppose A = randn(n) and S = (A + AT )/ 2, we can tridiagonalize S with the finite Householder procedure (see [TB97] for general algorithms.) The result [DE02] is ⎞ ⎛ √ G 2 χn−1 √ ⎟ ⎜χn−1 G 2 χn−2 ⎟ ⎜ ⎟ ⎜ .. .. .. (2.2) Tn = ⎜ ⎟, . . . ⎟ ⎜ √ ⎝ χ2 G 2 χ√1 ⎠ χ1 G 2 √ where G 2 refers to a Gaussian with mean 0 and variance 2. The superdiagonal and diagonal are independent, as the matrix is symmetric. The matrix Tn has the same eigenvalue distribution as S, but numerical computation of the eigenvalues is considerably faster when the right software is used, for example, lapack’s DSTEQR or DSTEBZ (bisection). The largest eigenvalue benefits further as we only need to build around a 10n1/3 × 10n1/3 matrix (shown in Table 1)) and we can input an estimate for the largest eigenvalues such as λmax = 2. See Section 2.5 for details. A dense eigensolver requires O(n3 ) operations and will spend nearly all of its time constructing Tn . Given that we know the distribution for Tn a priori, this is wasteful. The eigenvalues of Tn require O(n2 ) time or better. In addition, a dense matrix requires O(n2 ) storage while the tridiagonal matrix only needs O(n). 2.4. Bidiagonal reduction of Real Wishart Matrices. Given a Gaussian random matrix A=randn(m,n), W = AT A/m is called the Wishart matrix or Laguerre ensemble (β = 1). Computing its eigenvalues amounts to calculating the singular values of A. For that purpose, we need to reduce A to lower bidiagonal form [TB97] (Lec. 31) (shown here for n > m), ⎞ ⎛ χn ⎟ ⎜ χm−1 χn−1 ⎟ ⎜ ⎟ ⎜ χ χ m−2 n−2 ⎟ ⎜ ⎟ ⎜ . . .. .. Bn = ⎜ ⎟. ⎟ ⎜ ⎟ ⎜ χ3 χn−m+3 ⎟ ⎜ ⎠ ⎝ χ2 χn−m+2 χ1 χn−m+1 See [Sil85] and [Tro84] for details. Computation of singular values is greatly accelerated in bidiagonal form when using, for example, lapack’s DBDSQR. 2.5. Superfast computation. Most of earlier numerical experiments computed the eigenvalues of random matrices and then histogrammed them. Can we histogram without histogramming? The answer is Yes! Sturm sequences can be

60

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

used with Tn for the computation of histograms [ACE09]. This is particularly valuable when there is interest in a relatively small number of histogram intervals (say 20 or 30) and n is very large. This is an interesting idea, particularly because most people think that histogramming eigenvalues first requires that they compute the eigenvalues, then sort them into bins. The Sturm sequence [TB97] idea gives a count without computing the eigenvalues at all. This is a fine example of not computing more than is needed: if you only need a count, why should one compute the eigenvalues at all? We will further discuss Sturm sequence in Section 4. For the largest eigenvalue, as mentioned before, the best trick for very large n is to only generate the upper left 10 n1/3 × 10 n1/3 of the matrix. Because of what is known as the “Airy decay” in the corresponding eigenvector, the largest eigenvalue, which technically depends on every element in the tridiagonal matrix — numerically depends significantly only on the upper left part. This is a huge savings in Monte Carlo sampling. Further savings can be obtained by using the Lanczos “shift and invert” strategy given an estimate for the largest eigenvalue. Similar ideas may be used for singular values. We refer interested readers to Section 10 of [ER05]. Algorithm 3 provides an example of how we succeed to compute the largest eigenvalue of a billion by billion matrix in the time required by naive methods for a hundred by hundred matrix.

2.6. Generalizations to complex and quaternion. We can consider extending the same matrix algorithms to random complex (GUE) and quaternion (GSE) matrices. For the complex case, we take randn(n)+i*randn(n). Quaternions may be less familiar. Not available in matlab (without special programming) but easily imagined is randn(n)+i*randn(n)+j*randn(n)+k*randn(n), where ij = k, jk = i, ki = j, ji = −k, kj = −i, ik = −j, ijk = −1. One can complete to an entire algebraic system obtaining the third division ring. Remember that a division ring is an algebra where ab = 0 implies at least one of a or b is 0. Matrices are not a division ring even though they are an algebra. In matlab, one can simulate scalar quaternions a+bi+cj +dk with the matrix [a+bi c+di;-c+di a-bi]. Similarly, the quaternion matrix A + Bi + Cj + Dk can be simulated with the matlab matrix [A+Bi C+Di;-C+Di A-Bi]. For more details on matrix computations with quaternion matrices, readers are encouraged to consult [Zha97, BGBM89]. The generalizations to β = 2, 4 are as follows. Let β count the number of independent real Gaussians, and let Gβ be a complex (β = 2) or quaternion (β = 4) Gaussian respectively. G denotes G1 by default. Therefore, the upper triangular Rn , tridiagonal Tn (β-Hermite ensemble) and bidiagonal Bn (β-Laguerre ensemble) reductions have the following form

(2.3)

⎛ χnβ ⎜ ⎜ ⎜ Rn = ⎜ ⎜ ⎝

Gβ χ(n−1)β

... ... .. .

Gβ Gβ .. . χ2β

⎞ Gβ Gβ ⎟ ⎟ .. ⎟ , . ⎟ ⎟ Gβ ⎠ χβ

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

√ G 2 ⎜χ(n−1)β ⎜ ⎜ Tn = ⎜ ⎜ ⎜ ⎝ ⎛

(2.4)



(2.5)

χ(n−1)β √ G 2 χ(n−2)β

61

⎞ χ(n−2)β .. . .. .

..

⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎠

. √ G 2 χ√ β χβ G 2 ⎞

and

χnβ ⎟ ⎜χ(m−1)β χ(n−1)β ⎟ ⎜ ⎟ ⎜ .. .. Bn = ⎜ ⎟. . . ⎟ ⎜ ⎠ ⎝ χ(n−m+2)β χ2β χβ χ(n−m+1)β

Of interest is that Tn and Bn are real matrices whose eigenvalue and singular value distributions are exactly the same as the original complex and quaternion matrices. This leads to even greater computational savings because only real numbers need to be stored or computed with. Algorithm 3 Compute the largest eigenvalues of a billion by billion matrix. %% T h i s c o d e r e q u i r e s s t a t i s t i c s t o o l b o x beta = 1 ; n = 1 e9 ; o p t s . disp = 0 ; o p t s . i s s y m = 1 ; a l p h a = 1 0 ; k = round ( a l p h a ∗ n ˆ ( 1 / 3 ) ) ; % c u t o f f p a r a m e t e r s d = s q r t ( c h i 2 r n d ( beta ∗ n : −1: ( n − k − 1 ) ) ) ’ ; H = spdiags ( d , 1 , k , k ) + spdiags ( randn ( k , 1 ) , 0 , k , k ) ; H = (H + H’ ) / s q r t ( 4 ∗ n ∗ beta ) ; % S c a l e s o l a r g e s t e i g e n v l a u e e i g s (H, 1 , 1 , o p t s ) ;

is

near 1

2.7. Generalization Beyond. We continue our theme that a computational trick can lead to deep theoretical results. We summarize two generalizations and will survey recent results in Section 3 and Section 5 correspondingly. • Stochastic Operators: that tridiagonals tend to a stochastic operator was first announced by Edelman [Ede03] in 2003 and subsequently developed in [Sut05, ES07b]. Further mathematical treatment and generalizations appear in [RR09, RRV11, Blo11, BV11, VV13]. • Ghosts and Shadows of Random Matrices: there is little reason other than history and psychology to restrict β to only the values corresponding to the reals, complexes, and quaternions β = 1, 2, 4. The matrices given by Tn and Bn are well defined for any β, and are deeply related to generalizations of the Schur polynomials known as the Jack Polynomials of parameter α = 2/β. Edelman proposed in the method of “Ghosts and Shadows” that even the “ghost Gaussian” Gβ exists and has a meaning under which algebra is doable [Ede10]. 3. Stochastic Operators Classically, many important distributions of random matrix theory were accessed through what now seems like an indirect procedure: first formulate an n-by-n random matrix, then compute an eigenvalue distribution, and finally let n approach infinity. The limiting distribution was reasonably called an eigenvalue distribution, but it did not describe the eigenvalue of any specific operator, since the matrices were left behind in the n → ∞ limit.

62

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

This changed in 2003 [Ede03] with the stochastic operator approach to random matrix theory. The new framework is this: • Select a stochastic differential operator such as the stochastic Airy operator 2 d2 − x + √ W (x), 2 dx β where W (x) is the Wiener process. • Compute an eigenvalue distribution. That’s it. This approach produces the same eigenvalue statistics that have been studied by the random matrix theory community for decades but in a more direct fashion. The reason: the stochastic differential operators of interest are the n → ∞ continuum limits of the most-studied random matrix models, as we shall see. 3.1. Brownian motion and white noise. We begin by discussing simple Brownian motion and its derivative, “white noise.” Right away we would like to demystify ideas that almost fit the usual calculus framework, but with some differences. Readers familiar with the Dirac delta function (an infinitesimal spike) have been in this situation before. The following simple matlab code produces a figure of the sort that resembles logarithmic stock market prices. Every time we execute this code we get a different random picture (shown in Figure 1 Left). x = [0:h:1]; %T h i n k o f h a s ” D e l t a x ” dW = randn ( length ( x ) , 1 ) ∗ s q r t ( h ) ; %T h i n k W = cumsum(dW) ; p l o t ( x ,W)

s q r t ( Delta x )

Intuitively, we break [0, x] into intervals each having length Δx. For each interval, we sample ΔW which is a zero mean normal with variance equal to Δx and then sum them up. Thus, if we look at one point x, we have d

W (x) =

x [ Δx ]

i=1

ΔW =

x [ Δx ]



√ Δx.

i=1 x variance Δx

× Δx = x, i.e. W (x) ∼ N (0, x) W (x) is a normal with mean 0 and √ (shown in Figure 1 Center and Right). We can write this as W (x) = x · G, G denoting a standard normal. In particular, W (1) is a standard normal, and √ W (x) − W (y) has mean 0 and variance (x − y), i.e. W (x) − W (y) = x − y · G. W (x) is known as the Wiener process or standard Brownian motion. It has the property that W (x) − W (y) has the distribution N (0, x − y). A suggestive notation is √ dW = (standard normal) · dx and the corresponding Wiener process is W (x) =

dW.

√ The dx seems troubling as notation, until one realizes that the cumsum then has quantities that do not depend on h (or Δx) at all. Like the standard integral, mathematics prefers quantities that at least in the limit do not depend on the √ discretization size or method. Random quantities are the same. The dx captures

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

63

Figure 1. Left: Sample paths for standard Brownian motion; Center: histogram of W (1) vs. the pdf of the standard normal; Right: quantile-quantile plot of W (1).

the idea that √ variances add when adding normals. If each increment depends on dx instead of dx, then there will be no movement at all because the variance of x W (x) will be × (Δx)2 = x × Δx which will be 0 when Δx → 0. Δx The derivative W (x) = dW /dx at first seems strange. The discretization would be dW/h in the matlab code above, which is a discrete-time white noise process. At every point, it is a normal with mean 0 and variance 1/h, and the covariance matrix is h1 I.In the continuous limit, the differential form dW denotes a white noise process formally satisfying

f (x, W )W (x)dx = f (x, W )dW. Its covariance function is the Dirac delta dWx dWy = δ(x − y). We might say that W (x) has a “variance density” of 1, referring to the variance divided by the step size of the discretization. In general we can consider integrals of the form

x

f (t)dW = lim 0

Δx→0

x [ Δx ]

f (i [Δx])ΔW,

i=1

which discretizes to cumsum(f(t).*dW). We can think of dW as an operator such that f dW is a distribution—not a function in the classical sense, but able to serve as the differential in a stochastic integral. Multiplication by dW is called the white noise transformation [Ros09]. 3.2. Three local eigenvalue behaviors; three stochastic differential operators. The most commonly studied random matrix models over the years have been the Gaussian, Wishart, and MANOVA (Multivariate Analysis of Variance) ensembles, also known as the Hermite, Laguerre, and Jacobi ensembles. We are often concerned with local eigenvalue behavior (that is, a single eigenvalue or a small number of eigenvalues rather than the entire spectrum), which depends on the location in the spectrum as well as the random matrix distribution. Remarkably, though, we see only three different local behaviors among the classical ensembles:

64

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

Algorithm 4 Distribution of the largest eigenvalue of the stochastic Airy operator. % Experiment : L a r g e s t e i g e n v a l u e o f a S t o c h a s t i c Airy Operator % Plot : Histogram of the l a r g e s t e i g e n v a l u e s % Th e o r y : The Tracy−Widom l a w %% P a r a m e t e r s t =1 0 0 0 0 ; % number o f t r i a l s v=z er o s ( t , 1 ) ; % s a m p l e s n=1e9 ; % l e v e l of d i s c r e t i z a t i o n beta =2; h=n ˆ ( − 1 / 3 ) ; % h s e r v e s as dx x=[0:h : 1 0 ] ; % d i s c r e t i z a t i o n of x N=length ( x ) ; %% E x p e r i m e n t % generate the o f f diagonal elements b=(1/h ˆ 2 ) ∗ o n e s ( 1 , N− 1 ) ; f o r i =1: t %% d i s c r e t i z e s t o c h a s t i c a i r y o p e r a t o r % d i s c r e t i z e airy operator a=−(2/h ˆ 2 ) ∗ o n e s ( 1 ,N ) ; % d i f f e r e n t i a l o p e r a t o r : d ˆ2/ d x ˆ2 a=a−x ; % d ˆ2/ d x ˆ2 − x % add t h e s t o c h a s t i c p a r t dW=randn ( 1 ,N) ∗ s q r t ( h ) ; a=a +(2/ s q r t ( beta ) ) ∗dW/h ; %% c a l c u l a t e t h e l a r g e s t e i g e n v a l u e o f t r i d i a g o n a l m a t r i x T % d iag on al of T: a % s u b d i a g o n a l of T: b v ( i ) = maxeig ( a , b ) ; % maxeig ( a , b ) : e i g e n v a l u e s o l v e r f o r t r i d i a g o n a l m a t r i c e s % d o w n l o a d a b l e a t h t t p : / / p e r s s o n . b e r k e l e y . edu / m l t r i d / i n d e x . h t m l end %% P l o t binsize = 1/6; [ count , x ] = h i s t ( v , − 6 : b i n s i z e : 6 ) ; bar ( x , c o u n t / ( t ∗ b i n s i z e ) , ’ y ’ ) ; %% Theor y hold on ; tracywidom

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

65

Figure 2. Local eigenvalue behavior

Ensemble Hermite Laguerre Jacobi

Region of spectrum Left edge Interior Right edge soft edge bulk soft edge hard or soft edge bulk soft edge hard or soft edge bulk hard or soft edge

The edge of the support of the limiting density is classified as soft or hard depending on whether for finite n, there is a non-zero probability of an eigenvalue appearing. For Laguerre the hard edge would be at 0 while for Jacobi the hard edges would be at 0 and 1. In the next section, we will explore how the operator Aβ =

d2 2 − x + √ W (x) 2 dx β

might reasonably be considered to have a random largest eigenvalue that follows the limiting largest eigenvalue law of random matrices. Before proceeding to rigorous mathematical treatment, the authors hope readers will be convinced after running the following numerical experiment. For example, after appropriate recentering and rescaling, the largest eigenvalues of the Hermite and Laguerre ensembles are indistinguishable in the n → ∞ limit, because the limiting distributions are identical—both show “soft edge” behavior. In contrast, the limiting behavior of the smallest eigenvalues of the Laguerre ensemble—those at the “hard edge”—follow a very different law. Near a point in the interior of the spectrum support—in the “bulk”— a pair of eigenvalues is more interesting than a single eigenvalue, and the spacing between consecutive eigenvalues is the most commonly studied distribution. Figure 2 contains plots for the three scaling regimes. Plot (a) often describes a largest eigenvalue; plot (b) often describes a smallest singular value; and plot (c) often describes the spacing between two consecutive eigenvalues in the interior of a spectrum. The stochastic differential operators mentioned above are associated with the three local eigenvalue behaviors: Local eigenvalue behavior Stochastic differential operator Soft edge Stochastic Airy operator Hard edge Stochastic Bessel operator Bulk Stochastic sine operator

66

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

They are simple to state: Stochastic Airy operator: d2 2 − x + √ W (x), dx2 β b.c.’s: f : [0, +∞) → R, f (0) = 0,

Aβ =

lim f (x) = 0;

x→+∞

Stochastic Bessel operator: √ d a 2 Jaβ = −2 x + √ + √ W (x), dx x β b.c.’s:

f : [0, 1] → R, f (1) = 0, (Jaβ f )(0) = 0;

Stochastic sine operator: " " # #

∞ √1 W (x) W (x) J 2 11 12 −1/2 2 Sβ = +√ , 1 ∞



(J−1/2 )∗ (x) W22 (x) β √β W12   ∞ ∞ and (J−1/2 )∗ apply. b.c.’s: S β acts on fg , b.c.’s of J−1/2 The eigenvalues of the stochastic Airy operator show soft edge behavior; the singular values of the stochastic Bessel operator show hard edge behavior; and the spacing between consecutive eigenvalues of the stochastic sine operator show bulk behavior. These operators allow us to study classical eigenvalue distributions directly, rather than finding the eigenvalue of a finite random matrix and then taking an n → ∞ limit. 3.3. Justification: from random matrices to stochastic operators. The stochastic operators were discovered by interpreting the tridiagonal (2.4) and bidiagonal beta models (2.5) as finite difference schemes. We have three classical ensembles, and each has three spectrum regions, as discussed in the previous section. Continuum limits for eight of the 3 × 3 = 9 combinations have been found [Sut05, ES07b]. We shall review one derivation. Consider the largest eigenvalues of the β-Hermite matrix model H = [hij ]. These lie at a soft edge, and therefore we hope to find the stochastic Airy operator as n → ∞. First, a similarity transform produces a nonsymmetric matrix whose entries are totally independent and which is easier to interpret as a finite difference scheme. Define D to be the diagonal matrix whose ith diagonal entry equals (n/2)−(i−1)/2

i−1 

hk,k+1 .

k=1

Then DHD−1 equals ⎡

√ G 2 ⎢ √1 χ2 ⎢ βn (n−1)β 1 ⎢ √ ⎢ 2β ⎢ ⎢ ⎣

and all entries are independent.

√ βn √ G 2 .. .

√ βn .. . 1 √ χ2 βn 2β

⎤ .. . √ G 2 √1 χ2 βn β

⎥ ⎥ ⎥ ⎥, √ ⎥ ⎥ βn ⎦ √ G 2

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

67

To see the √largest eigenvalues√more clearly, the matrix is recentered and rescaled. We consider 2n1/6 (DHD−1 − 2nI). The distribution of the algebraically largest eigenvalue of this matrix, when β ∈ 1, 2, 4, converges to one of the curves in Figure 2(a) as n → ∞. The recentered and rescaled matrix has a natural interpretation as a finite difference scheme on the grid xi = hi, i = 1, . . . , n, with h = n−1/3 . First, the tridiagonal matrix is expressed as a sum of three simpler matrices: √ √ 1/6 2n (DHD−1 − 2nI) ⎤ ⎡ −2 1 ⎥ ⎢ 1 −2 1 ⎥ 1 ⎢ ⎥ ⎢ .. .. .. = 2⎢ ⎥ . . . ⎥ h ⎢ ⎣ 1 −2 1 ⎦ 1 −2 ⎡ ⎤ 0 0 ⎢x1 0 ⎥ 0 ⎢ ⎥ ⎢ ⎥ . . . .. .. .. −⎢ ⎥ ⎢ ⎥ ⎣ xn−2 0 0⎦ xn−1 0 ⎤ ⎡ G 0 ⎥ ⎢χ ˜2 G 0 ⎥ ⎢ (n−1)β 2 ⎥ ⎢ χ ˜ G 0 2 (n−2)β 1 ⎢ ⎥ +√ ·√ ⎢ ⎥, . . . . . . ⎥ ⎢ β 2h ⎢ . . . ⎥ 2 ⎣ χ ˜2β G 0 ⎦ χ ˜2β G with χ ˜2r shorthand for

√ 1 (χ2 r 2βn

− r). More briefly,

√ √ 1/6 1 2 2n (DHD−1 − 2nI) = 2 Δ − diag−1 (x1 , x2 , . . . , xn−1 ) + √ N. h β The random variable χ ˜2(n−j)β has mean zero and standard deviation 1 + O(h2 ) uniformly for j satisfying xj ≤ M √ for fixed M . Hence, the total standard deviation on row i is asymptotic to √12h 12 + 12 = h−1/2 . The recentered and rescaled matrix model encodes a finite difference scheme for d2 2 − x + √ W (x). Aβ = 2 dx β 4. Sturm Sequences and Riccati Diffusion Probabilists and engineers seem to approach eigenvalues in different ways. When a probabilist considers a cumulative distribution function F (λ) = Pr[Λ < λ], he or she conducts a test: Is the random eigenvalue less than a fixed cutoff? In contrast, when an engineer types eig(A) into matlab, he or she expects to receive the locations of the eigenvalues directly. If one looks under the hood, however, the distinction may disappear. A competitive numerical method for eigenvalues is bisection iteration with Sturm sequences. The method is easiest to describe for the largest eigenvalue. Starting from an initial

68

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

guess λ0 , the method determines if there are any eigenvalues greater than λ0 . If so, the guess is increased; if not, it is decreased. In time, the largest eigenvalue is captured within an interval, and then the interval is halved with each step. This is linear convergence, because the number of correct bits increases by one with each step. At the end, the numerical location is found from a sequence of tests. Linear convergence is fast, but it is not as fast as, say, Newton’s iteration, which converges quadratically. What makes the overall method competitive is the sheer speed with which the test against λk can be conducted. The key tool is the Sturm sequence for tridiagonal matrices, which has a close connection to the Sturm-Liouville theory of ordinary differential equations. Below, we show how these theories inspire two seemingly different approaches to computing random eigenvalue distributions. The Sturm sequence approach was introduced by Albrecht, Chan, and Edelman and applied to computing eigenvalue distributions of the β-Hermite ensemble [ACE09]. The continuous Riccati diffusion was introduced by Ram´ırez, Rider, and Vir´ ag [RRV11], by applying a change of variables to the stochastic differential operators of the previous section [RRV11]. 4.1. Sturm sequences for numerical methods. A Sturm sequence can reveal the inertia of a matrix, i.e., the number of positive, negative, and zero eigenvalues. For an n-by-n matrix A = [aij ], define Ak to be the  submatrix  a k-by-kaprincipal n−1,n . Because the in the lower-right corner, e.g., A1 = [an,n ] and A2 = n−1,n−1 an,n−1 an,n eigenvalues of Ak interlace those of Ak+1 , the Sturm sequence (det A0 , det A1 , det A2 , . . . , det An ) reveals the inertia. Specifically, assuming that no zeros occur in the sequence, the number of sign changes equals the number of negative eigenvalues. (Because a zero determinant occurs with zero probability in our random matrices of interest, we will maintain the assumption of a zero-free Sturm sequence.) Alternatively, the Sturm ratio sequence can be used. If ri = (det Ai )/(det Ai−1 ), then the number of negative values in (r1 , r2 , . . . , rn ) equals the number of negative eigenvalues. The Sturm ratio sequence can be computed extremely quickly when A is tridiagonal. Labeling the diagonal entries an , an−1 , . . . , a1 and the subdiagonal entries bn−1 , bn−2 , . . . , b1 from top-left to bottom-right, the ith Sturm ratio is  i = 1; a1 , ri = b2i−1 ai − ri−1 , i > 1. This reveals in quick order the number of negative eigenvalues of A, or the number of eigenvalues less than λ if A − λI is substituted for A. Computing eigenvalues is one of those “impossible” problems that follow from the insolubility of the quintic. It is remarkable that counting eigenvalues is so quick and easy. 4.2. Sturm sequences in random matrix theory. The Sturm sequence enables the computation of various random eigenvalue distributions. Let us consider the largest eigenvalue of the β-Hermite ensemble. The tridiagonal β-Hermite matrix model Hnβ has diagonal entries ai = Gi and √ subdiagonal entries bi = χ(i−1)β / 2. Hence, the Sturm ratio sequence of Hnβ − λI

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

is

 ri =

(4.1)

G1 − λ, Gi − λ −

69

i = 1; χ2(i−1)β 2ri−1 ,

i > 1.

The distribution of the largest eigenvalue is Pr[Λmax < λ] = Pr[Hnβ − λI has all negative eigenvalues] = Pr[all Sturm ratios ri are negative] 0 0 ··· fr1 ,...,rn (s1 , . . . , sn )ds1 · · · dsn , = −∞

−∞

in which fr1 ,...,rn is the joint density of all Sturm ratios. Albrecht, Chan, and Edelman compute this joint density from (4.1) and find n  2 1 fri |ri−1 (si |si − 1), fr1 ,...,rn (s1 , . . . , sn ) = √ e−(s−λ) /2 2π i=2 p

fri |ri−1 (si |si−1 ) =

|si−1 | i − 12 (si +λ)2 +zi2 /4 √ e D−pi (sign(si−1 )(si + λ + si−1 )), 2π

with Dp denoting a parabolic cylinder function. The level density, i.e., the distribution of a randomly chosen eigenvalue from the spectrum, has also been computed using Sturm sequences [ACE09]. 4.3. Sturm-Liouville theory. The presentation of Sturm sequences focuses on finite matrices. However, there is a deep connection with the continuous world, through Sturm-Liouville theory. Recall that the eigenvalues of A that are less than λ are equal in number to the negative Sturm ratios of A − λI. This is not all. A similar relationship exists with the solution vector x = (xn , xn−1 , . . . , x1 ) of (A − λI)x = 0. Of course, if λ is not an eigenvalue, then no nontrivial solution exists. However, the underdetermined matrix   T˜ = e1 (A − λI) , with e1 denoting the first standard basis vector, always has a nontrivial solution x ˜ = (xn+1 , xn , xn−1 , . . . , x1 ), and λ is an eigenvalue of A if and only if T˜ x = 0 has a nontrivial solution with xn+1 = 0. This recalls the shooting method for boundary value problems—the solution space is expanded by relaxing a boundary condition, a solution is found, and then the boundary condition is reasserted. ? As just mentioned, the test xn+1 = 0 highlights the eigenvalues of A. The other solution vector entries xn , xn−1 , . . . , x1 provide useful information as well. Letting si = xi /xi−1 , i = 2, . . . , n + 1, the reader can check that  r i−1 , i = 2, . . . , n; − bi−1 si = −rn , i = n + 1. If the subdiagonal entries bi−1 are all positive—as they are for the β-Hermite matrix model with probability one—then the “shooting vector ratios” sn+1 , . . . , s2 and the Sturm ratios rn , . . . , r1 have opposite signs. In particular, A has no eigenvalues greater than λ if and only if the shooting vector ratios sn+1 , . . . , s2 are all positive.

70

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

This result may sound familiar to a student of differential equations. One of the important results of Sturm-Liouville theory is this: the nth eigenfunction of a regular Sturm-Liouville operator L has exactly n zeros. In particular, the lowestenergy eigenfunction never crosses 0. This leads to the already-mentioned shooting method: From a guess λ for the lowest eigenvalue, relax a boundary condition and solve L − λf = 0. If the solution has no zeros, then the guess was too low; if the solution has zeros, then the guess was too high. 4.4. Riccati diffusion. The stochastic Airy operator is a regular SturmLiouville problem [ES07b, Blo11]. It can be analyzed by the shooting method and a Riccati transform, following Ram´ırez, Rider, and Vir´ ag [RR09], and the largest eigenvalue distribution can be computed with the help of Kolmogorov’s backward equation, as shown by Bloemendal and Vir´ ag [BV10, BV11]. Bloemendal and Sutton have developed an effective numerical method based on this approach [BS12]. First, the Riccati transform. Consider the stochastic Airy operator Aβ acting on a function f (x). Define w(x) = f (x)/f (x). Then 

2 f (x) f

(x) f

(x) − − w(x)2 . = w (x) = f (x) f (x) f (x) If f (x) is an eigenfunction of Aβ , then it passes two tests: the differential equation 2 f

(x) − xf (x) + √ W (x)f (x) = Λf (x) β and the boundary conditions f (0) = 0 and limx→+∞ f (x) = 0. In fact, the boundary condition at +∞ forces f (x) to decay at the same rate as Ai(x). After the change of variables, these conditions become 2 w (x) = x + Λ − √ W (x) − w(x)2 . β and lim w(x) = +∞, √ w(x) ∼ Ai (x)/ Ai(x) ∼ − x (x → +∞). x→0+

Conversely, if w(x) satisfies the first-order differential equation and satisfies the  boundary conditions, then f (x) = exp( w(x) dx) is an eigenfunction with eigenvalue Λ. Sturm-Liouville theory leads to the following three equivalent statements concerning the largest eigenvalue Λmax of the stochastic Airy operator: (1) Λmax < λ. (2) Suppressing the right boundary condition, the solution to (Aβ − λ)f (x) = 0 has no zeros on the nonnegative half-line. (3) Suppressing the right boundary condition, the solution w(x) to the firstorder ODE 2 w (x) = x + λ − w(x)2 + √ W (x) β has no poles on the nonnegative half-line.

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

71

Computing the probability of any of these events gives the desired distribution, the generalization of the Tracy-Widom distribution to arbitrary β > 0. One final trick is in order before moving to computation. The test value λ can be removed from the diffusion equation with the change of variables t = x + λ. The resulting equation is equivalent in distribution to (4.2)

2 w (t) = t − w(t)2 + √ W (t), β

and the left boundary condition becomes limt→λ+ w(t) = +∞. We have Λmax < λ if and only if w(t) has no poles in [λ, +∞). 4.5. Kolmogorov’s backward equation. The probability of a pole in the solution to the stochastic Riccati diffusion (4.2) turns out to be a tractable computation; Kolmogorov’s backward equation is designed for this sort of problem. We ultimately need to enforce the left boundary condition limt→0+ w(t) = +∞. The trick is to broaden the problem, analyzing all boundary conditions before finding the original one as a special case [BV10]. That is, we compute Pr(t0 ,w0 ) [no poles] for all initial conditions w(t0 ) = w0 . For initial conditions with large t0 , it is rather easy to predict whether a pole appears in the solution of the Riccati equation. When√noise is removed, the equation has two fundamental solutions: Ai (t)/ Ai(t) ∼ − t, which is like √an unstable equilibrium in that it repels nearby solutions, and Bi (t)/ Bi(t) ∼ t, which is like a stable equilibrium. Solutions with w(t0 ) < Ai (t0 )/ Ai(t0 ) hit w = −∞ in

finite time when √ run forward, and solutions with w(t0 ) > Ai (t0 )/ Ai(t0 ) become asymptotic to t when run forward. White noise has no effect on this behavior in the t0 → +∞ limit. Hence, we know a slice: lim

Pr [no poles] = 1w0 ≥−√t0 .

t0 →+∞ (t0 ,w0 )

Kolmogorov’s backward equation specifies how this probability evolves as the initial condition moves backward. Let F (t, w) = Pr(t,w) [no poles]. (Notice that we have dropped subscripts from t0 and w0 , but these are still initial values.) The backward equation is ∂F ∂F 2 ∂2F = 0. + (t − w2 ) + ∂t ∂w β ∂w2 With our initial condition and the boundary condition limw→−∞ F (t, w) = 0, this has a unique solution. The desired quantity is a horizontal slice: Pr[Λmax < λ] = Pr[Riccati diffusion started at w(λ) = +∞ has no poles with t > λ] = F (λ, +∞). Bloemendal and Sutton have developed a numerical routine for solving the PDE numerically. Some challenges arise, particularly when β becomes large. Then the PDE is dominated by convection, and its solution develops a jump discontinuity (a butte, so to speak). The solution can be smoothed out by an additional change of variables [BS12].

72

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

5. Ghosts and Shadows We propose to abandon the notion that a random matrix exists only if it can be sampled. Much of today’s applied finite random matrix theory concerns real or complex random matrices (β = 1, 2). The “threefold way” so named by Dyson in 1962 [Dys62] adds quaternions (β = 4). While it is true there are only three real division algebras (β=“dimension over the reals”), this mathematical fact while critical in some ways, in other ways is irrelevant and perhaps has been over interpreted over the decades. We introduce the notion of a “ghost” random matrix quantity that exists for every β, and a “shadow” quantity which may be real or complex which allows for computation. Any number of computations have successfully given reasonable answers to date though difficulties remain in some cases. Though it may seem absurd to have a “three and a quarter” dimensional or “π” dimensional algebra, that is exactly what we propose and what we compute with. In the end β becomes a noisiness parameter rather than a dimension. This section contains an “idea” which has become a “technique.” Perhaps it might be labeled “a conjecture,” but we think “idea” is the better label right now. Soon, we hopefully predict, this idea will be embedded in a rigorous theory. The idea was discussed informally to a number of researchers and students at mit for years now, probably dating back to 2003 or so. It was also presented at a number of conferences [Ede03] and in a paper [Ede10]. Mathematics has many precedents, the number 0 was invented when we let go of the notion that a count requires objects to exist. Similarly negative numbers are more than the absence of existing objects, imaginary numbers can be squared to obtain negative numbers, and infinitesimals act like the “ghosts of departed quantities.” Without belaboring the point, mathematics makes great strides by letting go of what at first seems so dear. What we will obtain here is a rich algebra that acts in every way that we care about as a β-dimensional real algebra for random matrix theory. Decades of random matrix theory have focused on reals, complexes, and quaternions or β = 1, 2, 4. Statisticians would say the real theory is more than enough and those who study wireless antenna networks would say that the complexes are valuable, while physicists are an applied community that also find the quaternions of value. Many random  matrix papers allow for general betas formally, perhaps in a formula with factor i 0. We call xβ a ghost Gaussian, vβ a vector of ghost Gaussians, and Qβ , a ghost unitary matrix. Definition 1. (Shadows) A shadow is a real (or complex) quantity derived from a ghost that we can sample and compute with. We therefore have that the norm xβ ∼ χβ is a shadow. So is (xβ ). 5.2. Ghost Orthogonals (“The Beta Haar Distribution”). We reason by analogy with β = 1, 2, 4 and imagine a notion of orthogonals that generalizes the orthogonal, unitary, and symplectic groups. A matrix Q of ghosts may be said to be orthogonal if QT Q = I. The elements of course will not be independent. We sketch an understanding based on the QR decomposition on general matrices of independent ghost Gaussians. We imagine using Householder transformations as is standard in numerical linear algebra software. We obtain immediately Proposition 2. Let A be an n × n matrix of standard β ghost Gaussians We may perform the QR decomposition into ghost orthogonal times ghost upper triangular. The matrix R has independent entries in the upper triangle. Its entries are standard ghost Gaussians above the diagonal, and the non-negative real quantity

74

ALAN EDELMAN, BRIAN D. SUTTON, AND YUYANG WANG

Rii = χβ(n+1−i) on the diagonal. The resulting Q may be thought of as a β analogue of Haar measure. It is the product of Householder matrices Hk obtained by reflecting on the uniform k-dimensional “β sphere.” The Householder procedure may be thought of as an analog for the O(n2 ) algorithm for representing random real orthogonal matrices as described by Stewart [Ste80]. We illustrate the procedure when n = 3. We use Gβ to denote independent standard ghost Gaussians as distributions. They are not meant in any way to indicate common values or that even there is a meaning to having values at all. ⎛

Gβ ⎝ Gβ Gβ

Gβ Gβ Gβ

⎞ ⎛ Gβ χ3β Gβ ⎠ = H3T ⎝ 0 Gβ 0 ⎛

⎞ Gβ Gβ ⎠ Gβ

Gβ Gβ Gβ

χ3β = H2T H3T ⎝ 0 0 ⎛

Gβ χ2β 0

χ3β = H1T H2T H3T ⎝ 0 0

⎞ Gβ Gβ ⎠ Gβ

Gβ χ2β 0

.

⎞ Gβ Gβ ⎠ χβ

The Hi are reflectors that do nothing on the first n − i elements and reflect uniformly on the remaining i elements. The absolute values of the elements on the sphere behave like i independent χβ random variables divided by their root mean square. The Q is the product of the Householder reflectors. We remark that the β-Haar are different from the circular β-ensembles for β = 2. 5.3. Ghost Gaussian Ensembles and Ghost Wishart Matrices. It is very interesting that if we tridiagonalize a complex Hermitian matrix (or a quaternion self-dual matrix), as is done with software for computing eigenvalues, the result is a real tridiagonal matrix. Equally interesting, and perhaps even easier to say, is that the bidigonalization procedure for computing singular values takes general rectangular complex (or quaternion) matrices into real bidiagonal matrices. The point of view is that the Hermite and Laguerre models introduced in Section 2 are not artificial constructions, but they are shadows of symmetric or general rectangular ghost matrices respectively. If we perform the traditional Householder reductions on the ghosts the answers are the tridiagonal and bidiagonal models. The tridiagonal reduction of a normalized symmetric Gaussian (“The Gaussian β-orthogonal Ensemble”) is √ ⎞ ⎛ χ(n−1)β G 2 √ ⎟ ⎜ χ(n−1)β G 2 χ(n−2)β ⎟ 1 ⎜ ⎟ ⎜ β . . . .. .. .. Hn ∼ √ ⎟, ⎜ ⎟ 2 nβ ⎜ √ ⎠ ⎝ χ2β G 2 χ√ β χβ G 2 where the elements on the diagonal are each independent Gaussians with mean 0 and variance 2. The χ s on the super and subdiagonal are equal giving a symmetric tridiagonal.

RANDOM MATRIX THEORY, NUMERICAL COMPUTATION AND APPLICATIONS

75

The bidiagonal for a the singular values of a general ghost is similar with chi’s running on the diagonal and off-diagonal respectively. See [DE02] for details. We repeat the key point that these “shadow” matrices are real and can therefore be used to compute the eigenvalues or the singular values very efficiently. The notion is that they are not artificial constructions, but what we must get when we apply the ghost Householder transformations. 5.4. Jack Polynomials and Ghosts. Around 1970, Henry Jack, a Scottish mathematician, obtained a sequence of symmetric polynomials Jκα (x) that are closely connected to our ghosts. The parameter α = 2/β for our purposes, and κ is a partition of an integer k. The argument x can be a finite vector or a matrix. It can also be a formal infinite sequence. With MOPS [DES07], we can press a few buttons before understanding the polynomials just to see what they look like for the partition [2,1,1] of 4: 1 (x, y, z) = 3xyz(x + y + z) J2,1,1

12α2 xyz(x + y + z) (1 + α)2 α (x, y, z) = 2(3 + α)m2,1,1 + 24m1,1,1,1, J2,1,1 α J2,1,1 (x, y, z) =

where m2,1,1 and m1,1,1,1 denote the monomial symmetric functions. When β = 2, the Jack polynomials are the Schur polynomials that are widely used in combinatorics and representation theory. When β = 1, the Jack polynomials are the zonal polynomials. A wonderful reference for β = 1, is [Mui82]. In general see [Sta89, Mac95]. We will not define the Jack polynomials here. Numerical and symbolic routines for their computation may be found in [DES07, KE06] respectively. We expect that the Jack Polynomial formula gives consistent moments for Q through what might be seen as a generating function. Let A and B be diagonal matrices of indeterminates. The formula EQ Jκ (AQBQ ) = Jκ (A)Jκ (B)/Jκ (I), provides expressions for moments in Q. Here the Jκ are the Jack Polynomials with parameter α = 2/β [Jac70, Sta89]. This formula is an analog of Theorem 7.2.5 of page 243 of [Mui82]. It must be understood that the formula is a generating function involving the moments of Q and Q . This is formally true whether or not one thinks that Q exists, or whether the formula is consistent or complete. For square Ghost Gaussian matrices, we expect an analog such that EG Jκ (AGBG ) = cκ(β) Jκ (A)Jκ (B). 5.5. Ghost Jacobian Computations. We propose a β-dimensional volume in what in retrospect must seem a straightforward manner. The volume element (dx)∧ satisfies the key scaling relationship. This makes us want to look a little into “fractal theory,” but at the moment we are suspecting this is not really the key direction. Nonetheless we keep an open mind. The important relationship must be





a 0 EeaX < +∞ (ψ2 -condition); 2 (3) ∃B, b > 0 ∀λ ∈ R EeλX ≤ Beλ b (Laplace transform condition); √ 1/p (4) ∃K > 0 ∀p ≥ 1 (E|X|p ) ≤ K p (moment condition). Moreover, if X is a centered random variable, (3) can be rewritten as 2  (3) ∃b > 0 ∀λ ∈ R EeλX ≤ eλ b . Proof. The proof is a series of elementary calculations. (1) ⇒ (2) Let a < v. By the integral distribution formula, ∞ ∞ 2 aX 2 at2 =1+ 2ate · P(|X| > t) dt ≤ 1 + 2at · Ce−(v−a)t dt < +∞. Ee 0

0

(2) ⇒ (3) Let λ be any real number. Then 2

2

2

2

EeλX = EeλX−aX eaX ≤ sup eλt−at · EeaX ≤ Beλ

2

t∈R

(3) ⇒ (4) Set λ = get

/4a

.

√ p. Replacing, as before, the the function by its supremum, we √ p − pt

E|X| ≤ sup t e p

t>0

√ p|X|

· Ee

 √ p p ≤ · Cepb . e √ p

(4) ⇒ (1) Assume first t ≥ eK. Choose p so that t = e−1 .  √ p K p 2 E|X|p ≤ = e−p = e−vt , P(|X| > t) ≤ p t t K

NON-ASYMPTOTIC THEORY

87

where v = e−2 K −2 . This proves (1) for t ≥ eK. Setting C = e automatically guaranties that (1) holds for 0 < t < eK as well. (3) We will assume that (3) holds with B > 1 since otherwise the statement is trivial. Assume first that X is symmetric. For large values of λ, we can √ derive (3) with constant B = 1 by changing the parameter b. Indeed, set λ0 = 2a and 2 2¯ choose ¯b > 0 so that Beλ0 b ≤ eλ0 b . This guarantees that (3) holds for all λ such that |λ| ≥ λ0 with B = 1 and b replaced by ¯b. If λ2 ≤ 2a, then by Holder’s inequality and the ψ2 -condition,  2  2  2 2 2 λ /2a 1 λ ≤ exp c EeλX = E (eλX + e−λX ) ≤ Eeλ X /2 ≤ EeaX . 2 2a Finally, we set b = max(c/2a, ¯b). In the general case, we use a simple symmetrization. Let X be an independent copy of X. Then by Jensen’s inequality, 



EeλX = Eeλ(X−EX ) ≤ Eeλ(X−X ) , where X − X is a symmetric subgaussian random variable.



Remark. The ψ2 -condition turns the set of centered subgaussian random variables into a normed space. Define the function ψ2 : R → R by ψ2 (t) = exp(t2 ) − 1. Then for a non-zero random variable set X ψ2 = inf{s > 0 | Eψ2 (X/s) ≤ 1}. The subgaussian random variables equipped with this norm form an Orlicz space (see [18] for the details). To estimate the first singular value, we have to prove a large deviation inequality for a linear combination of independent subgaussian random variables. Note that a linear combination of independent Gaussian random variables is Gaussian. We prove below that a linear combination of independent subgaussian random variables is subgaussian. Theorem 3.3. Let X1 , . . . , Xn be independent centered subgaussian random variables. Then for any a1 , . . . , an ∈ R  ⎛ ⎞    n  2   ct P ⎝ aj Xj  > t⎠ ≤ 2 exp − n . 2  j=1  j=1 aj 1/2  n 2 Proof. Set vj = aj / . We have to show that the random variable j=1 aj n Y = j=1 vj Xj is subgaussian. Let us check the Laplace transform condition (3) . For any λ ∈ R ⎛ ⎞ n n   E exp ⎝λ vj Xj ⎠ = E exp(λvj Xj ) j=1



n 

j=1



exp(λ2 vj2 b) = exp ⎝λ2 b

j=1

n 

⎞ vj2 ⎠ = eλ b . 2

j=1

The inequality here follows from (3) . Note that the fact that the constant in front  of the exponent in (3) is 1 plays the crucial role here.

88

MARK RUDELSON

Theorem 3.3 can be used to give a very short proof of a classical inequality due to Khinchin. Theorem 3.4 (Khinchin). Let X1 , . . . , Xn be independent centered subgaussian random variables. For any p ≥ 1 there exist Ap , Bp > 0 such that the inequality p ⎞1/p ⎛ ⎞1/2 ⎛  ⎛ ⎞1/2   n n n     a2j ⎠ ≤ ⎝E  aj Xj  ⎠ ≤ Bp ⎝ a2j ⎠ Ap ⎝   j=1 j=1 j=1 holds for all a1 , . . . , an ∈ R.

1/2  n 2 = 1. Proof. Without loss of generality, assume that j=1 aj Let p ≥ 2. Then by H¨ older’s inequality 2 ⎞1/2 ⎛ p ⎞1/p ⎞1/2 ⎛  ⎛      n n n      ⎟ ⎜ 2⎠    ⎝ ⎝ aj = ⎝E  aj X j  ⎠ ≤ E aj Xj  ⎠ ,  j=1   j=1 j=1

 so Ap = 1. By Theorem 3.3, Y = nj=1 aj Xj is a subgaussian random variable. Hence, √ 1/p (E|Y |p ) ≤ C p =: Bp . This is the right asymptotic as p → ∞. In the case 1 ≤ p ≤ 2 it is enough to prove the inequality for p = 1. As before, by H¨older’s inequality, we can choose Bp = 1. Applying Khinchin’s inequality with p = 3, we get 1/2 3/4 3/2  1/2  1/2 E|Y |2 ≤ (E|Y |) · B3 . E|Y |2 = E|Y |1/2 · |Y |3/2 ≤ (E|Y |) · E|Y |3 Hence,

 1/2 B3−3 E|Y |2 ≤ E|Y |.



4. Invertibility of a rectangular random matrix We introduce the ε-net argument, which will enable us to bound the condition number for a random N × n matrix with independent entries in the case when N  n. To simplify the proofs we assume from now on that the entries of the matrix are centered, subgaussian random variables. Recall the definition of an ε-net. Definition 4.1. Let (T, d) be a metric space. Let K ⊂ T . A set N ⊂ T is called an ε-net for K if ∀x ∈ K ∃ y ∈ N d(x, y) < ε. A set S ⊂ K is called ε-separated if ∀x, y ∈ S

d(x, y) ≥ ε.

The union of ε-balls centered at the ε-net N covers K, while the ε-balls centered at S form a packing. These two notions are closely related. Namely, we have the following elementary Lemma. Lemma 4.2. Let K be a subset of a metric space (T, d), and let N ⊂ T be an ε-net for K. Then

NON-ASYMPTOTIC THEORY

89

(1) there exists a 2ε-net N ⊂ K such that |N | ≤ |N |; (2) any 2ε-separated set S ⊂ K satisfies |S| ≤ |N |. (3) From the other side, any maximal ε-separated set S ⊂ K is an ε-net for K. We leave the proof of this lemma for a reader as an exercise. Lemma 4.3 (Volumetric estimate). For any ε < 1 there exists an ε-net N ⊂ S n−1 such that  n 3 |N | ≤ . ε Proof. Let N be a maximal ε-separated subset of S n−1 . Then for any distinct points x, y ∈ N  ε  ε   x + B2n ∩ y + B2n = ∅. 2 2 Hence,    ε    ε n ε  n x + B2 ≤ vol 1 + B2n , B2 = vol |N | · vol 2 2 2 x∈N

which implies |N | ≤

 n  n 3 2 1+ ≤ . ε ε



Using ε-nets, we prove a basic bound on the first singular value of a random subgaussian matrix: Proposition 4.4 (First singular value). Let A be an N × n random matrix, N ≥ n, whose entries are independent copies of a centered subgaussian random variable. Then √   2 P s1 (A) > t N ≤ e−c0 t N for t ≥ C0 . Proof. Let N be a (1/2)-net in S N −1 and M be a (1/2)-net in S n−1 . For any u ∈ S n−1 , we can choose a x ∈ N such that x − u 2 < 1/2. Then 1 Au 2 ≤ Ax 2 + A · x − u 2 ≤ Ax 2 + A . 2 This shows that A ≤ 2 supx∈N Ax 2 = 2 supx∈N supv∈S N −1 Ax, v . Approximating v in a similar way by an element of M, we obtain A ≤ 4

max

x∈N , y∈M

| Ax, y |.

By Lemma 4.3, we can choose these nets so that |N | ≤ 6N ,

|M| ≤ 6n .

By Theorem 3.3, for every x ∈ N and y ∈ M, the random variable Ax, y = N n j=1 k=1 aj,k yj xk is subgaussian, i.e., √   2 for t > 0. P | Ax, y | > t N ≤ C1 e−c1 t N Taking the union bound, we get √   P A > t N ≤ |N ||M|

√   P | Ax, y | > t N /4

max

x∈N , y∈N

≤ 6 · 6 · C1 e−c2 t N

N

2

N

≤ C1 e−c0 t

2

N

,

90

MARK RUDELSON

provided that t ≥ C0 for an appropriately chosen constant C0 > 0. This completes the proof.  √ Proposition 4.4 means that for any N ≥ n the first singular value is O( N ) with probability close to 1. Thus, the bound for the condition number reduces to a lower estimate of the last singular value. To obtain it, we prove an easy estimate for a small ball probability of a sum of independent random variables. Lemma 4.5. Let ξ1 , . . . , ξn be independent copies of a centered subgaussian random variable with variance 1. Then there exists μ ∈ (0, 1)  such that for every n coefficient vector a = (a1 , . . . , an ) ∈ S n−1 the random sum S = k=1 ak ξk satisfies P(|S| < 1/2) ≤ μ. Proof. Let 0 < λ < (ES 2 )1/2 = 1. By the Cauchy–Schwarz inequality, 1/2  ES 2 = ES 2 1[−λ,λ] (S) + ES 2 1R\[−λ,λ] (S) ≤ λ2 + ES 4 P(|S| > λ)1/2 . This leads to the Paley–Zygmund inequality: (1 − λ2 )2 (ES 2 − λ2 )2 = . 4 ES ES 4 By Theorem 3.3, the random variable S is subgaussian, so by part (4) of Theo rem 3.2, ES 4 ≤ C. To finish the proof, set λ = 1/2. P(|S| > λ) ≥

Lemma 4.5 implies the following invertibility estimate for a fixed vector. Corollary 4.6. Let A be a matrix as in Proposition 4.4. Assume that all entries of A have variance 1. Then there exist constants η, ν ∈ (0, 1) such that for every x ∈ S n−1 , √ P( Ax 2 < η N ) ≤ ν N . Proof. The coordinates of the vector Ax are independent linear combinations of i.i.d. subgaussian random variables with coefficients (x1 , . . . , xn ) ∈ S n−1 . Hence, by Lemma 4.5, P(|(Ax)j | < √ 1/2) ≤ μ for all j = 1, . . . , N . Assume that Ax 2 < η N . Then |(Ax)j | < 1/2 for at least (1−4η 2 )N > N/2 coordinates. If η is small enough, then the number M of subsets of {1, . . . , N } with at least (1 − 4η 2 )N elements is less than μ−N/4 . Then the union bound implies √  P( Ax 2 < η N ) ≤ M · μN/2 ≤ μN/4 . Combining this with the ε-net argument, we obtain the estimate for the smallest singular value of a random matrix, whose dimensions are significantly different. Proposition 4.7 (Smallest singular value of rectangular matrices). Let A be an N × n matrix whose entries are i.i.d. centered subgaussian random variables with variance 1. There exist c1 , c2 > 0 and δ0 ∈ (0, 1) such that if n < δ0 N , then √   Ax 2 ≤ c1 N ≤ e−c2 N . (4.1) P min n−1 x∈S

Proof. Let ε > 0 to be chosen later. Let N be an ε-net in S n−1 of cardinality |N | ≤ (3/ε)n . Let η and ν be the numbers in Corollary 4.6. Then by the union bound,  √  (4.2) P ∃y ∈ N : Ay 2 < η N ≤ (3/ε)n · ν N .

NON-ASYMPTOTIC THEORY

91

√ √ Let V be the event that A ≤ C0 N and Ay 2 ≥ η N for all points y ∈ N . Assume that V occurs, and let x ∈ S n−1 be any point. Choose y ∈ N such that y − x 2 < ε. Then √ √ √ η N , Ax 2 ≥ Ay 2 − A · x − y 2 ≥ η N − C0 N · ε = 2 if we set ε = η/(2C0 ). By (4.2) and Proposition 4.4,   n/N N P(V c ) ≤ ν · (3/ε) + e−c N ≤ e−c2 N , if we assume that n/N ≤ δ0 for an appropriately chosen δ0 < 1. This completes the proof.  Remark. Note that although we assumed that the entries of the matrix A are independent, Proposition 4.7 can be proved under a weaker assumption. It is enough to assume that for any x ∈ S n−1 , the coordinates of the vector Ax are independent centered subgaussian random variables of unit variance. Indeed, in this case Corollary 4.6 applies without any changes. We will use this observation in Subsection 8.2. 5. Invertibility of a square matrix: absolutely continuous entries Until recently, much less has been known about the behavior of the smallest singular value of a square matrix. In the classic work on numerical inversion of large matrices, von Neumann and his associates used random matrices to test their algorithms, and they speculated that (5.1)

sn (A) ∼ n−1/2

with high probability

(see [47, pp. 14, 477, 555]). In a more precise form, this estimate was conjectured by Smale [33] and proved by Edelman [6] and Szarek [35] for random Gaussian matrices A, i.e., those with i.i.d. standard normal entries. Edelman’s theorem states that for every ε ∈ (0, 1),   (5.2) P sn (A) ≤ εn−1/2 ∼ ε. Conjecture (5.1) for general random matrices was an open problem, unknown even for the random sign matrices A, i.e., those whose entries are ±1 symmetric random variables. The first polynomial bound for the smallest singular value of a random matrix with i.i.d. subgaussian, in particular, ±1 entries was obtained in [26]. It was proved that for such matrix sn (A) ≥ Cn−3/2 with high probability. Following that, Tao and Vu proved that if A is a ±1 random matrix, then for any α > 0 there exists β > 0 such that sn (A) ≥ n−β with probability at least 1 − n−α . In [27] the conjecture (5.1) is proved in full generality under the fourth moment assumption. Theorem 5.1 (Invertibility: fourth moment). Let A be an n × n matrix whose entries are independent centered real random variables with variances at least 1 and fourth moments bounded by B. Then, for every δ > 0 there exist ε > 0 and n0 which depend (polynomially) only on δ and B, such that   for all n ≥ n0 . P sn (A) ≤ εn−1/2 ≤ δ

92

MARK RUDELSON

This shows in particular that the median of sn (A) is at least of order n−1/2 . To show that sn (A) ∼ n−1/2 with high probability, one has to prove a matching lower bound. This was done in [29] for matrices with subgaussian entries and extended in [45] to matrices, whose entries have the finite fourth moment. Under stronger moment assumptions, more is known about the distribution of the largest singular value, and similarly one hopes to know more about the smallest singular value. One might then expect that the estimate (5.2) for the distribution of the smallest singular value of Gaussian matrices should hold for all subgaussian matrices. Note however that (5.2) fails for the random sign matrices, since they are singular with positive probability. Estimating the probability of singularity for random sign matrices is a longstanding open problem. Even proving that it converges to 0 as n → ∞ is a nontrivial result due to Koml´ os [17]. Later Kahn, Koml´os and Szemer´edi [16] showed that it is exponentially small:   (5.3) P random sign matrix A is singular < cn for some universal constant c ∈ (0, 1). The often √ conjectured optimal value of c is 1/2 + o(1) [16], and the best known value 1/ 2 + o(1) is due to Bourgain, Vu, and Wood [4], (see [37, 39] for earlier results). Spielman and Teng [34] conjectured that (5.2) should hold for the random sign matrices up to an exponentially small term that accounts for their singularity probability:   P sn (A) ≤ εn−1/2 ≤ ε + cn . We prove Spielman-Teng’s conjecture up to a coefficient in front of ε. Moreover, we show that this type of behavior is common for all matrices with subgaussian i.i.d. entries. For a bound for random matrices with general i.i.d. entries see [27]. Theorem 5.2 (Invertibility: subgaussian). Let A be an n × n matrix whose entries are independent copies of a centered subgaussian real random variable. Then for every ε ≥ 0, one has   (5.4) P sn (A) ≤ εn−1/2 ≤ Cε + cn , where C > 0 and c ∈ (0, 1). Note that setting ε = 0 we recover the result of Kahn, Koml´ os and Szemer´edi. Also, note that the question whether (5.4) holds for random sign matrices with coefficient C = 1 remains open. We shall start with an attempt to apply the ε-net argument. Let us consider an n × n Gaussian matrix, i.e., a matrix with independent N (0, 1) entries. In this case, for any x ∈ S n−1 , the vector Ax has independent N (0, 1) coordinates, so it is distributed like the standard Gaussian vector in Rn . Hence, for any t > 0, √ √ 2 P( Ax 2 ≤ t n) = (2π)−n/2 √ e−x2 /2 dx ≤ (2π)−n/2 vol(t n · B2n ) t n·B2n

≤ (C1 t) . n

Fix ε > 0. Let N be an ε-net in S n−1 of cardinality |N | ≤ (3/ε)n . Then by the union bound,   P ∃x ∈ N : Ax 2 < tn1/2 ≤ (3/ε)n · (C1 t)n .

NON-ASYMPTOTIC THEORY

93

To obtain a meaningful estimate we have to require (3/ε) · (C1 t) < 1.

(5.5)

√ As in Proposition 4.7, we may assume that A ≤ C0 n, since the complement of this event has √ an exponentially small probability. Assume that for any y ∈ N , Ay 2 ≥ t n. Given x ∈ S n−1 , find y ∈ N satisfying x − y 2 < ε. Then Ax 2 ≥ Ay 2 − A · x − y 2 ≥ tn1/2 − C0 n1/2 · ε.

To obtain a non-trivial lower bound, we have to assume that t > C0 ε.

(5.6)

Unfortunately, the system of inequalities (5.5) and (5.6) turns out to be inconsistent, and the ε-net argument fails for the square matrix. Nevertheless, a part of this idea can be salvaged. Namely, if the cardinality of the ε-net satisfies a better estimate |N | ≤ (α/ε)n

(5.7)

for a small constant α > 0, then (5.5) is replaced by (α/ε) · (C1 t) < 1, and the system (5.5), (5.6) becomes consistent. Although the estimate (5.7) is impossible for the whole sphere, it can be obtained for a small part of it. This becomes the first ingredient of our strategy: small parts of the sphere will be handled by the ε-net argument. However, the “bulk” of the sphere has to be handled differently. The proof of Theorem 5.2 for random matrices with i.i.d. subgaussian entries having a bounded density is presented below. 5.1. Conditional argument. To handle the “bulk”, we have to produce an estimate which holds for all vectors in it simultaneously, without taking the union bound. Let x ∈ S n−1 be a vector such that |x1 | ≥ n−1/2 . Denote the columns of the matrix A by X1 , . . . , Xn , and let Then Ax = (5.8)

n k=1

Hj := span(Xk | k = j). xk Xk , so

Ax 2 ≥ dist(Ax, H1 ) = dist(x1 X1 , H1 ) ≥ n−1/2 dist(X1 , H1 ).

Note that the right hand side is independent of x. Therefore it provides a uniform lower bound for all x such that |x1 | ≥ n−1/2 . Since any vector x ∈ S n−1 has a coordinate with absolute value greater than n−1/2 , we can try to extend this bound to the whole sphere. This approach immediately runs into a problem: we don’t know a priori which of the coordinates of x is big. To modify this approach we shall pick a random coordinate. To this end we have to know that the random coordinate is big with relatively high probability. This is true for vectors, which look like the vertices of a discrete cube, but is obviously false for vectors with small support, i.e. a small number of non-zero coordinates. This observation leads us to the first decomposition of the sphere: Definition 5.3 (Compressible and incompressible vectors). Fix δ, ρ ∈ (0, 1). A vector x ∈ Rn is called sparse if |supp(x)| ≤ δn. (Here supp(x) means the set of non-zero coordinates of x.) A vector x ∈ S n−1 is called compressible if x is within Euclidean distance ρ from the set of all sparse vectors. A vector x ∈ S n−1 is called incompressible if it is not compressible. The sets of sparse, compressible and incompressible vectors will be denoted by Sparse, Comp and Incomp respectively.

94

MARK RUDELSON

Using the decomposition of the sphere S n−1 = Comp ∪ Incomp, we break the invertibility problem into two subproblems, for compressible and incompressible vectors:   (5.9) P sn (A) ≤ εn−1/2   ≤P inf Ax 2 ≤ εn−1/2 x∈Comp   +P inf Ax 2 ≤ εn−1/2 . x∈Incomp

On the set of compressible vectors, we obtain an inequality, which is much stronger than we need. Lemma 5.4 (Invertibility for compressible vectors). Let A be a random matrix as in Theorem 5.2, Then there exist δ, ρ, c1 , c2 > 0 such that   P inf Ax 2 ≤ c1 n1/2 ≤ e−c2 n . x∈Comp

Sketch of the proof. Any compressible vectors is close to a coordinate subspace of a small dimension δn. The restriction of our random matrix A onto such a subspace is a random rectangular n × δn matrix. Such matrices are well invertible outside of an event of exponentially small probability, provided that δ is small enough (see Proposition 4.7). By taking the union bound over all coordinate subspaces, we deduce the invertibility of the random matrix on the set of compressible vectors.  We shall fix δ and ρ as in Lemma 5.4 for the rest of the proof. The incompressible vectors are well spread in the sense that they have many coordinates of the order n−1/2 . This observation will allow us to realize the scheme described at the beginning of this section. Lemma 5.5 (Incompressible vectors are spread). Let x ∈ Incomp. Then there exists a set σ(x) ⊆ {1, . . . , n} of cardinality |σ(x)| ≥ ν1 n and such that ν ν √2 ≤ |xk | ≤ √3 for all k ∈ σ. n n Here 0 < ν1 , ν2 < 1 and ν3 > 1 are constants depending only on the parameters δ, ρ. We leave the proof of this lemma to the reader. The main difficulty in implementing the distance bound like (5.8) is to avoid taking the union bound. We achieve this in the proof of the next lemma by a random choice of a coordinate. Lemma 5.6 (Invertibility via distance). Let A be a random matrix with i.i.d. entries. Let X1 , . . . , Xn denote the column vectors of A, and let Hk denote the span of all column vectors except the k-th one: Hj = span(Xk | k = j). Then for every ε > 0, one has     1 · P dist(Xn , Hn ) < ε . (5.10) P inf Ax 2 < εν2 n−1/2 ≤ x∈Incomp ν1 Proof. Denote

  p := P dist(Xk , Hk ) < ε .

NON-ASYMPTOTIC THEORY

95

Note that since the entries of the matrix A are i.i.d., this probability does not depend on k. Then   E{k : dist(Xk , Hk ) < ε} = np. Denote by U the event that the set σ1 := {k : dist(Xk , Hk ) ≥ ε} contains more than (1 − ν1 )n elements. Then by Chebychev’s inequality, p P(U c ) ≤ . ν1 Assume that the event U occurs. Fix any incompressible vector x and let σ(x) be the set from Lemma 5.5. Then |σ1 | + |σ(x)| > (1 − ν1 )n + ν1 n = n, so the sets σ1 and σ(x) have nonempty intersection. Let k ∈ σ1 ∩ σ(x), so Writing Ax =

|xk | ≥ ν2 n−1/2

n j=1

and

dist(Xk , Hk ) ≥ ε.

xj Xj , we get

Ax 2 ≥ dist(Ax, Hk ) = dist(xk Xk , Hk ) = |xk | dist(Xk , Hk ) ≥ ν2 n−1/2 · ε. Summarizing, we have shown that   p P inf Ax 2 < εν2 n−1/2 ≤ P(U c ) ≤ . x∈Incomp ν1 This completes the proof.



Lemma 5.6 reduces the invertibility problem to a lower bound on the distance between a random vector and a random subspace. Now we reduce bounding the distance to a small ball probability estimate. Let X1 , . . . , Xn be the column vectors of A. Let Z be any unit vector orthogonal to X1 , . . . , Xn−1 . We call it a random normal. We clearly have (5.11)

dist(Xn , Hn ) ≥ | Z, Xn |.

The vector Z depends only on X1 , . . . , Xn−1 , so Z =: (a1 , . . . , an ) and Xn =: (ξ1 , . . . , ξn ) are independent. Condition on the vectors X1 , . . . , Xn−1 . Then the vector Z can be viewed as fixed, and the problem reduces to the small ball probability estimate for a linear combination of independent random variables n 

Z, Xn = ak ξ k . k=1

Assume for a moment that the distribution of a random variable ξ is absolutely continuous with bounded density. Then (5.12)

P(|ξ| < t) ≤ C t

for any t > 0.

This estimate can be extended to a linear combination of independent copies of ξ. Therefore, P(| Z, Xn | < t | Z) ≤ Ct. Integrating over X1 , . . . , Xn−1 , we obtain P(| Z, Xn | < t) ≤ Ct. Thus, combining this estimate with Lemma 5.6, we prove that   P inf Ax 2 < εν2 n−1/2 ≤ Cε. x∈Incomp

96

MARK RUDELSON

Then (5.9) and Lemma 5.4 imply Theorem 5.2 in this case even without the additive term cn . 6. Arithmetic structure and the small ball probability To prove Theorem 5.2 in the previous section, we used the small ball probability estimate (5.12). However, this estimate does not hold for a general subgaussian random variable, and in particular for any n random variable having an atom at 0. Despite this, a linear combination k=1 ak ξk of independent copies of a subgaussian random variable ξ obeys an estimate similar to (5.12) for a typical vector a = (a1 , . . . , an ) up to a certain threshold. It is easy to see that this threshold should depend on the vector a ∈ S n−1 . Indeed, assume that ξ is the random ±1 variable. Then for  n     1 1 1 (1) P a = √ , √ , 0, . . . , 0 , ak ξ k = 0 = . 2 2 2 k=1

This singular behavior is due to the fact that the vector a(1) is sparse. If we choose the vector a, which is far from the sparse ones, i.e. an incompressible vector, the small ball probability may be significantly improved. Consider for example, the vector   1 1 1 a(2) = √ , √ , . . . , √ . n n n Then by the Berry–Ess´een Theorem,    n     1 1   √ ξk  ≤ t ≤ C t + √ P   n  n k=1

This estimate cannot be improved, since for an even n,  n   1 c √ ξk = 0 ≥ √ . P n n k=1

(2) The coordinates of the nvector a are the same, which results in a lot of cancelations in the random sum k=1 ak ξk . If the arithmetic structure of the coordinates of the vector a is less rigid, the small ball probability can be improved even further. For example, for the (not normalized) vector   n    1 + 1/n 1 + 2/n 1 + n/n (3) √ , √ ,..., √ a = ak ξk = 0 ∼ n−3/2 . , P n n n k=1

Determining the influence of the arithmetic structure nof the coordinates of a vector a on the small ball probability for the random sum k=1 ak ξk became known as the Littlewood–Offord Problem. It was investigated by Littlewood and Offord [19], Erd¨os [7], S´arc¨ ozy and Szem´eredi [32], etc. Recently Tao and Vu [40] put forward the inverse Littlewood–Offord theorems, stating that the large value of the small ball probability implies a rigid arithmetic structure. The inverse Littlewood– Offord theorems are extensively discussed in [38], see also [24] for current results in this direction. We will need a result of this type for the conditional argument to compensate for the lack of the bound (5.12). The additive structure of a sequence a = (a1 , . . . , an ) of real numbers ak can be described in terms of the shortest arithmetic progression into which it embeds. This

NON-ASYMPTOTIC THEORY

97

length is conveniently expressed as the least common denominator of a, defined as follows: + * lcd(a) := inf θ > 0 : θa ∈ Zn \ {0} . For the vector a(2) , lcd(a

(2)

 n  ,  √ )= n∼1 P ak ξ k = 0 . k=1

A similar phenomenon occurs for the vector a(3) :  n  ,  (3) 3/2 lcd(a ) = n ∼1 P ak ξ k = 0 . k=1

This suggests that the least common denominator of the sequence controls the small ball probability. However, in the case when t > 0, or when the random variable ξ is not purely discrete, the precise inclusion θa ∈ Zn \ {0} loses its meaning. It should be relaxed to measure the closeness of the vector θa to the integer lattice. This leads us to the definition of the essential least common denominator. Fix a parameter γ ∈ (0, 1). For α > 0 define * + LCDα (a) := inf θ > 0 : dist(θa, Zn ) < min(γ θa 2 , α) . The requirement that the distance is smaller than γ θa 2 forces us to consider only non-trivial integer points as approximations of θa – only those in a small aperture cone around the direction of a (see the picture below).

√ One typically uses this definition with γ a small constant, and for α = c n with n a small constant c > 0. The inequality dist(θa, Z ) < α then yields that most coordinates of θa are within a small constant distance from integers. This choice would allow us to conclude that√the least common denominator of any incompressible vector is of order at least n. Let us formulate this statement precisely. Lemma 6.1. There exist constants γ > 0 and λ > 0 depending only on the compressibility parameters δ, ρ such that any incompressible vector a satisfies LCDα (a) √ ≥ λ n.

98

MARK RUDELSON

Proof. Assume that a is an incompressible vector, and let σ(a) be the set √ defined in Lemma 5.5. If LCDα (a) < λ n, then √ √ for some θ ∈ (0, λ n), z ∈ Zn . θa − z 2 < γθ < γλ n Let I(a) be the set of all j ∈ {1, . . . , n} such that 2γλ . ν1 The previous inequality implies that |I(a)| > (1 − ν1 /2)n. Therefore, for the set J(a) = I(a) ∩ σ(a), we have ν1 |J(a)| > n. 2 For any j ∈ J(a), we have √ 2γλ 2γλ ν3 |zj | < θ|aj | + γθ θa − z 2 ≥ ⎝ θ 2 a2j ⎠ > θν2 2 |θaj − zj |
0, and for ε≥ we have

(4/π) , LCDα (a)

  n    2   P  ak ξk  ≤ ε ≤ Cε + Ce−cα .   k=1

We shall prove more than is claimed in the Theorem. Instead of the small ball probability we shall bound a parameter, which controls the concentration of a random variable around any fixed point. Definition 6.3. The L´evy concentration function of a random variable S is defined for ε > 0 as L(S, ε) = sup P(|S − v| ≤ ε). v∈R

The proof of the Theorem uses the Fourier-analytic approach developed by Hal´asz [15], [14]. We start with the classical Lemma of Ess´een, which estimate the L´evy concentration function in terms of the characteristic function of a random variable.

NON-ASYMPTOTIC THEORY

99

Lemma 6.4. Let Y be a real-valued random variable. Then 2 |φY (θ)| dθ, sup P(|Y − v| ≤ 1) ≤ C −2

v∈R

where φY (θ) = E exp(iθY ) is the characteristic function of Y . ˆ Proof. Let ψ = χ[−1,1] ∗ χ[−1,1] and let f = ψ:  2 2 sin t f (t) = . t Then both f ∈ L1 (R) and ψ ∈ L1 (R), so f satisfies the Fourier inversion formula. Note also, that f (t) ≥ c whenever |t| ≤ 1. Therefore, 1 P(|X − v| ≤ 1) = Eχ[−1,1] (X − v) ≤ Ef (X − v) c   1 1 1 = E ψ(θ)eiθ(X−v) dθ ≤ ψ(θ)|Eeiθ(X−v) | dθ c 2π R 2πc R 2 1 ≤ |EeiθX | dθ. πc −2 The last inequality follows from supp(ψ) = [−2, 2] and ψ(x) ≤ 2.



Proof of Theorem 6.2. To make the proof more transparent, we shall assume that ξ isthe random ±1 variable. The general case is considered in [27]. n Let S = j=1 aj ξj . Applying Ess´een’s Lemma to the random variable Y = S/ε, we obtain 2  2 n |φS (θ/ε)| dθ = C |φj (θ/ε)| dθ, (6.1) L(S, ε) ≤ C −2

−2 j=1

where φj (t) = E exp(iaj ξj t) = cos(aj t). The last equality in (6.1) follows from the independence of ξj , j = 1, . . . , n. The inequality |x| ≤ exp(− 12 (1 − x2 )), which is valid for all x ∈ R, implies     1 1 2 2 2 |φj (t)| ≤ exp − sin (aj t) ≤ exp − min | aj t − q| . 2 2 q∈Z π In the last inequality we estimated the absolute value of the sinus by a piecewise linear function, see the picture below.

100

MARK RUDELSON

Combining the previous inequalities, we get ⎞ ⎛ 2  2 n   2 1 θ L(S, ε) ≤ C (6.2) exp ⎝− min  aj · − q  ⎠ dθ 2 j=1 q∈Z  π ε −2 2 =C exp(−h2 (θ)/2) dθ, −2

where

    2  · θa − p h(θ) = minn   . p∈Z πε 2

Since by the assumption, 4/(πε) ≤ LCDα (a), the definition of the least common denominator implies that for any θ ∈ [−2, 2], 2 · θ a 2 , α). πε Recall that a 2 = 1. Then the previous inequality implies   2 ,  2γ 2 + exp(−α2 /2). θ exp(−h2 (θ)/2) ≤ exp − πε h(θ) ≥ min(γ



Substituting this into (6.2) we complete the proof.

To apply the previous result for random matrices we shall combine it with the following Tensorization Lemma. Lemma 6.5 (Tensorization). Let ζ1 , . . . , ζm be independent real random variables, and let K, ε0 ≥ 0. Assume that for each k P(|ζk | < ε) ≤ Kε Then

for all ε ≥ ε0 .

m   ζk2 < ε2 m ≤ (CKε)m P

for all ε ≥ ε0 ,

k=1

where C is an absolute constant. Proof. Let ε ≥ ε0 . We have m m m      1  2 1  2 ζk2 < ε2 m = P m − 2 ζk > 0 ≤ E exp m − 2 ζk P ε ε k=1

k=1

= em

(6.3)

m 

k=1

E exp(−ζk2 /ε2 ).

k=1

By Fubini’s theorem, E exp(−ζk2 /ε2 )

=E



−u2

2ue |ζ|/ε





du =

2ue−u P(|ζk |/ε < u) du. 2

0

For u ∈ (0, 1), we have P(|ζk |/ε < u) ≤ P(|ζk | < ε) ≤ Kε. This and the assumption of the lemma yields 1 ∞ 2 2 E exp(−ζk2 /ε2 ) ≤ 2ue−u Kε du + 2ue−u Kεu du ≤ CKε. 0

1

NON-ASYMPTOTIC THEORY

101

Putting this into (6.3) yields m   ζk2 < ε2 m ≤ em (CKε)m . P k=1

This completes the proof.



Combining Theorem 6.2 and Lemma 6.5 yields the multidimensional small ball probability estimate similar to the one we had for absolutely continuous random variable. Lemma 6.6 (Invertibility on a single vector). Let A be an m×n random matrix, whose entries are independent copies of a centered subgaussian random variable ξ of unit variance. Then for any α > 0, for every vector x ∈ S n−1 , and for every t ≥ 0, satisfying   (4/π) −cα2 t ≥ max , e , LCDα (x) one has   P A x 2 < tn1/2 ≤ (Ct)m .  2 2 2 To proveLemma 6.6, note that A x 2 can be represented as A x 2 = m k=1 ζk , n where ζk = j=1 a k,j xj are i.i.d. random variables satisfying the conditions of Theorem 6.2. 7. Putting all ingredients together Now we have developed all necessary tools to prove the invertibility theorem, which has been formulated in Section 5. Theorem. 5.2. Let A be an n × n matrix whose entries are independent copies of a centered subgaussian real random variable of unit variance. Then for every ε ≥ 0 one has   P sn (A) ≤ εn−1/2 ≤ Cε + cn , where C > 0 and c ∈ (0, 1). Recall that we have divided the unit sphere into compressible and incompressible vectors (see Definition 5.3 and inequality (5.9)), and proved that the first term in (5.9) is exponentially small. Applying Lemma 5.6 and (5.11), we reduced the estimate for the second term to the bound for   p(ε) := P | Z, Xn | ≤ ε , where Xn is the n-th column of the matrix A, and Z is a unit vector orthogonal to the first n − 1 columns. To complete the proof, we have to show that p(ε) ≤ Cε, n whenever ε ≥ e−cn . Here Z, Xn = j=1 Zj ξj , where Z = (Z1 , . . . , Zn ). Throughout the rest of the proof set √ (7.2) α = β n, (7.1)

where β > 0 is a small absolute constant, which will be chosen at the end of the proof. If LCDα (Z) ≥ ecn , then (7.1) follows from Theorem 6.2. Therefore, our problem has been further reduced to proving

102

MARK RUDELSON

Theorem 7.1 (Random normal). Let X1 , . . . , Xn−1 be random vectors whose coordinates are independent copies of a centered subgaussian random variable ξ. Consider a unit vector Z orthogonal to all these vectors. There exist constants c, c > 0 such that    P LCDα (Z) < ecn ≤ e−c n . The components of a random vector should be arithmetically incommensurate to the extent that their essential LCD is exponential in n. Intuitively, this is rather obvious for a random vector uniformly distributed over the sphere. It can be rigorously checked by estimating the total area of the points on the sphere, which have smaller values of the LCD. However, the distribution of the random normal Z is more involved, and it requires some work to confirm this intuition. T Proof. Let A be the (n − 1) × n matrix with rows X1T , . . . , Xn−1 . Then Z ∈



Ker(A ). The matrix A has i.i.d. entries. We start with using the decomposition similar to (5.9):   P ∃Z ∈ S n−2 LCDα (Z) < ecn and A Z = 0   ≤ P ∃Z ∈ Comp AZ = 0   + P ∃Z ∈ Incomp LCDα (Z) < ecn and A Z = 0 .

Lemma 5.4 implies that the first term in the right hand side does not exceed e−cn . Formally, we have to reprove this lemma for (n − 1) × n matrices, instead of the n × n ones, but the proof extends to this case without any changes. To bound the second term, we introduce a new decomposition of the sphere based on the LCD. Recall that by Lemma 6.1, any incompressible vector a satisfies √ LCDα (a) ≥ λ n. For D > 0, set SD = {x ∈ S n−1 | D ≤ LCDα (x) ≤ 2D}. It is enough to prove that P(∃x ∈ SD A x = 0) ≤ e−n . √ whenever λ n ≤ D ≤ ecn . Indeed, the statement of the Theorem will then follow by taking the union bound over D = 2k for k ≤ cn. To this end, we shall use the ε-net argument to bound A x 2 below. For a fixed x ∈ SD , the required estimate follows from substituting the bound LCDα (x) ≥ D in Lemma 6.6:   (7.3) P A x 2 < tn1/2 ≤ (Ct)n−1 , provided t ≥ (4/π) D . To estimate the size of the ε-net we use the bound for the essential least common denominator again. The simple volumetric bound is not sufficient for our purposes, and this is the crucial step where we explore the additive structure of SD to construct a smaller net. Lemma 7.2 √ (Nets of level sets). There exists a (4α/D)-net in SD of cardinality at most (CD/ n)n . It is important, that the mesh of this net depends on the small parameter α, while its cardinality is independent of it. This feature would allow us later to use the union bound for an appropriately chosen α. The proof of Lemma 7.2 is based on counting the number of integer points in a ball of a large radius. We will show that if x ∈ SD , then the ray {λx | λ > 0} passes

NON-ASYMPTOTIC THEORY

103

within distance α from an integer point in a ball of radius 3D. The number of such points is independent of α, and can be bounded from the volume considerations.

{λx | λ > 0}

3D

Proof. We can assume that 4α/D ≤ 1, otherwise the conclusion is trivial. To shorten the notation, denote for x ∈ SD D(x) := LCDα (x). By the definition of SD , we have D ≤ D(x) < 2D. By the definition of the essential least common denominator, there exists p ∈ Zn such that D(x)x − p 2 < α.

(7.4) Therefore

  x −

Since x 2 = 1, it follows that (7.5)

α 1 α p   ≤ ≤ .  < D(x) 2 D(x) D 4   x −

p  2α  .  < p 2 2 D

On the other hand, by (7.4) and using x 2 = 1, D(x) ≤ 2D and 4α/D ≤ 1, we obtain (7.6)

p 2 < D(x) + α ≤ 2D + α ≤ 3D.

Inequalities (7.5) and (7.6) show that the set * p + N := : p ∈ Zn ∩ B(0, 3D) p 2 argument, the number is a (2α/D)-net of SD . Recall that, by a known volumetric √ √ of integer points in B(0, 3D) is at most (1 + 9D/ n)n ≤ (CD/ n)n (where √ in the last inequality we used that by the definition of the level set, D > c0 n for all

104

MARK RUDELSON

incompressible vectors). Finally, we can find a (4α/D)-net of the same cardinality,  which lies in SD . Now we can complete the ε-argument. Recall that by Proposition 4.4, √ P(s1 (A ) ≥ C0 n) ≤ e−cn . Therefore, in order to complete the proof, it is enough to show that the event * √ + E := ∃x ∈ SD A x = 0 and A ≤ C0 n has probability at most e−n . Assume that E occurs, and let x ∈ SD be such that A x = 0. Let N be the (4α/D)-net constructed in Lemma 7.2. Choose y ∈ N such that x − y < 4α/D. Then by the triangle inequality, √ 4α n = 4C0 β , A y 2 ≤ A · x − y 2 < C0 n · D D √ √ if we recall that α = β n. Set t = 4C0 β n/D. Combining the estimate (7.3) for this t with the union bound, we obtain  n √ CD

n−1 P(E) ≤ P(∃y ∈ N A y 2 ≤ t n) ≤ |N | · (Ct) ≤ √ · (Ct)n−1 n   CD n−1 ≤ √ . · (4CC0 β) n Since D ≤ ecn , we can choose the constant β so that the right hand side of the previous inequality will be less than e−n . The proof of Theorem 5.2 is complete.  8. Short Khinchin inequality Let 1 ≤ p < ∞. Recall that · p denotes the standard p norm in Rn , and Bpn its unit ball. Let X ∈ Rn be a vector with independent centered random ±1 coordinates, i.e. a random vertex of the discrete cube {−1, 1}n . The classical Khinchin inequality, 1/p  Theorem 3.4, asserts that for any p ≥ 1 and for any vector a ∈ Rn , E| a, X |p is equivalent to a 2 up to multiplicative constants depending on p. This equivalence can be obtained if one averages not over the whole discrete cube, but over some small part of it. The problem how small should this set be was around since midseventies. More precisely, Let p ≥ 1. Find constants αp , βp and a set V ⊂ {−1, 1}n of a small cardinality such that  1/p 1  p | a, x | ≤ βp a 2 αp a 2 ≤ |V | x∈V

for any a ∈ R . Deterministic constructions of sets V of reasonably small cardinality are unknown. Therefore, we shall construct the set V probabilistically. Namely, we choose N = N (n, p) and consider N independent copies X1 , . . . , XN of the random vector X. If N  2n/2 , in particular, if N is polynomial in n, all vectors X1 , . . . , XN are n

NON-ASYMPTOTIC THEORY

105

distinct with high probability. The problem thus is reduced to showing that with high probability, any vector y ∈ Rn satisfies ⎞1/p ⎛ N  1 | y, Xj |p ⎠ ≤ βp y 2 . (8.1) αp y 2 ≤ ⎝ N j=1 This problem can be recast in the language of random matrices. Let A be the N ×n matrix with rows X1 , . . . , XN . Then the inequality above means that A defines a nice isomorphic embedding of n2 into N p . As in the proof of the original Khinchin inequality, we consider cases p = 1 and p > 2 separately. 8.1. Short Khinchin inequality for p = 1. In this case we derive the inequality (8.1) in a more general setup. Assume that the coordinates of the vector X are i.i.d. centered subgaussian variables. The middle term in (8.1) can be rewritten as N −1/p Ay p , where A is the matrix with columns X1 , . . . , XN . In this language, establishing (8.1) is equivalent to estimating the the maximum and ther minimum of Ay p over the unit sphere.   √  ≤ N · 4.4 combined with the inequality A : n2 → N 1   Proposition  A : n2 → N yields the following 2 Proposition 8.1. Let A be an N × n random matrix, N ≥ n, whose entries are independent copies of a subgaussian random variable. Then     > tN ≤ e−c0 t2 N P A : n2 → N for t ≥ C0 . 1 This implies the second inequality in (8.1) with p = 1 and β1 = C0 , so (8.1) is reduced to the first inequality. To establish it we apply the random matrix machinery developed in the previous sections. Without loss of generality, we may assume that n ≤ N ≤ 2n, because we are looking for small values of N . Then the following Theorem will imply that the short Khinchin inequality holds for any N ≥ n with α1 depending only on the ratio of N/n. Theorem 8.2. Let n, N be natural numbers such that n ≤ N ≤ 2n. Let A be an N × n matrix, whose entries are i.i.d. centered subgaussian random variable of variance 1. Set m = N − n + 1. Then for any ε > 0 m    CN ·ε + cn , P ∃x ∈ S n−1 Ax 1 < εm ≤ m where C > 0 and c ∈ (0, 1). Proof. Adding to the entries of A small multiples of independent N (0, 1) variables, we may assume that the entries of A are absolutely continuous, so the matrix A is of a full rank almost surely. We start with an elementary lemma from linear algebra. The section of B1N by the range of the operator A is a convex polytope. The next lemma shows that the minimum of the 2 norm of Ay over the Euclidean unit sphere is attained at a point y, which is mapped to a multiple of a vertex of this polytope. Such vertex will have exactly m non-zero coordinates.

106

MARK RUDELSON

Lemma 8.3. Let N > n and let A : Rn → RN be a random matrix with absolutely continuous entries. Let x ∈ S n−1 be a vector for which Ax 1 attains the minimal value. Then | supp(Ax)| = N − n + 1 almost surely. Proof. Let E = ARn and let K = B1N ∩ E. Set y = Ax/ Ax 1 . Since the function g : S n−1 → (0, ∞), g(u) = Au 1 attains the minimum at u = x, the function f : K → (0, ∞), f (z) = A−1 |E z 2 attains the maximum over K at z = y. The convexity of · 2 implies that y is an extreme point of K. Since K is the intersection of the octahedron B1N with an n-dimensional subspace, this means that |supp y| ≤ N − n + 1. Finally, since the entries of A are absolutely continuous, any coordinate subspace F ⊂ RN , whose dimension does not exceed N − n, satisfies E ∩ F = {0} a.s. Therefore, |supp y| = N − n + 1.  This lemma allows us to reduce the minimum of Ax 1 over the whole sphere S n−1 to a certain finite subset of it. Indeed, to each subset J ⊂ {1, . . . , N } of cardinality m =  N − n + 1 corresponds a unique pair of extreme points vJ and −vJ of K such that j∈J |vJ (j)| = 1 and vJ (j) = 0 whenever j ∈ / J. Let AJ  be the matrix consisting of the rows of A, whose indices belong to J = {1, . . . , N } \ J. The vector yJ ∈ S n−1 such that AyJ = tvJ for some t > 0 is uniquely defined by the matrix AJ  via the condition AJ  yJ = 0. By Lemma 8.3, min{ Ay 1 | y ∈ S n−1 } = min{ AyJ 1 | J ⊂ {1, . . . , N }, |J| = m}. To finish the proof, we estimate AyJ 1 below and apply the union bound over the sets J. Fix a set J ⊂ {1, . . . , N } of cardinality m. Denote the rows of the matrix T . The condition AJ  yJ = 0 means that yJ is orthogonal AJ  by X1T , . . . , Xn−1 to each of these vectors. Applying Theorem 7.1 to the vectors X1 , . . . , Xn−1 , we conclude that    (8.2) P LCDα (yJ ) < ecn ≤ e−c n . Conditioning on the matrix AJ  , we may regard the vector yJ as fixed. Denote a row of the matrix AJ by Y T , so the coordinates of AJ yJ are distributed like Y, yJ . If LCDα (yJ ) ≥ ecn , then by Theorem 6.2 P(| Y, yJ | ≤ ε | AJ  ) ≤ Cε, whenever ε > Ce−cn . Then taking expectation over AJ  and using (8.2) yields 

P(| Y, yJ | ≤ ε) ≤ Cε + Ce−cn + e−c n for any ε > 0. Coordinates ζj , j ∈ J of the vector AJ yJ are i.i.d. random  variables. ζj2 . In this Tensorization Lemma 6.5 can be easily reproved for |ζj | instead of form it implies m  P( AyJ 1 ≤ εm) = P( AJ yJ 1 ≤ εm) ≤ Cε + Ce−cn for any ε > 0. Finally, taking the union bound over all sets J, we obtain   m  N · Cε + Ce−cn P(∃J |J| = m, AyJ 1 ≤ εm) ≤ m m   CN ·ε ≤ + Ce−c n . m



NON-ASYMPTOTIC THEORY

107

Assume now that N is in a fixed proportion to n, and define δ by N = (1 + δ)n. In this notation, Theorem 8.2 reads  δn+1   Cε P ∃ x ∈ S n−1 Ax 1 < εδn ≤ + cn . δ Set ε = c δ, where the constant c is chosen to make the right hand side of the inequality above smaller than 1. Then the previous estimate shows that, with high probability, the short Khinchin inequality holds for N = (1 + δ)n independent subgaussian vectors X1 , . . . , XN with p = 1 and constants α1 = cδ 2 , β1 = C0 : ∀ y ∈ Rn

cδ 2 y 2 ≤

N 1  | y, Xj | ≤ C0 y 2 . N j=1

Theorem 8.2 proves more than the short Khinchin inequality. Combining it with Proposition 4.4, we show that √ (8.3) ∀x ∈ Rn εδn x 2 ≤ Ax 1 ≤ N Ax 2 ≤ C n x 2 with probability greater than 1 − C exp(−cn) − (ε/¯ cδ)δn . The second inequality here follows from Cauchy–Schwarz, and the third one from Proposition 4.4. Inequality (8.3) immediately yields a lower bound for the smallest singular value of a rectangular random matrix. Corollary 8.4. Let n, N, value √δ, A, ε be as above. Then the smallest singular cδ)δn . of A is bounded below by εδ · n with probability at least 1 − cn − (ε/¯ This bound is not sharp for small δ. The optimal estimate  √ √  P sn (A) ≤ ε N − n − 1 ≤ (Cε)N −n+1 + cN , valid for all n, N and ε, was obtained in [28]. Another application of the inequality (8.3) is a bound on the diameter of a random section of the octahedron B1N . A celebrated theorem of Kashin [13] states that a random n-dimensional section of the standard octahedron B1N √ of dimension N = (1 + δ)n! is close to the section of the inscribed ball (1/ N )B2N . The optimal estimates for the diameter of a random section of the octahedron were obtained by Garnaev and Gluskin [8]. Recently the attention was attracted to the question whether the almost spherical sections of the octahedron can be generated by simple random matrices, in particular by a random ±1 matrix. A general result proved in [20] implies that if N = (1 + δ)n! with δ ≥ c/ log n, then a random N × n matrix A with independent subgaussian entries generates a section of the octahedron B1N which is not far from the ball with probability exponentially close to 1. More precisely, if E = ARn ⊂ RN , then √ √ (1/ N )B2N ∩ E ⊂ B1N ∩ E ⊂ ϕ(δ) · (1/ N )B2N , where ϕ(δ) ≤ C 1/δ . For random ±1 matrices this result was improved by Artstein-Avidan at al. [2], who proved a polynomial type estimate for the diameter of a section ϕ(δ) ≤ (1/δ)α for α > 5/2 and δ ≥ Cn−1/10 . Using (8.3) we obtain a polynomial estimate for the diameter of sections for smaller values of δ.

108

MARK RUDELSON

Corollary 8.5. Let n, N be natural numbers such that n < N < 2n. Denote δ = (N −n)/n. Let ξ be a centered subgaussian random variable. Let A be an N ×n matrix, whose entries are independent copies of ξ and let E = ARn . Then for any ε>0   1 1 c · √ B2N ≥ 1 − cn − (ε/¯ cδ)δn . P √ B2N ∩ E ⊂ B1N ∩ E ⊂ εδ N N Note that to make the probability bound non-trivial, we have to assume that ε = c δ for some 0 < c < c¯. In this case the corollary means that a random n-dimensional subspace E satisfies c 1 1 √ B2N ∩ E ⊂ B1N ∩ E ⊂ 2 · √ B2N . δ N N c √ This inclusion remains non-trivial as long as δ2 < N , i.e., as long as δ > cN −1/4 . 8.2. Short Khinchin inequality for p > 2. The case p > 2 requires a completely different approach. In this case we will prove the short Khinchin inequality without the assumption that the coordinates of the random vector X are independent. We shall assume instead that X is isotropic and subgaussian. The first property means that for any y ∈ S n−1 E X, y 2 = 1, while the second means that for any y ∈ S n−1 the random variable X, y is centered subgaussian. By Theorem 3.3, any random vector with independent centered subgaussian coordinates of variance 1 is isotropic subgaussian. This includes, in particular, an appropriately scaled random vertex of the discrete cube {−1, 1}n . We prove the following Theorem [11]. Theorem 8.6. Let X be an isotropic subgaussian vector in Rn . Let X1 , . . . , XN be independent copies of X. Let p > 2 and N ≥ np/2 . Then, with high probability, the inequalites ⎞1/p ⎛ N  1 √ | y, Xj |p ⎠ ≤ C p y 2 c y 2 ≤ ⎝ N j=1 hold for all y ∈ Rn . Proof. As in the classical Khinchin inequality, the first inequality in Theorem 8.6 is easy. Denote, as before, by A the N × n matrix with rows X1 , . . . , XN . Assume that n is large enough, so that N ≥ np/2 ≥ δ0−1 n, where δ0 is the constant from Proposition 4.7. By the remark after this proposition, it is applicable to the matrix A despite the fact that its entries are dependent. Combining Proposition 4.7 with the inequality y 2 ≤ N 1/2−1/p · y p , valid for all y ∈ RN , we obtain   P min Ax p ≤ c1 N 1/p ≤ e−c2 N , n−1 x∈S

which establishes the left inequality with probability exponentially close to 1. If the vectors X1 , . . . , XN were independent standard gaussian, then the right inequality in Theorem 8.6 would follow from the classical Gordon–Chevet inequality for the norm of the Gaussian linear operator, see e.g., [5]. We will establish an analog of this inequality for isotropic subgaussian vectors. To this end, we use the method of majorizing measures, or generic chaining, developed by Talagrand [36].

NON-ASYMPTOTIC THEORY

109

Let {Xt }t∈T be a real-valued random process, i.e., a collection of interdependent random variables, indexed by some set T . In the setup below, we can assume that T is finite or countable, eliminating the question of measurability of supt∈T Xt . We shall call the process {Xt }t∈T centered if EXt = 0 for all t ∈ T . Definition 8.7. Let (T, d) be a metric space. A random process {Xt }t∈T is called subgaussian with respect to the metric d if for any t, s ∈ T, t = s the random variable (Xt − Xs )/d(t, s) is subgaussian. A random process {Gt }t∈T is called Gaussian with respect to the metric d if for any finite set F ⊂ T the joint distribution of {Gt }t∈F is Gaussian, and for any t, s ∈ T, t = s (Gt − Gs )/d(t, s) is N (0, 1) random variable. We use a fundamental result of Talagrand [36] comparing subgaussian and Gaussian processes. Theorem 8.8 (Majorizing Measure Theorem). Let (T, d) be a metric space, and let {Gt }t∈T be a Gaussian random process with respect to the metric d. For any centered random process {Xt }t∈T , which is subgaussian with respect to the same metric, E sup Xt ≤ C E sup Gt . t∈T

t∈T

For (s, y) ∈ R × R define the random variable Xs,y by N

n

Xs,y =

N 

sj Xj , y .

j=1

Let us show that for any T ⊂ B2N × B2n , the random process {Xs,y }(s,y)∈T is subgaussian with respect to the Euclidean metric. For any (s, y), (s , y ) ∈ T , Xs,y − Xs ,y =

N    (sj − s j ) Xj , y + s j Xj , y − y . j=1

Let λ ∈ R. Since the vector X is centered subgaussian, for any z ∈ RN exp(λ X, z ) ≤ exp(Cλ2 z 22 ). Hence, using independence of Xj and applying Cauchy–Schwartz inequality, we get   E exp λ(Xs,y − Xs ,y ) =

N       E exp λ(sj − s j ) Xj , y · exp λs j Xj , y − y j=1



N  j=1

N     

2 exp 2Cλ2 ((sj − s j )2 y 22 ) · exp 2Cλ2 (s 2 j y − y 2 )



j=1

≤ exp 2Cλ ( s − 2

2 s 2

+ y −

2  y 2 ) .

The last inequality follows because (s, y), (s , y ) ∈ T ⊂ B2N × B2n . By Theorem 3.2 this means that the random variable Xs,y − Xs ,y (s, y) − (s , y ) 2 is subgaussian, so the process (Xs,y )(s,y)∈T is subgaussian with respect to the 2 metric.

110

MARK RUDELSON

Now we will consider a Gaussian process with respect to the same metric. Let Y and Z be independent standard Gaussian vectors in Rn and RN respectively. Set Gs,y = s, Z + y, Y . Then for any T ⊂ RN × Rn , {Gs,y }(s,y)∈T is a Gaussian process with respect to the Euclidean metric. Let 1/p + 1/p∗ = 1, and set T = BpN∗ × B2n ⊂ B2N × B2n . By the Majorizing Measure Theorem E sup Xs,y ≤ C E sup Gs,y . (s,y)∈T

(s,y)∈T

Therefore, writing the p norm as the supremum of the values of functionals over the unit ball of the dual space, we obtain ⎛ ⎞1/p N N   1 1 | Xj , y |p ⎠ = 1/p E sup sup sj Xj , y E sup ⎝ N j=1 N y∈B2n s∈B N∗ y∈B2n j=1 p

 C  C ≤ 1/p E sup sup Gs,y = 1/p E Z p + E Y 2 N N s∈BpN∗ y∈B2n  √  n √ ≤C p + 1/p . N √ Since N ≥ np/2 , the last expression does not exceed C p. To complete the proof we combine this estimate of the expectation with Chebyshev’s inequality.  Remark. The same proof can be repeated for an general normed space, instead of the space p . This would establish a version of Gordon–Chevet inequality valid for a general isotropic subgaussian vector. We omit the details. Note that Theorem 8.6 implies that the matrix A formed by the vectors X1 , . . . , XN defines a subspace of N p which is close to Euclidean, so Theorem 8.6 can be viewed as an analog of the Isomorphic Dvoretzky’s Theorem of Milman and Schechtman [21]. This, in particular, means that the bound N ≥ np/2 is optimal (see e.g., [9] for details). 9. Random unitary and orthogonal perturbations The need for probabilistic bounds for the smallest singular value of a random matrix from a certain class arises in many intrinsic problems of the random matrix theory. Such bounds are the standard step in many proofs based on the convergence of Stieltjes transforms of the empirical measures to the Stieltjes transform of the limit measure. One of the examples, where such bounds become necessary is the Circular Law [10, 41, 42]. The proof of this law requires the lower bound on the smallest singular value of a random matrix with i.i.d. entries, which was obtained above. Another setup, where such bounds become necessary, is provided by the Single Ring Theorem of Guionnet, Krishnapur and Zeitouni [12]. The proof of this theorem deals with another natural class of random matrices, namely random unitary or orthogonal perturbations of a fixed matrix. Let us consider the complex case first. Let D be a fixed n × n matrix, and let U be a random matrix uniformly distributed over the unitary group U (n). In this case the solution of the qualitative invertibility problem is trivial, since the matrix D + U is non-singular with probability 1. This can be easily concluded by

NON-ASYMPTOTIC THEORY

111

considering the determinant of D + U . The determinant, however, provides a poor tool for studying the quantitative invertibility problem. In regard to this problem we will prove the following theorem. Theorem 9.1. Let D be an arbitrary n × n matrix, n ≥ 2. Let U be a random matrix uniformly distributed over the unitary group U (n). Then P(sn (D + U ) ≤ t) ≤ tc nC

for all t > 0.

Here C and c are absolute constants. An important feature of Theorem 9.1 is its independence of the matrix D. This independence is essential for the Single Ring Theorem. The statement similar to Theorem 9.1 fails in the real case, i.e., for random matrices distributed over the orthogonal group. Indeed, suppose that n is odd. If −D, U ∈ SO(n), then −D−1 U ∈ SO(n) has the eigenvalue 1, and the matrix D + U = D(D−1 U + In ) is singular. Therefore, if U is uniformly distributed over O(n), then sn (D + U ) = 0 with probability at least 1/2. Nevertheless, it turns out that this is essentially the only obstacle to the extension of Theorem 9.1 to the orthogonal case. Theorem 9.2 (Orthogonal perturbations). Let D be a fixed n × n real matrix, n ≥ 2. Assume that (9.1)

D ≤ K,

inf

V ∈O(n)

D − V ≥ δ

for some K ≥ 1, δ ∈ (0, 1). Let U be a random matrix uniformly distributed over the orthogonal group O(n). Then P(sn (D + U ) ≤ t) ≤ tc (Kn/δ)C ,

t > 0.

Similarly to the complex case, this bound is uniform over all matrices D satisfying (9.1). This condition is relatively mild: in the case when K = nC1 and δ = n−C2 for some constants C1 , C2 > 0, we have P(sn (D + U ) ≤ t) ≤ tc nC ,

t > 0,

as in the complex case. It is possible that the condition D ≤ K can be eliminated from the Theorem 9.2. However, this is not crucial because such condition already appears in the Single Ring Theorem. The problems we face in the proofs of Theorems 9.1 and 9.2 are significantly different from those appearing in Sections 5, 7. In the case of the independent entries the argument was based on the analysis of the small ball probability P( Ax 2 < t) or P( Ax 1 ) < t for a fixed vector x. As shown in Section 6, the decay of this probability as t → 0 is determined by the arithmetic structure of the coordinates of x. In contrast to this, the arithmetic structure plays no role in Theorems 9.1 and 9.2. The difficulty lies elsewhere, namely in the lack of independence of the entries of the matrix. We will have to introduce a set of the independent random variables artificially. These variables have to be chosen in a way that allows one to express tractably the smallest singular value in terms of them. To illustrate this approach, we present the proof of Theorem 9.1 below. The proof of Theorem 9.2 starts with the similar ideas, but requires new and significantly more delicate arguments. We refer the reader to [31] for the details.

112

MARK RUDELSON

Proof of Theorem 9.1. Throughout the proof we fix t > 0 and introduce several small and large parameters depending on t. The values of such parameters will be chosen of orders ta , where 0 < a < 1 for the small parameters, and t−b , 0 < b < 1 for the large ones. This would allow us to introduce an hierarchy of parameters, and disregard the terms corresponding to the smaller ones. Also,  note that we have to prove Theorem 9.1 only for t < n−C for a given constant C , because for larger values of t its statement can be made vacuous by choosing a large √  constant C. This observation would allow us to use bounds of the type nta ≤ ta

whenever a < a are constants. For convenience of a reader, we include a special paragraph entitled “Choice of the parameters” in the analysis of each case. In these paragraphs we list the constraints that the small and large parameters must satisfy, as well as the admissible numerical values of those parameters. These paragraphs will be printed in sans-serif and can be omitted on the first reading.

To simplify the argument, we will also assume that D ≤ K, as in Theorem 9.2. The proof of Theorem 9.1 without this assumption can be found in [31]. 9.1. Decomposition of the sphere and introduction of local and global perturbations. We have to bound sn (U + D), which is the minimum of (D + U )x 2 over√ the unit sphere. For every x ∈ S n−1 , there is a coordinate xj with |xj | ≥ 1/ n. Hence, the union bound yields   n  P inf (U + D)x 2 ≤ t , P(sn (D + U ) ≤ t) ≤ j=1

where

x∈Sj

√ . Sj = x ∈ S n−1 | |xj | ≥ 1/ n .

All terms on the right hand side of the inequality above can be estimated in the same way. So, without loss of generality we will consider the case j = 1. Note that the application of the crude union bound here may have increased the probability estimate of Theorem 9.1 n times. This, however, is unimportant, since we allow the coefficient nC anyway. The proof of the theorem reduces to the estimate of   (9.2) P inf (U + D)x 2 ≤ t . x∈S1

The structure of the set S1 gives a special role to the first coordinate. This will be reflected in our choice of independent random variables. If R, W ∈ U (n) are any matrices, and V is uniformly distributed over U (n), then the matrix U = V −1 R−1 W is uniformly distributed over U (n) as well. Hence, if we assume that the matrices R and W are random and independent of V , then this property would remain valid for U . The choice of the distributions of R and W is in our hands. Set R = diag(r, 1, . . . , 1), where r is a random variable uniformly distributed over {z ∈ C | |z| = 1}. This is a “global” perturbation, since we will need the values of r, which are far from 1. The matrix W will be “local”, i.e., it will be a small perturbation of the identity matrix. Let ε > 0 be a “small” parameter, and set W = exp(εS), where S is an n × n skew-symmetric matrix, i.e. S ∗ = −S. Although the matrix W is unitary,

NON-ASYMPTOTIC THEORY

113

the dependence of its entries on the entries of S is hard to trace. To simplify the structure, we consider the linearization of W , W0 = I + εS. The matrix W0 is not unitary, but its distance to the group U (n) is at most W − W0 ≤ ε2 S 2 . Thus, for any x ∈ S1 , (D + U )x 2 = (D + V −1 R−1 W )x 2 = (RV D + W )x 2 ≥ (RV D + W0 )x 2 − W − W0 ≥ (RV D + I + εS)x 2 − ε2 S 2 . We will use S to introduce a collection of independent random variables. Set /√ 0 −1 s −Z T (9.3) S= Z 0 where s ∼ NR (0, 1) and Z ∼ NR (0, In−1 ) are independent real-valued standard normal random variable and vector respectively. Clearly, S is skew-Hermitian. If K0 is a “large” parameter, K0 = t−b0 , then by Proposition 4.4, √ P( Z 2 ≥ K0 n) ≤ exp(−c0 K02 n) ≤ t 2

for all sufficiently small t > 0. This means that S ≤ K02 n with probability close to 1. Disregarding an event of a small probability, we reduce the problem to obtaining a lower bound for inf (RV D + I + εS)x 2 ,

x∈S1

provided that the bound we obtain is of order at least ε. Indeed, we may assume that K02 nε2  ε, if ε is chosen small enough. Choice of the parameters. The second order term 2ε2 S 2  should not affect the estimate of P(inf x∈S1 Ax ≤ t). To guarantee it, we require that K02 nε2 ≤ t/2. Also, to bound the probability by a power of t, we have to assume that exp(−c0 K02 n) ≤ tc for some c > 0. Both inequalities are satisfied for small t if ε = t0.6 and K0 = t−0.05 .

Starting from this moment we will condition on the matrix V and evaluate the conditional probability with respect to the random matrices R and S. The original random structure will be lost after this conditioning. However, we introduced a new independent structure in the form of the matrices R and S, and it will be easier to manipulate. Each of the matrices R and S alone is insufficient to obtain any meaningful estimate. Nevertheless, the combination of these two sources of randomness, a local perturbation S and a global perturbation R, produces enough power to conclude that RV D + I + εS is typically well invertible, and this leads to the proof of Theorem 9.1. Summarizing the previous argument, we conclude that our goal is to bound P( inf Ax 2 ≤ t), x∈S1

where (9.4)

/

A11 A = RV D + I + εS =: X

0 YT , BT

114

MARK RUDELSON

X, Y ∈ Cn−1 , B is an (n − 1) × (n − 1) matrix, and ε = ta . Here we decomposed the matrix A separating the first coordinate to emphasize its special role. For future reference we write A in terms of the components of the matrix V D, and random variables r, s, and Z exposing the dependence on these random parameters: / 0 / 0 √ A11 Y T ra + 1 + −1 εs (rv − εZ)T (9.5) A= = . X BT u + εZ BT Here a ∈ C, u, v ∈ Cn−1 , and the matrix B are independent of r, s, and Z. After conditioning on V , we can treat them as constants. The further strategy takes into account the properties of the matrix B. Depending on the invertibility properties of this matrix, we condition on some of the random variables r, s, and Z, and use the other ones to show that A is well-invertible with high probability. 9.2. Case 1: B is poorly invertible. Assume that sn (B) ≤ λ1 ε, where λ1 is another “small” parameter (λ1 = ta1 for 0 < a1 < 1). In this case we will condition on r and s, and rely on Z to obtain the probability bound. We know that there ˜ 2 ≤ λ1 ε. Let x ∈ S1 be arbitrary. We can exists a vector w ˜ ∈ S n−2 such that B w express it as / 0 1 x x = 1 , where |x1 | ≥ √ . x ˜ n Set w=

/ 0 0 ∈ Cn . w ˜

Using the decomposition of A given in (9.4), we obtain  0 / 0 /   A11 Y T x1  T T   ˜ Ax 2 ≥ |w Ax| =  0 w X BT x ˜  = |x1 · w ˜T X + w ˜T B T x ˜| ˜ T X| − B w ˜ 2 (by the triangle inequality) ≥ |x1 | · |w √ 1 ˜ T X| − λ1 ε (using |x1 | ≥ 1/ n). ≥ √ |w n By the representation (9.5), X = u + εZ, where u ∈ Cn−1 is a vector independent of Z. Taking the infimum over x ∈ S1 , we obtain 1 ˜ T u + εw ˜ T Z| − λ1 ε. inf Ax 2 ≥ √ |w n

x∈S1

Recall that w, ˜ u are fixed vectors, w ˜ 2 = 1, and Z ∼ NR (0, In−1 ). Then w ˜T Z = γ 2 is a complex normal random variable of variance 1: E|γ| = 1. This means that  2  2 E Re(γ) ≥ 1/2 or E Im(γ) ≥ 1/2. A quick density calculation yields the following bound on the conditional probability: - T √ . √ ˜ u + εw PZ |w ˜ T Z| ≤ 2λ1 ε n ≤ Cλ1 n. Therefore, a similar bound holds unconditionally. Thus, combining the previous estimates, we conclude that in case when sn (B) ≤ λ1 ε, and if ε and λ1 are chosen

NON-ASYMPTOTIC THEORY

115

so that λ1 ε ≥ t, we have 1 P( inf Ax 2 ≤ t) ≤ P( √ |w ˜ T X| − λ1 ε ≤ t) x∈S1 n - T √ . √ √ P |w ˜ u + εw ˜ T Z| ≤ 2λ1 ε n ≤ Cλ1 n = C n · ta1 . Choice of the parameters. The constraint λ1 ε ≥ t, appearing in this case, holds if we take λ1 = t0.1 .

9.3. Case 2: B is nicely invertible. Assume that sn (B) ≥ λ2 , where λ2 = ta2 is a “small” parameter. In this case, we will also use only the local perturbation, however the crucial random variable will be different. We will condition on r and Z, and use the dependence on s to derive the conclusion of the theorem. Set / 0 1 0 M= , 0 (B T )−1 then M ≤ λ−1 2 . Therefore, inf Ax 2 ≥ λ2 inf M Ax 2 .

x∈S1

x∈S1

The matrix M A has the following block representation: / 0 A11 YT MA = . (B T )−1 X In−1 Recall that we assumed that D ≤ K where √ K is a constant. Combining this with the already used inequality Z 2 ≤ K0 n, which holds outside of the event of exponentially small probability, we conclude that Y = rv − εZ satisfies Y 2 ≤ v 2 + ε Z 2 ≤ 2K √ if εK0 n ≤ K. To bound inf x∈S1 Ax 2 , we use an observation that / T 0   Y 1 −Y T · = 0. In−1 This implies that for every x ∈ S1 ,

 / 0    1  1 −Y T M A x1  ·  T x ˜  [1 − Y ] 2 1 ≥ · |A11 − Y T (B T )−1 X| · |x1 | 2K 1 √ · |A11 − Y T (B T )−1 X|. ≥ 2K n

M Ax 2 ≥

The right hand side of this inequality does not depend on x, so we can take the infimum over x ∈ S1 in the left hand side. Combination of the previous two inequalities reads λ2 √ · |A11 − Y T (B T )−1 X| 2K n √ Recall that according to (9.5), A11 = −1εs + d, where s is a real N (0, 1) random variable, and d is independent of s. Conditioning on everything but s, we can treat inf Ax 2 ≥

x∈S1

116

MARK RUDELSON

d and Y T (B T )−1 X as constants. An elementary estimate using the normal density yields μ for all μ > 0. Ps (|A11 − Y T (B T )−1 X| ≤ μ) ≤ C ε √ Applying this estimate with μ = 2Kλ2 n · t and integrating over the other random variables, we obtain √ √ 2K n · t ≤ C n · tc P( inf Ax 2 ≤ t) ≤ C x∈S1 λ2 ε for some c > 0 if λ2 is chosen appropriately. Choice of the parameters. The inequality 1 · t ≤ tc , c > 0 λ2 ε holds with c = 0.2 if we set λ2 = t0.2 . The constraint √ εK0 n ≤ K, appearing above, is satisfied since we have chosen ε = t0.6 and K0 = t−0.05 . One can try to tweak the parameters λ1 , λ2 , and ε to cover all possible scenarios. This attempt, however, is doomed to fail since the system of the constraints becomes inconsistent. Indeed, to include all matrices B in Cases 1 and 2, we have to choose λ2 ≤ λ1 ε. With this choice, t t ≥ > 1, λ2 ε λ1 ε2 because of the constraint K02 nε2 ≤ t/2. This forces us to consider the intermediate case.

9.4. Case 3, intermediate: B is invertible, but not nicely invertible. Assume that λ1 ε ≤ sn (B) ≤ λ2 with λ2 , λ1 defined in Cases 1 and 2. This is the most delicate case. Here we will have to rely on both local and global perturbations. We proceed like in Case 2 by multiplying Ax from the left by a vector which eliminates the dependence on all coordinates of x, except the first one. To this end, note that / 0  YT  T T −1 1 −Y (B ) · = 0. BT Hence, for any x ∈ S1 ,  / 0 / 0   A11 Y T 1 x  T T −1    · Ax 2 ≥  · 1   1 −Y T (B T )−1   1 −Y (B ) X BT x ˜ 2

  1 (A11 − Y T (B T )−1 X)x1  ≥ T T −1 1 + Y (B ) 2 1 1 ≥ |A11 − Y T (B T )−1 X| · √ . 1 + Y T (B T )−1 2 n

Since the right hand side is independent of x, we can take the infimum over x ∈ S1 . that Y T (B T )−1 is independent of s, see (9.5). We consider two subcases.  Note T T −1   ≤ λ−1 If Y (B ) 2 , then 2 λ2 inf Ax 2 ≥ √ |A11 − Y T (B T )−1 X|, 2 n

x∈S1

and we can finish the proof exactly like in Case 2, by conditioning on everything except s, and estimating the probability with respect to s.

NON-ASYMPTOTIC THEORY

117

  The second subcase requires more work. Assume that Y T (B T )−1 2 ≥ λ−1 2 . Then the inequality above yields 1 inf Ax 2 ≥ √ |A11 − Y T (B T )−1 X|. T 2 n Y (B T )−1 2   Since we do not have a satisfactory upper bound for Y T (B T )−1 2 , we cannot rely on A11 to estimate the small ball probability. The second term in the numerator looks more promising, because it contains the same vector Y T (B T )−1 . This term, however, is difficult to analyze, since the random vectors X and Y are dependent. A simplification of both numerator and denominator would allow us to get rid of this dependence. We start with analyzing the denominator. By (9.5), Y = rv − εZ, so      T T −1  Y (B )  ≤ v T (B T )−1  + ε Z T (B T )−1  . 2 2 2 x∈S1

As in the previous √ cases, disregarding an event of a small probability, we can assume that Z 2 ≤ K0 n. Then by the assumption on sn (B), √ √   K0 n εK0 n ≤ ε Z T (B T )−1 2 ≤ . sn (B) λ1 √

The parameters K0 , λ1 , and λ2 can be chosen so that K0λ1 n ≤ λ−1 2 /2. Then, since  T T −1  −1   by assumption Y (B ) ≥ λ2 , we conclude that 2  T T −1    Y (B )  ≤ 2 v T (B T )−1  2 2 and 1 inf Ax 2 ≥ √ · |A11 − Y T (B T )−1 X|. 4 n v T (B T )−1 2 The denominator here is independent of our random parameters. Now we pass to the analysis of the numerator. From (9.5) it follows that A11 − Y T (B T )−1 X = αr + β is a linear function of r with coefficients α and β, which depend on other random parameters. This representation would allow us to filter out several complicated terms in A11 − Y T (B T )−1 X by using the global perturbation r. Let λ3 > 0 be a “small” parameter: λ3 = ta3 . Condition on everything except r. Since r is uniformly distributed over the unit circle in C, an easy density calculation yields x∈S1

Pr (|αr + b| ≥ λ3 |α|) ≥ 1 − Cλ3 .

(9.6)

Taking the expectation with respect to the other random variables shows that the same bound holds unconditionally. Thus, disregarding the event of a small probability Cλ3 , we obtain that |A11 − Y T (B T )−1 X| ≥ λ3 |α|. The coefficient α in turn can be represented as follows: α = α − εv T (B T )−1 Z, where α ∈ C is independent of Z. Incorporating this into the bound above, we obtain λ3 inf Ax 2 ≥ √ |α − εv T (B T )−1 Z|. 4 n v T (B T )−1 2

x∈S1

Using the global perturbation allowed us to simplify the numerator and expose its dependence on the local perturbation variable Z. We will finish the proof using this local perturbation.

118

MARK RUDELSON

  Set hT = v T (B T )−1 / v T (B T )−1 2 and recall that h ∈ Cn−1 is independent of Z. Conditioning on everything except Z, we see that α

g := T T −1 − εhT Z = const + εγ , v (B ) 2 where γ is a complex normal random variable of unit variance: E|γ |2 = 1. Hence, as before, for any μ > 0 PZ (|g| ≤ μ) ≤ Cμ/ε, and integrating over other random variables, we conclude that the same estimate holds unconditionally. Combining this inequality with the previous one and recalling that we dropped an event of probability Cλ3 while using (9.6), we obtain  √  √ √  4 n 4 n t + Cλ3 ≤ C ntc P( inf Ax 2 ≤ t) ≤ P |g| ≤ t + Cλ3 ≤ C x∈S1 λ3 λ3 ε for some c > 0. Choosing appropriate constants a and a3 in ε = ta and λ3 = ta3 finishes the proof in this case and completes the proof of Theorem 9.1. Choice of the parameters. The analysis of this case requires the following two constraints: √  λ−1 t K0 n ≤ 2 and + λ3 ≤ tc , c > 0. λ1 2 λ3 ε The first one is satisfied with the choice K0 = t−0.05 , λ1 = t0.1 , λ2 = t0.2 that we made above. To satisfy the second one, set λ3 = t0.2 . 

We made no effort to optimize the dependence on t and n in the proof above. It would be interesting to find the optimal bound here. Another interesting question, suggested by Djalil Chafai, is to analyze the behavior of the smallest singular value of the matrix D + U where U is uniformly distributed over a discrete subgroup of the unitary group. The case of the permutation group may be of special interest, because of its relevance for random graph theory. This question may require a combination of tools from Sections 5–9, since both obstacles, the arithmetic structure and the lack of independence, make an appearance here. References [1] Greg W. Anderson, Alice Guionnet, and Ofer Zeitouni, An introduction to random matrices, Cambridge Studies in Advanced Mathematics, vol. 118, Cambridge University Press, Cambridge, 2010. MR2760897 (2011m:60016) [2] Shiri Artstein-Avidan, Omer Friedland, Vitali Milman, and Sasha Sodin, Polynomial bounds for large Bernoulli sections of l1N , Israel J. Math. 156 (2006), 141–155, DOI 10.1007/BF02773829. MR2282373 (2008k:46027) [3] Z. D. Bai and Y. Q. Yin, Limit of the smallest eigenvalue of a large-dimensional sample covariance matrix, Ann. Probab. 21 (1993), no. 3, 1275–1294. MR1235416 (94j:60060) [4] Jean Bourgain, Van H. Vu, and Philip Matchett Wood, On the singularity probability of discrete random matrices, J. Funct. Anal. 258 (2010), no. 2, 559–603, DOI 10.1016/j.jfa.2009.04.016. MR2557947 (2011b:60023) [5] Kenneth R. Davidson and Stanislaw J. Szarek, Local operator theory, random matrices and Banach spaces, Handbook of the geometry of Banach spaces, Vol. I, North-Holland, Amsterdam, 2001, pp. 317–366, DOI 10.1016/S1874-5849(01)80010-3. MR1863696 (2004f:47002a) [6] Alan Edelman, Eigenvalues and condition numbers of random matrices, SIAM J. Matrix Anal. Appl. 9 (1988), no. 4, 543–560, DOI 10.1137/0609045. MR964668 (89j:15039) [7] P. Erd¨ os, On a lemma of Littlewood and Offord, Bull. Amer. Math. Soc. 51 (1945), 898–902. MR0014608 (7,309j) [8] A. Yu. Garnaev and E. D. Gluskin, The widths of a Euclidean ball (Russian), Dokl. Akad. Nauk SSSR 277 (1984), no. 5, 1048–1052. MR759962 (85m:46023)

NON-ASYMPTOTIC THEORY

119

[9] A. A. Giannopoulos and V. D. Milman, Concentration property on probability spaces, Adv. Math. 156 (2000), no. 1, 77–106, DOI 10.1006/aima.2000.1949. MR1800254 (2001m:28001) [10] Friedrich G¨ otze and Alexander Tikhomirov, The circular law for random matrices, Ann. Probab. 38 (2010), no. 4, 1444–1491, DOI 10.1214/09-AOP522. MR2663633 (2012a:60011) [11] Olivier Gu´ edon and Mark Rudelson, Lp -moments of random vectors via majorizing measures, Adv. Math. 208 (2007), no. 2, 798–823, DOI 10.1016/j.aim.2006.03.013. MR2304336 (2008m:46017) [12] Alice Guionnet, Manjunath Krishnapur, and Ofer Zeitouni, The single ring theorem, Ann. of Math. (2) 174 (2011), no. 2, 1189–1217, DOI 10.4007/annals.2011.174.2.10. MR2831116 [13] B. S. Kaˇsin, The widths of certain finite-dimensional sets and classes of smooth functions (Russian), Izv. Akad. Nauk SSSR Ser. Mat. 41 (1977), no. 2, 334–351, 478. MR0481792 (58 #1891) [14] G´ abor Hal´ asz, On the distribution of additive arithmetic functions, Acta Arith. 27 (1975), 143–152. Collection of articles in memory of Juri˘ı Vladimiroviˇ c Linnik. MR0369292 (51 #5527) [15] G. Hal´ asz, Estimates for the concentration function of combinatorial number theory and probability, Period. Math. Hungar. 8 (1977), no. 3-4, 197–211. MR0494478 (58 #13338) [16] Jeff Kahn, J´ anos Koml´ os, and Endre Szemer´edi, On the probability that a random ±1-matrix is singular, J. Amer. Math. Soc. 8 (1995), no. 1, 223–240, DOI 10.2307/2152887. MR1260107 (95c:15047) [17] J. Koml´ os, On the determinant of (0, 1) matrices, Studia Sci. Math. Hungar 2 (1967), 7–21. MR0221962 (36 #5014) [18] Michel Ledoux and Michel Talagrand, Probability in Banach spaces, Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)], vol. 23, Springer-Verlag, Berlin, 1991. Isoperimetry and processes. MR1102015 (93c:60001) [19] J. E. Littlewood and A. C. Offord, On the number of real roots of a random algebraic equation. III (English, with Russian summary), Rec. Math. [Mat. Sbornik] N.S. 12(54) (1943), 277– 286. MR0009656 (5,179h) [20] A. E. Litvak, A. Pajor, M. Rudelson, N. Tomczak-Jaegermann, and R. Vershynin, Euclidean embeddings in spaces of finite volume ratio via random matrices, J. Reine Angew. Math. 589 (2005), 1–19, DOI 10.1515/crll.2005.2005.589.1. MR2194676 (2006j:46016) [21] Vitali Milman and Gideon Schechtman, An “isomorphic” version of Dvoretzky’s theorem. II, Convex geometric analysis (Berkeley, CA, 1996), Math. Sci. Res. Inst. Publ., vol. 34, Cambridge Univ. Press, Cambridge, 1999, pp. 159–164. MR1665588 (2000c:46016) [22] Hoi H. Nguyen, Inverse Littlewood-Offord problems and the singularity of random symmetric matrices, Duke Math. J. 161 (2012), no. 4, 545–586, DOI 10.1215/00127094-1548344. MR2891529 [23] H. H. Nguyen, V. Vu, Random matrices: law of the determinant, to appear in Annal of Probability. [24] Hoi Nguyen and Van Vu, Optimal inverse Littlewood-Offord theorems, Adv. Math. 226 (2011), no. 6, 5298–5319, DOI 10.1016/j.aim.2011.01.005. MR2775902 (2012b:60180) [25] Mark Rudelson, Lower estimates for the singular values of random matrices (English, with English and French summaries), C. R. Math. Acad. Sci. Paris 342 (2006), no. 4, 247–252, DOI 10.1016/j.crma.2005.11.013. MR2196007 (2007h:15029) [26] Mark Rudelson, Invertibility of random matrices: norm of the inverse, Ann. of Math. (2) 168 (2008), no. 2, 575–600, DOI 10.4007/annals.2008.168.575. MR2434885 (2010f:46021) [27] Mark Rudelson and Roman Vershynin, The Littlewood-Offord problem and invertibility of random matrices, Adv. Math. 218 (2008), no. 2, 600–633, DOI 10.1016/j.aim.2008.01.010. MR2407948 (2010g:60048) [28] Mark Rudelson and Roman Vershynin, Smallest singular value of a random rectangular matrix, Comm. Pure Appl. Math. 62 (2009), no. 12, 1707–1739, DOI 10.1002/cpa.20294. MR2569075 (2011a:60034) [29] Mark Rudelson and Roman Vershynin, The least singular value of a random square matrix is O(n−1/2 ) (English, with English and French summaries), C. R. Math. Acad. Sci. Paris 346 (2008), no. 15-16, 893–896, DOI 10.1016/j.crma.2008.07.009. MR2441928 (2009i:60104) [30] Mark Rudelson and Roman Vershynin, Non-asymptotic theory of random matrices: extreme singular values, Proceedings of the International Congress of Mathematicians. Volume III, Hindustan Book Agency, New Delhi, 2010, pp. 1576–1602. MR2827856 (2012g:60016)

120

MARK RUDELSON

[31] M. Rudelson, R. Vershynin, Invertibility of random matrices: unitary and orthogonal permutations, to appear in Journal of the AMS, arXiv:1206.5180. ¨ [32] A. S´ ark¨ ozi and E. Szemer´edi, Uber ein Problem von Erd˝ os und Moser (German), Acta Arith. 11 (1965), 205–208. MR0182619 (32 #102) [33] Steve Smale, On the efficiency of algorithms of analysis, Bull. Amer. Math. Soc. (N.S.) 13 (1985), no. 2, 87–121, DOI 10.1090/S0273-0979-1985-15391-1. MR799791 (86m:65061) [34] Daniel A. Spielman and Shang-Hua Teng, Smoothed analysis of algorithms, (Beijing, 2002), Higher Ed. Press, Beijing, 2002, pp. 597–606. MR1989210 (2004d:90138) [35] Stanislaw J. Szarek, Condition numbers of random matrices, J. Complexity 7 (1991), no. 2, 131–149, DOI 10.1016/0885-064X(91)90002-F. MR1108773 (92i:65086) [36] Michel Talagrand, Majorizing measures: the generic chaining, Ann. Probab. 24 (1996), no. 3, 1049–1103, DOI 10.1214/aop/1065725175. MR1411488 (97k:60097) [37] Terence Tao and Van Vu, On random ±1 matrices: singularity and determinant, Random Structures Algorithms 28 (2006), no. 1, 1–23, DOI 10.1002/rsa.20109. MR2187480 (2006g:15048) [38] Terence Tao and Van Vu, Additive combinatorics, Cambridge Studies in Advanced Mathematics, vol. 105, Cambridge University Press, Cambridge, 2006. MR2289012 (2008a:11002) [39] Terence Tao and Van Vu, On the singularity probability of random Bernoulli matrices, J. Amer. Math. Soc. 20 (2007), no. 3, 603–628, DOI 10.1090/S0894-0347-07-00555-3. MR2291914 (2008h:60027) [40] Terence Tao and Van H. Vu, Inverse Littlewood-Offord theorems and the condition number of random discrete matrices, Ann. of Math. (2) 169 (2009), no. 2, 595–632, DOI 10.4007/annals.2009.169.595. MR2480613 (2010j:60110) [41] Terence Tao and Van Vu, Random matrices: the circular law, Commun. Contemp. Math. 10 (2008), no. 2, 261–307, DOI 10.1142/S0219199708002788. MR2409368 (2009d:60091) [42] Terence Tao and Van Vu, Random matrices: universality of ESDs and the circular law, Ann. Probab. 38 (2010), no. 5, 2023–2065, DOI 10.1214/10-AOP534. With an appendix by Manjunath Krishnapur. MR2722794 (2011e:60017) [43] Terence Tao and Van Vu, A central limit theorem for the determinant of a Wigner matrix, Adv. Math. 231 (2012), no. 1, 74–101, DOI 10.1016/j.aim.2012.05.006. MR2935384 [44] Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, Compressed sensing, Cambridge Univ. Press, Cambridge, 2012, pp. 210–268. MR2963170 [45] Roman Vershynin, Spectral norm of products of random and deterministic matrices, Probab. Theory Related Fields 150 (2011), no. 3-4, 471–509, DOI 10.1007/s00440-010-0281-z. MR2824864 [46] R. Vershynin, Invertibility of symmetric random matrices, arXiv:1102.0300v4, Random Structures and Algorithms, to appear. [47] John von Neumann, Collected works. Vol. V: Design of computers, theory of automata and numerical analysis, General editor: A. H. Taub. A Pergamon Press Book, The Macmillan Co., New York, 1963. MR0157875 (28 #1104) Department of Mathematics, University of Michigan, Ann Arbor, Michigan 48109 E-mail address: [email protected]

Proceedings of Symposia in Applied Mathematics Volume 72, 2014 http://dx.doi.org/10.1090/psapm/072/00615

Random matrices: The universality phenomenon for Wigner ensembles Terence Tao and Van Vu Abstract. In this paper, we survey some recent progress on rigorously establishing the universality of various spectral statistics of Wigner Hermitian random matrix ensembles, focusing on the Four Moment Theorem and its refinements and applications, including the universality of the sine kernel and the Central limit theorem of several spectral parameters.

1. Introduction Random matrix theory is a central topic in probability and mathematical physics, with many connections to various areas such as statistics, number theory, combinatorics, numerical analysis and theoretical computer science. One of the primary goal of random matrix theory is to derive limiting laws for the eigenvalues and eigenvectors of ensembles of large (n × n) Hermitian random matrices1 , in the asymptotic limit n → ∞. There are many random matrix ensembles of interest, but to focus the discussion and to simplify the exposition we shall restrict attention to an important model class of ensembles, the Wigner matrix ensembles. Definition 1.1 (Wigner matrices). Let n ≥ 1 be an integer (which we view as a parameter going off to infinity). An n × n Wigner Hermitian matrix Mn is defined to be a random Hermitian n × n matrix Mn = (ξij )1≤i,j≤n , in which the ξij for 1 ≤ i ≤ j ≤ n are jointly independent with ξji = ξij (in particular, the ξii are real-valued). For 1 ≤ i < j ≤ n, we require that the ξij have mean zero and variance one, while for 1 ≤ i = j ≤ n we require that the ξij (which are necessarily real) have mean zero and variance σ 2 for some σ 2 > 0 independent of i, j, n. To simplify some of the statements of the results here, we will also assume that the ξij ≡ ξ are identically distributed for i < j, and the ξii ≡ ξ are also identically distributed for 2010 Mathematics Subject Classification. Primary 15A52. The first author was supported by a grant from the MacArthur Foundation, by NSF grant DMS-0649473, and by the NSF Waterman award. The second author was supported by research grants DMS-0901216 and AFOSAR-FA-955009-1-0167. 1 One can of course also study non-Hermitian or non-square random matrices, though in the latter case the concept of an eigenvalue needs to be replaced with that of a singular value. In this survey we will focus almost exclusively on the square Hermitian case. c 2014 American Mathematical Society

121

122

TERENCE TAO AND VAN VU

i = j, and furthermore that the real and imaginary parts of ξ are independent. We refer to the distributions Reξ, Imξ, and ξ as the atom distributions of Mn . We say that the ensemble obeys Condition C0 if we have the exponential decay condition P(|ξij | ≥ tC ) ≤ e−t for all 1 ≤ i, j ≤ n and t ≥ C , and some constants C, C (independent of i, j, n). We say that the Wigner matrix ensemble obeys condition C1 with constant C0 if one has E|ξij |C0 ≤ C for some constant C (independent of n). Of course, Condition C0 implies Condition C1 for any C0 , but not conversely. We refer to the matrix Wn := √1n Mn as the coarse-scale normalised Wigner √ Hermitian matrix, and An := nMn as the fine-scale normalised Wigner Hermitian matrix. Example 1.2. An important special case of a Wigner Hermitian matrix Mn is the gaussian unitary ensemble (GUE), in which ξij ≡ N (0, 1)C are complex gaussians with mean zero and variance one for i = j, and ξii ≡ N (0, 1)R are real gaussians with mean zero and variance one for 1 ≤ i ≤ n (thus σ 2 = 1 in this case). Another important special case is the gaussian orthogonal ensemble (GOE), in which ξij ≡ N (0, 1)R are real gaussians with mean zero and variance one for i = j, and ξii ≡ N (0, 1/2)R are real gaussians with mean zero and variance 1/2 for 1 ≤ i ≤ n (thus σ 2 = 1/2 in this case). These ensembles obey Condition C0, and hence Condition C1 for any C0 . For these two ensembles, the probability distribution of Mn can be expressed invariantly in either case as 1 −c tr Mn Mn∗ (1.1) e dMn Zn for some quantity Zn > 0 depending only on n, where dMn is Haar measure on the vector space of n × n Hermitian matrices (in the case of GUE) or real symmetric matrices (in the case of GOE), and c is equal to 1/2 (for GUE) or 1/4 (for GOE). From (1.1) we easily conclude that the probability distribution of GUE is invariant with respect to conjugations by unitary matrices, and similarly the probability distribution of GOE is invariant with respect to conjugations by orthogonal matrices. However, a general Wigner matrix ensemble will not enjoy invariances with respect to such large classes of matrices. (For instance, the Bernoulli ensembles described below are only invariant with respect to conjugation by a discrete group of matrices, which include permutation matrices and reflections around the coordinate axes.) Example 1.3. At the opposite extreme from the invariant ensembles are the Bernoulli ensembles, which are discrete instead of continuous. In the real Bernoulli ensemble (also known as symmetric random sign matrices), each of the ξij are equal to +1 with probability 1/2 and −1 with probability 1/2. In the complex Bernoulli ensemble, the diagonal entries ξii √still have this distribution, but the off-diagonal entries now take values2 ± √12 ± √−1 , with each of these four complex numbers 2 occuring with probability 1/4. 2 We

variable.

use

√ −1 to denote the imaginary unit, in order to free up the symbol i as an index

UNIVERSALITY FOR WIGNER ENSEMBLES

123

Remark 1.4. Many of the results given here have been extended to somewhat broader classes of matrices. For instance, one can consider generalised Wigner ensembles in which entries are not identically distributed; for instance, one can 2 of each entry ξij to vary in i, j, and even vanish for some allow the variances σij i, j; the latter situation occurs for instance in band-limited random matrices; one can also consider allowing the mean μij of the entries ξij to be non-zero (this is for instance the situation with the adjacency matrices of Erd˝os-Renyi random graphs). One can also consider sparse Wigner random matrices, in which only a small (randomly selected) number of entries are non-zero. For simplicity, though, we shall mostly restrict attention in this survey to ordinary Wigner ensembles. We do remark, however, that in all of these generalisations, it remains crucial that the entries ξij for 1 ≤ i ≤ j ≤ n are jointly independent, as many of the techniques currently available to control the fine-scale spectral structure of Wigner matrices rely heavily on joint independence. Given an n × n Hermitian matrix A, we denote its n eigenvalues in increasing order3 as λ1 (A) ≤ . . . ≤ λn (A), and write λ(A) := (λ1 (A), . . . , λn (A)). We also let u1 (A), . . . , un (A) ∈ Cn be an orthonormal basis of eigenvectors of A with Aui (A) = λi (A)ui (A); these eigenvectors ui (A) are only determined up to a complex phase even when the eigenvalues are simple (or up to a sign in the real symmetric case), but this ambiguity will not cause much of a difficulty in our results as we will usually only be interested in the magnitude |ui (A)∗ X| of various inner products ui (A)∗ X of ui (A) with other vectors X. We also introduce the eigenvalue counting function (1.2)

NI (A) := |{1 ≤ i ≤ n : λi (A) ∈ I}|

for any interval I ⊂ R. We will be interested in both the coarse-scale eigenvalue counting function NI (Wn ) and the fine-scale eigenvalue counting function NI (An ), which are of course transformable to each other by the identity NI (Wn ) = NnI (An ). 2. Global and local semi-circular laws We first discuss the coarse-scale spectral structure of Wigner matrices, that is to say the structure of the √eigenvalues of Wn at unit scales (or equivalently, the eigenvalues of Mn at scale n, or An at scale n). The fundamental result in this topic is the (global) Wigner semi-circular law. Denote by ρsc the semi-circle density function with support on [−2, 2],  √ 1 4 − x2 , |x| ≤ 2 (2.1) ρsc (x) := 2π 0, |x| > 2. Theorem 2.1 (Global). Let Mn be a Wigner Hermitian matrix. Then for any interval I (independent of n), one has 1 lim NI [Wn ] = ρsc (y) dy n→∞ n I 3 It is also common in the literature to arrange eigenvalues in decreasing order instead of increasing. Of course, the results remain the same under this convention except for minor notational changes.

124

TERENCE TAO AND VAN VU

in the sense of probability (and also in the almost sure sense, if the Mn are all minors of the same infinite Wigner Hermitian matrix). Remark 2.2. Wigner [126] proved this theorem for special ensembles. The general version above is due to Pastur [85] (see [1, 6] for a detailed discussion). The semi-circular law in fact holds under substantially more general hypotheses than those given in Definition 1.1, but we will not discuss this matter further here. One consequence of Theorem 2.1 is that we expect most of the eigenvalues of Wn to lie in the interval (−2 + ε, 2 + ε) for ε > 0 small; we shall thus informally refer to this region as the bulk of the spectrum. An essentially equivalent4 formulation of the semi-circular law is as follows: if 1 ≤ i ≤ n, then one has5 (2.2)

λi (Wn ) = λcl i (Wn ) + o(1)

with probability 1 − o(1) (and also almost surely, if the Mn are minors of an infinite th matrix), where the classical location λcl eigenvalue is the element of i (Wn ) of the i [−2, 2] defined by the formula λcl i (Wn ) i ρsc (y) dy = . n −2 A remarkable feature of the semi-circular law is its universality: the precise distribution of the atom variables ξij are irrelevant for the conclusion of the law, so long as they are normalised to have mean zero and variance one (or variance σ 2 , on the diagonal), and are jointly independent on the upper-triangular portion of the matrix. In particular, continuous matrix ensembles such as GUE or GOE, and discrete matrix ensembles such as the Bernoulli ensembles, have the same asymptotic spectral distribution when viewed at the coarse scale (i.e. √ by considering eigenvalue ranges of size ∼ 1 for Wn , or equivalently of size ∼ n for Mn or ∼ n for An ). However, as stated, the semi-circular law does not give good control on the fine-scale behaviour of the eigenvalues, for instance in controlling NI (Wn ) when I is a very short interval (of length closer to 1/n than to 1). The fine-scale theory for Wigner matrices is much more recent than the coarse-scale theory given by the semi-circular law, and is the main focus of this survey. There are several ways to establish the semi-circular law. For invariant ensembles such as or , one can use explicit formulae for the probability distribution of NI coming from the theory of determinantal processes, giving precise estimates all the way down to infinitesimally small scales; see e.g. [1] or Section 3 below. However, these techniques rely heavily on the invariance of the ensemble, and do not directly extend to more general Wigner ensembles. Another popular technique is the moment method, based on the basic moment identities n  1 (2.3) λi (Wn )k = trace Wnk = k/2 trace Mnk n i=1 formulation is slightly stronger because it also incorporates the upper bound Wn op ≤ 2+o(1) on the operator norm of Wn , which is consistent with, but not implied by, the semi-circular law, and follows from the work of Bai and Yin [7]. 5 We use the asymptotic notation o(1) to denote any quantity that goes to zero as n → ∞, and O(X) to denote any quantity bounded in magnitude by CX, where C is a constant independent of n. 4 This

UNIVERSALITY FOR WIGNER ENSEMBLES

125

for all k ≥ 0. The moment method already is instructive for revealing at least one explanation for the universality phenomenon. If one takes expectations in the above formula, one obtains E

n 

λi (Wn )k =

i=1

1 nk/2



Eξi1 i2 . . . ξik i1 .

1≤i1 ,...,ik ≤n

For those terms for which each edge {ij , ij+1 } appears at most twice, the summand can be explicitly computed purely in terms of the mean and variances of the ξij , and are thus universal. Terms for which an edge appears three or more times are sensitive to higher moments of the atom distribution, but can be computed to give a contribution of o(1) (at least assuming a decay condition such as Condition C0). This already explains universality for quantities such as ENI (Wn ) for intervals I of fixed size (independent of n), at least if one assumes a suitable decay condition on the entries (and one can use standard truncation arguments to relax such hypotheses substantially). At the edges ±2 of the spectrum, the moment method can be pushed further, to give quite precise control on the most extreme eigenvalues of Wn (which dominate the sum in (2.3)) if k is sufficiently large; see [50, 87, 97–99, 124] and the references therein. However, the moment method is quite poor at controlling the spectrum in the bulk. To improve the understanding of the bulk spectrum of Wigner matrices, Bai [2, 3] (see also [6, Chapter 8]) used the Stieltjes transform method (building upon the earlier work of Pastur [85]) to show that the speed of convergence to the semi-circle was O(n−1/2 ). Instead of working with moments, one instead studied the Stieltjes transform 1 1 1 trace(Wn − z)−1 = n n i=1 λi (Wn ) − z n

sn (z) =

of Wn , which is well-defined for z outside of the spectrum of Wn (and in particular for z in the upper half-plane {z ∈ C : Im(z) > 0}). To establish the semi-circular law, it suffices to show that sn (z) converges (in probability or almost surely) to ssc (z) for each z, where 1 ρsc (x) dx ssc (z) := R x−z is the Stieltjes transform of the semi-circular distribution ρsc . The key to the argument is to establish a self-consistent equation for sn , which roughly speaking takes the form −1 . (2.4) sn (z) ≈ sn (z) + z One can explicitly compute by contour integration that  1 ssc (z) = (−z + z 2 − 4) 2 √ for z = [−2, 2], where z 2 − 4 is the branch of the square root that is asymptotic to z at infinity, and in particular that (2.5)

ssc (z) =

−1 . ssc (z) + z

126

TERENCE TAO AND VAN VU

From a of the elementary equation (2.5) and a continuity argument, one can then use (2.4) to show that sn (z) ≈ ssc (z),

(2.6)

which then implies convergence to the semi-circular law. If one can obtain quantitative control on the approximation in (2.4), one can then deduce quantitative versions of the semi-circular law that are valid for certain short intervals. We briefly sketch why one expects the self-consistent equation to hold. One can expand n 1 (2.7) sn (z) = ((Wn − zI)−1 )ii n i=1 where ((Wn − zI)−1 )ii denotes the iith entry of the matrix (Wn − zI)−1 . Let us consider the i = n term for sake of concreteness. If one expands Wn as a block matrix   ˜ n−1 − zI √1 Xn W n , (2.8) Wn − zI := √1 X ∗ √1 ξnn − z n n n ˜ n−1 = √1 Mn−1 is the top left n − 1 × n − 1 minor of Mn , and Xn is the where W n n − 1 × 1 column vector with entries ξn1 , . . . , ξn(n−1) , then an application of Schur’s complement yields the identity −1 . (2.9) ((Wn − zI)−1 )nn = ˜ n−1 − zI)−1 Xn − √1 ξnn z + n1 Xn∗ (W n The term √1n ξnn is usually negligible and will be ignored for this heuristic discussion. Let us temporarily freeze (or condition on) the entries of the random matrix ˜ n−1 . Due to the joint independence of the entries of Mn , the Mn−1 , and hence W entries of Xn remain jointly independent even after this conditioning. As these entries also have mean zero and variance one, we easily compute that 1 ˜ n−1 − zI)−1 . ˜ n−1 − zI)−1 Xn = 1 trace(W E Xn∗ (W n n Using the Cauchy interlacing theorem (2.10)

λi−1 (Mn−1 ) ≤ λi (Mn ) ≤ λi (Mn−1 )

property between the eigenvalues of Mn−1 and the eigenvalues of Mn (which easily follows from the Courant-Fisher minimax formula λi (Mn ) =

inf

sup

V ⊂Cn ;dim(V )=i u∈V :u=1

u∗ Mn u

for the eigenvalues), one easily obtains an approximation of the form 1 ˜ n−1 − zI)−1 ≈ sn (z). trace(W n ˜ n−1 − zI)−1 Xn concentrates around its mean Assuming that the expression n1 Xn∗ (W (which can be justified under various hypotheses on Mn and z using a variety of concentration-of-measure tools, such as Talagrand’s concentration inequality, see e.g. [76]), one thus has 1 ∗ ˜ X (Wn−1 − zI)−1 Xn ≈ sn (z) (2.11) n n

UNIVERSALITY FOR WIGNER ENSEMBLES

127

and thus

−1 . z + sn (z) Similarly for the other diagonal entries ((Wn − zI)−1 )ii . Inserting this approximation back into (2.7) gives the desired approximation (2.4), heuristically at least. os, The above argument was optimized6 in a sequence of papers [37–39] by Erd˝ Schlein, and Yau (see also [108, Section 5.2] for a slightly simplified proof). As a consequence, one was able to obtain good estimates of the form (2.6) even when z was quite close to the spectrum [−2, 2] (e.g. at distance O(n−1+ε ) for some small ε > 0, which in turn leads (by standard arguments) to good control on the eigenvalue counting function NI (Wn ) for intervals I of length as short as n−1+ε . (Note that as there are only n eigenvalues in all, such intervals are expected to only have about O(nε ) eigenvalues in them.) Such results are known as local semicircular laws. A typical such law (though not the strongest such law known) is as follows: ((Wn − zI)−1 )nn ≈

Theorem 2.3. Let Mn be a Wigner matrix obeying Condition C0, let ε > 0, and let I ⊂ R be an interval of length |I| ≥ n−1+ε . Then with overwhelming probability7 , one has (2.12) NI (Wn ) = n ρsc (x) dx + o(n|I|). I

Proof. See e.g. [103, Theorem 1.10]. For the most precise estimates currently known of this type (and with the weakest decay hypotheses on the entries), see [32].  A variant of Theorem 2.3, which was established8 in the subsequent paper [44], is the extremely useful eigenvalue rigidity property (2.13)

−1+ε ), λi (Wn ) = λcl i (Wn ) + Oε (n

valid with overwhelming probability in the bulk range δn ≤ i ≤ (1 − δ)n for any fixed δ > 0 (and assuming Condition C0), and which significantly improves upon (2.2). This result is key in some of the strongest applications of the theory. See Section 7.7 for the precise form of this result and recent developments. Roughly speaking, results such as Theorem 2.3 and (2.13) control the spectrum of Wn at scales n−1+ε and above. However, they break downat the fine scale n−1 ; indeed, for intervals I of length |I| = O(1/n), one has n I ρsc (x) dx = O(1), while NI (Wn ) is clearly a natural number, so that one can no longer expect an asymptotic of the form (2.12). Nevertheless, local semicircle laws are an essential 6 There are further refinements to this method in Erd˝ os, Yau, and Yin [43], [44] and Erd˝ osKnowles-Yau-Yin [32], [33] that took advantage of some additional cancellation between the error terms in (2.9) (generalised to indices i = 1, . . . , n) that could be obtained (via the moment method in [43], [44], and via decoupling arguments in [32], [33]), to improve the error estimates further. These refinements are not needed for the application to Wigner matrices assuming a strong decay condition such as Condition C0, but is useful in generalised Wigner matrix models, such as sparse Wigner matrices in which most of the entries are zero, or in models where one only has a weak amount of decay (e.g. Condition C1 with C0 = 4 + ε). These refinements also lead to the very useful eigenvalue rigidity bound (2.13). 7 By this, we mean that the event occurs with probability 1 − O (n−A ) for each A > 0. A 8 The result in [44] actually proves a more precise result that also gives sharp results in the edge of the spectrum, though due to the sparser nature of the λcl i (Wn ) in that case, the error term Oε (n−1+ε ) must be enlarged.

128

TERENCE TAO AND VAN VU

part of the fine-scale theory. One particularly useful consequence of these laws is that of eigenvector delocalisation, first established in [39]: Corollary 2.4 (Eigenvalue delocalisation). Let Mn be a Wigner matrix obeying Condition C0, and let ε > 0. Then with overwhelming probability, one has ui (Wn )∗ ej = O(n−1/2+ε ) for all 1 ≤ i, j ≤ n. n ∗ 2 2 Note from Pythagoras’ theorem that j=1 |ui (Mn ) ej | = ui (Mn ) = 1; thus Corollary 2.4 asserts, roughly speaking, that the coefficients of each eigenvector are as spread out (or delocalised ) as possible. Proof. (Sketch) By symmetry we may take ej = n. Fix i, and set λ := λi (Wn ); then the eigenvector equation (Wn − λ)ui (Wn ) = 0 can be expressed using (2.8) as    ˜ n−1 − λI √1 Xn W u ˜i n =0 √1 X ∗ √1 ξnn − λ ui (Wn )∗ en n n n where u ˜i are the first n − 1 coefficients of ui (Wn ). After some elementary algebraic manipulation (using the normalisation ui (Wn ) = 1), this leads to the identity |ui (Wn )∗ en |2 =

1 ˜ n−1 − λ)−1 Xn 2 /n 1 + (W

and hence by eigenvalue decomposition |ui (Wn )∗ en |2 =

1+

n−1

1

−2 |u (W ˜ ˜ n−1 )∗ Xn |2 /n j j=1 (λj (Wn−1 ) − λ)

.

Suppose first that we are in the bulk case when δn ≤ i ≤ (1 − δ)n for some fixed δ > 0. From the local semicircle law, we then see with overwhelming probability ˜ n−1 ) that lie within n−1−ε/2 of λ. Letting that there are  nε/2 eigenvalues λj (W V be the span of the corresponding eigenvectors, we conclude that |ui (Wn )∗ en |2  n−1+ε / πV (Xn ) 2 ˜ n−1 where πV is the orthogonal projection to V . If we freeze (i.e. condition) on W and hence on V , then the coefficients of Xn remain jointly independent with mean zero and variance one. A short computation then shows that E πV (Xn ) 2 = dim(V )  nε/2 and by using concentration of measure tools such as Talagrand’s concentration inequality (see e.g. [76]), one can then conclude that πV (Xn )  1 with overwhelming probability. This concludes the claim of the Corollary in the bulk case. The edge case is more delicate, due to the sparser spectrum near λ. Here, one takes advantage of the identity λ+

1 1 ∗ ˜ X (Wn−1 − λI)−1 Xn − √ ξnn = 0, n n n

(cf. (2.9)), which we can rearrange as (2.14)

n−1 1 ˜ n−1 ) − λ)−1 |uj (W ˜ n−1 )∗ Xn |2 = √1 ξnn − λ. (λj (W n j=1 n

UNIVERSALITY FOR WIGNER ENSEMBLES

129

In the edge case, the right-hand side is close to ±2, and this can be used (together with the local semicircle law) to obtain the lower bound n−1 

˜ n−1 ) − λ)−2 |uj (W ˜ n−1 )∗ Xn |2  n−1−ε (λj (W

j=1

with overwhelming probability, which gives the claim. See [103] for details.



Remark 2.5. A slicker approach to eigenvalue delocalisation proceeds via control of the resolvent (or Green’s function) (Wn − zI)−1 , taking advantage of the identity Im((Wn − zI)−1 )jj =

n  i=1

η |ui (Wn )∗ ej |2 (λi (Wn ) − E)2 + η 2

for z = E + iη; see for instance [31] for details of this approach. Note from (2.9) that the Stieltjes transform arguments used to establish the local semicircle law already yield control on quantities such as ((Wn − zI)−1 )jj as a byproduct. 3. Fine-scale spectral statistics: the case of GUE We now turn to the question of the fine-scale behavior of eigenvalues of Wigner matrices, starting with the model case √ of GUE. Here, it is convenient to work with the fine-scale normalisation An := nMn . For simplicity we will restrict attention to the bulk region of the spectrum, which in the fine-scale normalisation corresponds to eigenvalues λi (An ) of An that are near nu for some fixed −2 < u < 2 independent of n. There are several quantities at the fine scale that are of interest to study. For instance, one can directly study the distribution of individual (fine-scale normalised) eigenvalues λi (An ) for a single index 1 ≤ i ≤ n, or more generally study the joint distribution of a k-tuple λi1 (An ), . . . , λik (An ) of such eigenvalues for some 1 ≤ i1 < . . . < ik ≤ n. Equivalently, one can obtain estimates for expressions of the form (3.1)

EF (λi1 (An ), . . . , λik (An ))

for various test functions F : Rk → R. By specializing to the case k = 2 and to translation-invariant functions F (x, y) := f (x − y), one obtains distributional information on individual eigenvalue gaps λi+1 (An ) − λi (An ). A closely related set of objects to the joint distribution of individual eigenvalues (k) (k) are the k-point correlation functions Rn = Rn (An ) : Rk → R+ , defined via duality to be the unique symmetric function (or measure) for which one has (3.2)  Rk

(k) F (x1 , . . . , xk )Rn (x1 , . . . , xk ) dx1 . . . dxk = k!



EF (λi1 (An ), . . . , λik (An ))

1≤i1 0. An alternate approach to these results was developed by Erd˝os, Ramirez, Schlein, Yau, and Yin [34], [40], [41]. The details are too technical to be given here, but the main idea is to use standard tools such as log-Sobolev inequalities to control the rate of convergence of the Dyson Fokker-Planck equation to the equi0 librium measure ρ∞ n , starting from the initial data ρn ; informally, if one has a good convergence to this equilibrium measure by time t, then one can obtain universality results for gauss divisible ensembles with this parameter t. A simple model to gain heuristic intuition on the time needed to converge to equilibrium is given by the one-dimensional Ornstein-Uhlenbeck process dx = σdβt − θ(x − μ) dt

(4.3)

for some parameters σ, θ > 0, μ ∈ R and some standard Brownian motion βt . Standard computations (or dimensional analysis) suggest that this process should converge to the equilibrium measure (in this case, a normal distribution N (μ, σ 2 /2θ)) in time14 O(1/θ), in the sense that the probability distribution of x should differ from the equilibrium distribution by an amount that decays exponentially in t/θ. As was already observed implicity by Dyson, the difficulty with the Dyson Fokker-Planck equation (4.2) (or equivalently, the Dyson Brownian motion (4.1)) is that different components of the evolution converge to equilibrium at different rates. Consider for instance the trace variable T := λ1 (Atn )+. . .+λn (Atn ). Summing up (4.1) we see that this variable evolves by the Ornstein-Uhlenbeck process 1 dT = ndβt − T dt 2 for some standard Brownian motion βt , and so we expect convergence to equilibrium for this variable in time O(1). At the other extreme, consider an eigenvalue gap si := λi+1 (Atn ) − λi (Atn ) somewhere in the bulk of the spectrum. Subtracting two consecutive cases of (4.1), we see that √ 2n (4.4) dsi = 2 ndβt,i − θi si dt + dt si where  1 1 θi := n + . t t t t (λi+1 (An ) − λj (An ))(λi (An ) − λj (An )) 2 1≤j≤n:j=i,i+1

14 If the initial position x(0) is significantly farther away from the mean μ than the standard   deviation σ 2 /2θ, say |x(0) − μ| ∼ K σ 2 /2θ, then one acquires an additional factor of log K in the convergence to equilibrium, because it takes time about log K/θ for the drift term −θ(x−μ) dt in (4.3) to move x back to within O(1) standard deviations of μ. These sorts of logarithmic factors will be of only secondary importance in this analysis, ultimately being absorbed in various O(nε ) error factors.

136

TERENCE TAO AND VAN VU

Using the heuristic λj (An ) ≈ λcl j (An ), we expect θi to be of size comparable to n and si to be of size comparable to 1; comparing (4.4) with (4.3) we thus expect si to converge to equilibrium in time O(1/n). One can use a standard log-Sobolev argument of Bakry and Emery [5] (exploiting the fact that the equilibrium measure ρ∞ n is the negative exponential of a strictly convex function H) to show (roughly speaking) that the Dyson Brownian motion converges to global equilibrium in time O(1); see e.g. [31]. Thus the trace variable T is among the slowest of the components of the motion to converge to equilibrium. However, for the purposes of controlling local statistics such as the normalised k-point correlation function xk x1 (k) , . . . , nu + ), ρ(k) n,u (x1 , . . . , xk ) := Rn (nu + ρsc (u) ρsc (u) and particularly the averaged normalised k-point correlation function u0 +b 1 xk x1 , . . . , nu + ) du, Rn(k) (nu + 2b u0 −b ρsc (u0 ) ρsc (u0 ) these “slow” variables turn out to essentially be irrelevant, and it is the “fast” variables such as si which largely control these expressions. As such, one expects these particular quantities to converge to their equilibrium limit at a much faster rate. By replacing the global equilibrium measure ρ∞ n with a localized variant which has better convexity properties in the slow variables, it was shown in [40], [41] by a suitable modification of the Bakry-Emery argument that one in fact had convergence to equilibrium for such expressions in time O(n−1+ε ) for any fixed ε; a weak version15 of the rigidity of eigenvalues statement (2.13) is needed in order to show that the error incurred by replacing the actual equilibrium measure with a localized variant is acceptable. Among other things, this argument reproves a weaker version of the result in [35] mentioned earlier, in which one obtained universality for the asymptotic (3.7) after an additional averaging in the energy parameter u. However, the method was simpler and more flexible than that in [35], as it did not rely on explicit identities, and has since been extended to many other types of ensembles, including the real symmetric analogue of gauss divisible ensembles in which the role of GUE is replaced instead by GOE. Again, we refer the reader to [31] for more details. 5. Extending beyond the GUE case II. Swapping and the The heat flow methods discussed in the previous section enlarge the class of Wigner matrices to which GUE-type statistics are known to hold, but do not cover all such matrices, and in particular leave out discrete ensembles such as the Bernoulli ensembles, which are not gauss divisible for any t > 0. To complement these methods, we have a family of swapping methods to extend spectral asymptotics from one ensemble to another, based on individual replacement of each coefficient of a Wigner matrix, as opposed to deforming all of the coefficients simultaneously via heat flow. The simplest (but rather crude) example of a swapping method is based on the total variation distance d(X, Y ) between two random variables X, Y taking values 15 Roughly speaking, the rigidity result that is needed is that one has λ (W ) = λcl (W ) + n n i i O(n1/2−c ) in an 2 -averaged sense for some absolute constant c > 0. See [31] for details.

UNIVERSALITY FOR WIGNER ENSEMBLES

137

in the same range R, defined by the formula d(X, Y ) := sup |P(X ∈ E) − P(Y ∈ E)|, E⊂R

where the supremum is over all measurable subsets E of R. Clearly one has |EF (X) − EF (Y )| ≤ F L∞ d(X, Y ) for any measurable function F : R → R. As such, if d(Mn , Mn ) is small, one can approximate various spectral statistics of Mn by those of Mn , or vice versa. For instance, one has |EF (λi1 (An ), . . . , λik (An )) − F (λi1 (A n ), . . . , λik (A n ))| ≤ F L∞ d(Mn , Mn ), and thus from (3.2) we have the somewhat crude bound 

|

Rk

(k) (k) F (x1 , . . . , xk )Rn (An )(x1 , . . . , xk ) − F (x1 , . . . , xk )Rn (An )(x1 , . . . , xk ) dx1 . . . dxk |

≤ nk F L∞ d(Mn , Mn ).

and hence by (3.3) 

|

Rk

(k)  F (x1 , . . . , xk )ρ(k) n,u (An )(x1 , . . . , xk ) − F (x1 , . . . , xk )ρn,u (An )(x1 , . . . , xk ) dx1 . . . dxk |

≤ (ρsc (u)n)k F L∞ d(Mn , Mn )

for any test function F . On the other hand, by swapping the entries of Mn with Mn one at a time, we see that d(Mn , Mn ) ≤

n  n 

d(ξij , ξij ).

i=1 j=1

) d(ξij , ξij

−C

≤n for a sufficiently large constant C (depending We thus see that if on k), then the k-point correlation functions of Mn and Mn are asymptotically equivalent. This argument was quite crude, costing many more powers of n than is strictly necessary, and by arguing more carefully one can reduce this power; see [35]. However, it does not seem possible to eliminate the factors of n entirely from this type of argument. By combining this sort of total variation-based swapping argument with the heat flow universality results for time t = n−1+ε , the asymptotic (3.7) was demonstrated to hold in [35] for Wigner matrices with sufficiently smooth distribution; in particular, the k = 2 case of (3.7) was established if the distribution function of the atom variables were C 6 (i.e. six times continuously differentiable) and obeyed a number of technical decay and positivity conditions that we will not detail here. The basic idea is to approximate the distribution ρ of an atom variable ξij in total variation distance (or equivalently, in L1 norm) by the distribution etL ρ˜ of a gauss

with an accuracy that is better than n−C for a suitable divisible atom variable ξij C, where L is the generator of the Ornstein-Uhlenbeck process and t = n−1+ε ; this can be accomplished by setting ρ˜ to essentially be a partial Taylor expansion of the (formal) backwards Ornstein-Uhlenbeck evolution e−tL ρ of ρ to some bounded order, with the smoothness of ρ needed to ensure that this partial Taylor expansion ρ˜ remains well-defined as a probability distribution, and that etL ρ˜ approximates ρ sufficiently well. See [35] for more details of this method (referred to as the method of reverse heat flow in that paper).

138

TERENCE TAO AND VAN VU

Another fundamental example of a swapping method is the Lindeberg exchange strategy 16 , introduced in Lindeberg’s classic proof [77] of the central limit theorem, and first applied to Wigner ensembles in [17]. We quickly sketch that proof here. Suppose that X1 , . . . , Xn are iid real random variables with mean zero and variance one, and let Y1 , . . . , Yn be another set of iid real random variables and mean zero and variance one (which we may assume to be independent of X1 , . . . , Xn ). We n n √ √ would like to show that the averages X1 +...+X and Y1 +...+Y have asymptotically n n the same distribution, thus     X1 + . . . + Xn Y1 + . . . + Yn √ √ EF = EF + o(1) n n for any smooth, compactly supported function F . The idea is to swap the entries X1 , . . . , Xn with Y1 , . . . , Yn one at a time and obtain an error of o(1/n) on each such swap. For sake of illustration we shall just establish this for the first swap:     X1 + . . . + Xn−1 + Yn X1 + . . . + Xn √ √ = EF + o(1/n). (5.1) EF n n n−1 n √ √ We write X1 +...+X = S + n−1/2 Xn , where S := X1 +...+X . From Taylor expann n sion we see (for fixed smooth, compactly supported F ) that   X1 + . . . + Xn 1 √ F = F (S) + n−1/2 Xn F (S) + n−1 Xn2 F

(S) + O(n−3/2 |Xn |3 ). n 2

We then make the crucial observation that S and Xn are independent. On taking expectations (and assuming that Xn has a bounded third moment) we conclude that  EF

X1 + . . . + Xn √ n



1 = EF (S) + n−1/2 (EXn )EF  (S) + n−1 (EXn2 )EF  (S) + O(n−3/2 ). 2

Similarly one has  EF

X1 + . . . + Xn−1 + Yn √ n



1 = EF (S) + n−1/2 0(EYn )EF  (S)+ n−1 (EYn2 )EF  (S)+O(n−3/2 ). 2

But by hypothesis, Xn and Yn have matching moments to second order, in the sense that EXni = EYni for i = 0, 1, 2. Thus, on subtracting, we obtain (5.1) (with about a factor of n−1/2 to spare; cf. the Berry-Ess´een theorem [10], [45]). Note how the argument relied on the matching moments of the two atom variables Xi , Yi ; if one had more matching moments, one could continue the Taylor expansion and obtain further improvements to the error term in (5.1), with an additional gain of n−1/2 for each further matching moment. We can apply the same strategy to control expressions such as EF (Mn ) − F (Mn ), where Mn , Mn are two (independent) Wigner matrices. If one can obtain bounds such as ˜ n ) = o(1/n) EF (Mn ) − EF (M

16 We

would like to thank S. Chatterjee and M. Krisnapur for introducing this method to us.

UNIVERSALITY FOR WIGNER ENSEMBLES

139

˜ n is formed from Mn by replacing17 one of the diagonal entries ξii of Mn when M

by the corresponding entry ξii of Mn , and bounds such as ˜ n ) = o(1/n2 ) EF (Mn ) − EF (M ˜ n is formed from Mn by replacing one of the off-diagonal entries ξij of Mn when M



, with the corresponding entry ξij of Mn (and also replacing ξji = ξij with ξji = ξij to preserve the Hermitian property), then on summing an appropriate telescoping series, one would be able to conclude asymptotic agreement of the statistics EF (Mn ) and EF (Mn ): (5.2)

EF (Mn ) − EF (Mn ) = o(1)

As it turns out, the numerology of swapping for matrices is similar to that for the central limit theorem, in that each matching moment leads to an additional factor of O(n−1/2 ) in the error estimates. From this, one can expect to obtain asymptotics of the form (5.2) when the entries of Mn , Mn match to second order on the diagonal and to fourth order off the diagonal; informally, this would mean that EF (Mn ) depends only on the first four moments of the entries (and the first two moments of the diagonal entries). In the case of statistics arising from eigenvalues or eigenvectors, this is indeed the case, and the precise statement is known as the Four Moment Theorem. We first state the Four Moment Theorem for eigenvalues. Definition 5.1. Let k ≥ 1. Two complex random variables ξ, ξ are said to match to order k if one has ERe(ξ)a Im(ξ)b = ERe(ξ )a Im(ξ )b whenever a, b ≥ 0 are integers such that a + b ≤ k. In the model case when the real and imaginary parts of ξ or of ξ are independent, the matching moment condition simplifies to the assertion that ERe(ξ)a = ERe(ξ )a and EIm(ξ)b = EIm(ξ )b for all 0 ≤ a, b ≤ k. Theorem 5.2 (Four Moment Theorem for eigenvalues). Let c0 > 0 be a suf

ficiently small constant. Let Mn = (ξij )1≤i,j≤n and Mn = (ξij )1≤i,j≤n be two Wigner matrices obeying Condition C1 for some sufficiently large absolute con

match to stant C0 . Assume furthermore that for any 1 ≤ i < j ≤ n, ξij and ξij √

order 4 and√for any 1 ≤ i ≤ n, ξii and ξii match to order 2. Set An := nMn and A n := nMn , let 1 ≤ k ≤ nc0 be an integer, and let G : Rk → R be a smooth function obeying the derivative bounds (5.3)

|∇j G(x)| ≤ nc0

for all 0 ≤ j ≤ 5 and x ∈ Rk . Then for any 1 ≤ i1 < i2 · · · < ik ≤ n, and for n sufficiently large we have (5.4)

|E(G(λi1 (An ), . . . , λik (An ))) − E(G(λi1 (A n ), . . . , λik (A n )))| ≤ n−c0 .

We remark that in the papers [108], [103], [109], a variant of the above result, which we called the three moment theorem, was asserted, in which the hypothesis of four matching moments off the diagonal was relaxed to three matching moments 17 Technically, the matrices M ˜ n formed by such a swapping procedure are not Wigner matrices as defined in Definition 1.1, because the diagonal or upper-triangular entries are no longer identically distributed. However, all of the relevant estimates for Wigner matrices can be extended to the non-identically-distributed case at the cost of making the notation slightly more complicated. As this is a relatively minor issue, we will not discuss it further here.

140

TERENCE TAO AND VAN VU

(and no moment matching was required on the diagonal), but for which the bound (5.3) was improved to |∇j G(x)| ≤ n−Cjc0 for some sufficiently large absolute constant C > 0. Unfortunately, the proof given of the three moment theorem in these papers was not correct as stated, although the claim can still be proven in most cases by other means; see [114]. A preliminary version of Theorem 5.2 was first established by the authors in [108], in the case18 of bulk eigenvalues (thus δn ≤ i1 , . . . , ik ≤ (1 − δ)n for some absolute constant δ > 0) and assuming Condition C0 instead of Condition C1. In [103], the restriction to the bulk was removed; and in [109], Condition C0 was relaxed to Condition C1 for a sufficiently large value of C0 . We will discuss the proof of this theorem in Section 6. The following technical generalization of the Four Moment Theorem, in which the entries of Mn , Mn only match approximately rather than exactly, is useful for some applications. Proposition 5.3 (Four Moment Theorem, approximate moment matching case). The conclusions of Theorem 5.2 continue to hold if the requirement that

ξij and ξij match to order 4 is relaxed to the conditions

a

b |ERe(ξij )a Im(ξij )b − ERe(ξij ) Im(ξij ) | ≤ εa+b

whenever a, b ≥ 0 and a + b ≤ 4, where ε0 = ε1 = ε2 := 0;

ε3 := n−1/2−Cc0 ;

ε4 := n−Cc0

for some absolute constant C > 0. This proposition follows from an inspection of the proof of Theorem 5.2: see Section 6. A key technical result used in the proof of the Four Moment Theorem, which is also of independent interest, is the gap theorem: Theorem 5.4. Let Mn be a Wigner matrix obeying Condition C1 for a sufficiently large absolute constant C0 . Then for every c0 > 0 there exists a c1 > 0 (depending only on c0 ) such that P(|λi+1 (An ) − λi (An )| ≤ n−c0 )  n−c1 for all 1 ≤ i < n. We discuss this theorem in Section 6. Among other things, the gap theorem tells us that eigenvalues of a Wigner matrix are usually simple. Closely related level repulsion estimates were established (under an additional smoothness hypothesis on the atom distributions) in [39]. Another variant of the Four Moment Theorem was subsequently introduced in [43], in which the eigenvalues λij (An ) appearing in Theorem 5.2 were replaced by expressions such as (3.4) that are derived from the resolvent (or Green’s function) (Wn − z)−1 , but with slightly different technical hypotheses on the matrices Mn , Mn ; see [43] for full details. As the resolvent-based quantities (3.4) are averaged statistics that sum over many eigenvalues, they are far less sensitive to the eigenvalue repulsion phenomenon than the individual eigenvalues, and as such the version of the Four Moment Theorem for Green’s function has a somewhat simpler 18 In the paper, k was held fixed, but an inspection of the argument reveals that it extends without difficulty to the case when k is as large as nc0 , for c0 small enough.

UNIVERSALITY FOR WIGNER ENSEMBLES

141

proof (based on resolvent expansions rather than the Hadamard variation formulae and Taylor expansion). Conversely, though, to use the Four Moment Theorem for Green’s function to control individual eigenvalues, while possible, requires a significant amount of additional argument; see [71]. We now discuss the extension of the Four Moment Theorem to eigenvectors rather than eigenvalues. Recall that we are using u1 (Mn ), . . . , un (Mn ) to denote the unit eigenvectors of a Hermitian matrix Mn associated to the eigenvalues λ1 (Mn ), . . . , λn (Mn ), thus u1 (Mn ), . . . , un (Mn ) lie in the unit sphere S 2n−1 := {z ∈ Cn : |z| = 1} is the unit sphere of Cn . We write ui,p (Mn ) = ui (Mn )∗ ep for the p-th coefficient of ui (Mn ) for each 1 ≤ i, p ≤ n. If Mn is not Hermitian, but is in fact real symmetric, then we can require the ui (Mn ) to have real coefficients, thus taking values in the unit sphere S n−1 := {x ∈ Rn : |x| = 1} of Rn . Unfortunately, the eigenvectors ui (Mn ) are not unique in either the Hermitian or real symmetric cases; even if one assumes that the spectrum of Mn is simple, in the sense that λ1 (Mn ) < . . . < λn (Mn ), √

one has the freedom to rotate each ui (Mn ) by a unit phase e −1θ ∈ U (1). In the real symmetric case, in which we force the eigenvectors to have real coefficients, one only has the freedom to multiply each ui (Mn ) by a sign ± ∈ O(1). There are a variety of ways to eliminate this ambiguity. For sake of concreteness we will remove the ambiguity by working with the orthogonal projections Pi (Mn ) to the eigenspace at eigenvalue λi (Mn ); if this eigenvalue is simple, we simply have Pi (Mn ) := ui (Mn )ui (Mn )∗ . Theorem 5.5 (Four Moment Theorem for eigenvectors). Let c0 , Mn , Mn , C0 , An , A n , k be as in Theorem 5.2. Let G : Rk × Ck → R be a smooth function obeying the derivative bounds (5.5)

|∇j G(x)| ≤ nc0

for all 0 ≤ j ≤ 5 and x ∈ Rk × Ck . Then for any 1 ≤ i1 , i2 , . . . , ik ≤ n and 1 ≤ p1 , . . . , pk , q1 , . . . , qk ≤ n, and for n sufficiently large depending on ε, c0 , C0 we have (5.6)

|EG(Φ(An )) − EG(Φ(A n ))| ≤ n−c0

where for any matrix M of size n, Φ(M ) ∈ Rk × Ck is the tuple Φ(M ) := ((λia (M ))1≤a≤k , (nPia ,pa ,qa (M ))1≤a≤k ) , and Pi,p,q (M ) is the pq coefficient of the projection Pi (M ). The bounds are uniform in the choice of i1 , . . . , ik , p1 , . . . , pk , q1 , . . . , qk . Theorem 5.5 extends (the first part of) Theorem 5.5, which deals with the case where the function G only depends on the Rk component of Rk × Ck . This theorem

142

TERENCE TAO AND VAN VU

as stated appears in [107]; a slight variant19 of the theorem (proven via the Four Moment Theorem for Green’s function) was simultaneously20 established in [71]. We also remark that the Four Moment Theorem for eigenvectors (in conjunction with the eigenvalue rigidity bound (2.13)) can be used to establish a variant of the Four Moment Theorem for Green’s function, which has the advantage of being applicable all the way up to the real axis (assuming a level repulsion hypothesis); see [107]. 5.1. The necessity of Four Moments. It is a natural question to ask whether the requirement of four matching moments (or four approximately matching moments, as in Proposition 5.3) is genuinely necessary. As far as the distribution of individual eigenvalues λi (An ) are concerned, the answer is essentially “yes”, even in the identically distributed case, as the following result from [106] shows. Theorem 5.6 (Necessity of fourth moment hypothesis). Let Mn , Mn be real symmetric Wigner matrices whose atom variables ξ, ξ have vanishing third moment Eξ 3 = E(ξ )3 = 0 but with distinct fourth moments Eξ 4 = E(ξ )4 . Then for all sufficiently large n, one has n 1 |Eλi (An ) − Eλi (A n )| ≥ κ n i=1 for some κ depending only on the atom distributions. is established by combining a computation of the fourth moment n This result 4 i=1 λi (An ) with eigenvalue rigidity estimates such as (2.13). Informally, it asserts that on average, the mean value of λi (An ) is sensitive to the fourth moment of the atom distributions at the scale of the mean eigenvalue spacing (which is comparable to 1 in the bulk at least). In contrast, the Four Moment Theorem morally21 asserts that when the atom variables of Mn and Mn match to fourth order, then the median of λi (An ) and of λi (A n ) only differ by O(n−c0 ). Thus, Theorem 5.6 and Theorem 5.2 are not directly comparable to each other. Nevertheless it is expected that the mean and median of λi (An ) should be asymptotically equal at the level of the mean eigenvalue spacing, although this is just beyond the known results (such as (2.13)) on the distribution of these eigenvalues. As such, Theorem 5.6 provides substantial evidence that the Four Moment Theorem breaks down if one does not have any sort of matching at the fourth moment. 19 Besides the differences in the methods of proof, the hypotheses of the result in [71] differ in some technical aspects from those in Theorem 5.5. For instance, Condition C1is replaced with Condition C0, and k is restricted to be bounded, rather than being allowed to be as large as nc0 . On the other hand, the result is sharper at the edge of the spectrum (one only requires matching up to two moments, rather than up to four), and the result can be extended to “generalized Wigner matrices” for which the variances of the entries are allowed to differ, provided that a suitable analogue of Theorem 5.4 holds. 20 More precisely, the results in [107] were announced at the AIM workshop “Random matrices” in December 2010 and appeared on the arXiv in March 2011. A preliminary version of the results in [71] appeared in February 2011, with a final version appearing in March 2011. 21 This is not quite true as stated, because of the various error terms in the Four Moment Theorem, and the requirement that the function G in that theorem is smooth. A more accurate statement (cf. the proof of Theorem 7.1 below) is that if the median of λi (An ) is M (thus P(λi (An ) ≤ M ) = 1/2, in the continuous case at least), then one has

P(λi (An ) ≤ M + n−c0 ), P(λi (An ) ≥ M − n−c0 ) ≥ 1/2 − n−c0 , which almost places the median of λi (An ) within O(n−c0 ) of M .

UNIVERSALITY FOR WIGNER ENSEMBLES

143

By computing higher moments of λi (An ), it was conjectured in [106] that one has an asymptotic of the form 1 cl 3 cl 4 −c Eλi (An ) = nλcl i (Wn ) + Ci,n + (λi (Wn ) − 2λi (Wn ))Eξ + O(n ) 4 for all i in the bulk region δn ≤ i ≤ (1 − δ)n, where Ci,n is a quantity independent of the atom distribution ξ. (At the edge, the dependence on the fourth moment is weaker, at least when compared against the (now much wider) mean eigenvalue spacing; see Section 7.1.) We remark that while the statistics of individual eigenvalues are sensitive to the fourth moment, averaged statistics such as the k-point correlation functions (k) ρn,u are much less sensitive to this moment (or the third moment). Indeed, this is already visible from the results in Section 4, as gauss divisible matrices can have a variety of possible third or fourth moments for their atom distributions (see Lemma 7.6). (5.7)

6. Sketch of proof of four moment theorem In this section we discuss the proof of Theorem 5.2 and Theorem 5.4, following the arguments that originated in [108] and refined in [109]. To simplify the exposition, we will just discuss the four moment theorem; the proof of the approximate four moment theorem in Proposition 5.3 is established by a routine modification of the argument. For technical reasons, the two theorems need to be proven together. Let us say that a Wigner matrix Mn has the gap property if it obeys the conclusion of Theorem 5.4; thus Theorem 5.4 asserts that all Wigner matrices obeying Condition C1 for sufficiently large C0 have the gap property. We do not know of a direct proof of this result that does not go through the Four Moment Theorem; however, it is possible to establish an independent proof of a more restrictive result: Theorem 6.1 (Gap theorem, special case). Any Wigner matrix obeying Condition C0 has the gap property. We discuss this theorem (which is [108, Theorem 19]) later in this section. Another key ingredient is the following truncated version of the Four Moment Theorem, in which one removes the event that two consecutive eigenvalues are too close to each other. For technical reasons, we need to introduce quantities  1 Qi (An ) := |λj (An ) − λi (An )|2 j=i

for i = 1, . . . , n, which is a regularised measure of extent to which λi (An ) is close to any other eigenvalue of An . Theorem 6.2 (Truncated Four Moment Theorem). Let c0 > 0 be a sufficiently

)1≤i,j≤n be two Wigner matrismall constant. Let Mn = (ξij )1≤i,j≤n and Mn = (ξij ces obeying Condition C1for some sufficiently large absolute constant C0 . Assume

match to order 4 and for any furthermore that for any 1 ≤ i < j ≤ n, ξij and ξij √ √

1 ≤ i ≤ n, ξii and ξii match to order 2. Set An := nMn and A n := nMn , let 1 ≤ k ≤ nc0 be an integer, and let G = G(λi1 , . . . , λik , Qi1 , . . . , Qik )

144

TERENCE TAO AND VAN VU

be a smooth function from Rk × Rk+ to R that is supported in the region Qi1 , . . . , Qik ≤ nc0

(6.1) and obeys the derivative bounds (6.2)

|∇j G(λi1 , . . . , λik , Qi1 , . . . , Qik )| ≤ nc0

for all 0 ≤ j ≤ 5. Then EG(λi1 (An ), . . . , λik (An ), Qi1 (An ), . . . , Qik (An )) = (6.3) EG(λi1 (A n ), . . . , λik (A n ), Qi1 (A n ), . . . , Qik (A n )) + O(n−1/2+O(c0 ) . We will discuss the proof of this theorem shortly. Applying Theorem 6.2 with k = 1 and a function G that depends only a single variable Qi , and using the gap property to bound Qi (cf. [108, Lemma 49]), one can show a four moment property for the gap theorem: if Mn , Mn are Wigner matrices obeying Condition C1 for a sufficiently large C0 which match to fourth order, and Mn obeys the gap property, then Mn also obeys the gap property. Using this and Theorem 6.1, one can then obtain Theorem 5.4 in full generality. Using Theorem 5.4, one can then deduce Theorem 5.2 from Theorem 6.2 by smoothly truncating in the Q variables: see [108, §3.3]. It remains to establish Theorem 6.1 and Theorem 6.2. We begin with Theorem 6.2. To simplify the exposition slightly, let us assume that the matrices Mn , Mn

are real symmetric rather than Hermitian. To reduce the number of parameters, we will also set C0 := 1/c0 . As indicated in Section 5, the basic idea is to use the Lindeberg exchange ˜ n be the matrix formed from Mn by replacing strategy. To illustrate the idea, let M

a single entry ξpq of Mn with the corresponding entry ξpq of Mn for some p < q, ˜ n Hermitian. with a similar swap also being performed at the ξqp entry to keep M ˜ Strictly speaking, Mn is not a Wigner matrix as defined in Definition 1.1, as the entries are no longer identically distributed, but this will not significantly affect the arguments. (One also needs to perform swaps on the diagonal, but this can be handled in essentially the same manner.) √ ˜ Set A˜n := nM n as usual. We will sketch the proof of the claim that EG(λi1 (An ), . . . , λik (An ), Qi1 (An ), . . . , Qik (An )) = EG(λi1 (A˜n ), . . . , λik (A˜n ), Qi1 (A˜n ), . . . , Qik (A˜n )) + O(n−5/2+O(c0 ) ; by telescoping together O(n2 ) estimates of this sort one can establish (6.3). (For swaps on the diagonal, one only needs an error term of O(n−3/2+O(c0 ) ), since there are only O(n) swaps to be made here rather than O(n2 ). This is ultimately why there are two fewer moment conditions on the diagonal than off it.)

We can write An = A(ξpq ), A˜n = A(ξpq ), where A(t) = A(0) + tA (0) is a (random) Hermitian matrix depending linearly22 on a real parameter t, with A(0) being a Wigner matrix with one entry (and its adjoint) zeroed out, and A (0) 22 If we were working with Hermitian matrices rather than real symmetric matrices, then one could either swap the real and imaginary parts of the ξij separately (exploiting the hypotheses that these parts were independent), or else repeat the above analysis with t now being a complex parameter (or equivalently, two real parameters) rather than a real one. In the latter case, one needs to replace all instances of single variable calculus below (such as Taylor expansion) with

UNIVERSALITY FOR WIGNER ENSEMBLES

145

is the explicit elementary Hermitian matrix A (0) = ep e∗q + e∗p eq .

(6.4)

We note the crucial fact that the random matrix A(0) is independent of both ξpq



and ξpq . Note from Condition C1 that we expect ξpq , ξpq to have size O(nO(c0 ) ) most of the time, so we should (heuristically at least) be able to restrict attention to the regime t = O(nO(c0 ) ). If we then set (6.5)

F (t) := EG(λi1 (A(t)), . . . , λik (A(t)), Qi1 (A(t)), . . . , Qik (A(t)))

then our task is to show that

) + O(n−5/2+O(c0 ) ). EF (ξpq ) = EF (ξpq

(6.6)

Suppose that we have Taylor expansions of the form (6.7)

λil (A(t)) = λil (A(0)) +

4 

cl,j tj + O(n−5/2+O(c0 ) )

j=1 O(c0 )

) and l = 1, . . . , k, where the Taylor coefficients cl,j have size for all t = O(n −j/2+O(c0 ) cl,j = O(n , and similarly for the quantities Qil (A(t)). Then by using the hypothesis (6.2) and further Taylor expansion, we can obtain a Taylor expansion F (t) = F (0) +

4 

fj tj + O(n−5/2+O(c0 ) )

j=1

for the function F (t) defined in (6.5), where the Taylor coefficients fj have size fj = O(n−j/2+O(c0 ) ). Setting t equal to ξpq and taking expectations, and noting that the Taylor coefficients fj depend only on F and A(0) and is thus independent of ξij , we conclude that EF (ξpq ) = EF (0) +

4 

j (Efj )(Eξpq ) + O(n−5/2+O(c0 ) )

j=1

). EF (ξpq

If ξpq and ξpq have matching moments to fourth order, and similarly for this gives (6.6). (Note that a similar argument also would give the Three Moment Theorem, as well as Proposition 5.3.) It remains to establish (6.7) (as well as the analogue for Qil (A(t)), which turns out to be analogous). We abbreviate il simply as i. By Taylor’s theorem with remainder, it would suffice to show that

dj λi (A(t)) = O(n−j/2+O(c0 ) ) dtj for j = 1, . . . , 5. As it turns out, this is not quite true as stated, but it becomes true (with overwhelming probability23 ) if one can assume that Qi (A(t)) is bounded by nO(c0 ) . In principle, one can reduce to this case due to the restriction (6.1) on the support of G, although there is a technical issue because one will need to (6.8)

double variable calculus, but aside from notational difficulties, it is a routine matter to perform this modification. 23 Technically, each value of t has a different exceptional event of very small probability for which the estimates fail. Since there are uncountably many values of t, this could potentially cause a problem when applying the union bound. In practice, though, it turns out that one can restrict t to a discrete set, such as the multiples of n−100 , in which case the union bound can be applied without difficulty. See [108] for details.

146

TERENCE TAO AND VAN VU

establish the bounds (6.8) for values of t other than ξpq or ξ˜pq . This difficulty can be overcome by a continuity argument; see [108]. For the purposes of this informal discussion, we shall ignore this issue and simply assume that we may restrict to the case where (6.9)

Qi (A(t))  nO(c0 ) .

In particular, the eigenvalue λi (A(t)) is simple, which ensures that all quantities depend smoothly on t (locally, at least). To prove (6.8), one can use the classical Hadamard variation formulae for the derivatives of λi (A(t)), which can be derived for instance by repeatedly differentiating the eigenvector equation A(t)ui (A(t)) = λi (A(t))ui (A(t)). The formula for the first derivative is d λi (A(t)) = ui (A(t))∗ A (0)ui (A(t)). dt But recall from eigenvalue delocalisation (Corollary 2.4) that with overwhelming probability, all coefficients of ui (A(t)) have size O(n−1/2+o(1) ); given the nature of the matrix (6.4), we can then obtain (6.8) in the j = 1 case. Now consider the j = 2 case. The second derivative formula reads  |ui (A(t))∗ A (0)uj (A(t))|2 d2 λ (A(t)) = −2 i dt2 λj (A(t)) − λi (A(t)) j=i

(compare with the formula (4.1) for Dyson Brownian motion). Using eigenvalue delocalisation as before, we see with overwhelming probability that the numerator is O(n−1+o(1) ). To deal with the denominator, one has to exploit the hypothesis (6.9) and the local semicircle law (Theorem 2.3). Using these tools, one can conclude (6.8) in the j = 2 case with overwhelming probability. It turns out that one can continue this process for higher values of j, although the formulae for the derivatives for λi (A(t)) (and related quantities, such as Pi (A(t)) and Qi (A(t))) become increasingly complicated, being given by a certain recursive formula in j. See [108] for details. Now we briefly discuss the proof24 of Theorem 6.1. For sake of discussion we restrict attention to the bulk case εn ≤ i ≤ (1 − ε)n; the changes needed to deal with the edge case are relatively minor and are discussed in [103]. The objective here is to limit the probability of the event that the quantity λi+1 (An ) − λi (An ) is unexpectedly small. The main difficulty here is the fact that one is comparing two adjacent eigenvalues. If instead one was bounding λi+k (An ) − λi (An ) for a larger value of k, say k ≥ logC n for a large value of C, then one could obtain such a bound from the local semicircle law (Theorem 2.3) without much difficulty. To reduce k all the way down to 1, the idea is to exploit the following phenomenon: If λi+1 (An ) − λi (An ) is small, then λi+1 (An−1 ) − λi−1 (An−1 ) is also likely to be small. 24 The argument here is taken from [108]. In the case when the atom distributions are sufficiently smooth, one can also deduce this result from the level repulsion argument in [39, Theorem 3.5] and the eigenvalue rigidity estimate (2.13), and by using Theorem 6.2 one can extend the gap property to several other Wigner ensembles. However, this argument does not cover the case of Bernoulli ensembles, which is perhaps the most difficult case of Theorem 5.4 or Theorem 6.1.

UNIVERSALITY FOR WIGNER ENSEMBLES

147

Here An−1 denotes25 the top left n − 1 × n − 1 minor of An . This phenomenon can be viewed as a sort of converse to the classical Cauchy interlacing law (6.10)

λi−1 (An−1 ) ≤ λi (An ) ≤ λi (An−1 ) ≤ λi+1 (An ) ≤ λi+1 (An−1 )

(cf. (2.10)), since this law clearly shows that λi+1 (An ) − λi (An ) will be small whenever λi+1 (An−1 ) − λi−1 (An−1 ) is. In principle, if one iterates (generalisations of) the above principle k = logC n times, one eventually reaches an event that can be shown to be highly unlikely by the local semicircle law. To explain why we expect such a phenomenon to be true, let us expand An as   An−1 √ X An = nξnn X∗ √ n−1 where X ∈ C is the random vector with entries nξnj for j = 1, . . . , n − 1. By expanding out the eigenvalue equation An ui (An ) = λi (An )ui (An ), one eventually obtains the formula n−1  |uj (An−1 )∗ X|2 √ = nξnn − λi (An ) (6.11) λ (An−1 ) − λi (An ) j=1 j that relates λi (An ) to the various eigenvalues λj (An−1 ) of An−1 (ignoring for sake of discussion the non-generic case when one or more of the denominators in (6.11) vanish); compare with (2.14). Using concentration of measure tools (such as Talagrand’s inequality, see e.g. [76]), one expects |uj (An−1 )∗ X|2 to concentrate around its mean, which can be computed to be n(n − 1). In view of this and, one expects the largest (and thus, presumably, the most dominant) terms in (6.11) to be the summands on the left-hand side when j is equal to either i − 1 or i. In particular, if λi+1 (An ) − λi (An ) is unexpectedly small (e.g. smaller than n−c0 ), then by (6.10) λi (An−1 ) − λi (An ) is also small. This causes the j = i summand in (6.11) to (usually) be large and positive; to counterbalance this, one then typically expects the j = i−1 summand to be large and negative, so that λi (An )−λi−1 (An−1 ) is small; in particular, λi (An−1 ) − λi−1 (An−1 ) is small. A similar heuristic argument (based on (6.11) but with λi (An ) replaced by λi+1 (An ) predicts that λi+1 (An−1 ) − λi (An−1 ) is also small; summing, we conclude that λi+1 (An−1 ) − λi−1 (An−1 ) is also small, thus giving heuristic support to the above phenomenon. One can make the above arguments more rigorous, but the details are rather complicated. One of the complications arises from the slow decay of the term 1 λj (An−1 )−λi (An ) as i moves away from j. Because of this, a large positive term (such as the j = i summand) in (6.11) need not be balanced primarily by the negative j = i − 1 summand, but instead by a dyadic block i − 2k ≤ j < i − 2k−1 of such summands; but this can be addressed by replacing the gap λi+1 (An ) − λi (An ) by a more complicated quantity (called the regularized gap in [108]) that is an infimum of a moderately large number of (normalised) gaps λi+ (An ) − λi− (An ). A more serious issue is that the numerators |uj (An−1 )∗ X| can sometimes be much smaller than their expected value of ∼ n, which can cause the gap at An−1 to be significantly larger than that at An . By carefully counting all the possible cases and estimating all the error probabilities, one can still keep the net error of this √

n−1 speaking, one has to multiply An−1 also by √ to be consistent with our n conventions for Mn and Mn−1 , although this factor turns out to make very little difference to the analysis. 25 Strictly

148

TERENCE TAO AND VAN VU

situation to be of the form O(n−c ) for some c > 0. It is in this delicate analysis that one must rely rather heavily on the exponential decay hypothesis in Condition C0, as opposed to the polynomial decay hypothesis in Condition C1. This concludes the sketch of Theorem 6.1. We remarked earlier that the extension to the edge case is fairly routine. In part, this is because the expected eigenvalue gap λi+1 (An ) − λi (An ) becomes much wider at the edge (as large as n1/3 , for instance, when i = 1 or i = n − 1), and so Theorem 6.1 and Theorem 5.4 becomes a weaker statement. There is however an interesting “bias” phenomenon that is worth pointing out at the edge, for instance with regard with the interlacing (6.12)

λn−1 (An ) ≤ λn−1 (An−1 ) ≤ λn (An )

of the very largest eigenvalues. On the one hand, the gap λn (An ) − λn−1 (An ) between the top two eigenvalues of An is expected (and known, in many cases) to be comparable to n1/3 on the average; see (7.2) below. On the other hand, from the semi-circular law one expects λn (An ) to grow like 2n, which suggests that λn (An ) − λn−1 (An−1 ) should be comparable to 1, rather than to n1/3 . In other words, the interlacing (6.12) is biased ; the intermediate quantity λn−1 (An−1 ) should be far closer to the right-most quantity λn (An ) than the left-most quantity λn−1 (An ). This bias can in fact be demonstrated by using the fundamental equation (6.11); the point is that in the edge case (when i is close to n) the term −λi (An ) on the right-hand side plays a major role, and has to be balanced by λi (An )−λi (An−1 ) being as small as O(1). This bias phenomenon is not purely of academic interest; it turns out to be an essential ingredient in the proof of eigenvalue delocalisation (Corollary 2.4) in the edge case, as discussed in Section 2. See [103] for more discussion. It would be of interest to understand the precise relationship between the various eigenvalues in (6.10) or (6.12); the asymptotic joint distribution for, say, λi (An ) and λi (An−1 ) is currently not known, even in the GUE case. 7. Applications By combining the heat flow methods with swapping tools such as the Four Moment Theorem, one can extend a variety of results from the GUE (or gauss divisible) regime to wider classes of Wigner ensembles. We now give some examples of such extensions. 7.1. Distribution of individual eigenvalues. One of the simplest instances of the method arises when extending the central limit theorem (3.10) of Gustavsson [57] for eigenvalues λi (An ) in the bulk from GUE to more general ensembles: Theorem 7.1. The gaussian fluctuation law (3.10) continues to hold for Wigner matrices obeying Condition C1 for a sufficiently large C0 , and whose atom distributions match that of GUE to second order on the diagonal and fourth order off the diagonal; thus, one has λ (A ) − λcl i (An ) i n → N (0, 1)R 2 log n/2π /ρsc (u) whenever λcl i (An ) = n(u + o(1)) for some fixed −2 < u < 2.

UNIVERSALITY FOR WIGNER ENSEMBLES

149

Proof. Let Mn be drawn from GUE, thus by (3.10) one already has λ (A ) − λcl i (An ) i n → N (0, 1)R 2 log n/2π /ρsc (u) cl

(note that λcl i (An ) = λi (An ). To conclude the analogous claim for An , it suffices to show that

(7.1)

P(λi (A n ) ∈ I− ) − n−c0 ≤ P(λi (An ) ∈ I) ≤ P(λi (A n ) ∈ I+ ) + n−c0

for all intervals I = [a, b], and n sufficiently large, where I+ := [a − n−c0 /10 , b + n−c0 /10 ] and I− := [a + n−c0 /10 , b − n−c0 /10 ]. We will just prove the second inequality in (7.1), as the first is very similar. We define a smooth bump function G : R → R+ equal to one on I− and vanishing outside of I+ . Then we have P(λi (An ) ∈ I) ≤ EG(λi (An )) and

EG(λi (A n )) ≤ P(λi (A n ) ∈ I) On the other hand, one can choose G to obey (5.3). Thus by Theorem 5.2 we have |EG(λi (An )) − EG(λi (A n ))| ≤ n−c0 and the second inequality in (7.1) follows from the triangle inequality. The first inequality is similarly proven using a smooth function that equals 1 on I− and vanishes outside of I.  Remark 7.2. In [57] the asymptotic joint distribution of k distinct eigenvalues λi1 (Mn ), . . . , λik (Mn ) in the bulk of a GUE matrix Mn was computed (it is a gaussian k-tuple with an explicit covariance matrix). By using the above argument, one can extend that asymptotic for any fixed k to other Wigner matrices, so long as they match GUE to fourth order off the diagonal and to second order on the diagonal. If one could extend the results in [57] to broader ensembles of matrices, such as gauss divisible matrices, then the above argument would allow some of the moment matching hypotheses to be dropped, using tools such as Lemma 7.6. Remark 7.3. Recently in [25], a moderate deviations property of the distribution of the eigenvalues λi (An ) was established first for GUE, and then extended to the same class of matrices considered in Theorem 7.1 by using the Four Moment Theorem. An analogue of Theorem 7.1 for real symmetric matrices (using GOE instead of GUE) was established in [84]. A similar argument to the one given in Theorem 7.1 also applies at the edge of the spectrum. For sake of discussion we shall just discuss the distribution of the largest eigenvalue λn (An ). In the case of a GUE ensemble, this largest eigenvalue is famously governed by the Tracy-Widom law [115, 116], which asserts that λn (An ) − 2n ≤ t) → det(1 − T[t,+∞) ) n1/3 for any fixed t ∈ R, where T[t,+∞) : L2 ([t, +∞)) → L2 ([t, +∞)) is the integral operator +∞ Ai(x)Ai (y) − Ai (x)Ai(y) f (y) dy T[t,+∞) f (x) := x−y t

(7.2)

P(

150

TERENCE TAO AND VAN VU

and Ai : R → R is the Airy function 1 Ai(x) := π





cos( 0

t3 + xt) dt. 3

Interestingly, the limiting distribution in (7.2) also occurs in many other seemingly unrelated contexts, such as the longest increasing subsequence in a random permutation [4, 116]. It is conjectured that the in fact holds for all Wigner matrices obeying Condition C1 with C0√= 4; this value of C0 is optimal, as one does not expect λ1 (An ) to stay near 2 n without this hypothesis (see [7]). While this conjecture is not yet fully resolved, there has now been a substantial amount of partial progress on the problem [44, 65, 69, 95, 99, 103]. Soshnikov [99] was the first to obtain the TracyWidom law for a large class of Wigner matrices; thanks to subsequent refinements in [69, 95], we know that (7.2) holds for all Wigner matrices whose entries are iid with symmetric distribution and obeying Condition C1 with C0 = 12. On the other hand, by using the argument used to prove Theorem 7.1, one can also obtain the asymptotic (7.2) for Wigner matrices obeying Condition C1 for a sufficiently large C0 , provided that the entries match that of GUE to fourth order. Actually, since the asymptotic (7.2) applies at scale n1/3 rather than at scale 1, it is possible to modify the arguments to reduce the amount of moment matching required, that one only needs the entries to match GUE to third order (which in particular subsumes the case when the distribution is symmetric); see [114]. More recently, Johansson [65] established (7.2) for gauss divisible Wigner ensembles obeying Condition C1with the optimal decay condition C0 = 4. Combining this result with the three moment theorem (and noting that any Wigner matrix can be matched up to order three with a gauss divisible matrix, see Lemma 7.6 below), one can then obtain (7.2) for any Wigner matrix obeying Condition C1for sufficiently large C0 . An independent proof of this claim (which also applied to generalized Wigner matrix models in which the variance of the entries was nonconstant) was also established in [44]. Finally, it was shown very recently in [71] that there is a version of the Four Moment Theorem for the edge that only requires two matching moments, which allows one to establish the Tracy-Widom law for all Wigner matrices obeying Condition C0. 7.2. Universality of the sine kernel for Hermitian Wigner matrices. We now turn to the question of the extent to which the asymptotic (3.7), which as(k) serts that the normalised k-point correlation functions ρn,u converge to the universal (k) limit ρSine , can be extended to more general Wigner ensembles. A long-standing conjecture of Wigner, Dyson, and Mehta (see e.g. [80]) asserts (informally speaking) that (3.7) is valid for all fixed k, all Wigner matrices and all fixed energy levels −2 < u < 2 in the bulk. (They also make the same conjecture for random symmetric and random symplectic matrices.) However, to make this conjecture precise one has to specify the nature of convergence in (3.7). For GUE, the convergence is quite strong (in the local uniform sense), but one cannot expect such strong convergence (k) in general, particularly in the case of discrete ensembles in which ρn,u is a discrete probability distribution (i.e. a linear combination of Dirac masses) and thus is unable to converge uniformly or pointwise to the continuous limiting distribution (k) ρSine . We will thus instead settle for the weaker notion of vague convergence. More

UNIVERSALITY FOR WIGNER ENSEMBLES

151

precisely, we say that (3.7) holds in the vague sense if one has lim

(7.3)

n→∞



Rk

F (x1 , . . . , xk )ρ(k) n,u (x1 , . . . , xk ) dx1 . . . dxk (k)

= Rk

F (x1 , . . . , xk )ρSine (x1 , . . . , xk ) dx1 . . . dxk

P for all continuous, compactly supported functions F : Rk → R. By the StoneWeierstrass theorem we may take F to be a test function (i.e. smooth and compactly supported) without loss of generality. Remark 7.4. Vague convergence is not the only notion of convergence studied in the literature. Another commonly studied notion of convergence is averaged vague convergence, in which one averages over the energy parameter u as well, thus replacing (7.3) with the weaker claim that 1 b→0 n→∞ 2b



E+b



lim lim

(7.4)

E−b



Rk

F (x1 , . . . , xk )ρ(k) n,u (x1 , . . . , xk ) dx1 . . . dxk du (k)

= Rk

F (x1 , . . . , xk )ρSine (x1 , . . . , xk ) dx1 . . . dxk

for all −2 < E < 2. It can be argued (as was done in [42]) that as the original conjectures of Wigner, Dyson, and Mehta did not precisely specify the nature of convergence, that any one of these notions of convergence would be equally valid for the purposes of claiming a proof of the conjecture. However, the distinction between such notions is not purely a technical one, as certain applications of the Wigner-Dyson-Mehta conjecture are only available if the convergence notion is strong enough. Consider for instance the gap problem of determining the probability that there is no eigenvalue in the interval [−t/2n, t/2n] (in other words the distribution of the least singular value). This is an important problem which in the GUE case was studied in [63] and discussed in length in a number of important books in the field, including Mehta’s (see [80, Chapter 5]), Deift’s (see [20, Section 5.4]), Deift-Gioev’s (see [21, Section 4.2]) and Anderson-Guionnet-Zeitouni (see [1, Section 3.5]), Forrester’s (see [48, Section 9.6]) and the Oxford handbook of matrix theory edited by Akemann et. al. (see [58, Section 4.6] by Anderson). This distribution can (k) be determined from the correlation functions ρn,0 at the energy level u = 0 by a standard combinatorial argument. In particular, if one has the Wigner-DysonMehta conjecture in the sense of vague convergence, one can make the limiting law in [63] universal over the class of matrices for which that conjecture is verified; see Corollary 7.12 below. However, if one only knows the Wigner-Dyson-Mehta conjecture in the sense of averaged vague convergence (7.4) instead of vague convergence (7.3), one cannot make this conclusion, because the distributional information at u = 0 is lost in the averaging process. See also Theorem 7.11 for a further example of a statistic which can be controlled by the vague convergence version of the Wigner-Dyson-Mehta conjecture, but not the averaged vague convergence version.

152

TERENCE TAO AND VAN VU

For this reason, it is our opinion that a solution of the Wigner-Dyson-Mehta conjecture in the averaged vague convergence sense should be viewed as an important partial resolution to that conjecture, but one that falls short of a complete solution to that conjecture26 . As a consequence of the heat flow and swapping techniques, the Wigner-DysonMehta conjecture for (Hermitian) Wigner matrices is largely resolved in the vague convergence category: Theorem 7.5. Let Mn be a Wigner matrix obeying Condition C1 for a sufficiently large absolute constant C0 which matches moments with GUE to second order, and let −2 < u < 2 and k ≥ 1 be fixed. Then (3.7) holds in the vague sense. This theorem has been established as a result of a long sequence of partial results towards the Wigner-Dyson-Mehta conjecture [34–36, 43, 44, 64, 108, 110], which we will summarise (in a slightly non-chronological order) below; see Remark 7.7 for a more detailed discussion of the chronology. The precise result stated above was first proven explicitly in [110], but relies heavily on the previous works just cited. As recalled in Section 3, the asymptotic (3.7) for GUE (in the sense of locally uniform convergence, which is far stronger than vague convergence) follows as a consequence of the Gaudin-Mehta formula and the Plancherel-Rotach asymptotics for Hermite polynomials27 . The next major breakthrough was by Johansson [64], who, as discussed in Section 4, establshed (3.7) for gauss divisible ensembles at some fixed time parameter t > 0 independent of n, obtained (3.7) in the vague sense (in fact, the slightly stronger convergence of weak convergence was established in that paper, in which the function F in (7.3) was allowed to merely be L∞ and compactly supported, rather than continuous and compactly supported). The main tool used in [64] was an explicit determinantal formula for the correlation functions in the gauss divisible case, essentially due to Br´ezin and Hikami [13]. In Johansson’s result, the time parameter t > 0 had to be independent of n. It was realized by Erd˝ os, Ramirez, Schlein, and Yau that one could obtain many further cases of the Wigner-Dyson-Mehta conjecture if one could extend Johansson’s result to much shorter times t that decayed at a polynomial rate in n. This was first achieved (again in the context of weak convergence) for t > n−3/4+ε for an arbitrary fixed ε > 0 in [34], and then to the essentially optimal case t > n−1+ε (for weak convergence, and (implicitly) in the local L1 sense as well) in [35]. By combining this with the method of reverse heat flow discussed in Section 5, the asymptotic (3.7) (again in the sense of weak convergence) was established for all Wigner matrices whose distribution obeyed certain smoothness conditions (e.g. when k = 2 one needs a C 6 type condition), and also decayed exponentially. The methods used in [35] were an extension of those in [64], combined with an approximation argument (the “method of time reversal”) that approximated a continuous distribution by a 26 In particular, we view this conjecture as only partially resolved in the real symmetric and symplectic cases, as opposed to the Hermitian cases, because the available results in that setting are either only in the averaged vague convergence sense, or require some additional moment matching hypotheses on the coefficients. Completing the proof of the Wigner-Dyson-Mehta conjecture in these categories remains an interesting future direction of research. 27 Analogous results are known for much wider classes of invariant random matrix ensembles, see e.g. [19], [86], [11]. However, we will not discuss these results further here, as they do not directly impact on the case of Wigner ensembles.

UNIVERSALITY FOR WIGNER ENSEMBLES

153

gauss divisible one (with a small value of t); the arguments in [34] are based instead on an analysis of the Dyson Brownian motion. Note from the eigenvalue rigidity property (2.13) that only a small number of eigenvalues (at most no(1) or so) make28 a significant contribution to the normalised (k) correlation function ρn,u on any fixed compact set, and any fixed u. Because of this, the Four Moment Theorem (Theorem 5.2) can be used to show that29 if one Wigner matrix Mn obeyed the asymptotics (3.7) in the vague sense, then any other Wigner matrix Mn that matched Mn to fourth order would also obey (3.7) in the vague sense, assuming that Mn , Mn both obeyed Condition C1 for a sufficiently large C0 (so that the eigenvalue rigidity and four moment theorems are applicable). By combining the above observation with the moment matching lemma presented below, one immediately concludes Theorem 7.5 assuming that the offdiagonal atom distributions are supported on at least three points. Lemma 7.6 (Moment matching lemma). Let ξ be a real random variable with mean zero, variance one, finite fourth moment, and which is supported on at least three points. Then there exists a gauss divisible, exponentially decaying real random variable ξ that matches ξ to fourth order. For a proof of this elementary lemma, see [108, Lemma 28]. The requirement of support on at least three points is necessary; indeed, if ξ is supported in just two points a, b, then E(ξ − a)2 (ξ − b)2 = 0, and so any other distribution that matches ξ to fourth order must also be supported on a, b and thus cannot be gauss divisible. To remove the requirement that the atom distributions be supported on at least three points, one can use the observation from Proposition 5.3 that one only needs the moments of Mn and Mn to approximately match to fourth order in order to be able to transfer results on the distribution of spectra of Mn to that of Mn . In particular, if t = n−1+ε for some small ε > 0, then the Ornstein-Uhlenbeck flow Mnt of Mn by time t is already close enough to matching the first four moments of Mn to apply Proposition 5.3. The results of [35] give the asymptotic (3.7) for Mnt , and the eigenvalue rigidity property (2.13) then allows one to transfer this property to Mn , giving Theorem 7.5. Remark 7.7. The above presentation (drawn from the most recent paper [110]) is somewhat ahistorical, as the arguments used above emerged from a sequence of papers, which obtained partial results using the best technology available at the time. In the paper [108], where the first version of the Four Moment Theorem was introduced, the asymptotic (3.7) was established under the additional assumptions of Condition C0, and matching the GUE to fourth order30 ; the former hypothesis was due to the weaker form of the four moment theorem known at 28 Strictly speaking, the results in [44] only establish the eigenvalue rigidity property (2.13) assuming Condition C0. However, by using the Four Moment Theorem one can relax this to Condition C1 for a sufficiently large C0 , at the cost of making (2.13) hold only with high probability rather than overwhelming probability. 29 Very recently, it was observed in [42] that if one uses the Green’s function Four Moment Theorem from [43] in place of the earlier eigenvalue Four Moment Theorem from [108] at this juncture, then one can reach the same conclusion here without the need to invoke the eigenvalue rigidity theorem, thus providing a further simplification to the argument. 30 In [108] it was claimed that one only needed a matching condition of GUE to third order, but this was not rigorously proven in that paper due to an issue with the Three Moment Theorem that we discuss in [114].

154

TERENCE TAO AND VAN VU

the time, and the latter was due to the fact that the eigenvalue rigidity result (2.13) was not yet established (and was instead deduced from the results of Gustavsson [57] combined with the Four Moment Theorem, thus necessitating the matching moment hypothesis). For related reasons, the paper in [36] (which first introduced the use of Proposition 5.3) was only able to establish (3.7) after an additional averaging in the energy parameter u (and with Condition C0). The subsequent progress in [40] via heat flow methods gave an alternate approach to establishing (3.7), but also required an averaging in the energy and a hypothesis that the atom distributions be supported on at least three points, although the latter condition was then removed in [44]. In a very recent paper [33], the exponent C0 in Condition C1 has been relaxed to as low as 4 + ε for any fixed ε > 0, though still at the cost of averaging in the energy parameter. Some generalisations in other directions (e.g. to covariance matrices, or to generalised Wigner ensembles with non-constant variances) were also established in [8], [109], [41], [43], [44], [32], [33], [125], [88], [89], [42]. For instance, it was very recently observed in [42] that by using a variant of the argument from [110] (replacing the asymptotic four moment theorem for eigenvalues by an asymptotic four moment theorem for the Green function) together with a careful inspection of the arguments in [35] and invoking some results from [43], one can extend Theorem 7.5 to generalised Wigner ensembles in which the entries are allowed to have variable variance (subject to some additional hypotheses); see [42] for details. To close this section, we remark that while Theorem 7.5 is the “right” result for discrete Wigner ensembles (except for the large value of C0 in Condition C1, which in view of the results in [33] should be reducible to 4 + ε), one expects stronger notions of convergence when one has more smoothness hypotheses on the atom distribution; in particular, one should have local uniform convergence of the correlation functions when the distribution is smooth enough. Some very recent progress in this direction in the k = 1 case was obtained by Maltsev and Schlein [78], [79]. For results concerning symmetric31 and symplectic random matrices, we refer to [31] and the references therein. In these cases, the Wigner-Dyson-Mehta conjecture is established in the context of averaged vague convergence; the case of vague convergence remains open even for gauss divisible distributions (although in the case when four moments agree with GOE, one can recover the vague convergence version of the conjecture thanks to the Four Moment Theorem). This appears to be an inherent limitation to the relaxation flow method, at least with the current state of technology. In the Hermitian case, this limitation is overcome by using the explicit formulae of Brezin-Hikami and Johansson (see [64]), but no such tool appears to be available at present in the symmetric and symplectic cases. It remains an interesting future direction of research to find a way to overcome this limitation and obtain a complete solution to the Wigner-Dyson-Mehta conjecture in these cases, as this could lead to a number of interesting applications, such as the determination of the asymptotic distribution of the least singular value of the adjacency matrix of an Erd˝os-Renyi graph, which remains open currently despite the significant partial progress (see [33]) on the Wigner-Dyson-Mehta conjecture 31 With our conventions, the symmetric case is not a sub-case of the Hermitian case, because the matrix would now be required to match GOE to second order, rather than GUE to second order.

UNIVERSALITY FOR WIGNER ENSEMBLES

155

for such matrices (though see the recent papers [81], [120] for some lower bounds relating to this problem). 7.3. Distribution of the gaps. In Section 1.2 the averaged gap distribution Sn (s, u, tn ) :=

1 |{1 ≤ i ≤ n : |λi (An ) − nu| ≤ tn /ρsc (u); λi+1 (An ) − λi (An ) ≤ s/ρsc (u)}. tn

was defined for a given energy level −2 < u < 2 and a scale window tn with 1/tn , tn /n both going to zero as n → ∞. For GUE, it is known that the expected value ESn (s, u, tn ) of this distribution converges as n → ∞ (keeping u, s fixed) to the (cumulative) Gaudin distribution (3.9); see [19]. Informally, this result asserts that a typical normalised gap ρsc (u)(λi+1 (An )−λi (An )), where λsc i (Wn ) = u+o(1), is asymptotically distributed on the average according to the Gaudin distribution. The eigenvalue gap distribution has received much attention in the mathematics community, partially thanks to the fascinating (numerical) coincidence with the gap distribution of the zeros of the zeta functions. For more discussions, we refer to [20, 22, 67] and the references therein. 32 It is possible to use an inclusion-exclusion argument to deduce information about ESn (s, u, tn ) from information on the k-point correlation functions; see e.g. [20], [19], [64], [35], [36]. In particular, one can establish universality of the Gaudin distribution for the Wigner matrix ensembles considered in those papers, sometimes assuming additional hypotheses due to the averaging of the correlation function in the u parameter; for instance, in [36] universality is shown for all Wigner matrices obeying Condition C0, assuming that tn /n decays very slowly to zero, by utilising the universality of the averaged k-point correlation function established in that paper. A slightly different approach proceeds by expressing the moments of Sn (s, u, tn ) in terms of the joint distribution of the eigenvalues. Observe that ESn (s, u, tn ) =

n 1  E1|λi (An )−nu|≤tn /ρsc (u) 1λi+1 (An )−λi (An )≤s/ρsc (u) . tn i=1

Replacing the sharp cutoffs in the expectation by smoothed out versions, and applying Theorem 5.2 (and also using (2.13) to effectively localise the i summation to about O(tn )+O(no(1) ) indices), we see that if Mn , Mn have four matching moments and both obey Condition C1 for a sufficiently large C0 , then one has ESn (s, u, tn ) ≤ ESn (s + o(1), u, (1 + o(1))tn ) + o(1) and similarly ESn (s, u, tn ) ≥ ESn (s − o(1), u, (1 − o(1))tn ) − o(1) for suitable choices of decaying quantities o(1). In particular, if Mn is known to exhibit Gaudin asymptotics (3.9) for any −2 < u < 2 and any tn with 1/tn , tn /n = o(1), then Mn does as well. In [64], the Gaudin asymptotics (3.9) were established for gauss divisible ensembles (with fixed time parameter t) and any tn with 1/tn , tn /n = o(1), and thus by Lemma 7.6 and the above argument we conclude that (3.9) also holds for Wigner matrices obeying Condition C1 for a sufficiently large C0 whose off-diagonal atom distributions are supported on at least three points. 32 We would like to thank P. Sarnak for enlightening conversations regarding the gap distribution and for constantly encouraging us to work on the universality problem.

156

TERENCE TAO AND VAN VU

This last condition can be removed by using Proposition 5.3 as in the previous section (using the results in [35] instead of [64]), thus giving (3.9) with no hypotheses on the Wigner matrix other than Condition C1 for a sufficiently large C0 . Remark 7.8. This argument appeared previously in [108], but at the time the eigenvalue rigidity result (2.13) was not available, and the Four Moment Theorem required Condition C0 instead of Condition C1, so the statement was weaker, requiring both Condition C0 and a vanishing third moment hypothesis (for the same reason as in Remark 7.7). By averaging over all u (effectively setting tn = n), the need for eigenvalue rigidity could be avoided in [108] (but note that the formulation of the averaged universality for the eigenvalue gap in [108, (5)] is not quite correct33 as stated, due to the absence of the normalisation by ρsc (u)). Remark 7.9. A similar argument also applies to higher moments ESn (s, u, tn )k of Sn (s, u, tn ), and so in principle one can also use the moment method to obtain universal statistics for the full distribution of Sn (s, u, tn ), and not just the expectation. However, this argument would require extending the results in [19], [64], or [35] to control the distribution (and not just the expectation) of Sn (s, u, tn ) for GUE, gauss divisible matrices (with fixed time parameter), or gauss divisible matrices (with t = n−1+ε ) respectively. Remark 7.10. It is natural to ask whether the averaging over the window tn can be dispensed with entirely. Indeed, one expects that the distribution of the individual normalised eigenvalue gaps ρsc (u)(λi+1 (An )−λi (An )), where −2 < u < 2 is fixed and λsc i (Wn ) = u + o(1), should asymptotically converge to the Gaudin distribution in the vague topology, without any averaging in i. The Four moment theorem allows one to readily deduce such a fact for general Wigner ensembles once one has established it for special ensembles such as GUE or gauss divisible ensembles (and to remove all moment matching and support hypotheses on the Wigner ensemble, one would need to treat gauss divisible ensembles with time parameter equal to a negative power of n). However, control of the individual normalised eigenvalue gaps in the bulk are not presently in the literature, even for GUE, though they are in principle obtainable from determinantal process methods. 7.4. Universality of the counting function and gap probability. Recall for Section 3 that one has an asymptotic (3.8) for the number of eigenvalues of a fine-scale normalised GUE matrix for an interval I := [nu + a/ρsc (u), nu + b/ρsc (u)] in the bulk, where −2 < u < 2 and a, b ∈ R are fixed independently of n; this asymptotic is controlled by the spectral of the Dyson sine kernel KSine on the interval [a, b]. In fact one can allow u to vary in n, so long as it lies inside an interval [−2 + ε, 2 − ε] for some fixed ε > 0. 33 More

precisely, instead of

1 |{1 ≤ i ≤ n : xi+1 − xi ≤ s}|, n the gap distribution should instead be expressed as 1 s }|, Sn (s; x) := |{1 ≤ i ≤ n : xi+1 − xi ≤ n ρsc (ti ) Sn (s; x) :=

where ti ∈ [−2, 2] is the classical location of the ith eigenvalue:  ti i ρsc (x) dx := . n −2

UNIVERSALITY FOR WIGNER ENSEMBLES

157

Using the four moment theorem, one can extend this asymptotic to more general Wigner ensembles. Theorem 7.11 (Asymptotic for NI ). Consider a Wigner matrix satisfying Condition C1 for a sufficiently large constant C0 . Let ε > 0 and a < b be independent of n. For any n, let u = un be an element of [−2 + ε, 2 − ε]. Then the asymptotic (3.8) (in the sense of convergence in distribution) holds. Proof. See [110, Theorem 8]. The basic idea is to use the moment method, combining Theorem 7.5 with the identity   NI 1 ρ(k) (x1 , . . . , xk ) dx1 . . . dxk E = k k! [a,b] n,u for any k ≥ 1. We remark that the method also allows one to control the asymptotic joint distribution of several intervals NI1 (An ), . . . , NIk (An ); for instance, one can show that NI (An ) and NJ (An ) are asymptotically uncorrelated if I, J have bounded lengths, lie in the bulk, and have separation tending to infinity as n → ∞. We omit the details.  Specialising to the case u = 0, one can obtain universal behaviour √ for √the probability that Mn has no eigenvalue in an interval of the form (−t/2 n, t/2 n) for any fixed t: Corollary 7.12. For any fixed t > 0, and Mn satisfy the conditions of Theorem 7.11, one has t f (x) 1 √ dx) Mn has no eigenvalues in (−t/2n, t/2n)) → exp( P( x n 0 as n → ∞, where f : R → R is the solution of the differential equation (tf

)2 + 4(tf − f )(tf − f + (f )2 ) = 0 with the asymptotics f (t) =

−t π



t2 π2



t3 π3

+ O(t4 ) as t → 0.

Proof. In the case of GUE, this was established in [1, Theorem 3.1.2], [63]. The general case then follows from Theorem 7.11. A weaker extension was established in [108], assuming matching moments with GUE to fourth order as well as Condition C0.  This corollary is the Hermitian version of the Goldstine-Von Neumann least singular value problem (see [30, 91, 94, 105, 123] for more details). Recently, some additional bounds on the least singular value of Wigner matrices were established in [81], [120]. The above universality results concerned intervals whose length was comparable to the fine-scale eigenvalue spacing (which, using the fine-scale normalisation An , is 1/ρsc (u)). One can also use the Four Moment Theorem to obtain similar universality results for the counting function on much larger scales, such as NI (Wn ), where I := [y, +∞) with y ∈ (−2, 2) in the bulk. For instance, in [57] the mean and variance of this statistic for GUE matrices was computed as  log n  E[NI (Wn )] = n ρsc + O n I (7.5)  1  + o(1) log n. Var(NI (Wn )) = 2π 2

158

TERENCE TAO AND VAN VU

Combining this with a general central limit theorem for determinantal processes due to Costin and Lebowitz [14], one obtains (7.6) and hence (7.7)

NI (Wn ) − E[NI (Wn )]  → N (0, 1). n→∞ Var(NI (Wn ))  NI (Wn ) − n I ρsc ! → N (0, 1) n→∞ 1 log n 2 2π

This result was extended to more general Wigner ensembles: Theorem 7.13 (Central limit theorem for NI (Wn )). Let Mn be a Wigner matrix obeying Condition C0 which matches the corresponding entries of GUE up to order 4. Let y ∈ (−2, 2) be fixed, and set I := [y, +∞). Then the asymptotic (7.7) hold (in the sense of probability distributions), as do the mean and variance bounds (7.5). Proof. See [18, Theorem 2]. The first component of this theorem is established using the Four Moment Theorem; the second part also uses the eigenvalue rigidity estimate (2.13).  7.5. Distribution of the eigenvectors. Let Mn be a matrix drawn from the gaussian orthogonal ensemble (GOE). Then the eigenvalues λi (Mn ) are almost surely simple, and the unit eigenvectors ui (Mn ) ∈ S n−1 ⊂ Rn−1 are well-defined up to sign. To deal with this sign ambiguity, let us select each eigenvector ui (Mn ) independetly and uniformly at random among the two possible choices. Since the GOE ensemble is invariant under orthogonal transformations, the eigenvectors ui (Mn ) must each be uniformly distributed on√the unit sphere. √ As is well known, this implies that the normalised coefficients nui,j (Mn ) := nui (Mn )e∗j of these eigenvectors are asymptotically normally distributed; see [61], [62] for more precise statements in this direction. By combining the results of [61], [62] with the four moment theorem for eigenvectors (Theorem 5.5), one can extend this fact to other Wigner ensembles in [107]. Here is a typical result: Theorem 7.14. Let Mn be a random real symmetric matrix obeying hypothesis C1 for a sufficiently large constant C0 , which matches GOE to fourth order. Assume furthermore that the atom distributions of Mn are symmetric (i.e. ξij ≡ −ξij for all 1 ≤ i, j ≤ n). Let i = in be an index (or more precisely, a sequence of indices) between 1 and n, and let a = an ∈ S n−1 be a unit vector in Rn (or more precisely, a sequence of unit vectors). For each i, let ui (Mn ) ∈ S n−1 √ be chosen randomly among all unit eigenvectors with eigenvalue λi (Mn ). Then nui (Mn ) · a tends to N (0, 1)R in distribution as n → ∞. Proof. See [107, Theorem 13].



As an example to illustrate Theorem 7.14, we can take a = an := √1n (1, . . . , 1) ∈ S , and i := n/2!. Then Theorem 7.14 asserts that the sum of the entries of the middle eigenvector u n/2 (Mn ) is gaussian in the limit. n−1

UNIVERSALITY FOR WIGNER ENSEMBLES

159

7.6. Central limit theorem for log-determinant. One of most natural and important matrix functionals is the determinant. As such, the study of determinants of random matrices has a long and rich history. The earlier papers on this study focused on the determinant det An of the non-Hermitian iid model An , where the entries ζij of the matrix were independent random variables with mean 0 and variance 1 [23, 49, 52, 53, 55, 73, 74, 83, 90, 93, 111, 119] (see [82] for a brief survey of these results). See also the early paper of Dixon [29] treating random matrix models in which the distribution is invariant with respect to rotation of the individual rows. In [55], Goodman considered random gaussian matrices An = (ζij )1≤i,j≤n where the atom variables ζij are iid standard real gaussian variables, ζij ≡ N (0, 1)R . He noticed that in this case the square of the determinant can be expressed as the product of independent chi-square variables. Therefore, its logarithm is the sum of independent variables and thus one expects a central limit theorem to hold. In fact, using properties of the chi-square distribution, it is not hard to prove34 (7.8)

log(| det An |) − ! 1 2

1 2

log n! +

1 2

log n

→ N (0, 1)R ,

log n

where N (0, 1)R denotes the law of the real gaussian with mean 0 and variance 1. A similar analysis (but with the real chi distribution replaced by a complex chi distribution) also works for complex gaussian matrices, in which ζij remain jointly independent but now have the distribution of the complex gaussian N (0, 1)C (or equivalently, the real and imaginary parts of ζij are independent and have the distribution of N (0, 12 )R ). In that case, one has a slightly different law (7.9)

log(| det An |) − ! 1 4

1 2

log n! +

1 4

log n

→ N (0, 1)R .

log n

We turn now to real iid matrices, in which the ζij are jointly independent and real with mean zero and variance one. In [52], Girko stated that (7.8) holds for such random matrices under the additional assumption that the fourth moment of the atom variables is 3. Twenty years later, he claimed a much stronger result which replaced the above assumption by the assumption that the atom variables have bounded (4 + δ)-th moment [53]. However, there are several points which are not clear in these papers. Recently, Nguyen and the second author [82] gave a new proof for (7.8). Their approach also results in an estimate for the rate of convergence and is easily extended to handle to complex case. The analysis of the above random determinants relies crucially on the fact that the rows of the matrix are jointly independent. This independence no longer holds for Hermitian random matrix models, which makes the analysis of determinants of Hermitian random matrices more challenging. Even showing that the determinant of a random symmetric Bernoulli matrix is non-zero (almost surely) was a long standing open question by Weiss in the 1980s and solved only five years ago [15]; for more recent developments see [108, Theorem 31], [16], [81], [120]. These results give good lower bound for the absolute value of the determinant or the least singular number, but do not reveal any information about the distribution. 34 Here

and in the sequel, → denotes convergence in distribution.

160

TERENCE TAO AND VAN VU

Even in the GUE case, it is highly non-trivial to prove an analogue of the central limit theorem (7.9). Notice that the observation of Goodman does not apply due to the dependence between the rows and so it is not even clear why a central limit theorem must hold for the log-determinant. In [28], Delannay and Le Caer made use of the explicit distribution of GUE and GOE to prove the central limit theorem for these cases. While it does not seem to be possible to express the log-determinant of GUE as a sum of independent random variables, in [112], the authors found a way to approximate the log-determinant as a sum of weakly dependent terms, based on analysing a tridiagonal form of GUE due to Trotter [118]35 . Using stochastic calculus and the martingale central limit theorem, we gave another proof (see [112]) for the central limit theorem for GUE and GOE: Theorem 7.15 (Central limit theorem for log-determinant of GUE and GOE). [28] Let Mn be drawn from GUE. Then log | det(Mn )| − ! 1 2

1 2

log n! +

1 4

log n

→ N (0, 1)R .

log n

Similarly, if Mn is drawn from GOE rather than GUE, one has log | det(Mn )| − 12 log n! + √ log n

1 4

log n

→ N (0, 1)R .

The next task is to extend beyond the GUE or GOE case. Our main tool for this is a four moment theorem for log-determinants of Wigner matrices, analogous to Theorem 5.2. Theorem 7.16 (Four moment theorem for determinant). Let Mn , Mn be Wigner matrices whose atom distributions have independent real and imaginary parts that match to fourth order off the diagonal and to second order on the diagonal, are bounded by nO(c0 ) for some sufficiently small but fixed c0 > 0, and are supported on at least three points. Let G : R → R obey the derivative estimates (7.10)

|

dj G(x)| = O(nc0 ) dxj

√ for 0 ≤ j ≤ 5. Let z0 = E + −1η0 be a complex number with |E| ≤ 2 − δ for some fixed δ > 0. Then √ √ EG(log | det(Mn − nz0 )|) − EG(log | det(Mn − nz0 )|) = O(n−c ) for some fixed c > 0, adopting the convention that G(−∞) = 0. The requirements that Mn , Mn be supported on at least three points, and that E lie in the bulk region |E| < 2−δ are artificial, due to the state of current literature on level repulsion estimates. It is likely that with further progress on those estimates that these hypotheses can be removed. The hypothesis that the atom distributions have independent real and imaginary parts is mostly for notational convenience and can also be removed with some additional effort. By combining Theorem 7.16 with Theorem 7.15 we obtain 35 We

would like to thank R. Killip for suggesting the use of Trotter’s form.

UNIVERSALITY FOR WIGNER ENSEMBLES

161

Corollary 7.17 (Central limit theorem for log-determinant of Wigner matrices). Let Mn be a Wigner matrix whose atom distributions ζij are independent of n, have real and imaginary parts that are independent and match GUE to fourth order, and obey Condition C1 for some sufficiently large C0 . Then log | det(Mn )| − ! 1 2

1 2

log n! +

1 4

log n

→ N (0, 1)R .

log n

If Mn matches GOE instead of GUE, then one instead has log | det(Mn )| − 12 log n! + √ log n

1 4

log n

→ N (0, 1)R .

The deduction of this proposition from Theorem 7.16 and Theorem 7.15 is standard (closely analogous, for instance, to the proof the central limit theorem for individual eigenvalues) and is omitted. (Notice that in order for the atom variables of Mn match those of GUE to fourth order, these variables most have at least three points in their supports.) 7.7. Concentration of eigenvalues. We first discuss the case of the Gaussian Unitary Ensemble (GUE), which is the best understood case, as the joint distribution of the eigenvalues is given by a determinantal point process. Because of this, it is known that for any interval I, the random variable NI (Wn ) in the GUE case obeys a law of the form (7.11)

NI (Wn ) ≡

∞ 

ηi

i=1

where the ηi = ηi,n,I are jointly independent indicator random variables (i.e. they take values in {0, 1}); see e.g. [1, Corollary 4.2.24]. The mean and variance of NI (Wn ) can also be computed in the GUE case with a high degree of accuracy: Theorem 7.18 (Mean and variance for GUE). [57] Let Mn be drawn from GUE, let Wn := √1n Mn , and let I = [−∞, x] for some real number x (which may depend on n). Let ε > 0 be independent of n. (i) (Bulk case) If x ∈ [−2 + ε, 2 − ε], then log n ). ENI (Wn ) = n ρsc (y) dy + O( n I (ii) (Edge case) If x ∈ [−2, 2], then ENI (Wn ) = n ρsc (y) dy + O(1). I

(iii) (Variance bound) If one has x ∈ [−2, 2 − ε] and n2/3 (2 + x) → ∞ as n → ∞, one has 1 + o(1)) log(n(2 + x)3/2 ). 2π 2 In particular, one has VarNI (Wn ) = O(log n) in this regime. VarNI (Wn ) = (

By combining these estimates with a well-known inequality of Bennett [9] (see [113] for details) we obtain a concentration estimate for NI (Wn ) in the GUE case:

162

TERENCE TAO AND VAN VU

Corollary 7.19 (Concentration for GUE). Let Mn be drawn from GUE, let Wn := √1n Mn , and let I be an interval. Then one has P(|NI (Wn ) − n ρsc (y) dy| ≥ T )  exp(−cT ) I

for all T  log n. From the above corollary we see in particular that in the GUE case, one has NI (Wn ) = n ρsc (y) dy + O(log1+o(1) n) I

with overwhelming probability for each fixed I, and an easy union bound argument (ranging over all intervals I in, say, [−3, 3] whose endpoints are a multiple of n−100 (say)) then shows that this is also true uniformly in I as well. Now we turn from the GUE case to more general Wigner ensembles. As already mentioned, there has been much interest in recent years in obtaining concentration results for NI (Wn ) (and for closely related objects, such as the Stieltjes transform sWn (z) := n1 trace(Wn − z)−1 of Wn ) for short intervals I, due to the applicability of such results to establishing various universality properties of such matrices; see [37–40, 43, 44, 103, 108]. The previous best result in this direction was by Erd˝ os, Yau, and Yin [44] (see also [32] for a variant): Theorem 7.20 ([44]). Let Mn be a Wigner matrix obeying Condition C0, and let Wn := √1n Mn . Then, for any interval I, one has (7.12) P(|NI (Wn ) − n ρsc (y) dy| ≥ T )  exp(−cT c ) I

for all T ≥ log

A log log n

n, and some constant A > 0.

One can reformulate (7.12) equivalently as the assertion that P(|NI (Wn ) − n ρsc (y) dy| ≥ T )  exp(logO(log log n) n) exp(−cT c ) I

for all T > 0. In particular, this theorem asserts that with overwhelming probability one has NI (Wn ) = n ρsc (y) dy + O(logO(log log n) n) I

for all intervals I. The proof of the above theorem is somewhat lengthy, requiring a delicate analysis of the self-consistent equation of the Stieltjes transform of Wn . Comparing this result with the previous results for the GUE case, we see that there is a loss of a double logarithm log log n in the exponent. It has turned out that using the swapping method one can remove this double logarithmic loss, at least under an additional vanishing moment assumption36 Theorem 7.21 ([113]). [Improved concentration of eigenvalues] Let Mn be a Wigner matrix obeying Condition C0, and let Wn := √1n Mn . Assume that Mn matches moments with GUE to third order off the diagonal (i.e. Reξij , Imξij have variance 1/2 and third moment zero). Then, for any interval I, one has P(|NI (Wn ) − n ρsc (y) dy| ≥ T )  nO(1) exp(−cT c ) I 36 We

thank M. Ledoux for a conversation leading to this study.

UNIVERSALITY FOR WIGNER ENSEMBLES

163

for any T > 0. This estimate is phrased for any T , but the bound only becomes non-trivial when T  logC n for some sufficiently large C. In that regime, we see that this result removes the double-logarithmic factor from Theorem 7.20. In particular, this theorem implies that with overwhelming probability one has NI (Wn ) = n ρsc (y) dy + O(logO(1) n) I

for all intervals I; in particular, for any I, NI (Wn ) has variance O(logO(1) n). Remark 7.22. As we are assuming Re(ξij ) and Im(ξij ) to be independent, the moment matching condition simplifies to the constraints that ERe(ξij )2 = EIm(ξij )2 = 12 and ERe(ξij )3 = EIm(ξij )3 = 0. However, it is possible to extend this theorem to the case when the real and imaginary parts of ξij are not independent. Remark 7.23. The constant c in the bound in Theorem 7.21 is quite decent in several cases. For instance, if the atom variables of Mn are Bernoulli or have sub-gaussian tail, then we can set c = 2/5 − o(1) by optimizing our arguments (details omitted). If we assume 4 matching moments rather than 3, then we can set c = 1, matching the bound in Corollary 7.19. It is an interesting question to determine the best value of c. The value of c in [43] is implicit and rather small. The proof of the above theorem is different from that in [44] in that it only uses a relatively crude analysis of the self-consistent equation to obtain some preliminary bounds on the Stieltjes transform and on NI (which were also essentially implicit in previous literature). Instead, the bulk of the argument relies on using the Lindeberg swapping strategy to deduce concentration of NI (Wn ) in the non-GUE case from the concentration results in the GUE case provided by Corollary 7.19. In order to keep the error terms in this swapping under control, three matching moments37 . We need one less moment here because we are working at “mesoscopic” scales (in which the number of eigenvalues involved is much larger than 1) rather than at “microscopic” scales. Very roughly speaking, the main idea of the argument is to show that high moments such as E|NI (Wn ) − n ρsc (y) dy|k I

are quite stable (in a multiplicative sense) if one swaps (the real or imaginary part of) one of the entries of Wn (and its adjoint) with another random variable that matches the moments of the original entry to third order. For technical reasons, however, we do not quite manipulate NI (Wn ) directly, but instead work with a proxy for this quantity, namely a certain integral of the Stieltjes transform of Wn . As observed in [43], the Lindeberg swapping argument is quite simple to implement at the level of the Stieltjes transform (due to the simplicity of the resolvent identities, when compared against the rather complicated Taylor expansions of individual eigenvalues used in [108]). A similar application of the Lindeberg swapping argument to high moments was also recently performed in [72] to control coefficients of the resolvent in arbitrary directions. 37 Compare

with Theorem 5.2.

164

TERENCE TAO AND VAN VU

As a corollary, we obtain the following rigidity of eigenvalues result, improving upon (2.13) when one has a matching moment hypothesis: Corollary 7.24 (Concentration of eigenvalues). Let Mn be a Wigner matrix obeying Condition C0, and let Wn := √1n Mn . Assume that Mn matches moments with GUE to three order off the diagonal and second order on the diagonal. Then for any i in the bulk P(|λi (Wn ) − γi | ≥ T /n)  nO(1) exp(−cT c ) for any T > 0, where the classical location γi ∈ [−2, 2] is defined by the formula γi i ρsc (y) dy = . n −2 This corollary improves [44, Theorem 2.2] as it allows T to be as small as n, instead of logO(log log n) n, under the extra third moment assumption. In log particular, in the Bernoulli case, this shows that the variance of the bulk eigenvalues is of order logO(1) n/n. We believe that this is sharp, up to the hidden constant in O(1). This corollary also significantly improves [108, Theorem 29]. (As a matter of fact, the original proof of this theorem has a gap in it; see Appendix A for a further discussion.) One can have analogous results for the edge case, under the four moment assumption; see [113] for details. O(1)

8. Open questions While the universality of many spectral statistics of Wigner matrices have now been established, there are still several open questions remaining. Some of these have already been raised in earlier sections; we collect some further such questions in this section. In one direction, one can continue generalising the class of matrices for which the universality result holds, for instance by lowering the exponent C0 in Condition 2 , C1, and allowing the entries ξij of the Wigner matrix to have different variances σij or for the Wigner matrix to be quite sparse. For recent work in these directions, see [32], [33], [43], [44]. With regards to the different variances case, one key assumption that is still needed for existing arguments to work is a spectral gap 2 )1≤i,j≤n has a significant gap hypothesis, namely that the matrix of variances (σij between its largest eigenvalue and its second largest one; in addition, for the most complete results one also needs the variances to be bounded away from zero. This omits some interesting classes of Wigner-type matrices, such as those with large blocks of zeroes. However, the spectral statistics of p + n × p + n matrices of the form   0 M M∗ 0 for rectangular (p × n) matrices with iid matrices M are well understood, as the problem is equivalent to that of understanding the singular values of M (or the eigenvalues of the covariance matrix M M ∗ ); in particular, analogues of the key tools discussed here (such as the four moment theorem, the local semicircle law, and heat flow methods) are known [109], [41]; this suggests that other block-type variants of Wigner matrices could be analysed by these methods. A related problem

UNIVERSALITY FOR WIGNER ENSEMBLES

165

would be to understand the spectral properties of various self-adjoint polynomial combinations of random matrices, e.g. the commutator AB − BA of two Wigner matrices A, B. The global coarse-scale nature of the spectrum for such matrices can be analysed by the tools of free probability [122], but there are still very few rigorous results for the local theory. Another direction of generalisation is to consider generalised Wigner matrices whose entries have non-zero mean, or equivalently to consider the spectral properties of a random matrix Mn + Dn that is the sum of an ordinary Wigner matrix Mn and a deterministic Hermitian matrix Dn . Large portions of the theory seem amenable to extension in this direction, although the global and local semicircular law would need to be replaced by a more complicated variant (in particular, the semicircular distribution ρsc should be replaced by the free convolution of ρsc with the empirical spectral distribution of Dn , see [121, 122]). In yet another direction, one could consider non-Hermitian analogues of these problems, for instance by considering the statistics of eigenvalues of iid random matrices (in which the entries are not constrained to be Hermitian, but are instead independent and identically distributed). The analogue of the semicircular law in this setting is the circular law, which has been analysed intensively in recent years (see [102] for a survey). There are in fact a number of close connections between the Hermitian and non-Hermitian ensembles, and so it is likely that the progress in the former can be applied to some extent to the latter. Another natural question to ask is to see if the universality theory for Wigner ensembles can be unified in some way with the older, but very extensively developed, universality theory for invariant ensembles (as covered for instance in [20]). Significant progress in this direction has recently been achieved in [12], in which heat flow methods are adapted to show that the local spectral statistics of β-ensembles are asymptotically independent of the choice of potential function (assuming some analyticity conditions on the potential). This reduces the problem to the gaussian case when the potential is quadratic, which can be handled by existing methods, similarly to how the methods discussed here reduce the statistics of general Wigner matrices to those of invariant ensembles such as GUE or GOE. Note though that these techniques do not provide an independent explanation as to why these invariant ensembles have the limiting statistics they do (e.g. governed by the sine determinantal process in the bulk, and the Airy determinantal process in the edge, in the case of GUE); for that, one still needs to rely on the theory of determinantal processes. Returning now to Wigner matrices, one of the major limitations of the methods discussed here is the heavy reliance on the hypothesis that the (upper-triangular) entries are jointly independent; even weak coupling between entries makes many of the existing methods (such as the swapping technique used in the Four Moment Theorem, or the use of identities such as (2.9)) break down. A good test case would be the asymptotic statistics of the adjacency matrices of random regular graphs, where the fixed degree d is a constant multiple of n, such as n/2. This is essentially equivalent to a Wigner matrix model (such as the real symmetric Bernoulli matrix ensemble) in which the row and column sums have been constrained to be zero. For this model, the global semicircular law and eigenvector delocalisation has recently been established for such matrices; see [24], [117].

166

TERENCE TAO AND VAN VU

Recall that the Central limit theorem for the log-determinant of non-Hermitian matrices requires only two moment matching. However, in the Hermitian case, Corollary 7.17 requires four matching moments. We believe that this requirement can be weakened. For instance, the central limit theorem must hold for random Bernoulli matrices. Another interesting problem is to determine the distribution of bulk eigenvalues of a random Bernoulli matrix. This matrix has only three matching moments, but perhaps a central limit theorem like Theorem 7.1 also holds here. We have proved [113] that the variance of any bulk eigenvalue is logO(1) n/n. A good first step would be to determine the right value of the hidden constant in O(1).

References [1] Greg W. Anderson, Alice Guionnet, and Ofer Zeitouni, An introduction to random matrices, Cambridge Studies in Advanced Mathematics, vol. 118, Cambridge University Press, Cambridge, 2010. MR2760897 (2011m:60016) [2] Z. D. Bai, Convergence rate of expected spectral distributions of large random matrices. I. Wigner matrices, Ann. Probab. 21 (1993), no. 2, 625–648. MR1217559 (95a:60039) [3] Z. D. Bai, Convergence rate of expected spectral distributions of large random matrices. II. Sample covariance matrices, Ann. Probab. 21 (1993), no. 2, 649–672. MR1217560 (95a:60040) [4] Jinho Baik, Percy Deift, and Kurt Johansson, On the distribution of the length of the longest increasing subsequence of random permutations, J. Amer. Math. Soc. 12 (1999), no. 4, 1119– 1178, DOI 10.1090/S0894-0347-99-00307-0. MR1682248 (2000e:05006) ´ [5] D. Bakry and Michel Emery, Diffusions hypercontractives (French), S´ eminaire de probabilit´ es, XIX, 1983/84, Lecture Notes in Math., vol. 1123, Springer, Berlin, 1985, pp. 177–206, DOI 10.1007/BFb0075847. MR889476 (88j:60131) [6] Z. D. Bai and J. Silverstein, Spectral analysis of large dimensional random matrices, Mathematics Monograph Series 2, Science Press, Beijing 2006. [7] Z. D. Bai and Y. Q. Yin, Necessary and sufficient conditions for almost sure convergence of the largest eigenvalue of a Wigner matrix, Ann. Probab. 16 (1988), no. 4, 1729–1741. MR958213 (90a:60069) [8] G. Ben Arous and S. P´ ech´ e, Universality of local eigenvalue statistics for some sample covariance matrices, Comm. Pure Appl. Math. 58 (2005), no. 10, 1316–1357, DOI 10.1002/cpa.20070. MR2162782 (2006h:62072) [9] G. Bennett, Probability Inequalities for the Sum of Independent Random Variables, Journal of the American Statistical Association 57 (1962), 33–45. [10] Andrew C. Berry, The accuracy of the Gaussian approximation to the sum of independent variates, Trans. Amer. Math. Soc. 49 (1941), 122–136. MR0003498 (2,228i) [11] Pavel Bleher and Alexander Its, Semiclassical asymptotics of orthogonal polynomials, Riemann-Hilbert problem, and universality in the matrix model, Ann. of Math. (2) 150 (1999), no. 1, 185–266, DOI 10.2307/121101. MR1715324 (2000k:42033) [12] Paul Bourgade, L´ aszl´ o Erd˝ os, and Horng-Tzer Yau, Bulk universality of general βensembles with non-convex potential, J. Math. Phys. 53 (2012), no. 9, 095221, 19, DOI 10.1063/1.4751478. MR2905803 [13] E. Br´ ezin and S. Hikami, Level spacing of random matrices in an external source, Phys. Rev. E (3) 58 (1998), no. 6, 7176–7185, DOI 10.1103/PhysRevE.58.7176. MR1662382 (99m:82023) [14] O. Costin and J. Lebowitz, Gaussian fluctuations in random matrices, Phys. Rev. Lett. 75 (1) (1995) 69–72. [15] Kevin P. Costello, Terence Tao, and Van Vu, Random symmetric matrices are almost surely nonsingular, Duke Math. J. 135 (2006), no. 2, 395–413, DOI 10.1215/S0012-7094-06-135275. MR2267289 (2008f:15080) [16] Kevin P. Costello, Bilinear and quadratic variants on the Littlewood-Offord problem, Israel J. Math. 194 (2013), no. 1, 359–394, DOI 10.1007/s11856-012-0082-4. MR3047075

UNIVERSALITY FOR WIGNER ENSEMBLES

167

[17] Sourav Chatterjee, A generalization of the Lindeberg principle, Ann. Probab. 34 (2006), no. 6, 2061–2076, DOI 10.1214/009117906000000575. MR2294976 (2008c:60028) [18] Sandrine Dallaporta and Van Vu, A note on the central limit theorem for the eigen-value counting function of Wigner matrices, Electron. Commun. Probab. 16 (2011), 314–322, DOI 10.1214/ECP.v16-1634. MR2819655 [19] P. Deift, T. Kriecherbauer, K. T.-R. McLaughlin, S. Venakides, and X. Zhou, Uniform asymptotics for polynomials orthogonal with respect to varying exponential weights and applications to universality questions in random matrix theory, Comm. Pure Appl. Math. 52 (1999), no. 11, 1335–1425, DOI 10.1002/(SICI)1097-0312(199911)52:111335::AIDCPA13.0.CO;2-1. MR1702716 (2001g:42050) [20] P. A. Deift, Orthogonal polynomials and random matrices: a Riemann-Hilbert approach, Courant Lecture Notes in Mathematics, vol. 3, New York University Courant Institute of Mathematical Sciences, New York, 1999. MR1677884 (2000g:47048) [21] Percy Deift and Dimitri Gioev, Random matrix theory: invariant ensembles and universality, Courant Lecture Notes in Mathematics, vol. 18, Courant Institute of Mathematical Sciences, New York, 2009. MR2514781 (2011f:60008) [22] Percy Deift, Universality for mathematical and physical systems, International Congress of Mathematicians. Vol. I, Eur. Math. Soc., Z¨ urich, 2007, pp. 125–152, DOI 10.4171/022-1/7. MR2334189 (2008g:60024) [23] A. Dembo, On random determinants, Quart. Appl. Math. 47 (1989), no. 2, 185–195. MR998095 (91a:62125) [24] Ioana Dumitriu and Soumik Pal, Sparse regular random graphs: spectral density and eigenvectors, Ann. Probab. 40 (2012), no. 5, 2197–2235, DOI 10.1214/11-AOP673. MR3025715 [25] H. Doering, P. Eichelsbacher, Moderate deviations for the eigenvalue counting function of Wigner matrices, arXiv:1104.0221. [26] Freeman J. Dyson, The threefold way. Algebraic structure of symmetry groups and ensembles in quantum mechanics, J. Mathematical Phys. 3 (1962), 1199–1215. MR0177643 (31 #1905) [27] Freeman J. Dyson, Correlations between eigenvalues of a random matrix, Comm. Math. Phys. 19 (1970), 235–250. MR0278668 (43 #4398) [28] R. Delannay and G. Le Ca¨ er, Distribution of the determinant of a random real-symmetric matrix from the Gaussian orthogonal ensemble, Phys. Rev. E (3) 62 (2000), no. 2, 1526– 1536, DOI 10.1103/PhysRevE.62.1526. MR1797664 (2001m:82039) [29] John D. Dixon, How good is Hadamard’s inequality for determinants?, Canad. Math. Bull. 27 (1984), no. 3, 260–264, DOI 10.4153/CMB-1984-039-2. MR749630 (85e:15009) [30] Alan Edelman, Eigenvalues and condition numbers of random matrices, SIAM J. Matrix Anal. Appl. 9 (1988), no. 4, 543–560, DOI 10.1137/0609045. MR964668 (89j:15039) [31] L. Erd¨ esh, Universality of Wigner random matrices: a survey of recent results (Russian, with Russian summary), Uspekhi Mat. Nauk 66 (2011), no. 3(399), 67–198, DOI 10.1070/RM2011v066n03ABEH004749; English transl., Russian Math. Surveys 66 (2011), no. 3, 507–626. MR2859190 [32] L´ aszl´ o Erd˝ os, Antti Knowles, Horng-Tzer Yau, and Jun Yin, Spectral statistics of Erd˝ osR´ enyi graphs I: Local semicircle law, Ann. Probab. 41 (2013), no. 3B, 2279–2375, DOI 10.1214/11-AOP734. MR3098073 [33] L´ aszl´ o Erd˝ os, Antti Knowles, Horng-Tzer Yau, and Jun Yin, Spectral statistics of Erd˝ osR´ enyi Graphs II: Eigenvalue spacing and the extreme eigenvalues, Comm. Math. Phys. 314 (2012), no. 3, 587–640, DOI 10.1007/s00220-012-1527-7. MR2964770 [34] L´ aszl´ o Erd˝ os, Jos´ e A. Ram´ırez, Benjamin Schlein, and Horng-Tzer Yau, Universality of sine-kernel for Wigner matrices with a small Gaussian perturbation, Electron. J. Probab. 15 (2010), no. 18, 526–603, DOI 10.1214/EJP.v15-768. MR2639734 (2011h:60015) [35] L´ aszl´ o Erd˝ os, Sandrine P´ech´ e, Jos´ e A. Ram´ırez, Benjamin Schlein, and Horng-Tzer Yau, Bulk universality for Wigner matrices, Comm. Pure Appl. Math. 63 (2010), no. 7, 895–925, DOI 10.1002/cpa.20317. MR2662426 (2011c:60022) [36] L´ aszl´ o Erd˝ os, Jos´ e Ram´ırez, Benjamin Schlein, Terence Tao, Van Vu, and Horng-Tzer Yau, Bulk universality for Wigner Hermitian matrices with subexponential decay, Math. Res. Lett. 17 (2010), no. 4, 667–674, DOI 10.4310/MRL.2010.v17.n4.a7. MR2661171 (2011j:60018)

168

TERENCE TAO AND VAN VU

[37] L´ aszl´ o Erd˝ os, Benjamin Schlein, and Horng-Tzer Yau, Semicircle law on short scales and delocalization of eigenvectors for Wigner random matrices, Ann. Probab. 37 (2009), no. 3, 815–852, DOI 10.1214/08-AOP421. MR2537522 (2010g:15036) [38] L´ aszl´ o Erd˝ os, Benjamin Schlein, and Horng-Tzer Yau, Local semicircle law and complete delocalization for Wigner random matrices, Comm. Math. Phys. 287 (2009), no. 2, 641–655, DOI 10.1007/s00220-008-0636-9. MR2481753 (2010f:60018) [39] L´ aszl´ o Erd˝ os, Benjamin Schlein, and Horng-Tzer Yau, Wegner estimate and level repulsion for Wigner random matrices, Int. Math. Res. Not. IMRN 3 (2010), 436–479, DOI 10.1093/imrn/rnp136. MR2587574 (2011h:60016) [40] L´ aszl´ o Erd˝ os, Benjamin Schlein, and Horng-Tzer Yau, Universality of random matrices and local relaxation flow, Invent. Math. 185 (2011), no. 1, 75–119, DOI 10.1007/s00222-0100302-7. MR2810797 (2012f:60020) [41] L´ aszl´ o Erd˝ os, Benjamin Schlein, and Horng-Tzer Yau, Universality of random matrices and local relaxation flow, Invent. Math. 185 (2011), no. 1, 75–119, DOI 10.1007/s00222-0100302-7. MR2810797 (2012f:60020) [42] L´ aszl´ o Erd˝ os and Horng-Tzer Yau, A comment on the Wigner-Dyson-Mehta bulk universality conjecture for Wigner matrices, Electron. J. Probab. 17 (2012), no. 28, 5, DOI 10.1214/EJP.v17-1779. MR2915664 [43] L´ aszl´ o Erd˝ os, Horng-Tzer Yau, and Jun Yin, Bulk universality for generalized Wigner matrices, Probab. Theory Related Fields 154 (2012), no. 1-2, 341–407, DOI 10.1007/s00440011-0390-3. MR2981427 [44] L´ aszl´ o Erd˝ os, Horng-Tzer Yau, and Jun Yin, Rigidity of eigenvalues of generalized Wigner matrices, Adv. Math. 229 (2012), no. 3, 1435–1515, DOI 10.1016/j.aim.2011.12.010. MR2871147 [45] Carl-Gustav Esseen, On the Liapounoff limit of error in the theory of probability, Ark. Mat. Astr. Fys. 28A (1942), no. 9, 19. MR0011909 (6,232k) [46] Ohad N. Feldheim and Sasha Sodin, A universality result for the smallest eigenvalues of certain sample covariance matrices, Geom. Funct. Anal. 20 (2010), no. 1, 88–123, DOI 10.1007/s00039-010-0055-x. MR2647136 (2011i:60013) [47] Peter J. Forrester and Eric M. Rains, Interrelationships between orthogonal, unitary and symplectic matrix ensembles, Random matrix models and their applications, Math. Sci. Res. Inst. Publ., vol. 40, Cambridge Univ. Press, Cambridge, 2001, pp. 171–207. MR1842786 (2002h:82008) [48] P. J. Forrester, Log-gases and random matrices, London Mathematical Society Monographs Series, vol. 34, Princeton University Press, Princeton, NJ, 2010. MR2641363 (2011d:82001) [49] G. E. Forsythe and J. W. Tukey, The extent of n random unit vectors, Bull. Amer. Math. Soc. 58 (1952), 502. [50] Z. F¨ uredi and J. Koml´ os, The eigenvalues of random symmetric matrices, Combinatorica 1 (1981), no. 3, 233–241, DOI 10.1007/BF02579329. MR637828 (83e:15010) [51] Jean Ginibre, Statistical ensembles of complex, quaternion, and real matrices, J. Mathematical Phys. 6 (1965), 440–449. MR0173726 (30 #3936) [52] V. L. G¯ırko, A central limit theorem for random determinants (Russian, with English summary), Teor. Veroyatnost. i Primenen. 24 (1979), no. 4, 728–740. MR550529 (82g:60035) [53] V. L. Girko, A refinement of the central limit theorem for random determinants (Russian, with Russian summary), Teor. Veroyatnost. i Primenen. 42 (1997), no. 1, 63–73, DOI 10.1137/S0040585X97975939; English transl., Theory Probab. Appl. 42 (1997), no. 1, 121– 129 (1998). MR1453330 (98k:60034) [54] V. L. Girko, Theory of random determinants, Mathematics and its Applications (Soviet Series), vol. 45, Kluwer Academic Publishers Group, Dordrecht, 1990. Translated from the Russian. MR1080966 (91k:60001) [55] N. R. Goodman, The distribution of the determinant of a complex Wishart distributed matrix, Ann. Math. Statist. 34 (1963), 178–180. MR0145619 (26 #3148b) [56] A. Guionnet, Grandes matrices al´ eatoires et th´ eor` emes d’universalit´ e, S´ eminaire BOURBAKI. Avril 2010. 62`eme ann´ ee, 2009-2010, no 1019. [57] Jonas Gustavsson, Gaussian fluctuations of eigenvalues in the GUE (English, with English and French summaries), Ann. Inst. H. Poincar´ e Probab. Statist. 41 (2005), no. 2, 151–178, DOI 10.1016/j.anihpb.2004.04.002. MR2124079 (2005k:60074)

UNIVERSALITY FOR WIGNER ENSEMBLES

169

[58] The Oxford handbook of random matrix theory, edited by Akeman, Baik and Di Fancesco, Oxford 2011. [59] Harish-Chandra, Differential operators on a semisimple Lie algebra, Amer. J. Math. 79 (1957), 87–120. MR0084104 (18,809d) [60] J. Ben Hough, Manjunath Krishnapur, Yuval Peres, and B´ alint Vir´ ag, Determinantal processes and independence, Probab. Surv. 3 (2006), 206–229, DOI 10.1214/154957806000000078. MR2216966 (2006m:60068) [61] Tiefeng Jiang, How many entries of a typical orthogonal matrix can be approximated by independent normals?, Ann. Probab. 34 (2006), no. 4, 1497–1529, DOI 10.1214/009117906000000205. MR2257653 (2007m:60011) [62] Tiefeng Jiang, The entries of Haar-invariant matrices from the classical compact groups, J. Theoret. Probab. 23 (2010), no. 4, 1227–1243, DOI 10.1007/s10959-009-0241-7. MR2735744 (2011j:60020) [63] Michio Jimbo, Tetsuji Miwa, Yasuko Mˆ ori, and Mikio Sato, Density matrix of an impenetrable Bose gas and the fifth Painlev´ e transcendent, Phys. D 1 (1980), no. 1, 80–158, DOI 10.1016/0167-2789(80)90006-8. MR573370 (84k:82037) [64] Kurt Johansson, Universality of the local spacing distribution in certain ensembles of Hermitian Wigner matrices, Comm. Math. Phys. 215 (2001), no. 3, 683–705, DOI 10.1007/s002200000328. MR1810949 (2002j:15024) [65] Kurt Johansson, Universality for certain Hermitian Wigner matrices under weak moment conditions (English, with English and French summaries), Ann. Inst. Henri Poincar´e Probab. Stat. 48 (2012), no. 1, 47–79, DOI 10.1214/11-AIHP429. MR2919198 [66] Jeff Kahn, J´ anos Koml´ os, and Endre Szemer´edi, On the probability that a random ±1matrix is singular, J. Amer. Math. Soc. 8 (1995), no. 1, 223–240, DOI 10.2307/2152887. MR1260107 (95c:15047) [67] Nicholas M. Katz and Peter Sarnak, Random matrices, Frobenius eigenvalues, and monodromy, American Mathematical Society Colloquium Publications, vol. 45, American Mathematical Society, Providence, RI, 1999. MR1659828 (2000b:11070) [68] J. P. Keating and N. C. Snaith, Random matrix theory and ζ(1/2 + it), Comm. Math. Phys. 214 (2000), no. 1, 57–89, DOI 10.1007/s002200000261. MR1794265 (2002c:11107) [69] Oleksiy Khorunzhiy, High moments of large Wigner random matrices and asymptotic properties of the spectral norm, Random Oper. Stoch. Equ. 20 (2012), no. 1, 25–68, DOI 10.1515/rose-2012-0002. MR2899796 [70] Rowan Killip, Gaussian fluctuations for β ensembles, Int. Math. Res. Not. IMRN 8 (2008), Art. ID rnn007, 19, DOI 10.1093/imrn/rnn007. MR2428142 (2009d:82073) [71] Antti Knowles and Jun Yin, Eigenvector distribution of Wigner matrices, Probab. Theory Related Fields 155 (2013), no. 3-4, 543–582, DOI 10.1007/s00440-011-0407-y. MR3034787 [72] A. Knowles, J. Yin, The isotropic semicircle law and deformation of Wigner matrices, arXiv:1110.6449. [73] J. Koml´ os, On the determinant of (0, 1) matrices, Studia Sci. Math. Hungar 2 (1967), 7–21. MR0221962 (36 #5014) [74] J. Koml´ os, On the determinant of random matrices, Studia Sci. Math. Hungar. 3 (1968), 387–399. MR0238371 (38 #6647) [75] I. V. Krasovsky, Correlations of the characteristic polynomials in the Gaussian unitary ensemble or a singular Hankel determinant, Duke Math. J. 139 (2007), no. 3, 581–619, DOI 10.1215/S0012-7094-07-13936-X. MR2350854 (2009e:15062) [76] Michel Ledoux, The concentration of measure phenomenon, Mathematical Surveys and Monographs, vol. 89, American Mathematical Society, Providence, RI, 2001. MR1849347 (2003k:28019) [77] J. W. Lindeberg, Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung (German), Math. Z. 15 (1922), no. 1, 211–225, DOI 10.1007/BF01494395. MR1544569 [78] Anna Maltsev and Benjamin Schlein, Average density of states of Hermitian Wigner matrices, Adv. Math. 228 (2011), no. 5, 2797–2836, DOI 10.1016/j.aim.2011.06.040. MR2838059 (2012k:60020) [79] Anna Maltsev and Benjamin Schlein, A Wegner estimate for Wigner matrices, Entropy and the quantum II, Contemp. Math., vol. 552, Amer. Math. Soc., Providence, RI, 2011, pp. 145–160, DOI 10.1090/conm/552/10915. MR2868046 (2012k:60019)

170

TERENCE TAO AND VAN VU

[80] M. L. Mehta, Random matrices and the statistical theory of energy levels, Academic Press, New York, 1967. MR0220494 (36 #3554) [81] Hoi H. Nguyen, On the least singular value of random symmetric matrices, Electron. J. Probab. 17 (2012), no. 53, 19, DOI 10.1214/EJP.v17-2165. MR2955045 [82] H. Nguyen and V. Vu, Random matrix: Law of the determinant, to appear in Annals of Probability. [83] H. Nyquist, S. O. Rice, and J. Riordan, The distribution of random determinants, Quart. Appl. Math. 12 (1954), 97–104. MR0063591 (16,148a) [84] Sean O’Rourke, Gaussian fluctuations of eigenvalues in Wigner random matrices, J. Stat. Phys. 138 (2010), no. 6, 1045–1066, DOI 10.1007/s10955-009-9906-y. MR2601422 (2011d:60021) [85] L. A. Pastur, The spectrum of random matrices (Russian, with English summary), Teoret. Mat. Fiz. 10 (1972), no. 1, 102–112. MR0475502 (57 #15106) [86] L. Pastur and M. Shcherbina, Universality of the local eigenvalue statistics for a class of unitary invariant random matrix ensembles, J. Statist. Phys. 86 (1997), no. 1-2, 109–147, DOI 10.1007/BF02180200. MR1435193 (98b:82037) [87] Sandrine P´ech´ e and Alexander Soshnikov, Wigner random matrices with non-symmetrically distributed entries, J. Stat. Phys. 129 (2007), no. 5-6, 857–884, DOI 10.1007/s10955-0079340-y. MR2363385 (2008m:82046) [88] N. Pillai, J. Yin, Universality of Covariance Matrices, arXiv:1110.2501 [89] Natesh S. Pillai and Jun Yin, Edge universality of correlation matrices, Ann. Statist. 40 (2012), no. 3, 1737–1763, DOI 10.1214/12-AOS1022. MR3015042 [90] A. Pr´ ekopa, On random determinants. I, Studia Sci. Math. Hungar. 2 (1967), 125–132. MR0211439 (35 #2319) [91] Mark Rudelson and Roman Vershynin, The Littlewood-Offord problem and invertibility of random matrices, Adv. Math. 218 (2008), no. 2, 600–633, DOI 10.1016/j.aim.2008.01.010. MR2407948 (2010g:60048) [92] Alain Rouault, Asymptotic behavior of random determinants in the Laguerre, Gram and Jacobi ensembles, ALEA Lat. Am. J. Probab. Math. Stat. 3 (2007), 181–230. MR2365642 (2009e:60060) [93] G. Szekeres and P. Tur´ an, On an extremal problem in the theory of determinants, Math. Naturwiss. Am. Ungar. Akad. Wiss. 56 (1937), 796–806. [94] Mark Rudelson and Roman Vershynin, The least singular value of a random square matrix is O(n−1/2 ) (English, with English and French summaries), C. R. Math. Acad. Sci. Paris 346 (2008), no. 15-16, 893–896, DOI 10.1016/j.crma.2008.07.009. MR2441928 (2009i:60104) [95] A. Ruzmaikina, Universality of the edge distribution of eigenvalues of Wigner random matrices with polynomially decaying distributions of entries, Comm. Math. Phys. 261 (2006), no. 2, 277–296, DOI 10.1007/s00220-005-1386-6. MR2191882 (2006k:82093) [96] Benjamin Schlein, Spectral properties of Wigner matrices, Mathematical results in quantum physics, World Sci. Publ., Hackensack, NJ, 2011, pp. 79–94. MR2885161 [97] Ya. Sinai and A. Soshnikov, Central limit theorem for traces of large random symmetric matrices with independent matrix elements, Bol. Soc. Brasil. Mat. (N.S.) 29 (1998), no. 1, 1–24, DOI 10.1007/BF01245866. MR1620151 (99f:60053) [98] Ya. G. Sina˘ı and A. B. Soshnikov, A refinement of Wigner’s semicircle law in a neighborhood of the spectrum edge for random symmetric matrices (Russian, with Russian summary), Funktsional. Anal. i Prilozhen. 32 (1998), no. 2, 56–79, 96, DOI 10.1007/BF02482597; English transl., Funct. Anal. Appl. 32 (1998), no. 2, 114–131. MR1647832 (2000c:82041) [99] Alexander Soshnikov, Universality at the edge of the spectrum in Wigner random matrices, Comm. Math. Phys. 207 (1999), no. 3, 697–733, DOI 10.1007/s002200050743. MR1727234 (2001i:82037) [100] Alexander Soshnikov, Gaussian limit for determinantal random point fields, Ann. Probab. 30 (2002), no. 1, 171–187, DOI 10.1214/aop/1020107764. MR1894104 (2003e:60106) [101] G. Szeg¨ o, On certain Hermitian forms associated with the Fourier series of a positive function, Comm. S´ em. Math. Univ. Lund [Medd. Lunds Univ. Mat. Sem.] 1952 (1952), no. Tome Supplementaire, 228–238. MR0051961 (14,553d) [102] Terence Tao and Van Vu, From the Littlewood-Offord problem to the circular law: universality of the spectral distribution of random matrices, Bull. Amer. Math. Soc. (N.S.) 46 (2009), no. 3, 377–396, DOI 10.1090/S0273-0979-09-01252-X. MR2507275 (2010b:15047)

UNIVERSALITY FOR WIGNER ENSEMBLES

171

[103] Terence Tao and Van Vu, Random matrices: universality of local eigenvalue statistics up to the edge, Comm. Math. Phys. 298 (2010), no. 2, 549–572, DOI 10.1007/s00220-010-1044-5. MR2669449 (2011f:60012) [104] Terence Tao and Van Vu, Random matrices: universality of ESDs and the circular law, Ann. Probab. 38 (2010), no. 5, 2023–2065, DOI 10.1214/10-AOP534. With an appendix by Manjunath Krishnapur. MR2722794 (2011e:60017) [105] Terence Tao and Van Vu, Random matrices: the distribution of the smallest singular values, Geom. Funct. Anal. 20 (2010), no. 1, 260–297, DOI 10.1007/s00039-010-0057-8. MR2647142 (2011m:60020) [106] Terence Tao and Van Vu, Random matrices: localization of the eigenvalues and the necessity of four moments, Acta Math. Vietnam. 36 (2011), no. 2, 431–449. MR2908537 [107] Terence Tao and Van Vu, Random matrices: universal properties of eigenvectors, Random Matrices Theory Appl. 1 (2012), no. 1, 1150001, 27, DOI 10.1142/S2010326311500018. MR2930379 [108] Terence Tao and Van Vu, Random matrices: universality of local eigenvalue statistics, Acta Math. 206 (2011), no. 1, 127–204, DOI 10.1007/s11511-011-0061-3. MR2784665 (2012d:60016) [109] Terence Tao and Van Vu, Random covariance matrices: universality of local statistics of eigenvalues, Ann. Probab. 40 (2012), no. 3, 1285–1315, DOI 10.1214/11-AOP648. MR2962092 [110] Terence Tao and Van Vu, The Wigner-Dyson-Mehta bulk universality conjecture for Wigner matrices, Electron. J. Probab. 16 (2011), no. 77, 2104–2121, DOI 10.1214/EJP.v16-944. MR2851058 [111] Terence Tao and Van Vu, On random ±1 matrices: singularity and determinant, Random Structures Algorithms 28 (2006), no. 1, 1–23, DOI 10.1002/rsa.20109. MR2187480 (2006g:15048) [112] Terence Tao and Van Vu, A central limit theorem for the determinant of a Wigner matrix, Adv. Math. 231 (2012), no. 1, 74–101, DOI 10.1016/j.aim.2012.05.006. MR2935384 [113] T. Tao and V. Vu, Random matrices: Sharp concentration of eigenvalues, preprint http://arxiv.org/pdf/1201.4789.pdf. [114] T. Tao and V. Vu, Some errata, appendix to http://arxiv.org/abs/1202.0068v1 [115] Craig A. Tracy and Harold Widom, On orthogonal and symplectic matrix ensembles, Comm. Math. Phys. 177 (1996), no. 3, 727–754. MR1385083 (97a:82055) [116] Craig A. Tracy and Harold Widom, Distribution functions for largest eigenvalues and their applications, (Beijing, 2002), Higher Ed. Press, Beijing, 2002, pp. 587–596. MR1989209 (2004f:82034) [117] Linh V. Tran, Van H. Vu, and Ke Wang, Sparse random graphs: eigenvalues and eigenvectors, Random Structures Algorithms 42 (2013), no. 1, 110–134, DOI 10.1002/rsa.20406. MR2999215 [118] Hale F. Trotter, Eigenvalue distributions of large Hermitian matrices; Wigner’s semicircle law and a theorem of Kac, Murdock, and Szeg˝ o, Adv. in Math. 54 (1984), no. 1, 67–82, DOI 10.1016/0001-8708(84)90037-9. MR761763 (86c:60055) [119] P. Tur´ an, On a problem in the theory of determinants (Chinese, with English summary), Acta Math. Sinica 5 (1955), 411–423. MR0073555 (17,449g) [120] R. Vershynin, Invertibility of symmetric random matrices, to appear in Random Structures and Algorithms [121] Dan Voiculescu, Addition of certain noncommuting random variables, J. Funct. Anal. 66 (1986), no. 3, 323–346, DOI 10.1016/0022-1236(86)90062-5. MR839105 (87j:46122) [122] Dan Voiculescu, Limit laws for random matrices and free products, Invent. Math. 104 (1991), no. 1, 201–220, DOI 10.1007/BF01245072. MR1094052 (92d:46163) [123] John von Neumann and H. H. Goldstine, Numerical inverting of matrices of high order, Bull. Amer. Math. Soc. 53 (1947), 1021–1099. MR0024235 (9,471b) [124] Van H. Vu, Spectral norm of random matrices, Combinatorica 27 (2007), no. 6, 721–736, DOI 10.1007/s00493-007-2190-z. MR2384414 (2009d:15060) [125] Ke Wang, Random covariance matrices: universality of local statistics of eigenvalues up to the edge, Random Matrices Theory Appl. 1 (2012), no. 1, 1150005, 24, DOI 10.1142/S2010326311500055. MR2930383

172

TERENCE TAO AND VAN VU

[126] Eugene P. Wigner, On the distribution of the roots of certain symmetric matrices, Ann. of Math. (2) 67 (1958), 325–327. MR0095527 (20 #2029) Department of Mathematics, UCLA, Los Angeles, California 90095-1555 E-mail address: [email protected] Department of Mathematics, Yale, New Haven, Connecticut 06520 E-mail address: [email protected]

Index ψ2 -condition, 86 -distribution, 24 -moment, 24 ε-net, 88 Airy decay, 60 Aldous-Steele objective method, 28 asymptotically free, 36, 39 Bernoulli ensembles, 122 bisection iteration, 67 Brown measure, 47 Brown spectral measure, 23 Brownian motion, 62 bulk, 63 Catalan numbers, 41 Cauchy ensemble, 25 Cauchy transform, 22 central limit theorem, 28 characteristic polynomial, 14 Chi-squared test, 58 circulant matrices, 25 circular law, 6, 11 compressible and incompressible vectors, 93 concentration of measure, 19 condition number, 85 configuration model, 28 correlation functions, 10 Coulomb gaz, 10, 13 Courant-Fischer variational formulas, 2 dependent entries, 26 determinantal process, 10 determinantal processes, 131 distribution of traffics, 38 doubly stochastic matrices, 26 Dyson Brownian motion, 134 Dyson Fokker-Planck equation, 134 eigenvalue spacing, 132 eigenvalues, 1 empirical distribution of the matrices, 38

empirical spectral distribution, 3 energy functional, 13 ergodic theorem, 26 essential least common denominator, 97 exchangeable entries, 26 first order global asymptotics, 5 Four Moment Theorem, 136 fourth moment, 7 free, 39 free probability, 23, 35, 72 freeness, 35 Frobenius norm, 3 Fuglede-Kadison determinant, 23 fundamental solution, 14 gap theorem, 140 Gaussian unitary ensemble, 8 ghosts and shadows, 72 ghost Gaussian, 73 ghost Haar distribution, 73 ghost random variables, 73 ghost Wishart matrix, 76 shadows, 73 Ginibre ensemble, 8 Girko theorem, 6 GOE, 124 GUE, 124 Gumbel fluctuation, 13, 29 Haar measure, 58 Haar unitary, 28 Haar unitary matrices, 26 hard edge, 63 Harish-Chandra-Itzykson-Zuber integral, 76 heat flow, 133 heavy tails, 27 Hermite ensemble, 57 GOE, 54 GSE, 60 GUE, 54, 60 Hermite polynomials, 131 173

174

Index

Hermitization, 2, 15 Hilbert-Schmidt norm, 3 Hoeffding inequality, 19 Householder transformations, 58

regular graphs, 28 replacement principle, 22 resolvent, 22 Riccati diffusion, 70

invariant ensembles, 122

S-transform, 46 Schur block inversion, 5 Schur norm, 3 Schur unitary triangularization, 2, 9 semi-circle law, 40 semi-circular law, 123 semi-circular variables, 40 semicircle law, 53, 54 short Khinchin inequality, 104 single ring theorem, 26 singular value decomposition (SVD), 76 singular values, 2, 85 soft edge, 63 sparsity, 27 spectral measure, 35 spectral norm, 2 spectral radius, 1, 7, 13 stability analysis, 126 Stieltjes transform, 22, 130 stochastic operators, 61 stochastic Airy operator, 65 stochastic Bessel operator, 65 stochastic sine operator, 65 strong aymptotic freeness, 44 Sturm sequences, 68 Sturm-Liouville theory, 69 subgaussian random variables, 86 swapping methods, 133

Jack polynomials, 73, 75 Schur polynomials, 73, 75 Zonal polynomials, 73 Kashin’s theorem, 107 Kesten-McKay measure, 28 Khinchin’s inequality, 88 Kolmogorov’s backward equation, 71 Kostlan layers, 12 L´ evy concentration function, 98 Laguerre ensemble, 57 Lanczos method, 60 Lapack, 59 large deviations principle, 13 linear statistics, 28 Littlewood–Offord problem, 21, 27, 96 local semi-circle law, 127 local universality, 29 log-concave distributions, 26 logarithmic potential, 14 logarithmic potential with external field, 13 Lyapunov exponent, 26 majorization, 16 majorizing measure theorem, 109 MANOVA matrices, 63 Marchenko-Pastur theorem, 5 Markov matrices, 26 matching moments, 139 Monte Carlo, 54 non-commutative laws, 38 non-normal matrices, 47 normal matrix, 2 operator norm, 2, 7 Ornstein-Uhlenbeck operator, 134 orthogonal polynomials, 11 outliers in the spectrum, 25 Paley–Zygmund inequality, 90 Poisson weighted infinite tree (PWIT), 28 propagation of chaos, 12 QR decomposition, 58 quarter circular law, 5 R-transform, 46 random matrix factorization, 57 bidiagonal models, 60 tridiagonal models, 60 random polynomials, 27 rate function, 13

Talagrand inequality, 19 trace norm, 3 Tracy-Widom distribution, 54 Tracy-Widom law, 150 tridiagonal matrices, 26 uniform integrability, 15 unitary matrices with Haar law, 43 universality, 5 universality of a limiting law, 131 Vandermonde determinant, 10 volumetric estimate, 89 Weiner process, 62 Weyl inequalities, 3 white noise transformation, 63 Wigner matrix, 122 Wishart matrix, 59

ISBN 978-0-8218-9471-2

9 780821 894712 PSAPM/72

PSAPM

72

Modern Aspects of Random Matrix Theory • Vu, Editor

AMS