Theoretical foundations of functional data analysis, with an introduction to linear operators 9780470016916

1,262 124 2MB

English Pages 364 Year 2015

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Theoretical foundations of functional data analysis, with an introduction to linear operators
 9780470016916

Table of contents :
Cover......Page 1
Contents......Page 9
Preface......Page 13
Chapter 1 Introduction......Page 17
1.1 Multivariate analysis in a nutshell......Page 18
1.2 The path that lies ahead......Page 29
Chapter 2 Vector and function spaces......Page 31
2.1 Metric spaces......Page 32
2.2 Vector and normed spaces......Page 36
2.3 Banach and Lp spaces......Page 42
2.4 Inner Product and Hilbert spaces......Page 47
2.5 The projection theorem and orthogonal decomposition......Page 54
2.6 Vector integrals......Page 56
2.7 Reproducing kernel Hilbert spaces......Page 62
2.8 Sobolev spaces......Page 71
Chapter 3 Linear operator and functionals......Page 77
3.1 Operators......Page 78
3.2 Linear functionals......Page 82
3.3 Adjoint operator......Page 87
3.4 Nonnegative, square-root, and projection operators......Page 90
3.5 Operator inverses......Page 93
3.6 Fréchet and Gâteaux derivatives......Page 99
3.7 Generalized Gram-Schmidt decompositions......Page 103
Chapter 4 Compact operators and singular value decomposition......Page 107
4.1 Compact operators......Page 108
4.2 Eigenvalues of compact operators......Page 112
4.3 The singular value decomposition......Page 119
4.4 Hilbert-Schmidt operators......Page 123
4.5 Trace class operators......Page 129
4.6 Integral operators and Mercer's Theorem......Page 132
4.7 Operators on an RKHS......Page 139
4.8 Simultaneous diagonalization of two nonnegative definite operators......Page 142
5.1 Perturbation of self-adjoint compact operators......Page 145
5.2 Perturbation of general compact operators......Page 156
6.1 Functional linear model......Page 163
6.2 Penalized least squares estimators......Page 166
6.3 Bias and variance......Page 173
6.4 A computational formula......Page 174
6.5 Regularization parameter selection......Page 177
6.6 Splines......Page 181
Chapter 7 Random elements in a Hilbert space......Page 191
7.1 Probability measures on a Hilbert space......Page 192
7.2 Mean and covariance of a random element of a Hilbert space......Page 194
7.3 Mean-square continuous processes and the Karhunen-Lòeve Theorem......Page 200
7.4 Mean-square continuous processes in L2(E, B(E),μ)......Page 206
7.5 RKHS valued processes......Page 211
7.6 The closed span of a process......Page 214
7.7 Large sample theory......Page 219
Chapter 8 Mean and covariance estimation......Page 227
8.1 Sample mean and covariance operator......Page 228
8.2 Local linear estimation......Page 230
8.3 Penalized least-squares estimation......Page 247
Chapter 9 Principal components analysis......Page 267
9.1 Estimation via the sample covariance operator......Page 269
9.2 Estimation via local linear smoothing......Page 271
9.3 Estimation via penalized least squares......Page 277
Chapter 10 Canonical correlation analysis......Page 281
10.1 CCA for random elements of a Hilbert space......Page 283
10.2 Estimation......Page 290
10.3 Prediction and regression......Page 297
10.4 Factor analysis......Page 300
10.5 MANOVA and discriminant analysis......Page 304
10.6 Orthogonal subspaces and partial cca......Page 310
11.1 A functional regression model......Page 321
11.2 Asymptotic theory......Page 324
11.3 Minimax optimality......Page 334
11.4 Discretely sampled data......Page 337
References......Page 343
Index......Page 347
Notation Index......Page 350
Wiley Series in Probability and Statistics......Page 351
EULA......Page 364

Citation preview

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators

WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels A complete list of the titles in this series appears at the end of this volume.

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators

Tailen Hsing Professor, Department of Statistics University of Michigan, USA

Randall Eubank Professor Emeritus, School of Mathematical and Statistical Sciences, Arizona State University, USA

This edition first published 2015 © 2015 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data applied for. A catalogue record for this book is available from the British Library. ISBN: 9780470016916 Set in 10/12pt TimesLTStd by Laserwords Private Limited, Chennai, India 1 2015

To Our Families

Contents Preface

xi

1

Introduction 1.1 Multivariate analysis in a nutshell 1.2 The path that lies ahead

1 2 13

2

Vector and function spaces 2.1 Metric spaces 2.2 Vector and normed spaces 2.3 Banach and 𝕃p spaces 2.4 Inner Product and Hilbert spaces 2.5 The projection theorem and orthogonal decomposition 2.6 Vector integrals 2.7 Reproducing kernel Hilbert spaces 2.8 Sobolev spaces

15 16 20 26 31 38 40 46 55

3

Linear operator and functionals 3.1 Operators 3.2 Linear functionals 3.3 Adjoint operator 3.4 Nonnegative, square-root, and projection operators 3.5 Operator inverses 3.6 Fréchet and Gâteaux derivatives 3.7 Generalized Gram–Schmidt decompositions

61 62 66 71 74 77 83 87

4

Compact operators and singular value decomposition 4.1 Compact operators 4.2 Eigenvalues of compact operators 4.3 The singular value decomposition 4.4 Hilbert–Schmidt operators 4.5 Trace class operators

91 92 96 103 107 113

viii

CONTENTS

4.6 4.7 4.8

Integral operators and Mercer’s Theorem Operators on an RKHS Simultaneous diagonalization of two nonnegative definite operators

116 123 126

5

Perturbation theory 5.1 Perturbation of self-adjoint compact operators 5.2 Perturbation of general compact operators

129 129 140

6

Smoothing and regularization 6.1 Functional linear model 6.2 Penalized least squares estimators 6.3 Bias and variance 6.4 A computational formula 6.5 Regularization parameter selection 6.6 Splines

147 147 150 157 158 161 165

7

Random elements in a Hilbert space 7.1 Probability measures on a Hilbert space 7.2 Mean and covariance of a random element of a Hilbert space 7.3 Mean-square continuous processes and the Karhunen–Lòeve Theorem 7.4 Mean-square continuous processes in 𝕃2 (E, ℬ(E), 𝜇) 7.5 RKHS valued processes 7.6 The closed span of a process 7.7 Large sample theory

175 176 178

8

Mean and covariance estimation 8.1 Sample mean and covariance operator 8.2 Local linear estimation 8.3 Penalized least-squares estimation

211 212 214 231

9

Principal components analysis 9.1 Estimation via the sample covariance operator 9.2 Estimation via local linear smoothing 9.3 Estimation via penalized least squares

251 253 255 261

10 Canonical correlation analysis 10.1 CCA for random elements of a Hilbert space 10.2 Estimation 10.3 Prediction and regression

184 190 195 198 203

265 267 274 281

CONTENTS

10.4 Factor analysis 10.5 MANOVA and discriminant analysis 10.6 Orthogonal subspaces and partial cca

ix

284 288 294

11 Regression 11.1 A functional regression model 11.2 Asymptotic theory 11.3 Minimax optimality 11.4 Discretely sampled data

305 305 308 318 321

References

327

Index

331

Notation Index

334

Preface This book aims to provide a compendium of the key mathematical concepts and results that are relevant for the theoretical development of functional data analysis (fda). As such, it is not intended to provide a general introduction to fda per se and, accordingly, we have not attempted to catalog the volumes of fda research work that have flowed at a brisk pace into the statistics literature over the past 15 years or so. Readers might therefore find it helpful to read the present text alongside other books on fda, such as Ramsay and Silverman (2005), which provide more thorough and practical developments of the topic. This project grew out of our own struggle in acquiring the theoretical foundations for fda research in diverse fields of mathematics and statistics. With that in mind, the book strives to be self-contained. Rigorous proofs are provided for most of the results that we present. Nonetheless, a solid mathematics background at a graduate level is needed to be able to appreciate the content of the text. In particular, the reader is assumed to be familiar with linear algebra and real analysis and to have taken a course in measure theoretic probability. With this proviso, the material in the book would be suitable for a one-semester, special-topics class for advanced graduate students. Functional data analysis is, from our perspective, the statistical analysis of sample path data observed from continuous time stochastic processes. Thus, we are dealing with random functions whose realizations fall into some suitable (large) collection of functions. This makes an overview of function space theory a natural starting point for our treatment of fda. Accordingly, we begin with that topic in Chapter 2. There we develop essential concepts such as Sobolev and reproducing kernel Hilbert spaces that play pivotal roles in subsequent chapters. We also lay the foundation that is needed to understand the essential mathematical properties of bounded operators on Banach and, in particular, Hilbert spaces. Our treatment of operator theory is broken into three chapters. The first of these, Chapter 3, deals with basic concepts such as adjoint, inverse, and projection operators. Then, Chapter 4 investigates the spectral theory that underlies compact operators in some detail. Here, we present both the typical eigenvalue/eigenvector expansion for self-adjoint operators and the somewhat less common singular value expansion that applies in the non-self-adjoint case

xii

PREFACE

or, more generally, for operators between two different Hilbert spaces. These expansions make it possible to develop the concepts of Hilbert–Schmidt and trace class operators at a level of generality that makes them useful in subsequent aspects of the text. The treatment of principal components analysis in Chapter 9 requires some understanding of perturbation theory for compact operators. This material is therefore developed in Chapter 5. As was the case for Chapter 4, we do this for both the self-adjoint and the-non self-adjoint scenarios. The latter instance therefore provides an introduction to the less well-documented perturbation theory for singular values and vectors that, for example, can be employed to investigate the properties of canonical correlation estimators. The fact that sample paths must be digitized for storage entails that data smoothing of some kind often becomes necessary. Smoothing and regularization problems also arise naturally from the approximate solution of operator equations, functional regression, and various other problems that are endemic to the fda setting. Chapter 6 examines a general abstract smoothing or regularization problem that corresponds to what we call a functional linear model. An explicit form is derived for the associated estimator of the underlying functional parameter. The problems of computation and regularization parameter selection are considered for the case of real valued, scalar response data. A special case of our abstract smoothing scenario leads us back to ordinary smoothing splines and we spend some time studying their associated properties as nonparametric regression estimators. Chapter 7 aims to establish the probabilistic underpinnings of fda. The mean element, covariance operator, and cross-covariance operators are rigorously defined here for random elements of a Hilbert space. The fda case where a random element has a representation as a continuous time stochastic process is given special treatment that, among other factors, clarifies the relationship between its covariance operator and covariance kernel. A brief foray into representation theory produces congruence relationships that prove useful in Chapter 10. The chapter then concludes with selected aspects of the large sample theory for Hilbert space valued random elements that includes both a strong law and central limit theorem. The large sample behavior of the sample mean element and covariance operator are studied in Chapter 8. This is relevant for cases where functional data is completely observed. When meaningful discretization occurs, smoothing becomes necessary and we look into the large sample performance of two estimation schema that can be used for that purpose: namely, local linear and penalized least-squares smoothing. Chapter 9 is the principal components counterpart of Chapter 8 in that it investigates the properties of eigenvalues and eigenfunctions associated with the covariance operator estimators that were derived in that chapter.

PREFACE

xiii

Chapters 10 and 11 both address bivariate situations. In Chapter 10, the focus is canonical correlation and this concept is used to study a variety of fda problems including functional factor analysis and discriminant analysis. Then, Chapter 11 deals with the problem of functional regression with a scalar response and functional predictor. An asymptotically, optimal penalized least-squares estimator is investigated in this setting. We have been fortunate to have talented coworkers and students that have generously shared their ideas and expertise with us on many occasions. A nonexhaustive list of such important contributors includes Toshiya Hoshikawa, Ana Kupresanin, Yehua Li, Heng Lian, Yolanda MunozMaldanado, Rosie Renaut, Hyejin Shin, and Jack Spielberg. We sincerely appreciate the invaluable help they have provided in bringing this book to fruition. The inspiration for much of the development in Chapter 10 can be traced to the serendipitous path through academics that brought us into contact with Anant Kshirsagar and Emanuel Parzen. We gratefully acknowledge the profound influence these two great scholars have had on this as well as many other aspects of our writing. TH also wishes to thank Ross Leadbetter for introducing him to the world of research and Ray Carroll for his support which has opened doors to many possibilities, including this book.

1

Introduction Briefly stated, a stochastic process is an indexed collection of random variables all of which are defined on a common probability space (Ω, ℱ, ℙ). If we denote the index set by E, then this can be described mathematically as {X(t, 𝜔) ∶ t ∈ E, 𝜔 ∈ Ω} , where X(t, ⋅) is a ℱ-measurable function on the sample space Ω. The 𝜔 argument will generally be suppressed and X(t, 𝜔) will typically be shortened to just X(t). Once the X(t) have been observed for every t ∈ E, the process has been realized and the resulting collection of real numbers is called a sample path for the process. Functional data analysis (fda), in the sense of this text, is concerned with the development of methodology for statistical analysis of data that represent sample paths of processes for which the index set is some (closed) interval of the real line; without loss, the interval can be taken as [0, 1]. This translates into observations that are functions on [0, 1] and data sets that consist of a collection of such random curves. From a practical perspective, one cannot actually observe a functional data set in its entirety; at some point, digitization must occur. Thus, analysis might be predicated on data of the form xi (j∕r), j = 1, … , r, i = 1, … , n, involving n sample paths x1 (⋅), … , xn (⋅) for some stochastic process with each sample path only being evaluated at r points in [0, 1]. When viewed from this perspective, the data is inherently finite dimensional and the temptation is to treat it as one would data in a multivariate analysis (mva) context.

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

2

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

However, for truly functional data, there will be many more “variables” than observations; that is, r ≫ n. This leads to drastic ill conditioning of the linear systems that are commonplace in mva which has consequences that can be quite profound. For example, Bickel and Levina (2004) showed that a naive application of multivariate discriminant analysis to functional data can result in a rule that always classifies by essentially flipping a fair coin regardless of the underlying population structure. Rote application of mva methodology is simply not the avenue one should follow for fda. On the other hand, the basic mva techniques are still meaningful in a certain sense. Data analysis tools such as canonical correlation analysis, discriminant analysis, factor analysis, multivariate analysis of variance (MANOVA), and principal components analysis exist because they provide useful ways to summarize complex data sets as well as carry out inference about the underlying parent population. In that sense, they remain conceptually valid in the fda setting even if the specific details for extracting the relevant information from data require a bit of adjustment. With that in mind, it is useful to begin by cataloging some of the multivariate methods and their associated mathematical foundations, thereby providing a roadmap of interesting avenues for study. This is the subject of the following section.

1.1

Multivariate analysis in a nutshell

mva is a mature area of statistics with a rich history. As a result, we cannot (and will not attempt to) give an in-depth overview of mva in this text. Instead, this section contains a terse, mathematical sketch of a few of the methods that are commonly employed in mva. This will, hopefully, provide the reader with some intuition concerning the form and structure of analogs of mva techniques that are used in fda as well as an appreciation for both the similarities and the differences between the two fields of study. Introductions to the theory and practice of mva can be found in a myriad of texts including Anderson (2003), Gittins (1985), Izenman (2008), Jolliffe (2004), and Johnson and Wichern (2007). Let us begin with the basic set up where we have a p-dimensional random vector X = (X1 , … , Xp )T having (variance-)covariance matrix [ ] 𝒦 = 𝔼 (X − m) (X − m)T

(1.1)

m = 𝔼X

(1.2)

with

the mean vector for X. Here, 𝔼 corresponds to mathematical expectation and 𝑣T indicates the transpose of a vector 𝑣. The matrix 𝒦 admits an

INTRODUCTION

3

eigenvalue–eigenvector decomposition of the form 𝒦=

p ∑

𝜆j ej eTj

(1.3)

j=1

for eigenvalues 𝜆1 ≥ · · · ≥ 𝜆p ≥ 0 and associated orthonormal eigenvectors ( )T ej = e1j , … , epj , j = 1, … , p that satisfy eTi 𝒦ej = 𝜆j 𝛿ij , where 𝛿ij is 1 or 0 depending on whether or not i and j coincide. This provides a basis for principal components analysis (pca). We can use the eigenvectors in (1.3) to define new variables Zj = eTj (X − m), which are referred to as principal components. These are linear combinations of the original variables with the weight or loadings eij that is applied to Xi in the jth component indicating its importance to Zj ; more precisely, Cov(Zj , Xi ) = 𝜆j eij . In fact, X =m+

p ∑

Zj e j

(1.4)

j=1

as, if 𝒦 is full rank, e1 , … , ep provide an orthonormal basis for ℝp ; this is even true when 𝒦 has less than full rank as eTj X is zero with probability one when 𝜆j = 0. The implication of (1.4) is that X can be represented as a weighted sum of the eigenvectors of 𝒦 with the weights/coefficients being uncorrelated random variables having variances that are the eigenvalues of 𝒦. In practice, one typically retains only some number q < p of the components and views them as providing a summary of the (covariance) relationship between the variables in X. As with any type of summarization, this results in a loss of information. The extent of this loss can be gauged by the proportion of the total X variance V ∶= trace(𝒦) that is recovered by the principal components that are retained. In this regard, we know that V=

p ∑

𝜆j

j=1

while the variance of the jth component is Var(Zj ) = eTj 𝒦ej = 𝜆j .

4

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Thus, the jth component accounts ( ) for 100𝜆j ∕V percentage of the total vari∑j ance and 100 1 − k=1 𝜆k ∕V is the percentage of variability that is not accounted for by Z1 , … , Zj . Principal components possess various optimality features such as the one catalogued in Theorem 1.1.1. Theorem 1.1.1 Var(Zj ) = max{eT e=1,eT 𝒦ei =0,i=1,…, j−1} Var(eT X). The proof of this result is, e.g., a consequence of developments in Section 4.2. It can be interpreted as saying that the jth principle component is the linear combination of X that accounts for the maximum amount of the remaining total variance after removing the portion that was explained by Z1 , … , Zj−1 . The discussion to this point has been concerned with only the population aspects of pca. Given a random sample x1 , … , xn of observations on X, we estimate 𝒦 by the sample covariance matrix 𝒦n = (n − 1)

−1

n ∑ ( )( )T xi − xn xi − xn

(1.5)

i=1

with xn = n

−1

n ∑

xi

(1.6)

i=1

the sample mean vector. As 𝒦n is positive semidefinite, it has the eigenvalue– eigenvector representation 𝒦n =

p ∑

𝜆jn ejn eTjn ,

(1.7)

j=1

where the ein are orthonormal and satisfy eTin 𝒦n ejn = 𝜆jn 𝛿ij . This produces the sample principle components zjn = eTjn (x − xn ) for j = 1, … , p with x = (x1 , … , xp )T and the associated scores eTjn (xi − xn ), i = 1, … , n that provide sample information concerning the Zj . Theorems 9.1.1 and 9.1.2 of Chapter 9 can be used to deduce the large sample behavior of the sample eigenvalue–eigenvector pairs, (𝜆jn , ejn ), j = √ √ 1, … , r. The limiting distributions of n(𝜆jn − 𝜆j ) and n(ejn − ej ) are found to be normal which provides a foundation for hypothesis testing and interval estimation. The next step it to assume that X consists of two subsets of variables that we indicate by writing X = (X1T , X2T )T , where X1 = (X11 , … , X1p )T and

INTRODUCTION

5

X2 = (X21 , … , X2q )T . Questions of interest now concern the relationships that may exist between X1 and X2 . Our focus will be on those that are manifested in their covariance structure. For this purpose, we partition the covariance matrix 𝒦 for X from (1.1) as [ ] 𝒦1 𝒦12 𝒦= . (1.8) 𝒦21 𝒦2 Here, 𝒦1 , 𝒦2 are the covariance matrices for X1 , X2 , respectively, and 𝒦12 = 𝒦T21 is sometimes called the cross-covariance matrix. The goal is now to summarize the (cross-)covariance properties of X1 and X2 . Analogous to the pca approach, this will be accomplished using linear combinations of the two random vectors. Specifically, we seek vectors a1 ∈ ℝp and a2 ∈ ℝq that maximize 𝜌2 (a1 , a2 ) =

Cov2 (aT1 X1 , aT2 X2 ) ( ) ( ). Var aT1 X1 Var aT2 X2

(1.9)

This optimization problem can be readily solved with the help of the singular value decomposition: e.g., Corollary 4.3.2. Assuming that X1 , X2 contain no redundant variables, both 𝒦1 and 𝒦2 will be positive-definite with nonsingu1∕2 lar square roots 𝒦i , i = 1, 2. This allows us to write ( T )2 a ̃ ℛ a ̃ 12 2 1 𝜌2 (a1 , a2 ) = T T , (1.10) ã 1 ã 1 ã 2 ã 2 where

−1∕2

ℛ12 = 𝒦1 1∕2

−1∕2

𝒦12 𝒦2

,

(1.11)

1∕2

ã 1 = 𝒦1 a1 and ã 2 = 𝒦2 a2 . The matrix ℛ12 can be viewed as a multivariate analog of the linear correlation coefficient between two variables. Using the singular value decomposition in Corollary 4.3.2, we can see that (1.10) is maximized by choosing ã 1 , ã 2 to be the pair of singular vectors ã 11 , ã 21 that correspond to its largest singular value 𝜌1 . The optimal linear combi−1∕2 nations of X1 and X2 are therefore provided by the vectors a11 = 𝒦1 ã 11 −1∕2 and a21 = 𝒦2 ã 21 . The corresponding random variables U11 = aT11 X1 and T U21 = a21 X2 are called the first canonical variables of the X1 and X2 spaces, respectively. They each have unit variance and correlation 𝜌1 that is referred to as the first canonical correlation. The summarization process need not stop after the first canonical variables. If 𝒦12 has rank r, then there are actually r − 1 additional canonical variables that can be found: namely, for j = 2, … , r, we have U1j = aT1j X1

(1.12)

6

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and

U2j = aT2j X2 , −1∕2

(1.13)

−1∕2

where a1j = 𝒦1 ã 1j , a2j = 𝒦2 ã 2j with ã 1j , ã 2j , the other singular vector pairs from ℛ12 that correspond to its remaining nonzero singular values 𝜌 ( 2 ≥ · · · )≥ 𝜌r > 0. For each choice of the index j, the random variable pair U1j , U2j is uncorrelated with all the other canonical variable pairs and has corresponding canonical correlation 𝜌j . When all this is put Together, it gives us 1∕2 1∕2 ℛ12 = 𝒦1 𝒜1 𝒟 𝒜2T 𝒦2 (1.14) with

[ ] 𝒜i = ai1 , … , air

the matrix of canonical weight vectors for Xi , i = 1, 2, and 𝒟 = diag (𝜌1 , … , 𝜌r ) a diagonal matrix containing the corresponding canonical correlations. There are various other ways of characterizing the canonical correlations and vectors. As they stem from the singular values and vectors of ℛ12 , −1∕2 they are the eigenvalue and eigenvectors obtained from 𝒦1 𝒦12 𝒦−1 2 −1∕2 −1∕2 −1∕2 𝒦21 𝒦1 and 𝒦2 𝒦21 𝒦−1 𝒦 𝒦 . For example, the squared canon12 1 2 ical correlations and canonical vectors of the X1 space can be derived from the linear system −1 2 𝒦−1 (1.15) 1 𝒦12 𝒦2 𝒦21 a1 = 𝜌 a1 . The population canonical correlations and associated canonical vectors can be estimated from the sample covariance matrix (1.5). For this purpose, one partitions 𝒦n analogous to 𝒦 and carries out the same form of singular value decomposition except using sample entities in place of 𝒦1 , 𝒦2 , and 𝒦12 . The resulting sample canonical correlations have a limiting multivariate normal distribution under various conditions as detailed in Muirhead and Waternaux (1980). The vectors X1 and X2 are linearly independent when 𝒦12 is a matrix of all zeros which we now recognize as being equivalent to 𝜌1 = · · · = 𝜌min(p,q) = 0. Test statistics for this and other related hypotheses can be developed from the sample canonical correlations. Canonical correlation occupies a pervasive role in classical multivariate analysis. One place it arises naturally is in the (linear) prediction of X1 from X2 . A best linear unbiased predictor(BLUP) is provided by the vector 𝛽0 + 𝛽1 X2 , where 𝛽0 , 𝛽1 are, respectively, the p × 1 vector and p × q matrix that minimize 𝔼(X1 − b0 − b1 X2 )T (X1 − b0 − b1 X2 )

INTRODUCTION

7

as a function of b0 , b1 . The minimizers are easily seen to be 𝛽0 = m1 − 𝛽1 m2 𝛽1 = 𝒦12 𝒦−1 2 , wherein mj = 𝔼Xj , j = 1, 2. From this, we recognize 𝛽1 as the least-squares regression coefficients for the regression of X1 on X2 . Now, premultiply (1.14) 1∕2 −1∕2 by 𝒦1 and postmultiply by 𝒦2 to obtain 𝛽1 =

r ∑

𝜌j 𝒦1 a1j aT2j ,

j=1

where, again, r is the rank of 𝒦12 . This establishes a fundamental relationship between the best linear predictor and the canonical variables: namely, 𝛽0 + 𝛽1 X2 = 𝛽0 +

r ∑

𝜌j 𝒦1 a1j U2j .

(1.16)

j=1

Thus, canonical correlation lies at the heart of the linear prediction of X1 from X2 . The converse is seen to be true as well by simply interchanging the roles of X1 and X2 in the above-mentioned discussion. In finding our linear predictor for X1 , we chose a priori to restrict attention to only those that were linear functions of X2 . A related, but distinct, variation on this theme is to presume the existence of a linear model that relates X1 to X2 by an expression such as X1 = 𝛽0 + 𝛽1 X2 + 𝜀

(1.17)

with 𝛽0 and 𝛽1 of dimension p × 1 and p × q as before and 𝜀 a p × 1 random vector that is uncorrelated with X2 while having zero mean and covariance matrix 𝒦𝜀 = 𝔼𝜀𝜀T . In this event,

𝒦12 = 𝛽1 𝒦2 ,

(1.18)

𝒦1 = 𝛽1 𝒦2 𝛽1T + 𝒦𝜀 = 𝒦12 𝒦−1 2 𝒦21 + 𝒦𝜀

(1.19)

and the canonical correlations now satisfy the relationship 𝒜1 𝒟 𝒜2T = (𝛽1 𝒦2 𝛽1T + 𝒦𝜀 )−1 𝛽1 .

(1.20)

Weyl’s inequality (e.g., Thompson and Freede, 1971) tells us that the eigen( ) values of 𝛽1 𝒦2 𝛽1T + 𝒦𝜀 are at least as large as those for 𝒦𝜀 . Thus, 𝛽1 is null if and only if 𝒟 in (1.20) is the zero matrix.

8

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Factor analysis is another multivariate analysis method that aims to examine the relationship between the two sets of variables. However, in this setting, only the values of the response variable X1 are observed while X2 is viewed as a collection of latent variables whose values represent the object of the analysis. The basic premise is that X1 and X2 are linearly related in that X1 = X2 + 𝜀

(1.21)

with 𝜀 a vector of zero mean random errors with variance–covariance matrix 𝒦𝜀 as before. The goal is now to use X1 to predict the unobserved values of X2 . Canonical correlation again provides the tool that allows us to make a modicum of progress toward solving the prediction problem posed by model (1.21). In this regard, we can look for a linear combination aT1 X1 that is maximally correlated with a linear combination aT2 X2 of X2 . There are some simplifications in this instance in that 𝒦12 = 𝒦2 and 𝒦1 = 𝒦2 + 𝒦𝜀 lead us to consideration of the objective function ( T )2 a1 𝒦2 a2 2 T T Corr (a1 X1 , a2 X2 ) = T a1 𝒦1 a1 aT2 𝒦2 a2 ( )2 −1∕2 1∕2 ã T1 𝒦1 𝒦2 ã 2 = ã T1 ã 1 ã T2 ã 2 1∕2

1∕2

with ã 1 = 𝒦1 a1 , ã 2 = 𝒦2 a2 . Thus, for example, the optimal correlations 𝜌 and choices for ã 1 can be obtained as solutions of the eigenvalue problem −1∕2

𝒦1

−1∕2

𝒦2 𝒦1

ã 1 = 𝜌2 ã 1 .

A little algebra then reveals this to be equivalent to finding solutions of 𝒦1 a1 =

1 𝒦𝜀 a1 . 1 − 𝜌2

(1.22)

To proceed further, we need to impose some additional structure on X2 . The standard approach is to assume that X2 = ΦZ

(1.23)

[ ] Φ = {𝜙ij }i=1∶p, j=1∶r = 𝜙1 , … , 𝜙r

(1.24)

for some unknown p × r matrix

and Z = (Z1 , … , Zr )T a vector of zero mean random variables with [ ] 𝔼 ZZ T = I.

INTRODUCTION

9

The elements of Z are referred to as factors while the elements of Φ are called factor loadings. A typical identifiability constraint arising from maximum likelihood estimation is to have 𝜙Ti 𝒦−1 𝜀 𝜙j = 0, i ≠ j.

(1.25)

This has the consequence of making ΦT 𝒦−1 𝜀 Φ a diagonal matrix, which we will subsequently presume to be the case. When (1.23) holds 𝒦1 = ΦΦT + 𝒦𝜀 ( ) 1∕2 −1∕2 −1∕2 1∕2 = 𝒦𝜀 𝒦𝜀 ΦΦT 𝒦𝜀 + I 𝒦𝜀 . −1∕2

−1∕2

It is readily verified that under (1.25) the matrix 𝒦𝜀 ΦΦT 𝒦𝜀 has eigen−1∕2 values 𝛾j = 𝜙Tj 𝒦−1 𝜙j . However, this 𝜀 𝜙j with associated eigenvectors 𝒦𝜀 −1∕2

−1∕2

means that 𝒦𝜀 𝒦1 𝒦𝜀 has eigenvalues 1 + 𝛾j associated with this same set of eigenvectors; i.e., [ ( ) ] −1∕2 −1∕2 −1∕2 𝒦𝜀 𝒦1 𝒦𝜀 − 1 + 𝛾j I 𝒦𝜀 𝜙j = 0 (1.26) or

[

] 𝒦1 − (1 + 𝛾j )𝒦𝜀 𝒦−1 𝜀 𝜙j = 0.

(1.27)

Comparing this with (1.22) leads to the conclusion that 𝒦−1 𝜀 𝜙j is a canonical weight vectors for the X1 space with 𝛾j = 𝜌2j ∕(1 − 𝜌2j ) obtained from its corresponding canonical correlation. To see where these developments might take us consider the unrealistic scenario where we knew 𝒦1 and 𝒦𝜀 but not Φ. If that were the case, the coefficient matrix could be [ recovered]from the canonical weight functions for the X1 space as Φ = 𝒦𝜀 a11 , … , a1r . We could then predict Z via the best linear unbiased predictor: namely, the linear transformation of X1 that minimizes the prediction error 𝔼(Z − ℒX1 )T (Z − ℒX1 ) over all possible choices for the r × p matrix ℒ. The minimum is attained with ℒ = Cov(Z, X1 )𝒦−1 giving 1 ( )−1 Ẑ = ΦT ΦΦT + 𝒦𝜀 X1

(1.28)

as the optimal predictor. Of course, in practice, we will not know either of 𝒦1 or 𝒦𝜀 . However, given a random sample of values from X1 , the first of these two quantities is easy to estimate using the sample variance–covariance matrix. While estimation of 𝒦𝜀 is more problematic, there are various ways this can be accomplished that produce at least acceptable initial estimators. Such estimators can be substituted into (1.27) to obtain an estimator of Φ. This, in turn, provides an update

10

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

of 𝒦𝜀 = 𝒦1 − ΦΦT . The result is one possible iterative estimation algorithm that is employed in the factor analysis genre. Our particular development of factor analysis is due to Rao (1955). A detailed treatments of this and many other factor analysis-related topics can be found in Basilevsky (1994). A case of particular interest that can be treated with a linear model is MANOVA and its predictive analog known as discriminant analysis. To develop these ideas, we begin with the model [ ] X1 = m + m1 − m, … , mq+1 − m X̃ 2 + 𝜀, ( )T where X̃ 2 = X̃ 21 , … , X̃ 2(q+1) has a multinomial distribution with ∑q+1 ̃ X = 1 and success probabilities 𝜋1 , … , 𝜋q+1 , m1 , … , mq+1 are the X1 j=1 2j ∑q+1 mean vectors for the q + 1 different populations, m = j=1 𝜋j mj is the grand mean and 𝜀 is a p-variate ∑ random vector with mean∑zero and covariance q q matrix 𝒦𝜀 . As, 𝜋q+1 = 1 − j=1 𝜋j and X̃ 2(q+1) = 1 − j=1 X̃ 2j , the previous model can be equivalently expressed as [ ] X1 = mq+1 + m1 − mq+1 , … , mq − mq+1 X2 + 𝜀 (1.29) with

( )T X2 ∶= X̃ 21 , … , X̃ 2q . (1.30) [ ] This corresponds to (1.17) with 𝛽1 = (m1 − mq+1 ), … , (mq − mq+1 ) and 𝛽0 = mq+1 . To apply formula (1.20), we must calculate 𝒦1 , 𝒦12 and 𝒦2 . In this regard, first note that 𝔼X2 = (𝜋1 , … , 𝜋q )T =∶ 𝜋 and

𝒦2 = Var(X2 ) ( ) = diag 𝜋1 , … , 𝜋q − 𝜋𝜋 T .

Then, one may check that

( −1 ) −1 −1 𝒦−1 + 𝜋q+1 11T 2 = diag 𝜋1 , … , 𝜋q

for a q-vector 1 of all unit elements. From these identities, we obtain [ ] 𝒦12 = 𝜋1 (m1 − m), … , 𝜋q (mq − m) and

𝒦1 = 𝒦B + 𝒦𝜀

INTRODUCTION

with 𝒦B =

q+1 ∑

11

𝜋j (mj − m)(mj − m)T = 𝒦12 𝒦−1 2 𝒦21 .

j=1

Relations (1.18) and (1.15) can now be used to see that in this instance the canonical correlations and canonical vectors of the X1 space are characterized by 𝜌2 𝒦−1 a. (1.31) 𝜀 𝒦B a = 1 − 𝜌2 Thus, all the canonical correlations are zero if and only if 𝒦B is the zero matrix, which, in turn, is equivalent to the standard MANOVA null hypothesis that all of the q + 1 populations have the same mean vector. Statistical tests for this null model can then be constructed from the sample canonical correlations. If the mean vectors are the same for all of our q + 1 populations, it will not generally be possible to distinguish between them on the basis of location. However, if the MANOVA null model is rejected, one can expect to have at least some success in categorizing incoming observations according to population membership. The process of doing so is often referred to as discriminant analysis. While there are many discrimination methods that appear in the literature, our focus here will be limited to Fisher’s classical proposal. The idea is to find a linear combination, or discriminant function, hT X1 that provides the maximum separation between the populations in the sense of maximizing the ratio hT 𝒦B h . hT 𝒦𝜀 h

(1.32)

However, this is just a variation of a problem we already encountered with −1∕2 −1∕2 pca. The solution is the largest eigenvalue 𝜆1 of 𝒦𝜀 𝒦B 𝒦𝜀 with the −1∕2 optimal discriminant function weight vector h1 = 𝒦𝜀 u1 obtained from the eigenvector u1 that corresponds to 𝜆1 . These quantities are characterized by −1∕2

𝒦𝜀 or, equivalently, by

−1∕2

𝒦B 𝒦𝜀

u1 = 𝜆1 u1

𝒦−1 𝜀 𝒦B h1 = 𝜆1 h1 .

So, 𝜆1 = 𝜌21 ∕(1 − 𝜌21 ) with 𝜌1 the first canonical correlation. Additional discriminant functions are provided by maximizing (1.32) conditional on the resulting linear combinations of variables being uncorrelated with the discriminant functions that have already been determined from this iterative process. The resulting eigenvalues will, of course, enjoy the

12

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

same relation to their corresponding canonical correlations. If we now opt to retain r ≤ min(p, q) discriminant functions having weight vectors h1 , … hr , a new observation x is classified as being from the population whose index minimizes r ∑ (hTj x − hTj mi )2 (1.33) j=1

over i = 1, … , q + 1. As one might expect, the relationship between Fisher’s discriminant functions and canonical variables goes much deeper than just the eigenvalue connection. The equivalence of Fisher’s discriminant analysis and canonical correlation is revealed by observing that 𝛿ij = =

aT1i 𝒦B a1j 𝜌2i hTi 𝒦B hj 𝜆i

.

This shows that the hi and a1i are eigenvector of 𝒦B that have been adjusted to √ 2 have norms 𝜆i and 𝜌i . So, a1i = 𝜌i hi ∕ 𝜆i . In particular, this means that (1.33) will produce the same classification as would be obtained using canonical variables with the criterion r (aT x − aT m )2 ∑ j j i j=1

1 − 𝜌2j

.

A typical situation with data would have us observing p dimensional random vectors Xij , i = 1, … , q + 1, j = 1, … , ni with n =

∑q+1 i=1

ni . Then, estimators of 𝒦𝜀 , 𝒦B are n−1

q+1 ni ( )( )T ∑ ∑ Xij − X i Xij − X i i=1 j=1

and n−1

q+1 ∑

( )( )T ni X i − X X i − X ,

i=1

∑ni ∑q+1 respectively, with X i = n−1 X , X = n−1 i=1 ni X i . We then carry out i j=1 ij classification by replacing 𝒦𝜀 , 𝒦B by their sample analogs in the previous formulation.

INTRODUCTION

1.2

13

The path that lies ahead

The basic concepts described in Section 1.1 remain conceptually valid in the context of fda. The first challenge faced by researchers in the area lies in developing them fully and rigorously from a mathematical perspective. This is the essential precursor to the growth and maturation of inferential methodology in general and for fda in particular. One cannot estimate a “parameter” if it is undefined and it is easy to take a misstep in fda formulations that lead to exactly such a conundrum. The theory of multivariate analysis is inextricably interwoven with matrix theory. If the observations in fda are viewed as vectors of infinite length, then we would anticipate that the infinite dimensional analog of matrices would represent the tools of the trade for advancing our understanding of this emerging field. Such entities are called linear operators and, in particular, compact operators give us the infinite dimensional extension of matrices that arise naturally in fda. Just as one cannot venture far into multivariate analysis without understanding matrix theory, one cannot expect to appreciate the mathematical aspects of fda without a thorough background in compact operators and their Hilbert–Schmidt and trace class variants. After developing the necessary background on linear spaces in Chapter 2, we accumulate some of the essential ingredients of functional analysis and operator theory in Chapters 3–5. Chapter 3 provides a general overview that introduces linear operators and linear functionals as well as fundamental concepts such as the inverse and adjoint of an operator, nonnegative and projection operators. Compact operators are then treated in Chapter 4 where we develop both eigenvalue and singular value expansions and treat the special cases of Hilbert–Schmidt and trace class operators. This work plays an important role throughout the remainder of this text. Chapter 5 deals with perturbation theory for compact operators and provides key tools for the treatment of functional pca in Chapter 9. Data smoothing methods tend to be a prominent aspect of most fda inferential methodology making a foray into the mathematical aspects of smoothing somewhat de rigueur for this particular treatise. Our treatment of the topic focuses on an abstract penalized smoothing problem that can be specialized to recover the spline smoothers that are most common in the fda literature. Penalized smoothing methods recur in Chapters 8 and 11 where we study their performance for estimation of certain functional parameters. Classical statistics is concerned with inference about the distribution of a basic random variable that we are able to sample repeatedly. The same can be said for fda except that the meaning of the “random variable” phrase requires a bit of reinterpretation before it becomes relevant to that setting. What is needed is the concept of a random element of a Hilbert space. That idea along with its associated probabilistic machinery is developed in Chapter 7.

14

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Here, we extend the concepts of a mean vector and covariance matrix to the infinite dimensional scenario and develop some asymptotic theory that is relevant for random samples of Hilbert space valued random elements. Chapters 8–11 will provide extended, detailed illustrations of how the mathematical machinery in Chapters 2–6 can be used to address problems that arise in the fda environment. In Chapter 8, this takes the form of analysis of the large sample properties of three types of estimators of the mean element and covariance function: ones that are based on the sample mean element and covariance operator for completely observed functional data and local linear and penalized least-squares estimators for the discretely observed case. This is followed in Chapter 9 with an investigation of the asymptotic behavior of the principle components estimators that are produced by the covariance estimators introduced in Chapter 8. Chapter 10 deals with the bivariate case where, for example, one has two stochastic processes and wishes to analyze their dependence structure. Somewhat more generally, it provides a development of abstract canonical correlation for two Hilbert space valued random elements. By specializing this theory to the fda stochastic processes context, we are then able to obtain parallels of results in Section 1.1 for functional analogs of linear prediction, regression, factor analysis, MANOVA, and discriminant analysis. Finally, Chapter 11 deals with the important case of bivariate data having both a scalar and functional (i.e., stochastic process) response. The large sample properties of a particular penalized least-squares estimator of the regression coefficient function are investigated in this setting.

2

Vector and function spaces In Chapter, we loosely defined functional data to be a collection of sample paths for a stochastic process (or processes) with the index set [0, 1]. Thus, the data are functions that may have various properties such as being continuous or square integrable with probability one. Characteristics such as these signify a commonality that becomes amenable to treatment through the study of function spaces. An understanding of function spaces is an essential first step in dealing with functional data. Beyond that, the properties of linear functionals and operators on function spaces that are studied in Chapters 3 and 4 lie at the heart of the functional analogs of the basic concepts from multivariate analysis. The purpose of this chapter is to present the function space theory that we perceive to be most relevant for fda. Clearly, it will be impossible to give a comprehensive treatment of each topic that we touch upon here. Certain function space concepts like reproducing kernel Hilbert spaces and Sobolev spaces are treated in some detail in view of their relative obscurity and relevance for our targeted audience. However, many topics such as vector, metric, Banach, and Hilbert spaces are included for completeness and to serve as a review of material that the reader will have hopefully seen elsewhere. Thorough expositions of these concepts can be found in, e.g., Luenberger (1969) and Rynne and Youngson (2001). It is also presumed that the reader is comfortable with standard measure and integration theory with Billingsley (1995) and Royden and Fitzpatrick (2010) being two of the standard sources that provide introductions to this area.

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

16

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

2.1

Metric spaces

One of the signature features of the p-dimensional Euclidean space ℝp is the ease with which one can define measures of distance or closeness. This can be accomplished in a variety of ways. However, all such measures must possess certain properties; namely, they must all be metrics in the sense defined in the following definition. Definition 2.1.1 A metric on a set 𝕄 is a function d ∶ 𝕄 × 𝕄 → ℝ that satisfies 1. d(x1 , x2 ) ≥ 0, 2. d(x1 , x2 ) = 0 if x1 = x2 , 3. d(x1 , x2 ) = d(x2 , x1 ) and 4. d(x1 , x3 ) ≤ d(x1 , x2 ) + d(x2 , x3 ) for x1 , x2 , x3 ∈ 𝕄. We refer to the pair (𝕄, d) as a metric space. Example 2.1.2 Let xi = (xi1 , … , xip ) ∈ ℝp , i = 1, 2. Metrics for ℝp include the Euclidean distance √ √ p √∑ d(x1 , x2 ) = √ (x1j − x2j )2 j=1

and

d(x1 , x2 ) = max |x1j − x2j |. 1≤j≤p

On the other hand, squared Euclidean distance is not a metric. Example 2.1.3 A somewhat more exotic example of a metric space is C[0, 1]: the set of continuous functions on [0, 1]. Recall that continuous functions defined on a closed intervals are uniformly continuous. The sup metric is then well defined as d(f , g) = sup{|f (t) − g(t)| ∶ t ∈ [0, 1]} for f , g ∈ C[0, 1]. Conditions 1–4 in Definition 2.1.1 can be easily verified for this case. Once we have a way to assess distances, it becomes possible to define parallels of many of the familiar point-set concepts associated with numbers on the real line. For example, the idea of open and closed sets takes the expected form with absolute values being replaced by an abstract metric.

VECTOR AND FUNCTION SPACES

17

Definition 2.1.4 Let (𝕄, d) be a metric space with E ⊂ 𝕄. Then, E is said to be open if for every e ∈ E there exists an 𝜖 > 0 such that {x ∈ 𝕄 ∶ d(x, e) < 𝜖} ⊂ E. The subset E is closed if {x ∈ 𝕄 ∶ x ∉ E} is open. Related concepts such as denseness, separability, etc., now translate virtually unchanged to the metric space setting. Definition 2.1.5 The closure E of E ⊂ 𝕄 is the smallest closed set in 𝕄 that contains E. Definition 2.1.6 A set E ⊂ 𝕄 is dense in 𝕄 if E = 𝕄. Definition 2.1.7 A metric space 𝕄 is separable if it has a countable, dense subset. Example 2.1.8 The p-dimensional real space, ℝp , is separable as the set of vectors with rational components is countable and is dense in ℝp . If we have a function mapping from one metric space to another, then its smoothness (i.e., how the value of the function changes as the argument changes) is an important consideration for functional data. The most basic form of smoothness is continuity. Definition 2.1.9 Let (𝕄1 , d1 ) and (𝕄2 , d2 ) be metric spaces and let f be a function defined on 𝕄1 that takes values in 𝕄2 . Then, f is continuous at x ∈ 𝕄1 if for every 𝜖 > 0 there is a 𝛿x,𝜖 > 0 such that for all y ∈ 𝕄1 satisfying d1 (x, y) < 𝛿x,𝜖 we have d2 (f (x), f (y)) < 𝜖. The function f is uniformly continuous when 𝛿x,𝜖 does not depend on x. An equivalent definition of continuity can be based on convergence of sequences in the sense defined in the following definition. Definition 2.1.10 Let (𝕄, d) be a metric space with {xn } a sequence of points in 𝕄. The sequence converges to x ∈ 𝕄, denoted by xn → x, if d(xn , x) → 0 as n → ∞. With this notation, we can say that f is continuous at x if f (xn ) → f (x) whenever xn → x. The Cauchy criterion arises as a means to determine the convergence of a sequence of real numbers. The result is that such a sequence converges if and only if it is a Cauchy sequence in the sense of our following definition.

18

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Definition 2.1.11 A sequence of elements {xn } in a metric space (𝕄, m) is said to be a Cauchy sequence if supm,n≥N d(xm , xn ) → 0 as N → ∞. Now, if xn → x, part 4 of Definition 2.1.1 (also called the triangle inequality) entails that d(xn , xm ) ≤ d(xn , x) + d(xm , x) → 0 as m, n → ∞. So, any convergent sequence is necessarily Cauchy. Unfortunately, the converse if not always true and cases where it is true are sufficiently important to merit a special name. Definition 2.1.12 A metric space (𝕄, m) is said to be complete if every Cauchy sequence is convergent. Example 2.1.13 It is easy to show that ℝ is complete by arguing that for any Cauchy sequence {xn } in ℝ, lim infxn must equal lim supxn and be finite. Thus, a sequence being Cauchy is equivalent to convergence and the Cauchy criterion stems from that fact. This is all readily generalized to ℝp . A somewhat more challenging case than the previous example is provided by C[0, 1]. Our task in this instance is to establish Theorem 2.1.14 The set of continuous functions on the interval [0, 1] equipped with the sup metric is complete and separable. Proof: Let {fn } be a Cauchy sequence in C[0, 1]: i.e., sup sup |fm (t) − fn (t)| → 0

(2.1)

m,n≥N t∈[0,1]

as N → ∞. Then, for each t ∈ [0, 1], the sequence {fn (t)} is a Cauchy sequence in ℝ and necessarily converges. Let f be the point-wise limit of fn . As supn≥N fn (t) ↓ f (t), inf n≥N fn (t) ↑ f (t) as N → ∞, |fN (t) − f (t)| ≤ sup fn (t) − inf fn (t) = sup |fm (t) − fn (t)| n≥N

n≥N

m,n≥N

which tends to 0 uniformly by (2.1) and thereby establishes completeness. To show that C[0, 1] is separable, we present a proof that employs some simple tools from probability to establish an elementary version of the Stone–Weierstrass Theorem. Let f ∈ C[0, 1] and define the Bernstein polynomial of degree n by n ( ) ∑ n m fn (x) = x (1 − x)n−m f (m∕n). m m=0

VECTOR AND FUNCTION SPACES

19

As the collection of all polynomials is countable, it suffices to show that lim sup |fn (x) − f (x)| = 0.

(2.2)

n→∞ x∈[0,1]

By uniform continuity, for any given 𝜖 > 0, there exists a 𝛿 > 0 such that if |x − y| < 𝛿 then |f (x) − f (y)| < 𝜖. Fix x ∈ [0, 1] and let Y have a binomial distribution with n trials and success probability x. In that case, fn (x) = 𝔼[f (Y∕n)] and |fn (x) − f (x)| ≤ 𝔼| f (Y∕n) − f (x)| = 𝔼[| f (Y∕n) − f (x)|I(|Y∕n − x| ≤ 𝛿)] +𝔼[| f (Y∕n) − f (x)|I(|Y∕n − x| > 𝛿)] with I(A) the indicator function for the set A. By the choice of 𝛿, 𝔼[|f (Y∕n) − f (x)|I(|Y∕n − x| ≤ 𝛿)] ≤ 𝜖. On the other hand, Chebyshev’s inequality gives 𝔼[|f (Y∕n) − f (x)|I(|Y∕n − x| > 𝛿)] ≤ 2 sup |f (y)| y

= 2 sup |f (y)| y

Var(Y) n2 𝛿 2 nx(1 − x) . n2 𝛿 2

In combination, the two cases lead us to the conclusion that lim sup sup |fn (x) − f (x)| ≤ 𝜖 n→∞

x∈[0,1]

for each 𝜖 > 0. As 𝜖 > 0 is arbitrary, (2.2) follows.



Two other important properties of metric spaces are the following. Definition 2.1.15 A metric space 𝕄 is compact if every sequence of elements in 𝕄 contains a subsequence that converges to an element of 𝕄. If 𝕄 is compact, 𝕄 is termed relatively compact. Definition 2.1.16 A metric space (𝕄, d) is totally bounded if for any 𝜖 > 0 there exists y1 , … , yn in 𝕄 such that 𝕄 ⊂ ∪ni=1 {x ∈ 𝕄 ∶ d(x, yi ) < 𝜖} for some finite positive integer n. The following result, known as the Heine–Borel theorem, is proved, for example, in Royden and Fitzpatrick (2010).

20

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 2.1.17 A metric space is compact if it is complete and totally bounded. Example 2.1.18 The p-dimensional unit square [0, 1]p is a compact metric space under either of the metrics in Example 2.1.2.

2.2

Vector and normed spaces

Metric spaces represent a natural topological abstraction of ℝp that is suitable for our purposes. What remains is the addition of an appropriate algebraic structure. This is provided by the vector space concept that is introduced in this section. It is safe to say that data deriving from complex vis-a-vis real valued random variables is a rarity in statistics. This comment carries over to functional data and, for that reason, we will confine attention to vector spaces over ℝ throughout this text unless explicitly stated to the contrary. Definition 2.2.1 A vector space 𝕍 is a set of elements, referred to as vectors, for which two operations have been defined: addition and scalar multiplication. Given two vectors 𝑣1 , 𝑣2 , addition returns another vector denoted by 𝑣1 + 𝑣2 . Given a vector 𝑣 and a ∈ ℝ, scalar multiplication returns a vector denote by a𝑣. The addition and multiplication operations are assumed to satisfy 1. 𝑣1 + 𝑣2 = 𝑣2 + 𝑣1 , 2. 𝑣1 + (𝑣2 + 𝑣3 ) = (𝑣1 + 𝑣2 ) + 𝑣3 , 3. a1 (a2 𝑣) = (a1 a2 )𝑣, 4. a(𝑣1 + 𝑣2 ) = a𝑣1 + a𝑣2 , (a1 + a2 )𝑣 = a1 𝑣 + a2 𝑣, and 5. 1𝑣 = 𝑣. In addition, there is a unique element 0 with the property that 𝑣 + 0 = 𝑣 for every 𝑣 ∈ 𝕍 and corresponding to each element 𝑣 there is another element −𝑣 such that 𝑣 + (−𝑣) = 0. We will often work with subsets of vector spaces. When such a subset admits the same algebraic operations as its parent vector space we refer to it as a linear subspace or merely as a subspace. In particular, beginning with a subset A of a vector space 𝕍 , we can create a subspace by forming a new set that contains all the finite dimensional linear combinations of its elements. This is referred to as the span of A and denoted by span(A). One characterization of span(A) is as the intersection of all subspaces in 𝕍 that contain A. In that sense, span(A) is the smallest subspace that contains the elements of A.

VECTOR AND FUNCTION SPACES

21

The linear span concept can be sharpened somewhat by introducing the idea of a basis. First, we need to define what we mean by linear independence. Definition 2.2.2 Let 𝕍 be a vector space and B = {𝑣1 , … , 𝑣k } ⊂ 𝕍 for some finite, ∑k positive integer k. The collection B is said to be linearly independent if i=1 ai 𝑣i = 0 entails that ai = 0 for all i. Then, we have Definition 2.2.3 If B = {𝑣1 , … , 𝑣k } is a linearly independent subset of the vector space 𝕍 and span(B) = 𝕍 , B is said to be a basis for 𝕍 . Note that it is necessarily the case that if {𝑣1 , … , 𝑣k } ⊂ 𝕍 is a basis and 𝑣 ∑k is any element of 𝕍 there are coefficients b1 , … , bk such that 𝑣 = j=1 bj 𝑣j . Moreover, the integer k is unique in the following sense. Theorem 2.2.4 If, for finite k and 𝓁, B1 = {𝑣1 , … , 𝑣k } and B2 = {u1 , … , u𝓁 } are both bases of the vector space 𝕍 , then k = 𝓁. Proof: Suppose that 𝓁 > k. As B1 is a basis the span of B1 coincides with 𝕍 . Thus, any element of 𝕍 can be written as a linear combination of the elements of B1 . In particular, this means that there are coefficients aij , j = 1, … , k such that k ∑ ui = aij 𝑣j , 1 ≤ i ≤ 𝓁. j=1

Let 𝒜 = {aij }i=1∶𝓁, j=1∶k . Then, by the method of elimination, the linear system 𝒜 x = 0 can be shown to have a nonzero solution x = (x1 , … , x𝓁 )T . Thus, 𝓁 ∑ i=1

x i ui =

k 𝓁 ∑ ∑

xi aij 𝑣j = 0

j=1 i=1

which means that B2 is linearly dependent and cannot be a basis.



As in the Euclidean case, the number of basis elements provides us with a value that can be viewed as the dimensionality of the space. This works in the expected way provided that the number of basis elements in finite. Specifically, if a vector space 𝕍 has a basis consisting of p < ∞ elements, it is said to have dimension p which we then denote by dim(𝕍 ) = p. However, it turns out that our labor to create the abstract vector space machinery has not moved us very far from our Euclidean starting point as a consequence of the next result. Theorem 2.2.5 Any p-dimensional vector space is isomorphic to ℝp .

22

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: Let 𝕍 be a vector ∑p space with basis {𝑣1 , … , 𝑣p }. Each x can be uniquely written as x = i=1 ai (x)𝑣i and the mapping x → (a1 (x), … , ap (x)) is one-to-one and onto ℝp . ◽ When a vector space does not possess a finite dimensional basis, it is said to be infinite dimensional. Even the meaning of the term “basis” becomes open for debate in this instance. While the idea can be extended to deal with this eventuality, the process of doing so entails considerable complications (such as convergence of infinite expansions) that are not relevant for this book. Our focus will instead be on the more readily accessible case of orthonormal bases for separable Hilbert spaces that arise in the following section. We will subsequently have use for the following concept. Definition 2.2.6 Let 𝕍1 , 𝕍2 be vector spaces. A transformation T mapping 𝕍1 into 𝕍2 is said to be linear if T(a1 𝑣1 + a2 𝑣2 ) = a1 T(𝑣1 ) + a2 T(𝑣2 ) for all a1 , a2 ∈ ℝ and all 𝑣1 , 𝑣2 ∈ 𝕍1 . Example 2.2.7 The identity mapping I(x) = x is clearly linear. Example 2.2.8 In ℝp , linear transformations coincide with matrix multiplication. Things become much more interesting when we merge the metric and vector space concepts. This is accomplished by adding a measure of distance called a norm to obtain a normed vector space or just normed space. Definition 2.2.9 Let 𝕍 be a vector space. A norm on 𝕍 is a function ‖ ⋅ ‖ ∶ 𝕍 → ℝ such that 1. ‖𝑣‖ ≥ 0, 2. ‖𝑣‖ = 0 if 𝑣 = 0, 3. ‖a𝑣‖ = |a|‖𝑣‖, and 4. ‖𝑣1 + 𝑣2 ‖ ≤ ‖𝑣1 ‖ + ‖𝑣2 ‖ for all 𝑣, 𝑣1 , 𝑣2 ∈ 𝕍 and a ∈ ℝ. Property 4 in Definition 2.2.9 is typically referred to as the triangle inequality. This is consistent with the usage of this phrase in the previous section because of the following result. Theorem 2.2.10 If 𝕍 is a vector space with norm ‖ ⋅ ‖, then d(x, y) ∶= ‖x − y‖ for x, y ∈ 𝕍 is a metric.

VECTOR AND FUNCTION SPACES

23

Example 2.2.11 Let 𝕍 be ∑p a finite-dimensional vector space with basis {𝑣1 , … , 𝑣p }. Then, if 𝑣 = i=1 ai 𝑣i , ( p ∑

‖𝑣‖ =

)1∕2 a2i

i=1

is a norm. To verify the triangle inequality in this case, let 𝑤 = observe that ‖𝑣 + 𝑤‖ = 2

p ∑



i=1

i=1

bi 𝑣i and

(ai + bi )2

i=1 p ∑

∑p

a2i

+2

( p ∑

)1∕2 ( a2i

i=1

p ∑

)1∕2 b2i

i=1

+

p ∑

b2i

i=1

= (‖𝑣‖ + ‖𝑤‖) , 2

where the inequality follows from the Cauchy–Schwarz inequality. Example 2.2.12 The 𝓁 2 space of square summable consists of ∑∞sequences 2 elements of the form (x1 , x2 , …), where xi ∈ ℝ and i=1 xi < ∞. The vector space operations of addition and multiplication are defined by (x1 , x2 , …) + (y1 , y2 , …) = (x1 + y1 , x2 + y2 , …), a(x1 , x2 , …) = (ax1 , ax2 , …) and the norm is ( ‖(x1 , x2 , …)‖ =

∞ ∑

)1∕2 x2i

.

i=1

As shown subsequently in Example 2.3.6, 𝓁 2 is also complete. With the norm topology in place, we now possess one way to extend the idea of a linear span to allow for infinite dimensions. Definition 2.2.13 Let A be a subset of a normed vector space. The closed span of A, denoted by span(A), is the closure of span(A) with respect to the metric induced by the norm. This definition is quite important and arises in various contexts going forward. Thus, it is worthwhile to explore it in a bit more detail. We can think of

24

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

span(A) as being built up from all finite dimensional linear combinations of the elements of A. We then append all the limits of sequences of these linear combinations to obtain the final space of interest. This latter step is simplified somewhat when we can draw on completeness so that only Cauchy sequences need consideration. Finite dimensional linear combinations are therefore by construction dense in span(A). Theorem 2.2.14 Suppose that A is a subset of a separable, complete normed vector space 𝕏. Then, if {xn } is dense in A, span(A) consists of all finite dimensional linear combinations of the xn and the limits in 𝕏 of Cauchy sequences of such linear combinations. A particular norm may be difficult to work with for some purposes. In such instances, the following concept can be useful Definition 2.2.15 Two norms ‖ ⋅ ‖1 and ‖ ⋅ ‖2 on a vector space 𝕍 are equivalent if there are constants c, C ∈ (0, ∞) such that for all x ∈ 𝕍 , c‖x‖1 ≤ ‖x‖2 ≤ C‖x‖1 . The point here is that if ‖ ⋅ ‖1 and ‖ ⋅ ‖2 are equivalent, the metric spaces induced by the two norms have the same “topological” properties; for instance, the classes of open sets are the same, if one is complete then the other is also complete, etc. Thus, one can replace a particular norm with another that is equivalent, but possibly more mathematically tractable, if such a thing can be found. This is not so difficult in finite dimensions in view of the following theorem. Theorem 2.2.16 Let 𝕍 be a finite-dimensional normed space. Then, 1. all norms for 𝕍 are equivalent and 2. 𝕍 is a complete and separable metric space in any metric generated by a norm. Proof: By Theorem 2.2.5, we can assume without loss of generality that 𝕍 = ℝp . So, we first show the equivalence of the Euclidean norm ‖ ⋅ ‖1 and an arbitrary norm ‖ ⋅ ‖2 on ℝp . In this regard, let ei be a vector of all zeros except for a one ∑pas its ith component ∑p and write x = (x1 , … , xp ) and y = (y1 , … , yp ) as x = i=1 xi ei and y = i=1 yi ei , respectively. Let 𝜖 > 0 and suppose that

VECTOR AND FUNCTION SPACES

‖x − y‖1 < 𝜖∕(

∑p i=1

25

‖ei ‖2 ). Then,

|‖x‖2 − ‖y‖2 | ≤ ‖x − y‖2 p ( )∑ ≤ max |xi − yi | ‖ei ‖2 i

i=1

∑ p

≤ ‖x − y‖1

‖ei ‖2

< 𝜖.

i=1

So, f (⋅) ∶= ‖ ⋅ ‖2 is a continuous function from (ℝp , d1 ) to (ℝp , d2 ), where di is the metric defined by ‖ ⋅ ‖i . Now, consider the restriction of f to the unit sphere ‖x‖1 = 1. The range of this function is some closed interval [c, C] with c, C ∈ (0, ∞). As a result, when ‖x‖1 = 1, c ≤ ‖x‖2 ≤ C and, hence, ‖x‖2 c≤ ≤C ‖x‖1 for any x ≠ 0. We can argue similarly for another norm ‖ ⋅ ‖3 to see that c∗ ≤

‖x‖3 ≤ C∗ ‖x‖1

for 0 < c∗ ≤ C∗ < ∞. In combination, the two inequalities lead to c∗ C∗ ‖x‖2 ≤ ‖x‖3 ≤ ‖x‖2 C c and thereby establish the first part of the theorem. For part 2, we again need only consider ℝp with the Euclidean norm. The separability of ℝp was discussed in Example 2.1.8. The completeness of ℝp was the subject of Example 2.1.13. ◽ Both of the conclusions in Theorem 2.2.16 are generally false for infinite dimensions. Example 2.2.17 From Theorem 2.1.14, we know that the space C[0, 1] is complete under the sup norm. Another norm is provided by ( ‖f ‖ =

)1∕2

1

∫0

2

f (t)dt

.

26

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

But, by taking En = [.5(1 − 1∕n), .5(1 + 1∕n)] with fn (t) = (1 − 2n|.5 − t|)I(t ∈ En ), we produce a sequence of functions {fn } that is Cauchy but does not have a limit in C[0, 1].

2.3

Banach and 𝕃p spaces

As seen in Section 2.2, not every normed vector space is complete. However, many useful normed vector spaces are complete and those that exhibit this property receive the following designation. Definition 2.3.1 A Banach space is a normed vector space which is complete under the metric associated with the norm. As we saw earlier, the spaces C[0, 1] with the sup norm and finitedimensional normed vector spaces are examples of Banach spaces. For fda, a more important Banach space is the following. Definition 2.3.2 Let (E, ℬ, 𝜇) be a measure space and for p ∈ [1, ∞) denote by 𝕃p (E, ℬ, 𝜇) the collection of measurable functions f on E that satisfy ∫E |f |p d𝜇 < ∞. Define ( ‖f ‖p =

∫E

)1∕p |f | d𝜇 p

(2.3)

when f ∈ 𝕃p (E, ℬ, 𝜇). The collection of measurable functions f on E that are finite a.e. 𝜇 is denoted by 𝕃∞ (E, ℬ, 𝜇), for which we define ‖f ‖∞ = ess sup|f (s)| s∈E

= inf{x ∈ ℝ ∶ 𝜇(s ∶ |f (s)| > x) = 0}.

(2.4)

Of course, the notation ‖ ⋅ ‖p suggests that we are working with a norm. Properties 1 and 2 of Definition 2.2.9 can be easily verified for ‖ ⋅ ‖p , while property 4 follows from Minkowski’s inequality that we give in Theorem 2.3.4. Somewhat more problematic is showing that ‖f ‖p = 0 means that f = 0. Indeed, this is not true and we can only conclude that f = 0 a.e. 𝜇 in such an instance. To bypass this stumbling block, we must adopt the

VECTOR AND FUNCTION SPACES

27

convention that functions differing only on a set of 𝜇 measure 0 are identified as being the same function. That is, we define the equivalence relation ∼ by f ∼ g if

𝜇{x ∶ f (x) ≠ g(x)} = 0

and focus on the quotient space 𝕃p (E, ℬ, 𝜇)∕ ∼ of equivalence classes. For convenience, we continue to denote this space as 𝕃p (E, ℬ, 𝜇) or 𝕃p if the associated 𝜎-field and measure are clear from the context. With this modification, ‖ ⋅ ‖p defined in (2.3) and (2.4) are norms for their respective spaces that will be referred to as the 𝕃p norms. The key to working with the 𝕃p spaces is the Hölder and Minkowski inequalities. The former of these takes the form Theorem 2.3.3 For 1 ≤ p, q ≤ ∞ suppose that f1 ∈ 𝕃p and f2 ∈ 𝕃q with 1∕p + 1∕q = 1. Then, ‖f1 f2 ‖1 ≤ ‖f1 ‖p ‖f2 ‖q . Proof: The case of p = ∞, q = 1 is straightforward. It follows immediately upon observing that |fg| ≤ ‖f ‖∞ |g|. Now assume that p, q ∈ (0, ∞) with p−1 + q−1 = 1. In this event, for any two real number a1 , a2 > 0, we have p

q

log (a1 a2 ) = p−1 log (a1 ) + q−1 log (a2 ) p

q

≤ log (p−1 a1 + q−1 a2 ) as log is concave. Upon exponentiating both sides, we obtain Young’s inequality p q a1 a2 ≤ p−1 a1 + q−1 a2 . Now replace a1 , a2 by functions |f1 |, |f2 | for which ‖f1 ‖p = ‖f2 ‖q = 1 to finish the proof. ◽ With the aid of Hölder’s inequality, we can establish Minkowski’s inequality. As mentioned earlier, an important application of Minkowski’s inequality is establishing the triangle inequality for ‖ ⋅ ‖p . Theorem 2.3.4 Suppose that p ≥ 1 and f1 , f2 ∈ 𝕃p . Then ‖f1 + f2 ‖p ≤ ‖f1 ‖p + ‖f2 ‖p . Proof: The cases p = 1 and ∞ are straightforward. For p ∈ (1, ∞), first note that the function xp is convex for x ≥ 0. Thus, |f1 ∕2 + f2 ∕2|p ≤ (|f1 |p + |f2 |p )∕2 and, hence, |f1 + f2 |p ≤ 2p−1 (|f1 |p + |f2 |p ). Consequently, if f1 and f2 are both in 𝕃p so is their sum.

28

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS p

We can treat ‖f1 + f2 ‖p as being finite and employ the triangle inequality to obtain p

‖f1 + f2 ‖p ≤

∫E

|f1 ‖f1 + f2 |p−1 d𝜇 +

∫E

|f2 ‖f1 + f2 |p−1 d𝜇.

Now apply Hölder’s inequality to the two integrals using q = p∕(p − 1) to finish the proof. ◽ The following result is known as the Riesz–Fischer Theorem. Theorem 2.3.5 The space 𝕃p is complete for each p ≥ 1. Proof: We focus on p ∈ [1, ∞). Let {fn } be a Cauchy sequence in 𝕃p : i.e., lim sup ‖fm − fn ‖p = 0.

N→∞ m,n≥N

(2.5)

Choose an integer subsequence {nk } such that C ∶=

∞ ∑

‖fnk+1 − fnk ‖p < ∞.

k=1

By Minkowsky’s inequality, ∞ ||∑ || || || || |fnk+1 − fnk ||| ≤ C. || || || k=1 ||p

Thus,

∞ ∑

|fnk+1 (s) − fnk (s)| < ∞

(2.6)

k=1

a.e. 𝜇 and from the triangle inequality ∑

k2 −1

|fnk (s) − fnk (s)| ≤ 2

1

|fnk+1 (s) − fnk (s)|,

k=k1

for s ∈ E and k1 < k2 . This allows us to conclude that {fnk (s)} is Cauchy and hence convergent a.e. 𝜇. Now define { lim fn (s), if the limit exists and is finite, f (s) = k→∞ k 0, otherwise,

VECTOR AND FUNCTION SPACES

29

and fix arbitrary 𝜖 > 0. Using Fatou’s Lemma along with (2.5) reveals that for all n sufficiently large ∫E

|fn − f |p d𝜇 =

∫E

lim inf |fn − fnk |p d𝜇 k→∞

≤ lim inf k→∞

∫E

|fn − fnk |p d𝜇 < 𝜖.

Thus, f ∈ 𝕃p and ‖fn − f ‖p → 0.



The sequence spaces 𝓁 p are an important special case of the 𝕃p spaces. Example 2.3.6 The 𝓁 p space for p ∈ [1,∑∞] consists of elements of the p form x = (x1 , x2 , …), where xi ∈ ℝ and ∞ i=1 |xi | < ∞ if p ∈ [1, ∞) and maxi |xi | < ∞ if p = ∞. The norm for the space is {( ∑ )1∕p ∞ |xi |p , p ∈ [1, ∞), i=1 ‖x‖p = maxi |xi |, p = ∞. We can view this from the perspective of the 𝕃p (E, ℬ, 𝜇) spaces with E = ℤ+ and 𝜇 chosen to be the counting measure. Thus, the 𝓁 p spaces are also Banach spaces. The Banach space of random variables with finite pth moments will appear in a number of places going forward. Example 2.3.7 Let (Ω, ℱ, ℙ) be a probability space. Then, the space 𝕃p (Ω, ℱ, ℙ) contains random variables defined on (Ω, ℱ, ℙ) with ( )1∕p p ‖X‖p = |X| dℙ = (𝔼|X|p )1∕p < ∞, ∫Ω where 𝔼 denotes expected value. Next, we turn to the case where E = [0, 1], ℬ is the Borel 𝜎-field of [0, 1] and 𝜇 is Lebesgue measure. For convenience, we will refer to these 𝕃p spaces simply as 𝕃p [0, 1]. As 𝕃p contains equivalence classes, it is meaningless to speak of a function’s value f (t) at any one specific t for f ∈ 𝕃p [0, 1] because singletons in [0, 1] have Lebesgue measure zero. One might argue from this that smoothness concepts such as continuity are not relevant in 𝕃p [0, 1], which makes the following result especially interesting. Theorem 2.3.8 The set of equivalence classes that correspond to functions in C[0, 1] is dense in 𝕃p [0, 1].

30

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: It suffices to show that for any uniformly bounded measurable function f on [0, 1] there exists a sequence of continuous functions gn on [0, 1] 1 such that ∫0 |f − gn |2 d𝜇 → 0. Toward this goal first recall (cf. Billingsley, 1995) that for any Borel set B of [0, 1] and any 𝜖 > 0, there exist intervals (ai , bi ), 1 ≤ i ≤ k, such that 𝜇(BΔ ∪ki=1 (ai , bi )) < 𝜖∕2 where Δ stands for symmetric difference; it is therefore possible to construct a continuous function f such that 1

∫0

1

|IB − f | d𝜇 ≤ p

∫0

1

|IB − I

∪ki=1 (ai ,bi )

| d𝜇 + p

∫0

|I∪k

i=1

(ai ,bi )

− f |p d𝜇 ≤ 𝜖,

where we have now used IA to represent the indicator function for a set A. Generalizing this from indicator to simple functions, we arrive at the conclusion that for any simple function f and any 𝜖 > 0 there exists a continuous 1 function g such that ∫0 |f − g|p d𝜇 < 𝜖. Recall that if f is a uniformly bounded, nonnegative measurable function then there exist a sequence of nonnegative simple functions fn such that fn ↑ f 1 uniformly and ∫0 |f − fn |p d𝜇 ↓ 0. Combining this fact with what we already know about approximating simple functions and applying Minkowsky’s inequality, we conclude that for any uniformly bounded, nonnegative measurable function f there exist a sequence of continuous functions gn such that 1 ∫0 |f − gn |p d𝜇 → 0. For a general uniformly bounded measurable function, it suffices to consider the positive parts and negative parts separately. ◽ Our last example in this section is concerned with functions having bounded variation on [0, 1]. As we will see in Section 3.2, such functions play an interesting role in characterizing a certain key property of the Banach space C[0, 1]. Example 2.3.9 The total variation of a function f on [0, 1] is TV(f ) = sup

n ∑

|f (ti ) − f (ti−1 )|,

(2.7)

i=1

where the supremum is taken over all partitions 0 = t0 < t1 < · · · < tn = 1 of the interval [0, 1]. If TV(f ) is finite, f is said to be of bounded variation. A function on [0, 1] is of bounded variation if and only if it can be expressed as the difference of two finite nondecreasing functions. Let the space BV[0, 1] be the collection of all functions of bounded variation on [0, 1] equipped with the norm ‖f ‖ = |f (0)| + TV(f ).

(2.8)

It can be verified that ‖f ‖ is a norm and that BV[0, 1] is a Banach space.

VECTOR AND FUNCTION SPACES

2.4

31

Inner Product and Hilbert spaces

Banach spaces provide one natural extension of finite dimensional, normed vector spaces. However, they do so without an immediate extension of the orthogonality concept that is so important in the Euclidean case. In some instances, this feature can be recovered by introducing an abstract analog of the dot or inner product. Definition 2.4.1 A function ⟨⋅, ⋅⟩ on a vector space 𝕍 is called an inner product if it satisfies 1. ⟨𝑣, 𝑣⟩ ≥ 0, 2. ⟨𝑣, 𝑣⟩ = 0 if 𝑣 = 0, 3. ⟨a1 𝑣1 + a2 𝑣2 , 𝑣⟩ = a1 ⟨𝑣1 , 𝑣⟩ + a2 ⟨𝑣2 , 𝑣⟩, and 4. ⟨𝑣1 , 𝑣2 ⟩ = ⟨𝑣1 , 𝑣2 ⟩ for every 𝑣, 𝑣1 , 𝑣2 ∈ 𝕍 and a1 , a2 ∈ ℝ. A vector space with an associated inner product is called an inner-product space. The following result establishes the connection between such spaces and the normed spaces of Section 2.3. Theorem 2.4.2 An inner product ⟨⋅, ⋅⟩ on a vector space 𝕍 produces a norm ‖ ⋅ ‖ defined by ‖𝑣‖ = ⟨𝑣, 𝑣⟩1∕2 for 𝑣 ∈ 𝕍 . The inner product then satisfies the Cauchy–Schwarz inequality |⟨𝑣1 , 𝑣2 ⟩| ≤ ‖𝑣1 ‖‖𝑣2 ‖

(2.9)

for 𝑣1 , 𝑣2 ∈ 𝕍 with equality in (2.9) if 𝑣1 = a1 + a2 𝑣2 for some a1 , a2 ∈ ℝ. ( ) Proof: Let 𝑣3 = 𝑣1 − ⟨𝑣1 , 𝑣2 ⟩∕‖𝑣2 ‖2 𝑣2 and observe that ⟨𝑣3 , 𝑣2 ⟩ = 0. Thus, ( ) ‖𝑣1 ‖2 = ‖𝑣3 + ⟨𝑣1 , 𝑣2 ⟩∕‖𝑣2 ‖2 𝑣2 ‖2 ( ) = ‖𝑣3 ‖2 + ‖ ⟨𝑣1 , 𝑣2 ⟩∕‖𝑣2 ‖2 𝑣2 ‖2 ≥ ⟨𝑣1 , 𝑣2 ⟩2 ∕‖𝑣2 ‖2 . The only way that equality can be achieved in this last expression is when ‖𝑣3 ‖ = 0.

32

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

To verify that ‖ ⋅ ‖ is a norm, we need only check the triangle inequality. This is accomplished with ‖𝑣1 + 𝑣2 ‖2 = ‖𝑣1 ‖2 + 2⟨𝑣1 , 𝑣2 ⟩ + ‖𝑣2 ‖2 ≤ ‖𝑣1 ‖2 + 2|⟨𝑣1 , 𝑣2 ⟩| + ‖𝑣2 ‖2 ≤ ‖𝑣1 ‖2 + 2‖𝑣1 ‖‖𝑣2 ‖ + ‖𝑣2 ‖ = (‖𝑣1 ‖ + ‖𝑣2 ‖)2 .



Although any inner product naturally defines a metric, the following example shows that not every metric space exhibits the structure that is necessary to also be an inner-product space. Example 2.4.3 The norm ‖f ‖ = sup{|f (x)| ∶ x ∈ [0, 1]} on C[0, 1] is not induced by an inner product. To see this consider the functions f , g ∈ C[0, 1] defined by f (x) ≡ 1, g(x) = x, x ∈ [0, 1]. Then, ‖f + g‖2 + ‖f − g‖2 = 4 + 1 = 5, 2(‖f ‖2 + ‖g‖2 ) = 2(1 + 1) = 4. This choice for f and g fails to satisfy the parallelogram rule ‖f + g‖2 + ‖f − g‖2 = ⟨f + g, f + g⟩ + ⟨f − g, f − g⟩ = 2(‖f ‖2 + ‖g‖2 ) that must hold for any two function with a norm-induced inner product. The standard norm for an inner-product space is the one defined through its inner product as in Theorem 2.4.2. Topological properties then derive from those of the corresponding metric space whose metric is the standard norm. With that in mind, we are lead to conclude that inner products are continuous functions under the norm induced topology. Theorem 2.4.4 Let {𝑣1n }, {𝑣2n } be sequences and 𝑣1 , 𝑣2 elements in an inner product space 𝕍 with inner product and norm ⟨⋅, ⋅⟩ and ‖ ⋅ ‖. If ‖𝑣i − 𝑣in ‖ → 0, i = 1, 2 then ⟨𝑣1n , 𝑣2n ⟩ → ⟨𝑣1 , 𝑣2 ⟩. Proof: The proof follows from |⟨𝑣1n , 𝑣2n ⟩ − ⟨𝑣1 , 𝑣2 ⟩| ≤ |⟨𝑣1n − 𝑣1 , 𝑣2n ⟩| + |⟨𝑣1 , 𝑣2n − 𝑣2 ⟩| ≤ ‖𝑣1n − 𝑣1 ‖‖𝑣2n ‖ + ‖𝑣1 ‖‖𝑣2n − 𝑣2 ‖.



VECTOR AND FUNCTION SPACES

33

Definition 2.4.5 A complete inner-product space is called a Hilbert space. Example 2.4.6 From Theorem 2.2.16, if follows that any finite-dimensional inner-product space is a Hilbert space. Example 2.4.7 The 𝓁 2 space from Example 2.3.6 is a Hilbert space. The inner product of elements 𝑣i = (𝑣i1 , 𝑣i2 , …), i = 1, 2 is ⟨𝑣1 , 𝑣2 ⟩ =

∞ ∑

𝑣1j 𝑣2j .

j=1

Inner product spaces provide the framework that is needed to extend the concepts of perpendicular vectors and subspaces to abstract settings. Definition 2.4.8 Elements x1 , x2 of an inner-product space 𝕏 are said to be orthogonal if ⟨x1 , x2 ⟩ = 0. A countable collection of elements {e1 , e2 , …} is said to be an orthonormal sequence if ‖ej ‖ = 1 for all j and the ej are pairwise orthogonal. The inner products of elements from an orthonormal sequence with the elements of its parent Hilbert space are of paramount interest in the study of inner-product spaces. To be precise, let {e1 , e2 , …} be an orthonormal sequence in an inner product space 𝕏 with associated inner product ⟨⋅, ⋅⟩. The corresponding generalized Fourier coefficients for x ∈ 𝕏 are ⟨x, ej ⟩, j = 1, … It is customary to drop the “generalized” from their name and refer to them simply as Fourier coefficients and we will adhere to that practice throughout the remainder of this text. The Fourier coefficients of an element are square summable due to the following result known as Besssel’s inequality. Theorem 2.4.9 Let {e1 , e2 , …} be an sequence in an ∑∞orthonormal 2 ≤ ‖x‖2 and, therefore, inner-product space 𝕏. For any x ∈ 𝕏, ⟨x, e ⟩ i i=1 ∑∞ i=1 ⟨x, ei ⟩ei converges in 𝕏. Proof: The result is an immediate consequence of the fact that n n || ||2 ∑ ∑ || || 0 ≤ ||x − ⟨x, ei ⟩ei || = ‖x‖2 − ⟨x, ei ⟩2 || || i=1 i=1 || ||

for all n.



The standard approach to creating an orthonormal sequence is the Gram–Schmidt algorithm described in the following theorem. The result is easily verified by induction.

34

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 2.4.10 Let {xn } be a countable collection of elements in a Hilbert space such that every finite subcollection of {xn } is linearly independent. Define e1 = x1 ∕‖x1 ‖ and ei = 𝑣i ∕‖𝑣i ‖ for 𝑣i = xi −

i−1 ∑ ⟨xi , ej ⟩ej . j=1

Then, {en } is an orthonormal sequence and span{xn } = span{en }. We would now like to expand the basis concept to the infinite dimensional Hilbert space context. The first step is to come to some agreement on precisely what the term “basis” might mean in this instance. With that in mind, the next definition provides one possible starting point. Definition 2.4.11 An orthonormal sequence {en } in a Hilbert space ℍ is called an orthonormal basis or a complete orthonormal system (CONS) if span{en } = ℍ. A simple application of the continuity of the inner product gives us one way to check that Definition 2.4.11 is applicable. Theorem 2.4.12 An orthonormal sequence {en } in a Hilbert space ℍ is a CONS if ⟨x, en ⟩ = 0 for all n implies that x = 0. This allows us to see that a CONS {en } provides a representation for the elements of a Hilbert space as linear combinations of the basis elements in an extended sense. Specifically, let {en } be a∑CONS for a Hilbert space ℍ and ∞ let x be any element of ℍ. Now take x̃ = j=1 ⟨x, ej ⟩ej which is well defined as a result of Bessel’s inequality. As, ⟨x − x̃ , ej ⟩ = 0 for every j, we reach the conclusion. Theorem 2.4.13 Every element x of a Hilbert space ℍ with CONS {ej } can be expressed in terms of the Fourier expansion x=

∞ ∑

⟨x, ej ⟩ej

(2.10)

j=1

and ‖x‖ = 2

∞ ∑

⟨x, ej ⟩2 .

(2.11)

j=1

The identity (2.11) strengthens Bessel’s inequality and is known as Parseval’s relation.

VECTOR AND FUNCTION SPACES

35

The CONS concept also provides a way to characterize separability. Theorem 2.4.14 A Hilbert space is separable if it has an orthonormal basis. Proof: Assume that {ej } is a CONS for the Hilbert space ℍ. By Theorem 2.4.13, it is easy to show that the countable subset of elements x with ⟨x, ej ⟩ in the set of rationals for all j is dense. Thus, ℍ is separable. Conversely, any countable dense subset of elements can be transformed into an orthonormal sequence via the Gram–Schmidt algorithm of Theorem 2.4.10. ◽ The next topic for consideration plays an important role in the representation of stochastic processes studied in Section 7.6. Definition 2.4.15 Two metric spaces (𝕄1 , d1 ) and (𝕄2 , d2 ) are said to be isometrically isomorphic or congruent if there exists a bijective function Ψ ∶ 𝕄2 → 𝕄1 such that d2 (x1 , x2 ) = d1 (Ψ(x1 ), Ψ(x2 )) for all x1 , x2 ∈ 𝕄2 . The case of most interest to us in when the two metric spaces in the definition are Hilbert spaces. In that case, we have the following. Theorem 2.4.16 Let ℍi , i = 1, 2, be Hilbert spaces with inner products ⟨⋅, ⋅⟩i , i = 1, 2. Suppose that for some index set E there are collections of vectors 𝕌i = {uit ∶ t ∈ E} such that span(𝕌i ) = ℍi , i = 1, 2. If for every s, t ∈ E ⟨u1s , u1t ⟩1 = ⟨u2s , u2t ⟩2 , (2.12) ℍ1 and ℍ2 are congruent.

Proof: The proof is an immediate consequence of the denseness of the sets 𝕌1 and 𝕌2 and the continuity of the inner product in a Hilbert space. ◽ An application of Theorem 2.4.16 gives us Theorem 2.4.17 Any infinite-dimensional separable Hilbert space is congruent to 𝓁 2 . Proof: Let {ej }∞ be an orthonormal basis for an infinite-dimensional j=1 separable Hilbert space ℍ. In 𝓁 2 , define the orthonormal basis {𝜙j }∞ with j=1 𝜙j being a sequence of all zeros except for a 1 as its jth entry. Now apply Theorem 2.4.16 with E = ℤ+ . ◽

36

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

We defined the Banach space 𝕃p (E, ℬ, 𝜇) in Section 2.3. In this collection of Banach spaces, the only one that is also a Hilbert space is 𝕃2 (E, ℬ, 𝜇) for which the inner product is defined as ⟨f1 , f2 ⟩ =

∫E

f1 f2 d𝜇

for f1 , f2 ∈ 𝕃2 (E, ℬ, 𝜇). In the fda, literature attention has focused on 𝕃2 [0, 1]: namely, the 𝕃2 space with E = [0, 1], ℬ the Borel 𝜎-field of [0, 1] and 𝜇 Lebesgue measure. The following result catalogs several Fourier bases for 𝕃2 [0, 1]. We note in passing that they all consist of continuous and, in fact, infinitely differentiable functions. Theorem 2.4.18 The following sets of functions √ B1 = {f0 (x) = 1, fn (x) = 2 cos(n𝜋x), n ≥ 1}, √ B2 = {gn (x) = 2 sin(n𝜋x), n ≥ 1} and B3 = {h0 (x) = 1, h2n−1 (x) =



2 sin(2n𝜋x), h2n (x) =



2 cos(2n𝜋x), n ≥ 1}

are all orthonormal bases for 𝕃2 [0, 1]. Proof: It is clear that B1 , B2 , and B3 are orthonormal. Hence, we need only show that they are bases. We begin with B1 and pick an arbitrary f ∈ 𝕃2 [0, 1] for which we wish to show that for each 𝜖 > 0 there exist an∑integer m𝜖 and m𝜖 real coefficients a1𝜖 , a2𝜖 , … , am𝜖 𝜖 with ‖f − f𝜖 ‖ < 𝜖 and f𝜖 = i=0 ai𝜖 fi . From Theorem 2.3.8, for any f ∈ 𝕃2 [0, 1] and 𝜖 > 0, there exist g ∈ C[0, 1] such that ‖f − g‖ < 𝜖∕2. Now observe that the function cos−1 is a continuous bijection, so that we can define the continuous function h(s) = g((1∕𝜋)cos−1 s) for s ∈ [−1, 1]. Then, it follows that (cf. Theorem 2.1.14) there is a polynomial p such that |h(s) − p(s)| < 𝜖∕2 uniformly in s ∈ [−1, 1]. Hence, writing k(x) = p(cos 𝜋x), x ∈ [0, 1], we have |g(x) − k(x)| = |h(cos 𝜋x) − p(cos 𝜋x)| < 𝜖∕2 for x ∈ [0, 1] so that ‖g − k‖ < 𝜖∕2 and ‖f − k‖ < 𝜖. As powers of cosine functions can be expressed as linear combinations of cosine function, the result has been shown. Now consider B2 . For f ∈ 𝕃2 [0, 1] and any 𝛿 > 0, define f𝛿 (x) = f (x)I(𝛿,1] (x). Given 𝜖 > 0, there obviously exists 𝛿 > 0 such that ‖f − f𝛿 ‖ < 𝜖∕2.

VECTOR AND FUNCTION SPACES

37

2 Now ∑m h(x) = f𝛿 (x)∕ sin x ∈ 𝕃 [0, 1] and so there is a function k(x) = i=0 ai cos(i𝜋x) such that ‖h − k‖ < 𝜖∕2. However,

‖h − k‖ = 2



𝛿

1 2

k (x)dx +

∫0 𝛿

∫0

∫𝛿

[

f𝛿 (x) sin2 (𝜋x)

]2 − k(x) dx

1

k2 (x) sin2 (𝜋x)dx +

∫𝛿

[f𝛿 (x) − k(x) sin(𝜋x)]2 dx

1

=

∫0

[f𝛿 (x) − k(x) sin(𝜋x)]2 dx = ‖f𝛿 (⋅) − k(⋅) sin(𝜋⋅)‖2

and k(⋅) sin(𝜋⋅) can be expressed as a linear combination of sine functions. Finally, consider B3 and suppose that it is not a basis. In that Case, there exists a nonzero function f ∈ 𝕃2 [0, 1] such that ⟨f , h0 ⟩ = ⟨f , hn ⟩ = 0, n ≥ 1. However, this means that 1

0=

1

f (x)dx =

∫0

∫−1

f (x∕2 + 1∕2)dx

1

=

[f (x∕2 + 1∕2) + f (−x∕2 + 1∕2)]dx,

∫0 1

0=

∫0

1

f (x) cos(2n𝜋x)dx =

∫−1

f (x∕2 + 1∕2) cos(n𝜋x + n𝜋)dx

1

= (−1)n

∫−1

f (x∕2 + 1∕2) cos(n𝜋x)dx

1

= (−1)n

∫0

[f (x∕2 + 1∕2) + f (−x∕2 + 1∕2)] cos(n𝜋x)dx

and, similarly, that 1

0 = (−1)n

∫0

[f (x∕2 + 1∕2) − f (−x∕2 + 1∕2)] sin(n𝜋x)dx

for each n ≥ 1. In view of what was already proved for B1 and B2 , we conclude that for any x ∈ [0, 1] f (x∕2 + 1∕2) − f (−x∕2 + 1∕2) = f (x∕2 + 1∕2) + f (−x∕2 + 1∕2) = 0 which, in turn, implies that f (x∕2 + 1∕2) = f (−x∕2 + 1∕2) = 0: i.e., f (x) = 0 for all x ∈ [0, 1]. ◽

38

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

2.5

The projection theorem and orthogonal decomposition

It is safe to say that almost every statistical problem eventually leads to some type of optimization problem with the classical Gauss–Markov Theorem providing an important case in point. Optimization in vector and function spaces becomes much more tractable when there is an inherent geometry that can be exploited to aid in the characterization of extrema. This is undoubtedly why Hilbert spaces have occupied such a central role in statistics. The following result is fundamental in optimization theory. Theorem 2.5.1 Let 𝕄 be a closed convex set in a Hilbert space ℍ. For every x ∈ ℍ, ‖x − y‖ has a unique minimizer x̂ in 𝕄 that satisfies ⟨x − x̂ , y − x̂ ⟩ ≤ 0

(2.13)

for all y ∈ 𝕄. Proof: Our proof follows along the lines of that in Luenberger (1969). First we observe that by the definition of infimum, we can obtain a sequence yn ∈ 𝕄 such that lim ‖x − yn ‖2 = inf ‖x − y‖2 . n→∞

y∈𝕄

The first step is to show that this sequence is Cauchy thereby insuring that it converges to some element in 𝕄. For this purpose, we use the identity || y + ym ||||2 ‖yn − ym ‖2 = 2‖x − yn ‖2 + 2‖x − ym ‖2 − 4||||x − n . 2 |||| || From convexity, (yn + ym )∕2 ∈ 𝕄 and, hence ‖yn − ym ‖2 ≤ 2‖x − ym ‖2 + 2‖x − ym ‖2 − 4 inf ‖x − y‖2 . y∈𝕄

Thus, {yn } is Cauchy and must have a limit x̂ that attains the infimum. If there were another element x̃ in 𝕄 that also attained the infimum, we could create the alternating series having yn = x̂ for n even and yn = x̃ for n odd. It is now trivially true that lim ‖x − yn ‖2 = inf y∈𝕄 ‖x − y‖2 and our previous argument can be used to see that the sequence is Cauchy. So, it must have a limit and that can only be true if x̂ = x̃ . Suppose now that there is a y ∈ 𝕄 for which ⟨x − x̂ , y − x̂ ⟩ > 0. Then, if we let x(a) = ay + (1 − a)̂x ∈ 𝕄 for a ∈ (0, 1), the derivative of ‖x − x(a)‖2 at a = 0 is negative. Thus, there is a choice of a that makes ‖x − x(a)‖2

VECTOR AND FUNCTION SPACES

39

smaller than ‖x − x̂ ‖2 which is a contradiction. On the other hand, if ⟨x − x̂ , y − x̂ ⟩ ≤ 0, ‖x − y‖2 = ‖x − x̂ ‖2 + 2⟨x − x̂ , x̂ − y⟩ + ‖̂x − y‖2 ≥ ‖x − x̂ ‖2 .



An important application of the previous result is the projection theorem. Theorem 2.5.2 Let 𝕄 be a closed linear subspace of a Hilbert space ℍ. For any element x ∈ ℍ, there exists a unique element of 𝕄 that minimizes ‖x − y‖ over y ∈ 𝕄. The minimizer x̂ is uniquely determined by the condition ⟨̂x, y⟩ = ⟨x, y⟩

(2.14)

for all y ∈ 𝕄. Proof: In view of Theorem 2.5.1, all we need to verify is (2.14). As 𝕄 is a subspace, we can choose y = 0 and y = 2̂x in (2.13) to see that ⟨x − x̂ , x̂ ⟩ = 0. Thus, (2.13) gives ⟨x − x̂ , y⟩ ≤ 0 for all y ∈ 𝕄. Replacing y by −y in this inequality finishes the proof.



The element x̂ in Theorem 2.5.2 is called the projection of x onto 𝕄. It is uniquely determined by the fact that the residual or error term x − x̂ is orthogonal to the entire 𝕄 subspace. The collection of all elements that have this orthogonality property is of independent interest. Definition 2.5.3 Let 𝕏 be an inner-product space with 𝕄 ⊂ 𝕏. The orthogonal complement of 𝕄 is the set 𝕄⊥ = {x ∈ 𝕏 ∶ ⟨x, y⟩ = 0 for all y ∈ 𝕄}. Theorem 2.5.2 can now be seen to have the consequence that when 𝕄 is a closed subspace every element x in ℍ can be uniquely expressed as x = x1 + x2

(2.15)

for x1 ∈ 𝕄 and x2 ∈ 𝕄⊥ . Definition 2.5.4 Let 𝕄1 and 𝕄2 be orthogonal subspaces of 𝕏; i.e., x1 ⊥ x2 for all xi ∈ 𝕄i . Then, the collection {x1 + x2 ∶ xi ∈ 𝕄i , i = 1, 2} is denoted by 𝕄1 ⊕ 𝕄2 and is referred to as the orthogonal direct sum of 𝕄1 and 𝕄2 . If 𝕄1 and 𝕄2 are not orthogonal but satisfy 𝕄1 ∩ 𝕄2 = {0}, then {x1 + x2 ∶

40

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

xi ∈ 𝕄i , i = 1, 2} is denoted by 𝕄1 + 𝕄2 and is referred to as the algebraic direct sum of 𝕄1 and 𝕄2 . Using the orthogonal direct sum notation, the decomposition in (2.15) can be restated as Theorem 2.5.5 Let 𝕄 be a closed subspace of a Hilbert space ℍ. Then ℍ = 𝕄 ⊕ 𝕄⊥ .

(2.16)

Some additional properties of the orthogonal complement are catalogued in our following theorem. Theorem 2.5.6 Let ℍ be a Hilbert space with 𝕄 a subset of ℍ. Then, 1. 𝕄⊥ is a closed subspace, 2. 𝕄 ⊂ (𝕄⊥ )⊥ , and 3. (𝕄⊥ )⊥ = 𝕄 if 𝕄 is a subspace. Proof: First, it is clear that 𝕄⊥ is a linear space. By the Cauchy–Schwarz inequality, the mapping x → ⟨x, a⟩ is continuous for any a ∈ 𝕏 . If x1 , x2 , … are in 𝕄⊥ and xn → x in 𝕏, then ⟨x, a⟩ = limn→∞ ⟨xn , a⟩ = 0 for all a ∈ 𝕄. Thus, 𝕄⊥ is closed. Part 2 of the theorem comes from observing that if x ∈ 𝕄 it must be orthogonal to every y ∈ 𝕄⊥ ; i.e., it is in (𝕄⊥ )⊥ . Finally, from parts 1 and 2, we conclude that 𝕄 ⊂ (𝕄⊥ )⊥ . By Theorem 2.5.5, any element x ∈ (𝕄⊥ )⊥ can be uniquely written as x = y + z for some y ∈ 𝕄 and z ∈ 𝕄⊥ ∩ (𝕄⊥ )⊥ . However, ⊥

𝕄 ∩ (𝕄⊥ )⊥ ⊂ 𝕄⊥ ∩ (𝕄⊥ )⊥ = {0}.

2.6



Vector integrals

Suppose now that we have a function f on a measure space (E, ℬ, 𝜇) that takes on values in a Banach space 𝕏. In the case of 𝕏 = ℝ, we are familiar with the Lebesgue integral ∫ fd𝜇 and may recall its definition as the limit of integrals of simple functions that take on only finitely many values. Here we wish to extend this idea to the case of a general Banach space. The resulting abstract notion of an integral is called a vector integral and there are various ways such integrals can be constructed. The objective of this section is to give a concise, but rigorous and self-contained, treatise concerning vector integrals that is appropriate for the target audience of this book. In particular, one place in fda where these

VECTOR AND FUNCTION SPACES

41

integrals are relevant is the definition of the mean element and covariance operator for a random element of a Hilbert space (Section 7.2). Readers interested in additional details on vector integrals may consult sources such as Diestel and Uhl (1977), Dunford and Schwarz (1988), and Yosida (1971). The integral we will focus on is due to Bochner (1933) and, accordingly, is referred to as the Bochner integral. The construction of these integrals parallels that of the Lebesgue integral in real analysis beginning with a definition of simple functions and their integrals. Definition 2.6.1 A function f ∶ E → 𝕏 is called simple if it can be represented as f (𝜔) =

k ∑

IEi (𝜔)gi

(2.17)

i=1

for some finite k, Ei ∈ ℬ and gi ∈ 𝕏. ∑k Definition 2.6.2 Any simple function f (𝜔) = i=1 IEi (𝜔)gi with 𝜇(Ei ) < ∞ for all i is said to be integrable and its Bochner integral is defined as k ∑

fd𝜇 = 𝜇(Ei )gi . (2.18) ∫E i=1 It is not difficult to verify that this definition does not depend on the particular representation of f . In particular, the Ei can be chosen without loss to be disjoint. To see this merely observe that if any two sets Ei and Ej in a simple function overlap, which part of the function can be rewritten in an equivalent form using disjoint sets as gi IEi ∩EC (𝜔) + gj IEj ∩EC (𝜔) + (gi + gj )IEi ∩Ej (𝜔) j

i

with AC indicating the complement of the set A. The previous definition is extended to a general measurable function from E to 𝕏 as follows. Definition 2.6.3 A measurable function f is said to be Bochner integrable if there exists a sequence {fn } of simple and Bochner integrable functions such that lim

n→∞ ∫E

‖fn − f ‖d𝜇 = 0.

(2.19)

In this case, the Bochner integral of f is defined as ∫E

fd𝜇 = lim

n→∞ ∫E

fn d𝜇.

(2.20)

42

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

To see that this definition has merit first observe from (2.18) and the triangle inequality that || || || fd𝜇 || ≤ (2.21) ||∫ || ∫ ‖f ‖d𝜇 || E || E for any simple function f . When applied to the simple function fn − fm in (2.19) this inequality produces || || || fn d𝜇 − fm d𝜇|| ≤ ‖fn − fm ‖d𝜇. ||∫ || ∫ ∫E || E || E However, this upper bound is just a Lebesgue integral and the triangle inequality assures us that ∫E

‖fn − fm ‖d𝜇 ≤

∫E

‖f − fn ‖d𝜇 +

∫E

‖f − fm ‖d𝜇

which converges to zero as m, n → ∞ by (2.19). This shows that {∫E fn d𝜇} is a Cauchy sequence and the completeness of 𝕏 can now be invoked to conclude that the limit in (2.20) must exist. It is also independent of the approximating sequence since we can combine two approximating sequences into a third that must also be convergent. The definition of the Bochner integral relies on the existence of an approximating sequence of simple and Bochner integrable functions {fn }. Of course, this condition may not be satisfied for any particular function and we would at least like to have a sufficient condition that we could more easily check to see if it were true. The following result serves that purpose for a scenario that is sufficiently general to be applicable in most fda applications. Theorem 2.6.4 Let f be a measurable function from E to 𝕏 with ∫E

‖f ‖d𝜇 < ∞.

(2.22)

Suppose that for each n there exists a finite-dimensional subspace 𝕏n of 𝕏 such that lim ‖f − gn ‖d𝜇 = 0 (2.23) n→∞ ∫E for some measurable gn taking value in 𝕏n . Then, there exist simple and Bochner integrable functions fn such that (2.19) holds. Proof: Define

̃ n = 𝕏n ∩ {g ∈ 𝕏 ∶ ‖g‖ ∈ [n−1 , n]} 𝕏

VECTOR AND FUNCTION SPACES

and

43

̃ n }. En = {𝜔 ∈ E ∶ gn (𝜔) ∈ 𝕏

Markov’s inequality produces 𝜇(En ) ≤ n

∫En

‖gn ‖d𝜇 ≤ n

∫E

‖gn ‖d𝜇 < ∞.

(2.24)

̃ n is bounded and finite-dimensional, it is totally bounded (cf. As 𝕏 Section 2.1). Thus, the Heine–Borel theorem (Theorem 2.1.17) can be ̃ n such that invoked to see that there is a finite partition Bi , 1 ≤ i ≤ k, of 𝕏 each Bi is in the Borel 𝜎-field for 𝕏 and has diameter less than (n𝜇(En ))−1 . For an arbitrary element bi ∈ Bi , set fn (𝜔) =

k ∑

bi I{gn (𝜔)∈Bi } .

i=1

Note that fn = 0 on Enc and hence (2.24) entails that fn is simple and Bochner integrable as a simple function. Also, by construction, ‖fn (𝜔) − gn (𝜔)‖ ≤ max sup ‖bi − x‖ ≤ (n𝜇(En ))−1 1≤i≤k x∈B

(2.25)

i

for 𝜔 ∈ En . Thus, by the triangle inequality, ∫E

‖fn − f ‖d𝜇

≤ =

∫E

‖fn − gn ‖d𝜇 +

∫En

∫E

‖fn − gn ‖d𝜇 +

‖gn − f ‖d𝜇

∫Ec

‖fn − gn ‖d𝜇 +

n

∫E

‖gn − f ‖d𝜇.

The first and third terms in the last expression tend to zero by (2.25) and (2.23), respectively, while the second term reduces to ∫Ec ‖gn ‖d𝜇 because fn = n 0 on Enc . Thus, to verify (2.19), we only need to establish that lim

n→∞ ∫Ec

‖gn ‖d𝜇 = 0,

n

or, equivalently, by (2.23) that lim

n→∞ ∫E

‖f ‖I(‖gn ‖ > n)d𝜇 = 0

(2.26)

44

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and that lim

n→∞ ∫E

‖f ‖I(‖gn ‖ < n−1 )d𝜇 = 0.

(2.27)

First, from (2.23) and Markov’s inequality, 𝜇(‖gn ‖ > n) ≤ n−1

∫E

‖gn ‖d𝜇 → 0.

As ‖f ‖ is integrable, (2.26) follows easily (cf. Exercise 5.6 of Resnick, 1998) from this relation. To show (2.27), for 𝜖 > 0, write ∫E

‖f ‖I(‖gn ‖ < n−1 )d𝜇 =

∫E

‖f ‖I(‖gn ‖ < n−1 , ‖f ‖ > 𝜖)d𝜇

+ ‖f ‖I(‖gn ‖ < n−1 , ‖f ‖ ≤ 𝜖)d𝜇 ∫E so that lim sup n→∞

∫E

‖f ‖I(‖gn ‖ < n−1 )d𝜇 ≤ lim sup n→∞

∫E

‖f ‖I(‖gn ‖ < n−1 , ‖f ‖ > 𝜖)d𝜇

+ ‖f ‖I(‖f ‖ ≤ 𝜖)d𝜇. ∫E The first term on the right of the inequality is zero as 𝜇(‖gn ‖ < n−1 , ‖f ‖ > 𝜖) ≤ ≤

∫E ‖f − gn ‖I(‖gn ‖ < n−1 , ‖f ‖ > 𝜖)d𝜇 𝜖 − n−1 ∫E ‖f − gn ‖d𝜇 𝜖 − n−1

→0

due to Markov’s inequality and (2.23). Hence, (2.27) follows by now letting 𝜖 → 0 and applying Lebesgue’s dominated convergence theorem. ◽ The following result shows that the existence of gn in (2.23) is guaranteed if 𝕏 is a separable Hilbert space. Theorem 2.6.5 Suppose 𝕏 is a separable Hilbert space and f is a measurable function from E to 𝕏 with ∫E ‖f ‖d𝜇 < ∞. Then, f is Bochner integrable. Proof: If suffices to verify (2.23) in Theorem 2.6.4. For that purpose, take 𝕏n to be span{e1 , … , en } for any CONS {ej }∞ of 𝕏 and gn the projection j=1 of f on 𝕏n . Then, (2.23) follows from Lebesgue’s dominated convergence theorem. ◽

VECTOR AND FUNCTION SPACES

45

Another connection with Lebesgue integration is the following Bochner integral version of the dominated convergence theorem. Theorem 2.6.6 Let {fn } be a sequence of Bochner integrable functions in 𝕏 that converges to some f ∈ 𝕏. If there is a nonnegative Lebesgue integrable function g such that ‖fn ‖ ≤ g for all n a.e. 𝜇, then f is Bochner integrable and ∫E fd𝜇 = lim ∫E fn d𝜇. n→∞

Proof: In combination, ‖f − fn ‖ ≤ 2g and ‖f − fn ‖ → 0 allow us to apply the Lebesgue dominated convergence theorem to obtain ∫E ‖f − fn ‖d𝜇 → 0. As the fn are Bochner Integrable, we may find simple functions f̃n such that ∫E ‖fn − f̃n ‖d𝜇 → 0. These new functions satisfy (2.19) because ∫E

‖f − f̃n ‖d𝜇 ≤

∫E

‖f − fn ‖d𝜇 +

∫E

‖fn − f̃n ‖d𝜇 ◽

and the theorem has been proved.

The Bochner integral also has a feature similar to the monotonicity of the Lebesgue integral: namely, Theorem 2.6.7 If f is Bochner integrable, then ‖∫E fd𝜇‖ ≤ ∫E ‖f ‖d𝜇. Proof: Let fn = Then,

∑n i=1

gi IEi (𝜔) be a simple function with Ei ∩ Ej = 𝜙 for i ≠ j.

n || || || ||∑ || fn (𝜔)d𝜇 || = |||| gi 𝜇(Ei )|||| ||∫ || || || || E || || i=1 || n ∑ ≤ ‖gi ‖𝜇(Ei ) = i=1

∫E

‖fn ‖d𝜇.

So, if {fn } is a sequence of simple Bochner integrable functions that satisfie (2.19), || || || || || || || fd𝜇 || ≤ || fd𝜇 − fn d𝜇 || + || fn d𝜇 || ||∫ || ||∫ | | | | || ∫ ∫E || E || || E || || E || || || ≤ |||| fd𝜇 − fn d𝜇 |||| + ‖fn ‖d𝜇 ∫E ||∫E || ∫E || || ≤ |||| fd𝜇 − fn d𝜇 |||| + ‖fn − f ‖d𝜇 + ‖f ‖d𝜇 ∫E ∫E ||∫E || ∫E and the result follows upon taking limits with respect to n.



46

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

2.7

Reproducing kernel Hilbert spaces

The concept of a reproducing kernel Hilbert space (RKHS) owes its origin to the work of Moore (1916) and Aronszajn (1950). An overview of statistical applications for RKHS theory is provided in Berlinet and Thomas-Agnan (2004). We will see RKHSs arise in Chapter 6 and as part of the representation theory for second-order stochastic processes discussed in Section 7.6. Throughout this section, we restrict our attention to the case where ℍ is a Hilbert space of real valued functions defined on some set E. With this being understood, we can now define the reproducing kernel concept. Definition 2.7.1 A bivariate function K on E × E is said to be a reproducing kernel (rk) for ℍ if 1. for every t ∈ E, K(⋅, t) ∈ ℍ and 2. K satisfies the reproducing property that for every f ∈ ℍ and t ∈ E f (t) = ⟨f , K(⋅, t)⟩.

(2.28)

When ℍ possesses an rk it is said to be an RKHS. Example 2.7.2 Let ℍ be a finite-dimensional Hilbert space with {e1 , … , ep } an associated orthonormal basis. Now define K(s, t) =

p ∑

ei (s)ei (t)

i=1

for s, t ∈ E. Clearly, K(⋅, t) ∈ ℍ. Also, for 1 ≤ j ≤ p, ⟨ej , K(⋅, t)⟩ =

p ∑

⟨ej , ei ⟩ei (t) = ej (t)

i=1

from which the reproducing property follows at once. Thus, ℍ is an RKHS. The theory of reproducing kernels is driven by the properties of nonnegative definite functions. We say that a bivariate function on a set E × E is nonnegative definite, or nonnegative for short when there is no ambiguity, if for any set of real numbers {aj }nj=1 , any set of elements t1 , … , tn from E, and any n ∈ ℤ+ n n ∑ ∑ ai aj K(ti , tj ) ≥ 0. (2.29) i=1 j=1

VECTOR AND FUNCTION SPACES

47

Bivariate functions that are nonnegative definite and symmetric in their arguments are sometimes referred to as kernels. The following two results are fundamental. Theorem 2.7.3 The rk K of an RKHS ℍ is unique, symmetric with K(s, t) = K(t, s) for all s, t ∈ E and nonnegative definite. Proof: Suppose that there are two kernels K1 and K2 for ℍ. Then, by the reproducing property f (t) = ⟨f , K1 (⋅, t)⟩ = ⟨f , K2 (⋅, t)⟩, for all f and all t which means that ⟨f , K1 (⋅, t) − K2 (⋅, t)⟩ = 0 for all f and all t. Thus, K1 = K2 . Symmetry is a consequence of the reproducing property as K(s, t) = ⟨K(⋅, t), K(⋅, s)⟩ = ⟨K(⋅, s), K(⋅, t)⟩ = K(t, s). However, this also means that n n ∑ ∑

ai aj K(ti , tj ) =

i=1 j=1

n n ∑ ∑

ai aj ⟨K(⋅, ti ), K(⋅, tj )⟩

i=1 j=1

⟨ =

n ∑

ai K(⋅, ti ),

i=1

n ∑

⟩ ai K(⋅, ti )

≥ 0. ◽

i=1

The following result is known as the Moore–Aronszajn Theorem. Theorem 2.7.4 Suppose that K(s, t), s, t ∈ E, is a symmetric and positive definite function. Then, there is a unique Hilbert space ℍ(K) of functions on E with K as its rk. Proof: Set

{

ℍ0 ∶= span{K(⋅, t) ∶ t ∈ E} =

n ∑

} ai K(⋅, ti ) ∶ ai ∈ ℝ, ti ∈ E, n ∈ ℤ

i=1

Define the bivariate function ⟨⋅, ⋅⟩0 on ℍ0 by ⟨m ⟩ n m n ∑ ∑ ∑ ∑ ai K(⋅, si ), bj K(⋅, tj ) ∶= ai bj K(si , tj ). i=1

j=1

0

i=1 j=1

+

.

48

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Clearly, ⟨⋅, ⋅⟩0 is bilinear and the assumption that K is nonnegative definite implies that ⟨f , f ⟩0 ≥ 0 for f ∈ ℍ0 . To establish that ⟨⋅, ⋅⟩0 is an inner product, it suffices to verify that ⟨f , f ⟩0 = 0 means that f = 0. Note that ⟨f , K(⋅, t)⟩0 = f (t) for f ∈ ℍ0 , t ∈ E and |⟨f , g⟩0 |2 ≤ ⟨f , f ⟩0 ⟨g, g⟩0 if f , g ∈ ℍ0 . So, if ⟨f , f ⟩0 = 0, for t ∈ E, |f (t)|2 = |⟨f , K(⋅, t)⟩0 |2 ≤ ⟨f , f ⟩0 K(t, t) = 0, showing that f = 0. Thus, ⟨⋅, ⋅⟩0 is indeed an inner product on ℍ0 . Now we proceed to complete ℍ0 . Suppose that {fn }∞ is a Cauchy sequence n=1 in ℍ0 . By the Cauchy–Schwarz inequality, for each t ∈ E and n1 , n2 , |fn1 (t) − fn2 (t)| = |⟨fn1 − fn2 , K(⋅, t)⟩0 | ≤ ‖fn1 − fn2 ‖0 K 1∕2 (t, t). This implies that {fn (t)}∞ is a Cauchy sequence in ℝ and hence has a limit. n=1 Thus, for any Cauchy sequence {fn }∞ in ℍ0 , we can define a function f by n=1 taking the pointwise limit of fn . If the limit f also happens to be in ℍ0 , then ‖fn − f ‖0 → 0. To see why this is true, assume without loss of generality that f = 0 and consider the identity ‖fm − fn ‖20 = ‖fn ‖20 + ‖fm ‖20 − 2⟨fn , fm ⟩0 . For fixed n, the pointwise convergence of fm to 0 implies that ⟨fn , fm ⟩0 → 0 as m → ∞, which entails that lim sup ‖fm − fn ‖20 = ‖fn ‖20 + lim sup ‖fm ‖20 . m→∞

m→∞

As {fn } is Cauchy, letting n → ∞ on both sides shows that ‖fn ‖0 → 0. Let ℍ be the collection of functions on E that are the pointwise limits of all Cauchy sequences in ℍ0 . For f , g ∈ ℍ, suppose that {fn }∞ and {gn }∞ are n=1 n=1 arbitrary Cauchy sequences in ℍ0 that converge to f and g pointwise. It is easy to see that ⟨fn , gn ⟩0 is also a Cauchy sequence and has a limit. Moreover, if {f̃n }∞ and {̃gn }∞ are another pair of Cauchy sequences in ℍ0 that converge n=1 n=1 to f and g pointwise, then fn − f̃n and gn − g̃ n both converge to 0 pointwise and

VECTOR AND FUNCTION SPACES

49

we know from the previous paragraph that they both converge to 0 in norm in ℍ0 . This implies that the limit of ⟨fn , gn ⟩0 only depends on f , g. Thus, the bivariate function ⟨f , g⟩ ∶= lim ⟨fn , gn ⟩0 n→∞

is well defined. It is straightforward to establish that ⟨⋅, ⋅⟩ is an inner product and ℍ is an RKHS with rk. K. Finally, we demonstrate that ℍ with this choice for its inner product must be the only Hilbert space for which K is the rk. Indeed, suppose to the contrary that there is another Hilbert space 𝔾 with inner product ⟨⋅, ⋅⟩𝔾 for which K is the rk. Then, it is easy to see that 𝔾 contains ℍ0 and hence ℍ as subspaces. By Theorem 2.5.5, we can write 𝔾 = ℍ ⊕ ℍ⊥ . For any f ∈ ℍ⊥ , we have f (t) = ⟨f , K(⋅, t)⟩𝔾 = 0 for t ∈ E which shows that ℍ⊥ = {0}.



The notation ℍ(K) in Theorem 2.7.4 will denote the RKHS with rk. K throughout the book. A fundamental implication of the proof of Theorem 2.7.4 is that linear combinations of the kernel functions that correspond to finitely many elements from E are dense in ℍ(K). With that in mind, we can establish the following result. Theorem 2.7.5 If E is a separable metric space and K is continuous on E × E, ℍ(K) is separable and the functions in ℍ(K) are continuous on E. Proof: By assumption, there is a countable set {t1 , t2 , …} that is dense in E. If f ∈ ℍ(K), |f (t) − f (s)| = |⟨f , K(⋅, t) − K(⋅, s)⟩| ≤ ‖f ‖‖K(⋅, t) − K(⋅, s)‖. However, ‖K(⋅, t) − K(⋅, s)‖2 = K(t, t) − 2K(t, s) + K(s, s) which converges to zero as t → s by the continuity of K.

(2.30)

50

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

By (2.30) and the proof of Theorem 2.7.4, the collection of functions { n } ∑ + ai K(⋅, ti ) ∶ ai ∈ ℚ, n ∈ ℤ , i=1

with ℚ denoting the set of rationals, is dense in ℍ(K). This implies that ℍ(K) is separable. ◽ A property of RKHSs that sets them apart from many function spaces is that norm convergence of functions in the RKHS entails point-wise convergence as well. Theorem 2.7.6 Let ℍ be an RKHS containing functions on E. If f , f1 , f2 , … are functions in ℍ such that limn→∞ ‖fn − f ‖ = 0, then limn→∞ |fn (t) − f (t)| = 0 for all t ∈ E. The convergence is uniform if supt∈E K(t, t) < ∞. Proof: By the reproducing property and the Cauchy–Schwarz inequality |fn (t) − f (t)| ≤ ‖fn − f ‖K 1∕2 (t, t) ◽

from which the result follows.

It turns out that this theorem can be strengthened. Boundedness of the evaluation functionals 𝓁t (t) ∶= f (t), t ∈ E actually characterizes an RKHS as a result of Theorem 3.2.3 in Chapter 3. The problem that is generally encountered in an fda setting is that we are in possession of a kernel that is a nonnegative definite function, namely, a covariance kernel (cf. Theorem 7.3.1). The interest is then in determining the form of the corresponding RKHS. The following result provides a useful tool for making such determinations. It is typically referred to as the integral representation theorem; see, e.g., Parzen (1970). Theorem 2.7.7 Let K be a function over the set E × E which admits the representation K(t, t′ ) =

∫s

g(t, s)g(t′ , s)d𝜇(s)

(2.31)

for t, t′ ∈ E, where (S, ℬ, 𝜇) is a measure space and {g(t, ⋅) ∶ t ∈ E} a collection of functions in 𝕃2 (S, ℬ, 𝜇). Then, the RKHS ℍ corresponding to K consists of all functions of the form f (t) =

∫s

F(s)g(t, s)d𝜇(s)

(2.32)

VECTOR AND FUNCTION SPACES

51

for some unique element F ∈ span{g(t, ⋅) ∶ t ∈ E} ⊂ 𝕃2 (S, ℬ, 𝜇). The RKHS norm for f ∈ ℍ is ‖f ‖ = ‖F‖2 (2.33) with ‖ ⋅ ‖2 and ⟨⋅, ⋅⟩2 the L2 (S, ℬ, 𝜇) norm and inner products. Proof: It suffices to observe that K(t, t′ ) = ⟨K(t, ⋅), K(t′ , ⋅)⟩ = ⟨g(t, ⋅), g(t′ , ⋅)⟩2 which establishes a congruence relationship between ℍ and span{g(t, ⋅) ∶ t ∈ E} as a result of Theorem 2.4.16. If f ∈ ℍ is the image of F ∈ span{g(t, ⋅) ∶ t ∈ E} under this congruence, the reproducing property gives f (t) = ⟨f , K(t, ⋅)⟩ = ⟨F, g(t, ⋅)⟩2 ◽

as was to be shown.

Example 2.7.8 A simple, but fundamentally important, example of an RKHS can be obtained when E is finite-dimensional. Thus, let E = {t1 , … , tp } in which case the kernel K is equivalent to the matrix 𝒦 = {K(ti , tj )}i, j=1∶p . The RKHS is now found to be the set of functions on E of the form f (⋅) =

p ∑

ai K(⋅, ti ),

i=1

where (a1 , … , ap ) is perpendicular to the null space of 𝒦: i.e., the set of vectors a for which 𝒦a = 0. Note that f (⋅) can take on only p values which means that it has a p-vector representation as (f (t1 ), … , f (tp ))T = 𝒦a for a = (a1 , … , ap )T . For notational clarity in this instance, we will use f (⋅) to indicate its representation as a function on E and f to denote its vector form. With that convention, the inner product between f1 (⋅), f2 (⋅) ∈ ℍ(K) is ⟨f1 (⋅), f2 (⋅)⟩ = f1T 𝒦− f2

(2.34)

with 𝒦− any generalized inverse of 𝒦: i.e., any matrix that satisfies 𝒦𝒦− 𝒦 = 𝒦. The Moore–Penrose generalized inverse of 𝒦 (see, e.g., Section 3.5) is one possible choice for 𝒦− and, of course, we use 𝒦− = 𝒦−1 when 𝒦 is invertible. Example 2.7.9 An elementary illustration of the integral representation theorem arises from the kernel K(s, t) = min(s, t) 1

=

∫0

(t − u)0+ (s − u)0+ du

52

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

defined on [0, 1] × [0, 1] with x+ being 0 for x ≤ 0 and x otherwise. The RKHS ℍ(K) generated by this kernel consists of all functions of the form 1

f (t) = with

∫0

t

(t − u)0+ F(u)du =

∫0

F(u)du

1

‖f ‖2 =

∫0

F2 (u)du.

Another way to describe ℍ(K) is as the set of all absolutely continuous (with respect to Lebesgue measure) functions that have square integrable derivatives. More general spaces of this variety are the topic of the following section. It is also of interest to consider what transpires when we add or difference reproducing kernels. In the case of sums of rks, we state the following result from Aronszajn (1950). Theorem 2.7.10 Let K1 , K2 be nonnegative kernels with RKHSs ℍ(Ki ), i = 1, 2 that have norms ‖ ⋅ ‖i , i = 1, 2. Then, K = K1 + K2 is the rk of the set of all functions of the form f1 + f2 with fi ∈ ℍ(Ki ), i = 1, 2 under the norm ‖f ‖2 =

min

fi ∈ℍ(Ki ),i=1,2∶f =f1 +f2

{‖f1 ‖21 + ‖f2 ‖22 }.

(2.35)

Now let us turn to the case of kernel differences. In that instance, we will write K1 ≪ K2 if K2 − K1 is nonnegative definite in the sense of (2.29). Theorem 2.7.11 If K1 ≪ K2 , ℍ(K1 ) ⊂ ℍ(K2 ) and the norms ‖ ⋅ ‖1 and ‖ ⋅ ‖2 for ℍ(K1 ) and ℍ(K2 ) satisfy ‖f1 ‖2 ≤ ‖f1 ‖1 for every f1 ∈ ℍ(K1 ). Proof: As K3 = K2 − K1 is nonnegative definite, Theorem 2.7.4 has the consequence that it generates a Hilbert space ℍ(K3 ) for which it is the rk. However, from Theorem 2.7.10, this means that K2 = K3 + K1 is the rk for the space of all functions of the form f3 + f1 with f3 ∈ ℍ(K3 ) and f1 ∈ ℍ(K1 ). The theorem follows from this fact by taking f3 = 0. ◽ A much stronger result than Theorem 2.7.11 is the theorem below that was proved by Aronszajn (1950). Theorem 2.7.12 Let K1 , K2 be nonnegative kernels. Then, ℍ(K1 ) ⊂ ℍ(K2 ) if there exists a positive constant B such that K1 ≪ BK2 .

VECTOR AND FUNCTION SPACES

53

In addition to sums and differences of reproducing kernels, we will also encounter spaces that derive from their products. Thus, suppose that K1 , K2 are rks for RKHSs ℍ1 , ℍ2 consisting of functions on E. Let ‖ ⋅ ‖i , ⟨⋅, ⋅⟩i and {eij }∞ be, respectively, the norm and inner product and a CONS for ℍi , i = j=1 1, 2. Then, the direct product. Hilbert space ℍ ∶= ℍ1 ⊗ ℍ2 can be derived in the following manner. For t = (t1 , t2 ) ∈ E × E, we first consider functions of the form g(t) =

n ∑

g1i (t1 )g2i (t2 )

(2.36)

i=1

with∑ the g1i being functions in ℍ1 and the g2i deriving from ℍ2 . Then, given m f = i=1 f1i f2i for fij ∈ ℍi , we take the inner product of f and g to be ⟨g, f ⟩ =

n m ∑ ∑

⟨g1i , f1j ⟩1 ⟨g2i , f2j ⟩2

(2.37)

i=1 j=1

and, as usual, have ‖g‖2 = ⟨g, g⟩. It is not difficult to see that these choices produce a valid norm and inner product for the space of functions of the form (2.36). To this point, we have succeeded in constructing a pre-Hilbert space. To complete it, we include all functions of the form g=

∞ ∞ ∑ ∑

aij e1i e2j

(2.38)

i=1 j=1

such that ‖g‖2 ∶=

∞ ∞ ∑ ∑

a2ij < ∞.

i=1 j=1

The inner product of g with f =

∑∞ ∑∞ i=1

⟨g, f ⟩ =

j=1

∞ ∞ ∑ ∑

bij e1i e2j is defined as aij bij .

(2.39)

i=1 j=1

It is clear that sums of finitely many products fall into this new category of functions and that the inner products (2.37) and (2.39) for ℍ coincide in that instance. With a little additional work, one may verify that every function of type (2.38) admits a representation as a function of type (2.36) and thereby conclude that the formulation provided by (2.38) and (2.39) is, in fact, the completion of our initial pre-Hilbert space construction.

54

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Note that if g ∈ ℍ1 ⊗ ℍ2 , for every s, t ∈ E, |g(s, t)| ≤

∞ ∞ ∑ ∑

|aij ‖e1i (s)‖e2j (t)|

i=1 j=1



∞ ∑

(∞ )1∕2 ( ∞ )1∕2 ∑ ∑ |e1i (s)| e22j (t) a2ij .

i=1

j=1

j=1

However, ∞ ∑

e22j (t) =

j=1

∞ ∑

⟨e2j , K2 (⋅, t)⟩2 e2j (t)

j=1

= K2 (t, t) and, similarly, ∞ ∑ i=1

( |e1i (s)|

∞ ∑

)1∕2 a2ij

j=1

Therefore, |g(s, t)| ≤



(∞ ∑

)1∕2 ( e21i (s)

i=1

√ = K1 (s, s)‖g‖. √

∞ ∞ ∑ ∑

)1∕2 a2ij

i=1 j=1

√ K1 (s, s) K2 (t, t)‖g‖.

This means that point evaluation of functions in ℍ is a bounded linear operation, which, according to Theorem 3.2.3 of Chapter 3, insures that ℍ is an RKHS. The obvious candidate for the rk is K(s, t, s′ , t′ ) ∶= K1 (s, t)K2 (s′ , t′ ). For fixed t, t′ ∈ E, K(s, t, s′ , t′ ) is clearly in the space as a function of s, s′ because K1 (⋅, t) ∈ ℍ1 and K2 (⋅, t′ ) ∈ ℍ2 . As, g(s, t) =

∞ ∞ ∑ ∑

aij e1i (s)e2j (t)

i=1 j=1

=

∞ ∞ ∑ ∑

aij ⟨e1i , K1 (⋅, s)⟩1 ⟨e2j , K2 (⋅, t)⟩2

i=1 j=1

=

∞ ∞ ∑ ∑

aij ⟨e1i (⋅)e2j (⋆), K(⋅, s, ⋆, t)⟩

i=1 j=1

we have proved the following theorem.

VECTOR AND FUNCTION SPACES

55

Theorem 2.7.13 The direct product of two RKHSs with rks K1 and K2 is also an RKHS with K1 K2 as its rk.

2.8

Sobolev spaces

Sobolev spaces generally refer to function spaces whose norms involve derivatives. In this section, we look at a particularly simple variety of Sobolev space that consists of univariate functions on the interval [0, 1]. For q ≥ 1, consider the collection 𝕎q [0, 1] of functions f on [0, 1] that are q − 1 times differentiable, where f (q−1) is absolutely continuous having a derivative f (q) almost everywhere with f (q) ∈ 𝕃2 [0, 1]. Note that 𝕎q [0, 1] is not complete in the Hilbert space 𝕃2 [0, 1]. Define 𝜙i (t) = ti ∕i!, i = 0, 1, … , and

q−1

Gq (t, u) =

(t − u)+

(q − 1)!

.

For each f ∈ 𝕎q [0, 1], we have by Taylor’s formula with remainder that f (t) =

q−1 ∑

1

f (i) (0)𝜙i (t) +

∫0

i=0

Gq (t, u)f (q) (u)du.

(2.40)

Thus, a function is in 𝕎q [0, 1] if and only if it can be expressed as q−1 ∑

1

bi 𝜙i (t) +

i=0

∫0

Gq (t, u)g(u)du

(2.41)

for some b0 , … , bq−1 ∈ ℝ and g ∈ 𝕃2 [0, 1]. There are a number of ways to define an inner product on 𝕎q [0, 1]. The following approach is especially insightful. Define ℍ0 = span{𝜙0 , … , 𝜙q−1 } and, for any f , g ∈ ℍ0 , let ⟨f , g⟩ℍ0 =

q−1 ∑

f (i) (0)g(i) (0).

i=0

It is easy to check that ⟨f , g⟩ℍ0 is an inner product on ℍ0 and that {𝜙0 , … , 𝜙q−1 } is an orthonormal basis.

56

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Next consider { ℍ1 ∶=

1

∫0

Gq (t, u)g(u)du ∶

} g ∈ 𝕃 [0, 1] . 2

(2.42)

1

By (2.40) and (2.41), if f (t) = ∫0 Gq (t, u)g(u)du for some g ∈ 𝕃2 [0, 1], then so that f (q)

f (0) = f ′ (0) = · · · = f (q−1) (0) = 0 = g. Thus, 1

⟨f , g⟩ℍ1 =

∫0

f (q) (u)g(q) (u)du

provides an inner product on ℍ1 . Theorem 2.8.1 The inner-product spaces (ℍi , ⟨⋅, ⋅⟩ℍi ), i = 0, 1, are RKHSs with reproducing kernels K0 (s, t) ∶=

q−1 ∑

𝜙i (s)𝜙i (t),

i=0

and

1

K1 (s, t) ∶=

Gq (s, u)Gq (t, u)du,

∫0

respectively.

Proof: The statement about ℍ0 follows at once from Example 2.7.2 that allows us to focus on ℍ1 . For that case, observe that a sequence {fn } in (q) ℍ1 is Cauchy if {fn } is Cauchy in 𝕃2 [0, 1]. As 𝕃2 [0, 1] is complete, if (q) {fn } is Cauchy in ℍ1 , then fn → g for some g in 𝕃2 [0, 1]. This, in turn, is equivalent to 1

f (⋅) ∶=

∫0

Gq (⋅, u)g(u)du

in ℍ1 . Thus, ℍ1 is complete. The form for the rk in this instance is a consequence of Theorem 2.7.7. ◽ Now construct 𝕎q [0, 1] as the RKHS with kernel K(s, t) ∶= K0 (s, t) + K1 (s, t).

(2.43)

VECTOR AND FUNCTION SPACES

57

As ℍ0 ∩ ℍ1 = {0}, according to Theorem 2.7.10, the inner product of h1 = f1 + g1 and h2 = f2 + g2 for f1 , f2 ∈ ℍ0 and g1 , g2 ∈ ℍ1 is ⟨h1 , h2 ⟩𝕎q [0,1] = ⟨f1 + g1 , f2 + g2 ⟩𝕎q [0,1] = ⟨f1 , f2 ⟩ℍ0 + ⟨g1 , g2 ⟩ℍ1 . This shows that ℍ0 and ℍ1 are orthogonal subspaces of 𝕎q [0, 1]. This space is extremely important in approximation theory due, in part, to its connection to spline functions. See, e.g., Section 6.6. Now consider another inner product for 𝕎q [0, 1]: namely, ⟨f , g⟩′𝕎 [0,1] ∶= ⟨f , g⟩2 + ⟨f (q) , g(q) ⟩2 q

(2.44)

with ⟨⋅, ⋅⟩2 and ‖ ⋅ ‖2 the 𝕃2 [0, 1] inner product and norm. The norm associated with ⟨⋅, ⋅⟩′𝕎 [0,1] is similarly denoted by ‖ ⋅ ‖′𝕎 [0,1] . q

q

Theorem 2.8.2 ‖f ‖𝕎q [0,1] and ‖f ‖′𝕎 [0,1] are equivalent norms. q

Proof: Write f ∈ 𝕎q [0, 1] as f =

q−1 ∑

1

bi 𝜙i (t) +

i=0

and observe that ‖f ‖2𝕎 [0,1] =

∫0 q−1 ∑

q

Gq (t, u) f (q) (u)du

b2i + ‖f (q) ‖22 .

i=0

It is easy to see that ‖f ‖22 ≤ C‖f ‖2𝕎 [0,1] for some finite constant C and, hence, q that ‖f ‖′2 ≤ (C + 1)‖f ‖2𝕎 [0,1] . 𝕎q [0,1] q On the other hand, ||∑ ||2 || 1 ||2 || q−1 || || (q) || bi 𝜙i (t)|| ≤ 2‖f ‖2 + 2 |||| Gq (t, u) f (u)du|| || || 2 | | || ∫ || i=0 || || 0 ||2 || ||2 ≤ C‖f ‖′2 𝕎 [0,1] q

also for some finite C. Now, ||∑ ||2 ∑ q−1 q−1 ∑ bi bj || q−1 || || bi 𝜙i (t)|| = = bT 𝒜 b, || || i!j!(i + j + 1) || i=0 || || ||2 i=0 j=0

58

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

where b = (b0 , … , bq−1 )T and 𝒜 = {(i!j!(i + j + 1))−1 }. If there were an a = ∑q−1 (a0 , … , aq−1 )T such that aT 𝒜 a = 0 that would mean that i=0 ai 𝜙i (t) ≡ 0. So, 𝒜 is positive-definite and its smallest eigenvalue, 𝜆, must be strictly positive. Consequently, ||∑ ||2 q−1 ∑ || q−1 || || bi 𝜙i (t)|| ≥ 𝜆bT b = 𝜆 b2i . || || || i=0 || i=1 || ||2 ∑q−1 2 These derivations show that i=1 b2i ≤ C‖f ‖′𝕎 [0,1] for some finite C from q

which we can conclude that ‖f ‖2𝕎 [0,1] ≤ (C + 1)‖f ‖′𝕎 [0,1] . q

2

q



Note that the kernel K0 is no longer the rk for H0 and, as a result, K is not the rk for 𝕎q [0, 1] under this alternative norm. This is easily rectified by taking K0′ (s, t) =

q−1 ∑

pj (s)pj (t)

(2.45)

j=1

with p0 , … , pq−1 the Legendre polynomials that one obtains by applying the Gram–Schmidt algorithm to 𝜙0 , … , 𝜙q−1 using the 𝕃2 [0, 1] norm. The rk for 𝕎q [0, 1] that results from this is K ′ (s, t) = K0′ (s, t) + K1 (s, t)

(2.46)

with K1 from Theorem 2.8.1 as before and K0′ in (2.45). To conclude this section, we construct a CONS for 𝕎q [0, 1] under the inner product in (2.44). A first step in that direction is Theorem 2.8.3 There exist a complete orthonormal sequence {ej }∞ for j=1 2 𝕃 [0, 1] such that for i, j ≥ 1 (q)

(q)

⟨ei , ej ⟩2 = 𝛾i 𝛿ij for values 0 = 𝛾1 = · · · = 𝛾m < 𝛾q+1 < · · · that, for constants C1 , C2 ∈ (0, ∞), satisfy C1 j2q ≤ 𝛾j+q ≤ C2 j2q ,

j ≥ 1.

Proof: We merely sketch some aspects of the proof. A more detailed development is available in Utreras (1988). Let f be 2q times differentiable and satisfy f (j) (0) = f (j) (1) = 0, q ≤ j ≤ 2q − 1. Performing integration by parts q times, we obtain ‖f (q) ‖22 = ⟨f , (−1)q D2q f ⟩2 .

VECTOR AND FUNCTION SPACES

59

Now consider the solution, with respect to 𝛾 and f of the differential equation, (−1)q D2q f = 𝛾f subject to f (j) (0) = f (j) (1) = 0, q ≤ j ≤ 2q − 1. This gives rise to a sequence {(𝛾j , ej )}∞ satisfying j=1 ⟨ei , ej ⟩2 = 𝛿ij , (q)

(q)

⟨ei , ej ⟩2 = 𝛾j 𝛿ij for 𝛾j > 0. To illustrate the idea consider the case of q = 1. Then, the general solution for −D2 f = 𝛾f is

√ √ f (t) = a sin( 𝛾t) + b cos( 𝛾t)

for a, b ∈ ℝ. Now,

√ f ′ (0) = a 𝛾 = 0

implies that a = 0 and √ f ′ (1) = −b sin( 𝛾) = 0 leads to 𝛾 = (j𝜋)2 for j = 1, … If, however, 𝛾 = 0, the general solution is f (x) = a + ax and the boundary conditions imply that b = 0. Thus, the solutions are e1 = 1 and √ ej (t) = 2 cos((j − 1)𝜋t) for j ≥ 2. By Theorem 2.4.18, {ej }∞ is a complete orthonormal basis for j=1 2 𝕃 [0, 1]. ◽ (q)

Since ‖ej ‖2 = 0 for j = 1, … , q, e1 , … , eq must be an orthonormal basis for the polynomials of order q (degree q − 1). Thus, e1 , … , eq is an orthonormal basis for ℍ0 under the (2.44) norm. However, the functions ej , j ≥ q + 1, are not in ℍ1 as the boundary conditions e(k) (0) = 0, 0 ≤ k ≤ q − 1, are not j satisfied. Observe that (q)

(q)

⟨ei , ej ⟩′𝕎 [0,1] = ⟨ei , ej ⟩2 + ⟨ei , ej ⟩2 = (1 + 𝛾i )𝛿ij . q

(2.47)

60

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Thus, the ej provide orthogonal functions in 𝕎q [0, 1]. As each function in 𝕎q [0, 1] is a function in 𝕃2 [0, 1], we can also conclude that {ej }∞ is a basis j=1 for 𝕎q [0, 1]. Theorem 2.8.4 The sequence {ej ∕(1 + 𝛾j )1∕2 }∞ is a CONS for 𝕎q [0, 1] j=1 under the inner product (2.44) and any f ∈ 𝕎q [0, 1] can be written as f =

∞ ∑

(1 + 𝛾j )−1∕2 fj ej

(2.48)

j=1

for a square summable coefficient sequence {fj }∞ . j=1 The ej are, of course, all in C[0, 1], which means that max |ej (t)| is bounded for each j. However, we can say much more than that in this particular instance. An application of results in Salaff (1968) shows that the boundary conditions f (j) (0) = f (j) (1) = 0, q ≤ j ≤ 2q − 1 for the differential operator (−1)q D2q are regular in the sense of Birkhoff (1908). Thus, we can use results from Stone (1926) to conclude that there is a universal M ∈ (0, ∞) such that sup max |ej (t)| ≤ M; j≥1 t∈[0,1]

i.e., the ej (t) are uniformly bounded in both t and j.

(2.49)

3

Linear operator and functionals In Chapter 2, we provided a review of the basic facts concerning Banach and Hilbert spaces. What is of more interest for our purposes is the properties of functions that operate on such space. This chapter proceeds in that direction. We are primarily interested in transformations that are linear and these tend to come in two varieties: linear functionals and operators, with the former being a special case of the latter. Both topics fall into the realm of mathematics known as functional analysis. This is a very broad area and our exposition cannot hope to do it justice. Rather than attempt to do so, we pick and choose topics that we feel are most relevant to fda and, in particular, those that are needed for subsequent chapters. More complete treatments of functional analysis are available through many sources. For example, Dunford and Schwarz (1988) is a standard reference for linear operator theory; for an elementary introduction, one can consult the text by Rynne and Youngson (2001). Before proceeding, it is perhaps worthwhile to comment on our use of the word “functional” as a adjective modifier of both “data” and “analysis” throughout this text. Functional analysis derives its name from its foundation in the study of linear functionals on, typically, Banach spaces. Such spaces need have nothing to do with functions per se. In contrast, functional data is, by definition, a collection of (random) functions that need have no direct connection to linear functionals. Such data would perhaps be better served with titles such as “function data,” “curve data,” and “sample path data”. However, “functional data” is now the accepted moniker and we will adhere to that convention. With this particular caveat in mind, there should be no reason for confusion when the “functional” term is invoked. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

62

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

3.1

Operators

We now investigate the properties of linear transformations (in the sense of Definition 2.2.6) on normed linear spaces. A quick note on notation may be useful. While, as a rule, we indicate the result of applying a function 𝒯 to an element x by 𝒯(x), in the case of linear transformations it is customary to suppress the parentheses. Often, an expression such as 𝒯(x) is written as merely 𝒯x. Let 𝕏1 , 𝕏2 be normed linear spaces. Associated with a linear transformation 𝒯 that maps from 𝕏1 into 𝕏2 are the spaces Dom(𝒯) = the subset of 𝕏1 on which 𝒯 is defined, Im(𝒯) = {𝒯x ∶ x ∈ Dom(𝒯)} and

Ker(𝒯) = {x ∈ Dom(𝒯) ∶ 𝒯x = 0}

(3.1) (3.2) (3.3)

called, respectively, the domain, range (or image), and kernel (or null space) of 𝒯. We will always assume that Dom(𝒯) is a linear space, which implies that Im(𝒯) is also a linear space. Unless otherwise noted, we will take Dom(𝒯) as the entire 𝕏1 space. The rank of an operator 𝒯 is defined to be rank(𝒯) = dim(Im(𝒯)) which may be infinite. The objective of our study is not just linear transformations but rather those that are bounded in the sense of the following definition. Definition 3.1.1 Suppose that 𝕏1 , 𝕏2 are normed vector spaces with norms ‖ ⋅ ‖i , i = 1, 2. A linear transformation 𝒯 from 𝕏1 to 𝕏2 is bounded if there exists a finite constant C > 0 such that ‖𝒯x‖2 ≤ C‖x‖1 for all x ∈ 𝕏1 . Boundedness has rather profound consequences as a result of the following theorem. Theorem 3.1.2 A linear transformation between two normed spaces is uniformly continuous if it is bounded. Proof: Let 𝕏1 and 𝕏2 be normed linear spaces with norms ‖ ⋅ ‖i , i = 1, 2 and let 𝒯 be a linear transformation between the two spaces. If 𝒯 is uniformly

LINEAR OPERATOR AND FUNCTIONALS

63

continuous, then it is continuous at 0 from which it follows that there is a universal 𝛿 > 0 such that ‖𝒯x‖2 ≤ 1 whenever ‖x‖1 ≤ 𝛿. Thus, for example, with any x ≠ 0, we will have ‖𝒯x‖2 = ‖𝒯(𝛿x∕‖x‖1 )‖2 ‖x‖1 ∕𝛿 ≤ ‖x‖1 ∕𝛿. For the converse, if xn → x, the fact that ‖𝒯(x − xn )‖2 ≤ C‖x − xn ‖1 means that 𝒯xn → 𝒯x. As C is independent of x, the continuity is uniform. ◽ Theorem 3.1.2 means that the phrases “continuous linear transformation” and “bounded linear transformation” are synonymous. We will use 𝔅(𝕏1 , 𝕏2 ) to denote the set of all bounded (and, hence, continuous) linear transformations from 𝕏1 to 𝕏2 . This becomes a normed space under the operator norm ‖𝒯‖ =

sup

x∈𝕏1 ,‖x‖1 =1

‖𝒯x‖2 .

(3.4)

Then, for any x ∈ 𝕏1 , ‖𝒯x‖2 ≤ ‖𝒯‖‖x‖1 .

(3.5)

The elements of 𝔅(𝕏1 , 𝕏2 ) are called bounded linear operators, linear operators, or operators. If 𝕏1 = 𝕏2 = 𝕏, we use 𝔅(𝕏) for the set of all bounded operators on 𝕏. Theorem 3.1.3 Let 𝕏1 and 𝕏2 be normed linear spaces. If 𝕏2 is complete, then 𝔅(𝕏1 , 𝕏2 ) with norm (3.4) is a Banach space. Proof: Let {𝒯n } be a Cauchy sequence in 𝔅(𝕏1 , 𝕏2 ). For fixed x ∈ 𝕏1 , consider the sequence {𝒯n x} in 𝕏2 . Applying (3.5) we see that {𝒯n x} is a Cauchy sequence in 𝕏2 . By the completeness of 𝕏2 , 𝒯n x has a limit which we denote by 𝒯x. It is obvious that 𝒯 is linear. As {𝒯n } is Cauchy, for any 𝜖 > 0, there exists an N𝜖 such that supn,m≥N𝜖 ‖𝒯n − 𝒯m ‖ < 𝜖∕2. In addition, for any x with ‖x‖1 = 1 there exists an m(x) ≥ N𝜖 such that ‖𝒯m(x) x − 𝒯x‖2 < 𝜖∕2. Thus, if n ≥ N𝜖 , ‖𝒯n x − 𝒯x‖2 ≤ ‖𝒯n x − 𝒯m(x) x‖2 + ‖𝒯m(x) x − 𝒯x‖2 < 𝜖.

(3.6)

By the triangle inequality, ‖𝒯x‖2 ≤ ‖𝒯n x‖2 + 𝜖, which shows that 𝒯 is bounded. Another application of (3.6) shows that 𝒯n converges to 𝒯 in operator norm. ◽

64

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Example 3.1.4 Let 𝕏1 = ℝp and 𝕏2 = ℝq with ‖ ⋅ ‖ the Euclidean norm for either space. Then, any linear transformation 𝒯 from ℝp to ℝq can be represented as a q × p matrix/array 𝒯 = {𝜏ij }i=1∶q,,j=1∶p of real numbers and ‖𝒯‖2 = max xT 𝒯T 𝒯x T x x=1

for x = (x1 , … , xp )T , the transpose of the p-vector x and 𝒯T = {𝜏ji }i=1∶q,j=1∶p the transpose of 𝒯. We will see eventually that when 𝒯 is symmetric ‖𝒯‖ is the absolute value of the eigenvalue of 𝒯 with largest magnitude. Somewhat more generally, ‖𝒯‖ is the largest singular value of the matrix 𝒯. Example 3.1.5 Consider the space C[0, 1] equipped with the sup norm in Example 2.1.3. Let t0 be an arbitrary point in [0, 1] and define the mapping 𝒯 ∶ f → f (t0 ) from 𝕏1 ∶= C[0, 1] into 𝕏2 ∶= ℝ. Clearly, 𝒯 is linear and ‖𝒯f ‖2 = |f (t0 )| ≤ sup |f (t)| = ‖f ‖1 t∈[0,1]

with equality holding for constant functions. This establishes that 𝒯 is also bounded with unit (operator) norm. Example 3.1.6 Consider the standard 𝕃2 [0, 1] space and the linear mapping defined by 1

(𝒯f )(⋅) =

K(⋅, u)f (u)du

∫0

(3.7)

for f ∈ 𝕃2 [0, 1] and some square-integrable function K on [0, 1] × [0, 1]. Operators of this type are called integral operators (Section 4.6) and the function K is called a kernel. Integral operators are bounded because 1

|(𝒯f )(t)|2 ≤

∫0

1

K 2 (t, u)du

1

f 2 (u)du = ‖f ‖22

∫0

∫0

K 2 (t, u)du

with ‖ ⋅ ‖2 the 𝕃2 [0, 1] norm. Therefore, 1

‖𝒯f ‖22 ≤ ‖f ‖22

∫0 ∫0

1

K 2 (t, u)dudt.

The operator norm for 𝒯 in this instance may be deduced from Theorem 4.3.4. In Section 2.6, we defined the Bochner integral that extended the concept of Lebesgue integration to integration over a Banach space. Given a function

LINEAR OPERATOR AND FUNCTIONALS

65

f on a measure space (E, ℬ, 𝜇) that takes on values in a Banach space 𝕏, the Bochner integral of f over E was defined as ∫E

fd𝜇 = lim

n→∞ ∫E

fn d𝜇,

where fn is a sequence of simple functions in the sense of Definition 2.6.1 such that limn→∞ ∫E ‖fn − f ‖d𝜇 = 0: i.e., (2.19) is satisfied. We know that Bochner integration is a linear operation from which one might have guessed that the following result would be true. Theorem 3.1.7 Let 𝕏1 , 𝕏2 be Banach spaces, f a Bochner integrable function from E to 𝕏1 and 𝒯 ∈ 𝔅(𝕏1 , 𝕏2 ). Then, 𝒯f is Bochner integrable and ( ) fd𝜇 = 𝒯fd𝜇. 𝒯 ∫E ∫E Proof: Let {fn } be a sequence of simple functions that satisfies (2.19). Then, 𝒯fn is also simple and ∫E 𝒯fn d𝜇 = 𝒯∫E fn d𝜇 for all n. Now 𝒯∫E fn d𝜇 (and, hence, ∫E 𝒯fn d𝜇) converges to 𝒯∫E fd𝜇 by the continuity of 𝒯. Approaching matters from the other direction we have ∫E

‖𝒯f − 𝒯fn ‖2 d𝜇 ≤

∫E

‖𝒯‖‖f − fn ‖1 d𝜇 → 0.

Thus, 𝒯f is Bochner integrable and its Bochner integral is the limit of ∫E 𝒯fn d𝜇. ◽ ̃ is an extension of a linear transformation 𝒯 if Dom(𝒯) ⊆ We say that 𝒯 ̃ ̃ Dom(𝒯) and 𝒯x = 𝒯x for all x ∈ Dom(𝒯). A linear transformation between two normed vector spaces 𝕏1 and 𝕏2 is said to be densely defined if Dom(𝒯) is dense in 𝕏1 . When 𝒯 is bounded, being densely defined turns out to be enough to define it globally in the sense of being able to produce an extension of 𝒯 whose domain is all of 𝕏1 . This result is sometimes called the extension principle. Theorem 3.1.8 Let 𝕏1 , 𝕏2 be Banach spaces and suppose that 𝒯 is a bounded linear transformation from Dom(𝒯) ⊆ 𝕏1 into 𝕏2 . Then, there is a unique extension of 𝒯 to Dom(𝒯) that has the same bound. Proof: As any closed subset of a Banach space is also a Banach space, we may as well assume that Dom(𝒯) = 𝕏1 . Then, the first step is to realize that ̃ ∈ 𝔅(𝕏1 , 𝕏2 ) is uniquely determined by its values on a dense subset of any 𝒯 𝕏1 ; if 𝔻 is a dense subset of 𝕏1 , for any x ∈ 𝕏1 , there is a sequence {xn } in ̃ by continuity. So, if we can ̃ n converges to 𝒯x 𝔻 that converges to x and 𝒯x find a bounded extension of 𝒯 to 𝕏1 , it will necessarily be unique.

66

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

To actually create the extension let x be any element of 𝕏1 with {xn } an arbitrary sequence from Dom(𝒯) having x as its limit. The sequence 𝒯xn is Cauchy in 𝕏2 because ‖𝒯xn − 𝒯xm ‖2 ≤ C‖xn − xm ‖1 with C the assumed finite bound for 𝒯. Thus, we know that 𝒯xn must have a ̃ to be that limit. Since ‖𝒯xn ‖2 ≤ C‖xn ‖1 limit in 𝕏2 and we simply define 𝒯x ̃ for all n, ‖𝒯‖ ≤ C. ◽

3.2

Linear functionals

Given a normed space 𝕏, a case of special interest is 𝔅(𝕏, ℝ). It is called the dual space of 𝕏 and its elements are called bounded linear functionals or just linear functionals. In the case of Hilbert spaces, the dual space has a particularly simple form as a result of the Riesz Representation Theorem that is stated in the following. We will have many occasions to draw on the power of this result. Theorem 3.2.1 Suppose that ℍ is a Hilbert space with inner-product and norm ⟨⋅, ⋅⟩, ‖ ⋅ ‖ and 𝒯 ∈ 𝔅(ℍ, ℝ). There is a unique element e𝒯 ∈ ℍ, called the representer of 𝒯, with the property that 𝒯x = ⟨x, e𝒯⟩ for all x ∈ ℍ and ‖𝒯‖ = ‖e𝒯‖. Proof: If 𝒯 maps all elements to 0, take e𝒯 = 0. Otherwise, Theorem 2.5.6 tells us that Ker(𝒯)⊥ is a closed subspace from which we may choose an element y with 𝒯y = 1 to obtain 𝒯(x − (𝒯x)y) = 𝒯x − 𝒯x𝒯y = 0 for every x ∈ ℍ; i.e., x − 𝒯(x)y ∈ Ker(𝒯). As y ∈ Ker(𝒯)⊥ , we have ⟨x − 𝒯(x)y, y⟩ = 0 and therefore ⟨x, y⟩ = 𝒯x⟨y, y⟩ = 𝒯x‖y‖2 . So, e𝒯 = y∕‖y‖2 has the requisite properties. If we could now find another e′𝒯 to serve as a representer, their difference would satisfy ⟨x, e𝒯 − e′𝒯⟩ = 0 for every x ∈ ℍ. Thus, e𝒯 is unique. ◽ The theorem has the consequence that a Hilbert space is self-dual. That is, if ℍ is a Hilbert space, its dual space is isomorphic to ℍ. However, the relationship is stronger than just that because the last statement of the theorem entails that ℍ and 𝔅(ℍ, ℝ) are congruent or isometrically isomorphic.

LINEAR OPERATOR AND FUNCTIONALS

67

Example 3.2.2 Consider the RKHS setting of Section 2.7. There we had a Hilbert space ℍ of real valued functions on a set E. If ℍ is an RKHS with inner product ⟨⋅, ⋅⟩, there is an rk with the reproducing property that for every f ∈ ℍ, f (t) = ⟨f , K(⋅, t)⟩ for t ∈ E. In this setting, the evaluation functionals 𝒯t ∶ f → f (t) are well defined for any f ∈ ℍ and t ∈ E. It is also clear that 𝒯t is linear and that its representer is K(⋅, t). Theorem 3.2.3 ℍ is an RKHS if all evaluation functionals are bounded. Proof: The necessity follows from Theorem 2.7.6. To go the other direction observe that if 𝒯t is bounded for each t ∈ E, Theorem 3.2.1 tells us that there is a gt ∈ ℍ such that f (t) = 𝒯t f = ⟨f , gt ⟩. Then, K(s, t) ∶= gt (s) is the rk. ◽ Example 3.2.4 In Theorem 3.1.7, let 𝕏1 be a Hilbert space and take 𝕏2 = ℝ. With these choices, 𝒯 is a linear functional on a Hilbert space and Theorem 3.1.7 translates to ⟨ ⟩ fd𝜇, g = ⟨f , g⟩1 d𝜇 (3.8) ∫E ∫E 1 for all g ∈ 𝕏1 . Definition 3.2.5 A function f on a normed space 𝕏 is called a sublinear functional if for all x, y in 𝕏 and a ∈ ℝ 1. f (x + y) ≤ f (x) + f (y) and 2. f (ax) = af (x). For example, any norm on 𝕏 is a sublinear functional. The following result is known as the Hahn–Banach Extension Theorem. It plays an important role in optimization theory as detailed, for example, in Luenberger (1969). Theorem 3.2.6 Let 𝓁 be a linear functional defined on a subspace 𝕄 of a normed linear space 𝕏. Suppose that there is a continuous sublinear functional f defined on 𝕏 such that 𝓁(x) ≤ f (x) for all x ∈ 𝕄. Then, there is a linear functional 𝓁̃ defined on 𝕏 that agrees with 𝓁 on 𝕄 and satisfies ̃ ≤ f (x) for all x ∈ 𝕏. 𝓁(x) Proof: Let x be any element of 𝕏 that is not in 𝕄 and let y be an arbitrary element of the subspace obtained by translating 𝕄 by x: i.e., y is an element of

68

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

the subspace in 𝕏 consisting of vectors of the form ax + m for some a ∈ ℝ and ̃ = 𝓁(m) + ah(x) m ∈ 𝕄. To extend 𝓁 to work on y, we need only define 𝓁(y) for some function h satisfying h(x) ≤ f (x). The trick is in establishing the existence of h. Let m1 , m2 be arbitrary elements of 𝕄 and observe that by assumption 𝓁(m1 + m2 ) ≤ f (m1 + m2 ) ≤ f (m1 − x) + f (m2 + x) because f is sublinear on all of 𝕏. As 𝓁 is linear this translates into the inequality 𝓁(m1 ) − f (m1 − x) ≤ f (m2 + x) − 𝓁(m2 ) and, as m1 , m2 are arbitrary, it must be that sup [𝓁(m) − f (m − x)] ≤ inf [f (m + x) − 𝓁(m)]. m∈𝕄

m∈𝕄

We now define the value of h(x) to be any number between supm∈𝕄 [𝓁(m) − f (m − x)] and inf m∈𝕄 [f (m + x) − 𝓁(m)]. To see that this works, suppose that y = ax + m for m ∈ 𝕄 and a > 0. Then, [ ( ) ] ( ) ̃ = 𝓁(m) + ah(x) = a 𝓁 m + h(x) ≤ af m + x = f (y). 𝓁(y) a a The case of a < 0 is handled similarly. So, we have succeeded in obtaining an extension of 𝓁 to the manifold that includes 𝕄 and x for an arbitrary x ∈ 𝕏 that is not in 𝕄. Now consider the set Θ of all subspace and linear functional pairs (𝔸, 𝓁A ) such that 𝕄 ⊆ 𝔸 and 𝓁A coincides with 𝓁 on 𝕄. We can provide a partial order on Θ by saying that (𝔸1 , 𝓁A1 ), (𝔸2 , 𝓁A2 ) ∈ Θ satisfy (𝔸1 , 𝓁A1 ) ≤ (𝔸2 , 𝓁A2 ) when 𝔸1 ⊆ 𝔸2 and 𝓁A2 coincides with 𝓁A1 on 𝔸1 . In particular, when either (𝔸1 , 𝓁A1 ) ≤ (𝔸2 , 𝓁A2 ) or (𝔸2 , 𝓁A2 ) ≤ (𝔸1 , 𝓁A1 ), (𝔸1 , 𝓁A1 ) and (𝔸2 , 𝓁A2 ) are said to be comparable. Let {(𝔸𝛽 , 𝓁A𝛽 )}𝛽∈B be any collection of comparable sets in Θ and define 𝔸 = ∪𝛽∈B 𝔸𝛽 with 𝓁A ∶= 𝓁A𝛽 for all 𝛽 ∈ B. With this formulation (𝔸, 𝓁A ) provides an upper bound on the chain {(𝔸𝛽 , 𝓁A𝛽 )}𝛽∈B in that (𝔸𝛽 , 𝓁A𝛽 ) ≤ (A, 𝓁A ) for all 𝛽 ∈ B. As a result, we may apply Zorn’s lemma to conclude that Θ has ̃ The result now follows provided that 𝕏 ̃ 𝓁). ̃ = 𝕏. a maximal element (𝕏, However, if that were not the case, we could apply the previous extension ̃ and obtain a contradiction argument to any element in 𝕏 that was not in 𝕏 of maximality. ◽ Perhaps the most obvious application of Theorem 3.2.6 is for the case where the linear functional in question is bounded. We state this result formally as follows.

LINEAR OPERATOR AND FUNCTIONALS

69

Corollary 3.2.7 If 𝓁 is a bounded linear functional on a subspace 𝕄 of a normed vector space 𝕏, there is an extension 𝓁̃ of 𝓁 defined on all of 𝕏 with the same norm as 𝓁 on 𝕄. Proof: To be precise let us define |𝓁(m)| . m∈𝕄 ‖m‖

‖𝓁‖M = sup

Then, specify the sublinear functional in Theorem 3.2.6 by f (x) = ‖𝓁‖M ‖x‖ ̃ for any x ∈ 𝕏. This means that |𝓁(x)| ≤ ‖𝓁‖M ‖x‖ on 𝕏 from which it follows ̃ = ‖𝓁‖M . that ‖𝓁‖ ◽ Corollary 3.2.8 Let x be an element of a normed vector space 𝕏. Then, there is a linear functional 𝓁 with unit norm for which 𝓁(x) = ‖x‖. Proof: Define 𝓁(ax) = a‖x‖ for any a ∈ ℝ. Then, 𝓁 is a bounded linear functional on span{x} and the result follows from Corollary 3.2.7. ◽ Example 3.1.5 gives one simple instance of a bounded linear functional for the space C[0, 1]. More generally, with the aid of the Hahn–Banach Theorem, it becomes possible to characterize the entire dual of C[0, 1] in terms of the function space BV[0, 1] from Example 2.3.9 that contains all functions of bounded variation on [0, 1]. Theorem 3.2.9 Let C[0, 1] be the space of continuous functions on [0, 1] equipped with the sup norm. A function 𝓁 is a bounded linear functional on C[0, 1] if and only if 1

𝓁(f ) =

∫0

f (t)dw(t)

(3.9)

for some 𝑤 ∈ BV[0, 1]. Proof: Let 𝓁 be a bounded linear functional on ℂ[0, 1] and define 𝕏 to be the set of bounded functions on the interval [0, 1] with associated norm ‖f ‖ = sup |f (x)|. x∈[0,1]

Clearly, C[0, 1] is a proper subset of 𝕏. Thus, from Corollary 3.2.7, 𝓁 has an extension to a bounded linear functional 𝓁̃ on 𝕏 that has the same norm. Let f be an arbitrary element of C[0, 1] and consider approximating it by fn (t) =

n ∑ i=1

f (ti−1 )[gti (t) − gti−1 (t)]

70

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

for some partition 0 = t0 < t2 < · · · < tn = 1 and gs (t) = (s − t)0+ . Now, fn ∈ 𝕏 and ‖fn − f ‖ = max max |f (t) − f (ti−1 )| → 0 1≤i≤n ti−1 ≤t≤ti

as n → ∞ because f is uniformly continuous. ̃ n ) is well defined and 𝓁(f ̃ n) → As 𝓁̃ is a bounded linear functional on 𝕏, 𝓁(f ̃ ). However, as f ∈ C[0, 1], 𝓁(f ̃ ) = 𝓁(f ). The fact that 𝓁̃ is linear means 𝓁(f that n ∑ ̃ n) = 𝓁(f f (ti−1 )[𝑤(ti ) − 𝑤(ti−1 )] i=1

̃ s ). with 𝑤(s) = 𝓁(g Now we claim that 𝑤 belongs to BV[0, 1]. To see that this is so begin by observing that n ∑

|𝑤(ti ) − 𝑤(ti−1 )| =

i=1

n ∑

sign (𝑤(ti ) − 𝑤(ti−1 )) [𝑤(ti ) − 𝑤(ti−1 )]

i=1

(

= 𝓁̃

n ∑

) sign (𝑤(ti ) − 𝑤(ti−1 )) [gti − gti−1 ]

i=1

which has the implication that n ∑ i=1

n ‖∑ ‖ ‖ ‖ ̃ |𝑤(ti ) − 𝑤(ti−1 )| ≤ ‖𝓁‖ ‖ sign (𝑤(ti ) − 𝑤(ti−1 )) [gti − gti−1 ]‖ . ‖ ‖ ‖ i=1 ‖

̃ = ‖𝓁‖ and However, ‖𝓁‖ n ‖∑ ‖ ‖ ‖ ‖ sign (𝑤(ti ) − 𝑤(ti−1 )) [gti − gti−1 ]‖ ‖ ‖ ‖ i=1 ‖ n |∑ | | | = max | sign (𝑤(ti ) − 𝑤(ti−1 )) [gti (s) − gti−1 (s)]| | s∈[0,1] | | i=1 | = 1.

So, TV(𝑤) ≤ ‖𝓁‖ and our claim has been shown to hold. As 𝑤 is of bounded variation and f is continuous n ∑ i=1

1

f (ti−1 )[𝑤(ti ) − 𝑤(ti−1 )] →

∫0

f (t)dw(t).

̃ n ) = 𝓁(f ) is the Riemann–Stieltjes integral of f with respect That is, lim 𝓁(f to 𝑤.

LINEAR OPERATOR AND FUNCTIONALS

71

All that remains is to prove that (3.9) defines a linear functional. The linear part is obvious and |𝓁(f )| ≤ ‖f ‖TV(𝑤) verifies the boundedness condition.◽ The dual space 𝔅(𝕏, ℝ) induces a topology different than the norm topology on 𝕏 through consideration of the following notion of convergence. Definition 3.2.10 A sequence {xn }∞ in a Banach space 𝕏 converges weakly n=1 to x if 𝓁(xn ) → 𝓁(x) for every 𝓁 ∈ 𝔅(𝕏, ℝ). Strong or norm convergence implies weak convergence as |𝓁(xn ) − 𝓁(x)| = |𝓁(xn − x)| ≤ ‖𝓁‖‖xn − x‖. However, the converse is not true. Most of our discussion in this text will concern strong convergence and that is the mode of convergence that should be assumed unless it has been explicitly stated to the contrary. Using Theorem 3.2.1, weak convergence in a Hilbert space ℍ can be characterized as having ⟨xn , y⟩ → ⟨x, y⟩ for every y ∈ ℍ. In this case if we also have ‖xn ‖ → ‖x‖, ‖xn − x‖2 = ‖xn ‖2 − 2⟨xn , x⟩ + ‖x‖2 → 0 and xn converges strongly. The main result we will eventually need about weak convergence in Section 6.2 is the following theorem whose proof is a consequence of the Banach–Alaoglu Theorem (e.g., Rudin, 1991). Theorem 3.2.11 Let {xn }∞ be a (norm) bounded sequence in a Hilbert n=1 space ℍ. Then, there is subsequence {xnk }∞ such that xnk converges weakly k=1 to some element x ∈ ℍ.

3.3

Adjoint operator

The discussion in this section is restricted to operators on Hilbert spaces. In that setting, we seek to develop an abstract extension of the concept of the transpose of a matrix that pertains to linear operators in the most familiar Hilbert space ℝp . Theorem 3.3.1 Let ℍ1 , ℍ2 be Hilbert spaces with inner products ⟨⋅, ⋅⟩i , i = 1, 2. Corresponding to every 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ), there is a unique element 𝒯∗ of 𝔅(ℍ2 , ℍ1 ) determined by the relation ⟨𝒯x1 , x2 ⟩2 = ⟨x1 , 𝒯∗ x2 ⟩1 for all x1 ∈ ℍ1 , x2 ∈ ℍ2 .

(3.10)

72

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: Consider ⟨𝒯x1 , x2 ⟩2 as a function of x1 for fixed x2 . It is bounded and linear so that the Riesz representation theorem may be applied to see that there exists a unique y ∈ ℍ1 such that ⟨𝒯x1 , x2 ⟩2 = ⟨x1 , y⟩1 . Thus, we take 𝒯∗ x2 = y. This definition gives us a linear mapping. To see that it bounded first note that 𝒯 is necessarily the adjoint of 𝒯∗ . Then, ‖𝒯∗ x2 ‖21 = |⟨x2 , 𝒯𝒯∗ x2 ⟩2 | ≤ ‖𝒯‖‖𝒯∗ x2 ‖1 ‖x2 ‖2 .



Definition 3.3.2 For any 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ), the unique operator 𝒯∗ of 𝔅(ℍ2 , ℍ1 ) in (3.10) is said to be the adjoint of 𝒯. If ℍ1 = ℍ2 , 𝒯 is called self-adjoint when 𝒯∗ = 𝒯. Example 3.3.3 Let ℍ = ℝp . Then any member of 𝔅(ℍ) can be expressed as a p × p matrix 𝒯 = {𝜏ij }i,j=1∶p . Clearly, ⟨𝒯x, y⟩ = xT 𝒯T y = ⟨x, 𝒯T y⟩. Hence, 𝒯∗ = 𝒯T and any symmetric matrix 𝒯 determines a self-adjoint operator. Example 3.3.4 Consider the 𝕃2 [0, 1] integral operator defined in Example 3.1.6. In that case ⟨ ⟩ 1 1 1 ⟨𝒯f , g⟩ = K(t, s)f (s)g(t)dsdt = f , K(t, ⋅)g(t)dt ∫0 ∫0 ∫0 and, hence,

1

(𝒯∗ g)(s) =

∫0

K(t, s)g(t)dt.

If K is symmetric then 𝒯 is self-adjoint. Example 3.3.5 Let ℍ(K) be an RKHS of functions on E with 𝒯t the evaluation functional that takes a function f to f (t) for each t ∈ E. Then, uf (t) = ⟨f , 𝒯∗t u⟩, for any u ∈ ℝ and f ∈ ℍ(K). By the reproducing property, this implies that 𝒯∗t u = uK(⋅, t). Using the adjoint, we can obtain a simple proof of the following result.

LINEAR OPERATOR AND FUNCTIONALS

73

Theorem 3.3.6 If 𝒯 ∈ 𝔅(ℍ) for some Hilbert space ℍ and {xn } is a sequence that converges weakly to x ∈ ℍ, then 𝒯xn converges weakly to 𝒯x. Proof: For any y ∈ ℍ ⟨𝒯xn , y⟩ = ⟨xn , 𝒯∗ y⟩ → ⟨x, 𝒯∗ y⟩ = ⟨𝒯x, y⟩.



The following result collects various properties we will eventually need involving the adjoint operator. Theorem 3.3.7 Let 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ) for real Hilbert spaces ℍ1 , ℍ2 . Then, 1. (𝒯∗ )∗ = 𝒯, 2. ‖𝒯∗ ‖ = ‖𝒯‖, 3. ‖𝒯∗ 𝒯‖ = ‖𝒯‖2 , 4. Ker(𝒯) = (Im(𝒯∗ ))⊥ , 5. Ker(𝒯∗ 𝒯) = Ker(𝒯) and Im(𝒯∗ 𝒯) = Im(𝒯∗ ), 6. ℍ1 = Ker(𝒯) ⊕ Im(𝒯∗ ) = Ker(𝒯∗ 𝒯) ⊕ Im(𝒯∗ 𝒯), and 7. rank(𝒯∗ ) = rank(𝒯). Proof: Let xi ∈ ℍi , i = 1, 2 and observe that ⟨(𝒯∗ )∗ x1 , x2 ⟩2 = ⟨x1 , 𝒯∗ x2 ⟩1 = ⟨𝒯x1 , x2 ⟩2 to establish part 1. For part 2, first recall from the proof of Theorem 3.3.1 that ‖𝒯∗ x2 ‖21 ≤ ‖𝒯‖‖𝒯∗ x2 ‖1 ‖x2 ‖2 . Thus, ‖𝒯∗ x2 ‖1 ≤ ‖𝒯‖‖x2 ‖2 provided that ‖𝒯∗ x2 ‖1 ≠ 0. Now exchange the roles of 𝒯 and 𝒯∗ and apply part 1 to obtain the desired result. From part 2, we know that ‖𝒯∗ 𝒯‖ ≤ ‖𝒯‖2 . On the other hand, ‖𝒯x1 ‖22 = ⟨𝒯∗ 𝒯x1 , x1 ⟩1 ≤ ‖𝒯∗ 𝒯‖‖x1 ‖21 , which means that ‖𝒯‖2 ≤ ‖𝒯∗ 𝒯‖ and part 3 has been shown. For part 4, let x1 ∈ Ker(𝒯) in which case ⟨x1 , 𝒯∗ x2 ⟩1 = ⟨𝒯x1 , x2 ⟩2 = 0 for all x2 . This entails that x1 is orthogonal to any element in Im(𝒯∗ ) and must be in (Im(𝒯∗ ))⊥ . Going the other direction let x1 ∈ (Im(𝒯∗ ))⊥ . Then, since 𝒯∗ 𝒯x1 ∈ Im(𝒯∗ ), we obtain ‖𝒯x1 ‖2 = ⟨x1 , 𝒯∗ 𝒯x1 ⟩1 = 0. Next, if x1 ∈ Ker(𝒯), then x1 ∈ Ker(𝒯∗ 𝒯). So, suppose x1 ∈ Ker(𝒯∗ 𝒯). However, in that event 0 = ⟨𝒯∗ 𝒯x1 , x1 ⟩1 = ‖𝒯x1 ‖2 and x1 ∈ Ker(𝒯). This proves the first identity of part 5. The second identity follows from the first, part 4, and the third part of Theorem 2.5.6.

74

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

By (2.16), part 4 and part 3 of Theorem 2.5.6, ℍ1 = Ker(𝒯) ⊕ (Ker(𝒯))⊥ = Ker(𝒯) ⊕ Im(𝒯∗ ), which is the first identity in part 6. Now apply part 5 to obtain the second result. To show part 7, assume first that rank(𝒯) < ∞ in which case Im(𝒯) must be finite-dimensional and hence Im(𝒯) = Im(𝒯). Applying part 6 for x ∈ ℍ2 , we have 𝒯∗ x = 𝒯∗ x′ where x′ is the projection of x on Im(𝒯). This shows that Im(𝒯∗ ) ⊂ 𝒯∗ (Im(𝒯)), which implies that rank(𝒯∗ ) ≤ rank(𝒯) < ∞. One can then reverse the roles of 𝒯 and 𝒯∗ to conclude that rank(𝒯) ≤ rank(𝒯∗ ). Hence, rank(𝒯) = rank(𝒯∗ ) if either rank(𝒯) or rank(𝒯∗ ) is finite. If, instead, we start with either rank(𝒯) or rank(𝒯∗ ) being infinite, the same argument shows that the other rank must also be infinite. ◽

3.4

Nonnegative, square-root, and projection operators

The role of projection matrices in statistics can be appreciated by examination of any book on linear models where they arise in quadratic forms involving random vectors. Projection matrices are, of course, positive semidefinite. More generally, nonnegative matrices occur naturally as variance–covariance matrices for random vectors. The extension of these ideas for fda occurs when finite dimensional vectors become functions. This leads to operators replacing matrices in covariance and other calculations as will be detailed, for example, in Chapter 7. At present, it suffices to merely set the stage by defining the ideas of nonnegative and projection operators. Definition 3.4.1 An operator 𝒯 on a Hilbert space ℍ is said to be nonnegative definite (or just nonnegative) if it is self-adjoint and ⟨𝒯x, x⟩ ≥ 0 for all x ∈ ℍ. It is positive definite (or just positive) if ⟨𝒯x, x⟩ > 0 for all x ∈ ℍ. For two operators 𝒯1 , 𝒯2 , we write 𝒯1 ≤ 𝒯2 (respectively, 𝒯1 < 𝒯2 ) if 𝒯2 − 𝒯1 is nonnegative (respectively, positive) definite. Example 3.4.2 If 𝒯 is any element of 𝔅(ℍ), 𝒯∗ 𝒯 is nonnegative definite because it is self-adjoint and ⟨𝒯∗ 𝒯x, x⟩ = ‖𝒯x‖2 . A fundamental result we will need about nonnegative operators is that they admit a square root type decomposition.

LINEAR OPERATOR AND FUNCTIONALS

75

Theorem 3.4.3 Let 𝒯 ∈ 𝔅(ℍ) for a Hilbert space ℍ. If 𝒯 is nonnegative, there is a unique nonnegative operator 𝒮 ∈ 𝔅(ℍ) that satisfies 𝒮2 = 𝒯 and commutes with any operator that commutes with 𝒯. Proof: Assume without loss of generality that ‖𝒯‖ ≤ 1 so that we also have ‖I − 𝒯‖√ ≤ 1. The argument relies on the fact that the Maclaurin series expan∑∞ sion for 1 − z = 1 + j=1 cj zj is absolutely convergent for all |z| ≤ 1 with all the ∑n cj being negative. This has the consequence that the series 𝒮n ∶= I + j=1 cj (I − 𝒯)j is Cauchy and must therefore converge to some opera∑∞ tor 𝒮 in the Banach space 𝔅(ℍ). Representing 𝒮 as 𝒮 = I + j=1 cj (I − 𝒯)j , we can rearrange terms by absolute convergence to show that 𝒮2 = 𝒯. Now, 𝒮 is nonnegative by the fact ⟨x, 𝒮x⟩ = 1 +

∞ ∑

∞ ∑ ⟨ ⟩ j cj x, (I − 𝒯) x ≥ 1 + cj = 0

j=1

j=1

𝒯)j x⟩

as cj < 0 and 0 ≤ ⟨x, (I − ≤ 1. In addition, 𝒮n commutes with any operator that commutes with 𝒯 and this property passes on to the limit 𝒮. It remains to show that 𝒮 is unique. Suppose that there is another operator 𝒮̃ with the prescribed properties. Then ( ) ( ) ( ) ( ) ( )( ) 𝒮̃ − 𝒮 𝒮̃ 𝒮̃ − 𝒮 + 𝒮̃ − 𝒮 𝒮 𝒮̃ − 𝒮 = 𝒮̃ 2 − 𝒮2 𝒮̃ − 𝒮 = 0. As both operators on the left-hand side of the last expression are nonnegative, they must each be the zero operator. Thus, ( ) ( ) ( ) ( ) ( )3 𝒮̃ − 𝒮 𝒮̃ 𝒮̃ − 𝒮 − 𝒮̃ − 𝒮 𝒮 𝒮̃ − 𝒮 = 𝒮̃ − 𝒮 = 0. ( )n This shows that 𝒮̃ − 𝒮 = 0 for all n ≥ 3. So, for all x ∈ ℍ, ⟨( )4 ⟩ ‖( )2 ‖2 𝒮̃ − 𝒮 x, x = ‖ 𝒮̃ − 𝒮 x‖ = 0 ‖ ‖ ( )2 which implies that 𝒮̃ − 𝒮 = 0. Applying the argument again leads to the conclusion that 𝒮̃ − 𝒮 = 0. ◽ From this point forward, for any nonnegative operator 𝒯, the notation 𝒯1∕2 will refer to the square-root operator 𝒮 described in Theorem 3.4.3. Now recall the projection theorem (Theorem 2.5.2) for Hilbert spaces. If 𝕄 is a closed subspace of a Hilbert space ℍ, it stated that for each x ∈ ℍ, there was a unique element in 𝕄 that minimized the norm difference ‖x − y‖ over all y ∈ 𝕄. Here we denote this minimizing element by 𝒫𝕄 x. When viewed as a function on ℍ, 𝒫𝕄 is referred to as the projection operator for the subspace 𝕄.

76

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 3.4.4 If 𝕄 is a closed subspace of a Hilbert space ℍ, 𝒫𝕄 is a self-adjoint element of 𝔅(ℍ) that satisfies 𝒫𝕄 = 𝒫2𝕄 . Proof: To see that 𝒫𝕄 is linear we can use the characterization in (2.14): namely, for x1 , x2 ∈ ℍ, a1 , a2 ∈ ℝ, and any y ∈ 𝕄, ⟨a1 𝒫𝕄 x1 + a2 𝒫𝕄 x2 , y⟩ = a1 ⟨𝒫𝕄 x1 , y⟩ + a2 ⟨𝒫𝕄 x2 , y⟩ = a1 ⟨x1 , y⟩ + a2 ⟨x2 , y⟩ = ⟨a1 x1 + a2 x2 , y⟩. Thus, a1 𝒫𝕄 x1 + a2 𝒫𝕄 x2 = 𝒫𝕄 (a1 x1 + a2 x2 ). The fact that 𝒫𝕄 is self-adjoint is obtained similarly. Finally, for any x ∈ ℍ, 𝒫𝕄 x is in 𝕄. The minimization feature of projection now has the consequence that 𝒫2𝕄 = 𝒫𝕄 𝒫𝕄 = 𝒫𝕄 . ◽ Projection operators are obviously nonnegative. One interesting relationship that derives from Theorem 3.4.4 is that ‖𝒫𝕄 x‖ = ‖𝒫2𝕄 x‖ ≤ ‖𝒫𝕄 ‖‖𝒫𝕄 x‖, which shows that ‖𝒫𝕄 ‖ ≥ 1. However, as ‖𝒫𝕄 x‖ ≤ ‖x‖, we arrive at the following corollary. Corollary 3.4.5 ‖𝒫𝕄 ‖ = 1. If the subspace 𝕄 of ℍ has dimension one and is spanned by x with ‖x‖ = 1, then 𝒫𝕄 can be written as x ⊗ x where (x ⊗ x)y = ⟨y, x⟩x for any y ∈ ℍ. This can be established with Theorem 2.5.2 upon observing ⟨y, x⟩ = ⟨(x ⊗ x)y, x⟩, y ∈ ℍ. This is a special case of more general tensor product notation that we will use frequently throughout the text. Definition 3.4.6 Let x1 , x2 be elements of Hilbert spaces ℍ1 and ℍ2 , respectively. The tensor product operator (x1 ⊗1 x2 ) ∶ ℍ1 → ℍ2 is defined by (x1 ⊗1 x2 )y = ⟨x1 , y⟩1 x2 for y ∈ ℍ1 . If ℍ1 = ℍ2 we use ⊗ in lieu of ⊗1 . Suppose that ℍi = ℝpi , i = 1, 2 for some finite positive integers p2 , p1 . Then, x1 ⊗1 x2 = x2 xT1 for xi ∈ ℝpi , i = 1, 2: namely, x1 ⊗1 x2 is the vector outer product of x2 and x1 .

LINEAR OPERATOR AND FUNCTIONALS

77

Theorem 3.4.7 Let xi ∈ ℍi , i = 1, 2. Then ‖x1 ⊗1 x2 ‖ = ‖x2 ‖2 ‖x1 ‖1 . Proof: For x1 ≠ 0, ‖x1 ⊗1 x2 ‖ = sup ‖⟨x1 , u⟩1 x2 ‖2 ≤ ‖x2 ‖2 ‖x1 ‖1 . ‖u‖=1

The inequality is attained when u = x1 ∕‖x1 ‖1 .

3.5



Operator inverses

A linear mapping 𝒯 from a vector space 𝕏1 into a vector space 𝕏2 is one-to-one if Ker(𝒯) = {0} and onto if Im(𝒯) = 𝕏2 . When 𝒯 is both one-to-one and onto it is said to be bijective. Bijective linear mappings are invertible. That is, there exists a linear mapping 𝒯−1 from 𝕏2 to 𝕏1 such that 𝒯−1 𝒯 and 𝒯𝒯−1 are the identity transformation on their respective spaces. Suppose now that 𝕏1 and 𝕏2 are Banach spaces and that 𝒯 ∈ 𝔅(𝕏1 , 𝕏2 ) is bijective. In this event, we know that 𝒯−1 exists. However, we must still ask whether or not it is bounded. The answer is in the affirmative and the specific result that is stated here is sometimes referred to as the Banach Inverse Theorem. Theorem 3.5.1 Let 𝕏1 and 𝕏2 be Banach spaces with 𝒯 ∈ 𝔅(𝕏1 , 𝕏2 ). If 𝒯−1 exists, i.e., 𝒯 is bijective from 𝕏1 to 𝕏2 , it is an element of 𝔅(𝕏2 , 𝕏1 ) Proof: The core of the proof is the so-called Open Mapping Theorem which states that if 𝕏1 and 𝕏2 are Banach spaces and 𝒯 maps 𝕏1 onto 𝕏2 , then there exists some r ∈ (0, ∞) such that {y ∈ 𝕏2 ∶ ‖y‖2 ≤ r} ⊂ {𝒯x ∶ ‖x‖1 ≤ 1}. This result is a consequence of the Baire Category Theorem. A detailed proof can be found in, e.g., Rynne and Youngson (2001). Let us apply the Open Mapping Theorem here. Take any y ∈ 𝕏2 with ‖y‖2 ≤ 1 so that ry ∈ {𝒯x ∶ ‖x‖1 ≤ 1}. As 𝒯 is one-to-one, there exists an x ∈ 𝕏1 with ‖x‖1 ≤ 1 such that 𝒯x = ry. Thus, 1 1 1 1 ‖𝒯−1 y‖1 = ‖𝒯−1 (ry)‖1 = ‖𝒯−1 𝒯x‖1 = ‖x‖1 ≤ , r r r r which shows that 𝒯−1 is bounded.



Example 3.5.2 Define the integral operator 1

(𝒯q f )(⋅) =

∫0

Gq (⋅, u)f (u)du

78

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

for

q−1

(s − u)+ Gq (s, u) = (q − 1)!

and f ∈ 𝕃2 [0, 1]. We saw this operator in Section 2.8 in our discussion of the Sobolev space 𝕎q [0, 1]. In particular, we showed that any element f ∈ 𝕎q [0, 1] could be represented as f =

q−1 ∑

f (j) (0)𝜙j + 𝒯q f (q)

j=0

for 𝜙j , j = 0, … , q − 1 a basis for the polynomials of order q. This allowed us to write 𝕎q [0, 1] = ℍ0 ⊕ ℍ1 , where ℍ0 = span{𝜙0 , … , 𝜙q−1 } and ℍ1 = {𝒯q g ∶ g ∈ 𝕃2 [0, 1]}. This latter space is a Hilbert space under the inner product ⟨f , g⟩1 = ⟨f (q) , g(q) ⟩

(3.11)

with f , g ∈ ℍ1 and ⟨⋅, ⋅⟩ the 𝕃2 [0, 1] inner product. The operator 𝒯q can now be seen as a bijective mapping from 𝕏1 = 𝕃2 [0, 1] to 𝕏2 = ℍ1 . Define the linear mapping D by (Df ) = f ′ whenever the operation makes sense. If, for example, we consider functions in 𝕃2 [0, 1], this transformation is not bounded. However, Dq = 𝒯−1 q and the Banach Inverse Theorem can be invoked to conclude that Dq ∈ 𝔅(ℍ1 , 𝕃2 [0, 1]). In fact, there is a bit more that can be said about the relationship between ℍ1 and 𝕃2 [0, 1]. Theorem 3.5.3 The Hilbert spaces 𝕃2 [0, 1] and ℍ1 are congruent under the mapping Ψ ∶= 𝒯q ∈ 𝔅(𝕃2 [0, 1], ℍ1 ) and Ψ−1 = Dq ∈ 𝔅(ℍ1 , 𝕃2 [0, 1]). (q)

(q)

Proof: If f1 , f2 ∈ ℍ1 , ⟨f1 , f2 ⟩1 = ⟨𝒯q f1 , 𝒯q f2 ⟩1 = ⟨Dq f1 , Dq f2 ⟩.



In practice, we need sufficient conditions that can be used to establish that a particular operator is invertible. A useful sufficient condition that insures the existence of an operator’s inverse on a Hilbert spaces is

LINEAR OPERATOR AND FUNCTIONALS

79

Theorem 3.5.4 Let ℍ be a Hilbert space with norm ‖ ⋅ ‖ and 𝒯 ∈ 𝔅(ℍ). If 𝒯 is self-adjoint and there is some C > 0 such that ‖𝒯f ‖ ≥ C‖f ‖

(3.12)

for all f ∈ ℍ, then 𝒯 is invertible. Proof: We first show that Im(𝒯) is closed. Let yn ∈ Im(𝒯) and yn → y ∈ ℍ. Then, there exists xn ∈ ℍ such that 𝒯xn = yn with ‖ym − yn ‖ = ‖𝒯(xm − xn )‖ ≥ C‖xm − xn ‖ showing that xn is Cauchy and must converge to some limit x ∈ ℍ. By continuity, 𝒯x = y and we must have y ∈ Im(𝒯). It follows from Theorem 3.3.7 that ℍ = Im(𝒯) ⊕ Ker(𝒯). Assumption (3.12) implies that Ker(𝒯) = {0} and, hence, 𝒯 is both one-to-one and onto. ◽ If 𝒯 ∈ 𝔅(ℍ) satisfies ‖𝒯‖ < 1, then (3.12) holds with 𝒯 replaced by I − 𝒯, in which case Theorem 3.5.4 can be invoked to establish that I − 𝒯 is invertible. The following result provides the details on the form of the inverse operator. Theorem 3.5.5 Let 𝕏 be a Banach space and 𝒯 ∈ 𝔅 (𝕏). If ‖𝒯‖ < 1, then I − 𝒯 is invertible and (I − 𝒯)−1 = I +

∞ ∑

𝒯j .

(3.13)

j=1

∑ j Proof: As ‖𝒯‖ < 1, we have ∞ j=0 ‖𝒯‖ < ∞. The triangle inequality then ∑k insures that the partial sums 𝒮k ∶= I + j=1 𝒯j form a Cauchy sequence in the Banach space ∑∞ 𝔅 (𝕏) and must therefore have some limit which we denote by 𝒮 = I + j=1 𝒯j . Observe that ‖I − (I − 𝒯)𝒮k ‖ = ‖I − 𝒮k (I − 𝒯)‖ = ‖𝒯k+1 ‖ ≤ ‖𝒯‖k+1 → 0. This shows that (I − 𝒯)𝒮 = 𝒮(I − 𝒯) = I and completes the proof.



The Sherman–Morrison–Woodbury matrix identity is useful for solving a variety of problems in linear algebra and matrix theory. The operator version that we give here will prove to be similarly valuable.

80

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 3.5.6 For operators 𝒮, 𝒯, 𝒰, and 𝒱 with 𝒮, 𝒯 invertible, ( )−1 ( )−1 𝒯 + 𝒰𝒮−1 𝒱 = 𝒯−1 − 𝒯−1 𝒰 𝒮 + 𝒱𝒯−1 𝒰 𝒱𝒯−1 . (3.14) Proof: The proof ( proceeds by ) showing that the right-hand side of (3.14) is the inverse of 𝒯 + 𝒰𝒮−1 𝒱 . This latter fact follows from ) ( )( ( )−1 𝒯 + 𝒰𝒮−1 𝒱 𝒯−1 − 𝒯−1 𝒰 𝒮 + 𝒱𝒯−1 𝒰 𝒱𝒯−1 ( )( )−1 = I + 𝒰𝒮−1 𝒱𝒯−1 − 𝒰 + 𝒰𝒮−1 𝒱𝒯−1 𝒰 𝒮 + 𝒱𝒯−1 𝒰 𝒱𝒯−1 ( )( )−1 = I + 𝒰𝒮−1 𝒱𝒯−1 − 𝒰𝒮−1 𝒮 + 𝒱𝒯−1 𝒰 𝒮 + 𝒱𝒯−1 𝒰 𝒱𝒯−1 = I.



It is fair to say that invertible operators are the exception rather than the rule and this is especially true for the fda applications that are central to this text. The question then arises as to what can be said about the solution of a linear system such as 𝒯x = y (3.15) for a given y when the operator 𝒯 is not invertible. In the remainder of this section, we provide an answer for the case where 𝒯 is a bounded operator on a Hilbert space ℍ. Even when 𝒯 in (3.15) is neither one-to-one nor onto, there is still an aspect ̃ of the operator that can be inverted in some general sense. Specifically, let 𝒯 be the restriction of 𝒯 to the orthogonal complement of Ker(𝒯). Then, this restricted operator is one-to-one and onto its range and, hence, invertible. This line of thought leads us to the concept of a Moore–Penrose inverse that also goes by the names Moore–Penrose generalized inverse and pseudo inverse. Definition 3.5.7 Let ℍ1 , ℍ2 be Hilbert spaces. Suppose that 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ) ̃ be the operator 𝒯 restricted to Ker(𝒯)⊥ . The Moore–Penrose (genand let 𝒯 eralized) inverse, 𝒯† , of 𝒯 is a linear transformation with domain Dom(𝒯† ) ∶= Im(𝒯) + Im(𝒯)⊥ . For y ∈ Dom(𝒯† ),

{ 𝒯† y =

̃ −1 y, 𝒯 0,

y ∈ Im(𝒯), y ∈ Im(𝒯)⊥ .

If Im(𝒯) is closed in ℍ2 , then ℍ2 = Im(𝒯) ⊕ Im(𝒯)⊥

LINEAR OPERATOR AND FUNCTIONALS

81

̃ −1 ∈ 𝔅(Im(𝒯), by (2.16). In that case, Theorem 3.5.1 implies that 𝒯 ⊥ † Ker(𝒯) ) and, hence, that 𝒯 ∈ 𝔅(ℍ2 , ℍ1 ). However, the requirement that Im(𝒯) be closed is a rather restrictive condition; for instance, in the case of the compact operators introduced in Chapter 4 this amounts to the operator being finite-dimensional (cf. Theorem 4.3.7). In general, particularly in fda applications, Im(𝒯) is not closed but 𝒯† still possesses some of the other properties we might expect from an inverse operator. Specifically, it satisfies the four Moore–Penrose equations (3.16)–(3.19) that are given in the following theorem. Theorem 3.5.8 Let 𝒯† be as in Definition 3.5.7. Then, 𝒯𝒯† 𝒯 = 𝒯,

(3.16)

𝒯† 𝒯𝒯† = 𝒯† ,

(3.17)

𝒯 𝒯=I − 𝒫

(3.18)

𝒯𝒯† = 𝒬

(3.19)



and

with 𝒫 the projection operator for Ker(𝒯) and 𝒬 the restriction of the projection operator for Im(𝒯) to Dom(𝒯† ). Proof: To verify (3.16)–(3.19), we will follow the arguments in Engl et al. (2000). First note that if y ∈ Dom(𝒯† ) it is necessarily true that ̃ −1 𝒬y. 𝒯† y = 𝒯† 𝒬y = 𝒯

(3.20)

This allows us to show (3.19) and (3.17) by observing that ̃ −1 𝒬y = 𝒬y 𝒯𝒯† y = 𝒯𝒯 and, hence,

𝒯† 𝒯𝒯† y = 𝒯† 𝒬y = 𝒯† y.

For (3.18), take any x ∈ ℍ and decompose it as 𝒫x + (I − 𝒫)x to see that ̃ −1 𝒯(I ̃ − 𝒫)x = (I − 𝒫)x. 𝒯† 𝒯x = 𝒯 However, from this identity, we obtain (3.16) because 𝒯𝒯† 𝒯x = 𝒯(I − 𝒫)x = 𝒯x. Another consequence of (3.20) is



82

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 3.5.9 Im(𝒯† ) = Ker(𝒯)⊥ . Proof: Expression (3.20) indicates that for any y ∈ Dom(𝒯† ), 𝒯† y is in the ̃ −1 , which is synonymous with Ker(T)⊥ . If, on the other hand, x ∈ range of 𝒯 Ker(T)⊥ , by definition 𝒯† 𝒯x = x. ◽ The Moore–Penrose pseudo inverse exhibits various other properties that would be expected for a inverse operator. For example, one may deduce from the definition that 𝒯† = 𝒯−1 (3.21) when 𝒯 is invertible and

(𝒯∗ )† = (𝒯† )∗ .

(3.22)

If 𝒯1 , 𝒯2 are two bounded operators, it can be shown that (𝒯1 𝒯2 )† = 𝒯†2 𝒯†1 .

(3.23)

We now return to the solution of (3.15) and elucidate its connection to the Moore–Penrose inverse. Certainly, we cannot expect there to be an exact or even unique solution to (3.15) when 𝒯 is not invertible. As a result, we must be satisfied with some form of approximate solution that has certain specified desirable qualities. In this respect, one possibility is to use a least-squares solution: i.e., an element x̂ ∈ ℍ that satisfies ‖𝒯̂x − y‖ = inf {‖𝒯x − y‖ ∶ x ∈ ℍ}

(3.24)

for a given y. In general, there may be more than one least-squares solution. Indeed, suppose that y in (3.24) is in Dom(𝒯† ) and consider the subspace 𝕄 = {x ∈ ℍ ∶ 𝒯x = 𝒬y} for 𝒬 the projection operator for Im(𝒯). This subspace is not empty because if y ∈ Dom(𝒯† ) we can choose, e.g., x = 𝒯† y by (3.19). However, for any x∈𝕄 ‖𝒯x − y‖ = ‖𝒬y − y‖ ≤ ‖z − y‖ for any z ∈ Im(𝒯), as a result of the projection theorem (Theorem 2.5.2). In particular, this means that ‖𝒯x − y‖ ≤ ‖𝒯h − y‖ for all h ∈ ℍ; i.e., all the elements of 𝕄 give least-squares approximations.

LINEAR OPERATOR AND FUNCTIONALS

83

One way to narrow down the possibilities for least-squares approximations is to use the one that has the smallest norm. This “best” approximate solution returns us to the Moore–Penrose inverse. Theorem 3.5.10 If y ∈ Dom(𝒯† ), the unique element of minimal norm that satisfies (3.24) is x̂ = 𝒯† y. Proof: As mentioned earlier, x = 𝒯† y is in 𝕄. Thus, any element in 𝕄 can be represented as 𝒯† y + z for some z ∈ Ker(𝒯). The result then follows from the fact that 𝒯† y ∈ Ker(𝒯)⊥ by Theorem 3.5.9 ◽ Example 3.5.11 If y in (3.24) is in Dom(𝒯† ) and x̂ = 𝒯† y, Theorem 2.5.2 has the consequence that 𝒯̂x − y ∈ Im(𝒯)⊥ = Ker(𝒯∗ ) due to Theorem 3.3.7. Thus, any least-squares solution must satisfy the normal equations 𝒯∗ 𝒯x = 𝒯∗ y.

(3.25)

Combining Theorem 3.5.10 with (3.18) characterizes the best approximate solution as the minimum norm solution to the normal equations (3.25). Another application of Theorem 3.5.10 with 𝒯 and y in (3.15) replaced by 𝒯∗ 𝒯and 𝒯∗ y leads us to the realization that x̂ = (𝒯∗ 𝒯)† 𝒯∗ y: i.e., 𝒯† = (𝒯∗ 𝒯)† 𝒯∗ .

3.6

(3.26)

Fréchet and Gâteaux derivatives

In Section 2.6, we saw that the familiar concept of integrating a functions over an interval of the real line could be extended to various abstract formulations wherein integration takes place in some general Banach space. It should therefore come as no surprise that the operation of differentiating a function of a real variable admits similar types of extensions. In this section, we briefly introduce two such abstract views of differentiation: namely, the Gâteaux and Fréchet derivatives. For our purposes, it suffices to concentrate on the case of two Banach spaces 𝕏1 , 𝕏2 with respective norms ‖ ⋅ ‖1 and ‖ ⋅ ‖2 . Let f be a function defined on an open subset U of 𝕏1 that takes values in 𝕏2 . We say that f is Gâteaux differentiable at x ∈ U if there is an element f ′ (x) ∈ 𝔅(𝕏1 , 𝕏2 ) such that lim t−1 ‖f (x + tv) − f (x) − tf ′ (x)𝑣‖2 = 0 t→0

(3.27)

for every 𝑣 ∈ 𝕏1 . When (3.27) holds, f ′ (x) is called the Gâteaux derivative of f at x. The derivative, if it exists, is necessarily unique. If there were another

84

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

operator f̃ ′ (x) that satisfied (3.27) then, for every 𝑣 and any t ≠ 0, ‖(f̃ ′ (x) − f ′ (x))𝑣‖2 ≤ t−1 ‖f (x + tv) − f (x) − tf̃ ′ (x)𝑣‖2 + t−1 ‖f (x + tv) − f (x) − tf ′ (x)𝑣‖2 and the right-hand side tends to 0 as t → 0. Various analogs of classical results for derivatives of functions of real variables hold for Gâteaux derivatives. For example, we will have use shortly for the following parallel of the mean value theorem. Theorem 3.6.1 Given x, y ∈ 𝕏1 , assume that f has a Gâteaux derivative at each point in the set {x + t(y − x) ∶ 0 ≤ t ≤ 1}. Then, for every bounded linear functional 𝓁 on 𝕏2 , ( ) 𝓁 (f (y) − f (x)) = 𝓁 f ′ (x + 𝜉(y − x))(y − x) for some 𝜉 ∈ (0, 1). Proof: Set g(t) = 𝓁 (f((x + t(y − x))) so that )the linearity of 𝓁 has the consequence that g′ (t) = 𝓁 f ′ (x + t(y − x))(y − x) . An application of the ordinary mean value theorem assures the existence of a 𝜉 ∈ (0, 1) for which g(1) − g(0) = g′ (𝜉) and proves the result. ◽ The existence of a Gâteaux derivative is a rather weak assumption as the limit in (3.27) is taken in a fixed direction: namely, in the direction of the vector 𝑣. The Fréchet derivative concept arises when we allow the direction to vary thereby producing the condition that ‖f (x + 𝑣) − f (x) − f ′ (x)𝑣‖2 = 0. 𝑣→0 ‖𝑣‖1 lim

(3.28)

This, in turn, amounts to saying that for any arbitrary sequence {𝑣n } tending to 0 ‖f (x + 𝑣n ) − f (x) − f ′ (x)𝑣n ‖2 lim = 0. n→∞ ‖𝑣n ‖1 If (3.28) is satisfied for some f ′ (x) ∈ 𝔅(𝕏1 , 𝕏2 ), we call f ′ (x) the Fréchet derivative of f at x. Clearly, the existence of the Fréchet derivative of f at x implies the existence of the corresponding Gâteaux derivative and the two derivatives must coincide in this instance. The converse is not true in general. For instance, if f is Fréchet differentiable at x then it is continuous at x, as for any 𝜖 > 0, there exists 𝛿 > 0 such that whenever ‖𝑣‖1 ≤ 𝛿, 𝜖‖𝑣‖1 ≥ ‖f (x + 𝑣) − f (x) − f ′ (x)𝑣‖2 ≥ ‖f (x + 𝑣) − f (x)‖2 − ‖f ′ (x)𝑣‖2

LINEAR OPERATOR AND FUNCTIONALS

or

85

‖f (x + 𝑣) − f (x)‖2 ≤ (𝜖 + ‖f ′ (x)‖)‖𝑣‖1 .

This property is not shared by the Gâteaux derivative. For example, the function { 3 x y , (x, y) ≠ (0, 0), f (x, y) = x6 +y2 0, (x, y) = (0, 0), on ℝ2 is discontinuous at (0, 0) but f ′ (0, 0) = 0 in the Gâteaux sense. More discussions along this line can be found in Ortega and Rheinboldt (1970). A sufficient condition for our two derivative notions to agree is the following. Theorem 3.6.2 Suppose that f is Gâteaux differentiable in an open subset U of 𝕏1 . If f ′ is continuous at x ∈ U, then f ′ (x) is the Fréchet derivative of f at x. Proof: As f ′ is continuous at x, for any given 𝜖 > 0, there is 𝛿 > 0 such that ‖f ′ (x + 𝑣) − f ′ (x)‖2 < 𝜖 if ‖𝑣‖1 < 𝛿. The fact that U is open means that for sufficiently small 𝛿 we will also have x + t𝑣 ∈ U for all t ∈ [0, 1]. Theorem 3.6.1 now indicates that ( ) ( ) 𝓁 f (x + 𝑣) − f (x) − f ′ (x)𝑣 = 𝓁 f ′ (x + 𝜉𝑣)𝑣 − f ′ (x)𝑣 for some 𝜉 ∈ (0, 1). As this is true for all linear functional, we can apply it to the linear functional from Corollary 3.2.8 that returns the norm of f (x + 𝑣) − f (x) − f ′ (x)𝑣 with the consequence that ‖f (x + 𝑣) − f (x) − f ′ (x)𝑣‖2 = ‖f ′ (x + 𝜉𝑣)𝑣 − f ′ (x)𝑣‖2 ≤ ‖f ′ (x + 𝜉𝑣) − f ′ (x)‖‖𝑣‖1 ≤ 𝜖‖𝑣‖1 .



Higher order derivatives of both the Fréchet and Gâteaux varieties can be defined as the derivative of a derivative of one lower order. For example, if the first Gâteaux derivative of f exists over some open subset of 𝕏1 that contains a point x and there is an element f ′′ (x) of 𝔅(𝕏1 , 𝔅(𝕏1 , 𝕏2 )) that satisfies lim t−1 ‖f ′ (x + tv) − f ′ (x) − tf ′′ (x)𝑣‖ = 0 t→0

(3.29)

for every 𝑣 ∈ 𝕏1 , we refer to f ′′ (x) as the second Gâteaux derivative of f at x. Note that the norm ‖ ⋅ ‖ in (3.29) is operator norm. If ‖f ′ (x + 𝑣) − f ′ (x) − f ′′ (x)𝑣‖ = 0, 𝑣→0 ‖𝑣‖1 lim

f ′′ (x) is the second Fréchet derivative of f at x.

(3.30)

86

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

As f ′′ (x) is in 𝔅(𝕏1 , 𝔅(𝕏1 , 𝕏2 )), this means[that for ]every 𝑣1 ∈ 𝕏1 , f ′′ (x)𝑣1 is in 𝔅(𝕏1 , 𝕏2 ). Then, for every 𝑣1 , 𝑣2 ∈ 𝕏1 , f ′′ (x)𝑣1 𝑣2 ∶= f ′′ (x)𝑣1 𝑣2 is an element of 𝕏2 . That is, the mapping h(𝑣1 , 𝑣2 ) ∶ 𝕏1 × 𝕏1 → 𝕏2 defined by h(𝑣1 , 𝑣2 ) = f ′′ (s)𝑣1 𝑣2 is bilinear which provides a useful view of f ′′ (x) in this instance. Life becomes somewhat simpler in the case where 𝕏2 = ℝ. Now 𝔅(𝕏1 , 𝕏2 ) is the dual space of linear functionals on 𝕏1 and as a result Gâteaux derivatives derive their properties from ordinary calculus in this instance. For example, extrema of linear functionals can be partially characterized by the behavior of their Gâteaux derivatives. Theorem 3.6.3 Suppose that f is Gâteaux differentiable over 𝕏1 . If f has a local maximum or minimum at x ∈ 𝕏1 , f ′ (x)𝑣 = 0 for every 𝑣 ∈ 𝕏1 . Proof: As f (x) is a local maximum or minimum of f , the function g(t) = f (x + tv) must attain a local maximum of minimum at t = 0 for every 𝑣 ∈ 𝕏1 . Therefore, g′ (0) = 0. ◽ The calculation used to prove the previous theorem suggests the following result that proves quite useful for evaluation of Gâteaux and Fréchet derivatives. Theorem 3.6.4 Let f be twice Gâteaux differentiable at x ∈ 𝕏1 and set g(t) = f (x + tv) and h(t) = f ′ (x + t𝑣1 )𝑣2 . Then, f ′ (x)𝑣 = g′ (0) and f ′′ (x)𝑣1 𝑣2 = h′ (0). Proof: If suffices to observe that g(t) − g(0) f (x + tv) − f (x) = t t and that

[ ′ ] f (x + t𝑣1 ) − f ′ (x) h(t) − h(0) = 𝑣2 . t t



Example 3.6.5 To illustrate the use Theorem 3.6.4, let 𝕏1 be a Hilbert space with inner product ⟨⋅, ⋅⟩ and define f (x) = ⟨x, x⟩. Then, g(t) = ⟨x + tv, x + tv⟩ and g′ (0) = 2⟨x, 𝑣⟩ = f ′ (x)𝑣. So, f ′ (x) is the linear functional with representer 2x. It is therefore continuous in x which means that f ′ (x) is also the Fréchet derivative of f at x as a result of Theorem 3.6.2. Now, h(t) = 2⟨x + t𝑣1 , 𝑣2 ⟩ and h′ (0) = 2⟨𝑣1 , 𝑣2 ⟩. Thus, f ′′ (x) is the linear operator that maps 𝑣1 ∈ 𝕏1 into the linear functional with representer 2𝑣1 . As this action is constant as a function of x, the derivative is trivially continuous in x and must therefore be the Fréchet as well. In fact, f ′′ (x) [ ′′ derivative ]−1 is even invertible in this instance with f (x) 𝓁 = 𝑣𝓁 ∕2 with 𝑣𝓁 ∈ 𝕏1 the representer of the linear functional 𝓁 ∈ 𝔅(𝕏1 , ℝ).

LINEAR OPERATOR AND FUNCTIONALS

3.7

87

Generalized Gram–Schmidt decompositions

Projection operators provide the means by which the Gram–Schmidt algorithm from Theorem 2.4.10 can be extended to a much more general context. To explain this idea, let us revisit the original Gram–Schmidt formulation from a projection perspective. Suppose that m1 , … , mn is a collection of linearly independent vectors in a Hilbert space ℍ. Define 𝕄j = span{mj } for j = 1, … , n. These are all simple, closed subspaces with x ∈ 𝕄j meaning that x = cmj for some c ∈ ℝ. For each integer 1 ≤ k ≤ n, the algebraic direct sum of 𝕄1 , … , 𝕄k described in Definition 2.5.4 is just 𝕊k = 𝕄1 + · · · + 𝕄k = span{m1 , … , mk }. As the mj are linearly independent, 𝕄j ∩ 𝕄k = {0} and, more generally, this entails that ( ) ∑ 𝕄i ∩ 𝕄j = 𝕄i ∩ span{m1 , … , mi−1 , mi+1 , … , mn } j≠i

= {0}. Now, the Gram–Schmidt algorithm uses the mj to create a new set of orthonormal vectors e1 , … , en with e1 = m1 ∕‖m1 ‖ and ( ek =

mk −

k−1 ∑ j=1

) ⟨mk , ej ⟩ej

‖ k−1 /‖ ∑ ‖ ‖ ‖mk − ‖ ⟨m , e ⟩e k j j‖ ‖ ‖ ‖ j=1 ‖ ‖

for k = 2, … , n. The method of construction ensures that span{e1 , … , ek } = span{m1 , … , mk }. If we let ℕj = span{ej }, we can express this last relation in terms of the orthogonal direct sum notation of Definition 2.5.4 as 𝕊k = ℕ1 ⊕ · · · ⊕ ℕk .

88

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

In particular, we see from this that the ℕk are characterized by 𝕊k ∩ 𝕊⊥k−1 = (𝕄1 + · · · + 𝕄k ) ∩ (𝕄1 + · · · + 𝕄k−1 )⊥ = span{m1 , … , mk } ∩ span{m1 , … , mk−1 }⊥ = span{e1 , … , ek } ∩ span{e1 , … , ek−1 }⊥ = span{ek } = ℕk . From Section 3.4, we know that the projection operator for ℕj has the simple form 𝒫ℕj = ej ⊗ ej . Thus, any x ∈ 𝕄k can be expressed as x=

k ∑

⟨x, ej ⟩ej =

j=1

k ∑

𝒫ℕj x.

(3.31)

j=1

One can certainly view this as the culmination of the Gram–Schmidt procedure. However, there is a bit more that can be said if we think about how one might need to use this type of outcome. In some cases, it may be more convenient to work directly with the orthonormal ej . This can be true, for example, in norm-based optimization problems where orthogonality may render the calculations more tractable. In such instances, it may be possible to find a desirable coefficient for ek but, upon doing so, it becomes necessary to trace ones way back to the corresponding element of 𝕄k that was involved in the original problem formulation. A little adjustment is needed in (3.31) to take care of such eventualities. Let 𝒫ℕj |𝕄k be the projection operator 𝒫ℕj restricted to 𝕄k . Then, 𝒫ℕk |𝕄k mk = ⟨mk , ek ⟩ek and, similarly,

𝒫𝕄k |ℕk ek = ⟨mk , ek ⟩mk

with 𝒫𝕄k |ℕk now the projection operator for 𝕄k restricted to ℕk . Therefore, ( )−1 𝒫ℕk |𝕄k ek =

mk ⟨mk , ek ⟩

and every x ∈ 𝕄k can be expressed as x=

k ∑

( )−1 𝒫ℕj 𝒫ℕk |𝕄k z

j=1

=

k ∑ j=1

( )−1 𝒫ℕj |𝕄k 𝒫ℕk |𝕄k z

(3.32)

LINEAR OPERATOR AND FUNCTIONALS

89

for some z ∈ ℕk . Conversely, for any z ∈ ℕk , x in (3.32) is the corresponding element of 𝕄k . To step beyond the Gram–Schmidt scenario, we follow the path of Sunder (1988) and now consider a Hilbert space ℍ that can be written as the algebraic direct sum of n closed subspaces 𝕄1 , … , 𝕄n . That is, ℍ=

n ∑

𝕄i ,

i=1

where 𝕄1 , … , 𝕄n satisfy 𝕄i ∩



𝕄j = {0}.

(3.33)

j≠i

∑ If x1 + · · · + xn = 0 for xi ∈ 𝕄i then xi = − j≠i xj = 0 for all i by (3.33). Thus, (3.33) defines a notion of linear independence for subspaces. As such, every element in ℍ can be written as a sum of elements from the 𝕄j with the components of the sum being determined uniquely. Now, for 1 ≤ k ≤ n, define the partial sums 𝕊k =

k ∑

𝕄i

i=1

and set

ℕk = 𝕊k ∩ 𝕊⊥k−1 ,

where 𝕊0 ∶= {0}. Then, ℕk ⊥𝕊k−1 for all k and ℕi ⊥ℕj for i ≠ j. One can show by induction that k ∑

𝕄i = ⊕ki=1 ℕi

(3.34)

i=1

for all k, and in particular ℍ = ⊕ni=1 ℕi .

(3.35)

Let 𝒫ℕk be the orthogonal projection operators onto ℕk for 1 ≤ k ≤ n and for 1 ≤ j ≤ k ≤ n define the restriction of 𝒫ℕj to 𝕄k by 𝒫ℕj |𝕄k x = 𝒫ℕj x for x ∈ 𝕄k . Theorem 3.7.1 𝒫ℕk |𝕄k is bijective.

90

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: Any x ∈ ℕk can be written as x = sk−1 + xk where sk−1 ∈ 𝕊k−1 and xk ∈ 𝕄k . As ℕk ⊥𝕊k−1 , x = 𝒫ℕk x = 𝒫ℕk xk . So, 𝒫ℕk maps 𝕄k onto ℕk . If x ∈ 𝕄k satisfies 𝒫ℕk x = 0, it must be that x ∈ 𝕄k ∩ ℕ⊥k . By (3.35), 𝕄k ∩ ℕ⊥k = 𝕄k ∩ ⊕i≠k ℕi = 𝕄k ∩ ⊕i≤k−1 ℕi = 𝕄k ∩ 𝕊k−1 as ℕj ⊥𝕄k for j > k. The right-hand side of the last relation is {0} by the definition of 𝕊k−1 . This shows that 𝒫ℕk |𝕄k is one-to-one. ◽ In view of (3.34), the inverse of 𝒫ℕk |𝕄k can be written as (𝒫ℕk |𝕄k )−1 =

k ∑

𝒫ℕj (𝒫ℕk |𝕄k )−1 .

j=1

As a consequence, any x ∈ 𝕄k has the representation x=

k ∑

𝒫ℕj (𝒫ℕk |𝕄k )−1 z

j=1

=

k ∑

𝒫ℕj |𝕄k (𝒫ℕk |𝕄k )−1 z

(3.36)

j=1

for some z ∈ ℕk and we have successfully extended (3.32) to our more general setting.

4

Compact operators and singular value decomposition In this chapter, we continue the discussion of operators that was begun in Chapter 3. However, in doing so, we will narrow our focus to the special case of operators that are compact in a sense to be described shortly. When working with Hilbert spaces, this type of operator can be approximated by finite-dimensional operators and, as a result, exhibits similar properties to those we are familiar with from the study of matrices. Not surprisingly, it is compact operators that are pervasive throughout statistics, in general, and fda, in particular. We begin in Section 4.1 with a general treatment of compact operators. Then, we specialize again; this time to the case of operators between Hilbert spaces in Sections 4.2 and 4.3 and derive both eigenvalue and singular value decompositions (svds) for this setting. Within the class of compact operators on Hilbert spaces, Hilbert–Schmidt and trace-class operators are of special interest due, in part, to the rapid convergence of their optimal finite-dimensional approximations. Accordingly, we investigate the properties of these operator classes in some depth in Sections 4.4 and 4.5. In functional data, integral operators are especially relevant. A key result for this type of operator is Mercer’s Theorem that uses the eigenvalue–eigenvector decomposition of an integral operator to obtain a corresponding series expansion for the operator’s kernel. This latter series is very important for functional pca and also provides a simple way to connect integral operators to the Hilbert–Schmidt and trace classes.

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

92

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

4.1

Compact operators

We have mentioned matrices as providing the simplest and most familiar example of linear operators. A natural extension of the matrix idea leads us to the concept of compact operators. Definition 4.1.1 A linear transformation 𝒯 from a normed space 𝕏1 into another normed space 𝕏2 is compact if for any bounded sequence {xn } ∈ 𝕏1 , {𝒯xn } contains a convergent subsequence in 𝕏2 . First note that compact linear transformations are necessarily bounded and are therefore referred to as compact operators. To see this, suppose that instead 𝒯 were unbounded. Then, we could find a bounded sequence {xn } in 𝕏1 for which ‖𝒯xn ‖2 ≥ n for each n and, consequently, {𝒯xn } would not contain a convergent subsequence. On the other hand, being compact is a special quality that is not shared by every operator as demonstrated by the following result. Theorem 4.1.2 The identity operator defined on an infinite-dimensional normed space is not compact. Proof: Let us begin with the case of operators on a Hilbert space with CONS {ej } and identity operator I. Then, for i ≠ j, ‖Iei − Iej ‖ = ‖ei − ej ‖ =



2.

So, {Iej } does not contain a convergent subsequence even though {ej } is a bounded sequence. As suggested by our Hilbert space argument, verification of this result for a general infinite-dimensional normed space 𝕏 relies on the construction of a bounded sequence {ej } for which inf i, j ‖ei − ej ‖ > 0. This can be achieved as follows. Let e1 be any element of 𝕏 with unit norm and set 𝕐 = span{e1 }. Now choose an arbitrary element x from 𝕐 C and define d = inf{‖x − y‖ ∶ y ∈ 𝕐 }. As 𝕐 is closed, d must be bigger than zero. By the definition of infimum, for any 𝛼 ∈ (1, ∞), there exists some z ∈ 𝕐 such that ‖x − z‖ < 𝛼d. Thus, we now take e2 = (x − z)∕‖x − z‖ which has unit norm and satisfies ‖e2 − y‖ =

‖x − (z + ‖x − z‖y)‖ d > = 𝛼 −1 ‖x − z‖ 𝛼d

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

93

for any y ∈ 𝕐 as z + ‖x − z‖y ∈ 𝕐 . In particular, ‖e1 − e2 ‖ > 𝛼 −1 . Next, let 𝕐 = span{e2 , e1 } and repeat the above-mentioned process by defining e3 with unit norm such that ‖e3 − y‖ > 𝛼 −1 for all y ∈ 𝕐 and ‖e3 − ei ‖ > 𝛼 −1 , i = 1, 2. These basic steps can be repeated ad infinitum to construct the desired sequence. ◽ Some of the basic properties of compact operators are collected in the following theorem. Theorem 4.1.3 The following results apply to compact operators between two normed linear spaces. 1. The closure of the range of any compact operator is separable. 2. Operators with finite rank are compact. 3. The composition of two operators is compact if either operator is compact. 4. The set of compact operators that map to any Banach space is closed. Proof: Let 𝒯 be a compact operator from a normed space 𝕏1 to a normed space 𝕏2 . Consider the set 𝒯(B(0; r)) in 𝕏2 where B(0; r) = {x ∈ 𝕏1 ∶ ‖x‖1 ≤ r} and let {yn } be any sequence in 𝒯(B(0; r)). Then, for every n, there exists an xn ∈ B(0; r) such that ‖yn − 𝒯xn ‖2 < n−1 .

(4.1)

As {xn } is bounded and 𝒯 is compact, {𝒯xn } contains a convergent subsequence. It then follows from (4.1) that {yn } also contains a convergent subsequence, which shows that 𝒯(B(0; r)) is sequentially compact. However, the fact that 𝕏2 is a metric space entails that 𝒯(B(0; r)) is compact and hence separable. In view of the relationship Im(𝒯) ⊂ ∪∞ 𝒯(B(0; r)), it follows that r=1 Im(𝒯) is also separable and the first result is proved. To prove the second claim, suppose that Im(𝒯) is finite-dimensional and {xn } is a bounded sequence in 𝕏1 . Then, the Bolzano–Weierstrass Theorem implies that {𝒯xn } contains a convergent subsequence in 𝕏2 thereby showing compactness. For part 3, let 𝒯1 and 𝒯2 be operators from 𝕏1 to 𝕏2 and 𝕏2 to 𝕏3 , respectively, with {xn } a bounded sequence in 𝕏2 . We then need to show that {𝒯2 (𝒯1 xn )} contains a convergent subsequence in 𝕏3 if either 𝒯1 or 𝒯2 is compact. If 𝒯1 is compact, then {𝒯1 xn } contains a convergent subsequence in 𝕏2 , and, as 𝒯2 is continuous, the image of this subsequence under 𝒯2 also converges in 𝕏3 . If, instead, 𝒯2 is compact then, as {𝒯1 xn } is bounded in 𝕏2 , {𝒯2 (𝒯1 xn )} contains a convergent subsequence in 𝕏3 .

94

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

To prove part 4, let {𝒯n } be a sequence of compact operators from 𝕏1 to 𝕏2 , where 𝕏2 is a Banach space. Our goal is to show that if ‖𝒯n − 𝒯‖ → 0 for some 𝒯 then 𝒯 is also compact. The proof is based on a “diagonalization” argument that we will now explain. Let {xk } be any bounded sequence in 𝕏1 . By compactness, 𝒯1 x1k converges to some y1 as k → ∞ for some subsequence {x1k } ⊂ {x0k } ∶= {xk }. Using this same argument, we can conclude 𝒯2 x2k converges to some y2 as k → ∞ for some {x2k } ⊂ {x1k }. Continuing in this manner, we see that for each n ≥ 1, 𝒯n xnk converges to some yn as k → ∞ for a subsequence {xnk } ⊂ {x(n−1)k }. We now show that yn converges. For each n, there is a large enough kn such that ‖𝒯n xnk − yn ‖2 < n−1 for all k ≥ kn . We can obviously choose the kn to satisfy kn′ > kn for n′ > n. So, for n′ > n write yn − yn′ = (yn − 𝒯n xn′ kn′ ) + (𝒯n′ xn′ kn′ − yn′ ) + (𝒯n xn′ kn′ − 𝒯xn′ kn′ ) +(𝒯xn′ kn′ − 𝒯n′ xn′ kn′ ), which gives ‖yn − yn′ ‖2 ≤ n−1 + n′

−1

+ ‖𝒯n − 𝒯‖ + ‖𝒯n′ − 𝒯‖.

This establishes that {yn } is Cauchy. As 𝕏2 is a Banach space, yn converges to some y in 𝕏2 . The inequality ‖𝒯xnkn − y‖2 ≤ ‖𝒯 − 𝒯n ‖‖xnkn ‖1 + ‖𝒯n xnkn − yn ‖2 + ‖yn − y‖2 now shows that 𝒯xnkn converges to y and proves that 𝒯 is compact.



We showed in Theorem 4.1.2 that the identity operator defined on an infinite-dimensional normed space is not compact. The following is a partial extension of that result. Theorem 4.1.4 Let 𝒯 be a bijective operator in 𝔅(𝕏1 , 𝕏2 ) for infinitedimensional Banach spaces 𝕏1 and 𝕏2 . Then, 𝒯 is not compact. Proof: By Theorem 3.5.1, 𝒯−1 ∈ 𝔅(𝕏2 , 𝕏1 ). If 𝒯 is compact, this implies that the identity mapping I = 𝒯−1 𝒯 is also compact due to result 3 of Theorem 4.1.3, which contradiction Theorem 4.1.2. ◽ The fundamental fact that underlies Theorem 4.1.4 is that the range of an infinite-dimensional compact operator is necessarily not closed. We will return to establish this in Theorem 4.3.7 .

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

95

Typically, one uses some combination of results 2–4 of Theorem 4.1.3 to establish that a particular operator is compact. In the context of Hilbert spaces, it is possible to strengthen the connection between finite dimensional and general compact operators that was implied by parts 2 and 4 of the theorem. Theorem 4.1.5 Let ℍ1 , ℍ2 be Hilbert spaces and assume that 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ). Then, 1. 𝒯 is compact if there exists a sequence {𝒯n } of finite-dimensional operators such that ‖𝒯n − 𝒯‖ → 0 as n → ∞ and 2. 𝒯 is compact if 𝒯∗ is compact. Proof: If there are finite-dimensional operators 𝒯n such that ‖𝒯 − 𝒯n ‖ → 0, then we conclude that 𝒯 is compact by results 2 and 4 of Theorem 4.1.3. For the other direction, using result 1 of Theorem 4.1.3, we can find an CONS {ej } for Im(𝒯) and define the finite-dimensional operator 𝒯n x ∶=

n ∑

⟨𝒯x, ej ⟩2 ej = 𝒫n 𝒯x,

j=1

for x ∈ ℍ1 with 𝒫n the projection operator onto span{e1 , … , en }. Suppose that ‖𝒯 − 𝒯n ‖ ↛ 0. Then, there exist an 𝜖 > 0 and a unit-norm sequence {xn′ } such that ‖(𝒯 − 𝒯n′ )xn′ ‖2 > 𝜖

(4.2)

for all n′ . Now, the compactness of 𝒯 implies that 𝒯xn′′ → y for some subsequence {xn′′ } ⊂ {xn′ } and some y. Thus, write (𝒯 − 𝒯n′′ )xn′′ = (I − 𝒫n′′ )𝒯xn′′ = (I − 𝒫n′′ )y + (I − 𝒫n′′ )(𝒯xn′′ − y), where I is the identity operator on Im(𝒯). As I − 𝒫n is bounded and (I − 𝒫n )y → 0 (although ‖I − 𝒫n ‖ ↛ 0), the right-hand side converges to 0 and contradicts (4.2). This shows that ‖𝒯 − 𝒯n ‖ → 0 and completes the proof of part 1. By parts 2 and 7 of Theorem 3.3.7, 𝒯∗n in the above-mentioned construction is also finite-dimensional and ‖𝒯∗n − 𝒯∗ ‖ = ‖𝒯n − 𝒯‖. Thus, part 2 follows from part 1. ◽ Part 2 of Theorem 4.1.5 and part 1 of Theorem 4.1.3 have the implication that if 𝒯 is a compact operator between Hilbert spaces ℍ1 and ℍ2 then Im(𝒯∗ ) is separable. Recall from Theorem 3.3.7 that ℍ1 = Ker(𝒯) ⊕ Im(𝒯∗ ). As a consequence, when considering the properties of a compact operator in 𝔅(ℍ1 , ℍ2 ), there is no loss of generality in assuming that both ℍ1 and ℍ2

96

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

are separable. We will therefore make that assumption throughout the rest of this chapter.

4.2

Eigenvalues of compact operators

The use of eigenvalues and eigenvectors for real positive definite matrices is central to the development of the mva pca concept from Section 1.1. If 𝒯 is a symmetric p × p positive semidefinite matrix, we know that it can be expressed as p ∑ 𝒯= 𝜆j ej ⊗ ej (4.3) j=1

with 𝜆1 ≥ 𝜆2 ≥ · · · ≥ 𝜆p ≥ 0, ej an eigenvector corresponding to 𝜆j and ej ⊗ ej = ej eTj the outer product of ej with itself (see Definition 3.4.6). In this section, we aim to extend the eigenvalue–eigenvector decomposition to work with compact operators on Hilbert spaces. As one might suspect, the tools we develop here will eventually be employed in functional data parallels of the mva pca concept. Let us focus on operators in 𝔅(ℍ) for some Hilbert space ℍ. The first step is to give a definition of an eigenvalue and eigenvector that is appropriate for that setting. Definition 4.2.1 Let 𝒯 ∈ 𝔅(ℍ) and suppose that there exists 𝜆 ∈ ℝ and a nonzero e ∈ ℍ such that 𝒯e = 𝜆e. (4.4) Then, 𝜆 is an eigenvalue and e is a corresponding eigenvector (or eigenfunction when ℍ is a function space) of 𝒯. It is easy to see that Ker(𝒯 − 𝜆I) is a closed linear subspace of ℍ and is therefore also a Hilbert space. Clearly, Ker(𝒯 − 𝜆I) is nontrivial if and only if 𝜆 is an eigenvalue, in which case we call Ker(𝒯 − 𝜆I) the eigenspace of 𝜆. Theorem 4.2.2 Let 𝒯 ∈ 𝔅(ℍ) and suppose that ej ∈ Ker(𝒯 − 𝜆j I) for j = 1, 2, … where the 𝜆j are distinct and nonzero. Then, 1. the ej are linearly independent and 2. if 𝒯 is self-adjoint, the ej are mutually orthogonal. Proof: For part 1, we need to show that if for some c1 , … , cn n ∑ j=1

cj ej = 0 and

n ∑ j=1

𝜆 j cj ej = 0

(4.5)

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

97

then c1 = · · · = cn = 0. We will establish this for all n by induction. We start with n = 2. Then, (4.5) implies 𝜆1 (c1 e1 + c2 e2 ) − (𝜆1 c1 e1 + 𝜆2 c2 e2 ) = (𝜆1 − 𝜆2 )c2 e2 = 0, from which we conclude that c2 = 0 as 𝜆1 ≠ 𝜆2 . This entails that c1 = 0 as well since c1 e1 = −c2 e2 = 0. Next, suppose we have shown the claim for some n = k but we have found coefficients c1 , … , ck+1 such (4.5) holds for n = k + 1. In that case, 𝜆k+1

k+1 ∑ j=1

cj ej −

k+1 ∑ j=1

𝜆j cj ej =

k ∑

(𝜆k+1 − 𝜆j )cj ej = 0,

j=1

which, by the induction assumption, implies that cj = 0 for all k = 1, … , k + 1. For part 2, write −1 −1 ⟨ei , ej ⟩ = ⟨ei , 𝜆−1 j 𝒯ej ⟩ = 𝜆j ⟨𝒯ei , ej ⟩ = 𝜆j 𝜆i ⟨ei , ej ⟩.

As 𝜆i ≠ 𝜆j , this can only be true if ⟨ei , ej ⟩ = 0.



Theorem 4.2.3 Let 𝒯 ∈ 𝔅(ℍ) be a compact operator. Then, 1. Ker(𝒯 − 𝜆I) is finite-dimensional for any 𝜆 ≠ 0, 2. the number of distinct eigenvalues of 𝒯 with absolute values bigger than any positive number is finite, and 3. the set of nonzero eigenvalues of 𝒯 is countable. Proof: We need only prove parts 1 and 2 as part 3 is a direct consequence of these two results. To verify part 1, suppose that Ker(𝒯 − 𝜆I) is infinite-dimensional for some 𝜆 ≠ 0. Then, the same construction that was employed in the proof of Theorem 4.1.2 can be used here to find a sequence of unit-norm elements {ej } in Ker(𝒯 − 𝜆I) such that inf i, j ‖ei − ej ‖ > 0 and, hence, inf ‖𝒯ei − 𝒯ej ‖ = 𝜆 inf ‖ei − ej ‖ > 0. i,j

i,j

This shows that {𝒯ej } does not contain a convergent subsequence and contradicts the assumption that 𝒯 is compact. For part 2, let 𝜆1 , 𝜆2 , … be a sequence of distinct eigenvalues with |𝜆j | ≥ 𝜆 > 0 for all j. Define 𝕄n = span{e1 , … , en } for ej ∈ Ker(𝒯 − 𝜆j I) and take 𝕄0 ∶= {0}. The ej are linearly independent by part 1 of Theorem 4.2.2 which means that 𝕄n ∩ 𝕄⊥n−1 is a one-dimensional linear space for each n. We can

98

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

now use the Gram–Schmidt recursion to create an orthonormal sequence {xn } such that xn ∈ Ker(𝒯 − 𝜆n I) ∩ 𝕄⊥n−1 for all n and, for n > m, ‖𝒯xn − 𝒯xm ‖2 = ‖𝒯xn ‖2 + ‖𝒯xm ‖2 ≥ 2𝜆2 . Again, this contradicts the assumption that 𝒯 is compact.



With these preliminaries, we are now ready to present the eigenvalue– eigenvector decomposition for a self-adjoint compact operator. Theorem 4.2.4 Let 𝒯 be a compact, self-adjoint operator on ℍ. The set of nonzero eigenvalues for 𝒯 is either finite or consists of a sequence which tends to zero. Each nonzero eigenvalue has finite multiplicity and eigenvectors corresponding to different eigenvalues are orthogonal. Let 𝜆1 , 𝜆2 , … be the eigenvalues ordered so that |𝜆1 | ≥ |𝜆2 | ≥ · · · and let e1 , e2 , … be the corresponding orthonormal eigenvectors obtained using the Gram–Schmidt orthogonalization process as necessary for repeated eigenvalues. Then, {ej } is a CONS for Im(𝒯) and ∑ 𝒯= 𝜆j ej ⊗ ej ; (4.6) j≥1

i.e, for every x ∈ ℍ,

𝒯x =



𝜆j ⟨x, ej ⟩ej .

(4.7)

j≥1

Proof: In view of Theorems 4.2.2 and 4.2.3, we only need to show (4.7). By part 6 of Theorems 3.3.7 and the fact that 𝒯 is self-adjoint ℍ = Ker(𝒯) ⊕ Im(𝒯). As (4.7) holds for all x ∈ span{ej ∶ j ≥ 1} or Ker(𝒯), it suffices to show that {ej }∞ is a CONS for Im(𝒯): namely, that j=1 Im(𝒯) = span{ej ∶ j ≥ 1}.

(4.8)

For any finite n and nonzero c1 , … , cn ∈ ℝ, we have n ∑ j=1

cj ej = 𝒯

( n ∑

∑n

j=1 cj ej

∈ Im(𝒯) as

) 𝜆−1 j cj ej

.

j=1

This shows that span{ej ∶ j ≥ 1} ⊂ Im(𝒯) so that span{ej ∶ j ≥ 1} ⊂ Im(𝒯).

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

99

From (2.16), we can write Im(𝒯) = span{ej ∶ j ≥ 1} ⊕ ℕ, where ℕ contains those elements in Im(𝒯) that are orthogonal to span{ej ∶ j ≥ 1}. Clearly, 𝒯 maps span{ej ∶ j ≥ 1} into its closure. It also maps ℕ into ℕ. To see that this latter claim is valid let x ∈ ℕ and y ∈ span{ej ∶ j ≥ 1}. Then, ⟨𝒯x, y⟩ = ⟨x, 𝒯y⟩ = 0 ⊥

and 𝒯x ∈ span{ej ∶ j ≥ 1} = ℕ. Let 𝒯ℕ be the restriction of 𝒯 to ℕ and observe that any nonzero eigenvalue of 𝒯ℕ must also be an eigenvalue for the original operator 𝒯. If 𝒯ℕ is not the zero operator, it is easy to see that either ‖𝒯ℕ ‖ or −‖𝒯ℕ ‖ must be one of its eigenvalues and, hence, an eigenvalue for 𝒯 as well. However, all the nonzero eigenvalues for 𝒯 were already captured in the collection {𝜆j , j ≥ 1} which leaves 𝒯ℕ being the zero operator as the only option. Thus, ℕ ⊂ Im(𝒯) ∩ Ker(𝒯) = {0} and (4.8) is proved. ◽ The sum in (4.6) is taken over all positive values for the index. This allows for the possibility that the operator has either finitely or infinitely many nonzero eigenvalues. Our interest is primarily in the latter instance and, as a result, we will usually write our sums as having an infinite upper limit with the understanding that all but finitely many eigenvalues will be zero for a finite dimensional operator. It is clear that all eigenvalues of a nonnegative definite operator are nonnegative, for if (𝜆, e) is an eigenvalue/eigenvector pair of 𝒯 then ⟨𝒯e, e⟩ = 𝜆‖e‖2 . If 𝒯 is compact and self-adjoint, then Theorem 4.2.4 entails that 𝒯 is nonnegative definite if and only if all eigenvalues of 𝒯 are nonnegative. The eigenvalues of compact operators provide solutions to various variational problems. For example, in the case of nonnegative operators, we have the following rather immediate result. Theorem 4.2.5 If 𝒯 is compact and nonnegative definite with associated eigenvalue/eigenvector sequence {(𝜆j , ej )}∞ , then j=1 𝜆k =

max

e∈span{e1 ,…,ek−1

}⊥

⟨𝒯e, e⟩ , ‖e‖2

for all k with span{e1 , … , ek−1 }⊥ being the entirety of ℍ when k = 1.

(4.9)

100

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

This result holds somewhat more generally. From Theorem 4.2.4, a self-adjoint compact operator with the representation (4.6) satisfies ∑ 𝒯n = 𝜆nj ej ⊗ ej (4.10) j≥1

for any positive integer n: namely, (𝜆nj , ej ), j ≥ 1, are the eigenvalue– eigenvector pairs of 𝒯n . Thus, the following is a straightforward consequence of Theorem 4.2.5. Theorem 4.2.6 Let 𝒯 ∈ 𝔅(ℍ) be a compact, self-adjoint operator with eigenvalue-eigenvector pairs {(𝜆j , ej )}∞ arranged so that 𝜆21 ≥ 𝜆22 ≥ · · ·. j=1 Then, ‖𝒯e‖ |𝜆k | = max . (4.11) e∈span{e1 ,…,ek−1 }⊥ ‖e‖ Note that Theorem 4.2.6 characterizes the operator norm of a self-adjoint compact operator as being the absolute value of its eigenvalue with the largest magnitude. If the compact, self-adjoint operator 𝒯 is also nonnegative-definite, then all the eigenvalues are nonnegative and the representation in (4.10) can be extended to noninteger powers. For example, this allows us to define the self-adjoint operator ∞ √ ∑ 𝒯1∕2 = 𝜆j ej ⊗ ej , j=1

which satisfies 𝒯12 𝒯1∕2 = 𝒯 (cf. Theorem 3.4.3). Note that 𝒯1∕2 is necessarily a compact element of 𝔅(ℍ). A more profound characterization of eigenvalues than the one in Theorem 4.2.5 is the Courant–Fischer minimax principle that can be described in the following manner. Theorem 4.2.7 Let 𝒯 be a nonnegative, compact operator on a Hilbert space ℍ with eigenvalues 𝜆1 ≥ 𝜆2 ≥ · · · ≥ 0. Then, 𝜆k =

⟨𝒯𝑣, 𝑣⟩ 𝑣1 ,…,𝑣k ∈ℍ 𝑣∈span{𝑣1 ,…,𝑣k } ‖𝑣‖2 max

min

and 𝜆k =

min

max

𝑣1 ,…,𝑣k−1 ∈ℍ 𝑣∈span{𝑣1 ,…,𝑣k−1 }⊥

⟨𝒯𝑣, 𝑣⟩ , ‖𝑣‖2

(4.12)

(4.13)

where the max and min in (4.12) and (4.13) are attained when 𝑣k is the eigenvector ek that corresponds to 𝜆k .

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

101

This is a truly remarkable result and, accordingly, a bit of interpretation is in order before we proceed to the proof. For example, the minimax relation (4.13) states that one can pick any k − 1 dimensional subspace 𝕍k−1 and the maximum of 𝒯’s Rayleigh quotient R𝒯(𝑣) ∶=

⟨𝒯𝑣, 𝑣⟩ ‖𝑣‖2

(4.14)

across all vectors that are in the orthogonal complement of this subspace can be no smaller than 𝜆k : i.e., 𝜆k ≤ max ⊥

𝑣∈𝕍k−1

⟨𝒯𝑣, 𝑣⟩ ‖𝑣‖2

for all 𝕍k−1 with equality for 𝕍k−1 = span{e1 , … , ek−1 } and 𝑣 = ek . Proof: Define 𝕄k−1 = span{e1 , … , ek−1 }. Now suppose that 𝕍k is any k dimensional subspace of Im(𝒯) and note that 𝕍k ∩ 𝕄⊥k−1 is nonempty. If ∑∞ 𝑣 is any element in this intersection, we can write it as 𝑣 = j=k aj ej to see that ∞ ∑ 𝜆j a2j ⟨𝒯𝑣, 𝑣⟩ j=k = ∞ ≤ 𝜆k . ‖𝑣‖2 ∑ 2 aj j=k

Thus, min 𝑣∈𝕍k

⟨𝒯𝑣, 𝑣⟩ ⟨𝒯𝑣, 𝑣⟩ ≤ min ≤ 𝜆k , ‖𝑣‖2 𝑣∈𝕍k ∩𝕄⊥k−1 ‖𝑣‖2

where the continuity of the Rayleigh quotient allows us to conclude that these minima all exist. However, the choice of 𝕍k was arbitrary which means that (4.12) holds because equality is obtained with the choice 𝕍k = span{e1 , … , ek } by Theorem 4.2.5. ⊥ Things work similarly in proving (4.13). The difference is that 𝕍k−1 ∩ 𝕄k is now the nonempty ∑k set of interest. In this case, an element in the intersection looks like 𝑣 = j=1 aj ej , which produces the inequality k ∑

𝜆j a2j

⟨𝒯𝑣, 𝑣⟩ = k ‖𝑣‖2 ∑ j=1

j=1

that is needed to complete the proof.

≥ 𝜆k a2j ◽

102

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

A consequence of Theorem 4.2.7 that will eventually prove useful is given in the following theorem. ̃ be nonnegative definite, compact operators with Theorem 4.2.8 Let 𝒯, 𝒯 eigenvalue sequences {𝜆j } and {𝜆̃ j }, respectively. Then ̃ sup |𝜆k+1 − 𝜆̃ k+1 | ≤ ‖𝒯 − 𝒯‖. k≥0

Proof: By Theorem 4.2.7, 𝜆k+1 =

min

max

𝑣1 ,…,𝑣k ∈ℍ 𝑣∈span{𝑣1 ,…,𝑣k }⊥

⟨𝒯𝑣, 𝑣⟩ . ‖𝑣‖2

Clearly, for any 𝑣1 , … , 𝑣k , max

𝑣∈span{𝑣1 ,…,𝑣k }⊥

̃ + 𝒯 − 𝒯)𝑣, ̃ 𝑣⟩ ⟨𝒯𝑣, 𝑣⟩ ⟨(𝒯 = max 𝑣∈span{𝑣1 ,…,𝑣k }⊥ ‖𝑣‖2 ‖𝑣‖2 ̃ 𝑣⟩ ⟨𝒯𝑣, ≤ max ⊥ 𝑣∈span{𝑣1 ,…,𝑣k } ‖𝑣‖2 ̃ 𝑣⟩ ⟨(𝒯 − 𝒯)𝑣, + max , 𝑣∈ℍ ‖𝑣‖2

where the second term of the right-hand side of the inequality is clearly ̃ Taking the minimum over 𝑣1 , … , 𝑣k ∈ ℍ on both bounded by ‖𝒯 − 𝒯‖. sides gives ̃ 𝜆k+1 ≤ 𝜆̃k+1 + ‖𝒯 − 𝒯‖. ̃ gives the result. Interchanging 𝒯 and 𝒯



The previous two results, as stated, would only seem to be applicable to nonnegative compact operators. However, this is a bit deceiving and they can actually be used with any self-adjoint, compact operator once one separates its eigenvalues into those that are all positive or all negative. Using Theorem 4.2.4, we can write any self-adjoint compact operator as 𝒯 = 𝒯+ − 𝒯− , where

𝒯+ =



𝜆j ej ⊗ ej

𝜆j >0

and

𝒯− =



(−𝜆j )ej ⊗ ej .

𝜆j 0. Proof: As 𝒯 ∑kis of finite rank, it is necessarily compact. Thus, by Theorem 4.3.1, 𝒯 = j=1 𝜆j (𝑣j ⊗1 uj ) with (i) 𝜆21 ≥ · · · 𝜆2k > 0 the nonzero eigenvalues of 𝒯T 𝒯, and 𝒯𝒯T , (ii) 𝑣1 , … , 𝑣q the eigenvectors of 𝒯T 𝒯 and (iii) u1 , … , up the eigenvectors of 𝒯𝒯T .

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

105

The tensor-product operator ⊗1 in this case is just the vector outer product: i.e., 𝑣j ⊗1 uj = uj 𝑣Tj . ◽ The singular values for a compact operator inherit certain optimality properties from the fact that they are eigenvalues. For instance, an svd version of Theorem 4.2.7 takes the following form. Theorem 4.3.3 Let 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ) be a compact with singular values 𝜆1 ≥ 𝜆2 ≥ · · · ≥ 0. Then 𝜆k =

max

min

𝑣1 ,…,𝑣k ∈ℍ1 f ∈span{𝑣1 ,…,𝑣k }

‖𝒯f ‖2 ‖f ‖1

and 𝜆k =

min

max

𝑣1 ,…,𝑣k−1 ∈ℍ1 f ∈span{𝑣1 ,…,𝑣k−1 }⊥

‖𝒯f ‖2 , ‖f ‖1

(4.17)

(4.18)

where the max and min in (4.12) and (4.13) are attained with 𝑣k equal to the right singular vector f1k corresponding to 𝜆k . Proof: The proof is immediate from Theorem 4.2.7 as the 𝜆2i are the nonascending eigenvalues of 𝒯∗ 𝒯. ◽ A simple but important case of the previous result occurs when k = 1. In that instance, we obtain Theorem 4.3.4 Let ℍ1 and ℍ2 be Hilbert spaces and let 𝒯 be a compact operator from ℍ1 into ℍ2 whose largest singular value is 𝜆1 . Then, ‖𝒯‖ = 𝜆1 . The following result provides a fundamental characterization of compact operators. Theorem 4.3.5 An operator 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ) is compact if and only if the sve (4.15) holds. Proof: Theorem 4.3.1 establishes the sve for any compact operator. To proceed in the other direction, assume that (4.15) holds and define 𝒯n =

n ∑

𝜆j (f1j ⊗1 f2j ).

j=1

Then, 𝒯n is finite dimensional and hence compact. By Theorem 4.3.4, ‖𝒯 − 𝒯n ‖ = 𝜆n+1 → 0. The compactness of 𝒯 is now a consequence of part 1 of Theorem 4.1.5. ◽

106

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

An alternative version of (4.17) that directly connects the left and right singular vectors takes the following form. Theorem 4.3.6 Let 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ) be a compact with associated singular system {(𝜆j , f1j , f2j )}∞ . Then, j=1 𝜆k =

|⟨𝒯f1 , f2 ⟩2 | , 𝑣11 ,…,𝑣1k ∈ℍ1 f1 ∈span{𝑣11 ,…,𝑣1k } ‖f1 ‖1 ‖f2 ‖2 max

min

(4.19)

𝑣21 ,…,𝑣2k ∈ℍ2 f2 ∈span{𝑣21 ,…,𝑣2k }

where the maximum is attained with 𝑣1k = f1k and 𝑣2k = f2k . Proof: Analogous to the arguments for Theorem 4.2.7, we define 𝕄i(k−1) = span{fi1 , … , fi(k−1) } for i = 1, 2 and choose k dimensional subspaces 𝕍ik with 𝕍ik ⊂ ℍi , i = 1, 2. Then, the subspaces 𝕄⊥i(k−1) ∩ 𝕍ik , i = 1, 2, are not empty and we can express ∑∞ ∑∞ any element in 𝕄⊥i(k−1) ∩ 𝕍ik as fi = j=k aij fij with ‖fi ‖2i = j=k a2ij . Thus, ⟨𝒯f1 , f2 ⟩22 ‖f1 ‖21 ‖f2 ‖22

(∑ ∞

)2 𝜆 a a j=k j 1j 2j = (∑ ) (∑ ) ≤ 𝜆2k ∞ ∞ 2 2 j=k a1j j=k a2j

from the Cauchy–Schwarz inequality. This means that min

⟨𝒯f1 , f2 ⟩22

f1 ∈𝕍1k ,f2 ∈𝕍2k

‖f1 ‖21 ‖f2 ‖22

≤ 𝜆2k

for arbitrary choices of 𝕍ik with equality when 𝕍ik = 𝕄ik , i = 1, 2, as a result of Theorem 4.3.1. ◽ For compact operators, the Moore–Penrose inverse from Section 3.5 can be characterized in terms of the operator’s singular system. Assume that {eij }∞ , i = 1, 2 are CONSs for ℍ1 and ℍ2 so that, for example, any x ∈ ℍ1 j=1 ∑ ∑∞ 2 can be expressed as x = ∞ j=1 ⟨x, e1j ⟩1 e1j with j=1 ⟨x, e1j ⟩1 < ∞. The sve then gives 𝒯x =

∞ ∑ j=1

𝜆j ⟨x, e1j ⟩1 e2j .

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

107

∑∞ A condition that characterizes y = j=1 ⟨y, e2j ⟩2 e2j ∈ Im(𝒯) is now seen to be that ∞ ∑ ⟨y, e2j ⟩22 < ∞. (4.20) 𝜆2j j=1 This is known as the Picard condition and, when it holds, we can write 𝒯† y =

∞ ∑ ⟨y, e2j ⟩2 j=1

𝜆j

e1j .

(4.21)

The following result builds on the previous discussion and uses Theorem 4.1.4 to provide some additional insight into the properties of compact operators. Theorem 4.3.7 Suppose that 𝒯 is an infinite-dimensional compact operator between two Hilbert spaces. Then, Im(𝒯) is not closed. Proof: Let 𝒯 have the singular system {(𝜆j , f1j , f2j )}∞ . As 𝜆j ↓ 0 as j → ∞, j=1 we can choose a subsequence {jk } such that 𝜆jk ≤ k−2 for all k. Define y=

∞ ∑

𝜆jk f2jk .

k=1

Clearly, y ∈ Im(𝒯) as f2jk = 𝒯(f1,jk ∕𝜆jk ). However, the Picard condition does not hold for y. Thus, y ∉ Im(𝒯). ◽

4.4

Hilbert–Schmidt operators

There are various subclasses of the set of compact operators that arise in our work ahead. One of these is the collection of Hilbert–Schmidt operators that are the object of interest for this section. We begin by stating a result that will prove useful in several places below in this section. Theorem 4.4.1 Let 𝒯1 and 𝒯2 be operators in 𝔅(ℍ1 , ℍ2 ). Suppose that there exist CONSs {e1i } and {e2j } for ℍ1 and ℍ2 , respectively, such that either ∞ ∑ i=1

(‖𝒯1 e1i ‖22 + ‖𝒯2 e1i ‖22 ) < ∞,

(4.22)

108

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

or ∞ ∑

(‖𝒯∗1 e2j ‖21 + ‖𝒯∗2 e2j ‖21 ) < ∞.

(4.23)

j=1

Then, ∞ ∑

⟨𝒯1 e1i , 𝒯2 e1i ⟩2 =

i=1

∞ ∞ ∑ ∑

⟨𝒯1 e1i , e2j ⟩2 ⟨𝒯2 e1i , e2j ⟩2

i=1 j=1

=

∞ ∞ ∑ ∑

⟨e1i , 𝒯∗1 e2j ⟩1 ⟨e1i , 𝒯∗2 e2j ⟩1

i=1 j=1

=

∞ ∑

⟨𝒯∗1 e2j , 𝒯∗2 e2j ⟩1 ,

(4.24)

j=1

where this identity holds for all choices of the CONSs {e1i } and {e2j }. Proof: First we verify (4.24) for the special case 𝒯1 = 𝒯2 = 𝒯. The argument is straightforward using the expansions 𝒯e1i =

∞ ∑

⟨𝒯e1i , e2j ⟩2 e2j

j=1

and 𝒯 e2j = ∗

∞ ∑

⟨𝒯∗ e2j , e1i ⟩1 e1i .

i=1

Thus, we have ∞ ∑

‖𝒯e1i ‖22 =

i=1

∞ ∞ ∑ ∑

⟨𝒯e1i , e2j ⟩22

i=1 j=1

=

∞ ∞ ∑ ∑

⟨e1i , 𝒯∗ e2j ⟩21

i=1 j=1

=

∞ ∑

‖𝒯∗ e2j ‖21 .

(4.25)

j=1

As the last expression does not depend on {e1i }, we can conclude that all our previous formulae are also independent of this choice for the CONS; the same argument then leads to the conclusion that the expressions are independent of the choice for {e2j } as well.

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

109

Next, consider the three identities obtained by replacing 𝒯 in (4.25) with 𝒯1 + 𝒯2 , 𝒯1 and 𝒯2 . Using (4.22) or (4.23), we can obtain (4.24) by subtracting the second and third of these identities from the first. ◽ Definition 4.4.2 Let {ei } be a CONS for ℍ1 and 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ). If 𝒯 satisfies ∞ ∑

‖𝒯ei ‖22 < ∞,

i=1

then 𝒯 is called a Hilbert–Schmidt (HS) operator. The collection of HS operators in 𝔅(ℍ1 , ℍ2 ) is denoted by 𝔅HS (ℍ1 , ℍ2 ). Theorem 4.4.1 shows that Definition 4.4.2 does not depend on the choice of CONS. It also reveals that 𝒯 is HS if 𝒯∗ is HS. Our next result has the consequence that Hilbert–Schmidt operators constitute a subclass of the collection of all compact operators. Theorem 4.4.3 A Hilbert–Schmidt operator is compact. Proof: Let 𝒯 be a HS operator and for x ∈ ℍ1 define 𝒯n x =

n ∑

⟨𝒯x, e2i ⟩2 e2i ,

i=1

where {e2i } is a CONS for ℍ2 . Clearly, the range of 𝒯n is finite-dimensional and so 𝒯n is ∑ compact. It suffices to show that ‖𝒯 − 𝒯n ‖ → 0. However, ∞ (𝒯n − 𝒯)x = i=n+1 ⟨𝒯x, e2i ⟩2 e2i and, hence, if ‖x‖1 ≤ 1 an application of the Cauchy–Schwarz inequality produces ‖(𝒯n − 𝒯)x‖22 =

∞ ∑

∞ ∑

⟨𝒯x, e2i ⟩22 =

i=n+1

⟨x, 𝒯∗ e2i ⟩21 ≤

i=n+1

∞ ∑

‖𝒯∗ e2i ‖21 ,

i=n+1



which tends to 0 as n tends to ∞.

Clearly, 𝔅HS (ℍ1 , ℍ2 ) is a linear space. We can also construct an associated inner product in the following manner. Definition 4.4.4 The inner product of 𝒯1 , 𝒯2 ∈ 𝔅HS (ℍ1 , ℍ2 ) is ⟨𝒯1 , 𝒯2 ⟩HS =

∞ ∑ j=1

⟨𝒯1 ej , 𝒯2 ej ⟩2 ,

(4.26)

110

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

where {ej } is any CONS for ℍ1 . The HS norm of 𝒯, ‖𝒯‖HS , is the norm determined by the HS inner product: namely, {∞ }1∕2 ∑ ‖𝒯‖HS = ‖𝒯ej ‖22 . j=1

Theorem 4.4.1 again shows that the definition for the HS norm and inner product is independent of the choice of CONS. It also entails that ‖𝒯‖2HS =

∞ ∞ ∑ ∑ ⟨𝒯e1i , e2j ⟩22

(4.27)

i=1 j=1

for arbitrary CONSs {e1i } and {e2j } of ℍ1 and ℍ2 , respectively. In particular by choosing the CONSs to be the singular vectors of 𝒯 we obtain ‖𝒯‖2HS =

∞ ∑

𝜆2j ,

(4.28)

i=1

where the 𝜆j are the singular values of 𝒯. In the finite-dimensional case, the HS norm is often referred to as the Frobenius norm. Suppose that 𝒯 = {𝜏ij }i=1∶q,j=1∶p is a q × p real matrix: i.e., it is a linear operator that maps ℍ1 = ℝp into ℍ2 = ℝq . Then, its squared HS norm is ‖𝒯‖2HS

=

q p ∑ ∑

𝜏ij2 = trace(𝒯T 𝒯)

(4.29)

i=1 j=1

with trace the matrix trace. To see this, choose e1 , … , ep so that ej is a vector of all zeros except for a one as its jth element. Then, 𝒯ej is the jth column of ∑q 𝒯 with ‖𝒯ej ‖2 = eTj 𝒯T 𝒯ej = i=1 𝜏ij2 . The following theorem validates our choice for the inner product in Definition 4.4.4. Theorem 4.4.5 The linear space 𝔅HS (ℍ1 , ℍ2 ) is a separable Hilbert space when equipped with the HS inner product. For any choice of CONS {e1i } and {e2j } for ℍ1 and ℍ2 , respectively, {e1i ⊗1 e2j } is a CONS of 𝔅HS (ℍ1 , ℍ2 ). ∑∞ ∑∞ Proof: As ‖𝒯‖2HS = i=1 j=1 a2ij for aij = ⟨𝒯e1i , e2j ⟩2 , completeness under the HS norm is a consequence of the completeness of the 𝓁 2 space (Theorem 2.3.5). It only remains to show that {e1i ⊗1 e2j } is a CONS. Orthonormality is obvious in this instance. Suppose then that 𝒯 satisfies ⟨𝒯, e1i ⊗1 e2j ⟩HS = 0 for all i, j. This is equivalent to ⟨𝒯e1i , e2j ⟩2 = 0 for all i, j, which, in turn, implies that 𝒯 = 0 due to (4.27). Theorem 2.4.12 can now be invoked to complete the proof. ◽

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

111

Example 4.4.6 Let ℍ1 , ℍ2 be separable Hilbert spaces. Given a measure space (E, ℬ, 𝜇), Theorem 2.6.5 has the consequence that a measurable mapping 𝒢 on E taking values in 𝔅HS (ℍ1 , ℍ2 ) is Bochner integrable if ‖𝒢‖HS is integrable: i.e., if ∫E

‖𝒢‖HS d𝜇 < ∞.

Our aim is to show that for any such function ( ) (𝒢f )d𝜇 = 𝒢d𝜇 f ∫E ∫E

(4.30)

for all f ∈ ℍ1 . To see why this holds, for any fixed f ∈ ℍ1 define a mapping ℋ that maps 𝒯 ∈ 𝔅HS (ℍ1 , ℍ2 ) to 𝒯f ∈ ℍ2 . Then, we can rewrite (4.30) as ( ) ℋ(𝒢)d𝜇 = ℋ 𝒢d𝜇 . (4.31) ∫E ∫E Observe that the operator norm of 𝒢 is bounded by ‖f ‖1 and hence ℋ is in 𝔅(𝔅HS (ℍ1 , ℍ2 ), ℍ2 ). Thus, (4.31) follows immediately from Theorem 3.1.7. A truncated sve provides a best approximation to HS operators in the sense of the following theorem. Theorem 4.4.7 Let 𝒯 be a Hilbert–Schmidt operator between two Hilbert spaces ℍ1 , ℍ2 with singular system {(𝜆j , f1j , f2j )}∞ . Then, for any finite j=1 integer k, ‖ ‖ ‖ ‖ k k ∑ ∑ ‖ ‖ ‖ ‖ ‖𝒯 − ‖ ‖ ‖ x ⊗ y ≥ 𝒯 − 𝜆 f ⊗ f (4.32) j 1 j j 1j 1 2j ‖ ‖ ‖ ‖ ‖ ‖ ‖ ‖ j=1 j=1 ‖ ‖HS ‖ ‖HS for any set of functions xj ∈ ℍ1 , yj ∈ ℍ2 , j = 1, … , k. This result has various names such as the Schmidt–Mirsky Theorem or the Eckart–Young Theorem in the case of finite dimensions. A review of its early history can be found in Stewart (1993). In the special case where ℍ1 and ℍ2 coincide, Theorem 4.4.7 translates into saying that a truncated eigenvalue–eigenvector decomposition provides a best approximation to a nonnegative self-adjoint Hilbert–Schmidt operator. One can, in fact, view Theorem 4.6.8 as representing a refinement of this type of result. Proof: It suffices to show that ‖ ‖2 k k ∑ ∑ ‖ ‖ 2 ‖𝒯 − ‖ xj ⊗1 yj ‖ ≥ ‖𝒯‖HS − 𝜆2j . ‖ ‖ ‖ j=1 j=1 ‖ ‖HS

112

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Without loss, we can assume that the (yj , xj ) are orthonormal. In that case, if {ej } is any CONS for ℍ1 ‖𝒯 −

k ∑ i=1

=

∞ ∑

xi ⊗1 yi ‖2HS

⟨( 𝒯∗ 𝒯 +

k ∑

j=1

) (xi − 𝒯∗ yi ) ⊗1 (xi − 𝒯∗ yi )∗

⟩ ej , ej

i=1



1

k



‖𝒯∗ yi ‖21 .

i=1

As (xi − 𝒯∗ yi )⊗1 (xi − 𝒯∗ yi )∗ is nonnegative, the result will follow once we ∑k ∑k establish that i=1 ‖𝒯∗ yi ‖21 ≤ i=1 𝜆2i .∑ Now the sve of 𝒯∗ gives 𝒯∗ yi = ∞ j=1 𝜆j ⟨f2j , yi ⟩2 f1j . This leads to the identity ( k ) k ∑ ∑ ‖𝒯∗ yi ‖21 = 𝜆2k + 𝜆2j ⟨ f2j , yi ⟩22 − 𝜆2k ⟨ f2j , yi ⟩22 j=1

(

∞ ∑

− 𝜆2k ( −𝜆2k

j=1

⟨ f2j , yi ⟩22 −

j=k+1

1−

∞ ∑

)

𝜆2j ⟨ f2j , yi ⟩22

j=k+1

∞ ∑

) ⟨ f2j , yi ⟩22

.

j=1

The last two terms are nonpositive meaning that k ∑

‖𝒯∗ yi ‖21 ≤ k𝜆2k +

i=1

=

k ∑

[

i=1 j=1

𝜆2k

j=1



k k ∑ ∑ (𝜆2j − 𝜆2k )⟨ f2j , yi ⟩22

+

(𝜆2j



𝜆2k )

k ∑ ⟨ f2j , yi ⟩22

]

i=1

k



𝜆2j

j=1

as a result of the orthonormality of the yi and Parseval ’s relation.



COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

4.5

113

Trace class operators

In the previous section, we introduced HS operators that provide an important subset of the compact operators. Another such subset is the trace class or nuclear operators that we examine in this section. Let 𝒯 be any bounded linear operator on some Hilbert space ℍ. By Theorem 3.4.3, we can define the square root of 𝒯∗ 𝒯 as a bounded, nonnegative, self-adjoint operator (𝒯∗ 𝒯)1∕2 such that (𝒯∗ 𝒯)1∕2 (𝒯∗ 𝒯)1∕2 = 𝒯∗ 𝒯. Definition 4.5.1 Let 𝒯 ∈ 𝔅(ℍ1 , ℍ2 ) for separable Hilbert spaces ℍ1 and ℍ2 . Then, 𝒯 is trace class if for some CONS {ej }∞ of ℍ1 , the quantity j=1 ‖𝒯‖TR ∶=

∞ ∑

⟨(𝒯∗ 𝒯)1∕2 ej , ej ⟩1

(4.33)

j=1

is finite. In this case, ‖𝒯‖TR is said to be the trace norm of 𝒯. An argument similar to that used for Theorem 4.4.1 shows that this definition does not depend on the choice of CONS. For any trace class operator 𝒯, we have ‖𝒯‖TR = ‖(𝒯∗ 𝒯)1∕4 ‖2HS , where (𝒯∗ 𝒯)1∕4 = ((𝒯∗ 𝒯)1∕2 )1∕2 . This establishes that if 𝒯 is trace class then (𝒯∗ 𝒯)1∕4 is HS and hence compact. As 𝒯∗ 𝒯 = ((𝒯∗ 𝒯)1∕4 )4 , and compositions of compact operators are compact (Theorem 4.1.3), we see that 𝒯∗ 𝒯 is compact. The development that led to the sve and Theorem 4.3.5 now allows us to conclude that trace class operators are compact and that ‖𝒯‖TR =

∞ ∑

𝜆i ,

i=1

where the 𝜆i are the singular values of 𝒯. Thus, from (4.28) ( ∞ )1∕2 ∑ √ ‖𝒯‖HS ≤ 𝜆1 𝜆i = 𝜆1 ‖𝒯‖TR i=1

(4.34)

114

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and it follows that Theorem 4.5.2 Trace class operators are HS. Now let us focus on operators in 𝔅(ℍ). If 𝒯 is self-adjoint and has the associated eigenvalue sequence {𝜆j }, ‖𝒯‖TR =

∞ ∑

|𝜆j |.

j=1

∑∞ In this case, j=1 𝜆j is well defined and finite. It is called the trace of 𝒯 and denoted by trace(𝒯). One can also readily verify that trace(𝒯) =

∞ ∑

⟨𝒯ej , ej ⟩

j=1

for any CONS {ej }. If, moreover, 𝒯 is nonnegative, then trace(𝒯) = ‖𝒯‖TR =

∞ ∑

𝜆j .

(4.35)

j=1

In the finite dimensional case, trace(𝒯) is the same as the matrix trace. Unsurprisingly, the operator trace shares the basic features of the matrix trace in that, for example, given two trace class operators 𝒯1 , 𝒯2 and a1 , a2 ∈ ℝ, trace(a1 𝒯1 + a2 𝒯2 ) = a1 trace(𝒯1 ) + a2 trace(𝒯2 ), and trace(𝒯2 𝒯1 ) = trace(𝒯2 𝒯1 ). Proofs of these relations are given in, e.g., Reed and Simon (1980). The connections between the eigenvalues of an operator and its HS and trace norms (when they exist) can be used to obtain various bounds on the size of its eigenvalues. When this feature is applied to the difference between operators, one result that emerges is von Neumann’s trace inequality that has implications for the perturbation of operator eigenvalues. Results of this nature are explored in more detail in Chapter 5. ̃ are HS operators from ℍ1 to ℍ2 with Theorem 4.5.3 Suppose that 𝒯 and 𝒯 nonascending singular values 𝜆j , 𝜆̃j , j = 1, 2, … Then, ∞ ∑

̃ ∗ (𝒯 − 𝒯)}. ̃ (𝜆j − 𝜆̃j )2 ≤ trace{(𝒯 − 𝒯)

(4.36)

j=1

Proof: The proof is based on developments in Grigorieff (1991). Let f1j and f2j be the right and left singular vectors of 𝒯 corresponding to 𝜆j and, similarly,

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

115

̃ corresponding take f̃1j and f̃2j to be the right and left singular vectors of 𝒯 to 𝜆̃j . Now write ̃ ∗ (𝒯 − 𝒯)} ̃ trace{(𝒯 − 𝒯) =

∞ ∑

𝜆2j

+

j=1

j=1







=



𝜆2j

+

j=1

j=1







=

∞ ∑



𝜆2j +

j=1

𝜆̃2j − 2

∞ ∑

̃ 1j ⟩2 ⟨𝒯f1j , 𝒯f

j=1

𝜆̃2j − 2

∞ ∑

̃ 1j ⟩2 𝜆j ⟨ f2j , 𝒯f

j=1

𝜆̃2j − 2

j=1

∞ ∞ ∑ ∑

𝜆j 𝜆̃k ⟨ f1j , f̃1k ⟩1 ⟨ f2j , f̃2k ⟩2 ,

j=1 k=1

where we applied the expansion ̃ 1j = 𝒯f

∞ ∑

̃ f̃1k = ⟨ f1j , f̃1k ⟩1 𝒯

k=1

∞ ∑

𝜆̃k ⟨ f1j , f̃1k ⟩1 f̃2k

k=1

in the last step. Observe that ⟨( ∞ ) ⟩| ∞ |∑ | || ∑ | | | f̃1k ⊗1 f̃2k f1j , f2j || ≤ 1 (4.37) | ⟨ f1j , ̃f 1k ⟩1 ⟨ f2j , ̃f 2k ⟩2 | = || | | | | k=1 | k=1 | | 2| for all j and ) ⟩| |∑ | |⟨(∑ ∞ |∞ | | | | ⟨ f1j , ̃f 1k ⟩1 ⟨ f2j , ̃f 2k ⟩2 | = | f1j ⊗1 f2j f̃1k , ̃f 2k || ≤ 1 (4.38) | | | | j=1 | | | j=1 2| | | | for all k. ̃ are We will now prove (4.36) while assuming that rank(𝒯) and rank(𝒯) both bounded by some finite integer n. If this is not the case, we can take limits on the inequalities proved for truncated operators using the singular value decompositions. Using the above-mentioned derivations, in the finite rank case (4.36) becomes ̃ 𝜆T 𝒜 𝜆̃ ≤ 𝜆T 𝜆, where 𝜆 = (𝜆1 , … , 𝜆n )T , 𝜆 = (𝜆̃1 , … , 𝜆̃n )T

(4.39)

116

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and

𝒜 = {⟨ f1j , ̃f 1k ⟩1 ⟨ f2j , ̃f 2k ⟩2 }j,k=1∶n .

For i = 1, … , n, let ei be an n-dimensional vector whose first i elements are equal to 1 and the rest equal to 0. As 𝜆i and 𝜆̃i are nonascending, both 𝜆 and 𝜆̃ can be written as linear combinations of e1 , … , en with nonnegative coefficients. Without loss of generality, assume that 𝜆1 and 𝜆̃1 are bounded by 1. It then suffices to establish (4.39) for 𝜆 = ek and 𝜆̃ = e𝓁 for arbitrary k and 𝓁. However, this is immediate as the row and column sums of 𝒜 are all bounded by 1 as a result of (4.37) and (4.38). ◽

4.6

Integral operators and Mercer’s Theorem

We briefly discussed integral operators in Example 3.1.6. In this section, we use the results developed in this chapter to expand on our previous treatment of the topic. Let (E, ℬ, 𝜇) be a measure space for some finite measure 𝜇. Suppose that K is a measurable function on E × E such that ∫ ∫E×E K 2 (s, t)d𝜇(s)d𝜇(t) is finite and define the integral operator 𝒦 by (𝒦f )(⋅) ∶=

∫E

K(s, ⋅)f (s)d𝜇(s)

(4.40)

for f ∈ 𝕃2 (E, ℬ, 𝜇). The function K is referred to as the kernel of 𝒦. By the Cauchy–Schwarz inequality, for f ∈ 𝕃2 (E, ℬ, 𝜇), the function (𝒦f )(⋅) is measurable and satisfies ∫E

(𝒦f )2 (t)d𝜇(t) ≤

∫ ∫E×E

K 2 (s, t)d𝜇(s)d𝜇(t) f 2 (s)d𝜇(s). ∫E

Thus, 𝒦 ∈ 𝔅(𝕃2 (E, ℬ, 𝜇)) with ( )1∕2 2 ‖𝒦‖ ≤ K (s, t)d𝜇(s)d𝜇(t) . ∫ ∫E×E We will generally take E = [0, 1] for our applications. However, the results that follow are applicable to any compact metric space and, accordingly, we will tacitly assume that E has that type of structure and that ℬ = ℬ(E) is the Borel 𝜎-field of E: namely, the smallest 𝜎-field containing all the open sets in E. Without loss of generality, we will also assume that the support of 𝜇 is the entire space E; if this is not the case then our results remain true with E replaced by the the support of 𝜇, which is also compact. The integral operators that we focus on are those whose operator kernels K are continuous on E × E. Unless otherwise stated, this will also be included

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

117

as a standard assumption in the rest of the section. The compactness of E then has the consequence that K is uniformly continuous. This continuity is translated to the image of 𝒦 as indicated in the following result. Lemma 4.6.1 For each f ∈ 𝕃2 (E, ℬ, 𝜇), (𝒦f )(⋅) is uniformly continuous. Proof: As K is uniformly continuous, for any given 𝜖 > 0, there exist 𝛿 > 0 such that |K(s, s2 ) − K(s, s1 )| < 𝜖 for all s, s2 , s1 ∈ E with |s2 − s1 | < 𝛿. Thus, | | | K(s, s2 )f (s)d𝜇(s) − K(s, s1 )f (s)d𝜇(s)| ≤ 𝜖‖f ‖. |∫ | ∫E | E |



The following result provides our first characterization of integral operators. Theorem 4.6.2 𝒦 is compact. Proof: The proof employs Theorem 4.1.5. To do so, we need to construct an approximating sequence of finite-dimensional operators. One such sequence can be obtained by use of the Stone–Weierstrass Theorem (see, e.g., Royden and Fitzpatrick 2010), which tells us that for any 𝜖 > 0 there is some finite n𝜖 and continuous functions gi , hi , i = 1, … , n𝜖 with the property that sup |K(s, t) − Kn𝜖 (s, t)| < 𝜖

s,t∈E

∑ for Kn (s, t) = ni=1 gi (s)hi (t). Define 𝒦n to be the integral operator with kernel Kn . Observe that Im(𝒦n ) ⊂ span{h1 , … , hn }. Thus, 𝒦n is finite-dimensional and compact. By the Cauchy–Schwarz inequality, ( ‖(𝒦n𝜖 − 𝒦)f ‖ = 2

∫E ∫E

)2 [Kn𝜖 (s, t) − K(s, t)] f (s)d𝜇(s)

≤ 𝜖 2 ‖f ‖2 𝜇2 (E).

d𝜇(t) ◽

Assume that K is symmetric in which case we know from Example 3.3.4 that 𝒦 is self-adjoint. ∑Theorem 4.2.4 gives us the eigenvalue–eigenvector decomposition 𝒦 = ∞ j=1 𝜆j ej ⊗ ej , where 𝜆j , ej satisfy the descriptions in Theorem 4.2.4. Lemma 4.6.1 ensures that the version of ej determined by ej (t) = 𝜆−1 j

∫E

K(s, t)ej (s)d𝜇(s)

118

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

is continuous in t. In the following, we will always assume that this is the choice that has been made for ej . Example 4.6.3 An operator that arises in various settings (e.g., in the construction of the 𝕎1 [0, 1] space of Section 2.8) is the 𝕃2 [0, 1] integral operator that has the kernel K(s, t) = min(s, t) for s, t ∈ [0, 1]. Note that K is the covariance function of the standard Brownian motion process on [0, 1]. The operator that corresponds to K is given explicitly by 1

(𝒦f )(t) =

t

K(t, s)f (s)ds =

∫0

∫0

f (s)ds.

Let us now find its eigenvalues and eigenfunctions by solving 1

∫0

min(s, t)e(s)ds = 𝜆e(t)

(4.41)

for 𝜆 and e. In this regard, first note that (4.41) is the same as t

∫0

1

se(s)ds + t

∫t

e(s)ds = 𝜆e(t).

Differentiating both sides of this relation produces 1

te(t) +

∫t

i.e.,

e(s)ds − te(t) = 𝜆e′ (t); 1

∫t

e(s)ds = 𝜆e′ (t).

(4.42)

Differentiating (4.42) again reveals that e(t) = −𝜆e′′ (t) for which the general solution is ( √ ) ( √ ) e(s) = a sin s∕ 𝜆 + b cos s∕ 𝜆 . From (4.41), e(0) = 0 and hence b = 0 in the above. Similarly, (4.42) implies that e′ (1) = 0 so that ( √ ) a cos 1∕ 𝜆 = 0,

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

which leads to

119

√ 1∕ 𝜆 = (2j − 1)𝜋∕2, j = 1, 2, …

Thus, the eigenvalues are 𝜆j =

4 ((2j − 1)𝜋)2

and the orthonormal eigenfunctions are ( ) √ (2j − 1)𝜋 ej (t) = 2 sin t . 2 Recall that a kernel K(s, t) is nonnegative definite if it satisfies (2.29). The following result relates this notion to that of nonnegative-definite integral operators. Theorem 4.6.4 An integral operator is nonnegative definite if and only if its kernel is nonnegative definite. Proof: Let K be the operator kernel of the integral operator 𝒦. Given n > 0 let 𝛿n be chosen so that |K(s2 , t2 ) − K(s1 , t1 )| < n−1 whenever d((s1 , t1 ), (s2 , t2 )) < 𝛿n , where d is some metric for the product space E × E. As E is a compact metric space, the Heine–Borel Theorem (Theorem 2.1.17) has the implication that there exists a finite partition {Eni } of E such that each Eni has diameter less than 𝛿n . Let 𝑣i be an arbitrary point of Eni and, for all (s, t) ∈ Eni × Enj , define Kn (s, t) to be K(𝑣i , 𝑣j ). The uniform continuity of K now has the consequence that max |K(s, t) − Kn (s, t)| < n−1 .

(s,t)∈E×E

Now let 𝒦n be the integral operator with kernel Kn . With this choice, we find that, for any f ∈ 𝕃2 (E, ℬ, 𝜇), |⟨𝒦f , f ⟩ − ⟨𝒦n f , f ⟩| ≤ n−1 ‖ f ‖2

(4.43)

and ⟨𝒦n f , f ⟩ =

n n ∑ ∑ i=1 j=1

K(𝑣i , 𝑣j ) f (t)d𝜇(t) f (t)d𝜇(t). ∫Eni ∫Enj

(4.44)

If K is nonnegative definite, then (4.43) and (4.44) entail that ⟨𝒦f , f ⟩ ≥ 0.

120

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Conversely, suppose that m m ∑ ∑

ai aj K(𝑣i , 𝑣j ) < 0

i=1 j=1

for some ai , 𝑣i . As K is uniformly continuous, there exist disjoint sets E1 , … , Em ∈ ℬ with 𝜇(Ei ) > 0, 𝑣i ∈ Ei for all i, and max

ui ,𝑣i ∈Ei ,i=1,…,m

m m ∑ ∑

ai aj K(ui , 𝑣j ) < 0.

i=1 j=1

This implies that m m ∑ ∑

ai aj (𝜇(Ei )𝜇(Ej ))−1

i=1 j=1

∫Ei ∫Ej

K(u, 𝑣)d𝜇(u)d𝜇(𝑣) < 0

due to the mean-value theorem. Upon observing that the last expression is ∑m simply ⟨𝒦f , f ⟩ for f = i=1 ai (𝜇(Ei ))−1 IEi , we conclude that 𝒦 is not nonnegative definite. ◽ The following result is the celebrated Mercer’s Theorem, which says essentially that for an integral operator 𝒦 with a symmetric and nonnegative-definite kernel, equivalent series expansions can be obtained for both the operator and its kernel. Theorem 4.6.5 Let the continuous kernel K be symmetric and nonnegative definite and 𝒦 the corresponding integral operator. If (𝜆j , ej ) are the eigenvalue and eigenfunction pairs of 𝒦, then K has the representation K(s, t) =

∞ ∑

𝜆j ej (s)ej (t),

j=1

for all s, t, with the sum converging absolutely and uniformly. The following lemma contains most of the technical details that are needed to prove Theorem 4.6.5. Lemma 4.6.6 Under the conditions of Theorem 4.6.5, ∑ 2 1. ∞ j=1 𝜆j ej (t) ≤ K(t, t) for all t, ∑∞ 2. j=1 |𝜆j ej (s)ej (t)| ≤ {K(t, t)K(s, s)}1∕2 for all s, t, ∑ 3. limn→∞ sups,t ∞ j=n+1 |𝜆j ej (s)ej (t)| = 0, and

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

121

∑∞ 4. the function j=1 𝜆j ej (s)ej (t) is well defined and uniformly continuous in (s, t), with the sum converging absolutely and uniformly. Proof: Let Kn (s, t) = K(s, t) −

n ∑

𝜆j ej (s)ej (t)

j=1

and take 𝒦n to be the integral operator with kernel Kn . Note that Kn is continuous. Then, for any f , ⟨𝒦n f , f ⟩ = ⟨𝒦f , f ⟩ −

n ∑

𝜆j ⟨ f , ej ⟩2 =

j=1

∞ ∑

𝜆 j ⟨ f , e j ⟩2 ≥ 0

j=n+1

and 𝒦n must be nonnegative definite. This implies by Theorem 4.6.4 that Kn is nonnegative definite and hence Kn (t, t) ≥ 0 thereby proving part 1. Part 2 is now a consequence of part 1 as, for any set J of positive integers, ( )1∕2 ( )1∕2 ∑ ∑ ∑ |𝜆j ej (s)ej (t)| ≤ 𝜆j e2j (s) 𝜆j e2j (t) (4.45) j∈J

j∈J

j∈J

from the Cauchy–Schwarz inequality. To show part 3 first observe that (4.45) ensures that for all s, t ( ∞ )1∕2 ( ∞ )1∕2 ∞ ∑ ∑ ∑ 2 2 |𝜆j ej (s)ej (t)| ≤ 𝜆j ej (s) 𝜆j ej (t) , j=n+1

j=n+1

(4.46)

j=n+1

which tends to 0 monotonically due to Lebesgue’s dominated convergence theorem. As E compact, uniform convergence of the left-hand side of (4.46) follows from Dini’s Theorem. Fix any 𝜖 > 0. Using part 3, we can conclude that there exists n𝜖 such that sup

∞ ∑

s,t j=n +1 𝜖

|𝜆j ej (s)ej (t)| < 𝜖.

(4.47)

In addition, the uniform continuity of e1 , … , en𝜖 guarantees the existence of 𝛿 > 0 such that |∑ | n n𝜖 ∑ | 𝜖 | ′ ′ | | 𝜆j ej (s)ej (t) − 𝜆j ej (s )ej (t )| < 𝜖 (4.48) | | j=1 | j=1 | | whenever d((s, t), (s′ , t′ )) < 𝛿. Part 4 is now a straightforward consequence of (4.47) and (4.48). ◽

122

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof of Theorem 4.6.5. For different continuous kernels K1 (s, t) and K2 (s, t), it is straightforward to construct a function f such that ∫E K2 (s, t)f (s)d𝜇(s) and ∫E K1 (s, t)f (s)d𝜇(s) differ. Thus, K is the unique operator kernel ∑∞ that defines 𝒦. Now, the integral operator with the continuous kernel j=1 𝜆i ei (s)ei (t) has the same eigen ∑∞ decomposition as 𝒦 and is therefore the same operator. Thus, K(s, t) = j=1 𝜆i ei (s)ei (t) for all s, t with the right-hand side converging absolutely and uniformly as a consequence of Lemma 4.6.6. ◽ With the aid of Mercer’s Theorem and the developments in Section 4.5, we can now see that the integral operator 𝒦 in (4.40) is trace class. Simple formulae for the trace and HS norms of 𝒦 are given in the subsequent theorem. Theorem 4.6.7 Under the conditions of Theorem 4.6.5, trace(𝒦) =

∫E

K(s, s)d𝜇(s)

(4.49)

and ‖𝒦‖2HS =

∫ ∫E×E

K 2 (s, t)d𝜇(s)d𝜇(t).

(4.50)

Proof: By (4.35) and Theorem 4.6.5, we see that trace(𝒦) =

∞ ∑ j=1

= =

𝜆i =

(

j=1



∫E ∫E

∞ ∑



𝜆i

e2 (t)d𝜇(t) ∫E i )

𝜆i e2i (t) d𝜇(t)

j=1

K(t, t)d𝜇(t).

Exchange of the order of summation and integration here is allowed by Fubini’s Theorem. ∑ Now, let Kn (s, t) = ni=1 𝜆i ei (s)ei (t) and take 𝒦n to be the corresponding integral operator. First, trace(𝒦2n )

=

n ∑

𝜆2i



i=1

∞ ∑

𝜆2i = trace(𝒦2 ),

i=1

where the right-hand side is ‖𝒦‖2HS by (4.28). In addition, trace(𝒦2n ) =

∫ ∫E×E

Kn2 (s, t)d𝜇(s)d𝜇(t).

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

123

The right side of this expression tends to ∫ ∫E×E K 2 (s, t)d𝜇(s)d𝜇(t) as a result of Theorem 4.6.5 and Lebesgue’s dominated convergence theorem. ◽ The following result describes an optimality feature of the decomposition of K under Mercer’s Theorem. We say that a kernel K(s, t) has rank r if the corresponding integral operator has rank r. Theorem 4.6.8 Let K be a symmetric and nonnegative-definite kernel with the eigen decomposition K(s, t) =

∞ ∑

𝜆j ej (s)ej (t).

j=1

Then, for any positive integer r for which 𝜆r > 0, min

rank(W)=r ∫

∫E×E

2

{K(s, t) − W(s, t)} d𝜇(s)d𝜇(t) =

∞ ∑

𝜆2j ,

j=r+1

where the minimum is achieved by W(s, t) =

∑r j=1

𝜆j ej (s)ej (t).

Proof: Let 𝒦 be the integral operator with kernel K. Theorem∑ 4.4.7 implies r that the operator 𝒲 that minimizes ‖𝒦 − 𝒲‖2HS is 𝒲 = j=1 𝜆j ej ⊗ ej ∑ ∞ for which ‖𝒦 − 𝒲‖2HS = j=r+1 𝜆2j . Note that 𝒲 is an integral operator with kernel W(s, t) =

r ∑

𝜆j ej (s)ej (t).

j=1

Part 2 of Theorem 4.6.7 then gives ‖𝒦 − 𝒲‖2HS =

∫ ∫E×E

{K(s, t) − W(s, t)}2 d𝜇(s)d𝜇(t)

from which the result follows readily.

4.7



Operators on an RKHS

As one might expect, the bounded operators defined on an RKHS can be characterized by their actions on the rk. To describe this property, let 𝒯 be an operator on ℍ(K) for some rk K. Then, let 𝒯K(⋅, s) be the function produced by applying 𝒯 to the first argument of K for a fixed value of the second argument. Now define the operator’s kernel to be R(⋅, s) = 𝒯∗ K(⋅, s).

(4.51)

124

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

With this definition, we obtain ⟨ f , R(⋅, s)⟩ = ⟨𝒯f , K(⋅, s)⟩ = (𝒯f )(s) for f ∈ ℍ(K). Thus, R(⋅, s) is the representer of the linear functional (𝒯f )(s). An application of the reproducing property then reveals that the kernel corresponding to the adjoint of 𝒯 is R∗ (⋅, s) = R(s, ⋅). An operator on an RKHS is therefore self-adjoint if its kernel is symmetric. Formula (4.51) provides a recipe for assigning a kernel to every operator on an RKHS. Such kernels persist through linear operations; i.e., the kernel for the linear combination a1 𝒯1 + a2 𝒯2 , a1 , a2 ∈ ℝ is a1 R1 + a2 R2 with Ri the kernel for 𝒯i , i = 1, 2. Taking 𝒯 = 𝒯2 𝒯1 , we see that 𝒯∗ K(⋅, s) = 𝒯∗1 R2 (⋅, s) = ⟨𝒯∗1 R2 (⋆, s), K(⋆, ⋅)⟩ = ⟨R2 (⋆, s), 𝒯1 K(⋆, ⋅)⟩ = ⟨R2 (⋆, s), R1 (⋅, ⋆)⟩. This provides a characterization of the composition of two operators on an RKHS in terms of their kernels. One might now wonder when an operator’s kernel generates an RKHS. We know that this is equivalent to the kernel being nonnegative definite. However, more can be said on this issue. Theorem 4.7.1 An operator on an RKHS with rk K is nonnegative if its kernel R is nonnegative in which case there is a nonnegative constant B such that BK − R is nonnegative. Proof: An operator 𝒯 being nonnegative means that ⟨𝒯f , f ⟩ ≥ 0 for every ∑n f ∈ ℍ(K). Take f = fn for fn (⋅) = i=1 ai K(⋅, ti ) and use the reproducing property to see that this implies the kernel for 𝒯 is nonnegative definite. The converse follows from the fact that functions of the form fn are dense in ℍ(K). As 𝒯 is nonnegative 0 ≤ ⟨𝒯f , f ⟩ ≤ B‖ f ‖2 with, e.g., B = ‖𝒯‖2 . Therefore, BI − 𝒯 is a nonnegative operator and its kernel (BI − 𝒯)∗ K(⋅, s) = BK(⋅, s) − R(⋅, s) must be a nonnegative function.



There is no reason to restrict attention to bounded operators on a single RKHS and the idea of an operator’s kernel works equally well when there are two RKHSs. To see this, let ℍ(K1 ) and ℍ(K2 ) be RKHSs corresponding to two rks K1 , K2 on sets E1 , E2 and suppose that 𝒯 ∈ 𝔅(ℍ(K1 ), ℍ(K2 )). Then, the operator’s kernel is R(⋅, s) = 𝒯∗ K2 (⋅, s)

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

125

because (𝒯f )(s) = ⟨𝒯f , K2 (⋅, s)⟩2 = ⟨ f , R(⋅, s)⟩1 with ⟨⋅, ⋅⟩i the inner product for ℍ(Ki ), i = 1, 2. Similarly, (𝒯∗ f )(t) = ⟨𝒯∗ f , K1 (⋅, t)⟩1 = ⟨ f , 𝒯K1 (⋅, t)⟩2 and the kernel for the adjoint of 𝒯 is R∗ (⋅, t) ∶= 𝒯K1 (⋅, t). Now let {eij }∞ be CONSs for ℍ(Ki ), i = 1, 2. j=1 Theorem 4.7.2 An operator 𝒯 ∶ ℍ(K1 ) → ℍ(K2 ) is HS if for each t ∈ E1 and s ∈ E2 ∞ ∞ ∑ ∑ R(t, s) = aij e1i (t)e2j (s) (4.52) i=1 j=1

with

∑∞ ∑∞ i=1

j=1

a2ij < ∞.

Proof: Assume first that (4.52) holds. Then, (𝒯e1i )(s) = ⟨e1i , R(⋅, s)⟩1 =

∞ ∑

aij e2j (s),

j=1

which means that ‖𝒯‖2HS

=

∞ ∑

‖𝒯e1i ‖22

i=1

=

∞ ∞ ∑ ∑

a2ij < ∞

i=1 j=1

with ‖ ⋅ ‖2 the ℍ(K2 ) norm. So, 𝒯 is HS. Conversely, suppose that 𝒯 is HS. As R(⋅, s) is an element of ℍ(K1 ) R(t, s) =

∞ ∑

⟨R(⋅, s), e1i ⟩1 e1i (t)

i=1

=

∞ ∑

⟨K2 (⋅, s), 𝒯e1i ⟩2 e1i (t)

i=1 ∞ ∑ = (𝒯e1i )(s)e1i (t) i=1

=

∞ ∞ ∑ ∑ i=1 j=1

⟨e2j , 𝒯e1i ⟩2 e1i (t)e2j (s)

126

and

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS ∞ ∞ ∑ ∑

⟨e2j , 𝒯e1i ⟩22 =

i=1 j=1

∞ ∑

‖𝒯e1i ‖22 = ‖𝒯‖2HS .

i=1



Arguing similarly to the proof of the previous theorem we see that an operator 𝒯 ∈ 𝔅(ℍ(K1 ), ℍ(K2 )) is trace class if (4.52) holds with ∞ ∞ ∑ ∑

|aij | < ∞.

i=1 j=1

In this latter instance, the convergence of the series is uniform in s, t when K1 and K2 are bounded because |R(t, s) −

n n ∑ ∑

aij e1i (t)e2j (s)|

i=1 j=1



∞ ∞ ∑ ∑

|aij e1i (t)e2j (s)|

i=n+1 j=n+1

=

∞ ∞ ∑ ∑

|aij ⟨e1i , K1 (⋅, t)⟩1 ⟨e2j , K2 (⋅, s)⟩2 |

i=n+1 j=n+1

≤ |K1 (t, t)|1∕2 |K2 (s, s)|1∕2

∞ ∞ ∑ ∑

|aij |.

i=n+1 j=n+1

4.8

Simultaneous diagonalization of two nonnegative definite operators

A classic result from matrix theory that arises in various statistical venues such as linear models concerns the simultaneous diagonalization of two nonnegative definite matrices. In this section, we briefly discuss a situation where it is possible to extend this idea to a more general context involving a compact operator. The result we will prove can be stated as follows. Theorem 4.8.1 Let 𝒞 and 𝒲 be self-adjoint operators on a separable Hilbert space ℍ. Suppose that 𝒞 is compact and nonnegative definite, 𝒲 is positive definite, and 𝒢 ∶= 𝒞 + 𝒲 is invertible. Let {(𝜂j , 𝑣j )}∞ be the j=1 eigenvalue–eigenvector pairs of 𝒢 −1∕2 𝒞 𝒢 −1∕2 , where the 𝜂j are necessarily

COMPACT OPERATORS AND SINGULAR VALUE DECOMPOSITION

in [0, 1), and define 𝛾j = 𝜂j ∕(1 − 𝜂j ), uj = (1 − 𝜂j )−1∕2 𝒢 −1∕2 𝑣j . Then, 𝒞 uj = 𝜂j 𝒢uj = 𝛾j 𝒲uj ⟨ui , 𝒲uj ⟩ = 𝛿ij , and x=

∞ ∑ ⟨x, 𝒲uj ⟩uj j=1

for all x. Proof: As 𝒢 −1∕2 𝒞 𝒢 −1∕2 = I − 𝒢 −1∕2 𝒲𝒢 −1∕2 , all the eigenvalues 𝜂j must be in [0, 1). Now 𝒢 −1∕2 𝒲𝒢 −1∕2 𝑣j = 𝒢 −1∕2 (𝒢 − 𝒞 )𝒢 −1∕2 𝑣j = (1 − 𝜂j )𝑣j . Thus, ⟨ui , 𝒲uj ⟩ = (1 − 𝜂i )−1∕2 (1 − 𝜂j )−1∕2 ⟨𝑣i , 𝒢 −1∕2 𝒲𝒢 −1∕2 𝑣j ⟩ = 𝛿ij and 𝒞 uj = (1 − 𝜂j )−1∕2 𝒞 𝒢 −1∕2 𝑣j = (1 − 𝜂j )−1∕2 𝒢 1∕2 𝒢 −1∕2 𝒞 𝒢 −1∕2 𝑣j = 𝜂j 𝒢uj = 𝜂j (𝒞 + 𝒲)uj . The last identity also gives (1 − 𝜂j )𝒞 uj = 𝜂j 𝒲uj .

127

128

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Finally, as {𝑣j } is a CONS for ℍ, any x ∈ ℍ satisfies x = 𝒢 −1∕2 𝒢 1∕2 x (∞ ) ∑ = 𝒢 −1∕2 ⟨𝒢 1∕2 x, 𝑣j ⟩𝑣j j=1

∑ ∞

=

⟨𝒢 1∕2 x, 𝑣j ⟩𝒢 −1∕2 𝑣j

j=1

=

∞ ∑

(1 − 𝜂j )1∕2 ⟨𝒢 1∕2 x, 𝒢 1∕2 uj ⟩𝒢 −1∕2 𝑣j

j=1

=

∞ ∑

(1 − 𝜂j )⟨x, 𝒢uj ⟩uj ,

j=1

and 𝒢uj = (1 − 𝜂j )−1 𝒲uj .



As a corollary to Theorem 4.8.1, we can state the following. Corollary 4.8.2 Let 𝒞 , 𝒲 be symmetric n × n matrices with 𝒲 positive definite. Then, there exists a matrix 𝒰 such that 𝒰T 𝒞 𝒰 is diagonal and 𝒰T 𝒲𝒰 is the identity. Proof: If 𝒞 is nonnegative definite, then the conclusions follow immediately from Theorem 4.8.1. If 𝒞 is not nonnegative definite, then first consider 𝒞̃ ∶= 𝒞 + B𝒲, where B is a large enough constant that makes 𝒞̃ positive definite. ◽

5

Perturbation theory This chapter delves into perturbation theory for compact operators. The material collected here will subsequently furnish some of the tools that will be needed for establishing large sample properties associated with methods for principle components estimation in Chapter 9. The definitive treatise on operator perturbation theory is that of Kato (1995). Our particular treatment of this topic focuses on two scenarios that parallel the developments in Chapter 4 and is partly motivated by the results in Dauxious, Pousse, and Romain (1982), Hall and Hosseini-Nasab (2005, 2009), and Riesz and Sz.-Nagy (1990). First, in Section 5.1, we consider the more standard case of self-adjoint, compact operators on a Hilbert space. In that setting, we obtain bounds and expansions that allow us to measure the effect that perturbing an operator will have on its eigenvalues and eigenvectors. Section 5.2 then explores the more complicated case of operators that are not self-adjoint and presents similar results to those of Section 5.1 for singular values and vectors.

5.1

Perturbation of self-adjoint compact operators

Theorems 4.2.8 and 4.5.3 represent anomalies of sorts when viewed in terms of the overall theme of Chapter 4. They both dealt with the eigenvalues of two operators rather than just one and then provided bounds for the difference in the operators’ eigenvalues in terms of some measure of the size of the overall difference between the two operators. Results of this nature can be particularly useful when the eigenvalues are the object of interest and it is possible to directly measure or bound the size of the operator difference. ̃ and Somewhat more generally, we might consider two operators 𝒯 and 𝒯 ̃ ̃ define the perturbation operator Δ = 𝒯 − 𝒯 so that 𝒯 = 𝒯 + Δ. This type of Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

130

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

̃ is some type of approximation to 𝒯 formulation gives the impression that 𝒯 with Δ representing an error or residual term that stems from the approximation. Thus, 𝒯 and its associated eigenvalues and eigenvectors represent the targets and we wish to ascertain how well the eigenvalues and eigenvectors of ̃ can serve if they are used in a surrogate capacity where they are employed 𝒯 in place of those for 𝒯. This is the direction we will follow in this section. The problem setting is one where we have a compact, self-adjoint operator 𝒯 on a Hilbert space ℍ. From Section 4.2, we know that 𝒯 admits a representation in terms of the eigenvalue–eigenvector expansion (4.6). We will sometimes need to list the eigenvalues in this representation as repeated and other times as nonrepeated. Repeated eigenvalues 𝜆1 ≥ 𝜆2 ≥ · · · are those that are listed along with information about their multiplicity while nonrepeated eigenvalues 𝜆1 > 𝜆2 > · · · are given without this information. Unless otherwise stated, 𝜆j will only represent a nonrepeated eigenvalue. Using this notational paradigm, we can alternatively express the eigen decomposition of 𝒯 as 𝒯=

∞ ∑

𝜆j 𝒫j ,

(5.1)

j=1

where 𝒫j is the projection operator for the eigenspace of 𝜆j . Note that if 𝜆j has multiplicity one then Theorem 3.4.7 tells us that 𝒫j = ej ⊗ ej for ej the eigenvector corresponding to 𝜆j . More generally, if the multiplicity of 𝜆j is more than one, from Theorem 4.2.4, we know that it must be some finite number, nj . In that case, we can use the Gram–Schmidt method of Theorem 2.4.10 to create orthonormal vectors ej1 , … , ejnj that span the eigenspace. Then, 𝒫j =

nj ∑

ejk ⊗ ejk .

k=1

We initially need to quantify the effect of perturbations on an operator’s projection operators. This will eventually provide tools that can be used to make similar assessments about eigenvalues and eigenvectors. To deal with projection operators effectively, it becomes expedient to bring in the concept of the resolvent operator. This requires us to expand the Hilbert space formulation we have used up to this point. Specifically, we now need the scalar field for ℍ to be the set of complex numbers ℂ. Definition 2.4.1 must be adjusted to reflect this change with its condition 4 being restated as ⟨x, y⟩ = ⟨y, x⟩, where a is the complex conjugate of a ∈ ℂ.

PERTURBATION THEORY

131

The set of z ∈ ℂ for which 𝒯 − zI is not invertible is called the spectrum of 𝒯 and denoted by 𝜎(𝒯). For compact operators on an infinite-dimensional space ℍ, 𝜎(𝒯) contains all the eigenvalues plus 0 whether 0 is an eigenvalue or not. The complement of 𝜎(𝒯) is called the resolvent set of 𝒯 and denoted by 𝜌(𝒯). For z ∈ 𝜌(𝒯), direct multiplication using (5.1) establishes that ℛ(z) ∶= (𝒯 − zI)−1 ∞ ∑ 1 = 𝒫 𝜆 −z j j=1 j

(5.2)

is a bounded operator that we will refer to as the resolvent of 𝒯. In fact, as ‖ℛ(z)f ‖2 ≤ max j

= max j

∞ ∑ 1 ‖𝒫j f ‖2 |z − 𝜆j |2 j=1

1 ‖f ‖2 , |z − 𝜆j |2

we have the exact expression ‖ℛ(z)‖ = 1∕ min |z − 𝜆j |. j

(5.3)

The resolvent plays a fundamental role in perturbation theory as we will shortly see. However, to access this utility, we need to be able to compute contour integrals of (5.2) as a function of its z argument. Development of the mathematical framework that will allow us to do so is our following task. We have already seen abstract integration from another perspective in Section 2.6. In complex analysis, contour integrals are integrals of functions over paths or curves in the complex plane. Thus, we now take Γ to be a simple closed curve, also known as a Jordan curve, in ℂ of length lΓ and let {𝒯(z) ∶ z ∈ Γ} be an indexed collection of operators on ℍ; that is, 𝒯(z) ∈ 𝔅(ℍ) for every z ∈ Γ and we need look no further than (5.2) to see a specific example of such a collection. The length of the arc on Γ connecting points z to z′ on the curve counterclockwise will be denoted as dΓ (z, z′ ). As a first step, let us choose points z0 , … , zn on Γ and order them counterclockwise with z0 = zn . If 𝜉j is any point on the arc going from zj to zj+1 , we define the operator ℐ({zj , 𝜉j }nj=0 ) =

n−1 ∑

𝒯(𝜉j )(zj+1 − zj ).

(5.4)

j=0

This bears a formal similarity to a Riemann sum and one might anticipate that something of this nature could be extended to a limit in an analogous manner

132

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

to the development of the Riemann integral of a real function. Conditions that can be employed to make such a notion rigorous stem from the following simple result. Lemma 5.1.1 Let {zj , 𝜉j }nj=0 and {̃zj , 𝜉̃j }nj=0 be two ordered sets of points on Γ and define 𝛿 = max dΓ (zj , zj+1 ) + max dΓ (̃zj , z̃j+1 ). j

j

Then, ‖ℐ({zj , 𝜉j }nj=0 ) − ℐ({̃zj , 𝜉̃j }nj=0 )‖ ≤

sup ‖𝒯(z) − 𝒯(z′ )‖lΓ .

(5.5)

dΓ (z,z′ )≤𝛿

Proof: Merge the two sets into a single ordered set {̌zj , 𝜉̌j } so that ) ∑( ℐ({zj , 𝜉j }nj=0 ) − ℐ({̃zj , 𝜉̃j }nj=0 ) = 𝒯(𝜉̌j ) − 𝒯(𝜉̌j′ ) (̌zj − ž j−1 ). j

The result then follows from the triangle inequality.



This leads us to the integral definition that we seek. Definition 5.1.2 Let {zjn , 𝜉jn }nj=0 , n = 1, … be a sequence of ordered sets with lim max dΓ (zjn , z(j+1)n ) = 0

(5.6)

lim sup ‖𝒯(z) − 𝒯(z′ )‖ = 0.

(5.7)

n→∞

and assume that

j

𝛿↓0 d (z,z′ )≤𝛿 Γ

The contour integral of 𝒯(z) over Γ is ∮Γ

𝒯(z)dz = lim ℐ({zjn , 𝜉jn }nj=0 ). n→∞

(5.8)

Lemma 5.1.1 has the implication that ℐ({znj , 𝜉nj }nj=0 ) is a Cauchy sequence under conditions (5.6) and (5.7) and must therefore have a limit in 𝔅(ℍ). To see that this limit is unique, take any two sequences that satisfy (5.6) and merge them. Condition (5.6) is still satisfied by this new sequence and the corresponding Riemann type sum of operators must therefore converge to a limit that agrees with the one obtained from either of the sequences that were used in its construction. The fact that ∮Γ 𝒯(z)dz arises from (5.4) makes it easy to see that contour integration of operators behaves much like ordinary Riemann integration.

PERTURBATION THEORY

133

In particular, it is a linear operation: if {𝒯1 (z) ∶ z ∈ Γ}, {𝒯2 (z) ∶ z ∈ Γ} are two collections of functions ∮Γ (𝛼1 𝒯1 (z) + 𝛼2 𝒯2 (z))dz = 𝛼1 ∮Γ 𝒯1 (z)dz + 𝛼2 ∮Γ 𝒯2 (z)dz for 𝛼1 , 𝛼2 ∈ ℝ. When contour integration is applied to the resolvent, the outcome is a formula that gives a useful characterization of projection operators. Theorem 5.1.3 Let 𝒯 be a compact, self-adjoint operator and let Γ represent the boundary of a disk D that contains {𝜆k ∶ p ≤ k ≤ q} and no other eigenvalues with {𝜆k ∶ p ≤ k ≤ q} strictly in the interior of D. Then, the projection operator, 𝒫, for the union of the eigenspaces corresponding to these eigenvalues has the representation 𝒫=−

1 ℛ(z)dz, 2𝜋i ∮Γ

(5.9)

where ℛ is the resolvent of 𝒯 in (5.2). Proof: We first show the contour integral in (5.9) is well defined by verifying (5.7). It follows that | | | | 1 | z − z′ | 1 | | ‖ℛ(z) − ℛ(z′ )‖ = max | − ′ | = max | |. ′ − 𝜆 )| | | j | z − 𝜆j j z − 𝜆 (z − 𝜆 )(z j| j j | | | As 𝜖 ∶= inf z∈Γ, j≥1 |z − 𝜆j | > 0, sup ‖ℛ(z) − ℛ(z′ )‖ ≤

dΓ (z,z′ )≤𝛿

𝛿 . 𝜖2

Hence, (5.7) holds and the contour integral in (5.9) is well defined. By (5.2), we obtain 1 1 ∑ dz ℛ(z)dz = − 𝒫 . 2𝜋i ∮Γ 2𝜋i k=1 k ∮Γ 𝜆k − z ∞



Cauchy’s integral formula gives dz = (−2𝜋i)I(p ≤ k ≤ q) ∮Γ 𝜆k − z and completes the proof.



̃ However, We now wish to compare the projection operators for 𝒯 and 𝒯. there is a bit of technical detail that must be dispensed with before such comparisons can be made mathematically tractable.

134

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

̃ and 𝒯, then 𝒫̃ j clearly First, if all the eigenvalues are distinct for both 𝒯 should be the projection operator corresponding to the jth largest eigenvalues ̃ However, things become more complicated if one or both of 𝒯 and of 𝒯. ̃ 𝒯 have eigenvalues with multiplicities greater than one and the two sets of eigenvalues must be paired in some manner. To do this, we will use the eigenvalues of the “target” operator 𝒯 and their multiplicities as the basis for the pairing. Let 𝜆1 > 𝜆2 > · · · be the nonrepeated eigenvalues of 𝒯 and, as before, let nj be the multiplicity of 𝜆j . Now let 𝜆̃1 ≥ 𝜆̃2 ≥ · · · be the repeated eigenvalues of ̃ Then, 𝒫̃ 1 is defined as the projection operator corresponding to the union 𝒯. of the eigenspaces for 𝜆̃k , 1 ≤ k ≤ n1 , and 𝒫̃ 2 is the projection operator for the union of the eigenspaces for 𝜆̃k , n1 + 1 ≤ k ≤ n2 , etc. In general, define Nj =

j ∑

nk

k=1

and take 𝒫̃ j =

Nj ∑

ẽ k ⊗ ẽ k

(5.10)

k=Nj−1 +1

with ẽ k the eigenvector associated with 𝜆̃k . Using this approach, the 𝒫̃ j are ̃ with those from 𝒯, with obtained by matching the repeated eigenvalues of 𝒯 the latter as the basis for matching. This ensures that the dimension of 𝒫̃ j is the same as that of 𝒫j which would not necessarily occur if we simply matched the eigenspaces corresponding to the nonrepeated eigenvalues of 𝒯 ̃ with those from 𝒯. Theorem 5.1.4 Let the nonrepeated eigenvalues of the compact, self-adjoint operators 𝒯 be 𝜆1 > 𝜆2 > · · · and take Γ as the circle in the complex plane ̃ be another comcentered at 𝜆j with radius 𝜂j ∶= (1∕2) mink≠j |𝜆k − 𝜆j |. Let 𝒯 pact, self-adjoint operator and assume that ‖Δ‖ < 𝜂j

(5.11)

̃ − 𝒯. Then, for 𝒫̃ j in (5.10) for Δ = 𝒯 1 𝒫̃ j − 𝒫j = 𝒮j Δ𝒫j + 𝒫j Δ𝒮j + ℳ(z)dz, 2𝜋i ∮Γ

(5.12)

where 𝒮j =

∑ k≠j

1 𝒫 𝜆k − 𝜆j k

and ℳ(z) = ℛ(z)

∞ ∑ k=2

{−Δℛ(z)}k .

(5.13)

PERTURBATION THEORY

135

̃ by ℛ. ̃ Theorem 4.2.8 ensures that Γ encirProof: Denote the resolvent for 𝒯 ̃ corresponding to 𝜆j but no other eigencles 𝜆j and all the eigenvalues of 𝒯 values. So, Theorem 5.1.3 can be applied to produce ( ) 1 ̃ − ℛ(z) dz. 𝒫̃ j − 𝒫j = − ℛ(z) 2𝜋i ∮Γ Now

(5.14)

̃ = (Δ + 𝒯 − zI)−1 = ℛ(z)(Δℛ(z) + I)−1 ℛ(z)

and (5.3) entails that sup ‖Δℛ(z)‖ ≤ z∈Γ

‖Δ‖ < 1. 𝜂j

Thus, (3.13) allows us to write ̃ − ℛ(z) = ℛ(z) ℛ(z)

∞ ∑

{−Δℛ(z)}k .

k=1

Using this in (5.14) leads to 1 1 𝒫̃ j − 𝒫j = − ℛ(z)Δℛ(z)dz + ℳ(z)dz. 2𝜋i ∮Γ 2𝜋i ∮Γ As

1 ⎧ ⎪− 𝜆 − 𝜆 , j ⎪ k 1 1 1 dz = ⎨− 1 , 2𝜋i ∮Γ 𝜆k − z 𝜆l − z ⎪ 𝜆l − 𝜆j ⎪0, ⎩

if k ≠ j and l = j, if k = j and l ≠ j, otherwise,

the theorem follows from 1 1 ∑∑ 1 1 ℛ(z)Δℛ(z)dz = − dz𝒫k Δ𝒫l 2𝜋i ∮Γ 2𝜋i k=1 l=1 ∮Γ 𝜆k − z 𝜆l − z ∑ 1 = (𝒫 Δ𝒫j + 𝒫j Δ𝒫k ). 𝜆 − 𝜆j k k≠j k ∞







Corollary 5.1.5 Under the conditions of Theorem 5.1.4, ‖𝒫̃ j − 𝒫j − (𝒮j Δ𝒫j + 𝒫j Δ𝒮j )‖ ≤ with 𝛿j = ‖Δ‖∕𝜂j .

𝛿j2 1 − 𝛿j

,

(5.15)

136

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: Note that ‖ℳ(z)‖ ≤ ‖ℛ(z)‖

∞ ∑

‖Δℛ(z)‖k ≤

k=2

2 1 𝛿j . 𝜂j 1 − 𝛿j

This leads to 𝛿j2 𝛿j2 ‖ 1 ‖ ‖ ‖ ≤ lΓ ℳ(z)dz ≤ ‖ 2𝜋i ∮ ‖ 2𝜋𝜂 1 − 𝛿 1 − 𝛿j ‖ ‖ Γ j j ◽

and thereby proves the corollary. It follows that ‖𝒮j Δ𝒫j + 𝒫j Δ𝒮j ‖2 = ‖𝒮j Δ𝒫j ‖2 + ‖𝒫j Δ𝒮j ‖2 ≤ 2‖𝒮j ‖2 ‖Δ‖2 . As ‖𝒮j ‖ ≤ (2𝜂j )−1 , we conclude that 𝛿j ‖𝒮j Δ𝒫j + 𝒫j Δ𝒮j ‖ ≤ √ . 2

(5.16)

Consequently, (5.15) implies that 𝒮j Δ𝒫j + 𝒫j Δ𝒮j represents a type of first-order approximation for 𝒫̃ j − 𝒫j . In addition, combining (5.15) and (5.16) gives 𝛿j 𝛿j 𝛿j ‖𝒫̃ j − 𝒫j ‖ ≤ √ + ≤ . 1 − 𝛿j 2 1 − 𝛿j 2

(5.17)

The nonasymptotic nature of these results allows us to potentially obtain useful bounds for multiple projection spaces simultaneously provided that ‖Δ‖ is small relative to 𝜂j . Our results concerning projections can be used to develop bounds for the ̃ First consider differences between eigenvalues and eigenvectors of 𝒯 and 𝒯. the eigenvalues. Theorem 5.1.6 Let 𝜆j be an eigenvalue of the compact, self-adjoint operator ̃ 𝒯 and let 𝒫j be the projection operator for the eigenspace. Suppose that 𝒯 is another compact, self-adjoint operator with associated projection operator 𝒫̃ j as in (5.10). Then, ̃ 𝒫̃ j − 𝜆j 𝒫̃ j = 𝒫̃ j Δ𝒫̃ j + (𝒫̃ j − 𝒫j )(𝒯 − 𝜆j I)(𝒫̃ j − 𝒫j ) 𝒫̃ j 𝒯

(5.18)

and, consequently, for each k = Nj−1 + 1, … , Nj , 𝜆̃k − 𝜆j = ⟨Δ̃ek , ẽ k ⟩ + ⟨(𝒫̃ j − 𝒫j )(𝒯 − 𝜆j I)(𝒫̃ j − 𝒫j )̃ek , ẽ k ⟩.

(5.19)

PERTURBATION THEORY

137

Proof: Write ̃ 𝒫̃ j − 𝜆j 𝒫̃ j = 𝒫̃ j Δ𝒫̃ j + 𝒫̃ j (𝒯 − 𝜆j I)𝒫̃ j . 𝒫̃ j 𝒯 As 𝒫j (𝒯 − 𝜆j I) = 0, we have 𝒫̃ j (𝒯 − 𝜆j I)𝒫̃ j = (𝒫̃ j − 𝒫j )(𝒯 − 𝜆j I)(𝒫̃ j − 𝒫j ) and (5.18) is proved. For (5.19), write ̃ 𝒫̃ j = 𝒫̃ j 𝒯

Nj ∑

𝜆̃k ẽ k ⊗ ẽ k

k=Nj−1 +1



and take inner products. As ‖(𝒫̃ j − 𝒫j )(𝒯 − 𝜆j I)(𝒫̃ j − 𝒫j )‖ ≤ ‖𝒯‖‖(𝒫̃ j − 𝒫j )‖2 ,

(5.20)

(5.19) shows that the first-order approximation to 𝜆̃jk − 𝜆j is ⟨Δ̃ejk , ẽ jk ⟩. The remainder can be dealt with using inequalities such as (5.17). Suppose that both 𝒫j and 𝒫̃ j are of dimension one. Then, arguing as in the proof of Theorem 5.1.6, we see that ̃ − 𝜆̃̃j I)(𝒫̃ j − 𝒫j ) 𝜆̃̃j 𝒫j − 𝒫j 𝒯𝒫j = 𝒫j Δ𝒫j + (𝒫̃ j − 𝒫j )(𝒯 for ̃j = Nj−1 + 1. Thus, in this case, ̃ − 𝜆̃̃j I)(𝒫̃ ̃j − 𝒫j )ej , ej ⟩. 𝜆̃̃j − 𝜆j = ⟨Δej , ej ⟩ + ⟨(𝒫̃ ̃j − 𝒫j )(𝒯

(5.21)

Furthermore, if 𝒫j and 𝒫̃ j are of dimension one for all j then (5.21) takes the form ̃ − 𝜆̃j I)(𝒫̃ j − 𝒫j )ej , ej ⟩. 𝜆̃j − 𝜆j = ⟨Δej , ej ⟩ + ⟨(𝒫̃ j − 𝒫j )(𝒯

(5.22)

Next we consider eigenvectors. In that regard, we first give a basic result that shows differences between eigenvectors are intimately related to differences between the corresponding projection operators. Lemma 5.1.7 Let 𝒫 = e ⊗ e and 𝒫̃ = ẽ ⊗ ẽ with ‖e‖ = ‖̃e‖ = 1. Then, 1. the eigenvalues of 𝒫 − 𝒫̃ are ±(1 − ⟨e, ẽ ⟩2 )1∕2 , ̃ 2 )1∕2 ] if ⟨e, ẽ ⟩ ≥ 0 and 2. ‖e − ẽ ‖2 = 2[1 − (1 − ‖𝒫 − 𝒫‖ ̃ 2 = 2(1 − ⟨e, ẽ ⟩2 ) = 2‖𝒫 − 𝒫‖ ̃ 2. 3. ‖𝒫 − 𝒫‖ HS

138

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: It is clear that the image space of 𝒫 − 𝒫̃ is at most of dimension two. The eigenvalues can be computed from the equivalent matrix eigen problem [ ][ ] [ ] 1 ⟨e, ẽ ⟩ a a =𝜆 . −⟨e, ẽ ⟩ −1 b b So, part 1 follows trivially. By Theorem 4.3.4 and part 1 of the lemma, ̃ 2 = 1 − ⟨̃e, e⟩2 . ‖𝒫 − 𝒫‖

(5.23)

Thus, part 2 of the lemma is a consequence of the identity ‖e − ẽ ‖2 = 2(1 − ⟨̃e, e⟩) ◽

while part 3 is due to part 1 and (5.23). Observe that for ⟨e, ẽ ⟩ ≥ 0 the inequality 1 − (1 − x)1∕2 ≤

x x2 + 2 4(1 − x)

that holds for x ∈ [0, 1] can be used with result 2 of Lemma 5.1.7 to see that ̃ 2+ ‖e − ẽ ‖2 ≤ ‖𝒫 − 𝒫‖

̃ 4 1 ‖𝒫 − 𝒫‖ . ̃ 2 2 1 − ‖𝒫 − 𝒫‖

(5.24)

Moreover, as (1 + x)1∕2 ≤ 1 + x∕2 for x ∈ [0, 1], we obtain ̃ 3 ̃ + 1 ‖𝒫 − 𝒫‖ . ‖e − ẽ ‖ ≤ ‖𝒫 − 𝒫‖ ̃ 2 4 1 − ‖𝒫 − 𝒫‖

(5.25)

Our perturbation result for eigenvectors can now be stated as follows. Theorem 5.1.8 Let 𝜆j be an eigenvalue of 𝒯 with multiplicity one and corresponding eigenvector ej . The multiplicities of other eigenvalues are not ̃ be an approximating operator and suppose (𝜆̃j , ẽ j ) restricted to be one. Let 𝒯 correspond to (𝜆j , ej ) with ⟨ej , ẽ j ⟩ ≥ 0. If inf k≠j |𝜆̃j − 𝜆k | > 0, ∑ ẽ j − ej = (𝜆̃j − 𝜆k )−1 𝒫k Δ̃ej + 𝒫j (̃ej − ej ). (5.26) k≠j

Let 𝜂j = (1∕2) inf k≠j |𝜆j − 𝜆k | and 𝛿j = ‖Δ‖∕𝜂j . If 𝛿j < 1 then ‖ej − ẽ j − 𝒮j Δej ‖ ≤ 𝜓(𝛿j )𝛿j2 for some finite function 𝜓, where 𝜓(𝛿) ↓ 1 as 𝛿 ↓ 0.

(5.27)

PERTURBATION THEORY

139

Proof: It is easy to verify that for k ≠ j (𝜆̃j − 𝜆k )𝒫k (̃ej − ej ) = (𝜆̃j − 𝜆k )𝒫k ẽ j = 𝒫k Δ̃ej . So, if 𝜆̃j ≠ 𝜆k for k ≠ j, ẽ j − ej =

∞ ∑

𝒫k (̃ej − ej ) =

∑ k≠j

k=1

1 𝒫k Δ̃ej + 𝒫j (̃ej − ej ), ̃ 𝜆j − 𝜆k

which is (5.26). Theorem 4.2.8 gives |𝜆̃j − 𝜆j | ≤ ‖Δ‖ < inf k≠j |𝜆j − 𝜆k | so that inf |𝜆̃j − 𝜆k | = inf |𝜆̃j − 𝜆j + 𝜆j − 𝜆k | k≠j

k≠j

≥ inf |𝜆j − 𝜆k | − |𝜆̃j − 𝜆j | k≠j

≥ inf |𝜆j − 𝜆k | − ‖Δ‖ > 0

(5.28)

k≠j

and the condition we need to use the expansion in (5.26) is met. As |𝜆j − 𝜆̃ j | < |𝜆j − 𝜆k |, we can write ( (𝜆̃j − 𝜆k )−1 = (𝜆j − 𝜆k )−1 1 −

𝜆j − 𝜆̃j

)−1

𝜆j − 𝜆k

=

∞ ∑ (𝜆j − 𝜆̃j )s s=0

(𝜆j − 𝜆k )s+1

.

Using this in conjunction with (5.26) produces ẽ j − ej =



(𝜆j − 𝜆k ) 𝒫k Δ̃ej + −1

k≠j

∞ ∑∑ (𝜆j − 𝜆̃j )s k≠j s=1

(𝜆j − 𝜆k )s+1

𝒫k Δ̃ej

+𝒫j (̃ej − ej ) ∑ ∑ = (𝜆j − 𝜆k )−1 𝒫k Δej + (𝜆j − 𝜆k )−1 𝒫k Δ(̃ej − ej ) k≠j

k≠j

∑ ∑ (𝜆j − 𝜆̃j ) ∞

+

k≠j s=1

s

(𝜆j − 𝜆k )s+1

𝒫k Δ̃ej + 𝒫j (̃ej − ej ).

The proof will be complete once we obtain bounds for the last three terms that arise in the last relation. From Bessel’s inequality, we see that ‖∑ ‖ ‖Δ‖‖̃e − e ‖ 𝛿 ‖ ‖ j j j ‖ (𝜆j − 𝜆k )−1 𝒫k Δ(̃ej − ej )‖ ≤ = ‖̃ej − ej ‖. ‖ ‖ 2𝜂j 2 ‖ k≠j ‖ ‖ ‖

140

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Similarly, )2 ‖∑ ∑ ‖2 ∑ (∑ ∞ ∞ (𝜆j − 𝜆̃j )s (𝜆j − 𝜆̃ j )s ‖ ‖ ‖ 𝒫k Δ̃ej ‖ ‖𝒫k Δ̃ej ‖2 ‖ ‖ = s+1 (𝜆 − 𝜆 ) ‖ k≠j s=1 (𝜆j − 𝜆k )s+1 ‖ j k k≠j s=1 ‖ ‖ ∑ (𝜆j − 𝜆̃j )2 = ‖𝒫k Δ̃ej ‖2 2 (𝜆 2 ̃ (𝜆 − 𝜆 ) − 𝜆 ) j k j k k≠j 𝛿j4 ‖Δ‖4 ≤ 2 ≤ 4 4𝜂j (2𝜂j − ‖Δ‖)2 as a result of (5.28) and Theorem 4.2.8. Finally, 1 ‖𝒫j (̃ej − ej )‖ = (1 − ⟨̃ej , ej ⟩) = ‖̃ej − ej ‖2 . 2 Assertion (5.27) now follows from (5.17), (5.24), and (5.25).

5.2



Perturbation of general compact operators

To this point, we have dealt only with the case of eigenvalues and eigenvectors for a self-adjoint, compact operator 𝒯. It is also of interest to have similar results that can be used more generally for any compact operator between two Hilbert spaces. This means that we now need to assess the effect of perturbation on singular values and vectors. That is the subject addressed in this section. ̃ ∈ 𝔅(ℍ1 , ℍ2 ) where, Let ℍ1 and ℍ2 be separable Hilbert spaces with 𝒯, 𝒯 ̃ as an approximating operator to 𝒯. As singular valagain, we think of 𝒯 ues and vectors are obtained from the eigenvalues and eigenvectors of 𝒯∗ 𝒯 and 𝒯𝒯∗ , we have in some sense already addressed how the singular values ̃ approximate those of 𝒯. However, the goal is to assess the and vectors of 𝒯 effect of perturbations of 𝒯 on its singular values and the theory we have developed would, strictly speaking, only be directly applicable to perturbations that were made to 𝒯∗ 𝒯 or 𝒯𝒯∗ . Thus, it it worthwhile to explore this issue in somewhat more detail. We will assume that all the nonzero singular values are of unit multiplicity. As will be evident from our proofs, this is by no means necessary but ̃ have singular systems appreciably simplifies our presentation. Let 𝒯 and 𝒯 ∞ ∞ ̃ ̃ ̃ {(𝜆j , f1j , f2j )}j=1 and {(𝜆j , f1j , f2j )}j=1 . Without loss, we will assume that {f1j } and {f2j } provide CONSs for ℍ1 and ℍ2 . Then, for k, 𝓁 = 1, 2, we will have need for the operators 𝒫k𝓁j = fkj ⊗ f𝓁j , 𝒫̃ k𝓁j = f̃kj ⊗ f̃𝓁j ;

PERTURBATION THEORY

141

i.e., 𝒫k𝓁j f = ⟨ fkj , f ⟩k f𝓁j , 𝒫̃ k𝓁j f = ⟨ ̃f kj , f ⟩k ̃f 𝓁j for f ∈ ℍk . In particular, 𝒫11j and 𝒫22j are the projection operators for span{f1j } and span{f2j }. Now take ̃ −𝒯 Δ=𝒯 and let

( 𝜁j =

)−1 1 min |𝜆2k − 𝜆2j | (‖𝒯‖ + ‖Δ‖)‖Δ‖. 2 k≠j

Our first objective is to establish analogs of (5.21) and (5.27) that are applicable to the present setting. Theorem 5.2.1 Let 𝜖 be any number in (0, 1) and assume that 𝜁j < 1 − 𝜖. Then, there exists a constant C = C𝜖 ∈ (0, ∞) such that ‖(𝜆̃j 𝒫12j − 𝒫22j 𝒯𝒫11j ) − 𝒫22j Δ𝒫11j ‖ ≤ C max(‖𝒯‖, 1)𝜁j2

(5.29)

|(𝜆̃j − 𝜆j ) − ⟨f2j , Δf1j ⟩2 | ≤ C max(‖𝒯‖, 1)𝜁j2 .

(5.30)

and

̃ = 𝒯n Our applications are usually for situations with fixed 𝒯 and j with 𝒯 for some approximating sequence of operators {𝒯n }. In such instances, 𝜁j = O(‖Δ‖) with the consequence that the bounds in (5.29) and (5.30) are of order ‖Δ‖2 . Proof: Relation (5.30) follows from (5.29) by taking inner products. Thus, we need only establish the latter result. A first step in this direction involves showing that 𝜆̃j 𝒫12j − 𝒫22j 𝒯𝒫11j ̃ − 𝜆̃j 𝒫̃ 12j )(𝒫̃ 11j − 𝒫11j ) = 𝒫22j Δ𝒫11j + (𝒫̃ 22j − 𝒫22j )(𝒯 +𝜆̃ j 𝒫̃ 22j (𝒫12j − 𝒫̃ 12j )𝒫̃ 11j . To verify this identity observe that as 𝒫22j 𝒫12j 𝒫11j = 𝒫12j 𝜆̃j 𝒫12j − 𝒫22j 𝒯𝒫11j ̃ − 𝜆̃j 𝒫12j )𝒫11j = 𝒫22j Δ𝒫11j − 𝒫22j (𝒯 ̃ − 𝜆̃j 𝒫̃ 12j )𝒫11j − 𝜆̃ j 𝒫22j (𝒫̃ 12j − 𝒫12j )𝒫11j . = 𝒫22j Δ𝒫11j − 𝒫22j (𝒯

142

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

The second term on the right-hand side of the last relation is ̃ − 𝜆̃j 𝒫̃ 12j )(𝒫̃ 11j − 𝒫11j ) −(𝒫̃ 22j − 𝒫22j )(𝒯 ̃ − 𝜆̃j 𝒫̃ 12j ) = 0 and (𝒯 ̃ − 𝜆̃j 𝒫̃ 12j )𝒫̃ 11j = 0. because 𝒫̃ 22j (𝒯 It remains to obtain norm bounds for the remainder terms (𝒫̃ 22j − ̃ − 𝜆̃j 𝒫̃ 12j )(𝒫̃ 11j − 𝒫11j ) and 𝜆̃j 𝒫22j (𝒫̃ 12j − 𝒫12j )𝒫11j . To do so 𝒫22j )(𝒯 begin by noting that ( )−1 1 ̃ ∗𝒯 ̃ − 𝒯∗ 𝒯‖ ≤ 𝜁j . min |𝜆2k − 𝜆2j | ‖𝒯 2 k≠j Then, from (5.17), we have max(‖𝒫̃ 11j − 𝒫11j ‖, ‖𝒫̃ 22j − 𝒫22j ‖) ≤

𝜁j 1 − 𝜁j

≤ C𝜁j

(5.31)

with C = 𝜖 −1 and we can conclude that ̃ − 𝜆̃ j 𝒫̃ 12j )(𝒫̃ 11j − 𝒫11j )‖ ‖(𝒫̃ 22j − 𝒫22j )(𝒯 ̃ ≤ ‖𝒫̃ 11j − 𝒫11j ‖‖𝒫̃ 22j − 𝒫22j ‖‖𝒯‖ ≤ C‖𝒯‖𝜁j2 . Next, ‖𝒫22j (𝒫̃ 12j − 𝒫12j )𝒫11j ‖ = 1 − ⟨f̃1j , f1j ⟩1 ⟨f̃2j , f2j ⟩2 = 1 − ⟨f̃2j , f2j ⟩2 + (1 − ⟨f̃1j , f1j ⟩1 )⟨f̃2j , f2j ⟩2 , where we assume without loss of generality that ⟨f̃1j , f1j ⟩1 and ⟨f̃2j , f2j ⟩2 are nonnegative. Applying (5.24) and (5.31) and using the identity 1 − ⟨f̃𝓁j , f𝓁j ⟩𝓁 = ‖f̃𝓁j − f𝓁j ‖2𝓁 ∕2 leads to ‖𝒫22j (𝒫̃ 12j − 𝒫12j )𝒫11j ‖ ≤ C𝜁j2 for some suitable C and completes the proof.



The singular vector version of Theorem 5.1.8 now takes the following form.

PERTURBATION THEORY

143

Theorem 5.2.2 Let 𝜖 be any number in (0, 1) and assume that 𝜁j < 1 − 𝜖. Then, there exists a constant C = C𝜖 ∈ (0, ∞) such that ‖ ∑ 𝜆k ⟨f2k , Δf1j ⟩2 + 𝜆j ⟨f2j , Δf1k ⟩2 ‖ ‖ ‖ ‖(f̃1j − f1j ) − f1k ‖ ‖ ‖ ≤ C𝛾j 2 2 𝜆k − 𝜆j ‖ ‖ k≠j ‖ ‖1 ( 𝛾j = 𝜁j 𝜁j +

with

‖Δ‖ ‖Δ‖ + ‖𝒯‖

) .

Proof: If 𝜁j < 1 then, by Theorem 5.1.8, ̃ ∗𝒯 ̃ − 𝒯∗ 𝒯)f1j ‖1 ≤ 𝜓(𝜁j )𝜁 2 , ‖(f̃1j − f1j ) − 𝒮11j (𝒯 j where 𝒮11j =



(𝜆2k − 𝜆2j )−1 𝒫11k .

k≠j

Now ̃ ∗𝒯 ̃ −𝒯 ̃ ∗ 𝒯)f ̃ 1j = 𝒮11j (𝒯

̃ ∗𝒯 ̃ − 𝒯∗ 𝒯)f1j ⟩1 ∑ ⟨f1k , (𝒯 k≠j

𝜆2k − 𝜆2j

f1k

and we can write ̃ ∗𝒯 ̃ − 𝒯∗ 𝒯 = 𝒯∗ (𝒯 ̃ − 𝒯) + (𝒯 ̃ ∗ − 𝒯∗ )𝒯 + (𝒯 ̃ ∗ − 𝒯∗ )(𝒯 ̃ − 𝒯). 𝒯 Thus, ̃ ∗𝒯 ̃ − 𝒯∗ 𝒯)f1j ⟩1 ⟨f1k , (𝒯 ̃ − 𝒯)f1j ⟩1 + ⟨f1k , (𝒯 ̃ ∗ − 𝒯∗ )𝒯f1j ⟩1 = ⟨f1k , 𝒯∗ (𝒯 ̃ ∗ − 𝒯∗ )(𝒯 ̃ − 𝒯)f1j ⟩1 +⟨f1k , (𝒯 = 𝜆k ⟨f2k , Δf1j ⟩2 + 𝜆j ⟨f2j , Δf1k ⟩2 + ⟨f1k , Δ∗ Δf1j ⟩1 and ‖∑ ⟨f , Δ∗ Δf ⟩ ‖ ‖ ‖ ‖Δ‖ 1k 1j 1 1 ‖ f1k ‖ ‖Δ‖2 = 𝜁. ‖ ‖ ≤ 2 2 2 2 2(‖Δ‖ + ‖𝒯‖) j 𝜆k − 𝜆j mink≠j |𝜆j − 𝜆k | ‖ k≠j ‖ ‖ ‖1



144

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

We conclude this chapter with an application of the developments in this section to an fda relevant problem from numerical analysis. Example 5.2.3 Consider the 𝕃2 [0, 1] integral operator 1

(𝒯f )(t) =

f (s)K(t, s)ds

∫0

for some continuous kernel function K on [0, 1] × [0, 1]. We do not presume that the kernel is symmetric. So, 𝒯 is compact but not necessarily self-adjoint. An application of (4.15) in this instance allows us to express the operator as 𝒯=

∞ ∑

𝜆j f1j ⊗ f2j

j=1

with {(𝜆j , f1j , f2j )}∞ its associated singular system. We will approximate the j=1 components of this system using a Rayleigh-Ritz motivated scheme proposed by Hansen (1988). The goal is to assess the error incurred by this approach in approximating the singular values and functions for 𝒯. Let {eij }∞ , i = 1, 2, be two CONSs for 𝕃2 [0, 1]. The idea is that we approxj=1 imate 𝒯 by n n ∑ ∑ 𝒯n = ⟨e2j , 𝒯e1i ⟩e1i ⊗ e2j (5.32) i=1 j=1

and then use its singular values and functions to approximate those for 𝒯. Note that under our assumptions 𝒯 is an HS operator. This is because 𝒯∗ 𝒯 is an integral operator with the symmetric and continuous kernel 1

Q(s, t) =

K(u, t)K(u, s)du.

∫0

It is therefore trace class and its eigenvalues 𝜆2j are summable. As a result, it is easy to assess the size of Δn = 𝒯n − 𝒯 because ‖Δn ‖2 ≤ ‖Δn ‖2HS =

∞ ∞ ∑ ∑

⟨e2j , 𝒯e1i ⟩2 .

i=n+1 j=n+1

This bound necessarily decays to zero as n diverges. It is not difficult to see that the singular values and functions for 𝒯n derive from the singular values and vectors of the matrix 𝒜 = {⟨e2j , 𝒯e1i ⟩}i,j=1∶n . Using Corollary 4.3.2, we can write 𝒜 =

n ∑ j=1

𝜆(n) uj 𝑣Tj j

PERTURBATION THEORY

145

for orthonormal n-vectors 𝑣j = (𝑣1j , … , 𝑣nj )T and uj = (u1j , … , unj )T . The sve for 𝒯n is now given explicitly by 𝒯n =

n ∑

𝜆(n) f (n) ⊗ f2j(n) j 1j

j=1

with f1j(n)

=

n ∑

𝑣ij e1i

i=1

and f2j(n)

=

n ∑

uij e2i .

i=1

Theorem 5.2.1 now implies that for any particular nonrepeated singular value 𝜆j we will have 𝜆(n) = 𝜆j + ⟨f2j , Δn f1j ⟩ + O(‖Δn ‖2 ). j Similarly, for a specific singular function f1j , Theorem 5.2.2 produces f1j(n) = f1j +

∑ 𝜆k ⟨f2k , Δn f1j ⟩ + 𝜆j ⟨f2j , Δn f1k ⟩ k≠j

𝜆2k



𝜆2j

f1k + O(‖Δn ‖2 ).

6

Smoothing and regularization As stated in Chapter 1, our view of functional data analysis involves the statistical analysis of sample paths that arise from one or more stochastic processes. The processes themselves are presumed to be random elements of some Hilbert space such as 𝕃2 [0, 1] in a sense that will be precisely defined in Chapter 7. For now, it suffices to merely think of the collected data as being discretized readings from a sample of curves. Discretization entails some loss of information. This may be sufficiently problematic in some instances to require remedial measures to recover some of what was lost. In addition, there may be contamination or distortions of the actual sample path values by noise or other sources of error. In such cases, it may be worthwhile to perform some preprocessing to filter out artifacts in the data that have arisen from extraneous sources. Such problems are not unique to fda and arise in a variety of statistical contexts. The methods that have evolved for their solution are generally referred to as smoothing or nonparametric smoothing techniques that will be the focus of this chapter.

6.1

Functional linear model

A conceptually simple smoothing problem arises from nonparametric regression analysis. In that setting, we have a real valued mean or regression function m on [0, 1] that is discretely observed with additive random noise; i.e., the realized data takes the form (ti , Yi ), i = 1, … , n, with Yi = m(ti ) + 𝜀i ,

(6.1)

for “time” ordinates 0 ≤ t1 < · · · < tn ≤ 1 and 𝜀1 , … , 𝜀n zero mean, uncorrelated random variables with some common variance 𝜎 2 . The objective is estimation of m. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

148

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

An estimator of m serves a similar purpose to a sample mean in that it gives a summary of how the responses behave on the average as a function of the independent or time variable, T. However, some restrictions are needed on m in order to make the summarization process fruitful; if m is totally arbitrary the best summary statistic is just the original response values Y1 , … , Yn . When m is smooth in the sense of being continuous or differentiable, the true value of m at any time point t ∈ [0, 1] may be estimated by suitably averaging the Yi ’s whose corresponding T ordinates are close to t. The process of replacing groups of response values by their average has the effect of reducing the local influence of individual responses with the outcome being smoother or less variable than what was present in the original data. This is what has motivated terminology such as data smoothing, scatter plot smoothing, and nonparametric smoothing for such averaging procedures. The word nonparametric is used here to emphasize the fact that the function m that is being estimated is presumed to derive from an infinite-dimensional space of candidate functions rather than some finite dimensional parametric family as has classically been the case in, e.g., linear regression analysis. In this chapter, we take a broad view of nonparametric function estimation that subsumes nonparametric regression as well as smoothing problems that arise from other contexts such as fda. The formulation assumes that we have two Hilbert spaces 𝕐 and ℍ with norms and inner products ‖ ⋅ ‖𝕐 , ⟨⋅, ⋅⟩𝕐 , ‖ ⋅ ‖ℍ and ⟨⋅, ⋅⟩ℍ . Then, given 𝒯 ∈ 𝔅(ℍ, 𝕐 ), we observe Y = 𝒯m + 𝜀,

(6.2)

for Y, 𝜀 ∈ 𝕐 , and m ∈ ℍ. Here and hereafter, 𝜀 will be used to represent a generic error term that has zero mean. Although we treat 𝒯 as being known, we should perhaps note that in practice 𝒯 may not be known exactly and/or may be random as in Chapter 11. We will refer to (6.2) as the (functional) linear model. As with nonparametric regression, our goal is to use the observed value of Y to estimate m. That is the topic that will be addressed in the following section. First, however, it will be useful to look at some specific examples of model (6.2) to gain some insight into the potential applications we have in mind. Example 6.1.1 Our starting point was nonparametric regression. So, let us begin by seeing how it can be treated using model (6.2). One approach would be to take ℍ as an RKHS containing functions defined on [0, 1] with 𝕐 = ℝn . Then, for g ∈ ℍ, we define 𝒯 by 𝒯g = (g(t1 ), … , g(tn ))T . As evaluation functionals are bounded and linear in an RKHS (Example 3.2.2), 𝒯 ∈ 𝔅(ℍ, 𝕐 ).

SMOOTHING AND REGULARIZATION

149

Example 6.1.2 Let {X(t) ∶ t ∈ [0, 1]} be a stochastic process on [0, 1] with mean function m(⋅) and suppose that we observe n independent copies X1 , … , Xn of X(⋅) whose sample paths fall in an RKHS ℍ. The information for the ith sample path is digitized with readings being recorded at time ordinates 0 ≤ ti1 < · · · < tiri ≤ 1. The realized, digitized data then takes the form (tij , Yij ), i = 1, … , n, j = 1, … , ri for Yij = xi (tij ) + 𝜀̃ij = m(tij ) + {xi (tij ) − m(tij )} + 𝜀̃ij with random errors 𝜀̃ij . Thus, let Y = (Y11 , … , Y1r1 , … , Yn1 , … , Ynrn )T , 𝒯g = (g(t11 ), … , g(t1r1 ), … , g(tn1 ), … , g(tnrn ))T and set 𝜀ij = xi (tij ) − m(tij ) + 𝜀̃ij to return to the (6.2) formulation. Problems of estimation in this setting have been investigated by, e.g., Rice and Silverman (1991). Example 6.1.3 Let R be a known, continuous bivariate function defined on [0, 1] × [0, 1]. Suppose we observe (ti , Yi ), i = 1, … , n, where 1

Yi =

∫0

R(ti , u)m(u)du + 𝜀i ,

with m ∈ 𝕃2 [0, 1]. In this case, Y = (Y1 , … , Yn )T , ( 1 )T 1 𝒯g = R(t1 , u)g(u)du, … , R(tn , u)g(u)du . ∫0 ∫0 Estimation of m using data of this variety has been considered by, e.g., Nychka and Cox (1989). Example 6.1.4 Assume that m ∈ 𝕃2 [0, 1] and that {X(t) ∶ t ∈ [0, 1]} is a stochastic process from which we observe n independent copies X1 , … , Xn . Suppose that the data at our disposal takes the form of (Xi , Yi ) and (tij , Xi (tij )) for i = 1, … , n and j = 1, … , ri , where 1

Yi =

∫0

m(u)Xi (u)du + 𝜀i .

150

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

This has the form of model (6.2) under the choice of Y = (Y1 , … , Yn )T , ( 1 )T 1 𝒯g = g(u)X1 (u)du, … , g(u)Xn (u)du . ∫0 ∫0 Note that in this instance 𝒯 is random and is not observed directly but must instead be evaluate using the observed X sample paths. If, for each given i, the tij ’s are uniformly distributed, then the values produced by the 𝒯 operator can be approximated by ( )T r1 rn ∑ ∑ 1 1 ̃ = 𝒯g g(t )X (t ), … , g(t )X (t ) r1 j=1 1j 1 1j rn j=1 nj n nj or, more generally, by some other quadrature type method.

6.2

Penalized least squares estimators

We now wish to develop an estimation criterion that can be used to accompany the functional regression model (6.2). As a first step in that direction, let us return momentarily to the prototypical nonparametric regression problem corresponding to model (6.1). In that setting, an initial thought might be to obtain a least-squares estimator m ̂ of m from m ̂ = argming∈ℍ

n ∑

(Yi − g(ti ))2 ∕n.

(6.3)

i=1

However, if we were to take, e.g., ℍ = 𝕎2 [0, 1] from Section 2.8, there would be infinitely many choices for m ̂ in (6.3); any function g ∈ 𝕎2 [0, 1] that interpolates the Y data by satisfying g(ti ) = Yi will minimize the least-squares criterion. Perhaps more to the point is that an estimator that interpolates the responses would generally be rejected a priori as being too slavish to the data as well as ineffective as a summary device. Some additional restrictions are therefore needed to make a least-square type criterion meaningful when ℍ is infinite dimensional. One way to accomplish this for nonparametric regression when ℍ = 𝕎2 [0, 1] is to instead consider the estimator defined by m ̂ = argming∈ℍ𝛿

n ∑

(Yi − g(ti ))2 ∕n,

(6.4)

i=1

{

where ℍ𝛿 ∶=

}

1

g ∈ 𝕎2 [0, 1] ∶

∫0

|g (t)| dt ≤ 𝛿 (2)

2

(6.5)

SMOOTHING AND REGULARIZATION

151

for some user specified 𝛿 > 0. One justification for this approach derives from 1 linear regression analysis. Specifically, if g is a line, ∫0 |g(2) (t)|2 dt is zero and the minimizer of the least-squares criterion in (6.3) over all such functions is the simple linear regression estimator of m. With that in mind, 𝛿 in (6.5) now has the interpretation as a measure of departure from linearity. Thus, (6.4) produces a constrained least-squares estimator with bounds being enforced on how far the solution can stray from the simple linear fit. Criterion (6.4) does not lend itself to simple practical implementation. Instead, it is better to work with the equivalent problem of finding the solution to ( ) n 1 ∑ m𝜂 = argming∈𝕎2 [0,1] n−1 |g(2) (t)|2 dt (6.6) (Yi − g(ti ))2 + 𝜂 ∫ 0 i=1 with 𝜂 > 0 now being some value that can be shown to be uniquely determined by 𝛿 in the previous formulation. The interpretation to be placed on (6.6) is that we wish to fit the response data with a smooth function (in the sense of being in 𝕎2 [0, 1]). The first component ∑n of the estimation criterion is the (average) residual sum of squares n−1 i=1 (Yi − g(ti ))2 that evaluates the fidelity 1 of the fit to the observations. The ∫0 |g(2) (t)|2 dt term measures the smoothness of g in terms of its average squared curvature. The relationship between the two components is that smaller values of the latter for m𝜂 correspond to larger values of the former and conversely. So, by choosing a large value of 1 𝜂, we place a premium on smoothness as assessed by ∫0 |g(2) (t)|2 dt with the consequence that the resulting estimator will behave more like a simple linear regression estimator than would be true had we opted for a smaller value. On the other hand, larger values of 𝜂 deemphasize smoothness thereby allowing more flexible function forms that are capable of adhering more closely to the response values. The smoothing parameter 𝜂 in (6.6) can now be viewed as a tuning devise that regulates the relative importance one places on fit to the data vis-a-vis smoothness. Now suppose we adopt either of the inner products for 𝕎2 [0, 1] that were introduced in Section 2.8. Then, 1

∫0

|g(2) (t)|2 dt = ⟨g, 𝒲g⟩ℍ

with 𝒲 the 𝕎2 [0, 1] projection operator for ℍ1 in (2.42). The salient feature of 𝒲 that we wish to take forward is that it is a nonnegative operator. In light of this fact, one possible generalization of the estimator in (6.6) that could be used for m in model (6.2) is a minimizer of f (g; 𝒯, 𝒲, Y, 𝜂) ∶= ‖Y − 𝒯g‖2𝕐 + 𝜂⟨g, 𝒲g⟩ℍ

(6.7)

152

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

over g ∈ ℍ with 𝒲 a nonnegative element of 𝔅(ℍ) and 𝜂 ∈ (0, ∞). In the context of this, more general problem 𝜂 is frequently called a regularization parameter and the associated estimators of m are referred to as method of regularization estimators. We will use “smoothing parameter” and “regularization parameter” interchangeably when referring to 𝜂. For the moment, we will treat the value of 𝜂 as being given. Methods for adaptively choosing its value from data are discussed in Section 6.5. The following result provides a characterization of the method of regularization estimator. Theorem 6.2.1 Assume that (𝒯∗ 𝒯 + 𝒲) is invertible. Then, for 𝜂 ∈ (0, ∞) m𝜂 ∶= (𝒯∗ 𝒯 + 𝜂𝒲)−1 𝒯∗ Y

(6.8)

is the unique minimizer of (6.7) over g ∈ ℍ. Proof: Consider any candidate solution of the form g̃ = m𝜂 + g. Then, f (̃g; 𝒯, 𝒲, Y, 𝜂) = ‖Y‖2𝕐 + ‖𝒯̃g‖2𝕐 − 2⟨Y, 𝒯̃g⟩𝕐 + 𝜂⟨̃g, 𝒲 g̃ ⟩ℍ . By definition, 𝒯∗ Y = (𝒯∗ 𝒯 + 𝜂𝒲)m𝜂 and, hence, ⟨Y, 𝒯̃g⟩𝕐 = ⟨(𝒯∗ 𝒯 + 𝜂𝒲)m𝜂 , g̃ ⟩ℍ . Thus, f (̃g; 𝒯, 𝒲, Y, 𝜂) = ‖Y‖2𝕐 + ⟨𝒯∗ 𝒯̃g, g̃ ⟩ℍ − 2⟨(𝒯∗ 𝒯 + 𝜂𝒲)m𝜂 , g̃ ⟩ℍ + 𝜂⟨𝒲 g̃ , g̃ ⟩ℍ = ‖Y‖2𝕐 + ⟨(𝒯∗ 𝒯 + 𝜂𝒲)(̃g − 2m𝜂 ), g̃ ⟩ℍ , where ⟨(𝒯∗ 𝒯 + 𝜂𝒲)(̃g − 2m𝜂 ), g̃ ⟩ℍ = ⟨(𝒯∗ 𝒯 + 𝜂𝒲)(̃g − m𝜂 ), g̃ ⟩ℍ − ⟨(𝒯∗ 𝒯 + 𝜂𝒲)m𝜂 , g̃ ⟩ℍ = ⟨(𝒯∗ 𝒯 + 𝜂𝒲)(̃g − m𝜂 ), g̃ ⟩ℍ − ⟨(𝒯∗ 𝒯 + 𝜂𝒲)̃g, m𝜂 ⟩ℍ = ⟨(𝒯∗ 𝒯 + 𝜂𝒲)(̃g − m𝜂 ), g̃ − m𝜂 ⟩ℍ − ⟨(𝒯∗ 𝒯 + 𝜂𝒲)m𝜂 , m𝜂 ⟩ℍ = ⟨(𝒯∗ 𝒯 + 𝜂𝒲)g, g⟩ℍ − ⟨(𝒯∗ 𝒯 + 𝜂𝒲)m𝜂 , m𝜂 ⟩ℍ . As (𝒯∗ 𝒯 + 𝜂𝒲) is positive definite, the last expression is uniquely minimized at g = 0. ◽ The theorem applies to values of 𝜂 in (0, ∞). However, it leaves open what transpires as the smoothing parameter tends to either end of this interval. These cases are conceptually important. If we are to view smoothing as a

SMOOTHING AND REGULARIZATION

153

sliding scale controlled by our choice for 𝜂, then we must understand what the two extreme values for 𝜂 reflect in order to appreciate what may result from selection of a value in the interior of the interval. We begin with the simple case of ordinary Tikhonov regularization where we chose to minimize (6.7) with 𝒲 = I: i.e., f (g; 𝒯, I, Y, 𝜂) = ‖Y − 𝒯g‖2𝕐 + 𝜂‖g‖2ℍ .

(6.9)

We will eventually use the results for this case to handle the general problem. Applying Theorem 6.2.1 in this instance produces the Tikhonov estimator m𝜂 = (𝒯∗ 𝒯 + 𝜂I)−1 𝒯∗ Y.

(6.10)

The behavior of this estimator for the two extreme values of the smoothing parameter is described in our following result. Theorem 6.2.2 Assume that Y ∈ Dom(𝒯† ). Then, m𝜂 in (6.10) satisfies lim m𝜂 = 0

𝜂→∞

and

lim m𝜂 = (𝒯∗ 𝒯)† 𝒯∗ Y.

𝜂→0

Proof: We know that 𝜂‖m𝜂 ‖2ℍ ≤ ‖Y − 𝒯m𝜂 ‖2𝕐 + 𝜂‖m𝜂 ‖2ℍ ≤ ‖Y − 𝒯g‖2𝕐 + 𝜂‖g‖2ℍ

(6.11)

for all g ∈ ℍ. In particular, this is true for g = 0, which means that 𝜂‖m𝜂 ‖ℍ ≤ ‖Y‖2𝕐 and m𝜂 must therefore tend to zero as 𝜂 diverges. Now consider the case where 𝜂 → 0. In this instance, use (6.11) with g = 𝒯† Y to see that 𝜂‖m𝜂 ‖2ℍ ≤ ‖Y − 𝒯𝒯† Y‖2𝕐 + 𝜂‖𝒯† Y‖2ℍ = 𝜂‖𝒯† Y‖2ℍ . The equality stems from (3.19) and the fact that Y ∈ Dom(𝒯† ). So, if 𝜂n is any sequence of smoothing parameter values that converges to zero, we must have lim sup ‖m𝜂n ‖ℍ ≤ ‖𝒯† Y‖ℍ . n→∞

Theorem 3.2.11 entails that the bounded sequence m𝜂n must have a subsequence m𝜂n that converges weakly to some g ∈ ℍ. Then, from Theorem 3.3.6 k 𝒯m𝜂n must converge weakly to 𝒯g. However, k

‖Y − 𝒯m𝜂n ‖2𝕐 ≤ ‖Y − 𝒯m𝜂n ‖2𝕐 + 𝜂nk ‖m𝜂n ‖2ℍ ≤ 𝜂nk ‖𝒯† Y‖2ℍ → 0. k

Thus, 𝒯g = Y.

k

k

154

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Now, it must be that all the m𝜂n are in the closed (from Theorem 2.5.6) k subspace Ker(𝒯)⊥ . To see this note that we can write every element of ℍ as g = h + 𝑣 with 𝑣 ∈ Ker(𝒯) and h ∈ Ker(𝒯)⊥ . Thus, f (g; 𝒯, I, Y, 𝜂) ≥ f (h; 𝒯, I, Y, 𝜂) and only elements of Ker(𝒯)⊥ need be examined when searching for a minimizer. Property (3.18) now has the consequence that if m𝜂n converges weakly k to g it must be that 𝒯† 𝒯g = g = 𝒯† Y. The same argument applies to any subsequence of m𝜂n with the consequence that every such subsequence will contain a further subsequence that converges weakly to 𝒯† Y. As a result, m𝜂n must also converge weakly to 𝒯† Y. Now suppose that there were a subsequence having ‖m𝜂n ‖ℍ ≤ ‖𝒯† Y‖ℍ − 𝜀 k

for some 𝜀 > 0 and all k. This sequence would be bounded and have a further subsequence that converges weakly to 𝒯† Y thereby producing a contradiction. Accordingly, we must conclude that lim inf ‖m𝜂n ‖ℍ ≥ ‖𝒯† Y‖ℍ . n→∞

We have now established that m𝜂n converges weakly to 𝒯† Y and that ‖m𝜂n ‖ℍ → ‖𝒯† Y‖ℍ . As, ‖m𝜂n − 𝒯† Y‖2ℍ = ‖m𝜂n ‖2ℍ − 2⟨m𝜂n , 𝒯† Y⟩ℍ + ‖𝒯† Y‖2ℍ this is sufficient to conclude that ‖m𝜂n − 𝒯† Y‖ℍ → 0 and complete the proof. ◽ Referring back to Theorem 3.5.10, we see that when 𝒲 = I in (6.7) and we let 𝜂 tend to zero, the Tikhonov estimator converges to the best (i.e., minimum norm) least-squares (approximate) solution of 𝒯g = Y. On the other hand, the estimator converges to zero as 𝜂 grows large. While one cannot debate the smoothness of this latter choice, it seems rather simplistic from an estimation standpoint. This is remedied by considering a general choice for 𝒲 that has a nontrivial null space. In such instances, one can argue intuitively that the estimator will come from this null space when 𝜂 diverges. This allows us to view the choice of 𝒲 from a modeling perspective wherein one can choose the 𝒲 operator in such a way that its null space corresponds to some “ideal” form for m in (6.2). For example, in the special case of (6.6), the idealized model would have m as a line.

SMOOTHING AND REGULARIZATION

155

We still need a rigorous extension of Theorem 6.2.2 that applies to general 𝒲. This is provided by the following result. Theorem 6.2.3 Assume that Y ∈ Dom(𝒯† ), 𝒲 is nonnegative definite and that 𝒯∗ 𝒯 + 𝜂𝒲 is invertible for 𝜂 ∈ (0, ∞). Then, m𝜂 in (6.8) satisfies lim m𝜂 = 𝒯† Y

𝜂→0

and

lim m𝜂 = 𝒫Ker(𝒲) 𝒯† Y,

𝜂→∞

where 𝒫Ker(𝒲) is the projection operator for the null space of 𝒲. Proof: To simplify the presentation, it will be helpful to use 𝒫 = 𝒫Ker(𝒲) and 𝒬 = I − 𝒫 throughout the proof. With this notation, we can write any element g of ℍ as g = 𝒫g + 𝒬g ∶= 𝑣 + h = 𝒫𝑣 + 𝒬h. Then, f (g; 𝒯, 𝒲, Y, 𝜂) = ‖Y − 𝒯𝒫𝑣 + 𝒯𝒬h‖2𝕐 + 𝜂‖𝒲 1∕2 𝒬h‖2ℍ ̃ h‖ ̃ 2 + 𝜂‖h‖ ̃ 2 = ‖Y − 𝒯𝒫𝑣 + 𝒯 𝕐 ℍ ̃ ̃ = f (h; 𝒯, I, Y − 𝒯𝒫𝑣, 𝜂)

̃ = 𝒯𝒬𝒲 −1∕2 and h̃ = 𝒲 1∕2 𝒬h. The use of 𝒲 −1∕2 here is justified for 𝒯 because h̃ is the 𝒲 1∕2 image of an element of ℍ that has no nonzero component from the operator’s null space. An application of Theorem 6.2.1 shows that for any given 𝑣 in the null space ̃ I, Y − 𝒯𝒫𝑣, 𝜂) is minimized by ̃ 𝒯, of 𝒲, f (h; ( ∗ ) ̃ 𝒯 ̃ + 𝜂I −1 𝒯 ̃ ∗ (Y − 𝒯𝒫𝑣). h̃ 𝜂 (𝑣) = 𝒯 A little algebra along with the identity ( ∗ ) ( ∗ ) ̃ 𝒯 ̃ + 𝜂I −1 𝒯 ̃ ∗𝒯 ̃ =I−𝜂 𝒯 ̃ 𝒯 ̃ + 𝜂I −1 𝒯 leads to

̃ I, Y − 𝒯𝒫𝑣, 𝜂) f (h̃ 𝜂 (𝑣); 𝒯, ⟨ ⟩ −1 ̃ ∗ ̃ = Y − 𝒫𝒯𝑣, (I − 𝒯𝒢(𝜂) 𝒯 )(Y − 𝒯𝒫𝑣) 𝕐

̃ ∗𝒯 ̃ + 𝜂I. Now use the Sherman-Morrison–Woodbury formula with 𝒢(𝜂) = 𝒯 from (3.14) to write −1 ̃ ∗ ̃ ̃𝒯 ̃ ∗ )−1 , I − 𝒯𝒢(𝜂) 𝒯 = 𝜂(𝜂I + 𝒯

thereby obtaining 2 ( ) ̃ I, Y − 𝒯𝒫𝑣, 𝜂) = 𝜂 ‖ ̃𝒯 ̃ ∗ −1∕2 (Y − 𝒯𝒫𝑣)‖ f (h̃ 𝜂 (𝑣), 𝒯, ‖ 𝜂I + 𝒯 ‖ . ‖ ‖𝕐

156

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

This latter expression is minimized as a function of 𝑣 by any solution to the normal equations ( ) ( ) ̃𝒯 ̃ ∗ −1 𝒯𝒫𝑣 = 𝒫𝒯∗ 𝜂I + 𝒯 ̃𝒯 ̃ ∗ −1 Y. 𝒫𝒯∗ 𝜂I + 𝒯 In particular, we can use the best approximate solution [( ]† ( ) ) ̃𝒯 ̃ ∗ −1∕2 𝒯𝒫 𝜂I + 𝒯 ̃𝒯 ̃ ∗ −1∕2 Y = 𝒫𝒯† Y 𝜂I + 𝒯 as a result of (3.21) and (3.23). Putting all the pieces together gives us m𝜂 = 𝒫𝒯† Y + 𝒬𝒲 −1∕2 h̃ 𝜂 (𝒫𝒯† Y) ( ∗ ) ( ) ̃ 𝒯 ̃ + 𝜂I −1 𝒯 ̃ ∗ Y − 𝒯𝒫𝒯† Y . = 𝒫𝒯† Y + 𝒬𝒲 −1∕2 𝒯 Arguing as in the proof of Theorem 6.2.2, 𝜂‖h̃ 𝜂 (𝒫𝒯† Y)‖2ℍ ≤ ‖Y − 𝒯𝒫𝒯† Y‖2𝕐 and it follows that lim m𝜂 = 𝒫𝒯† Y. An application of Theorem 6.2.2 establishes that

𝜂→∞

( ∗ ) ̃ 𝒯 ̃ + 𝜂I −1 𝒯 ̃ ∗ = 𝒬𝒲 −1∕2 𝒯 ̃† lim 𝒬𝒲 −1∕2 𝒯

𝜂→0

= 𝒬𝒯† . Thus,

lim m𝜂 = 𝒯† Y − 𝒬𝒯† 𝒯𝒫𝒯† Y. 𝜂→0

However, property (3.18) means that 𝒯† 𝒯 = I − 𝒫Ker(𝒯) with Ker(𝒯) the null space for 𝒯. Hence, 𝒬𝒯† 𝒯𝒫𝒯† Y = (I − 𝒫)(I − 𝒫Ker(𝒯) )𝒫𝒯† Y = (I − 𝒫)𝒫𝒯† Y − (I − 𝒫)𝒫Ker(𝒯) 𝒫𝒯† Y. It is always true that 𝒫(I − 𝒫) = 0. The condition that 𝒯∗ 𝒯 + 𝜂𝒲 be invertible entails that Ker(𝒯) ∩ Ker(𝒲) = 0 which implies that 𝒫Ker(𝒯) 𝒫 = 0 and the theorem is proved. ◽ Example 6.2.4 Let us return to the nonparametric regression problem from model (6.1) where ℍ = 𝕎2 [0, 1], 𝕐 = ℝn and we estimate m by m𝜂 in (6.6). In this instance, ⎡g(t1 )⎤ ⎡⟨g, K(⋅, t1 )⟩ℍ ⎤ ⎥ ⋮ 𝒯g = ⎢ ⋮ ⎥ = ⎢ ⎢g(t )⎥ ⎢⟨g, K(⋅, t )⟩ ⎥ ⎣ n⎦ ⎣ n ℍ⎦

SMOOTHING AND REGULARIZATION

157

with K the rk for 𝕎2 [0, 1] in, e.g., (2.43). Then 𝒯† Y is the minimum norm solution of 𝒯g = Y. This solution must come from 𝕄 = span{K(⋅, t1 ), … , K(⋅, tn )} as adding anything ∑n orthogonal to this space will onlyT increase its norm. As † a result, 𝒯 Y = j=1 cj K(⋅, tj ), where c = (c1 , … , cn ) is the unique solution of 𝒦c = Y with

6.3

𝒦 = {K(ti , tj )}i,j=1.n .

Bias and variance

Let m𝜂 be the estimator for m in model (6.1) that was provided by Theorem 6.2.1. Then, an associated predictor for a future observation Ynew can be obtained from 𝒯m𝜂 . To assess the statistical performance of our penalized least-squares estimator, we might then use its prediction mean squared error 𝔼‖Ynew − 𝒯m𝜂 ‖2𝕐 . However, if Ynew is uncorrelated with the data that was used to construct the estimator, this is tantamount to consideration of Risk(𝜂) = Var(𝜂) + Bias2 (𝜂) with

Bias2 (𝜂) ∶= ‖𝒯(𝔼m𝜂 − m)‖2𝕐

(6.12)

the squared bias of the estimator and Var(𝜂) ∶= 𝔼‖𝒯(m𝜂 − 𝔼m𝜂 )‖2𝕐

(6.13)

its variance. Some general conclusions that can be drawn about these two quantities are provided in the following result. Theorem 6.3.1 The squared bias (6.12) and variance (6.13) for 𝒯m𝜂 satisfy Bias2 (𝜂) ≤ 𝜂⟨m, 𝒲m⟩ℍ and

Var(𝜂) = 𝔼‖𝒯(𝒯∗ 𝒯 + 𝜂𝒲)−1 𝒯∗ 𝜀‖2𝕐 .

Proof: First consider Bias2 (𝜂) and note that 𝔼m𝜂 = (𝒯∗ 𝒯 + 𝜂𝒲)−1 𝒯∗ 𝒯m,

158

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

is the minimizer of ‖𝒯m − 𝒯g‖2𝕐 + 𝜂⟨g, 𝒲g⟩ℍ .

(6.14)

As a result, ‖𝒯(𝔼m𝜂 − m)‖2𝕐 ≤ ‖𝒯(𝔼m𝜂 − m)‖2𝕐 + 𝜂⟨𝔼m𝜂 , 𝒲𝔼m𝜂 ⟩ℍ ≤ ‖𝒯(g − m)‖2𝕐 + 𝜂⟨g, 𝒲g⟩ℍ for all g ∈ ℍ, and taking g = m leads to obtain the stated result. For Var(𝜂), one need only observe that m𝜂 − 𝔼m𝜂 = (𝒯∗ 𝒯 + 𝜂𝒲)−1 𝒯∗ 𝜀. ◽ A useful application of the theorem is the following corollary. Corollary 6.3.2 Let 𝕐 = ℝn with squared norm ‖Y‖2𝕐

=n

−1

n ∑

Yj2

j=1

for Y = (Y1 , … , Yn )T ∈ 𝕐 . If 𝜀 contains uncorrelated random variables with mean 0 and variance 𝜎 2 and 𝒲 is invertible )2 n ( 𝛾i 𝜎2 ∑ 2 𝔼‖𝒯(m𝜂 − 𝔼m𝜂 )‖𝕐 = , n j=1 𝛾i + 𝜂 where the 𝛾i are the eigenvalues of the compact operator 𝒲 −1∕2 𝒯∗ 𝒯𝒲 −1∕2 . Proof: As the range of 𝒯 is finite dimensional, it is necessarily compact. ̃ ∶= 𝒯𝒲 −1∕2 is also compact and will have at most n nonzero singuThus, 𝒯 lar values 𝛾1 ≥ 𝛾2 ≥ · · · ≥ 𝛾n ≥ 0. The result now follows from ̃ 𝒯 ̃ ∗𝒯 ̃ + 𝜂I)−1 𝒯 ̃∗ 𝒯(𝒯∗ 𝒯 + 𝜂𝒲)−1 𝒯∗ = 𝒯( and the fact that 𝔼⟨𝜀, e⟩2𝕐 = 𝜎 2 eT e∕n2 for any n-vector e.

6.4



A computational formula

In this section, we consider a special but common setting where it is possible to give an expression for m𝜂 that lends itself to explicit computation of the estimator. For this purpose, we take 𝕐 = ℝn and suppose that ℍ = ℍ0 ⊕ ℍ1 ,

SMOOTHING AND REGULARIZATION

159

where ℍ0 and ℍ1 are subspaces of ℍ with ℍ0 having dimension q < ∞. Let 𝒲 = 𝒫1 be the projection onto ℍ1 and then consider the optimization of f (g; 𝒯, 𝒫1 , Y, 𝜂) = ‖Y − 𝒯g‖2𝕐 + 𝜂⟨g, 𝒫1 g⟩ℍ

(6.15)

over g ∈ ℍ. Let {𝜙1 , … , 𝜙q } be a basis for ℍ0 and express 𝒯 as 𝒯g = (𝒯1 g, … , 𝒯n g)T ,

(6.16)

for linear functionals 𝒯1 , … , 𝒯n . Denote the representer for 𝒯i by 𝜏i with 𝜉i being the restriction of 𝜏i to ℍ1 : i.e., 𝒯i g = ⟨𝜏i , g⟩ℍ and

𝜉i = 𝒫1 𝜏i .

Theorem 6.4.1 Define the matrices Φ = {𝒯i 𝜙j }i=1∶n,j=1∶q and

Ψ = {𝒯i 𝜉j }i,j=1∶n = {⟨𝜉i , 𝜉j ⟩ℍ }i,j=1∶n

with and

𝒰 = (Φ, Ψ) ( ) 0q×q 0q×n 𝒱= . 0n×q Ψn×n

If 𝒰T 𝒰 + 𝜂𝒱 is invertible, the unique minimizer of (6.15) is m𝜂 =

q ∑

ĉ j 𝜙j +

j=1

where

n ∑

b̂ i 𝜉i ,

i=1

( ) ĉ = (𝒰T 𝒰 + 𝜂𝒱)−1 𝒰T Y. b̂

Proof: Any function in ℍ can be written as g=

q ∑ j=1

cj 𝜙j +

n ∑ j=1

bj 𝜉j + h

160

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

for coefficients cj and bj and some h ∈ span{𝜙1 , … , 𝜙q , 𝜉1 , … , 𝜉n }⊥ . Now, 𝒯i h = ⟨𝜏i , h⟩ℍ = ⟨𝜏i − 𝜉i , h⟩ℍ + ⟨𝜉i , h⟩ℍ = 0 since 𝜏i − 𝜉i ∈ ℍ0 = span{𝜙1 , … , 𝜙q }. Thus, ( q ) ( q ) n n ∑ ∑ ∑ ∑ 𝒯i cj 𝜙j + bj 𝜉j + h = 𝒯i cj 𝜙j + bj 𝜉 j . j=1

j=1

j=1

j=1

In view of (6.15) and (6.16), h = 0 and m𝜂 ∈ span{𝜙1 , … , 𝜙q , 𝜉1 , … , 𝜉n }. now ∑ know that any viable minimizer must have the form g = ∑We q n c 𝜙 + j=1 bj 𝜉j . This entails that j=1 j j 𝒯g = Φc + Ψb and ⟨g, 𝒫1 g⟩ℍ = bT Ψb for c = (c1 , … , cq )T and b = (b1 , … , bn )T . Hence, the penalized least-squares criterion function in (6.15) can be expressed as [ ]‖2 [ ] ‖ [ T ] c ‖ c ‖ T c b 𝒱 (6.17) ‖Y − 𝒰 ‖ +𝜂 b ‖ b ‖ ‖ ‖𝕐 and the conclusion of the theorem follows immediately from Theorem 6.2.1. ◽ In the special case where ℍ0 , ℍ1 and ℍ are RKHSs, the following result is readily established using the reproducing property (cf. Definition 2.7.1). Theorem 6.4.2 Let ℍ0 = ℍ(K0 ), ℍ1 = ℍ(K1 ) and ℍ = ℍ(K) where K0 and K1 are rks and K = K0 + K1 . Assume also that ℍ0 ∩ ℍ1 = {0}. Then ℍ = ℍ0 ⊕ ℍ1 , and 𝜏i (t) = 𝒯i(⋅) K(t, ⋅) and 𝜉i (t) = 𝒯i(⋅) K1 (t, ⋅), where 𝒯i(⋅) means that 𝒯i is applied to what follows as a function of (⋅). Proof: If ℍ0 ∩ ℍ1 = {0} then Theorem 2.7.10 entails that ℍ = ℍ0 ⊕ ℍ1 . By the reproducing property, 𝜏i (t) = ⟨𝜏i , K(t, ⋅)⟩ = 𝒯i(⋅) K(t, ⋅). Similarly, 𝜉i (t) = ⟨𝜉i , K(t, ⋅)⟩ = ⟨𝒫1 𝜏i , K(t, ⋅)⟩ = ⟨𝜏i , 𝒫1 K(t, ⋅)⟩ = ⟨𝜏i , 𝒫1 (K0 (t, ⋅) + K1 (t, ⋅))⟩ = ⟨𝜏i , K1 (t, ⋅)⟩ = 𝒯i(⋅) K1 (t, ⋅),



SMOOTHING AND REGULARIZATION

161

An important application of Theorem 6.4.2 is to the Sobolev space ℍ = 𝕎q [0, 1], introduced in Section 2.8, for which K0 , K1 are defined as in Theorem 2.8.1.

6.5

Regularization parameter selection

We have to this point avoided the question of how to choose the parameter 𝜂 that appears in the penalized least-squares criterion. There are several data adaptive techniques that are typically used for this purpose. We will describe three of them that are appropriate for the scenario that was examined in the previous section: namely, where 𝕐 = ℝn and ℍ = ℍ0 ⊕ ℍ1 for ℍ0 finite dimensional with basis {𝜙1 , … , 𝜙q }. As before the squared norm for Y ∈ 𝕐 is taken to be n ∑ ‖Y‖2𝕐 = n−1 Yj2 . j=1

We begin with ordinary cross validation. Let m[k] 𝜂 be the minimizer of n−1

n ∑

(Yi − 𝒯i g)2 + 𝜂⟨g, 𝒫1 g⟩ℍ

i=1 i≠k

and define CV(𝜂) = n−1

n ∑

2 (Yk − 𝒯k m[k] 𝜂 ) .

k=1

Ordinary cross validation then picks 𝜂 to minimize CV(𝜂). The intuition behind ordinary cross validation is that any good choice for 𝜂 should endow the estimator with good predictive ability for future response values. If new data is not forthcoming, our only resource is to assess predictive ability by reusing the data that we currently have on hand. We then “predict” each response using an estimator that was computed without its contribution to the fit and average the squared prediction errors to give an indication of the estimators’s performance for that particular choice for 𝜂. Brute force evaluation of CV(𝜂) is generally prohibitively time consuming. Fortunately, with some clever algebra, the task can be somewhat simplified using Wahba’s “Leave-One-Out” lemma that we state and prove in the following. For any constant y, let m𝜂 [k, y] be the minimizer of n−1 (y − 𝒯k g)2 + n−1

n ∑ i=1 i≠k

(Yi − 𝒯i g)2 + 𝜂⟨g, 𝒫1 g⟩ℍ

162

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

over g ∈ ℍ. Thus, for example, m𝜂 = m𝜂 [k, Yk ] for each k. Somewhat beyond this simple realization is the following result. [ ] Lemma 6.5.1 m𝜂 k, 𝒯k m[k] = m[k] 𝜂 𝜂 . Proof: For any g ∈ ℍ, the definition of m[k] 𝜂 tells us that −1

n

(𝒯k m[k] 𝜂



2 𝒯k m[k] 𝜂 )

+n

−1

n ∑

[k] [k] 2 (Yi − 𝒯i m[k] 𝜂 ) + 𝜂⟨m𝜂 , 𝒫1 m𝜂 ⟩ℍ

i=1 i≠k

= n−1

n ∑

[k] [k] 2 (Yi − 𝒯i m[k] 𝜂 ) + 𝜂⟨m𝜂 , 𝒫1 m𝜂 ⟩ℍ

i=1 i≠k

≤n

−1

n ∑

(Yi − 𝒯i g)2 + 𝜂⟨g, 𝒫1 g⟩ℍ

i=1 i≠k

≤n

−1

(𝒯k m[k] 𝜂

− 𝒯k g) + n 2

−1

n ∑

(Yi − 𝒯i g)2 + 𝜂⟨g, 𝒫1 g⟩ℍ .

i=1 i≠k



Now define akk (𝜂) =

𝒯k m𝜂 − 𝒯k m[k] 𝜂 Yk − 𝒯k m[k] 𝜂

and write Yk − 𝒯k m[k] 𝜂 =

Yk − 𝒯k m𝜂 1 − akk (𝜂)

.

Thus, CV(𝜂) = n

−1

n ∑ (Yk − 𝒯k m𝜂 )2 k=1

[1 − akk (𝜂)]2

.

Letting Yk′ = 𝒯k m[k] 𝜂 , it follows from Lemma 6.5.1 that akk (𝜂) =

𝒯k m𝜂 [k, Yk ] − 𝒯k m𝜂 [k, Yk′ ] Yk − Yk′

.

However, m𝜂 [k, y] = (𝜙1 , … , 𝜙q )c(y) + (𝜉1 , … , 𝜉n )b(y),

SMOOTHING AND REGULARIZATION

163

where c(y), b(y) are the coefficient vectors from Theorem 6.4.1 with the kth element of the response vector being replaced by y. Thus, 𝒯k m𝜂 [k, y] is linear in y and 𝜕𝒯k m𝜂 [k, y] || akk (𝜂) = | | 𝜕y |y=Yk which happens to be the kth diagonal element of the “hat matrix” 𝒜(𝜂) defined by ⎛𝒯1 m𝜂 ⎞ ⎜ ⋮ ⎟ = 𝒜(𝜂)Y. ⎜𝒯 m ⎟ ⎝ n 𝜂⎠ Specifically,

𝒜(𝜂) = 𝒰(𝒰T 𝒰 + 𝜂𝒱)−1 𝒰T .

The conclusion is that for any specified 𝜂, CV(𝜂) can be evaluated directly from the residuals Y − 𝒯m𝜂 and the diagonal elements of 𝒜(𝜂). Another popular method for choosing 𝜂 is generalized cross validation. The smoothing parameter value is chosen to minimize its associated criterion function defined by ‖Y − 𝒯m𝜂 ‖2𝕐 GCV(𝜂) = ( )2 1 − n−1 trace 𝒜(𝜂) n ( )2 ∑ = n−1 Yk − 𝒯k m[k] 𝑤kk (𝜂), 𝜂 k=1

where

( 𝑤kk =

1 − akk (𝜂) 1 − n−1 trace 𝒜(𝜂)

)2 .

From the last expression, one can conclude that GCV(𝜂) is a variant of CV(𝜂) wherein the individual weights for the squared deleted residuals are replaced by a common value using the average of the diagonal elements of 𝒜(𝜂). Assume now that the errors 𝜀 = (𝜀1 . … , 𝜀)T consist of uncorrelated random variables having zero means and common variance 𝜎 2 . Let 𝜇 ∶= 𝔼Y = (𝒯1 m, … , 𝒯n m)T and consider the squared-error risk associated with the estimator m𝜂 as defined by [ ] Risk(𝜂) ∶= 𝔼‖𝒜(𝜂)Y − 𝜇‖2𝕐 = n−1 𝔼 (𝒜(𝜂)Y − 𝜇)T (𝒜(𝜂)Y − 𝜇) .

164

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

A value of 𝜂 that minimizes this quantity would be optimal in the sense of providing an estimator of 𝜇 that performed well on the average. As 𝔼 [𝒜(𝜂)Y − 𝜇] = (A(𝜂) − I)𝜇 and Var(𝒜(𝜂)Y) = 𝜎 2 𝒜(𝜂)𝒜 (𝜂)T an explicit form for the risk is seen to be 1 𝜎2 Risk(𝜂) = 𝜇 T (I − 𝒜(𝜂))T (I − 𝒜(𝜂))𝜇 + trace 𝒜(𝜂)𝒜 (𝜂)T . n n Now let MSE(𝜂) = ‖(I − 𝒜(𝜂))Y‖2𝕐 and observe that MSE(𝜂) GCV(𝜂) = ( )2 . 1 − n−1 trace 𝒜(𝜂) As before, 1 𝔼 [MSE(𝜂)] = 𝔼(Y T (I − 𝒜(𝜂))T (I − 𝒜(𝜂))Y) n 1 = 𝜇 T (I − 𝒜(𝜂))T (I − 𝒜(𝜂))𝜇 n 𝜎2 + trace (I − 𝒜(𝜂))T (I − 𝒜(𝜂)) n 2𝜎 2 = Risk(𝜂) + 𝜎 2 − trace 𝒜(𝜂). n

(6.18)

From this relation, we can establish the so-called GCV theorem. Theorem 6.5.2 Let 𝜏j (𝜂) = trace 𝒜 j (𝜂)∕n, j = 1, 2, and define g(𝜂) = Then,

2𝜏1 (𝜂) + 𝜏12 (𝜂)∕𝜏2 (𝜂) (1 − 𝜏1 (𝜂))2

.

|𝔼GCV(𝜂) − 𝜎 2 − Risk(𝜂)| ≤ g(𝜂). Risk(𝜂)

Proof: Using our expression for 𝔼 [MSE(𝜂)], a little algebra reveals that 𝔼GCV(𝜂) − 𝜎 − Risk(𝜂) = 2

𝜏1 (𝜂)(2 − 𝜏1 (𝜂)) (1 − 𝜏1 (𝜂))2

Risk(𝜂) − 𝜎

2

𝜏12 (𝜂) (1 − 𝜏1 (𝜂))2

.

SMOOTHING AND REGULARIZATION

165

This gives | 𝜏12 (𝜂) |𝔼GCV(𝜂) − 𝜎 2 − Risk(𝜂)| || 𝜏1 (𝜂)(2 − 𝜏1 (𝜂)) | 2 =| − 𝜎 | 2| | (1 − 𝜏 (𝜂))2 Risk(𝜂) Risk(𝜂)(1 − 𝜏1 (𝜂)) | 1 | 2 𝜏 (𝜂)(2 − 𝜏1 (𝜂)) 𝜏1 (𝜂)∕𝜏2 (𝜂) ≤ 1 + (1 − 𝜏1 (𝜂))2 (1 − 𝜏1 (𝜂))2 ≤ g(𝜂) because 𝜎 2 ∕Risk(𝜂) ≤ 1∕𝜏2 (𝜂).



The theorem can be interpreted as saying that when g(𝜂) is small GCV(𝜂) provides a nearly unbiased estimator of 𝜎 2 + Risk(𝜂) in terms of the inherent estimation scale defined by Risk(𝜂). A somewhat more immediate consequence of (6.18) stems from the realization that MSE(𝜂) − 𝜎 2 +

2𝜎 2 trace 𝒜(𝜂) n

is an unbiased estimator of Risk(𝜂). This leads us to Mallows’ CL criterion function defined as CL (𝜂) = MSE(𝜂) +

2𝜎 2 trace 𝒜(𝜂). n

Again, one minimizes this criterion function to find a value for 𝜂. Both GCV(𝜂) and CL (𝜂) provide estimators of Risk(𝜂) + 𝜎 2 . The latter one is unbiased. However, it will generally require an estimator of 𝜎 2 that may introduce some level of bias back into the selection criterion. The generalized cross-validation criterion, although a priori biased, has the advantage of circumventing the problem of estimating 𝜎 2 .

6.6

Splines

The motivation for the method of regularization estimator in Theorem 6.2.1 and its associated estimation criterion derived from the penalized leastsquares estimator (6.6) for the nonparametric regression problem. Thus, it seems fitting that we end this chapter by giving a detailed treatment of the root problem posed by (6.6). This entails a foray into the topic of polynomial splines, which is of independent interest for its importance in approximation theory. References on splines include de Boor (1978) and Schumaker (1981). The term spline comes from drafting where splines were flexible strips used to draw curves connecting points. Here, though, splines mean piecewise polynomials that satisfy a number of continuity constraints.

166

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Definition 6.6.1 A spline of order q ≥ 1, with knots at 0 < t1 < · · · < tJ < 1, is any function of the form g(t) =

q−1 ∑ i=0

𝜃i ti +

J ∑

q−1

𝛿j (t − tj )+

(6.19)

j=1

for constants 𝜃0 , … , 𝜃q−1 , 𝛿1 , … , 𝛿J ∈ ℝ. Immediate consequences of the definition are that 1. g is a piecewise polynomial of order q on any subinterval [ti , ti+1 ), 2. for q ≥ 2, g has q − 2 continuous derivatives, and 3. the (q − 1)st derivative of g is a step function with jumps at t1 , … , tJ . Conversely, it is not difficult to see that any piecewise polynomial on [t1 , tJ ] satisfying these three conditions can be written in the form (6.19) and must be a spline. For a given q, J and a set of knots t1 < · · · < tJ , consider the vector space Sq (t1 , … , tJ ) of splines of order q. It is easy to verify that 1, t, … , tq−1 , q−1 q−1 (t − t1 )+ , … , (t − tJ )+ are linearly independent and hence constitute a basis for Sq (t1 , … , tJ ). Thus, dim(Sq (t1 , … , tJ )) = q + J. For computational purposes, it is desirable to have available an easily evaluated, nearly-orthogonal basis. For spline functions, this comes in the form of the Anselone-Laurent–Reinsch or B-spline basis that we now develop. For this purpose, it will be useful to take t0 = 0, tJ+1 = 1 and then define 2(q − 1) additional (phantom) knots t−(q−1) < t−(q−2) < · · · < t−1 ≤ 0 1 ≤ tJ+2 < tJ+3 < · · · < tJ+q . The values for these knots are arbitrary in what follows. Typically, one merely takes t−(q−1) = · · · = t−1 = 0 and tJ+2 = · · · = tJ+q = 1. B-splines then derive from the concept of divided differences as we spell out in the following definition. Definition 6.6.2 The q-th order divided difference of a function g at {ti , … , ti+q } is [ti , … , ti+q ]g =

[ti+1 , … , ti+q ]g − [ti , … , ti+q−1 ]g ti+q − ti

with [ti ]g = g(ti ) being used to initiate the recursion.

SMOOTHING AND REGULARIZATION

167

For example, the first-order divided difference is [ti , ti+1 ]g =

g(ti+1 ) − g(ti ) ti+1 − ti

and the second-order divided difference is g(ti+2 ) − g(ti+1 ) g(ti+1 ) − g(ti ) − [ti+1 , ti+2 ]g − [ti , ti+1 ]g ti+2 − ti+1 ti+1 − ti [ti , ti+1 , ti+2 ]g = = ti+2 − ti ti+2 − ti { = (ti+2 − ti )−1 (ti+2 − ti+1 )−1 g(ti+2 ) −[(ti+2 − ti+1 )−1 + (ti+1 − ti )−1 ]g(ti+1 ) } +(ti+1 − ti )−1 g(ti ) . Note that [ti , … , ti+q ]g depends only on ti , … , ti+q and g(ti ), … , g(ti+q ). ∑q Theorem 6.6.3 Let pq (x) = i=1 g(ti )𝓁i (x) for 𝓁i (x) =

q ∏ x − tj j=1

ti − tj

.

j≠i

Then, 1. pq is the unique qth order polynomial that agrees with g at ti for i = 1,...,q and 2. for each q = 1, 2 … the coefficient of xq in pq+1 is [ti , … , ti+q ]g. Proof: The function 𝓁i (x) is a polynomial of order q and vanishes at all the tj except for ti where it takes the value 1. So, property 1 holds. To verify the second property, we proceed by induction. The claim is clearly true for q = 1. Next, let pq (x) be the polynomial of order q that agrees with g at ti ,...,ti+q−1 and take p̃ q (x) to be the polynomial of order q that agrees with g at ti+1 ,...,ti+q . Then, p(x) =

ti+q − x x − ti p̃ q (x) + p (x) ti+q − ti ti+q − ti q

is a polynomial of order q + 1 and agrees with g at ti ,...,ti+q . By the uniqueness of interpolating polynomials, we must have pq+1 (x) = p(x) and the coefficient of xq in pq+1 is [ti+1 ,...,ti+q ]g − [ti ,...,ti+q−1 ]g ti+q − ti

.



168

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Corollary 6.6.4 If g is a polynomial of order q on [ti , ti+q ], [ti , … , ti+q ]g = 0. Proof: Let pi be as in Theorem 6.6.3 and note that pi = g for all i ≥ q. Then, [ti , … , ti+q ]g is the coefficient of xq in pq+1 . As pq+1 is equal to g and g has degree less than q that coefficient must be zero. ◽ [i] Denote by 𝛼rq the coefficient of g(ti+r ) in (ti+q − ti )[ti , … , ti+q ]g: i.e.,

(ti+q − ti )[ti , … , ti+q ]g =

q ∑

[i] 𝛼rq g(ti+r ).

r=0

For example, if q = 2 then (ti+2 − ti )[ti , … , ti+2 ]g = (ti+2 − ti+1 )−1 g(ti+2 ) − [(ti+2 − ti+1 )−1 + (ti+1 − ti )−1 ]g(ti+1 ) + (ti+1 − ti )−1 g(ti ), [i] [i] so that 𝛼02 = (ti+1 − ti )−1 , 𝛼22 = (ti+2 − ti+1 )−1 , [i] 𝛼12 = −(ti+2 − ti+1 )−1 − (ti+1 − ti )−1 [i] and 𝛼r2 = 0 for r > 2. Now define q−1

Niq (t) = (ti+q − ti )[ti , … , ti+q ](⋅ − t)+ =

q ∑

q−1

[i] 𝛼rq (ti+r − t)+

(6.20)

r=0

for i = −(q − 1), … , J. These functions are called B-splines; the “B” stands for “basis”. Theorem 6.6.5 The collection {Niq (⋅)}Ji=−(q−1) satisfy 1. Niq is a polynomial of order q on each interval (ti , ti+1 ), 2. Niq (t) = 0 for t ∉ [ti , ti+q ], and 3. if we take Ni1 ∶= I[ti ,ti+1 ] (t), the Niq may be evaluated recursively using Niq (t) =

ti+q − t t − ti Ni(q−1) (t) + N (t). ti+q−1 − ti ti+q − ti+1 (i+1)(q−1)

Proof: Part 1 is obvious. We will prove only part 2. For this purpose, first note that Niq (t) = 0 for t > ti+q as all functions in the sum in (6.20) vanish in

SMOOTHING AND REGULARIZATION

169

this case. Now consider t < ti . Then, from (6.20), Niq (t) =

q ∑

q−1

[i] 𝛼rq (ti+r − t)+

r=0

=

q ∑

[i] 𝛼rq (ti+r − t)q−1

r=0

= (ti+q − ti )[ti , … , ti+q ](⋅ − t)q−1 . As the function (⋅ − t)q−1 is a polynomial of order q on [ti , ti+q ], Niq (t) = 0 by Corollary 6.6.4. ◽ Property 2 of Theorem 6.6.5 is called the local support property of the B-spline basis; given a set of knots t1 , … , tJ , Niq and Njq have disjoint supports if |i − j| > q − 1. Thus, for example, when |i − j| > q − 1 1

⟨Niq , Njq ⟩2 =

∫0

Niq (t)Njq (t)dt = 0,

which illustrates the “nearly orthogonal” quality that we mentioned earlier. This also serves the purpose of showing that the Niq are linearly independent and therefore live up to their name by providing a basis for Sq (t1 , … , tJ ). Regularization type problems for real data lead to the consideration of a special sort of spline that satisfies some additional boundary conditions. Definition 6.6.6 A spline g of order 2q for q ≥ 1 with knots at t1 < · · · < tJ , is said to be a natural spline of order 2q if g is a polynomial of order q outside of [t1 , tJ ]. The collection of all such splines will be denoted by N 2q (t1 , … , tJ ). It is clear that any g ∈ N 2q (t1 , … , tJ ) must satisfy g(j) (0) = g(j) (1) = 0, j = q, … , 2q − 1, which are known as the natural boundary conditions. These impose 2q linear constraints on the coefficients in (6.19) with the consequence that N 2q (t1 , … , tJ ) has dimension 2q + J − 2q = J. The following result stems from this fact. Theorem 6.6.7 Given 0 ≤ t1 < · · · < tJ ≤ 1 and specified constants z1 , … , zJ , there is a unique natural spline g of order 2q with knots t1 , … , tJ and g(tj ) = zj , j = 1, … , J. The natural spline whose existence is guaranteed by Theorem 6.6.7 is called a natural interpolating spline. It turns out to be optimal in a way that is very important in the context of (6.6) and related problems.

170

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 6.6.8 For knots 0 < t1 < · · · < tJ < 1 let g̃ be the natural spline of order 2q that interpolates z1 , z2 , … , zJ . Then, for all functions g ∈ 𝕎q [0, 1] that interpolate the zj , 1

∫0

1

|̃g(q) (t)|2 dt ≤

|g(q) (t)|2 dt.

∫0

(6.21)

Equality holds in (6.21) if g = g̃ + h for h a qth order polynomial that vanishes at t1 , … , tJ . If J ≥ q, then h = 0 on [0, 1]. Proof: Let g be a function in 𝕎q [0, 1] that interpolates the zj and set h = g − g̃ . Then, h(tj ) = 0 for all j. Using integration by parts and the natural boundary conditions, we find that 1

∫0

1

g̃ (t)h (t)dt = g̃ (q)

(q)

(q+1)

(t)h

(q)

(t)|10



∫0

g̃ (q+1) (t)h(q−1) (t)dt

1

=−

∫0

g̃ (q+1) (t)h(q−1) (t)dt.

This suggests the way to proceed and by continuing to integrate by parts we eventually arrive at 1

∫0

1

g̃ (q) (t)h(q) (t)dt = (−1)q−1 q−1

= (−1)

g̃ (2q−1) (t)h(1) (t)dt

∫0 J−1 ∑

g̃ (2q−1) (tj+ )[h(tj+1 ) − h(tj )] = 0,

j=1

where we have used the property that g̃ (2q−1) vanishes on (0, t1 ) and (tJ , 1), while being constant on each interval (tj , tj+1 ) with value g̃ (2q−1) (tj+ ). Thus, 1

∫0

1

|g (t)| dt = (q)

2

∫0

( (q) )2 g̃ (t) + h(q) (t) dt

1

=

1

|̃g(q) (t)|2 dt + 2

∫0

∫0

1

=

∫0

1

g̃ (q) (t)h(q) (t)dt +

1

|̃g(q) (t)|2 dt +

∫0

∫0

|h(q) (t)|2 dt

1

|h(q) (t)|2 dt ≥

∫0

|̃g(q) (t)|2 dt.

SMOOTHING AND REGULARIZATION

171

From the developments so far, we see that equality will be attained in (6.21) 1 if and only if ∫0 |h(q) (t)|2 dt = 0: namely, h(q) = 0 a.e. Recall that both g and g̃ are members of 𝕎q [0, 1] so that h = g − g̃ ∈ 𝕎q [0, 1]. As h(t) =

q−1 (i) ∑ h (0) i=0

i!

q−1

1 i

t +

(t − u)+

∫0

(q − 1)!

h(q) (u)du,

we conclude that h(t) =

q−1 (i) ∑ h (0) i=0

i!

ti .

Note that any qth order polynomial can have at most q − 1 roots unless it is identically equal to 0. The final statement in the theorem is a consequence of that fact as h(tj ) = 0 for j = 1, … , J. ◽ As promised, the preceding result has a very powerful consequence for regularization-related problems. Theorem 6.6.9 Suppose that n > q and 𝜂 ∈ (0, ∞). Let L be a mapping from ℝn to [0, ∞). Then, if 1

L(g(t1 ), … , g(tn )) + 𝜂

∫0

|g(q) (t)|2 dt

(6.22)

has a minimizer in 𝕎q [0, 1], it must be an element of N 2q (t1 , … , tn ). Proof: Let ĝ be any minimizer of (6.22). If g̃ is the element of N 2q (t1 , … , tn ) that interpolates ĝ at t1 , … , tJ L(̂g(t1 ), … , ĝ (tn )) = L(̃g(t1 ), … , g̃ (tn )). The result is now a consequence of Theorem 6.6.8.



Theorem 6.6.9 has the remarkable implication that for any optimization problem corresponding to a criterion of the form (6.22), we need only look for minimizers over the finite dimensional space of natural splines of order 2q rather than all of 𝕎q [0, 1]. Thus, we have traded an infinite dimensional problem for one that has only finite-dimensional complexity.

172

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Now let us return to the nonparametric regression estimator (6.6) and, for that purpose, take 𝕐 = ℝn with ‖Y‖2𝕐 = n−1

n ∑

Yj2

j=1

for Y ∈ ℝn . Then, ℍ = 𝕎q [0, 1] = ℍ0 ⊕ ℍ1 and 𝒯g = (g(t1 ), … , g(tn ))T . If 𝒫1 is the projection operator for ℍ1 , our estimation criterion is ‖Y −

𝒯g‖2𝕐

+ 𝜂⟨g, 𝒫1 g⟩ℍ = n

−1

n ∑

1

(Yj − g(tj ))2 + 𝜂

j=1

∫0

|g(q) (t)|2 dt.

When q = 2, this gives us the estimator in (6.6). For j = 1, … , n let gj be the element of N 2q (t1 , … , tn ) that interpolates z1 = · · · = zj−1 = 0, zj = 1, zj+1 = · · · = zn = 0. Then, any g ∈ N 2q (t1 , … , tn ) can be written as n ∑ g(⋅) = g(tj )gj (⋅). j=1

In particular, by combining this fact with Theorem 6.6.9, we see that minimization of our estimation criterion over ℍ = 𝕎q [0, 1] is equivalent to finding the vector g = (g(t1 ), … , g(tn ))T that minimizes ‖Y − g‖2𝕐 + 𝜂gT 𝒲g {

for 𝒲=

1

∫0

} (q) (q) gi (t)gj (t)dt

. i,j=1,n

The optimal coefficient vector is m𝜂 = (m𝜂 (t1 ), … , m𝜂 (tn ))T = (I + 𝜂𝒲) Y giving m𝜂 (⋅) =

n ∑

m𝜂 (tj )gj (⋅).

(6.23)

(6.24)

j=1

Thus, the nonparametric regression estimator determined by (6.23) and (6.24) is a natural spline of order 2q that is called a smoothing spline. In particular, when q = 2, we get the cubic smoothing spline estimator (6.6).

SMOOTHING AND REGULARIZATION

173

To conclude this chapter, we provide a result concerning the bias and variance properties of smoothing spline estimators of m in model (6.1). Theorem 6.6.10 Assume that 𝔼[𝜀𝜀T ] = 𝜎 2 I, tj = (2j − 1)∕2n, j = 1, … , n and let m = (m(t1 ), … , m(tn ))T . For any n and 𝜂 > 0, ) 𝜂( T ‖𝔼m𝜂 − m‖2𝕐 ≤ m m + mT 𝒲m 4 and 𝔼‖m𝜂 − 𝔼m𝜂 ‖2ℝn ≤

q𝜎 2 C𝜎 2 −1∕(2q) + 𝜂 n n

for a constant C that depends only on q. Proof: Let 𝛾1 ≤ 𝛾2 ≤ · · · ≤ 𝛾n be the eigenvalues of 𝒲 with e1 , … , en the corresponding eigenvector. Utreras (1983) shows that in this case 𝛾j = 0, j = 1, … , q and C1 (j − q)2q ≤ 𝛾j ≤ C2 (j − q)2q , j = q + 1, … , n, where 0 < C1 ≤ C2 < ∞ depend only on q. Clearly, then ‖𝔼(m𝜂 − m)‖2𝕐 = ‖(I + 𝜂𝒲)−1 m − m‖2𝕐 )2 n ( ∑ 𝜂𝛾j −1 =n (mT ej )2 1 + 𝜂𝛾 j j=1 =n

−1

n [ ∑ ( )2 ] (1 + 𝛾j ) mT ej (1 + 𝛾j )−1

(

j=1

Now note that

𝜂𝛾j 1 + 𝜂𝛾j

n ∑ (1 + 𝛾j )(mT ej )2 = mT m + mT 𝒲m j=1

and

( sup (1 + 𝛾j )

q+1≤j≤n

−1

𝜂𝛾j 1 + 𝜂𝛾j

)2

)2 u 1+u u>0 ( )2 u = 𝜂 sup (𝜂 + u)−1 1+u u>0 𝜂 u ≤ 𝜂 sup = . 2 4 u>0 (1 + u) ≤ sup (1 + u𝜂 −1 )−1

(

)2 .

174

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

For the variance term [ ] 𝜎2 [ ] 𝔼‖m𝜂 − 𝔼m𝜂 ‖2𝕐 = n−1 𝔼 𝜀T (I + 𝜂𝒲)−2 𝜀 = 𝔼 trace (I + 𝜂𝒲)−2 𝜀𝜀T n n 2 𝜎 ∑ = (1 + 𝜂𝛾j )−2 . n j=1 Approximating the sum by an integral gives J ∑



(1 + 𝜂𝛾j )−2 ≤ q +

j=1

∫0

(1 + 𝜃x2q )−2 dx ∞

= q + 𝜃 −1∕(2q) where 𝜃 = C1 𝜂.

∫1

u−2 (u − 1)1∕(2q)−1 du∕2q, ◽

7

Random elements in a Hilbert space In this chapter, we give an overview of various foundational issues that arise in fda and related settings. There are two somewhat different perspectives of functional data. The first one is that functional data are realizations of random variables that take values in a Hilbert space; for convenience, we call this the random element perspective. The second view is that functional data are the sample paths of a (typically continuous time) stochastic process with smooth mean and covariance functions; we will refer to this as the stochastic process perspective. The differences between the two perspectives are subtle and worth exploring from a theoretical standpoint. To develop the random element perspective, we need to lay a rigorous foundation for the study of Hilbert space valued “random variables” so that we can develop concepts for the mean and covariance in that abstract environment. The stochastic process perspective uses the covariance function of the stochastic process as the fundamental tool for assessing the variability. The classical theorem by Karhunen and Lòeve appears in this context. For the purpose of developing the notions of prediction and canonical correlations, we define the concept of closed linear span corresponding to both the random element and stochastic process perspective and explore the properties of various associated congruence relations. This work will be useful in Chapter 10. Finally, we present some large sample theory for the case where one acquires (many) independent realizations of some random element of interest.

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

176

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

7.1

Probability measures on a Hilbert space

Throughout this chapter, we work within the context of a separable Hilbert space ℍ with associated norm and inner product ‖ ⋅ ‖ and ⟨⋅, ⋅⟩. Topologically, we can view ℍ as a metric space with metric d(f , g) = ‖f − g‖ = ⟨f − g, f − g⟩1∕2 . As usual, the Borel 𝜎-field of any topological space 𝕏 is the smallest 𝜎-field containing all the open (relative to the norm-based metric) subsets of 𝕏 and will be denoted by ℬ(𝕏). The smallest 𝜎-field containing a class 𝒞 of sets is also indicated by the standard notation 𝜎(𝒞 ). Let ℳ be the class of all sets of the form {x ∈ ℍ ∶ ⟨x, f ⟩ ∈ B} for f ∈ ℍ and B an open subset of ℝ. Theorem 7.1.1 The 𝜎-fields 𝜎(ℳ) and ℬ(ℍ) are identical. Proof: The proof is accomplished directly by establishing that 𝜎(ℳ) ⊂ ℬ(ℍ) and ℬ(ℍ) ⊂ 𝜎(ℳ). With this as our aim, first note that each set in ℳ is either open or closed since x → ⟨x, f ⟩ is a continuous mapping from ℍ into ℝ. Hence, ℳ ⊂ ℬ(ℍ) and we must have 𝜎(ℳ) ⊂ ℬ(ℍ). Next, recall that ℬ(ℍ) is the smallest 𝜎-field that contains all the open sets {x ∈ ℍ ∶ ‖x − f ‖ > r}, f ∈ ℍ, r ∈ (0, ∞). To show ℬ(ℍ) ⊂ 𝜎(ℳ), it suffices to show that these open sets are in 𝜎(ℳ). Theorem 2.4.14 ensures that there exists a CONS {ej }∞ in ℍ and, for any r ∈ (0, ∞), we have j=1 { {x ∈ ℍ ∶ ‖x‖ > r} = ∪∞ j=1

x∈ℍ∶

j ∑

} ⟨x, ek ⟩2 > r2

.

(7.1)

k=1

Initially, we can observe that {x ∈ ℍ ∶ ⟨x, e1 ⟩2 + ⟨x, e2 ⟩2 > r2 } = ∪q∈ℚ {x ∈ ℍ ∶ |⟨x, e1 ⟩|2 > q > r 2 − |⟨x, e2 ⟩|2 } ( ) = ∪q∈ℚ {x ∈ ℍ ∶ |⟨x, e1 ⟩|2 > q} ∩ {x ∈ ℍ ∶ |⟨x, e2 ⟩|2 > r2 − q} for ℚ the set of rational numbers. As both {x ∈ ℍ ∶ |⟨x, e1 ⟩|2 > q} and {x ∈ ℍ ∶ |⟨x, e2 ⟩|2 > r2 − q} are in ℳ, {x ∈ ℍ ∶ ⟨x, e1 ⟩2 + ⟨x, e2 ⟩2 > r2 } ∈ 𝜎(ℳ). Induction using (7.1) allows us to conclude that {x ∈ ℍ ∶ ‖x‖ > r} ∈ 𝜎(ℳ).

(7.2)

RANDOM ELEMENTS IN A HILBERT SPACE

177

To complete the proof, let f be any element of ℍ and write {x ∈ ℍ ∶ ‖x − f ‖ > r} = {x ∈ ℍ ∶ ‖x‖2 > r2 + 2⟨x, f ⟩ − ‖f ‖2 } = ∪q∈ℚ ({x ∈ ℍ ∶ ‖x‖2 > q} ∩ {x ∈ ℍ ∶ 2⟨x, f ⟩ < q − r2 + ‖f ‖2 }). By (7.2) and the above-mentioned argument, the last expression is a set in 𝜎(ℳ) and, hence, ℬ(ℍ) ⊂ 𝜎(ℳ). ◽ The following result provides a Hilbert space specific characterization of measurability that we will need in subsequent sections of this chapter. Theorem 7.1.2 Let 𝜒 be a mapping from some probability space (Ω, ℱ, ℙ) into (ℍ, ℬ(ℍ)). Then, 1. 𝜒 is measurable if ⟨𝜒, f ⟩ is measurable for all f ∈ ℍ and 2. if 𝜒 is measurable, its distribution is uniquely determined by the (marginal) distributions of ⟨𝜒, f ⟩ over f ∈ ℍ. Proof: If 𝜒 is measurable ⟨𝜒, f ⟩ is measurable as x → ⟨x, f ⟩ is continuous for all f . Conversely, if ⟨𝜒, f ⟩ is measurable for all f , for any open B ⊂ ℝ, the set {x ∈ ℍ ∶ ⟨x, f ⟩ ∈ B} ∈ ℳ satisfies 𝜒 −1 ({x ∈ ℍ ∶ ⟨x, f ⟩ ∈ B}) = {𝜔 ∈ Ω ∶ ⟨𝜒(𝜔), f ⟩ ∈ B} ∈ ℬ. Thus, Theorem 7.1.1 tells us that 𝜒 is measurable. Next, let ℳ̌ be the class containing finite intersections of sets in ℳ. Clearly, ̌ = 𝜎(ℳ) = ℬ(ℍ) due to Theorem 7.1.1. We first establish that ℳ̌ is a 𝜎(ℳ) determining class; namely, if two probability measure 𝜇1 , 𝜇2 agree on ℳ̌ it means that they must also agree on ℬ(ℍ). Define G = {A ∈ ℬ(ℍ) ∶ 𝜇1 (A) = 𝜇2 (A)}. It is easy to verify that G is a 𝜆 system. As ℳ̌ is a 𝜋-system and ℳ̌ ⊂ G, the ̌ ⊆ G and, hence, that 𝜋-𝜆 theorem (e.g., 1995, Billingsley) implies that 𝜎(ℳ) ̌ ℳ is indeed determining. Now, let ℳi = {x ∈ ℍ ∶ ⟨x, fi ⟩ ∈ Bi } for some fi ∈ ℍ, open sets Bi ⊂ ℝ and i = 1, … , n. Then, ℙ∘𝜒 −1 (∩ni=1 Mi ) = ℙ(⟨𝜒, fi ⟩ ∈ Bi , 1 ≤ i ≤ n).

178

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

The fact that ℳ̌ is determining means that probabilities of the form ℙ(⟨𝜒, fi ⟩ ∈ Bi , 1 ≤ i ≤ n) uniquely determine ℙ∘𝜒 −1 while these quantities are, in turn, determined by the joint distributions of ⟨𝜒, fi ⟩, 1 ≤ i ≤ n. The Cramér–Wold Theorem entails that these joint distributions are determined by the marginal distributions of ⟨𝜒, f ⟩, f ∈ ℍ. ◽ In subsequent discussions, we will refer to measurable functions that take values in a Hilbert space ℍ as being random elements of ℍ. As such, they represent a useful abstraction of the random variable concept that is well suited for our intended direction of study.

7.2

Mean and covariance of a random element of a Hilbert space

Let 𝜒 be a random element of a separable Hilbert space ℍ defined on a probability space (Ω, ℱ, ℙ). Theorem 2.6.5 suggests one notion of a mean that can be used here. Definition 7.2.1 If 𝔼(‖𝜒‖) < ∞, the mean element of 𝜒, or simply the mean of 𝜒, is defined as the Bochner integral m = 𝔼(𝜒) ∶=

∫Ω

𝜒 dℙ.

(7.3)

This definition provides the natural extension of the mean of a random variable to the case of random elements. It is, roughly speaking, a “weighted sum” of the possible realizations of 𝜒 that returns another, nonrandom, element of ℍ. An alternative way to define the mean element in general is to observe that the assumption 𝔼(‖𝜒‖) < ∞ implies that the functional f → 𝔼[⟨𝜒, f ⟩] is in 𝔅(ℍ, ℝ) and then define m as its representer (cf. Theorem 3.2.1). An application of Theorem 3.1.7 reveals that either definition gives ⟨m, f ⟩ = 𝔼[⟨𝜒, f ⟩]

(7.4)

for f ∈ ℍ. The linear functional approach corresponds to the Gelfand–Pettis integral, which is more general than the Bochner integral. In view of (7.4), the two integrals are identical in this setting. Assuming the expectation exists, one might take 𝔼‖𝜒 − m‖2 as a variance type measure associated with a random element 𝜒. In this regard, we have the following analog of a familiar variance identity.

RANDOM ELEMENTS IN A HILBERT SPACE

179

Theorem 7.2.2 Assume that 𝔼‖𝜒‖2 < ∞. Then 𝔼‖𝜒 − m‖2 = 𝔼‖𝜒‖2 − ‖m‖2 , where m is the mean of 𝜒. Proof: Write 𝔼‖𝜒 − m‖2 = 𝔼‖𝜒‖2 − 2𝔼⟨𝜒, m⟩ + ‖m‖2 . The result follows from Theorem 3.1.7 that allows us to interchange the expectation and inner product operations. ◽ The next step is to develop a concept of covariance for 𝜒. In that regard, one might recall that the covariance for a random p-vector X is 𝔼[(X − 𝔼X)(X − 𝔼X)T ] = 𝔼[(X − 𝔼X) ⊗ (X − 𝔼X)]. This is a p × p matrix and therefore an element of 𝔅(ℝp ). Our Hilbert space formulation builds on this idea. Specifically, if 𝜒 is a random element of a Hilbert space ℍ, we define the corresponding covariance operator as follows. Definition 7.2.3 Assume that 𝔼‖𝜒‖2 < ∞. Then, the covariance operator for 𝜒 is the element of 𝔅HS (ℍ) given by the Bochner integral [ ] 𝒦 = 𝔼 (𝜒 − m) ⊗ (𝜒 − m) ∶=

∫Ω

(𝜒 − m) ⊗ (𝜒 − m) dℙ.

(7.5)

If 𝜒(𝜔) is in ℍ, a direct calculation shows that (𝜒(𝜔) − m) ⊗ (𝜒(𝜔) − m) is a Hilbert–Schmidt operator with norm ‖𝜒(𝜔) − m‖2 . As a result, (𝜒 − m) ⊗ (𝜒 − m) is a random element of the Hilbert space 𝔅HS (ℍ). From Theorem 7.2.2, the assumption that 𝔼‖𝜒‖2 < ∞ implies that the expectation of the HS norm of (𝜒 − m) ⊗ (𝜒 − m) is finite. As 𝔅HS (ℍ) is a separable Hilbert space (Section 4.4), it follows from Theorem 2.6.5 that 𝒦 is well defined as the Bochner integral (7.5) that returns a nonrandom element of 𝔅HS (ℍ). The following result is the extension of a familiar covariance identity for finite dimensions. [ ] Theorem 7.2.4 𝔼 (𝜒 − m) ⊗ (𝜒 − m) = 𝔼(𝜒 ⊗ 𝜒) − m ⊗ m. Proof: All three of 𝜒 ⊗ m, m ⊗ 𝜒, and m ⊗ m are HS operators. The latter of the three is constant on Ω while 𝔼‖𝜒 ⊗ m‖HS = 𝔼‖m ⊗ 𝜒‖HS = ‖m‖𝔼‖𝜒‖.

180

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Thus, the result holds if for all f ∈ ℍ 𝔼(m ⊗ 𝜒)f = 𝔼(𝜒 ⊗ m)f = (m ⊗ m) f = ⟨m, f ⟩m, ◽

which is immediate from (4.30).

For simplicity of notation, we will generally assume that m = 0 unless stated otherwise. In this instance, 𝒦 = 𝔼 (𝜒 ⊗ 𝜒) ∶=

∫Ω

(𝜒 ⊗ 𝜒) dℙ.

Some properties of the covariance operator are listed in our following result. Theorem 7.2.5 Suppose that m = 0 and 𝔼(‖𝜒‖2 ) < ∞. For f , g ∈ ℍ [ ] 1. ⟨𝒦f , g⟩ = 𝔼 ⟨𝜒, f ⟩⟨𝜒, g⟩ , 2. 𝒦 is a nonnegative-definite, trace-class operator with ‖𝒦‖TR = 𝔼‖𝜒‖2 , and ( ) 3. ℙ 𝜒 ∈ Im(𝒦) = 1. Proof: To prove part 1 first note that as 𝒦 ∈ 𝔅HS (ℍ) we can use (4.30) to obtain ( ) 𝒦f = 𝜒 ⊗ 𝜒 dℙ f = 𝜒⟨𝜒, f ⟩ dℙ ∫Ω ∫Ω for any f ∈ ℍ. As a result, we can express 𝒦f as ∫Ω 𝜒⟨𝜒, f ⟩ dℙ. Then, an application of Theorem 3.1.7 to the linear functional Tf ∶= ⟨f , g⟩ gives the result. From part 1, we can see that 𝒦 is clearly nonnegative definite. To show that 𝒦 is trace class, let {ej } be any CONS for ℍ and observe that ‖𝒦‖TR =

∞ ∑ j=1

⟨𝒦ej , ej ⟩ =

∞ ∑

𝔼⟨𝜒, ej ⟩2 = 𝔼‖𝜒‖2 < ∞.

j=1

Finally, from part 4 of Theorem 3.3.7, (Im(𝒦))⊥ = Ker (𝒦∗ ) = Ker (𝒦) as 𝒦 is self-adjoint. Hence, for any f ∈ (Im(𝒦))⊥ , 𝔼[⟨𝜒, f ⟩2 ] = ⟨𝒦f , f ⟩ = 0.

RANDOM ELEMENTS IN A HILBERT SPACE

181

This implies that, with probability one, 𝜒 is orthogonal to any function in (Im(𝒦))⊥ . Thus, with probability one, 𝜒 ∈ (Im(𝒦))⊥⊥ = Im(𝒦) ◽

by part 3 of Theorem 2.5.6.

The following result is an immediate consequence of Theorem 4.2.4 and part 2 of Theorem 7.2.5. Theorem 7.2.6 The operator 𝒦 admits the eigen decomposition 𝒦=

∞ ∑

𝜆j ej ⊗ ej .

(7.6)

j=1

The eigenfunctions {ej }∞ form a CONS for Im(𝒦) while the eigenvalues are j=1 nonnegative with the set {𝜆j }∞ being either finite or consisting of a sequence j=1 that tends to zero. Each nonzero eigenvalue has finite multiplicity. Relation (7.6) extends the spectral decomposition of a variance–covariance matrix for a random vector. Indeed, it gives (1.3) as a special case when 𝜒 is a random vector in ℝp . The combination of Theorems 7.2.5 and 7.2.6 provide the corresponding extension of the principal component decomposition (1.4) that we state formally as follows. Theorem 7.2.7 Suppose that 𝒦 has the eigen decomposition (7.6). Then, with probability one, 𝜒=

∞ ∑

⟨𝜒, ej ⟩ej ,

(7.7)

j=1

where ⟨𝜒, ej ⟩, j ≥ 1, are uncorrelated random variables with mean zero and variances 𝜆j . The decomposition in (7.7) has various optimality properties. For example, it is straightforward to extend Theorem 1.1.1 to this context. Another possibility is provided by our following result. Theorem 7.2.8 If {fj }∞ is any CONS for ℍ, j=1 ‖ ‖2 n n ∑ ∑ ‖ ‖ 2 ‖ ‖ 𝔼‖𝜒 − ⟨𝜒, fj ⟩fj ‖ = 𝔼‖𝜒‖ − ⟨𝒦fj , fj ⟩, ‖ ‖ j=1 j=1 ‖ ‖ which is minimized by taking fj = ej , 1 ≤ j ≤ n.

182

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: We know that ‖ ‖2 ‖∑ ‖2 n n ∑ ‖ ‖ ‖ ‖ 2 ‖ ‖ 𝔼‖ ⟨𝜒, fj ⟩fj ‖ ‖𝜒 − ‖ = 𝔼‖𝜒‖ + 𝔼‖ ⟨𝜒, fj ⟩fj ‖ ‖ ‖ ‖ ‖ j=1 ‖ ‖ ‖ j=1 ‖ ⟨ ⟩ n ∑ −2𝔼 𝜒, ⟨𝜒, fj ⟩fj j=1

with

⟨ ⟩ ‖∑ ‖2 n ∑ ‖ n ‖ ‖ 𝔼‖ ⟨𝜒, fj ⟩fj ‖ ⟨𝜒, fj ⟩fj ‖ = 𝔼 𝜒, ‖ j=1 ‖ j=1 ‖ ‖ n n ∑ ∑ = 𝔼⟨𝜒, fj ⟩2 = ⟨𝒦fj , fj ⟩. j=1

j=1

So, ‖ ‖2 n n ∑ ∑ ‖ ‖ 2 ‖ ‖ 𝔼‖𝜒 − ⟨𝜒, fj ⟩fj ‖ = 𝔼‖𝜒‖ − ⟨𝒦fj , fj ⟩. ‖ ‖ j=1 j=1 ‖ ‖ Part 2 of Theorem 7.2.5 gives ‖𝒦‖TR = 𝔼‖𝜒‖2 and we can now use relation (4.35) along with Theorem 4.2.5 to complete the proof. ◽ As seen from the developments in Section 1.1, principle component decompositions tell us only a part of the story about the relationships that may be present among variables. It can also be of interest to examine the dependence between two different groups of variables using, e.g., canonical correlation analysis (cca). We will present a general treatment of abstract cca in Chapter 10 and therefore postpone further discussion of the topic until that point. However, in preparation for that work, we need to have at our disposal some general Hilbert space version of the finite-dimensional concept of a cross-covariance matrix in (1.8). The remainder of the section is devoted to establishing the existence and properties of this particular operator. Suppose now that we have two random elements 𝜒1 , 𝜒2 defined on the same probability space (Ω, ℱ, ℙ) but taking values in two separable Hilbert spaces ℍ1 and ℍ2 , respectively. Assume that, for i = 1, 2, 𝔼‖𝜒i ‖2i < ∞ and, for simplicity, take 𝔼𝜒1 = 𝔼𝜒2 = 0. Then, the cross-covariance operator for 𝜒1 , 𝜒2 is defined as the Bochner integral 𝒦12 =

∫Ω

(𝜒2 ⊗2 𝜒1 ) dℙ,

(7.8)

RANDOM ELEMENTS IN A HILBERT SPACE

183

where the tensor product 𝜒2 ⊗2 𝜒1 is defined as in Definition 3.4.6 to be the mapping that takes any element f ∈ ℍ2 to ⟨𝜒2 , f ⟩2 𝜒1 . This integral exists for essentially the same reasons as we used to verify the existence of the covariance operator. In particular, Theorem 3.4.7 shows that 𝜒2 (𝜔)⊗2 𝜒1 (𝜔) is an HS operator with HS norm ‖𝜒1 (𝜔)‖1 ‖𝜒2 (𝜔)‖2 for any 𝜔 ∈ Ω so that Theorem 2.6.5 can again be employed to see that the integral is well defined as an element of 𝔅HS (ℍ2 , ℍ1 ). The extension of Theorem 7.2.5 to this situation takes the following form. Theorem 7.2.9 Suppose that 𝔼𝜒i = 0 and 𝔼‖𝜒i ‖2i < ∞ for i = 1, 2. Then, for any g ∈ ℍ1 , f ∈ ℍ2 , [ ] 1. ⟨𝒦12 f , g⟩1 = 𝔼 ⟨𝜒1 , g⟩1 ⟨𝜒2 , f ⟩2 , 1∕2

1∕2

2. |⟨𝒦12 f , g⟩1 | ≤ ⟨𝒦1 g, g⟩1 ⟨𝒦2 f , f ⟩2 , and 3. the adjoint of 𝒦12 is 𝒦21 = ∫Ω (𝜒1 ⊗1 𝜒2 ) dℙ. Proof: The first claim of the theorem is verified by the same technique that was used to prove part 1 of Theorem 7.2.5. The second and third claims follow immediately from the first. ◽ In the multivariate analysis case, canonical correlation revolves around the singular value decomposition of the generalized correlation measure −1∕2 −1∕2 provided by the matrix ℛ12 = 𝒦1 𝒦12 𝒦2 with 𝒦1 , 𝒦2 , and 𝒦12 the covariance and cross-covariance matrices for the two random variables that are the subject of the analysis (cf. Chapter 1). Unfortunately, this approach fails in the case of general Hilbert space valued random elements because the compact nature of covariance operators renders them noninvertible for anything but the finite-dimensional case (Theorem 4.1.4). Nevertheless, it is a remarkable fact that there is still an infinite-dimensional parallel of ℛ12 that is the subject of the following theorem. The development here follows that in Baker (1970); a somewhat more general treatment can be found in Baker (1973). Theorem 7.2.10 There exists an operator ℛ12 ∈ 𝔅(ℍ2 , ℍ1 ) with ‖ℛ12 ‖ ≤ 1 1∕2 1∕2 such that 𝒦12 = 𝒦1 ℛ12 𝒦2 . Proof: Let (𝜆1j , e1j ) be the eigenvalues and eigenfunctions of 𝒦1 and 𝒫n the projection in ℍ1 on span{e11 , … , e1n }. Then, for every f ∈ ℍ2 , −1∕2

‖𝒫n 𝒦1

𝒦12 f ‖21 = ⟨𝒦12 f , 𝒫n 𝒦−1 1 𝒦12 f ⟩1 .

(7.9)

184

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

By part 2 of Theorem 7.2.9, ⟨𝒦12 f , 𝒫n 𝒦−1 1 𝒦12 f ⟩1 1∕2

1∕2

−1 ≤ ⟨𝒦2 f , f ⟩2 ⟨𝒦1 𝒫n 𝒦−1 1 𝒦12 f , 𝒫n 𝒦1 𝒦12 f ⟩1 1∕2

−1∕2

= ⟨𝒦2 f , f ⟩2 ‖𝒫n 𝒦1

(7.10)

𝒦12 f ‖1 .

Combining (7.9) and (7.10) gives −1∕2

‖𝒫n 𝒦1

1∕2

𝒦12 f ‖1 ≤ ‖𝒦2 f ‖2

for any n, and consequently −1∕2

‖𝒦1

1∕2

𝒦12 f ‖1 ≤ ‖𝒦2 f ‖2

1∕2 1∕2 for every f ∈ ℍ2 . So, if f ∈ Im(𝒦2 ) in that f = 𝒦2 f̃ for some f̃ ∈ ℍ2 , we have −1∕2 −1∕2 ‖𝒦1 𝒦12 𝒦2 f ‖1 ≤ ‖f ‖2 ; (7.11) −1∕2

i.e., ℛ12 ∶= 𝒦1

−1∕2

𝒦12 𝒦2

1∕2

is bounded on Im(𝒦2 ) with norm at most 1∕2

one. Now ℛ12 may be extended to Im(K2 ) by application of the extension principle and the extended operator has the same norm. Finally, define 1∕2



ℛ12 f = 0 for f ∈ Im(K2 ) to complete the definition of ℛ12 on the entirety of ℍ1 . ◽ The operator ℛ12 is unique in a certain sense. If there is another operator ℛ 1∕2 1∕2 1∕2 1∕2 such that 𝒦12 = 𝒦1 ℛ𝒦2 with ‖ℛ‖ ≤ 1, then 𝒦1 (ℛ12 − ℛ)𝒦2 f = 0 for all f ∈ ℍ2 . Thus, ℛ and ℛ12 must coincide when viewed as linear map1∕2

1∕2

pings from Im(𝒦1 ) to Im(𝒦2 ). As in the mva setting, the ℛ12 operator provides an assessment of the dependence relationship between 𝜒1 and 𝜒2 . In this regard, Baker (1970) shows that in the Gaussian case (i.e., when (⟨𝜒1 , g⟩1 , ⟨𝜒2 , f ⟩2 ) are bivariate normal for all g ∈ ℍ1 , f ∈ ℍ2 ), the mutual information of 𝜒1 and 𝜒2 is finite if and only if ℛ12 is HS and ‖ℛ12 ‖ < 1. This has the implication that ℛ12 is an element of 𝔅HS (ℍ2 , ℍ1 ) with norm strictly less than one unless 𝜒1 is a deterministic function of 𝜒2 .

7.3

Mean-square continuous processes and the Karhunen–Lòeve Theorem

So far, we have dealt with random elements in an abstract Hilbert space. Such a space may be a purely algebraic construction and have nothing to do with functions. In this section, we examine another important paradigm for fda.

RANDOM ELEMENTS IN A HILBERT SPACE

185

Specifically, our interest here is in dealing with a stochastic process X = {X(t) ∶ t ∈ E} on a probability space (Ω, ℱ, ℙ), where E is a general compact metric space. The index t is referred to as “time” and a continuous time process would correspond to E = [0, 1], for example. In fda, and more generally in statistics, X represents a random function that can be fully or partially observed. The basic measure-theoretic assumption of a stochastic process is that X(t) is a random variable, i.e., ℱ-measurable, for each fixed t. This alone does not imply that X(⋅) is a random element of 𝕃2 (E, ℬ(E), 𝜇), for instance; in the following section, we will give conditions that ensure this. We note in passing that Theorem 7.1.2 has the consequence that random elements in a Hilbert space ℍ may be viewed as stochastic processes with index set E = 𝔅(ℍ). However, we will not pursue that viewpoint as that particular abstraction takes us beyond what is relevant for this book. The mean function of the process X is defined by m(t) = 𝔼[X(t)]

(7.12)

and the covariance function or covariance kernel by K(s, t) = Cov(X(s), X(t))

(7.13)

for s, t ∈ E, provided the expectations exist. Processes with well-defined mean and covariance functions are referred to as second-order processes. As ( n ) n n ∑ ∑ ∑ ai aj K(ti , tj ) = Var aj X(tj ) , i=1 j=1

j=1

we can conclude that Theorem 7.3.1 The function K in (7.13) is nonnegative definite. This result in combination with the Moore–Aronszajn theorem (Theorem 2.7.4) guarantees that the covariance kernel for a stochastic process generates a reproducing kernel Hilbert space for which it is the rk. The implications of this fact will be explored in Section 7.6. We will focus on second-order processes such that [ ]2 lim 𝔼 X(tn ) − X(t) = 0,

n→∞

(7.14)

for any t ∈ E and any sequence {tn } in E converging to t. Processes with this property are said to be mean-square continuous and they are the focal entity that underlies the Karhunen–Lòeve Theorem that will be established in this section.

186

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

The following result characterizes mean-square continuity by the the continuity of the covariance function. Theorem 7.3.2 Let X be a second-order process. Then, X is mean-square continuous if and only if its mean and covariance functions are continuous. Proof: As 𝔼[X(s) − X(t)]2 = K(s, s) + K(t, t) − 2K(s, t) + (m(s) − m(t))2 , the continuity of m and K implies (7.14). To show the other direction, we first see that the continuity of m follows from |m(s) − m(t)| = |𝔼[X(s) − X(t)]| { }1∕2 ≤ 𝔼[X(s) − X(t)]2 and (7.14). Without loss of generality, assume that m(t) ≡ 0 in the following. Write K(s, t) − K(s′ , t′ ) = (K(s, t) − K(s′ , t)) + (K(s′ , t) − K(s′ , t′ )). The Cauchy–Schwarz inequality then gives ( [ ]2 )1∕2 |K(s, t) − K(s′ , t)| ≤ K 1∕2 (t, t) 𝔼 X(s) − X(s′ ) and

( [ ]2 )1∕2 |K(s′ , t) − K(s′ , t′ )| ≤ K 1∕2 (s′ , s′ ) 𝔼 X(t) − X(t′ )

so that mean-square continuity of X is seen to imply continuity of K.



Observe that a by-product of the above-mentioned proof is the fact that if the mean function m(t) is continuous, then the covariance function K(s, t) is continuous at all (s, t) if and only if it is continuous at all “diagonal points” (t, t). As a mean-square continuous process X may not be a random elements in any Hilbert space, Definition 7.2.3 cannot be applied to define the covariance operator. However, the following integral operator on 𝕃2 (E, ℬ(E), 𝜇) is well defined (𝒦f )(t) =

∫E

K(t, s)f (s)d𝜇(s),

(7.15)

where 𝜇 is a finite measure. We refer to 𝒦 as the covariance operator of a mean-square continuous process. Note that the measure 𝜇 is exogenous

RANDOM ELEMENTS IN A HILBERT SPACE

187

to X and is often set to be Lebesgue measure when E is an interval, or the counting measure when E is a finite (discrete) set. By Mercer’s Theorem (Theorem 4.6.5), K(s, t) =

∞ ∑

𝜆j ej (s)ej (t),

(7.16)

j=1

where the sum converges absolutely and uniformly on the support of 𝜇, with (𝜆j , ej ) the eigenvalue and (continuous) eigenfunction pairs of 𝒦. The next step is to define the 𝕃2 stochastic integral of a mean-square continuous process X. For convenience, we will assume in the rest of the section that the mean function is identically 0. For any function f ∈ 𝕃2 (E, ℬ(E), 𝜇), define ∑

m(n)

IX (f ; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) =

i=1

X(ti )

∫Ei

f (u)d𝜇(u),

where Ei ∈ ℬ(E) and ti ∈ E. The total boundedness of E ensures that for each n > 0 there is a partition Ei , 1 ≤ i ≤ m(n), of E such that each Ei is an element of ℬ(E) and has diameter less than 1∕m(n); let ti be an arbitrary point in Ei . Take two such partitions {Ei , ti ∶ 1 ≤ i ≤ m(n)} and {Ej′ , tj′ ∶ 1 ≤ j ≤ m(n′ )}. Then, [ ( )]2 𝔼 IX (f ; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) − IX f ; {Ej′ , tj′ ∶ 1 ≤ j ≤ m(n′ )} ∑∑

m(n) m(n)

=

i1 =1 i2 =1

K(ti1 , ti2 ) f (u)d𝜇(u) f (𝑣)d𝜇(𝑣) ∫Ei ∫Ei 1

2

m(n′ ) m(n′ )

+

∑ ∑

j1 =1

K(tj′ , tj′ ) f (u)d𝜇(𝑣) f (u)d𝜇(𝑣) 1 2 ∫ ′ ∫E′ E j =1 j1

j2

m(n′ )

∑∑

m(n)

−2

2

i=1 j=1

K(ti , tj′ ) f (u)d𝜇(𝑣) f (u)d𝜇(𝑣). ∫Ei ∫E′ j

As K is uniformly continuous, each double sum on the right-hand side can be made arbitrarily close to ∫ ∫E×E K(u, 𝑣)f (u)f (𝑣)d𝜇(u)d𝜇(𝑣) for large enough n and n′ . By the completeness of 𝕃2 (Ω, ℱ, ℙ), we conclude that there is a random variable IX (f ) ∈ 𝕃2 (Ω, ℱ, ℙ) to which IX (f ; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) converges in mean square as n → ∞ and the limit is independent of the choice of {Ei , ti ∶ 1 ≤ i ≤ m(n)}.

188

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 7.3.3 Let {X(t) ∶ t ∈ E} be a mean-square continuous stochastic process with mean zero. Then, for any f , g ∈ 𝕃2 (E, ℬ(E), 𝜇), 1. 𝔼[IX (f )] = 0, 2. 𝔼[IX (f )X(t)] = ∫E K(u, t)f (u)d𝜇(u) for any t ∈ E, and 3. 𝔼[IX (f )IX (g)] = ∫ ∫E×E K(u, 𝑣)f (u)g(𝑣)d𝜇(u)d𝜇(𝑣). Proof: The general strategy is the same for proving all three assertions. For instance, |𝔼[IX (f )]| = |𝔼[IX (f ) − IX (f ; {Ei , ti ∶ 1 ≤ i ≤ m(n)})]| { }1∕2 ≤ 𝔼[IX (f ) − IX (f ; {Ei , ti ∶ 1 ≤ i ≤ m(n)})]2 , which tends to 0 as n → ∞.



The following corollary is a simple consequence of part 3 of Theorem 7.3.3 and the orthogonality of eigenfunctions. Corollary 7.3.4 Let 𝜆i and ei be the eigenvalues and eigenfunctions of the operator 𝒦 in (7.15) and (7.16). Then, the IX (ej ) are mean zero random variables with 𝔼[IX (ei )IX (ej )] = 𝛿ij 𝜆i . We are now ready to state the Karhunen–Lòeve Theorem . Theorem 7.3.5 Let {X(t) ∶ t ∈ E} be a mean-square continuous stochastic process with mean zero. Then, [ ]2 lim sup 𝔼 X(t) − Xn (t) = 0, (7.17) where Xn (t) ∶=

∑n

n→∞ t∈E

j=1 IX (ej )ej (t).

Proof: Using Theorem 7.3.3 and Corollary 7.3.4, we see that 𝔼[Xn (t) − X(t)]2 = 𝔼[Xn (t)]2 − 2𝔼[Xn (t)X(t)] + 𝔼[X(t)]2 =

n ∑

𝜆j e2j (t)

−2

j=1

n ∑

𝜆j e2j (t) + K(t, t)

j=1

∑ n

= K(t, t) −

𝜆j e2j (t),

j=1

which tends to zero uniformly by Mercer’s Theorem.



RANDOM ELEMENTS IN A HILBERT SPACE

189

The random variables IX (ej ) in the Karhunen–Lòeve Theorem are sometimes referred to as the principle component scores or simply scores. The partial sum Xn (t) is the n-term Karhunen–Lòeve expansion and (7.17) says that the 𝕃2 (Ω, ℱ, ℙ) distance between X(t) and its Karhunen–Lòeve expansion can be made arbitrarily small uniformly in t. Finally, consider two mean-square continuous processes X1 = {X1 (s), s ∈ E1 } and X2 = {X2 (t), t ∈ E2 }, where E1 , E2 are both compact metric spaces. Define the auto-covariance functions Ki (s, t) = Cov(Xi (s), Xi (t)) for s, t ∈ Ei and the cross-covariance function K12 (s, t) = Cov(X1 (s), X2 (t)) for s ∈ E1 , t ∈ E2 . Let ℍi = 𝕃2 (Ei , ℬ(Ei ), 𝜇i ) for some arbitrary finite measure 𝜇i on Ei for i = 1, 2. In this case, the auto-covariance operators of X1 , X2 are defined by (𝒦i f )(t) =

∫Ei

Ki (s, t)f (s)d𝜇i (s)

(7.18)

for f ∈ ℍi and cross-covariance operators by (𝒦12 f )(t) = (𝒦21 g)(t) =

∫E2

K12 (t, s)f (s)d𝜇2 (s),

∫E1

K12 (s, t)g(s)d𝜇1 (s)

(7.19)

for g ∈ ℍ1 , f ∈ ℍ2 . The following result provides the stochastic process analog of Theorem 7.2.9. Theorem 7.3.6 For any g ∈ ℍ1 , f ∈ ℍ2 , 1. ⟨𝒦12 f , g⟩1 = ∫E ∫E K12 (s, t)g(s)f (t)d𝜇1 (s)d𝜇2 (t), 1

2

1∕2

1∕2

2. ⟨𝒦12 f , g⟩1 ≤ ⟨𝒦1 g, g⟩1 ⟨𝒦2 f , f ⟩2 , and 3. the adjoint of 𝒦12 is 𝒦21 where ⟨⋅, ⋅⟩i denotes inner product of ℍi . The proof of Theorem 7.2.10 is predicated on the Cauchy–Schwarz inequality in Theorem 7.2.9. In light of this fact and the parallel between Theorem 7.2.9 and Theorem 7.3.6, we see that the conclusion of Theorem 7.2.10 holds in exactly the same way for the covariance operators defined in (7.18) and (7.19) for mean-square continuous processes.

190

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

7.4

Mean-square continuous processes in 𝕃2 (E, 𝓑(E), 𝝁)

We have considered two different perspectives to this point: namely, random elements and second-order process. However, much of the fda research nowadays assumes that one is in a situation that involves the intersection of these two settings. The two dominant examples for this are processes that take values in RKHSs and in 𝕃2 (E, ℬ(E), 𝜇). In this section, we consider the latter space, which is more common but for which the theoretical issues are somewhat harder to handle. The RKHS setting will be discussed in Section 7.5. Let us start with a mean-square continuous process X = {X(t), t ∈ E} defined on some probability space (Ω, ℱ, ℙ). Thus, X(t) is ℱ-measurable for each t ∈ E. The first question is what additional assumption is required that will ensure that X is a random element of ℍ = 𝕃2 (E, ℬ(E), 𝜇). As usual, E is a compact metric space and 𝜇 a finite measure. A convenient technical assumption in this regard is the joint measurability of X(t, 𝜔) in both arguments t and 𝜔: namely, X(t, 𝜔) is measurable with respect to the product 𝜎-field ℬ(E) × ℱ. Joint measurability implies that for each 𝜔, X(⋅, 𝜔) is a measurable function on E which places us in a position where we can consider issues such as whether X(⋅, 𝜔) belongs to ℍ or whether X is a random element in ℍ, etc. Theorem 7.4.1 Suppose that the process X is jointly measurable and X(⋅, 𝜔) ∈ ℍ for each 𝜔. Then, the mapping 𝜔 → X(⋅, 𝜔) is measurable from Ω to ℍ: i.e., X is a random element of ℍ. Proof: By joint measurability, ⟨X(⋅, 𝜔), f ⟩ is measurable for each f ∈ ℍ. A proof of this can be found in the standard treatment of Fubini’s Theorem. The present result then follows from Theorem 7.1.2. ◽ In Theorem 7.4.1, the notation X is used to denote a stochastic process and a Hilbert space random element. It might be useful to emphasize a subtle distinction between the two roles for X. As mentioned earlier, the process X(t) is a concrete object that can potentially be observed. The Hilbert space element X is an abstract representation of X; for instance, for any t, the symbol X(t) has no meaning as elements of ℍ are equivalence classes of functions. For convenience of notation, we will continue to use X to represent both the process and the Hilbert space element. The question now arises what might represent an easily understood condition that would imply joint measurability. The following result provides one possible answer.

RANDOM ELEMENTS IN A HILBERT SPACE

191

Theorem 7.4.2 Assume that for each t, X(t, ⋅) is measurable and that X(⋅, 𝜔) is continuous for each 𝜔 ∈ Ω. Then, X(t, 𝜔) is jointly measurable and hence X is a random element of ℍ. In this case, the distribution of X is uniquely determined by the (finite-dimensional) distributions of (X(t1 , ⋅), … , X(tn , ⋅)) for all t1 , … , tn ∈ E and all n. ∑k Proof: Consider any bivariate function of the form g(t, 𝜔) ∶= i=1 IEi (t)fi (𝜔), where the Ei are disjoint sets in ℬ(E) and the fi (𝜔) are measurable functions. For any Borel set B of the real line, ( ) g−1 (B) = ∪ki=1 Ei × fi−1 (B) , which is in the product 𝜎-field ℬ(E) × ℱ. Thus, g is jointly measurable. Let {Ei , ti ∶ 1 ≤ i ≤ m(n)} be a partition defined as in Section 7.3 and define the jointly measurable function ∑

m(n)

Xn (t, 𝜔) =

IEi (t)X(ti , 𝜔).

i=1

By the uniform continuity of X(t, 𝜔) in t for any fixed 𝜔, we have Xn (t, 𝜔) → X(t, 𝜔) uniformly in t for any fixed 𝜔. Thus, X(t, 𝜔) is jointly measurable. The rest of the proof is straightforward due to Theorem 7.4.1 and the construction of Xn . ◽ Theorem 7.4.2 leads us to the following obvious question of how to verify the sample path continuity of a stochastic process. There is a large literature that considers this problem. One sufficient condition for continuity is the Kolmogorov criterion (see, e.g., Bass 2011), which we now describe. For two processes X and Y with E = [0, 1] as their index set, we say they are modifications of each other if ℙ(X(t) = Y(t)) = 1

(7.20)

for all t ∈ [0, 1]. By combining a countable number of zero-measure sets in Ω, (7.20) implies that ℙ(X(t) = Y(t) for all t ∈ 𝒞 ) = 1

(7.21)

for any countable sets 𝒞 . Note, however, that we cannot conclude from (7.20) alone that (7.21) holds for 𝒞 = [0, 1], a notion referred to as indistinguishability. Processes that are modifications of each other have the same finite-dimensional distributions. Kolmogorov’s criterion says that if there are finite, positive constants 𝛼, 𝛽, C such that 𝔼|X(t1 ) − X(t2 )|𝛼 ≤ C|t1 − t2 |1+𝛽

(7.22)

192

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

for all t1 , t2 ∈ [0, 1], X has a modification Y for which all realizations are continuous functions on [0, 1]. This allows us to verify continuity by calculations using the moments of the process. To illustrate the use of (7.22), consider the case where X corresponds to the Brownian motion process on the interval [0, 1]. This means that all the X(t) are normally distributed with mean zero and 𝔼[X(t1 )X(t2 )] = min(t1 , t2 ).

(7.23)

One can conclude from this that Brownian motion has uncorrelated and, hence, independent increments. Using this property in conjunction with the recurrence, 𝔼[X k (t2 )X r (t1 )] = 𝔼[(X(t2 ) − X(t1 )) X k−1 (t2 )X r (t1 )] + 𝔼[X k−1 (t2 )X r+1 (t1 )] reveals that

𝔼|X(t2 ) − X(t1 )|4 = 3|t2 − t1 |2 .

So, Brownian motion satisfies (7.22) with 𝛼 = 4, 𝛽 = 1 and C = 3. The following result shows what happens if a mean-square continuous process is also a random element of ℍ. Theorem 7.4.3 Let X = {X(t) ∶ t ∈ E} be a mean-square continuous process that is jointly measurable. Then 1. the mean function m belongs to ℍ and coincides with mean element of X in ℍ, 2. the covariance operator 𝔼(X ⊗ X) is defined and coincides with the operator 𝒦 in (7.15), 3. for any f ∈ ℍ IX (f ) =

∫E

X(t)f (t)d𝜇(t)

= ⟨X, f ⟩. Proof: By Fubini’s Theorem, ( ) 𝔼 X(t)f (t)d𝜇(t) = m(t)f (t)d𝜇(t), ∫E ∫E

RANDOM ELEMENTS IN A HILBERT SPACE

193

for f ∈ ℍ and therefore m is the mean element of X as a result of (7.4). Similarly, for any f , g ∈ ℍ, Fubini’s Theorem allows us to write ( ) 𝔼 [X(s) − m(s)][X(t) − m(t)]f (s)g(t)d𝜇(s)d𝜇(t) ∫ ∫E×E =

∫ ∫E×E

K(s, t)f (s)g(t)d𝜇(s)d𝜇(t).

The left-hand side of this last expression is 𝔼(⟨X − m, f ⟩⟨X − m, g⟩) while the right-hand side is ⟨𝒦f , g⟩. Thus, part 1 of Theorem 7.2.5 shows that 𝒦 is the covariance operator of X. To show part 3, let {Ei , ti ∶ 1 ≤ i ≤ m(n)} be a partition defined as in Section 7.3. Then, by Fubini’s Theorem, ( )2 𝔼 IX (f ; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) − X(t)f (t)d𝜇(t) ∫E ∑∑

m(n) m(n)

=

K(ti1 , ti2 )

i1 =1 i2 =1

+

∫Ei

f (u)d𝜇(u)

1

∫ ∫E×E

∫Ei

f (𝑣)d𝜇(𝑣)

2

K(u, 𝑣)f (u)f (𝑣)d𝜇(u)d𝜇(𝑣)



m(n)

−2

i=1

∫Ei ∫E

K(ti , 𝑣)f (u)f (𝑣)d𝜇(u)d𝜇(𝑣),

where the right-hand side tends to zero as n → ∞. This verifies part 3 and concludes the proof. ◽ One immediate implication of Theorem 7.4.3 is that when X is mean-square continuous and jointly measurable, the scores IX (ej ) in Theorem 7.3.5 can be represented as IX (ej ) =

∫E

X(t)ej (t)d𝜇(t)

(7.24)

= ⟨X, ej ⟩. Our last result in this section shows that even if the mean-square continuous process X is not a random element of ℍ = 𝕃2 (E, ℬ(E), 𝜇), one can speak about an abstract random element 𝜒 of ℍ that plays the role of X in (7.24).

194

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 7.4.4 Let X = {X(t) ∶ t ∈ E} be a mean-square continuous stochastic process with mean zero. Then, there is a random element 𝜒 of ℍ whose covariance operator is given by (7.15). Proof: For any choice of {Ei , ti ∶ 1 ≤ i ≤ m(n)}, define the process ∑

m(n)

𝜒 (t, 𝜔; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) =

IEi (t)X(ti , 𝜔).

i=1

Now view this as an element of ℍ0 ∶= 𝕃2 (E × Ω, ℱ(E) × ℬ, 𝜇 × ℙ), where the measurability is checked as in Theorem 7.4.2. Observe that ( )‖2 ‖ ‖𝜒 (t, 𝜔; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) − 𝜒 t, 𝜔; {E′ , t′ ∶ 1 ≤ j ≤ m(n′ )} ‖ ‖ ‖ j j ‖ ‖ℍ0 [ ( )]2 = 𝔼 IX (f ; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) − IX f ; {Ej′ , tj′ ∶ 1 ≤ j ≤ m(n′ )} , with f (t) ≡ 1. By completeness, there exists an element 𝜒 in ℍ0 such that ‖𝜒 (t, 𝜔; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) − 𝜒(t, 𝜔)‖ℍ0 → 0 as n → ∞. Note that the event { 𝜔∈Ω∶

(7.25)

} ∫E

𝜒 (t, 𝜔)d𝜇(t) = ∞ 2

has probability zero. Thus, if necessary we can modify 𝜒 and define a new element in ℍ0 , also denoted by 𝜒 for convenience, such that ∫E 𝜒 2 (t, 𝜔)d𝜇(t) < ∞ for all 𝜔. Then Theorem 7.4.1 can be invoked to conclude that 𝜒(⋅, 𝜔) is a random element of 𝕃2 (E, ℬ(E), 𝜇). Let 𝒦̃ = 𝔼(𝜒 ⊗ 𝜒) be the covariance operator of 𝜒. From part 1 of ̃ , g⟩ = 𝔼(⟨𝜒, f ⟩⟨𝜒, g⟩) for any f , g ∈ 𝕃2 (E, ℬ(E), 𝜇), and Theorem 7.2.5, ⟨𝒦f (7.25) now has the consequence that 𝔼(⟨𝜒, f ⟩⟨𝜒, g⟩) is the limit of 𝔼(IX (f ; {Ei , ti ∶ 1 ≤ i ≤ m(n)}) IX (g; {Ei , ti ∶ 1 ≤ i ≤ m(n)})) ∑∑

m(n) m(n)

=

i1 =1 i2 =1

K(ti1 , ti2 )

∫Ei

1

f (u)d𝜇(u)

∫Ei

g(𝑣)d𝜇(𝑣)

2

as n → ∞. On the other hand, the continuity of K implies that the last expression tends to ⟨𝒦f , g⟩, from which we conclude that 𝒦̃ and 𝒦 are the same operator. ◽

RANDOM ELEMENTS IN A HILBERT SPACE

7.5

195

RKHS valued processes

In Section 7.4, we considered the setting where a mean-square continuous stochastic process {X(t), t ∈ E} is also a random element of ℍ = 𝕃2 (E, ℬ(E), 𝜇). Here we consider what transpires when the Hilbert space ℍ is an RKHS ℍ(R) where the rk R is a continuous function defined on E × E. In contrast to the 𝕃2 (E, ℬ(E), 𝜇) functional space in Section 7.4, as the elements of ℍ(R) are functions, there is no need to distinguish between the process X and the potential Hilbert space element X. As such, the theoretical development at least in the fda context becomes a bit cleaner. For example, measurability issues are more straightforward as the following result shows. As usual, we say that X is a stochastic process if X(t) is a random variable for each fixed t and we say that X is a random element of ℍ(R) if X is a measurable mapping from the probability space into ℍ(R). Theorem 7.5.1 A random element X of ℍ(R) is a stochastic process. Conversely, a stochastic process X taking values in ℍ(R) is a random element of ℍ(R). Proof: Denote the probability space under discussion by (Ω, ℱ, ℙ). First assume that X is a random element of ℍ. By the reproducing property, X(t) = ⟨X, R(⋅, t)⟩. As the composition of two measurable transformations, X(t) is necessarily a random variable. To go the other direction, let X be an ℍ(R) valued ∑n process. For any g ∈ ℍ(R), we may find a sequence of∑the form gn (⋅) = i=1 ai R(⋅, ti ) that converges to g n for which ⟨X(⋅, 𝜔), gn ⟩ = i=1 ai X(ti , 𝜔). Thus, by the continuity of the inner product, we see that ⟨X(⋅, 𝜔), g⟩ is the point-wise limit of a sequence of measurable function in ℍ(R) and must aslo be measurable. The result is now a consequence of Theorem 7.1.2. ◽ With the aid of the previous theorem, we can obtain expressions for the mean and covariance functions of an RKHS valued process. Theorem 7.5.2 Let X be a random element of ℍ(R) with 𝔼‖X‖2 < ∞. Then, X is a mean-square continuous process on E and the mean element m and covariance operator 𝒦 are related to the mean and covariance functions m(t) and K(s, t) by m(t) = 𝔼[X(t)] = ⟨m, R(⋅, t)⟩ (7.26) and K(s, t) = Cov(X(t), X(s)) = ⟨𝒦R(⋅, t), R(⋅, s)⟩.

(7.27)

196

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Furthermore, K ∈ ℍ(R) ⊗ ℍ(R), with ‖K‖ℍ(R)⊗ℍ(R) ≤ 𝔼‖X − m‖2 < ∞.

(7.28)

Proof: That assertion (7.26) follows from (7.4) as X(t) = ⟨X, R(⋅, t)⟩. To prove (7.27), we need only observe that Cov(X(t), X(s)) = 𝔼⟨X, R(⋅, t)⟩⟨X, R(⋅, s)⟩ − 𝔼⟨X, R(⋅, t)⟩𝔼⟨X, R(⋅, s)⟩ and then apply (7.4) and Theorem 7.2.5. The continuity of K is a consequence of the continuity of R. Finally, X(s)X(t) is a bivariate stochastic process with sample paths in the direct product space ℍ(R) ⊗ ℍ(R), which is also an RKHS with rk R(⋅, s)R(⋅, t) by Theorem 2.7.13. As a result, X(s)X(t) is a random element of ℍ(R) ⊗ ℍ(R) and 𝔼‖X(s)X(t)‖ℍ(R)⊗ℍ(R) = 𝔼‖X‖2 < ∞. Thus, the assertions on K follow from the fact that K(s, t) is the mean element of the random element {X(s) − m(s)}{X(t) − m(t)}. ◽ The following result contains parallels of Mercer’s Theorem and the Karhunen and Lòeve Theorem for an RKHS valued process. Theorem 7.5.3 Let X be a random element of ℍ(R) with mean zero and 𝔼‖X‖2 < ∞. If the covariance operator 𝒦has the eigen decomposition 𝒦=

∞ ∑

𝜆j ej ⊗ ej ,

j=1

then K(s, t) =

∞ ∑

𝜆j ej (s)ej (t),

(7.29)

j=1

where the sum converges absolutely and uniformly, and [ ]2 lim sup 𝔼 X(t) − Xn (t) = 0, where Xn (t) ∶=

∑n

n→∞ t∈E

j=1 ⟨X, ej ⟩ej (t).

Proof: By (7.27) the covariance function of X satisfies K(s, t) = ⟨𝒦R(⋅, s), R(⋅, t)⟩ =

∞ ∑

𝜆j ⟨ej , R(⋅, s)⟩⟨ej , R(⋅, t)⟩

j=1

=

∞ ∑ j=1

𝜆j ej (s)ej (t).

(7.30)

RANDOM ELEMENTS IN A HILBERT SPACE

197

As K(t, t) = ⟨𝒦R(⋅, t), R(⋅, t)⟩, we have sup K(t, t) ≤ ‖𝒦‖ sup R(t, t) < ∞. t∈E

t∈E

Thus, uniformly in s, t, ∞ ∑ j=1

𝜆j |ej (s)ej (t)| ≤

(∞ ∑ j=1

𝜆j e2j (s)

∞ ∑

)1∕2 𝜆j e2j (t)

j=1

= (K(s, s)K(t, t))1∕2 < ∞. Finally, as the scores ⟨X, ej ⟩ are uncorrelated (Theorem 7.2.7), (7.30) can be established along similar lines as those in the proof of Theorem 7.3.5. ◽ Note that the components of the expansions in (7.29) and (7.30) are not the same as those in Mercer’s Theorem and the Karhunen and Lòeve Theorem. The latter quantities are based on the integral covariance operator on 𝕃2 (E, ℬ(E), 𝜇). At this point, one may begin to wonder about the conditions that are required for a process to take values in a particular RKHS. To motivate that discussion, let us again consider the Brownian motion process on [0, 1]. As stated in (7.23), this is a zero mean process that has covariance kernel K(s, t) = min(s, t). The eigen decomposition for the corresponding covariance operator 𝒦 in 𝕃2 [0, 1] was the subject of Example 4.6.3. By (2.41), Im(𝒦) contains functions f in 𝕎1 [0, 1] satisfying f (0) = 0. However, Brownian motion is the classic example of a continuous process that has nowhere differentiable sample paths, which means that ℙ(X ∈ Im(𝒦)) = 0. This would, at first, seem to be a contradiction to part 3 of Theorem 7.2.5. However, that is not the case because Im(𝒦) is not closed in 𝕃2 [0, 1] (Theorem 4.3.7) and the completion of Im(𝒦) in 𝕃2 [0, 1] is precisely the set of functions in 𝕃2 [0, 1] that are equal to 0 at 0 (Theorem 2.3.8). By Theorem 7.6.4 below, Im(𝒦) ⊂ ℍ(K). Thus, the following result is an attempt to describe this phenomenon in a general way. Theorem 7.5.4 Suppose that X is a random element of ℍ(K) with 𝔼‖X‖2 < ∞, where K is the covariance function of X. Then, ℍ(K) must be finite dimensional. Proof: The assumption 𝔼‖X‖2ℍ(K) < ∞ implies that the covariance operator 𝒦 of X on ℍ(K) is well defined and therefore must be trace class by Theorem 7.2.5. Then (7.26) of Theorem 7.5.2 has the consequence that 𝒦

198

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

is the identity mapping of ℍ(K). However, by Theorem 4.1.2, the identity of an infinite-dimensional space is not compact and hence the conclusion. ◽ Necessary and sufficient conditions for the sample paths of a Gaussian process to lie in an RKHS were determined by Driscoll (1973). The general case was examined by Lukíc and Beder (2001). They showed that if a process has covariance kernel K, a necessary condition for its sample paths to fall in an RKHS ℍ(R) is that ℍ(K) ⊆ ℍ(R) and the process has a valid covariance operator. This is also sufficient when, e.g., K is continuous.

7.6

The closed span of a process

One of the wonderful aspects of congruence mappings is that they provide us with the potential for translating a problem formulated in one space into an equivalent problem in another space where the mathematics may be more tractable. This turns out to be particularly true in the case of stochastic processes as a result of work begun in Parzen (1961). Here we describe certain aspects of this representation theory that are particularly relevant to the development of canonical correlations of functional data. As in previous sections, one can approach from the viewpoint of a stochastic process or a Hilbert space valued random element. We start with the first perspective. Let {X(t), t ∈ E} be a second-order process with mean zero and a continuous covariance function K(s, t). For the purpose of statistical inference, a space of particular interest is the completion of the set of all random variables of the form n ∑

ai X(ti ), ai ∈ ℝ, ti ∈ E, n = 1, 2, … ,

(7.31)

i=1

in the space 𝕃2 (Ω, ℱ, ℙ). We denote the completed space by 𝕃2 (X), i.e., { 𝕃 (X) = 2

n ∑

} aj X(tj ), tj ∈ E, aj ∈ ℝ, n = 1, 2, …

,

(7.32)

j=1

and call it the closed linear span of X. Recall from Section 2.7 that ℍ(K) denotes the RKHS with rk ∑ K; ℍ(K) is the completion of the space that contains functions of the form ni=1 ai K(⋅, ti ) with ai ∈ ℝ, ti ∈ E, n ∈ ℤ+ and with inner product determined by ⟨K(⋅, s), K(⋅, t)⟩ℍ(K) = K(s, t). The following result was first discussed by Lòeve.

RANDOM ELEMENTS IN A HILBERT SPACE

199

Theorem 7.6.1 𝕃2 (X) is congruent to ℍ(K). Proof: The proof is accomplished by verifying that the correspondence ∞ ∑

ai X(ti ) ←→

∞ ∑

ai K(⋅, ti )

(7.33)

determines a congruence between 𝕃2 (X) and ℍ(K).



i=1

i=1

For f ∈ ℍ(K), denote by Z(f ), the random variable in 𝕃2 (X) determined by the congruence (7.33). As such, Cov(Z(f1 ), Z(f2 )) = ⟨f1 , f2 ⟩ℍ(K)

(7.34)

for f1 , f2 ∈ ℍ(K). We now explore the properties of the congruence beginning with the following insightful example. Example 7.6.2 Suppose that E = {t1 , … , tp } so that X(⋅) can be represented as the random p-vector X = (X(t1 ), … , X(tp ))T . Its covariance kernel is equivalent to the p × p matrix 𝒦 = {K(ti , tj )}i,j=1∶p and the congruent RKHS ℍ(K) in this instance was described in Example 2.7.8. Analogous to that development we will use X(⋅) and X to refer to the stochastic process and vector forms for the X(⋅) process and use f and f (⋅) to represent the vector and function forms of an element f (⋅) of ℍ(K). We now claim that Z(f ) = f T 𝒦† X

(7.35)

with 𝒦† the Moore–Penrose generalized inverse of 𝒦. To verify this, we need only observe that ( ) Var f T 𝒦† X = f T 𝒦† f = ‖f (⋅)‖2ℍ(K) . The previous example provides an important case where Z(f ) can actually be evaluated. As both 𝕃2 (X) and ℍ(K) are built up from finite-dimensional subspaces, it provides us with a prescription for connecting the essential ingredients for the two spaces. The following result leverages that idea and gives a generalization of Example 7.6.2. Theorem 7.6.3 A function f is in ℍ(K) if there exists Y ∈ 𝕃2 (X) such that f (⋅) = 𝔼[YX(⋅)], in which case Y = Z(f ). ∑ Proof: Let Y be the limit of a sequence Yn = ∑ ni=1 ani X(tni ) in 𝕃2 (X). If f ∈ ℍ(K) is the congruent image of Y, then fn (⋅) = ni=1 ani K(⋅, tni ) must converge

200

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

to f as a result of the isometry. For each t ∈ E, fn (t) =

n ∑

ani 𝔼[X(t)X(tni )] = 𝔼[Yn X(t)]

i=1

and continuity of the 𝕃2 (X) inner product ensures that 𝔼[Yn X(t)] → 𝔼[YX(t)] and Y = Z(f ). Conversely, suppose that f is in ℍ(K). Then, f corresponds to Z(f ) in 𝕃2 (X) and by the reproducing property f (t) = ⟨f , K(t, ⋅)⟩ℍ(𝒦) = 𝔼[Z(f )X(t)]. ◽ Although Theorem 7.6.3 is insightful, it falls well short of the type of relation we were able to produce in Example 7.6.2. Such explicit characterizations are of paramount importance because if we are to solve optimization problems deriving from 𝕃2 (X) in ℍ(K), we need to have a formula that tells us how to return to the space of random variables that is actually of interest. Our following approach is an attempt in that regard. Suppose now that E is a compact metric space. Let 𝒦 denote the covariance operator on ℍ = 𝕃2 (E, ℬ(E), 𝜇) defined by (7.15) and (𝜆j , ej ) the eigenvalue–eigenfunction pairs of 𝒦. Denote by 𝔾(𝒦), the Hilbert space of functions in the range of 𝒦1∕2 , equipped with the inner product ⟨𝒦1∕2 f , 𝒦1∕2 g⟩𝔾(𝒦) = ⟨f , g⟩ℍ

(7.36)

for f , g ∈ ℍ. The relationship (7.36) shows that functions in 𝔾(𝒦) can be represented ∑∞ 1∕2 ∑∞ ∑∞ as j=1 𝜆j aj ej with j=1 a2j < ∞, or, equivalently, as j=1 𝜆j aj ej with ∑∞ 2 j=1 𝜆j aj < ∞. We will primarily use the latter representation in the following. To summarize, {∞ } ∞ ∑ ∑ 𝔾(𝒦) = 𝜆j aj ej ∶ 𝜆j a2j < ∞ (7.37) j=1

and



∞ ∑

𝜆j aj ej ,

j=1

∞ ∑

j=1

⟩ 𝜆j bj ej

j=1

=

∞ ∑

𝜆 j aj bj .

(7.38)

j=1

𝔾(𝒦)

Theorem 7.6.4 𝔾(𝒦) = ℍ(K). Proof: Mercer’s Theorem gives us K(⋅, t) =

∞ ∑ j=1

𝜆j ej (t)ej (⋅) =

∞ ∑ j=1

𝜆j aj ej (⋅)

(7.39)

RANDOM ELEMENTS IN A HILBERT SPACE

201

for any fixed t ∈ E, where aj = ej (t). As ∞ ∑

𝜆j a2j =

j=1

∞ ∑

𝜆j e2j (t) = K(t, t) < ∞,

j=1



we conclude that K(⋅, t) ∈ 𝔾(𝒦). Now, for g = (7.39) entail that ⟨g, K(⋅, t)⟩𝔾(𝒦) =

∞ ∑

𝜆j aj bj =

j=1

∞ ∑

j 𝜆j bj ej

∈ 𝔾(𝒦), (7.38) and

𝜆j bj ej (t) = g(t).

j=1

Therefore, K is an rk for 𝔾(𝒦) and our claim has been verified due to the unicity aspect of the Moore–Aronszajn Theorem (Theorem 2.7.4). ◽ 7.6.4∑shows that functions of ℍ(K) can be represented as ∑Theorem ∞ ∞ 2 2 j=1 𝜆j aj ej with j=1 𝜆j aj < ∞ and therefore the congruence between 𝕃 (X) and ℍ(K) can be described as Z(f ) =

∞ ∑

aj IX (ej )

(7.40)

j=1

∑∞ for f = j=1 𝜆j aj ej ∈ ℍ(K), where, as in the Karhunen–Lòeve Theorem (Theorem 7.3.5), IX (ej ) is the score of X that corresponds to ej . To see why this is true we only needs to observe that (∞ ) ∞ ∑ ∑ Var aj IX (ej ) = 𝜆j a2j = ‖f ‖2ℍ(K) . j=1

j=1

Suppose we make the additional assumption that X is a random element of ℍ = 𝕃2 (E, ℬ(E), 𝜇). Then (7.24) shows that Zj = ⟨X, ej ⟩, in which case we can write Z(f ) =

∞ ∑

aj ⟨X, ej ⟩

(7.41)

j=1

∑∞ for f = j=1 𝜆j aj ej . With (7.38) and (7.41), it is tempting to conclude that Z(f ) in this case is just ⟨X, f ⟩ℍ(K) . Unfortunately, in general, that is not true as X cannot be a random element of ℍ(K) except when ℍ(K) is finite dimensional (Theorem 7.5.4). ( )T Example 7.6.5 Take X = X(t1 ), … X(tp ) as in Example 7.6.2 with ℍ = ℝp and p ∑ ⟨X(⋅), g(⋅)⟩ = g(ti )X(ti ) i=1

202

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

( )T ∑q for g = g(t1 ), … , g(tp ) ∈ ℝp . Let 𝒦 = j=1 𝜆j ej eTj for nonzero eigenvalues 𝜆1 , … , 𝜆q with q ≤ p and associated eigenvectors ej = ( )T ∑q ej (t1 ), … , ej (tp ) . Then, X = j=1 ⟨X(⋅), ej (⋅)⟩ej with probability one and if ∑q f (⋅) = j=1 𝜆j aj ej (⋅) Z(f ) =

q ∑

aj ⟨X(⋅), ej (⋅)⟩

j=1

= f T 𝒦† X because 𝒦† =

∑q j=1

𝜆−1 ej eTj . j

So far in this section, we have considered mean-square continuous processes. We next turn to the more abstract setting where 𝜒 is a random element in a Hilbert space ℍ. Assume that 𝔼𝜒 = 0 and 𝔼‖𝜒‖2 < ∞. As explained briefly in Section 7.3, we can view 𝜒 as a stochastic process indexed by f ∈ ℍ, i.e., 𝜒 = {⟨𝜒, f ⟩ ∶ f ∈ ℍ}. The covariance function of this process is K(f , g) = Cov(⟨𝜒, f ⟩, ⟨𝜒, g⟩) and the closed linear span is 𝕃2 (𝜒) = {⟨𝜒, f ⟩, f ∈ ℍ}, where the completion is taken in 𝕃2 (Ω, ℱ, ℙ). As in Theorem 7.6.1, we can define the RKHS ℍ(K) and the correspondence ⟨𝜒, f ⟩ ←→ K(⋅, f ) that determines the congruence between 𝕃2 (𝜒) and ℍ(K). Unfortunately, it is hard to visualize functions whose argument is a Hilbert space element. Thus, it might be useful to consider a different parameterization of this congruence. Let 𝒦 be the covariance operator of the random element 𝜒 as defined by Definition 7.2.3 and (𝜆j , ej ) its eigenvalue–eigenfunction pairs. In terms of (𝜆j , ej ), we can express 𝕃2 (𝜒) as {∞ } ∞ ∑ ∑ 𝕃2 (𝜒) = aj ⟨𝜒, ej ⟩ ∶ 𝜆j a2j < ∞ . (7.42) j=1

j=1

Define the Hilbert space { 𝔾(𝒦) =

∞ ∑ j=1

𝜆j aj ej ∶

∞ ∑ j=1

} 𝜆j a2j < ∞

(7.43)

RANDOM ELEMENTS IN A HILBERT SPACE

with inner product ⟨

∞ ∑ j=1

𝜆j aj ej ,

∞ ∑

⟩ 𝜆j aj ej

=

j=1

𝔾(𝒦)

∞ ∑

𝜆j aj bj .

203

(7.44)

j=1

Then, 𝔾(𝒦) can take the role of ℍ(K) and the the congruence between 𝕃2 (𝜒) and ℍ(𝒦) is Z(f ) =

∞ ∑

aj ⟨𝜒, ej ⟩

(7.45)

j=1

∑∞ for f = j=1 𝜆j aj ej ∈ 𝔾(𝒦). Observe that (7.43), (7.44), and (7.45) for the random element setting completely parallel (7.37), (7.38), and (7.40) for the process setting. The only difference is how the covariance operators are defined in the two settings and that 𝔾(𝒦) in the former is an RKHS. In the special case of Theorem 7.4.3 where a mean-square continuous process is also a random element in 𝕃2 (E, ℬ(E), 𝜇), the space 𝔾(𝒦) defined from the process and random element settings are identical. This will facilitate our study of canonical correlations in Chapter 10 in a unified manner for the two settings.

7.7

Large sample theory

In this section, we introduce some basic large-sample results that are useful for the inference problems that arise in fda. More generally, probability theory in Banach and Hilbert spaces is an important branch of modern probability. The interested reader is referred to Ledoux and Talagrand (2013) for a complete treatment of this topic. Suppose that we have independent realization X1 , … , Xn of some real valued random variable ∑ X with 𝔼X = m. Then, we know that under various conditions X n = n−1 ni=1 Xi converges almost surely to m and, when suitably normalized, has an approximate normal distribution. Results of this nature are referred to as the strong law of large numbers and the central limit theorem, respectively. The goal of this section is to obtain analogs of these results with the Xi replaced by random elements of a Hilbert space. Let 𝜒1 , 𝜒2 , … be random elements in ℍ and define Sn =

n ∑

𝜒i .

i=1

Theorem 7.7.1 If 𝜒1 , … , 𝜒n are pairwise independent with mean 0, then 𝔼‖Sn ‖2 =

n ∑ i=1

𝔼‖𝜒i ‖2 .

204

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: Let {ek , k ≥ 1} be a CONS for ℍ. As the 𝜒i ’s are pairwise independent, for i ≠ j, we have 𝔼⟨𝜒i , ek ⟩⟨𝜒j , ek ⟩ = 𝔼⟨𝜒i , ek ⟩𝔼⟨𝜒j , ek ⟩ = ⟨𝔼𝜒i , ek ⟩⟨𝔼𝜒j , ek ⟩ = 0. It then follows that 𝔼‖Sn ‖2 =

∞ ∑

𝔼⟨Sn , ek ⟩2 =

k=1

𝔼⟨𝜒i , ek ⟩2

k=1 i=1

∑∑ n

=

∞ n ∑ ∑



𝔼⟨𝜒i , ek ⟩ = 2

i=1 k=1

n ∑

𝔼‖𝜒i ‖2 . ◽

i=1

Theorem 7.7.2 Let 𝜒1 , 𝜒2 , … be pairwise independent and identically distributed with 𝔼‖𝜒1 ‖ < ∞. Then, lim n−1 Sn = 𝔼(𝜒1 ) a.s. n→∞

Proof: We follow the line of proof in Etemadi (1983). First define 𝜒i′ = 𝜒i I{‖𝜒i ‖≤i} and Sn′

=

n ∑

𝜒i′ .

i=1

Then, take kn = [𝛼 ] for 𝛼 > 1, where [a] denotes the integral part of a and let {ek }∞ be a CONS for ℍ. Applying Theorem 7.2.2, Theorem 7.7.1, and k=1 Markov’s inequality, we see that for any 𝜖 > 0 ( ′ ) kn ∞ ∞ ‖Sk − 𝔼Sk′ ‖ ∑ ∑ ∑ n n −2 −2 ℙ > 𝜖 ≤𝜖 kn 𝔼‖𝜒i′ ‖2 k n n=1 n=1 i=1 n

=𝜖

−2

∞ ∑

𝔼‖𝜒i′ ‖2

i=1

≤ 4(1 − 𝛼 ) 𝜖



kn−2

{n∶kn ≥i}

−2 −1 −2

∞ ∑

𝔼‖𝜒i′ ‖2 i−2 ,

i=1

where the last inequality follows from a simple calculation in Durrett (1995, page 57). However,

RANDOM ELEMENTS IN A HILBERT SPACE ∞ ∑

𝔼‖𝜒i′ ‖2 i−2

=

∞ ∑

i=1

−2

i

205

i−1 ∑ [ ] 𝔼 ‖𝜒1 ‖2 I{j≤‖𝜒1 ‖≤j+1}

i=1

j=0

∞ ∑ [ ]∑ 2 = 𝔼 ‖𝜒1 ‖ I{j≤‖𝜒1 ‖≤j+1} i−2 ∞

j=0

i=j+1

∑ ∞

≤C

[ ] (j + 1)−1 𝔼 ‖𝜒1 ‖2 I{j≤‖𝜒1 ‖≤j+1}

j=0

for some C < ∞. Noticing that the last sum is bounded by 𝔼‖𝜒1 ‖, we conclude from the Borel–Cantelli Lemma that lim

Sk′ − 𝔼Sk′

n→∞

n

n

kn

=0

a.s.

By Lebesgue’s dominated convergence theorem, lim ‖𝔼𝜒n′ − 𝔼𝜒1 ‖ ≤ lim 𝔼‖𝜒1 ‖I{‖𝜒1 ‖>n} = 0.

n→∞

n→∞

Hence, ‖ 𝔼S′ ‖ kn ‖ kn ‖ 1 ∑ ‖ ‖ lim − 𝔼𝜒1 ‖ ≤ lim ‖𝔼𝜒i′ − 𝔼𝜒1 ‖ = 0 n→∞ ‖ ‖ kn ‖ n→∞ kn i=1 ‖ ‖ and we have lim

n→∞

Sk′

n

kn

= 𝔼𝜒1

a.s.

A standard argument using the Borel–Cantelli Lemma now shows that, with probability one, Xn′ = Xn eventually (i.e., Xn′ and Xn are tail equivalent). Thus, lim

n→∞

Skn kn

= 𝔼𝜒1

a.s.

(7.46)

The remainder of the proof deviates slightly from the Etemadi arguments we have used to this point. We now observe that for any n there exists a positive integer m(n) such that km(n)−1 = [𝛼 m(n)−1 ] < n ≤ [𝛼 m(n) ] = km(n) .

206

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

It follows that ‖S ‖ ‖ ‖ ‖ n Skm(n) ‖ ‖ Skm(n) Skm(n) Sn − Skm(n) ‖ − + ‖ − ‖=‖ ‖ ‖n ‖ ‖ km(n) ‖ km(n) n ‖ ‖ ‖ n ‖ ( ) ‖S ‖ km(n) km(n) ‖ km(n) ‖ 1 ∑ ≤ −1 ‖ ‖𝜒i ‖ ‖+ ‖ km(n) ‖ n n i=n+1 ‖ ‖ m(n) ‖ Sk ‖ 1 k∑ ‖ m(n) ‖ ≤ (𝛼 − 1) ‖ ‖𝜒i ‖. ‖+ ‖ km(n) ‖ n i=n+1 ‖ ‖

From (7.46), ‖ Sk ‖ ‖ m(n) ‖ lim ‖ ‖ = ‖𝔼𝜒1 ‖ ≤ 𝔼‖𝜒1 ‖ n→∞ ‖ km(n) ‖ ‖ ‖

a.s.

In addition, Etemadi’s strong law of large numbers for real-valued random ∑n −1 ∑km(n) variables ensures that km(n) ‖𝜒i ‖ and n−1 i=1 ‖𝜒i ‖ tend to 𝔼‖𝜒1 ‖ with i=1 probability one. Therefore, km(n) 1 ∑ lim sup ‖𝜒i ‖ n→∞ n i=n+1 km(n) n ⎛k ⎞ m(n) 1 ∑ 1∑ ⎜ = lim sup ‖𝜒i ‖ − ‖𝜒 ‖⎟ n i=1 i ⎟ n→∞ ⎜ n km(n) i=1 ⎝ ⎠

≤ (𝛼 − 1)𝔼‖𝜒1 ‖

a.s.

This gives ‖S Skm(n) ‖ ‖ ‖ lim sup ‖ n − ‖ ≤ 2(𝛼 − 1)𝔼‖𝜒1 ‖ ‖ n k n→∞ ‖ m(n) ‖ ‖ and the result follows as we let 𝛼 ↓ 1.



The remaining concept to consider is that of convergence in distribution or law. To explore that topic, we need a suitable notion of weak convergence of probability measures that we now develop. Let ℙ, ℙn , n ≥ 1 be probability measures on (ℍ, ℬ(ℍ)). We say that Pn con𝑤 verges weakly to P, denoted by Pn −−→ P, if ∫ℍ

f (x)dPn (x) →

∫ℍ

f (x)dP(x)

RANDOM ELEMENTS IN A HILBERT SPACE

207

for any bounded and continuous functions f on ℍ; for random elements 𝜒 and w 𝜒n , n ≥ 1, we say that 𝜒n converges in distribution to 𝜒 if ℙ∘𝜒n−1 −−→ ℙ∘𝜒 −1 d

and denote this relationship by 𝜒n −−→ 𝜒. A general treatment of weak convergence can be found in Billingsley (1999). Here, we introduce an approach that will be especially effective for our applications. The idea was adapted from de Acosta (1970) and Mas (2006). First let us recall the definition of tightness. Definition 7.7.3 An arbitrary set of probability measures {𝜇𝛼 }𝛼∈I on (ℍ, ℬ(ℍ)) is tight if for any 𝜖 > 0 there exists a compact set W such that inf 𝜇𝛼 (W) ≥ 1 − 𝜖. 𝛼∈I

For any S ⊂ ℍ and any 𝜖 > 0, let S𝜖 = {x ∈ ℍ ∶ inf{‖x − z‖ ∶ z ∈ S} ≤ 𝜖}.

(7.47)

Theorem 7.7.4 Let {𝜇𝛼 }𝛼∈I be a family of probability measures on (ℍ, ℬ(ℍ)). Assume that for each 𝜖, 𝛿 > 0, there exists a finite subset {y1 , … , yk } ⊂ ℍ such that 1. inf 𝜇𝛼 (S𝜖 ) ≥ 1 − 𝛿, where S ∶= span{y1 , … , yk } and 𝛼∈I

2. inf 𝜇𝛼 ({x ∈ ℍ ∶ |⟨x, yj ⟩| ≤ r, j = 1, … , k}) ≥ 1 − 𝛿 for some r > 0. 𝛼∈I

Then, {𝜇𝛼 }𝛼∈I is tight. Proof: Let 𝜖 = 𝛿 and pick {y1 , … , yk } ⊂ ℍ and r to satisfy the two conditions of the theorem. Define A𝜖 = S𝜖 ∩ {x ∈ ℍ ∶ |⟨x, yj ⟩| ≤ r, j = 1, … , k}. We first show that A𝜖 ⊂ W𝜖𝜖 for some compact set W𝜖 , where the 𝜖 subscript has been used to emphasize that the sets depend on 𝜖. For x, z ∈ ℍ write x = x1 + x2 and z = z1 + z2 with x1 and z1 the projections of x and z onto S. Then, ⟨x, z⟩ = ⟨x1 , z1 ⟩ + ⟨x2 , z2 ⟩.

(7.48)

If x ∈ A𝜖 , we know from the definition of S𝜖 that ‖x2 ‖ ≤ 𝜖 and, hence, |⟨x2 , z2 ⟩| ≤ 𝜖‖z2 ‖ ≤ 𝜖‖z‖.

(7.49)

208

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Without loss of generality, assume that y1 , … , yk are orthonormal. Then, |∑ | | k | | |⟨x1 , z1 ⟩| = | ⟨x, yj ⟩⟨z, yj ⟩|| ≤ kr‖z‖ | j=1 | | |

(7.50)

as each |⟨x, yj ⟩| ≤ r by the choice of A𝜖 . From (7.48)–(7.50) ‖x‖ = sup{|⟨x, z⟩| ∶ ‖z‖ ≤ 1} ≤ 𝜖 + kr =∶ 𝜈, which shows that A𝜖 ⊂ B𝜈 , where B𝜈 = {x ∈ ℍ ∶ ‖x‖ ≤ 𝜈}. Thus, A𝜖 ⊂ B𝜈 ∩ S 𝜖 .

(7.51)

Now, if x ∈ B𝜈 ∩ S𝜖 , this means that ‖x‖ ≤ 𝜈 and there exists a yj ∈ S such that ‖x − yj ‖ ≤ 𝜖. So, from the triangle inequality, ‖yj ‖ ≤ ‖x‖ + ‖x − yj ‖ ≤ 𝜈 + 𝜖, which implies that inf

z∈S∩B𝜈+𝜖

Thus, we conclude

‖x − z‖ ≤ ‖x − yj ‖ ≤ 𝜖.

B𝜈 ∩ S𝜖 ⊂ (S ∩ B𝜈+𝜖 )𝜖 .

(7.52)

In combination with (7.51), this produces A𝜖 ⊂ (S ∩ B𝜈+𝜖 )𝜖 . For notational convenience, we will use W𝜖 = S ∩ B𝜈+𝜖 in what follows. Note that as S is finite dimensional, W𝜖 is compact. Conditions 1 and 2 ensure that inf 𝜇𝛼 (A𝜖 ) ≥ 1 − 2𝜖. Consequently, we see 𝛼∈I that inf 𝜇𝛼 (W𝜖𝜖 ) ≥ 1 − 2𝜖; more generally, for each 𝜖 > 0, j ≥ 1, 𝛼∈I

( ) 𝜖∕2j inf 𝜇𝛼 W𝜖∕2j ≥ 1 − 2𝜖∕2j 𝛼∈

𝜖∕2j

and the set W ∶= ∩∞ W satisfies inf 𝜇𝛼 (W) ≥ 1 − 𝜖. j=2 𝜖∕2j 𝛼∈I To conclude the proof, it remains to verify that W is compact. Now, W is obviously closed. Thus, we need only show it is totally bounded to be able to

RANDOM ELEMENTS IN A HILBERT SPACE

209

apply the Heine–Borel theorem. Specifically, we need to show that for any 𝜈 > 0 there exists a finite set F ⊂ W such that W ⊂ F 𝜈 . For this purpose, write 𝜖∕2j

𝜖∕2j

∞ W = ∩∞ j=1 (W𝜖∕2j ∩ W) = ∩j=1 {(W𝜖∕2j ∩ W) ∪ (W𝜖∕2j ∩ W − W𝜖∕2j ∩ W)}.

Note that, for each j, W𝜖∕2j ∩ W is compact as it is a closed subset of a compact set. For any 𝜈 > 0, pick a large enough j such that 𝜖∕2j < 𝜈∕2. By the total boundedness of W𝜖∕2j ∩ W, there exists a finite set F ⊂ W𝜖∕2j ∩ W such that W𝜖∕2j ∩ W ⊂ F 𝜈∕2 . So, 𝜖∕2j

(W𝜖∕2j ∩ W) ∪ (W𝜖∕2j ∩ W − W𝜖∕2j ∩ W) ⊂ F𝜈 , which implies that W ⊂ F 𝜈 and completes the proof.



Condition 1 of Theorem 7.7.4 is sometimes referred to as flat concentration in the literature. We will apply Theorem 7.7.4 in establishing weak convergence mainly through the following result. Theorem 7.7.5 Let 𝜒, 𝜒n , n ≥ 1, be random elements in (ℍ, ℬ(ℍ)). Assume d

that ⟨𝜒n , f ⟩ −−→ ⟨𝜒, f ⟩ in ℝ for all f ∈ ℍ and for each 𝜖, 𝛿 > 0, there exists a finite-dimensional subspace S such that inf ℙ(𝜒n ∈ S𝜖 ) ≥ 1 − 𝛿

(7.53)

n≥1

d

for S𝜖 defined as in (7.47). Then, 𝜒n −−→ 𝜒. Proof: It is easy to show that the assumptions of Theorem 7.7.4 hold for the family of probability measures {ℙ∘𝜒n−1 ∶ n ≥ 1}. Therefore, {ℙ∘𝜒n−1 ∶ n ≥ 1} is tight and, by Prohorov’s Theorem (e.g., Billingsley 1995), it is relatively −1 compact. Now, let {ℙ∘𝜒n−1 ′ } and {ℙ∘𝜒n′′ } be two weakly convergent subsed

d

quences; for convenience, write 𝜒n′ −−→ 𝜒̃ and 𝜒n′′ −−→ 𝜒̌ for some random eled

ments 𝜒̃ and 𝜒̌ in ℍ. By the continuous mapping theorem, ⟨𝜒n′ , f ⟩ −−→ ⟨𝜒, ̃ f⟩ d

and ⟨𝜒n′′ , f ⟩ −−→ ⟨𝜒, ̌ f ⟩ for all f ∈ ℍ. However, it follows that d

d

⟨𝜒, ̃ f ⟩ = ⟨𝜒, ̌ f ⟩ = ⟨𝜒, f ⟩, d

f ∈ ℍ, d

d

where = indicates “equal in distribution.” By Theorem 7.1.2 , 𝜒 = 𝜒̃ = 𝜒̌ and so both 𝜒n′ and 𝜒n′′ converge in distribution to 𝜒. This completes the proof. ◽

210

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

We conclude by proving an elementary central limit theorem for random elements of a Hilbert space. Theorem 7.7.6 Let 𝜒1 , 𝜒2 , … be independent and identically distributed random elements in ℍ with mean 0 and 𝔼‖𝜒1 ‖2 < ∞. Then 𝜉n ∶= n

−1∕2

n ∑

d

𝜒i → 𝜉,

i=1

where 𝜉 is Gaussian random element of ℍ with covariance operator equal to 𝔼(𝜒1 ⊗ 𝜒1 ). Proof: We need only verify the two conditions in Theorem 7.7.5. First, by the central limit theorem for real-values random variables, for any f ∈ ℍ, the distribution of ⟨𝜉n , f ⟩ converges to N(0, ⟨𝔼(𝜒1 ⊗ 𝜒1 ), f ⟩), which is the distribution of ⟨𝜉, f ⟩. To show the second condition, let {ej } be any CONS for ℍ and take S in ′ Theorem 7.7.5 to be SJ = span{e1 , … , eJ }. Let 𝜉nJ and 𝜉nJ be the projections ⊥ ′ of 𝜉n on SJ and SJ , respectively, and let 𝜒iJ be the projection of 𝜒i on S⊥J . Then, for any 𝜖 > 0, ′ ℙ(‖𝜉nJ ‖ ≤ 𝜖) = ℙ(𝜉n ∈ SJ𝜖 ).

By Chebyshev’s inequality, ′ ′ 2 ℙ(‖𝜉nJ ‖ > 𝜖) ≤ 𝜖 −2 𝔼(‖𝜒1J ‖ ),

which will be smaller than any 𝛿 for J sufficiently large.



8

Mean and covariance estimation In this chapter, we study mean and covariance estimation for both the random element and stochastic process settings described in Chapter 7. For the first perspective, let ℍ be a separable Hilbert space and suppose that we have a random element 𝜒 of ℍ with 𝔼‖𝜒‖2 < ∞. The mean element and covariance operator of 𝜒 from Section 7.2 are therefore well defined as m = 𝔼𝜒 and 𝒦 = 𝔼(𝜒 − m) ⊗ (𝜒 − m). In Section 8.1, we address the base problem of estimating m and 𝒦 using a random sample 𝜒1 , … , 𝜒n from 𝜒. The estimators of choice are the natural parallels of the sample mean and covariance that are used for real-valued random variables. Suppose, on the other hand, that {X(t) ∶ t ∈ E} is a second-order stochastic process. Then, the mean function m(t) = 𝔼X(t) and covariance function K(s, t) = 𝔼(X(s) − m(s))(X(t) − m(t)) are well defined for all s, t ∈ E, and they are the basic quantities of interest in our inference problem. Sections 8.2 and 8.3 focus on two scenarios that Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

212

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

are more realistic from an fda standpoint. In both instances, we deal with stochastic processes that take values in a Hilbert function space. However, the sample paths are now only assumed to be observed at discrete time points with the resulting values then distorted by additive random noise. For Section 8.2, we let ℍ = 𝕃2 (E, ℬ(E), 𝜇) and estimate m and K with local linear regression type smoothers. In Section 8.3, the choice is ℍ = 𝕎q [0, 1] and estimation is by penalized least squares methods.

8.1

Sample mean and covariance operator

Let ℍ be a separable Hilbert space and 𝜒 a random element of ℍ. Assume that we observe an iid sample 𝜒1 , … , 𝜒n from 𝜒. We will then estimate m and 𝒦 by the sample mean 1∑ mn = 𝜒 n i=1 i n

and sample covariance operator 1 ∑ 𝒦n = (𝜒 − mn ) ⊗ (𝜒i − mn ). n − 1 i=1 i n

(8.1)

It is easy to verify that mn and 𝒦n are random elements of ℍ and 𝔅HS (ℍ), respectively, and they are unbiased for their population counterparts. We now apply the results in Section 7.7 to derive the asymptotic properties of mn and 𝒦n as n → ∞. For the sample mean, we have the following result whose proof follows directly from Theorems 7.7.2 and 7.7.6. a.s.

Theorem 8.1.1 If 𝔼‖𝜒1 ‖ < ∞ then mn −−−→ m. If 𝔼‖𝜒1 ‖2 < ∞ then √

d

n(mn − m) −−→ 𝜉

in ℍ where 𝜉 is a Gaussian random element with mean zero and covariance operator 𝒦. The basic asymptotic properties of 𝒦n are established as a.s.

Theorem 8.1.2 If 𝔼‖𝜒1 ‖2 < ∞ then 𝒦n −−−→ 𝒦. If 𝔼‖𝜒1 ‖4 < ∞ then √

d

n(𝒦n − 𝒦) −−→ ℨ

MEAN AND COVARIANCE ESTIMATION

213

in ℬHS (ℍ) where ℨ is a Gaussian random element with mean zero and covariance operator 𝔼((𝜒1 − m) ⊗ (𝜒1 − m) − 𝒦)⊗HS ((𝜒1 − m) ⊗ (𝜒1 − m) − 𝒦). Proof: Write 1 ∑ n (𝜒 − m) ⊗ (𝜒i − m) − (m − m) ⊗ (mn − m). n − 1 i=1 i n−1 n n

𝒦n =

The first conclusion follows from Theorems 7.7.2 and 8.1.1. The second conclusion will follow from Theorem 7.7.6 once we verify that 𝔼‖(𝜒i − m) ⊗ (𝜒i − m) − 𝒦‖2HS < ∞. However, from Theorem 7.2.2, 𝔼‖(𝜒i − m) ⊗ (𝜒i − m) − 𝒦‖2HS ≤ 𝔼‖(𝜒i − m) ⊗ (𝜒i − m)‖2HS = 𝔼‖𝜒i − m‖4 < ∞ ◽

and the proof is complete.

To illustrate the implication of this result, let 𝒢 be any element of 𝔅√ HS (ℍ) and initially take m = 0. Then, the limiting distribution of ⟨ n(𝒦n − 𝒦), 𝒢⟩HS is that of the random variable ⟨ℨ, 𝒢⟩HS . This quantity is normally distributed with mean zero and, from part 1 of Theorem 7.2.5, Var(⟨ℨ, 𝒢⟩HS ) = 𝔼⟨ℨ, 𝒢⟩2HS = ⟨𝔼(𝜒1 ⊗ 𝜒1 − 𝒦)⊗HS (𝜒1 ⊗ 𝜒1 − 𝒦)𝒢, 𝒢⟩HS = 𝔼⟨(𝜒1 ⊗ 𝜒1 − 𝒦), 𝒢⟩2HS = Var(⟨𝜒1 ⊗ 𝜒1 , 𝒢⟩HS ). We can evaluate the HS inner product in this last expression directly. For that purpose, we choose any CONS {ej } for ℍ whose first element in e1 = 𝜒1 ∕‖𝜒1 ‖. Then, (𝜒1 ⊗ 𝜒1 )ej = 0 for all j > 1 in which case Definition 4.4.4 for the HS inner product produces ⟨𝜒1 ⊗ 𝜒1 , 𝒢⟩HS = ⟨(𝜒1 ⊗ 𝜒1 )e1 , 𝒢e1 ⟩ = ⟨𝜒1 , 𝒢𝜒1 ⟩. So,

Var(⟨ℨ, 𝒢⟩HS ) = Var(⟨𝜒1 , 𝒢𝜒1 ⟩).

214

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

When m ≠ 0, this becomes Var(⟨ℨ, 𝒢⟩HS ) = Var(⟨(𝜒1 − m), 𝒢(𝜒1 − m)⟩).

8.2

Local linear estimation

We assume in this section that X is a stochastic process on E = [0, 1] where the mean function m(t) and covariance function K(s, t) are both smooth. For convenience, we also assume that X can be viewed as a random element of 𝕃2 [0, 1]. Thus, we are in the setting described in Section 7.4. Our attention is then directed toward the use of local linear regression estimators along the lines of those in Fan and Gijbels (1996) for estimation of m and K. The basic nonparametric regression formulation that underlies such estimators can be employed directly to construct estimators for the mean function but requires a slight adjustment for estimation of covariances as we subsequently demonstrate. Suppose that X1 , … , Xn are independent copies of a X. Consider the model Yij = Xi (Tij ) + 𝜀ij ,

j = 1, … , r, i = 1, … , n,

(8.2)

where the Tij ’s are iid sampling points and the 𝜀ij are iid errors. For data of the form (8.2), a local linear smoothing based estimator of m(t) is obtained by minimization of n r ∑ ∑

Wh (Tij − t)(Yij − a0 − a1 (Tij − t))2

(8.3)

i=1 j=1

with respect to a0 , a1 . This is just a weighted sum of squared residuals corresponding to a simple linear regression fit. What makes it special and worthwhile for our situation is how the weights are chosen. This is accomplished with the kernel weight function W and associated bandwidth h > 0 via the relation ( ) u Wh (u) = h−1 W . (8.4) h Here, W is a symmetric probability density function on [−1, 1], which makes h a scale parameter that governs how concentrated the associated weights Wh (Tij − t) are around t. We also assume that W is of bounded variation and satisfies the moment conditions ⎧1, j = 0, ⎪ j u W(u)du = ⎨0, j = 1, ∫−1 ⎪C ≠ 0, j = 2. ⎩ 1

(8.5)

MEAN AND COVARIANCE ESTIMATION

215

When the bandwidth is small, criterion (8.3) uses only those responses whose time ordinates are in the immediate neighborhood of the point of estimation t. First-order Taylor expansions of the m(Tij ) around t then suggest that we estimate m(t) by the minimizing intercept term, â 0 , of (8.3); i.e., the estimator we will consider is mh (t) =

S0 (t)M2 (t) − S1 (t)M1 (t) M0 (t)M2 (t) − M12 (t)

,

(8.6)

with Mp , Sp , p = 0, 1, 2, defined by Mp (t) =

Sp (t) =

n r 1 ∑∑ W (T − t), nr i=1 j=1 hp ij

(8.7)

n r 1 ∑∑ W (T − t)Yij nr i=1 j=1 hp ij

(8.8)

for Whp (u) =

( )p u Wh (u). h

(8.9)

In the spirit of typical developments in the nonparametric smoothing literature, we will analyze the behavior of sup |mh (t) − m(t)|

t∈[0,1]

as n → ∞ and h tends to zero at a rate that is some function of the sample size. What distinguishes the calculations performed here from those that arise in classical nonparametric regression problems is the necessity of dealing with an additional random (and correlated) component from our stochastic process formulation and the presence of replicates. We now spell out the conditions that are required for our analysis. First, the Tij are assumed to be a random sample from a continuous random variable T with positive density function f (⋅) having a derivative in C[0, 1]. We should perhaps mention at this point that the didactic assumption of an equal number of sampling points for every sample path can be relaxed (Li and Hsing, 2010). The 𝜀ij are also independent and identically distributed with 𝔼𝜀11 = 0, Var(𝜀11 ) = 𝜎 2 < ∞ and, collectively, the Xi , Tij and 𝜀ij are independent of one another. Finally, we will need

and

𝔼|𝜀11 |q < ∞

(8.10)

𝔼 sup |X(t)|q < ∞

(8.11)

t∈[0,1]

216

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

for some positive q falling in a range that will be made explicit in the following. Moment conditions such as (8.11) hold rather generally. In particular, they are satisfied for Gaussian processes with continuous sample paths (cf. Landau and Shepp 1970). To simplify subsequent notation, we will use the definitions 𝛿n1 (h) = ({1 + (hr)−1 } log n∕n)1∕2 , 𝛿n2 (h) = ({1 + (hr)−1 + (hr)−2 } log n∕n)1∕2 when stating our major results in the following. Under the above-mentioned conditions, we are able to establish Theorem 8.2.1 Assume that m has a uniformly bounded second derivative on [0, 1], (8.10) and (8.11) hold for some q ∈ (2, ∞) and that h → 0 as n → ∞ in such a way that (h2 + h∕r)−1 (log n∕n)1−2∕q → 0. Then, sup |mh (t) − m(t)| = O(h2 + 𝛿n1 (h))

a.s.

(8.12)

t∈[0,1]

Proof: Straightforward calculations give us mh (t) − m(t) = with

S̃ 0 (t)M2 (t) − S̃ 1 (t)M1 (t) M0 (t)M2 (t) − M12 (t)

S̃ p (t) = Sp (t) − m(t)Mp (t) − hm′ (t)Mp+1 (t).

To establish the result, we will proceed by obtaining almost sure, uniform approximations for the S̃ p (t) and Mp (t) with the aid of Lemma 8.2.2 in the following text. First, let us consider Mp . As the Tij are iid and the support of W is [−1, 1], uniformly for all t 1

𝔼Mp (t) =

∫0

Whp (s − t)f (s)ds (1−t)∕h

=

∫−t∕h

up W(u)f (t + hu)du min(1,(1−t)∕h)

= (f (t) + O(h))

∫max(−1,−t∕h)

up W(u)du.

Thus, for p = 0, 1, 2, 𝔼Mp (t) is uniformly bounded away from ∞, whereas for p = 0, 2, 𝔼Mp (t) is also uniformly bounded away from 0 as h → 0. Applying Lemma 8.2.2 with Zij ≡ 1 then enables us to conclude that these statements

MEAN AND COVARIANCE ESTIMATION

217

hold for Mp (t) in place of 𝔼[Mp (t)] with probability one. Thus, the rate of convergence for mh (t) − m(t) is determined by those of the S̃ p . Observe that 1 ∑∑ S̃ p (t) = W (T − t)(Yij − m(t) − m′ (t)(Tij − t)) nr i=1 j=1 hp ij n

r

1 ∑∑ = Up (t) + W (T − t)[m(Tij ) − m(t) − m′ (t)(Tij − t)] nr i=1 j=1 hp ij n

with

r

1 ∑∑ W (T − t)[𝜀ij + Xi (Tij ) − m(Tij )]. nr i=1 j=1 hp ij n

Up (t) =

r

Taylor’s Theorem and the assumption that m′′ is uniformly bounded then lead to 1 ∑∑ W (T − t)[m(Tij ) − m(t) − m′ (t)(Tij − t)] nr i=1 j=1 hp ij n

r

1 ∑∑ W (T − t)O((Tij − t)2 ) nr i=1 j=1 hp ij n

=

r

= Mp (t)O(h2 ) 2

= O(h )

a.s.

a.s.

as a result of what we previously established for Mp . Thus, with probability one, S̃p (t) = Up (t) + O(h2 ). Finally, express Up (t) as the sum of two zero mean process; i.e., write Up (t) = U1p (t) + U2p (t) with U1p (t) = and

n r 1 ∑∑ W (T − t)𝜀ij nr i=1 j=1 hp ij

1 ∑∑ U2p (t) = W (T − t)[Xi (tij ) − m(Tij )]. nr i=1 j=1 hp ij n

r

An application of Lemma 8.2.2 to each of the Uip , i = 1, 2 verifies (8.12) and concludes the proof. ◽

218

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Lemma 8.2.2 Let variables satisfying

Zij , 1 ≤ i ≤ n, 1 ≤ j ≤ r,

be

real-valued

random

sup 𝔼|Zij |q < ∞

(8.13)

1 ∑∑ |Z |q = O(1) a.s. nr i=1 j=1 ij

(8.14)

i,j

and

n

r

for some q ∈ (2, ∞). Let Tij , 1 ≤ i ≤ n, 1 ≤ j ≤ r, be independent random variables taking values in [0, 1] with probability density functions that are uniformly bounded. Assume the sets of random variables {Zij , Tij , 1 ≤ j ≤ r}, i = 1, … , n, are mutually independent and that there is a universal constant C such that sup 𝔼[|Zij Zik ‖Tij , Tik ] < C

a.s.

(8.15)

i,j,k

Define

1 ∑∑ W (T − t)Zij . nr i=1 j=1 hp ij n

Z p (t) =

r

Set 𝛽n = h2 + h∕r and assume that h → 0 in such a way that 𝛽n−1 (log n∕n)1−2∕q = o(1). Then, √ sup nh2 ∕(𝛽n log n)|Z p (t) − 𝔼Z p (t)| = O(1) a.s. (8.16) t∈[0,1]

Proof: As both W(u) and up are of bounded variation, Whp (u) is also of bounded variation. As a result, Whp (u) = Whp,1 (u) − Whp,2 (u) for increasing functions Whp,1 (u) and Whp,2 (u); without loss of generality, assume that Whp,1 (−h) = Whp,2 (−h) = 0 and write 1 ∑∑ Z p (t) = W (T − t)Zij n i=1 j=1 hp ij n

r

ij 1 ∑∑ = Zij I(Tij − t ≤ h) dWhp (𝑣) ∫−h nr i=1 j=1

n

h

=

∫−h

r

T −t

1 ∑∑ Z I(𝑣 ≤ Tij − t ≤ h)dWhp (𝑣) nr i=1 j=1 ij n

r

h

=

∫−h

Gn (t + 𝑣, t + h)dWhp (𝑣),

MEAN AND COVARIANCE ESTIMATION

219

where 1 ∑∑ Gn (t1 , t2 ) = Z I(T ∈ [t1 ∧ t2 , t1 ∨ t2 ]). nr i=1 j=1 ij ij n

r

(8.17)

Define Vn (t, h) = sup |Gn (t, t + u) − G(t, t + u)| |u|≤c

for G(t1 , t2 ) = 𝔼{Gn (t1 , t2 )} and observe that h

sup |Z p (t)(t) − 𝔼Z p (t)| ≤ sup Vn (t, 2h)

t∈[0,1]

t∈[0,1]

∫−h

|dWhp |

≤ sup Vn (t, 2h){Whp,1 (h) + Whp,2 (h)}. t∈[0,1]

Now, Whp,1 (h) + Whp,2 (h) = O(h−1 ) by the definition of Whp , and we will show below that sup Vn (t, h) = O({𝛽n log n∕n}1∕2 ) a.s. (8.18) t∈[0,1]

In combination, this gives us (8.16). In proving (8.18), we can obviously treat the positive and negative parts of Zij separately and will therefore assume in the following that Zij is nonnegative. Without loss of generality, assume that 1∕h is an integer. Define an equally spaced grid 𝒢 ∶= {𝑣k }, with 𝑣k = kh for k = 0, … , 1∕h. For any t ∈ [0, 1] and |u| ≤ h, one may now find a grid point 𝑣k that is within h of both t and t + u. As |Gn (t, t + u) − G(t, t + u)| ≤ |Gn (𝑣k , t + u) − G(𝑣k , t + u)| +|Gn (𝑣k , t) − G(𝑣k , t)|, we can write |Gn (t, t + u) − G(t, t + u)| ≤ 2 sup Vn (t, h). t∈𝒢

Thus, sup Vn (t, h) ≤ 2 sup Vn (t, h).

t∈[0,1]

(8.19)

t∈𝒢

From now on, we focus on the right-hand side of (8.19). Let an = (𝛽n log n∕n)1∕2 ,

(8.20)

Qn = 𝛽n ∕an

(8.21)

220

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and define G∗n (t1 , t2 ), G∗ (t1 , t2 ), and Vn∗ (t, h) in the same way that we defined Gn (t1 , t2 ), G(t1 , t2 ), and Vn (t, h) except with Zij I(Zij ≤ Qn ) replacing Zij in the expressions. Then, sup Vn (t, h) ≤ sup Vn∗ (t, h) + An1 + An2 , t∈𝒢

(8.22)

t∈𝒢

where An1 = sup sup (Gn (t, t + u) − G∗n (t, t + u)), t∈𝒢 |u|≤h

An2 = sup sup (G(t, t + u) − G∗ (t, t + u)). t∈𝒢 |u|≤h

We first consider An1 and An2 . For all t and u, (Gn (t, t + u) −

G∗n (t, t

n r 1 ∑ ∑ q 1−q + u)) ≤ Z Z I(Zij > Qn ) nr i=1 j=1 ij ij

1 ∑∑ q Z . nr i=1 j=1 ij n

1−q

≤ Qn

r

It follows that 1−q

a−1 n Qn

= {𝛽n−1 (log n∕n)1−2∕q }q∕2 = o(1).

(8.23)

So, from (8.13) and (8.14) and (8.23), we conclude that a.s

a−1 n An1 → 0. Similarly, a−1 n An2 = 0 and we have proved An1 + An2 = o(an ) a.s.

(8.24)

To bound Vn∗ (t, h) for a fixed t ∈ 𝒢, we perform a further partition of the interval [t − h, t + h]. Define 𝑤n = [Qn h∕an + 1] and u𝓁 = 𝓁h∕𝑤n , for 𝓁 = −𝑤n , −𝑤n + 1, … , 𝑤n . Note that 𝑤n → ∞ as an ∕(Qn h) = h−1 log n∕n ≤ 𝛽n−1 log n∕n → 0. Now pick any u in [−h, h]. There is an 𝓁 such that u𝓁 ≤ u ≤ u𝓁+1 . Note that we have either 0 ≤ u𝓁 ≤ u ≤ u𝓁+1 or u𝓁 ≤ u ≤ u𝓁+1 ≤ 0. So, consider the former case as the other one can be treated similarly. Using the monotonicity of G∗n (t, t + u) in |u|, as Zij ≥ 0, we obtain G∗n (t, t + u𝓁 ) − G∗ (t, t + u𝓁+1 ) ≤ G∗n (t, t + u) − G∗ (t, t + u) ≤ G∗n (t, t + u𝓁+1 ) − G∗ (t, t + u𝓁 ).

MEAN AND COVARIANCE ESTIMATION

221

The left-hand side can be written as G∗n (t, t + u𝓁 ) − G∗ (t, t + u𝓁 ) + G∗ (t, t + u𝓁 ) − G∗ (t, t + u𝓁+1 ) = G∗n (t, t + u𝓁 ) − G∗ (t, t + u𝓁 ) − G∗ (t + u𝓁 , t + u𝓁+1 ), and, similarly, the right-hand side can be written as G∗n (t, t + u𝓁+1 ) − G∗ (t, t + u𝓁+1 ) + G∗ (t, t + u𝓁+1 ) − G∗ (t, t + u𝓁 ) = G∗n (t, t + u𝓁+1 ) − G∗ (t, t + u𝓁+1 ) + G∗ (t + u𝓁 , t + u𝓁+1 ). From these relations, we conclude that |G∗n (t, t + u) − G∗ (t, t + u)| ≤ max(𝜉n𝓁 , 𝜉n,𝓁+1 ) + G∗ (t + u𝓁 , t + u𝓁+1 ), where 𝜉n𝓁 = |G∗n (t, t + u𝓁 ) − G∗ (t, t + u𝓁 )|. Thus, Vn∗ (t, h) ≤

max

−𝑤n ≤𝓁≤𝑤n

𝜉n𝓁 +

max

−𝑤n ≤𝓁≤𝑤n

G∗ (t + u𝓁 , t + u𝓁+1 ).

For all 𝓁, G∗ (t + u𝓁 , t + u𝓁+1 ) ≤ Qn

n r 1 ∑∑ ℙ(t + u𝓁 ≤ Tij ≤ t + u𝓁+1 ) nr i=1 j=1

≤ MT Qn (u𝓁+1 − u𝓁 ) ≤ MT an , where MT is the maximum of the densities of the Tij ’s. Therefore, for any B, ℙ{Vn∗ (t, h) ≥ Ban } ≤ ℙ{ max

−𝑤n ≤𝓁≤𝑤n

𝜉n𝓁 ≥ (B − MT )an }.

(8.25)

We now proceed to develop an upper bound for ℙ{𝜉n𝓁 ≥ (B − MT )an }. Write n n |1 ∑ | 1∑ | | 𝜉n𝓁 = | {Zi − 𝔼(Zi )}| ≤ |Zi − 𝔼(Zi )|, |n | n i=1 | i=1 |

where 1∑ Zi = Z I(Z ≤ Qn )I(Tij ∈ (t, t + u𝓁 ]). r j=1 ij ij r

222

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

It follows that Var(Zi ) ≤

r r ) 1 ∑∑ ( 𝔼 𝔼[Z Z |T T ]I(T , T ∈ (t, t + u ]) ij ik ij ik ij ik 𝓁 r2 j=1 k=1

≤C

r r 1 ∑∑ ℙ(Tij , Tik ∈ (t, t + u𝓁 ])) r 2 j=1 k=1

due to (8.15), where

{ ℙ(Tij ∈ (t, t + u𝓁 ])), ℙ(Tij , Tik ∈ (t, t + u𝓁 ])) = ℙ(Tij ∈ (t, t + u𝓁 ]))ℙ(Tik ∈ (t, t + u𝓁 ])),

j = k, j ≠ k.

As the densities of the Tij are uniformly bounded, our derivations have established the existence of a universal constant M such that Var(Zi ) ≤ M(u2𝓁 + u𝓁 ∕r) ≤ M𝛽n . Bernstein’s inequality, the fact that |Zi − 𝔼(Zi )| ≤ Qn and (8.20) and (8.21) now imply that for B > MT { } (B − MT )2 n2 a2n ℙ(𝜉nr ≥ (B − MT )an ) ≤ exp − ∑n 2 i=1 Var(Zi ) + (2∕3)(B − MT )Qn nan { } (B − MT )2 n2 a2n ≤ exp − 2Mn𝛽n + (2∕3)(B − MT )n𝛽n = exp {−B∗ na2n ∕𝛽n } ∗

= n−B , where B∗ =

(B−MT )2 . 2M+(2∕3)(B−MT )

Then, using (8.25) along with Boole’s inequality

( ) ( [ ] ) Qn h Q ∗ ∗ ∗ −1 ℙ sup Vn (t, h) ≥ Ban ≤ h 2 + 1 + 1 n−B ≤ C n n−B an an t∈𝒢 for some finite C. Observe that Qn ∕an = 𝛽n ∕a2n = n∕ log n. So, select B large enough that B∗ > 2 to obtain ∞ ∑

ℙ(sup Vn∗ (t, h) ≥ Ban ) < ∞.

n=1

t∈𝒢

We therefore conclude from the Borel–Cantelli lemma that sup Vn∗ (t, h) = O(an ) a.s.

(8.26)

t∈𝒢

Hence, (8.18) follows from combining (8.19), (8.22), (8.24), and (8.26).



MEAN AND COVARIANCE ESTIMATION

223

There are two scenarios for model (8.2) that have received special attention in the literature. The so-called sparse case has r uniformly bounded with only n divergent. In contrast, for the “dense” case, both the number of sampled processes and the number of sampling points are allowed to grow large. Thus, a spectrum of rates are possible depending on the nature of r. The following corollary addresses two special instances. To simplify subsequent presentations, we adopt the notation an ≲ bn if an = O(bn ), an ≳ bn if bn = O(an ), and an ≍ bn if an = O(bn ) and bn = O(an ). Corollary 8.2.3 Under the conditions of Theorem 8.2.1, 1. if r is bounded sup |mh (t) − m(t)| = O(h2 + [(log n∕(nh)]1∕2 ) a.s.

t∈[0,1]

2. if r = rn is such that rn−1 ≲ h ≲ (log n∕n)1∕4 , sup |mh (t) − m(t)| = O([(log n)∕n]1∕2 ) a.s.

t∈[0,1]

Proof: For the first statement of the corollary, (1 + (hr)−1 ) = O(h−1 ) as n → ∞ as r is bounded. In the second case, (1 + (hr)−1 ) = O(1) as n ∈ ∞. The condition h ≲ (log n∕n)1∕4 then insures that the bias or h2 term does not dominate. ◽ The rate in part 1 of the corollary for sparse functional data is the classical nonparametric rate for estimating a univariate function; see. e.g., Stone (1982). The second part of the result indicates that if r ≳ n1∕4 convergence occurs at a parametric rate. Next, we consider estimation of the covariance function K(s, t). To do that, we can proceed similarly to estimation of the mean function. First, we estimate R(s, t) ∶= 𝔼[X(s)X(t)] by Rh (s, t) = a0 , where ⎡ n ∑∑ 1 ∑⎢ 1 (̂ a0 , ̂ a1 , ̂ a2 ) = arg min (yij yik − a0 − a1 (Tij − s) ⎢ a0 ,a1 ,a2 n i=1 ⎢ r(r − 1) 1≤j,k≤r ⎣ j≠k − a2 (Tik − t))2 Wh (Tij − s)Wh (Tik − t)]. This produces Rh (s, t) = (ℳ1 (s, t)S00 (s, t) − ℳ2 (s, t)S10 (s, t) − ℳ3 (s, t)S01 )∕𝒟 (s, t),

224

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

where 2 ℳ1 (s, t) = M20 (s, t)M02 (s, t) − M11 (s, t),

ℳ2 (s, t) = M10 (s, t)M02 (s, t) − M01 (s, t)M11 (s, t), ℳ3 (s, t) = M01 (s, t)M20 (s, t) − M10 (s, t)M11 (s, t), 𝒟 (s, t) = ℳ1 (s, t)M00 (s, t) − ℳ2 (s, t)M10 (s, t) − ℳ3 (s, t)M01 (s, t) for

∑∑∑ 1 W (T − s)Whp2 (Tik − t) nr(r − 1) i=1 1≤j,k≤r hp1 ij n

Mp1 p2 (s, t) =

j≠k

and ∑∑∑ 1 W (T − s)Whp2 (Tik − t)Yij Yik . nr(r − 1) i=1 1≤j,k≤r hp1 ij n

Sp1 p2 (s, t) =

j≠k

We then estimate K(s, t) by Kh (s, t) = RhR (s, t) − mhm (s)mhm (t) with h = (hm , hR ), a vector containing the two bandwidths hm , hR that are used for estimation of R and m, respectively. The following result gives the convergence rates for Kh (s, t). Theorem 8.2.4 Assume that all second-order partial derivatives of K(s, t) exist and are bounded on [0, 1]2 and that (8.10) and (8.11) hold for some q ∈ (4, ∞). If h → 0 as n → ∞ in such a way that (h2m + hm ∕r)−1 (log n∕n)1−2∕q → 0 and (h4R + h3R ∕r + (h∕r)2 )−1 (log n∕n)1−4∕q → 0, sup |Kh (s, t) − K(s, t)| = O(h2m + h2R + 𝛿n1 (hm ) + 𝛿n2 (hR ))

a.s.

s,t∈[0,1]

Proof: Define S̃ p1 p2 (s, t) = Sp1 p2 (s, t) − R(s, t)Mp1 p2 − hR R(1,0) (s, t)M(p1 +1)p2 −hR R(0,1) (s, t)Mp1 (p2 +1) (s, t) and observe that RhR (s, t) − R(s, t) =

(ℳ1 (s, t)S̃ 00 (s, t) − ℳ2 (s, t)S̃ 10 (s, t) − ℳ3 (s, t)S̃ 01 (s, t)) . (8.27) 𝒟 (s, t)

MEAN AND COVARIANCE ESTIMATION

225

Note that cancellations occur that eliminate all the terms containing R(1,0) (s, t) and R(0,1) (s, t) from this last expression. Uniformly for all s, t, 𝔼Mp1 p2 (s, t) = {f (s)f (t) + O(hR )}IhR ,p1 ,p2 (s, t),

(8.28)

where min(1,(1−s)∕h)

Ih,p1 ,p2 (s, t) =

min(1,(1−t)∕h)

∫max(−1,−s∕h) ∫max(−1,−t∕h)

up1 𝑣p2 W(u)W(𝑣)dud𝑣.

Upon applying Lemma 8.2.5 with 𝜅 = q∕2 and Zijk equal to 1 and Yij Yik , we see that ℳi (s, t), 1 ≤ i ≤ 3 are uniformly bounded a.s. and 𝒟 (s, t) is uniformly bounded away from 0 a.s. Thus, by (8.27), the rate for RhR (s, t) − R(s, t) is determined from the rates for S̃ 00 (s, t), S̃ 10 (s, t), and S̃ 01 (s, t). To analyze the behavior of the S̃ p1 q2 (s, t), write S̃ p1 p2 (s, t) = U1 (s, t) + U2 (s, t) + U3 (s, t) + U4 (s, t)

(8.29)

with ∑∑∑ 1 U1 (s, t) = 𝜀 𝜀 W (T − s)WhR p2 (Tik − t), nr(r − 1) i=1 1≤j,k≤r ij ik hR p1 ij n

j≠k

U2 (s, t) =

∑∑∑ 1 𝜀 X (T )W (T − s)WhR p2 (Tik − t), nr(r − 1) i=1 1≤j,k≤r ij i ik hR p1 ij

U3 (s, t) =

∑∑∑ 1 [X (T )X (T ) − R(Tij , Tik )] nr(r − 1) i=1 1≤j,k≤r i ij i ik

n

j≠k n

j≠k

× WhR p1 (Tij − s)WhR p2 (Tik − t), ∑∑∑ 1 U4 (s, t) = [R(Tij , Tik ) − R(s, t) − (Tij − s)R(1,0) (s, t) nr(r − 1) i=1 1≤j,k≤r n

j≠k

− (Tik − t)R(0,1) (s, t)]WhR p1 (Tij − s)WhR p2 (Tik − t). An application of Taylor’s Theorem establishes that U4 (s, t) = O(h2R ) uniformly. The other Ui have zero means and Lemma 8.2.5 can be applied again to conclude the proof. ◽

226

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Lemma 8.2.5 Let Zijk , 1 ≤ i ≤ n, 1 ≤ j ≠ k ≤ r, be real-valued random variables satisfying sup 𝔼|Zijk |𝜅 < ∞ (8.30) i,j,k

and

n 1 ∑∑∑ |Zijk |𝜅 = O(1) a.s. nr2 i=1 1≤j,k≤r

(8.31)

j≠k

for some 𝜅 ∈ (2, ∞). Let Tij , 1 ≤ i ≤ n, 1 ≤ j ≤ r, be independent random variables with values in [0, 1] with uniformly bounded probability density functions. Assume that there is a universal constant C such that sup

i,j1 ,k1 ,j2 ,k2

𝔼[|Zij1 k1 Zij2 k2 ‖Tij1 , Tik1 , Tij2 , Tik2 ] < C

a.s.

(8.32)

For any nonnegative integers p1 , p2 , define ∑∑∑ 1 Z p1 p2 (s, t) = Z W (T − s)Whp2 (Tik − t) nr(r − 1) i=1 1≤j,k≤r ijk hp1 ij n

j≠k

and let 𝛽n = h4 + h3 ∕r + (h∕r)2 be such that 𝛽n−1 (log n∕n)1−2∕𝜅 = o(1) as h → 0. Then, √ sup nh4 ∕(𝛽n log n)|Z p1 p2 (s, t) − 𝔼Z p1 p2 (s, t)| = O(1) a.s. s,t∈[0,1]

Proof: Write ∑∑∑ 1 Z p1 p2 (s, t) = Z I(T ≤ s + h, Tik ≤ t + h) nr(r − 1) i=1 1≤j,k≤r ijk ij n

j≠k

× Whp1 (Tij − s)Whp2 (Tik − t) ∑∑∑ 1 Z I(T ∈ [s + u, s + h]) ∫ ∫(u,𝑣)∈[−h,h]2 nr(r − 1) i=1 1≤j,k≤r ijk ij n

=

j≠k

× I(Tik ∈ [t + 𝑣, t + h])dWhp1 (u)dWhp2 (𝑣) =

∫ ∫(u,𝑣)∈[−h,h]2

Gn (s + u, t + 𝑣, s + h, t + h)dWhp1 (u)dWhp2 (𝑣),

MEAN AND COVARIANCE ESTIMATION

227

where Gn (s1 , t1 , s2 , t2 ) ∑∑∑ 1 = Z I(T ∈ [s1 ∧ s2 , s1 ∨ s2 ], Tik ∈ [t1 ∧ t2 , t1 ∨ t2 ]) nr(r − 1) i=1 1≤j,k≤r ijk ij n

j≠k

and set G(s1 , t1 , s2 , t2 ) = 𝔼{Gn (s1 , t1 , s2 , t2 )} with Vn (s, t, h) =

sup

|u1 |,|u2 |≤h

|Gn (s, t, s + u1 , t + u2 ) − G(s, t, s + u1 , t + u2 )|.

Then, sup |Z p1 p2 (s, t) − 𝔼Z p1 p2 (s, t)|

s,t∈[0,1]

≤ sup Vn (s, t, 2h) s,t∈[0,1]

∫ ∫(u,𝑣)∈[−h,h]2

|dWhp1 (u)‖dWhp2 (𝑣)|

= O(h−2 ) sup Vn (s, t, h). s,t∈[0,1]

We will show below that sup Vn (s, t, h) = O({𝛽n log n∕n}1∕2 ) a.s.

(8.33)

s,t∈[0,1]

The proof of (8.33) is similar to that of (8.18) in Lemma 8.2.2. Accordingly, we only outline the unique aspects that arise for the covariance estimation case here. Let an , Qn be as in (8.20) and (8.21) and let 𝒢 be the same grid defined in the proof of Lemma 8.2.2. Then we have sup Vn (s, t, h) ≤ 4 sup Vn (s, t, h).

s,t∈[0,1]

(8.34)

s,t∈𝒢

Now define G∗n (s1 , t1 , s2 , t2 ), G∗ (s1 , t1 , s2 , t2 ), and Vn∗ (s, t, h) in the same way as we defined Gn (s1 , t1 , s2 , t2 ), G(s1 , t1 , s2 , t2 ), and Vn (s, t, 𝛿) except now with Zijk being replaced by Zijk I(Zijk ≤ Qn ) and observe that sup Vn (s, t, h) ≤ sup Vn∗ (s, t, h) + An1 + An2 ,

s,t∈𝒢

(8.35)

s,t∈𝒢

where An1 = sup

sup

|Gn (s, t, s + u1 , t + u2 ) − G∗n (s, t, s + u1 , t + u2 )|,

An2 = sup

sup

|G(s, t, s + u1 , t + u2 ) − G∗ (s, t, s + u1 , t + u2 )|.

s,t∈𝒢 |u1 |,|u2 |≤h s,t∈𝒢 |u1 |,|u2 |≤h

228

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Using arguments similar to those in the proof of Lemma 8.2.2 with relations (8.30) and (8.31), we can show that both An1 and An2 are o(an ) almost surely. To bound Vn∗ (s, t, h) for fixed (s, t), we create a further partition of the interval [s − h, s + h] × [t − h, t + h]. Put 𝑤n = [Qn h2 ∕an + 1] and u𝓁 = 𝓁h∕𝑤n , 𝓁 = −𝑤n , … , 𝑤n . Clearly, 𝑤n → ∞ as an ∕(Qn h2 ) = h−2 log n∕n ≤ 𝛽n−1 log n∕n → 0. Then, Vn∗ (s, t, h) ≤

max

−𝑤n ≤𝓁1 ,𝓁2 ≤𝑤n

𝜉n𝓁1 𝓁2 +

max

−𝑤n ≤𝓁1 ,𝓁2 ≤𝑤n

{G∗ (s, t, s + u𝓁1 +1 , t + u𝓁2 +1 )

−G∗ (s, t, s + u𝓁1 , t + u𝓁2 )}, where 𝜉n𝓁1 𝓁2 = |G∗n (s, t, s + u𝓁1 , t + u𝓁2 ) − G(s, t, s + u𝓁1 , t + u𝓁2 )|. Note that 𝔼{G∗ (s, t, s + u𝓁1 +1 , t + u𝓁2 +1 ) − G∗ (s, t, s + u𝓁1 , t + u𝓁2 )} ≤ MQn h2 ∕𝑤n ≤ Man . Now consider the case where u𝓁1 , u𝓁2 are both nonnegative with other cases being similar. Write 𝜉n𝓁1 𝓁2

n n | ∑ | |1 | 1∑ =| {Zi − 𝔼(Zi )}| ≤ |Zi − 𝔼(Zi )|, |n | n i=1 | i=1 |

where Zi =

∑∑ 1 Z I(Z ≤ Qn )I(Tij ∈ (s, s + u𝓁1 ], Tik ∈ (t, t + u𝓁2 ]). r(r − 1) 1≤j,k≤r ijk ijk j≠k

It follows that Var(Zi ) ≤

r2 (r

∑ ∑ C ℙ(Tij1 , Tij2 ∈ (t, t + u𝓁1 ], Tik1 , Tik2 ∈ (t, t + u𝓁2 ]) 2 − 1) j ≠k j ≠k 1

1 2

2

as a result of (8.32). Considering the possible scenarios of how some of the indices j1 , j2 , k1 , k2 are the same, we conclude that there exists a universal constant M such that Var(Zi ) ≤ M(h4 + h3 ∕r + (h∕r)2 ) = M𝛽n .

MEAN AND COVARIANCE ESTIMATION

229

The rest of the proof completely mirrors that of Lemma 8.2.2 and is omitted. ◽ As we did for the mean estimation problem, we present the following result that highlights the implications of Theorem 8.2.4 for the cases of sparse and dense functional data. Corollary 8.2.6 Under conditions of Theorem 8.2.4, 1. if r is bounded and h2R ≲ hm ≲ hR , sup |Kh (s, t) − K(s, t)| = O(h2R + {(log n∕(nh2R )}1∕2 ) a.s.

s,t∈[0,1]

2. if r = rn is such that rn−1 ≲ hm , hR ≲ (log n∕n)1∕4 , then sup |Kh (s, t) − K(s, t)| = O({log n∕n}1∕2 ) a.s.

s,t∈[0,1]

The rate in part 1 of the corollary is the classical nonparametric rate for estimating a bivariate function. In contrast, Kh (s, t) has a root-n convergence rate in the dense setting. Corollary 8.2.6 indicates that to estimate K optimally in the sparse case we should use a bandwidth ĥ R ≍ n−1∕6 while Corollary 8.2.3 suggests that the optimal bandwidth for estimating m(t) in this instance is ĥ m ≍ n−1∕5 . Thus, ĥ m is within the range of [ĥ 2R , ĥ R ] and Corollary 8.2.6 is applicable when optimal bandwidths are used for both estimation of m and R. Finally, we consider the difference between the estimated and true covariance operators that correspond to the 𝕃2 [0, 1] integral operators obtained from the true and estimated covariance kernel. For this purpose, we will study the behavior of 1

(Δh e)(t) =

∫0

{Kh (s, t) − K(s, t)}e(s)ds,

over a class of elements e from 𝕃2 [0, 1]. One might expect the rate of convergence for Δh to be the same as that of Kh (s, t) − K(s, t). However, our following result demonstrates that not to be the case. In Section 9.2, we will apply this work to the problem of inference for principal components. Theorem 8.2.7 Assume that the conditions of Theorem 8.2.4 hold. For any bounded measurable function e on [0, 1], sup |(Δh e)(t)| = O(h2m + h2R + 𝛿n1 (hm ) + 𝛿n1 (hR )) a.s.

t∈[0,1]

230

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: It follows that 1

(Δh e)(t) =

∫0

{RhR (s, t) − R(s, t)}e(s)ds 1



∫0

{mhm (s)mhm (t) − m(s)m(t)}e(s)ds

=∶ An1 (t) − An2 (t). From (8.27),

1

An1 (t) =

[

∫0

{ℳ1 (s, t)S̃ 00 (s, t) − ℳ2 (s, t)S̃ 10 (s, t)

] −ℳ3 (s, t)S̃ 01 (s, t)}∕𝒟 (s, t) e(s)ds.

] 1[ We only consider the ∫0 ℳ1 (s, t)S̃ 00 (s, t)∕𝒟 (s, t) e(s)ds term in this expression as the other two terms are of lower order and can be dealt with similarly. By (8.28) and Lemma 8.2.5, ℳ1 (s, t)∕𝒟 (s, t) behaves like a constant multiple[ of 1∕(f (s)f (t)). ] As f (t) > 0 for all t, it will be sufficient to focus only on 1 ̃ ∫0 S00 (s, t)∕f (s) e(s)ds. From (8.29), 1

∫0

[

3 ∑ ] S̃ 00 (s, t)∕f (s) e(s)ds = i=1

1

∫0

[Ui (s, t)∕f (s)]e(s)ds + O(h2R ),

where for each i, we can write 1

∫0

[Ui (s, t)∕f (s)]e(s)ds ∑∑∑ 1 = Z W (T − t) [WhR (Tij − s)∕f (s)]e(s)ds ∫0 nr(r − 1) i=1 1≤j,k≤r ijk hR ik n

1

j≠k

and the Zijk all have zero means. Express the right-hand side of this last formula as 1 ∑∑ Z W (T − t), nr i=1 k=1 ik hR ik n

where

r

1 ∑∑ Z [WhR (Tij − s)∕f (s)]e(s)ds r − 1 1≤j≤r ijk ∫0 1

Zik =

j≠k

MEAN AND COVARIANCE ESTIMATION

231

Note that 1 | 1[ | ] | | WhR (Tij − s)∕f (s) e(s)ds| ≤ sup (|e(s)|∕f (s)) W(u)du < ∞. | |∫0 | s∈[0,1] ∫−1 | | The assumptions of Lemma 8.2.2 can be easily verified for Zik , which entails that n r |1 ∑ | ∑ | | sup | Zik WhR (Tik − t)| = O(h2R + 𝛿n1 (hR )) a.s. | | nr t∈[0,1] | i=1 k=1 |

and consequently that | 1[ | ] | | sup | ℳ1 (s, t)S̃ 00 (s, t)∕𝒟 (s, t) e(s)ds| = O(h2R + 𝛿n1 (hR )) | | ∫ t∈[0,1] | 0 |

a.s.

Thus, we obtain the rate sup |An1 (t)| = O(h2R + 𝛿n1 (hR )). t∈[0,1]

Finally, we write 1

An2 (t) = mhm (t)

∫0

{mhm (s) − m(s)}e(s)ds 1

+{mhm (t) − m(t)}

∫0

m(s)e(s)ds.

This has the uniform rate O(h2m + 𝛿n1 (hm )) from Theorem 8.2.1 and the proof is complete. ◽

8.3

Penalized least-squares estimation

Section 8.2 considered nonparametric estimation of the mean function and covariance kernel using local linear regression type estimators. In this section, we explore a slightly different development with smoothing spline variants as the estimators of choice. The model to be considered is much the same as (8.2) with X1 , … , Xn iid as some second-order stochastic process X on E = [0, 1]. As before, let m(t) and K(s, t) be the mean and covariance functions of X. For simplicity, we will focus attention on the case where the Tij are a random sample from the uniform distribution. Apart from that, the major difference is that we now assume that X is a random element of the Sobolev space 𝕎q [0, 1] described in Section 2.8 for some q > 1. As 𝕎q [0, 1] is an RKHS, this is the setting

232

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

addressed in Section 7.5. For f , g ∈ 𝕎q [0, 1], we will use the (squared) norm and inner product ‖f ‖2𝕎 [0,1] = ‖f ‖2𝕃 [0,1] + ‖f (q) ‖2𝕃 [0,1] q

2

2

1

=

1

f 2 (t)dt +

∫0

[f (q) (t)]2 dt

∫0

(8.36)

and ⟨f , g⟩𝕎q [0,1] = ⟨f , g⟩𝕃2 [0,1] + ⟨f (q) , g(q) ⟩𝕃2 [0,1] 1

=

1

f (t)g(t)dt +

∫0

f (q) (t)g(q) (t)dt.

∫0

Recall from Section 2.8 that a convenient basis for 𝕎q [0, 1] is the set of eigenfunctions {ej } for the differential operator (−1)q D2q subject to the boundary conditions e(k) (0) = e(k) (1) = 0, k = q, … , 2q − 1. The ej satisfy j j 1

⟨ej , ek ⟩𝕃2 [0,1] = and (q)

ej (t)ek (t)dt = 𝛿jk

∫0 1

(q)

⟨ej , ek ⟩𝕃2 [0,1] =

∫0

(q)

(q)

ej (t)ek (t)dt = 𝛾j 𝛿jk

for 𝛾1 = · · · = 𝛾q = 0 and universal constants C1 , C2 ∈ (0, ∞) such that C1 j2q ≤ 𝛾j+q ≤ C2 j2q ,

j ≥ 1.

Thus, any element 𝑣 ∈ 𝕎q [0, 1] can be expressed as 𝑣 =

∑∞ j=1

𝑣j ej with

1

𝑣j = ⟨𝑣, ej ⟩𝕃2 [0,1] = and ‖𝑣‖2𝕎 [0,1] = q

In particular, ‖X‖2𝕎 [0,1] = q

∫0

∞ ∑

𝑣(t)ej (t)dt

(1 + 𝛾j )𝑣2j .

(8.37)

j=1

∞ ∑ (1 + 𝛾j )⟨X, ej ⟩2𝕃 [0,1] , j=1

2

which is finite by the assumption that X is in 𝕎q [0, 1].

MEAN AND COVARIANCE ESTIMATION

233

Let m ̂ (t) be any estimator of m(t) that has been constructed from the observed data (Tij , Yij ), i = 1, … , n, j = 1, … , r. We will assess its departure from m via the 𝕃2 [0, 1] norm of the difference; i.e., by { 1 }1∕2 ( )2 ‖̂ m − m‖𝕃2 [0,1] = m ̂ (t) − m(t) dt . ∫0 A gold standard for performance of estimators in this context has been established in Cai and Yuan (2011) that we state here as follows. Theorem 8.3.1 Let ℙ(q, C1 ) be the collection of probability measures for 𝕎q [0, 1] valued processes such that for any X with probability law ℙX ∈ ℙ(q, C1 ) 𝔼‖X (q) ‖2𝕃 [0,1] ≤ C1 2

for some C1 > 0. Then, there is a constant C2 > 0 that depends only on C1 and 𝜎 2 = 𝔼[𝜀211 ] such that ( ( )) lim sup sup ℙ ‖̂ m − m‖2𝕃2 [0,1] > C2 (nr)−2q∕(2q+1) + n−1 > 0. n→∞

ℙX ∈ℙ(q,C)

Our immediate aim is to construct a rate optimal estimator in the sense of this result. As X is second order with sample paths in 𝕎q [0, 1], we know from Section 7.2 that m ∈ 𝕎q [0, 1] as well. This observation leads us to consider estimation of m via the smoothing spline regression estimator discussed in Section 6.6. Specifically, we will estimate m by m𝜂 = argmin𝑣∈𝕎q [0,1] frn,𝜂 (𝑣) for −1

frn,𝜂 (𝑣) = (nr)

n r ∑ ∑

(Yij − 𝑣(Tij ))2 + 𝜂‖𝑣(q) ‖2𝕃 [0,1] . 2

i=1 j=1

(8.38)

We will have shown that this choice attains the minimax lower bound of Theorem 8.3.1 upon proving Theorem 8.3.2 If 𝜂 ≍ (rn)−2q∕(2q+1) , then ( ) ‖m𝜂 − m‖2𝕃2 [0,1] = Op (nr)−2q∕(2q+1) + n−1 . Proof: The development that follows is based on the proof of Theorem 3.2 of Cai and Yuan (2011). In keeping with their work, define f∞,𝜂 (𝑣) = 𝔼frn,𝜂 (𝑣) = 𝔼[(Y11 − m(T11 ))2 ] + ‖m − 𝑣‖2𝕃 [0,1] + 𝜂‖𝑣(q) ‖2𝕃2 [0,1] 2

234

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and let m𝜂 = argmin𝑣∈𝕎q [0,1] f∞,𝜂 (𝑣). The Cai/Yuan proof then proceeds on the basis of the identity

where

m𝜂 − m = m𝜂 − m ̃𝜂 +m ̃ 𝜂 − m𝜂 + m𝜂 − m,

(8.39)

′′ −1 ′ m ̃ 𝜂 = m𝜂 − (f∞,𝜂 ) frn,𝜂 (m𝜂 ).

(8.40)

′ ′′ In this last expression frn,𝜂 and f∞,𝜂 are, respectively, the first Fréchet derivative of frn,𝜂 and the second Fréchet derivative of f∞,𝜂 in the sense of the definitions in Section 3.6. The validity of (8.39) is indisputable. However, the motivation behind this particular representation for m𝜂 − m is somewhat less obvious. The intermediate approximation m ̃ 𝜂 that appears in (8.39) can be motivated by a formal Taylor type expansion wherein one writes ′ ′ ′′ frn,𝜂 (m𝜂 ) − frn,𝜂 (m𝜂 ) ≈ frn,𝜂 (m𝜂 − m𝜂 ). ′ From Theorem 3.6.3, we know that frn,𝜂 (m𝜂 ) = 0 which leads to ′′ −1 ′ (m𝜂 − m𝜂 ) ≈ −(frn,𝜂 ) frn,𝜂 (m𝜂 ). ′′ by f ′′ Then, a Fisher scoring type of strategy suggests that we replace frn,𝜂 ∞,𝜂 in this last approximation to obtain ′′ −1 ′ m𝜂 ≈ m𝜂 − (f∞,𝜂 ) frn,𝜂 (m𝜂 ) = m ̃ 𝜂.

The task at hand is to show that each of ‖m𝜂 − m ̃ 𝜂 ‖2𝕃2 [0,1] , ‖m ̃ 𝜂 − m𝜂 ‖2𝕃2 [0,1] , and ‖m𝜂 − m‖2𝕃2 [0,1] are Op ((rn)−2q∕(2q+1) + n−1 ) when 𝜂 ≍ (rn)−2q∕(2q+1) . In this regard, the most immediate result is for ‖m𝜂 − m‖2𝕃2 [0,1] . By the definition of m𝜂 , we know that 𝔼[(Y11 − m(T11 ))2 ] + ‖m𝜂 − m‖2𝕃2 [0,1] ≤ f∞,𝜂 (m𝜂 ) ≤ f∞,𝜂 (𝑣) for all 𝑣 ∈ 𝕎q [0, 1]. In particular, if we choose 𝑣 = m, this gives ‖m𝜂 − m‖2𝕃2 [0,1] ≤ 𝜂‖m(q) ‖2𝕃 [0,1] 2

(8.41)

with the right-hand side of this expression being of the order (rn)−2q∕(2q+1) under the assumptions on 𝜂. To proceed further, we must explicitly evaluate the Fréchet derivatives that appear in (8.39). ◽ Lemma 8.3.3 In what follows, let h, 𝑣, 𝑣1 , 𝑣2 be arbitrary elements of 𝕎q [0, 1].

MEAN AND COVARIANCE ESTIMATION

235

′ (h) of 1. The Fréchet derivative of frn,𝜂 at h is the element frn,𝜂 𝔅(𝕎q [0, 1], ℝ) characterized by

2 ∑∑ (Y − h(Tij ))𝑣(Tij ) + 2𝜂 h(q) (t)𝑣(q) (t)dt. ∫0 rn i=1 j=1 ij n

′ frn,𝜂 (h)𝑣 = −

r

1

′′ The second Fréchet derivative frn,𝜂 ∈ 𝔅(𝕎q [0, 1], 𝔅(𝕎q [0, 1], ℝ)) is characterized by

2 ∑∑ (q) (q) 𝑣 (T )𝑣 (T ) + 2𝜂 𝑣 (t)𝑣2 (t)dt. ∫0 1 rn i=1 j=1 1 ij 2 ij n

′′ frn,𝜂 𝑣1 𝑣2 =

r

1

′ (h) of 2. The Fréchet derivative of f∞,𝜂 at h is the element f∞,𝜂 𝔅(𝕎q [0, 1], ℝ) characterized by 1 ′ f∞,𝜂 (h)𝑣 = −2

∫0

1

(m(t) − h(t))𝑣(t)dt + 2𝜂

∫0

h(q) (t)𝑣(q) (t)dt.

′′ ∈ 𝔅(𝕎 [0, 1], 𝔅(𝕎 [0, 1], ℝ)) is The second Fréchet derivative f∞,𝜂 q q characterized by 1 ′′ f∞,𝜂 𝑣1 𝑣2 = 2

1

𝑣1 (t)𝑣2 (t)dt + 2𝜂

∫0

∫0

(q)

(q)

𝑣1 (t)𝑣2 (t)dt.

(8.42)

Proof: Verification of the stated results is similar in both cases. Thus, we only give the details for part 1. In Theorem 3.6.4, take 𝜙(s) = frn,𝜂 (h + sv), s ∈ ℝ, with the consequence that 2 ∑∑ 𝜙 (0) = − (Y − h(Tij ))𝑣(Tij ) + 2𝜂 h(q) (t)𝑣(q) (t)dt ∫0 rn i=1 j=1 ij n

n

1



is the value of the Gâteaux derivative of frn,𝜂 at h applied to 𝑣. This is continuous in h and therefore the Fréchet derivative and Gâteaux derivative coincide due to Theorem 3.6.2. Now set 𝜔(s) = f ′ (h + s𝑣1 )𝑣2 2 ∑∑ =− (Y − h(Tij ) − t𝑣1 (Tij ))𝑣2 (Tij ) rn i=1 j=1 ij n

r

1

+2𝜂

∫0

(q)

(q)

[h(q) (t) + s𝑣1 (t)]𝑣2 (t)dt

236

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

for s ∈ ℝ. Then, 1 ∑∑ (q) (q) 𝜔 (0) = 𝑣1 (Tij )𝑣2 (Tij ) + 2𝜂 𝑣1 (t)𝑣2 (t)dt. ∫ rn i=1 j=1 0 n

r

1



This can be interpreted as follows. The second Gâteaux derivative of frn,𝜂 at h when applied to 𝑣1 produces a linear functional whose value at 𝑣2 is 𝜔′ (0). As this is constant (and, hence, continuous) as a function of h the derivative ′′ ′′ is denoted by frn,𝜂 rather than frn,𝜂 (h) and must also be the second Fréchet derivative. ◽ ′′ The evaluation of m ̃ 𝜂 involves the inverse of the operator f∞,𝜂 that belongs to 𝔅(𝕎q [0, 1], 𝔅(𝕎q [0, 1], ℝ)). It is not obvious at this point that the inverse exists and, in addition, working directly with the space 𝔅(𝕎q [0, 1], 𝔅(𝕎q [0, 1], ℝ)) is somewhat inconvenient. A work around for the latter problem derives from the Reisz representation theorem (Theorem 3.2.1), which tells us that there is an invertible norm-preserving mapping 𝒬 such that 𝒬𝔅(𝕎q [0, 1], ℝ) = 𝕎q [0, 1]. Thus, ′′ ′′ f̃∞,𝜂 ∶= 𝒬f∞,𝜂

(8.43)

′′ is invertible. belongs to 𝔅(𝕎q [0, 1]) and it is invertible if and only if f∞,𝜂 ′′ ̃ We can actually obtain an explicit representation for f∞,𝜂 using the CONS for 𝕎q [0, 1] that was described earlier.

̃ ′′ Lemma 8.3.4 The operator ∑∞ f∞,𝜂 in (8.43) is an invertible element of 𝔅(𝕎q [0, 1]). For any 𝑣 = j=1 𝑣j ej in 𝕎q [0, 1], ∞ 1 ∑ 1 + 𝛾j ′′ −1 ̃ (f∞,𝜂 ) 𝑣 = 𝑣e. 2 j=1 1 + 𝜂𝛾j j j ′′ 𝑣 Proof: Let 𝑣1 ∈ 𝕎q [0, 1] with the consequence that f∞,𝜂 1 is in ′′ ̃ 𝔅(𝕎q [0, 1], ℝ) with representer f∞,𝜂 𝑣1 . Then, for any 𝑣2 ∈ 𝕎q [0, 1], we see from (8.42) that ′′ ′′ f∞,𝜂 𝑣1 𝑣2 = ⟨f̃∞,𝜂 𝑣1 , 𝑣2 ⟩𝕎q [0,1] 1

=2

∫0

1

𝑣1 (t)𝑣2 (t)dt + 2𝜂

∫0

(q)

(q)

𝑣1 (t)𝑣2 (t)dt

and, hence, that ′′ f̃∞,𝜂 𝑣1 = 2

∞ ∑ 1 + 𝜂𝛾j j=1

1 + 𝛾j

⟨𝑣1 , ej ⟩𝕃2 [0,1] ej .

The stated form for the inverse follows immediately from this relation.



MEAN AND COVARIANCE ESTIMATION

237

We now have at our disposal the requisite tools for evaluation of the norms of m ̃ 𝜂 − m and m𝜂 − m ̃ 𝜂 . For reasons that will become clear subsequently, it will be worthwhile to work with an alternative norm ‖ ⋅ ‖𝛼 defined by ‖𝑣‖2𝛼

=

∞ ∑

(1 + 𝛾j )𝛼 𝑣2j

j=1

∑∞ for 𝑣 = j=1 𝑣j ej ∈ 𝕎q [0, 1] and 0 ≤ 𝛼 ≤ 1. The restriction to 𝛼 ∈ [0, 1] means that ‖𝑣‖2𝕃2 [0,1] = ‖𝑣‖20 ≤ ‖𝑣‖2𝛼 ≤ ‖𝑣‖21 = ‖𝑣‖2𝕎 [0,1] . q

As ‖𝑣‖𝛼 is the 𝕃2 [0, 1] norm of 𝕎q [0, 1]. We now claim that

∑∞

j=1 (1

+ 𝛾j )𝛼∕2 𝑣j ej , it is, in fact, a norm for

1 ∑ (1 + 𝛾k )𝛼 ′ = (f (m )e )2 . 4 k=1 (1 + 𝜂𝛾k )2 rn,𝜂 𝜂 k ∞

‖m ̃𝜂 −

m𝜂 ‖2𝛼

(8.44)

To see that this is so observe that, by the definitions of m ̃ 𝜂 and ‖ ⋅ ‖𝛼 , ′′ −1 ′ ‖m ̃ 𝜂 − m𝜂 ‖2𝛼 = ‖(f∞,𝜂 ) frn,𝜂 (m𝜂 )‖2𝛼

=

∞ ∑ k=1

′′ −1 ′ (1 + 𝛾k )𝛼 ⟨(f∞,𝜂 ) frn,𝜂 (m𝜂 ), ek ⟩2𝕃2 [0,1] .

However, from Lemma 8.3.4, ′′ −1 ′ ⟨(f∞,𝜂 ) frn,𝜂 (m𝜂 ), ek ⟩𝕃2 [0,1]

1 ′ ⟨(f ′′ )−1 frn,𝜂 (m𝜂 ), ek ⟩𝕎q [0,1] 1 + 𝛾k ∞,𝜂 1 ′ ′′ −1 = ⟨𝒬frn,𝜂 (m𝜂 ), (f̃∞,𝜂 ) ek ⟩𝕎q [0,1] 1 + 𝛾k 1 ′ = ⟨𝒬frn,𝜂 (m𝜂 ), ek ⟩𝕎q [0,1] 2(1 + 𝜂𝛾j ) =

′ = (frn,𝜂 (m𝜂 )ek )∕2(1 + 𝜂𝛾k ) ′ ′ ′′ −1 because 𝒬frn,𝜂 (m𝜂 ) is the representer of frn,𝜂 (m𝜂 ) and (f̃∞,𝜂 ) is selfadjoint. From (8.44), we see that the probabilistic behavior of ‖m ̃ 𝜂 − m𝜂 ‖2𝛼 is gov′ erned by that of the random sequence {frn,𝜂 (m𝜂 )ek }. To assess the size of the ′ (m ) = 0. sequence elements, we first observe from Theorem 3.6.3 that f∞,𝜂 𝜂

238

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Then, an application of Lemma 8.3.3 reveals that for every 𝑣 ∈ 𝕎q [0, 1] ′ ′ ′ frn,𝜂 (m𝜂 )𝑣 = frn,𝜂 (m𝜂 )𝑣 − f∞,𝜂 (m𝜂 )𝑣

2 ∑∑ (Y − m𝜂 (Tij ))𝑣(Tij ) rn i=1 j=1 ij n

=−

r

1

+2 Consequently, and

∫0

(m(t) − m𝜂 (t))𝑣(t)dt.

′ ′ 𝔼frn,𝜂 (m𝜂 )ek = 𝔼T 𝔼[frn,𝜂 (m𝜂 )ek |T] = 0

′ ′ 𝔼(frn,𝜂 (m𝜂 )ek )2 = Var(frn,𝜂 (m𝜂 )ek ) ( r ) n ∑ ∑ ( ) 4 = Var Yij − m𝜂 (Tij ) ek (Tij ) (rn)2 i=1 j=1 ( r ) ∑( ) 4 = 2 Var Y1j − m𝜂 (T1j ) ek (T1j ) nr j=1

(8.45)

Now, for any random variable Z, Var(Z) = VarT (𝔼[Z|T]) + 𝔼T [Var(Z|T)]. An application of this relation here produces ( r ) ∑( ) Var Y1j − m𝜂 (T1j ) ek (T1j ) j=1

(

) r ∑ ( ) = Var m(T1j ) − m𝜂 (T1j ) ek (T1j ) [

j=1

+𝔼T Var

( r ∑

)] Y1j ek (T1j )| T

.

(8.46)

j=1

For the first term on the right hand side of this expression, we have ( r ) ∑( ) VarT m(Tij ) − m𝜂 (Tij ) ek (Tij ) j=1

= rVar((m(T) − m𝜂 (T))ek (T)) 1

≤r

∫0

(m(t) − m𝜂 (t))2 e2k (t)dt

≤ r max |ek (t)|2 ‖m − m𝜂 ‖2𝕃2 [0,1] t∈[0,1]

= O(r𝜂),

(8.47)

MEAN AND COVARIANCE ESTIMATION

239

where the bound is independent of k. This is due to (8.41) and the fact from Section 2.8 that the |ek | are uniformly bounded. Then, for the second term, [ ( r )] ∑ 𝔼T Var Y1j ek (T1j )| T j=1 1

≤ r(r − 1)

∫0 ∫0

1

1

ek (s)ek (t)R(s, t)dsdt + r

e2k (t)R(t, t)dt.

∫0

(8.48)

Both integrals on the right are bounded as R(s, t) is continuous (see the fol1 1 lowing text) and ‖ek ‖𝕃2 [0,1] = 1. In particular, ∫0 ∫0 ek (s)ek (t)R(s, t)dsdt can 2 be expressed as 𝔼⟨ek , X⟩𝕃2 [0,1] . Combining all the bounds in (8.45)–(8.48), we obtain ( ) 4 1 ′ 𝔼(frn,𝜂 (m𝜂 )ek )2 ≤ 𝔼⟨ek , X⟩2𝕃2 [0,1] + O . n rn Consequently, 4 ∑ (1 + 𝛾k )𝛼 𝔼⟨ek , X⟩2𝕃2 [0,1] n k=1 (1 + 𝜂𝛾k )2 ∞

𝔼[‖m ̃ 𝜂 − m𝜂 ‖2𝛼 ] ≤

(

) ∞ 1 ∑ (1 + 𝛾k )𝛼 +O . rn k=1 (1 + 𝜂𝛾k )2 However, n−1

∞ ∞ ∑ ∑ (1 + 𝛾k )𝛼 2 −1 𝔼⟨e , X⟩ ≤ n (1 + 𝛾k )𝔼⟨ek , X⟩2𝕃2 [0,1] k 𝕃2 [0,1] 2 (1 + 𝜂𝛾 ) k k=1 k=1

= n−1 𝔼‖X‖2𝕎 [0,1] = O(n−1 ) q

and, from Theorem 2.8.3, there are constants 0 < C1 ≤ C2 < ∞ such that ∞ ∞ ∑ ∑ (1 + 𝛾k )𝛼 (1 + C2 k2q )𝛼 ≤ q + (1 + 𝜂𝛾k )2 (1 + C1 𝜂k2q )2 k=1 k=1 ∞

≤q+ = O(𝜂

∫0

(1 + C2 x2q )𝛼 dx (1 + C1 𝜂x2q )2

−𝛼−1∕(2q)

)

as 𝜂 → 0 if 𝛼 + 1∕(2q) < 2. Thus, we are lead to the conclusion that ( ) 1 1 2 ‖m ̃ 𝜂 − m𝜂 ‖𝛼 = Op + . nr𝜂 𝛼+1∕(2q) n

240

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

In particular, by taking 𝛼 = 0 for 𝜂 ≍ (rn)−2q∕(2q+1) , this gives ‖m ̃ 𝜂 − m𝜂 ‖2𝕃2 [0,1] = Op ((nr)−2q∕(2q+1) + n−1 ).

(8.49)

The remaining quantity to consider is ‖m𝜂 − m ̃ 𝜂 ‖2𝛼 . The first step in this direction is to obtain a useful analytic form for m𝜂 − m ̃ 𝜂 . In this regard, observe that ′′ −1 ′ m𝜂 − m ̃ 𝜂 = m𝜂 − m𝜂 + (f∞,𝜂 ) frn,𝜂 (m𝜂 ) ′′ −1 ′′ ′ = (f∞,𝜂 ) [f∞,𝜂 (m𝜂 − m𝜂 ) + frn,𝜂 (m𝜂 )]. ′ (m ) = 0. Then, using As m𝜂 maximizes frn,𝜂 , Theorem 3.6.3 tells us that frn,𝜂 𝜂 Lemma 8.3.3, we see that for every 𝑣 ∈ 𝕎q [0, 1] ′′ ′ [f∞,𝜂 (m𝜂 − m𝜂 ) + frn,𝜂 (m𝜂 )]𝑣 ′′ ′ ′ = [f∞,𝜂 (m𝜂 − m𝜂 ) + frn,𝜂 (m𝜂 ) − frn,𝜂 (m𝜂 )]𝑣 1

=2

(m𝜂 (t) − m𝜂 (t))𝑣(t)dt

∫0

2 ∑∑ (m (T ) − m𝜂 (Tij ))𝑣(Tij ) rn i=1 j=1 𝜂 ij n



r

′′ ′′ = [f∞,0 (m𝜂 − m𝜂 ) − frn,0 (m𝜂 − m𝜂 )]𝑣

or ′′ −1 ′′ ′′ m𝜂 − m ̃ 𝜂 = (f∞,𝜂 ) [f∞,0 (m𝜂 − m𝜂 ) − frn,0 (m𝜂 − m𝜂 )].

Therefore, the same argument that produced (8.44) gives us ‖m𝜂 − m ̃ 𝜂 ‖2𝛼 1 ∑ (1 + 𝛾k )𝛼 ′′ = ([f ′′ (m − m𝜂 ) − frn,0 (m𝜂 − m𝜂 )]ek )2 4 k=1 (1 + 𝜂𝛾k )2 ∞,0 𝜂 ∞

with ′′ ′′ [f∞,0 (m𝜂 − m𝜂 ) − frn,0 (m𝜂 − m𝜂 )]ek 1

=2

∫0

(m𝜂 (t) − m𝜂 (t))ek (t)dt 2 ∑∑ − (m (T ) − m𝜂 (Tij ))ek (Tij ). rn i=1 j=1 𝜂 ij n

r

(8.50)

MEAN AND COVARIANCE ESTIMATION

241

∑∞ Now write m𝜂 − m𝜂 = 𝜈=1 h𝜈 e𝜈 in (8.50) and apply the Cauchy–Schwarz inequality to see that for arbitrary 𝜃 ∈ (1∕(2q), 1], we have [∞ { }]2 ∞ n r ∑ (1 + 𝛾k )𝛼 ∑ 1 ∑∑ 2 ‖m𝜂 − m ̃ 𝜂 ‖𝛼 = h𝜈 𝛿k𝜈 − e (T )e (T ) rn i=1 j=1 k ij 𝜈 ij (1 + 𝜂𝛾k )2 𝜈=1 k=1 ≤ ‖m𝜂 − m𝜂 ‖2𝜃 with

∞ ∞ ∑ (1 + 𝛾k )𝛼 ∑ 2 (1 + 𝛾𝜈 )−𝜃 Vk𝜈 2 (1 + 𝜂𝛾 ) k k=1 𝜈=1

1 ∑∑ Vk𝜈 = 𝛿k𝜈 − e (T )e (T ). rn i=1 j=1 k ij 𝜈 ij n

r

We have 𝔼Vk𝜈 = 0 so that 1 Var(ek (T)e𝜈 (T)) rn ( ) 1 1 ≤ 𝔼[e2k (T)e2𝜈 (T)] = O rn rn

2 𝔼Vk𝜈 = Var(Vk𝜈 ) =

∑∞ due to the uniform boundedness of the basis functions. As 𝜈=1 (1 + 𝛾𝜈 )−𝜃 is finite for 𝜃 > 1∕(2q), it follows that ( ) 1 2 ‖m𝜂 − m ̃ 𝜂 ‖𝛼 = Op ‖m𝜂 − m𝜂 ‖2𝜃 . rn𝜂 𝛼+1∕(2q) In particular, if 𝛼 > 1∕(2q) ‖m𝜂 −

m ̃ 𝜂 ‖2𝛼

( = Op

1 rn𝜂 𝛼+1∕(2q)

) ‖m𝜂 − m𝜂 ‖2𝛼 .

(8.51)

From (8.51), we can conclude that ‖m𝜂 − m ̃ 𝜂 ‖2𝛼 = op (‖m𝜂 − m𝜂 ‖2𝛼 ). So, ‖m ̃ 𝜂 − m𝜂 ‖𝛼 ≥ ‖m𝜂 − m𝜂 ‖𝛼 − ‖m𝜂 − m ̃ 𝜂 ‖𝛼 = (1 − op (1))‖m𝜂 − m𝜂 ‖𝛼 ; i.e., ‖m𝜂 − m𝜂 ‖2𝛼 = Op (‖m ̃ 𝜂 − m𝜂 ‖2𝛼 ). Using this fact in combination with (8.51) and (8.49) reveals that for 𝛼 > 1∕(2q) ‖m𝜂 − m ̃ 𝜂 ‖2𝕃2 [0,1] ≤ ‖m𝜂 − m ̃ 𝜂 ‖2𝛼 ( ) 1 = Op Op (‖m ̃ 𝜂 − m𝜂 ‖2𝛼 ) 𝛼+1∕(2q) rn𝜂 ( ) ( ) 1 1 1 = Op Op + rn𝜂 𝛼+1∕(2q) nr𝜂 𝛼+1∕(2q) n = op (n−1 + (nr𝜂 1∕(2q) )−1 )

242

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

provided that 𝜂 → 0 while rn → ∞ in such a way that rn𝜂 𝛼+1∕(2q) → ∞. This latter condition holds for 𝜂 ≍ (rn)−2q∕(2q+1) if 𝛼 < 1∕2. Our results were predicated on having 𝛼 > 1∕(2q). Beyond that, the choice for 𝛼 was arbitrary and both bounds can be satisfied whenever q > 1. We now turn our attention to the problem of estimating the covariance kernel for the X process. To simplify matters, we assume that 𝔼[X(t)] = 0 for all t ∈ [0, 1]; i.e., the mean function is either known or identically zero. For estimation in the general case, one may replace the response data by Yij − m ̂ (Tij ) with m ̂ some suitable estimator of m. Provided this estimator converges sufficiently fast the results that follow will not be affected by this substitution. In particular, it follows from Theorem 4 of Cai and Yuan (2010) that the smoothing spline estimator in Theorem 8.3.2 will work in this capacity. As before, the sample paths of the X process are presumed to lie in 𝕎q [0, 1]. Theorem 7.5.2, therefore has the consequence that the process covariance kernel is in the direct product Hilbert space ℍ = 𝕎q [0, 1] ⊗ 𝕎q [0, 1]. Similar to our scheme for estimation of the mean function, we estimate K by K𝜂 = argmin𝑣∈ℍ frn,𝜂 (𝑣) with frn,𝜂 now defined by frn,𝜂 (𝑣) = (r(r − 1)n)−1

n ∑ ∑

(Yij Yik − 𝑣(Tij , Tik ))2 + 𝜂‖𝑣‖2ℍ .

(8.52)

i=1 1≤j · · ·. Assume that 𝜖n ∶= ‖𝒦n − 𝒦‖ −−→ 0 as n → ∞ and let {bn } be a sequence of constants tending to ∞ and satisfying −2∕3 bn = Op (𝜖n ). Define k = kn to be the largest positive integer k such that 1∕2

p p 𝜖n −−→ 0 and (𝜆kn bn )−1 −−→ 0 minj≤k 𝜂jn

p

k(bn 𝜖n )1∕2 −−→ 0,

(10.11)

for 𝜂jn = (1∕2)inf s≠j |𝜆jn − 𝜆sn |. Then, 1. k diverges in probability,

( ) 3∕2 | = op bn 𝜖n , and ( ) ( ) 3∕2 −1∕2 (k) −1∕2 1∕2 3. ‖(𝒦(k) ) − (𝒦 ) ‖ = o b 𝜖 . n n p n + Op k(bn 𝜖n ) −1∕2

2. maxj≤k |𝜆jn

−1∕2

− 𝜆j

Proof: Theorem 4.2.8 has the implication that sup |𝜆jn − 𝜆j | ≤ 𝜖n j

and, therefore, for any fixed J 𝜆Jn ≥ 𝜆J − 𝜖n and min 𝜂jn ≥ min 𝜂j − 𝜖n j≤J

j≤J

with 𝜂j = (1∕2) inf s≠j |𝜆j − 𝜆s |. Part 1 of the lemma is a straightforward consequence of these two relations. A Taylor expansion reveals that −1∕2

x−1∕2 − x0

1 −3∕2 3 = − x0 (x − x0 ) + x̂ −5∕2 (x − x0 )2 , 2 8

where x̂ is some value between x and x0 . An application of this fact then produces ( ) −1∕2 −1∕2 −3∕2 |𝜆jn − 𝜆j | = 𝜆j Op |𝜆jn − 𝜆j | ( )−5∕2 ( ) + 𝜆j − Op (𝜖n ) Op |𝜆jn − 𝜆j |2

278

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

for j ≤ k. Thus, −1∕2

max |𝜆jn j≤k

−1∕2

− 𝜆j

( ) ( ) ( ) 3∕2 5∕2 3∕2 | = op bn 𝜖n + op bn 𝜖n2 = op bn 𝜖n

and part 2 has been proved. To show part 3, write −1∕2 (𝒦(k) − (𝒦(k) )−1∕2 n )

=

k ∑

−1∕2

𝜆jn

j=1

=

k ∑

𝒫jn −

k ∑

−1∕2

𝜆j

𝒫j

j=1 −1∕2

𝜆jn

(𝒫jn − 𝒫j ) +

j=1

k ∑ −1∕2 −1∕2 (𝜆jn − 𝜆j )𝒫j .

(10.12)

j=1

Now apply part 2 of the lemma to see that ‖∑ ‖ ( ) ‖ k −1∕2 ‖ −1∕2 3∕2 ‖ (𝜆 ‖ − 𝜆 )𝒫 = o b 𝜖 (10.13) n jn ‖ p n . ‖ j ‖ j=1 jn ‖ ‖ ‖ Using (5.17) with 𝒫j and 𝒫̃ j there chosen to be 𝒫jn and 𝒫j in this context gives 𝛿jn ‖𝒫jn − 𝒫j ‖ ≤ 1 − 𝛿jn 1∕2

for 𝛿jn = ‖𝒦n − 𝒦‖∕𝜂jn = O(𝜖n ) uniformly for all j ≤ k. Thus, ‖∑ ‖ ∑ k ‖ k −1∕2 ‖ −1∕2 ‖ 𝜆 ‖ (𝒫 − 𝒫 ) ≤ 𝜆 ‖𝒫jn − 𝒫j ‖ jn j ‖ ‖ ‖ j=1 jn ‖ j=1 jn ‖ ‖ k ( )∑ 1∕2 −1∕2 = Op 𝜖n 𝜆jn (

j=1 1∕2

1∕2

= Op 𝜖n kbn and part 3 follows from (10.12)–(10.14).

) (10.14) ◽

We now apply Lemma 10.2.1 to our cca problem. As discussed in Chapter 9, estimation of 𝒦i in fda can be achieved with a root-n rate for dense data while a slower, nonparametric rate is the best that can be expected for the sparse data scenario.

CANONICAL CORRELATION ANALYSIS

279

Theorem 10.2.2 For i = 1, 2, let 𝒦i be infinite-dimensional with distinct eigenvalues 𝜆i1 > 𝜆i2 > · · · > 0 and let 𝒦in be an estimator of 𝒦i with eigenvalues 𝜆i1n > 𝜆i2n > · · ·. In addition, let 𝒦12n be an estimator of 𝒦12 . Assume that 𝜖n∶= max (‖𝒦1n − 𝒦1 ‖, ‖𝒦2n − 𝒦2 ‖, ‖𝒦12n − 𝒦12 ‖) = Op (1) −2∕3 and let bn = Op (𝜖n ) with k = kn the largest integer such that (10.11) holds for both 𝒦1n and 𝒦2n . Then ( ) ( ) 3∕2 (k) 1∕2 ‖ℛ(k) − ℛ ‖ = o b 𝜖 . (10.15) n n p n + Op k(bn 𝜖n ) Proof: Write ( ) (k) −1∕2 (k) −1∕2 (k) ℛ(k) − ℛ = (𝒦 ) − (𝒦 ) 𝒦12n (𝒦(k) )−1∕2 n 1n 1 2n + (𝒦(k) )−1∕2 (𝒦12n − 𝒦12 ) (𝒦(k) )−1∕2 1 2n ( ) (k) −1∕2 (k) −1∕2 −1∕2 + (𝒦(k) ) 𝒦 (𝒦 ) − (𝒦 ) . 12 1 2n 2

(10.16)

The first term is ( ) (k) −1∕2 −1∕2 (𝒦(k) ) − (𝒦 ) 𝒦12n (𝒦(k) )−1∕2 1n 1 2n ( ) { } (k) −1∕2 (k) 1∕2 (k) −1∕2 (k) −1∕2 −1∕2 = (𝒦(k) ) − (𝒦 ) (𝒦 ) (𝒦 ) 𝒦 (𝒦 ) , 12n 1n 1 1n 1n 2n which has the rate of the theorem as a result of Lemma 10.2.1 and the fact that (𝒦(k) )1∕2 and (𝒦(k) )−1∕2 𝒦12n (𝒦(k) )−1∕2 are bounded in probability. The 1n 1n 2n third term in (10.16) can be dealt with in the same manner. Finally, write the second term of (10.16) as ( ) (k) −1∕2 (k) −1∕2 −1∕2 (𝒦(k) ) − 𝒦 (𝒦 ) − (𝒦 ) (𝒦 ) 12n 12 1 2n 2 + (𝒦(k) )−1∕2 (𝒦12n − 𝒦12 ) (𝒦(k) )−1∕2 , 1 2 where both terms in the sum can be shown to be dominated by the rate in (10.15) due to Lemma 10.2.1. ◽ Theorem 10.2.3 Assume that 𝒞 is HS with singular values 𝜌1 ≥ 𝜌2 ≥ · · · ≥ 0. In addition, let the assumptions of Theorem 10.2.2 hold and let 𝜌1n ≥ 𝜌2n ≥ · · · ≥ 0 be the singular values of ℛ(k) n . Then, as n → ∞, ( ) ( ) 3∕2 sup |𝜌2jn − 𝜌2j | = op bn 𝜖n + Op 𝛿k + k(bn 𝜖n )1∕2 , (10.17) j

280

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

where 𝛿k is defined as in (10.8). In addition, if the nonzero singular values of 𝒞 , 𝒞 (k) and 𝒞(k) n are all distinct, for any fixed 𝜌j ≠ 0 and i = 1, 2, {( } )2 𝔼1∕2 Uij − Uijn | (𝜒1r , 𝜒2r ), r = 1, … , n ( ) ( ) 3∕2 = op bn 𝜖n + Op 𝛿k + k(bn 𝜖n )1∕2 , (10.18) where (𝜒1 , 𝜒2 ) is assumed to be independent of (𝜒1r , 𝜒2r ), r = 1, … , n, in this calculation. Proof: The fact that (10.17) holds is a consequence of (10.8), Theorem 10.2.2, and Theorem 4.2.8. We will show (10.18) in two steps. The first of these is to establish that ‖Uij(k) − Uij ‖𝕃(𝜒i ) = Op (𝛿k )

(10.19)

(k) (k) for i = 1, 2, where U1j , U2j denote the canonical variables computed using ( )2 (k) 𝒞 (k) . Recall that ‖Uij − Uij ‖2𝕃(𝜒 ) = 𝔼 Uij(k) − Uij . If we take i

Δ = 𝒞 (k) − 𝒞 , then Theorem 5.2.2 provides the justification for (10.19) once we observe that for any 𝒯 ∈ 𝔅(𝔾1 , 𝔾2 ) ‖∑ ‖2 𝜌l ⟨ f2l , 𝒯f1j ⟩2 + 𝜌j ⟨ f2j , 𝒯f1l ⟩2 ‖ ‖ ‖ f1l ‖ ‖ ‖ 𝜌2l − 𝜌2j ‖ l≠j ‖ ‖ ‖1 [ ] ∑ ∑ 2 ≤ 𝜌2l ⟨ f2l , 𝒯f1j ⟩22 + 𝜌2j ⟨ f2j , 𝒯f1l ⟩22 ( )2 j≠l j≠l min 𝜌2j − 𝜌l j≠l



(

2𝜌21

min 𝜌2j − 𝜌l j≠l ( ) = O ‖𝒯‖2 .

[ ] 2 ∗ 2 )2 ‖𝒯f1j ‖2 + ‖𝒯 f2j ‖1

Thus, the congruence between 𝔾i and 𝕃(𝜒i ) gives (10.19). The next step is to show that {( } )2 (k) (k) 1∕2 𝔼 Uijn − Uij |(𝜒1r , 𝜒2r ), r = 1, … , n ( ) ( ) 3∕2 = op bn 𝜖n + Op k(bn 𝜖n )1∕2

CANONICAL CORRELATION ANALYSIS

281

for i = 1, 2. The proof of this result is analogous to that of (10.19). As ℛ(k) n and ℛ(k) are finite dimensional, we let (k) Δ = ℛ(k) n −ℛ

in 𝔅(ℍ) and apply the above-mentioned argument together with Theorem 10.2.2. ◽

10.3

Prediction and regression

We know from Section 1.1 that cca provides the basis for other mva techniques such as MANOVA and discriminant analysis. This remains true more generally and, in particular, in the context of fda. For the following several sections, we will expand on this comment by working exclusively with situations where the two random elements correspond to zero mean stochastic processes X1 , X2 that are jointly measurable in t, 𝜔 and can also be viewed as random elements of 𝕃2 ∶= 𝕃2 (E, ℬ(E), 𝜇). Recall from Section 10.1 that canonical correlations defined from the process and random element perspective are identical in this instance. We aim to develop parallels of many of the ideas in Section 1.1 in this infinite-dimensional environment. Our first such foray proceeds in the direction of optimal prediction. Suppose that our interest is in the X1 process but only the X2 process will actually be observed. In that case, it may be of interest to assess the value of X1 (t) for t ∈ E using a best linear predictor based on the X2 process. By this, we mean that we want to find X 1 (t) ∈ 𝕃2 (X2 ) such that 𝔼|X1 (t) − X 1 (t)|2 =

inf

Y∈𝕃2 (X2 )

𝔼|X1 (t) − Y|2 .

(10.20)

In this regard, we have the following. Theorem 10.3.1 The best linear predictor of X1 (t) is X 1 (t) = Z2 (K12 (t, ⋅)). Proof: From Theorem 10.1.5, we know that K12 (t1 , ⋅) ∈ 𝔾2 so that X 1 (t1 ) is well defined. The result will follow once we have shown that X 1 (t1 ) is the orthogonal projection of X1 (t) onto the Hilbert space 𝕃2 (X2 ). Take Y ∈ 𝕃2 (X2 ) and select a sequence of random variables of the form Yn =

n ∑

ajn X2 (tjn )

j=1

for real numbers ajn and points tjn ∈ E such that 𝔼|Y − Yn |2 → 0 as n → ∞. Then,

282

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

𝔼

[(

n ) ] ∑ X1 (t1 ) − X 1 (t1 ) Yn = ajn 𝔼[X1 (t1 )X2 (tjn )] j=1



n ∑

ajn 𝔼[X2 (tjn )Z2 (K12 (t1 , ⋅))]

j=1

=

n ∑

ajn K12 (t1 , tjn )

j=1



n ∑

ajn ⟨K12 (t1 , ⋅), K2 (tjn , ⋅)⟩2

j=1

= 0. The continuity of the inner product along with (2.14) implies the result.



An application of Theorem 10.3.1 in combination with Theorem 10.1.6 produces an fda version of identity (1.16) that connects the prediction problem with cca. Corollary 10.3.2 If 𝒞12 is compact with singular system {(𝜌j , f1j , f2j )}, ∑∞ X 1 (⋅) = j=1 𝜌j Z2 ( f2j ) f1j (⋅). Example 10.3.3 Let us consider the application of Corollary 10.3.2 in the finite-dimensional case of Example 10.1.4. In that instance, the vector representations of the singular functions f1j (⋅), f2j (⋅) have the form f1j = 𝒦1 a1j , f2j = 𝒦2 a2j and Z2 ( f2j ) = aT2j X2 =∶ U2j 1∕2

1∕2

with 𝒦1 a1j , 𝒦2 a2j the singular vectors that correspond to the jth singular value 𝜌j of the matrix ℛ12 in (1.11). Thus, X1 =

∞ ∑

𝜌j Z2 ( f2j ) f1j

j=1

=

∞ ∑ j=1

which agrees with (1.16).

𝜌j U2j 𝒦1 a1j ,

CANONICAL CORRELATION ANALYSIS

283

An alternative take on the prediction problem derives from imposition of a linear model such as in (1.17). An analog of that finite-dimensional relation that could be used here is X1 (t) =

∫E

𝛽(t, s)X2 (s)d𝜇(s) + 𝜀(t)

(10.21)

with 𝛽(t, ⋅) ∈ 𝕃2 and 𝜀(⋅) a zero mean process that is uncorrelated with X2 (⋅). However, a bit of thought is needed to make such a formulation rigorous. In we view ∫E 𝛽(t, s)X2 (s)d𝜇(s) as a limit of weighted sums of finitely many values from the X2 process, this leads us to the assumption that it should be an element of 𝕃2 (X2 ). However, if that is true then it must be the image of some element from 𝔾2 under the isometric mapping that connects the two spaces. With that in mind, we can now advance a tenable form for 𝛽. Let 𝒯 be an element of 𝔅(𝔾2 , 𝔾1 ) with associated kernel R(⋅, t) = 𝒯∗ K1 (⋅, t)

(10.22)

from Section 4.7. We will refer to R as a regression kernel for reasons that will become clear shortly. If, for example, 𝒯 is HS, Theorem 4.7.2 tells us that we can write R(s, t) =

∞ ∞ ∑ ∑

𝜆1i 𝜆2j bij e1i (t)e2j (s)

i=1 j=1

with

∞ ∞ ∑ ∑

𝜆1i 𝜆2j b2ij < ∞.

i=1 j=1

An updated version of the regression model (10.21) now appears as X1 (t) = Z2 (R(⋅, t)) + 𝜀(t)

(10.23)

with R as defined in (10.22) and 𝜀(⋅) as before. Now Z2 (R(⋅, t)) =

∞ ∞ ∑ ∑

𝜆1i bij e1i (t)⟨X2 , e2j ⟩.

i=1 j=1

If

∑∞ ∑∞ i=1

2 2 j=1 𝜆1i bij

< ∞, (10.23) and (10.21) coincide with 𝛽(s, t) =

∞ ∞ ∑ ∑ i=1 j=1

𝜆1i bij e1i (t)e2j (s).

(10.24)

284

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

The fact that X2 (⋅) and 𝜀(⋅) are uncorrelated means that 𝜀(⋅) is uncorrelated with ⟨X2 , e2j ⟩ for all j due to Theorem 3.1.7. The (X1 , X2 ) cross-covariance kernel is therefore seen to be [ ] K12 (t1 , t2 ) = 𝔼 Z2 (R(⋅, t1 ))X2 (t2 ) = ⟨R(⋅, t1 ), K2 (⋅, t2 )⟩2 = R(t2 , t1 ). Canonical correlation now gives us a canonical form for the regression kernel as a result of Theorem 10.1.6. In addition, our operator 𝒯 coincides with the operator 𝒞12 in (10.7) because (𝒞12 f )(t) = ⟨R(⋅, t), f (⋅)⟩2 = ⟨K1 (⋅, t), (𝒯f )(⋅)⟩1 = (𝒯f )(t). Among other things, we can conclude here that, just as in the mva setting, a linear regression relationship can exist between X1 and X2 only when at least one of the canonical correlations is nonzero.

10.4

Factor analysis

In this section, we explore the problem of factor analysis under the same conditions as Section 10.3. For that purpose, we consider the signal-plusnoise model given by X1 (t) = X2 (t) + 𝜀(t) (10.25) for t ∈ E. Here X1 , X2 , and 𝜀 are all zero mean, 𝕃2 (E, ℬ(E), 𝜇) valued processes with covariance kernels K1 , K2 , and K𝜀 . The signal process, X2 , is assumed to be uncorrelated with the noise process 𝜀. With this latter specification model (10.25) can be viewed as the natural extension of the mva factor model (1.21) to continuous time. Following along the lines of our factor analysis development in Section 1.1, our first step should be canonical analysis of the X1 , X2 processes. In this instance, we find that the cross-covariance kernel (10.6) is K12 = K2 and the covariance kernel for X1 is K1 = K2 + K𝜀 . Theorem 2.7.10 tells us that 𝔾1 consists of sums of functions from the RKHSs 𝔾2 and ℍ(K𝜀 ) = 𝔾(𝒦𝜀 ) and, in particular, any f2 ∈ 𝔾2 is also an element of 𝔾1 . Thus, by (10.7), (𝒞12 f2 )(t) = ⟨K12 (t, ⋅), f2 ⟩2 = ⟨K2 (t, ⋅), f2 ⟩2 = f2 (t)

CANONICAL CORRELATION ANALYSIS

and

285

∗ (𝒞12 𝒞12 f2 )(t) = ⟨K21 (t, ⋅), f2 ⟩1 = ⟨K2 (t, ⋅), f2 ⟩1 .

A squared canonical correlations 𝜌2 and its associated singular function f2 for the X2 space must therefore satisfy ⟨K2 (t, ⋅), f2 ⟩1 = ⟨K1 (t, ⋅) − K𝜀 (t, ⋅), f2 ⟩1 = f2 (t) − ⟨K𝜀 (t, ⋅), f2 ⟩1 = 𝜌2 f2 (t) for all t ∈ E. This gives us ⟨K𝜀 (t, ⋅), f2 ⟩1 = (1 − 𝜌2 ) f2 (t). However, when we use Theorem 4.3.1 here we see that the singular functions for the X1 and X2 spaces are related by f1 = f2 ∕𝜌. Hence, we have the identity ⟨K𝜀 (t, ⋅), f1 ⟩1 = (1 − 𝜌2 ) f1 (t)

(10.26)

that characterizes the singular functions for the X1 space. Example 10.4.1 Consider the mva setting of Example 10.1.4 with 𝒦1 , 𝒦2 , and 𝒦𝜀 now representing the variance–covariance matrices for the random p-vectors X1 , X2 , and 𝜀. In this case, we know that f1 (⋅) ∈ 𝔾1 means that f1 = 𝒦1 a1 for some a1 ∈ ℝp and (10.26) becomes 𝒦𝜀 𝒦−1 1 𝒦1 a1 = 𝒦𝜀 a1 = (1 − 𝜌2 )𝒦1 a1 , thereby returning us to relation (1.22) that was used to develop factor analysis in the finite-dimensional case. As in the mva setting, to complete the factor model, we need to add structure to the X2 process. Specifically, we will assume that X2 admits the representation ∞ ∑ X2 (t) = Zj 𝜙j (t), j=1

where the Zj are zero mean, uncorrelated random variables with unit variance and {𝜙j } is an orthogonal sequence of functions in ℍ(K𝜀 ) with ‖𝜙j ‖2ℍ(K ) = 𝜀 ∑ 𝛾j , j = 1, … for positive values 𝛾j that satisfy ∞ j=1 𝛾j < ∞. In the language of Section 1.1, the Zj occupy the role of the factors while the 𝜙j play the part of factor loadings.

286

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

The covariance kernel for the X2 process and, hence, the cross-covariance kernel is now seen to have the form K12 (s, t) = K2 (s, t) =

∞ ∑

𝛾j 𝜙̃ i (s)𝜙̃ i (t)

(10.27)

j=1

with

√ 𝜙̃ j = 𝜙j ∕ 𝛾j

being an orthonormal sequence in ℍ(K𝜀 ). The function K2 (⋅, t) is a welldefined element of ℍ(K𝜀 ), which has the consequence that the elements of ℍ(K1 ) must be the same as those in ℍ(K𝜀 ). The kernels K1 and K𝜀 can be directly related using the operator ̃ ∈ 𝔅(ℍ(K𝜀 )) defined by Φ ̃ = Φ

∞ ∑

𝛾j 𝜙̃ j ⊗ℍ (K𝜀 ) 𝜙̃ j .

(10.28)

j=1

This is a nonnegative operator whose eigenvector 𝜙̃ j corresponds to the eigeñ is invertible. Keeping that in mind, value 𝛾j with the consequence that (I + Φ) we claim that ̃ −1 f , g⟩ℍ(K ) . ⟨ f , g⟩1 = ⟨(I + Φ) (10.29) 𝜀 To establish (10.29) observe that for f ∈ ℍ(K1 ) ̃ f , K1 (⋅, t)⟩1 = ⟨ f , K1 (⋅, t)⟩1 + ⟨Φ ̃ f , K1 (⋅, t)⟩1 ⟨(I + Φ) ̃ f )(t) = f (t) + (Φ ̃ f , K𝜀 (⋅, t)⟩ℍ(K ) = ⟨ f , K𝜀 (⋅, t)⟩ℍ(K𝜀 ) + ⟨Φ 𝜀 = ⟨ f , K𝜀 + K2 ⟩ℍ(K𝜀 ) = ⟨ f , K1 ⟩ℍ(K𝜀 ) because ̃ f , K𝜀 (⋅, t)⟩ℍ(K ) = ⟨Φ 𝜀

∞ ∑

𝛾j ⟨𝜙̃ j , f ⟩ℍ(𝒦𝜀 ) 𝜙̃ j (t)

j=1

= ⟨K2 (⋅, t), f ⟩ℍ(K𝜀 ) . As linear combinations of K1 are dense in 𝔾1 , we conclude that ̃ f , g⟩1 = ⟨ f , g⟩ℍ(𝒦 ) for every f , g ∈ 𝔾1 and the claim has been ⟨(I + Φ) 𝜀 verified.

CANONICAL CORRELATION ANALYSIS

287

As a result of (10.29), we see that for all f ∈ 𝔾1 ̃ −1 K1 (⋅, t), f ⟩ℍ(K ) f (t) = ⟨K1 (⋅, t), f ⟩1 = ⟨(I + Φ) 𝜀 ̃ −1 K1 (⋅, t) = K𝜀 (⋅, t). Using this in which has the implication that (I + Φ) (10.26) produces (1 − 𝜌2 ) f1 (t) = ⟨K𝜀 (t, ⋅), f1 ⟩1 ̃ −1 K1 (⋅, t), f1 ⟩1 = ⟨(I + Φ) ̃ −1 f1 ⟩1 = ⟨K1 (⋅, t), (I + Φ) ( ) ̃ −1 f (t) = (I + Φ) or

𝜌2 f1 . 1 − 𝜌2 ̃ 𝜙̃ j = 𝛾j 𝜙̃ j and it follows that 𝜙̃ j is, in fact, the singular function However, Φ for the X1 space with associated canonical correlation 𝜌j . The value of 𝛾j can be recovered from the relation 𝛾j = 𝜌2j ∕(1 − 𝜌2j ). The idealized prescription for factor analysis would now proceed something like this. If we knew both K1 and K𝜀 , we could determine the X1 singular vectors and canonical correlations using (10.26) and thereby create the cross-covariance kernel K12 = K2 in (10.27). Theorem 10.3.1 now gives us the best linear predictor of X2 (t) as Z1 (K2 (⋅, t)). However, the aim is to predict the factors Z1 , Z2 , …. That problem can be addressed similarly using the cross-covariance kernel between the Zi and X1 : namely, ̃ f1 = Φ

̃ Cov (Zi , X1 (t)) = 𝜙(t). Another application of Theorem 10.3.1 reveals that Z1 (𝜙̃ i ) is the best linear predictor of the ith factor. Example 10.4.2 Continuing with Example 10.4.1, the vector representation of the X2 (⋅) process under the factor model appears as X2 = ΦZ with Φ = {𝜙j (ti )} a p × r matrix of coefficients and Z a random r-vector with mean zero and an identity for its variance–covariance matrix. The operator in (10.28) becomes r ∑ ̃ = Φ 𝛾j 𝜙̃ j 𝜙̃ T j

j=1

with 𝜙̃ j ∶= 𝒦𝜀 𝜙j for 𝜙j the jth column of the matrix Φ. Thus, under the present formulation, the factor analysis constraint from Section 1.1 that 𝜙Ti 𝒦−1 𝜀 𝜙j = 0 for i ≠ j can be appreciated as an orthogonality condition for the loading vectors in the Hilbert space 𝔾(𝒦𝜀 ) that represents their home. −1∕2

288

10.5

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

MANOVA and discriminant analysis

One formulation of functional analysis of variance (ANOVA) and discriminant analysis can be developed by assuming that we are observing a stochastic process X1 whose sample paths can lie in one of J possible populations. If 𝜋j is the probability that the reading comes from population j, we can view the problem as seeing the process pair (X1 , X2 ), where X1 takes values in 𝕃2 (E, ℬ(E), 𝜇) and X2 = {X2 (j) ∶ j = 1, … , J} for dichotomous random variables X2 (j) that take the values 1 or 0 with probability 𝜋j and 1 − 𝜋j , respectively, depending on whether or not X1 derives from the jth population. Let mj (⋅) = 𝔼[X1 (⋅)|X2 = j] for j = 1, … , J be the conditional mean functions for X1 that correspond to the J different populations. The overall or grand mean for X1 is then m(⋅) = 𝔼[X1 (⋅)] =

J ∑

𝜋j mj (⋅).

j=1

Unlike the developments in previous sections, we can no longer assume that m or the mj vanish. To account for this, we now need to work with the X1 covariance kernel R1 (s, t) = 𝔼[(X1 (s) − m(s))(X1 (t) − m(t))] and cross-covariance kernel R12 (s, j) = 𝔼[(X1 (s) − m(s))(X2 (j) − 𝜋j )] ( ) = 𝜋j mj (s) − m(s) The kernel R1 generates an RKHS 𝔾1 with norm and inner product we will continue to denote by ⟨⋅, ⋅⟩1 and ‖ ⋅ ‖1 . Movement between the spaces 𝔾1 and 𝕃2 (X1 ) is then carried out via the mapping Z1 as before. For our more immediate purposes, it will suffice to work with the centered process X̃ 2 = {X̃ 2 (j) ∶ j = 1, … , J} with X̃ 2 (j) = X2 (j) − 𝜋j , j = 1, … , J. Its covariance kernel is K2 (i, j) = 𝔼[X̃ 2 (i)X̃ 2 (j)] = 𝔼[(X2 (i) − 𝜋i )(X2 (j) − 𝜋j )] = 𝛿ij 𝜋j − 𝜋i 𝜋j and, of course,

R12 (s, j) = 𝔼[(X1 (s) − m(s))X̃ 2 (j)].

CANONICAL CORRELATION ANALYSIS

289

The K2 kernel has a matrix representation as 𝒦2 = diag(𝜋1 , … , 𝜋J ) − 𝜋𝜋 T ∑J with 𝜋 = (𝜋1 , … 𝜋J )T and j=1 𝜋j = 1. Thus, from Example 2.7.8, we know that 𝔾2 ∶= ℍ(K2 ) consists of functions on {1, … , J} of the form f2 (⋅) =

J ∑

a2j K2 (⋅, j)

j=1

for a2 = (a21 , … a2J )T an element of the orthogonal complement of the null ∑J space of the matrix 𝒦2 : i.e., j=1 a2j = 0. A convenient choice for a generalized inverse of 𝒦2 is 𝒦−2 = diag(𝜋1−1 , … , 𝜋J−1 ) and using this option in (2.34) reveals that the inner product of two functions f2 , f2′ in 𝔾2 is J ∑ f2 (j) f2′ (j) ⟨ f2 , f2′ ⟩2 = . 𝜋j j=1 The isometric mapping that connects 𝕃2 (X̃ 2 ) and 𝔾2 is found to be Z2 ( f2 ) =

J ∑ f2 (j)[X2 (j) − 𝜋j ]

𝜋j

j=1

=

J ∑ f2 (j)X2 (j) 𝜋j j=1

for f2 ∈ 𝔾2 . As in (10.7), cca now revolves around the singular value expansion of the operator 𝒞12 defined by (𝒞12 f2 )(⋅) =

J ∑ R12 (⋅, j) f2 (j) j=1

=

J ∑

𝜋j f2 (j)mj (t).

j=1

Thus, 𝒞12 maps elements of 𝔾2 to contrasts among the population mean functions with the consequence that all the mean functions coincide if and only if the 𝒞12 singular values 𝜌1 , … , 𝜌J−1 vanish. Testing the functional ANOVA

290

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

hypothesis that m1 (⋅) = · · · = mJ (⋅) is therefore equivalent to the hypothesis that all the canonical correlations are zero. If {(𝜌k , f1k , f2k )}J−1 is the singular system for 𝒞12 , the canonical variables k=1 of the X1 and X̃ 2 spaces are U1k ∶= Z1 ( f1k ) and U2k = Z2 ( f2k ) =

J ∑ f2k (j)X2 (j) . 𝜋j j=1

These variables may be used for classification purposes. For example, suppose we define the centered X1 canonical variables U1jc = U1j − 𝔼[U1j ] = U1j − ⟨ f1j , m⟩1 . Then, 𝜌j U2j provides a linear predictor of U1jc and, in particular, U2j = f2j (k)∕𝜋k when X1 comes from the kth population. Thus, a regression argument suggests classification via minimization with respect to k of a distance measure such as r ∑ j=1

1 (U c − 𝜌j 𝜋k−1 f2j (k))2 1 − 𝜌2j 1j

(10.30)

for some r ≤ J − 1. Fisher’s approach to discriminant analysis described in Section 1.1 has also been generalized to the stochastic processes setting in work by Shin (2008). She defines the discriminant functions to be random variables 𝓁 ∈ 𝕃2 (X1 ) that maximize Var(𝔼[𝓁|X2 ]) . 𝔼[Var(𝓁|X2 )] Thus, the first linear discriminant function 𝓁1 satisfies Var(𝔼[𝓁1 |X2 ]) = sup Var(𝔼[𝓁|X2 ]), 𝓁∈𝕃2 (X1 )

where 𝓁 is subject to 𝔼[Var(𝓁|X2 )] = 1. The ith linear discriminant function for i > 1 is defined similarly subject to the additional restriction 𝔼[Cov(𝓁i , 𝓁k |X2 )] = 0, k < i. Assume that the within group covariance kernel RW is the same across the J populations: i.e., RW (s, t) = 𝔼[(X1 (s) − mj (s))(X1 (t) − mj (t))|X2 = j]

CANONICAL CORRELATION ANALYSIS

291

for all j = 1, … , J. Denote the RKHS generated by RW as 𝔾W with associated norm and inner product indicated by ‖ ⋅ ‖W and ⟨⋅, ⋅⟩W . In addition, define the kernel function RB (s, t) =

J ∑

𝜋j (mj (s) − m(s))(mj (t) − m(t))

j=1

for s, t ∈ E. Arguments in Shin (2008) establish that if mj ∈ 𝔾W for all j = 1, … , J, there exists a one-to-one linear mapping ZW from 𝔾W onto 𝕃2 (X1 ) defined by ZW ∶ RW (⋅, t) → X1 (t) for every t ∈ E with the properties that 1. 𝔼[ZW (h)] = ⟨h, m⟩W , 2. 𝔼[ZW (h)|X2 = j] = ⟨h, mj ⟩W , and 3. 𝔼[Var(ZW (h)|X2 )] = ‖h‖2W for h ∈ 𝔾W . Using ZW , we see that Var(𝔼[ZW (h)|X2 ]) = ⟨h, 𝒞B h⟩W with (𝒞B h)(t) ∶= ⟨RB (t, ⋅), h(⋅)⟩W

(10.31)

for h ∈ 𝔾W . The operator 𝒞B in (10.31) has the spectral decomposition 𝒞B =

J−1 ∑

𝛾j hj ⊗𝔾W hj ,

j=1

with 𝛾1 ≥ · · · ≥ 𝛾J−1 ≥ 0 the eigenvalues and hj , j = 1, … , J − 1, the associated eigenfunctions for the operator. The linear discriminant functions are then seen to be 𝓁j = ZW (hj ) for j = 1, … , J − 1 with classification obtained from squared Mahalanobis distance based on the first r ≤ J − 1 linear discriminant functions. That is, for the kth population, we employ the distance measure r ∑ ( )2 ZW (hj ) − ⟨hj , mk ⟩W j=1

(10.32)

292

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

that compares the process “score” vector (ZW (h1 ), … , ZW (hr ))T to its conditional mean (⟨h1 , mk ⟩W , … , ⟨hr , mk ⟩W )T for the kth population. In Section 1.1, we saw that cca and Fisher’s method were equivalent in the case of finite-dimensional mva. We now show that this equivalence continues to hold for fda situations. Theorem 10.5.1 Assume that mj ∈ 𝔾W , j = 1, … , J. Then, 𝛾j =

𝜌2j 1 − 𝜌2j

,

hj = (1 − 𝜌2j )1∕2 f1j 𝓁j = √

U1j 1 − 𝜌2j

for j = 1, … , J − 1. Proof: For the theorem to be true, it must be that both 𝒞12 and 𝒞B are defined on the same space: i.e., 𝔾W and 𝔾1 must consist of the same elements. To see that, this is so first use the identity Cov(Y, Z) = Cov(𝔼[Y|V], 𝔼[Z|V]) + 𝔼[Cov(Y, Z|V)] that holds for any random variables Y, Z, and V with finite second moments to obtain R1 (s, t) = RB (s, t) + RW (s, t). Thus, RW ≪ R1 and 𝔾W ⊂ 𝔾1 due to Theorem 2.7.11. To go in the other direction, define the operator ℒ ∶ 𝔾1 → 𝔾W by ℒ(R1 (⋅, t)) = RW (⋅, t) for t ∈ E. Then, ℒ is a one-to-one, onto and, for h ∈ 𝔾W and f ∈ 𝔾1 , ⟨h, f ⟩1 = ⟨h, ℒ f ⟩W because

⟨h, R1 (⋅, t)⟩1 = h(t) = ⟨h, RW (⋅, t)⟩W .

For f ∈ 𝔾1 , f (t) = ⟨RB (⋅, t), f ⟩1 + ⟨RW (⋅, t), f ⟩1 = ⟨RB (⋅, t), ℒ f ⟩W + ⟨RW (⋅, t), ℒ f ⟩W = (𝒞B ℒ f )(t) + (ℒ f )(t), which indicates that f is in the range of ℒ and therefore in 𝔾W .

CANONICAL CORRELATION ANALYSIS

293

Now 𝒞B induces an operator 𝒞̃B on 𝔾1 via (𝒞̃B f )(t) = ⟨RB (t, ⋅), f (⋅)⟩1 . One finds that 𝒞̃B = 𝒞B ℒ because (𝒞B ℒ f )(t) = ⟨RB (⋅, t), ℒ f ⟩W = ⟨RB (⋅, t), f ⟩1 = (𝒞B f )(t). As f = ℒ f + 𝒞B ℒ f , ℒ = I − 𝒞̃B and ‖ℒ f ‖2W = ⟨ℒ f , f ⟩1 = ⟨(I − 𝒞̃B ) f , f ⟩1 for f ∈ 𝔾1 . The linear mappings Z1 , ZW that connect 𝔾1 and 𝔾W to 𝕃2 (X1 ) are similarly related in that Z1 ( f ) = ZW (ℒ f )

(10.33)

for f ∈ 𝔾1 , which follows from Z1 (R1 (⋅, t)) = X1 (t) = ZW (RW (⋅, t)) = ZW (ℒ(R1 (⋅, t))). For f ∈ 𝔾1 , we have (𝒞12 𝒞21 f )(t) = ⟨R12 (t, ⋅), 𝒞21 f ⟩2 = ⟨𝒞12 R12 (t, ⋅), f ⟩1 . However,

R12 (⋅, j) = Cov(X1 (⋅), X2 (j)) = 𝜋j (mj (⋅) − m(⋅))

for j = 1, … J and 𝒞12 R12 (t, ⋅) = ⟨R12 (⋅, ⋆), R12 (t, ⋆)⟩2 =

J ∑ R12 (⋅, j)R12 (t, j)

𝜋j

j=1

= RB (t, ⋅).

Thus, 𝒞12 𝒞12 = 𝒞̃B . We can now use the fact that 𝒞12 𝒞21 f = 𝒞̃B f = 𝒞B ℒ f and f = ℒ f + 𝒞B ℒ f for f ∈ 𝔾1 to write 𝒞B ℒ f1j =

𝜌2j 1 − 𝜌2j

ℒ f1j .

In addition, ℒ f1j = (1 − 𝜌2j ) f1j because ℒ f1j = (I − 𝒞̃B ) f1j = (I − 𝒞12 𝒞21 ) f1j

294

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and

‖ℒ f1j ‖2W =

⟨( ) ⟩ I − 𝒞̃B f1j , f1j 1 = 1 − 𝜌2j .

As the hj in 𝔾W corresponding to 𝓁j = ZW (hj ) satisfy ‖hj ‖2W = 1, hj =

ℒ f1j ‖ℒ f1j ‖W

=

ℒ f1j (1 −

𝜌2j )1∕2

= (1 − 𝜌2j )1∕2 f1j . ◽

The proof is completed using (10.33).

We note in passing that the results in Theorem 10.5.1 are the same as what we found for the mva case in Section 1.1. Along this same vein, we conclude the section by showing that, just as in the finite-dimensional case, classification with either Fisher’s method or cca produces the same results. Corollary 10.5.2 The distance measures (10.30) and (10.32) are equivalent. Proof: The relation ⟨ f1j , mk ⟩1 = ⟨ℒ f1j , mk ⟩W = (1 − 𝜌2j )1∕2 ⟨hj , mk ⟩W has the implication that r ∑

(ZW (hj ) − ⟨hj , mk ⟩W )2 =

j=1

r ∑ j=1

=

r ∑ j=1

As

1 (U1j − ⟨ f1j , mk ⟩1 )2 1 − 𝜌2j 1 (U c − ⟨ f1j , mk − m⟩1 )2 . 1 − 𝜌2j 1j

(𝒞21 f1j )(k) = ⟨ f1j (⋅), R12 (⋅, k)⟩1 = 𝜋k ⟨ f1j , mk − m⟩1 ,

we have ( ) ⟨ f1j , mk − m⟩1 = 𝜋k−1 𝒞21 f1j (k) = 𝜌j 𝜋k−1 f2j (k).

10.6



Orthogonal subspaces and partial cca

Theorem 10.1.2 represents the solution to our cca problem. There is, however, another way to obtain this result based on the orthogonal subspace decomposition detailed in Section 3.7. This approach has some advantages in that it is extensible to situations that involve more than just two random

CANONICAL CORRELATION ANALYSIS

295

elements as will be demonstrated subsequently. To pursue this alternative development, we focus on the random element setting and introduce the Hilbert space { } 2 ∑ 𝔾0 = h = ( f1 , f2 ) ∶ fi ∈ 𝔾i , i = 1, 2, ‖h‖20 = ‖ fi ‖2i < ∞ (10.34) i=1

with addition of elements of 𝔾0 being performed component-wise. For each h ∈ 𝔾0 , we then define an analog of Z1 and Z2 by Z0 (h) = Z1 ( f1 ) + Z2 ( f2 ). For h = ( f1 , f2 ), h′ = ( f1′ , f2′ ) in 𝔾0 , the Z0 function is Cov(Z0 (h), Z0 (h′ )) = Cov(Z1 ( f1 ), Z1 ( f1′ )) + Cov(Z2 ( f2 ), Z2 ( f2′ )) + Cov(Z1 ( f1 ), Z2 ( f2′ )) + Cov(Z1 ( f1′ ), Z2 ( f2 )) = ⟨ f1 , f1′ ⟩1 + ⟨ f2 , f2′ ⟩2 + ⟨ f1 , 𝒞12 f2′ ⟩1 + ⟨𝒞21 f1′ , f2 ⟩2 . The Hilbert space spanned by the Z0 process is { } 𝕃2 (Z0 ) = Z0 (h) ∶ h ∈ 𝔾0 , ‖Z0 (h)‖20 ∶= Var(Z0 (h)) < ∞ .

(10.35)

(10.36)

Note that when Var(Z1 ( f1 )) = 1 = Var(Z2 ( f2 )) 𝜌2 ( f1 , f2 ) = Cov(Z0 (( f1 , 0)), Z0 ((0, f2 ))). Thus, we have not lost track of our original problem and optimization of 𝜌2 ( f1 , f2 ) over fi ∈ 𝔾i can be recovered through, e.g., restricted optimization of the covariance functional (10.35) over 𝔾0 . However, the analysis becomes somewhat more tractable if we can work with a Hilbert space that is congruent to 𝕃2 (Z0 ) so that the objective function can be expressed in terms of its norms and inner products. For this purpose, we will require Assumption 1 There is a constant B ∈ [0, 1) such that |Corr(Z1 ( f1 ), Z2 ( f2 ))| ≤ B for all ( f1 , f2 ) ∈ 𝔾0 . ◽ Assumption 1 insures that there is no linear combination of the random variables {⟨𝜒2 , e2j ⟩} (with finite variance) that can exactly predict a similar linear combination of {⟨𝜒1 , e1j ⟩}. For the finite-dimensional multivariate analysis case, this would translate into the cross-covariance matrix for the vector representations of 𝜒1 , 𝜒2 having full rank: i.e., rank equal to the smaller of the dimensions for the two random vectors. This, in turn, avoids the degenerate

296

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

case where a component of one of the random vectors is just a translated and rescaled version of variables that already exist in the other. A similar interpretation can be made in the more abstract context of this chapter. Of more immediate import is the fact that ‖𝒞12 ‖ = ‖𝒞21 ‖ ≤ B

(10.37)

when Assumption 1 is in effect as can be seen by retracing the argument that was used to prove Theorem 10.1.1. Now, for h ∈ 𝔾0 , define 𝒬h = ( f1 + 𝒞12 f2 , f2 + 𝒞21 f1 ). It will be convenient to write this in matrix form as [ ][ ] I 𝒞12 f1 𝒬h = (10.38) 𝒞21 I f2 with the convention that the resulting vector is viewed as an element of ℋ0 . Observe that Cov(Z0 (h), Z0 (h′ )) = ⟨h, 𝒬h′ ⟩0 . Theorem 10.6.1 Under Assumption 1, 𝒬 ∶ 𝔾0 → 𝔾0 is invertible with inverse defined by −1 −1 −1 −1 𝒬 −1 (h) = (𝒞11.2 f1 − 𝒞12 𝒞22.1 f2 , 𝒞22.1 f2 − 𝒞21 C11.2 f1 )

(10.39)

for h = (f1 , f2 ) ∈ 𝔾0 and 𝒞ii.k = I − 𝒞ik 𝒞ki = (I − 𝒞ik 𝒞ki )∗ , i, k = 1, 2, i ≠ k. Analogous to (10.38), (10.39) will also be expressed as [ ][ ] −1 −1 𝒞11.2 −𝒞12 𝒞22.1 f1 −1 𝒬 h= . −1 −1 f2 −𝒞21 𝒞11.2 𝒞22.1 Proof: The form of the inverse as stated in 10.39 follows directly once we have shown all the relevant inverse operators exist. Thus, let us concentrate on the latter task. We can write 𝒬 = I − 𝒯 with [ ][ ] 0 𝒞12 f1 𝒯h = (−𝒞12 f2 , −𝒞21 f1 ) = − . 𝒞21 0 f2 Then, ‖𝒯h‖20 = ‖𝒞12 f2 ‖21 + ‖𝒞21 f1 ‖22 ≤ ‖𝒞12 ‖2 ‖| f2 ‖22 + ‖𝒞21 ‖2 ‖ f1 ‖21 = ‖𝒞12 ‖2 [‖ f1 ‖21 + ‖ f2 ‖22 ] = ‖𝒞12 ‖2 ‖h‖20 ≤ B‖h‖20 < ‖h‖20

CANONICAL CORRELATION ANALYSIS

297

due to (10.37) and Theorem 3.5.5 has the consequence that I − 𝒯 = 𝒬 is invertible. To complete the proof, we need to show that 𝒞11.2 and 𝒞22.1 are invertible. This again follows from Theorem 3.5.5 because, e.g., 𝒞11.2 = I − 𝒞12 𝒞21 with ‖𝒞21 ‖ = ‖𝒞12 ‖ < 1. ◽ Now define { [ ] } f1 2 −1∕2 ̃ 2 ̃ ̃ ̃ ℍ(𝒬) = h ∶ h = 𝒬 , fi ∈ 𝔾i , i = 1, 2, ‖h‖ℍ(𝒬) = ‖𝒬 h‖0 < ∞ . f2 This Hilbert space is the one that will now be the focus of our attention due to the next result. Theorem 10.6.2 ℍ(𝒬) is congruent to 𝕃2 (Z0 ) in (10.36) under the mapping ̃ = Z0 (𝒬 −1 h) ̃ for h̃ ∈ ℍ(𝒬). Ψ(h) Proof: Let h, h′ ∈ 𝔾0 with h̃ = 𝒬h, h̃ ′ = 𝒬h′ ∈ ℍ(𝒬). Then, ( ) ( ) ̃ Z0 (𝒬 −1 h̃ ′ ) Cov Z0 (h), Z0 (h′ = Cov Z0 (𝒬 −1 h), = ⟨h, 𝒬h′ ⟩0 ̃ 𝒬 −1∕2 h̃ ′ ⟩0 = ⟨𝒬 −1∕2 h, ̃ h̃ ′ ⟩ℍ(𝒬) . = ⟨h,



With Theorem 10.6.2 in hand, we can give our new formulation of cca. Specially, we seek elements fi ∈ 𝔾i of unit norm that maximize |Cov(Z1 ( f1 ), Z2 ( f2 ))|. However, ( ) Cov(Z1 ( f1 ), Z2 ( f2 )) = Cov Z0 (( f1 , 0)) , Z0 ((0, f2 )) ⟨ [ ] [ ]⟩ f 0 = 𝒬 1 ,𝒬 , 0 f2 ℍ(𝒬) which leads to the conclusion that it is equivalent to find fi ∈ 𝔾i to maximize the right-hand side of this last expression. The analysis from this point is driven by the results in Section 3.7. For that purpose, we decompose ℍ(𝒬) into a sum of the closed subspaces 𝕄1 and 𝕄2 with { [ ] } f 1 𝕄1 = h̃ ∈ ℍ(𝒬) ∶ h̃ = 𝒬 = ( f1 , 𝒞21 f1 ), f1 ∈ 𝔾1 , 0 { [ ] } 0 𝕄2 = h̃ ∈ ℍ(𝒬) ∶ h̃ = 𝒬 = (𝒞12 f2 , f2 ), f2 ∈ 𝔾2 . f2 Regarding 𝕄1 and 𝕄2 , we have the following result.

298

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Theorem 10.6.3 ℍ(𝒬) = 𝕄1 + 𝕄2 with “+” indicating an algebraic direct sum. Proof: Clearly, any element of 𝔾0 can be written as the sum of elements in 𝕄1 and 𝕄2 . We therefore need only show that 𝕄1 ∩ 𝕄2 = {0}. To the contrary, suppose that there exist fi ∈ 𝔾i , i = 1, 2, such that ( f1 , 𝒞21 f1 ) = (𝒞12 f2 , f2 ). Then, Var(Z1 ( f1 )) = ⟨ f1 , f1 ⟩1 = ⟨ f1 , 𝒞12 f2 ⟩1 and Var(Z2 ( f2 )) = ⟨ f2 , f2 ⟩2 = ⟨ f2 , 𝒞21 f1 ⟩2 = ⟨C12 f2 , f1 ⟩1 . However, these relations have the consequence that |Corr(Z1 ( f1 ), Z2 ( f2 ))| = 1, which contradicts Assumption 1. ◽ To relate Theorem 10.6.3 to the Section[ 3.7] development let ℕ1[ =]𝕄1 and f 0 ℕ2 = 𝕄2 ∩ 𝕄⊥1 = 𝕄⊥1 . Then, for h̃ 1 = 𝒬 1 ∈ 𝕄1 and h̃ 2 = 𝒬 ∈ 𝕄2 , 0 f2 the first canonical correlation satisfies 𝜌21 = = ≤

⟨h̃ 1 , h̃ 2 ⟩2ℍ(𝒬)

sup

h̃ 1 ∈𝕄1 ,h̃ 2 ∈𝕄2 ‖h̃ i ‖ℍ(𝒬) =1,i=1,2

sup

⟨h̃ 1 , 𝒯𝑣⟩2ℍ(𝒬)

h̃ 1 ∈ℕ1 ,𝑣∈ℕ2 ‖h̃ 1 ‖ℍ(𝒬) =1,‖𝑣+𝒯𝑣‖ℍ(𝒬) =1

sup

𝑣∈ℕ2 ‖𝑣+𝒯𝑣‖ℍ(𝒬) =1

‖𝒯𝑣‖2ℍ(𝒬)

for 𝒯 = 𝒫ℕ1 |𝕄2 (𝒫ℕ2 |𝕄2 )−1 . Taking h̃ 1 = 𝒯𝑣∕‖𝒯𝑣‖ℍ(𝒬) , we see that the bound is attainable and holds with equality. Thus, we have shown that 𝜌1 is obtained by maximizing ‖𝒯𝑣‖ℍ(𝒬) over 𝑣 ∈ ℕ2 subject to ‖𝒯𝑣 + 𝑣‖ℍ(𝒬) = 1. However, ‖𝒯𝑣 + 𝑣‖2ℍ(𝒬) = ‖𝑣‖2ℍ(𝒬) + 2⟨𝒯𝑣, 𝑣⟩ℍ(𝒬) + ‖𝒯𝑣‖2ℍ(𝒬) = ⟨𝑣, (I + 𝒯∗ 𝒯)𝑣⟩ℍ(𝒬) because 𝑣 ∈ ℕ2 is orthogonal to 𝒯𝑣 ∈ ℕ1 . Thus, we maximize ‖𝒯𝑣‖ℍ(𝒬) subject to ⟨𝑣, (I + 𝒯∗ 𝒯)𝑣⟩ℍ(𝒬) = 1.

CANONICAL CORRELATION ANALYSIS

299

The operator I + 𝒯∗ 𝒯 is self-adjoint, positive, Invertible, and has a self-adjoint square-root (I + 𝒯∗ 𝒯)1∕2 . We can therefore work with 𝑣′ = (I + 𝒯∗ 𝒯)1∕2 𝑣 and maximize ‖𝒯𝑣‖ℍ(𝒬) = ‖𝒯(I + 𝒯∗ 𝒯)−1∕2 𝑣′ ‖ℍ(𝒬) subject to 𝑣′ ∈ ℕ2 and ‖𝑣′ ‖2ℍ(𝒬) = 1. If we now assume that 𝒯∗ 𝒯 is compact, the maximizer is the eigenvector that corresponds to the largest eigenvalue of (I + 𝒯∗ 𝒯)−1∕2 𝒯∗ 𝒯(I + 𝒯∗ 𝒯)−1∕2 . Some algebra reveals that this eigenvalue problem is equivalent to finding a vector 𝑣 ∈ ℕ2 with ‖𝑣‖2ℍ(𝒬) = 1 such that 𝒯∗ 𝒯𝑣 = 𝛼 2 𝑣 (10.40) √ in which case the corresponding canonical correlation is 𝜌 = 𝛼∕ 1 + 𝛼 2 . Let 𝑣 ∈ ℕ2 be any vector that satisfies (10.40). Its 𝕄1 component is 𝒯𝑣 and its 𝕄2 component correspond to the canonical variables ( is 𝒯𝑣 + 𝑣.√These ) 2 Ψ (𝒯𝑣∕𝛼) and Ψ (𝑣 + 𝒯𝑣)∕ 1 + 𝛼 of the Z1 and Z2 spaces, respectively. We now turn to the task of characterizing 𝒯∗ 𝒯. To do so, we first need to appreciate the form of the elements of ℕ2 = 𝕄⊥1 . In this regard, we have the following useful result. Lemma 10.6.4 Every element 𝑣 ∈ ℕ2 can be expressed as 𝑣 = (0, ̃f 2 ) with ̃f 2 = 𝒞22.1 f2 for some f2 ∈ 𝔾2 . Proof: Let

[ ] f 𝑣=𝒬 1 f2

for fi ∈ 𝔾i , i = 1, 2, be any element of 𝕄⊥1 . Then, for every h̃ = ( f1′ , 𝒞21 f1′ ) ∈ 𝕄1 ̃ 𝑣⟩ℍ(𝒬) = ⟨𝒬 −1∕2 h, ̃ 𝒬 −1∕2 𝑣⟩0 ⟨h, = ⟨( f1′ , 0), 𝑣⟩0 = ⟨ f1′ , f1 ⟩1 + ⟨ f1′ , 𝒞12 f2 ⟩1 = 0. As this is true for all f1′ ∈ 𝔾1 , f1 = −𝒞12 f2 .



As a result of the previous lemma, the following theorem determines the form of 𝒯∗ 𝒯 in terms of the fundamental operators 𝒞12 and 𝒞21 .

300

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

−1 ̃ Theorem 10.6.5 For any 𝑣 = (0, ̃f 2 ) ∈ ℕ2 , 𝒯∗ 𝒯𝑣 = (0, 𝒞21 𝒞12 𝒞22.1 f2 ).

Proof: We will establish two results that imply the theorem statement: namely, that for 𝑣 = (0, f̃2 ) ∈ ℕ2 , −1 ̃ −1 ̃ 𝒯𝑣 = (𝒞12 𝒞22.1 f2 , 𝒞21 𝒞12 𝒞22.1 f2 )

(10.41)

and for h̃ = ( f1 , 𝒞21 f1 ) ∈ 𝕄1 𝒯∗ h̃ = (0, 𝒞21 f1 ).

(10.42)

As, 𝒯 = 𝒫ℕ1 |𝕄2 (𝒫ℕ2 |𝕄2 )−1 , we first need to obtain explicit expressions for 𝒫ℕ1 |𝕄2 and (𝒫ℕ2 |𝕄2 )−1 . Let h̃ 1 = ( f1 , 𝒞21 f1 ) ∈ 𝕄1 = ℕ1 with h̃ 2 = (𝒞12 f2 , f2 ) ∈ 𝕄2 . Then, ⟨𝒫ℕ1 |𝕄2 h̃ 2 , h̃ 1 ⟩ℍ(𝒬) = ⟨h̃ 2 , h̃ 1 ⟩ℍ(𝒬) for every h̃ 1 ∈ 𝕄1 . Writing 𝒫ℕ1 |𝕄2 h̃ 2 = ( f1′ , 𝒞21 f1′ ) for some f1′ ∈ 𝔾1 leads to ⟨𝒫ℕ1 |𝕄2 h̃ 2 , h̃ 1 ⟩ℍ(𝒬) = ⟨( f1′ , 𝒞21 f1′ ), ( f1 , 0)⟩0 = ⟨ f1′ , f1 ⟩1 = ⟨(𝒞12 f2 , f2 ), h̃ 1 ⟩ℍ(𝒬) = ⟨(𝒞12 f2 , f2 ), ( f1 , 0)⟩0 = ⟨𝒞12 f2 , f1 ⟩1 for every f1 ∈ 𝔾1 . So, f1′ = 𝒞12 f2 and 𝒫ℕ1 |𝕄2 h̃ 2 = (𝒞12 f2 , 𝒞21 𝒞12 f2 ).

(10.43)

Consequently, 𝒫ℕ2 |𝕄2 h̃ 2 = (I − 𝒫ℕ1 |𝕄2 )h̃ 2 = (0, 𝒞22.1 f2 ) from which we see that −1 ̃ −1 ̃ (𝒫ℕ2 |𝕄2 )−1 𝑣 = (𝒞12 𝒞22.1 f2 , 𝒞22.1 f2 )

(10.44)

for 𝑣 = (0, ̃f 2 ) ∈ ℕ2 . In combination, (10.43) and (10.44) give us (10.41). To verify (10.42), let h̃ = (f1 , 𝒞21 f1 ) ∈ 𝕄1 = ℕ1 and 𝑣 = (0, f̃2 ) ∈ ℕ2 . Then, ̃ 𝒯𝑣⟩ℍ(𝒬) = ⟨𝒬 −1 h, ̃ 𝒯𝑣⟩0 ⟨h, −1 ̃ −1 ̃ = ⟨( f1 , 0), (𝒞12 𝒞22.1 f2 , 𝒞21 𝒞12 𝒞22.1 f2 )⟩0 −1 ̃ −1 = ⟨ f1 , 𝒞12 𝒞22.1 f2 ⟩1 = ⟨𝒞22.1 𝒞21 f1 , f̃2 ⟩2 ⟨ [ ] ⟩ ⟨[ ] ⟩ 0 0 = 𝒬 −1 ,𝑣 = ,𝑣 𝒞21 f1 𝒞21 f1 0 ℍ(𝒬)

̃ 𝑣⟩ℍ(𝒬) . = ⟨𝒯∗ h,



CANONICAL CORRELATION ANALYSIS

301

Theorem 10.6.5 in conjunction with Lemma 10.6.4 entails that 𝒯∗ 𝒯(0, f̃2 ) = (0, 𝒞21 𝒞12 f2 ) for some f2 ∈ 𝔾2 . The eigenvalue problem (10.40) is therefore equivalent to 𝒞21 𝒞12 f2 = 𝛼 2 𝒞22.1 f2 or 𝒞21 𝒞12 f2 = 𝜌2 f2 .

(10.45)

By interchanging the roles of 𝕄1 and 𝕄2 , it follows that the optimal choice for f1 is the eigenvector corresponding to the same eigenvalue 𝜌2 in (10.45) except for the operator 𝒞12 𝒞21 . Thus, Theorem 4.3.1 now leads us to the same conclusion as Theorem 10.1.2. So far, we have succeeded only in reproducing our previous cca result. The benefits of the orthogonal subspace approach become evident only when we are interested in more than two random elements. We will direct our attention to the simplest case of three random elements here although it should be clear that the same development can be used in a substantially more general context. The specific, three-variable, extension of cca we will consider is the one suggested by Roy (1958) that has been termed partial canonical correlation analysis or pcca subsequently. The idea is that we have random vectors X1 , X2 , and X3 and wish to study the relationship between X2 and X3 after removing the influence of X1 . One way to accomplish this is by examination of partial canonical correlations that are the ordinary canonical correlations between X2 − 𝒫X1 X2 and X3 − 𝒫X1 X3 , where 𝒫X1 denotes projection onto the linear space spanned by X1 . The goal is to now formulate Roy’s idea for three Hilbert space valued random elements. The arguments that are required to accomplish this are rather immediate extensions of those for the two random element case. Thus, we will merely highlight the important differences. We now have three random elements 𝜒i , i = 1, 2, 3, with covariance operators 𝒦i , i = 1, 2, 3 and cross-covariance operators 𝒦12 , 𝒦13 , 𝒦23 . The Hilbert spaces 𝕃2 (Zi ) that are spanned by the Zi (⋅) processes that are indexed by their congruent Hilbert spaces 𝔾i = ℍ(𝒦i ), i = 1, 2, 3, are defined as in Section 10.1. One finds, as before, that there are bounded operators 𝒞ij ∶ 𝔾j → 𝔾i satisfying Cov(Zi ( fi ), Zj ( fj )) = ⟨ fi , 𝒞ij fj ⟩i for i, j = 1, 2, 3 and i ≠ j. Our previous definition of 𝔾0 in (10.34) now takes the form { } 3 ∑ 𝔾0 = h = ( f1 , f2 , f3 ) ∶ fi ∈ 𝔾i , i = 1, 2, 3, ‖h‖20 = ‖ fi ‖2i < ∞ i=1

302

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

for which a corresponding 𝔾0 indexed process is Z0 (h) =

3 ∑

Zi ( f i )

i=1

that has

( ) Cov Z0 (h), Z0 (h′ ) = ⟨h, 𝒬h′ ⟩0

with 𝒬 defined by

⎡ I 𝒞12 𝒞13 ⎤ ⎡ f1 ⎤ I 𝒞23 ⎥ ⎢ f2 ⎥ . 𝒬h = ⎢𝒞21 ⎢𝒞 I ⎥⎦ ⎢⎣ f3 ⎥⎦ ⎣ 31 𝒞32

As in Section 10.1, we need to rule out the case where certain types of exact prediction are possible. For this purpose, we require that Assumption 1 holds for both of the process pairs Z1 , Z2 and Z1 , Z3 as well as Assumption 2 There exist no f2 ∈ 𝔾2 or f3 ∈ 𝔾3 such that ◽

|Corr(Z2 ( f2 ) − 𝒫Z1 Z2 ( f2 ), Z3 ( f3 ) − 𝒫Z1 Z3 ( f3 ))| = 1.

Under this condition, 𝒬 is invertible. Its inverse, in matrix form, is given by [ ] I + ℰ𝒢 −1 ℱ −ℰ𝒢 −1 −1 𝒬 = (10.46) −𝒢 −1 ℱ 𝒢 −1 [ ] [ ] [ ] 𝒞21 I 𝒞23 𝒞 𝒞 for ℰ = 12 ,𝒟 = and 13 , ℱ = 𝒞31 𝒞32 I 𝒢 = 𝒟 1∕2 (I − 𝒱)𝒟 1∕2 with

[

−1∕2

−1∕2

0 −𝒞22.1 (𝒞23 − 𝒞21 𝒞13 )𝒞33.1 𝒱= −1∕2 −1∕2 −𝒞33.1 (𝒞32 − 𝒞31 𝒞12 )𝒞22.1 0

]

As in Theorem 10.6.2, 𝕃2 (Z0 ) = {Z0 (h) ∶ h ∈ 𝔾0 , ‖Z0 (h)‖2𝕃2 (Z ) = Var(Z0 (h)) < ∞} 0

is congruent to ⎧ ⎫ ⎡ f1 ⎤ ⎪̃ ̃ 2 = ‖𝒬 −1∕2 h‖ ̃ 2 < ∞⎪ ℍ(𝒬) = ⎨h = 𝒬 ⎢ f2 ⎥ ∶ fi ∈ 𝔾i , i = 1, 2, 3, ‖h‖ ⎬ 0 ℍ(𝒬) ⎢f ⎥ ⎪ ⎪ ⎣ 3⎦ ⎩ ⎭ ̃ = Z0 (𝒬 −1 h). ̃ under the mapping Ψ(h)

.

CANONICAL CORRELATION ANALYSIS

303

One finds that the projection of Z2 ( f2 ) onto 𝕃2 (Z1 ) is Z1 (𝒞12 f2 ) and the projection of Z3 ( f3 ) onto 𝕃2 (Z1 ) is Z1 (𝒞13 f3 ). Thus, for the pcca formulation, we wish to find f2 ∈ 𝔾2 and f3 ∈ 𝔾3 to maximize the correlation between Z2 ( f2 ) − Z1 (𝒞12 f2 ) = Z0 (−𝒞12 f2 , f2 , 0) and

Z3 ( f3 ) − Z1 (𝒞13 f3 ) = Z0 (−𝒞13 f3 , 0, f3 ).

However, Cov(Z0 (−𝒞13 f3 , 0, f3 )), Z0 (−𝒞13 f3 , 0, f3 )) ⟨ ⟩ ⎡−𝒞12 f2 ⎤ ⎡−𝒞13 f3 ⎤ = 𝒬 ⎢ f2 ⎥ , 𝒬 ⎢ 0 ⎥ . ⎢ 0 ⎥ ⎢ f ⎥ ⎣ ⎦ ⎣ ⎦ 3 ℍ(𝒬)

Thus, just as was the case for ordinary cca, we can formulate the problem of finding partial canonical correlations as one of restricted optimization in ℍ(𝒬). To proceed, we again draw on the results in Section 3.7 and write ℍ(𝒬) as the direct sum of three subspaces defined by ⎧ ⎫ ⎡ f1 ⎤ ⎪̃ ⎪ 𝕄1 = ⎨h ∈ ℍ(𝒬) ∶ h̃ = 𝒬 ⎢ 0 ⎥ ∶= ( f1 , 𝒞21 f1 , 𝒞31 f1 )⎬ ⎢ ⎥ ⎪ ⎪ ⎣0⎦ ⎩ ⎭ ⎧ ⎫ ⎡0⎤ ⎪̃ ⎪ 𝕄2 = ⎨h ∈ ℍ(𝒬) ∶ h̃ = 𝒬 ⎢ f2 ⎥ ∶= (𝒞12 f2 , f2 , 𝒞32 f2 )⎬ , ⎢0⎥ ⎪ ⎪ ⎣ ⎦ ⎩ ⎭ ⎧ ⎫ ⎡0⎤ ⎪̃ ⎪ 𝕄3 = ⎨h ∈ ℍ(𝒬) ∶ h̃ = 𝒬 ⎢ 0 ⎥ ∶= (𝒞13 f3 , 𝒞23 f3 , f3 )⎬ ⎢ ⎥ ⎪ ⎪ ⎣ f3 ⎦ ⎩ ⎭ with ℕ1 = 𝕄1 , ℕ2 = 𝕄2 ∩ 𝕄⊥1 , and ℕ3 = 𝕄3 ∩ (𝕄1 + 𝕄2 )⊥ . Then, one can ̃ 2 = 𝒫𝕄⊥ 𝕄2 and 𝕄 ̃ 3 = 𝒫𝕄⊥ 𝕄3 check that the elements of the subspaces 𝕄 1 1 have the form ⎡−𝒞12 f2 ⎤ h̃ 2 = 𝒬 ⎢ f2 ⎥ ⎢ 0 ⎥ ⎣ ⎦

304

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and

⎡−𝒞13 f2 ⎤ h̃ 3 = 𝒬 ⎢ 0 ⎥ , ⎢ f ⎥ ⎣ ⎦ 3

respectively, for fi ∈ 𝔾i , i = 1, 2. Thus, the first squared partial canonical correlation is 𝜌2 = sup ⟨h̃ 2 , h̃ 3 ⟩2ℍ(𝒬) . ̃i h̃ i ∈𝕄 ‖h̃ i ‖ℍ(𝒬) =1,i=2,3

̃ 2 = ℕ2 and the elements of 𝕄 ̃ 3 can all be expressed as However, 𝕄 h̃ 3 = 𝑣 + 𝒯𝑣 ( )−1 for 𝒯 = 𝒫ℕ2 |𝕄3 𝒫ℕ3 |𝕄3 . Consequently, 𝜌2 = ≤

sup

⟨h̃ 2 , 𝒯𝑣⟩2ℍ(𝒬)

h̃ 2 ∈ℕ2 ,𝑣∈ℕ3 ‖h̃ 2 ‖ℍ(𝒬) =1,‖𝑣+𝒯𝑣‖ℍ(𝒬) =1

sup

𝑣∈ℕ3 ‖𝑣+𝒯𝑣‖ℍ(𝒬) =1

‖𝒯𝑣‖2ℍ(𝒬) .

This bound holds with equality when h̃ 2 = 𝒯h̃ 3 ∕‖𝒯√ h̃ 3 ‖ℍ(𝒬) , which leads to the conclusion that when 𝒯 is compact 𝜌 = 𝛼∕ 1 + 𝛼 2 with 𝛼 2 the largest eigenvalue of 𝒯∗ 𝒯. If 𝑣 is an eigenvector corresponding to 𝛼 2 , the( partial canonical variable for the Z2 and Z3 spaces is Ψ (𝒯𝑣∕𝛼) and ) √ 2 Ψ (𝑣 + 𝒯𝑣)∕ 1 + 𝛼 . To obtain a representation for 𝒯 and 𝒯∗ 𝒯 in terms of the 𝒞ij , first define −1 𝒞0 = 𝒞33.1 − (𝒞32 − 𝒞31 𝒞12 )𝒞22.1 (𝒞23 − 𝒞21 𝒞13 ).

Then, for 𝑣 = (0, 0, ̃f 3 ) ∈ ℕ3 , we find that −1 𝒯𝑣 = (0, (𝒞23 − 𝒞21 𝒞13 )𝒞0−1 ̃f 3 , (𝒞32 − 𝒞31 𝒞12 )𝒞22.1 (𝒞23 − 𝒞21 𝒞13 )𝒞0−1 ̃f 3 ),

which characterizes the 𝒯 operator. Similarly, −1 𝒯∗ 𝒯𝑣 = (0, 0, (𝒞32 − 𝒞31 𝒞12 )𝒞22.1 (𝒞23 − 𝒞21 𝒞13 )𝒞0−1 ̃f 3 )).

11

Regression In this final chapter, we focus on the fda-specific problem of functional linear regression. There are various ways to formulate this type of regression idea and we have chosen to study one of the more common situations that has appeared in the fda literature: namely, the case of a scalar-dependent variable and functional independent variable. This leads to a special case of the functional linear model introduced in Section 6.1 which makes it natural to investigate the performance of method of regularization estimation techniques in this setting. In Section 11.1, we describe the basic regression model that we pose for study and derive a penalized least-squares estimator for the corresponding coefficient function that was originally proposed in Crambes, Kneip, and Sarda (2009). Subsequent sections examine the large sample and optimality properties of this estimator.

11.1

A functional regression model

The basic premise is that we have a probability space (Ω, ℱ, ℙ) and an associated second-order stochastic process {X(t, 𝜔) ∶ t ∈ [0, 1], 𝜔 ∈ Ω} that is jointly measurable in t and 𝜔 with a square-integrable sample paths that can be viewed as a random element of 𝕃2 ∶= 𝕃2 [0, 1]. The X(t) all have zero mean and the covariance kernel K(s, t) = 𝔼[X(s)X(t)] is presumed to be continuous for all s, t ∈ [0, 1]. Then, the regression model we intend to study concerns the pairs (X1 , Y1 ), … , (Xn , Yn ) with X1 , … , Xn independent copies of X and Yi = ⟨𝛽, Xi ⟩ + 𝜀i , i = 1, … , n, Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

(11.1)

306

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

where ⟨⋅, ⋅⟩ is the 𝕃2 inner product, the 𝜀j are independent, zero mean random errors with common variance 𝜎 2 that are independent of the Xi , and 𝛽 is an unknown element of 𝕎q ∶= 𝕎q [0, 1]. Model (11.1) provides one possible generalization of multiple linear regression to the functional domain with 𝛽 playing the role of the vector of regression coefficients. Accordingly, we will refer to it as the coefficient function throughout the chapter. To connect (11.1) with the developments in Chapter 6, take 𝕐 = ℝn with 1∑ 2 y n i=1 i n

‖y‖2𝕐 =

for y = (y1 , … , yn )T ∈ 𝕐 . Then, take ℍ = 𝕎q and define 𝒯 ∈ 𝔅(ℍ, 𝕐 ) by 𝒯g = (𝒯1 g, … , 𝒯n g)T for 𝒯i g = ⟨Xi , g⟩,

1 ≤ i ≤ n,

(11.2)

when g ∈ 𝕎q . From the Cauchy–Schwarz inequality, |𝒯i g| ≤ ‖Xi ‖‖g‖ ≤ ‖Xi ‖‖g‖𝕎q , where ‖g‖2𝕎 = ‖g‖2 + ‖g(q) ‖2 q

for ‖ ⋅ ‖ the 𝕃2 norm. Thus, 𝒯 is, in fact, a bounded linear mapping from ℍ to 𝕐 . Note that our definition of 𝒯tacitly assumes that the Xi are fully observed. Section 11.4 relaxes that condition and allows the sample paths to only be sampled at discrete time points. If we define Y = (Y1 , … , Yn )T and 𝜀 = (𝜀1 , … , 𝜀n )T , (11.1) becomes Y = 𝒯𝛽 + 𝜀 which now fits into the framework of the functional linear model (6.2). This suggests using a penalized least-squares estimator for 𝛽 obtained by minimization of criterion (6.7). For our specific case, this entails minimization with respect to g ∈ 𝕎q of ‖Y − 𝒯g‖2𝕐 + 𝜂⟨g, g⟩ℍ = n−1

n ∑

(Yi − 𝒯i g)2 + 𝜂‖g‖2𝕎 . q

i=1

REGRESSION

307

Upon making the identification 𝒲 = I in (6.7), an application of Theorem 6.2.1 produces

with

𝛽𝜂 = 𝒢(𝜂)−1 𝒯∗ Y,

(11.3)

𝒢(𝜂) = 𝒯∗ 𝒯 + 𝜂I

(11.4)

as our estimator of the coefficient function 𝛽. The following theorem provides a first step toward understanding the squared error properties of 𝛽𝜂 . Theorem 11.1.1 Let 𝔼𝜀 represent expectation with respect to 𝜀. Then, ‖𝔼𝜀 𝒯(𝛽𝜂 − 𝛽)‖2𝕐 ≤ 𝜂‖𝛽‖2𝕎 , q

𝔼𝜀 ‖𝒯(𝛽𝜂 − 𝔼𝜀 𝛽𝜂 )‖2𝕐 =

𝜎2 n

trace 𝒯𝒢(𝜂)−1 𝒯∗ 𝒯𝒢(𝜂)−1 𝒯∗

and 𝔼𝜀 ‖𝛽𝜂 ‖2𝕎 ≤ 2‖𝛽‖2𝕎 + q

q

2𝜎 2 trace 𝒯𝒢(𝜂)−2 𝒯∗ . n

(11.5)

Proof: The first two bounds follow from Theorem 6.3.1. For the third note that under our model 𝛽𝜂 = 𝒢(𝜂)−1 𝒯∗ 𝒯𝛽 + 𝒢(𝜂)−1 𝒯∗ 𝜀. Hence, ‖𝛽𝜂 ‖2𝕎 ≤ 2‖𝒢(𝜂)−1 𝒯∗ 𝒯𝛽‖2𝕎 + 2‖𝒢(𝜂)−1 𝒯∗ 𝜀‖2𝕎 q

q

q

and we can analyze the two terms in this last expression separately. First ‖𝒢(𝜂)−1 𝒯∗ 𝒯𝛽‖2𝕎 ≤ ‖𝒢(𝜂)−1 𝒯∗ 𝒯‖2 ‖𝛽‖2𝕎 q

= ‖𝒢(𝜂)

−1∕2

𝒯 𝒯𝒢(𝜂) ∗

q

‖ ‖𝛽‖2𝕎 .

−1∕2 2

q

Now, 𝒯∗ 𝒯 is compact which means that Theorem 4.8.1 can be applied to conclude that the eigenvalues of 𝒢(𝜂)−1∕2 𝒯∗ 𝒯𝒢(𝜂)−1∕2 are at most one. Next, as the 𝜀j are uncorrelated, 𝔼𝜀 ‖𝒢(𝜂)−1 𝒯∗ 𝜀‖2𝕎 = 𝔼𝜀 ⟨𝜀, 𝒯𝒢(𝜂)−2 𝒯∗ 𝜀⟩𝕐 = q

𝜎2 trace 𝒯𝒢(𝜂)−2 𝒯∗ . n



308

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

The theorem quantifies the effect of only a portion of the random components that are present in 𝛽𝜂 : namely, those that arise from the additive random errors in the model. It remains to assess the contribution from sampling the X process. We address this issue from a large sample perspective in the following section.

11.2

Asymptotic theory

One way to assess the performance of 𝛽𝜂 as an estimator of 𝛽 is through the squared error (range) loss L(𝜂) = ‖𝒯(𝛽𝜂 − 𝛽)‖2𝕐 whose probabilistic behavior is influenced by that of both X and 𝜀. However, L(𝜂) = Op (𝔼𝜀 L(𝜂))

(11.6)

𝔼𝜀 L(𝜂) = ‖𝔼𝜀 𝒯(𝛽𝜂 − 𝛽)‖2𝕐 + 𝔼𝜀 ‖𝒯(𝛽𝜂 − 𝔼𝜀 𝛽𝜂 )‖2𝕐 ≤ 𝜂‖𝛽‖2𝕎 + 𝔼𝜀 ‖𝒯(𝛽𝜂 − 𝔼𝜀 𝛽𝜂 )‖2𝕐 .

(11.7)

and, from Theorem 11.1.1,

q

The first term in the last expression is independent of X, and, as a result, we can direct our efforts toward assessing the magnitude of the second entry. Recall that 𝕎q is a reproducing kernel Hilbert space with rk, denoted here by R, that is defined as in (2.46) and the sample covariance X kernel is given by Kn (s, t) = n

−1

n ∑

Xi (s)Xi (t).

i=1

Using these two functions, we can obtain a useful characterization for 𝒯∗ 𝒯. Theorem 11.2.1 Let ℛ and 𝒦n be the 𝕃2 integral operators with kernels R and Kn , respectively. Then, ∗

(𝒯 Y)(⋅) = n

−1

n ∑ i=1

and, for g ∈ 𝕎q ,

1

Yi

∫0

Xi (t)R(⋅, t)dt

(𝒯∗ 𝒯g)(⋅) = (ℛ𝒦n g)(⋅).

Proof: First, by the reproducing property, (𝒯∗ Y)(s) = ⟨𝒯∗ Y, R(s, ⋅)⟩𝕎q = ⟨Y, 𝒯R(s, ⋅)⟩𝕐 .

(11.8)

(11.9)

REGRESSION

309

To prove (11.9), one simply concludes from (11.8) that (𝒯∗ 𝒯g)(⋅) = n−1

n ∑

1

∫0

i=1

[

1

=

∫0

1

Xi (s)g(s)ds

R(⋅, t)

1

∫0

∫0

Xi (t)R(⋅, t)dtds

] Kn (s, t)g(s)ds dt.



It will be convenient to use expansions based on the 𝕎q functions {ej }∞ j=1 described in Theorem 2.8.3 that satisfy 1

ei (t)ej (t)dt = 𝛿ij

∫0 and

1

∫0

(q)

(q)

ei (t)ej (t)dt = 𝛾j 𝛿ij ,

where 𝛾1 = · · · = 𝛾q = 0 and C1 j2q ≤ 𝛾(j+q) ≤ C2 j2q for C1 , C2 ∈ (0, ∞). The −1∕2 ej provide a CONS for 𝕃2 while the functions 𝜈j ej with 𝜈j = 1 + 𝛾j represent a CONS for 𝕎q . Using these functions, we can, for example, write 𝒦n =

∞ ∞ ∑ ∑

rjk ej ⊗ ek ,

(11.10)

j=1 k=1

with ⊗ indicating 𝕃2 tensor product and rjk = n−1

n ∑

⟨Xi , ej ⟩⟨Xi , ek ⟩.

i=1

Now write g ∈ 𝕎q as g =

∑∞ j=1

𝒯∗ 𝒯g =

−1∕2

gj 𝜈 j

∞ ∞ ∑ ∑

ej and apply Theorem 11.2.1 to obtain −1∕2

rjk 𝜈j

gj ℛek .

(11.11)

j=1 k=1

This leads us to the following conclusion. Theorem 11.2.2 Let g1 , g2 ∈ 𝕎q have the representations gi = −1∕2 𝜈j gij ej for square summable coefficient sequences {gij }. Then, ⟨𝒯 𝒯g1 , g2 ⟩𝕎q = ∗

∞ ∞ ∑ ∑ j=1 k=1

−1∕2 −1∕2 𝜈k g1j g2k .

rjk 𝜈j

∑∞ j=1

310

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Proof: The result follows from (11.11) once we realize that 1

1

dq (q) (ℛek )(s)ej (s)ds ∫0 ∫0 dsq ( 1[ 1 ] ) 𝜕q (q) = ek (t) R(s, t)ej (s) + q R(s, t)ej (s) ds dt ∫0 ∫0 𝜕s

⟨ℛek , ej ⟩𝕎q =

(ℛek )(s)ej (s)ds +

1

=

∫0

1

ek (t)⟨R(⋅, t), ej ⟩𝕎q dt =

∫0

ek (t)ej (t).



Theorem 2.8.4 has the consequence that the mapping from 𝕎q to the sequence space 𝓁 2 of Example 2.2.12 defined by g=

∞ ∑

−1∕2

𝜈j

gj ej → (g1 , g2 , …)

(11.12)

j=1

is an isometric isomorphism. Theorem 11.2.2 then provides the 𝓁 2 incarnation of 𝒯∗ 𝒯 as ℒ=

∞ ∞ ∑ ∑

−1∕2 −1∕2 𝜈k 𝜙j ⊗𝓁2 𝜙k ,

rjk 𝜈j

j=1 k=1

where 𝜙j is the element of 𝓁 2 with all zero components except for a one as its jth entry. Observe that we can express ℒ as the composition ℒ = 𝒟 ℳ𝒟 , where 𝒟 =

∞ ∑

−1∕2

𝜈j

(11.13)

𝜙j ⊗ 𝓁 2 𝜙j

j=1

and ℳ=

∞ ∞ ∑ ∑

rjk 𝜙j ⊗𝓁 2 𝜙k .

j=1 k=1

Let 𝜆1n ≥ 𝜆2n ≥ · · · be the eigenvalues of the 𝕃2 integral operator 𝒦n , which are also the eigenvalues of ℳ. Similarly, take 𝜆1 ≥ 𝜆2 ≥ · · · to be the eigenvalues of the 𝕃2 integral operator 𝒦 induced by the X covariance kernel. Then, for each k, define ∑ 𝜆jn Vkn = j>k

REGRESSION

and Vk =



311

𝜆j .

j>k

Theorem 11.2.3 For all j, k ≥ 1, trace 𝒢(𝜂)−1 𝒯∗ 𝒯 ≤ k + j +

Vkn . 𝜂𝜈j+1

Proof: The isometric isomorphism in (11.12) entails that trace 𝒢(𝜂)−1 𝒯∗ 𝒯 = trace (ℒ + 𝜂I)−1 ℒ. Let 𝒫 be the projection operator onto the span of the first k eigenfunctions of ℳ and set 𝒬 = I − 𝒫. Then, trace (ℒ + 𝜂I)−1 ℒ = trace (ℒ + 𝜂I)−1 𝒟 ℳ𝒫𝒟 +trace (ℒ + 𝜂I)−1 𝒟 ℳ𝒬𝒟 . Note that ℳ and 𝒫 (and, hence, 𝒬) commute. In addition, as ℒ ≥ 𝒟 ℳ𝒫𝒟 and ℒ ≥ 𝒟 ℳ𝒬𝒟 , we have trace (ℒ + 𝜂I)−1 ℒ ≤ trace (𝒟 ℳ𝒫𝒟 + 𝜂I)−1 𝒟 ℳ𝒫𝒟 +trace (𝒟 ℳ𝒬𝒟 + 𝜂I)−1 𝒟 ℳ𝒬𝒟 .

(11.14)

Observe that (𝒟 ℳ𝒫𝒟 + 𝜂I)−1 𝒟 ℳ𝒫𝒟 = I − 𝜂(𝒟 ℳ𝒫𝒟 + 𝜂I)−1 . The fact that ℳ and 𝒫 commute now has the consequence that (𝒟 ℳ𝒫𝒟 + 𝜂I)−1 𝒟 ℳ𝒫𝒟 is self-adjoint. In addition, ‖(𝒟 ℳ𝒫𝒟 + 𝜂I)−1 𝒟 ℳ𝒫𝒟 ‖ ≤ ‖𝒟 ℳ𝒫𝒟 ‖∕‖𝒟 ℳ𝒫𝒟 + 𝜂I‖ ≤ 1 so that trace (𝒟 ℳ𝒫𝒟 + 𝜂I)−1 𝒟 ℳ𝒫𝒟 ≤ dim (Im(𝒫)) = k.

(11.15)

Similarly, (𝒟 ℳ𝒬𝒟 + 𝜂I)−1 𝒟 ℳ𝒬𝒟 is self-adjoint with norm bounded by 1 and (𝒟 ℳ𝒬𝒟 + 𝜂I)−1 𝒟 ℳ𝒬𝒟 ≤ 𝜂 −1 𝒟 ℳ𝒬𝒟 .

312

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Thus, for any j, trace(𝒟 ℳ𝒬𝒟 + 𝜂I)−1 𝒟 ℳ𝒬𝒟 =

∞ ∑ i=1

⟨𝜙i , (𝒟 ℳ𝒬𝒟 + 𝜂I)−1 𝒟 ℳ𝒬𝒟 𝜙i ⟩𝓁2

≤j+

1∑ ⟨𝜙 , 𝒟 ℳ𝒬𝒟 𝜙i ⟩𝓁2 𝜂 i>j i

≤j+

1 ∑ ⟨𝜙 , ℳ𝒬𝜙i ⟩𝓁 2 𝜂𝜈j+1 i>j i

≤j+

1 trace ℳ𝒬. 𝜂𝜈j+1

(11.16)

The conclusion of the theorem is now a direct consequence of (11.15) and (11.16). ◽ The following result provides a tool we can use for working with the Vkn . Lemma 11.2.4 For any sequence kn , Vkn n = Op (Vkn ). Proof: Let 𝒫n be the projection operator onto the span of the first k sample eigenfunctions from 𝒦n and, similarly, let 𝒫 be the projection operator for the first k eigenfunctions of 𝒦. Then, Vkn and Vk can be expressed as Vkn = trace 𝒦n (I − 𝒫n ) and

Vk = trace 𝒦(I − 𝒫).

Theorem 4.4.7 gives trace 𝒦n 𝒫n ≥ trace 𝒦n 𝒫 and, hence, trace 𝒦n (I − 𝒫n ) ≤ trace 𝒦n (I − 𝒫). Now, by Chebeshev’s inequality and the fact that 𝔼[trace 𝒦n (I − 𝒫)] = trace 𝒦(I − 𝒫), we obtain ℙ(trace 𝒦n (I − 𝒫n ) > a) ≤ ℙ(trace 𝒦n (I − 𝒫) > a) trace 𝒦(I − 𝒫) ≤ . a



REGRESSION

313

When used in conjunction with Theorems 11.1.1 and 11.2.3, the previous lemma allows us to establish Theorem 11.2.5 Assume that Vk = O(k−𝛼 ) for some 𝛼 > 0. Then, the choice − 2q+𝛼+1 of 𝜂 = n 2q+𝛼+2 yields ( 2q+𝛼+1 ) − 𝔼𝜀 ‖𝒯(𝛽𝜂 − 𝛽)‖2𝕐 = Op n 2q+𝛼+2 (11.17) and 𝔼𝜀 ‖𝛽𝜂 − 𝛽‖2𝕎 = Op (1). q

(11.18)

It may be helpful to remember here that expression (11.17) is still a stochastic function of X1 , … , Xn , which explains the form of the bound. Proof: By Lemma 11.2.4, we conclude that Vkn = Op (k−𝛼 ). Theorem 11.2.3, [ ] 1

with k = j = n 2q+𝛼+2 , then gives ( 1 ) trace 𝒢(𝜂)−1 𝒯∗ 𝒯 = Op n 2q+𝛼+2 .

(11.19)

Note that 𝒢(𝜂)−1 𝒯∗ 𝒯 is self-adjoint and bounded by I. Thus, trace (𝒢(𝜂)−1 𝒯∗ 𝒯)2 ≤ trace 𝒢(𝜂)−1 𝒯∗ 𝒯 and (11.17) follows from Theorems 11.1.1. From (11.5), we have 𝔼𝜀 ‖𝛽𝜂 ‖2𝕎 ≤ 2‖𝛽‖2𝕎 + q

q

2𝜎 2 trace 𝒢(𝜂)−2 𝒯∗ 𝒯. n

In view of the fact that trace 𝒢(𝜂)−2 𝒯∗ 𝒯 ≤ 𝜂 −1 trace 𝒢(𝜂)−1 𝒯∗ 𝒯 and (11.19), ( 1 ) 1 1 trace 𝒢(𝜂)−2 𝒯∗ 𝒯 = Op n 2q+𝛼+2 = Op (1). n n𝜂 Thus, (11.18) is established.



A case of some interest occurs when X is the Brownian motion process with covariance kernel K(s, t) = min(s, t) for s, t ∈ [0, 1]. In that instance, we know

314

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

from Example 4.6.3 that 𝜆j = 4∕((2j − 1)𝜋)2 . An integral estimate then shows that Vk = O(k−1 ). Thus, 𝛼 = 1 in the corollary and ( 2q+2 ) − ‖𝒯(𝛽𝜂 − 𝛽)‖2𝕐 = Op n 2q+3 . Now let us examine the prediction properties of our estimator. For that purpose, let X be a new, independent, observation and consider 𝔼𝜀,X (⟨X, 𝛽𝜂 ⟩ − ⟨X, 𝛽⟩)2 , where 𝔼𝜀,X indicates expectation with respect to both X and 𝜀. Note that for any g ∈ 𝕃2 ‖𝒯g‖2𝕐 = ⟨g, 𝒦n g⟩ and, from Theorem 7.2.5, 𝔼X ⟨X, g⟩2 = ⟨g, 𝒦g⟩. Thus, 𝔼𝜀 ‖𝒯(𝛽𝜂 − 𝛽)‖2𝕐 = 𝔼𝜀 ⟨𝛽𝜂 − 𝛽, 𝒦n (𝛽𝜂 − 𝛽)⟩ and 𝔼𝜀,X (⟨X, 𝛽𝜂 ⟩ − ⟨X, 𝛽⟩)2 = 𝔼𝜀 ⟨𝛽𝜂 − 𝛽, 𝒦(𝛽𝜂 − 𝛽)⟩ = 𝔼𝜀 ⟨𝛽𝜂 − 𝛽, (𝒦 − 𝒦n )(𝛽𝜂 − 𝛽)⟩ + 𝔼𝜀 ‖𝒯(𝛽𝜂 −

(11.20)

𝛽)‖2𝕐 .

The second term in the last expression was already treated in Theorem 11.2.5. It therefore suffices to consider the first term. Let 𝒦=

∞ ∑

𝜆j 𝜓j ⊗ 𝜓j ,

j=1

be the eigen-expansion for 𝒦 and define the scores Zij = ⟨Xi , 𝜓j ⟩ for i = 1, … , n. Their associated empirical covariance is n ∑ 1 𝜏jk = √ (Zij Zik − 𝜆j 𝛿jk ) n𝜆j 𝜆k i=1

REGRESSION

and 𝒦n =

∞ ∞ ∑ ∑

( n−1

n ∑

j=1 k=1

=n

−1∕2

315

) Zij Zik

𝜓j ⊗ 𝜓k

i=1

∞ ∞ √ ∑ ∑

𝜆j 𝜆k 𝜏jk 𝜓j ⊗ 𝜓k + 𝒦.

(11.21)

j=1 k=1

Lemma 11.2.6 Suppose that there exists a fixed C < ∞ such that Var(Zij Zik ) ≤ C𝜆j 𝜆k

(11.22)

for all j, k. Then, for any g ∈ 𝕃2 and any k ≥ 1, ⟨g, (𝒦 − 𝒦n )g⟩ ≤ Op (n )‖g‖ −1

2

k ∑

Vr + Op (n

r=1

(

+ Op (n−3∕4 )‖g‖2 Vk

k ∑

−1∕2

)‖g‖‖𝒯g‖𝕐

( k ∑

)1∕2 Vr

(11.23)

r=1

)1∕2

+ Op (n−1∕2 )‖g‖2 Vk ,

Vr

r=1

where the Op terms are functions of X1 , … , Xn only. Proof: From (11.21), 1 ∑∑ ⟨g, (𝒦n − 𝒦)g⟩ = √ gj gk n j=1 k=1 ∞



√ 𝜆j 𝜆k 𝜏jk ,

where gj = ⟨g, ej ⟩. Then, |⟨g, (𝒦n − 𝒦)g⟩| k ∞ ∞ ∞ √ √ 2 ∑ ∑ 2 ∑ ∑ ≤√ |gr gs 𝜆r 𝜆s 𝜏rs | + √ |gr gs 𝜆r 𝜆s 𝜏rs | n r=1 s=r+1 n r=k+1 s=r+1 ( k ∞ )1∕2 ( k ∞ )1∕2 ∑ ∑ 2 ∑ ∑ ≤√ 𝜆r g2r g2s 𝜆s 𝜏rs2 n r=1 s=r+1 r=1 s=r+1 ( ∞ ∞ )1∕2 ( ∞ ∞ )1∕2 ∑ ∑ ∑ ∑ 2 +√ g2r g2s 𝜆r 𝜆s 𝜏rs2 . n r=k+1 s=r+1 r=k+1 s=r+1

316

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

We will∑ proceed by obtaining∑bounds for the terms in this last expression. ∞ 2 2 2 Now, ∞ r=1 gr ≤ ‖g‖ and r=1 𝜆r gr = ⟨g, 𝒦g⟩. Thus, k ∞ ∑ ∑

𝜆r g2r g2s ≤ ⟨g, 𝒦g⟩‖g‖2

r=1 s=r+1

and

∞ ∞ ∑ ∑

g2r g2s ≤ ‖g‖4 .

r=k+1 s=r+1

Using assumption (11.22), 𝔼𝜏rs2 ≤ C for all r, s so that k ∞ ∑ ∑

𝜆s 𝜏rs2 = Op (1)

r=1 s=r+1

and

∞ ∞ ∑ ∑

k ∑

Vr

r=1

𝜆r 𝜆s 𝜏rs2 = Op (Vk2 ).

r=k+1 s=r+1

Upon combining all our bounds, we obtain ⟨g, 𝒦 − 𝒦n )g⟩ ≤ ⟨g, 𝒦n )g⟩ + Op (n

−1∕2

)‖g‖⟨g, 𝒦g⟩

1∕2

( k ∑

)1∕2 Vr

(11.24)

r=1

+ Op (n−1∕2 )‖g‖2 Vk . Now, if x02 ≤ bx0 + c and x0 ≥ 0 then it must be that x0 is less than or equal to the positive root of the quadratic equation x2 − bx − c = 0: namely, ( ) √ √ b + b2 + 4c 1 1 0 ≤ x0 ≤ ≤ +√ b + 2c. 2 2 2 Applying this to (11.24), we get ( ⟨g, 𝒦g⟩1∕2 ≤ Op (n−1∕2 )‖g‖

k ∑

)1∕2 Vr

( ) + Op ⟨g, 𝒦n g⟩1∕2

r=1 1∕2

+ Op (n−1∕4 )‖g‖Vk . Upon substituting this back in (11.24), we obtain (11.23). Our main result concerning prediction can now be stated as follows.



REGRESSION

317

Theorem 11.2.7 Assume that (11.22) holds and Vk = O(k−𝛼 ) for some 𝛼 > 0. 2q+𝛼+1 − Then, if 𝜂 = n 2q+𝛼+2 , ( 2q+𝛼+1 ) 𝛼+1 − 𝔼𝜀,X (⟨X, 𝛽𝜂 ⟩ − ⟨X, 𝛽⟩)2 = Op n 2q+𝛼+2 + n− 2 . (11.25) Proof: By (11.20) and (11.23), it is sufficient to consider ) ( k ∑ −1 A1 = Op (n ) Vr 𝔼𝜀 ‖𝛽𝜂 − 𝛽‖2 , r=1

A2 = Op (n−1∕2 )

( k ∑ (

)1∕2 {𝔼𝜀 (‖𝛽𝜂 − 𝛽‖2 )𝔼𝜀 (‖𝒯(𝛽𝜂 − 𝛽)‖2𝕐 )}1∕2 ,

Vr

r=1

A3 = Op (n−3∕4 ) Vk

k ∑

)1∕2 𝔼𝜀 ‖𝛽𝜂 − 𝛽‖2 ,

Vr

r=1

A4 = Op (n

−1∕2

)Vk 𝔼𝜀 ‖𝛽𝜂 − 𝛽‖2 .

Bounds for both 𝔼𝜀 ‖𝛽𝜂 − 𝛽‖2 and 𝔼𝜀 ‖𝒯(𝛽𝜂 − 𝛽)‖2𝕐 were obtained in Theorem 11.2.5 so that only routine calculations stated ∑ are required to establish the 1∕2 ], rates. For instance, if 𝛼 > 1 so that ∞ V < ∞, then, with k = [n r=1 r 2q+𝛼+1

− 2q+𝛼+2

A1 = Op (n−1 ) = o(n A2 = Op (n A3 = Op (n A4 = Op (n

− 12 − 12 2q+𝛼+1 2q+𝛼+2

), − 2q+𝛼+1 2q+𝛼+2

) = o(n

− 3+𝛼 4

) = o(n−1 ) = o(n

− 𝛼+1 2

).

),

− 2q+𝛼+1 2q+𝛼+2

),

Bounds for 𝛼 ≤ 1 can be similarly established also using the choice k = [n1∕2 ]. ◽ The value of 𝛼 determines the dominant term in (11.25). Provided 𝛼 ≥ 1, ( 2q+𝛼+1 ) − 𝔼𝜀,X (⟨X, 𝛽𝜂 ⟩ − ⟨X, 𝛽⟩)2 = Op n 2q+𝛼+2 . As mentioned earlier, this condition holds when X is a Brownian motion process. Theorem 11.2.7 provides information about the large sample prediction ability of 𝛽𝜂 . However, in order to fully appreciate its message, we need to understand what is optimal in terms of convergence rates for this problem. That is the subject of Section 11.3.

318

11.3

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

Minimax optimality

In this section, we obtain results that speak to the rate optimality of our penalized least-squares estimator of 𝛽. We do so in the context of prediction and, as before, assume that data (Xi , Yi ), i = 1, … , n, have been obtained from the regression model (11.1). Given some estimator 𝛽̃ of the coefficient function obtained from this data and a new independent observation X, we can prẽ We will now derive the minimax dict the value of ⟨X, 𝛽⟩ by that of ⟨X, 𝛽⟩. ̃ convergence rate of ⟨X, 𝛽⟩ to ⟨X, 𝛽⟩ over all choices for 𝛽̃ ∈ 𝕃2 . Each functional regression model of the form (11.1) can be specified by a triple 𝜃 = (𝛽, pX , p𝜀 ) with 𝛽 the coefficient function and pX , p𝜀 the probability distributions for X and 𝜀. We will consider how our estimator performs over a class of models as determined by Θ = {𝜃 = (𝛽, pX , p𝜀 ) ∶ ‖𝛽‖𝕎q ≤ 1, X ∈ 𝕃2 , Vk ≤ k−𝛼 }. Then, given any specific choice for an estimator 𝛽̃ of the coefficient function, the worst that can happen is if the real model is 𝜃0 = (𝛽0 , p0X , p0𝜀 ) for which ( ) ( ) ̃ − ⟨X, 𝛽0 ⟩ 2 = sup 𝔼𝜃 ⟨X, 𝛽⟩ ̃ − ⟨X, 𝛽⟩ 2 𝔼𝜃0 ⟨X, 𝛽⟩ (11.26) 𝜃∈Θ

with 𝔼𝜃 indicating expectation under the model corresponding to 𝜃. A minimax estimator is a choice for 𝛽̃ that makes (11.26) as small as possible. In lieu of determining an explicit form for such an estimator, we will consider the rate of convergence of the infimum of (11.26) to zero as the sample size grows large. Then, any estimator that attains this rate is minimax optimal in an asymptotic or rate of convergence sense. Theorem 11.3.1 Assume that 𝛼 ≥ 1. Then, for some constant C ∈ (0, ∞), 2q+𝛼+1 ( ) ̃ − ⟨X, 𝛽⟩ 2 ≥ Cn− 2q+𝛼+2 . lim inf inf sup 𝔼𝜃 ⟨X, 𝛽⟩

n→∞

̃ 2 𝜃∈Θ 𝛽∈𝕃

[ 1 ] Proof: Let kn = n 2q+𝛼+2 and consider a 𝜃 for which p𝜀 is a standard normal distribution, 2kn ∑ −1∕2 −1∕2 𝛽= kn 𝛿j 𝜈j ej j=kn +1

and X=

2kn ∑ j=kn +1

− 𝛼+1 4q

𝜉j 𝜈j

ej ,

REGRESSION

319

where 𝛿j = 0 or 1, the ej ’s are, again, as in Theorem 2.8.4 and the 𝜉j are inde√ √ −1∕2 pendent uniform random variables on [− 3, 3]. As {𝜈j ej }∞ is a CONS j=1 for 𝕎q , this specification ensures that ‖𝛽‖𝕎q ≤ 1. If we denote the collection of all such 𝜃 by Θ0 , it is clearly true that ( ) ( ) ̃ − ⟨X, 𝛽⟩ 2 ≥ sup 𝔼𝜃 ⟨X, 𝛽⟩ ̃ − ⟨X, 𝛽⟩ 2 . sup 𝔼𝜃 ⟨X, 𝛽⟩ (11.27) 𝜃∈Θ

𝜃∈Θ0

Now, let 𝛽̃ ∈ 𝕃2 be an arbitrary estimator of 𝛽 and write ∞ ∑

𝛽̃ =

−1∕2 −1∕2 ̃ 𝜈j 𝛿j ej

kn

j=1

for some coefficient sequence {𝛿̃j }. Then, for any model 𝜃 ∈ Θ0 , 2kn ∑

⟨X, 𝛽̃ − 𝛽⟩ =

2q+𝛼+1 −1∕2 − 4q 𝜈j (𝛿̃j

kn

− 𝛿j )𝜉j ,

j=kn +1

and it follows that 𝔼𝜃 ⟨X, 𝛽̃ − 𝛽⟩2 =

2kn ∑



kn−1 𝜈j

2q+𝛼+1 2q

𝔼𝜃 (𝛿̃j − 𝛿j )2 .

(11.28)

j=kn +1

Combining (11.27) and (11.28) gives 2kn 2q+𝛼+1 1 ∑ ∑ −1 − 2q 2 ̃ sup 𝔼𝜃 ⟨X, 𝛽 − 𝛽⟩ ≥ k kn 𝜈j 𝔼𝜃 (𝛿̃j − 𝛿j )2 2 n 𝜃∈Θ j=k +1 𝜃∈Θ n

0

=

2kn ∑

− 2q+𝛼+1 2q

kn−1 𝜈j

j=kn +1

1 ∑ 𝔼𝜃 (𝛿̃j − 𝜃j )2 . 2kn 𝜃∈Θ 0

For each j = kn + 1, … , 2kn and 𝜃 ∈ Θ0 , let 𝜃j0 = 𝜃 and take 𝜃j1 to be the same as 𝜃 except for flipping the value of 𝛿j : i.e., 𝛿j1 = 1 − 𝛿j0 . By symmetry, 2kn ∑

− 2q+𝛼+1 2q

kn−1 𝜈j

j=kn +1

0

∑ 2kn

=

1 ∑ 𝔼𝜃 (𝛿̃j − 𝛿j )2 2kn 𝜃∈Θ

− 2q+𝛼+1 2q

kn−1 𝜈j

j=kn +1 − 2q+𝛼+1 2q

≥ 𝜈2k

n

[ ] 1 ∑ 1 2 2 ̃ ̃ 𝔼 ( 𝛿 − 𝛿 ) + 𝔼 ( 𝛿 − 𝛿 ) 𝜃j0 j j 𝜃j1 j j 2kn 𝜃∈Θ 2 0

inf

kn 0, sup 𝔼𝜃 ⟨X, 𝛽̃ − 𝛽⟩2 ≥ Cn

2q+𝛼+1

− 2q+𝛼+2

inf

inf

max 𝔼𝜃 (𝛿̃j − 𝛿j )2 .

kn 0.

kn |𝛿̃j − 𝛿j1 |) and observe that

{

1 = |𝛿j0 − 𝛿j1 | ≤ |𝛿̃j − 𝛿j0 | + |𝛿̃j − 𝛿j1 | ≤

2|𝛿̃j − 𝛿j1 |, 2|𝛿̃j − 𝛿j0 |,

𝛿̂j = 0, 𝛿̂j = 1.

Thus, ℙ𝜃j0 (|𝛿̃j − 𝛿j | > 1∕2) ≥ ℙ𝜃j0 (𝛿̂j = 1) and ℙ𝜃j1 (|𝛿̃j − 𝛿j | > 1∕2) ≥ ℙ𝜃j1 (𝛿̂j = 0). So, max ℙ𝜃 (|𝜃̃j − 𝜃j | > 1∕2) ≥ max{ℙ𝜃j0 (𝛿̂j = 1), ℙ𝜃j1 (𝛿̂j = 0)}

𝜃∈{𝜃j0 ,𝜃j1 }

1 ≥ {ℙ𝜃j0 (𝛿̂j = 1) + ℙ𝜃j1 (𝛿̂j = 0)}. 2 Now, let L𝜃j0 ∕L𝜃j1 be the likelihood ratio for the models corresponding to 𝜃j0 and 𝜃j1 . By the Neyman–Pearson Lemma, ℙ𝜃j0 (𝛿̂j = 1) + ℙ𝜃j1 (𝛿̂j = 0) ≥ ℙ𝜃j0 (L𝜃j0 ∕L𝜃j1 ≤ 1) + ℙ𝜃j1 (L𝜃j0 ∕L𝜃j1 ≥ 1) and, hence, max ℙ𝜃 (|𝛿̃j − 𝛿j | > 1∕2)

𝜃∈{𝜃j0 ,𝜃j1 }

1 ≥ {ℙ𝜃j0 (L𝜃j0 ∕L𝜃j1 ≤ 1) + ℙ𝜃j1 (L𝜃j0 ∕L𝜃j1 ≥ 1)} 2 { }−1 ≥ 4𝔼𝜃j0 (L𝜃2 ∕L𝜃2 ) . j1

j0

REGRESSION

321

Thus, 1 max 𝔼𝜃 (𝛿̃j − 𝛿j )2 ≥ max ℙ (|𝛿̃ − 𝛿j (𝜃)| > 1∕2) 𝜃∈{𝜃j0 ,𝜃j1 } 4 𝜃∈{𝜃j0 ,𝜃j1 } 𝜃 j ≥ {16𝔼𝜃j0 (L2𝜃 ∕L2𝜃 )}−1 . j1

j0

As the 𝜀i are iid standard normals, it is straightforward to show that ( [ {( )2 }])n 𝔼𝜃j0 (L𝜃2 ∕L𝜃2 ) = j1

j0

𝔼𝜃0 exp

2q+𝛼+1 −1∕2 − 4q 𝜈j 𝜉j

Vn

= (1 + O(n−1 ))n = O(1). The result follows from this as the constants in the derivation do not depend on j, 𝛿̃j , or 𝜃. ◽ Let us conclude by relating Theorem 11.3.1 to the developments at the end of the previous section. Specifically, when using this result in combination with Theorem 11.2.7, we see that a rate optimal choice for 𝜂 ensures that ⟨X, 𝛽𝜂 ⟩ attains the minimax optimal rate of squared error convergence as a predictor of ⟨X, 𝛽⟩.

11.4

Discretely sampled data

Until now, the assumption has been that the Xi are observed in their entirety. To conclude this chapter, we briefly examine the case where the X data can only be realized at some discrete set of sampling points. Suppose that each Xi is observed at time ordinate ti1 , … , tiJi . Let Iij be an interval containing tij such that Ii1 , … , IiJi form a partition of [0, 1]. Then, we approximate Xi by Ji ∑ ̃Xi (t) = Xi (tij )I(t ∈ Iij ). j=1

This, in turn, produces approximations to the 𝒯i and 𝒯 that are given by ̃ i g = ⟨X̃ i , g⟩ 𝒯 and

( ) ̃ = 𝒯 ̃ 1 g, … , 𝒯 ̃ ng T 𝒯g

for g ∈ 𝕎q . The resulting estimator of 𝛽 is obtained by minimizing ̃ 2 + 𝜂‖g‖2 ‖Y − 𝒯g‖ 𝕐 𝕎

q

(11.29)

322

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

over g ∈ 𝕎q ; i.e., we estimate 𝛽 by ̃ ∗ Y, ̃ −1 𝒯 𝛽̃𝜂 = G(𝜂) ̃ ∗𝒯 ̃ + 𝜂I. ̃ where G(𝜂) =𝒯 Our goal is to obtain an analog of Theorem 11.2.6. To do so, we proceed along the same lines as the developments in Section 11.2. First, take 𝒦̃ n to be the 𝕃2 integral operator corresponding to the (discretized) sample covariance kernel n ∑ ̃ t) = n−1 K(s, X̃ i (s)X̃ i (t). i=1

Next, define

𝒦̃ = 𝔼𝒦̃ n

∑ with associated eigenvalues 𝜆̃1 ≥ 𝜆̃2 ≥ · · · and Ṽ k = j>k 𝜆̃j . Similarly, define scores by Z̃ ik = ⟨X̃ i , ẽ k ⟩, where ẽ s is the eigenfunction for 𝒦̃ that corresponds to the eigenvalue 𝜆̃k . Theorem 11.4.1 Assume that Ṽ k = O(k−𝛼 ) for some 𝛼 > 0 and that there exists a fixed C < ∞ such that Var(Z̃ ir Z̃ is ) ≤ C𝜆̃r 𝜆̃s for all r, s. Then, if 𝜂 = n

2q+𝛼+1 − 2q+𝛼+2

(11.30)

, (

𝔼𝜀,X (⟨X, 𝛽̃𝜂 ⟩ − ⟨X, 𝛽⟩)2 = Op

− 2q+𝛼+1 2q+𝛼+2

n

+n

− 𝛼+1 2

+

n 1∑

n

) 𝔼‖Xi − X̃ i ‖2

.

i=1

Proof: First, we show that ( ̃ 𝛽̃𝜂 − 𝔼𝜀 ‖𝒯(

𝛽)‖2𝕐

− 2q+𝛼+1 2q+𝛼+2

1∑ + 𝔼‖Xi − X̃ i ‖2 n i=1 n

) .

(11.31)

̃ 𝛽̃𝜂 − 𝔼𝜀 𝛽̃𝜂 )‖2 + 2‖𝒯𝔼 ̃ 𝜀 (𝛽̃𝜂 − 𝛽)‖2 . ≤ 2𝔼𝜀 ‖𝒯( 𝕐

(11.32)

= Op

n

Toward this goal, observe that ̃ 𝛽̃𝜂 − 𝛽)‖2 = 𝔼𝜀 ‖𝒯( ̃ 𝛽̃𝜂 − 𝔼𝜀 𝛽̃𝜂 + 𝔼𝜀 𝛽̃𝜂 − 𝛽)‖2 𝔼𝜀 ‖𝒯( 𝕐 𝕐

REGRESSION

323

The first term on the right of (11.32) is a variance expression and computations similar to what we used for the completely observed case can be used to show that 2 ̃ 𝛽̃𝜂 − 𝔼𝜀 𝛽̃𝜂 )‖2 = 𝜎 trace (𝒢(𝜂) ̃ −1 𝒯 ̃ ∗ 𝒯) ̃ 2 𝔼𝜀 ‖𝒯( n ( 2q+𝛼+1 ) − = Op n 2q+𝛼+2 .

The second term corresponds to squared bias and can be handled as follows. ̃ ∗ 𝒯𝛽 is the minimizer of ̃ −1 𝒯 First observe that 𝔼𝜀 𝛽̃𝜂 = G(𝜂) ̃ 2 + 𝜂‖g‖2 . ‖𝒯𝛽 − 𝒯g‖ 𝕐 𝕎 q

Thus, ̃ 𝜀 𝛽̃𝜂 ‖2 + 𝜂‖𝔼𝜀 𝛽̃𝜂 ‖2 ≤ ‖𝒯𝛽 − 𝒯𝛽‖ ̃ 2 + 𝜂‖𝛽‖2 . ‖𝒯𝛽 − 𝒯𝔼 𝕐 𝕎 𝕐 𝕎 q

q

(11.33)

As ̃ 𝜀 (𝛽̃𝜂 − 𝛽)‖2 ≤ 2‖𝒯𝛽 ̃ − 𝒯𝛽‖2 + 2‖𝒯𝛽 − 𝒯𝔼 ̃ 𝜀 𝛽̃𝜂 ‖2 , ‖𝒯𝔼 𝕐 𝕐 𝕐 using (11.33), we arrive at the inequality ̃ 𝜀 (𝛽̃𝜂 − 𝛽)‖2 ≤ 3‖𝒯𝛽 − 𝒯𝛽‖ ̃ 2 + 2𝜂‖𝛽‖2 . ‖𝒯𝔼 𝕐 𝕐 𝕎

(11.34)

q

By the Cauchy–Schwarz inequality, n ∑ 2 21 ̃ ‖𝒯𝛽 − 𝒯𝛽‖𝕐 ≤ ‖𝛽‖ ‖X − X̃ i ‖2 n i=1 i ( n ) 1∑ 2 = Op 𝔼‖Xi − X̃ i ‖ , n i=1

and, from (11.34), ̃ 𝜀 (𝛽̃𝜂 − ‖𝒯𝔼

( 𝛽)‖2𝕐

= Op

1∑ 𝜂+ 𝔼‖Xi − X̃ i ‖2 n i=1 n

) .

(11.35)

Then, (11.31) follows from (11.32) and (11.35). Next, let X1′ , … , Xn′ be iid and have the same distribution as X1 and let U be a discrete uniform random variable with possible values 1, … , n. Assume that X1 , … , Xn , X1′ , … , Xn′ , U are all independent and define X̃ i′ (t) =

Ji ∑ j=1

Xi′ (tij )I(t ∈ Iij ),

324

THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

and X̃ =

n ∑

I(U = i)X̃ i′ .

i=1

̃ Using (11.31) and following the Note that the covariance operator of X̃ is 𝒦. lines of the proofs for Lemma 11.2.6 and Theorem 11.2.7, we see that ( ) n ∑ 2q+𝛼+1 ( )2 𝛼+1 1 − ̃ 𝛽̃𝜂 ⟩ − ⟨X, ̃ 𝛽⟩ = Op n 2q+𝛼+2 + n− 2 + 𝔼𝜀,X̃ ⟨X, 𝔼‖Xi − X̃ i ‖2 . n i=1 ( )2 Of course, the object of interest is 𝔼𝜀,X ⟨X, 𝛽̃𝜂 ⟩ − ⟨X, 𝛽⟩ . However, we can equivalently consider X ′ defined by ′

X =

n ∑

I(U = i)Xi′ ,

i=1

which clearly has the same distribution as X1 . As ̃ 𝛽̃𝜂 − 𝛽⟩ + ⟨X ′ − X, ̃ 𝛽̃𝜂 − 𝛽⟩, ⟨X ′ , 𝛽̃𝜂 − 𝛽⟩ = ⟨X, we have ̃ 𝛽̃𝜂 − 𝛽⟩2 + 2⟨X ′ − X, ̃ 𝛽̃𝜂 − 𝛽⟩2 . (⟨X ′ , 𝛽̃𝜂 ⟩ − ⟨X ′ , 𝛽⟩)2 ≤ 2⟨X, Thus, 𝔼𝜀,X ′ ⟨X ′ , 𝛽̃𝜂 − 𝛽⟩2 = 𝔼𝜀,X1′ ,…,Xn′ ,U ⟨X ′ , 𝛽̃𝜂 − 𝛽⟩2 ̃ 𝛽̃𝜂 − 𝛽⟩2 + 2𝔼𝜀,X′ ,…,X′ ,U ⟨X ′ − X, ̃ 𝛽̃𝜂 − 𝛽⟩2 ≤ 2𝔼𝜀,X′ ,…,Xn′ ,U ⟨X, n 1

1

̃ 𝛽̃𝜂 − 𝛽⟩ + 2𝔼𝜀,X′ ,X̃ ⟨X − X, ̃ 𝛽̃𝜂 − 𝛽⟩2 . = 2𝔼𝜀,X̃ ⟨X, 2



We already have a bound for the first term on the right-hand side of this last expression. For the second term, apply the Cauchy–Schwarz inequality to obtain ̃ 𝛽̃𝜂 − 𝛽⟩2 ≤ 𝔼𝜀 ‖𝛽̃𝜂 − 𝛽‖2 𝔼‖X ′ − X‖ ̃ 2. 𝔼𝜀,X′ ,X̃ ⟨X ′ − X, As in the proof of Theorem 11.2.7, the∑first term on the right-hand side is Op (1). The second term is equal to n−1 ni=1 𝔼‖Xi − X̃ i ‖2 that completes the proof. ◽

REGRESSION

325

Suppose, for example, that the X process has the covariance properties of Brownian motion and that the sampling occurs at a common set of points tj = (j − 1)∕J, j = 1, … , J + 1. Then, one may check that n−1

n ∑

𝔼‖Xi − X̃ i ‖2 = 𝔼‖X1 − X̃ 1 ‖2 =

i=1

1

∫0

tdt − J −1

J ∑

tj ≤ J −1 .

j=1

Thus, as one might expect, for a fine sampling grid with, e.g., J ≫ n, the quadrature error entailed by approximating the Xi by the X̃ i has a negligible influence on the performance of the estimator in this particular case. Note that 𝛽𝜂 and 𝛽̃𝜂 are not natural splines. To compute these estimators, one can use the fact that 𝕎q is an RKHS. Let us focus on 𝛽̃𝜂 . Take 𝜉i be the representer of the functional g → ⟨X̃ i , g⟩, g ∈ 𝕎q : namely, ⟨𝜉i , g⟩𝕎q = ⟨X̃ i , g⟩. If R denotes the rk for 𝕎q , then by the reproducing property, 𝜉i (t) = ⟨𝜉i , R(⋅, t)⟩𝕎q = ⟨X̃ i , R(⋅, t)⟩. Now define the matrix { 1 1 } 𝒰= R(s, t)X̃ i (s)X̃ j (t)dsdt . ∫0 ∫0 i,j=1∶n An application of Theorem 6.4.1 then reveals that the estimator can be expressed as 𝛽̃𝜂 =

n ∑

bi 𝜉i

i=1

with b = (𝒰T 𝒰 + 𝜂𝒰)−1 𝒰T Y = (𝒰 + 𝜂I)−1 Y as 𝒰 is symmetric.

References Anderson T 2003 An Introduction to Multivariate Statistical Analysis. Wiley, New York, NY. Aronszajn N 1950 Theory of reproducing kernels. Trans. Am. Math. Soc. 68, 337–404. Baker C 1970 Mutual information for Gaussian processes. SIAM J. Appl. Math. 19, 451–458. Baker C 1973 Joint measures and cross-covariance operators. Trans. Am. Math. Soc. 186, 273–289. Basilevsky A 1994 Statistical Factor Analysis and Related Methods: Theory and Applications. Wiley, New York, NY. Bass R 2011 Stochastic Processes. Cambridge University Press, Cambridge. Berlinet A and Thomas-Agnan C 2004 Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, Boston, MA. Bickel P and Levina E 2004 Some theory for Fisher’s linear discriminant function, ‘naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10, 989–1010. Billingsley P 1995 Probability and Measure, Third Edition. Wiley, New York, NY. Billingsley P 1999 Convergence of Probability Measures. Wiley, New York, NY. Birkhoff G 1908 Boundary value and expansion problems of ordinary linear differential equations. Trans. Am. Math. Soc. 9, 373–395. Bochner S 1933 Integration von funktionen, deren werte die elemente eines vektorraumes sind. Fundam. Math. 20, 262–276. Cai T and Yuan M 2010 Nonparametric covariance function estimation for functional and longitudinal data. manuscript. Cai T and Yuan M 2011 Optimal estimation of the mean function based on discretely sampled functional data: phase transition. Ann. Stat. 39, 2330–2355. Crambes C, Kneip A and Sarda P 2009 Smoothing spline estimators for functional linear regression. Ann. Stat. 37, 35–72. Dauxious J, Pousse A and Romain Y 1982 Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference. J. Multivariate Anal. 12, 136–154. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

328

REFERENCES

de Acosta A 1970 Existence and convergence of probability measures in Banach spaces. Trans. Am. Math. Soc. 132, 273–298. de Boor C 1978 A Practical Guide to Splines. Springer, New York. Diestel J and Uhl J 1977 Vector Measures. American Mathematical Society, Providence, RI. Driscoll M 1973 The reproducing kernel Hilbert space structure of sample paths of Gaussian processes. Z. Wahrscheinlichkeitstheorie verw. Geb. 26, 309–316. Dunford N and Schwarz J 1988 Linear Operators Part I: General Theory. WileyInterscience, New York, NY. Durrett R 1996 Probability: Theory and Examples, The Wadsworth & Brooks/Cole Statistics/Probability Series. Duxbury Press, ISBN: 9780534243180, LCCN: 95022544, http://books.google.com/books?id=kkc\_AQAAIAAJ. Engl H, Hanke M and Neubauer A 2000 Regularization of Inverse Problems. Kluwer Academic Publishers, Norwell, MA. Etemadi N 1983 On the laws of large numbers for nonnegative random variables. J. Multivariate Anal. 13, 187–193. Eubank R and Hsing T 2007 Canonical correlation for stochastic processes. Stochastic Processes Appl. 118, 1634–1661. Fan J and Gijbels I 1996 Local Polynomial Modeling and Its Applications. Chapman and Hall, New York, NY. Gittins R 1985 Canonical Analysis: A Review with Applications in Ecology. Springer, New York, NY. Grigorieff R D 1991 A note on von Neumann’s trace inequality. Math. Nachr. 151, 327–328. Hall P and Hosseini-Nasab M 2005 On properties of functional principal components analysis. J. R. Stat. Soc. Ser. B 68, 109–126. Hall P and Hosseini-Nasab M 2009 Theory for high-order bounds in functional principal components analysis. Math. Proc. Cambridge Philos. Soc. 146, 225–256. Hansen P 1988 Computation of the singular value expansion. Computing 40, 185–199. He G, Muller H and Wang J 2003 Functional canonical correlation analysis for square integrable stochastic processes. J. Multivariate Anal. 85, 54–77. Hotelling H 1936 Relations between two sets of variates. Biometrika 28, 321–377. Izenman A 2008 Modern Multivariate Statistical Techniques: Regression, Classification and Manifold Learning. Springer, New York, NY. Johnson R and Wichern D 2007 Applied Multivariate Statistical Analysis, Sixth Edition. Prentice Hall, Upper Saddle River, NJ. Jolliffe L 2004 Principal Components Analysis, Second Edition. Springer, New York, NY. Kato T 1995 Perturbation Theory for Linear Operators. Springer, New York, NY. Kshirsagar A 1972 Multivariate Analysis. Marcel-Dekker, New York, NY. Landau H and Shepp L 1970 On the supremum of a Gaussian process. Sankhy¯a Ser. A 32, 369–378. Ledoux M and Talagrand M 2013 Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin. Lin Y 2000 Tensor product space ANOVA models. Ann. Stat. 28, 734–755. Li Y and Hsing T 2010 Uniform convergence rates for nonparametric regression and principal component analysis of functional/longitudinal data. Ann. Stat. 38, 3321–3351.

REFERENCES

329

Luenberger D 1969 Optimization by Vector Space Methods. Wiley, New York, NY. Luki´c M and Beder J 2001 Stochastic processes with sample paths in reproducing kernel Hilbert spaces. Trans. Am. Math. Soc. 353, 3945–3969. Mas A 2006 A sufficient condition for the CLT in the space of nuclear operators - application to covariance of random functions. Stat. Probab. Lett. 76, 1503–1509. Moore H 1916 On properly positive Hermitian matrices. Bull. Am. Math. Soc. 23, 66–67. Muirhead R and Waternaux C 1980 Asymptotic distributions in canonical correlation analysis and other multivariate procedures for nonnormal populations. Biometrika 67, 31–43. Nychka D and Cox D 1989 Convergence rates for regularized solutions of integral equalities from discrete noisy data. Ann. Stat. 17, 556–572. Ortega J and Rheinboldt W 1970 Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York, NY. Parzen E 1961 An approach to time series analysis. Ann. Math. Stat. 32, 951–989. Parzen E 1970 Statistical inference on time series by RKHS method. In 12th Annual Biennial Seminar Canadian Mathematical Congress Proceedings (ed. Pyke R), pp. 1–37, Canadian Mathematical Congress, Montreal. Ramsay J and Silverman B 2005 Functional Data Analysis, Second Edition. Springer, New York. Rao C 1955 Estimation and tests of significance in factor analysis. Psychometrika 20, 93–111. Reed M and Simon D 1980 Functional Analysis. Academic Press, Salt Lake City, UT. Resnick S 1999 A Probability Path. Birkhäuser, Boston, MA. Rice J and Silverman B 1991 Estimating the mean and covariance structure nonparametrically when the data are curves. J. R. Stat. Soc. Ser. B 53, 233–243. Riesz F and Sz.-Nagy B 1990 Functional Analysis. Dover Publications, New York, NY. Roy S 1958 Some Aspects of Multivariate Analysis. Wiley, New York, NY. Royden H and Fitzpatrick P 2010 Real Analysis, Fourth Edition. Pearson, Upper Saddle River, NJ. Rudin W 1991 Functional Analysis. McGraw-Hill, Boston, MA. Rynne B and Youngson M 2001 Linear Functional Analysis. Springer, New York, NY. Salaff S 1968 Regular boundary conditions for ordinary differential operators. Trans. Am. Math. Soc. 134, 355–373. Schumaker L 1981 Spline Functions: Basic Theory. Wiley, New York, NY. Shin H 2008 An extension of Fisher’s discriminant analysis for stochastic processes. J. Multivariate Anal. 99, 1191–1216. Stewart G 1993 On the early history of the singular value decomposition. SIAM Rev. 35, 551–566. Stone C 1982 Optimal global rates of convergence for nonparametric regression. Ann. Stat. 10, 1040–1053. Stone M 1926 A comparison of the series of Fourier and Birkhoff. Trans. Am. Math. Soc. 28, 695–761.

330

REFERENCES

Sunder V 1988 N subspaces. Can. J. Math. XL, 38–54. Thompson R and Freede L 1971 On the eigenvalues of sums of Hermitian matrices. Linear Algebra Appl. 4, 369–376. Utreras FI 1983 Natural splines functions: their associated eigenvalue problem. Numer. Math. 63, 107–117. Utreras F 1988 Boundary effects on convergence rates for Tikhonov regularization. J. Approx. Theory 54, 235–249. Yosida K 1971 Functional Analysis. Springer-Verlag, New York, NY.

Index auto-covariance function, 189 operator, 189 Banach Inverse Theorem, 77 Banach-Alaoglu Theorem, 71 basis, 21 best linear predictor, 6 bijective mapping, 77 BLUP, 6, see best linear predictor 6 Bochner integral, 41 Borel 𝜎-field, 29 bounded linear transformation, 62 bounded variation, 30 Brownian motion, 118

Courant-Fischer minimax principle, 100 covariance function, 185 kernel, 185 matrix, 2 operator, 179, 186 cross validation generalized, 163 ordinary, 161 cross-covariance function, 189 matrix, 5 operator, 182, 189

canonical correlations, 6, 267 canonical variables, 5, 269 canonical weight vectors, 6 Cauchy sequence, 18 Cauchy’s integral formula, 133 closed linear span, 23, 198, 202 closed set, 17 closure, 17 coefficient function, 306 compact, 19 complete orthonormal system, 34 congruence relationship, 35 CONS, see complete orthonormal system 34 continuous function, 17

dense set, 17 densely defined operator, 65 dimension, 21 direct product, 53 direct sum algebraic, 40 orthogonal, 39 discriminant analysis, 11 domain, 62 dual space, 66 Eckart-Young Theorem, 111 eigenfunction, 96 eigenspace, 96

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

332

INDEX

eigenvalues non-repeated, 130 of a linear operator, 96 of a matrix, 3 repeated, 130 evaluation functional, 67 extension principle, 65 factor loadings, 9, 285 factors, 9, 285 flat concentration, 209 Fourier coefficients, 33 Fréchet derivative, 84 Gâteaux derivative, 83 Gram-Schmidt algorithm, 33 Hahn-Banach Extension Theorem, 67 Heine-Borel Theorem, 19 Hilbert-Schmidt inner product, 109 HS operator, see Hilbert Schmidt operator 109 identity mapping, 22 image, 62 inequality Besssel, 33 Cauchy-Schwarz, 31 Hölder, 27 Minkowski, 27 triangle, 18, 22 Young’s, 27 inner product, 31 isometrically isomorphic, 66 Karhunen-Lòeve Theorem, 188 kernel of an operator, 62 linear functional, 66 linear subspace, 20 linearly independence, 21 local linear smoother

bandwidth, 214 kernel function, 214 MANOVA, 10 mean element, 178 mean function, 185 mean-square continuity, 185 Mercer’s Theorem, 120 metric, 16 Moore-Penrose inverse, 80 mva, 2 nonnegative definite function, 46 nonnegative definite kernel, 119 norm, 22 Frobenius, 110 Hilbert-Schmidt, 110 operator, 63 trace, 113 normal equations, 83 one-to-one mapping, 77 onto mapping, 77 open mapping theorem, 77 open set, 17 operator adjoint, 72 compact, 92 Hilbert-Schmidt, 109 integral, 116 inverse, 77 linear, 63 nonnegative definite, 74 nuclear, 113 projection, 75 rank of, 62 self-adjoint, 72 square root, 75 trace-class, 113 orthogonal complement, 39 outer product, see tensor product 76

INDEX

Parseval’s relation, 34 pca, see principal components analysis 3 Picard condition, 107 positive definite, 74 principal components analysis, 3, 252 principle component scores, 189 projection theorem, 39 random element of a Hilbert space, 178 range, 62 Rayleigh quotient, 101 regression kernel, 283 regularization parameter, 152 relatively compact, 19 representer, 66 reproducing property, 46 resolvent, 131 Riesz Representation Theorem, 66 Riesz-Fischer Theorem, 28 risk, 163 RKHS, see reproducing kernel Hilbert space 46 Schmidt-Mirsky Theorem, 111 second-order process, 185 self-dual, 66 Sherman-Morrison-Woodbury, 79 simple function, 41 simultaneous diagonalization, 126 singular functions, 104 system, 104

333

values, 104 vectors, 104 singular value decomposition, 103 singular value expansion, 103 smoothing, 147 smoothing spline, 172 space Banach, 26 Hilbert, 33 inner-product, 31 metric, 16 normed, 22 null, 62 reproducing kernel Hilbert, 46 separable metric, 17 Sobolev, 55 vector, 20 span, 20 span(A), 20 spline interpolation, 169 subspace, see linear subspace 20 svd, see singular value decomposition 103 sve, see singular value expansion 103 tensor product, 76 tightness, 207 total variation, 30 totally bounded, 19 trace, 110, 114 vector integral, see Bochner integral 41 weak convergence, 71

Notation Index (Ω, ℱ, ℙ), 29 C[0, 1], 16 IX (f ), 187 K1 ≪ K2 , 52 Z(f ), 201, 203 𝓁 2 , 23 ⟨⋅, ⋅⟩HS , 109 𝔾(𝒦), 200 ℍ(K), 49 𝕃2 (X), 198 𝕃2 (𝜒), 202 𝒯∗ , 72

𝒯† , 80 𝔅(𝕏1 , 𝕏2 ), 63 𝔅HS (ℍ1 , ℍ2 ), 109 Dom(𝒯), 62 Im(𝒯), 62 Ker(𝒯), 62 ⊕, 39 ⊗, 76 𝕃p (E, ℬ, 𝜇), 26 𝕃p [0, 1], 29 𝕎q [0, 1], 55

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators, First Edition. Tailen Hsing and Randall Eubank. © 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

WILEY SERIES IN PROBABILITY AND STATISTICS established by Walter A. Shewhart and Samuel S. Wilks Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods. Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research. ABRAHAM and LEDOLTER ⋅ Statistical Methods for Forecasting AGRESTI ⋅ Analysis of Ordinal Categorical Data, Second Edition AGRESTI ⋅ An Introduction to Categorical Data Analysis, Second Edition AGRESTI ⋅ Categorical Data Analysis, Third Edition ALSTON, MENGERSEN and PETTITT (editors) ⋅ Case Studies in Bayesian Statistical Modelling and Analysis ALTMAN, GILL, and McDONALD ⋅ Numerical Issues in Statistical Computing for the Social Scientist AMARATUNGA and CABRERA ⋅ Exploration and Analysis of DNA Microarray and Protein Array Data AMARATUNGA, CABRERA, and SHKEDY ⋅ Exploration and Analysis of DNA Microarray and Other High-Dimensional Data, Second Edition ˇ ⋅ Mathematics of Chance ANDEL ANDERSON ⋅ An Introduction to Multivariate Statistical Analysis, Third Edition ∗ ANDERSON ⋅ The Statistical Analysis of Time Series ANDERSON, AUQUIER, HAUCK, OAKES, VANDAELE, and WEISBERG ⋅ Statistical Methods for Comparative Studies ANDERSON and LOYNES ⋅ The Teaching of Practical Statistics ARMITAGE and DAVID (editors) ⋅ Advances in Biometry ARNOLD, BALAKRISHNAN, and NAGARAJA ⋅ Records * ARTHANARI and DODGE ⋅ Mathematical Programming in Statistics AUGUSTIN, COOLEN, DE COOMAN and TROFFAES (editors) ⋅ Introduction to Imprecise Probabilities * BAILEY ⋅ The Elements of Stochastic Processes with Applications to the Natural Sciences †

† ∗

Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series. Now available in a lower priced paperback edition in the Wiley Classics Library.

BAJORSKI ⋅ Statistics for Imaging, Optics, and Photonics BALAKRISHNAN and KOUTRAS ⋅ Runs and Scans with Applications BALAKRISHNAN and NG ⋅ Precedence-Type Tests and Applications BARNETT ⋅ Comparative Statistical Inference, Third Edition BARNETT ⋅ Environmental Statistics BARNETT and LEWIS ⋅ Outliers in Statistical Data, Third Edition BARTHOLOMEW, KNOTT, and MOUSTAKI ⋅ Latent Variable Models and Factor Analysis: A Unified Approach, Third Edition BARTOSZYNSKI and NIEWIADOMSKA-BUGAJ ⋅ Probability and Statistical Inference, Second Edition BASILEVSKY ⋅ Statistical Factor Analysis and Related Methods: Theory and Applications BATES and WATTS ⋅ Nonlinear Regression Analysis and Its Applications BECHHOFER, SANTNER, and GOLDSMAN ⋅ Design and Analysis of Experiments for Statistical Selection, Screening, and Multiple Comparisons BEH and LOMBARDO ⋅ Correspondence Analysis: Theory, Practice and New Strategies BEIRLANT, GOEGEBEUR, SEGERS, TEUGELS, and DE WAAL ⋅ Statistics of Extremes: Theory and Applications BELSLEY Conditioning Diagnostics: Collinearity and Weak Data in Regression † BELSLEY, KUH, and WELSCH ⋅ Regression Diagnostics: Identifying Influential Data and Sources of Collinearity BENDAT and PIERSOL ⋅ Random Data: Analysis and Measurement Procedures, Fourth Edition BERNARDO and SMITH ⋅ Bayesian Theory BHAT and MILLER ⋅ Elements of Applied Stochastic Processes, Third Edition BHATTACHARYA and WAYMIRE ⋅ Stochastic Processes with Applications BIEMER, GROVES, LYBERG, MATHIOWETZ, and SUDMAN ⋅ Measurement Errors in Surveys BILLINGSLEY ⋅ Convergence of Probability Measures, Second Edition BILLINGSLEY ⋅ Probability and Measure, Anniversary Edition BIRKES and DODGE ⋅ Alternative Methods of Regression BISGAARD and KULAHCI ⋅ Time Series Analysis and Forecasting by Example BISWAS, DATTA, FINE, and SEGAL ⋅ Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics BLISCHKE and MURTHY (editors) ⋅ Case Studies in Reliability and Maintenance BLISCHKE and MURTHY ⋅ Reliability: Modeling, Prediction, and Optimization BLOOMFIELD ⋅ Fourier Analysis of Time Series: An Introduction, Second Edition BOLLEN ⋅ Structural Equations with Latent Variables BOLLEN and CURRAN ⋅ Latent Curve Models: A Structural Equation Perspective BONNINI, CORAIN, MAROZZI and SALMASO ⋅ Nonparametric Hypothesis Testing: Rank and Permutation Methods with Applications in R BOROVKOV ⋅ Ergodicity and Stability of Stochastic Processes †

Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

BOSQ and BLANKE ⋅ Inference and Prediction in Large Dimensions BOULEAU ⋅ Numerical Methods for Stochastic Processes ∗ BOX and TIAO ⋅ Bayesian Inference in Statistical Analysis BOX ⋅ Improving Almost Anything, Revised Edition * BOX and DRAPER ⋅ Evolutionary Operation: A Statistical Method for Process Improvement BOX and DRAPER ⋅ Response Surfaces, Mixtures, and Ridge Analyses, Second Edition BOX, HUNTER, and HUNTER ⋅ Statistics for Experimenters: Design, Innovation, and Discovery, Second Editon BOX, JENKINS, and REINSEL ⋅ Time Series Analysis: Forcasting and Control, Fourth Edition BOX, LUCEÑO, and PANIAGUA-QUIÑONES ⋅ Statistical Control by Monitoring and Adjustment, Second Edition * BROWN and HOLLANDER ⋅ Statistics: A Biomedical Introduction CAIROLI and DALANG ⋅ Sequential Stochastic Optimization CASTILLO, HADI, BALAKRISHNAN, and SARABIA ⋅ Extreme Value and Related Models with Applications in Engineering and Science CHAN ⋅ Time Series: Applications to Finance with R and S-Plus‸ , Second Edition CHARALAMBIDES ⋅ Combinatorial Methods in Discrete Distributions CHATTERJEE and HADI ⋅ Regression Analysis by Example, Fourth Edition CHATTERJEE and HADI ⋅ Sensitivity Analysis in Linear Regression CHEN ⋅ The Fitness of Information: Quantitative Assessments of Critical Evidence CHERNICK ⋅ Bootstrap Methods: A Guide for Practitioners and Researchers, Second Edition CHERNICK and FRIIS ⋅ Introductory Biostatistics for the Health Sciences CHILÈS and DELFINER ⋅ Geostatistics: Modeling Spatial Uncertainty, Second Edition CHIU, STOYAN, KENDALL and MECKE ⋅ Stochastic Geometry and Its Applications, Third Edition CHOW and LIU ⋅ Design and Analysis of Clinical Trials: Concepts and Methodologies, Third Edition CLARKE ⋅ Linear Models: The Theory and Application of Analysis of Variance CLARKE and DISNEY ⋅ Probability and Random Processes: A First Course with Applications, Second Edition * COCHRAN and COX ⋅ Experimental Designs, Second Edition COLLINS and LANZA ⋅ Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health Sciences CONGDON ⋅ Applied Bayesian Modelling, Second Edition CONGDON ⋅ Bayesian Models for Categorical Data CONGDON ⋅ Bayesian Statistical Modelling, Second Edition CONOVER ⋅ Practical Nonparametric Statistics, Third Edition COOK ⋅ Regression Graphics COOK and WEISBERG ⋅ An Introduction to Regression Graphics ∗

Now available in a lower priced paperback edition in the Wiley Classics Library.

COOK and WEISBERG ⋅ Applied Regression Including Computing and Graphics CORNELL ⋅ A Primer on Experiments with Mixtures CORNELL ⋅ Experiments with Mixtures, Designs, Models, and the Analysis of Mixture Data, Third Edition COX ⋅ A Handbook of Introductory Statistical Methods CRESSIE ⋅ Statistics for Spatial Data, Revised Edition CRESSIE and WIKLE ⋅ Statistics for Spatio-Temporal Data CSÖRGÖ and HORVÁTH ⋅ Limit Theorems in Change Point Analysis DAGPUNAR ⋅ Simulation and Monte Carlo: With Applications in Finance and MCMC DANIEL ⋅ Applications of Statistics to Industrial Experimentation DANIEL ⋅ Biostatistics: A Foundation for Analysis in the Health Sciences, Eighth Edition * DANIEL ⋅ Fitting Equations to Data: Computer Analysis of Multifactor Data, Second Edition DASU and JOHNSON ⋅ Exploratory Data Mining and Data Cleaning DAVID and NAGARAJA ⋅ Order Statistics, Third Edition DAVINO, FURNO and VISTOCCO ⋅ Quantile Regression: Theory and Applications * DEGROOT, FIENBERG, and KADANE ⋅ Statistics and the Law DEL CASTILLO ⋅ Statistical Process Adjustment for Quality Control DEMARIS ⋅ Regression with Social Data: Modeling Continuous and Limited Response Variables DEMIDENKO ⋅ Mixed Models: Theory and Applications with R, Second Edition DENISON, HOLMES, MALLICK and SMITH ⋅ Bayesian Methods for Nonlinear Classification and Regression DETTE and STUDDEN ⋅ The Theory of Canonical Moments with Applications in Statistics, Probability, and Analysis DEY and MUKERJEE ⋅ Fractional Factorial Plans DILLON and GOLDSTEIN ⋅ Multivariate Analysis: Methods and Applications * DODGE and ROMIG ⋅ Sampling Inspection Tables, Second Edition * DOOB ⋅ Stochastic Processes DOWDY, WEARDEN, and CHILKO ⋅ Statistics for Research, Third Edition DRAPER and SMITH ⋅ Applied Regression Analysis, Third Edition DRYDEN and MARDIA ⋅ Statistical Shape Analysis DUDEWICZ and MISHRA ⋅ Modern Mathematical Statistics DUNN and CLARK ⋅ Basic Statistics: A Primer for the Biomedical Sciences, Fourth Edition DUPUIS and ELLIS ⋅ A Weak Convergence Approach to the Theory of Large Deviations EDLER and KITSOS ⋅ Recent Advances in Quantitative Methods in Cancer and Human Health Risk Assessment ∗ ELANDT-JOHNSON and JOHNSON ⋅ Survival Models and Data Analysis ENDERS ⋅ Applied Econometric Time Series, Third Edition † ETHIER and KURTZ ⋅ Markov Processes: Characterization and Convergence ∗ †

Now available in a lower priced paperback edition in the Wiley Classics Library. Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

EVANS, HASTINGS, and PEACOCK ⋅ Statistical Distributions, Third Edition EVERITT, LANDAU, LEESE, and STAHL ⋅ Cluster Analysis, Fifth Edition FEDERER and KING ⋅ Variations on Split Plot and Split Block Experiment Designs FELLER ⋅ An Introduction to Probability Theory and Its Applications, Volume I, Third Edition, Revised; Volume II, Second Edition FITZMAURICE, LAIRD, and WARE ⋅ Applied Longitudinal Analysis, Second Edition ∗ FLEISS ⋅ The Design and Analysis of Clinical Experiments FLEISS ⋅ Statistical Methods for Rates and Proportions, Third Edition † FLEMING and HARRINGTON ⋅ Counting Processes and Survival Analysis FUJIKOSHI, ULYANOV, and SHIMIZU ⋅ Multivariate Statistics: High-Dimensional and Large-Sample Approximations FULLER ⋅ Introduction to Statistical Time Series, Second Edition † FULLER ⋅ Measurement Error Models GALLANT ⋅ Nonlinear Statistical Models GEISSER ⋅ Modes of Parametric Statistical Inference GELMAN and MENG ⋅ Applied Bayesian Modeling and Causal Inference from ncomplete-Data Perspectives GEWEKE ⋅ Contemporary Bayesian Econometrics and Statistics GHOSH, MUKHOPADHYAY, and SEN ⋅ Sequential Estimation GIESBRECHT and GUMPERTZ ⋅ Planning, Construction, and Statistical Analysis of Comparative Experiments GIFI ⋅ Nonlinear Multivariate Analysis GIVENS and HOETING ⋅ Computational Statistics GLASSERMAN and YAO ⋅ Monotone Structure in Discrete-Event Systems GNANADESIKAN ⋅ Methods for Statistical Data Analysis of Multivariate Observations, Second Edition GOLDSTEIN ⋅ Multilevel Statistical Models, Fourth Edition GOLDSTEIN and LEWIS ⋅ Assessment: Problems, Development, and Statistical Issues GOLDSTEIN and WOOFF ⋅ Bayes Linear Statistics GRAHAM ⋅ Markov Chains: Analytic and Monte Carlo Computations GREENWOOD and NIKULIN ⋅ A Guide to Chi-Squared Testing GROSS, SHORTLE, THOMPSON, and HARRIS ⋅ Fundamentals of Queueing Theory, Fourth Edition GROSS, SHORTLE, THOMPSON, and HARRIS ⋅ Solutions Manual to Accompany Fundamentals of Queueing Theory, Fourth Edition * HAHN and SHAPIRO ⋅ Statistical Models in Engineering HAHN and MEEKER ⋅ Statistical Intervals: A Guide for Practitioners HALD ⋅ A History of Probability and Statistics and their Applications Before 1750 † HAMPEL ⋅ Robust Statistics: The Approach Based on Influence Functions HARTUNG, KNAPP, and SINHA ⋅ Statistical Meta-Analysis with Applications HEIBERGER ⋅ Computation for the Analysis of Designed Experiments ∗ †

Now available in a lower priced paperback edition in the Wiley Classics Library. Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

HEDAYAT and SINHA ⋅ Design and Inference in Finite Population Sampling HEDEKER and GIBBONS ⋅ Longitudinal Data Analysis HELLER ⋅ MACSYMA for Statisticians HERITIER, CANTONI, COPT, and VICTORIA-FESER ⋅ Robust Methods in Biostatistics HINKELMANN and KEMPTHORNE ⋅ Design and Analysis of Experiments, Volume 1: Introduction to Experimental Design, Second Edition HINKELMANN and KEMPTHORNE ⋅ Design and Analysis of Experiments, Volume 2: Advanced Experimental Design HINKELMANN (editor) ⋅ Design and Analysis of Experiments, Volume 3: Special Designs and Applications HOAGLIN, MOSTELLER, and TUKEY ⋅ Fundamentals of Exploratory Analysis of Variance ∗ HOAGLIN, MOSTELLER, and TUKEY ⋅ Exploring Data Tables, Trends and Shapes * HOAGLIN, MOSTELLER, and TUKEY ⋅ Understanding Robust and Exploratory Data Analysis HOCHBERG and TAMHANE ⋅ Multiple Comparison Procedures HOCKING ⋅ Methods and Applications of Linear Models: Regression and the Analysis of Variance, Third Edition HOEL ⋅ Introduction to Mathematical Statistics, Fifth Edition HOGG and KLUGMAN ⋅ Loss Distributions HOLLANDER, WOLFE, and CHICKEN ⋅ Nonparametric Statistical Methods, Third Edition HOSMER and LEMESHOW ⋅ Applied Logistic Regression, Second Edition HOSMER, LEMESHOW, and MAY ⋅ Applied Survival Analysis: Regression Modeling of Time-to-Event Data, Second Edition HUBER ⋅ Data Analysis: What Can Be Learned From the Past 50 Years HUBER ⋅ Robust Statistics † HUBER and RONCHETTI ⋅ Robust Statistics, Second Edition HUBERTY ⋅ Applied Discriminant Analysis, Second Edition HUBERTY and OLEJNIK ⋅ Applied MANOVA and Discriminant Analysis, Second Edition HUITEMA ⋅ The Analysis of Covariance and Alternatives: Statistical Methods for Experiments, Quasi-Experiments, and Single-Case Studies, Second Edition HUNT and KENNEDY ⋅ Financial Derivatives in Theory and Practice, Revised Edition HURD and MIAMEE ⋅ Periodically Correlated Random Sequences: Spectral Theory and Practice HUSKOVA, BERAN, and DUPAC ⋅ Collected Works of Jaroslav Hajek – with Commentary HUZURBAZAR ⋅ Flowgraph Models for Multistate Time-to-Event Data JACKMAN ⋅ Bayesian Analysis for the Social Sciences † JACKSON ⋅ A User’s Guide to Principle Components ∗ †

Now available in a lower priced paperback edition in the Wiley Classics Library. Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

JOHN ⋅ Statistical Methods in Engineering and Quality Assurance JOHNSON ⋅ Multivariate Statistical Simulation JOHNSON and BALAKRISHNAN ⋅ Advances in the Theory and Practice of Statistics: A Volume in Honor of Samuel Kotz JOHNSON, KEMP, and KOTZ ⋅ Univariate Discrete Distributions, Third Edition JOHNSON and KOTZ (editors) ⋅ Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present JOHNSON, KOTZ, and BALAKRISHNAN ⋅ Continuous Univariate Distributions, Volume 1, Second Edition JOHNSON, KOTZ, and BALAKRISHNAN ⋅ Continuous Univariate Distributions, Volume 2, Second Edition JOHNSON, KOTZ, and BALAKRISHNAN ⋅ Discrete Multivariate Distributions JUDGE, GRIFFITHS, HILL, LÜTKEPOHL, and LEE ⋅ The Theory and Practice of Econometrics, Second Edition JUREK and MASON ⋅ Operator-Limit Distributions in Probability Theory KADANE ⋅ Bayesian Methods and Ethics in a Clinical Trial Design KADANE AND SCHUM ⋅ A Probabilistic Analysis of the Sacco and Vanzetti Evidence KALBFLEISCH and PRENTICE ⋅ The Statistical Analysis of Failure Time Data, Second Edition KARIYA and KURATA ⋅ Generalized Least Squares KASS and VOS ⋅ Geometrical Foundations of Asymptotic Inference † KAUFMAN and ROUSSEEUW ⋅ Finding Groups in Data: An Introduction to Cluster Analysis KEDEM and FOKIANOS ⋅ Regression Models for Time Series Analysis KENDALL, BARDEN, CARNE, and LE ⋅ Shape and Shape Theory KHURI ⋅ Advanced Calculus with Applications in Statistics, Second Edition KHURI, MATHEW, and SINHA ⋅ Statistical Tests for Mixed Linear Models ∗ KISH ⋅ Statistical Design for Research KLEIBER and KOTZ ⋅ Statistical Size Distributions in Economics and Actuarial Sciences KLEMELÄ ⋅ Smoothing of Multivariate Data: Density Estimation and Visualization KLUGMAN, PANJER, and WILLMOT ⋅ Loss Models: From Data to Decisions, Third Edition KLUGMAN, PANJER, and WILLMOT ⋅ Loss Models: Further Topics KLUGMAN, PANJER, and WILLMOT ⋅ Solutions Manual to Accompany Loss Models: From Data to Decisions, Third Edition KOSKI and NOBLE ⋅ Bayesian Networks: An Introduction KOTZ, BALAKRISHNAN, and JOHNSON ⋅ Continuous Multivariate Distributions, Volume 1, Second Edition KOTZ and JOHNSON (editors) ⋅ Encyclopedia of Statistical Sciences: Volumes 1 to 9 with Index

† ∗

Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series. Now available in a lower priced paperback edition in the Wiley Classics Library.

KOTZ and JOHNSON (editors) ⋅ Encyclopedia of Statistical Sciences: Supplement Volume KOTZ, READ, and BANKS (editors) ⋅ Encyclopedia of Statistical Sciences: Update Volume 1 KOTZ, READ, and BANKS (editors) ⋅ Encyclopedia of Statistical Sciences: Update Volume 2 KOWALSKI and TU ⋅ Modern Applied U-Statistics KRISHNAMOORTHY and MATHEW ⋅ Statistical Tolerance Regions: Theory, Applications, and Computation KROESE, TAIMRE, and BOTEV ⋅ Handbook of Monte Carlo Methods KROONENBERG ⋅ Applied Multiway Data Analysis KULINSKAYA, MORGENTHALER, and STAUDTE ⋅ Meta Analysis: A Guide to Calibrating and Combining Statistical Evidence KULKARNI and HARMAN ⋅ An Elementary Introduction to Statistical Learning Theory KUROWICKA and COOKE ⋅ Uncertainty Analysis with High Dimensional Dependence Modelling KVAM and VIDAKOVIC ⋅ Nonparametric Statistics with Applications to Science and Engineering LACHIN ⋅ Biostatistical Methods: The Assessment of Relative Risks, Second Edition LAD ⋅ Operational Subjective Statistical Methods: A Mathematical, Philosophical, and Historical Introduction LAMPERTI ⋅ Probability: A Survey of the Mathematical Theory, Second Edition LAWLESS ⋅ Statistical Models and Methods for Lifetime Data, Second Edition LAWSON ⋅ Statistical Methods in Spatial Epidemiology, Second Edition LE ⋅ Applied Categorical Data Analysis, Second Edition LE ⋅ Applied Survival Analysis LEE ⋅ Structural Equation Modeling: A Bayesian Approach LEE and WANG ⋅ Statistical Methods for Survival Data Analysis, Fourth Edition LePAGE and BILLARD ⋅ Exploring the Limits of Bootstrap LESSLER and KALSBEEK ⋅ Nonsampling Errors in Surveys LEYLAND and GOLDSTEIN (editors) ⋅ Multilevel Modelling of Health Statistics LIAO ⋅ Statistical Group Comparison LIN ⋅ Introductory Stochastic Analysis for Finance and Insurance LINDLEY ⋅ Understanding Uncertainty, Revised Edition LITTLE and RUBIN ⋅ Statistical Analysis with Missing Data, Second Edition LLOYD ⋅ The Statistical Analysis of Categorical Data LOWEN and TEICH ⋅ Fractal-Based Point Processes MAGNUS and NEUDECKER ⋅ Matrix Differential Calculus with Applications in Statistics and Econometrics, Revised Edition MALLER and ZHOU ⋅ Survival Analysis with Long Term Survivors MARCHETTE ⋅ Random Graphs for Statistical Pattern Recognition MARDIA and JUPP ⋅ Directional Statistics

MARKOVICH ⋅ Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and Practice MARONNA, MARTIN and YOHAI ⋅ Robust Statistics: Theory and Methods MASON, GUNST, and HESS ⋅ Statistical Design and Analysis of Experiments with Applications to Engineering and Science, Second Edition McCULLOCH, SEARLE, and NEUHAUS ⋅ Generalized, Linear, and Mixed Models, Second Edition McFADDEN ⋅ Management of Data in Clinical Trials, Second Edition ∗ McLACHLAN ⋅ Discriminant Analysis and Statistical Pattern Recognition McLACHLAN, DO, and AMBROISE ⋅ Analyzing Microarray Gene Expression Data McLACHLAN and KRISHNAN ⋅ The EM Algorithm and Extensions, Second Edition McLACHLAN and PEEL ⋅ Finite Mixture Models McNEIL ⋅ Epidemiological Research Methods MEEKER and ESCOBAR ⋅ Statistical Methods for Reliability Data MEERSCHAERT and SCHEFFLER ⋅ Limit Distributions for Sums of Independent Random Vectors: Heavy Tails in Theory and Practice MENGERSEN, ROBERT, and TITTERINGTON ⋅ Mixtures: Estimation and Applications MICKEY, DUNN, and CLARK ⋅ Applied Statistics: Analysis of Variance and Regression, Third Edition * MILLER ⋅ Survival Analysis, Second Edition MONTGOMERY, JENNINGS, and KULAHCI ⋅ Introduction to Time Series Analysis and Forecasting MONTGOMERY, PECK, and VINING ⋅ Introduction to Linear Regression Analysis, Fifth Edition MORGENTHALER and TUKEY ⋅ Configural Polysampling: A Route to Practical Robustness MUIRHEAD ⋅ Aspects of Multivariate Statistical Theory MULLER and STOYAN ⋅ Comparison Methods for Stochastic Models and Risks MURTHY, XIE, and JIANG ⋅ Weibull Models MYERS, MONTGOMERY, and ANDERSON-COOK ⋅ Response Surface Methodology: Process and Product Optimization Using Designed Experiments, Third Edition MYERS, MONTGOMERY, VINING, and ROBINSON ⋅ Generalized Linear Models. With Applications in Engineering and the Sciences, Second Edition NATVIG ⋅ Multistate Systems Reliability Theory With Applications † NELSON ⋅ Accelerated Testing, Statistical Models, Test Plans, and Data Analyses † NELSON ⋅ Applied Life Data Analysis NEWMAN ⋅ Biostatistical Methods in Epidemiology NG, TAIN, and TANG ⋅ Dirichlet Theory: Theory, Methods and Applications OKABE, BOOTS, SUGIHARA, and CHIU ⋅ Spatial Tesselations: Concepts and Applications of Voronoi Diagrams, Second Edition OLIVER and SMITH ⋅ Influence Diagrams, Belief Nets and Decision Analysis ∗ †

Now available in a lower priced paperback edition in the Wiley Classics Library. Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

PALTA ⋅ Quantitative Methods in Population Health: Extensions of Ordinary Regressions PANJER ⋅ Operational Risk: Modeling and Analytics PANKRATZ ⋅ Forecasting with Dynamic Regression Models PANKRATZ ⋅ Forecasting with Univariate Box-Jenkins Models: Concepts and Cases PARDOUX ⋅ Markov Processes and Applications: Algorithms, Networks, Genome and Finance PARMIGIANI and INOUE ⋅ Decision Theory: Principles and Approaches ∗ PARZEN ⋅ Modern Probability Theory and Its Applications PEÑA, TIAO, and TSAY ⋅ A Course in Time Series Analysis PESARIN and SALMASO ⋅ Permutation Tests for Complex Data: Applications and Software PIANTADOSI ⋅ Clinical Trials: A Methodologic Perspective, Second Edition POURAHMADI ⋅ Foundations of Time Series Analysis and Prediction Theory POURAHMADI ⋅ High-Dimensional Covariance Estimation POWELL ⋅ Approximate Dynamic Programming: Solving the Curses of Dimensionality, Second Edition POWELL and RYZHOV ⋅ Optimal Learning PRESS ⋅ Subjective and Objective Bayesian Statistics, Second Edition PRESS and TANUR ⋅ The Subjectivity of Scientists and the Bayesian Approach PURI, VILAPLANA, and WERTZ ⋅ New Perspectives in Theoretical and Applied Statistics † PUTERMAN ⋅ Markov Decision Processes: Discrete Stochastic Dynamic Programming QIU ⋅ Image Processing and Jump Regression Analysis * RAO ⋅ Linear Statistical Inference and Its Applications, Second Edition RAO ⋅ Statistical Inference for Fractional Diffusion Processes RAUSAND and HØYLAND ⋅ System Reliability Theory: Models, Statistical Methods, and Applications, Second Edition RAYNER, THAS, and BEST ⋅ Smooth Tests of Goodnes of Fit: Using R, Second Edition RENCHER and SCHAALJE ⋅ Linear Models in Statistics, Second Edition RENCHER and CHRISTENSEN ⋅ Methods of Multivariate Analysis, Third Edition RENCHER ⋅ Multivariate Statistical Inference with Applications RIGDON and BASU ⋅ Statistical Methods for the Reliability of Repairable Systems * RIPLEY ⋅ Spatial Statistics * RIPLEY ⋅ Stochastic Simulation ROHATGI and SALEH ⋅ An Introduction to Probability and Statistics, Second Edition ROLSKI, SCHMIDLI, SCHMIDT, and TEUGELS ⋅ Stochastic Processes for Insurance and Finance ROSENBERGER and LACHIN ⋅ Randomization in Clinical Trials: Theory and Practice ROSSI, ALLENBY, and MCCULLOCH ⋅ Bayesian Statistics and Marketing † ROUSSEEUW and LEROY ⋅ Robust Regression and Outlier Detection ∗ †

Now available in a lower priced paperback edition in the Wiley Classics Library. Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

ROYSTON and SAUERBREI ⋅ Multivariate Model Building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modeling Continuous Variables * RUBIN ⋅ Multiple Imputation for Nonresponse in Surveys RUBINSTEIN and KROESE ⋅ Simulation and the Monte Carlo Method, Second Edition RUBINSTEIN and MELAMED ⋅ Modern Simulation and Modeling RUBINSTEIN, RIDDER, and VAISMAN ⋅ Fast Sequential Monte Carlo Methods for Counting and Optimization RYAN ⋅ Modern Engineering Statistics RYAN ⋅ Modern Experimental Design RYAN ⋅ Modern Regression Methods, Second Edition RYAN ⋅ Sample Size Determination and Power RYAN ⋅ Statistical Methods for Quality Improvement, Third Edition SALEH ⋅ Theory of Preliminary Test and Stein-Type Estimation with Applications SALTELLI, CHAN, and SCOTT (editors) ⋅ Sensitivity Analysis SCHERER ⋅ Batch Effects and Noise in Microarray Experiments: Sources and Solutions ∗ SCHEFFE ⋅ The Analysis of Variance SCHIMEK ⋅ Smoothing and Regression: Approaches, Computation, and Application SCHOTT ⋅ Matrix Analysis for Statistics, Second Edition SCHOUTENS ⋅ Levy Processes in Finance: Pricing Financial Derivatives SCOTT ⋅ Multivariate Density Estimation: Theory, Practice, and Visualization * SEARLE ⋅ Linear Models † SEARLE ⋅ Linear Models for Unbalanced Data † SEARLE ⋅ Matrix Algebra Useful for Statistics † SEARLE, CASELLA, and McCULLOCH ⋅ Variance Components SEARLE and WILLETT ⋅ Matrix Algebra for Applied Economics SEBER ⋅ A Matrix Handbook For Statisticians † SEBER ⋅ Multivariate Observations SEBER and LEE ⋅ Linear Regression Analysis, Second Edition † SEBER and WILD ⋅ Nonlinear Regression SENNOTT ⋅ Stochastic Dynamic Programming and the Control of Queueing Systems * SERFLING ⋅ Approximation Theorems of Mathematical Statistics SHAFER and VOVK ⋅ Probability and Finance: It’s Only a Game! SHERMAN ⋅ Spatial Statistics and Spatio-Temporal Data: Covariance Functions and Directional Properties SILVAPULLE and SEN ⋅ Constrained Statistical Inference: Inequality, Order, and Shape Restrictions SINGPURWALLA ⋅ Reliability and Risk: A Bayesian Perspective SMALL and MCLEISH ⋅ Hilbert Space Methods in Probability and Statistical Inference SRIVASTAVA ⋅ Methods of Multivariate Statistics ∗ †

Now available in a lower priced paperback edition in the Wiley Classics Library. Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

STAPLETON ⋅ Linear Statistical Models, Second Edition STAPLETON ⋅ Models for Probability and Statistical Inference: Theory and Applications STAUDTE and SHEATHER ⋅ Robust Estimation and Testing STOYAN ⋅ Counterexamples in Probability, Second Edition STOYAN and STOYAN ⋅ Fractals, Random Shapes and Point Fields: Methods of Geometrical Statistics STREET and BURGESS ⋅ The Construction of Optimal Stated Choice Experiments: Theory and Methods STYAN ⋅ The Collected Papers of T. W. Anderson: 1943–1985 SUTTON, ABRAMS, JONES, SHELDON, and SONG ⋅ Methods for Meta-Analysis in Medical Research TAKEZAWA ⋅ Introduction to Nonparametric Regression TAMHANE ⋅ Statistical Analysis of Designed Experiments: Theory and Applications TANAKA ⋅ Time Series Analysis: Nonstationary and Noninvertible Distribution Theory THOMPSON ⋅ Empirical Model Building: Data, Models, and Reality, Second Edition THOMPSON ⋅ Sampling, Third Edition THOMPSON ⋅ Simulation: A Modeler’s Approach THOMPSON and SEBER ⋅ Adaptive Sampling THOMPSON, WILLIAMS, and FINDLAY ⋅ Models for Investors in Real World Markets TIERNEY ⋅ LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics TROFFAES and DE COOMAN ⋅ Lower Previsions TSAY ⋅ Analysis of Financial Time Series, Third Edition TSAY ⋅ An Introduction to Analysis of Financial Data with R TSAY ⋅ Multivariate Time Series Analysis: With R and Financial Applications UPTON and FINGLETON ⋅ Spatial Data Analysis by Example, Volume II: Categorical and Directional Data † VAN BELLE ⋅ Statistical Rules of Thumb, Second Edition VAN BELLE, FISHER, HEAGERTY, and LUMLEY ⋅ Biostatistics: A Methodology for the Health Sciences, Second Edition VESTRUP ⋅ The Theory of Measures and Integration VIDAKOVIC ⋅ Statistical Modeling by Wavelets VIERTL ⋅ Statistical Methods for Fuzzy Data VINOD and REAGLE ⋅ Preparing for the Worst: Incorporating Downside Risk in Stock Market Investments WALLER and GOTWAY ⋅ Applied Spatial Statistics for Public Health Data WEISBERG ⋅ Applied Linear Regression, Fourth Edition WEISBERG ⋅ Bias and Causation: Models and Judgment for Valid Comparisons WELSH ⋅ Aspects of Statistical Inference WESTFALL and YOUNG ⋅ Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment †

Now available in a lower priced paperback edition in the Wiley–Interscience Paperback Series.

WHITTAKER ⋅ Graphical Models in Applied Multivariate Statistics WINKER ⋅ Optimization Heuristics in Economics: Applications of Threshold Accepting WOODWORTH ⋅ Biostatistics: A Bayesian Introduction WOOLSON and CLARKE ⋅ Statistical Methods for the Analysis of Biomedical Data, Second Edition WU and HAMADA ⋅ Experiments: Planning, Analysis, and Parameter Design Optimization, Second Edition WU and ZHANG ⋅ Nonparametric Regression Methods for Longitudinal Data Analysis YAKIR ⋅ Extremes in Random Fields YIN ⋅ Clinical Trial Design: Bayesian and Frequentist Adaptive Methods YOUNG, VALERO-MORA, and FRIENDLY ⋅ Visual Statistics: Seeing Data with Dynamic Interactive Graphics ZACKS ⋅ Examples and Problems in Mathematical Statistics ZACKS ⋅ Stage-Wise Adaptive Designs * ZELLNER ⋅ An Introduction to Bayesian Inference in Econometrics ZELTERMAN ⋅ Discrete Distributions – Applications in the Health Sciences ZHOU, OBUCHOWSKI, and MCCLISH ⋅ Statistical Methods in Diagnostic Medicine, Second Edition ∗



Now available in a lower priced paperback edition in the Wiley Classics Library.

WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.