Optimal Bayesian Classification
 1510630694, 9781510630697

Table of contents :
Contents
Preface
Acknowledgments
1 Classification and Error Estimation
1.1 Classifiers
1.2 Constrained Classifiers
1.3 Error Estimation
1.4 Random Versus Separate Sampling
1.5 Epistemology and Validity
1.5.1 RMS bounds
1.5.2 Error RMS in the Gaussian model
2 Optimal Bayesian Error Estimation
2.1 The Bayesian MMSE Error Estimator
2.2 Evaluation of the Bayesian MMSE Error Estimator
2.3 Performance Evaluation at a Fixed Point
2.4 Discrete Model
2.4.1 Representation of the Bayesian MMSE error estimator
2.4.2 Performance and robustness in the discrete model
2.5 Gaussian Model
2.5.1 Independent covariance model
2.5.2 Homoscedastic covariance model
2.5.3 Effective class-conditional densities
2.5.4 Bayesian MMSE error estimator for linear classification
2.6 Performance in the Gaussian Model with LDA
2.6.1 Fixed circular Gaussian distributions
2.6.2 Robustness to falsely assuming identity covariances
2.6.3 Robustness to falsely assuming Gaussianity
2.6.4 Average performance under proper priors
2.7 Consistency of Bayesian Error Estimation
2.7.1 Convergence of posteriors
2.7.2 Sufficient conditions for consistency
2.7.3 Discrete and Gaussian models
2.8 Calibration
2.8.1 MMSE calibration function
2.8.2 Performance with LDA
2.9 Optimal Bayesian ROC-based Analysis
2.9.1 Bayesian MMSE FPR and TPR estimation
2.9.2 Bayesian MMSE ROC and AUC estimation
2.9.3 Performance study
3 Sample-Conditioned MSE of Error Estimation
3.1 Conditional MSE of Error Estimators
3.2 Evaluation of the Conditional MSE
3.3 Discrete Model
3.4 Gaussian Model
3.4.1 Effective joint class-conditional densities
3.4.2 Sample-conditioned MSE for linear classification
3.4.3 Closed-form expressions for functions I and R
3.5 Average Performance in the Gaussian Model
3.6 Convergence of the Sample-Conditioned MSE
3.7 A Performance Bound for the Discrete Model
3.8 Censored Sampling
3.8.1 Gaussian model
3.9 Asymptotic Approximation of the RMS
3.9.1 Bayesian–Kolmogorov asymptotic conditions
3.9.2 Conditional expectation
3.9.3 Unconditional expectation
3.9.4 Conditional second moments
3.9.5 Unconditional second moments
3.9.6 Unconditional MSE
4 Optimal Bayesian Classification
4.1 Optimal Operator Design Under Uncertainty
4.2 Optimal Bayesian Classifier
4.3 Discrete Model
4.4 Gaussian Model
4.4.1 Both covariances known
4.4.2 Both covariances diagonal
4.4.3 Both covariances scaled identity or general
4.4.4 Mixed covariance models
4.4.5 Average performance in the Gaussian model
4.5 Transformations of the Feature Space
4.6 Convergence of the Optimal Bayesian Classifier
4.7 Robustness in the Gaussian Model
4.7.1 Falsely assuming homoscedastic covariances
4.7.2 Falsely assuming the variance of the features
4.7.3 Falsely assuming the mean of a class
4.7.4 Falsely assuming Gaussianity under Johnson distributions
4.8 Intrinsically Bayesian Robust Classifiers
4.9 Missing Values
4.9.1 Computation for application
4.10 Optimal Sampling
4.10.1 MOCU-based optimal experimental design
4.10.2 MOCU-based optimal sampling
4.11 OBC for Autoregressive Dependent Sampling
4.11.1 Prior and posterior distributions for VAR processes
4.11.2 OBC for VAR processes
5 Optimal Bayesian Risk-based Multi-class Classification
5.1 Bayes Decision Theory
5.2 Bayesian Risk Estimation
5.3 Optimal Bayesian Risk Classification
5.4 Sample-Conditioned MSE of Risk Estimation
5.5 Efficient Computation
5.6 Evaluation of Posterior Mixed Moments: Discrete Model
5.7 Evaluation of Posterior Mixed Moments: Gaussian Models
5.7.1 Known covariance
5.7.2 Homoscedastic general covariance
5.7.3 Independent general covariance
5.8 Simulations
6 Optimal Bayesian Transfer Learning
6.1 Joint Prior Distribution
6.2 Posterior Distribution in the Target Domain
6.3 Optimal Bayesian Transfer Learning Classifier
6.3.1 OBC in the target domain
6.4 OBTLC with Negative Binomial Distribution
7 Construction of Prior Distributions
7.1 Prior Construction Using Data from Discarded Features
7.2 Prior Knowledge from Stochastic Differential Equations
7.2.1 Binary classification of Gaussian processes
7.2.2 SDE prior knowledge in the BCGP model
7.3 Maximal Knowledge-Driven Information Prior
7.3.1 Conditional probabilistic constraints
7.3.2 Dirichlet prior distribution
7.4 REMLP for a Normal-Wishart Prior
7.4.1 Pathway knowledge
7.4.2 REMLP optimization
7.4.3 Application of a normal-Wishart prior
7.4.4 Incorporating regulation types
7.4.5 A synthetic example
References
Index

Citation preview

Library of Congress Cataloging-in-Publication Data Names: Dalton, Lori A., author. | Dougherty, Edward R., author. Title: Optimal Bayesian classification / Lori A. Dalton and Edward R. Dougherty. Description: Bellingham, Washington : SPIE Press, [2019] | Includes bibliographical references and index. Identifiers: LCCN 2019028901 (print) | LCCN 2019028902 (ebook) | ISBN 9781510630697 (paperback) | ISBN 9781510630710 (pdf) | ISBN 9781510630703 (epub) | ISBN 9781510630727 (kindle edition) Subjects: LCSH: Bayesian statistical decision theory. | Statistical decision. Classification: LCC QA279.5 D355 2020 (print) | LCC QA279.5 (ebook) | DDC 519.5/42–dc23 LC record available at https://lccn.loc.gov/2019028901 LC ebook record available at https://lccn.loc.gov/2019028902

Published by SPIE P.O. Box 10 Bellingham, Washington 98227-0010 USA Phone: +1 360.676.3290 Fax: +1 360.647.1445 Email: [email protected] Web: http://spie.org Copyright © 2020 Society of Photo-Optical Instrumentation Engineers (SPIE) All rights reserved. No part of this publication may be reproduced or distributed in any form or by any means without written permission of the publisher. The content of this book reflects the work and thought of the author. Every effort has been made to publish reliable and accurate information herein, but the publisher is not responsible for the validity of the information or for any outcomes resulting from reliance thereon. Cover photographs: Matt Anderson Photography/Moment via Getty Images and Jackie Niam/iStock via Getty Images. Printed in the United States of America. First Printing. For updates to this book, visit http://spie.org and type “PM310” in the search field.

To our spouses, Yousef and Terry, For being our companions and sources of support and encouragement.

Contents Preface Acknowledgments

xi xv

1 Classification and Error Estimation 1.1 1.2 1.3 1.4 1.5

1

Classifiers Constrained Classifiers Error Estimation Random Versus Separate Sampling Epistemology and Validity 1.5.1 RMS bounds 1.5.2 Error RMS in the Gaussian model

2 Optimal Bayesian Error Estimation 2.1 2.2 2.3 2.4

2.5

2.6

2.7

1 5 11 16 18 20 22 25

The Bayesian MMSE Error Estimator Evaluation of the Bayesian MMSE Error Estimator Performance Evaluation at a Fixed Point Discrete Model 2.4.1 Representation of the Bayesian MMSE error estimator 2.4.2 Performance and robustness in the discrete model Gaussian Model 2.5.1 Independent covariance model 2.5.2 Homoscedastic covariance model 2.5.3 Effective class-conditional densities 2.5.4 Bayesian MMSE error estimator for linear classification Performance in the Gaussian Model with LDA 2.6.1 Fixed circular Gaussian distributions 2.6.2 Robustness to falsely assuming identity covariances 2.6.3 Robustness to falsely assuming Gaussianity 2.6.4 Average performance under proper priors Consistency of Bayesian Error Estimation 2.7.1 Convergence of posteriors 2.7.2 Sufficient conditions for consistency 2.7.3 Discrete and Gaussian models

vii

25 32 33 36 37 38 46 47 50 53 61 65 65 68 69 75 76 77 79 82

viii

Contents

2.8

2.9

Calibration 2.8.1 MMSE calibration function 2.8.2 Performance with LDA Optimal Bayesian ROC-based Analysis 2.9.1 Bayesian MMSE FPR and TPR estimation 2.9.2 Bayesian MMSE ROC and AUC estimation 2.9.3 Performance study

3 Sample-Conditioned MSE of Error Estimation 3.1 3.2 3.3 3.4

3.5 3.6 3.7 3.8 3.9

Conditional MSE of Error Estimators Evaluation of the Conditional MSE Discrete Model Gaussian Model 3.4.1 Effective joint class-conditional densities 3.4.2 Sample-conditioned MSE for linear classification 3.4.3 Closed-form expressions for functions I and R Average Performance in the Gaussian Model Convergence of the Sample-Conditioned MSE A Performance Bound for the Discrete Model Censored Sampling 3.8.1 Gaussian model Asymptotic Approximation of the RMS 3.9.1 Bayesian–Kolmogorov asymptotic conditions 3.9.2 Conditional expectation 3.9.3 Unconditional expectation 3.9.4 Conditional second moments 3.9.5 Unconditional second moments 3.9.6 Unconditional MSE

4 Optimal Bayesian Classification 4.1 4.2 4.3 4.4

4.5 4.6 4.7

Optimal Operator Design Under Uncertainty Optimal Bayesian Classifier Discrete Model Gaussian Model 4.4.1 Both covariances known 4.4.2 Both covariances diagonal 4.4.3 Both covariances scaled identity or general 4.4.4 Mixed covariance models 4.4.5 Average performance in the Gaussian model Transformations of the Feature Space Convergence of the Optimal Bayesian Classifier Robustness in the Gaussian Model 4.7.1 Falsely assuming homoscedastic covariances 4.7.2 Falsely assuming the variance of the features

85 85 88 91 92 93 95 101 101 104 106 110 110 117 129 135 136 137 140 141 143 145 147 153 155 161 164 169 169 173 174 176 176 178 179 188 190 195 196 203 203 205

Contents

ix

4.7.3 Falsely assuming the mean of a class 4.7.4 Falsely assuming Gaussianity under Johnson distributions 4.8 Intrinsically Bayesian Robust Classifiers 4.9 Missing Values 4.9.1 Computation for application 4.10 Optimal Sampling 4.10.1 MOCU-based optimal experimental design 4.10.2 MOCU-based optimal sampling 4.11 OBC for Autoregressive Dependent Sampling 4.11.1 Prior and posterior distributions for VAR processes 4.11.2 OBC for VAR processes 5 Optimal Bayesian Risk-based Multi-class Classification 5.1 5.2 5.3 5.4 5.5 5.6 5.7

5.8

Bayes Decision Theory Bayesian Risk Estimation Optimal Bayesian Risk Classification Sample-Conditioned MSE of Risk Estimation Efficient Computation Evaluation of Posterior Mixed Moments: Discrete Model Evaluation of Posterior Mixed Moments: Gaussian Models 5.7.1 Known covariance 5.7.2 Homoscedastic general covariance 5.7.3 Independent general covariance Simulations

6 Optimal Bayesian Transfer Learning 6.1 6.2 6.3 6.4

Joint Prior Distribution Posterior Distribution in the Target Domain Optimal Bayesian Transfer Learning Classifier 6.3.1 OBC in the target domain OBTLC with Negative Binomial Distribution

7 Construction of Prior Distributions 7.1 7.2

7.3

7.4

Prior Construction Using Data from Discarded Features Prior Knowledge from Stochastic Differential Equations 7.2.1 Binary classification of Gaussian processes 7.2.2 SDE prior knowledge in the BCGP model Maximal Knowledge-Driven Information Prior 7.3.1 Conditional probabilistic constraints 7.3.2 Dirichlet prior distribution REMLP for a Normal-Wishart Prior 7.4.1 Pathway knowledge 7.4.2 REMLP optimization

206 208 210 212 219 219 220 222 225 225 228 237 238 239 240 245 246 248 250 250 251 255 256 261 261 265 273 276 281 287 288 295 296 299 303 307 309 312 312 314

x

Contents

7.4.3 7.4.4 7.4.5

Application of a normal-Wishart prior Incorporating regulation types A synthetic example

315 318 320

References

325

Index

341

Preface The most basic problem of engineering is the design of optimal (or close-tooptimal) operators. The design of optimal operators takes different forms depending on the random process constituting the scientific model and the operator class of interest. The operators might be filters, controllers, or classifiers, each having numerous domains of application. The underlying random process might be a random signal/image for filtering, a Markov process for control, or a feature-label distribution for classification. Here we are interested in classification, and an optimal operator is a Bayes classifier, which is a classifier minimizing the classification error. With sufficient knowledge we can construct the feature-label distribution and thereby find a Bayes classifier. Rarely, and in practice virtually never, do we possess such knowledge. On the other hand, if we had unlimited data, we could accurately estimate the feature-label distribution and obtain a Bayes classifier. Rarely do we possess sufficient data. Therefore, we must use whatever knowledge and data are available to design a classifier whose performance is hopefully close to that of a Bayes classifier. Classification theory has historically developed mainly on the side of data, the classical case being linear discriminant analysis (LDA), where a Bayes classifier is deduced from a Gaussian model and the parameters of the classifier are estimated from data via maximum-likelihood estimation. The idea is that we have knowledge that the true feature-label distribution is Gaussian (or close to Gaussian) and the data can fill in the parameters, in this case, the mean vectors and common co-variance matrix. Much contemporary work takes an even less knowledge-driven approach by assuming some very general classifier form such as a neural network and estimating the network parameters by fitting the network to the data in some manner. The more general the classifier form, the more parameters to determine, and the more data needed. Moreover, there is growing danger of overfitting the classifier form to the data as the classifier structure becomes more complex. Lack of knowledge presents us with model uncertainty, and hypothesizing a classifier form and then estimating the parameters is an ad hoc way of dealing with that uncertainty. It is ad hoc because the designer postulates a classification rule based on some heuristics and then applies the rule to the data.

xi

xii

Preface

This book takes a Bayesian approach to modeling the feature-label distribution and designs an optimal classifier relative to a posterior distribution governing an uncertainty class of feature-label distributions. In this way it takes full advantage of knowledge regarding the underlying system and the available data. Its origins lie in the need to estimate classifier error when there is insufficient data to hold out test data, in which case an optimal error estimate can be obtained relative to the uncertainty class. A natural next step is to forgo classical ad hoc classifier design and simply find an optimal classifier relative to the posterior distribution over the uncertainty class—this being an optimal Bayesian classifier. A critical point is that, in general, for optimal operator design, the prior distribution is not on the parameters of the operator (controller, filter, classifier), but on the unknown parameters of the scientific model, which for classification is the feature-label distribution. If the model were known with certainty, then one would optimize with respect to the known model; if the model is uncertain, then the optimization is naturally extended to include model uncertainty and the prior distribution on that uncertainty. Model uncertainty induces uncertainty on the operator parameters, and the distribution of the latter uncertainty follows from the prior distribution on the model. If one places the prior directly on the operator parameters while ignoring model uncertainty, then there is a scientific gap, meaning that the relation between scientific knowledge and operator design is broken. The first chapter reviews the basics of classification and error estimation. It addresses the issue that confronts much of contemporary science and engineering: How do we characterize validity when data are insufficient for the complexity of the problem? In particular, what can be said regarding the accuracy of an error estimate? This is the most fundamental question for classification since the error estimate characterizes the predictive capacity of a classifier. Chapter 2 develops the theory of optimal Bayesian error estimation: What is the best estimate of classifier error given our knowledge and the data? It introduces what is perhaps the most important concept in the book: effective class-conditional densities. Optimal classifier design and error estimation for a particular feature-label distribution are based on the class-conditional densities. In the context of an uncertainty class of class-conditional densities, the key role is played by the effective class-conditional densities, which are the expected densities relative to the posterior distribution. The Bayesian minimum-mean-square error (MMSE) theory is developed for the discrete multinomial model and several Gaussian models. Sufficient conditions for error-estimation consistency are provided. The chapter closes with a discussion of optimal Bayesian ROC estimation. Chapter 3 addresses error-estimation accuracy. In the typical ad hoc classification paradigm, there is no way to address the accuracy of a particular

Preface

xiii

error estimate. We can only quantify the mean-square error (MSE) of error estimation relative to the sampling distribution. With Bayesian MMSE error estimation, we can compute the MSE of the error estimate conditioned on the actual sample relative to the uncertainty class. The sample-conditioned MSE is studied in the discrete and Gaussian models, and its consistency is established. Because a running MSE calculation can be performed as new sample points are collected, one can do censored sampling: stop sampling when the error estimate and MSE of the error estimate are sufficiently small. Section 3.9 provides double-asymptotic approximations of the first and second moments of the Bayesian MMSE error estimate relative to the sampling distribution and the uncertainty class, thereby providing asymptotic approximation to the MSE, or, its square root, the root-mean-square (RMS) error. Double asymptotic convergence means that both the sample size and the dimension of the space increase to infinity at a fixed rate between the two. Even though we omit many theoretical details (referring instead to the literature), this section is rather long, on account of double asymptotics, and contains many complicated equations. Nevertheless, it provides an instructive analysis of the relationship between the conditional and unconditional RMS. Regarding the chapter as a whole, it is not a logical prerequisite for succeeding chapters and can be skipped by those wishing to move directly to classification. Chapter 4 defines an optimal Bayesian classifier as one possessing minimum expected error relative to the uncertainty class, this expectation agreeing with the Bayesian MMSE error estimate. Optimal Bayesian classifiers are developed for a discrete model and several Gaussian models, and convergence to a Bayes classifier for the true feature-label distribution is studied. The robustness of assumptions on the prior distribution is discussed. The chapter has a section on intrinsically Bayesian robust classification, which is equivalent to optimal Bayesian classification with a null dataset. It next has a section showing how missing values in the data are incorporated into the overall optimization without having to implement an intermediate imputation step, which would cause a loss of optimality. The chapter closes with two sections in which sampling is not random. Section 4.10 considers optimal sampling, and Section 4.11 examines the effect of dependent sampling. Chapter 5 extends the theory to multi-class classification via optimal Bayesian risk classification. It includes evaluation of the sample-conditioned MSE of risk estimation and evaluation of the posterior mixed moments for both the discrete and Gaussian models. Chapter 6 extends the multi-class theory to transfer learning. Here, there are data from a different (source) feature-label distribution, and one wishes to use this data together with whatever data are available from the (target) feature-label distribution of interest. The source and target are linked via a joint prior distribution, and an optimal Bayesian transfer learning classifier is

xiv

Preface

derived for the posterior distribution in the target domain. Both Gaussian and negative binomial distributions are considered. The final chapter addresses the fundamental problem of prior construction: How do we transform prior knowledge into a prior distribution? The first two sections address special cases: using data from discarded features, and using knowledge from a partially known physical system. The heart of the chapter is the development of a general method for transforming scientific knowledge into a prior distribution by performing an information-theoretic optimization over a class of potential priors with the optimization constrained by a set of conditional probability statements characterizing our scientific knowledge. In a sense, this book is the last of a trilogy. The Evolution of Scientific Knowledge: From Certainty to Uncertainty (Dougherty, 2016) traces the epistemology of modern science from its deterministic beginnings in the Seventeenth century up through the inherent stochasticity of quantum theory in the first half of the Twentieth century, and then to the uncertainty in scientific models that has become commonplace in the latter part of the Twentieth century. This uncertainty leads to an inability to validate physical models, thereby limiting the scope of valid science. The last chapter of the book presents, from a philosophical perspective, the structure of operator design in the context of model uncertainty. Optimal Signal Processing Under Uncertainty (Dougherty, 2018) develops the mathematical theory articulated in that last chapter, applying it to filtering, control, classification, clustering, and experimental design. In this book, we extensively develop the classification theory summarized in that book. Edward R. Dougherty Lori A. Dalton December 2019

Acknowledgments The material in this book has been developed over a number of years and involves the contributions of numerous students and colleagues, whom we would like to acknowledge: Mohammadmahdi Rezaei Yousefi, Amin Zollanvari, Mohammad Shahrokh Esfahani, Shahin Boluki, Alireza Karbalayghareh, Siamak Zamani Dadaneh, Roozbeh Dehghannasiri, Xiaoning Qian, Byung-Jun Yoon, and Ulisses Braga-Neto. We also extend our appreciation to Dara Burrows for her careful editing of the book.

xv

Chapter 1

Classification and Error Estimation A classifier operates on a vector of features, which are random variables, and outputs a decision as to which class the feature vector belongs. We consider binary classification, meaning that the decision is between two classes. Typically, a classifier is designed and its error estimated from sample data. Two basic questions arise. First, given a set of features, how does one design a classifier from the sample data that provides good classification over the general population? Second, how does one estimate the error of a designed classifier from the data? This book examines both classifier design and error estimation when one is given prior knowledge regarding the population. The first chapter provides basic background knowledge.

1.1 Classifiers Classification involves a feature vector X ¼ ½X 1 , X 2 , : : : , X D T composed of random variables (features) on a D-dimensional Euclidean space X ¼ RD , a binary random variable (label) Y, and a classifier c : RD → f0, 1g to predict Y ∈ {0, 1}. We assume a joint feature-label distribution f X,Y ðx, yÞ for the random vector–label pair ðX, Y Þ. The feature-label distribution characterizes the classification problem and may be a generalized function. f XjY ðxj0Þ and f XjY ðxj1Þ, the distributions of X given Y ¼ 0 and Y ¼ 1, respectively, are known as the class-conditional distributions. The classification error is relative to the feature-label distribution and equals the probability of incorrect classification Pr ðcðXÞ ≠ Y Þ. The error also equals the expected (mean) absolute difference between the label and the classification E½jY  cðXÞj. An optimal classifier is one having minimal error among all classifiers c : RD → f0, 1g. An optimal classifier cBayes is called a Bayes classifier and its error εBayes is called the Bayes error. The Bayes error is intrinsic to the feature-label distribution; however, there may be more than one Bayes classifier. 1

2

Chapter 1

The error of an arbitrary classifier c can be expressed as Z ε¼ PrðcðXÞ ≠ Y jX ¼ xÞf X ðxÞdx X Z Z hðxÞf X ðxÞdx þ ½1  hðxÞ f X ðxÞdx, ¼ cðxÞ¼0

(1.1)

cðxÞ¼1

where f X ðxÞ is the marginal density of X, and hðxÞ ¼ Pr ðY ¼ 1jX ¼ xÞ is the posterior probability that Y ¼ 1. The posterior distribution of Y is proportional to the product of the prior distribution of Y and the classconditional density for x via Bayes’ theorem: Pr ðY ¼ yjX ¼ xÞ ¼

Pr ðY ¼ yÞf XjY ðxjyÞ . f X ðxÞ

(1.2)

Hence, the error is decomposed as ε ¼ cε0 þ ð1  cÞε1 ,

(1.3)

where c ¼ PrðY ¼ 0Þ is the a priori probability of class 0, ε0 ¼ PrðcðXÞ ¼ 1jY ¼ 0Þ Z f XjY ðxj0Þdx ¼

(1.4)

cðxÞ¼1

is the probability of an element from class 0 being wrongly classified (the error contributed by class 0), and, similarly, ε1 ¼ PrðcðXÞ ¼ 0jY ¼ 1Þ. Since 0 ≤ hðxÞ ≤ 1, the right-hand side of Eq. 1.1 is minimized by the classifier  0 if hðxÞ ≤ 0.5, cBayes ðxÞ ¼ 1 otherwise, (1.5)  0 if cf XjY ðxj0Þ ≥ ð1  cÞf XjY ðxj1Þ, ¼ 1 otherwise. Hence, the Bayes classifier cBayes ðxÞ is defined to be 0 or 1 according to whether Y is more likely to be 0 or 1 given x (ties may be broken arbitrarily). It follows from Eqs. 1.1 and 1.5 that the Bayes error is given by Z Z εBayes ¼ hðxÞf X ðxÞdx þ ½1  hðxÞ f X ðxÞdx. (1.6) hðxÞ≤0.5

hðxÞ.0.5

In practice, the feature-label distribution is typically unknown, and a classifier must be designed from sample data. We assume a dataset consisting of n points,

Classification and Error Estimation

S n ¼ fðx1 , y1 Þ, ðx2 , y2 Þ, : : : , ðxn , yn Þg,

3

(1.7)

which is a realization of vector–label pairs drawn from the featurelabel distribution. A classification rule is a mapping of the form Cn : ½RD  f0, 1gn → F , where F is the family of {0, 1}-valued functions on RD . (Without further mention, all functions are assumed to be measurable.) The subscript n emphasizes that a classification rule is actually a collection of classification rules depending on n. Given a sample S n , we obtain a designed classifier cn ¼ Cn ðS n Þ ∈ F according to the rule Cn. Although perhaps it would be more complete to write cn ðx; S n Þ rather than cn ðxÞ, we use the simpler notation, keeping in mind that cn has been trained from a sample. Note that, relative to the sample, cn is a random function depending on the sampling distribution generating the sample. For a particular sample (dataset), cn is simply a binary function on RD . As a binary function, cn is determined by a decision boundary between two regions. Letting εn denote the error of a designed classifier cn, relative to the Bayes error, there is a design cost Dn ¼ εn  εBayes , where εn and Dn are sampledependent random variables. The expected design cost is E½Dn , and the expected error of cn is decomposed as E½εn  ¼ εBayes þ E½Dn .

(1.8)

The expectations E½εn  and E½Dn  are relative to the sampling distribution, and these quantities measure the performance of the classification rule relative to the feature-label distribution, rather than the performance of an individual designed classifier. Nonetheless, a classification rule for which E½εn  is small will also tend to produce designed classifiers possessing small error. Asymptotic properties of a classification rule concern large samples (as n → `). A rule is said to be consistent for a distribution of ðX, Y Þ if Dn → 0 in probability, meaning that PrðjDn j . dÞ → 0 as n → ` for any d > 0. A classification rule is universally consistent if Dn → 0 in probability for any distribution of ðX, Y Þ. By Markov’s inequality, consistency is assured if E½Dn  → 0. A classification rule is said to be strongly consistent for a distribution of ðX, Y Þ if Dn → 0 almost surely (a.s.) as n → `, meaning that Dn → 0 with probability 1. Specifically, Dn → 0 as n → ` for all sequences of datasets of increasing size fS 1 , S 2 , : : : g, except on a set of sequences with probability 0. Strong consistency implies consistency. A rule is universally strongly consistent if Dn → 0 almost surely for any distribution of ðX, Y Þ. To illustrate consistency, suppose that hn ðxÞ is an estimate of hðxÞ based on a sample S n . Then the plug-in classification rule is defined by replacing hðxÞ with hn ðxÞ in Eq. 1.5. For this rule,

4

Chapter 1

Dn ¼ εn  εBayes Z ¼

cn ðxÞ¼0, cBayes ðxÞ¼1

½2hðxÞ  1 f X ðxÞdx

Z

þ Z ¼

cn ðxÞ¼1, cBayes ðxÞ¼0

cn ðxÞ≠cBayes ðxÞ

½1  2hðxÞ f X ðxÞdx

(1.9)

j2hðxÞ  1j f X ðxÞdx.

Since |hðxÞ  0.5| ≤ |hðxÞ  hn ðxÞ| when cn ðxÞ ≠ cBayes ðxÞ, E½Dn  ≤ 2E½jhðXÞ  hn ðXÞj.

(1.10)

Applying the Cauchy–Schwarz inequality yields 1

E½Dn  ≤ 2E½jhðXÞ  hn ðXÞj2 2 .

(1.11)

Hence, if hn ðXÞ converges to hðXÞ in the mean-square sense, then the plug-in classifier is consistent. In fact, E½Dn  converges much faster than does E½jhðXÞ  hn ðXÞj2 1∕2 , which in turn shows that classification is easier than density estimation, in the sense that it requires less data. Specifically, if E½jhðXÞ  hn ðXÞj2  → 0 as n → `, then for the plug-in rule (Devroye et al., 1996), lim

n→`

E½Dn  1

E½jhðXÞ  hn ðXÞj2 2

¼ 0.

(1.12)

As another example of consistency, we consider the cubic histogram rule, in which RD is partitioned into hypercubes of side length rn. For each point x in RD , cn ðxÞ is defined to be 0 or 1 according to which is the majority among the labels for points in the cube containing x. If the cubes are defined such that rn → 0 and nrD n → ` as n → `, then the rule is universally consistent (Gordon and Olshen, 1978; Devroye et al., 1996). Lastly, for the k-nearest-neighbor (kNN) rule with k odd, the k points closest to x are selected, and cn ðxÞ is defined to be 0 or 1 according to which is the majority among the labels of these points. For a fixed k, limn→` E½Dn  ≤ ðkeÞ1∕2 (Devroye et al., 1996), which does not give consistency. However, if k → ` and k∕n → 0 as n → `, then the kNN rule is universally consistent (Stone, 1977). While universal consistency is an appealing property, its practical application is dubious. For instance, the expected design cost E½Dn  cannot be made universally small for fixed n. Indeed, for any t > 0, fixed n, and designed classifier cn, there exists a distribution of ðX, Y Þ such that εBayes ¼ 0

Classification and Error Estimation

5

and E½εn  . 0.5  t (Devroye, 1982). Moreover, even if a classifier is universally consistent, the rate at which E½Dn  → 0 is critical to application. If we desire a classifier whose expected error is within some tolerance of the Bayes error, consistency is not sufficient. Rather, we would like a statement of the following form: for any t > 0, there exists nðtÞ such that, for n . nðtÞ, E½Dn  , t for any distribution of ðX, Y Þ. Unfortunately, even if a classification rule is universally consistent, the design error converges to 0 arbitrarily slowly relative to all possible distributions. To wit, if {an} is a decreasing sequence such that 1/16 ≥ a1 ≥ a2 ≥ ⋯ > 0, then for any sequence of designed classifiers cn, there exists a distribution of ðX, Y Þ such that εBayes ¼ 0 and E½εn  . an (Devroye, 1982). More generally, consistency is of little consequence for small-sample classifier design, which is a key focus of the current text.

1.2 Constrained Classifiers A common problem with small-sample design is that E[Dn] tends to be large. A classification rule may yield a classifier that performs well on the sample data. However, if the small sample does not represent the distribution sufficiently, then the designed classifier will not perform well on the distribution. Constraining classifier design means restricting the functions from which a classifier can be chosen to a class C. Constraining the classifier can reduce the expected design error, but at the cost of increasing the error of the best possible classifier. Since optimization in C is over a subclass of classifiers, the error εC of an optimal classifier cC ∈ C will typically exceed the Bayes error, unless a Bayes classifier happens to be in C. We call DC ¼ εC  εBayes the cost of constraint. A classification rule yields a classifier cn,C ∈ C with error εn,C , where εn,C ≥ εC ≥ εBayes . Design error for constrained classification is Dn,C ¼ εn,C  εC . The expected error of the designed classifier from C can be decomposed as E½εn,C  ¼ εBayes þ DC þ E½Dn,C .

(1.13)

The constraint is beneficial if and only if the cost of constraint is less than the decrease in expected design cost. The dilemma is that strong constraint reduces E½Dn,C  at the cost of increasing εC . Historically, discriminant functions have played an important role in classification. Keeping in mind that the logarithm is a strictly increasing function, if we define the discriminant function by d y ðxÞ ¼ ln f XjY ðxjyÞ þ ln Pr ðY ¼ yÞ,

(1.14)

then the misclassification error is minimized by cBayes ðxÞ ¼ y if d y ðxÞ ≥ d j ðxÞ for j ¼ 0, 1, or equivalently, cBayes ðxÞ ¼ 0 if and only if

6

Chapter 1

g(x) ¼ d1(x)  d0(x) ≤ 0. We are particularly interested in quadratic discrimination, where for D  D matrix A, length D column vector a, and constant b, gðxÞ ¼ xT Ax þ aT x þ b,

(1.15)

where superscript T denotes matrix transpose, and the resulting classifier produces quadratic decision boundaries. An important special case is linear discrimination, for which gðxÞ ¼ aT x þ b, (1.16) and the resulting classifier produces a hyperplane decision boundary. If the conditional densities f XjY ðxj0Þ and f XjY ðxj1Þ are Gaussian with positive definite covariance matrices, then   1 1 T 1 f XjY ðxjyÞ ¼  ðx  my Þ Sy ðx  my Þ , (1.17) D 1 exp 2 ð2pÞ 2 jSy j2 where my and Sy are the mean vector and covariance matrix for class y ∈ {0, 1}, respectively, and | ⋅ | is the matrix determinant operator. The covariance matrix of a multivariate Gaussian distribution must be positive semi-definite, and it is positive definite unless one variable is a linear function of the other variables. Conversely, every symmetric positive semi-definite matrix is a covariance matrix. Recall that a symmetric real matrix S is positive definite if xTSx > 0 for any nonzero vector x, and it is positive semi-definite if xTSx ≥ 0. A symmetric matrix is positive definite if and only if all of its eigenvalues are positive. Inserting f XjY ðxjyÞ into Eq. 1.14 and dropping the constant terms yields 1 1 d y ðxÞ ¼  ðx  my ÞT S1 y ðx  my Þ  ln jSy j þ ln Pr ðY ¼ yÞ. 2 2

(1.18)

Assuming that the Gaussian distributions are known, and denoting the class-0 probability by c ¼ PrðY ¼ 0Þ, the Bayes (optimal) classifier is given by the quadratic discriminant gBayes ðxÞ ¼ xT ABayes x þ aTBayes x þ bBayes ,

(1.19)

with constant matrix ABayes, column vector aBayes, and scalar bBayes given by 1  S1 ABayes ¼  ðS1 0 Þ, 2 1

bBayes

1 aBayes ¼ S1 1 m1  S0 m0 ,     1 T 1 1 jS0 j 1c T 1 þ ln . ¼  ðm1 S1 m1  m0 S0 m0 Þ þ ln 2 2 jS1 j c

(1.20) (1.21) (1.22)

Classification and Error Estimation

7

If the classes possess a common covariance matrix S, then lnjSy j can also be dropped from d y ðxÞ. The Bayes classifier is now given by the linear discriminant gBayes ðxÞ ¼ aTBayes x þ bBayes ,

(1.23)

aBayes ¼ S1 ðm1  m0 Þ,

(1.24)

  1 1c bBayes ¼  ðm1  m0 ÞT S1 ðm1 þ m0 Þ þ ln . 2 c

(1.25)

where

A Gaussian model is called heteroscedastic if the class covariance matrices are unequal and homoscedastic if they are equal. When the parameters of the distributions are unknown, plug-in classification rules are formed by estimating the covariance matrices, mean vectors, and prior class probabilities from the data by using the standard sample estimators (sample covariance and sample mean). Quadratic discriminant analysis (QDA) substitutes the sample means for each class, y 1X xy , ny i¼1 i

n

ˆy ¼ m

(1.26)

and the sample covariances for each class, ˆ ¼ S y

y 1 X ˆ y Þðxyi  m ˆ y ÞT , ðxyi  m ny  1 i¼1

n

(1.27)

for the corresponding true parameters my and Sy, where ny is the number of sample points in class y∈{0, 1}, and xy1 , : : : , xyny are the sample points in class y. For the class probability c, QDA substitutes either a fixed value or a maximum-likelihood estimate n0 ∕n. Linear discriminant analysis (LDA) substitutes the sample means in the true means m0 and m1, the pooled sample covariance ˆ ˆ ˆ ¼ ðn0  1ÞS0 þ ðn1  1ÞS1 S n2

(1.28)

in the common true covariance S, and either a fixed value or estimated class-0 probability for the true class-0 probability c. If it is assumed that c ¼ 0.5, then from Eq. 1.23 the LDA classifier produces cn ðxÞ ¼ 0 if and only if W ðxÞ ≤ 0, with discriminant

8

Chapter 1

  ˆ1 þm ˆ0 m 1 ˆ ˆ1 m ˆ 0Þ S W ðxÞ ¼ ðm x . 2 T

(1.29)

W ðxÞ is known as the Anderson W statistic (Anderson, 1951). Much of the classical theory concerning LDA in the Gaussian model uses the Anderson W statistic because it greatly simplifies analysis. Practically, it is reasonable to use W ðxÞ if there is good reason to believe that c  0.5, or when estimating c is problematic and it is believed that c is not too far from 0.5. If we further assume that S ¼ s2ID for some scalar s2, where ID is the D  D identity matrix, then the Bayes classifier simplifies to     m þ m0 1c gBayes ðxÞ ¼ ðm1  m0 ÞT x  1 þ s2 ln . (1.30) 2 c The corresponding plug-in classification rule substitutes the sample means, ˆ trðSÞ∕D, and either a fixed value or an estimate of the class-0 probability for the true parameters m0, m1, s2, and c, respectively, where trð⋅Þ denotes the matrix trace operator. If it is assumed that c ¼ 0.5, then gBayes assigns each point the class label with closer mean. The corresponding plug-in rule is the nearest-mean-classification (NMC) rule, which substitutes the sample means for the corresponding true parameters m0 and m1 to arrive at   ˆ1 þm ˆ0 m T ˆ1 m ˆ 0Þ x  gNMC ðxÞ ¼ ðm . (1.31) 2 QDA and LDA can perform reasonably well so long as the classconditional densities are not too far from Gaussian and there are sufficient data to obtain good parameter estimates. Owing to the greater number of parameters (greater complexity) to be estimated for QDA as opposed to LDA, one can proceed with smaller samples with LDA than with QDA. We have defined QDA and LDA in two possible ways, plugging in a fixed value for c, or employing the maximum-likelihood estimate cˆ ¼ n0 ∕n, the default being the latter. For small samples, knowing c provides significant benefit, the reason being that while cˆ → c in probability as n → ` (by Bernoulli’s weak law of large numbers), cˆ possesses significant variance for small n. Indeed, for small n, as long as j0.5  cj is not too great, the Anderson W statistic, which uses cˆ ¼ 0.5, can outperform the default LDA. These considerations are illustrated in Fig. 1.1, which is based on Gaussian classconditional densities with class 0 having mean 0D ≡ ½0, 0, : : : , 0T , class 1 having mean 1D ≡ ½1, 1, : : : , 1T , and both classes having covariance matrix ID, with D ¼ 3. In the figure, the horizontal and vertical axes give c and the average true error, respectively. Parts (a) and (b) are for n ¼ 20 and n ¼ 50, respectively, where each part shows the performance of LDA with c known,

Classification and Error Estimation

9

(a)

(b)

Figure 1.1 Average true error of LDA under Gaussian classes with respect to c: (a) n ¼ 20; (b) n ¼ 50. [Reprinted from (Esfahani and Dougherty, 2013).]

c estimated by n0/n (default LDA), and c substituted by 0.5 (Anderson W statistic). The advantage of knowing c is evident, and this advantage is nonnegligible for small n. Estimating c is sufficiently difficult with small samples that in the case of n ¼ 20, for 0.38 ≤ c ≤ 0.62, the Anderson W statistic outperforms the default LDA. The true error for any classifier c with linear discriminant gðxÞ ¼ aT x þ b on Gaussian distributions with known means my and covariances Sy (S0 ≠ S1 allowable) is given in closed form by placing 1 0 y ð1Þ gðmy Þ εy ¼ F@ qffiffiffiffiffiffiffiffiffiffiffiffiffi A (1.32) aT Sy a in Eq. 1.3 for y ¼ 0, 1, where F is the standard normal cumulative distribution function (CDF) (Sitgreaves, 1961). Hence, the Bayes error when S ¼ S0 ¼ S1 is given by εBayes ¼ cεBayes,0 þ ð1  cÞεBayes,1 ,

(1.33)

1 ð1Þy gBayes ðmy Þ ¼ F@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A. aTBayes SaBayes

(1.34)

where 0

εBayes,y

Generally speaking, the more complex a classifier class C, the smaller the constraint cost and the greater the design cost. By this we mean that the more finely the functions in C partition the feature space RD , the better functions within it can approximate a Bayes classifier, and, concomitantly, the more they can overfit the data. This notion can be illustrated via a celebrated theorem that provides bounds for E½Dn,C . It concerns the empirical error

10

Chapter 1

classification rule, which chooses the classifier in C that makes the least number of errors on the sample data. For this rule, rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi V C log2 n þ 4 E½Dn,C  ≤ 4 , 2n

(1.35)

where V C is the Vapnik–Chervonenkis (VC) dimension of C (Vapnik, 1971) [see (Devroye et al., 1996) for the definition]. It is not our intent to pursue this well-studied topic, but only to note that the VC dimension provides a measure of classifier complexity and that n must greatly exceed V C for the preceding bound to be small. The VC dimension of a linear classifier is D þ 1. For a neural network with an even number of neurons k, the VC dimension has the lower bound V C ≥ Dk. If k is odd, then V C ≥ Dðk  1Þ. Thus, the bound pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exceeds 4 ½Dðk  1Þlog2 n∕ð2nÞ, which is not promising for small n. A common source of complexity is the number of features on which a classifier operates. This dimensionality problem motivates feature selection when designing classifiers. In general, feature selection is part of the overall classification rule and, relative to this rule, the number of variables is the number in the data measurements, not the final number used in the designed classifier. Feature selection results in a subfamily of the original family of classifiers and thereby constitutes a form of constraint. For instance, if there are D features available and LDA is used directly, then the classifier family consists of all hyperplanes in a D-dimensional space, but if a feature-selection algorithm reduces the number of variables to d < D prior to application of LDA, then the classifier family consists of all hyperplanes in the D-dimensional space that are confined to d-dimensional subspaces. Since the role of feature selection is constraint, assessing the worth of feature selection involves us in the standard dilemma: increasing constraint (decreasing the number of features selected) incurs a cost but reduces design error. Given a sample, in principle one could consider all feature sets of sizes 1 through D, design the classifier corresponding to each feature set, and choose the feature set Dn,d of size d whose designed classifier cn,d has the smallest error εn,d. The first problem with this exhaustive search is computational: too many classifiers to design. Moreover, to select a subset of d features from a set of D features with minimum error among all feature sets of size d, all d-element subsets must be checked unless there is distributional knowledge that mitigates the search requirement (Cover and Van Campenhout, 1977). Thus, a full search cannot be avoided if we want to assure optimality. Second, since the errors of all designed classifiers over all feature sets have to be estimated, inaccurate estimation can lead to poor feature-set ranking—a problem exacerbated with small samples (Sima et al., 2005). Numerous suboptimal feature-selection algorithms have been proposed to address computational

Classification and Error Estimation

11

limitations. Those that do not employ error estimation are still impacted by the need to estimate certain parameters or statistics in the selection process. Now, consider a classification problem in which there are D potential features, listed as x1, x2, . . . , xD, the feature-label distribution is known for the full set of features, and a sequence of classifiers is designed by adding one feature at a time. Let cd be the optimal classifier for feature-set size d and εd be the true error of cd on the known distribution. Then, as features are added, the optimal error is non-increasing, that is, εdþ1 ≤ εd for d ¼ 1, 2, . . . , D  1. However, if cn,d is trained from a sample of size n, then it is commonplace to have the expected error (across the sampling distribution) decrease and then increase for increasing d. This behavior is known as the peaking phenomenon (Hughes, 1968; Hua et al., 2005). In fact, the situation can be more complicated and depends on the classification rule and feature-label distribution (Dougherty et al., 2009b; Hua et al., 2009). For instance, the optimal number of features need not increase as the sample size increases. Moreover, for fixed sample size the error curve need not be concave; for instance, the expected error may decrease, increase, and then decrease again as the feature-set size grows. Finally, if a feature-selection algorithm is applied, then the matter becomes far more complicated (Sima and Dougherty, 2008).

1.3 Error Estimation Abstractly, any pair ðc, εÞ composed of a function c : RD → f0, 1g and a real number ε ∈ ½0, 1 constitutes a classifier model, with ε being simply a number, not necessarily specifying an actual error probability corresponding to c. ðc, εÞ becomes a scientific model when it is understood in the context of a feature-label distribution. A designed classifier produces a classifier model, namely, ðcn , εn Þ. Relative to the sampling distribution generating the sample S n , ðcn , εn Þ consists of a random function cn ¼ Cn ðS n Þ and a corresponding error εn. When applying a classification rule to a sample, we are not as interested in the classifier it produces as in the error of the classifier. Since this error is a random variable, it is characterized by its distribution. The first exact expression for the expected error of LDA (the Anderson W statistic) in the Gaussian model with a common known covariance was derived by Sitgreaves in 1961 (Sitgreaves, 1961). In the same year, John obtained a simpler expression as an infinite sum (John, 1961). Two years later, Okamoto obtained an asymptotic expansion ðn → ` Þ for the expected error in the same model (Okamoto, 1968). Other asymptotic expansions followed, including an asymptotic expansion for the variance by McLachlan in 1973 (McLachlan, 1973). In Section 1.5 we will consider “double asymptotic” expansions for the expected error, where n → ` and D → ` such that D∕n → l for some positive constant l (Raudys, 1967, 1972; Serdobolskii, 2000).

12

Chapter 1

Since εn depends on the feature-label distribution, which is unknown, εn is unknown. It must be estimated via an error estimation rule Ξn. Thus, a sample S n yields a classifier cn ¼ Cn ðS n Þ and an error estimate εˆ n ¼ Ξn ðS n Þ, which together constitute a classifier model ðcn , εˆ n Þ. Overall, classifier design involves a classification rule model ðCn , Ξn Þ used to determine a sampledependent classifier model ðcn , εˆ n Þ. Both ðcn , εn Þ and ðcn , εˆ n Þ are random pairs relative to the sampling distribution. Given a feature-label distribution, the relation between the true and estimated errors is completely characterized by the joint distribution of the random vector ðεn , εˆ n Þ. Error estimation accuracy is commonly summarized by the mean-square error (MSE) defined by MSEðˆεn Þ ¼ E½ðˆεn  εn Þ2  or, equivalently, by the square root of the MSE, known as the root-mean-square (RMS) error. The expectation used here is relative to the sampling distribution induced by the feature-label distribution. The MSE is decomposed into the deviation variance varðˆεn  εn Þ and the bias biasðˆεn Þ ¼ E½ˆεn  εn  of the error estimator relative to the true error: MSEðˆεn Þ ¼ varðˆεn  εn Þ þ biasðˆεn Þ2 .

(1.36)

When the sample size is large, it can be split into independent training and test sets, the classifier being designed on the training data and its error being estimated by the proportion of errors on the test data, which is the holdout error estimator. For holdout, we have the distribution-free bound pffiffiffiffiffiffiffi RMSðˆεholdout Þ ≤ 1∕ 4m, (1.37) where m is the size of the test sample (Devroye et al., 1996). But when data are limited, the sample cannot be split without leaving too little data to design a good classifier. Hence, training and error estimation must take place on the same dataset. We next consider several training-data-based error estimators. The resubstitution error estimator εˆ resub is the fraction of errors made by the designed classier cn on the training sample: εˆ resub ¼

n 1X jy  cn ðxi Þj. n i¼1 i

(1.38)

The resubstitution estimator is typically (but not always) low-biased, meaning that E½ˆεresub  , E½εn . This bias can be severe for small samples, depending on the complexity of the classification rule (Duda et al., 2001; Devroye et al., 1996). Cross-validation is a resampling strategy in which surrogate classifiers are designed from parts of the sample, each is tested on the remaining data, and εn is estimated by averaging the errors (Lunts and Brailovsky, 1967; Stone, 1974). In k-fold cross-validation, the sample S n is randomly partitioned into k folds (subsets) S ðiÞ for i ¼ 1, 2, . . . , k, each fold is left out of the design process, a surrogate classifier cn,i is designed on the set difference S n \ S ðiÞ , and the

Classification and Error Estimation

13

cross-validation error estimate is εˆ cvðkÞ ¼

k X 1X jy  cn,i ðxÞj. n i¼1 ðx, yÞ∈S

(1.39)

ðiÞ

A k-fold cross-validation estimator is an unbiased estimator of εn  n/k, the error of a classifier trained on a sample of size n  n/k, meaning that E½ˆεcvðkÞ  ¼ E½εnn∕k .

(1.40)

The special case of n-fold cross-validation yields the leave-one-out estimator εˆ loo , which is an unbiased estimator of εn  1. While not suffering from severe bias, cross-validation has large variance in small-sample settings, the result being high RMS (Braga-Neto and Dougherty, 2004b). In an effort to reduce the variance, k-fold cross-validation can be repeated using different folds, the final estimate being an average of the estimates. Bootstrap is a general resampling strategy that can be applied to error estimation (Efron, 1979, 1983). A bootstrap sample consists of n equally likely draws with replacement from the original sample S n . Some points may be drawn multiple times, whereas others may not be drawn at all. For the basic bootstrap estimator εˆ boot , a classifier is designed on the bootstrap sample and tested on the points left out; this is done repeatedly, and the bootstrap estimate is the average error made on the left-out points. εˆ boot tends to be a high-biased estimator of εn since the number of points available for design is on average only 0.632n. The 0.632 bootstrap estimator tries to correct this bias via a weighted average of εˆ boot and resubstitution (Efron, 1979): εˆ 0.632 boot ¼ 0.632ˆεboot þ 0.368ˆεresub .

(1.41)

The 0.632 bootstrap estimator is a special case of a convex estimator, the general form of which is aˆεlow þ bˆεhigh , where a > 0, b > 0, and a þ b ¼ 1 (Sima and Dougherty, 2006a). Given a feature-label distribution, a classification rule, and low- and high-biased estimators, an optimal convex estimator is found by finding the weights a and b that minimize the RMS. In resubstitution there is no distinction between points near and far from the decision boundary; the bolstered-resubstitution estimator is based on the heuristic that, relative to making an error, more confidence should be attributed to points far from the decision boundary than points near it (Braga-Neto and Dougherty, 2004a). This is achieved by placing a distribution, called a bolstering kernel, at each point (rather than simply counting the points as with resubstitution). Specifically, for i ¼ 1, 2, . . . , n, consider a probability density function (PDF) f ⋄i (a bolstering kernel) and define a bolstered empirical distribution f ⋄ by

14

Chapter 1

f ⋄ ðx, yÞ ¼

n 1X f ⋄ ðx  xi ÞIy¼yi , n i¼1 i

(1.42)

where the indicator function IE equals 1 if E is true and 0 otherwise. The bolstered resubstitution estimator is defined by εˆ bol ¼ Ef ⋄ ½jY  cn ðXÞj  Z Z n  (1.43) 1X ⋄ ⋄ f i ðx  xi Þdx þ Iyi ¼1 f i ðx  xi Þdx . ¼ Iyi ¼0 n i¼1 cn ðxÞ¼1 cn ðxÞ¼0 The subscript in the expectation denotes the joint distribution of ðX, Y Þ . A key issue is the amount of bolstering (spread of the bolstering kernels). A simple method has been proposed to compute this spread based on the data (Braga-Neto and Dougherty, 2004a); however, except for small dimensions, this simple procedure fails and an optimization method must be used to find the spread (Sima et al., 2011). Figure 1.2 illustrates the error for linear classification when the bolstering kernels are uniform circular distributions. When resubstitution is heavily low-biased, it may not be good to spread incorrectly classified data points because that increases the optimism of the error estimate (low bias). The semi-bolstered-resubstitution estimator results from not bolstering (no spread) for incorrectly classified points. Bolstering can be applied to any error-counting estimation procedure. Bolstered leaveone-out estimation involves bolstering the resubstitution estimates on the surrogate classifiers.

Figure 1.2 Bolstered resubstitution for LDA, assuming uniform circular bolstering kernels. The error contribution made by a point equals the area of the associated shaded region divided by the area of the associated circle. The bolstered resubstitution error is the sum of all error contributions divided by the number of points. [Reprinted from (Braga-Neto and Dougherty, 2004a).]

Classification and Error Estimation

15

Finally, the plug-in error estimator εˆ plug-in requires that we parameterize the class-conditional distributions f ðxjyÞ, where here and henceforth we do not use the subscript “XjY ” when denoting class-conditional distributions. By assuming that the estimates of the distribution parameters are the same as the true parameters, the “true” error of the classifier can be computed exactly to obtain this estimate. The consequences of training-set error estimation are readily explained by the following formula for the deviation variance: varðˆεn  εn Þ ¼ s 2εn þ s2εˆ n  2rs εˆ n sεn ,

(1.44)

where s2εn , s2εˆ n , and r are the variance of the error, the variance of the error estimate, and the correlation between the true and estimated errors, respectively. The deviation variance is driven down by small variances or a correlation coefficient near 1. A large deviation variation results in large RMS. The problem with cross-validation is characterized in Eq. 1.44: for small samples it has large variance and little correlation with the true error. Hence, although with small folds cross-validation does not significantly suffer from bias, it typically has large deviation variance. Figure 1.3 shows a scatter plot and linear regression line for a cross-validation estimate (horizontal axis) and true error (vertical axis) with a sample size of 50 using LDA. What we observe is typical for small samples: large variance and negligible regression between the true and estimated errors. Indeed, one even sees negatively sloping regression lines for cross-validation and bootstrap, and negative correlation

Figure 1.3 Scatter plot and linear regression for cross-validation (horizontal axis) and the true error (vertical axis) for linear discrimination between two classes of 50 breast cancer patients. [Reprinted from (Dougherty, 2012).]

16

Chapter 1

between the true and cross-validation estimated errors has been mathematically demonstrated in some basic models (Braga-Neto and Dougherty, 2005).

1.4 Random Versus Separate Sampling Thus far, we have assumed random sampling, under which the dataset S n is drawn independently from a fixed distribution of feature–label pairs ðX, Y Þ. In particular, this means that if a sample of size n is drawn for a binary classification problem, then the number of sample points in classes 0 and 1, n0 and n1, respectively, are binomial random variables such that n0 þ n1 ¼ n. An immediate consequence of the random-sampling assumption is that the prior probability c ¼ PrðY ¼ 0Þ can be consistently estimated by the sampling ratio cˆ ¼ n0 ∕n, namely, n0∕n → c in probability. While random sampling is almost invariably assumed (often tacitly) in classification theory, it is quite common in real-world situations for sampling not to be random. Specifically, with separate sampling, the class ratios n0∕n and n1∕n are chosen prior to sampling. Here, S n ¼ S n0 ∪ S n1 , where the sample points in S n0 and S n1 are selected randomly from class 0 and class 1, respectively, but, given n, the individual class counts n0 and n1 are not random. In this case, n0∕n is not a meaningful estimate of c. When c is not known, both QDA and LDA involve the estimate cˆ ¼ n0 ∕n by default. Hence, in the case of separate sampling when c is unknown, they are problematic. More generally, most classification rules make no explicit mention of c; however, their behavior depends on the sampling ratio, and their expected performances can be significantly degraded by separate sampling. In the special case when c is known and n0 ∕n  c for separate sampling, the sampling is said to be stratified. Example 1.1. Consider a model composed of multivariate Gaussian distributions with a block covariance structure (Esfahani and Dougherty, 2013). The model has several parameters that can generate various covariance matrices. For example, a 3-block covariance matrix with block size 5 has the structure 3 2 By 055 055 Sy ¼ 4 055 By 055 5, (1.45) 055 055 By where 0k  l is a matrix of size k  l with all elements being 0, and 2

1 6r 6 By ¼ s2y 6 6r 4r r

r 1 r r r

r r 1 r r

r r r 1 r

3 r r7 7 r7 7, r5 1

(1.46)

Classification and Error Estimation

17

in which s2y is the variance of each variable, and r is the correlation coefficient within a block. Two covariance matrix settings are considered: identical covariance matrices with s20 ¼ s21 ¼ 0.4 and unequal covariance matrices with s20 ¼ 0.4 and s21 ¼ 1.6. We use block size l ¼ 5 and r ¼ 0.8, which models tight correlation within a block. The means are m0 ¼ 0.1 ⋅ 1D and m1 ¼ 0.8 ⋅ 1D, and the feature-set size is 15. For a given sample size n, sampling ratio r ¼ n0 ∕n, and classification rule, the conditional expected true error rate E½εn jr is approximated via Monte Carlo simulation. The conditional expected true errors for common covariance matrices (the homoscedastic case) under LDA, QDA, and linear support vector machines (L-SVMs), all assuming that cˆ ¼ n0 ∕n, are given in the top row of Fig. 1.4, where each plot gives E½εn jr for different sampling ratios and c ∈ f0.3, 0.4, 0.5, 0.6, 0.7g. Results for different covariance matrices (the heteroscedastic case) are shown in the bottom row of Fig. 1.4. We observe that the expected error is close to minimal when r ¼ c and that it can greatly

(a)

(b)

(c)

(d)

(e)

(f)

Figure 1.4 Expected true error rates for three classification rules (n ¼ 100) when covariance matrices are homoscedastic (top row) and heteroscedastic (bottom row): (a) homoscedastic, LDA; (b) homoscedastic, QDA; (c) homoscedastic, linear support vector machine (L-SVM); (d) heteroscedastic, LDA; (e) heteroscedastic, QDA; (f) heteroscedastic, L-SVM. [Reprinted from (Esfahani and Dougherty, 2013).]

18

Chapter 1

increase when r ≠ c. The problem is that r is often chosen deterministically without regard to c. Can separate sampling be superior to random sampling? To answer this question, consider random sampling and suppose that r0  c. Then, typically, E½εn jr0  , Er ½E½εn jr ¼ E½εn ,

(1.47)

where E½εn jr0  is the expected error for separate sampling with r ¼ r0, and the inequality typically holds because E½εn jr0  , E½εn jr for most of the probability mass for r (which is random with random sampling) when r0 is close to c. Here, Er denotes an expectation over r, where r is drawn from the same distribution as the sampling ratio under random sampling. Consequently, separate sampling is beneficial if c is known and the sampling ratio is selected to reflect c. The problem is that in practice r is often chosen without regard to c, the result being the kind of poor performance shown in Fig. 1.4. An interesting characteristic of the error curves in Fig. 1.4 is that they appear to cross at a single value of r. In fact, this phenomenon is fairly common; however, we must be careful in examining it because the figure shows continuous curves, but r is actually a discrete variable. The issue is examined in detail in (Esfahani and Dougherty, 2013).

1.5 Epistemology and Validity From a scientific perspective, error estimation is the critical epistemological issue [see (Dougherty and Bittner, 2011) for a comprehensive discussion focused on biology and including classifier validity]. A scientific theory consists of two parts: (1) a mathematical model composed of symbols (variables and relations between the variables), and (2) a set of operational definitions that relate the symbols to data. A mathematical model alone does not constitute a scientific theory. The formal mathematical structure must yield experimental predictions in accord with experimental observations. As stated succinctly by Richard Feynman, “It is whether or not the theory gives predictions that agree with experiment. It is not a question of whether a theory is philosophically delightful, or easy to understand, or perfectly reasonable from the point of view of common sense” (Feynman, 1985). Model validity is characterized by predictive relations, without which the model lacks empirical content. Validation requires that the symbols be tied to observations by some semantic rules that relate not necessarily to the general principles of the mathematical model themselves but to conclusions drawn from the principles. There must be a clearly defined tie between the mathematical model and experimental methodology. Percy Bridgman was the first who said precisely that these relations of coordination consist in the description of physical operations. He called them, therefore, operational definitions. As we have written elsewhere, “Operational definitions

Classification and Error Estimation

19

are required, but their exact formulation in a given circumstance is left open. Their specification constitutes an epistemological issue that must be addressed in mathematical (including logical) statements. Absent such a specification, a purported scientific theory is meaningless” (Dougherty, 2008). With regard to a classifier model, predictive capacity is quantified by the error. Thus, when using an estimate in place of the true error, the accuracy of the estimate is epistemologically paramount. The validity of a scientific theory depends on the choice of validity criteria and the mathematical properties of those criteria. The observational measurements and the manner in which they are to be compared to the mathematical model must be formally specified. This necessarily takes the form of mathematical statements. As previously noted, the RMS (and equivalently the MSE) is the most commonly applied criterion for error estimation accuracy, and therefore for model validity. The validity of a theory is relative to the choice of validity criterion, but what is not at issue is the necessity of a set of relations tying the model to measurements. Suppose that a sample is collected, a classification rule Cn applied, and the classifier error estimated by an error-estimation rule Ξn to arrive at the classifier model ðcn , εˆ n Þ. If no assumptions are posited regarding the featurelabel distribution (as is commonly the case), then the entire procedure is completely distribution-free. There are three possibilities. First, if no validity criterion is specified, then the classifier model is ipso facto epistemologically meaningless. Second, if a validity criterion is specified, say RMS, and no distribution-free results are known about the RMS for Cn and Ξn, then, again, the model is meaningless. Third, if there exist distribution-free RMS bounds concerning Cn and Ξn, then these bounds can, in principle, be used to quantify the performance of the error estimator and thereby quantify model validity. To illustrate the third possibility, we consider multinomial discrimination, where the feature components are random variables with the discrete range f1, 2, : : : , bg, corresponding to choosing a fixed partition in RD with b cells, and the histogram rule assigns to each cell the majority label in the cell. The following is a distribution-free RMS bound for the leave-one-out error estimator with the discrete histogram rule and tie-breaking in the direction of class 0 (Devroye et al., 1996): sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ 6e1 6 þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi. (1.48) RMSðˆεloo Þ ≤ n pðn  1Þ Although this bound holds for all distributions, it is useless for small samples: for n ¼ 200 this bound is 0.506. In general, there are very few cases in which distribution-free bounds are known and, when they are known, they are useless for small samples.

20

Chapter 1

1.5.1 RMS bounds Distribution-based bounds are needed. These require knowledge of the RMS, which means knowledge concerning the second-order moments of the joint distribution between the true and estimated errors. More generally, for full characterization of the relationship between the true and estimated errors, we need to know their joint distribution. Oddly, this problem has historically been ignored, notwithstanding the fact that error estimation is the epistemological ground for classification. Going back to the 1970s, there were some results on the mean and variance of some error estimators for the Gaussian model using LDA. In 1966, Hills obtained the expected value of the resubstitution and plug-in estimators in the univariate model with known common variance (Hills, 1966). In 1972, Foley obtained the expected value of resubstitution in the multivariate model with known common covariance matrix (Foley, 1972). In 1973, Sorum derived results for the expected value and variance for both resubstitution and leave-one-out in the univariate model with known common variance (Sorum, 1971). In 1973, McLachlan derived an asymptotic representation for the expected value of resubstitution in the multivariate model with an unknown common covariance matrix (McLachlan, 1973). In 1975, Moran obtained new results for the expected value of resubstitution and plug-in for the multivariate model with a known covariance matrix (Moran, 1975). In 1977, Goldstein and Wolf obtained the expected value of resubstitution for multinomial discrimination (Goldstein and Wolf, 1977). Following the latter, there was a gap of 15 years before Davison and Hall derived asymptotic representations for the expected value and variance of bootstrap and leave-one-out in the univariate Gaussian model with unknown and possibly different covariances (Davison and Hall, 1992). None of these papers provided representation of the joint distribution or representation of second-order mixed moments, which are needed for the RMS. This problem has only recently been addressed beginning in 2005, in particular, for the resubstitution and leave-one-out estimators. For the multinomial model, complete enumeration was used to obtain the marginal distributions for the error estimators (Braga-Neto and Dougherty, 2005), and then the joint distributions (Xu et al., 2006). Subsequently, exact closed-form representations for second-order moments, including the mixed moments, were obtained, thereby obtaining exact RMS representations for both estimators (Braga-Neto and Dougherty, 2010). For the Gaussian model using LDA, the exact marginal distributions for both estimators in the univariate model with known but not necessarily equal class variances (the heteroscedastic model) and approximations in the multivariate model with known and equal class covariance matrices (the homoscedastic model) were obtained (Zollanvari et al., 2009). Subsequently, these were extended to the joint distributions for the true and estimated errors in a Gaussian model

Classification and Error Estimation

21

(Zollanvari et al., 2010). More recently, exact closed-form representations for the second-order moments in the heteroscedastic univariate model were discovered, thereby providing exact expressions of the RMS for both estimators (Zollanvari et al., 2012). Moreover, double asymptotic representations for the second-order moments in the homoscedastic multivariate model were found, thereby providing double asymptotic expressions for the RMS (Zollanvari et al., 2011). For the most part, both the early papers and the later papers concern separate sampling. An obvious way to proceed would be to say that a classifier model ðcn , εˆ n Þ is valid for the feature-label distribution F to the extent that εˆ n approximates the classifier error εn on F, where the degree of approximation is measured by some distance between εn and εˆ n , say jˆεn  εn j. To do this we would have to know the true error and F. But if we knew F, then we would use the Bayes classifier and would not need to design a classifier from sample data. Since it is the precision of the error estimate that is of consequence, a natural way to proceed would be to characterize validity in terms of the precision of the error estimator εˆ n ¼ Ξn ðS n Þ as an estimator of εn, say by RMSðˆεn Þ. This makes sense because both the true and estimated errors are random functions of the sample and the RMS measures their closeness across the sampling distribution. But again there is a catch: the RMS depends on F, which we do not know. Thus, given the sample without knowledge of F, we cannot compute the RMS. To proceed, prior knowledge is required, in the sense that we need to assume that the actual (unknown) feature-label distribution belongs to some uncertainty class U of feature-label distributions. Once RMS representations have been obtained for feature-label distributions in U, distribution-based RMS bounds follow: RMSðˆεn Þ ≤ sup RMSðˆεn juÞ, u∈U

(1.49)

where RMSðˆεn juÞ is the RMS of the error estimator under the assumption that the feature-label distribution is identified with u. We do not know the actual feature-label distribution precisely, but prior knowledge allows us to bound the RMS. Except in rare cases, to have scientific content classification requires prior knowledge. Regarding the feature-label distribution there are two extremes: (1) the feature-label distribution is known, in which case the entire classification problem collapses to finding a Bayes classifier and Bayes error, so there is no classifier design or error estimation issue; and (2) the uncertainty class consists of all feature-label distributions, the distribution-free case, and we typically have no bound, or one that is too loose for practice. In the middle ground, there is a trade-off between the size of the uncertainty class and the size of the sample. The uncertainty class must be sufficiently constrained

22

Chapter 1

(equivalently, the prior knowledge must be sufficiently great) that an acceptable bound can be achieved with an acceptable sample size. 1.5.2 Error RMS in the Gaussian model To illustrate the error RMS problem, we first consider RMS for LDA in the one-dimensional heteroscedastic model (means m0 and m1, and variances s20 and s21 ) with separate sampling using fixed class sample sizes n0 and n1, and resubstitution error estimation. Here, LDA is given by cLDA ðxÞ ¼ 0 if and only if W ðxÞ , 0, where W ðxÞ is the Anderson W statistic in Eq. 1.29; relative to our default definition, this variant of LDA assumes that cˆ ¼ 0.5 and assigns the decision boundary to class 1 instead of class 0. The true error and resubstitution estimate for class y are denoted by εyn and εˆ yresub , respectively. From here forward we will typically denote the class associated with an error or error estimate using superscripts to avoid cluttered notation. The MSE is   2    n0 0 n1 1 0 1  MSEðˆεn Þ ¼ E cεn þ ð1  cÞεn  (1.50) εˆ resub þ εˆ resub  . n n To evaluate the MSE we need expressions for the various second-order moments involved. These are derived in (Zollanvari et al., 2012) and are provided in (Braga-Neto and Dougherty, 2015). To illustrate the complexity of the problem, even in this very simple setting, we state the forms of the required second-order moments for the RMS. In all cases, expressions of the form Z < 0 and Z ≥ 0 mean that all components of Z are negative and nonnegative, respectively. The second-order moments for εyn are

E ðε0n Þ2 ¼ PrðZI , 0Þ þ PrðZI ≥ 0Þ, (1.51)

(1.52) E ε0n ε1n ¼ PrðZII , 0Þ þ PrðZII ≥ 0Þ,

(1.53) E ðε1n Þ2 ¼ PrðZIII , 0Þ þ PrðZIII ≥ 0Þ, where Zj, for j ¼ I, II, III, are 3-variate Gaussian random vectors with means 2 m0 m1 3 2

mZI ¼ mZII ¼ mZIII ¼ 4 m1  m0 5, m0 m1 2

(1.54)

and covariance matrices 2

SZI

s þ s 20 ¼ 4 2d s

2d 4s 2d

3 s 2d 5, s þ s20

(1.55)

Classification and Error Estimation

23

3 s s þ s20 2d SZII ¼ 4 2d 4s 2d 5, s 2d s þ s21 3 2 s s þ s21 2d SZIII ¼ 4 2d 4s 2d 5, s 2d s þ s 21 2

s2

s2

s2

(1.56)

(1.57)

s2

where s ¼ 4n00 þ 4n11 , and d ¼ 4n00  4n11 (Zollanvari et al., 2012). The secondorder moments for εˆ yresub are



1 E ðˆε0resub Þ2 ¼ PrðZI , 0Þ þ PrðZI ≥ 0Þ n0

n 1 þ 0 PrðZIII , 0Þ þ PrðZIII ≥ 0Þ , n0 0

E εˆ resub εˆ 1resub ¼ PrðZV , 0Þ þ PrðZV ≥ 0Þ,



1 PrðZII , 0Þ þ PrðZII ≥ 0Þ E ðˆε1resub Þ2 ¼ n1

n 1 PrðZIV , 0Þ þ PrðZIV ≥ 0Þ , þ 1 n1

(1.58)

(1.59)

(1.60)

where Zj for j ¼ I, II are bivariate Gaussian random vectors, and Zj for j ¼ III, IV, V are 3-variate Gaussian random vectors, whose means and covariance matrices are provided in (Zollanvari et al., 2012). Lastly, the mixed moments of εyn and εˆ yresub are

E ε0n εˆ 0resub ¼ PrðZI , 0Þ þ PrðZI ≥ 0Þ, (1.61) 0 1

(1.62) E εn εˆ resub ¼ PrðZII , 0Þ þ PrðZII ≥ 0Þ,

(1.63) E ε1n εˆ 0resub ¼ PrðZIII , 0Þ þ PrðZIII ≥ 0Þ,

(1.64) E ε1n εˆ 1resub ¼ PrðZIV , 0Þ þ PrðZIV ≥ 0Þ, where Zj, for j ¼ I, . . . , IV, are 3-variate Gaussian random vectors, whose means and covariance matrices are provided in (Zollanvari et al., 2012). Even for LDA in the multi-dimensional homoscedastic Gaussian model, the problem is more difficult since exact expressions for the moments are unknown. However, in that model there exist double-asymptotic approximations for all of the second-order moments (Zollanvari et al., 2011). Double asymptotic convergence means that n → ` as the dimension D → ` in a limiting proportional manner. Specifically, n0 → `, n1 → `, D → `,

24

Chapter 1

D∕n0 → l0 , and D∕n1 → l1 , where l0 and l1 are positive finite constants. It is also assumed that the Mahalanobis distance, qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1.65) dD ¼ ðmD,0  mD,1 ÞT S1 D ðmD,0  mD,1 Þ, converges to d . 0 as D → `, where mD,y is the mean of class y under D dimensions, and SD is the covariance of both classes under D dimensions. An example of the kind of result one obtains is the approximation 0 1 1 D 1 1 D 1 1 1 þ d2 n1  n0

n1 þ 2d2D n20 þ n21 C B d D E ðε0n Þ2  F@ D ⋅ ; (1.66) A, 2 f 0 ðn0 , n1 , D, d2D Þ f 20 ðn0 , n1 , D, d2D Þ where Fða, b; rÞ is the bivariate CDF of standard normal random variables with correlation coefficient r, Fða; rÞ ¼ Fða, a; rÞ, and sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ffi 1 D 1 1 D 1 1 þ þ þ 2 . (1.67) f 0 ðn0 , n1 , D, d2D Þ ¼ 1 þ þ 2 n1 dD n0 n1 2dD n20 n21 It is doubly asymptotic in the sense that both the left- and right-hand sides of the approximation converge to the same value as n → ` and D → `. Since the limit of the right-hand side is determined by the double-asymptotic assumptions, the theoretical problem is to prove that E½ðε0n Þ2  converges to that limit. Corresponding doubly asymptotically exact approximations exist for the other second-order moments required for the RMS (Zollanvari et al., 2011). These results show a practical limiting usefulness of Eq. 1.49. If the RMS is so complicated in a model as simple as the one-dimensional Gaussian model for LDA and resubstitution, and it is only known approximately in the multidimensional model even for the homoscedastic case, bounding the RMS does not hold much promise for validation, even if further results of this kind can be obtained. Thus, we must look in a different direction for error estimation.

Chapter 2

Optimal Bayesian Error Estimation Given that a distributional model is needed to achieve useful performance bounds for classifier error estimation when using the training data, a natural course of action is to define a prior distribution over the uncertainty class of feature-label distributions and then find an optimal minimum-mean-squareerror (MMSE) error estimator relative to the uncertainty class (Dalton and Dougherty, 2011b).

2.1 The Bayesian MMSE Error Estimator Consider finding a MMSE estimator (filter) of a nonnegative function gðX , Y Þ of two random variables based on observing Y; that is, minimize EX ,Y ½jgðX , Y Þ  hðY Þj2  over all Borel measurable functions h. The optimal estimator, gˆ ¼ arg min EX ,Y ½jgðX , Y Þ  hðY Þj2 , h

(2.1)

is given by the conditional expectation gˆ ðY Þ ¼ EX ½gðX , Y Þj Y :

(2.2)

Moreover, gˆ ðY Þ is an unbiased estimator over the distribution f ðx, yÞ of ðX , Y Þ: EY ½gˆ ðY Þ ¼ EX ,Y ½gðX , Y Þ:

(2.3)

The fact that gˆ ðY Þ is an unbiased MMSE estimator of gðX , Y Þ over f ðx, yÞ ¯ Y Þ for some specific value does not tell us how well gˆ ðY Þ estimates gðx, ¯ This has to do with the expected difference X ¼ x. 25

26

Chapter 2

2   Z   ` ¯ YÞ  ¯ Y Þ  gˆ ðY Þj  ¼ EY gðx, EY ½jgðx, gðx, Y Þf ðxjY Þdx ` 2  Z  `  ¯  f ðxjY Þdx gðx, Y Þ½dðx  xÞ ¼ EY  ` 2  Z ` ¯  f ðxjY Þjdx ≤ EY gðx, Y Þjdðx  xÞ , 2

`

(2.4) where dð⋅Þ is the Dirac delta function. This inequality reveals that the accuracy of the estimate at a point depends on the degree to which the mass of the conditional distribution for X given Y is concentrated at x¯ on average for Y. If we replace the single random variable Y by a sequence fY n g`n¼1 of random ¯ in a suitable sense (relative to the variables such that f ðxjY n Þ → dðx  xÞ convergence of generalized functions), then we are assured that ¯ Y n Þ → 0 in the mean-square sense. gˆ ðY n Þ  gðx, ¯ Y Þ but are uncertain of x. ¯ Instead, we can We desire gˆ ðY Þ to estimate gðx, obtain an unbiased MMSE estimator for gðX , Y Þ, which means good performance across all possible values of X relative to the distribution of X and Y; however, the performance of that estimator for a particular value ¯ X ¼ x¯ depends on the concentration of the conditional mass of X relative to x. When applying MMSE estimation theory to error estimation, the uncertainty manifests itself in a Bayesian framework relative to a space of feature-label distributions and samples. The random variable X is replaced by a random vector u governed by a specified prior distribution pðuÞ, where each u corresponds to a feature-label distribution parameterized by u, and denoted by f u ðx, yÞ. The parameter space is denoted by U. The random variable Y is replaced by a sample S n , which is used to train a classifier cn, and we set gðX , Y Þ ¼ εn ðu, cn Þ,

(2.5)

which is the true error on fu of cn. In this scenario, gˆ ðY Þ becomes the error estimator εˆ n ðS n , cn Þ ¼ Ep ½εn ðu, cn ÞjS n ,

(2.6)

which we call the Bayesian MMSE error estimator. The conditional distribution f ðxjY Þ becomes the posterior distribution pð uj S n Þ ¼ f ð uj S n Þ, which for simplicity we write as p∗ ðuÞ, tacitly keeping in mind conditioning on the sample. We will sometimes write the true error as εn ðuÞ or εn, and the Bayesian MMSE error estimator as

Optimal Bayesian Error Estimation

27

εˆ n ¼ Ep∗ ½εn ,

(2.7)

which is short-hand for εˆ n ðS n , cn Þ expressed in Eq. 2.6. Throughout this book, we will use several notations to clarify the probability space for expectations and probabilities. In Eq. 2.7 we use a subscript with the name of the distribution of the random vector, in this case the posterior distribution p on u. Occasionally, the prior distribution will be used instead of the posterior. We may also apply conditioning to this notation for emphasis; for example, we sometimes use notation similar to Eq. 2.6 rather than Eq. 2.7 to emphasize conditioning on the sample S n . When the distribution is understood, we will sometimes drop subscripts, or list random variables in the subscript. For example, EX,Y in Eq. 2.1 denotes an expectation over the joint distribution between X and Y. From here on, we will denote an arbitrary classifier and its error by c and ε, respectively. To emphasize that a classifier is trained from a sample S n , we denote a trained classifier and its error by cn and εn, respectively. We will always refer to the Bayesian MMSE error estimator as εˆ n , which always depends on the sample S n . Other error estimators will be denoted with a subscript naming the corresponding error estimation rule. For clarity, we will sometimes write the true error as a function of u and S n , or as a function of u and the classifier. Similarly, we will sometimes write an error estimator as a function of S n only, or as a function of S n and the classifier. In the context of classification, u is a random vector composed of three parts: the parameters u0 of the class-0 conditional distribution, the parameters u1 of the class-1 conditional distribution, and the class probability c for class 0 (with 1  c for class 1). We define Uy to be the parameter space for uy and write the class-conditional distributions as f uy ðxjyÞ. The marginal prior densities of the class-conditional distributions are denoted by pðuy Þ for y ∈ f0, 1g. The marginal density for the a priori class probability is denoted by pðcÞ. In using common Bayesian terminology, we also refer to these prior distributions as “prior probabilities.” It is up to the investigator to consider the nature of the problem at hand and to choose appropriate models for pðuy Þ and pðcÞ (Jaynes, 1968). One might utilize theoretical knowledge to construct a prior, or, alternatively, use an objective prior. Objective, or non-informative, priors attempt to characterize only basic information about the variable of interest, for instance, the support of the distribution, and are useful if one wishes to avoid using subjective data. An example would be the flat prior, which assumes that the prior distribution is constant over its support (or more generally, that the prior is proportional to 1 over its support). In Chapter 7 we consider the construction of priors based on existing scientific knowledge. A conjugate prior is one for which the prior and posterior are in the same family of distributions. This permits a closed form for the posterior and avoids difficult integration. Even in many

28

Chapter 2

classical problems, there is no universal agreement on the “right” prior to use. Based on the introductory filter analysis, we would like p∗ ðuÞ to be close to ¯ where u¯ corresponds to the actual feature-label distribution from dðu  uÞ, ¯ and an overzealous effort to which the data have come, but we do not know u, concentrate the conditional mass at a particular value of u can have ¯ detrimental effects if that value is far from u. The Bayesian MMSE error estimate is not guaranteed to be the optimal error estimate for any particular feature-label distribution (the true error being the best estimate and perfect), but for a given sample, and assuming the parameterized model and prior distribution, it is both optimal on average with respect to MSE (and therefore RMS) and unbiased when averaged over all parameters and samples. These implications apply for any classification rule that is not a function of the true parameters, and produces a classifier that is fixed given the sample. To facilitate analytic representations, we assume that c is independent from both u0 and u1 prior to observing the data. This assumption allows us to separate the Bayesian MMSE error estimator into components representing the error contributed by each class. We will maintain the assumption that c and ðu0 , u1 Þ are independent throughout this chapter and Chapter 4. At times, we will make the further assumption that c, u0, and u1 are mutually independent; in particular, this assumption is made in Chapter 3. Given pðuy Þ and pðcÞ, data are used to find the joint posterior density. By the product rule, p∗ ðc, uy Þ ¼ f ðc, uy jS n Þ ¼ f ðcjS n , uy Þf ðuy jS n Þ:

(2.8)

Given ny, c is independent of the sample values and the distribution parameters for class y. Hence, f ðcjS n , uy Þ ¼ f ðcjny , fx0i gn10 , fx1i gn11 , uy Þ ¼ f ðcjny Þ,

(2.9)

n

where fxyi g1y denotes the collection of training points in class y ∈ f0, 1g, and we assume that ny is fixed in this notation. We have that c remains independent of u0 and u1 posterior to observing the data: p∗ ðc, uy Þ ¼ f ðcjny Þf ðuy jS n Þ ¼ p∗ ðcÞp∗ ðuy Þ,

(2.10)

where p∗ ðcÞ and p∗ ðuy Þ are the marginal posterior densities for c and uy, respectively. By Bayes’ theorem, p∗ ðuy Þ ∝ pðuy Þf ðS n juy Þ, where the constant

Optimal Bayesian Error Estimation

29

of proportionality can be found by normalizing the integral of p∗ ðuy Þ to 1. The term f ðS n juy Þ is called the likelihood function. Furthermore, p∗ ðuÞ ¼ f ðc, u0 , u1 jS n Þ ¼ p∗ ðcÞp∗ ðu0 , u1 Þ,

(2.11)

where p∗ ðu0 , u1 Þ ¼ f ðu0 , u1 jS n Þ. Owing to the posterior independence between c and uy, and since εyn is a function of uy only, the Bayesian MMSE error estimator can be expressed as εˆ n ¼ Ep∗ ½εn  ¼ Ep∗ ½cε0n þ ð1  cÞε1n 

(2.12)

¼ Ep∗ ½cEp∗ ½ε0n  þ ð1  Ep∗ ½cÞEp∗ ½ε1n , where Ep∗ ½c depends on our prior assumptions about the prior class probability. Ep∗ ½εyn  may be viewed as the posterior expectation for the error contributed by class y, that is, the Bayesian MMSE error estimate contributed by class y. If we let εˆ ny ¼ Ep∗ ½εyn , then we can rewrite the estimator as εˆ n ¼ Ep∗ ½cˆεn0 þ ð1  Ep∗ ½cÞˆεn1 :

(2.13)

With a fixed classifier and given uy, the true error εyn ðuy Þ is deterministic, and Z y y εyn ðuy Þp∗ ðuy Þduy : (2.14) εˆ n ¼ Ep∗ ½εn  ¼ Uy

If c, u0, and u1 are mutually independent prior to observing the data, then the above still holds. In addition, given ny and the sample, distribution parameters for class y are independent from sample points not from class y. Thus, p∗ ðuy Þ ¼ f ðuy jS n Þ ¼ f ðuy jny , fx0i gn10 , fx1i gn11 Þ n

¼ f ðuy jny , fxyi g1y Þ

(2.15)

n

¼ f ðuy jfxyi g1y Þ: Furthermore, c, u0, and u1 remain independent posterior to observing the data, that is, p∗ ðuÞ ¼ f ðcjn0 Þf ðu0 jfx0i gn10 Þf ðu1 jfx1i gn11 Þ ¼ p∗ ðcÞp∗ ðu0 Þp∗ ðu1 Þ:

(2.16)

30

Chapter 2

Further, by Bayes’ theorem, n

p∗ ðuy Þ ¼ f ðuy jfxyi g1y Þ n

∝ pðuy Þf ðfxyi g1y juy Þ ¼ pðuy Þ

ny Y

(2.17)

f uy ðxyi jyÞ:

i¼1

We assume that sample points are independent throughout, with the exception of Sections 4.10 and 4.11. As is common in Bayesian analysis, we often characterize a posterior as being proportional to a prior times a likelihood; the normalization constant will still be accounted for throughout our analysis. Although we call pðuy Þ the “prior probabilities,” they are not required to be valid density functions. The priors are proper if the integral of pðuy Þ is finite, and they are improper if the integral of pðuy Þ is infinite, i.e., if pðuy Þ induces a s-finite measure but not a finite probability measure. The flat prior over an infinite support is necessarily an improper prior. When improper priors are used, Bayes’ theorem does not apply. However, as long as the product of the prior and likelihood function is integrable, we define the posterior to be their normalized product, e.g., we take Eq. 2.17 (following normalization) as the definition of the posterior in the mutually independent case. For the class prior probabilities, under random sampling, we only need to consider the size of each class: p∗ ðcÞ ¼ f ðcjn0 Þ ∝ pðcÞf ðn0 jcÞ ∝ pðcÞcn0 ð1  cÞn1 ,

(2.18)

where we have taken advantage of the fact that n0 has a binomialðn, cÞ distribution given c. We consider three models for the prior distributions of the a priori class probabilities: beta, uniform, and known. Suppose that the prior distribution for c follows a betaða0 , a1 Þ distribution, pðcÞ ¼

ca0 1 ð1  cÞa1 1 , Bða0 , a1 Þ

(2.19)

Gða0 ÞGða1 Þ Gða0 þ a1 Þ

(2.20)

xa1 ex dx

(2.21)

where Bða0 , a1 Þ ¼ is the beta function, and Z

`

GðaÞ ¼ 0

Optimal Bayesian Error Estimation

31

is the gamma function. We call parameters of the prior model, in this case a0 and a1, hyperparameters. The posterior distribution for c can be obtained from Eq. 2.18. From this beta-binomial model, with random sampling the form of p∗ ðcÞ is still a beta distribution with updated hyperparameters a∗y ¼ ay þ ny : ∗



ca0 1 ð1  cÞa1 1 : p ðcÞ ¼ Bða∗0 , a∗1 Þ ∗

(2.22)

The expectation of this distribution is given by (Devore, 1995) Ep∗ ½c ¼

a∗0 : a∗0 þ a∗1

(2.23)

In the special case of a uniform prior, a0 ¼ a1 ¼ 1, and p∗ ðcÞ ¼

ðn þ 1Þ! n0 c ð1  cÞn1 , n0 ! n1 !

(2.24)

n0 þ 1 : nþ2

(2.25)

Ep∗ ½c ¼

In the case of separate sampling, there is no posterior for c since the data provide no information regarding c. Hence, for separate sampling we must assume that c is known; otherwise, we can only compute the Bayesian MMSE error estimates contributed by the individual classes. Bayesian error estimation for classification is not completely new. In the 1960s, two papers made small forays into the area. In (Sorum, 1968), a Bayesian error estimator is given for the univariate Gaussian model with known covariance matrices. In (Geisser, 1967), the problem is addressed in the multivariate Gaussian model for a particular linear classification rule based on Fisher’s discriminant for a common unknown covariance matrix and known class probabilities by using a specific prior on the means and the inverse of the covariance matrix. In neither case were the properties, optimality, or performance of these estimators extensively considered. We derive the Bayesian MMSE error estimator for an arbitrary linear classification rule in the multivariate Gaussian model with unknown covariance matrices and unknown class probabilities using a general class of priors on the means and an intermediate parameter that allows us to impose structure on the covariance matrices, and we extensively study the performance of Bayesian MMSE estimators in the Gaussian model.

32

Chapter 2

2.2 Evaluation of the Bayesian MMSE Error Estimator Direct evaluation of the Bayesian MMSE error estimator is accomplished by deriving Ep∗ ½εyn  for each class using Eq. 2.14, finding Ep∗ ½c according to the prior model for c, and referring to Eq. 2.13 for the complete Bayesian MMSE error estimator. This can be tedious owing to the need to evaluate a challenging integral. Fortunately, the matter can be greatly simplified. Relative to the uncertainty class U and the posterior distribution p∗ ðuy Þ for y ¼ 0, 1, we define the effective class-conditional density as Z f uy ðxjyÞp∗ ðuy Þduy : (2.26) f U ðxjyÞ ¼ Uy

The next theorem shows that the effective class-conditional density, which is the average of the state-specific conditional densities relative to the posterior distribution, can be used to more easily obtain the Bayesian MMSE error estimator than by going through the route of direct evaluation. Theorem 2.1 (Dalton and Dougherty, 2013a). Let c be a fixed classifier given by cðxÞ ¼ 0 if x ∈ R0 , and cðxÞ ¼ 1 if x ∈ R1 , where R0 and R1 are measurable sets partitioning the feature space. Then the Bayesian MMSE error estimator can be found by Z Z ∗ ∗ εˆ n ðS n , cÞ ¼ Ep ½c f U ðxj0Þdx þ ð1  Ep ½cÞ f U ðxj1Þdx R1 R0 (2.27) Z ¼ ½Ep∗ ½c f U ðxj0ÞIx∈R1 þ ð1  Ep∗ ½cÞ f U ðxj1ÞIx∈R0 dx: X

Proof. For a fixed distribution uy and classifier c, the true error contributed by class y ∈ f0, 1g may be written as Z y f uy ðxjyÞdx: (2.28) εn ðuy , cÞ ¼ R1y

Averaging over the posterior yields Z εyn ðuy , cÞp∗ ðuy Þduy Ep∗ ½εyn ðuy , cÞ ¼ Uy

Z ¼

Z

Uy

Z ¼

R1y

Z ¼

R1y

R1y

Z

Uy

f uy ðxjyÞdxp∗ ðuy Þduy (2.29) f uy ðxjyÞp∗ ðuy Þduy dx

f U ðxjyÞdx,

Optimal Bayesian Error Estimation

33

where switching the integrals in the third line is justified by Fubini’s theorem. Finally, εˆ n ðS n , cÞ ¼ Ep∗ ½cEp∗ ½ε0n ðu0 , cÞ þ ð1  Ep∗ ½cÞEp∗ ½ε1n ðu1 , cÞ Z Z f U ðxj0Þdx þ ð1  Ep∗ ½cÞ f U ðxj1Þdx ¼ Ep∗ ½c R1 R0 Z ¼ ðEp∗ ½c f U ðxj0ÞIx∈R1 þ ð1  Ep∗ ½cÞf U ðxj1ÞIx∈R0 Þdx,

(2.30)

X

which completes the proof. ▪ In the proof of the theorem, Eq. 2.30 was obtained by applying Eq. 2.29 to the y ¼ 0 and y ¼ 1 terms individually. Hence, the theorem could be expressed as εˆ ny ¼ Ep∗ ½εyn  Z f U ðxjyÞIx∈X \Ry dx ¼ X Z ¼ f U ðxjyÞdx

(2.31)

X \Ry

for y ∈ f0, 1g. When a closed-form solution for the Bayesian MMSE error estimator is not available, Eq. 2.27 can be approximated by drawing a synthetic sample from the effective class-conditional densities f U ðxjyÞ and substituting the Rproportion of points from class y misclassified by c for the integral X \Ry f U ðxjyÞdx.

2.3 Performance Evaluation at a Fixed Point The error rate of a classifier is the probability of misclassifying a random point (vector) drawn from the feature-label distribution. The Bayesian MMSE error estimator produces an estimate of this probability. That being said, in some applications, one may be interested not in the probability of misclassification for a random point, but in the probability of misclassification for a specific test point x drawn from the true feature-label distribution. When the true distribution is known precisely, the probability that x has label y ∈ f0, 1g is simply cy f u ðxjyÞ PrðY ¼ yjX ¼ x, uÞ ¼ P1 y , i¼0 ci f ui ðxjiÞ

(2.32)

where cy ¼ c if y ¼ 0, and cy ¼ 1  c if y ¼ 1. If one has in mind a specific label for x, perhaps the label cðxÞ produced by some classifier c, then one may be

34

Chapter 2

interested in the misclassification probability for this prediction. For known parameters, this is PrðY ≠ cðxÞjX ¼ x, uÞ ¼

X

cy f u ðxjyÞ P1 y , i¼0 ci f ui ðxjiÞ y≠cðxÞ

(2.33)

where in binary classification the outer sum has a single term indexed by y ¼ 1  cðxÞ. The true error of a classifier under a known distribution can be expressed as εn ðu, cÞ ¼ PrðY ≠ cðXÞjuÞ Z 1 X ¼ PrðY ≠ cðxÞjX ¼ x, uÞ ci f ui ðxjiÞdx X

¼

i¼0

1 Z X y¼0

cðxÞ≠y

(2.34)

cy f uy ðxjyÞdx:

In the Bayesian framework, the probability corresponding to Eq. 2.32 is given by Z PrðY ¼ yjX ¼ x, S n Þ ¼ PrðY ¼ yjX ¼ x, S n , uÞ f ðujX ¼ x, S n Þdu U Z PrðY ¼ yjX ¼ x, S n , uÞf ðxjS n , uÞp∗ ðuÞdu ∝ U Z PrðY ¼ yjX ¼ x, uÞf ðxjuÞp∗ ðuÞdu, ¼ U

(2.35) where f ðujX ¼ x, S n Þ is the posterior of u given S n and the test point x, x is P independent from the sample given u, and f ðxjuÞ ¼ 1i¼0 ci f ui ðxjiÞ. Now PrðY ¼ yjX ¼ x, uÞ is available in Eq. 2.32; hence, Z PrðY ¼ yjX ¼ x, S n Þ ∝ cy f uy ðxjyÞp∗ ðuÞdu ZU1 Z Z ∗ cy p ðcy Þdcy f uy ðxjyÞp∗ ðu0 , u1 Þdu0 du1 ¼ 0

U0

U1

¼ Ep∗ ½cy  f U ðxjyÞ, (2.36) where in the second line we have used the independence between cy and ðu0 , u1 Þ and Eq. 2.11. After normalizing,

Optimal Bayesian Error Estimation

35

Ep∗ ½cy  f U ðxjyÞ : PrðY ¼ yjX ¼ x, S n Þ ¼ P1 i¼0 Ep∗ ½ci  f U ðxjiÞ

(2.37)

Not surprisingly, this is of the same form as Eq. 2.32 with the effective density and estimated class probability plugged in for the true values. Similarly, for a given label prediction cðxÞ at x, the probability of misclassification is PrðY ≠ cðxÞjX ¼ x, S n Þ ¼

X

Ep∗ ½cy  f U ðxjyÞ P1 : i¼0 Ep∗ ½ci  f U ðxjiÞ y≠cðxÞ

(2.38)

The same result is derived in (Knight et al., 2014). Further, since Z f ðxjS n Þ ¼ ¼ ¼ ¼

f ðxjS n , uÞf ðujS n Þdu

ZU

f ðxjuÞp∗ ðuÞdu

U

Z X 1 U y¼0 1 Z X

1

cy p∗ ðcy Þdcy

0

y¼0

¼

f ðxjY ¼ y, uÞ PrðY ¼ yjuÞp∗ ðuÞdu

1 X

Z U0

Z U1

(2.39)

f uy ðxjyÞp∗ ðu0 , u1 Þdu0 du1

Ep∗ ½cy  f U ðxjyÞ,

y¼0

it follows from Eq. 2.38 that the Bayesian MMSE error estimator can be expressed as εˆ n ðS n , cÞ ¼ PrðY ≠ cðxÞjS n Þ Z PrðY ≠ cðxÞjX ¼ x, S n Þf ðxjS n Þdx ¼ X

Z ¼ ¼

X

PrðY ≠ cðxÞjX ¼ x, S n Þ

1 Z X y¼0

cðxÞ≠y

1 X i¼0

Ep∗ ½cy  f U ðxjyÞdx:

Ep∗ ½ci  f U ðxjiÞdx

(2.40)

36

Chapter 2

2.4 Discrete Model Consider multinomial discrimination, in which there is a discrete sample space X ¼ f1, 2, : : : , bg with b ≥ 2 bins. Let pi and qi be the class-conditional probabilities in bin i ∈ f1, 2, : : : , bg for class 0 and 1, respectively, and define U 0i and U 1i to be the number of sample points observed in bin i ∈ f1, 2, : : : , bg P from class 0 and 1, respectively. The class sizes are thus n0 ¼ bi¼1 U 0i and P n1 ¼ bi¼1 U 1i . A general discrete classifier assigns each bin to a class, so c : f1, 2, : : : , bg → f0, 1g. The true error of an arbitrary classifier c is given by Eq. 1.3, where ε0n

¼

b X

pi IcðiÞ¼1 ,

(2.41)

qi IcðiÞ¼0 :

(2.42)

i¼1

ε1n

¼

b X i¼1

The discrete Bayesian model defines u0 ¼ ½ p1 , p2 , : : : , pb  and u1 ¼ ½q1 , q2 , : : : , qb . The parameter space of u0 is the standard (b  1)-simplex U0 ¼ Db1 ⊂ Rb , which is the set of all valid bin probabilities; e.g., ½ p1 , p2 , : : : , pb  ∈ U0 if and only if 0 ≤ pi ≤ 1 for i ∈ f1, 2, : : : , bg and Pb i¼1 pi ¼ 1. The parameter space U1 is defined similarly. With the parametric model established, define Dirichlet priors: pðu0 Þ ¼

b Y 1 a0 1 pi i , 0 0 Bða1 , : : : , ab Þ i¼1

(2.43)

pðu1 Þ ¼

b Y 1 a1 1 qi i , 1 1 Bða1 , : : : , ab Þ i¼1

(2.44)

where B is the multivariate beta function, Bða1 , : : : , ab Þ ¼

Gða1 Þ · · · Gðab Þ : Gða1 þ · · · þ ab Þ

(2.45)

For proper priors, the hyperparameters ayi for i ∈ f1, 2, : : : , bg and y ∈ f0, 1g must be positive, and for uniform priors, ayi ¼ 1 for all i and y. For a Dirichlet distribution on ½ p1 , p2 , : : : , pb  with concentration parameter ½a01 , a02 , : : : , a0b , the mean of pj is well known and given by a0j E½pj  ¼ Pb k¼1

a0k

:

(2.46)

Optimal Bayesian Error Estimation

37

Second-order moments are also well known: a0j ða0j þ 1Þ  , Pb 0 0 k¼1 ak 1 þ k¼1 ak

E½p2j  ¼ Pb

a0i a0j  : Pb 0 0 k¼1 ak 1 þ k¼1 ak

E½pi pj  ¼ Pb

(2.47)

(2.48)

Similar equations hold for moments of q1 , q2 , : : : , qb in class 1. 2.4.1 Representation of the Bayesian MMSE error estimator The next theorem provides the posterior distributions for the discrete model. This property of Dirichlet distributions is well known (Johnson et al., 1997). Theorem 2.2. For the discrete model with Dirichletðay1 , : : : , ayb Þ priors for class y ∈ f0, 1g, ayi . 0 for all i and y, the posterior distributions are DirichletðU y1 þ ay1 , : : : , U yb þ ayb Þ and given by P b Gðn0 þ bi¼1 a0i Þ Y U 0 þa0 1 pi i i , p ðu0 Þ ¼ Qb 0 0 k¼1 GðU k þ ak Þ i¼1 P b Gðn1 þ bi¼1 a1i Þ Y U 1 þa1 1 ∗ p ðu1 Þ ¼ Qb qi i i : 1 1 k¼1 GðU k þ ak Þ i¼1 ∗

(2.49)

(2.50)

Proof. Consider class 0. f u0 ðx0i j0Þ equals the bin probability corresponding to class 0 and bin x0i . Thus, we have the likelihood function n0 Y

f u0 ðx0i j0Þ ¼

b Y

U0

pi i :

(2.51)

i¼1

i¼1

By Eq. 2.17, the posterior of u0 is still proportional to the product of the bin probabilities: p∗ ðu0 Þ ∝

b Y

U 0i þa0i 1

pi

:

(2.52)

i¼1

We realize that this is a DirichletðU 01 þ a01 , : : : , U 0b þ a0b Þ distribution. The density function for this distribution is precisely Eq. 2.49. Class 1 is similar. ▪ Theorem 2.3 (Dalton and Dougherty, 2013a). For the discrete model with Dirichletðay1 , : : : , ayb Þ priors for class y ∈ f0, 1g, ayi . 0 for all i and y, the effective class-conditional densities are given by

38

Chapter 2

U yj þ ayj P f U ðjjyÞ ¼ : ny þ bi¼1 ayi Proof. The effective density for class y ¼ 0 is Z f U ðjj0Þ ¼ pj p∗ ðp1 , p2 , : : : , pb Þdp1 dp2 · · · dpb ¼ Ep∗ ½pj : U0

(2.53)

(2.54)

Plugging in the hyperparameters for the posterior p∗ ðp1 , p2 , : : : , pb Þ, we obtain Eq. 2.53 from Eq. 2.46. Class 1 is treated similarly. ▪ In the preceding theorem, f U ð jj0Þ and f U ðjj1Þ may be viewed as effective bin probabilities for each class after combining prior knowledge and observed data. Theorem 2.1 yields the Bayesian MMSE estimator: εˆ n ¼

b X j¼1

Ep∗ ½c

U 0j þ a0j U 1j þ a1j ∗ Pb P I þ ð1  E ½cÞ Icð jÞ¼0 : cð jÞ¼1 p n0 þ i¼1 a0i n1 þ bi¼1 a1i (2.55)

In particular, the first posterior moment of the true error for y ∈ f0, 1g is given by εˆ ny ¼ Ep∗ ½εyn  ¼

b X j¼1

U yj þ ayj P Icð jÞ¼1y : ny þ bi¼1 ayi

(2.56)

In the special case where we have random sampling, uniform c, and uniform priors for the bin probabilities (ayi ¼ 1 for all i and y), the Bayesian MMSE error estimate is εˆ n ¼

b X n0 þ 1 U 0j þ 1 n þ 1 U 1j þ 1 Icð jÞ¼1 þ 1 I ⋅ ⋅ : n þ 2 n0 þ b n þ 2 n1 þ b cð jÞ¼0 j¼1

(2.57)

In addition, the Bayesian MMSE error estimator assuming random sampling and (improper) ay ¼ ayi ¼ 0 for all i and y is equivalent to resubstitution. 2.4.2 Performance and robustness in the discrete model We next examine the performance of discrete Bayesian error estimators with the histogram rule via examples involving simulation studies. The first example considers the performance of Bayesian MMSE error estimators for two bins and different beta prior distributions for the bin probabilities. By studying beta priors that target specific values for the bin probabilities, we will

Optimal Bayesian Error Estimation

39

observe the benefits of informative priors and assess the robustness of discrete Bayesian error estimators to poor prior-distribution modeling. The second example treats the performance of Bayesian MMSE error estimators with uniform priors for an arbitrary number of bins. These simulations show how and when non-informative Bayesian error estimators improve on the resubstitution and leave-one-out error estimators, especially as the number of bins increases. The third example considers average performance under uniform priors, including bias and deviation variance. A summary of the basic Monte Carlo simulation methodology is shown in Fig. 2.1, which lists the main steps and flow of information. At the start we have a single distribution and a prior. In each iteration, step A generates a labeled training sample from class-conditional distributions via random sampling, stratified sampling, or separate sampling. These labeled sample points are used to update the prior to a posterior in step B and to train one or more classifiers in step C. Feature selection, if implemented, is considered part of step C. In step D, for each classifier we compute and store several statistics used in performance evaluation. The true error εn is either found exactly or approximated by evaluating the proportion of misclassified points from a large synthetic testing set drawn from the true distributions parameterized by c, u0, and u1. We evaluate the Bayesian MMSE error estimator εˆ n from the classifier and posterior using closed-form expressions if available and by evaluating the proportion of misclassified points from a large synthetic testing set drawn from the effective density otherwise. We also find the MSE of εˆ n conditioned on the sample, denoted by MSEðˆεn jS n Þ, which we will study extensively in Chapter 3 and is expressed in Theorem 3.1. Several classical training-data error estimators are also computed. The sample-conditioned MSE of the classical error estimators may be evaluated for each iteration according to Theorem 3.2. Steps A through D are repeated to obtain t samples and sets of output. Typically, for each sample and error estimator we evaluate the difference εˆ n  εn and the squared difference ðˆεn  εn Þ2 , and average these to produce Monte Carlo approximations of bias and RMS over the sampling distribution.

Figure 2.1 Simulation methodology for fixed sample sizes under a fixed feature-label distribution or a large dataset representing a population.

40

Chapter 2

Example 2.1. For all simulations in this example, we use random sampling in a discrete model with b ¼ 2, true a priori class probability c ¼ 0.5, and fixed bin probabilities governed by fixed class-0 bin-1 probability p and class-1 bin1 probability q. The Bayes error, or the optimal true error obtained from the optimal classifier (not to be confused with Bayesian error estimators), is min(p, q). To generate a random sample we first determine the sample size for each class using a binomial(n, c) experiment and then assign each sample point a bin number according to the distribution of its class. The sample is then used to train a histogram classifier, where the class assigned to each bin is determined by a majority vote. The same samples are used for classifier design and error estimation. The true error of this histogram classifier is calculated via our known distribution parameters. The same sample is used to find resubstitution, leave-one-out, and Bayesian MMSE error estimates for the designed classifier. This process is repeated t ¼ 10,000,000 times to find a Monte Carlo approximation for the RMS deviation from the true error for each error estimator. All results are presented in Fig. 2.2. For all Bayesian MMSE error estimators, c is assumed to have a uniform prior and the top row of Fig. 2.2 shows different beta distributions used as priors for p. We set q ¼ 1p so that the priors for q are the same but flipped about 0.5, i.e., a11 ¼ a02 and a12 ¼ a01 , and Ep ½q ¼ 1  Ep ½p. Part (a) of the figure shows five priors with varying means ðEp ½p ¼ 0:5; 0:6; 0:7; 0:8; 0:9Þ and relatively low variance. Part (b) shows several beta distributions with Ep ½p ¼ 0.5 (including the uniform prior) with varying weights between middle versus edge values of p. In all priors in part (b), the ayi are equal for all i. The graphs below these priors present RMS deviation from true error for the resubstitution and leave-one-out error estimators, and for the Bayesian MMSE error estimators corresponding to the priors at the top, with respect to both sample size (middle row) and the true distributions (bottom row). Each point on these graphs represents a fixed sample size and true distribution (c ¼ 0.5, p, and q), and under these fixed conditions we evaluate the performance of each error estimator using Monte Carlo simulations. In the second row of Fig. 2.2, we fix the true distributions at p ¼ 0.8 and q ¼ 0.2, and observe performance for increasing sample size. Naturally, these simulations show that priors with a high density around the true distributions have better performance and tend to converge more quickly to the true error. For example, in part (a) the prior with Ep ½p ¼ 0.8 matches the distribution p ¼ 0.8 very well, and this is reflected in the RMS deviation in Fig. 2.2(c), where we observe remarkable performance. On the other hand, when our priors have a small mass around the true distributions, performance can be quite poor compared to resubstitution and leave-one-out, and converge very slowly with increasing sample size. See, for example, the prior with Ep ½p ¼ 0.5 in part (c).

Optimal Bayesian Error Estimation 12 10 8 6

0 =25, 1 0 =30, 1 0 =35, 1 0 =40, 1 0 =45, 1

41 5

0 2 0 2 0 2 0 2 0 2

4 3

0 = 1 0 = 1 0 = 1 0 = 1 0 = 1

0 =0.1 2 0 =0.5 2 0 =1 2 0 =2 2 0 =10 2

2 4 1

2 0 0

0.2

0.4

0.6

0.8

0 0

1

0.2

0.4

resub/plugin loo

0.25 0.2 0.15 0.1 0.05 0

5

10

15

20

25

30

0.3

0.2 0.15 0.1 0.05 0

5

10

15

0.1 resub/plugin loo

0.6

(e)

0.8

1

RMS deviation from true error

RMS deviation from true error

0.15

0.4

25

30

(d)

0.2

0.2

20

sample size

(c) 0.25

0 0

1

resub/plugin loo

0.25

sample size

0.05

0.8

(b)

0.3

RMS deviation from true error

RMS deviation from true error

(a)

0.6

0.25

resub/plugin loo

0.2 0.15 0.1 0.05 0

0

0.2

0.4

0.6

0.8

1

(f)

Figure 2.2 RMS deviation from true error for discrete classification (b ¼ 2, c ¼ 0.5): (a) using low variance priors; (b) using priors centered at p ¼ 0.5; (c) versus sample size (low variance priors, p ¼ 0.8); (d) versus sample size (centered priors, p ¼ 0.8); (e) versus p (low variance priors, n ¼ 20); (f) versus p (centered priors, n ¼ 20). Lines without markers represent Bayesian MMSE error estimators with different beta priors, which are labeled and shown in the graph at the top of the corresponding column. [Reprinted from (Dalton and Dougherty, 2011b).]

The third row of Fig. 2.2 shows performance graphs with sample size n ¼ 20, as a function of p. These illustrate how each prior performs as the true distributions vary. In all cases, performance is best in the ranges of p and q that are well represented in the prior distributions, but outside this range results can be poor. This is best seen in Fig. 2.2(e), where the RMS curves move to the right as the priors move right. High-information priors offer better performance if they are within the targeted range of parameters, but

42

Chapter 2

performance outside this range is reciprocally worse. Meanwhile, lowinformation priors tend to give safer results by avoiding catastrophic behavior at the expense of performance. For example, in Fig. 2.2(f) the curves corresponding to priors with large ayi dip well below resubstitution (good performance) when p is near 0.5; however, away from this range performance rapidly deteriorates. Meanwhile, the curves corresponding to priors with ayi ¼ 1 have good performance over all states. Example 2.2. We now consider the RMS performance of Bayesian MMSE error estimators with non-informative uniform priors for an arbitrary number of bins. This example treats performance relative to a fixed distribution. As noted in (Braga-Neto and Dougherty, 2005), discrete classification for these bin sizes corresponds to regulatory rule design in binary regulatory networks. The distributions are fixed with c ¼ 0.5. We use Zipf’s power law model, where pi ∝ ia and qi ¼ pbiþ1 for i ¼ 1, 2, : : : , b [see (Braga-Neto and Dougherty, 2005)]. The parameter a ≥ 0 is a free parameter used to target a specific Bayes error, with larger a corresponding to smaller Bayes error. We again use random sampling and the histogram classification rule throughout. Figure 2.3 shows the RMS as a function of sample size for b ¼ 16 and different Bayes errors, where increasing Bayes error corresponds to increasingly difficult classification. Figure 2.4 shows RMS as a function of Bayes error for sample size 20 for our fixed Zipf distributions, again for b ¼ 16. In sum, these graphs show that performance for Bayesian MMSE error estimators (denoted by BEE in the legends) is superior to resubstitution and leave-one-out for most distributions and tends to be especially favorable

RMS deviation from true error

resub/plugin loo BEE

0.35 0.3 0.25 0.2 0.15 0.1 0.05

5

10

15 20 sample size

(a)

25

30

RMS deviation from true error

0.4

0.4

resub/plugin loo BEE

0.35 0.3 0.25 0.2 0.15 0.1 0.05

5

10

15 20 sample size

25

30

(b)

Figure 2.3 RMS deviation from true error for discrete classification and uniform priors with respect to sample size (c ¼ 0.5): (a) b ¼ 16, Bayes error ¼ 0.1; (b) b ¼ 16, Bayes error ¼ 0.2. [Reprinted from (Dalton and Dougherty, 2011b).]

Optimal Bayesian Error Estimation

43

0.4 RMS deviation from true error

resub/plugin loo BEE

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

Bayes error

Figure 2.4 RMS deviation from true error for discrete classification and uniform priors with respect to Bayes error (c ¼ 0.5, b ¼ 16, n ¼ 20). [Reprinted from (Dalton and Dougherty, 2011b).]

with moderate to high Bayes errors and small sample sizes. From Fig. 2.4, it appears that performance of the Bayesian MMSE error estimators tends to be favorable across a wide range of distributions, while the other error estimators, especially resubstitution, favor a small Bayes error. While the Bayesian MMSE error estimator is optimal with respect to MSE relative to the prior distribution and sampling distribution, it need not be optimal for a specific distribution. A clear weakness with uniform priors occurs when the Bayes error is very small. To explain this latter phenomenon, suppose that the true distributions are perfectly separated by the bins, for instance, p1 ¼ 1 and qb ¼ 1, thereby giving a Bayes error of zero. If we observe five sample points from each class, these will be perfectly separated into two bins with nonzero probability, and the histogram classifier will assign the correct class to each of these bins. Resubstitution and leave-one-out will both give estimates of 0, which is correct; however, since the true distribution is unknown, the Bayesian MMSE error estimator with ayi ¼ 1 for all i and y assigns a nonzero value to all bins in the effective density, even bins for which no points have been observed. This improves the average performance, but not for cases with zero (or very small) Bayes error. Of course, if it is suspected before the experiment that the Bayes error is very low, or if any additional information about the parameters is available to incorporate into the priors, then we can improve the Bayesian MMSE error estimator using informed priors as demonstrated in Example 2.1 with beta priors and b ¼ 2.

44

Chapter 2

Example 2.3. This example considers average performance with uniform priors, where simulations assume that the feature-label distribution is generated randomly from a given prior. A summary of the methodology is shown in Fig. 2.5. We emulate the entire Bayesian model by specifying hyperparameters for priors over c, u0, and u1, drawing random parameters according to the prior distribution (step 1), generating a random sample for each fixed feature-label distribution (step 2A), updating the prior to a posterior (step 2B), training a classifier (step 2C), and evaluating the true error and estimates of the true error (step 2D). Steps 2A through 2D are essentially the same as steps A through D in the simulation methodology outlined in Fig. 2.1. Step 1 is repeated to produce T different feature-label distributions (corresponding to the randomly selected parameters), and steps 2A through 2D are repeated t times for each fixed feature-label distribution, producing a total of tT samples and sets of output results in each simulation. Typically, for each sample and error estimator, we evaluate the difference εˆ n  εn and the squared difference ðˆεn  εn Þ2 , and average these to produce Monte Carlo approximations of bias and RMS over the prior and sampling distributions. As long as the model and prior used for Bayesian MMSE error estimation are the same as those used to generate random parameters, the Bayesian MMSE error estimator is unbiased and has optimal RMS in these simulations. We present results in the Bayesian framework for non-informative uniform priors and an arbitrary number of bins. We again use random sampling and histogram classification throughout. Figure 2.6 gives the average RMS deviation from the true error, as a function of sample size, over all distributions in the model with uniform priors for the bin probabilities and c. To generate these graphs, the true distributions and c are randomly selected, a random sample is generated according to the current distributions, and the performance for each error estimator is calculated. This procedure is repeated to obtain Monte Carlo approximations of the RMS deviation from the true error. This figure indicates that the Bayesian MMSE error estimator has excellent average performance for each fixed n. Indeed, it is optimal with respect to MSE relative to the prior distribution and sampling distribution. The Bayesian MMSE error estimator shows great improvement over resubstitution and leave-one-out, especially for small samples or a large number of bins. Also note that, as has

Figure 2.5 Simulation methodology for fixed sample sizes under random feature-label distributions drawn from a Bayesian framework.

Optimal Bayesian Error Estimation

45

RMS deviation from true error

resub/plugin 0.35

loo BEE

0.3 0.25 0.2 0.15 0.1 0.05

5

10

15 20 sample size

25

30

Figure 2.6 RMS deviation from true error for discrete classification and uniform priors with respect to sample size (b ¼ 16, bin probabilities and c uniform). [Reprinted from (Dalton and Dougherty, 2011b).]

been demonstrated analytically for discrete histogram classification, resubstitution is superior to leave-one-out for small numbers of bins but poorer for large numbers (on account of increasing bias) (Braga-Neto and Dougherty, 2005). The Bayesian MMSE error estimator is unbiased when averaged over all distributions in the model and all possible samples from these distributions; however, it can be quite biased for a fixed distribution. Figure 2.7 examines bias and deviation variance for a 16-bin problem using resubstitution, leaveone-out, and the Bayesian MMSE error estimator with uniform priors. The left column shows bias and deviation variance under random distributions drawn from a uniform prior. Unbiasedness is observed in Fig. 2.7(a), which shows the average bias over all distributions and samples with respect to sample size. Figure 2.7(c) shows the significant small-sample advantage in average deviation variance of the Bayesian MMSE error estimator with respect to leave-one-out and resubstitution. The right column in Fig. 2.7 shows bias and deviation variance under fixed Zipf distributions with c ¼ 0.5 and sample size 20 as a function of the Bayes error of feature-label distributions governed by the prior distribution (not averaged across the prior distribution). Leave-one-out is nearly unbiased with a very large deviation variance, while resubstitution is quite optimistically biased with a much lower deviation variance. In contrast, the Bayesian MMSE error estimator is pessimistically biased when the true classes are well separated (low Bayes error), but tends to be optimistically biased when the true classes are highly mixed (high Bayes error). This correlates with our previous RMS graphs, where the performance of the error estimator is usually best with moderate Bayes error.

46

Chapter 2 0.1

0.5

resub/plugin loo BEE

0

bias

bias

−0.1 0

−0.2 resub/plugin loo BEE

−0.3 −0.4

5

10

15 20 sample size

25

−0.5

30

0

0.1

(a)

0.5

0.025

resub/plugin loo BEE

0.06

0.4

(b)

0.02 deviation variance

deviation variance

0.08

0.2 0.3 Bayes error

0.04

0.02

resub/plugin loo BEE

0.015 0.01 0.005

0

5

10

15 20 sample size

(c)

25

30

0

0

0.1

0.2 0.3 Bayes error

0.4

0.5

(d)

Figure 2.7 Bias and deviation variance from true error for discrete classification and uniform priors (b ¼ 16): (a) bias versus sample size (bin probabilities and c uniform); (b) bias versus Bayes error (n ¼ 20, c ¼ 0.5); (c) deviation variance versus sample size (bin probabilities and c uniform); (d) deviation variance versus Bayes error (n ¼ 20, c ¼ 0.5). [Reprinted from (Dalton and Dougherty, 2011b).]

2.5 Gaussian Model In the Gaussian model, each sample point is a column vector of D multivariate Gaussian features; thus, the feature space is X ¼ RD . For y ∈ f0, 1g, assume a Gaussian distribution with parameters uy ¼ ðmy , ly Þ, where my is the mean of the class-conditional distribution, and ly is a collection of parameters that determine the covariance matrix Sy of the class. We distinguish between ly and Sy to enable us to impose a structure on the covariance. In fact, we will consider four types of models: a fixed covariance ðSy is known perfectlyÞ, a scaled identity covariance having uncorrelated features with equal variances ðly ¼ s 2y is a scalar and Sy ¼ s2y ID Þ, a diagonal covariance ðly ¼ ½s2y,1 , s 2y,2 , : : : , s2y,D T is a length D vector corresponding to

Optimal Bayesian Error Estimation

47

the diagonal elements of Sy Þ, and a general (unconstrained, but valid) covariance matrix Sy ¼ ly. The parameter space of my is RD . The parameter space of ly, denoted by Ly , must be carefully defined to permit only valid covariance matrices. We write Sy without explicitly showing its dependence on ly; that is, we write Sy instead of Sy ðly Þ. A multivariate Gaussian distribution with mean m and covariance S is denoted by f m,S ðxÞ; therefore, the parameterized class-conditional distributions are f uy ðxjyÞ ¼ f my ,Sy ðxÞ. In addition to the four covariance structures described above, we use two models to relate the covariances of each class: independent and homoscedastic. We next characterize the independent covariance and homoscedastic covariance models. 2.5.1 Independent covariance model In the independent covariance model, we assume that c, u0 ¼ ðm0 , l0 Þ, and u1 ¼ ðm1 , l1 Þ are all independent prior to observing the data so that pðuÞ ¼ pðcÞpðu0 Þpðu1 Þ. Assuming that p(c) and p∗ ðcÞ have been established, we have to define priors pðuy Þ and find posteriors p∗ ðuy Þ for both classes; note that Eq. 2.16 applies in the independent covariance case. We begin by specifying conjugate priors for u0 and u1. Let n be a real number, m a length D real vector, k a real number (although we may restrict k to be an integer to guarantee nice properties for the Bayesian MMSE error estimator), and S a symmetric D  D matrix. Define 

f m ðm; n, m, lÞ ¼ jSj

12

 n T 1 exp  ðm  mÞ S ðm  mÞ , 2 kþDþ1 2

f c ðl; k, SÞ ¼ jSj

  1 1 , etr  SS 2

(2.58)

(2.59)

where S is a function of l, and etrð⋅Þ ¼ expðtrð⋅ÞÞ. If n > 0, then fm is an unnormalized Gaussian distribution with mean m and covariance S/n. Likewise, fc simplifies to many different forms, depending on the definition of SðlÞ; for example, if S ¼ l, k . D−1, and S is symmetric positive definite, then fc is an unnormalized inverse-WishartðS, kÞ distribution. In general, fm and fc are not necessarily normalizable and thus can be used for improper priors. Considering one class y ∈ f0, 1g at a time, we assume that Sy is invertible (almost surely), and that for invertible Sy our priors for uy are of the form pðuy Þ ¼ pðmy jly Þpðly Þ,

(2.60)

48

Chapter 2

where pðmy jly Þ ∝ f m ðmy ; ny , my , ly Þ,

(2.61)

pðly Þ ∝ f c ðly ; ky , Sy Þ:

(2.62)

Alternatively, pðly Þ may be set to a point mass at a fixed covariance. In this way, different covariance structures (fixed, scaled identity, diagonal, and general) may be imposed on each class. If ny . 0, then the prior pðmy jly Þ for the mean conditioned on the covariance is proper and Gaussian with mean my and covariance Sy ∕ny . The hyperparameter my can be viewed as a target for the mean, where the larger ny is, the more localized the prior is about my. In the general covariance model where Sy ¼ ly , pðly Þ is proper if ky . D  1 and Sy is symmetric positive definite. If in addition ny . 0, then pðuy Þ is a normal-inverse-Wishart distribution, which is the conjugate prior for normal distributions with unknown mean and covariance (DeGroot, 1970; Raiffa and Schlaifer, 1961). When ky . D þ 1, Ep ½Sy  ¼

1 S: ky  D  1 y

(2.63)

The larger ky is, the more certainty we have about Sy in the prior. In the other covariance models, properties and constraints on ky and Sy may be different. Theorem 2.4 (Dalton and Dougherty, 2013a). In the independent covariance model with unknown covariances, assuming that p∗ ðuy Þ is normalizable, the posterior distributions possess the same form as the priors and satisfy p∗ ðuy Þ ∝ f m ðmy ; n∗y , m∗y , ly Þf c ðly ; k∗y , S∗y Þ,

(2.64)

with updated hyperparameters n∗y ¼ ny þ ny , m∗y ¼

ˆy ny my þ ny m , ny þ ny

(2.65) (2.66)

k∗y ¼ ky þ ny ,

(2.67)

ˆ þ ny ny ðm ˆ  my Þðm ˆ y  m y ÞT : S∗y ¼ Sy þ ðny  1ÞS y ny þ ny y

(2.68)

Proof. For notational ease, in the proof we write ky, Sy, ny, and my as k, S, n, and m, respectively. Then, for fixed k, S, n, and m, the posterior probabilities of the distribution parameters are found from Eq. 2.17. After some

Optimal Bayesian Error Estimation

49

simplification, we have n

 2y



  ny  1 ˆ 1 Sy Sy etr  2

p ðuy Þ ∝ pðuy ÞjSy j   ny ˆ y ÞT S1 ˆ ðm  m Þ :  exp  ðmy  m y y y 2

(2.69)

Our prior has a similar form to this expression and can be merged with the rest of the equation, giving    

1 kþn þDþ2 1 M ∗  y2 ˆ p ðuy Þ ∝ jSy j , (2.70) etr  S þ ðny  1ÞSy Sy exp  2 2 where, as long as either ny > 0 or n > 0, ˆ y ÞT S1 ˆ y Þ þ nðmy  mÞT S1 M ¼ ny ðmy  m y ðmy  m y ðmy  mÞ   T  ˆ y þ nm ˆ y þ nm ny m ny m 1 Sy my  ¼ ðny þ nÞ my  ny þ n ny þ n nny ˆ  mÞT S1 ˆ y  mÞ ðm þ y ðm n þ ny y     ˆ y þ nm T 1 ˆ y þ nm ny m ny m ¼ ðny þ nÞ my  Sy my  ny þ n ny þ n nny ˆ y  mÞðm ˆ y  mÞT S1 trððm þ y Þ: n þ ny

(2.71)

This leads to Eq. 2.64. ▪ The parameters in Eqs. 2.65 through 2.68 may be viewed as the updated parameters after observing the data. Similar results are found in (DeGroot, 1970). Note that the choice of ly will affect the proportionality constant in p∗ ðuy Þ. Rewriting Eq. 2.66 as m∗y ¼

1 1 ˆ m my þ 1 þ ny ∕ny ny ∕ny þ 1 y

(2.72)

shows the weighting effect of the prior and the data. Moreover, as ny → `, the first summand converges to zero and the second summand converges to the sample mean (although here convergence must be characterized relative to ˆ y , a topic we will take up in a general framework when the random vector m discussing error-estimation consistency). Similar comments apply to S∗y and the sample covariance. For an unknown covariance, we may also write the posterior probability as p∗ ðuy Þ ¼ p∗ ðmy jl y Þp∗ ðly Þ,

(2.73)

50

Chapter 2

where p∗ ðmy jly Þ ¼ f m∗y , ∗



p ðly Þ ∝ jSy j

k∗ y þDþ1 2

ðmy Þ,

(2.74)

  1 ∗ 1 etr  Sy Sy : 2

(2.75)

1 Sy n∗ y

For a fixed covariance matrix, the posterior density p∗ ðmy jly Þ for the mean is still given by Eq. 2.74. Assuming that n∗y . 0, p∗ ðmy jly Þ is always normalizable and Gaussian. The validity of p∗ ðly Þ depends on the definition of ly. The model allows for improper priors. Some useful examples of improper priors occur when Sy ¼ 0DD and ny ¼ 0. In this case, our prior has the form pðuy Þ ∝ jSy j

ky þDþ2 2

:

(2.76)

If ky þ D þ 2 ¼ 0, we obtain the flat prior used by Laplace (de Laplace, 1812). Alternatively, if ly ¼ Sy, then with ky ¼ 0 we obtain the Jeffreys rule prior, which is designed to be invariant to differentiable one-to-one transformations of the parameters (Jeffreys, 1946, 1961), and with ky ¼ 1 we obtain the independent Jeffreys prior, which uses the same principle as the Jeffreys rule prior but treats the mean and covariance matrix as independent parameters. When improper priors are used, the posterior must always be a valid probability density; e.g., for the general covariance model we require that n∗y . 0, k∗y . D  1, and that S∗y be symmetric positive definite. 2.5.2 Homoscedastic covariance model In this section, we modify the Gaussian Bayesian model to allow unknown means and a common (homoscedastic) unknown covariance matrix; that is, we assume that uy ¼ ðmy , lÞ, where l ðand the covariance SÞ is the same for class 0 and 1. The homoscedastic known covariance case is covered in Section 2.5.1. Thus, u ¼ ðc, m0 , m1 , lÞ. We further assume that c is independent so that Eq. 2.11 applies, and that m0 and m1 are independent given l. Hence, pðu0 , u1 Þ ¼ pðm0 jm1 , lÞpðm1 jlÞpðlÞ ¼ pðm0 jlÞpðm1 jlÞpðlÞ:

(2.77)

Given the definitions in Eqs. 2.58 and 2.59, we assume that S is invertible (almost surely) and that for invertible S, pðmy jlÞ ∝ f m ðmy ; ny , my , lÞ for y ∈ f0, 1g and pðlÞ ∝ f c ðl; k, SÞ. Constraints and interpretations for the hyperparameters ny , my , k, and S are similar to those of ny , my , ky , and Sy in the independent covariance model, except that now k and S represent both

Optimal Bayesian Error Estimation

51

classes. Any covariance structure may be assumed by pðlÞ, with the same structure being used in both classes. Given a sample S n , the posterior of u0 and u1 is given by p∗ ðu0 , u1 Þ ¼ f ðm0 , m1 , ljS n Þ:

(2.78)

Theorem 2.5 (Dalton and Dougherty, 2013a). In the homoscedastic covariance model, assuming that p∗ ðu0 , u1 Þ is normalizable, the posterior distribution possesses the same form as the prior and satisfies p∗ ðu0 , u1 Þ ∝ f m ðm0 ; n∗0 , m∗0 , lÞf m ðm1 ; n∗1 , m∗1 , lÞf c ðl; k∗ , S∗ Þ,

(2.79)

where n∗y is given in Eq. 2.65, m∗y is given in Eq. 2.66, and k∗ ¼ k þ n, S∗ ¼ S þ

1 X

ˆ þ ny ny ðm ˆ y  my Þðm ˆ y  my ÞT : ðny  1ÞS y n þ n y y y¼0

(2.80) (2.81)

Proof. From Bayes’ theorem, f ðm0 , m1 , ljS n Þ ∝ f ðS n jm0 , m1 , lÞf ðm0 , m1 , lÞ #" # " n0 n1 Y Y 0 1 f m0 ,S ðxi Þ f m1 ,S ðxi Þ pðm0 jlÞpðm1 jlÞpðlÞ ¼ " ∝

i¼1 n0 Y

#

f m0 ,S ðx0i Þ f m ðm0 ; n0 , m0 , lÞf c ðl; k, SÞ

i¼1

"



i¼1

n1 Y

# f m1 ,S ðx1i Þ f m ðm1 ; n1 , m1 , lÞ:

i¼1

(2.82) As in the independent covariance model, f m ðm0 ; n0 , m0 , lÞf c ðl; k, SÞ

n0 Y

f m0 ,S ðx0i Þ ∝ f m ðm0 ; n∗0 , m∗0 , lÞf c ðl; k∗0 , S∗0 Þ

i¼1

(2.83) with updated hyperparameters given by n∗0 ¼ n0 þ n0 ,

(2.84)

52

Chapter 2

m∗0 ¼

ˆ0 n0 m0 þ n0 m , n0 þ n0

(2.85)

k∗0 ¼ k þ n0 ,

(2.86)

ˆ þ n0 n0 ðm ˆ  m0 Þðm ˆ 0  m0 ÞT : S∗0 ¼ S þ ðn0  1ÞS 0 n0 þ n0 0

(2.87)

These equations have the same form as the updated hyperparameters in Theorem 2.4. At this point we have f ðm0 , m1 , ljS n Þ ∝ f m ðm0 ; n∗0 , m∗0 , lÞ  f m ðm1 ; n1 , m1 , lÞf c ðl;

k∗0 ,

S∗0 Þ

n1 Y

f m1 ,S ðx1i Þ:

(2.88)

i¼1

Once again, f m ðm1 ; n1 , m1 , lÞf c ðl; k∗0 , S∗0 Þ

n1 Y

f m1 ,S ðx1i Þ ∝ f m ðm1 ; n∗1 , m∗1 , lÞf c ðl; k∗ , S∗ Þ

i¼1

(2.89) with new updated hyperparameters given by n∗1 ¼ n1 þ n1 , m∗1 ¼

ˆ1 n1 m1 þ n1 m , n1 þ n1

k∗ ¼ k∗0 þ n1 ¼ k þ n, ˆ þ n1 n1 ðm ˆ  m1 Þðm ˆ 1  m 1 ÞT S∗ ¼ S∗0 þ ðn1  1ÞS 1 n1 þ n1 1 ˆ þ n0 n0 ðm ˆ  m0 Þðm ˆ 0  m0 ÞT ¼ S þ ðn0  1ÞS 0 n0 þ n0 0 ˆ þ n1 n1 ðm ˆ  m1 Þðm ˆ 1  m 1 ÞT : þ ðn1  1ÞS 1 n1 þ n1 1

(2.90) (2.91) (2.92)

(2.93)

Combining these results yields the theorem. ▪ We require each component in Eq. 2.79 to be normalizable so that p∗ ðu0 , u1 Þ is proper. The normalized version of f m ðmy ; n∗y , m∗y , lÞ is N ðm∗y , S∕n∗y Þ for y ∈ f0, 1g (we require that n∗y . 0), and we name the normalized distributions p∗ ðmy jlÞ. Similarly, we let p∗ ðlÞ be the normalized version of f c ðl; k∗ , S∗ Þ.Therefore, we may write

Optimal Bayesian Error Estimation

p∗ ðu0 , u1 Þ ¼ p∗ ðm0 jlÞp∗ ðm1 jlÞp∗ ðlÞ:

53

(2.94)

Furthermore, the marginal posterior of uy is p∗ ðuy Þ ¼ p∗ ðmy jlÞp∗ ðlÞ:

(2.95)

2.5.3 Effective class-conditional densities This section provides effective class-conditional densities for different covariance models. Since the effective class-conditional densities are found separately, a different covariance model may be used for each class, and these results apply for both the independent and homoscedastic covariance models. To simplify notation, in this section and the next section, we denote hyperparameters without subscripts. The equations for n∗ and m∗ are given by Eqs. 2.65 and 2.66, respectively, in both independent and homoscedastic models. The equations for k∗ and S∗ are given by Eqs. 2.67 and 2.68, respectively, in independent models, and by Eqs. 2.80 and 2.81, respectively, in homoscedastic models. We first consider a fixed covariance matrix. Theorem 2.6 (Dalton and Dougherty, 2013a). Assuming that n∗ . 0 and that Sy is symmetric positive definite, the effective class-conditional density for the fixed covariance model is Gaussian with mean m∗ and covariance ½ðn∗ þ 1Þ∕n∗ Sy : ðxÞ f U ðxjyÞ ¼ f m∗ , n∗ þ1 ∗ Sy n

¼

1 ð2pÞ

D 2

1 ∗ j n nþ1 ∗ Sy j2

   ∗ 1 1 ∗ T n þ1 ∗  exp  ðx  m Þ Sy ðx  m Þ : 2 n∗

(2.96)

Proof. From the definition of the effective class-conditional distribution, and given that the posterior p∗ ðmy jSy Þ is essentially f m ðmy ; n∗ , m∗ , Sy Þ (the latter is not normalized to have an integral equal to 1), Z f my ,Sy ðxÞp∗ ðmy jSy Þdmy f U ðxjyÞ ¼ RD   Z 1 1 T 1 ðx  my Þ Sy ðx  my Þ ¼ D 1 exp  (2.97) 2 RD ð2pÞ 2 jSy j2  ∗  D ðn∗ Þ 2 n ∗ T 1 ∗  ðm  m Þ Sy ðmy  m Þ dmy : D 1 exp  2 y ð2pÞ 2 jSy j2

54

Chapter 2

Some algebra shows that ∗ ∗ T 1 ∗ ðx  my ÞT S1 y ðx  my Þ þ n ðmy  m Þ Sy ðmy  m Þ     x þ n∗ m∗ T 1 x þ n∗ m∗ ∗ Sy my  ∗ ¼ ðn þ 1Þ my  ∗ n þ1 n þ1 ∗ n ∗ ðx  m∗ ÞT S1 þ ∗ y ðx  m Þ n þ1 n∗ 0 ∗ ¼ ðn∗ þ 1Þðmy  x0 ÞT S1 ðx  m∗ ÞT S1 ðm  x Þ þ y y ðx  m Þ, y n∗ þ 1 (2.98)

where x0 ¼ ðx þ n∗ m∗ Þ∕ðn∗ þ 1Þ. Hence,   D ðn∗ Þ 2 n∗ ∗ T 1 ∗ ðx  m Þ Sy ðx  m Þ f U ðxjyÞ ¼ exp  2ðn∗ þ 1Þ ð2pÞD jSy j  ∗  Z n þ1 0 T 1 0  ðmy  x Þ Sy ðmy  x Þ dmy exp  2 RD   D ðn∗ Þ 2 n∗ ∗ T 1 ∗ ðx  m Þ Sy ðx  m Þ ¼ ∗ D D 1 exp  2ðn∗ þ 1Þ ðn þ 1Þ 2 ð2pÞ 2 jSy j2 ¼ f m∗ , n∗ þ1 ðxÞ, ∗ Sy n

(2.99) as stated in the theorem. ▪ See (Geisser, 1964) for a special case of Theorem 2.6 assuming flat priors on the means. We next consider a scaled identity covariance matrix, that is, ly ¼ s2y and Sy ¼ s 2y ID . In this model, the parameter space of s2y is Ly ¼ ð0, `Þ. The following lemma shows that the posterior on s2y is inversegamma distributed. This fact also holds for the prior, i.e., the special case where there is no data. Thus, the scaled identity model equivalently assumes an inverse-gamma prior on the variance shared between all features. Lemma 2.1 (Dalton and Dougherty, 2011c). Under the scaled identity model, assuming that p∗ ðuy Þ is normalizable, p∗ ðs 2y Þ has an inverse-gamma distribution,   1 1 b ∗ 2 a p ðs y Þ ¼ b exp  2 , (2.100) GðaÞ ðs2y Þaþ1 sy with shape and scale parameters

Optimal Bayesian Error Estimation



55

ðk∗ þ D þ 1ÞD  1, 2 1 b ¼ trðS∗ Þ: 2

(2.101) (2.102)

For a proper posterior, we require that a . 0 and b . 0, or, equivalently, that ðk∗ þ D þ 1ÞD . 2 and S∗ is symmetric positive definite. Proof. In the scaled identity covariance model, we set ly ¼ s2y and Sy ¼ s2y ID . From Theorems 2.4 and 2.5, the posterior p∗ ðs2y Þ has the form given in Eq. 2.59. Plugging in Sy ¼ s2y ID and the prior or posterior hyperparameters of either the independent or homoscedastic models, this results in a univariate function on s2y that is an inverse-gamma distribution, as in Eq. 2.100, for which the normalization is well-known. ▪ Theorem 2.7 (Dalton and Dougherty, 2013a). Assuming that n∗ . 0, a . 0, and b . 0, where a and b are given by Eqs. 2.101 and 2.102, respectively, the effective class-conditional density for the scaled identity covariance model is a multivariate t-distribution with 2a degrees of freedom, location vector m∗ , and scale matrix ½bðn∗ þ 1Þ∕ðan∗ ÞID : f U ðxjyÞ ¼

Gða þ D2 Þ 1  ∗ 1 ⋅ D D  GðaÞ ID 2 ð2aÞ 2 p 2  bðnanþ1Þ ∗  ∗ 1  aD 2 1 ∗ T bðn þ 1Þ ∗ ðx  m Þ  1þ ID ðx  m Þ , ∗ 2a an

(2.103)

This distribution is proper, the mean exists and is m∗ as long as a . 0.5, and the covariance exists and is ½bðn∗ þ 1Þ∕ðða  1Þn∗ ÞID as long as a . 1. Proof. By definition, Z f U ðxjyÞ ¼

Z0

¼ 0

`

Z RD

`

f

f my , s2y ID ðxÞp∗ ðmy js2y Þp∗ ðs2y Þdmy ds2y

∗ 2 m∗ , n nþ1 ∗ s y ID

(2.104) ðxÞp



ðs2y Þds2y ,

where we have applied Theorem 2.6 for fixed covariance modeling. Continuing,

56

Chapter 2

Z f U ðxjyÞ ¼

ðn∗ Þ 2

D

`

ðn∗ þ 1Þ 2 ð2pÞ 2 ðs 2y Þ 2   n∗ ∗ T ∗ ðx  m Þ ðx  m Þ  exp  2 ∗ 2sy ðn þ 1Þ   1 1 b a  b exp  2 ds2y GðaÞ ðs2y Þaþ1 sy D

0

Z ¼

D

ðn∗ Þ 2 ba D

`

ðn∗ þ 1Þ 2 ð2pÞ 2 GðaÞðs 2y Þaþ 2 þ1     n∗ 1 ∗ T ∗ ðx  m Þ ðx  m Þ 2 ds2y  exp  b þ ∗ 2ðn þ 1Þ sy D

0

D

D

D

ðn∗ Þ 2 ba Gða þ D2 Þ D

¼



ðn∗ þ 1Þ 2 ð2pÞ 2 GðaÞ½b þ 2ðnn∗ þ1Þ ðx  m∗ ÞT ðx  m∗ Þaþ 2 D

D

D

, (2.105)

where the last line follows because the integrand is essentially an inversegamma distribution. This is a multivariate t-distribution as stated in the theorem. It is proper because bðn∗ þ 1Þ∕ðan∗ Þ . 0 (so the scaled matrix is positive definite) and 2a . 0. ▪ Next we consider the diagonal covariance model, where Sy is unknown with a diagonal structure; i.e., ly ¼ ½s2y,1 , s2y,2 , : : : , s2y,D T and s2y,i is the ith diagonal entry of Sy . In this model, the parameter space of s 2y,i is ð0, `Þ for all i ¼ 1, 2, : : : , D; thus, Ly ¼ ð0, `ÞD . The following lemma shows that the s2y,i are independent and inverse-gamma distributed. This holds for both the prior and posterior. Thus, the diagonal covariance model equivalently assumes independent inverse-gamma priors on the variance of each feature. Lemma 2.2. Under the diagonal covariance model, assuming that p∗ ðuy Þ is normalizable, p∗ ðs2y,1 , s2y,2 , : : : , s2y,D Þ is the joint density of D independent inverse-gamma distributions, ∗

p

ðs 2y,1 ,

s 2y,2 , : : : ,

s2y,D Þ

  D Y 1 1 bi a b ¼ exp  2 , GðaÞ i ðs2y,i Þaþ1 s y,i i¼1

(2.106)

with shape and scale parameters a¼

k∗ þ D  1 , 2

(2.107)

Optimal Bayesian Error Estimation

57

bi ¼

s∗ii , 2

(2.108)

where s∗ii is the ith diagonal element of S∗ . For a proper posterior, we require that a . 0 and bi . 0 for all i ¼ 1, 2, : : : , D, or, equivalently, that k∗ þ D  1 . 0 and s∗ii . 0. Proof. From Theorems 2.4 and 2.5, the posterior p∗ ðs 2y,1 , s 2y,2 , : : : , s2y,D Þ has the form given in Eq. 2.59. Plugging in the posterior hyperparameters of either the independent or homoscedastic models, this results in a function on s2y,1 , s2y,2 , : : : , s2y,D of the form D Y i¼1

!kþDþ1 2

s2y,i

D s∗ 1X jj exp  2 j¼1 s2y, j

! ¼

D Y

kþDþ1 ðs2y,i Þ 2

i¼1

  s∗ii exp  2 : (2.109) 2sy,i

Each term in the product on the right-hand side is an inverse-gamma distribution with parameters given in Eqs. 2.107 and 2.108. ▪ Theorem 2.8. Assuming that n∗ . 0, a . 0, and bi . 0 for all i ¼ 1, 2, : : : , D, where a and bi are given by Eqs. 2.107 and 2.108, respectively, the effective class-conditional density for the diagonal covariance model is the joint density between D independent non-standardized Student’s t-distributions. That is,   D Y Gða þ 12Þ bi ðn∗ þ 1Þ 12 f U ðxjyÞ ¼ 1 1 an∗ 2 2 i¼1 GðaÞp ð2aÞ (2.110)   ðaþ1Þ  2 1 bi ðn∗ þ 1Þ 1  1þ ðxi  m∗i Þ2 , 2a an∗ where xi is the ith element in x, and m∗i is the ith element in m∗ . The ith distribution has 2a ¼ k∗ þ D  1 degrees of freedom, location parameter m∗i , and scale parameter bi ðn∗ þ 1Þ s∗ii ðn∗ þ 1Þ ¼ : an∗ ðk∗ þ D  1Þn∗

(2.111)

f U ðxjyÞ is proper, the mean exists as long as a . 0.5 and equals m∗ , and the covariance exists as long as a . 1 and equals a diagonal matrix with ith diagonal entry bi ðn∗ þ 1Þ 2a s∗ ðn∗ þ 1Þ ¼ ∗ ii ⋅ : ∗ an 2a  2 ðk þ D  3Þn∗

(2.112)

58

Chapter 2

Proof. By definition, Z Z f U ðxjyÞ ¼

Ly

RD

Z ¼

Ly

f

f my ,Sy ðxÞp∗ ðmy jly Þp∗ ðly Þdmy dly

∗ m∗ , n nþ1 ∗ Sy

(2.113) ∗

ðxÞp ðly Þdly ,

where we have applied Theorem 2.6 for fixed covariance modeling. Continuing, Z f U ðxjyÞ ¼

`

Z

`

ðn∗ Þ 2

D

D D Q 2 12 ðn∗ þ 1Þ 2 ð2pÞ 2 ð D i¼1 s y,i Þ   D X n∗ ðxi  m∗i Þ2  exp  2ðn∗ þ 1Þ i¼1 s2y,i   D Y bj 1 1 a  bj 2 aþ1 exp  2 ds2y,1 · · · ds2y,D GðaÞ ðsy, j Þ sy, j j¼1  ∗ 1  D Z ` Y 2 n∗ n ðxi  m∗i Þ2 ¼ exp  ðn∗ þ 1Þ2ps 2y,i 2ðn∗ þ 1Þs 2y,i i¼1 0   1 1 bi a b exp  2 ds2y,i :  GðaÞ i ðs2y,i Þaþ1 sy,i

0

···

0

(2.114) Going further, 1  D Y 2 1 n∗ a f U ðxjyÞ ¼ bi ∗ ðn þ 1Þ2p GðaÞ i¼1     Z  aþ3 ` 2 1 n∗ ðxi  m∗i Þ2 1  exp  bi þ ds2y,i ∗ 2 2 2ðn þ 1Þ s s 0 y,i y,i     1 1 D 1 Y Gða þ Þ 2 n∗ n∗ ðxi  m∗i Þ2 ðaþ2Þ 2 bai ¼ þ , b i GðaÞ ðn∗ þ 1Þ2p 2ðn∗ þ 1Þ i¼1 (2.115) where the last line follows because the integrand is essentially an inversegamma distribution. We can also write

Optimal Bayesian Error Estimation

59

  D Y Gða þ 12Þ bi ðn∗ þ 1Þ 12 f U ðxjyÞ ¼ 1 1 an∗ 2 2 i¼1 GðaÞp ð2aÞ   ðaþ1Þ  2 1 bi ðn∗ þ 1Þ 1 ∗ 2  1þ ðxi  mi Þ : ∗ 2a an

(2.116)

Each term in the product is a non-standardized Student’s t-distribution as stated in the theorem. The distribution is proper because bi ðn∗ þ 1Þ∕ðan∗ Þ . 0 (so the scale parameter is positive) and 2a . 0. ▪ Lastly, we consider the general covariance model, Sy ¼ ly , the parameter space being the set of all symmetric positive definite matrices, which we denote by Ly ¼ fSy : Sy ≻ 0g. We require the multivariate gamma function, which is defined by the following integral over the space of D  D positive definite matrices (O’Hagan and Forster, 2004; Mardia et al., 1979): Z GD ðaÞ ¼

S≻0

jSja

Dþ1 2

etrðSÞdS:

(2.117)

An equivalent formulation, one better suited for numerical approximations, is given by GD ðaÞ ¼ pDðD1Þ∕4

  D Y 1j G aþ : 2 j¼1

(2.118)

Lemma 2.3 (Dalton and Dougherty, 2011c). Under the general covariance model, assuming that p∗ ðuy Þ is normalizable, p∗ ðSy Þ has an inverse-Wishart distribution, k∗

jS∗ j 2



p ðSy Þ ¼

2

k∗ D 2



GD ðk2 Þ

jSy j

∗ þDþ1 2

k

  1 ∗ 1 etr  S Sy : 2

(2.119)

For a proper posterior, we require that k∗ . D  1 and that S∗ be symmetric positive definite. Proof. From Theorems 2.4 and 2.5, the posterior p∗ ðSy Þ has the form given in Eq. 2.59. Plugging in the posterior hyperparameters of either the independent or homoscedastic models, this results in a function on Sy that is essentially an inverse-Wishart distribution, as in Eq. 2.119, for which the normalization is well-known (Muller and Stewart, 2006). ▪

60

Chapter 2

Theorem 2.9 (Dalton and Dougherty, 2013a). Assuming that n∗ . 0, k∗ . D  1, and S∗ is symmetric positive definite, the effective class-conditional density for the general covariance model is a multivariate t-distribution with k ¼ k∗  D þ 1 degrees of freedom, location vector m∗ , and scale matrix ½ðn∗ þ 1Þ∕ððk∗  D þ 1Þn∗ ÞS∗ . That is, f U ðxjyÞ ¼

GðkþD 2 Þ ∗

n þ1 ∗ Gðk2 Þk 2 p 2 j ðk∗ Dþ1Þn ∗ S j2  kþD (2.120)  1 ∗ 2 1 n þ 1 ∗  1 þ ðx  m∗ ÞT ðx  m∗ Þ : ∗ ∗S k ðk  D þ 1Þn D

D

1

This distribution is proper, the mean exists and is m∗ as long as k∗ . D, and the covariance exists and is ½ðn∗ þ 1Þ∕ððk∗  D  1Þn∗ ÞS∗ as long as k∗ . D þ 1. Proof. By definition, Z Z f U ðxjyÞ ¼ Ly

Z ¼

Ly

f my ,Sy ðxÞp∗ ðmy jSy Þp∗ ðSy Þdmy dSy

RD

(2.121) ∗

f m∗ , n∗ þ1 ðxÞp ðSy ÞdSy , ∗ Sy n

where in the last line we have used Theorem 2.6 for the fixed covariance model. Continuing, Z f U ðxjyÞ ¼

Ly

Z ¼

Ly

ðn∗ Þ 2

D

ðn∗ þ 1Þ 2 ð2pÞ 2 jSy j2   n∗ ∗ T 1 ∗ ðx  m  exp  Þ S ðx  m Þ y 2ðn∗ þ 1Þ   k∗ ∗ jS∗ j 2 1 ∗ 1 k þDþ1  k∗ D jSy j 2 etr  S Sy dSy ∗ 2 2 2 GD ðk2 Þ D

D

1

k∗

ðn∗ Þ 2

D



jS∗ j 2 k∗ D 2

jSy j

k∗ þDþ2 2

ðn∗ þ 1Þ ð2pÞ 2 GD ð 2 Þ     1 ∗ n∗ ∗ ∗ T 1 ðx  m Þðx  m Þ Sy dSy :  etr  S þ ∗ n þ1 2 (2.122) D 2

D 2

k∗

The integrand is essentially an inverse-Wishart distribution and, therefore,

Optimal Bayesian Error Estimation

61

k∗

ðn∗ Þ 2

jS∗ j 2

D

f U ðxjyÞ ¼

D D ⋅ k∗ D ∗ ðn∗ þ 1Þ 2 ð2pÞ 2 2 2 GD ðk2 Þ

2

ðk∗ þ1ÞD 2



GD ðk 2þ1Þ

 k∗ þ1 ∗  ∗  S þ n∗nþ1 ðx  m∗ Þðx  m∗ ÞT  2 ∗

¼

Gðk 2þ1Þðn∗ Þ 2 jS∗ j ∗

D

(2.123)

k∗ 2

Gðk Dþ1 Þðn∗ þ 1Þ 2 p 2 2  k∗ þ1 n∗  ∗ ∗ ∗ T 2 ðx  m Þðx  m Þ   S þ ∗ : n þ1 D

D

This is a multivariate t-distribution as stated in the theorem. This distribution is proper because ðn∗ þ 1Þ∕½ðk∗  D þ 1Þn∗  . 0 and S∗ is symmetric positive definite (so the scale matrix is symmetric positive definite) and k∗  D þ 1 . 0. ▪ See (Geisser, 1964) for a special case of Theorem 2.9 assuming flat priors K∕2 for K , ny on on the means and priors of the form pðS1 y Þ ∝ jSy j independent and homoscedastic precision matrices. 2.5.4 Bayesian MMSE error estimator for linear classification Suppose that the classifier discriminant is linear in form as in Eq. 1.16 with gðxÞ ¼ aT x þ b, where vector a and scalar b are unrestricted functions of the sample. The classifier predicts class 0 if gðxÞ ≤ 0 and 1 otherwise. With fixed distribution parameters and nonzero a, the true error for a class-y Gaussian distribution f my ,Sy is given by Eq. 1.32. If a ¼ 0D , then ε0n and ε1n are deterministically 0 or 1, depending on the sign of b, and the Bayesian MMSE error estimator is either Ep∗ ½c (for b . 0) or 1  Ep∗ ½c (for b ≤ 0). Hence, from here forward we assume that a ≠ 0D . For a fixed invertible covariance Sy , we require that n∗ . 0 to ensure that the posterior p∗ ðmy jly Þ is proper. Theorem 2.10 (Dalton and Dougherty, 2011c). In the Gaussian model with fixed invertible covariance matrix Sy , assuming that n∗ . 0, εˆ ny ¼ Ep∗ ½εyn  ¼ FðdÞ for y ∈ f0, 1g, where ð1Þy gðm∗ Þ d ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi aT Sy a

rffiffiffiffiffiffiffiffiffiffiffiffiffi n∗ : n∗ þ 1

(2.124)

(2.125)

62

Chapter 2

Proof. By Theorem 2.6, the effective class-conditional distribution f U ðxjyÞ is Gaussian with mean m∗ and variance ½ðn∗ þ 1Þ∕n∗ Sy , as given in Eq. 2.96. By Eq. 2.31, Z εˆ ny ðS n , cÞ ¼ f U ðxjyÞdx: (2.126) ð1Þy ðaT xþbÞ.0

This is the integral of a D-dimensional multivariate Gaussian distribution on one side of a hyperplane, which is equivalent to Eq. 1.32. Hence, 1 0 B ð1Þy gðm∗ Þ C εˆ ny ðS n , cÞ ¼ F@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA, ∗ aT ðn nþ1 ∗ Sy Þa

(2.127)

which is equivalent to the claimed result. ▪ Before giving the Bayesian MMSE error estimators for the scaled covariance and general covariance Gaussian models, we require a technical lemma. The lemma utilizes a Student’s t-CDF, which may be found using standard lookup tables, or equivalently, a special function involving an Euler integral called a regularized incomplete beta function defined by Z x 1 ta1 ð1  tÞb1 dt (2.128) I ðx; a, bÞ ¼ Bða, bÞ 0 for 0 ≤ x ≤ 1, a . 0, and b . 0, where the beta function Bða, bÞ normalizes I ð⋅; a, bÞ so that I ð1; a, bÞ ¼ 1. We only need to evaluate I ðx; 1∕2, bÞ for 0 ≤ x , 1 and b . 0. Although the integral does not have a closed-form solution for arbitrary parameters, in Chapter 4 we will provide exact expressions for I ðx; 1∕2, N∕2Þ for positive integers N. Restricting b to be an integer or half-integer, which in all cases equivalently restricts k to be an integer, guarantees that these equations may be applied so that Bayesian MMSE error estimators for the Gaussian model with linear classification may be evaluated exactly using finite sums of common single-variable functions. Note that jj ⋅ jjp denotes the p-norm for vectors and the entrywise p-norm for matrices. Moreover, the sign function sgnðxÞ equals 1 if x is positive, 1 if x is negative, and zero otherwise. Lemma 2.4 (Dalton and Dougherty, 2011c). Let X have a multivariate t-distribution with D dimensions, location vector m, symmetric positive definite scale matrix S, and n . 0 degrees of freedom. To wit, its density is given by f ðxÞ ¼

GðnþD 2 Þ Gðn2Þn 2 p 2 jSj2 ½1 þ 1n ðx  mÞT S1 ðx  mÞ D

D

1

nþD 2

:

(2.129)

Optimal Bayesian Error Estimation

Then Z

63

 ðaT m þ bÞ f ðxÞdx ¼ Pr Z , pffiffiffiffiffiffiffiffiffiffiffiffi aT Sa aT xþb≤0   1 1 ðaT m þ bÞ2 1 n T ¼  sgnða m þ bÞI ; , , 2 2 ðaT m þ bÞ2 þ nðaT SaÞ 2 2 (2.130) 

where Z is a Student’s t-random variable with n degrees of freedom. Proof. If X is governed by a multivariate t-distribution with the given parameters, and a is a nonzero vector, then Y ¼ aT X þ b is a nonstandardized Student’s t-random variable with location parameter aT m þ b, positive scale parameter aT Sa, and n degrees of freedom (Kotz and pffiffiffiffiffiffiffiffiffiffiffiffi Nadarajah, 2004). Let Z ¼ ½Y  ðaT m þ bÞ∕ aT Sa, which is now standard Student’s t with n degrees of freedom. The CDF for a Student’s t-random variable is available in closed form (Shaw, 2004):  2  1 sgnðzÞ z 1 n PrðZ , zÞ ¼ þ I 2 ; , : (2.131) 2 2 z þn 2 2 pffiffiffiffiffiffiffiffiffiffiffiffi Our desired integral is obtained by plugging in z ¼ ðaT m þ bÞ∕ aT Sa. ▪ Theorem 2.11 (Dalton and Dougherty, 2011c). In the Gaussian model with scaled identity covariance matrix Sy ¼ s2y ID ,    1 A2 1 y y (2.132) ; ,a , εˆ n ¼ Ep∗ ½εn  ¼ 1 þ sgnðAÞI 2 A2 þ 2b 2 where rffiffiffiffiffiffiffiffiffiffiffiffiffi n∗ , ∗ n þ1

(2.133)

ðk∗ þ D þ 1ÞD  1, 2

(2.134)

1 b ¼ trðS∗ Þ, 2

(2.135)

ð1Þy gðm∗ Þ A¼ jjajj2 a¼

and it is assumed that n∗ . 0, a . 0, and b . 0. Proof. By Theorem 2.7, the effective class-conditional density for the scaled identity covariance model is the multivariate t-distribution f U ðxjyÞ given in

64

Chapter 2

Eq. 2.103. By Theorem 2.1 and Lemma 2.4, Z y f U ðxjyÞdx εˆ n ðS n , cÞ ¼ 1  ð1Þy ðaT xþbÞ≤0

1 sgnðð1Þy ðaT m∗ þ bÞÞ ¼ þ 2 2 ½ð1Þy ðaT m∗ þ bÞ2 1 2a I ∗

; , y T ∗ 2 T bðn þ1Þ 2 2 ½ð1Þ ða m þ bÞ þ 2aa an∗ ID a

!

! 1 sgnðð1Þy ðaT m∗ þ bÞÞ ðaT m∗ þ bÞ2 1 ¼ þ I ; ,a , ∗ 2 2 ðaT m∗ þ bÞ2 þ 2bðnn∗þ1Þ kak22 2 (2.136) ▪

which completes the proof.

Theorem 2.12 (Dalton and Dougherty, 2011c). In the Gaussian model with general covariance matrix Sy ¼ ly and Ly containing all symmetric positive definite matrices, assuming that n∗ . 0, k∗ . D  1, and S∗ is symmetric positive definite,    1 A2 1 y y εˆ n ¼ Ep∗ ½εn  ¼ 1 þ sgnðAÞI ; ,a , (2.137) 2 A2 þ 2b 2 where

rffiffiffiffiffiffiffiffiffiffiffiffiffi n∗ , A ¼ ð1Þy gðm∗ Þ ∗ n þ1 a¼

(2.138)

k∗  D þ 1 , 2

(2.139)

aT S∗ a : 2

(2.140)



Proof. By Theorem 2.9, the effective class-conditional density for the general covariance model is a multivariate t-distribution, given in Eq. 2.120. By Theorem 2.1 and Lemma 2.4, Z ˆεny ðS n , cÞ ¼ 1  f U ðxjyÞdx ð1Þy ðaT xþbÞ≤0

¼

1 sgnðð1Þy ðaT m∗ þ bÞÞ þ 2 2   ðaT m∗ þ bÞ2 1 k∗  D þ 1 ; , , I ∗ T ∗ 2 2 ðaT m∗ þ bÞ2 þ n nþ1 ∗ a S a

which completes the proof.

(2.141)



Optimal Bayesian Error Estimation

65

Consider the Gaussian model with a diagonal covariance matrix. By Theorem 2.8 the effective class-conditional density f U ðxjyÞ for the diagonal covariance model is the joint density between independent non-standardized Student’s t-distributions given in Eq. 2.110. By Theorem 2.1, Z y f U ðxjyÞdx εˆ n ðS n , cÞ ¼ 1  ð1Þy ðaT xþbÞ≤0



¼ 1  Pr ð1Þ

y

X D

  ai X i þ b ≤ 0 ,

(2.142)

i¼1

where ai is the ith element of a, and X 1 , X 2 , : : : , X D are independent non-standardized Student’s t-random variables with the same degrees of freedom. The issue now is to find the CDF for a sum of independent Student’s t-random variables. This has been found in closed form if k∗ þ D  1 is odd, in which case it can be written as a finite mixture of t-distributions (Walker and Saw, 1978; Kotz and Nadarajah, 2004). However, the procedure is quite involved, and we refer readers to the references for details.

2.6 Performance in the Gaussian Model with LDA This section presents simulation studies examining various aspects of performance for the Bayesian MMSE error estimator in the Gaussian models. First, we provide performance results under circular Gaussian distributions, then under several non-circular Gaussian distributions. These demonstrate robustness relative to the covariance modeling assumptions. We then graph performance under Johnson distributions, which are outside the assumed Gaussian model. These simulations illustrate robustness relative to the Gaussian assumption. This is important in practice since we cannot guarantee Gaussianity. The results will show that performance does require nearly Gaussian distributions, but there is some degree of flexibility (in skewness and kurtosis). 2.6.1 Fixed circular Gaussian distributions Here we evaluate the performance of error estimators under fixed circular Gaussian distributions with c ¼ 0.5 and random sampling. In all simulations, the mean of class 0 is fixed at m0 ¼ 0D and the mean of class 1 at   1 m1 ¼ : (2.143) 0D1 The covariance of each class is chosen to make the distributions mirror images with respect to the hyperplane between the two means. This plane is the optimal linear classifier, and the classifier designed from the data is meant to approximate it. In this section, the covariances of both classes are scaled

66

Chapter 2

identity matrices, with the same scaling factor s2 in both classes, i.e., S0 ¼ S1 ¼ s2 ID . The scale s2 of the covariance matrix is used to control the Bayes error (Eqs. 1.33 and 1.34), where a low Bayes error corresponds to small variance and high Bayes error to high variance. The simulation is based on Fig. 2.1. A random sample is drawn from the fixed distribution (step A), the prior is updated to a posterior to use in Bayesian MMSE error estimation later (step B), and the sample is used, without feature selection, to train an LDA classifier defined by ˆ 1 ðm ˆ1 m ˆ 0 Þ, a¼S

(2.144)

1 n ˆ 1 ðm ˆ m ˆ 0 ÞT S ˆ1 þm ˆ 0 Þ þ ln 1 b ¼  ðm n0 2 1

(2.145)

ˆ is given by (see Eqs. 1.24 and 1.25), where the pooled covariance matrix S Eq. 1.28 (step C). In step D, the true error of this classifier is calculated via Eq. 1.32 using the fixed true distribution parameters. The same sample is used to find five non-parametric error estimates: resubstitution, leave-one-out, cross-validation, 0.632 bootstrap, and bolstered resubstitution. We also find a parametric plug-in error estimate, which is computed via Eq. 1.32 using the sample mean and sample covariance in place of the real values, and Eq. 1.3 using the a priori class probability estimate cˆ ¼ n0 ∕n in place of c. Three Bayesian MMSE error estimators are also provided, using uniform priors on c and the simple improper priors in Eq. 2.76 with Sy ¼ 0DD and ny ¼ 0 (my does not matter because ny ¼ 0). Two of these assume general independent covariances, one with ky þ D þ 2 ¼ 0 (the flat prior) and one with ky ¼ 0 (the Jeffreys rule prior), and the last assumes scaled identity independent covariances with ky þ D þ 2 ¼ 0 (the flat prior). Using the closed-form expression in Eq. 2.14, these error estimates can be computed very quickly. For all Bayesian MMSE error estimators, in the rare event where the sample size for one class is so small that the posteriors used to find the Bayesian MMSE error estimator cannot be normalized, ky is increased until the posterior is valid. For each iteration and error estimator, the squared difference jˆεn  εn j2 is computed. The process is repeated over t ¼ 100,000 samples to find a Monte Carlo approximation for the RMS deviation from the true error for each error estimator. Figure 2.8 shows the RMS of all error estimators with respect to the Bayes error for D ¼ 2 and n ¼ 30. We see that the Bayesian MMSE error estimator for general covariances using the flat prior is best for distributions with moderate Bayes error, but poor for very small or large Bayes error. A similar result was found in the discrete classification problem. Bolstered resubstitution is very competitive with the Bayesian MMSE error estimator for general covariances using the Jeffreys rule prior, and it is also very flexible

Optimal Bayesian Error Estimation

RMS deviation from true error

0.14

0.12

0.1

67

resub loo cv boot bol plugin BEE, identity, flat BEE, general, flat BEE, general, Jeff.

0.08

0.06

0.04 0.05

0.1

0.15

0.2

0.25 0.3 Bayes error

0.35

0.4

0.45

Figure 2.8 RMS deviation from true error for Gaussian distributions with respect to Bayes error (D ¼ 2, n ¼ 30). [Reprinted from (Dalton and Dougherty, 2011c).]

since it can be applied fairly easily to any classifier; however, keep in mind that bolstering is known to perform particularly well with circular densities (uncorrelated equal-variance features) like those in this example. The Bayesian MMSE error estimator for general covariances using the Jeffreys rule prior ðky ¼ 0Þ shifts performance in favor of lower Bayes error. Recall from the form of the priors in Eq. 2.76 that a larger ky will put more weight on covariances with a small determinant (usually corresponding to a small Bayes error) and less weight on those with a large determinant (usually corresponding to a large Bayes error). If the Bayes error is indeed very small, then the Bayesian MMSE error estimator using the Jeffreys rule prior is usually the best followed by the plug-in rule, which performs exceptionally well because the sample mean and sample variance are accurate even with a small sample. Finally, regarding Fig. 2.8, note that the Bayesian MMSE error estimator assuming scaled identity covariances tends to be better than the one assuming general covariances with ky ¼ 0 over the entire range of Bayes error. This makes clear the benefit of using more-constrained assumptions as long as the assumptions are correct. RMS with respect to sample size is graphed in Fig. 2.9 for two dimensions and Bayes errors of 0.2 and 0.4. Graphs like these can be used to determine the sample size needed to guarantee a certain RMS. As the sample size increases, RMS for the parametrically based error estimators (the plug-in rule and Bayesian MMSE error estimators) tends to converge to zero much more quickly than the distribution-free error estimators. This is not surprising since for a large sample the sample parameter estimates tend to be very accurate. Bayesian MMSE error estimators can greatly improve on traditional error

68

Chapter 2

0.07 0.06 0.05 0.04

resub loo cv boot bol plugin BEE, identity, flat BEE, general, flat BEE, general, Jeff.

0.12

resub loo cv boot bol plugin BEE, identity, flat BEE, general, flat BEE, general, Jeff.

RMS deviation from true error

RMS deviation from true error

0.08

0.03

0.1

0.08

0.06

0.04 50

100 150 sample size

200

50

100 150 sample size

(a)

200

(b)

Figure 2.9 RMS deviation from true error for Gaussian distributions with respect to sample size: (a) D ¼ 2, Bayes error ¼ 0.2; (b) D ¼ 2, Bayes error ¼ 0.4. [Reprinted from (Dalton and Dougherty, 2011c).]

estimators. For only one or two features, the benefit is clear, especially for moderate Bayes error as in part (a) of Fig. 2.9. In higher dimensions, there are many options to constrain the covariance matrix and choose different priors, so the picture is more complex. 2.6.2 Robustness to falsely assuming identity covariances The Bayesian MMSE error estimator assuming scaled identity covariances performs very well for many cases in the preceding simulation, where the scaled identity covariance assumption is correct. We consider two examples to investigate robustness relative to this assumption. For the first example, define r to be the correlation coefficient for class 0 in a two-feature problem. The correlation coefficient for class 1 is r to ensure mirror-image distributions. Thus, the covariance matrices are given by 

s2 Sy ¼ ð1Þy rs2

ð1Þy rs2 s2

 (2.146)

for y ∈ f0, 1g. Illustrations of the distributions used in this experiment are shown in Fig. 2.10(a), and simulation results are shown in Fig. 2.10(b). For the simulations, we have fixed s ¼ 0.7413, which corresponds to a Bayes error of 0.25 when there is no correlation. The Bayesian MMSE error estimators assuming general covariances are not significantly affected by correlation, and the performance of the error estimator assuming identity covariances is also fairly robust to correlation in this particular model, although some degradation can be seen for r . 0.8. Meanwhile, bolstering also appears to be somewhat negatively affected by high correlation, probably owing to the use of spherical kernels when the true distributions are not spherical.

69 RMS deviation from true error

Optimal Bayesian Error Estimation

0.065

resub loo cv boot bol plugin BEE, identity, flat BEE, general, flat BEE, general, Jeff.

0.06 0.055 0.05

0

0.2

0.4

(a)

0.6

0.8

(b)

2σ0 = σ1

σ0 = σ1

σ 0 = 2σ1

(a)

RMS deviation from true error

Figure 2.10 RMS deviation from the true error with respect to correlation (D ¼ 2, s ¼ 0.7413, n ¼ 50): (a) distributions used in RMS graphs; (b) RMS deviation from true error. [Reprinted from (Dalton and Dougherty, 2011c).] 0.08 resub loo cv boot bol plugin BEE, identity, flat BEE, general, flat BEE, general, Jeff.

0.07 0.06 0.05 0.04 0.03 0.02

0.4

0.6

σ0

0.8

1

(b)

Figure 2.11 RMS deviation from the true error with respect to s0 (D ¼ 2, s ¼ 0.7413, n ¼ 50): (a) distributions used in RMS graphs; (b) RMS deviation from true error. [Reprinted from (Dalton and Dougherty, 2011c).]

Figure 2.11 presents a second experiment using different variances for each feature. The covariances are given by  2  s0 0 S0 ¼ S1 ¼ , (2.147) 0 s21 and we fix the average variance between the classes such that 0.5ðs 20 þ s21 Þ ¼ 0.74132 . When s20 ¼ s21 , the Bayes error of the classification problem is again 0.25. These simulations show that the Bayesian MMSE error estimator assuming identity covariances can be highly sensitive to unbalanced features; however, this problem may be alleviated by normalizing the raw data. 2.6.3 Robustness to falsely assuming Gaussianity Since Bayesian error estimators depend on parametric models of the true distributions, one may apply a Kolmogorov–Smirnov normality test or other hypothesis test to discern if a sample deviates substantially from being

70

Chapter 2

Gaussian; nevertheless, the actual distribution is very unlikely to be truly Gaussian, so we need to investigate robustness relative to the Gaussian assumption. To explore this issue in a systematic setting, we apply Gaussianderived Bayesian MMSE error estimators to Johnson distributions in one dimension. Johnson distributions are a flexible family of distributions with four free parameters, including mean and variance (Johnson, 1949; Johnson et al., 1994). There are two main classes in the Johnson system of distributions: Johnson SU (for unbounded) and Johnson SB (for bounded). The normal and log-normal distributions are also considered classes in this system, and in fact they are limiting cases of the SU and SB distributions. The Johnson system can be summarized as follows. If Z is a standard normal random variable, then X is Johnson if   Zg X h ¼f , (2.148) d l where f is a simple function satisfying some desirable properties such as monotonicity (Johnson, 1949); Johnson et al., 1994). For log-normal distributions f ðyÞ ¼ lnðyÞ, for Johnson SU distributions f ðyÞ ¼ sinh1 ðyÞ, and for Johnson SB distributions f ðyÞ ¼ lnðy∕ð1  yÞÞ ¼ 2tanh1 ð2y  1Þ. For reference, example graphs of these distributions are given in Fig. 2.12. Johnson SU distributions are always unimodal, while SB distributions can pffiffiffi also be bimodal. In particular, an SB distribution is bimodal if d , 1∕ 2 and pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (2.149) jgj , d1 1  2d2  2dtanh1 1  2d2 : The parameters g and d control the shape of the Johnson distribution and together essentially determine its skewness and kurtosis, which are normalized third and fourth moments. In particular, skewness is m3 ∕s 3 and kurtosis is m4 ∕s4 , where mk is the kth mean-adjusted moment of a random variable, and s2 ¼ m2 is the variance. Skewness and kurtosis are very useful statistics to measure normality: Gaussian distributions always have a skewness of 0 and kurtosis of 3. For Johnson distributions, skewness is more influenced by g and kurtosis by d, but the relationship is not exclusive. Once the shape of the distribution is determined by g and d, h and l determine the mean and variance. Figure 2.13 illustrates the values of skewness and kurtosis obtainable within the Johnson family. The region below the log-normal line can be achieved with Johnson SU distributions, while the region above can be achieved with Johnson SB distributions. In fact, the normal, log-normal, SU, and SB systems partition the entire obtainable region of the skewness-kurtosis plane, so there is just one distribution corresponding to each skewness– kurtosis pair. All distributions satisfy kurtosis ≥ skewness2 þ 1, where equality

Optimal Bayesian Error Estimation

71

0.4

δ = 1.5

0.6 0.5 PDF

PDF

0.3

0.2

γ=

0.2

0

−2

δ = 0.5

0.1

2.0

0 −4

0.3

0.

γ=

0.1

0.4

0 feature

2

0 −4

4

−2

(a)

0 feature

(b) 1.5 δ=

6

2

5 4

PDF

PDF

4

2.5

7

3

1.5 1

2

γ= 0 γ=

1 0

2

0

0.2

δ = 0.5

.0

0.5

2.0 0.4 0.6 feature

(c)

0.8

1

0 0

0.2

0.4 0.6 feature

0.8

1

(d)

Figure 2.12 Example Johnson distributions with one parameter fixed and the other varying in increments of 0.1 (h ¼ 0, l ¼ 1): (a) Johnson SU, d ¼ 0.9; (b) Johnson SU, g ¼ 0.0; (c) Johnson SB, d ¼ 0.9; (d) Johnson SB, g ¼ 0.0. [Reprinted from (Dalton and Dougherty, 2011c).]

corresponds to a distribution taking on two values, say one with probability p and the other with probability 1  p. In Fig. 2.13, g ¼ 0 corresponds to points on the left axis. The gray diagonal lines represent skewness and kurtosis obtainable with SU distributions and fixed values of d. As we increase d, these lines move up in an almost parallel manner. As we increase g, kurtosis increases along with skewness until we converge to a point on the log-normal line. As a quick example, if kurtosis is fixed at 4.0, then d . 2.3, which is limited by the worst-case scenario, where g ¼ 0. Also, for SU distributions with this kurtosis, the maximum squared skewness is about 0.57, which is achieved using d , 4.1. The simulation procedure in this section is the same as that in Section 2.6.1, except that the sample points are each assigned a Johnson-distributed value rather than Gaussian. We use mirror images of the same Johnson distribution for both classes. In the following, the parameters d and g refer to class 0, while class 1 has parameters d and g. Meanwhile, for each class, h

72

Chapter 2

1.0

0.0

skewness 2

0.57

impossible region kur

tos

2.0

is =

ske

wn

ess 2

+1

Gaussian 3.0 δ=

region obtainable by Johnson SB 4.1

4.0 2.3

al

m

or

gn

lo e

region obtainable by Johnson SU

lin

kurtosis

δ=

Figure 2.13 Skewness–kurtosis-obtainable regions for Johnson distributions. [Reprinted from (Dalton and Dougherty, 2011c).]

and l are jointly selected to give the appropriate mean and covariance pair. The sample size is fixed at n ¼ 30, the means are fixed at 0 and 1, and the standard deviations are fixed at s ¼ 0.7413, which corresponds to a Bayes error of 0.25 for the Gaussian distribution. With one feature, n ¼ 30, and Gaussian distributions with a Bayes error of 0.25, the Bayesian MMSE error estimators using the flat prior and the Jeffreys rule prior perform quite well with RMSs of about 0.060 and 0.066, respectively. These are followed by the plug-in rule with an RMS of 0.070 and bolstering with an RMS of 0.073. We wish to observe how well this ordering is preserved as the skewness and kurtosis of the original Gaussian distributions are distorted using Johnson distributions. Figures 2.14(a) and (b) show the RMS of all error estimators for various Johnson SU distributions, and Figs. 2.14(c) and (d) show analogous graphs for Johnson SB distributions. In each subfigure, we fix either d or g and vary the other parameter to observe a slice of the performance behavior. The scale for the RMS of all error estimators is provided on the left axis as usual, and a graph of either skewness (when d is fixed) or kurtosis (when g is fixed) is added as a black dotted line with the scale shown on the right axis. These skewness

Optimal Bayesian Error Estimation

73

ss

0.09 0.08

1.5 1 0.5 0 −0.5

0.07

−1 0.06

−4

−3

−2

−1

0 γ

1

2

3

4

15

0.1

14

0.09

13

0.08

12 11

0.07

10

0.06 0.05 0.04 0.03 0.02

kurto

sis

0.01 0 0

−2 5

1

2

δ

(a)

1.5

ne

1

w

e sk

0.5 0 −0.5

0.08

−1

0.07

−1.5

0.06

−2 −1.5

−1

−0.5

0 γ

(c)

0.5

1

1.5

−2.5 2

skewness

0.1 0.09

2

ss

RMS deviation from true error

RMS deviation from true error

0.11

0.05 −2

0.12

2.5

0.12

4

9 8 7 6 5 4 3 5

(b)

0.13 resub loo cv boot bol plugin BEE, flat BEE, Jeffreys’

3

resub loo cv boot bol plugin BEE, flat BEE, Jeffreys’

3

sis

to

0.11

kur

resub loo cv boot bol plugin BEE, flat BEE, Jeffreys’

0.1 0.09 0.08 0.07

2.5

2

kurtosis

0.05 −5

−1.5

0.11

kurtosis

ne

skewness

RMS deviation from true error

ew

0.1

RMS deviation from true error

2 resub loo cv boot bol plugin BEE, flat BEE, Jeffreys’

sk

1.5

0.06 0.05 0

1

2

δ

3

4

1 5

(d)

Figure 2.14 RMS deviation from true error for Johnson SU and SB distributions (1D, s ¼ 0.7413, n ¼ 30): (a) SU, d ¼ 2.0, with skewness; (b) SU, g ¼ 0.0, with kurtosis; (c) SB, d ¼ 0.7, with skewness; (d) SB, g ¼ 0.0, with kurtosis. [Reprinted from (Dalton and Dougherty, 2011c).]

and kurtosis graphs help illustrate the non-Gaussianity of the distributions represented by each point. Figure 2.14(b) presents a simulation observing the effect of d (which has more influence on kurtosis) with SU distributions and g ¼ 0. For g ¼ 0 there is no skewness, and this graph shows that the Bayesian MMSE error estimator with the flat prior requires that d ≥ 1.5 before it surpasses all of the other error estimators (the last being bolstering). This corresponds to a kurtosis of about 7.0. A similar performance graph with Johnson SB distributions and g ¼ 0 is given in Fig. 2.14(d), in which the same error estimator is the best whenever d . 0.4, corresponding to kurtosis greater than about 1.5. So although Gaussian distributions have a kurtosis of 3.0, in this example the Bayesian MMSE error estimator is still better than all of the other error estimators whenever there is no skewness and kurtosis is between 1.5 and 7.0. Interestingly, performance can actually improve as we move away from Gaussianity. For example, in the SU system, for larger d the RMS of the Bayesian estimators seems to monotonically decrease with g, as in

74

Chapter 2

Fig. 2.14(a), suggesting that they favor negative skewness (positive g), where the classes have less overlapping mass. Simulations with Johnson SB distributions also appear to favor slight negative skewness (negative g), although performance is not monotonic. Finally, in Fig. 2.15 we present a graph summarizing the performance of Bayesian MMSE error estimators on Johnson distributions with respect to the skewness and kurtosis of class 0. The skewness-kurtosis plane shown in this figure is essentially the same as that illustrated in Fig. 2.13, but also shows two sides to distinguish between distributions with more overlapping mass (class 0 has positive skewness) and less overlapping mass (class 0 has negative skewness). For mirror-image distributions, the kurtosis of class 1 is the same but skewness is negative. Each dot in Fig. 2.15 represents a fixed classconditional Johnson distribution. As before, we fix s ¼ 0.7413, corresponding to a Bayes error of 0.25 for the Gaussian distribution (which has skewness 0 and kurtosis 3). Black dots in Fig. 2.15 represent distributions where the Bayesian MMSE error estimator with the flat prior performs better than all of the other six standard error estimators, while white dots pessimistically represent distributions where at least one other error estimator is better. With one feature, n ¼ 30, and s ¼ 0.7413, the black dots cover a relatively large range of skewness and kurtosis (especially with negative skewness), indicating that Bayesian MMSE error estimators can be used relatively reliably even if the true distributions are not perfectly Gaussian. Similar graphs or studies may be 1 2

BEE is best BEE is not best

impossible region

3

kurtosis

4 5 6 7 8 9 SU 10 −5

−4

−3 −2 −1 −skewness 2

0

1

2 3 skewness 2

SB 4

5

Figure 2.15 RMS deviation from true error for Johnson distributions, varying both skewness and kurtosis (1D, s ¼ 0.7413, n ¼ 30). Black dots are where the Bayesian MMSE error estimator is best; white dots are where any other error estimator is best. [Reprinted from (Dalton and Dougherty, 2011c).]

Optimal Bayesian Error Estimation

75

used to determine an “acceptable” region for Gaussian modeling assumptions, which may be useful for designing hypothesis tests. However, performance in this graph depends heavily on the particular distributions. Recall that the Bayesian MMSE error estimator may not be the best error estimator for a specific Gaussian distribution, let alone a Johnson distribution. 2.6.4 Average performance under proper priors To illustrate the average performance of Bayesian MMSE error estimators over all distributions in a model, we require proper priors, so for the sake of demonstration we will use a carefully designed proper prior in this section rather than the improper priors used previously. We assume the general independent covariance model for both classes, and define the prior parameters ky ¼ ny ¼ 5D and Sy ¼ 0.74132 ðky  D  1ÞID . For class 0 we also define m0 ¼ 0D , and for class 1 we define m1 as in Eq. 2.143. For each class, this prior is always proper. In addition, we assume a uniform distribution for c. The simulation procedure follows Fig. 2.5. In each iteration of step 1, and for each class independently, a random covariance Sy is drawn from the inverse-Wishart distribution with parameters ky and Sy using methods in (Johnson, 1987). Conditioned on Sy , a random mean is generated using the Gaussian distribution pðmy jSy Þ  N ðmy , Sy ∕ny Þ, resulting in a normalinverse-Wishart distributed mean and covariance pair. Each feature-label distribution is determined by a class probability c drawn from a uniformð0, 1Þ distribution, and the set of means my and covariances Sy for y ∈ f0, 1g. Step 2A generates a random training sample from the realized classconditional distributions N ðmy , Sy Þ, and the prior is updated to a posterior in step 2B. The labeled sample points are used to train an LDA classifier in step 2C, with no feature selection. In step 2D, since the classifier is linear, we compute the true error and Bayesian MMSE error estimators exactly using closed-form expressions. Resubstitution, leave-one-out, cross-validation, 0.632 bootstrap, bolstered resubstitution, and plug-in are also found. The difference εˆ n  εn and the squared difference jˆεn  εn j2 are averaged to produce Monte Carlo approximations of bias and RMS, respectively, over the prior and sampling distributions. The whole process is repeated over T ¼ 100,000 feature-label distributions and t ¼ 10 samples per feature-label distribution. Results for five features are given in Fig. 2.16. These graphs show (as they must) that Bayesian MMSE error estimators possess optimal RMS performance when averaged over all distributions in the parameterized family and are unbiased for each sample size. In fact, performance of the Bayesian MMSE error estimator improves significantly relative to the other error estimators as the number of features increases.

76

Chapter 2 0.1

0.08 0.07 0.06

0.04 0.03 0.02 0.01 bias

RMS deviation from true error

0.05

resub loo cv boot bol plugin BEE

0.09

0.05

0 −0.01

0.04 0.03

−0.03

0.02 0.01

resub loo cv boot bol plugin BEE

−0.02

−0.04 40

60

80

100 120 140 sample size

160

180

200

−0.05

40

60

80

100 120 140 sample size

(a)

160

180

200

(b)

Figure 2.16 RMS deviation from true error and bias for linear classification of Gaussian distributions, averaged over all distributions and samples using a proper prior: (a) RMS, D ¼ 5; (b) bias, D ¼ 5. [Reprinted from (Dalton and Dougherty, 2011c).]

2.7 Consistency of Bayesian Error Estimation This section treats consistency of the Bayesian MMSE error estimator: As more data are collected, does the estimator converge to the true error? We are interested in the asymptotic behavior ðn → `Þ with respect to a fixed true parameter and its sampling distribution. Specifically, suppose that u¯ ∈ U is the unknown true parameter. Let S ` represent an infinite sample drawn from the true distribution and S n denote the first n observations of this sample. The sampling distribution will be specified in the subscript of probabilities and ¯ expectations using a notation of the form S ` ju. Doob (Doob, 1948) proved that Bayesian estimates are consistent up to a null set of the prior measure; that is, the set of parameters where the estimator is not consistent has prior probability 0. If the true distribution is in the uncertainty class, and the true parameter is thought of as a random variable determined by a known random mechanism represented by our prior, this result is satisfactory, in the sense that consistency is guaranteed with probability 1. If the true parameter is a fixed state of nature and our prior is simply a model of our uncertainty, it is important to determine for that specific parameter whether the Bayesian MMSE error estimator is consistent. To wit, when we talk about a specific true distribution, unless the prior has a point mass on that specific true distribution (which is the case in the discrete model, making that problem easy), according to Doob we do not know if the specific true distribution is or is not in the null set. We need a stronger result, ensuring convergence for all distributions in the parameter space, not just all distributions except for an unspecified null set. ¯ S n Þ rather Throughout this section, we will denote the true error by εn ðu, ¯ cn Þ to emphasize that it is a function of the sample. A sequence of than εn ðu,

Optimal Bayesian Error Estimation

77

estimators εˆ n ðS n , cn Þ of a sequence of functions εn ðu, S n Þ of the parameter is said to be weakly consistent at u¯ if ¯ SnÞ ¼ 0 lim εˆ n ðS n , cn Þ  εn ðu, n→`

(2.150)

in probability. εˆ n ðS n , cn Þ is weakly consistent, or simply consistent, if Eq. 2.150 is true for all u¯ ∈ U. For any r ≥ 1, εˆ n ðS n , cn Þ is consistent in the rth mean if there is convergence in the rth absolute moments, 

 ¯ S n Þr ¼ 0, lim ESn ju¯ εˆ n ðS n , cn Þ  εn ðu, (2.151) n→`

for all u¯ ∈ U. rth mean consistency always implies weak consistency. Of particular interest is the case of r ¼ 2, where there is convergence in the mean¯ S n Þj is bounded, which is always true for square. If jˆεn ðS n , cn Þ  εn ðu, classifier error estimation, then weak consistency implies rth mean consistency for all r ≥ 1; therefore, the two notions of consistency are equivalent. εˆ n ðS n , cn Þ is said to be strongly consistent if there is almost sure convergence: PrS

` ju

ðˆεn ðS n , cn Þ  εn ðu, S n Þ → 0Þ ¼ 1.

(2.152)

Strong consistency always implies weak consistency. For Bayesian MMSE error estimators, we will prove Eq. 2.152, assuming fairly weak conditions on the model and classification rule. 2.7.1 Convergence of posteriors The salient point is to show that the posteriors of the parameters c, u0 , and u1 converge in some sense to delta functions on the true parameters c¯ ∈ ½0, 1, u¯ 0 ∈ U0 , and u¯ 1 ∈ U1 . This is a property of the posterior distribution, whereas the preceding definitions of consistency are only properties of the estimator itself. Posterior convergence is formalized via the concept of weak∗ consistency, whose elucidation requires some comments regarding measure theory. Suppose that the sample space X and the parameter space U are Borel subsets of complete separable metric spaces, each being endowed with the induced s-algebra from the Borel s-algebra on its respective metric space. For instance, if A is the Borel s-algebra for the space containing U, then U ∈ A, and U is endowed with the induced s-algebra AU ¼ fU ∩ A : A ∈ Ag. If Pn and P are probability measures on U, then Pn → P weak∗ (that is, in ∗ the weak R topologyR on the space of all probability measures over U) if and only if U f dPn → U f dP for all bounded continuous functions f on U. This is the Helly–Bray theorem. Further, if du is a point mass at u ∈ U, then it can

78

Chapter 2

be shown that Pn → du weak∗ if and only if Pn ðUÞ → 1 for every neighborhood U of u. Our interest is in the convergence of posteriors, which are themselves random due to the sampling distribution. In particular, Bayesian modeling parameterizes a family of distributions, or, equivalently, a family of ¯ we probability measures fF u : u ∈ Ug on X . For a fixed true parameter u, ` denote the infinite labeled sampling distribution under u by F u¯ , which is a measure on the space of infinite samples on X . We call the posterior of u weak∗ consistent at u¯ ∈ U if the posterior probability of the parameter converges weak∗ to du¯ for all infinite samples, excluding a subset with probability 0 on F `u¯ . To wit, for all bounded continuous functions f on U, ¯ ¼ 1. PrS ` ju¯ ðEujSn ½ f ðuÞ → f ðuÞÞ

(2.153)

Equivalently, the posterior probability (given a fixed sample) of any neighborhood U of the true parameter u¯ converges to 1 almost surely with respect to the sampling distribution, i.e., PrS ` ju¯ ðPrujS n ðUÞ → 1Þ ¼ 1.

(2.154)

The posterior is called weak∗ consistent if it is weak∗ consistent for every u¯ ∈ U. If there are only a finite number of possible outcomes and no neighborhood of the true parameter has prior probability 0, as has long been known, posteriors are weak∗ consistent (Freedman, 1963; Diaconis and Freedman, 1986). Returning to our classification problem, to establish weak∗ consistency for the posteriors of c, u0 , and u1 in both the discrete and Gaussian models (in the usual topologies), we assume proper priors on c, u0 , and u1 . First consider the parameter c. The sample space for c is the binary set f0, 1g, which is trivially a complete separable metric space, and the parameter space is the interval ½0, 1, which is itself a complete separable metric space in the usual topology. We use the L1 -norm. Suppose that c is not known and labeled training points are drawn using random sampling, where n0 → ` when c¯ . 0 and n1 → ` when c¯ , 1. Since the number of possible outcomes for labels is finite (in fact binary), p∗ ðcÞ is weak∗ consistent as n → ` when using a beta prior on c, which has positive mass in every open interval in ½0, 1. This holds regardless of the class-conditional density models (whether they are discrete, Gaussian, or otherwise). By Eq. 2.153, Ep∗ ½c → c¯ almost surely over the labeled sampling distribution. On the other hand, if c is assumed to be known, then Ep∗ ½c ¼ c¯ trivially. Now consider u0 and u1 . In the discrete model, the feature space is X ¼ f1, 2, : : : , bg, which is again trivially a complete separable metric space, and for bin probabilities ½ p1 , : : : , pb  or ½q1 , : : : , qb  the parameter space

Optimal Bayesian Error Estimation

79

Uy ¼ Db1 is a complete separable metric space in the L1 -norm. Since sample points again have a finite number of possible outcomes, the posteriors of uy are weak∗ consistent as ny → `. For other Bayesian models on uy , we assume that the feature space X and the parameter space Uy are finite-dimensional Borel subsets of complete separable metric spaces. In the independent general covariance Gaussian models, the feature space X ¼ RD is a complete separable metric space, while the parameter space for ðmy , Sy Þ is Uy ¼ RD  fSy ∈ RDD : Sy ≻ 0g,

(2.155)

which is contained in the complete separable metric space RD  RDD with the L1 -norm, and is a Borel subset. Similar results hold for our other Gaussian models. As long as the parameter space is finite-dimensional, the true featurelabel distribution is in the interior (not the boundary) of the parameterized family of distributions, the likelihood function is a bounded continuous function of the parameter that is not under-identified (not flat for a range of values of the parameter), and the prior is proper with all neighborhoods of the true parameter having positive probability, then the posterior distribution of the parameter approaches a normal distribution centered at the true parameter with variance proportional to 1∕n as n → ` (Gelman et al., 2004). For a Gaussian model on class y ∈ f0, 1g, and either random or separate sampling, these regularity conditions hold; hence, the posterior of uy is weak∗ consistent as ny → `. Weak∗ consistency of the posteriors for uy (which holds in the discrete and Gaussian models) and Eq. 2.153 imply that, as ny → `, PrS ` ju¯ y ðEuy jS n ½ f ðuy Þ → f ðu¯ y ÞÞ ¼ 1

(2.156)

for all u¯ y ∈ Uy and any bounded continuous function f on Uy . 2.7.2 Sufficient conditions for consistency The following theorem decouples the issue of consistency in the prior for c from consistency in the priors for u0 and u1 . Hence, consistency in the Bayesian MMSE error estimator becomes an issue of consistency in the estimation of the error due to each class. Theorem 2.13 (Dalton and Dougherty, 2012c). Let c ∈ ½0, 1, u0 ∈ U0 , and u1 ∈ U1 : Define u ¼ ðc, u0 , u1 Þ: Suppose that 1. EujSn ½c → c¯ as n → ` almost surely over the infinite labeled sampling a:s: distribution under u¯ (henceforth we denote this limit by → ).

80

Chapter 2

2. c and ðu0 , u1 Þ are independent in their priors (and all posteriors). 3. n0 → ` if c¯ . 0 and n1 → ` if c¯ , 1 almost surely. 4. If ny → `, then  a:s:  (2.157) Euy jS n εyn ðuy , S n Þ  εyn ðu¯ y , S n Þ → 0. Then for all finite k ≥ 1, 

 ¯ S n Þk a:s: → 0. EujS n εn ðu, S n Þ  εn ðu,

(2.158)

In particular, k ¼ 1 implies that the Bayesian MMSE error estimator is strongly consistent. Proof. Note that ¯ SnÞ εn ðu, S n Þ  εn ðu, ¼ cε0n ðu0 , S n Þ þ ð1  cÞε1n ðu1 , S n Þ  c¯ ε0n ðu¯ 0 , S n Þ  ð1  c¯ Þε1n ðu¯ 1 , S n Þ ¼ ðc  c¯ Þε0n ðu0 , S n Þ þ ð¯c  cÞε1n ðu1 , S n Þ þ c¯ ½ε0n ðu0 , S n Þ  ε0n ðu¯ 0 , S n Þ þ ð1  c¯ Þ½ε1n ðu1 , S n Þ  ε1n ðu¯ 1 , S n Þ: (2.159) By condition (2) and the triangle inequality, ¯ S n Þj 0 ≤ EujS n ½jεn ðu, S n Þ  εn ðu, ≤ EujS n ½jc  c¯ jEujS n ½ε0n ðu0 , S n Þ

(2.160)

þ EujS n ½j¯c  cjEujSn ½ε1n ðu1 , S n Þ þ c¯ EujSn ½e0n ðu0 , S n Þ þ ð1  c¯ ÞEujS n ½e1n ðu1 , S n Þ, where eyn ðuy , S n Þ ¼ jεyn ðuy , S n Þ  εyn ðu¯ y , S n Þj:

(2.161)

Furthermore, by condition (1) and the fact that EujSn ½εyn ðuy , S n Þ is bounded, a:s:

EujS n ½jc  c¯ jEujS n ½ε0n ðu0 , S n Þ þ EujSn ½j¯c  cjEujS n ½ε1n ðu1 , S n Þ → 0.

(2.162)

Suppose that 0 , c¯ , 1. Applying conditions (3) and (4), and the fact that any linear combination of almost surely convergent quantities is almost surely convergent, a:s:

c¯ EujS n ½e0n ðu0 , S n Þ þ ð1  c¯ ÞEujS n ½e1n ðu1 , S n Þ → c¯ ⋅ 0 þ ð1  c¯ Þ ⋅ 0 ¼ 0. (2.163)

Optimal Bayesian Error Estimation

81 a:s:

If c¯ ¼ 0, then since EujSn ½e0n ðu0 , S n Þ is bounded, c¯ EujSn ½e0n ðu0 , S n Þ → 0. If a:s:

c¯ ¼ 1, likewise ð1  c¯ ÞEujSn ½e1n ðu1 , S n Þ → 0. Again applying conditions (3) and (4), we have that Eq. 2.163 holds in all cases. Combining Eqs. 2.162 and 2.163, we have that the right-hand side of Eq. 2.160 converges to zero almost surely, thus ¯ S n Þj a:s: EujS ½jεn ðu, S n Þ  εn ðu, → 0. This implies Eq. 2.158 for k ¼ 1. Applying n

Jensen’s inequality, ¯ S n Þj ≤ EujS ½jεn ðu, S n Þ  εn ðu, ¯ S n Þj: 0 ≤ jEujS n ½εn ðu, S n Þ  εn ðu, n

(2.164)

¯ S n Þj a:s: Therefore, jEujSn ½εn ðu, S n Þ  εn ðu, → 0, which implies Eq. 2.152. Thus, the Bayesian MMSE error estimator is strongly consistent. For k . 1, note ¯ S n Þj ≤ 1. Hence, that jεn ðu, S n Þ  εn ðu, ¯ S n Þjk ≤ jεn ðu, S n Þ  εn ðu, ¯ S n Þj, 0 ≤ jεn ðu, S n Þ  εn ðu,

(2.165)

and after taking the expectation, ¯ S n Þjk  ≤ EujS ½jεn ðu, S n Þ  εn ðu, ¯ S n Þj: 0 ≤ EujS n ½jεn ðu, S n Þ  εn ðu, n

(2.166)

¯ S n Þj → 0 almost surely, Eq. 2.158 must Since EujS n ½jεn ðu, S n Þ  εn ðu, hold. ▪ ∗ ∗ Condition (1) is assured if p ðcÞ is weak consistent, as is the case with beta priors and random sampling. Condition (2) is the usual assumption of independence used in the Bayesian framework. Condition (3) is a property of the sampling methodology. Essentially, we require that points from each class should be observed infinitely often (almost surely), unless one class actually has zero probability. Condition (4) will be proven under fairly broad conditions in the following theorem, which proves that the Bayesian MMSE error estimator is strongly consistent as long as the true error functions εyn ðuy , S n Þ form equicontinuous sets for fixed samples and the posterior of uy is weak∗ consistent at u¯ y . In the next section, we prove that this property holds for all classification rules in the discrete model and all linear classification rules in the general covariance Gaussian model. Theorem 2.14 (Dalton and Dougherty, 2012c). Let u¯ ∈ U be an unknown true parameter and let F ðS ` Þ ¼ ff ð⋅, S n Þg`n¼1 be a uniformly bounded collection of measurable functions associated with the sample S ` , where f ð⋅, S n Þ : U → ½0, `Þ, f ð⋅, S n Þ ≤ MðS ` Þ for each n ¼ 1, 2, : : : , and MðS ` Þ is a positive finite constant. If F ðS ` Þ is equicontinuous at u¯ (almost surely with

82

Chapter 2

respect to the sampling distribution) and the posterior of u is weak∗ consistent at ¯ then u, ¯ S n Þj → 0Þ ¼ 1. PrS ` ju¯ ðEujS n ½jf ðu, S n Þ  f ðu,

(2.167)

Proof. Let d U be the metric associated with U. For fixed S ` and e . 0, if equicontinuity holds for F ðS ` Þ, then there exists d . 0 such that ¯ S n Þj , e for all f ∈ F ðS Þ whenever d U ðu, uÞ ¯ , d. Hence, jf ðu, S n Þ  f ðu, `

¯ S n Þj EujSn ½j f ðu, S n Þ  f ðu, ¯ S n ÞjI  ¼ EujS n ½jf ðu, S n Þ  f ðu, ¯ d U ðu,uÞ,d ¯ S n ÞjI  þ EujSn ½j f ðu, S n Þ  f ðu, ¯ d U ðu,uÞ≥d ≤ EujS n ½eId U ðu,uÞ,d  þ EujSn ½2MðS ` ÞId U ðu,uÞ≥d  ¯ ¯

(2.168)

 þ 2MðS ` ÞEujS n ½Id U ðu,uÞ≥d  ¼ eEujS n ½Id U ðu,uÞ,d ¯ ¯ ¯ , dÞ þ 2MðS Þ PrujS ðd U ðu, uÞ ¯ ≥ dÞ: ¼ e PrujS n ðd U ðu, uÞ ` n ¯ Eq. 2.154 holds, and From the weak∗ consistency of the posterior of u at u, ¯ S n Þj lim sup EujS n ½j f ðu, S n Þ  f ðu, n→`

¯ , dÞ ≤ e lim sup PrujSn ðd U ðu, uÞ n→`

¯ ≥ dÞ þ 2MðS ` Þlim sup PrujS n ðd U ðu, uÞ

(2.169)

n→`

a:s:

¼ e ⋅ 1 þ 2MðS ` Þ ⋅ 0 ¼ e:

Finally, since this is (almost surely) true for all e . 0, ¯ S n Þj → 0, EujS n ½jf ðu, S n Þ  f ðu, a:s:

which completes the proof.

(2.170) ▪

2.7.3 Discrete and Gaussian models Equicontinuity essentially guarantees that the true errors for designed classifiers are somewhat “robust” near the true parameter. Loosely speaking, with equicontinuity there exists (almost surely) a neighborhood U of the true parameter such that at any parameter in U, all errors, under any classifier produced by the given rule and truncated samples drawn from the fixed infinite sample, are as close as desired to the true error. This property is a sufficient condition for consistency and is not a stringent requirement. Indeed, the following two theorems prove that it holds for both the discrete and

Optimal Bayesian Error Estimation

83

Gaussian models herein. Combining these results with Theorems 2.13 and 2.14, the Bayesian MMSE error estimator is strongly consistent for both the discrete model with any classification rule and the general covariance Gaussian model with any linear classification rule. Theorem 2.15 (Dalton and Dougherty, 2012c). In the discrete Bayesian model with any classification rule, F y ðS ` Þ ¼ fεyn ð⋅ , S n Þg`n¼1 is equicontinuous at every u¯ y ∈ Uy for y ∈ f0, 1g. Proof. This is a slightly stronger proof than required in Theorem 2.14, since equicontinuity is always true for any sample, not only almost surely. Also, we may use any classification rule; any sequence of classifiers may be applied across each value of n. In a b-bin model, suppose that the sequence of classifiers cn : f1, 2, : : : , bg → f0, 1g is obtained from a given sample. The error of classifier cn contributed by class 0 at parameter u0 ¼ ½ p1 , p2 , : : : pb  ∈ U0 is ε0n ðu0 , S n Þ ¼

b X

pi Icn ðiÞ¼1 :

(2.171)

i¼1

For any fixed sample S ` , fixed true parameter u¯ 0 ¼ ½ p¯ 1 , p¯ 2 , : : : , p¯ b , and any u0 ¼ ½ p1 , p2 , : : : , pb ,  X   b 0 0 ¯  jεn ðu0 , S n Þ  εn ðu0 , S n Þj ¼  ðpi  p¯ i ÞIcn ðiÞ¼1  i¼1



b X

jpi  p¯ i j

(2.172)

i¼1

¼ jju0  u¯ 0 jj1 : Since u¯ 0 is arbitrary, F 0 ðS ` Þ is equicontinuous. A similar argument shows P that F 1 ðS ` Þ ¼ f bi¼1 qi Icn ðiÞ¼0 g`n¼1 is equicontinuous, which completes the proof. ▪ Theorem 2.16 (Dalton and Dougherty, 2012c). In a general covariance Gaussian Bayesian model with D features and any linear classification rule, F y ðS ` Þ ¼ fεyn ð⋅, S n Þg`n¼1 is equicontinuous at every u¯ y ∈ Uy for y ∈ f0, 1g. Proof. Given S ` , suppose that we obtain a sequence of linear classifiers cn : RD → f0, 1g with discriminant functions gn ðxÞ ¼ aTn x þ bn defined by vectors an and constants bn . If an ¼ 0D for some n, then the classifier and

84

Chapter 2

classifier errors are constant (with respect to the feature space). In this case, jεyn ðuy , S n Þ  εyn ðu¯ y , S n Þj ¼ 0 for all uy , u¯ y ∈ Uy , so this classifier does not affect the equicontinuity of F y ðS ` Þ. Hence, without loss of generality, we assume that an ≠ 0D for all n so that the error of classifier cn contributed by class y at parameter uy ¼ ½my , Sy  is given by 1 y ð1Þ g ðm Þ n y εyn ðuy , S n Þ ¼ F@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A: T an Sy an 0

(2.173)

Since scaling gn does not affect the decision of classifier cn and an ≠ 0D , without loss of generality, we also assume that gn is normalized such that maxi jan,i j ¼ 1 for all n, where an,i is the ith element of an . Treating both classes at the same time, it is enough to show that ¯ ∈ RD , and faTn San g`n¼1 is equicontinfgn ðmÞg`n¼1 is equicontinuous at every m ¯ (considering one fixed S ¯ at a time, uous at every symmetric positive definite S T ¯ by positive definiteness, an San . 0). For any fixed but arbitrary ¯ ¼ ½m¯ 1 , m¯ 2 , : : : , m¯ D T and any m ¼ ½m1 , m2 , : : : , mD T , m   X  D   ¯ ¼ jgn ðmÞ  gn ðmÞj an,i ðmi  m¯ i Þ  i¼1  ≤ maxjan,i j i

D X

jmi  m¯ i j

(2.174)

i¼1

¯ 1: ¼ jjm  mjj ¯ we denote s¯ This proves that fgn ðmÞg`n¼1 is equicontinuous. For any fixed S, ij as its ith row, jth column element and use similar notation for an arbitrary matrix S. Then ¯ j ¼ jaT ðS  SÞa ¯ j jaTn San  aTn Sa n n n   X  D X D   ¼ an,i an,j ðsij  s¯ ij Þ  i¼1 j¼1  ≤ maxjan,i j2 i

D X D X

(2.175)

js ij  s¯ ij j

i¼1 j¼1

¯ : ¼ jjS  Sjj 1 Hence, faTn San g`n¼1 is equicontinuous.



Optimal Bayesian Error Estimation

85

2.8 Calibration When an analytical representation of the Bayesian MMSE error estimator is not available, it may be approximated using Monte Carlo methods—for instance, in the case of Gaussian models with nonlinear classification (Dalton and Dougherty, 2011a); however, approximating a Bayesian MMSE error estimator is much more computationally intensive than classical counting methods, such as cross-validation, and may be infeasible. Thus, we consider optimally calibrating arbitrary error estimators within a Bayesian framework (Dalton and Dougherty, 2012a). Assuming a fixed sample size, fixed classification and error estimation schemes, and a prior distribution over an uncertainty class, a calibration function mapping error estimates (from the specified error estimation rule) to their calibrated values is computed, and this calibration function is used as a lookup table to calibrate the final error estimates. A salient property of a calibrated error estimator is that it is unbiased relative to the true error. 2.8.1 MMSE calibration function An optimal calibration function is associated with four assumptions: a fixed sample size n, a Bayesian model with a proper prior pðuÞ ¼ pðc, u0 , u1 Þ, a fixed classification rule (including possibly a feature selection scheme), and a fixed uncalibrated error estimator εˆ • . Given these assumptions, the optimal MMSE calibration function is the expected true error conditioned on the observed error estimate, Z 1 E½εn jˆε•  ¼ εn f ðεn jˆε• Þdεn 0 R1 (2.176) ˆ • Þdεn 0 εn f ðεn , ε , ¼ f ðˆε• Þ where f ðεn , εˆ • Þ is the unconditional joint density between the true and estimated errors, and f ðˆε• Þ is the unconditional marginal density of the estimated error. Viewed as a function of εˆ • , this expectation is called the MMSE calibration function. It may be used to calibrate any error estimator to have optimal MSE performance for the assumed model. Evaluated at a particular value of εˆ • , it is called the MMSE calibrated error estimate and will be denoted by εˆ ∗ . Owing to the basic conditional-expectation property E½E½X jY  ¼ E½X , E½ˆε∗  ¼ E½E½εn jˆε•  ¼ E½εn :

(2.177)

In the first and last terms, both expectations are over u and S n . In the middle term, the inner expectation is over u and S n conditioned on εˆ • , and the outer expectation is over εˆ • . Therefore, the calibrated estimate is unbiased.

86

Chapter 2

We say that an arbitrary estimator εˆ o (be it calibrated or not) has ideal regression if E½εn jˆεo  ¼ εˆ o almost surely. This means that, given the estimate, the mean of the true error equals the estimate. Hence, the regression of the true error on the estimate is the 45° line through the origin. The next theorem is a special case of the tower property of conditional expectation and guarantees that both Bayesian MMSE and calibrated error estimators possess ideal regression. We first prove a lemma using a measure-theoretic definition of conditional expectation based on the Radon–Nikodym theorem (Loève, 1978). The measure theoretic definition conditions on an entire sub-s-algebra so that the conditional expectation is viewed as a function or a random variable itself. This is one of those instances where it is easier to prove a general measure-theoretic result than to work with densities or probability distribution functions. Lemma 2.5 (Dalton and Dougherty, 2012a). Consider a probability space ðV, A, PÞ. Let X be any A-measurable function whose integral exists, and B be a s-algebra contained in A. Then E½X jE½X jB ¼ E½X jB

(2.178)

almost surely. Proof. Let PB be the restriction of P to B. By definition, E½X jB is the conditional expectation of X given B, which is a B-measurable function and is defined up to PB measure zero by Z Z E½X jBdPB ¼ X dP (2.179) B

B

for any B ∈ B. The existence of E½X jB is guaranteed by the Radon–Nikodym theorem because PB is absolutely continuous with respect to P. Since E½X jB is B-measurable, the s-algebra C generated by E½X jB is a sub-algebra of B and therefore a sub-algebra of A. By definition, E½X jC is the conditional expectation of X given C, which is a C-measurable function defined up to PC measure zero by Z Z E½X jCdPC ¼ X dP (2.180) C

C

for any C ∈ C. Since C ⊆ B, Eqs. 2.179 and 2.180 imply that Z Z E½X jBdPC ¼ E½X jCdPC C

C

(2.181)

Optimal Bayesian Error Estimation

87

for any C ∈ C. Hence, E½X jB ¼ E½X jC almost surely relative to PC , which yields Eq. 2.178 because E½X jE½X jB ¼ E½X jC by definition, C being the s-algebra generated by E½X jB. ▪ Theorem 2.17 (Dalton and Dougherty, 2012a). Bayesian MMSE and MMSE calibrated error estimators possess ideal regression. Proof. Consider a probability space ðV, A, PÞ and let X be an integrable random variable and Y be a random vector. Then from the lemma we deduce that E½X jE½X jY  ¼ E½X jY 

(2.182)

almost surely. To see this, let B be the s-algebra generated by Y . Then in the lemma, E½X jB becomes E½X jY , C becomes the s-algebra generated by E½X jY , and E½X jC becomes E½X jE½X jY . Now, εn is an integrable random variable because the true error is bounded. In Eq. 2.182, letting X ¼ εn and Y ¼ εˆ • yields E½εn jˆε∗  ¼ E½εn jE½εn jˆε•  ¼ E½εn jˆε•  ¼ εˆ ∗ :

(2.183)

Therefore, calibrated error estimators are ideal. If we now let Y ¼ S n be the observed sample, then E½εn jˆεn  ¼ E½εn jE½εn jS n  ¼ E½εn jS n  ¼ εˆ n :

(2.184)

Therefore, Bayesian MMSE error estimators are ideal. ▪ If an analytical representation for the joint density between true and estimated errors for fixed distributions f ðεn , εˆ • ju Þ is available, then Z f ðεn , εˆ • juÞpðuÞdu: (2.185) f ðεn , εˆ • Þ ¼ U

f ðˆε• Þ may be found either directly from f ðεn , εˆ • Þ or from analytical representations of f ðˆε• juÞ via Z f ðˆε• Þ ¼ f ðˆε• juÞpðuÞdu: (2.186) U

From Eq. 2.185, it is clear that f ðεn , εˆ • Þ utilizes all of our modeling assumptions, including the classification rule (because different classifiers will have different true errors), the error estimation rule, and the Bayesian prior. If analytical results for f ðεn , εˆ • juÞ and f ðˆε• juÞ are not available, then E½εn jˆε•  may be found via Monte Carlo approximation by simulating the model and classification procedure to generate a large collection of true and estimated error pairs. The MMSE calibration function may then be

88

Chapter 2

approximated by estimating the joint density f ðεn , εˆ • Þ, or by partitioning error estimates into bins and finding the corresponding average true error for estimated errors falling in each bin. Even though calibrated error estimation is suboptimal compared to Bayesian MMSE error estimation, it has several practical advantages: 1. Given a sampling strategy and sample size, a prior on u, a classification rule, and an error estimate εˆ • , a calibration function may be found offline with straightforward Monte Carlo approximation. 2. Analytical solutions may be derived using independent theoretical work providing representations for f ðεn , εˆ • juÞ and f ðˆε• juÞ. 3. Once a calibration function has been established, it may be applied by post-processing a final error estimate with a simple lookup table. 2.8.2 Performance with LDA We again use the simulation methodology shown in Fig. 2.5 to evaluate the performance of calibrated error estimators under a given prior model and stratified samples of fixed size. Consider the independent general covariance Gaussian model with known class-0 probability c ¼ 0.5 and normal-inverseWishart priors on the mean and covariance pairs for each class. We use hyperparameters ky ¼ 3D, n0 ¼ 6D, n1 ¼ D, Sy ¼ 0.03ðky  D  1ÞID , m0 ¼ 0D , and m1 ¼ 0.1210 ⋅ 1D . In general, the amount of information in each prior is reflected in the values of ky and ny , which increase as the amount of information in the prior increases. For class 1, m1 has been adjusted to give an expected true error of about 0.25. In step 1 of Fig. 2.5, we generate, independently for each class, my and Sy from the normal-inverse-Wishart prior for class y. Step 2A generates a stratified training sample of size n ¼ 30 from the realized class-conditional distributions N ðmy , Sy Þ, with the sample sizes of both classes being fixed at n0 ¼ n1 ¼ n∕2. These labeled sample points are used to find the posterior and to train an LDA classifier with no feature selection in steps 2B and 2C. In step 2D, we compute the true error εn , the Bayesian MMSE error estimator εˆ n , and the MSE of εˆ n conditioned on the sample, MSEðˆεn jS n Þ, which is discussed in detail in Chapter 3. Since the classifier is linear, εn , εˆ n , and MSEðˆεn jS n Þ may be computed exactly using closed-form expressions. Several classical trainingdata error estimators are also computed: 5-fold cross-validation, 0.632 bootstrap, and bolstered resubstitution. The sample-conditioned MSEs of these error estimators are also evaluated in each iteration using Theorem 3.2, which will be discussed in Chapter 3. For each fixed feature-label distribution, steps 2A through 2D (collectively, step 2) are repeated to obtain t ¼ 1000 samples and sets of output. Steps 1 and 2 are repeated for T ¼ 10,000 different feature-label distributions (corresponding to the randomly selected

Optimal Bayesian Error Estimation

89

parameters). In total, each simulation produces tT ¼ 10,000,000 samples and sets of output results. After the simulation is complete, the synthetically generated true and estimated error pairs are used to find the expected true error E½εn jˆε•  conditioned on each non-Bayesian error estimate, where εˆ • can be crossvalidation, bootstrap, or bolstering. The conditional expectation is approximated by uniformly partitioning the interval ½0, 1 into 500 bins and averaging the true errors corresponding to error estimates that fall in each bin. Moreover, the average true error is only found for bins with at least 100 points; otherwise, the bin is considered “rare” and the lookup table simply leaves the error estimate unchanged (an identity mapping). The result is a calibration function (a lookup table) mapping each of the 500 error estimate bins to a corresponding expected true error. Once a lookup table has been generated for each error estimator, the entire experiment is repeated again using the same prior model, classification rule, and classical training-data error estimators; however, at the end of each iteration in step 2D, this time we apply the corresponding MMSE calibration lookup table to each non-Bayesian error estimator to obtain εˆ ∗ . We also report the exact true error and Bayesian sample-conditioned MSEs again, but the Bayesian MMSE error estimator is not needed since it is not changed by calibration, and performance would be theoretically identical to the original experiment. As before, the procedure is iterated t ¼ 1000 times for each fixed feature-label distribution for T ¼ 10,000 sets of feature-label distribution parameters. Figure 2.17 illustrates the performance of calibrated error estimators for D ¼ 2. Figure 2.17(a) shows the expected true error conditioned on the error estimate across iterations. The thin gray diagonal line represents an ideal error estimator equal to the true error. Part (b) of the same figure shows the RMS for each error estimator conditioned on the error estimate itself, which, by definition, is given by qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi RMSðˆεo jˆεo Þ ¼ E½ðεn  εˆ o Þ2 jˆεo , (2.187) and part (c) shows RMS conditioned on the true error, qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi RMSðˆεo jεn Þ ¼ E½ðεn  εˆ o Þ2 jεn ,

(2.188)

where εˆ o ∈ fˆε• , εˆ ∗ , εˆ n g. These graphs indicate error estimation accuracy for fixed error estimates and fixed true errors. Part (d) has probability densities for the theoretically computed RMS conditioned on the sample for each error estimator. Figure 2.17(a) demonstrates that calibrated error estimators and Bayesian MMSE error estimation have ideal regression with the true error, as they must

90

Chapter 2

boot

bol

cal cv

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0.1

0.2 0.3 estimated error

0.4

0.5

RMS( estimated error | estimated error)

E[ true error | estimated error]

cv

cal boot

BEE

0.12

0.1

0.08

0.06

0.04 0

(a)

0.1

0.2 0.3 estimated error

0.4

0.5

(b) 70

0.12

60 0.1

50 PDF

RMS (estimated error | true error)

cal bol

0.08

40 30

0.06

20 10

0.04 0

0.1

0.2 0.3 true error

(c)

0.4

0.5

0 0

0.05 0.1 RMS conditioned on the sample

0.15

(d)

Figure 2.17 Performance of calibrated error estimators (Gaussian model, D ¼ 2, n ¼ 30, LDA): (a) E½εn jˆεn ; (b) RMSðˆεn jˆεn Þ; (c) RMSðˆεn jεn Þ; (d) probability densities of RMSðˆεn jS n Þ. [Reprinted from (Dalton and Dougherty, 2012a).]

according to Theorem 2.17. Furthermore, the RMS conditioned on calibrated error estimators is significantly improved relative to their uncalibrated counterparts, usually tracking just above the Bayesian MMSE error estimator. The results illustrate good performance for calibrated error estimators relative to their uncalibrated classical counterparts. The RMS conditioned on uncalibrated error estimators tends to have a “V” shape, achieving a minimum RMS for a very small window of estimated errors. The RMS conditioned on a low estimated error tends to be high because the error estimator is usually low-biased. Conditioning on a high estimated error tends to result in a high RMS because the error estimator is high-biased. The error estimate where the RMS is minimized approximately corresponds to the point where the expected true error conditioned on the error estimate crosses the ideal dotted line. Note that in a small-sample setting, without modeling assumptions, this window where the estimated error

Optimal Bayesian Error Estimation

91

is most accurate is unknown, in contrast to Bayesian modeling where these graphs demonstrate how to find the optimal window. Furthermore, the errorestimate-conditioned RMS of calibrated error estimators and Bayesian MMSE error estimators tends to monotonically increase; therefore, the accuracy of error estimation is usually higher when the estimated error is low. Figure 2.17(c) is a very typical representative for the behavior of the RMS conditioned on true errors. Uncalibrated error estimators tend to be best for low true errors, which is consistent with many studies on error estimation accuracy (Glick, 1978; Zollanvari et al., 2011, 2012). As we have seen previously, Bayesian MMSE error estimators are usually best for moderate true errors, where small-sample classification is most interesting. This is also true for calibrated error estimators, which have true-error-conditioned RMS plots usually tracking just above the Bayesian MMSE error estimator. Although the unconditional RMS for Bayesian MMSE error estimators is guaranteed to be optimal (within the assumed model), in some cases the conditional RMS of calibrated error estimators can actually outperform that of the Bayesian MMSE error estimator for some small ranges of the true error. Furthermore, although the unconditional RMS for a calibrated error estimator is guaranteed to be lower than its uncalibrated counterpart, uncalibrated error estimators can even outperform Bayesian MMSE error estimators for some values of the true error. Nonetheless, the distribution of the RMS conditioned on the sample for calibrated error estimators tends to have more mass toward lower values of RMS than uncalibrated error estimators, with the Bayesian MMSE error estimator being even more shifted to the left.

2.9 Optimal Bayesian ROC-based Analysis In this section, we apply optimal Bayesian estimation theory to receiver operator characteristic (ROC) curves, which are popular tools to evaluate classifier performance over a range of decision thresholds (Pepe et al., 2004; Spackman, 1989; Fawcett, 2006). By varying the threshold, one can control the trade-off between specificity and sensitivity in a test. The area under the ROC curve (AUC) maps the entire ROC curve into a single number that reflects the overall performance of the classifier over all thresholds. The false positive rate (FPR) and true positive rate (TPR) evaluate performance for a specific threshold. As with classifier error, in practice, the ROC, AUC, FPR, and TPR must be estimated from data. A simulation study based on both synthetic and real data by (Hanczar et al., 2010) concluded that popular resampling estimates of the FPR, TPR, and AUC have considerable RMS deviation from their true values over the sampling distribution for fixed classconditional densities. Here, we define MMSE estimates of the FPR, TPR, and AUC relative to a posterior. The sample-conditioned MSE of any FPR or

92

Chapter 2

TPR estimator, and an optimal Neyman–Pearson classifier based on the theory herein, can be found in (Dalton, 2016). 2.9.1 Bayesian MMSE FPR and TPR estimation The FPR is the probability of erroneously classifying a class-0 point (a “negative” state), and the TPR is the probability of correctly classifying a class-1 point (a “positive” state). Letting εi,y n ðuy , cÞ denote the probability that a class-y point under parameter uy is assigned class i by classifier c given 1,1 a sample of size n, FPRðu0 , cÞ ¼ ε1,0 n ðu0 , cÞ and TPRðu1 , cÞ ¼ εn ðu1 , cÞ. Bayesian MMSE estimates of the FPR and TPR are called the expected FPR (EFPR) and expected TPR (ETPR), respectively. Given a classifier c that does not depend on the parameters, the EFPR and ETPR are d n , cÞ ¼ Ep∗ ½FPRðu0 , cÞ ¼ εˆ n1,0 ðS n , cÞ, FPRðS

(2.189)

d n , cÞ ¼ Ep∗ ½TPRðu1 , cÞ ¼ εˆ n1,1 ðS n , cÞ, TPRðS

(2.190)

respectively, where εˆ ni,y ðS n , cÞ ¼ Ep∗ ½εni,y ðuy , cÞ

(2.191)

is the posterior probability of assigning label i to a class-y point. Assuming that c is independent of u0 and u1 , in terms of EFPR and ETPR estimates, the Bayesian MMSE error estimate takes the form εˆ n ðS n , cÞ ¼ cˆ ðS n Þˆεn1,0 ðS n , cÞ þ ð1  cˆ ðS n ÞÞˆεn0, 1 ðS n , cÞ ¼ cˆ ðS n Þˆεn1,0 ðS n , cÞ þ ð1  cˆ ðS n ÞÞ½1  εˆ n1,1 ðS n , cÞ

(2.192)

d n , cÞ þ ð1  cˆ ðS n ÞÞ½1  TPRðS d n , cÞ, ¼ cˆ ðS n ÞFPRðS where cˆ ðS n Þ ¼ Ep∗ ½c is the Bayesian MMSE estimate of c. Based on the d n , cÞ and TPRðS d n , cÞ can be found from reasoning of Theorem 2.1, FPRðS the effective class-conditional densities via Z εˆ ni,y ðS n ,

cÞ ¼ Ri

f U ðxjyÞdx:

(2.193)

The EFPR can be approximated via Monte Carlo integral approximation by drawing a large synthetic sample from the effective class-0 density f U ðxj0Þ and evaluating the proportion of false positives. The ETPR may similarly be found using f U ðxj1Þ and evaluating the proportion of true positives.

Optimal Bayesian Error Estimation

93

2.9.2 Bayesian MMSE ROC and AUC estimation Consider a classifier c defined via a discriminant g by cðxÞ ¼ 0 if gðxÞ ≤ t and cðxÞ ¼ 1 otherwise. An ROC curve graphs classifier performance over a continuum of thresholds t. The horizontal axis is the FPR, the vertical axis is the TPR, and each point on the curve represents a classifier with some threshold t. As t increases, the point converges to ½0, 0 in the ROC space, representing a classifier that always predicts class 0; as t decreases, the point converges to ½1, 1, representing a classifier that always predicts class 1. The point ½0, 1 corresponds to perfect classification, while the diagonal line from ½0, 0 to ½1, 1 corresponds to a family of classifiers with random blind decisions. Given a discriminant g and N test points fxi gN i¼1 , an empirical ROC curve can be constructed by first sorting all test points in increasing order according to its classification score gðxi Þ. Starting with a threshold below all scores, we obtain an empirical FPR of 1 and a TPR of 1. By increasing the threshold enough to cross over exactly one point, either the FPR will decrease slightly (if the point is in the positive class) or the TPR will decrease slightly (if the point is the negative class). Proceeding in this way, we construct a staircaseshaped empirical ROC curve. When test data are not available, the Bayesian framework naturally facilitates an estimated ROC curve from training data. We define the expected ROC (EROC) curve to be a plot of EFPR and ETPR pairs across all thresholds. Since the EFPR and ETPR can be found as the FPR and TPR under the effective class-conditional densities, the EROC curve is equivalent to the ROC curve under the effective densities. Hence, the EROC may be found by applying the ROC approximation methods described above using a large pool of M synthetic labeled points from the effective class-conditional densities. Since as many points as desired may be drawn from the effective density, the EROC curve can be made as smooth as desired. Although the AUC can be approximated by evaluating an integral, it is also the probability that a randomly chosen sample point from the positive class has a higher classification score than an independent randomly chosen sample point from the negative class (assuming that scores higher than the classifier threshold are assigned to the positive class) (Hand and Till, 2001); that is, AUC ¼ PrðgðX1 Þ . gðX0 Þju0 , u1 Þ,

(2.194)

where the Xy are independent random vectors governed by the class-y densities f uy ðxjyÞ. This interpretation leads to a simple way of approximating the AUC from a number of test points: sort each test point in increasing order according to its classification score, evaluate the sum R1 of ranks of sorted test points in the positive class, and evaluate

94

Chapter 2

AUC 

R1  12 N 1 ðN 1 þ 1Þ , N 0N 1

(2.195)

where N 0 and N 1 are the numbers of test points (not training points) in the negative and positive classes, respectively. In the absence of test data, the Bayesian framework naturally gives rise to an expected AUC (EAUC), defined to be the MMSE estimate Ep∗ ½AUC of the AUC. Lemma 2.6 (Dalton, 2016). If u0 and u1 are independent, then the EAUC is equivalent to the AUC under the effective densities. Proof. Substituting Eq. 2.194 in the definition, d ¼ Ep∗ ½AUC AUC ¼ Ep∗ ½PrðgðX1 Þ . gðX0 Þ ju 0 , u1 Þ Z Z PrðgðX1 Þ . gðX0 Þju0 , u1 Þp∗ ðu0 Þp∗ ðu1 Þdu1 du0 : ¼ U0

(2.196)

U1

Writing the probability as a double integral over the class-conditional densities yields Z Z Z Z d Igðx1 Þ.gðx0 Þ f u0 ðx0 j0Þf u1 ðx1 j1Þdx1 dx0 AUC ¼ U0 U1 X X (2.197) ∗ ∗  p ðu0 Þp ðu1 Þdu1 du0 : Interchanging the order of integration and applying the definition of the effective densities yields Z Z d ¼ Igðx1 Þ.gðx0 Þ f U ðx0 j0Þf U ðx1 j1Þdx1 dx0 AUC (2.198) X X ¼ PrðgðZ1 Þ . gðZ0 ÞÞ, where the Zy are independent random vectors governed by the effective class-y conditional densities in Eq. 2.26. ▪ T From Eq. 2.194, the AUC for a linear binary classifier gðxÞ ¼ a x þ b that predicts class 0 if gðxÞ ≤ 0 and 1 otherwise under N ðm0 , S0 Þ and N ðm1 , S1 Þ distributions is given by AUCðm0 , S0 , m1 , S1 Þ ¼ PrðaT ðX0  X1 Þ , 0jm0 , S0 , m1 , S1 Þ ! aT ðm1  m0 Þ ¼ F pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , aT ðS0 þ S1 Þa

(2.199)

Optimal Bayesian Error Estimation

95

where the last line follows because aT ðX0  X1 Þ  N ðaT ðm0  m1 Þ, aT ðS0 þ S1 ÞaÞ. Consider posteriors for the fixed covariance case, where we can have S0 ≠ S1 or S0 ¼ S1 . By Lemma 2.6, 0 1 aT ðm∗1  m∗0 Þ B C Ep∗ ½AUC ¼ F@rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∗

ffi A: ∗ n þ1 n þ1 aT 0n∗ S0 þ 1n∗ S1 a 0

(2.200)

1

In the independent scaled identity and independent general covariance models (and any other model with independent u0 and u1 ), by Lemma 2.6 the EAUC may be approximated in the same way that the true AUC is approximated, that is, by generating a large number of test points from the effective densities and evaluating Eq. 2.195. The EAUC has also been solved in closed form for homoscedastic covariance models with S0 ¼ S1 , although Lemma 2.6 does not apply in this case. The AUC for fixed parameters is given in Eq. 2.199, which is of the same pffiffiffi form as εy in Eq. 1.32 with S ≡ S0 ¼ S1 in place of Sy , m ≡ ðm1  m0 Þ∕ 2 in place of my , ð1Þy ¼ 1, and b ¼ 0. Thus, we can leverage equations for expectations of the true error to find expectations of the AUC. By pffiffiffi Theorem 2.5, conditioned on S, m is Gaussian with mean ðm∗1  m∗0 Þ∕ 2 and covariance S∕n∗ , where n∗ ¼ 2n∗0 n∗1 ∕ðn∗0 þ n∗1 Þ. Under the homoscedastic scaled identity covariance model, by Theorem 2.11 we have    1 B2 1 ðk∗ þ D þ 1ÞD ∗ Ep ½AUC ¼ 1 þ sgnðBÞI ; , 1 , 2 2 B2 þ jjajj2 trðS∗ Þ 2 (2.201) where sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n∗0 n∗1 B ¼ aT ðm∗1  m∗0 Þ : n∗0 þ n∗1 þ 2n∗0 n∗1

(2.202)

Under the homoscedastic general covariance model, by Theorem 2.12,    1 B2 1 k∗  D þ 1 Ep∗ ½AUC ¼ 1 þ sgnðBÞI ; , , (2.203) 2 2 B2 þ aT S∗ a 2 where B is again given by Eq. 2.202 (Hassan et al., 2019). 2.9.3 Performance study To illustrate the ROC theory, we assume Gaussian class-conditional densities of dimension D ¼ 2 with parameterized mean–covariance pairs

96

Chapter 2

uy ¼ ðmy , Sy Þ, where Sy is inverse-Wishart with ky degrees of freedom and scale matrix Sy , and my given Sy is Gaussian with mean my and covariance Sy ∕ny . We use the following hyperparameters: n0 ¼ n1 ¼ 1, m0 ¼ m1 ¼ 0D , k0 ¼ k1 ¼ D þ 2, and S0 ¼ S1 ¼ 0.3ID . These hyperparameters represent a low-information prior, where ny and ky have the minimal integer values necessary for the priors to be proper densities and the expected means and covariances to exist (these are 0D and 0.3ID for both classes, respectively). From each mean and covariance pair we draw a stratified sample of size n with n∕2 points per class for classifier design and performance evaluation. We assume that c ¼ 1∕2 is known. We classify using a support vector machine (SVM) with radial basis function kernel (RBF-SVM). The SVM classifiers are trained using the LIBSVM software package in MATLAB® (Chang and Lin, 2011). The “default” threshold for all of these classifiers is t ¼ 0. To evaluate the performance of FPR and TPR estimators, we generate 10,000 random mean and covariance pairs from the prior, and from each pair we draw a training sample of size n (varying from 20 to 100 points) with n∕2 points independently drawn from each class to train the classifiers using the default threshold ðt ¼ 0Þ. The true FPR and TPR are found exactly for LDA. For nonlinear classifiers they are approximated by drawing a synthetic testing sample of size N ¼ 100,000 with N∕2 points in each class from the true density parameterized by my and Sy , and evaluating the proportion of false and true positives. Similarly, the EFPR and ETPR are found exactly for LDA and are approximated for nonlinear classifiers by drawing a synthetic sample of size M ¼ 100,000 with M∕2 points per class from the effective densities, and evaluating the proportion of false and true positives. We also evaluate four standard FPR and TPR estimators: resubstitution (resub), leave-one-out (loo), 10-fold cross-validation with 10 repetitions (cv), and 0.632 bootstrap with 250 repetitions (boot). The resub FPR estimator d resub is the proportion of false positives among class-0 training points, and FPR d resub is the proportion of true positives among the resub TPR estimator TPR class-1 training points. There are multiple cross-validation- and bootstrapbased methods for FPR, TPR, AUC, and ROC estimation. In our implementation of cv for FPR and TPR estimation, in each repetition we randomly partition the training set into stratified folds, and for each fold we train a surrogate classifier using all points not in the fold and apply this classifier to each point in the fold. The final cv FPR and TPR estimates are the proportions of false positives and true positives, respectively, among heldout points from all folds and repetitions. loo is a special case of cv, where each fold contains precisely one point, and thus only one partition is possible and repetition is not necessary. In each repetition of boot, we train a surrogate classifier on a bootstrap sample and apply this classifier to each point not in d boot and TPR d boot are the bootstrap sample. The bootstrap-zero estimates FPR

Optimal Bayesian Error Estimation

97

the proportion of false positives and true positives among held-out points from all repetitions, and the 0.632 bootstrap estimates are given by d 0.632 boot ¼ 0.368FPR d resub þ 0.632FPR d boot , FPR

(2.204)

d 0.632 boot ¼ 0.368TPR d resub þ 0.632TPR d boot : TPR

(2.205)

d  FPR, or the mean bias of Figure 2.18(a) provides the mean of FPR FPR estimators over random distributions and samples with respect to sample size for RBF-SVM and each of the five FPR estimation methods. Figure 2.18(b) analogously shows the mean bias of TPR estimators. The EFPR and ETPR not only appear unbiased, but they are theoretically d  FPR ¼ 0 and unbiased when conditioned on the sample; that is, Ep∗ ½FPR d  TPR ¼ 0. Ep∗ ½TPR In Fig. 2.18(c), we plot the square root of the mean, over random d  FPRÞ2 þ ðTPR d  TPRÞ2 for RBFdistributions and samples, of ðFPR SVM and each estimator. This quantifies the accuracy of each FPR and TPR d TPRÞ. d estimator pair, and we denote it by RMSðFPR, Even with lowinformation priors, observe that EFPR and ETPR have significantly superior performance relative to other FPR and TPR estimators. Indeed, they are theoretically optimal in these graphs. To see why, recall that EFPR and resub

loo

cv

boot

0.1

EFPR

mean of TPR

mean of FPR

loo

cv

boot

60 sample size

80

ETPR

0.05

0

−0.05

−0.1 20

resub

TPR

FPR

0.05

40

60 sample size

80

0

−0.05 20

100

40

(a)

(b)

0.3

RMS( FPR TPR)

100

resub

loo

cv

boot

Bayesian

0.2

0.1

0 20

40

60

sample size

80

100

(c) Figure 2.18 Performance of FPR and TPR estimators for RBF-SVM with default threshold: d TPRÞ. d [Reprinted from (Dalton, 2016).] (a) FPR bias; (b) TPR bias; (c) RMSðFPR,

98

Chapter 2

ETPR minimize their respective sample-conditioned MSE and thus also minimize the quantity h i d n Þ þ MSEðTPRjS d n Þ ¼ Ep∗ ðFPR d  TPRÞ2 , d  FPRÞ2 þ ðTPR MSEðFPRjS (2.206) d TPR d from which is the mean-square Euclidean distance of the point ½FPR, the point ½FPR, TPR in the ROC space, with respect to the posterior. d TPRÞ d approximates the square root of the expectation of RMSðFPR, Eq. 2.206 with respect to the sampling distribution. All of this holds for the default threshold and any other thresholding method that does not depend on knowing the parameters. To examine ROC and AUC estimation methods, we (1) generate 10,000 random mean and covariance pairs from the prior, (2) for each pair draw a training sample of size n with n∕2 independent points from each class, and (3) from each sample find the discriminant function for each classifier. We then use methods outlined previously to approximate the true ROC and AUC using N ¼ 100,000 test points with N∕2 points from each class. We examine five ROC and AUC estimation methods, corresponding to the FPR and TPR estimation methods as discussed above. For resubstitution, an empirical ROC curve and estimated AUC may be found by applying the previous methods with training data rather than independent test data. For leave-one-out, for each training point x, we design a surrogate classifier with training data excluding x, and apply the surrogate classifier to the point x to produce a classification score. The resulting score and label corresponding to each left-out point are pooled into a single dataset, which is again used in place of test data to find ROC curves and AUC estimates. Cross-validation and bootstrap are performed similarly by generating random re-sampled training sets in the usual manner (via folds in the case of cross-validation and a bootstrap sample in the case of bootstrap estimation), training a surrogate classifier for each re-sampled training set, and pooling the scores and labels from holdout points in all folds and repetitions. This aggregated score-label dataset is used to estimate the ROC and AUC. The bootstrap ROC and AUC estimates are not averaged with resubstitution. Finally, we find the EROC and EAUC using M ¼ 100,000 points drawn from the effective densities with M∕2 points from each class. Consider the true and estimated ROC curves for a single sample of size n ¼ 60 and RBF-SVM in Fig. 2.19. The true ROC curve for the Bayes classifier represents a bound in achievable performance, while the true ROC curve for the RBF-SVM classifier is the curve we wish to estimate and must always be below the ROC curve for the Bayes classifier. Estimated ROC curves from each of the five estimation methods discussed are also shown.

Optimal Bayesian Error Estimation

99

1

0.8

TPR

0.6

0.4 Bayes true RBF-SVM true RBF-SVM resub RBF-SVM loo RBF-SVM cv RBF-SVM boot RBF-SVM EROC

0.2

0 0

0.2

0.4

FPR

0.6

0.8

1

Figure 2.19 Example true and estimated ROC curves for RBF-SVM with a training set of size n ¼ 60. Circles (from left to right: loo, boot, cv, resub, EROC) indicate the true FPR and TPR for thresholds corresponding to the estimated FPR and TPR closest to the desired FPR. [Reprinted from (Dalton, 2016).]

0.1

resub

loo

cv

boot

EAUC

0.05 0 −0.05 −0.1 20

40

60

80

100

uncond. RMS of AUC

mean of AUC

AUC

In this example, the EROC curve is smooth and quite accurate. Indeed, the d TPRÞ d EROC curve is optimal in the sense that every point on the curve ðFPR, minimizes Eq. 2.206 for the corresponding classifier and threshold. Suppose that we desire an FPR of 0.1. This is illustrated in Fig. 2.19 as a thin vertical gray line. Each ROC estimator crosses this line at coordinates d TPRÞ, d where FPR d is as close as possible to 0.1. For each ROC ðFPR, 0.2

resub

loo

cv

boot

EAUC

0.15 0.1 0.05 0 20

40

60

sample size

sample size

(a)

(b)

80

100

Figure 2.20 Performance of AUC estimators for RBF-SVM: (a) bias; (b) unconditional RMS. [Reprinted from (Dalton, 2016).]

100

Chapter 2

estimator, this point corresponds to a specific threshold of the classifier, which also corresponds to a true FPR and TPR, marked as circles at the coordinates ðFPR, TPRÞ on the true ROC curve. Using the EFPR to select a threshold results in a true FPR that is closer to the desired FPR than obtained by any other FPR estimator in this example. Bias and unconditional RMS graphs for AUC estimation are shown in Fig. 2.20 for RBF-SVM. As expected, the EAUC appears unbiased with superior RMS performance, while all other estimators are biased.

Chapter 3

Sample-Conditioned MSE of Error Estimation There are two sources of randomness in the Bayesian model. The first is the sample, which randomizes the designed classifier and its true error. Most results on error estimator performance are averaged over random samples, which demonstrates performance relative to a fixed feature-label distribution. The second source of randomness is uncertainty in the underlying featurelabel distribution. The Bayesian MMSE error estimator addresses the second source of randomness. This gives rise to a practical expected measure of performance given a fixed sample and classifier. Up until this point, unless otherwise stated, we have assumed that c and ðu0 , u1 Þ are independent to simplify the Bayesian MMSE error estimator. However, our interest is now in evaluating the MSE of an error estimator itself, which generally depends on the variances and correlations between the errors contributed by both classes. For the sake of simplicity, throughout this chapter we avoid the need to evaluate correlations by making the stronger assumption that c, u0, and u1 are all mutually independent. See Chapter 5 and (Dalton and Yousefi, 2015) for derivations in the general case.

3.1 Conditional MSE of Error Estimators For a fixed sample S n , the sample-conditioned MSE of an arbitrary error estimator εˆ • is defined to be MSEðˆε• ðS n , cÞjS n Þ ¼ Ep ½ðεn ðu, cÞ  εˆ • ðS n , cÞÞ2 :

(3.1)

This is precisely the objective function optimized by the Bayesian MMSE error estimator. Also define the conditional MSE for the Bayesian MMSE error estimate for each class: MSEðˆεny ðS n , cÞjS n Þ ¼ Ep ½ðεny ðuy , cÞ  εˆ ny ðS n , cÞÞ2 : 101

(3.2)

102

Chapter 3

The next theorem provides an expression for MSEðˆεn jS n Þ, the sampleconditioned MSE of the Bayesian MMSE error estimator, in terms of the MSE of the εˆ ny , which may be decomposed into the first and second posterior moments of εny . Theorem 3.1 (Dalton and Dougherty, 2012b). Let c, u0, and u1 be mutually independent given the sample. Then the conditional MSE of the Bayesian MMSE error estimator is given by MSEðˆεn jS n Þ ¼ varp ðcÞðˆεn0  εˆ n1 Þ2 þ Ep ½c2  MSEðˆεn0 jS n Þ þ Ep ½ð1  cÞ2  MSEðˆεn1 jS n Þ,

(3.3)

where MSEðˆεny jS n Þ ¼ Ep ½ðεny ðuy ÞÞ2   ðˆεny Þ2 :

(3.4)

Proof. According to MMSE estimation theory and applying the definition of the Bayesian MMSE error estimator, by the orthogonality principle, MSEðˆεn jS n Þ ¼ Eu ½ðεn ðuÞ  εˆ n Þ2 jS n  ¼ Eu ½ðεn ðuÞ  εˆ n Þεn ðuÞjS n  þ Eu ½ðεn ðuÞ  εˆ n Þˆεn jS n  ¼ Eu ½ðεn ðuÞ  εˆ n Þεn ðuÞjS n  ¼ Eu ½ðεn ðuÞÞ jS n   ðˆεn Þ 2

(3.5)

2

¼ varu ðεn ðuÞjS n Þ; that is, the conditional MSE of the Bayesian MMSE error estimator is equivalent to the posterior variance of the true error. Similarly, MSEðˆεny jS n Þ ¼ varuy ðεny ðuy ÞjS n Þ. By the law of total variance, MSEðˆεn jS n Þ ¼ varc,u0 ,u1 ðcε0n ðu0 Þ þ ð1  cÞε1n ðu1 ÞjS n Þ ¼ varc ðEu0 ,u1 ½cε0n ðu0 Þ þ ð1  cÞε1n ðu1 Þjc, S n jS n Þ þ

Ec ½varu0 ,u1 ðcε0n ðu0 Þ

þ ð1 

cÞε1n ðu1 Þjc,

(3.6)

S n ÞjS n :

Further decomposing the inner expectation and variance, while recalling that εˆ ny ¼ Ep ½εny ðuy Þ, and c, u0, and u1 are mutually independent, yields MSEðˆεn jS n Þ ¼ varc ðcˆεn0 þ ð1  cÞˆεn1 jS n Þ þ Ec ½c2 jS n varu0 ðε0n ðu0 ÞjS n Þ þ Ec ½ð1  cÞ2 jS n varu1 ðε1n ðu1 ÞjS n Þ ¼ varp ðcÞðˆεn0  εˆ n1 Þ2 þ Ep ½c2 varp ðε0n ðu0 ÞÞ þ Ep ½ð1  cÞ2 varp ðε1n ðu1 ÞÞ,

(3.7)

Sample-Conditioned MSE of Error Estimation

103

which is equivalent to Eq. 3.3. Furthermore, MSEðˆεny jS n Þ ¼ varp ðεny ðuy ÞÞ ¼ Ep ½ðεny ðuy ÞÞ2   ðˆεny Þ2 ,

(3.8)

which completes the proof. ▪ The variance and expectations related to the variable c depend on the prior for c. For example, if p ðcÞ is beta with hyperparameters ay , then Ep ½c ¼

a0

a0 þ a1

(3.9)

and Ep∗ ½c2  ¼

a∗0 ða∗0 þ 1Þ . ða∗0 þ a∗1 Þða∗0 þ a∗1 þ 1Þ

(3.10)

Hence, MSEðˆεn jS n Þ ¼

a0 a1 ðˆεn0  εˆ n1 Þ2 ða0 þ a1 Þ2 ða0 þ a1 þ 1Þ þ þ

ða0

a0 ða0 þ 1Þ MSEðˆεn0 jS n Þ þ a1 Þða0 þ a1 þ 1Þ

ða0

a1 ða1 þ 1Þ MSEðˆεn1 jS n Þ þ a1 Þða0 þ a1 þ 1Þ

ða0 Þ2 ða1 Þ2 0 2 ðˆ ε Þ  ðˆεn1 Þ2 n ða0 þ a1 Þ2 ða0 þ a1 Þ2 2a0 a1   εˆ n0 εˆ n1 ða0 þ a1 Þ2 ða0 þ a1 þ 1Þ

¼

þ

a0 ða0 þ 1Þ E  ½ðε0 ðu ÞÞ2  ða0 þ a1 Þða0 þ a1 þ 1Þ p n 0

þ

a1 ða1 þ 1Þ E  ½ðε1 ðu ÞÞ2 : ða0 þ a1 Þða0 þ a1 þ 1Þ p n 1

(3.11)

Therefore, the conditional MSE for fixed samples is solved if we can find the first moment Ep ½εny ðuy Þ and second moment Ep ½ðεny ðuy ÞÞ2  of the true error for y ∈ {0, 1}. Having characterized the conditional MSE of the Bayesian MMSE error estimator, it is easy to find analogous results for an arbitrary error estimate.

104

Chapter 3

Theorem 3.2 (Dalton and Dougherty, 2012b). Let εˆ • be an error estimate evaluated from a given sample. Then MSEðˆε• jS n Þ ¼ MSEðˆεn jS n Þ þ ðˆεn  εˆ • Þ2 :

(3.12)

Proof. Since Eu ½εn ðuÞjS n  ¼ εˆ n , MSEðˆε• jS n Þ ¼ Eu ½ðεn ðuÞ  εˆ • Þ2 jS n  ¼ Eu ½ðεn ðuÞ  εˆ n þ εˆ n  εˆ • Þ2 jS n  ¼ Eu ½ðεn ðuÞ  εˆ n Þ2 jS n 

(3.13)

þ 2ðˆεn  εˆ • ÞEu ½εn ðuÞ  εˆ n jS n  þ ðˆεn  εˆ • Þ

2

¼ MSEðˆεn jS n Þ þ ðˆεn  εˆ • Þ2 , which completes the proof. ▪ If we find MSEðˆεn jS n Þ, it is trivial to evaluate the conditional MSE of any error estimator MSEðˆε• jS n Þ under the Bayesian model. Furthermore, Theorem 3.2 clearly shows that the conditional MSE of the Bayesian MMSE error estimator lower bounds the conditional MSE of any other error estimator.

3.2 Evaluation of the Conditional MSE Having found the first posterior error moments for the models we have been considering, we now focus on evaluating the second posterior moments. In each case, when combined with Theorem 3.1, this will provide a representation of the sample-conditioned MSE. The following theorem shows how second posterior moments can be found via the effective joint class-conditional density, which is defined by Z f U ðx, zjyÞ ¼ f uy ðxjyÞf uy ðzjyÞp ðuy Þduy Uy (3.14)   ¼ Ep f uy ðxjyÞf uy ðzjyÞ :

Theorem 3.3 (Dalton and Yousefi, 2015). Let c be a fixed classifier given by cðxÞ ¼ 0 if x ∈ R0 and cðxÞ ¼ 1 if x ∈ R1 , where R0 and R1 are measurable sets partitioning the feature space. Then the second moment of the true error can be found by Z Z y 2 f U ðx, zjyÞdxdz: (3.15) Ep ½ðεn ðuy ÞÞ  ¼ X \Ry

X \Ry

Sample-Conditioned MSE of Error Estimation

Proof. From Eq. 2.28, Z Z y 2 Ep ½ðεn ðuy ÞÞ  ¼ Z ¼

Uy

Uy

105

Z

X \Ry

Z

f uy ðxjyÞdx Z

X \Ry

X \Ry

X \Ry

f uy ðzjyÞdzp ðuy Þduy (3.16)

f uy ðxjyÞf uy ðzjyÞdxdzp ðuy Þduy :

Since 0 ≤ εny ðuy Þ ≤ 1 and hence Ep ½ðεny ðuy ÞÞ2  must be finite, and because ▪ f U ðx, zjyÞ is a valid density, we obtain the result by Fubini’s theorem. Note that Z Z y 2 Ix∈=Ry Iz∈=Ry f U ðx, zjyÞdxdz Ep ½ðεn ðuy ÞÞ  ¼ (3.17) X X = Ry Þ, ¼ PrðX ∈ = Ry , Z ∈ where in the last line X and Z are random vectors drawn from the effective joint density. Moreover, recognizing that f U ðx, zjyÞ is a valid density, the marginals of the effective joint density are both precisely the effective density. For example: Z Z Z f U ðx, zjyÞdz ¼ f uy ðxjyÞf uy ðzjyÞp ðuy Þduy dz X

X

Z ¼

Uy

Z ¼

Uy

Uy

f uy ðxjyÞ

Z X

f uy ðzjyÞdzp ðuy Þduy

(3.18)

f uy ðxjyÞp ðuy Þduy

¼ f U ðxjyÞ: In the special case where p is a point mass at the true distribution parameters uy , the effective joint density becomes f U ðx, zjyÞ ¼ f uy ðxjyÞf uy ðzjyÞ, indicating that X and Z would be independent and identically distributed (i.i.d.) with density f uy . According to Eq. 3.17, = Ry juy Þ PrðZ ∈ = Ry juy Þ ¼ ½εny ðuy Þ2 PrðX ∈= Ry , Z ∈= Ry Þ ¼ PrðX ∈

(3.19)

and MSEðεny ðuy ÞÞ ¼ ½εny ðuy Þ2  ½εny ðuy Þ2 ¼ 0. In this case, there is no uncertainty in the true error of the classifier. When there is uncertainty in uy, the effective joint density loses independence so that X and Z must be considered jointly. This leads to a positive MSE, or uncertainty in the true error of the classifier.

106

Chapter 3

If f U ðx, zjyÞ can be found for a specific model, Theorem 3.3 provides an efficient method to evaluate or approximate the conditional MSE for any classifier. For instance, in the Gaussian model, although we have closedform solutions for the conditional MSE under linear classifiers, the optimal Bayesian classifier and many other classification rules are generally not linear. To evaluate the conditional MSE for a classifier and model where closed-form solutions are unavailable, one may approximate the conditional MSE by generating a large number of synthetic sample pairs from f U ðx, zjyÞ. Each sample pair can be realized by first drawing a realization of X from the marginal effective density f U ðxjyÞ, and then drawing a realization of Z given X from the conditional effective density f U ðzjx, yÞ ¼ f U ðx, zjyÞ∕f U ðxjyÞ. Equation 3.17 is approximated by evaluating the proportion of pairs for which both points are misclassified, i.e., cðxÞ ≠ y and cðzÞ ≠ y. Further, since f U ðxjyÞ is the effective density, the same realizations of X can be used to approximate the Bayesian MMSE error estimator by evaluating the proportion of x values that are misclassified. In the following sections, we derive closed-form expressions for f U ðx, zjyÞ, f U ðzjx, yÞ, and the second moment of the true error for each class in our discrete and Gaussian models.

3.3 Discrete Model For the discrete model with Dirichlet priors, Theorem 2.2 shows that p ðu0 Þ and p ðu1 Þ are also Dirichlet distributions given by Eqs. 2.49 and 2.50 with 0 0 1 1 1 updated hyperparameters a0 i ¼ ai þ U i and ai ¼ ai þ U i . Theorem 3.4 (Dalton and Yousefi, 2015). Suppose that class y ¼ 0 is multinomial 0 0 with bin probabilities p1, p2, . . . , pb that have a Dirichletða0 1 , a2 , : : : , ab Þ 0 posterior, ai . 0 for all i. The effective joint density for class 0 is given by 0 a0 i aj Pb f U ði, jj0Þ ¼ Ep ½pi pj  ¼ Pb 0 ð k¼1 a0 k Þð1 þ k¼1 ak Þ

(3.20)

for i, j ∈ f1, 2, : : : , bg with i ≠ j, and 0 a0 i ðai þ 1Þ Pb f U ði, ij0Þ ¼ Ep ½p2i  ¼ Pb 0 ð k¼1 a0 k Þð1 þ k¼1 ak Þ

(3.21)

for i ∈ f1, 2, : : : , bg. The effective joint density for class 1 is similar, except 0 with qi in place of pi and a1 i in place of ai for i ¼ 1, : : : , b.

Sample-Conditioned MSE of Error Estimation

107

Proof. For bin indices i, j ∈ f1, 2, : : : , bg, by definition, f U ði, jj0Þ ¼ Ep ½pi pj :

(3.22)

Equation 3.20 follows from the second-order moments for Dirichlet distributions given in Eq. 2.48. Similarly, Eq. 3.21 follows from Eq. 2.47. The case for class 1 is similar. ▪ Theorem 3.5 (Dalton and Dougherty, 2012b). For the discrete model with y y y Dirichletðay 1 , a2 , : : : , ab Þ posteriors for class y ∈ f0, 1g, ai . 0 for all i and y, P ðˆεny Þ2 ð bk¼1 ay εˆ ny y k Þ 2 P P þ : (3.23) Ep ½ðεn ðuy ÞÞ  ¼ 1 þ bk¼1 ay 1 þ bk¼1 ay k k Proof. By Theorems 3.3 and 3.4, Ep ½ðεny ðuy ÞÞ2  ¼

b X b X i¼1 j¼1

ð

y ay i ðaj þ dij Þ Pb y y IcðiÞ≠y Icð jÞ≠y , k¼1 ak Þð1 þ k¼1 ak Þ

Pb

(3.24)

where dij is the Kronecker delta function, which equals 1 if i ¼ j and 0 otherwise. Continuing, Ep ½ðεny ðuy ÞÞ2  b X

2 3 b X IcðiÞ≠y 4 ay Icð jÞ≠y þ IcðiÞ≠y 5

Pb y y j a Þð1 þ k¼1 ak Þ k i¼1 j¼1 2 32 3 Pb y y y b b X X a a j k¼1 ak 4 5 P Pb i y IcðiÞ≠y 5 4 Pb ¼ y y Icð jÞ≠y b 1 þ k¼1 ak ð a Þ ð a Þ k k k¼1 k¼1 i¼1 j¼1 2 3 b X 1 ay 4 Pb Pb i y IcðiÞ≠y 5: þ y 1 þ k¼1 ak ð k¼1 ak Þ i¼1

¼

ð

Pb

ay i

k¼1

(3.25) We write this using the first moment via Eq. 2.56. ▪ By Theorem 3.5 and Eq. 2.56, the conditional MSE in the discrete model is given by Eq. 3.3, where P ðˆεny Þ2 ð bk¼1 ay εˆ ny y k Þ P Pb MSEðˆεn jS n Þ ¼ þ εny Þ2 y  ðˆ b 1 þ a 1 þ k¼1 ay k¼1 k k (3.26) y y εˆ n ð1  εˆ n Þ P ¼ : 1 þ bk¼1 ay k

108

Chapter 3

Work in (Berikov and Litvinenko, 2003) derives expressions for the characteristic function of the misclassification probability fðtÞ ¼ Ep ½expðitεn ðu, cÞÞ, where i is the imaginary unit. The same discrete Bayesian model is used, allowing for an arbitrary number of classes K with known prior class probabilities and the flat prior over the bin probabilities. Given the characteristic function, the first and second moments of the true error, equivalent to the Bayesian MMSE error estimator and conditional MSE of the Bayesian MMSE error estimator, respectively, are derived. Higher-order moments of the true error can also be found from the characteristic function, although the equations become tedious. Example 3.1. To demonstrate how the theoretical conditional RMS provides practical performance results for small samples, in contrast with distributionfree RMS bounds, which are too loose to be useful for small samples, we use the simulation methodology outlined in Fig. 2.5 with a discrete model and fixed bin size b. Assume that c ¼ 0.5 is known and fixed, and let the bin probabilities of class 0 and 1 have Dirichlet priors given by the hyperparameters a0i ∝ 2b  2i þ 1 and a1i ∝ 2i  1, respectively, where the ayi are normalized P such that bi¼1 ayi ¼ b for y ∈ f0, 1g. In step 1 of Fig. 2.5, we generate random bin probabilities from the Dirichlet priors by first generating 2b independent gamma-distributed random variables, g yi  gammaðayi , 1Þ for i ¼ 1, 2, . . . , b and y ∈ {0, 1}. The bin probabilities are then given by g0 pi ¼ Pb i k¼1

g1 qi ¼ Pb i k¼1

g 0k g 1k

,

(3.27)

:

(3.28)

In step 2A we generate a random sample of size n; that is, the number of class-0 points n0 is determined using a binomialðn, cÞ experiment. Then n0 points are drawn from the discrete distribution ½p1 , : : : , pb , and n1 ¼ nn0 points are drawn from the discrete distribution ½q1 , : : : , qb . Although the classes are equally likely, the number of training points from each class may not be the same. In step 2B the prior is updated to a posterior given the full training sample. In step 2C we train a discrete histogram classifier that breaks ties toward class 0. Finally, in step 2D we evaluate the exact error of the trained classifier, the Bayesian MMSE error estimator with “correct” priors (used to generate the true bin probabilities), and a leave-one-out error estimator. The Bayesian MMSE error estimator is found by evaluating Eq. 2.13 with Ep ½c ¼ 0.5 and εˆ ny defined in Eq. 2.56. The sample-conditioned RMS is computed from Eq. 3.3. The sampling procedure is repeated t ¼ 1000

Sample-Conditioned MSE of Error Estimation

109

times for each fixed feature-label distribution, with T ¼ 10,000 feature-label distributions, for a total of 10,000,000 samples. From these experiments, it is possible to approximate the unconditional MSE (averaged over both the feature-label distribution and the sampling distribution) for any error estimator εˆ • using one of two methods: (1) the “semi-analytical” unconditional MSE computes MSEðˆε• jS n Þ using closed-form expressions for the sample-conditioned MSE for each sample/ iteration and averages over these values, and (2) the “empirical” unconditional MSE computes the squared difference ðεn  εˆ • Þ2 between the true error εn and the error estimate εˆ • for each sample/iteration and averages over these values. The semi-analytical RMS and empirical RMS are the square roots of the semi-analytical MSE and empirical MSE, respectively. Throughout, we use the semi-analytical unconditional MSE unless otherwise indicated. Figures 3.1(a) and (b) show the probability densities of the sampleconditioned RMS for both the Bayesian MMSE error estimator and the leave-one-out error estimator. The sample sizes for each experiment are chosen such that the expected true error of the trained classifier is 0.25. Within each plot, we also show the unconditional semi-analytical RMS of both the leave-one-out and Bayesian MMSE error estimators, as well as the distribution-free RMS bound on the leave-one-out error estimator for the discrete histogram rule with tie-breaking in the direction of class 0 given in Eq. 1.48. Jaggedness in part (a) is not due to poor density estimation or Monte Carlo approximation, but rather is caused by the discrete nature of

80

loo BEE

60

100 RMS = 0.151 loo: Devroye: RMS≤ 1.0366

40

BEE:

PDF

PDF

loo BEE

120

RMS = 0.0698

80 loo: RMS = 0.1103 Devroye: RMS≤ 0.8576

60

BEE:

RMS = 0.0518

40

20

20 0

0

0.05

0.1

0.15

0.2

0.25

RMS conditioned on the sample

(a)

0.3

0

0

0.05 0.1 0.15 0.2 0.25 RMS conditioned on the sample

0.3

(b)

Figure 3.1 Probability densities for the conditional RMS of the leave-one-out and Bayesian MMSE error estimators with correct priors: (a) b ¼ 8, n ¼ 16; (b) b ¼ 16, n ¼ 30. The sample sizes for each experiment were chosen such that the expected true error is 0.25. The unconditional RMS for both error estimators is also shown, as well as Devroye’s distributionfree bound. [Reprinted from (Dalton and Dougherty, 2012c).]

110

Chapter 3

the problem. In particular, the expressions for εˆ n0 , εˆ n1 , Ep ½ðε0n ðu0 ÞÞ2 , and Ep ½ðε1n ðu1 ÞÞ2  can take on only a finite set of values, which is especially small for a small number of bins or sample points. In both parts of Fig. 3.1 (as well as in other unshown plots for different values of b and n), the density of the conditional RMS for the Bayesian MMSE error estimator is much tighter than that of leave-one-out. For example, in Fig. 3.1(b) the conditional RMS of the Bayesian MMSE error estimator tends to be very close to 0.05, whereas the leave-one-out error estimator has a long tail with substantial mass between 0.05 and 0.2. Furthermore, the conditional RMS for the Bayesian MMSE error estimator is concentrated on lower values of RMS, so much so that in all cases the unconditional RMS of the Bayesian MMSE error estimator is less than half that of the leave-one-out error estimator. Without any kind of modeling assumptions, distribution-free bounds on the unconditional RMS are too loose to be useful. In fact, the bound from Eq. 1.48 is greater than 0.85 in both subplots of Fig. 3.1. On the other hand, a Bayesian framework facilitates exact expressions for the RMS conditioned on the sample for both the Bayesian MMSE error estimator and any other error estimation rule.

3.4 Gaussian Model We next consider the Gaussian model. For linear classifiers, we have closedform Bayesian MMSE error estimators for four models: fixed covariance, scaled identity covariance, diagonal covariance, and general covariance. However, note that independence between u0 and u1 implies that the following results are not applicable in homoscedastic models. To avoid cluttered notation, in this section we denote hyperparameters without subscripts. 3.4.1 Effective joint class-conditional densities Known Covariance

The following theorem provides a closed form for the effective joint density in the known covariance model. Theorem 3.6 (Dalton and Yousefi, 2015). If n . 0 and Sy is a fixed symmetric positive definite matrix, then " # " n þ1 #! 1 m n S n S y f U ðx, zjyÞ  N , : (3.29) 1 n þ1 m n S y n S y

Sample-Conditioned MSE of Error Estimation

111

Proof. First note that f uy ðxjyÞf uy ðzjyÞp ðmy jly Þ   1 1 T 1 ¼ ðx  m exp  Þ S ðx  m Þ y y y D 1 2 ð2pÞ 2 jSy j2   1 1 T 1  ðz  my Þ Sy ðz  my Þ D 1 exp  2 ð2pÞ 2 jSy j2    D ðn Þ 2 n  T 1   ðm  m Þ Sy ðmy  m Þ : D 1 exp  2 y ð2pÞ 2 jSy j2

(3.30)

We can view this as the joint Gaussian density for X, Z, and my , which is currently of the form f ðx, z, my Þ ¼ f ðxjmy Þf ðzjmy Þf ðmy Þ. The effective joint density f U ðx, zjyÞ is precisely the marginal density of X and Z: ðn Þ 2

D

f U ðx, zjyÞ ¼

ð2pÞD ðn þ 2Þ 2 jSy j !  1 n þ 1  T 1  ðx  m Þ Sy ðx  m Þ  exp  2 n þ 2  ! 1 n þ 1   exp  ðz  m ÞT S1 y ðz  m Þ 2 n þ 2 !  1 2   exp  ðx  m ÞT S1 : y ðz  m Þ 2 n þ 2 D

(3.31)

Note that "

n þ1 n Sy 1 n Sy

1 n Sy  n þ1 n S y

"

#1 ¼

n þ1 1 n þ2 Sy  n1þ2 S1 y

 n1þ2 S1 y n þ1 1 n þ2 Sy

# :

(3.32)

It is straightforward to check that f U ðx, zjyÞ is Gaussian with mean and covariance as given in Eq. 3.29. ▪ As expected, the marginal densities of X and Z are both equivalent to the effective density (Dalton and Dougherty, 2013a), i.e., 

 n þ 1 f U ðxjyÞ  N m , Sy : n 

Further, conditioned on X ¼ x,

(3.33)

112

Chapter 3



 n m þ x n þ 2 ,  S : f U ðzjx, yÞ  N n þ 1 n þ1 y

(3.34)

X and Z become independent as ny → `. Scaled Identity Covariance

For a model with unknown covariance, Z f uy ðxjyÞf uy ðzjyÞp ðuy Þduy f U ðx, zjyÞ ¼ Uy

Z Z ¼

Ly

(3.35) 

RD



f uy ðxjyÞf uy ðzjyÞp ðmy jly Þdmy p ðly Þdly ,

where ly parameterizes the covariance matrix. The inner integral was solved in Theorem 3.6. Hence, Z D ðn Þ 2 f U ðx, zjyÞ ¼ D D  Ly ð2pÞ ðn þ 2Þ 2 jSy j  ! 1 n þ 1  ðx  m ÞT S1  exp  y ðx  m Þ 2 n þ 2 !  1 n þ 1  T 1   exp  ðz  m Þ Sy ðz  m Þ 2 n þ 2 !  1 2  p ðly Þdly :  exp  ðx  m ÞT S1 y ðz  m Þ 2 n þ 2 (3.36) We introduce the following theorem, which will be used in the scaled identity, diagonal, and general covariance models. Theorem 3.7. In the independent covariance model, where Sy is parameterized by ly and the posterior p ðly Þ

∝ jSy

k j

 þDþ1 2

  1  1 etr  S Sy 2

(3.37)

is normalizable, the effective joint density for class y is given by Z f U ðx, zjyÞ ∝

Ly

k

jSy j

 þDþ3 2

  1 1 etr  hðx, z, yÞSy dly , 2

(3.38)

Sample-Conditioned MSE of Error Estimation

113

where n þ 1 n þ 1 ðx  m Þðx  m ÞT þ  ðz  m Þðz  m ÞT  n þ2 n þ2 2 ðz  m Þðx  m ÞT þ S   n þ2 (3.39) n   T ðx  m Þðx  m Þ ¼  n þ1    n þ 1 x  m x  m T   zm   zm   þ  þ S : n þ1 n þ1 n þ2

hðx, z, yÞ ¼

Proof. The proof follows from the definition   ! 1 n þ 1  f U ðx, zjyÞ ∝ ðx  m ÞT S1 jSy j1 exp  y ðx  m Þ 2 n þ 2 Ly !  1 n þ 1   exp  ðz  m ÞT S1 y ðz  m Þ 2 n þ 2 !  1 2   exp  ðx  m ÞT S1 y ðz  m Þ 2 n þ 2   1 k þDþ1  jSy j 2 etr  S S1 dly y 2   Z  1 k þDþ3 1 ¼ jSy j 2 etr  hðx, z, yÞSy dly : 2 Ly Z

(3.40) The normalization constant depends on the parameterization variable ly . ▪ In the case of a scaled identity covariance, ly ¼ s2y , and from Theorem 3.7,   Z ` ðk þDþ3ÞD 1 2  2 ðsy Þ etr  2 hðx, z, yÞ ds2y : (3.41) f U ðx, zjyÞ ∝ 2sy 0 The integrand is essentially an inverse-Wishart distribution, which has a known normalization constant; thus, f U ðx, zjyÞ ∝ trðhðx, z, yÞÞ

ðk þDþ3ÞD2 2

,

(3.42)

where the proportionality treats x and z as variables. In (Dalton and Yousefi, 2015), it is shown that f U ðx, zjyÞ is a multivariate t-distribution with k  D degrees of freedom, location vector ½ðm ÞT , ðm ÞT T , and scale matrix

114

Chapter 3

trðS Þ  n ðk  DÞ



ðn þ 1ÞID ID

 ID , ðn þ 1ÞID

(3.43)

where k ¼ ðk þ D þ 2ÞD  2. Given X ¼ x, we have f U ðzjx, yÞ ∝ f U ðx, zjyÞ ∝ trðhðx, z, yÞÞ

ðk þDþ3ÞD2 2

,

(3.44)

where the first proportionality treats z as a variable and x as constant. Letting S ¼

n ðx  m Þðx  m ÞT þ S , n þ 1

(3.45)

we have trðhðx, z, yÞÞ     n þ 1 x  m T x  m   ¼  zm   zm   þ trðS Þ n þ1 n þ1 n þ2     n þ 1 x  m T x  m   ∝1þ  zm   zm   : ðn þ 2Þ trðS Þ n þ1 n þ1

(3.46)

Thus, f U ðzjx, yÞ " ∝

  #kþD   T  2 (3.47) 1 x  m x  m 1þ ðUID Þ1 z  m   , z  m   n þ1 n þ1 k

where U¼

ðn þ 2Þ trðS∗∗ Þ . ðn þ 1Þ½ðk∗ þ D þ 2ÞD  2 

(3.48)

Hence, f U ðzjx, yÞ is a multivariate t-distribution with k degrees of freedom, location vector ðn m þ xÞ∕ðn þ 1Þ, and scale matrix UID . Diagonal Covariance

Applying Theorem 3.7 in the diagonal covariance model, with ly ¼ ½s2y1 , s2y2 , : : : , s2yD ,

Sample-Conditioned MSE of Error Estimation

f U ðx, zjyÞ Z Z ` ∝ ···

115

 1 1 etr  hðx, z, yÞSy ds2y1 · · · s2yD 2 0 0 i¼1   Z Z Y D ` ` D  1X hi ðxi , zi , yÞ 2 k þDþ3 ¼ ··· ðsyi Þ 2 exp  ds2y1 · · · s2yD 2 i¼1 s 2yi 0 0 i¼1   D Z ` Y h ðx , z , yÞ k þDþ3 ds2yi , ¼ ðs 2yi Þ 2 exp  i i 2i 2syi i¼1 0 D `Y

 þDþ3 2



k ðs 2yi Þ

(3.49) where hi ðxi , zi , yÞ is the ith diagonal element of hðx, z, yÞ [note that hi ðxi , zi , yÞ depends on only xi and zi, the ith elements of x and z, respectively]. Similar to Eq. 3.41 in the scaled identity case, the integrand is an unnormalized inverse-Wishart distribution. Thus,

f U ðx, zjyÞ ∝

D Y

½hi ðxi , zi , yÞ

k þDþ1 2

:

(3.50)

i¼1

In (Dalton and Yousefi, 2015), it is shown that f U ðx, zjyÞ is the joint distribution of D independent pairs of bivariate t-random vectors, where the ith pair has k  1 degrees of freedom, location parameter ½mi , mi T , and scale parameter sii ðn I2 þ 122 Þ∕½n ðk  1Þ. Note that k ¼ k þ D, mi is the ith element of m , and sii is the ith diagonal element of S . The conditional density is given by f U ðzjx, yÞ ∝ f U ðx, zjyÞ ∝

D Y i¼1



D Y i¼1

½hi ðxi , zi , yÞ "

k þDþ1 2

(3.51)

 # kþ1  2  2

 1 ðn þ 1Þk xi  m i  1þ   zi  mi   n þ1 k ðn þ 2Þsii

,

   2  where s ii ¼ ½n ∕ðn þ 1Þðxi  mi Þ þ sii is the ith diagonal element of  S . Thus, f U ðzjx, yÞ is the joint distribution of D independent nonstandardized Student’s t-distributions, where the ith distribution has k degrees of freedom, location parameter ðn mi þ xi Þ∕ðn þ 1Þ, and scale parameter  ðn þ 2Þs ii ∕½ðn þ 1Þk.

116

Chapter 3

General Covariance

For the general covariance model, f U ðx, zjyÞ can be found from Theorem 3.7, with ly ¼ Sy : Z f U ðx, zjyÞ ∝

Sy ≻0

jSy j

k þDþ3 2

  1 dSy : etr  hðx, z, yÞS1 y 2

(3.52)

The parameter space Sy ≻ 0 is the space of all symmetric positive definite matrices. The integrand is essentially an inverse-Wishart distribution, which has a known normalization constant. Hence, letting x ¼ m  þ

x  m , n þ 1

(3.53)

we have k þ2

f U ðx, zjyÞ ∝ jhðx, z, yÞj 2 k þ2     2  n n þ 1 ðx  m Þðx  m ÞT þ  ðz  x Þðz  x ÞT þ S  ¼   (3.54) n þ1 n þ2   k þ2 n þ 2  2 ∝   : S þ ðz  x Þðz  x ÞT  n þ1 Applying the property jaaT þ Aj ¼ jAjð1 þ aT A1 aÞ

(3.55)

yields f U ðx, zjyÞ #k þ2   k þ2 "   1 2 n þ2  2 n þ 2   T   ðz  x Þ S  S 1 þ ðz  x Þ ∝    n þ1 n þ1 kþD    2 n þ2 S  ∝   n þ1 !kþD   1 2 1 n þ 2   S  1 þ ðz  x ÞT ðz  x Þ , k ðn þ 1Þk

(3.56)

where k ¼ k  D þ 2. Although f U ðx, zjyÞ is not easy to characterize in terms of standard distributions, the conditional density is given by

Sample-Conditioned MSE of Error Estimation

117

f U ðzjx, yÞ ∝ f U ðx, zjyÞ ∝

!kþD   1 2 1 n þ 2   S ðz  x Þ : 1 þ ðz  x ÞT k ðn þ 1Þk

(3.57)

Conditioned on X ¼ x, Z is a multivariate t-distribution with k degrees of freedom, location parameter x, and scale matrix    n þ 2  n þ 2 n   T  S ¼  ðx  m Þðx  m Þ þ S : ðn þ 1Þk ðn þ 1Þðk  D þ 2Þ n þ 1 (3.58) 3.4.2 Sample-conditioned MSE for linear classification We will present closed-form expressions for the conditional MSE of Bayesian MMSE error estimators under Gaussian distributions with linear classification for three models: known, scaled identity, and general covariance. We assume a linear classifier of the form cðxÞ ¼ 0 if gðxÞ ≤ 0 and cðxÞ ¼ 1 otherwise, where gðxÞ ¼ aT x þ b. Rather than use the effective joint density, we take a direct approach, as in (Dalton and Dougherty, 2012b); see Section 5.7 for similar derivations based on the effective joint density. Note that the required first and second moments are given by h Ep

ðεny ðuy ÞÞk

i

Z ¼

Uy

½εny ðuy Þk p ðuy Þduy

Z Z ¼

Ly

RD

(3.59) ½εny ðmy ,





ly Þ p ðmy jl y Þdmy p ðly Þdly k

for k ¼ 1 and k ¼ 2, respectively, where Uy is the parameter space for uy, and Ly is the parameter space for ly . For a fixed invertible covariance Sy , we require that n . 0 to ensure that the posterior p ðmy jly Þ is proper. To find the desired moments, we require a technical lemma. Lemma 3.1 (Dalton and Dougherty, 2012b). Let y ∈ {0, 1}, n > 0, m ∈ RD , S be an invertible covariance matrix, and gðxÞ ¼ aT x þ b, where a ∈ RD is a nonzero length D vector, and b ∈ R is a scalar. Then

118

Chapter 3

 ð1Þy gðmÞ pffiffiffiffiffiffiffiffiffiffiffiffi f m , 1 S ðmÞdm F n aT Sa RD

pffiffiffiffiffiffi   Z n þ2 1 tan1 pffiffiffi d2 n exp  ¼ Id.0 ½2FðdÞ  1 þ du, p 0 2sin2 u 

Z

2

(3.60)

where d is given in Eq. 2.125. Proof. Call this integral M. Then   Z ð1Þy gðmÞ F2 pffiffiffiffiffiffiffiffiffiffiffiffi M¼ RD aT Sa    D ðn Þ 2 n  T 1   ðm  m Þ S ðm  m Þ dm: D 1 exp  2 ð2pÞ 2 jSj2

(3.61)

Since S is an invertible covariance matrix, we can use singular value decomposition to write S ¼ WWT with jSj ¼ jWj2 . Using the linear change pffiffiffiffiffi of variables z ¼ n W1 ðm  m∗ Þ, ! Z ffi aT Wz þ aT m þ bÞ ð1Þy ðp1ffiffiffi n 2 pffiffiffiffiffiffiffiffiffiffiffiffi F M¼ RD aT Sa (3.62)  T  1 z z  dz: D exp  2 ð2pÞ 2 Define ð1Þy WT a a ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , n aT Sa

(3.63)

ð1Þy gðm Þ pffiffiffiffiffiffiffiffiffiffiffiffi , aT Sa

(3.64)



and note that kak2 ¼ ðn Þ1 . Then  T  Z 1 z z 2 T M¼ F ða z þ bÞ dz D exp  D 2 ð2pÞ 2 R  2 Z T  2 Z Z aT zþb a zþb 1 1 x y pffiffiffiffiffiffi exp  pffiffiffiffiffiffi exp  ¼ dx dy D 2 2 2p 2p ` ` R  T  1 z z  dz D exp  2 ð2pÞ 2 ! Z Z aT zþbZ aT zþb 1 x2 þ y2 þ zT z dxdydz:  ¼ Dþ2 exp 2 ` RD ` ð2pÞ 2

(3.65)

Sample-Conditioned MSE of Error Estimation

119

Next consider a change of variables w ¼ Rz, where R rotates the vector a pffiffiffiffiffi to the vector ½1∕ n , 0TD1 T . Since R is a rotation matrix, jRj ¼ 1 and RT R is an identity matrix. Let the first element in the vector w be called w. Then the preceding integral simplifies to  2  Z Z aT RT wþb Z aT RT wþb 1 x þ y2 þ wT w exp  M¼ dxdydw Dþ2 2 ` RD ` ð2pÞ 2  2  Z Z p1ffiffiffiwþb Z p1ffiffiffiwþb ` 1 x þ y2 þ w2 n n dxdydw: ¼ 3 exp  2 ð2pÞ2 ` ` ` (3.66) This reduces the problem to a three-dimensional space. Now consider the following rotation of the coordinate system: 2 pffiffiffiffiffi pffiffiffiffiffi pffiffi 3 2n ffi 2n ffi pffiffiffiffiffiffiffi 2 ffi  2pffiffiffiffiffiffiffi  2pffiffiffiffiffiffiffi " # " 0#  þ2  þ2  þ2 n n n x 7 x 6 pffiffi pffiffi 7 6 0 2 y ¼ 6  22 0 7 y : 2 4 0 p ffiffiffiffi 5 w w n ffi 1 ffi 1 ffi ffiffiffiffiffiffiffi ffiffiffiffiffiffiffi pffiffiffiffiffiffiffi p p    n þ2

n þ2

(3.67)

n þ2

pffiffiffiffiffi This rotates the vector ½x, y, wT ¼ ½1, 1, n T to the vector ½x0 , y0 , w0 T ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi ½0, 0, n þ 2T . To determine the new region of integration, note that in the ½x, y, wT coordinate system the region of integration is defined by two pffiffiffiffiffi pffiffiffiffiffi restrictions: x , ð n Þ1 w þ b and y , ð n Þ1 w þ b. In the new coordinate system, the first restriction is ! pffiffiffiffiffiffiffi pffiffiffi pffiffiffi pffiffiffiffiffi n 1 1 2n 2 2 x0 þ w0 , pffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffi x0 þ pffiffiffiffiffiffiffiffiffiffiffiffiffi w0 þ b: y0 þ pffiffiffiffiffiffiffiffiffiffiffiffiffi  pffiffiffiffiffiffiffiffiffiffiffiffiffi 2 n þ 2 2 n þ 2 n þ 2 n þ 2 n (3.68) Equivalently, 0

y ,

! pffiffiffiffiffiffiffiffiffiffiffiffiffi n þ 2 0 pffiffiffi pffiffiffiffiffi x þ 2 b: n

And, similarly, for the other restriction, ! pffiffiffiffiffiffiffiffiffiffiffiffiffi  pffiffiffi þ 2 n pffiffiffiffiffi y0 , x0 þ 2 b: n

(3.69)

(3.70)

We have designed our new coordinate system to make the variable w0 independent from these restrictions. Hence, w0 may be integrated out of our original integral, which may be simplified to

120

Chapter 3

  02 1 x þ y02 dx0 dy0 exp  M¼ pffiffiffi pffiffi n 0 j 2 bÞ 2p 2 ` pffiffiffiffiffiffi ðjy n þ2  02  Z Z ` ` 1 x þ y02 exp  ¼2 dx0 dy0 : pffiffiffi pffiffi n 0  2 bÞ 2p 2 pffiffiffiffiffiffi 0 ðy  Z

`

Z

`

(3.71)

n þ2

If b ≤ 0, then we convert to polar coordinates ðr, uÞ, using ! pffiffiffiffiffiffiffiffiffiffiffiffiffi   n þ 2 pffiffiffiffiffi x0 ¼ r cos tan1 u n

(3.72)

and !  pffiffiffiffiffiffiffiffiffiffiffiffiffi  n þ 2 pffiffiffiffiffi y0 ¼ r sin tan1 u n

(3.73)

to obtain 1 M¼ p

Z

tan1

pffiffiffiffiffiffi Z n þ2 pffiffiffi n

`

ffiffiffi

p

n b pffiffiffiffiffiffi 

0

n þ1 sin u

 2 r exp  rdrdu: 2

(3.74)

Let u ¼ 0.5r2. Then 1 M¼ p ¼

1 p

Z

tan1

pffiffiffiffiffiffi Z n þ2 pffiffiffi n

0

Z

tan1

pffiffiffiffiffiffi n þ2 pffiffiffi n

0

` n b2 2ðn þ1Þsin2 u

exp 

expðuÞdudu  2

!

(3.75)

nb du: 2ðn þ 1Þsin2 u 

On the other hand, if b . 0, then from Eq. 3.71,  02  Z Z 1 ` ` x þ y02 exp  M¼ dx0 dy0 pffiffiffi n 0 2 p 0 pffiffiffiffiffiffi y n þ2 ffiffiffi  02  Z Z ppffiffiffiffiffiffi n y0 1 ` x þ y02 n þ2 þ dx0 dy0 : pffiffiffi pffiffi exp  n 0  2 bÞ 2 p 0 pffiffiffiffiffiffi ðy 

(3.76)

n þ2

Using the result for b ≤ 0, the first double integral equals pffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffi p1 tan1 ð n þ 2∕ n Þ. For the second integral, we use the same polar transformation and u-substitution as before:

Sample-Conditioned MSE of Error Estimation

1 p

ffiffiffi

121



 x02 þ y02 dx0 dy0 pffiffiffi pffiffi exp  n 0  2 bÞ 2 pffiffiffiffiffiffi ðy 0 n þ2 pffiffiffi  2 Z pffiffiffiffiffiffi Z 0 n b 1 r n þ1 sin u

pffiffiffiffiffiffi ¼ exp  rdrdu  þ2 n 2 p tan1 pffiffiffi p 0

Z

`

Z

p

n pffiffiffiffiffiffi y0 n þ2

n

1 ¼ p 1 ¼ p

Z

0

tan1

Z

0

tan1

pffiffiffiffiffiffi n þ2 pffiffiffi p n

pffiffiffiffiffiffi n þ2 pffiffiffi p n

Z

n b2 2ðn þ1Þsin2 u

(3.77) expðuÞdudu

0

"

 1  exp 

n b2 2ðn þ 1Þsin2 u

# du:

Thus, 1 M ¼1 p

  n b2

pffiffiffiffiffiffi exp  du: n þ2 2ðn þ 1Þsin2 u tan1 pffiffiffi p n

Z

0

(3.78)

This may be simplified by realizing that a component of this integral is equivalent to an alternative representation for the Gaussian CDF function (Craig, 1991). We first break the integral into two parts, and then use symmetry in the integrand to simplify the result:   Z 1 0 n b2 exp  M ¼1 du p p 2ðn þ 1Þsin2 u

  Z tan1 ppffiffiffiffiffiffi n þ2 ffiffiffi p 1 n b2 n exp  þ du p p 2ðn þ 1Þsin2 u   Z 1 p n b2 exp  du ¼1 p 0 2ðn þ 1Þsin2 u

pffiffiffiffiffiffi   Z n þ2 1 tan1 pffiffiffi n b2 n exp  þ du p 0 2ðn þ 1Þsin2 u

  pffiffiffiffiffi   Z tan1 ppffiffiffiffiffiffi n þ2 ffiffiffi 1 n b2 n b n exp  ¼ 2F pffiffiffiffiffiffiffiffiffiffiffiffiffi 1þ du, p 0 2ðn þ 1Þsin2 u n þ 1 (3.79) which completes the proof.



122

Chapter 3

Theorem 3.8 (Dalton and Dougherty, 2012b). For the Gaussian model with fixed covariance matrix Sy , assuming that n . 0, Ep ½ðεny ðuy ÞÞ2  ¼ Id.0 ½2FðdÞ  1

  Z tan1 ppffiffiffiffiffiffi n þ2 ffiffiffi 1 d2 n exp  þ du p 0 2sin2 u

(3.80)

for y ∈ f0, 1g, where d is defined in Eq. 2.125.

Proof. The outer integral in the definition of the second moment in Eq. 3.59 is not necessary because the current model has a fixed covariance. Hence, we need only solve the inner integral, which is given by Z ½εny ðmy , ly Þ2 p ðmy jly Þdmy Ep ½ðεny ðuy ÞÞ2  ¼ D R ! Z (3.81) y ð1Þ gðm Þ y ¼ F2 qffiffiffiffiffiffiffiffiffiffiffiffiffi f m , 1 Sy ðmy Þdmy : n RD aT Sy a This integral is simplified to a well-behaved single integral in Lemma 3.1. ▪ Theorem 3.1 in conjunction with Eqs. 2.124 and 3.80 gives MSEðˆεn jS n Þ for the fixed covariance model. Continuing with the Gaussian model, now assume that Sy is a scaled identity covariance matrix with ly ¼ s2y and Sy ¼ s2y ID . The posterior distribution is given in Lemma 2.1 as Eq. 2.100, and the first moment is given in Eq. 2.132. The second posterior moment in this case involves two special functions, one being the regularized incomplete beta function given in Eq. 2.128. The other function Rðx, y; aÞ is defined via the Appell hypergeometric function F1. Specifically, Rðx, 0; aÞ ¼

1 1 pffiffiffi sin ð xÞ p

(3.82)

and pffiffiffi

x aþ1 y 2 Rðx, y; aÞ ¼ pð2a þ 1Þ x þ y   1 1 3 xðy þ 1Þ x ,  F 1 a þ ; , 1; a þ ; 2 2 2 xþy xþy for a > 0, 0 < x < 1, and y > 0, where

(3.83)

Sample-Conditioned MSE of Error Estimation

GðcÞ F 1 ða; b, b ; c; z, z Þ ¼ GðaÞGðc  aÞ 0

0

Z

1

123

0

ta1 ð1  tÞca1 ð1  ztÞb ð1  z0 tÞb dt

0

(3.84) is defined for jzj , 1, jz0 j , 1, and 0 < a < c. Closed-form expressions for both R and the regularized incomplete beta function I for integer or half-integer values of a are discussed in Section 3.4.3. Lemma 3.2 (Dalton and Dougherty, 2012b). Let A ∈ R, p/4 < B < p/2, a > 0, and b > 0. Let f ðx; a, bÞ be the inverse-gamma distribution with shape parameter a and scale parameter b. Then # "    !  Z Z ` A 1 B A2 du f ðz; a, bÞdz exp  IA.0 2F pffiffiffi  1 þ p 0 z 2z sin2 u 0     A2 1 A2 2 ¼ IA.0 I ;a : ; , a þ R sin B, 2b A2 þ 2b 2 (3.85) Proof. Call this integral M. When A ¼ 0, it is easy to show that M ¼ B∕p. For A ≠ 0,     Z ` A M¼ IA.0 2F pffiffiffi  1 f ðz; a, bÞdz z 0   Z Z B 1 ` A2 þ exp  duf ðz; a, bÞdz p 0 0 2z sin2 u (3.86)     Z ` A F pffiffiffi f ðz; a, bÞdz  1 ¼ IA.0 2 z 0   Z Z B ` 1 A2 þ exp  duf ðz; a, bÞdz: p 0 0 2z sin2 u The integral in the first term is solved in (Dalton and Dougherty, 2011c, Lemma D.1). We have   A2 1 M ¼ IA.0 sgnðAÞI ; ,a A2 þ 2b 2   Z Z 1 ` B A2 þ exp  duf ðz; a, bÞdz p 0 0 2z sin2 u     Z Z A2 1 1 B ` A2 ; ,a þ exp  ¼ IA.0 I f ðz; a, bÞdzdu: p 0 0 A2 þ 2b 2 2z sin2 u (3.87) This intermediate result will be used in Lemma 3.3.

124

Chapter 3

We next focus on the inner integral in the second term. Call this integral N. We have    a A2 b 1 b · exp  N¼ dz z 2z sin2 u GðaÞ zaþ1 0     Z ` 1 ba A2 1 exp  b þ ¼ dz aþ1 2 GðaÞ 0 z 2sin u z ba GðaÞ a ·

¼ A2 GðaÞ b þ 2sin 2u !a sin2 u ¼ , 2 sin2 u þ A2b Z

`

 exp 

(3.88)

where we have solved this integral by noting that it is essentially an inversegamma distribution. Thus, !a   Z A2 1 1 B sin2 u ; ,a þ du: (3.89) M ¼ IA.0 I 2 p 0 A2 þ 2b 2 sin2 u þ A 2b

Under the substitution u ¼ ðsin2 uÞ∕ðsin2 BÞ, !a Z B sin2 u du 2 sin2 u þ A2b 0 !a Z 1 u sin2 B sin B 1 1 u 2 ð1  u sin2 BÞ2 du ¼ A2 2 2 u sin B þ 2b 0     Z sin2aþ1 B 2b a 1 a1 2b sin2 B a 1 2  u 2 ð1  u sin BÞ 2 1 þ u du: ¼ 2 A2 A2 0 (3.90) This is essentially a one-dimensional Euler-type integral representation of the Appell hypergeometric function F 1 . In other words, !a Z B sin2 u du 2 sin2 u þ A2b 0  (3.91)  a  2aþ1 2 sin B 2b 1 1 3 2b sin B F 1 a þ ; , a; a þ ; sin2 B,  ¼ : 2a þ 1 A2 2 2 2 A2 Finally, from the identity

Sample-Conditioned MSE of Error Estimation

125

  z  z0 z0 0 F 1 ða; b, b ; c; z, z Þ ¼ ð1  z Þ F 1 a; b, c  b  b ; c; ,  1  z0 1  z0 (3.92) 0

0

0 a

in (Slater, 1966), we obtain !a Z B sin2 u du 2 sin2 u þ A2b 0 qffiffiffiffi !aþ1 A2 2 sin2 B 2b ¼ 2a þ 1 sin2 B þ A2 2b

1 1 3  F 1 a þ ; , 1; a þ ; 2 2 2   A2 ;a : ¼ pR sin2 B, 2b

2 ðA2b

þ 1Þsin B 2

sin B þ 2

A2 2b

,

sin2

!

B

(3.93)

2

sin B þ A2b 2

Combining this result with Eq. 3.89 completes the proof.



Theorem 3.9 (Dalton and Dougherty, 2012b). In the Gaussian model with scaled identity covariance matrix Sy ¼ s2y ID ,      A2 1 n þ2 A2 y 2 , ; a , (3.94) Ep ½ðεn ðuy ÞÞ  ¼ IA.0 I ; , a þR 2ðn þ 1Þ 2b A2 þ 2b 2 where ð1Þy gðm Þ A¼ kak2 a¼

rffiffiffiffiffiffiffiffiffiffiffiffiffi n ,  n þ1

ðk þ D þ 1ÞD  1, 2

1 b ¼ trðS Þ, 2

(3.95) (3.96) (3.97)

and it is assumed that n . 0, a . 0, and b . 0. Proof. We require that n . 0 to ensure that p ðmy jly Þ is proper, and we require that a . 0 and b . 0 to ensure that p ðs2y Þ is proper. Use Lemma 3.1 for the inner integral so that Eq. 3.59 reduces to

126

Chapter 3

Ep ½ðεny ðuy ÞÞ2  ! # "   ! Z Z ` A 1 B A2 IA.0 2F qffiffiffiffiffi 1 þ exp  2 2 du f ðs 2y ; a, bÞds2y , ¼ p 2sy sin u 2 0 0 sy (3.98) ffiffiffiffi ffi p p ffiffiffiffiffiffiffiffiffiffiffiffiffi where B ¼ tan1 ð n þ 2∕ n Þ, and f ð⋅; a, bÞ is the inverse-gamma distribution with shape parameter a and scale parameter b. This integral is solved in Lemma 3.2, leading to Eq. 3.94. ▪ Theorem 3.1 in conjunction with Eqs. 2.132 and 3.94 gives MSEðˆεn jS n Þ for the scaled identity model. We next consider the general covariance model, in which Sy ¼ ly , and the parameter space Ly contains all symmetric positive definite matrices. The posterior distribution is given in Eq. 2.119 and the first posterior moment in Eq. 2.137. Prior to the theorem we provide a technical lemma. Lemma 3.3 (Dalton and Dougherty, 2012b). Let A ∈ R, p∕4 , B , p∕2, a ∈ RD be a nonzero column vector, k > D1, and S be a symmetric positive definite D  D matrix. Also let f IW ðS; S , k Þ be an inverse-Wishart distribution with parameters S and k . Then       ! Z Z A 1 B A2 IA.0 2F pffiffiffiffiffiffiffiffiffiffiffiffi  1 þ exp  du p 0 ð2sin2 uÞaT Sa S≻0 aT Sa  f IW ðS; S , k ÞdS   A2 1 k  D þ 1 ¼ IA.0 I ; , 2 A2 þ aT S a 2   2  A k Dþ1 2 þ R sin B, T  ; : 2 a Sa (3.99) Proof. Call this integral M. If A ¼ 0, it is easy to show that M ¼ B∕p. If A ≠ 0, then     Z A IA.0 2F pffiffiffiffiffiffiffiffiffiffiffiffi  1 f IW ðS; S , k ÞdS M¼ S≻0 aT Sa   Z B Z 1 A2 þ exp  duf IW ðS; S , k ÞdS p S≻0 0 ð2sin2 uÞaT Sa (3.100)     Z A   F pffiffiffiffiffiffiffiffiffiffiffiffi f IW ðS; S , k ÞdS  1 ¼ IA.0 2 S≻0 aT Sa   Z B Z 1 A2 þ exp  duf IW ðS; S , k ÞdS: p S≻0 0 ð2sin2 uÞaT Sa

Sample-Conditioned MSE of Error Estimation

127

The integral in the first term is solved in (Dalton and Dougherty, 2011c, Lemma E.1). We have   A2 1 k  D þ 1 ; , M ¼ IA.0 sgnðAÞI 2 A2 þ aT S a 2   Z B Z 2 1 A exp  þ duf IW ðS; S , k ÞdS 2 p S≻0 0 ð2sin uÞaT Sa (3.101)   A2 1 k  D þ 1 ¼ IA.0 I ; , 2 A2 þ aT S a 2   Z BZ 1 A2 þ f IW ðS; S , k ÞdSdu: exp  p 0 S≻0 ð2sin2 uÞaT Sa Define the constant matrix  C¼

0D1

 aT . j ID1

(3.102)

Since a is nonzero, with a simple reordering of the dimensions we can guarantee that a1 ≠ 0. The value of aT S a is unchanged by such a redefinition, so, without loss of generality, assume that C is invertible. Consider the change of variables Y ¼ CSCT. Since C is invertible, Y is symmetric positive definite if and only if S is also. Furthermore, the Jacobean determinant of this transformation is jCjDþ1 (Arnold, 1981; Mathai and Haubold, 2008). Note that y11 ¼ aT Sa, where y11 is the upper-left element of Y, and we have   A2 1 k  D þ 1 ; , M ¼ IA.0 I 2 A2 þ aT S a 2   Z BZ 2 1 A exp  þ f IW ðC1 YðCT Þ1 ; S , k Þ p 0 Y≻0 2y11 sin2 u 

 jjCjD1 jdY du



(3.103)

A2 1 k  D þ 1 ; , 2 A2 þ aT S a 2   Z BZ 2 1 A þ exp  f IW ðY; CS CT , k ÞdY du: p 0 Y≻0 2y11 sin2 u

¼ IA.0 I

Since the inner integrand now depends on only one parameter in Y, namely y11, the other parameters can be integrated out. It can be shown that for any D  D inverse-Wishart random matrix X with density f IW ðX; V, mÞ, the upper-left element x11 of X is also inverse-Wishart with density f IW ðx11 ; v11 , m  D þ 1Þ, where v11 is the upper-left element of V (Muller and Stewart, 2006). x11 is also an inverse-gamma random variable with

128

Chapter 3

density f ðx11 ; ðm  D þ 1Þ∕2, v11 ∕2Þ. Here, the upper-left element of CS CT is aT S a, so  A2 1 k  D þ 1 M ¼ IA.0 I ; , 2 A2 þ aT S a 2     Z BZ ` 1 A2 k  D þ 1 aT S a , f y exp  ; þ dy11 du 11 p 0 0 2 2 2y11 sin2 u   A2 1 ¼ IA.0 I ; , a A2 þ 2b 2   Z BZ ` 1 A2 þ exp  f ðy11 ; a, bÞdy11 du, p 0 0 2y11 sin2 u (3.104) 

where we have defined a¼

k  D þ 1 , 2

(3.105)



aT S a : 2

(3.106)

Note that a > 0, b > 0, and this integral is exactly the same as Eq. 3.87. Hence, we apply Lemma 3.2 to complete the proof. ▪ Theorem 3.10 (Dalton and Dougherty, 2012b). In the Gaussian model with general covariance matrix, assuming that n . 0, k . D  1, and S is symmetric positive definite,      A2 1 n þ2 A2 y 2  , ; a , (3.107) Ep ½ðεn ðuy ÞÞ  ¼ IA.0 I ; , a þR 2ðn þ 1Þ 2b A2 þ 2b 2 where rffiffiffiffiffiffiffiffiffiffiffiffiffi n A ¼ ð1Þ gðm Þ , n þ 1 y



(3.108)



k  D þ 1 , 2

(3.109)



aT S a : 2

(3.110)

Sample-Conditioned MSE of Error Estimation

129

Proof. Using the same method as in the case of a scaled covariance, note that 1 3 0 2 0 Z 7 B 6 B A C Ep ½ðεny ðuy ÞÞ2  ¼ @IA.0 42F@qffiffiffiffiffiffiffiffiffiffiffiffiffiA  15 T Sy ≻0 a Sy a 1   Z B 2 1 A C exp  þ duAf IW ðSy ; S , k ÞdSy , 2 p 0 ð2sin uÞaT Sy a (3.111) ffiffiffiffi ffi p p ffiffiffiffiffiffiffiffiffiffiffiffiffi where B ¼ tan1 ð n þ 2∕ n Þ, and f IW ðSy ; S , k Þ is the inverse-Wishart distribution with parameters S and k. Applying Lemma 3.3 yields the desired second moment. ▪ Theorem 3.1 in conjunction with Eqs. 2.137 and 3.107 gives MSEðˆεn jS n Þ for the general covariance model. 3.4.3 Closed-form expressions for functions I and R The solutions proposed in the previous sections utilize two Euler integrals. The first is the incomplete beta function I ðx; a, bÞ defined in Eq. 2.128. In our application, we only need to evaluate I ðx; 1∕2, aÞ for 0 ≤ x < 1 and a > 0. The second integral is the function Rðx, y; aÞ, defined in Eq. 3.83 via the Appell hypergeometric function F 1 . Although these integrals do not have closed-form solutions for arbitrary parameters, in this section we provide exact expressions for I ðx; 1∕2, N∕2Þ and Rðx, y; N∕2Þ for positive integers N. Restricting k to be an integer guarantees that these equations may be applied so that Bayesian MMSE error estimators and their conditional MSEs for the Gaussian model with linear classification may be evaluated exactly using finite sums of common single-variable functions. Lemma 3.4 (Dalton and Dougherty, 2011c). Let N be a positive integer and let pffiffiffi 0 ≤ x ≤ 1. Then I ðx; 1∕2, 1∕2Þ ¼ ð2∕pÞsin1 ð xÞ, N1   2 1 N 2 1 pffiffiffi 2 pffiffiffi X ð2k  2Þ!! 1 ¼ sin ð xÞ þ x ð1  xÞk2 I x; , 2 2 p p ð2k  1Þ!! k¼1 pffiffiffi for any odd N . 1, I ðx; 1∕2, 1Þ ¼ x, and N2   2 pffiffiffi pffiffiffi X 1 N ð2k  1Þ!! I x; , ð1  xÞk ¼ xþ x 2 2 ð2kÞ!! k¼1

for any even N > 2, where !! denotes the double factorial.

(3.112)

(3.113)

130

Chapter 3

Proof. I ð1; a, bÞ ¼ 1 is a property of the regularized incomplete beta function for all a, b > 0. For the remainder of this proof we assume that 0 ≤ x < 1. We have

Z   x G Nþ1 1 N 1 N2 2 ¼ 1 N t2 ð1  tÞ 2 dt: (3.114) I x; , 2 2 G 2 G 2 0 pffiffi Using the substitution sin u ¼ t gives

Z 1 pffiffi   sin x 1 G Nþ1 1 N 2 ðcosN2 uÞ2 sin u cos udu I x; , ¼ 1 N pffiffi 2 2 G 2 G 2 sin1 0 sin u

Z 1 pffiffi sin x G Nþ1 cosN1 udu: ¼ 2 1 2 N G 2 G 2 0

(3.115)

For 0 ≤ a , p∕2 and k > 1, define Z M k ðaÞ ¼

a

cosk udu:

(3.116)

0

Using integration by parts or integration tables, it is well known that 8 > a if k ¼ 0, > < sin a if k ¼ 1, M k ðaÞ ¼ (3.117) k1 k  1 sin a cos a > > : M k2 ðaÞ þ if k . 1: k k The claims for N ¼ 1 and N ¼ 2 are easy to verify using the cases for k ¼ 0 and k ¼ 1 above, respectively. For the other cases, we apply a recursion using the equation for k > 1. If the recursion is applied i > 0 times such that n  2i . 1, then M n ðaÞ ¼

ðn  1Þ!! ðn  2iÞ!! M n2i ðaÞ ðn  2i  1Þ!! n!! i X ðn  1Þ!! ðn  2kÞ!! þ sin a cosn2kþ1 a ðn  2k þ 1Þ!! n!! k¼1

ðn  1Þ!! ðn  2iÞ!! M n2i ðaÞ ¼ n!! ðn  2i  1Þ!! i X ðn  1Þ!! ðn  2kÞ!! þ sin a cosn2kþ1 a: n!! ðn  2k þ 1Þ!! k¼1 In particular, for even n, repeat the recursion i ¼ n∕2 times to obtain

(3.118)

Sample-Conditioned MSE of Error Estimation

131

n

2 X ðn  1Þ!! 0!! ðn  1Þ!! ðn  2kÞ!! M 0 ðaÞ þ sin a cosn2kþ1 a M n ðaÞ ¼ n!! ð1Þ!! n!! ðn  2k þ 1Þ!! k¼1 n   2 X ðn  2kÞ!! ðn  1Þ!! cosn2kþ1 a ¼ a þ sin a n!! ðn  2k þ 1Þ!! k¼1 n   2 X ðn  1Þ!! ð2k  2Þ!! 2k1 ¼ a þ sin a cos a , n!! ð2k  1Þ!! k¼1

(3.119) and for odd n, repeat the recursion i ¼ ðn  1Þ∕2 times to obtain n1

2 X ðn  1Þ!! 1!! ðn  1Þ!! ðn  2kÞ!! M 1 ðaÞ þ sin a cosn2kþ1 a M n ðaÞ ¼ n!! 0!! n!! ðn  2k þ 1Þ!! k¼1 n1   2 X ðn  1Þ!! ðn  2kÞ!! n2kþ1 sin a 1 þ cos a ¼ n!! ðn  2k þ 1Þ!! k¼1 n1   2 X ðn  1Þ!! ð2k  1Þ!! 2k ¼ sin a 1 þ cos a , n!! ð2kÞ!! k¼1

(3.120) where in each case we have redefined the indices of the sums in reverse order. Returning to the original problem, for odd N > 1,



N1

N1 ! 2 ! ! G Nþ1 2N1 N1 2 2 N1 ðN  1Þ!! 2 2 2 1 N ¼ ¼ ¼ (3.121) pðN  2Þ! pðN  2Þ!! pðN  2Þ!! G 2 G 2 and

  pffiffiffi G Nþ1 1 N 2 I x; , ¼ 2 1 N M N1 ðsin1 xÞ 2 2 G 2 G 2 ðN  1Þ!! ðN  2Þ!! pðN  2Þ!! ðN  1Þ!! N1   2 pffiffiffi pffiffiffi X ð2k  2Þ!! pffiffiffiffiffiffiffiffiffiffiffi 2k1 1  sin ð 1  xÞ xþ x ð2k  1Þ!! k¼1

¼2

N1

2 pffiffiffi 2 pffiffiffi X 2 ð2k  2Þ!! 1 ¼ sin1 x þ ð1  xÞk2 : x p p ð2k  1Þ!! k¼1

(3.122)

132

Chapter 3

Finally, for even N . 2,

G Nþ1 N! ðN  1Þ!! ðN  1Þ!! 1 2 N ¼ N N N2 ¼ N ¼ N2 G 2 G 2 2 2 ! 2 ! 2 2 2 ! 2ðN  2Þ!!

(3.123)

and

  pffiffiffi G Nþ1 1 N 2 I x; , ¼ 2 1 N M N1 ðsin1 xÞ 2 2 G 2 G 2 # " N2 2 X ðN  1Þ!! ðN  2Þ!! pffiffiffi ð2k  1Þ!! pffiffiffiffiffiffiffiffiffiffiffi 2k ¼2 x 1þ 1x 2ðN  2Þ!! ðN  1Þ!! ð2kÞ!! k¼1 N2

2 pffiffiffi pffiffiffi X ð2k  1Þ!! ð1  xÞk , ¼ xþ x ð2kÞ!! k¼1

(3.124) which completes the proof.



Lemma 3.5 (Dalton and Dougherty, 2012b). Let N be a positive integer, 0 < x < 1, and y ≥ 0. Then  rffiffiffiffiffiffiffiffiffiffiffi  1 1 1 xþy 1 pffiffiffi (3.125) R x, y; ¼ sin  tan1 ð yÞ, 2 p 1þy p  rffiffiffiffiffiffiffiffiffiffiffi  N 1 1 xþy 1 pffiffiffi R x, y; ¼ sin  tan1 ð yÞ 2 p 1þy p N1  i    pffiffiffi X y 2 ð2i  2Þ!! 1 yð1  xÞ 1 ; ,i 1I  p i¼1 ð2i  1Þ!! y þ 1 xþy 2 (3.126) for any odd N > 1, and   pffiffiffi N 1 R x, y; ¼ sin1 ð xÞ 2 p N2  iþ1    pffiffiffi X y 2 ð2i  1Þ!! 2 1 yð1  xÞ 1 1 ; , iþ 1I  2 i¼0 ð2iÞ!! yþ1 xþy 2 2 (3.127) for any even N > 1. By definition, (1)!! ¼ 0!! ¼ 1. The regularized incomplete beta function in these expressions can be found using Lemma 3.4.

Sample-Conditioned MSE of Error Estimation

133

pffiffiffi Proof. If y ¼ 0, then Rðx, 0; aÞ ¼ p1 sin1 ð xÞ. This is consistent with all equations in the statement of this lemma. Going forward we assume that y . 0. To solve R for half-integer values, we first focus on the Appell function F 1 . Define w ¼ xðy þ 1Þ∕ðx þ yÞ and z ¼ x∕ðx þ yÞ, and note that 0 , z , w , 1. For any real number a,   Z 1 1 1 ua ð1  wuÞ2 ð1  zuÞ1 du: F 1 a þ 1; , 1; a þ 2; w, z ¼ ða þ 1Þ 2 0 (3.128) Some manipulation yields   1 F 1 a þ 1; , 1; a þ 2; w, z 2 Z a þ 1 1 a1 1 ¼ u ðzuÞð1  wuÞ2 ð1  zuÞ1 du z 0  Z Z 1 1 aþ1 a1 12 1 a1 12 u ð1  wuÞ ð1  zuÞ du  u ð1  wuÞ du : ¼ z 0 0 (3.129) In the last integral, let v ¼ wu. Then   Z 1 a þ 1 1 a1 1 u ð1  wuÞ2 ð1  zuÞ1 du F 1 a þ 1; , 1; a þ 2; w, z ¼ 2 z 0 Z a þ 1 a w a1 1  w v ð1  vÞ2 dv: z 0 (3.130) The last integral is an incomplete beta function, and the first is again an Appell function. Hence,       1 aþ1 1 1 B a, F 1 a þ 1; , 1; a þ 2; w, z ¼  I w; a, 2 zwa 2 2  (3.131)  aþ1 1 F 1 a; , 1; a þ 1; w, z : þ az 2 A property of the regularized incomplete beta function is I ðx; a, bÞ ¼ 1  I ð1  x; b, aÞ:

(3.132)

Hence,     1 aþ1 1 F 1 a þ 1; , 1; a þ 2; w, z ¼ a; F1 , 1; a þ 1; w, z 2 az 2     aþ1 1 1 1  I 1  w; , a : B a,  zwa 2 2 (3.133)

134

Chapter 3

By induction, for any positive integer k,   1 F 1 a þ k; , 1; a þ k þ 1; w, z 2   aþk 1 ¼ F 1 a; , 1; a þ 1; w, z 2 azk    k 1  aþkX z i 1 1  a k B a þ i, 1  I 1  w; , a þ i : 2 2 w z i¼0 w

(3.134)

We apply this to the definition of R before the statement of Lemma 3.2 to decompose R into one of two Appell functions with known solutions. In particular,     pffiffiffi yz 1 1 R x, y; F 1; , 1; 2; w, z , ¼ 2p 1 2 2

(3.135)

    pffiffiffi yz N 1 F 1 1; , 1; 2; w, z R x, y; ¼ 2p 2 2 N3     pffiffiffi X yz 2 z i 1 1  B i þ 1, 1  I 1  w; , i þ 1 2pw i¼0 w 2 2 (3.136) for N > 1 odd, and    pffiffiffiffiffi  yz N 1 1 3 R x, y; F ; , 1; ; w, z ¼ p 1 2 2 2 2 N2       pffiffiffiffiffi X yz 2 z i 1 1 1 1  pffiffiffiffi B iþ , 1  I 1  w; , i þ 2 2 2 2 2p w i¼0 w (3.137) for N > 1 even. We can write Bði, 1∕2Þ ¼ ð2iÞ!!∕½ið2i  1Þ!! and Bði þ 1∕2, 1∕2Þ ¼ pð2i  1Þ!!∕ð2iÞ!! for integers i ≥ 1. Finally, to evaluate R, it can be shown that   1 F 1 1; , 1; 2; w, z 2 " (3.138) rffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi# 2 z zð1  wÞ 1 1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi tan  tan , wz wz zðw  zÞ

Sample-Conditioned MSE of Error Estimation

135

and  F1

rffiffiffiffiffiffiffiffiffiffiffiffi  1 1 3 1 wz 1 ; , 1; ; w, z ¼ pffiffiffiffiffiffiffiffiffiffiffiffi tan : 2 2 2 1w wz

(3.139)

Further simplification gives the result in the statement of the lemma.



3.5 Average Performance in the Gaussian Model In this section we examine performance in Gaussian models under proper priors with fixed sample size, illustrating that different samples condition RMS performance to different extents and that models using more informative priors have better RMS performance. Consider an independent general covariance Gaussian model with known and fixed c ¼ 0.5. Let n0 ¼ 6D, n1 ¼ 3D, m0 ¼ 0D, m1 ¼ 0.1719⋅1D, ky ¼ 3D, and Sy ¼ 0.03ðky  D  1ÞID . This is a proper prior, where m1 has been calibrated to give an expected true error of 0.25 with D ¼ 1. Following the procedure in Fig. 2.5, in step 1, m0 , S0 , m1 , and S1 are generated according to the specified priors. This is done by generating a random covariance according to the inverse-Wishart distribution pðSy Þ using methods in (Johnson, 1987). Conditioned on the covariance, we generate a random mean from the Gaussian distribution pðmy jSy Þ  N ðmy , Sy ∕ny Þ, resulting in a normal-inverse-Wishart distributed mean and covariance pair. The parameters for class 0 are generated independently from those of class 1. In step 2A we generate a random sample of size n, in step 2B the prior is updated to a posterior, and in step 2C we train an LDA classifier. In step 2D, the true error of the classifier is computed exactly, the training data are used to evaluate the 5-fold cross-validation error estimator, a Bayesian MMSE error estimator is found exactly using the posterior, and the theoretical sample-conditioned RMS for the Bayesian MMSE error estimator is computed exactly. The sampling procedure is repeated t ¼ 1000 times for each fixed feature-label distribution, with T ¼ 10,000 feature-label distributions, for a total of 10,000,000 samples. Table 3.1 shows the accuracy of the analytical formulas for conditional RMS using n ¼ 60 with different feature sizes ðD ¼ 1, 2, and 5Þ. There Table 3.1 Average true error, semi-analytical RMS, and absolute difference between the semi-analytical and empirical RMS for the Bayesian MMSE error estimator. Simulation settings n ¼ 60, D ¼ 1 n ¼ 60, D ¼ 2 n ¼ 60, D ¼ 5

Average true error

Semi-analytical RMS

Absolute difference between RMSs

0.2474 0.1999 0.1156

0.0377 0.0358 0.0262

5.208  10–6 3.110  10–5 8.971  10–5

136

Chapter 3

cv BEE

100

PDF

80 cv: RMS = 0.0425 BEE: RMS = 0.0262

60 40 20 0

0

0.02

0.04

0.06

0.08

0.1

RMS conditioned on the sample

Figure 3.2 Probability densities for the conditional RMS of the cross-validation and Bayesian MMSE error estimators with correct priors. The unconditional RMS for both error estimators is also indicated ðD ¼ 5, n ¼ 60Þ. [Reprinted from (Dalton and Dougherty, 2012c).]

is close agreement between the semi-analytical RMS and empirical RMS (both defined in Section 3.3) of the Bayesian MMSE error estimator. The table also provides the average true errors for each model. Figure 3.2 shows the estimated densities of the conditional RMS, found from the conditional RMS values recorded in each iteration of the experiment, for both the cross-validation and Bayesian MMSE error estimators. The conditional RMS for cross-validation is computed from Eq. 3.12. Results for D ¼ 5 features and n ¼ 60 sample points are shown. The semi-analytical unconditional RMS for each error estimator is also printed in each graph for reference. The high variance of these distributions illustrates that different samples condition the RMS to different extents. Meanwhile, the conditional RMS for cross-validation has a much higher variance and is shifted to the right, which is expected since the conditional RMS of the Bayesian MMSE error estimator is optimal.

3.6 Convergence of the Sample-Conditioned MSE Section 2.7 discussed consistency of the Bayesian MMSE error estimator, defining weak, rth mean, and strong consistency. Throughout this section, we denote the true error by εn ðu, S n Þ rather than εn ðu, cn Þ to emphasize dependency on the sample. For the sample-conditioned MSE we are interested in showing that for all u ∈ U, MSEðˆεn ðS n , cn ÞjS n Þ → 0 (almost surely), i.e., PrS

` ju

ðEujS n ½ðˆεn ðS n , cn Þ  εn ðu, S n ÞÞ2  → 0Þ ¼ 1:

(3.140)

We refer to this property as conditional MSE convergence. Since εˆ n ðS n , cn Þ ¼ EujS n ½εn ðu, S n Þ, for conditional MSE convergence it must be shown that

Sample-Conditioned MSE of Error Estimation

PrS

` ju

137

ðEujS n ½ðEujS n ½εn ðu, S n Þ  εn ðu, S n ÞÞ2  → 0Þ

¼ PrS ¼ PrS

` ju `

ðEujS n ½ðEujS n ½ f n ðu, S n Þ  f n ðu, S n ÞÞ2  → 0Þ

2 2 ju ðEujS n ½ f n ðu, S n Þ  ðEujS n ½ f n ðu, S n ÞÞ → 0Þ

(3.141)

¼ 1, where f n ðu, S n Þ ¼ εn ðu, S n Þ  εn ðu, S n Þ. Hence, both strong consistency and conditional MSE convergence are proved if, for any true parameter u and both k ¼ 1 and k ¼ 2,   h i PrS ju EujS n j f n ðu, S n Þjk → 0 ¼ 1: (3.142) `

Recall that Theorem 2.13 not only addresses strong consistency, but also addresses the stronger requirement in Eq. 3.142. In particular, Theorems 2.13 and 2.14 not only prove that the Bayesian MMSE error estimator is strongly consistent as long as the true error functions εn ðu, S n Þ form equicontinuous sets for fixed samples and the posteriors are weak consistent with an appropriate sampling methodology, but they also prove conditional MSE convergence under the same constraints. Indeed, the conclusion of Theorem 2.13 can be amended to include conditional MSE convergence by letting k ¼ 2.

3.7 A Performance Bound for the Discrete Model In the previous section we have seen that MSEðˆεn jS n Þ → 0 as n → ` (almost surely relative to the sampling process) for the discrete model. In fact, it is possible to derive an upper bound on MSEðˆεn jS n Þ as a function of only the sample size in some cases. In the discrete model, the conditional MSE is given by Eq. 3.26 and can be written as MSEðˆεny jS n Þ ¼

εˆ ny ð1  εˆ ny Þ P : ny þ bi¼1 ayi þ 1

(3.143)

Plugging this expression for both y ¼ 0 and y ¼ 1 into Eq. 3.3 and assuming a beta posterior model for c with hyperparameters ay yields MSEðˆεn jS n Þ ¼

a0 a1 ðˆεn0  εˆ n1 Þ2 ða0 þ a1 Þ2 ða0 þ a1 þ 1Þ þ

a0 ða0 þ 1Þ εˆ n0 ð1  εˆ n0 Þ P ⋅     ða0 þ a1 Þða0 þ a1 þ 1Þ n0 þ bi¼1 a0i þ 1

þ

a1 ða1 þ 1Þ εˆ n1 ð1  εˆ n1 Þ P ⋅ : ða0 þ a1 Þða0 þ a1 þ 1Þ n1 þ bi¼1 a1i þ 1

(3.144)

138

Chapter 3

From this, it is clear that MSEðˆεn jS n Þ → 0 (and these results apply for any P classification rule). In particular, as long as a0 ≤ n0 þ bi¼1 a0i and P a1 ≤ n1 þ bi¼1 a1i , which is often the case, then ða0 þ a1 þ 1Þ MSEðˆεn jS n Þ a a a a ≤  0 1  2 ðˆεn0  εˆ n1 Þ2 þ  0  εˆ n0 ð1  εˆ n0 Þ þ  1  εˆ n1 ð1  εˆ n1 Þ a0 þ a1 a0 þ a1 ða0 þ a1 Þ ¼ Ep ½cEp ½1  cðˆεn0  εˆ n1 Þ2 þ Ep ½cˆεn0 ð1  εˆ n0 Þ þ Ep ½1  cˆεn1 ð1  εˆ n1 Þ, (3.145) where we have used E p ½c ¼ a0 ∕ða0 þ a1 Þ and E p ½1  c ¼ a1 ∕ða0 þ a1 Þ. To help simplify this equation further, define x ¼ εˆ n0 , y ¼ εˆ n1 , and z ¼ E p ½c. Then MSEðˆεn j S n Þ ≤

zð1  zÞðx  yÞ2 þ zxð1  xÞ þ ð1  zÞyð1  yÞ a0 þ a1 þ 1

zx þ ð1  zÞy  ½zx þ ð1  zÞy2 . ¼ a0 þ a1 þ 1

(3.146)

From Eq. 2.13 note that εˆ n ¼ zx þ ð1  zÞy, and also note that 0 ≤ εˆ n ≤ 1. Hence, MSEðˆεn j S n Þ ≤

1 εˆ n  ðˆεn Þ2 ≤ , a0 þ a1 þ 1 4ða0 þ a1 þ 1Þ

(3.147)

where the last inequality utilizes the fact that w  w2 ¼ wð1  wÞ ≤ 1∕4 whenever 0 ≤ w ≤ 1. Thus, the conditional RMS of the Bayesian MMSE error estimator for any discrete classifier, averaged over all feature-label distributions with beta posteriors on c such that ay ≥ ny (which holds under proper beta priors and random sampling) and Dirichlet priors on the bin probabilities P such that ay ≤ ny þ bi¼1 ayi , satisfies rffiffiffiffiffi 1 RMSðˆεn j S n Þ ≤ . 4n

(3.148)

Since this bound is only a function of the sample size, it holds if we remove the conditioning on S n . For comparison, we consider a remarkably similar holdout bound. If the data are split between training and test data, where the classifier is designed on the training data and the classifier error is estimated on the test data, then we have the distribution-free bound

Sample-Conditioned MSE of Error Estimation

rffiffiffiffiffiffiffi 1 RMSðˆεholdout j S nm , c, u0 , u1 Þ ≤ , 4m

139

(3.149)

where m is the size of the test sample, and S nm is the training sample (Devroye et al., 1996). Note that uncertainty here stems from the sampling distribution of the test sample. In any case, the bound is still true upon removing the conditioning. The RMS bound on the Bayesian MMSE error estimator is always lower than that of the holdout estimate, which is a testament to the power of modeling assumptions. Moreover, as m → n for full holdout, the holdout bound converges down to the Bayes estimate bound. Example 3.2. We next compare the performance of Bayesian MMSE error estimation versus holdout error estimation, inspired by the similarity between the performance bounds given in Eqs. 3.148 and 3.149. We use the simulation methodology outlined in Fig. 2.5 with a discrete model and fixed bin size b. Let the prior for c be uniformð0, 1Þ ða0 ¼ a1 ¼ 1Þ and let the bin probabilities of classes 0 and 1 have Dirichlet priors given by the hyperparameters a0i ∝ 2b  2i þ 1 and a1i ∝ 2i  1, respectively, where the ayi are normalized Pb P y such that priors satisfy ay ≤ bi¼1 ayi ; i¼1 ai ¼ b for y ∈ f0, 1g. These P therefore, with random sampling ay ≤ ny þ bi¼1 ayi . In step 1, generate a random c from the uniform distribution and generate random bin probabilities from the Dirichlet priors using the same methodology outlined in Example 3.1. In step 2A generate a random sample with fixed sample size n, and in step 2B update the prior to a posterior. For a fair comparison between the Bayesian MMSE error estimator, which is a full-sample error estimator, and the holdout estimator, which partitions the sample into training and testing datasets, we will consider separate experiments, one training a classifier with the full sample and evaluating the Bayesian MMSE error estimate, the other training a classifier on the sample data without holdout points and evaluating the holdout error estimate on the held-out points. In step 2C, we design a discrete histogram classifier using the full-training sample, breaking ties toward class 0. In step 2D, the true error is found exactly and the Bayesian MMSE error estimator for this classifier is found from the classifier and posterior by evaluating Eq. 2.13 with E p ½c ¼ ðn0 þ 1Þ∕ðn þ 2Þ and εˆ ny defined in Eq. 2.56. The sample-conditioned RMS of the Bayesian MMSE error estimator is computed from Eqs. 3.3 and 3.23. In step 2C, for each m ¼ 1, 2, . . . , n the original sample is partitioned into n  m training points and m holdout points, where the proportion of points from each class in the holdout set is kept as close as possible to that of the original sample. Each training set is used to find a discrete histogram

140

Chapter 3 0.4

BEE holdout

RMS deviation from true error

average true error

0.5

0.4

0.3

0.2

0.1

2

4

6

8

10

12

holdout sample size

(a)

14

16

BEE holdout holdout upper bound

0.3

0.2

0.1

0

2

4

6

8

10

12

14

16

holdout sample size

(b)

Figure 3.3 Comparison of the holdout error estimator and Bayesian MMSE error estimator with correct priors with respect to the holdout sample size for a discrete model with b ¼ 8 bins and fixed sample size n ¼ 16: (a) average true error; (b) RMS performance. [Reprinted from (Dalton and Dougherty, 2012c).]

classifier. In step 2D, the true error is found exactly and the holdout estimate is evaluated as the proportion of classification errors on the holdout subset. The sampling procedure is repeated t ¼ 10,000 times for each fixed feature-label distribution, and T ¼ 10,000 feature-label distributions are generated (corresponding to randomly selected parameters), for a total of 100,000,000 samples. Results are shown in Fig. 3.3 for b ¼ 8 with n ¼ 16. This setting is also used in Fig. 3.1. The results are typical, where part (a) shows the expected true error, and part (b) shows the RMS between the true and estimated errors, both as functions of the holdout sample size m. As expected, the average true error of the classifier in the holdout experiment decreases and converges to the average true error of the classifier trained from the full sample as the holdout sample size decreases. In addition, the RMS performance of the Bayesian MMSE error estimator consistently surpasses that of the holdout error estimator, as suggested by the RMS bounds given in Eqs. 3.148 and 3.149. Thus, under a Bayesian model, not only does using the full sample to train the classifier result in a lower true error, but we can achieve better RMS performance using training-data error estimation than by holding out the entire sample for error estimation.

3.8 Censored Sampling As discussed in Section 1.5, it is necessary to bound the RMS (or some other criterion of estimation accuracy) of the error estimator in order for the classifier model, classifier, and error estimate to be epistemologically meaningful. In that section we considered a situation in which the unconditioned RMS for every feature-label distribution in an uncertainty class is known and therefore one can bound the unconditioned RMS over the

Sample-Conditioned MSE of Error Estimation

141

uncertainty class. In particular, one can compute a sample size to guarantee a desired degree of error estimation accuracy. This applies to the common setting in which the sample size is determined prior to sampling. With censored sampling, the size of the sample is determined adaptively based on some criterion determining when a sufficient number of sample points has been observed. For example, suppose that one desires a sample-conditioned MSE of at most r. Following the selection of a batch of sample points, the sample-conditioned MSE can be computed. If MSEðˆεn j S n Þ . r, then another batch (perhaps only one point) can be randomly drawn and adjoined to S n ; otherwise, the sampling procedure ends. Since the conditional MSE converges almost surely to 0 with increasing sample size, any desired threshold for the conditional MSE will almost surely eventually be achieved. When sample points are expensive, difficult, or time-consuming to obtain, censored sampling allows one to stop sampling once a desired degree of accuracy is achieved. 3.8.1 Gaussian model When applying the conditional RMS to censored sampling with synthetic data from the general covariance Gaussian model, step 1 shown in Fig. 2.5 remains the same; that is, we still define a fixed set of hyperparameters and use these priors to generate random feature-label distributions. However, the sampling procedure in step 2 is modified, as shown in Fig. 3.4. Instead of fixing the sample size ahead of time, sample points are collected in batches until a stopping criterion is satisfied. An example of censored sampling is provided in Fig. 3.5 for an independent general covariance Gaussian model with D ¼ 2 features, c ¼ 0.5 fixed and known, and the following hyperparameters for the priors of u0 and u1: n0 ¼ 36, n1 ¼ 18, m0 ¼ 02 , m1 ¼ 0.2281 ⋅ 12 , ky ¼ 18, and Sy ¼ 0.03ðky  3ÞI2 . We implement censored sampling with LDA. Under these hyperparameters, the average Bayes error is 0.158. In step A of each iteration, we draw a small initial training sample from the feature-label distribution. In our implementation the training sample is initialized with two sample points in each class, for a total of four sample points. In step B we update hyperparameters, and in step C we train an LDA

Figure 3.4 Simulation methodology for censored sampling under a fixed feature-label distribution or a large dataset representing a population.

Chapter 3

3000

1000

2500

800

2000

PDF

PDF

142

1500

600 400

1000 200

500 0

0

0.01

0.02

0.03

0.04

0.05

0

0

50

100

sample-conditioned RMS

censored sample size

(a)

(b)

150

Figure 3.5 Classification performance under fixed sample size versus censored sampling (D ¼ 2, c ¼ 0.5): (a) density of conditional RMS under fixed sample size, LDA, n ¼ 60; (b) density of censored sample size, LDA.

classifier using the initial training sample with no feature selection. Step D checks the current Bayesian MMSE error estimate as well as the conditional MSE for the initial training sample and designed classifier. For LDA, these may be found in closed form. If the stopping criteria, which we will outline in detail shortly, are not satisfied, then a random sample of four points is augmented to the current training set in step E. To do this, we first establish the labels of the new sample points from an independent binomialð4, cÞ experiment and then draw the sample points from the corresponding classconditional distributions. Hyperparameters are updated again (returning to step B), a new LDA classifier is designed (step C), and the Bayesian MMSE error estimate and conditional MSE are checked again (step D). This is repeated until the stopping criteria are satisfied, or a maximum sample size of 160 is reached, in which case the sampling procedure halts. Step F collects several outputs: the final censored sample size, the true error of the final classifier (from the true distribution parameters), a 5-fold cross-validation error estimate, the Bayesian MMSE error estimate, and the conditional MSE of the Bayesian MMSE error estimate (the latter two already having been found when checking the stopping criterion). All error estimates and the conditional MSE are found relative to the final censored sample and final classifier. This procedure is repeated t ¼ 1000 times for each fixed featurelabel distribution, for T ¼ 1000 random feature-label distributions, for a total of tT ¼ 1,000,000 censored samples. It remains to specify the stopping criteria. We desire RMSðˆεn j S n Þ ≤ 0.0295, which is the semi-analytical RMS (root of the average conditional MSE) reported for a fixed-sample-size experiment with LDA and n ¼ 60. Our interest is primarily in the overall conditional MSE, but we also implement five additional stopping conditions that must all be satisfied: RMSðˆεn0 j S n Þ ≤ 0.0295  2, RMSðˆεn1 j S n Þ ≤ 0.0295  2, εˆ n ≤ 0.3, εˆ n0 ≤ 0.35,

Sample-Conditioned MSE of Error Estimation

143

and εˆ n1 ≤ 0.35. The latter three conditions help avoid early stopping. The average Bayes error for this model is well below 0.3, so in most cases the conditional RMS determines the censored sample size. The sample size is different in each trial because the conditional MSE depends on the actual data obtained from sampling. Consistency guarantees that MSEðˆεn jS n Þ will eventually reach the stopping criterion; thus, relative to the conditional MSE, censored sampling may work to any degree desired. Figure 3.5(a) shows a histogram of the conditional MSE for a fixed-samplesize experiment with LDA and n ¼ 60. Here, the average true error is 0.170 and the root of the average MSE is 0.0295. For comparison, a histogram of the sample size obtained with LDA and censored sampling experiments is shown in Fig. 3.5(b). In this experiment, the average true error is 0.169, the root of the average MSE is 0.029, and the average sample size is 62.1. A key point is that the distribution in Fig. 3.5(a) has a wide support, illustrating that the sample significantly conditions the RMS. In all cases, the sample-conditioned RMS with censored sampling is close to the unconditioned RMS with fixed sampling, which is expected due to the stopping criteria in the censored sampling process. The difference is that, while the fixed-sample experiments have certainty in sample size with uncertainty in the conditional RMS, censored sampling experiments have certainty in the conditional RMS with uncertainty in the sample size. Hence, censored sampling provides approximately the same RMS and average sample size, while also guaranteeing a specified conditional RMS for each final sample in the censored sampling process. One needs to keep in mind that, when using a smaller sample size to obtain a desired RMS, the true error of the classifier may increase. This potential problem is alleviated by requiring that the Bayesian MMSE error estimate itself reach a desired threshold before halting the sampling procedure. In effect, one then places goodness criteria on both the error estimation and classifier design to determine sample size.

3.9 Asymptotic Approximation of the RMS The sample-conditioned MSE, which provides a measure of performance across the uncertainty class U for a given sample S n , involves various sampleconditioned moments for the error estimator: Eu ½ εˆ n j S n , Eu ½ðˆεn Þ2 j S n , and Eu ½εn εˆ n j S n . These simplify to εˆ n , ðˆεn Þ2 , and Eu ½εn j S n ˆεn ¼ ðˆεn Þ2 , respectively. One could also consider the MSE relative to a fixed feature-label distribution in the uncertainty class and randomness relative to the sampling distribution. This would yield the feature-label-distribution-conditioned MSE MSES n ðˆεn j uÞ and corresponding moments: ES n ½ εˆ n j u, ES n ½ðˆεn Þ2 j u, and ES n ½εn εˆ n j u. These moments help characterize estimator behavior relative to a given feature-label distribution and are the kind of moments considered

144

Chapter 3

historically. They facilitate performance comparison of the Bayesian MMSE error estimator to classical error estimators. To evaluate performance across both the uncertainty class and the sampling distribution requires the unconditioned MSE MSEu,S n ðˆεn Þ and corresponding moments: Eu,Sn ½ εˆ n , Eu,S n ½ðˆεn Þ2 , and Eu,Sn ½εn εˆ n . Until this point, we have examined MSESn ðˆεn j uÞ and MSEu,S n ðˆεn Þ via simulation studies in the discrete and Gaussian models. In this section, following (Zollanvari and Dougherty, 2014), we provide asymptotically exact approximations of these, along with the corresponding moments, of the Bayesian MMSE error estimator for LDA in the Gaussian model. These approximations are exact in a double-asymptotic sense, where both sample size n and dimensionality D approach infinity at a fixed rate between the two (Zollanvari et al., 2011). Finite-sample approximations from the double-asymptotic method have been shown to be quite accurate (Zollanvari et al., 2011; Wyman et al., 1990; Pikelis, 1976). There is a body of work on the use of double asymptotics for the analysis of LDA and its related statistics (Zollanvari et al., 2011; Raudys, 1972; Deev, 1970; Fujikoshi, 2000; Serdobolskii, 2000; Bickel and Levina, 2004). Raudys and Young provide a good review of the literature on the subject (Raudys and Young, 2004). Although the theoretical underpinning of both (Zollanvari et al., 2011) and the present section relies on double-asymptotic expansions, in which n, D → ` at a proportional rate, practical interest concerns finite-sample approximations corresponding to the asymptotic expansions. In (Wyman et al., 1990), the accuracy of such finite-sample approximations is investigated relative to the expected error of LDA in a Gaussian model. Several singleasymptotic expansions ðn → `Þ are considered, along with double-asymptotic expansions ðn, D → `Þ (Raudys, 1972; Deev, 1970). The results of (Wyman et al., 1990) show that the double-asymptotic approximations are significantly more accurate than the single-asymptotic approximations. In particular, even with n∕D , 3, the double-asymptotic expansions yield “excellent approximations,” while the others “falter.” The analysis of this section pertains to separate sampling under a Gaussian model with a known common covariance matrix S. We classify with a variant of LDA given by cn ðxÞ ¼ 0 if W ðxÞ þ w ≤ 0 and cn ðxÞ ¼ 1 otherwise, where ˆ ¼ S, w ¼ lnðc ∕c Þ, and c is W ðxÞ is the discriminant given in Eq. 1.29 with S 1 0 i the true prior class-i probability. This classifier assumes that the ci and S are known and assigns the decision boundary to class 1 instead of class 0. For known covariance S and a Gaussian prior distribution on mi possessing mean mi and covariance matrix S∕ni , i ∈ f0, 1g, the Bayesian MMSE error estimator is given by Eq. 2.13, with the true c0 in place of Ep ½c and εˆ ni given by Theorem 2.10. Letting m ¼ ½mT0 , mT1 T , our interest is with the moments ES n ½ εˆ n j m, ES n ½ðˆεn Þ2 j m, and ES n ½εn εˆ n j m used to obtain MSESn ðˆεn j mÞ, and the moments Em,Sn ½ εˆ n , Em,S n ½ðˆεn Þ2 , and Em,S n ½εn εˆ n  used to obtain MSEm,S n ðˆεn Þ.

Sample-Conditioned MSE of Error Estimation

145

3.9.1 Bayesian–Kolmogorov asymptotic conditions The Raudys–Kolmogorov asymptotic conditions (Zollanvari et al., 2011) are defined on a sequence of Gaussian  discrimination problems with ` a sequence of parameters and sample sizes: ðmD,0 , mD,1 , SD , nD,0 , nD,1 Þ D¼1 , where the means and the covariance matrix are arbitrary. The assumptions for Raudys– Kolmogorov asymptotics are nD,0 → `, nD,1 → `, D → `, D∕nD,0 → l0 , D∕nD,1 → l1 , and 0 , l0 , l1 , `. For notational simplicity, we denote the limit under these conditions by limRKac . For analysis related to LDA, we also assume that the Mahalanobis distance, dm,D

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ ðmD,0  mD,1 ÞT S1 D ðmD,0  mD,1 Þ,

(3.150)

is finite, and limRKac dm,D ¼ dm (Serdobolskii, 2000, p. 4). This condition assures the existence of limits of performance metrics of the relevant statistics (Zollanvari et al., 2011; Serdobolskii, 2000). To analyze the Bayesian MMSE error estimator, we modify the sequence of Gaussian discrimination problems to n

ðmD,0 , mD,1 , SD , nD,0 , nD,1 , mD,0 , mD,1 , nD,0 , nD,1 Þ

o

` D¼1

,

(3.151)

where mD,i and nD,i are hyperparameters of the fixed covariance Gaussian Bayesian model. In addition to the previous conditions, we assume that the following limits exist for i, j ∈ f0, 1g: T 1 lim mTD,i S1 D mD, j ¼ mi S mj ,

(3.152)

T 1 lim mTD,i S1 D mD, j ¼ mi S mj ,

(3.153)

T 1 lim mTD,i S1 D mD, j ¼ mi S mj ,

(3.154)

RKac

RKac

RKac

where mTi S1 mj , mTi S1 mj , and mTi S1 mj are symbolic representations of the constants to which the limits converge. In (Serdobolskii, 2000), fairly mild sufficient conditions are given for the existence of these limits. We refer to all of the aforementioned conditions, along with nD,i → `, nD,i ∕nD,i → g i , and 0 , g i , `, as the Bayesian–Kolmogorov asymptotic conditions (BKac). We denote the limit under these conditions by limBKac BKac

or → , which means that, for i, j ∈ f0, 1g,

146

Chapter 3

lim ð⋅Þ ¼

BKac

lim ð⋅Þ. D → `, nD,0 → `, nD,1 → `, nD,0 → `, nD,1 → ` nD,0 nD,1 D D nD,0 → l0 , nD,1 → l1 , nD,0 → g 0 , nD,1 → g 1 0 , l0 , `, 0 , l1 , `, 0 , g 0 , `, 0 , g 1 , ` T 1 T 1 mTD,i S1 D mD, j → mi S mj , 0 , mi S mj , ` T 1 T 1 mTD,i S1 D mD, j → mi S mj , 0 , mi S mj , ` T 1 T 1 mTD,i S1 D mD, j → mi S mj , 0 , mi S mj , `

(3.155)

T 1 T 1 mTD,i S1 D mD, j → mi S mj implies that mD,i SD mD, j ¼ Oð1Þ as D → `. SimiT 1 larly, mTD,i S1 D mD, j ¼ Oð1Þ and mD,i SD mD, j ¼ Oð1Þ. This limit is defined for the case where there is conditioning on a specific value of mD,i . In this case mD,i is not a random vector; rather, for each D, it is a vector of constants. To remove conditioning, we model mD,i as Gaussian with mean mD,i and covariance SD ∕nD,i ; i.e., the means are given the same prior as the Bayesian MMSE error estimator. The sequence of discrimination problems and the above limit reduce to o n ` (3.156) ðSD , nD,0 , nD,1 , mD,0 , mD,1 , nD,0 , nD,1 Þ D¼1

and lim ð⋅Þ ¼

BKac

ð⋅Þ, lim D → `, nD,i → `, nD,i → ` n → li , nD,i → g i , 0 , li , `, 0 , g i , ` D,i

D nD,i mTD,i S1 D mD, j

(3.157)

→ mTi S1 mj , 0 , mTi S1 mj , `

respectively. For notational simplicity we assume clarity from the context and do not explicitly differentiate between these conditions. Convergence in probability under the Bayesian–Kolmogorov asymptotic conditions is denoted by plimBKac . For notational ease, we henceforth omit the subscript “D” from the parameters. Define ha1 , a2 , a3 , a4 ¼ ða1  a2 ÞT S1 ða3  a4 Þ,

(3.158)

and for ease of notation write ha1 ;a2 , a1 ;a2 as ha1 ;a2 . There are two special cases: (1) the square of the Mahalanobis distance in the parameter space of unknown class-conditional densities, d2m ¼ hm0 ,m1 . 0, and (2) a distance measure for prior distributions, D2m ¼ hm0 ,m1 . 0, where m ¼ ½mT0 , mT1 T . The conditions in Eq. 3.155 assure the existence of limBKac ha1 ;a2 ;a3 ;a4 , where the aj s can be any combination of m0 , m1 , m0 , and m1 . ha1 ;a2 ;a3 ;a4 , d2m , and D2m denote limBKac ha1 ;a2 ;a3 ;a4 , limBKac d2m , and limBKac D2m , respectively.

Sample-Conditioned MSE of Error Estimation

147

The ratio D∕ni is an indicator of complexity for LDA (and any linear classification rule), the Vapnik–Chervonenkis dimension being D þ 1. Therefore, the conditions in Eq. 3.155 characterize the asymptotic complexity of the problem. The ratio ni ∕ni is a measure of relative uncertainty: the smaller ni ∕ni is, the more we rely on the data and the less we rely on the prior knowledge. Hence, the conditions in Eq. 3.155 characterize the asymptotic uncertainty. We let bi ¼ ni ∕ni so that bi ¼ ni ∕ni → g i . We also denote the sample mean for class i ∈ f0, 1g by xi . 3.9.2 Conditional expectation In this section, based on (Zollanvari and Dougherty, 2014), we use the Bayesian–Kolmogorov asymptotic conditions to characterize the conditional and unconditional first moments of the Bayesian MMSE error estimator. In each case, the BKac limit suggests a finite-sample approximation. These are tailored after the so-called Raudys-type Gaussian-based finite-sample approximation for the expected LDA classification error (Raudys, 1972; Raudys and Young, 2004): ESn ½ε0n  ¼ PrðW ðx0 , x1 , xÞ ≤ wjx ∈ Π0 Þ 1 0 E ½W ðx , x , xÞjx ∈ Π  þ w S n ,x 0 1 0 A, ≂ F@pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi varS n ;x ðW ðx0 , x1 , xÞjx ∈ Π0 Þ

(3.159)

with Π0 denoting the class-0 population and w being the discriminant threshold. Here we have written the Anderson W statistic explicitly as a function of the test point and the sample. The asymptotic conditional expectation of the Bayesian MMSE error estimator is characterized in the following theorem, where G B0 , GB1 , and L depend on m. Theorem 3.11 (Zollanvari and Dougherty, 2014). Consider the sequence of Gaussian discrimination problems defined by Eq. 3.151. Then for i ∈ f0, 1g, lim ES n ½ εˆ ni jm BKac

  B i þw i Gp ffiffiffiffi , ¼ F ð1Þ L

(3.160)

and therefore, 

 B   GB0 þ w G1  w pffiffiffiffi pffiffiffiffi lim ES n ½ εˆ n jm ¼ c0 F , þ c1 F BKac L L

(3.161)

148

Chapter 3

where i h 1 2 ¼  hm0 ;m0 Þ þ dm þ ð1  g 0 Þl0 þ ð1 þ g 0 Þl1 , g ðh 2ð1 þ g 0 Þ 0 m0 ,m1 (3.162) i h 1 GB1 ¼ g 1 ðhm1 ;m0  hm1 ;m1 Þ þ d2m þ ð1  g 1 Þl1 þ ð1 þ g 1 Þl0 , 2ð1 þ g 1 Þ (3.163)

G B0

L ¼ d2m þ l0 þ l1 . Proof. For i ∈ {0, 1}, let b B0 ¼ G

 m0

x þ x1  0 2

T

S1 ðx0  x1 Þ,

(3.164)

(3.165)

where m0 is defined in Eq. 3.66. Then n0 mT S1 ðx0  x1 Þ n0 þ n0 0 1

n  n0 T 1 x0 S x0  xT0 S1 x1 þ xT1 S1 x1  xT0 S1 x1 . þ 0 2ðn0 þ n0 Þ 2 (3.166)

b B0 ¼ G

For i, j ∈ f0, 1g, and i ≠ j, define the following random variables: yi ¼ mTi S1 ðx0  x1 Þ,

(3.167)

zi ¼ xTi S1 xi ,

(3.168)

zij ¼ xTi S1 xj .

(3.169)

Given m, the sample mean xi for class i is Gaussian with mean mi and covariance S/ni. The variance of yi given m does not depend on m. Therefore, under the Bayesian–Kolmogorov conditions stated in Eq. 3.155, mTi S1 mj and mTi S1 mj do not appear in the limit. Only mTi S1 mi matters, which vanishes in the limit as follows:  1  S1 T S þ varS n ðyi jmÞ ¼ mi mi n0 n1 (3.170) mTi S1 mi mTi S1 mi BKac → lim þ lim ¼ 0. n0 →` n1 → ` n0 n1

Sample-Conditioned MSE of Error Estimation

149

To find the variance of zi and zij , we first transform zi and zij to quadratic forms and then use the results of (Kan, 2008) to find the variance of quadratic functions of Gaussian random vectors. Specifically, from (Kan, 2008), for y  N ðm, SÞ and any symmetric positive definite matrix A, varðyT AyÞ ¼ 2 trðASASÞ þ 4mT ASAm.

(3.171)

Some algebraic manipulation yields

varS n ðzi jmÞ ¼ 2

D mTi S1 mi þ 4 ni n2i

l mT S1 mi → 2 lim i þ 4 lim i ¼ 0, ni →` ni ni → ` ni

(3.172)

BKac

varS n ðzij jm Þ ¼

D mT S1 mi mTj S1 mj þ i þ nj ni ni nj

mTj S1 mj l mT S1 mi → lim i þ lim i þ lim ¼ 0. nj →` nj nj →` ni →` nj ni

(3.173)

BKac

From

the

Cauchy–Schwarz BKac

inequality,

BKac

covSn ðyi , zk jmÞ → 0,

BKac

covS n ðyi , zij jmÞ → 0, and covS n ðzi , zij jmÞ → 0 for i, j, k ∈ f0, 1g with i ≠ j. Furthermore, ni  ni BKac 1  g i → , 2ðni þ ni Þ 2ð1 þ g i Þ

(3.174)

ni BKac g i → . ni þ ni 1 þ gi

(3.175)

b 1 (plug m in Eq. 3.165) yields With this and using the same approach for G 1 BKac B b varSn ðG i jm Þ → 0. Since varðX n Þ → 0 implies that X n → limn→` E½X n  in probability whenever limn→` E½X n  exists, for i, j ∈ {0, 1}, and i ≠ j, B

150

Chapter 3

b Bi jm ¼ lim ES ½G b Bi jm  plim G n BKac BKac   i1 T 1 ¼ ð1Þ mj S mj þ lj 2   gi i T 1 T 1 mi S mi  mi S mj þ ð1Þ 1 þ gi   i 1  gi T 1 þ ð1Þ mi S mi þ li 2ð1 þ g i Þ   1  gi 1 T 1 i þ m S mj  ð1Þ 2ð1 þ g i Þ 2 i

(3.176)

¼ G Bi . Now let   b i ¼ ni þ 1 ðx0  x1 ÞT S1 ðx0  x1 Þ ¼ ni þ 1 dˆ 2 , L ni ni

(3.177)

where dˆ 2 ¼ ðx0  x1 ÞT S1 ðx0  x1 Þ. Similar to deriving Eq. 3.173 via the variance of quadratic forms of Gaussian variables, we can show that varS n ðdˆ 2 jmÞ ¼ 4d2m



1 1 þ n0 n1



 þ 2D

1 1 þ n0 n1

2 .

(3.178)

Thus, b i jmÞ ¼ varS n ðL



ni þ 1 ni

2

varS n ðdˆ 2 jmÞ → 0. BKac

(3.179)

As before, from Chebyshev’s inequality it follows that b 0 jm ¼ plim L b 1 jm ¼ lim ES ½L b i jm ¼ L, plim L n BKac

BKac

BKac

(3.180)

with L being defined in Eq. 3.164. By the continuous mapping theorem (continuous functions preserve convergence in probability),

Sample-Conditioned MSE of Error Estimation

plim εˆ ni jm BKac

 b Bi þ w i G m ¼ plim F ð1Þ qffiffiffiffiffi  BKac bi L   b Bi þ w  i G ¼ F plimð1Þ qffiffiffiffiffi m BKac bi L   B i þw i Gp ffiffiffiffi ¼ F ð1Þ . L

151

(3.181)

Convergence in probability implies convergence in distribution. Thus, from Eq. 3.181,   b Bi þ w d  G i ð1Þ qffiffiffiffiffi m → Zi , (3.182)  b Li d

where → means convergence in distribution, and Zi is a random variable with pffiffiffiffi point mass at the constant ð1Þi ðG Bi þ wÞ∕ L. Boundedness and continuity of Fð⋅Þ along with Eq. 3.182 allow one to apply the Helly–Bray theorem (Sen and Singer, 1993) to write lim

BKac

ES n ½ˆεni

  b Bi þ w  i G m jm  ¼ lim ES n F ð1Þ qffiffiffiffiffi  BKac bi L ¼ EZi ½FðZi Þ   B i þw i Gp ffiffiffiffi ¼ F ð1Þ L i ¼ plim εˆ n jm ,

(3.183)

BKac

which completes the proof. ▪ 2 2 Since dm → dm , D∕n0 → l0 , and D∕n1 → l1 , Theorem 3.11 suggests the following finite-sample approximation: 1 B,f G þ w 0 ffiA, ESn ½ εˆ n0 jm  ≂ F@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 dm þ nD0 þ nD1 0

(3.184)

where G B,f 0 is obtained by using the finite-sample parameters of the problem in Eq. 3.162, namely,

152

Chapter 3

GB,f 0

  1 D D 2 ¼  hm0 ;m0 Þ þ dm þ ð1  b0 Þ þ ð1 þ b0 Þ . b ðh 2ð1 þ b0 Þ 0 m0 ;m1 n0 n1 (3.185)

To obtain the corresponding approximation for ES n ½ εˆ n1 jm, it suffices to use Eq. 3.184 by changing the sign of w, and exchanging n0 with n1 , n0 with n1 B,f ðor b0 with b1 Þ, m0 with m1 , and m0 with m1 in GB,f 0 to produce G 1 . To obtain a Raudys-type finite-sample approximation, first note that the Gaussian distribution in Eq. 2.124 can be rewritten as εˆ n0 ¼ PrðU 0 ðx0 , x1 , zÞ ≤ wjx0 , x1 , z ∈ C0 Þ,

(3.186)

where z is independent of S n , Ci is the multivariate Gaussian distribution N ðmi , ½ðni þ ni þ 1Þðni þ ni Þ∕n2i SÞ, and  U i ðx0 , x1 , zÞ ¼

ni nx x þ x1 zþ i i  0 ni þ ni ni þ ni 2

T

S1 ðx0  x1 Þ.

(3.187)

Taking the expectation of εˆ n0 relative to the sampling distribution and then applying the standard normal approximation yields a Raudys-type approximation: ES n ½ˆεn0 jm ¼ PrðU 0 ðx0 , x1 , zÞ ≤ wjz ∈ C0 , mÞ   ES n ,z ½U 0 ðx0 , x1 , zÞjz ∈ C0 , m þ w ≂ F pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi . varS n ,z ðU 0 ðx0 , x1 , zÞjz ∈ C0 , mÞ

(3.188)

It is proven in (Zollanvari and Dougherty, 2014) via an equivalent representation of the right-hand side of Eq. 3.188 that ! B,R G þ w 0 qffiffiffiffiffiffiffiffiffiffi ES n ½ εˆ n0 jm ≂ F , LB,R 0

(3.189)

¼ G B,f GB,R 0 0 ,

(3.190)

where

with GB,f 0 defined in Eq. 3.185, and

Sample-Conditioned MSE of Error Estimation

153

2d2m d2m D D þ þ þ n0 n1 n0 ð1 þ b0 Þ2 n1 ð1 þ b0 Þ   hm0 ;m1  ð1  b0 Þhm0 ;m0 ð1 þ b0 Þhm0 ;m1  hm0 ;m0 b0 þ þ n0 n1 ð1 þ b0 Þ2

LB,R ¼ d2m þ 0

þ

ð3 þ b20 ÞD ð2 þ b0 ÞD D þ þ 2. 2 2 2 2n0 ð1 þ b0 Þ n0 n1 ð1 þ b0 Þ 2n1 (3.191)

The corresponding approximation for ES n ½ εˆ n1 jm is 1 0 B,R G  w 1 ffiffiffiffiffiffiffiffiffiffi A, ES n ½ εˆ n1 jm  ≂ F@ q LB,R 1

(3.192)

where LB,R and GB,R are obtained from LB,R and GB,R 1 1 0 0 , respectively, by exchanging n0 with n1 , n0 with n1 , m0 with m1 , and m0 with m1 . It is straightforward to show that BKac

→ GB0 , G B,R 0 BKac

→ d2m þ l0 þ l1 , LB,R 0

(3.193) (3.194)

with GB0 being defined in Theorem 3.11. Therefore, the approximation obtained in Eq. 3.189 is asymptotically exact, and Eqs. 3.184 and 3.189 are asymptotically equivalent. Similar limits hold for class 1. 3.9.3 Unconditional expectation The unconditional expectation of εˆ ni under Bayesian–Kolmogorov asymptotics is provided by the next theorem; for the proof of this theorem we refer to (Zollanvari and Dougherty, 2014). Theorem 3.12 (Zollanvari and Dougherty, 2014). Consider the sequence of Gaussian discrimination problems defined by Eq. 3.156. Then for i ∈ f0, 1g,   H þ w lim Em,S n ½ εˆ ni  ¼ F ð1Þi pi ffiffiffiffi , BKac F

(3.195)

and therefore,     H 0 þ w H1  w pffiffiffiffi pffiffiffiffi þ c1 F , lim Em,S n ½ εˆ n  ¼ c0 F BKac F F

(3.196)

154

Chapter 3

where   1 l0 l1 2 Dm þ l1  l0 þ þ H0 ¼ , g0 g1 2   1 l0 l1 2 Dm þ l0  l1 þ þ H1 ¼  , g0 g1 2 F ¼ D2m þ l0 þ l1 þ

l0 l1 þ . g0 g1

Theorem 3.12 suggests the finite-sample approximation ! Rþw H 0 ffi , Em,Sn ½ εˆ n0  ≂ F qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi D 2 Dm þ n0 þ nD1 þ nD0 þ nD1 where HR 0

  1 D D D D 2 ¼ . Dm þ  þ þ 2 n1 n0 n0 n1

(3.197) (3.198) (3.199)

(3.200)

(3.201)

From Eq. 3.186 we can get a Raudys-type approximation: Em,S n ½ˆεn0  ¼ Em ½PrðU 0 ðx0 , x1 , zÞ ≤ wjz ∈ C0 , mÞ   Em,S n ;z ½U 0 ðx0 , x1 , zÞjz ∈ C0  þ w ≂ F pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi . varm,S n ;z ðU 0 ðx0 , x1 , zÞjz ∈ C0 Þ

(3.202)

As proven in (Zollanvari and Dougherty, 2014) via an equivalent representation of the right-hand side of Eq. 3.202, ! H R 0 þw 0 qffiffiffiffiffiffiffi Em,S n ½ εˆ n  ≂ F , (3.203) FR 0 where FR 0

    1 1 1 1 1 1 1 2 ¼ 1þ þ þ þ þ þ D þD n0 n1 n1 m n0 n1 n0 n1     1 1 1 1 1 1 1 þ þ þ þ þ þD þD . n1 n0 n1 n1 n0 n1 2n20 2n21 2n20 2n21

(3.204)

It is straightforward to show that BKac

HR 0 → H 0,

(3.205)

Sample-Conditioned MSE of Error Estimation

BKac

2 FR 0 → Dm þ l0 þ l1 þ

155

l0 l1 þ , g0 g1

(3.206)

with H 0 defined in Eq. 3.197. Hence, the approximation obtained in Eq. 3.203 is asymptotically exact, and Eqs. 3.200 and 3.203 are asymptotically equivalent. Similar results hold for class 1. 3.9.4 Conditional second moments This section employs Bayesian–Kolmogorov asymptotic analysis to characterize the second and cross moments with the actual error, which provides an asymptotic expression for the MSE of error estimation. As with first moments, we will follow asymptotic theorems with suggested finite-sample approximations and then state Raudys-type approximations, referring the interested reader to (Zollanvari and Dougherty, 2014) for the derivations. Defining two i.i.d. random vectors z and z0 yields the second-moment representation ESn ½ðˆεn0 Þ2 jm ¼ ES n ½Pr2 ðU 0 ðx0 , x1 , zÞ ≤ wjx0 , x1 , z ∈ C0 , mÞ ¼ ES n ½PrðU 0 ðx0 , x1 , zÞ ≤ wjx0 , x1 , z ∈ C0 , mÞ  PrðU 0 ðx0 , x1 , z0 Þ ≤ wjx0 , x1 , z0 ∈ C0 , mÞ ¼ ES n ½PrðU 0 ðx0 , x1 , zÞ ≤ w, U 0 ðx0 , x1 , z0 Þ ≤ wjx0 , x1 , z ∈ C0 , z0 ∈ C0 , mÞ ¼ PrðU 0 ðx0 , x1 , zÞ ≤ w, U 0 ðx0 , x1 , z0 Þ ≤ wjz ∈ C0 , z0 ∈ C0 , mÞ, (3.207) where z and z0 are independent of S n , Ci is N ðmi , ½ðni þ ni þ 1Þðni þ ni Þ∕n2i SÞ, and U i ðx0 , x1 , zÞ is defined in Eq. 3.187. Theorem 3.13 (Zollanvari and Dougherty, 2014). Consider the sequence of Gaussian discrimination problems in Eq. 3.151. Then for i, j ∈ f0, 1g,     GBj þ w G Bi þ w j i i j pffiffiffiffi pffiffiffiffi lim ESn ½ εˆ n εˆ n jm ¼ F ð1Þ , (3.208) F ð1Þ BKac L L and therefore, 

  B   G B0 þ w G1  w 2 pffiffiffiffi pffiffiffiffi , lim ES n ½ðˆεn Þ jm ¼ c0 F þ c1 F BKac L L 2

(3.209)

where GB0 , G B1 , and L are defined in Eqs. 3.162 through 3.164, respectively.

156

Chapter 3

Proof. From Eq. 3.207, ES n ½ðˆεn0 Þ2 jm  ¼ ESn ½PrðU 0 ðx0 , x1 , zÞ ≤ w, U 0 ðx0 , x1 , z0 Þ ≤ w jx0 , x1 , z ∈ C0 , z0 ∈ C0 , mÞ.

(3.210)

We characterize the conditional probability inside ES n ½⋅. From the independence of z, z0 , x0 , and x1 , 0" # " #1   b B0 b G U 0 ðx0 , x1 , zÞ  x , x , z ∈ C0 , z0 ∈ C0 , m  N @ B , L0 b0 A, U 0 ðx0 , x1 , z0 Þ  0 1 b 0 L0 G 0

(3.211) b B0 and L b 0 are defined in Eqs. 3.166 and 3.177, respectively. Thus, where G 2 ! 3 b B0 þ w   G qffiffiffiffiffiffi ES n ½ðˆεn0 Þ2 jm  ¼ ES n 4F2 (3.212) m5.  b L0 Similar to the proof of Theorem 3.11, we get lim ES n ½ðˆεni Þ2 jm ¼ plim ðˆεni Þ2 jm BKac

2 ¼ lim ESn ½ εˆ ni jm  BKac   Bþw i 2 i Gp ffiffiffiffi . ¼ F ð1Þ L

(3.213)

lim ES n ½ εˆ n0 εˆ n1 jm ¼ plim εˆ n0 εˆ n1 jm,

(3.214)

BKac

Similarly, BKac

BKac

and the result follows via the decomposition of the error estimate in terms of the two class estimates. ▪ Theorem 3.13 suggests the finite-sample approximation 1 0 B,f G 0 þ w A ffi , (3.215) ESn ½ðˆεn0 Þ2 jm ≂ F2 @qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d2m þ nD0 þ nD1 which is the square of the approximation in Eq. 3.184. Approximations for ES n ½ εˆ n0 εˆ n1 jm and ES n ½ðˆεn1 Þ2 jm are obtained analogously. A similar proof yields the next theorem.

Sample-Conditioned MSE of Error Estimation

157

Theorem 3.14 (Zollanvari and Dougherty, 2014). Consider the sequence of Gaussian discrimination problems in Eq. 3.151. Then for i, j ∈ f0, 1g,     Gj þ w GBi þ w j i i j pffiffiffiffi pffiffiffiffi F ð1Þ , (3.216) lim ES n ½ εˆ n εn jm ¼ F ð1Þ BKac L L and therefore, lim ES n ½ εˆ n εn jm ¼

BKac

1 X 1 X i¼0 j¼0

    B G j þ w i þw i Gp j ffiffiffiffi pffiffiffiffi F ð1Þ , ci cj F ð1Þ L L (3.217)

where 1 G0 ¼ ðd2m þ l1  l0 Þ, 2

(3.218)

1 G1 ¼  ðd2m þ l0  l1 Þ, 2

(3.219)

and G B0 , G B1 , and L are defined in Eqs. 3.162 through 3.164, respectively.

Theorem 3.14 suggests the finite-sample approximation 1 0 1 0 1 D D 2 B,f  ðd þ  Þ þ w G 0 þ w A @ 2 m n1 n0 A. qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi F ES n ½ εˆ n0 ε0n jm ≂ F@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d2m þ nD0 þ nD1 d2m þ nD0 þ nD1

(3.220)

This is a product of Eq. 3.184 and the finite-sample approximation for ES n ½ε0n jm in (Zollanvari et al., 2011). Class 1 and mixed class cases are similar. A consequence of Theorems 3.11, 3.13, and 3.14 is that the conditional variances and covariances are asymptotically zero: lim varSn ðˆεn jmÞ ¼ lim varSn ðεn jmÞ ¼ lim covSn ðεn , εˆ n jmÞ ¼ 0.

BKac

BKac

BKac

(3.221)

Hence, the deviation variance is also asymptotically zero: lim varS n ðˆεn  εn jmÞ ¼ 0.

BKac

(3.222)

Define the conditional bias as biasS n ðˆεn jm Þ ¼ ES n ½ εˆ n  εn jm . Then the asymptotic RMS reduces to

(3.223)

158

Chapter 3

lim RMSS n ðˆεn jm Þ ¼ lim jbiasS n ðˆεn jmÞj.

BKac

BKac

(3.224)

The asymptotic RMS is thus the asymptotic absolute bias. As proven in (Zollanvari et al., 2011),     G 0 þ w G1  w pffiffiffiffi pffiffiffiffi lim ES n ½εn jm ¼ c0 F þ c1 F . (3.225) BKac L L It follows from Theorem 3.11 and Eq. 3.225 that      GB0 þ w G 0 þ w p ffiffiffi ffi p ffiffiffi ffi F lim biasS n ðˆεn jmÞ ¼ c0 F BKac L L      B G1  w G1  w pffiffiffiffi pffiffiffiffi F . þ c1 F L L

(3.226)

Recall that the Bayesian MMSE error estimator is unconditionally unbiased: biasm,Sn ðˆεn Þ ¼ Em,Sn ½ εˆ n  εn  ¼ 0. To obtain Raudys-type approximations corresponding to Theorems 3.13 and 3.14, we utilize the joint distribution of U i ðx0 , x1 , zÞ and U j ðx0 , x1 , z0 Þ, defined in Eq. 3.187, with z and z0 being independently selected from populations C0 or C1 . We utilize the function   Z aZ b 1 ðx2 þ y2  2rxyÞ pffiffiffiffiffiffiffiffiffiffiffiffiffi exp Fða, b; rÞ ¼ dxdy, (3.227) 2ð1  r2 Þ ` ` 2p 1  r2 which is the bivariate CDF of standard normal random variables with correlation coefficient r. Note that Fða, `; rÞ ¼ FðaÞ, and Fða, b; 0Þ ¼ FðaÞFðbÞ. For notational simplicity, we write Fða, a; rÞ as Fða; rÞ. For any Gaussian vector ½X , Y T ,   x  mX y  mY PrðX ≤ x, Y ≤ yÞ ¼ F , ; rX Y , (3.228) sX sY pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi with mX ¼ E½X , mY ¼ E½Y , sX ¼ varðX Þ, s Y ¼ varðY Þ, and correlation coefficient rX Y . A Raudys-type approximation for the second conditional moment is given by 1 0 B,R B,R G þ w C 0 A ES n ½ðˆεn0 Þ2 jm ≂ F@ q0ffiffiffiffiffiffiffiffiffiffi ; B,R , (3.229) B,R L 0 L0 and LB,R given in Eqs. 3.190 and 3.191, respectively, and with GB,R 0 0

Sample-Conditioned MSE of Error Estimation

C B,R ¼ 0

159

d2m ð1  b0 Þd2m þ n1 ð1 þ b0 Þ n0 ð1 þ b0 Þ2   hm0 ,m1  ð1  b0 Þhm0 ,m0 ð1 þ b0 Þhm0 ,m1  hm0 ,m0 b0 þ þ n0 n1 ð1 þ b0 Þ2 ð1  b0 Þ2 D D D þ þ 2. 2 2 2 2n0 ð1 þ b0 Þ n0 n1 ð1 þ b0 Þ 2n1

þ

(3.230) Similarly,

0 ESn ½ðˆεn1 Þ2 jm ≂ F

B,R @G 1

w qffiffiffiffiffiffiffiffiffiffi ; LB,R 1

1

C B,R 1 A , LB,R 1

(3.231)

where LB,R and GB,R are defined immediately following Eq. 3.192, and C B,R is 1 1 1 B,R obtained from C 0 using Eq. 3.230 by exchanging n0 with n1 , n0 with n1 , m0 BKac

with m1 , and m0 with m1 . Having C B,R → 0 together with Eqs. 3.193 and 0 3.194 shows that Eq. 3.229 is asymptotically exact, that is, asymptotically equivalent to ES n ½ðˆεn0 Þ2 jm obtained in Theorem 3.13. Class 1 is similar. For the conditional mixed moments, first, 1 0 B,R B,R B,R G þ w G  w C 1 01 ffiffiffiffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi A, (3.232) ES n ½ εˆ n0 εˆ n1 jm ≂ F@ q0ffiffiffiffiffiffiffiffiffiffi , q B,R B,R B,R B,R L0 L0 L1 L1 where C B,R 01 ¼

b0 hm0 ,m0 ,m0 ,m1  b0 b1 hm0 ,m0 ,m1 ,m0 þ b1 hm1 ,m1 ,m1 ,m0 þ b1 d2m þ d2m n0 ð1 þ b0 Þð1 þ b1 Þ b1 hm1 ,m1 ,m1 ,m0  b0 b1 hm1 ,m1 ,m0 ,m1 þ b0 hm0 ,m0 ,m0 ,m1 þ b0 d2m þ d2m n1 ð1 þ b0 Þð1 þ b1 Þ D ð1  b0 ÞD ð1  b1 ÞD þ þ . þ n0 n1 ð1 þ b0 Þð1 þ b1 Þ 2n20 ð1 þ b0 Þ 2n21 ð1 þ b1 Þ þ

(3.233) BKac

Since C B,R → 0, Eq. 3.232 is asymptotically exact; i.e., Eq. 3.232 becomes 01 equivalent to the result of Theorem 3.13. Next, 1 0 B,R BT,R R G þ w G þ w C0 ffi A, (3.234) ES n ½ εˆ n0 ε0n jm ≂ F@ q0ffiffiffiffiffiffiffiffiffiffi , q0ffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi B,R B,R R R L0 L0 L0 L0

160

Chapter 3

where ¼ C BT,R 0



1 ð1  b0 ÞD D þ 2, d2m þ b0 d2m þ b0 hm0 ,m0 ,m0 ,m1  2 n1 ð1 þ b0 Þ 2n0 ð1 þ b0 Þ 2n1 (3.235)

and GR 0

LR 0

¼

d2m

  1 2 D D ¼ , d þ  2 m n1 n0

(3.236)

  d2m 1 1 1 1 þ þD þ þ þ . n1 n0 n1 2n20 2n21

(3.237)

Next,  ESn ½ εˆ n1 ε1n jm  ≂ F

 G B,R  w GR C BT,R 1 w 1 1 ffiffiffiffiffiffiffiffiffiffi , p ffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q ffi , R LR 1 LB,R LB,R 1 1 L1

(3.238)

BT,R R and GB,R are obtained as in Eq. 3.231, and LR are where LB,R 1 , G 1 , and C 1 1 1 obtained by exchanging n0 with n1 , n0 with n1 , m0 with m1 , and m0 with m1 in BT,R R , respectively. Finally, LR 0 , G 0 , and C 0 1 0 B,R BT,R R G þ w G  w C 1 01 ffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiA, ES n ½ εˆ n0 ε1n jm ≂ F@ q0ffiffiffiffiffiffiffiffiffiffi , p (3.239) R B,R B,R R L 1 L0 L0 L1

where ¼ C BT,R 01



1 ð1  b0 ÞD D  2, d2m þ b0 hm0 ,m0 ,m0 ,m1 þ 2 n0 ð1 þ b0 Þ 2n0 ð1 þ b0 Þ 2n1

(3.240)

and 0

B,R @G 1

ES n ½ εˆ n1 ε0n jm ≂ F

w qffiffiffiffiffiffiffiffiffiffi , LB,R 1

GR 0

C BT,R 10

1

þw qffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA, R LR LB,R 0 1 L0

(3.241)

is obtained by exchanging n0 with n1 , n0 with n1 , m0 with m1 , and where C BT,R 10 m0 with m1 in C BT,R 01 .

Sample-Conditioned MSE of Error Estimation BKac

BKac

161 BKac

BKac

C BT,R → 0, C BT,R → 0, C BT,R → 0, and C BT,R → 0. Therefore, 0 1 01 10 from

Eq.

3.194

and

the

facts

that

BKac

2 GR 0 → dm þ l1  l0

and

BKac LR 0 →

d2m þ l0 þ l1 , we conclude that Eqs. 3.234, 3.238, 3.239, and 3.241 are all asymptotically exact (compare to Theorem 3.14). 3.9.5 Unconditional second moments We now consider the unconditional second and cross moments. The following theorems are proven similarly to Theorems 3.13 and 3.14. Theorem 3.15 (Zollanvari and Dougherty, 2014). Consider the sequence of Gaussian discrimination problems in Eq. 3.156. For i, j ∈ {0, 1},     H j þ w H i þ w j i i j pffiffiffiffi pffiffiffiffi lim Em,S n ½ εˆ n εˆ n  ¼ F ð1Þ F ð1Þ , (3.242) BKac F F and therefore,

   H 0 þ w H1  w 2 pffiffiffiffi pffiffiffiffi lim Em,Sn ½ðˆεn Þ  ¼ c0 F þ c1 F , BKac F F 



2

(3.243)

where H0, H1, and F are defined in Eqs. 3.197 through 3.199, respectively.

Theorem 3.16 (Zollanvari and Dougherty, 2014). Consider the sequence of Gaussian discrimination problems in Eq. 3.156. For i, j ∈ {0, 1}, lim Em,S n ½ εˆ ni εjn  ¼ lim Em,S n ½ εˆ ni εˆ nj  ¼ lim Em,S n ½εin εjn ,

BKac

BKac

and therefore, lim Em,Sn ½ εˆ n εn  ¼

BKac

XX i¼0 j¼0

BKac

(3.244)

    H j þ w i þw i H j pffiffiffiffi pffiffiffiffi F ð1Þ ci cj F ð1Þ , F F (3.245)

where H 0 , H 1 , and F are defined in Eqs. 3.197 through 3.199, respectively. Theorems 3.15 and 3.16 suggest the finite-sample approximations: Em,S n ½ðˆεn0 Þ2  ≂ Em,S n ½ˆεn0 ε0n  ≂ Em,S n ½ðε0n Þ2  1 0 D D D D 2 D þ  þ þ  2w 1 m n1 n0 n0 n1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A. ≂ F2 @ 2 D2 þ D þ D þ D þ D m

n0

n1

n0

n1

(3.246)

162

Chapter 3

A consequence of Theorems 3.12, 3.15, and 3.16 is that lim varm,S n ðˆεn  εn Þ ¼ lim jbiasm,S n ðˆεn Þj

BKac

BKac

¼ lim varm,Sn ðˆεn Þ BKac

¼ lim varm,Sn ðεn Þ

(3.247)

BKac

¼ lim covm,S n ðεn , εˆ n Þ BKac

¼ lim RMSm,Sn ðˆεn Þ ¼ 0. BKac

Previously, we have shown that εˆ n is strongly consistent under rather general conditions and that MSEm ðˆεn jS n Þ → 0 almost surely as n → ` under BKac

similar conditions. Here, we have shown that MSEm,Sn ðˆεn Þ → 0 under conditions stated in Eq. 3.157. Intuitively, this means that MSEm,Sn ðˆεn Þ  0 for asymptotic and comparable dimensionality, sample size, and uncertainty parameter. A Raudys-type approximation for the unconditional second moment is given by 0 1 B,R R H þ w K (3.248) Em,Sn ½ðˆεn0 Þ2  ≂ F@ q0ffiffiffiffiffiffiffi ; 0R A, F0 FR 0

R with H R 0 and F 0 given in Eqs. 3.201 and 3.204, respectively, and   1 1 1 1 2 B,R K0 ¼ þ þ þ Dm n0 ð1 þ b0 Þ2 n1 n0 ð1 þ b0 Þ2 n1 D D D D D D þ 2 þ 2þ 2þ þ 2 2n0 n0 n0 2n0 2n1 n1 n1 2n1 D D D þ þ þ . 2 2 n0 n1 ð1 þ b0 Þ n0 n1 ð1 þ b0 Þ n1 n0 ð1 þ b0 Þ2

(3.249)

Analogously, Em,Sn ½ðˆεn1 Þ2 

 R  H 1  w K B,R 1 ffi ; ≂ F pffiffiffiffiffiffi , FR FR 1 1

(3.250)

B,R R R are obtained from F R where F R 1 , H 1 , and K 1 0 using Eq. 3.204, H 0 using B,R Eq. 3.201, and K 0 using Eq. 3.249, respectively, by exchanging n0 with n1 , n0 BKac

with n1 , m0 with m1 , and m0 with m1 . Having K B,R → 0 together with 0 Eq. 3.206 makes Eq. 3.248 asymptotically exact. Class 1 is similar.

Sample-Conditioned MSE of Error Estimation

163

Turning to unconditional mixed moments,   H R þ w HR K B,R 1 w 01 q0ffiffiffiffiffiffiffi , p ffi , ffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffi Em,S n ½ εˆ n0 εˆ n1  ≂ F R FR 1 FR FR 0 0 F1

(3.251)

where K B,R 01 ¼

D ðn  n0 ÞD ðn  n1 ÞD þ 20 þ 21 ðn0 þ n0 Þðn1 þ n1 Þ 2n0 ðn0 þ n0 Þ 2n1 ðn1 þ n1 Þ n0 n1 D ðn  n0 ÞD ðn  n1 ÞD þ 20 þ 21 n0 n1 ðn0 þ n0 Þðn1 þ n1 Þ 2n0 ðn0 þ n0 Þ 2n1 ðn1 þ n1 Þ   1 n0 n0 D 1þ  þ n1 þ n1 n0 n0 n0 þ n0     1 n1 n1 D 1 1 1þ D2 . þ  þ þ n0 þ n0 n1 n1 n1 þ n1 n0 n1 m þ

(3.252)

BKac

→ 0, Eq. 3.251 is asymptotically exact (compare to Since K B,R 01 Theorem 3.15). Next, 1 0 BT,R R H þ w K Em,Sn ½ εˆ n0 ε0n  ≂ F@ q0ffiffiffiffiffiffiffi ; 0 R A, (3.253) F0 FR 0

where 

K BT,R 0

 n0 1 1 2 D D D þ þ ¼ þ 2 Dm þ 2 þ n0 ðn0 þ n0 Þ n1 n1 2n1 n1 n1 2n1 þ

n0 D ðn  n0 ÞD ðn  n0 ÞD n0 D  20 . þ 20 þ n1 n0 ðn0 þ n0 Þ 2n0 ðn0 þ n0 Þ 2n0 ðn0 þ n0 Þ n0 n1 ðn0 þ n0 Þ (3.254)

Finally, 1 BT,R R R H þ w H  w K 1 01 ffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffi ffiA, Em,S n ½ εˆ n0 ε1n  ≂ F@ q0ffiffiffiffiffiffiffi , p R R R F1 F0 FR F 0 1 0

(3.255)

where  K BT,R 01

¼

 1 1 D D D D D þ  2 2. D2m þ 2 þ 2 þ n0 n1 2n0 2n1 n0 n1 2n0 2n1

(3.256)

164

Chapter 3 BKac

BKac

Having K BT,R → 0 and K BT,R → 0 along with Eq. 3.206 makes Eqs. 3.253 0 01 and 3.255 asymptotically exact (compare to Theorem 3.16). Approximations for Em,S n ½ εˆ n1 ε0n  and Em,S n ½ εˆ n1 ε1n  are obtained analogously. 3.9.6 Unconditional MSE Computation of the unconditional MSE for the Bayesian MMSE error estimator requires the second moments of the true error. The conditional second moments of the true error are (Zollanvari et al., 2011) 1 0 T,R R G þ w C (3.257) ES n ½ðε0n Þ2 jm ≂ F@ q0ffiffiffiffiffiffiffi ; 0R A, L0 LR 0

1 T,R R G  w C 1 ffiffiffiffiffiffiffi ; 1R A, ESn ½ðε1n Þ2 jm ≂ F@ p L1 LR 1 0

1 T,R R R G þ w G  w C 1 01 ffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffi ffiA, ESn ½ε0n ε1n jm ≂ F@ q0ffiffiffiffiffiffiffi , p R R R R L 1 L0 L0 L1

(3.258)

0

(3.259)

R R R where GR 0 and L0 are defined in Eqs. 3.236 and 3.237, respectively, G 1 and L1 are defined immediately after Eq. 3.238, and

C T,R ¼ 0

d2m D D þ þ , n1 2n20 2n21

(3.260)

¼ C T,R 1

d2m D D þ 2þ 2, n0 2n0 2n1

(3.261)

C T,R 01 ¼ 

D D  2. 2 2n0 2n1

(3.262)

We have already seen Eq. 3.257 in Eq. 1.66. The unconditional second moments of the true error are obtained in (Zollanvari and Dougherty, 2014) for the purpose of obtaining the MSE. First, 1 0 T,R R H þ w K Em,S n ½ðε0n Þ2  ≂ F@ q0ffiffiffiffiffiffiffi ; 0R A, (3.263) F0 FR 0

Sample-Conditioned MSE of Error Estimation

165

R with H R 0 and F 0 given in Eqs. 3.201 and 3.204, respectively, and   1 1 1 D D D D D D D T,R þ þ þ 2þ 2þ þ . K0 ¼ D2m þ 2 þ 2 þ n0 n1 n1 2n0 2n1 n0 n1 2n0 2n1 n1 n0 n1 n1

(3.264) Second, 0 Em,Sn ½ðε1n Þ2  ≂ F

w pffiffiffiffiffiffiffi ; FR 1

R @H 1

1

K T,R 1 A , FR 1

(3.265)

with K T,R obtained from K T,R by exchanging n0 with n1 and n0 with n1 . 1 0 Finally, 0 1 T,R R R H þ w H 1  w K 01 A ffiffiffiffiffiffiffi ; qffiffiffiffiffiffiffiffiffiffiffiffi ffi , (3.266) Em,Sn ½ε0n ε1n  ≂ F@ q0ffiffiffiffiffiffiffi , p R R R F1 FR F F 0 0 1 R with H R 0 and F 0 given in Eqs. 3.201 and 3.204, respectively, and   1 1 D D D D D T,R K 01 ¼ þ  2 2. D2m þ 2 þ 2 þ n0 n1 2n0 2n1 n0 n1 2n0 2n1

(3.267)

All of the moments necessary to compute MSES n ðˆεn jmÞ and MSEm,S n ðˆεn Þ have now been expressed asymptotically. Example 3.3. Following (Zollanvari and Dougherty, 2014), we compare the asymptotically exact finite-sample approximations to Monte Carlo estimations in conditional and unconditional scenarios for 100 features. Throughout, let S have diagonal elements 1 and off-diagonal elements 0.1, and let v0 and v1 be defined such that vi has equal elements, v0 ¼ v1 , and hv0 ;v1 ¼ 4 (this squared Mahalanobis distance corresponds to a Bayes error of 0.1586). We also define the hyperparameters n0 ¼ n1 ¼ 50 and mi ¼ 1.01vi for the Bayesian model. In the conditional scenario, we set the class means to mi ¼ vi for i ∈ f0, 1g. Given the sample sizes, hyperparameters, ci , and mi , all conditional asymptotic and Raudys-type approximations may be computed. We also approximate conditional error moments and the conditional RMS using Monte Carlo simulations. In each iteration, we generate training data of size ni for each class i ∈ f0, 1g using m0 , m1 , and S, and use these to train an LDA classifier cn . We then find the true error of cn using the true means

166

Chapter 3

0.12

RMS

0.04

MC unconditional FSA unconditional asym unconditional MC conditional FSA conditional asym conditional

40

60

80

100

120

140

160

180

200

0.00

0.04

0.03 0.01

0.02

moments

0.05

MC unconditional second FSA unconditional second asym unconditional second MC conditional mixed FSA conditional mixed asym conditional mixed

0.08

0.06

and covariance, and the Bayesian MMSE error estimator using the sample and above hyperparameters. This process is repeated 10,000 times, and ðˆεn Þ2 , εˆ n εn , and ðˆεn  εn Þ2 are averaged over all iterations to obtain Monte Carlo approximations of ES n ½ðˆεn Þ2 jm, ES n ½ εˆ n εn jm, and RMS2S n ðˆεn jmÞ, respectively. In the unconditional case, all unconditional asymptotic and Raudys-type approximations may be found given only the sample size, hyperparameters, and ci . To produce Monte Carlo approximations, we generate random realizations of the means m0 and m1 from the Bayesian model. Given m0 , m1 , and S, as in the conditional scenario, we generate training data, train a classifier, compute the true error, and compute the Bayesian MMSE error estimate. We generate 300 pairs of means, and for each pair of means we generate 300 training sets. The appropriate quantities are averaged over all 90,000 samples to obtain Monte Carlo approximations of Em,Sn ½ðˆεn Þ2 , Em,Sn ½ εˆ n εn , and RMS2m,S n ðˆεn Þ. Figure 3.6(a) shows finite-sample second-order and mixed-moment approximations as a function of sample size. Each curve label has three fields: (1) MC (Monte Carlo), FSA (Raudys-type finite-sample approximation), or asym (Bayesian–Kolmorgorov-asymptotic finite-sample approximation); (2) unconditional or conditional; and (3) second moment ½of ðˆεn Þ2  or mixed moment ðof εˆ n εn Þ. For instance, “FSA unconditional second” corresponds to Em,Sn ½ðˆεn Þ2 , and “FSA conditional mixed” corresponds to ES n ½ εˆ n εn jm. Note the closeness of the curves for both the second and mixed moments. Also note that Raudys-type approximations have slightly better performance. Figure 3.6(b) shows RMS curves. There is extremely close agreement between the Raudys-type approximation and the MC curves for the conditional RMS, so much so that they appear as one. Moreover, the asymptotic finite-sample approximation curve is also very close to the MC curve. Regarding unconditional RMS, the Raudys-type approximation and the MC curves are practically identical. There is, however, a large difference

40

60

80

100

120

140

sample size

sample size

(a)

(b)

160

180

200

Figure 3.6 Comparison of conditional and unconditional finite-sample moment approximations versus sample size (D ¼ 100): (a) second moments with ðˆεn Þ2 ¼ εˆ n εˆ n and mixed moments with εˆ n εn ; (b) RMS.

Sample-Conditioned MSE of Error Estimation

167

between them and the finite-sample approximation curve; indeed the FSA curves are identically 0. In fact, this is an immediate consequence of Eq. 3.247. So we see that, while the Bayesian–Kolmogorov finite-sample approximations work reasonably well for the individual moments, when the latter approximations are combined to form the RMS approximation, the result is RMS ¼ 0.

Chapter 4

Optimal Bayesian Classification Since prior knowledge is required to obtain a good error estimate in smallsample settings, it is prudent to utilize that knowledge when designing a classifier. When large amounts of data are available, one can appeal to the Vapnik–Chervonenkis (VC) theory (and extensions), which bounds the difference between the error of the best classifier in a family of classifiers and the error of the designed classifier as a function of sample size, absent prior distributional knowledge, but this VC bound is generally useless for small samples. Rather than take a distribution-free approach, we can find a classifier with minimum expected error, given the data and an uncertainty class of feature-label distributions. This is accomplished in the framework of optimal operator design under uncertainty.

4.1 Optimal Operator Design Under Uncertainty Finding a Bayes classifier is an instance of the most basic problem of engineering, the design of optimal operators. Design takes different forms depending on the random process constituting the scientific model and the operator class of interest. The operators might be filters, controllers, classifiers, or clusterers. The underlying random process might be a random signal/image for filtering, a Markov process for control, a feature-label distribution for classification, or a random point set for clustering. Our interest concerns optimal classification when the underlying process model, the feature-label distribution, is not known with certainty, in the sense that it belongs to an uncertainty class of possible feature-label distributions; that is, rather than use a classification rule based solely on data, we want a rule based on the data and prior knowledge concerning the feature-label distribution— and we want the design to be optimal, not based merely on some heuristic. Optimal operator design (synthesis) begins with the relevant scientific knowledge constituted in a mathematical theory that is used to construct an operator for optimally accomplishing a desired transformation under the constraints imposed by the circumstances. A criterion called a cost function

169

170

Chapter 4

(objective function) is defined to judge the goodness of the response—the lower the cost, the better the operator. The objective is to find an optimal way of manipulating the system, which means minimizing the cost function. This kind of engineering synthesis originated with optimal time series filtering in the classic work of Andrey Kolmogorov (Kolmogorov, 1941) and Norbert Wiener (Wiener, 1949), an unpublished version having appeared in 1942. In the Wiener theory, the scientific model consists of two random signals, one being the true signal and the other being an observed “noisy” variant of the true signal. The engineering aim is to linearly operate on the observed signal so as to transform it to be more like the true signal. Being that a linear operator is formed by a weighted average, the synthesis problem is to find an optimal weighting function for the linear operator, and the goodness criterion is the MSE between the true and filtered signals. An optimal linear filter is found in terms of the cross-correlation function between the true and observed random signals. It is the solution of an integral equation known as the Wiener–Hopf equation. In the case of wide-sense stationary signals, an optimal filter is found in terms of the signal and noise power spectra and is known as the Wiener filter. The cross-correlation function and power spectra are called characteristics of the random processes, the term referring to any deterministic function of the processes. The synthesis scheme for optimal linear filtering has four components: (1) the mathematical model consists of two jointly distributed random signals; (2) the operator family consists of integral linear filters over an observation window; (3) optimization involves minimizing the MSE; and (4) the optimization is performed by solving the Wiener–Hopf equation in terms of certain process characteristics. In general, an optimal operator is given by copt ¼ arg min CðcÞ, c∈F

(4.1)

where F is the operator family, and CðcÞ is the cost of applying operator c on the model. The general synthesis scheme has four steps: (1) determine the mathematical model; (2) define a family of operators; (3) define the optimization problem via a cost function; and (4) solve the optimization problem. For binary classification, the four steps become: (1) determine the featurelabel distribution; (2) define a family of classifiers; (3) define the optimization problem as minimizing classification error; and (4) solve the optimization problem in terms of the feature-label distribution. For instance, for any feature-label distribution, if the classifier family consists of all classifiers, then an optimal classifier is given by Eq. 1.5. If the feature-label distribution is Gaussian and the classifier family consists of all classifiers, then an optimal classifier is given by the quadratic discriminant in

Optimal Bayesian Classification

171

Eq. 1.19. In the former case, the optimization is solved in terms of the classconditional distributions; in the latter, it is solved in terms of moments of the feature-label distribution, which serve as characteristics. The situation changes when the underlying model is not known with certainty but belongs to an uncertainty class of models parameterized by a vector u belonging to a parameter set U. In the case of linear filtering, the cross-covariance function may contain unknown parameters, or with classification, it may be known that the feature-label distribution is Gaussian, but the covariance matrices not fully known. Under such circumstances, we define an intrinsically Bayesian robust (IBR) operator by cU IBR ¼ arg min Eu ½C u ðcÞ, c∈F

(4.2)

where C u is the cost function for model u (Dalton and Dougherty, 2014; Yoon et al., 2013; Dougherty, 2018). An IBR operator is robust in the sense that on average it performs well over the uncertainty class. Since each u ∈ U corresponds to a model, pðuÞ quantifies our prior knowledge that some models are more likely to be the actual model than are others. If there is no prior knowledge beyond the uncertainty class itself, then the prior can sometimes be taken to be uniform. With linear filtering, the IBR filter is found in terms of the effective autoand cross-covariance functions or effective power spectra (Dalton and Dougherty, 2014). These are examples of effective characteristics, which replace the original characteristics for a known model with characteristics representing the entire uncertainty class. IBR design has been applied in various engineering areas, including Kalman filtering, where the Kalman gain matrix is replaced by the effective Kalman gain matrix (Dehghannasiri et al., 2017a), and Karhunen–Loève compression, where the covariance matrix is replaced by the effective covariance matrix (Dehghannasiri et al., 2018a). When a set S of sample data is available, the prior distribution can be updated to a posterior distribution p ðuÞ ¼ pðujSÞ. The IBR operator relative to the posterior distribution is called an optimal Bayesian operator (OBO) and is denoted by cOBO . An IBR operator is an OBO with S ¼ ∅. Besides classification, which is our topic here, OBO design has been developed for Kalman filtering (Dehghannasiri et al., 2018b) and regression (Qian and Dougherty, 2016). To reduce computation when an IBR operator is found via a search, for instance, when effective characteristics are not known, one can constrain the minimization to operators that are optimal for some state u ∈ U. A model-constrained (state-constrained) Bayesian robust (MCBR) operator is defined by cU MCBR ¼ arg min Eu ½C u ðcÞ, c∈F U

(4.3)

172

Chapter 4

where F U is the set of all operators in F that are optimal for some u ∈ U. An MCBR operator is suboptimal relative to an IBR operator. This chapter develops the optimal Bayesian classifier (and therefore the IBR classifier) for binary classification, and the following chapter extends this to the optimal Bayesian classifier for multi-class classification. Whereas in the case of Wiener and Kalman filtering, an IBR filter is determined in the same form as the standard filter with the characteristics replaced by effective characteristics, for classification we will see that an optimal IBR classifier is obtained similarly to the usual Bayes classifier with the class-conditional distributions being replaced by effective class-conditional distributions. This is an example of an IBR operator being found in the same manner as the standard optimal operator, but with the original random process replaced by effective processes, another example being IBR clustering, where the original underlying random labeled point sets are replaced by effective random labeled point sets (Dalton et al., 2018). In the 1960s, control theorists formulated the problem of model uncertainty in a Bayesian manner by assuming an uncertainty class of models, positing a prior distribution on the uncertainty class, and then selecting the control policy that minimizes a cost function averaged over the uncertainty class (Silver, 1963; Gozzolino et al., 1965; Martin, 1967). Application was impeded by the extreme computational burden, the result being that adaptive control became prevalent. The computational cost remains a problem (Yousefi and Dougherty, 2014). In the 1970s, signalprocessing theorists took a minimax approach by assuming an uncertainty class of covariance matrices (or power spectra) and choosing the linear filter that gave the best worst-case performance over the uncertainty class (Kuznetsov, 1976; Kassam and Lim, 1977; Poor, 1980). Later an MCBR approach to filtering was taken (Grigoryan and Dougherty, 2001), followed by use of the IBR methodology. A critical point regarding IBR design is that the prior distribution is not on the parameters of the operator (controller, filter, classifier, clusterer), but on the unknown parameters of the scientific model. If the model were known with certainty, then one would optimize with respect to the known model; if the model is uncertain, then the optimization is naturally extended to include model uncertainty and the prior distribution on that uncertainty. Model uncertainty induces uncertainty on the model-specific optimal operator parameters. If one places the prior directly on the operator parameters while ignoring model uncertainty, then there is a scientific gap, meaning that the relation between scientific knowledge and operator design is broken. For instance, compare optimal Bayesian regression (Qian and Dougherty, 2016), which is a form of optimal Bayesian filtering, to standard Bayesian linear regression (Bernardo and Smith, 2000; Bishop, 2006; Hastie et al., 2009; Murphy, 2012). In the latter, the connection between the regression functions

Optimal Bayesian Classification

173

and prior assumptions on the underlying physical system is unclear. Prior knowledge should be dictated by scientific knowledge.

4.2 Optimal Bayesian Classifier Ordinary Bayes classifiers minimize the misclassification probability when the underlying distributions are known and utilize Bayes’ theorem according to Eq. 1.2. On the other hand, optimal Bayesian classification trains a classifier from data while assuming that the feature-label distribution is contained in an uncertainty class parameterized by u ∈ U that is governed by a prior distribution pðuÞ. Via Eq. 4.2, we define an optimal Bayesian classifier (OBC) by cOBC ¼ arg min Ep ½εn ðu, cÞ, c∈C

(4.4)

where εn ðu, cÞ is the cost function, and C is an arbitrary family of classifiers. The family C in the definition allows one to control the classifiers considered, for instance, insisting on a linear form, restricting complexity, or constraining the search to cases with known closed-form error estimates or other desirable properties. Although Bayes classifiers and OBCs both employ Bayes’ theorem to facilitate derivations, they are fundamentally different and apply Bayes’ theorem differently. Bayes classifiers place priors PrðY ¼ yÞ on the class probabilities and evaluate posteriors PrðY ¼ yjX ¼ xÞ conditioned on the test point via Eq. 1.2. For a sample S n consisting of ny points xy1 , xy2 , : : : , xyny from class y ∈ f0, 1g, we see from Eqs. 2.17 and 2.18 that OBCs place priors and posteriors on the feature-label distribution parameters ðincluding cÞ, where posteriors are conditioned on the training data. Under the Bayesian framework, PrðcðXÞ ≠ Y jS n Þ ¼ Ep ½PrðcðXÞ ≠ Y ju, S n Þ ¼ Ep ½εn ðu, cÞ

(4.5)

¼ bεn ðS n , cÞ: Thus, OBCs minimize the expected misclassification probability relative to the assumed model or, equivalently, minimize the Bayesian MMSE error estimate. An OBC can be found by brute force using closed-form solutions for the expected true error (the Bayesian MMSE error estimator), when available; however, if C is the set of all classifiers (with measurable decision regions), then the next theorem shows that an OBC can be found analogously to a Bayes classifier for a fixed distribution. In particular, we can realize an optimal solution without explicitly finding the error for every classifier because the solution can be found pointwise: Choose the class with the highest density at this point, where instead of using the class-conditional density, as in the classical setting, we use the effective class-conditional density f U ðxjyÞ. The density at other points in the space does not affect the classifier decision.

174

Chapter 4

Theorem 4.1 (Dalton and Dougherty, 2013a). An OBC cOBC satisfying Eq. 4.4 over all c ∈ C, the set of all classifiers with measurable decision regions, exists and is given pointwise by  0 if Ep ½c f U ðxj0Þ ≥ ð1  Ep ½cÞ f U ðxj1Þ, (4.6) cOBC ðxÞ ¼ 1 otherwise:

Proof. For any classifier c with measurable decision regions R0 and R1 , Ep ½εn ðu, cÞ is given by Eq. 2.27. To minimize the integral, we minimize the integrand pointwise. This is achieved by the classifier in the theorem, which indeed has measurable decision regions. ▪  If Ep ½c ¼ 0, then this OBC is a constant and always assigns class 1, and if Ep ½c ¼ 1, then it always assigns class 0. Hence, we will typically assume that 0 , Ep ½c , 1. According to the theorem, to find an OBC we can average the classconditional densities f uy ðxjyÞ relative to the posterior distribution to obtain the effective class-conditional density f U ðxjyÞ, whereby an OBC is found by Eq. 4.6. Essentially, the optimal thing to do is to find the Bayes classifier using f U ðxj0Þ and f U ðxj1Þ as the true class-conditional distributions. This is like a plug-in rule using f U ðxjyÞ. This methodology finds an OBC over all possible classifiers, which is guaranteed to have minimum expected error in the given model. Henceforth, we will only consider OBCs over the space of all classifiers given in Theorem 4.1. The OBC is defined by the optimization of Eq. 4.4, and Theorem 4.1 provides a general solution regardless of the feature-label distributions in the uncertainty class as long as the classifier family consists of all classifiers with measurable decision regions. The optimization fits into the general paradigm of optimization under model uncertainty, and therefore the classifier of Eq. 4.6 provides a solution to Eq. 4.2. In the Gaussian case, this classifier has a long history in Bayesian statistics (Jeffreys, 1961; Geisser, 1964; Guttman and Tiao, 1964; Dunsmore, 1966), with the effective density being called the predictive density.

4.3 Discrete Model For the discrete model, the OBC is found by applying Theorem 4.1 to the effective class-conditional densities in Eq. 2.53: 8 U 0j þ a0j U 1j þ a1j > < P P 0 if Ep ½c ≥ ð1  Ep ½cÞ , cOBC ð jÞ ¼ (4.7) n0 þ bi¼1 a0i n1 þ bi¼1 a1i > : 1 otherwise:

Optimal Bayesian Classification

175

The expected error of the optimal classifier is found via Theorem 2.1:   b X U 0j þ a0j U 1j þ a1j P P bεn ðS n , cOBC Þ ¼ min Ep ½c , ð1  Ep ½cÞ : n0 þ bi¼1 a0i n1 þ bi¼1 a1i j¼1 (4.8) In this case it is easy to verify that the OBC minimizes the Bayesian MMSE error estimator by minimizing each term in the sum of Eq. 2.55. This is achieved by assigning cð jÞ the class with the smaller constant scaling the indicator function. In the special case where we have uniform priors for c ðay ¼ 1 for y ∈ f0, 1gÞ, random sampling, and uniform priors for the bin probabilities ðayi ¼ 1 for all i and yÞ, the OBC is 8 n þ1 0 n þ1 1 < ðU j þ 1Þ ≥ 1 ðU þ 1Þ, 0 if 0 cOBC ð jÞ ¼ (4.9) n0 þ b n1 þ b j : 1 otherwise, and the expected error of the optimal classifier is   b X n0 þ 1 U 0j þ 1 n1 þ 1 U 1j þ 1 bεn ðS n , cOBC Þ ¼ , ⋅ ⋅ min : n þ 2 n0 þ b n þ 2 n1 þ b j¼1

(4.10)

Hence, when n0 ¼ n1 , the discrete histogram rule is equivalent to an OBC that assumes uniform priors and random sampling. Under arbitrary ny , it is also equivalent to an OBC that assumes (improper priors) ay ¼ ayi ¼ 0 for all i and y and random sampling. Otherwise, the discrete histogram rule is not necessarily optimal within an arbitrary Bayesian framework. Example 4.1. To demonstrate the advantage of OBCs, consider a synthetic simulation where c and the bin probabilities are generated randomly according to uniform prior distributions. For each fixed feature-label distribution, a binomialðn, cÞ experiment determines the number of sample points in class 0, and the bin for each point is drawn from bin probabilities corresponding to its class, thus generating a random sample of size n. Both the histogram rule and the OBC from Eq. 4.9 are trained from the sample. The true error for each classifier is calculated via Eqs. 2.41 and 2.42. This is repeated 100,000 times to obtain the average true error for each classification rule, presented in Fig. 4.1 for b ¼ 4 and 8 bins. The average performance of the OBC is superior to that of the discrete histogram rule, especially for larger bin sizes. However, OBCs are not guaranteed to be optimal for a specific distribution (the optimal classifier is the Bayes classifier), but are only optimal when averaged over all distributions relative to the posterior distribution.

176

Chapter 4 0.29

0.26 0.25 0.24 0.23 0.22

histogram OBC

0.32

average true error

average true error

0.27

0.30 0.28 0.26 0.24 0.22

0.21 0.20 5

0.34

histogram OBC

0.28

10

15

20

25

sample size

(a)

30

0.20 5

10

15

20

25

30

sample size

(b)

Figure 4.1 Average true errors on discrete distributions from known priors with uniform c and bin probabilities versus sample size: (a) b ¼ 4; (b) b ¼ 8. [Reprinted from (Dalton and Dougherty, 2013a).]

4.4 Gaussian Model The four covariance models considered in Chapter 2 can be categorized into three effective density models: Gaussian (known covariance), multivariate t (scaled identity and general covariance structure), and independent non-standardized Student’s t (diagonal covariance). We will consider all combinations of these effective density models for the two classes: (1) covariances known in both classes, (2) covariances diagonal in both classes, (3) covariances scaled identity/general in both classes, and (4) mixed covariance models. The homoscedastic covariance model can be applied in the second and third cases, with the appropriate definition of the posterior hyperparameters. There is no feature selection in the OBC designed here; rather, these classifiers utilize all features in the model to minimize the expected true error. We also discuss the close relationships between OBCs and their plug-in counterparts, LDA, QDA, and NMC, in terms of analytic formulation, approximation, and convergence as n → `. Although all of these models assume Gaussian class-conditional densities, in some cases the OBC is not linear or quadratic, but rather takes a higher-order polynomial form, although one can easily evaluate the classifier at a given test point x. When the OBC is linear, the Bayesian MMSE error estimator can be found in closed form using equations derived in Chapter 2; otherwise, it may be approximated via Theorem 2.1 by sampling the effective density. 4.4.1 Both covariances known In the presence of uncertainty, when both covariances are known, the effective class-conditional distributions are Gaussian and are provided in Eq. 2.96 with n ¼ ny and m ¼ my . The OBC is the optimal classifier between the effective

Optimal Bayesian Classification

177

Gaussians with class-0 probability Ep ½c. That is, cOBC ðxÞ ¼ 0 if gOBC ðxÞ ≤ 0, and cOBC ðxÞ ¼ 1 if gOBC ðxÞ . 0, where gOBC ðxÞ ¼ xT AOBC x þ aTOBC x þ bOBC

(4.11)

is quadratic with AOBC

   n0 n1 1 1 1 S   S ¼ , n0 þ 1 0 2 n1 þ 1 1

aOBC ¼ bOBC

n0 n1 1  S S1 m , m  n1 þ 1 1 1 n0 þ 1 0 0

   n0 n1 1  T 1   T 1  m S m  m S m ¼ 2 n1 þ 1 1 1 1 n0 þ 1 0 0 0  D  1   1  Ep ½c n1 ðn0 þ 1Þ 2 jS0 j 2 : þ ln n0 ðn1 þ 1Þ Ep ½c jS1 j

(4.12) (4.13)

(4.14)

This classifier is simply a QDA plug-in rule with [(ny þ 1)/ny] Sy in place of the covariance for class y, my in place of the mean for class b y if n0 ¼ n1 ¼ 0Þ, and Ep ½c in place of c (Ep ½c ¼ n0 ∕n under y ðmy ¼ m improper beta priors with ay ¼ 0 and random sampling). If S¼

n0 þ 1 n þ 1 S0 ¼ 1  S1 ,  n0 n1

(4.15)

then the classifier is linear with gOBC ðxÞ ¼ aTOBC x þ bOBC

(4.16)

aOBC ¼ S1 ðm1  m0 Þ,

(4.17)

1 1  Ep ½c bOBC ¼  ðm1  m0 ÞT S1 ðm1 þ m0 Þ þ ln : 2 Ep ½c

(4.18)

and

This is equivalent to a plug-in rule like LDA with S in place of the covariance for both classes, my in place of the mean for class y, and Ep ½c in place of c. The expected true error for this case is simply the true error for this linear classifier under the effective densities: Gaussian distributions with mean my and covariance S. This is given by bεn ðS n , cOBC Þ ¼ Ep ½cbε0n ðS n , cOBC Þ þ ð1  Ep ½cÞbε1n ðS n , cOBC Þ,

(4.19)

178

Chapter 4

where 1 0 sffiffiffiffiffiffiffiffiffiffiffiffiffi1 yg Þ yg Þ ð1Þ ðm ð1Þ ðm ny OBC OBC y y A: bεyn ðS n , cOBC Þ ¼ F@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A ¼ F@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  n þ 1 T T y aOBC Sy aOBC aOBC SaOBC 0

(4.20) The last line also follows by plugging gOBC and aOBC into Eq. 2.124. 4.4.2 Both covariances diagonal When both covariances are diagonal, whether they be independent or homoscedastic, the effective class-conditional distribution is given by Eq. 2.110, where n ¼ ny and m ¼ my are given by Eqs. 2.65 and 2.66, respectively. In addition, a and bi are given by Eqs. 2.107 and 2.108, respectively, where k ¼ ky and S ¼ Sy are given by Eqs. 2.67 and 2.68, respectively, in the independent covariance model, and k and S are given by Eqs. 2.80 and 2.81, respectively, in the homoscedastic covariance model. Let my,i be the ith element in my , ay be a for class y, and by,i be bi for class y. Hence, the effective densities take the form    1 G ay þ 12 by,i ðny þ 1Þ 2 f U ðxjyÞ ¼ 1 1 ay ny 2 2 i¼1 Gðay Þp ð2ay Þ   ða þ1Þ  y 2 1 by,i ðny þ 1Þ 1  2  1þ ðxi  my,i Þ :  2ay ay ny D Y

(4.21)

The discriminant of the OBC can be simplified to D  Y

2a þ1 0 n0  2 ðx  m Þ 1þ gOBC ðxÞ ¼ K i 0,i 2b0,i ðn0 þ 1Þ i¼1 2a þ1 D  Y 1 n1  2 ðxi  m1,i Þ  , 1þ  2b1,i ðn1 þ 1Þ i¼1

(4.22)

where  K¼

1  Ep ½c Ep ½c

2 

 3 2  Gða ÞG a þ 1 2D  Y D 0 1 b0,i 4 2 5 :  1 b 1,i G a0 þ 2 Gða1 Þ i¼1

ðn0 þ 1Þn1 D n0 ðn1 þ 1Þ

(4.23)

Optimal Bayesian Classification

179

4.4.3 Both covariances scaled identity or general In the scaled identity covariance model, the effective class-conditional distribution is a multivariate t-distribution given by Eq. 2.103, where a and b are given by Eqs. 2.101 and 2.102, respectively. In the general covariance model, the effective density is again a multivariate t-distribution given by Eq. 2.120. In all models, n ¼ ny and m ¼ my are given by Eqs. 2.65 and 2.66, respectively. In all independent covariance models, k ¼ ky and S ¼ Sy are given by Eqs. 2.67 and 2.68, respectively, and in all homoscedastic covariance models k and S are given by Eqs. 2.80 and 2.81, respectively. If the covariances are: (1) independent and follow some combination of the scaled identity and general covariance models, (2) homoscedastic scaled identity, or (3) homoscedastic general, then the effective class-conditional densities take the form  k þD G y2 f U ðxjyÞ ¼  D D 1 k G 2y k y2 p 2 jCy j2 (4.24)  ky þD 2 1   1 þ ðx  my ÞT C1 , y ðx  my Þ ky where k y and Cy denote the degrees of freedom and the scale matrix for class y, respectively (the location vector is always my ). The discriminant of the OBC can be simplified to  k þD 0 1  T 1  gOBC ðxÞ ¼ K 1 þ ðx  m0 Þ C0 ðx  m0 Þ k0 (4.25)  k þD 1 1  T 1   1 þ ðx  m1 Þ C1 ðx  m1 Þ , k1 where  K¼

2   32 k0 k 1 þD 2  D 1  Ep ½c k0 jC0 j 6G 2 G 2 7  5 : 4  k1 Ep ½c jC1 j G k0 þD G k1 2

(4.26)

2

This classifier has a polynomial decision boundary if k 0 and k 1 are integers, which is satisfied for both scaled identity and general covariance models with independent covariances if k0 and k1 are integers, and with homoscedastic covariances if k is an integer. In the special case where k ¼ k 0 ¼ k 1 , which always occurs in a homoscedastic covariance model, the OBC can be further simplified to

180

Chapter 4



 C0 1 gOBC ðxÞ ¼ ðx  ðx  m0 Þ K0  1  T C1  ðx  m1 Þ ðx  m1 Þ þ kðK 0  K 1 Þ, K1 m0 ÞT

(4.27)

where  K0 ¼ 

jC0 j ðEp ½cÞ2



1 kþD

jC1 j K1 ¼ ð1  Ep ½cÞ2



,

1 kþD

(4.28) :

(4.29)

Thus, the OBC is quadratic and given by Eq. 4.11, with 1 1 AOBC ¼  ðK 1 C1 1  K 0 C0 Þ, 2

(4.30)

 1  aOBC ¼ K 1 C1 1 m1  K 0 C0 m0 ,

(4.31)

1 kðK 0  K 1 Þ   T 1  bOBC ¼  ðK 1 m1 T C1 : 1 m 1  K 0 m 0 C0 m 0 Þ þ 2 2

(4.32)

The optimal classifier becomes quadratic like QDA, except with K 1 y Cy in  place of the covariance for class y, my in place of the mean for class y b y Þ, and the threshold bOBC . ðif n0 ¼ n1 ¼ 0, this is m 1 If k ¼ k 0 ¼ k 1 and C ¼ K 1 0 C0 ¼ K 1 C1 , then the OBC is linear and given by Eq. 4.16, with aOBC ¼ C1 ðm1  m0 Þ,

(4.33)

1 kðK 0  K 1 Þ bOBC ¼  aTOBC ðm1 þ m0 Þ þ : 2 2

(4.34)

This occurs in the (scaled identity or general) homoscedastic covariance model if       n0 þ 1 k n1 þ 1 k 2 ðEp ½cÞ ¼ ð1  Ep ½cÞ2 , (4.35) n1 n0 which is guaranteed if n0 ¼ n1 and Ep ½c ¼ 0.5. OBCs under these models can be shown to be approximately equivalent to plug-in counterparts, LDA, QDA, and NMC, under certain non-informative priors. We consider several cases: (1) independent general covariances with n0 ¼ n1 and Ep ½c ¼ 0.5 (the OBC is similar to QDA with a modified

Optimal Bayesian Classification

181

estimated covariance and modified threshold); (2) independent scaled identity covariances with n0 ¼ n1 and Ep ½c ¼ 0.5 (similar to QDA with a modified estimated scaled identity covariance and modified threshold); (3) homoscedastic general covariances (closely approximated by and converging to LDA when plugging in Ep ½c for the class-0 probability, and equivalent to LDA when n0 ¼ n1 and Ep ½c ¼ 0.5); and (4) homoscedastic scaled identity covariances (closely approximated by and converging to NMC when we plug Ep ½c in for the class-0 probability, and equivalent to NMC when n0 ¼ n1 and Ep ½c ¼ 0.5). There are three salient points. First, OBCs with these non-informative priors assuming independent or homoscedastic covariances have close, and in some cases nearly identical, performance to QDA and LDA, respectively. This suggests that the classical LDA and QDA classifiers are nearly optimal, given their respective covariance and prior assumptions. If informative priors are known, then OBC can further improve performance. Second, experiments with synthetic Gaussian data suggest that the performance of the OBC converges to that of Bayes classification, in which case a non-informative prior can be overcome when enough data are obtained as long as the true distribution is contained in the uncertainty class. Third, the OBC essentially unifies LDA and QDA in a general framework, showing how to optimally design classifiers under different Gaussian modeling assumptions. Independent General Covariances

Consider a Gaussian independent general covariance model with noninformative priors having hyperparameters n0 ¼ n1 ¼ 0, k0 ¼ k1 ¼ 0, and b y and S0 ¼ S1 ¼ 0DD (the Jeffreys rule prior). It can be shown that my ¼ m  b b b is the sample mean, and S is the sample S ¼ ðn  1ÞS , where m y

y

y

y

y

covariance for class y. If we further assume that n0 ¼ n1 points are observed in each class, then n0 ¼ n1 ¼ n∕2 and k0 ¼ k1 ¼ n∕2. Finally, assume that Ep ½c ¼ 0.5. Define 2 b : e ¼ jS b jnþ2 S S (4.36) y

y

y

Then the OBC is quadratic as in Eq. 4.11, with AOBC ¼ 

1  e 1 e 1 S  S0 , 2 1

e 1 m e 1 b , aOBC ¼ S 1 b 1  S0 m 0 1  T e 1 e 1 m b1  m b 1 S1 m b 0T S m 0 b0 2    1 n n 2 2 e jn2Dþ2 e jn2Dþ2 þ þ1  1 jS  j S : 0 1 n 2 2

(4.37) (4.38)

bOBC ¼ 

(4.39)

182

Chapter 4

This is like the QDA classifier, which plugs estimated parameters into the Bayes classifier for Gaussian distributions, except that it plugs in a modified e for S and uses a modified threshold b estimate S y y OBC . Example 4.2. We next illustrate performance of the OBC using noninformative priors under fixed heteroscedastic Gaussian distributions with D ¼ 2 features and known c ¼ 0.83. Assume that the true means are m0 ¼ 0D and m1 ¼ 1D , and that the true covariances are S0 ¼ 0.65ID and   1 0.5 : (4.40) S1 ¼ 0.35 0.5 1 Under these distributions, the Bayes classifier is quadratic. We employ stratified sampling, where the proportion of class-0 points is chosen to be as close as possible to c. From each sample, three classifiers are designed: LDA, QDA, and OBC. OBC uses an independent general covariance non-informative prior with ny ¼ 0, ky ¼ 0, and Sy ¼ 0DD . my need not be specified because ny ¼ 0. Once all classifiers have been trained, the true error for each is found under the known modeling assumptions. For LDA the true error is found exactly from Eq. 1.32, and for QDA and OBC (which is non-quadratic under the assumed model) the true error is approximated by generating a stratified test sample of size 100,000. For each sample size, this entire process is repeated for t ¼ 10,000 samples. Once this iteration is complete, for each classifier we find the average true error with respect to sample size, which is reported in Fig. 4.2. This figure contains two graphs: (a) a realization of a sample and classifiers (Bayes, LDA,

x2

1

0.15

Bayes LDA QDA OBC

0 −1

0.13 0.12 0.11 0.1

−2 −2

Bayes LDA QDA OBC

0.14 average true error

2

−1

0

x1

(a)

1

2

3

0.09

0

20

40

60 80 sample size

100

120

(b)

Figure 4.2 Classification of fixed Gaussian distributions with c ¼ 0.83 and unequal covariances with respect to sample size: (a) example with n ¼ 60; (b) average true error. OBC assumes independent general covariances and a non-informative prior.

Optimal Bayesian Classification

183

QDA, and OBC) and (b) the average true error for each classifier with respect to sample size. In the realization of a sample and classifiers, points from class 0 are marked with circles and points from class 1 with x’s. To give an idea of the location of the true distributions, level curves of both class-conditional distributions are shown in thin gray lines. These have been found by setting the Mahalanobis distance to 1. Although LDA initially performs best for very small sample sizes, when more points are collected, the OBC has the best performance of the three classifiers and makes a noticeable improvement over QDA on average over the sampling distribution. Independent Scaled Identity Covariances

Now consider a Gaussian independent scaled identity covariance model with non-informative priors having hyperparameters n0 ¼ n1 ¼ 0, k0 ¼ k1 ¼ 0, b . Assume that b y and Sy ¼ ðny  1ÞS and S0 ¼ S1 ¼ 0DD . Again, my ¼ m y     n0 ¼ n1 so that n0 ¼ n1 ¼ n∕2 and k0 ¼ k1 ¼ n∕2, and also assume that Ep ½c ¼ 0.5. Define 

b e ¼ trðSy Þ S y D

ðnþ2Dþ2ÞD4 ðnþ2Dþ4ÞD4

ID :

Then the OBC is quadratic, of the form in Eq. 4.11, with 1  e 1 e 1 AOBC ¼  S  S0 , 2 1

bOBC

e 1 m e 1 b , aOBC ¼ S 1 b 1  S0 m 0  1 e 1 m e 1 m b TS b m b 0T S ¼ m 0 b0 2 1 1 1  h  i 2 2 D n n e jðnþ2Dþ2ÞD4 e jðnþ2Dþ2ÞD4 þ1  1 jS  jS þ : 0 1 n 2 2

(4.41)

(4.42) (4.43)

(4.44)

e of S and uses This is like QDA except that it plugs in a modified estimate S y y a modified threshold bOBC . Homoscedastic General Covariances

Next consider a Gaussian homoscedastic general covariance model with non-informative priors having hyperparameters n0 ¼ n1 ¼ 0, k ¼ 0, and S ¼ 0DD . Make no assumptions on Ep ½c or on n0 and n1 b b y and S ¼ ðn  2ÞS, ðn0 ≠ n1 being possibleÞ. It can be shown that my ¼ m b is the pooled sample covariance defined in Eq. 1.28. Define where S

184

Chapter 4



nDþ1 n0 þ 1 nþ1 2 b ðEp ½cÞnþ1 S, n0  nDþ1 n1 þ 1 nþ1 2 e b S1 ¼ ð1  Ep ½cÞnþ1 S: n1 e ¼ S 0

Then the OBC is quadratic, given by Eq. 4.11, with 1  e 1 e 1 AOBC ¼  S  S0 , 2 1

bOBC

(4.45) (4.46)

(4.47)

e 1 m e 1 b , (4.48) aOBC ¼ S 1 b 1  S0 m 0 2 D  n  2 n þ 1nþ1 nþ1 1  T e 1 1 1 0 T e b 0 S0 m b m b0 þ b S m ¼ m 2 1 1 1 2 n0 Ep ½c D 2  nþ1 n  2 n1 þ 1 nþ1 1  :  2 n1 1  Ep ½c (4.49)

b and S e S b even for relatively small n and n . In addition, e S Note that S 0 1 0 1 the following lemma holds. Lemma 4.1 (Dalton and Dougherty, 2013b). Suppose that both n0 → ` and n1 → ` almost surely over the infinite labeled sampling distribution. Let f ðcÞ ¼ lnðð1  cÞ∕cÞ and  D 2 D  2 nþ1 n  2 n0 þ 1 nþ1 1 nþ1 n1 þ 1 nþ1 1 f n ðcÞ ¼  : (4.50) 2 n0 c n1 1c Then f n ðcÞ → f ðcÞ uniformly (almost surely) over any interval of the form ½d, 1  d with 0 , d , 0.5. Proof. Let e . 0. There exists N . 0 such that 1 , ðn0 þ 1Þ∕n0 , ee∕D and 1 , ðn1 þ 1Þ∕n1 , ee∕D for all n . N. Hence, for c ∈ ½0, 1,   2  e 2 nþ1 n  2 1 nþ1 e2 ≤ f n ðcÞ  1c 2 c (4.51)  e  2 2  nþ1 n  2 e2 nþ1 1 ≤ :  c 2 1c The limit of the right-hand side is lnðð1  cÞ∕cÞ þ e∕2 for all c ∈ ½0, 1, and convergence is uniform over c ∈ ½d, 1  d. Similarly, the limit of the left-hand

Optimal Bayesian Classification

185

side is lnðð1  cÞ∕cÞ  e∕2 for all c ∈ ½0, 1, and convergence is uniform over c ∈ ½d, 1  d. Hence, there exists N 0 . N such that for all n . N 0 and c ∈ ½d, 1  d, 

   1c 1c ln  e ≤ f n ðcÞ ≤ ln þ e, c c

(4.52)

which completes the proof. ▪ By Lemma 4.1, if 0 , c , 1 is the true c, then any consistent estimator bc for c may be used in place of c, and we are guaranteed that f n ðbcÞ → f ðcÞ. When applied using Ep ½c, this shows that the last two terms of bOBC in Eq. 4.49 will track f ðEp ½cÞ, which resembles the analogous term in LDA classification, and this will ultimately converge to f ðcÞ, which is the analogous term in the Bayes classifier. Although the classifier in this case is quadratic, even for relatively small n e  S, b S e  S, b and f ðcÞ  f ðcÞ are very accurate. the approximations S 0 1 n Hence, even under small samples, the OBC is very closely approximated by a variant of LDA using Ep ½c rather than n0 ∕n as the plug-in estimate of c, and, under either random sampling or known c, both OBC and LDA converge to the Bayes classifier. If we assume that n0 ¼ n1 and Ep ½c ¼ 0.5, then the OBC is in fact b 1 ðb b 0 Þ and m1  m linear and of the form in Eq. 4.16 with aOBC ¼ S b 0 Þ. This is exactly the LDA classifier; thus, LDA bOBC ¼ 0.5aTOBC ðb m1 þ m classification is optimal in a Bayesian framework if we assume homoscedastic covariances, the Jeffreys rule prior, n0 ¼ n1 , and Ep ½c ¼ 0.5. Example 4.3. Next consider the performance of the OBC using noninformative priors under fixed homoscedastic Gaussian distributions with D ¼ 2 features and known c ¼ 0.83. Assume that the true means are m0 ¼ 0D and m1 ¼ 1D , and that the true covariances are S0 ¼ S1 ¼ 0.5ID . Under these distributions, the Bayes classifier is linear. We employ stratified sampling where the proportion of class-0 points is set as close as possible to c ¼ 0.5. From each sample, three classifiers are designed: LDA, QDA, and OBC. The OBC uses a homoscedastic general covariance non-informative prior with ny ¼ 0, k ¼ 0, and S ¼ 0DD . my need not be specified because ny ¼ 0. Once all classifiers have been trained, the true error is found exactly for LDA from Eq. 1.32, and for QDA and OBC (which is quadratic, though approximately linear, under the assumed model) the true error is approximated by generating a stratified test sample of size 100,000. For each sample size, this entire process is repeated for t ¼ 10,000 samples.

186

Chapter 4

x2

2 1 0 −1 −2 −2

0.16

Bayes LDA QDA OBC

0.15 average true error

Bayes LDA QDA OBC

3

0.14 0.13 0.12 0.11

−1

0

1

0.1

2

0

20

40

60

80

x1

sample size

(a)

(b)

100

120

Figure 4.3 Classification of fixed Gaussian distributions with c ¼ 0.83 and equal scaled identity covariances with respect to sample size: (a) example with n ¼ 60; (b) average true error. OBC assumes homoscedastic general covariances and a non-informative prior.

Once this iteration is complete, for each classifier we find the average true error with respect to sample size, which is reported in Fig. 4.3. As in Fig. 4.2, this figure contains two graphs: (a) a realization of a sample and classifiers (Bayes, LDA, QDA, and OBC) and (b) the average true error for each classifier with respect to sample size. LDA and OBC happen to make a correct modeling assumption that the covariances are equal, and not surprisingly they have superior performance. Further, OBC makes a slight improvement over LDA on average over the sampling distribution. Homoscedastic Scaled Identity Covariances

Consider a Gaussian homoscedastic scaled identity covariance model with non-informative priors having hyperparameters n0 ¼ n1 ¼ 0, k ¼ 0, and S ¼ 0DD . Make no assumptions on Ep ½c or on n0 and n1 . As before, b Define b y and S ¼ ðn  2ÞS. my ¼ m e ¼ S 0 e ¼ S 1





n0 þ 1 n0

n1 þ 1 n1

ðnþDþ1ÞD2

ðnþDþ2ÞD2

2

ðEp ½cÞðnþDþ2ÞD2

ðnþDþ1ÞD2

ðnþDþ2ÞD2

2

b trðSÞ I , D D

ð1  Ep ½cÞðnþDþ2ÞD2

Then the OBC is quadratic, given by Eq. 4.11 with 1  e 1 e 1  S0 , AOBC ¼  S 2 1 e 1 m e 1 b , aOBC ¼ S 1 b 1  S0 m 0

b trðSÞ I : D D

(4.53)

(4.54)

(4.55) (4.56)

Optimal Bayesian Classification

1  T e 1 e 1 m b 0T S b1  m b b 1 S1 m m 0 0 2  D   2  ðnþDþ2ÞD2 Dðn  2Þ n0 þ 1 ðnþDþ2ÞD2 1 þ 2 n0 Ep ½c  D    2 ðnþDþ2ÞD2 Dðn  2Þ n1 þ 1 ðnþDþ2ÞD2 1  : 2 n1 1  Ep ½c

187

bOBC ¼ 

(4.57)

b e  D1 trðSÞI and Analogous to the general covariance case, S 0 D 1 b e S1  D trðSÞID even for relatively small n0 and n1 , and the following lemma holds. Lemma 4.2 (Dalton and Dougherty, 2013b). Suppose that both n0 → ` and n1 → ` almost surely over the infinite labeled sampling distribution. Let f ðcÞ ¼ lnðð1  cÞ∕cÞ and "  D   2 Dðn  2Þ n0 þ 1 ðnþDþ2ÞD2 1 ðnþDþ2ÞD2 f n ðcÞ ¼ 2 n0 c (4.58)   D   2 # ðnþDþ2ÞD2 n1 þ 1 ðnþDþ2ÞD2 1  : n1 1c Then f n ðcÞ → f ðcÞ uniformly (almost surely) over any interval of the form ½d, 1  d with 0 , d , 0.5.

Proof. The proof is nearly identical to that of Lemma 4.1. ▪ As before, if 0 , c , 1, then any consistent estimator bc for c may be used in place of c, and we are guaranteed that f n ðbcÞ → f ðcÞ. Although the classifier in this case is in general quadratic, even for relatively small n the b , S e  D1 trðSÞI b , and f ðcÞ  f ðcÞ are e  D1 trðSÞI approximations S 0 D 1 D n very accurate. Hence, the OBC is very closely approximated by and quickly tracks an NMC classifier with Ep ½c rather than n0 ∕n as the plug-in estimate of c, and, under random sampling or known c, both converge to the Bayes classifier. If we assume that n0 ¼ n1 and Ep ½c ¼ 0.5, then the OBC is linear, given b1  m b 0 and bOBC ¼  12 aTOBC ðb b 0 Þ. This is by Eq. 4.16 with aOBC ¼ m m1 þ m exactly the NMC classifier; thus, NMC classification is optimal in a Bayesian framework if we assume homoscedastic scaled identity covariances, the Jeffreys rule prior, n0 ¼ n1 , and Ep ½c ¼ 0.5.

188

Chapter 4

4.4.4 Mixed covariance models If the covariance is known for one class, say class 0, and diagonal for the other, the effective class-conditional distributions take the form ðn0 Þ 2

D

f U ðxj0Þ ¼

ðn0 þ 1Þ 2 ð2pÞ 2 jS0 j2   n0  T 1  ðx  m  exp  Þ S ðx  m Þ , 0 0 0 2ðn0 þ 1Þ D

D

1

   1 D G a1 þ 12 Y b1,i ðn1 þ 1Þ 2 f U ðxj1Þ ¼ 1 1 a1 n1 2 2 i¼1 Gða1 Þp ð2a1 Þ    ða þ1Þ 1 2 1 b1,i ðn1 þ 1Þ 1  2  1þ ðx  m Þ : i 1,i a1 n1 2a1

(4.59)

(4.60)

The discriminant of the OBC can be simplified to gOBC ðxÞ ¼

n0  ðx  m0 ÞT S1 0 ðx  m0 Þ þ1   D X n1  2 ðx  m1,i Þ þ K, ln 1 þ  ð2a1 þ 1Þ 2b1,i ðn1 þ 1Þ i i¼1 n0

(4.61)

where 0 B1  Ep ½c K ¼ 2 ln@ Ep ½c



ðn0 þ 1Þn1 n0 ðn1 þ 1Þ

D  2

3 1 2  1 G a þ 1 D 1 2 2 jS j 5 C 4 QD 0 A: Gða Þ b 1 i¼1 1,i

(4.62)

Now consider the case in which the covariance is known for only one class and is scaled identity/general for the other. The effective class-conditional distribution for the known class, say class 0, is Gaussian and again given by Eq. 4.59. In the other class, it is a multivariate t-distribution with k 1 ¼ ðk1 þ D þ 1ÞD  2 degrees of freedom, location vector m1 , and scale matrix C1 ¼ ½trðS1 Þðn1 þ 1Þ∕ðk 1 n1 ÞID in the scaled identity covariance model (by Eq. 2.103), or k 1 ¼ k1  D þ 1 degrees of freedom, location vector m1 , and scale matrix C1 ¼ ½ðn1 þ 1Þ∕ðk 1 n1 ÞS1 in the general covariance model (by Eq. 2.120). Hence, the effective class-conditional densities take the form

Optimal Bayesian Classification

189

ðn0 Þ 2

D

f U ðxj0Þ ¼

ðn0 þ 1Þ 2 ð2pÞ 2 jS0 j2   n0  T 1  ðx  m0 Þ S0 ðx  m0 Þ ,  exp  2ðn0 þ 1Þ D

D

1

 k1 þD  G k1 þD 2 2 1  T 1  : 1 þ ðx  m1 Þ C1 ðx  m1 Þ f U ðxj1Þ ¼  D D 1 k1 G k21 k 12 p 2 jC1 j2

(4.63)

(4.64)

The discriminant of the OBC can be simplified to gOBC ðxÞ ¼

n0  ðx  m0 ÞT S1 0 ðx  m0 Þ n0 þ 1   1  T 1   ðk 1 þ DÞ ln 1 þ ðx  m1 Þ C1 ðx  m1 Þ þ K, k1

(4.65)

where 0 K ¼ 2 ln@

1  Ep ½c Ep ½c



2ðn0 þ 1Þ n0 k 1

D  2

 1 1 G k1 þD 2 jS0 j 2  A: k jC1 j G 21

(4.66)

If the covariance is scaled identity/general for one class, say class 0, and diagonal for the other, where n0 , m0 , k 0 , C0 , n1 , m1 , a1 , and b1,i for i ¼ 1, : : : , D are defined appropriately, then the effective class-conditional distributions are  k0 þD  G k0 þD 2 2 1  T 1  f U ðxj0Þ ¼  D D , 1 þ ðx  m0 Þ C0 ðx  m0 Þ 1 k0 G k20 k 02 p 2 jC0 j2    1 D G a1 þ 12 Y b1,i ðn1 þ 1Þ 2 f U ðxj1Þ ¼ 1 1 a1 n1 2 2 i¼1 Gða1 Þp ð2a1 Þ    ða þ1Þ 1 2 1 b1,i ðn1 þ 1Þ 1  2  1þ ðx  m Þ : i 1,i a1 n1 2a1 The discriminant of the OBC can be simplified to

(4.67)

(4.68)

190

Chapter 4

 k þD 0 1  gOBC ðxÞ ¼ K 1 þ ðx  m0 ÞT C1 ðx  m Þ 0 0 k0  2a þ1 D Y 1 n1  2 ðxi  m1,i Þ  , 1þ  2b1,i ðn1 þ 1Þ i¼1

(4.69)

where  K¼

1  Ep ½c Ep ½c

2 

k 0 n1 2ðn1 þ 1Þ

D

2   32 k0 1 D G 2 G a1 þ 2 7 jC j QD 0 6 5: 4  D i¼1 b1,i G k0 þD ða Þ G 1 2

(4.70)

4.4.5 Average performance in the Gaussian model We next analyze the performance of OBCs via synthetic simulations with a proper prior over an uncertainty class of Gaussian distributions and c ¼ 0.5 known. Throughout, two proper priors are considered. The first assumes independent general covariances, with hyperparameters n0 ¼ 6D, n1 ¼ D, m0 ¼ 0D , m1 ¼ 0.5 ⋅ 1D , ky ¼ 3D, and Sy ∕ðky  D  1Þ ¼ 0.3ID . In this model, the Bayes classifier and OBC are both always quadratic. The second assumes a homoscedastic general covariance, with the same hyperparameters ðk ¼ ky and S ¼ Sy Þ. In the homoscedastic model, the Bayes classifier is linear while the OBC is quadratic. We will address performance with respect to sample size, Bayes error, and feature size. In both models, my has the form my ⋅ 1D for some scalar my , and Sy ∕ðky  D  1Þ has the form 0.3ID . These are the expected mean and covariance; the actual mean and covariance will not necessarily have this form. Examples of distributions, samples, and classifiers realized in these models are shown in Fig. 4.4. Level curves of the Gaussian 2 1.5 1

Bayes LDA QDA OBC

Bayes LDA QDA OBC

2

1

x2

x2

0.5 0 −0.5

0

−1

−1 −1.5 −2

−1

x1

(a)

0

1

−2

−1

0

x1

1

2

(b)

Figure 4.4 Examples of a distribution, sample, and classifiers from Bayesian models (c ¼ 0.5 known, D ¼ 2, n ¼ 60): (a) independent covariances; (b) homoscedastic covariances. [Reprinted from (Dalton and Dougherty, 2013a).]

Optimal Bayesian Classification

191

class-conditional distributions are shown as thin gray lines. These have been found by setting the Mahalanobis distance to 1. Points from class 0 are marked with circles and points from class 1 with x’s. The first series of experiments addresses performance under Gaussian models with respect to sample size. We follow the procedure in Fig. 2.5. Random distributions with D ¼ 2 features are drawn from the assumed prior, and for each we find the Bayes classifier and approximate the Bayes error by evaluating the proportion of 100,000 stratified testing points drawn from the true distribution that are misclassified. A stratified training set of a designated size n is then drawn independently from the corresponding class-conditional distribution (step 2A). Only even sample sizes are considered so that the proportion of points in class 0 is always exactly c. The sample is used to update the prior to a posterior (step 2B). From each sample, three classifiers are designed (without feature selection): LDA, QDA, and OBC (step 2C). The true error is found exactly for LDA from Eq. 1.32 and is found approximately for QDA and OBC by evaluating the proportion of the same 100,000 stratified testing points used to evaluate the Bayes error that are misclassified. For each covariance model and sample size, this entire process is repeated for T ¼ 10,000 random distributions and t ¼ 100 samples per distribution. Once this iteration is complete, the average true error and variance of the true error are found for the Bayes classifier and each trained classifier. Results are provided in Fig. 4.5. Experiments using independent covariance and homoscedastic covariance priors are shown in the top and bottom rows, respectively. The average true error and the variance of the true error for each classifier with respect to sample size are shown in the left and right columns, respectively. The optimality of OBCs is supported as we observe superior performance relative to LDA and QDA, that is, significantly lower expected true error. The variance of the true error is also significantly lower. Moreover, in all cases the performance of OBC appears to converge to that of the Bayes classifier as n → `. By combining prior modeling assumptions with observed data, these results show that significantly improved performance can be obtained. The next series of experiments addresses performance with respect to Bayes error for fixed sample size. The procedure is identical to that used in the first series, except that once the iteration is complete we partition the Bayes errors into ten uniform bins, which equivalently partitions the realized samples into bins. The average true error and variance of the true error for each classifier among all samples associated with each bin is found. Results are provided in Fig. 4.6 with the independent covariance model and n ¼ 30 shown in the top row and the homoscedastic covariance model and n ¼ 18 in the bottom row. The average difference between the true error and Bayes error, and the variance of the true error, for each classification rule with respect to the Bayes error, are shown in the left and right columns,

192

Chapter 4

0.3

0.25

0.2

0

20

40

60

80

20

40

60

80

(a)

(b) Bayes LDA QDA OBC

0.3 0.25 0.2

20

0.01

sample size

0.35

0

0.015

0.005 0

120

Bayes LDA QDA OBC

sample size

0.4

average true error

100

0.02

variance of true error

Bayes LDA QDA OBC

40

60

80

100

120

0.02

variance of true error

average true error

0.35

100

120

Bayes LDA QDA OBC

0.018 0.016 0.014 0.012 0.01 0

20

40

60

80

sample size

sample size

(c)

(d)

100

120

Figure 4.5 Performance of classifiers on Gaussian models with known proper priors versus sample size (c ¼ 0.5 known, D ¼ 2): (a) independent covariances, average true error; (b) independent covariances, true error variance; (c) homoscedastic covariances, average true error; (d) homoscedastic covariances, true error variance. [Reprinted from (Dalton and Dougherty, 2013a).]

respectively. Observe that the average deviation between the true and Bayes errors increases as the Bayes error increases, until it drops sharply for very high Bayes errors. By combining prior knowledge with observed data, significant improvement in performance over LDA and QDA is obtained over the whole range of Bayes errors. However, it is possible for a classifier to outperform OBC over some range of Bayes errors in these graphs (although LDA and QDA do not) because there is no guarantee that the average true error is minimized for any fixed distribution, or for any subset of the uncertainly class of distributions in the model (in this case, distributions having Bayes error in a given range). Optimality is only guaranteed when averaging over all distributions in the uncertainty class. Thus, OBC is guaranteed to be optimal in these graphs in the sense that the weighted average true error over the bins (weighted by the probabilities that a distribution should have Bayes error falling in that bin) is optimal. Hence, no classifier could outperform OBC over the entire range of Bayes errors.

Optimal Bayesian Classification

193 −3

0.08

LDA QDA OBC

0.06

variance of true error

average true error − Bayes error

4

0.04

0.02

0

0

0.1

0.2

0.3

0.4

x 10

Bayes LDA QDA OBC

3

2

1

0 0

0.5

0.1

Bayes error

(a)

0.2 0.3 Bayes error

0.4

0.5

0.4

0.5

(b) 3.5

0.07

LDA QDA OBC

0.06 0.05 0.04 0.03 0.02

Bayes LDA QDA OBC

2.5 2 1.5 1 0.5

0.01 0

x 10

3

variance of true error

average true error − Bayes error

−3

0

0.1

0.2

0.3

Bayes error

(c)

0.4

0.5

0

0

0.1

0.2

0.3

Bayes error

(d)

Figure 4.6 Performance of classifiers on Gaussian models with known proper priors and fixed sample size versus Bayes error (c ¼ 0.5 known, D ¼ 2): (a) independent covariance, n ¼ 30, average difference between true and Bayes error; (b) independent covariance, n ¼ 30, true error variance; (c) homoscedastic covariance, n ¼ 18, average difference between true and Bayes error; (d) homoscedastic covariance, n ¼ 18, true error variance. [Reprinted from (Dalton and Dougherty, 2013a).]

Finally, we consider an important beneficial property of OBCs, namely, that for any fixed sample size n, their average performance across the sampling distribution and the uncertainty class improves monotonically as the number of features increases. There is no peaking phenomenon. Indeed, since for a fixed sample an OBC and its expected error are equivalent to an optimal classifier and its error on an effective feature-label distribution, if one designs OBCs cdOBC on an increasing set of features and finds the expected true errors bεn ðS n , cdOBC Þ relative to the posterior distribution, then these expected εn ðS n , cdOBC Þ for true errors would be non-increasing, i.e., bεn ðS n , cdþ1 OBC Þ ≤ b d ¼ 1, 2, : : : , D  1. While OBCs are guaranteed to not peak relative to the Bayesian MMSE error estimator, we next present a synthetic example in which OBCs also do

194

Chapter 4

not peak relative to the true error. We assume the same two Gaussian models, this time with D ¼ 30 features. A few minor modifications of the experimental procedure are required. First, in the classification step, d ∈ f1, 2, : : : , Dg features are selected for training using a t-test. We find the Bayes classifier and train LDA, QDA, and OBC classifiers, all using only the selected features. LDA cannot be trained if the pooled covariance matrix is singular, and QDA may not be applied if either sample covariance is singular. OBC sets the prior on selected features to the marginal of the prior on all features. OBCs may only be applied if the posterior is proper, which is guaranteed with proper priors like the ones considered here. If a classifier cannot be evaluated for a given sample, then the sample is omitted in the analysis for this classifier (but not for all classifiers). Once all classifiers have been trained, the true error for each under the true distribution is found as before. This entire process is repeated over T ¼ 10,000 random distributions and a single sample per distribution. The average true error for each classifier is shown with respect to the selected feature size for n ¼ 18 in Fig. 4.7. Parts (a) and (b) show performance using the independent covariance and homoscedastic covariance models, respectively. Results are only included if the probability that the classifier can be trained is at least 90%. In general, QDA peaks first, followed by LDA, and OBC does not peak. For instance, in Fig. 4.7(a) for the independent covariance model, QDA peaks at around four features and LDA at around eight features. At only n ¼ 18 points with 9 in each class, QDA is not trainable for at least 10% of iterations with 7 or more features, and LDA for more than about 10 features. In contrast, OBC with a proper prior can be applied with the full 30 features, and its expected true error decreases monotonically up to and including 30 features, thereby facilitating much better performance.

Bayes LDA QDA OBC

average true error

0.3 0.25 0.2 0.15 0.1 0.05 0

0.35

Bayes LDA QDA OBC

0.3

average true error

0.35

0.25 0.2 0.15 0.1 0.05

0

5

10

15

20

feature set size

(a)

25

30

0

0

5

10

15

20

25

30

feature set size

(b)

Figure 4.7 Average true errors on Gaussian distributions from known proper priors with fixed c and n ¼ 18 versus feature size: (a) independent covariance; (b) homoscedastic covariance. [Reprinted from (Dalton and Dougherty, 2013a).]

Optimal Bayesian Classification

195

4.5 Transformations of the Feature Space Consider an invertible transformation t : X → X , mapping some original feature space X to a new space X (in the continuous case, assume that the inverse map is continuously differentiable). The next theorem shows that the OBC in the transformed space can be found by transforming the OBC in the original feature space pointwise, and that both classifiers have the same expected true error. The advantages of this fundamental property are at least twofold. First, the data can be losslessly preprocessed without affecting optimal classifier design or the expected true error, which is not true in general, for example, with LDA classification under a nonlinear transformation. Secondly, it is possible to solve or interpret optimal classification and error estimation problems by transforming to a more manageable space, similar to the “kernel trick” used to map features to a high-dimensional feature space having a meaningful linear classifier. We denote analogous constants and functions in the transformed space with an overline; for example, we write a point x in the transformed space as x. Theorem 4.2 (Dalton and Dougherty, 2013b). Consider a Bayesian model with posterior p ðuÞ in a feature space X that is either discrete or Euclidean. Suppose that cOBC is an OBC over all c ∈ C, where C is a family of classifiers (not necessarily all classifiers) with measurable decision regions. Moreover, suppose that the original feature space is transformed by an invertible mapping t and that in the continuous case t1 is continuously differentiable with an almost everywhere full rank Jacobian. Then the optimal classifier in the transformed space among the set of classifiers C ¼ fc : cðxÞ ¼ cðt1 ðxÞÞ for all x ∈ X and for some c ∈ Cg is cOBC ðxÞ ¼ cOBC ðt1 ðxÞÞ,

(4.71)

and both classifiers possess the same Bayesian MMSE error estimate Ep ½εn ðu, cOBC Þ. Proof. With JðxÞ being the Jacobian determinant of t1 evaluated at x, for a fixed class y ∈ f0, 1g, in the continuous case the class-conditional density parameterized by uy in the transformed space is f uy ðxjyÞ ¼ f uy ðt1 ðxÞjyÞjJðxÞj:

(4.72)

In the discrete case, f uy ðxjyÞ ¼ f uy ðt1 ðxÞjyÞ and, to unify the two cases, we say that jJðxÞj ¼ 1. Although each class-conditional density in the model uncertainty class will change with the transformation, each may still be indexed by the same parameter uy . Hence, the same prior and posterior may be used in both spaces. The effective class-conditional density is thus given by

196

Chapter 4

Z f U ðxjyÞ ¼

Uy

Z ¼

Uy

f uy ðxjyÞp ðuy Þduy (4.73)

f uy ðt1 ðxÞjyÞjJðxÞjp ðuy Þduy

¼ f U ðt1 ðxÞjyÞjJðxÞj: Let c ∈ C be an arbitrary fixed classifier given by cðxÞ ¼ Ix∈R1 , where R1 is a measurable set in the original feature space. Then cðxÞ ¼ Ix∈R1 , where

R1 ¼ ftðxÞ : x ∈ R1 g is the equivalent classifier in the transformed space, i.e., cðxÞ ¼ cðt1 ðxÞÞ. Noting that Ep ½c remains unchanged, by Theorem 2.1 the expected true error of c is given by Z Ep ½εn ðu, cÞ ¼ Ep ½c

R1

Z f U ðxj0Þdx þ ð1  Ep ½cÞ

Z

X \R1

f U ðxj1Þdx

f U ðt1 ðxÞj0ÞjJðxÞjdx Z f U ðt1 ðxÞj1ÞjJðxÞjdx þ ð1  Ep ½cÞ X \R1 Z Z ¼ Ep ½c f U ðxj0Þdx þ ð1  Ep ½cÞ f U ðxj1Þdx

¼ Ep ½c

R1

R1

X \R1

¼ Ep ½εn ðu, cÞ, (4.74) where the integrals in the second to last line have applied the substitution x ¼ t1 ðxÞ. If cOBC is an OBC in the original space and cOBC is the equivalent classifier in the transformed space, then cOBC also minimizes the expected true error and thus is an OBC in the transformed space. ▪

4.6 Convergence of the Optimal Bayesian Classifier As discussed in Section 2.7, we expect the posteriors of c, u0 , and u1 to converge in some sense to their true values, c ∈ ½0, 1, u0 ∈ U0 , and u1 ∈ U1 , respectively. This is true for c with random sampling, or if c is known. Posterior convergence also typically holds for uy as long as every neighborhood about the true parameter has nonzero prior probability and the number of sample points observed from class y tends to infinity. Convergence of the posterior, under typical conditions, leads to consistency in the Bayesian MMSE error estimator.

Optimal Bayesian Classification

197

In this section, we show that pointwise convergence holds for optimal Bayesian classification as long as the true distribution is contained in the parameterized family with mild conditions on the prior. To emphasize that the posteriors may be viewed as sequences, in the following theorem they are denoted by pn ðcÞ, pn ðu0 Þ, and pn ðu1 Þ. Similarly, the effective class-conditional distributions are denoted by f n ðxjyÞ for y ∈ f0, 1g. Throughout, we assume proper priors and that the priors assign positive mass on every open set. The latter assumption may be relaxed as long as the true distribution is included in the parameterized family of distributions and certain regularity conditions hold. Like Theorem 2.13, the proofs of Theorems 4.3 and 4.4 below treat the classes separately by decoupling the sampling procedure and the evaluation of the effective densities for class 0 and class 1. We assume weak convergence of the individual posteriors separately, with careful treatment of degenerate cases where c ¼ 0 or c ¼ 1. All that is required is to check certain regularity conditions on the class-conditional densities. In particular, Theorem 4.3 applies in the discrete model and Gaussian model with known covariance. Theorem 4.3 (Dalton and Dougherty, 2013b). Let c ∈ ½0, 1, u0 ∈ U0, and u1 ∈ U1. Let c and (u0, u1) be independent in their priors. Suppose that 1. Epn ½c → c as n → ` almost surely over the infinite labeled sampling distribution. 2. n0 → ` if c . 0 and n1 → ` if c , 1 almost surely over the infinite labeled sampling distribution. 3. For any fixed y ∈ f0, 1g, pn ðuy Þ is weak consistent if ny → `. 4. For any fixed x, y, and sample size n, the effective density f n ðxjyÞ is finite. 5. For fixed x and y, there exists M . 0 such that f uy ðxjyÞ , M for all uy . Then the OBC (over the space of all classifiers with measurable decision regions) constructed in Eq. 4.6 converges pointwise (almost surely) to a Bayes classifier.

Proof. Since we are interested in pointwise convergence, throughout this proof we fix x ∈ X . First assume that 0 , c , 1, so n0 , n1 → ` (almost surely). For fixed y ∈ f0, 1g and any sample of size ny , by definition Euy jS n ½ f uy ðxjyÞ equals the effective class-conditional density f n ðxjyÞ. Since the f uy ðxjyÞ are bounded across all uy for fixed x and y, by Eq. 2.153 f n ðxjyÞ converges pointwise to the true class-conditional density f uy ðxjyÞ (almost surely). Comparing the Bayes classifier

198

Chapter 4

 cBayes ðxÞ ¼

0 if cf u0 ðxj0Þ ≥ ð1  cÞ f u1 ðxj1Þ 1 otherwise

(4.75)

with the OBC given in Eq. 4.6 shows that the OBC converges pointwise (almost surely) to a Bayes classifier. If c ¼ 1, then f n ðxj0Þ → f u0 ðxj0Þ pointwise (almost surely), but f n ðxj1Þ may not converge to f u1 ðxj1Þ almost surely (under random sampling, the number of selected points from class 1 will almost surely be finite by the Borel–Cantelli lemma). For any sequence in which the number of points in class 1 is finite, let N be the sample size where the final point in class 1 is observed. Since f n ðxj1Þ is bounded after the last point, ð1  Epn ½cÞf n ðxj1Þ converges to ð1  cÞf N ðxj1Þ ¼ 0, and the OBC again converges pointwise ▪ (almost surely) to a Bayes classifier. The same argument holds if c ¼ 0. Theorem 4.3 does not apply to our Gaussian models with unknown covariances because, at any point x, there exists a class-conditional distribution that is arbitrarily large at x. Thus, even though there is weak consistency of the posteriors, we may not apply Eq. 2.153. The situation can be rectified by an additional constraint on the class-conditional distributions. A sequence of probability measures Pn on a measure space U with Borel R s-algebra B converges weak to a point mass du if and only if U f dPn → f ðuÞ for all bounded continuous functions f on U. A Borel measurable function f : U → R is almost uniformly (a.u.) integrable relative to fPn g`n¼0 if for all e . 0 there exists 0 , M e , ` and N e . 0 such that for all n . N e , Z jf ðuÞj≥M e

jf ðuÞjdPn ðuÞ , e:

(4.76)

To proceed, we use the main theorem provided in (Zapała, 2008), which R guarantees that if Pn converges weak to du , then U f dPn → f ðuÞ for any f that is a.u. integrable relative to fPn g`n¼0 . Theorem 4.4 (Dalton and Dougherty, 2013b). Suppose that the first four conditions in Theorem 4.3 hold and, in addition, 5. For fixed x and y, the class-conditional distributions considered in the Bayesian model are a.u. integrable over uy relative to the sequence of posteriors (almost surely). Then the OBC (over the space of all classifiers with measurable decision regions) constructed in Eq. 4.6 converges pointwise (almost surely) to a Bayes classifier.

Optimal Bayesian Classification

199

Proof. By the main theorem in (Zapała, 2008), Euy jS n ½ f uy ðxjyÞ → f uy ðxjyÞ (almost surely). Substituting this fact for Eq. 2.153, the proof is exactly the same as in Theorem 4.3. ▪ If the class-conditional distributions are bounded for a fixed x, then they are also a.u. integrable for any sequence of weak consistent measures. Thus, Theorem 4.4 is strictly stronger than Theorem 4.3. The next two lemmas show that the Gaussian class-conditional distributions in the scaled identity and general covariance models are a.u. integrable over uy relative to the sequence of posteriors (almost surely), and therefore the OBC converges to a Bayes classifier in both models. Lemma 4.3 (Dalton and Dougherty, 2013b). For fixed x and y, Gaussian classconditional densities in a scaled identity covariance model are a.u. integrable over uy relative to the sequence of posteriors (almost surely). Proof. We assume that the covariances are independent. The homoscedastic case is similar. Let 0 , K , ` and ny be a positive integer. We wish to bound Z j f uy ðxjyÞjpny ðuy Þduy : (4.77) LðKÞ ¼ D K 2 j f uy ðxjyÞj≥ð2p Þ

Since f uy ðxjyÞ ≤ ð2pÞD∕2 jSy j1∕2 ,   Z 1 1 ba 1 b K LðKÞ ≤ ⋅ exp  2 ds2y D 2 D2 GðaÞ ðs 2 Þaþ1 2 sy ð2pÞ ðs Þ 0 y y

  Z D 1 G a þ D2 baþ 2 1 b K ⋅

¼ exp  2 ds2y , D D D sy ð2pÞ 2 b 2 GðaÞ 0 G a þ D2 ðs2y Þaþ 2 þ1

(4.78)

where the hyperparameters a and b depend on ny . The integrand is the PDF for an inverse-gamma distribution. Thus,



G a þ D2 G a þ D2 , Kb

⋅ LðKÞ ≤ D D G a þ D2 ð2pÞ 2 b 2 GðaÞ (4.79)

G a þ D2 , Kb ¼ , D D ð2pÞ 2 b 2 GðaÞ Gða, bÞ being the upper incomplete gamma function. For a . 1, B . 1, and b . ða  1ÞB∕ðB  1Þ (Natalini and Palumbo, 2000), Gða, bÞ , Bba1 eb :

(4.80)

To apply this inequality here, note that a ¼ 0.5½ðny þ k þ D þ 1ÞD  2 is an increasing affine function of the sample size, so we can find a sample size N

200

Chapter 4

guaranteeing that a þ D∕2 . 1. Furthermore, by the strong law of large b → s 2 I (almost surely). Thus, there exists N . N such that for numbers, S y D y 1 all ny . N 1 , D b (4.81) Sy  s2y ID , s2y 1 4 almost surely. Likewise, there exists C ∈ R and N 2 . N 1 such that for all ny . N 2 , trðSÞ þ

ny ny ðb m  my ÞT ðb my  my Þ ≥ 2C ny þ ny y

(4.82)

almost surely. Thus, for ny large enough, 1 b ¼ trðS Þ 2 ny  1  b ≥Cþ tr Sy 2  i ny  1 h b  s2 I ¼Cþ trðs2y ID Þ þ tr S y D y 2 ny  1  2 b 2 Dsy  S ≥Cþ y  s y ID 1 2 ny  1 3D 2 ⋅ s : .Cþ 4 y 2

(4.83)

There exists N 0 . N 2 such that b . ðny  1ÞDs2y ∕4 for all ny . N 0 . To apply Eq. 4.80, we require some B . 1 such that b . ða þ D∕2  1ÞB∕½KðB  1Þ. Both bounds on b are equal when B equals Bðny , KÞ ¼

ðny  1ÞKDs2y : ðny  1ÞðKDs2y  2DÞ  2ðk þ D þ 3ÞD þ 8

(4.84)

Thus, this choice of B guarantees the desired bound on b for fixed ny > N 0 and K. Although this choice of B is always larger than 1 if ny > N3 ¼ (k þ D þ 4) þ 4/D, it only guarantees the desired bound for a specific pair of ny and K. Since B∕ðB  1Þ is a decreasing function for B . 1, any B larger than Eq. 4.84 for all ny and K in a given region also guarantees the desired bound for all ny and K in this region. It can be shown that there exist constants B > 1, N 00 > maxðN 0 , N 3 Þ, and K 00 > 0 such that for all ny . N 00 and K . K 00 , B . Bðny , KÞ, and this choice of B guarantees that Kb . ða þ D∕2  1ÞB∕ðB  1Þ for all ny > N 00 and K > K 00 . Hence, by Eqs. 4.79 and 4.80, BðKbÞaþ 2 1 eKb D

LðKÞ ≤

D

D

ð2pÞ 2 b 2 GðaÞ

(4.85)

Optimal Bayesian Classification

201

for ny . N 00 and K . K 00 . Rearranging terms a bit, for ny . N 00 and K . K 00 , ð2pÞ2 ðKbÞKb2 eKb GðKbÞ ⋅ LðKÞ ≤ ⋅ D 1 1 : þ GðKbÞ ðKbÞKba GðaÞb2 ð2pÞ 2 2 BK

D1 2

1

1

(4.86)

Since Kb . a  1, the second term converges uniformly to 1 as ny → ` for K . K 00 (Stirling’s formula). By a property of the G function, GðKbÞ ≤ ðKbÞKba GðaÞ; thus, the third term is bounded by b1∕2 , which is arbitrarily small for large enough ny by Eq. 4.83. It is straightforward to show that for any e . 0 there exists N e . N 00 and K 00 , K e , ` such that LðK e Þ , e for all n . N e. ▪ Lemma 4.4 (Dalton and Dougherty, 2013b). For fixed x and y, Gaussian classconditional densities in a general covariance model are a.u. integrable over uy relative to the sequence of posteriors (almost surely). Proof. We assume that the covariances are independent. The homoscedastic case is similar. Let 0 , K , ` and ny be a positive integer. Define Z LðKÞ ¼ jf uy ðxjyÞjpny ðuy Þduy : (4.87) D D jf uy ðxjyÞj≥K 2 ð2pÞ

2

Since f uy ðxjyÞ ≤ ð2pÞD∕2 jSy j1∕2 ,   Z k þDþ1 k jS j 2 jSy j 2 1  1 LðKÞ ≤

 etr  S Sy dSy D 1 k D 2 jSy j≤ 1D ð2pÞ 2 jSy j2 2 2 GD k 2 K

k þ1 Z GD ¼ D  1 2 k f IW ðSy ; S , k þ 1ÞdSy p 2 jS j2 GD 2 jSy j≤K1D

   G k 2þ1 1 ¼ D  1 k Dþ1 Pr jSy j ≤ D , K p 2 jS j2 G 2

(4.88)

where f IW is an inverse-Wishart distribution, and in the last line we view Sy as an inverse-Wishart distributed random matrix with parameters S and k þ 1 (we suppress the subscript y in the hyperparameters). b → S and m b y → my (almost By the strong law of large numbers, S y y  surely). Therefore, S ∕ðny  1Þ → Sy (almost surely). It can be shown that for any sequence of symmetric matrices An that converges to a symmetric positive definite matrix, there exists some N . 0 such that An is symmetric positive definite for all n . N. Applied here, there exists N . 0 such that S is symmetric positive definite for all ny . N (almost surely). Let S ¼ LLH be the Cholesky decomposition of S , where H is the conjugate transpose. Then

202

Chapter 4

for ny . N, X ¼ LH S1 y L is a Wishart random matrix with parameters LH ðS Þ1 L ¼ ID and k þ 1, and

 G k 2þ1 LðKÞ ≤ D  1 k Dþ1 PrðjXj ≥ K D jS jÞ: (4.89) p 2 jS j2 G 2 Further, jXj ≥ K D jS j implies that at least one of the diagonal elements in X must be greater than or equal to KjS j1∕D . Hence, 



LðKÞ ≤

Gðk 2þ1Þ 

Pr

 1 1 X 1 ≥ KjS jD ∪ · · · ∪ X D ≥ KjS jD

p 2 jS j2 Gðk Dþ1 Þ 2

k þ1 D  X G 1 ≤ D  1 2k Dþ1 Pr X i ≥ KjS jD , p 2 jS j2 G i¼1 2 D

1

(4.90) where X i is the ith diagonal element of X. The marginal distributions of the diagonal elements of Wishart random matrices are well known, and in this case X i  chi-squaredðk þ 1Þ. Thus,

  G k 2þ1 D 1 LðKÞ ≤ D  1 k Dþ1 Pr X i ≥ KjS jD p 2 jS j2 G 2 (4.91)

k þ1 K  1 DG 2 , 2 jS jD ≤ D  1 k Dþ1 : p 2 jS j2 G 2 We use the same bound in Eq. 4.80 to bound the equation above. Recall that k ¼ ny þ k in the independent covariance model, so it is a simple matter to find N 0 . N such that ðk þ 1Þ∕2 . 1 for all ny . N 0 . By Minkowski’s determinant inequality, for ny large enough,

1 1 b D1 ¼ ðn  1Þ S b D : jS jD ≥ ðny  1ÞS y y y

(4.92)

b j is a strongly consistent estimator of jS j, where S is the true value Since jS y y y 00 0 b of the covariance, there exists N . N such that jS j ≥ jS j∕2 (almost surely). y

y

Hence, for ny large enough, jS jD ≥ 1

ny  1 2

1 D

1

jSy jD :

(4.93)

The same procedure used in the previous lemma may be used to find constants B . 1, N 000 . N 00 , and K 000 . 0 such that for all ny . N 000 and K . K 000 we have 0.5KjS j1∕D . 0.5ðk  1ÞB∕ðB  1Þ, and thus

Optimal Bayesian Classification



LðKÞ ≤

203

k 1 1 1 K  jS jD 2 e 2 jS jD  D 1 p 2 jS j2 G k Dþ1 2

DB

K 2

K  D1 1 1  D1 2 jS j 2 K2 jS jD jS j e BDK  ¼ D Dþ1 ⋅ 1 22 p 2 G K2 jS jD  1 G K2 jS jD  : K  D1 k Dþ1   1 K k Dþ1  D1 2 jS j  2  2D G jS j 2 jS j 2 D1 2

1

ð2pÞ2



K 2

(4.94)

To simplify the third fraction, note that GðbÞ ≤ bba GðaÞ, and thus for ny . N 000 and K . K 000 ,

LðKÞ ≤

BDK D

22 p

D1 2

Dþ1 2

1



ð2pÞ2



K 2

K jS jD1 1 1 1 2 K jS jD jS jD 2 e 2 1  jS j2D : 1 G K2 jS jD

(4.95)

Since 0.5KjS j1∕D . 0.5ðk  1Þ, the second fraction converges uniformly to 1 as ny → ` over all K . K 000 by Stirling’s formula. The third term may be made ▪ arbitrarily small for large enough ny by Eq. 4.93.

4.7 Robustness in the Gaussian Model We next consider the important issue of robustness to false modeling assumptions, with emphasis on priors making false assumptions and possessing varying degrees of information. 4.7.1 Falsely assuming homoscedastic covariances To observe the effects of falsely assuming a homoscedastic model, assume that c ¼ 0.5 and that the underlying distributions are fixed Gaussian distributions with D ¼ 2 features, means m0 ¼ 02 and m1 ¼ 12 , and covariances  1 r , S0 ¼ 0.5 r 1 

 1 r S1 ¼ 0.5 : r 1

(4.96)



(4.97)

204

Chapter 4

With sample size n ¼ 18 we generate a stratified sample (9 points in each class) and design LDA, QDA, and several OBCs, each using different homoscedastic covariance priors of the form n0 ¼ n1 ¼ jD, m0 ¼ m0 ¼ 02 , m1 ¼ m1 ¼ 12 , k ¼ 2jD, and S ¼ 0.5ðk  3ÞI2 , where j ¼ 0, 1, : : : , 18. All priors with j . 0 are proper, and each contains a different amount of information, a larger j being more informative. These priors essentially assume correct means and variances of the features, but when r ≠ 0 they incorrectly assume homoscedastic covariances. For each r and each sample, the true errors from the known distributions are found exactly for linear classifiers and using 100,000 test points otherwise. The entire process is repeated for t ¼ 10,000 samples. Once this iteration is complete, we evaluate the average true error, which is illustrated in Fig. 4.8. Note that QDA cannot be trained when r ¼ 1, where the sample covariance in both classes is always singular. The OBCs with lowinformation priors ðj ¼ 0 and j ¼ 1Þ perform close to LDA. Even though they make the false assumption that the covariances are equal, their performance is still better than QDA, which does not assume that the covariances are equal, and which is a good approximation of the OBC with independent covariances and non-informative priors, up to the correlation r  0.5. This indicates that the homoscedastic OBC is somewhat robust to unequal covariances. Furthermore, as information is added to the prior ðj ¼ 2 and 18Þ the performance of the OBC actually improves over the whole range of r. The performance of the OBC with the prior indexed by j ¼ 18 is remarkable, nearly reaching that of the Bayes classifier when r ¼ 0, and beating QDA for correlations up to r  0.8, even though it incorrectly assumes homoscedastic covariances.

0.25

average true error

0.2 j=0

0.15 j = 18

j=1 j=2

0.1 Bayes LDA QDA OBC

0.05

0

0

0.2

0.4

0.6

0.8

1

Figure 4.8 Performance of classification on fixed heteroscedastic Gaussian distributions over a range of correlation r in class 0 and r in class 1. OBCs are shown with increasing information priors falsely assuming homoscedastic covariances with an expected correlation of zero (n ¼ 18, c ¼ 0.5, D ¼ 2). [Reprinted from (Dalton and Dougherty, 2013b).]

Optimal Bayesian Classification

205

4.7.2 Falsely assuming the variance of the features We now consider a Gaussian model parameterized by a scaling factor in the covariance (thereby controlling the Bayes error) and apply a prior tuned to the case where the Bayes error is 0.25 with varying degrees of confidence. We assume that c ¼ 0.5 is known and that the class-conditional distributions are fixed Gaussian distributions with D ¼ 2 features, means m0 ¼ 02 and m1 ¼ 12 , and covariances of the form S0 ¼ S1 ¼ s 2 I2 . The parameter s2 is chosen between about 0.18 and 26.0, corresponding to a Bayes error between 0.05 and 0.45. With sample size n ¼ 18, we generate a stratified sample and design LDA, QDA, and several OBCs, each using different priors with increasing information. All priors correctly assume homoscedastic covariances, with hyperparameters n0 ¼ n1 ¼ 2j, m0 ¼ m0 ¼ 02 , m1 ¼ m1 ¼ 12 , k ¼ 4j, and S ¼ 1.1231ðk  3ÞI2 , where j ¼ 0, 1, : : : , 18. Each prior contains a different amount of information, a larger j being more informative. All priors have correct information about the means but always assume that E½s2  ¼ 1.1231, corresponding to a Bayes error of 0.25, whereas in actuality the true value of s2 varies. An OBC might not be trainable when j ¼ 0 since with k ¼ 0 we have an improper prior and it is possible that the posterior is also improper. The true errors are found under the known distributions: exactly for linear classifiers and approximately with 100,000 test points for nonlinear classifiers. For each s2 , this entire process is repeated for t ¼ 10,000 samples. The average true error is shown in Fig. 4.9. We only report results for the OBC with j ¼ 0 if the probability that the classifier cannot be trained is less than 10%, and the average true error is evaluated only over samples for which

average true error − Bayes error

0.07 LDA QDA OBC

0.06 0.05 0.04

j=0 j=1 j=2 j=3

0.03 0.02

j=6

0.01 0

j = 12 j = 18

0

0.1

0.2 0.3 Bayes error

0.4

0.5

Figure 4.9 Performance of classification on fixed homoscedastic scaled identity covariance Gaussian distributions over a range of values for a common variance s 2 between all features and both classes (controlling Bayes error). OBCs are shown with increasing information priors correctly assuming homoscedastic covariances but with an expected variance corresponding to a Bayes error of 0.25 (n ¼ 18, c ¼ 0.5, D ¼ 2). [Reprinted from (Dalton and Dougherty, 2013b).]

206

Chapter 4

the classifier exists. OBCs are shown in black lines for j ¼ 0, 1, 2, 3, 6, 12, and 18. Even though information is increased in the prior using the wrong covariance, performance still improves, outperforms LDA and QDA, and even appears to converge to the Bayes classifier. We conjecture that this occurs because “classification is easier than density estimation.” To wit, in this case, changing the covariance of both classes at the same time tends not to change the optimal classifier very much. 4.7.3 Falsely assuming the mean of a class We next consider the effect of priors containing misinformation about the mean of a class, in the sense that the mean of one of the priors does not equal the mean of the corresponding class-conditional distribution. We assume known c ¼ 0.5 and that the class-conditional distributions are fixed Gaussian distributions with D ¼ 2 features, means m0 ¼ 02 and m1 ¼ m ⋅ 12 , and covariances S0 ¼ S1 ¼ 1.1231 ⋅ I2 . The parameter m takes on one of nine values, where mj corresponds to a Bayes error of 0.05j. For example, m1 ¼ 2.45, m5 ¼ 1.00, and m9 ¼ 0.19, giving a range of Bayes errors between 0.05 and 0.45. The scaling factor 1.1231 in the covariances has been calibrated to give a Bayes error of 0.25 when m ¼ 1. With sample size n ¼ 18 we generate a stratified sample, and from each sample design an LDA, a QDA, and several OBCs using different priors with increasing information. All priors correctly assume homoscedastic covariances and apply hyperparameters n0 ¼ n1 ¼ 2j, m0 ¼ 02 , m1 ¼ 12 , k ¼ 4j, and S ¼ 1.1231ðk  3ÞI2 , where j ¼ 0, 1, : : : , 18, a larger j being more informative. For each prior, the mean of m0 and the covariances are in fact the true values, but the mean of m1 is incorrectly assumed to be 12 , corresponding to a Bayes error of 0.25. The true errors are found under the known distributions: exactly for linear classifiers and approximately with 100,000 test points for nonlinear classifiers. For each mj , this entire process is repeated for t ¼ 10,000 samples. The average true error is shown in Fig. 4.10. OBCs are shown in black lines for j ¼ 0, 1, 2, 3, 6, 12, and 18. For a Bayes error of 0.25, the performance of the OBC improves as we increase the amount of correct information in the prior, with performance appearing to converge to that of the Bayes classifier. That being said, performance on the left and right sides of this graph exhibits quite different behavior. On the right edge, the actual Bayes error is higher than 0.25 for the means corresponding to each prior, a situation in which the means are closer together than suggested by the priors. In this range, adding information to the prior appears to improve performance, which is far better than LDA or QDA, even with low-information priors. On the left edge, where the actual Bayes error is lower than that corresponding to the means assumed in each prior, we see performance degrade as misinformation is added to the priors, although as we observe more data these kinds of bad priors will

Optimal Bayesian Classification

207

average true error − Bayes error

0.07 LDA QDA OBC

0.06 0.05 j=18 j=12

0.04

j=0

0.03

j=6

0.02

j=1

0.01

j=2 j=3

0

0

0.1

0.2 0.3 Bayes error

0.4

0.5

Figure 4.10 Performance of classification on fixed homoscedastic scaled identity covariance Gaussian distributions over a range of class-1 means (controlling Bayes error). OBCs are shown with increasing information priors correctly assuming homoscedastic covariances but with an expected class-1 mean corresponding to a Bayes error of 0.25 (n ¼ 18, c ¼ 0.5, D ¼ 2). [Reprinted from (Dalton and Dougherty, 2013b).]

eventually be overcome. At least in this example, the OBC is quite robust when the assumed Bayes error is lower than reality, that is, when it assumes that the means of the classes are farther apart than they should be, but sensitive to cases where the assumed Bayes error is higher than in reality. We next perform a similar experiment, but this time with a family of OBCs that assume priors with the same amount of information but targeting different values for the means. The setup is the same as in the previous experiment, except that we examine priors that correctly assume homoscedastic covariances, with n0 ¼ n1 ¼ 18, m0 ¼ m0 ¼ 02 , m1 ¼ mj ⋅ 12 , k ¼ 36, and S ¼ 1.1231ðk  3ÞI2 , where j ¼ 1, : : : , 9. Each prior now contains the same moderate amount of information and has correct information about m0 and the covariances, but assumes that m ¼ mj or, equivalently, that the Bayes error is 0.05j, whereas in actuality the true value of m can be mi for some i ≠ j (we use i to index the true mean and j to index the expected mean in the prior). The average true error for each classifier for different values of mi , corresponding to different Bayes errors, is provided in Fig. 4.11. OBCs are shown in black lines for j ¼ 1, 2, 3, 4, 5, 6, and 9. For each prior, indexed by j, as expected, the best performance is around 0.05j; however, at a fixed Bayes error of 0.05i, the best prior may not necessarily correspond to j ¼ i; that is, the best prior may not necessarily be the one that uses the exact mean for class 1 in place of the hyperparameter m1 . For instance, on the far right at Bayes error 0.45, corresponding to i ¼ 9, the OBC with prior j ¼ 9 decreases monotonically in the range shown until it reaches a point where it performs about 0.01 worse than the Bayes classifier. At the same time, the OBC corresponding to j ¼ 6 (for Bayes error 0.3) actually performs better than all other classifiers shown over the range of Bayes error between 0.30 and 0.45,

208

Chapter 4

average true error − Bayes error

0.1 LDA QDA OBC

j=9

0.08 0.06 j=6

0.04 j = 5 j=4

0.02 0

j=3 j=2 j=1

0

0.1

0.2 0.3 Bayes error

0.4

0.5

Figure 4.11 Performance of classification on fixed homoscedastic scaled identity covariance Gaussian distributions over a range of class-1 means (controlling Bayes error). OBCs are shown with priors correctly assuming homoscedastic covariances and a range of expected class-1 means (n ¼ 18, c ¼ 0.5, D ¼ 2). [Reprinted from (Dalton and Dougherty, 2013b).]

where it performs only about 0.005 above the Bayes classifier. This may, at first, appear to contradict the theory, but recall that the Bayesian framework does not guarantee optimal performance for any fixed distribution, even if it is the expected distribution in the prior. It is only optimal when averaged over all distributions. 4.7.4 Falsely assuming Gaussianity under Johnson distributions Utilizing Johnson distributions, we now address the robustness of OBCs using Gaussian models when applied to non-Gaussian distributions. The simulations assume that c ¼ 0.5 and that the class-conditional distributions contain D ¼ 2 features with means m0 ¼ 02 and m1 ¼ 12 , and covariances S0 ¼ 0.5 ⋅ I2 and S1 ¼ 1.5 ⋅ I2 . For class 0, the features are i.i.d. Johnson distributions with skewness between 3 and 3 in steps of 0.5, and kurtosis between 0 and 10 in steps of 0.5. For a fixed class-0 skewness and kurtosis pair, we first determine if the pair is impossible or the type of Johnson distribution from Fig. 2.13, and then find values of g 0 and d0 producing the desired skewness and kurtosis. With these known we may find values of h0 and l0 producing the desired mean and variance. Both features in class 0 are thus modeled by a Johnson distribution with parameters g 0 , d0 , h0 , and l0 . For class 1, essentially the parameters are set such that both classes are skewed either away from (less overlapping mass) or toward (more overlapping mass) each other. Features are again i.i.d., having the same kurtosis but opposite (negative) skewness relative to those in class 0, that is, g 1 ¼ g 0 and d1 ¼ d0 . The type of Johnson distribution used is always the same as that used in class 0, and the values of h1 and l1 producing the desired mean and variance for class 1 are then calculated using the same method as for class 0.

Optimal Bayesian Classification

209

The sample has n ¼ 18 points and is stratified with 9 points in each class. LDA and QDA classifiers are trained, along with two OBCs, each using independent general covariance priors. The first prior is non-informative, that is, n0 ¼ n1 ¼ 0, k0 ¼ k1 ¼ 0, and S is an all-zero matrix. The second is a proper prior with hyperparameters n0 ¼ n1 ¼ k0 ¼ k1 ¼ 6, m0 ¼ m0 ¼ 02 , m1 ¼ m1 ¼ 12 , and Sy ¼ ðky  3ÞSy . In a few cases with extreme skewness and kurtosis, LDA, QDA, and the OBC with non-informative priors may not be trainable. In such cases the current sample is ignored in the analysis for the untrainable classifiers (but not for all classifiers). OBCs with proper priors are always trainable. True errors are found under the appropriate given Johnson distributions using 100,000 test points. For each skewness and kurtosis pair, this entire process is repeated for t ¼ 10,000 samples to find the average true error for each classifier. As a baseline, for the Gaussian case (skewness 0 and kurtosis 3), the average true classifier errors are 0.249 for LDA, 0.243 for QDA, 0.240 for OBCs with non-informative priors, and 0.211 for OBCs with informative priors. Even with non-informative priors, the OBC beats LDA and QDA, although this is not guaranteed in all settings. As expected for the Gaussian case, the performance of OBCs improves significantly by adding a small amount of information to the priors. Simulation results over the skewness/kurtosis plane are shown in Fig. 4.12. Each subplot shows performance for LDA, QDA, and one of the OBCs over essentially the same skewness/kurtosis plane shown in Fig. 2.13, but now having two sides to distinguish between positive and negative skewness in class 0 (distributions have more or less overlapping mass, respectively). The log-normal line and the boundary for the impossible region are also shown. Each marker represents a specific Johnson distribution with the corresponding skewness and kurtosis. A white circle, white square, or black circle at a point means that LDA, QDA, or OBC has the lowest expected true error among the three classifiers, respectively. Similar graphs may be used to design a 2

OBC is best QDA is best LDA is best

2

impossible region

impossible region

4

kurtosis

kurtosis

4

OBC is best QDA is best LDA is best

6 8

6 8

10

SU

−5

0 2

10

SB 2

−skewness

skewness

(a)

SU

−5

5

0 2

SB

5

skewness2

−skewness

(b)

Figure 4.12 Performance under Johnson distributions (c ¼ 0.5, D ¼ 2, n ¼ 18): (a) noninformative prior; (b) informative prior. [Reprinted from (Dalton and Dougherty, 2013b).]

210

Chapter 4

hypothesis test to determine a region where it is “safe” to make Gaussian modeling assumptions. Performance in Fig. 4.12 depends on the simulation settings; for instance, in some cases the OBC does not outperform LDA or QDA, even on Gaussian distributions. The OBC with non-informative priors appears to be very robust to kurtosis given that most points near zero skewness are represented by black dots in Fig. 4.12(a). There is some robustness to skewness, although the left half of the plane (distributions having less overlap) is dominated by LDA and the right half (distributions having more overlap) is mostly covered by QDA. Although it is not immediately evident in the figure, the OBC with this noninformative prior actually has performance very close to that of QDA, which is consistent with our previous theoretical and empirical findings. So in this scenario, LDA tends to perform better when distributions have less overlap, QDA tends to perform better when they have more overlap, and the OBC with non-informative priors has best performance for very small skewness and tracks closely with QDA throughout, so it also has very good performance when the distributions have more overlap. Perhaps more interestingly, as a small amount of information is added to the priors, even though they assume a Gaussian model, the region where the OBC outperforms both LDA and QDA does not shrink but actually expands.

4.8 Intrinsically Bayesian Robust Classifiers An IBR classifier is defined similarly to an OBC, except that p is replaced by p in Eq. 4.4: cIBR ¼ arg min Ep ½εðu, cÞ: c∈C

(4.98)

A model-constrained Bayesian robust (MCBR) classifier cMCBR is defined in the same manner as an IBR classifier, except that C is replaced by CU , which is the family of classifiers in C that are optimal for some u ∈ U. In the absence of data the OBC reduces to the IBR classifier relative to the prior, and in the presence of data the OBC is the IBR classifier relative to the posterior distribution. Whether uncertainty is relative to the prior or posterior distribution, Eu ½εðu, cÞ is the Bayesian MMSE error estimator. The effective densities address the problem of finding an IBR classifier: simply apply the effective densities with p instead of p . In particular, when the prior and posterior possess the same form (as in the Gaussian model), we obtain an IBR classifier by simply replacing the posterior distribution by the prior distribution in our solutions. Given that an IBR classifier can be obtained from the effective densities relative to the prior, we are in a position to compare the MCBR and IBR classifiers. In particular, are there cases when the two are equivalent and,

Optimal Bayesian Classification

211

when they are not, what is the difference between them? First, let us highlight some cases in which they are the same. In a Gaussian model with independent scaled identity or general covariances, the set of model-specific optimal classifiers is the set of all quadratic classifiers. If k 0 ¼ k 1 , then the IBR classifier is quadratic as in Eq. 4.11 and determined by Eqs. 4.30 through 4.32, for instance, in the general covariance model with k0 ¼ k1 in the prior. If we assume known unequal covariances between the classes, then the set of state-specific optimal classifiers is the set of all quadratic classifiers with discriminant of the form 1 gðxÞ ¼ xT Ax þ aT x þ b, with A ∝ S1 1  S0 , and the IBR classifier for this model has the same form if n0 ¼ n1 ; if n0 ≠ n1 , then the IBR classifier is still quadratic, but it is generally not among the set of optimal classifiers for all states because the matrix AOBC in the discriminant will typically not have the same form as A. With homoscedastic scaled identity or general covariances, the set of optimal classifiers is the set of all linear classifiers, and the IBR classifier is also linear if n0 ¼ n1 and Ep ½c ¼ 0.5. Finally, if we assume known equal covariances between the classes, then the set of optimal classifiers is also the set of all linear classifiers, and the IBR classifier is linear if n0 ¼ n1 . While in the preceding cases the MCBR and IBR classifiers are equivalent, in most other cases this is not true. We compare performance in a case where the IBR classifier is not in the family of state-optimal classifiers and actually performs strictly better than the MCBR classifier. Consider a synthetic Gaussian model with D ¼ 2 features, independent general covariances, and a proper prior defined by known c ¼ 0.5 and hyperparameters n0 ¼ k0 ¼ 20D, m0 ¼ 0D , n1 ¼ k1 ¼ 2D, m1 ¼ 1D , and Sy ¼ ðky  D  1ÞID . Since k0 ≠ k1 , we have k 0 ≠ k 1 and the IBR classifier is polynomial but generally not quadratic. Based on our previous analysis of this model, it is given by cIBR ðxÞ ¼ 0 if gIBR ðxÞ ≤ 0 and cIBR ðxÞ ¼ 1 if gIBR ðxÞ . 0, where  k þD 0 1 T 1 gIBR ðxÞ ¼ K 1 þ ðx  m0 Þ C0 ðx  m0 Þ k0 (4.99)  k þD 1 1 T 1  1 þ ðx  m1 Þ C1 ðx  m1 Þ , k1 k y ¼ ky  D þ 1, Cy ¼ ½ðny þ 1Þ∕ðk y ny ÞSy , and 2   32 k0 k 1 þD  2  D G G 2 1c k0 jC0 j 6 2  7 K¼ 4  5: k1 c jC1 j G k0 þD G k1 2 2

(4.100)

On the other hand, the Bayes classifier for any particular state is quadratic. Using Monte Carlo methods we find the corresponding MCBR classifier

212

Chapter 4 3 2

x2

1 0 −1 −2 −2

plug-in MCBR IBR −1

0

1

2

3

x1

Figure 4.13 Classifiers for an independent general covariance Gaussian model with D ¼ 2 features and proper priors having k0 ≠ k1 . The intrinsically Bayesian robust classifier is polynomial with average true error 0.2007, whereas the state-constrained Bayesian robust classifier is quadratic with average true error 0.2061. [Reprinted from (Dalton and Dougherty, 2013b).]

cMCBR . We also consider a plug-in classifier cplug-in using the expected value of each parameter, which is the Bayes classifier assuming that c ¼ 0.5, m0 ¼ m0 , m1 ¼ m1 , and S0 ¼ S1 ¼ ID . This plug-in classifier is linear. The average true errors are Eu ½εðu, cplug-in Þ ¼ 0.2078, Eu ½εðu, cMCBR Þ ¼ 0.2061, and Eu ½εðu, cIBR Þ ¼ 0.2007. The IBR classifier is superior to the MCBR classifier. Figure 4.13 shows cIBR , cMCBR , and cplug-in . Level curves for the distributions corresponding to the expected parameters, which have been found by setting the Mahalanobis distance to 1, are shown in thin gray lines. cIBR is the only classifier that is not quadratic, and it is quite distinct from cMCBR .

4.9 Missing Values Missing values are commonplace in many applications, especially when feature values result from complex technology. Numerous methods have been developed to impute missing values in the framework of standard classification rules. OBCs are naturally suited for handling missing values. Once the mechanism of missing data is modeled using an uncertainty class of distributions, the effective class-conditional density can be adjusted to reflect this mechanism. Analytic solution may be possible, such as for a Gaussian model with independent features; however, typically, Markov chain Monte Carlo (MCMC) methods must be utilized. Throughout this section, following (Dadaneh et al., 2018a), we consider Gaussian feature-label distributions. Although the OBC framework possesses the capacity to incorporate the probabilistic modeling of arbitrary types of missing data, we assume that data are missing completely at random (MCAR) (Little and Rubin, 2014). In this

Optimal Bayesian Classification

213

scenario, the parameters of the missingness mechanism are independent of other model parameters. Thus, they vanish when taking the expectation in the calculation of the effective class-conditional density so we can ignore parameters related to the missing-data mechanism. For each class y ∈ f0, 1g, we assume a Gaussian distribution with parameters uy ¼ ðmy , ly Þ, where my is the mean of the class-conditional distribution, and ly is a collection of parameters that determines the covariance matrix Sy of the class y. Except when needed, we drop the class index y. We first consider a general covariance model. Let class y be fixed. Assuming n independent sample points x1 , x2 , : : : , xn under class y, each possessing D-dimensional Gaussian distribution N ðm, SÞ, partition the observations into G ≤ n groups, where all ng points in group g ∈ f1, : : : , Gg have the same set J g of observed features, with cardinality jJ g j ¼ Dg . Let I g denote the set of sample point indices in group g, and represent the pattern of missing data in group g by a Dg  D matrix Mg , where each row is a D-dimensional vector with a single nonzero element with value 1 corresponding to the index of an observed feature. The nonmissing portion of sample point xi in group g is Mg xi and has Gaussian distribution N ðMg m, Mg SMTg Þ. We define the group-g statistics by mg ¼

1X M x, ng i∈I g i

(4.101)

g

Sg ¼

X ðMg xi  mg ÞðMg xi  mg ÞT ,

(4.102)

i∈I g

where mg and Sg are the sample mean and scatter matrix, respectively, when employing only the observed data in group g. Let X be the portion of the full dataset ½x1 , x2 , : : : , xn  that is not missing. Given y and the corresponding matrices Mg , and assuming that the sample points are independent, we can write the data likelihood as   1 1 f ðXjm, SÞ ∝ jSg j etr  Sg Sg 2 g¼1   G 1X T 1 n ðm  Mg mÞ Sg ðmg  Mg mÞ ,  exp  2 g¼1 g g G Y

n

 2g

(4.103) where Sg ¼ Mg SMTg is the covariance matrix corresponding to group g. Suppose that we use Gaussian and inverse-Wishart priors for m and S:

214

Chapter 4

m  N ðm, S∕nÞ and S  inverse-WishartðS, kÞ, where n . 0, k . D  1, S is a symmetric positive definite matrix, and the PDF of the inverse-Wishart prior is proportional to Eq. 2.59. This is generally not a conjugate prior, and MCMC methods should be used to proceed with the analysis, as we will discuss shortly. Now consider independent features, where we can derive a closed-form representation for the OBC with missing values. Let class y be fixed. Let I k be the set of sample point indices under class y for which feature k is not missing and let jI k j ¼ l k . In a Gaussian model with a diagonal covariance matrix, feature k ∈ f1, : : : , Dg is distributed as N ðmk , s2k Þ. Hence, the likelihood can be expressed as f ðXjm1 , : : : , mD , s21 , : : : , s2D Þ    D Y X l 1 2  2k 2 2 , ∝ ðsk Þ exp  2 l k ðmk  xk Þ þ ðxki  xk Þ 2sk i∈I k k¼1

(4.104)

where xk is the sample mean of feature k including only nonmissing values, and xki is the value of the kth feature in sample point i in I k . We shall use a conjugate normal-inverse-gamma prior, where the parameters in class 0 are independent from those of class 1, the ðmk , s2k Þ pairs are independent, and mk js2k  N ðmk , s 2k ∕nk Þ, s2k  inverse-gammaðrk , ck Þ:

(4.105)

Theorem 4.5 (Dadaneh et al., 2018a). In the diagonal covariance model, assuming a normal-inverse-gamma prior for feature k ∈ f1, : : : , Dg with nk . 0, rk . 0, and ck . 0, the posterior distributions are given by   s 2k 2 , (4.106) mk jX, s k  N hk , l k þ nk   lk 2 sk jX  inverse-gamma rk þ , ck þ bk , (4.107) 2 where l k xk þ nk mk , l k þ nk   1 X 1 l k nk 2 ðxki  xk Þ þ ðx  mk Þ2 : bk ¼ 2 i∈I 2 l k þ nk k hk ¼

k

(4.108) (4.109)

Optimal Bayesian Classification

215

Proof. The PDF of an inverse-gammaðr, cÞ random variable is given by   c r1 f ðxÞ ∝ x (4.110) exp  : x Therefore, the normal-inverse-gamma prior in Eq. 4.105 can be expressed as   n ðm  mk Þ2 þ 2ck 3 pðmk , s2k Þ ∝ ðs 2k Þrk 2 exp  k k : 2s2k

(4.111)

Combining this prior with the likelihood in Eq. 4.104 yields the posterior density: l k 3

p ðmk , s2k Þ ∝ ðs2k Þrk  2 P   nk ðmk  mk Þ2 þ 2ck þ l k ðmk  xk Þ2 þ i∈I k ðxki  xk Þ2  exp  : 2s2k (4.112) To calculate mk jX, s 2k , complete the square on nk ðmk  mk Þ2 þ l k ðmk  xk Þ2 to obtain (omitting the constant term)   l k xk þ nk mk 2 ðnk þ l k Þ mk  l k þ nk

(4.113)

(4.114)

and hence the Gaussian distribution in Eq. 4.106. The posterior of s2k in Eq. 4.107 is obtained by marginalizing out mk in Eq. 4.112. ▪ Theorem 4.6 (Dadaneh et al., 2018a). Under the conditions of Theorem 4.5, the effective class-conditional density is given by   1 G r þ l k þ 1 D Y k 2 2 2 l k þ nk 1  ð2pÞ2 f U ðxjyÞ ¼ l l k þ nk þ 1 G rk þ 2k k¼1 (4.115) lk ðck þ bk Þrk þ 2 h i lk 1 , 2 rk þ 2 þ 2 k þnk ðx  h Þ ck þ bk þ 12 l klþn k k k þ1 where xk is the kth feature in x.

216

Chapter 4

Proof. Let y be fixed. Marginalizing out mk using the conditional posterior in Eq. 4.106 yields Z ` 2 f ðxk js k Þ ¼ f ðxk jmk , s 2k Þp ðmk js2k Þdmk `   Z ` 1 2 12 2 ¼ ð2psk Þ exp  2 ðxk  mk Þ 2sk `     1 2 s2k l k þ nk 2 exp  ðmk  hk Þ dmk  2p l k þ nk 2s2k   1  2 l k þ nk 1 ðl k þ nk Þðxk  hk Þ2 2 12 ¼ ð2ps k Þ exp  2 : l k þ nk þ 1 l k þ nk þ 1 2sk (4.116) Using the posterior of s2k in Eq. 4.107, the effective class-conditional density for xk is Z

`

f ðxk jyÞ ¼ 0

l

rk þ 2k 2 ðckþ bk Þ f ðxk jsk Þ G rk þ l2k

l

k ðs2k Þrk  2 1

  ck þ bk exp  ds2k s2k



l 1 r þk Z ` l 2 ðck þ bk Þ k 2 l k þ nk 2 rk  2k 32 ¼ ð2pÞ ðs Þ k l k þ nk þ 1 Gðrk þ l2k Þ 0    1 1 ðl k þ nk Þðxk  hk Þ2 ds2k  exp  2 ck þ bk þ l k þ nk þ 1 2 sk   1 G r þ l k þ 1 k 2 2 2 l k þ nk 1  ¼ ð2pÞ2 l k þ nk þ 1 G rk þ l2k

12

lk

h

ðck þ bk Þrk þ 2 ck þ bk þ

1 l k þnk 2 l k þnk þ1 ðxk

 hk Þ

i

l

k 1 2 rk þ 2 þ 2

, (4.117)

which completes the proof. ▪ The OBC is obtained from the preceding theorem and Eq. 4.6. If the test point is missing data, the OBC is obtained by omitting the features missing in the test point completely from the analysis. The hyperparameters nk , mk , rk , and ck should be specified for each feature k and each class, and hk and bk are found from Eqs. 4.108 and 4.109, respectively. Lemma 2.2 and Theorem 2.8 are special cases of Theorems 4.5 and 4.6, respectively, where there is no missing data, and nk and rk do not depend on k.

Optimal Bayesian Classification

217

For general covariance Gaussian models and general missing-data mechanisms, we do not have a closed-form posterior. Thus, MCMC inference of model parameters is employed, and then an OBC decision rule is obtained via Monte Carlo approximation. For inference of covariance matrices in highdimensional settings, traditional random walk MCMC methods such as Metropolis–Hastings suffer major limitations, for instance, high rejection rate. Thus, a Hamiltonian Monte Carlo OBC (OBC-HMC) is obtained for the corresponding OBC with missing values by employing the Hamiltonian Monte Carlo (HMC) method (Neal, 2011). HMC has the ability to make large changes to the system state while keeping the rejection rate small. We defer to (Dadaneh et al., 2018a) for the details. Example 4.4. First we compare OBC performance with missing-value imputation in the Gaussian model when features are independent. Feature dimensionality is D ¼ 10. Sample size is n0 ¼ n1 ¼ 12, where feature k ∈ f1, : : : , 10g values for class y ∈ f0, 1g are drawn independently from N ðmk,y , s2k,y Þ. The mean and variance parameters are generated according to a normal-inverse-gamma model: mk,y  N ð0, s 2k,y ∕ny Þ and s2k,y  inverse-gammaðry , cy Þ, where r0 ¼ r1 ¼ 101, c0 ¼ 60, and c1 ¼ 45. This choice of hyperparameters avoids severe violation of the LDA-classifier assumption that the two classes possess equal variances. The parameter ny is used to adjust the difficulty of classification. We consider Bayes errors 0.05 and 0.15. To obtain an average Bayes error of 0.05, n0 ¼ 3 and n1 ¼ 1. To obtain 0.15, n0 ¼ 12 and n1 ¼ 2.6. To approximate Bayes errors, in each run of the simulation, 1000 test points are generated with the classes being equiprobable. Missing-value percentages vary between 0% and 50%. The simulation setup is repeated 10,000 times, and in each iteration classifier accuracy is the proportion of correctly classified test points. Figure 4.14 shows the average accuracy of different classifiers applied to the test data for two levels of Bayes error. In addition to the OBC with missing data, standard SVM, LDA, and QDA classifiers are applied subsequent to a missing-data imputation scheme proposed in (Dadaneh et al., 2018a) based on Gibbs sampling. The OBC with missing data is theoretically optimal in this figure and indeed outperforms the imputationbased methods. To simulate data from general covariance Gaussian models, the training data are generated with n0 ¼ n1 ¼ 12 sample points in each class, with dimension D ¼ 10, drawn from N ðmy , Sy Þ, where a normal-inverse-Wishart prior is placed on the parameters of this Gaussian distribution as my  N ðmy , Sy ∕ny Þ and Sy  inverse-Wishartðk, SÞ. The hyperparameters are m0 ¼ 0D , m1 ¼ m1 ⋅ 1D , n0 ¼ 6D, n1 ¼ D, and S ¼ 0.3ðk  D  1ÞID , where m1 is a scalar. To assess the performance of the OBC-HMC, the

218

Chapter 4

OBC

SVM-Gibbs

LDA-Gibbs

QDA-Gibbs

Bayes

0.9

average accuracy

average accuracy

1 0.95 0.9 0.85 0.8 0.75 0

5

10

15

20

25

35

0.85 0.8 0.75 0.7 0.65

50

0

5

10

15

20

25

35

50

missing-value percentage

missing-value percentage

(a)

(b)

Figure 4.14 Accuracy of various classifiers versus missing-data percentage in a setting with independent features and two Bayes error rates: (a) Bayes error 0.05; (b) Bayes error 0.15. The dotted line indicates the Bayes classifier accuracy as an upper bound for classification accuracy, which is calculated using the true values of model parameters. [Reprinted from (Dadaneh et al., 2018a).]

OBC

OBC-HMC

SVM-Gibbs

LDA-Gibbs

0.9

0.8

0.7

0.6

Bayes

0.9

average accuracy

average accuracy

1

QDA-Gibbs

5

10

15

20

25

35

missing-value percentage

(a)

50

0.8

0.7

0.6

0.5

5

10

15

20

25

35

50

missing-value percentage

(b)

Figure 4.15 Accuracy of various classifiers versus missing-value percentage in a setting with general covariance structures and under two Bayes error rates: (a) Bayes error 0.05; (b) Bayes error 0.15. OBC is trained by the complete data without missing values. OBCHMC integrates missing-value imputation into OBC derivation by marginalization. [Reprinted from (Dadaneh et al., 2018a).]

hyperparameters are adjusted to yield different Bayes errors, the first setting m1 ¼ 0.38 and k ¼ 3D for Bayes error 0.05, and the second setting m1 ¼ 0.28 and k ¼ 15D for Bayes error 0.15. 10,000 simulations are performed, and in each simulation experiment, classifier accuracies are measured on 1000 test points. For OBC-HMC and Gibbs-sampling-imputation implementation, see (Dadaneh et al., 2018a). Figures 4.15(a) and (b) show the performance of OBC-HMC compared to SVM, LDA, and QDA following imputation, as well as the ordinary OBC trained on the complete data without missing values. Missing-value rates are between 5% and 50%, and are simulated similarly to the independent-feature case.

Optimal Bayesian Classification

219

As reported in (Dadaneh et al., 2018a), several other imputation methods were tried, but the small sample size and high missing-value rates in this simulation setup proved too challenging to achieve reasonable prediction accuracy. 4.9.1 Computation for application Missing values lead to the need for MCMC computation. In fact the need for such computational methods is common in applications because, except in special circumstances, one does not have a conjugate prior leading to closedform posteriors. Here, we briefly mention some examples in the literature. In (Knight et al., 2014), a hierarchical multivariate Poisson model is used to represent RNA-Seq measurements, and model uncertainty leads to an MCMC-based OBC for classification. Using an RNA-Seq dataset from The Cancer Genome Atlas (TCGA), the paper develops a classifier for tumor type that discriminates between lung adenocarcinoma and lung squamous cell carcinoma. The same basic setup is used in (Knight et al., 2018), where the OBC and Bayesian MMSE error estimator are used to examine gene sets that can discriminate between experimentally generated phenotypes, the ultimate purpose being to detect multivariate gene interactions in RNA-Seq data. In (Banerjee and Braga-Neto, 2017), the OBC is employed for classification of proteomic profiles generated by liquid chromatography-mass spectrometry (LC-MS). The method is likelihood-free, utilizing approximate Bayesian computation (ABC) implemented via MCMC. A similar computational approach for the OBC is used in (Nagaraja and Braga-Neto, 2018) for a model-based OBC for classification of data obtained via selected reaction monitoring-mass spectrometry (SRM-MS). Also, recall that for the OBC with missing values we employed the Hamiltonian Monte Carlo method. On account of the technology, standard gene-expression-based phenotype classification using RNA-Seq measurements utilizes no timing information, and measurements are averages across collections of cells. In (Karbalayghareh et al., 2018a), IBR classification is considered in the context of gene regulatory models under the assumption that single-cell measurements are sampled at a sufficient rate to detect regulatory timing. Thus, observations are expression trajectories. In effect, classification is performed on data generated by an underlying gene regulatory network. This approach is extended to optimal Bayesian classification in (Hajiramezanali et al., 2019), in which recursive estimation, in the form of the Boolean Kalman filter, is employed, for which the paper uses particle filtering (Imani and Braga-Neto, 2018).

4.10 Optimal Sampling In the context of optimal Bayesian classification, sample data are collected to update the prior distribution to a posterior via the likelihood function

220

Chapter 4

(Eq. 2.17). Thus far, we have focused on random sampling, where a fixed number of points are drawn from the feature-label distribution, and separate sampling, where a fixed number of points are drawn from each classconditional distribution. These methods can be inefficient if, for instance, uncertainty in one of the class prior distributions is contributing more to performance loss due to uncertainty than the other. In the extreme case, consider the situation in which the class-0 feature-label distribution is known and there is uncertainty only with regard to the class-1 feature-label distribution. If the application allows for samples to be drawn sequentially, and in particular if the final sample size can be variable, this can be alleviated to some extent using censored sampling, as we have discussed previously. As an alternative approach, in this section we consider optimal sampling in the framework of optimal experimental design grounded on objective-based uncertainty. This framework was first formulated in the context of intrinsically Bayesian robust operators (Dehghannasiri et al., 2015) and was later generalized into a framework not restricted to operator design (Boluki et al., 2019). Optimal sampling fits directly into the general approach, which we describe next. 4.10.1 MOCU-based optimal experimental design We assume a probability space U with probability measure p, a set C, and a function C : U  C → ½0, `Þ, where U, p, C, and C are called the uncertainty class, prior distribution, action space, and cost function, respectively. Elements of U and C are called uncertainty parameters and actions, respectively. For any u ∈ U, an optimal action is an element cu ∈ C such that Cðu, cu Þ ≤ Cðu, cÞ for any c ∈ C. An intrinsically Bayesian robust (IBR) U action is an element cU IBR ∈ C such that Eu ½Cðu, cIBR Þ ≤ Eu ½Cðu, cÞ for any c ∈ C. Whereas cU IBR is optimal over U, for u ∈ U, cu is optimal relative to u. The objective cost of uncertainty is the performance loss owing to the U application of cU IBR instead of cu on u: Cðu, cIBR Þ  Cðu, c u Þ. Averaging this cost over U gives the mean objective cost of uncertainty (MOCU): i h  M C ðUÞ ¼ Eu C u, cU Þ  Cðu, c u : IBR

(4.118)

The action space is arbitrary so long as the cost function is defined on U  C. Suppose that there is a set Ξ, called the experiment space, whose elements j, called experiments, are random variables. One j is chosen from Ξ based on some procedure; in a slight abuse of notation we also use j to denote the outcome of an experiment. In some applications, the experiment selection procedure itself may influence the distribution of u. Here we assume that, while u is jointly distributed with the experiment, the marginal distribution of

Optimal Bayesian Classification

221

u does not depend on which experiment j is selected; thus, Ej ½pðu, jÞ ¼ pðuÞ for all j ∈ Ξ. Given j ∈ Ξ, the conditional distribution pðujjÞ is the posterior distribution relative to j, and Ujj denotes the corresponding probability space, called the conditional uncertainty class. Relative to Ujj, we define IBR actions Ujj cIBR and the remaining MOCU, h  i Ujj M C ðUjjÞ ¼ Eujj C u, cIBR  Cðu, cu Þ , (4.119) where the expectation is with respect to pðujjÞ. Since the experiment j is a random variable, M C ðUjjÞ is a random variable. Taking the expectation over j gives the expected remaining MOCU, ii h h  Ujj DC ðU, jÞ ¼ Ej Eujj C u, cIBR  Cðu, cu Þ , (4.120) which is called the experimental design value. DC ðU, jÞ is a constant indexed by the experiment j in the experiment space. An optimal experiment j  ∈ Ξ minimizes DC ðU, jÞ: j  ¼ arg min DC ðU, jÞ: j∈Ξ

Evaluating the preceding equation gives h h  ii Ujj j  ¼ arg min Ej Eujj C u, cIBR , j∈Ξ

(4.121)

(4.122)

the latter double expectation being the residual IBR cost, which is denoted by RC ðU, jÞ. Hence, j  ¼ arg min RC ðU, jÞ: j∈Ξ

(4.123)

Proceeding iteratively results in sequential experimental design. This can be done in a greedy manner (greedy-MOCU), meaning that at each step we select the best experiment without looking ahead to further experiments. We stop when we reach a fixed budget of experiments, or, if the final number of experiments can be variable, when some other stopping condition (such as the MOCU, or the difference in MOCU between experiments) reaches some specified threshold. An optimal policy, i.e., an optimal strategy for selecting experiments, can also be found using dynamic programming over a finite horizon of experiments (DP-MOCU) (Imani et al., 2018). Because DP-MOCU reduces MOCU at the end of the horizon, whereas greedy-MOCU takes only the next step into account, DP-MOCU achieves the lowest cost at the end of the horizon; however, it typically requires much greater computation time. In its original formulation (Yoon et al., 2013), MOCU depends on a class of operators applied to a parameterized model in which u is a random vector

222

Chapter 4

whose distribution depends on a characterization of the uncertainty, as in Section 4.1. In that setting, U is an uncertainty class of system models parameterized by a vector u governed by a probability distribution pðuÞ, and C is a class of operators on the models whose performances are measured by the cost Cðu, cÞ of applying c on model u ∈ U. Lindley has proposed a general framework for Bayesian experimental design (Lindley, 1972). The general MOCU-based design fits within this framework. Moreover, knowledge gradient (Frazier et al., 2008) and efficient global optimization (Jones et al., 1998) are specific implementations of MOCU-based experimental design under their modeling assumptions; see (Boluki et al., 2019). MOCU-based optimal experimental design has been used in several settings, including structural intervention in gene regulatory networks (Dehghannasiri et al., 2015; Mohsenizadeh et al., 2018; Imani and Braga-Neto, 2018), canonical expansions (Dehghannasiri et al., 2017b), and materials discovery (Dehghannasiri et al., 2017c). 4.10.2 MOCU-based optimal sampling The cost function for MOCU-based optimal sampling for classification is classification error. Beginning with the prior and an IBR classifier, we determine the optimal experiment to obtain a sample point, derive the OBC from the posterior, treat this OBC as a new IBR classifier and the posterior as a new prior, and then repeat the procedure iteratively to generate a sample and final OBC. The following scenario is discussed in (Dougherty, 2018). It yields a sequential sampling procedure when applied iteratively. Here we restrict our attention to greedy-MOCU. Suppose that we can specify which class to sample, where, given the class, sampling is random. We assume known class-0 prior probability c and consider the multinomial model with Dirichlet priors possessing hyperparameter vector ½ay1 , : : : , ayb  for class y. The priors for classes 0 and 1 are given in Eqs. 2.43 and 2.44, respectively. The experiment space is fh0 , h1 g, where hy selects the feature value of a sample from class y, which for the multinomial Ujh distribution is the bin number. For h ∈ fh0 , h1 g, cIBR ¼ cOBCjh is the IBR classifier for the posterior distribution pðujhÞ, or, equivalently, it is the OBC given h, Cðu, · Þ is the classification error for state u, and the residual IBR cost is   RC ðU, hÞ ¼ Eh EpðujhÞ ½εðu, cOBCjh Þ ¼ Eh ½bεðh, cOBCjh Þ "  # b X U 0j þ a0j U 1j þ a1j P P , ¼ Eh min c , ð1  cÞ n0 þ bi¼1 a0i n1 þ bi¼1 a1i j¼1 (4.124)

Optimal Bayesian Classification

223

where the second equality follows because the inner expectation is the Bayesian MMSE error estimate of the OBC error, and the third equality applies Eq. 4.8. U yj and ny inside the last expectation are implicitly functions of h. For selecting a single data point from class 0, Eq. 4.124 reduces to  # Ih0 ¼j þ a0j a1j min c , ð1  cÞ 1 RC ðUjh0 Þ ¼ Eh0 a0 1 þ a00 j¼1   b b X X Ii¼j þ a0j a1j ¼ Prðh0 ¼ iÞ min c , ð1  cÞ 1 a0 1 þ a00 i¼1 j¼1   b X a0i 1 þ a0i a1i min c , ð1  cÞ ¼ a0 1 þ a00 a10 i¼1 0   a00  a0i a0i a1i þ min c , ð1  cÞ 1 , a00 1 þ a00 a0 "

b X

(4.125)

P where ay0 ¼ bi¼1 ayi , and Prðh0 ¼ iÞ ¼ a0i ∕a00 is the effective density of class 0 under the prior. An analogous expression gives RC ðUjh1 Þ. An optimal experiment is determined by minðRC ðUjh0 Þ, RC ðUjh1 ÞÞ. We have that h ¼ hz , where z ¼ arg min

y∈f0,1g

b X

b pyi d yþ pyi Þd y i þ ð1  b i :

(4.126)

i¼1

Given class y, b pyi ¼ ayi ∕ay0 is the probability that the new point is in bin i, is the error contributed by bin i if the new point is in bin i; and d yþ i in particular, d 0þ p0þ p1i Þ, d 1þ p0i , ð1  cÞb p1þ i ¼ minðcb i , ð1  cÞb i ¼ minðcb i Þ, and yþ y y y b pi ¼ ð1 þ ai Þ∕ð1 þ a0 Þ. Likewise, d i is the error contributed by bin i if the new point is not in bin i; in particular, d 0 p0 p1i Þ, i ¼ minðcb i , ð1  cÞb y y d 1 p0i , ð1  cÞb p1 py i ¼ minðcb i Þ, and b i ¼ ai ∕ð1 þ a0 Þ. When this procedure is y iterated, in each epoch the hyperparameters ai should be updated with the data from all past experiments. If it is impossible to change the OBC classifier by observing a single point [cb p0þ p1i and cb p0i , ð1  cÞb p1 i for all bin i i , ð1  cÞb 1þ 0 1 pi and ð1  cÞb pi , cb p0 where OBC outputs class 0, and ð1  cÞb pi , cb i for all bin i where OBC outputs class 1], it can be shown that the argument of the minimum in Eq. 4.126 is equal for both classes; ties may be broken with random sampling to ensure convergence to the Bayes error. A similar sampling procedure is applied in (Broumand et al., 2015) that is not MOCU-based.

224

Chapter 4

Example 4.5. Consider a discrete classification problem with b ¼ 8 bins in which c ¼ 0.5 is known and fixed, ½ p1 , : : : , pb  is drawn from a prior with low P uncertainty given by a0j ∝ 1∕j with bj¼1 a0j ¼ 100, and ½q1 , : : : , qb  is drawn from a prior with high uncertainty given by a1j ∝ 1∕ðb  j þ 1Þ with Pb 1 j¼1 aj ¼ 1. A budget of 20 samples is available, where the label of each sample can be requested sequentially, and given the labels each sample is drawn independently. Figure 4.16(a) presents the average of several classifier errors with respect to the number of samples drawn over 100,000 randomly drawn bin probabilities. For all classifiers, ties are broken by randomly selecting from the labels with equal probability. The Bayes error is constant at 0.104 with respect to sample size. The average error of a discrete histogram rule with random sampling decreases, as expected, but at 20 points it only reaches 0.159. The average error of OBC with “correct” priors (the same priors used to generate the true bin probabilities) and random sampling has less than half the average error of the discrete histogram rule after observing just one point (0.206 versus 0.418), then it continues to decrease until it reaches 0.117 after observing 20 points. Finally, OBC with correct priors and optimal (one-steplook-ahead) sampling performs even better, with an average error of 0.180 after one point and 0.112 after 20 points. The optimal one-step-look-ahead tends to draw more points from class 1, since class 1 starts with more uncertainty than class 0. This experiment is repeated in Fig. 4.16(b) with c ¼ 0.1, b ¼ 64 bins, both classes being drawn from priors having the same amount of uncertainty, given Pb y by a0j ∝ 1∕j and a1j ∝ 1∕ðb  j þ 1Þ with j¼1 aj ¼ 1 for both y ¼ 0, 1. Random sampling draws points from class 0 with probability c and ties in all classifiers that are still broken by selecting labels with equal probability. The difference between optimal and random is greater in part (a). 0.3

0.5 Bayes error DHR, random sample OBC, random sample OBC, optimal one-step-look-ahead

0.3 0.2 0.1 0

Bayes error DHR, random sample OBC, random sample OBC, optimal one-step-look-ahead

0.25

average error

average error

0.4

0.2 0.15 0.1 0.05

5

10

15

20

0 10

20

30

samples drawn

samples drawn

(a)

(b)

40

50

Figure 4.16 Performance of optimal sampling: (a) unequal initial uncertainty between classes; (b) equal initial uncertainty between classes.

Optimal Bayesian Classification

225

We close this section by noting that other online learning approaches have been proposed, for example, methods based on sequential measurements aimed at model improvement, such as knowledge gradient and active sampling (Cohn et al., 1994). Active sampling aims to control the selection of potential unlabeled training points in the sample space to be labeled and used for training. A generic active sampling algorithm is described in (Saar-Tsechansky and Provost, 2004), and a Bayesian approach is considered in (Golovin et al., 2010).

4.11 OBC for Autoregressive Dependent Sampling In the preceding section, sample points are dependent because the sampling procedure is guided by evidence from previous samples. One may consider other forms of dependency in the sampling mechanism. For instance, sample points can be modeled as the output of a stochastic process (Tubbs, 1980; Lawoko and McLachlan, 1983, 1986; McLachlan, 2004; Zollanvari et al., 2013). In this section, we derive the OBC for data generated by an autoregressive process. Historically, the performance of LDA with dependent observations was studied in (Lawoko and McLachlan, 1986). A univariate stationary autoregressive (AR) model of order p was used to model the intraclass correlation among observations. In (Lawoko and McLachlan, 1985), the same authors examined the performance of the classifier obtained by replacing the optimal parameters in the Bayes classifier by their maximum-likelihood estimators obtained under the same AR model. The performance of LDA under a general stochastic setting was considered in (Zollanvari et al., 2013), special cases being when observations are generated from AR or movingaverage (MA) signal models of order 1. Here we construct the OBC under two conditions: (1) the training observations for class y are generated from VARy ðpÞ, which denotes a multidimensional vector autoregressive process of order p; and (2) there exists uncertainty about parameters governing the VARy ðpÞ model (Zollanvari and Dougherty, 2019). We assume that model parameters (coefficient matrices) are random with a prior distribution. 4.11.1 Prior and posterior distributions for VAR processes Let xy1 , : : : , xyny be length q column vectors for ny training observations under class y ¼ 0, 1, which are drawn from a q-dimensional (column) VARy ðpÞ vector autoregressive process of order p (Lutkepohl, 2005). In particular, xyt ¼ cy þ

p X i¼1

Ayi xyti þ uyt ,

(4.127)

226

Chapter 4

where uyt is q-dimensional Gaussian white noise with E½uyt  ¼ 0q , E½uyt ðuyt ÞT  ¼ Sy invertible, E½uyt ðuys ÞT  ¼ 0qq for s ≠ t, Ayi is a q  q matrix of model parameters, cy is the vector of q intercept terms, and, to ease notation, the order p of the VAR processes is assumed to be the same for both classes. Equivalently, Xy ¼ Ay Zy þ Uy ,

(4.128)

where Xy ¼ ½xy1 , : : : , xyny  and Uy ¼ ½uy1 , : : : , uyny  are q  ny -dimensional matrices, Ay ¼ ½cy , Ay1 , : : : , Ayp  is a q  ð1 þ pqÞ matrix of model parameters, and Zy ¼ ½zy1 , : : : , zyny  is a ð1 þ pqÞ  ny matrix; note that zyt ¼ ½1, ðxyt1 ÞT , : : : , ðxytp ÞT T and assume that xy0 ¼ xy1 ¼ · · · ¼ xy1p ¼ 0q . From the chain rule of probability, the likelihood function f ðXy jAy Þ can be expressed as (Lütkepohl, 2005, p. 223; Kilian and Lütkepohl, 1977, p. 152) ny  Y f ðXy jAy Þ ¼ f xyt jxyt1 , : : : , xytp , Ay t¼1

  ny 1X y y T 1 y y ðx  Ay zt Þ Sy ðxt  Ay zt Þ ∝ jSy j exp  2 t¼1 t   1  T 1 ∝ exp  tr ðXy  Ay Zy Þ Sy ðXy  Ay Zy ÞIny 2   1 T T 1 T ¼ exp  ½xy  ðZy ⊗ Iq Þay  ðIny ⊗ Sy Þ½xy  ðZy ⊗ Iq Þay  2   1 ¼ exp  ðwy  Wy ay ÞT ðwy  Wy ay Þ : 2 (4.129) ny 2

The second to last equality of Eq. 4.129 follows from the definitions xy ¼ vecðXy Þ and ay ¼ vecðAy Þ, and the identities trðAT BCDÞ ¼ vecðAÞT ðDT ⊗ BÞ vecðCÞ,

(4.130)

vecðABÞ ¼ ðBT ⊗ Ir Þ vecðAÞ,

(4.131)

where r is the number of rows in A, vecð⋅Þ denotes the vectorization operator (stacking matrix columns), and ⊗ denotes the Kronecker product. The last equality follows from the definitions  1 w y ¼ I n y ⊗ S y 2 xy , (4.132)

Optimal Bayesian Classification

227

1

Wy ¼ ZTy ⊗ Sy 2 ,

(4.133)

the identity ðIr ⊗ AÞ1∕2 ¼ Ir ⊗ A1∕2 , and the mixed-product property of the Kronecker product. Let 1 Fy ¼ ðWTy Wy Þ1 ¼ ðZy ZTy ⊗ S1 y Þ ,

(4.134)

my ¼ Fy ðZy ⊗ S1 y Þxy :

(4.135)

Writing ay ¼ ay þ my  my and noting that wTy Wy ¼ mTy WTy Wy yields f ðXy jAy Þ   1 T 1 T ∝ exp  ½ðay  my Þ Fy ðay  my Þ þ ðwy  Wy my Þ ðwy  Wy my Þ : 2 (4.136) The prior distribution of ay is assumed to be a multivariate Gaussian with known ð1 þ pqÞq-dimensional mean vector my and ð1 þ pqÞq  ð1 þ pqÞqdimensional invertible covariance matrix Cy : 12

pðay Þ ∝ jCy j

  1 T 1 exp  ðay  my Þ Cy ðay  my Þ : 2

(4.137)

The next theorem provides the posterior distribution. It appears as in (Zollanvari and Dougherty, 2019), but is essentially the same as in (Lütkepohl, 2005). Theorem 4.7 (Lütkepohl, 2005). For the prior distribution of Eq. 4.137 and the VARy ðpÞ model of Eq. 4.128, the posterior distribution is Gaussian and given by   1 1 T 1 ða exp   m Þ C ða  m Þ , (4.138) p ðay Þ ¼ y y y y ð1þpqÞq 1 2 y ð2pÞ 2 jCy j2 where 1 my ¼ Cy ðC1 y my þ Fy my Þ,

(4.139)

1 1 Cy ¼ ðC1 y þ Fy Þ :

(4.140)

228

Chapter 4

Proof. The posterior is defined by p ðay Þ ¼ R

pðay Þf ðXy jAy Þ : Rð1þpqÞq pðay Þf ðXy jAy Þday

(4.141)

Using Eqs. 4.136 and 4.137, and after some algebraic manipulations [e.g., see (Lütkepohl, 2005), p. 223)], we obtain   1 T 1 pðay Þf ðXy jAy Þ ∝ exp  ðay  my Þ Cy ðay  my Þ 2 (4.142)   1 T  exp  ðwy  Wy my Þ ðwy  Wy my Þ , 2 where " wy ¼

# 1 Cy 2 m y , wy "

Wy ¼

# 1 Cy 2 : Wy

(4.143)

(4.144)

Replacing Eq. 4.142 in Eq. 4.141, and noting that the second exponential term ▪ in Eq. 4.142 does not depend on ay , we obtain Eq. 4.138. 4.11.2 OBC for VAR processes In constructing the OBC we assume that a future q-dimensional sample point x is independent from all training observations, as in (Lawoko and McLachlan, 1985). We also assume that f uy ðxjyÞ  N ðcy , Sy Þ, which is the marginal distribution of the training point xy1 . Theorem 4.8 (Zollanvari and Dougherty, 2019). For the prior distribution of Eq. 4.137 and the VARy ð pÞ model of Eq. 4.128, the effective density is given by  1 ˜ T 1 ˜ exp  ð x  m Þ C ð x  m Þ y y y jCy j 2  , f U ðxjyÞ ¼ ̮ q 1 1 ⋅ ̮ ̮ ð2pÞ2 jSy j2 jCy j2 exp  1 ðx˜  my ðxÞÞT C1 ˜ ð x  m ðxÞÞ y y 2 ̮

1 2

(4.145)

Optimal Bayesian Classification

229

˜ X ˜ ¼ ½x, 0qpq , my and Cy are defined in Eqs. 4.139 and where x˜ ¼ vecðXÞ, 4.140, respectively, i ̮ h ̮ 1 ˜ m þ ðD ⊗ S Þ x , my ðxÞ ¼ Cy C1 y y y

(4.146)

 ̮ 1 1 Cy ¼ C1 , y þ D ⊗ Sy

(4.147)

D is the ð1 þ pqÞ  ð1 þ pqÞ matrix D ¼ e1 eT1 , and e1 ¼ ½1, 0, : : : , 0T .

Proof. Using f uy ðxjyÞ  N ðcy , Sy Þ together with Eq. 4.138 yields the effective density Z f U ðxjyÞ ¼

1 q 2

ð1þpqÞq

1 2

1

ð2pÞ jSy j ð2pÞ 2 jCy j2   1  exp  ðx  cy ÞT S1 ðx  c Þ y y 2   1 T 1  exp  ðay  my Þ Cy ðay  my Þ day : 2 Rð1þpqÞq

(4.148)

The exponent of the first exponential inside the integral can be expressed as ˜  ay ÞT ðD ⊗ S1 ˜  ay Þ: ðx  cy ÞT S1 y ðx  cy Þ ¼ ðx y Þðx

(4.149)

Then using Eqs. 4.146, 4.147, and 4.149, we can combine and rewrite the exponents of both exponentials in Eq. 4.148 as T 1 ðx  cy ÞT S1 y ðx  cy Þ þ ðay  my Þ Cy ðay  my Þ

˜ þ mTy C1 ¼ x˜ T ðD ⊗ S1 y Þx y my ̮

1

̮

1 ̮

þ aTy Cy ay  2aTy Cy my

(4.150)

˜ þ mTy C1 ¼ x˜ T ðD ⊗ S1 y Þx y my ̮

̮

1 ̮

̮

̮

1

̮

 mTy Cy my þ ðay  my ÞT Cy ðay  my Þ, ̮

where we have suppressed the dependency on x in the notation for my . Substituting this into Eq. 4.148 yields

230

Chapter 4

f U ðxjyÞ ¼

 h i ̮ ̮ ̮ T 1 T 1 ˜ exp  12 x˜ T ðD ⊗ S1 Þ x þ m C m  m C m y y y y y y y q

Z 

¼

̮

1

ð2pÞ2 jSy j2 jCy j2 jCy j2

Rð1þpqÞq

1

1

 ̮ ̮ ̮ 1 exp  12 ðay  my ÞT Cy ðay  my Þ ð2pÞ

ð1þpqÞq 2

q

1

̮

1

jCy j2

(4.151) day

i  h ̮ ̮ ̮ T 1 ˜ þ mTy C1 exp  12 x˜ T ðD ⊗ S1 y Þx y my  my Cy my 1

̮

ð2pÞ2 jSy j2 jCy j2 jCy j2 1

,

which reduces to Eq. 4.145. ▪ Under the assumption that the prior probability cy of class y is known for y ¼ 0, 1 (previously we used c for class 0 and 1  c for class 1), applying Theorem 4.1 and taking the logarithm in Eq. 4.145 gives the OBC:  0 if g0 ðxÞ ≥ g1 ðxÞ, VAR (4.152) cOBC ðxÞ ¼ 1 otherwise, where  ̮  ̮ jCy C1 ̮ y j ˜ my , Cy Þ  D2 ðx, ˜ my , Cy Þ, gy ðxÞ ¼ 2 lnðcy Þ þ ln þ D2 ðx, jSy j (4.153) and Dðx, m, CÞ ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx  mÞT C1 ðx  mÞ

(4.154)

is the Mahalanobis distance of vector x from a distribution with the mean m and covariance C. For independent sampling, the observations are generated from VARy ð0Þ processes for y ¼ 0, 1 ð p ¼ 0Þ. Note that p ¼ 0 implies that Zy ¼ 1Tny and x˜ ¼ x. If we assume that Cy ¼ Sy ∕ny for ny . 0, then my becomes the sample mean of class y, my becomes the posterior mean of the mean (previously called my ), the effective density in Eq. 4.145 reduces to Eq. 2.96, and the preceding OBC reduces to that of Eqs. 4.11 through 4.14. For a second special case, consider the improper non-informative prior pðay Þ ¼ b . 0. We require a matrix lemma given in (Zollanvari and Dougherty, 2019).

Optimal Bayesian Classification

231

Lemma 4.5 (Zollanvari and Dougherty, 2019). Let A be an invertible r  r matrix and D ¼ e1 eT1 . Then D½Ir  ðA þ DÞ1 D ¼ kD,

(4.155)

ðA þ DÞ1 D ¼ kA1 D,

(4.156)

A1 D½Ir  ðA þ DÞ1 e1 ¼ ðA þ DÞ1 e1 ,

(4.157)

jðA þ DÞ1 j ¼ k, jA1 j

(4.158)

where k ¼ ð1 þ eT1 A1 e1 Þ1 .

Proof. Right and left multiplication of a matrix by D nullifies all elements except for those in the first column and row, respectively. Hence, D½Ir  ðA þ DÞ1 D ¼ D½Ir  ðA þ DÞ1 ðA þ D  AÞD ¼ DðA þ DÞ1 AD ¼ DðIr þ A1 DÞ1 D

(4.159)

¼ kD: Therefore, A1 D½Ir  ðA þ DÞ1 D ¼ kA1 D:

(4.160)

From the Sherman–Morrison inversion formula, ðA þ DÞ1 D ¼ ðA1  A1 DðIr þ A1 DÞ1 A1 ÞD ¼ A1 D½Ir  ðA þ DÞ1 D

(4.161)

¼ kA1 D, which completes the proof of Eqs. 4.155 through 4.157. Furthermore, by the matrix determinant lemma, we have jðA þ DÞ1 j jA1 jk ¼ jA1 j jA1 j

(4.162)

¼ k, which completes the proof of Eq. 4.158. ▪ Using the prior pðay Þ ¼ b in Eq. 4.141, the posterior has a Gaussian form similar to Eq. 4.138, where

232

Chapter 4

my ¼ my ¼ ½ðZy ZTy Þ1 Zy ⊗ Iq xy  ¼ vec Xy ZTy ðZy ZTy Þ1 ,

(4.163)

Cy ¼ Fy ¼ ðZy ZTy Þ1 ⊗ Sy ,

(4.164)

my and Fy are defined in Eq. 4.135 and Eq. 4.134, respectively, and we have used the fact that for any s  r matrix A and q  r matrix B, ðA ⊗ Iq Þ vecðBÞ ¼ vecðBAT Þ. We also assume that ny . 1 þ pq to have the inverse ðZy ZTy Þ1 . Then h h i i ̮ my ¼ ðZy ZTy þ DÞ1 Zy ⊗ Iq xy þ ðZy ZTy þ DÞ1 D ⊗ Iq x˜ (4.165)  T 1 ˜ , ¼ vec ðXy ZTy þ XÞðZ Z þ DÞ y y ̮

Cy ¼ ðZy ZTy þ DÞ1 ⊗ Sy :

(4.166)

Substituting Eqs. 4.163 through 4.166 in Eq. 4.153 and applying the fact that vecðXÞT ðA ⊗ BÞ vecðXÞ ¼ trðAT XT BXÞ

(4.167)

for any q  r matrix X, r  r matrix A, and q  q matrix B, yields ̮

̮

˜ my , Cy Þ  D2 ðx, ˜ my , Cy Þ D2 ðx,  ̮ ̮ ˜  My ÞT S1 ˜ ¼ tr ðZy ZTy þ DÞðX y ðX  My Þ ˜  My ÞT S1 ˜  Zy ZTy ðX ð X  M Þ y y ,

(4.168)

where My ¼ Xy ZTy ðZy ZTy Þ1 ,

(4.169)

T 1 ˜ My ¼ ðXy ZTy þ XÞðZ y Zy þ DÞ :

(4.170)

̮

Since ̮

˜  My ÞðZy ZTy þ DÞ ¼ ðX ˜  My ÞZy ZTy , ðX

(4.171)

we have that  ̮ ̮ ̮ T ˜ ˜ my , Cy Þ  D2 ðx, ˜ my , Cy Þ ¼ tr Zy ZTy ðMy  My Þ S1 D2 ðx, ð X  M Þ y y : (4.172)

Optimal Bayesian Classification

233

By Eq. 4.171, we also have that ̮

̮

˜  My ÞD: ðMy  My ÞZy ZTy ¼ ðX

(4.173)

Thus,  ̮ ̮ ̮ ˜  My ÞT S1 ˜ ˜ my , Cy Þ  D2 ðx, ˜ my , Cy Þ ¼  tr DðX D2 ðx, y ð X  My Þ : (4.174) ˜ ¼ XD ˜ Using the fact that X and Lemma 4.5, and the fact that T trðDAÞ ¼ e1 Ae1 for any matrix A of the appropriate size, we have that  ̮ ̮ ˜  My ÞT S1 ˜ ˜ my , Cy Þ  D2 ðx, ˜ my , Cy Þ ¼  tr k y DðX D2 ðx, y ð X  My Þ ˜  My ÞT S1 ˜ ¼ k y eT1 ðX y ðX  My Þe1 ¼ k y ðx  gy ÞT S1 y ðx  gy Þ, (4.175) where k y ¼ ½1 þ eT1 ðZy ZTy Þ1 e1 1 ,

(4.176)

gy ¼ My e1 ¼ Xy ZTy ðZy ZTy Þ1 e1 :

(4.177)

Thus,  cVAR OBC ðxÞ

¼

0 if xT Hx þ hT x þ b ≥ 0, 1 otherwise,

(4.178)

where 1 1 H ¼ ðk 1 S1 1  k 0 S0 Þ, 2 1 h ¼ k 0 S1 0 g0  k 1 S1 g1 ,    q  1   1 c0 k 0 2 jS1 j 2 1 1 b¼ : k g S g  k 0 g0 S0 g0 þ ln c1 k 1 2 1 1 1 1 jS0 j

(4.179) (4.180) (4.181)

The classifier in Eq. 4.178 has a quadratic form similar to Eq. 4.11 except that vy ∕ðvy þ 1Þ and my are replaced by k y and gy , respectively. Once again, we can assume that observations are independent, having been generated from VARy ð0Þ processes for y ¼ 0, 1. Then, Zy ¼ 1Tny and, therefore, k y ¼ ny ∕ðny þ 1Þ and gy is the sample mean of class y. Using these in Eqs. 4.179 through 4.181 yields the OBC given in Eqs. 4.11 through 4.14,

234

Chapter 4

which is similar to QDA. Assuming that k y ¼ ny ∕ðny þ 1Þ  1 leads to QDA with known covariance matrix. Further assuming that S ¼ S0 ¼ S1 leads to LDA with known covariance. In sum, the LDA and QDA classifiers are essentially special cases of the OBC for autoregressive processes under the assumptions of sample independence and a non-informative prior [except for a negligible Bayesian calibration factor ny ∕ðny þ 1Þ]. Example 4.6. Assume that c0 ¼ c1 , n0 ¼ n1 ∈ ½20, 100, q ¼ 3, p ¼ 2, c1 ¼ 1q , S0 ¼ S1 ¼ Iq , 3 2 0.8 0 0 A01 ¼ 4 0.2 0.4 0.2 5, 0 0 0.5 2 3 0.5 0.1 0.1 A11 ¼ 4 0 0.6 0 5, 0 0 0.4 2 3 0.9 0 0 A02 ¼ 4 0 0.1 0 5, 0.2 0.2 0 3 2 0.4 0.1 0 0 5: A12 ¼ 4 0.2 0.1 0 0 0.5

c0 ¼ 0q ,

(4.182)

(4.183)

(4.184)

(4.185)

The parameters Ayi have been chosen to make the processes stable. The process VARy ðpÞ is stable when all roots of the reverse characteristic polynomial lie outside the unit circle. To estimate the error of each classifier, we generate 1000 independent observations. The procedure of generating training observations and independent test observations is repeated 500 times to estimate the expected error rate of each classifier for each ny . We consider three classifiers: (1) cNI OBC ðxÞ, which is the OBC with a non-informative prior; SP (2) cOBC ðxÞ, which is the OBC constructed from a “strong prior” using the general results in Eqs. 4.152 through 4.154, in which the the prior means my are set using parameters of Ayi , namely, my ¼ ay ¼ vecðAy Þ, where Ay ¼ ½cy , Ay1 , : : : , Ayp , and the covariance matrix of the prior is an identity matrix; and (3) cLDA ðxÞ, which is the LDA classifier, and which approximates the OBC from a non-informative prior and independent observations. Figure 4.17 shows the performances of the three classifiers as functions of the sample size n0 ¼ n1 .

Optimal Bayesian Classification

235

0.27 0.26

average error

0.25 0.24 0.23 0.22 0.21 0.2 20

30

40

50

60

70

80

90

100

sample size

Figure 4.17 Performances of three classifiers with respect to sample size.

Chapter 5

Optimal Bayesian Risk-based Multi-class Classification In this chapter we consider classification under multiple classes and allow for different types of error to be associated with different levels of risk or loss. A few classical classification algorithms naturally permit multiple classes and arbitrary loss functions; for example, a plug-in rule takes the functional form for an optimal Bayes decision rule under a given modeling assumption and substitutes sample estimates of model parameters in place of the true parameters. This can be done with LDA and QDA for multiple classes with arbitrary loss functions, which essentially assume that the underlying classconditional densities are Gaussian with equal or unequal covariances, respectively. Most training-data error estimation methods, for instance, crossvalidation, can also be generalized to handle multiple classes and arbitrary loss functions. However, it is expected that the same difficulties encountered under binary classes with simple zero-one loss functions (where the expected risk reduces to the probability of misclassification) will carry over to the more general setting, as they have in ROC curve estimation (Hanczar et al., 2010). Support vector machines are inherently binary but can be adapted to incorporate penalties that influence risk by implementing slack terms or applying a shrinkage or robustifying objective function (Xu et al., 2009a,b). It is also common to construct multi-class classifiers from binary classifiers using the popular one-versus-all or all-versus-all strategies (Bishop, 2006). The former method builds several binary classifiers by discriminating one class, in turn, against all others, and at a given test point reports the class corresponding to the highest classification score. The latter discriminates between each combination of pairs of classes and reports a majority vote. However, it is unclear how one may assess the precise effect of these adaptations on the expected risk. Here we generalize the Bayesian MMSE error estimator, sample-conditioned MSE, and OBC to treat multiple classes with arbitrary loss functions. We will present the analogous concepts of the Bayesian risk estimator, sample-conditioned MSE for risk estimators, and optimal Bayesian risk classifier. We will show that

237

238

Chapter 5

the Bayesian risk estimator and optimal Bayesian risk classifier can be represented in the same form as the expected risk and Bayes decision rule with unknown true densities replaced by effective densities. This approach is distinct from the simple plug-in rule since the form of the effective densities may not be the same as the individual densities represented in the uncertainty class.

5.1 Bayes Decision Theory Consider a classification problem in which the aim is to assign one of M classes y ∈ f0, : : : , M  1g to samples drawn from feature space X . Let f ðyjcÞ be the probability mass function of Y , parameterized by a vector c, and for each y let f ðxjy, uy Þ be the class-y-conditional density of X, parameterized by a vector uy . The full feature-label distribution is parameterized by c and u ¼ ðu0 , : : : , uM1 Þ. Let Lði, yÞ be a loss function quantifying a penalty in predicting label i when the true label is y. The conditional risk in predicting label i for a given point x is defined as Rði, x, c, uÞ ¼ E½Lði, Y Þjx, c, u ¼

M1 X

Lði, yÞf ðyjx, c, uÞ

y¼0

PM1 ¼

y¼0

PM1 ¼

Lði, yÞf ðxjy, c, uÞf ðyjc, uÞ

(5.1)

f ðxjc, uÞ

Lði, yÞf ðyjcÞf ðxjy, uy Þ PM1 : y¼0 f ðyjcÞf ðxjy, uy Þ

y¼0

In expectations we denote conditioning on the event X ¼ x by simply conditioning on x. Although c and u are fixed, here we denote them as conditioned variables too, since they will be assigned priors later in the Bayesian analysis. The expected risk of a given classification rule c : X → f0, : : : , M  1g is given by Rðc, c, uÞ ¼ E½RðcðXÞ, X, c, uÞjc, u Z ¼ RðcðxÞ, x, c, uÞf ðxjc, uÞdx X

¼

M1 X M1 X y¼0 i¼0

¼

M1 X M1 X y¼0 i¼0

Z Lði, yÞf ðyjcÞ

f ðxjy, uy Þdx Ri

Lði, yÞf ðyjcÞεi,y n ðc, uy Þ,

(5.2)

Optimal Bayesian Risk-based Multi-class Classification

where the classification probability εni,y ðc,

239

Z

uy Þ ¼

f ðxjy, uy Þdx

(5.3)

Ri

¼ PrðX ∈ Ri jy, uy Þ is the probability that a class-y point will be assigned to class i by the classifier c, the Ri ¼ fx ∈ X : cðxÞ ¼ ig partition the feature space into decision regions, and the third equality follows from the third equality in Eq. 5.1. A Bayes decision rule (BDR) minimizes expected risk, or equivalently, the conditional risk at each fixed point x: cBDR ðxÞ ¼ arg ¼ arg

min

i∈f0, : : : , M1g

min

i∈f0, : : : , M1g

Rði, x, c, uÞ M1 X

Lði, yÞf ðyjcÞf ðxjy, uy Þ,

(5.4)

y¼0

where the second equality follows from Eq. 5.1, whose denominator is not a function of i. By convention, we break ties with the lowest index i ∈ f0, : : : , M  1g minimizing Rði, x, c, uÞ. The zero-one loss function is given by Lði, yÞ ¼ 0 if i ¼ y and Lði, yÞ ¼ 1 if i ≠ y. In the binary case with the zero-one loss, the expected risk reduces to the classification error and the BDR is a Bayes classifier.

5.2 Bayesian Risk Estimation In the multi-class framework, we assume that c is the probability mass function of Y , that is, c ¼ ½c0 , : : : , cM1  ∈ DM1 , where f ðyjcÞ ¼ cy and DM1 is the standard ðM  1Þ-simplex defined by cy ∈ ½0, 1 PM1 for y ∈ f0, : : : , M  1g and y¼0 cy ¼ 1. Also assume that uy ∈ Uy for some parameter space Uy , and u ∈ U ¼ U0  · · ·  UM1 . Let C ¼ ½C 0 , : : : , C M1  and T ¼ ðT0 , : : : , TM1 Þ denote random vectors for parameters c and u. We assume that C and T are independent prior to observing data and assign prior probabilities pðcÞ and pðuÞ. Note the change of notation: up until now, we had let c and u denote both the random variables and the parameters. The change is being made to avoid confusion regarding the expectations in this chapter. Let S n be a random sample, xyi the ith sample point in class y, and ny the number of sample points observed from class y. Given a sample, the priors are updated to posteriors: p ðc,

uÞ ¼ f ðc, ujS n Þ ∝ pðcÞpðuÞ

ny M1 YY y¼0 i¼1

f ðxyi , yjc, uy Þ,

(5.5)

240

Chapter 5

where we have assumed independent sample points, and the product on the right is the likelihood function. Since f ðxyi , yjc, uy Þ ¼ f ðxyi jy, c, uy Þf ðyjc, uy Þ ¼ cy f ðxyi jy, uy Þ,

(5.6)

where the last equality applies some independence assumptions, we may write p ðc, uÞ ¼ p ðcÞp ðuÞ,

(5.7)

where p ðcÞ ¼ f ðcjS n Þ ∝ pðcÞ

M1 Y

ðcy Þny

(5.8)

f ðxyi jy, uy Þ

(5.9)

y¼0

and 

p ðuÞ ¼ f ðujS n Þ ∝ pðuÞ

ny M1 YY y¼0 i¼1

are marginal posteriors of C and T. Independence between C and T is preserved in the posterior. When the prior density is proper, this all follows from Bayes’ theorem; otherwise, Eqs. 5.8 and 5.9 are taken as definitions, where we require posteriors to be proper. Given a Dirichlet prior on C with hyperparameters a ¼ ½a0 , : : : , aM1 , under random sampling, the posterior on C is still Dirichlet with hyperparameters ay ¼ ay þ ny . Defining aþ ¼

M1 X

ai ,

(5.10)

ay , aþ

(5.11)

ay ð1 þ ay Þ , aþ ð1 þ aþ Þ

(5.12)

i¼0

for y ≠ z, E½C y jS n  ¼ E½C 2y jS n  ¼

E½C y C z jS n  ¼

ay az : aþ ð1 þ aþ Þ

(5.13)

5.3 Optimal Bayesian Risk Classification A Bayesian risk estimator (BRE) is an MMSE estimate of the expected risk, or equivalently, the conditional expectation of the expected risk given the

Optimal Bayesian Risk-based Multi-class Classification

241

observations. Given a sample S n and a classifier c that is not informed by u, owing to posterior independence between C and T, the BRE is given by b Rðc, S n Þ ¼ E½Rðc, C, TÞjS n  ¼

M 1 M 1 X X

Lði, yÞE½ f ðyjCÞjS n E½εni,y ðc, Ty ÞjS n :

(5.14)

y¼0 i¼0

Keeping in mind that f uy ðxjyÞ ¼ f ðxjy, uy Þ, the effective density f U ðxjyÞ for class y is defined in Eq. 2.26, where we assume that ðX, Y Þ and S n are independent given C and T (once the parameters are known, the sample does not add information about the test point). We can also consider the effective density: Z f D ðyÞ ¼ f ðyjcÞp ðcÞdc: (5.15) DM1

The effective densities are expressed via expectation by f D ðyÞ ¼ EC ½ f ðyjCÞjS n  ¼ E½C y jS n  ¼ Ep ½C y ,

(5.16)

f U ðxjyÞ ¼ ETy ½ f ðxjy, Ty ÞjS n :

(5.17)

We may thus write the BRE in Eq. 5.14 as b Rðc, SnÞ ¼

M1 X X M1

Lði, yÞf D ðyÞˆεni,y ðc, S n Þ,

(5.18)

y¼0 i¼0

where εˆ ni,y ðc, S n Þ ¼ E½εi,y n ðc, Ty ÞjS n : The latter can be expressed via the effective density f U ðxjyÞ as   Z  i,y εˆ n ðc, S n Þ ¼ E f ðxjy, Ty ÞdxS n Ri Z E½ f ðxjy, Ty ÞjS n dx ¼ Ri Z ¼ f U ðxjyÞdx,

(5.19)

(5.20)

Ri

where the second equality follows from Fubini’s theorem. Hence, εˆ ni,y ðc, S n Þ ¼ PrðX ∈ Ri jy, S n Þ

(5.21)

242

Chapter 5

is the posterior probability of assigning a class-y point to class i. Comparing Eqs. 5.2 and 5.18, observe that f D ðyÞ and f U ðxjyÞ play roles analogous to f ðyjcÞ and f ðxjy, uy Þ, respectively, in Bayes decision theory. Before proceeding, we wish to remark on the various notations being employed. We will be manipulating expressions involving various densities and conditional densities, and, as is common, we will generally denote them by f . For instance, we may write the prior and posterior as pðuÞ ¼ f ðuÞ and p ðuÞ ¼ f ðujS n Þ, respectively. We will also consider f ðyjS n Þ and f ðxjy, S n Þ. Regarding the former, Z f ðyjS n Þ ¼ ¼

ZD

M1

DM1

f ðy, cjS n Þdc f ðyjc, S n Þf ðcjS n Þdc

Z ¼

DM1

Z ¼

DM1

f ðyjcÞf ðcjS n Þdc

(5.22)

f ðyjcÞp ðcÞdc

¼ f D ðyÞ: Similarly, f ðxjy, S n Þ ¼ f U ðxjyÞ:

(5.23)

Whereas the BRE addresses overall classifier performance across the entire feature space X , we may also consider classification at a fixed point x ∈ X . This is analogous to the probability of misclassification PrðY ≠ cðxÞjX ¼ x, S n Þ derived in Section 2.3. We define the Bayesian conditional risk estimator (BCRE) for class i ∈ f0, : : : , M  1g at point x ∈ X to be the MMSE estimate of the conditional risk for label i given the sample S n and the current test point X ¼ x: b x, S n Þ ¼ E½Rði, x, C, TÞjS n , x Rði, ¼

M1 X

Lði, yÞE½ f ðyjx, C, TÞjS n , x

y¼0

¼

M1 X

(5.24)

Lði, yÞE½PrðY ¼ yjx, C, TÞjS n , x:

y¼0

The expectations are over a posterior on C and T updated with both S n and the unlabeled test point in question x. Now,

Optimal Bayesian Risk-based Multi-class Classification

E½PrðY ¼ yjx, C, TÞjS n , x Z Z PrðY ¼ yjx, c, uÞ f ðc, ujS n , xÞdcdu ¼ U DM1 Z Z f ðxjc, u, S n Þf ðc, ujS n Þ dcdu ¼ PrðY ¼ yjx, c, uÞ M1 f ðxjS n Þ U D Z Z ∝ PrðY ¼ yjx, c, uÞf ðxjc, u, S n Þf ðc, ujS n Þdcdu U DM1 Z Z PrðY ¼ yjx, c, uÞ f ðxjc, uÞp ðcÞp ðuÞdcdu, ¼ U

243

(5.25)

DM1

where the second equality follows from Bayes’ theorem, the proportionality in the third line follows because the denominator does not depend on C or T, and the fourth line follows (1) because X is independent of S n given C and T, and (2) from the definition of the posterior before we update with the test point x. Applying Bayes’ theorem again, E½PrðY ¼ yjx, C, TÞjS n , x Z Z cy f ðxjy, uy Þ f ðxjc, uÞp ðcÞp ðuÞdcdu ∝ M1 f ðxjc, uÞ U D Z Z  ¼ cy p ðcÞdc f ðxjy, uy Þp ðuy Þduy DM1

(5.26)

Uy

¼ Ep ½C y  f U ðxjyÞ ¼ f D ðyÞf U ðxjyÞ: The normalization constant is obtained by forcing the sum over y to be 1. Thus, f D ðyÞf U ðxjyÞ E½PrðY ¼ yjx, C, TÞjS n , x ¼ PM1 : i¼0 f D ðiÞf U ðxjiÞ Applying this to Eq. 5.24 yields PM1 Lði, yÞf D ðyÞf U ðxjyÞ b x, S n Þ ¼ y¼0 PM1 Rði, : k¼0 f D ðkÞf U ðxjkÞ

(5.27)

(5.28)

This is analogous to Eq. 5.1 in Bayes decision theory. Furthermore, given a classifier c with decision regions R0 , : : : , RM1 , 1 Z h i M X b b x, S n Þf ðxjS n Þdx, E RðcðXÞ, X, S n ÞjS n ¼ Rði, i¼0

Ri

(5.29)

244

Chapter 5

where the expectation is over X ðnot C or TÞ given S n . Note that f ðxjS n Þ ¼

M 1 X

f ðyjS n Þf ðxjy, S n Þ

y¼0

¼

M1 X

(5.30) f D ðyÞ f U ðxjyÞ

y¼0

is the marginal distribution of x given S n . Proceeding, PM1 i M1 h XZ y¼0 Lði, yÞf D ðyÞf U ðxjyÞ b PM1 E RðcðXÞ, X, S n ÞjS n ¼ f ðxjS n Þdx k¼0 f D ðkÞf U ðxjkÞ i¼0 Ri M1 X X Z M1 ¼ Lði, yÞf D ðyÞf U ðxjyÞdx i¼0

¼

Ri y¼0

M 1 M 1 X X

Z Lði, yÞf D ðyÞ

y¼0 i¼0

¼

M1 X X M1

Ri

f U ðxjyÞdx

Lði, yÞf D ðyÞˆεni,y ðc, S n Þ

y¼0 i¼0

b ¼ Rðc, S n Þ: (5.31) Hence, the BRE of c is the mean of the BCRE across the feature space. For binary classification, εˆ ni,y ðc, S n Þ has been solved in closed form as components of the Bayesian MMSE error estimator for both discrete models under arbitrary classifiers and Gaussian models under linear classifiers, so the BRE with an arbitrary loss function is available in closed form for these models. When closed-form solutions for εˆ ni,y ðc, S n Þ are not available, from Eq. 5.20, εˆ ni,y ðc, S n Þ may be approximated for all i and a given fixed y by drawing a large synthetic sample from f U ðxjyÞ and evaluating the proportion of points assigned to class i. The final approximate BRE can be found by plugging the approximate εˆ ni,y ðc, S n Þ for each y and i into Eq. 5.18. A number of practical considerations for Bayesian MMSE error estimators addressed under binary classification naturally carry over to multiple classes, including robustness to false modeling assumptions. Furthermore, classical frequentist consistency holds for BREs on fixed distributions in the parameterized family owing to the convergence of posteriors in both the discrete and Gaussian models.

Optimal Bayesian Risk-based Multi-class Classification

245

An optimal Bayesian risk classifier (OBRC) minimizes the BRE: b S n Þ, cOBRC ¼ arg min Rðc, c∈C

(5.32)

where C is a family of classifiers. If C is the set of all classifiers with measurable decision regions, then cOBRC exists and is given for any x ∈ X by cOBRC ðxÞ ¼ arg ¼ arg

min

i∈f0, : : : , M1g

min

i∈f0, : : : , M1g

b x, S n Þ Rði, M 1 X

Lði, yÞf D ðyÞf U ðxjyÞ:

(5.33)

y¼0

The OBRC minimizes the average loss weighted by f D ðyÞf U ðxjyÞ. For future reference we denote this weighted loss by AL ðiÞ. The OBRC has the same functional form as the BDR with f D ðyÞ substituted for the true class probability f ðyjcÞ and f U ðxjyÞ substituted for the true density f ðxjy, uy Þ for all y. Closed-form OBRC representation is available for any model in which f U ðxjyÞ has been found, including discrete and Gaussian models. A number of properties also carry over, including invariance to invertible transformations, pointwise convergence to the Bayes classifier, and robustness to false modeling assumptions. For binary classification and zero-one loss, b Rðc, SnÞ ¼

1 X 1 X

Lði, yÞf D ðyÞˆεni,y ðc, S n Þ

y¼0 i¼0

¼ f D ð0Þˆε1,0 ε0,1 n ðc, S n Þ þ f D ð1Þˆ n ðc, S n Þ Z Z f U ðxj0Þdx þ f D ð1Þ f U ðxj1Þdx, ¼ f D ð0Þ R1

(5.34)

R0

which is identical to Eq. 2.27 since f D ð0Þ ¼ Ep ½C 0 . Hence, the BRE reduces to the Bayesian MMSE error estimator, and the OBRC reduces to the OBC.

5.4 Sample-Conditioned MSE of Risk Estimation In a typical small-sample classification scenario, a classifier is trained from data and a risk estimate is found for the true risk of this classifier. We can measure the closeness of the risk estimate to the actual risk via the sampleconditioned MSE of the BRE relative to the true expected risk: b b MSEðRðc, S n ÞjS n Þ ¼ E½ðRðc, C, TÞ  Rðc, S n ÞÞ2 jS n  ¼ varðRðc, C, TÞjS n Þ:

(5.35)

246

Chapter 5

This MSE is precisely the quantity that the BRE minimizes, and it quantifies b as an estimator of R, conditioned on the actual sample in the accuracy of R hand. Owing to posterior independence between C and T, it can be decomposed as b MSEðRðc, S n ÞjS n Þ # " M1 X M1 X M1 X X M1 b 2 ðc, S n Þ, Lði, yÞLðj, zÞEC ½C y C z jS n mi,y,zj ðc, S n Þ  R ¼ y¼0 z¼0 i¼0

j¼0

(5.36) where   j,z mi,y,zj ðc, S n Þ ¼ E εi,y n ðc, Ty Þεn ðc, Tz ÞjS n

(5.37)

j,z is the posterior mixed moment of εi,y n ðc, Ty Þ and εn ðc, Tz Þ, and where we have applied Eq. 5.2 in Eq. 5.35, and used the fact that

E½ f ðyjCÞf ðzjCÞjS n  ¼ E½C y C z jS n :

(5.38)

Second-order moments of C y depend on the prior for C; for instance, under Dirichlet posteriors they are given by Eqs. 5.12 and 5.13. Hence, evaluating b the conditional MSE of the BRE reduces to evaluating the BRE Rðc, S n Þ and i, j evaluating posterior mixed moments my,z ðc, S n Þ. Furthermore, if we additionally assume that T0 , : : : , TM1 are pairwise independent, then when y ≠ z, mi,y,zj ðc, S n Þ ¼ εˆ ni,y ðc, S n Þˆεnj,z ðc, S n Þ,

(5.39)

where εˆ ni,y ðc, S n Þ, given in Eq. 5.20, is a component of the BRE. b • ðc, S n Þ may be The conditional MSE of an arbitrary risk estimate R found from the BRE and the MSE of the BRE: b • ðc, S n ÞjS n Þ ¼ E½ðRðc, C, TÞ  R b • ðc, S n ÞÞ2 jS n  MSEðR b b b • ðc, S n Þ2 : ¼ MSEðRðc, S n ÞjS n Þ þ ½Rðc, SnÞ  R (5.40) In this form, the optimality of the BRE is clear.

5.5 Efficient Computation i, j The following representation for my,z ðc, S n Þ is useful in both deriving analytic forms for, and approximating, the MSE. Via Eq. 5.3 and Fubini’s theorem,

Optimal Bayesian Risk-based Multi-class Classification

Z Z mi,y,zj ðc,

SnÞ ¼

U

Z f ðxjy, uy Þdx

Ri

Ri

Rj

Ri

Rj

Z Z ¼

f ðwjz, uz Þdwp ðuÞdu

Rj

Z Z Z ¼

247

U

f ðxjy, uy Þf ðwjz, uz Þp ðuÞdudwdx

(5.41)

f U ðx, wjy, zÞdwdx,

where the effective joint density in the last line is defined by Z f U ðx, wjy, zÞ ¼

U

f ðxjy, uy Þf ðwjz, uz Þp ðuÞdu:

(5.42)

In terms of probability, mi,y,zj ðc, S n Þ ¼ PrðX ∈ Ri , W ∈ Rj jy, z, S n Þ,

(5.43)

where X and W are random vectors drawn from f U ðx, wjy, zÞ. The marginal densities of X and W under f U ðx, wjy, zÞ are precisely the effective densities: Z Z

Z X

f ðxjy, uy Þf ðwjz, uz Þp ðuÞdudw Z Z f ðxjy, uy Þ f ðwjz, uz Þdwp ðuÞdu ¼ U X Z f ðxjy, uy Þp ðuy Þduy ¼

f U ðx, wjy, zÞdw ¼

X

U

(5.44)

Uy

¼ f U ðxjyÞ: Furthermore, we have an effective conditional density of W given X: f U ðx, wjy, zÞ f U ðxjyÞ Z Z f ðxjy, uy Þp ðuy , uz Þ f ðwjz, uz Þ R ¼ 0 0 duy duz :  0 Uz Uy Uy f ðxjy, uy Þp ðuy Þduy

f U ðwjx, y, zÞ ¼

(5.45)

As usual, p ðuÞ represents the posterior relative to S n , but, in addition, let pS n ∪fðx, yÞg ðuÞ represent the posterior relative to S n ∪ fðx, yÞg. Expanding the preceding expression yields

248

Chapter 5

f U ðwjx, y, zÞ Z Z ¼ f ðwjz, uz Þ R Uz

Z ¼

Uz

Z ¼

Uz

Uy

Z

Uy

R Uz

f ðxjy, uy Þp ðuy , uz Þ 0 0 0 0 duy duz  0 Uy f ðxjy, uy Þp ðuy , uz Þduy duz

f ðwjz, uz ÞpS n ∪fðx, yÞg ðuy , uz Þduy duz

(5.46)

f ðwjz, uz ÞpSn ∪fðx, yÞg ðuz Þduz

¼ f ðwjz, S n ∪ fðx, yÞgÞ, where we have used the fact that the fractional term in the integrand of the first equality is equivalent to the posterior updated with a new independent sample point with feature vector x and label y, and the expression in the last line is the effective density relative to S n ∪ fðx, yÞg. The effective joint density may be easily found once the effective density is known. Furthermore, from Eq. 5.43 we may approximate mi,y,zj ðc, S n Þ by drawing a large synthetic sample from f U ðxjyÞ, drawing a single point w from the effective conditional density f ðwjz, S n ∪ fðx, yÞgÞ for each x, and evaluating the proportion of pairs ðx, wÞ for which x ∈ Ri and w ∈ Rj . Additionally, since x is marginally governed by the effective density, from Eq. 5.21 we may approximate εˆ ni,y ðc, S n Þ by evaluating the proportion of x in Ri . Evaluating the OBRC, BRE, and conditional MSE requires obtaining E½C y jS n , E½C 2y jS n , and E½C y C z jS n  based on the posterior for C, and finding the effective density f U ðxjyÞ and the effective joint density f U ðx, wjy, zÞ based on the posterior for T. At a fixed point x, one may then evaluate the BCRE from Eq. 5.28. The OBRC is then found by choosing the class i that minimizes AL ðiÞ. For any classifier, the BRE is given by Eq. 5.18 with εˆ ni,y ðc, S n Þ given by Eq. 5.20 (or equivalently Eq. 5.21) using the effective density f U ðxjyÞ. The MSE of the BRE is then given by Eq. 5.36, where mi,y,zj ðc, S n Þ is given by Eq. 5.39 when U0 , : : : , UM1 are pairwise independent and y ≠ z, and mi,y,zj ðc, S n Þ is otherwise found from Eq. 5.41 (or equivalently Eq. 5.43) using the effective joint density f U ðx, wjy, zÞ. The MSE of an arbitrary risk estimator can also be found from Eq. 5.40 using the BRE and the MSE for the BRE.

5.6 Evaluation of Posterior Mixed Moments: Discrete Model Drawing on results from binary classification, in this section we evaluate the posterior mixed moments mi,y,zj ðc, S n Þ in the discrete model.

Optimal Bayesian Risk-based Multi-class Classification

249

Consider a discrete feature space X ¼ f1, 2, : : : , bg. Let pyx be the probability that a point from class y is observed in bin x ∈ X , and let U yx be the number of sample points observed from class y in bin x. Note that P ny ¼ bx¼1 U yx . The discrete Bayesian model defines Ty ¼ ½Py1 , : : : , Pyb  with parameter space Uy ¼ Db1 . For each y, we define Dirichlet priors on Ty with hyperparameters ay ¼ ½ay1 , : : : , ayb  by pðuy Þ ∝

b Y

y

ðpyx Þax 1 :

(5.47)

x¼1

Assume that the Ty are mutually independent. According to Eq. 2.52, the y y posteriors are again Dirichlet with updated hyperparameters ay x ¼ ax þ U x for all x and y. For proper posteriors, ay x . 0 for all x and y. According to Eq. 2.53, the effective density is f U ðxjyÞ ¼

ay x , ay þ

(5.48)

ay x :

(5.49)

where ay þ ¼

b X x¼1

Thus, εˆ ni,y ðc, S n Þ ¼

b X ay x x¼1

ay þ

IcðxÞ¼i :

(5.50)

The effective joint density f U ðx, wjy, zÞ for y ¼ z can be found from properties of Dirichlet distributions. For any y ∈ f0, : : : , M  1g and x, w ∈ X , by Theorem 3.4, f U ðx, wjy, yÞ ¼ E½Pyx Pyw jS n  ¼

y ay x ðaw þ dxw Þ : y ay þ ðaþ þ 1Þ

(5.51)

From Eq. 5.41, mi,y,yj ðc, S n Þ ¼ ¼

y b X b X ay x ðaw þ dxw Þ

IcðxÞ¼i IcðwÞ¼j y ay þ ðaþ þ 1Þ x¼1 w¼1 ˆ j,y εˆ ni,y ðc, S n Þ½ay n ðc, S n Þ þ dij  þε : y aþ þ 1

(5.52)

This generalizes Eq. 3.23. When y ≠ z, mi,y,zj ðcÞ may be found from Eq. 5.39.

250

Chapter 5

5.7 Evaluation of Posterior Mixed Moments: Gaussian Models We now consider Gaussian models with mean vectors my and covariance matrices Sy . All posteriors on parameters, effective densities, and effective joint densities found in this section apply under arbitrary multi-class classifiers. We also find analytic forms for posterior mixed moments, the BRE, and the conditional MSE under binary linear classifiers c of the form  cðxÞ ¼

0 if gðxÞ ≤ 0, 1 otherwise,

(5.53)

where gðxÞ ¼ aT x þ b for some vector a and scalar b. Under arbitrary classifiers, the BRE and conditional MSE may be approximated using techniques described in Section 5.5. 5.7.1 Known covariance Assume that Sy is known and is a valid invertible covariance matrix. Then Ty ¼ my with parameter space Uy ¼ RD . We assume that the my are mutually independent and use the prior in Eq. 2.58: pðmy Þ ∝ jSy j

12

 ny T 1 exp  ðmy  my Þ Sy ðmy  my Þ , 2

(5.54)

where ny ∈ R and my ∈ RD . The posterior is of the same form as the prior, with updated hyperparameters ny and my given in Eqs. 2.65 and 2.66, respectively. We require that ny . 0 for a proper posterior. The effective density is given in Eq. 2.96 of Theorem 2.6. To find the BRE we require εˆ ni,y ðc, S n Þ. Under linear classifiers, this is essentially given by Theorem 2.10, except that the exponent of 1 is altered to obtain 1 sffiffiffiffiffiffiffiffiffiffiffiffiffi iþ1  ny C Bð1Þ gðmy Þ εˆ ni,y ðc, S n Þ ¼ F@ qffiffiffiffiffiffiffiffiffiffiffiffiffi A: ny þ 1 aT S a 0

(5.55)

y

To find the MSE under linear classification, note that f U ðwjx, y, zÞ is of the same form as f U ðxjyÞ with posterior hyperparameters updated with ðx, yÞ as a new sample point. Hence, for y ¼ z,  x  my ny þ 2  f U ðwjx, y, yÞ  N my þ  , S , ny þ 1 ny þ 1 y

(5.56)

Optimal Bayesian Risk-based Multi-class Classification

251

and, as also shown in Theorem 3.6, the effective joint density is thus given by 31 0 2    ny þ1 1 S S   y y m n ny 5A: (5.57) f U ðx, wjy, yÞ  N @ y , 4 1y ny þ1 my S S   y y ny ny Now let P ¼ ð1Þi gðXÞ and Q ¼ ð1Þj gðWÞ. Since X and W are governed by the effective joint density in Eq. 5.57, the effective joint density of P and Q is 2  31 0 ny þ1 T   ð1Þiþj T i  a S a a S a y y ð1Þ gðmy Þ 4 ny ny 5A: (5.58) , ð1Þiþj f U ðp, qjy, yÞ  N @ ny þ1 T T ð1Þj gðmy Þ a S a a S a   y y ny ny Hence, from Eq. 5.43, i, j ðc, S n Þ ¼ PrðP ≤ 0, Q ≤ 0jy, y, S n Þ my,y 0 1 sffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffi i  ny ð1Þj gðmy Þ ny ð1Þiþj C B ð1Þ gðmy Þ ¼ F@ qffiffiffiffiffiffiffiffiffiffiffiffiffi ,  qffiffiffiffiffiffiffiffiffiffiffiffiffi ,  A,   ny þ 1 ny þ 1 ny þ 1 aT S a aT S a y

y

(5.59) where Fð⋅, ⋅ , rÞ is the joint CDF of two standard normal random variables with correlation r. A special case of Eq. 5.59 is also provided in Theorem 3.8. When y ≠ z, mi,y,zj ðc, S n Þ is found from Eq. 5.39. 5.7.2 Homoscedastic general covariance Consider the homoscedastic model with general covariance, where Ty ¼ ½my , S, the parameter space of my is RD , and the parameter space of S consists of all symmetric positive definite matrices. Further, assume a conjugate prior in which the my are mutually independent given S so that " # M1 Y pðuÞ ¼ pðmy jSÞ pðSÞ, (5.60) y¼0

where pðmy jSÞ is as in Eq. 5.54 with hyperparameters ny ∈ R and my ∈ RD , and  1 1 kþDþ1 2 pðSÞ ∝ jSj exp  SS (5.61) 2 with hyperparameters k ∈ R and S, a symmetric D  D matrix. Note that Eq. 5.60 is the multi-class extension of Eq. 2.77 in which we are assuming the general-covariance model l ¼ S (although we need not have). All comments

252

Chapter 5

in Chapter 2 regarding these priors apply. In particular, if ny . 0, then ðmy jSÞ is Gaussian with mean my and covariance S∕ny , and if k . D  1 and S is positive definite, then pðSÞ is an inverse-Wishart distribution with hyperparameters k and S. Similar to Theorem 2.5, the posterior is of the same form as the prior with ny , my , and k given in Eqs. 2.65, 2.66, and 2.80, respectively, and S ¼ S þ

M1 X

b þ ny ny ðb ðny  1ÞS my  my Þðb m y  m y ÞT : y n þ n y y y¼0

(5.62)

b for b y and sample variance S Evaluating Eq. 5.62 requires the sample mean m y all classes y ¼ 0, : : : , M  1. The posteriors are proper if ny . 0, k . D  1, and S is positive definite. According to Theorem 2.9, the effective density for class y is multivariate t with k ¼ k  D þ 1 degrees of freedom, location vector my , and scale matrix ½ðny þ 1Þ∕ðkny ÞS :  ny þ 1   S,k : (5.63) f U ðxjyÞ  t my , kny To find the BRE under a binary linear classifier of the form in Eq. 5.53, let P ¼ ð1Þi gðXÞ. Since P is an affine transformation of a multivariate t-random vector, it has a non-standardized Student’s t-distribution (Kotz and Nadarajah, 2004):  ny þ 1 2 f U ðpjyÞ  t miy , g ,k , (5.64) kny where miy ¼ ð1Þi gðmy Þ, and g 2 ¼ aT S a. The CDF ϒ of a non-standardized Student’s t-distribution with d degrees of freedom, location parameter m, and scale parameter s2 is well known, and  1 sgnðmÞ m2 1 d I ; , ϒð0Þ ¼  , (5.65) 2 2 m2 þ ds2 2 2 where I ðx; a, bÞ is an incomplete regularized beta function (Johnson et al., 1995). Hence, ! m2iy 1 sgnðmiy Þ 1 k i,y : (5.66) I ; , εˆ n ðc, S n Þ ¼  n þ1 2 2 m2 þ y  g 2 2 2 iy

ny

This result is an extension of Theorem 2.12. The MSE found in Chapter 3 assumes independent covariances, and thus does not apply here. Instead, the effective conditional density for y ¼ z is

Optimal Bayesian Risk-based Multi-class Classification

253

solved by updating all of the hyperparameters associated with class y with the new sample point ðx, yÞ, resulting in  x  my ny þ 2   , ½S þ Sy ðxÞ, k þ 1 , f U ðwjx, y, yÞ  t my þ  ny þ 1 ðk þ 1Þðny þ 1Þ (5.67) where Sy ðxÞ ¼

ny ðx  my Þðx  my ÞT : ny þ 1

(5.68)

For y ≠ z, f U ðwjx, y, zÞ is of the same form as the effective density with only hyperparameters k and S updated:  f U ðwjx, y, zÞ  t mz ,

nz þ 1  ½S þ S ðxÞ, k þ 1 : y ðk þ 1Þnz

(5.69)

The next lemma is used to derive the effective joint density. Lemma 5.1 (Dalton and Yousefi, 2015). Suppose that X is a multivariate t-random vector with density  ny þ 1 2  f ðxÞ  t my , g ID , k : (5.70) kny Further, suppose that W conditioned on X ¼ x is multivariate t with density 0 1 h i ny 2  T  x  my J g þ ny þ1 ðx  my Þ ðx  my Þ B C f ðwjxÞ  t@mz þ I  , ID , k þ DA, ny þ 1 kþD (5.71) where either I ¼ 0 and J ¼ ðnz þ 1Þ∕nz , or jI j ¼ 1 and J ¼ ðny þ 2Þ∕ðny þ 1Þ. Then, the joint density is multivariate t: # !    2 " ny þ1 1 I I I   g my D D ny ny ,k , (5.72) f ðx, wÞ  t , mz k I n1 ID KI D y where K ¼ ðnz þ 1Þ∕nz when I ¼ 0 and K ¼ ðny þ 1Þ∕ny when jI j ¼ 1.

254

Chapter 5

Proof. After some simplification, one can show that f ðx, wÞ ¼ f ðxÞf ðwjxÞ  kþD ny 2  T 2 1  ðx  my Þ ðg ID Þ ðx  my Þ ∝ 1þ  ny þ 1 kþ2D1   2  ny 2  T   ðx  my Þ ðx  my ÞID    g ID þ  ny þ 1 kþ2D    2  n 1 y 2  T  T ðx  my Þ ðx  my ÞID þ vv   g ID þ  , ny þ 1 J where v ¼ w  mz  I

x  my : ny þ 1

(5.73)

(5.74)

Further simplification yields f ðx, wÞ  ∝ g2 þ

ny ðx  my ÞT ðx  my Þ ny þ 1    x  my T x  my kþ2D 2 1 þ : w  mz  I  w  mz  I  ny þ 1 ny þ 1 J

If I ¼ 0, then it can be shown that      kþ2D 2 x  my T 1 x  my f ðx, wÞ ∝ 1 þ L ,   w  mz w  mz

(5.75)

(5.76)

where " n þ1 y



ny

g 2 ID

0D

0D

# :

nz þ1 2 nz g ID

Similarly, if jI j ¼ 1, it can be shown that      kþ2D 2 x  my T 1 x  my L , f ðx, wÞ ∝ 1 þ w  mz w  mz

(5.77)

(5.78)

where " ny þ1 L¼ which completes the proof.

ny I ny

g 2 ID

g 2 ID

I 2 ny g ID  ny þ1 2 ny g ID

# ,

(5.79) ▪

Optimal Bayesian Risk-based Multi-class Classification

255

To find the conditional MSE of the BRE under binary linear classification, let P ¼ ð1Þi gðXÞ and Q ¼ ð1Þj gðWÞ. For y ¼ z, and p ¼ ð1Þi gðxÞ, f U ðqjx, y, yÞ    p  miy ny þ 2 ny iþj 2 2  t mjy þ ð1Þ , g þ  ðp  miy Þ , k þ 1 , ny þ 1 ðk þ 1Þðny þ 1Þ ny þ 1 (5.80) where we have used the fact that ð1Þi aT ðx  my Þ ¼ p  my . When y ≠ z,   ny nz þ 1 2 2 f U ðqjx, y, zÞ  t mjz , g þ  ðp  miy Þ , k þ 1 : ny þ 1 ðk þ 1Þnz 

(5.81)

Since dependency on X has been reduced to dependency on only P in both of the above distributions, we may write f U ðqjx, y, zÞ ¼ f U ðqjp, y, zÞ for all y and z. The lemma produces an effective joint density given an effective density and an effective conditional density of a specified form. Indeed, the distributions f U ðpjyÞ and f U ðqjp, y, yÞ are precisely in the form required by the lemma with D ¼ 1. Hence, ½P, QT follows a bivariate t-distribution when y ¼ z, 2  2 ny þ1  g m 4 ny iþj f U ðp, qjy, yÞ  t@ iy , ð1Þ mjy k  0



ny

ð1Þiþj ny ny þ1 ny

3

1

5 , k A,

(5.82)

and when y ≠ z,  f U ðp, qjy, zÞ  t

 2 " ny þ1 g miy ny , mjz k 0

0 nz þ1 nz

#

! ,k :

(5.83)

Thus, mi,y,yj ðc, S n Þ can be found from Eq. 5.43. In particular, when y ¼ z, mi,y,yj ðc, S n Þ ¼ PrðP ≤ 0, Q ≤ 0jy, y, S n Þ 1 0 sffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffi   iþj m kn m kn ð1Þ y y iy jy A, ¼ T @ ,  , k,  g ny þ 1 g ny þ 1 ny þ 1 and when y ≠ z,

(5.84)

256

Chapter 5

i, j my,z ðc, S n Þ ¼ PrðP ≤ 0, Q ≤ 0jy, z, S n Þ 1 0 sffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffi   miy kny mjz knz , k, 0A, ¼ T @ ,    g ny þ 1 g nz þ 1

(5.85)

where Tðx, y, d, rÞ is the bivariate CDF of standard Student’s t-random variables with d degrees of freedom and correlation coefficient r. 5.7.3 Independent general covariance In the independent general covariance model, Ty ¼ ½my , Sy , the parameter space of my is RD , the parameter space of Sy consists of all symmetric positive definite matrices, the Ty are independent, and pðuy Þ ¼ pðmy jSy ÞpðSy Þ,

(5.86)

where pðmy jSy Þ is of the same form as in Eq. 5.54 with hyperparameters ny ∈ R and my ∈ RD , and pðSy Þ is of the same form as in Eq. 5.61 with hyperparameters ky ∈ R and the symmetric D  D matrix Sy in place of k and S. Via Theorem 2.4, the posterior is of the same form as the prior with ny , my , ky , and Sy given in Eqs. 2.65 through 2.68, respectively. The posteriors are proper if ny . 0, ky . D  1, and Sy is positive definite. The effective density for class y is multivariate t as in Eq. 5.63 with k y ¼ ky  D þ 1 and Sy in place of k and S , respectively (Theorem 2.9). Further, Eq. 5.64 also holds with miy ¼ ð1Þi gðmy Þ and with k y and g 2y ¼ aT Sy a in place of k and g 2 , respectively. Under binary linear classification, εˆ ni,y ðc, S n Þ is given by Eq. 5.66 with k y and g 2y in place of k and g 2 , respectively (Theorem 2.12). mi,y,yj ðc, S n Þ is solved similarly to the homoscedastic case, resulting in Eqs. 5.67, 5.80, and 5.82, and ultimately Eq. 5.84, with k y , Sy , and g 2y in place of k, S , and g 2 , respectively. A special case of Eq. 5.84 is also i, j ðc, S n Þ for y ≠ z is found from Eq. 5.39. provided in Theorem 3.10. my,z

5.8 Simulations We consider five classification rules: OBRC (under Gaussian models), LDA, QDA, L-SVM, and RBF-SVM. Since SVM classifiers optimize relative to their own objective function (for example, hinge loss) rather than expected risk, we exclude them from our analysis when using a non–zero-one loss function. For all classification rules, we calculate the true risk defined in Eqs. 5.2 and 5.3. We find the exact value if a formula is available; otherwise, we use a test sample of at least 10,000 points generated from

Optimal Bayesian Risk-based Multi-class Classification

257

the true feature-label distributions, stratified relative to the true class prior probabilities. This will yield an approximation of the true risk with pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi RMS ≤ 1∕ 4  10,000 ¼ 0.005 by Eq. 1.37. Four risk estimation methods are considered: BRE, 10-fold crossvalidation (cv), leave-one-out (loo), and 0.632 bootstrap (boot). When there is no closed-form formula for calculating the BRE, we approximate it by drawing a sample of 1,000,000 points from the effective density of each class. In cv, the risk is estimated on the holdout data and the resulting risk values are averaged to get the cv estimate. For cv and boot risk estimates, we use 10 and 100 repetitions, respectively. Under linear classification, the sample-conditioned MSE from Eq. 5.36 is found analytically: for mi,y,yj ðc, S n Þ, use Eq. 5.84 and plug in the appropriate values for k and g 2 depending on the covariance model, and for mi,y,zj ðc, S n Þ with z ≠ y, employ Eq. 5.39 for independent and Eq. 5.85 for homoscedastic covariance models, and plug in appropriate values for k and g 2 . When analytic forms are not available, the sample-conditioned MSE is approximated as follows. In independent covariance models, for each y, and for each of the 1,000,000 sample points generated to approximate the BRE of class y, draw a single point from the effective conditional density f U ðwjx; y; yÞ, giving i, j 1,000,000 sample point pairs to approximate my,y ðc, S n Þ for all i and j. In homoscedastic covariance models, for each y and z pair we draw 1,000,000 synthetic points from the effective density of class y (there are already 1,000,000 points available from approximating the BRE, which we distribute between each z). For each of these points we draw a single point from the effective conditional density f U ðwjx, y, zÞ. For each y and z, the corresponding 1,000,000 point pairs are used to approximate mi,y,zj ðc, S n Þ for all i and j. In the simulations, all classes are equally likely, the data are stratified to give an equal number of sample points from each class, there are Gaussian feature-label distributions, and we employ a zero-one loss function. For each prior model and a fixed sample size, we evaluate classification performance in a Monte Carlo estimation loop with 10,000 iterations. In each iteration, a two-step procedure is followed for sample generation: (1) generate random feature-label distribution parameters from the prior (each serving as the true underlying feature-label distribution), and (2) generate a random sample of size n from this fixed feature-label distribution. The generated random sample is used to train classifiers and evaluate their true risk. We consider four Gaussian models. For Models 1 and 2, the number of features is D ¼ 2, the number of classes is M ¼ 5, n0 , : : : , n4 are 12, 2, 2, 2, 2, respectively, the means m0 , : : : , m4 are ½0, 0T , ½1, 1T , ½1,  1T , ½1,  1T , ½1, 1T , respectively, ky ¼ 6 for all y, ðky  D  1Þ1 Sy ¼ 0.3I2 for all y, and the covariances are independent general and homoscedastic general, respectively. Figures 5.1 and 5.2 provide examples of decision boundaries for Models 1 and 2, respectively. Under Model 1

258

Chapter 5 4

4

4

3

3

3

2

2

2

1

1

1

0

0

0

−1

−1

−1

−2

−2

−2

−3

−3

−3

−4 −4 −3 −2 −1

−4 −4 −3 −2 −1

0

1

2

3

4

(a)

0

1

2

3

−4 −4 −3 −2 −1

4

(b)

4

4

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3

−4 −4 −3 −2 −1

0

1

2

3

1

2

3

4

(c)

−4 −4 −3 −2 −1

4

0

(d)

0

1

2

3

4

(e)

Figure 5.1 Example decision boundaries for Model 1 with multi-class classification: (a) LDA; (b) QDA; (c) OBRC; (d) L-SVM; (e) RBF-SVM. [Reprinted from (Dalton and Yousefi, 2015).] 4

4

4

3

3

3

2

2

2

1

1

1

0

0

0

−1

−1

−1

−2

−2

−2

−3

−3

−3

−4 −4 −3 −2 −1

−4 −4 −3 −2 −1

−4 −4 −3 −2 −1

0

1

2

3

4

(a)

0

1

2

3

4

(b) 4

4

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3

−4 −4 −3 −2 −1

−4 −4 −3 −2 −1

0

(d)

1

2

3

4

0

1

2

3

4

(c)

0

1

2

3

4

(e)

Figure 5.2 Example decision boundaries for Model 2 with multi-class classification: (a) LDA; (b) QDA; (c) OBRC; (d) L-SVM; (e) RBF-SVM. [Reprinted from (Dalton and Yousefi, 2015).]

Optimal Bayesian Risk-based Multi-class Classification

259

LDA QDA OBRC L-SVM RBF-SVM

0.25 0.2 0.15 0.1

10

30

50

70

90

110

0.1

risk standard deviation

average risk

(independent covariances), the decision boundaries of OBRC are most similar to QDA, although they are in general of a polynomial order. Under Model 2 (homoscedastic covariances), OBRC is most similar to LDA, although the decision boundaries are not necessarily linear. Figure 5.3 presents the mean and standard deviation of the true risk with respect to all sample realizations as a function of sample size for Models 1 and 2. OBRC outperforms all other classification rules with respect to mean risk, as it must, since the OBRC is defined to minimize mean risk. Although there is no guarantee that OBRC should minimize risk variance, in these examples the risk variance is lower than in all other classification rules. The performance gain is particularly significant for small samples. Consider Figs. 5.3(a) and 5.3(b), where at sample size 10, the risk of OBRC has a mean of about 0.16 and standard deviation of about 0.065, whereas the risk of the next best classifier, RBF-SVM, has a mean of about 0.22 and standard deviation of about 0.09. For higher-dimensional examples, we consider Models 3 and 4, for which D ¼ 20 and M ¼ 5. For Model 3, n0 , : : : , n4 are 12, 2, 2, 2, 2, respectively, m0 , : : : , m4 are 020 , 0.1 ⋅ 120 , 0.1 ⋅ 120 , ½0.1 ⋅ 1T10 ,  0.1 ⋅ 1T10 T , ½0.1 ⋅ 1T10 , 0.1 ⋅ 1T10 T , respectively, ky ¼ 20.65, ð3Þ1 Sy ¼ 0.3I20 , and the covariances are independent scaled identity [see (Dalton and Yousefi, 2015) for details of this case]. Model 4 is the same except n0 ¼ · · · ¼ n4 ¼ 20 and

0.08 0.07 0.06 0.05

130

LDA QDA OBRC L-SVM RBF-SVM

0.09

10

30

sample size

50

0.25 0.2 0.15

30

50

70

110

130

(b) LDA QDA OBRC L-SVM RBF-SVM

10

90

90

110

130

0.14

risk standard deviation

average risk

(a)

0.1

70

sample size

LDA QDA OBRC L-SVM RBF-SVM

0.13 0.12 0.11 0.1 0.09 0.08

10

30

50

70

90

sample size

sample size

(c)

(d)

110

130

Figure 5.3 True risk statistics for Models 1 (top row) and 2 (bottom row), and five classification rules (LDA, QDA, OBRC, L-SVM, and RBF-SVM): (a) Model 1, mean; (b) Model 1, standard deviation; (c) Model 2, mean; (d) Model 2, standard deviation. [Reprinted from (Dalton and Yousefi, 2015).]

260

Chapter 5

average risk

0.4 0.3 0.2 0.1 0

20

60

100

140

180

220

260

0.08

risk standard deviation

LDA QDA OBRC L-SVM RBF-SVM

0.5

0.06 0.05 0.04 0.03 0.02

300

LDA QDA OBRC L-SVM RBF-SVM

0.07

20

60

100

140

(a) 0.7 0.6 0.5 0.4

60

100

140

180

260

300

220

260

300

220

260

300

0.1

risk standard deviation

average risk

LDA QDA OBRC L-SVM RBF-SVM

20

220

(b)

0.8

0.3

180

sample size

sample size

0.08 LDA QDA OBRC L-SVM RBF-SVM

0.06

0.04

0.02

20

60

100

140

180

sample size

sample size

(c)

(d)

Figure 5.4 True risk statistics for Models 3 (top row) and 4 (bottom row), and five classification rules (LDA, QDA, OBRC, L-SVM, and RBF-SVM): (a) Model 3, mean; (b) Model 3, standard deviation; (c) Model 4, mean; (d) Model 4, standard deviation. [Reprinted from (Dalton and Yousefi, 2015).]

m0 ¼ · · · ¼ m4 ¼ 020 . Figure 5.4 presents the mean and standard deviation of the true risk of all classifiers as a function of sample size for Models 3 and 4, where Model 3 is designed to produce a low mean risk and Model 4 a high mean risk. OBRC again outperforms all other classification rules with respect to mean risk, as it should. There is no guarantee that OBRC will minimize risk variance, and, although risk variance is lowest for OBRC in Fig. 5.4(b), in Fig. 5.4(d) it is actually highest. Performance gain is particularly significant for small samples.

Chapter 6

Optimal Bayesian Transfer Learning The theory of optimal Bayesian classification assumes that the sample data come from the unknown true feature-label distribution, which is a standard assumption in classification theory. When data from the true feature-label distribution are limited, it is possible to use data from a related feature-label distribution. This is the basic idea behind transfer learning, where data from a source domain are used to augment data from a target domain, which may follow a different feature-label distribution (Pan and Yang, 2010; Weiss et al., 2016). The key issue is to quantify relatedness, which means providing a rigorous mathematical framework to characterize transferability. This can be achieved by extending the OBC framework so that transfer learning from the source to target domain is via a joint prior distribution for the model parameters of the feature-label distributions of the two domains (Karbalayghareh et al., 2018b). In this way, the posterior distribution of the target model parameters can be updated via the joint prior probability distribution function in conjunction with the source and target data.

6.1 Joint Prior Distribution We consider L common classes (labels) in each domain. Let S s and S t denote samples from the source and target domains with sizes of N s and N t , respectively. In practice one might expect N t to be substantially smaller than N s . For l ¼ 1, 2, : : : , L, let S ls ¼ fxls,1 , xls,2 , : : : , xls,nl g and s

S lt ¼ fxlt,1 , xlt,2 , : : : , xlt,nl g, where nls and nlt denote the number of points in t

the source and target domains, respectively, for the label l. S it ∩ S jt ¼ ∅ and S is ∩ S js ¼ ∅ for i, j ∈ f1, : : : , Lg. Moreover, S s ¼ ∪Ll¼1 S ls , S t ¼ ∪Ll¼1 S lt , P P N s ¼ Ll¼1 nls , and N t ¼ Ll¼1 nlt . Heretofore we have been considering priors on the covariance matrix and have employed the inverse-Wishart distribution. Here, we desire priors for the 261

262

Chapter 6

precision matrices Lls or Llt , and we use a Wishart distribution as the conjugate prior. A random D  D symmetric positive definite matrix L has a nonsingular Wishart distribution WishartðM, kÞ with k degrees of freedom if k . D  1, M is a D  D symmetric positive definite matrix, and the density is   1    k 1 kD k kD1 f W ðL; M, kÞ ¼ 2 2 GD jMj2 jLj 2 etr  M1 L , 2 2

(6.1)

where GD ð⋅Þ is the multivariate gamma function. We assume that there are two datasets separately sampled from the source and target domains. Since the feature spaces are the same in both the source and target domains, let xls and xlt be D  1 vectors for the D features of the source and target domains under class l, respectively. We utilize a Gaussian model for the feature-label distribution in each domain: xlz  N ðmlz , ðLlz Þ1 Þ

(6.2)

for l ∈ f1, : : : , Lg, where subscript z ∈ fs, tg denotes the source s or target t domain, mls and mlt are D  1 mean vectors in the source and target domains for label l, respectively, Lls and Llt are the D  D precision matrices in the source and target domains for label l, respectively, and a joint normal-Wishart distribution is employed as a prior for the mean and precision matrices of the Gaussian models. Under these assumptions, the joint prior distribution for mls , mlt , Lls , and Llt takes the form pðmls , mlt , Lls , Llt Þ ¼ f ðmls , mlt jLls , Llt ÞpðLls , Llt Þ:

(6.3)

To facilitate conjugate priors, we assume that, for any class l ∈ f1, : : : , Lg, mls and mlt are conditionally independent given Lls and Llt . Thus, pðmls , mlt , Lls , Llt Þ ¼ f ðmls jLls Þf ðmlt jLlt ÞpðLls , Llt Þ,

(6.4)

and both f ðmls jLls Þ and f ðmlt jLlt Þ are Gaussian, mlz jLlz  N ðmlz , ðnlz Llz Þ1 Þ,

(6.5)

where mlz is the D  1 mean vector of mlz , and nlz is a positive scalar hyperparameter. A key issue is the structure of the joint prior governing the target and source precision matrices. Here we employ a family of joint priors that falls out naturally from a collection of partitioned Wishart random matrices. Specifically, we consider D  D matrices L  WishartðM, kÞ, where

Optimal Bayesian Transfer Learning

263



 L12 , L22

L11 L¼ LT12

(6.6)

L11 and L22 are D1  D1 and D2  D2 submatrices, respectively, and   M11 M12 M¼ MT12 M22

(6.7)

is the corresponding partition of M, with M11 and M22 being D1  D1 and D2  D2 submatrices, respectively. For such L, we are assured that L11  WishartðM11 , kÞ and L22  WishartðM22 , kÞ (Muirhead, 2009). The next theorem gives the form of the joint distribution of the two submatrices of a partitioned Wishart matrix. It provides a class of joint priors for the target and source precision matrices. Theorem 6.1 (Halvorsen et al., 2016). If L in Eq. 6.6 is a ðD1 þ D2 Þ  ðD1 þ D2 Þ partitioned Wishart random matrix with k ≥ D1 þ D2 degrees of freedom and the positive definite scale matrix M in Eq. 6.7, where the two diagonal partitions of L and M are of sizes D1  D1 and D2  D2 , then the diagonal partitions L11 and L22 have the joint density function     1 1 1 1 T f ðL11 , L22 Þ ¼ K etr  ðM11 þ F C2 FÞL11 etr  C2 L22 2 2 (6.8)   kD2 1 kD1 1 k 1 ; GðL11 , L22 Þ ,  jL11 j 2 jL22 j 2 0 F 1 2 4 where C2 ¼ M22  MT12 M1 11 M12 ,

(6.9)

T 1 F ¼ C1 2 M12 M11 , 1

(6.10) 1

2 2 FL11 FT L22 , GðL11 , L22 Þ ¼ L22     ðD1 þD2 Þk k k k 1 2 K ¼2 GD1 G jMj2 , 2 D2 2

(6.11) (6.12)

1

X2 denotes the unique positive definite square root of the positive definite matrix X, and 0 F 1 is the generalized matrix-variate hypergeometric function. The generalized hypergeometric function of one matrix argument (Nagar and Mosquera-Benıtez, 2017) is defined by

264

Chapter 6

p F q ða1 , : : : ,

ap ; b1 , : : : , bq ; XÞ ¼

` X X ða1 Þt · · · ðap Þt C t ðXÞ , ðb1 Þt · · · ðbq Þt k! k¼0 t⊢k

(6.13)

where ai for i ¼ 1, : : : , p and bj for j ¼ 1, : : : , q are arbitrary complex (real in our case) numbers; C t ðXÞ is the zonal polynomial of a D  D symmetric matrix X corresponding to an ordered P partition t ¼ ðk 1 , : : : , k D Þ, k 1 ≥ · · · ≥ k D ≥ 0, k 1 þ · · · þ k D ¼ k; and t⊢k denotes a summation over all partitions t. The generalized hypergeometric coefficient ðaÞt is defined by  D  Y i1 , ðaÞt ¼ a 2 ki i¼1

(6.14)

where ðaÞk ¼ aða þ 1Þ · · · ða þ k  1Þ for k ¼ 1, 2, : : : , and ðaÞ0 ¼ 1. As in (Muirhead, 2009), we also define 1

1

C t ðXYÞ ¼ C t ðYXÞ ¼ C t ðX2 YX2 Þ

(6.15)

for symmetric positive definite X and symmetric Y. This effectively extends p F q to accept matrices of the form XY and YX. Conditions for convergence of the series in Eq. 6.13 are available in (Constantine, 1963). From Eq. 6.13 it follows that ` X X C t ðXÞ

(6.16)

` X X ðaÞt C t ðXÞ ¼ jID  Xja , kXk , 1, k! k¼0 t⊢k

(6.17)

k¼0 t⊢k 1 F 0 ða;

XÞ ¼

` X ½trðXÞk

¼ etrðXÞ,

0 F 0 ðXÞ ¼

0 F 1 ðb;

k!

¼

k¼0

XÞ ¼

` X X C t ðXÞ k¼0 t⊢k

1 F 1 ða; b; XÞ ¼

2 F 1 ða, b; c; XÞ ¼

k!

ðbÞt k!

,

` X X ðaÞt C t ðXÞ , ðbÞt k! k¼0 t⊢k

` X X ðaÞt ðbÞt C t ðXÞ , ðcÞt k! k¼0 t⊢k

(6.18)

(6.19)

(6.20)

where kXk , 1 means that the maximum of the absolute values of the eigenvalues of X is less than 1. 1 F 1 ða; b; XÞ and 2 F 1 ða, b; c; XÞ are called the confluent hypergeometric function and Gauss hypergeometric function of one matrix argument, respectively.

Optimal Bayesian Transfer Learning

265

Using Theorem 6.1, we define the joint prior distribution pðLls , Llt Þ in Eq. 6.4 of the precision matrices of the source and target domains for class l ∈ f1, : : : , Lg. Given a Wishart matrix of the form in Eq. 6.6, and replacing L11 and L22 with Llt and Lls , respectively, we obtain the joint prior  i  1 h l 1 kl D1 l l l l T l l pðLt , Ls Þ ¼ K etr  ðMt Þ þ ðF Þ C F Llt jLlt j 2 2   l   l 1 l 1 l k 1 l l k D1 ; G ðL11 , L22 Þ ,  etr  ðC Þ Ls jLs j 2 0 F 1 2 4 2 (6.21) where 

Mlt M ¼ ðMlts ÞT l

Mlts Mls

 (6.22)

is a 2D  2D positive definite scale matrix, kl ≥ 2D denotes degrees of freedom, and Cl ¼ Mls  ðMlts ÞT ðMlt Þ1 Mlts ,

(6.23)

Fl ¼ ðCl Þ1 ðMlts ÞT ðMlt Þ1 ,

(6.24)

1

1

Gl ðL11 , L22 Þ ¼ ðLls Þ2 Fl Llt ðFl ÞT ðLls Þ2 ,  l kl l 1 Dkl 2 k ðK Þ ¼ 2 GD jMl j 2 : 2

(6.25) (6.26)

Owing to the comment following Eq. 6.7, Llt and Lls possess Wishart marginal distributions: Llz  WishartðMlz , kl Þ

(6.27)

for l ∈ f1, : : : , Lg and z ∈ fs, tg.

6.2 Posterior Distribution in the Target Domain Having defined the prior distributions, we need to derive the posterior distribution of the parameters of the target domain upon observing the training source and target samples. The likelihood of the samples S t and S s is conditionally independent given the parameters of the target and source domains. The dependence between the two domains is due to the dependence of the prior distributions of the precision matrices, as shown in Fig 6.1. Within each domain, source or target, the likelihoods of the different classes are also conditionally independent given the parameters of the classes.

266

Chapter 6

Target Domain

Source Domain

Figure 6.1 Dependency of the source and target domains through their precision matrices for any class l ∈ {1, . . . , L}. [Reprinted from (Karbalayghareh et al., 2018b).]

Let mz ¼ ðm1z , : : : , mLz Þ, and Lz ¼ ðL1z , : : : , LLz Þ, where z ∈ fs, tg. Under these conditions, the joint likelihood of the samples S t and S s can be written as f ðS t , S s jmt , ms , Lt , Ls Þ ¼ f ðS t jmt , Lt Þf ðS s jms , Ls Þ ¼ f ðS 1t , : : : , S Lt jm1t , : : : , mLt , L1t , : : : , LLt Þ  f ðS 1s , : : : , S Ls jm1s , : : : , mLs , L1s , : : : , LLs Þ ¼

L Y

f ðS lt jmlt , Llt Þ

l¼1

L Y

f ðS ls jmls , Lls Þ:

l¼1

(6.28) The posterior of the parameters given S t and S s satisfies pðmt , ms , Lt , Ls jS t , S s Þ ∝ f ðS t , S s jmt , ms , Lt , Ls Þpðmt , ms , Lt , Ls Þ ¼

L Y

f ðS lt jmlt , Llt Þ

l¼1

L Y

f ðS ls jmls , Lls Þ

l¼1

L Y

pðmlt , mls , Llt , Lls Þ,

(6.29)

l¼1

where we assume that the priors of the parameters in different classes are independent: pðmt , ms , Lt , Ls Þ ¼

L Y

pðmlt , mls , Llt , Lls Þ:

(6.30)

l¼1

From Eqs. 6.4 and 6.29, pðmt , ms , Lt , Ls jS t , S s Þ ∝

L Y

f ðS lt jmlt , Llt Þf ðS ls jmls , Lls Þf ðmls jLls Þf ðmlt jLlt ÞpðLls , Llt Þ:

(6.31)

l¼1

We observe that the posterior of the parameters equals the product of the posteriors of the parameters of each class:

Optimal Bayesian Transfer Learning

267 L Y

pðmt , ms , Lt , Ls jS t , S s Þ ¼

pðmlt , mls , Llt , Lls jS lt , S ls Þ,

(6.32)

l¼1

where pðmlt , mls , Llt , Lls jS lt , S ls Þ ∝ f ðS lt jmlt , Llt Þf ðS ls jmls , Lls Þf ðmls jLls Þf ðmlt jLlt ÞpðLls , Llt Þ:

(6.33)

Since we are interested in the posterior of the parameters of the target domain, we integrate out the parameters of the source domain in Eq. 6.32: Z Z pðmt , Lt jS t , S s Þ ¼ pðmt , ms , Lt , Ls jS t , S s Þdms dLs L1s ≻0, : : : , LLs ≻0

¼

L Z Y

Lls ≻0

l¼1

¼

Z

L Y

RD

ðRD ÞL

pðmlt , mls , Llt , Lls jS lt , S ls Þdmls dLls

pðmlt , Llt jS lt , S ls Þ,

l¼1

(6.34) where Z

Z pðmlt ,

Llt jS lt ,

S ls Þ

¼

Lls ≻0

RD

pðmlt , mls , Llt , Lls jS lt , S ls Þdmls dLls

∝ f ðS lt jmlt , Llt Þf ðmlt jLlt Þ Z Z f ðS ls jmls , Lls Þf ðmls jLls ÞpðLls , Llt Þdmls dLls :  Lls ≻0

RD

(6.35) Theorem 6.4 derives the posterior for the target domain. To prove it, we require two theorems. The first one is essentially a restatement of Theorem 2.4 for the Wishart distribution instead of the inverse-Wishart distribution. Theorem 6.2 (Muirhead, 2009). If S ¼ fx1 , : : : , xn g, where xi is a D  1 vector and xi  N ðm, L1 Þ for i ¼ 1, : : : , n, and ðm, LÞ has a normal-Wishart prior such that mjL  N ðm, ðnLÞ1 Þ and L  W ishartðM, kÞ, then the posterior of ðm, LÞ upon observing S is also a normal-Wishart distribution: mjL, S  N ðmn , ðnn LÞ1 Þ,

(6.36)

LjS  WishartðMn , kn Þ,

(6.37)

268

Chapter 6

where nn ¼ n þ n,

(6.38)

kn ¼ k þ n,

(6.39)

mn ¼

nm þ nx¯ , nþn

(6.40)

nn 1 ¯ ¯ T, M1 ðm  xÞðm  xÞ þ ðn  1ÞSˆ þ n ¼ M nþn

(6.41)

x¯ is the sample mean, and Sˆ is the sample covariance matrix.

Theorem 6.3 (Gupta et al., 2016). Let Z be a symmetric positive definite matrix, X a symmetric matrix, and a . ðD  1Þ∕2. Then Z Dþ1 etrðZRÞjRja 2 p F q ða1 , : : : , ap ; b1 , : : : , bq ; XRÞdR R≻0 Z Dþ1 1 1 (6.42) ¼ etrðZRÞjRja 2 p F q ða1 , : : : , ap ; b1 , : : : , bq ; R2 XR2 ÞdR R≻0

¼ GD ðaÞjZja pþ1 F q ða1 , : : : , ap , a; b1 , : : : , bq ; XZ1 Þ:

Theorem 6.4 (Karbalayghareh et al., 2018b). Given the target S t and source S s samples, the posterior distribution of target mean mlt and target precision matrix Llt for the class l ∈ f1, : : : , Lg has a Gauss-hypergeometric-function distribution: 

pðmlt ,

Llt jS lt ,

S ls Þ

nlt,n l ðmt  mlt,n ÞT Llt ðmlt  mlt,n Þ ¼A exp  2   kl þnl D1 1 l 1 l t l  jLt j 2 etr  ðTt Þ Lt 2  l  l l k þ ns k 1 l l l T l  1F 1 ; ; F Lt ðF Þ Ts , 2 2 2 l



1 jLlt j2

where Al is the constant of proportionality, given by

(6.43)

Optimal Bayesian Transfer Learning

 l 1

ðA Þ

¼

2p nlt,n

D 2



 2F 1

2

269

Dðkl þnl Þ t 2

 GD

 kl þnl kl þ nlt t jTlt j 2 2

kl þ nls kl þ nlt kl , ; ; Tls Fl Tlt ðFl ÞT 2 2 2



(6.44)

when Fl is full rank or Fl ¼ 0DD , and nlt,n ¼ nlt þ nlt , mlt,n ¼

(6.45)

nlt mlt þ nlt x¯ lt , nlt þ nlt

(6.46)

ðTlt Þ1 ¼ ðMlt Þ1 þ ðFl ÞT Cl Fl nl nl þ ðnlt  1ÞSˆ lt þ l t t l ðmlt  x¯ lt Þðmlt  x¯ lt ÞT , nt þ nt nl nl ðTls Þ1 ¼ ðCl Þ1 þ ðnls  1ÞSˆ ls þ l s s l ðmls  x¯ ls Þðmls  x¯ ls ÞT , ns þ ns

(6.47)

(6.48)

with x¯ lz and Sˆ lz being the sample mean and covariance for z ∈ fs, tg and l ∈ f1, : : : , Lg, respectively. Proof. From Eq. 6.2, for each domain z ∈ fs, tg, f ðS lz jmlz ,

Llz Þ



¼ ð2pÞ

dnlz 2

nl

z jLlz j 2



 1 l exp  qz , 2

(6.49)

where qlz

¼

nlz X

ðxlz,i  mlz ÞT Llz ðxlz,i  mlz Þ:

(6.50)

i¼1

Moreover, from Eq. 6.5, for each domain z ∈ fs, tg,  l  nz l l l D2 l D2 l 12 l T l l l f ðmz jLz Þ ¼ ð2pÞ ðnz Þ jLz j exp  ðmz  mz Þ Lz ðmz  mz Þ : 2 From Eqs. 6.21, 6.35, 6.49, and 6.51,

(6.51)

270

Chapter 6

pðmlt , Llt jS lt , S ls Þ   l   nl 1 l nt l l 2t l 12 l T l l l ∝ jLt j exp  qt jLt j exp  ðmt  mt Þ Lt ðmt  mt Þ 2 2  i  h l 1 1 T l k D1 l l l l  jLt j 2 etr  ðMt Þ þ ðF Þ C F Llt 2   Z Z l n 1 l l 2s jLs j exp  qs  2 Lls ≻0 RD  l  ns l 1 T l l l l l  jLs j2 exp  ðms  ms Þ Ls ðms  ms Þ 2   l 1 l 1 l l k D1  jLs j 2 etr  ðC Þ Ls 2  l  k 1 l 1 l l l T l 1 ; ðL Þ2 F Lt ðF Þ ðLs Þ2 dmls dLls  0F 1 2 4 s  l  nt,n l 1 ðmt  mlt,n ÞT Llt ðmlt  mlt,n Þ ¼ jLlt j2 exp  2   kl þnl D1 1 l 1 l t l  jLt j 2 etr  ðTt Þ Lt 2  l  Z Z ns,n l l 12 l T l l l ðms  ms,n Þ Ls ðms  ms,n Þ  jLs j exp  2 Lls ≻0 RD   kl þnls D1 1 l 1 l l 2  jLs j etr  ðTs Þ Ls 2  l  k 1 l 1 l l l T l 1 2 2  0F 1 ; ðL Þ F Lt ðF Þ ðLs Þ dmls dLls , 2 4 s

(6.52)

where the reduction of the equality utilizes Theorem 6.2, and where nlt,n ¼ nlt þ nlt ,

(6.53)

nls,n ¼ nls þ nls ,

(6.54)

mlt,n ¼

nlt mlt þ nlt x¯ lt , nlt þ nlt

(6.55)

mls,n ¼

nls mls þ nls x¯ ls , nls þ nls

(6.56)

Optimal Bayesian Transfer Learning

271

ðTlt Þ1 ¼ ðMlt Þ1 þ ðFl ÞT Cl Fl nl nl þ ðnlt  1ÞSˆ lt þ l t t l ðmlt  x¯ lt Þðmlt  x¯ lt ÞT , nt þ nt nl nl ðTls Þ1 ¼ ðCl Þ1 þ ðnls  1ÞSˆ ls þ l s s l ðmls  x¯ ls Þðmls  x¯ ls ÞT , ns þ ns

(6.57)

(6.58)

with sample mean x¯ lz and sample covariance matrix Sˆ lz for z ∈ fs, tg. Using the equation   Z 1 D 1 T exp  ðx  mÞ Lðx  mÞ dx ¼ ð2pÞ 2 jLj2 (6.59) D 2 R and integrating out mls in Eq. 6.52 yields pðmlt , Llt jS lt , S ls Þ  l  nt,n l l 12 l T l l l ðmt  mt,n Þ Lt ðmt  mt,n Þ ∝ jLt j exp  2   kl þnl D1 1 l 1 l t l 2  jLt j etr  ðTt Þ Lt 2   Z kl þnls D1 1 l 1 l l 2  jLs j etr  ðTs Þ Ls 2 Lls ≻0  l  k 1 l 1 l l l T l 1 ; ðL Þ2 F Lt ðF Þ ðLs Þ2 dLls :  0F 1 2 4 s The integral I in Eq. 6.60 can be evaluated using Theorem 6.3:  l  l   kl þnl k þ nls k þ nls kl 1 l l l T l l 2 s ; ; F Lt ðF Þ Ts , I ¼ GD j2Ts j 1 F 1 2 2 2 2

(6.60)

(6.61)

where 1 F 1 ða; b; XÞ is the confluent hypergeometric function with the matrix argument X. As a result, Eq. 6.60 becomes  l  nt,n l 1 ðmt  mlt,n ÞT Llt ðmlt  mlt,n Þ pðmlt , Llt jS lt , S ls Þ ¼ Al jLlt j2 exp  2   kl þnl D1 1 l 1 l t l (6.62)  jLt j 2 etr  ðTt Þ Lt 2  l  k þ nls kl 1 l l l T l ; ; F Lt ðF Þ Ts ,  1F 1 2 2 2

272

Chapter 6

where the constant of proportionality Al makes the integration of the posterior pðmlt , Llt jS lt , S ls Þ with respect to mlt and Llt equal to 1. Hence,   Z kl þnl D1 1 l 1 l 1 t l 1 l 2 jLt j etr  ðTt Þ Lt jLlt j2 ðA Þ ¼ l 2 Lt ≻0  l  Z nt,n l l T l l l (6.63) ðmt  mt,n Þ Lt ðmt  mt,n Þ dmlt  exp  D 2 R  l  k þ nls kl 1 l l l T l ; ; F Lt ðF Þ Ts dLlt :  1F 1 2 2 2 Using Eq. 6.59, the inner integral equals  D 2p 2 l 1 D l l 12 2 ð2pÞ jnt,n Lt j ¼ jLt j 2 : nlt,n Hence, l 1

ðA Þ

 ¼

2p nlt,n

D Z 2

Llt ≻0

jLlt j

kl þnl D1 t 2

  1F 1

  1 l 1 l etr  ðTt Þ Lt 2

 kl þ nls kl 1 l l l T l ; ; F Lt ðF Þ Ts dLlt : 2 2 2

(6.64)

(6.65)

Suppose that Fl is full rank. With the variable change V ¼ Fl Llt ðFl ÞT , we have dV ¼ jFl jDþ1 dLlt ,

(6.66)

Llt ¼ ðFl Þ1 V½ðFl ÞT 1 :

(6.67)

The region Llt ≻ 0 corresponds to V ≻ 0. Furthermore, trðABCDÞ ¼ trðBCDAÞ ¼ trðCDABÞ ¼ trðDABCÞ, and jABCj ¼ jAjjBjjCj. Thus, Al can be derived as  D 2p 2 l ðkl þnl Þ l 1 t jF j ðA Þ ¼ nlt,n   Z kl þnl D1 1 l T 1 l 1 l 1 t  jVj 2 etr  ½ðF Þ  ðTt Þ ðF Þ V 2 V≻0  l  l k þ ns kl 1 l ; ; VTs dV  1F 1 2 2 2  l   D l l kl þnl 2p 2 Dðk þnt Þ k þ nlt l 2 t 2 2 G j jT ¼ D t 2 nlt,n  l  k þ nls kl þ nlt kl l l l l T , ; ; Ts F Tt ðF Þ ,  2F 1 2 2 2

(6.68)

(6.69)

Optimal Bayesian Transfer Learning

273

where the second equality follows from Theorem 6.3, and 2 F 1 ða, b; c; XÞ is the Gauss hypergeometric function with the matrix argument X. Hence, we have derived the closed-form posterior distribution of the target parameters ðmlt , Llt Þ in Eq. 6.43, where Al is given by Eq. 6.44. Suppose that Fl ¼ 0DD , so that the 1 F 1 term in Eq. 6.65 equals 1, and  l 1

ðA Þ

¼

2p nlt,n

D Z 2

Llt ≻0

jLlt j

kl þnl D1 t 2

  1 l 1 l etr  ðTt Þ Lt dLlt : 2

(6.70)

The integrand is essentially a Wishart distribution; hence,  ðAl Þ1 ¼

2p nlt,n

D 2

2

ðkl þnl ÞD t 2

jTlt j

kl þnl t 2

 GD

 kl þ nlt : 2

(6.71)



This is consistent with Eq. 6.44.

6.3 Optimal Bayesian Transfer Learning Classifier The optimal Bayesian transfer learning classifier (OBTLC) is the OBC in the target domain given data from both the target and source datasets. To find the OBTLC, we need to derive the effective class-conditional densities using the posterior of the target parameters derived from both the target and source datasets. Expressing the effective class-conditional density for class l explicitly in terms of the model parameters, it is given by Z f OBTLC ðxjlÞ ¼ U

Llt ≻0

Z RD

f ðxjmlt , Llt Þp∗ ðmlt , Llt Þdmlt dLlt ,

(6.72)

where p∗ ðmlt , Llt Þ ¼ pðmlt , Llt jS lt , S ls Þ

(6.73)

is the posterior of ðmlt , Llt Þ upon observation of S lt and S ls . Theorem 6.5 (Karbalayghareh et al., 2018b). Suppose that Fl is full rank or Fl ¼ 0DD . Then the effective class-conditional density in the target domain for class l is given by

274

Chapter 6

f OBTLC ðxjlÞ U  l D  l    kl þnl þ1 kl þnl nt,n 2 k þ nlt þ 1 1 kl þ nlt t l  2 t D2 l 2 G j jT j ¼p jT G x D t D 2 2 nlx  l  k þ nls kl þ nlt þ 1 kl l l l l T , ; Ts F Tx ðF Þ ;  2F 1 2 2 2  l  l l l l 1 k þ ns k þ nt k l l l l T  2F 1 , ; ; Ts F Tt ðF Þ , 2 2 2

(6.74)

where nlx ¼ nlt,n þ 1 ¼ nlt þ nlt þ 1, ðTlx Þ1 ¼ ðTlt Þ1 þ

nlt,n ðmlt,n  xÞðmlt,n  xÞT : nlt,n þ 1

(6.75) (6.76)

Proof. The likelihood f ðxjmlt , Llt Þ and posterior pðmlt , Llt jS lt , S ls Þ are given in Eqs. 6.2 and 6.43, respectively. Hence,   Z Z 1 D 1 T OBTLC l l l l  l ðxjlÞ ¼ ð2pÞ 2 A jLt j2 exp  ðx  mt Þ Lt ðx  mt Þ fU 2 Llt ≻0 RD  l  nt,n l l 12 l T l l l ðmt  mt,n Þ Lt ðmt  mt,n Þ  jLt j exp  2   kl þnl D1 1 t  jLlt j 2 etr  ðTlt Þ1 Llt 2  l  k þ nls kl 1 l l l T l ; ; F Lt ðF Þ Ts dmlt dLlt  1F 1 2 2 2  l  Z Z nx l l 12 l D2 l l T l l ¼ ð2pÞ A jLt j exp  ðmt  mx Þ Lt ðmt  mx Þ 2 Llt ≻0 RD   kl þnl þ1D1 1 l 1 l t l 2  jLt j etr  ðTx Þ Lt 2  l  l l k þ ns k 1 l l l T l ; ; F Lt ðF Þ Ts dmlt dLlt ,  1F 1 2 2 2 (6.77) where nlx ¼ nlt,n þ 1 ¼ nlt þ nlt þ 1,

(6.78)

Optimal Bayesian Transfer Learning

275

mlx ¼ ðTlx Þ1 ¼ ðTlt Þ1 þ

nlt,n mlt,n þ x , nt,n þ 1

(6.79)

nlt,n ðmlt,n  xÞðmlt,n  xÞT : nlt,n þ 1

(6.80)

The integration in Eq. 6.77 is similar to the one in Eq. 6.63. As a result, using Eq. 6.44,  ðxjlÞ f OBTLC U

D2

¼ ð2pÞ A 

jTlx j

l

kl þnl þ1 t 2

2p nlx

D 2



2F 1

2

Dðkl þnl þ1Þ t 2

 GD

kl þ nlt þ 1 2



 kl þ nls kl þ nlt þ 1 kl l l l l T , ; Ts F Tx ðF Þ : ; 2 2 2 (6.81)

By replacing the value of Al , we have the effective class-conditional density as stated in Eq. 6.74. ▪ A Dirichlet prior is assumed for the prior probabilities clt that the target sample belongs to class l: ct ¼ ½c1t , : : : , cLt   Dirichletðat Þ, where at ¼ ½a1t , : : : , aLt  is the vector of concentration parameters, and alt . 0 for l ∈ f1, : : : , Lg. As the Dirichlet distribution is a conjugate prior for the categorical distribution, upon observing nt ¼ ½n1t , : : : , nLt  sample points for class l in the target domain, the posterior p∗ ðct Þ ¼ pðct jnt Þ has a Dirichlet distribution, Dirichletðat þ nt Þ ¼ Dirichletða1t þ n1t , : : : , aLt þ nLt Þ,

(6.82)

with Ep∗ ½clt  ¼

alt þ nlt , a0t þ N t

(6.83)

P P where N t ¼ Ll¼1 nlt , and a0t ¼ Ll¼1 alt . The OBTLC in the target domain is given by cOBTLC ðxÞ ¼ arg

max

l∈f1, : : : , Lg

Ep∗ ½clt  f OBTLC ðxjlÞ: U

(6.84)

If we lack prior knowledge for class selection, then we use the same concentration parameter for all classes: at ¼ ½at , : : : , at . Hence, if the number of samples in each class is the same, i.e., n1t ¼ · · · ¼ nLt , then Ep∗ ½clt  is the same for all classes, and Eq. 6.84 reduces to

276

Chapter 6

cOBTLC ðxÞ ¼ arg

max

l∈f1, : : : , Lg

f OBTLC ðxjlÞ: U

(6.85)

The effective class-conditional densities are derived in closed forms in Eq. 6.74; however, deriving the OBTLC in Eq. 6.84 requires computing the Gauss hypergeometric function of one matrix argument. Computing the exact values of hypergeometric functions of a matrix argument using the series of zonal polynomials, as in Eq. 6.20, is time-consuming and is not scalable to high dimension. To facilitate computation, one can use the Laplace approximation of this function, as in (Butler and Wood, 2002), which is computationally efficient and scalable. The Laplace approximation is discussed in (Karbalayghareh et al., 2018b). 6.3.1 OBC in the target domain The beneficial effect of the source data can be seen by comparing the OBTLC with the OBC based solely on target data. From Eqs. 6.5 and 6.27, the priors for mlt and Llt are given by mlt jLlt  N ðmlt , ðnlt Llt Þ1 Þ and Llt  WishartðMlt , kl Þ, respectively. Using Theorem 6.2, upon observing the sample S lt , the posteriors of mlt and Llt will be mlt jLlt , S lt  N ðmlt,n , ðnlt,n Llt Þ1 Þ,

(6.86)

Llt jS lt  WishartðMlt,n , klt,n Þ,

(6.87)

nlt,n ¼ nlt þ nlt ,

(6.88)

klt,n ¼ kl þ nlt ,

(6.89)

where

mlt,n ¼

nlt mlt þ nlt x¯ lt , nlt þ nlt

nl nl ðMlt,n Þ1 ¼ ðMlt Þ1 þ ðnlt  1ÞSˆ lt þ l t t l ðmlt  x¯ lt Þðmlt  x¯ lt ÞT , nt þ nt

(6.90) (6.91)

and x¯ lt and Sˆ lt are the sample mean and covariance, respectively. The effective class-conditional densities for the OBC are given by  l D  l  2 nt,n k þ nlt þ 1 OBC D2 f U ðxjlÞ ¼ p GD 2 nlt,n þ 1 (6.92)  l  l l l k þn þ1 kl þnl t t 1 k þ nt l l   GD jMx j 2 jMt,n j 2 , 2

Optimal Bayesian Transfer Learning

277

where ðMlx Þ1 ¼ ðMlt,n Þ1 þ

nlt,n ðmlt,n  xÞðmlt,n  xÞT : nlt,n þ 1

(6.93)

Note that if L  WishartðM, kÞ, then S ¼ L1  inverse-WishartðM1 , kÞ. Thus, as expected, Eq. 6.92 is equivalent to Eq. 2.120 with nlt,n in place of n∗ , mlt,n in place of m∗ , kl þ nlt in place of k∗ , and ðMlt,n Þ1 in place of S∗ . The multi-class OBC under a zero-one loss function is given by cOBC ðxÞ ¼ arg

max

l∈f1, : : : , Lg

Ep∗ ½clt  f OBC U ðxjlÞ:

(6.94)

In the case of equal expected prior probabilities for the classes, cOBC ðxÞ ¼ arg

max

l∈f1, : : : , Lg

f OBC U ðxjlÞ:

(6.95)

The next two theorems state that, if there is no interaction between the source and target domains in all of the classes, or there is no source data, then the OBTLC reduces to the OBC in the target domain. Theorem 6.6 (Karbalayghareh et al., 2018b). If Mlts ¼ 0DD for all l ∈ f1, : : : , Lg, then cOBTLC ¼ cOBC . Proof. If Mlts ¼ 0DD for all l ∈ f1, : : : , Lg, then Fl ¼ 0DD . Since 2 F 1 ða, b; c; 0DD Þ ¼ 1 for any values of a, b, and c, the Gauss hypergeometric functions will disappear in Eq. 6.74. From Eqs. 6.47 and 6.91, Tlt ¼ Mlt,n . From Eqs. 6.76 and 6.93, Tlx ¼ Mlx . Thus, f OBTLC ðxjlÞ ¼ f OBC U U ðxjlÞ, and consequently, cOBTLC ðxÞ ¼ cOBC ðxÞ. ▪ Theorem 6.7. If nls ¼ 0 for all l ∈ f1, : : : , Lg, then cOBTLC ¼ cOBC . Proof. When nls ¼ 0, the posterior pðmlt , Llt jS lt , S ls Þ in Eq. 6.43 is given by  l  nt,n l l l l l l 12 l T l l l ðmt  mt,n Þ Lt ðmt  mt,n Þ pðmt , Lt jS t , S s Þ ∝ jLt j exp  2   kl þnl D1 1 l 1 l t l (6.96)  jLt j 2 etr  ðTt Þ Lt 2   1 l l l T l  etr F Lt ðF Þ Ts , 2

278

Chapter 6

where we have used the property 1 F 1 ða; a; XÞ ¼0 F 0 ðXÞ and Eq. 6.16. By Eqs. 6.47 and 6.91, ðTlt Þ1 ¼ ðMlt,n Þ1 þ ðFl ÞT Cl Fl . Further, in the absence of data, Tls ¼ Cl by Eq. 6.48. Thus,       1 l 1 l 1 l l l T l 1 l 1 l etr  ðTt Þ Lt etr F Lt ðF Þ Ts ¼ etr  ðMt,n Þ Lt : 2 2 2

(6.97)

Combining this with Eq. 6.96, we have that pðmlt , Llt jS lt , S ls Þ is precisely the ðxjlÞ ¼ f OBC posterior given by Eqs. 6.86 and 6.87. Thus, f OBTLC U U ðxjlÞ, and cOBTLC ðxÞ ¼ cOBC ðxÞ. ▪ Example 6.1. Following (Karbalayghareh et al., 2018b), the aim is to investigate the effect of the relatedness of the source and target domains. The simulation setup assumes that D ¼ 10, L ¼ 2, the number of source training data per class is ns ¼ nls ¼ 200, the number of target training data per class is nt ¼ nlt ¼ 10, k ¼ kl ¼ 25, nt ¼ nlt ¼ 100, ns ¼ nls ¼ 100, m1t ¼ 0D , m2t ¼ 0.05 ⋅ 1D , m1s ¼ m1t þ 1D , and m2s ¼ m2t þ 1D . For the scale matrix Ml in Eq. 6.22, Mlt ¼ k t ID , Mls ¼ k s ID , and Mlts ¼ k ts ID for l ¼ 1, 2. Choosing an identity matrix for Mlts makes sense when the order of the features in the two domains is the same. Ml must be positive definite for any class l. Hence, we have the following constraints on k t , k s , and k ts : k t . 0, k s . 0, and pffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi jk ts j , k t k s . Let k ts ¼ a k t k s , where jaj , 1. The value of jaj shows the amount of relatedness between the source and target domains. If jaj ¼ 0, then the two domains are unrelated; as jaj approaches 1, there is greater relatedness. We set k t ¼ k s ¼ 1 and plot the average classification error curves for different values of jaj. All of the simulations assume equal prior probabilities for the classes. Prediction performance is measured by average classification errors. To sample from the prior in Eq. 6.4, first sample from a 2D  2D WishartðMl , kl Þ distribution for each class l ¼ 1, 2. For a fixed l, call the upper-left D  D block of this matrix Llt , and the lower-right D  D block Lls . By Theorem 6.1, ðLlt , Lls Þ is a joint sample from pðLlt , Lls Þ in Eq. 6.21. Given Llt and Lls , sample from Eq. 6.5 to get samples of mlt and mls for l ¼ 1, 2. Having mlt , mls , Llt , and Lls , generate 100 different training and test sets from Eq. 6.2. Training sets contain samples from both the target and source domains, but the test set contains only samples from the target domain. Since the numbers of source and target training data per class are ns and nt , there are Lns and Lnt source and target training data in total, respectively. The size of the test set per class is 1000. For each training and test set, use the OBTLC and its target-only version, OBC, and calculate the error. Then average all of the errors for 100 different training and test sets. Repeat the process

Optimal Bayesian Transfer Learning

279

1000 times for different realizations of Llt , Lls , mlt , and mls for l ¼ 1, 2, and finally average all of the errors and return the average classification error. In all figures, the OBTLC hyperparameters are the same as those used for simulating data, except for the figures showing the sensitivity of the performance with respect to different hyperparameters, in which case it is assumed that the true values of the hyperparameters used for simulating data are unknown. Figure 6.2(a) demonstrates how the source data improve the classifier in the target domain. It shows the average classification error versus nt for the OBC and OBTLC with different values of a. When a is close to 1, the performance of the OBTLC is much better than that of the OBC, owing to strong relatedness. Performance improvement is especially noticeable when nt is small. Moreover, the errors of the OBTLC and OBC converge to a similar value when nt gets very large, meaning that the source data are redundant when there is a large amount of target data. When a is larger, the error curves converge faster to the optimal error, which is the average Bayes error of the target classifier. The corresponding Bayes error averaged over 1000 randomly generated distributions equals 0.122. When a ¼ 0, the OBTLC reduces to the OBC. By Eq. 6.74, the sign of a does not matter in the performance of the OBTLC. Hence, we can use jaj in all cases. Figure 6.2(b) depicts average classification error versus ns for the OBC and OBTLC with different values of a. The error of the OBC is constant for all ns . The error of the OBTLC equals that of the OBC when ns ¼ 0 and starts to decrease as ns increases. A key point is that having a very large collection of source data when the two domains are highly related can compensate for the lack of target data and lead to a target classification error approaching an error not much greater than the Bayes error in the target domain.

(a)

(b)

Figure 6.2 Average classification error in transfer learning (k ¼ 25): (a) versus the number of target training data per class nt (ns ¼ 200); (b) versus the number of source training data per class ns (nt ¼ 10). [Reprinted from (Karbalayghareh et al., 2018b).]

280

Chapter 6

(a)

(b)

(c)

(d)

Figure 6.3 Average classification error versus jaj (k ¼ 25): (a) atrue ¼ 0.3; (b) atrue ¼ 0.5; (c) atrue ¼ 0.7; (d) atrue ¼ 0.9. [Reprinted from (Karbalayghareh et al., 2018b).]

Figure 6.3 treats robustness of the OBTLC with respect to the hyperparameters. It shows the average classification error of the OBTLC with respect to jaj, under the assumption that the true value atrue of the amount of relatedness between source and target domains is unknown. The four parts show error curves for atrue ¼ 0.3, 0.5, 0.7, 0.9. In all cases, the OBTLC performance gain depends on the relatedness ðvalue of atrue Þ and the value of a used in the classifier. Maximum gain is achieved at jaj ¼ atrue (not exactly at atrue due to the Laplace approximation of the Gauss hypergeometric function). Performance gain is higher when the two domains are highly related. When the two domains are only modestly related, choosing large jaj leads to higher performance loss compared to the OBC. This means that exaggeration in the amount of relatedness between the two domains can result in “negative” transfer learning. Figure 6.4 shows the errors versus k, assuming unknown true value ktrue , for a ¼ 0.5, a ¼ 0.9, ktrue ¼ 25, and ktrue ¼ 50. The salient point here is that OBTLC performance is not very sensitive to k if it is chosen in its

Optimal Bayesian Transfer Learning

281

(a)

(b)

(c)

(d)

Figure 6.4 Average classification error versus k: (a) a ¼ 0.5, ktrue ¼ 25; (b) a ¼ 0.5, ktrue ¼ 50; (c) a ¼ 0.9, ktrue ¼ 25; (d) a ¼ 0.9, ktrue ¼ 50. [Reprinted from (Karbalayghareh et al., 2018b).]

allowable range, that is, k ≥ 2D. Figure 6.5 depicts average classification error versus nt for a ¼ 0.5 and a ¼ 0.9, where the true value of nt is ntrue ¼ 50. Note that the discrepancy between OBC in Figures 6.5(a) and (b) is only due to the small number of iterations. If nt ≥ 20, then performance is fairly constant. Figures 6.3, 6.4, and 6.5 reveal that, at least in the current simulation, OBTLC performance improvement depends on the value of a and true relatedness atrue between the two domains, and is robust with respect to the choices of the hyperparameters k and nt . A reasonable range of a provides improved performance, but a decent estimate of relatedness is important.

6.4 OBTLC with Negative Binomial Distribution The negative binomial distribution is used to model RNA sequencing (RNA-Seq) read counts owing to its suitability for modeling the over-dispersion

282

Chapter 6

(a)

(b)

Figure 6.5 Average classification error versus nt (b) a ¼ 0.9. [Reprinted from (Karbalayghareh et al., 2018b).]

(ntrue ¼ 50):

a ¼ 0.5;

(a)

of the counts, meaning that the variance can be much greater than the mean (Anders and Huber, 2010; Dadaneh et al., 2018b). In this section we will apply optimal Bayesian transfer learning to count data using the negative binomial distribution (Karbalayghareh et al., 2019). Although the theory is independent of the application, we use gene terminology because it gives an empirical flavor to the development. In the current setting, S ls ¼ fxls,i, j g contains the nls sample points with the counts of D genes in the source domain for the class l, where l ∈ f1, : : : , Lg, i ∈ f1, : : : , Dg, and j ∈ f1, : : : , nls g. Analogously, S lt ¼ fxlt,i, j g is the sample for the target domain, containing nlt sample points with the counts of D genes in the target domain for the class l. The counts xlz,i, j are modeled by a negative binomial distribution xlz,i, j  negative-binomialðmlz,i , rlz,i Þ, with the probability mass function Prðxlz,i, j

¼

kjmlz,i ,

rlz,i Þ

¼

Gðk þ rlz,i Þ



k 

mlz,i

Gðrlz,i ÞGðk þ 1Þ mlz,i þ rlz,i

rlz,i mlz,i þ rlz,i

 rl

z,i

,

(6.98)

where mlz,i and rlz,i are the mean and shape parameters, respectively, of gene i in domain z and class l. The mean and variance of xlz,i, j are E½xlz,i, j  ¼ mlz,i and varðxlz,i, j Þ ¼ mlz,i þ

ðmlz,i Þ2 rlz,i

,

(6.99)

respectively. Let the mean vector m contain all mls,i and mlt,i for l ¼ 1, : : : , L and i ¼ 1, : : : , D, and let the shape vector r contain all rls,i and rlt,i for l ¼ 1, : : : , L

Optimal Bayesian Transfer Learning

283

and i ¼ 1, : : : , D. Assuming that the priors for different genes and classes are independent in each domain, the prior is factored as pðm, rÞ ¼

L Y D Y

pðmls,i , mlt,i Þpðrls,i , rlt,i Þ:

(6.100)

l¼1 i¼1

To define a family of joint priors for two sets of positive parameters, mls,i and mlt,i , and rls,i and rlt,i , we shall use the following version of Theorem 6.1 that takes a specialized form with the matrices L and M being 2  2, and L and M now having real-valued components luv and muv , respectively, where u, v ∈ f1, 2g. Theorem 6.8 (Halvorsen et al., 2016). Let   l11 l12 L¼ l12 l22

(6.101)

be a 2  2 Wishart random matrix with k ≥ 2 degrees of freedom and positive definite scale matrix M. The joint distribution of the two diagonal entries l11 and l22 has density function       1 1 1 1 2 f ðl11 , l22 Þ ¼ K exp  m11 þ c2 f l11 exp  c2 l22 2 2 (6.102)   k 1 2 k k 1 1 2 2  ðl11 Þ ðl22 Þ 0 F 1 ; f l11 l22 , 2 4 1 1 1 k 2 k∕2 where c2 ¼ m22  m212 m1 . 11 , f ¼ c2 m12 m11 , and K ¼ 2 G ðk∕2ÞjMj

We shall also apply the following lemma. Lemma 6.1 (Muirhead, 2009). If 2  2 L  WishartðM, kÞ, then lii ∕mii  chi-squaredðkÞ for i ¼ 1, 2. In particular, E½lii  ¼ kmii , and varðlii Þ ¼ 2km2ii for i ¼ 1, 2. Furthermore, the covariance and correlation coefficient between l11 and l22 are covðl11 , l22 Þ ¼ 2km212 and 1 rl ¼ m212 m1 11 m22 , respectively. For each i ∈ f1, : : : , Dg and l ∈ f1, : : : , Lg, given L in Eq. 6.101, we define the joint prior for mls,i and mlt,i to take the form of the joint density of the diagonals given in Eq. 6.102. Without loss of generality, we also parameterize the prior using the correlation coefficient given in Lemma 6.1. In particular,

284

pðmls,i ,

Chapter 6

mlt,i Þ

¼

 exp 



mls,i





mlt,i

exp  l 2mls,i ð1  rlm,i Þ 2mt,i ð1  rlm,i Þ   rlm,i km km km l 1 l 1 l l ; m m :  ðms,i Þ 2 ðmt,i Þ 2 0 F 1 2 4mls,i mlt,i ð1  rlm,i Þ2 s,i t,i K lm,i

(6.103) The hyperparameters are mlt,i , mls,i , rlm,i , and km , and the normalization constant K lm,i depends on these parameters. By Lemma 6.1, note that E½mlz,i  ¼ km mlz,i and varðmlz,i Þ ¼ 2km ðmlz,i Þ2 for z ∈ fs, tg, and we have the correlation coefficient corrðmls,i , mlt,i Þ ¼ rlm,i . Similarly, for every i ∈ f1, : : : , Dg and l ∈ f1, : : : , Lg, we can define the joint prior for rls,i and rlt,i by  pðrls,i ,

rlt,i Þ

¼

rls,i





rlt,i



exp  l 2sls,i ð1  rlr,i Þ 2st,i ð1  rlr,i Þ   rlr,i kr l k2r 1 l k2r 1 l l ;  ðrs,i Þ ðrt,i Þ 0 F 1 r r , 2 4sls,i slt,i ð1  rlr,i Þ2 s,i t,i K lr,i

exp 

(6.104)

with hyperparameters slt,i , sls,i , rlr,i , and kr , and moments E½rlz,i  ¼ kr slz,i , varðrlz,i Þ ¼ 2kr ðslz,i Þ2 for z ∈ fs, tg, and corrðrls,i , rlt,i Þ ¼ rlr,i . Independence assumptions for all genes i ∈ f1, : : : , Dg, classes l ∈ f1, : : : , Lg, z ∈ fs, tg, and sample points j ∈ f1, : : : , nlz g are adopted for the joint data likelihood function: nlz L Y D Y Y Y

f ðS s , S t jm, rÞ ¼

f ðxlz,i, j jmlz,i , rlz,i Þ:

(6.105)

z∈fs,tg l¼1 i¼1 j¼1

Therefore, the posteriors of different genes in different classes can be factored as pðmls,i , mlt,i , rls,i , rlt,i jS ls , S lt Þ ∝

pðmls,i ,

mlt,i Þpðrls,i ,

rlt,i Þ

nlz Y Y

f ðxlz,i, j jmlz,i , rlz,i Þ

(6.106)

z∈fs,tg j¼1

for all i ∈ f1, : : : , Dg and l ∈ f1, : : : , Lg. Integrating out the source parameters in Eq. 6.106 yields

Optimal Bayesian Transfer Learning

285

pðmlt,i , rlt,i jS ls , S lt Þ Z Z nlz Y Y ` ` l l l l pðms,i , mt,i Þpðrs,i , rt,i Þ f ðxlz,i, j jmlz,i , rlz,i Þdmls,i drls,i : ∝ 0

0

z∈fs,tg j¼1

(6.107) Since the joint prior is not conjugate for the joint likelihood, the joint posteriors and the target posteriors will not have closed forms. MCMC methods are needed to obtain posterior samples in the target domain. Hamilton Monte Carlo (HMC) (Neal, 2011) is used in (Karbalayghareh et al., 2019). The effective class-conditional densities are given by Z Z OBTLC fU ðxjlÞ ¼ f ðxjmlt , rlt Þp∗ ðmlt , rlt Þdmlt drlt (6.108) rlt .0

mlt .0

for l ∈ f1, : : : , Lg, where mlt contains all mlt,i for i ¼ 1, : : : , D, rlt contains all Q l l l l rlt,i for i ¼ 1, : : : , D, and p∗ ðmlt , rlt Þ ¼ D i¼1 pðmt,i , rt,i jS t , S s Þ. Since we do not have closed form for the posterior p∗ ðmlt , rlt Þ, the posterior samples generated by HMC sampling are used to approximate the integration of Eq. 6.108. If there are N posterior samples from all of the D genes in L classes, then the approximation is f OBTLC ðxjlÞ U

N Y D 1X ¼ f ðxi jm¯ lt,i, j , r¯lt,i, j Þ, N j¼1 i¼1

(6.109)

where xi is the ith gene in x, and m¯ lt,i, j and r¯lt,i, j are the jth posterior samples of gene i in class l of the target domain for the mean and shape parameters, respectively. Alternatively, we may write Eq. 6.108 as D Z ` Z ` Y OBTLC ðxjlÞ ¼ f ðxi jmlt,i , rlt,i Þpðmlt,i , rlt,i jS lt , S ls Þdmlt,i drlt,i , (6.110) fU i¼1

0

0

which is a product of effective densities on each gene. We approximate this by ðxjlÞ ¼ f OBTLC U

D X N 1Y f ðxi jm¯ lt,i, j , r¯lt,i, j Þ, N i¼1 j¼1

(6.111)

where the MCMC posterior samples may be drawn for each gene independently. As previously done, we assume a Dirichlet prior for the prior class probabilities. The OBTLC is defined via Eq. 6.84, in this case for the target parameters clt , mlt,i , and rlt,i for l ¼ 1, : : : , L and i ¼ 1, : : : , D.

286

Chapter 6

For deriving the OBC in the target domain using only target data, the marginal priors for mlt,i and rlt,i are scaled chi-squared random variables: mlt,i ∕mlt,i  chi-squaredðkm Þ and rlt,i ∕slt,i  chi-squaredðkr Þ, with  pðmlt,i Þ

¼

k



m ð2mlt,i Þ 2 G

km 2

¼

mlt,i

km 2 1

 exp

mlt,i 2mlt,i

 ,

 1  l  rt,i kr l k2r 1 : rt,i exp 2 2slt,i

 pðrlt,i Þ

1

kr ð2slt,i Þ 2 G

(6.112)

(6.113)

Let the mean vector m contain mlt,i for i ¼ 1, : : : , D and l ¼ 1, : : : , L, and let the shape vector r contain rlt,i for i ¼ 1, : : : , D and l ¼ 1, : : : , L. The prior is factored as pðm, rÞ ¼

L Y D Y

pðmlt,i Þpðrlt,i Þ:

(6.114)

l¼1 i¼1

The likelihood of the target data is nt L Y D Y Y l

f ðS t jm, rÞ ¼

f ðxlt,i, j jmlt,i , rlt,i Þ:

(6.115)

l¼1 i¼1 j¼1

The posteriors of different genes in different classes can be factored as nt Y l

pðmlt,i ,

rlt,i jS lt Þ



pðmlt,i Þpðrlt,i Þ

f ðxlt,i, j jmlt,i , rlt,i Þ

(6.116)

j¼1

for i ∈ f1, : : : , Dg and l ∈ f1, : : : , Lg. HMC is used to get samples from the posteriors of the parameters mlt,i and rlt,i , and using Eq. 6.109 with these samples yields an approximation of the effective class-conditional densities for the OBC. The OBC is then defined using Eq. 6.94. If all correlations between the mean and shape parameters of the target and source domains are zero, that is, rlm,i ¼ 0 and rlr,i ¼ 0 for i ∈ f1, : : : , Dg and l ∈ f1, : : : , Lg, then the joint priors for the OBTLC in Eq. 6.103 become pðmls,i , mlt,i Þ ¼ pðmls,i Þpðmlt,i Þ and pðrls,i , rlt,i Þ ¼ pðrls,i Þpðrlt,i Þ so that all mean and shape parameters are independent between the two domains. Having independent priors and likelihoods leads to independent posteriors for the two domains, and the OBTLC reduces to the OBC. If, on the other hand, rlm,i and rlr,i are close to 1, then the OBTLC yields significantly better performance than the OBC.

Chapter 7

Construction of Prior Distributions Up to this point we have ignored the issue of prior construction, assuming that the characterization of uncertainty is known. For optimal Bayesian classification, the problem consists of transforming scientific knowledge into a probability distribution governing uncertainty in the feature-label distribution. Regarding prior construction in general, in 1968, E. T. Jaynes remarked, “Bayesian methods, for all their advantages, will not be entirely satisfactory until we face the problem of finding the prior probability squarely” (Jaynes, 1968). Twelve years later, he added, “There must exist a general formal theory of determination of priors by logical analysis of prior information—and that to develop it is today the top priority research problem of Bayesian theory” (Jaynes, 1980). Historically, prior construction has tended to utilize general methodologies not targeting any specific type of prior information and has usually been treated independently (even subjectively) of real available prior knowledge and sample data. Subsequent to the introduction of the Jeffreys rule prior (Jeffreys, 1946), objective-based methods were proposed, two early ones being (Kashyap, 1971) and (Bernardo, 1979). There appeared a series of information-theoretic and statistical approaches: non-informative priors for integers (Rissanen, 1983), entropic priors (Rodriguez, 1991), maximal data information priors (MDIP) (Zellner, 1995), reference (non-informative) priors obtained through maximization of the missing information (Berger and Bernardo, 1992), and least-informative priors (Spall and Hill, 1990) [see also (Bernardo, 1979; Kass and Wasserman, 1996; Berger et al., 2012)]. The principle of maximum entropy can be seen as a method of constructing leastinformative priors (Jaynes, 1957, 1968). Except in the Jeffreys rule prior, almost all of the methods are based on optimization: maximizing or minimizing an objective function, usually an information theoretic one. The least-informative prior in (Spall and Hill, 1990) is found among a restricted set of distributions, where the feasible region is a set of convex combinations of

287

288

Chapter 7

certain types of distributions. In (Zellner, 1996) several non-informative and informative priors for different problems are found. In all of these methods there is a separation between prior knowledge and observed sample data. Moreover, although these methods are appropriate tools for generating prior probabilities, they are quite general methodologies and do not target specific scientific prior information. Since, in the case of optimal Bayesian classification, uncertainty is directly on the feature-label distribution and this distribution characterizes our joint knowledge of the features and the labels, prior construction is at the scientific level. A number of prior construction methods will be discussed in this chapter. In the first two sections, we consider a purely data-driven method that uses discarded data via the method of moments to determine the hyperparameters (Dalton and Dougherty, 2011), and a prior distribution derived from a stochastic differential equation that incompletely describes the underlying system (Zollanvari and Dougherty, 2016). In the next two sections we present a general methodology for transforming scientific knowledge into a prior distribution via a constrained optimization where the constraints represent scientific knowledge (Esfahani and Dougherty, 2014, 2015; Boluki et al., 2017).

7.1 Prior Construction Using Data from Discarded Features If we possess a large set G containing D features from which we have chosen a subset H, containing d , D features to use for classification, the remaining D  d features in G \ H can be used for construction of a prior distribution governing the feature-label distribution of the selected features, the assumption being that they implicitly contain useful information. Since with the Bayesian approach we are usually concerned with small samples, D should be much larger than d. We use an independent general covariance Gaussian model with d features used for classification. We will focus on one class at a time without writing the class label y explicitly. Assume distribution parameters u ¼ ðm, SÞ with S invertible. The prior distribution is defined in Eq. 2.60: the mean conditioned on the covariance is Gaussian with mean m and covariance S∕n, and the marginal distribution of the covariance is an inverse-Wishart distribution. The hyperparameters of pðuÞ are a real number n, a length d real vector m, a real number k, and a symmetric d  d matrix S. We restrict k to be an integer to guarantee a closed-form solution. Given n observed sample points, the posterior has the same form as the prior, with updated hyperparameters given in Eqs. 2.65 through 2.68. To ensure a proper prior, we require the following: k . d  1, symmetric positive definite S, and n . 0. Following (Dalton and Dougherty, 2011a), we use a method-of-moments approach to calibrate the hyperparameters. Since estimating a vector m and

Construction of Prior Distributions

289

matrix S may be problematic for a small number of sample points, we limit the number of model parameters by assuming m ¼ m ⋅ 1d

(7.1)

and 2

1 r 6r 1 6 S ¼ s2 6 .. .. 4. . r r

··· ··· .. . ···

3 r r7 7 .. 7, .5 1

(7.2)

where m is a real number, s2 . 0, and 1 , r , 1. This structure is justified because, prior to observing the data, there is no reason to think that any feature, or pair of features, should have unique properties. In particular, the structure of S in Eq. 7.2 implies that the prior correlation between any pair of features does not depend on the indices of the features. Note that correlations between features are generated randomly, and this structure in hyperparameter S does not imply that correlations between features are assumed to be the same. There are five scalars to estimate for each class: n, m, k, s2 , and r. To apply the method of moments, we need the theoretical first and second moments of the random variables m and S in terms of the desired hyperparameters. Since S has a marginal prior possessing an inverse-Wishart distribution with hyperparameters k and S, E½S ¼

S . kd 1

(7.3)

Given the previously defined structure on S, s2 ¼ ðk  d  1ÞE½s11 , r¼

E½s12  , E½s11 

(7.4) (7.5)

where sij is the ith row, jth column element of S. Owing to our imposed structure, only E½s11  and E½s 12  are needed. The variance of the jth diagonal element in an inverse-Wishart distributed S may be expressed as varðs jj Þ ¼

2s2 2ðE½s11 Þ2 , ¼ ðk  d  1Þ2 ðk  d  3Þ k  d  3

where we have applied Eq. 7.4 in the second equality. Solving for k,

(7.6)

290

Chapter 7



2ðE½s11 Þ2 þ d þ 3. varðs11 Þ

(7.7)

Now consider the mean m that is parameterized by the hyperparameters n and m. The marginal distribution of the mean is a multivariate t-distribution (Rowe, 2003):   sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi G kþ1 nd jSj1 2  pðmÞ ¼ kdþ1 ⋅ . (7.8) pd ½1 þ nðm  mÞT S1 ðm  mÞkþ1 G 2 The mean and covariance of this distribution are E½m ¼ m, covðmÞ ¼

S E½S ¼ , ðk  d  1Þn n

(7.9) (7.10)

respectively. With the assumed structure on m, m ¼ E½m1 , n¼

E½s11  , varðm1 Þ

(7.11) (7.12)

where mi is the ith element of m. Our objective is to approximate the expectations in Eqs. 7.4 through 7.12 b C be the using the C ¼ D  d features in the calibration data G \ H. Let m C b be the sample covariance matrix of the calibration sample mean and S data. From these we wish to find several sample moments of m and S b 1 , varðm b 11 , E½s b 12 , and c 1 Þ, E½s for the d-feature classification problem: E½m c 11 Þ, where the hats indicate the sample moment of the corresponding varðs quantity. To compress the set G of D features to solve an estimation problem on just the d features in H, and to find these scalar sample moments in a balanced way, we assume that the selected features are drawn uniformly. Since each feature in G\H is equally likely to be selected as the ith feature, the sample b i  is computed as the average of the mean of the mean of the ith feature E½m C C C b C . This result sample means m b1 , m b2 , : : : , m b C , where m bC i is the ith element of m b 1  to represent all features: is the same for all i, and we use E½m C X b 1 ¼ 1 E½m m bC: C i¼1 i

(7.13)

Construction of Prior Distributions

291

Owing to uniform feature selection, all other moments are balanced over all features or any pair of distinct features. The remaining sample moments are obtained in a similar manner: c 1Þ ¼ varðm

C   1 X b 1 2,  E½m m bC i C  1 i¼1

C 1X b E½s11  ¼ s bC, C i¼1 ii

b 12  ¼ E½s

c 11 Þ ¼ varðs

C X i1 X 2 s bC, CðC  1Þ i¼2 j¼1 ij

C   1 X b 11  2 , s bC  E½s ii C  1 i¼1

(7.14)

(7.15)

(7.16)

(7.17)

bC c 1 Þ represents where s bC ij is the ith row, jth column element of S . Here, varðm b 12  b 11  and E½s the variance of each feature in the mean. We also have E½s representing the mean of diagonal elements and off-diagonal elements in S, c 11 Þ represents the variance of the diagonal elements respectively. Finally, varðs in S. Plugging the sample moments into Eqs. 7.4, 7.5, 7.7, 7.11, and 7.12 yields  b  ðE½s 11 Þ2 2 b s ¼ 2E½s 11  þ1 , (7.18) c 11 Þ varðs b 12  E½s , b 11  E½s

(7.19)

b 11 Þ2 2ðE½s þ d þ 3, c 11 Þ varðs

(7.20)





b 1 , m ¼ E½m n¼

b 11  E½s : c 1Þ varðm

(7.21) (7.22)

Note that Eq. 7.20 for k was plugged into Eq. 7.4 to obtain the final s2 . In sum, calibration for the prior hyperparameters is defined by Eqs. 7.18 through 7.22, the sample moments being given in Eqs. 7.13 through 7.17. The estimates of k and n can be unstable since they rely on second moments c 11 Þ and varðm c 1 Þ in a denominator. These parameters can be made more varðs

292

Chapter 7

stable by discarding outliers when computing the sample moments. Herein, bC we discard the 10% of the m bC i with largest magnitude and the 10% of the s ii with largest value. Many variants of the method are possible. For instance, to avoid an overdefined system of equations, we do not incorporate the covariance covðm1 , m2 Þ between distinct features in m, the variance varðs 12 Þ of offdiagonal elements in S, or the covariance between distinct elements in S, though it may be possible to use these. It may also be feasible to use other estimation methods, such as maximum likelihood. We illustrate the data-driven methodology using a synthetic highdimensional data model, originally proposed to model thousands of geneexpression measurements, that uses blocked covariance matrices to model groups of interacting variables with negligible interactions between groups (Hua et al., 2009). The model is employed to emulate a feature-label distribution with 20,000 total features. Features are categorized as either “markers” or “non-markers.” Markers represent features that have different class-conditional distributions in the two classes and are further divided into two subtypes: global markers and heterogeneous markers. Non-markers have the same distributions for both classes and thus have no discriminatory power. These are divided into two subtypes: high-variance non-markers and lowvariance non-markers. A graphical categorization of the feature types is shown in Fig. 7.1. There are four basic types:

class 0

class 1 subclass 0 subclass 1

1. Global (homogeneous) markers: each class has its own Gaussian distribution—these are the best features to discriminate the classes;

global markers

heterogeneous markers

high-variance non-markers

low-variance non-markers

Figure 7.1 Different feature types in the high-dimensional synthetic microarray data model. [Reprinted from (Hua et al., 2009).]

Construction of Prior Distributions

293

2. Heterogeneous markers: one class is Gaussian and the other is a mixture of Gaussians—these features can still discriminate the classes, but not as well; 3. High-variance non-markers: all of these features are independent and, for any given feature, all sample points come from the same mixture of Gaussian distributions with a randomly selected mixture proportion; 4. Low-variance non-markers: all of these features are independent and, for any given feature, all sample points come from the same univariate Gaussian distribution. We utilize 20 features as global markers that are homogeneous in each class. In particular, the set of all global markers in class i has a Gaussian gm distribution with mean mgm i and covariance matrix Si . Class 1 is divided into two subclasses, called subclass 0 and subclass 1. The 100 heterogeneous markers are also divided into two groups, with 50 features per group. Call these groups A0 and A1 . Features in A0 and A1 are independent of each other. Under features in group A0 , class-1-subclass-0 points are jointly Gaussian with the same mean and covariance structure as class-1 global markers, and class-1-subclass-1 points are jointly Gaussian with the same mean and covariance structure as class-0 global markers. Under features in group A1 , class-1-subclass-0 points are jointly Gaussian with the same mean and covariance structure as class-0 global markers, and class-1-subclass-1 points are jointly Gaussian with the same mean and covariance structure as class-1 global markers. The model is simplified by assuming that mgm ¼ mi ⋅ 120 for i 2 2 2 ¼ s S, where s and s are constants, fixed scalars mi . We assume that Sgm i i 0 1 and S possesses a block covariance structure, 2 3 Sr ·· · 0 6 .. .. 7, S ¼ 4 ... (7.23) . . 5 0 ·· · Sr with Sr being a 5  5 matrix with 1 on the diagonal and r ¼ 0.8 off the diagonal. We generate 2000 high-variance non-marker features that have independent mixed Gaussian distributions given by pN ðm0 , s20 Þ þ ð1  pÞN ðm1 , s21 Þ, where mi and s2i are the same scalars as were defined for markers. The random variable p is selected independently for each feature with a uniformð0, 1Þ distribution and is applied to all sample points of both classes. The remaining features are low-variance non-marker features having independent univariate Gaussian distributions with mean m0 and variance s20 . In this model, heterogeneous markers are Gaussian within each subclass, but the class-conditional distribution for class 1 is a mixed Gaussian distribution (mixing the distributions of the subclasses) and is thus not

294

Chapter 7

Gaussian. Furthermore, the high-variance features are also mixed Gaussian distributions, so this model incorporates both Gaussian and non-Gaussian features. To simplify the simulations, we set the a priori probability of both classes to 0.5, fix the parameters m0 ¼ 0 and m1 ¼ 1, and define a single parameter s2 ¼ s20 ¼ s 21 , which specifies the difficulty of the classification problem. A summary of the model parameters is given in Table 7.1. In all simulations, the values for s2 are chosen such that a single global marker has a specific Bayes error εBayes ¼ Fð1∕ð2sÞÞ. For instance, s ¼ 0.9537 yields Bayes error 0.3. Several Monte Carlo simulations are run. In each experiment we fix the training sample size n, the number of selected features d, and the difficulty of the classification problem via s. In each iteration the sample size of each class is determined by a binomialðn, 0.5Þ experiment, and the corresponding sample points are randomly generated according to the distributions defined for each class. Once the sample has been generated, a three-stage feature-selection scheme is employed: (1) apply a t-test to obtain 1000 highly differentially expressed features; (2) apply a Shapiro–Wilk Gaussianity test and eliminate features that do not pass the test with 95% confidence—the number of passing features is variable and, if there are not at least 30 passing features, then we return the 30 features with the highest sum of the Shapiro–Wilk test statistics for both classes; (3) use the same t-test values originally computed to obtain the final set of d highly differentially expressed Gaussian features with which to design the classifier. The C ¼ 1000  d features that pass the first stage of feature selection but are not used for classification are saved as calibration data. The performance of this feature-selection scheme is analyzed in (Dalton and Dougherty, 2011a). The feature-selected training data are used to train an LDA classifier. With the classifier fixed, 5000 test points are drawn from exactly the same distribution as the training data and used expressly to approximate the true error. Subsequently, several training-data error estimators are computed, Table 7.1

Synthetic high-dimensional data model parameters.

Parameters Total features Global markers Subclasses in class 1 Heterogeneous markers High-variance features Low-variance features Mean Variances Block size Block correlation a priori probability of class 0

Values/description 20,000 20 2 50 per subclass (100 total) 2000 17,880 m0 ¼ 0, m1 ¼ 1 s 2 ¼ s 20 ¼ s21 (controls Bayes error) 5 0.8 0.5

295 0.25

0.2

0.15

0.1

loo cv boot bolstering BEE, flat BEE, calibrated

RMS deviation from true error

RMS deviation from true error

Construction of Prior Distributions

0.05

0 0

0.1

0.2 0.3 0.4 average true error

(a)

0.5

0.2 0.15

loo cv boot bolstering BEE, flat BEE, calibrated

0.1 0.05 0 0

0.1

0.2 0.3 0.4 average true error

0.5

(b)

Figure 7.2 RMS deviation from true error for the synthetic high-dimensional data model with LDA classification versus average true error: (a) n ¼ 120, d ¼ 1; (b) n ¼ 120, d ¼ 7. [Reprinted from (Dalton and Dougherty, 2011a).]

including leave-one-out (loo), 5-fold cross-validation with 5 repetitions (cv), 0.632 bootstrap (boot), and bolstered resubstitution (bol). Two Bayesian MMSE error estimators are also applied, one with the flat prior defined by pðuÞ ¼ 1 (flat BEE), and the other with priors calibrated as described previously (calibrated BEE). Since the classifier is linear, these Bayesian MMSE error estimators are computed exactly. This entire process is repeated 120,000 times to approximate the RMS deviations from the true error for each error estimator. Figure 7.2 shows RMS deviation from true error for all error estimators with respect to the expected true error for LDA classification with 1 or 7 selected features and 120 sample points. Given the sample size, it is prudent to keep the number of selected features small to have satisfactory feature selection (Sima and Dougherty, 2006b) and to avoid the peaking phenomena (Hughes, 1968; Hua et al., 2005, 2009). Several classical error estimators are shown, along with Bayesian MMSE error estimators with the flat prior and calibrated priors. The calibrated-prior Bayesian MMSE error estimator has best performance in the middle and high range. For an average true error of about 0.25, the RMS for the calibrated Bayesian MMSE error estimator outperforms 5-fold cross-validation for d ¼ 1 and 7 by 0.0366 and 0.0198, respectively, representing 67% and 33% decreases in RMS, respectively. All other error estimators typically have best performance for low average true errors, with the flat-prior Bayesian MMSE error estimator having better performance than the classical error estimation schemes.

7.2 Prior Knowledge from Stochastic Differential Equations Stochastic differential equations (SDEs) are used to model phenomena in many domains of science, engineering, and economics. This section utilizes vector SDEs as prior knowledge in the classification of time series. Although

296

Chapter 7

we will confine ourselves to a Gaussian problem, the basic ideas can be extended by using MCMC methods. In the stochastic setting, training data are collected over time processes. Given certain Gaussian assumptions, classification in the SDE setting takes the same form as ordinary classification in the Gaussian model, and we can apply optimal classification theory after constructing a prior distribution based on known stochastic equations. Specifically, we consider a vector SDE in integral form involving a drift vector and dispersion matrix. 7.2.1 Binary classification of Gaussian processes A stochastic process X with state space RD is a collection fXt : t ∈ Ug of RD -valued random vectors defined on a common probability space ðV, F , PrÞ indexed by a (time) parameter t ∈ U ⊆ R, where V is the sample space, and F is a s-algebra on V. The time set U and state space RD are given the usual Euclidean topologies and Borel s-algebras. A stochastic process X is a multivariate Gaussian process if, for every finite set of indices t1 , : : : , tk ∈ U, ½XTt1 , : : : , XTtk T is a multivariate Gaussian random vector. Suppose that X is a multivariate Gaussian process, and consider the D-dimensional column random vectors Xt1 , Xt2 , : : : , XtN for the fixed observation time vector tN ¼ ½t1 , t2 , : : : , tN T . Then XtN ¼ ½XTt1 , XTt2 , : : : , XTtN T possesses a multivariate Gaussian distribution N ðmtN , StN Þ, where 2 3 m t1 6 m t2 7 6 7 mtN ¼ 6 . 7 (7.24) . 4 . 5 mtN is an ND  1 vector with mti ¼ E½Xti , and 2 St ,t St ,t 6S1 1 S1 2 6 t2 ,t1 t2 ,t2 StN ¼ 6 .. 6 .. . 4 . StN ,t1

StN ,t2

··· ··· .. . ···

3 St1 ,tN St2 ,tN 7 7 .. 7 7 . 5 StN ,tN

(7.25)

is an ND  ND matrix with Sti ,tj ¼ E½ðXti  E½Xti ÞðXtj  E½Xtj ÞT .

(7.26)

For any fixed v ∈ V, a sample path is a collection fXt ðvÞ : t ∈ Ug. A realization of X at sample path v and time vector tN is denoted by xtN ðvÞ. Following (Zollanvari and Dougherty, 2016), we consider a general framework, referred to as binary classification of Gaussian processes (BCGP).

Construction of Prior Distributions

297

Consider two independent multivariate Gaussian processes X0 and X1 , where, for fixed tN , X0tN possesses mean and covariance m0tN and S0tN , and X1tN possesses mean and covariance m1tN and S1tN . For y ¼ 0, 1, mytN is defined similarly to Eq. 7.24, where myti ¼ E½Xyti , and SytN is defined similarly to Eq. 7.25, with Syti , tj ¼ E½ðXyti  E½Xyti ÞðXytj  E½Xytj ÞT .

(7.27)

Let S ytN denote a set of ny sample paths from process Xy at tN : n o S ytN ¼ xytN ðv1 Þ, xytN ðv2 Þ, : : : , xytN ðvny Þ ,

(7.28)

where y ∈ f0, 1g is the label of the class-conditional process the sample path is coming from, X0 or X1 . Separate sampling is assumed. Stochastic-process classification is defined relative to a set of sample paths that can be considered as observations of ND-dimension. Let xytN ðvs Þ denote a test sample path observed on the same observation time vector as the training sample paths but with y unknown. We desire a discriminant ctN to predict y: y¼

0 if ctN ðxytN ðvs ÞÞ ≤ 0, 1 otherwise:

(7.29)

Other types of classification are possible in this stochastic-process setting. We could classify a test sample path xytNþM ðvs Þ whose observation time vector is obtained by augmenting tN by another vector ½tNþ1 , tNþ2 , : : : , tNþM T , M being a positive integer. In this case, the time of observation for the test sample path is extended. The test sample path’s time of observation might be a subset of time points in tN , or a set of time points totally or partially different from time points in tN . We confine ourselves to the problem defined in Eq. 7.29. We consider SDEs defined via a Wiener process. A one-dimensional Wiener process W ¼ fW t : t ≥ 0g is a Gaussian process satisfying the following properties: pffiffiffiffiffiffiffiffiffiffiffiffiffi 1. For 0 ≤ t1 , t2 , `, W t2  W t1 is distributed as t2  t1 N ð0, s2 Þ, where s . 0 (s ¼ 1 for the standard Wiener process). 2. For 0 ≤ t1 , t2 ≤ t3 , t4 , `, W t4  W t3 is independent of W t2  W t1 . 3. W 0 ¼ 0 almost surely. 4. The sample paths of W are almost surely everywhere continuous. In general, a d-dimensional Wiener process is defined using a homogeneous Markov process X ¼ fXt : t ≥ 0g. A stochastic process X on ðV, F , PrÞ is adapted to the filtration fF t : t ≥ 0g (a filtration is a family of non-decreasing

298

Chapter 7

sub-s-algebras of F ) if Xt is F t -measurable for each t ≥ 0. X is also a Markov process if it satisfies the Markov property: for each t1 , t2 ∈ ½0, `Þ with t1 , t2 and for each Borel set B ⊆ Rd , PrðXt2 ∈ BjF t1 Þ ¼ PrðXt2 ∈ BjXt1 Þ.

(7.30)

Thus, a Markov process X is characterized by a collection of transition probabilities, which we denote by Prðt1 , x; t2 , BÞ ¼ PrðXt2 ∈ BjXt1 ¼ xÞ for 0 ≤ t1 , t2 , `, x ∈ Rd , and Borel set B ⊆ Rd . For fixed values of t1 , x, and t2 , Prðt1 , x; t2 , ⋅Þ is a probability measure on the s-algebra B of Borel sets in Rd . A Markov process is homogeneous if its transition probability Prðt1 , x; t2 , BÞ is stationary: for 0 ≤ t1 , t2 , ` and 0 ≤ t1 þ u , t2 þ u , `, Prðt1 þ u, x; t2 þ u, BÞ ¼ Prðt1 , x; t2 , BÞ.

(7.31)

In this case, Prðt1 , x; t2 , BÞ is denoted by Prðt2  t1 , x; BÞ. A d-dimensional Wiener process is a d-dimensional homogeneous Markov process with stationary transition probability defined by a multivariate Gaussian distribution: Z Prðt, x; BÞ ¼ B

1

 d e

ð2ptÞ2

kyxk2 2 2t

dy:

(7.32)

Each dimension of a d-dimensional Wiener process is a one-dimensional Wiener process. Let W ¼ fWt : t ≥ 0g be a d-dimensional Wiener process. For each sample path and for 0 ≤ t0 ≤ t ≤ T, we consider a vector SDE in the integral form Z t Z t gðs, Xs ðvÞÞds þ Gðs, Xs ðvÞÞdWs ðvÞ, (7.33) Xt ðvÞ ¼ Xt0 ðvÞ þ t0

t0

where g : ½0, T  V → RD (the D-dimensional drift vector) and G : ½0, T  V → RDd (the D  d dispersion matrix). The first integral in Eq. 7.33 is an ordinary Lebesgue integral, and we assume an Itô integration for the second integral. Let L be the Lebesgue s-algebra on R. We say that a function h : ½0, T  V → R on the probability space ðV, F , PrÞ belongs to V L t -measurable for each t ∈ ½0, T, and R TT if2 it is L  F -measurable, hðt, ⋅Þ is F i h ðs, vÞds , almost surely. Let g and G i, j denote the components of g ` 0 pffiffiffiffiffiffiffi and G, respectively. If we assume that X0 ðvÞ is F 0 -measurable, jgi j ∈ LV T, i, j V and G ∈ LT , then each component of the D-dimensional process Xt is F t -measurable (Kloeden and Platen, 1995). Equation 7.33 is commonly written in a symbolic form as

Construction of Prior Distributions

299

dXt ¼ gðt, Xt Þdt þ Gðt, Xt ÞdWt ,

(7.34)

which is the representation of a vector SDE. 7.2.2 SDE prior knowledge in the BCGP model Prior knowledge in the form of a set of SDEs constrains the possible behavior of the dynamical system to an uncertainty class. In (Zollanvari and Dougherty, 2016), valid prior knowledge is defined as a set of SDEs with a unique solution that does not contradict the Gaussianity assumption of the dynamics of the model. For nonlinear gðt, Xt Þ and Gðt, Xt Þ, the solution of Eq. 7.34 is in general a non-Gaussian process; however, for a wide class of linear functions, the solutions are Gaussian and the SDEs become valid prior knowledge for each class-conditional process defined in the BCGP model. We focus on this type of SDE. For class label y ¼ 0, 1, the linear classes of SDEs we consider are defined by gy ðt, Xt Þ ¼ Ay ðtÞXyt þ ay ðtÞ,

(7.35)

Gy ðt, Xt Þ ¼ By ðtÞ

(7.36)

in Eq. 7.34, with Ay ðtÞ a D  D matrix, ay ðtÞ a D  1 vector, and By ðtÞ a D  d matrix, these being measurable and bounded on ½t0 , T and independent of v. Hence, dXyt ¼ ½Ay ðtÞXyt þ ay ðtÞdt þ By ðtÞdWyt ,

(7.37)

Xyt0 ðvÞ ¼ cy ðvÞ,

(7.38)

where cy is a d  1 vector. This initial value problem has a unique solution that is a Gaussian stochastic process if and only if the initial conditions cy are constant or Gaussian (Arnold, 1974, Theorem 8.2.10). Moreover, the mean (at time index ti ) and the covariance matrix (at ti and tj ) of the Gaussian process Xyt are given by (Arnold, 1974)

Z t i y y y y y 1 y mti ¼ E½Xti  ¼ F ðti Þ E½c  þ ½F ðsÞ a ðsÞds , (7.39) t0

Cyti ,tj

    y y y y T ¼ E Xti  E½Xti  Xtj  E½Xtj 

h i y ¼ F ðti Þ E ðcy  E½cy Þðcy  E½cy ÞT Z þ t0

ti

(7.40)

½F ðuÞ B ðuÞ½B ðuÞ ð½F ðuÞ Þ du ½Fy ðtj ÞT , y

1

y

y

T

y

1 T

300

Chapter 7

where t0 ≤ ti ≤ tj ≤ T, and Fy ðti Þ is the fundamental matrix of the deterministic differential equation ˙ yt ¼ Ay ðtÞXyt : X

(7.41)

If the SDE in Eqs. 7.37 and 7.38 precisely represents the dynamics of the underlying stochastic processes of the BCGP model, then no training sample paths are needed. In this case, Eqs. 7.24 and 7.25 take the form mytN ¼ mytN ,

(7.42)

SytN ¼ CytN ,

(7.43)

where 3 myt1 6 y 7 6 m t2 7 7 6 ¼6 . 7 6 .. 7 5 4 y mtN 2

mytN

(7.44)

is an ND  1 vector, and 2

CytN

Cyt1 ,t1 6 y 6 Ct2 ,t1 6 ¼6 . 6 .. 4 CytN ,t1

Cyt1 ,t2 Cyt2 ,t2 .. . y CtN ,t2

··· ··· .. . ···

3 Cyt1 ,tN 7 Cyt2 ,tN 7 7 .. 7 . 7 5 y CtN ,tN

(7.45)

is an ND  ND matrix, with myti and Cyti ,tj being obtained from Eqs. 7.39 and 7.40, respectively. The (approximately) exact values of the means and autocovariances used to characterize the Gaussian processes in the BCGP model can be obtained. If Eq. 7.41 can be solved analytically, then numerical methods can be used to evaluate the integrations in Eqs. 7.39 and 7.40. For example, if Ay ðtÞ ¼ Ay is independent of t, then the solution of Eq. 7.41 is given by the matrix exponential Fy ðtÞ ¼ expððt  t0 ÞAy Þ, which can be used in Eqs. 7.39 and 7.40. If one cannot analytically solve Eq. 7.41, then numerical methods can be used to obtain good approximations of myti and Cyti ,tj [see (Zollanvari and Dougherty, 2016) for details]. Assuming that the values of m0ti , m1ti , C0ti ,tj , and C1ti ,tj are (approximately) known, because BCGP classification reduces to discriminating independent observations of ND dimension generated from two multivariate Gaussian distributions, the optimal discriminant in Eq. 7.29 is quadratic:

Construction of Prior Distributions

301

 i   h i  1h y y y 0 T 0 1 0 x C x x cQDA ðv Þ ¼ ðv Þ  m ðv Þ  m s s tN tN tN tN tN tN 2 tN s i h i h 1 T  xytN ðvs Þ  m1tN ðC1tN Þ1 xytN ðvs Þ  m1tN 2 1 C0tN a1 þ ln 1 þ ln 0 , 2 CtN a

(7.46)

where ay ¼ PrðY ¼ yÞ. If the SDEs do not provide a complete description of the dynamics of each class-conditional process, then Eqs. 7.42 and 7.43 do not hold; however, the SDEs can be used as prior knowledge. In (Zollanvari and Dougherty, 2016), two assumptions are made on the nature of the prior information to which the set of SDEs corresponding to each class gives rise: (1) before observing the sample paths at an observation time vector, the SDEs characterize the only information that we have about the system; and (2) the statistical properties of all Gaussian processes that may generate the data are on average (over the parameter space) equivalent to the statistical properties determined from the SDEs. Thus, we will supply the classification problem with prior knowledge in the form of SDEs, which are used to specify the mean of a mean and the mean of a covariance matrix characterizing the data from each class. We then construct a conjugate prior consistent with this information. The effective densities may then be found and used to find the OBC, as explained below, or used to perform other analyses. Fixing tN , under uncertainty, the BCGP model is characterized by a random parameter uytN ¼ ðmytN , SytN Þ, with mytN and SytN defined in Eqs. 7.24 and 7.25, respectively. uytN has a conjugate normal-inverse-Wishart prior distribution ̮

̮

y

pðuytN Þ, parameterized by a set of hyperparameters ðmytN , CtN , nytN , kytN Þ, and given by 1ðky þNDþ1Þ 1 ̮ y y 2 tN y y 1 etr  CtN ðStN Þ pðutN Þ ∝ StN 2

y   1   1   (7.47) ntN ̮ y T ̮ y y 2 y y y mtN  mtN S tN mtN  mtN :  StN exp  2 We assume that u0tN and u1tN are independent. From Eq. 7.47 we obtain h i ̮ (7.48) Ep mytN ¼ mytN , h

Ep SytN

i

̮

y

Ct N , ¼ y ktN  ND  1

(7.49)

302

Chapter 7

the latter holding for kytN . ND þ 1. Hence, assumption (2) on the nature of the prior information implies that ̮

mytN ¼ mytN ,

(7.50)

CtN ¼ ðkytN  ND  1ÞCytN ,

(7.51)

̮

y

with mytN defined by Eqs. 7.39 and 7.44, and CytN defined by Eqs. 7.40 and 7.45. The greater our confidence in a set of SDEs representing the underlying stochastic processes, the larger we might choose the values of nytN and kytN , and the more concentrated become the priors of the mean and covariance about mytN and CytN , respectively. To ensure a proper prior and the existence of the ̮

y

expectation in Eq. 7.49, we assume that CtN is symmetric positive definite, kytN . ND þ 1, and nytN . 0. Assume that all test and training sample paths are independent. The OBC theory applies: (1) the effective class-conditional distributions of the processes result from Eq. 2.120; and (2) extending the dimensionality to ND and using ̮ ̮ y the parameter set ðmytN , CtN , nytN , kytN Þ in the discriminant of Eq. 4.25 yields   y cOBC ðv Þ x s tN tN

 T   h i k0 þND 1 y y 0∗ 0 1 0∗ ΠtN xtN ðvs Þ  mtN ¼ K 1 þ 0 xtN ðvs Þ  mtN (7.52) k

i   h i k1 þND 1 h y y 1∗ T 1 1 1∗ ΠtN xtN ðvs Þ  mtN  1 þ 1 xtN ðvs Þ  mtN , k where



2     32 0

k0 k 1 þND 2 a1 2 k 0 ND ΠtN 6G 2 G    7 5, 1 4 k1 k 0 þND a0 k1 G Π G tN

ΠytN Cy∗ tN

¼

̮

y Ct N

2

2

ny∗ tN þ 1 y∗ ¼ y∗ y∗ CtN , ðktN  ND þ 1ÞntN

  nytN ny  y ̮ y ̮ y T y y b b b þ ðn  1ÞStN þ y  m  m , m m tN tN tN tN ntN þ ny y

(7.53)

(7.54)

(7.55)

y y ny∗ tN ¼ ntN þ n ,

(7.56)

y y ky∗ tN ¼ ktN þ n ,

(7.57)

Construction of Prior Distributions

303

k y ¼ ky∗ tN  ND þ 1,

(7.58)

̮

my∗ tN ̮

̮

b ytN nyt myt þ ny m ¼ N yN , ntN þ ny

(7.59)

y

b ytN mytN and CtN are determined from the SDEs via Eqs. 7.50 and 7.51, and m b y are the sample mean and sample covariance matrix obtained by using and S tN

the sample path training set S ytN :

n 1X xy ðv Þ, ny i¼1 tN i

(7.60)

n h ih i 1 X y y y y T b b ðv Þ  m ðv Þ  m : x x i i t t t t N N N N ny  1 i¼1

(7.61)

y

b ytN ¼ m

y

by ¼ S tN

Since we are assuming separate sampling, there is no mechanism for estimating ay , and thus it is assumed to be known, as is typical for separate sampling.

7.3 Maximal Knowledge-Driven Information Prior In this section, we present a formal structure for prior formation involving a constrained optimization in which the constraints incorporate existing scientific knowledge augmented by slackness variables (Boluki et al., 2017). We will subsequently discuss instantiations of this formal structure. The constraints tighten the prior distribution in accordance with prior knowledge while at the same time avoiding inadvertent over-restriction of the prior. Prior construction involves two steps: (1) functional information quantification; (2) objective-based prior selection: combining sample data and prior knowledge, and building an objective function in which the expected mean log-likelihood is regularized by the quantified information in step 1. When there are no sample data or only one data point available for prior construction, this procedure reduces to a regularized extension of the maximum entropy principle. The next two definitions provide the general framework. Notice in the definitions the formalization of our prior knowledge. This means that, whereas prior scientific knowledge may come in various forms, for our use it must be mathematically formalized in a way that allows it to be precisely used in the cost function, for instance, as equalities, inequalities, or conditional probabilities.

304

Chapter 7

Definition 7.1. If Π is a family of proper priors, then a maximal knowledgedriven information prior (MKDIP) is a solution to the optimization arg min Ep ½C u ðj, p, SÞ,

(7.62)

p∈Π

where C u ðj, p, SÞ is a cost function depending on the random vector u parameterizing the uncertainty class, a formalization j of our prior knowledge, the prior p, and part of the sample data S. Alternatively, by parameterizing the prior probability as pðu; gÞ, with g ∈ G denoting the hyperparameters, an MKDIP can be found by solving arg min Epðu;gÞ ½C u ðj, g, SÞ:

(7.63)

g∈G

The MKDIP incorporates prior knowledge and part of the data to construct an informative prior. There is no known way of determining the optimal amount of observed data to use in the optimization of Definition 7.1; however, the issue is addressed via simulations in (Esfahani and Dougherty, 2014) and will be discussed in the next subsection. A special case arises when the cost function is additively decomposed into costs on the hyperparameters and the data so that it takes the form ð1Þ

ð2Þ

C u ðj, g, SÞ ¼ ð1  bÞgu ðj, gÞ þ bgu ðj, SÞ, ð1Þ

(7.64) ð2Þ

where b ∈ ½0, 1 is a regularization parameter, and gu and gu are cost functions. The next definition transforms the MKDIP into a constrained optimization. Definition 7.2. A maximal knowledge-driven information prior with constraints solves the MKDIP problem subject to the constraints i h ð3Þ Epðu;gÞ gu,i ðjÞ ¼ 0, i ¼ 1, 2, : : : , nc , (7.65) ð3Þ

where gu,i , i ¼ 1, 2, : : : , nc , are constraints resulting from the state j of our knowledge via a mapping  i i i h h h ð3Þ ð3Þ ð3Þ (7.66) T : j → Epðu;gÞ gu,1 ðjÞ , Epðu;gÞ gu,2 ðjÞ , : : : , Epðu;gÞ gu,nc ðjÞ : We restrict our attention to MKDIP with constraints and additive costs as in Eq. 7.64. In addition, while the MKDIP formulation allows prior

Construction of Prior Distributions

305

information in the cost function, we will restrict prior information to the constraints so that Eq. 7.64 becomes ð1Þ

ð2Þ

C u ðg, SÞ ¼ ð1  bÞgu ðgÞ þ bgu ðSÞ,

(7.67)

and the optimization of Eq. 7.63 becomes ð1Þ

ð2Þ

arg min Epðu;gÞ ½ð1  bÞgu ðgÞ þ bgu ðSÞ. g∈G

(7.68)

A nonnegative slackness variable can be considered for each constraint to make the constraint framework more flexible, thereby allowing potential error or uncertainty in prior knowledge (allowing potential inconsistencies in prior knowledge). When employed, slackness variables also become optimization parameters, and a linear function (summation of all slackness variables) times a regulatory coefficient is added to the cost function of the optimization in Eq. 7.63 so that the optimization in Eq. 7.63 relative to Eq. 7.67 becomes   nc X ð1Þ ð2Þ arg min Epðu;gÞ l1 ½ð1  bÞgu ðgÞ þ bgu ðSÞ þ l2 ei g∈G, e∈E (7.69) i¼1 ð3Þ

subject to  ei ≤ Epðu;gÞ ½gu,i ðjÞ ≤ ei , i ¼ 1, 2, : : : , nc , where l1 and l2 are nonnegative regularization parameters, and e ¼ ½e1 , : : : , enc  and E represent the vector of all slackness variables and the feasible region for slackness variables, respectively. Each slackness variable determines a range—the more uncertainty regarding a constraint, the greater the range for the corresponding slackness variable. We list three cost functions in the literature: 1. Maximum entropy: The principle of maximum entropy (Guiasu and Shenitzer, 1985) yields the least informative prior given the constraints in order to prevent adding spurious information. Under our general framework, this principle can be formulated by setting b ¼ 0 and ð1Þ

gu ðgÞ ¼ log pðu; gÞ.

(7.70)

Thus, ð1Þ

Epðu;gÞ ½gu ðgÞ ¼ HðuÞ,

(7.71)

where Hð⋅Þ denotes the Shannon entropy (in the discrete case) or differential entropy (in the continuous case). The base of the logarithm determines the unit of information. 2. Maximal data information: The maximal data information prior (MDIP) (Zellner, 1984) provides a criterion for the constructed

306

Chapter 7

probability distribution to remain maximally committed to the data, meaning that the cost function gives “the a priori expected information for discrimination between the data-generating distribution and the prior” (Ebrahimi et al., 1999). To achieve MDIP in our general framework, set b ¼ 0 and ð1Þ

gu ðgÞ ¼ log pðu; gÞ þ HðXÞ ¼ log pðu; gÞ  E½log f ðXjuÞju,

(7.72)

where X is a random vector with the u-conditioned density f ðxjuÞ. 3. Expected mean log-likelihood: This cost function (Esfahani and Dougherty, 2014), which utilizes part of the observed data for prior construction, results from setting b ¼ 1 and ð2Þ

gu ðSÞ ¼ lðu; SÞ,

(7.73)

where S consists of n independent points x1 , : : : , xn , and lðu; SÞ ¼

n 1X log f ðxi juÞ n i¼1

(7.74)

is the mean log-likelihood function of the sample points used for prior construction. The mean log-likelihood function lðu; SÞ is related to the Kullback– Leibler (KL) divergence. To see this, consider the true distribution utrue and an arbitrary distribution u. KL divergence provides a measure of the difference between the distributions: Z f ðxjutrue Þ KLðutrue , uÞ ¼ dx f ðxjutrue Þ log f ðxjuÞ X (7.75) Z h i ¼ f ðxjutrue Þ log f ðxjutrue Þ  f ðxjutrue Þ log f ðxjuÞ dx: X

Since KLðutrue , uÞ ≥ 0 and f ðxjutrue Þ is fixed, KLðutrue , uÞ is minimized by maximizing Z f ðxjutrue Þ log f ðxjuÞdx ¼ E½log f ðXjuÞjutrue , (7.76) rðutrue , uÞ ¼ X

which can therefore be treated as a similarity measure between utrue and u. Since the sample points are conditioned on utrue , rðutrue , uÞ has the samplemean estimate lðu; SÞ given in Eq. 7.74 (Akaike, 1973, 1978; Bozdogan, 1987).

Construction of Prior Distributions

307

As originally proposed, the preceding three approaches did not involve expectation over the uncertainty class. They were extended to the general prior construction form in Definition 7.1, including the expectation, to produce the regularized maximum entropy prior (RMEP) and the regularized maximal data information prior (RMDIP) in (Esfahani and Dougherty, 2014), and the regularized expected mean log-likelihood prior (REMLP) in (Esfahani and Dougherty, 2015). In all cases, optimization was subject to specialized constraints. Three MKDIP methods result from using these information-theoretic cost functions in the MKDIP prior construction optimization framework. 7.3.1 Conditional probabilistic constraints Knowledge is often in the form of conditional probabilities characterizing conditional relations: specifically, PrðUi ∈ Ai jVi ∈ Bi Þ, where Ui and Vi are vectors composed of variables in the system, and Ai and Bi are subsets of the ranges of Ui and Vi , respectively. For instance, if a system has m binary random variables X 1 , X 2 , : : : , X m , then there are m2m1 probabilities of the form PrðX i ¼ k i jX 1 ¼ k 1 , : : : , X i1 ¼ k i1 , X iþ1 ¼ k iþ1 , : : : , X m ¼ k m Þ ¼ aki i ðk 1 , : : : , k i1 , k iþ1 , : : : , k m Þ. (7.77) Note that a0i ðk 1 , : : : , k i1 , k iþ1 , : : : , k m Þ ¼ 1  a1i ðk 1 , : : : , k i1 , k iþ1 , : : : , k m Þ.

(7.78)

The chosen constraints will involve conditional probabilities whose values are approximately known. For instance, if X 1 ¼ 1 if and only if X 2 ¼ 1 and X 3 ¼ 0, regardless of other variables, then a11 ð1, 0, k 4 , : : : , k m Þ ¼ 1

(7.79)

and a11 ð1, 1, k 4 , : : : , k m Þ ¼ a11 ð0, 0, k 4 , : : : , k m Þ ¼ a11 ð0, 1, k 4 , : : : , k m Þ ¼ 0 (7.80) for all k 4 , : : : , k m . In stochastic systems it is unlikely that conditioning will be so complete that constraints take 0-1 forms; rather, the relation between variables in the model will be conditioned on the context of the system being modeled, not simply the activity being modeled. In this situation, conditional probabilities take the form

308

Chapter 7

a11 ð1, 0, k 4 , : : : , k m Þ ¼ 1  d1 ð1, 0, k 4 , : : : , k m Þ,

(7.81)

where d1 ð1, 0, k 4 , : : : , k m Þ ∈ ½0, 1 is referred to as a conditioning parameter, and a11 ð1, 1, k 4 , : : : , k m Þ ¼ h1 ð1, 1, k 4 , : : : , k m Þ,

(7.82)

a11 ð0, 0, k 4 , : : : , k m Þ ¼ h1 ð0, 0, k 4 , : : : , k m Þ,

(7.83)

a11 ð0, 1, k 4 , : : : , k m Þ ¼ h1 ð0, 1, k 4 , : : : , k m Þ,

(7.84)

where h1 ðr, s, k 4 , : : : , k m Þ ∈ ½0, 1 is referred to as a crosstalk parameter. The “conditioning” and “crosstalk” terminology comes from (Dougherty et al., 2009a), in which d quantifies loss of regulatory control based on the overall context in which the system is operating and, analogously, h corresponds to regulatory relations outside the model that result in the conditioned variable X 1 taking the value 1. We typically do not know the conditioning and crosstalk parameters for all combinations of k 4 , : : : , k m ; rather, we might just know the average, in which case d1 ð1, 0, k 4 , : : : , k m Þ reduces to d1 ð1, 0Þ and h1 ð1, 1, k 4 , : : : , k m Þ reduces to h1 ð1, 1Þ, etc. The basic scheme is very general and applies to the Gaussian and discrete models in (Esfahani and Dougherty, 2014, 2015). In this paradigm, the constraints resulting from the state of knowledge are of the form ð3Þ

gu,i ðjÞ ¼ PrðX i ¼ k i jX 1 ¼ k 1 , : : : , X i1 ¼ k i1 , X iþ1 ¼ k iþ1 , : : : , X m ¼ k m , uÞ  aki i ðk 1 , : : : , k i1 , k iþ1 , : : : , k m Þ. (7.85) The first term is a probability based on the model conditioned on u, and aki i represents our real-world prior knowledge about this probability. With slackness variables ei , the constraints take the form aki i ðk 1 , : : : , k i1 , k iþ1 , : : : , k m Þ  ei ðk 1 , : : : , k i1 , k iþ1 , : : : , k m Þ ≤ Epðu;gÞ ½PrðX i ¼ k i jX 1 ¼ k 1 , : : : , X i1 ¼ k i1 , X iþ1 ¼ k iþ1 , : : : , X m ¼ k m , uÞ ≤

aki i ðk 1 , : : : ,

k i1 , k iþ1 , : : : , k m Þ þ ei ðk 1 , : : : , k i1 , k iþ1 , : : : , k m Þ. (7.86)

Construction of Prior Distributions

309

The general conditional probabilities will not likely be used because they will likely not be known when there are many conditioning variables. When all constraints in the optimization are of this form, we obtain the optimizations MKDIP-E, MKDIP-D, and MKDIP-R, which correspond to using the same cost functions as RMEP, RMDIP, and REMLP, respectively. 7.3.2 Dirichlet prior distribution We consider the multinomial model where parameter vector p ¼ ½p1 , p2 , : : : , pb  has a Dirichlet prior distribution DirichletðaÞ with parameter vector a ¼ ½a1 , a2 , : : : , ab : P b Gð bk¼1 ak Þ Y pðp; aÞ ¼ Qb pkak 1 : (7.87) Gða Þ k k¼1 k¼1 P The sum a0 ¼ bk¼1 ak provides a measure of the strength of the prior knowledge (Ferguson, 1973). The feasible region for a given a0 is given by P Π ¼ fDirichletðaÞ : bk¼1 ak ¼ a0 and ak . 0 for all kg. In the binary setting, the state space consists of the set of all length-m binary vectors, or equivalently, integers from 1 to b ¼ 2m that index these vectors. Fixing the value of a single random variable, X i ¼ 0 or X i ¼ 1 for some i in {1, : : : , m}, corresponds to a partition of the state space X ¼ f1, : : : , bg. Denote the portions of X for which ðX i ¼ k 1 , X j ¼ k 2 Þ and ðX i ≠ k 1 , X j ¼ k 2 Þ for i, j ∈ f1, : : : , mg and k 1 , k 2 ∈ f0, 1g by X i, j ðk 1 , k 2 Þ and X i, j ðk c1 , k 2 Þ, respectively. For the Dirichlet distribution, the constraints on the expectation over the conditional probability in Eq. 7.86 can be explicitly written as functions of the hyperparameters. Define X ai, j ðk 1 , k 2 Þ ¼ ak : (7.88) k∈X i, j ðk 1 ,k 2 Þ

The notation is easily extended for cases having more than two fixed random variables. More generally, we can replace X j by a vector Ri composed of a subset of the random variables. In such a case, X i,Ri ðk i , ri Þ denotes the portion of X for which X i ¼ k i and Ri ¼ ri . Equation 7.88 holds with ai,Ri ðk i , ri Þ replacing ai, j ðk 1 , k 2 Þ and X i,Ri ðk i , ri Þ replacing X i, j ðk 1 , k 2 Þ. To evaluate the expectations of the conditional probabilities, we require a technical lemma proven in (Esfahani and Dougherty, 2015): for any nonempty disjoint subsets A, B ⊂ X ¼ f1, 2, : : : , bg, P P   i∈A pi i∈A ai P P P P Ep : (7.89) ¼ p þ p a i∈A i j∈B j i∈A i þ j∈B aj If, for any i, the vector of random variables other than X i and the vector ˜ i and x˜ i , respectively, then, of their corresponding values are denoted by X

310

Chapter 7

applying the preceding equation, the expectation over the conditional probability in Eq. 7.86 is Ep ½PrðX i ¼ k i jX 1 ¼ k 1 , : : : , X i1 ¼ k i1 , X iþ1 ¼ k iþ1 , : : : , X m ¼ k m , pÞ P   ˜ k∈X i,Xi ðk i ,x˜ i Þ pk P ¼ Ep P ˜ ˜ k∈X i,Xi ðk i ,x˜ i Þ pk þ k∈X i,Xi ðk c ,x˜ i Þ pk i

¼

˜i i,X

a ˜ ai,Xi ðk

i,

ðk i , x˜ i Þ

˜ x˜ i Þ þ ai,Xi ðk ci , x˜ i Þ

: (7.90)

Note that each summation in this equation has only one probability because ˜ ˜ X i,Xi ðk i , x˜ i Þ and X i,Xi ðk ci , x˜ i Þ each contain only one state. If, on the other hand, there exists a subset of random variables that completely determines the value of X i (or only a specific setup of their values that determines the value), then the constraints on the conditional probability conditioned on all of the random variables other than X i can be changed to be conditioned on that subset only. Specifically, let Ri denote the vector of random variables corresponding to such a set and suppose that there exists a specific setup of their values ri that completely determines the value of X i . Then the conditional probability can be expressed as P   k∈X i,Ri ðk i ,ri Þ pk P Ep ½PrðX i ¼ k i jRi ¼ ri Þ ¼ Ep P k∈X i,Ri ðk i ,ri Þ pk þ k∈X i,Ri ðk ci ,ri Þ pk (7.91) ai,Ri ðk i , ri Þ : ¼ i,R a i ðk i , ri Þ þ ai,Ri ðk ci , ri Þ For a multinomial model with a Dirichlet prior distribution, prior knowledge about the conditional probabilities PrðX i ¼ k i jRi ¼ ri Þ may take the form in Eq. 7.86 with expectations provided in Eq. 7.91. We next express the optimization cost functions for the three prior construction methods, RMEP, RMDIP, and REMLP, as functions of the Dirichlet parameter. For RMEP, the entropy of p is given by HðpÞ ¼

b X

½ln Gðak Þ  ðak  1Þcðak Þ  ln Gða0 Þ þ ða0  bÞcða0 Þ,

(7.92)

k¼1 d ln GðxÞ (Bishop, 2006). Assuming where c is the digamma function cðxÞ ¼ dx that a0 is given, it does not affect the minimization, and the cost function is given by

Construction of Prior Distributions

311

Ep ½C p ðaÞ ¼ 

b X ½ln Gðak Þ  ðak  1Þcðak Þ.

(7.93)

k¼1

For RMDIP, according to Eq. 7.72 and employing Eq. 7.92, after removing the constant terms, Ep ½C p ðaÞ ¼ 

b X

½ln Gðak Þ  ðak  1Þcðak Þ  Ep

k¼1

X b

 pk ln pk :

(7.94)

k¼1

To evaluate the second summand, first bring the expectation inside the sum. Given p  DirichletðaÞ with a0 defined as it is, it is known that pk is beta distributed with pk  betaðak , a0  ak Þ. Plugging the terms into the expectation and doing some manipulation of the terms inside the expectation integral yields Ep ½pk ln pk  ¼

ak E½ln w, a0

(7.95)

where w  betaðak þ 1, a0  ak Þ. It is known that for X  betaða, bÞ, E½ln X  ¼ cðaÞ  cða þ bÞ.

(7.96)

Hence, Ep ½pk ln pk  ¼

ak ½cðak þ 1Þ  cða0 þ 1Þ, a0

(7.97)

and the summand on the right in Eq. 7.94 becomes Ep

X b

 pk ln pk

k¼1

 ak cðak þ 1Þ  cða0 þ 1Þ. ¼ a k¼1 0 X b

(7.98)

Plugging this into Eq. 7.94 and dropping cða0 þ 1Þ yields the RMDIP cost function: Ep ½C p ðaÞ ¼ 

b  X k¼1

ln Gðak Þ  ðak  1Þcðak Þ þ

 ak cðak þ 1Þ : a0

(7.99)

P For REMLP, if there are np ¼ bk¼1 uk points for prior construction, there being uk points observed for bin k, then the mean-log-likelihood function for the multinomial distribution is

312

Chapter 7

lðp; SÞ ¼

b np ! 1X uk ln pk þ ln Qb : np k¼1 k¼1 uk !

Since p  DirichletðaÞ and a0 ¼

P i

(7.100)

ai ,

Ep ½ln pk  ¼ cðak Þ  cða0 Þ

(7.101)

(Bishop, 2006). Hence, b np ! 1X Ep ½lðp; SÞ ¼ uk ½cðak Þ  cða0 Þ þ ln Qb : np k¼1 k¼1 uk !

(7.102)

Finally, removing the constant parts in Eq. 7.102, the REMLP Dirichlet prior for known a0 involves optimization with respect to the cost function Ep ½C p ða, SÞ ¼ 

b 1X u cðak Þ. np k¼1 k

(7.103)

For application of the MKDIP construction for the multinomial model with a Dirichlet prior distribution, we refer to (Esfahani and Dougherty, 2015).

7.4 REMLP for a Normal-Wishart Prior In this section based on (Esfahani and Dougherty, 2014), we construct the REMLP for a normal-Wishart prior on an unknown mean and precision matrix using pathway knowledge in a graphical model. We denote the entities in a given set of pathways by xðiÞ (as the ith element of the feature vector x). Whereas the application in (Esfahani and Dougherty, 2014) is aimed at genes, from the perspective of prior construction, these entities are very general, with only their mathematical formulation being relevant to the construction procedure. In keeping with the original application, we refer to these entities as “genes.” A simplified illustration of the pathways highly influential in colon cancer is shown in Fig. 7.3. 7.4.1 Pathway knowledge Define the term activating pathway segment (APS) xðiÞ → xð jÞ to mean that, if xðiÞ is up-regulated (UR, or 1), then xð jÞ becomes UR (in some time steps). Similarly, the term repressing pathway segment (RPS) xðiÞ ⊣ xð jÞ means that, if xðiÞ is UR, then xð jÞ becomes down-regulated (DR, or 0). A pathway is defined to be an APS/RPS sequence, for instance xð1Þ → xð2Þ ⊣ xð3Þ. In this pathway, there are two pathway segments, the APS xð1Þ → xð2Þ and the RPS xð2Þ ⊣ xð3Þ. A set of pathways used as prior knowledge is denoted by G.

Construction of Prior Distributions

313

HGF

IL6

EGF RAS

PIK3CA

STAT3

EGF

IL6 TSC1/TSC2

MEK 1/2

mTOR

SPRY4

PKC

IL6

Figure 7.3 A simplified wiring diagram showing the key components of the colon cancer pathways used in (Hua et al., 2012). Dashed boxes are used to simplify the diagram and represent identical counterparts in solid boxes. [Reprinted from (Esfahani and Dougherty, 2014).]

GA and GR denote the sets of all APS and RPS segments in G, respectively. More specifically, GA contains pairs of features ½i, j such that xðiÞ → xðjÞ, and GR contains pairs of features ½i, j such that xðiÞ ⊣ xð jÞ. Regulations of the form xðiÞ → xð jÞ and xðiÞ ⊣ xð jÞ are called “pairwise regulations.” The regulatory set for gene xðiÞ is the set of genes that affect xðiÞ, i.e., genes that regulate xðiÞ through some APS/RPS. Denote this set by RxðiÞ for gene xðiÞ. Denote the union of gene xðiÞ with its regulatory set RxðiÞ by RxðiÞ . As an example, for the pathways shown in Fig. 7.4, Rxð1Þ ¼ fxð3Þg, Rxð2Þ ¼ ∅, Rxð3Þ ¼ fxð1Þg, Rxð4Þ ¼ fxð1Þ, xð2Þg, Rxð5Þ ¼ fxð2Þ, xð4Þg, and Rxð6Þ ¼ fxð5Þg. Assuming that the pathways convey complete information, that is, they are not affected by unspecified crosstalk or conflicting interaction, we quantify the pairwise regulations in a conditional probabilistic manner: APS : Prðxð jÞ ¼ URjxðiÞ ¼ URÞ ¼ 1,

(7.104)

RPS : Prðxð jÞ ¼ DRjxðiÞ ¼ URÞ ¼ 1.

(7.105)

This notation is shorthand for a directional relationship, where the state of gene xðiÞ influences the state of gene xð jÞ at some future time. For Gaussian joint distributions, the equalities are changed to simpler ones involving correlation:

Figure 7.4 An example of pathways with feedback containing six genes. This contains three RPSs and four APSs. [Reprinted from (Esfahani and Dougherty, 2014).]

314

Chapter 7

APS : rxðiÞ,xð jÞ ¼ 1,

(7.106)

RPS : rxðiÞ,xð jÞ ¼ 1,

(7.107)

where rxðiÞ,xð jÞ denotes the correlation coefficient between xðiÞ and xð jÞ. The definitions in Eqs. 7.104 and 7.105 are directional and asymmetric so that the flow of influence is preserved; however, the definitions in Eqs. 7.106 and 7.107 are symmetric and not directional. Moreover, the interpretation of Eqs. 7.104 and 7.105 as correlations in Eqs. 7.106 and 7.107 may not be appropriate. Specifically, in the case of a cycle (a directed loop regardless of the type of regulation), this two-way interpretation is inapplicable [see Fig. 7.4, where there is an APS from xð1Þ to xð3Þ and an RPS from xð3Þ to xð1Þ]. When using Eqs. 7.106 and 7.107 for the Gaussian case, acyclic pathways are assumed. We also employ the conditional entropy of a gene given the expressions of the genes in its regulatory set via the constraint Hu ðxðiÞjRxðiÞ Þ ¼ 0 for all i ∈ C,

(7.108)

where Hu ð⋅j⋅Þ is the conditional entropy obtained by a u-parameterized distribution, and C is the set of all features i corresponding to genes xðiÞ with non-empty regulatory sets. Hu ðxðiÞjRxðiÞ Þ is the amount of information needed to describe the outcome of xðiÞ given RxðiÞ . 7.4.2 REMLP optimization Although we have in mind classification between two (or more) classes, we omit the class index y and focus on prior construction for only one class. As previously noted, preliminary data are used in prior construction. A sample S n is partitioned into two parts: a set S prior used for prior construction with np np train points, and a set S nt used for classifier training with nt points, where n ¼ np þ nt . Taking the expectation over the uncertainty class in the constraints, including the slack variables ei, j and ji for the conditional probabilities and entropies, respectively, and splitting the regularization parameter for the slack variables between the probability and entropy slack variables, the optimization of Eq. 7.69 becomes i h X pREMLP ¼ arg min  ð1  l1  l2 ÞEp lðu; S prior ji np Þ þ l 1 p∈Π, ji ∈E i ei , j ≥0, ei , j ≥0 a a r r

2

þ l2 4

X ½ia ; j a ∈GA

eia ; ja þ

X ½ir ; j r ∈GR

3 ei r ; j r 5

i∈C

(7.109)

Construction of Prior Distributions

315

subject to the constraints h  i Ep Hu xðiÞjRxðiÞ ≤ j i , i ∈ C,

(7.110)

Ep ½Prðxð j a Þ ¼ URjxðia Þ ¼ UR,uÞ ≥ 1  eia , ja , ½ia , j a  ∈ GA ,

(7.111)

Ep ½Prðxð j r Þ ¼ DRjxðir Þ ¼ UR,uÞ ≥ 1  eir , j r , ½ir , j r  ∈ GR ,

(7.112)

where l1 , l2 ≥ 0, l1 þ l2 ≤ 1, E i is the feasible region for slackness variable j i , Π is the feasible region for the prior distribution, and lðu; S prior np Þ is the mean log-likelihood function defined in Eq. 7.74. In Eq. 7.109, Ep ½lðu; S prior np Þ reflects the expected similarity between the observed data and the true model. Prior averaging performs marginalization with respect to model parameterization, resulting in dependence only on the hyperparameters. Assuming Gaussian distributions, Eqs. 7.111 and 7.112 become h i Ep rxðia Þ,xð ja Þ ≥ 1  eia , ja , ½ia , j a  ∈ GA , (7.113) h i Ep rxðir Þ,xð jr Þ ≤ 1 þ eir , jr , ½ir , j r  ∈ GR :

(7.114)

7.4.3 Application of a normal-Wishart prior Assume a multivariate Gaussian feature distribution, N ðm, L1 Þ, with D genes, precision matrix L ¼ S1 , and u ¼ ðm, LÞ. For given n and k, define the feasible region for the prior as Π ¼ fnormal-Wishartðm, n, W, kÞ : m ∈ RD , W ≻ 0g,

(7.115)

i.e., the set of all normal-Wishart distributions. The normal-Wishart distribution is determined fully by four parameters, a D  1 vector m, a scalar n, a D  D matrix W, and a scalar k: mjL  N ðm,ðnLÞ1 Þ,

(7.116)

L ¼ S1  WishartðW, kÞ.

(7.117)

The form of the Wishart density is given in Eq. 6.1. Setting n . 0, W ≻ 0, and k . D  1 ensures a proper prior. The general optimization framework in Eqs. 7.109 through 7.112 does not yield a convex programming for which a guaranteed converging algorithm exists. To facilitate convergence, the full procedure can be split into two optimization problems (at the cost of yielding a suboptimal solution). The effect of prior knowledge can be assessed by deriving analytical expressions for the gradient and Hessian of the cost functions. First, assume that l2 ¼ 0

316

Chapter 7

and solve the optimization in Eqs. 7.109 and 7.110. The goal of the second optimization, which we treat in the next section, is to find a matrix close to the solution of the first optimization problem that better satisfies the constraints simplified to correlations in Eqs. 7.113 and 7.114. Setting l2 ¼ 0, the REMLP optimization reduces to Z min  ð1  l1 Þ lðu; S prior np ÞpðuÞdu þ l1 j m∈RD , W≻0, j∈E U (7.118) Z Z Hm,L ðxjRx Þpðm, LÞdmdL ≤ j, subject to L≻0

RD

where, for the sake of simplicity, we consider here the single-constraint optimization with feasible region E for slackness variable j, and we omit the gene index and simply denote the constrained gene by x. For the loglikelihood of the Gaussian distribution, p   1X ¼ ln jLj  tr Lðxi  mÞðxi  mÞT  D ln 2p: np i¼1

n

2lðm, L;

S prior np Þ

(7.119) Taking the expectation with respect to the mean and covariance matrix yields i h 2Ep lðm, L; S prior np Þ np   D kX ¼ E½ln jLj  tr Wðxi  mÞðxi  mÞT   D ln 2p np i¼1 n



D X kþ1d 1 ¼ ln jWj  k trðWVm Þ þ c  D ln p þ , 2 n d¼1

(7.120) where we assume that n and k are fixed, we have applied the fact that E½ln jLj ¼ ln jWj þ



D X kþ1d þ D ln 2, c 2 d¼1

(7.121)

and we define p 1X Vm ¼ ðx  mÞðxi  mÞT : np i¼1 i

n

(7.122)

We will see shortly that the constraint in Eq. 7.118 does not depend on m. Thus, the optimization of m is free of constraints and equivalent to minimizing trðWVm Þ. This gives

Construction of Prior Distributions

317

p 1X x: np i¼1 i

n

b REMLP ¼ m

(7.123)

We consider two cases for the covariance matrix and, consequently, for W. Throughout, it is assumed that x ∈ = Rx (no self-regulation), and, without loss of generality, that genes in the feature vector x are ordered with genes in Rx first, followed by the gene x, followed by all remaining genes. Suppose that x contains only Rx , so that x contains only the constrained gene x and genes in its regulating set. The precision matrix L for x can be written in block form, 

LRx L¼ L21 where, since L  WishartðW, kÞ, Lx  WishartðW x , kÞ, and 

WR x W¼ W21

 L12 , Lx we

have

(7.124) LRx  WishartðWRx , kÞ,

 W12 : Wx

(7.125)

Note that Lx and W x are scalars. Given features in Rx , x is Gaussian with 1 variance L1 x . Hence, Hu ðxjRx Þ ¼ 0.5 lnð2peLx Þ. By Eq. 7.121, Ep ½Hu ðxjRx Þ ¼ 0.5 lnðpeÞ  0.5 ln jW x j  0.5cðk∕2Þ.

(7.126)

Assuming that x contains only Rx , and letting V denote Vm with m replaced by Eq. 7.123, the optimization of Eq. 7.118 can be expressed as h i 1  ð1  l1 Þ ln jWj  k trðWVÞ þ l1 j 2 W≻0, j≥j

k subject to  lnðW x Þ  c ≤ j, 2

CP1 ðkÞ :

min

(7.127)

where j ¼  lnðpeÞ ensures that Ep ½Hu ðxjRx Þ is always upper bounded by a positive constant (Esfahani and Dougherty, 2014). From the inequalities in (Dembo et al., 1991), one can see that the parts containing ln jWj are concave, thereby making the optimization problem CP1 ðkÞ (the objective function and constraints) convex in the matrix W. Now suppose that x contains Rx along with one or more other entities, with the precision matrix and its prior represented in block format by

318

Chapter 7

3 LRx L12 L13 L ¼ 4 L21 Lx L23 5, L31 L32 L33 2 3 WRx W12 W13 W ¼ 4 W21 W x W23 5: W31 W32 W33 2

(7.128)

(7.129)

Similar to before, Lx and W x are scalars. Given features in Rx , x is Gaussian with variance B1 , where B ¼ Lx  L23 L1 33 L32 . Any diagonal block of a Wishart matrix is Wishart, and the Schur complement of any diagonal block of a Wishart matrix is also Wishart. Since B is the Schur complement of L33 in the matrix formed by removing the rows and columns in L corresponding to Rx in Eq. 7.128, we have that   B  Wishart W x  W23 W1 W , k  ðD  jR j  1Þ . (7.130) 32 x 33 Since Hu ðxjRx Þ ¼ 0.5 lnð2peB1 Þ, by Eq. 7.121, Ep ½Hu ðxjRx Þ ¼ 0.5 lnðpeÞ  0.5 lnðW x  W23 W1 33 W32 Þ  0.5cðAÞ, (7.131) where A ¼ ½k  ðD  jRx j  1Þ∕2. Hence, as shown in (Esfahani and Dougherty, 2014), the optimization problem of Eq. 7.118 can be expressed as CP2 ðkÞ :

min W≻0, j≥j

1  ð1  l1 Þ½ln jWj  k trðWVÞ þ l1 j 2

subject to  lnðW x 

W23 W1 33 W32 Þ

(7.132)

 cðAÞ ≤ j:

It is also shown in (Esfahani and Dougherty, 2014) that CP2 ðkÞ is a convex program and that the optimization problems CP1 ðkÞ and CP2 ðkÞ satisfy Slater’s condition. 7.4.4 Incorporating regulation types Given that the underlying feature distribution is jointly Gaussian, we incorporate the APS and RPS effects using Eqs. 7.113 and 7.114. Analogous to the development of CP1 ðkÞ and CP2 ðkÞ, we try to manipulate the expected correlation coefficients; however, instead of taking the expectation of the correlation coefficient, which yields a non-convex function, we fix the variances according to what we obtain from CP2 ðkÞ. From the properties of the Wishart distribution, S  inverse-WishartðC, kÞ, where C ¼ W1 . Define C∗ ¼ ðW∗ Þ1 , where W∗ is the optimal solution of CP2 ðkÞ. Denote the elements of S, C, and C∗ by sij , cij , and c∗ij , respectively, where i, j ¼ 1, 2, : : : , D. The first moments of sij are E½sij  ¼ cij ∕ðk  D  1Þ,

Construction of Prior Distributions

319

from which we obtain the approximation   sij E½sij  cij ¼ pffiffiffiffiffiffiffiffiffiffiffi E½rij  ¼ E pffiffiffiffiffiffiffiffiffiffiffi  1 pffiffiffiffiffiffiffiffiffiffiffi , ∗ ∗ sii sjj c∗ii c∗jj cii cjj kD1

(7.133)

where rij is the correlation coefficient between xðiÞ and xð jÞ. The goal of the second optimization paradigm is to satisfy the correlation-coefficient constraints according to the regulation types and be as close to the CP2 ðkÞ solution as possible. Thus, we introduce a penalty term based on the distance from the solution of CP2 ðkÞ and aim to find the closest, in the sense of the Frobenius norm, symmetric positive definite matrix C to the matrix C∗ . This leads to the following optimization problem, with optimization parameter C: 2 3 X X CP3 : min ð1  l2 ÞkC  C∗ k2F þ l2 4 ei a , j a þ ei r , j r 5 C≻0, ei , j ≥0, a a ei , j ≥0 r r

½ia , j a ∈GA

½ir , j r ∈GR

cia j a subject to 1  eia , ja ≤ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ≤ 1, ½ia , j a  ∈ GA , ∗ c ia ia c∗ j a j a cir jr 1  eir , jr ≤ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ≤ 1, ½ir , j r  ∈ GR , c ∗ i r ir c ∗ j r j r (7.134) where k ⋅ kF denotes the Frobenius norm. The regularization parameter l2 ∈ ð0, 1Þ provides a balance between the two functions. It can be readily shown that the optimization problem in Eq. 7.134 is convex. In sum, the REMLP problem in Eqs. 7.109 through 7.112 with given n and k is approximated by setting the hyperparameter m according to Eq. 7.123 and solving for the hyperparameter W using two sequential problems: (1) the optimization in Eq. 7.132 [CP2 ðkÞ], and then (2) the optimization in Eq. 7.134 (CP3 ). Once again, we only consider the one-constraint problem. The multiple-constraint problem can be treated similarly. The optimization problem CP2 ðkÞ is a nonlinear inequality constrained programming problem, and in (Esfahani and Dougherty, 2014) an algorithm employing the log-barrier interior point method is proposed for solving it. The optimization problem CP3 is a linearly constrained quadratic programming problem, which, absent the positive definiteness constraint, is easily solved; however, to find a proper prior distribution, a symmetric positive definite matrix is of interest. An algorithm is provided in (Esfahani and Dougherty, 2014). One issue is in setting the parameter l1 . Since larger k means that the prior is more centered about the scale matrix, k can be viewed as a measure of the total amount of information about the covariance in the prior. The

320

Chapter 7

regularization parameter l1 balances two sources of information: (1) data through the expected likelihood function, and (2) pathway knowledge through slackness variables bounding the conditional entropy. Thus, we may view k as a sum of the amount of information supplied by the prior construction training data np and the amount of information supplied by the pathway constraints npw . Heuristically, we may also view l1 as the ratio of pathway information to total information. Thus, l1 

npw : np þ npw

(7.135)

We are left with defining npw . In the simulations we let npw ¼ mD for different values of m ≥ 2 and see that the performance is not very sensitive to m; thus, a default value could simply be npw ¼ 2D. 7.4.5 A synthetic example We present a synthetic example from (Esfahani and Dougherty, 2014). Simulations require a fixed ground-truth model from which sample data are taken and pathways are built up. The ground-truth model is Gaussian with both covariance matrices having a blocked structure proposed in (Hua et al., 2005) to model the covariance matrix of gene-expression microarrays. Here, however, we place a small correlation between blocks. A 3-block covariance matrix with block size 3 has the structure 2 3 B1 C C S ¼ 4 C B2 C 5, (7.136) C C B3 where 3 s 2 ri s2 ri s2 Bi ¼ 4 ri s 2 s2 ri s2 5, ri s 2 ri s2 s2 3 2 rc s2 rc s 2 rc s2 C ¼ 4 rc s2 rc s 2 rc s2 5, rc s2 rc s 2 rc s2 2

(7.137)

(7.138)

s2 is the variance of each variable, ri for i ∈ f1, 2, 3g is the correlation coefficient inside block Bi , and rc is the correlation coefficient between elements of different blocks. In (Esfahani and Dougherty, 2014) a method for generating synthetic pathways to serve as the true model governing the stochastic regulations in the network is proposed. Sample points are generated

Construction of Prior Distributions

biological pathways

sample

321

REMLP prior construction

partition sample

train OBC

Figure 7.5 A prior construction methodology for OBC training. The sample is partitioned into two parts; one part is integrated with scientific knowledge to construct the prior, and the other is used for training as usual.

according to the Gaussian model. The generated pathways and sample points are then used for classifier design as depicted in Fig. 7.5. As shown in the figure, the fundamental principle is that prior knowledge about the pathways is integrated into the classification problem. As will be illustrated in this synthetic example, prior calibration is used to calibrate the mathematical equations to reflect the expert (pathway) knowledge. For the simulations, there may be one set of pathways G corresponding to one class, or two sets of pathways (G0 , G1 ) corresponding to two classes. The and S train for prior construction and sample data are partitioned into S prior np nt training the OBC, respectively. Whereas previously only one class was train considered for prior construction, here S n , S prior contain points np , and S nt to construct the from both classes. The pathways are combined with S prior np to train REMLP for each class. The constructed priors are utilized with S train nt the OBC. Denoting the error of a classifier c under feature-label distributions parameterized by u by εn ðu, cÞ, and denoting the OBC designed via the np and training points S train by cOBC,n , we are REMLP constructed using S prior np nt t n

p concerned with εn ðu, cOBC,n Þ for some true parameter u. If the solutions to t the optimization paradigms stated in CP2 ðky Þ and CP3 , pREMLP ðuy Þ for y ¼ 0, 1, produce good priors, that is, priors that have strong concentration np around the true parameter u, then it is likely that εn ðu, cOBC,n Þ ≤ εn ðu, cÞ, t where c is some other classifier, the exact relation depending on the featurelabel distribution, classification rule, and sample size. Fixing the true feature-label distribution, n points are generated that compose S n,i in the ith iteration for i ¼ 1, 2, : : : , M. These points are split train randomly into S prior np ,i and S nt ,i , where np þ nt ¼ n. Denote the given pathways

by G. Using G and S prior np ,i , construct prior distributions pREMLP,i ðuy Þ for y ∈ f0, 1g. These are updated using the remaining points S train nt ,i from which np np cOBC,nt ,i is trained. The expected true error εn ðu, cOBC,nt Þ is evaluated via Monte Carlo simulations:

322

Chapter 7

n

p E½εn ðu, cOBC,n Þ  t

M   1 X np εn u, cOBC,n , t ,i M i¼1

(7.139)

where the expectation is over the sampling distribution under the true parameter, we set M ¼ 15,000 repetitions, and each error term in the sum is estimated using 10,000 points. The overall strategy, repeated through Monte Carlo simulations, is partly shown in Fig. 7.5 and implemented step-wise as follows: 1. Fix the true parameterization for two classes: ðmy , Sy Þ, y ∈ f0, 1g. 2. Use the pathway-generation algorithm to generate two sets of pathways Gy for y ∈ f0, 1g. 3. Take observations from N ðmy , Sy Þ to generate S n . 4. Randomly choose np points from S n to form S prior and let np prior train S nt ¼ S n  S np . 5. Use S prior and Gy to construct the prior pREMLP ðuy Þ for y ∈ f0, 1g by np REMLP [CP2 ðky Þ and CP3 ]. np for the priors pREMLP ðuy Þ, y ∈ f0, 1g, and 6. Design the OBC cOBC,n t S train . nt We consider a setting with D ¼ 8 entities. The covariance matrices, S0 and S1 , are of the form in Eq. 7.136 with block sizes 3, 3, and 2 for the first, second, and third blocks, respectively. We set S0 with s2 ¼ 1, r1 ¼ r3 ¼ 0.3, r2 ¼ 0.3, rc ¼ 0.1, and we set S1 ¼ 2S0 . We also set m0 ¼ 0.5 ⋅ 1D , m1 ¼ 0.5 ⋅ 1D , and the prior class-0 probability to c ¼ 0.5. These settings correspond to the Bayes error εBayes ¼ 0.091. Given the feature-label distribution parameters, sample sizes n ¼ 30; 50; 70 are considered. The number of points in each class is fixed and is determined by n0 ¼ cn and n1 ¼ n  n0 . The average true error of the designed OBC using REMLP priors from CP2 and CP3 and the Jeffreys rule prior are computed. LDA and QDA classifiers are also trained for comparison. For REMLP priors, we vary the ratio rp of the number of sample points used for prior construction to the total sample size from 0.1 to 0.9. Thus, we keep at most 90% of the points for prior update and finding the posterior. The number of points available for prior construction under class y is given by np,y ¼ drp ny e. For example, when c ¼ 0.6, n ¼ 30, and 50% of the points are used for prior construction, np,0 ¼ 9 and np,1 ¼ 6. We set l1 according to Eq. 7.135 with npw ¼ 2D, and l2 ¼ 0.5. We also set ny ¼ np,y and ky ¼ 2D þ np,y . Average true errors for OBC with REMLP priors are shown in Fig. 7.6 with respect to the percentage rp  100% of points used for prior construction. OBC under Jeffreys rule prior, LDA, and QDA use all of the data directly for

323

0.28

0.2

0.26

0.19

0.24

LDA QDA OBC (REML prior)

average true error

average true error

Construction of Prior Distributions

0.22

0.2

0.18

0.16 10

LDA QDA OBC (REML prior)

0.18

0.17

0.16

0.15

0.14 20

30 40 50 60 70 80 prior construction sample size (%)

90

10

20

30 40 50 60 70 80 prior construction sample size (%)

(a)

90

(b) 0.16

average true error

0.15

0.14

0.13

LDA QDA OBC (REML prior) 0.12 10

20

30 40 50 60 70 80 prior construction sample size (%)

90

(c) Figure 7.6 Average true error as a function of the percentage of the sample points used for prior construction, r p  100%: (a) n ¼ 30; (b) n ¼ 50; (c) n ¼ 70. [Reprinted from (Esfahani and Dougherty, 2014).]

classifier training and thus appear as constants in the same graphs. In Fig. 7.6(a), n ¼ 30, and by increasing the number of sample points used for prior construction, the true error decreases. Thus, one should use at least 90% of the sample points for prior construction. However, when the total number of sample points increases, there is an optimal number of points that should be utilized for prior construction. For instance, as illustrated in Fig. 7.6(c), after about np ¼ 70  0.5 ¼ 35, the true error of the designed OBC increases. The simulations demonstrate that splitting the data provides better performance—that is, using np points (np , n) to design the prior and the

324

Chapter 7

remaining n  np points to train the OBC can provide lower expected error. To be precise, for a given n, the number of sample points for which the minimum average true error is achieved is expressed by h  i np n∗p ðnÞ ¼ arg min E εn u, cOBC,nn , (7.140) p np ∈f2,3, : : : , ng

where the expectation is over the sampling distribution under the true parameter. n∗p ðnÞ represents the optimal amount of data to be used in prior construction, after which the remaining points should be employed to update the constructed prior. Since there is no closed form for the true error of the OBC designed using an REMLP prior, the exact value of n∗p cannot be determined. As considered in (Esfahani and Dougherty, 2014) via Monte Carlo analysis, increasing n does not necessarily lead to a larger n∗p ; on the contrary, there is a saturation point after which increasing n does not significantly influence the optimal sample size for prior construction. Practically, one can find some number of points for prior construction that can be taken as preliminary data, no matter the sample size n.

References Akaike, H., “Information theory and an extension of the maximum likelihood principle,” in Proceedings of the 2nd International Symposium on Information Theory, pp. 267–281, 1973. Akaike, H., “A Bayesian analysis of the minimum AIC procedure,” Annals of the Institute of Statistical Mathematics, vol. 30, no. 1, pp. 9–14, 1978. Anders, S. and W. Huber, “Differential expression analysis for sequence count data,” Genome Biology, vol. 11, no. 10, art. R106, 2010. Anderson, T., “Classification by multivariate analysis,” Psychometrika, vol. 16, pp. 31–50, 1951. Arnold, L., Stochastic Differential Equations: Theory and Applications. John Wiley & Sons, New York, 1974. Arnold, S. F., The Theory of Linear Models and Multivariate Analysis. John Wiley & Sons, New York, 1981. Banerjee, U. and U. M. Braga-Neto, “Bayesian ABC-MCMC classification of liquid chromatography–mass spectrometry data,” Cancer Informatics, vol. 14(Suppl 5), pp. 175–182, 2017. Bao, Y. and A. Ullah, “Expectation of quadratic forms in normal and nonnormal variables with econometric applications,” Journal of Statistical Planning and Inference, vol. 140, pp. 1193–1205, 2010. Berger, J. O. and J. M. Bernardo, “On the development of reference priors,” Bayesian Statistics, vol. 4, no. 4, pp. 35–60, 1992. Berger, J. O., J. M. Bernardo, and D. Sun, “Objective priors for discrete parameter spaces,” Journal of the American Statistical Association, vol. 107, no. 498, pp. 636–648, 2012. Berikov, V. and A. Litvinenko, “The influence of prior knowledge on the expected performance of a classifier,” Pattern Recognition Letters, vol. 24, pp. 2537–2548, 2003. Bernardo, J. M., “Reference posterior distributions for Bayesian inference,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 41, no. 2, pp. 113–147, 1979. Bernardo, J. M. and A. F. Smith, Bayesian Theory. John Wiley & Sons, Chichester, UK, 2000.

325

326

References

Bickel, P. J. and E. Levina, “Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations,” Bernoulli, vol. 10, no. 6, pp. 989–1010, 2004. Bishop, C. M., Pattern Recognition and Machine Learning. Springer-Verlag, New York, 2006. Boluki, S., M. S. Esfahani, X. Qian, and E. R. Dougherty, “Incorporating biological prior knowledge for Bayesian learning via maximal knowledgedriven information priors,” BMC Bioinformatics, vol. 18(Suppl 14), art. 552, 2017. Boluki, S., X. Qian, and E. R. Dougherty, “Experimental design via generalized mean objective cost of uncertainty,” IEEE Access, vol. 7, no. 1, pp. 2223–2230, 2019. Bozdogan, H., “Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions,” Psychometrika, vol. 52, no. 3, pp. 345–370, 1987. Braga-Neto, U. M. and E. R. Dougherty, “Bolstered error estimation,” Pattern Recognition, vol. 37, no. 6, pp. 1267–1281, 2004a. Braga-Neto, U. M. and E. R. Dougherty, “Is cross-validation valid for small-sample microarray classification?” Bioinformatics, vol. 20, no. 3, pp. 374–380, 2004b. Braga-Neto, U. M. and E. R. Dougherty, “Exact performance of error estimators for discrete classifiers,” Pattern Recognition, vol. 38, no. 11, pp. 1799–1814, 2005. Braga-Neto, U. M. and E. R. Dougherty, “Exact correlation between actual and estimated errors in discrete classification,” Pattern Recognition Letters, vol. 31, no. 5, pp. 407–412, 2010. Braga-Neto, U. M. and E. R. Dougherty, Error Estimation for Pattern Recognition. Wiley-IEEE Press, New York, 2015. Broumand, A., B.-J. Yoon, M. S. Esfahani, and E. R. Dougherty, “Discrete optimal Bayesian classification with error-conditioned sequential sampling,” Pattern Recognition, vol. 48, no. 11, pp. 3766–3782, 2015. Butler, R. W. and A. T. A. Wood, “Laplace approximations for hypergeometric functions with matrix argument,” Annals of Statistics, vol. 30, no. 4, pp. 1155–1177, 2002. Carrasco, M. and J.-P. Florens, “Simulation-based method of moments and efficiency,” Journal of Business & Economic Statistics, vol. 20, no. 4, pp. 482–492, 2002. Chang, C.-C. and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, art. 27, 2011. Cohn, D., L. Atlas, and R. Ladner, “Improving generalization with active learning,” Machine Learning, vol. 15, no. 2, pp. 201–221, 1994.

References

327

Constantine, A. G., “Some non-central distribution problems in multivariate analysis,” Annals of Mathematical Statistics, vol. 34, no. 4, pp. 1270–1285, 1963. Cover, T. M. and J. M. Van Campenhout, “On the possible orderings in the measurement selection problem,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 7, no. 9, pp. 657–661, 1977. Craig, J. W., “A new, simple and exact result for calculating the probability of error for two-dimensional signal constellations,” in Proceedings of the Military Communications Conference, pp. 571–575, 1991. Dadaneh, S. Z., E. R. Dougherty, and X. Qian, “Optimal Bayesian classification with missing values,” IEEE Transactions on Signal Processing, vol. 66, no. 16, pp. 4182–4192, 2018a. Dadaneh, S. Z., X. Qian, and M. Zhou, “BNP-Seq: Bayesian nonparametric differential expression analysis of sequencing count data,” Journal of the American Statistical Association, vol. 113, no. 521, pp. 81–94, 2018b. Dalton, L. A., “Application of the sample-conditioned MSE to non-linear classification and censored sampling,” in Proceedings of the 21st European Signal Processing Conference, art. 1569744691, 2013. Dalton, L. A., “Optimal ROC-based classification and performance analysis under Bayesian uncertainty models,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 13, no. 4, pp. 719–729, 2016. Dalton, L. A. and E. R. Dougherty, “Application of the Bayesian MMSE estimator for classification error to gene expression microarray data,” Bioinformatics, vol. 27, no. 13, pp. 1822–1831, 2011a. Dalton, L. A. and E. R. Dougherty, “Bayesian minimum mean-square error estimation for classification error—Part I: Definition and the Bayesian MMSE error estimator for discrete classification,” IEEE Transactions on Signal Processing, vol. 59, no. 1, pp. 115–129, 2011b. Dalton, L. A. and E. R. Dougherty, “Bayesian minimum mean-square error estimation for classification error—Part II: Linear classification of Gaussian models,” IEEE Transactions on Signal Processing, vol. 59, no. 1, pp. 130–144, 2011c. Dalton, L. A. and E. R. Dougherty, “Optimal mean-square-error calibration of classifier error estimators under Bayesian models,” Pattern Recognition, vol. 45, no. 6, pp. 2308–2320, 2012a. Dalton, L. A. and E. R. Dougherty, “Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error— Part I: Representation,” IEEE Transactions on Signal Processing, vol. 60, no. 5, pp. 2575–2587, 2012b. Dalton, L. A. and E. R. Dougherty, “Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error— Part II: Consistency and performance analysis,” IEEE Transactions on Signal Processing, vol. 60, no. 5, pp. 2588–2603, 2012c.

328

References

Dalton, L. A. and E. R. Dougherty, “Optimal classifiers with minimum expected error within a Bayesian framework—Part I: Discrete and Gaussian models,” Pattern Recognition, vol. 46, no. 5, pp. 1301–1314, 2013a. Dalton, L. A. and E. R. Dougherty, “Optimal classifiers with minimum expected error within a Bayesian framework—Part II: Properties and performance analysis,” Pattern Recognition, vol. 46, no. 5, pp. 1288–1300, 2013b. Dalton, L. A. and E. R. Dougherty, “Intrinsically optimal Bayesian robust filtering,” IEEE Transactions on Signal Processing, vol. 62, no. 3, pp. 657–670, 2014. Dalton, L. A. and M. R. Yousefi, “On optimal Bayesian classification and risk estimation under multiple classes,” EURASIP Journal on Bioinformatics and Systems Biology, vol. 2015, art. 8, 2015. Dalton, L. A., M. E. Benalcázar, and E. R. Dougherty, “Optimal clustering under uncertainty,” PLOS ONE, vol. 13, no. 10, art. e0204627, 2018. Davison, A. and P. Hall, “On the bias and variability of bootstrap and crossvalidation estimates of error rates in discrimination problems,” Biometrica, vol. 79, pp. 274–284, 1992. de Laplace, P. S., Théorie Analytique des Probabilitiés. Courceir, Paris, 1812. Deev, A. D., “Representation of statistics of discriminant analysis and asymptotic expansion when space dimensions are comparable with sample size,” Doklady Akademii Nauk SSSR, vol. 195, pp. 759–762, 1970. DeGroot, M. H., Optimal Statistical Decisions. McGraw-Hill, New York, 1970. Dehghannasiri, R., B.-J. Yoon, and E. R. Dougherty, “Optimal experimental design for gene regulatory networks in the presence of uncertainty,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14, no. 4, pp. 938–950, 2015. Dehghannasiri, R., M. S. Esfahani, and E. R. Dougherty, “Intrinsically Bayesian robust Kalman filter: An innovation process approach,” IEEE Transactions on Signal Processing, vol. 65, no. 10, pp. 2531–2546, 2017a. Dehghannasiri, R., X. Qian, and E. R. Dougherty, “Optimal experimental design in the context of canonical expansions,” IET Signal Processing, vol. 11, no. 8, pp. 942–951, 2017b. Dehghannasiri, R., D. Xue, P. V. Balachandran, M. R. Yousefi, L. A. Dalton, T. Lookman, and E. R. Dougherty, “Optimal experimental design for materials discovery,” Computational Materials Science, vol. 129, pp. 311–322, 2017c. Dehghannasiri, R., X. Qian, and E. R. Dougherty, “Intrinsically Bayesian robust Karhunen-Loève compression,” EURASIP Signal Processing, vol. 144, pp. 311–322, 2018a. Dehghannasiri, R., M. S. Esfahani, X. Qian, and E. R. Dougherty, “Optimal Bayesian Kalman filtering with prior update,” IEEE Transactions on Signal Processing, vol. 66, no. 8, pp. 1982–1996, 2018b.

References

329

Dembo, A., T. M. Cover, and J. A. Thomas, “Information theoretic inequalities,” IEEE Transactions on Information Theory, vol. 37, no. 6, pp. 1501–1518, 1991. Devore, J. L., Probability and Statistics for Engineering and the Sciences. Brooks/Cole, Pacific Grove, California, fourth edition, 1995. Devroye, L., “Necessary and sufficient conditions for the almost everywhere convergence of nearest neighbor regression function estimates,” Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 61, pp. 467–481, 1982. Devroye, L., L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Stochastic Modelling and Applied Probability, SpringerVerlag, New York, 1996. Diaconis, P. and D. Freedman, “On the consistency of Bayes estimates,” Annals of Statistics, vol. 14, no. 1, pp. 1–26, 1986. Dillies, M.-A., A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime, G. Marot, D. Castel, J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloë, C. Le Gall, B. Schaëffer, S. Le Crom, M. Guedj, and F. Jaffrézic, “A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis,” Briefings in Bioinformatics, vol. 14, no. 6, pp. 671–683, 2013. Doob, J. L., “Application of the theory of martingales,” in Colloques Internationaux du Centre National de la Recherche Scientifique, pp. 22–28, Centre National de la Recherche Scientifique, Paris, 1948. Dougherty, E. R., “On the epistemological crisis in genomics,” Current Genomics, vol. 9, pp. 69–79, 2008. Dougherty, E. R., “Biomarker discovery: Prudence, risk, and reproducibility,” Bioessays, vol. 34, no. 4, pp. 277–279, 2012. Dougherty, E. R., The Evolution of Scientific Knowledge: From Certainty to Uncertainty. SPIE Press, Bellingham, Washington, 2016 [doi: 10.1117/3. 22633620]. Dougherty, E. R., Optimal Signal Processing Under Uncertainty. SPIE Press, Bellingham, Washington, 2018 [doi: 10.1117/3.2317891]. Dougherty, E. R. and M. L. Bittner, Epistemology of the Cell: A Systems Perspective on Biological Knowledge. IEEE Press Series on Biomedical Engineering, John Wiley & Sons, New York, 2011. Dougherty, E. R., M. Brun, J. M. Trent, and M. L. Bittner, “Conditioningbased modeling of contextual genomic regulation,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 310–320, 2009a. Dougherty, E. R., J. Hua, and C. Sima, “Performance of feature selection methods,” Current Genomics, vol. 10, no. 6, pp. 365–374, 2009b. Duda, R. O., P. E. Hart, and D. G. Stork, Pattern Classification. John Wiley & Sons, New York, second edition, 2001.

330

References

Dunsmore, I., “A Bayesian approach to classification,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 28, no. 3, pp. 568–577, 1966. Ebrahimi, N., E. Maasoumi, and E. S. Soofi, “Measuring informativeness of data by entropy and variance,” in Advances in Econometrics, Income Distribution and Scientific Methodology, D. J. Slottje, Ed., PhysicaVerlag, Heidelberg, chapter 5, pp. 61–77, 1999. Efron, B., “Bootstrap methods: another look at the jackknife,” Annals of Statistics, vol. 7, no. 1, pp. 1–26, 1979. Efron, B., “Estimating the error rate of a prediction rule: improvement on cross validation,” Journal of the American Statistical Association, vol. 78, pp. 316–331, 1983. Esfahani, M. S. and E. R. Dougherty, “Effect of separate sampling on classification accuracy,” Bioinformatics, vol. 30, no. 2, pp. 242–250, 2013. Esfahani, M. S. and E. R. Dougherty, “Incorporation of biological pathway knowledge in the construction of priors for optimal Bayesian classification,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 1, pp. 202–218, 2014. Esfahani, M. S. and E. R. Dougherty, “An optimization-based framework for the transformation of incomplete biological knowledge into a probabilistic structure and its application to the utilization of gene/protein signaling pathways in discrete phenotype classification,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 6, pp. 1304–1321, 2015. Fawcett, T., “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, pp. 861–874, 2006. Ferguson, T. S., “A Bayesian analysis of some nonparametric problems,” Annals of Statistics, vol. 1, no. 2, pp. 209–230, 1973. Feynman, R., QED: The Strange Theory of Light and Matter. Princeton University Press, Princeton, New Jersey, 1985. Foley, D., “Considerations of sample and feature size,” IEEE Transactions on Information Theory, vol. 18, no. 5, pp. 618–626, 1972. Frazier, P. I., W. B. Powell, and S. Dayanik, “A knowledge-gradient policy for sequential information collection,” SIAM Journal on Control and Optimization, vol. 47, no. 5, pp. 2410–2439, 2008. Freedman, D. A., “On the asymptotic behavior of Bayes’ estimates in the discrete case,” Annals of Mathematical Statistics, vol. 34, no. 4, pp. 1386–1403, 1963. Fujikoshi, Y., “Error bounds for asymptotic approximations of the linear discriminant function when the sample size and dimensionality are large,” Journal of Multivariate Analysis, vol. 73, pp. 1–17, 2000. Geisser, S., “Posterior odds for multivariate normal classifications,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 26, no. 1, pp. 69–76, 1964.

References

331

Geisser, S., “Estimation associated with linear discriminants,” Annals of Mathematical Statistics, vol. 38, no. 3, pp. 807–817, 1967. Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, Florida, second edition, 2004. Ghaffari, N., M. R. Yousefi, C. D. Johnson, I. Ivanov, and E. R. Dougherty, “Modeling the next generation sequencing sample processing pipeline for the purposes of classification,” BMC Bioinformatics, vol. 14, art. 307, 2013. Gilks, W. R., S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice. CRC press, Boca Raton, Florida, 1995. Glick, N., “Additive estimators for probabilites of correct classification,” Pattern Recognition, vol. 10, no. 3, pp. 211–222, 1978. Goldstein, M. and E. Wolf, “On the problem of bias in multinomial classification,” Biometrics, vol. 33, pp. 325–31, 1977. Golovin, D., A. Krause, and D. Ray, “Near-optimal Bayesian active learning with noisy observations,” in Proceedings of the 23rd International Conference on Neural Information Processing Systems, pp. 766–774, 2010. Gordon, L. and R. Olshen, “Asymptotically efficient solutions to the classification problem,” Annals of Statistics, vol. 6, no. 3, pp. 515–533, 1978. Gozzolino, J. M., R. Gonzalez-Zubieta, and R. L. Miller, “Markovian decision processes with uncertain transition probabilities,” Technical report, Massachusetts Institute of Technology Operations Reserach Center, 1965. Grigoryan, A. M. and E. R. Dougherty, “Bayesian robust optimal linear filters,” Signal Processing, vol. 81, no. 12, pp. 2503–2521, 2001. Guiasu, S. and A. Shenitzer, “The principle of maximum entropy,” The Mathematical Intelligencer, vol. 7, no. 1, pp. 42–48, 1985. Gupta, A. K., D. K. Nagar, and L. E. Sánchez, “Properties of matrix variate confluent hypergeometric function distribution,” Journal of Probability and Statistics, vol. 2016, art. 2374907, 2016. Guttman, I. and G. C. Tiao, “A Bayesian approach to some best population problems,” The Annals of Mathematical Statistics, vol. 35, no. 2, pp. 825–835, 1964. Hajiramezanali, E., M. Imani, U. Braga-Neto, X. Qian, and E. R. Dougherty, “Scalable optimal Bayesian classification of single-cell trajectories under regulatory model uncertainty,” BMC Genomics, vol. 20, art. 435, 2019. Halvorsen, K. B., V. Ayala, and E. Fierro, “On the marginal distribution of the diagonal blocks in a blocked Wishart random matrix,” International Journal of Analysis, vol. 2016, art. 5967218, 2016. Hanczar, B., J. Hua, C. Sima, J. Weinstein, M. Bittner, and E. R. Dougherty, “Small-sample precision of ROC-related estimates,” Bioinformatics, vol. 26, no. 6, pp. 822–830, 2010.

332

References

Hand, D. J. and R. J. Till, “A simple generalisation of the area under the ROC curve for multiple class classification problems,” Machine Learning, vol. 45, pp. 171–186, 2001. Hansen, L. P. and K. J. Singleton, “Generalized instrumental variables estimation of nonlinear rational expectations models,” Econometrica: Journal of the Econometric Society, vol. 50, no. 5, pp. 1269–1286, 1982. Hassan, S. S., H. Huttunen, J. Niemi, and J. Tohka, “Bayesian receiver operating characteristic metric for linear classifiers,” Pattern Recognition Letters, vol. 128, pp. 52–59, 2019. Hastie, T., R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2009. Hills, M., “Allocation rules and their error rates,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 28, no. 1, pp. 1–31, 1966. Hua, J., Z. Xiong, J. Lowey, E. Suh, and E. R. Dougherty, “Optimal number of features as a function of sample size for various classification rules,” Bioinformatics, vol. 21, no. 8, pp. 1509–1515, 2005. Hua, J., W. D. Tembe, and E. R. Dougherty, “Performance of featureselection methods in the classification of high-dimension data,” Pattern Recognition, vol. 42, no. 3, pp. 409–424, 2009. Hua, J., C. Sima, M. Cypert, G. C. Gooden, S. Shack, L. Alla, E. A. Smith, J. M. Trent, E. R. Dougherty, and M. L. Bittner, “Tracking transcriptional activities with high-content epifluorescent imaging,” Journal of Biomedical Optics, vol. 17, no. 4, art. 046008, 2012 [doi: 10. 1117/1.JBO.17.4.046008]. Hughes, G. F., “On the mean accuracy of statistical pattern recognizers,” IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55–63, 1968. Imani, M. and U. M. Braga-Neto, “Particle filters for partially-observed Boolean dynamical systems,” Automatica, vol. 87, pp. 238–250, 2018. Imani, M., R. Dehghannasiri, U. M. Braga-Neto, and E. R. Dougherty, “Sequential experimental design for optimal structural intervention in gene regulatory networks based on the mean objective cost of uncertainty,” Cancer Informatics, vol. 17, art. 1176935118790247, 2018. Jaynes, E. T., “Information theory and statistical mechanics,” Physical Review, vol. 106, no. 4, pp. 620–630, 1957. Jaynes, E. T., “Prior probabilities,” IEEE Transactions on Systems Science and Cybernetics, vol. 4, no. 3, pp. 227–241, 1968. Jaynes, E. T., “What is the question?” Bayesian Statistics, J. Bernardo, M. deGroot, D. Lindly, and A. Smith, Eds., Valencia University Press, Valencia, 1980. Jeffreys, H., “An invariant form for the prior probability in estimation problems,” Proceedings of the Royal Society of London: Series A (Mathematical and Physical Sciences), vol. 186, no. 1007, pp. 453–461, 1946.

References

333

Jeffreys, H., Theory of Probability. Oxford University Press, London, 1961. John, S., “Errors in discrimination,” Annals of Mathematical Statistics, vol. 32, no. 4, pp. 1125–1144, 1961. Johnson, M. E., Multivariate Statistical Simulation. Wiley Series in Applied Probability and Statistics, John Wiley & Sons, New York, 1987. Johnson, N. L., “Systems of frequency curves generated by methods of translation,” Biometrika, vol. 36, no. 1-2, pp. 149–176, 1949. Johnson, N. L., S. Kotz, and N. Balakrishnan, Continuous Univariate Distributions, Volume 1. Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons, New York, second edition, 1994. Johnson, N. L., S. Kotz, and N. Balakrishnan, Continuous Univariate Distributions, Volume 2. Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons, New York, second edition, 1995. Johnson, N. L., S. Kotz, and N. Balakrishnan, Discrete Multivariate Distributions. Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons, New York, 1997. Jones, D. R., M. Schonlau, and W. J. Welch, “Efficient global optimization of expensive black-box functions,” Journal of Global Optimization, vol. 13, no. 4, pp. 455–492, 1998. Kan, R., “From moments of sum to moments of product,” Journal of Multivariate Analysis, vol. 99, pp. 542–554, 2008. Karbalayghareh, A., U. Braga-Neto, and E. R. Dougherty, “Intrinsically Bayesian robust classifier for single-cell gene expression trajectories in gene regulatory networks,” BMC Systems Biology, vol. 12(Suppl 3), art. 23, 2018a. Karbalayghareh, A., X. Qian, and E. R. Dougherty, “Optimal Bayesian transfer learning,” IEEE Transactions on Signal Processing, vol. 16, no. 14, pp. 3724–3739, 2018b. Karbalayghareh, A., X. Qian, and E. R. Dougherty, “Optimal Bayesian transfer learning for count data,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019 [doi: 10.1109/TCBB.2019.2920981]. Kashyap, R., “Prior probability and uncertainty,” IEEE Transactions on Information Theory, vol. 17, no. 6, pp. 641–650, 1971. Kass, R. E. and L. Wasserman, “The selection of prior distributions by formal rules,” Journal of the American Statistical Association, vol. 91, no. 435, pp. 1343–1370, 1996. Kassam, S. A. and T. L. Lim, “Robust Wiener filters,” Journal of the Franklin Institute, vol. 304, no. 4-5, pp. 171–185, 1977. Kilian, L. and H. Lütkepohl, Structural Vector Autoregressive Analysis. Themes in Modern Econometrics, Cambridge University Press, Cambridge, UK, 2017. Kloeden, P. E. and E. Platen, Numerical Solution of Stochastic Differential Equations. Springer, New York, 1995.

334

References

Knight, J. M., I. Ivanov, and E. R. Dougherty, “MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: Modelbased RNA-Seq classification,” BMC Bioinformatics, vol. 15, art. 401, 2014. Knight, J. M., I. Ivanov, K. Triff, R. S. Chapkin, and E. R. Dougherty, “Detecting multivariate gene interactions in RNA-Seq data using optimal Bayesian classification,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, no. 2, pp. 484–493, 2018. Knight, J. M., I. Ivanov, R. S. Chapkin, and E. R. Dougherty, “Detecting multivariate gene interactions in RNA-Seq data using optimal Bayesian classification,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, no. 2, pp. 484–493, 2019. Kolmogorov, A. N., “Stationary sequences in Hilbert space,” Bulletin of Mathematics, University of Moscow, vol. 2, no. 6, pp. 1–40, 1941. Kotz, S. and S. Nadarajah, Multivariate t Distributions and Their Applications. Cambridge University Press, New York, 2004. Kuznetsov, V. P., “Stable detection when the signal and spectrum of normal noise are inaccurately known,” Telecommunications and Radio Engineering, vol. 30/31, pp. 58–64, 1976. Lawoko, C. R. O. and G. J. McLachlan, “Some asymptotic results on the effect of autocorrelation on the error rates of the sample linear discriminant function,” Pattern Recognition, vol. 16, pp. 119–121, 1983. Lawoko, C. R. O. and G. J. McLachlan, “Discrimination with autocorrelated observations,” Pattern Recognition, vol. 18, pp. 145–149, 1985. Lawoko, C. R. O. and G. J. McLachlan, “Asymptotic error rates of the W and Z statistics when the training observations are dependent,” Pattern Recognition, vol. 19, pp. 467–471, 1986. Lindley, D. V., Bayesian Statistics: A Review. SIAM, Philadelphia, 1972. Little, R. J. A. and D. B. Rubin, Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken, New Jersey, second edition, 2014. Loève, M., Probability Theory II. Graduate Texts in Mathematics, SpringerVerlag, New York, fourth edition, 1978. Lunts, A. and V. Brailovsky, “Evaluation of attributes obtained in statistical decision rules,” Engineering Cybernetics, vol. 3, pp. 98–109, 1967. Lütkepohl, H., New Introduction to Multiple Time Series Analysis. SpringerVerlag, Berlin, 2005. Mardia, K. V., J. T. Kent, and J. M. Bibby, Multivariate Analysis. Academic Press, London, 1979. Martin, J. J., Bayesian Decision Problems and Markov Chains. John Wiley, New York, 1967. Mathai, A. M. and H. J. Haubold, Special Functions for Applied Scientists. Springer, New York, 2008.

References

335

McLachlan, G. J., “An asymptotic expansion of the expectation of the estimated error rate in discriminant analysis,” Australian Journal of Statistics, vol. 15, pp. 210–214, 1973. McLachlan, G. J., Discriminant Analysis and Statistical Pattern Recognition. John Wiley & Sons, Hoboken, New Jersey, 2004. Mohsenizadeh, D. N., R. Dehghannasiri, and E. R. Dougherty, “Optimal objective-based experimental design for uncertain dynamical gene networks with experimental error,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, no. 1, pp. 218–230, 2018. Moran, M., “On the expectation of errors of allocation associated with a linear discriminant function,” Biometrika, vol. 62, no. 1, pp. 141–148, 1975. Muirhead, R. J., Aspects of Multivariate Statistical Theory. John Wiley & Sons, Hoboken, New Jersey, 2009. Muller, K. E. and P. W. Stewart, Linear Model Theory: Univariate, Multivariate, and Mixed Models. John Wiley & Sons, Hoboken, New Jersey, 2006. Murphy, K. P., Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, Massachusetts, 2012. Nagar, D. K. and J. C. Mosquera-Benıtez, “Properties of matrix variate hypergeometric function distribution,” Applied Mathematical Sciences, vol. 11, no. 14, pp. 677–692, 2017. Nagaraja, K. and U. Braga-Neto, “Bayesian classification of proteomics biomarkers from selected reaction monitoring data using an approximate Bayesian computation-Markov chain Monte Carlo approach,” Cancer Informatics, vol. 17, art. 1176935118786927, 2018. Natalini, P. and B. Palumbo, “Inequalities for the incomplete gamma function,” Mathematical Inequalities & Applications, vol. 3, no. 1, pp. 69–77, 2000. Neal, R. M., “MCMC using Hamiltonian dynamics,” in Handbook of Markov Chain Monte Carlo, S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, Eds., Chapman & Hall/CRC, Boca Raton, Florida, Chapter 5, pp. 113–162, 2011. O’Hagan, A. and J. Forster, Kendalls Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Hodder Arnold, London, second edition, 2004. Okamoto, M., “An asymptotic expansion for the distribution of the linear discriminant function,” Annals of Mathematical Statistics, vol. 34, no. 4, pp. 1286–1301, 1968. Pan, S. J. and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010. Pepe, M. S., H. Janes, G. Longton, W. Leisenring, and P. Newcomb, “Limitations of the odds ratio in guaging the performance of a diagnositc,

336

References

prognostic, or screening marker,” American Journal of Epidemiology, vol. 159, pp. 882–890, 2004. Pikelis, V., “Comparison of methods of computing the expected classification errors,” Automation and Remote Control, vol. 5, no. 7, pp. 59–63, 1976. Poor, H., “On robust Wiener filtering,” IEEE Transactions on Automatic Control, vol. 25, no. 3, pp. 531–536, 1980. Qian, X. and E. R. Dougherty, “Bayesian regression with network prior: Optimal Bayesian filtering perspective,” IEEE Transactions on Signal Processing, vol. 64, no. 23, pp. 6243–6253, 2016. Raiffa, H. and R. Schlaifer, Applied Statistical Decision Theory. MIT Press, Cambridge, Massachusetts, 1961. Raudys, S., “On determining training sample size of a linear classifier,” Computing Systems, vol. 28, pp. 79–87, 1967. Raudys, S., “On the amount of a priori information in designing the classification algorithm,” Technical Cybernetics, vol. 4, pp. 168–174, 1972. Raudys, S. and D. M. Young, “Results in statistical discriminant analysis: A review of the former soviet union literature,” Journal of Multivariate Analysis, vol. 89, no. 1, pp. 1–35, 2004. Rissanen, J., “A universal prior for integers and estimation by minimum description length,” Annals of Statistics, vol. 11, no. 2, pp. 416–431, 1983. Rodriguez, C. C., “Entropic priors,” Technical report, State University of New York at Albany, Department of Mathematics and Statistics, Albany, New York, 1991. Rowe, D. B., Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing. Chapman & Hall/CRC, Boca Raton, Florida, 2003. Saar-Tsechansky, M. and F. Provost, “Active sampling for class probability estimation and ranking,” Machine Learning, vol. 54, no. 2, pp. 153–178, 2004. Sen, P. K. and J. M. Singer, Large Sample Methods in Statistics. Chapman & Hall, New York, 1993. Serdobolskii, V., Multivariate Statistical Analysis: A High-Dimensional Approach. Springer, New York, 2000. Shaw, W. T., “Sampling Student’s T distribution - use of the inverse cumulative distribution function,” Journal of Computational Finance, vol. 9, no. 4, pp. 37–73, 2004. Silver, E. A., “Markovian decision processes with uncertain transition probabilities or rewards,” Technical report, Massachusetts Institute of Technology Operations Research Center, Cambridge, 1963. Sima, C. and E. R. Dougherty, “Optimal convex error estimators for classification,” Pattern Recognition, vol. 39, no. 6, pp. 1763–1780, 2006a. Sima, C. and E. R. Dougherty, “What should be expected from feature selection in small-sample settings,” Bioinformatics, vol. 22, no. 19, pp. 2430–2436, 2006b.

References

337

Sima, C. and E. R. Dougherty, “The peaking phenomenon in the presence of feature selection,” Pattern Recognition Letters, vol. 29, pp. 1667–1674, 2008. Sima, C., U. M. Braga-Neto, and E. R. Dougherty, “Superior feature-set ranking for small samples using bolstered error estimation,” Bioinformatics, vol. 21, no. 7, pp. 1046–1054, 2005. Sima, C., U. M. Braga-Neto, and E. R. Dougherty, “High-dimensional bolstered error estimation,” Bioinformatics, vol. 27, no. 21, pp. 3056– 3064, 2011. Sitgreaves, R., “Some results on the distribution of the W-classification statistic,” in Studies in Item Analysis and Prediction, H. Solomon, Ed., Stanford University Press, Stanford, California, pp. 241–251, 1961. Slater, L. J., Generalized Hypergeometric Functions. Cambridge University Press, Cambridge, UK, 1966. Sorum, M. J., Estimating the Probability of Misclassification. Ph.D. Dissertation, University of Minnesota, Minneapolis, 1968. Sorum, M. J., “Estimating the conditional probability of misclassification,” Technometrics, vol. 13, pp. 333–343, 1971. Spackman, K. A., “Signal detection theory: Valuable tools for evaluating inductive learning,” in Proceedings of the 6th International Workshop on Machine Learning, pp. 160–163, 1989. Spall, J. C. and S. D. Hill, “Least-informative Bayesian prior distributions for finite samples based on information theory,” IEEE Transactions on Automatic Control, vol. 35, no. 5, pp. 580–583, 1990. Stone, C. J., “Consistent nonparametric regression,” Annals of Statistics, vol. 5, no. 4, pp. 595–620, 1977. Stone, M., “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 36, no. 2, pp. 111–147, 1974. Tubbs, J. D., “Effect of autocorrelated training samples on Bayes’ probability of misclassification,” Pattern Recognition, vol. 12, pp. 351–354, 1980. Vapnik, V. N., “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability and its Applications, vol. 16, pp. 264–280, 1971. Walker, G. A. and J. G. Saw, “The distribution of linear combinations of t-variables,” Journal of the American Statistical Association, vol. 73, no. 364, pp. 876–878, 1978. Weiss, K., T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big Data, vol. 3, art. 9, 2016. Wiener, N., Extrapolation, Interpolation and Smoothing of Stationary Time Series, with Engineering Applications. MIT Press, Cambridge, 1949. Wyman, F. J., D. M. Young, and D. W. Turner, “A comparison of asymptotic error rate expansions for the sample linear discriminant function,” Pattern Recognition, vol. 23, no. 7, pp. 775–783, 1990.

338

References

Xu, H., C. Caramanis, and S. Mannor, “Robustness and regularization of support vector machines,” Journal of Machine Learning Research, vol. 10, pp. 1485–1510, 2009a. Xu, H., C. Caramanis, S. Mannor, and S. Yun, “Risk sensitive robust support vector machines,” in Proceedings of the Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, pp. 4655– 4661, 2009b. Xu, Q., J. Hua, U. M. Braga-Neto, Z. Xiong, E. Suh, and E. R. Dougherty, “Confidence intervals for the true classification error conditioned on the estimated error,” Technology in Cancer Research and Treatment, vol. 5, pp. 579–590, 2006. Yoon, B.-J., X. Qian, and E. R. Dougherty, “Quantifying the objective cost of uncertainty in complex dynamical systems,” IEEE Transactions on Signal Processing, vol. 61, no. 9, pp. 2256–2266, 2013. Yousefi, M. R. and E. R. Dougherty, “A comparison study of optimal and suboptimal intervention policies for gene regulatory networks in the presence of uncertainty,” EURASIP Journal on Bioinformatics and Systems Biology, vol. 2014, art. 6, 2014. Zapała, A. M., “Unbounded mappings and weak convergence of measures,” Statistics & Probability Letters, vol. 78, pp. 698–706, 2008. Zellner, A., Maximal Data Information Prior Distributions, Basic Issues in Econometrics. University of Chicago Press, Chicago, 1984. Zellner, A., Past and Recent Results on Maximal Data Information Priors. Working Paper Series in Economics and Econometrics, University of Chicago, Graduate School of Business, Department of Economics, Chicago, 1995. Zellner, A., “Models, prior information, and Bayesian analysis,” Journal of Econometrics, vol. 75, no. 1, pp. 51–68, 1996. Zollanvari, A. and E. R. Dougherty, “Moments and root-mean-square error of the Bayesian MMSE estimator of classification error in the Gaussian model,” Pattern Recognition, vol. 47, no. 6, pp. 2178–2192, 2014. Zollanvari, A. and E. R. Dougherty, “Incorporating prior knowledge induced from stochastic differential equations in the classification of stochastic observations,” EURASIP Journal on Bioinformatics and Systems Biology, vol. 2016, art. 2, 2016. Zollanvari, A. and E. R. Dougherty, “Optimal Bayesian classification with autoregressive data dependency,” IEEE Transactions on Signal Processing, vol. 67, no. 12, pp. 3073–3086, 2019. Zollanvari, A., U. M. Braga-Neto, and E. R. Dougherty, “On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers,” Pattern Recognition, vol. 42, no. 11, pp. 2705–2723, 2009. Zollanvari, A., U. M. Braga-Neto, and E. R. Dougherty, “On the joint sampling distribution between the actual classification error and the

References

339

resubstitution and leave-one-out error estimators for linear classifiers,” IEEE Transactions on Information Theory, vol. 56, no. 2, pp. 784–804, 2010. Zollanvari, A., U. M. Braga-Neto, and E. R. Dougherty, “Analytic study of performance of error estimators for linear discriminant analysis,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4238–4255, 2011. Zollanvari, A., U. M. Braga-Neto, and E. R. Dougherty, “Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic Gaussian model,” Pattern Recognition, vol. 45, no. 2, pp. 908–917, 2012. Zollanvari, A., J. Hua, and E. R. Dougherty, “Analytical study of performance of linear discriminant analysis in stochastic settings,” Pattern Recognition, vol. 46, pp. 3017–3029, 2013.

Index bolstered empirical distribution, 13 bolstered leave-one-out estimation, 14 bolstered-resubstitution estimator, 13 bolstering kernel, 13 bootstrap, 13 bootstrap sample, 13

0.632 bootstrap estimator, 13 A a priori probability, 2 action space, 220 activating pathway segment (APS), 312 almost surely, 3 almost uniformly (a.u.) integrable, 198 Anderson W statistic, 8 Appell hypergeometric function, 122 approximate Bayesian computation (ABC), 219 area under the ROC curve (AUC), 91 autoregressive process of order p, 225

C characteristics of random processes, 170 class-conditional distribution, 1 class-y-conditional density, 238 classification error, 1 classification probability, 239 classification rule, 3 classification rule model, 12 classifier, 1 classifier model, 11 conditional MSE, 101 conditional MSE convergence, 136 conditional risk, 238 conditional uncertainty class, 221 conditioning parameter, 308 confluent hypergeometric function, 264 conjugate prior, 27 consistent estimators, 77 consistent in the rth mean (estimators), 77 consistent rule, 3 convex estimator, 13

B Bayes classifier, 1 Bayes decision rule (BDR), 239 Bayes error, 1 Bayesian conditional risk estimator (BCRE), 242 Bayesian–Kolmogorov asymptotic conditions (BKac), 145 Bayesian MMSE error estimator, 26 Bayesian risk estimator (BRE), 240 beta distribution, 30 bias, 12 binary classification of Gaussian processes (BCGP), 296

341

342

Index

cost function, 169–170 cost of constraint, 5 cross-validation, 12 crosstalk parameter, 308 cubic histogram rule, 4 cumulative distribution function (CDF), 9

F false positive rate (FPR) feature-label distribution, 1 feature selection, 10 feature vector, 1 features, 1 flat prior, 27

D d-dimensional Wiener process, 297 design cost, 3 deviation variance, 12 Dirichlet priors, 36 discriminant functions, 5 dispersion matrix, 298 double-asymptotic expansions, 144 down-regulated (DR, or 0), 312 drift vector, 298

G Gauss hypergeometric function, 264 Gauss-hypergeometric-function distribution, 268 Gaussian, 6 generalized hypergeometric function, 263–264 global (homogeneous) markers, 292

E effective characteristics, 171 effective class-conditional density, 32 effective conditional density, 247 effective density, 241 effective joint class-conditional density, 104 effective joint density, 247 effective processes, 172 empirical error classification rule, 9–10 equicontinuous function, 81 error estimation rule, 12 expected AUC (EAUC), 94 expected design cost, 3 expected FPR (EFPR), 92 expected mean log-likelihood, 306 expected risk, 238 expected ROC (EROC) curve, 93 expected TPR (ETPR), 92 experiment space, 220 experimental design value, 221 experiments, 220

H Hamilton Monte Carlo OBC (OBC-HMC), 217 Helly–Bray theorem, 77 heterogeneous markers, 293 heteroscedastic model, 7 high-variance non-markers, 293 histogram rule, 19 holdout error estimator, 12 homoscedastic model, 7 I IBR action, 220 IBR classifiers, 210–212 ideal regression, 86 improper priors, 30 independent Jeffreys prior, 50 intrinsically Bayesian robust (IBR) operator, 171 inverse-gamma distribution, 54–55 inverse-Wishart distribution, 47 J Jeffreys rule prior, 50 Johnson SB, 70

Index

Johnson SU, 70 joint normal-Wishart distribution, 262 K k-fold cross-validation, 12 k-nearest-neighbor (kNN) rule, 4 Kullback–Liebler (KL) divergence, 306 L label, 1 leave-one-out estimator, 13 likelihood function, 29, 240 linear discriminant analysis (LDA), 7 loss function, 238 low-variance non-markers, 293 M Mahalanobis distance, 24 marginal prior densities, 27 Markov chain Monte Carlo (MCMC), 212 Markov property, 298 mathematical model, 18 maximal data information, 305 maximal data information prior (MDIP), 305 maximal knowledge-driven information prior (MKDIP), 304 maximum entropy, 305 principle of, 305 mean objective cost of uncertainty (MOCU), 220 mean-square error (MSE), 12 missing completely at random (MCAR), 212 MKDIP with constraints, 304 MMSE calibrated error estimate, 85 MMSE calibration function, 85 MMSE error estimator, 25 model-constrained Bayesian robust (MCBR) operator, 171–172, 210

343

multinomial discrimination, 19, 36 multivariate beta function, 36 multivariate gamma function, 59 multivariate Gaussian process, 296 multivariate t-distribution, 55 N nature of the prior information, 301 nearest-mean-classification (NMC) rule, 8 non-informative prior, 27 non-standardized Student’s t-distribution, 57 normal-inverse-Wishart distribution, 48 O objective cost of uncertainty, 220 objective function, 170 objective prior, 27 observation time vector, 296 one-dimensional Wiener process, 297 optimal action, 220 optimal Bayesian classifier (OBC), 173 optimal Bayesian operator (OBO), 171 optimal Bayesian risk classifier (OBRC), 245 optimal Bayesian transfer learning classifier (OBTLC), 273 optimal operator, 170 P peaking phenomenon, 11 plug-in classification rule, 3 plug-in error estimator, 15 posterior distribution, 26, 221 posterior distribution of target, 268 predictive density, 174 prior distribution, 26, 220 prior probabilities, 27

344

probability density function, 13 proper priors, 30 Q quadratic discriminant analysis (QDA), 7 R Radon–Nikodym theorem, 86 random sampling, 16 Raudys–Kolmogorov asymptotics, 145 Raudys-type Gaussian-based finitesample approximation, 147 receiver operator characteristic (ROC) curves, 91 regularized expected mean loglikelihood prior (REMLP), 307 regularized incomplete beta function, 62 regularized maximal data information prior (RMDIP), 307 regularized maximum entropy prior (RMEP), 307 regulatory set, 313 relatedness between source and target domains, 278 remaining MOCU, 221 repressing pathway segment (RPS), 312 residual IBR cost, 221 resubstitution error estimator, 12 RMS conditioned on the true error, 89 RNA sequencing, 281–282 root-mean-square (RMS) error, 12 S sample-conditioned MSE, 101, 245 sample path, 296 sampling distribution, 3 scientific gap, 172

Index

semi-bolstered-redistribution estimator, 14 separate sampling, 14 slackness variable, 305 source domain, 261 stochastic differential equation (SDE), 295–296 stratified sampling, 16 strongly consistent estimators, 77 strongly consistent (rule), 3 surrogate classifiers, 12 T target domain, 261 test point, 173 transfer learning, 261 transition probabilities, 298 true positive rate (TPR), 91 U uncertainty actions, 220 uncertainty class, 21, 171, 220 uncertainty parameters, 220 universally consistent (rule), 3 up-regulated (UR, or 1), 312 V valid prior knowledge, 299 Vapnik–Chervonenkis (VC) dimension, 10 W weak* consistent posterior, 78 weak topology, 77 weakly consistent estimators, 77 Wiener filter, 170 Wiener–Hopf equation, 170 Wishart distribution, 262 Z zero-one loss function, 239 Zipf’s power law model, 42

Lori A. Dalton received the B.S., M.S., and Ph.D. degrees in electrical engineering at Texas A&M University, College Station, Texas, in 2001, 2002 and 2012, respectively. She is currently an Adjunct Professor of Electrical and Computer Engineering at The Ohio State University. Dr. Dalton was awarded an NSF CAREER Award in 2014. Her current research interests include pattern recognition, estimation, and optimization, genomic signal processing, systems biology, bioinformatics, robust filtering, and information theory. Edward R. Dougherty is a Distinguished Professor in the Department of Electrical and Computer Engineering at Texas A&M University in College Station, Texas, where he holds the Robert M. Kennedy ‘26 Chair in Electrical Engineering. He holds a Ph.D. in mathematics from Rutgers University and an M.S. in Computer Science from Stevens Institute of Technology, and was awarded an honoris causa doctorate by the Tampere University of Technology in Finland. His previous works include SPIE Press titles Optimal Signal Processing Under Uncertainty (2018), The Evolution of Scientific Knowledge: From Certainty to Uncertainty (2016), and Hands-on Morphological Image Processing (2003).