This book is a mathematically accessible and uptodate introduction to the tools needed to address modern inference pro
3,244 787 26MB
English Pages 418 [421] Year 2018
Table of contents :
Contents
Preface
List of Acronyms
1 Introduction
1.1 Background
1.2 Notation
1.2.1 Probability Distributions
1.2.2 Conditional Probability Distributions
1.2.3 Expectations and Conditional Expectations
1.2.4 Unified Notation
1.2.5 General Random Variables
1.3 Statistical Inference
1.3.1 Statistical Model
1.3.2 Some Generic Estimation Problems
1.3.3 Some Generic Detection Problems
1.4 Performance Analysis
1.5 Statistical Decision Theory
1.5.1 Conditional Risk and Optimal Decision Rules
1.5.2 Bayesian Approach
1.5.3 Minimax Approach
1.5.4 Other NonBayesian Rules
1.6 Derivation of Bayes Rule
1.7 Link Between Minimax and Bayesian Decision Theory
1.7.1 Dual Concept
1.7.2 Game Theory
1.7.3 Saddlepoint
1.7.4 Randomized Decision Rules
Exercises
References
Part I Hypothesis Testing
2 Binary Hypothesis Testing
2.1 General Framework
2.2 Bayesian Binary Hypothesis Testing
2.2.1 Likelihood Ratio Test
2.2.2 Uniform Costs
2.2.3 Examples
2.3 Binary Minimax Hypothesis Testing
2.3.1 Equalizer Rules
2.3.2 Bayes Risk Line and Minimum Risk Curve
2.3.3 Differentiable V (π0)
2.3.4 Nondifferentiable V (π0)
2.3.5 Randomized LRTs
2.3.6 Examples
2.4 Neyman–Pearson Hypothesis Testing
2.4.1 Solution to the NP Optimization Problem
2.4.2 NP Rule
2.4.3 Receiver Operating Characteristic
2.4.4 Examples
2.4.5 Convex Optimization
Exercises
3 Multiple Hypothesis Testing
3.1 General Framework
3.2 Bayesian Hypothesis Testing
3.2.1 Optimal Decision Regions
3.2.2 Gaussian Ternary Hypothesis Testing
3.3 Minimax Hypothesis Testing
3.4 Generalized Neyman–Pearson Detection
3.5 Multiple Binary Tests
3.5.1 Bonferroni Correction
3.5.2 False Discovery Rate
3.5.3 Benjamini–Hochberg Procedure
3.5.4 Connection to Bayesian Decision Theory
Exercises
References
4 Composite Hypothesis Testing
4.1 Introduction
4.2 Random Parameter ±
4.2.1 Uniform Costs Over Each Hypothesis
4.2.2 Nonuniform Costs Over Hypotheses
4.3 Uniformly Most Powerful Test
4.3.1 Examples
4.3.2 Monotone Likelihood Ratio Theorem
4.3.3 Both Composite Hypotheses
4.4 Locally Most Powerful Test
4.5 Generalized Likelihood Ratio Test
4.5.1 GLRT for Gaussian Hypothesis Testing
4.5.2 GLRT for Cauchy Hypothesis Testing
4.6 Random versus Nonrandom θ
4.7 NonDominated Tests
4.8 Composite mary Hypothesis Testing
4.8.1 Random Parameter ±
4.8.2 NonDominated Tests
4.8.3 mGLRT
4.9 Robust Hypothesis Testing
4.9.1 Robust Detection with Conditionally Independent Observations
4.9.2 EpsilonContamination Class
Exercises
References
5 Signal Detection
5.1 Introduction
5.2 Problem Formulation
5.3 Detection of Known Signal in Independent Noise
5.3.1 Signal in i.i.d. Gaussian Noise
5.3.2 Signal in i.i.d. Laplacian Noise
5.3.3 Signal in i.i.d. Cauchy Noise
5.3.4 Approximate NP Test
5.4 Detection of Known Signal in Correlated Gaussian Noise
5.4.1 Reduction to i.i.d. Noise Case
5.4.2 Performance Analysis
5.5 mary Signal Detection
5.5.1 Bayes Classification Rule
5.5.2 Performance Analysis
5.6 Signal Selection
5.6.1 i.i.d. Noise
5.6.2 Correlated Noise
5.7 Detection of Gaussian Signals in Gaussian Noise
5.7.1 Detection of a Gaussian Signal in White Gaussian Noise
5.7.2 Detection of i.i.d. ZeroMean Gaussian Signal
5.7.3 Diagonalization of Signal Covariance
5.7.4 Performance Analysis
5.7.5 Gaussian Signals With Nonzero Mean
5.8 Detection of Weak Signals
5.9 Detection of Signal with Unknown Parameters in White Gaussian Noise
5.9.1 General Approach
5.9.2 Linear Gaussian Model
5.9.3 Nonlinear Gaussian Model
5.9.4 Discrete Parameter Set
5.10 DeflectionBased Detection of NonGaussian Signal in Gaussian Noise
Exercises
References
6 Convex Statistical Distances
6.1 Kullback–Leibler Divergence
6.2 Entropy and Mutual Information
6.3 Chernoff Divergence, Chernoff Information, and Bhattacharyya Distance
6.4 Ali–Silvey Distances
6.5 Some Useful Inequalities
Exercises
References
7 Performance Bounds for Hypothesis Testing
7.1 Simple Lower Bounds on Conditional Error Probabilities
7.2 Simple Lower Bounds on Error Probability
7.3 Chernoff Bound
7.3.1 MomentGenerating and CumulantGenerating Functions
7.3.2 Chernoff Bound
7.4 Application of Chernoff Bound to Binary Hypothesis Testing
7.4.1 Exponential Upper Bounds on PF and PM
7.4.2 Bayesian Error Probability
7.4.3 Lower Bound on ROC
7.4.4 Example
7.5 Bounds on Classification Error Probability
7.5.1 Upper and Lower Bounds in Terms of Pairwise Error Probabilities
7.5.2 Bonferroni’s Inequalities
7.5.3 Generalized Fano’s Inequality
7.6 Appendix: Proof of Theorem 7.4
Exercises
References
8 Large Deviations and Error Exponents for Hypothesis Testing
8.1 Introduction
8.2 Chernoff Bound for Sum of i.i.d. Random Variables
8.2.1 Cramér’s Theorem
8.2.2 Why is the Central Limit Theorem Inapplicable Here?
8.3 Hypothesis Testing with i.i.d. Observations
8.3.1 Bayesian Hypothesis Testing with i.i.d. Observations
8.3.2 Neyman–Pearson Hypothesis Testing with i.i.d. Observations
8.3.3 Hoeffding Problem
8.3.4 Example
8.4 Refined Large Deviations
8.4.1 The Method of Exponential Tilting
8.4.2 Sum of i.i.d. Random Variables
8.4.3 Lower Bounds on LargeDeviations Probabilities
8.4.4 Refined Asymptotics for Binary Hypothesis Testing
8.4.5 Noni.i.d. Components
8.5 Appendix: Proof of Lemma 8.1
Exercises
References
9 Sequential and Quickest Change Detection
9.1 Sequential Detection
9.1.1 Problem Formulation
9.1.2 Stopping Times and Decision Rules
9.1.3 Two Formulations of the Sequential Hypothesis Testing Problem
9.1.4 Sequential Probability Ratio Test
9.1.5 SPRT Performance Evaluation
9.2 Quickest Change Detection
9.2.1 Minimax Quickest Change Detection
9.2.2 Bayesian Quickest Change Detection
Exercises
References
10 Detection of Random Processes
10.1 DiscreteTime Random Processes
10.1.1 Periodic Stationary Gaussian Processes
10.1.2 Stationary Gaussian Processes
10.1.3 Markov Processes
10.2 ContinuousTime Processes
10.2.1 Covariance Kernel
10.2.2 Karhunen–Loève Transform
10.2.3 Detection of Known Signals in Gaussian Noise
10.2.4 Detection of Gaussian Signals in Gaussian Noise
10.3 Poisson Processes
10.4 General Processes
10.4.1 Likelihood Ratio
10.4.2 Ali–Silvey Distances
10.5 Appendix: Proof of Proposition 10.1
Exercises
References
Part II Estimation
11 Bayesian Parameter Estimation
11.1 Introduction
11.2 Bayesian Parameter Estimation
11.3 MMSE Estimation
11.4 MMAE Estimation
11.5 MAP Estimation
11.6 Parameter Estimation for Linear Gaussian Models
11.7 Estimation of Vector Parameters
11.7.1 Vector MMSE Estimation
11.7.2 Vector MMAE Estimation
11.7.3 Vector MAP Estimation
11.7.4 Linear MMSE Estimation
11.7.5 Vector Parameter Estimation in Linear Gaussian Models
11.7.6 Other Cost Functions for Bayesian Estimation
11.8 Exponential Families
11.8.1 Basic Properties
11.8.2 Conjugate Priors
Exercises
References
12 Minimum Variance Unbiased Estimation
12.1 Nonrandom Parameter Estimation
12.2 Sufficient Statistics
12.3 Factorization Theorem
12.4 Rao–Blackwell Theorem
12.5 Complete Families of Distributions
12.5.1 Link Between Completeness and Sufficiency
12.5.2 Link Between Completeness and MVUE
12.5.3 Link Between Completeness and Exponential Families
12.6 Discussion
12.7 Examples: Gaussian Families
Exercises
References
13 Information Inequality and Cramér–Rao Lower Bound
13.1 Fisher Information and the Information Inequality
13.2 Cramér–Rao Lower Bound
13.3 Properties of Fisher Information
13.4 Conditions for Equality in Information Inequality
13.5 Vector Parameters
13.6 Information Inequality for Random Parameters
13.7 Biased Estimators
13.8 Appendix: Derivation of (13.16)
Exercises
References
14 Maximum Likelihood Estimation
14.1 Introduction
14.2 Computation of ML Estimates
14.3 Invariance to Reparameterization
14.4 MLE in Exponential Families
14.4.1 MeanValue Parameterization
14.4.2 Relation to MVUEs
14.4.3 Asymptotics
14.5 Estimation of Parameters on Boundary
14.6 Asymptotic Properties for General Families
14.6.1 Consistency
14.6.2 Asymptotic Efficiency and Normality
14.7 Nonregular ML Estimation Problems
14.8 Nonexistence of MLE
14.9 Noni.i.d. Observations
14.10 MEstimators and LeastSquares Estimators
14.11 ExpectationMaximization (EM) Algorithm
14.11.1 General Structure of the EM Algorithm
14.11.2 Convergence of EM Algorithm
14.11.3 Examples
14.12 Recursive Estimation
14.12.1 Recursive MLE
14.12.2 Recursive Approximations to LeastSquares Solution
14.13 Appendix: Proof of Theorem 14.2
14.14 Appendix: Proof of Theorem 14.4
Exercises
References
15 Signal Estimation
15.1 Linear Innovations
15.2 DiscreteTime Kalman Filter
15.2.1 TimeInvariant Case
15.3 Extended Kalman Filter
15.4 Nonlinear Filtering for General Hidden Markov Models
15.5 Estimation in Finite Alphabet Hidden Markov Models
15.5.1 Viterbi Algorithm
15.5.2 ForwardBackward Algorithm
15.5.3 Baum–Welch Algorithm for HMM Learning
Exercises
References
Appendix A Matrix Analysis
Appendix B Random Vectors and Covariance Matrices
Appendix C Probability Distributions
Appendix D Convergence of Random Sequences
Index
Statistical Inference for Engineers and Data Scientists
A mathematically accessible and uptodate introduction to the tools needed to address modern inference problems in engineering and data science, ideal for graduate students taking courses on statistical inference and detection and estimation, and an invaluable reference for researchers and professionals. With a wealth of illustrations and examples to explain the key features of the theory and to connect with realworld applications, additional material to explore more advanced concepts, and numerous endofchapter problems to test the reader’s knowledge, this textbook is the “goto” guide for learning about the core principles of statistical inference and its application in engineering and data science. The passwordprotected Solutions Manual and the Image Gallery from the book are available at www.cambridge.org/Moulin. is a professor in the ECE Department at the University of Illinois at UrbanaChampaign. His research interests include statistical inference, machine learning, detection and estimation theory, information theory, statistical signal, image, and video processing, and information security. Moulin is a Fellow of the IEEE, and served as a Distinguished Lecturer for the IEEE Signal Processing Society. He has received two best paper awards from the IEEE Signal Processing Society and the US National Science Foundation CAREER Award. He was founding EditorinChief of the IEEE Transactions on Information Security and Forensics. Pierre Moulin
is the Henry Magnuski Professor in the ECE Department at the University of Illinois at UrbanaChampaign. His research interests include statistical inference and machine learning, detection and estimation theory, and information theory, with applications to data science, wireless communications, and sensor networks. Veeravalli is a Fellow of the IEEE, and served as a Distinguished Lecturer for the IEEE Signal Processing Society. Among the awards he has received are the IEEE Browder J. Thompson Best Paper Award, the National Science Foundation CAREER Award, the Presidential Early Career Award for Scientists and Engineers (PECASE), and the Wald Prize in Sequential Analysis. Venugopal V. Veeravalli
Statistical Inference for Engineers and Data Scientists
PIERRE
MOULIN
University of Illinois, UrbanaChampaign
V E N U G O PA L
V.
V E E R AVA L L I
University of Illinois, UrbanaChampaign
University Printing House, Cambridge CB2 8BS, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781107185920 DOI: 10.1017/9781107185920
±c Cambridge University Press 2019 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2019 Printed in the United Kingdom by TJ International Ltd. Padstow Cornwall
A catalogue record for this publication is available from the British Library. ISBN 9781107185920 Hardback Additional resources for this publication at www.cambridge.org/Moulin. Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or thirdparty internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
We balance probabilities and choose the most likely – Sherlock Holmes
Brief Contents
Preface List of Acronyms 1
Part I
Introduction
Hypothesis Testing
page xvii xx 1 23
2
Binary Hypothesis Testing
25
3
Multiple Hypothesis Testing
54
4
Composite Hypothesis Testing
71
5
Signal Detection
105
6
Convex Statistical Distances
145
7
Performance Bounds for Hypothesis Testing
160
8
Large Deviations and Error Exponents for Hypothesis Testing
184
9
Sequential and Quickest Change Detection
208
10
Detection of Random Processes
231
Part II
Estimation
257
11
Bayesian Parameter Estimation
259
12
Minimum Variance Unbiased Estimation
280
13
Information Inequality and Cramér–Rao Lower Bound
297
14
Maximum Likelihood Estimation
319
viii
Brief Contents
15
Signal Estimation
358
Appendix A
Matrix Analysis
384
Appendix B
Random Vectors and Covariance Matrices
390
Appendix C
Probability Distributions
391
Appendix D
Convergence of Random Sequences
393
Index
395
Contents
Preface List of Acronyms 1
Introduction
1.1 1.2
Background Notation 1.2.1 Probability Distributions 1.2.2 Conditional Probability Distributions 1.2.3 Expectations and Conditional Expectations 1.2.4 Uniﬁed Notation 1.2.5 General Random Variables 1.3 Statistical Inference 1.3.1 Statistical Model 1.3.2 Some Generic Estimation Problems 1.3.3 Some Generic Detection Problems 1.4 Performance Analysis 1.5 Statistical Decision Theory 1.5.1 Conditional Risk and Optimal Decision Rules 1.5.2 Bayesian Approach 1.5.3 Minimax Approach 1.5.4 Other NonBayesian Rules 1.6 Derivation of Bayes Rule 1.7 Link Between Minimax and Bayesian Decision Theory 1.7.1 Dual Concept 1.7.2 Game Theory 1.7.3 Saddlepoint 1.7.4 Randomized Decision Rules Exercises References Part I
2
Hypothesis Testing
Binary Hypothesis Testing
2.1 2.2
General Framework Bayesian Binary Hypothesis Testing
page xvii xx 1 1 1 2 2 3 3 3 4 5 6 6 7 7 8 9 10 11 12 14 14 15 15 16 18 21 23 25 25 26
x
Contents
2.2.1 Likelihood Ratio Test 2.2.2 Uniform Costs 2.2.3 Examples 2.3 Binary Minimax Hypothesis Testing 2.3.1 Equalizer Rules 2.3.2 Bayes Risk Line and Minimum Risk Curve 2.3.3 Differentiable V (π0 ) 2.3.4 Nondifferentiable V (π0 ) 2.3.5 Randomized LRTs 2.3.6 Examples 2.4 Neyman–Pearson Hypothesis Testing 2.4.1 Solution to the NP Optimization Problem 2.4.2 NP Rule 2.4.3 Receiver Operating Characteristic 2.4.4 Examples 2.4.5 Convex Optimization Exercises 3
Multiple Hypothesis Testing
3.1 3.2
General Framework Bayesian Hypothesis Testing 3.2.1 Optimal Decision Regions 3.2.2 Gaussian Ternary Hypothesis Testing 3.3 Minimax Hypothesis Testing 3.4 Generalized Neyman–Pearson Detection 3.5 Multiple Binary Tests 3.5.1 Bonferroni Correction 3.5.2 False Discovery Rate 3.5.3 Benjamini–Hochberg Procedure 3.5.4 Connection to Bayesian Decision Theory Exercises References 4
Composite Hypothesis Testing
4.1 4.2
4.3
4.4 4.5
Introduction Random Parameter ± 4.2.1 Uniform Costs Over Each Hypothesis 4.2.2 Nonuniform Costs Over Hypotheses Uniformly Most Powerful Test 4.3.1 Examples 4.3.2 Monotone Likelihood Ratio Theorem 4.3.3 Both Composite Hypotheses Locally Most Powerful Test Generalized Likelihood Ratio Test
27 28 28 32 33 34 35 35 37 38 40 41 42 43 44 46 47 54 54 55 56 58 58 62 62 63 64 65 66 67 70 71 71 72 73 76 77 77 79 80 82 84
Contents
xi
4.5.1 GLRT for Gaussian Hypothesis Testing 84 4.5.2 GLRT for Cauchy Hypothesis Testing 86 4.6 Random versus Nonrandom θ 87 4.7 NonDominated Tests 88 4.8 Composite m ary Hypothesis Testing 90 4.8.1 Random Parameter ± 90 4.8.2 NonDominated Tests 91 4.8.3 mGLRT 92 4.9 Robust Hypothesis Testing 92 4.9.1 Robust Detection with Conditionally Independent Observations 96 4.9.2 EpsilonContamination Class 97 Exercises 99 References 103 5
Signal Detection
5.1 5.2 5.3
5.4
5.5
5.6
5.7
5.8 5.9
5.10
Introduction Problem Formulation Detection of Known Signal in Independent Noise 5.3.1 Signal in i.i.d. Gaussian Noise 5.3.2 Signal in i.i.d. Laplacian Noise 5.3.3 Signal in i.i.d. Cauchy Noise 5.3.4 Approximate NP Test Detection of Known Signal in Correlated Gaussian Noise 5.4.1 Reduction to i.i.d. Noise Case 5.4.2 Performance Analysis m ary Signal Detection 5.5.1 Bayes Classiﬁcation Rule 5.5.2 Performance Analysis Signal Selection 5.6.1 i.i.d. Noise 5.6.2 Correlated Noise Detection of Gaussian Signals in Gaussian Noise 5.7.1 Detection of a Gaussian Signal in White Gaussian Noise 5.7.2 Detection of i.i.d. ZeroMean Gaussian Signal 5.7.3 Diagonalization of Signal Covariance 5.7.4 Performance Analysis 5.7.5 Gaussian Signals With Nonzero Mean Detection of Weak Signals Detection of Signal with Unknown Parameters in White Gaussian Noise 5.9.1 General Approach 5.9.2 Linear Gaussian Model 5.9.3 Nonlinear Gaussian Model 5.9.4 Discrete Parameter Set DeﬂectionBased Detection of NonGaussian Signal in Gaussian Noise
105 105 106 107 107 108 110 111 112 113 114 115 116 116 117 118 118 120 121 122 123 125 126 127 128 129 130 130 132 135
xii
Contents
Exercises References 6
Convex Statistical Distances
6.1 Kullback–Leibler Divergence 6.2 Entropy and Mutual Information 6.3 Chernoff Divergence, Chernoff Information, and Bhattacharyya Distance 6.4 Ali–Silvey Distances 6.5 Some Useful Inequalities Exercises References 7
Performance Bounds for Hypothesis Testing
7.1 7.2 7.3
Simple Lower Bounds on Conditional Error Probabilities Simple Lower Bounds on Error Probability Chernoff Bound 7.3.1 MomentGenerating and CumulantGenerating Functions 7.3.2 Chernoff Bound 7.4 Application of Chernoff Bound to Binary Hypothesis Testing 7.4.1 Exponential Upper Bounds on PF and P M 7.4.2 Bayesian Error Probability 7.4.3 Lower Bound on ROC 7.4.4 Example 7.5 Bounds on Classiﬁcation Error Probability 7.5.1 Upper and Lower Bounds in Terms of Pairwise Error Probabilities 7.5.2 Bonferroni’s Inequalities 7.5.3 Generalized Fano’s Inequality 7.6 Appendix: Proof of Theorem 7.4 Exercises References
8
Large Deviations and Error Exponents for Hypothesis Testing
8.1 8.2
8.3
8.4
Introduction Chernoff Bound for Sum of i.i.d. Random Variables 8.2.1 Cramér’s Theorem 8.2.2 Why is the Central Limit Theorem Inapplicable Here? Hypothesis Testing with i.i.d. Observations 8.3.1 Bayesian Hypothesis Testing with i.i.d. Observations 8.3.2 Neyman–Pearson Hypothesis Testing with i.i.d. Observations 8.3.3 Hoeffding Problem 8.3.4 Example Reﬁned Large Deviations 8.4.1 The Method of Exponential Tilting
139 143 145 145 147 149 151 155 156 158 160 160 162 163 163 164 167 168 170 172 172 173 173 176 176 178 181 183 184 184 185 185 186 187 188 189 189 191 194 194
Contents
8.4.2 Sum of i.i.d. Random Variables 8.4.3 Lower Bounds on LargeDeviations Probabilities 8.4.4 Reﬁned Asymptotics for Binary Hypothesis Testing 8.4.5 Noni.i.d. Components 8.5 Appendix: Proof of Lemma 8.1 Exercises References 9
Sequential and Quickest Change Detection
9.1
Sequential Detection 9.1.1 Problem Formulation 9.1.2 Stopping Times and Decision Rules 9.1.3 Two Formulations of the Sequential Hypothesis Testing Problem 9.1.4 Sequential Probability Ratio Test 9.1.5 SPRT Performance Evaluation 9.2 Quickest Change Detection 9.2.1 Minimax Quickest Change Detection 9.2.2 Bayesian Quickest Change Detection Exercises References 10
Detection of Random Processes
10.1
DiscreteTime Random Processes 10.1.1 Periodic Stationary Gaussian Processes 10.1.2 Stationary Gaussian Processes 10.1.3 Markov Processes 10.2 ContinuousTime Processes 10.2.1 Covariance Kernel 10.2.2 Karhunen–Loève Transform 10.2.3 Detection of Known Signals in Gaussian Noise 10.2.4 Detection of Gaussian Signals in Gaussian Noise 10.3 Poisson Processes 10.4 General Processes 10.4.1 Likelihood Ratio 10.4.2 Ali–Silvey Distances 10.5 Appendix: Proof of Proposition 10.1 Exercises References Part II
11
195 198 199 200 202 203 206 208 208 208 209 209 210 212 217 219 223 227 229 231 231 232 234 235 238 239 240 244 246 248 250 250 252 253 254 256 257
Estimation
Bayesian Parameter Estimation
11.1
xiii
Introduction
259 259
xiv
Contents
11.2 11.3 11.4 11.5 11.6 11.7
Bayesian Parameter Estimation MMSE Estimation MMAE Estimation MAP Estimation Parameter Estimation for Linear Gaussian Models Estimation of Vector Parameters 11.7.1 Vector MMSE Estimation 11.7.2 Vector MMAE Estimation 11.7.3 Vector MAP Estimation 11.7.4 Linear MMSE Estimation 11.7.5 Vector Parameter Estimation in Linear Gaussian Models 11.7.6 Other Cost Functions for Bayesian Estimation 11.8 Exponential Families 11.8.1 Basic Properties 11.8.2 Conjugate Priors Exercises References 12
Minimum Variance Unbiased Estimation
12.1 12.2 12.3 12.4 12.5
Nonrandom Parameter Estimation Sufﬁcient Statistics Factorization Theorem Rao–Blackwell Theorem Complete Families of Distributions 12.5.1 Link Between Completeness and Sufﬁciency 12.5.2 Link Between Completeness and MVUE 12.5.3 Link Between Completeness and Exponential Families 12.6 Discussion 12.7 Examples: Gaussian Families Exercises References 13
Information Inequality and Cramér–Rao Lower Bound
13.1 Fisher Information and the Information Inequality 13.2 Cramér–Rao Lower Bound 13.3 Properties of Fisher Information 13.4 Conditions for Equality in Information Inequality 13.5 Vector Parameters 13.6 Information Inequality for Random Parameters 13.7 Biased Estimators 13.8 Appendix: Derivation of (13.16) Exercises References
259 260 262 263 265 266 267 267 267 268 269 270 270 271 273 276 279 280 280 281 283 284 286 288 289 289 291 291 294 296 297 297 300 302 305 306 311 312 314 315 318
Contents
14
Maximum Likelihood Estimation
14.1 14.2 14.3 14.4
Introduction Computation of ML Estimates Invariance to Reparameterization MLE in Exponential Families 14.4.1 MeanValue Parameterization 14.4.2 Relation to MVUEs 14.4.3 Asymptotics 14.5 Estimation of Parameters on Boundary 14.6 Asymptotic Properties for General Families 14.6.1 Consistency 14.6.2 Asymptotic Efﬁciency and Normality 14.7 Nonregular ML Estimation Problems 14.8 Nonexistence of MLE 14.9 Noni.i.d. Observations 14.10 MEstimators and LeastSquares Estimators 14.11 ExpectationMaximization (EM) Algorithm 14.11.1 General Structure of the EM Algorithm 14.11.2 Convergence of EM Algorithm 14.11.3 Examples 14.12 Recursive Estimation 14.12.1 Recursive MLE 14.12.2 Recursive Approximations to LeastSquares Solution 14.13 Appendix: Proof of Theorem 14.2 14.14 Appendix: Proof of Theorem 14.4 Exercises References 15
Signal Estimation
15.1 15.2
Linear Innovations DiscreteTime Kalman Filter 15.2.1 TimeInvariant Case 15.3 Extended Kalman Filter 15.4 Nonlinear Filtering for General Hidden Markov Models 15.5 Estimation in Finite Alphabet Hidden Markov Models 15.5.1 Viterbi Algorithm 15.5.2 ForwardBackward Algorithm 15.5.3 Baum–Welch Algorithm for HMM Learning Exercises References
xv
319 319 320 322 323 324 324 325 327 329 329 331 334 335 338 338 339 340 341 341 347 347 349 350 351 352 356 358 358 360 365 367 369 372 373 375 378 381 383
Appendix A
Matrix Analysis
384
Appendix B
Random Vectors and Covariance Matrices
390
xvi
Contents
Appendix C
Probability Distributions
391
Appendix D
Convergence of Random Sequences
393
Index
395
Preface
In the engineering context, statistical inference has traditionally been ubiquitous in areas as diverse as signal processing, communications, and control. Historically, one of the most celebrated applications of statistical inference theory was the development of radar systems, which was a major turning point during World War II. During the following decades, the theory has been expanded considerably and has provided solutions to an impressive variety of technical problems, including reliable detection, identiﬁcation, and recovery of radio and television signals, of underwater signals, and of speech signals; reliable communication of data on pointtopoint links and on information networks; and control of plants. In the last decade or so, the reach of this theory has expanded even further, ﬁnding applications in biology, security (detection of threats), and analysis of big data. In a broad sense, statistical inference theory addresses problems of detection and estimation. The underlying theory is foundational for machine learning and data science, as it provides golden standards (fundamental performance limits), which, in some cases, can be approached asymptotically by learning algorithms. In order to develop a deep understanding of machine learning, where one does not assume a prior statistical model for the data, one ﬁrst needs to thoroughly understand modelbased statistical inference, which is the subject of this book. This book is intended to provide a unifying and insightful view, and a fundamental understanding of statistical inference for engineers and data scientists. It should serve both as a textbook and as a modern reference for researchers and practitioners. The core principles of statistical inference are introduced and illustrated with numerous examples that are designed to be accessible to the broadest possible audience, without relying on domainspeciﬁc knowledge. The examples are designed to emphasize key features of the theory and the implications of the assumptions made (e.g., assuming prior distributions and cost functions) and the subtleties that arise when applying the theory. After an introductory chapter, the book is divided into two main parts. The ﬁrst part (Chapters 2–10) covers hypothesis testing, where the quantity being inferred (state) takes on a ﬁnite set of values. The second part (Chapters 11–15) covers estimation, where the state is not restricted to a ﬁnite set. A summary of the contents of the chapters is as follows:
•
In Chapter 1, the problems of hypothesis testing and estimation are introduced through examples, and then cast in the general framework of statistical decision theory. Various approaches (e.g., Bayes, minimax) to solving statistical decision making
xviii
Preface
• • •
• •
• •
•
• •
problems are deﬁned and compared. The notation used in the book is also deﬁned in this chapter. In Chapter 2, the focus is on binary hypothesis testing, where the state takes one of two possible values. The three basic formulations of the binary hypothesis testing problem, namely, Bayesian, minimax, and Neyman–Pearson, are described along with illustrative examples. In Chapter 3, the methods developed in Chapter 2 are extended to the case of m ary hypothesis testing, with m > 2. This chapter also includes a discussion of the problem of designing m binary tests simultaneously and obtaining performance guarantees for the collection of tests (rather than for each individual test). In Chapter 4, the problem of composite hypothesis testing is studied, where each hypothesis may be associated with more than one probability distribution. Uniformly most powerful (UMP), locally most powerful (LMP), generalized likelihood ratio (GLR), nondominated, and robust tests are developed to address the composite nature of the hypotheses. In Chapter 5, the principles developed in the previous chapters are applied to the problem of detecting a signal, which is a ﬁnite sequence, observed in noise. Various models for the signal and noise are considered, along with a discussion of the structures of the optimal tests. In Chapter 6, various notions of distances between two distributions are introduced, along with their relationships. These distance metrics prove to be useful in deriving bounds on the performance of the tests for hypothesis testing problems. This chapter should also be of independent interest to researchers from other ﬁelds where such distance metrics ﬁnd applications. In Chapter 7, analytically tractable performance bounds for hypothesis testing are derived. Of central interest are upper and lower bounds on error probabilities of optimal tests. A key tool that is used in deriving these bounds is the Chernoff bound, which is discussed in great detail in this chapter. In Chapter 8, largedeviations theory, whose basis is the Chernoff bound studied in Chapter 7, is used to derive performance bounds on hypothesis testing with a large number of independent and identically distributed observations under each hypothesis. The asymptotics of these methods are also studied, and tight approximations that are based on the method of exponential tilting are presented. In Chapter 9, the problem of hypothesis testing is studied in a sequential setting where we are allowed to choose when to stop taking observations before making a decision. The related problem of quickest change detection is also studied, where the observations undergo a change in distribution at some unknown time and the goal is to detect the change as quickly as possible, subject to falsealarm constraints. In Chapter 10, hypothesis testing is studied in the setting where the observations are realizations of random processes. Notions of Kullback–Leibler and Chernoff divergence rates, and Radon–Nikodym derivatives between two distributions on random processes are introduced and exploited to develop detection schemes. In Chapter 11, the Bayesian approach to parameter estimation is discussed, where the unknown parameter is modeled as random. The cases of scalar and vectorvalued
Preface
•
• •
•
xix
parameter estimation are studied separately to emphasize the similarities and differences between these two cases. In Chapter 12, several methods are introduced for constructing good estimators when prior probabilistic models are not available for the unknown parameter. The notions of unbiasedness and minimumvariance unbiased estimation are deﬁned, along with notions of sufﬁcient statistics and completeness. Exponential families are studied in detail. In Chapter 13, the information inequality is studied for both scalar and vector valued parameters. This fundamental inequality, when applied to unbiased estimators, results in the powerful Cramér–Rao lower bound (CRLB) on the variance. In Chapter 14, the focus is on the maximum likelihood (ML) approach to parameter estimation. Properties of the ML estimator are studied in the asymptotic setting where the number of observations goes to inﬁnity. Recursive ways to compute (approximations to) ML estimators are studied, along with the practically useful expectationmaximization (EM) algorithm. In Chapter 15, we shift away from parameter estimation and study the problem of estimating a discretetime random signal using noisy observations of the signal. The celebrated Kalman ﬁlter is studied in detail, along with some extensions to nonlinear ﬁltering. The chapter ends with a discussion of estimation in ﬁnite alphabet hidden Markov models (HMMs).
The main audience for this book is graduate students and researchers that have completed a ﬁrstyear graduate course in probability. The material in this book should be accessible to engineers and data scientists working in industry, assuming they have the necessary probability background. This book is intended for a onesemester graduatelevel course, as it is taught at the University of Illinois at UrbanaChampaign. The core of such a course (about twothirds) could be formed using the material from Chapter 1, Chapter 2, Sections 3.1– 3.4, Sections 4.1–4.6, Sections 6.1–6.3, Chapter 7, Sections 8.1 and 8.2, Chapter 11, Chapter 12, Sections 13.1–13.5, Sections 14.1–14.6, Section 14.11, and Sections 15.1– 15.3. The remaining third of the course could cover selected topics from the remaining material at the discretion of the instructor.
Acknowledgments
This book is the result of course notes developed by the authors over a period of more than 20 years as they alternated teaching a graduatelevel course on the topic in the Electrical and Computer Engineering department at the University of Illinois at UrbanaChampaign. The authors gratefully acknowledge the invaluable feedback and help that they received from the students in this course over the years. Finally, the authors are thankful to their families for their love and support over the years.
Acronyms
a.s. ADD AUC BH CADD cdf CFAR CLT cgf CRLB CuSum
almost surely average detection delay area under the curve Benjamini–Hochberg conditional average detection delay cumulative distribution function constant falsealarm rate Central Limit Theorem cumulantgenerating function Cramér–Rao lower bound cumulative sum
DFT EKF EM FAR FDR FWER GLR GLRT GSNR HMM i.i.d. i.p. JSB KF KL LFD LLRT LMMSE LMP LMS LRT MAP mgf
convergence in distribution discrete fourier transform extended Kalman ﬁlter expectationmaximization falsealarm rate false discovery rate familywise error rate generalized likelihood ratio generalized likelihood ratio test generalized signaltonoise ratio hidden Markov model independent and identically distributed in probability joint stochastic boundedness Kalman ﬁlter Kullback–Leibler least favorable distribution loglikelihood ratio test linear minimum mean squarederror locally most powerful least mean squares likelihood ratio test maximum a posteriori momentgenerating function
d. →
List of Acronyms
ML MLE MMAE MMSE MOM MPE m.s. MVUE MSE NLF NP pdf PFA pmf QCD RLS ROC SAGE SND SNR SPRT SR UMP WADD w.p.
maximum likelihood maximum likelihood estimator minimum mean absoluteerror minimum mean squarederror method of moments minimum probability of error mean squares minimumvariance unbiased estimator mean squarederror nonlinear ﬁlter Neyman–Pearson probability density function probability of false alarm probability mass function quickest change detection recursive least squares receiver operating characteristic spacealternating generalized EM standard noncoherent detector signaltonoise ratio sequential probability ratio test Shiryaev–Roberts uniformly most powerful worstcase average detection delay with probability
xxi
1
Introduction
In this chapter we introduce the problems of detection and estimation, and cast them in the general framework of statistical decision theory, using the notions of states, observations, actions, costs, and optimal decision making. We then introduce the Bayesian approach as the optimal approach in the case of random states. For nonrandom states, the minimax approach and other nonBayesian approaches are introduced. 1.1
Background
This book is intended to be accessible to graduate students and researchers that have completed a ﬁrstyear graduate course in probability and random processes. Familiarity with the notions of convergence of random sequences (in probability, almost surely, and in the mean square sense) is needed in the sections of this text that deal with asymptotics. Elementary knowledge of measure theory is helpful but not required. An excellent reference for this background material is Hajek’s book [1]. Familiarity with basic tools from matrix analysis [2] and optimization [3] is also assumed. Textbooks on detection and estimation include the classic book by Van Trees [4], and the more recent books by Poor [5], Kay [6], and Levy [7]. For a thorough treatment of the subject, the reader is referred to the classic books by Lehman [8] and Lehman and Casella [9]. While these books treat detection and estimation as distinct but loosely related topics, we stress in this text the tight connection between the underlying concepts, via Wald’s statistical decision theory [10, 11, 12]. The subject matter is also related to statistical learning theory and to information theory. Statistical learning theory deals with unknown probability distributions and with the construction of decision rules whose performance approaches that of rules that know the underlying distributions. Performance analysis of the latter rules is of central interest in this book, and many of the analytic tools are also useful in learning theory [13]. Finally, the connection to information theory appears via the role of Kullback–Leibler divergence between probability distributions, Fisher information, sufﬁcient statistics, and information geometry [14]. 1.2
Notation
The following notation is used throughout this book. Random variables are denoted by uppercase letters (e.g., Y ), their individual values by lowercase letters (e.g., y ), and the
2
Introduction
set of values by script letters (e.g., Y ). We use sans serif notation for matrices (e.g., A); the identity matrix is denoted by I. The indicator function of a set A is denoted by 1 A, i.e.,
1A(x ) =
±
1
0
if x
∈A
(1.1)
otherwise.
For simplicity of the exposition, we shall mostly be interested in discrete random variables and in continuous random variables over Euclidean spaces. In some cases where we need to distinguish between scalar and vector variables, we use boldface for vectors (e.g., Y = [Y1 Y2 ]± ). The reader is referred to [1] for a thorough introduction to the topics in this section.
1.2.1
Probability Distributions
Consider a random variable Y taking its values in a set Y endowed with a σ algebra . A probability measure P is a realvalued function on that satisﬁes Kolmogorov’s three axioms of probability: nonnegativity, unit measure, and σ additivity. If Y is (ﬁnitely or inﬁnitely) countable, is the collection of all subsets of Y , and we denote by { p(y ), y ∈ Y } a probability mass function (pmf) on Y . The probability of a ∑ set A ⊂ Y (hence A ∈ ) is P(A) = y ∈A p(y ). If Y is the n dimensional Euclidean space Rn (with n ≥ 1), we choose = (Rn ), the Borel σ algebra which contains all ndimensional rectangles and all ﬁnite or inﬁnite unions of such rectangles. We then assume that Y is a continuous random variable, i.e., it has a probability density function (pdf) which we denote by { p (y ), y ∈ Y }. The probability of a Borel set A ∈ (Rn ) is ² then P (A) = A p( y) dy . The cumulative distribution function (cdf) of Y is given by
F
F
F
F
F B
B
where A y = { y² summarize:
1.2.2
∈Y :
F ( y ) = P(A y ),
(1.2)
≤ yi , 1 ≤ i ≤ n} is an ndimensional orthant. To ³∑ p(y ) : discrete Y P(A) = ² y ∈A (1.3) A p( y) dy : continuous Y . yi²
Conditional Probability Distributions
A conditional pmf of a random variable Y is a collection of pmfs { px (y ), y ∈ Y } indexed by a conditioning variable x ∈ X . Note that x is not necessarily interpreted as a realization of random variable X , and in that sense, px ( y) is different from the more traditional conditional distribution of a random variable Y , given another random variable X . Similarly, a conditional pdf is a collection of pdfs { px ( y ), y ∈ Y } indexed by a conditioning variable x ∈ X . The conditional probability of a Borel set A ∈ (Rn ) ² is P x (A) = A px (y ) dy . The conditional cdf is denoted by Fx ( y) = P x (A y ) where A y is the orthant deﬁned in Section 1.2.1. For both the discrete and the continuous cases, we may write p( y x ) instead of px ( y ), especially in the case that x is a realization of a random variable X .
B
1.2
1.2.3
3
Notation
Expectations and Conditional Expectations
The expectation of a function g (Y ) of a random variable Y is given by
±∑ E[ g(Y )] = ² y ∈Y p( y)g (y ) : Y p (y )g( y) dy :
discrete Y continuous Y .
(1.4)
The expectation may be viewed as a linear function of the pmf p in the discrete case, and a linear functional of the pdf p in the continuous case.1 The conditional expectation of g(Y ) given x is denoted by E x [g(Y )] and takes the form
±∑ (y ) : E x [g(Y )] = ² y∈pY(yp)xg((yy)g) dy : Y x
1.2.4
discrete Y continuous Y .
(1.5)
Uniﬁed Notation
The cases of discrete and continuous Y can be handled in a uniﬁed way, avoiding duplication of formulas. Indeed the probability of a set A ∈ may be written as the Lebesgue–Stieltjes integral
F
P(A) =
´
p (y ) d µ( y),
A
F
(1.6)
where µ is a ﬁnite measure on , equal to the Lebesgue measure in the continuous ² case (dµ( y) = dy , or equivalently µ( A) = A dy for all A ∈ ) and to the counting measure in the discrete case.2 The expectation of a function g (Y ) is likewise given by
F
E[g (Y )] =
´
Y
p(y )g( y) dµ( y).
(1.7)
For conditional expectations we have
E x [g(Y )] = 1.2.5
´
Y
∀x ∈ X .
px ( y )g(y ) d µ( y ),
(1.8)
General Random Variables
The expressions deﬁned in the previous sections can be generalized to mixed discretecontinuous random variables using the Lebesgue–Stieltjes formalism. Indeed let Y be a random variable over Rn with cdf F deﬁned in (1.2). Then the probability of a Borel set A is ´ P(A) =
A
d F ( y)
(1.9)
(the Lebesgue–Stieltjes integral) and the expectation of a function g (Y ) is
E [g(Y )] =
´
Y
g( y)d F ( y ).
(1.10)
( )
1 Typical functions g Y of interest are polynomials of Y (the expected values thereof are moments of the
distribution) and indicator functions of sets (the expected values thereof are probabilities of said sets). 2 The counting measure on a set of points y , y , . . . is deﬁned as µ( ) = ∑ 1{ y ∈ } . 1 2 i i
A
A
4
Introduction
The random variable Y has a density with respect to a measure µ which is the sum of Lebesgue measure and a counting measure, in which case the expressions (1.6) and (1.7) hold and allow simple calculations. 3 The same concept applies to conditional probabilities and conditional expectations, with F replaced by the conditional cdf Fx . Finally, if n ≥ 2 and the distribution P assigns positive probabilities to subsets of several dimensions, then Y is not of the mixed discretecontinuous type but (1.9) and (1.10) hold, and so do the expressions (1.6) and (1.7) using a suitable µ. For any pair of random variables ( X , Y ), assuming f (y ) = E[ X  Y = y ] exists for each y ∈ , let E[ X Y ] ± f (Y ) which is a function of Y . The law of iterated expectations states that E [ X ] = E(E[ X Y ]), which will be very convenient throughout this text. In particular, this law may be applied to evaluate the expectation of a function g (X , Y ) as E[g( X , Y )] = E(E[g( X , Y ) Y ]). The (scalar) Gaussian random variable with mean µ and variance σ 2 is denoted by (µ, σ 2). The Gaussian random vector with mean µ and covariance matrix C is denoted by (µ, C).
Y
N
1.3
N
Statistical Inference
The detection and estimation problems treated in this course are instances of the following general statistical inference problem. Given observations y taking their value in some data space , infer (estimate) some unknown state x taking its value in a state space . The observations are noisy. More precisely, they are stochastically related to the state via a conditional probability distribution {P x , x ∈ }, i.e., a collection of distributions on indexed by x ∈ . Note that no assumption of randomness of the state is made at this point. The key problems are to construct a good estimator and to characterize its performance. This raises the following central issues:
X
Y
Y
• • • •
X
X
What performance criteria should be used? Can we derive an estimator that is optimal under some desirable performance criterion? Is such an estimator computationally tractable? If not, can the optimization be restricted to a useful class of tractable estimators?
The theory applies to a remarkable variety of statistical inference problems covering applications in diverse areas (communications, signal processing, control, life sciences, economics, etc.).
Y be the interval [0, 1] and consider the cdf F0 (y ) = y 1{0 ≤ y ≤ 1} corresponding to the uniform distribution on Y , the cdf F1 ( y ) = 1{ 13 ≤ y ≤ 1} corresponding to the degenerate distribution that assigns all mass at the point y = 31 , and the cdf F2 ( y ) = 21 [ F0 ( y ) + F1( y )] corresponding to the mixture of the continuous and discrete random variables above. The means µ0 and µ 1 of the continuous and discrete random variables are equal to 21 and 31 , respectively. The mean of the mixture distribution is µ3 = 21 [µ1 + µ 2 ] = 125 .
3 For instance, let
1.3
1.3.1
Statistical Inference
5
Statistical Model
The statistical model is a triple (X , Y , {Px , x ∈ X }) consisting of a state space X , an observation space Y , and a family of conditional distributions P x on Y , indexed by x ∈ X . The state space may be a continuum or a discrete set, in which cases the inference problem is conventionally called estimation or detection, respectively.
Estimation: Often the state x ∈ X is a scalar parameter, e.g., the unknown attenuation factor of a transmission system, a temperature, or some other physical quantity. More generally x could be a vectorvalued parameter, e.g., x represents location in twoor threedimensional Euclidean space, or x represents a set of individual states associated with sensors at different physical locations. Another example of vectorvalued x is a discretetime signal (speech, geophysical, etc.). Yet another version of this problem involves functions deﬁned over some compact set, e.g., x is a continuoustime signal over a ﬁnite time window [0, T ]. The notion of time can be extended to space, and so x could also be a two or threedimensional discretespace signal, or a continuousspace signal. In all cases, the parameter space X is a continuum (an uncountably inﬁnite set). In this text, we will limit our attention to the m dimensional Euclidean space X = Rm . Many of the key insights are already apparent in the scalar case (m = 1). Detection: In other problems, the state x is a discrete quantity, e.g., a bit in a communication system (X = {0, 1}), or a sequence of bits ( X = {0, 1}n ) where n is the length of the binary sequence. Alternatively, x could take one of m values for a signal classiﬁcation problem (e.g., x represents a word in a speech recognition system, or a person in a biometric veriﬁcation system). The observed signal comes from one of m possible classes or categories. Similarly, observations can come in a variety of formats: the data space Y could be:
• • •
a ﬁnite set, e.g., the observation is a single bit y ∈ Y = {0, 1} in a data communication system, or a sequence of bits; a countably inﬁnite set, e.g., the set of all binary sequences, Y = {0, 1}∗ ; an uncountably inﬁnite set, such as R for scalar observations, or Rm for vectorvalued observations, where m > 1, or the space L 2 [0, T ] of ﬁniteenergy analog signals deﬁned over the time window [0, T ].
In many problems, the observations form a length n sequence whose elements take their value in a common space Y1. Then Y = Y1n is the nfold product of Y1 , and we use the boldface notation Y = (Y1 , Y 2, . . . , Yn ) to represent its elements. The sequence Y µ often follows the product conditional distribution px ( y) = ni=1 pYi  X =x (yi ), given the state x . The estimation problem consists of designing a function xˆ : Y → X (estimator ) that produces an estimate xˆ ( y) ∈ X , given observations y ∈ Y . The detection problem admits the same formulation, the only difference being that X is a discrete set instead of a continuum. Constraints such as linearity can be imposed on the design of the estimator.
6
Introduction
1.3.2
Some Generic Estimation Problems
1. Parameter estimation: estimate the mean µ and variance σ 2 of a Gaussian distribution from a sequence of observations Y i , 1 ≤ i ≤ n drawn i.i.d. (independent and identically distributed) (µ, σ 2 ). Here = Rn and = R × R+. A simple estimator of the mean and variance would be the socalled sample mean µ( ˆ Y ) = 1n ∑ni=1 Yi and sample variance σˆ 2 (Y ) = 1n ∑ni=1 (Yi − µ( ˆ Y ))2 which are respectively linear and quadratic functions of the data. 2. Probability mass function estimation: estimate a pmf p = { p(1), p(2), . . . , p(k)} from a sequence of observations Yi , 1 ≤ i ≤ n drawn i.i.d. p. Here = {1, . . . , k }n and is the probability simplex in Rk . A simple estimator would be the empirical ∑ pmf pˆ (l ) = 1n ni=1 1{Yi = l } for l = 1, 2, . . . , k . 3. Estimation of signal in i.i.d. noise: Yi = si + Z i for i = 1, 2, . . . , n where Z i are i.i.d. (0, σ 2). Here = = Rn . A simple estimator would be Sˆi = cYi , where c ∈ R is a weight to be optimized. 4. Parametric estimation of signal in i.i.d. noise: Yi = si (θ) + Z i for i = 1, 2, . . . , n where {si (θ) }ni=1 is a sequence parameterized by θ ∈ , and Z i are i.i.d. (0, σ 2). The unknown parameter could be estimated using the nonlinear leastsquares ∑ estimator θˆ = argmin θ ∈X i (Y i − si (θ)) 2. 5. Signal estimation, prediction and smoothing (Figure 1.1): Y i = (h ± s )i + Z i for i = 1, 2, . . . , n, where Z i are i.i.d. (0, σ 2), and h ± s denotes the convolution of the sequence s with the impulse response of the linear system. A candidate estimator would be Sˆi = (g ± Y )i , where g is the impulse response of an estimation ﬁlter. In a prediction problem, a ﬁlter is designed to estimate a future sample of the signal (e.g., estimate si +1). In a smoothing problem, a ﬁlter is designed to estimate past values of the signal. 6. Image denoising: Yi j = si j + Z i j for i = 1, 2, . . . , n1 and j = 1, 2, . . . , n2 and Z i j are i.i.d. (0, σ 2 ). 7. Estimation of a continuoustime signal: Y (t ) = s (t ) + Z (t ) for 0 ≤ t ≤ T where x belongs to a separable Hilbert space such as L 2 [0, T ], and Z is stationary Gaussian noise with mean zero and covariance function R (t ), t ∈ R.
N
Y
X
Y
X
N
Y X
X
N
N
N
1.3.3
Some Generic Detection Problems
1. Binary hypothesis testing: under hypothesis H0 , the observations are a sequence Y i , i = 1, 2, . . . , n drawn i.i.d. p0. Under the rival hypothesis H1 , the observations are drawn i.i.d. p1. 2. Signal detection in i.i.d. Gaussian noise: Yi = θ si + Z i for i = 1, 2, . . . , n where s ∈ Rn is a known sequence, θ ∈ { 0, 1}, and Z i are i.i.d. (0, σ 2 ).
N
si
h
Yi Zi
Figure 1.1
Signal prediction and smoothing
1.5
Statistical Decision Theory
7
3. M ary signal classiﬁcation in i.i.d. Gaussian noise: Yi = si (θ) + Z i for i = 1, 2, . . . , n where s(θ) ∈ Rn is a sequence parameterized by θ ∈ { 0, 1, . . . , M − 1}, and Z i are i.i.d. (0, σ 2 ). This generalizes Problem #2 above. A simple detector is the correlation detector, the output θˆ of which maximizes the correlation score ∑ T (θ) = ni=1 Y i si (θ) over all θ ∈ {0, 1, . . . , M − 1}. 4. Composite hypothesis testing in i.i.d. Gaussian noise: Yi = α si (θ) + Z i for i = 1, 2, . . . , n where the multiplier α ∈ {0, 1}, the sequence s(θ) ∈ Rn is a function of some nuisance parameter θ ∈ Rm , and Z i are i.i.d. (0, σ 2). Here only α is of interest; it is possible (but not required) to estimate θ as a means to solve the detection problem.
N
N
1.4
Performance Analysis
For estimation problems, various metrics can be used to measure the closeness of the estimator xˆ to the original x . For instance, let = Rm and consider:
m 2 • squared error C (x , xˆ ) = ∑ i =1(xˆi − xi ) ; ∑ m • absolute error C (x , xˆ ) = i =1 xˆi − xi .
X
The squarederror criterion is more tractable but penalizes large errors heavily. This could be desirable in some problems but undesirable in others (e.g., in the presence of outliers in the data). In such cases, the absoluteerror criterion might be preferable. For detection problems, one may consider:
• for X = {0, 1, . . . , m − 1}: zeroone loss 1{ˆx ³= x }; •∑ for X = {0, 1}d (space of lengthd binary sequences): Hamming distance dH (xˆ , x ) = d 1{ˆx ³= x }. The Hamming distance, normalized by d , is the bit error rate (BER) i =1
i
i
in a digital communication system.
In both cases, the performance metric can be evaluated on a speciﬁc x , on a speciﬁc y , or in some average sense. The average value of the performance metric is often used as a design criterion for the estimation (detection) system, as discussed next.
1.5
Statistical Decision Theory
Detection and estimation problems fall under the umbrella of Abraham Wald’s statistical decision theory [10, 11], where the goal is to make a right (optimal) choice from a set of alternatives in a noisy environment. A general framework for statistical decision theory is depicted in Figure 1.2. The ﬁrst block represents the observational model which is governed by the conditional distribution px ( y) of the observations given the state x . The second block is a decision rule δ that outputs an action a = δ( y ). Each possible action a incurs a cost C (a , x ), to be minimized in some sense.4 4 The cost is also called loss. Alternatively, one could take a more positive view and specify a utility function
U (a , x ), to be maximized. Both problem formulations are equivalent if one speciﬁes U (a , x ) = b − C (a , x ) for some constant b.
8
Introduction
x
Px( y )
y
δ (y )
Observation model Figure 1.2
a
Decision rule
Statistical decisionmaking framework
Formally, there are six basic ingredients in a typical decision theory problem: 1.
2.
3.
X : the set of states. For detection problems, the number of states is ﬁnite, i.e., X  = m < ∞. For binary detection problems, we usually let X = {0, 1}. We denote a typical state for detection problems by the variable j ∈ X . We denote a typical state for estimation problems by the variable x ∈ X . A: the set of actions, or possible decisions about the state. In most detection and estimation problems, A = X . However, in problems such as decoding with an erasure option, A could be larger than X . In other problems such as composite hypothesis testing, A is smaller than X . We denote a typical decision by the variable a ∈ A. C (a, x ): the cost of taking action a when the state of Nature is x . Typically C (a, x ) ≥
0, and the cost is to be minimized in some suitable sense. Quantifying the costs incurred from decisions allows us to optimize the decision rule. An example of cost function that is relevant in many applications is the uniform cost function for which X  = A = m < ∞ and
C (a , x ) = 4.
³0
1
if a = x if a ³ = x ,
x
∈ X , a ∈ A.
(1.11)
Y : the set of observations. The decision is made based on some random observation Y taking values in Y . The observations could be continuous (with Y ⊂ Rn ) or
discrete. 5. {P x , x ∈ X }: the observational model, a family of probability distributions on Y conditioned on the state of Nature x . 6. D: the set of decision rules or tests. A decision rule is a mapping δ : Y ´ → A that associates an action a = δ( y) to each possible observation y ∈ Y . Detection problems are also referred to as hypothesis testing problems, with the understanding that each state corresponds to a hypothesis about the nature of the observations. The hypothesis corresponding to state j is denoted by H j . 1.5.1
Conditional Risk and Optimal Decision Rules
The cost associated with a decision rule δ ∈ D is a random quantity (because Y is random) given by C (δ(Y ), x ). Therefore, to order decision rules according to their “merit” we use the quantity
R x (δ) ± E x [C (δ(Y ), x )] =
´
Y
px ( y)C (δ( y), x )d µ( y),
(1.12)
which we call the conditional risk associated with δ when the state is x . The conditional risk function can be used to obtain a (partial) ordering of the decision rules in D, in the following sense.
1.5
Table 1.1
δ δ1 δ2 δ3 δ4 δ5 δ6 δ7 δ8
9
Statistical Decision Theory
Decision rules and conditional risks for Example 1.1
y=1
y
0 0 0 0 1 1 1 1
D E FI N I T I O N 1.1
=2 0 0 1 1 0 0 1 1
y=3
R0 (δ)
R1 (δ)
0 1 0 1 0 1 0 1
0 0 0 0 1 1 1 1
1 0.5 0.5 0 1 0.5 0.5 0
A decision rule δ is better 5 than decision rule δ ² if
R x (δ) ≤ Rx (δ ² ),
and
∀x ∈ X
R x (δ) < Rx (δ ² ) for at least one x
∈ X. Sometimes it may be possible to ﬁnd a decision rule δ ± ∈ D which is better than any other δ ∈ D. In this case, the statistical decision problem is solved. Unfortunately, this
usually happens only for trivial cases as in the following example.
Suppose X = A = {0, 1} with the uniform cost function as in (1.11). Furthermore suppose the observation Y takes values in the set Y = {1, 2, 3} and the conditional pmfs of Y are: Example 1.1
p0(1) = 1, p0 (2) = p0 (3) = 0,
p1 (1) = 0, p1(2) = p1(3) = 0. 5.
Then it is easy to see that we have the conditional risks for the eight possible decision rules depicted in Table 1.1. Clearly, δ4 is the best rule according to Deﬁnition 1.1, but this happens only because the conditional pmfs p0 and p1 have disjoint supports (see also Exercise 2.1). Since conditional risks cannot be used directly in ﬁnding optimal solutions to statistical decision making problems except in trivial cases, there are two general approaches for ﬁnding optimal decision rules: Bayesian and minimax. 1.5.2
Bayesian Approach
If the state of Nature is random with known prior distribution π(x ), x ∈ X , then one can deﬁne the average risk or Bayes risk associated with a decision rule δ , which is given by
r (δ) ± E [ R X (δ) ] = E[C (δ(Y ), X )].
(1.13)
δ dominates δ² as in this deﬁnition, the decision rule δ² is sometimes said to be inadmissible [8] for the statistical inference problem.
5 If
10
Introduction
More explicitly,
±² ² ( y )C (δ(y ), x )dydx : X Y π( ∑x ) px π( r (δ) = ∑ x ) px (y )C (δ( y), x ) : x ∈X y ∈Y
continuous X , Y discrete X , Y ,
(1.14)
which may also be written using the uniﬁed notation
r (δ) =
´ ´
X Y
π(x ) px ( y)C (δ( y), x )d µ( y)d ν(x )
(1.15)
for appropriate measures µ and ν on Y and X , respectively. The optimal decision rule, in the Bayesian framework, is the one that minimizes the Bayes risk:
δB = arg min r (δ), δ ∈D
(1.16)
and is known as the Bayesian decision rule. The minimizer in (1.16) might not be unique; all such minimizers are Bayes rules.
1.5.3
Minimax Approach
What if we are not given a prior distribution on the set X ? We could postulate a distribution on X (for example, a uniform distribution) and use the Bayesian approach. On the other hand, one may want to guarantee a certain level of performance for all choices of state. In this case, we use a minimax approach. To this end, we deﬁne the maximum (or worstcase) risk (see Figure 1.3(a)):
R max(δ) ± max R x (δ). x∈
(1.17)
X
The minimax decision rule minimizes the worstcase risk:
δm ± arg min R (δ) δ∈D max = arg min max R x (δ). δ∈D x ∈X
(1.18)
The minimizer in (1.18) might not be unique; all such minimizers are minimax rules.
Rmax(δ)
Rmax (δ ) Rx( δm) Rx( δ) x
Rx (δ) x (a)
(b)
Minimax risk: (a) conditional risk and maximum risk for decision rule δ ; (b) example where minimax rule is pessimistic Figure 1.3
1.5
Table 1.2
δ δ1 δ2 δ3 δ4 δ5 δ6 δ7 δ8
Statistical Decision Theory
11
Decision rules, and Bayes and minimax risks for Example 1.2
y=1
y=2
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
y
=3 0 1 0 1 0 1 0 1
R 0 (δ)
R1 (δ)
0.5( R0 (δ) + R 1(δ))
R max (δ)
0 0 0. 5 0. 5 0. 5 0. 5 1 1
1 0. 5 0. 5 0 1 0. 5 0. 5 0
0. 5 0.25 0. 5 0.25 0.75 0. 5 0.75 0. 5
1 0. 5 0. 5 0. 5 1 0. 5 1 1
The minimax approach is closely related to game theory [11] and more speciﬁcally a zerosum game with payoff function R x (δ) and state x and decision rule δ selected by Nature and by the statistician, respectively. In gametheoretic parlance, the minimax approach is secure in the sense that it offers a performance guarantee even in the worstcase scenario. This property would be most useful if x were selected by an adversary. However, when x is selected by Nature, selecting δ based on the worstcase scenario could result in an overly conservative design, as illustrated in Figure 1.3(b), which compares δ m with another rule δ . It appears that R x (δm ) is much larger than Rx (δ) for all values of x except in a small neighborhood where δm marginally outperforms δ . In this case it does not appear reasonable to select δm instead of δ. Example 1.2
Consider the same setup as in Example 1.1 with the following conditional
pmfs:
p0 (1) = p0 (2) = 0.5, p0 (3) = 0,
p1 (1) = 0, p1(2) = p1(3) = 0.5.
We can compute the conditional risks for the eight possible decision rules as shown in Table 1.2. Clearly there is no “best” rule based on conditional risks alone in this case. However, δ2 is better than δ1 , δ 3, δ 5, δ6 , and δ 7; and δ 4 is better than δ3 , δ5 , δ 6, δ 7, and δ8 . Now consider ﬁnding a Bayes rule for priors π0 = π1 = 0.5. It is clear from Table 1.2 that δ2 and δ4 are both Bayes rules. Also, δ2 , δ 3, δ 4 , and δ 6 are all minimax rules with minimax risk equal to 0.5.
1.5.4
Other NonBayesian Rules
A third approach, which does not require the speciﬁcation of costs and is often convenient for parameter estimation (A = X ), is to adopt an estimation rule of the form
xˆ (y ) = max ²( x , y ), x∈X
(1.19)
12
Introduction
where the objective function ² should be tractable and have some good properties. For instance, ²( x , y) = px ( y ), which, for ﬁxed y, is called the likelihood function for x . The resulting xˆ is called the maximumlikelihood (ML) estimator (see Chapter 14). Often one also seeks decision rules that have some desired property such as unbiasedness (see MVUEs in Chapter 12) or invariance with respect to some natural group operation. This deﬁnes a class D of feasible decision rules.
1.6
Derivation of Bayes Rule
∈ A given y ∈ Y as C (a, x )π(x  y)d ν(x ), (1.20)
Deﬁne the a posteriori cost (aka posterior cost) of taking action a
C (a y) ± E[C (a, X ) Y where
= y] =
´
X
x) px ( y)π(x ) ² π(x  y) = px (py()π( = y) X px ( y )π(x )d ν(x )
(1.21)
is the posterior probability of the state given the observation, and
p(y ) =
´
is the marginal distribution on Y . The risk r (δ) can be written as
r (δ) =
E[C (δ(Y ), X )] =
or equivalently as
r (δ) =
E[C (δ(Y ), X )] =
´ Y
π( x ) px ( y)d ν(x )
X
´ X
R x (δ)π( x )dν( x ) = E[ R X (δ)],
C (δ( y) y ) p (y )d µ( y ) = E[C (δ(Y )Y )],
(1.22)
(1.23)
where C (δ( y ) y) is the posterior risk, evaluated using (1.20). The problem is to minimize r (δ) over all δ = {δ( y ), y ∈ Y } ∈ D. By (1.23), r (δ) is additive over y, and the minimizer is simply obtained by solving the minimization problem min C (a y)
A
a∈
independently for each y follows:
∈ Y . The general procedure to ﬁnd the solution is therefore as
1. Compute the posterior distribution using Bayes rule
x) . (1.24) π( x y ) = px (py()π( y) ² Compute the posterior costs C (a  y ) = X C (a, x )π(x  y)d ν(x ) for all a ∈ A .
2. 3. Find the Bayes decision rule
δB (y ) = arg amin C (a y). ∈A
(1.25)
1.6
Derivation of Bayes Rule
13
Hence the solution depends on the costs and on the posterior distribution π( x  y ), which is the fundamental distribution in all Bayesian inference problems. Consider the following simpliﬁed Bayesian model for the decision of buying or not buying health insurance. Say each person’s health is represented by a binary variable x taking value 1 (healthy) or 0 (not healthy). Each person’s belief about their own health is likewise represented by a binary random variable Y taking value 1 (believe healthy) or 0 (believe not healthy). A healthy person has no symptom and believes she is healthy, but a nonhealthy person believes she is healthy with probability 1/2. Thus p1 ( y) = 1 {y = 1} and p0 (y ) = 12 for y = 0, 1. The prior belief that a person is healthy is π(1) = 0.8. Each person must decide whether or not to buy health insurance covering the cost of medical treatment for themselves (as well as for uninsured persons). The cost of the insurance premium is 1 and no further cost arises whether or not the person needs medical treatment. If a person does not buy health insurance, the cost is 0 if the person is healthy. However, the cost is 10 if the person is not healthy, reﬂecting the cost of future medical treatments. The question we would like to answer is: should people purchase insurance? Example 1.3
Solution The state space is X cost function is given by
= {0, 1} and the action space is A = {buy, not}. The
⎧ 1, ⎨ C (a, x ) = 10, ⎩ 0,
a = buy , ∀ x a = not , x = 0 a = not , x = 1.
The marginal pmf of Y is given by
p( y ) = Note that Y
±
p0(0)π(0) + p1 (0)π(1) = 0.2/2 + 0 = 0.1
0.9
=0 if y = 1.
if y
= 0 implies that X = 0. The posterior distribution of X is given by ± π(x 0) = 1 if x = 0 0 if x = 1,
and
± p (1)π(0)
=0 if x = 1. ∑ The posterior cost of taking action a given y is C (a y ) = x =0,1 C (a, x )π(x  y). The posterior costs for Y = 0 (which implies X = 0) are π(x 1) =
0
8 9
p (1)
= 19
C (buy 0) = 1,
if x
C (not 0) = 10 > 1,
14
Introduction
hence the Bayes decision for a person who believes she is not healthy is The posterior costs for Y = 1 are
C (buy 1) = 1, C (not1) =
¶
x =0,1
δB (0) = buy.
π( x 1)C (not, x ) = 19 10 + 98 0 = 109 > 1,
hence the Bayes decision for a person who believes she is healthy is δ B(1) = buy.
1.7
Link Between Minimax and Bayesian Decision Theory
The Bayesian and minimax approaches are seemingly very different but there exists a fundamental relation between them. When the state is random with prior π( x ), x ∈ X , the Bayes risk of decision rule δ ∈ D is given by
r (δ, π)
=
´
R x (δ)π( x )dν( x ) X · ¸´ (a) ≤ max R x (δ) π( x )d ν(x ) x∈X X = max Rx (δ), x ∈X
(1.26)
which is upper bounded by the maximum conditional risk. This inequality holds for all
π . Moreover, equality in (a) is achieved by any prior πδ that puts all its mass on the set of maximizer(s) of R x (δ). Hence we obtain the fundamental equality r (δ, πδ ) = max r (δ, π) = max R x (δ), ∀ δ, (1.27) π x ∈X
which shows that the worstcase conditional risk for rule δ is the same as its Bayes risk under prior πδ . Minimizing both sides of (1.27) over δ , we obtain the minimax risk, which is achieved by the minimax rule δ ∗ : min max r (δ, π) = min max R x (δ)
δ
1.7.1
π
δ
x∈X
= Rmax(δ ∗ ).
(1.28)
Dual Concept
The Bayes rule achieves the minimal risk, min δ r (δ, π), over all decision rules. Denote by δ π a Bayes rule under prior π . The worst Bayes risk (over all priors) is given by max r (δπ , π) = max min r (δ, π).
π
D E FI N I T I O N
risk in (1.29).
1.2
π
δ
(1.29)
A prior π ∗ is said to be least favorable if it achieves the maximum
1.7
1.7.2
Link Between Minimax and Bayesian Decision Theory
15
Game Theory
The decision models in this section are closely related to game theory. Assume a game is played between the decision maker and Nature, with risk r (δ, π) as the payoff function, minimizing variable δ , and maximizing variable π . The expressions min max and max min obtained in (1.28) and (1.29) reﬂect the different knowledge available to the two players. The player with more information (who knows the variable selected by the opponent) has a potential advantage, as reﬂected in the following fundamental result, which states that minimax risk (among all δ ∈ D) is at least as large as worstcase Bayes risk. P RO P O S I T I O N 1.1
max min r (δ, π) ≤ min max r (δ, π).
π
δ
δ
Proof For all δ and π , we have
(1.30)
π
max r (δ, π ² ) ≥ r (δ, π).
π²
In particular, this inequality holds for δ ∗ that achieves min δ maxπ ² r (δ, π ² ). Thus, for all π , we have
² ∗ ² ∗ min max ² r (δ , π ) ≥ r (δ , π) ≥ min r (δ, π). ² r (δ, π ) = max δ
In particular, for
δ
π
π
(1.31)
π = π ∗ that achieves maxπ minδ r (δ, π), (1.31) becomes min max r (δ, π) ≥ min r (δ, π ∗) = max min r (δ, π). π δ δ π δ
This concludes the proof.
1.7.3
Saddlepoint
In some cases, the minmax and maxmin values are equal, and a saddlepoint of the game is obtained, as illustrated in Figure 1.4. The following result is fundamental and its proof follows directly from the deﬁnitions [12, Ch. 5]. T H E O R E M 1.1 Saddlepoint Theorem for Deterministic Decision Rules. admits a saddlepoint (δ ∗, π ∗), i.e.,
r (δ ∗ , π)
(a)
(b)
≤ r (δ ∗ , π ∗ ) ≤ r (δ, π ∗ ), ∀δ, π, then δ ∗ is a minimax decision rule, π ∗ is a leastfavorable prior, and max min r (δ, π) = min max r (δ, π) = r (δ ∗ , π ∗ ). π δ δ π
If r (δ, π) (1.32)
A useful notion for identifying potential saddlepoints is introduced next. D E FI N I T I O N 1.3
of x
∈ X.
An equalizer rule δ is a decision rule such that R x (δ) is independent
16
Introduction
(
r δ, π
)
π
(
∗
r δ ,π
π
∗
δ
Figure 1.4
∗)
δ
∗
Saddlepoint of risk function r (δ, π)
If there exists an equalizer rule δ ∗ that is also a Bayes rule for some π ∗ , then (δ ∗ , π ∗ ) is a saddlepoint of the risk function r (δ, π). 1.2
P RO P O S I T I O N
Proof To prove the proposition, we need to show that (1.32) holds. Since δ ∗ is by assumption a Bayes rule for π ∗ , the second inequality (b) holds. Since δ ∗ is also an equalizer rule, we have r (δ ∗ , π) =
´
X
R x (δ)π(x )d ν(x ) = R x (δ)
´
X
π(x )dν( x ) = R x (δ) ∀π, x ,
(1.33)
hence the ﬁrst inequality (a) holds with equality.
1.7.4
Randomized Decision Rules
As already discussed, the player who has more knowledge generally has an advantage, which is reﬂected by the fact that strict inequality might hold in (1.30). In such cases one can remedy the situation and construct a better decision rule by randomly choosing within a set of deterministic decision rules. A randomized decision rule δ˜ chooses randomly between a ﬁnite set of deterministic decision rules:6 D E FI N I T I O N 1.4
δ˜ = δ³ with probability γ³, ³ = 1, . . . , L , ∑ for some L and some {γ³ , δ³ ∈ D , 1 ≤ ³ ≤ L } such that ³L=1 γ³ . This deﬁnition implies that δ˜( y) is a discrete random variable taking values in A for each y ∈ Y with probabilities γ1, . . . , γ L . Therefore, given an observation y ∈ Y and a state x ∈ X , the cost is now a random variable that can take values C (δ³ (y ), x ) with probabilities γ³ for ³ = 1, . . . , L . Performance of a decision rule is evaluated by taking
the expected value of that random variable, as in the following deﬁnition.
Y to a A [11], where the distribution assigns probability mass γ³ to the action
6 We can deﬁne the randomized decision rule alternatively as a mapping from the observation space
distribution on the action space δ³( y ), ³ = 1, . . . , L .
1.7
D E FI N I T I O N 1.5
The expected cost for randomized decision rule δ˜ is
C (δ˜( y ), x ) ± for each x
17
Link Between Minimax and Bayesian Decision Theory
∈ X and y ∈ Y .
L ¶
³=1
γl C (δ³( y), x ),
One can deﬁne conditional risk and Bayes risk in exactly the same way as before, if cost is replaced with expected cost. In particular, the conditional risk for randomized decision rule δ˜ is R x (δ˜) ± Ex [C (δ˜(Y ), x )] for each x ∈ X , and the Bayes risk for δ˜ ² under prior π is r (δ˜, π) = X R x (δ˜)π( x )d ν(x ). The following proposition follows directly from Deﬁnition 1.5. P RO P O S I T I O N 1.3
The conditional risk for randomized decision rule Rx (δ˜) =
L ¶
³=1
γ³ R x (δ³),
The Bayes risk for δ˜ under prior π is
r (δ˜ , π) =
L ¶
³=1
x
δ˜ is
∈ X.
γ³ r (δ³, π).
(1.34)
Note that a deterministic decision rule is a degenerate randomized decision rule with L = 1. Hence the set D˜ of randomized decision rules contains the set D of deterministic ˜ will necessarily result in at least as good a decision decision rules, and optimizing over D rule as that obtained by optimizing over D. THEOREM
1.2 Randomization does not improve Bayes rules: min r (δ, π) = min r (δ˜, π).
δ ∈D
(1.35)
δ˜∈D˜
Proof See Exercise 1.8. Theorem 1.2 shows that randomization is not useful in a Bayesian setting. However, we can see in Example 1.2 that randomizing between δ 2 and δ 4 with equal probability results in a rule with minimax risk equal to 0.25. Thus, we see that randomization can improve minimax rules. We will explore this further in the next chapter. We will also see that randomization can yield better Neyman–Pearson rules for binary detection problems. More importantly ﬁnding an optimal rule within the class of randomized decision rules D˜ can often be more tractable than ﬁnding an optimal rule within the class of deterministic decision rules D. In particular, within the class of randomized decision rules, the minimax risk equals the worstcase Bayes risk and the following result holds. T H E O R E M 1.3 Existence of Saddlepoint. Suppose Y and A are ﬁnite sets. Then, the function r (δ˜ , π) admits a saddlepoint (δ˜∗ , π ∗ ), i.e.,
r (δ˜∗ , π) ≤ r (δ˜ ∗ , π ∗ ) ≤ r (δ˜, π ∗ ),
∀δ˜, π,
(1.36)
and δ˜ ∗ is a minimax (randomized) decision rule, π ∗ is a leastfavorable prior, and
18
Introduction
max min r (δ, π) = max min r (δ˜, π) = min max r (δ˜, π)
π δ ∈D
π δ˜∈D˜
δ˜
π
= r (δ˜ ∗, π ∗).
(1.37)
Proof Let L = A Y  and let δ³ , 1 ≤ ³ ≤ L range over all possible deterministic decision rules. Then the randomized decision rule δ˜ is parameterized by the probability vector γ ± {γ³ , 1 ≤ ³ ≤ L }. By Proposition 1.3, the risk function r (δ˜, π) of (1.34) is linear both in δ˜ (via γ ) and in π . Moreover the feasible sets for γ and π are probability simplices and are thus convex and compact. By the fundamental theorem of game theory, a saddlepoint exists [15]. The ﬁrst equality in (1.37) follows from Theorem 1.1. The notion of equalizer rules (Deﬁnition 1.3) also applies to randomized decision rules. The following proposition is proved the same way as Proposition 1.2 (see Exercise 1.9). 1.4 If there exists an equalizer rule δ˜ ∗ that is also a Bayes rule for ∗ some π , then (δ˜∗ , π ∗ ) is a saddlepoint of the risk function r (δ˜ , π).
P RO P O S I T I O N
Assume the following simpliﬁed model for penalty kicks in football. The penalty shooter aims either left or right. The goalie dives left or right at the very moment the ball is kicked. If the goalie dives in the same direction as the ball, the probability of scoring is 0.55. Otherwise the probability of scoring is 0.95. We denote by x the direction selected by the penalty shooter (left or right) and by a the direction selected by the goalie (also left or right). We deﬁne the cost C (a, x ) = 0.55 if a = x , and C (a , x ) = 0.95 otherwise. In this problem setup, no observations are available. A deterministic decision rule for the goalkeeper reduces to an action, hence there exist only two possible deterministic rules: δ 1 (dive left) and δ 2 (dive right). The worstcase risk for both δ1 and δ2 is 0.95. (A goalie who systematically chooses one side will be systematically punished by his opponent, who would simply shoot in the opposite direction.) However, the randomized decision rule δ ± that picks δ1 and δ2 with equal probability is an equalizer rule, since the conditional risk is 12 (0.95 + 0.55) = 0. 75 for both x = left and x = right. Moreover, if x is random with uniform probability, we have r (δ1 ) = r (δ2 ) = 0. 75 (both deterministic rules have the same risk) and therefore r (δ ± ) = 0.75 as well. Since δ ± is an equalizer rule as well as a Bayes rule, we conclude that δ ± achieves the minimax risk and that the uniform distribution on x is the leastfavorable prior. Example 1.4
Exercises
1.1 Binary Decision Making. Consider the following decision problem with X
{0, 1}. Suppose the cost function is given by ⎧0 if a = x ⎨ C (a , x ) = 1 ⎩10 ifif xx == 01,, aa == 10.
=A=
Exercises
The observation Y takes values in the set Y are:
p0(1) = 0.4, p0(2) = 0.6, p0 (3) = 0,
19
= {1, 2, 3} and the conditional pmfs of Y p1 (1) = p1(2) = 0.25, p1 (3) = 0.5.
(a) Is there a best decision rule based on conditional risks? (b) Find Bayes (for equal priors) and minimax rules within the set of deterministic decision rules. (c) Now consider the set of randomized decision rules. Find a Bayes rule (for equal priors). Also ﬁnd a randomized rule whose maximum risk is smaller than that of the minimax rule of part (b). 1.2 Health Insurance. Find the minimax decision rule for the health insurance problem given in Example 1.3. 1.3 Binary Communication with Erasures. Let X = {0, 1}, and A = {0, 1, e}. This example models communication of a bit x ∈ X with an erasure option at the receiver. Now suppose
p j ( y) =
√
1
2πσ
¹ (y − (−1) j +1 )2 º , exp − 2 2σ 2
j
∈ X , y ∈ R.
That is, Y has distribution N (−1, σ 2 ) when the state is 0, and Y has distribution N (1, σ 2) when the state is 1. Assume a cost structure
⎧0 ⎨ C (a, j ) = 1 ⎩c
if a = 0, j if a = 1, j if a = e.
= 0 or a = 1, j = 1 = 0 or a = 0, j = 1
Furthermore, assume that the two states are equally likely. (a) First assume that c < 0.5. Show that the Bayes rule for this problem has the form
⎧0 y ≤ −t ⎨ δB (y ) = ⎩e −t < y < t 1 y ≥ t.
Also give an expression for t in terms of the parameters of the problem. (b) Now ﬁnd δ B( y) when c ≥ 0.5. 1.4 Medical Diagnosis. Let X be the health state of a patient, who is healthy (x = 0) with probability 0.8 and sick ( x = 1) with probability 0.2. A medical test produces a measurement Y following a Poisson law with parameter x + 1. The physician may take one of three actions: declare the patient healthy ( a = 0), sick (a = 1), or order a new medical test (a = t). The cost structure is as follows:
⎧0 ⎪⎨ 1 C (a , x ) = ⎪⎪⎩e e/4
if a = x if a = 1, x if a = 0, x if a = t .
Derive the Bayes decision rule for the physician.
=0 =1
20
Introduction
1.5 Optimal Routing. A driver needs to decide between taking the highway (h) or side streets (s) based on prior information and input from a map application on her phone about congestion on the highway. If the highway is congested, it would be worse than taking the side streets, but if the highway has normal trafﬁc it would better than taking the side streets. Let X denote the state of the highway, with value 0 denoting normal trafﬁc, and value 1 denoting congestion. The prior information that the driver has tells her that at the time she is planning to drive
π(0) = P{ X = 0} = 0.3, π(1) = P{ X = 1} = 0.7. The map application displays a recommendation (using colors such as blue and red), which is modeled as a binary observation Y ∈ { 0, 1}, with
p0 (0) = 0.8, p0 (1) = 0.2,
p1(0) = 0.4, p1 (1) = 0.6.
The cost function that penalizes the driver’s actions is given by:
⎧ 10 ⎪⎪⎨ 1 C (a, x ) = ⎪⎪⎩ 5 2
if a = if a = if a = if a =
h, x = 1 h, x = 0 s, x = 0 s, x = 1 .
Derive the Bayes decision rule for the driver. 1.6 Job Application. Don, Marco, and Ted are CEO candidates for their company. Each candidate incurs a cost if another candidate is selected for the job. If Don is selected, the cost to Marco and Ted is 10. If Marco or Ted is selected, the cost to the other two candidates is 2. The cost to a candidate is 0 if he is selected. (a) The selection will be made by a committee; it is believed that the committee will select Don with probability 0.8 and Marco with probability 0.1. Compute the expected cost for each candidate. (b) While Marco and Ted are underdogs, if either one of them drops out of contention, Don’s probability of being selected will drop to 0.6. Marco and Ted discuss that possibility and have three possible actions (Marco drops out, Ted drops out, or both stay in the race). What decision minimizes their expected cost? 1.7 Wise Investor. Ann has received $1000 and must decide whether to invest it in the stock market or to safely put it in the bank where it collects no interest. Assume the economy is either in recession or in expansion, and will remain in the same state for the next year. During that period, the stock market will go up 10 percent if the economy expands, and down 5 percent otherwise. The economy has a 40 percent prior probability to be in expansion. An investment guru claims to know the state of the economy and reveals it to Ann. Whatever the state is, there is a probability λ ∈ [ 0, 1] that the guru is incorrect. Ann will act on the binary information provided by the guru but is well aware of the underlying probabilistic model. Give the Bayesian and minimax decision rules and Ann’s expected ﬁnancial gains in terms of λ. Perform a sanity check to verify your answer is correct in the extreme cases λ = 0 and λ = 1.
References
21
1.8 Randomized Bayes Rules. Prove Theorem 1.2. 1.9 Equalizer Rule. Prove Proposition 1.4.
References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
B. Hajek, Random Processes for Engineers , Cambridge University Press, 2015. R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, 1990. S. Boyd and L. Vandenberghe, Convex Optimization , Cambridge University Press, 2004. H. L. Van Trees, Detection, Estimation and Modulation Theory, Part I, Wiley, 1968. H. V. Poor, An Introduction to Signal Detection and Estimation, 2nd edition, SpringerVerlag, 1994. S. M. Kay, Fundamentals of Statistical Signal Processing: Detection Theory, PrenticeHall, 1998. B. C. Levy, Principles of Signal Detection and Parameter Estimation, SpringerVerlag, 2008. E. L. Lehman, Testing Statistical Hypotheses, 2nd edition, SpringerVerlag, 1986. E. L. Lehman and G. Casella, Theory of Point Estimation, 2nd edition, SpringerVerlag, 1998. A. Wald, Statistical Decision Functions, Wiley, New York, 1950. T. S. Ferguson, Mathematical Statistics: A Decision Theoretic Approach, Academic Press, 1967. J. O. Berger, Statistical Decision Theory and Bayesian Analysis, SpringerVerlag, 1985. L. Devroye, L. Györﬁg, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, 1996. T. M. Cover and J. A. Thomas, Elements of Information Theory , 2nd edition, 2006. J. von Neumann, “Sur la théorie des jeux,” Comptes Rendus Acad. Sci., pp. 1689–1691, 1928.
Part I
Hypothesis Testing
2
Binary Hypothesis Testing
In this chapter we cover the foundations of binary hypothesis testing or binary detection. The three basic formulations of the binary hypothesis testing problem, namely, Bayesian, minimax, and Neyman–Pearson, are described along with illustrative examples.
2.1
General Framework
We ﬁrst specialize the general decisiontheoretic framework of Section 1.5 to the binary hypothesis testing problem. In particular, the six ingredients of the framework specialize as follows: 1. The state space, X = {0, 1}, and its elements will be generically denoted by j ∈ X . We refer equivalently to “state j ” or to “hypothesis H j .” 2. The action space A is the same as the state space X , and its elements will be generically denoted by i ∈ A. 3. The cost function C : A × X → R can be viewed as a square matrix whose element C i j is the cost of declaring hypothesis Hi is true when H j is true. The case of uniform costs, C i j = 1{i ± = j }, will be particularly insightful. 4. The decision is made based on observation Y taking values in Y . The observations could be continuous (with Y ⊂ Rn ) or discrete. 5. The conditional pdf (or pmf) for the observations given state j (hypothesis H j ) is denoted by p j (y ), y ∈ Y . 6. We will consider both deterministic decision rules δ ∈ D and randomized decision rules δ˜ ∈ D˜ . Note there are 2 Y possible deterministic decision rules when Y is ﬁnite. Any deterministic decision rule δ partitions the observation space into disjoint sets Y0 and Y1 , corresponding to decisions δ( y ) = 0 and δ( y ) = 1, respectively. The hypotheses H0 and H1 are sometimes referred to as the null hypothesis and the alternative hypothesis, respectively. The regions Y0 and Y1 are then referred to as the acceptance region and the rejection region for H0 , respectively. The conditional risk associated with deterministic decision rule δ when the state is j is
26
Binary Hypothesis Testing
³ ± ² R j (δ) = E j C (δ(Y ), j ) = C (δ( y ), j ) p j ( y)d µ( y) Y = C 0 j P j (Y0) + C1 j P j (Y 1), j = 0, 1. (2.1) We deﬁne the probability of false alarm P F (δ) and the probability of miss P M (δ) as ± ± PF (δ) = P 0(Y1 ), P M (δ) = P1 (Y0 ). (2.2) A S S U M P T I O N 2.1 The cost of a correct decision about the state is strictly smaller than that of a wrong decision:
C00
< C10 ,
C11
< C01 .
Under this assumption, the following inequalities hold for any deterministic decision rule δ :1
R0 (δ) = C00 P0 (Y 0) + C 10 P 0(Y1 ) ≥ C 00 P0 (Y0 ) + C 00 P 0(Y1 ) = C00 , R1 (δ) = C01 P1 (Y 0) + C 11 P 1(Y1 ) ≥ C 11 P1 (Y0 ) + C 11 P 1(Y1 ) = C11 .
(2.3) (2.4)
Consider the rules δ0 ( y) = 0 for all y, and δ1 ( y) = 1 for all y . It is clear that δ0 achieves the lower bound in (2.3) and δ 1 achieves the lower bound in (2.4). However, R 1(δ 0) and R 0(δ1 ) are typically unacceptably large because the rules δ0 and δ 1 ignore the observations. Now, we explore under what conditions one can ﬁnd a best rule based on Deﬁnition 1.1. For a rule δ ² to be best, it must be better than both δ0 and δ1 , and therefore must satisfy R0 (δ² ) = C 00 and R 1(δ ² ) = C 11 . It is easy to show that such a δ ² exists if and only if p0 and p1 have disjoint supports as in Example 1.1 (see Exercise 2.1).
2.2
Bayesian Binary Hypothesis Testing
Assuming priors π0 , π1 for the states, the Bayes risk for a decision rule δ is given by
r (δ) = π0 R0 (δ) + π1 R1 (δ)
= π0 (C 10 − C00 ) P0(Y1 ) + π1 (C01 − C 11 ) P1 (Y0) + π0 C00 + π 1 C 11 . The posterior probability π( j  y) of state j , given observation y , is given by π( j  y) = p jp((yy)π) j . The posterior cost of decision i ∈ X , given observation y, is given by ± ´ C ( i  y) = π( j y )C i j . j∈X
1 These lower bounds can be shown to hold for randomized rules as well.
(2.5)
(2.6)
(2.7)
2.2
Bayesian Binary Hypothesis Testing
27
Using (1.25), we can ﬁnd a Bayes decision rule for binary detection as:
⎧1 < ⎨ δB( y ) = arg i ∈{min C (i  y ) = any : C (1 y ) = ⎩0 0,1} >
C (0 y),
(2.8)
where “any” can be arbitrarily chosen to be 0 or 1, without any effect on Bayes risk. Hence the Bayes solution need not be unique. By (2.7) we have
C (0 y ) − C (1 y ) = π(1 y)[C01 − C 11 ] − π(0 y)[C 10 − C00] and thus (2.8) yields
⎧1 > ⎨ δB( y) = ⎩any : π(1 y)[C01 − C 11 ] = π(0y )[C10 − C 00]. 0
= ππ0 1
0.
If we further deﬁne the threshold η by:
η = ππ0 1
C 10 − C00 , C 01 − C11
(2.12)
then we can write (2.10) as
⎧1 > ⎨ δB( y ) = ⎩any : L( y ) = η. 0
⎨ δB (y ) = ⎩any : π(1y) = π(0y ). 0
0. The conditional pdfs for the observation Y are given by
N
¶ y2 · p0 ( y) = √ exp − 2 , 2σ 2πσ 2 · ¶ 1 ( y − µ)2 , p1 ( y) = √ exp − 2σ 2 2πσ 2 1
y
∈ R.
Hence the likelihood ratio is given by
L ( y) =
p1 (y ) p0 (y )
¶ ( y − µ)2 − y 2 · ¸ ¹ º» = exp − 2σ 2 = exp σµ2 y − µ2 .
Taking the logarithm or applying any other monotonically increasing function to both sides of the LRT (2.13) does not change the decision. Hence (2.13) is equivalent to a threshold test on the observation y :
2.2
δB ( y) = with threshold
Bayesian Binary Hypothesis Testing
µ
1 0
: :
y≥τ
29
(2.16)
y < τ,
2 τ = σµ ln η + µ2 .
Note we have ignored the “any” case which has zero probability under both H0 and H1 here. The acceptance and rejection regions for the test are the halflines 1 = [τ, ∞ ) and 0 = (−∞, τ), respectively, as illustrated in Figure 2.1.
Y
Y
Performance Evaluation Denote by
³(t ) =
³t
−∞
N
√1
2π
2 e−x / 2 dx ,
t
∈ R,
(2.17)
the cdf of the (0, 1) random variable, and by Q its complement, i.e., Q (t ) = 1 − ³(t ) = ³(−t ), for t ∈ R. The Q function is plotted in Figure 2.2. Then the probability of false alarm and the probability of miss are respectively given by
¹ º Y1 ) = P0 {Y ≥ τ } = Q στ , ¼µ − τ ½ P M (δB) = P 1(Y0 ) = P1 {Y < τ } = Q σ . P F (δB) = P 0(
H0
H1
PM
PF
0 Figure 2.1
τ
Gaussian binary hypothesis testing ( )
Q x
1
0.5
0 Figure 2.2
µ
The Q function
x
y
(2.18) (2.19)
30
Binary Hypothesis Testing
Table 2.1
All possible decision rules for Example 2.2, with the corresponding rejection and acceptance
regions, conditional risks under uniform costs, Bayes risk, and worstcase risk
δ δ1 δ2 δ3 δ4 δ5 δ6 δ7 δ8
Y0
Y1
{1, 2, 3} ∅ {1, 2} {3} {1, 3} {2} {1} {2, 3} {2, 3} {1} {2} {1, 3} {3} {1, 2} ∅ {1, 2, 3}
R 0(δ)
R1 (δ)
0 0 0.5 0.5 0.5 0.5 1 1
1 0. 5 0. 5 0 1 0. 5 0. 5 0
π0 R 0(δ) + π1 R 1(δ) π1 π1 /2 1/2 π0 /2 π 1 + π0 /2 1/2 π 0 + π1 /2 π0
max{ R0 (δ), R 1(δ)} 1 0. 5 0. 5 0. 5 1 0. 5 1 1
Hence, the average error probability is given by Pe (δB ) = π0P F (δB ) + π1 PM (δ B)
¼µ σ ½ ¼µ σ ½ (2.20) = π0 Q 2σ + µ ln η + π1 Q 2σ − µ ln η . The error probability depends on µ and σ only via their ratio, d ± µ/σ . If equal priors are assumed (π0 = π1 = 12 ), then η = 1, τ = µ2 , and Pe (δB ) = Q (d/ 2), which is a decreasing function of d . The worst value of d is d = 0, in which case the two hypotheses are indistinguishable and P e(δ B) = 12 . For large d , we have Q (d /2) ∼ exp{− d /8} √ , hence Pe (δ B) vanishes exponentially with d. d π/2 2
Example 2.2
Discrete Observations. We revisit Example 1.2 with
Y = {1, 2, 3}, and conditional pmfs:
1 p 0( 2 ) = 2
1 2
p0(3) = 0,
p 1( 1 ) = 0 p 1( 2 ) =
1 2
p 1( 3 ) =
p0 (1) =
X = A = {0, 1},
1 . 2
As we saw before, there are 23 = 8 possible decision rules. We rewrite these decision rules using the decision regions Y 0 and Y1 in Table 1.2, and provide their conditional risks for uniform costs and Bayes risks for arbitrary priors. By inspection of Table 2.1, rules δ 1, δ3 , δ5 , δ6, δ7 cannot be better than δ 2 for any value of π0, and are strictly worse for any π0 ∈ / {0, 1}. Similarly, rules δ3 , δ5 , δ6, δ7 , δ8 cannot be better than δ4 for any value of π0 , and are strictly worse for any π0 ∈/ {0, 1}. This leaves only δ 2 = 1{3} and δ4 = 1{2, 3} as admissible rules for π0 ∈/ {0, 1}. Observe that for both rules, one conditional risk is equal to 0 and the other one is 0.5. The rules have respective Bayes risks Pe (δ2 ) = 1−2π0 and Pe (δ4 ) = π20 . Hence δ 4 is a Bayes rule for 0 ≤ π0 ≤ 12 , and is unique for 0 < π0 < 21 ; and δ2 is a Bayes rule for 21 ≤ π0 ≤ 1, and is unique for 12 < π0 < 1. The Bayes risk is equal to P e(δ B) = 12 min{π 0, 1 − π0 }.
2.2
Bayesian Binary Hypothesis Testing
31
While generally inefﬁcient, exhaustive listing and comparison of all decision rules is easy here because Y is small. A more principled approach is to derive the LRT. Here the likelihood ratio is given by
⎧ ⎪0 : p 1( y ) ⎨ = 1 : L (y ) = p 0( y ) ⎪ ⎪⎩∞ :
π0
η=
=1 y=2 y = 3.
y
(2.21)
δ2 is a Bayes rule for 0 ≤ π0 ≤ and δ4 is a Bayes rule for ≤ π0 ≤ . For π0 = we have η = 1. By (2.13), the decision in the case y = 2 is immaterial, and again we see that both δ2 and δ 4 are Bayes rules. Under uniform costs, the LRT threshold is
1 2
1 2,
1−π0 .
Thus we see that 1 2,
Binary Communication Channel. Consider the binary communication channel with input j ∈ X = {0, 1} and output y ∈ Y = {0, 1} viewed as transmitted and received bits, respectively. The channel has crossover error probabilities λ0 and λ1 shown in Figure 2.3. There p0(1) = λ0 and p1 (0) = λ1 . There are 22 = 4 possible decision rules. Their conditional risks for uniform costs are shown in Table 2.2. There λ = 12 (λ0 + λ1 ), λ = max(λ0 , λ1), and λ = min(λ0 , λ1). Clearly λ ≤ λ ≤ λ . The likelihood ratio is given by Example 2.3
L( y ) =
p 1( y) p 0( y)
⎧ λ ⎨ 1−λ : y = 0 = ⎩ 1−λ : y = 1. λ 1
0 1
0
The LRT is given by (2.13). Assume now uniform costs and equal priors, in which case
η = 1 again. λ Case I: λ0 + λ1 < 1. Hence 1−λ λ < 1 < 1− λ , and the Bayes rule is given by δB ( y) = y (output the received bit). This is rule δ 2 in Table 2.2. Then 1
1
0
P F (δB) = p0(1) = λ0 , 1 − λ0
0
0
P M (δB ) = p1(0) = λ1 ,
P e (δB ) =
λ0 + λ1 . 2
0
λ0 y
j λ1
1
1 − λ1
1
Binary communication channel with input j , output y , and crossover error probabilities λ 0 and λ1
Figure 2.3
32
Binary Hypothesis Testing
Table 2.2
All possible decision rules for Example 2.3, with the corresponding rejection and acceptance
regions, conditional risks under uniform costs, Bayes risk under the uniform prior, and worstcase risk
δ δ1 δ2 δ3 δ4
Y0
Y1
R 0(δ)
{0, 1} {0} {1} ∅
∅ {1} {0} {0, 1}
λ0 1 − λ0
0
R1 (δ)
0.5( R 0(δ) + R1 (δ))
1
0. 5
λ1 1 − λ1
1
0
max ( R 0(δ), R1 (δ)) 1
λ 1−λ 0. 5
λ 1−λ 1
Pe
0 .5
0 Figure 2.4
0.5
1
λ
Risk (error probability) as a function of λ for the binary communication channel
Case II: λ 0 + λ 1 > 1. The same derivation yields δ B (y ) = 1 − y (ﬂip the received bit, i.e., rule δ3 in Table 2.2), PF (δ B) = p0 (0) = 1 − λ 0,
P M (δB) = p1(1) = 1 − λ1 ,
P e (δB ) = 1 −
λ0 + λ1 .
2 Case III: λ0 + λ1 = 1. Here all possible decisions have the same risk and are Bayes rules, and Pe (δB ) = 12 . The channel is so noisy that the observation does not convey any information about the transmitted bit. Combining all three cases, we express the error probability as Pe (δB ) = min{λ, 1 − λ},
where λ = (λ0 + λ 1)/2. As shown in Figure 2.4, the error probability is a continuous function of λ, but the decision rule changes abruptly at λ = 0.5, due to the transition from Case I to Case II.
2.3
Binary Minimax Hypothesis Testing
The deterministic minimax rule is obtained from (1.18) as
δm = arg min Rmax (δ), δ∈D
(2.22)
where R max(δ) = max { R0 (δ), R1 (δ)}. Several approaches can be considered to ﬁnd minimax rules. The most naive one is to evaluate every possible decision rule, as done in Tables 1.2 and 2.2. A second approach is to seek an equalizer rule that is also a Bayes rule; by Proposition 1.2 such a rule is a minimax rule. The third approach, which
2.3
Binary Minimax Hypothesis Testing
33
is generally an improvement over (2.22), consists of seeking a randomized decision rule:
δ˜m = arg min˜ Rmax (δ˜ ),
where R max (δ˜) = max {R 0(δ˜), R1 (δ˜)}. 2.3.1
(2.23)
δ˜∈D
Equalizer Rules
To develop intuition, we ﬁrst solve the minimax problem for the three problems of Section 2.2.3. We have established in Example 2.1 that, for Gaussian hypothesis testing, Bayes rules are threshold rules of the form H1
y≷τ H0
2 = µ2 + σµ ln ππ0 . 1
(2.24)
The conditional risks are given by
¼µ − τ ½ and R1 (δ) = Q σ . To be not only a Bayes rule but also an equalizer rule, δ must be of the form (2.24) and τ . Therefore τ = µ , which satisfy R 0(δ) = R1 (δ), i.e., τ satisﬁes the equality τσ = µ− σ 2 corresponds to the least favorable prior π0∗ = 21 . The corresponding Bayes rule δ ∗ is given by (2.24) with threshold τ = µ2 . Since δ ∗ is both an equalizer rule and a Bayes rule for prior π ∗, it follows from Proposition 1.2 that (δ∗ , π ∗ ) is a saddlepoint of the( risk ) function and therefore δ ∗ is a minimax rule. The minimax risk is R (δ ∗ ) = Q µ , ¹τ º R0 (δ) = Q σ
max
2σ
which is the same as Bayes risk under the leastfavorable prior. For the binary communication problem (Example 2.3), in the symmetric case when λ0 = λ1 , we have λ = λ = λ. For λ ≤ 12 , decision rule δ ∗( y ) = y is an equalizer rule and is a Bayes rule under the uniform prior π ∗ . It follows again from Proposition 1.2 that δ∗ is a minimax rule and π ∗ a leastfavorable prior. Likewise, for λ ≥ 12 , the deterministic decision rule δ ∗ (y ) = 1 − y is minimax. Hence the minimax risk is min (λ, 1 − λ) for all λ ∈ [0, 1]. The minimax rule is unique for λ ± = 21 , and there are two deterministic minimax rules for λ = 12 . For the other problem with discrete observations (Example 2.2) it is seen from Table 1.2 that no deterministic equalizer rule exists, and that the minimax risk over D is 12 , which is twice the Bayes risk under the leastfavorable prior (π0∗ = 12 ). However, consider the randomized detection rule δ˜ ∗ that picks δ 2 or δ4 with equal probability (γ2 = γ4 = 12 ). In this case we have R j (δ˜∗ ) = 12 [ R j (δ2) + R j (δ 4)] = 41 for both j = 0 and j = 1, hence δ˜ ∗ is a (randomized) equalizer rule. Moreover δ˜∗ is also a Bayes rule for π0∗ = 12 . It follows from Proposition 1.4 that (δ˜ ∗ , π0∗ ) is a saddlepoint of the risk function and therefore a minimax rule. The minimax risk is Rmax (δ˜ ∗ ) = 14 , which is the same as Bayes risk under the leastfavorable prior. Hence randomized decision rules outperform deterministic rules in the minimax sense here. In the next subsections we present a systematic procedure for ﬁnding minimax rules, again by relating the minimax problem to a Bayesian problem.
34
Binary Hypothesis Testing
2.3.2
Bayes Risk Line and Minimum Risk Curve
Denote by δ π0 the Bayes rule under prior deﬁnitions. D E FI N I T I O N
D E FI N I T I O N
π0
on H0. We begin with the following
∈ D, the Bayes risk line is r (π0 ; δ) ± π0 R 0(δ) + (1 − π0 ) R 1(δ), π0 ∈ [0, 1].
2.2 For any δ
(2.25)
2.3 The Bayes minimumrisk curve is
V (π0 ) ± min r (π0 ; δ) = r (π0; δ π0 ),
π0 ∈ [0, 1].
δ ∈D
(2.26)
Bayes risk lines and the minimum risk curve are illustrated in Figure 2.5. The end points of the Bayes risk line are r (0; δ) = R 1(δ) and r (1; δ) = R0 (δ). The following lemma states some useful properties of V (π0). 2.1
LEMMA
(i) V is a concave (continuous) function on [0, 1]. (ii) If D is ﬁnite, then V is piecewise linear. (iii) V (0) = C11 and V (1) = C 00. Proof The minimum of a collection of linear functions is concave; therefore, the concavity of V follows from the fact that each of the risk lines r (π0; δ) is linear in π 0. The functions in the collection are indexed by δ ∈ D, hence V is piecewise linear if D is ﬁnite. As for the end point properties, V (0) = min R1 (δ) = min[C 01P1 (Y0) + C11 P1 (Y1 )] = C 11 ,
δ ∈D
δ∈D
where the minimizing rule is δ ∗ ( y) = 1, for all y
∈ Y . Similarly V (1) = C 00. ( )
R0 δ
( ;
r π0 δ
B,π(2) 0
)
(
;
(
;
) B,π(1) 0
r π0 δ
( )
R1 δ
r π0 δ
V
C 1,1
(1)
0
π0
( m)
π0
( π0 )
m)
( )
B,π0
C 0,0
(2)
π0
1
π0
Bayes risk lines and a differentiable minimum risk curve V (π0 ). The deterministic equalizer rule δπ ∗ is the minimax rule
Figure 2.5
0
2.3
Binary Minimax Hypothesis Testing
35
We can write V (π0) in terms of the likelihood ratio L (y ) and threshold η as:
V (π0 ) = π0 [C10P 0{ L (Y ) ≥ η } + C00 P0 {L (Y ) < η}]
+ (1 − π0 )[C11 P1 {L (Y ) ≥ η} + C 01P1 {L (Y ) < η }]. If L (y ) has no point masses under P 0 or P1 , then V is differentiable in π0 (since η is differentiable in π0 ).2 2.3.3
Differentiable V
(π ) 0
Let us ﬁrst consider the case where V is differentiable for all π0 ∈ [0, 1]. Then V (π0 ) achieves its maximum value at either the end point π0 = 0 or π0 = 1 or within the interior π0 ∈ (0, 1). Usually C00 = C 11 in which case V (0) = V (1) and the maximum is achieved at an interior point π0∗ , as illustrated in Figure 2.5. In all cases, the Bayes risk line r (π0 ; δ) associated with any decision rule δ must lie entirely above the curve V (π0). Otherwise there would exist some π0 such that r (π0 ; δ) < V (π0) = minδ r (π0; δ), contradicting the deﬁnition of the minimum. If the Bayes risk line r (π0 ; δ) is tangent to the curve V (π0 ) at some point π0² , then r (π0² ; δ) = V (π0² ), hence δ is a Bayes rule for π0² . Moreover, any δ whose Bayes risk line lies strictly above the curve V (π0 ) can be improved by a Bayes rule. The statement of Theorem 2.1 below follows directly. THEOREM
2.1 Assume V is differentiable on [0, 1].
(i) If V has an interior maximum, then the minimax rule equalizer rule
δm
δm = δπ ∗ ,
is the deterministic (2.27)
0
where π0∗ = arg max π0∈[0,1] V (π0 ) is obtained by solving d V (π0 )/dπ0 = 0, i.e., δm is a Bayes rule for the worstcase prior. (ii) If the maximum of V (π0) occurs at π0∗ = 0 or 1, then (2.27) holds but δ m is generally not an equalizer rule. (iii) The minimax risk is equal to V (π0∗). (iv) Randomization cannot improve the minimax rule.
2.3.4
Nondifferentiable V
(π ) 0
If V is not differentiable for all π0 , the conclusions of Theorem 2.1 still hold if V is differentiable at its maximum. If V is not differentiable at its (interior) maximum π0∗ , we have the scenario depicted in Figure 2.6. It is important to note that, in this case, ﬁnding the minimax rule within the class of deterministic decision rules D may be quite challenging except in simple examples like the one considered in Example 1.2. In particular, the minimax rule within D is not necessarily a deterministic Bayes rule (LRT), as the following example shows.
( )
( )
2 This condition typically holds for continuous observations when p y and p y are pdfs with the same 0 1
support, but not necessarily even in this case.
36
Binary Hypothesis Testing
r (π0 ; δ˜m )
r (π0 ; δ−) R0( δ−)
R1 (δ + )
r( π0; δ + ) R1 (δ −)
R0( δ+ )
V (π0 ) π(0m)
0
1
π0
Minimax rule when V is not differentiable at π0∗
Figure 2.6
Example 2.4
Consider Y
= {1, 2, 3} with the following conditional pmfs:
p0(1) = 0.04, p0 (2) = 0.95, p0 (3) = 0.01,
p1(1) = 0.001, p1 (2) = 0. 05, p1(3) = 0.949.
Assuming uniform costs we can compute the conditional risks for the eight possible deterministic decision rules as shown in the table below.
δ δ1 δ2 δ3 δ4 δ5 δ6 δ7 δ8
y
=1 0 0 0 0 1 1 1 1
y=2 0 0 1 1 0 0 1 1
y
=3
R0 (δ)
R 1(δ)
Rmax (δ)
0 1 0 1 0 1 0 1
0 0. 01 0. 95 0. 96 0. 04 0. 05 0. 99 1
1 0.051 0.95 0.001 .999 0.05 0.949 0
1 0.051 0.95 0.96 0.999 0.05 0.99 1
Clearly δ 6 is minimax within the class of deterministic rules for this problem. The 1 1 likelihood ratio takes the values L (1) = 40 , L (2) = 19 , and L (3) = 94 .9. Since the likelihood ratio strictly increases from 1 to 3, the only deterministic Bayes rules (likelihood ratio tests) are given by {δ 1, δ2 , δ4, δ 8}, and this set does not include the deterministic minimax rule δ 6 . We therefore extend our search for minimax rules to the class of randomized decision rules D˜ . Denote by δ − and δ + the (deterministic) Bayes rules corresponding to priors approaching π0∗ from below and above, respectively. Their risk lines are shown in
2.3
Binary Minimax Hypothesis Testing
37
Figure 2.6 and have slopes R0 (δ − ) − R1 (δ −) > 0 and R0 (δ +) − R1 (δ+ ) < 0, respectively. Both rules have the same Bayes risk V (π0∗), but unequal conditional risks: R 0(δ − ) > R0 (δ + ) while R 1(δ − ) < R1 (δ +). Consider now the randomized decision rule
µ− δ˜ = δ+ with probability γ (2.28) δ with probability (1 − γ ). The conditional risks of δ˜ are given by R0 (δ˜ ) = γ R0 (δ − ) + (1 − γ ) R0 (δ +), (2.29) R1 (δ˜ ) = γ R1 (δ − ) + (1 − γ ) R1 (δ +). In particular, δ˜ reduces to the deterministic rules δ − and δ + if γ = 1 and γ = 0, respectively. For any γ ∈ [ 0, 1], the risk line of δ˜ is a tangent to the V curve. All these risk lines go through the point (π0∗, V (π0∗ )) . The risk line can be made horizontal (i.e, the conditional risks can be make equal) by choosing
+ R0 (δ + ) ± ∗ γ = ( R (δ + ) − RR1(δ(δ+ ))) − + ( R (δ− ) − R (δ −)) = γ . 1
0
0
1
The resulting rule δ˜∗ is an equalizer rule and is a Bayes rule for prior minimax rule.
2.3.5
(2.30)
π0∗ , hence is a
Randomized LRTs
Since Bayes rules are LRTs, the randomized minimax rule of (2.28) is a randomized LRT which we now derive explicity. Both δ − and δ + are deterministic LRTs with threshold η determined by π0∗:
⎧ ⎧ ⎪⎪⎨1 : L( y) > η ⎪⎪⎨1 : L (y ) > η δ− (y ) = ⎪1 : L( y) = η , δ+ ( y) = ⎪0 : L (y ) = η (2.31) ⎩0 : L( y) < η ⎩0 : L (y ) < η. The tests δ − and δ + differ only in the decision made in the event L (Y ) = η , respectively selecting H1 and H0 in that case. For δ − and δ + to be different, L (Y ) must have a point mass at η, i.e., P j { L (Y ) = η} ± = 0, for j = 0, 1. This also implies that V is not differentiable at π 0∗ and that neither δ − nor δ + is an equalizer rule. A randomized LRT takes the form ⎧ ⎪⎪⎨1 : > ˜δ( y) = 1 w.p. γ : L (y ) = η. (2.32) ⎪⎪⎩ 0 :
˜δm( y) = 1 w.p. γ ∗ : L( y) = η ∗, (2.33) ⎪⎩ 0 : < where the threshold η∗ is related to π0∗ via (2.12), and the randomization parameter γ ∗ is given by (2.30). The minimax risk is equal to V (π0∗ ).
Examples
Signal Detection in Gaussian Noise (continued). In this example we study the minimax solution to the detection problem described in Example 2.1 using the systematic procedure we have developed. We assume uniform costs. The minimum Bayes risk curve is given by
Example 2.5
V (π0) = π0 P0 {Y
≥ τ } + (1 − π0)P1{Y < τ } ¼τ − µ ½ ¹τ º = π0 Q σ + (1 − π0 )³ σ
with
¼ π ½ µ 2 σ τ = µ ln 1 −0π + 2 . 0 Clearly V is a differentiable function in π0 , and therefore the deterministic equalizer
rule is minimax. We can solve for the equalizer rule without explicitly maximizing V . In particular, if we denote the LRT with threshold τ by δτ , then
R0 (δτ ) = Q
¹τ º σ ,
Setting R 0(δτ ) = R1 (δτ ) yields
R 1(δ τ ) = ³
¼τ − µ½ ¼µ − τ ½ σ =Q σ .
τ ∗ = µ2 ,
from which we can conclude that η ∗ = 1 and π0∗ Thus the minimax decision rule is given by
δm ( y) = δ0.5( y ) =
µ
1 0
= 0.5. : :
and the minimax risk is given by
Rmax (δm ) = V (0.5) = Q
y
≥ µ2
δ˜NP (y ) = δ˜η,γ ( y) = ⎪1 w.p. γ ∗ : L ( y) = η ∗, ⎪⎩0 < where the threshold η∗ and randomization parameter γ ∗ are chosen so that PF (δ˜NP) = P 0{ L (Y ) > η∗ } + γ ∗ P0 {L (Y ) = η∗ } = α. V (π0 )
r (π0; δNP) α
r (π0; δ˜± )
R0 (δ ) = PF( δ) R1 (δ ) = PM( δ )
0 Figure 2.8
3 If
r( π0 ; δ )
π0∗
1
π0
Differentiable minimumrisk curve V (π 0)
α > V ² (0) then δ˜NP need not satisfy PF = α and need not be a Bayes rule. See Exercise 2.11.
(2.38)
(2.39)
42
Binary Hypothesis Testing
V (π0)
r (π0; δ˜NP) α
0 Figure 2.9
π∗0
1
π0
The minimumrisk curve V (π0 ) is not everywhere differentiable P0{L(Y ) > η }
P0 {L( Y ) > η }
1
1
α
α
0
η∗
η
0
η∗
(a )
η
( b)
Complementary cdf of L (Y ) under H0 : (a) P0 { L(Y ) > η } is continuous; (b) P0{ L (Y ) > η } is discontinuous
Figure 2.10
2.4.2
NP Rule
The procedure for ﬁnding the parameters η∗ and γ ∗ of the NP solution is illustrated in Figure 2.10, which displays the complementary cdf P0 {L (Y ) > η } as a function of η. This is a rightcontinuous function. Case I: P 0{ L (Y ) > η } is a continuous function of η . Referring to Figure 2.10(a), the equation P0{ L (Y ) > η∗ } = α admits a solution for η∗ . The randomization parameter γ can be chosen arbitrarily (including 0) since P0 {L (Y ) = η } = 0. Case II: P0 {L (Y ) > η } is not a continuous function of η. If there exists η∗ such that P 0{ L (Y ) > η∗} = α, the solution is as in Case I. Otherwise, we use the following procedure illustrated in Figure 2.10(b): 1. Find the smallest η such that P0{ L (Y ) > η} ≤ α:
2. Then
η∗ = min{η : P0 {L (Y ) > η} ≤ α}.
(2.40)
γ ∗ = α −P P{ L0{(LY()Y=) >η}η} .
(2.41)
0
2.4
Neyman–Pearson Hypothesis Testing
43
The probability of detection (power) of δ˜NP at signiﬁcance level α can be computed as PD (δ˜NP) = P 1{ L ( y ) > η∗ } + γ ∗ P1 {L ( y) = η∗ },
(2.42)
while PF (δ˜NP) is given by (2.39).
2.4.3
Receiver Operating Characteristic
A plot of PD (δ˜ ) versus PF (δ˜) = α is called the receiver operating characteristic (ROC) of a decision rule δ˜ ∈ D˜ . The “best” ROC is achieved by δ˜NP as illustrated in Figure 2.11 for a differentiable ROC. Speciﬁc examples of ROC are given in the next section. Each point on the ROC is indexed by α or equivalently, by (η∗ , γ ∗ ). A “good” ROC plot is close to a reverse L, as one can simultaneously achieve high P D and low PF . Conversely, in the case of indistinguishable hypotheses ( p0 = p1 , hence L (Y ) = 1 with probability 1), the ROC of δ˜NP is the straightline segment PD = P F . Note that the error probability for the minimax rule can be obtained from the ROC curve of δ˜NP as the intersection of the curve with the straight line PD = 1 − P F . THEOREM
2.4 The ROC of δ˜NP has the following properties:
(i) The ROC is a concave, nondecreasing function. (ii) The ROC lies above the 45◦ line, i.e., PD (δ˜NP) ≥ PF (δ˜NP). (iii) The slope of the ROC at a point of differentiability is equal to the value of the threshold η required to achieve the P D and P F at that point, i.e.,
d PD d PF
= η.
(iv) The right derivative of the ROC at a point of nondifferentiability is equal to the value of the threshold η required to achieve the PD and PF at that point.
Proof See Exercise 2.24. PD (δ˜NP) 1
0 Figure 2.11
1 P F (δ˜NP )
Receiver Operating Characteristic (ROC) of δ˜NP
44
Binary Hypothesis Testing
2.4.4
Examples
Signal Detection in Gaussian Noise. In this example we study the NP solution to the detection problem described in Example 2.1. As noted before, LRTs are equivalent to threshold tests on Y . Moreover, L (Y ) does not have point masses under either H0 or H1 , so Case I of Section 2.4.2 applies, and no randomization is needed. Thus Example 2.8
δ˜NP (y ) = δτ (y ) = with τ
= σµ ln η + 2
µ
1
0
: :
y
≥τ
η} =
µ1
45
: η ∈ [0 , 1 ) : η ∈ [1, ∞).
2
0
η = 1 and γ = α1−/20 = 2α , which yields ⎧ ⎪⎪⎨1 : y=3 δ˜NP( y ) = ⎪1 w.p. 2 α : y = 2 ⎪⎩0 : y=1
Thus, for α ∈ [0, 12 ), we have
and PD (δ˜NP) = p1 (3) + 2α p1(2) =
1 2
+ α.
= α1−/2 = 2α − 1, which yields µ : y = 2, 3 ˜δNP( y ) = 1 1 w.p. 2 α − 1 : y = 1
For α ∈ [ 21 , 1], η = 0 and γ
1 2
and P D (δ˜NP ) = 1. The ROC is plotted in Figure 2.13 and has a point of nondifferentiability at α = 21 . Example 2.10 Z Channel. We study the NP solution for the Z channel problem of Example 2.7. The likelihood ratio is given by
¶λ : y = 0 L ( y) = = ∞ : y=1 and P0{Y = 0} = 1. Hence P0{ L (y ) > η} = 1{η < λ}. We seek the smallest η such that α ≥ P0{ L( y ) > η}. First assume α < 1. From Figure 2.14 it can be seen that η = λ. Next, to ﬁnd γ , we use (2.41) and obtain γ = α. Hence the Neyman–Pearson rule at signiﬁcance level α is: p1 ( y) p0 ( y)
P 0{L( Y )
PD
> η}
1
1
0 .5
0.5
0
1
η
0
0.5
(a) Figure 2.13
Example 2.9: (a) P0{ L (Y ) > η }; (b) ROC
( b)
1
PF
46
Binary Hypothesis Testing
P0 { L(Y )
> η}
PD
1
1 1−λ
0
Figure 2.14
λ (a )
η
0
1
PF
(b)
Z channel: (a) P0{ L (Y ) > η }; (b) ROC
¶1
: L ( y) > λ ( y = 1 ) (2.43) α : L ( y) = λ ( y = 0 ) (note the event L (Y ) < λ has probability zero). For α = 1, we obtain η = 0 and γ = 1, corresponding to the deterministic rule δ( y) ≡ 1 (always say H1 ). Note that this is also the limit of (2.43) as α ↑ 1. Therefore, for all α ∈ [0, 1], the power of the NP test is PD (δ˜NP) = P 1{ L (Y ) > λ} + αP 1{ L (Y ) = λ} = P1{Y = 1} + αP1 {Y = 0} = 1 − λ + αλ. The ROC for the Z channel is given by PD = 1− λ + PF λ and is plotted in Figure 2.14(b). The straight line segment with slope λ joins the two points with coordinates (PF , PD ) = (0, 1− λ) and (1, 0), which are respectively achieved by the deterministic rules δ( y) = y and δ( y) = 1. δ˜NP (y ) =
2.4.5
1 w.p.
Convex Optimization
The NP problem (2.37) can also be solved using convex optimization. Indeed (2.37) may be written as
δ˜NP = arg min P (δ˜) ˜ M δ
subject to PF (δ˜) ≤ α.
(2.44)
The expressions (2.34) and (2.36) for PF (δ˜) and PD (δ˜) are linear in δ˜ , hence the constrained optimization problem (2.37) is linear and is equivalent to the Lagrangian optimization problem
δ˜λ = arg min [PM (δ˜ ) + λPF(δ˜)] δ˜ ¿ ¾ 1 λ ˜ ˜ PM (δ ) + P F (δ ) , = arg min 1+λ δ˜ 1 + λ
Exercises
47
where the Lagrange multiplier λ ≥ 0 is chosen so as to satisfy the condition PF (δ˜λ ) = α. Equivalently, the NP rule is given by
δ˜λ = arg min r (π0; δ˜), ˜ δ
where we have reparameterized the Lagrange multiplier:
λ = 1 −π0π ≥ 0 ⇔ π0 = 1 +λ λ ∈ [0, 1]. 0
If the equality P F (δ˜λ ) = α cannot be achieved (because α > V ² (0)) then the PF constraint is inactive, λ = 0, and δ˜NP is the Bayes rule corresponding to π0 = 0.
Exercises
2.1 Best Decision Rules. For the binary hypothesis testing problem, with C00 < C10 and C11 < C 01, show there is no “best” rule based on conditional risks, except in the trivial case where p0 ( y) and p1( y) have disjoint supports. 2.2 Bayesian Test with Nonuniform Costs. Consider the binary detection problem where p0 and p1 are Rayleigh densities with scale parameters 1 and σ > 1, respectively:
p 0 ( y ) = y e−
y2 2
1 {y ≥ 0} and p1 (y ) =
y − y22 e 2σ 2
σ
1 {y ≥ 0}.
(a) Find the Bayes rule for equal priors and a cost structure of the form C00 C 10 = 1, and C01 = 4. (b) Find the Bayes risk for the Bayes rule of part (a).
= C 11 = 0,
2.3 Minimum Bayes Risk Curve and Minimax Rule. For Exercise 1.1, ﬁnd the minimum Bayes risk function V (π0), and then ﬁnd a minimax rule in the set of randomized decision rules using V (π0). 2.4 Minimum Bayes Risk Curve. Suppose
⎧2π ⎪⎪⎨ 0 V (π0 ) = 12 ⎪⎩ 2(1 − π0)
∈ [0, 14 ) if π0 ∈ [ 41 , 43 ) if π0 ∈ [ 43 , 1] .
if π0
Is V (π0 ) a valid minimum Bayes risk function for uniform costs? Explain your reasoning. 2.5 Nonuniform Costs. Consider the binary hypothesis testing problem with cost function ⎧0 if i = j ⎨ Ci j = 1 ⎩10 ifif jj == 01,, ii == 10.
48
Binary Hypothesis Testing
The observation Y takes values in the set are:
p0(a) = p0 (b) = 0.5
and
Y = {a, b, c} and the conditional pmfs of Y
p1(a) = p1(b) = 0. 25, p1(c) = 0.5.
(a) Is there a best decision rule based on conditional risks? (b) Find Bayes (for equal priors) and minimax rules within the set of deterministic decision rules. (c) Now consider the set of randomized decision rules. Find a Bayes rule (for equal priors). Also construct a randomized rule whose maximum risk is smaller than that of the minimax rule of part (b). 2.6 Bayes and Minimax Decision Rules. Consider the binary detection problem with
p0 (y ) =
y +3 1 {y 18
∈ [−3, 3]} , and
p1 (y ) =
1 1{y 4
∈ [0, 4]} .
(a) Find a Bayes rule for uniform costs and equal priors and the corresponding minimum Bayes risk. (b) Find a minimax rule for uniform costs, and the corresponding minimax risk. 2.7
Binary Detection. Consider the binary detection problem with p 0( y) =
1 1 {y 2
∈ [−1, 1]} , and
p1 (y ) =
3y 2 1 { y ∈ [−1, 1]} . 2
(a) Find a Bayes rule for equal priors and uniform costs, and the corresponding Bayes risk. (b) Find a minimax rule for uniform costs, and the corresponding minimax risk. (You may leave your answer in terms of a solution to a polynomial equation.) (c) Find a NP rule for level α ∈ (0, 1), and plot its ROC. 2.8 Multiple Predictions. Consider a binary hypothesis problem where the decision maker is allowed to make n predictions of the hypothesis, and the cost is the number of prediction errors (hence is an integer between 0 and n). Assume Y = { a, b, c} and the conditional pmfs of Y satisfy p0 (a ) = p0(b) = p1(b) = p1 (c) = 0.5. Derive the deterministic minimax risk and the minimax risk. Explain why these risks are equal for even n. 2.9 Penalty Kicks. Assume the following simpliﬁed model for penalty kicks in football. The penalty shooter aims either left or right. The goalie dives left or right at the very moment the ball is kicked. If the goalie dives in the same direction as the ball, the probability of scoring is 0.55. Otherwise the probability of scoring is 0.95. We denote by x the direction selected by the penalty shooter (left or right) and by a the direction selected by the goalie (also left or right). We deﬁne the cost C (a, x ) = 0.55 if a = x , and C (a, x ) = 0.95 otherwise. In this problem setup, no observations are available. A deterministic decision rule for the goalkeeper reduces to an action, hence there exist only two possible deterministic rules: δ 1 (dive left) and δ 2 (dive right).
Exercises
49
(a) Determine the minimax decision rule (and the minimax risk) for the goalie. (b) Now assume the shooter is more accurate and powerful when aiming at the left side. If the goalie dives in the same direction (left), the probability of scoring is 0.6. If the shooter aims at the right side and the goalie dives in the same direction, the probability of scoring is 0.5. In all other cases, the probability of scoring is 0.95. Determine the minimax decision rule for the goalie, the optimal strategy for the shooter, and the minimax risk. 2.10 Nonuniform Costs. The minimum Bayes risk for a binary hypothesis testing problem with costs C00 = 1, C 11 = 2, C10 = 2, C01 = 4 is given by
V (π0) = 2 − π02 ,
where π0 is the prior probability of hypothesis H0 .
(a) Find the minimax risk and the least favorable prior. (b) What is the conditional probability of error given H0 for the minimax rule in (a)? (c) Find the αlevel NP rule, and compute its ROC. 2.11 Choice of Signiﬁcance Level for NP Test. Show that the test δ( y) = 1 {y ∈ supp( P1)} is a Bayes test for π0 = 0. Derive the risk line for δ . Show that the slope of this risk line at the origin is given by V ² (0) = P0{supp( P1 )}. Assume there exists a subset S of Y such that P0(S ) = 0 and P1 (S ) > 0. Explain why there cannot be any advantage in choosing a signiﬁcance level α > V ² (0) for the NP test. (e) If V ²(0) < α < 1, ﬁnd a deterministic Bayes rule that is an NP rule, and show there exist NP rules that are not Bayes rules.
(a) (b) (c) (d)
2.12 Laplacian Distributions. Consider the binary detection problem with 1 p0 (y ) = e− y  , and p1 (y ) = e−2 y  , y ∈ R. 2 (a) Find the Bayes rule and Bayes risk for equal priors and a cost structure of the form C00 = C11 = 0, C 10 = 1, and C 01 = 2. (b) Find a NP rule at signiﬁcance level α = 1/4. (c) Find the probability of detection for the rule of part (b). 2.13 Bayes and Neyman–Pearson Decision Rules. Consider the binary hypothesis test with 1 1 p0 (y ) = e− y 1{ y ≥ 0}, and p1( y) = 1{0 ≤ y ≤ 1} + e−( y −1) 1{ y > 1}. 2 2 (a) Assume uniform costs, and derive the Bayes rules and risk curve V (π0 ), 0 ≤ π0 1. (b) Give the NP rule at signiﬁcance level 0.01.
≤
50
Binary Hypothesis Testing
2.14 Discrete Observations. Consider a binary hypothesis test between p0 and p1 , where the observation space = {a, b, c, d } and the probability vectors p0 and p1 are equal to ( 12 , 14 , 14 , 0) and (0, 14 , 41 , 21 ), respectively.
Y
(a) Assume the cost matrix C00 = C 11 = 0, C 10 = 1, C 01 = 2. Give the Bayes minimumrisk curve V (π0), 0 ≤ π 0 ≤ 1, and identify the leastfavorable prior. (b) Prove or disprove the following claim: “The decision rule δ( y ) = 1{ y ∈ { c, d }} is a NP rule.” 2.15 Receiver Operating Characteristic. Derive the ROC curve for the hypothesis test
¶H : 0 H1 :
Y Y
∼ p0 (y ) = Uniform [0, 1] ∼ p 1 ( y ) = e− y , y ≥ 0 .
2.16 ROCs and Bayes Risk. Consider a binary hypothesis testing problem in which 1/ 4 the ROC for the NP rule is given by PD = P F . For the same hypotheses, consider a Bayesian problem with cost structure C 10 = 3, C01 = 2, C 00 = 2, C11 = 0. (a) Show that the decision rule δ( y) ≡ 1 is Bayes for π0 (b) Find the minimum Bayes risk for π0 = 1/ 2.
≤ 1/3.
2.17 Radar. A radar sends out n short pulses. To decide whether a target is present at a given hypothesized distance, it correlates its received signal against delayed versions of these pulses. The correlator output corresponding to the i th pulse is denoted by Yi = Bi θ + Ni , where θ > 0 under hypothesis H1 (target present), and θ = 0 under H0 (target absent), B i takes values ´1 with equal probability (a simple model of a random phase shift in the i th pulse), and { Ni }in=1 are i.i.d. (0, σ 2). Show that the NP rule can be written as
N
δ( y) =
µ ∑n 1 0
i =1 g
else,
( θσY  ) > τ i 2
where g (x ) = ln cosh(x ). 2.18 Faulty Sensors. A number n of sensors are deployed in a ﬁeld in order to detect an acoustic event. Each sensor transmits a bit whose value is 1 if the event has occurred and the sensor is operational, and is 0 otherwise. However, each sensor is defective with 50 percent probability, independently of the other sensors. The observation Y ∈ {0, 1, . . . , n} is the sum of bit values transmitted by the active sensors. Derive the ROC for this problem. 2.19 Photon Detection. The emission of photons follows a Poisson law. Consider the detection problem
µ
H0 : Y
∼ Poi(1) H1 : Y ∼ Poi(λ),
51
Exercises
where Y ∈ {0, 1, 2, . . .} is the photon count, and Poi (λ) is the Poisson law with parameter λ. Assume λ > 1. The null hypothesis models background noise, and the alternative models some phenomenon of interest. This model arises in various modalities, including optical communications and medical imaging. (a) Show that the LRT can be reduced to the comparison of Y with a threshold. (b) Give an expression for the maximum achievable probability of correct detection at signiﬁcance level α = 0.05. Hint: A randomized test is needed. (c) Determine the value of λ needed so that the probability of correct detection (at signiﬁcance level 0.05) is 0.95. 2.20 Geometric Distributions. Consider the detection problem for which {0, 1, 2, . . .} and the pmfs of the observations under the two hypotheses are:
p0 (y ) = (1 − β0 )β0 ,
y
= 0, 1, 2, . . .
p1( y) = (1 − β1)β 1 ,
y
= 0, 1, 2,. . . .
y
and
y
Assume that 0 < β0
Y =
< β1 < 1.
(a) Find the Bayes rule for uniform costs and equal priors. (b) Find the NP rule with falsealarm probability α ∈ (0, 1). Also ﬁnd the corresponding probability of detection as a function of α. 2.21
Discrete Observations. Consider the binary detection problem with
{0, 1, 2, 3} and
p 1( y ) =
y 6
p0 ( y) =
and
1 , 4
y
Y =
∈ Y.
(a) Find a Bayes rule for equal priors and uniform costs. (b) Find the Bayes risk for the rule in part (a). (c) Assuming uniform costs, plot the Bayes risk lines for the following two tests:
δ(1)( y ) =
µ
1
0
= 2, 3 if y = 0, 1 if y
and
δ (2)(y ) =
µ
1 0
=3 if y = 0, 1, 2.
if y
(d) Find a minimax rule for uniform costs. (e) Find a NP rule for level α = 13 . (f) Find PD for the rule in part (e). 2.22 Exotic Optimality Criterion. Consider a binary detection problem, where the goal is to minimize the following risk measure
ρ(δ˜) = [PF(δ˜ )] 2 + PM (δ˜). (a) Show that the solution is a (possibly randomized) LRT.
52
Binary Hypothesis Testing
(b) Find the solution for the observation model
p0 (y ) = 1,
p 1( y ) = 2 y ,
y
∈ [0, 1].
2.23 Random Priors. Consider the variation of the Bayesian detection problem where the prior probability π 0 (and hence π1 ) is a random variable that is independent of the observation Y and the hypothesis. Further assume that π0 has mean π 0 ∈ (0, 1). We consider the problem of minimizing the average risk
r (δ˜)
= E[π0 R0 (δ˜) + (1 − π0 )R1(δ˜)], where the expectation is over the distribution of π0 . Find a decision rule that minimizes r . Comment on your solution.
2.24 Properties of ROC. Prove Theorem 2.4. Hints: One way to solve Part (ii) of this problem is to consider the cases η ≤ 1 and η > 1 separately and use the following change of measure argument: For any event A, P 1(A) =
³
A
p1 (y )dµ( y) =
³
A
L ( y ) p0( y)d µ( y).
To solve Part (iii), start with
À dP F , = dη and then express PD and PF in terms of the pdf of L (Y ) under H0 using a change of d PD d PF
dP D dη
measure argument similar to that in Part (ii).
2.25 AUC. Since “best” tests exist only in trivial cases, the performance of different tests is often compared using the area under the curve (AUC) where the curve is the ROC. Derive the ROC and compute the AUC for the detection problems in Exercises 2.12 and 2.13. 2.26 Worst Tests . (a) Find a test δ˜ that minimizes PD (δ˜) subject to the constraint PF (δ˜) ≤ α. Hint : Relate this problem to the NP problem at signiﬁcance level 1 − α. (b) Derive the set of all achievable probability pairs (P F (δ˜ ), P D (δ˜)) . Hint : This set can be expressed in terms of the ROC for the NP family of tests. 2.27 Improving Suboptimal Tests . Consider the pdfs 1 p0( y ) = e− y  and p1 (y ) = e−2 y , 2
y
∈ R.
Let δ 0 (y ) = 0 and δ 1 (y ) = 1 for all y , and consider a randomized test δ˜3 that returns 1 with probability 0.5 if the observation y < 0, and returns 1 {0 ≤ y ≤ τ } if y ≥ 0; the threshold τ ≥ 0. All three tests are suboptimal. (a) Derive and sketch the ROC for δ˜3 .
Exercises
53
(b) Show that suitable randomization between δ0, δ1, and δ˜3 improves the ROC for √ 3 PF ∈ [ 0, α 0] ∪ [ α1 , 1] where α0 = 4 and α1 = 1 − α0 . Show that the resulting ROC is a straight line segment for PF ∈ [0, α0 ], coincides with the ROC of δ˜3 for PF ∈ [α0 , α1], and is another straight line segment for P F ∈ [α1, 1]. 2.28 Neyman–Pearson Hypothesis Testing With Laplacian Observations. Consider the binary detection problem with 1 1 p0 (y ) = e− y  , and p1( y ) = e− y −2 , y ∈ R. 2 2 This is a case where the likelihood ratio has point masses even though p0 and p1 have no point masses. (a) Show that the Neyman–Pearson solution, which is a randomized LRT, is equivalent to a deterministic threshold test on the observation y . (b) Find the ROC for the test in part (a).
3
Multiple Hypothesis Testing
The methods developed for binary hypothesis testing are extended to more than two hypotheses in this chapter. We ﬁrst consider the problem of distinguishing between m > 2 hypotheses using a single test, which is called the m ary detection problem. We study the Bayesian, minimax, and generalized Neyman–Pearson versions of the m ary detection problem, and provide a geometric interpretation of the solutions. We then consider the problem of designing m binary tests and obtaining performance guarantees for the collection of tests (rather than for each individual test).
3.1
General Framework
In the basic multiplehypothesis testing problem, both the state space and the action space have m elements each, say X = A = {0, 1, . . . , m − 1}. The problem is called m ary detection, or classiﬁcation. It is assumed that under hypothesis H j , the observation is drawn from a distribution P j with associated density p j ( y ), y ∈ Y . The problem is to test between the m hypotheses
Hj
: Y ∼ p j,
= 0, 1, . . . , m − 1
j
(3.1)
given a cost matrix {C i j }0≤i , j ≤m −1 . The problem (3.1) arises in many types of applications ranging from handwritten digit recognition (m = 10, Y = image) to optical character recognition, speech recognition, biological and medical image analysis, sonar, radar, decoding, system identiﬁcation, etc. To any deterministic decision rule δ : Y → {0, 1, . . . , m − 1}, we may associate a partition Y i , i = 0, 1, . . . , m − 1 of the observation space Y such that
δ( y ) = i ⇔
y
∈ Yi ,
as illustrated in Figure 3.1. Hence we may write δ in the compact form
δ( y) =
±
m −1 i =0
i
1{ y ∈ Yi }
(3.2)
and refer to {Y } as the decision regions. The conditional risks for δ are given by m i i =1
R j (δ) =
±
m −1 i =0
Ci j P j (Y i ),
j
= 0, 1, . . . , m − 1,
(3.3)
3.2
55
Bayesian Hypothesis Testing
Y Y1 Y0 Y2
Figure 3.1
rule δ
Ternary hypothesis testing: decision regions
Y 0, Y1 , Y2 associated with the decision
where P j (Yi ) = P{Say Hi  H j is true} is the probability of confusing hypothesis i with the actual hypothesis j when i ± = j . In the case of uniform costs, Ci j = 1 {i ± = j }, the conditional risks are conditional error probabilities:
R j (δ) = 1 − P j (Y j ),
j
= 0, 1, . . . , m − 1,
(3.4)
where P j (Y j ) is the probability that observation Y is correctly classiﬁed when H j is true.
3.2
Bayesian Hypothesis Testing
In the Bayesian setting, each hypothesis H j is assumed to have prior probability π j . The posterior probability distribution
π( j y ) = π jpp(jy()y) ,
j
= 0, 1, . . . , m − 1
plays a central role in the Bayesian problem. As seen in (2.7), the posterior cost of decision i is
C ( i  y) =
±
m −1 j =0
Ci j π( j  y) =
±
m −1 j =0
Ci j
π j p j (y ) p ( y)
(3.5)
and the Bayes decision rule is
δ B( y) =
arg min C (i  y ) = arg min
0≤i ≤m −1
±
m −1
0≤i ≤m −1 j =0
δ is given by m −1 ± π j R j (δ) r (δ) =
Ci j
π j p j ( y) . p (y )
(3.6)
The Bayes risk for any decision rule
j =0
=
±
m −1 i , j =0
Ci j π j P j (Yi ).
(3.7)
56
Multiple Hypothesis Testing
Uniform costs For uniform costs, (3.4) and (3.7) yield r (δ) = 1 −
±
m −1 j =0
π j P j (Y j ).
(3.8)
The posterior cost of (3.5) is given by
C (i  y ) = 1 − π(i  y ), hence
δ B( y) =
arg min C (i  y ) = arg max
0≤i ≤m −1
0≤i ≤m −1
π(i y ),
(3.9)
which is the MAP decision rule. Its risk is the decision error probability, P e (δB ) = r (δB) = E [min C (i Y )] = 1 − E[max π(i  Y )]. i
3.2.1
i
(3.10)
Optimal Decision Regions
Observe that p( y) in the denominator of the right hand side of (3.6) does not depend on i and therefore has no effect on the minimization. It will sometimes be convenient to use p0( y) as a normalization factor instead of p (y ) and write
δB (y ) =
=
arg min
±
m −1
0≤i ≤m −1 j =0
±
m −1
arg min
0≤i ≤m −1 j =0
Ci j
π j p j ( y) p0 (y )
Ci j π j L j ( y),
where we have introduced the likelihood ratios p j ( y) L j (y ) ± , i = 0, 1, . . . , m p0 (y ) and L 0 ( y) ≡ 1. In the sequel, we consider the case m concepts can be easily visualized. Then
− 1,
= 3 (ternary hypothesis testing) for which the
δB( y ) = arg min [C² i 0 π0 + C i 1 π1 L 1³´( y) + Ci 2π2 L2 (yµ)] , i =0,1,2
(3.11)
(3.12)
± fi (y)
where we have introduced the discriminant functions
fi ( y ) = Ci 0 π0 + C i 1 π1 L 1( y) + Ci 2π2 L 2 (y ),
i
= 0, 1, 2,
which are linear in the likelihood ratios L 1( y) and L 2 (y ). The twodimensional (2D) random vector T ± [ L 1(Y ), L 2 (Y )] contains all the information needed to make an optimal decision. We may express the minimization problem of (3.12) in the 2D decision space with coordinates L 1 (y ) and L 2( y), as shown in Figure 3.2. The decision regions for H0, H1 , and H2 are separated by lines. For instance,
3.2
L2( y)
57
Bayesian Hypothesis Testing
L2( y) H2
2f
f=
2
= 1f
H2
H1
0f
f0
η02
=
f0
f2
=
f1
η02
H0 η01
0
L1(y)
=
f1 H1
η01
0
(a) Figure 3.2
f0
H0
L1(y)
(b)
Typical decision regions for ternary hypothesis testing
L1 (y) H2 π0/π2 H1
H0
π0 /π1 Figure 3.3
L2( y)
Decision regions for ternary hypothesis testing with uniform costs
the decision boundary for H0 versus H1 satisﬁes the equality f 0 (y ) the linear equation
= f 1 (y ), which is
(C00 − C 10)π0 + (C01 − C 11 )π1 L 1( y) + (C 02 − C12 )π2 L 2( y) = 0. In the decision space of Figure 3.2, this is a line with slope equal to
x intercept equal to η01
= CC
10 01
−C 00 π0 . −C 11 π1
In the special case of uniform costs, Ci j
f i (y ) =
p(y ) p0 (y )
C01 −C11 C12 −C02
(3.13)
π1 and π2
= 1{i ±= j }, we have
− πi L i (y ),
i
= 0, 1, 2,
hence
δB ( y) = arg max {π0 , π1 L 1( y), π2 L 2( y )}, i =0,1,2
(3.14)
which is the MAP classiﬁcation rule. The decision space is shown in Figure 3.3. For m > 3, the same approach can be used, with a m − 1 dimensional vector of likelihood ratios T = [ L 1(Y ), . . . , L m −1 (Y )], m discriminant functions f i (y ), i = 0, 1, . . . , m − 1, and m (m2− 1) hyperplanes separating each pair of hypotheses (comparing pairs of discriminant functions).
58
Multiple Hypothesis Testing
3.2.2
Gaussian Ternary Hypothesis Testing
Assume uniform costs and consider the test
Hj
: Y ∼ N (µ j , j ), K
j
= 0, 1, 2,
where Y ∈ Rd follows an ndimensional Gaussian distribution with mean µ j = E j [Y ] and covariance matrix K j = Cov j [Y ] ± E j [(Y − µ j )( Y − µ j )² ] under H j . Assume each π j > 0 and each K j is full rank. The likelihood ratios of (3.11) are given by
Li ( y ) =
pi (y ) p 0( y )
=
(2π)−d /2  i −1/2 exp{− 21 ( y − µi )² i−1( y − µi )} , (2π)−d /2  0−1/2 exp{− 12 ( y − µ0)² −0 1 ( y − µ0 )} K
K
K
K
i
= 0, 1, 2.
It is more convenient to work with the loglikelihood ratios, which are quadratic functions of y for each i = 0, 1, 2: 1 1 1 1 1 ln L i ( y) = − ln  Ki + ln K0− ( y − µi )² Ki−1( y − µ i ) + (y − µ0 )²K− 0 (y − µ0 ). 2 2 2 2 Since each π i > 0 and the logarithm function is monotonically increasing, the Bayes rule of (3.14) can be expressed in terms of the test statistic T = (ln L 1(Y ), ln L 2(Y )):
δB ( y) = arg min {ln π0 , ln π1 + ln L 1( y ), ln π2 + ln L2 ( y)}. i =0,1,2
Alternatively, for d = 2, we may view the decision boundaries directly in the observation space R2, as shown in Figure 3.4. The decision boundaries are quadratic functions of y. If all three covariance matrices are identical (K0 = K1 = K2 ), the decision boundaries are linear in y .
3.3
Minimax Hypothesis Testing
We turn our attention to minimax hypothesis testing. As in the binary case, it will generally be useful to consider randomized decision rules. Recall from Deﬁnition 1.4 that a randomized decision rule δ˜ picks a rule at random from a set of deterministic decision rules {δl }Ll=1 with respective probabilities {γl }lL=1. The worstcase risk for a randomized decision rule δ˜ is y1 µ1
Y1 Y0 Y2 µ0
y2 µ2
Figure 3.4
Quadratic decision boundaries for Gaussian hypothesis testing (Y
∈ R2 )
3.3
Rmax (δ˜ ) =
Minimax Hypothesis Testing
max
0≤ j ≤m −1
R j (δ˜).
59
(3.15)
¶, The sets of deterministic and randomized decision rules are denoted by D and D respectively. Similarly to the notion of risk lines and risk curves in Section 2.3.2, we can gain intuition by considering a geometric representation of the problem. δ˜ ∈ D¶ is R (δ˜) = [ R0 (δ˜), R1 (δ˜ ), . . . , Rm −1 (δ˜ )] ∈ Rm . D E FI N I T I O N 3.2 The risk set for D is R (D) ± { R (δ), δ ∈ D} ⊂ Rm . ¶ is R(D¶) ± { R(δ˜), δ˜ ∈ D¶} ⊂ Rm . D E FI N I T I O N 3.3 The risk set for D ∑ By Proposition 1.3, we have R (δ˜) = ±L=1 γ± R (δ± ) for a randomized rule δ˜ that picks deterministic rule δ± with probability γ± . Hence the following proposition: ¶) is the convex hull of R (D ). P RO P O S I T I O N 3.1 The risk set R (D The risk set R (D) for a ﬁnite D is collection of at most D points, and its convex hull ¶) is a convex m polytope, as illustrated in Figure 3.5 for the binary communication R (D Example 2.3 (m = 2). The polytope has at most  D vertices. If D is a continuum, the ¶) is convex but does not have vertices.1 risk set R (D D E FI N I T I O N 3.1
The risk vector of a decision rule
In the Bayesian setting, the probabilities assigned to the m hypotheses can be represented by a probability vector π on the probability simplex in Rm . The risk of a rule δ˜ under prior π is then given by
r (π, δ˜ ) =
±
R1
j
π j R j (δ˜ ) = π ² R (δ˜).
(3.16)
R1 Constantrisk line
1
1 1−
π λ1 R(δπ) 0
λ1
π∗
λ1
λ0
1 Minrisk line
(a)
R0
0
R(δ m)
λ0
1−
λ0
1
R0
(b)
¶) for binary hypothesis testing Example 2.3. The four possible Risk set R (D ¶). Geometric view of (a) deterministic rules are the vertices of the (shaded) risk set R (D probability vector π and risk vector R (δπ ) for Bayes rule δπ under prior π ; and (b) probability vector π ∗ and risk vector R (δm ) for minimax rule δm and leastfavorable prior π ∗
Figure 3.5
1 Recall that a set
λ ∈ [0, 1].
R is said to be convex if for all u , u ³ ∈ R, the element λu + (1 − λ) u³ ∈ R for all
60
Multiple Hypothesis Testing
Rk(δ) ˜) R( D Constantrisk plane
R(δB) π
Minrisk plane
R0 (δ)
0 Figure 3.6
¶): geometric view of Bayes rule δB Twodimensional section of risk set R (D
As illustrated in Figure 3.6, all the risk vectors that lie in a plane orthogonal to the probability vector π have the same Bayes risk which is proportional to the distance from the plane to the origin.2 The risk vector for the Bayes rule δB is therefore a point of ¶). The minimum risk is always tangency of such a plane to (from below) the risk set R (D ¶), hence no randomization is needed. achieved at an extremal point of R (D In the minimax setting, all the risk vectors that lie in the closed set
O (r ) ± { R ∈ Rm :
Rj
≤ r ∀ j , with equality for at least one j }
(boundary of an m dimensional orthant) have the same worstcase risk (3.15). The vector (r, r, . . . , r ) is the “corner point” of the orthant, as illustrated in Figure 3.7. The minimax solution δm is therefore a point of tangency of the risk set R (D) with O(r ) that has the lowest possible value of r , and that value is the minimax risk. In some problems, the minimax solution is an equalizer rule, represented by a “corner point” of the set R (r ). This situation occurs in Figures 3.5(b) and 3.7(a) but not in Figure 3.7(b). The least favorable prior π ∗ can also be found using geometric reasoning. If V (π) ± r (π ; δB ) is the minimum Bayes risk under prior π , then the point (V (π), . . . , V (π)) with equal ¶) at R(δB). Hence the least favorable prior is coordinates is in the tangent plane to R (D ¶), that the one that maximizes V (π). Moreover, if the point (V (π), . . . , V (π)) ∈ R (D point is a Bayes rule as well as an equalizer rule.
Example 3.1
Let X
= Y = {0, 1, 2} and consider the following three distributions: p0 (0) =
1 p0 (1) = 2 1 p1 (1) = 3
1 2
p0 (2) = 0,
1 , 3 1 p2 (0) = 0 p2 (1) = 12 p2 (2) = . 2 Assume uniform costs. Find the minimax decision rule and the leastfavorable prior.
p1 (0) =
2 The proportionality constant is 1
/´π ´.
1 3
p1 (2) =
3.3
Table 3.1
61
Minimax Hypothesis Testing
All admissible decision rules for Example 3.1, with the corresponding decision regions,
· ¸ π∗ = / , / , / ²
conditional error probabilities, worstcase risk, and Bayes risk under least favorable prior 2
7
3
7
2
Y0 {0, 1} {0, 1} {0} {0} {0} {0} {1} {1} ∅ ∅ ∅ ∅
δ δ1 δ2 δ3 δ4 δ5 δ6 δ7 δ8 δ9 δ10 δ11 δ12
7
Y1 { 2} ∅ {1, 2} { 1} { 2} ∅ {0, 2} { 0} {0, 1, 2} {0, 1} {0, 2} { 0}
Rk ( δ)
R 2(δ)
rmax (δ)
0 0 1/2 1/2
2/3 1 1/3 2/3
1 1/ 2 1 1/ 2
1 1 1
2/ 3
4/ 7 4/ 7 4/ 7 4/ 7
1 1 1 1
0 1/3 1/3 2/3
1 1/ 2 1/ 2 0
1 1 1 1
4/ 7 4/ 7 4/ 7 4/ 7
1/2 1/2 1/2 1/2
2/3 1 1/3 2/3
(a)
2/ 3 1 1 2/ 3
Rk (δ )
4/ 7 4/ 7 4/ 7 4/ 7
R0 (δ ) = Rk (δ )
V (π∗ )
V (π∗ ) r
1/ 2 0 1 1/ 2
˜) R(D
R(δ m) π
0
R1 (δ)
˜) R(D
∗
r (π ∗, δ)
R0 (δ)
R0(δ ) = Rk (δ )
r V (π∗ )
Y2 ∅ {2} ∅ {2} {1} {1, 2} ∅ {2} ∅ {2} {1} {1, 2}
Constant worstcase risk
π∗
R 0(δ )
0
R(δm )
V ( π∗) (b)
R0 (δ )
¶): geometric view of minimax rule δm in the Twodimensional section of risk set R (D case δm is (a) an equalizer rule; (b) not an equalizer rule
Figure 3.7
While there are  D = 33 = 27 possible decision rules, we may exclude any rule such that δ(0) = 2 and δ( 2) = 0. Hence there are two possible actions to consider when Y = 0 or Y = 2, and three actions when Y = 1, for a total of 12 decision rules. Their conditional risks (conditional error probabilities R j (δ) = 1 − P j (Y j ) for j = 0, 1, 2) are listed in Table 3.1. The best deterministic rules are δ4 , δ5 , and δ 8, with a worstcase risk of 32 ; moreover,
·
¸²
all three rules have the same risk vector R = 1/ 2, 2/3, 1/2 . However, the worstcase risk can be reduced using randomization. Consider δ 4( y ) = y and δ9 (y ) = 1; the latter has risk vector R (δ9 ) = [1, 0, 1]² . If we pick δ4 with probability γ and δ9 with probability 1 − γ , we obtain a randomized rule δ˜ with risk vector
62
Multiple Hypothesis Testing
R (δ˜) =
º² ¹ γ R (δ4 ) + (1 − γ ) R (δ9 ) = 1 − γ2 , 23γ , 1 − γ2 .
An equalizer rule is obtained by choosing γ = 67 . All three conditional risks are then equal to 47 , hence Rmax (δ˜ ) = 74 is lower than 32 achieved by the best deterministic rules of Table 3.1. While δ˜ constructed above is an equalizer rule, the question is whether it is also a Bayes rule. The Bayes risks of rules δ 4 and δ9 are respectively 1 2 1 π 0 + π1 + π2 2 3 2 r (π ; δ 9) = π0 + π2 = 1 − π1 .
r (π ; δ 4) =
= 3 +6 π1 ,
If δ˜ is a Bayes rule, we must have r (π ; δ4 ) = r (π ; δ 9), hence π1 = 73 . Then r (π ; δ4 ) = r (π ; δ9 ) = 74 = R max(δ˜). By symmetry, we may guess that the least favorable prior
·
¸²
satisﬁes π0∗ = π2∗, hence π ∗ = 2/7, 3/7, 2/7 . Computing the risks of all decision rules under π ∗ in the last column of Table 3.1, we see that none of them is lower than 4 3 ∗ ˜ 7 . We conclude that δ is a minimax rule and that π is a leastfavorable prior. In this problem the minimax rule δ˜ is not unique: one could have randomized between δ5 and δ9, or between δ8 and δ9. One could even have randomized among any subset of {δ4 , δ5, δ 8, δ9 }, provided δ9 is included in the mix and receives a probability weight of 1 . However, there is no advantage in randomizing between more than two rules in this 7 problem.
3.4
Generalized Neyman–Pearson Detection
The Neyman–Pearson optimality criterion can be extended to m ary detection in several ways. For instance, when m ≥ 3 one may want to minimize the conditional risk R m −1(δ˜) subject to the inequality constraints Rk (δ˜) ≤ αk for k = 0, 1, . . . , m − 2. Since the constraints are linear in δ˜, the problem can be formulated as a linear optimization problem as in Section 2.4.5. The solution admits a geometric interpretation: each constraint R k (δ˜) ≤ αk deﬁnes a halfplane. The feasible set, which is the intersection of ¶) with those m − 1 half spaces, is convex. The minimum of the (convex) risk set R (D a linear function over a convex set is achieved at an extremal point on the feasible set. The solution is a randomized Bayes rule. Since there are m − 1 constraints, the Bayes rule randomizes between m deterministic rules at most.
3.5
Multiple Binary Tests
We now consider the problem of designing m binary tests
H0 i
versus
H1i ,
i
= 1, 2, . . . , m .
(3.17)
(π ∗ , δ) = 47 for all 12 rules listed in Table 3.1. Note, however, that some of the other 15 rules are worse, for instance δ( y ) = 0 has risk vector R = [0, 1, 1] and therefore r (π ∗ , δ) = 75 .
3 In fact r
3.5
63
Multiple Binary Tests
Hence the state space and the action space are X = A = {0, 1}m . This inference model occurs in a number of important applications, including data mining, medical testing, signal analysis, and genomics. Typically each test corresponds to a question about the observed dataset Y , e.g., does Y satisfy a certain property or not? The number of tests could be very large, each of them leading to a potential discovery about Y . As we have seen in Chapter 2, a central problem in binary hypothesis testing is to control the tradeoff between false alarms and missed detections (missed discoveries). The problem is clearly more complex for multiple binary tests. The classical analogue to the probability of false alarm is the familywise error rate FWER ± P 0
»∪m {say H }¼ , 1i i =1
(3.18)
where P0 is the distribution of the observations when all null hypotheses are true. The missed detections can be quantiﬁed by the average power of the tests, deﬁned as
βm ± m1
m ± i =1
P1i {say H1i },
(3.19)
where P1i is the distribution of the test statistic for the i th test under H1i . As we shall see, a simple and classical solution is to pick a signiﬁcance level α ∈ (0, 1) and apply the Bonferroni correction [1], which guarantees that FWER ≤ α. A more recent and powerful approach is the Benjamini–Hochberg procedure [2] which is adaptive and controls the false discovery rate, i.e., the expected fraction of false discoveries, to be deﬁned precisely in Section 3.5.2. First we introduce the notion of p value of a binary test. Assume a single binary test H0 versus H1 taking the form δ( y) = 1 {t ( y ) > η} where t ( y ) is the test statistic, and η is the threshold of the test. For instance, t ( y ) could be the likelihood ratio p1( y)/ p0 ( y), but other choices are possible. D E FI N I T I O N 3.4
The pvalue of the test statistic is p ± P 0{T
≥ t ( y)}, where T ∼ P0 .
In other words, the pvalue is the right tail probability of the distribution P 0, evaluated at t ( y). If t (Y ) follows the distribution P0 , the pvalue is a random variable, and is uniformly distributed on [0, 1]. The smaller the pvalue, the more likely it is that H1 is true. To guarantee a probability of false alarm equal to α (as in a NP test), one can select the test threshold η such that P0{t (Y ) > η } = α . (For simplicity we assume that P0 has no mass at {t (Y ) = η}, hence no randomization is needed.) Equivalently, the test can be expressed in terms of the pvalue as δ( p ) = 1{ p < α}. Now consider m tests δi : [0, 1] → {0, 1}, i = 1, 2, . . . , m , each of which uses a pvalue Q i as its test statistic.4 Denote by P0 the joint distribution of { Q i }m i =1 when all null hypotheses are true.
3.5.1
Bonferroni Correction
If one selects α as the signiﬁcance level of each individual test, then FWER equals α if m = 1 but increases with m . More precisely,
{ }
4 These random p values are not denoted by P so as not to cause confusion with probability measures. i
64
Multiple Hypothesis Testing
Table 3.2
Notation for multiple binary hypothesis testing
Declared true (accepted)
Declared false (rejected)
Ac Ae
Re Rc
True null hypotheses False null hypotheses Total
A=m
FWER
−R
≤
m ± i =1
R
Total
m
m0 − m0 m
P0 {say H1i } = m α,
where the inequality follows from the union bound ( aka the Bonferroni inequality), and holds whether the test statistics { Q i }m i =1 are independent or not. In the special case of independent pvalues {Q i }m , we have FWER = 1 − (1 − α)m ≈ mα for small m α. i =1 To reduce FWER, one could set the signiﬁcance level of each test to α/ m . This is the Bonferroni correction. Unfortunately, doing so reduces the average power of the tests. 3.5.2
False Discovery Rate
Table 3.2 introduces some notation. Denote by δ = {δi }m i =1 a collection of m binary tests and by m 0 the number of true null hypotheses. Of these, a number A c are correctly accepted, and Re are erroneously rejected. There are also m − m 0 false null hypotheses. Of these, a number A e are erroneously accepted, and Rc are correctly rejected. Denote by R = R c + Re the total number of rejected null hypotheses, and by A = m − R the total number of accepted null hypotheses. The random variables Ac , A e , R e , Rc are unobserved, but A and R are. With this notation, we have FWER (δ) = P0 { Re ≥ 1}. The fraction of errors occurring by erroneously rejecting null hypotheses is the random variable
F
±
½
Re Re + Rc
0
=
Re R
: R>0 : R = 0.
(3.20)
The false discovery rate (FDR) is deﬁned as the expected value of F : FDR(δ ) ± E[ F ],
where the expectation is with respect to the joint distribution P of { Qi }m i =1 – which depends on the underlying combination of true and false null hypotheses. Note that if m 0 = m (all null hypotheses are true) then R e = R , implying that in this case FDR (δ ) = P 0{ Re ≥ 1} = FWER(δ). The Benjamini–Hochberg procedure selects a target α ∈ (0, 1) and makes decisions on each hypothesis in a way that guarantees that FDR ≤ α for all P. Hence one might have to contend with several false alarms (in average), but this allows an increase in the average power of the tests (enabling more discoveries). This procedure is particularly attractive for problems involving large m , and has been found relevant in several types of problems:
3.5
• • 3.5.3
Multiple Binary Tests
65
overall decision making based on multiple inferences, e.g., target detection in a network [6]; multiple individual decisions (an overall decision is not required), e.g., voxel classiﬁcation in brain imaging [7].
Benjamini–Hochberg Procedure
Fix α ∈ (0, 1). Given m p values Q 1 , . . . , Q m , reorder them in nondecreasing order as Q (1) ≤ Q (2) ≤ · · · ≤ Q (m ) , and denote the corresponding null hypotheses by H(1) , H(2) , . . . , H(m ) . Let k be the largest i such that Q (i ) ≤ mi α. Then reject the null hypotheses H( 1) , H(2) , . . . , H(k ) (report k discoveries). Denote by δBH the resulting collection of tests. T H E O R E M 3.1 For independent Q 1 , . . . , Q m and any conﬁguration of false null hypotheses, FDR(δBH ) ≤ α.
The theorem has been extended to the case of positively correlated test statistics [3]. The following example is taken from [2, Sec. 4]. Let m be a multiple of 4 and assume that ( j − 1)m < i ≤ jm . P 0i = (0, 1), 1 ≤ i ≤ m , P1i = ( j µ, 1), j = 1, 2, 3, 4, 4 4 Let α = 0.05 and µ = 1.25. The center column of Figure 3.8 shows the average power β m of the Benjamini–Hochberg (BH) procedure for m = 4, 8, 16, 32, 64. The simulations show that for large m , the average power is substantially larger than that obtained using the Bonferroni correction method. Example 3.2
N
N
Applications of the BH procedure reported in the literature include sparse signal analysis [4], detection of the number of incoming signals in array signal processing [5], target detection in networks [6], neuroimaging [7], and genomics [8].
Application of BH procedure to Example 3.2 with 50 percent true null hypotheses: average power β m versus log2 m ∈ {2, 3, 4, 5, 6}.
Figure 3.8
66
Multiple Hypothesis Testing
3.5.4
Connection to Bayesian Decision Theory
While the BH procedure may seem ad hoc, it admits an interesting connection to Bayes theory. To see this, consider the simpliﬁed problem where the distributions P 0i = P0 and P1i = P 1 are the same for all i , and each null hypothesis H0i is true with probability π0 . Denote by F0 and F1 the cdfs of the test statistic T under distributions P0 and P1 , respectively, and by F (t ) = π0 F0 (t ) + π1 F1(t ) the cdf of the mixture distribution. The conditional probability of the null hypothesis given T ≥ t is given by
π(0T ≥ t ) = π0 PP0{{TT ≥≥ tt}} = π0 11 −− FF0((tt)) ,
(3.21)
where P = π0 P0 + π1P 1 is the mixture distribution. Now for T ∼ P 0, the pvalues deﬁned by p = 1 − F0 (t ) are uniformly distributed, hence their cdf is given by G 0(γ ) = γ for γ ∈ [0, 1]. For T ∼ P, the pvalues follow the mixture distribution with cdf equal to G (γ ) = F (F0−1 (1 − γ )) . Given the pvalues { Q i }im=1, we may construct the following empirical cdf:
¾(γ ) G
= m1
m ± i =1
1 {Q i
≤ γ }, γ ∈ [0, 1] ,
which converges uniformly to G (γ ) as m → ∞ by the Kolmogorov–Smirnov limit ¾( Q (i )) = i / m , hence the BH procedure may be written as: Let theorem. Observe that G ¾( Q(i ))α. For large m , the ratio k / m converges to k be the largest i such that Q (i ) ≤ G the solution of the nonlinear equation
γ = G (γ ) α.
(3.22)
Moreover, for large m , the conditional probability of (3.21) may be approximated as
π(0P ≤ γ ) ≈ π0 Gγ(γ ) = π0 α.
Since this conditional probability can be interpreted as the false discovery rate of the tests { Qi ≤ γ }, we denote it by Fdr(γ ) . Moreover,
∑m 1{ Q ≤ γ, H true} i =1 FDR(δBH ) = E[ F ] ≈ E ∑m i1{Q ≤ γ0i} 1 i i =1 m 1 ∑m E[ 1 1{ Qi ≤ γ, H0i true }] ≈ m i1=∑ E[ m mi=1 1 {Q i ≤ γ }] 0γ = Gπ(γ ) = Fdr(γ ) = π0α. 1 m
Hence one may interpret the BH procedure as an approximate Bayesian procedure that assumes very few discoveries (π0 ≈ 1) and guarantees a false discovery rate of α. Let σ > 1 and assume the Bayesian setup where P0i = P0 = Exp(1) and = P1 = Exp(σ ) are the same for all i . Hence
Example 3.3
P 1i
p0(t ) = e−t ,
p 1( t ) =
1 −t /σ ,
σe
Exercises
1 − F0 (t ) = e−t ,
1 − F1(t ) = e−t /σ ,
t
67
≥ 0.
= 1 − F0(t ) = e−t . Moreover, G (γ ) = π0 γ + π1 [1 − F1(t )] t =ln(1/γ ) = π0γ + π1 γ 1/σ .
The pvalue associated with t is γ Solving (3.22), we obtain
¿ − π )α Àσ/σ −1) 0 γ = (11 − . π 0α
For large m , the average power of the BH procedure is therefore
βm ≈ 1 − F1 (t )t =ln(1/γ ) = γ
1/σ
¿ (1 − π )α À 1/σ −1) . = 1 − π0 α 0
Note that this expression is independent of m and is larger (closer to 1) for σ µ 1, as expected since the rival distributions are more dissimilar in this case. In contrast, the average power of the m NP tests at signiﬁcance level α/ m (guaranteeing that FWER approaches α as m → ∞) is (α/ m )1/σ , which vanishes polynomially with m (see Exercise 3.9).
Exercises
3.1 Ternary Poisson Hypothesis Testing. Derive the MPE rule for ternary hypothesis testing assuming uniform prior and
H0
: Y ∼ Poi(λ),
H1
: Y ∼ Poi(2λ),
H2
: Y ∼ Poi(3λ),
where λ > 0. Infer a condition on λ under which the MPE rule never returns H1. 3.2 Ternary Hypothesis Testing. Consider the ternary hypothesis testing problem where p0, p1, and p2 are uniform pdfs over the intervals [0, 1], [0, 12 ), and [ 12 , 1], respectively. Assume uniform costs. (a) Find the Bayes classiﬁcation rule minimum Á and1−the Â Bayes classiﬁcation error π 1−π0 0 probability assuming a prior π = π0, 2 , 2 , for all values of π 0 ∈ [0, 1]. (b) Find the minimax classiﬁcation rule, the leastfavorable prior, and the minimax classiﬁcation error. (c) Find the classiﬁcation rule that minimizes the probability of classiﬁcation error under H0 subject to the constraint that the error probability under H1 does not exceed 15 , and the error probability under H2 does not exceed 25 . 3.3 5ary Classiﬁcation. A communication system uses multiamplitude signaling to send one of ﬁve signals. After correlation of the received signal with the carrier, the correlation statistic Y follows one of the following hypotheses:
: Y = ( j − 2)µ + Z , j = 0, 1, 2, 3, 4, where µ > 0 and Z ∼ N (0, σ 2 ). Assume uniform costs. Hj
68
Multiple Hypothesis Testing
(a) Assume the hypotheses are equally likely, and ﬁnd the decision rule with minimum probability of error. Also ﬁnd the corresponding minimum Bayes risk. Hint: Find the probability of correct decision making ﬁrst. (b) Find the minimax classiﬁcation rule for d = µ σ = 3. To do this, seek a Bayes rule which is an equalizer. Hint: Guess what the acceptance regions for each hypothesis are, up to two parameters which can be determined by numerical optimization. 3.4 Decision Regions for Ternary Hypothesis Testing. Show the decision regions for a ternary Bayesian hypothesis test with uniform priors and costs C i j =  i − j  . 3.5 Ternary Gaussian Hypothesis Testing. Consider the following ternary hypothesis testing problem with twodimensional observation Y = [Y1 Y 2]² :
H0 : Y
= N,
H1 : Y
= s + N,
H2 : Y
= −s + N, where s = √1 [11]² , and the noise vector N is Gaussian N ( , ) with covariance matrix 2 ¿1 1 À = 1 14 . 0
K
K
4
(a) Assuming that all hypotheses are equally probable, show that the minimum error probability rule can be written as
⎧1 s ² y ≥ η ⎨ δB( y ) = ⎩0 −η < s ² y < η 2 s ² y ≤ −η.
(b) Specify the value of error probability.
η that minimizes the error probability and ﬁnd the minimum
3.6 MPE Rule. Derive the MPE rule for the following ternary hypothesis testing problem with uniform prior and twodimensional observation Y = (Y1 , Y 2)² where Y 1 and Y 2 are independent under each hypothesis and
Hj
:
Y1
∼ Poi(2 e j ), Y2 ∼ N (2 j , 1),
j
= 0, 1, 2.
Show the acceptance regions of the test in the Y plane. 3.7 LED Display. The following ﬁgure depicts a sevensegment LED display. Each segment can be turned on or off to display digits. For instance, digit 0 is displayed by turning on segments a through f and leaving segment g turned off. Some segments may die (can no longer be turned on), impairing the ability of the device to properly display digits. We assume that each of the seven segments may die independently, with probability θ . Hence there is a possibility that digits are confused: for instance a displayed 0 might also be an 8 with dead segment g. In this problem, we analyze various error probabilities using Bayesian inference, uniforms costs, and uniform prior.
Exercises
69
g f com a b a f
g
e
b c
d e d com c dot Sevense gment display
Denote by y ∈ {0, 1}7 the displayed conﬁguration, i.e., the sequence of on/off states for a particular digit. To each digit j ∈ {0, 1, . . . , 9} we may associate a probability distribution p j ( y). For instance, if j = 8 and y = (1, 1, 1, 1, 1, 1, 1), we have p j (y ) = (1 − θ)7 . (a) (b) (c) (d)
Derive a Bayes test and the Bayes error probability for testing H0 versus H8 . Repeat with H4 versus H8. Repeat with H0 versus H4. Give a lower and an upper bound on Bayes error probability for testing H0 versus H4 versus H8 .
3.8 False Discovery Rate. Show that FDR(δ) ≤ FWER(δ) for any δ. 3.9 Bonferroni Correction. Show that the average power of the m NP tests at signiﬁcance level α/ m in Example 3.3 is equal to (α/ m)1/σ . 3.10 Bayesian Version of BH Procedure. Derive the asymptotic FDR for the BH procedure at level α when the m test statistics are independent and m → ∞ . For each test there are 3 equiprobable hypotheses, with the following pdfs on [0,1]:
p 0i ( t ) = 1 ,
p1i (t ) = 2t ,
p2i (t ) = 3t 2,
1≤i
≤ m.
3.11 Computer Simulation of BH Procedure. Consider m binary tests. For each test i , the observation is the absolute value of a random variable Yi drawn from either P0 = (0, 1) or P1 = 12 [ (−3, 1) + (3, 1)]. The random variables {Yi }mi=1 are mutually independent.
N
N
N
(a) Give the asymptotic FWER (as m → ∞) that results when each test is a NP test at signiﬁcance level α/ m (i.e., Bonferroni correction is used). Hint. This value is close to, but not exactly equal to α . (b) Determine the signiﬁcance level ensuring that FWER = α (exactly) for each m . (c) Fix α = 0.05 and implement the Benjamini–Hochberg procedure with FDR = α and the procedure of Part (b) with FWER = α. Let the ﬁrst m /2 null hypotheses be true. To estimate the average power β m of (3.19), perform 10 000 independent
70
Multiple Hypothesis Testing
simulations and average the number of true discoveries over the 10 000 m tests. Plot the estimated average power β m against m ranging between 1 and 512.
References
[1] O. J. Dunn, “Multiple Comparisons Among Means,” J. Am. Stat. Assoc., Vol. 56, No. 293, pp. 52–64, 1961. [2] Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,” J. Roy. Stat. Soc. Series B, Vol. 57, No. 1, pp. 289– 300, 1995. [3] Y. Benjamini and D. Yekutieli, “The Control of the False Discovery Rate in Multiple Testing under Dependency,” Annal. Stat., Vol. 29, No. 4, pp. 1165–1188, 2001. [4] F. Abramovich, Y. Benjamini, D. L. Donoho, and I. M. Johnstone, “Adapting to Unknown Sparsity by Controlling the False Discovery Rate,” Ann. Stat., Vol. 34, No. 2, pp. 584–653, 2006. [5] P. J. Chung, J. F. Böhme, M. F. Mecklenbräucker, and A. O. Hero, III, “Detection of the Number of Signals Using the Benjamini–Hochberg Procedure,” IEEE Trans. Signal Process., 2006. [6] P. Ray and P. K. Varshney, “False Discovery Rate Based Sensor Decision Rules for the NetworkWide Distributed Detection Problem,” IEEE Trans. Aerosp. Electron. Syst., Vol. 47, No. 3, pp. 1785–1799, 2011. [7] C. R. Genovese, N. A. Lazar, and T. Nichols, “Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate,” Neuroimage, Vol. 15, No. 4, pp. 870–878, 2002. [8] J. D. Storey and R. Tibshirani, “Statistical Signiﬁcance for Genomewide Studies,” Proc. Nat. Acad. Sci., Vol. 100, No. 6, pp. 9440–9445, 2003.
4
Composite Hypothesis Testing
Composite hypothesis testing is studied in this chapter, where each hypothesis may be associated with more than one probability distribution. We begin with the case of two hypotheses. The Bayesian approach to this problem is ﬁrst considered. Then the notions of uniformly most powerful test, locally most powerful test, generalized likelihood ratio test, and nondominated test are introduced and their advantages and limitations are identiﬁed. The approach is then extended to the m ary case. The robust hypothesis testing problem is also introduced, along with approaches to solving this problem.
4.1
Introduction
For the binary detection problem of Chapter 2 we have assumed that conditional densities p0 and p1 are speciﬁed completely. Under this assumption, we saw that all three classical formulations of the binary detection problem (Bayes, minimax, Neyman– Pearson) led to the same solution structure, the likelihood ratio test (LRT), which is a comparison of the likelihood ratio L (y ) to an appropriately chosen threshold. We now study the situation where p0 and p1 are not speciﬁed explicitly, but come from a parameterized family of densities { pθ , θ ∈ X }, with X being either a discrete set or a subset of a Euclidean space. The hypothesis H j corresponds to θ ∈ X j , j = 0, 1, where the sets X0 and X1 form a partition of the state space X , namely they are disjoint, and their union is X . Composite binary detection (hypothesis testing) may therefore be viewed as a statistical decision theory problem where the set of states X is nonbinary, but the set of actions A = {0, 1} is still binary. The cost function relating the actions and states will be denoted by C (i , θ), i ∈ {0, 1}, θ ∈ X . A deterministic decision rule δ : Y → {0, 1} is a function of the observations hence may not depend on θ . The same holds for a randomized decision rule. We formulate binary composite hypothesis testing as follows:
±
H0 : Y
∼ pθ , H1 : Y ∼ p θ ,
θ ∈ X0 θ ∈ X1 .
(4.1)
The mapping from the parameter space X to the space P [Y ] of probability distributions on Y is depicted in Figure 4.1. Note that it is possible to have pθ = pθ ± for some θ ² = θ ± . We illustrate the setting of composite hypothesis testing with an example before studying the general problem:
72
Composite Hypothesis Testing
X X0 θ
θ± X1
pθ
pθ ±
P [Y ]
Figure 4.1
Mapping from parameter set X to set of probability distributions P [Y ] p0( y)
pθ (y)
PM
PF
τ
0 Figure 4.2
Gaussian hypothesis testing with X0
±
y
θ
= {0} and X1 = (0, ∞)
∼ N (θ, σ 2 ), H1 : Y ∼ N (θ, σ 2 ),
H0 : Y
θ ∈ X0 = {0} θ ∈ X1 = (0, ∞).
(4.2)
This setup has classically been used to model a radar detection problem with Gaussian receiver noise. We try to determine whether a target is present, i.e., whether H1 is true. The state θ represents the strength of the target signature and depends on the distance to the target, the target type, and other physical factors. In (4.2), H0 is the socalled null hypothesis which consists of a single distribution for the observations, so H0 is a simple hypothesis. Since H1 has more than one (here inﬁnitely many) possible distributions, H1 is a composite hypothesis. As an example of a possible decision rule, consider a threshold rule for Y , in which case the threshold τ is to be determined, but is not allowed to depend on the unknown θ . As illustrated in Figure 4.2, the probability of false alarm τ is PF = Q ( στ ) and the probability of miss is P M = Q ( θ − σ ). Different notions of optimality are considered in the following sections to select the decision rule. 4.2
Random Parameter
±
Assume the state θ is a realization of a random variable ± with known prior π(θ), θ ∈ X . From (1.25), we immediately have that the Bayes rule for composite hypothesis
testing is given by
4.2
Random Parameter
±
δB( y ) = arg i ∈{min C (i  y ), 0,1}
73
(4.3)
where, using the notation introduced in (1.20),
C (i  y ) = with
4.2.1
²
X
C (i , θ)π(θ  y)d ν(θ),
π(θ  y) = pθ (py()yπ(θ) ) .
Uniform Costs Over Each Hypothesis
In the special case where costs are uniform over each hypothesis, i.e.,
C (i , θ)
= Ci j , ∀θ ∈ X j
(4.4)
(but it is not necessary that Ci j = 1 {i ²= j } ), the posterior cost has the same form as the Bayes rule for simple hypothesis testing. Indeed we can expand C (i  y) as
C ( i  y) = C i 0
²
X0
π(θ  y)d ν(θ) +
Ci 1
²
θ ∈X1
π(θ y )d ν(θ),
from which we can easily conclude that
³ X C (1 y) ≤ C (0 y) ⇐⇒ ³
1
X0
pθ ( y )π(θ)d ν(θ)
pθ ( y )π(θ)d ν(θ)
≥ CC 10 −− CC 00 . 01
11
(4.5)
Now, we deﬁne the priors on the hypotheses as
²
πj =
²
Xj
π(θ)d ν(θ),
= 0, 1,
j
(4.6)
the conditional density of ± given H j as
π j (θ) = π(θ) π , θ ∈ Xj ,
j
j
= 0, 1,
(4.7)
and the conditional densities for the hypotheses as
²
p ( y H j ) = with that
η=
C10 −C00 C01 −C11
²
Xj
pθ ( y
≤ C (0y ) ⇐⇒
π0 as deﬁned in (2.12) and L ( y) π1
δ B( y ) =
±
1 0
L( y ) ≥ η
(4.8)
= p( y H1 )/ p(y H0 ). This means
L ( y ) ≥ η.
1} under H1 . Consider the following hypotheses: Example 4.1
±
: pθ (y ) = θ e−θ y 1{ y ≥ 0}, θ ∈ X0 = {1} H1 : pθ (y ) = θ e−θ y 1{ y ≥ 0}, θ ∈ X1 = (1, ∞). The exponential distribution pθ and the prior π1 are shown in Figure 4.3. Evaluating the conditional marginals (4.8), we have, for y ≥ 0, p ( y  H0 ) = e − y , ² ²∞ p ( y  H1 ) = pθ ( y )π1 (θ) dθ = θ e−θ y β e−β(θ −1) d θ X ² 1 ∞ = β e− y θ e−(θ −1)( y +β) dθ ´²1 ∞ ²∞ µ = β e− y (θ − 1)e−(θ −1)(y +β)d θ + e−(θ −1)( y +β)d θ 1 µ ´1 1 1 − y = β e (y + β)2 + y + β . H0
1
The likelihood ratio is given by
L ( y) =
p ( y  H1 ) p ( y  H0 )
´ µ = β (y +1 β)2 + y +1 β ,
p θ (y )
y
≥ 0.
π1( θ )
θ
β
0
y
1 θ
( a) Figure 4.3
Example 4.1: (a) pθ ( y ); and (b) π1 (θ)
0
θ
1
(b)
4.2
Random Parameter
±
75
Observe that L ( y) → 1 as β → ∞, and thus H0 and H1 become indistinguishable in the limit as β → ∞. This should be expected because π1 (θ) concentrates around θ = 1 in that case. The decision rule is
⎧ ⎨1 δ B( y) = ⎩ 0
¶
if L (y ) = β ( y +1β)2 otherwise.
· + y +1 β ≥ η = 1
Since the function L ( y) is monotonically decreasing, we obtain, equivalently,
δ B( y ) = where
τ=
¸
− 12
+
±
¹
1
if y
0
if y
1 4
≤τ > τ,
+ β −1
º−1
− β.
Uniform Costs with Both Hypotheses Composite . Consider the composite detection problem in which X = [0, ∞ ), X0 = [0, 1), and X1 = [1, ∞ ), with C (i , θ) = C i j = 1{i ² = j }, and
Example 4.2
pθ (y ) = θ e−θ y 1 { y≥0} ,
π(θ) = e−θ 1{ θ ≥0} .
To compute the Bayes rule for this problem, we ﬁrst compute
²
X1 and
² X0
pθ ( y)π(θ) dθ
pθ (y )π(θ)d θ
=
=
Then, from (4.5), we get that
δ B( y ) =
²∞ 1
²1 0
−( y+1)
θ e−θ( y+1)d θ = ( y +(y2+)e1)2
θ e−θ(y +1)d θ = 1 − ( y( y++21)e)2
−( y+1)
.
±
( y + 2) ≥ 0.5e(y +1) , 0:
τ
where τ is a solution to the transcendental equation ( y + 2) τ ≈ 0.68. We may also compute the Bayes risk for δ B as:
r (δB ) = π0 R0 (δB) + π1 R1 (δB)
= π0P(Y1 H0 ) + π1 P(Y0 H1 )
= 0. 5e(y +1), which yields
76
Composite Hypothesis Testing
² ∞² ∞ 1 ² τ² 1 1 = π0 θ e−θ( y+1) d θ dy + θ e−θ( y+1)d θ dy π π 0 1 ² ∞τ ( y +1 2)e −(y+1) ² τ 10 − 0( y + 2)e−(y+1) dy + = ( y + 1)2 ( y + 1)2 dy . τ 0
4.2.2
Nonuniform Costs Over Hypotheses
We now give an example of the case where the cost function depends on θ . Assume the same setting as Example 4.1 except that the cost function is
Example 4.3
now
⎧1 ⎪⎪⎨ θ C (i , θ) = 1 ⎪⎩ 0
= 0, θ > 1 if i = 1, θ = 1 else.
( H1 ) ( H0 )
if i
The cost for saying H0 when H1 is true varies with Example 4.2. The posterior risks are given by
C (1 y ) =
²
X
π(θ y )C (1, θ)d θ =
²
X0
θ
(4.10)
and is lower than that in
π(θ  y)dθ e− y
= P( H0Y = y ) = P( H0)p(py()y H0) = 2p( y ) and
²
²
1
1
²
π(θ y ) θ d θ = π1 (θ y ) θ1 d θ X X X ² 1 ² ∞ ∞ pθ ( y )π1 (θ) 1 2 −θ y −β(θ−1) θ −1d θ = 12 p(y ) θ d θ = p ( y) 1 θ e β e 1 1 −y ² ∞ 1 −y e 2e = p( y ) β e−(θ −1)( y +β)d θ = 2p( y ) y +β β ≤ C (1y ) ∀y ≥ 0. (4.11) 1 Therefore, we have that δB ( y) = 0, for all y ≥ 0, i.e., the Bayes rule always picks H0 . Interestingly, if we change θ −1 into kθ −1 with k > 1 in the deﬁnition of the cost C (0 y) =
π(θ y )C (0, θ)d θ =
1
1 2
1
(4.10), the optimal decision rule becomes
δB ( y) =
±
1: 0:
y
≤ (k − 1)β. >
This illustrates the potential sensitivity of the Bayes rule to the cost assignments.
4.3
4.3
Uniformly Most Powerful Test
77
Uniformly Most Powerful Test
When a prior distribution for the parameter θ is not available, we may attempt to extend the Neyman–Pearson approach to select the decision rule. The probabilities of false alarm and correct detection depend on both the decision rule δ˜ and the state θ and may be written as P F (δ˜, θ 0) = P θ0 {say H1} = P θ0 {δ˜(Y ) = 1},
θ 0 ∈ X0 , P D (δ˜, θ 1) = P θ {say H1} = P θ {δ˜(Y ) = 1}, θ1 ∈ X1 . 1
1
Note that this is actually the same function which is given a different name under H0 and under H1 . Apparently a natural extension of the NP approach would be Maximize PD (δ˜ , θ1 ),
∀θ1 ∈ X1 subject to PF (δ˜, θ0 ) ≤ α, ∀θ0 ∈ X0 . (4.12) Unfortunately, in general no such δ˜ exists because the same δ˜ cannot always simul
taneously maximize different functions. If a solution does exist, it is said to be a universal most powerful (UMP) test at level α. Such a test, denoted by δ˜UMP , satisﬁes the following two properties. First, sup P F (δ˜ , θ0 ) ≤ α
(4.13)
θ0∈X 0
for δ˜ = δ˜UMP . Second, for any δ˜ ∈ D˜ that satisﬁes (4.13), we must have P D (δ˜ , θ1 ) ≤ P D (δ˜UMP , θ1 ),
∀θ1 ∈ X1.
Consider the special case where H0 is a simple hypothesis, i.e., X0 have only one constraint PF (δ˜, θ0 ) ≤ α in (4.12). Denote by
L θ1 ( y) =
pθ1 ( y ) , pθ0 ( y )
(4.14)
= {θ0 }. Then we
θ 1 ∈ X1
the likelihood ratio for pθ1 versus pθ 0 . The corresponding NP rule is given by
⎧1 ⎨ θ (y ) = 1 w.p. γ : δ˜NP ⎩0
> = η , θ1 ∈ X1, (4.15) < where the test parameters η and γ (which could depend on θ1 ) are chosen to satisfy the constraint PF (δ˜, θ0 ) ≤ α with equality, as derived in Section 2.4. If a UMP test exists, θ then δ˜UMP must coincide with the NP test δ˜NP for each θ1 ∈ X1. 1
L θ1 ( y )
1
4.3.1
Examples
Detection of OneSided Composite Signal in Gaussian Noise. We revisit the Gaussian example of (4.2). We have θ0 = 0 and consider three scenarios for X1.
Example 4.4
78
Composite Hypothesis Testing
θ1 (where θ Case I:X1 = (0, ∞ ), as illustrated in Figure 4.2. Then all NP rules δ˜NP 1 are of the form
±
δ˜θNP1 (y ) = 1 0
where τ is positive and satisﬁes PF θ1 is UMP: on θ1, and δ˜NP
: :
> 0)
y ≥ τ,
where τ is now negative and satisﬁes PF = Q ( −στ ) = α. Hence τ = −σ Q −1 (α) does not depend on θ1, and the decision rule is UMP: ± −1 δ− (y ) = 1 : y ≤ −σ Q (α). (4.17) 0 : < Its power is PD (δUMP , θ1 ) = Q ( Q−1 (α) + θ1 /σ). Case III: X1 = R\{0}. Now the NP test depends on the sign of θ1 . For example, the αlevel tests corresponding to θ1 = 1 and θ1 = −1 are different, as in Cases I and II, δ + (y ) =
1
: :
y
1
1
1
respectively. Hence no UMP test exists.
Detection of OneSided Composite Signal in Cauchy Noise. From the Gaussian example (Example 4.4), we may be tempted to conclude that for problems involving signal detection in noise, UMP tests exist as long as H1 is onesided. To see Example 4.5
pθ( y)
p0 (y)
PF
θ Figure 4.4
PM
τ
0
Gaussian hypothesis testing with X0
y
= {0} and X1 = (−∞, 0)
4.3
79
Uniformly Most Powerful Test
that this is not true in general, we consider the example where the noise has a Cauchy distribution centered at θ , i.e.,
pθ (y ) = where X0
1 π [1 + (y − θ)2 ] ,
= {0} and X1 = (0, ∞). Then
L θ1 ( y ) =
1 + y2 . 1 + (y − θ1 )2
It can be veriﬁed that the α level NP tests for θ1 there is no UMP solution.
= 1 and θ1 = 2 are different, and hence
1 To be UMP, the rule δ˜NP must be independent of the unknown θ1. This may seem paradoxical because η and γ in (4.15) generally depend on θ1; however, so does L θ1 ( y). The apparent paradox is elucidated in Section 4.3.2 and illustrated in Figure 4.5.
θ
4.3.2
Monotone Likelihood Ratio Theorem
: Y → R and a collection of functions X1 such that L θ ( y) = Fθ (g(y )), the UMP test at signiﬁcance level α exists for testing the simple hypothesis H0 : θ = θ0 versus the composite hypothesis H1 : θ ∈ X1, and is of the form ⎧ ⎪⎪⎨1 > δ˜UMP (y ) = 1 w.p. γ : g( y) = τ, (4.18) ⎪⎩0 < where τ and γ are such that Eθ [δ˜(Y )] = Pθ {g (Y ) > τ } + γ Pθ {g(Y ) = τ } = α. THEOREM
If there exists a function g
4.1
Fθ1 : R → R+ that are strictly increasing ∀ θ1 ∈
0
0
1
1
0
Proof Since each function Fθ1 is strictly increasing,
⇐⇒ g( y) > τ, L θ ( y) = η ⇐⇒ g( y) = τ, ∀θ1 ∈ X1 ,
L θ1 ( y) > η 1
L θ( y)
ηθ
0 Figure 4.5
of g ( y)
τ
g (y )
Monotone likelihood ratio: for each θ , the likelihood ratio is an increasing function
80
Composite Hypothesis Testing
for some τ . Hence the NP rule (4.15) can be written in the form given in (4.18). Now, to achieve a signiﬁcance level of α, we set
α = Pθ {g(Y ) > τ } + γ Pθ {g(Y ) = τ }. The solution to this equation yields τ and γ that are independent of θ1 , which implies 0
0
that the rule in (4.18) is UMP.
Example 4.6
Example 4.4, revisited. The likelihood ratio is given by
L θ1 ( y) = Case I: θ1
pθ 1 (y ) p 0( y )
√
=
± 2» ¼ ½ {− (y 2−σθ ) } θ θ1 y . = exp − 12 exp y 2 σ σ2 {− 2σ }
1 exp 2πσ 2 √ 1 2 exp 2πσ
1 2
2
2
2
> 0, choose g( y) = y , and thus ± 2» ¼ ½ θ θ1 g , Fθ (g ) = exp − 12 exp 2σ σ2 1
which is strictly increasing in g for all θ1 > 0. Alternatively one could have taken g( y) = e y , amongst other choices. Case II: θ1 < 0, choose g( y) = − y , thus Fθ1 (g) is the same as Case I. Case III: θ1 ² = 0 we cannot ﬁnd an appropriate g, and no UMP test exists as we showed in Example 4.4.
4.3.3
Both Composite Hypotheses
We now consider the more general case where both the hypotheses are composite. Finding a UMP solution in this case, i.e., a test that satisﬁes (4.13) and (4.14), is considerably more complicated. The following example illustrates a case where a UMP solution can be found, and a procedure to ﬁnd it. Also see Exercises 4.14 and 4.15.
Testing Between Two OneSided Composite Signals in Gaussian Noise. This is an extension of the Gaussian example (Example 4.4) where X0 = (−∞, 0] and X1 = (0, ∞). For ﬁxed θ0 ∈ X0 and θ1 ∈ X1 , Lθ0 ,θ1 (y ) has no point masses under Pθ0 θ0 ,θ1 is a deterministic LRT, which can be reduced to a threshold test or Pθ1 . Therefore δ˜NP on Y :
Example 4.7
δθ0,θ1 NP
=
±
1 0
: :
L θ0,θ1 ( y)
where the threshold τθ0,θ1 is given by
≥η =
θ0, and it cannot be more powerful than δτ , i.e., P D (δ˜ , θ1) ≤ P D (δτ , θ1 ) Therefore (4.14) holds and we have
δUMP (y ) = δτ ( y ) =
±
1 0
: :
y
∀θ1 > 0. ≥ σ Q−1 (α).
θ0 .
H0
0
(4.19)
Here we want to maximize the probability of correct detection in the neighborhood of θ0 because the two hypotheses are least distinguishable in that region. Stated explicitly: max P D (δ˜, θ)
↓ θ0 (4.20) subject to PF (δ˜, θ 0) ≤ α. Assuming the ﬁrst two derivatives of P D (δ˜, ·) exist and are ﬁnite in the interval [θ0 , θ ], we write the ﬁrstorder Taylor series expansion around θ0 as P D (δ˜ , θ) = PD (δ˜, θ0 ) + (θ − θ0 ) PD± (δ˜, θ 0) + O ((θ − θ0)2), where P D (δ˜, θ 0) = P F (δ˜ , θ0 ) = α. Thus we have the approximation (4.21) P D (δ˜ , θ) ∼ = α + (θ − θ0) PD± (δ˜ , θ0 ). Since θ − θ0 ≥ 0, the optimization in (4.20) is replaced by (4.22) max PD± (δ˜, θ0 ) δ˜ subject to PF (δ˜, θ 0) ≤ α. δ˜
as θ
For a randomized decision rule as in Deﬁnition 1.4, P D (δ˜, θ) = and
L Â
³= 1
γ³PD(δ³, θ) =
² Â L Y ³=1
γ³1{δ³ (y ) = 1} pθ ( y ) dy
Ã ∂ pθ ( y ) ÃÃ ˜ γ³1{δ³ (y ) = 1} ∂θ Ã D (δ , θ0 ) = Y ³=1 θ =θ
P±
² Â L
dy .
0
The solution to the locally optimal detection problem of (4.22) can be shown to be equivalent to NP testing between pθ0 ( y) and
Ã ∂ p ( y )ÃÃ . ∂θ θ Ãθ =θ 0
Even though the latter quantity is not necessarily a pdf (or pmf), the steps that we followed in deriving the NP solution in Section 2.4 can be repeated to show that the solution to (4.22) has the form
4.4
⎧1 ⎨ δ˜LMP( y) = ⎩1 w.p. γ : 0
83
Locally Most Powerful Test
> L ³ ( y ) = τ,
0. As indicated in Example 4.5, there is no UMP solution to this problem. We therefore examine the LMP solution: Example 4.8
p θ ( y) =
1 π [1 + ( y − θ)2]
ÃÃ ∂ =⇒ ∂θ pθ ( y)ÃÃ = π(1 2+y y2 )2 . θ =0
Thus
L ³ (y ) = and
δ˜LMP( y) =
±
1
2y 1 + y2
:
L ³ (y ) ≥ η.
η } = 1 for η ≤ −1, and P 0{ L ³ (Y ) > η } = 0 for η ≥ 1. For η ∈ [0, 1), it is easy to show that P 0 { L ³ ( Y ) > η} = where 1
1
π
²b
1+y
a
Ä
1
= π1 1
¶
tan−1(b) − tan−1(a)
Ä
·
,
= η − η2 − 1, b = η + η12 − 1. Similarly for η ∈ (−1, 0), it is easy to show that · 1 ¶ −1 tan (a) − tan−1 (b) . P0 { L ³ (Y ) > η} = 1 + π Using these expressions for P 0{ L ³ (Y ) > η }, we can ﬁnd η to meet a falsealarm constraint of α ∈ (0, 1). a
1
dy 2
84
Composite Hypothesis Testing
4.5
Generalized Likelihood Ratio Test
Since UMP tests often do not exist, and LMP tests are unsuitable for twosided composite hypothesis testing, another technique is often needed. One approach is based on the generalized likelihood ratio (GLR), which is deﬁned by
L G (y ) ±
maxθ ∈X 1 pθ (y ) maxθ ∈X 0 pθ (y )
=
pθˆ ( y) 1
pθˆ0 ( y)
,
(4.23)
where θˆ1 (y ) and θˆ0 (y ) are the maximumlikelihood estimates of θ given y under H1 and H0, respectively. Note that evaluation of L G (y ) involves a maximization over θ0 and θ1, and so this test statistic can be considerably more complex than the LRT. Also the result of the maximization generally does not produce a pdf (or pmf) in the numerator and denominator. We can use the statistic L G (y ) to produce a test, which is called the generalized likelihood ratio test (GLRT):
⎧1 ⎨ δ˜GLRT ( y) = ⎩1 w.p. γ : 0
L G ( y)
> = η.
τ } ¾τ + θ ¿ ¾τ − θ ¿ =Q σ +Q σ ¾ À Á ¿ ¾ À Á ¿ = Q Q −1 α2 + σθ + Q Q−1 α2 − θσ . Figure 4.7 compares the power of the UMP rules δ + of (4.16), δ − of (4.17), and the P D (δGLRT , θ)
GLRT rule, and Figure 4.8 compares their ROCs. As should be expected, the GLRT performs somewhat worse than the UMP test. However, it should be recalled that, unlike the GLRT, the UMP test requires knowledge of the sign of θ . The GLRT has found a great variety of applications. In general, the GLRT does perform well when the estimates θˆ0 and θˆ1 are reliable. ( )
( )
p0 y
−τ Figure 4.6
0
pθ y
τ
y
θ
Thresholds for GLRT on Gaussian distributions P D( θ)
1
P D( δ −, θ )
P D( δ +, θ)
PD (δ GLRT, θ ) α
0 Figure 4.7
PD versus θ for UMP and GLRT
θ
86
Composite Hypothesis Testing
PD 1 (δ + , δ − ) δGLRT
0 Figure 4.8
4.5.2
1 PF
ROC for (δ + , δ− ) and δGLRT for ﬁxed θ
GLRT for Cauchy Hypothesis Testing
Detection of OneSided Composite Signal in Cauchy Noise (continued) . This problem was introduced in Example 4.5, and we saw that there is no UMP solution. We studied the LMP solution in Example 4.8, and now we examine the GLRT solution. The GLR statistic is given by
Example 4.9
LG ( y ) =
supθ> 0 pθ ( y) p 0( y)
with sup pθ (y )
θ> 0
=
1 sup θ> 0 π [1 + ( y − θ)2]
Thus
±
π 1 π(1+ y 2)
: y ≥0 : y < 0.
: y≥0 1 : y < 0. To ﬁnd an α level test we need to evaluate P0 {L G (Y ) > η }. Clearly P0 {L G (Y ) > η } = 1 for 0 ≤ η < 1. For η ≥ 1 √ ²∞ 1 1 1 tan−1 ( η − 1) P 0{ L G (Y ) > η} = √ dy = − . 2 π η−1 π 1 + y2 There is a point of discontinuity in P0 {L G (Y ) > η } at η = 1 as the value drops from 1 to the left to 12 to the right. For α ∈ ( 12 , 1] , we would need to randomize to meet the PF constraint with equality. For α ∈ (0, 12 ], which would be more relevant in practice, the L G ( y) =
GLRT is a deterministic test:
1 + y2
=
±1
4.6
δGLRT (y ) = where
η=
±
: :
1 0
´
Random versus Nonrandom
87
θ
L G ( y) ≥ η,
2
(4.31)
where the sets {X j }mj =−01 form a partition of the state space X . 4.8.1
Random Parameter
±
Assume ³ a prior π on X and a cost function C (i , θ), 0 ≤ i ≤ m − 1, θ ∈ X . Then X j π(θ)d ν(θ) is the prior probability that hypothesis H j is true. The Bayes rule takes the form
πj =
δ B( y) = where
C ( i  y) =
² X
arg min C (i  y ),
0≤i ≤m −1
C (i, θ)π(θ  y )dν(θ)
is the posterior cost of deciding in favor of Hi .
4.8
Composite
m ary Hypothesis Testing
91
If the costs are uniform over each hypothesis, i.e.,
C (i , θ)
= Ci j ∀i , j , ∀θ ∈ X j ,
the classiﬁcation problem reduces to simple m ary hypothesis testing with conditional distributions
p(y  H j ) = and π j (θ  y) detection:
²
Xj
π j (θ  y)d ν(θ),
0≤ j
= π(θ  y)/π j . For uniform costs, C i j =
≤ m − 1,
1{ i ²= j } , this reduces to MAP
δB (y ) = arg max [πi p(y  Hi )]. 0≤i ≤m −1
All this is a straightforward extension of the binary case ( m
4.8.2
(4.32)
= 2) in Section 4.2.
NonDominated Tests
If θ is not modeled as random, the choice of the decision rule implies some tradeoff between the various types of error. The exposition of nondominated tests in Section 4.7 extends directly to the m ary case: neither Theorem 4.2 nor its corollary required that m be equal to 2.
N
Composite Ternary Hypothesis Testing. Let m = 3 and Pθ = (θ, σ 2). Assume that 0 = {0}, 1 is a ﬁnite subset of (0, ∞ ), 2 is a ﬁnite subset of (−∞, 0), and the costs are uniform. We claim that all nondominated tests are of the form
Example 4.12
X
X
X
⎧ ⎪⎪⎨0 : τ2 ≤ y ≤ τ1 δ( y ) = ⎪1 : y > τ1 ⎩2 : y < τ2 ,
(4.33)
where τ2 ≤ τ1 . This deﬁnes a twodimensional family of tests parameterized by (τ1, τ 2). The conditional risks for these tests are given by
À τ1 Á ¾ −τ2 ¿ +Q σ , R0 (δ) = Q σ ¾θ − τ ¿ 1 , θ > 0, Rθ (δ) = Q σ ¾τ − θ ¿ 2 Rθ (δ) = Q σ , θ < 0.
To show (4.33), we apply Theorem 4.2 and derive the MAP tests for any prior We have δB ( y) = arg mini =0,1,2 f i ( y) where
f i (y ) = πi
p ( y  Hi ) p ( y  H0 )
±
π(0) : i=0 = ∑ 2 θ ∈X π(θ) exp{− yθ − θ /2} : i = 1, 2. i
π on X .
92
Composite Hypothesis Testing
Observe that f0 is constant, f 1 is monotonically increasing, and f 2 is monotonically decreasing. Hence the decision regions Y1 and Y 2 are semiinﬁnite intervals extending to +∞ and −∞, respectively. If π(0) is small enough, the test never returns 0 and satisﬁes (4.33) with τ1 = τ2 satisfying f 1 (τ1 ) = f2 (τ1 ). Else the test satisﬁes (4.33) with τ1 > τ2 satisfying f 1(τ1 ) = π(0) = f 2(τ2 ).
4.8.3
m GLRT There is no commonly used mary version of the GLRT. The following prescription seems natural enough in view of Corollary 4.1. Choose a probability vector π on {0, 1, . . . , m − 1} and deﬁne
δGLRT ( y) =
arg min [πi max pθ ( y)].
0≤i ≤m −1
θ ∈Xi
(4.34)
η
Note this reduces to the GLRT (4.24) in the case m = 2, if one chooses π0 = 1+η . As with the GLRT, good performance can be expected when the parameter estimates θˆi = maxθ ∈Xi pθ ( y), 0 ≤ i ≤ m − 1, are reliable.
4.9
Robust Hypothesis Testing
Thus far we have studied the case where the distributions of the observations under the two hypotheses are not speciﬁed explicitly but come from a parametric family of densities { pθ , θ ∈ X }. We now generalize this notion of uncertainty to the case where the probability distributions under the two hypotheses belong to classes (sets) of distributions, often termed as uncertainty classes, which are not necessarily parametric. We denote the uncertainty class for hypothesis H j by P j , j = 0, 1, and assume that the sets P0 and P1 are disjoint (have no overlap). The hypotheses can then be described by
±
H0 : P 0 ∈ P 0
H1 : P 1 ∈ P 1 . We could look for a UMP solution to (4.35) by trying to ﬁnd a test the following two properties. First, sup P F (δ˜, P 0) ≤ α
for δ˜ = δ˜UMP. Second, for any
P 0 ∈P0
(4.35)
δ˜UMP that satisﬁes (4.36)
δ˜ ∈ D˜ that satisﬁes (4.36), P D (δ˜ , P 1) ≤ PD (δ˜UMP , P 1), ∀P 1 ∈ P1, (4.37) ² ² where PF (δ˜, P0 ) = P0 {δ˜ (Y ) = 1} and PD (δ˜ , P 1) = P 0{δ˜(Y ) = 1}. However, as we
saw in Section 4.3, such UMP solutions are rare even when the distributions under the two hypotheses belong to parametric families, and so proceeding along these lines is generally futile.
4.9
93
Robust Hypothesis Testing
An alternative approach to dealing with model uncertainty, ﬁrst proposed by Huber [3], is the minimax approach, where the goal is to minimize the worstcase detection performance over the uncertainty classes. The decision rules thus obtained are said to be robust to the uncertainties in the probability distributions. The minimax robust version of the Neyman–Pearson formulation is given by1 maximize inf PD (δ˜, P1 ) subject to sup P F (δ˜ , P 0) ≤ α.
δ˜∈D˜
P1 ∈
P1
P0
P0∈
(4.38)
One approach to ﬁnding a solution to robust NP problem in (4.38) is based on the notion of stochastic ordering, described in the following deﬁnitions. D E FI N I T I O N 4.1 A random variable U is said to be stochastically larger than a random variable V under probability distribution P if
P{U
> t } ≥ P{V > t },
for all t
∈ R.
D E FI N I T I O N 4.2 Given two candidate probability distributions P and Q for a random variable X , X is stochastically larger under distribution Q than under distribution P if
Q{ X
> t } ≥ P{ X > t },
for all t
∈ R.
These deﬁnitions lead to the following deﬁnition of joint stochastic boundedness (JSB) of a pair of classes of probability distributions. 4.3 Joint Stochastic Boundedness [4]. A pair (P0 , P1 ) of classes of probability distributions deﬁned on a measurable observation space Y is said to be jointly stochastically bounded by (Q0 , Q1 ), if there exist distributions Q 0 ∈ P0 and Q1 ∈ P1 such that for any (P0 , P 1) ∈ P0 × P1, Y ∈ Y , and all t ∈ R, D E FI N I T I O N
P 0{ln L q (Y ) > t } ≤ Q0{ln L q (Y ) > t } and P1 {ln L q (Y ) > t } ≥ Q 1 {ln L q (Y ) > t },
where L q ( y) = qq10(( yy )) is the likelihood ratio between Q1 and Q0 . The pair (Q0, Q1 ) is called the pair of least favorable distributions (LFDs) for the pair of uncertainty classes {P0 , P1}. (See Figure 4.9.)
LFDs
P0
Figure 4.9
Q0
Q1
P1
Least favorable distributions
1 Minimax robust versions of the Bayesian and minimax hypothesis testing problems can similarly be
deﬁned.
94
Composite Hypothesis Testing
R E M A R K 4.1 Note that it is not difﬁcult to see that Deﬁnitions 4.1–4.3 remain unchanged if we replace “> t ” with “ ≥ t ” throughout; this will be relevant to the case where the random variables involved in the deﬁnitions have point masses at some points on the real line. For example, by taking convex combinations of the deﬁnitions with “> t ” and “≥ t ,” we can see that Deﬁnition 4.2 is equivalent to the following condition:
Q{ X
> t } + γ Q{ X = t } ≥ P{ X > t } + γ P{ X = t }, ∀t ∈ R, ∀γ ∈ [0, 1]. Deﬁnition 4.3 says that the loglikelihood ratio ln L q (Y ) is stochastically the smallest
P
P
under Q1 within class 1, and stochastically the largest under Q 0 within class 0 . The following example, which considers the special case where 0 and 1 are parametric classes illustrates the notion of joint stochastic boundedness.
P
P
Example 4.13 Least Favorable Distributions. Let G θ denote the Gaussian probability distribution with mean θ and variance 1, and consider the parametric classes of distributions
P0 = {Gθ : θ ∈ [−1, 0]}, P1 = {Gθ : θ ∈ [1, 2]}. We will establish that the pair Q 0 = G0 and Q1 = G1 is a least favorable pair of
distributions for the pair of uncertainty classes (P0 , P 1). To see this, we ﬁrst compute the loglikelihood ratio: ln L q (Y ) = ln
q1 (Y ) q0 (Y )
= Y.
∈ P0 , say P0 = Gθ , θ0 ∈ [−1, 0]. Then for all t ∈ R, P0 {ln L q (Y ) > t } = P0 {Y > t } = 1 − ´(t − θ0 ) ≤ 1 − ´( t ) = Q0 {Y > t } = Q0{ln L q (Y ) > t }, where ´ is the cdf of the N (0, 1) distribution. Similarly it is easy to see that for any P 1 ∈ P1, and all t ∈ R P 1{ln L q (Y ) > t } ≥ Q1 {ln L q (Y ) > t }.
Now consider P0
0
A saddlepoint solution to the minimax robust NP testing problem of (4.38) can easily be constructed if the pair { 0, 1 } is jointly stochastically bounded, as the following result shows. Similar solutions can also be obtained for the minimax robust versions of the Bayesian and minimax binary hypothesis testing problems.
P P
P P
4.3 Suppose the pair of uncertainty classes { 0, 1 } is jointly stochastically bounded, with least favorable distributions given by (Q0, Q1 ). Then the solution to the minimax robust NP testing problem of (4.38) is given by the solution to the simple hypothesis testing problem
THEOREM
maximize PD (δ˜ , Q1 ) subject to PF (δ˜, Q0 ) ≤ α,
δ˜∈ D˜
4.9
Robust Hypothesis Testing
95
which is the randomized LRT
⎧ ⎪⎨1 ∗ δ˜ ( y ) = ⎪1 w.p. γ ∗ : ⎪⎩0
> ln L q ( y ) = η∗ , < where the threshold η∗ and randomization parameter γ ∗ are chosen so that P F (δ˜ ∗, Q0) = Q0 {ln L q (Y ) > η∗ } + γ ∗ Q0 {ln L q (Y ) = η∗ } = α. Proof By the JSB property (also see Remark 4.1), for all P0 ∈ P0, P F (δ˜ ∗, P0 ) = P0 {ln L q (Y ) > η∗ } + γ ∗ P 0{ln L q (Y ) = η ∗} ≤ Q0{ln L q (Y ) > η∗ } + γ ∗Q0 {ln Lq (Y ) = η∗ } = PF (δ˜∗ , Q0) ≤ α. Therefore δ˜∗ satisﬁes the constraint for the minimax robust NP testing problem of (4.38), i.e.,
sup PF (δ˜∗ , P0 )
P0 ∈P0
≤ α.
∈ P1 , and for all δ˜ ∈ D˜ such that supP ∈P PF (δ˜, P0 ) ≤ α , PD (δ˜∗ , P1 ) = P1 {ln L q (Y ) > η∗ } + γ ∗ P 1{ln L q (Y ) = η∗ } (a) ≥ Q1{ln L q (Y ) > η ∗} + γ ∗ Q1 {ln L q (Y ) = η∗ } = PD(δ˜∗ , Q1) (4.39) (b) ≥ PD(δ˜ , Q1 ) ≥ P inf P (δ˜ , P 1), ∈P D
Furthermore, for all P1
0
1
0
1
where (a) follows from the JSB property, and (b) follows from the optimality of δ˜ ∗ for NP testing between Q 0 and Q1 at level α. Taking the inﬁmum over P 1 ∈ P1 on the left hand side of (4.39) establishes that δ˜∗ is a solution to (4.38). The uncertainty classes considered in Huber’s original work [3] were neighborhood classes containing, under each hypothesis, a nominal distribution and distributions in its vicinity, in particular the µcontamination and the total variation classes. In a later paper, Huber and Strassen [5] generalized the notion of neighborhood classes to those that can be described in terms of alternating capacities of order 2. When the observation space is compact, several uncertainty models such as µcontaminated neighborhoods, total variation neighborhoods, bandclasses, and ppoint classes are special cases of this model with different choices of capacity. All of these uncertainty classes can be shown to possess the JSB property of Deﬁnition 4.3, and hence the corresponding minimax
96
Composite Hypothesis Testing
robust hypothesis testing problems can be shown to have saddlepoint solutions constructed from the LFDs in a uniﬁed way using Theorem 4.3 (or its Bayesian or minimax counterparts).
4.9.1
Robust Detection with Conditionally Independent Observations
Now consider the case where the observation Y is a vector (block) of observations
(Y1 , . . . , Yn ), where each observation Yk takes values in a measurable space Yk and ( k) has a distribution which belongs to the class P j when the hypothesis is H j , j = 0, 1.
Let the set
² P (1) × · · · × P (n ) Pj = j j
represent a class of distributions on Y = Y 1 × · · · × Yn , which are products of distributions in P (j k ) , k = 1, . . . , n. To further clarify this point, let P (jk ) denote a typical
(1)
(k )
(n )
element of P j . Then P j ∈ P j represents the product distribution P j × · · · × P j , i.e., the observations (Y1 , . . . , Y n ) are independent under P j . We have the following result, which says that if the uncertainty classes corresponding to each of the observations satisfy the JSB property, and if the observations are conditionally independent, given the hypothesis, then the uncertainty classes corresponding to the entire block of observations also satisfy the JSB property.
(k)
(k )
4.1 For each k, k = 1, . . . , n, let the pair (P0 , P1 ) be jointly stochastically (k ) (k ) bounded by (Q0 , Q1 ). Then the pair (P0, P1) is jointly stochastically bounded by (Q0, Q1), where Q j = Q(j1) × · · · × Q(jn), j = 0, 1.
LEMMA
The proof of Lemma 4.1 follows quite easily from the following result, whose proof is left to the reader as Exercise 4.16. 4.2 Let Z 1, Z 2, . . . , Z n be independent random variables. Suppose that Z k is stochastically larger under distribution Q( k) than under P(k ) , i.e.,
LEMMA
Q( k) {Z k
Let P
> tk } ≥ P(k ){ Zk > tk },
= P(1) × · · · × P(n) and Q =
larger under Q than under P, i.e., Q
¸Â n k =1
Zk
º
Q(1)
>t ≥P
for all t k
∈ R. n Â
× · · · × Q(n ). Then
¸Â n k =1
Zk
>t
º for all t
k =1
Z k is stochastically
∈ R.
Proof of Lemma 4.1 Let P0 be any (product) distribution in the set P0, and P1 be any (product) distribution in the set P1. Let L q denote the likelihood ratio between Q1 and Q0 , and let L q(k ) denote the likelihood ratio between Q(1k ) and Q(0k ) . Then ln L q (Y ) =
n Â k =1
ln L (qk ) (Yk ).
4.9
Robust Hypothesis Testing
(k )
( k)
97
(k )
By the joint stochastic boundedness property of (P0 , P1 ), ln L q (Yk ) is stochasti(k) (k ) cally larger under Q0 than under P0 . Hence, by Lemma 4.2, ln L q (Y ) is stochastically larger under Q 0 than under P0 . This proves the ﬁrst condition required for the joint stochastic boundedness of (P0 , P1) by (Q0 , Q1 ). The other condition is proved similarly.
4.9.2
EpsilonContamination Class
As we saw in Theorem 4.3, the solution to the minimax robust detection problem relied on the JSB property of the uncertainty classes. Establishing that a pair of uncertainty classes satisﬁes the JSB property, and ﬁnding the corresponding LFDs, is a nontrivial task. In this section, we focus on the most commonly studied pair of uncertainty classes, which was introduced in Huber’s seminal work [3], the µcontamination classes:
P j = {P j = (1 − µ)P¯ j + µ M j }, j = 0, 1, (4.40) where µ is small enough so that P0 and P1 do not overlap.2 Here P¯ 0 and P¯ 1 can be con
sidered to be known nominal distributions for the observations under the hypotheses, but these nominal distributions are “contaminated” by unknown and arbitrary distributions M 0 and M1 , respectively. Such an uncertainty model might arise, for example, in communications systems where an extraneous unknown source of interference (e.g., lightning strike) may occur with probability µ every time one tries to communicate information on the channel.
4.4 The uncertainty classes (P0 , P1 ) given by the µ contamination model of (4.40) satisfy the JSB property of Deﬁnition 4.3, with LFDs that are given by THEOREM
q0 ( y) = and
±
(1 − µ) p¯ 0 (y ) ( 1−µ) p¯ ( y ) 1 b
± (1 − µ) p¯ 1( y) q1 (y ) = a (1 − µ) p¯ 0 (y )
if L¯ ( y) < b
(4.41)
if L¯ ( y) > a
(4.42)
if L¯ ( y) ≥ b
if L¯ ( y) ≤ a
be chosen such that q0 ³if q0 ( t } ≥ Q1 {L q (Y ) > t }.
(4.44)
and Now if t ≤ a or t ≥ b, then both sides of these inequalities are equal to 0, or both sides are equal to 1, and the inequalities are met trivially. If a < t < b, then P 0{ L q (Y ) ≤ t } = (1 − µ)P¯ 0{ L q (Y ) ≤ t } + µ M0 {L q (Y ) ≤ t }
≥ (1 − µ)P¯ 0 {L q (Y ) ≤ t } (a) = Q0 { L q ( Y ) ≤ t } ,
where (a) follows because
L q (Y ) ≤ t
=⇒ L¯ (Y ) < b =⇒ q0( y ) = (1 − µ) p¯ 0 (y ).
This establishes (4.43). A similar proof can be given for (4.44). R E M A R K 4.2 It is easy to verify that for q 0 and q 1 as deﬁned in (4.41) and (4.42), Q0 ∈ P0 and Q1 ∈ P1 , assuming of course that a and b can be chosen such that q0 and q1 are valid densities. Indeed we can write
q0 (y ) = (1 − µ) p¯0 ( y) + µ m 0( y), where
m0 ( y ) =
1−µ
µ
´ p¯ ( y) 1
b
µ − p¯ 0( y ) 1{ L¯ (y ) ≥ b}.
A similar argument can be given for q1 . REMARK
4.3 Note that the densities q0 and q1 can be rewritten as
q0 ( y) = (1 − µ) max
¼ p¯ (y ) 1
b
½
, p¯0 ( y) ,
q1 ( y) = (1 − µ) max { p¯ 1( y), a p¯0 (y )} . Therefore in order for q0 and q1 to be valid densities, i.e., we must have
²
¼ p¯ (y ) ½ ² 1 max , p¯0 (y ) dμ( y) = b
³ q ( y )d μ( y ) = 1, j = 0, 1, j
max { p¯ 1 ( y), a p¯ 0 (y )} dμ( y) =
1
1−µ
,
which implies that a = 1b . Also, it is clear the solution for a increases with µ . Therefore, for µ large enough, the solution a could be greater than 1. In this case, we can see that the corresponding Q1 ∈ P0 , i.e., P1 and P0 have an overlap and there is no robust solution to the hypothesis testing problem. In particular, if µ is such that a = b = 1, then it is clear that q0 = q1 . See Exercise 4.17 for an example.
Exercises
99
R E M A R K 4.4 The robust test for the µcontamination model is a LRT based on clipping the nominal likelihood ratio:
⎧ ⎪⎨b L q ( y) = L¯ (y ) ⎪⎪⎩ a
if L¯ ( y) ≥ b
if a < L¯ (y ) < b
if L¯ ( y) ≤ a.
Some intuition for why such a test might be effective comes looking at the hypothesis testing problem with i.i.d. observations under each hypothesis:
±
H0 : {Y k }nk=1 H1 : {
}
Y k nk=1
∼ i.i. d. P0 ∼ i.i. d. P1,
where P0 ∈ P0 and P 1 ∈ P1 as in (4.40). Since the contaminating distributions M0 and M 1 are not known to the decisionmaker, a naive approach to constructing a test might be based on simply using the nominal likelihood ratio:
L¯ (n) ( y) =
n Ì ¯
k =1
L ( yk ),
with the corresponding LRT given by:
⎧ ⎪⎪⎨1 > ˜δ ( y) = 1 w.p. γ : L¯ (n )( y) = η. ⎪⎩ 0 < Note that L¯ ( n) ( y) is sensitive to observations for which L¯ ( yk ) ´ 1 or L¯ ( yk ) µ 1. The worstcase performance for the test based on L¯ (n ) (Y ) will occur when M0 has all of its mass for values of y where L¯ ( y ) ´ 1 and M1 has all of its mass for values of y where L¯ (y ) µ 1. With this choice: sup P F (δ˜ , P 0) ≈ 1 − (1 − µ)n , and sup P M (δ˜ , P1 ) ≈ 1 − (1 − µ)n , LRT
P 0 ∈P 0
LRT
P 1 ∈P1
LRT
i.e., the worstcase performance of the LRT based on L¯ (n ) ( y) can be arbitrarily bad for large n. Using L q ( y) in place of L¯ ( y) limits the damage done by the contamination, and the worstcase performance is then simply the performance of the LRT between LFDs Q 1 and Q 0 for testing between the simple hypotheses where the observations follow the LFD under each hypothesis.
Exercises
4.1 Bayesian Composite Hypothesis Testing with only H1 Composite. Consider the composite binary hypothesis testing problem in which
pθ (y ) = θ e−θ y 1{ y
≥ 0},
100
Composite Hypothesis Testing
π0 = π1 = 12 , and the conditional prior under H1 is π1 (θ) = 2 1{0.5 ≤ θ ≤ 1}. Find a Bayes rule and corresponding minimum Bayes risk for the hypotheses
¼H : X = {0.5} 0 0 H1 : X1 = (0.5, 1]. Assume C (i , θ) = Ci j = 1{i ² = j }.
4.2 Bayesian Composite Hypothesis Testing with Both Hypotheses Composite. Consider the composite binary hypothesis testing problem in which
pθ ( y ) = θ e−θ y 1{ y
≥ 0}, π(θ) = 1{0 ≤ θ ≤ 1},
and C (i , θ) = C i j = 1 {i ² = j }. Find a Bayes rule and corresponding minimum Bayes risk for the hypotheses
¼H : X = [0, 0.5] 0 0 H1 : X1 = (0.5, 1].
4.3 Composite Hypothesis Testing with Nonuniform Costs over Hypotheses. Rework Exercise 4.1 under the assumption that the cost function is:
⎧1 ⎪⎪⎨ θ C (i , θ) = 1 ⎪⎩ 0
= 0, θ ∈ (0. 5, 1] if i = 1, θ = 0.5 else. if i
( H1 ) ( H0 )
4.4 GLRT and UMP for Uniform Distributions. Consider the composite binary hypothesis test
¼H : Y ∼ Uniform[0, 1] 0 H1 : Y ∼ Uniform[0, θ ], θ ∈ (0, 1).
(4.45)
(a) Derive the GLRT. (b) Is there a UMP test for this problem? If so, give it. If not, explain why not. 4.5 UMP for Poisson Distributions. Let Y be Poisson(θ ), i.e.,
p θ ( y) =
θ y e−θ , y!
for y
= 0, 1, 2, . . . .
Consider the composite hypothesis testing problem
¼H : 0 H1 :
Y Y
∼ Poisson(1) ∼ Poisson(θ), θ > 1.
Does a UMP rule exist? Explain. 4.6 GLRT/UMP for Exponential Distributions. Consider the composite hypothesis test
¼H : 0 H1 :
Y Y
∼ p0 (y ) = e− y 1{ y ≥ 0} ∼ pθ ( y) = θ e−θ y 1{ y ≥ 0}, θ > 1.
Is there a UMP test here? If so, what form does it take?
Exercises
101
4.7 GLRT/UMP for OneSided Gaussian Distributions. Consider the hypothesis test
¼H : 0 H1 :
where s
Y Y
=Z = s + Z,
> 0, and Z is a random variable distributed as a “onesided Gaussian”: ± 2 √ exp{− 2zσ } : z ≥ 0 2πσ p(z ) = 0 : else. 2
2 2
(a) Give the Bayes rule for deciding between H0 and H1 assuming known s , equal priors, and uniform costs. (b) Now assume s is unknown, and derive the GLRT for this problem. (c) Is the GLRT a UMP test here? 4.8 GLRT/UMP for Gaussian Hypothesis testing. Consider the hypothesis test
¼H : Y = Z 0 H1 : Y = s + Z ,
where s is on the unit circle in R2 (you do not know where), and the vector Z is normal. (a) Is there a UMP test for this problem? Explain why. (b) Derive the GLRT for this problem, and sketch the boundary of the decision region. 4.9 Composite Gaussian Hypothesis with Two Alternative Distributions. Let θ2 < 0 < θ1. Derive the GLRT for the composite detection problem with Y ∼ Pθ = (θ, 1) and
±
:θ = 0 H1 : θ ∈ { θ1, θ2 }.
N
H0
4.10 Composite Gaussian Hypothesis Testing. Consider the hypothesis testing problem: ± H0 : θ = 0
H1 where
pθ ( y) =
1 exp 2π
: θ > 0,
´ ( y − θ)2 + ( y − θ 2 )2 µ 2 − 1 . 2
(a) Is there a UMP test between H0 and H1 ? If so, ﬁnd it for a level of not, explain clearly why not. (b) Find an αlevel LMP test for this problem, for α ∈ (0, 1). (c) Find PD as a function of α and θ for the LMP test.
α ∈ (0, 1). If
4.11 Composite Hypothesis Testing. Consider the hypothesis testing problem:
⎧ ⎨ H0 : Y = Z ⎩ H1 : Y = s (θ) + Z,
102
Composite Hypothesis Testing
where
θ
is a deterministic but unknown parameter that takes the value 1 or 2, Z
N (0, I ), and
⎡1⎤ ⎢0⎥⎥ , s (1) = ⎢ ⎣2⎦
∼
⎡0⎤ ⎢1⎥⎥ . s(2) = ⎢ ⎣⎦ 0 2
0
(a) Is there a UMP test between H0 and H1 ? If so, ﬁnd it (for a level of α). If not, explain why not. (b) Find an α level GLRT for testing between H0 and H1. (c) Argue that the probability of detection for the GLRT of part (b) is independent of θ , and then ﬁnd the probability of detection as a function of α. 4.12 UMP for TwoSided Gaussian Hypothesis testing. Here we switch the roles of H0 and H1 in Section 4.5.1. Consider the composite detection problem with Y ∼ Pθ = (θ, σ 2 ) and
N
±
: θ ²= 0 H1 : θ = 0 . H0
Does a UMP test exist for this problem? If so, ﬁnd it for signiﬁcance level α. 4.13 UMP/LMP testing with Laplacian Observations . Consider the composite binary detection problem in which 1 pθ ( y) = e − y−θ  , 2 and
±
y
∈ R,
:θ =0 H1 : θ > 0 . H0
(a) Show that a UMP test exists for this problem, even though the Monotone Likelihood Ratio Theorem does not apply. Find an αlevel UMP test, and ﬁnd P D as a function of α for this test. Hint: Even though the likelihood ratio has point masses under H0 and H1, it can be shown that randomization can be avoided in the LRT by rewriting the test in terms of Y . This step is crucial to establishing the existence of a UMP solution. (b) Find a locally most powerful αlevel test, and ﬁnd PD as a function of α for this test. (c) Plot the ROCs for the tests in parts (a) and (b) on the same ﬁgure and compare them. 4.14 Monotone Likelihood Ratio Theorem with Two Composite Hypotheses. Consider the composite detection problem:
±
: θ ∈ X0 H1 : θ ∈ X 1 . H0
References
103
∈ X0 and each ﬁxed θ1 ∈ X1 , we have p θ ( y) = Fθ ,θ (g( y )), p θ ( y) where the (realvalued) function g does not depend on θ1 or θ0 , and the function Fθ ,θ Now suppose that for each ﬁxed θ0 1
0
1
0
0
is strictly increasing in its argument. Assume that for tests of the form
⎧ ⎪⎪⎨1 > δ˜( y ) = ⎪1 w.p. γ : g( y) = τ ⎩0
τ } + γ Pθ0 {g(Y ) = τ }
)
θ0∈X 0
θ0 ∈X0 ∗ is achieved at a ﬁxed θ0 ∈ X0, irrespective of the values of τ and γ .
Show that for any level specify it.
α ∈ (0, 1), a UMP test between
H1 and H0 exists, and
4.15 UMP versus GLRT. Consider the composite binary detection problem in which
∈ (0, 1), show that a UMP test of level α exists for testing the hypotheses ± H0 : X0 = [1, 2] H1 : X1 = (2, ∞ ). Find this UMP test as a function of α. Find an α level GLRT for testing between H0 and H1 .
(a) For α
(b)
pθ (y ) = θ e−θ y 1{ y ≥ 0}.
4.16 Stochastic Ordering. Prove Lemma 4.2. 4.17 Robust Hypothesis Testing with µ Contamination Model. Consider the contamination model, where the nominal distributions are P¯ 0 = (0, 1) and P¯ 1 (μ, 1).
N
N
µ=
(a) Find an equation for the critical value of µ (call it µcrit ) as function of ρ for which q0 and q1 as deﬁned in (4.41) and (4.42) are equal. (b) Compute numerically the µcrit for ρ = 1. (c) For µ = 0.5µcrit , ﬁnd the least favorable distributions q0 and q1 .
References
[1] E. L. Lehmann, Testing Statistical Hypotheses, 2nd edition, SpringerVerlag, 1986. [2] S. Boyd and L. Vandenberghe, Convex Optimization , Cambridge University Press, 2004.
104
Composite Hypothesis Testing
[3] P. J. Huber, “A Robust Version of the Probability Ratio Test,” Annal. Math. Stat., Vol. 36, No. 6, pp. 1753–1758, 1965. [4] V. V. Veeravalli, T. Ba¸ sar, and H. V. Poor, “Minimax Robust Decentralized Detection,” IEEE Trans. Inf. Theory, Vol. 40, No. 1, pp. 35–40, 1994. [5] P. J. Huber and V. Strassen, “Minimax Tests and the Neyman–Pearson Lemma for Capacities,” Annal. Stat., Vol. 1, No. 2, pp. 251–263, 1973.
5
Signal Detection
In this chapter, we apply the basic principles developed so far to the problem of detecting a signal, consisting of a sequence of n values, observed with noise. The questions of interest are to determine the structure of the optimal detector, and to analyze the error probabilities. We ﬁrst consider known signals corrupted by i.i.d. noise, and derive the matched ﬁlter for Gaussian noise and nonlinear detectors for nonGaussian noise. We then introduce noise whitening methods for signal detection problems involving correlated Gaussian noise, and present the problem of signal selection. We also analyze the problem of detecting (possibly correlated) Gaussian signals in Gaussian noise. We study m ary signal detection (classiﬁcation) and its relation to optimal binary hypothesis tests via the unionofevents bound. We introduce techniques for analyzing the performance of the GLRT and Bayes tests for problems of composite hypothesis testing involving a signal known up to some parameters. Finally, we consider an approach to detecting nonGaussian signals in Gaussian noise via the performance metric of deﬂection.
5.1
Introduction
In the detection problems we have studied so far, we did not make any explicit assumptions about the observation space, although most examples were restricted to scalar observations. The theory that we have developed applies equally to scalar and vector observations. In particular, consider the binary detection problem:
±
H0 : Y
∼ p0 , H1 : Y ∼ p1 , where Y = (Y 1 Y2 . . . Y n ) ∈ Rn and y = ( y1 y2 . . . yn ). The optimum detector for this problem, no matter which criterion (Bayes, Neyman–Pearson, or minimax) we choose, is of the form ⎧
⎪⎨1 δ˜OPT ( y) = ⎪1 w.p. γ : ⎪⎩ 0
> ln L ( y) = ln η,
δ˜ ( y) = ⎪1 w.p. γ : L ( y) = η, ⎩0 < where η and γ are derived from α using the procedure from Section 2.4.2: µ µ α= p0 ( y)d µ( y) + γ p0 ( y)d µ( y) . { L ( y )>η}
{ L ( y )=η}
(5.5)
However, the integrals are highdimensional for large n which makes it difﬁcult to solve (5.5) for η.
5.3
5.3
Detection of Known Signal in Independent Noise
107
Detection of Known Signal in Independent Noise
If the signal s j , j = 0, 1, is deterministic and known to the detector, the problem is often called coherent detection . In this case, the hypothesis test of (5.2) is written as
¶H : Y = s + Z 0 0 H1 : Y = s 1 + Z .
(5.6)
Note that since s 0 is known at the detector, we can subtract it out of the observations Y to get the following equivalent hypothesis test:
¶H : Y = Z 0 H1 : Y = s + Z ,
(5.7)
with s = s1 − s0 . Considerable simpliﬁcations arise when the noise samples Z 1 , . . . , Z n are statistically · independent. Then the noise pdf factors as pZ (z ) = nk=1 pZ k (z k ), hence the likelihood ratio of (5.3) also factors, n ² p Z ( yk − sk ) L(y) = , p Z ( yk ) k=1 k
(5.8)
k
and the loglikelihood ratio test (LLRT) takes the additive form ln L ( y) =
n ³ k =1
ln
p Z k ( y k − sk ) p Z k ( yk )
¹º
¸
»
L k ( yk )
H1
≷ ln η. H0
(5.9)
The general structure of the detector is shown in Figure 5.1. In the next section, we further specialize this structure to the case of typical noise distributions.
5.3.1
Signal in i.i.d. Gaussian Noise
If Z k ’s are i.i.d. Gaussian random variables, i.e., Z k
∼ N (0, σ 2 ) for 1 ≤ k ≤ n, then
2 √ 1 2 exp{− ( yk2−σs2k ) } 2πσ ln L k ( yk ) = ln 2 √ 1 2 exp{− 2yσk 2 } 2πσ
= σ12 sk
yk Figure 5.1
ln
Lk (·)
± n
k =1
( ·)
¼
yk
− s2k
½
= σ12 sk y˜k , L
ln (y )
H1 > < ln η H0
Optimal detector structure for known signal in independent noise
(5.10)
108
Signal Detection
yk
+
y˜k
± n
s k y˜k
2
Figure 5.2
H1
> < τ = σ2 ln η
H0
k=1
− sk
σ2 ln L(y )
(·)
sk correlator
Correlator detector, optimal for i.i.d. Gaussian noise
where ˜y = y − 21 s are the centered observations. Combining (5.9) and (5.10) yields the optimal test n ³ k =1
H1
sk y˜k
≷ τ = σ 2 ln η . H0
The detector consists of a correlator that computes the correlation between the known signal s and the centered observations ˜y. The resulting detector is known as a correlator detector and is shown in Figure 5.2. Alternatively, the observation y can be directly correlated with s , and the decision rule in this case can be written as n ³ k =1
sk y k
H1
n ³
H0
k =1
≷ ln η + 12
sk2.
(5.11)
The correlation statistic may also be viewed as the output at time n of a linear, timeinvariant ﬁlter whose impulse response is the timereversed signal
hk
= sn −k ,
0≤k
≤ n − 1.
This ﬁlter is the celebrated matched ﬁlter. The correlation statistic can be written as n ³ k =1
sk y˜k
=
n ³ k=1
h n−k y˜ k
(5.12)
= (h ± ˜y)n ,
where ± denotes the convolution of two sequences. 5.3.2
Signal in i.i.d. Laplacian Noise
The Laplacian pdf has heavier tails than the Gaussian pdf and is often used to model noise in applications including underwater communication and biomedical signal processing. If the Z k ’s are i.i.d. Laplacian random variables, the noise samples are distributed as a (5.13) pZ (z k ) = e−az k  . 2
5.3
Detection of Known Signal in Independent Noise
109
In this case, the LLR for each observed sample yk is ln L k ( yk ) = ln
a 2
e−a yk −s k  a 2
e−a yk 
(5.14)
= a(yk  −  yk − sk ) , which is diagrammed in Figure 5.3. Equivalently,
¼
ln L k ( yk ) = a f k yk
− s2k
½
sgn(sk ),
(5.15)
where we have introduced the soft limiter
⎧ s  : x > s  ⎨ k k x :  x  ≤ sk  f k (x ) = ⎩−sk  : x < −sk ,
(5.16)
which is diagrammed in Figure 5.4. ln Lk ( yk )
ln L k (yk )
ask
ask
0
sk
sk
2
yk
sk
sk
− ask
as k
(a) Figure 5.3
yk
0
2
(b)
Component k of LLR for Laplacian noise: (a) s k
> 0; and (b) s k < 0
fk (x)
s k 
0
−s k 
 sk 
x
−sk  Figure 5.4
yk
Soft limiter
+
± n
y˜k
( ·)
k =1
− sk
2
Figure 5.5
fk (˜ yk )
sgn(sk )
Detector structure for i.i.d. Laplacian noise
1
a
ln L( y)
H1
1 > < τ = a ln η H0
110
Signal Detection
The structure of the optimal detector is shown in Figure 5.5. Note that large inputs are clipped by the soft limiter. Then the soft limiter output is correlated with the sequence of signs of sk (instead of the sequence sk itself). Clipping reduces the inﬂuence of large samples, which are less trustworthy than they were in the Gaussian case. Clipping makes the detector more robust against large noise samples (see also Remark 4.4). 5.3.3
Signal in i.i.d. Cauchy Noise
The Cauchy pdf has extremely heavy tails; its variance is inﬁnite. Cauchy noise models have been used, e.g., in the area of software testing. If the noise is i.i.d. and follows the standard Cauchy distribution, we have 1
p Z k ( zk ) =
π(1 + z 2k )
∀ k.
(5.17)
In this case, the LLR of each observed sample is
π(1 + yk2 ) π(1 + (yk − sk )2 ) 1 + (y˜k + 21 sk )2 = ln 1 + (y˜k − 21 sk )2 =² fk ( y˜k ),
ln L k ( yk ) = ln
(5.18)
with y˜ k = yk − s2k . Figure 5.6 shows a plot of f k . The detector structure in this case is shown in Figure 5.7. Note that large observed samples are heavily suppressed. This is because the Cauchy distribution has extremely heavy tails, and large observed values cannot be trusted. fk (˜ yk )
y˜k
Figure 5.6
yk
Component k of LLR for Cauchy noise
± n
y˜k
+
k =1
− sk
fk (˜ yk )
2
Figure 5.7
Detector structure for Cauchy noise
(· )
ln L(y )
H1
> < τ = lnη
H0
5.3
5.3.4
Detection of Known Signal in Independent Noise
111
Approximate NP Test
The probabilities of false alarm and correct detection are respectively given by
±³ n
PF (δη ) = P0
k =1 n
±³
PD (δη ) = P1
k =1
ln L k (Yk ) > η ln L k (Yk ) ≥ η
¾
¾
, ,
where {ln L k (Yk )}nk=1 are independent random variables under each hypothesis. We defer the evaluation of these probabilities for arbitrary η to Section 5.4.2 (for the Gaussian case) and to Section 5.7.4 (for the general case). We focus our analysis here on a setting where τ is chosen to approximately obtain a target probability of false alarm. For i.i.d. Gaussian noise, (5.10) shows that ln L k (Yk ) is a linear function of Yk , hence is a Gaussian random variable under both H0 and H1. Under H0 , the mean and variance of ln L (Y ) are respectively given by 2 µ0 ± − ±2sσ±2
v 0 ± ± s± 2 .
and
Hence the normalized loglikelihood ratio
±
U
∑n
k=1 ln
√Lkv(Yk ) − µ0 0
N (0, 1). Consider setting the threshold η so as to achieve a falsealarm
is distributed as probability of α :
P F (δη ) = P 0 hence
¶
¿ Àη − µ Á η − µ 0 0 U> √ = Q √ v0 v0 = α,
η = µ0 + √v0 Q −1 (α).
(5.19)
If the noise distribution is nonGaussian, the mean and variance of ln L (Y ) under H0 are respectively given by
µ0 ±
n ³ k =1
E0 [ln L k (Yk )]
and
v0 ±
n ³ k =1
Var 0[ln L k (Y k )].
(5.20)
If n is large and certain mild conditions on the signal are satisﬁed, then by application of the Central Limit Theorem [1, Sec. VIII.4], the normalized loglikelihood ratio U is approximately (0, 1). Under these conditions, the choice (5.19) for η yields PF (δη ) ≈ α. However, in Chapter 8 we shall see that √ such “Gaussian approximations” to P F(δη ) are generally not valid when η − µ0 ² v0 .
N
112
Signal Detection
5.4
Detection of Known Signal in Correlated Gaussian Noise
Now consider the hypothesis test (5.7) where the signal s ∈ Rn is known, the noise Z is (0, CZ ), and the noise covariance matrix C Z has full rank. The likelihood ratio of (5.3) takes the form
N
L ( y) =
pZ ( y − s ) p Z ( y)
−n /2 −1/ 2 {− 1 ( y − s )³ −1 ( y − s)} Z 2 = (2π)  −Zn/2 exp (2π)  Z −1/2 exp{− 12 y³ −Z 1 y} ¶ 1 ¿ ³ − 1 ³ − 1 ³ − 1 = exp − 2 [ y Z s + s Z y − s Z s] ¶ ¿ = exp s³ −Z 1 y − 21 s ³ −Z 1 s , C
C
C
C
C
C
C
C
C
(5.21)
1 where the last line holds because the inverse covariance matrix C− Z is symmetric. The optimal test (in the Bayes, minimax, or NP sense) compares L ( y) to a threshold η. Taking logarithms in (5.21), we obtain the LLRT
H1
± ln η + 12 s³
1 s³ C− Z y≷τ
H0
We recover the detector of (5.11) if Z is i.i.d. is the correlation test H1
C
−1 s.
(5.22)
Z
N (0, σ 2). Then
C
Z
= σ 2 n , and the LRT I
= σ 2 ln η + 21 s ³ s.
s³ y ≷ τ H0
(5.23)
We can relate this special case to the general case by deﬁning the pseudosignal
˜s ±
C
−1 s Z
∈ Rn
(5.24)
and writing (5.22) as a correlation test where the correlation is taken with the pseudosignal ˜s instead of the original signal s: H1
s˜ ³ y ≷ τ H0
= ln η + 12 ˜s³ s .
The structure of this detector is shown in Figure 5.8.
± n
yk
k =1
sk
Figure 5.8
(· )
H1
> < τ H0
CZ Optimal detector structure for known signal in correlated Gaussian noise
(5.25)
5.4
5.4.1
Detection of Known Signal in Correlated Gaussian Noise
113
Reduction to i.i.d. Noise Case
Another widely applicable approach when dealing with correlated Gaussian noise is to reduce the problem to a detection problem involving i.i.d. Gaussian noise. This is done by applying a special linear transform to the observations. Speciﬁcally, given observations Y ∈ Rn , design a matrix B ∈ Rn×n and deﬁne:
• • •
the transformed observations Y˜ the transformed signal ˆs ± B s; the transformed noise Z˜ ± B Z .
±
B
Y;
We use the notation ˆs instead of s˜ to avoid confusion with the pseudosignal s˜ of (5.24). If B is invertible, the hypothesis test is equivalent to
: Y˜ = Z˜ , H1 : Y˜ = sˆ + Z˜ . H0
(5.26)
The matrix B is designed such that the transformed noise is i.i.d. C
˜ Z
=
CB
=
Z
BC
³=
ZB
n
I
N (0, 1):
.
(5.27)
This is the socalled whitening condition, and we now consider several ways of choosing B. (a) Eigenvector decomposition of CZ : let C
³= Z = ³ V
V
n ³ k =1
λk vk v³k ,
(5.28)
where V
= [v1 , v2 , . . . , vn ]
is the matrix of eigenvectors, and
³ = diag(λ1 , . . . , λn ) is the diagonal matrix whose entry ³kk = λk is the eigenvalue associated with eigen
vector v k . The eigenvalues are positive because C Z was assumed to have full rank and is thus positive deﬁnite. The eigenvector matrix V is orthonormal. Substituting (5.28) into (5.27), we obtain I
n
= ³ ³ ³ = ( ³ 1/2)( ³1/2)³ , BV
V
B
BV
BV
which holds (among other choices) if BV
³1/2 = n , I
hence B
= ³−1/2 ³ . V
(5.29)
114
Signal Detection
For each k by
= 1, . . . , n , component k of the transformed observation vector is given Y˜k
= ( Y )k = √1 v³k Y , λk B
−1/2
which is the projection of Y onto the k th eigenvector v k , normalized by λk . Hence Y˜ is the representation of Y in the eigenvector basis for the covariance noise matrix, followed by scaling of individual component by the reciprocal of the square root of the associated eigenvalues. Likewise, we have
sˆk where Z˜ k , k
= √1 v³k s λk
and
Z˜ k
= √1 v³k Z, λk
= 1, . . . , n, are i.i.d. N (0, 1). Writing the test (5.26) as ¶H : Y˜ = Z˜ 0 k k H1 : Y˜k = sˆk + Z˜ k , 1 ≤ k ≤ n
we see that the informative components of the observations are those corresponding to large signal components: they contribute an amount sˆk2 to the generalized signaltonoise ratio (GSNR)
= ± s ±2 = s³ ³ s = s³ −Z 1 s. −1 : choose = −1/2 = ³ −1/ 2 ³ . This is Z Z
GSNR ± ±ˆs±2
B
B
B
C
(5.30)
(b) Positive Square Root of C B C V V V multiplied by the whitening matrix of (5.29). (c) Choleski decomposition of C Z : this decomposition takes the form C Z = GG ³ where −1 is a whitening transform, and B is also G is a lower triangular matrix. Then B = G ∑ k lower triangular. We obtain y˜k = ´=1 Bk ´ yl for 1 ≤ k ≤ n. Hence this whitening transform can be implemented by the application of a causal, linear timevarying ﬁlter to the observed sequence { yk }nk=1. 5.4.2
Performance Analysis
The LRT statistic of (5.25),
T
± ˜s³ Y =
n ³ k= 1
s˜k Yk ,
is a linear combination of jointly Gaussian random variables and is therefore also Gaussian. The distributions of T under H0 and H1 are determined by the means and variances of T :
E0[ T ] = E0 [˜s³ Y ] = s˜ ³E 0[Y ] = 0, E1[ T ] = E1 [˜s³ Y ] = ˜s³E 1[Y ] = ˜s³ s = s³ −Z 1 s ± µ, Var 0[ T ] = E0 [T 2] = E0[(s˜³ Y )2] = E0[˜s³ Y Y ³ s˜ ] = s˜ ³ E0[Y Y ³ ]˜s = ˜s³ Z ˜s = s³ −Z 1 s = µ, C
Var 1[ T ] = Var 0 [T ].
C
C
5.5
mary Signal Detection
115
Therefore the test (5.25) can be expressed as
¶H : T ∼ N (0, µ) 0 H1 : T ∼ N (µ, µ)
1 with µ = s ³C− Z s . The normalized distance between the two distributions is the ratio of the difference between the two means to the standard deviation:
= √µ =
d
Â
1 s³ C− Z s,
(5.31)
which is the square root of the GSNR of (5.30). We note that d is a special case of the Mahalanobis distance [2], which is deﬁned as follows: D E FI N I T I O N 5.1 The Mahalanobis distance between two n dimensional vectors x 1 and x2 , corresponding to a positive deﬁnite matrix A, is given by
d ( x1 , x2 ) = A
Ã ( x2 − x 1)³ ( x2 − x1 ). A
The probability of false alarm and the probability of correct detection of the LRT (5.25) are respectively given by PF PD
¼ ½ = P0{T ≥ τ } = Q τd , Àτ − d2 Á . = P {T ≥ τ } = Q 1
Hence the ROC is given by PD
=Q
d
¼
Q −1(α) − d
½
.
(5.32) (5.33)
(5.34)
If the threshold is set to τ = µ2 (as occurs in the Bayesian setting with equal priors, or in the minimax setting), the error probabilities are given by PF
= PM = Pe = Q
Àd Á 2
.
(5.35)
For the case of i.i.d. noise ( CZ = σ 2In ), we have d 2 = ±s ±2 /σ 2 , which represents the 1 signaltonoise ratio (SNR). For general C Z , the GSNR d2 = s³ C− Z s determines the performance of the LRT.
5.5
mary Signal Detection Consider the following detection problem with m
= s j + Z, j = 1, . . . , m, where Z ∼ N (0, Z ), and the signals si , i = 1, . . . , m are known. Hj
C
:
> 2 hypotheses:
Y
(5.36)
116
Signal Detection
5.5.1
Bayes Classiﬁcation Rule
Assume a prior distribution π on the hypotheses. Denoting by π(i  y) the posterior probability of Hi given Y = y, the MPE classiﬁcation rule is the MAP rule
δMAP ( y) = arg max π(i  y) 1≤i ≤m = arg max πipp(iy()y) 1≤i ≤m = arg max πip p(i (y)y) , 1≤i ≤m
where we have introduced the pdf p0 hypothesis
= N (0, Z ) corresponding to a dummy noiseonly C
: Y = Z.
H0 Therefore,
δMAP ( y) = arg max 1≤i ≤m
= arg max 1≤i ≤m
0
À À
ln πi
Á + ln ppi (( yy))
ln πi
+ s ³i
0
−1 y − 1 s³ −1 si i Z Z
C
2
C
where the last line follows from (5.21). In the special case where πi and C Z = σ 2 In , we obtain
Á
,
= m1 (equal priors)
À1 Á 1 ³ 2 δMAP ( y) = arg max σ 2 si y − 2σ 2 ±si ± . 1≤i ≤m Further assuming equal signal energy, which means ±si ±2 = E s ∀i , the MAP rule
reduces to the maximumcorrelation rule
δMAP ( y) = arg max s³i y.
(5.37)
1≤i ≤m
This is also equivalent to the minimumdistance detection rule
δMD( y) = arg min ± y − si ±2 1≤i ≤m
because ± y − si ±2
5.5.2
= ± y±2 + Es − 2s³i y.
Performance Analysis
Assume equal priors, equalenergy signals, and C Z LRT is given by Pe
=
1 m
m ³ i =1
= σ 2 n . The error probability of the
P{Error Hi }.
I
(5.38)
5.6
117
Signal Selection
We now analyze the conditional error probability P {Error Hi } which is somewhat unwieldy but can be upper bounded in terms of error probabilities for binary hypothesis tests. Speciﬁcally, deﬁne the correlation statistics Ti = s³ i Y for i = 1 , . . . , m . By (5.6) and (5.11), an optimal statistic for testing Hi versus H j (for any i ´ = j ) is (s i − s j )³ Y = Ti − T j . If hypotheses Hi and H j have the same prior probability, the error probability for the optimal binary test between Hi and H j is given by
À Á > Ti } = P j {Ti > T j } = Q ±s i 2−σ s j ± . maximumcorrelation rule (5.37) takes the form δMAP (Y ) = Pi {T j
The
Therefore,
∀i :
´= i : T j > Ti } Ä Å = Pi {∪ j ´=i T j > Ti } ³ ≤ Pi {T j > Ti } j ´=i ³ À ± si − s j ± Á = Q , 2σ j ´=i
(5.39) arg max Ti . 1≤i ≤m
P{Error Hi } = Pi {∃ j
(5.40)
where the inequality follows from the unionofevents bound, and the last equality from (5.39). In many problems, the power (energy per sample) of the signals is roughly independent of n, and so is the power of the difference between any two signals. Then the normalized Euclidean distance di j ± ±si − s j ± /σ between signals s i and s j increases √ linearly with n, and the argument of the Q functions is large. This is referred to as the highSNR regime. While the Q function does not admit a closedform expression, the 2 upper bound Q(x ) ≤ e− x / 2 holds for all x ≥ 0, and the asymptotic equality
Q (x ) ∼
2 e− x / 2
√
2π x
,
as x
→∞
(5.41)
is already accurate for moderately large x . Since the function Q (x ) decreases exponentially fast, the sum of the Q functions in (5.40) is dominated by the largest term (corresponding to the smallest distance). Let dmin ± mini ´= j di j . We obtain the upper bound Àd Á Àd Á m 1 ³³ ij Q ≤ (m − 1) Q min , (5.42) Pe ≤ m 2 2 i =1 j ´=i
which is typically fairly tight in the highSNR regime, when m is relatively small.
5.6
Signal Selection
Having derived expressions for error probability, applicable to signal detection in correlated Gaussian noise, we now ask how the signals can be designed to minimize error
118
Signal Detection
probability, a typical problem in digital communications and radar applications. For simplicity we consider the binary case, m = 2. The hypothesis test is given by (5.6):
±
N
H0 : Y
= s0 + Z H1 : Y = s1 + Z,
where Z ∼ (0, C Z ). We ﬁrst consider the case of i.i.d. noise, then the general case of correlated noise. 5.6.1
i.i.d. Noise
For i.i.d. noise of variance σ 2 , we have C Z = σ 2 In . Let d detection with equal priors, the problem is to minimize Pe
=Q
Àd Á 2
= ±s1 − s0 ±/σ . For Bayesian
À ±s − s ± Á = Q 1 2σ 0
over all signals s0 and s 1. The solution would be trivial (arbitrarily small P e ) in the absence of any constraint on s0 and s1 , since the distance ±s 1 − s0 ± between them could be made arbitrarily large. For physical reasons, we assume that the signals are subject to the energy constraints ±s 0±2 ≤ E s and ± s1±2 ≤ E s . Minimizing Pe is equivalent to maximizing the squared distance
± s1 − s0±2 = ± s0±2 + ±s 1±2 − 2s³0 s1 (a ) ≤ ± s0 ± 2 + ± s 1 ± 2 + 2 ± s0 ± ± s1 ± (b) ≤ Es + Es + 2Es = 4 Es ,
(5.43)
where inequality (a) holds by the Cauchy–Schwarz inequality, and (b) holds because both s0 and s1 are subject to the energy constraint E s . The upper bound (5.43) holds √ with equality if and only if s 1 = − s0 and ±s0 ± = ± s1 ± = Es . Hence the optimal solution is antipodal signaling with maximum power. The SNR is
= 4σE2s , ¼ √E ½ d2
and the error probability is P e 5.6.2
=Q
σ
s
.
Correlated Noise
For general positive deﬁnite C Z , the signal selection problem takes the form min P e
s 0 , s1
=Q
Àd Á 2
,
where
d2
= ( s 1 − s 0) ³
C
−1(s 1 − s0) Z
5.6
Signal Selection
119
is the GSNR of (5.30) or squared Mahalanobis distance (see Deﬁnition 5.1) between s0 and s1 , and the signals s0 and s1 are subject to the energy constraints ±s0 ±2 ≤ E s and ± s1±2 ≤ E s . Since the Q function is decreasing, the signal selection problem is equivalent to maximizing d 2 subject to the energy constraints. To solve this problem, consider the eigenvector decomposition C Z = V³V³ of the noise covariance matrix. For j = 0, 1, denote by sˆ j = V³ s j the representation of the signal s j in the eigenvector basis. We may write
d2
= (s 1 − s0)³ ³−1 ³( s1 − s0 ) = (sˆ 1 − sˆ 0)³ ³−1 (ˆs1 − sˆ 0 ) n 2 ³ = (sˆ 1k −λ sˆ 0k ) . V
k =1
V
k
Since V is orthonormal, we have ±ˆsi ±2 = ±si ±2 for i = 0, 1, and thus ±ˆs1 − sˆ 0±2 as in (5.43). Deﬁning λmin = min λk , we obtain the upper bound
≤ 4 Es ,
1≤k ≤n
n ³ (ˆs1k − sˆ 0k )2 ≤ 1 ±ˆs − sˆ ±2 1 0 2λk 2λmin k =1
≤ λ2Es . min
Equality is achieved if and only if: (i) sˆ1 = −ˆs0 , (ii) ±ˆs0 ±2 = E s , and (iii) sˆ0k = 0 for all k such that λk
> λmin. √ Hence the optimal signals satisfy s 1 = E s v min = −s 0 where v min is an eigenvector corresponding to λmin . Moreover d2
Example 5.1
Consider n = 2 and C
= λ4 Es . min
À Á 2 1 ρ = σ Z ρ 1 ,
where  ρ  < 1. The eigenvector decomposition of C Z yields
³=σ
2
and V
=
À1 + ρ
1−ρ
0
Æ √1
2
√1
2
Á
0
− √12 √1
2
Ç
.
(5.44)
(5.45)
120
Signal Detection
s0
Equalprobability contour
s1
Figure 5.9
Signal selection for detection in correlated Gaussian noise
Hence λmin
= σ 2 (1 − ρ) and d2
= σ 2(14 E−s ρ)
is strictly larger than the value obtained for i.i.d. noise ( ρ = 0) of the same variance. Hence we can exploit the correlation of the noise by designing antipodal signals along the¼direction √ E ½of the minimum eigenvector. The corresponding error probability is P e = Q σ √1−s ρ  as illustrated in Figure 5.9. The worst design would be to choose signals in the direction of the maximum eigenvector. ²
5.7
Detection of Gaussian Signals in Gaussian Noise
So far we have studied the problem of detecting known signals in noise. However, in various signal processing and communications applications, the signals are not known precisely. Only some general characteristics of the signals may be known, for instance, their probability distribution. Consider the following model for detection of Gaussian signals in Gaussian noise:
±
: Y = S0 + Z (5.46) H1 : Y = S 1 + Z , where S j and Z are jointly Gaussian, for j = 0, 1. The problem in (5.46) is a special H0
case of the following Gaussian hypothesis testing problem:
±
: Y ∼ N (µ0 , 0 ) H1 : Y ∼ N (µ1 , 1 ).
H0
C
(5.47)
C
The loglikelihood ratio for this pair of hypotheses is given by ln L ( y) =
½ ¼ ³ −1 ³ −1½ 1 ³ ¼ −1 1 y + µ 1 C1 − µ0 C0 y y C0 − C− 1 2 + 12 ln CC0 + 21 µ³0 C−0 1µ0 − 21 µ³1 C−1 1 µ1 . 1
(5.48)
5.7
Detection of Gaussian Signals in Gaussian Noise
121
The ﬁrst term on the right hand side is quadratic in y, and the second term is linear in y. The remaining three terms are constant in y. We denote their sum by
κ = 12 ln  0  + 21 µ³0 −0 1 µ0 − 12 µ³1 −1 1µ1 . (5.49) 1 If 0 = 1 = and µ 0 ´ = µ 1, (5.48) reduces to )³ −1 ( y +κ (5.50) ln L ( y) = µ1 − µ 0 and the optimum detector is a purely linear detector (with κ being absorbed into the C
C
C
C
C
C
C
C
threshold), which corresponds to detection of deterministic signals in Gaussian noise, as studied in Section 5.4. If µ0 = µ 1 (which can be taken to equal 0 without loss generality) and C0 ´ = C1 , (5.48) reduces to ¼ 1 −1 ½ 1 ln L ( y) = y³ C− y+κ (5.51) 0 − C1 2 and the optimum detector is a purely quadratic detector. 5.7.1
Detection of a Gaussian Signal in White Gaussian Noise
We now consider the following special case of (5.46):
±
: H1 :
=Z (5.52) Y = S + Z, where S ∼ N (µ S , S ) is independent of Z ∼ N (0, σ 2 n ). The signal covariance matrix H0
Y
C
C
S
I
need not be full rank. The test (5.52) can be written equivalently as
±
∼ N (0, σ 2 n ) H1 : Y ∼ N (µ S , S + σ 2 n ).
H0 : Y
I
C
(5.53)
I
Specializing (5.48) to (5.53), we get ln L ( y) =
1 ³ y 2
À1 Á 2 −1 ³ 2 −1 σ 2 n − ( S + σ n ) y + µS ( S + σ n ) y + κ I
C
I
C
I
(5.54)
with
κ = − 21 µS³ ( S + σ 2 n )−1µ S − 21 ln  S + σ 2 n  + n2 ln σ 2 . C
I
C
I
Denote by Q
= σ12 n − ( S + σ 2 n )−1 Á À1 2 = σ 2 ( S + σ n ) − In ( S + σ 2 n )−1 = σ12 S( S + σ 2 n )−1 I
C
C
C
C
I
I
C
I
I
(5.55)
122
Signal Detection
the difference between the inverse covariance matrices under H0 and H1, and note that ≥ 0 (nonnegative deﬁnite matrix). With this notation, (5.54) becomes
Q
ln L ( y) =
1 ³ y Q y + µ³ S ( CS 2
+ σ 2 n )−1 y + κ. I
µS = 0. Then the loglikelihood
We ﬁrst consider the case of a signal with zero mean: ratio in (5.56) simpliﬁes to ln L ( y) = and the optimum test has the form
δOPT ( y) =
±
1 0
(5.56)
1 ³ y Qy+κ 2
: :
≥ τ,
τ 2 . σS
(5.62)
Diagonalization of Signal Covariance
The derivations in Section 5.7.2 were simple because both the signal covariance matrix and the noise covariance matrix were multiples of In . To handle a general signal covariance matrix C S , we use the same type of whitening technique that was applied to the noise in Section 5.4.1. Consider the eigenvector decomposition C
S
= ³ ³, V
(5.63)
V
where V = [ v 1 , . . . , v n ] is the matrix of eigenvectors, ³ = diag {λ k }nk=1, and λk is the 2 2 ³ eigenvalue corresponding to eigenvector v k . Consequently, CS + σ In = V(³ + σ In )V É È
and Q = VDV³ where D = diag σ 2 (λλ k+σ 2 ) k
˜y =
n
k =1
. Now deﬁne
³ y,
(5.64)
V
i.e., y˜k = v ³ k y is the component of the observations in the direction of v k , the k th eigenvector. The hypothesis test (5.53) may be written in the equivalent form
±
: Y˜ ∼ N (0, σ 2 n ) H1 : Y˜ ∼ N (0, ³ + σ 2 n )
H0
I
I
(5.65)
124
Signal Detection
and the decision rule (5.57) reduces to the diagonal form
y³ Q y Note that if weight.
Example 5.2
= y˜ ³ ˜y = D
n ³
H λk y˜k2 ≷ τ. 2 σ (λk + σ ) H k =1 1
2
(5.66)
0
λk ¶ σ 2 , component k of the rotated observation vector ˜y receives low Let n = 2 and C
À1 ρ Á 2 = S ρ 1 σS ,
(5.67)
where ρ  ≤ 1. The eigenvector decomposition of this matrix is given in (5.44) (5.45). Here V is the 45o rotation matrix and the eigenvalues λ1 , λ2 are (1 · ρ)σ S2 . The rotated signal is given by
ÊY + Y Ë 1 ÊS + S Ë 1 2 = √ S1 − S2 + Z˜ , 2 Y1 − Y2 2 1 2
1 Y˜ = √
where the rotated noise components Z˜ 1 and Z˜ 2 are i.i.d. (0, σ 2 ). The equalprobability contours for Y are elliptic, and so is the decision boundary (see Figure 5.11).
N
Equalprobability contours for Y and decision regions for the 2 × 2 signal covariance matrix of (5.67)
Figure 5.11
5.7
5.7.4
Detection of Gaussian Signals in Gaussian Noise
125
Performance Analysis
Note that the rotated vector Y˜ = V³ Y of (5.64) is Gaussian with covariance matrix equal to σ 2I n under H0 and to ³ + σ 2 In under H1 . Hence
±
H0 : Y˜k
∼ N (0, σ 2) H1 : Y˜k ∼ N (0, λk + σ 2 ),
1≤k
where {Y˜k }nk=1 are independent. We have
= P0
±³ n
≤ n,
¾
>τ , k =1 ±³ ¾ n PD = P1 Wk > τ , PF
Wk
k =1
where
Wk
± σ 2(λ λk+ σ 2) Y˜k2 ,
k
k
= 1, 2, . . . , n
are independent and have the following variances under H0 and H1:
σ02k = λ λ+k σ 2 , σ 12k = σλk2 , k
k
= 1, 2, . . . , n .
Hence Wk follows a gamma distribution (see Appendix C) under H j :
p j,Wk (x ) = i.e., Wk
À
∼ Gamma ,
1 1 2 2σ 2 jk
Á
Â
1 2π x σ 2jk
¾ ± x exp − 2 , 2σ
(5.68)
jk
.
The case of i.i.d. Gaussian signals ( λk ≡ σS2) was studied in Section 5.7.2. Then {Wk }nk=1 are i.i.d., and the sum of n i.i.d. Gamma (a , b) random variables is Gamma(na , b), which is a scaled χn2 random variable. In the correlatedsignal scenario, there is no simple way to evaluate PF and PD . We outline a few approaches in the remainder of Section 5.7.4.
Characteristic Function Approach Denote by
¼ i ωW ½
√ , ω ∈ R, i = −1, ∑ the characteristic function of Wk under H j . If Tn = nk=1 Wk , then under H j , Tn has µ j, W (ω) = E j k
characteristic function:
µ j ,T (ω) = n
n ² k=1
e
k
µ j,W (ω) = k
n ² k =1
Â
1 1 − 2i ωσ 2jk
.
In principle one could apply the inverse Fourier Transform and obtain the pdf pTn numerically. This is, however, not straightforward because the inverse Fourier transform
126
Signal Detection
is an integral, and evaluating it by means of the discrete Fourier transform requires truncation and sampling of the original µ Tn . Unfortunately the resulting inversion method is unstable and therefore unsuitable for computing small tail probabilities.
Saddlepoint Approximation If each Wk has a momentgenerating function (mgf)
¼
M j,Wk (σ) = E j eσ Wk
½
, σ ∈ R,
under H j , then so does Tn :
M j ,Tn (σ) =
n ² k =1
MWk (σ),
σ ∈ R,
and the deﬁnition of the mgf can be extended to the complex plane:
¼
M j, Tn (u) = E j euTn
½
,
u
∈ C.
Thus MTn (i ω) = µ j ,Tn (ω) for ω ∈ R. To alleviate the instability problem, one may compute p j ,Tn by evaluating the inverse Laplace transform of M j ,Tn . This is done by integrating along a vertical line in the complex plane, intersecting the real axis at a location u ∗ ∈ R. (The inverse Fourier transform is obtained when u∗ = 0.) Then
p j,Tn (t ) =
1
2π i
µ u ∗+i ∞ u ∗−i ∞
M j ,Tn (ω)e−ut du .
The optimal value of u ∗ (for which the inversion process is most stable) is the saddlepoint of the integrand on the real axis, which satisﬁes [3] 0=
Ì
Ì d [− ut + ln M j,Tn (u )]ÌÌ . du u=u∗
Tail probabilities P j {Tn > t } can also be approximated using saddlepoint methods, as developed by Lugannani and Rice [4]. For a detailed account of these methods, see [5, Ch. 14].
Large Deviations Largedeviations methods (as developed in Chapter 8) can be used to upper bound the tail probabilities PF and P M ; this method is particularly useful for small PF and PM . In fact these methods are closely related to the saddlepoint approximation methods.
5.7.5
Gaussian Signals With Nonzero Mean
Up to this point, we assumed that the mean of S is zero. Now let μ S observations again, we write (5.56) as ln L ( ˜y) =
1 ³ ˜y D ˜y + y˜ ³ (³ + σ 2 In )−1µ˜ S 2
´= 0. Whitening the
− 21 µ˜ S (³ + σ 2 n )−1 µ˜ S + κ, I
5.8
where µ˜ S
Detection of Weak Signals
127
= {μ˜ k }nk=1 = ³µ. Hence n ³ μ˜ 2k H≷ τ. λk y˜k μ˜ k 1 2 y ˜ + − k 2 σ 2(λk + σ 2 ) λk + σ 2 2(λk + σ 2 ) H k =1 V
1 0
Performance analysis is greatly simpliﬁed in the special case of a white signal (C S hence λ k ≡ σS2). Then V = I n and the test becomes
= σ S2 ,
n 2 2 2 ³ H ( yi − u i )2 ≷ τ µ = 2σ (σS 2+ σ ) , σ 1
i =1
H0
where
S
= − σ 2 μ˜ S,i , σS 2
ui
1≤ k
≤ n.
After elementary algebraic manipulations, we obtain that the optimal test has the form
T
±
n ³ i =1
( yi − ui )
2
H1
≷ τµ = H0
Æ Ç σ 2 (σS2 + σ 2 ) 2τ + ±μ±2 . σ S2 σ S2
Under both H0 and H1 , the test statistic Â T is a scaled noncentral
σ S2 + σ 2 under H1 . We have
The scale factor is σ under H1 and
(5.69)
χn2 random variable.
PF (δ) = P {χn2(λ0 ) > σ 2τ µ },
PD (δ) = P {χn2(λ1 ) > (σS2 + σ 2)τ µ},
(5.70)
where the noncentrality parameters are respectively
λ0 = ±µ±2 σ 4 σS
2
5.8
and
λ1 = ±µ±2
σ4 . σS4(σ S2 + σ 2 )
Detection of Weak Signals
In many applications, the signal to be detected is weak and is known only up to a scaling constant. The detection problem may be formulated as
±
H0 : Y
=Z H1 : Y = θ s + Z , θ > 0,
(5.71)
where s is known but θ is not. We assume that the noise is i.i.d. with pdf p Z . A locally optimum detector for this problem can be obtained by approximating the loglikelihood ratio as a linear function of θ in the vicinity of θ = 0: n n ³ ³ [ln p Z ( yk − θ sk ) − ln pZ ( yk )] = θ ψ(yk )sk + o(θ), ln L θ ( y) = k =1
k =1
128
Signal Detection
± n
ψ( ·)
H1
(· )
>
τ/a },
P θ (δLO ) = P {2Bi(n, 1 − e−θ / 2) − n D
> τ/a}, θ > 0.
(5.73) (5.74)
In order to achieve a probability of false alarm approximately equal to α, one can use the Gaussian approximation of Section 5.3.4 to the distribution of the test statistic and √ set the value of the test threshold to τ = a n Q −1(α).
5.9
Detection of Signal with Unknown Parameters in White Gaussian Noise
We now study detection of a signal that is known up to some parameter θ . The signal is denoted by s(θ) ∈ Rn . For detection in i.i.d. Gaussian noise, the problem is formulated as 1 The pdf p might have points where the derivative does not exist, as in the Laplacian example 5.3; Z
however, the probability that a sample Yi takes one of these values is zero.
5.9
Detection of Signal with Unknown Parameters in White Gaussian Noise
±
129
: Y=Z (5.75) H1 : Y = s(θ) + Z , θ ∈ X1 , where the noise Z is i.i.d. N (0, σ 2 ). This model ﬁnds a variety of applications including communications (where θ is an unknown phase, as in Example 5.4), template matching (as in Example 5.5), and radar signal processing (where θ includes target location H0
and velocity parameters). In linear models , the signal depends linearly on the unknown parameter:
s(θ) = Lθ,
(5.76)
X
where L is a known n × d matrix, and the unknown θ ∈ 1 = Rd . Without loss of generality, we assume that the columns of L are orthonormal vectors, i.e., L³ L = Id .2 5.9.1
General Approach
The Gaussian noise model (5.75) is a special case of the following general composite binary hypothesis test (Chapter 4) with vector observations:
±
: H1 :
H0
∼ P(¸ H0 ) Y ∼ Pθ , θ ∈ X1 .
Y
(5.77)
p ( y)
Let L θ ( y) ± p( θy H 0) . For the Gaussian problem of (5.75), we have
¿ ¶1 ± s(θ) ±2 ³ (5.78) L θ ( y) = exp σ 2 s (θ) y − 2σ 2 , θ ∈ X1 . If θ were known, the optimal test would be a Bayes rule δ θB: ± θ δ B ( y ) = 1 : L θ ( y ) ≥ ηθ , (5.79) 0 : < where the threshold ηθ is selected according to the desired performance criterion (Bayes,
minimax, or Neyman–Pearson). The corresponding probabilities of false alarm and miss are P F (δBθ ) = P{ L θ (Y ) ≥ ηθ  H0 },
PθM (δBθ ) = Pθ { L θ (Y ) < ηθ }.
(5.80)
For unknown θ , we consider two strategies. First the GLRT setting, where the test takes the form
⎧ ⎨1 : max Lθ ( y) ≥ η, δGLRT ( y) = ⎩ θ ∈X 0: < Second, the Bayesian setting where θ is viewed as a realization of a random variable ¶ with prior π , and costs are uniform over each hypothesis (as studied in Section 4.2.1). 1
The likelihood ratio is given by
2 If this was not the case, a linear reparameterization of
θ could be used to obtain ³ = L
L
I
d.
130
Signal Detection
L ( y) =
µ X1
and the Bayes test takes the form
δB ( y) = 5.9.2
±
L θ ( y)π(θ)d ν(θ)
1 : L ( y) ≥ η.
0:
ln η H0 } = P {χd2
> 2 ln η }. Under H1 , we have Y ∼ N ( θ, σ 2 n ), hence Y˜ ∼ N (θ, σ 2 d ). Thus ln L G (Y ) is half a noncentral χ 2 random variable with d degrees of freedom and noncentrality parameter λ = ±θ ±2 /σ 2. The probability of correct detection is given by P θD (δGLRT ) = P θ {ln L G (Y ) > ln η} = P{χd2 (λ) > ln η }. L
5.9.3
I
I
Nonlinear Gaussian Model
Performance analysis in the case the signal depends nonlinearly on θ is more complicated than in the linear case. We present a wellknown example for which the analysis is tractable.
5.9
131
Detection of Signal with Unknown Parameters in White Gaussian Noise
Sinusoid Detection. Assume the Gaussian noise model of (5.75) with
Example 5.4
X1 = [0, 2π ], ω = 2π kn where k ∈ { 1, 2, . . . , n}, and si (θ) = sin(ωi + θ) for i = 1, 2, . . . , n. Thus s(θ) is a sinusoid with known angular frequency ω and unknown phase θ . Moreover ±s(θ) ±2 = n/ 2 for all θ . Assume equal priors on H0 and H1, hence η = 1. Using the identity sin(ωi + θ) = sin(ωi ) cos θ + cos(ωi ) sin θ , we write the GLR statistic as Â TGLR = max s(θ)³ Y = max [Y s cos θ + Yc sin θ ] = Y s2 + Y c2 ± R , 0≤θ ≤2π 0≤θ ≤2π ∑ ∑ n where Ys = i =1 sin(ωi )Yi and Yc = ni=1 cos(ωi )Yi are the projections of the obser
vation vector onto the sine vector and the cosine vector with angular frequency zero phase. We have
δGLRT (Y ) =
±
: :
1
0
≥τ R < τ, R
ω and
(5.82)
where τ = σ 2 ln η + n/4 = n/4. Likewise deﬁne the noise vector projections Z s ± ∑ ∑ n sin(ωi ) Z and Z ± n cos(ωi ) Z . Since Z , 1 ≤ i ≤ n are i.i.d. N (0, σ 2 ), it i =1
i
c
i =1
i
follows from the trigonometric identities n ³ i =1
sin2(ωi ) =
n ³ i =1
cos2 (ωi ) =
n 2
and
i
n ³ i =1
sin(ωi ) cos(ωi ) =
N
1³ sin(2ωi ) = 0 2 n
i =1
that the random variables Z s and Z c are i.i.d. (0, nσ2 ). Therefore, under H0, R 2 = Z 2s + Z c2 follows an exponential distribution with mean equal to nσ 2 . Then we obtain a closedform solution for the falsealarm probability of the GLRT: 2
P F (δGLRT ) = P{ R 2
2 ¶ ≥ ττ 2 H¿ 0 } È n É = exp − nσ 2 = exp − 16σ 2 .
Under H1, the random variables Y s and Y c are independent with respective means
Eθ (Ys ) = Eθ (Yc ) =
n ³ i =1 n
³ i =1
sin(ωi ) sin(ωi
+ θ) = n2 cos θ,
cos (ωi ) sin(ωi
+ θ) = n2 sin θ,
and the same variances as under H0. Hence Y s
N(
nσ 2 2
nσ 2 2
∼ N ( n2 cos θ, nσ2 ) 2
and Yc
∼
sin θ, ), and R ∼ Rice( , ) (for all θ ). (See Appendix C for the deﬁnition of the Ricean distribution.) The probability of correct detection is obtained using (C.7): n 2
n 2
P θD (δGLRT ) = P { R
≥ τ}
¶ r 2 + (n/ 2)2 ¿ ¼ r ½ = − nσ 2 I0 dr 2 σ τ Í Á ÀÍ = Q 2σn 2 , τ 2σn 2 , ∀θ ∈ X1 , µ∞
2r exp nσ 2
132
Signal Detection
´
where Q (ρ, t ) ± t∞ x exp{− 21 (x 2 + ρ 2)} I0 (ρ x ) dx is Marcum’s Q function. It is also instructive to derive the Bayes rule assuming a uniform prior on ¶ in this case. The likelihood ratio is then given by
¶1 ¿ n ³ exp L (Y ) = σ 2 s(θ) Y − 4σ 2 d θ 0 È n É 1 µ 2π ¶ Yc sin θ + Ys cos θ ¿ exp dθ = exp − 4σ 2 2π σ2 0 Ç Æ È n É ÃYc2 + Ys2 , = exp − 4σ 2 I 0 σ2 1 2π
µ 2π
whereÃI0 is the modiﬁed Bessel function of the ﬁrst kind. Hence an optimal test statistic is Yc2 + Ys2 , which is identical to the GLR statistic TGLR , and the Bayes test δB coincides with δ GLRT .
5.9.4
Discrete Parameter Set
Next we turn our attention to the case of discrete X1 , for which performance bounds can be conveniently derived. Again we consider the case of known θ ﬁrst. By (5.32), (5.33), the probabilities of false alarm and miss for δBθ are given by
À ±s(θ)±2 − τ Á θ (5.83) σ ±s (θ)± . In the special case of equalenergy signals, we have ±s (θ)±2 = n E S for each θ ∈ X1 . For equal priors π( H0) = π( H1) = 21 , the test threshold is given by τθ = 12 n E S , and thus À √n E Á S θ θ θ P F (δB ) = PM (δ B) = Q . (5.84) 2σ PF (δ θB) = Q
À τ Á θ σ ±s(θ)± ,
PM (δ θB) = Q
Performance Bounds for GLRT For unknown θ , we derive upper bounds on P F (δGLRT ) and PM (δGLRT ), as well as a lower bound on P F (δGLRT ). The ﬁrst part of the derivation does not require a Gaussian observation model. The falsealarm probability for the GLRT with threshold η is given by P F (δGLRT )
=P
¶
ÌÌ
max L θ (Y ) ≥ ηÌÌ H0
θ ∈X1
¿
⎧ ÌÌ ⎫ ⎨Î Ì ⎬ = P ⎩ {L θ (Y) ≥ η}ÌÌ H0 ⎭ Ì θ∈X
(a )
(5.85)
1
≤ X1  θmax P { L θ ( Y ) ≥ η  H0 } , ∈X 1
(5.86)
5.9
133
Detection of Signal with Unknown Parameters in White Gaussian Noise
where the inequality follows from the unionofevents bound. Also, for each θ probability of miss is given by Pθ
¶
∈ X1 the
¿
(δGLRT ) = Pθ θmax L µ (Y ) < η µ∈ X θ ≤ Pθ { L θ (Y ) < η } , in (5.87) uses the fact that L θ (Y )
M
1
(5.87)
where the last inequality is a lower bound on maxθ µ ∈X1 L θ µ (Y ). Specializing (5.86) to the Gaussian problem of (5.75), and using (5.78), we obtain
È
É ≥ τ θ ∈X À τ Á ≤ X1 θmax Q σ ±s (θ)± . ∈X
PF (δGLRT ) ≤ X1 max P s (θ)³ Z 1
(5.88)
1
Applying (5.87), we also obtain
Ï Ð
˜δ( y) = 1 w.p. γ : T ( y) = τ, ⎪⎩ 0
τ.
τ.
0 and b ∈ (0, 2]. Show that each observation is passed through a nonlinear function, and sketch this function for b = 1/2 and for b = 3/2.
Poisson Observations. Denote by Pλ the Poisson distribution with parameter 0. Given two lengthn sequences s and λ such that 0 < si < λi for all 1 ≤ i ≤ n, consider the binary hypothesis test 5.2
λ>
⎧ ⎨ H0 : ⎩ H1 :
Yi Yi
indep.
∼ ∼
indep.
Pλ i +si
Pλ i −si ,
1≤i
≤ n,
and assume equal priors. Derive the LRT and show that it reduces to a correlation test. 5.3 Additive and Multiplicative Noise. Consider a known signal s and parameter θ0 and two i.i.d. (0, 1) noise sequences Z and W . We test
N
>0
¶H : θ = 0 0 H1 : θ = θ 0 , √ where the observations are given by Y i = θ si Wi + Z i for 1 ≤ i ≤ n.
(a) Derive the NP detector at signiﬁcance level a. (b) Let si = a for all 1 ≤ i ≤ n. Give an approximation to the threshold of the NP test of part (a) that is valid for large n. (c) Now assume θ is unknown and we need to test H0 : θ = 0 versus H1 : θ > 0. Give a necessary and sufﬁcient condition on s such that a UMP test exists. (d) Assume that θ is unknown and we need to test H0 : θ = 0 versus H1 : θ ´ = 0. Derive the GLRT and simplify it in the case si = a for all 1 ≤ i ≤ n. 5.4 CFAR Detector. Consider the following problem of detecting an unknown signal in exponential noise:
¶H : 0 H1 :
= Zk , k = 1, . . . , n = sk + Zk , k = 1, . . . , n, where the signal is unknown except that we know that sk > 0 for all k , and the noise variables Z 1 , . . . , Z n are i.i.d. Exp (1/μ) random variables, i.e., p Z (z ) = μe−μz 1 {z ≥ 0}, k = 1, . . . , n, with μ > 0. The falsealarm probability of a LRT for this problem is a function of μ . Now suppose μ is unknown, and we wish to design a detector whose falsealarm rate is independent of μ and is still able to detect the signal reasonably well. Such a detector is Yk Yk
k
called a CFAR (constant falsealarm rate) detector.
140
Signal Detection
(a) Let n ≥ 3. Show that the following detector is CFAR:
¶1 : δ η ( y) = 0
if y2 + y3 + ¸ ¸ ¸ + yn otherwise
≥ η y1 , η > 0 .
(b) Suppose n is a large integer. Use the Law of Large Numbers to ﬁnd a good approximation for η as a function of n and α, for an α level CFAR test. (c) Now ﬁnd a closedform expression for η as a function of α, for an αlevel CFAR test. Compare your answer to the approximation you found in part (b). 5.5 Signal Detection in Correlated Gaussian Noise. Derive the minimum error probability for detecting the length3 signal s = [1 2 3]³ in Gaussian noise with mean zero ⎡ 2 0 0⎤ and covariance matrix K = ⎣ 0 2 1⎦. 0 1 2 5.6 Signal Detection in Correlated Gaussian Noise . Consider the detection problem with
⎧ ⎪⎪ H0 : ⎨ ⎪⎪ ⎩ H1 :
where Z
∼ N ( , Z ) with 0
C
C
Assume that a > 0 and ρ
Ê Ë = −0a + Z ÊaË Y= + Z, 0
Y
ρ Ë. 1 + ρ2
Ê1 Z = ρ
∈ (0, 1).
(a) For equal priors show that the minimumprobabilityoferror detector is given by
δB ( y) =
±
: :
1 0
y1 − by2
≥ τ,
0 is an unknown parameter. (a) Show that the LMP test as θ ↓ 0 uses the quadratic test statistic: TLMP ( y) = y³ S y. 0
I
0 C
C
Hint: Spectral factorization of C S may be useful here. (b) Show that the quadratic statistic that maximizes the deﬂection is independent of θ and equals TLMP ( y ). References
[1] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. II, Wiley, 1971. [2] P. C. Mahalanobis, “On the Generalised Distance in Statistics,” Proc. Nat. Inst. Sci. India, Vol. 2, pp. 49–55, 1936. [3] H. E. Daniels, “Tail Probability Approximations,” Int. Stat. Review, Vol. 44, pp. 37–48, 1954. [4] R. Lugannani and S. Rice, “Saddlepoint Approximation for the Distribution of a Sum of Independent Random Variables,” Adv. Applied Prob., Vol. 12, pp. 475–490, 1980. [5] A. DasGupta, Asymptotic Theory of Statistics and Probability, Springer, 2008.
144
Signal Detection
[6] C. R. Baker, “Optimum Quadratic Detection of a Random Vector in Gaussian Noise,” IEEE Trans. Commun. Tech. Vol. COM14, No. 6, pp. 802–805, 1966. [7] B. Picinbono and P. Duvaut, “Optimum Linear Quadratic Systems for Detection and Estimation,” IEEE Trans. Inform. Theory, Vol. 34, No. 2, pp. 304–311, 1988. [8] V. V. Veeravalli and H. V. Poor, “Quadratic Detection of Signals with Drifting Phase,” J. Acoust. Soc. Am. Vol. 89, No. 2, pp. 811–819, 1991.
6
Convex Statistical Distances
As we saw in previous chapters, most notably Chapter 5, deriving exact expressions for the performance of detection algorithms is in general difﬁcult, and we often have to rely on bounds on the probabilities of error. To derive such bounds, we ﬁrst introduce convex statistical distances between distributions termed f divergences and Ali–Silvey distances. They include Kullback–Leibler divergence, Chernoff divergences, and total variation distance as special cases. We identify the basic properties and relations between those distances, and derive the fundamental data processing inequality (DPI). Ali–Silvey distances have also been used as criteria for signal selection or feature selection in pattern recognition systems [1, 2, 3, 4, 5, 6, 7, 8], and as surrogate loss functions in machine learning [9, 10]. The most fundamental inequality used in this chapter is Jensen’s inequality. For any convex function f : R → R and probability distribution P on R, Jensen’s inequality ± ± states that E[ f (X )] ≥ f (E [ X ]). We often use the shorthand Y f d µ = Y f ( y) d µ( y ) to denote the integral of a function f : Y → R. 6.1
Kullback–Leibler Divergence
The notion of Kullback–Leibler divergence [11, 12] plays a central role in hypothesis testing and related ﬁelds, including probability theory, statistical inference, and information theory. D E FI N I T I O N 6.1 Let P0 and P1 be two probability distributions on the same space Y , and assume P0 is dominated by P 1, i.e., supp {P0 } ⊆ supp{P1 }. The functional
D( p0 ± p1) ±
²
Y
p0( y ) ln
p0 (y ) d µ( y ) p1 (y )
(6.1)
is known as the Kullback–Leibler (KL) divergence, or relative entropy, between the distributions P0 and P1. If P0 is not dominated by P1, then D( p0± p1 ) = ∞. In (6.1), we use the notational convention 0 ln 0 = 0. P RO P O S I T I O N 6.1
The KL divergence D ( p0 ± p1) satisﬁes the following properties:
(i) Nonnegativity. D ( p0± p1 ) ≥ 0 with equality if and only if p0 (Y ) = p1 (Y ) a.s. (ii) No triangle inequality. There exist distributions p 0 , p1 , p2 such that
D ( p0± p2 ) > D( p0± p1 ) + D ( p1± p2 ). (iii) Asymmetry . D( p0 ± p1) ² = D( p1 ± p0) in general.
146
Convex Statistical Distances
(iv) Convexity. The functional D ( p0± p1 ) is jointly convex in ( p0 , p1). (v) Additivity for product distributions . If both p0 and p1 are product distributions, ³ i.e., the observations y = {yi }ni=1 and p j ( y) = ni=1 p ji ( yi ) for j = 0, 1, then ∑ D( p0± p1 ) = ni=1 D ( p0i ± p1i ). (vi) Dataprocessing inequality. Consider a stochastic mapping from Y to Z , represented by a conditional pdf w( z  y). Deﬁne the marginals
q 0( z ) = and q 1( z ) =
²
²
w(z y ) p0( y ) dµ( y) w(z y ) p1( y) dµ( y).
Then D (q0 ±q1) ≤ D ( p0± p1 ), with equality if and only if the mapping from Y to Z is invertible. In view of properties (ii) and (iii) in Proposition 6.1, the KL divergence is not a distance in the usual mathematical sense. The additivity property (v) holds because
D ( p0± p1 ) = E0
´
p 0( Y ) ln p 1( Y )
µ ¶ ´ p (Y ) µ ¶ n n 0i i = E0 ln p (Y ) = D( p0i ± p1i ). i =1
1i
i
i =1
The nonnegativity property (i), the convexity property (iv), and the dataprocessing inequality (vi) will not be proved here as they are special cases of a more general result to be presented in Section 6.4.
Example 6.1
obtain
= N (0, 1) and p1 = N (θ, 1). We ´ p (Y ) µ 0 D( p0 ± p1) = E 0 ln p 1( Y ) ´ Y 2 (Y − θ)2 µ = E0 − 2 + 2 ´ 2µ θ = E −θ Y +
Gaussians with Same Variance. Let p0
0
= θ2 .
2
2
Example 6.2
Gaussians with Same Mean. Let p0
obtain
D ( p0± p1 ) = E0
´
p 0( Y ) ln p 1( Y )
(6.2)
= N (0, 1) and p1 = N (0, σ 2 ). We
µ
´ 2 2 µ = E0 − 12 ln(2π) − Y2 + 21 ln(2πσ 2 ) + 2Yσ 2
6.2
147
Entropy and Mutual Information
ψ (x ) 3 2 1 0 0 Figure 6.1
2
4
6
8
10
x
The function ψ(x )
=
´
1 1 ln σ 2 − 1 + 2 2 σ ·1¸
µ
= ψ σ2 ,
(6.3)
where we have deﬁned the function
ψ(x ) = 21 (x − 1 − ln x ),
x
> 0,
which is positive, unimodal, and achieves its minimum at x minimum is ψ(1) = 0; see Figure 6.1.
N
(6.4)
=
1. The value of the
²
N
Multivariate Gaussians. Let p0 = (µ 0, K0 ) and p1 = (µ 1, K1 ) where the covariance matrices K0 and K1 are d × d and have full rank. Then (see Exercise 6.2)
Example 6.3
D ( p 0 ± p 1) =
´
K0 1 ³ −1 1 Tr (K− 1 K 0) + (µ 1 − µ0 ) K 1 (µ 1 − µ0 ) − d − ln K  2 1
µ
.
(6.5)
²
Two Uniform Random Variables. Let θ > 1 and p0 and p1 be uniform distributions over the intervals [ 0, 1] and [0, θ ] , respectively. We obtain
Example 6.4
D ( p0± p1 ) = However, D ( p1± p0 ) = ∞.
6.2
²1 0
ln
p 0( y) dy p 1( y)
= ln θ. ²
Entropy and Mutual Information
The informationtheoretic notions of entropy and mutual information appear in the classiﬁcation bounds of Section 7.5.3 and can be derived from KL divergence [13].
148
Convex Statistical Distances
D E FI N I T I O N
6.2 The entropy of a discrete distribution π
H (π) ±
m ¶
j =1
= (π1 , . . . , πm ) is
π j ln π1 .
(6.6)
j
Denote by U the uniform distribution over {1, 2, . . . , m }. Then entropy is related to KL divergence as follows:
H (π) = ln m −
m ¶ j =1
π j ln 1π/ mj = ln m − D (π ±U),
(6.7)
from which we conclude that the function H (π) is concave and satisﬁes the inequalities 0 ≤ H (π) ≤ ln m , where the lower bound is achieved by any degenerate distribution, and the upper bound is achieved by the uniform distribution. D E FI N I T I O N
6.3 The binary entropy function
h2 (a) ± −a ln a − (1 − a) ln(1 − a)
= H ([a , 1 − a]),
0 ≤ a ≤ 1,
(6.8)
is the entropy of a Bernoulli distribution with parameter a. If follows from the properties of the entropy function that h2 is concave, symmetric around a = 1/ 2, and satisﬁes h2 (0) = h2 (1) = 0 and h 2(1/2) = ln 2. Given a pair ( J, Y ) with joint distribution π j p j ( y ) for j ∈ { 1, 2, . . . , m } and y ∈ Y , denote by
p(y ) ±
¶ j
π j p j ( y ) the marginal distribution on Y,
(π × p J )( j , y ) ± π j p j ( y) the joint distribution of ( J, Y ), (π × p)( j , y ) ± π j p( y ) the product of its marginals. 6.4 The mutual information between J and Y is the KL divergence between the joint distribution of ( J, Y ) and the product of its marginals: D E FI N I T I O N
² m ¶ πj I ( J ; Y ) ± D (π × p J ±π × p) = =
LEMMA
j =1 m
¶
j =1
Y
p j ( y) ln
π j D( p j ± p).
p j ( y) d µ( y ) p ( y) (6.9)
6.1
I ( J ; Y ) ≤ H (π) ≤ ln m .
(6.10)
Proof The second inequality was given following (6.7). To show the ﬁrst inequality, write
6.3
Chernoff Divergence, Chernoff Information, and Bhattacharyya Distance
I (J; Y )
(a) ¶ m
=
= ≤
j =1 m
¶
πj
² Y
p j ( y) ln
π( j y ) dµ( y) πj
π j ln π1 + E ¹[ln π(º»J Y )¼] j j =1
¶ m
j =1
≤0
π j ln π1 = H (π) j
where equality (a) follows from the deﬁnition (6.9) and the identity
π( j  y) p(y ). 6.3
149
π j p j (y ) =
Chernoff Divergence, Chernoff Information, and Bhattacharyya Distance
In addition to KL divergence, the notion of Chernoff divergence also plays a central role in hypothesis testing [14]. 6.5 The Chernoff divergence between two probability distributions P0 and P1 on a common space is deﬁned by D E FI N I T I O N
du ( p0, p1 ) ± − ln P RO P O S I T I O N
erties:
²
Y
p10−u ( y ) pu1 ( y) dµ( y) ≥ 0,
u
∈ [0, 1].
(6.11)
6.2 The Chernoff divergence d u ( p0 , p1 ) satisﬁes the following prop
(i) Nonnegativity. du ( p0, p1 ) ≥ 0 with equality if and only if p0 (Y ) = p1 (Y ) a.s. (ii) No triangle inequality. du ( p0 , p1) does not satisfy the triangle inequality, even for u = 12 . (iii) Asymmetry . du ( p0 , p1) ² = du ( p1 , p0) in general, except for u = 21 . (iv) Convexity. The functional du ( p0 , p1) is jointly convex in ( p0, p1 ) for each u ∈ (0, 1). (v) Additivity for product distributions . If both p0 and p1 are product distributions, ³ i.e., the observations y = { yi }ni=1 and p j ( y) = ni=1 p ji ( yi ) for j = 0, 1, then ∑ du ( p0, p1 ) = in=1 du ( p0i , p1i ). (vi) Dataprocessing inequality. Consider a stochastic mapping from Y to Z , represented by a conditional pdf w( z  y ). Deﬁne the marginals
q 0 ( z) =
²
w(z y ) p0( y ) d µ( y),
q 1( z ) =
²
w(z y ) p1( y) dµ( y).
Then du (q0 , q1) ≤ du ( p0, p1 ), with equality if and only if Z is an invertible function of Y . In view of properties (ii) and (iii) in Proposition 6.2, the Chernoff divergence is not a distance in the usual mathematical sense. Properties (i), (iv), and (vi) are a special case of a more general result to be derived in Section 6.4. Property (v) holds because
150
Convex Statistical Distances
· p (Y ) ¸u 1 du ( p0, p1 ) = − ln E 0 p0 (Y ) ¸ n · ½ p1i (Y i ) u = − ln E 0 p0i (Y i ) i =1 ¸u · n ½ = − ln E0 pp1i ((YYi )) 0i i i =1 n ¶ = du ( p0i , p1i ), i =1
where the third equality holds because the expectation of a product of independent variables is equal to the product of their individual expectations. In the special case u = 1/2, the Chernoff divergence is known as the Bhattacharyya distance ²
B ( p0, p1 ) ± d1/ 2( p0 , p1) = − ln
¾
Y
p0( y ) p1( y) dµ( y).
(6.12)
Related quantities of interest are the Chernoff ucoefﬁcient
ρu ±
²
Y
p10−u p1u dµ
= e−d (p , p ), u
0
u ∈ [0, 1],
1
(6.13)
and the Bhattacharyya coefﬁcient
ρ 1/2 =
² √ p Y
0 p1 d
µ = e−B ( p , p ). 0
1
(6.14)
Also, maximizing the Chernoff divergence over the parameter u yields a symmetric distance measure called the Chernoff information:
C ( p0, p1 ) ± max du ( p0, p1 ) = max
− ln u∈[0,1]
u∈[0,1]
²
Y
p01−u (y ) pu1 (y )d µ( y ).
(6.15)
The Bhattacharyya distance and Chernoff information, while both being symmetric in p0 and p1, are still not distances in the usual mathematical sense they do not satisfy the triangle inequality. Scaling Chernoff divergence, one obtains the Rényi divergence [15] 1
²= 1, which has the property that limu→1 Du ( p0 ± p1 ) = D( p0 ± p1). Du ( p0± p1 ) ±
1−u
du ( p0, p1 ),
u
(6.16)
There is another interesting connection between the Chernoff information and the KL divergence. If we deﬁne the “geometric mixture” of p0 and p1 by
p u ( y) ±
±
p10−u ( y) p1u ( y)
1−u u Y p0 (t ) p1 (t )d µ(t )
,
(6.17)
then it can be shown that (see Exercise 6.4)
C ( p0, p1 ) = D ( pu∗ ± p0 ) = D ( pu∗ ± p1 ),
where u∗ achieves the maximum in (6.15).
(6.18)
6.4
Ali–Silvey Distances
N
151
N
Multivariate Gaussians. Let p0 = (µ 0, K0 ) and p1 = (µ 1, K1 ) where the covariance matrices K0 and K1 are m × m and have full rank. Then (see Exercise 6.2):
Example 6.5
d u ( p 0 , p 1) = where K−1
u(1 − u ) 1 (1 − u )K0 + u K1  ³ −1 ( µ1 − µ0 ) K (µ1 − µ0 ) + ln 2 2 K01−u K1 u , (6.19)
= (1 − u)
−1 + u −1. 0 1
K
²
K
Uniform random variables. Let θ > 1 and p0 and p1 be uniform distributions over the intervals [ 0, 1] and [0, θ ] , respectively. We obtain
Example 6.6
du ( p0, p1 ) = − ln
6.4
²1 0
p10−u ( y) pu1 ( y) dy = u ln θ.
Ali–Silvey Distances
Consider two probability distributions p0 and p1 on a common space and 6.3 we have deﬁned the KL divergence
D ( p0 ± p1 ) = E0 and the Chernoff divergence of order u
du ( p0 , p1) = − ln E0
´
p0 (Y ) ln p1 (Y )
´· p (Y ) ¸u µ 1 , p0 (Y )
Y . In Sections 6.1
µ
u ∈ [0, 1].
Other statistical “distances” of interest include: (i) Total variation (an actual distance):
¿
²
1 1 ¿ p1 (Y )  p1 (y ) − p0( y) d µ( y) = E0 ¿¿ 2 Y 2 p0 (Y ) It may be shown that (see Exercise 6.3):
dV ( p0, p1 ) =
dV ( p0, p1 ) = sup [ P1(
Y1⊆ Y
Y
¿¿ − 1¿¿ .
Y1 ) − P0(Y1 )]
(6.20)
(6.21)
Y
(where 1 ranges over the Borel subsets of ) which is actually the most general deﬁnition of dV for arbitrary probability measures. We also write
dV (π0 p0, π1 p1 ) =
²
1 π1 p1( y)−π0 p0( y ) d µ( y ) = sup [π1 P1( 2 Y Y1 ⊆Y
(ii) Pearson’s χ divergence: 2
χ 2 ( p1, p0 ) =
Y1 )−π0 P0 (Y1)]. (6.22)
· p (Y ) ¸ 2 · p (Y ) ¸2 ² [ p1( y )− p0 (y )]2 1 1 dµ( y) = E0 − 1 = E0 −1. p ( y ) p ( Y ) p Y 0 0 0( Y )
(6.23)
152
Convex Statistical Distances
(iii) Squared Hellinger distance:
H
2
² ¾ ¾ ( p0 , p1 ) = [ p0 ( y) − p1( y)]2 dµ( y) = 2(1 − ρ1/2 ). Y
(iv) Symmetrized divergence: 1 [d( p0, p1 ) + d ( p1, p0 )], (6.24) 2 for which dsym ( p0, p1 ) = dsym ( p1 , p0). Note that dsym might not satisfy the triangle inequality and so might not be an actual distance. (v) Jensen–Shannon divergence:
dsym ( p0 , p1) =
dJS ( p0 , p1) =
´ ·
1 D 2
p0 + p1 p0 ± 2
¸
+D
·
p0 + p1 p1 ± 2
¸µ
,
(6.25)
which is the square of a distance [16]. It has been used in random graph theory [17] and has recently become popular in machine learning [18]. Denote by P e (δ) the Bayesian error probability for a binary hypothesis test between p0 and p1 using decision rule δ : Y → {0, 1}. There is a remarkable connection between the minimum error probability and total variation. For any deterministic decision rule δ with acceptance region Y1 , we have Pe (δ) = π0 P0 (Y1 ) + π1 P1(Y0)
= π1 − [π1 P1 (Y1) − π0 P0(Y1 )] ≥ π1 − dV (π0 p0, π1 p1 ) ,
where the last line follows from the deﬁnition (6.22). The lower bound is achieved by the MPE rule, hence Pe (δB ) = π1 − dV
(π0 p0, π1 p1 ) . In the case of a uniform prior, we obtain Pe (δB) = 12 [1 − dV ( p0, p1 )].
(6.26)
As summarized in Table 6.1, the “distances” in Sections 6.1–6.3 are special cases of Ali–Silvey distances, which are of the general form [2]
· ´ · p (Y ) ¸µ¸ 1 , d( p0, p1 ) = g E0 f p0 (Y )
(6.27)
where g is an increasing function on R and f is a convex function of R+ . In the special case where g (t ) = t , the Ali–Silvey distance is known as a f divergence, and
´ · p (Y ) ¸µ 1 d ( p0, p1 ) = E0 f . p0 (Y )
(6.28)
The fundamental properties of Ali–Silvey distances are as follows: (i) Minimum value. By Jensen’s inequality we have
´ · p (Y ) ¸µ · ´ p (Y ) µ¸ (6.29) E0 f p1 (Y ) ≥ f E0 p1 (Y ) = f (1) 0 0 with equality if and only if p0 = p1 . Moreover the condition for equality is neces
sary if f is strictly convex. Since g is increasing, the inequality (6.29) also implies
6.4
Table 6.1
153
Ali–Silvey Distances
Ali–Silvey distances
g (t )
f (x)
− ln x
t t t t
− ln(−t ) − ln(−t )
t t t
x ln x x −1 ln x 2 x x +1 x +1 2 ln x − 2 ln 2 u −x −√√x 2(1 − x ) ( x − 1)2 1 2  x − 1
d ( p0 , p1 )
Name
D( p0± p1 ) D( p1± p0 ) Dsym ( p0 , p1 ) dJS ( p0, p1) du ( p0 , p1 ) b( p0 , p1 ) H 2 ( p0 , p1 ) χ 2 ( p1 , p0 ) dV ( p0 , p1 )
KL divergence KL divergence J divergence Jensen–Shannon divergence Chernoff divergence Bhattacharyya distance Squared Hellinger distance χ 2 divergence Total variation distance
d( p0 , p1) ≥ g ( f (1)) with inequality if p0 = p1. Usually g and f are chosen such that g( f (1)) = 0. (ii) Data grouping. Consider a manytoone mapping Z = φ(Y ) and denote by φ−1 (z) ± {y ∈ Y : φ( y) = z } the image of z under the inverse map± ping. Deﬁne the marginal distributions q0 (z ) = φ−1 (z ) p0 ( y) d µ( y ) and q1 (z) = ± φ−1( z) p1( y ) d µ( y). Then d(q0±q1 ) ≤ d ( p0± p1 ). This property is a special case of the data processing inequality in (iii). (iii) Data processing inequality. Consider a stochastic mapping from Y to Z , represented by a conditional pdf w( z  y). Deﬁne the marginals q 0( z ) = and
q 1( z ) = Then1
²
Y
² Y
w(z y ) p0 (y ) d µ( y) w(z y ) p1( y ) d µ( y).
d (q0, q1) ≤ d ( p0, p1 ),
with equality if the likelihood ratio P0 ) given Z .
p1 (Y ) p0 (Y )
(6.30)
is a degenerate random variable (under
Proof Using the law of iterated expectations we write
À ´ · p (Y ) ¸¿¿ µÁ ´ · p (Y ) ¸µ 1 = E0 E0 f p1 (Y ) ¿¿ Z . E0 f p0 (Y ) 0
(6.31)
For each z we have, by Jensen’s inequality applied to the convex function f ,
µ · ´ p (Y ) ¿¿ ´ · p (Y ) ¸¿¿ µ¸ 1 1 ¿ ¿ Z = z ≥ f E0 E0 f ¿Z =z p0 (Y ) ¿ ¸ ·² p0(Y ) p 1( y) d µ( y) =f p( y H0 , Z = z ) p ( y) Y
1 For a deterministic mapping from y to z we have
w(zy ) = 1{z = φ( y )}.
0
154
Convex Statistical Distances
= =
·²
p0 (y )w(z  y ) p1 ( y) dµ( y) f q 0( z ) p0 ( y) Y · q (z ) ¸ 1 f . q0 (z )
¸ (6.32)
Equality holds in (6.32) if the likelihood ratio pp1((YY )) is a degenerate random variable 0 (under P0) given Z = z . Combining (6.31) and (6.32) we obtain
´ · q ( Z ) ¸µ ´ · p (Y ) ¸µ 1 ≥ E0 f q1( Z ) . E0 f p0 (Y ) 0
The inequality is preserved by application of the increasing function g , hence (6.30) follows. (iv) Let θ ∈ [a, b] be a real parameter and pθ (y ) a parametric family of distributions with the monotone likelihood ratio property in y . That is, the likelihood ratio pθ2 ( y)/ pθ 1 (y ) is strictly increasing in y for each a < θ1 < θ2 < b. Then if a < θ1 < θ2 < θ3 < b, we have
d ( pθ1 , pθ2 ) ≤ d ( pθ1 , pθ3 ).
Proof See [2]. (v) Joint convexity: f divergences are jointly convex in ( p0, p1 ).
Proof Deﬁne the function g ( p, q) ± q f ( q ) for 0 and 0 ≤ q , q ´ ≤ 1, let p
λ = ξ q + (ξ1q− ξ)q ´ ⇒
1−λ
· p¸
· p´ ¸
Then
f is convex ⇔ λ f
q
+ (1 − λ) f
q´
´ = ξ q (+1 −(1 ξ)−qξ)q ´ ∈ [0, 1].
≥ =
⇔ ξq f
·p¸ q
+ (1 − ξ)q ´ f
≤ p, q ≤ 1. For any ξ ∈ [0, 1]
· p´ ¸ ´
·
¸
p´ p f λ + (1 − λ) ´ · ξqp + (1 − ξ) pq´ ¸ f ξ q + (1 − ξ)q ´ , ∀ξ, p, p´ , q, q ∈ [0, 1]
· ξ p q+ (1 − ξ) p´ ¸ ≥ (ξ q + (1 − ξ)q ´) f ξ q + (1 − ξ)q ´ , ∀ξ, p, p´ , q, q ∈ [0, 1] ⇔ ξ g ( p, q ) + (1 − ξ) g( p´ , q´ ) ≥ g (ξ p + (1 − ξ) p´ , ξ q + (1 − ξ)q´ ) ⇔ g is convex ´ · ¸µ ⇒ d ( p0 , p1) = E0 f pp1 ((YY )) 0 ² = g ( p0( y), p1 (y )) dµ( y) is convex. Y
6.5
Product Distributions If p0
choices of f , g we have
= ³ni=1 p0i
d ( p0, p1 ) =
n ¶ i =1
Some Useful Inequalities
and p1
155
= ³in=1 p1i , then for some
d ( p0i , p1i ).
(6.33)
We have seen in Sections 6.1 and 6.3 that (6.33) holds for KL divergence and Chernoff divergence. The additivity property also holds for ln (1 + χ 2( p0 , p1)) but not for χ 2( p0 , p1) or for dV ( p0 , p1).
Bregman Divergences These divergences generalize the notion of distances between pairs of points in a convex set [19]. Let F : R+ → R be a strictly convex function, and p0 and p1 be arbitrary pdfs. The Bregman divergence associated with the generator function F is B F ( p 0 , p 1) ±
²
F ( p0( y )) − F ( p1 ( y)) − [ p0( y) − p1 (y )] F ´ ( p1 (y )) d µ( y), (6.34)
Y ´ where F denotes the derivative of F . The integrand is nonnegative by convexity of
F , and is zero if and only if p0 (y ) = p1( y ). The function B F ( p0 , p1) is therefore nonnegative and convex in the ﬁrst argument but not necessarily in the second argument. Examples include:
• F (x ) = x 2, for which F´ (x ) = 2x and BF ( p0 , p1) = ± p0 − p1±2. • F (x ) = x ln x , for which F´ (x ) = 1 + ln x and ² B F ( p0, p1 ) = [ p0 ln p0 − p1 ln p1 − ( p0 − p1)(1 + ln p1 )] d µ ²Y = [ p0 ln pp0 − p0 + p1 ] d µ 1 Y = D( p0± p1 ). • F (x ) = − ln x , for which µ ² ´ p0 p0 − ln p − 1 d µ B F ( p0, p1 ) = p Y
1
1
(6.35)
(6.36)
is called the Itakura–Saito distance between p0 and p1 . The KL divergence is the only f divergence that is also a Bregman divergence.
6.5
Some Useful Inequalities
THEOREM
6.1 Hoeffding and Wolfowitz [20]
The squared Bhattacharyya coefﬁcient
ρ 2 ≥ exp{− D( p0± p1 )}.
(6.37)
156
Convex Statistical Distances
Proof The proof is a direct application of Jensen’s inequality, ln
² √ ln ρ = 2 ln p0 p1 = 2 ln E 0
ÂÃ
2
Y
C O RO L L A RY
p 1( Y ) p 0( Y )
Ä
≥ 2E0
Â Ã ln
E[ X ] ≥ E[ln X ]:
Ä
p1 (Y ) p0 (Y )
= − D ( p0± p1 ).
6.1
ρ 2 ≥ exp{− D ( p1± p0 )}, ρ 2 ≥ exp{− Dsym ( p0 ± p1)}.
(6.38) (6.39)
The bound (6.38) is obtained the same way as (6.37). Averaging the two lower bounds on ln ρ 2 yields (6.39). THEOREM
6.2
D( p0 ± p1) ≤ ln(1 + χ 2 ( p0, p1 )) ≤ χ 2( p0 , p1).
(6.40)
Proof The ﬁrst inequality again follows by application of Jensen’s inequality:
´ p (Y ) µ ≤ ln E0 p0 (Y ) = ln(1 + χ 2 ( p0, p1 )). D ( p0 ± p1) = E0 1 The second inequality holds because ln(1 + x ) ≤ x for all x > −1 (by concavity of the ´
p0 (Y ) ln p1 (Y )
µ
log function). THEOREM
6.3
(Vajda [21])
D ( p0 ± p1) ≥ 2 ln
1 + dV ( p0, p1 ) 1 − dV ( p0, p1 )
− 1 2+dVd( p(0p, ,p1p) ) . V
0
1
(6.41)
This inequality is stronger than the famous Pinsker’s inequality (also known as Csiszar–Kemperman–Kullback–Pinsker inequality)
D ( p0± p1 ) ≥ 2dV2 ( p0, p1 ).
(6.42)
In particular, Vajda’s inequality is tight when dV ( p0 , p1) = 0 and dV ( p0 , p1) = 1. Exercises
6.1 Poisson Distributions. Derive the KL and Chernoff divergences between two Poisson distributions. 6.2 Prove (6.5) and (6.19). 6.3 Prove (6.21). 6.4 Prove (6.18). Hint: You may want to use the fact that pu ( y ) can be written as:
p u ( y) =
p0 (y ) L ( y )u , E 0[ L (Y )u ]
where L (y ) = p1 (y )/ p0 (y ) is the likelihood ratio.
157
Exercises
6.5
Jensen–Shannon Divergence.
(a) Show that Jensen–Shannon divergence (6.25) is an Ali–Silvey distance. (b) Show that 0 ≤ dJS ( p0, p1 ) ≤ ln 2 where the upper bound is achieved for p0 and p1 that have disjoint support sets. p1 1 (c) Show that dJS ( p0, p1 ) = H ( p0 + is a ﬁnite set. 2 ) − 2 [ H ( p 0) + H ( p1 )] when
Y
6.6 Discard Parity Bit? Let Z be a random variable taking values in the set of even numbers {0, 2, 4, 6, 8}, and B a Bernoulli random variable with parameter θ , which is independent of Z . Let Y = Z + B , hence B represents a parity bit. Express Z as a function of Y and show that the DPI applied to this function is satisﬁed with equality. 6.7 Data Grouping. Consider the probability vectors P0 = [ .7, .2, .1] and P1 = [.1, .6, .3] on = {1, 2, 3}. Find a data grouping function φ : → = {a , b} such that d ( P0, P1) = d ( Q0 , Q1 ) for any f divergence, where Q 0 and Q 1 are the distributions on induced by the mapping φ .
Y
Y
Z
Z
N
6.8 DPI and Binary Quantization. Let Y be distributed as P0 = (0, 1) or some other distribution P1. Give a nontrivial condition on P1 such that the DPI applied to the mapping Z = sgn(Y ) is achieved with equality.
N (0, 1) or P1 = N (µ, 1). Q1 on Z = sgn (Y ), and express
6.9 Binary Quantization. Let Y be distributed as P0 =
(a) Give the corresponding distributions Q 0 and D( Q 0 ±Q 1 ) as a function of µ . (b) Graph D( P0± P1 ) and D ( Q 0± Q 1) as a function of µ. (c) Show that
D ( P0± P1) µ→0 D( Q 0± Q 1) lim
= π2
and
D ( P0 ± P1 ) µ→∞ D ( Q0 ± Q1 ) lim
N
= 2.
N
6.10 Ternary Quantization. Let Y be distributed as P0 = (0, 1) or P1 = (µ, 1). Consider ⎧ t ≥ 0. We quantize Y and represent it by a ternary random variable ⎪⎪⎨1 : Y − µ/2 > t
Z
= ⎪0 : Y − µ/2 ≤ t ⎩−1 : Y − µ/2 < −t .
(a) Give the corresponding distributions Q0 and Q 1 on Z , and express D( Q 0± Q 1) as a function of µ and t . (b) Numerically maximize D ( Q0 ± Q 1) over t , for µ ∈ {0.1, 0.5, 1, 2, 3, 4, 5}. Evaluate the KL divergence ratio f (µ) = D ( P0 ± P1)/ maxt D( Q 0± Q 1 ) for µ ∈ {0.1, 0.5, 1, 2, 3}.
Linear Dimensionality Reduction. Consider the hypotheses H0 : Y ∼ P0 = Y ∼ P1 = (0, C1). We wish to select a direction vector u ∈ Rn such that the projection Z = u ³ Y ∈ R contains as much information for discrimination as possible, in the sense that u maximizes D (Q 0 ±Q 1 ) where Q 0 and Q 1 are the distributions of Z under H0 and H1, respectively. 6.11
N (0, n ) and H1 : I
N
158
Convex Statistical Distances
(a) Show that the maximum is achieved by any eigenvector associated with the largest eigenvalue of C1, and give an expression for the maximal D ( Q0 ± Q1 ). (b) Repeat when P0 = (0, C0 ) where C0 is an arbitrary fullrank covariance matrix. Hint: Exploit the invariance of KL divergence to invertible transformations to relate this problem to that of part (a). (c) Assume again that C0 = In and ﬁx some d ∈ {2, 3, . . . , n −1}. Now we wish to select a d × n matrix A such that the projection Z = AY ∈ Rd has maximal D ( Q0 ± Q1 ) where Q 0 and Q 1 are the distributions of Z under H0 and H1 , respectively. Argue that there is no loss of optimality in constraining AA³ = Id (again exploiting the invariance properties of KL divergence). Show that the maximum is achieved by any A whose columns are eigenvectors associated with the largest eigenvalues of C1 , and give an expression for the maximal D( Q 0 ±Q 1 ) in terms of these eigenvalues. 6.12 Boolean Function Selection. This may be viewed as a nonlinear dimensionality reduction problem. Consider two probability distributions P0 and P1 for a random variable Y taking values over the set {(0, 0), (0, 1), (1, 0), (1, 1)}. The distribution P0 is uniform, and P1 = [3/16, 9/16, 3/16, 1/16 ]. Let Q0 and Q1 be the corresponding distributions on Z = f (Y ) where f is a Boolean function, i.e., Z ∈ { 0, 1}.
N
(a) Evaluate the Bhattacharyya distance B ( P0 , P1). (b) Find f that maximizes B ( Q 0± Q 1). Hint : There are 16 Boolean functions on Y but you need to consider only 8 of them.
Y
6.13 Asymptotics of f Divergences for Increasingly Similar Distributions. Assume is a ﬁnite set and p0 ( y) > 0 for all y ∈ . Fix any zeromean vector {h (y )}y ∈Y and let p1 = p0 + ± h, which is a probability vector for ± small enough. Let d be a f divergence such that f (x ) is twice differentiable at x = 1. Using a secondorder Taylor series expansion of f , show that the following asymptotic equality holds:
Y
f ´´ (1) 2 χ ( p0 , p1) as ± → 0. 2 Divergences . Fix an arbitrary integer n > 1 and evaluate D ( p0± p1 ), D ( p1 ± p0), p1±TV , and χ 2( p0 , p1) for the following distributions on = {1, 2, 3, . . . }: d ( p 0, p 1 ) ∼
6.14 ± p0 −
p0( y ) = 2− y ,
p 1( y) =
Give the limits of these expressions as n
2− y 1 {1 ≤ y ≤ n }, 1 − 2−n
Y
y
∈ Y.
→ ∞.
References
[1] I. Csiszár, “Eine Informationtheoretische Ungleichung and ihre Anwendung auf den Beweis der Ergodizität on Markoffschen Ketten,” Publ. Math. Inst. Hungarian Acad. Sci. ser. A, Vol. 8, pp. 84–108, 1963. [2] S. M. Ali and S. D. Silvey, “A General Class of Coefﬁcients of Divergence of One Distribution from Another,” J. Royal Stat. Soc. B, Vol. 28, pp. 132–142, 1966. [3] T. Kailath, “The Divergence and Bhattacharyya Distance Measures in Signal Selection,” IEEE Trans. Comm. Technology, Vol. 15, No. 1, pp. 52–60, 1967.
References
159
[4] H. Kobayashi and J.B. Thomas, “Distance Measures and Related Criteria,” Proc. Allerton Conf., Monticello, IL, pp. 491–500, 1967. [5] M. Basseville, “Distance Measures for Signal Processing and Pattern Recognition,” Signal Processing, Vol. 18, pp. 349–369, 1989. [6] L. K. Jones and C. L. Byrne, “General Entropy Criteria for Inverse Problems, with Applications to Data Compression, Pattern Classiﬁcation, and Cluster Analysis,” IEEE Trans. Information Theory, Vol. 36, No. 1, pp. 23–30, 1990. [7] A. Jain, P. Moulin, M. I. Miller, and K. Ramchandran, “Information Theoretic Bounds for Degraded Target Recognition,” IEEE Trans. PAMI, Vol. 24, No. 9, pp. 1153–1166, 2002. [8] F. Liese and I. Vajda, “On Divergences and Informations in Statistics and Information Theory,” IEEE Trans. Inf. Theory , Vol. 52, No. 10, pp. 4394–4412, 2006. [9] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing,” Proc. 43rd Allerton Conf., Monticello, IL, 2005. [10] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “On Surrogate Loss Functions and f Divergences,” Annal. Stat. , Vol. 37, No. 2, pp. 876–904, 2009. [11] S. Kullback and R. Leibler, “On Information and Sufﬁciency,” Ann. Math. Stat., Vol. 22, pp. 79–86, 1951. [12] S. Kullback, Information Theory and Statistics, Wiley, 1959. [13] T. M. Cover and J. A. Thomas, Elements of Information Theory , 2nd edition, Wiley, 2006. [14] H. Chernoff, “A Measure of Asymptotic Efﬁciency for Test of a Hypothesis Based on the Sum of Observations,” Ann. Math. Stat., Vol. 23, pp. 493–507, 1952. [15] A. Rényi, “On Measures of Entropy and Information,” Proc. 4th Berkeley Symp. on Probability Theory and Mathematical Statistics, pp. 547–561, Berkeley University Press, 1961. [16] D. M. Endres and J. E. Schindelin, “A New Metric for Probability Distributions,” IEEE Trans. Inf. Theory, Vol. 49, pp. 1858–1860, 2003. [17] A. K. C. Wong and M. You, “Entropy and Distance of Random Graphs with Application to Structural Pattern Recognition,” IEEE Trans. Pattern Analysis Machine Intelligence, Vol. 7, pp. 599–609, 1985. [18] I. Goodfellow, “Generative Adversarial Networks,” NIPS tutorial , 2016. Available at https://arxiv.org/pdf/1701.00160.pdf. [19] L. M. Bregman, “The Relaxation Method of Finding the Common Points of Convex Sets and its Application to the Solution of Problems in Convex Programming,” USSR Comput. Math. Math. Phys., Vol. 7, No. 3, pp. 200–217, 1967. [20] W. Hoeffding and J. Wolfowitz, “Distinguishability of Sets of Distributions,” Ann. Math. Stat., Vol. 29, pp. 700–718, 1958. [21] I. Vajda, “Note on Discrimination Information and Variation,” IEEE Trans. Inf. Theory, Vol. 16, pp. 771–773, 1970.
7
Performance Bounds for Hypothesis Testing
We saw in Chapters 2 and 4 that decision rules for binary hypothesis problems take the form:
⎧1 ⎨ δ˜ (y ) = ⎩1 w.p. γ 0
> if T ( y) = τ,
τ } + γ P 0{T (Y ) = τ },
PM (δ˜) = P 1{T (Y ) < τ } + (1 − γ )P1 {T (Y ) = τ }, P e(δ˜) = π0 PF (δ˜) + (1 − π0)PM (δ˜).
Barring special cases, it is generally impractical or impossible to compute the exact values of PF , P M , and Pe . However, it is often sufﬁcient to obtain good lower and upper bounds on these error probabilities. This chapter derives analytically tractable performance bounds for hypothesis testing. To derive such bounds, we exploit the convex statistical distances discussed in Chapter 6. We begin with simple lower bounds before introducing a family of upper bounds based on Chernoff divergence and deﬁning the largedeviations rate function.
7.1
Simple Lower Bounds on Conditional Error Probabilities
Theorem 7.1 presents some simple bounds on conditional error probabilities based on the dataprocessing inequality. D E FI N I T I O N
7.1 The binary KL divergence function
a + (1 − a) ln 11 −− ab b D ([a, 1 − a]±[b, 1 − b]),
d (a ±b) ± a ln
=
0 ≤ a, b
≤1
(7.1)
is the KL divergence between two Bernoulli distributions with parameters a and b.
7.1
Simple Lower Bounds on Conditional Error Probabilities
161
Again we use the notational convention 0 ln 0 = 0. The function d (a ±b) is not symmetric; however, d(a±b) = d(1 − a±1 − b). The function is jointly convex in (a , b) and attains its minimum (0) when a = b. For ﬁxed a ∈ (0, 1), d(a±b) tends to ∞ as b approaches 0 or 1. The following theorem provides an upper bound (parameterized by α) on the ROC. T H E O R E M 7.1 Kullback [1] . following inequality holds:
For all tests δ˜ such that PF (δ˜) d(α ±1 − β)
= α and PM (δ˜) = β, the
≤ D( p0 ± p1).
Y
(7.2)
Proof Consider deterministic tests ﬁrst. Let 1 be the acceptance region for the test and deﬁne the binary random variable B = 1 {Y ∈ 1 }, which is a deterministic function of Y . The probability of false alarm and the probability of correct detection are respectively equal to
and
Y
α = P0 {Y ∈ Y1 } = P0 {B = 1} = q0 (1) 1 − β = P1 {Y ∈ Y1 } = P1 {B = 1} = q1 (1),
hence D (q0 ±q1) = d (α ±1 − β). Then the claim follows from the data processing inequality (6.30) and the fact that B is a function of Y . For randomized tests, let B = 1{say H1} which is a stochastic function of Y . Again the data processing inequality applies and (7.2) holds. R E M A R K 7.1
By convexity of the binary divergence function, the set
R0 ± {(α, 1 − β) : d (α±1 − β) ≤ D ( p0 ± p1)}
(7.3)
is a convex region inside [0, 1]2 . In particular, the line segment α = 1 − belongs to 0 , and 0 satisﬁes the symmetry condition (α, 1 − β) ∈ 0 α, β) ∈ 0. By Theorem 7.1, 0 is a superset of
R
R
R
β ∈ [0, 1]
R ⇔ (1 −
R
R = {(α, 1 − β) : ∃δ˜ : PF(δ˜) = α, PM(δ˜ ) = β}. Some interesting points in R0 are given by
(7.4)
α = 0 ⇒ β ≥ e− D ( p ± p ) , α = 12 ⇒ β(1 − β) ≥ 41 e−2 D( p ± p ), α → 1 ⇒ β ≥ e− D( p ± p )/(1−α). The last line holds because we must have β → 0 for any sequence of (α, 1 − β) ∈ R0 such that α → 1. Hence d (α ±1 − β) ∼ (1 − α) ln β1 . 0
1
0
0
1
1
The lower bound in the case α = 21 is generally poor in the asymptotic setting with n i.i.d. observations, owing to the extraneous factor of 2 in the exponent (see the Chernoff–Stein Lemma 8.1 in the next chapter). R E M A R K 7.2
162
Performance Bounds for Hypothesis Testing
REMARK
Similarly to (7.2), a dual bound can be derived using D ( p1 ± p0 ). We
7.3
have
d (β ±1 − α) ≤ D ( p1± p0 ),
(7.5)
R1 ± {(α, 1 − β) : d (β±1 − α) ≤ D( p1± p0 )},
(7.6)
resulting in the convex region
and the following bounds:
β = 0 ⇒ α ≥ e− D ( p ± p ) , β = 12 ⇒ α(1 − α) ≥ 41 e−2D( p ± p ), β → 1 ⇒ α ≥ (1 − β)e− D( p ± p ). 1
0
1
1
7.2
0
0
Simple Lower Bounds on Error Probability
Consider a binary test H0 : Y ∼ p0 versus H1 : Y ∼ p1 with priors π0 and π1 on H0 and H1, respectively. Denote by P e the minimum error probability. Theorem 7.2 provides a simple lower bound on Pe .
± √ p0 p1 d µ (Bhattacharyya
7.2 Kailath Lower Bound on Pe [2] Let ρ = Y coefﬁcient). Then ² ³ ´ 1 1 − 1 − 4π0 π1 ρ 2 . Pe ≥ 2
THEOREM
(7.7)
The following successively weaker lower bounds also apply:
≥ π0π1 ρ2 , P e ≥ π0π1 exp{− Dsym ( p0± p1 )}.
Pe
(7.8) (7.9)
Proof First note the identity min(a, b) + max(a, b) = a + b. Hence
µ
Y
min{π0 p0, π1 p1} d µ +
We have
µ
Y
max{π0 p0 , π1 p1 } d µ =
µ
Y
(π0 p0 + π1 p1) dµ = 1.
¶µ √ ·2 π0 π1 ρ = π0 π1 p0 p1 d µ Y ·2 ¶µ ¸ min{π 0 p0 , π1 p1} max{π0 p0, π1 p1} dµ = Y µ (a) µ ≤ min{π 0 p0, π1 p1} d µ max{π 0 p0, π1 p1} d µ ¶Y µ · µY (b) = min{π 0 p0, π1 p1} d µ 1 − min{π 0 p0, π1 p1} d µ Y Y = Pe (1 − Pe ), 2
(7.10)
7.3
Chernoff Bound
163
where the inequality (a) is obtained by application of the Cauchy–Schwarz inequality, and (b) follows from (7.10). This establishes (7.7). The weaker bound (7.8) follows from √ the inequality 1 + 2x ≤ 1 + x for x ≥ − 21 and is tight for small ρ . The even weaker bound (7.9) follows from (6.39).
7.3
Chernoff Bound
The basic problem is to upper bound a probability P{ X ≥ a } where X is any real random variable and a > E[ X ]. Similarly, we may wish to upper bound P { X ≤ a} where a < E[ X ]. 7.3.1
MomentGenerating and CumulantGenerating Functions
The momentgenerating function (mgf) for a real random variable X is deﬁned as
MX (u) ± E[e
uX
]=
µ
R
p(x )eux d µ(x ),
u ∈ R,
(7.11)
assuming the integral exists. The logarithm of the mgf is the cumulantgenerating function (cgf)
κX (u ) ± ln E[eu X ].
(7.12)
Recall some fundamental properties of the mgf and cgf:
• • •
The kth moment of X is obtained by differentiating the mgf k times at the origin. For k = 0, 1, 2, we obtain
MX (0) = 1,
M ²X (0) = E[ X ],
M ²²X (0) = E[ X 2 ].
The kth cumulant of X is obtained by differentiating the cgf k times at the origin. For k = 0, 1, 2, we obtain
κ X (0) = 0, κ²X (0) = E[ X ], κ ²²X (0) = Var ( X ).
Both the mgf and the cgf are convex functions. The mgf is easily seen to be convex because M ²²X (u) = E [eu X X 2 ] ≥ 0. For the convexity of the cgf, we apply Hölder’s inequality, which states that
( ) ( ) E[ X Y ] ≤ E[ X  p] 1/ p E [ X q ] 1/q for p, q ∈ [ 1, ∞) such that 1/ p + 1/q = 1. In particular, if we consider α ∈ (0, 1), and set p = 1/α and q = 1/(1 − α), then ¹ º κX (αu 1 + (1 − α)u2 ) = ln E eαu X +(1−α)u X ¹ º = ln E eu X α eu X (1−α) ¹ º ¹ º ≤ α ln E eu X + (1 − α) ln E eu X = ακX (u1 ) + (1 − α)κ X (u 2). 1
2
1
2
1
2
164
Performance Bounds for Hypothesis Testing
• •
The domain of κ X is either the entire real line R, or a halfline, or an interval. Except for the case of the degenerate random variable X = 0, the supremum of κ X over its domain is ∞. If X and Y are independent random variables, then the cgf for their sum Z = X + Y is the sum of the individual cgfs: κ Z (u) = κ X (u ) + κY (u).
N
Gaussian Random Variable . Let X ∼ (0, 1). Then M X (u ) 2 u /2 for all u ∈ R, as shown in Figure 7.1(a).
Example 7.1
κ X (u ) =
= eu /2 and 2
Exponential Random Variable. Let X be the unit exponential distribution. The mgf is given by
Example 7.2
» 1 : u 0, Var ( X ) follows directly from Markov’s inequality with Z = ( XV−arE((XX))) and b = c2.
> 0, we (7.13)
P
2
Both Markov’s inequality and Chebyshev’s inequality are generally loose. Often better upper bounds on P{ X > a}) are obtained by applying Markov’s inequality to the nonnegative random variable
Z
= eu X > 0
7.3
Chernoff Bound
165
and letting b = eua , where u is a free parameter to be optimized. For u > 0, the function eux is increasing over x ∈ R, and we have P{ X
uX ≥ a} = P{eu X ≥ eua } ≤ E[eeua ] , ∀u > 0, a ∈ R.
(7.14)
Combining (7.14) and the deﬁnition (7.12) of the cgf, we obtain a collection of exponential upper bounds indexed by u > 0:
≥ a} ≤ e−[ua −κ (u)] , ∀u > 0, a ∈ R. (7.15) Similarly, to upper bound P { X < a}, we observe that for u < 0, the function z = eux is decreasing over x ∈ R. Hence P{ X ≤ a } = P{eu X ≥ eua } and P{ X ≤ a} ≤ e−[ua −κ ( u)] , ∀u < 0, a ∈ R. (7.16) P{ X
X
X
Note the family of bounds in (7.16) can be obtained from those in (7.15) by considering Y = − X in (7.15) and using κY (u) = κ X (−u ). Therefore only the upper bound in (7.15) needs to be considered. To ﬁnd the tightest bound in (7.15), we seek u > 0 that maximizes the function
ga (u ) ± ua − κ X (u),
u
∈ dom(κ X ).
Performing this maximization, we obtain the largedeviation rate function
± X (a) ± sup[ ua − κX (u)], u ∈R
a ∈ R.
(7.17)
This may be viewed as a transformation of the function κ X into ± X and is also called the convex conjugate transform, or the Legendre–Fenchel transform. It is widely used in convex optimization theory and related ﬁelds. If the supremum in (7.17) is achieved for u > 0, we have P{ X
≥ a } ≤ e−±
< 0, we have P { X ≤ a } ≤ e−±
X
(a) .
X
(a) .
If the supremum is achieved for u
The maximization problem (7.17) satisﬁes the following properties, which are illustrated in Figure 7.2:
• • •
Since both ua and −κ X (u) are concave functions, so is their sum ga (u ), and ﬁnding the maximum u ∗ of ga (u ) is a simple concave optimization problem. We have ga (0) = 0 and ga² (0) = a − E[ X ]. Hence u∗ > 0 for a > E[ X ] and u ∗ < 0 for a < E[ X ]. The largedeviation rate function ± X (a) is the supremum of a collection of linear functions of a and is therefore convex. In summary, the following upper bounds hold: P {X
≥ a} ≤
½
e−[ua−κ X (u )] 1
: a > E[ X ], u > 0 : a ≤ E[X ]
(7.18)
166
Performance Bounds for Hypothesis Testing
ga(u)
ga (u)
a
a
ΛX ( )
Λ X( )
u∗
0
u
u∗
Case a > (X ) Figure 7.2
Case
0
u
a < (X )
Plots of ga (u) and its maxima
and P {X
≤ a} ≤
½
: a < E[ X ], u < 0 : a ≥ E [ X ].
e−[ua −κ X (u )] 1
(7.19)
In particular, choosing the maximizing u in (7.17) yields P {X
≥ a} ≤
and P {X
≤ a} ≤
½
e−± X (a ) 1
½
e−± X (a ) 1
: a > E[ X ] : a ≤ E[X ]
(7.20)
: a < E[ X ] : a ≥ E[ X ].
(7.21)
N
Gaussian Random Variable, revisited. Let X ∼ (0, 1), which has mean E( X ) = 0. The cgf of X is κX (u ) = u2/ 2 for all u ∈ R. Hence ga (u) = ua − u2 /2. Since ga (u ) is concave, we obtain its global maximum u∗ by setting the derivative of ga to zero:
Example 7.3
hence u∗ bounds:
0 = ga² (u∗ ) = a − u ∗,
= a and ± X (a ) = a 2/2 for all a ∈ R. Thus we have the following upper
≥ a} ≤ e−a /2, ∀a > 0, P{ X ≤ a} ≤ e−a / 2, ∀a < 0. The exact values of both probabilities are Q (a ). For large a, recall from (5.41) that P{ X
Q (a) ∼
2
2
2 e−a /2
√
2π a
Hence the upper bound is tight for large a.
,
a
> 0.
²
Exponential Random Variable, revisited . Let X follow the unit exponential distribution, which has mean E( X ) = 1. The cgf κ X (u) = − ln(1 − u ), u < 1, was
Example 7.4
7.4
Application of Chernoff Bound to Binary Hypothesis Testing
167
a
ΛX ( )
0 Figure 7.3
a
1
Largedeviations function ± X (a) for unit exponential random variable
derived in Example 7.2. Hence ga (u) = ua + ln(1 − u ) for u maximum u∗ is obtained by setting the derivative of ga to zero: 0 = g²a (u∗ ) = a −
1
1 − u∗
0, which is depicted in Figure 7.3. For any a > 1 = E( X ), we have u ∗ >
0 and the
largedeviations bound yields
≥ a} ≤ e−± (a ) = e−a +1+ln a = (ae )e −a . Comparing with the exact value P{ X ≥ a} = e−a , we see the bound is especially useful for large a because it captures the exponential decay of P{ X ≥ a} with a. In contrast, the Chebyshev bound would yield, for a > 1, 1 P{ X ≥ a} = P { X − 1 ≥ a − 1} ≤ (a − 1)2 , which decays only quadratically with a . ² P{ X
7.4
X
Application of Chernoff Bound to Binary Hypothesis Testing
We now apply the largedeviations bounds to the binary hypothesis problem
½
H0 H1
: Y ∼ p0 : Y ∼ p 1,
for which decision rules take the form:
⎧1 ⎨ δ˜ (y ) = ⎩1 w.p. γ 0
if T ( y)
(7.22)
> = τ,
τ } + γ P 0{T (Y ) = τ } ≤ P 0 {T (Y ) ≥ τ } ,
PM (δ˜) = P 1{T (Y ) < τ } + (1 − γ )P1 {T (Y ) = τ } ≤ P1 {T (Y ) ≤ τ } .
(7.23) (7.24)
Denote by κ0 and κ1 the cgfs of T under H0 and H1 , respectively. Then the largedeviations bounds (7.18) and (7.19) applied to P F and PM respectively are given by PF (δ˜) ≤ e−[uτ −κ0 (u )],
P M (δ˜) ≤ e−[vτ −κ1(v)] ,
∀ u > 0, ∀v < 0.
κ0 and κ1. We have κ0(u) = ln E0[e u ln L (Y )] ·u µ ¶ = ln p0( y) pp1 ((yy )) d µ( y ) 0 µ 1−u = ln p0 ( y) p1u ( y) d µ( y )
We now derive a relationship between
and
κ1(v) = ln E1 [ev ln L(Y ) ] µ ¶ p (y ) · v = ln p1( y) p1 (y ) dµ( y ) ¶ p0 (y ) · v+1 µY d µ( y ) = ln p0( y) p1 (y ) 0 Y = κ0 (v + 1). Hence the second cgf κ1 is obtained by simply shifting the ﬁrst cgf κ 0. We further observe that κ 0 is the negative of the Chernoff divergence du ( p0, p1 ) introduced in
(6.11). Setting u
= v + 1, we rewrite the upper bounds on PF and PM as PF (δ˜) ≤ e−[uτ −κ (u) ], ∀u > 0, (7.25) −[ ( u −1)τ −κ ( u)] ˜ PM (δ ) ≤ e , ∀u < 1. (7.26) The functions uτ − κ0 (u ) and (u − 1)τ − κ0 (u) differ only by a constant (τ ) and their maximum over u ∈ R is therefore achieved by the same value u ∗. Deﬁne ² g(u) = u τ − κ0 (u). (7.27) 0
0
7.4
Application of Chernoff Bound to Binary Hypothesis Testing
169
Then by the convexity of κ0 , g is concave and
u∗
= arg max g(u) (7.28) u ∈R satisﬁes g² (u ∗) = 0, hence τ = κ0² (u ∗ ). The derivative of the function κ 0 is given by µ d ² κ0(u) = du ln p10−u ( y) pu1 ( y) dµ( y) Y µ ¶ p (y ) ·u 1 − κ ( u) d d µ( y ) =e p 0( y ) du Y p0 (y ) µ = e−κ (u) p10−u ( y ) pu1 (y ) ln pp1(( yy)) d µ( y ). 0 Y The largedeviations rate functions for T (Y ) under P0 and P1 are respectively given by ±0(τ) = u∗ τ − κ0 (u ∗ ), (7.29) ∗ ∗ ±1(τ) = (u − 1)τ − κ0(u ) = ±0 (τ) − τ. (7.30) Recalling the deﬁnition of KL divergence in (6.1), the statistic T (Y ) has mean µ E0 [T (Y )] = p0 ( y) ln p1( y ) dµ( y) = − D ( p0± p1 ) = κ0² (0) (7.31) p (y) 0
0
Y
under H0 , and mean
E1[ T (Y )] =
µ Y
0
p1 ( y) ln
p 1( y ) dµ( y) = D ( p1 ± p0) = κ 1² (0) p 0( y )
(7.32)
under H1. We assume that both D ( p1± p0 ) and D ( p0 ± p1) are ﬁnite. We also recall that κ0(0) = κ1 (0) = κ0 (1) = 0. Based on these properties of κ0 , we consider the following cases:
Case I: − D ( p0± p1 ) < τ < D( p1 ± p0) (the case of primary interest); Case II: τ ≤ − D ( p0± p1 ); Case III: τ ≥ D ( p1± p0 ).
The maximization of the function g(u ) = uτ − κ0 (u ) is illustrated in Figure 7.4. The line with slope τ is tangent to the graph of the cgf κ0 at u ∗ . The intercept of that line with the vertical axis is equal to −[u ∗ τ − κ 0(u ∗)] = −±0(τ), and the intercept with the vertical line at u = 1 is at −[(u ∗ − 1)τ − κ0 (u∗ )] = −±1 (τ). The slopes of κ0 (u) at u = 0 and u = 1 are equal to − D( p0 ± p1) and to D ( p1 ± p0 ), respectively. The best bounds on PF and P M are given in terms of ±0(τ) and ±1(τ) as in (7.33) and (7.34). The function g deﬁned in (7.27) has the following properties:
g(0) = κ0 (0) = 0, g(1) = τ
− κ0(1) = τ − κ1(0) = τ, g² (0) = τ − κ 0² (0) = τ + D( p0 ± p1), g² (1) = τ − κ 0² (1) = τ − κ 1² (0) = τ − D ( p1± p0 ). Case I:− D( p0± p1 ) < τ < D ( p1 ± p0). Owing to the ﬁrst inequality we have g² (0) > 0, hence the maximum of g(·) is achieved by u ∗ > 0. Owing to the second inequality we
170
Performance Bounds for Hypothesis Testing
κ0 (u) u∗
0
u slope =
slope = −D (p0 ±p1 )
D (p1±p0)
κ0( u∗) + τ (1 − u∗ ) κ0( u∗ ) + τu∗ Figure 7.4
slope =
τ
Chernoff bounds for Case I, i.e., − D( p0± p1 ) < τ
< D ( p1 ± p0 )
have g² (1) < 0, hence u ∗ < 1. Thus u ∗ ∈ (0, 1). The bounds in (7.25) and (7.26) are both optimized by u∗ to yield the tightest Chernoff bounds: PF (δ˜) ≤ e−±0 (τ) ,
P M (δ˜) ≤ e−±1 (τ) = e−(±0 (τ)−τ).
Case II:τ have
(7.33) (7.34)
≤ − D( p0 ± p1). In this case g ² (0) ≤ 0, which implies that u∗ ≤ 0. Then we sup g (u ) = g(0) = 0, u >0
which implies that the optimized bounds in (7.25) and (7.26) are PF (δ˜) ≤ e− supu>0 g (u)
PM (δ˜) ≤ e−±1(τ) ,
i.e., the bound on PF becomes trivial. Case III: τ ≥ D ( p1± p0 ). In this case g² (1) have
= eg(0) = e0 = 1, ≥ 0, which implies that u ∗ ≥ 1. Then we
sup g(u ) = g(1) = τ, u 38 = 0.375. 0
1
The Bhattacharyya distance can be computed as
√ ln 2 + ln(1.5) = ln(1.5) − ln 2 2 and the corresponding Bhattacharya bound is given by B ( p0, p1 ) = −κ0 (0.5) = −
Pe
≤
1 − B( p0, p1) e 2
=
√
2 3
≈ 0.4714,
which is very close to the best bound based on the Chernoff information given earlier.
7.5
Bounds on Classiﬁcation Error Probability
We now consider m ary Bayesian hypothesis testing. There are m hypotheses H j , j = 1, 2, . . . , m , with respective prior probabilities π j , j = 1, 2, . . . , m . The observational model is Y ∼ p j under H j . Denote by Pe the minimum error probability, which is achieved by the MAP decoding rule.
7.5.1
Upper and Lower Bounds in Terms of Pairwise Error Probabilities
Evaluating the minimum error probability is usually difﬁcult for m > 2, even when the conditional distributions are “nice” – e.g., Gaussians with the same covariance. However, upper and lower bounds on P e can be derived in terms of the m (m2−1) error probabilities for all pairwise binary hypothesis tests. Let Pe (i , j ) denote the Bayesian π error probability for a binary test between Hi and H j with priors πi π+iπ j and πi +jπ j . The
174
Performance Bounds for Hypothesis Testing
p1 (y)
p3 (y)
1
1
1− λ
1− λ
λ
λ 0
1
2
3
y
4
p2 (y)
0
1
2
3
4
y
0
1
2
3
4
y
p4 (y) 1
1 1− λ
1− λ
λ
λ 0
Figure 7.6
1
2
3
y
4
Conditional distributions for Example 7.5
lower bound on Pe in the following theorem is due to Lissack and Fu [4]; an alternate derivation may be found in [5]. The upper bound is due to Chu and Chueh [6]. 7.4 The following lower and upper bounds on the minimum error probability Pe hold:
THEOREM
2 m
¾ i ³= j
πi Pe (i , j ) ≤ Pe ≤
¾ i ³= j
πi Pe(i , j ).
(7.41)
The lower bound is achieved if the posterior probabilities satisfy
π(i y ) = π( j y ) ∀i , j , y ³= (Yi ∪ Y j ), where Y j is the Bayes acceptance region for hypothesis H j , 1 ≤ j ≤ m. The upper bound is achieved if the decision regions Y j , 1 ≤ j ≤ m are the same as for the binary hypothesis tests.
Observe that the lower and upper bounds coincide for m = 2, and that the gap is relatively small for moderate m , if Pe is small. The proof of the theorem is given in Section 7.6. We now provide two examples in which the upper and lower bounds of (7.41) are tight, respectively. As the ﬁrst example shows, the sufﬁcient condition given in the theorem for achieving the upper bound is not necessary.
Tight Upper Bound. Let Y = [0, 4], m = 4, uniform prior, λ ∈ [ 0, 0.5). Consider the following conditional distributions, which are shown in Figure 7.6:
Example 7.5
p j ( y) =
½
1−λ
λ
: j −1 ≤ y < j : j ≤ y ≤ j + 1,
j
= 1, 3,
7.5
175
Bounds on Classiﬁcation Error Probability
p1 (y)
p3 (y)
1
1
1− λ
1− λ
λ 3
λ 3
0
1
2
3
y
4
p2 (y)
0
1
2
3
4
y
0
1
2
3
4
y
p4 (y)
1
1
1− λ
1− λ
λ 3
λ 3
0 Figure 7.7
1
2
3
y
4
Conditional distributions for Example 7.6
½ λ p (y) =
: j −2 ≤ y < j −1 1 − λ : j − 1 ≤ y ≤ j,
j
j
= 2, 4.
The conditional pdfs p1 and p2 have common support [0, 2], and p3 and p4 have common support [2, 4]. The marginal for Y is uniform, hence π j / p (y ) ≡ 1, and π( j  y) = p j (y ). The decision regions are given by Y j = [ j − 1, j ), and P e = E[±1 − max j π( j Y )] = λ. For the binary hypothesis tests, we have P e(i , j ) = 1 min{ pi , p j } d µ = λ1 { j = i + 1} for odd i and Pe (i, j ) = λ1 { j = i − 1} for 2 Y even i . Hence the upper bound of (7.41) is λ, and is achieved. The lower bound is 12 λ . Example 7.6
and
Tight Lower Bound. Let Y pj (y) =
½
1−λ
λ/3
= [0, 4], m = 4, uniform prior, λ ∈ [ 0, 3/ 4),
: j −1 ≤ y ≤ : else,
j
j
= 1, 2, 3, 4.
The conditional pdfs are shown in Figure 7.7. They are obtained by shifts of p1 and have common support [0, 4]. The marginal for Y is uniform, hence π j / p( y) ≡ 1, and π( j  y) = p j (y ). The decision regions are given by Y j = [ j − 1, j ), and the error probability is Pe = E[ 1 − max j π( j Y )] = λ . For the binary hypothesis tests, we have ± P e (i , j ) = 12 Y min{ pi , p j } d µ = 23 λ for any i ³ = j . The lower bound of (7.41) is λ = Pe and is therefore achieved. The upper bound is 2λ. In many problems, the upper bound of Theorem 7.4 is reasonably tight because one worstpair (i , j ) dominates the sum. This is especially true in an asymptotic
176
Performance Bounds for Hypothesis Testing
setting where the pairwise error probabilities vanish exponentially with the length n of the observation sequence, and the pair (i, j ) corresponding to the worst exponent dominates; see the Gaussian example of Section 5.5.
7.5.2
Bonferroni’s Inequalities
The union bound is a standard tool for deriving an upper bound on the probability of a union of events E1 , . . . , Em : m ¿ m À ¾ ( ) P ∪ j = 1E j ≤ P E j ± S1 . j =1
In particular, we applied this bound to the error probability for the m ary signal detection problem of Section 5.5. However, a lower bound on that probability is sometimes useful too. Let
S2 Then we have the lower bound P
7.5.3
¿
=
¾ ( ) P Ei ∩ E j . i< j
À ∪mj=1 E j ≥ S1 − S2 .
(7.42)
Generalized Fano’s Inequality
Another popular way to obtain lower bounds on classiﬁcation error probability is via Fano’s inequality, which plays a central role in information theory [7]. While Fano’s inequality requires a uniform prior on the hypotheses, it can be generalized to the case of arbitrary priors [8]. We state the main result and then specialize it to the classical case of a uniform prior. The proof of the generalized Fano’s inequality is obtained by application of the data processing inequality. Denote by πe = 1 − max j π j the minimum error probability in the absence of observations. In this case the MPE rule picks the a priori most likely symbol. The generalized Fano’s inequality is stated in Theorem 7.5. THEOREM
7.5 Generalized Fano’s Inequality.
(i) The minimum error probability P e satisﬁes Pe
≥ Pe,LB,
(7.43)
where Pe,LB is the unique solution to the equation d (·±πe ) = I ( J ; Y ) on (0, πe ]. (ii) Equality is achieved in (7.43) if and only if there exist a Bayes rule δB and constants a0 and a1 such that
π j p j (y ) =
½
: δB (y ) = j a1π j p( y ) : δB (y ) ³ = j .
a 0π j p ( y )
(7.44)
7.5
Bounds on Classiﬁcation Error Probability
177
Proof Consider the MPE rule δ B : Y → { 1, 2, . . . , m } and deﬁne the Bernoulli random variable B ± 1{δB (Y ) ³ = J } which is a function of ( J, Y ). We have P e = P{ B = 1} where P is the joint distribution π × pJ . Similarly, π e = P{ B = 1} where P is the product π × p. By the dataprocessing inequality we have I ( J ; Y ) = D (π × p J ±π
× p) ≥ d(Pe±πe ).
(7.45)
Since P e ≤ πe and the function d (·±πe ) is monotone decreasing on (0, πe ], this proves (i). Next, we apply the condition for equality in the data processing equality which is given following (6.30). The condition, applied to our problem, is that the likelihood ratio p(Y )/ p J (Y ) be a degenerate random variable (under π × p J ) given B , i.e.,
pj (y) p ( y)
=
½
a0 a1
: δB (y ) = j : δB (y ) ³= j .
This is equivalent to (7.44). 7.6 Fano’s Inequality Assume the prior π is uniform over {1, 2, . . . , m }. > 2 we have Pe ≥ P e,LB where Pe,LB is the unique solution of ln m − I ( J ; Y ) − h2 (P e ) ± f (Pe ) (7.46) Pe = ln(m − 1) over [0, 1/2]. For m = 2 the lower bound Pe,LB of (7.43) is the unique solution to the equation h 2(·) = ln 2 − I ( J ; Y ) on [0, 21 ]. Proof For a uniform prior we have π e = 1 − m1 and thus 1 − Pe Pe + ( 1 − P e ) ln d (Pe ±πe ) = Pe ln πe 1 − πe Pe = Pe ln 1 + (1 − Pe) ln[m(1 − Pe )] 1− m (7.47) = −h2 (Pe ) + Pe ln m m− 1 + (1 − Pe ) ln m . THEOREM
For m
Combining (7.45) and (7.47), we obtain
I ( J ; Y ) ≥ −h2 (Pe ) + P e ln
⇒
Pe
≥ f (Pe ).
m
+ (1 − Pe ) ln m m −1 = ln m − h2 (Pe ) − Pe ln(m − 1)
Since f (P e) is decreasing over [0, 1/2] and increasing over [1/2, 1], this proves the claim. When m = 2, we have πe = 1/2, hence I ( J ; Y ) ≥ d (Pe ±πe ) = ln 2 − h 2(Pe ). R E M A R K 7.4 Fano’s inequality (7.46) is sometimes weakened by replacing the denominator of (7.46) with ln m and by replacing I ( J ; Y ) with the upper bound
I (J ; Y ) ≤
¾ i ³= j
πi π j D( pi ± p j ),
(7.48)
178
Performance Bounds for Hypothesis Testing
∑
which follows from (6.9) and the fact that D ( p j ± p) ≤ i πi D( p j ± pi ) by convexity of KL divergence. However, in many classiﬁcation problems πi π j D ( pi ± p j ) is larger than ln m for at least one pair (i , j ), and the inequality (7.48) is loose. We apply Fano’s inequality to Example 7.5. The mutual information is ∑ given by I ( J ; Y ) = j π j D ( p j ± p) = D( p1 ± p) by symmetry of the conditional pdfs { p j , j = 1, 2, 3, 4}. Thus Example 7.7
I ( J ; Y ) = D ( p1± p) = (1 − λ) ln
1−λ 1/4
+ λ ln 1λ/4 + 0 = −h 2(λ) + ln 4.
Fano’s inequality (7.46) yields
≥ h2(λ) ln−3h2 (Pe ) ,
Pe which is satisﬁed by Pe loose for this problem. Example 7.8
= λ but also by values below λ. Hence Fano’s inequality is
We apply Fano’s inequality to Example 7.6. Here we have
I ( J ; Y ) = D ( p1 ± p) = (1 − λ) ln
1−λ 1/4
3 + 3 3λ ln λ/ = −h 2(λ) + ln 4 − λ ln 3. 1/4
Fano’s inequality yields
≥ h2 (λ) + λlnln33 − h2 (Pe ) , which is satisﬁed with equality by Pe = λ. Hence Fano’s inequality is tight for this probPe
lem. This result could also have been predicted from the condition (7.44) for equality in the dataprocessing inequality: we have δ B( y ) = ´ y µ and
π j p j (y ) =
7.6
½λ
= 4λ π j p( y) : δB (y ) = j = 43 λπ j p( y) : δB (y ) ³= j . 12 4
λ
Appendix: Proof of Theorem 7.4
The minimum error probability is achieved by a deterministic MAP rule. Without loss of generality we assume that ties among the posterior probabilities are broken in favor of the lower index. Hence for the binary test between Hi and H j with i < j with respective π priors π π+iπ and π +jπ , the Bayes rule returns i in the event πi pi (Y ) = π j p j (Y ), and i
j
i
j
Pe (i , j ) =
πi P {π p (Y ) < π p (Y )} j j πi + π j i i i + π π+j π P j {πi pi (Y ) ≥ π j p j (Y )}. i
j
(7.49)
7.6
i
Appendix: Proof of Theorem 7.4
179
For the m ary test, recall that Yi denotes the acceptance region for hypothesis Hi , 1 ≤
≤ m.
< j we have ⇒ πi pi ( y ) ≥ π j p j (y ) ⇒ Pj (Yi ) ≤ Pj {πi pi (Y ) ≥ π j p j (Y )}, ⇒ πi pi ( y ) < π j p j ( y) ⇒ Pi (Y j ) ≤ Pi {πi pi (Y ) < π j p j (Y )}.
Upper Bound By our tiebreaking convention, for any i
∈ Yi y ∈ Yj
y
Hence from (7.49) we have the inequality Pe (i, j ) ≥
πi πj πi + π j Pi (Y j ) + πi + π j P j (Yi ).
(7.50)
Equality holds if the decision regions for the binary test are Yi and Y j . Therefore Pe
= =
¾ i ³= j
¾ i< j
(a) ¾
≤
i< j
( b) ¾
=
i ³= j
π j P j (Yi )
[π j Pj (Yi ) + πi Pi (Y j )] (πi + π j )Pe (i , j ) πi Pe (i , j ),
where inequality (a) follows from (7.50), and equality (b) holds because P e (i , j ) P e ( j , i ).
=
Lower Bound For the lower bound of (7.41), we show the following identity: Pe
= m2
¾ i ³= j
π i Pe (i, j ) + m1
¾µ i< j
(Yi ∪Y j )c
πi pi ( y) − π j p j ( y ) d µ( y ).
(7.51)
Since each integral is nonnegative, the lower bound of (7.41) follows. To show (7.51), observe that for each i , j , we have
πi Pµi (Yi ) − πi Pi (Y j ) + π j P j (Y j ) − πµ j P j (Yi ) = [πi pi ( y) − π j p j ( y )] d µ( y) + [π j p j ( y) − πi pi ( y )] d µ( y) Y µY = πi pi ( y) − π j p j ( y) d µ( y), i
Y i ∪Y j
j
(7.52)
where the last equality holds because the integrands are nonnegative over their respective domains. Summing over all (i , j ), we obtain
180
Performance Bounds for Hypothesis Testing
¾
[πi Pi (Yi ) − πi Pi (Y j ) + π j P j (Y j ) − π j P j (Yi )] ¾ = 2 ([πi Pi (Yi ) − πi Pi (Y j )] + [π j Pj (Y j ) − π j Pj (Yi )] ) i< j ¾µ =2 πi pi ( y) − π j p j ( y) d µ( y),
i, j
i< j
(7.53)
Yi ∪Y j
where the last equality follows from (7.52). But direct evaluation of the same sum by decomposition into four terms yields
¾ i, j
[πi Pi (Yi ) − πi Pi (Y j ) + π j Pj (Y j ) − π j Pj (Yi )] = 2m (1 − Pe ) − 2.
(7.54)
Equating (7.53) and (7.54), we obtain 2
¾µ i< j
Yi ∪Y j
πi pi ( y) − π j p j ( y) d µ( y) = 2(m − 1) − 2m Pe
and therefore
⎤ ⎡ µ ¾ πi pi ( y) − π j p j ( y) d µ( y)⎦ P e = ⎣m − 1 − Y ∪ Y i< j Á Â µ 1 ¾ =m πi + π j − πi pi ( y) − π j p j ( y) d µ( y ) , 1 m
i
j
where the last line holds because
µ
(πi + π j )Pe (i , j ) =
1 2
∑ [π + π ] = ∑ π = m − 1. Now j i< j i i ³= j i
min{πi pi ( y), π j p j ( y)} d µ( y )
Y²
=
πi + π j −
µ
πi pi ( y) − π j p j ( y) dµ( y)
Y
where the last equality follows from the identity min{a, b} Substituting into (7.55), we obtain Pe
=
1 m
¾Á i< j
2(πi
(7.55)
Y i ∪Y j
i< j
+ π j )Pe (i , j ) +
µ
(Yi ∪Y j )c
´
∀i ³= j ,
= 12 (a + b − a − b).
Â
πi pi (y ) − π j p j (y ) d µ( y ) ,
which proves (7.51). For each (i , j ), the integral in the previous equation may be written as
µ
(Yi ∪Y j )c
π(i  y) − π( j y) p( y ) dµ( y),
which is zero under the condition stated following (7.41).
Exercises
181
Exercises
7.1 Acquiring Observations Cannot Increase Bayes Risk. Consider a binary hypothesis testing problem H0 versus H1 with costs C 10 = 2, C 01 = 1, and C 00 = C 11 = 0. Give the prior Bayes risk r0 , which is deﬁned as the Bayes risk without observations. Use Jensen’s inequality to show that the availability of observations cannot increase that risk. Under what condition on the observational model does the risk actually remain r0 ? 7.2 Chernoff Upper Bound and Kailath Lower Bound on P e (I) . Evaluate the Chernoff upper bound (7.36) (for u ∈ {. 25, .5, .75}) and the Kailath lower bound (7.7) for P0 = [.1, .1, .8] and P1 = [0, .5, .5] and a uniform prior. 7.3 Chernoff Upper Bound and Kailath Lower Bound on Pe (II). Evaluate the Chernoff upper bound (7.36) (for 0 < u < 1) and the Kailath lower bound (7.7) for the following distributions on [0, 1]:
p0 ( y) = 2 y,
p1 ( y) = 3 y2
assuming a uniform prior. Evaluate the Chernoff information C ( p0 , p1) and the Chernoff bound in (7.40). 7.4 Chernoff and Bhattacharyya Bounds. Consider the binary hypothesis testing problem with 1 p0 (y ) = e− y  and p1 ( y) = e−2 y  . 2 Assume equal priors. Find the Chernoff and Bhattacharya bounds on P e. 7.5 Detection in Neural Networks. Consider the following Bayesian detection problem with equal priors:
½
H0 : Yk
H1 : Yk
= −m + Z k , k = 1 , . . . , n = m + Zk , k = 1, . . . , n ,
where m > 0 and Z 1 , Z 2 , . . . , Z n are i.i.d. continuousvalued random variables with a symmetric distribution around zero, i.e., pZ k (x ) = pZ k (−x ) for all x ∈ R. Suppose we quantize each observation to one bit to form
Uk
= g (Yk ) =
½
1 0
≥0 Yk < 0.
Yk
(a) Show that the minimumerrorprobability detector based on U1 , U2 , . . . , Un is a majoritylogic detector, i.e.,
δopt (u) =
½
1 0
if
∑n
k =1 u k
otherwise.
≥ n2
182
Performance Bounds for Hypothesis Testing
(b) Now let us specialize to case where m
= ln 2 and Z k has the Laplacian pdf
1 pZ k ( x ) = e− x  . 2 Find the Chernoff bound on the probability of error of the detector of part (a). 7.6 Bounds on ROC. The binary Bhattacharyya distance
√
d1/2 (a , b) ± − ln( ab +
¸
(1 − a )(1 − b)) = B([a, 1 − a] , [b, 1 − b]), 0 ≤ a, b ≤ 1
is the Bhattacharyya distance between two Bernoulli distributions with parameters a and b. Let α and β be the probabilities of false alarm and miss for a binary hypothesis test between p0 and p1 . (a) (b) (c) (d) (e)
Show that d1/2 (α, 1 − β) ≤ B ( p0 , p1). Give a lower bound on β when α = 0. Give a lower bound on β when α = β. √ Give a lower bound on β when α = 12 and B ( p0, p1 ) ≥ ln 2. Give a lower bound on α when β = 0.
Ternary Gaussian Hypothesis Testing. Consider the hypotheses H j : Y ∼ uniform prior. The mean vectors µ1 = [0, 1]¶ , µ 2 = √(µ j , I 2), j =¶ 1, 2, 3 with √ [ 3/2, −1/ 2] , and µ3 = [− 3/2, −1/ 2]¶ are the vertices of a simplex. Evaluate the bounds of (7.41). Do you expect Pe to be closer to the lower or to the upper bound? Justify your answer. 7.7
N
7.8 9ary Gaussian Hypothesis Testing. Assume a uniform prior on the hypotheses H j : Y ∼ (3 j − 15, 1), 1 ≤ j ≤ 9. Evaluate (7.41). Do you expect Pe to be closer to the lower or to the upper bound? Justify your answer.
N
7.9 Signal Design. Denote by Pλ the Poisson distribution with parameter λ > 0. Consider the sequence λ = { λi }i20=1 whose element λi equals 10 for 1 ≤ i ≤ 10 and 20 for 11 ≤ i ≤ 20. We wish to design a sequence s and transmit either s or − s to a receiver which observes a sequence Y at the output of an optical communication channel. The corresponding probability distributions P0 and P1 are as follows:
⎧ ⎨ H0 : ⎩ H1 :
Yi Yi
indep.
∼ ∼
indep.
Pλ i +si
Pλ i −si ,
1≤i
≤ n.
Find a sequence s that maximizes D( P0± P1 ) subject to the constraint may use the approximation ln(1 + ³) ≈ ³ − ³2 /2 for small ³. 7.10 Fano’s Inequality. Let for each row j :
∑ s 2 = 1. You i i
X = Y = {1, 2, 3, 4}. The following matrix gives p j (y )
References
183
⎡1 0 0 0⎤ ⎢⎢ 0 1 0 0 ⎥⎥ . ⎣0 0 1 0⎦ 1/4 1/4 1/4 1/ 4 (a) Derive the MPE and Fano’s lower bound on it, assuming a uniform prior. (b) Derive the MPE and the generalized Fano lower bound on it, assuming a prior (1/6, 1/6, 1/6, 1/2).
π=
References
[1] S. Kullback and R. Leibler, “On Information and Sufﬁciency,” Ann. Math. Stat., Vol. 22, pp. 79–86, 1951. [2] T. Kailath, “The Divergence and Bhattacharyya Distance Measures in Signal Selection,” IEEE Trans. Comm. Technology, Vol. 15, No. 1, pp. 52–60, 1967. [3] H. Chernoff, “A Measure of Asymptotic Efﬁciency for Test of a Hypothesis Based on the Sum of Observations,” Ann. Math. Stat., Vol. 23, pp. 493–507, 1952. [4] T. Lissack and K.S. Fu, “Error Estimation in Pattern recognition via L α Distance Between Posterior Density Functions,” IEEE Trans. Inf. Theory, Vol. 22, No. 1, pp. 34–45, 1976. [5] F. D. Garber and A. Djouadi, “Bounds on Bayes Error Classiﬁcation Based on Pairwise Risk Functions,” IEEE Trans. Pattern Anal. Mach. Intell., Vol. 10, No. 2, pp. 281–288, 1988. [6] J. T. Chu and J. C. Chueh, “Error Probability in Decision Functions for Character Recognition,” Journal ACM, Vol. 14, pp. 273–280, 1967. [7] T. M. Cover and J. A. Thomas, Elements of Information Theory , 2nd edition, Wiley, 2006. [8] T. S. Han and S. Verdú, “Generalizing the Fano Inequality,” IEEE Trans. Inf. Theory, Vol. 40, No. 4, pp. 1247–1251, 1994.
8
Large Deviations and Error Exponents for Hypothesis Testing
In this chapter, we apply largedeviations theory, whose basis is the Chernoff bound studied in Section 7.3, to derive bounds on hypothesis testing with a large number of i.i.d. observations under each hypothesis [1, 2]. We study the asymptotics of these methods and present tight approximations that are based on the method of exponential tilting and can easily be evaluated. 8.1
Introduction
Consider the following hypothesis testing problem as motivation. The hypotheses are:
±
H0 H1
: :
∼ i.i.d. p0 Y k ∼ i.i.d. p1 ,
Yk
1≤k
(8.1)
≤ n.
The LLRT is given by
> n ² ln L (Y ) = L k = τ, k =1 < where
Lk
± ln pp1((YYk )) ,
k
³
´
0
k
= 1, . . . , n .
(8.2)
Since Y k are i.i.d., so are L k for k = 1, . . . , n under both H0 and H1 . For instance, the energy detector of Section 5.7 satisﬁes 1 1 L k = Y k2 ∼ Gamma , 2 2σ j2
under P j , j
= 0, 1.
(8.3)
We would like to upper bound the falsealarm and miss probabilities
¶
≥τ , =1 ¶ µk² n Lk ≤ τ . P M ≤ P1 PF
≤ P0
µ² n
k =1
Lk
(8.4) (8.5)
8.2
Chernoff Bound for Sum of i.i.d. Random Variables
185
Both inequalities (8.4) and (8.5) hold with equality if randomization is not needed. The two probabilities on the right hand side can be upper bounded using the tools developed in this chapter, resulting in PF
−ne , ≤ PUB F =e
PM
F
−ne , ≤ PUB M =e
(8.6)
M
where eF and eM are two error exponents that we shall see are easy to compute. We will also use reﬁnements of largedeviations theory to derive good asymptotic approximations Pˆ F and Pˆ M in the sense that PF where the symbol n → ∞. 8.2
∼ Pˆ F
and P M
∼ Pˆ M
→ ∞,
as n
(8.7)
∼ means that the ratio of left and right hand sides tends to 1 as
Chernoff Bound for Sum of i.i.d. Random Variables
Consider a sequence of n i.i.d. random variables X i , 1 ≤ i ≤ n with common cgf κ X and largedeviations function ± X (a). We are interested in upper bounding the probability P {Sn
≥ na } ,
where we have deﬁned the sum
Sn P RO P O S I T I O N 8.1
=
a
n ² i =1
> E( X ),
(8.8)
Xi .
(8.9)
The largedeviations bound takes the form P {Sn
≥ na} ≤ e−n±
X
( a) ,
a > E (X ).
(8.10)
= 1, . . . , n are i.i.d., the cgf of Sn is κS (u) = nκ X (u ). Hence P { Sn ≥ na } ≤ e−n [ua −κ (u )] ∀a > E( X ), u > 0. (8.11)
Proof Since X i , i
n
X
Optimizing u according to (7.17) proves the claim (8.10). In comparison, Chebyshev’s bound yields P { Sn
(X ) , ≥ na } = P{Sn2 > (na )2 } ≤ n(Ena( X)2 ) = Ena 2 2
2
which is inversely proportional to n. Hence the Chebyshev bound is loose for large n. 8.2.1
Cramér’s Theorem
The following theorem states that the upper bound in Proposition 8.1 is tight in the exponent, by providing a lower bound on P {Sn ≥ na }. THEOREM
8.1 Cramér [3].
For a > E[ X ], P { Sn
≥ na } ≤ e−n ±
X
(a)
186
Large Deviations and Error Exponents for Hypothesis Testing
and given ²
> 0, there exists n ² such that P {Sn ≥ na } ≥ e−n(±
That is: lim
n→∞
X
( a)+²) for all n > n ² .
− 1n ln P {Sn ≥ na} = ± X (a), ∀a > E[ X ].
(8.12)
(8.13)
A proof of the lower bound (8.12) is given in [3] (also see Exercises 8.12 and 8.13), and is strengthened in Theorem 8.4 under a ﬁniteness assumption on the third absolute moment of X . Cramér’s theorem says that the tail probability P {Sn ≥ na} decays exponentially with n, and the exponent ± X (a ) in the upper bound (8.10) cannot be improved. Note, however, that the ratio rn of the exact tail probabilities to the upper bound e−n± X (a) does not necessarily converge to 1. In fact it generally converges to 0, as seen in the case where X i are i.i.d. (0, 1), in which case Sn ∼ (0, n ) and
N
P {Sn
≥ na} = Q (
√
N
n a) ∼
2 e− na / 2
√
2π na
−n ± X (a)
= e√
2π na
as n → ∞ .
,
Hence the ratio r n decays as (2π na 2)−1/2 . The largedeviations bound (8.10) is said to be tight in the exponent . Cramér’s theorem can be generalized slightly in the following way, by exploiting the continuity of the rate function ± X (see Exercise 8.1). THEOREM
8.2
→ a as n → ∞ and a > E[ X ], then
Suppose an lim
n→∞
8.2.2
− 1n ln P {Sn ≥ nan } = ± X (a ).
(8.14)
Why is the Central Limit Theorem Inapplicable Here?
The partial sum Sn of (8.9) has mean nE (X ) and variance nVar (X ). The normalized random variable Sn − n E[ X ] Tn ± √ nVar ( X ) has zero mean and unit variance. By the Central Limit Theorem (CLT), Tn converges in distribution to the normal random variable (0, 1). Assume a > E [ X ]. Since P {Sn
≥ na } = P
·
N
Tn
>
√
a − E[X ] n√ Var ( X )
¸
(8.15)
it seems tempting to use the “Gaussian approximation”
¹√ a − E[ X ] º n √ . P {Sn ≥ na} ≈ Q Var (X ) Using the bound Q( x ) ≤ e− x / 2 for x > 0 yields the approximation · (a − E[ X ])2 ¸ P { Sn ≥ na } ≈ exp −n . 2Var (X ) 2
(8.16)
8.3
Hypothesis Testing with i.i.d. Observations
187
Unfortunately (8.16) disagrees with (8.10), except when X is Gaussian to begin with! In general the exponent in (8.16) could be larger than ± X (a ) or smaller. What went wrong? The problem lies in an incorrect application of the CLT. Convergence of Tn in distribution merely says that lim P{Tn
n →∞
> z} = Q ( z )
√
for any ﬁxed z ∈ R. However, in (8.15) z is not ﬁxed, in fact it grows as n as n → ∞ . Unless X is Gaussian, the tails of the distribution of Tn are not Gaussian. For instance, if X is uniformly distributed over the unit interval [− 12 , 21 ] , we have − 2n ≤ Sn ≤ 2n . Hence » n¼ P Sn > = 0, ∀n ≥ 1, 2 again showing the inadequacy of the Gaussian approximation (8.16). The Gaussian approximation holds in the smalldeviations regime (also known as the central regime). It still provides the correct exponent in the socalled moderatedeviations regime but fails in the largedeviations regime.1 8.3
Hypothesis Testing with i.i.d. Observations
We return to our detection problem with i.i.d. observations (8.1) and use largedeviations methods to upper bound PF and PM . The case of i.i.d. observations is a special case of (7.22) with Y replaced by an i.i.d. sequence Y = (Y1 , . . . , Yn ) and p0 and p1 replaced by product distributions p0n and pn1 . For each n, LLRTs take the form
⎧1 ⎨ δ˜(n)( y) = ⎩1 w.p. γn 0
where
Sn
=
n ² k =1
ln
if Sn
> = nτn ,
nβ a] where β ∈ ( 21 , 1) is
1 Moderate deviations refers to the evaluation of tail probabilities such as P S n
a ﬁxed constant. Then it may be shown that
− E[ X ]) . − n2 β1−1 ln P{Sn > n β a } = (a2V n→∞ ar [ X ] lim
2
188
Large Deviations and Error Exponents for Hypothesis Testing
for Sn is n times the cgf for the random variable L = ln (7.34), we obtain the following bounds on PF and PM :
p1 (Y ) . p0 (Y )
Applying (7.33) and
PF (δ˜ (n ) ) ≤ e−n ±0(τ n) ,
(8.19)
PM (δ˜ (n ) ) ≤ e−n [±0(τn )−τn ] .
(8.20)
If the sequence τn is bounded away from − D ( p0± p1 ) and D( p1± p0 ), we see that both error probabilities vanish exponentially fast as n → ∞. If the sequence τn converges to a limit τ ∈ (− D( p0 ± p1), D( p1 ± p0 )), then we may apply Theorem 8.2 along with the fact that (due to possible randomization) P0 { Sn
> nτn } ≤ PF(δ˜ (n)) ≤ P0 { Sn ≥ n τn }
to conclude that the error exponent for PF is given by eF (τ) = − lim ln PF (δ˜( n) ) = ±0 (τ) = u∗ τ n →∞
− κ0(u∗ ).
(8.21)
Similarly we can show that the error exponent for PM is given by eM (τ) = − lim ln PM (δ˜( n) ) = ±1 (τ) = −τ n →∞
+ eF(τ).
(8.22)
It is clear from (8.21) and (8.22) that randomization (i.e., γn ) plays no role in the error exponents.
8.3.1
Bayesian Hypothesis Testing with i.i.d. Observations
The Bayes LLRT for i.i.d. observations takes the form
⎧ ⎨ δB(n )(y ) = ⎩1 0
Note that τn
Sn ≥ τn n otherwise. if
= n1 ln ππ
0 1
→ 0 as n → ∞, and therefore from (8.21) and (8.22) it follows that eF
= nlim − 1 ln PF (δ(Bn)) = ±0 (0) = C ( p0 , p1 ) →∞ n
and that eM
1 (n ) = nlim →∞ − n ln PM (δ B ) = ±1(0) = ±0(0) = C ( p0 , p1).
Combining these results, we get that Pe lim
n →∞
= π0PF + π1PM has an error exponent given by
− n1 ln Pe (δB(n)) = C ( p0, p1 ).
(8.23)
Finally, the error exponent for minimax hypothesis testing can be shown to be the same as that for Bayesian hypothesis testing, and is given by the Chernoff information C ( p0, p1 ) (see Exercise 8.5).
8.3
8.3.2
Hypothesis Testing with i.i.d. Observations
189
Neyman–Pearson Hypothesis Testing with i.i.d. Observations
The Neyman–Pearson LLRT for i.i.d. observations takes the form
⎧ ⎪⎪⎨1 (n) (y ) = δ˜NP ⎪⎩1 w.p. γn 0
Sn if n
> = τn ,
λ0 . Deﬁne the ratio
ξ = λλ1 .
(8.33)
0
Computing the cgfκ 0 (u ) The likelihood ratio is given by L ( y) = Then
λ1 λ0
exp(−(λ1 − λ 0) y) 1 {y
≥ 0}.
Â Ã Â Ã κ0(u) = ln E0 L(Y )u = u ln λλ1 + ln E0 exp(−(λ1 − λ0)Y u ) 0
with
Ä∞ Â Ã E0 exp(−(λ1 − λ 0)Y u ) = λ0 exp(−(λ1 − λ0 )yu ) exp(−λ 0 y )dy 0 = λ u + λλ0(1 − u ) . 1 0
Thus
Hence
κ0(u) = u ln λλ1 + ln λ u + λλ0(1 − u) 0 1 0 = u ln ξ − ln(ξ u + 1 − u ).
(8.34)
κ0² (u ) = ln ξ − ξ u ξ+−1 1− u
(8.35)
and the error exponents for the Chernoff–Stein problem (Lemma 8.1) are given by
D ( p0± p1 ) = −κ0² (0) = − ln ξ + ξ − 1 = 2ψ(ξ),
¹1 º ξ − 1 = 2ψ ξ ,
1 D ( p1± p0 ) = κ ² (1) = ln ξ + 0
where ψ is deﬁned in (6.4); see also Figure 6.1.
Computing the Error ExponentsF eand eM for an LLRT Consider an LLRT
of the form (8.17) with threshold τn converging to τ ∈ (− D ( p0± p1 ), D ( p1± p0 )). Note that eF = ±0 (τ) and eM = ±1(τ), where ± 0(τ) and ±1 (τ) can be computed from the solution u∗ to the equation κ 0² (u∗ ) = τ , i.e., ln ξ
− ξ u∗ ξ+−1 1− u∗ = τ.
u∗
= ln ξ1− τ − ξ −1 1 .
This yields
And after some straightforward algebra, we can compute
(8.36)
¹
ln ξ − τ eF (τ) = ±0 (τ) = τ u∗ − κ0 (u∗ ) = 2ψ
ξ−1
º (8.37)
8.3
Hypothesis Testing with i.i.d. Observations
193
eM
0.35 ( ± p1)
D p0
0.3
τ=
− D( p0±p1 )
0.25 0.2 0.15 (
)
0.1
C p0, p 1
τ
=0
0.05 0
τ
0
0.05 (
0.1
Here ξ
0.2
eF
( ± )
D p1 p0
)
C p0, p 1
Figure 8.2
= D(p1 ±p0 )
0.15
Tradeoff between the falsealarm and miss exponents for the hypothesis test of (8.32).
= 2, D( p0± p1 ) = 2ψ(2), and D ( p1 ± p0 ) = 2ψ(1/2)
¹ ln ξ − τ º (8.38) eM (τ) = ±1 (τ) = ±0(τ) − τ = 2ψ ξ − 1 − τ. Figure 8.2 shows the tradeoff between e M and eF as τ is varied between the limits of − D ( p0± p1 ) and D ( p1 ± p0). and
Computing the Chernoff InformationC ( p0 , p1 ) The error exponent for Bayesian hypothesis testing C ( p0 , p1 ) can be computed using (8.37) as
¹ ln ξ º C ( p0 , p1) = ± 0(0) = 2ψ ξ−1 .
Hoeffding Problem Finally, we consider the solution to the Hoeffding problem (8.25). To this end, we ﬁrst compute the geometric mixture pdf
p u ( y) =
p0( y) L ( y )u exp(κ0 (u))
= λ0(ξ u + (1 − u)) exp (−λ0 (ξ u + (1 − u ))y ) 1{y ≥ 0}. Then
À
D ( pu ± p0 ) = Eu ln
pu (Y ) p 0( Y )
Á
= ln (ξ u + (1 − u)) + ξ u + (11 − u ) − 1 ¹ º 1 = 2ψ ξ u + (1 − u ) .
194
Large Deviations and Error Exponents for Hypothesis Testing
We now need to solve for uH that satisﬁes D( pu H ± p0) = γ , which can done numerically through the function ψ , keeping in mind that uH ∈ (0, 1). Finally we can compute the LLRT threshold for the Hoeffding test as (see (8.26))
τ = κ0² (u ) = ln ξ − ξ u ξ+−1 1− u . H
H
H
8.4
H
Reﬁned Large Deviations
We now study the problem stated in (8.7): ﬁnd exact asymptotic expressions for P F and P M . To do so, we ﬁrst revisit the generic largedeviations problem of Section 7.3.2 and introduce the method of exponential tilting. Then we consider sums of i.i.d. random variables as in Section 8.2 and derive the desired asymptotics. The results are then applied to binary hypothesis testing.
8.4.1
The Method of Exponential Tilting
Let p be a pdf on X (continuous, discrete, or mixed) with respect to a measure µ . Fix some u > 0 and deﬁne the tilted distribution (also known as exponentially tilted distribution, or twisted distribution)
p˜ (x ) ±
ux ¿ euxe p(xp)(xd)µ(x ) = eux −κ(u ) p (x ),
u
∈ R.
(8.39)
This equation deﬁnes a mapping from p to p˜ named the Esscher transform [5]. Observe that p˜ (x ) = p(x ) in the case u = 0. The ﬁrst two cumulants of p˜ are given by
¿ xeux p(x )d µ( x ) ² (u), E p˜ (X ) = x p˜ (x ) dµ( x ) = ¿ ux = κ e p(x ) d µ( x ) Var p˜ (X ) = κ ²² (u). (This proves that κ ²² (u) ≥ 0, hence that the cgf is convex.) Consider a > E( X ). Using the deﬁnition (8.39), we derive the identity Ä∞ P{ X ≥ a } = p (x ) d µ(x ) a Ä∞ = e−[ux −κ(u )] p˜ (x ) dµ( x ) a Ä∞ −[ ua−κ(u )] =e e−u( x −a) p˜ (x ) d µ( x ) a ½ ¾ −[ ua−κ(u )] =e E p˜ e−u ( X −a) 1 { X ≥ a } . Ä
(8.40)
Observe that the integral in (8.40) is the expectation of the function e−u (x −a ) 1 {x ≥ a} ≤ 1 when the random variable X is distributed as p˜ . For u ≥ 0, the value of the integral is at most 1, hence the largedeviations upper bound (7.15) follows directly and without applying Markov’s inequality. Now we choose u = u∗ satisfying κ ² (u∗ ) = a .
8.4
195
p˜(x)
p(x)
p [X ] Figure 8.3
Reﬁned Large Deviations
p˜[X] = a
Original and tilted pdfs
The tilted distribution has mean a , as illustrated in Figure 8.3. Deﬁne the normalized random variable
V
² ∗ = XÅ− E p˜ ( X ) = X√− ²²κ (u∗ ) = √X ²²− a∗ , Var p˜ ( X ) κ (u ) κ (u )
(8.41)
which has zero mean, unit variance, and cdf FV . Using this notation and changing variables, we write (8.40) as ∗ ∗ P {X ≥ a} = e−[u a −κ(u ) ]
Ä∞ 0
e−u
∗ √κ²² (u ∗) v
d FV (v).
(8.42)
The problem is now to evaluate the integral, which in many problems is much simpler than evaluating the original tail probability P {X ≥ a}.
8.4.2
Sum of i.i.d. Random Variables
Now let X i , i ≥ 1 be an i.i.d. sequence drawn from a distribution P. The method of ∑ ∑ exponential tilting can be applied to the sum ni=1 X i . The cgf for ni=1 X i is n times the cgf κ for X . Fix a > E[ X ] and let u∗ be the solution to κ ² (u∗ ) = a. Analogously to (8.41), the normalized sum
∑n X − na i =1 i Vn ± √ n κ ²² (u ∗)
(8.43)
has zero mean, unit variance, and cdf Fn . By application of (8.42), the following identity, due to Esscher, holds [5]: P
±² n i =1
Æ
∗ ∗ X i ≥ na = e−n[u a−κ(u )]
Ä∞ 0
e−u
∗ √n κ²² (u ∗) v
d Fn (v).
(8.44)
As n → ∞, it follows from the Central Limit Theorem that Vn converges in distribution to a normal distribution, as illustrated in Figure 8.4. This property can be exploited to derive exact asymptotics of (8.44) as n → ∞ , as well as lower bounds.
196
Large Deviations and Error Exponents for Hypothesis Testing
N (0, 1)
( )
pV n v
v
Figure 8.4
Convergence of Vn to a
N ( 0, 1) random variable
The main result on exact asymptotics is the Bahadur–Rao theorem [6, 7] [3, Ch. 3.7] which applies to the case where X is a continuous random variable, or a discrete random variable of the nonlattice type.3 8.3 Bahadur–Rao. Let X be a random variable of the nonlattice type with cgf κ . Fix a > E [ X ] and let u∗ be the solution to κ ² (u ∗ ) = a. Let X i , i ≥ 1 be an i.i.d. sequence of such random variables. Then the following asymptotic equality holds: THEOREM
P
±² n i =1
Æ
Xi
u )]} ≥ na ∼ exp{−∗ √n [u a²²− κ( √{−n ±²² X (∗a )} = exp ∗ ∗ u 2πκ (u ) n u 2πκ (u ) n
∗
∗
as n → ∞ . (8.45)
We have used the standard notation f (n) ∼ g (n ) for asymptotic equality of two √ functions f and g as n → ∞, i.e., limn→∞ gf ((nn)) = 1. Owing to the term n in the denominator of (8.45), we now see that the upper bound of (8.11) is increasingly loose as n → ∞. (However, this is sometimes not a concern because an upper bound that is tight in the exponent may be satisfactory.)
Sketch of Proof of Theorem 8.3 For simplicity the proof is sketched in the case X is a continuous random variable. Since X is a continuous random variable, so is Vn for each n. The density of Vn is pVn (v) = d Fn (v)/d v , and lim pVn (0) =
n→∞
√1 . 2π
(8.46)
While convergence might be slow in the tails of pVn , we now analyze (8.44) and show that only the value of pVn at v = 0 matters. Indeed let
² = ∗√ u
1
nκ ²² (u∗ )
,
(8.47)
3 A real random variable X is said to be of the lattice type if there exist numbers d and x such that X 0
belongs to the lattice { x0 + kd , k ∈ Z} with probability 1. The largest d for which this holds is called the span of the lattice, and x0 is the offset .
8.4
1 ±
e
197
Reﬁned Large Deviations
−v/±
±
= 0.1
±
= 0. 5
v
Figure 8.5
Functions ²1 e −v/² ,
v≥0
which tends to zero as n−1/2 . The function 1² e−v/² , v ≥ 0 integrates to 1, and concentrates at v = 0 in the limit as ² ↓ 0 (see Figure 8.5). Hence the integral in (8.44) is of the form
Ä∞ 0
e−v/² pV
n
º Ä ∞ ¹1 (a ) ( b) ² − v/² pV (v) d v ∼ ² pV (0) ∼ √ e (v) d v = ² 2π 0 Ç ² ÈÉ Ê n
n
as ² → 0,
“Diraclike”
(8.48) where asymptotic equality (a) holds because pVn is continuous at v = 0, and (b) because of (8.46). A formal proof of this claim (taking into account that ² is actually a function of n) can be found e.g., in [3]. Combining (8.44), (8.47), and (8.48), we obtain the desired asymptotic equality (8.45). If X is a discrete random variable of the lattice type, the constant (2π) −1/ 2 is replaced by a bounded oscillatory function of n, as derived by Blackwell and Hodges [6].
≤ n are i.i.d. N (0, 1), then Vn is also N (0, 1) for each n. To evaluate (8.45), we note that κ( u) = u2 /2 and that u∗ = a, as shown in Example 7.1. Special Case If X i , 1 ≤ i
Thus (8.45) becomes
P
±² n i =1
Æ
Xi
2 e−na / 2
≥ na ∼ √
a 2π n
as n → ∞ .
(8.49)
This expression could have been derived in a more direct way for this special case. Since ∑n X ∼ (0, n), we have i =1 i
N
P
±² n i =1
Æ
Xi
√ ≥ na = Q (a n ).
Using the asymptotic formula (5.41) for the Q function with x immediately in the asymptotic expression (8.49).
= a √n
results
198
Large Deviations and Error Exponents for Hypothesis Testing
8.4.3
Lower Bounds on LargeDeviations Probabilities
Ë∑
Ì
n One can derive lower bounds on the largedeviations probability P i =1 X i ≥ na using appropriate probability inequalities. Shannon and Gallager derived a lower bound √ in which the leading exponential term exp{− n± X (a)} is multiplied by an exp {O ( n)} term. Theorem 8.4 provides a sharper lower bound in case the random variables have ﬁnite absolute third moment.
8.4 Let X be a random variable with cgf κ . Fix a > E( X ) and let u ∗ be the solution to κ ² (u∗ ) = a. Denote by ρ = E( X 3 )/Var (X ) the normalized absolute third moment of X under the tilted distribution. Let X i , i ≥ 1 be an i.i.d. sequence of such random variables. Then THEOREM
P
±² n i =1
Æ
Xi
≥ na ≥
√cn e −n± X (a )
where the constant c
∀n ≥
¹
u∗
√
1
κ ²² (u ∗)
√
º2
+ ρ 2π e ,
(8.50)
√
exp{− 1 − ρ u ∗ 2π e κ ²² (u∗ )} √ . u∗ 2π eκ ²² (u ∗ )
=
Comparing with the Bahadur–Rao Theorem 8.3, we see that the lower bound (8.50) √ yields the same O (1/ n) subexponential correction, with a lower asymptotic constant.
Proof of Theorem 8.4 restated as
The starting point is the Esscher identity (8.44), which can be P
±² n i =1
where
In
±
Æ
Xi
Ä∞ 0
≥ na = e −n± e−u
∗√nκ ²² (u∗ ) v
(a ) I
,
(8.51)
d Fn (v).
(8.52)
X
n
We now derive a lower bound on In . By the Central Limit Theorem, the normalized random variable Vn converges in distribution to (0, 1). Since X i have normalized absolute third moment, the deviation of the cdf of Vn from the Gaussian cdf can be bounded using the Berry–Esséen theorem [8]:
N
ρ , √ 2 n v ∈R √ where Fn denotes the cdf of Vn . Fix any b > 0, let h = (b + 1)ρ 2π e, and assume n ≥
∀n ≥ 1 :
sup ³(v) − Fn (v)  ≤
h 2. Since the integrand of (8.52) is nonnegative, we can lower bound In by truncating the integral: In
≥
Ä h/√n 0
e−u
∗√nκ ²²( u∗ ) v
√ ≥ e−hu ∗ κ ²²(u∗ )
Ä h /√n 0
d Fn (v)
d Fn (v).
8.4
Now
Ä h /√n 0
d Fn (v)
Reﬁned Large Deviations
199
¹ º = Fn √hn − Fn (0) ¹hº ≥ ³ √n − ³(0) − √ρn ¹hº ρ h ≥√ φ √ −√ ≥
n n n hφ(1) − ρ bρ √ =√ , n n
where the ﬁrst inequality follows from the Berry–Esséen theorem, the second from the fact that the Gaussian pdf φ(t ) is decreasing for t > 0, and the third holds because n ≥ h 2. Hence √ ∗ √ ²² ∗ bρ (8.53) I n ≥ √ e −(b+1)ρ 2π eu κ (u ) . n This lower bound holds for all b > 0 and n we maximize it over b and obtain
> 2π eρ 2 (b + 1)2. To obtain the best bound,
= ρu∗ √2π1eκ ²² (u∗ ) . √ Substituting into (8.53), we obtain In ≥ c/ n , which proves the claim. b
8.4.4
Reﬁned Asymptotics for Binary Hypothesis Testing
· H : Y ∼ pn 0 0 H1 : Y ∼ pn1 .
Consider
The LLRT is given by
n ² k=1
Lk
(8.54)
H1
≷ nτ, H0
where τ is the normalized threshold and L k = ln p1 (Yk )/ p0(Y k ) are i.i.d. random variables. Assuming that − D ( p0 ± p1) < τ < D ( p1± p0 ), the probability of false alarm PF and the probability of miss PM are upper bounded by (8.19) and (8.20), respectively. The reﬁned asymptotics of the previous section can be applied directly to yield
±² n
Æ
−n [u ∗τ −κ0 (u∗ )]
≥ nτ ∼ e Í , ²² ∗ ∗ u 2 πκ ( u ) n k =1 0 ±² Æ ∗ ∗ n e−n [(u −1)τ −κ ( u )] n Í . PM ≤ P1 Lk ≤ nτ ∼ (1 − u∗ ) 2πκ0²² (u∗ )n k =1 PF
≤
P0n
Lk
0
(8.55)
(8.56)
200
Large Deviations and Error Exponents for Hypothesis Testing
Example 8.2 We revisit Example 8.1. The hypotheses are given by (8.32). Differentiating (8.35), we obtain
1)2 κ0²² (u) = [1 +(ξu − (ξ − 1)]2 .
(8.57)
Replacing in (8.55) and (8.56) yields
∗ ∗ ∼ exp{−u∗n(ξ[u−1τ) −√κ0(u )]} , 1+u ∗(ξ −1) 2π n exp{− n[(u∗ − 1)τ − κ0 (u∗ )]} PM ∼ , (1−u ∗)(ξ −1) √2π n ∗ 1+u (ξ −1) ∗ with u given by (8.36) and the threshold τ in the range (−2ψ(ξ), 2ψ(1/ξ)). PF
(8.58) (8.59)
The ROC using the upper bounds of (8.6), (8.37), and (8.38) on PF and PM is compared with the asymptotic approximation to the ROC using (8.58) and (8.59) in Figure 8.6 for ξ = 1 and n = 80. Since the error probabilities are small, the curve − 1n ln PM versus − n1 ln PF is given instead of the ROC. The asymptotic approximation is very accurate as illustrated by the comparison with Monte Carlo simulations of Figure 8.6. 0.14 Monte Carlo simulation Upper Bound Asymptotic Approximation
0.12 0.1 MP
nl*)n/1(
0.08 0.06 0.04 0.02 0
0
0.02
0.04
0.06
0.08 0.1 0.12 (1/n)*ln P F
0.14
0.16
0.18
0.2
Monte Carlo simulation of the curve − n1 ln PM versus − n1 ln PF for ξ = 1 and n compared with Chernoff lower bound on ROC and with asymptotic approximation
Figure 8.6
8.4.5
= 80,
²
Noni.i.d. Components
A variation of the hypothesis testing problem (8.54) involves noni.i.d. data:
· H : Y ∼ În p 0 Îk =1 0k H1 : Y ∼ nk=1 p1k .
(8.60)
8.4
Reﬁned Large Deviations
201
The loglikelihood ratio is still a sum of n independent random variables {L k } but they are no longer identically distributed: ln where
Lk
p1 (Y ) p0 (Y )
n ²
=
k=1
= ln pp1k ((YYk )) , 0k
L k,
1≤k
k
≤ n.
Denote by κ0k (u ) the cgf for L k under H0. Since {L k } are mutually independent, the ∑ cgf for nk=1 L k under H0 is given by
κ0 (u) =
n ² k=1
κ0k (u).
Using the same methods as in Section 8.3, one can derive the upper bounds
³Ï n
´ ±² n
Æ
≥ τ ≤ e−[u∗ τ −κ (u∗ )], =1 ´ ±k =1 ³kÏ Æ n n ² ∗ ∗ PM ≤ P1k L k ≤ τ ≤ e−[(u −1)τ −κ (u )] . PF
≤
P0k
Lk
0
0
k =1
k =1
The Gärtner–Ellis theorem [3], which is a generalization of Cramér’s theorem, can be used to show that these upper bounds are tight in the exponent, if both the normalized cgf 1 ∑n k=1 κ0k (u ) and the normalized threshold τ/ n converge to a ﬁnite limit as n → ∞ . n One may also ask whether reﬁned asymptotics may apply. To answer this question, we ∑ revisit the core problem of evaluating P{ ni=1 X i ≥ na }, where X i , i ≥ 1 are independent but not identically distributed random variables with respective cgfs κi (u), i ≥ 1. Deﬁne the normalized sum of cgfs
κ n (u ) =
1² κk (u). n n
k=1
(In the i.i.d. case, κ n (u) is independent of n.) Let u∗n be the solution of κ ²n (u ∗n ) Analogously to (8.43), deﬁne
∑n X − na i =1 i . Vn ± Å nκ ²n (u∗n )
From (8.44), we have P
±² n i =1
Æ
∗ ∗ X i ≥ na = e−n[un a−κ n (un ) ]
Ä∞ 0
√
∗ ²² ∗ e−un κn ( un )n v pVn (v) d v.
=
a.
(8.61)
(8.62)
The question is whether the function κ n converges to a ﬁnite limit as n → ∞ (in which case u∗n also converges to a limit u∗ ) as in the Gärtner–Ellis theorem, and whether Vn of (8.61) converges in distribution to a normal distribution. If so, we have
202
Large Deviations and Error Exponents for Hypothesis Testing
P
±² n i =1
Æ
Xi
∗
∗
e−n [un a−κ n( un )]
≥ na ∼ ∗ Å
un 2πκ ²²n (u ∗n )n
as n
→ ∞.
(8.63)
Conditions for convergence of Vn to a normal distribution are given by the Linde∑ berg conditions [8, p. 262], which are necessary and sufﬁcient. Let sn2 = ni=1 Var [ X i ] and denote by Fi the cdf of the centered random variable X i − E( X i ). The Lindeberg condition is
∀² > 0 :
1 n→∞ sn2 lim
n Ä ²s ²
n
−² sn
i =1
t 2d Fi (t ) = 1.
This implies that no component of the sum dominates the others:
∀² > 0, ∃n (²) : ∀n > n(²) :
max Var [ X i ] < ² sn2.
1≤i ≤n
Reﬁned largedeviations results can be obtained in even more general settings involving dependent random variables [9]. 8.5
Appendix: Proof of Lemma 8.1
For any LLRT of the form in (8.17), we may bound P M (δ˜( n) ) using a change of measure argument as P M (δ˜( n) ) ≤ P1 {Sn
≤ nτn } = E1 [1{Sn ≤ nτn }] ½ ¾ = E0 1{Sn ≤ nτn } eS ≤ enτ , n
(8.64)
n
since eSn is the likelihood ratio of the observations up to time n. This yields the upper bound 1 ln PM (δ˜ (n) ) ≤ τn . (8.65) n
Given any ² > 0, consider the LLRT of the form in (8.17) that uses the constant ( n) threshold τ = − D( p0 ± p1 ) + ² . Then, by (8.19), it follows that PF (δ˜τ ) ≤ α for n sufﬁciently large. But the NP solution for level α must have a smaller PM than any other test of level α , and therefore 1 1 ( n) ln PM (δ˜NP ) ≤ ln PM (δ˜τ(n ) ) ≤ τ n n
= − D ( p0± p1 ) + ²
for n sufﬁciently large, where the last equality follows from (8.65). (n ) Let τn (α) denote the threshold of the αlevel test δ˜NP . It may be assumed that lim inf τn (α) ≥ − D ( p0 ± p1). n →∞
Otherwise since
( n) PF (δ˜NP ) ≥ P0
·S
n
n
¸ > τn (α)
(8.66)
203
Exercises
and
Sn n
→ − D( p0 ± p1) as n → ∞ by the law of large numbers, we have that (n) lim sup P F (δ˜NP ) = 1 n →∞
and the constraint of α ∈ (0, 1) will not be met for some large n . Furthermore, by the weak law of large numbers, we have that, for any given ² > 0,
· P 0 − D ( p 0 ± p 1) − ²
0 was chosen arbitrarily, we ²
Exercises
8.1 Slight Generalization of Cramér’s Theorem. Prove Theorem 8.2. Hint: Use the bounds in Theorem 8.1 and the continuity of ± X .
8.2 Hypothesis Testing for Geometric Distributions. Let pθ ( y) = (θ − 1) y−θ for θ and y ≥ 1.
>1
(a) Derive an expression for the Chernoff divergence of order u ∈ (0, 1) between pα and pβ , valid for all α, β > 1. (b) Fix α > β and consider the binary hypothesis test with P{ H0} = 0.5.
±
H0 H1
: :
∼ i. i.d. pα Y i ∼ i. i.d. pβ , Yi
1≤i
≤ n.
Give the LRT in its simplest form. (c) Show that the Bayes error probability for the problem of part (b) vanishes exponentially with n, and give the error exponent. 8.3 Hypothesis Testing for Bernoulli Random Variables. Let Ni , 1 ≤ i ≤ n, be i.i.d. Bernoulli random variables with P{ N = 1} = θ ≤ 12 . Consider the following binary hypothesis test:
·H : 0 H1 :
(a) Write the LRT.
Yi Yi
= Ni = 1 − Ni ,
1≤i
≤ n,
204
Large Deviations and Error Exponents for Hypothesis Testing
(b) Give a largedeviations upper bound on the Bayes error probability of the test, assuming uniform prior. (c) Give a good approximation to the threshold of the NP test at ﬁxed signiﬁcance level α . (d) Give a good approximation to the threshold of the NP test at signiﬁcance level exp{− n1/3 }. 8.4 Large Deviations for Poisson Random Variables. Denote by Pλ the Poisson distribution with parameter λ > 0. The cumulantgenerating function for Pλ is κ( u) = λ(eu − 1), u ∈ R. (a) Derive the largedeviations function for Pλ . ∑ (b) Let Y i , 1 ≤ i ≤ n be i.i.d. Pλ . Let Sn = ni=1 Y i . Derive the largedeviations bound on P{Sn ≥ neλ }. (c) Derive an exact asymptotic expression for P{ Sn ≥ neλ }. 8.5 Error Exponent for Minimax Hypothesis Testing. Prove that the error exponent for binary minimax hypothesis testing with i.i.d. observations is the same as that for Bayesian hypothesis testing, and is given by the Chernoff information C ( p0 , p1). 8.6 Optical Communications. The following problem is encountered in optical communications. A device counts N photons during T seconds and decides whether the observations are due to background noise or to signal plus noise. The hypothesis test is of the form
H0
:
N
∼ Poisson(T )
versus
H1
:
N
∼ Poisson(e T ).
We would like to design the detector to minimize the probability of miss P M subject to an upper bound PF ≤ e−αT on the probability of alarm. Here α > 0 is a tradeoff parameter. When T is large, we can use largedeviations bounds to approximate the optimal test. Derive a largedeviations bound on P M , and determine the range of α in which your bound is useful. Sketch the corresponding curve − T1 ln PM versus − T1 ln PF and identify its endpoints. 8.7 Hypothesis Testing for Poisson Random Variables. Denote by Pλ the Poisson distribution with parameter λ > 0. Consider the binary hypothesis test
⎧ ⎨ H0 : ⎩ H1 :
Yi Yi
indep.
∼ ∼
indep.
P1/ ln i
Pe/ ln i ,
2≤i
≤ n.
(a) Assume equal priors on the hypotheses. Derive the LRT. (b) Derive a tight upper bound on the probability of false alarm. Hint : A decreasing function x (t ) is said to be slowly decreasing if lim
t →∞
x (at ) x (t )
= 1 for all a > 0.
Exercises
205
For such a function, the following asymptotic equality holds: n ² i =1
x (i ) ∼ nx (n) as n
→ ∞.
8.8 Signal Detection in Sparse Noise. The random variable Z is equal to 0 with probability 1 − ², and to a (0, 1) random variable with probability ². Given a lengthn i.i.d. sequence Z generated from this distribution and a known signal s, consider the binary hypothesis test
N
·H : Y = − s + Z 0 H1 : Y = s + Z .
Derive an upper bound on the probability of error for that test; the bound should be tight in the exponent as n → ∞. 8.9 Hypothesis Testing for Exponential Distributions. Let X i , 1 random variables with unit exponential distribution.
≤i≤
n , be i.i.d.
∑
(a) Use a largedeviations method to derive an upper bound on P{ ni=1 X i > nα }, where α > 1. ∑ (b) Use the Central Limit Theorem to approximate P { ni=1 X i > n + nα }, where 12 ≤ α < 1. (c) Use the largedeviations method of part (a) to derive an upper bound on the tail probability of part (b). You may use asymptotic approximations, e.g., ln (1 + ²) ∼ ² − 12 ² 2 as ² → 0. Compare your result with that of part (b). 8.10 Error Exponents for Hypothesis Testing with Gaussian Observations (I). Consider the detection problem where the observations, {Yk , k ≥ 1} are i.i.d. (µ j , σ 2) random variables under H j , j = 0, 1.
N
(a) (b) (c) (d) (e)
Evaluate the cgf κ0 (u). Use κ 0(u) to ﬁnd D ( p0± p1 ) and D ( p1± p0 ). Find the rate functions ±0 (τ) and ±1 (τ). Find the Chernoff information C ( p0 , p1). Solve the Hoeffding problem when the constraint on the falsealarm exponent is equal to γ , for γ ∈ (0, D ( p1 ± p0 )).
8.11 Error Exponents for Hypothesis Testing with Gaussian Observations (II). Repeat Exercise 8.10 for the case where the observations {Yk , k ≥ 1} are i.i.d. (0, σ 2j ) random variables under H j , j = 0, 1. Hint : This problem can be reduced to the one studied in Example 8.1. In particular,
N
κ0 (u ) =
1 − u σ12 ln 2 2 σ0
−
³
1 ln 1 + (1 − u) 2
³
´´ σ12 − 1 . σ02
206
Large Deviations and Error Exponents for Hypothesis Testing
8.12 Lower Bound on LargeDeviations Probability (I) . Here we derive a version of Theorem 8.4 which does not assume ﬁniteness of the absolute third moment of X . Start from the Esscher identity (8.51) and show that
´ ln P X i ≥ na + n ± X (a ) = 0 i =1 Ë∑n X ≥ naÌ = exp{−n± (a) + o(√n)}. or equivalently, P X i =1 i : The integral In of (8.52) is upper bounded by 1, and lower bounded by ¿Hint ² e−u∗ √nκ ²²( u∗ ) v d F (v) for any ² > 0. Invoke the Central Limit Theorem to claim n 0 that lim n→∞ [ Fn (²) − Fn (0)] = ³(²) − ³(0). Note that while ² can be as small as desired, ² may not depend on n in this analysis. 1 lim √ n→∞ n
³
±² n
Æ
8.13 Lower Bound on LargeDeviations Probability (II). Here we derive a version of Theorem 8.4 which does not assume ﬁniteness of Var ( X ). Start from the Esscher identity (8.51) and show that 1 lim n →∞ n
³
ln P
±² n
Ë∑
i =1
Æ
Xi
´
≥ na + n± X (a) = 0
Ì ≥ na = exp{−n± X (a) + o(n)}. ±² Æ¶ µ ∑ X i ≥ na , In = E p˜ e−u ( X −na ) 1
n or equivalently, P i =1 X i Hint: Write In of (8.52) as
i
i
i
which is upper bounded by 1 and lower bounded by
±
µ
∑ ² Xi In = E p˜ e−u( i X i −na) 1 na ≤ i
Æ¶ ≤ n(a + ²)
for any ² > 0. Invoke the Weak Law of Large Numbers to claim that there exists δ > 0 ∑ such that limn→∞ P{na ≤ i X i ≤ n(a + ²)} > δ . Note that while ² and δ can be as small as desired, they may not depend on n in this analysis. 8.14 Bernoulli Random Variables. Denote by Be(θ ) the Bernoulli distribution with success probability θ .
(a) Derive the cumulantgenerating function for Be(θ ). (b) Show that the largedeviations function for Be( θ ) is ±(a) = d (a±»θ), a ∈ [0¼, 1]. (c) Use reﬁned large deviations to give a precise approximation to P S < e10 +1 where
S is the sum of 10 i.i.d. Be
Ðe Ñ e+1
random variables.
References
[1] H. Chernoff, “Large Sample Theory: Parametric Case,” Ann. Math. Stat., Vol. 27, pp. 1–22, 1956. [2] H. Chernoff, “A Measure of Asymptotic Efﬁciency for Tests of a Hypothesis Based on a Sum of Observations,” Ann. Math. Stat., Vol. 23, pp. 493–507, 1952.
References
207
[3] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, Springer, 1998. [4] W. Hoeffding, “Asymptotically Optimal Tests for Multinomial Distributions,” Ann. Math. Stat., Vol. 36, No. 2, pp. 369–401, 1965. [5] F. Esscher, “On the Probability Function in the Collective Theory of Risk,” Skandinavisk Aktuarietidskrift, Vol. 15, pp. 175–195, 1932. [6] D. Blackwell and J. L. Hodges, “The Probability in the Extreme Tail of a Convolution,” Ann. Math. Stat., Vol. 30, pp. 1113–1120, 1959. [7] R. Bahadur and R. Ranga Rao, “On Deviations of the Sample Mean,” Ann. Math. Stat. , Vol. 31, pp. 1015–1027, 1960. [8] W. Feller, An Introduction to Probability Theory and Its Applications, Wiley, 1971. [9] N. R. Chaganty and J. Sethuraman, “Strong Large Deviation and Local Limit Theorems,” Ann. Prob., Vol. 21, No. 3, pp. 1671–1690, 1993.
9
Sequential and Quickest Change Detection
In this chapter, we continue our study of the problem of hypothesis testing with multiple observations, but in a sequential setting where we are allowed to choose when to stop taking observations before making a decision. We ﬁrst study the sequential binary hypothesis testing problem, introduced by Wald [1], where the goal is to choose a stopping time at which a binary decision about the hypothesis is made. We then study the quickest change detection problem, where the observations undergo a change in distribution at some unknown time and the goal is to detect the change as quickly as possibly, subject to falsealarm constraints. More details on the sequential detection problem are provided in [2, 3], and on the quickest change detection problem are provided in [3, 4, 5].
9.1
Sequential Detection
In the detection problems we studied in Chapter 5, we are given a sample of n observations with which we need to make a decision about the hypothesis. The size of the sample n is assumed to be ﬁxed beforehand, i.e., before the observations are taken, and therefore this setting is referred to as the ﬁxed sample size (FSS) setting. In contrast, in the problems that we study in this chapter, the sample size is determined online, i.e., the observations come from a potentially inﬁnite stream of observations and we are allowed to choose when to stop taking observations based on a desired level of accuracy for decisionmaking about the hypothesis.
9.1.1
Problem Formulation
We will restrict our attention to binary hypothesis testing and i.i.d. observation models here. In particular we have:
±
H0 H1
: :
∼ i. i.d. p0 Yk ∼ i. i.d. p1, Yk
k
= 1, 2, . . . .
(9.1)
A sequential test has two components: (i) a stopping rule, which is used to determine when to stop taking additional observations and make a decision, and (ii) a decision rule , which is used to make a decision about the hypothesis using the observations up to the stopping time. There are two goals in sequential detection: (i) to make an accurate decision regarding the hypothesis (which might require a large number of observations);
9.1
Sequential Detection
209
and (ii) to use as few observations as possible. These are conﬂicting goals and hence the optimization problem we need to solve is a tradeoff problem. 9.1.2
Stopping Times and Decision Rules
The number of observations taken by a sequential test is a random variable called a stopping time , which is deﬁned formally as follows. D E FI N I T I O N 9.1 A stopping time on a sequence of random variables Z 1 , Z 2 , . . . is a random variable N such that { N = n} ⊂ σ( Z 1, . . . , Z n ), where σ( Z 1 , . . . , Z n ) denotes the σ algebra generated by the random variables Z 1, . . . , Z n . Equivalently, 1{N = n} is a function of only the observations Z 1 , . . . , Z n , i.e., it is a causal function of the observations.
A sequential test stops taking observations and makes a decision about the hypothesis at a stopping time N on the observations sequence Y 1, Y2 , . . .. We have the following deﬁnition. A sequential test is a pair 1 ( N ; δ), such that at the stopping time N on the observation sequence, the decision rule δ is used to make a decision about the hypothesis based on Y1 , . . . , Y N . D E FI N I T I O N 9.2
9.1.3
Two Formulations of the Sequential Hypothesis Testing Problem
As in the case of ﬁxed sample size hypothesis testing, there are Bayesian and nonBayesian approaches to sequential hypothesis testing. We begin with a Bayesian approach, for which we assume:
• • •
uniform decision costs;2 known priors on the hypotheses π0 , π1; cost per observation c (to penalize the use of observations).
We may then associate a risk with the sequential test (N ; δ).
The Bayes risk associated with a sequential test ( N ; δ) is given by
D E FI N I T I O N 9.3
r ( N ; δ) = c [π0 E0 [ N ] + π1 E1 [ N ]]
²
³´
µ + ²π0 PF( N ; δ) +³´π1 PM (N ; δ)µ,
Average penalty for stopping time N
(9.2)
Average probability of error
where P F ( N ; δ) = P0 {δ(Y1 , . . . , Y N ) = 1},
P M ( N ; δ) = P1 {δ(Y1 , . . . , Y N ) = 0}.
1 Including the stopping time N (which is a random variable) in the deﬁnition of the test is slightly
(9.3) (9.4)
inconsistent with the way in which we have deﬁned decision rules in the previous chapters. We could have deﬁned a stopping rule η , which chooses the stopping time N , and use η in place of N in deﬁning the sequential test. However, such a stopping rule η and N would be equivalent. We therefore adopt the more standard way to deﬁne the sequential test using N as introduced by Wald [1]. 2 This can easily be generalized to nonuniform decision costs.
210
Sequential and Quickest Change Detection
A Bayesian sequential test is one that minimizes the risk in (9.2), i.e.,
(N ; δ) B = argmin r (N ; δ). ( N ;δ)
(9.5)
An alternative to the Bayesian formulation is a Neyman–Pearsonlike nonBayesian formulation. Here no priors are assumed for the hypotheses. Furthermore, we do not have any explicit costs on the decisions or the observations. In particular, the nonBayesian formulation of the sequential hypothesis testing problem is Minimize E0 [ N ] and E1[ N ]
subject to PF ( N ; δ) ≤ α and PM ( N ; δ) ≤ β, with α, β
(9.6)
∈ (0, 1). Note that we require a simultaneous minimization of E0[ N ] and E1 [ N ] under constraints
on both of the error probabilities. One of the most celebrated results in sequential analysis, due to Wald and Wolfowitz [6], is that optimization problem (9.6) has a solution whose structure is the same as that of the solution to Bayesian optimization problem given in (9.5). We discuss this test structure next.
9.1.4
Sequential Probability Ratio Test
The loglikelihood ratio of the observations Y 1, . . . , Yn is given by the sum:
Sn
=
n ¶
k =1
ln L (Y k ),
(9.7)
where as usual L ( y) = p1( y)/ p0( y). Let a < 0 < b. The stopping time and decision rule of the sequential probability ratio test (SPRT) are given by:
N SPRT
= inf {n ≥ 1 : Sn ∈/ (a , b)}
and
δSPRT =
·1 0
if S NSPRT ≥ b if SNSPRT ≤ a.
(9.8)
(9.9)
The SPRT ( NSPRT ; δSPRT ) can be summarized as:
• • •
stop and decide H1 if Sn ≥ b; stop and decide H0 if Sn ≤ a ; continue observing if a < Sn < b.
Typical sample paths of the Sn , along with the decision boundaries are depicted in Figure 9.1. As mentioned previously, the SPRT test structure can be shown to be optimal for both the Bayesian and nonBayesian optimization problems introduced in Section 9.1.3 (see [1] and [6]). Therefore, an optimal solution to (9.5) is obtained by ﬁnding thresholds a and b that minimize the Bayes risk (9.2), and an optimal to solution to (9.6) is obtained by ﬁnding a and b that satisfy the constraints on the error probabilities in (9.6) with equality. (Note that randomization might be required for the latter problem.)
9.1
Sequential Detection
211
b
10 5
nS
0
–5
a
–10
0
5
10
15
n
20
25
N
(a)
b
10 5
nS
0
–5
a
–10 0
5
10
n
15
N 25
20
(b) Typical evolution of the SPRT statistic Sn with p0 H0 is the true hypothesis, (b) H1 is the true hypothesis
Figure 9.1
∼ N (0, 1) and p1 ∼ N (1, 1): (a)
We now provide some intuition as to why the SPRT structure provides a good tradeoff between decision errors and sample size. The loglikelihood ratio sequence Sn that deﬁnes the SPRT is a random walk with components ln L (Yk ) that have the following means under H0 and H1 , respectively:
E0[ ln L(Yk )] = E0
¸
and
E1 [ln L (Yk )]] = E1
p1 (Yk ) ln p0 (Yk )
¸
¹
p1(Yk ) ln p0(Yk )
= − D( p0 ± p1) < 0
¹
= D ( p1± p0 ) > 0.
(9.10)
(9.11)
This means that the random walk Sn has a positive drift under H1 and a negative drift under H0 . Therefore, under H0, Sn is more likely to cross the lower threshold a than the
212
Sequential and Quickest Change Detection
upper threshold b, and the opposite is true under H1 . Furthermore, the decision accuracy can be improved by making the threshold b larger and the threshold a smaller.
Example 9.1
Gaussian Observations. Consider testing the hypotheses: H0 H1
with µ1 > µ 0. In this case,
: Y1 , Y2 , . . . i .i .d . ∼ N (µ0, σ 2 ), : Y1 , Y2 , . . . i .i .d . ∼ N (µ1, σ 2 ),
º 2 2» n ¶ µ1 − µ0 µ 1 − µ0 Sn = σ 2 k =1 Yk − n 2σ 2 . ∑ It is easy to see that Sn ∈ (a , b) is equivalent to nk=1 Yk ∈ (an² , bn² ) where σ 2 a + n µ1 + µ0 , a²n = µ1 − µ0 2 2 σ b + n µ1 + µ0 . b²n = µ −µ 2 1
0
Therefore, we can write our SPRT as:
• • •
9.1.5
∑
stop and decide H1 if nk=1 Yk ∑ stop and decide H0 if nk=1 Yk continue observing otherwise.
≥ b²n ; ≤ a²n ;
SPRT Performance Evaluation
The performance of the SPRT is characterized by values of P M , P F , E 0[ N ], E1 [ N ] achieved by the SPRT. An exact analysis of these quantities is impossible except in special cases, and therefore we rely on bounds and approximations. We begin by studying Wald’s bounds and approximations for the error probabilities P F and PM . The likelihood ratio of Y 1, . . . , Yn can be written as
Xn
± eS = n
n ¼
k =1
L (Yk ) =
n ¼ p1 (Y k ) . p0 (Y k ) k =1
Error Probability Bounds For the SPRT PF the event {X N
= P0 {SN ≥ b} = P0 { X N ≥ eb},
≥ eb } can be rewritten as the following union of disjoint events: ∞¾ ¿ ½ b {XN ≥ e } = { N = n } ∩ { X n ≥ eb } . n =1
9.1
Thus P 0{ X N
≥ eb } = = =
∞ ¶ n=1
P0 ({N
213
Sequential Detection
= n } ∩ { X n ≥ e b} )
∞À ¶
n ¼
b n=1 { N =n}∩{ X n ≥e } k =1 Án ∞À ¶
p0( yk ) d µ( y1)d µ( y2) . . . dµ( yn ) p 1( yk ) d µ( y1)d µ( y2 ) . . . d µ( yn ) Xn
k =1
{ N =n}∩{ X n ≥ } À ∞ n ¼ 1 ¶ p1 ( yk ) d µ( y1)d µ( y2 ) . . . d µ( yn ) ≤ eb b n=1 { N =n}∩{ X n ≥e } k =1 = e−b P1{ X N ≥ eb } = e−b (1 − PM ). eb
n=1
That is, PF
≤ e−b (1 − PM ) ≤ e−b .
(9.12)
An analogous derivation for PM noting that PM yields PM
= P1 {X N ≤ ea }
≤ ea (1 − PF ) ≤ ea .
(9.13)
Together, (9.12) and (9.13) are known as Wald’s bounds on the error probabilities. Now, we consider setting the thresholds a and b to meet constraints PF ≤ α and PM ≤ β . An obvious way to do this is to set b = − ln α and a = ln β. An alternative suggested by Wald is to treat (9.12) and (9.13) as approximations, with the understanding that if the decision is in favor of H1, X n ≈ eb , and if the decision is in favor of H0 , X n ≈ ea , i.e.,
≈ e−b (1 − PM ), and PM ≈ ea (1 − PF). = α and PM = β yields the following approximations PF
Then setting PF thresholds:
a
= ln 1 −β α ,
and b = ln
for the
1−β
α .
Although this choice of thresholds does not necessarily result in error probabilities that meet the constraints PF ≤ α and PM ≤ β , we can show using (9.12) and (9.13) that (see Exercise 9.1): P F + PM
≤ α + β.
(9.14)
Thus, setting thresholds based on assuming equality in the Wald bounds guarantees that the sum of the falsealarm probability and missed detection probabilities is constrained by the sum of their individual constraints. It is important to note that we did not assume anything about the distributions of the observations when deriving the bounds. Nevertheless, the SPRT itself is a function of the distributions through the loglikelihood
214
Sequential and Quickest Change Detection
ratio statistic Sn . More accurate approximations for the error probabilities derived using renewal theory do indeed depend on the distributions P 0 and P 1 [2].
Expected Stopping Time Approximations In order to calculate approximations for E0[ N ] and E1 [ N ], we will need a few preliminary lemmas. LEMMA
9.1 If Z is a nonnegative integer valued random variable
E[Z ] =
∞ ¶ k=0
P {Z
> k} .
(If Z is a nonnegative continuous random variable, we replace the sum with an integral.) Proof
E[ Z ] =
= =
∞ ¶ k =0
∞ ¶ k =1
∞ ¶ k =1
k P{ Z
= k} =
k (P{ Z P{ Z
∞ ¶ k =1
k P{ Z
= k}
≥ k } − P{Z ≥ k + 1})
≥ k} =
∞ ¶ k =0
P{ Z
> k} .
9.2 Assume P0 ³ = P1 . Then, the stopping time of the SPRT is ﬁnite w.p. 1 under both hypotheses. Further, E j [ N m ] < ∞ for j = 0, 1, for all integer m ≥ 1.
LEMMA
Proof For k
≥ 0, using Markov’s inequality, P0 {N > k } ≤ P0 { X k ∈ (ea , eb )} ≤ P0 { Xk > ea } Â √ = P 0{ X k > ea } √ E0[ X k ] ≤ √ ea
= √ρ a , k
e
Ã √ where ρ = Y p0 (y ) p1( y )d µ( y ) is the Bhattacharyya coefﬁcient between P0 and P1 (see also (6.14)). Since ρ < 1, whenever P 0 ³ = P1 , P0 {N > k} goes to zero exponentially fast with k , thus establishing that N is ﬁnite with probability one under H0. A similar argument with b replacing the role of a establishes the corresponding result under H1. Now, we apply Lemma 9.1 to obtain that for integer m ≥ 1,
E0 [ N ] = m
∞ ¶ k =0
P 0{ N
m
> k} =
∞ ¶
k =0
P0 {N
>k
1/ m
1
}≤ √
ea
∞ ¶ k =0
ρ k / < ∞, 1 m
since ρ < 1. A similar argument with b replacing the role of a establishes the corresponding result under H1.
9.1
Sequential Detection
215
L E M M A 9.3 Wald’s Identity. Let Z 1 , Z 2 , . . . be i.i.d with mean µ . Let N be a stopping time with respect to Z 1, Z 2 , . . . such that N is ﬁnite w.p. 1. Then
E
Ä¶ Å N k =1
Zk
= E [ N ]µ.
Proof (Note: Since N is a function of the sequence {Z k }, this statement is not obvious. If N were independent of {Z k }, then a simple application of Law of Iterated Expectation after ﬁrst conditioning on N would lead to the result.) We begin by noting that
1 {N
≥ k } = 1 − 1{ N ≤ k − 1}.
By the deﬁnition of a stopping time (see Deﬁnition 9.1), the indicator function on the right hand side is a function of only Z 1 , . . . , Z k −1 , and is thus independent of Z k . Therefore, we get
E
Ä¶ Å N k=1
Zk
=E =
Ä¶ ∞
∞ ¶ k=1
k =1
Å ¶ ∞ Z k 1 {k ≤ N } = E [ Z k ] E [1{k ≤ N }] k =1
µP {N ≥ k } = µE[ N ],
where the last step follows from Lemma 9.1. We now apply Lemmas 9.2 and 9.3 to the analysis of the expected stopping time of the SPRT. At the stopping time N , the test statistic SN has just crossed the threshold b from below or just crossed the threshold a from above. Therefore, S N ≈ a or SN ≈ b. Then, under H0 ,
SN and therefore
≈
a
w.p. 1 − P F
b
w.p. P F
E0 [ SN ] ≈ a(1 − PF) + bPF .
Similarly, under H1
SN and therefore
±
≈
±
a b
w.p. P M
w.p.1 − P F
E1 [ SN ] ≈ aPM + b(1 − PM ).
Now recall that
SN
=
N ¶ k =1
(9.15)
(9.16)
ln L (Y k ).
Since the stopping time of the SPRT N is ﬁnite w.p. 1 under both H0 and H1 , we can apply Wald’s identity to SN to obtain
216
Sequential and Quickest Change Detection
E0[ SN ] = E0[ N ] E0 [ln L (Y K )] = −E0[ N ] D( p0 ± p1) and
E1[ SN ] = E 1[ N ] E 1 [ln L (YK )] = E1[ N ] D( p1 ± p0). Plugging these expressions into (9.15) and (9.16), respectively, we get the Wald approximations for the expected stopping time:
E0 [ N ] ≈ a(1 − PF) + bPF ,
− D ( p0± p1 ) aP M + b(1 − P M ) . E1 [ N ] ≈ D ( p1± p0 )
(9.17) (9.18)
Comparison of Fixed Sample Size Hypothesis Test with SPRT. Consider the hypotheses:
Example 9.2
±
H0 H1
: :
Y 1 , Y 2, . . . i.i.d. ∼ N (0, σ 2 )
Y 1 , Y 2, . . . i.i.d. ∼ N (µ, σ 2 ),
where µ > 0. Our goal is to be efﬁcient in terms of number of observations, while meeting error probability constraints PF ≤ α and PM ≤ β . First, we consider the ﬁxed sample size (FSS) case. Here we design a sample size n before we take observations such that the constraints are met. By the optimality of the Neyman–Pearson test, it is clear that for any ﬁxed n, an αlevel Neyman–Pearson test results in the smallest PM , while meeting the constraint on PF . Therefore the most efﬁcient FSS sample size test that meets the constraints on PM and PF must be an αlevel Neyman–Pearson test, for which PM for a ﬁxed n can be computed using (5.34). PM
= 1 − PD = 1 − Q ( Q −1(α) − d ),
where (using 1 to denote the vector of all ones of length n)
d2
= µ1T (σ 2 I )− 1µ1 = σ12 µ2 n.
Therefore PM
= 1 − Q ( Q −1(α) −
√ µ n ). σ
≤ β constraint, we pick the sample size ÆÇ ¾ È2 É ¿ σ −1 −1 . n = µ Q (1 − β) − Q (α)
Thus, to meet the PM
FSS
For the sequential test, using the Wald approximations we set
Ç1 −βÈ b = ln α ,
Ç β È a = ln 1−α
(9.19)
9.2
Quickest Change Detection
217
and compute the expected sample sizes using Wald’s approximations
(1 − PF)a + PFb ≈ (1 − α) ln 1−β α + α ln 1−αβ , − D( p0 ± p1) − D ( p 0 ± p 1) β ln 1−β α + (1 − β) ln 1−αβ , E 1[ N ] ≈ PM a + (1 − PM )b ≈ D ( p1 ± p0) D( p1 ± p0)
E 0[ N ] ≈
with
D ( p1± p0 ) = D ( p0± p1 ) =
µ2 . 2σ 2 = 22 and E0 [ N ] ≈ E1 [ N ] ≈ 9.
If α = β = 0. 1 and σµ = 1, then we can compute n FSS Therefore the SPRT takes fewer than half the number of observations as the FSS test on average .
While the average number of observation taken is lower for the sequential test, there is no guarantee that sequential test will take fewer observations for every realization of the observation sequence. Therefore, in practice, truncated sequential procedures are used, where the test stops at min{ N , k } for some ﬁxed integer k, so that at most k observations are used. The performance analysis of truncated sequential procedures is considerably more difﬁcult than for the SPRT, but with a sufﬁciently large truncation parameter k , one can continue to use Wald’s approximations for the performance.
9.2
Quickest Change Detection
In the quickest change detection (QCD) problem, we have a sequence of observations
{Yk , k ≥ 1}, which initially have distribution P0. At some point of time λ, due to some event, the distribution of the random variables changes to P1 . We restrict our attention to models where the observations are i.i.d. in the pre and postchange regimes. Thus
∼ i.i.d. p0 for k < λ, Yk ∼ i.i.d. p1 for k ≥ λ. We assume that p0 and p1 are known but λ is unknown. We call λ the change point. Yk
Our goal is to detect the change in distribution as quickly as possible (after it happens), subject to falsealarm constraints. To this end, a procedure for quickest change detection is constructed as a stopping time N (see Deﬁnition 9.1) on the observation sequence {Yk }, with the understanding that at the stopping time it is declared that the change has happened. The notion of quick detection is then captured by the requirement that we wish to detect the change with minimum possible delay, i.e., minimize some metric that captures the average of ( N − λ)+ (where we use the notation x + = max{x , 0}). Clearly, selecting N = 1 always minimizes this quantity; however, if λ > 1, we have a falsealarm event. In practice we want to avoid false alarms as much as possible. Thus, the QCD problem is to ﬁnd a stopping time on {Y k } that results in an optimal tradeoff between detection delay and falsealarm rate.
218
Sequential and Quickest Change Detection
4 3 snoitavresbO
2 1 0
–1 –2 –3 0
200
400
Time
600
800
1000
N
Stochastic sequence with observations from p0 = (0, 1) before the change (time slot 200), and with observations from p1 = (0.1, 1) after the change
Figure 9.2
N
5 citsitatS muSuC
4 3 2 1 0
0
100
200
300
Time
400
500
600
700
Evolution of the CuSum algorithm when applied to the samples given in Figure 9.2. We see that the change is detected with a delay of roughly 400 time slots
Figure 9.3
To motivate the need for quickest change detection algorithms, in Figure 9.2 we plot a sample path of a stochastic sequence whose samples are distributed as (0, 1) before the change, and distributed as (0.1, 1) after the change. For illustration, we choose time slot 200 as the change point. As is evident from the ﬁgure, the change cannot be detected through manual inspection. In Figure 9.3, we plot the evolution of the CuSum test statistic (discussed in detail in Section 9.2.1), computed using the observations shown in Figure 9.2. As seen in Figure 9.3, the value of the test statistic stays close to zero before the change point, and starts to grow after the change point. The change is detected by using a threshold of 4. We also see from Figure 9.3 that it takes around 500 samples to detect the change after it occurs. Can we do better than that, at least on an average? Clearly, declaring change before the change point (time slot 200) will result in zero delay, but it will cause a false alarm. The theory of quickest change detection deals with ﬁnding algorithms that have provable optimality properties, in the sense of minimizing the average detection delay under a falsealarm constraint.
N
N
9.2
9.2.1
Quickest Change Detection
219
Minimax Quickest Change Detection
We begin by considering a nonBayesian setting, where we do not assume a prior on the change point λ. We focus on ﬁrst constructing good algorithms for this problem, and then later comment on their optimality properties. A fundamental fact that is used in the construction of these algorithms is that the mean of the loglikelihood ratio ln L (Yk ) in the postchange regime, i.e., for k ≥ λ equals D ( p1 ± p0) > 0, and its mean before change, i.e., for k < λ is − D ( p0 ± p1) < 0. A simple way to use this fact was ﬁrst proposed by Shewhart [7], in which a statistic based on the current observation is compared with a threshold to make a decision about the change. That is the Shewhart test is deﬁned as
NShew
= inf{n ≥ 1 : ln L(Yn ) > b}.
Shewhart’s test is widely employed in practice due to its simplicity; however, signiﬁcant gain in performance can be achieved by making use of past observations (in addition to the current observation) to make the decision about the change. Page [8] proposed such an algorithm that uses past observations, which he called the Cumulative Sum (CuSum) algorithm. The idea behind the CuSum algorithm is based on the behavior of the cumulative loglikelihood ratio sequence:
Sn
=
n ¶ k =1
ln L (Yk ).
Before the change occurs, the statistic Sn has a negative “drift” due to the fact that
E0 [ln L (Yk )] < 0 (see (9.10)) and diverges towards −∞. At the change point, the drift of the statistic Sn changes from negative to positive due to the fact that E1[ln L (Yk )] > 0
(see (9.11)), and beyond the change point, Sn starts growing towards ∞ . Therefore Sn roughly attains a minimum at the change point (see Figure 9.4). The CuSum algorithm is constructed to detect this change in drift, and stop the ﬁrst time the growth of Sn after change in the drift is large enough (to avoid false alarms). Speciﬁcally, the stopping time for the CuSum algorithm is deﬁned as
NC
= inf {n ≥ 1 : Wn ≥ b} ,
(9.20)
4 2
Wn
citsitatS
0
–2
Sn
–4 –6 –8 0
0[ln L (Yk)]
5
0
35
220
Sequential and Quickest Change Detection
where
Wn
= Sn − 0≤min Sk = max Sn − Sk k ≤n 0≤k ≤n n n ¶ ¶ = 0max ln L (Y ± ) = max ln L (Y ± ), ≤k ≤n 1≤k≤n+1 ± =k +1
± =k
∑
(9.21) (9.22)
with the understanding that n± =n+1 ln L (Y± ) = 0. It is easily shown that Wn can be computed iteratively as follows (see Exercise 9.6):
Wn
= (Wn −1 + ln L (Yn ))+ ,
W0
= 0.
(9.23)
This recursion is useful from the viewpoint of implementing the CuSum test in practice. The CuSum test also has a generalized likelihood ratio test (GLRT) interpretation. We can consider the quickest change detection (QCD) problem as a dynamic decision making problem, where at each time n , we are faced with a composite binary hypothesis testing problem:
±
H0n
H1n
: λ>n : λ ≤ n.
The joint distribution of the observations under H0n is given by n ¼ k=1
p 0 ( yk )
and the joint distribution under H1n is given by
pλ ( y1 , y2 , . . . , yn ) =
Á with the understanding that k
k +1
at time n is equal to
L G (y1, . . . , yn ) =
λ¼ −1 k =1
p 0( yk )
pi ( X k )
n ¼ k=λ
p1( yk ),
= 1. Then, the GLRT statistic LG computed
max 1≤λ≤n pλ ( y1 , y2 , . . . , yn ) Án p ( y ) k =1 0 k
Deﬁne the statistic
Cn
λ = 1, . . . , n
²
= ln LG(Y1, . . . , Yn ) = 1max ≤k≤n
= 1max ≤λ≤n
n ¶
±=k
n ¼ k =λ
ln L (Y± ).
L ( yk ).
(9.24)
(9.25)
This statistic is closely related to the statistic Wn given in (9.22) except that the maximization does not include k = n + 1. This means that, unlike Wn , the statistic C n can take on negative values. In particular, it is easy to show that Cn can be computed iteratively as (see Exercise 9.6)
Cn
= (C n−1 )+ + ln L (Yn ),
C0
= 0.
(9.26)
9.2
Quickest Change Detection
221
2.5 2 1.5 citsitatS
1
Wn
0.5 0
Cn
–0.5 –1 0
5
10
15
20
25
30
35
Time Figure 9.5
Comparison of the test statistics Wn and Cn
Nevertheless it is easy to see that both Wn and Cn cross will cross a positive threshold b at the same time (see Figure 9.5) and hence the CuSum algorithm can be equivalently deﬁned in terms of C n as
N C = inf {n
≥ 1 : Cn ≥ b} .
(9.27)
The GLRT interpretation of the CuSum algorithm has a close connection with another popular algorithm in the literature, called the Shiryaev–Roberts (SR) algorithm. In the SR algorithm, the maximum in (9.25) is replaced by a sum . Speciﬁcally, let
Tn
=
n ¶ ¼
1≤λ≤n k=λ
L (Y k ).
Then, the SR stopping time is deﬁned as
N SR
= inf{n ≥ 1 : Tn > b}.
(9.28)
It can be shown that even the SR statistic can be computed iteratively as
Tn
= (1 + Tn −1) L (Yn ),
T0
= 0.
We now comment on the optimality properties of the CuSum algorithm and the SR algorithm. Without a prior on the change point, a reasonable measure of false alarms is the mean time to false alarm, or its reciprocal, which is the falsealarm rate (FAR): FAR (N ) =
1
E∞ [ N ] ,
(9.29)
where E∞ is the expectation with respect to the probability measure when the change never occurs. Finding a uniformly powerful test that minimizes the delay over all possible values of γ subject to a FAR constraint is generally not possible. Therefore, it is more appropriate to study the quickest change detection problem in what is known as a minimax setting in this case. There are two important minimax problem formulations, one due to Lorden [9] and the other due to Pollak [10].
222
Sequential and Quickest Change Detection
In Lorden’s formulation, the objective is to minimize the supremum of the average delay conditioned on the worst possible realizations, subject to a constraint on the falsealarm rate. In particular, we deﬁne the worstcase average detection delay (WADD): WADD ( N ) = sup ess sup Eλ
λ≥1
Ê( N − λ + 1)+ Y , . . . , Y Ë , 1 λ −1
(9.30)
where Eλ denotes the expectation with respect to the probability measure when the change occurs at time λ . We have the following problem formulation. P RO B L E M
9.1 Lorden . Minimize WADD ( N ) subject to FAR (N ) ≤ α.
For the i.i.d. setting, Lorden showed that the CuSum algorithm ((9.20) or (9.27)) is asymptotically optimal for Problem 9.1 as α → 0. Speciﬁcally, the following result is proved in [9]. THEOREM
9.1
Setting b =  ln α  in (9.20) or (9.27) ensures that FAR (N C) ≤ α,
and as α
→ 0, inf
N :FAR( N )≤α
WADD ( N ) ∼ WADD ( NC ) ∼
 ln α , D( p1 ± p0)
where the “ ∼” notation is used to denote that the ratio of the quantities on the two sides of the “ ∼” approaches 1 in the limit as α → 0. It was later shown in [11] that the CuSum algorithm is actually exactly optimal for Problem 9.1. Although the CuSum algorithm enjoys such a strong optimality property under Lorden’s formulation, it can be argued that WADD is a somewhat pessimistic measure of delay. A less pessimistic way to measure the delay, the conditional average detection delay (CADD), was suggested by Pollak [10]: CADD( N ) = sup Eλ [ N
λ≥1
− λN ≥ λ]
(9.31)
for all stopping times N for which the expectation is welldeﬁned. It can be shown that CADD( N ) ≤ WADD ( N ).
(9.32)
We then have the following problem formulation. P RO B L E M
9.2 Pollak . Minimize CADD( N ) subject to FAR ( N ) ≤ α.
It was shown by Lai [12] that: THEOREM
9.2
As α → 0, inf
N :FAR ( N )≤α
CADD(N ) ≥
 ln α  (1 + o(1)). D ( p1± p0 )
From Theorem 9.2 and (9.32), we have the following result:
(9.33)
9.2
Quickest Change Detection
223
26 24 DDAC
22 20 18 16 7
7.5
8
8.5 log(FAR)
9
9.5
Performance of the CuSum algorithm with p0 D ( p1 ± p0 ) = 0.2812
Figure 9.6
C O RO L L A RY
9.1 As α
10
= N (0, 1), p1 = N (0.75, 1), and
→ 0,
inf
N : FAR( N ) ≤α
CADD( N ) ∼ CADD( NC ) ∼
 ln α  . D( p1 ± p0 )
Problem 9.2 has been studied in the i.i.d. setting in [10] and [13], where it is shown that the performance of some algorithms based on the Shiryaev–Roberts statistic, speciﬁcally, the SR algorithm (see (9.28)), are within a constant of the best possible performance over the set of algorithms meeting an FAR constraint of α, as α → 0. We do not discuss the details here, but those results imply that the SR algorithm is asymptotically optimal, i.e., we have the following theorem. THEOREM
and as α
9.3 Setting b = 1/α in (9.28) ensures that FAR ( N SR ) ≤ α,
→ 0, inf
N :FAR ( N ) ≤α
CADD( N ) ∼ CADD( NSR ) ∼
 ln α  . D ( p1± p0 )
Theorems 9.1–9.3 and Corollary 9.1 show that both the CuSum algorithm and the SR algorithm are asymptotically optimal for both Problem 9.1 and Problem 9.2. The FAR in all cases decreases to zero exponentially with delay (CADD or WADD) with exponent D ( p1 ± p0), which is the same as the error exponent in the Chernoff–Stein lemma (see Lemma 8.1) with ﬁxed P M . In Figure 9.6, we plot tradeoff curve for the CuSum algorithm obtained via simulations for Gaussian observations, i.e., we plot CADD as a function of − ln FAR. It can be checked that the curve has a slope of approximately 1 / D( p1± p0 ). 9.2.2
Bayesian Quickest Change Detection
In the Bayesian setting it is assumed that the change point is a random variable ³ taking values on the nonnegative integers, with πλ = P{³ = λ}. Deﬁne the average detection delay (ADD) and the probability of false alarm (PFA), respectively, as
224
Sequential and Quickest Change Detection
∞ Ê − ³)+ Ë = ¶ Ê Ë πλEλ ( N − λ)+ ,
ADD( N ) = E (N PFA( N ) = P {N
< ³} =
∞ ¶ λ=0
λ= 1
πλPλ{ N < λ}.
(9.34) (9.35)
Then, the Bayesian quickest change detection problem is to minimize ADD subject to a constraint on PFA, and can be stated formally as follows. P RO B L E M
9.3 Minimize ADD( N ) subject to PFA ( N ) ≤ α.
The prior on the change point ρ , i.e., for 0 < ρ < 1,
³ is usually assumed to be geometric with parameter
πλ = P{³ = λ} = ρ(1 − ρ)λ−1, λ ≥ 1.
The justiﬁcation for this assumption is that it is a memoryless distribution; it also leads to a tractable solution to Problem 9.3 as seen in Theorem 9.4. Let Y 1n = (Y 1, . . . , Y n ) denote the observations up to time n. Also let
= P{³ ≤ n  Y1n }
Gn
(9.36)
be the a posteriori probability at time n that the change has taken place given the observation up to time n. Using Bayes’ rule, G n can be shown to satisfy the recursion (see Exercise 9.8):
= ´(Yn , G n−1 ),
Gn
(9.37)
where
G˜ n−1 L (Yn ) ´(Yn , G n −1) = ˜ , G n −1 L (Y n ) + (1 − G˜ n −1) G˜ n −1 = G n−1 + (1 − G n−1)ρ , and G 0 = 0. THEOREM
9.4 [14]
The optimal solution to Problem 9.3 has the following structure: NS
if a
(9.38)
= inf {n ≥ 1 : G n ≥ a} ,
(9.39)
∈ (0, 1) can be chosen such that PFA ( NS ) = α.
(9.40)
The optimal change detection algorithm of (9.39) is referred to as the Shiryaev algorithm. We now discuss some alternative descriptions of the Shiryaev algorithm. Let
Rn
= (1 −G nG ) , n
and
Rn,ρ
= (1 −GGn )ρ . n
9.2
Quickest Change Detection
225
We note that Rn is the likelihood ratio between the hypotheses
±
(n ) : (n ) H :
H1
0
³≤n ³>n
averaged over the distribution of the change point. In particular,
Rn
(³ ≤ nY1n ) = (1 −G nG ) = PP(³ > n Y1n ) n ∑n Ák −1 p (Y ) Án p (Y ) k −1ρ i =1 0 i i =k 1 i = k =1(1 − ρ) (1 − ρ)n Á ni=1 p0 (Yi ) n n ¶ ¼ 1 k −1 = (1 − ρ)n (1 − ρ) ρ L (Yi ). k=1 i =k
Also, Rn ,ρ is a scaled version of R n :
Rn ,ρ
= (1 −1 ρ)n
n n ¶ ¼ (1 − ρ)k −1 L(Yi ). k =1
(9.41)
(9.42)
i =k
Like G n , Rn ,ρ can also be computed using a recursion (see Exercise 9.8):
= 1 +1 R−n−ρ1,ρ L (Yn ), R0,ρ = 0. (9.43) We remark here that if we set ρ = 0 in (9.42) and (9.43), then the Shiryaev statistic Rn,ρ
reduces to the SR statistic (see (9.28)) and the Shiryaev recursion reduces to the SR recursion. It is easy to see that Rn and Rn ,ρ have onetoone mappings with the Shiryaev statistic G n . A L G O R I T H M 9.1 Shiryaev Algorithm. The following three stopping times are equivalent and deﬁne the Shiryaev stopping time:
= inf{n ≥ 1 : G n ≥ a}, NS = inf{n ≥ 1 : R n ≥ b}, b NS = inf{n ≥ 1 : R n,ρ ≥ }, ρ NS
a with b = 1− . a
(9.44) (9.45)
It is easy to design the Shiryaev algorithm to meet a constraint on PFA. In particular, PFA( NS ) = P{ NS
Ê Ê< ³} = E [ 1{N S < ³ }] ËË = E E 1{ NS < ³}  Y1 , Y2, . . . , YN ÍË Ê Ì = E P NS < ³  Y1 , Y2, . . . , YN = E[1 − G N ] ≤ 1 − a. S
S
Thus, for a = 1 − α, we have
S
PFA ( NS ) ≤ α.
226
Sequential and Quickest Change Detection
45 40 DDA
35 30 25 20
4
Figure 9.7
p1
5
6
7 8 log(PFA) 
9
Performance of the Shiryaev algorithm with ρ
= N (0.75, 1),  ln(1 − ρ) + D( p1± p0) = 0.2913
10
11
= 0.01, p0 = N (0, 1),
We now present a ﬁrstorder asymptotic analysis of the Shiryaev algorithm in the following theorem that is proved in [15]. THEOREM
inf
As α → 0,
9.5
N : PFA ( N )≤α
 ln α
E[( N − ³)+ ] ∼ E[( N S − ³)+ ] ∼ D ( p ± p ) +  ln(1 − ρ) . 1 0
(9.46)
In Figure 9.7 we plot simulation results for ADD as a function of ln (PFA ) for Gaussian observations. For a PFA constraint of α that is small, b ≈  ln α , and ADD ≈ E1 [ NS ] ≈
 ln α  ln(1 − ρ) + D( p1 ± p0) ,
giving a slope of roughly  ln(1−ρ)+1 D ( p ± p ) to the tradeoff curve as we can verify in 1 0 Figure 9.7. When  ln(1 − ρ) ´ D ( p1± p0 ), the observations contain more information about the change than the prior, and the tradeoff slope is roughly D ( p11± p0) , as in the minimax setting. On the other hand, when  ln(1 − ρ) µ D( p1± p0 ), the prior contains more information about the change than the observations, and the tradeoff slope is roughly 1  ln(1−ρ)  . The latter asymptotic slope is achieved by the stopping time that is based only on the prior:
= inf {n ≥ 1 : P(³ > n) ≤ α} . This is also easy to see from (9.38). With D ( p1± p0 ) small, L (Y ) ≈ 1, and the recursion N
for G n reduces to
= G n−1 + (1 − G n−1 )ρ, G 0 = 0. ∑ 1 k n Expanding we get G n = ρ nk − =0 (1 − ρ) = 1 − (1 − ρ) . Setting G N ∗ = 1 − α, we get  ln α  . ADD ≈ N ∗ ≈  ln(1 − ρ) Gn
Exercises
227
Exercises
9.1 Wald’s Bound on Sum of Error Probabilities. Prove (9.14). 9.2 Sequential Hypothesis Testing. Consider the problem sequential testing between the pmfs:
±3
± 1 if y = 1 =1 4 p 1( y ) = , p 0( y) = 3 if y = 0 if y = 0 4 using an SPRT with thresholds a < 0 < b. (a) Let N1(n) denote the number of observations that are equal to 1 among the ﬁrst n observations. Show that the SPRT can be written in terms of N1 (n). (b) Suppose a = −10 ln 3, and b = 10 ln 3. Show that Wald’s approximations for the 4 1 4
if y
error probabilities and expected stopping times are exact in this case. (c) For the thresholds given in part (b), ﬁnd the error probabilities PF and P M . (d) For the thresholds given in part (b), ﬁnd E0 [ NSPRT ] and E1[ NSPRT ]. (e) Give a good approximation for the number of samples that a ﬁxed sample size (FSS) test requires to achieve same error probabilities as in part (c). 9.3 SPRT versus FSS Simulation. Consider the detection problem:
H1
:
Y1 , Y 2, . . . , are i.i.d.
N (0.25 , 1),
N (−0.25 , 1). Your goal is to design tests that have error probabilies α = 0. 05 and β = 0.05. H0
:
Y1 , Y 2, . . . , are i.i.d.
(a) Design a ﬁxed sample size test that has the desired error probabilities. That is, ﬁnd τ and nFSS such that the LRT
⎧ ⎪⎪1 ⎪⎪ ⎨
δ( y1, . . . , yn ) = ⎪ ⎪⎪ ⎪⎩0
¶ nFSS
if
FSS
k=1
¶
yk
≥τ
yk
0 > a. (a) Show that the SPRT stopping rule is independent of the upper threshold b; in particular:
N
= min
·Î
Ï
a
ln(w0 /w1 )
Ó
, min Ô
ÑÒ w w 0 1 n ≥1: < Y  < ,
Ð
n
2
2
and we select H0 if N = ln(w0a/w1 ) and Y N  ≤ w0/ 2, and H1 otherwise. (b) Evaluate the probabilities of false alarm and miss for the rule of part (a). Note that you should be able to evaluate this exactly, without using Wald’s approximations. (c) Evaluate the expected stopping times under H0 and H1. Once again you should be able to evaluate these exactly without resorting to Wald’s approximations. 9.5 Sequential Testing with Three Hypotheses. Suggest a reasonable way to modify the SPRT for sequential testing with three hypotheses. Explain your logic. 9.6 CuSum Recursion. Prove that the CuSum statistics Wn and Cn can be computed iteratively as given in this chapter, i.e.,
Wn
= (Wn−1 + ln( L (Yn ))+ ,
W0
= 0,
Cn
= (C n−1)+ + ln( L (Yn )),
C0
= 0.
and
Also use these recursions to conclude that Wn and Cn will cross a positive threshold b at the same time (samplepath wise). 9.7 CuSum Simulation. Simulate the performance of the CuSum algorithm with p0 = (0, 1) and p1 = (0.5, 1), and plot your results in a similar manner as in Figure 9.6.
N
N
9.8 Shiryaev and SR Recursions. (a) Show that the Shiryaev statistic G n satisﬁes the recursion
Gn where
= ´(Yn , G n−1 ),
G˜ n −1 L (Y n ) ´( Yn , G n−1 ) = ˜ , G n−1 L (Y n ) + (1 − G˜ n−1 )
G˜ n−1 = G n− 1 + (1 − G n−1 )ρ , and G 0 = 0. (b) Show that the SR statistic can be computed iteratively as Tn
= (1 + Tn−1 )L (Yn ),
T0
= 0.
References
229
9.9 CuSum Simulation. Simulate the performance of the Shiryaev algorithm with ρ = 0.01, p0 = (0, 1), and p1 = (0. 5, 1), and plot your results in a similar manner as in Figure 9.7.
N
N
9.10 ML Detection of Change Point in QCD. The Cumulative Sum (CuSum) algorithm for quickest change detection uses the statistic:
Wn where L (y ) =
p1 ( y ) p0 ( y ) .
= 1≤max k ≤n +1
n ¶
±=k
ln L (Y± )
Recall that statistic Wn has the recursion
Wn
= (Wn −1 + ln L (Yn ))+ ,
W0
= 0.
The statistic Wn can be used to detect the change in distribution by comparison with an appropriate threshold. Now suppose in addition to detecting the occurrence of the change, we are also interested in obtaining a maximum likelihood estimator (MLE) of the actual change point λ at the time n, with the understanding that an MLE value of n + 1 corresponds to λ > n. Let λˆ n denote the MLE of λ at time n. Show that
±
with λˆ 0
λˆ n = nˆ + 1 λn −1
=0 if Wn > 0, if Wn
= 1.
References
[1] A. Wald, Sequential Analysis, Wiley, 1947. [2] D. Siegmund, Sequential Analysis: Tests and Conﬁdence Intervals, Springer Science & Business Media, 2013. [3] A. G. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential Analysis: Hypothesis Testing and Changepoint Detection, CRC Press, 2014. [4] H.V. Poor and O. Hadjiliadis, Quickest Detection, Cambridge University Press, 2009. [5] V. V. Veeravalli and T. Banerjee, “Quickest Change Detection,” Academic Press Library in Signal Processing, Vol. 3, pp. 209–255, Elsevier, 2014. [6] A. Wald and J. Wolfowitz, “Optimum Character of the Sequential Probability Ratio test,” Ann. Math. Stat., Vol. 19, No. 3, pp. 326–339, 1948. [7] W. A. Shewhart, “The Application of Statistics as an Aid in Maintaining Quality of a Manufactured Product,” J. Am. Stat. Assoc. , Vol. 20, pp. 546–548, 1925. [8] E. S. Page, “Continuous Inspection Schemes,” Biometrika, Vol. 41, pp. 100–115, 1954. [9] G. Lorden, “Procedures for Reacting to a Change in Distribution,” Ann. Math. Stat., Vol. 42, pp. 1897–1908, 1971. [10] M. Pollak, “Optimal Detection of a Change in Distribution,” Ann. Stat., Vol. 13, pp. 206–227, 1985. [11] G. V. Moustakides, “Optimal Stopping Times for Detecting Changes in Distributions,” Ann. Stat., Vol. 14, pp. 1379–1387, 1986.
230
Sequential and Quickest Change Detection
[12] T. L. Lai, “Information Bounds and Quick Detection of Parameter Changes in Stochastic Systems,” IEEE Trans. Inf. Theory , Vol. 44, pp. 2917–2929, 1998. [13] A. G. Tartakovsky, M. Pollak, and A. Polunchenko, “Thirdorder Asymptotic Optimality of the Generalized Shiryaev–Roberts Changepoint Detection Procedures,” Theory Prob. Appl., Vol. 56, pp. 457–484, 2012. [14] A. N. Shiryayev, Optimal Stopping Rules, SpringerVerlag, 1978. [15] A. G. Tartakovsky and V. V. Veeravalli, “General Asymptotic Bayesian Theory of Quickest Change Detection,” SIAM Theory Prob. App., Vol. 49, pp. 458–497, 2005.
10
Detection of Random Processes
This chapter considers hypothesis testing when the observations are realizations of random processes. We ﬁrst study discretetime stationary processes and Markov processes, describe the structure of optimal detectors, and introduce the notions of Kullback– Leibler (KL) and Chernoff divergence rates. Then we consider two continuoustime processes: Gaussian processes, which are studied by means of the Karhunen–Loève transform; and inhomogeneous Poisson processes. The chapter ends with a general approach to detection based on Radon–Nikodym derivatives. In Chapters 5 and 8 we studied the structure of optimal detectors and their performance when the observations form a lengthn vector. We have focused on problems where the samples of the observation vector are conditionally independent given the hypothesis, or at least can be reduced to that case (using a whitening transform). The underlying information measures (Kullback–Leibler, Chernoff, etc.) are additive, which facilitates the performance analysis. This chapter considers several problems where the conditionalindependence model does not hold. We show that for several classical models a tractable expression for the optimal tests can still be derived, and performance can still be accurately analyzed. We begin with the case of discretetime processes (Section 10.1) then move to continuoustime Gaussian processes (Section 10.2) and inhomogeneous Poisson processes (Section 10.3) before concluding with a description of the general approach to such problems (Section 10.4).
10.1
DiscreteTime Random Processes
A random process is an indexed family of random variables { X (t ), t ∈ T } deﬁned in terms of a probability space (±, , P ) [1]. Then X (t ) = x (t , ω), ω ∈ ±. In m ary hypothesis testing, there are m probability measures P j , 1 ≤ j ≤ m , deﬁned over . For discretetime processes, the observations form a length n sequence Y = (Y1 , Y2, . . . , Yn ) whose conditional distribution given hypothesis H j will sometimes (n) be denoted by p j to emphasize the dependency on n . We will have T = Z and Y i = X (i ) for i = 1, 2, . . . , n. For many problems encountered in this section, the Kullback–Leibler and Chernoff information measures, normalized by 1n , converge to a limit as n → ∞, as deﬁned next.
F
F
232
Detection of Random Processes
D E FI N I T I O N 10.1 The Kullback–Leibler divergence rate between processes P0 and P 1 is deﬁned as the following limit, when it exists:
1 ( n) ( n) D ( p0 ± p1 ). n→∞ n D E FI N I T I O N 10.2 The Chernoff divergence rate of order u ∈ processes P0 and P1 is deﬁned as the following limit, when it exists:
D(P 0±P1 ) ± lim
(10.1)
(0, 1)
between
1 (n ) (n ) du ( p0 , p1 ). (10.2) n The Kullback–Leibler and Chernoff divergence rates can be used to derive the asymptotics of error probability for binary hypothesis tests.
d u (P0 , P 1) ± lim
n →∞
10.1.1
Periodic Stationary Gaussian Processes
A discretetime stationary Gaussian process Z = ( Z 1 , Z 2, . . .) is said to be periodic with period n if its mean function and covariance function are periodic with period n [1]. We shall only be interested in zeromean processes. The covariance sequence ck ± E[ Z 1 Z k+1 ] is symmetric (ck = c−k ) and periodic, hence ck = cn−k . The process is represented by a nvector Z whose n × n covariance matrix K( n) = E[ Z Z T ] is circulant Toeplitz, i.e., each row is obtained from a circular shift of the ﬁrst row [c0, c1, . . . , c2 , c1 ]: ⎡c c c . . . c c ⎤ 0 1 2 2 1 ⎢ c c c . . . c c 1 0 1 3 2⎥ ⎥⎥ . (n ) = ⎢ K (10.3) ⎢.
⎣ ..
c1
c2
...
c3
⎦
c1
c0
The matrix is constant along diagonals, symmetric, and nonnegative deﬁnite. The discrete Fourier transform (DFT) of the ﬁrst row is the power spectral mass function
λk = √1n
n−1 ± j =0
c j e−i 2π jk / n
≥ 0,
k
√ = 0, 1, . . . , n − 1 (i = −1),
(10.4)
which is symmetric and nonnegative. The inverse DFT transform gives
cj
=
√1n
±
n −1 j =0
λk ei 2π jk/n ,
j
= 0, 1, . . . , n − 1.
(10.5)
Hence the process is characterized equivalently by (10.4) or by (10.5). Denote by F ± { √1n e−i 2π jk / n }1≤ j ,k ≤n the n× n DFT matrix, which is complexvalued, symmetric, and unitary. The eigenvectors of circulant Toeplitz matrices are complex exponentials at harmonically related frequencies. Speciﬁcally, K
(n ) =
F
†
², F
where ² = diag[λ 0, λ1 , . . . , λk −1 ] is a diagonal matrix comprised of the power spectral mass values, and the superscript † denotes the Hermitian transpose operator: (F† )i j = ∗ . Since the determinant of a matrix is invariant under unitary transforms, we have F ji
10.1
DiscreteTime Random Processes
 ( n)  = K
²
n−1 k =0
λk.
233
(10.6)
Now denote by Z˜ ± FZ the DFT of Z , which satisﬁes the Hermitiansymmetry property Z˜ k = Z˜ ∗n−k for all k . Also
˜ ) = E [ Z˜ Z˜ Cov( Z
†
and
Z T (K(n) )−1 Z
] = E[ Z ZT ] † = F
F
FK
(n)
†
F
=²
± ˜ = Z˜ T ²−1 Z˜ =  Zλk  . n −1
2
k =0
(10.7)
(10.8)
k
Binary Hypothesis Testing Consider testing between two zeromean periodic sta
tionary Gaussian processes with respective power spectral mass functions λ0,k and λ 1,k , k = 0, 1, . . . , n − 1. Let Y˜ = FY . The LLR is given by
(n)
p (Y ) ln 1(n) p0 (Y )
(n ) = 12 ln  0(n ) − 21 Y T ( (1n))−1Y + 21 Y T ( (0n))−1 Y 1  n± − 1³ λ0,k − Y˜ 2 ´ 1 − 1 µ¶ , = 12 ln k λ1,k λ1,k λ0,k k =0 K
K
K
K
(10.9)
where the last equality was obtained using the identities (10.6) and (10.8). Thus the LLR is expressed simply in terms of the DFT of the observation vector. Taking the expectation of (10.9) under H1 and using E1 [ Z˜ k  2] = λ1,k per (10.7), we obtain the Kullback–Leibler divergence n−1 1± ( n) ( n) D( p ± p ) = 1
0
2
k =0
and likewise
³ λ ¶ ± µ n −1 ´ λ λ 1,k 1,k 1,k − ln λ − 1 + λ = ψ λ 0,k
0,k
k =0
µ n− 1 ´ (n ) ± p( n) ) = ± ψ λ0,k , 1 λ1,k k =0
D ( p0
0,k
(10.10)
(10.11)
(x − 1 − ln x ), x > 0, was introduced· in¸(6.4). For testing P 0 = N (0, 1) against P 1 = N (0, ξ) , we obtained D ( p0 ± p1) = ψ 1ξ . The Chernoff divergence of order u ∈ (0, 1) is obtained from (6.19) as n −1 1 ± (1 − u )λ0,k + uλ1,k (n) (n) ln du ( p0 , p1 ) = 2 λ10−,ku λu1,k k=0 ´ µ¶ n −1 ³ λ λ 1± 0,k 1,k u ln (10.12) =2 λ1,k + ln 1 − u + u λ0,k . k=0 where the function ψ( x ) =
1 2
234
Detection of Random Processes
10.1.2
Stationary Gaussian Processes
Consider now testing between two zeromean stationary Gaussian processes with respective spectral densities S0 ( f ) and S1 ( f ),  f  ≤ 21 . The respective covariance matrices are K(0n) and K(1n) . The LLR is given by
(n)
p (Y ) ln 1(n) p0 (Y )
∑
(n ) 0 (n ) K 1 (n ) K 0 (n ) K 1
= 12 ln    = 12 ln 
K
 − 1 Y T ( (n))−1Y + 1 Y T ( (n))−1 Y 0 2  2 1  1 ¹ T (n ) −1 (n) −1 º − Tr Y Y [( 1 ) − ( 0 ) ] ,  2 K
K
K
K
(10.13)
where Tr[A] = i Aii denotes the trace of a square matrix. (See Appendix A for basic properties of the trace.) In general the LLR is not so easy to evaluate due to the need to evaluate quadratic forms involving the inverse covariance matrices. (n) Taking the expectation of (10.13) under H1 and using E 1[Y Y T ] = K1 , we obtain the Kullback–Leibler divergence
(n) ± p(n) ) =
D( p1
0
»
(n )
K  1 (n ) ( n) Tr [(K0 )−1 K1 ] − n − ln (1n ) 2 K 
¼
(10.14)
0
and likewise
» (n )  ¼  1 ( n) ( n) ( n ) −1 ( n) D( p0 ± p1 ) = Tr [( 1 ) 0 ] − n − ln (0n ) 2   K
K
K
K
(10.15)
1
as given in (6.5). While these expressions are still cumbersome, signiﬁcant simpliﬁcations are obtained in the asymptotic setting where n → ∞. Assume the covariance sequence is absolutely ∑ summable: k ck  < ∞, which implies that  ck  decays faster than 1k as k → ∞. Then the KL divergence rate (10.1) exists and can be evaluated using the asymptotics of covariance matrices of stationary processes [2, 3]
 (n )1/n = exp n→∞ lim
or equivalently,
−1/2
ln S ( f ) d f
½
1/ 2 1 ln  K( n)  = ln S( f ) d f . n→∞ n −1/ 2
lim
Likewise,
½ 1/2
K
½
1/ 2 S 1( f ) 1 (n) ( n) Tr[ (K0 )−1 K1 ] = d f. n →∞ n S − 1/ 2 0( f )
lim
Combining these limits with (10.1) and (10.15), we obtain
D¯ (P 0±P1 ) =
½ 1/2 ´ S1 ( f ) µ ψ S ( f ) d f. 0 −1/2
(10.16)
10.1
235
DiscreteTime Random Processes
For the Chernoff divergence of order u ∈ (0, 1), starting from (6.19) and applying (10.2), one similarly obtains the Chernoff divergence rate of order u [4]:
d¯u ( p0, p1 ) ± lim
1 ( n) (n) du ( p0 , p1 ) n→∞ n (n) (n ) 1 1 (1 − u)K0 + uK1  = 2 nlim →∞ n ln  K( n) 1−u  K( n) u
=
±
0
1
1 1/ 2 (1 − u )S0 ( f ) + u S1 ( f ) ln d f. 2 −1/2 S0 ( f )1−u S1 ( f )u
(10.17)
The intuition behind these results is that the n × n Toeplitz covariance matrix is well approximated by a circulant Toeplitz matrix if n is large enough and ck  decays faster than 1k . For instance, a tridiagonal Toeplitz covariance matrix K and its circular Toeplitz approximation Kˆ are given in (10.18):
K
⎡ c ⎢⎢c10 ⎢⎢ 0 = ⎢⎢ .. ⎢⎢ . ⎣0 0
c1 c0 c1 0 0
0 c1 c0 0 0
... ... ... . .. ... ...
0 0 c1 0 0
0 0 0
c0 c1
⎤
0 0⎥ ⎥⎥ 0⎥
.. ⎥⎥ . ⎥⎥ c1⎦ c0
⎡ c ⎢⎢c01 ⎢ ˆ = ⎢⎢⎢ 0. ⎢⎢ .. ⎣0
c1 c0 c1
0 c1 c0
0 0 c1
0 0
0 0
0 0
K
c1
... ... ... ... ... ...
⎤
0 0 0
c1 0⎥ ⎥⎥ 0⎥
c0 c1
c0
.. ⎥⎥ . ⎥⎥ c1⎦ (10.18)
Only the upper right and lower left corners of the matrix are affected. More generally, replacing each entry ck of the covariance matrix by cˆk = ck + cn−k produces a circulant approximation to the original Toeplitz covariance matrix. If such approximation is valid, then the approach used for periodic stationary Gaussian processes in Section 10.1.1 can be extended to the nonperiodic case, and an additive expression for the KL divergence rate is obtained again. This approximation method can be also used to construct an approximate LRT, by computing the DFT Y˜ of the observations and applying the LRT (10.9) for periodic stationary Gaussian processes. Here the power spectral values λ0,k and λ1,k are obtained by taking the DFT of the periodicized correlation sequence cˆk under H0 and H1 , respectively. 10.1.3
Markov Processes
Consider a homogeneous Markov chain (S , q0 , r 0) with ﬁnite state space S , initial state probabilities
q0( y) = P 0{Y0
= y }, y ∈ S ,
and transition probabilities
r 0( y y± ) = P 0{Yi +1
= y Yi = y ±}, for all i ∈ N and y , y± ∈ S [1]. Consider a second process (S , q1, r 1) with same state space S but potentially different initial state probabilities q1 (y ) = P1 {Y0 = y} and
236
Detection of Random Processes
transition probabilities r1( y y² ) = P 1{Yi = y Yi −1 = y ² }. The probability of a sequence y = ( y1, y2 , . . . , yn ) ∈ S n under hypothesis H j , j = 0, 1 is given by
(n) ( y) = q ( y ) ² r (y  y ). j 1 j i i −1 i =2 n
pj
(10.19)
The LLR is given by
p(1n) ( y)
ln ( n) p 0 ( y)
=
q1( y1) ² r1( yi  yi −1 ) q 0( y1) r0( yi  yi −1 ) n
(10.20)
i =2
and is easily computed owing to the factorization property (10.19) of the two Markov chains. To compute the KL divergences between p(0n) and p1(n) we ﬁrst need the following deﬁnition. D E FI N I T I O N 10.3 Let p and q be two conditional distributions on Y given X , and r a distribution on X . The conditional Kullback–Leibler divergence of p and q with reference r is ± ± p( y x ) ± D( p±q r ) ± r (x ) p( y x ) ln = r (x ) D( p(·x )±q(·x )). (10.21) q (y  x ) x y x
KL Divergence Rate We have
(n) (n ) D( p ± p ) = E 0
1
0
= E0
( n) ( Y ) ¼ ln ( n) p (Y )
»
»
p0
1
q0 (Y 1) ln q1 (Y 1)
n ±
r0(Yi  Yi −1) ln r1(Yi  Yi −1)
¼
+ i =2 ³ q (Y ) ¶ ± ³ r (Y  Y ) ¶ n 0 1 = E0 ln q (Y ) + E0 ln r0 (Yi Yi −1 ) 1 1 1 i i −1 i =2 n ± = D(q0±q1 ) + D(r0±r1 f 0,i −1), i =2
(10.22)
where f 0,i = { f 0,i (y ), y ∈ S } is the distribution of Yi under H0. Assume the Markov chain (S , q0 , r0 ) is irreducible (there exists a path connecting any two states) and aperiodic. Then the Markov chain has a unique equilibrium distribution f 0∗ , and f 0,i converges to f 0∗ as i → ∞. Hence D (r 0±r1 f 0,i −1) converges to D (r0 ±r1  f0∗ ). Hence the KL divergence rate (10.1) exists and is equal to 1 (n ) (n ) D( p0 ± p1 ) = D (r0 ±r1  f 0∗ ). (10.23) n Note how the log function and the additivity property (10.22) were used to show the existence of a limit and to evaluate that limit.
D (P0 ±P 1) = lim
n →∞
10.1
Chernoff Divergence Rate Deﬁne the S 
elements
R
237
DiscreteTime Random Processes
× S 
(y , y ² ) = r 0( y y² )1−ur 1( y y² )u
matrix
y, y²
R
with (nonnegative)
∈ S,
(10.24)
the vector v with (nonnegative) elements
v(y ) = q0 ( y)1−u q1 ( y)u , y ∈ S , u ∈ (0, 1), (10.25) and the vector 1 = (1, . . . , 1) of the same dimension  S . Both and v depend on u . Then the Chernoff divergence of order u ∈ (0, 1) between p0(n ) and p(1n ) is given by ± (n) 1−u (n) u du ( p(0n) , p(1n) ) = − ln [ p 0 ( y) ] [ p 1 ( y) ] R
y∈S n
= − ln = − ln (a )
= − ln = − ln = − ln
±
y∈S n
±
y∈S n
±
y1 ∈S
±
y1 ∈S
±
y1 ∈S
= − ln v
T
R
n ² [q0 ( y1 )1−u q1( y1)u ] [r0 (yi yi −1 )1−u r1 (yi yi −1 )u ]
v( y1 ) v(y1) v(y1) v(y1) n −1
1.
i =2
² n
i =2
R
±
( yi , yi −1)
R
y2 ∈S n
( y2, y1 ) . . .
²± i =2 yi ∈ S
±
R
± yn ∈S
R
( yn , yn−1 )
( yi , yi −1)
R n−1 ( yn , y1 )
yn ∈S
(10.26)
Observe the passage from a sum of products to a product of sums in (a). Since r0 and r1 are irreducible, so is the matrix R. By the Perron–Frobenius theorem, the matrix R has a unique largest real eigenvalue, which is positive and will be denoted by λ. Moreover, limn →∞ v T Rn−1 1 exists and is equal to λ.1 Hence the Chernoff divergence rate (10.2) exists and is equal to
d¯u (P 0, P1 ) = lim
1 ( n) (n) du ( p0 , p1 ) = − ln λ. n→∞ n
(10.27)
Example 10.1 Let S = {0, 1}. Consider two Markov processes with respective transition probability matrices
r0
³ = 1 −a a
1−a b
¶
,
and r 1 =
³
b
1−b
1−b b
¶
,
a, b ∈ (0, 1).
(10.28)
1 This is reminiscent of the power method for ﬁnding the largest eigenvalue of a matrix R. Starting from an
arbitrary vector v and recursively multiplying by eigenvector in the limit.
R
and normalizing the result yields the corresponding
238
Detection of Random Processes
Each process admits the uniform stationary distribution f 0∗ sample paths of Y are, for a = 1 − b = 0.1:
=
f 1∗
= ( 12 , 21 ). Typical
: 10101010100101010 . . . , H1 : 11111111000000000 . . . .
H0
By (10.23), the KL divergence rates are given by
D (P0 ±P 1) = d (a±b)
and
D(P1 ±P0 ) = d(b±a ),
where d is the binary KL divergence function of (7.1). By (10.24) and (10.28), the matrix R is given by
R
¶ ³ 1−u u 1−u u = (1 − aa)1−u b(1 − b)u (1 − aa)1−u (b1u − b) ,
which is Toeplitz circulant. Its largest eigenvalue is
λ = a1−u bu + (1 − a )1−u (1 − b)u = e −d (a ,b), u
where
du (a, b) = − ln(a1−u bu
+ (1 − a)1−u (1 − b)u )
is the binary Chernoff divergence of order u , i.e., the Chernoff divergence between two Bernoulli distributions with parameters a and b respectively. Hence the Chernoff divergence rate of order u ∈ (0, 1) between the processes P0 and P1 is equal to du (a , b). ²
10.2
ContinuousTime Processes
In many signal detection problems, the observations are continuoustime waveforms. These observations are often sampled and processed numerically, which gives rise to the discretetime detection problems considered so far. In fact, one might even question the need to study detection of continuoustime signals in this context. To motivate such study, consider the following issues:
• • •
Since sampling generally discards information, one would like to know how much information is lost in the sampling process. Highresolution sampling produces long observation vectors, and ﬁnitesample performance metrics such as the GSNR of (5.30) might be cumbersome. The use of limits (as sampling resolution increases) often provides considerable insight into the mathematical structure of the problem. In particular, problems of singular detection (zero error probability) arise in some continuoustime detection problems, even though their discretetime counterparts have nonzero error probabilities.
10.2
ContinuousTime Processes
239
By acquiring n samples of a continuoustime waveform y (t ), 0 ≤ t ≤ T , we mean either acquiring values y (ti ) for n distinct time instants t i , 1 ≤ i ≤ n , or more gener¾T ally acquiring linear projections 0 y (t )φi (t ) dt for n suitably designed measurement functions φi (·), 1 ≤ i ≤ n. Returning to the observed continuoustime waveform Y = {Y (t ), t ∈ [0, T ]}, the main technical difﬁculty is that Y does not have a density, hence it is not clear at this point what the likelihood ratio should be. A variety of approaches can be developed to approach such detection problems; see [5] for a comprehensive survey. As described by Grenander [6], the most intuitive approach for Gaussian processes is the Karhunen– Loève decomposition, or eigenfunction expansion method. The Kullback–Leibler and Chernoff information measures can still be deﬁned. When normalized by T1 , they often converge to a limit as T → ∞ .
Mathematical Preliminaries The space of squareintegrable functions over [0, T ]
by L 2 ([0, T ]). The L 2 norm of a function f ∈ L 2([0, T ]) is ± f ± ± ·is¾ denoted ¸1/2 T 2 . The Dirac impulse is a generalized function, denoted by δ(t ) (not f ( t ) dt 0 ¾ to be confused with a decision rule) and deﬁned via the property f (t )δ(t ) dt = f (0) ¾T for any function f continuous at the origin. A Riemann–Stieltjes integral 0 f (t )dg (t ) ∑ is the limit of the Riemann sum iK=1 f (ti )[g(ti ) − g(ti −1 )] for a discretization 0 = t0 < t1 < · · · < t K = T of the interval [0, T ], as the mesh size of the discretization tends¾ to zero. For a random waveform Y = {Y (t ), t ∈ [0, T ]}, the stochastic inteT gral 0 f (t )dY (t ) is deﬁned as the limit in the meansquare sense of the random sum ∑ K i =1 f (ti )[Y (ti ) − Y (ti −1 )]. 10.2.1
Covariance Kernel
Let Z = { Z (t ), t ∈ T } be a random (not necessarily Gaussian) process with mean zero and covariance kernel
C (t , u) ± E[ Z (t ) Z (u )] ,
∈T. (10.29) The kernel is symmetric and nonnegative deﬁnite. If T = R and the process is weakly stationary (aka widesense stationary) then C (t , u ) = c(t − u) for all t , u, and the process has a spectral density ½ S( f ) = c(t ) ei 2π f t dt , (10.30) t,u
R
which is realvalued, nonnegative, and symmetric. Classical Gaussian processes include:
Brownian Motion: C (t , u) = σ 2 min(t , u ). This is a nonstationary Gaussian process with independent increments. Its sample paths are continuous with probability 1 but are nowhere differentiable, see Figure 10.1 for a sample path. White Gaussian Noise: C (t , u) = σ 2 δ(t − u) where δ denotes the Dirac impulse. White Gaussian noise is not physically realizable but serves as a powerful mathematical abstraction. It is also the “derivative” of Brownian motion, W (t ) = d Zdt(t ) ,
240
Detection of Random Processes
0.2 0.15 0.1 noitisoP
0.05 0
–0.05 –0.1 –0.15 0
0.2
0.4
0.6
0.8
1
Time Figure 10.1
Brownian motion with σ
=1
the notion of ¾where ¾ derivative holds in a distributional sense: the stochastic integral f (t )W (t )dt = f (t )d Z (t ) is well deﬁned for every function f ∈ L 2([0, T ] ).2 White Gaussian noise is stationary with spectral density S( f ) = σ 2 for all f ∈ R.
Bandlimited White Noise:This process is weakly stationary with spectral density function S( f ) = σ 21{ f  ≤ B } for some ﬁnite B (the bandwidth of the process) and therefore its covariance function is given by c(t ) = 2σ 2 B sin2(π2πBtBt ) . Periodic Weakly Stationary Process:Weakly stationary process whose correlation function is periodic with period T . The process admits a power spectral mass function. Ornstein–Uhlenbeck Process: C (t , u) = σ 2 e−γ  t −u , where γ > 0 is the inverse of the “correlation time” of the process. This is a stationary Gauss–Markov process, solution to the stochastic differential equation d Z (t ) = −γ Z (t ) + dW (t ),
where W is Brownian motion. The spectral density of the process is S( f ) 2γ σ 2 γ 2+( 2π f )2
=
, f ∈ R. This is one of the simplest processes with a rational spectrum
(i.e., a ratio of polynomials in f 2). 10.2.2
Karhunen–Loève Transform
The Karhunen–Loève transform decomposes a random (not necessarily Gaussian) continuoustime process Z into a countable sequence of uncorrelated coefﬁcients. If the process is Gaussian, these coefﬁcients are also mutually independent. Let T = [0, T ] and consider the eigenfunction decomposition of the covariance kernel: 2 From a mathematical standpoint, white noise can be treated as a generalized random function, in the same
way as a Dirac impulse is a generalized function. From an engineering standpoint, deterministic and random generalized ¾ functions are meaningful when multiplied by wellbehaved functions and integrated. For instance, f (t )δ(t ) dt = f ( 0) for any function f that is continuous at 0. The random variable ¾T ¾ 2 T 2 0 f ( t ) W ( t ) dt has mean zero and ﬁnite variance σ 0 f ( t ) dt .
10.2
½T 0
C (t , u)vk (t )dt
= λk vk (u),
u
ContinuousTime Processes
∈ [0, T ], k = 1, 2, . . . ,
where vk , k ∈ N, are the eigenfunctions of C (t , u ), and eigenvalues. The eigenfunctions are orthonormal:
½T
(10.31)
λk ≥ 0 are the corresponding
vk (t )vl (t ) dt = 1{k = l },
0
241
(10.32)
but do not necessarily form a complete orthonormal basis of L 2([0, T ]). The coordinates of Z in the eigenfunction basis are
Z˜ k
±
½T 0
vk (t )Z (t ) dt ,
k
= 1, 2,. . . .
(10.33)
The random variables Z˜ k , k are uncorrelated because
≥ 1 have zero mean, have respective variances λk , and ³½ T ½ T ¶ ˜ ˜ E[ Z k Z l ] = E vk (t )vl (u) Z (t ) Z (u) dtdu 0 0 ½T½T = vk (t )vl (u)C (t , u ) dtdu 0 0 = λk 1{k = l }, (10.34)
where the last line follows from (10.31) and (10.32). Thus the KLT provides both geometric orthogonality of the eigenfunctions {vk } and statistical orthogonality of the noise coefﬁcients {Z˜ k }. This may be viewed as noise whitening in the continuoustime framework. For weakly stationary processes one sometimes prefers to work with complexvalued eigenfunctions. Then the coefﬁcients Z˜ k are complexvalued, and the covariance of (10.34) is replaced by E[ Z˜ k Z˜ l∗ ] = λ k 1 {k = l } (note the complex conjugation of Z˜ l ). Example 10.2
Periodic Weakly Stationary Process. The complex exponentials
√1
T
ei 2π kt / T ,
t
√ ∈ [0, T ], k ∈ Z (i = −1)
are eigenfunctions of C (t , u) and thus
½
0
T
c( t
− u)e
with
λk =
i 2π ku / T
½ 0
T
du
=
c ( −u ) e
´½
T
0
i 2π ku / T
c( t
du
− u)e =
½T 0
i 2π k (u −t )/ T
µ
du ei 2π kt / T
c(u ) cos (2π ku / T ) du .
(10.35)
= λk ei 2π kt / T (10.36)
We obtain realvalued eigenfunctions by taking the real and imaginary parts of (10.35) and normalizing to unit energy. Thus
242
Detection of Random Processes
1
¿
¿
v2k (t ) = T2 cos(2π kt / T ), T k = 1, 2, . . . , t ∈ [0, T ]. (10.37) The index k represents a discrete frequency. The coefﬁcients Z˜ k of (10.33) are the Fourier coefﬁcients of Z , and the sequence of eigenvalues λ k is the power spectral mass v0(t ) = √ , v2k −1 (t ) =
2 sin(2π kt / T ), T
function of the process. By the inverse Fourier transform formula, we have
λ0 c( t ) = √ + T
Example 10.3
are given by
vk (t ) =
¿ ± ∞ 2 T
k =1
λ2k cos(2π kt / T ),
t
∈ [0, T ].
(10.38)
²
Brownian Motion. The eigenfunctions and the corresponding eigenvalues
¿
(2k − 1)π t , 2 sin T 2T
2 2 λk = (2k4−σ 1T)2π 2 ,
k
= 1, 2,. . . .
(10.39)
The eigenfunctions form a complete orthornormal basis of L 2([0, T ] ).
²
White Noise. If C (t , u ) = σ 2 δ(t − u ) then any complete orthonormal basis of L ([0, T ]) is an eigenfunction basis for C (t , u ), and the eigenvalues λ k are all equal to σ 2. ² Example 10.4
2
Ornstein–Uhlenbeck process . It will be convenient here to assume the observation interval is [− T , T ]. The eigenvalues λk are samples of the spectral density:
Example 10.5
´b µ λk = 2 2 = S 2πk , γ + bk 2γ σ 2
k
= 1, 2, . . . ,
where the coefﬁcients bk are the solutions to the nonlinear equation (tan(bT ) γ b γ )( tan(bT ) − b ) = 0. The eigenfunctions are given by
´
vk (t ) = T −1/2 1 + sin(2bk T )
´
2bk T
vk (t ) = T −1/2 1 − sin(2bk T ) 2bk T
µ−1 µ−1
cos (bk t ),
−T ≤ t ≤ T :
k odd,
sin(bk t ),
−T ≤ t ≤ T :
k even.
+
The eigenfunctions are sines and cosines whose frequencies are not harmonically related. However, the coefﬁcients bk form an increasing sequence, and bk ∼ (k − 1) 2πT as k → ∞ . ²
Bandlimited White Noise Process . The eigenvectors of the covariance kernel are prolate spheroidal wave functions . The number of signiﬁcant eigenvalues is approximately 2 BT ; the remaining eigenvalues decay exponentially fast. ²
Example 10.6
10.2
243
ContinuousTime Processes
Mercer’s Theorem [1] Assume that C (t , u ) is continuous over [0, T ]2 . Equiva
lently, Z is meansquare continuous [1]. Then C admits the representation
∞ ±
C (t , u ) =
k =1
0 ≤ t, u
λk vk (t )vk (u ),
≤ T,
where the sum converges uniformly over [0, T ]2, i.e.,
»
lim
max
n→∞ 0≤t ,u≤T
C (t , u) −
n ±
k =1
(10.40)
¼
λk vk (t )vk (u) = 0.
¾T
∑
Letting t = u in (10.40) and integrating, we obtain 0 C (t , t ) dt = k λ k . Since this quantity is ﬁnite, it follows that the eigenvalues λk decay faster than k −1 .
Reconstruction of Z The process Z may be represented by Z (t ) =
∞ ± ˜ k=1
Z k v k (t ),
t
∈ [0, T ],
(10.41)
where the sum converges in the meansquare sense, uniformly over [ 0, T ]: lim max
n →∞ 0≤t ≤T
»
E
Z (t ) −
¼2 Z k v k (t ) = 0.
n ± ˜ k=1
If Z is white Gaussian noise, its covariance matrix C (t , u) = σ 2 δ(t − u ) does not satisfy the conditions of Mercer’s theorem, and the sum of (10.41) does not converge in the meansquared sense. However, (10.40) and (10.41) ¾ T hold in a weaker sense: for any function f ∈ L 2 ([0, T ]) with coefﬁcients fk = 0 f (t )vk (t ) dt , there holds
½
0
and
T
f (u )C (t , u) du
=
½T 0
½T 0
f (u)
»± ∞ k=1
f (t ) Z (t ) dt
Inﬁnite Observation Interval For T
λk vk (t )vk (u)
=
¼
du
∞ ± f k Z˜ k . k =1
=
∞ ± k =1
λk f k vk (t ) (10.42)
= R, a spectral representation of weakly
stationary processes may be obtained by extension of the spectral representation (10.37) (10.41) of periodic weakly stationary processes. For processes that are meansquare continuous, Cramér’s spectral representation theorem [7] gives
Z (t ) =
½
R
ei 2π f t d Z˜ ( f ),
t
∈ R,
(10.43)
where Z˜ is a process with orthogonal increments, i.e., for any integer n and any f1 < f2 < · · · < fn , the increments Z˜ ( f i +1 ) − Z˜ ( f i ), 1 ≤ i ≤ n, are uncorrelated.
244
Detection of Random Processes
10.2.3
Detection of Known Signals in Gaussian Noise
Consider the binary hypothesis test
À
: Y (t ) = Z (t ) : Y (t ) = s (t ) + Z (t ),
H0 H1
where s
t
∈ [0, T ],
= {s (t ), t ∈ [0, T ]} is a known signal with ﬁnite energy ½T Es = s 2 (t ) dt
(10.44)
(10.45)
0
and Z = {Z (t ), t ∈ [0, T ]} is a Gaussian random process with mean zero and continuous covariance kernel C (t , u). Denote the signal and noise coefﬁcients in the kernel eigenfunction basis by
s˜k
=
Z˜ k
=
½T
½T 0
0
vk (t )s (t ) dt , vk (t )Z (t ) dt ,
(10.46)
= 1, 2,. . . .
(10.47)
k = 1, 2, . . . .
(10.48)
k
The coefﬁcients of the observed Y are given by
Y˜k =
½T 0
vk (t )Y (t ) dt ,
A representation for Z in terms of { Z˜ k }k≥1 was given in (10.41). We also have the signal reconstruction formula
s (t ) = s0 (t ) +
∞ ± k= 1
s˜k vk (t ),
t
∈ [0, T ],
(10.49)
where convergence of the sum is in the meansquare sense.¾ The signal s0 is the component of s in the null space of the covariance kernel, i.e., 0T C (t , u)s0 (u ) du = 0. We
¾T
∑
2 have 0 s02(t ) dt + ∞ k=1 s˜k = E s . If s0 ³ = 0, the detection problem is singular because there is no noise in the direction of s0 , hence perfect detection is possible using the test ¾ T statistic 0 s0(t )Y (t ) dt , which equals 0 and ±s0±2 with probability 1 under H0 and H1 , respectively. To avoid such trivial conclusions, we assume that s0 ³ = 0 in the following. The hypothesis test of (10.44) may then be written as
À
H0 H1
: Y˜k = Z˜ k : Y˜k = s˜k + Z˜ k ,
k
= 1, 2, . . . .
(10.50)
This is the same formulation as in the discretetime case, the only difference being that the sequence of observations Y˜k is not ﬁnite. If we use the ﬁrst n coefﬁcients to construct a LRT, the LLR is given by
Â Á ˜ n ± s˜k2 s˜k Yk ln L n (Y ) = λk − 2λk , k =1
(10.51)
10.2
ContinuousTime Processes
245
similarly to (5.26). For Bayesian hypothesis testing with equal priors, the minimum error H1
probability of the LLRT ln L n (Y ) ≷ 0 is Pe H0
dn2 =
= Q (dn /2) where
n ± s˜k2 λk k =1
(10.52)
is a nondecreasing sequence. Now what happens as n → ∞? Assume that the sum of (10.52) converges to a ﬁnite limit,
d2
=
∞ s˜ 2 ± k . λ k k =1
(10.53)
Then the limit of the loglikelihood ratios,
Â ∞ Á s˜ Y˜ ± s˜k2 k k − 2λ , ln L (Y ) ± λ k k k =1
(10.54)
has ﬁnite mean and variance under both H0 and H1 (see Exercise 10.6) and therefore the limit exists in the meansquared sense. For Bayesian hypothesis testing with equal H1
priors, the minimum error probability of the test ln L (Y ) ≷ 0 is Pe H0
= Q (d /2).
If only a ﬁnite number n of coefﬁcients Y˜k are evaluated and used in the loglikelihood ratio, the performance of the LRT is determined by the sum (10.52). The performance loss depends on n as well as on the convergence rate of the sum. For instance, if the individual terms s˜k2/λ k decay as k−α with α > 1, then the loss in d 2 is O (nα−1 ). We now evaluate the limits (10.53) and (10.54) for classical noise processes.
White Noise If C (t , u) = σ 2δ(t − u) then any orthonormal basis of L 2([0, T ]) is an eigenfunction basis for C (t , u), with corresponding eigenvalues this case (10.54) becomes
λk all equal to σ 2. In
Â ∞ Á s˜ Y˜ ± s˜k2 k k ln L (Y ) = σ 2 − 2σ 2 k=1 ½T 1 = σ 2 s(t )Y (t ) dt − 2Eσs2 , 0
(10.55)
where the second inequality follows from (10.42). Also (10.53) becomes
d2 hence Pe
= Q (d/ 2) > 0.3
= σE2s < ∞,
3 A simpler way to derive (10.55) in the white Gaussian noise case is to choose the ﬁrst eigenfunction −1/2 s of the covariance kernel to be aligned with the signal s . Then s E s and sk 0 for all Es 1 1
v = k > 1. The sum of (10.55) then reduces to a single nonzero term.
˜ =√
˜ =
246
Detection of Random Processes
Colored Noise In general the expression (10.54) does not reduce to a formula as simple as (10.55). In fact, even for ﬁniteenergy signals, the sum (10.53) might diverge, in which case Pe = 0. For a periodic stationary process, divergence of the sum (10.53) means that the noise variance λk at high frequencies decays too fast relative to the signal energy components. The underlying observational model is mathematically valid but does not correspond to a physically plausible situation. The same conclusion applies to aperiodic stationary processes. It is therefore common to model colored noise by the superposition of colored noise with smooth covariance kernel C˜ (t , u) and an 2 independent white noise component with variance σ 2: C (t , u ) = C˜ (t , u) + σW δ(t − u). Brownian Motion Detection of a signal in Brownian motion can be analyzed by differentiating the observed signal and reducing the problem to one of detection in white Gaussian noise. Assume the signal s satisﬁes s (0) = 0 and is differentiable on [0, T ]. Then the process Y may be represented by the sample Y (0) together with the derivative process Y ² (t ), t ∈ [0, T ]. Since Y (0) = Z (0) has the same distribution under H0 and H1, we may write the hypotheses as
À
: Y ² (t ) = W (t ) H1 : Y ² (t ) = s ² (t ) + W (t ), t ∈ [0, T ], where W is white Gaussian noise. Therefore Pe = Q (d /2) where ½T 1 ² 2 d2 = 2 σ 0 [s (t )] dt . H0
(10.56)
Note that d2 might be inﬁnite (perfect detection) even if the signal s has ﬁnite energy. This occurs, for instance, with s (t ) = (T − t )−1/3 , 0 ≤ t < T . The case of a differentiable signal with s(0) ³ = 0 always results in a singular detection problem. Indeed the test δ(Y ) = 1{Y (0) > s20 } has zero error probability because Var [ Z (0)] = 0. 10.2.4
Detection of Gaussian Signals in Gaussian Noise
Consider the binary hypothesis test
À
H0 H1
: Y (t ) = Z (t ) : Y (t ) = S(t ) + Z (t ),
t
∈ [0, T ],
(10.57)
where S = { S(t ), t ∈ [0, T ]} is a Gaussian signal with mean zero and covariance kernel C S (t , u) ± E [S (t )S (u )] , t , u ∈ [0, T ], and Z = { Z (t ), t ∈ [0, T ]} is a Gaussian random process with mean zero and covariance kernel C Z (t , u ) ± E[ Z (t )Z (u)], t , u ∈ [0, T ] . Assume that S and Z are independent and that the covariance kernels C S and C Z have the same eigenfunctions {vk , k = 1, 2, . . . }, and respective eigenvalues λ S,k and λZ ,k , k = 1, 2, . . .. Then the test (10.57) can be equivalently stated as
À
H0 H1
: Y˜k = Z˜ k : Y˜k = S˜ k + Z˜ k ,
k
= 1, 2, . . .
(10.58)
10.2
À
or
H0 H1
ContinuousTime Processes
: Y˜k ∼ indep. N (0, λZ ,k ) : Y˜k ∼ indep. N (0, λS,k + λZ,k ),
k
= 1, 2, . . . .
247
(10.59)
Again this is the same formulation as in the discretetime case (5.65), the only difference being that the sequence of observations Y˜k is not ﬁnite. It will be notationally convenient Ã to work with the normalized random variables Yˆk = Y˜k / λ Z ,k and the variance ratios ξk = λS ,k /λ Z,k , in which case the test can be equivalently stated as
À
H0 H1
: Yˆk ∼ indep. N (0, 1) : Yˆk ∼ indep. N (0, 1 + ξk ),
k
= 1, 2, . . . .
(10.60)
The loglikelihood ratio for n coefﬁcients is denoted by ln L n (Y ) =
1 1± ln 2 1 + ξk n
k=1
+
1 ± ξk Yˆk2. 2 1 + ξk n
k =1
(10.61)
Similarly to (5.66), we may write the LLRT for n coefﬁcients as n ± ξk Yˆ 2 H≷ τ. 1 + ξk k H k =1 1
(10.62)
0
By (6.3), we have
D( p(0n ) ± p(1n ) ) = E0[ ln L n (Y )] = D( p(1n ) ± p(0n ) ) = E1[ ln L n (Y )] =
µ ´ n ± ψ 1 +1 ξ , k =1 n
± k =1
k
ψ (1 + ξk ) ,
(10.63) (10.64)
where ψ(x ) = 21 (x − 1 − ln x ) was deﬁned in (6.4). Now consider the asymptotic scenario where n → ∞ . First assume the variance ratios {ξk } are such that the two sequences of KL divergences (10.63) and (10.64) converge to ﬁnite limits, respectively denoted by
∞ ´ 1 µ ± ψ 1+ξ D (P 0±P1 ) ± E0 [ln L (Y )] = k =1
and
D (P 1±P0 ) ± E1 [ln L (Y )] =
∞ ± k =1
k
ψ (1 + ξk ) .
(10.65)
(10.66)
Then the limit of random sequence (10.61) exists in the meansquared sense: ln L (Y ) ±
1 1± ln 2 1 + ξk
∞
k=1
+ 21
∞ ξ ± k Yˆ 2, 1 + ξk k k =1
(10.67)
as it has ﬁnite mean and variance under both H0 and H1 (see Exercise 10.6). Then detection performance saturates as an increasing number n of terms is evaluated.
248
Detection of Random Processes
In many problems, however, the KL divergence sequences of (10.63) and (10.64) do not converge to ﬁnite limits. For instance, if the sequence ξ k does not vanish, the sequences (10.63) and (10.64) are unbounded, and one may show that perfect detection is obtained. This may hold even if the sequence ξ k vanishes. The secondorder Taylor series expansion of ψ(x ) near x = 1 is given by ψ(1 + ³) = 14 ³2 + O (³ 3 ). Hence we ∑ ξ 2 < ∞ for the sums (10.63) and (10.64) to converge, hence ξ must decay need ∞ k k=1√k faster than 1/ k. By application of Exercise 8.11, the Chernoff divergence is given by n 1± ( n) ( n) [(1 − u ) ln(1 + ξk ) − ln(1 + (1 − u)ξk )], du ( p0 , p1 ) = 2 k =1
∀u ∈ (0, 1),
(10.68)
which can be used to derive upper bounds on PF , PM , and Pe as in (7.25), (7.26), (7.35) and lower bounds as in Theorem 8.4. Let T = 1 and S be Brownian motion with parameter ³ . Then we have ξk = ³/σ 2 for all k. The Chernoff divergence of (10.68) tends to ∞ as n → ∞, and perfect detection is possible no matter how small ³ is! Indeed arbitrarily small error Example 10.7
probabilities are achieved using the test (10.62) with an arbitrarily large n . An even simpler test that has the same performance is the following one. Deﬁne the random √ 1 variables Yˆk = n[Y ( nk ) − Y ( k − )], 1 ≤ k ≤ n, which are mutually independent given n either hypothesis (owing to the independentincrements property of Brownian motion) and have variances σ 2 and ³ + σ 2 under H0 and H1 , respectively. Since these random variables have the same statistics as {Y˜k , 1 ≤ k ≤ n }, the LRT based on {Yˆk , 1 ≤ k ≤ n} has the same performance as (10.62). ²
Periodic Stationary Processes If both S and Z are Gaussian periodic stationary
processes, the eigenvalues λ S,k and λ Z ,k are samples of the discrete spectral densities of these two processes. Similarly to the knownsignal case, colored noise whose spectral mass function falls off at the same rate or faster than the signal’s energy results in perfect detection.
10.3
Poisson Processes
Recall the following deﬁnitions [1]. The Poisson random variable N with parameter
λ > 0 has pmf
λn e−λ , n = 0, 1, 2, . . . (10.69) n! and cgf κ(u ) = ln E[eu N ] = λ(eu − 1). A function f on R+ is a counting function if f (0) = 0, f is nondecreasing and rightcontinuous, and integervalued. A random p(n) =
process is a counting process if its sample paths are counting functions with probability
10.3
249
Poisson Processes
1. A random process Y (t ), t ≥ 0 has independent increments if for any integer n and any 0 ≤ t1 ≤ t 2 ≤ · · · ≤ tn , the increments Y (t i +1) − Y (ti ), 1 ≤ i < n, are mutually independent. An inhomogeneous Poisson process with rate function λ(t ), t is a counting process Y (t ), t ≥ 0 with the following two properties: D E FI N I T I O N 10.4
≥0
(i) Y has independent increments; (ii) for any 0 ≤ ¾ ut < u , the increment Y (u ) − Y (t ) is a Poisson random variable with parameter t λ(v) d v . Denote by N the number of counts in the interval [0, T ] and by T1 < T2 < · · · < TN the arrival times in [0, T ]. The sample path Y (t ) can be represented by the tuple ( N , T1 , T2, . . . , TN ). See Figure 10.2 for an illustration.
< T2 < · · · < TN is given by Å Ä½ T ½T λ(t )dt , ln λ(t )dy(t ) − p(t1 , . . . , tn , N = n) = exp
LEMMA
10.1 [8] The joint pmf for N and density for T1
0
0
n = 0, 1, 2, . . . , 0 < t1
where y (t ) t1 , . . . , tn .
< t2 < · · · < tn < T ,
(10.70)
= ∑ni=1 1{t ≥ ti } is the counting function with n jumps located at times
The expression (10.70) is sometimes called samplefunction density [8]. Now consider the problem of testing between two inhomogeneous Poisson processes P0 and P1 with known intensity functions s j (t ), 0 ≤ t ≤ T , j = 0, 1, given observations Y = {Y (t ), 0 ≤ t ≤ T }:
À
≤T (10.71) H1 Y has intensity function s1 (t ), 0 ≤ t ≤ T . Denote by N the number of counts in the interval [ 0, T ], by T1 , . . . , TN the arrival times, H0
: :
Y has intensity function s0 (t ), 0 ≤ t
and by p0 and p1 the samplefunction densities under H0 and H1, respectively. Using (10.70), we obtain the likelihood ratio as Y (t)
λ(t)
T
0
(a ) Figure 10.2
t
0
T1 T2
TN
T
(b )
(a) Intensity function for an inhomogeneous Poisson process; (b) sample path
t
250
Detection of Random Processes
L (Y ) =
p1(T1 , . . . , TN , N ) p0(T1 , . . . , TN , N )
⎧ Æ∑ N s (T ) ¾ T Ç ⎨exp i =1 ln s (T ) − 0 [s1 (t ) − s0 (t )] dt : N > 0 Ç = ⎩ Æ ¾T : N =0 exp − 0 [s1(t ) − s0(t )] dt Ä½ T s (t ) ½T Å 1 = exp ln dY (t ) − [s1(t ) − s0(t )] dt . s0 (t ) 0 0 1 0
i i
(10.72)
Hence the LRT admits the simple implementation N ± i =1
ln
s1 (Ti ) s0 (Ti )
H1
½T
H0
0
≷ τ = ln η +
[s1(t ) − s0(t )] dt ,
(10.73)
where the left hand side is understood to be zero when N = 0. The KL divergence between the two processes is given by
D(P0 ±P 1) = −E 0[ln L (Y )]
½
[s0(t ) ln ss0((tt )) + s1(t ) − s0(t )] dt ´ s1 (t ) µ ½0T = s0 (t )φ s0(t ) dt , 1 0 where we have used E0 [dY (t )] = s0(t ) dt in the ﬁrst line and introduced the function φ(x ) = ln x − 1 + x1 , x > 0, =
T
in the second line, where φ(x ) = 2ψ( 1x ) and ψ was deﬁned in (6.4).
10.1 Assume the intensity ratio s1 (t )/s0(t ) is bounded away from zero and inﬁnity for all t ∈ [0, T ]. For each u ∈ (0, 1), the Chernoff distance between P0 and P1 is given by
P RO P O S I T I O N
du (P 0, P1 ) = − ln E0[ L (Y )u ] =
½
T 0
[(1 − u)s0 + us1 − s01−u s1u ].
(10.74)
Proof See Section 10.5.
10.4
General Processes
We conclude this chapter by introducing a general approach to hypothesis testing based on Radon–Nikodym derivatives. This section requires some notions of measure theory.
10.4.1
Likelihood Ratio
F
Consider a measurable space (Y , ) consisting of a sample space Y and a σ algebra of subsets of Y . Let P0 and P 1 be two probability measures deﬁned on . Let µ be
F
F
10.4
251
General Processes
F
a probability measure on such that both P0 and P 1 are absolutely continuous with respect to µ (denoted by P0 ´ µ and P1 ´ µ ), i.e., for any set A ∈ such that µ(A ) = 0 we must have P0 (A ) = 0 and P1 (A) = 0. Both P0 and P 1 are also said to be dominated by µ. This condition is met, for instance, if µ = 21 (P 0 + P1 ). If there exists a set E ∈ such that P0(E ) = 0 and P1 (E ) = 1 then P0 and P1 are said to be orthogonal probability measures (denoted by P0 ⊥ P 1). The detection problem is singular in this case. Indeed the decision rule δ( y ) = 1{ y ∈ E } has zero error probability under both H0 and H1. Several of the problems treated in Sections 10.2.3 and 10.2.4 fall in that category. If P 0 and P 1 are dominated by each other, then P0 and P1 are said to be equivalent (denoted by P0 ≡ P1 ), and perfect detection cannot be achieved. Intermediate cases exist where P0 and P1 are neither orthogonal nor equivalent (e.g., P0 = Uniform [0, 2] and P1 = Uniform [1, 3]). By the Radon–Nikodym theorem [7], there exist functions p0 ( y) and p1 (y ), called generalized probability densities such that
F
F
P0 ( A ) =
½
A
p0( y)d µ( y ),
P 1(A) =
½
A
p1 (y )d µ( y ),
∀A ∈ F .
These functions are also called Radom–Nikodym derivatives and denoted by
p 0( y) = Let E j
= {y :
d P0 ( y) d µ( y )
p j (y ) = 0} for j
D E FI N I T I O N 10.5
and
p1 ( y) =
d P1 (y ) . d µ( y )
= 0, 1, and E = { y :
p0( y) > 0, p1 (y ) > 0}.
The likelihood ratio is deﬁned as
⎧ ⎪⎪⎨∞ : y ∈ E0 L (y ) = pp (( yy )) : y ∈ E ⎪⎩ 0 : y ∈ E1 , 1 0
(10.75)
where p1( y)/ p0 ( y) is the Radon–Nikodym derivative of P1 relative to P0 . The expressions (10.54), (10.67), and (10.72) are all Radon–Nikodym derivatives, whether the underlying p0 and p1 are densities or generalized densities.
Likelihood Ratio Convergence Theorem [7]We have seen in Section 10.2 that for two signal detection problems in continuoustime Gaussian noise, the sequences of likelihood ratios (10.51) and (10.61) converge to a limit. In a more general setting, we may design a sequence of realvalued observables Y˜k , k ≥ 1, that faithfully represents the process Y as an increasing number of observations is taken. Corresponding to the ﬁrst n observations is a likelihood ratio L n (Y˜1:n ), where Y˜1:n ± (Y˜1 , . . . , Y˜n ). The likelihood ratio convergence theorem states that, if the distribution of each Y˜1:n is absolutely continuous with respect to Lebesgue measure in Rn , then
À
L n (Y˜1:n ) → L (Y )
i. p. (P0 ) ˜ L n (Y1:n ) → L (Y ) i. p. (P1 ),
where i.p. denotes convergence in probability.
(10.76)
252
Detection of Random Processes
10.4.2
Ali–Silvey Distances
Similarly to (6.1), we deﬁne the KL divergence between P 0 and P1 as
D(P0 ±P1 ) =
½
Y
³
p0 ( y) d P 0( Y ) dµ( y) = E0 p1 ( y) d P 1( Y )
p0( y ) ln
¶
.
(10.77)
f divergences are likewise deﬁned as D f (P 0, P1 ) =
½
Y
p0 (y ) f
´ p ( y) µ ³ ´ d P (Y ) µ¶ 1 1 d µ( y ) = E0 f . p0 ( y) d P0 (Y )
(10.78)
By the data processing inequality, we have
± i
P 0(Ai ) f
´ P (A )µ 1 i ≤ D f (P0, P1 ) P 0( A i )
for any partition {Ai } of Y . By deﬁnition of the Lebesgue integral (10.78),
D f (P0 , P1 ) = sup
{Ai }
± i
P 0(Ai ) f
´ P (A ) µ 1 i . P 0(Ai )
Stationary Processes The general approach can be applied to derive KL divergence rates and Chernoff divergence rates for detection of stationary processes. For instance, let S and Z be two independent continuoustime Gaussian stationary processes with absolutely integrable covariance functions and respective spectral densities SS ( f ) and SZ ( f ), f ∈ R. Consider the hypothesis test of (10.57) and denote by P(0T ) and P(1T ) the probability measures for {Y (t ), 0 ≤ t ≤ T } under H0 and H1 . Then the eigenvalues λS,k , k ≥ 1 of the covariance kernel C S (t , u), 0 ≤ t , u ≤ T converge to samples of the spectral density SS ( f ) for increasing T , in the sense that
∀ f ∈ R : λ S,µ f T ¶ ∼ S S ( f )
as T
→∞
∀ f ∈ R : λ Z ,µ f T ¶ ∼ S Z ( f )
as T
→ ∞.
and similarly
Then the KL divergence rates are given by
½
´
µ
SZ( f ) 1 ψ D(P(0T ) ±P(1T ) ) = d f, D (P0 ±P 1) = lim SZ ( f ) + SS ( f ) T →∞ T R µ ´ ½ 1 SS ( f ) (T ) (T ) D(P1 ±P0 ) = D (P1 ±P 0) = lim d f, ψ 1+ T →∞ T SZ( f ) R which are the Itakura–Saito distances between the spectral densities SZ and SZ (and viceversa). The Chernoff divergence rates are given by
d u (P0 , P 1) = lim
T →∞
= 12
1 ( T ) (T ) Du (P 0 , P 1 ) T
+ SS
½ ³ ´ µ ´ µ¶ (1 − u) ln 1 + SSS (( ff )) − ln 1 + (1 − u ) SSS (( ff )) d f , Z Z R ∀u ∈ (0, 1).
10.5
Appendix: Proof of Proposition 10.1
253
Other inference problems that can be treated using this approach include continuoustime Markov processes, processes with independent increments, and compound Poisson processes. 10.5
Appendix: Proof of Proposition 10.1
For simplicity, we ﬁrst prove the claim in case the intensity ratio s1(t )/s0 (t ) is constant the interval [0, T ]. Denote this constant by ρ . Under H0, the random variable N ± ¾ T s under ¾over T H0. From (10.72) we obtain dY ( t ) is Poisson with parameter 0 0 0
Ä
L (Y ) = exp N ln ρ −
½T 0
Å
[s1(t ) − s0(t )] dt .
Then
= − ln E0 [³L(Y )Äu ] Å¶ ½T = − ln E0 exp u N ln ρ − u [s1 − s0] 0 »± Ä ½T Å¼ ∞ (¾ T s )n ¾ 0 (a) 0 = − ln e− s exp un ln ρ − u [s1 − s0 ] n! 0 n=0 ½T ∞ [ (¾ T s )ρ u ]n ± 0 0 = [ (1 − u )s0 + us1 ] − ln n! 0 n=0 ½ (b) T = [ (1 − u )s0 + us1 − s0 ρu ], (10.79) 0 ¾T where (a) uses the fact that N is Poisson with parameter 0 s0 under H0 and (b) uses ∑ the identity ex = n x n / n!. The claim (10.74) follows from (10.79) because s0 ρ u = du (P 0, P1 )
T 0 0
s01−u s1u . Next consider the more general case where the intensity ratio s1(t )/s0 (t ) is piecewise constant, taking values ρi over regions Ti , i ∈ I , forming a partition of [0, T ]: s1 (t ) s0 (t )
=
±
ρi 1 {t ∈ Ti }. i∈I ¾ By Deﬁnition 10.4, the random variables Ni ± T dY (t ), i ∈ I are mutually indepen¾ i
dent and are Poisson distributed with respective parameters Ti s0 under H0 . It follows that
du (P0 , P 1)
= − ln E 0[»L (Y )u ] Å¼ ½ ±Ä = − ln E 0 exp u Ni ln ρi − u [s1 − s0] T i ∈I ³ Ä ½ Å¶ ( a) ± = − ln E0 exp u Ni ln ρi − u [s1 − s0] T i ∈I ½ ± (b) = [(1 − u)s0 + us1 − s0 ρiu ] i
i
i ∈I
Ti
(10.80)
254
Detection of Random Processes
where (a) holds because { Ni } are mutually independent, and (b) follows from (10.79) with the interval [0, T ] replaced by Ti and ρ by ρi . Hence (10.74) holds again. Finally, in the general case where the intensity ratio s1(t )/s0 (t ) belongs to a ﬁnite a range [a , b), pick an arbitrarily large integer k and let ρ i = a + i b− k for i = 0 , 1 , . . . , k . The level sets
Ä
Å
∈ I ± {1, 2, . . . , k } ¾ form a partition of [0, T ]. Deﬁne the nonnegative random variables N i = T dY (t ), i ∈ I . Then du (P0 , P 1) may be sandwiched as follows: » ±Ä ½ Å¼ [s1 − s0] − ln E0 exp u Ni ln ρ i − u Ti
= t ∈ [0, T ] : ρi −1 ≤
s1 ( t ) s0 ( t )
< ρi ,
i
i
Ti
i ∈I
≤ du (P0 , P1) = − ln E0 [»L (Y )u ] Å¼ ½ ±Ä < − ln E0 exp [s1 − s0] , u Ni ln ρ i −1 − u Ti
i ∈I
and the lower and upper bounds evaluated using (10.80). We obtain
±½
Ti i ∈I
[(1 − u)s0 + us1 − ρ ] ≤ du (P0 , P1 ) ≤
Now letting k → proves the claim.
s0 ui
±½
Ti i∈I
[(1 − u )s0 + us1 − s0 ρui−1 ].
∞, the lower and upper bounds tend to the same limit (10.74). This ²
Exercises
10.1 Detection of Markov Process. Consider two Markov chains with state space Y = {0, 1}. The emission probabilities q0 and q1 are uniform, and the probability transition matrices under H0 and H1 are respectively given by
r0
³ ¶ = ..15 ..95
and r1
³ ¶ = ..59 ..51 .
(a) Give the Bayes decision assuming uniform prior on the hypotheses and an observed sequence y = (0, 0, 1, 1, 1). (b) Give the Kullback–Leibler and Rényi divergence rates between the two Markov processes. 10.2 Detection of Markov Process. Repeat Exercise 10.1 when
r0
¶ ³ = ..55 ..55
and r1
¶ ³ = ..91 ..19 .
Exercises
255
¾
10.3 Brownian Motion (I). Consider the noise process Z (t ) = 0t B (u ) du , t ∈ [0, T ] where B (u) is Brownian motion with parameter σ 2. Consider the binary hypothesis test
À
: Y (t ) = Z (t ) : Y (t ) = s (t ) + Z (t ),
H0 H1
t
∈ [0, T ],
where s is a known signal. Assume uniform priors on the hypotheses. (a) Give the optimal test and the Bayes error probability when s (t ) = t 2. È É (b) Repeat Part (a) when s (t ) = max t − T2 , 0 . 10.4 Brownian Motion (II). Consider the following ternary hypothesis test:
⎧ ⎪⎨ H1 : Y (t ) = t + Z (t ) 2 ⎪⎪⎩ H2 : Y (t ) = t + t + Z (t ) H3 : Y (t ) = 1 + Z (t ), t ∈ [0, 1], where Z is standard Brownian motion (σ 2 = 1). Give a Bayes decision rule and the
Bayesian error probability assuming a uniform prior.
10.5 Detection of Poisson Process. Consider two Poisson processes P0 and P1 with respective intensity functions s0(t ) = et and s1(t ) = e2t for t ∈ [0, ln 4]. Assume a uniform prior on the corresponding hypotheses H0 and H1. (a) Give the Bayes decision when counts are observed at times 0.2, 0.9, and 1.2. (b) Find the optimal exponent u in Proposition 10.1 in terms of the intensity functions s0 and s1. (c) Give an exponential upper bound on the Bayes error probability. 10.6 SecondOrder Statistics of Loglikelihood Ratios. Derive the mean and variance of the loglikelihood ratios of (10.54) and (10.67). 10.7 Signal Detection in Correlated Gaussian Noise. Consider the following detection problem:
À
H0 H1
: :
= Zt Y t = cos(4π t ) + Z t ,
Yt
0≤t
≤ 1,
where Z t is a zeromean Gaussian random process with covariance kernel
C Z (t , u ) = cos(2π(t
− u)),
0 ≤ t , u ≤ 1.
Assuming equal priors, ﬁnd the minimum error probability. 10.8 Composite Hypothesis Testing. Consider the hypothesis testing problem
À
H0 H1
: :
= Wt Yt = ´ t + Z t , Yt
0≤t
≤ T,
256
Detection of Random Processes
where { Z t , 0 ≤ t ≤ T } is the standard Brownian motion. The parameter ´ takes values +1 and −1 with equal probabilities and is independent of { Z t , 0 ≤ t ≤ T }. (a) Find the likelihood ratio. (b) Find an α level Neyman–Pearson test.
References
[1] B. Hajek, Random Processes for Engineers , Cambridge University Press, 2015. [2] U. Grenander and G. Szegö, Toeplitz Forms and their Applications, Chelsea, 1958. [3] R. M. Gray, Toeplitz and Circulant Matrices: A Review, NOW Publishers, 2006. Available from http://ee.stanford.edu/ ˜ gray/toeplitz.pdf [4] M. Gil, “On Rényi Divergences for Continuous Alphabet Sources,” M. S. thesis, Dept of Math. and Stat., McGill University, 2011. [5] T. Kailath and H. V. Poor, “Detection of Stochastic Processes,” IEEE Trans. Inf. Theory, Vol. 44, No. 6, pp. 2230–2259, 1998. [6] U. Grenander, “Stochastic processes and Statistical Inference,” Arkiv Matematik, Vol. 1, No. 17, pp. 195–277, 1950. [7] U. Grenander, Abstract Inference, Wiley, 1981. [8] D. L. Snyder and M. I. Miller, Random Point Processes in Time and Space , 2nd edition, SpringerVerlag, 1991.
Part II
Estimation
11
Bayesian Parameter Estimation
This chapter covers the Bayesian approach to estimation theory, where the unknown parameters are modeled as random. We consider three cost functions, namely squared error, absolute error, and the uniform cost. The cases of scalar and vectorvalued parameter estimation are studied separately to emphasize the similarities and differences between these two cases.
11.1
Introduction
For parameter estimation, the ﬁve ingredients of the statistical decision theory framework are as follows:
• X : The set of parameters. A typical parameter is denoted by θ ∈ X . For parameter estimation, X is generally a compact subset of Rm . • A : The set of estimates of the parameter. A typical estimate (action) is denoted by a ∈ A. • C (a , θ): The cost of action a when the parameter is θ , C : A × X ±→ R+ . • Y : The set of observations. The observations could be continuous or discrete. • D : The set of decision rules, or estimators, typically denoted by θˆ : Y ±→ A. Often (but not always) we will have A = X . The set D of decision rules may be constrained. For instance, D could be the class of linear estimators, or a class of estimators
that satisfy some unbiasedness or invariance property, as will be discussed later. The case of scalar and vectorvalued θ will sometimes require a different treatment. We use boldface notation (θ ) to denote a vectorvalued parameter whenever the vector nature of the parameter is used. As in the detection problem, we assume that conditional distributions for the observations { pθ ( y), θ ∈ X } are speciﬁed, based on which we can compute the conditional risk of the estimator θˆ :
± ² ³ ˆ ˆ R θ (θ) = Eθ C (θ(Y ), θ) =
Y
11.2
C (θˆ (y ), θ) pθ (y )d µ( y ),
θ ∈ X.
Bayesian Parameter Estimation
To ﬁnd optimal solutions to the parameter estimation problem, we ﬁrst consider the Bayesian setting in which we assume that the parameter θ ∈ X is a realization of
260
Bayesian Parameter Estimation
a random parameter ± with prior π(θ). Using the general approach to ﬁnding Bayes solutions that we introduced earlier in Section 1.6, we obtain the Bayesian estimator as
θˆB( y ) = arg min C (ay ), A
a∈
where
C (a y) = E[C (a , ±) Y
= y] =
³ X
y
∈ Y,
(11.1)
C (a, θ)π(θ  y )d ν(θ)
is the posterior risk of action a given observation y. Here dν(θ) typically denotes the Lebesgue measure d θ (for θ taking values in Euclidean spaces). The posterior distribution of ±, given Y = y, is expressed as
π(θ y ) = ´
X
pθ (y )π(θ) . pθ ( y)π(θ) dν(θ)
We now specialize the cost function to three common choices, for which we can compute θˆB more explicitly in terms of the posterior distribution π(θ  y ). In Sections 11.3–11.6 we restrict our attention to the scalar parameter case, where X = A ⊆ R. The vector case is considered in Section 11.7. 11.3
MMSE Estimation
Mean squarederror is a popular choice for the cost function because it gives rise to relatively tractable estimators (in fact linear in the case of Gaussian models) and strongly penalizes large errors. Assume thus C (a, θ) = (a − θ)2 , and E[± 2] < ∞. In this case,
C (a y) = E[ (a − ±)2 Y
= y].
Consequently, the Bayes estimator is the MMSE estimator
θˆMMSE( y ) = arg min E[(a − ±)2 Y = y ], a ∈R
where the subscript MMSE stands for minimum mean squarederror. The minimand is a convex quadratic function of a . Hence, the minimizer can be obtained by setting the derivative of C (a y) with respect to a equal to zero: 0=
dC (a y ) da
= 2E[(a − ±)Y = y ] = 2a − 2E[± Y = y] .
Consequently,
³ θˆMMSE (y ) = E[±Y = y] = θ π(θ  y) d ν(θ), (11.2) X which is the conditional mean of π(θ  y ). Hence the MMSE estimator may also be called
conditionalmean estimator. The MMSE solution (11.2) can also be obtained without differentiation using the orthogonality principle :
E[(a − ±)2Y = y] = E[(± − E[ ±Y = y ])2 Y = y ] + E[ (a − E[± Y = y])2 Y = y] , which is minimized by setting a = E[± Y = y].
11.3
Example 11.1
261
MMSE Estimation
(See Figure 11.1) Assume the prior
π(θ) = θ e−θ 1 {θ ≥ 0}
and assume Y is uniformly distributed over the interval [0, ±]:
pθ (y ) = The unconditional pdf of Y is equal to
p ( y) =
³
R
1
θ 1{0 ≤ y ≤ θ }.
pθ (y )π(θ)d θ
=
³∞ y
e−θ d θ
= e− y
for y ≥ 0, and is equal to 0 otherwise. Thus, the posterior pdf of ± given Y
= e−(θ− y ) 1{0 ≤ y ≤ θ }, π(θ y ) = pθ (ey−)π(θ) y
= y is
as depicted in Figure 11.1. The MMSE estimator is obtained by evaluating the conditional mean of the posterior pdf:
θˆMMSE( y ) =
=
³∞
³∞ 0
y
θ π(θ  y) d θ =
³∞ y
θ e−(θ− y )d θ ³∞
(θ − y)e−(θ− y )d θ + y
y
e−(θ − y ) d θ.
p θ (y ) π(θ)
0
θ
1
θ
0
(a)
y
(b) π (θ y)
p(y )
y
0
(c)
0
y
θ
(d)
Pdfs for parameter estimation problem: (a) Prior pdf π(θ); (b) Conditional pdf pθ ( y ); (c) Marginal pdf p ( y ); (d) Posterior pdf π(θ y )
Figure 11.1
262
Bayesian Parameter Estimation
= θ − y. Then ³∞ ³∞ ˆθMMSE (y ) = θ ² e−θ ² d θ ² + y e−θ ² d θ ² = 1 + y.
Make the change of variables θ ² 0
11.4
0
MMAE Estimation
Another common cost function is the absoluteerror cost function C (a , θ) =  a − θ  . This cost function has higher tolerance for large errors relative to the squarederror cost function. Assume that E[± ] < ∞. Then the Bayes estimator is given by the MMAE estimator
³ ˆθMMAE (y ) = arg min π(θ y )a − θ d θ a∈R R µ³ a ¶ ³∞ = arg min π(θ y)(a − θ)dθ − π(θ  y)(a − θ)d θ , a∈R −∞
a
where the subscript MMAE stands for minimum mean absoluteerror. Since the minimand is a convex function of a , we may try to ﬁnd the minimum by setting the derivative equal to zero, 0=
=
³
a d π(θ  y)(a − θ)dθ da −∞
³a
−∞
π(θ  y)d θ −
³∞ a
−
d da
³∞ a
π(θ y )(a − θ)dθ
π(θ y )d θ,
where the last line holds by application of Leibnitz’s rule for differentiation of integrals. Hence ³a ³∞ π(θ y )d θ = π(θ y )dθ = 12 , −∞ a which means that a is the median of the posterior pdf. Consequently,
· ¸ θˆMMAE (y ) = med π(θ y )
(11.3)
is the conditionalmedian estimator. It is convenient to compute the MMAE estimate using the identity ³∞ 1 π(θ  y )dθ = . 2 θˆMMAE Example 11.2
For the estimation problem introduced in Example 11.1, we ﬁnd that
³∞ a
e−(θ − y) dθ
= 12 ⇒ e y −a = 12 ⇒ y = − ln 2 + a ⇒ θˆMMAE ( y) = y + ln 2.
11.5
11.5
263
MAP Estimation
MAP Estimation
Consider now the uniform cost, deﬁned by
C (a, θ)
=
¹
0
1
for some small ²
if a − θ  ≤
² if a − θ  > ²
> 0. The Bayes estimator under this cost function is given by ³ ³ θ(ˆ y ) = arg min π(θ  y)1{a − θ  > ² }d θ = arg min π(θ y )d θ a a R  a−θ  >² µ ³ ¶ = arg min 1− π(θ y )dθ a  a −θ ≤² ³ π(θ y )dθ. (11.4) = arg max a a −θ ≤²
In order to solve this problem, we assume that the posterior pdf
θ = a and approximate the integral in (11.4) with1 ³ π(θ y )dθ ≈ 2²π(a y).
π(θ y ) is continuous at
a −θ ≤²
Therefore,
θˆMAP (y ) = arg max π(θ  y) = arg max pθ ( y)π(θ), θ ∈R θ∈R
(11.5)
which is the conditionalmode estimator, more commonly called the maximum a posteriori (MAP) estimator. Example 11.3
For the estimation problem introduced in Example 11.1, we ﬁnd that
θˆMAP (y ) = arg max e−(θ − y ) = y. θ ≥0 Example 11.4
= (0, ∞), Y = [0, ∞), and pθ ( y ) = θ e−θ y 1{ y ≥ 0}, π(θ) = αe−α θ 1 {θ ≥ 0},
Consider the estimation problem in which X
where α > 0 is ﬁxed. ´∞ Using the fact that 0 x m e−β x dx
= m !/βm+1 for β > 0 and m ∈ Z +, we obtain π(θ  y) = (α + y)2 θ e−(α+ y )θ 1{ y ≥ 0} 1{θ ≥ 0}.
Therefore
Also,
θˆMMSE (y ) =
³∞ θˆMMAE ( y )
³∞ 0
θ π(θ y ) d θ = α +2 y .
π(θ y ) d θ = 0.5 =⇒ θˆMMAE( y ) ≈ α1.+68y .
θ = a , the integral can be approximated by ² [π(a + y ) + π(a − y )] which results in the same MAP estimator as in (11.5).
1 If the posterior pdf is discontinuous at
264
Bayesian Parameter Estimation
Finally
θˆMAP ( y) = arg max θ e−(α + y)θ = α +1 y . θ> 0 Note that for the MAP estimator, we did not use the conditional distribution of y explicitly.
θ given
We see that θˆMAP ( y) < θˆMMAE (y ) < θˆMMSE ( y) in this example. In general, since the median of a distribution usually lies between the mean and the mode, we expect such a relationship or the one with inequalities reversed to hold. Here we modify Example 11.3, assuming now a sequence of n observations Y = (Y1 , Y 2, . . . , Yn ) drawn i.i.d. from the uniform distribution on the interval [0, θ ]. The parameter θ is also random with exponential prior distribution
Example 11.5
π(θ) = θ e−θ 1 {θ ≥ 0}. We ﬁrst derive the posterior pdf (see Figure 11.2)
ºn p ( y ) π(θ) pnθ ( y) π(θ) i =1 θ i π(θ  y) = p( y) = p ( y) θ e−θ ( 1θ )n ºni=1 1 {0 ≤ yi ≤ θ } 1{θ ≥ 0} = p ( y) −θ = p(1y) θen−1 1{θ ≥ 1max y }. ≤i ≤n i
Hence the MAP estimator is given by
−θ
θˆMAP ( y) = arg max π(θ  y) = arg max θen−1 = 1max y, ≤i ≤n i θ ≥ max yi
θ ≥0
1 ≤i ≤n
π(θ  y) π (θy )
0
y
θ
0
(a ) Figure 11.2
The posterior pdf, π(θ  y) : (a) n
max
1≤ i≤n
yi
(b)
= 1; (b) n = 10
θ
11.6
(
Parameter Estimation for Linear Gaussian Models
265
)
C a, θ
θ
±
θ
θ
+±
a
The three cost functions used in the scalar parameter estimation problems, resulting in MMSE, MMAE, and MAP estimators
Figure 11.3
where the last equality holds because the function θ 1−n e−θ is strictly decreasing over its domain, hence its maximum is achieved by the smallest feasible θ . Note again that we avoided unnecessary evaluation of the marginal p( y). Also note that the posterior pdf is sharply peaked for large n, indicating that the residual uncertainty upon seeing n observations is small. (Compare Figures 11.2 (a) and (b).) Finally, note that the computation of the MMSE and MMAE estimators would be more involved here. Figure 11.3 compares the three different cost functions considered in MMSE, MMAE, and MAP estimation. In all cases, the estimator is derived from the posterior pdf – by ﬁnding its mean, its median, or its mode. The posterior pdf is fundamental in all Bayesian inference problems. 11.6
Parameter Estimation for Linear Gaussian Models
The case of linear Gaussian models is frequently encountered in practice and is insightful. Assume the parameter space = R, and = Rn , and
X Y Y = ± s + Z, where s is a known (signal) vector, ± ∼ N (µ, σ 2 ), Z ∼ N ( , Z ), and ± and Z are 0
C
independent. Using the fact that ± has a conditional distribution that is Gaussian, given Y , and the fact that the MMSE estimate is the same as the linear MMSE estimate for this Gaussian problem (see Section 11.7.4), we get that ± , given Y = y, has a Gaussian distribution with mean
E [±Y = y] = E[±] + Cov(±, Y ) Cov(Y )−1 ( y − E [Y ]) and variance Var (±Y
= y) = Var (±) − Cov(±, Y ) Cov(Y )−1 Cov(Y , ±).
266
Bayesian Parameter Estimation
We can compute
E[±] = µ, E[Y ] = µs
Var (±)
= σ 2 , Cov(±, Y ) = σ 2 s³ , Cov (Y , ±) = σ 2 s , Cov(Y ) = σ 2 ss³ + Z , C
and use these to conclude that
E[±Y = y] = µ + σ 2 s ³ (σ 2 ss ³ + Z )−1( y − µs )
=
Var (± Y
=
1 µ s³ − Z y + σ2 C
C
,
(11.6)
d 2 + σ12 y) = σ 2 − σ 4 s³ (σ 2 ss³ + C Z )−1 s 1
=
d 2 + σ12
,
(11.7)
1 where d 2 = s³ C− Z s is the squared Mahalanobis distance (see Deﬁnition 5.1) of s from 0 . Note that due to the symmetry of the Gaussian distribution,
θˆMMSE ( y) = θˆMMAE ( y) = θˆMAP ( y) = E[±Y = y].
(11.8)
´ σ1 , i.e., if the observations are more informative than the prior, then the esti1 2 mated parameter is θˆ ( y) ≈ s³ − Z y/d , and we essentially ignore the prior. On the other hand, if d 2 µ σ1 , i.e., the prior is more informative than the observations, then θ(ˆ y) ≈ µ and we ignore the observations. In the special case s = 1 and Z = σ Z2 n (the model reduces to {Y i }in=1 being conditionally i.i.d. N (θ, σ Z2 ) given θ ), we obtain σ 1 ∑n ˆθMMSE ( y) = θˆMMAE ( y) = θˆMAP ( y) = n i =1 yi + µ n σ . σ 1 + nσ
If d 2
2
C
2
C
I
2 Z 2
2 Z 2
11.7
Estimation of Vector Parameters
The cost functions introduced in Sections 11.3–11.5 for scalar parameters can be extended to the vector case. Here we assume that X = A ⊆ Rm , and Y = Rn . Consider the special case of an additive cost function, where
C (a, θ ) =
m » i =1
C i (ai , θi ).
Then the posterior cost of action a, given observation y, can be written as
· ¸ =y m » · ¸ = E C (ai , ±i )Y = y .
C ( a y) = E C (a, ±) Y i =1
Minimizing C (a y) is equivalent to minimizing
C i (ai  y) =
E
·C (a , ± )Y = y¸ i
i
i
(11.9)
11.7
Estimation of Vector Parameters
267
for each i , which implies that the Bayes solution θˆ B ( y) has components that are given by solving m independent minimization problems:
ˆ ( y) = arg min C i (ai  y), a
θ B ,i
11.7.1
= 1, 2, . . . , m .
i
i
Vector MMSE Estimation
Assume that E(¶±¶2) < ∞ . The squarederror cost function is additive:
C (a, θ ) = ¶a − θ ¶2
m »
=
i =1
(ai − θi )2.
Therefore the components of θˆ MMSE ( y) are given by
· ¸ θˆMMSE,i ( y) = E ±i Y = y .
By the linearity of the expectation operator, we obtain
ˆ
θ MMSE
11.7.2
· ¸ ( y) = E ±Y = y .
(11.10)
Vector MMAE Estimation
The absoluteerror cost function ( ³1 norm of the error vector) is also additive:
C (a, θ ) = ¶a − θ ¶ 1 =
m » i =1
ai − θi 
and therefore the components of θˆ MMAE ( y) are given by
· ¸ θˆMMAE,i ( y) = med π(θi  y) ,
11.7.3
i
= 1, 2, . . . , m .
(11.11)
Vector MAP Estimation
There are two ways to deﬁne the uniform cost function that lead to MAP solutions. In the ﬁrst, we consider an additive cost function with components:
¹
− θi  ≤ ² 1 if ai − θi  > ². This leads to a componentwise MAP estimate as ² → 0 with θˆMAP,i ( y) = arg max π(θi  y). θ Ci (ai , θi ) =
0
if ai
i
A more natural way to deﬁne the uniform cost function is through a nonadditive cost function:
C (a, θ ) =
¹
0
if maxi  ai
1
otherwise.
− θi  ≤ ²
268
Bayesian Parameter Estimation
In this case, using the same argument as in the scalar case, we obtain as optimum estimator converges to
ˆ
θ MAP
² → 0, the
( y) = arg max π(θ  y).
(11.12)
θ
Note that in this case θˆMAP,i ( y) is not necessarily equal to arg max θi π(θi  y). However, the equality will hold if
π(θ  y) = Example 11.6
m ¼
i =1
π(θi  y).
= (0, ∞) and X = [0, ∞)2, with ¹ 2 −yθ if 0 ≤ θ2 ≤ θ1 π(θ y ) = y e
Suppose Y
1
0
Then
ˆ
otherwise.
θ MAP
( y) = 0 .
However, it is easy to see that
π(θ1 y ) = y2 θ1 e− y θ 1 {θ1 ≥ 0}, and π(θ2 y ) = ye− y θ 1{θ2 ≥ 0}, 1
2
from which we get that arg max π(θ1 y ) =
θ1
Thus θˆMAP,i ( y) · = arg maxθi
11.7.4
1 , and arg max π(θ2 y) = 0. y θ2
π(θi y) for i = 1.
Linear MMSE Estimation
We saw in Section 11.7.2 that the (unconstrained) MMSE estimator of conditional mean:
ˆ
θ MMSE
±, given
Y is the
( y) = E [±Y = y].
For general models, this conditional expectation is nonlinear in y and may be difﬁcult to compute. To simplify the MMSE estimation problem, we add the constraint that the estimator should be an afﬁne function of the observations of the form:
ˆ (Y ) =
θ
Y
A
+ b.
As in the unconstrained MMSE estimation problem we ask to minimize the mean squarederror MSE = E[¶θˆ (Y ) − ± ¶2 ]
but with the restriction that the minimization is over the set of linear estimators. This problem can be solved using orthogonality principle (see, e.g., [1]). If W = θˆ (Y ) − ±
11.7
269
Estimation of Vector Parameters
denotes the estimation error, then the orthogonality principle states that the optimal satisﬁes
E[ W ] = 0
ˆ
θ
(11.13)
Cov (W , Y ) = 0.
(11.14)
These equations can solved to yield the linear MMSE (or LMMSE) estimate [1]:
(Y ) =´ Eˆ [±Y ] = E[±] + Cov(±, Y )Cov (Y )−1 (Y − E[Y ]),
(11.15)
Cov(W ) = Cov(±) − Cov(±, Y )Cov(Y )−1Cov (Y , ±),
(11.16)
ˆ
θ LMMSE
ˆ for the LMMSE estimate is used to denote the fact that the LMMSE where the notation E estimate can be considered to be a linear approximation to the conditional expectation, which is the MMSE estimate. The covariance of the error can also be obtained: from which we can compute the MSE of the LMMSE estimator as
½ ¾ ± ½ ¾² E[¶W ¶2 ] = E Tr ¶W ¶2 ± ½ ¾² ½ ± ²¾ = E Tr W ³ W = Tr E W W ³ = Tr (Cov(W )) ,
MSE (θˆ LMMSE) = E [¶W ¶2 ] = Tr
where, as in Section 5.10, we exploited the linearity of the expectation operation, the linearity of the trace operation, and invariance of the trace to circular rotations of matrix products (see also Appendix A). Furthermore, for (± , Y ) jointly Gaussian it can be shown that
ˆ
θ MMSE
( y) = E[±Y = y] = Eˆ [±Y = y] = θˆ LMMSE( y)
(11.17)
and that Cov(W ) = Cov(θˆ (Y ) − ±) = Cov(θˆ (Y ) − ± Y
= y) = Cov (±Y = y).
(11.18)
It is to be noted that, unlike the MMSE estimate, the computations of the LMMSE estimate and the covariance of the estimation error do not require the entire joint distribution of ± and Y ; moments up to second order are sufﬁcient. 11.7.5
Vector Parameter Estimation in Linear Gaussian Models
This is the vector parameter version of the problem considered in Section 11.6. Assume
X = Rm, Y = Rn , and
= ± + Z, (11.19) is a known n × m matrix, ± ∼ N (µ, ), Z ∼ N ( , Z ), and ± and Z are Y
H
where H K 0 C independent. Using (11.17) and (11.18), we get that ±, given Y = y, has a Gaussian distribution with mean and covariance matrix respectively given by
E [±Y = y] = Eˆ [±Y = y] = E[±] − Cov (±, Y ) Cov (Y )−1 ( y − E[Y ])
(11.20)
270
Bayesian Parameter Estimation
and Cov(± Y
= y) = Cov(±) − Cov(±, Y ) Cov(Y )−1 Cov (Y , ±).
(11.21)
We can compute
E [±] = µE[Y ] =
µ
H
,
Cov(±, Y ) = KH³ ,
Cov(±) = K,
Cov(Y , ±) = HK, Cov(Y ) =
³+
HKH
C
Z
,
and use these to conclude that
E[±Y = y] = µ + ³ ( KH
and
Cov (± Y
= y) = − K
HKH
KH
³ + )−1( y −
³(
K
HKH
³ + )−1 K
)
µ
H
HK
.
Note again that due to the symmetry of the Gaussian distribution,
ˆ
( y) = θˆ MMAE ( y) = θˆ MAP ( y) = E[±Y = y].
θ MMSE
11.7.6
(11.22)
Other Cost Functions for Bayesian Estimation
Other cost functions are sometimes encountered. For instance, choose q
C (a, θ ) =
» m
i=1
ai − θi q
≥ 1 and let (11.23)
be the ³q norm ¶a − θ ¶q of the estimation error, raised to the power q. This is an additive cost function, and the special cases q = 1 and q = 2 lead to the MMAE and MMSE estimators, respectively. For large q, large estimation errors are strongly penalized. Moreover the cost function is dominated by the largest component of the estimation error vector because lim
q →∞
¶a − θ ¶q = 1max a − θi . ≤i ≤m i
Another possible cost function is the Kullback–Leibler (KL) divergence loss (see Section 6.1 for the deﬁnition of KL divergence),
C (a, θ ) = D( pa ¶ pθ ),
(11.24)
which is encountered in some theoretical analyses. For the Gaussian family pθ
N (θ , n ), this reduces to the squarederror cost.
=
I
11.8
Exponential Families
D E FI N I T I O N 11.1 An m dimensional exponential family { pθ , distributions with densities of the form
pθ ( y) = h ( y) C (θ ) exp
¹» m k=1
¿
ηk (θ )Tk ( y) ,
y
θ
∈ X } is a family of
∈ Y,
(11.25)
11.8
where {ηk (θ ), 1 ≤ k ≤ m } and {Tk ( y), 1 ≤ k is a normalization constant.
Exponential Families
271
≤ m } are realvalued functions, and C (θ )
Note that m is not necessarily the dimension of θ in Deﬁnition 11.1. Also, if we let η(θ ) = [ η1 (θ ) . . . ηm (θ )], then η maps the parameter space X to some subset of Rm , which we denote by µ . Exponential families are widely used and encompass many common distributions such as Gaussian, Gamma, Poisson, and multinomial. However, the family of uniform distributions over [0, θ ] is not in the exponential family, and neither is the family of Cauchy distributions with location parameter θ . It is often simpler to view η = {ηk }m k=1 as parameters of the distribution. The exponential family of (11.25) is then written in the following canonical form :
qη ( y) = h( y) exp
¹» m k =1
¿ ηk Tk ( y) − A (η ) ,
y
∈ Y,
(11.26)
where A(η) = − ln C (θ ). In order for the normalization constant to be well deﬁned, we must have
³
Y
h( y) exp
¹» m k =1
¿ ηk Tk ( y ) d μ( y ) < ∞.
(11.27)
Since the integral is a convex function of η, the set of η satisfying the inequality (11.27) is a convex set, called the natural parameter space. Usually (but not always), this set is open.
11.8.1
Basic Properties
The following proposition gives some basic properties of a canonical exponential family. In what follows:
• •
the m vector ∇ A (η ) ± {∂ A (η )/∂ηk , 1 ≤ k ≤ m } is the gradient of the function A (η), and the m × m matrix ∇2 A (η) ± {∂ 2 A(η)/∂θ j ∂ηk , 1 ≤ j , k ≤ m } is the Hessian of A(η).
The identities given at the end of Appendix A are used in the derivations. The following technical result is also used (see e.g. [2, p. 27]): for any integrable ´ function f : Y → R and η in the interior of Xnat , the expectation Eη [ f (Y )] = Y qη (y ) f ( y ) d μ( y) is continuous and has derivatives of all orders with respect to η , and these derivatives can be obtained by switching the order of integration and differentiation. P RO P O S I T I O N 11.1
The following statements hold:
(i) The mean of T (Y ) is given by
E [T (Y )] = ∇ A (η ).
(11.28)
Covη [ T (Y )] = ∇ 2 A(η).
(11.29)
η
(ii) The covariance of T (Y ) is given by
272
Bayesian Parameter Estimation
(iii) The cumulantgenerating function of q η is given by
κ (u) = ln E [e u³ T ] = A(η + u) − A(η), η
η
u ∈ Rm .
(11.30)
(iv) The KL divergence between qη and q η² is given by
(η − η² )³ ∇ A (η ) − A (η) + A(η² ), which is also a Bregman divergence with generator function − A(η). D (qη ¶qη ² ) =
Proof (i) Taking the gradient of the identity 1
³
¹» m
=
´
(11.31)
Y qη d μ with respect to η, we obtain
¿
ηk Tk (y ) − A(η) d μ( y ) ¿ ¹» ³ m ηk Tk ( y ) − A (η) [T ( y) − ∇ A (η )] d μ( y) = h( y) exp
0=∇
Y
h (y ) exp
Y
k =1
k =1
= E [ T (Y )] − ∇ A(η). η
(ii) Similarly, taking the Hessian of the identity 1 obtain 0=∇
³
=
´
Y qη dμ with respect to η, we
¿
¹» m
ηk Tk ( y ) − A(η) d μ( y ) k=1 ¹» ¿ ³ m = ∇ h (y ) exp ηk Tk ( y) − A (η ) [ T (y ) − ∇ A(η)] d μ( y ) Y k =1 ¹» ¿½ ³ m = h (y ) exp ηk Tk (y ) − A(η) [ T (y ) − ∇ A(η)] [T ( y) − ∇ A(η)]³ Y k =1 ¾ −∇2 A(η ) d μ( y) = Cov [ T (Y )] − ∇2 A(η). 2
h( y ) exp
Y
η
(iii) The momentgenerating function of qη is given by
Mη (u ) =
E [eu³ T ]
¿ ∑ = h( y) exp ηk Tk ( y) − A (η ) e = u T ( y) d μ( y) Y k =1 ¹» ¿ ³ m A( +u )− A( ) =e h (y ) exp (ηk + uk )Tk ( y) − A (η + u) dμ( y) ³
η
=e
¹» m
η
m k 1 k k
η
A(η +u )− A( η)
.
Y
k =1
The claim follows by taking the logarithm of both sides.
11.8
Exponential Families
273
(iv) The KL divergence between qη and qη² is given by
µ
D (qη ¶qη² ) = Eη ln
qη (Y ) qη² (Y )
¶
= E θ [(η − η ²)³ T (Y ) − A(η) + A(η² )] = (η − η² )³ ∇ A(η) − A(η) + A(η² ).
MeanValue Parameterization It will be convenient to view ξ
= g(η) ± ∇ A(η) = E [ T (Y )] η
(11.32)
η
as a transformation of the vector η. In view of the last equality in (11.32), ξ is also called the meanvalue parameterization of the exponential family [2, p. 126].
i.i.d. Observations If {Yi }ni=1 are drawn i.i.d. from pθ of (11.25), then the pdf for the observation sequence Y
pθ(n) ( y) =
=
n ¼ i =1 n
= {Yi }ni=1 is given by
pθ ( yi )
À¼ i =1
h( yi )
Á
[C (θ )]n exp
¹» m k =1
ηk ( θ )
n » i =1
Tk (yi )
¿
,
(11.33)
which is again an exponential family.
11.8.2
Conjugate Priors
For a given distribution pθ ( y), there often exist several prior distributions π(θ ) that are “well matched” to pθ ( y) in the sense that the posterior distribution π(θ  y) is tractable. For instance, if both pθ (y ) and π(θ ) are Gaussian, then so is π(θ  y ). More generally, a similar concept holds when pθ is taken from an arbitrary exponential family [3]. Consider the canonical family {qη } of (11.26) and the following conjugate prior, which belongs to a (m + 1) dimensional exponential family parameterized by c > 0 and µ ∈ Rm :
π( ˜ η) = h˜ (c, µ) exp{cη³ µ − c A(η)}. (11.34) Assuming the mapping η (θ ) is invertible, the prior on η is related to the prior on π as
follows:
π( ˜ η) = π((θθ)) ,
(11.35)
J
where the m × m Jacobian matrix J(θ ) has components given by J kl 1 ≤ k, l ≤ m . Consider again Y = (Y1 , . . . , Y n ), drawn i.i.d. from (11.26). Then
qη ( y)π( ˜ η) =
À¼ n
i =1
Á
h (yi ) h˜ (c, µ) exp
¹ À» n ³ η
i =1
Á
T ( yi ) + cµ
= ∂ηk (θ )/∂θl for
¿ − (n + c) A(η ) .
274
Bayesian Parameter Estimation
= y is given by Á ¿ π( ˜ η y) ∝ exp η T ( yi ) + cµ − (n + c) A(η) i =1 Â Ã = exp c ²η ³µ² − c ² A(η)
Therefore the posterior distribution of η given Y
¹ À» n ³
and belongs to the same exponential family as the prior (11.34), with parameters
c²
= n +c
and
µ
²=
À» n
1
n +c
i =1
T ( yi ) + cµ
Á
.
This greatly facilitates inference on η or some transformations of η. In particular, the MAP estimator ηˆ MAP is the solution to 0=
∇ ln π˜ (η y )  =ˆ η
hence 1
∇ A(ηˆ MAP ) = n + c
À» n i=1
η MAP
,
T ( yi ) + cµ
Á
.
Inversion of the mapping ∇ A yields ηˆ MAP . Note that the ML estimator (see Chapter 14) is similarly obtained by maximizing the loglikelihood (11.33). Hence ηˆ ML is the solution to Ä 0=
which is2
∇ ln q(n)( y) ÄÄ =ˆ , η
η
η ML
n ˆξ ML = ∇ A(ηˆ ML) = 1 » T ( yi ). n
(11.36)
i =1
Observe the remarkable similarity between the ML and MAP estimators. It is convenient to think of c as a number of pseudoobservations, and of µ as a prior mean for T (Y ).
Conditional Expectation Similarly to Proposition 11.1(i), one can show
E[ξ ] = E[∇ A (η)] = µ,
(11.37)
where the expectation is taken with respect to the prior π( ˜ η ). Similarly, for the problem with i.i.d. data Y = (Y1 , . . . , Yn ), we obtain ξˆ MMSE ( y) = E [ξ  y] = E [∇ A (η ) y] =
1
n +c
À» n i =1
T ( yi ) + cµ
Á
= ξˆ MAP ( y). (11.38)
Precision Estimation with Gamma Prior . Consider Y = (Y 1, . . . , Y n ) drawn i.i.d. from the Gaussian distribution (0, σ 2 ). Let η = 1/σ 2 (called precision parameter), then
Example 11.7
N
2 The ﬁrst equality in (11.36) will be shown in Proposition 14.1.
11.8
qη ( y) = (2π)−1/2 η1/2 exp
= − 12 y2
is in canonical form with T ( y) distribution is given by
Exponential Families
275
Å 1 Æ − 2 η y2
and A (η)
= − 12 ln η. The n fold product
¹
1 » 2 qη ( y) = (2π) −n/2 ηn /2 exp − η yi 2 n
¿
i =1
Consider the µ(a , b) prior for η:
π(η) ˜ = µ(a1)ba ηa−1 e−η/b ,
a, b, η
.
> 0,
whose mode is at b(a − 1). Hence
Á¿ ¹ À n » 1 1 π(η ˜  y) ∝ ηn /2+a−1 exp −η b + 2 yi2 i =1
is a µ(a² , b² ) distribution with
n a ² = + a and
1 b²
2
1 b
= +
1» 2 yi . 2 n
i =1
The MAP estimator of η given y is the mode of µ(a² , b² ), i.e.,
ηˆ MAP = b² (a² − 1) =
2a − 2 + n
2 b
+
∑n
2 i =1 yi
.
For this problem one can also derive (see Exercise 11.12)
σˆ
2 MAP
+ ∑ni=1 yi2 = 2a + 2 + n . 1 b
(11.39)
2 Observe that σˆ MAP ·= 1/ηˆMAP , i.e., MAP estimation generally does not commute with nonlinear reparameterization.
Multinomial Distribution and Dirichlet Prior. This model is frequently used in machine learning. Let Y = (Y 1, . . . , Ym ) be a multinomial distribution with m categories, n trials, and probability vector θ ; then Example 11.8
pθ ( y ) =
ºmn! y ! i =1 i
m ¼ i =1
θiy . i
(11.40)
This is the pmf that results from drawing a category n times i.i.d. from the distribution θ , and recording the count for each category. Now assume that θ was drawn from a Dirichlet probability distribution
∑m α ) ¼ m µ( i =1 i π(θ ) = ºm µ(α ) θiα −1 , i
i =1
i
i =1
(11.41)
276
Bayesian Parameter Estimation
´ µ(x )∑= 0∞ e−t t x −1 dt , the vector α = (α1, . . . , αm ) has positive components, θm = 1 − mi =−11 θi , and ν is the Lebesgue measure on Rm−1 . For m = 2, (11.40) and where
(11.41) respectively reduce to the binomial and the beta distributions. Note that π of (11.41) is the uniform distribution if α = 1 (allones vector), and π concentrates near the vertices of the probability simplex if αi are large. The distribution ∑ α )−1α and its mode is of (11.41) is denoted by Dir( m , α ). Its mean is equal to ( m i =1 i ∑ − 1 at ( m α − m ) ( α − 1 ) when α > 1 for all i . The posterior distribution of θ given i i =1 i Y is
π(θ y ) ∝ p ( y)π(θ ) ∝ θ
m ¼ i =1
θiy +α −1 i
i
hence is Dir(m , y + α). We conclude that the Dirichlet distribution is the conjugate prior for the multinomial distribution. The MMSE and MAP estimators of θ are respectively the mean and the mode of the posterior distribution:
θˆMMSE,i ( y) = E[±i Y = y] = n +yi∑+mαi α
i =1 i
and
i + αi − 1 θˆMAP,i ( y) = n +y∑ m α − m,
If the prior is uniform ( αi
i =1
i
1≤i
≤ m.
≡ 1), the MAP estimator coincides with the ML estimator, θˆML,i (y ) = yni , 1 ≤ i ≤ m .
Otherwise {αi −1}m i =1 may be viewed as pseudocounts that are added to the actual counts m { yi }i =1 to obtain the MAP estimator of θ .
Exercises
N
N
11.1 Gaussian Parameter Estimation. Let θ ∼ (0, 1), Z 1 ∼ (0, 1), and Z 2 ∼ (0, 2) be independent Gaussian variables. Find the MMSE estimator of θ given the observations
N
= −θ − Z 1 + Z 2 . Bayesian Parameter Estimation. Suppose ± is a random parameter with expoY1
= θ + Z1 + Z2
11.2 nential prior distribution
and Y2
π(θ) = e−θ 1 {θ ≥ 0}
= θ , the observation Y has a Rayleigh distribution with pθ ( y) = 2θ y e−θ y 1{ y ≥ 0}. Find the MAP estimate of ± given Y . Find the MMSE estimate of ± given Y .
and suppose that given ±
2
(a) (b)
Exercises
11.3 Bayesian Parameter Estimation. Suppose given by the Gamma density
277
¶ is a random parameter with prior
− λ λα−1 π(λ) = e µ(α) 1{λ ≥ 0}, where α is a known positive real number, and µ is the Gamma function deﬁned by the integral ³∞ µ( x ) = t x −1 e−t dt , for x > 0 . 0
Our observation Y is Poisson with rate ¶, i.e.,
(a) (b)
−λ
= y }  {¶ = λ}) = λ ye! , y = 0, 1, 2, . . . . Find the MAP estimate of ¶ given Y . Now suppose we wish to estimate a parameter ± that is related to ¶ as ± = e−¶ . Find the MMSE estimate of ± given Y . y
pλ (y ) = P({Y
11.4 MAP/MMSE/MMAE Estimation for Exponential Distribution and Uniform Prior. Suppose ± is uniformly distributed on [0, 1] and that we observe
= ± + Z, where Z is a random variable, independent of ±, with exponential pdf p Z (z ) = e−z 1{z ≥ 0}. Find θˆMMSE( y ), θˆMMAE( y ), and θˆMAP ( y). Y
11.5 MMSE/MAP Estimation for Exponential Observations and Prior. Let Y1 , . . . , Y n be i.i.d. observations drawn from an exponential distribution with mean 1 /θ , i.e.,
pθ (yk ) = θ e−θ yk
1{ yk
≥ 0}.
The parameter θ is modeled as a unit exponential random variable, i.e.,
π(θ) = e−θ 1 {θ ≥ 0}. Find the MMSE and MAP estimates of θ based on the n observations.
11.6 MAP Estimator for Uniform Distribution and Uniform Prior. Let Y1 , . . . , Y n be i.i.d. observations drawn from a uniform distribution over [1, θ ], i.e.,
pθ ( yk ) =
1
θ − 1 1{ yk ∈ [1, θ ]}.
The prior distribution of θ is given by a uniform distribution over [1, 10], i.e.,
π(θ) = 19 1{θ ∈ [1, 10]}.
278
Bayesian Parameter Estimation
(a) Find θˆMAP ( y). (b) Show that θˆMAP ( y) converges in distribution to the true value of θ as n show that lim P{θˆMAP (Y ) ≤ t } = P {±
n →∞
≤ t}
for all t
→ ∞, i.e.,
∈ R.
11.7 Sunrise Probability. In the 18th century, Laplace asked what is the probability that the Sun will rise tomorrow. He assumed that the Earth was formed n days ago, and that the Sun rises each day with probability θ . While the value of θ is unknown, we might model our uncertainty by viewing θ as a random parameter with uniform prior over [0, 1]. Under these assumptions, show that the estimate that the Sun will rise tomorrow 1 is nn+ +2 . 11.8 Distributions over Finite Sets. Show that the set of probability distributions over a ﬁnite set Y = { y1 , y2 , . . . , ym + 1} forms an m dimensional exponential family whose sufﬁcient statistics are indicator functions:
Tk ( y) = 1{ y
= yk } , 1 ≤ k ≤ m .
Write this family in canonical form. 11.9 Estimation for Mixed Discrete/Continuous Random Variables . Let Y take value 0 with probability 1 − ² and be N (0, σ 2 ) otherwise. Deﬁne θ = (², σ 2) where 0 < ² < 1 and σ 2 > 0. (a) Show that the distribution Pθ of Y belongs to a twodimensional exponential family with respect to the measure μ which is the sum of the Lebesgue measure and the measure that puts a mass of 1 at y = 0; more explicitly, μ(A) = A + 1 {0 ∈ A} for any (open, closed, semiinﬁnite) interval A of the real line. (b) Assume we are given observations {Yi }ni=1 ∼ i.i.d. Pθ and assume a prior π(², σ 2 ) = e−σ 2 over (0, 1) × R+. Derive the MAP estimator of ² and σ 2. 11.10 Beta Distribution as Conjugate Prior. Assume Y is a binomial random variable with n trials and parameter θ ∈ (0, 1):
pθ ( y) =
n! θ y (1 − θ)n− y , y ! (n − y )!
y
= 0, 1, . . . , n .
(a) Suppose C (a, θ) = (a − θ)2. Let the prior distribution of ± be the beta distribution β(a, b) with density
µ(a + b) θ a−1 (1 − θ)b−1, π(θ) = µ( a) µ(b)
0 0 and b > 0. Show that the posterior distribution of ± given Y = y is β(a + y, b + n − y ), and obtain the Bayes estimator. (b) Repeat when the prior distribution of ± is the uniform distribution on [0, 1]. θ)2 . Find the Bayes rule with respect to the uniform (c) Now, let C (a , θ) = (θ(a − 1−θ) distribution on [0, 1].
References
279
11.11 Expectations in Conjugate Prior Family. Prove (11.37) and (11.38). 11.12 MAP Variance Estimation. Derive the MAP estimator (11.39). Hint : First express the prior for σ 2 in terms of the prior for η = 1/(2σ 2 ), then evaluate the derivative of the log posterior density with respect to σ 2 , and set it to zero.
References
[1] B. Hajek, Random Processes for Engineers , Cambridge University Press, 2015. [2] E. L. Lehmann and G. Casella, Theory of Point Estimation, Springer, 1998. [3] P. Diaconis and D. Ylvisaker, “Conjugate Priors for Exponential Families,” Ann. Stat., pp. 269–281, 1979.
12
Minimum Variance Unbiased Estimation
In this chapter we introduce several methods for constructing good estimators when prior probabilistic models are not available for the unknown parameter. First we deﬁne the notions of unbiasedness and minimumvariance unbiased estimators (MVUE). We then present the concept of sufﬁcient statistics, followed by the factorization theorem and the Rao–Blackwell theorem. Parameter estimation for complete families of distributions and exponential families is studied in detail.
12.1
Nonrandom Parameter Estimation
In Chapter 11, we considered the Bayesian approach to the parameter estimation problem, where we assumed that the parameter θ ∈ X is a realization of a random parameter ± with prior π(θ). Now we consider the situation where θ is unknown but cannot be modeled as random. The conditional risk of estimator θˆ is given by
±
²
Rθ (θˆ ) = Eθ C (θˆ (Y ), θ)
, θ ∈ X.
We could try the minimax approach (as we did in the case of hypothesis testing):
ˆ θˆm =² arg min max R (θ). ˆ θ∈X θ θ ∈D
(12.1)
However, the minimax estimation rule is often hard to compute in practice. It is also often too conservative for many applications. We would like to explore estimators that are simple to compute and less conservative. Consider ﬁrst the case where θ is a scalar parameter, and we use the squarederror cost function C (a, θ) = (a − θ)2. Then
Rθ (θˆ ) = Eθ
± ² (θˆ (Y ) − θ)2
(12.2)
is the mean squarederror (MSE). Unfortunately it is generally difﬁcult to construct an ˆ for all θ ∈ X , let alone the “best” estimator. For estimator that is good in terms of R θ (θ) ˆ instance the trivial estimator θ that returns a ﬁxed value θ0 for all the observations has minimum squarederror risk (zero) if θ happens to be equal to θ0 . This is a silly estimator which performs poorly if the true θ is far from θ0 . However, this example shows one cannot expect to have a uniformly best estimator since the trivial estimator cannot be
12.2
Sufﬁcient Statistics
281
improved upon at θ = θ0. This concept was already apparent in simple binary hypothesis testing, where a trivial detector that returns the same decision (say H0) irrespective of the observations has zero conditional risk under H0 . To avoid such trivialities, it is usually preferable to design an estimator that does not strongly favor one or more values of θ at the expense of others. We introduce two important properties that help us design good estimators. D E FI N I T I O N 12.1
An estimator θˆ is said to be unbiased if
Eθ (θˆ ) = θ, ∀θ ∈ X .
(12.3)
Note that for unbiased estimators the squarederror risk Rθ (θˆ ) deﬁned in (12.2) is the ˆ Y )] . This leads to the following deﬁnition. same as the variance Var θ [θ( D E FI N I T I O N 12.2
θˆ is said to be a minimumvariance unbiased estimator (MVUE) if:
1. θˆ is unbiased. 2. If θ˜ is another unbiased estimator, then
˜ Y )], Var θ [θˆ (Y )] ≤ Var θ [θ(
∀θ ∈ X .
An MVUE is a “good” estimator and is optimal in the sense that it has minimum variance among all unbiased estimators over all possible θ . Note that the existence of an MVUE is generally not guaranteed – in some problems, even an unbiased estimator does not exist. Also note that an MVUE is not the same as an MMSE. There could be biased estimators that have lower MSE than MVUE estimators (see Section 13.7). Thus far in this chapter we have assumed that θ is a scalar. The deﬁnition of an MVUE extends to the case where θ is a vector, i.e., X ⊆ Rm , for some m > 1, if instead of estimating θ , we estimate a scalar function g(θ) . If we are interested in estimating the individual components of the vector θ , then we can allow the function to equal each of the components and estimate them separately. We have the following deﬁnition in this more general setting. D E FI N I T I O N 12.3
gˆ (Y ) is said to be an MVUE of g(θ) if:
1. Eθ [ gˆ (Y )] = g(θ) for all θ ∈ X (unbiasedness). 2. If g˜ (Y ) is another unbiased estimator, then
Var θ [ gˆ (Y )] ≤ Var θ [ g˜ (Y )],
∀θ ∈ X .
Our goal in the remainder of this chapter is to develop a procedure to ﬁnd MVUE solutions, when they exist. The ﬁrst step in this procedure requires the concept of a sufﬁcient statistic.
12.2
Sufﬁcient Statistics
D E FI N I T I O N 12.4 A sufﬁcient statistic T (Y ) is any function of the data such that the conditional distribution of Y given T (Y ) is independent of θ .
282
Minimum Variance Unbiased Estimation
Informally a sufﬁcient statistic compresses the observations without losing any information that is relevant for estimating θ .
Empirical Frequency . Let Y = {0, 1}n . We estimate the parameter θ ∈ [0, 1] of a Bernoulli distribution from n i.i.d. observations Yi ∼ i.i.d. Be(θ), 1 ≤ i ≤ n . The pmf of Y given θ is Example 12.1
pθ ( y) =
n ³ i =1
pθ ( yi ) =
θ n (1 − θ)n−n 1
1
∑ where n1 = ni=1 1{yi = 1} is the number of 1’s that occur in the sequence y. Claim: T ( y ) ± n1 is a sufﬁcient statistic for θ . Each sequence with n1 1’s and n − n1 0’s has equal probability and the additional knowledge of θ does not change this probability. Speciﬁcally, ´nµ pθ ( y n1) = P θ {Y = y N 1 = n1} = 1/ n 1
for all such sequences y and all θ ∈ [0, 1]. Therefore Y is independent of θ conditioned on T (Y ). Note that T (Y ) is not a unique sufﬁcient statistic. Assume n ≥ 2 and consider
n11
´
=
n Then T ³ ( y) ± 11 n12
±¶ n/ 2² i =1
µ
1{Yk = 1} and n12 =
n ¶ k =±n/ 2²+1
1{Yk
= 1}.
is also a sufﬁcient statistic for θ .
Trivial Sufﬁcient Statistic. Here the parameter is a vector θ = [θ1, . . . , θn ] ∈ [0, 1]n , Yi ∼ independent Be(θi ), 1 ≤ i ≤ n . In this example the Yi ’s are independent but not identically distributed, and n1 , as deﬁned in Example 12.1, is no longer a sufﬁcient statistic. However, T ( y) = y is a trivial sufﬁcient statistic since Y conditioned on Y is trivially independent of θ . A trivial sufﬁcient statistic does not compress the information content available in the observations. This motivates Deﬁnition 12.5. Example 12.2
Largest Observation . Here we revisit Example 11.5 but drop the assumption that θ is random. Let pθ be the uniform distribution over [0, θ ] where θ ∈ X = R+ . Let Y i , 1 ≤ i ≤ n be i.i.d. observations drawn from pθ . Hence
Example 12.3
pθ ( y) = θ −n hence T (Y )
±
n ³
i =1
1{0 ≤ yi
≤ θ } = θ −n 1
·
max yi
1≤i ≤n
≤θ
¸
max Yi is a sufﬁcient statistic for θ , as can be shown by verifying
1≤i ≤n
that the distribution of Y conditioned on T (Y ) is independent of θ , or by applying the factorization theorem in the next section.
12.3
Factorization Theorem
283
D E FI N I T I O N 12.5 T (Y ) is a minimal sufﬁcient statistic if T (Y ) is a function of any other sufﬁcient statistic.
In Example 12.1, T ³ (Y ) is not a minimal sufﬁcient statistic because it is not a function of T (Y ). 12.3
Factorization Theorem
T H E O R E M 12.1
h such that
T (Y ) is a sufﬁcient statistic if and only if there exist functions f θ and pθ ( y) = fθ (T ( y)) h (y ),
∀θ, y .
(12.4)
Return to Example 12.1, where pθ ( y) = θ n1 (1 − θ)n−n1 . We can select T ( y) = n1 , choose the functions h( y) = 1 and f θ (t ) = θ t (1 − θ)n−t , and apply the factorization theorem to verify that T ( y) is a sufﬁcient statistic.
N (θ, 1), 1 ≤ i ≤ n , and θ ∈ R. Then ¹ ¶ º n n ³ 1 − n/2 2 p θ ( y) = pθ (yi ) = (2π) exp − ( yi − θ) 2 i =1 i =1 ¹ ¶ º n n ¶ 1 nθ 2 − n/2 2 = (2π) exp − 2 yi + θ yi − 2 .
Example 12.4
Let Yi be i.i.d.
Now let T ( y) =
n ¶ i =1
i =1
i =1
yi and select the functions
h( y) = (2π) −n/2 e− 2
1
∑n
2 i =1 yi
,
2 fθ (t ) = eθ t −n θ / 2
and apply the factorization theorem to see that T ( y) is a sufﬁcient statistic for θ . Example 12.5
mary hypothesis testing. Let θ p θ ( y) =
pθ ( y ) p0 (y ), p 0( y)
∈ X = {0, 1, . . . , m − 1}. Write θ = 0, 1, . . . , m − 1.
We obtain the desired factorization of (12.4) by choosing T (y ) as the vector of m − 1 likelihood ratios { pi ( y)/ p0( y), 1 ≤ i ≤ m − 1} and letting
h (y ) = p0 ( y),
f θ (t ) =
·1
tθ ,
: θ =0 : θ = 1, 2, . . . , m − 1.
284
Minimum Variance Unbiased Estimation
Proof of Theorem 12.1
We prove the result for discrete Y (see [1] for the general case).
(i) We ﬁrst prove the “only if” condition. Assume T (Y ) is a sufﬁcient statistic for θ . Then
pθ ( y) = Pθ {Y
(ii)
= y } = Pθ {Y = y, T (Y ) = T (y )} = Pθ {Y = yT (Y ) = T ( y )} Pθ {T (Y ) = T ( y)} = h( y) f θ (T ( y )) with h( y ) ± Pθ {Y = y T (Y ) = T (y )} (by sufﬁciency) and f θ (T ( y)) ± Pθ {T (Y ) = T ( y)}. We now prove the “if” condition. Assume there exist functions f θ , T , and h such that the factorization (12.4) holds. By Bayes’ rule, we have
pθ (y t ) ± Pθ {Y = y T (Y ) = t } = Pθ {T (YP) {=T (tYY) == ty}} pθ (y ) . θ
(12.5)
Since T (Y ) is a function of Y , we have
Pθ {T (Y ) = t Y
= y } = 1{T (y ) = t }.
(12.6)
Substituting (12.4) and (12.6) into (12.5), we obtain
p θ ( y t ) =
f θ ( t ) h ( y) 1{T (y ) = t }. Pθ {T (Y ) = t }
Now
Pθ {T (Y ) = t } =
=
¶ y :T ( y )=t
¶
y :T ( y )=t
= f θ (t )
(12.7)
pθ (y ) f θ (T ( y)) h( y)
¶
h( y).
(12.8)
1{T ( y) = t },
(12.9)
y: T ( y )=t
Substituting into (12.7), we obtain
pθ ( yt ) =
∑
h ( y) y: T ( y )=t
h ( y)
which does not depend on θ , and therefore T (Y ) is a sufﬁcient statistic.
12.4
Rao–Blackwell Theorem
The Rao–Blackwell theorem [2, 3] provides a general framework for reducing the variance of unbiased estimators. Given a sufﬁcient statistic T (Y ), any unbiased estimator can be potentially improved by averaging it with respect to the conditional distribution of Y given T (Y ). This procedure is known as Rao–Blackwellization.
12.4
Rao–Blackwell Theorem
285
T H E O R E M 12.2 Assume gˆ (Y ) is an unbiased estimator of g(θ) and T ( y) is a sufﬁcient statistic for θ , then the estimator
g˜ (T (Y )) ± E [gˆ (Y ) T (Y )]
(12.10)
satisﬁes: (i) g˜ (T (Y )) is also an unbiased estimator of g (θ); (ii) Var θ [g˜ (T (Y )) ] ≤ Var θ [gˆ (Y )] with equality if and only if Pθ {g˜ (T (Y )) for all θ ∈ X .
= gˆ(Y )} = 1
12.1 If T (Y ) is a sufﬁcient statistic for θ and g ∗ (T (Y )) is the only function of T (Y ) that is an unbiased estimator of g (θ) , then g ∗ is an MVUE for g (θ). C O RO L L A RY
Example 12.6 We use the Rao–Blackwell theorem to design an estimator for the setup in Example 12.1 with Yi ∼ i.i.d. Be(θ). Let g(θ) = θ . Consider θˆ (Y ) = Y 1, which is a fairly silly estimator (it uses only one out of n observations and has variance equal to θ(1 − θ)). Nevertheless θˆ is unbiased and can be improved dramatically using Rao– ∑ Blackwellization. Recall the sufﬁcient statistic T (Y ) = N1 = ni=1 1 {Y i = 1}. Using Rao–Blackwellization we construct the estimator
θ(˜ N 1) = E[θ(ˆ Y )N1 ] = E[Y1 N1 ] = E[Yi N 1], 1 ≤ i ≤ n, where the last line holds by symmetry. Averaging over all 1 ≤ i ≤ n, we obtain n ¶ θ(˜ N 1) = 1n E[Yi  N1 ] i =1 ¼ »¶ n 1 Yi N 1 = nE i =1
= Nn1 , which has variance equal to θ(1 − θ)/ n , i.e., n times lower than the original simple estimator. In fact, we shall soon see that θ˜ is an MVUE. Proof of Theorem 12.2 (i) For each θ
∈ X , we have
Eθ [g˜ (T (Y ))] (=a) Eθ E[ gˆ (Y )T (Y )] ( b) = Eθ [gˆ (Y )] (c)
= g(θ),
where (a) holds by deﬁnition of g˜ , (b) by the law of iterated expectations, and (c) by the unbiasedness of gˆ .
286
Minimum Variance Unbiased Estimation
(ii) For each θ
∈ X , we have
± ² ½ ¾ (g˜ (T (Y )))2 − Eθ [g˜ (T (Y ))] 2 ± ² = Eθ (g˜ (T (Y )))2 − g2 (θ).
Var θ [g˜ (T (Y ))] = Eθ
Similarly, Var θ [gˆ (Y )] = Eθ Now
²
±
± ² (gˆ (Y ))2 − g2 (θ).
±( )² ± ² (a) ≤ Eθ E (gˆ (Y ))2T (Y ) ± ² = Eθ (gˆ (Y ))2 ,
Eθ (g˜ (T (Y )))2 = Eθ E[ gˆ (Y )T (Y )] 2
where (a) follows by Jensen’s inequality: E2[ X ] ≤ E[ X 2] for any random variable X , with equality if and only if X is equal to a constant with probability 1. Similarly E2 [ X T ] ≤ E[ X 2 T ], with equality if and only if X is a function of T with probability 1. This proves the claim.
Proof of Corollary 12.1 Let gˆ (Y ) be any MVUE for g(θ) . Then g˜ (T (Y )) = E[ gˆ (Y )T (Y )] is unbiased and its variance cannot exceed that of gˆ (Y ), by the Rao– Blackwell theorem. But gˆ (Y ) is an MVUE, so equality is achieved in (ii). Since g∗ is unique, we must have g ∗ = g˜ , and g∗ must be an MVUE. 12.5
Complete Families of Distributions
D E FI N I T I O N
A family of distributions P → R, we have
12.6
:Y E θ [ f (Y )] = 0 ∀ θ ∈ X ⇒
function f
= { pθ , θ ∈ X } is complete if for any
Pθ { f (Y ) = 0} = 1
∀θ ∈ X .
(12.11)
The latter statement may be written more concisely as f (Y ) = 0 a.s. P . A geometric interpretation of completeness is the following. Consider ﬁnite Y , with Y  = k , for simplicity and write Eθ [ f (Y )] = ∑ y∈ Y f ( y) pθ ( y ) as the dot product f ´ pθ of the two kdimensional vectors f = { f (y ), y ∈ Y } and pθ = { pθ (y ), y ∈ Y }. Condition (12.11) may be written as
f ´ pθ
= 0 ∀θ ∈ X ⇒
f
= . 0
In other words, the only vector that is orthogonal to all the vectors in the set { pθ , θ is the 0 vector.
∈ X}
12.5
p(2)
p(2) 1
( θ = 0)
1
( θ = 0)
pθ
pθ
0
p(0)
0
(θ = 1)
1
1
p(1)
p(0)
(θ =
1
( a) Figure 12.1
287
Complete Families of Distributions
√
1
2 − 1)
p(1)
(b )
(a) Incomplete family of Example 12.7; (b) complete family of Example 12.8
= {0, 1, 2} and consider the family ⎧ ⎪⎨θ/3 : y = 0 pθ ( y ) = 2θ/3 : y = 1 (12.12) ⎪⎪⎩ 1 − θ : y = 2, where θ ∈ X = [ 0, 1]. The vector f = (2, −1, 0) is orthogonal to each vector pθ , θ ∈ [0, 1]. Since f µ= , the family is not complete. A geometric representation of the Example 12.7
Let Y
0
problem is given in Figure 12.1, where the family is represented by a straight line in the probability simplex, and therefore there exists a nonzero vector f that is orthogonal to it.
= {0, 1, 2} and consider the family ⎧2 ⎪⎨θ : y=0 : y=1 (12.13) p θ ( y) = 2 θ ⎪⎪⎩ 2 1 − θ − 2θ : y = 2, √ where θ ∈ X = [0, 2 − 1]. Here the family is represented by a curve in the simplex, and there exists no f such that f · pθ = 0, ∀ θ ∈ X , other than the trivial f = . Example 12.8
Let Y
0
The family of binomial distributions is another example of a complete family with discrete observations:
´ nµ Y ∼ Bi (n , θ) ⇒ pθ ( y) = θ y (1 − θ)n− y , y
y
= 0, 1, . . . , n, θ ∈ [0, 1].
We now provide an example of a complete family with continuous observations.
288
Minimum Variance Unbiased Estimation
Example 12.9
Let X
= (0, ∞), Y = R, and consider the family of distributions pθ (y ) = θ e −θ y 1{y ≥ 0}, θ ∈ X .
Then
Eθ [ f (Y )] =
¿∞ 0
Therefore,
Eθ [ f (Y )] = 0 ∀θ ∈ X =⇒
f ( y) θ e−θ y dy.
¿∞ 0
f ( y) e−θ y dy
= 0 ∀θ ∈ X
which means that the Laplace transform of f equals 0 on the positive real line. Now by using the fact that the Laplace transform is an analytic function, and invoking analytic continuity, we can conclude that f ( y) = 0 for almost all y ∈ Y . This establishes the completeness of { pθ , θ ∈ X }. More general results for exponential families are given in Section 12.5.3.
12.5.1
Link Between Completeness and Sufﬁciency
12.3 If P = { pθ , θ ∈ X } is a complete family and T (Y ) is a sufﬁcient statistic, then the mapping T is invertible, i.e., T is a trivial sufﬁcient statistic.
THEOREM
Proof Let f (Y ) = Y
− E[Y T (Y )]. Then Eθ [ f (Y )] = Eθ [Y ] − Eθ [E[Y T (Y )]] = Eθ [Y ] − Eθ [Y ] = 0, ∀θ ∈ X . The completeness assumption (12.11) implies that f (Y ) = 0 a.s. P. By the deﬁnition
of f , we conclude that
= E[Y T (Y )] a.s. P ⇒ Y = a function of T (Y ) Y
a.s. P .
Since Y and T (Y ) are functions of each other, the mapping T is invertible. Theorem 12.3 implies that the completeness of P is not that useful since that would imply that all sufﬁcient statistics are trivial. We will now see that it is the completeness of the family of distributions that P induces on a sufﬁcient statistic that is more useful as it allows us to identify minimal sufﬁcient statistics. Denote by
qθ (t ) =
¿
y : T ( y )=t
pθ ( y) d µ( y )
the pdf of the sufﬁcient statistic T (Y ) given θ . D E FI N I T I O N
12.7
complete. THEOREM
12.4
A sufﬁcient statistic T (Y ) is complete if the family {qθ , θ
A complete sufﬁcient statistic T (Y ) is always minimal.
(12.14)
∈ X } is
12.5
Complete Families of Distributions
289
Proof Suppose T (Y ) is not minimal. Then there exists a noninvertible function h such that h(T (Y )) is a sufﬁcient statistic. Let
= T (Y ) − E [ T (Y )h(T (Y ))] . Then it is clear by law of iterated expectations that Eθ [ f (T (Y ))] = 0. Therefore, by completeness of T (Y ), we conclude that f (T (Y )) = 0, which implies that T (Y ) = E [ T (Y )h(T (Y )) ] . Therefore T (Y ) is a function of h(T (Y )), i.e., h is invertible. This is a contradiction, and so T (Y ) must be minimal. f (T (Y ))
12.5.2
Link Between Completeness and MVUE
If T (Y ) is a complete sufﬁcient statistic, and g˜ (T (Y )) and g ∗ (T (Y )) are two unbiased estimators of g (θ), then T H E O R E M 12.5
is the unique MVUE for g (θ) .
g˜ (T (Y ))
= g∗ (T (Y ))
= g˜(T ( y)) − g∗ (T ( y )). Then Eθ [ f (T (Y ))] = Eθ [g˜(T (Y ))] − Eθ [g∗ (T (Y ))] = g(θ) − g(θ) = 0, ∀θ ∈ X . The completeness assumption implies that f (T (Y )) = 0 a.s. P . By the deﬁnition of f , we conclude that g˜ (T (Y )) = g∗ (T (Y )) a.s. P . Thus g∗ (T (Y )) is the unique unbiased estimator of g(θ) . By the corollary to the Rao–Blackwell theorem, this estimator is an
Proof Let f (T ( y))
MVUE.
General Procedure For Finding MVUEsThe following procedure can be applied to ﬁnd an MVUE of g(θ) whenever an unbiased estimator exists and a complete sufﬁcient statistic can be found: 1. Find some unbiased estimator gˆ (Y ) of g (θ). 2. Find a complete sufﬁcient statistic T (Y ). 3. Apply Rao–Blackwellization and obtain g ∗(T (Y )) MVUE of g(θ). 12.5.3
½ ¾ = E gˆ (Y )T (Y ) , which is an
Link Between Completeness and Exponential Families
The notions of sufﬁciency, completeness, and MVUEs are more straightforward if one deals with an exponential family. Recall the deﬁnition (11.26) of a family of m dimensional exponential distributions in canonical form:
pθ ( y) = h( y) exp
¹¶ m k =1
º
θk Tk (y ) − A(θ ) ,
y
∈ Y.
290
Minimum Variance Unbiased Estimation
Denote by qθ the pdf of the m vector T (Y ) = [T1 (Y ) . . . Tm (Y )]´ under pθ . THEOREM
12.6 Sufﬁcient Statistics for Exponential Family.
(i) T (Y ) is a sufﬁcient statistic for θ . (ii) The family {qθ } is an mdimensional exponential family.
Proof (i) This follows directly from the factorization theorem applied to (11.26). (ii) Integrating pθ over the set { y : T (y ) = t } yields
qθ (t ) = h T (t ) exp with
h T (t ) =
¹¶ m
k =1
¿ y : T ( y )= t
θk tk − A(θ )
º
h( y )d µ( y ).
12.7 Completeness Theorem for Exponential Families. Consider the family of distributions deﬁned in (11.25) , with η mapping the parameter set X to the set ³ ⊂ Rm . Then the vector T (Y ) = [T1 (Y ) . . . Tm (Y )]´ is a complete sufﬁcient (hence minimal) statistic for { pθ , θ ∈ X } if ³ contains an mdimensional rectangle.1 THEOREM
Proof The proof is a generalization of the argument given in Example 12.9. First of all, by the factorization theorem, T (Y ) is a sufﬁcient statistic for P . Using an argument similar to that used in Part (ii) of Theorem 12.6, we see that the joint distribution of T (Y ) is given by qθ (t ) = C (θ ) exp
¹¶ m
k =1
ηk (θ ) tk
for an appropriately deﬁned h T (t ). Therefore
¿ ½ ¾ E f (T (Y )) = C (θ ) θ
Rm
f (t ) exp
¹¶ m k =1
½
º
hT ( t )
ηk (θ ) tk
º
h T (t ) d t .
¾
If ³ contains an m dimensional rectangle, then Eθ f (T (Y )) = 0, for all θ ∈ X , which implies that the m dimensional Laplace transform of f (t )h T (t ) is equal to 0 on an m dimensional rectangle of real parts of the complex variables deﬁning the Laplace transform. By analytic continuity, we can conclude that f (t )h T (t ) = 0 for almost all t , which implies that f (t ) = 0 for almost all t [4]. Thus, {qθ , θ ∈ X } is a complete family, and T ( y) is a complete sufﬁcient statistic. C O RO L L A RY 12.2 Completeness Theorem for Canonical Exponential Families. For the canonical exponential family deﬁned in (11.26) , the vector T (Y ) = [ T1 (Y ) . . . Tm (Y )]´ is a complete sufﬁcient (hence minimal) statistic for { pθ , θ ∈ X } if X contains an mdimensional rectangle. 1 For m
= 1 this is a onedimensional interval.
12.7
Examples: Gaussian Families
291
Factorization theorem
Exponential family
Figure 12.2
12.6
Suﬃcient statistic
Improve unbiased estimator using Rao–Blackwell averaging
Completeness
Unique MVUE
The Quest for MVUEs
Discussion
Figure 12.2 summarizes the key results in this chapter. The factorization theorem helps to identify sufﬁcient statistics. The Rao–Blackwell theorem can be applied to reduce the variance of any unbiased estimator by averaging it over Y conditioned on the sufﬁcient statistic T (Y ). If the sufﬁcient statistic is complete, this produces the MVUE, which is unique. When the parametric family is exponential, it is easy to identify a sufﬁcient statistic and to check whether it is complete. 12.7
Examples: Gaussian Families
Examples 12.10–12.12 are from Gaussian families.
N
Consider the lengthn sequence Y = α s + Z where Z ∼ (0, σ 2 In ) ∈ Rn . Thus Y ∼ (α s, σ 2I n ). Here σ 2 and s ∈ Rn are known, and the unknown parameter is α ∈ R. We identify θ with α and write
Example 12.10
N
pθ ( y) = (2πσ 2 )−n /2 exp
= (2πσ
)−n /2 exp
·
·
− 2σ1 2 ¶ y − α s¶2 1
¸
− 2α y´ s)
¸
− 2 (¶ y¶ + α ¶s¶ · 2σ1 ¸ Å Ä 2 −n /2 2 2 = (2πσ ) exp − 2σ 2 α ¶s¶ exp{− 2σ1 2 ¶ y¶2 } exp σα2 y´ s . ÁÂ Ã À ÁÂ ÃÀ 2
2
2
2
=h( y )
=C (θ)
We see that this is a onedimensional (1D) canonical exponential family where
· 1 ¸ 2 2 C (θ) ± (2πσ − 2σ 2 α ¶s¶ , ¸ · 1 h( y) ± exp − 2 ¶ y¶2 , 2σ α η1(α) ± σ 2 , T1( y) ± y´ s (sufﬁcient statistic). 2
)−n /2 exp
292
Minimum Variance Unbiased Estimation
Because { pθ , θ ∈ X } is a 1D exponential family and ³ = R includes a 1D rectangle, T1 (Y ) is a complete sufﬁcient statistic by Theorem 12.7. Now we seek an MVUE for α. Observe that
Eθ [ T1(Y )] = Eθ [Y ´ s] = α s´ s = α ¶s¶2.
Hence
´
αˆ (Y ) ± T¶1s(¶Y2) = ¶Ys ¶2s
(12.15)
is an unbiased estimator of α . But T1(Y ) is a complete sufﬁcient statistic, so this estimator is also the unique MVUE. Note that the estimator αˆ (Y ) does not depend on σ 2 . Example 12.11
known and σ
2
Consider the same situation as in Example 12.10, except that here is not. Thus we identify θ with σ 2 and write
pθ ( y) = (2πσ 2 )−n/2 exp
α is
¸ · 1 − 2σ 2 ¶ y − α s¶2 ,
which is again a 1D canonical exponential family with
C (θ) = (2πσ 2 )−n /2 , η(θ) = − 2σ1 2 , T ( y) = ¶ y − α s ¶2 , h ( y) = 1 .
We have
Eθ [ T (Y )] = Eθ [¶Y − α s¶2] = nσ 2
hence
(12.16) σˆ 2(Y ) ± T (nY ) = n1 ¶Y − α s¶2 is an unbiased estimator of α. Since ³ = R− includes a 1D rectangle, T (Y ) is a
complete sufﬁcient statistic, and the estimator (12.16) is the unique MVUE. Consider the same example again, except that both = (α, σ 2) ∈ X = R × R+ . We factor pθ as
Example 12.12
unknown: θ
pθ ( y) = (2πσ
À
2
α
and
¸ · α2 ¸ · α 1 2 ´ 2 − 2 ¶s ¶ exp σ 2 y s − 2σ 2 ¶ y¶ . ÁÂ 2σ Ã
)−n/2 exp
=C (θ )
This is a twodimensional (2D) exponential family where
C (θ ) ± (2πσ h ( y) ± 1 ,
2
)−n /2 exp
· α2 ¸ − 2σ 2 ¶s¶2 ,
σ2
are
12.7
Examples: Gaussian Families
293
η1(θ ) = σα2 , T1( y) ± y´ s , η2(θ ) ± − 2σ1 2 , T2( y) = ¶ y¶2 .
Æ Y´sÇ The sufﬁcient statistic is the twovector T (Y ) = ¶Y ¶2 . Moreover T (Y ) is complete because ³ contains a twodimensional rectangle. We already have an unbiased estimator of α in the form of (12.15). We now seek an unbiased estimator of σ 2 . For notational simplicity, we shorten E [·] to E [·]. We have E[T2 (Y )] = E [¶Y ¶2] = E[¶αs + Z¶2 ] = α 2¶s ¶2 + E[¶ Z¶2] + 2αE[ s´ Z] = α 2¶s ¶2 + nσ 2 + 0 = α2 ¶s¶2 + nσ 2 . (12.17) To construct an unbiased estimator of σ 2 , we need to combine T2(Y ) with T12(Y ). E[ T12 (Y )] = E[(Y ´ s)2 ] = E[((α s + Z)´ s )2 ] = E [(α¶s¶2 + Z ´ s)2 ] = α2 ¶s¶4 + E[( Z´ s)2 ] + 2α¶s ¶2 E[ Z´ s ] = α2 ¶s¶4 + σ 2 ¶s¶2 + 0 = α2 ¶s¶4 + σ 2¶s ¶2 θ
and therefore
E[T12 (Y )]
2 2 2 ¶ s¶ 2 = α ¶ s¶ + σ .
Subtracting (12.18) from (12.17), we obtain
E Hence
±
T 2 (Y )
T2(Y ) − ¶1s ¶2
²
= (n − 1)σ 2 .
É È T (Y ) T2 (Y ) − ¶s ¶ Ê Ë 1 ( Y ´ s) 2 2 = n − 1 ¶Y ¶ − ¶s ¶2 È É = n −1 1 ¶Y ¶2 − αˆ 2¶ s¶2 = n −1 1 ¶Y − αˆ s ¶2
σˆ 2(Y ) ± n −1 1
(12.18)
2 1
2
(12.19)
is an unbiased estimator of σ 2. Again, this is the unique MVUE estimator. Note the 1 normalization factor is n − instead of n1 as might have been expected. The reason is that 1 α is unknown, and estimating it by the empirical mean αˆ minimizes the residual sum of squares ¶Y − αˆ s¶2; the expected sum of squares is slightly larger.
294
Minimum Variance Unbiased Estimation
Exercises
12.1 Minimal Sufﬁcient Statistics. Consider the following families of distributions. In each case, ﬁnd a minimal sufﬁcient statistic, and explain why it must be minimal. (a) Yi , 1 ≤ i ≤ n are i.i.d. samples from a Poisson distribution with parameter θ , where θ > 0. (b) Yi , 1 ≤ i ≤ n are i.i.d. samples from a uniform distribution over the interval [−θ, θ ], where θ > 0. (c) Yi are independent (θ si , σi2) random variables, for 1 ≤ i ≤ n . Assume that si , σi2 µ = 0 for all i .
N
12.2 Minimal Sufﬁcient Statistics. Let Yi = X i Si , i ∈ {1, 2, 3} be three random variables generated as follows. X 1 , X 2, X 3 are i.i.d random variables with distribution P{ X = 1} = 1 − P{ X = 2} = θ , where 0 < θ < 1. S1 , S2, S3 are three independent (but not identically distributed) random variables with distributions P {Si = 1} = 1 − P {Si = −1} = ai , where a1 = 0.5, a2 = 1, a3 = 0. Find minimal sufﬁcient statistics for estimating θ from the data Y = [Y 1 Y2 Y3 ]´ . 12.3 Factorization Theorem. Let q1 and q2 be two pdfs with disjoint supports 2 . Consider the mixture distribution
Y
Y1 and
pθ ( y) = θ q1 (y ) + (1 − θ)q2( y), where θ ∈ [0, 1]. Use the factorization theorem to show that T ( y) for θ .
= 1{ y ∈ Y 1} is a sufﬁcient statistic
12.4 Optimization of Unbiased Estimators. Yi are independent variables, for 1 ≤ i ≤ n. Assume that σi2 > 0 for all i . Show that
θ(ˆ Y ) =
n ¶ i =1
N (θ, σi2) random
ai Yi
is an unbiased estimator of θ for any sequence a = {ai }in=1 summing to 1. Derive a that minimizes the variance of this estimator over all a. Is this estimator an MVUE? 12.5 Complete Sufﬁcient Statistic. Suppose that given a parameter Y 1, Y2 , . . . , Yn are i.i.d. observations each with pdf
pθ ( y ) = That is,
p θ ( y) =
(θ − 1)y −θ 1{ y ≥ 1}.
n ³
(θ − 1) yk−θ 1{ yk ≥ 1}.
k= 1
Find a complete sufﬁcient statistic for { pθ , θ
> 1}.
θ >
1,
Exercises
295
12.6 MVUE for Exponential Family. Show that the family { pθ } of (12.12) is a onedimensional exponential family, and derive the MVUE of θ given n i.i.d. random variables drawn from Pθ .
12.7 MVUE for Uniform Family. Suppose θ > 0 is a parameter of interest and that given θ , Y1 , . . . , Y n are i.i.d. observations with marginal densities
p θ ( y) =
1
θ 1 { y ∈ [0, θ ]} .
(a) Prove that T ( y) = max{ y1 , . . . , yn } is a sufﬁcient statistic for θ . (b) Prove that T ( y) is also complete. Note that you cannot apply the Completeness Theorem for Exponential Families.
Hint
:
Use the fact that
¿θ 0
f (t )
(c) Find an MVUE of θ based on y.
t n −1
θn
dt
= 0, ∀ θ > 0 ⇒
f (t ) = 0 on [0, ∞).
12.8 Order Statistics. Consider data Y = [Y 1 Y2 . . . Y n ]´ , and the reordered data T (Y ) = [Y (1) Y (2) . . . Y (n ) ]´, where Y( 1) ≥ Y (2) ≥ . . . ≥ Y(n ) . Clearly T is a manytoone map. Show that if Y 1, Y2 , . . . , Y n are i.i.d. from a family of univariate distributions P = { pθ , θ ∈ X }, then T (Y ) are sufﬁcient statistics for the i.i.d. extension of the family P. 12.9 Unbiased Estimators. Consider the family of distributions pθ ( y) = 1+ 2θ y where
 y ≤ 1/2 and θ  ≤ 1/2.
(a) Let k be an arbitrary odd number. Show that θˆk (Y ) ± 2k (k + 2)Y k is an unbiased estimator of θ , and derive its variance. (b) Show that v k ± maxθ ≤1 Var θ [θˆk (Y )] is achieved at θ = 0. (c) Show that among all the unbiased estimators of Part (a), θˆ1(Y ) = 6Y achieves inf{v1 , v3 , . . .}. (d) Consider the convex combination θˆλ (Y ) ± λθˆ1(Y ) + (1 − λ)θˆ3 (Y ) where λ ∈ [0, 1]. Show that θˆλ (Y ) is an unbiased estimator of θ , and derive its variance. Show that the variance is maximized when θ = 0, and derive λ that minimizes maxθ Var θ [θˆλ (Y )].
12.10 MVUE for Poisson Family. Suppose {Yk }nk=1 (with n random variables with parameter (mean) θ > 0, i.e.,
Ì
pθ ( yk ) =
e−θ θ yk yk !
and pθ ( y) = nk=1 pθ (yk ). We wish to estimate the parameter (a) Show that T ( y) problem.
yk
≥
2) are i.i.d. Poisson
= 0, 1, 2, . . .
λ that is given by λ = e−θ .
= ∑nk=1 yk is a complete sufﬁcient statistic for this estimation
296
Minimum Variance Unbiased Estimation
(b) Show that
λˆ 1 ( y) = n1 is an unbiased estimator of λ. (c) Show that
λˆ 2( y) =
n ¶ k =1
1{ yk
= 0}
Æ n − 1 ÇT ( y ) n
is an MVUE of λ. Hint: You may use the fact that T ( y) is Poisson with parameter nθ . References
[1] E. L. Lehman and G. Casella, Theory of Point Estimation, 2nd edition, SpringerVerlag, 1998. [2] C. R. Rao, “Information and the Accuracy Attainable in the Estimation of Statistical Parameters,” Bull. Calcutta Math. Soc., pp. 81–91, 1945. [3] D. Blackwell, “Conditional Expectation and Unbiased Sequential Estimation,” Annal. Math. Stat., pp. 105–110, 1947. [4] L. Hormander, An Introduction to Complex Analysis in Several Variables, Elsevier, 1973.
13
Information Inequality and Cramér–Rao Lower Bound
In this chapter, we introduce the information inequality, which provides a lower bound on the variance of any estimator of scalar parameters. When applied to unbiased estimators, this lower bound is known as the Cramér–Rao lower bound (CRLB). The bound is expressed in terms of the Fisher information, which quantiﬁes the information present in the data to estimate the (nonrandom) parameter. We extend the bound to vectorvalued parameters and to random parameters. 13.1
Fisher Information and the Information Inequality
In Chapter 12, we studied a systematic approach to ﬁnding estimators using the MVUE criterion. Unfortunately in many cases, with the exception of exponential families, it may be difﬁcult to ﬁnd an MVUE solution, or an MVUE solution may not even exist. One approach in these cases is to design good suboptimal detectors and compare their performance to pick the best one. The performance measure of interest is the conditional ˆ for squarederror cost. If θ is realvalued then risk Rθ (θ)
± ² (θˆ (Y ) − θ)2 ³ ´2 = Var θ (θ(ˆ Y )) + bθ (θ)ˆ ,
Rθ (θˆ ) = MSEθ (θˆ ) = Eθ
where the quantity
ˆ ± E θ [θˆ (Y ) − θ ], bθ (θ)
θ ∈ X,
(13.1)
(13.2)
is called the bias of the estimator θˆ. For a vector parameter θ ∈ X ⊆ Rm , the bias of (13.2) is an m vector bθ (θˆ ). We are interested in the correlation matrix of the estimation error, Rθ
(θˆ ) =± E
µ³
θ
= Cov = Cov
θ θ
´³ ´ ¶ ˆθ (Y ) − θ θˆ (Y ) − θ ± ³ ± ²´ ³ ± ²´± (θˆ ) + E θˆ (Y ) − θ E θˆ (Y ) − θ (θˆ ) + b (θˆ )± b (θˆ ), θ
θ
θ
θ
(13.3)
and especially in its diagonal elements which are the meansquared values of the individual components of the estimator error vector. The total mean squarederror is given by the sum of these terms, i.e., the trace of the correlation matrix:
298
Information Inequality and Cramér–Rao Lower Bound
MSE(θˆ ) = Eθ ²θˆ (Y ) − θ ²2 = Tr[Rθ (θˆ )] = Tr [Covθ (θˆ )] + ²bθ (θˆ )²2 .
(13.4)
It is often difﬁcult if not impossible to derive closedform expressions for the MSE. One approach then is to perform Monte Carlo simulations, in which n i.i.d. observations Y (i ) , i = 1, . . . , n, are generated from pθ . The estimator θˆ (Y (i ) ) can be computed for each of these n experiments, and the MSE can be estimated as
± M SE(θˆ ) =
1 n
n ± ²2 · θˆ (Y (i )) − θ . i =1
The main drawback of this approach is that it may not yield much insight, and that Monte Carlo simulations can be computationally expensive. This motivates a different approach, where a lower bound on the MSE is derived, which can be more easily evaluated. The main result is the information inequality, which is given in Theorem 13.1 for scalar parameters and is specialized to unbiased estimators in the next section. A key quantity that appears in the lower bound is Fisher information, which is deﬁned next. We make the following assumptions, which are known as regularity conditions. In particular, Assumptions (iv) and (v) state that the order of differentiation with respect to θ and integration with respect to Y may be switched. These assumptions are generally satisﬁed when Assumption (ii) holds.
Regularity Assumptions (i) X is an open interval of R (possibly semiinﬁnite or inﬁnite). (ii) The support set supp { pθ } = { y ∈ Y : pθ (y ) > 0} is the same for all θ ± θ ( y ) exists and is ﬁnite for all θ ∈ X and y ∈ Y . (iii) p³θ ( y) = ∂ p∂θ ¸ ∂ ¸ p ( y)d µ( y) = 0, ∀θ ∈ X . (iv) Y p³θ (y )dµ( y) = ∂θ Y θ pθ ( y) exists and is ﬁnite (v) If p³³θ ( y ) ± ∂ ∂θ 2 ¸ ∂2 ∂θ 2 Y pθ (y )dµ( y) = 0. 2
∀θ ∈ X , ∀y ∈ Y then
¸
∈ X.
³³ Y pθ (y )d µ( y )
=
The notion of information contained in the observations about an unknown parameter was introduced by Edgeworth [1] and later fully developed by Fisher [2]. D E FI N I T I O N
13.1
The quantity
¹º »¼ ∂ ln pθ (Y ) 2 Iθ ± Eθ ∂θ is called the Fisher information for θ . LEMMA
13.1 Under Assumptions (i)–(iv) we have
µ
¶
Eθ ∂ ln pθ (Y ) = 0
and
(13.5)
∂θ µ ∂ ln p (Y ) ¶ θ Iθ = Var θ . ∂θ
(13.6)
(13.7)
13.1
Fisher Information and the Information Inequality
299
If in addition Assumption (v) holds, then
µ ∂ 2 ln p (Y ) ¶ θ Iθ = E θ − . (13.8) 2 ∂θ ¸ p³ d µ which is zero by Assumption The left hand side of (13.6) is equal to
Proof Y θ (iv). Then (13.7) follows directly from (13.5) and (13.6). Finally, if Assumption (v) ¸ holds, then Y p³³θ d µ = 0 and
∂ 2 ln pθ (Y ) = ∂ º p³θ (Y ) » ∂θ 2 ∂θ pθ (Y ) ³³ ³ 2 = pθ (Y ) pθ[ p(Y()Y−)][2pθ (Y )] θ º » ³³ ∂ ln pθ (Y ) 2 pθ (Y ) . = p (Y ) − ∂θ θ
Taking the expectation of both sides with respect to pθ establishes (13.8).
ˆ Y ) be any 13.1 Information Inequality. Assume (i)–(iv) above hold. Let θ( 1 estimator of θ ∈ X satisfying the regularity condition THEOREM
(vi)
d dθ
¸ ˆ ¸ ³ Y θ ( y) pθ ( y )d µ( y ) = Y θˆ (y ) pθ ( y)d µ( y), ∀θ ∈ X .
Then
± ² ˆ Y) ≥ Var θ θ(
³d
±
²´2
ˆ dθ Eθ θ(Y ) Iθ
=
´2
³d
ˆ d θ b θ (θ) + 1 Iθ
.
(13.9)
Proof Without loss of generality, assume Y = supp { pθ }. The proof is an application of the Cauchy–Schwarz inequality. For any two random variables A and B with ﬁnite variances, we have
½ (13.10) Cov( A, B) ≤ Var ( A) Var ( B). pθ (Y ) We apply this inequality with A = θˆ (Y ) and B = ∂ ln ∂θ , both of which are functions of Y . For any θ ∈ X , we have ˆ Y )], Var θ ( B) = Iθ , Var θ ( A ) = Var θ [θ( where the second equality is the same as (13.7). By (13.6), Eθ ( B ) = 0, and thus µ ∂ ln p (Y ) ¶ θ Covθ ( A , B ) = Eθ ( AB ) = Eθ θˆ (Y ) . ∂θ 1 Similarly to Assumption (iv), the order of differentiation with respect to
may be switched.
θ and integration with respect to Y
300
Information Inequality and Cramér–Rao Lower Bound
Therefore (13.10) yields
º µ ∂ ln p (Y ) ¶»2 θ ˆ Var θ [θ(Y )] Iθ ≥ Eθ θˆ (Y ) ∂θ »2 º¾ ³ ˆ θ ( y) pθ ( y)d µ( y ) = Y º d ± ² »2 = dθ E θ θ(ˆ Y ) ,
where the last equality follows from Assumption (vi). This proves the information inequality (13.9).
13.2
Cramér–Rao Lower Bound
Observe that θˆ appears on both sides of the information inequality (13.9). In particular, the right hand side might be difﬁcult to evaluate if no closedform expression for the bias of θˆ is available. For unbiased estimators, however, the information inequality takes a simpler form. Then bθ [θˆ ] = 0 for all θ ∈ . The information inequality (13.9) reduces to the following lower bound on the variance, which is generally known as the Cramér– Rao lower bound [3, 4] although it was also discovered by Fréchet earlier [5]:
X
ˆ Y )] ≥ Var θ [θ(
1 . Iθ
(13.11)
D E FI N I T I O N 13.2 Efﬁcient Estimator. An unbiased estimator whose variance achieves the CRLB (13.11) with equality is said to be efﬁcient.
An efﬁcient estimator is necessarily MVUE.
Known Signal with Unknown Scale Factor . Let Y = θ s + Z where Z ∼ N (0, σ 2 I n ), s is a known signal, and θ ∈ R. Thus 1 n ² y − θ s ²2 ln pθ ( y) = − ln(2πσ 2) − 2 2σ 2
Example 13.1
is twice differentiable with respect to θ , and
∂ 2 ln pθ ( y) = − ²s²2 . ∂θ 2 σ2
Thus we obtain Fisher information from (13.8) as
Iθ
= ²σs²2 , 2
which represents the SNR in the observations. We have seen in (12.15), (12.18) that the variance of the MVUE
θˆMVUE (Y ) =
s± Y
² s² 2
(13.12)
13.2
is equal to Var θ
Cramér–Rao Lower Bound
301
² 2 ± θˆMVUE = σ 2 . ² s²
In view of (13.12), the MVUE achieves the CRLB (13.11) and is thus efﬁcient. In other problems, the MVUE need not achieve the CRLB.
Nonlinear Signal Model. Let s(θ) ∈ Rn be a signal which is a twicedifferentiable function of a scalar parameter θ ∈ R. The observations are the sum of this signal and i.i.d. Gaussian noise: Y = s(θ) + Z where Z ∼ (0, σ 2 In ). Then Example 13.2
N
n 1 ln pθ ( y) = − ln(2πσ 2) − 2 2σ 2 is twice differentiable with respect to θ , and
n · ( yi − si (θ)) 2 i =1
n ∂ 2 ln pθ (Y ) = − 1 · ³³ ³ 2 (13.13) ∂θ 2 σ 2 i =1 [(Yi − si (θ))si (θ) + (si (θ)) ]. Since E[Yi − si (θ)] = E[ Z i ] = 0 has zero mean, we obtain Fisher information from
(13.8) and (13.13) as
Iθ
1
= σ2
n · i =1
[si³ (θ)]2 ,
(13.14)
which may be thought of as a SNR, where signal power is the energy in the derivative. A particularly insightful application of this model is estimation of the angular frequency of a sinusoid: si (θ) = cos (i θ + φ) where θ ∈ (0, π) is the angular frequency, and φ ∈ [0, 2π) is a known phase. Then si³ (θ) = −i sin(i θ + φ) and
Iθ
= σ12
n · i =1
i 2 sin2 (i θ
+ φ).
(13.15)
Note that Iθ is generally large for large n, but the case of θ ´ 1/ n (in which case the observation window is much less than one cycle of the sinusoid) and φ = 0 is 2 ∑ rather different: then sin (i θ + φ) ≈ i θ , and thus Iθ ≈ σθ 2 ni=1 i 4 could be very small. Intuitively, a good estimator of θ cannot exist because the signal si (θ) ≈ 1 − i 2θ 2 /2 depends very weakly on θ over the observation window.
Gaussian Observations with Parameterized Covariance Matrix. Assume that the observations are an n vector Y ∼ (0, C(θ)) where the covariance matrix C(θ) is twice differentiable in θ . Then Fisher information is given by (see derivation in Section 13.8) µ ¶ 1 d C(θ) −1 d C(θ) Iθ = Tr C−1 (θ) C (θ) . (13.16) 2 dθ dθ Example 13.3
N
302
Information Inequality and Cramér–Rao Lower Bound
ln pθ( y)
ln pθ( y)
(a)
θ
(b)
ln pθ( y)
θ
θ
(c)
(a) A sharply peaked loglikelihood function; (b) a fairly ﬂat loglikelihood function; (c) a completely ﬂat loglikelihood function, for which the parameter θ is not identiﬁable
Figure 13.1
13.3
Properties of Fisher Information
Relation to Estimation Accuracy Both the information inequality and the Cramér–Rao bound suggest that the higher the Fisher information is, the higher estimation accuracy will potentially be. (The caveat does not apply to efﬁcient estimators, for which the Cramér–Rao bound holds with equality.) In Example 13.1, Fisher information was directly related to SNR. More generally, (13.8) suggests that Fisher information is the average curvature of the loglikelihood function. Figure 13.1 depicts different scenarios. In Figure 13.1(a), the loglikelihood function is sharply peaked, Fisher information is large, and high estimation accuracy can be expected. In Figure 13.1(b), the loglikelihood function is fairly ﬂat, Fisher information is low, and poor estimation accuracy can be expected. In Figure 13.1(c), as an extreme case, the loglikelihood function is completely ﬂat, Fisher information is zero, and θ is not identiﬁable.
Relation to KL Divergence Assume pθ (Y ) is twice differentiable with respect to
θ , a.s. Pθ . Consider a small variation ² of the parameter θ . Assuming θ + ² ∈ X and taking the limit as ² tends to zero, we obtain µ p (Y ) ¶ θ D ( pθ ² pθ +² ) = Eθ ln pθ +² (Y ) µ ∂ ln p (Y ) ² 2 ∂ 2 ln p (Y ) ¶ θ θ 2 = Eθ −² ∂θ + 2 ∂θ 2 + o(² ) 2 (13.17) = ²2 Iθ + o(²2 ) as ² → 0, where the last equality follows from (13.5) and (13.6). Hence Fisher information also represents the local curvature of the KL divergence function.
Bound on MSE Combining the expressions for bias and variance, we have
±
MSEθ (θˆ ) = Var θ (θˆ ) + bθ (θˆ )
≥
³∂
ˆ ∂θ bθ (θ ) + 1 Iθ
²2
´2
+
±
²2
bθ (θˆ )
.
13.3
303
Properties of Fisher Information
Bounds on Estimation of Functions θof If we wish to estimate g(θ) using an estimator gˆ (Y ), then following the same steps that were used in proving Theorem 13.1, we get Var θ
¿
¿ gˆ (Y )À ≥ ( ddθ Eθ ¿gˆ (Y )À) 2 .
If gˆ (Y ) is unbiased, i.e., if E θ gˆ (Y )
(13.18)
Iθ
À
= g(θ), then ( d g(θ))2 (g³ (θ))2 ¿ À = I . Var θ gˆ (Y ) ≥ d θ I θ
(13.19)
θ
We can consider the ratio ³ Iθ 2 to be the Fisher information for the estimation of g(θ) . ( g (θ)) In fact, if g is a onetoone (invertible) and differentiable function, then this notion can be made more precise, as shown next.
Reparameterization Let g be a onetoone continuously differentiable function and
X
suppose we wish to estimate η = g(θ) . Then the family of distributions { pθ , θ ∈ } is the same as the family of distributions {qη , η ∈ g ( )}, where g( ) is the image of under the mapping g. The family {qη , η ∈ g( )} is simply a reparameterization of { pθ , θ ∈ }. Using the chain rule for differentiation we see that
X
X
X
X
X
∂ ln p (y ) = ∂η ∂ ln q (y ) = g³ (θ) ∂ ln q ( y ). ∂θ θ ∂θ ∂η η ∂η η
(13.20)
Squaring and taking expectations on both sides we get
Iθ
( ) = g³ (θ) 2 Iη .
(13.21)
Thus Fisher information is not invariant to reparameterization via invertible functions. 2 Moreover, reparameterization does generally not preserve efﬁciency. For instance, as a special case of Example 13.1, we have that θˆ = Y is an efﬁcient estimator of θ when Y ∼ (θ, 1). However, Y 2 is not an efﬁcient estimator of θ 2 since it is biased (Eθ (Y 2) = θ 2 + 1).
N
Exponential Families Consider the onedimensional exponential family pθ ( y) = h ( y)C (θ) e g(θ)T ( y ) ,
(13.22)
where the function g(·) is assumed to be invertible and continuously differentiable. To derive I θ , it is convenient to parameterize the family in canonical form
qη ( y) = h (y )eηT ( y )− A(η)
( ² ) = D(qη ³ ²qη ) for η = g(θ),
2 However, KL divergence is invariant to reparameterization: D p ³ p θ θ
η³ = g (θ ³ ).
304
Information Inequality and Cramér–Rao Lower Bound
with η = g(θ) . The Fisher information for η is given by
Â Á 2 ∂ ln qη ( y) (a) Iη = Eη − = A ³³ (η) = Var η [T (Y )] = Var θ [T (Y )], 2 dη
(13.23)
where equality (a) follows from (11.29). Applying the reparameterization formula (13.21), we conclude that
= (g³ (θ))2 Var θ [ T (Y )].
Iθ
(13.24)
The Gaussian example of Section 13.2 is a special case of (13.22) with T (Y ) (hence Var θ [T (Y )] = σ 2²s ²2 ) and g(θ) = θ/σ 2 , thus Iθ = ²s ²2 /σ 2 .
= Y ±s
i.i.d. Observations Suppose the observations are a sequence Y = (Y 1, . . . , Y n ) whose elements are i.i.d. pθ . The Fisher information for the nfold product pdf pnθ is n times the Fisher information for a single observation:
( n) Iθ = Eθ
¹ 2 n ¼ − ∂ ln∂θpθ2 (Y ) = n Iθ .
(13.25)
We shall see in Chapter 14 (ML estimation) that the loss of unbiasedness and efﬁciency under nonlinear reparameterization is mitigated in the setting with many i.i.d. observations. This phenomenon is observed with other estimators too, as illustrated in Exercise 13.14.
Local Bound In some problems, the Cramér–Rao bound may be very loose. Con
N
sider the Gaussian family { pθ = (sin θ, σ 2 ), θ ∈ R}. Let η = sin θ . Using (13.12), we have Iη = 1/σ 2 . From (13.21), we obtain I θ = (cos 2 θ)/σ 2. As might be expected, Iθ is maximal for θ = 0 and vanishes for θ = π/2. However, note that accurate estimation performance cannot be expected for any value of θ , since the values θ + 2π k , k ∈ Z, are indistinguishable. Even if σ 2 ´ 1 and Fisher information is large, no estimator can perform well – unless it has some prior information about θ , which is not available here. Example 13.4 Variance Estimation. Let Y exponential family where
∼ N ( , θ In ) where θ > 0
n 1 ln pθ ( y) = − ln(2πθ) − 2 2θ Direct evaluation of (13.8) yields
¹
n Iθ = Eθ − 2 2θ
1
+ θ3
¼ n · 2 i =1
Yi
n ·
0. This is an
yi2 .
(13.26)
= 2nθ 2 .
(13.27)
i =1
Alternatively, the Fisher information could have been derived using the canonical representation
13.4
Conditions for Equality in Information Inequality
n η ln 2 2π
ln qη ( y) =
+ ηT ( y )
305
∑
of this exponential family with η = g(θ) ± θ −1 and T (Y ) = − 21 ni=1 Yi2 . Then Iη = Var [ T (Y )] = 2ηn2 by application of (13.23). Since [ g³ (θ)]2 = θ −4, application of (13.24) yields (13.27) again. Another interesting parameterization of this family is given by ξ = ln θ ∈ R, in which case Iξ = n2 independently of ξ (proof is left to the reader).
13.4
Conditions for Equality in Information Inequality
T H E O R E M 13.2 Let X be an open interval (a , b ) and assume the regularity conditions of Theorem 13.1 hold. The information inequality (13.9) holds with equality for some ˆ Y ) if and only if pθ , θ ∈ X is a onedimensional exponential family and estimator θ( ˆθ(Y ) is a sufﬁcient statistic for θ . Moreover, if θˆ (Y ) is an unbiased estimator of θ , then θ(ˆ Y ) is an MVUE and
pθ ( y) = h (y ) exp
Ã¾ θ a
Ä ³ ³ ˆ Iθ ³ (θ( y ) − θ ) d θ .
(13.28)
Proof The Cauchy–Schwarz inequality (13.10) holds with equality if and only if there exists a constant c independent of ( A , B ) such that B − E( B ) = c[ A − E( A )]. Hence the information inequality (13.9) holds with equality if and only if there exists a function c(θ), θ ∈ X , such that
∂ ln pθ (Y ) − E µ ∂ ln pθ (Y ) ¶ = c(θ) ³θˆ (Y ) − E ± θˆ (Y )²´ θ θ ∂θ Å ÆÇ∂θ È
a.s. pθ .
(13.29)
=0
The function c(θ) may be identiﬁed by taking the variance of both sides of this equation and using the expression (13.7) for Fisher information:
Iθ
= c2(θ)Var θ [θ(ˆ Y )].
Substituting into (13.9), we obtain
c(θ) =
d dθ
Iθ
E θ [θˆ (Y )]
,
where the denominator is equal to 1 for unbiased estimators. Let m (θ) Integrating (13.29) over θ , we obtain
¾θ
(13.30)
± E θ [θˆ (Y )].
± ² θˆ ( y) − m (θ³ ) dθ ³ + b( y) ¾a θ Iθ ³ ± ² = m³ (θ ³ ) θ(ˆ y ) − m (θ ³ ) d θ ³ + b( y ) a ¾ θ Iθ ³ m (θ ³ ) ¾ θ Iθ ³ ³ ˆ d θ ³ + b(y ) a.s. pθ , = θ( y ) m ³ (θ ³ ) d θ − m ³ (θ ³ )
ln pθ ( y) =
c(θ ³ )
a
a
(13.31)
306
Information Inequality and Cramér–Rao Lower Bound
where b( y) is an arbitrary function of y but not of θ . Equivalently,
ˆ
pθ ( y) = C (θ)h (y )eθ (y ) η(θ) ,
θ∈X
θ(ˆ Y ) is a sufﬁcient statistic and Ã ¾ θ I ³ m (θ ³) Ä θ dθ³ , C (θ) = exp − ³ ³ m (θ ) a h ( y) = eb( y ) , ¾ θ Iθ ³ and η(θ) = dθ ³ . m ³(θ ³ )
is an exponential family, where
a
ˆ Y ) is an unbiased estimator, then m (θ) If θ( (13.31). 13.5
= θ , and therefore (13.28) follows from
Vector Parameters
The notions developed so far in this chapter can be extended to vector parameters. Consider a parametric family pθ (Y ), θ ∈ X ⊆ Rm , where θ is the unknown m vector. Assume the pdf pθ (Y ) is differentiable with respect to θ and denote by ∇θ f (θ , y ) = [ ∂ f ∂θ(θi, y ) ]mi=1 the gradient of a realvalued function f (θ , y ) with respect to θ and by
∇ 2 f (θ , y) = [ ∂ ∂θf (∂θ,y ) ]mi, j =1 the Hessian of 2
θ
f with respect to θ . The subscript θ on the gradient and the Hessian will be omitted when the function f depends on θ only. θ
D E FI N I T I O N
i
13.3
j
The m × m Fisher information matrix of θ is Iθ
± ± E (∇ θ
θ
ln pθ (Y )) ( ∇θ ln pθ (Y ))±
²
.
(13.32)
Note that Iθ µ 0, i.e., it is a nonnegative deﬁnite (positive semideﬁnite) matrix. Its entries are given by
µ ∂ ln p (Y ) ∂ ln p (Y ) ¶ ( )i j = E , ∂θi ∂θ j Iθ
θ
θ
θ
1 ≤ i, j
≤ m.
(13.33)
Similarly to (i)–(v) in Section 13.1, we assume the following regularity conditions hold: (i) X is an open hyperrectangle in Rm . (ii) The support set supp { pθ } = { y ∈ Y : pθ (y ) > 0} is the same for all θ ∈ X . θ (y) (iii) The partial derivatives ∂ p∂θ , i = 1, 2, . . . , m , exist and are ﬁnite for all θ i and ¸ ¸ y ∈ Y. (iv) Y ∇θ pθ ( y )d µ( y ) = ∇θ Y pθ ( y )d µ( y ) = 0, ∀θ ∈ X . (v) If ∇θ2 pθ ( y) exists and is ﬁnite ∀θ ∈ X , ∀ y ∈ Y then
¾
Y
∇
2 θ
pθ (y )d µ( y ) = ∇θ
2
¾
Y
pθ ( y)d µ( y) = 0.
∈X
13.5
Vector Parameters
307
The following lemma is an extension of Lemma 13.1 to the vector case. The lemma is used to prove Theorem 13.4, which is an extension of the CRLB (13.11) to the vector case. LEMMA
13.2
Under Assumptions (i)–(iv) we have
¿
E ∇ θ
and Iθ
À=
ln pθ (Y )
θ
¿ = Cov ∇
(13.34)
0
À.
(13.35)
± ² = E −∇ 2 ln p (Y ) .
(13.36)
θ
θ
ln pθ (Y )
If in addition Assumption (v) holds, then Iθ
θ
θ
θ
Analogously to (13.18) and (13.19), the following result holds by application of the Cauchy–Schwarz inequality. T H E O R E M 13.3 Let g (θ ) be a realvalued function and gˆ (Y ) an estimator of g (θ ). Assume the following regularity condition holds:
(vi)
∇
θ
¸ gˆ ( y ) p ( y) dµ( y) (=a ) ¸ gˆ (y ) ∇ p ( y) d µ( y ). Y Y d ± ∇E [gˆ (Y )] (which generally depends on θ , but the dependency is not θ
θ
θ
Deﬁne θ explicitly shown here, to simplify the notation). Then
1 Var θ [ gˆ (Y )] ≥ d ± I− d. θ
If gˆ (Y ) is an unbiased estimator, then d does not depend on gˆ . Proof We have d = ∇θ
¾
Y
gˆ ( y) pθ ( y) d µ( y )
(a)
=
¾ Y
(13.37)
= ∇ g(θ ) and the right hand side of (13.37)
gˆ ( y) ∇θ pθ (y ) d µ( y ) =
E [gˆ(Y ) ∇ θ
θ
ln pθ (Y )],
where equality (a) holds by the regularity assumption (vi). Fix any m vector a. By (13.34) and (13.35), the random variable Z ± a± ∇θ ln pθ (Y ) has mean Eθ ( Z ) = 0 and variance Var θ ( Z ) = a± Iθ a. By (13.10), we have
(
Var θ [gˆ (Y )] Var θ ( Z ) ≥ Covθ (gˆ (Y ), Z )
)2
.
Moreover, Cov θ (gˆ (Y ), Z ) = Eθ [gˆ (Y ) Z ] = a± d, and therefore Var θ [ gˆ (Y )] ≥
( a± d )2 . a± Iθ a
This inequality holds for all a ∈ Rm . We now show that the supremum of the right hand 1 side over all a ∈ Rm is equal to d ±I − d, from which (13.37) follows. θ
308
Information Inequality and Cramér–Rao Lower Bound
1/2
Let b = Iθ a. By the Cauchy–Schwarz inequality applied to vectors in Rm , we have
(a± d )2 = ( b± −1/2 d )2 ≤ ² −1/2 d²2 = d ± −1 d ± a± a I
θ
I
I
θ
b b
Iθ
θ
−1/ 2 d (hence a = c −1 d) for some constant c ¶= 0.
with equality if b = cIθ THEOREM
I
θ
For any unbiased estimator θˆ (Y ),
13.4 CRLB for Vector Parameters. Covθ (θˆ (Y ))
µ −1, θ
1 i.e., Covθ (θˆ (Y )) − I− is positive semideﬁnite. θ
Proof Fix any a
(13.38)
I
∈ Rm and let gˆ (Y ) = a± θˆ (Y ). Then d = ∇ E [ gˆ (Y )] = ∇ (a± θ ) = a θ
and (13.37) yields
θ
θ
1 a, a± Covθ (θˆ (Y )) a ≥ a± I− θ
which implies that
´
³
1 a ≥ 0. a± Cov θ (θˆ (Y )) − I− θ
Since this inequality holds for any a ∈ Rm , the claim is proved. By carefully following the proof of this theorem, one can also obtain the following result (see Exercise 13.5). 13.1 Attainment of CRLB in Exponential Families. (13.38) if pθ satisﬁes the vector identity C O RO L L A RY
∇
θ
ln pθ ( y) = Iθ (θˆ ( y) − θ )
Equality holds in
∀y ∈ Y .
D E FI N I T I O N 13.4 Efﬁcient Estimator. An unbiased estimator whose covariance matrix achieves the CRLB (13.38) with equality is said to be efﬁcient.
Of special interest are the variances of the individual components of θˆ , which are also the diagonal terms of the covariance matrix. Since the diagonal entries of a nonnegative deﬁnite matrix are nonnegative, we have
± ²
1 Var θ [θˆ i ] ≥ I− θ
i ,i
i
= 1, . . . , m.
(13.39)
On the other hand, the CRLB for scalar parameters gives Var θ [θˆi ] ≥ where
¿ 1À , Iθ
i,i
¹ ¼ ¿ À = E − ∂ 2 ln p (Y ) . θ i ,i ∂θi2 Iθ
θ
(13.40)
13.5
Since
Vector Parameters
± −1² I
θ
1 ≥ ¿ À i ,i Iθ
309
(13.41)
i,i
for a positive deﬁnite matrix, the CRLB for the vector parameter gives a tighter bound (see also Exercise 13.7). If Iθ is a diagonal matrix, the bounds given in (13.39 and 13.40) are equal. In this case, the components of θ are said to be orthogonal parameters.
Rank The Fisher information matrix need not have full rank. For instance, let m = 2 and consider a variation of Example 13.1 where θ is replaced by θ1 + θ2 , and X = R2 . Then the 2 × 2 Fisher information matrix Iθ
= ²σs²2
2
º 1 1» 1 1
has rank one. In this problem the parameter θ appears only through the sum θ1 + θ2 of its components, and the difference θ1 − θ2 is not identiﬁable. Another way one can have a rankdeﬁcient Fisher information matrix is via redundancy in the parameterization. Consider for instance X = {θ ∈ R2 : θ1 = θ2}, which describes a line in R2 . Then all partial derivatives in (13.33) are equal, and Iθ is rankdeﬁcient.
∇ 2 ln p (Y ) exists so that the expression (13.36) for the Fisher information matrix holds. Consider a small variation ² a of the parameter θ in direction a. Assuming θ + ² a ∈ X , we obtain, similarly to (13.17), µ p (Y ) ¶ D ( p ² p +² a ) = E ln ¶ µ p +² a (Y ) 2 ² ± 2 2 ± = E −² a ∇ ln p (Y ) + 2 a (∇ ln p (Y ))a + o(² ) 2 (13.42) = ² a± a + o(²2 ) as ² → 0. Relation to KL Divergence Assume
θ
θ
θ
θ
θ
θ
2
θ
θ
θ
θ
θ
θ
Iθ
The quadratic form a± Iθ a is positive deﬁnite and therefore induces a metric
dF (θ , θ ³ ) ±
½ (θ − θ ³ )± (θ − θ ³ ) Iθ
(13.43)
on the parameter space X ; this distance is called the Fisher information metric.
Reparameterization Let η = g(θ ) where the mapping g is assumed to be continuously differentiable with Jacobian J = [∂η j /∂θi ]1≤i , j ≤m . By the chain rule for differentiation, we have m ∂η j ∂ ln q (Y ) , ∂ ln p (Y ) = · ∂θi ∂θi ∂ηi j =1 θ
η
i
= 1, 2, . . . , m ,
310
Information Inequality and Cramér–Rao Lower Bound
or equivalently, in vector form,
∇
θ
ln pθ (Y ) = J ∇η ln qη (Y ).
(13.44)
Taking the outer product of this vector with itself and taking expectations, we see that the Fisher information matrix for η satisﬁes Iθ
=
±.
(13.45)
J Iη J
Exponential Family For an m dimensional exponential family in canonical form, combining (11.26), (11.29), and (13.36) we obtain Iη
= ∇ 2 A(η ) = Cov [T (Y )] = Cov [ T (Y )]. η
η
θ
(13.46)
13.5 Fisher Information Matrix for Covariance Estimation in Gaussian Model. Let Y i ∼ i.i.d. (0 , C) for 1 ≤ i ≤ n where C is an r × r covariance matrix. Viewing C as the parameter of the family, we have Example
N
n 1 · ± −1 rn Yi C Yi ln p (Y ) = − ln(2π) − ln C − 2 2 2 n
C
i =1
= − rn2 ln(2π) − n2 ln C − n2 Tr[ SC−1],
∑
where S = 1n ni=1 Y ± i Y i is the r × r sample covariance matrix. Again it is convenient to use the canonical representation of the family, where T (Y ) = − n2 S and η = C−1 is the inverse covariance matrix, or precision matrix. Then
n n rn ln qη ( S) = − ln(2π) + ln η − 2 2 2
r ·
i, j =1
S ji ηi j .
By symmetry of the precision matrix, this is an exponential family of order r (r 2+1) . We have n E [ T (Y )] = − C 2 and the Fisher information matrix for η is given by (proof left to the reader) C
( )i j,kl = Cov [Ti j (Y )Tkl (Y )] ⎧n 2 ⎪⎨ 2 i j : (i j ) = (kl ) n = ⎪ 2 ik jl : i = k, j ¶= l or i ¶= k , j = l ≤ i , j , k , l ≤ r ⎪⎩ n [ + il jk ] : i ¶= k , j ¶= l . 4 ik jl Iη
C
C
C
C
C
C
C
C
Alternatively, this result could have been derived using the formulas for ﬁrst and second partial derivatives of the logdeterminant function ln η . Finally, note that since η has r 2 entries, the Fisher information matrix Iη has dimensions r 2 × r 2 . However, η is symmetric and has only r (r2+1) degrees of freedom, hence the rank of Iη is at most r (r 2+1) .
13.6
13.6
Information Inequality for Random Parameters
311
Information Inequality for Random Parameters
If θ is a random vector, a lower bound on the correlation matrix of the estimation error vector can also be derived. Denote by π(θ ) the prior density and by p(θ , y) = pθ (y )π(θ ) the joint density, assumed to be twice differentiable in θ , for each y ∈ . Deﬁne the total Fisher information matrix
Y
tot
I
É¿ À¿ =± E ∇ ln p(±, Y ) ∇ ± ² = E −∇ 2 ln p (±, Y ) , ±
±
ln p(±, Y )
À± Ê
(13.47)
±
where the expectation is with respect to both Y and ±. Since ln p(θ , Y ) = ln pθ (Y ) + ln π(θ ), the total Fisher information matrix can be split into two parts
= D + P,
(13.48)
± ² = E −∇ 2 ln p (Y ) µ 0
(13.49)
tot
I
where D
I
I
±
I
±
is the average Fisher information matrix associated with the observations, and P
I
± ² = E −∇ 2 ln π(±) µ 0
(13.50)
±
is the Fisher information matrix associated with the prior. Using a derivation analogous to the information inequality, one may show that the correlation matrix for the estimation error vector satisﬁes the following inequality [6, Ch. 2.4.3]:
±
² ³
E (θˆ − ±)(θˆ − ±)± µ
I total
´−1
(13.51)
with equality if and only if
∇ 2 ln p(±, Y ) = − ±
a.s.
Q
for some constant nonnegative deﬁnite matrix Q . Since ln p(θ , y) = ln π(θ  y) + ln p( y), this implies
∇2 ln π(θ y ) = − . Q
θ
Integrating twice over θ , we obtain 1 ± ± θ Q θ + a ( y) θ + b ( y ) 2 for some vector a( y) and scalar b(y ). Hence the posterior distribution π(θ  y) must be Gaussian for equality to hold in (13.51) and therefore for the estimator to be efﬁcient. ln π(θ  y ) = −
N
Let Y ∼ (θ , C) be a Gaussian vector with unknown mean θ ∈ Rm and known covariance C. Assume θ is also random with Gaussian prior (0, C± ). The marginal distribution of Y is (0, C± + C), and the posterior distribution of Y given θ is (θˆ MMSE (Y ), C E ), where Example 13.6
N
N
N
ˆ
θ MMSE
(Y ) = ( + )−1 Y C± C±
C
(13.52)
312
Information Inequality and Cramér–Rao Lower Bound
is the (linear) MMSE estimator, and C
= − ( + )−1 = ( −1 + −1)−1
E
C±
C±
C
C
C±
C
±
is its covariance matrix. We have 1 1 m ln pθ (Y ) = − ln(2π) − ln C − (Y 2 2 2 and thus (13.49) yields D
I
Likewise ln π(θ ) = −
=
C±
C
− θ )±
C
−1(Y − θ ),
−1 .
m 1 1 1 ln(2π) − ln C±  − θ ± C− θ, ± 2 2 2
and thus (13.50) yields P
I
=
C
−1 . ±
The total Fisher information (13.47) is the sum of the inverse covariance matrices: tot
I
Then
=
C
−1 + −1 . C
±
± ² E (θˆ − θ )(θˆ − θ )± µ ( −1 + −1 )−1 C
C
±
for any estimator θˆ . Equality is achieved by the MMSE estimator (13.52).
13.7
Biased Estimators
Recall that an MVUE is said to be efﬁcient if its variance achieves the CRLB. However, some biased estimator might achieve lower mean squarederror (MSE) than does the MVUE, as illustrated in Example 13.7.
N
Given Y i ∼ (0, θ), 1 The likelihood function is given by
≤ i ≤ n, estimate the unknown variance θ . Ì Ë · n 1 yi2 . (13.53) p ( y) = (2πθ) −n/ 2 exp − 2θ i.i.d.
Example 13.7
θ
i =1
θ is n · θˆMVUE (Y ) = 1n Yi2 .
We have seen in Example 12.11 that the MVUE of
(13.54)
i =1
Now, consider a biased estimator of the form
θˆa (Y ) = a θˆMVUE (Y ) = na
n · i =1
Yi2 ,
(13.55)
13.7
where a by
Biased Estimators
313
∈ R is a scaling factor to be optimized. The MSE of the estimator θˆa is given ²2 ± MSEθ (θˆa ) = Eθ θˆa (Y ) − θ Â2 Á · n a 2 = Eθ n Yi − θ ⎞ ⎛ i =Á1 n Â2 n 2 · · 2θ a Y i2 − Yi2 + θ 2⎠ = Eθ ⎝ an 2 n µ º 2i =»1 ¶ i =1 = a2 1 + n − 2a + 1 θ 2 , (13.56)
where the last line follows from
⎡Á n Â2 ⎤ ⎡ n ⎤ · · Eθ ⎣ Yi2 ⎦ = Eθ ⎣ Yi2Y 2j ⎦ i =1
i , j =1
=
· ± 4² n
i =1
Eθ
Yi
+
n · i ¶= j =1
Eθ
± 2² ± 2 ² Yi
Eθ
Yj
= 3nθ 2 + n(n − 1)θ 2 = (n2 + 2n)θ 2 . Using (13.56) with a = 1, we obtain MSEθ (θˆMVUE ) = 2θ 2 / n. However, the value of a that minimizes (13.56) is
aopt
= 1 +12/ n < 1
(13.57)
and the resulting MSE is obtained by substituting (13.57) into (13.56): MSEθ (θˆaopt ) =
º
1 1 + 2/ n
» − 1 +22/ n + 1 θ 2 = n +2 2 θ 2 = n +n 2 MSEθ (θˆMVUE ).
It is remarkable that, for n = 1, we have aopt = 3, and θˆMMSE (Y ) = Y 2/3. The MSE of this biased estimator is three times smaller than that of the MVUE. More generally, owing to the bias–variance decomposition (13.1), we can consider the two extreme cases of the bias–variance tradeoff:
• •
ˆ = Var θ (θ)ˆ . For any unbiased estimator, MSE θ (θ) Any estimator returning a constant c has zero variance, and
±
ˆ MSEθ (θˆ ) = bθ (θ)
²2
= (c − θ)2.
314
Information Inequality and Cramér–Rao Lower Bound
13.8
Appendix: Derivation of (13.16)
To derive (13.16) we make use of the following differentiation formulas:
µ
¶ (θ) = Tr , dθ −1 (θ) −1 (θ) d (θ) −1 (θ), = − dθ dθ
d ln C(θ)  dθ dC
C
−1(θ) d
C
(13.58)
C
C
and the expectation formulas
±
E Y±
A
Y
(13.59)
C
²
= Tr[ ],
(13.60)
CA
which holds for any matrix A and any random vector Y with mean zero and covariance matrix C, and
ETr [Y Y ±
A
Y Y ± B] = Tr [AC]Tr [BC] + 2Tr[ACBC ],
which holds for any symmetric matrices zero and covariance matrix C. Differentiating the loglikelihood
A
(13.61)
and B and any random vector Y with mean
n 1 1 ln pθ (Y ) = − ln(2π) − ln  C(θ) − Y ±C−1 (θ)Y 2 2 2 with respect to θ , we obtain
∂ ln pθ (Y ) = − 1 d ln  (θ) − 1 Y ± d −1(θ) Y ∂θ 2 µd θ 2 ¶ dθ d 1 ± −1 d (θ) 1 + Y (θ) d θ = − 2 Tr −1(θ) d(θ) θ 2 C
C
C
C
C
C
−1 (θ)Y , (13.62)
C
where the last equality results from the differentiation formulas (13.58) and (13.59). Squaring both sides of (13.62), we obtain
º ∂ ln p (Y ) »2 θ ∂θ µ ¶ 1 µ ¶ 1 2 −1 d (θ) d (θ) − 1 = 4 Tr (θ) d θ − 2 Tr (θ) d θ Y ± −1(θ) d d(θ) ÆÇθ Å C
C
C
+ 41 Y±
C
C
C
Å
−1(θ) d
(θ) dÆÇθ
C
=
A
C
C
−1(θ) Y Y ± −1(θ) d
È
C
Å
(θ) dÆÇθ
=
C
=
C
C
−1(θ) Y
−1(θ) Y .
È
A
Taking expectations and applying the formulas (13.60) and (13.61) with −1 (θ) d (θ) C−1 (θ), we obtain Fisher information as dθ
C
C
È
A
A
=
B
=
Exercises
Iθ
»2 º = Eθ ∂ ln ∂θpθ (Y ) µ ¶ 1 µ ¶ 1 2 −1 d (θ) d (θ) 2 − 1 = 4 Tr (θ) d θ − 2 Tr (θ) d θ º µ ¶ µ d (θ) + 14 Tr2 −1 (θ) d d(θ) + 2Tr −1 (θ) θ dθ C
315
C
C
C
C
C
C
C
C
−1 (θ) d
(θ) ¶» . dθ
C
The three terms involving squares of traces cancel out, and (13.16) follows. Exercises
13.1 Fisher information and CRLB for Poisson Family. Let Y be Poisson with parameter θ > 0. √ Find the Fisher information and CRLB for estimation of θ and θ .
X
13.2 Fisher Information and CRLB for i.i.d. Observations. Let ⊂ R and i θ be the Fisher information for pθ , θ ∈ . Derive an expression for the Fisher information and the CRLB for the i.i.d., n variate extension of that family.
X
13.3 Fisher Information for Independent Observations. Let θ be a vector parameter, and {Y i }in=1 be independent random variables with respective pdfs pθ ,i with respect to a common measure µ. Show that the Fisher information matrix for θ is the sum of the Fisher information ( n) ∑ matrices for the individual components: Iθ = ni=1 Iθ ,i . 13.4 Fisher Information for LocationScale Family. Let q be a pdf with support set equal to R. Show that for the socalled locationscale family
Ã
ºy −θ » Ä 1 pθ (y ) = q θ2 θ2 , y , θ1 ∈ R, θ2 > 0 , 1
the elements of the Fisher information matrix are
¾ º q ³( y ) »2 q ( y) dy, I 11 = 2 θ2 R q (y ) ¾ º yq ³ ( y) »2 1 I 22 = 2 + 1 q ( y) dy , θ2 R q (y ) ¾ º q ³ (y ) » 2 1 I 12 = 2 y q( y) dy. θ2 R q (y ) If the prototype distribution q ( y) is symmetric about zero, the crossterm I12 = 0. 1
13.5 Attainment of CRLB in Exponential Families. Prove Corollary 13.1.
N
13.6 Efﬁcient Estimator . Let X 1, X 2, X 3 be i.i.d. (θ, 1) where θ the partial sums Y 1 = X 1 + X 2 and Y 2 = X 2 + X 3. Derive the MVUE of θ and show it is an efﬁcient estimator.
∈ R. You are given
316
Information Inequality and Cramér–Rao Lower Bound
13.7 Vector CRLB. Suppose Y 1 and Y 2 are independent random variables, with Y 1
N (θ1 , 1) and Y2 ∼ N (θ1 + θ2, 1).
∼
(a) Find the Fisher information matrix Iθ and the vector CRLB for the estimation of θ = [θ1 θ2]± from Y = [Y1 Y2 ]± . (b) Now compute the scalar CRLB individually for each of θ1 and θ2 (see (13.40)). Compare these bounds with those obtained from part (a). 13.8 Linear Regression.
= θ1 + θ2 k + Z k , k = 1, 2, . . . , n, where { Z k } are i.i.d. N (0, 1) random variables. Yk
(a) Find the Fisher information matrix for the vector parameter inverse. You may leave your answer in terms of the sums:
s1 ( n ) =
n · k =1
k=
n (n + 1) , 2
s2 ( n ) =
n · k =1
k2 =
θ
= [θ1 θ2 ] and its
n (n + 1)(2n + 1) . 6
(b) Now suppose n = 2. Explain why unique MVUE estimators of θ1 and θ2 exist, and give their variance. 13.9 Estimating a 2D Sinusoid in Gaussian Noise. Consider an image patch consisting of a 2D sine pattern
s i (θ ) = cos(i ± θ ), i
= (i 1, i 2) ∈ {0, 1, . . . , 7}2
at angular frequency θ
corrupted by i.i.d.
N (0, 1) noise.
= (θ1 , θ2 ) ∈ [0, π)2
(a) Derive and plot the CRLB for an unbiased estimator of θ . (b) Comment on the dependency of the CRLB on θ , and identify the frequencies for which the Fisher information matrix is singular. 13.10 Estimating the Parameters of a General Gaussian Model. In (13.14) and (13.16), we derived the Fisher information for a scalar θ that parameterizes the mean and the covariance of a Gaussian observation vector Y , respectively. In this problem, we generalize those results. (a) Show that the Fisher information matrix for an m vector θ in the model Y ( s(θ ), C) (for a ﬁxed n × n covariance matrix C) is given by
N
± −1 ∂ s(θ ) , 1 ≤ j, k ≤ m .
( ) jk = ∂ s∂θ(θ ) Iθ
j
C
∂θk
∼
317
Exercises
(b) Show that the Fisher information matrix for an m vector ( s, C(θ )) (for a ﬁxed mean s) is given by
N
( ) jk = 12 Tr
µ
Iθ
C
−1(θ) ∂
(θ) ∂θ j C
C
−1 (θ) ∂
(θ) ¶ , ∂θk C
θ
in the model Y
1 ≤ j, k
∼
≤ m,
and is therefore independent of s . (c) Show that the Fisher information matrix for an m vector θ in the general model Y ∼ (s(θ ), C(θ )) is given by the sum of the two Fisher information matrices in parts (a) and (b).
N
13.11 Estimating Linear Filter Parameters. You observe a random signal whose samples {Y i }ni=1 satisfy the equation Y i = θ0 Z i + θ1 Z i −1 where Z 0 = 0 and {Z i }ni=1 are i.i.d. (0, 1). Hence the observations are ﬁltered white Gaussian noise.
N
(a) Derive the mean and covariance matrix for the observations. (b) Using a result of Exercise 13.10, derive the Fisher information matrix for (c) Give the CRLB on the variance of any unbiased estimator of θ0 and θ1 .
(θ0 , θ1).
13.12 Estimation of LogVariance. Derive the Fisher information of the logvariance
ξ = ln θ for the estimation problem of (13.26).
13.13 Efﬁciency and Linear Reparameterization . Show that if θˆ is an efﬁcient estimator of θ ∈ Rm , then Mθˆ + b is an efﬁcient estimator of Mθ + b, for any m × m matrix M and m vector b. 13.14 Efﬁciency and Nonlinear Reparameterization. Assume that the observations {Yi }ni=1 are drawn i.i.d. (θ, 1). Recall that the sample mean
N
θˆ = 1n
n · i =1
Yi
is an efﬁcient estimator of θ . Now consider estimating θ 2 .
(a) Compute the CRLB for any unbiased estimator of θ 2 . (b) Derive the bias and the variance of θˆ 2 as an estimator of θ 2 . Evaluate them for n = 1. (c) Now consider large n. Show that θˆ 2 is asymptotically unbiased (its bias tends to 0 as n → ∞ ) and asymptotically efﬁcient (the ratio of its variance to the CRLB tends to 1 as n → ∞). 13.15 Averaging Unbiased Estimators . Assume θˆ1 and θˆ2 are unbiased estimators of a scalar parameter θ . The estimators have respective variances σ12 and σ22 and normalized correlation ρ = Cov (θˆ1 , θˆ2)/(σ1 σ2), all independent of θ . (a) Show that the averaged estimator is unbiased for all λ
∈ R.
θˆλ ± λθ1 + (1 − λ)θ2
318
Information Inequality and Cramér–Rao Lower Bound
(b) Find λ that minimizes the variance of θˆλ in part (a). (c) Specialize your answer in the case of uncorrelated estimators (ρ
= 0).
References
[1] F. Y. Edgeworth, “On the Probable Errors of Frequency Constraints,” J. Royal Stat. Soc. Series B, Vol. 71, pp. 381–397, 499–512, 651–678; Vol. 72, pp. 81–90, 1908 and 1909. [2] R. A. Fisher, “On the Mathematical Foundations of Theoretical Statistics,” Philos. Trans. Roy. Stat. Soc. London, Series A, Vol. 222, pp. 309–368, 1922. [3] H. Cramér, “A Contribution to the Theory of Statistical Estimation,” Skand. Akt. Tidskr., Vol. 29, pp. 85–94, 1946. [4] C. R. Rao, “Information and the Accuracy Attainable in the Estimation of Statistical Parameters,” Bull. Calc. Math. Soc. , Vol. 37, pp. 81–91, 1945. [5] M. Fréchet, “Sur l’Extension de Certaines Évaluations Statistiques au cas de Petits Échantillons,” Rev. Int. Stat., Vol. 11, No. 3/4, pp. 182–205, 1943. [6] H. L. Van Trees, Detection, Estimation and Modulation Theory, Part I, Wiley, 1968.
14
Maximum Likelihood Estimation
Maximum likelihood (ML) is one of the most popular approaches to parameter estimation. We formulate the problem as a maximization over the parameter set X and identify closedform solutions when the parametric family is an exponential family. We identify conditions under which the ML estimator is also an MVUE. For the problem of estimation given n conditionally i.i.d. observations, we show the ML estimator to be asymptotically consistent, efﬁcient, and Gaussian as n tends to inﬁnity, under certain regularity conditions. We then study recursive ways to compute (approximations to) ML estimators, along with the practically useful expectationmaximization (EM) algorithm.
14.1
Introduction
Denote by P = {Pθ }θ ∈X the parametric family of distributions on the observations. Given the observation Y = y , one may view { pθ (y ), θ ∈ X } as a function of θ , which we deﬁned earlier as the likelihood function. The maximumlikelihood (ML) estimator seeks θˆML (y ) that achieves the supremum of this function over X , i.e.,
pθˆML ( y) ( y ) = sup pθ ( y).
θ ∈X
(14.1)
(Note that this is not a probability distribution over Y .) The likelihood function may have multiple local maxima over the feasible set X . In case several global maxima exist, all of them are ML estimates. The deﬁnition (14.1) is often casually replaced by
θˆML(y ) = arg max pθ (y ). θ ∈X
(14.2)
However, there exist problems where the supremum of (14.1) is not achieved, in which case the ML estimator does not exist; see Section 14.8. The ML approach can be motivated in several ways:
• If X is a bounded subset of Rm , ML estimation is equivalent to MAP estimation with a uniform prior. Indeed
θˆMAP ( y) = arg max[ pθ (y )π(θ)], θ ∈X
where π(θ) = 1/X  for all θ ∈ X . Still, justifying ML on those grounds may be dubious because the assumptions of a uniform prior and uniform cost may be questionable. It should also be noted that the equivalence to MAP estimation with
320
Maximum Likelihood Estimation
X
• •
14.2
uniform prior does not formally hold if is an unbounded set such as the real line because the uniform distribution does not exist on such sets. One may then think of ML estimation as a limiting case of MAP estimation with a prior whose ﬂatness increases – for instance, a Gaussian prior with variance increasing to inﬁnity. One could think of pθ ( y) as a probability that should be made as large as possible so that the observations are deemed more “likely.” It is unclear whether such a heuristic approach should work well. More fundamentally, we shall see that strong performance guarantees can be given for the ML estimator in asymptotic settings where the number of observations tends to inﬁnity.
Computation of ML Estimates
Maximizing the likelihood function or any monotonic function thereof are equivalent operations. It is often convenient to maximize ln pθ (y ), which is called the loglikelihood function. For simplicity we consider scalar θ ﬁrst. The roots of the socalled likelihood equation 0=
d ln pθ (y ) dθ
(14.3)
X
give the local extremas of pθ ( y) in the interior of . These local extrema could be maxima, minima, or saddlepoints. The type of extremum can be identiﬁed by examining the second derivative of the loglikelihood function at the root (respectively negative, positive, and zero). In many problems, the ML estimator is a root of the likelihood equation. In other problems, the maximum of the likelihood function is achieved on the boundary of (assuming such boundary exists); see Section 14.5.
X
Example 14.1
Let Y
∼ N (0, θ) where θ > 0. The loglikelihood function is given by 1 y2 ln pθ ( y ) = − ln(2πθ) − , 2 2θ
θ > 0,
as illustrated in Figure 14.1. Its derivative is given by 1 d ln pθ (y ) = − dθ 2θ
2
+ 2yθ 2 .
The root of the likelihood equation (14.3) is unique and is equal to y 2. The second derivative of the loglikelihood function is given by 1 d2 ln pθ ( y) = 2 2 dθ 2θ Evaluating it at the root, we obtain
±
± d2 ln pθ ( y)±± 2 dθ θ = y2
2
− θy3 .
= − 21y 4 ,
14.2
Computation of ML Estimates
321
ln pθ (y)
0
y2
Figure 14.1
θ
Loglikelihood function for Example 14.1
which is a negative number. Hence the root is a maximizer of the likelihood function, and
θˆML (y ) = y2 .
N
Assume as in Example 14.1 that Y ∼ (0, θ), but now θ ≥ 1. If the root y 2 of the likelihood equation satisﬁes y2 ≥ 1, we again have θˆML ( y ) = y2 . However, if y 2 < 1, the likelihood function is monotonically decreasing over the feasible set [1, ∞), and the maximum is achieved on the boundary: θˆML ( y) = 1. Combining both cases, we conclude that Example 14.2
θˆML( y) = max{1, y2 }.
Consider now vector parameters. If equation is given in vector form by
θ
∈ X ⊂ Rm
is an m vector, the likelihood
⎤ ⎡ ∂ ln p ( y ) ⎥⎥ ⎢⎢ ∂θ1 .. ⎥⎥ ∈ Rm . = ∇ ln p ( y) = ⎢⎢ . ⎦ ⎣ ∂ ln p ( y) ∂θm θ
0
θ
θ
(14.4)
θ
To verify that a solution to (14.4) is indeed a local maximum of ln pθ (y ), we would need to verify that the Hessian ∇θ2 ln pθ ( y ) is nonpositive deﬁnite at the solution.
= α s + Z, where Z ∼ N (0, σ 2 n ), n ≥ 2, s ∈ Rn is a known signal, α ∈ R, and σ > 0. The unknown parameters are α and σ 2 . Let θ = [ α, σ 2] ∈ X = R × R+ . The MVUEs of α and σ 2 were obtained in (12.15) and (12.19):
Example 14.3
2
Y
I
±
αˆ MVUE ( y) = ²ss ²y2 ,
2 σˆ MVUE ( y) = n −1 1 ² y − αˆ MVUE( y) s²2.
322
Maximum Likelihood Estimation
The loglikelihood function is 1 n ² y − α s²2 , θ = [α, σ 2]± . ln pθ ( y) = − ln(2πσ 2) − 2 2σ 2 The roots of the likelihood equation are obtained from (14.4) as follows. First,
∂ ln p ( y) = 1 ( y − αs )± s ∂α σ2 ± ⇒ αˆ ML = ²ys²s2 = αˆ MVUE , 0=
(14.5)
θ
(14.6)
and so the ML estimator of α is an MVUE. Second,
∂ ln p ( y) = − n + 1 ² y − αˆ s²2 ML ∂(σ 2 ) 2σ 2 (σ 2 )2 2 2 , ⇒ σˆML = n1 ² y − αˆ ML s²2 = n −n 1 σˆ MVUE 0=
θ
(14.7)
and so the ML estimator of σ 2 is biased. Note that it is easy to verify that the ∇θ2 ln pθ (y ) is negative deﬁnite at the solutions to the likelihood equation for α and σ 2 given in (14.6) and (14.7) respectively. 2 2 Let us compare the MSEs of σˆ MVUE and σˆ ML : 2 2 MSEθ (σˆ MVUE ) = Var θ (σˆMVUE )=
³ σ2 2 2 σ ˆ − σ =− n , MVUE n 2 (σˆML ) = σ 4 2(nn−2 1) , (σˆ 2 ) = Var (σˆ 2 ) + b2 (σˆ 2 ) = σ 4 2n − 1 .
2 bθ (σˆML )=
Var θ MSE θ
2σ 4 , n −1
E
θ
ML
²n − 1
θ
ML
θ
ML
n2
2 ) Since 2nn−2 1 < n−2 1 for all n ≥ 2, we conclude that MSEθ (σˆ ML showing again the potential beneﬁts of a biased estimator.
14.3
0 and deﬁne the subset X² ± {θ ³ ∈ X : θ ³ − θ  > ² }. Let δ² ± infθ ³∈X² D ( pθ ² pθ ³ ) which is positive by the identiﬁability assumption (A1). By Lemma 14.2, we have
¾
Ã
pθ ³ (Y ) Pθ lim sup ln n→∞ θ ³ ∈X pθ (Y ) ² hence
Ä
À
− n D ( pθ ² pθ ³ ) = 0 = 1,
¾
pθ ³ (Y ) Pθ lim sup ln n →∞ θ ³ ∈X pθ (Y ) ²
À
≤ −nδ² = 1.
(14.31)
Hence there exists some n0 (²) such that for all n > n 0(²), sup ln
θ ³ ∈X ²
pθ ³ (Y ) pθ (Y )
≤ − n2 δ² .
(14.32)
14.6
Asymptotic Properties for General Families
ln p±θ (y)
331
p±θ(y )
θ − ± θ θˆnθ +± (a)
θ − ± θˆnθ θ +± (b)
θ±
θ±
Consistency of MLEs. (a) Continuous likelihood function. (b) Upper semicontinuous likelihood function
Figure 14.3
Since θˆn satisﬁes
pθˆn (Y )
pθ ³ (Y ) = sup ln = 0, (14.33) pθ (Y ) θ ³ ∈X pθ (Y ) it follows from (14.32) that θˆn ∈/ X² or equivalently θˆn − θ  ≤ ² , for all n > n0 (²). Since this statement holds for arbitrarily small ² > 0, the claim follows. ln
Figure 14.3 depicts the likelihood function for some large n and some typical observation sequence y. The event that θˆn − θ  < ² for all n ≥ n0 (²) has probability 1. This holds whether the likelihood function is continuous (a) or upper semicontinuous (b). The scenario depicted in Figure 14.3(b) corresponds to the family of Uniform [0, θ ] distributions which is treated in detail in Section 14.7. Finally, note that almostsure convergence of θˆn does not necessarily imply asymptotic unbiasedness (lim n→∞ Eθ [θˆn ] = θ). However, by the dominated convergence theorem, existence of a random variable X such that
θˆn (Y) ≤ X a. s P , Eθ [ X ] < ∞ ∀θ ∈ X ,
ensures asymptotic unbiasedness. 14.6.2
Asymptotic Efﬁciency and Normality
Next we study the asymptotic distribution of the MLE. We ﬁrst consider the case of scalar θ . Deﬁne the score function
ψ( y , θ) = ∂θ∂ ln pθ (y )
(14.34)
∂ ln p (y ) and ψ ³³ (y , θ) = ∂ ln p ( y), its second and denote by ψ ³( y , θ) = ∂θ θ θ 2 ∂θ 3 and third derivatives with respect to θ , which are all well deﬁned provided regularity conditions apply. The likelihood equation (14.3) may be written as 2
3
0=
n ¿ i =1
ψ(Yi , θ).
(14.35)
The main idea in the following theorem is to apply a Taylor series expansion of the score function around θ and to show that θˆn (properly scaled) converges to an average of i.i.d. scores. The proof is given in Section 14.13.
332
Maximum Likelihood Estimation
THEOREM
[Cramér] Asymptotic Normality of MLE. Assume that:
14.2
(A1) pθ ´ = p³θ for all θ ´ = θ ³ (identiﬁability). (A2) θ ∈ , an open subset of R. (A3) { pθ , θ ∈ } have the same support
X
X
Y = supp{ pθ } = {y : pθ ( y ) > 0} ∀θ ∈ X .
(A4) ψ(Y , θ) is twice differentiable with respect to θ a.s. (A5) There exists a function M (Y ) such that  ψ ³³ (Y , θ)  ≤ M (Y ) a.s. and E θ M (Y ) < ∞ ∀θ ∈ . (A6) The Fisher information I θ = Eθ [ψ 2 (Y , θ) ] = Eθ [−ψ ³ (Y , θ)] satisﬁes 0 < Iθ < ∞.Å Å ∂ 2 pθ ( y ) Å ∂ pθ ( y ) ∂2 Å ∂ (A7) ∂θ Y pθ ( y )d µ( y ) = Y ∂θ d µ( y ) and ∂θ 2 Y pθ ( y )d µ( y ) = Y ∂θ 2 d µ( y).
P
X
Then there exists a strongly consistent sequence such that
θˆn of roots of the likelihood equation
√ ˆ d. n(θn − θ) − → N (0, Iθ−1 ).
REMARK
14.2
(14.36)
Equation (14.36) does not necessarily imply that lim nVar θ (θˆn ) = Iθ−1 ,
n→∞
which is the property required for asymptotic efﬁciency. Some mild additional technical assumptions are needed to show that the MLE is asymptotically efﬁcient and asymptotically unbiased. REMARK
14.3
For any estimator θˆn , the property
lim nVar θ (θˆn ) = O (1)
n→∞
implies that the estimation errors are O P (n− 1/ 2) and is known as REMARK
14.4
√n consistency.
Theorem 14.2 does not guarantee that θˆn is MLE.
14.5 Assumption (A5) on ψ ³³ can be replaced by a weaker condition on the modulus of continuity of ψ ³ [4]. REMARK
Let pθ ( y ) = 21 e− y −θ  , y ∈ R be the Laplacian distribution with mean and variance equal to 2. The loglikelihood function is given by
Example 14.6
θ
ln pθ ( y) = −n ln 2 −
1¿ θ 2
− yi 
ln pθ ( y) = −n ln 2 −
1¿ θ 2
− y(i )
or equivalently by
n
i =1 n
i =1
14.6
Asymptotic Properties for General Families
333
where { y(i ) }ni=1 denote the order statistics, i.e., the observations reordered such as y(1) y(2) ≤ · · · ≤ y( n) . The loglikelihood function is maximized by the median estimator
≤
θˆML( y) = med{yi }ni=1 . For odd n, the solution θˆML ( y) = y( + ) is the unique MLE. For even n any point in the interval [ y( ) , y( +1) ] is a MLE. n 2
n 1 2
n 2
Consistency, efﬁciency, and asymptotic Gaussianity of the median estimator can be inferred from Theorem 14.2, but also from asymptotic properties of median estimator in a much broader family of distributions [5, p. 92]. Consider a cdf F whose median shall be assumed to be 0 without loss of generality. Assume the associated pdf is continuous at 0 where it takes the value p(0). Denote by Y(k ) the median of n = 2k − 1 i.i.d. random variables with common cdf F . Then by application of [5, Theorem 7.1] (also see [6]), the normalized median estimator
√
n(Y( k ) − med( F )) − → d
N
²
1 0, 4 p(0)2
³
.
Moreover, the median estimator is unbiased for all n and has asymptotic variance lim n nVar [Y(k ) ] = 4 p(10) 2 . For the Laplace distribution, p(0) = 12 and thus limn nVar [ Y(k ) ] = 1. This coincides with the CRLB because Fisher information for θ in the Laplace family { pθ } is Iθ = Eθ (1) = 1. ∑ contrast, the sample mean estimator Y = n1 ni=1 Y i , while also unbiased and √nInconsistent, is not efﬁcient because n Var θ [Y ] = Var [Y ] = 2 for all n ≥ 1.
X ⊆ Rm . We deﬁne the score function as the gradient vector ψ( y , θ ) ± ∇ ln p ( y) ∈ Rm and the m × m Fisher information matrix as = E [ψ(Y, θ )ψ ± (Y, θ )] = E [−∇ ψ(Y, θ )], Vector Parameters Consider a parametric family { pθ }θ ∈X where θ
Iθ
θ
θ
θ
θ
assuming the derivatives exist. The counterpart of Theorem 14.2 in the vector case is given in Theorem 14.3. THEOREM
14.3 Asymptotic Normality of MLE [4, p. 121]. Assume:
(B1) pθ ´ = pθ³ for all θ ´ = θ ³ (identiﬁability). (B2) is an open subset of Rm . (B3) { pθ }θ ∈X have the same support
X
Y = supp{ p } = { y : p ( y) > 0} ∀θ ∈ X . θ
θ
(B4) pθ ( y) is twice continuously differentiable with respect to θ . (B5) There exists a function M(Y) such that Eθ [ M (Y )] < ∞∀θ ∈ and all thirdorder partial derivatives of pθ (Y ) are bounded in magnitude by M (Y ). (B6) The Fisher information matrix I θ is ﬁnite and has full rank. Å Å (B7) ∇θ2 Y pθ (y )dµ( y) = Y ∇ 2 pθ (y ) d µ( y ).
X
334
Maximum Likelihood Estimation
Then there exists a strongly consistent sequence such that
ˆ
θn
of roots of the likelihood equation
√ ˆ d n(θ n − θ ) − → N (0, −1 ). I
θ
14.7
(14.37)
Nonregular ML Estimation Problems
Consider the case where the distributions pθ do not have common support, so Assumption (B3) of Theorem 14.3 does not hold. Theorem 14.1 still applies but Theorem 14.3 does not. As an example, consider
Yi
∼ i.i .d. Uniform[0, θ ], θ ∈ X = (0, ∞).
Then
pθ (Y ) =
n ¼ i =1
= θ −n
pθ (Yi ) n ¼ i =1
1{0 ≤ Yi
≤ θ}
= θ −n 1{ Zn ≤ θ }, where
Zn
± 1max Y ≤i ≤n i
is a complete sufﬁcient statistic for θ . The MLE is given by
θˆn = arg max p (Y ) = arg max θ −n = Z n . θ>0 θ θ ≥Z n
The cdf of
θˆn = Zn is given by Pθ {θˆn ≤ z } = Pθ { Z n ≤ z} = Pθ {∀i : Yi ≤ z } = (´Pθµ{Y ≤ z })n n = θz , 0 ≤ z ≤ θ.
The estimation error is O P (n −1) in this problem, as can be seen from the following analysis. Deﬁne the normalized error
Xn
= n (θ − θˆn ) ≥ 0
and denote by Fn its cdf. We have 1 − Fn (x ) = Pθ {X n
Æ > x} x Ç = Pθ Zn < θ − n ² x /θ ³n = 1− , x ≥ 0. n
14.8
Hence
lim Fn [x ] = 1 − e− x /θ ,
n→∞
335
Nonexistence of MLE
x
≥ 0,
i.e., X n converges in distribution to an Exp(θ ) random variable. The asymptotics of the estimation error are thus completely different from those in √ the regular case, both in the scaling factor (1 / n instead of 1/ n ) and in the asymptotic distribution (exponential instead of Gaussian). 14.8
Nonexistence of MLE
Since the MLE is deﬁned as any θˆ that achieves the supremum of the likelihood function over , one must be concerned about problems in which the supremum is not achieved, hence a MLE does not exist. This is typically the case when the likelihood function is unbounded and may happen even when all the regularity conditions are satisﬁed. While the ML approach technically fails, it can often be salvaged in the sense that the likelihood equation admits a “good root,” as illustrated in Example 14.7.
X
i .i .d .
N
N
Gaussian Observations. Let Y 1 ∼ (µ, σ 2) and Y i ∼ (µ, 1) for i = 2, . . . , n. The unknown parameter is θ = (µ, σ 2) ∈ R ×R+ . Note that σ 2 inﬂuences Y 1 but not Y 2, . . . , Yn . Before attempting to derive the MLE, note there exists a simple and reasonable estimator for this problem, i.e., Example 14.7
1
µ˜ = n − 1
n ¿ i =2
σ˜ 2 = (Y1 − µ) ˜ 2,
Yi ,
(14.38)
where µ˜ ignores Y1 but is asymptotically unbiased, consistent, and efﬁcient as n → ∞ , m .s . and σ˜ 2 is asymptotically MVUE (albeit not consistent) since µ˜ → µ as n → ∞. The loglikelihood function for (µ, σ 2) may be written as 1 (Y1 − µ)2 l (µ, σ 2) = − ln(2πσ 2) − 2 2σ 2
− (n −2 1) ln(2π) −
n ¿ (Yi − µ)2 i =2
n 2 2 ¿ = − n2 ln(2π) − 21 ln σ 2 − (Y12−σ 2µ) − (Yi −2 µ) i =2
2
(14.39)
to be maximized over µ and σ 2. Unfortunately l (µ, σ 2 ) is unbounded. To see this, consider µ² = Y1 − ² and σ ²2 = ² 2 . We have n ¿ √1 − 21 ln ² 2 − 12 − 21 (Yi − Y1 + ²)2 , 2π i =2 = C ² − ln ² where C ² tends to a ﬁnite limit as ² ↓ 0, but − ln ² tends to +∞ . Hence sup l (µ, σ 2 ) ≥ sup l (µ² , σ²2) = ∞.
l (µ² , σ²2) = ln
µ,σ 2
²>0
336
Maximum Likelihood Estimation
An iterative optimization algorithm will often produce a sequence with unbounded likelihood that converges to (Y1 , 0) which is infeasible, hence is not a MLE. For instance a gradientdescent algorithm will converge to a root of the likelihood equation n ∂ l (µ, σ 2 ) = Y1 − µ + ¿ (Yi − µ), ∂µ σ2 i =2 ∂ l (µ, σ 2 ) = − 1 + (Y1 − µ)2 . 0= ∂σ 2 2σ 2 2σ 4
0=
Hence any solution (µ, ˆ σˆ 2 ) to the likelihood equation satisﬁes
∑
Y1 /σˆ 2 + ni=2 Yi , 1/σˆ 2 + n − 1
µˆ = σˆ 2 = (Y1 − µ) ˆ 2.
(14.40)
This nonlinear system admits a solution that converges a.s. to the infeasible point (Y1 , 0) as n → ∞ . However, the system (14.40) also admits a feasible solution that converges a.s. to the “simple” estimator (µ, ˜ σ˜ 2) of (14.38) as n → ∞ (see Exercise 14.8). It may also be 2 veriﬁed that the Hessian of the loglikelihood at that point is negative deﬁnite, hence the root is a local maximum of the loglikelihood.
Mixture of Gaussians. Here we consider i.i.d. observations {Y i }in=1 ∼ i .i .d . pθ where θ = (µ, σ 2 ) and pθ is a mixture of two Gaussians with common mean µ: Example 14.8
Pθ
= 12 N (µ, 1) + 21 N (µ, σ 2 ).
The loglikelihood function for (µ, σ 2) is given by 2 These derivations are simpliﬁed if one works with the reparameterization
(14.39) is equivalent to maximization of 1 l (µ, ξ) = ln √ 2π
+ 12 ln ξ − 21 (Y1 − µ)2ξ − 21
Á
ξ = 1/σ 2 . Then maximization of
n ¿ (Yi − µ)2 , i =2
Â (Y1 − µ)ξ + ∑ni=2 (Yi − µ) ⇒ ∇l (µ, ξ) = 1 1 2 2 ξ − 2 (Y1 − µ) Ã Ä ∇ 2l (µ, ξ) = −ξY −−n µ+ 1 −Y11/−2ξµ2 , 1
which is negative deﬁnite in a neighborhood of (µ, ˜ ξ)˜ , for n sufﬁciently large.
14.8
l (µ, σ 2) =
n ¿ i =1
337
Nonexistence of MLE
ln pθ (Yi )
Ã n ¿
¶ (Y − µ)2 ¸Ä ¶ (Y − µ) 2 ¸ 1 i = ln √ exp − 2 + √ 2 exp − i 2σ 2 2 2π 2 2πσ i =1 and is again unbounded. To see this, pick any i ∈ { 1, 2, . . . , n} and consider µ ² = Y i − ² and σ²2 = ²2 . Then ¾ À Á n ¿ ( Y j − Y i + ²)2 1 2 exp − l (µ² , σ² ) = ln √ 2 2 2 π j =1 ¾ ÀÂ ( Y j − Yi + ²)2 1 + √ 2 exp − 2²2 Ã2 12π² Ä 1 − 1/2 − ² / 2 = C² + ln √ e + √ 2 e , 2 2π 2 2π² where C ² is the sum of the terms indexed by j ´ = i and tends to a ﬁnite limit as ² ↓ 0. Hence l (µ² , σ²2 ) ∼ − ln ² as ² ↓ 0, and sup l (µ, σ 2 ) ≥ sup l (µ² , σ²2) = ∞. 1
2
²>0
µ,σ 2
Thus the likelihood function is unbounded as i ≤ n, and the MLE does not exist.
(µ, σ 2) approaches (Yi , 0) for each 1 ≤
Example 14.9 General Mixtures. The same problem of unbounded likelihood function arises with more general mixture models
Pθ
=
k ¿ j =1
π j N (µ j , σ 2j ),
(14.41)
where k is the number of mixture components, π is a pmf over {1, 2, . . . , k }, and the parameter θ = {π j , µ j , σ 2j }kj =1. Yet it may be shown that if the Fisher information matrix Iθ has full rank the following result holds: for all large n with probability 1, there
√
exists a unique root θˆ n of the likelihood equation such that n(θˆ n − θ ) [5, Ch. 33]. The same result applies to mixtures of Gaussian vectors
Pθ
=
k ¿ j =1
π j N (µ j , j ), C
→d N (0, −1 ) I
θ
(14.42)
where µ j and C j are respectively the mean vector and covariance matrix for mixture component j , and θ = {π j , µ j , C j }kj =1. The same result applies to even more general mixture models of the form
pθ
=
k ¿
j =1
π j pξ , j
338
Maximum Likelihood Estimation
where { p } ∈³ is a parametric family that satisﬁes some regularity conditions, and θ = {π j , ξ j }kj =1 . ξ ξ
14.9
Noni.i.d. Observations
The methods of Section 14.6 can be used to study asymptotics in problems where the observations are not i.i.d. Consider for instance the problem of estimating the scale parameter θ ∈ R for a signal received in white Gaussian noise:
Y
∼ N (θ s, n ), I
where s ∈ Rn is a known sequence. As we saw in Example 14.3, the MLE is an MVUE for this problem:
±
θˆML = θˆMVUE = Y²s²s2 .
The estimator is Gaussian with mean θ and variance 1/²s ²2 equal to the reciprocal of Fisher information. Hence the estimator is unbiased and efﬁcient. The estimator is consistent if and only if limn→∞ ²s ²2 = ∞. Note that the estimation error need not √ scale as 1/ n . For instance, if si = i −a with −∞ < a ≤ 12 , the estimation error variance is O (n2a−1 ). For a ≥ 12 , the estimator is not consistent.
14.10
MEstimators and LeastSquares Estimators
Often the pdf pθ is not known precisely, or the likelihood maximization problem is difﬁcult. It is common in such problems to use an Mestimator [5]
¿ θˆ ± arg min ϕ(Yi , θ), θ ∈X n
i =1
where ϕ is a function that possesses nice properties, such as convexity. This approach reduces to ML if ϕ(Y , θ) = − ln pθ (Y ) is the negative loglikelihood; one could also let ϕ(Y , θ) = − ln qθ (Y ) for some other pdf qθ . If qθ is Gaussian, the method is known as leastsquares estimation. The asymptotic properties of Mestimators can be derived using the same approach as for MLEs and often result in estimators that are asymptotically unbiased and consistent, but not efﬁcient: while the estimation error scales as √ 1/ n under regularity conditions, the asymptotic variance is larger than the reciprocal of Fisher information. This is the penalty for using a mismatched model (qθ instead of the actual pθ ). We explore the special case of leastsquares estimation in more detail now. Consider the estimation of a parameter θ from a sequence of observations, where each observation is a linear transformation of θ in zeromean additive noise, i.e., the observation sequence given by
Yk
=
kθ
H
+ Vk,
k
= 1, 2, . . . ,
14.11
ExpectationMaximization (EM) Algorithm
339
where Hk is known, and V k is assumed to be zero mean. If we obtain a block of n observations, then by concatenation
Y
= [ Y ±1 . . . Y ±n ]± , = [ ±1 . . . ±n ]± , H
H
we obtain the block model
Y
V
H
= [ V ±1 . . . V ±n ]±
= θ + V. H
The leastsquares estimate for θ is deﬁned as
ˆ ( y) =´ arg min ² y − θ ²2 ,
θ LS
(14.43)
H
θ
which equals θˆ ML ( y) = θˆ MVUE ( y) if V has i.i.d. zeromean Gaussian components. The optimization problem in (14.43) is a convex optimization problem in θ , the solution to which is easily seen to be (see Exercise 14.9):
ˆ ( y) = (
θ LS
± )−1 ± y.
H
H
(14.44)
H
The matrix (H± H)−1H± is called the pseudoinverse of H. In applications where the components of the noise vector are known to be correlated with covariance matrix C, the following modiﬁcation to the leastsquares problem called the weighted leastsquares problem is of interest:
ˆ
θ WLS
( y) =´ arg min ( y − θ )± H
C
θ
−1 ( y −
θ
H
),
(14.45)
which equals θˆ ML ( y) = θˆ MVUE ( y) if V is zeromean and Gaussian, with covariance matrix C. The solution to the problem in (14.45), which is also a convex optimization problem, is easily seen to be (see Exercise 14.9):
ˆ
θ WLS
( y) = (
± −1 )−1 ± −1 y.
H
C
H
H
C
(14.46)
While the solution to leastsquares and weighted leastsquares problems can be written in closedform, computing the required pseudoinverses can be computationally expensive especially when the number of observations is large. For this reason, in practice recursive solutions such as the ones described in Section 14.12.2 are preferable.
14.11
ExpectationMaximization (EM) Algorithm
Numerical evaluation of maximumlikelihood (ML) estimates is often difﬁcult. The likelihood function may have multiple extrema and the parameter θ may be multidimensional, all of which are problematic for any numerical algorithm. In this section we introduce the expectationmaximization (EM) algorithm, a powerful optimization method that has been used with great success in many applications. The idea can be traced back to Hartley [7] and was fully developed in a landmark paper by Dempster, Laird, and Rubin [8]. The EM algorithm relies on the concept of a complete data space, to be selected in some convenient way. The EM algorithm is iterative and alternates
340
Maximum Likelihood Estimation
between conditional expectation and maximization steps. If the complete data space is properly chosen, this method can be very effective as it exploits the inherent statistical structure of the parameter estimation problem. The EM algorithm is particularly adept at dealing with problems involving missing data which arise in many applications in the areas of data communications, signal processing, imaging, and biology. Comprehensive surveys include the tutorial paper by Meng and van Dijk [9] and the book by McLachlan and Krishnan [10]. 14.11.1
General Structure of the EM Algorithm
The EM algorithm is an iterative algorithm that starts from an initial estimate of the unknown parameter θ and produces a sequence of estimates θˆ (k ) , k = 1, 2, . . .. The EM algorithm uses the following ingredients:
• • •
an incompletedata space Y (measurement space); a completedata space Z (many choices are possible); a manytoone mapping h : Z → Y .
With some abuse of notation, we denote the pdf for the complete data by pθ (z). The pdf for the incomplete data is given by
pθ ( y) =
·
h−1 ( y )
pθ (z ) dz ,
where h −1( y) ⊂ Z denotes the set of z such that h (z) = y . The incompletedata and completedata loglikelihood functions are respectively denoted by
µid (θ, y) ± ln pθ ( y), µcd (θ, z ) ± ln pθ (z ). The complete data space should be chosen in such a way that maximization of µcd (θ, z ) is easy. However, the complete data are not available, so one uses a surrogate cost function: the expectation of µcd (θ, z ) conditioned on Y and the current estimate of θ . Maximizing this surrogate function results in an updated estimate of θ , and these steps are repeated till convergence. Upon selection of an initial estimate θˆ (0) , the EM algorithm alternates between expectation (E) and maximization (M) steps for updating the estimate θˆ (k ) of the unknown parameter θ at iterations k = 1, 2, 3, . . ..
EStep: Compute
Q (θ θˆ( k ) ) ± Eθˆ(k ) [µcd (θ, Z )Y
= y ],
(14.47)
where the conditional expectation is evaluated assuming that the true parameter is equal to θˆ (k ) .
Mstep: Compute
θˆ(k +1) = arg max Q (θ θˆ (k ) ). θ ∈X
(14.48)
14.11
ExpectationMaximization (EM) Algorithm
341
If the EM sequence θˆ( k) , k ≥ 0 converges, the limit θ ∗ is called a stable point of the algorithm. In practice, the iterations are terminated when the estimates are stable enough (using some stopping criterion). 14.11.2
Convergence of EM Algorithm
The convergence properties of the EM algorithm were ﬁrst studied by Dempster, Laird, and Rubin [8] and by Wu [11]. The main result is as follows. The proof is given in Section 14.14. 14.4 The sequence of loglikelihoods of the incomplete data k = 0, 1, 2, . . . is nondecreasing. THEOREM
µid (θˆ (k ), y),
While this theorem guarantees monotonicity of the loglikelihood sequence µid (θˆ (k ) ), convergence to an extremum of µid (θ, y), i.e., a solution to the likelihood equation, requires additional differentiability assumptions (see, e.g., [11]). 3 R E M A R K 14.6
Some remarks about the EM algorithm are in order.
1. Even though the sequence µid (θˆ (k ) ) may converge, there is in general no guarantee that θˆ ( k) converges, e.g., the EM algorithm may jump back and forth between two attractors. 2. Even if the algorithm has a stable point, there is in general no guarantee that this stable point is a global maximum of the likelihood function, or even a local maximum. For many problems, however, convergence to a local maximum has been demonstrated. 3. The solution in general depends on the initialization. 4. Several variants of the EM algorithm have been developed. One of the most promising ones is Fessler and Hero’s SAGE algorithm (spacealternating generalized EM) whose convergence rate is often far superior to that of the EM algorithm [12]. 5. There is a close relationship between convergence rate of the EM algorithm and Fisher information, as described in [8, 12]. 6. Often the choice for Z is a pair (Y , U ) where U are thought of as missing data, or latent data. Such is the case with the estimation of mixtures in Example 14.11, where the missing data were the labels associated with each observation. 7. The Estep is relatively simple to implement when Z conditioned on Y is from an exponential family. Then the Estep is reduced to evaluating the conditional expectation of sufﬁcient statistics, and the Mstep is often tractable. 14.11.3
Examples
We now provide some examples of applications of the EM algorithm. 3 To illustrate why such differentiability assumptions are necessary, consider the function
f ( u , v) = φ(u − v) cos(π( u + v)/2) where φ(u ) = max{0, 1 − u } and − 1 ≤ u, v ≤ 1. This function admits a unique local maximum (which is therefore also a global maximum) at (u , v) = ( 0, 0) . However, the graph of the function has a ridge along the direction v = u . A coordinateascent algorithm (that would successively maximize f along the u and v directions) would reach the ridge after just one step of the algorithm and remain stuck there.
342
Maximum Likelihood Estimation
Convergence of EM Algorithm Suppose the observation Y = X + W , where X ∼ (θ, 1) and W ∼ (0, 1) are independent, with θ being an unknown parameter to be estimated from Y . It is easy to see that Example 14.10
N
N
pθ (y ) =
√1
4π
e−( y −θ) / 4 2
and that the loglikelihood
µid(θ, y) = ln pθ ( y) = − ( y −4 θ) − 21 ln(4π). 2
Maximizing over θ , we obtain
θˆML(y ) = y .
While the MLE is easy to obtain in this example, it is nevertheless instructive to study the solution that is produced by the EM algorithm applied to this problem with complete data given by Z = ( X , W ). Note that the mapping that takes us from Z to Y is manytoone (noninvertible). By the independence of X and W , we obtain
pθ (z ) = pθ (x , w) =
1 −(x −θ)2/ 2 −w 2 /2 e e 2π
and the loglikelihood of the complete data is given by
µcd (θ, z ) = − ( x −2 θ) − w2 − ln(2π). 2
EStep
2
Èµ (θ, Z)Y = y É . cd In computing Q(θ  θˆ (k ) ), we can ignore additive terms in µcd (θ, Z ) that are not functions of θ , since they are irrelevant in the Mstep. In particular, 2 µ (θ, Z ) = − θ + θ X + terms that not functions of θ. Q (θ θˆ( k) ) = Eθˆ(k )
cd
2
Therefore
θ 2 + θ E È X  Y = yÉ . θˆ( ) 2 This is a (concave) quadratic function in θ , which is maximized by setting the derivative with respect to θ to 0. Q (θ  θˆ (k ) ) = −
MStep
k
È É θˆ (k +1) = arg max Q (θ θˆ ( k) ) = E θˆ ( ) X  Y = y . θ ∈X k
By the LMMSE formula (11.15), we obtain the conditional expectation as
È
É
E θˆ (k ) X Y = y = θˆ (k ) + 1 (y − θˆ (k ) ) = 1 ( y + θˆ( k ) ) 2
2
14.11
343
ExpectationMaximization (EM) Algorithm
hence
θˆ (k +1) = 12 (y + θˆ (k )).
As k
→ ∞, θˆ (k ) converges to θ ∞ that satisﬁes the equation ( ) θ ∞ = 12 y + θ ∞ ,
which can be solved to yield
θ ∞( y ) = y = θˆML( y). That is, the EM algorithm indeed converges to the ML solution.
Estimation in Gaussian Mixture Models. This is one of the most famous problems to which the EM algorithm has been (successfully) applied. Assume the data Y = {Y i , 1 ≤ i ≤ n} ∈ Rn , are drawn i.i.d. from a pdf pθ ( y ) which is the mixture of two Gaussian distributions with means μ and ν and variances equal to 1, with probabilities ρ and (1 − ρ), respectively:
Example 14.11
pθ ( y ) = ρφ(μ, y) + (1 − ρ)φ(ν, y), where
φ(m , y) =´ √1
2π
2 e−( y −m ) /2
is the Gaussian pdf with mean m and variance 1. Let θ = (μ, ν, ρ). The loglikelihood function is given by ln pθ ( y) =
n ¿ i =1
ln (ρφ(μ, yi ) + (1 − ρ)φ(ν, yi )) ,
which is nonconvex in θ . The likelihood equation ∇θ ln pθ ( y) = 0 takes the form n ¿ φ(μ, yi ) − φ(ν, yi ) ∂ 0= ∂ρ ln p ( y) = i =1 ρφ(μ, yi ) + (1 − ρ)φ(ν, yi ) , n ¿ ρ( yi − μ)φ(μ, yi ) , ∂ ln p ( y) = 0= ∂μ ρφ(μ, yi ) + (1 − ρ)φ(ν, yi ) i =1 n ∂ ln p ( y) = ¿ (1 − ρ)(yi − ν)φ(ν, yi ) . 0= ∂ν ρφ(μ, yi ) + (1 − ρ)φ(ν, yi ) i =1 θ
θ
(14.49)
θ
This nonlinear system of three equations with three unknowns cannot be solved in closed form and may have multiple local maxima, or even local minima and stationary points. The EM approach to this problem adds auxiliary variables to y in such a way as to
(k )
make Q (θ θˆ ) easy to maximize. There is some guesswork involved here, but a good choice for the auxiliary variables are the binary random variables that determine which
344
Maximum Likelihood Estimation
N
N
of (μ, 1) or (ν, 1) is chosen as the distribution for the observations. In particular, let { X i }ni=1 be i.i.d. Bernouilli (ρ ) random variables such that
¾
∼ N (μ, 1) if Xi = 1 N (ν, 1) if Xi = 0. The complete data is set to be Z = (Y , X ). Then n ¿ µcd(θ , z) = ln p ( z) = ln p (zi ), Yi
θ
where
θ
i =1
¾ φ(μ, yi ) ρ p (z i ) = p ( yi , xi ) = p ( yi xi ) p (xi ) = φ(ν, yi ) (1 − ρ) θ
θ
θ
and so
¾
θ
=1 if xi = 0,
if xi
=1 ln(φ(ν, yi ) (1 − ρ)) if xi = 0 = xi ln(φ(μ, yi ) ρ) + (1 − x i ) ln(φ(ν, yi ) (1 − ρ)).
ln pθ (zi ) =
EStep Q (θ θˆ
(k)
ln(φ(μ, yi ) ρ)
if xi
È É ) = E ˆ ( ) µcd(θ , Z )Y = y n ¿ = xi(k) ln(φ(μ, yi ) ρ) + (1 − xi(k )) ln(φ(ν, yi ) (1 − ρ)), θ
k
i =1
where
´ ( k) = E
xi
È
ˆ ( k) X i  Y
θ
(14.50)
É =y .
´= i , we have È É xi(k ) = E ˆ ( ) X i Yi = yi = P ˆ ( ) ( X i = 1Yi = yi ) (k ) (k ) = ρˆ (k)φ(μˆ (k ), ρyˆ ) +φ((μ1ˆ −,ρˆy(ik)))φ(νˆ (k ), y ) . i i
Since X i is independent of Yk , for k θ
θ
k
k
MStep Note that ln(φ(μ, yi ) ρ) = ln ρ −
( yi − μ)2 + (constant in θ ) 2
and ln(φ(ν, yi ) (1 − ρ)) = ln(1 − ρ) −
ˆ (k )
Therefore Q (θ θ
)
( yi − ν)2 + (constant in θ ). 2
in (14.50) is a concave function of
ˆ (k+1)
derivatives and setting them equal to 0 we obtain θ
θ
= (μ, ν, ρ).
as follows:
Taking
14.11
345
ExpectationMaximization (EM) Algorithm
n n 1 ¿ ∂ Q(θ θˆ (k )) = 0 =⇒ 1 ¿ ( k) ( k) xi − ( 1 − xi ) = 0 ∂ρ ρ i =1 1−ρ i =1 n ¿ =⇒ ρˆ (k +1) = n1 xi(k ), i =1
n ∂ Q (θ θˆ (k)) = 0 =⇒ ¿ (k ) xi ( yi − μ) = 0 ∂μ i =1 ∑n x (k ) y ( k +1) =⇒ μˆ = ∑i =n 1 i (k ) i , x i =1 i
n ∂ Q (θ θˆ (k )) = 0 =⇒ ¿ (1 − xi(k ))( yi − ν) = 0 ∂ν i =1 ∑n (1 − x (k ))y i ( k +1) i =⇒ νˆ = ∑i =n 1 . ( k) ( 1 − x ) i =1 i ( k+ 1) ( k +1) ( k +1) The solutions for ρˆ , μˆ , and νˆ have the following intuitive interpretation. ( k) The quantity xi is an estimate of the probability that Y i has mean μ at iteration k . (k) Then the average of the {xi }ni=1 is a reasonable candidate for ρˆ (k +1) . The weighted average of the observations corresponding to those estimated to have mean μ and those estimated to have mean ν are reasonable candidates for μˆ (k +1) and νˆ (k +1) , respectively. Example 14.12
Joint Estimation of Signal Amplitude and Noise Variance.
= aX + W , where X ∼ N (0, ) and W ∼ N (0, v ) are independent Gaussian vectors, and a, v are unknown parameters. With θ = (a, v), the loglikelihood function of the observations Y
C
I
is given by:
±± n 1 ±± 1 ln pθ ( y) = − y± (a2 C + v I)−1 y − ln ±a2 C + v I± − ln(2π) 2 2 2
and the corresponding likelihood equation is given by
∂ ln p ( y) = a y± (a2 + v )−1 (a2 + v )− 1 y − aTr ´ (a 2 + v )−1 µ ∂a ∂ ln p ( y) = 1 y± (a2 + v )−1 (a2 + v )−1 y − 1 Tr ´(a2 + v )−1µ 0= ∂v 2 2
0=
θ
θ
C
C
I
I
C
C
C
I
I
C
C
I
C
I
where we have used the results at the end of Appendix A. While the loglikelihood function is easily seen to be concave in θ , the likelihood equation does not have a closedform solution. Therefore one has to resort to iterative gradient ascent techniques to obtain aˆ ML and vˆML . The EM algorithm, which we discuss next, provides an iterative solution for aˆ ML and vˆ ML , with an intuitive interpretation.
346
Maximum Likelihood Estimation
= (Y , X ). Then ( yx ) + ln p (x )
For the EM approach, we set the complete data to be Z
µcd (θ , z) = ln p (z ) = ln p 2 = − ² y −2vax² − 2n ln v + (constant in θ ). θ
EStep Q(θ  θˆ
(k )
θ
θ
¹ º ) = − 21v E ˆ ( ) ²Y − aX ²2 Y = y − 2n ln v. k
θ
To compute the expectation we need the pdf of X , given Y , under θˆ are jointly Gaussian,
p ˆ (k ) (x  y) ∼ θ
(k)
. Since X and Y
N (xˆ (k ), ˆ (k )), C
where xˆ (k ) and Cˆ ( k ) are computed using the LMMSE formulas (see Section 11.7.4) as follows:
xˆ (k )
= Cov ˆ ( ) ( X , Y )Cov ˆ ( ) (Y )−1 y ´ µ−1 = aˆ (k) (aˆ (k))2 + vˆ (k) y » ½ −1 aˆ (k ) (aˆ (k ) )2 − 1 = vˆ (k ) vˆ (k) + y θ
k
θ
C
k
C
I
I
C
and
ˆ (k ) = − Cov ˆ ( ) ( X , Y )Cov ˆ ( ) (Y )−1 Cov ˆ ( ) (Y , X ) ´ µ−1 = − (aˆ (k ))2 (aˆ (k ))2 + vˆ (k ) » (k ) 2 ½−1 ( aˆ ) − 1 = vˆ (k ) + ,
C
C
k
θ
C
θ
C
I
k
C
θ
I
k
C
C
where the last line follows from the Matrix Inversion Lemma (see Lemma A.1 in Appendix A). Note that we have
We are now ready to compute Q (θ θˆ
= avˆˆ (k ) ˆ (k ) y. (k )
xˆ (k )
C
(k )
) as follows: ¹ º ¹ º E ˆ ( ) ²Y − a X ²2Y = y = E ˆ ( ) ²(Y − a xˆ (k ) ) − a( X − xˆ (k ) )²2 Y = y ¹ º = ² y − a xˆ (k )²2 + a 2E ˆ ( ) ² X − xˆ (k)²2 Y = y ´ µ = ² y − a xˆ (k )²2 + a 2 Tr ˆ (k ) , θ
k
θ
k
θ
k
C
14.12
347
Recursive Estimation
where the second line follows from the orthogonality principle (see Section 11.7.4). Thus ´ µµ n (k ) 1 ´ Q(θ  θˆ ) = − ² y − a xˆ (k ) ²2 + a2 Tr Cˆ (k ) − ln v. 2v 2
MStep It is easy to see that Q (θ θˆ
( k)
) is concave in (a , v). Taking derivatives and setting them equal to 0 we obtain the iterates aˆ (k +1) and vˆ (k +1) as follows: ∂ Q (θ θˆ (k )) = 0 =⇒ 2aTr ´ ˆ (k )µ + 2a²ˆx(k)²2 − 2 y± ˆx(k) = 0 ∂a ± (k) =⇒ aˆ (k+1) = ´ y µˆx (k ) , Tr ˆ (k ) + ²ˆx ²2 C
C
∂ Q (θ θˆ (k) ) = 0 =⇒ 1 ¹² y − a xˆ (k )²2 + a2 Tr ´ ˆ (k )µº − n = 0 ∂v 2v v2 ´ µº 1¹ ( k+1) ( k +1) (k ) 2 ( k +1) 2 =⇒ vˆ = n ² y − aˆ xˆ ² + (aˆ ) Tr ˆ (k ) . C
C
14.12
Recursive Estimation
In many applications, the observations used for estimation are obtained sequentially in time, and it is desirable to produce a parameter estimate at each time k based on only the observations obtained up to time k, which can easily be updated to produce a parameter estimate at time k + 1 based on the estimate at time k and the observation at time k + 1. Such estimation procedures are called recursive estimation procedures. Note that recursive estimation procedures should not be confused with iterative estimation procedures such as the EM algorithm, where an estimate based on one scalar or vector observation is reﬁned over multiple steps to approach an optimal estimate. 14.12.1
Recursive MLE
In this subsection we use θˆk to refer to the MLE of θ obtained using k i.i.d. observations Y 1, Y2 , . . . , Y k that each has distribution pθ . We start with an example that illustrates that, in some cases, the exact MLE solution can be computed recursively, without the need for any approximation.
Example 14.13
For each i , let
pθ ( yi ) =
1 − yi e θ 1 { yi
θ
≥ 0}.
348
Maximum Likelihood Estimation
Then ln pθ (y1 , . . . , yk ) =
k ¿ i =1
ln pθ ( yi )
= −k ln θ − θ1
k ¿ i =1
yi ,
from which we can easily conclude that the MLE with k observations is given by
¿ θˆk = 1k yi . k
i =1
Extending this solution to k + 1 observations, we see that
¿ θˆk +1 = k +1 1 yi k+ 1 i =1
ˆ = nθkk++y1k +1 ,
which implies that the MLE can be computed recursively. Such a recursion does not always hold for the MLE, but fortunately a good approximation to the MLE always has a recursion. For simplicity we assume here that θ is a scalar, but the development can easily be extended to the case where θ is a vector. Deﬁne
ψ( yi , θ) =´ ∂θ∂ ln pθ (yi ).
Then, θˆk satisﬁes the likelihood equation: k ¿ i =1
Now, assuming that
ψ( yi , θˆk ) = 0.
θˆk is consistent (see Section 14.6), we have that θˆk +1 − θˆk → 0 as k → ∞ i.p. or a.s.
This justiﬁes the Taylor seriesbased approximation:
ψ( yi , θˆk +1 ) ≈ ψ( yi , θˆk ) + (θˆk +1 − θˆk )ψ ³ (yi , θˆk ).
Note that
2 ψ ³ (yi , θ) = ∂θ∂ 2 ln pθ (yi ).
Since, θˆk +1 is the MLE obtained using (k likelihood equation:
¿ k +1 i =1
(14.51)
+ 1) i.i.d. observations, it must satisfy the
ψ( yi , θˆk+1 ) = 0.
14.12
Recursive Estimation
349
Based on the approximation obtained for ψ( yi , θˆk+1 ) in (14.51), we have k +1 ´ ¿ i =1
µ ψ(yi , θˆk ) + (θˆk+1 − θˆk )ψ ³ (yi , θˆk ) ≈ 0,
which can rewritten as
¿ k +1
θˆk +1 ≈ θˆk +
ψ(yi , θˆk )
i =1 k +1
−
¿ i =1
ψ ³ (yi , θˆk )
.
(14.52)
This approximation does not yield a recursive estimator since the right hand side depends explicitly on all of the observations up to time (k + 1), but we can further sim∑ plify it to obtain a recursive estimator as follows. First, since ki=1 ψ(yi , θˆk ) = 0, the numerator on the right hand side of (14.52) reduces to ψ(yk +1, θˆk ). For the denominator, note that by the Strong Law of Large Numbers: 1
−k + 1
¿ k +1 i =1
Ã ∂2 Ä k+1 2 ¿ ∂ ln p ( y ) −→ a.s. −E ∂θ 2 ln pθ (Y1 ) = I θ . k +1 ∂θ 2 θ i i =1
ψ ³ ( yi , θ) = − 1
This allows us to make the approximation
−
»¿ k+1 i =1
½
ψ ³ ( yi , θˆk ) ≈ (k + 1) Iθˆ . k
(14.53)
Plugging (14.53) into (14.52) yields the following recursive approximation to the MLE:
yk +1, θˆk ) θˆk +1 ≈ θˆk + ψ( (k + 1) I .
θˆk
(14.54)
The estimator obtained by treating this approximation as an equality is called the recursive MLE. It can be shown that the recursion in (14.54) is special case of a stochastic approximation procedure [13], and yields an estimate with the same asymptotic properties as the MLE (see, e.g., [14]). 14.12.2
Recursive Approximations to LeastSquares Solution
We now develop a recursive approximation to the leastsquares solution θˆ LS ( y) given in (14.44). First note that the function being optimized in leastsquares can be rewritten as
² y − θ ²2 = H
n ¿
k =1
² yk −
H
kθ
²2
and its gradient with respect to θ is given by
∇ ² y − θ² = θ
H
2
n ¿ k =1
2H± k (Hk θ − yk ).
350
Maximum Likelihood Estimation
The terms in the summation on the right hand side can be used in a stochastic gradient descent procedure to produce a recursive (online) estimator for θ , which is called the least mean squares (LMS) algorithm:
ˆ
θ k +1
= θˆ k − μk
± ( k +1θˆ k − y ), k+ 1 k +1
H
(14.55)
H
where μ k > 0 is called the stepsize at the stage k. Under appropriate conditions on the stepsizes {μk } and the distribution of the observations, it can be shown that the LMS update has the desirable property [15]: i.p. ˆ −→ θˆ LS .
(14.56)
θk
The rate of convergence in (14.56) can be improved by using a modiﬁcation to LMS akin to Newton’s method [16] for gradient descent. The resulting algorithm is called the recursive least squares (RLS) algorithm, which has the following update equation:
ˆ
θ k +1
= θˆ k −
±
k k +1
C H
(
ˆ − yk +1 ),
k +1 θ k
H
(14.57)
where Ck can be interpreted as the covariance of the estimation error at time k and is updated using the Riccati equation (which we will study in more detail in Chapter 15, (15.28)): C
14.13
k +1
= k− C
±
k k+1
C H
(+ I
H
± )−1 k +1
k+1 Ck H k+1
H
C
k
.
Appendix: Proof of Theorem 14.2
By (14.31) and (14.33), there exists a strongly consistent sequence θˆn of local maxima of pθ ³ (Y ) in the set {θ ³ ∈ X : θ − θ ³ ≤ ²}. Such a sequence satisﬁes the likelihood equation (14.35). By the differentiability assumption (A4), Taylor’s remainder theorem may be applied to the likelihood equation around θ . We obtain n ¿
ψ(Yi , θˆn ) i =1 Ä n Ã ¿ 1 1 ³ 2 ³³ ˆ ˆ = √n ψ(Yi , θ) + (θn − θ)ψ (Yi , θ) + 2 (θn − θ) ψ (Yi ,θ n ) i =1 ⎤ ⎡ ⎥⎥ ⎢⎢ 1 ¿ n n n ¿ √ ˆ 1 1 ¿ ³ ³³ ⎢ = √n ψ(Yi , θ) + n(θn − θ) ⎢ n ψ (Yi , θ) + (Êθˆn ËÌ− θ)Í 2n ψ (Yi ,θ n )⎥⎥ ⎣Ê i =1 ËÌ Í =W Ê i =1 ËÌ Ê i =1ËÌ Í Í⎦
0=
√1n
=Un
=Vn
n
=Xn
(14.58) for some θ n
∈ [θ, θˆn ]. By Assumptions (A3), (A4), and (A7), (13.6) holds, and Ã ∂ ln p (Y ) Ä θ = 0. Eθ [ψ(Y, θ)] = Eθ ∂θ
14.14
Appendix: Proof of Theorem 14.4
351
By Assumption (A6) and the Central Limit Theorem,
Un
± √1n
n ¿ i =1
d. ψ(Yi , θ) −→ N (0, Iθ ).
By the WLLN,
Vn
± 1n
n ¿ i =1
and
Xn
±
1 n
p. ψ ³ (Yi , θ) −i−.→ Eθ [ψ ³ (Y, θ)] = − I θ
n ¿ i =1
p. ψ ³³ (Yi , θ) −i−.→ Eθ [ψ ³³ (Y, θ)],
which is ﬁnite by Assumption (A5) since
Eθ [ψ ³³ (Y, θ)] ≤ Eθ [ M (Y )] < ∞. i.p . a.s . Since θˆn −−→ θ , this implies that Wn −−→ 0. Since X n converges i.p. to a constant, it
follows from Slutsky’s theorem (see Appendix D) that the product Wn X n converges i.p. i.p
to 0. Since Vn −→ − Iθ , it also follows from Slutsky’s theorem that the sum Vn + Wn X n converges i.p. to − I θ . Then (14.58) yields
² 1³ √ ˆ Un d Un n(θn − θ) = −Vn − Wn X n −→ Iθ ∼ N 0, Iθ .
This completes the proof.
14.14
Appendix: Proof of Theorem 14.4
Since pθ (z ) = pθ (z  y ) pθ (y ), we have that
µcd (θ, z ) = µid (θ, y ) + ln pθ (z y).
Now using the fact that
·
we get that
(14.59)
pθˆ (k ) (z y)d μ(z ) = 1
·
µid (θ, y ) = µid (θ, y ) pθˆ( ) (z y )d μ(z ). k
(14.60)
Plugging (14.59) into (14.60), we get
· · µid (θ, y) = µcd (θ, z ) pθˆ( ) (z y )d μ( z) − k
ln pθ (z  y) pθˆ (k ) (z y)d μ(z ).
The KL divergence between distributions pθˆ( k) (z  y ) and pθ (z  y ) is given by
·
ln pθˆ( k) (z  y ) pθˆ(k ) (z y)d μ(z ) −
·
ln pθ (z y ) pθˆ( k) (z  y )d μ( z).
(14.61)
352
Maximum Likelihood Estimation
Since this KL divergence is nonnegative, we get from (14.61) that
·
µid (θ, y ) ≥ µcd (θ, z ) pθˆ( ) (z y)d μ(z ) − k
·
ln pθˆ(k ) (z y) pθˆ (k ) (z  y )dμ( z ).
(14.62)
Therefore if
·
· ( k +1) ˆ µcd (θ , z) pθˆ ( ) (z y)d μ(z ) ≥ µcd (θˆ (k ), z) pθˆ ( )(z y )dμ( z ) k
k
(14.63)
then, by (14.62), we obtain
· · ( k +1) ( k +1) ˆ ˆ µid (θ , y ) ≥ µcd (θ , z) pθˆ ( )(z y )dμ( z ) − ln pθˆ( ) (z y ) pθˆ( ) (z y)d μ(z ) · · ≥ µcd (θˆ (k ), z) pθˆ ( ) (z y)d μ(z ) − ln pθˆ ( )(z y ) pθˆ( ) (z y )d μ( z) · = µid (θˆ (k ), y) pθˆ ( ) (z y)d μ(z ) = µid (θˆ (k ), y ), k
k
k
k
k
k
k
where the last two lines follow from (14.59) and (14.60). Therefore the statement of the theorem holds if (14.63) can be met. One way to meet the condition in (14.63) is by choosing
· ˆθ (k +1) = arg max µcd (θ, z ) pθˆ( ) (z y )dμ( z) θ ∈X = arg max Eθˆ( ) [µcd (θ, Z )Y = y ], k
θ ∈X
k
which can be broken up into the two steps of the EM algorithm given in (14.47) and (14.48). Exercises
14.1 Point on a Unit Circle. Consider a point s on the unit circle in R2 . You are given n i.i.d. points
= s + Zi ,
Yi where
{ }
Z i ni=1
ρ  < 1.
are i.i.d. and
1≤i
≤ n,
N ( , ) where the covariance matrix 0
C
(a) Determine the ML estimator of s. (b) Is the ML estimator unbiased? (c) Is the ML estimator consistent (in the a.s and m.s. senses) as n 14.2 MLE in Gaussian Noise.
≥ 0 given the observations θ Yi = 2 + Z i , 1 ≤ i ≤ n , i where {Z i }ni=1 are i.i.d. N (0, 1).
(a) Derive the ML estimator of θ
C
Ã1 ρÄ = ρ 1 , and
→ ∞?
Exercises
(b) Is the ML estimator unbiased? (c) Is the ML estimator consistent (in the a.s and m.s. senses) as n
353
→ ∞?
14.3 MLE in Poisson Family. Suppose {Y k }nk=1 are i.i.d. Poisson random variables with parameter θ > 0, i.e.,
Î
pθ ( yk ) =
e−θ θ yk yk !
yk
= 0, 1, 2, . . .
and pθ ( y) = nk=1 pθ (yk ). Hint : You may use the fact that under Pθ , both the mean and the variance of Yk equal θ . (a) (b) (c) (d) (e)
Find θˆML ( y). Is θˆML unbiased? If not, ﬁnd its bias. Find the variance of θˆML . Find the CRLB on the variance of unbiased estimators of θ . Find an MVUE for θ based on y.
14.4 MLE in Rayleigh Family. Suppose {Yk }nk=1 are i.i.d. Rayleigh random variables with parameter θ > 0, i.e.,
Î
p θ ( yk ) =
2 yk − y2k e θ
θ
and pθ ( y) = nk=1 pθ (yk ). We wish to estimate φ = θ 2. Hint : You may use the fact that, under P θ , X
pθ ( x ) = (a) (b) (c) (d) (e) (f)
1 { yk ≥0}
= ∑nk=1 Yk2 has a Gamma pdf given by
x n −1 e−x /θ (n − 1)! θ n
1{x
≥ 0}.
Find the MLE of φ and compute its bias. Find the MVUE of φ. Which of these two estimates has a smaller MSE? Compute the CRLB on the variance of unbiased estimators of φ. Is the MLE asymptotically unbiased as n → ∞ ? Is the MLE asymptotically efﬁcient as n → ∞ ?
14.5 Continuation of Exercise 13.8. (a) Suppose n = 2. Find the MLE of θ . (b) Show that the MLE of part (a) is unbiased. (c) Show that the MLE of part (a) is efﬁcient, i.e., that the covariance of the MLE equals the inverse of the Fisher information matrix.
354
Maximum Likelihood Estimation
14.6 End Point Estimation. Denote by pθ the uniform pdf over the union of two disjoint intervals Y = [ 0, θ1 ] ∪ [θ2, 1], where 0 ≤ θ1 ≤ θ2 < 1. The parameters θ1 and θ2 are unknown and are to be determined from n i.i.d. samples Y1 , . . . , Y n drawn from the distribution pθ . Determine the ML estimator of θ1 and θ2. 14.7 Randomized Estimator. Let Y be a Bernoulli random variable with P{Y θ = 1 − P{Y = 0}. Consider the randomized estimator
¶ θˆ = 1Y/ 2
= 1} =
with prob. γ with prob. 1 − γ
ˆ = whose performance will be assessed using the squarederror cost function C (θ, θ) 2 ˆ (θ − θ) . Note that the MLE is obtained as a special case when γ = 0 (no randomization). (a) Give an expression for the conditional risk of this estimator, in terms of θ and γ . (b) Find the value of γ that minimizes the worstcase risk over θ ∈ [0, 1]. (c) Can you argue that the estimator θˆ for the value of γ calculated in part (b) is minimax? 14.8 Solutions to Likelihood Equation . Prove that the nonlinear system of (14.40) also admits a feasible solution that is a perturbation of (14.38) and converges a.s. to the “simple” estimator (μ, ˜ σ˜ 2 ) of (14.38) as n → ∞. 14.9 LeastSquares Estimation . (a) Prove the result in (14.44). (b) Prove the result in (14.46). 14.10 EM Algorithm for Estimating the Parameter of an Exponential Distribution. Suppose that given a parameter θ > 0, the random variable X is exponential with parameter θ , i.e.,
pθ (x ) = θ e−θ x
1{x ≥ 0}.
The random variable X is not observed directly, rather a noisy version Y is observed, with
= X + W, where W is exponential with parameter β , and independent of X . Our goal is to estimate θ from Y . (a) Find the likelihood function pθ ( y ) and write down the corresponding likelihood Y
equation. (b) Can you ﬁnd θˆML (y ) in closedform? If your answer is yes, provide the solution. (c) Now consider computing θˆML (y ) iteratively using the EM algorithm based on Z (Y, X ). Compute the Estep of the algorithm, i.e., ﬁnd Q (θ θˆ(k )). (d) Compute the Mstep in closed form, i.e., ﬁnd θˆ (k +1) in terms of θˆ (k ) and y .
=
Exercises
355
14.11 EM Algorithm for Interference Cancellation. Consider the observation model:
Y 1 = X 1 + θ X 2 + W 1,
N
N
where X 1 ∼ (0, C1 ) is the desired signal, X 2 ∼ (0, C2) is additive interference, and W 1 ∼ (0, σ 2 I ) is white noise. We wish to cancel X 2 by placing a sensor close to the interferer that produces observation:
N
Y 2 = X 2 + W 2,
N
where W 2 ∼ (0, σ 2 I ) is white noise. Assume that X 1, X 2, W 1, and W 2 are independent, and that C1, C2 , and σ 2 are known. To cancel X 2 in Y 1, we need to estimate θ . Develop an EM algorithm to estimate θ from Y 1, Y 2. 14.12 EM Algorithm Convergence. Consider a random variable W that has the Gamma density that is parameterized by θ :
pθ (w) = where α
e−wθ θ α
wα−1 1 , { w≥0} ¶(α)
> 0 is known and ¶ is the Gamma function deﬁned by the integral ·∞ ¶( x ) = t x −1 e−t dt , for x > 0 .
Now, given W ables, i.e.,
= w,
0
suppose Y 1, Y2 , . . . , Y n are i.i.d. Exp(1/w) random vari
pθ ( y w) =
n ¼ k =1
w e− y w 1 { y ≥0} . k
k
(a) Find pθ ( y ) and use it to compute θˆML ( y). (b) Now even though we can compute θˆML ( y) directly, develop an EM algorithm for computing θˆML ( y) using as complete data (Y , W ). (c) Show that θˆML ( y) that you found in part (a) is indeed a stationary point (ﬁxed point) of the EM iteration you developed in part (b). 14.13 James–Stein Estimator. Consider the estimation of a vector parameter θ with m ≥ 3, from observations Y ∈ Rm with conditional distribution
pθ ( y) ∼
N (θ , ). I
(a) Find θˆ ML ( y) and MSEθ (θˆ ML (Y )). (b) Show that θˆ MVUE ( y) = θˆ ML ( y). (c) Now consider the following James–Stein Estimator [17]:
ˆ ( y) =
θS
Show that
²
1−
m−2 ² y² 2
³
y.
MSE θ (θˆ S ) = MSEθ (θˆ ML ) − (m − 2)2E θ
i.e., θˆ S is strictly better than θˆ ML in terms of MSE.
Ã
1 ²Y ²2
Ä
,
∈ Rm ,
356
Maximum Likelihood Estimation
Hint: First show using integration by parts that
Â Á Ã Y (Y − θ ) Ä 2Yi2 1 i i i E ²Y ²2 = E ²Y ²2 − ²Y ²4 , for i = 1, 2, . . . , m . Now consider the special case where θ = , and ﬁnd the value of MSE (θˆ S ). Compare this value with MSE (θˆ ML ). θ
(d)
θ
0
0
0
14.14 Method of Moments. Suppose {Yk }nk=1 are i.i.d. random variables with the Gamma pdf 1 θ1 −1 e− θy2 . pθ (y ) = θ y
¶(θ1 ) θ2
1
where θ = [θ1 θ2 ]± is an unknown parameter vector. It is easy to see that the MLE for θ cannot be computed in closed form for this example. An alternative to the MLE, which is often more computationally feasible, is the so called method of moments (MOM) estimator, in which we compute empirical approximations to a set of moments of the random variable Y 1 and use them to get estimates of the unknown parameter θ . In particular, for the problem at hand, we set
Y
=´ 1n
Y2
=´ 1n
n ¿ k =1 n
¿ k =1
Yk
≈ E [Y1] = g1(θ1, θ2 )
(14.64)
Yk2
≈ E [Y12 ] = g2(θ1, θ 2).
(14.65)
θ
θ
We treat the approximations in the above equations as equalities, and solve them jointly to obtain MOM estimates for θ1 and θ2 in terms of Y and Y 2. (a) Find g1 and g2 in (14.64) and (14.65), and solve the equations to ﬁnd the MOM estimates θˆ1MOM and θˆ2MOM in terms of Y and Y 2 . (b) Prove that the MOM estimates are strongly consistent, i.e., show that for i = 1, 2:
θˆi
MOM
a.s. −→ θi , as n → ∞.
References
[1] P. W. Zehna, “Invariance of Maximum Likelihood Estimators,” Ann. Math. Stat., Vol. 37, No. 3, p. 744, 1966. [2] B. A. Coomes, “On Conditions Sufﬁcient for Injectivity of Maps,” Institute for Mathematics and its Applications (USA), Preprint #544, 1989. [3] E. L. Lehmann and G. Casella, Theory of Point Estimation, Springer, 1998. [4] T. S. Ferguson, A Course in Large Sample Theory, Chapman and Hall, 1996. [5] A. DasGupta, Asymptotic Theory of Statistics and Probability, Springer, 2008. [6] R. Serﬂing, Approximation Theorems of Mathematical Statistics, Wiley, 1980. [7] H. Hartley, “Maximum Likelihood Estimation from Incomplete Data,” Biometrics, Vol. 14, pp. 174–194, 1958.
References
357
[8] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” (with discussion) J. Roy. Stat. Soc. B., Vol. 39, No. 1, pp. 1–38, 1977. [9] X.L. Meng and D. van Dijk, “The EM algorithm: An Old Folk Song Sung to a Fast New Tune,” (with discussion) J. Roy. Stat. Soc. B., Vol. 59, No. 3, pp. 511–567, 1997. [10] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions , 2nd edition, Wiley, 2008. [11] C. F. J. Wu, “On the Convergence Properties of the EM Algorithm,” Ann. Stat., Vol. 11, No. 1, pp. 95–103, 1983. [12] J. A. Fessler and A. O. Hero, “SpaceAlternating Generalized ExpectationMaximization Algorithm,” IEEE Trans. Sig. Proc., Vol. 42, No. 10, pp. 2664–2677, 1994. [13] H. Robbins and S. Monro, “A Stochastic Approximation Method,” Ann. Math. Stat., pp. 400– 407, 1951. [14] D. Sakrison, “Efﬁcient Recursive Estimation of the Parameters of a Radar or Radio Astronomy Target,” IEEE Trans. Inf. Theory, Vol. 12, No. 1, pp. 35–41, 1966. [15] A. H. Sayed, Adaptive Filters, Wiley, 2011. [16] D. P. Bertsekas, Nonlinear Programming, Athena Scientiﬁc, 1999. [17] B. Efron and T. Hastie, Computer Age Statistical Inference, Cambridge University Press, 2016.
15
Signal Estimation
In this chapter, we consider the problem of estimating a discretetime random signal from noisy observations of the signal. We will begin by adopting the MMSE estimation setting that was introduced in Chapter 11, and more speciﬁcally focusing on estimators with linear constraints (see Section 11.7.4) that allow for solutions to the estimation problem with fewer assumptions on the joint distributions of the observation and the state. In this context, we discuss the important notion of linear innovations, which is a key tool in the development of the discretetime Kalman ﬁlter, where the signal is a sequence of states that evolves as a linear dynamical system driven by process noise, and the observation sequence is a linear, causal, and noisy version of the state. We then consider the more general problem of nonlinear ﬁltering, where we drop the linearity assumptions on the state evolution and observation equations, and discuss some approaches to solving this problem. We end the chapter with a discussion of estimation in ﬁnite alphabet hidden Markov models (HMMs). 15.1
Linear Innovations
We ﬁrst consider the computation of the LMMSE estimate described in Section 11.7.4 for the case where we wish to estimate the state X ∈ R p , from a sequence of observations Y 1 , . . . , Y n ∈ Rq , where p and q are positive integers. If the observations are uncorrelated and all the random vectors are zero mean, then the LMMSE estimate of X can be obtained as a sum of the LMMSE estimates of the X from the individual observations as shown in the following result. P RO P O S I T I O N
15.1 Let X , Y 1, . . . , Y n be zeromean random vectors with ﬁnite sec
ond moments. If Cov(Y k , Y ± ) = 0 for k Y 1 , . . . , Y n , satisﬁes
±= ±,
then the linear MMSE estimate of X, given
± Eˆ [ X Y 1 , . . . , Y n ] = Eˆ [ X Y k ] . n
k =1
(15.1)
Proof To prove this result it is enough to verify that the right hand side of (15.1) satisﬁes the orthogonality principle condition for LMMSE estimation of X from Y 1, . . . , Y n , i.e., if Z
= X−
n ± ˆ k =1
E[ X Y k ]
15.1
Linear Innovations
359
then we need to show that
E[ Z ] = , 0
Cov( Z, Y k ) = 0, for k
= 1, . . . , n.
(15.2) (15.3)
We ﬁrst note that the right hand side of (15.1) is linear in the observations. Next, because the observations are zero mean, we have that (see (11.15))
Eˆ [ X Y k ] =
B
kYk
for some deterministic matrix Bk .
Now, exploiting the fact that X is also zero mean,
E[ Z ] = E
²
X
−
n ± k= 1
B
³
kYk
= , 0
∈ {1, 2, . . . , n}, ⎡⎛ ⎞ ⎤ n ± ⎣⎝ X − ⎠ ²⎦ Cov( Z , Y k ) = E[ ZY ² jY j Yk k ]=E
and hence (15.2) holds. Furthermore, for each k
j =1
(a )
B
= E[ XY ²k ] − k E[Y k Y ²k ] = E[ (X − k Y k )Y ²k ] (b) = . where (a) holds because the n vectors {Y j }nj=1 are uncorrelated, and (b) holds due to the B
B
0
orthogonality principle applied to the estimation of X using the vector Y k . This veriﬁes (15.3), thus proving the result. If the observations Y 1, Y 2 , . . . , Y n are not zero mean and pairwise uncorrelated as in the statement of Proposition 15.1, we can create an innovation sequence that has the properties of being zero mean and uncorrelated as follows:
Y˜ 1
= Y 1 − E[Y 1 ], Y˜ 2 = Y 2 − Eˆ [Y 2 Y 1 ], .. . Y˜ k = Y k − Eˆ [Y k Y 1 , . . . , Y k −1 ]. There is a onetoone mapping between Y 1 , . . . , Y k and Y˜ 1, . . . , Y˜ k , for each k : Y 1 = Y˜ 1 + E[Y 1 ], Y 2 = Y˜ 2 + Eˆ [Y 2 Y 1 ], (15.4) .. . Y k = Y˜ k + Eˆ [Y k Y 1 , . . . , Y k −1 ].
360
Signal Estimation
ˆ [Y k Y 1 , . . . , Y k −1] is an afﬁne function of Y 1 , . . . , Y k . Also any Note that in (15.4) E afﬁne combination of Y 1, . . . , Y k is also an afﬁne combination of Y˜ 1 , . . . , Y˜ k , for each k . Hence, using Proposition 15.1 we have
± Eˆ [ X Y 1, . . . , Y n ] = Eˆ [ X Y˜ 1 , . . . , Y˜ n ] = Eˆ [ X Y˜ k ]. n
k =1
(15.5)
Using the same arguments, the innovation sequence can also be computed recursively:
Y˜ 1
= Y 1 − E[Y 1 ],
Y˜ k
= Y k − Eˆ [Y k Y 1 , . . . , Y k −1 ] = Y k − Eˆ [Y k Y˜ 1 , . . . , Y˜ k −1 ] k−1 ± = Y k − Eˆ [ Yk Y˜ i ].
.. .
i =1
We note that if Y 1, . . . , Y k are jointly Gaussian, then Y˜ 1 , . . . , Y˜ n are independent. Finally, if both the observations Y 1 , Y 2, . . . , Y n and X are not zero mean, we have the following corollary to Proposition 15.1, which follows easily from (15.5) (see Exercise 15.1). C O RO L L A RY 15.1 Let X , Y 1 , . . . , Y n be random vectors with possibly nonzero means and ﬁnite second moments. Then
Eˆ [ X Y 1, . . . , Y n ] = E[ X ] +
=
15.2
n ´ µ ± Eˆ ( X − E[ X ])Y˜ k k =1
n ´ µ ± Eˆ X Y˜ k − (n − 1)E[ X ]. k =1
(15.6) (15.7)
DiscreteTime Kalman Filter
Consider a sequence of states { X k } evolving as a linear dynamical system, where the observation sequence {Y k } is a linear, causal, and noisy version of the state, i.e., State: Observations:
= Yk =
X k+1
F
+ Uk , k X k + V k , k = 0, 1, 2 , . . . ,
k Xk
H
(15.8) (15.9)
where U k and V k are, respectively, the state noise (or process noise) and the measurement noise . Assume that for each k , X k ∈ R p and Y k ∈ Rq , where p and q are positive integers. Linear dynamical systems arise in a large variety of applications, including object tracking, navigation, robotics, etc. The general problem is to produce an optimal linear estimator of the state sequence given the measurements. This problem comes in two main ﬂavors: (1) estimation of
15.2
361
DiscreteTime Kalman Filter
X k given observations up to time k , and (2) onestep prediction of X k +1 given the same observations. These problems are closely related and can be solved efﬁciently and recursively, in such a way that the computational and storage complexity of the algorithm is independent of k . As a integral part of this computation, one obtains covariance matrices for the estimation error and the prediction error vectors at each time k . If the joint process ( X k , Y k )k≥0 is Gaussian, the linear MMSE estimators are also (unconstrained) MMSE estimators. We use the compact notation Y k0 ± (Y 0, Y 1 , . . . , Y k ) to denote all observations until time k; the sequences U k0 and V k0 are deﬁned similarly. We denote by X k the mean of X k . The assumptions about the dynamical system are as follows: 1. X 0, U 0 , U 1 , . . . , V 0, V 1 , . . . are uncorrelated. 2. The following moments are known:
E[ X 0] = X 0 , Cov( X 0 ) = ²0 , E[U k ] = 0, E[ V k ] = 0, Cov (U k ) = k , Q
3. The sequence of matrices Fk , Hk , k
≥ 0 is known.
Cov( V k ) = Rk ,
k ≥ 0.
The objective is to solve for the following two quantities recursively in time. Estimator:
ˆ k k X
Predictor:
Xˆ k+1 k
± Eˆ [ X k Y k0],
(15.10)
± Eˆ [ X k +1 Y k0 ].
(15.11)
The covariance matrix of the estimation error is given by
²k k ± Cov (X k − Xˆ kk )
(15.12)
and the covariance matrix of the prediction error by
²k +1k ± Cov( X k +1 − Xˆ k +1k ).
(15.13)
The recursive algorithm is comprised of a measurement update, where the new meaˆ k +1k surement at time k is used to update the state estimate, and a time update, where X is obtained in terms of Xˆ k k .
Measurement Update We express Xˆ k  k in terms of Xˆ k k −1 and the new measure
ment Y k . We denote by
Y˜ k
= Y k − Yˆ k k −1
(15.14) k −1
the innovation at time k, which by construction is orthogonal to Y˜ 0
ˆ k k X
. Hence
= Eˆ [ X k Y k0 ] = Eˆ [ X k Y˜ k0 ] = Eˆ [ X k Y˜ k0−1 , Y˜ k ] ( a) = X k + Eˆ [ X k − X k Y˜ k0−1] + Eˆ [ X k − X k Y˜ k ]
362
Signal Estimation
= Xˆ k k −1 + Eˆ [ X k − X k Y˜ k ] (b) ˆ = X k k −1 + Cov( X k , Y˜ k )Cov(Y˜ k )−1 Y˜ k ,
(15.15)
where equality (a) follows from Corollary 15.1, and (b) from (11.15). To evaluate (15.15), we derive expressions for Y˜ k , Cov(X k , Y˜ k ), and Cov (Y˜ k ) and substitute them into (15.15). We ﬁrst apply the linear operator Eˆ [ .Y k0−1 ] on the measurement equation (15.9) to obtain
Yˆ k k −1
= Eˆ [Yk Y0k −1 ] = Eˆ [
H
kXk
+ V k Y k0−1 ] = k Xˆ kk −1 = k Xˆ k k −1. H
H
Hence the innovation (15.14) can be expressed as
Y˜ k
= Y k − Yˆ kk −1 = Y k − k Xˆ k k −1 = k ( X k − Xˆ k k −1 ) + V k . From (15.16) we obtain Cov (X k , Y˜ k ) and Cov(Y˜ k ) as follows. First, H
H
Cov( X k , Y˜ k )
( a)
(15.16)
= Cov( X k , k ( X k − Xˆ k k −1) + V k ) (b) = Cov( X k , X k − Xˆ k k−1 ) ²k ( c) = Cov( X k − Xˆ k k −1, X k − Xˆ k k −1 ) ²k = ²k k −1 ²k , (15.17) where (a) follows from (15.16), (b) holds because Cov( X k , V k ) = 0, and likewise (c) ˆ k k −1 is orthogonal to X k − Xˆ k k −1 by the orthogonality principle. Next, holds because X ˆ k k −1 ) + V k ) Cov(Y˜ k ) = Cov ( k ( X k − X (15.18) = k ²k k −1 ²k + k , ˆ k k −1 , which is a linear where the last equality holds because V k is orthogonal to X k − X k −1 k −1 function of ( X 0, U 0 , V 0 ). Substituting the covariance matrices (15.17) and (15.18) H
H
H
H
H
H
H
R
into (15.15), we obtain
Xˆ k k
= Xˆ k k−1 + k Y˜ k = Xˆ k k−1 + k (Y k − k Xˆ k k −1 ), K
K
H
where we have deﬁned the Kalman gain matrix K
k
± Cov(X k , Y˜ k )Cov (Y˜ k )−1 = ²k k −1 ²k ( k ²kk −1 ²k + k )−1 . H
H
H
R
(15.19)
Eˆ [.Y k0] to the state equation (15.8). Since Y k0 is a function of (X 0, U k0−1 , V k0 ), which are uncorrelated with U k , we have Eˆ [U k Y k0 ] = 0, Time Update We apply the operator
hence the predicted state estimate is
15.2
ˆ k +1k X
DiscreteTime Kalman Filter
= Eˆ [ X k +1Y k0] = Eˆ [ k X k + U k Y k0] = k Eˆ [ X k Y k0 ] + Eˆ [U k Y k0 ] = k Xˆ k k .
363
F
F
(15.20)
F
Thus, the Kalman ﬁlter updates are given by
ˆ k k X
= Xˆ kk −1 + k (Y k − k Xˆ k k −1 ), (15.21) ˆ k +1k = k Xˆ k k , X (15.22) and the ﬁlter is initialized with Xˆ 0−1 = X 0 . These equations can be combined to form K
H
F
a single equation for the predictor update,
ˆ k +1k X
= k Xˆ k k −1 + F
F
k Kk
(Y k − k Xˆ k k −1 ), Xˆ 0−1 = X 0 ,
(15.23)
H
and a single equation for the estimator update,
ˆ k k X
=
ˆ
+ k (Y k − k with Xˆ 0 0 computed from (15.21) with Xˆ 0−1 = X 0 . F
k −1 X k −1 k −1
K
H
ˆ
k −1 X k −1k −1
F
),
(15.24)
Kalman Gain Calculation To evaluate these recursions efﬁciently we need an efﬁ
cient way to update the Kalman gain Kk in (15.19). This is achieved by obtaining a recursion for ²k k −1 as follows. The measurement update for the error covariance matrix is obtained as follows. From (15.15), the estimation and prediction errors are related by
Xk
− Xˆ k k = X k − Xˆ k k−1 − Cov( X k , Y˜ k )Cov (Y˜ k )−1Y˜ k .
Taking covariances, we obtain the following relation between estimation and prediction covariance error matrices:
²k k = ²k k −1 − Cov( X k , Y˜ k )Cov(Y˜ k )−1 Cov(Y˜ k , X k ).
(15.25)
Substituting (15.17) and (15.18) into (15.25), we obtain
²k k = ²k k −1 − ²kk− 1 ²k ( k ²k k −1 ²k + k )−1 k ²k k −1 . H
H
H
R
(15.26)
H
The time update for the error covariance matrix is given by
²k +1k = Cov( X k +1 − Xˆ k+1k ) ( a) = Cov( k ( X k − Xˆ k k ) + U k ) (b) = Cov( k ( X k − Xˆ k k )) + Cov(U k ) = k ²kk ²k + k , F
F
F
F
(15.27)
Q
where (a) follows from (15.8) and (15.20), and (b) holds because U k is orthogonal to X k − Xˆ k  k , which is a linear function of ( X 0 , U k0−1, V k0 ). Substituting (15.26) into (15.27) yields the desired covariance update
²k +1k = k (²k k −1 − ²k k −1 ²k ( k ²kk −1 ²k + k )−1 k ²k k −1) ²k + k , which is called the Riccati equation, and is updated with ²0− 1 = ²0 . F
H
H
H
R
H
F
Q
(15.28)
364
Signal Estimation
The Kalman ﬁlter equations can be summarized as follows:
Estimator Update: Xˆ k  k
=
F
ˆ
k −1 X k −1k −1
+ k (Y k − K
ˆ
k Fk −1 X k −1 k −1
H
)
Predictor Update: Xˆ k+1 k Kalman Gain: K
k
= k Xˆ k k−1 + F
k Kk
F
(Y k − k Xˆ k k−1 ) H
= ²k k −1 ²k ( k ²k k −1 ²k + k )−1 H
H
H
R
Covariance Update:
²k +1k = k (²k k −1 − ²k k −1 ²k ( k ²k k −1 ²k + k )− 1 k ²k k−1 ) ²k + F
REMARK
15.1
H
H
H
R
H
F
k
Q
We make the following observations regarding the Kalman ﬁltering
equations: 1. The covariance updates are dataindependent and can be performed ofﬂine to produce the Kalman gain matrices Kk for all k , which can then be used in the ﬁltering equations. 2. The Kalman ﬁlter updates have a recursive structure which means that there is no need to store all the observations. 3. The Kalman gain matrix optimally weighs the innovation to update the state estimate. 4. A variant to the estimation and prediction problems considered here is the ﬁxedlag ˆ k −±k of the state at time k − ± is desired smoothing problem, where an estimate X given observations till time k . Here ± ≥ 1 is the lag. 5. The performance of the Kalman ﬁlter may be unsatisfactory for heavily nonGaussian noise, in which case the optimal MMSE estimator could be heavily nonlinear. A vehicle is moving along a straight road. Samples of the trajectory g (t ) at multiples of a sampling period τ are denoted by G k = g(k τ), k = 0, 1, 2, . . .. The velocity and acceleration of the vehicle are the ﬁrst and second derivatives of g(t ), and their samples are denoted by Sk = g³ (kτ) and g³³ (kτ), k = 0, 1, 2, . . .. Deﬁne the state X k as a twodimensional vector [G k , Sk ]² . Newtonian dynamics suggest the following state model: ¶1 τ · ¶τ 2/ 2· X k +1 = Xk + 0 1 τ Ak , Example 15.1
where, by Taylor’s remainder theorem, Ak is an approximation to g³³ (kτ). We model Ak as zeromean uncorrelated random variables with common variance σ A2 . We further assume noisy measurements of the vehicle’s position (obtained, e.g., using a radar). This motivates the observation model
15.2
Yk
= [ 1, 0] X k + Vk ,
365
DiscreteTime Kalman Filter
= 0, 1, 2, . . . ,
k
where Vk are zeromean uncorrelated random variables with common variance σV2 .
15.2.1
TimeInvariant Case
The case of a system whose dynamics do not change over time, and in which the secondorder statistics of the state and measurement noises do not change over time, is of interest in many applications. Speciﬁcally,
X k +1 = F X k
+ U k, Observations: Y k = X k + V k , k = 0, 1, 2, . . . , where Cov(U k ) = and Cov( V k ) = for all k. State:
(15.29)
H
Q
R
In this case, the estimator and predictor equations simplify to
ˆ k k X and
= Xˆ k−1k −1 + k (Y k − Xˆ k −1k −1 ) F
Xˆ k+1k
K
= Xˆ kk −1 + F
(15.30)
HF
k
FK
(Y k − Xˆ k k−1 ).
(15.31)
H
Furthermore, the various covariance matrices converge to a limit value, assuming the system is stable. For instance, taking the covariance matrix of both sides of (15.8), we have Cov (X k +1) = FCov( X k )F² + Q, for k
If this recursion converges, the limit discrete Lyapunov equation C
≥ 0. ± limk →∞ Cov( X k ) satisﬁes the socalled
C
=
FCF
²+
Q
,
which can be expressed as a linear system. Convergence is guaranteed if the singular values of F are all less than 1 in magnitude (see, e.g., [1]). In this case, the prediction error covariance matrix ²k+1k converges to a limit ²∞ as k → ∞, where ²∞ satisﬁes the steadystate Riccati equation (see (15.28)): (15.32) ²∞ = (²∞ − ²∞ ² ( ²∞ ² + )−1 ²∞) ² + and the Kalman gain k (see (15.19)) converges to a limit ∞ as k → ∞ , where (15.33) ∞ = ²∞ ² ( ²∞ ² + )−1 . Furthermore, it can be shown that the singular values of ( − ) are all less than 1 in F
H
H
H
R
H
K
F
Q
K
K
H
H
H
R
F I
KH
magnitude (see, e.g., [1]), so that the linear systems deﬁning the estimator and predictor updates in (15.30) and (15.31) are asymptotically stable.
SteadyState Kalman Filter – Scalar Case. The equations for the steadystate Kalman ﬁlter are easiest to describe and solve in the scalar case, for which we
Example 15.2
366
Signal Estimation
replace matrices F, H, Q, and R by the corresponding scalar variables f , h , q , and r , and the system model becomes
= f X k + Uk , Y k = h X k + Vk , k = 0, 1, 2, . . . ,
X k +1
State: Observations:
with Uk and Vk being zeromean random variables with variances q and r , respectively, for all k. The estimator and predictor updates are scalar versions of (15.30) and (15.31). In particular, the estimator update can be written as
Xˆ k k
= f (1 − hκk )Xˆ k −1k −1 + κk Yk ,
where κ k is the Kalman gain (scalar) given by
κk = σ
2 k k −1h
+ r )−1 =
( σ
h 2 k2 k−1
(15.34)
σk2k−1 h . h 2σk2k −1 + r
(15.35)
The update of the predictionerror variance (which is the scalar version of the Riccati equation in (15.28)) is given by
σ
2 k +1k
= σ =
¹ σk2k −1 h2 +q 1− 2 2 h σk  k−1 + r
¸
f 2 k2 k −1
f 2r σk2k −1
h 2σk2k −1 + r
+ q.
(15.36)
Suppose this sequence of variances converges to a limit σ 2 as k the equation
→ ∞, then σ∞2 satisﬁes
σ 2 = h2fσ r2 σ+ r + q , 2
2
(15.37)
which can be written as the following quadratic equation in σ 2 :
h2 σ 4 + r (1 − f 2)σ 2 − q
= 0.
(15.38)
There is only one positive solution to this quadratic equation, which is given by 2 σ∞ = −r (1 − f ) + 2
º
r 2 (1 − f 2 )2 + 4h 2q 2 . 2h2
(15.39)
In order to prove that the sequence of variances σk2+1k indeed converges to a limit as k → ∞ under some conditions, we ﬁrst see from (15.36) and (15.37) that
»» »» 2 2 r σk k −1 r σ∞ » » f » » h 2σk2k −1 + r − h2 σ∞2 + r »» »» »» 2 r 2 σk2k −1 − r 2 σ∞ »» 2» = f »» 2 2 (h σk k −1 + r )(h2 σ∞2 + r ) » » » ≤ f 2 »»σk2k −1 − σ∞2 »» . 2 given in (15.39). Therefore, if  f  < 1, σk2+1k converges to σ∞ »» 2 » »σk +1k − σ ∞2 »» =
2
15.3
Extended Kalman Filter
367
σk2+1k converging to σ∞2 , the Kalman gain of (15.35) converges to 2 = h 2σσ 2 h+ r . For very noisy observations (r ´ h2 σ 2), we have k ≈ 0, in which case the measurement update is Xˆ k +1 k ≈ f Xˆ k k− 1, and the current observation receives very little weight. For very clean observations (r µ h2σ 2), we have k ≈ 1/ h , in which case the measurement update is Xˆ k +1k ≈ hf Yk : the past is ignored, and only the current With
K
K
K
observation matters.
15.3
Extended Kalman Filter
In developing the Kalman ﬁlter we assumed that the sequence of states { X k } evolved as a linear dynamical system with additive noise, and the observation sequence {Y k } is a linear, causal version of the state with additive noise. Interestingly the Kalman ﬁltering approach can be used to derive a good approximation to the optimal estimator and predictor even when the state evolution and the observation equations are nonlinear. In particular, consider the following system model:
= f k( Xk ) + U k, (15.40) Observations: Y k = h k (X k ) + V k , k = 0, 1, 2, . . . , (15.41) where f k : R p ¶ → R p and hk : R p ¶ → Rq are nonlinear mappings, for some positive integers p, q, and the state noise {U k } and the measurement noise {V k } have the same State:
X k +1
secondorder statistics as in the Kalman ﬁlter described in Section 15.2. The key idea behind the extended Kalman ﬁlter (EKF) is to linearize f k and h k around the current estimate Xˆ k k and predictor Xˆ k  k−1 of the state at time k , respecˆ k k ) and ( X k − Xˆ kk −1) are small. In tively, assuming that the estimation errors (X k − X particular, using a Taylor series approximation, we can write
f k (X k ) ≈ f k ( Xˆ k k ) + Fk ( X k
− Xˆ k k ), ˆ k k −1 ), h k (X k ) ≈ h k ( Xˆ k  k−1 ) + k ( X k − X » » ∂ f ( x) »» ∂ h (x ) »» k = ∂ x »x = Xˆ  and k = ∂ x » x= Xˆ  − . H
where
F
H
kk
kk 1
A linear approximation to the system model given in (15.40) and (15.41) can then be written as
+ U k + f k ( Xˆ kk ) − k Xˆ kk , ˆ k k −1 ) − k Xˆ k k −1 , k = 0, 1, 2,. . . . Y k = k X k + V k + hk ( X
X k +1 =
k Xk
F
F
H
H
(15.42) (15.43)
The EKF equations can then be derived by mimicking the corresponding (linear) Kalman ﬁlter updates in (15.44) and (15.45) to get
Xˆ k k
= Xˆ k k−1 + k (Y k − hk (Xˆ k k −1)), K
(15.44)
368
Signal Estimation
ˆ k +1k X
= f k ( Xˆ k k ).
(15.45)
The Kalman gain and covariance updates for the EKF are identical to those for the Kalman ﬁlter, i.e., the terms that depend on the estimates Xˆ k k and Xˆ k k −1 in (15.42) and (15.43) are ignored in developing these updates: k
K
= ²k k −1 ²k ( k ²k k −1 ²k + k )−1, H
H
H
(15.46)
R
²k +1k = k (² kk −1 − ²k k−1 ²k ( k ²k k −1 ²k + k )−1 k ²k k −1 ) ²k + k . F
H
H
H
R
H
F
Q
(15.47)
While the covariance and Kalman gain update equations for the Kalman ﬁlter and the EKF are identical, one major difference between the updates for the two ﬁlters is that in the Kalman ﬁlter the updates can be performed ofﬂine before collecting the observations, whereas in the EKF these updates have to be performed online since Fk and Hk are ˆ k k −1 . functions of the estimates Xˆ k  k and X Also, unlike the Kalman ﬁlter, which is optimal for the linear dynamical system model, the EKF is in general not optimal for the nonlinear system model described in (15.40) and (15.41). Nevertheless, the EKF does provide reasonable performance in many applications as has become the de facto standard in navigation systems and global positioning systems [2].
Tracking the Phase of a Sinusoid. The state in this example is the (random) phase of a sinusoid, and the observation is a noisy version of the sinusoid. The system model is as follows: Example 15.3
= X k + Uk , Observations: Yk = sin(ω0k + X k ) + Vk , k = 0, 1, 2, . . . , where ω0 is the angular frequency of the sinusoid, Var (U k ) = q, and Var (Vk ) = r for State:
X k +1
all k . Referring to (15.40) and (15.41), we see that f (x ) = x and h(x ) = sin(ω0k + x ) for this example, with F
∂ f (x ) »»» = k ∂ x »x =Xˆ  = 1, k k
¼ ½ ∂ h(x ) »»» = = cos ω0 k + Xˆ k k −1 . k » ∂ x x = Xˆ  −
H
This leads to the following equations for the EKF:
kk 1
¼ ¼ ½½ = Xˆ k k = Xˆ k k −1 + κk Yk − sin ω0 k + Xˆ k k −1 , ¼ ½ σk2k −1 cos ω0 k + Xˆ k k−1 ¼ ½ , κk = cos2 ω0 k + Xˆ k k −1 σk2k −1 + r r σk2k −1 2 ¼ ½ σk +1k = + q. cos2 ω0 k + Xˆ k k −1 σk2k −1 + r
Xˆ k +1 k
15.4
Nonlinear Filtering for General Hidden Markov Models
369
3
Xk Xˆkk−1
2 1 0 esahP
–1 –2 –3 –4 –5 0
Figure 15.1
r
50
100
150
200
250 Time
300
350
400
450
500
Simulation of the EKF for tracking the phase of a sinusoid, with parameters
= q = 0.1 and ω0 = 1
In Figure 15.1, we simulate the performance of the EKF for speciﬁc values of the parameters r, q , and ω0 , with the state and measurement noise being Gaussian. We can see that the EKF tracks the true phase reasonably well, but the ﬁner variations in the phase are “smoothed out.” See also Exercise 15.5, where some variants of this simulation example are explored.
15.4
Nonlinear Filtering for General Hidden Markov Models
The extended Kalman ﬁltering problem discussed in the previous section is a special case of the more general nonlinear ﬁltering problem, which can be described as follows:
= f k (X k , U k ), (15.48) Observations: Y k = h k ( X k , V k ), k = 0, 1, 2, . . . , (15.49) where the noise sequences {U k , k = 0, 1, . . . } and {V k , k = 0, 1, . . .} satisfy the folState:
X k +1
lowing Markov property with respect to the state sequence: U k , conditioned on X k is independent of U k0−1 and X k0−1; likewise, V k , conditioned on X k , is independent of V k0−1 and X k0−1. A consequence of this Markov property is that the state evolves as a Markov process, with X k +1 , conditioned on X k , being independent of X k0−1. Furthermore, the distribution of X k +1, conditioned on X k , denoted by p(x k +1 x k ), is determined by the distribution of U k , conditioned on X k , and the distribution of Y k , conditioned on X k , denoted by p ( yk  xk ), is determined by the distribution of V k , conditioned on X k . This model is also referred to as a hidden Markov model (HMM), and is illustrated in Figure 15.2. The general goal in nonlinear ﬁltering is to ﬁnd at each time k the posterior distribution of the state X k conditioned on the observations Y k0 (estimation), which we denote
370
Signal Estimation
...
... Figure 15.2
xk−1
xk
xk +1
yk −1
yk
y k+1
...
...
Hidden Markov model (HMM)
by p(x k  yk0), or the posterior distribution of the state X k +1 conditioned on the observations Y k0 (prediction), which we denote by p( xk +1 yk0). In practice, we may be interested in the lesser goal of determining the posterior means E[ X k Y k0 ] and E[ X k +1 Y k0 ] of these distributions, which would correspond to MMSE estimation, or determining their maximizers, which would correspond to MAP estimation. Similar to the Kalman ﬁlter, the construction of the nonlinear ﬁlter (NLF) update consists of two steps. The ﬁrst is a measurement update that takes us from p(xk  y 0k−1 ) to p( xk  yk0 ), and the second is a time update that takes us from p( xk  yk0 ) to p(x k +1 y k0). The measurement update essentially follows from Bayes’ rule:
p
¼

x k yk0
½
= =
(
)
p yk0 x k p ( xk )
¼
( )
p yk0
½
p yk , yk0−1 x k p (xk )
¼
¼
p yk , yk0−1
½
½ ¼ ½ = ¼ ½ p yk  yk0−1 p( yk0−1 ) ½ ¼ ½ ¼ ½ ¼ p yk  yk0−1, x k p xk  yk0−1 p yk0−1 p (xk ) ¼ ½ = p yk  yk0−1 p( yk0−1 ) p (x k ) ½ ( ) ¼ p yk  x k p x k  yk0−1 ½ . ¼ = p yk  yk0−1 p yk  yk0−1, x k p yk0−1 x k p (x k )
(15.50)
From the HMM property of the system, the denominator in (15.50) can be written as
p
¼
yk yk0−1

½ ¾ ¼ ½ = p yk , x k  yk0−1 d ν(x k ) ¾ ¼ ½ ¼ ½ = p yk x k , yk0−1 p xk  yk0−1 d ν(x k ) ¾ ( ) ¼ ½ = p yk x k p x k  yk0−1 d ν(x k ).
(15.51)
15.4
Nonlinear Filtering for General Hidden Markov Models
371
The time update follows from Markov property of the state evolution. In particular,
p(
¾ ¼ ½  ) = p xk +1 , xk  yk0 d ν(xk ) ¾ ¼ ½ ¼ ½ = p xk +1 xk , yk0 p x k  yk0 d ν(x k ) ¾ ¼ ½ = p (x k+1 x k ) p xk  yk0 dν( xk ).
xk +1 yk0
(15.52)
In general the NLF update equations (15.50)–(15.52) are difﬁcult to calculate in closed form, and ﬁnding good approximations to these updates has been the subject of intense research over the last few decades – a popular approach is the particle ﬁlter [3]. An exception is the (linear) Kalman ﬁltering problem with the random variables being jointly Gaussian. In the following we provide an example of a nonlinear ﬁltering problem for which it is easy to compute the update equations. In Section 15.5, we study the special case of ﬁnitestate HMMs in more detail, for which the NLF update equations (15.50)–(15.52) can be computed in closed form.
Bayesian Quickest Change Detection (QCD). The Bayesian QCD problem introduced in Section 9.2.2 can be considered to be a nonlinear ﬁltering problem with a binary valued state, and 0 denoting the prechange state and 1 denoting the postchange state. The state (evolution) is completely determined by the changepoint: Example 15.4
X k +1 = X k 1{³ > k + 1} + 1 {³ where the change point 0 < ρ < 1,
≤ k + 1},
X0
= 0,
(15.53)
³ is a geometric random variable with parameter ρ , i.e., for
πλ = P{³ = λ} = ρ(1 − ρ)λ−1, λ ≥ 1.
The observation equation can be described as
= Vk(0)1{ X k = 0} + Vk(1)1{ Xk = 1}, (15.54) ( j) where {Vk , k = 1, 2, . . .} are i.i.d. with pdf p j , j = 0, 1. Note that there are no Yk
observations at time 0. The NLF equations involve computing recursions for
Gk
=´ P( Xk = 1Y1k )
and
G˜ k+1
= P( Xk +1 = 1Y1k )
since the posterior distributions are completely characterized by these two probabilities with the state being binary. In this case the NLF update equations (15.50)–(15.52) can be computed straightforwardly as
Gk and with G 0 = 0 and L ( y) Chapter 9.
=
˜ = ˜ G k L (Yk ) ˜ G k L (Y k ) + (1 − G k )
G˜ k+1
p1 ( y ) p0 ( y )
= G k + (1 − G k )ρ
being the likelihood ratio. See also Exercise 9.8 in
372
Signal Estimation
15.5
Estimation in Finite Alphabet Hidden Markov Models
Consider the HMM described in Figure 15.2, where both the state and the observation variables take on values in ﬁnite sets (alphabets) X and Y , respectively, and where the system dynamics do not change over time. The state sequence { X k }k ≥0 is a Markov chain and the observed sequence {Yk }k≥0 follows a HMM. Due to the ﬁnite alphabet and timeinvariance assumptions, we have the following: 1. The probability distribution p(x 0) on the initial state X 0 can be represented by a vector, which we denote by π . 2. The distribution of X k +1 , conditioned on X k , i.e., p (xk +1 xk ), is completely described by the transition probability matrix of the Markov chain:
(x , x ³ ) = P( X k +1 = x ³  X k = x },
x, x³
A
∈ X.
(15.55)
3. The distribution of Y k , conditioned on X k , i.e., p(yk x k ), is completely described by the emission probability matrix:
( x , y ) = P(Yk = y X k = x },
B
∈ X , y ∈ Y.
x
(15.56)
A graphical representation of the ﬁnite alphabet timeinvariant HMM is given in Figure 15.3. For this HMM, the NLF update equations (15.50)–(15.52) can be computed in closed form as follows. Starting with p (x0 y0−1 ) = π (x0), the measurement update is given by
¼
½ k
p xk  y0 with
¼
p yk  y0k −1 The time update is given by
¼
p x k+1  y0k
½ ½
½ ¼ (xk , yk ) p x k  y0k −1 ½ ¼ = p yk  y0k −1
= =
B
± x k ∈X
±
B
A
xk ∈X
( xk , yk ) p
(15.57)
¼
x k  y0k −1
(xk , x k +1) p
¼
xk  y0k
½ ½
.
(15.58)
.
(15.59)
Using (15.57)–(15.59), we may compute MMSE ﬁlter and predictor estimates: π
x0
A
B y0 Figure 15.3
A
x1
x2
B y1
Finite alphabet timeinvariant HMM
B y2
A
...
A
xn
B yn
15.5
µ ± ¼ ½ = y0k = xk p x k  y0k , µ x ∈X± ½ ´ ¼ MMSE k k xˆk +1k = E X k +1 Y0 = y0 = xk +1 p xk +1 y0k .
xˆkMMSE k
=E
´
Estimation in Finite Alphabet Hidden Markov Models
373
X k  Y0k
k
(15.60)
xk +1 ∈X
We may also compute MAP ﬁlter and predictor estimates:
xˆkMAP k xˆ MAP k+1 k
= arg xmax p ∈X
¼
k
xk  y0k
¼
= arg x max∈X p k +1
½
,
xk +1 y0k
½
(15.61)
.
In addition to the ﬁltering and prediction problems, which can be considered online (causal) estimation problems, there are three other ofﬂine estimation problems that are of great interest in applications of HMMs, particularly in error control coding and speech signal processing. These are: 1. MAP estimation of the entire state sequence X n0 from the entire observation sequence Y 0n , which is solved by the Viterbi algorithm (see Section 15.5.1). 2. Determining the conditional distribution of X k , for k ∈ { 0, 1, . . . , n}, given the entire observation sequence Y0n , which is solved by the forwardbackward algorithm (see Section 15.5.2). 3. Estimation of the HMM parameters π , A, and B from the entire observation sequence Y 0n , which is solved by the Baum–Welch algorithm (see Section 15.5.3). 15.5.1
Viterbi Algorithm
The MAP estimation of X 0n from Y 0n arises in a variety of applications, and Viterbi derived a remarkable practical algorithm for solving it exactly [4]. The probability of the state sequence x0n ∈ X n+1 is given by
−1 ( ) = π (x ) n¿ (xk , xk +1), 0
p x0n
k=0
(15.62)
A
and the conditional probability of the observed sequence y0n given the state sequence x0n À is p( y0n  x0n ) = nk=0 B(x k , yk ). Hence the joint probability of x0n and y0n is
p(
x 0n
, ) = p ( ) p (  ) = π ( x0) y0n
x0n
y0n
x0n
¿
n −1
k =0
( xk , xk +1 )
A
n ¿
k =0
( xk , yk ).
(15.63)
³ (xk , yk ) .
(15.64)
B
The MAP estimate of interest can be computed as
xˆ0n MAP
= arg = arg = arg
max p(x0n  y0n )
x n0 ∈X n +1 x n0
max p(x0n , y0n )
∈X n+1
max
x n0 ∈X n +1
²
ln π (x0 ) +
± n−1 k =0
ln A(xk , xk +1) +
n ± k =0
ln B
374
Signal Estimation
This maximization apparently requires a search of exponential complexity. Note that the problem does not reduce to (n + 1) independent problems of computing arg max x k ∈X p (xk  y0n ) for 0 ≤ k ≤ n. However, a global maximum of (15.64) can be found in linear time using the Viterbi algorithm. The algorithm was originally designed for decoding of convolutional codes [4], but was soon recognized to be an instance of dynamic programming [5, 6]. The algorithm was previously known in the ﬁeld of operations research [6] and applies to a variety of problems with similar mathematical structure [7]. Observe that the maximization problem (15.64) may be written in the form
²
E (x0n ) ± f (x0) +
max
x0n ∈X n +1
± n−1 k =0
³ gk (x k , xk +1 )
(15.65)
with
f (x ) ± ln π (x ) + ln B(x , y0 ),
gk (x , x ³ ) ± ln A(x , x ³ ) + ln B(x ³ , yk +1 ),
0 ≤ k < n − 1.
The sequence x0n = (x 0, x1 , . . . , xn ), which is a realization of states of a Markov chain, is sometimes called a path, and the diagram representing all possible paths is called the state trellis. Figure 15.4(a) shows an example, with three possible states (X = {0, 1, 2}), ﬁve time instants ( n = 4), initial value vector f = (1, 1, 0), and six possible transitions with weights g(0, 0) = 1, g(0, 2) = 2, g(1, 1) = 0, g(1, 2) = 1, g(2, 0) = 3, and g (2, 1) = 1. We set g(0, 1) = g(1, 0) = g(2, 2) = −∞ to disallow these three transitions. In this example, gk does not depend on k . Denote the value function for state x at time k = 0 as V (0, x ) = f (x ) and at times k = 1, 2, . . . , n by
V (k , x ) ± nmax
x0 : xk=x
²
E(
x0k
) ± f ( x0) +
± k −1 u=0
³ gu (xu , x u+1 ) .
(15.66)
Since
E (x k0 ) = E ( x0k−1 ) + gk −1 (xk −1, x k ), we have the forward recursion
V (k , x ) = max [ V (k − 1, x ³ ) + gk −1( x ³ , x )], For k
x ³∈ X
= n we have
k
≥ 1.
(15.67)
max E ( x0n ) = max V (n, x ).
x0n ∈X n +1
x ∈X
Denote by xˆk −1 (x ) a value of x ³ that achieves the maximum in (15.67). Observe that if a state x at time k is part of an optimal path xˆ 0n , then xˆ k = x , and xˆ 0k must achieve V (k, x ) = maxx k E (x0k ). Otherwise there would exist another path x0n such that
E ( x0k ) > E (xˆ0k ) and xkn
0
= xˆkn , and hence E (x0n ) − E (xˆ0n ) > 0, which would contradict the
15.5
375
Estimation in Finite Alphabet Hidden Markov Models
optimality of xˆ 0n . By keeping track of the optimal predecessor state xˆk −1 (x ) and tracing back from time k to time 0, we obtain a path that we call the survivor path. Once the forward recursive procedure is completed, the ﬁnal state achieving max x V (n, x ) is identiﬁed, and backtracking yields the optimal path xˆ0n . The complexity of the algorithm is O (nX  ) storage (keeping track of the predecessors xˆk −1 (x ) for all k , x and the most recent value function) and O (nX  2) computation (evaluating all possible state transitions at all times). Figure 15.4(b)–(e) displays the value function V (k, x ), x ∈ X , at successive times k = 0, 1, 2, 3, 4, together with the corresponding surviving paths. 15.5.2
ForwardBackward Algorithm
We now consider the problem of determining the conditional distribution of X k , for k ∈ {0, 1, . . . , n }, given Y 0n , i.e., that of determining the posterior probabilities P( X k = x Y 0n = y0n ), for k = 0, 1, . . . , n and x ∈ X . The solution to the problem is instrumental in solving the problem of estimating the HMM parameters from Y 0n , as we shall see in Section 15.5.3, for which we will also need to determine P( X k = x , X k +1 = x ³ Y0n = y0n ) for each k = 0, 1, . . . , n and x , x ³ ∈ X . Deﬁne the shorthand notation: (15.68) γk (x ) ± P( X k = x Y0n = y0n ), ³ ³ n n ³ ξk (x , x ) ± P( X k = x , X k +1 = x Y0 = y0 ), x , x ∈ X . (15.69) Note that γk is the ﬁrst marginal of ξk . The forwardbackward algorithm allows efﬁcient
computation of these probabilities. We begin with
P ( X k = x , Y0n = y0n ) γk ( x ) = P( X k = x Y0n = y0n ) = ± , P( X k = s, Y0n = y0n )
0≤ k
≤ n.
(15.70)
s ∈X
Write the numerator as a product of two conditional distributions, P (Y 0n
= y0n , Xk = x ) (=a) PÁ (Y0k = yÂÃ0k , Xk = xÄ) PÁ (Ykn+1 = yÂÃnk+1  Xk = xÄ) µk ( x )
= µk (x )νk (x ),
0≤k
< n, → Xk →
where (a) follows from the Markov chain Y 0k νn ( x ) ≡ 1. Combining (15.70) and (15.71) we have
νk (x )
(15.71)
Y kn+1 . For k
µk (x )νk ( x ) . γk (x ) = ± µk (s )νk (s ) s ∈X
The ﬁrst factor in the product of (15.71) is
µk (x ) = P(Y0k = y0k , X k = x ),
x
∈ X , 0 ≤ k ≤ n,
=
n, we let
(15.72)
376
Signal Estimation
1 2
1 3
2
0
1 k=0
1 3
2
0
1
1 k=1
1 3
2
0
1
1 k=2
3 3 0
1
1 k=3
1
1 k=4
3 k=1
k=0 (b)
(a)
k=0
k=1
6
8
4
6
5 k=2
k=0
k=1
(c)
8 k=3
k=2 (d) 11
9
k=0
k=1
k=2
k=3
10 k=4
(e) 11
k=0
k=1
k=2
k=3
k=4
(f)
Viterbi algorithm: (a) Trellis diagram; (b)–(e) evolution of the Viterbi algorithm, showing surviving paths and values V (k , x ) at times k = 0, 1, 2, 3; (f) optimal path xˆ 04 = (0, 2, 0, 2, 0) and its value E (xˆ 04) = 11
Figure 15.4
for which we derive a forward recursion.1 The recursion is initialized with
µ0( x ) = P(Y0 = y0, X 0 = x ) = π (x ) B( x , y0). 1 Also note that
µn (x ) = P(Y0n = y0n , X n = x ), hence p( y0n ) = ∑x∈X µn (x ).
15.5
For k
Estimation in Finite Alphabet Hidden Markov Models
377
≥ 0 we express µk +1 in terms of µk as follows: µk +1(x ) = P(Y0k +1 = y0k +1, X k+1 = x ) (a) = P(Y0k = y0k , Xk +1 = x ) P(Yk +1 = yk +1 X k +1 = x ) ± = B(x , yk +1 ) P(Y0k = y0k , Xk +1 = x , X k = x ³ ) x ³ ∈X ± ( b) = B(x , yk +1 ) P(Y0k = y0k , Xk = x ³ )P( Xk +1 = x  Xk = x ³ ) x ³ ∈X ± = B(x , yk +1 ) µk (x ³ ) (x ³ , x ), k = 0, 1, . . . , n − 1, (15.73) A
x ³ ∈X
where (a) holds because Y0k → X k +1 → Yk +1 forms a Markov chain, and (b) because Y 0k → X k → X k +1 forms a Markov chain. The second factor in the product of (15.71) is
ν k (x ) = P(Ykn+1 = ykn+1 X k = x ), x ∈ X , 0 ≤ k < n. (15.74) Starting from νn (x ) ≡ 1, we have the following backward recursion, expressing νk −1 in terms of νk for 1 ≤ k ≤ n: νk −1(x ) = P(Ykn = ykn X k−1 = x ) ± = P(Ykn = ykn , Xk = x ³ X k−1 = x ) x ³ ∈X (a) ± = P(Ykn = ykn  Xk = x ³ ) P( Xk = x ³ X k−1 = x ) x ³ ∈X ( b) ± = P(Ykn+1 = ykn+1 X k = x ³ ) P(Yk = yk X k = x ³ ) P( Xk = x ³ Xk −1 = x ) x ³ ∈X ± = ν k (x ³ ) (x ³ , yk ) (x , x ³ ), t = n, n − 1, . . . , 1, (15.75) B
A
x ³ ∈X
where (a) holds because X k−1 → X k → Ykn forms a Markov chain, and (b) because Y kn+1 → X k → Yk forms a Markov chain. Next we derive an expression for
ξk (x , x ³ ) = P(X k = x , X k +1 = x ³ Y0n = y0n ) P(Y0n = yn0 , X k = x , X k +1 = x ³) . = ± n n P (Y0 = y0 , X k = s , X k +1 = s³ ) s ,s ³ ∈X
We have
= y0n , X k = x , Xk +1 = x ³ ) (a) = P(Y0k+1 = y0k +1, Xk = x , Xk +1 = x ³ ) P(Ykn+2 = ykn+2 X k+1 = x ³ ) ( b) = P(Y0k = y0k , X k = x ) P( Xk +1 = x ³  Xk = x ) P(Yk +1 = yk +1 Xk +1 = x ³ ) νk +1(x ³ ) = µk (x ) (x , x ³ ) (x ³ , yk +1 )νk +1(x ³ ),
P(Y 0n
A
B
378
Signal Estimation
where (a) holds because (Y 0k +1, X k ) → X k+1 → Y kn+2 forms a Markov chain, and (b) because Y 0k → X k → X k +1 → Y k+1 forms a Markov chain. Hence, for 0 ≤ k ≤ n and x , x³ ∈ X , µk (x )A(x , x ³ )B(x ³ , yk +1 )νk +1(x ³ ) . (15.76) ξ k (x , x ³ ) = ± µk (s )A(s, s³ )B(s³ , yk +1 )νk +1(s ³ ) s,s ³ ∈X
Scaling Factors Unfortunately the recursions in (15.75) and (15.76) are numerically unstable for large n because the probabilities µ k (x ) and νk (x ) vanish exponentially with n and are sums of many small terms of different sizes. The following approach is more stable. Deﬁne
αk (x ) = P( X k = x Y0k = y0k ), P(Y n = yn  X = x ) βk (x ) = kn+1 nk+1 kk k , P(Yk +1 = yk +1 Y0 = y0 ) ck = P(Y k = yk Y 0k− 1 = y0k −1).
(15.77) (15.78) (15.79)
Then
γk (x ) = αk ( x ) βk (x ), ξk (x , x ³ ) = ck αk (x ) (x , yk ) (x , x ³ )βk (x ³ ). A forward recursion can be derived for αk and ck , and a backward recursion for βk (see B
A
Exercise 15.6). The time and storage complexity of the algorithm is O (n X 2 ). 15.5.3
Baum–Welch Algorithm for HMM Learning
´
Finally, we address the learning problem of estimating the HMM parameters θ = (π , A, B). The solution to this problem was developed by Baum and Welch in the 1960s and was later recognized to be an instance of the EM algorithm [9]. We derive the Baum–Welch algorithm using the EM framework described in Section 14.11. We take ( X 0n , Y 0n ) to be the complete data; hence the state sequence X 0n is viewed as the latent (missing) data. The marginal probability of y0n is obtained from (15.63) as
pθ ( y0n ) =
± x0n
∈X
π n
( x0 )
¿
n− 1 k =0
( xk , xk +1 )
A
n ¿ k =0
B
(xk , yk ).
(15.80)
Indicating the dependence on θ explicitly, we obtain the incompletedata and completedata loglikelihoods respectively as
± id(θ ) = ln p ( ) = ln θ
y0n
± x0n ∈X n +1
π
( x0 )
¿
n−1 k =0
(xk , xk +1 )
A
n ¿ k =0
B
( xk , yk )
15.5
379
Estimation in Finite Alphabet Hidden Markov Models
and
±cd (θ ) = ln p ( , ) = ln π (x0 ) + θ
x0n
y0n
n −1 ±
ln A(x k , xk +1 ) +
k=0
n ± k =0
ln B(x k , yk ).
The function ±id (θ ) is nonconcave and typically has many local extrema. The Estep in the m th stage of the EM algorithm is given by
Q (θ  θˆ
(m )
) = E ˆ ( ) [± cd(θ )Y0n = y0n ] θ
m
²± n−1
³
= E ˆ ( ) [ln π (X 0)  = ] + E ˆ ( ) ln ( X k , X k +1 )  = k =0 ²± ³ n n n + Eˆ ( ) ln ( X k , Y k ) Y 0 = y0 . (15.81) θ
Y 0n
m
θ
m
y0n
θ
m
Y0n
A
y0n
B
k =0
To perform the Mstep, we note that the cost function in (15.81) is the sum of three terms involving the variables π , A, and B separately. Hence the problem can be split into three independent problems. To solve each of these problems, we use properties of the crossentropy and entropy of pmfs. Recalling the deﬁnition (6.6), the entropy of a pmf p(z ), z ∈ Z , where Z is a ﬁnite set, is deﬁned as
H ( p) ± −
± z
p(z ) ln p(z ).
(15.82)
∈ Z relative to another pmf q(z ), z ∈ Z is deﬁned ± H ( p, q) ± − p(z) ln q (z ). (15.83)
The crossentropy of a pmf p(z ), z as
z
The crossentropy satisﬁes the following extremal property. For any two pmfs p and q , we have
± z
p (z ) ln q(z ) =
± x
p(z ) ln p(z ) − D ( p·q) ≤
± z
p(z ) ln p(z ),
where D ( p·q) is the Kullback–Leibler divergence between the pmfs p and q (see Section 6.1). The upper bound is achieved by q = p. The inequality can be rewritten as
− H ( p, q ) ≤ H ( p) with equality when q = p, i.e., − H ( p, q ) is maximized over q , when q = p.
(15.84)
We employ (15.84) as follows. The ﬁrst term of the cost function,
E ˆ (m) [ln π ( X 0 ) Y0n = y0n ] = θ
±
x ∈X
P ˆ (m) ( X 0 θ
= x Y0n = y0n ) ln π (x ),
is a negative crossentropy and is maximized by
ˆ (m+1)(x ) = Pˆ ( ) (X 0 = x Y0n = y0n ), ∀x ∈ X , which is the posterior distribution of X 0 , given Y 0n = y0n and the current estimate of θ . π
θ
m
380
Signal Estimation
The second term in (15.81) is
E ˆ ( m)
²n± −1
θ
k =0
= =
ln A( X k , X k +1) 
Y0n
± ±
n−1
k =0 x ,x ³ ∈X
Pθ ( m) ( X k
n−1 ±²± ± x ³ ∈X k =0
x ∈X
∑
=
³ yn0
= x , X k+1 = x ³ Y0n = y0n ) ln (x , x ³ ) A
Pθ (m) ( X k
³
= x , Xk +1 = x ³ Y0n = y0n ) ln (x , x ³ ) . A
−1 P (m) ( X k = x , ¸Y n = yn ) and A (x , ¸ ) are probability distribuSince both n1 nk = 0 θ 0 0 tions for each x ∈ X , the second term is maximized by applying the extremal property of crossentropy for each x ∈ X . The solution is
ˆ (m+1)(x , x ³ ) =
A
1 ± Pθ ( m) (X k Z (x ) n −1
= x , Xk +1 = x ³ Y0n = yn0 )
∑n−1k=P0( )( X = x , X = x ³ Y n = yn ) k +1 0 0 , ∀x , x ³ ∈ X , = k =0∑n −1 k P ( ) ( X = x Y n = y n ) θ
m
k =0
θ
k
m
0
0
where Z ( x ) is the normalization factor that ensures Aˆ (m +1) (x , ¸) is a valid probability distribution for each x ∈ X . The third term of (15.81),
E ˆ (m) θ
²± n
k =0
= =
ln B(X k , Yk ) 
⎡ n ± ⎣± ±
x ∈X
y ∈Y k =0
⎡ n ± ⎣± ±
x ∈X
y ∈Y k =0
Y0n
=
yn0
³
⎤ ( x , y) ⎦
Pθ ( m) ( X k
= x , Yk = y Y0n = y0n ) ln
Pθ ( m) ( X k
= x Y0n = y0n ) 1{ yk = y } ln
B
B
⎤ (x , y )⎦ ,
is similarly maximized by
ˆ (m+1)(x , y) =
B
=
1 ± Pθ ( m) ( X k Z (x ) n
∑n
k =0
= x Y0n = y0n ) 1{ yk = y }
n n ∑Pn ( ) (PXk( )=( Xx Y=0 x=Yyn0 )=1{yynk) = y } , ∀x ∈ X , y ∈ Y ,
k =0
θ
m
k =0
θ m
k
0
0
where Z (x ) is the normalization factor that ensures Bˆ (m +1) (x , ¸) is a valid probability distribution for each x ∈ X . The Mstep can be written more compactly as
ˆ (m+1)(x ) = γ0(m)(x ),
π
(15.85)
Exercises
∑n−1 (m) ³ ˆ (m+1)(x , x ³ ) = ∑k=n0−ξ1 k (m()x , x ) , γ (x ) ∑n k =γ0 (mk )(x ) 1{y = y} ˆ (m+1)(x , y ) = k=0∑kn (m) k , k =0 γk (x )
381
(15.86)
A
(15.87)
B
where
γk(m)( x ) ± P ( ) ( X k = x Y0n = y0n ), ξk(m)( x , x ³ ) ± P ( ) ( X k = x , X k +1 = x ³ Y0n = y0n ), θ m
θ m
(15.88)
x, x³ ∈ X
(15.89)
are the current estimates (at iteration number k ) of the posterior probabilities γk (x ) and ξk (x , x ³) of (15.68) and (15.69), respectively. The probabilities (15.88) and (15.89) can
be evaluated in a computationally efﬁcient way using the forwardbackward algorithm of Section 15.5.2.
Exercises
15.1 Linear Innovations. Prove Corollary 15.1 15.2 Kalman Filtering with Identical State and Observation Noise. Suppose in the Kalman ﬁlter model of Section 15.2, the noise in the state equation and observation equations are identical at each time, i.e.,
Uk
= Vk , k = 0, 1, . . . .
Rederive the Kalman ﬁlter/predictor equations, and the Riccati equation for the predictor covariance update under this model. 15.3 Kalman Filtering with Control. Suppose we modify the state equation in the Kalman ﬁltering model as
X k+1
= Fk Xk + G k Uk + Wk ,
where {G k }∞ k =0 is a known sequence of matrices. Find appropriate modiﬁcations to the Kalman ﬁltering recursions for the case where, for each k, the control Uk is an afﬁne function of the observations Y0k . 15.4 Kalman Smoothing. For the Kalman ﬁltering model, suppose we wish to estimate X j from Y0k , for 0 ≤ j ≤ k . Consider the estimator deﬁned recursively (in k ) by
Xˆ j k
= Xˆ j k −1 + K ks(Yk − Hk Xˆ k k −1),
where
K ks = and
²ksk −1 Hk²
´
Hk ²ks  k−1 Hk² + R k
²ks+1k = ²ksk −1 [ Fk − K k Hk ]² ,
µ −1
382
Signal Estimation
with
² sj  j −1 = ² j  j −1, where Xˆ k k −1, ²k  k−1 , and K k are as in the onestep prediction problem. Show that
²ksk −1 = Cov(( X j − Xˆ j k −1), X k ), and that
Xˆ j  k
= E[ X j Y0k ].
15.5 Extended Kalman Filter Simulation. Redo the simulations example described in Figure 15.1 for the following two cases: (a) The state and measurement noise variables are Gaussian, with q = 0.1, r = 0.3, and ω0 = 1, (b) The state noise is Gaussian but the measurement noise is nonGaussian, with mixture pdf:
Å Ær r Ç ÅÆ r r Ç , + 0.5 N − 2 , 2 . Vk ∼ 0.5 N 2 2
Note that Var (Vk ) = r . Assume that r = q = 0.1 and ω0 (c) Compare the tracking performance in parts (a) and (b).
= 1.
15.6 ForwardBackward Recursions. Derive forward recursions for backward recursion for βk as deﬁned in (15.77)–(15.79).
αk and ck , and a
15.7 Baum–Welch. Show that if the state transition probability matrix A(x , x ³ ) is initialized to 0 for some entry (x , x ³ ), then that entry will remain zero for all iterations of the Baum–Welch algorithm. 15.8 Baum–Welch with Symmetric Transition Matrix. In some problems the state transition probability matrix A is known to be symmetric. Derive the appropriate modiﬁcations to the Baum–Welch algorithm for estimating symmetric A. 15.9 Viterbi Algorithm. Find the sequence x
E (u ) =
È∑
t
ut
−∞
where ut = t − 3 + xt + 1 − xt  for t surviving paths at times t = 2, 3, 5.
∈ {0, 1, 2}5 that maximizes À if t u t ± = 0 else,
= 1, 2, 3, 4. Show the trellis diagram and the
383
References
15.10 Two HMMs. Consider the following extension of HMMs. Let U = (U1, U2, . . . , Un ) and V = (V1 , V2, . . . , Vn ) be two Markov chains with the same alphabet X = Å{0, 1} andÇrespective transition probability matrices = Åbinary Ç 2/3 1/3 1/2 1/2 and = . Assume U1 = V1 = 1. A fair coin is ﬂipped. 1/3 2/3 1/2 1/2 A
Q
If the result is “heads” U will be selected as the state sequence X for a HMM with output alphabet Y = { 0, 1} and emission probability matrix B = A. If the result is “tails” V will be selected as the state sequence X for a HMM with output alphabet Y = {0, 1} and emission probability matrix B = A. Let n = 3 and Y = (1, 1, 1). What is the posterior probability that the coin ﬂip was “heads”?
References
[1] B. D. O. Anderson and J. B. Moore, Optimal Filtering, Prentice Hall, 1979. [2] M. S. Grewal, L. R. Weill, and A. P. Andrews, Global Positioning Systems, Inertial Navigation, and Integration, John Wiley & Sons, 2007. [3] A. Doucet and A. M. Johansen, “A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later,” in Handbook of Nonlinear Filtering, pp. 656–704, 2009. [4] A. J. Viterbi, “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Trans. Inf. Theory , Vol. 13, pp. 260–269, 1967. [5] J. K. Omura, “On the Viterbi Decoding Algorithm,” IEEE Trans. Inf. Theory, Vol. 15, No. 1, pp. 177–179, 1969. [6] M. Pollack and W. Wiebenson, “Solutions of the Shortestroute Problem: A Review,” Oper. Res., Vol. 8, pp. 224–230, 1960. [7] G. D. Forney, “The Viterbi Algorithm,” Proc. IEEE, Vol. 61, No. 3, pp. 268–278, 1973. [8] R. Bellman, “The Theory of Dynamic Programming,” Proc. Nat. Acad. Sci., Vol. 38, pp. 716–719, 1952. [9] L. R. Welch, “The Shannon Lecture: Hidden Markov Models and the Baum–Welch Algorithm,” IEEE Inf. Theory Society Newsletter , Vol. 53, No. 4, pp. 1, 10–13, 2003.
Appendix A
Matrix Analysis
This appendix includes a number of basic results from matrix analysis that are used in this book. For a comprehensive study of this subject, the reader is referred to the book by Horn and Johnson [1]. A (deterministic) m ×n matrix A is rectangular array of scalar (realvalued 1) elements:
A
⎡ 11 ⎢ 21 = ⎢⎢⎣ .. .
A
A
A
m1
A
...
11
A
21
A
A A
... ...
m2
A
⎤ 2n ⎥ .. ⎥⎥⎦ . . 1n
mn
Special cases are a square matrix, for which m = n, row vector for which m = 1, and column vector for which n = 1. (Deterministic) vectors are denoted by lowercase boldfaced symbols, e.g., a and b. The sum of m × n matrices A and B is an m × n matrix C whose elements are given by
+ i j , i = 1, 2, . . . , m, j = 1, 2, . . . , n. Matrix addition is commutative, i.e., + = + . ij
C
=
A
ij
B
A
B
B
A
The product AB of two matrices A and B is only deﬁned if the number of columns of A is equal to the number of rows of B. In particular, if A is an m × n matrix and B is an n × k matrix, then C = AB is an m × k matrix with elements ij =
C
n ±
±=1
i ± B± j
A
, i = 1, 2, . . . , m , j = 1, 2, . . . , k .
Matrix multiplication is not necessarily commutative (e.g., BA may not be deﬁned even if is), but it is associative, i.e., A(BC) = (AB)C, and distributive, i.e., A(B + C) = AB + AC . The transpose of an m × n matrix A is an n × m matrix denoted by A±, whose elements are given by
AB
² ±³ A
ij
= ji , i = 1, 2, . . . , n, j = 1, 2, . . . , m . A
A matrix A is said to be symmetric if A± = A. The transpose (AB)± of the product AB is the product of the transposes in reverse order, i.e., B± A± . 1 We restrict our attention to realvalued matrices here, with the understanding that all of the results in this
appendix are easily generalized to complexvalued matrices [1].
Matrix Analysis
385
An n × n diagonal matrix D , denoted by diag[D11 , D 22, . . . , D nn ], is a matrix whose offdiagonal entries are equal to zero, i.e.,
⎡ 0 11 ⎢⎢ 0 = ⎢⎢ . . 22 ⎣ .. . . 0 ...
... ... ...
⎡1 0 ⎢⎢0 1 = ⎢⎢ . . ⎣ .. . . 0 ...
... . .. . ..
D
D
D
⎤ .. ⎥⎥ . ⎥. ⎥ 0 ⎦ 0
0
D
nn
A special case is an n × n identity matrix, denoted by I or In , whose diagonal elements are all equal to 1, i.e.,
I
⎤ .. ⎥⎥ .⎥ . ⎥ 0⎦
0
0
1
A square matrix is said to be unitary if A±A = I. The inverse of a square matrix A (when it exists) is denoted by A−1 , and satisﬁes
−1 = −1
AA
If A, B are invertible matrices
A
( )−1 = AB
B
A
=. I
−1 −1 . A
An important result regarding matrix inverses is the following: LEMMA
A.1 Matrix Inversion Lemma [1]. Asssuming all inverses used exist, we have
( + A
BCD
)−1 = = =
A
A
A
−1 − −1 −1 − −1 −1 − −1
−1
(
A
B DA
A
BC
A
B I
(+ (+ I
B
−1)−1 −1 −1 )−1 −1 −1 )−1 −1 .
+
DA
CDA
C
DA
BC
B
DA
CDA
Note that in Lemma A.1 the matrices B and D need not be invertible, and for the second and third lines to hold the matrix C also need not be invertible. Special cases of Lemma A.1 include
(
A
+ b c d± )−1 = −1
´
A
I
−
c d±
1 b
+ d±
−1 c
A
−1
µ
A
and
( + )−1 = −1 − −1( −1 + −1)−1 −1 . The determinant of an n × n matrix is denoted by det ( ) or  , and is given by   = 11 ˜ 11 + 12 ˜ 12 + · · · + 1n ˜ 1n , where the cofactor ˜ i j of equals (−1) j +i times the determinant of the (n − 1) × (n − 1) A
B
A
A
A
B
A
A
A
A
A
A
A
A
A
A
A
A
A
matrix formed by deleting the i th row and j th column of A. A necessary and sufﬁcient
386
Matrix Analysis
condition for the existence of the inverse of a matrix is that its determinant is nonzero. The determinant of a square matrix is the same as that of its transpose, and for two n × n matrices A and B
  =    . is denoted by Tr ( ), and equals the sum of its diagonal A B
The trace of an n × n matrix A elements:
A
B
A
Tr (A) =
n ± i =1
ii
A
.
The trace is a linear operation on matrices. In particular Tr(A + B) = Tr (A) + Tr (B). An interesting and useful property of the trace is the following. Given two matrices A and B whose product AB is a square matrix (which implies that BA is welldeﬁned and is also square), Tr (AB ) = Tr (BA ).
A symmetric matrix A is said to be positive deﬁnite (denoted by A ² 0) if
x± Ax
> 0,
for all nonzero x ,
x ±A x ≥ 0,
for all nonzero x.
and is said to be positive semideﬁnite or nonnegative deﬁnite (denoted by A ³ 0) if For two symmetric matrices A and B of the same size, we use A ² B to denote that − B ² 0 (similarly, A ³ B). A necessary and sufﬁcient condition for a symmetric matrix A to be positive deﬁnite is if leading principal minors are all positive. The k th leading principle minor of a matrix M is the determinant of its upperleft k × k submatrix. A Toeplitz matrix has constant values along all of its diagonals, e.g., A
A
⎡ a u1 ⎢⎢ ±1 a ⎢ = ⎢⎢ ±2 ±1 ⎢⎣ .. . ... ±n−1 ±n −2
...
u n−2
a
u1
u1
...
... ... ... ...
⎤
un−1 un−2 ⎥ ⎥
.. . .. .
⎥⎥ ⎥⎥ . ⎦
a
A special case of a Toeplitz matrix is a circulant matrix, for which each row is a circular rotation by one of the previous row.
Eigen Decomposition Let be a n × n matrix. A nonzero vector u is an eigenvector
of A with eigenvalue λ if
A
A
u = λu .
When a matrix is applied to an eigenvector, it simply scales it. Rearranging this, we see
( − λ )u = 0. A
I
(A.1)
Matrix Analysis
387
This implies that if A has eigenvalue λ , then the matrix A− λ I is not invertible (otherwise, we would have u = 0, which would be a contradiction), i.e., det(A − λI) = 0.
(A.2)
The expression det(A − λ I ) is known as the characteristic polynomial of A, and has degree n. By solving the characteristic equation (A.2) we can ﬁnd all of the eigenvalues of A. In general, there will be at most n distinct eigenvalues of A (some may be complex valued). An eigenvalue can have multiplicity greater than one, in which case we will have fewer than n distinct eigenvalues. Once we have an eigenvalue, we can ﬁnd the eigenvectors corresponding to that eigenvalue using (A.1). Some properties of eigenvalues and eigenvectors of symmetric matrices are as follows [1]: 1. The sum of eigenvectors corresponding to the same eigenvalue is also an eigenvector for that eigenvalue, and a scaled version of that eigenvector is still an eigenvector for that eigenvalue. (This holds even if A is not symmetric.) 2. The eigenvalues and eigenvectors are real. 3. If u 1 and u 2 are eigenvectors corresponding to two different eigenvalues λ1 and λ2 , then the eigenvectors are orthogonal, i.e., u± 1 u 2 = 0. 4. If an eigenvalue λ has multiplicity k , then we can ﬁnd k mutually orthogonal eigenvectors for that eigenvalue. The eigenvalues of A are typically indexed in decreasing order:
λ1 ≥ λ2 ≥ · · · ≥ λn . By the above properties, for symmetric A, we can ﬁnd corresponding eigenvectors u1 , . . . , u n that are orthonormal, i.e., they are orthogonal and unit norm. One of the most important results in matrix analysis is the following. A.2 Eigen Decomposition. An n terms of its eigenvectors and eigenvalues as LEMMA
A
=
n ± i =1
× n symmetric matrix
A
can be written in
λi ui u±i .
The form of the eigen decomposition given in Lemma A.2 is often called the outerproduct from the eigen decomposition. In many cases, we work with a matrix representation of the eigen decomposition. Let U
= [ u1 u2 . . . un ]
and D be the n × n diagonal matrix with diagonal entries λ1 , . . . , λ n , i.e.,
D
⎡λ1 0 . . . ⎢⎢ .. = ⎢⎢ 0. .λ2 . . ⎣ .. . . . . 0 ... 0
⎤ .. ⎥⎥ . ⎥. ⎥ 0⎦ λn 0
388
Matrix Analysis
Then, the eigen decomposition of A can be written as
=
A
Note that
U
±.
(A.3)
UDU
⎡ u±⎤ 1 ±⎥ ⎢ u ±=⎢ ⎢⎣ ..2 ⎥⎥⎦ . u± n
and that U± U = I, i.e., U is a unitary matrix. Note that U−1 relationship given in (A.3) and write
=
D
±
U
AU
=
U
± . We can invert the
.
Thus eigen decomposition can also be regarded as a way to diagonalize the matrix A. Finally, we note that the trace and determinant of an n ×n matrix A can be conveniently written in terms of it eigenvalues (λ1, λ2 , . . . , λn ) as n ±
Tr (A) =
i =1
λi ,   = A
n ¶
i =1
λi .
Also, a necessary and sufﬁcient condition for positive deﬁniteness of a symmetric matrix is that all of its eigenvalues are positive.
Matrix Calculus The gradient with respect to a matrix of a scalarvalued function A
f (A) is a matrix of the partial derivatives of f with respect to the elements of A:
·∇ f ( )¸ = ∂ f . ij ∂ ij A
A
A
This deﬁnition obviously applies even in the case where A is a row or column vector. Some examples of gradient calculations that follow from this deﬁnition are:
∇ Tr( ) = , ∇ Tr( ) = ± , ∇ Tr( ) = , ² ³± ∇ Tr( −1 ) = − −1 −1 , ¹ º ¹ º ∇ exp x ± x = xx ± exp x± x , ¹ º± ∇   =   −1 , ¹ º± ∇ ln   = −1 , ∇ x x ± y = ∇x y± x = y, ∇x x ± x = 2 x , A
A
A
A
I
BA
B
AB
B
BA
A
A
A
A
A
A
A
A
A
A
BA
A
A
A
A
Matrix Analysis
∇x Tr
¹
x y±
º
= ∇x Tr
¹
yx ±
º
389
= y.
In these equations we have assumed that matrix inverses exist wherever they are used. References
[1] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, 1990.
Random Vectors and
Appendix B
Covariance Matrices
We use the following notation. The mean of a random vector Y
µY = E (Y ) ∈ Rn ,
∈ Rn is denoted by
and its covariance matrix by
= E[(Y − µY )(Y − µY )±] = Cov(Y , Y ) ∈ Rn ×n . The crosscovariance matrix of two random vectors X ∈ Rm and Y ∈ Rn is denoted by ± ∈ Rm×n . XY = E [(X − µ X )( Y − µ Y ) ] = Cov( X , Y ) C
Y
C
The following properties hold:
• • • • •
Any covariance matrix is nonnegative deﬁnite (see Appendix A for deﬁnition): CY ² 0. If CY has full rank, then it is positive deﬁnite, i.e., CY ³ 0. Any covariance matrix is symmetric: CY (i , j ) = CY ( j, i ) for all i, j = 1, . . . , n. 1 If CY has full rank, the inverse covariance matrix C− Y exists and is also positive deﬁnite and symmetric. Given two vectors X ∈ Rm and Y ∈ Rn and two matrices A ∈ Rl ×m and B ∈ Rn× p , we have the linearity property Cov(A X , BY ) = A Cov( X , Y ) B± . The distribution of a Gaussian random vector Y is determined by its mean and covariance matrix. If the covariance matrix has full rank, then Y has a pdf, which is given by ± 1 ² 1 pY ( y) = (2π)−n /2 CY −1/ 2 exp − ( y − µ Y )±C− ( y − µ ) , Y Y 2 where  ·  denotes matrix determinant.
Appendix C
Probability Distributions
In this book we encounter several classical probability distributions. Their properties are summarized below. 1. Exponential distribution with scale parameter σ
pσ (x ) =
> 0, denoted by Exp(σ ):
1 −x /σ ,
σe
x
≥ 0.
(C.1)
The mean and standard deviation of X are both equal to σ . 2. Laplace distribution with scale parameter σ > 0, denoted by Lap(σ ):
pσ ( x ) =
1 −x  /σ e , 2σ
x
∈ R.
(C.2)
The mean of X is zero and its variance is equal to 2σ 2. 3. Multivariate Gaussian distribution with mean µ and covariance matrix C, denoted by (µ, C). Its pdf takes the form
N
±
1 ( x − µ)± −1 (x − µ) p (x ) = (2π)−d /2   −1/ 2 exp C
2
C
²
,
x
∈ Rd .
(C.3)
For d = 1, the complementary cdf (right tail probability) is denoted by Q (x ). 4. Chisquared distribution with d degrees of freedom, denoted by χd2 . The random ∑ variable X = di=1 Z 2i where { Z i }di=1 are i.i.d. (0, 1). Its pdf takes the form
N
³∞
p(x ) =
2 d/ 2
1
±(d /2) x
d /2−1
e− x /2 ,
x
≥0
(C.4)
where ±( t ) ± 0 x t −1e− x dx is the Gamma function, which satisﬁes ±(n) = (n − √ 1)! for any integer n ≥ 1, ±(1/2) = π , and the recursion ±(t + 1) = t ±(t ) for all t > 0. The mean and variance of X are equal to d and 2d , respectively. For d = 2, X ∼ Exp(2). Using integration by parts, the tail probability of the χn2 distribution can be expressed as
⎧ √ ⎪⎪⎨2 Q( t ) :d = 1 √ e√− / ∑(d− 1)/2 (k −1)!(2t ) − / 2 : d = 3, 5, 7, . . . P{χd > t } = 2 Q( t ) + π ( 2k −1)! ⎪⎩⎪ −t /2 ∑d/2−1 (t /2) k=1 e : d = 2, 4, 6, . . . . k =0 k! k 1 2
t 2
(C.5)
k
5. Noncentral chisquared distribution with d degrees of freedom and noncentrality ∑ parameter λ , denoted by χd2 (λ). The random variable X = di=1 Z i2 where { Z i }di=1
392
Probability Distributions
N
are i.i.d. (µi , 1), and the squares of the means {µi }di=1 sum to λ. The pdf of X takes the form √ 1 ´ x µ(d −2)/4 −(x +λ)/2 pλ (x ) = e Id / 2−1 ( λ x ), x ≥ 0, (C.6) 2 λ where In (t ) is the modiﬁed Bessel function of the ﬁrst kind and order n:
In (t ) ±
¶π n √π±((t /n2+) 1/ 2) et cos φ sin2n φ d φ. 0
³∞ In particular, I0(t ) = π 0 et cos φ dφ . The mean and variance of X are equal to d + λ and 2d + 4λ, respectively. As a special case, χd2 (0) is the (central) chisquared 1
distribution of (C.4). 6. · Rice (or Rician) distribution, denoted by Rice( ρ, σ 2 ). The random variable X
Z 12 + Z 22 where Z 1 and Z 2 are independent
variables, and µ 21 + µ22
7.
N (µ1 , σ 2) and N (µ2 , σ 2 ) random
= ρ2 . The pdf of X is ± x 2 + ρ2 ² ´ ρx µ x (C.7) pρ,σ (x ) = 2 exp − I0 σ 2σ 2 σ 2 , x ≥ 0. √ If σ = 1, then X 2 ∼ χ22(ρ 2 ). If ρ = 0, then X ∼ Rayleigh(σ) and X 2 ∼ Exp ( 2σ). Gamma distribution, denoted by ±(a, b). The random variable X has pdf p a,b ( x ) =
ba a−1 −bx ±(a) x e ,
8. Rayleigh distribution with scale parameter σ
x
> 0.
(C.8)
> 0:
y − y 22 e 2σ , 2
y ≥ 0. σ Poisson distribution with parameter λ > 0, denoted by Poi(λ): λx e−λ, x = 0, 1, 2,. . . . pλ (x ) = x! p σ ( y) =
9.
=
(C.9)
(C.10)
Appendix D
Convergence of Random
Sequences
This appendix reviews classical notions of convergence for random sequences and some important properties. Let X n , n = 1, 2, . . . , be a sequence of random variables in Rm . The sequence X n converges in probability to X if
∀± > 0 :
lim P{± X n − X ± < ±} = 1,
n→∞
(D.1)
where the choice of the norm is immaterial. The sequence X n converges a.s. to X if P
±
lim X n
n→∞
² = X = 1,
(D.2)
or equivalently,
∀± > 0 :
lim P{∀k
n →∞
or equivalently,
∀± > 0 :
P
≥ n : ± X k − X ± < ±} = 1 ,
³´ ∞ µ ∞ n=1 k=n
¶ {± X k − X ± < ± } = 1.
(D.3)
(D.4)
The sequence X n converges in the mean square to X if lim
n→∞
E± X n − X ±2 = 0,
(D.5)
where ± · ± is the Euclidean norm.
Continuous Mapping Theorem [1, p. 8]Assume the random sequence X n
∈ Rm
converges to X in probability, almost surely, or in distribution. Let g(·) be a continuous mapping. Then g( X n ) converges to g( X ) in probability, almost surely, or in distribution, respectively. m µ ∈ R and assume the random √ d. sequence X n satisﬁes n ( X n − µ ) → N ( , ). Let g : Rm → Rm be a mapping with √ d. ² ). nonzero Jacobian matrix . Then n( g( X n ) − g(µ)) → N ( ,
Delta Theorem [2, Ch. 2.6; 3, p. 136] Fix 0
C
J
0
JCJ
Slutsky’s Theorem [1, p. 4] (a) If X n ∈ R converges to X in distribution and Yn then X n Y n converges in distribution to cX .
∈ R converges to c in probability,
394
Convergence of Random Sequences
(b) If X n ∈ R converges to X in distribution and Yn ∈ R converges to c ³ = 0 in probability, then X n / Yn converges in distribution to X /c. (c) If X n ∈ R converges to X in distribution and Y n ∈ R converges to c in probability, then X n + Yn converges in distribution to X + c. References
[1] A. DasGupta, Asymptotic Theory of Statistics and Probability, Springer, 2008. [2] T. S. Ferguson, A Course in Large Sample Theory, Chapman and Hall, 1996. [3] P. K. Sen and J. M. Singer, Large Sample Methods in Statistics , Chapman and Hall, 1993.
Index
action, 1 space, 13, 54, 63, 71 algorithm Baum–Welch, 373, 378 coordinate ascent, 341 CuSum, 219, 222 dynamic programming, 374 expectationmaximization, 339, 378 forwardbackward, 373, 375 gradient descent, 336 least mean squares, 350 recursive least squares, 350 SAGE, 341 stochastic gradient descent, 350 Viterbi, 373 area under the curve, 52 asymptotic equality, 196 Bayes, 1, 12 Benjamini–Hochberg procedure, 63, 65 Berry–Esséen theorem, 198 Bessel function, 132 Bhattacharyya coefﬁcient, 155, 162, 214 biology, 340 Bonferroni correction, 63 Central Limit Theorem, 111, 186, 195, 198, 351 Lindeberg condition, 202 change point, 217, 224, 371 character recognition, 54 characteristic function, 125 Chernoff–Stein lemma, 189 classiﬁcation, 54, 65, 67, 90, 173 communications, 340 continuous mapping theorem, 325, 393 convergence, 1, 365, 393 almost surely, 326, 330, 336 in probability, 251, 326, 350 meansquare, 239, 244, 245, 326 correlation detector, 7 correlation statistic, 117 correlator detector, 108 cost, 1, 7, 259 a posteriori, 12, 26, 55, 266 absoluteerror, 267
additive, 266, 267, 270 function, 71 matrix, 54 nonuniform, 76 posterior, 89 squarederror, 267, 280, 297 uniform, 28, 38, 41, 56, 60, 73, 91, 129, 209, 263, 267 covariance kernel, 239, 244, 252 Cramér–Rao lower bound, 297, 300 in exponential family, 308 vector case, 307 cumulantgenerating function, 163, 272 CuSum algorithm, 219, 222 test statistic, 218 data mining, 63 decision boundary, 56 decision region, 54, 56 decision rule, 1, 208, 259 admissible, 60, 88 Bayesian, 10, 55, 89, 105, 130 complete class, 88 correlation, 112 deterministic, 17, 54, 71 equalizer, 15, 33, 62 linear, 121 locally most powerful, 82, 127, 138 MAP, 28, 57, 89, 116 maximumcorrelation, 116 minimax, 10, 32, 43, 93, 105 ML, 28 MPE, 28, 116, 152 Neyman–Pearson, 62, 77, 93 nondominated, 88 power, 43, 63, 78, 85 quadratic, 121, 122 randomized, 16, 17, 36, 37, 40, 58, 71, 88 sensitivity, 76 signiﬁcance level, 63, 80 uniformly most powerful, 77, 92 decision theory, 1 delta theorem, 325, 393 detection, 1
396
Index
m ary, 54 average delay, 223 coherent, 107 quickest, 208 quickest change, 217 random process, 231 sequential, 208 singular, 238, 244, 246, 251 detector deﬂectionbased, 136 locally optimum, 127 digital communications, 28, 39, 44, 118 dimensionality reduction, 157 Dirac impulse, 239 discriminant function, 56 distance, 145 Ali–Silvey, 145, 152, 252 Bhattacharyya, 149, 171, 173, 182 Chernoff information, 150, 171 Euclidean, 118 Hamming, 7 Hellinger, 152 Itakura–Saito, 155, 252 Mahalanobis, 115, 119, 266 total variation, 145, 151 divergence binary Chernoff function, 238 binary KL, 160 Bregman, 155, 272 Chernoff, 145, 149, 168, 171, 233, 248 conditional KL, 236 f, 145, 152, 252 Jensen–Shannon, 152, 162 Kullback–Leibler, 1, 145, 161, 169, 233, 247, 272, 302, 309, 379 Pearson’s χ 2 , 151 Rényi, 150 symmetrized, 152 divergence rate Chernoff, 231, 252 Kullback–Leibler, 231, 252 dominated convergence theorem, 326, 331 eigenfunction, 239, 241, 245 eigenvalue, 114, 119, 123 eigenvector, 113, 119, 123, 158 entropy, 148, 379 cross, 379 epsilon contamination, 97 equalizer rule, 15 error probability Bayesian, 152, 170 conditional, 41, 55, 117, 160 false alarm, 208, 223 familywise rate, 63 lower bound, 134, 160, 162
pairwise, 173 upper bound, 134, 160, 172 Wald’s bounds, 213 estimation, 1 MAP, 373 estimator, 7 absoluteerror, 7, 262 asymptotically efﬁcient, 326, 329, 331 asymptotically Gaussian, 326, 329, 331 asymptotically unbiased, 326, 329 asymptotics, 325 bias, 297 biased, 281, 312, 322 conditionalmean, 260 conditionalmedian, 262 conditionalmode, 263 consistent, 319, 326, 329 efﬁcient, 300, 303, 308, 319 invariant, 259, 322 James–Stein, 355 leastsquares, 338, 349 linear, 259, 265, 268 linear MMSE, 268, 346, 358 M, 338 MAP, 263, 267, 276, 319, 370, 373 median, 333 minimax, 280 ML, 12, 319 MMAE, 262 MMSE, 122, 260, 276, 361, 364, 370, 372 MVUE, 280, 281, 297, 305, 324 randomized, 354 recursive, 347 recursive ML, 347 samplemean, 333 squarederror, 7, 260 unbiased, 259, 281, 284, 289, 297, 300, 305, 307 weighted leastsquares, 339 exponential family, 270, 303, 305, 341 canonical form, 271, 310 completeness, 289 meanvalue parameter, 273, 324 ML estimation, 323 natural parameter space, 271 exponential tilting, 194 factorization theorem, 280, 283, 290 false alarm rate, 221 false discoveries, 63 false discovery rate, 64 family of distributions complete, 286, 288 Gaussian, 291 Gaussian mixture, 336 general mixture, 337, 341 incomplete, 287
Index
locationscale, 315 Poisson, 315 reparameterization, 303, 309 ﬁlter causal, 358 Kalman, 358, 360 linear, 317, 358 nonlinear, 358, 367, 369 Fisher information, 1, 297, 298, 341 matrix, 306, 311, 324, 333, 337 metric, 309
Markov, 164, 194, 214 Pinsker, 156 Vajda, 156 information theory, 1, 145, 176 innovation sequence, 358 integral Riemann–Stieltjes, 239 stochastic, 239
game theory, 11, 15 genomics, 63
Kalman ﬁlter, 358, 360 extended, 367 scalar case, 365 steady state, 365 Karhunen–Loève decomposition, 239
handwritten digit recognition, 54 Hoeffding problem, 189 hypothesis alternative, 25, 101 composite, 72 dummy, 116 null, 25, 72 simple, 72 hypothesis test, 6 acceptance region, 25 Bayesian, 25, 54, 73, 188, 245 binary, 6, 160, 167, 208, 233 Cauchy, 78, 83, 86, 110 composite, 7, 71, 105, 106, 220 Gaussian, 58, 68, 71, 77, 84, 101 Laplacian, 102, 108 locally most powerful, 71 m ary, 90, 105, 115, 231, 283 minimax, 25, 32, 54, 58 multiple, 54 Neyman–Pearson, 25, 40, 54, 189, 210 nondominated, 71, 91 Poisson, 67, 100 power, 40 rejection region, 25 robust, 71, 92 sequential, 208 signiﬁcance level, 40, 43 ternary, 56, 67, 91, 182 uniformly most powerful, 71 imaging, 340 inequality Bonferroni, 176 Cauchy–Schwarz, 118, 137, 163, 299, 305, 307 Chebyshev, 164, 167 data processing, 145, 153, 160, 176, 252 Fano, 176 generalized Fano, 176 information, 297, 299, 311 Jensen, 145, 152, 286
Jacobian, 273, 309, 323, 324, 393 joint stochastic boundedness, 93
large deviations, 126, 160, 184 Bahadur–Rao theorem, 196 Cramér’s theorem, 185 Gärtner–Ellis theorem, 201 rate function, 165 reﬁned, 194 upper bound, 167 law of iterated expectations, 289 likelihood equation, 320, 336, 348 likelihood function, 12, 302, 319 unbounded, 336 likelihood ratio, 56, 63, 87, 106, 107, 233, 250 clipped, 97 generalized, 84 generalized test, 71, 130, 220 monotone, 79, 154 test, 27, 77, 99, 107 limiter soft, 110 loss, 7 KL divergence, 270 loss function surrogate, 145 zeroone, 7 Lyapunov equation, 365 machine learning, 1, 152, 275 marginal distribution, 12 matched ﬁlter, 105, 108 matrix circulant Toeplitz, 232 determinant, 232 DFT, 232 Hermitian transpose, 232 Hessian, 321, 336 trace, 137, 234, 297 measure, 1, 4, 92 counting, 4
397
398
Index
Lebesgue, 4, 260 Mercer’s theorem, 243 minimax, 1 mismatched model, 338 momentgenerating function, 126, 163 MonteCarlo simulations, 298 mutual information, 148, 178 navigation, 360, 364 noise additive, 139 Cauchy, 110 colored, 246, 248 correlated, 105, 112, 118 Gaussian, 105 generalized Gaussian, 139 i.i.d., 105, 118 Laplacian, 108, 128 multiplicative, 139 nonGaussian, 105, 111, 364 whitening, 105, 113 observations, 1, 259, 304 asymptotic setup, 325 missing, 340, 378 sequence, 360 optimization complexity, 374 convex, 46, 165, 339 linear, 62 orthogonality principle, 260, 268, 347, 358, 362
p value, 63 parameter, 71 estimation, 84 identiﬁability, 302, 309, 330, 331, 333 nonrandom, 87, 91, 280, 297 on boundary, 320, 328 orthogonality, 309 precision, 274 random, 72, 87, 90, 260, 297, 311 set, 259 space, 71 vector, 297, 306, 321, 333 parameter estimation, 6 Bayesian, 259 for linear Gaussian models, 265 payoff function, 11, 15 performance metrics, 7 deﬂection criterion, 105, 135 prior, 9 beta, 278 conjugate, 273 Dirichlet, 275 exponential, 276 gamma, 274, 277
least favorable, 14, 33, 62, 94 uniform, 118, 132, 152, 177, 277, 319 probability a posteriori, 26, 55 conditional, 2 density function, 2, 196 detection, 115, 131 false alarm, 40, 72, 111, 115, 128, 129, 168 mass function, 2 miss, 26, 40, 72, 129, 168 nominal distribution, 97 of detection, 40 space, 231 tail, 126 probability distribution, 2 binomial, 287 exponential, 287 geometric mixture, 190 mixed, 278 multinomial, 275 nominal, 97 probability measure, 2 absolutely continuous, 251 dominated, 251 equivalent, 251 orthogonal, 251 pseudoinverse, 339 Q function asymptotics, 117, 133, 134 quantization, 157 radar, 28, 40, 54, 72, 118, 129, 364 Radon–Nikodym derivative, 231, 251 random process bandlimited, 240, 242 Brownian motion, 239, 242, 246, 248 continuoustime, 231, 238 counting process, 248 discretetime, 231 dynamical system, 358, 367 Gauss–Markov, 240 Gaussian, 231, 234, 239, 248, 252, 317 hidden Markov model, 358, 369, 372 independent increments, 239 Markov, 231, 235, 369, 372 meansquare continuous, 243 Ornstein–Uhlenbeck, 240, 242 orthogonal increments, 243 periodic stationary, 232, 240, 241, 248 Poisson, 231, 248 power spectral mass, 242 spectral density, 234, 239, 252 spectral representation, 243 stationary, 231, 234, 252 weakly stationary, 239
Index
white Gaussian, 239 random variable continuous, 2, 196 discrete, 2 lattice type, 196, 197 orthogonal, 241 Poisson, 276 randomization parameter, 105 Rao–Blackwell theorem, 280, 284, 289 Receiver Operating Characteristic, 43, 200 lower bound, 172 upper bound, 160 regularity conditions, 298, 306, 319, 329 relative entropy, 145 reparameterization, 303, 317, 322, 324 Riccati equation, 350, 363, 365 risk, 8 Bayes, 9, 55, 209 conditional, 8, 25, 54, 88, 91, 259, 280, 297 curve, 34, 41 line, 34, 41 minimax, 40 posterior, 12, 28, 76, 260 set, 59, 88 vector, 59, 88 worstcase, 10, 58, 61 robotics, 360 saddlepoint, 15, 33, 94, 126, 320 sample size, 208 score function, 331, 333 sequential test Bayesian, 210 nonBayesian, 210 SPRT, 210 signal, 6 antipodal, 118 classiﬁcation, 7, 105, 115 correlated, 105 detection, 38, 105, 176 deterministic, 106, 244 energy, 132 energy constraint, 118 estimation, 6, 361 ﬁxedlag smoothing, 364 Gaussian, 120, 246 impulse, 135 nonGaussian, 136 Poisson, 139
prediction, 6, 361 pseudo, 112 random, 106 reconstruction, 244 sampling, 238 selection, 105, 117, 145 sinusoid, 131, 138, 301, 316, 368 smoothing, 6 weak, 127 signal to noise ratio, 117, 300, 301 generalized, 114 Slutsky’s theorem, 393 software testing, 110 sonar, 54 speech recognition, 54 state, 1 nonrandom, 1 of Nature, 8 random, 1 sequence, 360 space, 13, 54, 71 trellis, 374 stochastic ordering, 93 stopping rule, 208 Shewhart, 219 stopping time, 208 sufﬁcient statistic complete, 334 sufﬁcient statistics, 1, 280, 281, 341 minimal, 283, 289 trivial, 282, 288 target detection, 65 Taylor series, 82, 248, 331, 348, 350, 364, 367 template matching, 129, 135 threshold, 27, 63 trace of a matrix, 137 tracking, 360, 364 transform convex conjugate, 165 discrete Fourier, 126, 232 Esscher, 194, 198 Fourier, 125, 126, 242 Karhunen–Loève, 240 Laplace, 126, 288, 290 Legendre–Fenchel, 165 unitary, 232 utility function, 7
399