Multivariate Exponential Families: A Concise Guide to Statistical Inference (SpringerBriefs in Statistics) 3030818993, 9783030818999

This book provides a concise introduction to exponential families. Parametric families of probability distributions and

119 115 2MB

English Pages 152 [147] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

All of Statistics: A Concise Course in Statistical Inference [1 ed.] 1441923225, 9781441923226

Taken literally, the title "All of Statistics" is an exaggeration. But in spirit, the title is apt, as the boo

154 44 3MB Read more

All Of Statistics: A Concise Course In Statistical Inference 1441923225, 9781441923226

This book is for people who want to learn probability and statistics quickly. It brings together many of the main ideas

6,865 896 23MB Read more

All of Statistics: A Concise Course in Statistical Inference 1441923225, 9781441923226

Taken literally, the title "All of Statistics" is an exaggeration. But in spirit, the title is apt, as the boo

1,553 261 23MB Read more

Statistical modelling by exponential families 9781108476591, 9781108604574, 9781108701112, 2182252282

366 129 2MB Read more

Non-Asymptotic Analysis of Approximations for Multivariate Statistics (SpringerBriefs in Statistics) 9811326150, 9789811326158

132 92 Read more

Multivariate Statistical Simulation: A Guide to Selecting and Generating Continuous Multivariate Distributions 9781118150740, 0471822906

Provides state-of-the-art coverage for the researcher confronted with designing and executing a simulation study using c

239 60 7MB Read more

A concise introduction to statistical inference 978-1-4987-5596-2

905 190 1MB Read more

Applied Multivariate Statistical Analysis

2,465 371 9MB Read more

Statistical Inference 0412138204, 9780412138201

Statistics is a subject with a vast field of application, involving problems which vary widely in their character and co

1,336 191 6MB Read more

Neotropical Plant Families: A Concise Guide to Families of Vascular Plants in the Neotropics [1 ed.] 1878762389, 9781878762382

This new edition deals with angiosperm families according to the APG (1998) classification. This affected the treatment

166 15 4MB Read more

Multivariate Exponential Families: A Concise Guide to Statistical Inference (SpringerBriefs in Statistics)
3030818993, 9783030818999

Author / Uploaded
Stefan Bedbur

Table of contents :
Preface
Contents
Symbols
1 Introduction: Aims and Outline
2 Parametrizations and Basic Properties
2.1 Definition and Examples
2.2 Order and Minimal Representation
2.3 Structural Properties and Natural Parametrization
2.4 Analytical Properties and Mean Value Parametrization
3 Distributional and Statistical Properties
3.1 Generating Functions
3.2 Marginal and Conditional Distributions
3.3 Product Measures
3.4 Sufficiency and Completeness
3.5 Score Statistic and Information Matrix
3.6 Divergence and Distance Measures
4 Parameter Estimation
4.1 Maximum Likelihood Estimation
4.2 Constrained Maximum Likelihood Estimation
4.3 Efficient Estimation
4.4 Asymptotic Properties
5 Hypotheses Testing
5.1 One-Sided Test Problems
5.2 Two-Sided Test Problems
5.3 Testing for a Selected Parameter
5.4 Testing for Several Parameters
6 Exemplary Multivariate Applications
6.1 Negative Multinomial Distribution
6.2 Dirichlet Distribution
6.3 Generalized Order Statistics
References
Index

Citation preview

SPRINGER BRIEFS IN STATISTICS

Stefan Bedbur Udo Kamps

Multivariate Exponential Families: A Concise Guide to Statistical Inference

SpringerBriefs in Statistics

SpringerBriefs present concise summaries of cutting-edge research and practical applications across a wide spectrum of fields. Featuring compact volumes of 50 to 125 pages, the series covers a range of content from professional to academic. Typical topics might include: • A timely report of state-of-the art analytical techniques • A bridge between new research results, as published in journal articles, and a contextual literature review • A snapshot of a hot or emerging topic • An in-depth case study or clinical example • A presentation of core concepts that students must understand in order to make independent contributions SpringerBriefs in Statistics showcase emerging theory, empirical research, and practical application in Statistics from a global author community. SpringerBriefs are characterized by fast, global electronic dissemination, standard publishing contracts, standardized manuscript preparation and formatting guidelines, and expedited production schedules.

More information about this series at http://www.springer.com/series/8921

Stefan Bedbur • Udo Kamps

Multivariate Exponential Families: A Concise Guide to Statistical Inference

Stefan Bedbur Institute of Statistics RWTH Aachen University Aachen, Germany

Udo Kamps Institute of Statistics RWTH Aachen University Aachen, Germany

ISSN 2191-544X ISSN 2191-5458 (electronic) SpringerBriefs in Statistics ISBN 978-3-030-81899-9 ISBN 978-3-030-81900-2 (eBook) https://doi.org/10.1007/978-3-030-81900-2 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Parametric families of probability distributions and their properties are extensively studied in the literature on statistical modeling and inference. Exponential families of distributions comprise density functions of a particular form, which enables general assertions and leads to nice features. A literature review of books in the field of Mathematical Statistics discussing exponential families and their properties is given in the first chapter. With a focus on parameter estimation and hypotheses testing, we introduce the reader to properties of multivariate and multiparameter exponential families and show a variety of detailed examples. The material is widely self-contained and written in a mathematical setting. It may serve as a concise course on exponential families in a systematic structure. Supplements and comments can be found on the website https://www.isw.rwth-aachen.de/EF_SpringerBriefs.php. Dear reader, we would very much appreciate receiving your remarks, criticism, and hints! We hope you will benefit from our introduction to exponential families. Aachen, Germany May 2021

Stefan Bedbur Udo Kamps

v

Contents

1

Introduction: Aims and Outline . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

1

2 Parametrizations and Basic Properties .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.1 Definition and Examples.. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2 Order and Minimal Representation . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3 Structural Properties and Natural Parametrization . . . . . . . . . . . . . . . . . . . . 2.4 Analytical Properties and Mean Value Parametrization .. . . . . . . . . . . . . .

5 5 17 23 32

3 Distributional and Statistical Properties . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2 Marginal and Conditional Distributions . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3 Product Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4 Sufficiency and Completeness .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5 Score Statistic and Information Matrix . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.6 Divergence and Distance Measures . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

43 43 47 49 54 57 60

4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2 Constrained Maximum Likelihood Estimation . . . .. . . . . . . . . . . . . . . . . . . . 4.3 Efficient Estimation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.4 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

65 66 71 78 84

5 Hypotheses Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 93 5.1 One-Sided Test Problems .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 93 5.2 Two-Sided Test Problems .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101 5.3 Testing for a Selected Parameter . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 105 5.4 Testing for Several Parameters . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 109 6 Exemplary Multivariate Applications . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.1 Negative Multinomial Distribution .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.2 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.3 Generalized Order Statistics . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

117 117 125 131

vii

viii

Contents

References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 137 Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 141

Symbols

N N0 Z R C An 1A int (A) a t , At A > 0 (A ≥ 0) A≥B ∇g Hg Dg (X, B) T : (X, B) → (Y, C) Bk μX λX

{1, 2, 3, . . . } set of natural numbers {0, 1, 2, . . . } set of natural numbers including zero {. . . , −2, −1, 0, 1, 2, . . . } set of integers (−∞, ∞) set of real numbers {a + ib : a, b ∈ R} set of complex numbers ×ni=1 A n-fold Cartesian product of set A indicator function of set A 1A (x) = 1 if x ∈ A and zero otherwise interior of set A transpose of vector a and matrix A matrix A is positive (semi)definite matrix A − B is positive semidefinite ∂ t g, . . . , ∂x∂ k g ∂x1 vector of partial derivatives (gradient) of g : A ⊂ Rk → R ∂2 ∂xi ∂xj g 1≤i,j ≤k Hessian matrix of g : A ⊂ Rk → R (∇g1 , . . . , ∇gn )t Jacobian matrix of g = (g1 , . . . , gn ) : A ⊂ Rk → Rn measurable space of observations sample space X endowed with σ -algebra B mapping T : X → Y is (B − C)-measurable Borel sets of Rk counting measure on the power set of X Lebesgue measure on the Borel sets of X

ix

x

poi(•) b(n0 , •) nb(r0 , •) geo(•) kn(•, •) g(•, •) m(n0 , •) nm(r0 , •) mps(h, •) N(•, •) N(•, σ02 ) N(μ0 , •) Nk (•, 0 ) IN(•, •) G(•, •) E(•) D(•) B(•, •) L(h, •)

Symbols

{poi(a) : a ∈ (0, ∞)} family of Poisson distributions {b(n0 , p) : p ∈ (0, 1)} family of binomial distributions with n0 ∈ N fixed {nb(r0 , p) : p ∈ (0, 1)} family of negative binomial distributions with r0 ∈ N fixed {nb(1, p) : p ∈ (0, 1)} family of geometric distributions {kn(α, p) : α ∈ (0, ∞), p ∈ (0, 1)} family of Kemp discrete normal distributions {g(p, τ ) : p ∈ (0, 1), τ ∈ R} family of Good distributions {m(n0 , p) : p = (p1 , . . . , pk ) ∈ (0, 1)k , ki=1 pi = 1} family of multinomial distributions with n 0 ∈ N fixed {nm(r0 , p) : p = (p1 , . . . , pk ) ∈ (0, 1)k , ki=1 pi < 1} family of negative multinomial distributions with r0 ∈ N fixed {mps(h, ϑ) : ϑ ∈ } family of multivariate power series distributions with ⊂ (0, ∞)k and function h fixed {N(μ, σ 2 ) : μ ∈ R, σ 2 ∈ (0, ∞)} family of univariate normal distributions {N(μ, σ02 ) : μ ∈ R} family of univariate normal distributions with σ02 ∈ (0, ∞) fixed {N(μ0 , σ 2 ) : σ 2 ∈ (0, ∞)} family of univariate normal distributions with μ0 ∈ R fixed {Nk (μ, 0 ) : μ ∈ Rk } family of multivariate normal distributions with 0 ∈ Rk×k fixed {IN(μ, η) : μ, η ∈ (0, ∞)} family of inverse normal distributions {G(α, β) : α, β ∈ (0, ∞)} family of gamma distributions {G(α, 1) : α ∈ (0, ∞)} family of exponential distributions {D(β) : β ∈ (0, ∞)k+1 } family of Dirichlet distributions {D(β1 , β2 ) : β1 , β2 ∈ (0, ∞)} family of beta distributions {L(h, a) : a ∈ (0, ∞)k } family of Liouville distributions with function h fixed

Chapter 1

Introduction: Aims and Outline

The study of parametric families of probability distributions is a main foundation of statistical modeling and inference. Although some probability distributions may be quite different at first sight with respect to their density functions, they can be observed to share a similar structure and common nice properties. Such findings always motivate to look for explanations and to examine the reasons behind the similarities. Let us take a look at three distributions to illustrate a starting point for the topic of exponential families. A basic discrete, one-parameter distribution is the binomial distribution with counting density function fp (x) =

n x p (1 − p)n−x , x

x ∈ {0, 1, . . . , n} ,

(1.1)

for some fixed natural number n and parameter p ∈ (0, 1). An important continuous two-parameter distribution is the gamma distribution with Lebesgue density function fα,β (x) =

x 1 β−1 x exp − ,

(β) α β α

x ∈ (0, ∞) ,

(1.2)

and parameters α ∈ (0, ∞) and β ∈ (0, ∞). As a third distribution, we consider a multivariate normal distribution with Lebesgue density function fμ (x) =

1 1 t (x − μ) exp − (x − μ) , (2π)k/2 2

x ∈ Rk ,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Bedbur, U. Kamps, Multivariate Exponential Families: A Concise Guide to Statistical Inference, SpringerBriefs in Statistics, https://doi.org/10.1007/978-3-030-81900-2_1

(1.3)

1

2

1 Introduction: Aims and Outline

for k ∈ N, its covariance matrix given by the k-dimensional unity matrix, and parameter vector μ = (μ1 , . . . , μk )t ∈ Rk . These distributions are different with respect to their support, dimension, and number of parameters. On the other hand, by rewriting the density functions in formulas (1.1)–(1.3) according to n p x , fp (x) = (1 − p)n exp ln 1−p x

(1.4)

1 x 1 + β ln(x) , exp −

(β) α β α x

(1.5)

fα,β (x) = and

⎫ ⎧

k ⎬ exp{−x t x/2} ⎨ μt μ fμ (x) = exp − μj x j , exp ⎭ ⎩ 2 (2π)k/2

(1.6)

j =1

all of the density functions exhibit a common multiplicative structure fϑ (x) = C(ϑ) exp

⎧ k ⎨ ⎩

j =1

⎫ ⎬ Zj (ϑ)Tj (x)

⎭

h(x) ,

x ∈ X,

(1.7)

where ϑ denotes the respective parameter (vector) and X is the support of the distribution. Here, the first factor C(ϑ) is a normalizing constant depending only on ϑ. In the exponent of the second factor, Z1 (ϑ), . . . , Zk (ϑ) are constants depending only on ϑ, and T1 , . . . , Tk are mappings on X, which are free of ϑ (with k = 1 in formula (1.4) and k = 2 in formula (1.5)). The third factor h is a function on X not depending on ϑ. Distribution families with density functions of the form (1.7) are called exponential families. Having seen that the aforementioned distributions constitute exponential families (see formulas (1.4)–(1.6)), several questions arise at this point, such as the following: • Are there other distributions (uni- and multivariate, one- and multiparameter distributions) possessing exponential family structure? • What are special properties of an exponential family, and are there advantages when having statistical inference in mind? • What is the use of a unified approach and a general theory of exponential families? It will turn out that the exponential family structure comprises many well known distributions and that there are major benefits when considering exponential families of distributions in statistical inference. Various properties are found to hold true

1 Introduction: Aims and Outline

3

under fairly general assumptions on the underlying exponential family, which gives rise for a unified approach. Exponential families are discussed in textbooks and monographs on Mathematical Statistics, usually. There are respective examples, recurring examples or chapters, such as in Bickel and Doksum [16], Ferguson [24], Keener [35], Kotz et al. [37, Chapter 54 by M. Casalis], Lehmann and Casella [39], Lehmann and Romano [40], Liese and Miescke [42], Lindsey [44], Pfanzagl [50], Shao [55] and Witting [63] or books dedicated to exponential families, such as the fundamental accounts of Brown [20], Jørgensen [29], Letac [41], Barndorff-Nielsen [4], and Sundberg [56]. In particular, the books of Witting [63] and Brown [20] influenced the present text on exponential families. In the sequel, we focus on properties of multivariate and multiparameter exponential families along with a variety of detailed examples as well as on the use of exponential families in parameter estimation and hypotheses testing, mainly based on the maximum-likelihood principle. Nevertheless, the present introductory material also serves as a solid basis for other methods of statistical inference within exponential families. Chapters 2–5 offer a widely self-contained, mathematically rigorous and concise course on exponential families in a systematic structure for Master- and Ph.D.students. The text can be considered an introduction to Mathematical Statistics restricted to using exponential families. As prerequisites, the reader should have a solid knowledge of calculus, linear algebra, and of an introduction to probability and measure theory. For the chapter on hypotheses testing, a former introductory course on statistics may be helpful. In Chaps. 2 and 3 on parametrizations, distributional and statistical properties of exponential families, the proofs of assertions are not detailed; however, ideas and references are given, throughout. Thereafter, in Chaps. 4 and 5 on parameter estimation and hypotheses testing within exponential families, proofs and derivations are worked out. In Chap. 6, three multiparameter and multivariate families of distributions are exemplarily examined in order to utilize and illustrate the results on exponential families, and, by this, the benefit of a general approach is demonstrated. In the present concise introduction to multiparameter exponential families, several topics have been left out, such as the field of variance functions, cuts, advanced structural results, confidence regions, others than maximum-likelihood approaches, and Bayesian inference. For further reading on the theory and applications of exponential families, we refer to, for example, Brown [20], Lehmann and Casella [39], and Liese and Miescke [42] regarding Bayesian methods, Brown [20], Letac [41], Barndorff-Nielsen [4], Witting [63], Kotz et al. [37, Chapter 54] and Liese and Mieschke [42] regarding structural considerations and statistical inference, as well as to the recent monograph Sundberg [56] highlighting different fields of applications in statistics.

Chapter 2

Parametrizations and Basic Properties

2.1 Definition and Examples Motivated by the examples in the introduction, we start with the formal definition of an exponential family by specifying the form of a probability density function.

Definition 2.1 Let (X, B) be a measurable space and P = {Pϑ : ϑ ∈ } be a parametric family of probability distributions on B, where = ∅ denotes a set of real parameters or parameter vectors. If there exist • a σ -finite measure μ on B dominating P, • mappings C, Z1 , . . . , Zk : → R and • mappings h, T1 , . . . , Tk : (X, B) → (R1 , B1 ) with h ≥ 0 for some integer k ∈ N in such a way that ⎧ ⎫ k ⎨ ⎬ dPϑ (x) = C(ϑ) exp Zj (ϑ)Tj (x) h(x), fϑ (x) = ⎩ ⎭ dμ

x ∈ X,

j =1

(2.1) is a μ-density of Pϑ for every ϑ ∈ , then P is said to form an exponential family (EF).

By the definition of the mappings involved in formula (2.1), the functions C, Z1 , . . . , Zk do not depend on x ∈ X, and the functions h, T1 , . . . , Tk do not depend on ϑ ∈ . © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Bedbur, U. Kamps, Multivariate Exponential Families: A Concise Guide to Statistical Inference, SpringerBriefs in Statistics, https://doi.org/10.1007/978-3-030-81900-2_2

5

6

2 Parametrizations and Basic Properties

It is clear from the definition that an EF may have various different representations. The mappings C, Z1 , . . . , Zk and h, T1 , . . . , Tk as well as the number k and the dominating measure μ in formula (2.1) are not uniquely determined. For instance, for some j ∈ {1, . . . , k}, we may replace Zj (ϑ) by Zj (ϑ) − a and h by h exp{aTj } for some constant a ∈ R to obtain another representation of the densities. Likewise, a shift of Tj and a respective modification of C gives a new representation. Moreover, although being artificial, setting Zk+1 = 0 or Tk+1 = 0 yields another representation with k + 1 summands in the exponent. Since fϑ dμ = 1 for all ϑ ∈ , C(ϑ) is just a normalizing constant. Lemma 2.2 In the situation of Definition 2.1, we have for every ϑ ∈ ⎡ 0 < C(ϑ) = ⎣

exp

⎧ k ⎨ ⎩

⎫ ⎬ Zj (ϑ)Tj (x)

j =1

⎭

⎤−1 h(x) dμ(x)⎦

< ∞.

Many well known univariate and multivariate distributions form EFs when varying one or more of their parameters. Here, the dominating measure μ is usually the counting measure (in the discrete case) or the Lebesgue measure (in the continuous case). We show some prominent examples; in particular, the distributions in the introduction turn out to form EFs when varying the parameters. Throughout the text, superscript t at a vector or matrix denotes transposition. Examples: poi(•), b(n0 , •), nb(r0 , •), geo(•), kn(•, •), g(•, •), m(n0 , •) In the following examples and throughout, let μX denote the counting measure on the power set of X. Example 2.3 (Poisson Distribution) The counting density of the Poisson distribution poi(a), a ∈ (0, ∞), is given by fa (x) = e−a

ax 1 = e−a eln(a)x , x! x!

x ∈ N0 .

Hence, poi(•) = {poi(a) : a ∈ (0, ∞)} forms an EF with representation (2.1) given by μ = μN0 , k = 1, C(a) = e−a , and

T1 (x) = x ,

Z1 (a) = ln(a) , h(x) =

1 , x!

a ∈ (0, ∞) ,

x ∈ N0 .

2.1 Definition and Examples

7

Example 2.4 (Binomial Distribution) The counting density of the binomial distribution b(n, p), n ∈ N, p ∈ (0, 1), is given by n x fn,p (x) = p (1 − p)n−x x p n n = (1 − p) exp ln x , 1−p x

x ∈ {0, 1, . . . , n} = X .

Hence, for fixed n = n0 ∈ N, b(n0 , •) = {b(n0 , p) : p ∈ (0, 1)} forms an EF with representation (2.1) given by μ = μX , k = 1,

and

C(p) = (1 − p)n0 ,

Z1 (p) = ln

T1 (x) = x ,

n0 , x

h(x) =

p 1−p

p ∈ (0, 1) ,

,

x ∈ X.

Example 2.5 (Negative Binomial Distribution) The counting density of the negative binomial distribution nb(r, p), r ∈ N, p ∈ (0, 1), is given by

x+r −1 r p (1 − p)x x x+r −1 = pr eln(1−p)x , x

fr,p (x) =

x ∈ N0 .

Hence, for fixed r = r0 ∈ N, nb(r0 , •) = {nb(r0 , p) : p ∈ (0, 1)} forms an EF with representation (2.1) given by μ = μN0 , k = 1, C(p) = pr0 , and

T1 (x) = x ,

Z1 (p) = ln(1 − p) , x + r0 − 1 , h(x) = x

p ∈ (0, 1) , x ∈ N0 .

In particular, for r0 = 1, the EF geo(•) = nb(1, •) of geometric distributions results.

8

2 Parametrizations and Basic Properties

Example 2.6 (Kemp Discrete Normal Distribution) The counting density of the Kemp discrete normal distribution kn(α, p), α ∈ (0, ∞), p ∈ (0, 1), is given by ⎛ fα,p (x) = ⎝

⎞−1

∞

α j pj (j −1)/2 ⎠

α x px(x−1)/2

j =−∞

⎛ = ⎝

⎞−1

∞

α j pj (j −1)/2 ⎠

j =−∞

x(x − 1) exp ln(α)x + ln(p) , 2

x ∈ Z.

Hence, kn(•, •) = {kn(α, p) : α ∈ (0, ∞), p ∈ (0, 1)} forms an EF with representation (2.1) given by μ = μZ , k = 2, ⎛

⎞−1

∞

C(α, p) = ⎝

α j pj (j −1)/2 ⎠

j =−∞

Z1 (α, p) = ln(α) , and

T1 (x) = x ,

Z2 (α, p) = ln(p) , T2 (x) =

x(x − 1) , 2

α ∈ (0, ∞) , p ∈ (0, 1) , h(x) = 1 ,

x ∈ Z.

Example 2.7 (Good Distribution) The counting density of the Good distribution g(p, τ ), p ∈ (0, 1), τ ∈ R, is given by ⎛ fp,τ (x) = ⎝

∞

⎞−1 pj j τ ⎠

px x τ

j =1

⎛ = ⎝

∞

⎞−1 pj j τ ⎠

exp {ln(p)x + τ ln(x)} ,

x ∈ N.

j =1

Hence, g(•, •) = {g(p, τ ) : p ∈ (0, 1), τ ∈ R} forms an EF with repre sentation (2.1) given by μ = μN , k = 2, ⎛ ⎞−1 ∞ C(p, τ ) = ⎝ pj j τ ⎠ j =1

Z1 (p, τ ) = ln(p) ,

Z2 (p, τ ) = τ ,

p ∈ (0, 1) , τ ∈ R ,

2.1 Definition and Examples

and

9

T1 (x) = x ,

T2 (x) = ln(x) ,

h(x) = 1 ,

x ∈ N.

Example 2.8 (Multinomial Distribution) The counting density of the multinomial distribution m(n, p), n ∈ N, p = (p1 , . . . , pk )t ∈ = {(q1 , . . . , qk )t ∈ k k (0, 1) : i=1 qi = 1}, where k ≥ 2, is given by ⎛ fn,p (x) = k

n!

i=1 xi !

= exp

⎧ k ⎨ ⎩

⎝

k

⎞ x pj j ⎠

j =1

⎫ ⎬ ln(pj )xj

j =1

⎭

k

n!

i=1 xi !

for x = (x1 , . . . , xk )t ∈ X = {(y1 , . . . , yk )t ∈ Nk0 : ki=1 yi = n}. Hence, for fixed n = n0 ∈ N, m(n0 , •) = {m(n0 , p) : p ∈ } forms an EF with representation (2.1) given by μ = μX ,

and

C(p) = 1 ,

Zj (p) = ln(pj ) ,

Tj (x) = xj ,

n0 ! h(x) = k

i=1 xi !

p ∈ , ,

x ∈ X,

1 ≤ j ≤k.

More examples of discrete multivariate EFs can be found in [28, pp. 12–16] including, among others, the family of multivariate power series distributions. The negative multinomial distribution, which forms a particular multivariate power series distribution, is studied in Sect. 6.1. Remark 2.9 A general class of discrete multivariate distributions is given by multivariate power series distributions with counting densities of the form fϑ (x) = C(ϑ) h(x)

k

x

ϑj j ,

x = (x1 , . . . , xk )t ∈ X ⊂ Nk0 ,

(2.2)

j =1

where h is an appropriate function, ϑ = (ϑ1 , . . . , ϑk )t ∈ ⊂ (0, ∞)k is a parameter vector, and C(ϑ) is a normalizing constant (see [28, Chapter 38]). We denote the multivariate power series distribution with density function (2.2) by mps(h, ϑ). For fixed function h, mps(h, •) = {mps(h, ϑ) : ϑ ∈ } forms an EF.

10

2 Parametrizations and Basic Properties

Examples of particular distributions contained in the class of multivariate power series distributions are the multinomial distribution (see Example 2.8), the negative multinomial distribution (see Sect. 6.1), the multivariatelogarithmic series distribution (C(ϑ) = −1/ ln(1 − kj =1 ϑj ), ϑ ∈ (0, 1)k with kj =1 ϑj < 1, and h(x) = ( kj =1 xj −1)!/ kj =1 xj !, x ∈ Nk0 \{(0, . . . , 0)t }), and the joint distribution of k independent Poisson-distributed random variables (see Example 2.3). Examples: N(•, •), N(μ0 , •), N(•, σ02 ), IN(•, •), G(•, •), E(•), D(•), B(•, •) In the following examples and throughout, let λX denote the Lebesgue measure on the Borel sets of X. Example 2.10 (Normal Distribution) The Lebesgue density of the normal distribution N(μ, σ 2 ) with mean μ ∈ R and variance σ 2 ∈ (0, ∞) is given by 1

fμ,σ 2 (x) = √ 2πσ 2

(x − μ)2 exp − 2σ 2

1 μ2 x2 μx 1 = √ exp − 2 exp − √ , 2 2 2 2σ σ 2σ 2π σ

x ∈ R.

Hence, N(•, •) = {N(μ, σ 2 ) : μ ∈ R, σ 2 ∈ (0, ∞)} forms an EF with representation (2.1) given by μ = λR , k = 2,

1 μ2 2 C(μ, σ ) = √ exp − 2 , 2σ σ2 Z1 (μ, σ 2 ) = and

μ , σ2

T1 (x) = x ,

Z2 (μ, σ 2 ) = − T2 (x) = x 2 ,

1 , 2σ 2

μ ∈ R, σ 2 ∈ (0, ∞) ,

1 h(x) = √ , 2π

x ∈ R.

Moreover, when fixing either μ = μ0 or σ 2 = σ02 , the classes N(μ0 , •) = {N(μ0 , σ 2 ) : σ 2 ∈ (0, ∞)} and N(•, σ02 ) = {N(μ, σ02 ) : μ ∈ R} are easily seen to form one-parameter EFs. By choosing α = exp{−(1 − 2μ)/(2σ 2 )} and p = exp{−1/σ 2 } in Example 2.6 (cf. [26, p. 475]), the Kemp distribution kn(α, p) is seen to be a discrete version of the normal distribution N(μ, σ 2 ).

2.1 Definition and Examples

11

Example 2.11 (Inverse Normal Distribution) The Lebesgue density of the inverse normal distribution IN(μ, η), μ, η ∈ (0, ∞), is given by

η η(x − μ)2 fμ,η (x) = exp − 2πx 3 2μ2 x

ηx 1 η η √ = η exp exp − 2 − , x ∈ (0, ∞) . μ 2μ 2x 2πx 3 Hence, IN(•, •) = {IN(μ, η) : μ, η ∈ (0, ∞)} forms an EF with representation (2.1) given by μ = λ(0,∞) , k = 2,

η √ C(μ, η) = η exp , μ Z1 (μ, η) = −

and

η , 2μ2

Z2 (μ, η) = −

1 T2 (x) = , x

T1 (x) = x ,

η , 2

μ, η ∈ (0, ∞) ,

h(x) =

1 , 2πx 3

x ∈ (0, ∞) .

Example 2.12 (Gamma Distribution) The Lebesgue density of the gamma distribution G(α, β), α, β ∈ (0, ∞), is given by x 1 β−1 x exp − fα,β (x) =

(β) α β α =

1 x 1 + β ln(x) , exp −

(β) α β α x

x ∈ (0, ∞) .

Hence, G(•, •) = {G(α, β) : α, β ∈ (0, ∞)} forms an EF with representation (2.1) given by μ = λ(0,∞) , k = 2, C(α, β) = and

1 1 , Z1 (α, β) = − ,

(β) α β α

T1 (x) = x ,

T2 (x) = ln(x) ,

Z2 (α, β) = β , α, β ∈ (0, ∞) , h(x) =

1 , x

x ∈ (0, ∞) .

Fixing β = 1, the resulting exponential distributions E(α), α ∈ (0, ∞), form the one-parameter EF E(•) = {G(α, 1) : α ∈ (0, ∞)} with representation (2.1) given by μ = λ(0,∞) , k = 1, C(α) =

1 , α

Z1 (α) = −

1 , α

α ∈ (0, ∞) ,

12

2 Parametrizations and Basic Properties

and

T1 (x) = x ,

h(x) = 1 ,

x ∈ (0, ∞) .

By choosing α = 2 and β = n/2, we arrive at the EF {χ 2 (n) : n ∈ N} = {G(2, n/2) : n ∈ N} of chi-square distributions. For p = exp{−1/α} and τ = β − 1, the Good distribution g(p, τ ) in Example 2.7 (cf. [26, p. 530]) is seen to be a discrete version of the gamma distribution G(α, β). Example 2.13 (Dirichlet Distribution) The Lebesgue density of the Dirichlet distribution D(β), β = (β1 , . . . , βk+1 )t ∈ (0, ∞)k+1 , is given by ⎛ ⎞⎛ ⎞βk+1 −1 k k 1 ⎝ βj −1 ⎠ ⎝ fβ (x) = xj 1− xj ⎠ B(β) j =1

j =1

⎧⎛ ⎞ ⎛ ⎞⎫ ⎡⎛ ⎞ ⎤−1 k k k k ⎬ ⎨ 1 = βj ln(xj )⎠ + βk+1 ln ⎝1 − xj ⎠ ⎣⎝1 − xj ⎠ xj ⎦ exp ⎝ ⎭ ⎩ B(β) j =1

j =1

j =1

j =1

for x = (x1 , . . . , xk )t ∈ X = {(y1 , . . . , yk )t ∈ (0, 1)k : ki=1 yi < 1}, where k+1 B(β) = [ k+1 i=1 (βi )]/ ( i=1 βi ) denotes the beta function evaluated at β. Hence, D(•) = {D(β) : β ∈ (0, ∞)k+1 } and, in particular, by setting k = 1, the family B(•, •) = {D(β1 , β2 ) : β1 , β2 ∈ (0, ∞)} of beta distributions are seen to form EFs. The EF of Dirichlet distributions is studied in Sect. 6.2.

Further continuous univariate EFs are given by the family of log-normal distributions and the family of inverse gamma distributions. Other examples of continuous multivariate distributions, which lead to multiparameter EFs, are multivariate normal distributions, particular multivariate Liouville distributions, and Wishart distributions (for the latter, see, e.g., [37, pp. 662/663] and [36, Section 2.4]). Remark 2.14 A general class of continuous multivariate distributions is given by so-called Liouville distributions with Lebesgue density functions proportional to ⎛ ⎞ k k a −1 h⎝ xj ⎠ xj j , j =1

x1 , . . . , xk > 0 ,

(2.3)

j =1

where h is an appropriate non-negative function and a1 , . . . , ak are positive parameters (see [37, Chapter 50] and [47, Section 8.5] for details). We denote the Liouville distribution with density function (2.3) by L(h, (a1 , . . . , ak )t ). Obviously, for fixed h, L(h, •) = {L(h, a) : a ∈ (0, ∞)k } then forms an EF. An EF structure

2.1 Definition and Examples

13

may also arise if h depends on additional parameters as the following two examples show. For h(y) = y β−1 exp{−y/α}, y > 0, with positive parameters α and β, the resulting Liouville distribution (of the first kind) has density function ⎛

(a• + β − 1) (

(a• ) k

a• +β−1 j =1 (aj )) α

⎝

k

⎞β−1 xj ⎠

j =1

k

a −1 −xj /α

xj j

e

j =1

k for x1 , . . . , xk > 0, where a• = i=1 ai . The particular case β = 1 yields the joint density function of k independent gamma-distributed random variables with common scale parameter α (see Example 2.12). Choosing h(y) = (1 − y)β−1 for 0 < y < 1 (zero otherwise) and some β > 0, we obtain the Dirichlet distribution D(a1 , . . . , ak , β) defined on the open simplex {(x1 , . . . , xk )t ∈ (0, 1)k : ki=1 xi < 1} (see Example 2.13) as a particular Liouville distribution (of the second kind). The function h in Definition 2.1 is free of ϑ and, indeed, as we will later see, it does not play a role in statistical inference. As a (B − B1 )-measurable non-negative mapping, h may be considered a μ-density of a measure ν, i.e., we have ν(A) = h1A dμ, A ∈ B, where 1A denotes the indicator function of the set A. For short, we write ν = hμ in that case. Since μ is σ -finite and h is real-valued, ν is then σ -finite as well (see, e.g., [6, pp. 104/105]). We may therefore absorb h into the dominating measure to obtain a shorter representation of the EF. Lemma 2.15 Let P be an EF with representation (2.1), and let ν = hμ be the measure with μ-density h. Then, for every ϑ ∈ , a ν-density of Pϑ is given by gϑ (x) =

⎧ k ⎨

dPϑ (x) = C(ϑ) exp ⎩ dν

j =1

⎫ ⎬ Zj (ϑ)Tj (x)

⎭

,

x ∈ X.

(2.4)

Remark 2.16 In formula (2.4), we have gϑ > 0 for every ϑ ∈ , which implies that ν and Pϑ are equivalent measures, i.e., ν(N) = 0 if and only if Pϑ (N) = 0 is valid for every N ∈ B. As a consequence, all distributions in an EF are necessarily equivalent. In particular, the support of Pϑ , defined by supp(Pϑ ) = {x ∈ X : fϑ (x) > 0} for the discrete case and by supp(Pϑ ) = {x ∈ X : Pϑ (V ) > 0 for all open sets V ⊂ X with x ∈ V }

14

2 Parametrizations and Basic Properties

for the continuous case, does not depend on ϑ. By fixing some parameter ϑ0 ∈ , this allows for a further representation of the EF given by the Pϑ0 -densities ⎧ ⎫ k ⎨ ⎬ dPϑ C(ϑ) exp (x) = (Zj (ϑ) − Zj (ϑ0 )) Tj (x) , ⎩ ⎭ dPϑ0 C(ϑ0 )

x ∈ X,

j =1

for ϑ ∈ . By Remark 2.16, a necessary condition for a family of distributions to form an EF is that all distributions have the same support. Counterexamples are therefore near at hand. For instance, the continuous uniform distribution on the interval [a, b], −∞ < a < b < ∞, does not form an EF when a or b are supposed to vary. Another counterexample is provided by the two-parameter exponential distribution with varying location parameter. Both distribution families are examples for socalled truncated EFs, which have also been studied in the literature. For a recent work on estimation in truncated EFs, we refer to [1]. For the sake of a short notation, we introduce the (column) vectors Z = (Z1 , . . . , Zk )t and T = (T1 , . . . , Tk )t of parameter functions k and statistics, respectively. Then, for ϑ ∈ and x ∈ X, Z(ϑ)t T (x) = j =1 Zj (ϑ)Tj (x) is the scalar product of Z(ϑ) and T (x). Moreover, let PϑT denote the distribution of T under Pϑ being defined on the Borel sets Bk of Rk , i.e., PϑT (A) = Pϑ (T −1 (A)), A ∈ Bk . By an integral transformation, it is easily seen that if P = {Pϑ : ϑ ∈ } forms an EF so do the image distributions PϑT , ϑ ∈ , of T (see, e.g., [39, pp. 31/32] or [63, p. 149]). Lemma 2.17 Let P be an EF with representation (2.4), and let ν T (A) = ν(T −1 (A)), A ∈ Bk , denote the image measure of ν under T . Then, for every ϑ ∈ , a ν T -density of PϑT is given by gϑT (t) =

dPϑT (t) = C(ϑ) exp Z(ϑ)t t , T dν

t = (t1 , . . . , tk )t ∈ Rk ,

and PT = {PϑT : ϑ ∈ } forms an EF. that the measure ν T in Lemma 2.17 is σ -finite, since 0 < gϑT < ∞ and Note T T gϑ dν = 1 < ∞ for ϑ ∈ (see, e.g., [6, pp. 98/99]). Example: Nk (•, 0 )

Example 2.18 (Multivariate Normal Distribution) The Lebesgue density of the multivariate normal distribution Nk (μ, ) with mean vector μ ∈ Rk and positive

2.1 Definition and Examples

15

definite covariance matrix ∈ Rk×k is given by

1 1 2 exp − ||x − μ|| , √ fμ, (x) = x ∈ Rk , 2 (2π)k/2 det() ! where ||y|| = y t −1 y , y ∈ Rk , denotes the Mahalanobis norm with weight matrix −1 . For fixed = 0 , P = Nk (•, 0 ) = {Nk (μ, 0 ) : μ ∈ Rk } forms an EF with representation (2.1) given by μ = λRk = λk , say, " C(μ) = exp −

||μ||2 0 2

Z(μ) = μ ,

,

μ ∈ Rk ,

exp −||x||20 /2 , h(x) = √ (2π)k/2 det( 0 )

T (x) = −1 0 x,

and

#

x ∈ Rk .

Now, let ν = hλk = Nk (0, 0 ). For A ∈ Bk , an integral transformation then yields T

ν (A) = ν(T

−1

(A)) =

T −1 (A)

h(x) dλk (x)

=

h( 0 u) |det( 0 )| dλ (u) = k

A

A

f0, −1 (u) dλk (u) . 0

Hence, ν T = Nk (0, −1 0 ). By Lemma 2.17, the distribution of T under Pμ is given by PμT (A)

=

C(μ) eZ(μ) t dν T (t) t

A

= A

C(μ) eμ t f0,−1 (t) dλk (t) = t

0

A

f −1 μ, −1 (t) dλk (t) , 0

0

−1 −1 k T i.e., PμT = Nk ( −1 0 μ, 0 ), for μ ∈ R . Hence, P = Nk (•, 0 ), here.

In general, however, the dominating measure ν T of PT in Lemma 2.17 will have a complicated structure. For instance, in case of the EF N(•, •) in Example 2.10, we have T (x) = (x, x 2 )t , x ∈ R, and, since ν T and PϑT are equivalent by Remark 2.16, supp(ν T ) = {(x, x 2 )t : x ∈ R}, i.e., the support of ν T is given by the graph of the

16

2 Parametrizations and Basic Properties

standard parabola in the plane. In particular, supp(ν T ) has no inner points, and ν T does not have a Lebesgue density. Remark 2.19 As the preceding examples show, the concept of the EF is rich and allows for a unifying study of various different distributions. It comprises univariate, one-parameter distributions, such as poi(•) and E(•) shown in Examples 2.3 and 2.12, as well as multivariate, multiparameter distributions as, for instance, m(n0 , •) and D(•) in Examples 2.8 and 2.13. Examples for univariate EFs with more than one parameter are provided by kn(•, •) and N(•, •) in Examples 2.6 and 2.10. Finally, multivariate, one-parameter EFs appear in a natural way when sampling from a univariate, one-parameter EF (see Sect. 3.3); a particular example is the family {Nk ((μ, . . . , μ)t , Ek ) : μ ∈ R}, where Ek denotes the k-dimensional unit matrix. Many of the aforementioned examples of EFs will be considered again and studied in more detail in the sequel. An overview is provided in Table 2.1.

Table 2.1 EFs discussed in this book listed by type and number of parameters Type Discrete

Continuous

Parameters 1

EF poi(•)

1

b(n0 , •)

1 1 2 2 k (≥2) k (≥2) k (≥2) 1 1 1 2

nb(r0 , •) geo(•) kn(•, •) g(•, •) m(n0 , •) nm(r0 , •) mps(h, •) E(•) N(•, σ02 ) N(μ0 , •) N(•, •)

2 2

IN(•, •) G(•, •)

2 k (≥ 2)

B(•, •) Nk (•, 0 )

k (≥ 2) k (≥ 2)

D(•) L(h, •)

Discussed in . . . Examples 2.3, 2.37, 2.64, 3.2, 3.14, 3.22, 3.31, 4.7, 4.28, 5.6 Examples 2.4, 2.38, 2.57, 3.3, 3.22, 4.8, 4.28 Examples 2.5, 2.51, 3.22, 3.32, 4.28 Examples 2.5, 2.51, 3.22, 3.32, 4.28 Example 2.6 Examples 2.7, 3.23 Examples 2.8, 2.46, 3.4, 3.24 Section 6.1 Remark 2.9 Examples 2.12, 3.10, 3.22, 4.28, 5.9 Examples 2.10, 3.16, 5.7 Example 2.10 Examples 2.10, 2.27, 2.31, 2.39, 2.52, 2.59, 3.5, 3.15, 3.25, 3.33, 4.9, 4.15 Examples 2.11, 2.25, 2.41, 2.58 Examples 2.12, 2.24, 2.40, 2.53, 3.6, 3.23, 3.34, 4.24 Example 2.13, Section 6.2 Examples 2.18, 2.28, 2.54, 2.65, 3.7, 3.16, 3.26, 3.35, 3.41, 4.10, 4.25, 5.14, 5.18 Example 2.13, Section 6.2 Remark 2.14

2.2 Order and Minimal Representation

17

2.2 Order and Minimal Representation For both, mathematical and practical reasons, a representation (2.1) of the EF in Definition 2.1 is desired, where the number k of summands is minimum. On the one hand, this will ensure that the parameter ϑ ∈ is identifiable meaning that different parameters correspond to different distributions in the EF. On the other hand, the vector T of statistics, which plays an important role in statistical inference, should have a dimension as small as possible. First, we define what is meant by a minimal representation of the EF.

Definition 2.20 Let P be an EF with representation (2.1). Then, P is said to have order k if the following implication holds true: If μ˜ is a σ -finite measure on B, k˜ ∈ N, and if there are mappings ˜ Z˜ 1 , . . . , Z˜ ˜ : → R, and h, ˜ T˜1 , . . . , T˜ ˜ : (X, B) → (R1 , B1 ) with h˜ ≥ 0 C, k k such that ⎧ ⎫ k˜ ⎨ ⎬ dPϑ ˜ ˜ (x) = C(ϑ) exp Z˜ j (ϑ)T˜j (x) h(x) f˜ϑ (x) = , x ∈ X, ⎩ ⎭ d μ˜ j =1

(2.5) is a μ-density ˜ of Pϑ for every ϑ ∈ , then k˜ ≥ k. In that case, we write ord(P) = k, and representation (2.1) is said to be minimal.

Definition 2.20 is not suitable to determine the order of an EF in a given situation. Aiming for necessary and sufficient conditions for an EF to have order k, we introduce affine independence of mappings.

Definition 2.21 Let Y = ∅ be a set, k ∈ N, and u1 , . . . , uk : Y → R be mappings. (a) u1 , . . . , uk are called affinely independent if, for any a0 , a1 , . . . , ak ∈ R, a0 +

k

aj uj (y) = 0 for all y ∈ Y

j =1

implies that a0 = a1 = · · · = ak = 0. (continued)

18

2 Parametrizations and Basic Properties

Definition 2.21 (continued) (b) Let (Y, C) be a measurable space, Q be a measure on C, and u1 , . . . , uk be (C − B1 )-measurable. Then, u1 , . . . , uk are called Q-affinely independent if u˜ 1 , . . . , u˜ k : N c → R are affinely independent in the sense of (a) for every N ∈ C with Q(N) = 0.

The following lemma connects the concepts of affine independence of mappings and linear independence of vectors in Rk . A proof can be found in the web appendix. Lemma 2.22 Let Y = ∅ be a set, k ∈ N, and u1 , . . . , uk : Y → R be mappings. Moreover, let u = (u1 , . . . , uk ) : Y → Rk and y0 ∈ Y. Then, the following assertions are equivalent: (a) u1 , . . . , uk are affinely independent. (b) There exist y1 , . . . , yk ∈ Y such that u(y1 ) − u(y0 ), . . . , u(yk ) − u(y0 ) are linearly independent (in Rk ). Given representation (2.4) of an EF, it is clear that if Z1 , . . . , Zk are not affinely independent or if T1 , . . . , Tk are not ν-affinely independent, the exponent in the densities can be reduced by a summand under respective modifications of C or ν. Hence, affine independence of Z1 , . . . , Zk and ν-affine independence of T1 , . . . , Tk is a necessary condition for the EF to have order k. Indeed, the condition is also sufficient (see, e.g., [4, p. 113]). A detailed proof can be found, for instance, in [63, pp. 145/146]. Theorem 2.23 Let P be an EF with representation (2.4). Then, the following assertions are equivalent: (a) ord(P) = k, (b) Z1 , . . . , Zk are affinely independent and T1 , . . . , Tk are ν-affinely independent. Note that, by Remark 2.16, ν-affine independence coincides with Pϑ -affine independence for every ϑ ∈ . Examples: G(•, •), IN(•, •)

Example 2.24 Consider the representation of G(•, •) in Example 2.12 with Z1 (α, β) = −1/α, Z2 (α, β) = β for α, β ∈ (0, ∞), and T1 (x) = x, T2 (x) = ln(x) for x ∈ (0, ∞). Let a0 −

a1 + a2 β = 0 α

for all α, β ∈ (0, ∞) .

For α → ∞ and β → 0, this gives a0 = 0. Then, choosing (α, β) = (1, 1) and (α, β) = (1, 2) yields that a2 = 0 and, thus, a1 = 0. Hence, Z1 and Z2 are

2.2 Order and Minimal Representation

19

affinely independent. Now, suppose that T1 and T2 are not ν-affinely independent. Then, there exists a vector (a0 , a1 , a2 ) = (0, 0, 0) with a0 + a1 x + a2 ln(x) = 0

ν-almost everywhere (ν-a.e.) .

For x → ∞, the left-hand side tends to either −∞ or ∞ and is therefore not equal to 0 on the interval (M, ∞) for some M > 0. Since ν((M, ∞)) > 0, this forms a contradiction. Hence, T1 and T2 are ν-affinely independent, and we have ord(G(•, •)) = 2 by Theorem 2.23. In particular, the representation is minimal. Example 2.25 Consider the representation of IN(•, •) in Example 2.11 with Z1 (ϑ) = −

η , 2μ2

Z2 (ϑ) = −

η , 2

ϑ = (μ, η)t ∈ (0, ∞)2 ,

√ and T1 (x) = x, T2 (x) = 1/x for x ∈ (0, ∞). Setting ϑ 0 = ( 2, 4)t , ϑ 1 = (1, 2)t , and ϑ 2 = (1, 4)t , we obtain that Z(ϑ 1 ) − Z(ϑ 0 ) = (−1, −1)t − (−1, −2)t = (0, 1)t and

Z(ϑ 2 ) − Z(ϑ 0 ) = (−2, −2)t − (−1, −2)t = (−1, 0)t

are linearly independent in R2 . Hence, Z1 and Z2 are affinely independent by Lemma 2.22. Now, suppose that T1 and T2 are not ν-affinely independent. Then, there exists a vector (a0 , a1 , a2 ) = (0, 0, 0) with a0 + a1 x +

a2 = 0 x

ν-a.e. ,

or, equivalently, a1 x 2 + a0 x + a2 = 0

ν-a.e. .

This quadratic equation has at most two positive solutions x1 and x2 , say. Since ν({x1 , x2 }) = 0, we have a contradiction. Hence, T1 and T2 are νaffinely independent, and we obtain ord(IN(•, •)) = 2 from Theorem 2.23. In particular, the representation is minimal.

To check for a representation of P to be minimal, the following equivalence is often useful. The assertion is obvious from the general rule a t Cov(U )a = V ar(a t U ) for a random vector U and a constant vector a, which implies that a t Cov(U )a = 0 if and only if a t U is almost surely (a.s.) constant (see also [63, pp. 145/146]).

20

2 Parametrizations and Basic Properties

Lemma 2.26 Let (Y, C) be a measurable space, Q be a probability measure on C, and U = (U1 , . . . , Uk )t : (Y, C) → (Rk , Bk ) a random vector with E(Uj2 ) < ∞, 1 ≤ j ≤ k. Then, the following assertions are equivalent: (a) U1 , . . . , Uk are Q-affinely independent. (b) Cov(U ) > 0, i.e., the covariance matrix of U is positive definite. Examples: N(•, •), Nk (•, 0 )

Example 2.27 Consider the representation of N(•, •) in Example 2.10 with Z1 (ϑ) =

μ , σ2

Z2 (ϑ) = −

1 , 2σ 2

ϑ = (μ, σ 2 )t ∈ R × (0, ∞) ,

and T1 (x) = x, T2 (x) = x 2 , for x ∈ R. Setting ϑ 0 = (0, 1)t , ϑ 1 = (0, 1/2)t , and ϑ 2 = (1/2, 1/2)t , we obtain that Z(ϑ 1 ) − Z(ϑ 0 ) = (0, −1/2)t and

Z(ϑ 2 ) − Z(ϑ 0 ) = (1, −1/2)t

are linearly independent in R2 . Hence, Z1 and Z2 are affinely independent by Lemma 2.22. Moreover, the covariance matrix of T (X) = (X, X2 )t for X ∼ Q = N(0, 1) is given by 10 CovQ (T ) = > 0. 02 Thus, T1 and T2 are ν-affinely independent by Lemma 2.26 and Remark 2.16. Applying Theorem 2.23 then yields that ord(N(•, •)) = 2. In particular, the representation is minimal. Example 2.28 Consider the representation of Nk (•, 0 ) in Example 2.18 with k k Z(μ) = μ, μ ∈ Rk , and T (x) = −1 0 x, x ∈ R . If a0 ∈ R and a ∈ R t k are such that a0 + a μ = 0 for all μ ∈ R , it directly follows that a0 = 0 and a = 0. Hence, Z1 , . . . , Zk are affinely independent. Moreover, for every μ ∈ Rk , Covμ (T ) = −1 0 > 0 such that T1 , . . . , Tk are ν-affinely independent by Lemma 2.26. Applying Theorem 2.23 then yields that ord(Nk (•, 0 )) = k. In particular, the representation is minimal.

2.2 Order and Minimal Representation

21

In a minimal representation of an EF, the vectors Z and T of parameter functions and statistics, respectively, are uniquely determined up to non-degenerated affine transformations (see [4, p. 112]); a detailed proof is provided in [63, pp. 145–147]. Theorem 2.29 Let P be an EF with minimal representations (2.1) and (2.5), i.e., ˜ Then, there exist regular matrices L, M ∈ Rk×k and vectors k = ord(P) = k. k l, m ∈ R with Z = LZ˜ + l

and

T = MT˜ + m ν-a.e. ,

where Z˜ = (Z˜ 1 , . . . , Z˜ k )t and T˜ = (T˜1 , . . . , T˜k )t . Finding a minimal representation for a given EF is an important step when aiming at statistical inference. Among others, it may ensure that the parameter ϑ ∈ is identifiable. Alternatively, in the literature, the identifiability property is ascribed to the set of parameters or to the parametric family P of distributions. Lemma 2.30 Let P be an EF with representation (2.4). If Z is one-to-one and T1 , . . . , Tk are ν-affinely independent, then the parameter ϑ ∈ is identifiable, i.e., the mapping ϑ → Pϑ , ϑ ∈ , is one-to-one. Proof Let Pϑ1 = Pϑ2 for ϑ1 , ϑ2 ∈ . Then, we have gϑ1 = gϑ2 ν-a.e., which is equivalent to (Z(ϑ1 ) − Z(ϑ2 ))t T = ln

C(ϑ2 ) C(ϑ1 )

ν-a.e. .

Hence, we have Z(ϑ1 ) = Z(ϑ2 ) by using that T1 , . . . , Tk are ν-affinely independent. Since Z is one-to-one, it follows that ϑ1 = ϑ2 .

In many applications, the parameters of some EF are supposed to satisfy certain constraints. This assumption leads to subfamilies which, of course, form EFs. The order of these embedded EFs may be smaller than or equal to the order of the original EF. We give a classical example (see [35, pp. 85–87]). Example: Subfamilies of N(•, •)

Example 2.31 Let P = N(•, •) with representation as in Example 2.10, and recall that ord(P) = 2 (see Example 2.27). (a) First, we consider the subfamily P1 = {N(μ, μ) : μ ∈ (0, ∞)} of P. Setting σ 2 = μ in the density of N(μ, σ 2 ), it is easily seen that P1 forms an EF with representation (2.1) given by k = 1, μ 1 C(μ) = √ exp − , μ 2

Z1 (μ) = −

1 , 2μ

μ ∈ (0, ∞) ,

22

2 Parametrizations and Basic Properties

and

T1 (x) = x 2 ,

1 h(x) = √ ex , 2π

x ∈ R.

Hence, ord(P1) = 1. (b) Now, let P2 = {N(μ, μ2 ) : μ ∈ R \ {0}}. Replacing σ 2 by μ2 in the density of N(μ, σ 2 ), we arrive at a representation (2.1) of P2 given by k = 2, 1 C(μ) = $ , μ2 and

h(x) = √

1 2πe

1 , μ

and

Z2 (μ) = −

T1 (x) = x ,

and

T2 (x) = x 2 ,

Z1 (μ) =

,

1 , 2μ2

μ ∈ R \ {0} ,

x ∈ R.

Since Z(1) − Z(−1) = (2, 0)t and Z(1/2) − Z(−1) = (3, −3/2)t , Z1 and Z2 are affinely independent by Lemma 2.22. As shown in Example 2.27, T1 and T2 are N(0, 1)-affinely independent and thus also N(1, 1)-affinely independent. Hence, we have ord(P2 ) = 2 by Theorem 2.23.

Although both EFs, P1 and P2 , in Example 2.31 are one-parameter subfamilies of N(•, •), their orders are different. When imposing the constraint for the parameters, the mappings Z1 and Z2 in Example 2.27 are no longer affinely independent in case (a) but they are in case (b). P2 forms an example for a so-called curved EF. We give a fairly general definition of the concept to keep it short, here (see [50, p. 28]); other definitions in the literature impose additional regularity conditions as, e.g., continuous second derivatives of Z on , where is assumed to be an open subset of Rm with m < k.

Definition 2.32 Let P be an EF with representation (2.1). If Z1 , . . . , Zk are affinely independent and int (Z()) = ∅, i.e., the interior of Z() is empty, then P is called a curved EF.

The statistical literature on curved EFs is vast; among others, geometrical aspects, inference, and applications, e.g., in graphical models and networks, have widely been studied. In the sequel, several properties of curved EFs are addressed while the main focus is on EFs with parameter spaces containing at least some inner points. More results on curved EFs can be found, for instance, in [20, 2, 5, 39, 44, 35, 56].

2.3 Structural Properties and Natural Parametrization

23

2.3 Structural Properties and Natural Parametrization In Lemma 2.2, we have pointed out that the term C(ϑ) in representation (2.1) is just a normalizing constant. Moreover, it depends on ϑ only via Z(ϑ) ∈ Z() = {Z(ϑ) : ϑ ∈ }. This finding allows for a more natural parametrization of the EF.

Definition 2.33 Let P be an EF with representation (2.1). The natural parameter space is defined by ∗ = {ζ = (ζ1 , . . . , ζk )t ∈ Rk : 0
0, for parameter λ > 0, then the respective EF is already given in natural parametrization with λ ∈ = (0, ∞). The natural parametrization for the class Nk (•, 0 ) of multivariate normal distributions with fixed covariance matrix 0 is already given in Example 2.18. Some other simple examples for EFs in canonical representations are shown in what follows, and the corresponding natural parameter spaces are determined. Examples: poi(•), b(n0 , •), N(•, •), G(•, •)

Example 2.37 Consider the representation of P = poi(•) in Example 2.3 with k = 1, Z1 (a) = ln(a) for a ∈ (0, ∞) and T1 (x) = x, h(x) = 1/x! for x ∈ N0 . Since e

ζx

∞ 1 1 dμN0 (x) = = exp eζ , eζ x x! x!

ζ ∈ R,

x=0

we have ∗ = R = Z1 ((0, ∞)). Hence, P is regular, and the canonical representation of P is given by the counting densities fζ∗ (x) = C ∗ (ζ ) eζ x

1 , x!

x ∈ N0 ,

2.3 Structural Properties and Natural Parametrization

25

with C ∗ (ζ ) = exp −eζ for ζ ∈ R. Example 2.38 Consider the representation of P = b(n0 , •) in Example 2.4 for fixed n0 ∈ N with k = 1, Z1 (p) = ln(p/(1 − p)) for p ∈ (0, 1) and T1 (x) = x, h(x) = nx0 for x ∈ {0, 1, . . . , n0 } = X. Since e

ζx

n0 n0 ζ x n0 dμX (x) = = (1 + eζ )n0 , e x x

ζ ∈ R,

x=0

we have ∗ = R = Z1 ((0, 1)). Hence, P is regular, and the canonical representation of P is given by the counting densities fζ∗ (x) = C ∗ (ζ ) eζ x

n0 , x

x ∈ X,

with −n0 C ∗ (ζ ) = 1 + eζ for ζ ∈ R. Example 2.39 Consider the representation of P = N(•, •) in Example 2.10 with k = 2,

1 μ2 C(μ, σ 2 ) = √ exp − 2 , 2σ σ2 Z1 (μ, σ 2 )√= μ/σ 2 , and Z2 (μ, σ 2 ) = −1/(2σ 2 ) for μ ∈ R, σ 2 ∈ (0, ∞), and h(x) = 1/ 2π, T1 (x) = x, and T2 (x) = x 2 for x ∈ R. Here, we have R × (−∞, 0) = Z(R × (0, ∞)) ⊂ ∗ . To establish that ∗ = R × (−∞, 0), it is sufficient to show that (ζ1 , 0)t ∈ ∗ for ζ1 ∈ R, since ∗ is convex by Theorem 2.34. Indeed, for every ζ1 ∈ R, we have ∞ 1 eζ1 x √ dx = ∞ . 2π −∞

26

2 Parametrizations and Basic Properties

Hence, P is regular, and, by means of the parameter transformation

1 μ ,− 2 σ2 2σ

= (ζ1 , ζ2 )

⇔

ζ1 1 , (μ, σ 2 ) = − ,− 2ζ2 2ζ2

the canonical representation of P turns out to be 1 2 fζ∗ (x) = C ∗ (ζ ) eζ1 x+ζ2 x √ , 2π

x ∈ R,

with " # $ ζ12 ζ1 1 C (ζ ) = C − ,− = −2ζ2 exp 2ζ2 2ζ2 4ζ2 ∗

for ζ = (ζ1 , ζ2 )t ∈ R × (−∞, 0). Example 2.40 Consider the representation of P = G(•, •) in Example 2.12 with k = 2, C(α, β) = 1/[ (β)α β ], Z1 (α, β) = −1/α, Z2 (α, β) = β for α, β ∈ (0, ∞), and h(x) = 1/x, T1 (x) = x, T2 (x) = ln(x) for x ∈ (0, ∞). Here, we have (−∞, 0) × (0, ∞) = Z((0, ∞)2 ) ⊂ ∗ . To establish that ∗ = (−∞, 0) × (0, ∞), it is sufficient to show that (ζ1 , 0)t ∈ ∗ and (0, ζ2 )t ∈ / ∗ for ζ1 ≤ 0 and ζ2 > 0, since ∗ is convex by Theorem 2.34. Indeed, for ζ1 ≤ 0 and ζ2 > 0, we have

∞ 0

eζ1 x

1 dx ≥ x

1

eζ1 x

0

∞

and 0

1 dx ≥ eζ1 x

eζ2 ln(x)

1 0

1 dx = ∞ x

x ζ2 %%∞ 1 dx = % = ∞. x ζ2 0

Hence, P is regular, and, by means of the parameter transformation ζ1 = −1/α and ζ2 = β, the canonical representation of P turns out to be fζ∗ (x) = C ∗ (ζ ) eζ1 x+ζ2 ln(x)

1 , x

x ∈ (0, ∞) ,

2.3 Structural Properties and Natural Parametrization

27

with 1 (−ζ1 )ζ2 C ∗ (ζ ) = C − , ζ2 = ζ1

(ζ2 ) for ζ = (ζ1 , ζ2 )t ∈ (−∞, 0) × (0, ∞).

The EFs considered in the preceding examples are all full and moreover regular, since the natural parameter space turns out to be open. Indeed, this is the case for most of the common EFs. An example for a full but non-regular EF is provided by the full family of inverse normal distributions (see, e.g., [4, p. 117] or [20, p. 72]). Example: IN(•, •)

Example 2.41 Consider the representation of P = IN(•, •) in Example 2.11 with k = 2,

η √ C(μ, η) = η exp , μ Z1 (μ, η) = −η/(2μ2 ), and Z2 (μ, η) = −η/2 for μ, η ∈ (0, ∞), and h(x) = $ 1/(2πx 3), T1 (x) = x, and T2 (x) = 1/x for x ∈ (0, ∞). Here, we have (−∞, 0)2 = Z((0, ∞)2 ) ⊂ ∗ . However, it turns out that ∗ = (−∞, 0] × (−∞, 0), which can be seen as follows. First, we have (0, ζ2 )t ∈ ∗ for ζ2 < 0, since by substituting z = 1/x

∞

e 0

ζ2 /x

1 dx = 2πx 3

=

1 2π

∞

eζ2 z z−1/2 dz

0

& 1 (1/2) = 2π (−ζ2 )1/2

1 < ∞. −2ζ2

28

2 Parametrizations and Basic Properties

Moreover, it holds that (ζ1 , 0)t ∈ / ∗ for ζ1 ≤ 0, since

∞

e

ζ1 x

0

1 dx ≥ 2πx 3

≥

1 2π

1

e

ζ1 x

0

1 ζ1 e 2π

1 0

1 dx x3

1 dx = ∞ . x

Finally, we have (ζ1 , ζ2 )t ∈ / ∗ for ζ1 > 0 and ζ2 < 0, since by using the series representation of exp{ζ1 x}

∞

e

ζ1 x+ζ2 /x

0

1 dx ≥ 2πx 3

≥ ≥

1 2π

∞

e

ζ1 x+ζ2 /x

1

1 ζ2 e 2π 1 ζ2 e 2π

∞

eζ1 x

1

∞ 1

1 dx x3

1 dx x2

(ζ1 x)2 1 dx = ∞ . 2! x 2

The convexity of ∗ as pointed out in Theorem 2.34 now ensures that ∗ = (−∞, 0] × (−∞, 0). In particular, P is not full. By using the parameter transformation ( '& η ζ2 η − 2,− = (ζ1 , ζ2 ) ⇔ (μ, η) = , −2ζ2 2μ 2 ζ1 the canonical representation of P is given by fζ∗ (x)

∗

= C (ζ ) e

ζ1 x+ζ2 /x

1 , 2πx 3

x ∈ (0, ∞) ,

with '& ∗

C (ζ ) = C

ζ2 , −2ζ2 ζ1

( =

$

−2ζ2 e2

√ ζ1 ζ2

for ζ = (ζ1 , ζ2 )t ∈ (−∞, 0)2 . Note that the associated full EF {Pζ∗ : ζ ∈ √ (−∞, 0] × (−∞, 0)} with C ∗ (0, ζ2 ) = −2ζ2 for ζ2 < 0 is not regular.

2.3 Structural Properties and Natural Parametrization

29

Having seen that the natural parameter space of the EF is not open in general, we address a simple sufficient condition in the following lemma guaranteeing that it has at least some inner points. The proof makes use of the convexity property of ∗ and Lemma 2.22 (for details, see [63, p. 150]). Lemma 2.42 Let P be an EF with representation (2.1). If Z1 , . . . , Zk are affinely independent, then int (∗ ) = ∅. To address further properties related to the natural parametrization, we introduce a function, which is of high relevance in statistical inference.

Definition 2.43 Let P be an EF with canonical representation (2.6). The mapping κ : → R defined by κ = − ln(C ∗ ) is called the cumulant function of P.

For the motivation of the term cumulant function in Definition 2.43 and for remarks on so-called cumulants, we refer to Sects. 2.4 and 3.1. The following properties of the cumulant function are immediately obtained from the proof of Theorem 2.34 and Lemma 2.30. Theorem 2.44 Let P be a full EF with canonical representation (2.6) and cumulant function κ = − ln(C ∗ ) defined on ∗ . Then, we have: (a) κ is convex, i.e., if ζ , η ∈ ∗ and α ∈ (0, 1), then κ(αζ + (1 − α)η) ≤ ακ(ζ ) + (1 − α)κ(η) .

(2.8)

Here, equality holds true if and only if Pζ∗ = Pη∗ . (b) If T1 , . . . , Tk are ν-affinely independent, then κ is strictly convex on ∗ , i.e., equality in Eq. (2.8) holds true if and only if ζ = η. Note that, since κ is convex and finite on ∗ , it is continuous on the interior int (∗ ) of ∗ (see, e.g., [51, Section 10]). For maximum likelihood estimation in a subsequent section, it will be convenient to extend κ in a natural way to a function on Rk . The following property then directly follows from Fatou’s lemma (see [20, pp. 19/20]). Fatou’s lemma along with its proof can be found, e.g., in [17, p. 209]. Lemma 2.45 Let P be a full EF with canonical representation (2.6) and cumulant function κ = − ln(C ∗ ) on ∗ . Moreover, let κ˜ : Rk → R ∪ {∞} be the extension of κ defined by κ(ζ ˜ ) = ∞ for ζ ∈ Rk \ ∗ . Then, κ˜ is lower semi-continuous, i.e., ˜ ≥ κ(ζ ˜ ) lim inf κ(η) η→ζ

Moreover, κ˜ is continuous on int (∗ ).

for every ζ ∈ Rk .

30

2 Parametrizations and Basic Properties

When aiming for inference with a given EF, it is advisable to find a minimal representation, first, before transforming it to a canonical representation. Otherwise, statistical problems may arise as it is demonstrated in the following example, where we discuss the structure of the multinomial EF (see [20, pp. 4–6]). Example: m(n0 , •)

Example 2.46 Consider the representation of P = m(n0 , •) in Example 2.8 for fixed n0 ∈ N with C(p) = 1, Zj (p) = ln(pj ) ,

p ∈ = {(q1 , . . . , qk )t ∈ (0, 1)k :

k

qi = 1} ,

i=1

and h(x) = n0 !/

k

i=1 xi !,

Tj (x) = xj ,

x ∈ X = {(y1 , . . . , yk )t ∈ Nk0 :

k

yi = n0 } ,

i=1

for j ∈ {1, . . . , k}. For ζ = (ζ1 , . . . , ζk )t ∈ Rk , it follows by the multinomial theorem that ⎫ ⎧ k k ⎬ ⎨ n0 ! ζ xj exp ζj xj h(x) dμX (x) = ej k ⎭ ⎩ xi ! j =1

x∈X

i=1

j =1

⎛ ⎞n0 k = ⎝ eζj ⎠ . j =1

Hence, we have ∗ = Rk , and, since Z() ⊂ (−∞, 0)k , P is not full. The full EF is given by P∗ = {Pζ∗ : ζ ∈ Rk }, where Pζ∗ has counting density ⎛ fζ∗ (x) = ⎝

k

⎞−n0 eζj ⎠

j =1

exp

⎧ k ⎨ ⎩

j =1

⎫ ⎬ ζj x j

⎭

h(x) ,

x ∈ X,

for ζ ∈ Rk . Nevertheless, it holds that P = P∗ , which can be seen as follows. First, for ζ ∈ Rk , 1(k) = (1, . . . , 1)t ∈ Rk , and a ∈ R, we have ⎛ fζ∗+a1(k) (x) = ⎝

k j =1

⎞−n0 eζj +a ⎠

⎧ ⎫ k ⎨ ⎬ exp (ζj + a) xj h(x) ⎩ ⎭ j =1

2.3 Structural Properties and Natural Parametrization

= e−n0 a

31

⎧ ⎫ k ⎨ ⎬ exp a xj fζ∗ (x) ⎩ ⎭ j =1

= fζ∗ (x) ,

x ∈ X.

Hence, Pζ∗ = Pζ∗+a1(k) . Now, let ζ ∈ Rk . We have to show that there exists some p ∈ with m(n0 , p) = Pζ∗ . To this end, let p = (p1 , . . . , pk )t with ⎧ ⎞⎫ ⎛ k ⎬ ⎨ pi = exp ζi − ln ⎝ eζj ⎠ , ⎭ ⎩

1≤ i ≤k.

j =1

Obviously, pi > 0 for 1 ≤ i ≤ k, and since fn0 ,p =

∗ fZ(p)

k

= 1, such that p ∈ . Moreover, by definition, it follows for a = − ln( kj =1 exp{ζj }) that j =1 pj

∗ m(n0 , p) = PZ(p) = Pζ∗+a1(k) = Pζ∗ .

Hence, P = P∗ is shown. In particular, this yields that the parameter ζ ∈ ∗ is not identifiable. This problem occurs, since the representation of P in Example 2.8 is not minimal. Indeed, we have ord(P) = k − 1, which is shown in the following. First, we rewrite

fn0 ,p (x) = exp

n

⎧ k−1 ⎨ ⎩

⎛ ln(pj ) xj + ln(pk ) ⎝n0 −

j =1

= pk 0 exp

j =1

⎧ k−1 ⎨ ⎩

k−1

j =1

(ln(pj ) − ln(pk ))xj

⎞⎫ ⎬ xj ⎠ h(x) ⎭

⎫ ⎬ ⎭

h(x) ,

x ∈ X . (2.9)

This is another representation of P with parameter functions Z˜ j (p) = ln(pj ) − ln(pk ) for p ∈ and statistics Tj (x) = xj for x ∈ X, 1 ≤ j ≤ k − 1. Hence,

32

2 Parametrizations and Basic Properties

ord(P) ≤ k − 1. Since ∅ is the only set N ⊂ X with ν(N) = 0, we show that T1 , . . . , Tk−1 are affinely independent. Let a0 , a1 , . . . , ak−1 ∈ R with 0 = a0 +

k−1

aj Tj (x) = a0 +

j =1

k−1

aj xj

for all x ∈ X .

(2.10)

j =1 (k)

(k)

Moreover, we denote by ej the j -th unit vector in Rk . Then, setting x = n0 ek ∈ (k)

X in Eq. (2.10) gives a0 = 0, and the choice x = n0 ej ∈ X leads to aj = 0, 1 ≤ j ≤ k − 1. To show that Z˜ 1 , . . . , Z˜ k−1 are affinely independent, let p (0) = 1(k) /k ∈ and p(j ) = (1(k) + e(k) j )/(k + 1) ∈ , 1 ≤ j ≤ k − 1. By setting t ˜ ˜ ˜ Z = (Z1 , . . . , Zk−1 ) , we have ˜ (0) ) = Z(p ˜ (j ) ) = ln(2) e(k−1) ∈ Rk−1 , ˜ (j ) ) − Z(p Z(p j

1 ≤ j ≤ k − 1.

From Lemma 2.22, we obtain that Z˜ 1 , . . . , Z˜ k−1 are affinely independent, and applying Theorem 2.23 gives ord(P) = k − 1. In particular, representation (2.9) ˜ is minimal. Now, by using the multinomial theorem again, ∗ = Rk−1 = Z(), such that P is full and regular. Finally, by setting ζj = ln(pj ) − ln(pk ), 1 ≤ j ≤ k − 1, in density (2.9) and by noticing that then pk = 1 −

k−1

pj = 1 − pk

j =1

k−1

⎛ eζj

and thus pk = ⎝1 +

j =1

k−1

⎞−1 eζj ⎠

,

j =1

a minimal canonical representation of P is given by the counting densities ⎛ f˜ζ∗ (x) = ⎝1 +

k−1 j =1

⎞−n0 eζj ⎠

exp

⎧ k−1 ⎨ ⎩

j =1

⎫ ⎬ ζj x j

⎭

h(x) ,

x ∈ X,

for ζ ∈ Rk−1 .

2.4 Analytical Properties and Mean Value Parametrization We now turn to useful analytical properties of the EF in natural parametrization. Essentially, they are consequences of the following lemma, which can be shown by using the Lebesgue dominated-convergence theorem; a detailed proof is provided, e.g., in [40, pp. 49/50].

2.4 Analytical Properties and Mean Value Parametrization

33

Lemma 2.47 Let P be an EF with canonical representation (2.6), and let A ⊂ be an open set. Moreover, let the function ϕ : (X, B) → (R1 , B1 ) be Pζ∗ -integrable for all ζ = (ζ1 , . . . , ζk )t ∈ A. Then, the function δ : A → R defined by δ(ζ ) =

ϕ(x) exp

⎧ k ⎨ ⎩

⎫ ⎬ ζj Tj (x)

j =1

⎭

h(x) dμ(x) ,

ζ ∈ A,

is infinitely often differentiable, and the partial derivatives can be obtained by differentiating under the integral sign, i.e., ∂ l1 ∂ζ1l1

...

∂ lk l

∂ζkk

δ(ζ ) =

ϕ(x) T1l1 (x) . . . Tklk (x)

exp

⎧ k ⎨ ⎩

⎫ ⎬ ζj Tj (x)

j =1

⎭

h(x) dμ(x) (2.11)

for ζ ∈ A and l1 , . . . , lk ∈ N0 . Remark 2.48 Lemma 2.47 can be extended to functions δ with complex arguments: If δ is defined on the complex set {z = ζ + iη ∈ Ck : ζ ∈ A}, then δ is a holomorphic function of every variable zj for fixed other components zi , i = j . As in Lemma 2.47, its (complex) derivatives can be obtained by differentiating under the integral sign. For a proof, see, e.g., [40, pp. 49/50]. The usefulness of Lemma 2.47 is that it yields existence and simple representations for the moments of T . Setting ϕ ≡ 1, which is Pζ∗ -integrable for all ζ ∈ ∗ as a constant function, and multiplying both sides of Eq. (2.11) by C ∗ (ζ ), we directly l l obtain the following theorem by noticing that Eζ [T1l1 . . . Tk k ] = T1l1 . . . Tk k fζ∗ dμ. Theorem 2.49 Let P be an EF with canonical representation (2.6). Then, for ζ ∈ int (), T = (T1 , . . . , Tk )t has finite multivariate moments of any order with respect to Pζ∗ , which are given by Eζ [T1l1 . . . Tklk ] = C ∗ (ζ )

∂ l1 ∂ζ1l1

...

∂ lk l

∂ζkk

exp

⎧ k ⎨ ⎩

j =1

⎫ ⎬ ζ j Tj

⎭

h dμ

for l1 , . . . , lk ∈ N0 . As another important finding, the mean and covariance matrix of T can be obtained as derivatives of the cumulant function κ = − ln(C ∗ ) introduced in Definition 2.43; the proof gives insight into the convenient structure of the canonical representation.

34

2 Parametrizations and Basic Properties

Corollary 2.50 Let P be an EF with canonical representation (2.6). Then, the functions C ∗ and κ = − ln(C ∗ ) are infinitely often differentiable at every ζ ∈ int (), where ∇κ(ζ ) = Eζ [T ] , Hκ (ζ ) = Covζ (T ) . Here, ∇κ(ζ ) and Hκ (ζ ) denote the gradient and the Hessian matrix of κ evaluated at ζ . Proof By Lemma 2.47 with ϕ ≡ 1, the function 1/C ∗ and thus C ∗ and κ = − ln(C ∗ ) are infinitely often differentiable at ζ ∈ int (). Moreover, with ν = hμ, we obtain that t Ti eζ T dν ∂ ∂ t = Eζ [Ti ] κ(ζ ) = ln eζ T dν = t ∂ζi ∂ζi eζ T dν for 1 ≤ i ≤ k. Applying Lemma 2.47 with ϕ = Ti , which is admissible by Theorem 2.49, then yields ∂2 ∂ κ(ζ ) = ∂ζi ∂ζj ∂ζj

'

Ti eζ T dν t eζ T dν t

(

t t t t ( Ti Tj eζ T dν)( eζ T dν) − ( Ti eζ T dν)( Tj eζ T dν) t = ( eζ T dν)2 = Eζ [Ti Tj ] − Eζ [Ti ]Eζ [Tj ] = Covζ (Ti , Tj ) for 1 ≤ i, j ≤ k.

Moments in EFs with canonical representation are easily determined by the respective normalizing constant C ∗ (ζ ). Examples: nb(r0 , •), N(•, •), G(•, •), Nk (•, 0 )

Example 2.51 From Example 2.5, it is readily seen that the EF P = nb(r0 , •) is regular with canonical representation fζ∗ (x)

∗

= C (ζ ) e

ζx

x + r0 − 1 , x

x ∈ N0 ,

2.4 Analytical Properties and Mean Value Parametrization

35

with C ∗ (ζ ) = (1 − exp{ζ })r0 for ζ ∈ ∗ = (−∞, 0). Hence, κ(ζ ) = −r0 ln 1 − eζ ,

ζ ∈ (−∞, 0) ,

and, by using Corollary 2.50, the mean and variance of the negative binomial distribution Pζ∗ are given by κ (ζ ) =

r0 exp{ζ } 1 − exp{ζ }

and

κ (ζ ) =

r0 exp{ζ } . (1 − exp{ζ })2

In the original parametrization, this yields the expressions r0 (1−p)/p and r0 (1− p)/p2 for the mean and variance of Pp = nb(r0 , p). Example 2.52 In Example 2.39, the canonical representation of P = N(•, •) yields ∗

C (ζ ) =

$

"

ζ12 −2ζ2 exp 4ζ2

# ,

ζ = (ζ1 , ζ2 )t ∈ R × (−∞, 0) ,

and T (x) = (T1 (x), T2 (x))t = (x, x 2 )t , x ∈ R. Hence, by Corollary 2.50 with κ(ζ ) = − ln(−2ζ2)/2 − ζ12 /(4ζ2),

and

Eζ [T1 ] =

∂ ζ1 κ(ζ ) = − ∂ζ1 2ζ2

Eζ [T2 ] =

ζ2 ∂ 1 κ(ζ ) = − + 12 ∂ζ2 2ζ2 4ζ2

for ζ = (ζ1 , ζ2 )t ∈ R × (−∞, 0). In terms of the common parametrization of the univariate normal distribution (see Example 2.10), we find for a random variable X ∼ N(μ, σ 2 ) with parameters (μ, σ 2 ) = (−ζ1 /(2ζ2 ), −1/(2ζ2))t that the means of X = T1 (X) and X2 = T2 (X) are given by μ and σ 2 + μ2 ; the variance of X is then equal to σ 2 . Example 2.53 For P = G(•, •), we conclude from Example 2.40 that κ(ζ ) = ln( (ζ2 )) − ζ2 ln(−ζ1 ) ,

ζ = (ζ1 , ζ2 )t ∈ (−∞, 0) × (0, ∞) .

Hence, by using Corollary 2.50, we have for T (x) = (x, ln(x))t , x ∈ (0, ∞), that t ζ2 Eζ [T ] = ∇κ(ζ ) = − , ψ(ζ2 ) − ln(−ζ1 ) , ζ ∈ (−∞, 0)×(0, ∞) , ζ1

36

2 Parametrizations and Basic Properties

where ψ(y) = [ln( (y))] = (y)/ (y), y ∈ (0, ∞), denotes the digamma function (see [22, p. 258]). Moreover, it follows that '

ζ2 ζ12 − ζ11

Covζ (T ) = Hκ (ζ ) =

− ζ11 ψ1 (ζ2 )

( ,

ζ ∈ (−∞, 0) × (0, ∞) ,

where ψ1 = ψ denotes the trigamma function (see [22, p. 260]). In particular, the mean and variance of the gamma distribution Pζ∗ are given by −ζ2 /ζ1 and ζ2 /ζ12 . Example 2.54 For the EF P = Nk (•, 0 ) of multivariate normal distributions as in Example 2.18, which is already given in natural parametrization (ζ = μ, C ∗ = C), we have " C(μ) = exp −

||μ||20 2

#

1 μ , = exp − μt −1 0 2

μ ∈ Rk .

Then, by Corollary 2.50 with κ(μ) = μt −1 0 μ/2, Eμ [T ] = ∇κ(μ) = −1 0 μ, Covμ (T ) = Hκ (μ) = −1 0 ,

μ ∈ Rk ,

k where T (x) = −1 0 x, x ∈ R . Hence, for a random vector X ∼ Nk (μ, 0 ), we find with the identity X = 0 T (X) that the mean and covariance matrix of X are given by 0 Eμ [T ] = μ and 0 Covμ (T ) 0 = 0 .

Lemma 2.47 and Corollary 2.50 may be applied to establish the following result. It is important for the derivation of optimal statistical tests. Corollary 2.55 Under the assumptions of Lemma 2.47, the mapping βϕ :

A→R:

ζ → Eζ [ϕ]

is infinitely often differentiable with gradient ∇βϕ (ζ ) = Eζ [ϕ T ] − Eζ [ϕ] Eζ [T ] = Covζ (ϕ, T ) ,

ζ ∈ A.

2.4 Analytical Properties and Mean Value Parametrization

37

Proof Since βϕ (ζ ) = C ∗ (ζ )

ϕ eζ

tT

h dμ ,

ζ ∈ A,

βϕ is infinitely often differentiable by Lemma 2.47 and Corollary 2.50. Moreover, for 1 ≤ i ≤ k, we obtain that ) −1 * t Ti eζ T h dμ ∂ ∂ ∗ ζtT C (ζ ) = h dμ = − C ∗ (ζ ) Eζ [Ti ] e = − ζtT ∂ζi ∂ζi ( e h dμ)2 and, hence, ∂ βϕ (ζ ) = ∂ζi

∂ ∗ C (ζ ) ∂ζi

ϕ eζ

tT

h dμ + C ∗ (ζ )

ϕ Ti e ζ

tT

h dμ

= Eζ [ϕ Ti ] − Eζ [ϕ] Eζ [Ti ] .

There is an important function, which describes the relationship between the natural parameter space ∗ and the expectation space of the vector T of statistics. Its properties allow for another parametrization of a regular EF, and they are also useful in point estimation. Theorem 2.56 Let P be a full EF with minimal canonical representation (2.6). Then, the so-called mean value function π:

int (∗ ) → π(int (∗ )) :

ζ → Eζ [T ]

has the following properties: (a) π(ζ ) = ∇κ(ζ ) for every ζ ∈ int (∗ ). (b) π is continuously differentiable with Jacobian matrix Dπ (ζ ) = Hκ (ζ ) = Covζ (T ) > 0 ,

ζ ∈ int (∗ ) .

(c) π is bijective and has a continuously differentiable inverse function π −1 : π(int (∗ )) → int (∗ ). Proof Assertions (a) and (b) directly follow from Corollary 2.50 together with Theorem 2.23 and Lemma 2.26. For statement (c), let ζ , η ∈ int (∗ ) with ζ = η. By Theorem 2.23 and 2.44(b), κ is strictly convex on ∗ . Hence, by using part (a), (η − ζ )t π(ζ ) < κ(η) − κ(ζ ) ,

38

2 Parametrizations and Basic Properties

and, for symmetry reasons, (ζ − η)t π(η) < κ(ζ ) − κ(η) (see, for instance, [19, pp. 69/70]). Summation of both inequalities leads to (ζ − η)t (π(η) − π(ζ )) < 0 . This yields π(ζ ) = π(η), and π is shown to be one-to-one. By definition, π is onto and thus bijective. The remaining part follows by assertion (b) and the inverse function theorem (see, e.g., [52, p. 221]).

Examples: b(n0 , •), IN(•, •), N(•, •)

Example 2.57 For P = b(n0 , •), we obtain from Example 2.38 that κ(ζ ) = n0 ln 1 + eζ ,

ζ ∈ R.

Hence, the mean value function is π:

R → (0, n0 ) :

ζ →

n0 exp{ζ } 1 + exp{ζ }

with inverse function π

−1

:

(0, n0 ) → R :

t t→ ln n0 − t

.

Example 2.58 For P = IN(•, •), we conclude from Example 2.41 that κ(ζ ) = −

$ ln(−2ζ2) − 2 ζ1 ζ2 , 2

ζ = (ζ1 , ζ2 )t ∈ (−∞, 0)2 .

Hence, the mean value function is '& π:

(−∞, 0)2 → π((−∞, 0)2 ) :

ζ →

& 1 ζ2 ,− + ζ1 2ζ2

ζ1 ζ2

(t

with inverse function π

−1

:

π((−∞, 0) ) → (−∞, 0) : 2

2

t →

t1 1 , 2t1 (1 − t1 t2 ) 2(1 − t1 t2 )

t .

2.4 Analytical Properties and Mean Value Parametrization

39

Example 2.59 For P = N(•, •), we obtain from Example 2.52 that the mean value function is given by '

π:

R × (−∞, 0) → π(R × (−∞, 0)) :

ζ1 ζ12 − 2ζ2 ζ → − , 2ζ2 4ζ22

(t

with inverse function ' π

−1

:

π(R × (−∞, 0)) → R × (−∞, 0) :

t →

t1 1 , 2 2 t2 − t1 2(t1 − t2 )

(t .

Moreover, we have π(R × (−∞, 0)) = {(x, y)t ∈ R2 : x 2 < y} .

In Theorem 2.56, the image of the mapping π can be stated explicitly provided that the EF is steep (see [4, p. 117] and [20, p. 71] for the definition and an illustration). The concept of steepness is originally introduced in a more general context for convex functions. Here, for simplicity, we give a definition directly for the full EF with minimal canonical representation.

Definition 2.60 Let P be a full EF with minimal canonical representation (2.6). P is called steep if, for all η ∈ ∗ \ int (∗ ) and ζ ∈ int (∗ ), we have lim

α1

∂ κ(ζ (α) ) = lim (η − ζ )t ∇κ(ζ (α) ) = ∞ , α1 ∂α

(2.12)

where ζ (α) = ζ + α(η − ζ ), α ∈ (0, 1).

Roughly speaking, Eq. (2.12) ensures that the cumulant function of the EF has infinite slope when moving towards boundary points in ∗ on straight lines. A regular EF trivially meets the condition, since such boundary points do not exist in that case.

40

2 Parametrizations and Basic Properties

Lemma 2.61 Let P be an EF with minimal canonical representation (2.6). Then, the following implications are true: P is regular

⇒

P is steep

⇒

P is full .

As pointed out in Example 2.41, the full EF of inverse normal distributions is not regular, but it can be shown to be steep (see [4, p. 117] or [20, p. 72]). An example for a full EF which is not steep is presented in [20, p. 86]. Remark 2.62 In Theorem 2.56, we have π(int (∗ )) ⊂ int (M) ,

(2.13)

where M denotes the convex support of ν T , i.e., the closed convex hull of the support of ν T (see, e.g., [37, p. 668]). If P is steep, then equality holds true in formula (2.13); for a proof, see, e.g., [20, pp. 74/75]. Recall that the support of ν T coincides with the support of (Pζ∗ )T for every ζ ∈ ∗ , since these measures are all equivalent; this finding may ease to derive M. The properties of the mean value function π in Theorem 2.56 allow for the following parametrization of a regular EF.

Definition 2.63 Let P be a regular EF with minimal canonical representation (2.6). Then, P˜m = Pπ∗−1 (m) ,

m ∈ π(∗ ) = int (M)

is called the mean value parametrization of P, where M is as defined in Remark 2.62.

Some examples illustrate the mean value parametrization of an EF. Examples: poi(•), Nk (•, 0 )

Example 2.64 According to Example 2.37, P = poi(•) is regular with canonical representation given by the counting densities fζ∗ (x) = C ∗ (ζ ) eζ x

1 , x!

x ∈ N0 ,

2.4 Analytical Properties and Mean Value Parametrization

41

with C ∗ (ζ ) = exp{−eζ } for ζ ∈ ∗ = R. Hence, the cumulant function is κ(ζ ) = eζ ,

ζ ∈ R,

and we conclude by Corollary 2.50 that the mean and variance of the statistic T (x) = x, x ∈ N0 , under Pζ∗ are given by Eζ [T ] = κ (ζ ) = eζ , Varζ (T ) = κ (ζ ) = eζ . Here, the mean value function π is π : R → (0, ∞) : ζ → eζ with inverse function π −1 : (0, ∞) → R : t → ln(t) . To obtain the mean value parametrization, we set ζ = π −1 (m) = ln(m) in fζ∗ and arrive at the counting densities 1 f˜m (x) = e−m mx , x!

x ∈ N0 ,

for m ∈ (0, ∞), which is the common representation of poi(•) as introduced in Example 2.3. In this parametrization, mean and variance of T under P˜m are both equal to m. Example 2.65 For P = Nk (•, 0 ), we obtain from Example 2.54 that the mean value function π and its inverse function π −1 are given by π : Rk → Rk : μ → −1 0 μ and

π −1 : Rk → Rk : t → 0 t .

To find the mean value parametrization, we set μ = π −1 (m) = 0 m in fμ,0 (see Example 2.18) and obtain the Lebesgue densities

t 2 ˜ (x) = exp −||m|| /2 em x h(x) , fm −1 0

x ∈ Rk ,

42

2 Parametrizations and Basic Properties

for m ∈ Rk . In this parametrization, mean vector and covariance matrix of T under P˜m are given by m and −1 0 , respectively. Note that the parametrizations coincide if 0 = Ek , i.e., if 0 is the k-dimensional unit matrix.

Related to the mean value parametrization is the so-called variance function, which leads to a wide range of publications with structural considerations of EFs. Remark 2.66 An EF P = {Pζ∗ : ζ ∈ int (∗ )} with minimal canonical representation (2.6) and T (x) = x, x ∈ X, i.e., with ν-densities dPζ∗ dν

(x) = C ∗ (ζ ) eζ

tx

= eζ

t x−κ(ζ )

,

x ∈ Rk ,

(2.14)

for ζ ∈ int (∗ ), forms a natural EF with generating measure ν as introduced in [37, Chapter 54] and Remark 2.36. Here, minimality of representation (2.14) ensures that int (∗ ) = ∅ and that the support of ν is not contained in an affine hyperplane of Rk . If P is moreover steep, the variance function of P is defined by VP (m) = CovP˜m (X) ,

m ∈ int (M) ,

i.e., VP maps the mean m of the random vector X to the covariance matrix of X under P˜m , where int (M) is the domain of the means as defined in Remark 2.62 (see, e.g., [37, pp. 669/670]). By Theorem 2.56, we have the identity VP (m) = Hκ (π −1 (m)) ,

m ∈ int (M) .

In Example 2.64, P forms a natural EF with linear variance function VP (m) = m, m ∈ (0, ∞). For 0 = Ek , the EF P in Example 2.65 is natural and has constant variance function VP (m) = Ek , m ∈ Rk . One motivation for considering variance functions is that it enables to characterize, classify, and construct natural EFs. The interested reader is referred to, e.g., [41], [37, pp. 669/670 and pp. 678–691], and the references therein. A recent overview on the topic is provided in [3].

Chapter 3

Distributional and Statistical Properties

In the third chapter, some further distributional and structural properties of an EF are examined, and basic results for statistical inference in EFs are provided.

3.1 Generating Functions At the beginning of Sect. 2.2, we have mentioned that the vector T = (T1 , . . . , Tk )t of statistics is fundamental in statistical inference. Theorem 2.49 already shows how to derive the moments of T in the EF with natural parametrization. To find the distribution of T under Pζ∗ and for several other applications, it is useful to have simple representations of the moment generating function mT (a) = Eζ [exp{a t T }], a ∈ Rk , and the characteristic function ϕT (a) = Eζ [exp{ia t T }], a ∈ Rk , available. These are stated in the following lemma. Lemma 3.1 Let P be an EF with canonical representation (2.6), and let ζ ∈ . (a) If A ⊂ Rk is such that ζ + A = {ζ + a : a ∈ A} ⊂ , then mT (a) =

C ∗ (ζ ) , C ∗ (ζ + a)

a ∈ A,

where integration has been carried out with respect to Pζ∗ . (b) If the function C ∗ is extended to the complex set {ζ + ia ∈ Ck : a ∈ Rk } in a canonical way, then ϕT (a) =

C ∗ (ζ ) , + ia)

C ∗ (ζ

a ∈ Rk ,

where integration has been carried out with respect to Pζ∗ . © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Bedbur, U. Kamps, Multivariate Exponential Families: A Concise Guide to Statistical Inference, SpringerBriefs in Statistics, https://doi.org/10.1007/978-3-030-81900-2_3

43

44

3 Distributional and Statistical Properties

Proof (a) Let ζ ∈ and ζ + A ⊂ . Obviously, we have for a ∈ A

mT (a) = Eζ [ea T ] = t

=

C ∗ (ζ ) C ∗ (ζ + a)

=

C ∗ (ζ ) . C ∗ (ζ + a)

ea T (x) fζ∗ (x) dμ(x) t

C ∗ (ζ + a) e(ζ +a) T (x) h(x) dμ(x) t

(b) The representation is obtained by proceeding as in the proof of statement (a).

Examples: poi(•), b(n0 , •), m(n0 , •), N(•, •), G(•, •), Nk (•, 0 )

Example 3.2 For P = poi(•), we have by Example 2.37 that C ∗ (ζ ) = exp{−eζ }, ζ ∈ R, and T (x) = x, x ∈ X = N0 . Application of Lemma 3.1(a) then yields mT (b) = exp eζ (eb − 1) ,

b ∈ R,

where integration is with respect to Pζ∗ . The transformation ζ = ln(a) then gives the moment generating function of T in the original parametrization (see Example 2.3), i.e., mT (b) = exp a(eb − 1) ,

b ∈ R,

where integration is with respect to Pa = poi(a). Example 3.3 For P = b(n0 , •), we have by Example 2.38 that C ∗ (ζ ) = (1 + exp{ζ })−n0 , ζ ∈ R, and T (x) = x, x ∈ X = {0, 1, . . . , n0 }. Application of Lemma 3.1(a) then yields mT (a) =

1 + exp{ζ + a} 1 + exp{ζ }

n0 ,

a ∈ R,

where integration is with respect to Pζ∗ . The transformation ζ = ln(p/(1 − p)) then gives the moment generating function of T in the original parametrization

3.1 Generating Functions

45

(see Example 2.4), i.e., n mT (a) = 1 − p + pea 0 ,

a ∈ R,

where integration is with respect to Pp = b(n0 , p). Example 3.4 For P = m(n0 , •) with minimal canonical representation f˜ζ∗ , ζ ∈ −n0 for Rk−1 , as shown in Example 2.46, we have C ∗ (ζ ) = (1 + k−1 j =1 exp{ζj }) k−1 t t and T (x) = (x1 , . . . , xk−1 ) for x ∈ X = {(y1 , . . . , yk ) ∈ Nk0 : ζ ∈ R k i=1 yi = n0 }. Application of Lemma 3.1(a) then yields ' mT (a) =

k−1

j =1 exp{ζj + aj } 1 + k−1 j =1 exp{ζj }

1+

(n0 ,

a = (a1 , . . . , ak−1 )t ∈ Rk−1 ,

where integration is with respect to P˜ζ∗ with counting density f˜ζ∗ . For k = 2, we are in the case of Example 3.3. Example 3.5 In Example 2.39, the canonical representation of P = N(•, •) yields ∗

C (ζ ) =

$

"

ζ12 −2ζ2 exp 4ζ2

# ,

ζ = (ζ1 , ζ2 )t ∈ R × (−∞, 0) ,

and T (x) = (T1 (x), T2 (x))t = (x, x 2 )t , x ∈ R. Then, by Lemma 3.1(b), the characteristic function of T1 is C ∗ (ζ ) C ∗ (ζ + i(a1 , 0)t ) # " # " a12 ζ12 iζ1 a1 (ζ1 + ia1 )2 = exp − , − + = exp 4ζ2 4ζ2 2ζ2 4ζ2

ϕT1 (a1 ) = ϕT ((a1 , 0)t ) =

a1 ∈ R ,

where integration is with respect to Pζ∗ . In terms of the common parametrization of the univariate normal distribution (see Example 2.10), we find for the characteristic function of a random variable X = T1 (X) ∼ N(μ, σ 2 ) with (ζ1 , ζ2 ) = (μ/σ 2 , −1/(2σ 2 )) the well known expression

σ 2t 2 ϕT1 (t) = exp iμt − , 2

t ∈ R,

where integration is with respect to Pμ,σ 2 = N(μ, σ 2 ).

46

3 Distributional and Statistical Properties

Example 3.6 For P = G(•, •), we have by Example 2.40 that C ∗ (ζ ) = (−ζ1 )ζ2 / (ζ2 ), ζ = (ζ1 , ζ2 )t ∈ (−∞, 0) × (0, ∞), and T (x) = (x, ln(x))t , x ∈ (0, ∞). Application of Lemma 3.1(a) then yields mT (a) =

(−ζ1 )ζ2

(ζ2 + a2 )

(ζ2 ) (−ζ1 − a1 )ζ2 +a2

for a = (a1 , a2 )t ∈ (−∞, −ζ1 ) × (−ζ2 , ∞), where integration is with respect to Pζ∗ . The transformation ζ1 = −1/α, ζ2 = β then gives the moment generating function of T in the original parametrization (see Example 2.12), i.e., mT (a) =

α a2

(β + a2 ) ,

(β) (1 − αa1 )β+a2

a ∈ (−∞, 1/α) × (−β, ∞) ,

where integration is with respect to Pα,β = G(α, β). Example 3.7 For the EF P = Nk (•, 0 ) of multivariate normal distributions as in Example 2.18, we have " C(μ) = exp −

||μ||2 0 2

#

1 t −1 = exp − μ 0 μ , 2

μ ∈ Rk ,

k ∗ and T (x) = −1 0 x, x ∈ R . Hence, by Lemma 3.1(b) with ζ = μ and C = C (cf. Example 2.54), the characteristic function of T is given by

1 t −1 1 t −1 ϕT (a) = exp − μ 0 μ + (μ + ia) 0 (μ + ia) 2 2

1 t −1 t −1 = exp iμ 0 a − a 0 a , a ∈ Rk , 2 where integration is with respect to Pμ, 0 = Nk (μ, 0 ). This yields the well known expression for the characteristic function of a random vector X = 0 T (X) ∼ Nk (μ, 0 ) as

1 ϕ0 T (t) = ϕT ( 0 t) = exp iμt t − t t 0 t , 2

t ∈ Rk .

3.2 Marginal and Conditional Distributions

47

It is known in general that if the moment generating function mT of a random vector T with values in Rk is finite in a neighborhood of 0 ∈ Rk , then all partial derivatives of mT exist finitely, and moments of T can be obtained by calculating respective partial derivatives at 0. The logarithm of a moment generating function is called a cumulant generating function, and its partial derivatives at 0 are termed cumulants of T , which are applied, for instance, in Edgeworth expansions (see, e.g., [36] and [35]). For further details on moments and cumulants as well as their interrelation, we refer to the literature. Remark 3.8 In the situation of Lemma 3.1(a) with k = 1 (one-parameter case), let KT = ln(mT ) denote the cumulant generating function of T . Then, we find the relation C ∗ (ζ ) KT (a) = ln = κ(ζ + a) − κ(ζ ) , a ∈ A, C ∗ (ζ + a) to the cumulant function κ = − ln(C ∗ ) of the EF (see Definition 2.43). If 0 ∈ int (A), the n-th cumulant KT(n) of T is defined as the n-th derivative of KT at 0, i.e., we have KT(n) =

∂n ∂n KT (a)|a=0 = κ(ζ ) . n ∂a ∂ζ n (1)

(2)

In particular, we obtain the first two cumulants KT and KT of T , which are known to give the mean and variance of T under Pζ∗ , as respective derivatives of κ (as already shown in Corollary 2.50).

3.2 Marginal and Conditional Distributions It is natural to ask whether the EF structure is preserved by considering marginal and conditional distributions. To begin with, the restriction of the sample space to a subset Y ⊂ X yields an EF, again (see, e.g., [44, p. 37]). Lemma 3.9 Let P be an EF with representation (2.1) and ν = hμ as in Lemma 2.15. Moreover, let Y ∈ B with ν(Y) > 0, and let Qϑ (·) = Pϑ (·|Y) for ϑ ∈ . Then, the family {Qϑ : ϑ ∈ } of conditional distributions forms an EF with μ-density ⎧ ⎫ k ⎨ ⎬ dQϑ (y) = CY (ϑ) exp Zj (ϑ)Tj (y) h(y) , ⎩ ⎭ dμ j =1

and normalizing constant CY (ϑ) for ϑ ∈ .

y ∈ Y,

48

3 Distributional and Statistical Properties

Proof Let ϑ ∈ . Since ν(Y) > 0, we have Pϑ (Y) > 0 by Remark 2.16. Then, for A ∈ B ∩ Y, we find fϑ dμ Pϑ (A ∩ Y) = A Qϑ (A) = , Pϑ (Y) Y fϑ dμ and thus dQϑ fϑ = , dμ f Y ϑ dμ where

Y

fϑ dμ = C(ϑ)/CY (ϑ), say.

Example: E(•)

Example 3.10 Let P = E(•), Pα = E(α) for α ∈ (0, ∞), and Y = [b, ∞) for some fixed constant b ∈ (0, ∞). Then, for α ∈ (0, ∞), Pα (·|Y) has Lebesgue density y , CY (α) exp − α

y ∈ [b, ∞) ,

where CY (α) =

∞ b

y −1 exp{b/α} dy , exp − = α α

α ∈ (0, ∞) .

Hence, {Pα (·|Y) : α ∈ (0, ∞)} is the EF of exponential distributions with location parameter b.

The conditional EF in Lemma 3.9 is not to be mixed up with the truncated EF, which does not form an EF in the proper sense (see Sect. 2.1). For the EF PT of distributions of T , the following results for marginal and conditional distributions can be shown; for the proof, see, e.g., [35, pp. 255–257]. Theorem 3.11 Let P be an EF with canonical representation (2.6) and k ≥ 2. Moreover, let U = (T1 , . . . , Tr )t and V = (Tr+1 , . . . , Tk )t for some fixed r ∈ {1, . . . , k − 1}. (a) Let ζ1 , . . . , ζr be fixed. Then, there exists a σ -finite measure Qζ1 ,...,ζr on Bk−r depending on ζ1 , . . . , ζr , such that the distributions of V form an EF with

3.3 Product Measures

49

densities d(Pζ∗ )V dQζ1 ,...,ζr

(tr+1 , . . . , tk ) = C(ζ ) exp

⎫ ⎬

⎧ k ⎨ ⎩

ζj tj

j =r+1

⎭

tr+1 , . . . , tk ∈ R .

,

(3.1) (b) There exists a σ -finite measure Q˜ v on Br , such that the conditional distributions of U given V = v form an EF with densities d(Pζ∗ )U |V =v d Q˜ v

(t1 , . . . , tr ) = Cv (ζ1 , . . . , ζr ) exp

⎧ r ⎨ ⎩

j =1

⎫ ⎬ ζj tj

⎭

,

t1 , . . . , tr ∈ R ,

˜ v and Cv (ζ1 , . . . , ζr ) where Cv (ζ1 , . . . , ζr ) denotes a normalizing constant. Q are independent of ζr+1 , . . . , ζk . We shall highlight the findings in Theorem 3.11 from an inferential perspective. Suppose that we are only interested in estimating the parameters ζj , j ∈ J , for some index set J {1, . . . , k}. If the parameters ζj , j ∈ / J , are assumed to be known, inference may be carried out in the marginal model based on the distribution of (Tj )j ∈J forming an EF by varying ζj , j ∈ J . However, if ζj , j ∈ / J , are unknown, the dominating measure and normalizing constant in the marginal density will be unknown as well, such that inference based on the marginal model will not be feasible or at least gets more involved. In contrast, conditioning (Tj )j ∈J on (Tj )j ∈J / will eliminate the unknown nuisance parameters ζj , j ∈ / J , while maintaining the EF structure of the model. Remark 3.12 If the density in formula (3.1) does not depend on ζ1 , . . . , ζr , V is called a cut for P (see, e.g., [37, pp. 673–676], and the references cited therein). Necessary and sufficient conditions for a statistic to be a cut for a (natural) EF can be found in [4, Chapter 10]. For cuts and the related topic of mixed parametrizations, the reader is also referred to [44, pp. 240–245] and [20, pp. 78–81].

3.3 Product Measures We are aiming for statistical inference with EFs in later sections, where we assume to have a sample of independent random variables. Hence, we shall now focus on the corresponding family of product measures. From the respective product densities it becomes clear +n that the EF structure is preserved for the product case. In what follows, let i=1 Bi denote the product σ -algebra of the σ -algebras B1 , . . . , Bn + and ni=1 Qi denote the product measure of the σ -finite measures n for +n Q1 , . . . , Q(n) n = n ∈ N. Moreover, for n ∈ N, we use the notation B B and Q = i=1 +n Q for the n-fold product σ -algebra of a σ -algebra B and the n-fold product i=1 measure of a σ -finite measure Q, respectively.

50

3 Distributional and Statistical Properties

Theorem 3.13 (a) For 1 ≤ i ≤ n, let (Xi , Bi ) be a measurable space and Pi = {Pϑi ;i : ϑi ∈ i } be an EF of distributions on Bi dominated by a σ -finite measure μi . Then, the class " n # , Pϑi ;i : ϑi ∈ i , 1 ≤ i ≤ n i=1

+ + of product measures on ni=1 Bi forms an EF dominated by ni=1 μi . + (n) (b) Let P be an EF with representation (2.1). For n ∈ N, let Pϑ = ni=1 Pϑ for Pϑ ∈ P. Then, the class P(n) = Pϑ(n) : ϑ ∈ of product measures on Bn forms an EF dominated by μ(n) , where, for ϑ ∈ , (n) a μ(n) -density of Pϑ is given by ⎧ ⎫' ( k n ⎨ ⎬ (n) (n) n (i) fϑ (x) ˜ = C(ϑ) exp Zj (ϑ)Tj (x) ˜ h(x ) ⎩ ⎭ j =1

i=1

with statistics (n)

˜ = Tj (x)

n

Tj (x (i) ) ,

1 ≤ j ≤ k,

(3.2)

i=1

for x˜ = (x (1), . . . , x (n) ) ∈ Xn . Consequently, the joint distribution of independent random variables X1 , . . . , Xn forms an EF if the distribution of Xi , 1 ≤ i ≤ n, forms an EF. In Theorem 3.13, case (a) corresponds to a sample of independent not necessarily identically distributed (inid) random variables; case (b) then refers to the situation of an independent and identically distributed (iid) sample. Examples: poi(•), N(•, •), Nk (•, 0 )

Example 3.14 Let X1 , . . . , Xn be independent random variables with Xi ∼ ˜ = poi(ai ) for 1 ≤ i ≤ n as in Example 2.3. Then, the distribution of X (X1 , . . . , Xn ) forms a multiparameter EF with representation (2.1) given by μ = μ(n) N0 , k = n, "

C(a) = exp −

n i=1

#

ai

,

Zj (a) = ln(aj ) ,

1 ≤ j ≤ n,

3.3 Product Measures

51

for a = (a1 , . . . , an )t ∈ (0, ∞)n , and Tj (x) ˜ = xj ,

1 ≤ j ≤ n,

h(x) ˜ =

n 1 xi ! i=1

for x˜ = (x1 , . . . , xn ) ∈ Nn0 . Clearly, this EF has order n and is regular. On the other hand, by assuming that a1 = · · · = an , we are in the iid situation with an EF of order 1. Obviously, we can generate EFs with one parameter up to n parameters, here. Now, suppose that the parameters a1 , . . . , an are connected via a log-linear link function, i.e., we set ln(aj ) = ζ1 + ζ2 vj , 1 ≤ j ≤ n, where ζ1 and ζ2 are real-valued parameters and v1 , . . . , vn are fixed real numbers, which ˜ turns out to form a regular are not all identical. By inserting, the distribution of X EF of order 2 with canonical representation (2.6), where " n # ∗ ζ1 +ζ2 vi C (ζ ) = exp − e , ζ = (ζ1 , ζ2 )t ∈ R2 , i=1

and T (x) ˜ =(

n

i=1 xi ,

n

i=1 vi xi )

t

for x˜ = (x1 , . . . , xn ) ∈ Nn0 (see [44, p. 35]).

Example 3.15 Let X1 , . . . , Xn1 be an iid sample from N(μ1 , σ12 ) and Y1 , . . . , Yn2 be an iid sample from N(μ2 , σ22 ) as introduced in Example 2.10. Moreover, let the samples be independent. Then, the distribution of (X1 , . . . , Xn1 , Y1 , . . . , Yn2 ) forms an EF with representation (2.1) given by μ = λnR1 +n2 , k = 4, # " - .−n1 /2 - .−n2 /2 n1 μ21 n2 μ22 2 2 , C(ϑ) = σ1 σ2 exp − − 2σ12 2σ22 ' Z(ϑ) =

μ1 μ2 1 1 , 2,− 2,− 2 2 σ1 σ2 2σ1 2σ2

(t ,

for ϑ = (μ1 , μ2 , σ12 , σ22 )t ∈ R2 × (0, ∞)2 , and ⎛ ⎞t n1 n2 n1 n2 T (x) ˜ = ⎝ xi , yj , xi2 , yj2 ⎠ , i=1

j =1

i=1

j =1

h(x) ˜ =

1 √ 2π

n1 +n2 ,

for x˜ = (x1 , . . . , xn1 , y1 , . . . , yn2 ) ∈ Rn1 +n2 . Clearly, the EF has order 4 and is regular. Now, suppose that σ12 = σ22 = σ 2 , say, i.e., we have a common-scale

52

3 Distributional and Statistical Properties

˜ turns out to form an EF with mappings model. By inserting, the distribution of X ˜ = C(ϑ) ˜ = Z(ϑ)

# " - .−(n1 +n2 )/2 n1 μ21 + n2 μ22 2 σ , exp − 2σ 2

μ1 μ 2 1 , ,− 2 σ2 σ2 2σ

t ,

for ϑ˜ = (μ1 , μ2 , σ 2 )t ∈ R2 × (0, ∞), and ⎛ T (x) ˜ = ⎝

n1 i=1

xi ,

n2 j =1

yj ,

n1 i=1

xi2 +

n2

⎞t yj2 ⎠ ,

j =1

for x˜ = (x1 , . . . , xn1 , y1 , . . . , yn2 ) ∈ Rn1 +n2 . This is a regular EF of order 3. In contrast, the common-location assumption μ1 = μ2 leads to a curved EF of order 4 (see also [35, pp. 87/88]). Example 3.16 Let X1 , . . . , Xn be independent random variables with Xi ∼ N(μi , σ02 ), i ∈ {1, . . . , n}, for some fixed variance σ02 > 0. Then, by Theorem 3.13(a), the joint distributions of X1 , . . . , Xn form an EF of distributions on (Rn , Bn ) with parameters μ1 , . . . , μn ∈ R. Now, let X(1) , . . . , X(m) be iid random vectors following a multivariate normal distribution with mean vector μ = (μ1 , . . . , μn )t ∈ Rn and fixed positive definite covariance matrix 0 ∈ Rn×n , i.e., X(i) ∼ Nn (μ, 0 ), i ∈ {1, . . . , m}. Then, by Theorem 3.13(b), the joint distributions of X (1) , . . . , X (m) form an EF of distributions on (Rmn , Bmn ) with parameters μ1 , . . . , μn ∈ R.

Based on Theorem 3.13, the latter examples show the flexibility of constructing EFs. Suppose that, in the situation of Example 3.16, we want to have a 3-parameter EF of distributions on (R6 , B6 ). This is achieved, for instance, by choosing n = 6 in the independent case while assuming μ2 = μ3 and μ4 = μ5 = μ6 , or by setting n = 3 and m = 2 in the iid case. Given the iid situation, several properties of the EF are transferred to the product EF. In what follows, let T (n) = (T1(n) , . . . , Tk(n) )t with Tj(n) , 1 ≤ j ≤ k, as defined in formula (3.2), be the vector of statistics for the n-fold product case, n ∈ N. Corollary 3.17 Let P be an EF with representation (2.1), and let P(n) , n ∈ N, denote the corresponding EF of n-fold product measures. Then, ord(P(n)) = ord(P). Moreover, if P is full (regular, steep), so is P(n) .

3.3 Product Measures

53

Proof Considering T (n) as function of iid random variables X1 , . . . , Xn ∼ Pϑ , it follows that CovP (n) (T (n) ) = n CovPϑ (T ) . ϑ

Hence, P and P(n) have the same order by applying Lemma 2.26 and Theorem 2.23. Moreover, by Fubini’s theorem, we have for ζ ∈ Rk that e

ζ t T (n)

hn dμ

(n)

=

e

ζtT

n h dμ

,

where hn (x) ˜ = ni=1 h(x (i) ) for x˜ = (x (1), . . . , x (n) ) ∈ Xn , such that the natural parameter spaces of P(n) and P coincide. Trivially, P(n) is then full (regular) if P is full (regular). Finally, since the cumulant function of P(n) is given by nκ, P(n) is steep if P is steep.

Note that Lemma 2.17 and Theorem 3.13(b) with T (x) = x, x ∈ X, imply that the distribution of a sum of iid random variables forms an EF if this is the case for the distribution of the summands (for more details, see [37, pp. 670–673]). In other words, the EF structure is preserved for the family of n-fold convolutions. Here, the cumulant function for the n-fold convolution family is given by nκ, n ∈ N. For the steep natural EF P with ν-densities (2.14), this finding suggests the construction of a more general family of distributions. Let ⊂ (0, ∞) denote the set of all parameters λ, for which λκ is the cumulant function of a natural EF generated by some measure νλ on Bk . Then, D = {Pζ∗,λ : ζ ∈ int (∗ ), λ ∈ } is called the exponential dispersion model associated with P, where a νλ -density of Pζ∗,λ is given by dPζ∗,λ dνλ

(x) = eζ

t x−λκ(ζ )

,

x ∈ Rk ,

for ζ ∈ int (∗ ) and λ ∈ . D is the union of the natural EFs {Pζ∗,λ : ζ ∈ int (∗ )} with variance functions Vλ (m) = λVP (m/λ), m ∈ int (λM), for λ ∈ (see Remark 2.66). The parameter 1/λ is called the dispersion parameter, and is also referred to as the Jørgensen set in the literature. For an extensive account on the model, see [29]. Applications of exponential dispersion models can be found, for instance, in [3] and the references cited therein. For their particular use in generalized linear models and actuarial risk theory, see, e.g., [61, 23, 56, 30].

54

3 Distributional and Statistical Properties

3.4 Sufficiency and Completeness The concepts of sufficiency and completeness of a statistic for a family of distributions is fundamental in various aspects of statistical inference as, for example, the derivation of optimal point estimators and statistical tests. Roughly speaking, a sufficient statistic contains and summarizes all information in the data about the underlying true distribution, and it comprises no redundant information if it is complete, too. In general, sufficiency is introduced via conditional expectations and probabilities (see, e.g., [40, pp. 36–44]). Here, for short, we introduce sufficiency by means of the form of the density functions in the family, which is known as the Neyman criterion (see, e.g., [40, pp. 44–46]).

Definition 3.18 Let (X, B) be a measurable space and P = {Pϑ : ϑ ∈ } be a family of distributions on B dominated by a σ -finite measure μ. Moreover, let T : (X, B) → (Y, C) be a statistic. (a) T is called sufficient for ϑ ∈ if there exist a function h : (X, B) → (R1 , B1 ) and functions qϑ : (Y, C) → (R1 , B1 ), ϑ ∈ , such that a μ-density of Pϑ is of the form dPϑ (x) = qϑ (T (x)) h(x) , dμ

x ∈ X,

(3.3)

for all ϑ ∈ . (b) T is called minimal sufficient for ϑ ∈ if T is sufficient for ϑ ∈ and factorizes over any other sufficient statistic, i.e., for every statistic ˜ C) ˜ being sufficient for ϑ ∈ , there exist a function T˜ : (X, B) → (Y, ˜ ˜ q : (Y, C) → (Y, C) and a set N ∈ B with Pϑ (N) = 0 for all ϑ ∈ , such that T = q ◦ T˜

on N c .

A sufficient statistic allows for reduction of the data without a loss of information meaning that T (x) contains all information in x about ϑ. The main issue in formula (3.3) is that the term qϑ (T (x)) depends on the observation x only through T (x). A minimal sufficient statistic is as simple as possible in the sense of Definition 3.18(b).

3.4 Sufficiency and Completeness

55

Definition 3.19 Let (X, B) be a measurable space and P = {Pϑ : ϑ ∈ } be a family of distributions on B. (a) P is called (boundedly) complete if, for every (bounded) function r : (X, B) → (R1 , B1 ), the following implication holds: r dPϑ = 0 for all ϑ ∈ ⇒

there exists a set N ∈ B with Pϑ (N) = 0 for all ϑ ∈ and r(x) = 0 for all x ∈ N c .

(b) A statistic T : (X, B) → (Y, C) is called (boundedly) complete for ϑ ∈ if PT = {PϑT : ϑ ∈ } is (boundedly) complete.

The completeness of a statistic refers to the richness of the underlying family P of distributions. By definition, every complete statistic is boundedly complete. In an EF, a (minimal) sufficient and complete statistic is easily found, which means an important advantage in statistical inference. Theorem 3.20 Let P be an EF with representation (2.1). Then, we have the following properties: (a) The vector T = (T1 , . . . , Tk )t is sufficient for ϑ ∈ . (b) If Z1 , . . . , Zk are affinely independent, then T is minimal sufficient for ϑ ∈ . (c) If int (Z()) = ∅, then T is complete for ϑ ∈ . Statement (a) in Theorem 3.20 directly follows by setting qϑ (t) = C(ϑ) exp{Z(ϑ)t t}, t ∈ Rk , in Eq. (3.3), and it holds true in any EF without imposing further conditions on Z1 , . . . , Zk . A proof of statement (b) can be found, for instance, in [39, p. 39] and [50, pp. 25/26]. Statement (c) can be shown by using characteristic functions and the uniqueness theorem (see, e.g., [40, pp. 116/117], [50, pp. 26/27], or [20, p. 43]). The property of an EF to be full along with a minimal representation (see Definitions 2.35 and 2.20) has direct applications for the statistic T , namely, it is minimal sufficient and complete. Corollary 3.21 Let P be a full EF with minimal representation (2.1). Then, T = (T1 , . . . , Tk )t is minimal sufficient and complete for ϑ. Proof The assertion follows from Theorem 3.20, Lemma 2.42, and Theorem 2.23.

56

3 Distributional and Statistical Properties

Note that the dimension of the vector T (n) of statistics related to the product case is the same for all n ∈ N (see Theorem 3.13(b)). Hence, in the iid case, sampling from an EF does not increase the dimension of the sufficient statistic. Examples: poi(•), b(n0 , •), nb(r0 , •), m(n0 , •), E(•), g(•, •), G(•, •), N(•, •), Nk (•, 0 )

Example 3.22 Let P be any of the EFs poi(•), b(n0 , •), nb(r0 , •), or E(•) introduced in Examples 2.3, 2.4, 2.5, and 2.12. Moreover, let X1 , . . . , Xn be an iid sample from Pϑ ∈ P. Then, by Theorem 3.20, ˜ = T (n) (x)

n

xi ,

x˜ = (x1 , . . . , xn ) ∈ Xn ,

i=1

is minimal sufficient and complete for ϑ. Example 3.23 Let P be the EF g(•, •) or G(•, •) defined in Examples 2.7 and 2.12. Moreover, let X1 , . . . , Xn be an iid sample from Pϑ ∈ P. Then, by Theorem 3.20, ' T

(n)

(x) ˜ =

n i=1

xi ,

n

(t ln(xi )

,

x˜ = (x1 , . . . , xn ) ∈ Xn ,

i=1

is minimal sufficient and complete for ϑ. Example 3.24 Let X(1) , . . . , X (n) be an iid sample from m(n0 , p) ∈ m(n0 , •) as introduced in Example 2.8. Then, by Theorem 3.20 and Example 2.46, (n) T˜ (x) ˜ =

n (i) (i) (x1 , . . . , xk−1 )t ,

x˜ = (x (1) , . . . , x (n) ) ∈ Xn ,

i=1 (i)

(i)

x (i) = (x1 , . . . , xk )t ∈ X , 1 ≤ i ≤ n , is minimal sufficient and complete for p.

3.5 Score Statistic and Information Matrix

57

Example 3.25 Let X1 , . . . , Xn be an iid sample from N(μ, σ 2 ) ∈ N(•, •) as introduced in Example 2.10. Then, by Theorem 3.20 and Example 2.27, T

(n)

(x) ˜ =

' n

xi ,

i=1

n

(t xi2

,

x˜ = (x1 , . . . , xn ) ∈ Rn ,

i=1

is minimal sufficient and complete for (μ, σ 2 )t . Example 3.26 Let X (1), . . . , X(n) be an iid sample from Nk (μ, 0 ) ∈ Nk (•, 0 ) as introduced in Example 2.18. Then, by Theorem 3.20 and Example 2.28, ˜ = T (n) (x)

n

x (i) ,

x˜ = (x (1) , . . . , x (n) ) ∈ (Rk )n ,

i=1

is minimal sufficient and complete for μ.

Remark 3.27 For the curved EF introduced in Definition 2.32, statements (a) and (b) of Theorem 3.20 are true. Completeness of T , however, may or may not be the case. In [50, pp. 28–31], examples for complete and non-complete curved EFs are shown and sufficient conditions are given for T to be not complete or not even boundedly complete.

3.5 Score Statistic and Information Matrix The Fisher information matrix is a measure for the amount of information a random variable/vector contains about the true underlying parameter. It appears, for instance, in the context of point estimation and allows for defining efficient and asymptotically efficient estimators.

Definition 3.28 Let (X, B) be a measurable space and P = {Pϑ = fϑ μ : ϑ ∈ }, ⊂ Rk open, be a family of distributions on B dominated by a σ finite measure μ. If all appearing derivatives (with respect to ϑ) and integrals (continued)

58

3 Distributional and Statistical Properties

Definition 3.28 (continued) (with respect to Pϑ ) exist, then I(ϑ) = EPϑ [U ϑ U tϑ ]

with

U ϑ = ∇ ln(fϑ )

is called the Fisher information matrix of P at ϑ ∈ . The random vector U ϑ is referred to as the score statistic at ϑ ∈ .

In an EF with natural parametrization, differentiability of the densities with respect to ζ is guaranteed by Corollary 2.50, and the score statistic and Fisher information matrix are readily obtained. Theorem 3.29 Let P be an EF with canonical representation (2.6), and let ζ ∈ int (). Then, U ζ = T − Eζ [T ] , I(ζ ) = Covζ (T ) . In particular, all entries in I(ζ ) are finite, i.e., [I(ζ )]i,j < ∞, 1 ≤ i, j ≤ k. If ord(P) = k, then I(ζ ) > 0. Proof Let ζ ∈ int (). By Corollary 2.50, U ζ = ∇ ln(exp{ζ t T − κ(ζ )} h) = T − Eζ [T ] . Hence, I(ζ ) = Covζ (T ). Moreover, from Theorem 2.49, all entries in I(ζ ) are finite. If ord(P) = k, then I(ζ ) > 0 by Theorem 2.23 and Lemma 2.26.

Moreover, the following simple formula for the Fisher information matrix results from Corollary 2.50. Corollary 3.30 Let P be an EF with canonical representation (2.6). Then, I(ζ ) = Hκ (ζ ) ,

ζ ∈ int () .

For one-parameter EFs with canonical representation and ζ ∈ int (), the Fisher information is simply given by the second derivative of κ at ζ .

3.5 Score Statistic and Information Matrix

59

Examples: poi(•), nb(r0 , •), N(•, •), G(•, •), Nk (•, 0 )

Example 3.31 For P = poi(•), Corollary 3.30 and Example 2.64 directly yield I (ζ ) = eζ for ζ ∈ R. Example 3.32 For P = nb(r0 , •), Corollary 3.30 and Example 2.51 directly yield I (ζ ) =

r0 exp{ζ } , (1 − exp{ζ })2

ζ ∈ (−∞, 0) .

Example 3.33 For P = N(•, •), Corollary 3.30 and Example 2.52 yield ⎛ 1 − 2ζ2 I(ζ ) = ⎝ ζ 1 2ζ22

ζ1 2ζ22 1 2ζ22

−

⎞ ζ12 2ζ23

⎠ =

1 2ζ23

'

−ζ22

ζ1 ζ2

(

ζ1 ζ2 ζ2 − ζ12

for ζ = (ζ1 , ζ2 )t ∈ R × (−∞, 0). Example 3.34 For P = G(•, •), Corollary 3.30 and Example 2.53 directly yield ' I(ζ ) =

ζ2 ζ12 − ζ11

− ζ11 ψ1 (ζ2 )

( ,

ζ = (ζ1 , ζ2 )t ∈ (−∞, 0) × (0, ∞) ,

where ψ1 = ψ denotes the trigamma function. Example 3.35 For P = Nk (•, 0 ), Corollary 3.30 and Example 2.54 directly yield I(μ) = −1 0 ,

μ ∈ Rk .

The Fisher information matrix generally depends on the parametrization of the distribution family. In the situation of Definition 3.28, let I(ϑ) exist for ϑ ∈ . Moreover, let ⊂ Rq , q ≤ k, be open, and let : → be a one-to-one and continuously differentiable function with Jacobian matrix D (η) ∈ Rk×q of full rank for all η ∈ . Then, the score statistic and Fisher information matrix at η ∈

60

3 Distributional and Statistical Properties

of the family {P˜η : η ∈ } ⊂ P with parametrization P˜η = P(η) , η ∈ , are given by ˜ η = D (η)t U(η) , U ˜ I(η) = D (η)t I((η)) D (η) .

(3.4)

Example

Example 3.36 Let P be a regular EF with minimal canonical representation (2.6), and let π : ∗ → π(∗ ) be the mean value function as defined in Theorem 2.56 with Jacobian matrix Dπ (ζ ) = Covζ (T ) for ζ ∈ ∗ . Then, by using Theorem 3.29, formula (3.4) with = π −1 , and Dπ −1 (m) = [Dπ (π −1 (m))]−1 , the Fisher information matrix of P corresponding to the mean value parametrization P˜m = Pπ∗−1 (m) , m ∈ π(∗ ), is given by ˜ I(m) = Dπ −1 (m)t I(π −1 (m)) Dπ −1 (m) = [Covπ −1 (m) (T )]−1 Covπ −1 (m) (T ) [Covπ −1 (m) (T )]−1 = [Covπ −1 (m) (T )]−1 = [CovP˜m (T )]−1 ,

m ∈ π(∗ ) .

3.6 Divergence and Distance Measures Entropy measures as, for instance, the Shannon entropy defined by H (P ) = −EP [ln(f )] for a distribution P with μ-density f , measure the disorder and uncertainty of distributions. Closely related are divergence measures, which quantify the ‘distance’ of two distributions in some family P in a certain sense (see, e.g., [57] and [48]). In statistical inference, they serve to construct statistical tests or confidence regions for underlying parameters. If the distributions in P are dominated by a σ -finite measure μ, various divergence measures have been introduced by means of the corresponding densities. Some important examples are the Kullback-Leibler divergence DKL (P1 , P2 ) =

f1 f1 ln f2

dμ ,

3.6 Divergence and Distance Measures

61

the Jeffrey distance DJ (P1 , P2 ) = DKL (P1 , P2 ) + DKL (P2 , P1 ), the Rényi divergence DRq (P1 , P2 ) =

1 ln q(q − 1)

q

1−q

f1 f2

dμ

q ∈ R \ {0, 1} ,

,

the Cressie-Read divergence 1 DCRq (P1 , P2 ) = q(q − 1)

'

f1

f1 f2

(

q−1 −1

dμ ,

q ∈ R \ {0, 1} ,

and the Hellinger metric DH (P1 , P2 ) =

$

$ .2 f1 − f2 dμ

1/2

for Pi = fi μ ∈ P, i = 1, 2, provided that the respective expressions are defined and the integrals finitely exist. If P = {Pϑ : ϑ ∈ } is a parametric family of distributions, it is convenient to use the parameters as arguments and briefly write, e.g., DKL (ϑ1 , ϑ2 ) for DKL (Pϑ1 , Pϑ2 ), ϑ1 , ϑ2 ∈ ; note that then DKL (ϑ1 , ϑ2 ) = 0 implies ϑ1 = ϑ2 if the parameter ϑ ∈ is identifiable (see Lemma 2.30). The EF structure admits simple formulas for various divergence measures as indicated in Lemma 3.37 and Corollary 3.39. Note that the Kullback-Leibler divergence, which can be considered as an expected log-likelihood ratio, is a starting point for many statistical procedures. Under common assumptions, we obtain a simple representation in EFs (see [20, pp. 174–178]). This representation is used in [33] in a two-class classification procedure within an EF, where classes are defined by means of left-sided Kullback-Leibler balls. Lemma 3.37 Let P be a full EF with minimal canonical representation (2.6). Then, for ζ , η ∈ int (∗ ), we have DKL (ζ , η) = κ(η) − κ(ζ ) + (ζ − η)t π(ζ ) , DJ (ζ , η) = (ζ − η)t (π(ζ ) − π(η)) . Proof For ζ , η ∈ int (∗ ), we have by Theorem 2.56 DKL (ζ , η) =

. ln(fζ∗ ) − ln(fη∗ ) fζ∗ dμ

=

(ζ − η)t T − κ(ζ ) + κ(η) fζ∗ dμ

62

3 Distributional and Statistical Properties

= κ(η) − κ(ζ ) + (ζ − η)t Eζ [T ] = κ(η) − κ(ζ ) + (ζ − η)t π(ζ ) . From this, the representation of DJ is obvious.

The Rényi divergence, Cressie-Read divergence, Hellinger metric, and other q 1−q divergences are functions of the quantity Aq (P1 , P2 ) = f1 f2 dμ, which has a nice representation in EFs for q ∈ (0, 1) (see, e.g., [43, 60]). Lemma 3.38 Let P be a full EF with minimal canonical representation (2.6). Then, for ζ , η ∈ ∗ and q ∈ (0, 1), we have Aq (ζ , η) = exp {κ(qζ + (1 − q)η) − (qκ(ζ ) + (1 − q)κ(η))} . Proof Let ζ , η ∈ ∗ and q ∈ (0, 1). Then, Aq (ζ , η) = =

(fζ∗ )q (fη∗ )1−q dμ exp (qζ + (1 − q)η)t T − (qκ(ζ ) + (1 − q)κ(η)) h dμ

= exp {κ(qζ + (1 − q)η) − (qκ(ζ ) + (1 − q)κ(η))} .

Recall that the cumulant function κ is convex (see Theorem 2.44). The logarithm of Aq (ζ , η) in Lemma 3.38 is just the difference of κ evaluated at a convex combination of ζ and η and the respective convex combination of κ(ζ ) and κ(η). Corollary 3.39 Let P be a full EF with minimal canonical representation (2.6). Then, for ζ , η ∈ ∗ and q ∈ (0, 1), we have DRq (ζ , η) = DCRq (ζ , η) =

κ(qζ + (1 − q)η) − (qκ(ζ ) + (1 − q)κ(η)) , q(q − 1) exp {κ(qζ + (1 − q)η) − (qκ(ζ ) + (1 − q)κ(η))} − 1 , q(q − 1)

DH (ζ , η) =

ζ +η κ(ζ ) + κ(η) 1/2 2 − 2 exp κ . − 2 2

Proof The assertion immediately follows from Lemma 3.38, since DRq = (ln Aq )/[q(q − 1)], DCRq = (Aq − 1)/[q(q − 1)], and DH = (2 − 2A1/2)1/2 .

3.6 Divergence and Distance Measures

63

Remark 3.40 Every divergence measure D, say, of DKL , DJ , DRq , DCRq , and DH in Lemma 3.37 and Corollary 3.39 satisfies the properties that, for any ζ , η ∈ int (∗ ), D(ζ , η) ≥ 0 and D(ζ , η) = 0

⇔

ζ = η.

Moreover, DJ and DH are symmetric. Beyond that, DH meets the triangle inequality and therefore defines a metric on int (∗ ) × int (∗ ). Example: Nk (•, 0 )

Example 3.41 In the minimal canonical representation of Nk (•, 0 ) in Example 2.65, the cumulant function and mean value function are given by κ(μ) = k ||μ||20 /2 and π(μ) = −1 0 μ for μ ∈ R . Hence, by Lemma 3.37, DKL (μ, η) =

||μ − η||20 2

,

DJ (μ, η) = ||μ − η||20 , for μ, η ∈ Rk . Moreover, we have for q ∈ (0, 1) that κ(qμ + (1 − q)η) − (qκ(μ) + (1 − q)κ(η)) =

q(q − 1)||μ − η||2 0 2

,

which, by using Corollary 3.39 gives DRq (μ, η) =

||μ − η||2 0 2

,

' " # ( q(q − 1)||μ − η||20 1 DCRq (μ, η) = exp −1 , q(q − 1) 2 and, by setting q = 1/2, ' DH (μ, η) =

"

2 − 2 exp −

||μ − η||20 8

#(1/2 ,

for μ, η ∈ Rk . Hence, DKL (μ, η), DJ (μ, η), DRq (μ, η), DCRq (μ, η), and DH (μ, η) each admit a representation as h(||μ − η||20 ) for some increasing function h : [0, ∞) → [0, ∞).

64

3 Distributional and Statistical Properties

The above representations of divergences can be applied to develop statistical procedures such as confidence regions (see, e.g., [60]). By using the EF structure, simple formulas for other related quantities may be derived as well. In the full EF with minimal canonical representation (2.6) , the weighted Matusita affinity ρn (ζ

(1)

,...,ζ

(n)

) =

' n

(wi fζ∗(i)

ζ (1), . . . , ζ (n) ∈ ∗ ,

dμ ,

i=1

with positive weights w1 , . . . , wn satisfying " ' ρn (ζ

(1)

,...,ζ

(n)

) = exp κ

n

n i=1

i=1 wi

= 1 and n ≥ 2 is given by

( wi ζ

(i)

−

n

# wi κ(ζ

(i)

)

i=1

(cf. Lemma 3.38); it can be applied, e.g., to construct a homogeneity test with null hypothesis ζ (1) = · · · = ζ (n) or to decide in a discriminant problem (see [25, 42, 34]).

Chapter 4

Parameter Estimation

In this chapter, parameter estimation is discussed when the underlying parametric model forms an EF. For this, we focus on the natural parametrization of the EF according to Definition 2.35, i.e., P = {Pζ∗ : ζ ∈ } with μ-densities fζ∗ (x) =

dPζ∗ dμ

= eζ

(x) = C ∗ (ζ ) exp

t T (x)−κ(ζ )

h(x) ,

⎧ k ⎨ ⎩

⎫ ⎬ ζj Tj (x)

j =1

⎭

h(x)

x ∈ X,

(4.1)

for ζ ∈ , where ∅ = ⊂ ∗ . To estimate ζ or, more generally, a parameter γ (ζ ), ζ ∈ , we will assume to have observations x (i) considered realizations of iid random variables X(i) with distribution Pζ∗ ∈ P for i ∈ I ⊂ N. Every measurable function of X(i) , i ∈ I , with values in is then called an estimator of ζ (in based on X(i) , i ∈ I ); an estimator of γ (ζ ), ζ ∈ , or, for short, of γ is a measurable function of X(i) , i ∈ I , with values in γ (). Here, all random variables X(i) , i ∈ I , are formally defined on the probability space (, A, Pζ ), i.e., we have PX ζ

(i)

= Pζ∗ ,

i∈I.

Moreover, the notation Eζ [·], Varζ (·), and Covζ (·) for the mean, variance, and covariance (matrix) is used when integration is with respect to Pζ .

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Bedbur, U. Kamps, Multivariate Exponential Families: A Concise Guide to Statistical Inference, SpringerBriefs in Statistics, https://doi.org/10.1007/978-3-030-81900-2_4

65

66

4 Parameter Estimation

4.1 Maximum Likelihood Estimation The maximum likelihood method, where, given some observation x of a random variable X with density function fζ∗ , we want to assign a global maximum point of fζ∗ (x), as a function of ζ , to an estimate of ζ , is closely connected to the study of EFs. Under mild conditions and in the case of Lebesgue density functions, the unique existence of the maximum likelihood estimator can be guaranteed. For counting densities, existence is an issue, while uniqueness is ensured.

Definition 4.1 Let x ∈ X be a realization of X ∼ Pζ∗ . Every parameter in that maximizes the likelihood function L(ζ ; x) = fζ∗ (x) ,

ζ ∈ ,

is called a maximum likelihood estimate (ML estimate) of ζ in based on x.

Equivalently and usually analytically simpler, one may maximize the loglikelihood function = ln(L) to find the ML estimate(s). From formula (4.1), the log-likelihood function in our model is (ζ ; x) = ζ t T (x) − κ(ζ ) + ln(h(x)) ,

ζ ∈ .

Hence, the aim is to maximize the expression ζ t T (x) − κ(ζ ) with respect to ζ in . For this, it will be convenient to consider the natural extension κ˜ of κ with domain Rk as introduced in Lemma 2.45 and the respective extension ˜ ; x) = −∞ for ζ ∈ Rk \ ∗ . For a simple notation, ˜ of to Rk by setting (ζ these extensions are again denoted by κ and in the following. Theorem 2.44, Lemma 2.45, and Theorem 2.56 directly yield some useful properties of . Lemma 4.2 Let P be a full EF with minimal canonical representation (4.1). Then, we have for every x ∈ X: (a) (·; x) is strictly concave on ∗ .

4.1 Maximum Likelihood Estimation

67

(b) (·; x) is upper semi-continuous, i.e., lim sup (η; x) ≤ (ζ ; x) for every ζ ∈ Rk . η→ζ

(c) (·; x) is continuously differentiable on int (∗ ) with derivative ∇(ζ ; x) = T (x) − π(ζ ) ,

ζ ∈ int (∗ ) ,

where π is the mean value function introduced in Theorem 2.56. Note that in Lemma 4.2 the mean value function π : int (∗ ) → π(int (∗ )), defined by π(ζ ) = Eζ [T ] for ζ ∈ int (∗ ), is bijective with π(int (∗ )) ⊂ int (M), where M denotes the convex support of ν T (see Theorem 2.56 and Remark 2.62). If P is moreover steep, we have π(int (∗ )) = int (M) . Now, we have the following fundamental theorem for steep EFs (see [4, p. 151]). Theorem 4.3 Let P be a steep EF with minimal canonical representation (4.1), and let x ∈ X be a given realization of the random variable X with distribution Pζ∗ ∈ P. Then, the ML estimate of ζ in ∗ based on x exists if and only if T (x) ∈ π(int (∗ ))

(= int (M)) .

If it exists, it is uniquely determined and given by π −1 (T (x)). Proof First, we show that the maximum of = (·; x) cannot be attained at a boundary point of ∗ . Let η ∈ ∗ \int (∗ ), ζ ∈ int (∗ ), and ζ (α) = ζ +α(η−ζ ) ∈ int (∗ ), α ∈ [0, 1). By Lemma 4.2, lim inf (ζ (α)) ≥ lim inf[(ζ ) + α((η) − (ζ ))] α1

α1

= (η) ≥ lim sup (ζ (α)) , α1

i.e., limα1 (ζ (α)) = (η). Moreover, since P is steep, / 0 ∂ ∂ (α) t (α) (ζ ) = lim (η − ζ ) T (x) − κ(ζ ) = −∞ , lim α1 ∂α α1 ∂α ˜ ) > (η). Hence, if has a such that there exists some α˜ ∈ (0, 1) with (ζ (α) maximum, it is attained on int (∗ ) and necessary a solution of the likelihood equation

0 = ∇(ζ ) = T (x) − π(ζ ) .

68

4 Parameter Estimation

In case of T (x) ∈ π(int (∗ )), this equation has the only solution π −1 (T (x)), which is then also the unique ML estimate of ζ in ∗ based on x. Otherwise, an ML estimate in ∗ based on x does not exist.

Remark 4.4 In Theorem 4.3, the ML estimate of ζ in ∗ based on a single observation of X ∼ Pζ∗ exists with probability Pζ∗ (T ∈ π(int (∗ ))) = Pζ∗ (T ∈ int (M)) . Since ν T (M c ) = 0 as well as ν T and (Pζ∗ )T are equivalent measures by Remark 2.16 and Lemma 2.17, it follows that Pζ∗ (T ∈ / M) = 0

for ζ ∈ ∗ .

Hence, the ML estimate of ζ in ∗ exists with Pζ∗ -probability 1 if and only if T lies on the boundary of M with Pζ∗ -probability 0. In particular, this is the case if the distribution of T has a Lebesgue density, since the boundary of any convex set in Rk has Lebesgue measure 0. Based on an iid sample from Pζ∗ , we arrive at the following version of Theorem 4.3. Corollary 4.5 Let P be a steep EF with minimal canonical representation (4.1). Moreover, let X(1) , . . . , X(n) be an iid sample from the distribution Pζ∗ ∈ P, and let x (1), . . . , x (n) denote realizations of X(1) , . . . , X(n) . Then, the ML estimate of ζ in ∗ based on x (1), . . . , x (n) exists if and only if 1 T (x (i) ) ∈ π(int (∗ )) n n

(= int (M)) .

(4.2)

i=1

If the ML estimate exists, it is uniquely determined and given by ' π

−1

1 T (x (i) ) n n

( .

i=1

Proof By using Theorem 3.13 and Corollary 3.17, we may apply Theorem 4.3 to the EF P(n) of distributions of (X(1) , . . . , X(n) ). By doing so, T and π have to be replaced by T (n) and nπ. The assertion then follows from T (n) (x (1), . . . , x (n) ) ∈ (nπ)(int (∗ ))

⇔

n 1 T (x (i) ) ∈ π(int (∗ )) . n i=1

4.1 Maximum Likelihood Estimation

69

Notation 4.6 If, in the situation of Corollary 4.5, condition (4.2) is true with (Pζ∗ )(n) -probability 1, the maximum likelihood estimator (MLE) of ζ in ∗ based on X(1) , . . . , X(n) , which then uniquely exists Pζ -a.s., is denoted by ' (n) ζˆ = π −1

1 T (X(i) ) n n

( (4.3)

.

i=1

Examples: poi(•), b(n0 , •), N(•, •), Nk (•, 0 )

Example 4.7 From Example 2.64 for P = poi(•), we have T (x) = x for x ∈ N0 and π(ζ ) = exp{ζ } for ζ ∈ ∗ = R. Here, π(int (∗ )) = (0, ∞). Now, let X1 , . . . , Xn be an iid sample from Pζ∗ , and let x1 , . . . , xn ∈ N0 denote realizations of X1 , . . . , Xn . Then, 1 xi ∈ / (0, ∞) n n

⇔

x1 = · · · = xn = 0 .

i=1

Hence, by Corollary 4.5, an ML estimate of ζ based on x1 , . . . , xn ∈ N0 does not exist if x1 = · · · = xn = 0. In any other case, '

1 ln xi n n

(

i=1

is the unique ML estimate. The probability for the non-existence of the ML estimate is given by Pζ (X1 = 0, . . . , Xn = 0) = exp −neζ ,

ζ ∈ R,

and converges to 0 for n → ∞. Example 4.8 From Example 2.57 for P = b(n0 , •), we have T (x) = x for x ∈ {0, 1, . . . , n0 }, ∗ = R, and π(∗ ) = (0, n0 ). Now, let X1 , . . . , Xn be an iid sample from Pζ∗ , and let x1 , . . . , xn ∈ {0, 1, . . . , n0 } denote realizations of

70

4 Parameter Estimation

X1 , . . . , Xn . Then, 1 xi ∈ / (0, n0 ) n n

⇔

( x1 = · · · = xn = 0 ∨ x1 = · · · = xn = n0 )

i=1

Hence, by Corollary 4.5, an ML estimate of ζ based on x1 , . . . , xn ∈ {0, 1 . . . , n0 } does not exist if x1 = · · · = xn = 0 or x1 = · · · = xn = n0 . In any other case, ' π

−1

1 xi n n

(

= ln

i=1

n i=1 xi n n0 n − i=1 xi

is the unique ML estimate. The probability for the non-existence of the ML estimate is given by Pζ ({X1 = · · · = Xn = 0} ∪ {X1 = · · · = Xn = n0 }) =

1 1 + exp{ζ }

n0 n

+

exp{ζ } 1 + exp{ζ }

n0 n

and converges to 0 for n → ∞. Example 4.9 From Example 2.59 for P = N(•, •), we have T (x) = (x, x 2 )t for x ∈ R, ∗ = R × (−∞, 0), and π(∗ ) = {(x, y)t ∈ R2 : x 2 < y} . By Corollary 4.5, the probability for the existence of the ML estimate of ζ in ∗ based on n observations is given by '" (Pζ∗ )(n)

1 T (xi ) ∈ π(∗ ) (x1 , . . . , xn ) ∈ Rn : n n

i=1

#( =

" 0,

n = 1,

1,

n ≥ 2,

since (x1 , x12 )t ∈ / π(∗ ) for all x1 ∈ R, and '

1 xi n n

i=1

(2

1 2 xi n n

≤

i=1

for all x1 , . . . , xn ∈ R and n ≥ 2 (the empirical variance is non-negative) with equality if and only if x1 = · · · = xn . Hence, for n ≥ 2, the MLE of ζ based on

4.2 Constrained Maximum Likelihood Estimation

71

an iid sample X1 , . . . , Xn from Pζ∗ uniquely exists (Pζ -a.s.) and is given by ' (n) ζˆ = π −1

1 1 2 Xi , Xi n n n

n

i=1

i=1

(

with inverse function π −1 as stated in Example 2.59. Example 4.10 From Example 2.65 for P = Nk (•, 0 ), we have T (x) = −1 0 x k . Here, π(∗ ) = Rk such μ for μ ∈ R for x ∈ Rk , ∗ = Rk , and π(μ) = −1 0 that condition (4.2) is trivially met. Hence, by Corollary 4.5, the MLE of μ based on an iid sample X (1) , . . . , X (n) from Pμ uniquely exists and is given by ' μ ˆ

(n)

= π

−1

1 T (X (i) ) n n

'

( = 0

i=1

1 −1 (i) 0 X n n

(

1 (i) X . n n

=

i=1

i=1

4.2 Constrained Maximum Likelihood Estimation We now consider maximum likelihood estimation in non-full EFs. Any such nonfull family is represented by a subfamily {Pζ∗ : ζ ∈ } of the full family {Pζ∗ : ζ ∈ ∗ } for some proper subset of ∗ (with = ∗ ). Thus, the aim is now to find all parameters within the constrained set at which the likelihood function has a global maximum on . For every x ∈ X and every compact set ⊂ int (∗ ), it is clear that there exists an ML estimate of ζ in based on x, since (·; x) is continuous on by Lemma 4.2 and thus attains its maximum on . On the other hand, the strict concavity of (·; x) on ∗ ensures uniqueness of the ML estimate in every convex set ⊂ ∗ provided that it exists. Moreover, in the steep EF and if T (x) ∈ π(), then π −1 (T (x)) must be the ML estimate in , since it already maximizes (·; x) over ∗ . In [20, pp. 152– 155], under different assumptions on the constrained set , sufficient and necessary conditions for the existence and uniqueness of ML estimates in non-full EFs are stated. The findings are presented in this section. First, as a preliminary result, we address the following lemma. Lemma 4.11 Let P be a full EF with minimal canonical representation (4.1). Moreover, let x ∈ X and ∅ = A ⊂ Rk be closed. If (ζ ) = (ζ ; x) → −∞

for ||ζ ||2 → ∞ ,

then there exists some η ∈ A with (η) = supζ ∈A (ζ ).

(4.4)

72

4 Parameter Estimation

Proof Let m = supζ ∈A (ζ ), and ζ n ∈ A, n ∈ N, be a sequence with (ζ n ) → m for n → ∞. If m = −∞, the assertion is trivially met. For m > −∞, it holds by assumption that ζ n , n ∈ N, is bounded. Hence, there exists a convergent subsequence ζ nk , k ∈ N, with ζ nk → η ∈ A for k → ∞. By Lemma 4.2, is upper semi-continuous, which yields m ≥ (η) ≥ lim sup (ζ nk ) = m , k→∞

i.e., (η) = m.

To obtain a sufficient condition for property (4.4) to hold true, we introduce the half spaces H + (y, a) = {z ∈ Rk : y t z > a} , H − (y, a) = {z ∈ Rk : y t z < a} , for y ∈ Rk and a ∈ R. Lemma 4.12 Let P be a full EF with minimal canonical representation (4.1), and let 0 ∈ int (M). Then, κ(ζ ) → ∞ for ||ζ ||2 → ∞. Proof First, we show that there exist some a > 0 and c > 0 such that ν T (H + (y, a)) > c

for all y ∈ B = {z ∈ Rk : ||z||2 = 1} .

Suppose the contrary assertion is true. Then, there exist a sequence an , n ∈ N, with limn→∞ an = 0, and a sequence y n ∈ B, n ∈ N, with lim ν T (H + (y n , an )) = 0 .

n→∞

Since B is compact, there exist y ∈ B and a convergent subsequence y nk , k ∈ N, with y nk → y for k → ∞. Note that, for every z ∈ Rk with y t z > 0, there exists an index k0 ∈ N such that y tnk z > ank for all k ≥ k0 . Together with Fatou’s lemma (see, e.g., [17, p. 209]), this yields T

+

ν (H (y, 0)) =

1H + (y,0) dν

T

≤

lim inf 1H + (y n

≤ lim inf k→∞

1H + (y n

k

k

k→∞

,ank ) dν

T

,ank ) dν

T

= lim ν T (H + (y nk , ank )) = 0 . k→∞

Hence, we have supp(ν T ) ⊂ H + (y, 0)c and thus M ⊂ H + (y, 0)c , since H + (y, 0)c is convex and closed. By assumption, 0 ∈ int (M), which gives the contradiction 0 ∈ int (H + (y, 0)c ) = H − (y, 0).

4.2 Constrained Maximum Likelihood Estimation

73

Now, for ζ ∈ Rk \ {0}, κ(ζ ) = ln

e

≥ ln

ζtt

T

dν (t)

= ln

t

ζ T exp ||ζ ||2 t dν (t) ||ζ ||2

t

ζ exp ||ζ ||2 t dν T (t) ||ζ ||2 H + (ζ /||ζ ||2 ,a)

. ≥ ln e||ζ ||2 a c = a ||ζ ||2 + ln(c) . For ||ζ ||2 → ∞, this yields the assertion.

For a relatively closed subset of ∗ , we have the following theorem (see [20, pp. 152/153]). Theorem 4.13 Let P be a steep EF with minimal canonical representation (4.1), and let x ∈ X be a realization of the random variable X with distribution Pζ∗ ∈ P. Moreover, let = ∅ be relatively closed in ∗ , i.e., = B ∩ ∗ for a closed set B ⊂ Rk . Then, an ML estimate of ζ in based on x exists if one of the following conditions is met: (a) T (x) ∈ int (M), (b) T (x) ∈ M \ int (M) and there exist t 1 , . . . , t m ∈ int (M) and a1 , . . . , am ∈ R with ⊂

m 1

H − (T (x) − t i , ai ) .

(4.5)

i=1

Proof Let t = T (x). First, let (a) be true, i.e., t ∈ int (M). Since for ζ ∈ ∗ fζ∗ = eζ

t T −κ(ζ )

h = eζ

t (T −t)−(κ(ζ )−ζ t t)

h

we have another representation of P with statistic Tˇ = T − t and cumulant function ˇ κ(ζ ˇ ) = κ(ζ ) − ζ t t, ζ ∈ Rk . The corresponding convex support Mˇ of ν T is then Mˇ = M − t ˇ by assumption. Hence, by Lemma 4.12, we have and satisfies 0 ∈ int (M) (ζ ; x) = ln(h(x)) − κ(ζ ˇ ) → −∞ for ||ζ ||2 → ∞. Applying Lemma 4.11 then yields that there exists some η ∈ B with (η; x) = supζ ∈B (ζ ; x). Necessarily holds that η ∈ = B ∩ ∗ , since η ∈ / ∗ would imply that (η; x) = −∞, which forms a contradiction to ∅ = ⊂ ∗ .

74

4 Parameter Estimation

Now, let assertion (b) be true. Since t 1 , . . . , t m ∈ int (M), we have κi (ζ ) = κ(ζ ) − ζ t t i → ∞ for ||ζ ||2 → ∞ and 1 ≤ i ≤ m by the same arguments as above. Moreover, for ζ ∈ , there exists an index j ∈ {1, . . . , m} with (ζ ; x) = ln(h(x)) + ζ t (T (x) − t j ) − κj (ζ ) ≤ ln(h(x)) + aj − κj (ζ ) . Thus, (ζ ; x) ≤ sup {ln(h(x)) + ai − κi (ζ )} → −∞ 1≤i≤m

for ||ζ ||2 → ∞. Again, applying Lemma 4.11 completes the proof.

Theorem 4.13 may be useful to ensure existence of ML estimates in curved EFs as indicated in the following remark. Remark 4.14 Let P be a steep EF with minimal canonical representation (4.1). Moreover, let ⊂ Rm for some m < k, Z = (Z1 , . . . , Zk )t :

→ ∗

be a one-to-one mapping with int (Z()) = ∅, and Z1 , . . . , ZK be affinely independent. Then, the non-full subfamily ∗ PZ = {PZ(ϑ) : ϑ ∈ }

of P forms a curved EF according to Definition 2.32. Now, let x be a realization of ∗ ∈ PZ . If Z() is relatively closed in ∗ , Theorem 4.13 is applicable X ∼ PZ(ϑ) and may be used to ensure existence of ML estimates of ϑ in based on x. For example, an ML estimate of ϑ in exists with probability 1 if the distribution of T has a Lebesgue density (see Remark 4.4). Under the additional assumptions that is open, Z() ⊂ int (∗ ), and Z is differentiable, any ML estimate of ϑ in based on x is necessarily a solution of the likelihood equation DZ (ϑ)t [T (x) − π(Z(ϑ))] = 0

(4.6)

with respect to ϑ ∈ (see Theorem 2.56). Note, however, that the likelihood function may have several local maxima and thus Eq. (4.6) more than one solution in (see [5, p. 69]). Uniqueness of the ML estimate may also be guaranteed by imposing further assumptions on both Z and T . For instance, in case of Lebesgue density functions, the MLE of ϑ in is unique with probability 1 if X ⊂ Rn for some n ≥ k and X is open, is open with Z() ⊂ int (∗ ), T is continuously differentiable and Z is twice continuously differentiable with Jacobian matrix DT (x) of full rank k at every

4.2 Constrained Maximum Likelihood Estimation

75

x ∈ X and Jacobian matrix DZ (ϑ) of full rank m at every ϑ ∈ , respectively (see [49] and [38]). Example: Curved EF in N(•, •)

Example 4.15 We consider the curved EF P2 = {N(μ, μ2 ) : μ ∈ R \ {0}} in Example 2.31(b) forming a subfamily of N(•, •) with ord(P2) = 2. By using the canonical representation of N(•, •) introduced in Example 2.39, we have P2 = ∗ {PZ(μ) : μ ∈ } with = R\{0} and Z = (Z1 , Z2 )t : → ∗ = R×(−∞, 0), where Z1 (μ) =

1 μ

Z2 (μ) = −

and

1 2μ2

for μ ∈ .

The corresponding statistic is given by T (x) = (x, x 2 )t , x ∈ R. Obviously, Z is one-to-one, and Z() = {(η, −η2 /2)t : η ∈ R \ {0}} is relatively closed in ∗ , since Z() = B ∩ ∗ for B = {(η, −η2 /2)t : η ∈ R}, which is closed in R2 . From Remark 2.62 and Example 2.59, we have that int (M) = π(∗ ) = {(x, y)t ∈ R2 : x 2 < y} . Now, let us assume to have realizations x1 , . . . , xn of an iid sample ∗ X1 , . . . , Xn of size n ≥ 2 from PZ(μ) for some μ ∈ . Since 1 T (xi ) = n n

i=1

'

1 1 2 xi , xi n n n

n

i=1

i=1

(t ∈ int (M)

if x1 , . . . , xn are not all identical (see Example 4.9), Theorem 4.13 applied to the product EF P(n) with statistic T (n) (x) = ni=1 T (xi ) for x = (x1 , . . . , xn ) ∈ Xn = Rn ensures that an MLE of μ in based on X1 , . . . , Xn exists with probability 1. Moreover, the MLE is uniquely determined with probability 1 by Remark 4.14. To see this, we formally replace the sample space Xn = Rn by Y = Rn \ N, where N = {(x, . . . , x) ∈ Rn : x ∈ R} has probability 0. Then, Y is open and T (n) is continuously differentiable on Y with Jacobian matrix DT (n) (y) =

1 2y

,

y ∈ Y,

76

4 Parameter Estimation

with 1 = (1, . . . , 1) ∈ Rn , which has full rank k = 2 ≤ n at every y ∈ Y. Since = R \ {0} is open and Z is twice continuously differentiable with Jacobian matrix 1 1 t DZ (μ) = − 2 , 3 μ μ of rank m = 1 at every μ ∈ , all conditions in Remark 4.14 for the uniqueness of the MLE are met (see also [49, Ex. 3]).

If in Theorem 4.13 is moreover convex, the ML estimate of ζ in based on x is unique provided that it exists. Furthermore, we have the following theorem (see [20, pp. 153–155]). Theorem 4.16 In the situation of Theorem 4.13, let be moreover convex and ∩ int (∗ ) = ∅. Then, there exists a unique ML estimate of ζ in based on x if and only if one of the following conditions is met: (a) T (x) ∈ int (M), (b) T (x) ∈ M \ int (M) and there exist t 1 ∈ int (M) and a1 ∈ R with ⊂ H − (T (x) − t 1 , a1 ) . If an ML estimate of ζ in based on x exists, it is the unique parameter ζ ∗ ∈ ∩ int (∗ ) satisfying (T (x) − π(ζ ∗ ))t (ζ ∗ − η) ≥ 0

for all η ∈ .

(4.7)

Proof First, let us show that if inequality (4.7) has a solution in ∩ int (∗ ), it is uniquely determined. For this, let ζ ∗1 , ζ ∗2 ∈ ∩ int (∗ ) be solutions of inequality (4.7). Then, in particular, (T (x) − π(ζ ∗1 ))t (ζ ∗1 − ζ ∗2 ) ≥ 0 , (T (x) − π(ζ ∗2 ))t (ζ ∗2 − ζ ∗1 ) ≥ 0 . Summation of both inequalities and applying Lemma 3.37 yields 0 ≤ (π(ζ ∗2 ) − π(ζ ∗1 ))t (ζ ∗1 − ζ ∗2 ) = −DJ (ζ ∗1 , ζ ∗2 ) By Remark 3.40, it follows that DJ (ζ ∗1 , ζ ∗2 ) = 0 and ζ ∗1 = ζ ∗2 . Next, let us show that ζ ∗ is an ML estimate of ζ in based on x if and only if ∗ ζ ∈ ∩ int (∗ ) and ζ ∗ satisfies inequality (4.7).

4.2 Constrained Maximum Likelihood Estimation

77

First, if ζ ∗ ∈ / ∩ int (∗ ), ζ cannot be an ML estimate of ζ in based on x, since the steepness of P ensures the existence of some η ∈ ∩ int (∗ ) with (η; x) > (ζ ; x), as shown in the proof of Theorem 4.3. Now, let ζ ∗ ∈ ∩int (∗ ) and suppose that there exists some η ∈ with (T (x) − π(ζ ∗ ))t (ζ ∗ − η) < 0 . Since is convex by assumption, η(α) = ζ ∗ + α(η − ζ ∗ ) ∈ , α ∈ [0, 1], and we obtain from Lemma 4.2 % ∂ % (η(α) ; x)% = (T (x) − π(ζ ∗ ))t (η − ζ ∗ ) > 0 . α=0 ∂α ˜ ; x) > (ζ ∗ ; x), which implies that ζ ∗ is Hence, there exists α˜ ∈ (0, 1) with (η(α) not an ML estimate of ζ in based on x. Now, let ζ ∗ ∈ ∩ int (∗ ) satisfy inequality (4.7). Then, by using Lemma 3.37, it holds for η ∈ ∩ int (∗ ) that

(ζ ∗ ; x) − (η; x) = (ζ ∗ − η)t T (x) − κ(ζ ∗ ) + κ(η) = (ζ ∗ − η)t (T (x) − π(ζ ∗ )) + DKL (ζ ∗ , η) ≥ 0, since ζ ∗ satisfies inequality (4.7) and DKL is non-negative by Remark 3.40. Moreover, by the same arguments as in the proof of Theorem 4.3, the steepness of P ensures that for every η˜ ∈ ∩ (∗ \ int (∗ )) there exists some η ∈ ∩ int (∗ ) with (η; x) > (η; ˜ x). Hence, ζ ∗ is an ML estimate of ζ in based on x. Finally, we prove the equivalence assertion. If either statement (a) or (b) is met, then existence of an ML estimate of ζ in based on x follows from Theorem 4.13. Since (·; x) is strictly concave on by Lemma 4.2, this ML estimate is uniquely determined. For the other implication, let ζ ∗ denote the unique ML estimate of ζ in based on x. Then, as already shown above, ζ ∗ ∈ ∩ int (∗ ) and ζ ∗ satisfies inequality (4.7), which is equivalent to ηt (T (x) − π(ζ ∗ )) ≤ (ζ ∗ )t (T (x) − π(ζ ∗ ))

for all η ∈ .

Hence, assertion (b) is true with t 1 = π(ζ ∗ ) ∈ int (M) and some arbitrary a1 > (ζ ∗ )t (T (x) − π(ζ ∗ )).

For more details on maximum likelihood estimation in EFs including, in particular, geometrical interpretations and illustrations of formulas (4.5) and (4.7), see [20, Chapter 5]. So far in this chapter, we have considered maximum likelihood estimation for the natural parameter of the EF. For the sake of completeness, we finally address a result and a comment for other parametrizations.

78

4 Parameter Estimation

Lemma 4.17 Let P be an EF with minimal canonical representation (4.1). More∗ , η ∈ , over, let ⊂ Rk , : → be a bijective mapping, and P˜η = P(η) ∗ of P˜η for denote the corresponding parametrization of P with μ-density f˜η = f(η) η ∈ . Let x be a realization of the random variable X with distribution in P, and let ζˇ denote an ML estimate of ζ in based on x. Then, ηˇ = −1 (ζˇ ) is an ML estimate of η in based on x. Moreover, if ζˇ is uniquely determined, so is ηˇ . Proof Since is bijective, we have f˜ −1 (ζˇ ) (x) ≥ f˜η (x)

for all η ∈

⇔

∗ fζˇ∗ (x) ≥ f(η) (x)

⇔

fζˇ∗ (x) ≥ fζ∗ (x)

for all η ∈

for all ζ ∈ ,

which is true by assumption. Hence, ηˇ = −1 (ζˇ ) is an ML estimate of η in based on x. If ζˇ is uniquely determined, then the above equivalences hold true with strict inequalities when replacing by \ {η} ˇ and by \ {ζˇ }, and ηˇ is uniquely determined as well.

Simply put, Lemma 4.17 states that an ML estimate for a (bijectively) transformed parameter can be obtained by transforming an ML estimate for the original parameter in the same way. This property holds true in parametric models, in general. From Lemma 4.17, respective statements for the iid case and for the existence and uniqueness of the MLE of η can readily be deduced. As an important example for minimal regular EFs, these results apply to the mean value parametrization P˜m , m ∈ int (M), by setting = π −1 (see Theorem 2.56). For instance, provided that (n) the MLE ζˆ of ζ in ∗ based on an iid sample X(1) , . . . , X(n) from Pζ∗ exists, the MLE of m in int (M) based on X(1) , . . . , X(n) is given by m ˆ (n) = π(ζˆ

(n)

1 T (X(i) ) , n n

) =

i=1

(n) where we still use the representation of ζˆ in Definition 4.6.

4.3 Efficient Estimation We now turn to properties of point estimators and their connections to EFs. First of all, unbiasedness of an estimator means that, in expectation, we estimate correctly.

4.3 Efficient Estimation

79

Definition 4.18 Let γ : → Rq be a function. An estimator γˆ : (, A, Pζ ) → (Rq , Bq ) is called unbiased for γ if Eζ [γˆ ] = γ (ζ )

for every ζ ∈ .

In particular, an estimator ζˆ is called unbiased for ζ if Eζ [ζˆ ] = ζ

for every ζ ∈ .

(n) Remark 4.19 If under the assumptions of Corollary 4.5 the MLE ζˆ in formula (4.3) exists with probability 1, it is a measurable function of the sufficient and complete statistic T (n) (see Corollary 3.21 and Theorem 3.13). For the oneparameter EF, the Lehmann-Scheffé theorem then yields that any estimator cζˆ (n) being unbiased for ζ for some constant c ∈ R has minimum variance among all unbiased estimators for ζ . In that case, cζˆ (n) is called a uniformly minimum variance unbiased estimator (UMVUE) for ζ .

Estimators with such a minimality property may also be defined in the multivariate case. The Cramér-Rao inequality provides a lower bound for the covariance matrix of any unbiased estimator. Theorem 4.20 Let P be an EF with minimal canonical representation (4.1), ⊂ ∗ be open, and γ : → Rq be a differentiable function. Moreover, let X be a random variable with distribution Pζ∗ for some ζ ∈ . Then, for every unbiased estimator γˆ for γ based on X, the Cramér-Rao inequality Covζ (γˆ ) ≥ Dγ (ζ ) [I(ζ )]−1 Dγ (ζ )t

(4.8)

holds true in the sense of the Löwner order for matrices defined by A ≥ B if and only if A − B ≥ 0, i.e., A − B is positive semidefinite. In particular, any unbiased estimator ζˆ for ζ based on X satisfies that Covζ (ζˆ ) ≥ [I(ζ )]−1 . Here, Dγ (ζ ) ∈ Rq×k denotes the Jacobian matrix of γ and I(ζ ) the Fisher information matrix of P at ζ . Proof Let ζ ∈ . Note that then ζ ∈ int (∗ ), since is open by assumption. Recall that by Theorems 3.29, 2.23, and Lemma 2.26 U ζ = T − Eζ [T ]

and

I(ζ ) = Covζ (T ) > 0 .

80

4 Parameter Estimation

In particular, we have Eζ [U ζ ] = 0

and

Covζ (U ζ ) = I(ζ ) .

(4.9)

Since γˆ = (γˆ1 , . . . , γˆq )t is unbiased for γ = (γ1 , . . . , γq )t , Corollary 2.55 then yields for 1 ≤ j ≤ q that ∇γj (ζ ) = ∇Eζ [γˆj ] = Covζ (γˆj , T (X)) = Covζ (γˆj , U ζ (X)) = Eζ [γˆj U ζ (X)] , i.e., Dγ (ζ ) = Eζ [γˆ U ζ (X)t ] . Now, for y ∈ Rq , it follows by inserting and by using formula (4.9) that y t Dγ (ζ ) [I(ζ )]−1 Dγ (ζ )t y = Eζ [y t γˆ U ζ (X)t [I(ζ )]−1 Dγ (ζ )t y] = Eζ [[y t (γˆ − γ (ζ ))] [U ζ (X)t [I(ζ )]−1 Dγ (ζ )t y]] = Covζ (y t (γˆ − γ (ζ )) , U ζ (X)t [I(ζ )]−1 Dγ (ζ )t y).

From the unbiasedness of γˆ and formula (4.9), the last expression is seen to be the covariance of two random variables with mean 0. Applying the Cauchy-Schwarz inequality then leads to [y t Dγ (ζ ) [I(ζ )]−1 Dγ (ζ )t y]2 ≤ Varζ (y t (γˆ − γ (ζ ))) Varζ (U ζ (X)t [I(ζ )]−1 Dγ (ζ )t y) = [y t Covζ (γˆ ) y] [y t Dγ (ζ ) [I(ζ )]−1 Covζ (U ζ (X)) [I(ζ )]−1 Dγ (ζ )t y] = [y t Covζ (γˆ ) y] [y t Dγ (ζ ) [I(ζ )]−1 Dγ (ζ )t y] . Hence, y t Dγ (ζ ) [I(ζ )]−1 Dγ (ζ )t y ≤ y t Covζ (γˆ ) y , which yields the assertion.

Note that in the proof of Theorem 4.20 we use properties of the EF but not the explicit structure of the densities. Under respective assumptions such as those in formula (4.9), the Cramér-Rao inequality can be shown to hold true in more general models.

4.3 Efficient Estimation

81

Efficient estimators are now introduced as follows, where we implicitly assume that the Cramér-Rao inequality is valid.

Definition 4.21 Let ⊂ ∗ be open, γ : → Rq be a differentiable function, and γˆ : (, A, Pζ ) → (Rq , Bq ) be an unbiased estimator for γ . Then, γˆ is called efficient for γ if its covariance matrix attains the lower bound of the Cramér-Rao inequality in formula (4.8) for every ζ ∈ , i.e., Covζ (γˆ ) = Dγ (ζ ) [I(ζ )]−1 Dγ (ζ )t ,

ζ ∈ .

In particular, any unbiased estimator ζˆ for ζ with Covζ (ζˆ ) = [I(ζ )]−1 , ζ ∈ , is called efficient for ζ .

An efficient estimator in the sense of Definition 4.21 simultaneously solves different optimization problems. Remark 4.22 Let A = [aij ]1≤i,j ≤k , B = [bij ]1≤i,j ≤k ∈ Rk×k be two matrices with A, B ≥ 0 and A ≥ B in the sense of the Löwner order. Then, (a) aii ≥ bii , 1 ≤ i ≤ k, (b) trace(A) ≥ trace(B), (c) det(A) ≥ det(B). Note that (a) is obvious from et Ae ≥ et Be for any unit vector e ∈ Rk , and (b) directly follows from (a); assertion (c) can be obtained from Weyl’s Monotonicity Theorem stating that the j -th largest eigenvalue of A is not smaller than that of B for 1 ≤ j ≤ k if A, B ≥ 0 and A − B ≥ 0 (see [15, p. 63]). An efficient estimator therefore minimizes the marginal variances (a), the total variation (b), and the generalized variance (c) among all unbiased estimators. Here, the total variation and generalized variance of a random vector are defined as the trace and determinant of its covariance matrix, respectively (see, e.g., [45, p. 13 and pp. 30/31]). Remark 4.23 For k = 1, an efficient estimator is always a UMVUE. The reverse implication is not true in general, since the lower bound of the Cramér-Rao inequality need not to be attained by the UMVUE. Indeed, it can be shown that efficient estimators essentially only exist in EFs (see, e.g., [16, pp. 182/183]). Analogously, for k > 1, efficiency of an estimator ensures that it has minimum covariance matrix in the sense of the Löwner order among all unbiased estimators, but not vice versa.

82

4 Parameter Estimation

Examples: G(•, •), Nk (•, 0 )

Example 4.24 Let P = G(•, •) and X1 , . . . , Xn be an iid sample from Pζ∗ ∈ P (see Example 2.40). Evidently, ni=1 Xi /n is an unbiased estimator for the mean Eζ [X1 ], which is given by π1 (ζ ) = −ζ2 /ζ1 in the natural parametrization as shown in Example 2.53. From Example 3.34, the Fisher information matrix of P(n) is seen to be ( ' ζ2 − ζ11 ζ12 , ζ = (ζ1 , ζ2 )t ∈ (−∞, 0) × (0, ∞) . nI(ζ ) = n − ζ11 ψ1 (ζ2 ) Hence, by setting γ = π1 , the lower bound of the Cramér-Rao inequality in formula (4.8) is obtained as ⎛ . ψ1 (ζ2 ) 1 1 2 ⎝ , − 1 ζ1 n det(I(ζ )) ζ12 ζ1

-ζ

⎞'

1 ζ1 ⎠ ζ2 2 ζ1

ζ2 ζ12 − ζ11

-ζ ζ12 1. 2 = , − n(ζ2 1 (ζ2 ) − 1) ζ12 ζ1

(

'ζ

2 1 (ζ2 )−1 ζ12

0

( =

ζ2 nζ12

for ζ = (ζ1 , ζ2 )t ∈ (−∞, 0) × (0, ∞). nFrom Example 2.53, we have Varζ (X1 ) = ζ2 /ζ12 such that the variance of i=1 Xi /n attains the lower bound of the Cramér-Rao inequality. Thus, ni=1 Xi /n is an efficient estimator (and a UMVUE) for π1 . Example 4.25 Let P = Nk (•, 0 ) and X(1) , . . . , X (n) be an iid sample from Pμ ∈ P. By Example 3.35, the Fisher information matrix of P(n) is given by k n −1 0 at every μ ∈ R . Accordingly, by Theorem 4.20, the covariance matrix of any unbiased estimator for μ is bounded below by 0 /n. This is the n from (i) covariance matrix of the MLE μ ˆ (n) = i=1 X /n derived in Example 4.10, which is unbiased and therefore also efficient for μ.

Example 4.24 provides an efficient estimator for the mean of a gamma distribution. This finding is a particular case of a general result for EFs. Theorem 4.26 Let P be an EF with minimal canonical representation (4.1), ⊂ ∗ be open, and X be a random variable with distribution in P. Then, T (X) is an

4.3 Efficient Estimation

83

efficient estimator for π(ζ ), ζ ∈ , where π is the mean value function as defined in Theorem 2.56. Proof Trivially, T (X) is an unbiased estimator for π(ζ ) = Eζ [T (X)], ζ ∈ . Moreover, from Theorems 2.56 and 3.29, we obtain that Dπ (ζ ) [I(ζ )]−1 Dπ (ζ )t = Covζ (T ) [Covζ (T )]−1 Covζ (T ) = Covζ (T (X)) ,

ζ ∈ .

Based on an iid sample, we arrive at the following version of Theorem 4.26. Corollary 4.27 Let P be an EF with minimal canonical representation (4.1), ⊂ ∗ be open, and π be the corresponding mean value function as defined in Theorem 2.56. let X(1) , . . . , X(n) be an iid sample from a distribution in n Moreover, (i) P. Then, i=1 T (X )/n is an efficient estimator for π(ζ ), ζ ∈ . Proof Since nI(ζ ) is the Fisher information matrix of the product EF P(n) at ζ , the lower bound of the Cramér-Rao inequality is here Covζ (T (X(1) )) n ( ' n 1 = Covζ T (X(i) ) , n

Dπ (ζ ) [nI(ζ )]−1 Dπ (ζ )t =

ζ ∈ ,

i=1

by using the same arguments as in the proof of Theorem 4.26 and that X(1) , . . . , X(n) are iid. Evidently, ni=1 T (X(i) )/n is unbiased for π(ζ ), ζ ∈ , and the assertion follows.

In particular, Theorem 4.26 and Corollary 4.27 apply to (minimal) regular EFs, in which the mean value parameter can therefore always be estimated efficiently. Examples: poi(•), b(n0 , •), nb(r0 , •), E(•)

Example 4.28 Let P be any of the EFs poi(•), b(n0 , •), nb(r0 , •), or E(•) in Examples 2.37, 2.38, 2.51, and 2.12. Moreover, let X1 , . . . , Xn be an iid sample from Pζ∗ ∈ P. Then, by Corollary 4.27, ni=1 Xi /n is an efficient estimator (and thus a UMVUE) for the mean of Pζ∗ .

84

4 Parameter Estimation

4.4 Asymptotic Properties In the examples of Sect. 4.1, the probability that the ML estimate of ζ in ∗ based on n observations exists is equal to 1 or at least converges to 1 when the sample size tends to infinity. This finding holds true in general for minimal regular EFs. It is a consequence of the following theorem. Theorem 4.29 Let P be an EF with minimal canonical representation (4.1), ⊂ ∗ be open, and X(1) , X(2) , . . . be a sequence of iid random variables with distribution Pζ∗ ∈ P for some ζ ∈ . Then, ' lim Pζ

n→∞

1 T (X(i) ) ∈ π() n n

( = 1,

i=1

i.e., the probability that the ML estimate of ζ in based on realizations x (1), . . . , x (n) of X(1) , . . . , X(n) exists converges to 1 for n → ∞. Proof Let ζ ∈ . By Theorem 2.49, all moments of T with respect to Pζ∗ are finite. The strong law of large numbers together with the definition of π in Theorem 2.56 then yields that 1 n→∞ T (X(i) ) −→ Eζ [T (X(1) )] = π(ζ ) n n

Pζ -a.s. ,

i=1

which implies that ni=1 T (X(i) )/n → π(ζ ) in Pζ -probability (see, e.g., [55, p. 51]). Moreover, by Theorem 2.56, π −1 is continuous, and we conclude that π() is open as the pre-image (π −1 )−1 () = {m ∈ π(int (∗ )) : π −1 (m) ∈ } of the open set under π −1 . Hence, there exist some ε > 0 and an Euclidean ball Bε (π(ζ )) = {η ∈ Rk : ||η − π(ζ )||2 < ε} ⊂ π(). Now, it follows that ' Pζ

1 T (X(i) ) ∈ π() n n

(

' ≥

Pζ

i=1

1 T (X(i) ) ∈ Bε (π(ζ )) n n

(

i=1

=

( ' n %% %% 1 % %% % 1 − Pζ %% T (X(i) ) − π(ζ )%% ≥ ε 2 n i=1

n→∞

−→ 1 .

Finally, since is open, Lemma 4.2 yields that an ML estimate of ζ in based on realizations x (1), . . . , x (n) of X(1) , . . . , X(n) exists (and is then uniquely determined) if and only if the corresponding likelihood equation π(·) = ni=1 T (x (i) )/n has a solution in . This completes the proof.

4.4 Asymptotic Properties

85

In Sect. 4.3, we have shown that the parameter m = π(ζ ) of the mean value parametrization can be estimated efficiently in minimal regular EFs. Based on a single random variable X ∼ P˜m = Pπ∗−1 (m) , for example, T (X) was proven to form an efficient estimator for m. In contrast, provided that it exists in ∗ with probability 1, the MLE π −1 (T (X)) of the natural parameter does not have to be efficient, not even unbiased for ζ in general. Asymptotically, however, it possesses prominent properties, which are introduced in what follows. First, the consistency of an estimator, more precisely of a sequence of estimators, is the property that the estimator approaches the true parameter when the sample size increases.

Definition 4.30 Let γ : → Rq be a function and γˆ (n) : (, A, Pζ ) → (Rq , Bq ) be an estimator of γ for every n ∈ N. Then, the sequence γˆ (n) , n ∈ N, is called (strongly) consistent for γ if n→∞

γˆ (n) −→ γ (ζ ) in Pζ -probability (Pζ -a.s.) for every ζ ∈ . (n) In particular, a sequence ζˆ , n ∈ N, of estimators is called (strongly) consistent for ζ if (n) n→∞ ζˆ −→ ζ

in Pζ -probability (Pζ -a.s.) for every ζ ∈ .

Note that strong consistency implies consistency by definition. Theorem 4.31 Let P be a regular EF with minimal canonical representation (4.1) and X(1) , X(2) , . . . be a sequence of iid random variables with distribution Pζ∗ ∈ P. (n) Moreover, let the MLE ζˆ of ζ in ∗ based on X(1) , . . . , X(n) exist for every n ∈ N. (n) Then, the sequence ζˆ , n ∈ N, is strongly consistent for ζ .

Proof Let ζ ∈ ∗ . As in the proof of Theorem 4.29, we have 1 T (X(i) ) −→ π(ζ ) n n

i=1

Pζ -a.s. .

86

4 Parameter Estimation

Since π −1 is continuous by Theorem 2.56, we obtain from formula (4.3) that ' ζˆ

(n)

= π

−1

1 T (X(i) ) n n

( n→∞

−→ π −1 (π(ζ )) = ζ

Pζ -a.s. .

i=1

Another main concept for sequences of estimators is asymptotic efficiency, which is motivated by Definition 4.21.

Definition 4.32 Let ⊂ ∗ be open, γ : → Rq be a differentiable function with Jacobian matrix Dγ (ζ ) ∈ Rq×k of rank q at every ζ ∈ , and γˆ (n) : (, A, Pζ ) → (Rq , Bq ) be an estimator of γ for every n ∈ N. Then, the sequence γˆ (n) , n ∈ N, is called asymptotically efficient for γ if √ d n (γˆ (n) − γ (ζ )) −→ Nq (0, Dγ (ζ )[I(ζ )]−1 Dγ (ζ )t )

for ζ ∈ ,

√ i.e., n(γˆ (n) − γ (ζ )) is asymptotically multivariate normal distributed with mean vector 0 and covariance matrix Dγ (ζ )[I(ζ )]−1 Dγ (ζ )t . (n) In particular, a sequence ζˆ , n ∈ N, of estimators is called asymptotically efficient for ζ if √ (n) d n (ζˆ − ζ ) −→ Nk (0, [I(ζ )]−1 )

for ζ ∈ .

Here, I(ζ ) denotes the Fisher information matrix of P at ζ , which is assumed to be regular for all ζ ∈ .

Remark 4.33 We shall address some motivational aspects for Definition 4.32 (see (n) also [39, pp. 437–443]). Suppose that ζˆ , n ∈ N, is asymptotically efficient for ζ according to Definition 4.32. If we further assume that the covariance matrix of √ (n) n(ζˆ −ζ ) converges to that of the asymptotic distribution for n → ∞, this yields for sufficiently large n that [I(ζ )]−1 ≈ Covζ

-√

n(ζˆ

(n)

. - (n) . − ζ ) = n Covζ ζˆ ,

ζ ∈ .

(n) That is, the covariance matrix of ζˆ approximately attains the lower bound [I(ζ )]−1 /n of the Cramér-Rao inequality for P(n) related to an iid sample of size n (n) (see Sect. 4.3). Note, however, that ζˆ is not necessarily unbiased for ζ .

4.4 Asymptotic Properties

87

(n) In a minimal regular EF, it can moreover be shown that for every sequence ζ˜ , n ∈ N, of estimators for ζ with

√ d (n) n (ζ˜ − ζ ) −→ Nk (0, (ζ ))

for every ζ ∈ ∗ ,

we necessarily have (ζ ) ≥ [I(ζ )]−1

λ∗ -a.e.

in the sense of the Löwner order (see [39, pp. 437–443] and the references cited therein). If and I−1 are continuous on ∗ , then the inequality holds even everywhere on ∗ . Hence, an asymptotically efficient sequence of estimators has minimum asymptotic covariance matrix within a certain class of (sequences of) estimators. Note that asymptotic efficiency implies consistency. This can be seen by rewriting 1 √ (n) (n) ζˆ − ζ = √ n(ζˆ − ζ ) , n which converges to 0 in distribution by applying the multivariate Slutsky theorem (see, e.g., [53, p. 130]). Since the limit is a constant, this convergence holds true even in Pζ -probability (see, for instance, [55, p. 51]). Theorem 4.34 Under the assumptions of Theorem 4.31, the sequence ζˆ is asymptotically efficient for ζ .

(n)

, n ∈ N,

Proof Let ζ ∈ ∗ . First, recall that by Theorems 2.56 and 3.29, it holds that Eζ [T (X(1) )] = π(ζ ) ,

Covζ (T (X(1) )) = I(ζ ) .

The multivariate central limit theorem (see, e.g., [55, Section 1.5]) then yields that √ n

'

1 T (X(i) ) − π(ζ ) n n

( d

−→ N

(4.10)

i=1

for a random vector N ∼ Nk (0, I(ζ )). Now, to ease notation, let ρ = (ρ1 , . . . , ρk ) denote the inverse function of π, i.e., we have ρ = π −1 . Then, formula (4.3) along with the mean value theorem for real-valued functions of several variables (e.g., in [52, p. 174]) applied to every of

88

4 Parameter Estimation

the mappings ρj , 1 ≤ j ≤ k, yields √ √ (n) n (ζˆ − ζ ) = n

' ' ρ

( ( n 1 (i) T (X ) − ρ(π(ζ )) n i=1

= Gn

√

' n

1 T (X(i) ) − π(ζ ) n n

( ,

i=1

where Gn is a random matrix with rows t [∇ρj (V (n) j )] ,

1 ≤j ≤ k,

(n) (n) and the random vectors V 1 , . . . , V k are convex combinations of ni=1 T (X(i) )/n and π(ζ ) for n ∈ N. For n → ∞, the strong law of large numbers ensures that n (n) (i) i=1 T (X )/n → π(ζ )Pζ -a.s. and with it V j → π(ζ )Pζ -a.s. for 1 ≤ j ≤ k. From Corollary 2.56 and Theorem 3.29, we conclude n→∞

Gn −→ Dρ (π(ζ )) = [Dπ (ζ )]−1 = [I(ζ )]−1

Pζ -a.s. .

By the multivariate Slutsky theorem (see, e.g., [53, p. 130]), it finally follows that √ (n) d n (ζˆ − ζ ) −→ [I(ζ )]−1 N , and, since [I(ζ )]−1 N ∼ Nk (0, [I(ζ )]−1 ), the proof is completed.

Note that if P is not assumed to be regular, Theorems 4.31 and 4.34 remain valid when replacing ∗ by some open set ⊂ ∗ . Remark 4.35 The assumption in Theorems 4.31 and 4.34 saying that the MLE (n) ζˆ of ζ in ∗ based on X(1) , . . . , X(n) exists for n ∈ N may be dropped in a certain sense. Evidently, we may weaken the assumption from all n ∈ N to eventually all n ∈ N, i.e., for all n ≥ n0 for some n0 ∈ N, without affecting the findings. If even the latter property is not met, we may proceed as follows to obtain a strongly consistent and asymptotically efficient estimator for ζ . Let t ∈ π(∗ ) be arbitrary. Moreover, for n ∈ N, we introduce the set " An =

# n 1 T (X(i) ) ∈ π(∗ ) ∈ A n i=1

and the estimator ' ζ˜

(n)

= π

−1

(W n )

with

Wn =

1 T (X(i) ) n n

i=1

( 1An + t 1Acn .

4.4 Asymptotic Properties

89

Now, for ζ ∈ ∗ , we have that ni=1 T (X(i) )/n → π(ζ )Pζ -a.s. for n → ∞ by the strong law of large numbers. Since π(∗ ) = int (M) is open, this implies that 1An → 1Pζ -a.s. and 1Acn → 0Pζ -a.s. for n → ∞. Hence, n→∞

W n −→ π(ζ )

Pζ -a.s. ,

(4.11) (n)

and, since π −1 is continuous by Theorem 2.56, it follows that ζ˜ → ζ Pζ -a.s. for (n) n → ∞, i.e., the sequence ζ˜ , n ∈ N, is strongly consistent for ζ . (n) To establish that the sequence ζ˜ , n ∈ N, is also asymptotically efficient for ζ , note first that for n ∈ N ' ( n n 1 1 (i) (i) T (X ) + t − T (X ) 1Acn Wn = n n i=1

i=1

and thus √ √ n(W n −π(ζ )) = n

'

( ' ( n n √ 1 1 (i) (i) T (X ) − π(ζ ) + n t − T (X ) 1Acn . n n i=1

i=1

Since 1Acn → 0Pζ -a.s. for n → ∞, there exists Pζ -a.s. an index m ∈ N with 1Acn = 0 for all n ≥ m. Hence, ' ( n √ 1 n→∞ (i) n t− T (X ) 1Acn −→ 0 n

Pζ -a.s. .

i=1

Formula (4.10) together with the multivariate Slutsky theorem then yields that √ d n (W n − π(ζ )) −→ Nk (0, I(ζ )) .

(4.12)

As in the proof of Theorem 4.34, it now follows with ρ = π −1 that √

√ n (ρ(W n ) − ρ(π(ζ ))) √ ˜ n n (W n − π(ζ )) , = G

(n) n (ζ˜ − ζ ) =

˜ n is a random matrix with rows where G (n) [∇ρj (V˜ j )]t ,

1 ≤j ≤ k,

(n) (n) and the random vectors V˜ 1 , . . . , V˜ k are convex combinations of W n and π(ζ ) (n) for n ∈ N. From formula (4.11), we obtain for n → ∞ that V˜ j → π(ζ )Pζ -a.s.

90

4 Parameter Estimation

˜ n → [I(ζ )]−1 Pζ -a.s. as in the proof of Theorem 4.34. The for 1 ≤ j ≤ k and thus G latter finding together with formula (4.12) then yields that √ d (n) n (ζ˜ − ζ ) −→ Nk (0, [I(ζ )]−1 ) by using the multivariate Slutsky theorem, again. Thus, asymptotic efficiency of (n) ζ˜ , n ∈ N, is shown. Asymptotic properties for curved EFs are stated in the following remark. Remark 4.36 Let PZ ⊂ P with Z : → ∗ be a curved EF as introduced in Remark 4.14, where we assume that Z() ⊂ int (∗ ) and that Z() is relatively closed in ∗ . Moreover, suppose that the MLE ϑˆ n of ϑ in ⊂ Rm based on a sample of size n uniquely exists for eventually all n ∈ N. Then the sequence ϑˆ n , n ∈ N, is strongly consistent provided that Z has a continuous inverse Z −1 : Z() → . If, in addition, is open and Z is continuously differentiable with √ Jacobian matrix DZ (ϑ) of full rank m at every ϑ ∈ , then n(ϑˆ n −ϑ) converges in distribution to Nm (0, [DZ (ϑ)t I(Z(ϑ))DZ (ϑ)]−1 ). Here, I(Z(ϑ)) denotes the Fisher information matrix of P at Z(ϑ). For a detailed discussion along with proofs of the above results, see, e.g., [14]. Having proposed strongly consistent and asymptotically efficient sequences of estimators for the natural parameter ζ of the EF, we finally state sufficient conditions under which these properties transfer to a sequence of estimators for γ (ζ ). Lemma 4.37 Let P be an EF with minimal canonical representation (4.1), ⊂ ∗ (n) be open, and γ : → Rq be a mapping. Moreover, let ζ˜ , n ∈ N, be a sequence of estimators for ζ . Then, the following assertions hold true: (n)

(a) If ζ˜ , n ∈ N, is strongly consistent for ζ and γ is continuous, then the sequence (n) γ (ζ˜ ), n ∈ N, is strongly consistent for γ . (n)

(b) If ζ˜ , n ∈ N, is asymptotically efficient for ζ and γ is continuously differentiable with Jacobian matrix Dγ (ζ ) ∈ Rq×k of rank q at every ζ ∈ , (n) then the sequence γ (ζ˜ ), n ∈ N, is asymptotically efficient for γ . Proof Assertion (a) is clear, since a.s.-convergence is preserved under continuous mappings. To show assertion (b), let γ be continuously differentiable and ζ ∈ . Then, applying the mean value theorem for real-valued functions of several variables (e.g., in [52, p. 174]) to γj for 1 ≤ j ≤ q gives . √ √ (n) ˇ n n ζ˜ (n) − ζ , n (γ (ζ˜ ) − γ (ζ )) = G ˇ n is a random matrix with rows where G (n)

[∇γj (Vˇ j )]t ,

1≤j ≤q,

4.4 Asymptotic Properties

91

(n) (n) (n) and the random vectors Vˇ 1 , . . . , Vˇ q are convex combinations of ζ˜ and ζ (n) for n ∈ N. Since ζ˜ , n ∈ N, is assumed to be asymptotically efficient, it is (n) also consistent for ζ , i.e., we have ζ˜ → ζ in Pζ -probability for n → ∞.

(n) Consequently, for 1 ≤ j ≤ q, we also have Vˇ j → ζ in Pζ -probability for n → ∞. Moreover, for 1 ≤ j ≤ q, ∇γj is continuous by assumption, which implies that

ˇ n n→∞ −→ Dγ (ζ ) G

in Pζ -probability

by using that convergence in probability is preserved under continuous mappings (n) (see, e.g., [55, p. 59]). Since ζ˜ , n ∈ N, is asymptotically efficient for ζ , applying the multivariate Slutsky theorem finally gives √ d (n) n (γ (ζ˜ ) − γ (ζ )) −→ Nq (0, Dγ (ζ )[I(ζ )]−1 Dγ (ζ )t ) , i.e., γ (ζ˜

(n)

), n ∈ N, is asymptotically efficient for γ .

Chapter 5

Hypotheses Testing

As a basic concept of statistical inference, a statistical test allows for a data-based decision between two contrary hypotheses about some population, where two error probabilities with opposing trends are fundamental. In a one-parameter EF, uniformly most powerful tests are easily found for particular, but important hypotheses. Moreover, uniformly most powerful unbiased tests can be established in other cases and in multiparameter EFs as well via conditioning, which reduces the problem to the one-parameter case. Finally, statistical tests for several parameters simultaneously are considered with a focus on likelihood-ratio tests. For a rigorous introduction in hypotheses testing, we refer to the rich textbook literature on Mathematical Statistics. In order to become familiar with the topic, a basic knowledge of statistical testing with some applications is helpful.

5.1 One-Sided Test Problems

Definition 5.1 Let P = {Pζ∗ : ζ ∈ } be a family of distributions on the measurable space (X, B). (a) Every (B − B1 )-measurable mapping ϕ : X → [0, 1] is called a (statistical) test, and the function βϕ defined by (continued)

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Bedbur, U. Kamps, Multivariate Exponential Families: A Concise Guide to Statistical Inference, SpringerBriefs in Statistics, https://doi.org/10.1007/978-3-030-81900-2_5

93

94

5 Hypotheses Testing

Definition 5.1 (continued)

ϕ dPζ∗ ,

βϕ (ζ ) = Eζ [ϕ] =

ζ ∈ ,

is its power function. If ϕ only takes on the values 0 and 1, then it is called non-randomized. (b) For non-empty, disjoint subsets 0 and 1 of , the test problem with null hypothesis H0 : ζ ∈ 0 and alternative hypothesis H1 : ζ ∈ 1 is denoted by H0 : ζ ∈ 0

↔

H1 : ζ ∈ 1 .

(5.1)

A test ϕ is then said to have (significance) level α ∈ (0, 1) for test problem (5.1) if βϕ (ζ ) ≤ α

for all ζ ∈ 0 ,

and a test ϕ with level α ∈ (0, 1) for test problem (5.1) is called unbiased if βϕ (ζ ) ≥ α

for all ζ ∈ 1 .

Roughly speaking, ϕ(x) is interpreted as the probability to decide for the alternative hypothesis H1 if x ∈ X has been observed. ϕ(x) = 0 therefore corresponds to a decision for H0 , whereas ϕ(x) = 1 yields a decision for H1 . If there exists some x ∈ X for which ϕ(x) ∈ (0, 1), then ϕ is called a randomized test, in the case of which the decision is indeed random and has to be implemented by some additional rule; one may think of flipping a fair coin in case of ϕ(x) = 1/2 as a simple example. Allowing for randomization of the test decision is useful in the derivation of optimal tests; in practice, however, they do not play an important role and are usually avoided by adjusting the level of the test. For ζ ∈ 0 , βϕ (ζ ) is called the error probability of the first kind of ϕ at ζ ; for a non-randomized test, it is the probability of deciding in favor of H1 and the true parameter is in 0 . Accordingly, for ζ ∈ 1 , 1−βϕ (ζ ) is called the error probability of the second kind of ϕ at ζ ; for a non-randomized test, it is the probability of deciding in favor of H0 and the true parameter is in 1 . In the latter case, βϕ (ζ ) is also said to be the power of ϕ at ζ . Usually, it is not possible to simultaneously minimize the error probabilities of the first and second kind. The derivation of optimal tests is therefore based on a non-symmetric procedure. One minimizes the error probability of the second kind—and thus maximizes power—in a restricted family of tests, for example, among all tests with the same level.

5.1 One-Sided Test Problems

95

Definition 5.2 Let be a set of statistical tests. A test ϕ ∈ is called uniformly most powerful (UMP) in for test problem (5.1) if βϕ (ζ ) = sup βψ (ζ )

for all ζ ∈ 1 .

ψ∈

For the particular case = {ψ : ψ is a test with level α for test problem (5.1)} , ϕ is called a UMP (uniformly most powerful) level-α test, and for = {ψ : ψ is an unbiased test with level α for test problem (5.1)} , ϕ is called a UMPU (uniformly most powerful unbiased) level-α test for test problem (5.1).

Note that a UMP level-α test is necessarily unbiased, since its power is not smaller than that of the constant level-α test ψ ≡ α at every ζ ∈ 1 . As a consequence, every UMP level-α test is also a UMPU level-α test. For one- and multiparameter regular EFs, UMP or UMPU level-α tests can be provided for various test problems concerning a single parameter. These optimal tests are readily deduced from the (generalized) Neyman-Pearson lemma, which is presented in what follows. For the proof, see, e.g., [63, Section 2.4.]. Lemma 5.3 Let μ be a σ -finite measure and f0 , f1 , . . . , fn be real-valued, μintegrable functions on the measurable space (X, B) for some n ∈ N. Moreover, assume that (α0 , α1 , . . . , αn−1 ) ∈ int (Q) with Q =

(q0 , q1 , . . . , qn−1 ) ∈ Rn : there exists a test ψ with

ψfi dμ = qi for all i = 0, 1, . . . , n − 1 . Then, the following properties hold true: (a) A test ϕ maximizes ψfn dμ among all tests ψ in = {ψ : ψ is test with

ψfi dμ = αi , i = 0, 1, . . . , n − 1}

96

5 Hypotheses Testing

if and only if there exist constants k0 , k1 , . . . , kn−1 ∈ R with " 1 , fn (x) > n−1 i=0 ki fi (x) , ϕ(x) = n−1 0 , fn (x) < i=0 ki fi (x) ,

μ-a.e. .

(5.2)

(b) There exists a maximizing test ϕ according to (a). (c) If αi = α fi dμ for i = 0, 1, . . . , n − 1 and some α ∈ (0, 1), then a test ϕ satisfying formula (5.2) with k0 , k1 , . . . , kn−1 ≥ 0 maximizes ψfn dμ even among all tests ψ in ˜ = {ψ : ψ is test with

ψfi dμ ≤ αi , i = 0, 1, . . . , n − 1} .

First, we focus on optimal tests in one-parameter EFs, where UMP level-α tests for one-sided test problems are derived by applying Lemma 5.3. In what follows, as commonly used, a hypothesis ζ ∈ (−∞, ζ0 ] ∩ ∗ , say, is denoted by ζ ≤ ζ0 , for short. Theorem 5.4 Let P = {Pζ∗ = fζ∗ μ : ζ ∈ ∗ } be a regular EF with μ-densities fζ∗ (x) = C ∗ (ζ ) eζ T (x) h(x) ,

x ∈ X,

for ζ ∈ ∗ ⊂ R. Moreover, let ζ0 ∈ ∗ and α ∈ (0, 1). Then, we have: (a) For the one-sided test problem H 0 : ζ ≤ ζ0

↔

H 1 : ζ > ζ0 ,

(5.3)

a UMP level-α test is given by ϕ(x) = 1(c,∞) (T (x)) + γ 1{c} (T (x)) ,

x ∈ X.

(b) For the one-sided test problem H 0 : ζ ≥ ζ0

↔

H 1 : ζ < ζ0 ,

a UMP level-α test is given by ϕ(x) = 1(−∞,c) (T (x)) + γ 1{c} (T (x)) ,

x ∈ X.

In (a) and (b), respectively, the constants c ∈ R and γ ∈ [0, 1] are each determined by the equation Eζ0 [ϕ] = α .

5.1 One-Sided Test Problems

97

Proof (a) Let ϕ be as stated. Without loss of generality, we may assume that h(x) > 0 for all x ∈ X, since Pζ∗ (h = 0) = 0 for ζ ∈ ∗ . We first show that, for every fixed ζ1 ∈ ∗ with ζ1 > ζ0 , ϕ is a UMP level-α test for the test problem H˜ 0 : ζ = ζ0

H˜ 1 : ζ = ζ1

↔

(5.4)

with two simple hypotheses (0 = {ζ0 } and 1 = {ζ1 } are singletons). Obviously, the likelihood ratio fζ∗1 (x)

fζ∗0 (x)

=

C ∗ (ζ1 ) (ζ1 −ζ0 ) T (x) e C ∗ (ζ0 )

is a strictly increasing function in T (x). ϕ can therefore equivalently be represented as ⎧ ∗ ∗ ⎪ ⎪ ⎨1 , fζ1 (x) > k0 fζ0 (x) , ϕ(x) =

γ, ⎪ ⎪ ⎩0 ,

fζ∗1 (x) = k0 fζ∗0 (x) ,

fζ∗1 (x) < k0 fζ∗0 (x) ,

for some constant k0 ∈ R such that Eζ0 [ϕ] = α. Necessarily, it holds that k0 ≥ 0. Applying Lemma 5.3(c) with n = 1, fi = fζ∗i , i = 0, 1, and α0 = α (Q = [0, 1]), ϕ then maximizes βψ (ζ1 ) = Eζ1 [ψ] = ψ fζ∗1 dμ among all tests ψ with βψ (ζ0 ) = Eζ0 [ψ] =

ψ fζ∗0 dμ ≤ α .

By Definitions 5.1 and 5.2, this means that ϕ is a UMP level-α test for test problem (5.4). Moreover, since ϕ does not depend on the special value of ζ1 > ζ0 , it is clear from Definition 5.2 that ϕ is even a UMP level-α test for the test problem H˜ 0 : ζ = ζ0

↔

H 1 : ζ > ζ0 .

(5.5)

Now, note that every level-α test for test problem (5.3) is a level-α test for test problem (5.5), too. Hence, what is left to verify for ϕ to be a UMP level-α test for test problem (5.3) is that ϕ has level α for test problem (5.3). For this, it is sufficient to show that βϕ is non-decreasing, since then βϕ (ζ ) ≤ βϕ (ζ0 ) = α for ζ ≤ ζ0 by assumption.

98

5 Hypotheses Testing

By Corollary 2.55, βϕ is differentiable with derivative βϕ (ζ ) = Eζ [(T − Eζ [T ]) ϕ] ,

ζ ∈ ∗ .

Hence, we have for all ζ ∈ ∗ with c − Eζ [T ] ≥ 0 that βϕ (ζ ) =

(c,∞)

(t − Eζ [T ]) d(Pζ∗ )T (t) + γ (c − Eζ [T ]) Pζ∗ (T = c) ≥ 0 ,

and for all ζ ∈ ∗ with c − Eζ [T ] < 0 βϕ (ζ ) ≥

[c,∞)

(t − Eζ [T ]) d(Pζ∗ )T (t)

= (−∞,∞)

(t − Eζ [T ]) d(Pζ∗ )T (t) −

= − (−∞,c)

(−∞,c)

(t − Eζ [T ]) d(Pζ∗ )T (t)

(t − Eζ [T ]) d(Pζ∗ )T (t) ≥ 0 .

The proof of assertion (b) is analogue to that of (a) and therefore omitted.

Obviously, the UMP tests in Theorem 5.4 can be chosen as non-randomized tests if T has a continuous distribution. Remark 5.5 The UMP tests in Theorem 5.4 have even strictly monotone power functions on {ζ ∈ ∗ : 0 < βϕ (ζ ) < 1}, which can be seen as follows. Let ϕ be the UMP test in Theorem 5.4(a), and let ζ1 , ζ2 ∈ ∗ with ζ1 < ζ2 and α0 = βϕ (ζ1 ) ∈ (0, 1). Then, by inspecting the proof of Theorem 5.4(a) and by Lemma 5.3(a), ϕ also maximizes Eζ2 [ψ] among all tests ψ with Eζ1 [ψ] = α0 . In particular, by setting ψ ≡ α0 , it follows that Eζ2 [ϕ] ≥ α0 . Here, if equality would be true, then ψ ≡ α0 would also be a solution of this maximizing problem and Lemma 5.3(a) then would ensure that there exists a constant k0 ∈ R with fζ∗2 = k0 fζ∗1 μ-a.e., which would yield k0 = 1 and thus the contradiction Pζ∗2 = Pζ∗1 . Hence, we have βϕ (ζ2 ) = Eζ2 [ϕ] > α0 = βϕ (ζ1 ). By the same arguments, the assertion can be shown for the UMP test in Theorem 5.4(b). Examples: poi(•), N(•, σ02 )

Example 5.6 Let X1 , . . . , Xn be an iid sample from Pa = poi(a) for some a ∈ (0, ∞), and consider the test problem H0 : a ≤ a0

↔

H1 : a > a0 ,

5.1 One-Sided Test Problems

99

for some fixed a0 ∈ (0, ∞). By transforming to the canonical representation of poi(•) via Pζ∗ = Pexp{ζ } , ζ ∈ R, according to Example 2.37, the test problem becomes H 0 : ζ ≤ ζ0

↔

H 1 : ζ > ζ0

with ζ0 = ln(a0 ). Applying Theorem 5.4(a) to the corresponding product family P(n) then yields that ϕ(x) = 1(c,∞) (Tn (x)) + γ 1{c} (Tn (x)) ,

x = (x1 , . . . , xn ) ∈ Nn0 ,

forms a UMP level-α test for this test problem, where Tn (x) = ni=1 xi , x ∈ Nn0 , and the constants c ∈ R and γ ∈ [0, 1] are determined by the equation α = (Pζ∗0 )(n) (Tn > c) + γ (Pζ∗0 )(n) (Tn = c) (Tn > c) + γ Pa(n) (Tn = c) . = Pa(n) 0 0 By using Example 3.2, the moment generating function of Tn is given by mTn (b) = mnT (b) = exp na(eb − 1) ,

b ∈ R,

(n)

where integration is with respect to Pa . This yields Tn ∼ Pna = poi(na). Hence, since 0 ≤ γ ≤ 1, c must be the minimum integer m ∈ N0 such that α ≥ Pa(n) (Tn > m) = e−na0 0

∞ (na0 )j , j!

j =m+1 (n)

(n)

and γ is then already specified as γ = [α − Pa0 (Tn > c)]/Pa0 (Tn = c). Example 5.7 Let X1 , . . . , Xn be an iid sample from N(μ, σ02 ) for some μ ∈ R and fixed σ02 ∈ (0, ∞), and consider the test problem H 0 : μ ≥ μ0

↔

H 1 : μ < μ0 ,

(5.6)

for some fixed μ0 ∈ R. Obviously, the product family P(n) of P = N(•, σ02 ) = {N(μ, σ02 ) : μ ∈ R} forms an EF with canonical representation fμ(n) (x) = C(μ) eμTn (x) h(x) ,

x = (x1 , . . . , xn ) ∈ Rn ,

100

5 Hypotheses Testing

where C(μ) = exp{−nμ2 /(2σ02 )}, μ ∈ R, and n 1 Tn (x) = 2 xi σ0 i=1

' h(x) =

and

1 2πσ02

(n/2

"

n 1 2 exp − 2 xi 2σ0 i=1

#

for x = (x1 , . . . , xn ) ∈ Rn (see Example 2.10). Applying Theorem 5.4(b) to the EF P(n) then yields that x = (x1 , . . . , xn ) ∈ Rn ,

ϕ(x) = 1(−∞,c) (Tn (x)) ,

forms a UMP level-α test for test problem (5.6), where the constant c ∈ R is determined by Pμ(n) (Tn < c) = α . 0 Here, since Tn is continuously distributed, we have set γ = 0 and thus chosen a non-randomized test. Since Tn ∼ N(nμ0 /σ02 , n/σ02 ) under Pμ(n) 0 , we conclude that c fulfils c − nμ0 /σ02 ! = uα , n/σ02 where uα denotes the α-quantile of the standard normal distribution N(0, 1). Equivalently, we have & c =

nμ0 n uα + 2 , 2 σ0 σ0

and thus, in terms of the arithmetic mean of x1 , . . . , xn , a representation of ϕ as ' ϕ(x) =

1

! −∞,μ0 +uα σ02 /n

1 xi n n

(

i=1

This test is commonly known as Gauss test.

,

x = (x1 , . . . , xn ) ∈ Rn .

5.2 Two-Sided Test Problems

101

5.2 Two-Sided Test Problems For two-sided test problems, a UMP level-α test does not exist, in general; however, optimal tests for these test problems can be established for one-parameter EFs by farther restricting the set of admissible tests, i.e., from all level-α tests to all unbiased level-α tests for the test problem. Theorem 5.8 Under the assumptions of Theorem 5.4 a UMPU level-α test for the two-sided test problem H 0 : ζ = ζ0

↔

H1 : ζ = ζ0

(5.7)

is given by ⎧ ⎪ ⎪1 , ⎪ ⎨ ϕ(x) = γi , ⎪ ⎪ ⎪ ⎩ 0,

T (x) < c1 or T (x) > c2 , T (x) = ci , i = 1, 2 ,

for x ∈ X ,

(5.8)

c1 < T (x) < c2 ,

where the constants c1 , c2 ∈ R and γ1 , γ2 ∈ [0, 1] are determined by the equations Eζ0 [ϕ] = α

and

Eζ0 [ϕ T ] = α Eζ0 [T ] .

Proof We first show that every unbiased test with level α for test problem (5.7) is contained in the set = {ψ : ψ is test with Eζ0 [ψ] = α and Eζ0 [ψT ] = α Eζ0 [T ]} . Let ψ be an unbiased test with level α for test problem (5.7). From Corollary 2.55, we have that the power function βψ (ζ ) = Eζ [ψ], ζ ∈ ∗ , of ψ is continuously differentiable with derivative βψ (ζ ) = Eζ [ψT ] − Eζ [ψ]Eζ [T ] ,

ζ ∈ ∗ .

Since unbiasedness of ψ for test problem (5.7) yields that βψ (ζ0 ) ≤ α and βψ (ζ ) ≥ α for ζ ∈ ∗ \ {ζ0 }, it necessarily holds that βψ (ζ0 ) = α by using the continuity of βψ . Moreover, unbiasedness of ψ implies that βψ has a global minimum at ζ0 . Since ψ is differentiable, it follows that 0 = βψ (ζ0 ) = Eζ0 [ψT ] − Eζ0 [ψ]Eζ0 [T ] = Eζ0 [ψT ] − αEζ0 [T ] . Hence, ψ ∈ is shown. Next, note that every UMP test in for test problem (5.7) has level α and must be unbiased for test problem (5.7), as a power comparison with the constant test

102

5 Hypotheses Testing

ψ ≡ α in shows. Hence, it also forms a UMPU level-α test for this test problem. We therefore aim to find a UMP test in for test problem (5.7). For this, we derive a UMP test in for test problem (5.4) with two simple hypotheses, where ζ1 ∈ ∗ with ζ1 = ζ0 is fixed; this will lead to the test ϕ as stated and, since ϕ does not depend on the special value of ζ1 , it is UMP in even for test problem (5.7) by Definition 5.2. Let ζ1 ∈ ∗ with ζ1 = ζ0 be fixed. As in the proof of Theorem 5.4, let h > 0 without loss of generality. We apply Lemma 5.3 with n = 2, f0 = fζ∗0 , f1 = Tfζ∗0 and f2 = fζ∗1 , α0 = α and α1 = αEζ0 [T ]. Here, the condition that (α, αEζ0 [T ]) ∈ int (Q) can be verified as follows. First, by choosing ψ ≡ α + ε and ψ ≡ α − ε for a sufficiently small ε > 0, we obtain that (α + ε, (α + ε)Eζ0 [T ]) ∈ Q

and

(α − ε, (α − ε)Eζ0 [T ]) ∈ Q .

(5.9)

By Lemma 5.3(b) with n = 1, f0 = fζ∗0 , f1 = Tfζ∗0 , and α0 = α, there exists a test ϕ1 , which maximizes Eζ0 [ψT ] among all tests ψ with Eζ0 [ψ] = α. In particular, the choice ψ ≡ α yields that Eζ0 [ϕ1 T ] ≥ αEζ0 [T ]. If this would be an equality, ψ ≡ α would also be a solution of the maximizing problem, and Lemma 5.3(a) then would ensure the existence of a constant k0 ∈ R with T = k0 μ-a.e., which cannot be the case. Hence, Eζ0 [ϕ1 ] = α

and

Eζ0 [ϕ1 T ] > αEζ0 [T ] .

(5.10)

By the same arguments, there exists a test ϕ2 with Eζ0 [ϕ2 ] = 1 − α and Eζ0 [ϕ2 T ] > (1 − α)Eζ0 [T ]. The test 1 − ϕ2 therefore meets Eζ0 [1 − ϕ2 ] = α

and

Eζ0 [(1 − ϕ2 )T ] < αEζ0 [T ] .

(5.11)

Equations (5.9)–(5.11) together with the convexity of Q then guarantee that (α, αEζ0 [T ]) ∈ int (Q). Now, applying Lemma 5.3(a, b) yields that there exists a UMP test ϕ in for test problem (5.4) with ϕ(x) =

=

" 1 , fζ∗1 (x) > (k0 + k1 T (x)) fζ∗0 (x) , 0 , fζ∗1 (x) < (k0 + k1 T (x)) fζ∗0 (x) ,

" 1, 0,

C ∗ (ζ1 ) C ∗ (ζ0 ) C ∗ (ζ1 ) C ∗ (ζ0 )

exp{(ζ1 − ζ0 )T (x)} > k0 + k1 T (x) , exp{(ζ1 − ζ0 )T (x)} < k0 + k1 T (x) ,

with constants k0 , k1 ∈ R such that ϕ ∈ .

μ-a.e. ,

5.2 Two-Sided Test Problems

103

Fig. 5.1 Illustration of the solutions c1 and c2 of Eq. (5.12) in t being intersections of a line and a strict convex function

Finally, we consider all solutions of the equation C ∗ (ζ1 ) (ζ1 −ζ0 )t e = k0 + k1 t C ∗ (ζ0 )

(5.12)

with respect to t ∈ R. If Eq. (5.12) would have no solution in t, then it would follow that ϕ ≡ 0 or ϕ ≡ 1 in contradiction to Eζ0 [ϕ] = α. On the other hand, since the lefthand side of Eq. (5.12) is strictly convex as a function of t, there exist at most two solutions of Eq. (5.12) in t. ϕ therefore admits a representation as in formula (5.8) with constants c1 , c2 ∈ R and γ1 , γ2 ∈ [0, 1] such that ϕ ∈ (c1 = c2 and γ1 = γ2 if Eq. (5.12) has only one solution). See Fig. 5.1 for an illustration.

Example: E(•)

Example 5.9 Let X1 , . . . , Xn be an iid sample from E(λ) ∈ P = E(•) for some λ ∈ (0, ∞), and consider the test problem H0 : λ = λ0

↔

H1 : λ = λ0 ,

for some fixed λ0 ∈ R. By transforming to the canonical representation of E(•) via Pζ∗ = E(−1/ζ ), ζ ∈ (−∞, 0), according to Example 2.12, the test problem becomes H 0 : ζ = ζ0

↔

H1 : ζ = ζ0

with ζ0 = −1/λ0 . Applying Theorem 5.8 to the corresponding product family P(n) then yields that ϕ(x) = 1 − 1[c1 ,c2 ] (Tn (x)) ,

x = (x1 , . . . , xn ) ∈ (0, ∞)n ,

104

5 Hypotheses Testing

forms a UMPU level-α test for this test problem, where Tn (x) = ni=1 xi , x ∈ (0, ∞)n , and the constants c1 , c2 ∈ R are determined by the equations

[1 − 1[c1 ,c2 ] (Tn )] d(Pζ∗0 )(n) = α

and

[1 −

1[c1 ,c2 ] (Tn )] Tn d(Pζ∗0 )(n)

= α

Tn d(Pζ∗0 )(n) .

Here, since Tn is continuously distributed, we have set γi = 0, i = 1, 2. Moreover, we have that Tn ∼ G(−1/ζ0 , n) = G(λ0 , n) with corresponding density fλ0 ,n and mean nλ0 under (Pζ∗0 )(n) . Hence, we may rewrite the above equations as

c2 c1

fλ0 ,n (t) dt = 1 − α

c2

and c1

t fλ0 ,n (t) dt = nλ0 (1 − α) .

It can be shown analytically that these equations have a unique solution in (c1 , c2 ) ∈ R2 , which does not depend on λ0 ; for a given sample size n and a desired significance level α, this solution can be obtained numerically, for instance, via the Newton-Raphson method (see, e.g., [7]).

Besides one- and two-sided test problems, choosing a bounded interval as (either null or alternative) hypothesis may also be of interest in applications, which are known as hypotheses of (non)-equivalence in the literature (see [62] for an extensive account on the topic). For one-parameter EFs, optimal tests for these test problems can be derived in a similar way as in the proofs of Theorems 5.4 and 5.8. They are stated in the following theorem; a proof can be found, for instance, in [63, Section 2.4]. Theorem 5.10 Let the assumptions of Theorem 5.4 be given, and let ζ1 , ζ2 ∈ ∗ with ζ1 < ζ2 . Then, the following assertions are true: (a) For the test problem H0 : ζ ∈ [ζ1 , ζ2 ]

↔

H1 : ζ ∈ / [ζ1 , ζ2 ] ,

a UMPU level-α- test is given by ϕ in formula (5.8). (b) For the test problem H0 : ζ ∈ / (ζ1 , ζ2 )

↔

H1 : ζ ∈ (ζ1 , ζ2 ) ,

5.3 Testing for a Selected Parameter

105

a UMP level-α test is given by ⎧ ⎪ 1 , c1 < T (x) < c2 , ⎪ ⎪ ⎨ ϕ(x) = γi , T (x) = ci , i = 1, 2 , ⎪ ⎪ ⎪ ⎩ 0 , T (x) < c1 or T (x) > c2 ,

for x ∈ X .

In (a) and (b), respectively, the constants c1 , c2 ∈ R and γ1 , γ2 ∈ [0, 1] are each determined by the equations Eζ1 [ϕ] = α

and

Eζ2 [ϕ] = α .

Remark 5.11 Note that, since every UMP level-α test is unbiased, the tests in Theorems 5.4 and 5.10(b) are also UMPU level-α tests for the given test problem.

5.3 Testing for a Selected Parameter In multiparameter EFs (k > 1), UMPU level-α tests for test problems concerning a single parameter ζ1 , say, may be derived as well. We illustrate the key arguments by considering the regular EF P = {Pζ∗ : ζ = (ζ1 , ζ2 )t ∈ ∗ } with minimal canonical representation fζ∗ (x) = C ∗ (ζ ) exp {ζ1 T1 (x) + ζ2 T2 (x)} h(x) ,

x ∈ X,

(5.13)

and the test problem (5.1) with 0 = {ζ ∈ ∗ : ζ1 ≤ ζ0 } and 1 = {ζ ∈ ∗ : ζ1 > ζ0 } or, for short, H 0 : ζ1 ≤ ζ0

↔

H 1 : ζ1 > ζ0 ,

(5.14)

where ζ0 is fixed. Since T = (T1 , T2 )t is sufficient for P by Corollary 3.21, we may restrict ourselves to consider tests ψ = ψ(T ) that depend on x only through T (x). Let B = {ζ ∈ ∗ : ζ1 = ζ0 } denote the common boundary of 0 and 1 . Since the power function of any test is continuous by Corollary 2.55, every unbiased test ψ with level α for test problem (5.14) is necessarily α-similar on B, i.e., Eζ [ψ(T )] = α

for all ζ ∈ B .

(5.15)

Hence, if we find a UMP test among all tests with level α for test problem (5.14) satisfying condition (5.15), then this is also a UMPU level-α test for test problem (5.14), where unbiasedness of the test is obtained by a power comparison with the constant test ψ ≡ α.

106

5 Hypotheses Testing

Next, note that the power function of every test ψ at ζ ∈ ∗ can be represented as βψ (ζ ) = Eζ [ψ(T )] =

ψ(t1 , t2 ) d(Pζ∗ )T1 |T2 =t2 (t1 ) d(Pζ∗ )T2 (t2 ) .

(5.16)

A crucial point is now that T2 is a sufficient and (boundedly) complete statistic for the EF {Pζ∗ : ζ ∈ B}, which follows by setting ζ1 = ζ0 in formula (5.13) and then applying Corollary 3.21. The tests satisfying condition (5.15) are therefore exactly the tests having so-called Neyman structure with respect to T2 , i.e., the tests ψ that meet (Pζ∗ )T2 -a.s. , ζ ∈ B ψ(t1 , t2 ) d(Pζ∗ )T1 |T2 =t2 (t1 ) = α (see, for instance, [40, p. 118/119] and [55, pp. 405/406]). This suggests to consider the inner integral in representation (5.16) separately for every fixed t2 , which reduces the maximizing problem to the one-parameter case, since the conditional distribution of T1 given T2 = t2 does not depend on ζ2 and forms a one-parameter EF in ζ1 (see Theorem 3.11). Detailed proofs of the following theorems can be found, e.g., in [40, pp. 119– 122] and [55, pp. 406–408]. Without loss of generality, the results are formulated in terms of the first parameter. Theorem 5.12 Let P = {Pζ∗ = fζ∗ μ : ζ ∈ ∗ } be a regular EF with minimal canonical representation of the μ-densities given by fζ∗ (x) = C ∗ (ζ ) exp {ζ1 U (x) + (ζ2 , . . . , ζk )V (x)} h(x) ,

x ∈ X,

for ζ = (ζ1 , . . . , ζk )t ∈ ∗ ⊂ Rk , where U : (X, B) → (R1 , B1 ) is a random variable and V : (X, B) → (Rk−1 , Bk−1 ) is a column random vector. Moreover, let η = (η1 , . . . , ηk )t ∈ ∗ and α ∈ (0, 1). Then, we have: (a) For the one-sided test problem H0 : ζ1 ≤ η1

↔

H1 : ζ1 > η1 ,

a UMPU level-α test is given by ϕ = ψ ◦ (U, V ) with mapping ψ(u, v) = 1(c(v),∞) (u) + γ (v) 1{c(v)} (u) ,

u ∈ R , v ∈ Rk−1 .

(b) For the one-sided test problem H0 : ζ1 ≥ η1

↔

H1 : ζ1 < η1

5.3 Testing for a Selected Parameter

107

a UMPU level-α test is given by ϕ = ψ ◦ (U, V ) with mapping ψ(u, v) = 1(−∞,c(v)) (u) + γ (v) 1{c(v)} (u) ,

u ∈ R , v ∈ Rk−1 .

In (a) and (b), respectively, the mappings c, γ : (Rk−1 , Bk−1 ) → (R1 , B1 ) with 0 ≤ γ ≤ 1 are each determined by the equations Eη [ϕ|V = v] = α ,

v ∈ Rk−1 .

(5.17)

Note that the expected value in formula (5.17) is independent of η2 , . . . , ηk , since Eη [ϕ|V = v] = Eη [ψ(U, v)|V = v] =

ψ(u, v) d(Pη∗ )U |V =v (u) ,

and the conditional distribution of U given V = v under Pη∗ does only depend on η1 . Theorem 5.13 Under the assumptions of Theorem 5.12 a UMPU level-α test for the two-sided test problem H0 : ζ1 = η1

↔

H1 : ζ1 = η1

is given by ϕ = ψ ◦ (U, V ) with mapping

ψ(u, v) =

⎧ ⎪ 1, ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩

u < c1 (v) or u > c2 (v) ,

γi (v) ,

u = ci (v) , i = 1, 2 ,

0,

c1 (v) < u < c2 (v) ,

u ∈ R1 , v ∈ Rk−1 .

Here, the mappings ci , γi : (Rk−1 , Bk−1 ) → (R1 , B1 ) with 0 ≤ γi ≤ 1, i = 1, 2, are determined by the equations Eη [ϕ|V = v] = α

and

Eη [ϕ U |V = v] = α Eη [U |V = v]

(5.18)

for v ∈ Rk−1 . The expected values in formula (5.18) are all independent of η2 , . . . , ηk . Respective statements hold true for the test problems in Theorem 5.10 with interval hypotheses. Example: N2 (•, 0 )

Example 5.14 Let P = N2 (•, 0 ) = {N2 (μ, 0 ) : μ ∈ R2 } be the EF of bivariate normal distributions according to Example 2.18 with k = 2 and fixed

108

5 Hypotheses Testing

covariance matrix ' 0 =

σ12 σ12

( ,

σ12 σ22

2 < σ 2 σ 2 . We assume to have an iid where σ1 , σ2 > 0 and σ12 ∈ R with σ12 1 2 (1) (n) sample X , . . . , X from Pμ and consider the test problem

H0 : μ1 ≤ η1

↔

H1 : μ1 > η1

(5.19)

−1 for fixed η1 ∈ R. From Example 2.18, we know that PμT = N2 ( −1 0 μ, 0 ) −1 2 (n) for T (x) = 0 x, x ∈ R . For the product family P , it therefore n (i) follows that T (n) (x) ˜ = ˜ = (x (1) , . . . , x (n) ) ∈ (R2 )n has a i=1 T (x ), x (n) −1 −1 N2 (n 0 μ, n 0 )-distribution under Pμ , where

n −1 0

n = 2 2 2 σ1 σ2 − σ12

'

σ22 −σ12

(

' =

−σ12 σ12

τ12 τ12 τ12 τ22

( ,

say .

(5.20)

The conditional distribution of T1(n) given T2(n) = v under Pμ(n) is a univariate normal distribution with mean μ(v) and variance σ 2 (v) given by μ(v) = τ12 μ1 + τ12 μ2 + = τ12 μ1 + τ12 μ2 + =

σ 2 (v) =

τ12 τ22

4 τ12 3 v − (τ12 μ1 + τ22 μ2 ) τ1 τ2 . v − τ12 μ1 − τ22 μ2

. 1 - 2 2 τ12 nμ1 − σ12 v 2 τ τ − τ μ1 + 2 v = 1 2 12 2 τ2 τ2 σ12 '

and

τ1 τ2

1−

2 τ12

τ12 τ22

( τ12 =

. 1 - 2 2 n 2 τ τ − τ = 2, 1 2 12 2 τ2 σ1

where the latter is free of v. In the derivation, we have used the identity 2 τ12 τ22 − τ12 =

n2 , 2 σ12 σ22 − σ12

which is obtained by considering determinants in formula (5.20), as well as 2 ). Note that the the equations τ12 /τ22 = −σ12 /σ12 and τ22 = nσ12 /(σ12 σ22 − σ12

5.4 Testing for Several Parameters

109

conditional distribution does not depend on μ2 . Denoting by c1−α (v) the (1 − α)quantile of this distribution, it holds for v ∈ R that u1−α =

c1−α (v) − μ(v) σ (v) &

c1−α (v) = u1−α σ (v) + μ(v) = u1−α

and thus

nμ1 − σ12 v n + , σ12 σ12

where u1−α is the (1 − α)-quantile of the standard normal distribution N(0, 1). Hence, by Theorem 5.12(a), a UMPU level-α test for test problem (5.19) is given by (n)

ϕ(x) ˜ = 1(c(T (n) (x)),∞) (T1 (x)) ˜ , ˜ 2

& with

c(v) = u1−α

x˜ ∈ (R2 )n ,

n nη1 − σ12 v + , 2 σ1 σ12

v ∈ R.

5.4 Testing for Several Parameters In the preceding sections, we have proposed optimal tests for test problems concerning a single parameter of a one- or multiparameter EF. Here, we consider tests for test problem (5.1) with parameter set 0 ⊂ Rk making assumptions on more than one component of ζ , where we focus on likelihood-ratio tests. On the one hand, these tests enable to check conjectures on several parameters simultaneously as, for instance, a simple null hypothesis by choosing 0 = {η} for some fixed η ∈ . On the other hand, they may be used to test for relations between the parameters; a homogeneity test, for example, is obtained by setting 0 = {ζ ∈ : ζ1 = · · · = ζk }.

Definition 5.15 Let μ be a σ -finite measure and P = {fζ∗ μ : ζ ∈ }, ⊂ Rk , be a family of probability distributions on a measurable space (X, B). Moreover, let ∅ = 0 and α ∈ (0, 1). For the test problem H0 : ζ ∈ 0

↔

H1 : ζ ∈ \ 0 ,

(5.21) (continued)

110

5 Hypotheses Testing

Definition 5.15 (continued) the likelihood-ratio (LR) test with level α is defined as ϕ (x) = 1(c,∞) ((x)) + γ 1{c} ((x)) ,

x ∈ X,

provided that the test statistic ' (x) = −2 ln

supζ ∈0 fζ∗ (x)

(

supζ ∈ fζ∗ (x)

x ∈ X,

,

is well-defined μ-a.e.. Here, the constants c ≥ 0 and γ ∈ [0, 1] are determined by the equation sup Eζ [ϕ ] = α .

ζ ∈0

The LR test compares the maximum ‘probability’ fζ∗ (x) of observing x under 0 with the one under ; the null hypothesis is then rejected if the ratio of both values is (too) small (see the representations of and ϕ in Definition 5.15). The numerator and denominator in the expression of are nothing but the density fζ∗ (x), where the parameter is chosen as the ML estimate of ζ in 0 and in , respectively. In minimal regular EFs, is closely related to the Kullback-Leibler divergence. Theorem 5.16 Let P be a regular EF with minimal canonical representation (4.1), and let x be a realization of the random variable X with distribution Pζ∗ ∈ P. If the ML estimate ζˆ (x) of ζ in ∗ and the ML estimate ζˆ 0 (x) of ζ in 0 based on x exist, then the test statistic of the LR test for test problem (5.21) with = ∗ is given by (x) = 2 DKL (ζˆ (x), 0 ) = 2 DKL (ζˆ (x), ζˆ 0 (x)) , where

DKL (ζ , 0 ) = inf DKL (ζ , η) , η∈0

(5.22)

ζ ∈ ∗ .

Proof Without loss of generality, we assume that h(x) > 0. Then, since the logarithm is strictly increasing, we have (x) = 2 [ln(fζˆ∗(x)(x)) − ln(supζ ∈0 fζ∗ (x))] = 2 ζˆ (x)t T (x) − κ(ζˆ (x)) − sup {ζ t T (x) − κ(ζ )} ζ ∈0

5.4 Testing for Several Parameters

= 2 inf

-

ζ ∈0

= 2 inf

ζ ∈0

-

111

ζˆ (x)t T (x) − κ(ζˆ (x)) − (ζ t T (x) − κ(ζ ))

.

. κ(ζ ) − κ(ζˆ (x)) + (ζˆ (x) − ζ )t T (x) .

Now, by Theorem 4.3, T (x) = π(ζˆ (x)), and the first equality in formula (5.22) follows from Lemma 3.37. Since the supremum of ζ t T (x) − κ(ζ ), ζ ∈ 0 , is attained at ζˆ 0 (x), the equation (x) = 2DKL (ζˆ (x), ζˆ 0 (x)) is also valid.

In particular, for fixed η ∈ ∗ and the test problem H0 : ζ = η

↔

H1 : ζ = η ,

(5.23)

with a simple null hypothesis, Theorem 5.16 yields that = 2DKL (ζˆ , η) provided that the ML estimate ζˆ (x) of ζ exists for x ∈ X. Remark 5.17 The EF structure may also be useful for deriving other well known test statistics for multivariate test problems. For test problem (5.21), if all appearing quantities exist, the Rao score statistic and the Wald statistic are defined as R = U tζˆ I(ζˆ 0 )−1 U ζˆ 0

and

0

W = g(ζˆ )t [Dg (ζˆ ) I(ζˆ )−1 Dg (ζˆ )t ]−1 g(ζˆ ) ,

where U ζ is the score statistic and I(ζ ) is the Fisher information matrix at ζ , and g : → Rq , q ≤ k, is a differentiable function with Jacobian matrix Dg (ζ ) of full rank q at every ζ ∈ , which satisfies that g(ζ ) = 0 if and only if ζ ∈ 0 (see, e.g., [53, pp. 239/240]). Note that R does only depend on the MLE ζˆ 0 of ζ in 0 , whereas W only depends on the MLE ζˆ of ζ in . The Rao score test and the Wald test then reject H0 for (too) large values of R and W , respectively. These decision rules are based on the fact that, if ζ ∈ 0 is true, then, on the one hand, U ζˆ will 0 tend to be close to U ζˆ = 0 and, on the other hand, g(ζˆ ) is expected to be small as an estimator of g(ζ ) = 0. Now, if the underlying model forms a minimal regular EF, we may utilize the representations for U ζ and I(ζ ) as derived in Theorem 3.29 and make also use of the results in Sects. 4.1 and 4.2 to find the MLEs ζˆ and ζˆ 0 . Example: Nk (•, 0 )

Example 5.18 Let X (1), . . . , X(n) be an iid sample from Nk (μ, 0 ) for some μ ∈ Rk and fixed 0 ∈ Rk×k with 0 > 0, and consider the test problem H0 : μ = η

↔

H1 : μ = η ,

(5.24)

112

5 Hypotheses Testing

for some fixed η ∈ Rk . From Example 4.10, the ML estimate of μ based on realizations x (1) , . . . , x (n) of X(1) , . . . , X (n) is given by μ ˆ (n) (x (1) , . . . , x (n) ) = n (i) i=1 x /n. Hence, by using Theorem 5.16 and the representation of DKL derived in Example 3.41, the test statistic of the LR test for test problem (5.24) is given by ˆ (n) , η) = n ||μ ˆ (n) − η||20 . n = 2 n DKL (μ Here, note that nDKL is the Kullback-Leibler divergence corresponding to the EF P(n) of product measures. Under H0 , we have −1/2 √

0

. n μ ˆ (n) − η ∼ Nk (0, Ek ) , −1/2

∈ Rk×k is a matrix where Ek is the k-dimensional unit matrix and 0 −1/2 −1/2 −1/2 −1 > 0 and 0 0 = 0 , the existence of which is ensured satisfying 0 by the spectral theorem. Thus, under H0 , n ∼ χ 2 (k), i.e., n has a chi-square distribution with k degrees of freedom. The LR test with level α is therefore given by ϕ(x) ˜ = 1(χ 2

1−α

- %% %%2 . %%μ( %% n ˆ x) ˜ − η , (k),∞) 0

x˜ = (x (1) , . . . , x (n) ) ∈ (Rk )n ,

2 (k) denotes the (1 − α)-quantile of χ 2 (k) (see, e.g., [45, p. 124]). where χ1−α Now, let us derive the Rao score statistic and the Wald statistic for test problem (5.24) as introduced in Remark 5.17, where we choose g(μ) = μ − η, μ ∈ Rk . First, note that by Example 2.18 and Theorem 3.13(b) the statistic T (n) for the product EF P(n) can be represented as T (n) = n −1 ˆ (n) . According to 0 μ Theorem 3.29 and Example 3.35, the score statistic and the Fisher information (n) matrix belonging to P(n) are therefore given by U μ = n −1 ˆ (n) − μ) and 0 (μ k I(n) (μ) = n −1 0 for μ ∈ R . Hence, the Rao score statistic is t (n) −1 (n) Rn = [U (n) U η = n ||μ ˆ (n) − η||20 . η ] [I (η)]

Moreover, we have Dg (μ) = Ek , μ ∈ Rk , and the Wald statistic is Wn = g(μˆ (n) )t I(n) (μ ˆ (n) ) g(μ ˆ (n) ) = n ||μ ˆ (n) − η||20 . Hence, in the actual case, all three test statistics n , Rn , and Wn coincide and thus yield the same test.

5.4 Testing for Several Parameters

113

In general, for finite sample size n ∈ N, the distribution of n under the null hypothesis is not explicitly available. However, for sufficiently large n, we may employ the asymptotic distribution of n under H0 instead, which is easily derived by using the properties of the EF. Choosing the (1 − α)-quantile of the asymptotic distribution as critical value for n , the actual level of the test ϕn will converge to the nominal level α, when n tends to infinity. For a simple null hypothesis, the asymptotic distribution of n can be obtained from a Taylor series expansion of the log-likelihood function and by using (asymptotic) properties of the MLE of ζ (see, e.g., [54, pp. 151–156]). Theorem 5.19 Let P be a regular EF with minimal canonical representation (4.1), and let X(1) , X(2) , . . . be a sequence of iid random variables with distribution (n) Pζ∗ ∈ P. Moreover, for eventually all n ∈ N, let the MLE ζˆ of ζ in ∗ based on X(1) , . . . , X(n) exist and n = n (X(1) , . . . , X(n) ) denote the test statistic of the LR test for test problem (5.23) based on X(1) , . . . , X(n) . Then, under H0 , d

n −→ χ 2 (k) for n → ∞, i.e., n is asymptotically chi-square distributed with k degrees of freedom if n tends to infinity. (n) Proof loss of generality, let ζˆ exist for all n ∈ N. Let n (ζ ) = n Without ln( i=1 fζ∗ (X(i) )), ζ ∈ ∗ , denote the log-likelihood function based on X(1) , . . . , X(n) for n ∈ N. Moreover, suppose that H0 is true, i.e., ζ = η. Since n (n) (n) has a maximum at ζˆ and ∗ is open, it necessarily holds that ∇n (ζˆ ) = 0 for (n) all n ∈ N. Hence, the Taylor series expansion of n about ζˆ at ζ is given by (n) n (ζ ) = n (ζˆ ) +

1 (n) (n) (n) (ζˆ − ζ )t Hn (ζ˜ ) (ζˆ − ζ ) , 2

(n) (n) where ζ˜ is a convex combination of ζ and ζˆ , n ∈ N. For the test statistic n = n (X(1) , . . . , X(n) ) of the LR test based on X(1) , . . . , X(n) , we thus obtain (n)

n = − 2 (n (ζ ) − n (ζˆ )) / 0 √ √ (n) 1 (n) (n) t = n (ζˆ − ζ ) − Hn (ζ˜ ) n(ζˆ − ζ ) . n From Corollary 3.30 and Theorem 2.56, we have that −

1 H (ζ ) = Hκ (ζ ) = I(ζ ) > 0 , n n

ζ ∈ ∗ ,

114

5 Hypotheses Testing

which allows for the representation n = || I(ζ˜

(n) 1/2 √

n(ζˆ

)

(n)

− ζ )||22 ,

(5.25)

(n) (n) (n) (n) with a matrix I(ζ˜ )1/2 > 0 satisfying I(ζ˜ )1/2 I(ζ˜ )1/2 = I(ζ˜ ) for n ∈ N; here, existence of such a matrix is guaranteed by the spectral theorem. From (n) Theorem 4.31, we have that ζˆ → ζ Pζ -a.s. and thus also ζ˜ (n) → ζ Pζ -a.s. for (n) n → ∞. Since the mapping I(·)1/2 is continuous on ∗ , it follows that I(ζ˜ )1/2 → I(ζ )1/2 Pζ -a.s. for n → ∞. Moreover, by Theorem 4.34,

√

(n) d n (ζˆ − ζ ) −→ N ∼ Nk (0, [I(ζ )]−1 ) ,

and we conclude from the multivariate Slutsky theorem that I(ζ˜

(n) 1/2 √

)

n(ζˆ

(n)

d

− ζ ) −→ I(ζ )1/2 N ∼ Nk (0, Ek ) ,

where Ek ∈ Rk×k denotes the k-dimensional unit matrix. In virtue of the continuous mapping || · ||22 , the assertion now follows from formula (5.25) together with the continuous mapping theorem (see, e.g., [18, Chapter 1, Section 5]).

Theorem 5.19 is a special case of the following, more general result; its proof can be found, for instance, in [53, p. 240] and [54, pp. 156–160]. Theorem 5.20 In the situation of Theorem 5.19, let the test problem be replaced by test problem (5.21) with = ∗ and 0 = {ζ ∈ ∗ : g(ζ ) = 0} , where g : ∗ → Rq , q ≤ k, is a differentiable function with Jacobian matrix Dg (ζ ) (n) of full rank q at every ζ ∈ ∗ . If the MLE ζˆ 0 of ζ in 0 based on X(1) , . . . , X(n) exists for eventually all n ∈ N, then d

n −→ χ 2 (q) under H0 for n → ∞. The function g in Theorem 5.20 characterizes the null hypothesis in the same way as in Remark 5.17 for the Wald test. Setting g(ζ ) = ζ − η, ζ ∈ ∗ , in Theorem 5.20 gives the result of Theorem 5.19. The choice g(ζ ) = (ζi1 − ηi1 , . . . , ζiq − ηiq )t ,

ζ ∈ ∗ ,

for fixed η = (η1 , . . . , ηk )t ∈ ∗ and i1 , . . . , iq ∈ {1, . . . , k} with i1 < · · · < iq corresponds to the null hypothesis H0 : ζij = ηij for all 1 ≤ j ≤ q, which may

5.4 Testing for Several Parameters

115

therefore be used to check only for particular components of ζ . Moreover, we may test for homogeneity of the parameters, i.e., for H0 : ζ1 = · · · = ζk , by defining, e.g., g(ζ ) = (ζ2 − ζ1 , . . . , ζk − ζk−1 )t ,

ζ ∈ ∗ ,

d

in the case of which n −→ χ 2 (k − 1) under H0 for n → ∞. Remark 5.21 Let the iid setup and the test problem in Theorem 5.20 be given. If (n) (n) the respective estimators ζˆ 0 and ζˆ exist for eventually all n ∈ N, the Rao score statistic Rn and the Wald statistic Wn (defined on Xn according to Remark 5.17) are asymptotically χ 2 (q)-distributed under H0 for n → ∞. For the proofs, see, e.g., [53, p. 240] and [54, pp. 156–160]. For further reading on testing hypotheses in EFs, we refer to, for example, [63, 20, 40, 42, 34].

Chapter 6

Exemplary Multivariate Applications

In this chapter, we demonstrate the usefulness of the proposed results for EFs by means of three multivariate examples of different kind. First, for the discrete case, the family of negative multinomial distributions is considered. Then, we turn to the continuous case and study properties of Dirichlet distributions. Finally, the results are applied to joint distributions of generalized order statistics, which form a unifying approach to various models of ordered random variables. In the joint density function of generalized order statistics, an absolutely continuous baseline distribution function can be arbitrarily chosen.

6.1 Negative Multinomial Distribution The negative multinomial distribution is a natural multivariate extension of the negative binomial distribution (see, for example, [28, Chapter 36]). Suppose that we have an experiment with k + 1 possible outcomes O1 , . . . , Ok+1 , which is independently repeated until a given number r of a particular outcome Ok+1 , say, is observed. Then, the negative multinomial distribution describes the probabilities of observing exactly xi times outcome Oi for 1 ≤ i ≤ k up to the r-th observation of Ok+1 .

Definition 6.1 The negative multinomial distribution nm(r, p) with r ∈ N and p = (p1 , . . . , pk )t ∈ (0, 1)k , k ∈ N, satisfying kj =1 pj < 1 is defined (continued)

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Bedbur, U. Kamps, Multivariate Exponential Families: A Concise Guide to Statistical Inference, SpringerBriefs in Statistics, https://doi.org/10.1007/978-3-030-81900-2_6

117

118

6 Exemplary Multivariate Applications

Definition 6.1 (continued) as the probability distribution on Nk0 with counting density . ⎛ ⎞⎛ ⎞r k k r − 1 + kj =1 xj ! xj ⎝ pj ⎠ ⎝1 − pj ⎠ , fp (x) = (r − 1)! kj =1 xj ! j =1 j =1

(6.1)

for x = (x1 , . . . , xk )t ∈ Nk0 .

The negative multinomial distribution is a member of the more general family of multivariate power series distributions introduced in Remark 2.9. Rewriting density function (6.1) as fp (x) = C(p) exp

⎧ k ⎨ ⎩

⎫ ⎬ Zj (p)Tj (x)

j =1

⎭

x = (x1 , . . . , xk )t ∈ Nk0 ,

h(x) ,

with Zj (p) = ln(pj ), 1 ≤ j ≤ k, and ⎛ C(p) = ⎝1 −

k

⎞r pj ⎠ ,

j =1

and Tj (x) = xj , 1 ≤ j ≤ k, and . r − 1 + kj =1 xj ! , h(x) = (r − 1)! kj =1 xj !

x = (x1 , . . . , xk )t ∈ Nk0 ,

(6.2)

we directly find that, for fixed r = r0 ∈ N, nm(r0 , •) = {nm(r0 , p) : p ∈ } forms an EF with parameter space =

⎧ ⎨ ⎩

(p1 , . . . , pk )t ∈ (0, 1)k :

k j =1

⎫ ⎬ pj < 1

⎭

.

Aiming for inferential results for the negative multinomial distribution, we turn to a minimal canonical representation of nm(r0 , •). Theorem 6.2 Let P = nm(r0 , •) for some fixed r0 ∈ N as introduced above. Then, we find: (a) ord(P) = k.

6.1 Negative Multinomial Distribution

119

(b) The natural parameter space of P is given by ∗ =

⎧ ⎨ ⎩

k

ζ = (ζ1 , . . . , ζk )t ∈ (−∞, 0)k :

eζj < 1

j =1

⎫ ⎬ ⎭

.

(c) P is regular with minimal canonical representation given by the counting densities fζ∗ (x) =

dPζ∗ dμNk

(x) = eζ

t T (x)−κ(ζ )

x ∈ Nk0 ,

h(x) ,

(6.3)

0

for ζ ∈ ∗ , where T (x) = x, x ∈ Nk0 , h is as in formula (6.2), and the cumulant function is given by ⎛ κ(ζ ) = −r0 ln ⎝1 −

k

⎞ eζj ⎠ ,

ζ = (ζ1 , . . . , ζk )t ∈ ∗ .

j =1

Proof (a) Let Z = (Z1 , . . . , Zk )t and T = (T1 , . . . , Tk )t with Zj (p) = ln(pj ), p ∈ , and Tj (x) = xj , x ∈ Nk0 , for 1 ≤ j ≤ k. Moreover, let ej denote the j th unit vector in Rk , 1 ≤ j ≤ k. Since T (ej ) − T (0) = ej , 1 ≤ j ≤ k, are linearly independent in Rk , T1 , . . . , Tk are affinely independent by Lemma 2.22. (j ) (j ) Setting p(0) = (exp{−k}, . . . , exp{−k})t and p(j ) = (p1 , . . . , pk )t with (j ) (j ) pi = exp{−k}, i = j , and pj = exp{−k − 1} for 1 ≤ i, j ≤ k, we find that Z(p(j ) ) − Z(p(0) ) = −ej , 1 ≤ j ≤ k, are linearly independent in Rk . Again, by Lemma 2.22, Z1 , . . . , Zk are affinely independent. The assertion then follows from Theorem 2.23. (b) Let ζ = (ζ1 , . . . , ζk )t ∈ Rk . Then, x∈Nk0

exp

⎧ k ⎨ ⎩

j =1

⎫ ⎬ ζj xj

⎭

h(x) =

∞

n=0 x∈Nk :k 0

=

exp

j=1 xj =n

⎧ k ⎨ ⎩

∞ (n + r0 − 1)! n=0

(r0 − 1)! n!

⎫ ⎬ ζj xj

j =1

x∈Nk0 :

k

j=1 xj =n

⎞n ⎛ k ∞ n + r0 − 1 ⎝ ζj ⎠ = e , r0 − 1 n=0

j =1

⎭

h(x)

k

n!

k ζ xj ej

j =1 xj ! j =1

120

6 Exemplary Multivariate Applications

where the last equation follows from the theorem. By D’Alembert’s multinomial k criterion, this series converges if exp{ζ } < 1 and diverges if j j =1 k ∗ j =1 exp{ζj } ≥ 1. This gives as stated. (c) From assertion (b), we have that Z() = ∗ and ∗ is open, such that P is regular. By setting ζj = ln(pj ), 1 ≤ j ≤ k, in density function fp , the canonical representation is obtained, which is minimal according to statement (a).

Utilizing this finding, the mean vector and covariance matrix of the negative multinomial distribution are directly obtained. Theorem 6.3 Let P = nm(r0 , •) for some fixed r0 ∈ N with minimal canonical representation (6.3). Then, the mean vector and the covariance matrix of X ∼ Pζ∗ are given by ' Eζ (X) =

1−

r0 eζ1 k

j =1 e

ζj

,...,

1−

(t

r0 eζk k

j =1 e

(6.4)

ζj

and Covζ (Xi , Xj ) =

1−

r0 k

'

l=1 e

ζl

eζi +ζj + δij eζi 1 − kl=1 eζl

( ,

1 ≤ i, j ≤ k , (6.5)

where δij is the Kronecker delta, i.e., δij = 1 for i = j , and δij = 0 for i = j . Proof Applying Corollary 2.50, we have Eζ (X) = Eζ [T ] = ∇κ(ζ ) and Covζ (X) = Covζ (T ) = Hκ (ζ ) for ζ ∈ ∗ , which leads to the stated formulas.

In terms of the original parameters p1 , . . . , pk , the mean vector and covariance matrix of nm(r0 , p) can be obtained by setting ζj = ln(pj ), 1 ≤ j ≤ k, in formulas (6.4) and (6.5). Noticing that one-dimensional marginals are negative binomial distributions (see [28, Chapter 36]), the expressions in Theorem 6.3 for one-dimensional marginal expectations and variances are known from Example 2.51. Moreprecisely, the parameters of the j -th marginal distribution are r0 and (1 − ni=1 pi )/(1 − n i=1,i =j pi ). As two by-products of Theorem 6.3, the mean value function of P turns out to be ' π(ζ ) =

1−

r0 eζ1 k

ζj j =1 e

,...,

1−

r0 eζk k

ζj j =1 e

(t ,

ζ ∈ ∗ ,

(6.6)

6.1 Negative Multinomial Distribution

121

and the Fisher information matrix of P at ζ ∈ ∗ is given by [I(ζ )]i,j =

1−

'

r0 k

l=1 e

ζl

eζi +ζj + δij eζi 1 − kl=1 eζl

( 1 ≤ i, j ≤ k

,

(6.7) (see Theorem 2.56 and Theorem 3.29). Let us now develop inferential results for the negative multinomial distribution. For simplicity, we restrict ourselves to the case of a single observation from Pζ∗ . First, maximum likelihood estimation of the natural parameters ζ1 , . . . , ζk is considered. Theorem 6.4 Let P = nm(r0 , •) for some fixed r0 ∈ N with minimal canonical representation (6.3), and let x ∈ Nk0 be a realization of X ∼ Pζ∗ for some ζ ∈ ∗ . Then, an ML estimate of ζ in ∗ based on x exists if and only if x ∈ (0, ∞)k . If it exists, it is uniquely determined and given by ' ' ln

r0 +

(

x1 k

' , . . . , ln

i=1 xi

r0 +

((t

xk k

i=1 xi

.

Proof From Theorem 4.3, we obtain that an ML estimate of ζ in ∗ based on x exists if and only if x ∈ int (M), where M = [0, ∞)k is the convex support of X. Hence, the first statement is clear. Now, let us compute the ML estimate π −1 (x) of ζ for x ∈ (0, ∞)k , where π −1 is the inverse function of π : ∗ → (0, ∞)k in formula (6.6). Suppose that π(ζ ) = x for some ζ ∈ ∗ . Summing up all k equations of this system then gives k

ζj j =1 e k 1 − i=1 eζi

r0

=

k

xj ,

j =1

which is equivalent to k

k e

ζi

j =1 xj

=

r0 +

i=1

Thus, for 1 ≤ j ≤ r, since r0 exp{ζj }/(1 − '

xj ζj = ln r0

' 1−

k i=1

k

j =1 xj

k

i=1 exp{ζi })

(( e

ζi

.

' = ln

r0 +

= xj , we find

xj k

i=1 xi

( .

122

6 Exemplary Multivariate Applications

In particular, Theorem 6.4 yields that the probability of non-existence of an ML estimate based on a single observation of X is positive, since ⎛ Pζ (X ∈ / (0, ∞)k ) ≥ Pζ (X = 0) = ⎝1 −

k

⎞r0 eζj ⎠

> 0.

j =1

If an estimator for the mean vector of nm(r0 , p) is desired, then one may choose X itself. Theorem 6.5 Under the assumptions of Theorem 6.4, X is an efficient estimator of π, i.e., its covariance matrix attains the lower bound of the Cramér-Rao inequality given by [I(ζ )

−1

]i,j =

1−

k

l=1 e

r0

ζl

−ζ e i δij − 1 ,

1 ≤ i, j ≤ k .

Proof X is an efficient estimator of π by Theorem 4.26. To find the inverse of I(ζ ) in formula (6.7), note that r0 I(ζ ) = a

pp t P+ a

for the diagonal matrix P with positive diagonal entries pj = eζj , 1 ≤ j ≤ k, positive number a = 1 − kj =1 pj , and vector p = (p1 , . . . , pk )t . A well known matrix identity (see, e.g., [45, p. 458]) then yields that I(ζ )−1 = =

a r0

.−1 p t P−1 P−1 − P−1 p a + p t P−1 p

. a - −1 P − (1, . . . , 1)t (1, . . . , 1) , r0

noticing that p t P−1 p = proof.

k

j =1 pj .

Inserting for p1 , . . . , pk and a completes the

Next, suppose we are interested to statistically confirm that the probability p1 of outcome O1 in a single experiment exceeds a certain threshold p0 ∈ (0, 1), say. The corresponding test problem is H0 : p1 ≤ p0

↔

H1 : p1 > p0

(6.8)

6.1 Negative Multinomial Distribution

123

or, equivalently and in terms of the natural parameter ζ1 , H 0 : ζ1 ≤ ζ0

↔

H 1 : ζ1 > ζ0

(6.9)

with ζ0 = ln(p0 ). An optimal test for these test problems can be obtained from Theorem 5.12(a) by using that the conditional distribution of X1 given (X2 , . . . , Xk ) = y is a negative binomial distribution with respective parameters. Theorem 6.6 Let the assumptions of Theorem 6.4 be given, and let α ∈ (0, 1). Then, for test problems (6.8) and (6.9), ϕ(x) = 1(c(y),∞) (x1 ) + γ (y)1{c(y)} (x1 ) ,

x = (x1 , y) ∈ Nk0 ,

forms a UMPU level-α test, where, for y = (x2 , . . . , xk ) ∈ Nk−1 0 , c(y) is the minimum number m ∈ N0 satisfying ∞

gy (i) ≤ α ,

i=m+1

and γ (y) =

α−

∞

i=c(y)+1 gy (i)

gy (c(y))

where gy denotes the counting density of nb(r0 + Example 2.5.

,

k

i=2 xi , 1

− eζ0 ) defined by

Let us now consider the multivariate test problem H0 : p = q

↔

H1 : p = q

(6.10)

with a simple null hypothesis and fixed q = (q1 , . . . , qk )t ∈ , or, equivalently and in terms of the natural parameter ζ , H0 : ζ = η

↔

H1 : ζ = η

(6.11)

with η = (η1 , . . . , ηk )t ∈ ∗ defined by ηj = ln(qj ), 1 ≤ j ≤ k. Since the MLE of ζ in ∗ based on X does not exist a.s., we focus on the Rao score test, which, in contrast to the LR test and the Wald test, does not depend on the MLE in ∗ .

124

6 Exemplary Multivariate Applications

Theorem 6.7 Under the assumptions of Theorem 6.4, the test statistic of the Rao score test for test problems (6.10) and (6.11) is given by R(x) =

1−

⎧ ' (2 k ηi ⎨ ηj e e r 0 i=1 e−ηj xj − ⎩ r0 1 − ki=1 eηi j =1

k

⎡ ' k − ⎣ xj − j =1

⎫ (⎤2 ⎪ ⎬ r0 eηj ⎦ ⎪ 1 − ki=1 eηi ⎭

for x = (x1 , . . . , xk )t ∈ Nk0 . Proof By using the same notation as in the proof of Theorem 6.5, we have U tζ I(ζ )−1 U ζ =

62 5 a t −1 U ζ P U ζ − (1, . . . , 1) U ζ r0

Moreover, we conclude from Theorem 3.29 together with formula (6.4) that the score statistic at ζ is given by U ζ (x) = x − r0 p/a, x ∈ Nk0 . Inserting for a, p1 , . . . , pk , and replacing ζ by η then gives the representation for R defined in Remark 5.17.

The Rao score test then rejects the null hypothesis if R in Theorem 6.7 exceeds a certain critical value. Since X is discrete so is R, and to guarantee that the test has exact level α ∈ (0, 1) therefore requires a randomized test based on R, in general. A non-randomized level-α test, which rejects H0 if and only if R exceeds the critical value, can be obtained if the critical value is determined as the (1 − α)-quantile r1−α , say, of R under H0 , i.e., if ζ = η is true. By doing so, the exact level of the test may be smaller than α, in the case of which the test is also called conservative. To compute a numerical value for r1−α , Monte-Carlo simulation can be used, where a sufficiently large number of realizations of R are generated under the assumption that ζ = η, and then the empirical (1 − α)-quantile of the resulting data vector is chosen for r1−α . Here, note that the convergence of the empirical (1 − α)-quantile to the true (1 − α)-quantile for increasing simulation size is only guaranteed if the quantile function of R is continuous at 1 − α (see, e.g., [58, pp. 304/305]). Finally, inverting the acceptance region of the test also yields a confidence region for ζ , i.e., C(x) = {ζ ∈ ∗ : R(x) ≤ r1−α } ,

x ∈ X,

with (at least) level 1 − α. A confidence region for the original parameter p with (at least) level 1 − α is then given by ˜ C(x) =

t eζ1 , . . . , eζk : (ζ1 , . . . , ζk )t ∈ C(x) ,

x ∈ X.

6.2 Dirichlet Distribution

125

6.2 Dirichlet Distribution The Dirichlet distribution is a multivariate extension of the beta distribution, and it is widely used in the literature (see, e.g., [37, Chapter 49] and [47]).

Definition 6.8 The Dirichlet distribution D(β) with parameter vector β = (β1 , . . . , βk+1 )t ∈ (0, ∞)k+1 , k ∈ N, is defined as the distribution on the Borel sets of X = {(y1 , . . . , yk )t ∈ (0, 1)k : ki=1 yi < 1} with Lebesgue density function ⎞⎛ ⎞βk+1 −1 ⎛ k k 1 ⎝ βj −1 ⎠ ⎝ fβ (x) = xj xj ⎠ 1− B(β) j =1

(6.12)

j =1

for x = (x1 , . . . , xk )t ∈ X, where B denotes the beta function, i.e., k+1

B(β) =

i=1 (βi )

( k+1 i=1 βi )

β ∈ (0, ∞)k+1 .

,

(6.13)

By setting k = 1 in Definition 6.8, D(β1 , β2 ) coincides with the beta distribution B(β1 , β2 ), β1 , β2 ∈ (0, ∞). According to Example 2.13, we may rewrite density function (6.12) as ⎧ ⎫ k+1 ⎨ ⎬ 1 exp fβ (x) = βj Tj (x) h(x) ⎩ ⎭ B(β)

(6.14)

j =1

with statistics ' Tj (x) = ln(xj ) ,

1 ≤ j ≤k,

Tk+1 (x) = ln 1 −

k

( xi

,

(6.15)

i=1

and function ⎡⎛ h(x) = ⎣⎝1 −

k j =1

⎞ xj ⎠

k j =1

⎤−1 xj ⎦

(6.16)

126

6 Exemplary Multivariate Applications

for x ∈ X. Hence, P = D(•) = {D(β) : β ∈ (0, ∞)k+1 } forms an EF. Note that formula (6.14) is already a canonical representation of P. Some properties of D(•) are stated in the following theorem, where the proof of assertion (a) is technical and therefore omitted. Theorem 6.9 For P = D(•) as introduced above, we find: (a) ord(P) = k + 1. (b) The natural parameter space of P is given by ∗ = (0, ∞)k+1 . (c) P is regular with minimal canonical representation given by the Lebesgue density functions fβ (x) =

dPβ t (x) = eβ T (x)−κ(β) h(x) , dλX

x ∈ X,

(6.17)

for β ∈ (0, ∞)k+1 , where T = (T1 , . . . , Tk+1 )t with Tj , 1 ≤ j ≤ k + 1, as in formula (6.15), h as in formula (6.16), and cumulant function κ = ln(B) on (0, ∞)k+1 with B as in formula (6.13). Proof (b) Evidently, we have (0, ∞)k+1 ⊂ ∗ . Let β2 , . . . , βk+1 ∈ [0, ∞) and note that h(x) ≥ 1/x1 for x ∈ X. By introducing the k-dimensional rectangle Y = I × J k−1 ⊂ X with I = (0, 1/(k + 1)) and J = (1/(k + 2), 1/(k + 1)), we have for x ∈ Y that

exp

⎧ k+1 ⎨ ⎩

j =2

⎫ ⎛ ⎞⎛ ⎞βk+1 k k ⎬ 1 β βj Tj (x) h(x) ≥ ⎝ xj j ⎠ ⎝1 − xj ⎠ ⎭ x1 j =2

j =1

⎛

⎞ β j βk+1 k d 1 1 k ⎠ ⎝ ≥ = , 1− k+2 k+1 x1 x1 j =2

say, where d > 0. Hence, X

exp

⎧ k+1 ⎨ ⎩

j =2

⎫ ⎬ βj Tj

⎭

h dλX ≥

Y

exp

= d

⎧ k+1 ⎨ ⎩

j =2

⎫ ⎬ βj Tj

⎭

1 1 − k+1 k+2

h dλX ≥ d

k−1

1/(k+1) 0

Y

1 dλX (x) x1

1 dx1 = ∞ . x1

Hence, (0, β2 , . . . , βk+1 )t ∈ / ∗ for β2 , . . . , βk+1 ∈ [0, ∞). For symmetry reasons, we even have that (β1 , . . . , βk+1 )t ∈ / ∗ for βi ∈ [0, ∞), 1 ≤ i ≤ k+1,

6.2 Dirichlet Distribution

127

if βj = 0 for some j ∈ {1, . . . , k + 1}. Since ∗ is convex by Theorem 2.34, it follows that ∗ = (0, ∞)k+1 . (c) The assertion follows directly from (a) and (b).

The mean vector and the covariance matrix of T are now readily derived. Theorem 6.10 Let P = D(•) with minimal canonical representation (6.17). Then, the mean vector and the covariance matrix of T under Pβ are given by t Eβ (T ) = ψ(β1 ) − ψ(β • ) , . . . , ψ(βk+1 ) − ψ(β • ) and

Covβ (Ti , Tj ) = ψ (βi ) δij − ψ (β • ) ,

1 ≤ i, j ≤ k + 1 ,

where ψ = [ln( )] = / denotes the digamma function, δij is the Kronecker delta, i.e., δij = 1 for i = j , and δij = 0 for i = j , and the notation β • = k+1 i=1 βi is used for β ∈ (0, ∞)k+1 . Proof Applying Corollary 2.50, we have for 1 ≤ j ≤ k + 1 that Eβ (Tj ) = ∂/∂βj κ(β) = [∂/∂βj B(β)]/B(β) and Covβ (T ) = Hκ (β), from which the assertions follow by elementary calculus.

As two by-products of Theorem 6.10, the mean value function of P turns out to be t π(β) = ψ(β1 ) − ψ(β • ) , . . . , ψ(βk+1 ) − ψ(β • ) ,

β ∈ (0, ∞)k+1 , (6.18)

and the Fisher information matrix of P at β ∈ (0, ∞)k+1 is given by [I(β)]i,j = ψ (βi ) δij − ψ (β • ) ,

1 ≤ i, j ≤ k + 1

(6.19)

(see Theorems 2.56 and 3.29). Now, we turn to inferential results for the Dirichlet distribution based on an iid sample from Pβ . We start with maximum likelihood estimation of the natural parameters β1 , . . . , βk+1 . To apply the results of Sect. 4.1, note first that the boundary of the convex support of T is nothing but the support {T (x) : x ∈ X} of T , i.e., ⎫ ⎧ k k ⎬ ⎨..t ln(x1 ), . . . , ln(xk ), ln 1 − xj : x ∈ (0, 1)k , xj < 1 ∂M = ⎭ ⎩ j =1

j =1

(6.20) Theorem 4.3 then yields that an ML estimate based on a single observation x from Pβ does not exist, since T (x) ∈ ∂M Pβ -a.s.. However, for a sample size larger than 1, we have the following theorem (see also [20, pp. 150–152] for the particular case of the beta distribution).

128

6 Exemplary Multivariate Applications

Theorem 6.11 Let P = D(•) with minimal canonical representation (6.17), and let x (1), . . . , x (n) ∈ X be realizations of iid random vectors X(1) , . . . , X (n) with distribution Pβ for some β ∈ (0, ∞)k+1 . If n ≥ 2 and x (1), . . . , x (n) are not all identical, then an ML estimate of β in (0, ∞)k+1 based on x (1) , . . . , x (n) uniquely exists and is given by the only solution of the equations 1 (i) ln(xj ) , n n

ψ(βj ) − ψ(β • ) =

1 ≤ j ≤k,

i=1

and

' ( n k 1 (i) ln 1 − xl ψ(βk+1 ) − ψ(β • ) = n i=1

l=1

with respect to β ∈ (0, ∞)k+1 . Moreover, the MLE βˆ X(1) , . . . , X (n) uniquely exists (a.s.) for all n ≥ 2.

(n)

of β in (0, ∞)k+1 based on

Proof Let x (1) , . . . , x (n) ∈ X with x (i) = (x1(i) , . . . , xk(i) )t , 1 ≤ i ≤ n, be not all identical. By Corollary 4.5 and Remark 4.4, we have to show that 1 T (x (i) ) ∈ / ∂M , n n

i=1

where ∂M is given by formula (6.20). For this, suppose that there exists some y = (y1 , . . . , yk )t ∈ X with 1 1 Tj (x (i) ) = ln(xj(i) ) n n n

n

i=1

i=1

= ln(yj ) ,

1 ≤j ≤ k.

Then, two-fold application of the inequality of arithmetic and geometric mean yields that ⎛⎡ ⎤ ⎞ ⎞ ⎛ 1/n n n k n k 1 1 ⎟ ⎜ (i) (i) Tk+1 (x (i) ) = ln ⎝1 − xj ⎠ = ln ⎝⎣ xj ⎦ ⎠ 1− n n i=1

j =1

i=1

j =1

i=1

⎛

⎡ ⎤⎞ ⎛ ⎞ n k k n 1 1 (i) (i) ⎣1 − ≤ ln ⎝ xj ⎦⎠ = ln ⎝1 − xj ⎠ n n i=1

⎛ < ln ⎝1 −

j =1

) n k j =1

i=1

j =1

i=1

⎛ ⎞ *1/n ⎞ k ⎠ = ln ⎝1 − xj(i) yj ⎠ , j =1

6.2 Dirichlet Distribution

129

i.e., ni=1 T (x (i) )/n ∈ / ∂M, and an ML estimate of β based on x (1), . . . , x (n) uniquely exists. Since x (1) = · · · = x (n) occurs with Pβ -probability 0, the proof is done.

(n) For n ≥ 2, the MLE βˆ of β in Theorem 6.11 cannot be stated explicitly, and an ML estimate based on (non-identical) realizations x (1), . . . , x (n) has to be obtained via numerical procedures (see [37, pp. 505/506] and also [27, pp. 223–225] in case of the beta distribution). In contrast, we have the explicit MLE ni=1 T (X(i) )/n of π(β) based on X(1) , . . . , X(n) , which is moreover efficient by Corollary 4.27. Since the MLE of the natural parameter β is not available in explicit form, we cannot derive its exact distribution analytically, which however is required, e.g., for the construction of statistical tests and confidence regions for β. We therefore now assume to have a sequence of iid random vectors following a Dirichlet distribution and focus on asymptotic inferential results.

Theorem 6.12 Let P = D(•) with minimal canonical representation (6.17), and let X(1) , X (2), . . . be a sequence of iid random vectors with distribution Pβ for some (n) β ∈ (0, ∞)k+1 . Then, the sequence βˆ , n ∈ N, n ≥ 2, of MLEs for β is strongly consistent and asymptotically efficient for β, i.e., we have βˆ

(n)

→ β Pβ -a.s. and

√ (n) d n(βˆ − β) −→ Nk+1 (0, I(β)−1 ) for n → ∞, where the inverse of the Fisher information matrix at β is given by [I(β)

−1

]i,j

δij = − ψ (βi )

'k+1 l=1

1 1 − ψ (βl ) ψ (β • )

(−1

1 ψ (βi ) ψ (βj )

for 1 ≤ i, j ≤ k + 1. (n) Proof Strong consistency and asymptotic efficiency of the sequence βˆ , n ≥ 2, are obtained from Theorems 4.31, 4.34, and Remark 4.35. To obtain the inverse of I(β) given by formula (6.19), note that I(β) = D + bvv t for the diagonal matrix D with diagonal entries dj = ψ (βj ) > 0, 1 ≤ j ≤ k + 1, real number b = −ψ (β • ) < 0, and vector v = (1, . . . , 1)t ∈ Rk+1 . Applying a commonly known matrix identity (see, e.g., [45, p. 458]) and setting w = (1/d1 , . . . , 1/dk+1 )t then yields t −1

(D + bvv )

= D

−1

= D

−1

−D ' −

−1

v

1 + v t D−1 v b

1 1 + b dl k+1

−1

v t D−1

(−1 wwt .

l=1

Inserting for d1 , . . . , dk+1 and b then gives the representation.

130

6 Exemplary Multivariate Applications

Now, let us consider the multivariate test problem H0 : β = η

↔

H1 : β = η

(6.21)

for some fixed η = (η1 , . . . , ηk+1 )t ∈ (0, ∞)k+1 . Theorem 6.13 Under the assumptions of Theorem 6.12, the test statistics n = n (X (1) , . . . , X (n) ) and Wn = Wn (X(1) , . . . , X (n) ) of the LR test and the Wald test (with g(β) = β − η, β ∈ (0, ∞)k+1 ) for test problem (6.21) based on X(1) , . . . , X (n) , n ≥ 2, are given by / n = 2n

and

ln

B(η) B(βˆ

(n)

)

k+1 . - (n) .. 0 - (n) (n) ˆ ˆ − ψ βˆ • (βj − ηj ) ψ βj + j =1

⎡ ⎤ k+1 .2 - (n) . - (n) .2 (n) (n) ψ (βˆj ) βˆj − ηj − ψ βˆ • βˆ • − η• ⎦ . Wn = n ⎣ j =1

Under H0 , both test statistics are asymptotically χ 2 (k + 1)-distributed for n → ∞. Proof From Theorem 5.19 and Remark 5.21, n and Wn both have an asymptotic χ 2 (k + 1)-distribution if H0 is true, such that we only have to verify the representations of the test statistics. From Theorem 5.16 and Lemma 3.37, we obtain that (n) n = 2 n DKL (βˆ , η) 4 3 (n) (n) (n) = 2 n κ(η) − κ(βˆ ) + (βˆ − η)t π(βˆ ) ,

and the representation follows from κ = ln(B) and formula (6.18). Here, note that nDKL is the Kullback-Leibler divergence corresponding to the product family P(n) . To obtain the formula for Wn as introduced in Remark 5.17 note first that by Theorem 3.29 the Fisher information matrix at β for the product family P(n) is given by nI(β). Hence, by using the same notation as in the proof of Theorem 6.12, we find (β −η)t [n I(β)] (β −η) = n (β − η)t D (β − η) + b [(β − η)t v]2 . (6.22) Inserting for d1 , . . . , dk+1 and b and finally setting β = βˆ in formula (6.22) leads to the representation for Wn .

For α ∈ (0, 1), the LR test and the Wald test with asymptotic level α then reject null hypothesis (6.21) if and only if the respective test statistic exceeds the critical 2 (k +1), i.e., the (1 −α)-quantile of χ 2 (k +1). For fixed n ∈ N, the actual value χ1−α level of every test can be obtained by Monte-Carlo simulation. Here, a sufficiently large number of realizations of the test statistic under H0 , i.e., under the assumption that β = η is true, are generated. The actual level of the test is then estimated by the 2 (k + 1), i.e., of (falsely) rejecting H . relative frequency of exceeding χ1−α 0

6.3 Generalized Order Statistics

131

Since the above tests are non-randomized, they may also serve to construct confidence regions for β, which asymptotically meet a desired confidence level. For instance, in case of the LR test, 2 ˜ = β ∈ (0, ∞)k+1 : n (x) ˜ ≤ χ1−α (k + 1) , x˜ ∈ Xn , Cn (x) forms a confidence region for β with asymptotic level 1 − α (for n → ∞).

6.3 Generalized Order Statistics Generalized order statistics have been introduced as a general semi-parametric model for ordered random variables, which contains, for instance, common order statistics, record values and k-record values, sequential order statistics, Pfeifer record values, and progressively type-II censored order statistics (see [31, 32]). What is remarkable here is the fact that joint density functions of generalized order statistics with a fixed but arbitrarily chosen absolutely continuous baseline distribution function F (see formula (6.23)) lead to a multiparameter EF, the support of which is a cone of Rn .

Definition 6.14 Let F denote a distribution function with Lebesgue density function f and quantile function F −1 , and let γ1 , . . . , γn be positive parameters. Then, the random variables X1 , . . . , Xn are called generalized order statistics based on F and γ1 , . . . , γn if their joint density function is given by ⎛ gF,γ (x) = ⎝

n j =1

⎞⎛ γj ⎠ ⎝

n−1

⎞ (1 − F (xj ))γj −γj+1 −1 f (xj )⎠ (1−F (xn ))γn −1 f (xn )

j =1

(6.23) for x ∈ X = {(x1 , . . . , xn

)t

∈

Rn

:

F −1 (0+)

< x1 < · · · < xn
η

for some fixed η ∈ (0, ∞) is given by . (s) ˜ , ϕ(x) ˜ = 1(0,χα2 (2s)) −2ηT1 (x)

x˜ ∈ Xs

(see [7]). A UMPU test for the two-sided test problem with null hypothesis H0 : γ1 = η can be obtained by applying Theorem 5.13 leading to a similar situation as in Example 5.9. The corresponding confidence intervals for γ1 with minimum coverage probabilities of false parameters are shown in [12]. Multivariate test problems with a simple and composite null hypothesis have been studied in the literature. Tests for homogeneity of the parameters including the LR test are shown in [21]. In [9], the Rao score statistic and the Wald statistic have been addressed, the test statistics of which are easily computed, since the Fisher information matrix has diagonal shape (see also [13]). Moreover, confidence regions for γ are obtained in [12] by inverting the acceptance regions of tests for a simple null hypothesis. Other confidence regions are constructed in [60] via divergence measures for generalized order statistics. In particular, for P1 , P2 ∈ P (1) (1) (2) (2) with parameter vectors γ (1) = (γ1 , . . . , γn )t and γ (2) = (γ1 , . . . , γn )t , (1) (2) say, the Kullback-Leibler divergence DKL (γ , γ ) is determined by applying

134

6 Exemplary Multivariate Applications

Lemma 3.37: DKL (γ (1) , γ (2) ) = κ(γ (2) ) − κ(γ (1) ) + (γ (1) − γ (2))t π(γ (1) ) = −

n

(1)

n n γ j ln γj(2) + ln γj(1) −

j =1

=

j =1

n / γ (2) j j =1

(1)

γj

− ln

γj(2)

j =1

(1)

γj

− γj(2) (1)

γj

0 −1 .

Obviously, the latter expression depends on component-wise ratios of parameters, only. This finding can be used, e.g., to obtain an exact LR test for the test problem H0 : γ = γ (0)

↔

H1 : γ = γ (0)

for fixed γ (0) ∈ ∗ , which has test statistic s = 2DKL (γˆ (s), γ (0) ) by Theorem 5.16. Based on Cressie-Read divergences, multivariate tests with simple null hypothesis are examined in [60]. An important submodel of generalized order statistics results from the general setting by assuming that successive differences of the parameters are identical, i.e., γ1 − γ2 = . . . = γn−1 − γn = m + 1 , say .

(6.24)

The model P0 ⊂ P thus obtained is called m-generalized order statistics, and any distribution in P0 is determined by specifying m and γn . The particular settings m = γn − 1 and m = −1 correspond to order statistics and record values, respectively. By utilizing formula (6.24) to rewrite density function gF,γ in formula (6.23), P0 turns out to form a regular EF of order 2 with minimal canonical representation gF,m,γn (x) =

n j =1

n

[(n − j )(m + 1) + γn ] e

mU (x)+γn V (x)

j =1 f (xj )

1 − F (xn )

and n−1minimal sufficient and complete statistic (U, V ) defined by U (x) = j =1 ln(1 − F (xj )) and V (x) = ln(1 − F (xn )) for x ∈ X. Here, the natural parameter space is given by ∗0 = {(m, γn )t ∈ R × (0, ∞) : (n − 1)(m + 1) + γn > 0} . More results on this two-parameter EF concerning, in particular, existence, uniqueness, and asymptotic properties of the MLE of (m, γn )t can be found in [10]. While the restriction to m-generalized order statistics again results in a regular EF, other assumed relations between the parameters may lead to curved EFs. This

6.3 Generalized Order Statistics

135

is the case, for instance, by connecting γ1 , . . . , γn via the log-linear link function ln(γj ) = τ1 + τ2 zj ,

1 ≤ j ≤ n,

where τ1 , τ2 ∈ R are link function parameters and z1 , . . . , zn are known real numbers, which are not all identical. Properties of curved EFs may then be applied to establish inferential results for (τ1 , τ2 )t . For this special link function as well as for other types of link functions, maximum likelihood estimation of the link function parameters is discussed in [59].

References

1. Akahira, M. (2017). Statistical estimation for truncated exponential families. Singapore: Springer. 2. Amari, S. I., Barndorff-Nielsen, O. E., Kass, R. E., Lauritzen, S. L., & Rao, C. R. (1987). Differential geometry in statistical inference. Hayward: Institute of Mathematical Statistics. 3. Bar-Lev, S. K., & Kokonendji, C. C. (2017). On the mean value parametrization of natural exponential families-A revisited review. Mathematical Methods of Statistics, 26(3), 159–175. 4. Barndorff-Nielsen, O. (2014). Information and exponential families in statistical theory. Chichester: Wiley. 5. Barndorff-Nielsen, O. E., & Cox, D. R. (1994). Inference and asymptotics. Boca Raton: Chapman & Hall. 6. Bauer, H. (2001). Measure and integration theory. Berlin: de Gruyter. 7. Bedbur, S. (2010). UMPU tests based on sequential order statistics. Journal of Statistical Planning and Inference, 140(9), 2520–2530. 8. Bedbur, S., Beutner, E., & Kamps, U. (2012). Generalized order statistics: An exponential family in model parameters. Statistics, 46(2), 159–166. 9. Bedbur, S., Beutner, E., & Kamps, U. (2014). Multivariate testing and model-checking for generalized order statistics with applications. Statistics, 48(6), 1297–1310. 10. Bedbur, S., & Kamps, U. (2017). Inference in a two-parameter generalized order statistics model. Statistics, 51(5), 1132–1142. 11. Bedbur, S., & Kamps, U. (2019). Testing for equality of parameters from different load-sharing systems. Stats, 2(1), 70–88. 12. Bedbur, S., Lennartz, J. M., & Kamps, U. (2013). Confidence regions in models of ordered data. Journal of Statistical Theory and Practice, 7(1), 59–72. 13. Bedbur, S., Müller, N., & Kamps, U. (2016). Hypotheses testing for generalized order statistics with simple order restrictions on model parameters under the alternative. Statistics, 50(4), 775– 790. 14. Berk, R. H. (1972). Consistency and asymptotic normality of MLE’s for exponential models. The Annals of Mathematical Statistics, 43(1), 193–204. 15. Bhatia, R. (1997). Matrix analysis. New York: Springer. 16. Bickel, P. J., & Doksum, K. A. (2015). Mathematical statistics: Basic ideas and selected topics (2nd ed., Vol. I). Boca Raton: Taylor & Francis. 17. Billingsley, P. (1995). Probability and measure (3rd ed.). New York: Wiley. 18. Billingsley, P. (1999). Convergence of probability measures (2nd ed.). New York: Wiley.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Bedbur, U. Kamps, Multivariate Exponential Families: A Concise Guide to Statistical Inference, SpringerBriefs in Statistics, https://doi.org/10.1007/978-3-030-81900-2

137

138

References

19. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. 20. Brown, L. D. (1986). Fundamentals of statistical exponential families. Hayward: Institute of Mathematical Statistics. 21. Cramer, E., & Kamps, U. (2001). Sequential k-out-of-n systems. In N. Balakrishnan & C. R. Rao (Eds.), Advances in reliability, handbook of statistics (Vol. 20, pp. 301–372). Amsterdam: Elsevier. 22. Davis, P. J. (1965). Gamma functions and related functions. In M. Abramowitz & I. A. Stegun (Eds.), Handbook of mathematical functions: With formulas, graphs, and mathematical tables (pp. 253–294). New York: US Government Printing Office, Dover. 23. Dunn, P. K., & Smyth, G. K. (2018). Generalized linear models with examples in R. New York: Springer. 24. Ferguson, T. S. (1967). Mathematical statistics: A decision theoretic approach. New York: Academic. 25. Garren, S. T. (2000). Asymptotic distribution of estimated affinity between multiparameter exponential families. Annals of the Institute of Statistical Mathematics, 52(3), 426–437. 26. Johnson, N. L., Kemp, A. W., & Kotz, S. (2005). Univariate discrete distributions (3rd ed.). New York: Wiley. 27. Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995). Continuous univariate distributions (2nd ed., Vol. 2). New York: Wiley. 28. Johnson, N. L., Kotz, S., & Balakrishnan, N. (1997). Discrete multivariate distributions. New York: Wiley. 29. Jørgensen, B. (1997). The theory of dispersion models. London: Chapman & Hall. 30. Kaas, R., Goovaerts, M., Dhaene, J., & Denuit, M. (2008). Modern actuarial risk theory (2nd ed.). Berlin: Springer. 31. Kamps, U. (1995). A concept of generalized order statistics. Journal of Statistical Planning and Inference, 48(1), 1–23. 32. Kamps, U. (2016). Generalized order statistics. In N. Balakrishnan, P. Brandimarte, B. Everitt, G. Molenberghs, W. Piegorsch, & F. Ruggeri (Eds.), Wiley StatsRef: Statistics reference online (pp. 1–12). Chichester: Wiley. 33. Katzur, A., & Kamps, U. (2016). Classification into Kullback-Leibler balls in exponential families. Journal of Multivariate Analysis, 150, 75–90. 34. Katzur, A., & Kamps, U. (2016). Homogeneity testing via weighted affinity in multiparameter exponential families. Statistical Methodology, 32, 77–90. 35. Keener, R. W. (2010). Theoretical statistics: Topics for a core course. New York: Springer. 36. Kollo, T., & von Rosen, D. (2005). Advanced multivariate statistics with matrices. Dordrecht: Springer. 37. Kotz, S., Balakrishnan, N., & Johnson, N. L. (2000). Continuous multivariate distributions, models and applications (2nd ed.). New York: Wiley. 38. Krätschmer, V. (2007). The uniqueness of extremum estimation. Statistics & Probability Letters, 77(10), 942–951. 39. Lehmann, E. L., & Casella, G. (1998). Theory of point estimation (2nd ed.). New York: Springer. 40. Lehmann, E. L., & Romano, J. P. (2005). Testing statistical hypotheses (3rd ed.). New York: Springer. 41. Letac, G. (1992). Lectures on natural exponential families and their variance functions. Rio de Janeiro: Instituto de Matemática Pura e Aplicada (IMPA). 42. Liese, F., & Miescke, K. J. (2008). Statistical decision theory: Estimation, testing, and selection. New York: Springer. 43. Liese, F., & Vajda, I. (1987). Convex statistical distances. Leipzig: Teubner. 44. Lindsey, J. K. (2006). Parametric statistical inference. Oxford: Clarendon Press. 45. Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. London: Academic. 46. Mies, F., & Bedbur, S. (2020). Exact semiparametric inference and model selection for loadsharing systems. IEEE - Transactions on Reliability, 69(3), 863–872.

References

139

47. Ng, K. W., Tian, G. L., & Tang, M. L. (2011). Dirichlet and related distributions: Theory, methods and applications. Chichester: Wiley. 48. Pardo, L. (2006). Statistical inference based on divergence measures. Boca Raton: Chapman & Hall/CRC. 49. Pázman, A. (1986). On the uniqueness of the m.l. estimate in curved exponential families. Kybernetika, 22(2), 124–132. 50. Pfanzagl, J. (1994). Parametric statistical theory. Berlin: de Gruyter. 51. Rockafellar, R. T. (1970). Convex analysis. Princeton: Princeton University Press. 52. Rudin, W. (1976). Principles of mathematical analysis (3rd ed.). Auckland: McGraw-Hill. 53. Sen, P. K., & Singer, J. M. (1993). Large sample methods in statistics: An introduction with applications. Boca Raton: Chapman & Hall/CRC. 54. Serfling, R. J. (1980). Approximation theorems of mathematical statistics. New York: Wiley. 55. Shao, J. (2003). Mathematical statistics (2nd ed.). New York: Springer. 56. Sundberg, R. (2019). Statistical modelling by exponential families. Cambridge: Cambridge University Press. 57. Vajda, I. (1989). Theory of statistical inference and information. Dordrecht: Kluwer. 58. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge: Cambridge University Press. 59. Volovskiy, G., Bedbur, S., & Kamps, U. (2021). Link functions for parameters of sequential order statistics and curved exponential families. Probability and Mathematical Statistics, 41(1), 115–127. 60. Vuong, Q. N., Bedbur, S., & Kamps, U. (2013). Distances between models of generalized order statistics. Journal of Multivariate Analysis, 118, 24–36. 61. Wei, B. C. (1998). Exponential family nonlinear models. Singapore: Springer. 62. Wellek, S. (2010). Testing statistical hypotheses of equivalence and noninferiority (2nd ed.). Boca Raton: Chapman & Hall/CRC. 63. Witting, H. (1985). Mathematische statistik I. Stuttgart: Teubner.

Index

A Affinely independent, 17

C Characteristic function, 43 Convex function, 29 hull, 40 set, 23 Cramér-Rao inequality, 79 Cumulant, 47 Cumulant generating function, 47

D Distribution beta, 12 binomial, 7 bivariate normal, 107 chi-square, 12 conditional, 47 Dirichlet, 12 exponential, 11 gamma, 11 geometric, 7 Good, 8 inverse gamma, 12 inverse normal, 11 joint, 50 Kemp discrete normal, 8 Liouville, 12 log-normal, 12 marginal, 47

multinomial, 9 multivariate normal, 14 multivariate power series, 9 negative binomial, 7 negative multinomial, 117 normal, 10 Poisson, 6 support of, 13 Wishart, 12 Divergence measure, 60 Cressie-Read, 61 Hellinger, 61 Jeffrey, 61 Kullback-Leibler, 60 Rényi, 61

E Entropy measure, 60 Shannon, 60 Estimator, 65 asymptotically efficient, 86 consistent, 85 efficient, 81 maximum likelihood, 69 strongly consistent, 85 unbiased, 79 uniformly minimum variance unbiased, 79 Exponential dispersion model, 53 Exponential family canonical representation of, 23 cumulant function of, 29 curved, 22 cut of, 49

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Bedbur, U. Kamps, Multivariate Exponential Families: A Concise Guide to Statistical Inference, SpringerBriefs in Statistics, https://doi.org/10.1007/978-3-030-81900-2

141

142 full, 23 generating measure of, 24 mean value function of, 37 mean value parametrization of, 40 minimal representation of, 17 natural, 24 natural parameter space of, 23 natural parametrization of, 23 order of, 17 regular, 23 steep, 39 truncated, 14 variance function of, 42

F Fisher information matrix, 58

H Hypothesis alternative, 94 of equivalence, 104 null, 94 one-sided, 96 simple, 109 two-sided, 101

L Likelihood function, 66 Log-likelihood function, 66

M Maximum likelihood estimate, 66 Moment generating function, 43

Index N Neyman-Pearson lemma, 95

P Parameter dispersion, 53 identifiable, 21 Product measure, 49

S Score statistic, 58 Statistic complete, 55 minimal sufficient, 54 sufficient, 54

T Test, 93 error probabilities of the first/second kind of, 94 homogeneity, 109 likelihood-ratio, 110 non-randomized, 94 power function of, 94 randomized, 94 Rao score, 111 significance level of, 94 unbiased, 94 uniformly most powerful, 95 uniformly most powerful level-α, 95 uniformly most powerful unbiased level-α, 95 Wald, 111