Solutions manual to accompany pattern classification [2nd. ed.]

22,228 2,181 5MB

English Pages 444 [446] Year 2001

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Solutions manual to accompany pattern classification [2nd. ed.]

Citation preview

Solution Manual to accompany

Pattern Classification (2nd ed.)

David G. Stork

Solution Manual to accompany

Pattern Classification (2nd ed.) by R. O. Duda, P. E. Hart and D. G. Stork David G. Stork Copyright 2001. All rights reserved. This Manual is for the sole use of designated educators and must not be distributed to students except in short, isolated portions and in conjunction with the use of Pattern Classification (2nd ed.)

June 18, 2003

Preface In writing this Solution Manual I have learned a very impor tant lesson. As a student, I thought that the best way to master a subject was to go to a superb university and study with an established expert. Later, I realized instead that the best way was to teach a course on the sub ject. Yet later, I was convinced that the best way was to write a detailed and extensive textbook. Now I know that all these years I have been wron g: in fact the best way to master a subject is to write the Solution Manual. In solving the problems for this Manual I have been forced to confront myriad technical details that might have tripped up the unsuspecting student. Students and teachers can thank me for simplifying or screening out problems that required pages of unenlightening calculations. Occasionally I had to go back to the text and delete the word “easily” from problem references that read “it can easily be shown (Problem ...).” Throughout, I have tried to choose data or problem conditions that are particularly instructive. In solving these I have found in early drafts of this text (as well as errors in books byproblems, other authors and evenerrors in classic refereed papers), and thus the accompanying text has been improved for the writing of this Manual. I have tried to make the problem solutions self-co ntained and self-explanatory. I have gone to great lengths to ensure that the solutions are correct and clearly presented — many have been reviewed by students in several classes. Surely there are errors and typos in this manuscript, but rather than editing and rechecking these solutions over months or even years, I thought it best to distribute the Manual, however flawed, as early as possible. I accept responsib ility for these inevit able errors, and humbly ask anyone finding them to contact me directly. (Please, however, do not ask me to explain a solution or help you solv e a problem!) It should be a small matter to change the Manual for future printings, and you should contact the publisher to check that you have the most recent version. Notice, too, that this Manual contains a list of known typos and errata in the text which you might wish to photocopy and distribute to students. I have tried to be thorough in order to help students, even to the occassional fault of verbosity. You will notice that several problems have the simple “explain your answer in words” and “graph your resu lts.” These were added for student s to gain intuition and a deeper understanding. Graphing per se is hardly an intellectual challenge, but if the student graphs functions, he or she will develop intuition and remember the problem and its results b etter. Furthermore, when the student later sees graphs of data from dissertation or research work, the link to the homework problem and the material in the text will be more readily apparent. Note that due to the vagaries of automatic typesetting, figures may appear on pages after their reference in this Manual; be sure to consult the full solution to any problem. I have also included worked examples and so sample final exams with solutions to 1

2 cover material in text. I distribute a list of important equations (without descriptions) with the exam so students can focus understanding and using equations, rather than memorizing them. I also include on every final exam one problem verbatim from a homework, taken from the book. I find this motivates students to review careful ly their homework assignments, and allows somewhat more difficult problems to be included. These will be updated and expand ed; thus if you have exam questions you find particularly appropriate, and would like to share them, please send a copy (with solutions) to me. It should be noted, too, that a set of overhead transparency masters of the figures from the text are available to faculty adopt ers. I have found these to b e invaluable for lecturing, and I put a set on reserve in the library for students. The files can be accessed through a standard web browser or an ftp client program at the Wiley STM ftp area at: ftp://ftp.wiley.com/public/sci tech med/pattern/

or from a link on the Wiley Electrical Engineering software supplements page at: http://www.wiley.com/products/subject/engineering/electrical/ software supplem elec eng.html

I have taught from the text (in various stages of completion) at the University of California at Berkeley (Extension Division) and in three Departments at Stanford University: Electrical Engineering, Statistics and Computer Science. Numerous students and colleague s have made suggestions. Especially noteworthy in this regard are Sudes hna Adak, Jian An, Sung-Hyuk Cha, Koichi Ejiri, Rick Guadette, John Heumann, Travis Kopp, Yaxin Liu, Yunqian Ma, Sayan Mukherjee, Hirobumi Nishida, Erhan Oztop, Steven Rogers, Charles Roosen, Sergio Bermejo Sanchez, Godfried Mohammed Yousuf and Yu Zhong. Thanks too go to Toussaint, Dick DudaNamrata who gaveVaswani, several excellent suggestions. I would greatly appreciate notices of any errors in this Manual or the text itself. I would be especially grateful for solutions to problems not yet solved. Please send any such information to me at the below address. I will incorporate them into subsequent releases of this Manual. This Manual is for the use of educators and must not be distributed in bulk to students in any form. Short excerpts may be photocopied and distribut ed, but only in conjunction with the use of Pattern Classification (2nd ed.). I wish you all the best of luck in teaching and research.

Ricoh Innovations, Inc.

2882 Hill Suite 115 MenloSand Park, CARoad 94025-7022 USA [email protected]

David G. Stork

Contents Preface 1Introduction

5

2 Bayesiandecisiontheory Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 7 74

3 Maximum likelihood and Bayesian parameter estimation Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

77

4 Nonparametrictechniques 131 Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5 Linear discriminant functions 177 Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6 Multilayerneuralnetworks 219 Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 7 Stochasticmethods 255 Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 8 Nonmetricmethods 277 Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 9 Algorithm-independent machine learning Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

295

10 Unsupervised learning and clustering 305 Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Computer Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Sample final exams and solutions

357 3

4 Workedexamples

CONTENTS

415

Errata and ammendations in the text 417 First and second printings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 Fifth printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

Chapter 1

Introduction

Problem Solutions There are neither problems nor computer exercises in Chapter 1.

5

6

CHAPTER 1. INTRODUCTION

Chapter 2

Bayesian decision theory

Problem Solutions Section 2.1 1. Equation 7 in the text states

|

|

|

P (error x) = min[ P (ω1 x), P (ω2 x)]. (a) We assume, without loss of generality, that for a given particular x we have P (ω2 x) P (ω1 x), and thus P (error x) = P (ω1 x). We have, moreover, the normalization condition P (ω1 x) = 1 P (ω2 x). Together these imply P (ω2 x) > 1/2 or 2 P (ω2 x) > 1 and

| ≥

|

|

|

|



|

|

|

|

|

|

|

2P (ω2 x)P (ω1 x) > P (ω1 x) = P (error x). This is true at every x, and hence the integrals obey



|

|

2P (ω2 x)P (ω1 x)dx

|

|





|

P (error x)dx.

|

In short, 2 P (ω2 x)P (ω1 x) provides an upper bound for P (error x).

|

(b) From part (a), we have that P (ω2 x) > 1/2, but in the current conditions not greater than 1 /α for α < 2. Take as an example , α = 4/3 and P (ω1 x) = 0.4 and hence P (ω2 x) = 0.6. In this case, P (error x) = 0.4. Moreover, we have

|

|

|

|

|

αP (ω1 x)P (ω2 x) = 4/3

× 0.6 × 0.4 < P (error |x). P (ω |x).

This does not provide an upper bound for all values of

|

| |

1

(c) Let P (error x) = P (ω1 x). In that case, for all x we have



|

|

|

P (ω2 x)P (ω1 x) < P (ω1 x)P (error x)

|

|

P (ω2 x)P (ω1 x)dx


1,

≤1

p(x|ω1) p(x|ω2) 4 3.5 3 2.5 2 1.5 1 0.5 -2

-1

0

1

2 x

Section 2.3 3. We are are to use the standard zero-one classification cost, that is and λ 12 = λ 21 = 1.

λ 11 = λ 22 = 0

PROBLEM SOLUTIONS

9

(a) We have the priors P (ω1 ) and P (ω2 ) = 1 Eqs. 12 and 13 in the text: R(P (ω1 )) = P (ω1 )



R

− P (ω ). The Bayes risk is given b y 1

|

p(x ω1 )dx + (1

− P (ω )) 1



R

2

|

p(x ω2 )dx.

1

To obtain the prior with the minimum risk, we take the derivative with respect to P (ω1 ) and set it to 0, that is d R(P (ω1 )) = dP (ω1 )



R

|

p(x ω1 )dx



2



R

|

p(x ω2 )dx = 0,

1

which gives the desired result:



|

p(x ω1 )dx =

R



R

2

|

p(x ω2 )dx.

1

(b) This solution is not always unique, as shown in this simple count erexample. Let P (ω1 ) = P (ω2 ) = 0.5 and

|

p(x ω1 ) =



|

p(x ω2 ) =

− ≤ ≤ 0.5

1 0.5 x 0 otherwise

≤ ≤

1 0 x 1 0 otherwise.



R



It is easy to verify that the decision regions 1 = [ 0.5, 0.25] and satisfy the equations in part (a); thus the solution is not unique.

R

1

= [0, 0.5]

4. Consider the minimax criterion for a two-category classification problem. (a) The total risk is the inte gral over the two regio ns their costs: R

=



R

i

|

of the posteriors times

|

[λ11 P (ω1 )p(x ω1 ) + λ12 P (ω2 )p(x ω2 )] dx

R

1

+



|

|

[λ21 P (ω1 )p(x ω1 ) + λ22 P (ω2 )p(x ω2 )] dx.

R

2

We use find: R

|

p(x ω2 ) dx = 1 2

R





|

p(x ω2 ) dx and P (ω2 ) = 1 1

− P (ω ), regroup to 1

R

= λ22 + λ12

 | −  |  −  | −   |  |  p(x ω2 ) dx

R

λ22

p(x ω2 ) dx

R

1

1

+ P (ω1 ) (λ11

λ22 ) + λ11 p(x ω1 ) dx

R

R

2

+ λ21 p(x ω1 ) dx + λ22 p(x ω2 ) dx

R

2

R

1

|

λ12 p(x ω2 ) dx 1

10

CHAPTER 2. BAYESIAN DECISION THEORY = λ22 + (λ12

−λ

22 )



|

p(x ω2 ) dx

R

1



+P (ω1 ) (λ11

−λ

22 ) + (λ11

+ λ21 )



|

p(x ω1 ) dx

R

2

+ (λ22

−λ

|

12 )

p(x ω2 ) dx .

R





1

(b) Consider an arbitrary prior 0 < P ∗ (ω1 ) < 1, and assume the decision boundary has been set so as to achieve the minimal (Bayes) error for that prior. If one holds the same decision boundary, but changes the prior probabilities (i.e., P (ω1 ) in the figure), then the error changes linearly, as given by the formula in part (a). The true Bayes error, however, must be less than or equal to that (linearly bounded) value, since one has the freedom to change the decision boundary at each value of P (ω1 ). Moreover, we note that the Bayes error is 0 at P (ω1 ) = 0 and at P (ω1 ) = 1, since the Bayes decision rule under those conditions is to always decide ω2 or ω1 , respectively, and this gives zero error. Thus the curve of Bayes error rate is concave down for all prior probabilities. E(P(ω1))

0.2

0.1

P*(ω1)

0.5

1

P(ω1)

(c) According to the general minim ax equation in part (a), for our case (i.e., λ 11 = λ22 = 0 and λ 12 = λ 21 = 1) the decision boundary is chosen to satisfy



R

2

|

p(x ω1 ) dx =



R

|

p(x ω2 ) dx.

1

We assume that a single decision point suffices, and thus we seek to find x ∗ such that x∗



−∞

N (µ1 , σ12 ) dx =





N (µ2 , σ22 ) dx,

x∗

where, as usual, N (µi , σi2 ) denotes a Gaussian. We assume for definiteness and without loss of generality that µ2 > µ1 , and that the single decision point lies between the means. Recall the definition of an error functio n, given by Eq. 96

PROBLEM SOLUTIONS

11

in the Appendix of the text, that is, erf(x) =

√2π

x



2

e−t dt.

0

We can rewrite the above as erf[(x∗

− µ )/σ ] = erf[( x∗ − µ )/σ ]. 1

1

2

2

If the values of the error function are equal, then their corresponding arguments must be equal, that is (x∗

− µ )/σ 1

1

= (x∗

− µ )/σ 2

2

and solving for x ∗ gives the value of the decision point x∗ =



µ2 σ1 + µ1 σ2 σ1 + σ2



.

(d) Because the mimimax error rate is independent of the prior probabilities, we can choose a particularly simple case to evaluate the error, for instance, P (ω1 ) = 0. In that case our error becomes E = 1/2

− erf[(x∗ − µ )/σ ] = 1/2 − erf 1

1



µ2 σ1 µ1 σ2 σ1 (σ1 + σ2 ) .





(e) We substitute the values given in the problem int o the formula in part (c) and find x∗ =

µ2 σ1 + µ1 σ2 1/2 + 0 = = 1/3. σ1 + σ2 1 + 1/2

The error from part (d) is then E = 1/2

− erf

 −

1/3 0 = 1/2 1+0

− erf[1/3] = 0 .1374.

(f) Note that the distributio ns have the same form (in particular, the same variance). Thus, by symmetry the Bayes error for P (ω1 ) = P ∗ (for some value P ∗ ) must be the same as for P (ω2 ) = P ∗ . Because P (ω2 ) = 1 P (ω1 ), we know that the curve, analogous to the one in part (b), is symmetric around the point P (ω1 ) = 0.5. Because the curve is concave down, therefore it must peak at P (ω1 ) = 0.5, that is, equal priors. The tangent to the graph of the error versus P (ω1 ) is thus horizontal at P (ω1 ) = 0.5. For this case of equal prio rs, the Bayes decision point for this problem can be stated simply: it is the point midway between the means of the two distributions, that is, x ∗ = 5.5.



5. We seek to generalize the notion of minimax criteria to the case where pendent prior probabilities are set.

two inde-

CHAPTER 2. BAYESIAN DECISION THEORY

12

p(x|ωi)P(ωi)

p(x|ω2)P(ω2)

p(x|ω3)P(ω3) p(x|ω1)P(ω1)

µ2

µ1 x1*

x

x2* µ3

δ3

δ2

δ1

(a) We use the triangle distributions and conventions in the figure. We solve for the decision points as follows (being sure to keep the signs correct, and assuming that the decision boundary consists of just two points): P (ω1 )



δ1

− (x∗ − µ ) 1 δ12

1



= P (ω2 )



δ2

− (µ − x∗) 2 δ22

1



,

which has solution x =

P (ω1 )δ22 δ1 + P (ω1 )δ22 µ1

∗1

P (ω1 )δ22

P (ω2 )δ12 δ2 + P (ω2 )µ2 δ12



.

+ P (ω2 )δ12 An analogous derivation for the other decision point gives: P (ω2 )



δ2

− (x∗ − µ ) 2 δ22

2



= P (ω3 )



δ3

− (x∗ − µ ) 3

2 δ32



,

which has solution x∗2 =

2 3 2

−P (ω )δ µ 2

+ P (ω2 )δ32 δ2 + P (ω3 )δ22 δ3 + P (ω3 )δ22 µ3 . P (ω2 )δ32 + P (ω3 )δ22 3

(b) Note that from our normalization condition,



P (ωi ) = 1, we can express all

i=1

priors in terms of just two independent ones, which we choose to be P (ω1 ) and P (ω2 ). We could substitute the values for x ∗i and integrate, but it is just a bit simpler to go directly to the calculation of the error, E , as a function of priors P (ω1 ) and P (ω2 ) by considering the four contributions: E

= P (ω1 ) 12 [µ1 + δ1 x∗1 ]2 2δ1 1 +P (ω2 ) 2 [δ2 µ2 + x∗1 ]2 2δ2 1 +P (ω2 ) 2 [µ2 + δ2 x∗2 ]2 2δ2 1 + 1 P (ω1 ) P (ω2 ) [δ3 2δ32







  − −   P (ω3 )

−µ

3

+ x∗2 ]2 .

PROBLEM SOLUTIONS

13

To obtain the minimax solution, we take the two partial and set them to zero. The first of the derivative equations, ∂E = 0, ∂P (ω1 ) yields the equation 2



µ1 + δ1 x∗1 δ1 µ1 + δ1 x∗1 δ1





δ3

2

µ3 + x∗2 δ3 µ3 + x∗2 . δ3

  −−  = =

δ3

or

Likewise, the second of the derivative equations, ∂E = 0, ∂P (ω2 ) yields the equation



δ2

−µ

2

+ x∗1

δ2

  2

+

µ2 + δ2 δ2

− x∗ 2

  2

=

−µ

δ3

3

+ x∗2

δ3



2

.

These simultaneous quadratic equations have solutions of the general form: x∗i =



bi + ci ai

i = 1, 2.

After a straightforward, but very tedious calculation, we find that: a1 b1 c1

= = =

2 1

2 2

2 3

δ −δ + δ , −δ δ − δ δ − δ δ 2 1 2

2 δ22 µ1 + δ32 µ1 + δ12 µ2 δ1 δ3 µ2 + δ1 δ3 µ3 , 1 2 1 2 δ3 δ12 2δ1 δ23 + 2δ24 + 2δ1 δ22 δ3 + 2δ23 δ3 + δ1 δ22 µ1 + 2δ23 µ1







+2 δ1 δ2 δ3 µ1 2 δ2 δ3 2 µ1 + δ2 2 µ1 2 δ3 2 µ1 2 2 δ1 2 δ2 µ2 2 δ1 δ2 2 µ2 +2 δ2 2 δ3 µ2 + 2 δ2 δ3 2 µ2 2 δ2 2 µ1 µ2 + 2 δ1 δ3 µ1 µ2 + 2 δ3 2 µ1 µ2 δ1 2 µ2 2 + 2 δ2 2 µ2 2 2 δ1 δ3 µ2 2 δ3 2 µ2 2 + 2 δ1 2 δ2 µ3 2 δ2 3 µ3 2 δ1 δ2 δ3 µ3 2 δ2 2 δ3 µ3 2 δ1 δ3 µ1 µ3 + 2 δ1 2 µ2 µ3 2 δ2 2 µ2 µ3 +2 δ1 δ3 µ2 µ3 δ1 2 µ3 2 + δ2 2 µ3 2 .



− −

− −







= = =



δ22 δ32 + 2 δ2 δ32 µ1 + δ32 µ21

+ 2 δ22 δ3

{





δ12 δ22 + δ32 , δ δ δ + δ2 δ + 2 δ δ2 + δ δ µ 12 2 3 2 2 3 2 3 1 3 1 δ1 δ2 + δ32

 − ×

µ3 + 2 δ1 δ3 µ1

} { }{

}{ }







An analogous calculation gives: a2 b 2 c2





δ δ µ + δ2 µ + δ 2 µ



2 3

1 3

2

3 2 3

2

2 2

1

δ2 µ , 3



2

3

− 2δ µ µ +2δ µ +2 δ δ δ µ µ − 2δ δ µ µ + δ µ − δ µ . 3

1

1 3

2

2

3

2 1

2 3

1 2 3 2 2 2 3



3

(c) For µi , δi = 0, 1 , .5, .5 , 1, 1 , for i = 1, 2, 3, respectively, we substitute into the above equations to find x∗1 = 0.2612 and x∗2 = 0.7388. It is a sim ple matter to confirm that indeed these two decision points suffice for the classification problem, that is, that no more than two points are needed.

CHAPTER 2. BAYESIAN DECISION THEORY

14

p(x|ωi)P(ωi)

p(x|ω1)P(ω1)

p(x|ω2)P(ω2)

E1 -2

x

0 x*

-1

µ1

1

2

µ2

6. We let x ∗ denote our decision boundary and µ 2 > µ1 , as shown in the figure. (a) The error for class ifying a pattern that is actually in ω 1 as if it were in ω 2 is:



1 p(x ω1 )P (ω1 ) dx = 2

|

R

2





N (µ1 , σ12 ) dx

x∗

≤E . 1

Our problem demands that this error be less than or equal to E1 . Thus the bound on x ∗ is a function of E 1 , and could be obtained by tables of cumulative normal distributions, or simple numerical integration. (b) Likewise, the error for categ orizing a pattern that is in ω 2 as if it were in ω 1 is:

E2 =

1 p(x ω2 )P (ω2 ) dx = 2

|



R

1

x∗ 2



N (µ2 , σ2 ) dx.

−∞

(c) The total error is simply the sum of these two contributions: E

=

E1 + E2

=

1 2





N (µ1 , σ12 )

x∗

| ∼ −

1 dx + 2

x∗



N (µ2 , σ22 ) dx.

−∞

| ∼

(d) For p(x ω1 ) N ( 1/2, 1) and p(x ω2 ) N (1/2, 1) and E 1 = 0.05, we have (by simple numerical integration) x ∗ = 0.2815, and thus E

=

= =

0.05 +

1 2

1 0.05 + 2 0.168.

0.2815



N (µ2 , σ22 ) dx

−∞

0.2815



−∞

1

(x

0.5)2

√2π0.05 exp − 2(0.5) −



2



dx

(e) The decision boundary for the (minimum error) Bayes case is clearly at x ∗ = 0. The Bayes error for this problem is:



EB

= 2

 0

1 N (µ1 , σ12 ) dx 2

PROBLEM SOLUTIONS

15





=

N (1, 1) dx = erf[1] = 0 .159,

0

which of course is lower than the error for the Neyman-Pearson criterion case. Note that if the Bayes error were lower than 2 0.05 = 0 .1 in this problem, we would use the Bayes decision point for the Neyman-Pearson case, since it too would ensure that the Neyman-Pearson criteria were obeyed and would give the lowest total error.

×

7. We proceed as in Problem 6, with the figure below. p(x|ωi)P(ωi) 0.3 0.25

p(x|ω1)P(ω1)

p(x|ω2)P(ω2)

0.2 0.15 0.1 0.05

E1 x

-4

-2

0 x*

2

4

(a) Recall that the Cauc hy density is 1

1

|

p(x ωi ) = πb 1 +



x ai 2 . b

 

If we denote our (single) decision boundary point as x ∗ , and note that P (ωi ) = 1/2, then the error for misclassifying a ω 1 pattern as ω 2 is:



E1

=



p(x ω1 )P (ω1 ) dx

1 2



|

x∗

= We substitute (x



1 1 πb 1 + x−a1

  

1

E1

=

θ=0



dx.

1 + y 2 to get:

− a )/b = y, and sin θ = 1/ 1 2π

2

b

x∗



θ=θ˜

= where θ˜ = sin −1

√

b





b2

b + (x∗

−a

1

)2



,

∗ −a1 )2 . Solving for the decision point gives

b2 +(x

x∗ = a 1 + b

1 sin−1 2π



1 sin2 [2πE 1 ]

−1= a

1

+ b/tan[2πE 1 ].

16

CHAPTER 2. BAYESIAN DECISION THEORY

(b) The error for the conv erse case is found similarly: E2

1 πb

=

=

1 2π

=

1 2π

x∗

   1

P (ω2 ) x a2 2 b



1+

−∞

dx

θ=θ˜





θ= π

   

b b2 + (x∗

sin−1

1 1 + sin−1 2 π

=



2

−a ) 2

b b2 + (x∗

−a ) 2

   ,

2

where θ˜ is defined in part (a).

(c) The total error is merely the sum of the component errors: E = E 1 + E2 = E 1 +

1 1 + sin−1 2 π



b b2 + (x∗

2

−a ) 2



,

where the numerical value of the decision point is x∗ = a 1 + b/tan[2πE 1 ] = 0.376. (d) We add the errors (for b = 1) and find E = 0.1 +

1 1 + sin−1 2 π



b b2 + (x∗

−a ) 2

2



= 0.2607.

(e) For the Bayes case, the decision point is midway between the peaks of the two distributions, i.e., at x ∗ = 0 (cf. Problem 6). The Bayes error is then



EB = 2

   1

1+

0

P (ω2 ) x a 2 b



dx = 0.2489.

This is indeed lower than for the Neym an-Pearson case, as it must be. Note that if the Bayes error were lower than 2 0.1 = 0.2 in this problem, we would use the Bayes decision point for the Neyman-Pearson case, since it too would ensure that the Neyman-Pearson criteria were obeyed and would give the lowest total error.

×

8. Consider the Cauchy distribution.

|

(a) We let k denote the integral of p(x ωi ), and check the normalization condition, that is, whether k = 1:



k=



−∞

|

p(x ωi ) dx =

1 πb



  

−∞

1

1+

x ai 2 b



dx.

PROBLEM SOLUTIONS We substitute (x

17

− a )/b = y into the above and get i

1 π

k=





−∞

1 dy, 1 + y2

and use the trigonometric substition 1 / 1 + y2 = sin θ, and hence dy = dθ/sin2 θ to find

  θ=0

1 k= π



θ= π

sin2 θ dθ = 1. sin2 θ

Indeed, k = 1, and the distribution is normalized. (b) We let x∗ denote the decision boundary (a single point) and find its value by setting p(x∗ ω1 )P (ω1 ) = p(x∗ ω2 )P (ω2 ). We have then

|

|

1 πb 1 +

1

1 1 = 2 πb 1 +

 

x∗ a1 2 b



1

 

x∗ a2 2 b



1 , 2

or (x∗ a1 ) = (x∗ a2 ). For a 1 = a 2 , this implies that x ∗ = (a1 + a2 )/2, that is, the decision boundary is midway between the means of the two distributions.



± −



(c) For the values a 1 = 3, a2 = 5 and b = 1, we get the graph shown in the figure. P(ω1|x) 1

0.8

0.6

0.4

0.2

x -4

-2

0

2

|

4

6

8

10

|

(d) We substitute the form of P (ωi x) and p(x ωi ) and find 1 2

|

lim P (ωi x) = x→∞

lim x→∞

=

lim

 1 2

1 πb

1

2

−ai

x

b

1+(

)

 

1 1 πb 1+( x−a1 )2 b

b2 + (x

+

1 2

2

−a ) i

1 1 πb 1+( x−a2 )2 b

→∞ b2 + (x − a1 )2 + b2 + (x − a2 )2

x

|

=

1 , 2

and likewise, lim P (ωi x) = 1/2, as can be confirmed in the figure.

→−∞

x

9. We follow the terminology in Section 2.3 in the text.



CHAPTER 2. BAYESIAN DECISION THEORY

18

(a) Without loss of generality, we assume that a2 > a1 , note that the deci sion boundary is at ( a1 + a2 )/2. The probability of error is given by (a1 +a2 )/2



P (error) =



|

p(ω2 x)dx +

−∞ (a1 +a2 )/2

1

1



dx + πb

1

1 dx = π

x a2 2 b

1+

x a2 2 b



1/2



x a1 2 b

   

(a1 +a2 )/2



1+

dx

(a1 a2 )/2

2

−∞



1/2

     

−∞ 1 + (a −a )/2

1 = πb

|

p(ω1 x)dx

(a1 +a2 )/2

1

= πb



−∞

1 dy, 1 + y2



where for the last step we have used the trigonometric substitution y = (x a2 )/b as in Problem 8. The integral is a standard form for tan −1 y and thus our solution is: P (error) = = (b) See figure.

  −  −  −  −

1 a1 a2 tan−1 tan−1 [ π 2b 1 1 a2 a1 tan−1 . 2 π 2b

−∞]



P(error) 0.5

0.4

0.3

0.2

0.1

|a2 - a1|/(2b) 12345

−a1 ) = 1/2, which (c) The maximum value of the probability of error is Pmax ( a22b −a1 = 0. This occurs when eithe r the two distrib utions are the occurs for a22b same, which can happen because a 1 = a 2 , or even if a 1 = a 2 because b = and

|

|



both distributi ons are flat. 10. We use the fact that the conditional error is

|

P (error x) =



| |

P (ω1 x) if we decide ω 2 P (ω2 x) if we decide ω 1 .

(a) Thus the decision as stated lead s to:



P (error) =



−∞

|

P (error x)p(x)dx.



PROBLEM SOLUTIONS

19

Thus we can write the probability of error as P (error) =

P (x < θ and ω 1 is the true state) +P (x > θ and ω 2 is the true state) = P (x < θ ω1 )P (ω1 ) + P (x > θ ω2 )P (ω2 )

|

|



θ

= P (ω1 )

|

|

p(x ω1 ) dx + P (ω2 ) p(x ω2 ) dx.

−∞

θ





(b) We take a derivative with respect to θ and set it to zero to find an extremum, that is, dP (error) = P (ω1 )p(θ ω1 ) dθ

| − P (ω )p(θ|ω ) = 0, 2

2

which yields the condition

|

|

P (ω1 )p(θ ω1 ) = P (ω2 )p(θ ω2 ),

|

where we have used the fact that p(x ωi ) = 0 at x

→ ±∞.

(c) No, this conditi on does not uniquely define θ.

|

|

1 )p(θ ω1 ) = P (ω2 )p(θ ω2 ) over a range of θ, then θ would be unspec1. Ifified P (ω throughout such a range.

2. There can easily be multiple values of x for which the condition hold, for instance if the distributions have the appropriate multiple peaks.

|



| ∼ −

(d) If p(x ω1 ) N (1, 1) and p(x ω2 ) N ( 1, 1) with P (ω1 ) = P (ω2 ) = 1/2, then we have a maximum for the error at θ = 0. 11. The deterministic risk is given by Bayes’ Rule and Eq. 20 in the text R=



|

R(αi (x) x) dx.

|

(a) In a random decision rule, we have the probability P (αi x) of deciding to take action αi . Thus in order to comput e the full probabilistic or randomized risk, Rran , we must integrate over all the conditional risks weighted by their probabilities, i.e.,

  a

Rran =

|

|



R(αi (x) x)P (αi x) p(x) dx.

i=1

(b) Consider a fixed point x and note that the (deterministic) Bayes minimum risk decision at that point obeys

| ≥ R(α

R(αi (x) x)

max (x)

|x).

CHAPTER 2. BAYESIAN DECISION THEORY

20

Therefore we have the risk in the randomized case

  | |      ≥ | |  | a

Rran

=

R(αi (x) x)P (αi x) p(x)dx

i=1

a

R(αmax x)

P (αi x) p(x)dx

i=1

= R(αmax x)p(x)dx = RB ,

|

the Bayes risk. Equality holds if and only if P (αmax (x) x) = 1. 12. We first note the normalization condition c



|

P (ωi x) = 1

i=1

for all x.

|

| |

|

|

(a) If P (ωi x) = P (ωj x) for all i and j, then P (ωi x) = 1/c and hence P (ωmax x) = 1/c. If one of the P (ωi x) < 1/c, then by our normalization condition we must have that P (ωmax x) > 1/c.

|

(b) The probability of error is simply 1.0 minus the probabil ity of being correct, that is, P (error) = 1





|

P (ωmax x)p(x) dx.

(c) We simply substitu te the limit from part (a) to get P (error) = 1

= 1





  |  − g

P (ωmax x) p(x) dx



=g 1/c

p(x) dx = 1

− g.

≤ 1 − 1/c = (c − 1)/c.

Therefore, we have P (error)

(d) All categories have the same prior probability and each distribut ion has the same form, in other words, the distributions are indistinguishable. Section 2.4 13. If we choose the category ω max that has the maximum posterior probability, our risk at a point x is: λs



|

P (ωj x) = λ s [1



j =max

− P (ω |x)], max

whereas if we reject, our risk is λ r . If we choose a non-maxim al category ωk (where k = max), then our risk is



λs

 

j =k

|

P (ωj x) = λ s [1

− P (ω |x)] ≥ λ [1 − P (ω |x)]. k

s

max

PROBLEM SOLUTIONS

21

This last inequality shows that we should never decide on a category other than the one that has the maximum posterior probability, as we know from our Bayes analysis. Consequently, we should either choose ωmax or we should reject, depending upon which is smaller: λs [1 P (ωmax x)] or λ r . We reject if λ r λ s [1 P (ωmax x)], that is, if P (ωmax x) 1 λr /λs . 14. Consider the classification problem with rejection option.

− | ≥ −

|





|

(a) The minimum-risk decision rule is given by:

| ≥ P (ω |x), for all j λ and if P (ω |x) ≥ 1 − . λ

Choose ω i if P (ωi x)

j

r

i

s

This rule is equivalent to

| ≥ and if p(x|ω )P (ω ) ≥

Choose ω i if p(x ωi )P (ωi ) i

i

 −| 

p(x ωj )P (ωj ) for all j λr 1 p(x), λs

where by Bayes’ formula

|

p(x ωi )P (ωi ) . p(x)

|

p(ωi x) =

The optimal discriminant function for this problem is given by Choose ω i if gi (x)

g j (x) for all i = 1,...,c,

and j = 1,...,c

≥ Thus the discriminant functions are: p(x|ω )P (ω ),

   |  

p(x),

i = 1,...,c i = c + 1,

p(x ωi )P (ωi ),

i = 1,...,c

i λs λr λs

gi (x) =



=

c



λs λr λs

|

i

+ 1.

|

p(x ωj )P (ωj ), i = c + 1.

j=1



|

∼ −

(b) Consider the case p(x ω1 ) N (1, 1), p(x ω2 ) N ( 1, 1), P (ω1 ) = P (ω2 ) = 1/2 and λ r /λs = 1/4. In this case the discriminant functions in part (a) give

|

g1 (x) = p(x ω1 )P (ω1 ) =

1 e−(x−1) 2 2π

2

/2



2

1 e−(x+1) /2 2 2π [p(x ω1 )P (ω1 ) + p(x ω2 )P (ω2 )]



g2 (x) = p(x ω2 )P (ω2 ) = g3 (x) = = =

 −|   −   √ 1 1

λr λs 1 4

|

|

1 e−(x−1) 2 2π



2

/2

+

1 e−(x+1) 2 2π





2

/2



2 2 3 3 e−(x−1) /2 + e−(x+1) /2 = [g1 (x) + g2 (x)]. 4 8 2π

as shown in the figure.

CHAPTER 2. BAYESIAN DECISION THEORY

22

gi(x) 0.2

g2(x)

g1(x)

g3(x)

0.15

0.1

0.05

-3

-2

-1

0

1

2

3

x

(c) If λr /λs = 0, there is no cost in rejec ting as unrec ognizable. Furthermore, P (ωi x) 1 λr /λs is never satisfied if λr /λs = 0. In that cas e, the decis ion rule will alw ays reject as unrec ognizable. On the other hand, as λr /λs 1, P (ωi x) 1 λr /λs is always satisfied (there is a high cost of not recognizing) and hence the decision rule is the Bayes decision rule of choosing the class ωi that maximizes the posterior probability P (ωi x).

| ≥ − | ≥ −



| p(x|ω ) ∼ N (0, 1/4), P (ω ) = 1/3, P (ω ) =

| ∼

(d) Consider the case p(x ω1 ) N (1, 1), 2 1 2/3 and λ r /λs = 1/2. In this case, the discriminant functions of part (a) give

|

2 e−(x−1) 3 2π

|

1 2e−2x 3 2π

g1 (x) =

p(x ω1 )P (ω1 ) =

g2 (x) =

p(x ω2 )P (ω2 ) =

2

/2



2

g3 (x) = = =

−  √ ·  √ 1

λr λs



|

|

[p(x ω1 )P (ω1 ) + p(x ω2 )P (ω2 )]

1 2 e−(x−1) 2 3 2π

2

2

/2

+

e −2x 2π



 

2 2 1 1 e−(x−1) /2 + e−2x = [g1 (x) + g2 (x)]. 2 3 2π

Note from the figure that for this problem we should never reject. gi(x) 0.25

g2(x)

g1(x)

0.2

g3(x)

0.15 0.1 0.05

-3

-2

-1

0

1

2

3

x

2

PROBLEM SOLUTIONS

23

Section 2.5 15. We consider the volume of a d-dimensional hypersphere of radius 1.0, and more generally radius x, as shown in the figure. z Vd xd 1

dz

z 1

1

x

x

1 (a) We use Eq. 47 in the text for d = odd, that is, V d = 2d π(d−1)/2 ( d− 2 )!/d!. When applied to d = 1 (a line) we have V1 = 21 π0 1 = 2. Indeed, a line se gment 1 x +1 has generalized volume (length) of 2. More generally, a line of “radius” x has volume of 2 x.

− ≤ ≤

(b) We use Eq. 47 in the text for d = even, that is, Vd = π (d/2) /(d/2)!. When applied to d = 2 (a disk), we have V2 = π 1 /1! = π. Indeed, a disk of radi us 1 has generalized volume (area) of π. More generally, a disk of radius x has volume of πx2 . (c) Given the volume of a line in d = 1, we can derive the volume of a disk by straightforward integration. As shown in the figure, we have 1

V2 = 2

− 1

z 2 dz = π,

0

as we saw in part (a). (d) As can be seen in the figure, to find the volume of a general ized hypersphere in d + 1 dimensions, we merely integrate along the z (new) dimension the volume of a generalized hypersphere in the d-dimensional space, with proper factors and limits. Thus we have: 1

Vd+1 = 2

z 2 )d/2 dz =

Vd (1

 0



Vd πΓ(d/2 + 1)



,

Γ(d/2 + 3/2)

where for integer k the gamma function obeys



Γ(k + 1) = k! and Γ( k + 1/2) = 2 −2k+1 π(2k

− 1)!/(k − 1)!.

(e) Using this formula for d = 2k even, and Vd given for even dimensions, we get that for the next higher (odd) dimension d ∗ : Vd∗

= Vd+1 =

2π d/2 (d/2)!

√

π (d/2)! 2 Γ(d/2 + 3/2)



CHAPTER 2. BAYESIAN DECISION THEORY

24

π d/2 k! 22k+1 (2k + 1)!

=

∗ −1)/2 d∗ −1 ∗ ( )! 2d

π (d

=

2

(d∗ )!

,

where we have used 2 k = d for some integer k and d∗ = d + 1. This confirms Eq. 47 for odd dimension given in the text. (f) We repeat the above steps, but now use Vd for d odd, in order to derive the volume of the hypersphere in an even dimension: Vd+1

=

√π Γ(

d + 1) 2 d 2 Γ( 2 + 32 ) 2d π(d 1)/2 ( d 2 1 )!

√π

d!

2

Vd





=

Γ((k + 1) + 12 ) (k + 1)!



π d /2 , (d∗ /2)!

=

where we have used that for odd dimension d = 2k + 1 for some integer k , and d∗ = d + 1 is the (even) dimension of the higher space . This confirms Eq. 47 for even dimension given in the text. 16. We approach the problem analogou sly to problem 15, and use the same figure. z d

Vd x

1

dz

z 1

1

x

x

−1 ≤ x ≤ 1 is indeed V

(a) The “volume” of a line from

1

= 2.

(b) Integrating once for the general case (according to the figure) give s



1

Vd+1

=2



Vd (1

0



z 2 )d/2 dz

Vd πΓ(d/2 + 1) = Γ(d/2 + 3/2) ,

where for integer k the gamma function obeys



Γ(k + 1) = k! and Γ( k + 1/2) = 2 −2k+1 π(2k

− 1)!/(k − 1)!.

Integrating again thus gives: Vd+2

=

 √    Vd

√

πΓ(d/2 + 1) Γ(d/2 + 3/2) Vd+1

πΓ((d + 1)/2 + 1) Γ((d + 1)/2 + 3/2)



PROBLEM SOLUTIONS

25

Γ(d/2 + 1) Γ((d + 1)/2 + 3/2) πΓ(d/2 + 1) π = Vd = Vd . (d/2 + 1)Γ(d/2 + 1) d/2 + 1 = Vd π

This is the central result we will use below. (c) For d odd, we rewrite π/(d/2 + 1) as 2 π/(d + 2). Thus we have Vd

= V1

 



    · · ·  −  ×···×−   − −  −    ×···×    − 

2π 2π 2π = V d−4 = (d 2) + 2 (d 4) + 2 (d 2) + 2 2π 2π 2π (d (d 1)) + 2 (d (d 3)) + 2 (d 4) + 2

= V d −2



− −



2π (d 2) + 2



(d 1)/2 terms

= π (d−1)/2 2(d+1)/2 = π (d−1)/2 2(d+1)/2

1 3

1 5

1

d

2

1 d

 

1 π(d−1)/2 2d = . d!! d!

We have used the fact that V1 = 2, from part (a), and the notation d (d 2) (d 4) , read “ d double factorial.”

d!! =

× − × − ×···

(d) Analogously to part (c), for d even we have Vd

  ··· −     ×···×   −  −   − −   − ×···×

= V d −2



π

(d

− 2)/2 + 1

V2 =π

=

(d

π d)/2 + 1

 

(d

= V d −4

π

(d

− 4)/2 + 1

π (d 2))/2 + 1

(d

(d

π = 2)/2 + 1 π 4)/2 + 1

(d

π 2)/2 + 1

d/2 terms

= π d/2 =

1 1

πd/2 . (d/2)!

1 2

1 d/2 1



1 d/2

(e) The central mathem atical reason that we express the formulas separately for even and for odd dimensions comes from the gamma function, which has different expressions as factorials whether the argument is integer valued, or half-integer valued. depending One couldupon express the volum e of hyperspheres in all dimensions by using gamma functions, but that would be computationally a bit less elegant. 17. Consider the minimal rectangular bounding box in d dimensions that contains the hype rellipsoid, as shown in the figure. We can work in a coordinate syst em in which the princ ipal axes of the ellip soid are paral lel to the box axes, that is, ˜ = diag[1 /σ2 , 1/σ 2 ,..., 1/σ 2 ]). The squared the covariance matrix is diagonal ( Σ 1 2 d Mahalanobis distance from the srcin to the surface of the ellipsoid is given by r2 ,

 

CHAPTER 2. BAYESIAN DECISION THEORY

26

σ3

σ2 σ1

that is, points x that satisfy r2

˜ = xt Σx = xt

 

1/σ12 0 .. .

0 1/σ22 .. .

0

0

... ... .. .

0 0 .. .

...

1/σd2

 

x.

Thus, along each of the principal axes, the distance obeys x2i = σi2 r2 . Because the distance across the rectangular volume is twice that amount, the volume of the rectangular bounding box is Vrect = (2x1 )(2x2 )

d d

· · · (2x ) = 2 r d

d



˜ 1/2 . σi = 2 d r d Σ

i=1

| |

We let V be the (unknown) volume of the hyperellipsoid, Vd the volume of the unit hypersphere in d dimension, and Vcube be the volume of the d-dimensional cube having length 2 on each side. Then we have the following relation: V Vd = . Vrect Vcube We note that the volume of the hypercube is V cube = 2d , and substitute the above to find that V =

Vrect Vd ˜ 1/2 Vd , = rd Σ Vcube

| |

where Vd is given by Eq. 47 in the text. Recall that the determin ant of a matrix is

| |

| |

unchanged by rotation of axes ( Σ ˜ 1/2 = Σ 1/2 ), and thus the value can be written as V = r d Σ 1/2 Vd .

| |

18. Let X 1 ,...,X n be a random sample of size n from N (µ1 , σ12 ) and let Y 1 ,...,Y be a random sample of size m from N (µ2 , σ22 ).

···

···

m

(a) Let Z = (X1 + + Xn ) + (Y1 + + Ym ). Our goal is to show that Z is also normally distributed. From the discussion in the text, if Xd×1 N (µd×1 , Σd×d )



PROBLEM SOLUTIONS and A is a k

27 t

t

t

× d matrix, then A X ∼ N (A µ, A ΣA). Here, we take

X(n+m)×1 =

  

X1 X2 .. . Xn Y1 Y ..2 . Ym

  

.

×

Then, clearly X is normally distributed in (n + m) 1 dimensions. We can write Z as a particular matrix A t operating on X: Z = X1 +

··· + X

n

+ Y1 +

··· + Y

m

= 1 t X,

where 1 denotes a vector of length n + m consisting solely of 1’s. By the above fact, it follows that Z has a univariate normal distribution. (b) We let µ 3 be the mean of the new distribution. Then, we have µ3

= = =

E (Z ) E [(X + · · · + X ) + (Y + · · · + Y )] (X ) + + (X ) + (Y ) + + (Y ) E(since X ·,...,X · · E , Y ,...,YE are· · independent) · E n

1

m

1

n

1

n

1

m

1

m

1

= nµ1 + mµ2 . (c) We let σ 32 be the variance of the new distribution. Then, we have σ32 = Var( Z ) = Var(X1 ) + + Var(Xn ) + Var(Y1 ) + + Var(Ym ) (since X 1 ,...,X n , Y1 ,...,Y m are independent) = nσ12 + mσ22

···

···

(d) Define a column ve ctor of the sample s, as:

X=

×

  

X1 .. . Xn Y1 .. . Ym

  

.

Then, clearly X is [(nd + md) 1]-dimensional random variable that is normally distributed. Consider the linear projection operator A defined by At = (Id×d Id×d

  · · · 

Id×d ).

(n+m) times

CHAPTER 2. BAYESIAN DECISION THEORY

28 Then we have

Z = At X = X 1 +

···+ X

n

··· + Y

+ Y1 +

m,

which must therefore be normally distributed. Furthermore, the mean and variance of the distribution are µ3 =

E (Z)

=

E (X ) + · · · + E (X ) + E (Y ) + · · · + E (Y 1

1

n

= nµ1 + mµ2 . Σ3 = Var(Z) = Var( X1 ) + + Var(Xn ) + Var(Y1 ) + = nΣ1 + mΣ2 .

···

m)

· · · + Var(Y

m)

19. The entropy is given by Eq. 37 in the text: H (p(x)) =





p(x) lnp(x)dx

with constraints



bk (x)p(x)dx = a k

for k = 1,...,q.

(a) We use Lagrange factors and find q

Hs

=

p(x)lnp(x)dx +

bk (x)p(x)dx k=1 q

=

ak

     −  − − −  p(x) lnp(x)

q

λk bk (x)

k=0

From the normalization condition all x.

ak λ k .

k=0

p(x)dx = 1, we know that a 0 = b 0 = 1 for

(b) In order to find the maximum or minimum value for H (having constraints), we take the derivative of H s (having no constraints) with respect to p(x) and set it to zero: ∂H s = ∂p(x)

− 



q

lnp(x)

 −

λk bk (x) + 1 dx = 0.

k=0

The argument of the integral must vanish, and thus lnp(x) =

q

 

λk bk (x)

k=0

− 1.

We exponentiate both sides and find

q

p(x) = exp

k=0

λk bk (x)



−1

,

where the q + 1 parameters are determined by the constraint equations.

PROBLEM SOLUTIONS

29

20. We make use of the result of Problem 19, that is, the maximum-entropy distribution p(x) having constraints of the form

 for k = 1,...,q

bk (x)p(x)dx = a k

is q

p(x) = exp

 

k =0

λk bk (x)

−1



.

(a) In the current case, the normaliza tion constraint is xh

p(x)dx = 1,

xl

and thus

 q

p(x) = exp

k=0

λ b (x) − 1 k k



= exp(λ0

− 1).

In order to satisfy this constraint equation, we demand xh

p(x)dx = exp(λ0

1)(xu





xl

xl ) = 1,



and thus the distribution is uniform, p(x) =



| −x |

1/ xu

≤ ≤

xl x x u 0 otherwise.

l

(b) Here the constrain t equations are





p(x)dx =

1

xp(x)dx =

µ.

0



 0

Thus, using the above result, the density is of the form p(x) = exp[ λ0 + λ1 x

− 1]

and we need to solve for λ 0 and λ 1 . The normalization constraint gives



 0

exp[λ0 + λ1 x

− 1]dx



= eλ0 −1

eλ1 x λ1

= eλ0 −1

− 

∞ 0

1 λ1

= 1.

CHAPTER 2. BAYESIAN DECISION THEORY

30

Likewise, the mean constraint gives





eλ1 eλ0 −1 xdx = e λ0 −1

0

Hence λ 1 =

−1/µ and λ

0

=1

 1 λ21

= µ.

− lnµ, and the density is x/µ

p(x) =





(1/µ)e− 0 otherwise. x 0

(c) Here the density has three free parameters, and is of the general form p(x) = exp[ λ0

2

− 1 + λ x + λ x ], 1

2

and the constraint equations are







p(x)dx = 1

( )

xp(x)dx = µ

( )

−∞ ∞



∗∗

−∞ ∞



x2 p(x)dx = σ 2 .

(

−∞



We first substitute the general form of p(x) into ( ) and find

√−1λ

2

√π 2

Since λ 2 < 0, erf(

2

eλ0 −1−λ1 /(4λ2 ) erf

 −

λ2 x

− 2√λ−λ

∞) = 1 and erf( −∞) = −1, we have √πexp[λ − 1 − λ /(4λ )] √ −λ = 1. 2 1

0





1

2

= 1.

−∞

2

2

∗∗) and find √ ∞ − λ πexp[λ4λ −√ 1 −λ λ /(4λ )] erf −λ x − λ /(2 −λ ) = µ, −∞ − which can be simplified to yield √ λ π √ exp[λ − 1 − λ /(4λ )] = −µ. 2λ −λ Finally, we substitute the general form of p(x) into ( ∗ ∗ ∗) and find √π √ exp[λ − 1 − λ /(4λ )] = −σ . 2λ −λ Likewise, next substitute the general form of p(x) into ( 1

2 1

0

2

2

2

2

2

2



1

2

0

2

0

2

2

 

2 1

2 1

2

2

2

∗ ∗ ∗)

PROBLEM SOLUTIONS

31

We combine these three results to find the constants:



2

λ0

=

λ1 λ2

= =

1

− 2σµ

+ ln[1/( 2πσ)]

2

µ/σ2 1/(2σ 2 ).



We substitute these values back into the general form of the density and find



2

−(x − µ) 1 √2πσ exp 2σ

p(x) =

2



,

that is, a Gaussian. 21. A Gaussian centered at x = 0 is of the form 1 √2πσ exp[−x /(2σ )]. 2

p(x) =

2

The entropy for this distribution is given by Eq. 37 in the text:



H (p(x)) =





p(x)lnp(x)dx

−∞ ∞

= =

1

2

1

2

2

2

−−∞ √2πσ exp[−x /(2σ )]ln √2πσ exp[−x /(2σ )] √ √ ln[ 2πσ] + 1/2 = ln[ 2πeσ].







dx

For the uniform distribution, the entropy is xu

H (p(x)) =

 −

xl

1

ln



1

|x − x | |x − x | u

l

u

l



dx =



−ln |x −1 x | u

l



= ln[xu

− x ].

Since we are given that the mean of the distribution is 0, we know that Further, we are told that the variance is σ 2 , that is

l

xu =

−x . l

xh



x2 p(x)dx = σ 2

xl

which, after integration, implies x2u + xu xl + x2l = 3σ 2 .



We put these results together and find for the uniform distribution H (p(x)) = ln[2 3σ]. We are told that the variance of the triangle distribution centered on 0 having half-width w is σ 2 , and this implies w



−w

x2

w

− |x| dx =

w2

w

 0

x2



w x dx + w2



0

−w

x2

w+x dx = w 2 /6 = σ 2 . w2

CHAPTER 2. BAYESIAN DECISION THEORY

32

The entropy of the triangle distribution is then w





H (p(x)) =

w

w2

−w

w

   −  −

− |x| ln w − |x|

dx

w2

0

 

=



=

− = ln[ 6eσ], lnw + 1/2 = ln[ 6σ] + 1/2



w x w x ln dx w2 w2

0

w

w+x w+x ln dx w2 w2



√ √ w = 6σ from the variance condition.

where we used the result Thus, in order of decreasing entropy, these equal-variance distributions are Gaussian, uniform then triangle, as illustrated in the figur e, where each has the same variance σ 2 . p(x)

0.4/σ

0.3/σ

0.2/σ

0.1/σ

-3σ

-2σ



σ





x

22. As usual, we denote our multidim ensional Gaussian distribu tion by p(x) N (µ, Σ), or p(x) =

1

exp 1/2

(2π)d/2 Σ

| |

−

1 (x 2

− µ) Σ− (x − µ) 1

t



.

According to Eq. 37 in the text, the entropy is H (p(x)) =

=

=

=

=

=

  −



1 2 1 2 1 2 1 2

p(x)lnp(x)dx

p(x)

 −

1 (x 2

− µ) Σ− (x t

 −  −   d

µi )[Σ−1 ]ij (xj

i=1 j =1 d

(xj

µj )(xi

i=1 j=1 d

d

[Σ]ji [Σ−1 ]ij +

i=1 j=1 d

[ΣΣ−1 ]jj +

j=1

ln (2π)d/2 Σ 1/2

µ)

indep. of x

d

(xi

d

1

 || − −   −    

−µ ) i

 

dx

1 µj ) dx + ln[(2π)d Σ ] 2

| |

1 [Σ−1 ]ij dx + ln[(2π)d Σ ] 2

indep. of x

1 ln[(2π)d Σ ] 2

| |

1 ln[(2π)d Σ ] 2

| |

| |



PROBLEM SOLUTIONS 1 2

=

33

d

   

1 [I]jj + ln[(2π)d Σ ] 2 j=1

| |

d

d 1 + ln[(2π)d Σ ] 2 2 1 ln[(2πe)d Σ ], 2

=

| |

| |

=

where we used our common notation of I for the d-by-d identity matrix. 23. We have p(x ω) N (µ, Σ), where

| ∼

µ=

  1 2 2

and Σ =

(a) The density at a test point x o is 1

|

p(xo ω) =

| |

(2π)3/2 Σ

For this case we have

| Σ| Σ−1

=

=

 

1 0 0 0 5 2 0 2 5

     =1

=

−



1 0 0 0 5 2 0 2 5

1 (xo 2

   

5 2 2 5

−1

1 0 0 0 5 2 0 2 5

exp 1/2



.

t

1

o

.

= 21,

1 0 0

0 5 2

0 2 5

−1

  

and the squared Mahalanobis distance from the mean to (xo − µ) Σ−1 (xo − µ)



− µ) Σ− (x − µ)

=

1 0 0

0 5/21 2/21



0

−2/21 5/21

x o = (.5, 0, 1)t is

t

=

     −  −   −  .5 0 1

t

0.5

=

t

1 2 2

1 0 0

0 5/21 2/21

−2 −1





0.5

−8/21 −1/21

0 2/21 5/21

= 0.25 +

     − −1

.5 0 1

1 2 2

16 1 + = 1.06. 21 21

We substitute these values to find that the density at x o is: 1

p(xo ω) =

3/2

1/2

1

exp

|

10−3 .

(1.06) = 8.16

− 

×

(2π) (21) 2 Aw = ΦΛ−1/2 , where Φ contains the (b) Recall from Eq. 44 in the text that normalized eigenvectors of Σ and Λ is the diagonal matrix of eigenvalues. The characteristic equation, Σ λI = 0, in this case is

 

| − |

1

−λ 0 0

0 5

−λ 2

0 2 5

−λ

 

= =



2



− λ) (5 − λ) − 4 (1 − λ)(3 − λ)(7 − λ) = 0. (1



,

CHAPTER 2. BAYESIAN DECISION THEORY

34

The three eigenvalues are then λ = 1, 3, 7 can be read immediately from the factors. The (diagonal) Λ matrix of eigenvalues is thus Λ=



1 0 0 0 3 0 0 0 7



.

To find the eigenvectors, we solve Σx = λ i x for (i = 1, 2, 3): Σx =



10 05 0 2

02 5

1 5x2 x+ 2x3 2x2 + 5x3

1 x x2 x3

           ⇒      √  ⇒ √ −      √  ⇒ √  √√ √√  −  √ √     − √√ √√  x=

= λi

i = 1, 2, 3.

The three eigenvectors are given by: λ1

= 1:

λ2

= 3:

λ3

= 7:

x1 5x2 + 2x3 2x2 + 5x3

=

x1 x2 x3

x1 5x2 + 2x3 2x2 + 5x3

=

3x1 3x2 3x3

φ2 =

x1 5x2 + 2x3 2x2 + 5x3

=

7x1 7x2 7x3

φ3 =

φ1 =

1 0 0

,

0 1/ 2 1/ 2

0 1/ 2 1/ 2

,

.

Thus our final Φ and A w matrices are: Φ=

and

Aw = ΦΛ −1/2

We have then, Y = A tw (x

1 0 0

0 0 1/ 2 1/ 2 1/ 2 1/ 2

=

1 0 0

0 0 1/ 2 1/ 2 1/ 2 1/ 2

=

1 0 0

0 0 1/ 6 1/ 14 1/ 6 1/ 14

− √

1 0 0 0 3 0 0 0 7

.



− µ) ∼ N (0, I).

(c) The transformed point is found by applying A, that is, xw

= Atw (xo µ) 1 0 0 = 0 1/ 6 1/ 14 0 1/ 6 1/ 14



−√ − √

√ √

  −−   −− √  − − √ 0.5 2 1

=

0.5 1/ 6 3/ 14

.

(d) From part (a), we have that the squared Mahalan obis distance from x o to µ in the srcinal coordinates is r 2 = (xo µ)t Σ−1 (xo µ) = 1.06. The Mahalanobis distance from x w to 0 in the transformed coordinates is x tw xw = (0.5)2 + 1/6 + 3/14 = 1 .06. The two distances are the same, as they must be under any linear transformation.





PROBLEM SOLUTIONS

35

(e) A Gaussian distr ibution is written as p(x) =

1

exp 1/2

(2π)d/2 Σ

| |

−

1 (x 2

− µ) Σ− (x − µ) 1

t



.

Under a general linear transformation T, we have that x = Tt x. The transformed mean is n

µ =

n

xk =

n

Tt xk = T t

xk = T t µ.

    − −  − − 

k=1

k=1

k=1

Likewise, the transformed covariance matrix is n

Σ

(xk

=

µ )(xk

µ )t

k=1

n

= Tt

(xk

µ)(xk

µ) T

k=1

= Tt ΣT.

We note that Σ = Tt ΣT = Σ for transformations such as translation and rotation, and thus

| | |

| | |

p(xo N (µ, Σ)) = p(Tt xo N (Tt µ, Tt ΣT)).

|

|

| |

The volume element is proportial to T and for transformations such as scaling, the transformed covariance is proportional to T 2 , so the transformed normalization contant contains 1 / T , which exactly compensates for the change in volume.

| |

| |

(f) Recall the defini tion of a whitening transformation given by Eq. 44 in the text: Aw = ΦΛ −1/2 . In this case we have y = A tw x

∼ N (A

t t w µ, Aw ΣAw ),

and this implies that Var(y) = =

Atw (x µ)(x Atw ΣA



t

− µ) A

w

=

(ΦΛ−1/2 )t ΦΛΦt (ΦΛ−1/2 )

= = =

Λ−1/2 Φt ΦΛΦt ΦΛ−1/2 Λ−1/2 ΛΛ−1/2 I,

the identity matrix. 24. Recall that the general multivariate normal density in d-dimensions is: p(x) =

1 (2π)d/2 Σ 1/2

| |

exp

−

1 (x 2

− µ) Σ− (x − µ) t

1



.

CHAPTER 2. BAYESIAN DECISION THEORY

36

(a) Thus we have if σ ij = 0 and σ ii = σ i2 , then Σ = diag( σ12 ,...,σ σ12 .. . . = . .

  

0

2 d)

···

0 .. .

···

σd2

 

.

Thus the determinant and inverse matrix are particularly simple: d

| Σ| Σ −1

=

σi2 ,

i=1

= diag(1/σ12 ,..., 1/σd2 ).

This leads to the density being expressed as: p(x) = =

1 (2π)d/2 Σ 1/2

| |

1 d



√2πσ

exp i

i=1

− −  −   −   1 (x 2

exp

1 2

−

µ)t diag(1/σ12 ,..., 1/σd2 ) (x

d

xi

2

µi

σi

i=1

µ)

.

(b) The contours of constant density are concentric ellipses in d dimensions whose centers are at ( µ1 ,...,µ d )t = µ, and whose axes in the ith direction are of length 2σi c for the density p(x) held constant at



e−c/2

√ d

.

2πσi

i=1

The axes of the ellipses are parallel to the coor dinate axes. The plot in 2 dimensions (d = 2) is shown: x2

/2 1

c 2

(µ1 , σ2)

σ

2

2σ1 c1/2 x1

(c) The squared Mahal anobis distance from x to µ is: (x

− µ) Σ− (x − µ) t

1

=

(x

1/σ12 .. .

···

0

···

 −  d

=

t

− µ)

 

i=1

xi

µi

σi

2

.

..

.

0 .. . 1/σd2

 

(x

− µ)

PROBLEM SOLUTIONS

37

Section 2.6 25. A useful discriminant function for Gaussians is given by Eq. 52 in the text, gi (x) =

− 12 (x − µ ) Σ− (x − µ ) + ln P (ω ). 1

t

i

i

i

We expand to get

− 12 − 12

gi (x) = =

x t Σ −1 x

µti Σ−1 x

   −− x t Σ −1 x

− x Σ− µ + µ Σ− µ t

1

t i

i

2µt Σ−1 x + µt Σ−1 µ i

i

i

indep. of i



1

i



+ ln P (ωi )

+ ln P (ωi ).

We drop the term that is independent of i, and this yields the equivalent discriminant function: gi (x) = =

µti Σ−1 x

− 12 µ Σ− µ + ln P (ω ) 1

t i

i

i

wit x + wio ,

where wi wio

= Σ−1 µi 1 t −1 µ Σ µi + ln P (ωi ). = 2 i



The decision boundary for two Gaussians is given by µti Σ−1 x

g i (x) = g j (x) or

− 12 µ Σ− µ + ln P (ω ) = µ Σ− x − 12 µ Σ− µ 1

t i

1

t j

i

i

1

t j

j

+ ln P (ωj ).

We collect terms so as to rewrite this as: (µi (µi

− µ ) Σ− x − 12 µ Σ− µ + 12 µ Σ− µ j

− µ ) Σ− j

t

1

t

1

− x

t i

1 (µ 2 i

1

1

t j

i

j

+ ln

P (ωi ) =0 P (ωj )

)/P (ω )](µ − µ ) − µ ) + ln(µ[P (ω − µ ) Σ− µ − µ ) i

i

− This is the form of a linear discriminant wt (x

j

j



j

i

1

t

i





=0

− x ) = 0, o

w = Σ −1 (µi

−µ ) j

and 1 (µ + µj ) 2 i

− ln(µ[P−(ωµ)/P) Σ(ω− )]((µµ −−µµ) ) , i

i

j



1 t −1 1 µ Σ µi + µti Σ−1 µj = 0. 2 j 2

where the weight and bias (offset) are

xo =

j

j

t

j 1

i

i

j

j

CHAPTER 2. BAYESIAN DECISION THEORY

38

respectively. 26. The densities and Mahalanobis distances for our two distributions with the same covarance obey

|



N (µi , Σ), (x µi )t Σ−1 (x

p(x ωi ) ri2 =



−µ ) i

for i = 1, 2. (a) Our goal is to sho w that ri2 = 2Σ−1 (x µi ). Here ri2 is the gradient of r i2 , that is, the (column) vector in d-dimensions given by:





 

 

∂r i2 ∂x 1

.. .

∂r i2 ∂x d

.

We carefully write out the components of ∂r i2 ∂x j

= =

 



2 i

∇r

for j = 1,...,d , and have:

  −      −        −      −     −    − 

∂ (x µi )t Σ−1 (x µi ) ∂x j ∂ xΣ−1 x 2µti Σ−1 x + µ ti Σ−1 µi ∂x j

=

∂ ∂x j

=

∂ ∂x j





indep. of x

d

d

d

k=1 l=1

xk xl ξkl

2

d

d

d

k=1 l=1

d



xk ξkj +

k=1 k =j

d

µi,k ξkj

k=1



= 2 xj ξjj +

xl ξkl

l=1

d

2

xl ξjl

l=1 l =j



µi,k

k=1

l=1 k =l

d

= 2xj ξjj +

d

2

xk xl ξkl

k=1

(where Σ −1 = [ξkl ]d×d )

µi,k ξkl xl d

x2k ξkk +

k=1

d

d

xk ξkj

2

(note that Σ −1 is symmetric)

µi,k ξkj

k=1

k=1 k =j



d

= 2

(xk

µi,k )ξkj

k=1 th

1

×



= 2 j component of Σ − (x µi ). Since this is true for each component j , we have

∇r = 2Σ− (x − µ ). (b) From part (a) we know that ∇r = 2Σ− (x − µ). We can work in the translated 1

2 i

2 i

i

1

frame of reference where the mean is at the srcin. This does not change the derivatives. Moreover, since we are deali ng with a single Gauss ian, we can dispense with the needless subscript indica ting the category. For notational

PROBLEM SOLUTIONS

39 x2 tx a

=

b

x1

simplicity, we set the constant matrix Σ −1 = A. In this translated frame, the derivative is then r2 = 2Ax. For point s along a line through the srcin, we have a t x = b for some contant vector a, which specifies the direction. Different points have different offsets, b. Thus for all points x on the line, we can write



x=

ba

 a

2

.

Thus our derivative is

∇r

2

= 2Ax = 2A

ba

 a

2

.

We are interested in the direction of this derivative vector, not its magnitude. Thus we now normalize the vector to find a unit vector: 2A aba2



A aba2

=

[2Ax]t [2Ax]

ba

  2

a

t

At A aba2

Aa . at At Aa

=

Note especially that this normalized vector is independent of b, and thus independent of the point x on the line. Note too that in the very special case where the Gaussian in hyperspherical (i.e., Σ = q I, the identity matrix and q a scalar), then Aa = 1/q Ia = 1/q a, and At A = 1/q 2 It I = 1/q 2 I. In this case, the derivative depends only upon a and the scalar q . (c) We seek now to show that r12 and r22 point in opposite direction along the line from µ1 to µ2 . As we saw above, ri2 points in the same direction along any line. Hence it is enough to show that r12 and r22 point in the opposite directions at the point µ1 or µ2 . From above we know that µ1 and µ2 are parallel to each other along the line from µ1 to µ2 . Also, ri2 points in the same direction along any line through µ 1 and µ 2 . Thus, r12 points in the same







2

∇ ∇









2



direction along a line through µ1 as r2 along a line through 2µ2 . Thus r1 points in the same direction along the line from µ 1 to µ 2 as r2 along the line from µ2 to µ1 . Therefore r12 and r22 point in opposite directions along the line from µ 1 to µ 2 .





(d) From Eqs. 60 and 62 the text , we know that the separating hyperplane has normal vector w = Σ −1 (µ1 µ2 ). We shall show that the vector perpendicular to surfaces of constant probabilities is in the same direction, w, at the point of intersecton of the line connecting µ1 and µ2 and the discriminanting hyperplane. (This point is denoted x o .)



CHAPTER 2. BAYESIAN DECISION THEORY

40

As we saw in part (a), the gradient obeys r12 = 2Σ−1 (xo µ1 ), Thus, because of the relationship between the Mahalanobis distance and probability density in a Gaussian, this vector is also in the direction normal to surfaces of constant density. Equation 65 in the text states



xo =

Thus

2 1

1 (µ + µ2 ) 2 1



)/P (ω )] − (µ ln− µ[P (ω (µ − µ ). ) Σ− (µ − µ ) 1

1

2

2

1

1

t

1

2

2

∇r evaluated at x is: ∇r = 2Σ− (x − µ ) 2 1

o



1

x=x0

2Σ−1

=

o



Σ −1 ( µ

=

1

1 (µ + µ2 ) 2 1

2

−µ )

=

1

−  1

1

2Σ−1 (µ2 w.



− ln(µ[P −(ωµ)/P) Σ(ω− )]((µµ −−µµ) ) 1

2

t

1

2

1

2

2ln [ P (ω1 )/P (ω2 )] (µ1 µ2 )t Σ−1 (µ1 µ2 )







=scalar constant

−µ ) 1

2 1

 



Indeed, these two vectors are in the same direction. (e) True. We are given that P (ω1 ) = P (ω2 ) = 1/2. As described in the text, for these conditions the Bayes decision boundary is given by: wt (x

−x )

= 0

o

where w = Σ −1 (µ1

−µ ) 2

and =0

xo

1 (µ + µ2 ) 2 1 1 (µ + µ2 ). 2 1



= =

  

ln [P (ω1 )/P (ω2 )] (µ1 (µ1 µ2 )t Σ−1 (µ1 µ2 )





−µ ) 2

Therefore, we have µ2 )t Σ−1 x

(µ1



1

=

(µ1

µ2 )t Σ−1 (µ1

 −

2 1 µt1 Σ−1 µ1 2

=

µ2 )



− µ Σ− µ t 2

1

t −1 µ2 2 + µ1 Σ



 

− µ Σ− µ



=0

t 2

1

1

.

Consequently the Bayes decision boundary is the set of points x that satisfy the following equation: (µ1

− µ ) Σ− x = 12 2

t

1



µt1 Σ−1 µ1

− µ Σ− µ t 2

1

2



.

PROBLEM SOLUTIONS

41

Equal squared Mahalanobis distances imply the following equivalent equations: r12

r22 (x µ2 )t Σ−1 (x µ2 ) xt Σ−1 x + µt2 Σ−1 µ2 2µt2 Σ−1 x 2(µ1 µ2 )t Σ−1 x 1 1 2 2 1 t −1 (µ1 µ2 )t Σ−1 x = (µ1 Σ µ1 µt2 Σ−1 µ2 ). 2 From these results it follows that the Bayes decision boundary consists of points of equal Mahalanobis distance. (x − µ1 ) Σ−1 (x − µ1 ) t −1 x Σ x + µt1 Σ−1 µ1 − 2µt1 Σ−1 x µt Σ−1 µ − µt Σ−1 µ t

= = = =













| ∼

27. Our Gaussian distributions are p(x ωi ) N (µi , Σi ) and prior probabilities are P (ωi ). The Bayes decision boundar y for this problem is linear, given by Eqs. 63–65 in the text w t (x xo ) = 0, where



w

=

xo

=

Σ−1 (µ1 µ2 ) 1 ln [P (ω1 )/P (ω2 )] µ2 ) (µ (µ1 2 1 (µ1 µ2 )t Σ−1 (µ1 µ2 )











− µ ). 2

The Bayes decision boundary does not pass between the two means, µ 1 and µ 2 if and only if w t (µ1 xo ) and w t (µ2 xo ) have the same sign, that is, either





t

w (µ1

− x ) >0 o

and

wt (µ2

−x )>0

wt (µ1

− x ) 0 and w (µ − x ) > 0. o

2

 

− µ ) Σ− (µ − µ ) < −2 ln

 

2

1

t

1

2

and (µ1

o

− µ ) Σ− (µ − µ ) > 2 ln

(µ1

2

t

1

1

2

P (ω1 ) P (ω2 )

P (ω1 ) , P (ω2 )

and therefore we have: w (µ1 This last equation implies

P (ω1 ) P (ω2 )

P (ω1 ) P (ω2 )

P (ω1 ) . P (ω2 )

CHAPTER 2. BAYESIAN DECISION THEORY

42

Likewise, the conditions can be written as: wt (µ1

t

− x ) < 0 and w (µ − x ) < 0 o

or (µ1

o

2

   − 

P (ω1 ) and P (ω2 ) P (ω1 )

− µ ) Σ− (µ − µ ) < 2 ln

µ

2

1

t

1

t

µ



1

2

µ



µ

2 ) Σ− ( 1 2) > ( 1 2 ln P (ω2 ) . In sum, the condition that the Bayes decision boundary does not pass between the two means can be stated as follows:

≤ P (ω ). Condition: (µ − µ ) Σ− (µ − µ ) < 2 ln this ensures w (µ − x ) > 0 and w (µ − x ) > 0. Case 2 : P (ω ) > P (ω ). Condition: ( µ − µ ) Σ− (µ − µ ) < 2 ln this ensures w (µ − x ) < 0 and w (µ − x ) < 0. Case 1 : P (ω1 )

2

t

o

1

1

1 t

2

t

1 t

o

1

1

t

2

1

2

1

2

o

2

1

t

2

o

2

    P (ω1 ) P (ω2 )

and

P (ω1 ) P (ω2 )

and

28. We use Eqs. 42 and 43 in the text for the mean and covariance. (a) The covariance obeys: 2 σij

E [(x − µ )(x − µ )]

=

i

i

j

j

∞ ∞

=

p(xi , xj ) (xi

= =

µi )(xj

     −  − 

−∞−∞ ∞

µj )dxi dxj



=p(xi )p(xj ) by indep.



(xi

µi )p(xi )dxi

−∞

(xj

−∞

0,

− µ )p(x )dx j

j

j

where we have used the fact that







xi p(xi )dxi = µi

and

−∞



p(xi )dxi = 1.

−∞

(b) Suppose we had a two-dimensional Gaussian distribut ion, that is,

  ∼     x1 x2

E



where σ 12 = [(x1 µ1 )(x2 is Gaussian, that is, p(x1 , x2 ) =

If σ12 is,

µ1 σ2 σ12 , 1 2 µ2 σ21 σ2

N

,

− µ )]. Furthermore, we have that the joint density 2

1 exp 2π Σ 1/2

−

1 (x 2



− µ) Σ− (x − µ) t

1

. | | = 0, then |Σ| = |σ σ | and the inverse covariance matrix is diagonal, that 2 2 1 2

Σ −1 =





1/σ12 0 . 0 1/σ22

PROBLEM SOLUTIONS

43

In this case, we can write p(x1 , x2 ) = =

1 exp 2πσ1 σ2 1 √2πσ

exp 1

−   −   −   −  −   · √ −  −   1 2

x1

µ1

2

+

σ1

1 2

x1

2

µ1

x2

µ2

σ2

1 exp 2πσ 2

σ1

2

1 2

x2

µ2

2

σ2

= p(x1 )p(x2 ). Although we have derived this for the special case of two dimensions and σ12 = 0, the same method applies to the fully general case in d dimensions and two arbitrary coordinates i and j . (c) Consider the following discrete distribution: x1 =



+1 with probability 1 /2 1 with probability 1/2,



and a random variable x 2 conditioned on x 1 by

If x 1 =

x2 =

−1,

It is simple to verify that calculation: Cov(x1 , x2 ) = = = =



+1/2 with probability 1/2 1/2 with probability 1/2. x2 = 0 with probability 1 .

If x 1 = +1,

µ1 =



E (x ) = 0; we use that fact in the following 1

E [(x − µ )(x − µ )] E [x x ] − µ E [x ] − µ E [x ] − E [µ µ ] E [x x ] − µ µ 1 1 P (x = +1, x = +1/2) + − P (x 2 2 +0 · P (x = −1) 1

1

2

1 2

2

1 2

1 2

1

2

1

1

2

1 2

 

2

1

= +1, x2 =

−1/2)

1

=

0.

Now we consider



Pr(x1 = +1, x2 = +1/2) = Pr( x1 = 1, x2 = 1/2) = Pr( x2 = +1/2 x1 = 1)Pr(x1 = 1) 1 1 1 = = . 2 2 4

|

·

However, we also have

|

Pr(x2 = +1/2) = Pr( x + 2 = +1 /2 x1 = 1)Pr(x1 = 1) = 1 /4. Thus we note the inequality



Pr(x1 = 1, x2 = +1/2) = 1 /4 = Pr(x1 = 1)Pr(x2 = +1/2) = 1 /8, which verifies that the features are independen t. This inequality shows that even if the covariance is zero, the features need not be independent.

CHAPTER 2. BAYESIAN DECISION THEORY

44

29. Figure 2.15 in the text shows a decision region that is a mere line in three dimensions. We can understand that unusual by analogy to a one-dimensional problem that has a decision region consisting of a single point, as follows. (a) Suppose we have two one-dime nsional Gaussians, of p ossibly different means and variances. The posteriors are also Gaussian, and the priors P (ωi ) can be adjusted so that the posteriors are equal at a single point, as shown in the figure. That is to say, the Bayes decision rule is to choose ω1 except at a single point , where we should choose ω 2 , as shown P(ωin|x)the figure. i

1

ω1 0.8 0.6 0.4

ω2

0.2

-3

-2

-1

0

R1

1

2

R2

3

x

R1

(b) Likewise, in the three-dimensional case of Fig. 2.15, the priors can be adjusted so that the decision region is a single line segment. 30. Recall that the decision boundary is given by Eq. 66 in the text, i.e., g2 (x).

g1 (x) =

(a) The most general case of a hyperquad ric can be written as xt W1 x + w1t x + w10 = x t W2 x + w2t x + w20 where Wi are positive definite d-by-d matrices, wi are arbitrary vectors in d dimensions, and wi0 scalar offsets. The above equation can be written in the form 1 t −1 x (Σ1 2

− Σ− )x + (Σ− µ − Σ− µ )x + (w − w ) = 0. If we have p(x|ω ) ∼ N (µ , Σ ) and P (ω ), then we can always choose 2

i

1

1

1

1

2

i

i

1

2

10

20

i

the distribution to be Σi for the classifier, and choose a mean Σi µi = w i .

Σi for

µi such that

(b) We can still assign these v alues if P (ω1 ) = P (ω2 ). Section 2.7 31. Recall that from the Bayes rule, to minimize the probability of error, the decision boundary is given according to:

|

|

Choose ω 1 if P (ω1 x) > P (ω2 x); otherwise choose ω 2 .

PROBLEM SOLUTIONS

45

| ∝ |

(a) In this case, since P (ω1 ) = P (ω2 ) = 1/2, we have P (ω1 x) p(x ω1 ); and since the prior distributions p(x ωi ) have the same σ, therefore the decision boundary is:

|

| − µ | < |x − µ |, otherwise choose ω .

Choose ω 1 if x

1

2

2

Without loss of generality, we assume µ 1 < µ2 , and thus: P (error) =

P( x

µ1 > x

µ2 ω1 )P (ω1 ) + P ( x

| −∞ | | − | | √12π e− du



=

µ2 > x

µ1 ω2 )P (ω2 )

| − | | − ||

u2 /2

a

| − µ |/(2σ).

where a = µ2

1

(b) We note that a Gaussian decay overcomes an inverse function for sufficiently large argument, that is, lim

a

→∞

1 √2πa e−

a2 /2

=0

and Pe

√12π

=

1 2π





e−u

2

/2

du

a



u

u2 /2

≤ √ a e− 1 = √ e− . 2πa



du

a

a2 /2

| − µ |/σ goes

With these results we can conclude that Pe goes to 0 as a = µ2 to infinity. 32. We note first that the categories have equal priors, spherical normal distributions, that is,

1

P (ω1 ) = P (ω2 ) = 1/2, and

2

| ∼ N (µ , σ I).

p(x ωi )

i

(a) The Bayes decision rule for this problem is given by:

|

|

Choose ω 1 if p(ω1 x) > p(ω2 x); otherwise choose ω 2 .

| ∝

|

Equal priors, P (ω1 ) = P (ω2 ) = 1/2, imply P (ωi x) p(x ωi ) for i = 1, 2. So the decision rule is given by choosing ω1 in any of the below functionally equivalent conditions: exp

−

|

p(x ω1 ) 1/2(x µ1 )t (σ 2 I)−1 (x µ1 ) (2π)d/2 σ2 I d/2 (x µ1 )t (x µ1 ) t x x 2µt1 x + µt1 µ1



| | − −



(µ2



|

> p(x ω2 ) exp 1/2(x µ2 )t (σ 2 I)−1 (x > (2π)d/2 σ 2 I d/2 < (x µ2 )t (x µ2 ) < xt x 2µt2 x + µt2 µ2 1 µ1 )t x < ( µt1 µ1 + µt2 µ2 ), 2





−

− − −





| |

−µ ) 2



CHAPTER 2. BAYESIAN DECISION THEORY

46

and otherwise choose ω 2 . The minimum probability of error P e can be computed directly from the Bayes decision rule, but an easier way of computing P e is as follows: Since the Bayes decision rule for the d-dimensional problem is determined by the one-dimensional random variable Y = (µ2 µ1 )t X, it is sufficient to consider the two-category one-dimensional problem:



˜ i, σ N (µ ˜ 2 ),

P (Y ωi )

| ∼

with P (ω1 ) = P (ω2 ) = 1/2. In this case, the mean and variance are (µ2 µ1 )t µi (µ2 µ1 )t σ 2 I(µ2 σ2 µ2 µ1 2 .

E | |

− −

µ ˜ i = (Y ωi ) = σ ˜ = Var(Y ωi ) = = 2

 − 

t

−µ ) 1

As shown in Problem 31, it follows that

√12π

Pe =





2

e−u

/2

du,

a

where a

t

t

|µ˜ − µ˜ | = |(µ − µ ) µ − (µ − µ ) µ | 2˜ σ 2σ µ µ (µ − µ ) (µ − µ ) µ −− µ = 2σ µ − µ  2σ µ − µ  µ − µ  . 2

=

1

2

1

2

2

2

=

t

2

1

2

1

2

2

=

1

1

1

2

1

1

1

2

2

1



(b) We work in a translated coordinate system in which µ1 = 0 and µ2 = (µ1 ,...,µ We use the result Pe

1 ≤ √2πa e−

a2 /2

t d) .

→ 0 if a → ∞,

where for this case we have a=

 µ − µ  =  µ − 0 = 2

1

2





We conclude, then, that P e

→ 0 if

d



i=1

d



  d

1/2

µ2i

.

i=1

→ ∞ as d → ∞. µ2i

→ ∞. This implies that as d → ∞, ∞ the two points µ , µ in d-dimensions become well separated (unless µ 0 (and A A









= (µ2

−µ ) 1

Σ1 + Σ2 2

t



| | | |

−1

    −

(µ2

− µ ). 1

Consequently, we have ˜2 (µ

− µ˜ ) 1

−1

˜1 + Σ ˜2 Σ 2

˜2 (µ

µ ˜ 1)

t

≥ (µ − µ ) 2

1



Σ1 + Σ2 2



−1

(µ2

− µ ), 1

where equality holds if an only if φ = θ t AA−1 u. Now we let ˜i = Σ



Σi uti

˜ i = αi Σi , where α i = c i Then Σ

| |

| |

|A˜ | =

 



ui ci

,

i = 1, 2.

1 uti Σ− i ui . We have

 −  

˜1 + Σ ˜2 Σ Σ1 + Σ2 = 2 2

 ·

| |

α= Aα

˜ uA−1 u. Now, we also expand our definition of A:

where α = c





˜ ˜ ˜ = Σ1 + Σ2 = Σ1 + Σ2 = A 2 2

Σ 1 +Σ 2

u1 +u2

2

2 c1 +c2 2

ut1 +ut2

2

  =

A ut

u c

.

Thus, u = (u1 + u2 )/2 and c = (c1 + c2 )/2. We substitute these values into the above equations and find

 | |  | || | ˜ ˜ Σ 1 +Σ 2

ln

2

˜1 Σ ˜2 Σ

=

ln

=

ln

 | |   | | |||  | √ Σ1 + Σ2 α 2

Σ 1 α1 Σ 2 α2

Σ 1 +Σ 2 2

1

˜ i = (Xi , Xi,d+1 If Xi N (µi , Σi ), and X conditional variance of X i,d+1 given X i is:



α

α

|Σ ||Σ | + ln α α ˜ ) for i = 1, 2, then the ˜ ,Σ ) ∼ N (µ 2

i

1 2

i

= Var( X X ), i = 1, 2. i,d+1 i X1 + X2 Y = 2 ˜ ˜ ˜ = (Y, Yd+1 ) = X1 + X2 , Y 2

|

i





|

and thus α = Var(Yd+1 Y). Assume now that X 1 and X 2 are independent. In that case we have

|



|

α = Var(Yd+1 Y) = Var(1 / 2 (X1,d+1 + X2,d+1 ) X1 , X2 )

PROBLEM SOLUTIONS

51 1 [Var(X1,d+1 X1 ) + Var(X2,d+1 X2 )] 2 1 (α1 + α2 ). 2

|

= =

We substitute these values and find

                   √                ≥       ˜ ˜ Σ 1 +Σ 2

ln

˜Σ1

and thus

˜ ˜ Σ 1 +Σ 2

2

=

Σ ˜2

2

ln

˜Σ1

d+1

˜ ˜ Σ 1 +Σ 2

˜Σ 1

Σ ˜2

α1 +α2 2

α1 α2

d



0 since α1 +α2 α1 α2 2

≥√

˜ ˜ Σ 1 +Σ 2

2

ln

+ ln

2

ln

˜2 Σ

˜Σ 1

d+1

˜2 Σ

,

d



where equality holds if and only if ( α1 + α 2 )/2 = α1 α2 , or equivalently if α1 = α 2 . We put these results together and find that ˜k(1/2) k(1/2). This implies Pd+1 = That is, P d+1



P (ω1 )P (ω2 )e−

˜ k(1/2)

P d , as desired.

≤



P (ω1 )P (ω2 )e−k(1/2) = P d .



(b) The above result was derived adding the ( d + 1)th dimension and does not depend on which dimension is added. This is becaus e, by a permutation of the coordinates, any dimension that is added can be designated as the ( d + 1)th ˜ i , µi , Σi , i = 1, 2, ˜ i, Σ dimension. This gives a permutation of the coordinates of µ but does not change the error bound. (c) The “pathological case” where equalit y holds (that is, P d+1 = Pd ) is when: φ = α1 =

θ t A −1 u

α2 ,

where θ A

−    

= µ2 µ1 Σ1 + Σ2 = 2

Σ ˜1 + Σ ˜2 2

˜ A

=

φ αi

= µ2,d+1 µ1,d+1 1 = ci uti Σ− i ui .



=

A ut

u c



For exmple, the condition of equality is satisfied if

˜1 = Σ ˜ 2. µ ˜1 = µ ˜ 2 and Σ

(d) No, the true error can never increas e as we go to a higher dimensio n, as illustrated in Fig. 3.3 in the text.

CHAPTER 2. BAYESIAN DECISION THEORY

52

∝ e−

(e) For non-pathological distributions, P d is because k (1/2) for d .

→∞

→∞

k(1/2)

→ ∞. This

goes to zero as d

(f) No. First no te that Pd (error) Pd+1 (error)

≤P

≤P

d

=

d+1

=



P (ω1 )P (ω2 )e−k(1/2) ˜

P (ω1 )P (ω2 )e−k(1/2) . k(1/2)

1 )(P (ω2 )e− But, there no clear relationitbetween Pd+1 (error) andPPd =(error) P (ω< So, even if is˜k(1/2) > k(1/2), is not guaranteed that Pd (error). d+1

36. First note the definition of k(β ) given by Eq. 75 in the text: k(β ) =

.



− β) (µ − µ ) (βΣ + (1 − β)Σ )− (µ − µ ) |β Σ + (1 − β )Σ | . 1 + ln |Σ | |Σ | − 2 β (1

2

2



1

t

2

2

1

2

β

1

1 β

1

1

2



1

(a) Recall from Eq. 74 we have e−k(β)

=

 |   − × − {

pβ (x ω1 )p1−β (x ω2 ) dx exp

=

1 2

exp

|

− − |Σ | |Σ |

(1 β) t 1 2 µ Σ2 µ2



β t 1 2 µ Σ1 µ1 1

β/2

2



β/2

1

t

1 + (1 x (β Σ−



1

1

t

− )x − 2x (β Σ− − β )Σ(2π) 2

1

d/2

1

+ (1

− β)Σ− )} 2

(b) Again from Eq. 74 we have e−k(β)

=

exp

× where

−

1 1 β/2 µ t1 Σ− (1 β )/2 µ t2 Σ− 1 µ1 2 µ2 Σ1 β/2 Σ2 (1−β)/2 exp 12 xt A−1 x 2xA−1 θ dx (2π)d/2

− − | | | | −{ −

  

1 A = β Σ− 1 + (1

and

− β)Σ− 2

1 A−1 θ = β Σ− 1 µ1 + (1

1



}





dx



−1

− β)Σ− µ . 2

1

2

Thus we conclude that the vector θ is 1 θ = A(β Σ− 1 µ1 + (1

− β)Σ− µ ). 2

1

2

(c) For the conditions given, we have

  exp

1 t −1 (x A x 2



− 2x A− θ) t

1

dx = e 2 θ 1

t

A−1 θ

 −

= (2π)d/2 e1/2θ

exp

t

A−1 θ

1 (x 2

|A|



− θ) A− (x − θ)

1/2

t

1

dx

PROBLEM SOLUTIONS

53

since e− 2 (x−θ ) A (x−θ ) , (2π)d/2 A 1/2 −1

t

1

g(x) =

| |

where g(x) has the form of a d-dimensional Gaussian density. So it follows that e−k(β)

= exp

− {− 1 2

1 θA−1 θ + β µt1 Σ− 1 µ + (1

|Σ | |A|Σ| | 1

1/2 (1 β)/2 2

β/2





− β )µ Σ− µ } × t 2

2

1

2

· []

Problem not yet solved

37. We are given that P (ω1 ) = P (ω2 ) = 0.5 and

| |

∼ ∼

p(x ω1 ) p(x ω2 )

N (0, I) N (1, I)

where 1 is a two-component vector of 1s. (a) The inverse matrices are simple in this case: 1 −1 Σ− 1 = Σ2 =

 

1 0 . 0 1

We substitute these into Eqs. 53–55 in the text and find g1 (x) = w1t x + w10 1 = 0t x x2 + 0 + ln(1/2) = ln(1/2)



and

w2t x + w20 x1 1 1 (1, 1) (1, 1) + ln(1/2) x2 2 1 x1 + x2 1 + ln(1/2).

g2 (x) =

 −  

=



=

We set g1 (x) = g2 (x) and find the decision boundardy is x1 + x 2 = 1, which passes through the midpoint of the two means, that is, at (µ1 + µ2 )/2 =



0.5 . 0.5

This result makes sense because these two categories have the same prior and conditional distributions except for their means. (b) We use Eqs. 76 in the text and substitu te the values given to find k(1/2) =

= =

1 8

  −           −       

1 1 8 1 1/4.

1 1

t

0 0

1 0 0 1

t

1 0 0 1

+ 2

1 1 1 + ln 1 2 1

1 0 0 1

−1

1 1

0 0

1 + ln 2

     (10 01)+(10

0 1

)

2

1 0 0 1

1 0 0 1

CHAPTER 2. BAYESIAN DECISION THEORY

54

Equation 77 in the text gives the Bhatacharyya bound as P (error)





√0.5 · 0.5e−

P (ω1 )P (ω2 )e−k(1/2) =

1/4

= 0.3894.

(c) Here we have P (ω1 ) = P (ω2 ) = 0.5 and µ1 = µ2 =

The inverse matrices are 1 Σ− 1 1 Σ− 2

We use Eqs. 66–69 and find g1 (x) =

  −    − − 12

t

1 0 2 0

=

t

x1 x2

− 154 x

8/5

+ 15 2 x1 x2

and g2 (x) =

  −    − − 12

t

1 1 2 1

=

t

x1 x2

− 185 x

5/9 4/9

5/9

−4/9

2 1

+

     −  − −  − −    −     −       − −    − − −    −     −       − −   0 0

, Σ1 =

2 0.5 0.5 2

1 1

, Σ2 =

5 4 . 4 5

8/5 2/15 5/9 4/9

= =

8/5 2/15

−2/15

2 1

 

8 x1 x2 18

2/15 8/15

x1 + x2

2/15 8/15

4 x22 15

0 0

1 ln 2

2 0.5 0.5 2

x1 + x2

4/9 5/9

2 2

8/5 2/15

2/15 8/15

+ ln

0 0

t

x1 x2

1 2

0.66 + ln 12 ,

4/9 5/9

− 185 x

2/15 8/15 4/9 . 5/9

1 1

5/9 4/9

1 ln 2

5 4 4 5

1 1 + x1 + x2 9 9

4/9 5/9

+ ln

1 1

t

x1 x2

1 2

− 19 − 1.1 + ln 12 .

The Bayes decision boundary is the solution to g 1 (x) = g 2 (x) or x21 + x22

− 28x x − 10x − 10x 1 2

1

2

+ 50 = 0 ,

which consists of two hyperbolas, as shown in the figure. We use Eqs. 76 and 77 in the text and find

k(1/2) =

1 8

  −           −        1 1

t

0 0

1 1 3.5 2.25 8 1 2.25 3.5 = 0.1499. =

t

2 0.5 0.5 2

−1

+ 2

5 4 4 5

−1

1 1 7.1875 + ln 1 2 5.8095

1 1

0 0

  

(20.5 0.52)+(54 45)

+ ln

2

2 0.5 0.5 2

 

5 4 4 5

PROBLEM SOLUTIONS

55 x2 4

R2

2

R1 -4

-2

2

x1

4

-2

R2

-4

Equation 77 in the text gives the Bhatacharyya bound as P (error)





P (ω1 )P (ω2 )e−k(1/2) =

√0.5 · 0.5e−

1.5439

= 0.4304.

38. We derive the Bhattacharyya error bound without first examining the Chernoff bound as follows.

≤√

(a) We wish to show that min[ a, b] ab. We suppose without loss of generalit y that a b, or equivalently b a + δ for δ > 0. Thus ab = a(a + δ ) a2 = a = min[ a, b].







≥√



(b) Using the above result and the formula for the probability of error given by Eq. 7 in the text, we have:

 |   ≤     |  

P (error) =

|

min[ P (ω1 )p(x ω1 ), P (ω2 )p(x ω2 )] dx

P (ω1 )P (ω2 )



|

p(x ω1 )p(x ω2 ) dx

≤1/2



ρ/2,



where for the last step we have used the fact that min[ P (ω1 ), P (ω2 )] which follows from the normalization condition P (ω1 ) + P (ω2 ) = 1.

≤ 1/2,

39. We assume the underlying distributions are Gaussian. (a) Based on the Gaussian assu mption, we can calculate ( x∗ µ2 )/σ2 from the hit rate P hit = P (x > x∗ x ω2 ). We can also calculate ( x∗ µ1 )/σ1 from the false alarm rate P false = P (x > x∗ x ω 1 ). Since σ 1 = σ 2 = σ, the discriminability

| ∈

is simply d =

− −

| ∈

 −   µ2

µ1

σ

=

x∗

− µ − x∗ − µ σ σ 1

2

1

2



.

(b) Recall the error function from Eq. 96 in the Appendix of the text: erf[u] =

√2π

u

 0

2

e−x dx.

CHAPTER 2. BAYESIAN DECISION THEORY

56

We can express the probability of a Gaussian distribution in terms of the error function as: P (x > x∗ ) = 1/2

− erf

 − x∗

µ

σ

and thus x∗

− µ= erf −

1 [1/2 P (x > x ∗ )] . σ We let Phit = P (x > x∗ x ω2 ) and Pfalse = P (x > x∗ x discriminability can be written as



| ∈

d =

µ2

−µ

1

σ

=

x∗

| ∈

− µ − x ∗ − µ = erf − [1/2 − P σ σ 1

2

1

false ]

ω1 ). The

− erf − [1/2 − P 1

hit ].

We substitute the values for this problem and find d1 = erf d2 = erf

−1 [0.2] − erf −1 [−0.3] = 0 .52 + 0.84 = 1 .36 −1 [0.1] − erf −1 [−0.2] = 0 .26 + 0.52 = 0 .78.

(c) According to Eq. 70 in the text, we hav e 1 [0.3 + (1 0.8)] = 0 .25 2 1 Case 2 : P (error) = [0.4 + (1 0.7)] = 0 .35. 2 (d) Because of the symmetry property of the ROC curve, the point (Phit , Pfalse ) and the point (1 Phit , 1 Pfalse ) will go through the same curve corresponding to some fixed d  . For case B, (0.1, 0.3) is also a point on ROC curv e that (0.9, 0.7) lies. We can compare this point with case A, going throu gh (0.8, 0.3) and the help of Fig. 2.20 in the text, we can see that case A has a higher discriminability d . Case 1 :



P (error) =







40. We are to assume that the two Gaussians underlying the ROC curve have different variances. (a) From the hit rate P hit = P (x > x∗ x ω2 ) we can calculate (x∗ µ2 )/σ2 . From the false alarm rate P false = P (x > x∗ x ω 1 ) we can calculate ( x∗ µ1 )/σ1 . Let us denote the ratio of the standard deviations as σ 1 /σ2 = K . Then we can write the discriminability in this case as

| ∈

µ2 da =

µ1

x∗

µ2

 √−   √− σ1 σ2

=

σ1 σ2



| ∈

x∗

x∗

µ1

− √−

σ1 σ2

=



µ2

x∗

µ1

− − − − | ∈

σ2 /K



Kσ 1



.

Because we cannot determine K from (µ2 x∗ )/σ2 and (x∗ µ1 )/σ1 , we cannot determine d  uniquely with only Phit = P (x > x∗ x ω 2 ) and Pfalse = P (x > x∗ x ω1 ).

| ∈



(b) Suppose we are given the following four experime ntal rates: Phit1 = P (x > x∗1 ω2 ) Phit2 = P (x > x∗2 ω2 )

| |

Pfalse 1 = P (x > x∗1 ω1 ) Pfalse 2 = P (x > x∗2 ω1 ).

| |

PROBLEM SOLUTIONS

57

Then we can calculate the four quantities

a2

x∗1

−µ x∗ − µ =

a1 =

2

σ2

2

2

σ2

= erf −1 [1/2

−P = erf − [1/2 − P 1

x∗1

−µ x∗ − µ =

hit1 ]

b1 =

hit2 ]

b2

1

σ1

2

1

σ1

= erf −1 [1/2

false 1 ]

for x ∗1

false 2 ]

for x ∗2 .

−P = erf − [1/2 − P 1

Then we have the following relations: a1 b1

−a −b

2

=

2

=

K

=

x∗1

− x∗ x∗ − x∗ σ a −a σ = b −b σ 2

σ2

1

2

1

1

2

1

1

2

2

.

Thus, with K we can calculate d a as da

= =

 √ −   −  − −− − µ2

 

µ1 µ2 x∗1 x ∗1 µ1 = σ1 σ2 σ2 /K Kσ 1 (a1 a2 )a1 (b1 b2 )b1 . b1 b2 a1 a2

− − − −

(c) For all those x ∗1 and x ∗2 that satisfy x∗1

−µ

=

−µ

=

2

σ2

− x∗ σ− µ 2

2

2

or x∗1

1

σ1

∗ − x σ− µ 2

1

.

1

That is, the two different thresh olds do not provide any additional informa tion and conveys the same information as only one obser vation. As explained in part (a), this kind of result would not allow us to determine d a . (d) See figure. 1

p(x|ωi) )1 0.8

0.4

ω ε

ω1

|x 0.6 *

0.3

0.2

< x x ( 0.4 P

ω2

0.2

0.1

-2

0

x*

24

41. We use the notation shown in the figure.

x

0.2

0.4

0.6

0.8

P(x x∗ ω2 )dx, false alarm rate F = x∗ ω1 )dx, and miss rate M = 1 F . The hit rate is

|

|

x∗

H

= 1

T (µ2 , δ )dx = 1



µ2 δ

µ1 +δ

F

=

T (µ1 , δ )dx =

x∗

∗ − (x −2δµ

2+ 2

δ )2

(µ1 + δ x∗ )2 2δ 2



− √2F − 2(1 √ − H) (2 − d − 2F ) H = 1− 2  which is valid for 0 ≤ F ≤ (2 − d ) /2. For the second case µ − δ ≤ x∗ < µ , that is, 0 ≤ d ∗ ≤ 1, we have (x∗ − µ + δ ) H = 1− 2δ (x∗ − µ + δ ) F = 1− 2δ 2(1 − F ) − 2(1 − H ) d = ( 2(1 − F ) − d ) H = 1− 2 dT

= 1

T

T

2



2

2

1

T

1 2 1 2

T

 

2

2

T

2



P (x >

PROBLEM SOLUTIONS

59

≤ F ≤ 1 − d /2. For the third case, µ ≤ x∗ ≤ µ + δ and 0 ≤ d  ≤ 1 we have (µ + δ − x∗ ) F = 2δ (µ + δ − x∗ ) H = √2H 2δ√2F , d =  − valid for 0 ≤ F ≤ (1 − d ) /2. 2 T

which is valid for 1 /2

2

1

T

2

1

2

2

2

2

T

T

2

(b) If dT = 2.0, then the two densiti es do not overlap. If x∗ is closer to the mean of ω1 than to the mean of ω2 , then the modifi ed ROC curve goes through P (x > x∗ x ω 1 ) = 1, P (x > x∗ x ω 2 ) = 0. Alternatively, if x∗ is closer to the mean of ω2 than to the mean of ω1 , then we get the con verse. In short, there is no meaningful value of d T greater than 2.0.

| ∈

| ∈

1

0.8 )1

ω0.6 ε

dT'=1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

|x * x 0. First, note that b L (p) is symmetry with respect to the interchange p (1 p), and thus it is symmetric with respect to the value p = 0.5. We need consider only the range 0 p 1/2 as the limit properties we prove there will also hold for 1 /2 p 1. In the range 0 p 1/2, we have min[ p, 1 p] = p.

↔ −

≤ ≤

≤ ≤

≤ ≤



CHAPTER 2. BAYESIAN DECISION THEORY

60

The simplest way to prove that bL (p) is a lower bound for p in the range 0 p 1/2 is to note that bL (0) = 0, and the derivative is always less than the derivative of min[p, 1 p]. Indeed, at p = 0 we have





bL (0) =

 

1 1 + e−β ln β 1 + e−β



1 ln[1] = 0. β

=

Moreover the derivative is



eβ e2βp eβ + e2βp ∂ < 1= min[p, 1 ∂p

∂ b (p) = L ∂p

in the range 0

− p]

≤ p ≤ 1/2 for β < ∞.

(b) To show that b L (p) is an arbitrarily tight bound, we need show only that in the limit β , the derivative, ∂ bL (p)/∂p approaches 1, the same as

→∞

∂ min[p, 1 ∂p

− p]

in this range. Using the results from part (a) we find lim

β

in the range 0



→∞ ∂p

bL (p) =

lim β



2βp

−e

=1

→∞ eβ + e2βp

≤ p < 1/2.

(c) Our candidate upper bound is specified by bU (p) = b L (p) + [1

− 2b

L (0.5)]bG (p),

where g U (p) obeys several simple conditions, restated in part (d) below. We let bL (p) = p θ(p), where from part (a) we know that θ(p) is non-negative and in fact is at least linear in p. By the conditions given, we can write bG (p) = p+φ(p), where φ(p) is non-negative and φ(0) = φ(1/2) = 0. Then our candidate upper limit obeys



bU (p) = p = p

− θ(p) + [1 − 2(1/2 − θ(1/2))](p + φ(p)) − θ(p) + θ(1/2)(p + φ(p)).

We show that this is an upper bound by calculating the difference between this bound and the Bayes limit (which is min[ p, 1 p] = p in the range 0 p 1/2).



≤ ≤

Thus we have bU (p)

−p



= θ(p) + pθ(1/2) + θ(1/2)φ(p) > 0.

(d) We seek to confirm that b G (p) = 1/2sin[πp] has the following four properties:

•b







≤ ≤

min[p, 1 p]: Indeed, 1 /2 sin[πp] p for 0 p 1/2, with equality G (p) holding at the extremes of the interval (that is, at p = 0 and p = 1/2). By symmetry (see below), the relation holds for the interval 1 /2 p 1.

≤ ≤

PROBLEM SOLUTIONS

•b

61



p): Indeed, the sine function is symmetric about the point G (p) = b G (1 π/2, that is, 1 /2 sin[ π/2 + θ] = 1/2 sin[ π/2 θ]. Hence by a simple substitution we see that 1 /2 sin[πp] = 1/2 sin[π(1 p)].



•b

·



·

G (0) = b G (1) = 0: Indeed, 1 /2 sin[π 0] = 1 /2 sin[ π 1] = 0 — a special case of the fact b G (p) = b G (1 p), as shown immediately above.



•b

G (0.5) = 0 .5:

·

Indeed, 1 /2 sin[π0.5] = 1 /2 1 = 0.5.

(e) See figure. min[p,1-p]

0.5

0.5 bL(p;β=50)

0.4

bU(p;β=1) bU(p;β=10)

0.4

min[p,1-p]

bL(p;β=10) 0.3

0.3

bU(p;β=50) 0.2

0.2 bL(p;β=1)

0.1

0.1

0.2

0.4

0.6

0.8

1

p

0.2

0.4

0.6

0.8

1

p

Section 2.9 43. Here the components of the vector x = (x1 ,...,x and

d)

t

are binary-valued (0 or 1),

i = 1,...,d j = 1,...,c.

|

pij = Pr[xi = 1 ωj ]

(a) Thus pij is simply the probability we get a 1 in feature xi given that the category is ω j . This is the kind of probability structure we find when each categ ory has a set of independent binary features (or even real-valued features, thresholded in the form “ yi > yi0 ?”). (b) The discriminant functions are then

|

gj (x) = ln p(x ωj ) + ln P (ωj ). The components of x are statistically independent for all x in ωj , then we can write the density as a product:

|

t d)

p(x ωj ) = p((x1 ,...,x d

=

|ω ) j

d

pxiji (1

p(xi ωj ) =

|



i=1



i=1

Thus, we have the discriminant function

pij )1−xi .



d

gj (x) =

 

[xi ln p ij + (1

i=1 d

=

i=1

− x ) ln (1 − p i

ij )] + ln

P (ωj )

d

xi ln



pij + ln (1 1 pij i=1



−p

ij ) + ln

P (ωj ).

CHAPTER 2. BAYESIAN DECISION THEORY

62

44. The minimum probability of error is achieved by the following decision rule: Choose ω k if g k (x)

≥ g (x) for all j = k, j

where here we will use the discriminant function

|

gj (x) = ln p(x ωj ) + ln P (ωj ). The components of x are statistically independent for all x in ω j , and therefore, d

|

p(x ωj ) = p((x1 ,...,x

t d)

|ω ) = j



|

p(xi ωj ),

i=1

where

| | − |

pij = Pr[ xi = 1 ωj ], qij = Pr[ xi = 0 ωj ], rij = Pr[ xi = 1 ωj ]. As in Sect. 2.9.1 in the text, we use exponents to “select” the proper probability, that is, exponents that have value 1.0 when xi has the value corresponding to the particular probability and value 0.0 for the other values of x i . For instance, for the pij term, we seek an exponent that has value 1.0 when x i = +1 but is 0.0 when xi = 0 and when xi = 1. The simplest such exponent is 12 xi + 12 x2i . For the q ij term, the simplest exponent is 1 x2i , and so on. Thus we write the class-conditional probability for a single component x i as:





1

|

p(xi ωj ) =

i = 1,...,d j = 1,...,c

1 xi + 12 x2i 1 x2i x + 1 x2 qij rij 2 i 2 i



pij2



and thus for the full vector x the conditional probability is d



|

p(x ωj ) =

1

1 xi + 12 x2i 1 x2i x + 1 x2 qij rij 2 i 2 i .





pij2

i=1

Thus the discriminant functions can be written as

 |

gj (x) = ln p(x ωj ) + ln P (ωj ) d

=

i=1 d

x2i ln

=



1 1 xi + x2i ln p ij + (1 2 2

√p

ij rij

+

i=1

d

1

xi ln

2 i

− x )ln q pij

ij

+

−

d

+

ln q ij + ln P (ωj ),

i+1 qij 2 i=1 rij which are quadratic functions of the components x i . 45. We are given that P (ω1 ) = P (ω2 ) = 1/2 and





pi1 pi2

= =

1 1 xi + x2i ln r ij 2 2



p > 1/2 1 p i = 1,...,d,



where d is the dimension, or number of features.



+ ln P (ωj )

PROBLEM SOLUTIONS

63

(a) The minimum error decision rule is d



Choose ω 1 if

d



xi ln

i=1

This rule can be expressed as

+



i=1

>

d

p 1 p

xi ln





1 p p

   − −   −  × − d

xi

p

ln

1

i=1

ln

p

d

xi

ln

i=1

1

p

2

p



i=1

ln p + ln 12 .

> d ln p

p

p

1

d

+

1 2

− p) + ln

ln (1

i=1

− d ln (1 − p) p

> d ln

1

−p

or simply d



xi > d/2.

i=1

(b) We denote the minimum probability of error as P e (d, p). Then we have: Pe (d, p) =

|

|

P (error ω1 )P (ω1 ) + P (error ω2 )P (ω2 ) d

=

 ≤   ×   ≤ −   ×     −

P

i=1

xi

d/2

ω1

d

1/2 + P

d

As d is odd, and

is an integer, we have

i=1

d

Pe (d, p) = P

d

xi

2

i=1



(d 1)/2

= 1/2 We substitute d d−k  , and find

 

= d



=

k =0



(d 1)/2

=

k=0

(c) We seek the limit for p

d k k p (1

d k p (1 k

k=(d+1)/2

d+1 ω2 2

d (1 k



(d 1)/2 d k

p)



+ 1/2  k =0

e

  −

k=0

d k

lim pk (1

→1/2

p

− p) −

d k

1/2

k k

− p) p .

d k k  p (1

 

→ 1/2 of P (d, p). Formally, we write

lim Pe (d, p) =

→1/2

d

p)d−k .

(d 1)/2 p

xi

i=1

p)d−k + 1/2

   −  −

1/2

d

1/2 + P

1/2.

− k in the second summation, use the fact that

(d 1)/2

Pe (d, p)

ω1

d k p (1 k

k=0

k

1

i=1

 ×   × 

xi > d/2 ω1

  ≥  

 d k

d k

− p) −

=

CHAPTER 2. BAYESIAN DECISION THEORY

64

                 × −

(d 1)/2

=

d k

k=0

d

1 2

=

d

1 2

d

1 2

d k

k=0



d (d 1)/2

1 2

=

k=0

d

1 2

=

d k

2d 1 = . 2 2

→ 1/2, the probability of error will be 1 /2. (d) Note that as d → ∞, the binomial probabilit y p (1 − p) − can be approximated by the normal density with mean dp and variance dp q . Thus we have Indeed, in the case p

d k



d

lim

→∞

→∞

d

=

lim P (0

→∞

=

lim P

→∞

d

=

lim P

→∞

d

As P > 1/2, lim

→∞

d

d k p (1 k

k=0

d

k

d k

  −  √≤− ≤≤ −≤ −√ − ∼ − ≤ ≤ √ √− − 

(d 1)/2

lim Pe (d, p) =



X

p)d−k

(d

dp dp q

1)/2) where X (d

Z

dp q

Z

−√dp q = −∞ and

limit of very large dimension, we have

lim

→∞

d

lim Pe (d, p) = Pr(

1)/2 dp dp q z(1/2 p) dp q

where Z

d/2

∼ N (0, 1)

.

√d (1/2 − p) = −∞.

Z

d

N (dp, dp q )

Thus in t he

) = 0.

−∞ ≤ ≤ −∞

→∞

46. The general minimum-risk discriminant rule is given by Eq. 16 in the text: Choose ω 1 if (λ11

−λ

21 )p(x

|ω )P (ω ) < (λ − λ 1

1

|ω )P (ω ); otherwise choose ω .

12 )p(x

22

2

2

Under the assumption λ 21 > λ11 and λ 12 > λ22 , we thus have Choose ω 1 if

(λ21 (λ12

or

−λ −λ

11 )p(x

| |

ω1 )P (ω1 ) > 1, 22 )p(x ω2 )P (ω2 )

| |

p(x ω1 ) P (ω1 ) λ21 + ln + ln p(x ω2 ) P (ω2 ) λ12

Choose ω 1 if ln

−λ −λ

11

> 0.

22

Thus the discriminant function for minimum risk, derived by taking logarithms in Eq. 17 in the text, is: g(x) = ln

| + ln P (ω ) + ln λ − λ p(x|ω ) P (ω ) λ −λ p(x ω1 ) 2

1

21

11

2

12

22

.

For the case of independent features given in this problem, we have ln

| |

p(x ω1 ) = p(x ω2 )

d

d

  w i xi +

i=1

i=1

ln

1 1

−p , −q

where wi = ln

pi (1 qi (1

−q ) − p ), i i

i = 1,...,d.

i

i

2

PROBLEM SOLUTIONS

65

Therefore, the discriminant function can be written as: d

g(x) =

d

  w i xi +

i=1 t

ln

i=1

1 1

−p −q

i

P (ω1 ) λ21 + ln P (ω2 ) λ12

+ ln

i

−λ −λ

11 22

= w x + w0 .

47. Recall the Poisson distribution for a discrete variable x = 0, 1, 2,... , is given by P (x λ) = e −λ λx . x!

|

(a) The mean, or expectatio n of this distribution is defined as

E [x] =





|

xP (x λ) =

x=0





xe−λ

x=0





λx λx−1 = λe−λ = λe −λ eλ = λ. x! (x 1)! x=1



(b) The variance of this distribut ion is 2

2

E − E [x]) ] = E [x ] − (E [x])

Var[x] = [(x

2

  

.

λ2

To evaluate this variance, we need [x2 ]:

E

2

E [x ]

=

E [x(x − 1) + x] = E[x(x − 1)] + E [x] ∞

=

λλ

x

− x! x=0(x(x − 1)e ∞ λx−2 2 −λ

 

= λ e

x=2

= λ2 e−λ

(x





− 2)! + λ 

λx +λ x ! x =0

= λ2 e−λ eλ + λ = λ2 + λ, where we made the substitution x above results together and find 2

← (x − 2) in the fourt h step. 2

2

2

We put the

2

E − E [x]) ] = E [x ] − (E [x]) = λ + λ − λ = λ. (c) First recall the notation λ, read “floor of λ,” indicates the greatest integer less than λ. For all x < λ ≤ λ, then, we have λ λ − λ λ − = > . x! (x − 1)! x (x − 1)! Var[x] = [(x

x

x 1

x 1

 >1

 ≤

That is, for x < λ λ, the probability increases as x increases. Conversely, for x > λ λ we have

≥ 

λx λx−1 λ λx−1 = > . x! (x 1)! x (x 1)!



 λ λ , the probability decreases as x increases. Thus, the probability is maximized for integers x [ λ , λ]. In short, if λ is not an integer, then x = λ is the mode; if λ is an integer, then both λ and λ are modes.

∈





(d) We denote the two distributions P (x λi ) = e −λi

|

λxi x!

for i = 1, 2, and by assumption λ 1 > λ2 . The likelihood ratio is x P (x λ1 ) e−λ1 λx λ1 = −λ2 1x = eλ2 −λ1 . P (x λ2 ) e λ2 λ2

 

| |

Thus the Bayes decision rule is Choose ω 2

x

λ1 > 1, λ2 (λ2 λ1 ) x< ; ln[λ1 ] ln[λ2 ] eλ2 −λ1

if

− −

or equivalently if otherwise choose ω 1 ,

as illustrated in the figure (where the actual x values are discrete). P(x|ωi)

ω2

0.2

0.15

ω1

0.1

0.05

2

4

6

8

λ2

x*

λ1

10

12

x

14

(e) The conditional Bayes error rate is



P (error x) = min e−λ1

|



λx1 −λ2 λx2 ,e . x! x!

The Bayes error, given the decision rule in part (d) is x∗

eλ2

PB (error) =

λx2

x=0

where x ∗ = (λ2

 − λ )/(ln[λ ] 1

1

x! ln[λ2 ]) .

 −





+

e−λ1

x=x∗

λx1

,

x!



Section 2.10 48. In two dimension s, the Gaussian distribution is

|

p(x ωi ) =

1 2π Σi

exp 1/2

| |

−

1 (x 2

− µ ) Σ− (x − µ ) i

t

i

1

i



.

PROBLEM SOLUTIONS

67

(a) By direct calcula tion using the densitie s stated in the problem, we find that for x = .3 .3 that p(x ω1 )P (ω1 ) = 0.04849, p(x ω2 )P (ω2 ) = 0.03250 and p(x ω3 )P (ω3 ) = 0.04437, and thus the pattern should be classified as category ω 1 .



|

|

|



∗ , that is, a vector whose first component is missing and its second (b) To classify .3 component is 0.3, we need to marginalize over the unknow n feature. Thus we compute numerically ∞ P (ωi )p

 ∗   ωi

.3

= P (ωi )

p

x .3

ωi

   

−∞

dx

and find that P (ω1 )p(( , .3)t ω1 ) = 0.12713, P (ω1 )p(( , .3)t ω2 ) = 0.10409, and P (ω1 )p(( , .3)t ω3 ) = 0.13035. Thus the pattern should be categorized as ω 3 .





|

|



|

(c) As in part (a), we calculat e numerically P (ωi )˜ p

   .3

ωi



    ∞

= P (ωi )

.3 y

p

−∞

ωi

dy

and find that P (ω1 )p((.3, )t ω1 ) = 0.12713, P (ω1 )p((.3, )t ω2 ) = 0.10409, and P (ω1 )p((.3, )t ω3 ) = 0.11346. Thus the pattern should be categorized as ω 1 .

∗ |

∗ |

∗ |

(d) We follow the procedur e in part (c) above: x = (.2, .6)t

• P (ω )p(x|ω ) = 0.04344. • P (ω )p(x|ω ) = 0.03556. • P (ω )p(x|ω ) = 0.04589. 1

1

2

2

3

3

Thus x = (.2, .6)t should be categorized as ω 3 . x = ( , .6)t



• P (ω )p(x|ω ) = 0.11108. • P (ω )p(x|ω ) = 0.12276. • P (ω )p(x|ω ) = 0.13232. Thus x = (∗, .6) should be categorized as ω . x = (.2, ∗) • P (ω )p(x|ω ) = 0.11108. • P (ω )p(x|ω ) = 0.12276. • P (ω )p(x|ω ) = 0.10247. Thus x = ( , .6) should be categorized as ω . ∗ 49. Equation 95 in the text states g (x)p(x)p(x |x )dx P (ω |x , x ) = . p(x)p(x |x )dx 1

1

2

2

3

3 t

3

t

1

1

2

2

3

3 t

i

2

g

b



i

b

b

t

t

t

t

We are given that the “true” features, x t , are corrupted by Gaussian noise to give us the measured “bad” data, that is,

|

p(xb xt ) =

1 (2π)d/2 Σ

| |

exp

−

1 (xt 2



− µ ) Σ− (x − µ ) i

t

1

t

i

.

CHAPTER 2. BAYESIAN DECISION THEORY

68

We drop the needless subscript i indicating category membership and substitute this Gaussian form into the above. The constant terms, (2 π)d/2 Σ , cancel, and we then find that the probability of the category given the good and the measured bad data is

| |

 

|

P (ω xg , xb ) =

g(x)p(x)exp[ 12 (xt µ)t Σ−1 (xt µ)]dxt . p(x)exp[ 12 (xt µ)t Σ−1 (xt µ)]dxt ]dxt













After integration, this gives us the final answer,

|

P (ω xg , xb ) =

|

P (ω)p(xg , xb ω) , p(xg , xb )

which is the result from standard Bayesian considerations. Section 2.11 50. We use the values from Example 4 in the text. (a) For this case, the probabilities are: P (a1 ) P (a2 ) P (b1 ) P (b ) 2 P (d1 ) P (d2 )

= = = = = =

P (a4 ) = 0.5 P (a3 ) = 0 1 0 0 1.

Then using Eq. 99 in the text we have PP (x1 )



|

|

P (x1 a1 , b1 )P (a1 )P (b1 ) + 0 + 0 + 0 + 0 + 0 + P (x1 a4 , b1 )P (a4 )P (b1 ) + 0 0.9 0.65 0.8 0.65 = 0.5 1 + 0.5 1 0.9 0.65 + 0.1 0.35 0.8 0.65 + 0.2 0.35 = 0.472 + 0.441 = 0.913.

·

·

· ·

·

·

·

·

·

A similar calculation gives PP (x2 )



|

|

P (x2 a1 , b1 )P (a1 )P (b1 ) + 0 + 0 + 0 + 0 + 0 + P (x2 a4 , b1 )P (a4 )P (b1 ) + 0 0.35 0.1 0.2 0.35 = 0.5 1 + 0.5 1 0.9 0.65 + 0.1 0.35 0.8 0.65 + 0.2 0.35 = 0.87.

·

·

·

·

·

·

·

·

Since the lightness is not measured, we can consider only thickness PC (x1 )



| |

P (eD x2 ) = P (eD d1 )P (d1 x1 ) + P (eD d2 )P (d2 x1 ) = 0 0.4 + 1 0.6 = 0.6.

·

·

|

|

|

PROBLEM SOLUTIONS

69

We normalize these and find

| P (x |e)

P (x1 e) = =

2

· ·

0.913 0.6 = 0.992 0.913 0.6 + 0.087 0.05 0.087 0.05 = 0.008. 0.913 0.6 + 0.087 0.05

· ·

· ·

Thus, given all the evidence e throughout the belief net, the most probable outcome is x , that is, salmon. The expected error is the probability of finding 1 a sea bass, that is, 0 .008. (b) Here we are given that the fish is thin and medium ligh tness, which implies

| |

|

P (eD d1 ) = 0, P (eD d2 ) = 1 P (eC c1 ) = 0, P (eC c2 ) = 1, P (eC c3 ) = 0.

|

|

and as in part (a) we have



PC (x1 )

|

| |

P (eC x1 )P (eD x1 ) = [P (eC c1 )P (c1 x1 ) + P (eC c2 )P (c2 x1 ) + P (eC c3 )P (c3 x1 )] [P (eD d1 )P (d1 x1 ) + P (eD d2 )P (d2 x1 )] = [0 + 1 0.33 + 0][0 + 1 0.6] = 0.198.

×

|

·

|

|

|

·

|

|

|

|

|

Likewise we have



PC (x2 )

|

| |

P (eC x2 )P (eD x2 ) = [P (eC c1 )P (c2 x2 ) + P (eC c2 )P (c2 x2 ) + P (eC c3 )P (c3 x2 )] [P (eD d1 )P (d1 x2 ) + P (eD d2 )P (d2 x2 )] = [0 + 1 0.1 + 0][0 + 1 0.05] = 0.005.

×

|

·

|

|

|

·

|

|

|

|

|

We normalize and find

| P (x |eC D )

0.198 = 0.975 0.198 + 0.005 0.005 = 0.025. 0.198 + 0.005

P (x1 eC ,D ) = 2

,

=

Thus, from the evidence of the children nodes, we classify the fish as

x1 , that

is, salmon. Now we infer the probability P (ai x1 ). We have P (a1 ) = P (a2 ) = P (a3 ) = P (a4 ) = 1/4. Then we normalize and find

|

|

P (a1 x1 ) = = =

|

P (x1 a1 )P (a1 ) P (x1 a1 ) P (a1 ) + P (x1 a2 ) P (a2 ) + P (x1 a3 )P (a3 ) + P (x1 a4 )P (a4 ) 0.9 0.25 0.9 0.25 + 0.3 0.25 + 0.4 0.25 + 0.8 0.25 0.375.

| ·

·

·

| · · ·

|

·

|

CHAPTER 2. BAYESIAN DECISION THEORY

70 We also have

| | |

P (a2 x1 ) = 0.125 P (a3 x1 ) = 0.167 P (a4 x1 ) = 0.333. Thus the most probable season is a1 , winter. The probability of being correct is

|

|

·

P (x1 eC,D )P (a1 x1 ) = 0.975 0.375 = 0.367. (c) Fish is thin and medium lightness, so from part (b) we have

|

|

P (x1 eC,D ) = 0.975, P (x2 eC ,D ) = 0.025, and we classify the fish as salmon.

|

|

For the fish caught in the north Atlantic, we have P (b1 eB ) = 1 and P (b2 eB ) = 0.

|



|



P (a3 x1 , b1 )

|



|



P (a1 x1 , b1 )

|

| · |

P (x1 a1 )P (x1 b1 )P (a1 )P (b1 ) = 0.9 0.65 0.25 = 0.146 P (a2 x1 , b1 ) P (x1 a2 )P (x1 b1 )P (a2 )P (b1 ) = 0.3 0.65 0.25 = 0.049

·

|

·

· | · |· · · · | | · · ·

P (x1 a3 )P (x1 b1 )P (a3 )P (b1 ) = 0.4 0.65 0.25 = 0.065 P (a4 x1 , b1 ) P (x1 a4 )P (x1 b1 )P (a4 )P (b1 ) = 0.8 0.65 0.25 = 0.13. So the most probable season is a 1 , winter. The probability of being correct is

|

|

·

P (x1 eC,D )P (a1 x1 ) = 0.975 0.375 = 0.367. 51. Problem not yet solved Section 2.12 52. We have the priors P (ω1 ) = 1/2 and P (ω2 ) = P (ω3 ) = 1/4, and the Gaussian densities p(x ω1 ) N (0, 1), p(x ω2 ) N (0.5, 1), and p(x ω3 ) N (1, 1). We use the general form of a Gaussian, and by straightforward calculation find:

| ∼

| ∼

x 0.6 0.1 0.9 1.1

|

p(x ω1 ) 0.333225 0.396953 0.266085 0.217852

|

|

p(x ω2 ) 0.396953 0.368270 0.368270 0.333225



|

p(x ω3 ) 0.368270 0.266085 0.396953 0.396953

We denote X = (x1 , x2 , x3 , x4 ) and ω = (ω(1), ω(2), ω(3), ω(4)). Using the notation in Section 2.12, we have n = 4 and c = 3, and thus there are cn = 34 = 81 possible

PROBLEM SOLUTIONS

71

values of ω , such as (ω1 , ω1 , ω1 , ω1 ), (ω1 , ω1 , ω1 , ω2 ), (ω1 , ω1 , ω1 , ω3 ), (ω1 , ω1 , ω2 , ω1 ), (ω1 , ω1 , ω2 , ω2 ), (ω1 , ω1 , ω2 , ω3 ), (ω1 , ω1 , ω3 , ω1 ), (ω1 , ω1 , ω3 , ω2 ), (ω1 , ω1 , ω3 , ω3 ), .. .. .. . . . (ω3 , ω3 , ω3 , ω1 ), (ω3 , ω3 , ω3 , ω2 ), (ω3 , ω3 , ω3 , ω3 )

|

For each possible value of ω , we calculate P (ω ) and p(X ω ) using the following, which assume the independences of x i and ω (i): 4

 

|

p(X ω ) =

|

p(xi ω(i))

i=1 4

P (ω ) =

P (ω(i)).

i=1

For example, if ω = (ω1 , ω3 , ω3 , ω2 ) and X = (0.6, 0.1, 0.9, 1.1), then we have

|

p(X ω ) = = = =

|

p((0.6, 0.1, 0.9, 1.1) (ω1 , ω3 , ω3 , ω2 )) p(0.6 ω1 )p(0.1 ω3 )p(0.9 ω3 )p(1.1 ω2 ) 0.333225 0.266085 0.396953 0.333225

|

|

×

×

|

| ×

0.01173

and P (ω ) = P (ω1 )P (ω3 )P (ω3 )P (ω2 ) 1 1 1 1 = 2 4 4 4 1 = = 0.0078125. 128

× × ×

(a) Here we have X = (0.6, 0.1, 0.9, 1.1) and ω = (ω1 , ω3 , ω3 , ω2 ). Thus, we have p(X) = p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1) =



|

p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1 ω )P (ω )

ω

|

= p((x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1) (ω1 , ω1 , ω1 , ω1 ))P (ω1 , ω1 , ω1 , ω1 ) +p((x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1) (ω1 , ω1 , ω1 , ω2 ))P (ω1 , ω1 , ω1 , ω2 )

|

. +p((x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1) (ω3 , ω3 , ω3 , ω3 ))P (ω3 , ω3 , ω3 , ω3 ) = p(0.6 ω1 )p(0.1 ω1 )p(0.9 ω1 )p(1.1 ω1 )P (ω1 )P (ω1 )P (ω1 )P (ω1 ) +p(0.6 ω1 )p(0.1 ω1 )p(0.9 ω1 )p(1.1 ω2 )P (ω1 )P (ω1 )P (ω1 )P (ω2 ) .. . +p(0.6 ω3 )p(0.1 ω3 )p(0.9 ω3 )p(1.1 ω3 )P (ω3 )P (ω3 )P (ω3 )P (ω3 ) = 0.012083,

|

| |

|

| |

|

| |

|

|

| |

CHAPTER 2. BAYESIAN DECISION THEORY

72

where the sum of the 81 terms can be computed by a simple program, or somewhat tediously by hand. We also have

|

|

|

|

|

|

p(X ω ) = p(0.6, 0.1, 0.9, 1.1 ω1 , ω3 , ω3 , ω2 ) = p(0.6 ω1 )p(0.1 ω3 )p(0.9 ω3 )p(1.1 ω2 ) = 0.33325 0.266085 0.396953 0.333225 = 0.01173.

×

×

×

Also we have P (ω ) = P (ω1 )P (ω3 )P (ω3 )P (ω2 ) 1 1 1 1 = 2 4 4 4 1 = = 0.0078125. 128

× × ×

According to Eq. 103 in the text, we have

|

|

P (ω X) = P (ω1 , ω3 , ω3 , ω2 0.6, 0.1, 0.9, 1.1) p(0.6, 0.1, 0.8, 1.1 ω1 , ω3 , ω3 , ω2 )P (ω1 , ω3 , ω3 , ω2 ) = p(X) 0.01173 0.0078125 = = 0.007584. 0.012083

|

·

(b) We follow the procedur e in part (a), with the value s X = (0.6, 0.1, 0.9, 1.1) and ω = (ω1 , ω2 , ω2 , ω3 ). We have

|

p(X ω ) = = =

|

p(0.6, 0.1, 0.9, 1.1 ω1 , ω2 , ω2 , ω3 ) p(0.6 ω1 )p(0.1 ω2 )p(0.9 ω2 )p(1.1 ω3 ) 0.33225 0.368270 0.368270 0.396953 = 0 .01794.

|

|

×

×

|

| ×

Likewise, we have P (ω ) = P (ω1 )P (ω2 )P (ω2 )P (ω3 ) 1 1 1 1 = 2 4 4 4 1 = = 0.0078125. 128

× × ×

Thus

|

|

P (ω X) = P (ω1 , ω2 , ω2 , ω3 0.6, 0.1, 0.9, 1.1) p(0.6, 0.1, 0.9, 1.1 ω1 , ω2 , ω2 , ω3 )P (ω1 , ω2 , ω2 , ω3 ) = P (X) 0.01794 0.0078125 = = 0.01160. 0.012083

|

·

(c) Here we have X = (0.6, 0.1, 0.9, 1.1) and ω = (ω(1), ω(2), ω(3), ω(4)). According to Eq. 103 in the text, the sequence ω that maximizes p(X ω )P (ω ) has the maximum probability, since p(X) is fixed for a given observed X. With the simplifications above, we have

|

|

|

|

|

|

p(X ω )P (ω ) = p(x1 ω(1))p(x2 ω(2))p(x3 ω(3))p(x4 ω(4))P (ω(1))P (ω(2))P (ω(3))P (ω(4)).

PROBLEM SOLUTIONS

73

For X = (0.6, 0.1, 0.9, 1.1), we have

|

|

|

max[ p(0.6 ω1 )P (ω1 ), p(0.6 ω2 )P (ω2 ), p(0.6 ω3 )P (ω3 )] = max[0 .333225 0.5, 0.396953 0.25, 0.368270 0.25] = 0.333225 0.5 = 0.1666125.

×

×

×

×

Likewise, for the second step we have

|

|

|

max[ p(0.1 ω1 )P (ω1 ), p(0.1 ω2 )P (ω2 ), p(0.1 ω3 )P (ω3 )] = max[0 .396953 0.5, 0.368270 0.25, 0.266085 0.25] = 0.396953 0.5 = 0.1984765.

×

×

×

×

For the third step we have

|

|

|

max[ p(0.9 ω1 )P (ω1 ), p(0.9 ω2 )P (ω2 ), p(0.9 ω3 )P (ω3 )] = max[0 .266085 0.5, 0.368270 0.25, 0.396953 0.25] = 0.266085 0.5 = 0.133042.

×

×

×

×

For the final step we have

| × ×

| ×

| ×

max[ p(1.1 ω1 )P (ω1 ), p(1.1 ω2 )P (ω2 ), p(1.1 ω3 )P (ω3 )] = max[0 .217852 0.5, 0.333225 0.25, 0.396953 0.25] = 0.217852 0.5 = 0.108926. Thus the sequence ω = (ω1 , ω1 , ω1 , ω1 ) has the maximum probability to observe X = (0.6, 0.1, 0.9, 1.1). This maximum probability is 0.166625

1 × 0.1984765 × 0.133042 × 0.108926 × 0.012083 = 0.03966.

CHAPTER 2. BAYESIAN DECISION THEORY

74

Computer Exercises Section 2.5 1. Computer exercise not yet solved 2. MATLAB program

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

load samples.mat; the data [n,m] = size(samples); for i=1:3 mu i = mean(samples(:, (i-1)*3+1;i*3))’; sigma i = zeros(3); for j=1:n sigma i = sigma i + ... The ... continues the line (samples(j,(i-1)*3+1:i*3)’ - mu i ) ... * (samples(j,(i-1)*3+1;i*3)’ - mu i )’; end sigma i = sigma i ./n; end s = [1 2 1; 5 3 2; 0 0 0; 1 0 0]’ for j=1:size(s,2) for i=1;3 d = sqrt((s:,j)-mu i )’*inv(sigma i )*(s(:,j)-mu i ));

{}

{}

{}

{}

{}

{} {}

{}

{}

18

{}

{}

endfprintf(’Mahal. dist. for class %d and point %d: %: %f end 21 pw(1:) =[1/3 0.8]; 22 pw(2:) =[1/3 0.1]; 23 pw(3:) =[1/3 0.1]; 24 for p=1:2 25 fprintf(’ n n n n’); 26 for j=1:size(s,2) 27 class = 0; max gi = -1000000; 19 20

\n’, i, j, d);

\\\\

for i=1:3

28

{}

30 31 32 33 34 35

{}

\

36 37 38

{}

d i = (s(:,j)-mu i )’*inv(sigma i )*(s(:,j)-mu i ); g i = -0.5*d i - 1.5*log(2*pi) - 0.5*log(det(sigma i )) + log(pw(i,p)); if gi > max gi, max gi = gi; class = i; end end fprintf(’Point %d classified in category %d n’, j, class); end

29

end

Output Mahal. dist . for clas s 1 and poin t 1: 1.069873 Mahal. dist . for clas s 2 and poin t 1: 0.904465 Mahal. dist . for clas s 3 and poin t 1: 2.819441

{}

COMPUTER EXERCISES Mahal. Mahal. Mahal. Mahal. Mahal. Mahal. Mahal. Mahal.

dist. dist. dist. dist. dist. dist. dist. dist.

for for for for for for for for

class class class class class class class class

1 2 3 1 2 3 1 2

and and and and and and and and

75 point point point point point point point point

2: 2: 2: 3: 3: 3: 4: 4:

1.641368 1.850650 0.682007 0.516465 0.282953 2.362750 0.513593 0.476275

Mahal. dist. for class 3 and point 4: 1.541438 Point Point Point Point

1 2 3 4

classified classified classified classified

in in in in

category category category category

2 3 1 1

Point Point Point Point

1 2 3 4

classified classified classified classified

in in in in

category category category category

1 1 1 1

3. Computer exercise not yet solved 4. Computer exercise not yet solved 5. Computer exercise not yet solved Section 2.8 6. Computer exercise not yet solved 7. Computer exercise not yet solved 8. Computer exercise not yet solved Section 2.11 9. Computer exercise not yet solved

76

CHAPTER 2. BAYESIAN DECISION THEORY

Chapter 3

Maximum likelihood and Bayesian parameter estimation

Problem Solutions Section 3.2 1. Our exponential function is:

|

p(x θ) =



θe−θx x 0 0 otherwise.



|

(a) See Figure. Note that p(x = 2 θ) is not maximized when θ = 2 but instead for a value less than 1.0. p(x|θ =1)

p(x=2|θ ) 0.2

1

0.1

0.5

x 0

1 θˆ =1

2

θ

3

0

12345

(part c)

(b) The log-likelihood function is n

l(θ) =



k=1

n

|

ln p(xk θ) =



n

[ln θ

k=1

77

− θx ] = nln θ − θ k



k=1

xk .

78CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION We solve

θˆ as

∇ l(θ) = 0 to find θ

∇ l(θ) θ

 −   −

∂ n ln θ ∂θ

=

xk

k=1

n

n θ

=

n

θ

xk = 0.

k=1

Thus the maximum-likelihood solution is 1 θˆ = . n 1 xk n

 

k=1

(c) Here we approximate the mean

n

1 xn n k=1 by the integral





xp(x)dx,

0

which is valid in the large n limit. Noting that

∞ −x xe dx = 1,

 0

we put these results together and see that part (a).

θˆ = 1, as shown on the figure in

2. Our (normalized) distribution function is

|

p(x θ) =



≤ ≤

1/θ 0 x θ 0 otherwise.

·

(a) We will use the notation of an indicator function I ( ), whose value is equal to 1.0 if the logical value of its argument is true, and 0.0 otherwise. We can write the likelihood function using I ( ) as n

p(

D|θ)

=

·

|

p(xk θ) k=1 n



=

1 I (0 θ k=1

=

1 I θ θn

 ≥≤ ≤   xk

max xk k

θ)

I min xk k

≥0



.

We note that 1 /θ n decreases monotonically as θ increases but also that I (θ max xk ) is 0.0 if θ is less than the maximum value of xk . Therefore, our likelihood



k

function is maximized at θˆ = max xk . k

PROBLEM SOLUTIONS

79

p(D|θ ) 12 10 8 6 4 2 θ

0.2

0.4

0.6

0.8

1

(b) See Figure . 3. We are given that zik =



1 if the state of nature for the k th sample is ω i 0 otherwise.

(a) The samples are drawn by successive independent selection of a state of nature ωi with probability P (ωi ). We have then

|

Pr[zik = 1 P (ωi )] = P (ωi ) and

|

Pr[zik = 0 P (ωi )] = 1

− P (ω ). i

These two equations can be unified as P (zik P (ωi )) = [ P (ωi )]zik [1

|

− P (ω )] − i

1 zik

.

By the independence of the successive selections, we have n

P (zi1 ,

· · · , z |P (ω )) in

=

i

 

|

P (zik P (ωi ))

k=1 n

=

[P (ωi )]zik [1

k=1

− P (ω )] − i

1 zik

.

(b) The log-likelihood as a function of P (ωi ) is l(P (ωi )) = ln P (zi1 , n

· · · , z |P (ω )) in

i

zik

= ln

 n

=

zik

k=1

1 zik

− P (ω )] − ln P (ω ) + (1 − z ) ln (1 − P (ω )) .

[P (ωi )]

k=1

i

[1

i



ik

i



Therefore, the maximum-likelihood values for the P (ωi ) must satisfy



P (ωi ) l(P (ωi )) =

1 P (ωi )

n



k=1

zik

− 1 − P1 (ω ) i

n



(1

k=1

−z

ik ) =

0.

80CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION We solve this equation and find n

− Pˆ (ω ))

(1

i



n

= Pˆ (ωi )

zik

k=1



(1

k=1

−z

ik ),

which can be rewritten as n

n

zik

n

= Pˆ (ωi )

zik + nPˆ (ωi )







k =1

Pˆ (ωi )

k =1

The final solution is then

1 n

Pˆ (ωi ) =

k =1

n



zik .



zik .

k=1

That is, the estimate of the probability of category ω i is merely the probability of obtaining its indicatory value in the training data, just as we would expect.

{

}

4. We have n samples x1 ,..., xn from the discrete distribution d

|

P (x θ ) =



θixi (1

i=1

−θ ) −

1 xi

i

.

The likelihood for a particular sequence of n samples is n

d



|

P (x1 ,..., xn θ) =

θixki (1

k=1 i=1

−θ ) −

1 xki

i

,

and the log-likelihood function is then n

l(θ ) =

d



xki ln θ i + (1

k=1 i=1

To find the maximum of l(θ ), we set nent (i = 1,...,d ) and get



[ θ l(θ)]i

ki ) ln (1

− θ ). i

∇θ l(θ) = 0 and evaluate component by compo-

=



=

1 θi

=

−x

0.

θi l(θ) n

n

− −  k=1

1

1

θi

(1

k=1

−x

ki )

This implies that for any i n



xki =

− θˆ )



1 θˆi

k=1

1

which can be rewritten as n

(1

i

k=1

n

1



θˆi

−  −  (1

xki ),

k=1

n

xki

= θˆi n

xki

k=1

.

PROBLEM SOLUTIONS

81

The final solution is then n



1 θˆi = n

xki .

k=1

Since this result is valid for all i = 1,...,d , we can write this last equation in vector form as n

1 θ ˆ= n



k=1

xk .

Thus the maximum-likelihood value of θ is merely the sample mean, just as we would expect. 5. The probability of finding feature x i to be 1.0 in category ω 1 is denoted p:

|

p(xi = 1 ω1 ) = 1

− p(x = 0|ω ) = p 1

i

i1

=p>

1 , 2

|

for i = 1,...,d . Moreover, the normalization condition gives p i2 = p(xi ω2 ) = 1 (a) A single observation x = (x1 ,...,x d

|

p(x ω1 ) =



−p

d ) is drawn from class ω 1 , and thus have d



|

p(xi ω1 ) =

i=1

pxi (1

i=1

− p) −

1 xi

,

and the log-likelihood function for p is d

|

l(p) = ln p(x ω1 ) =



[xi ln p + (1

i=1

− x ) ln (1 − p)]. i

Thus the derivative is

∇ l(p)

=

p

1 p

d



xi

i=1

− (1 −1 p)

d



(1

i=1

− x ). i

We set this derivative to zero, which gives d



d

 − −

1 1 xi = pˆ i=1 1 pˆ

(1

xi ),

i=1

which after simple rearrangement gives d

(1 Thus our final solution is

d

p) ˆ



xi



= pˆ d

i=1

pˆ =

xi

.

 −   i=1

1 d

d

xi .

i=1

That is, the maximum-likelihood estimate of the probability of obtaining a 1 in any position is simply the ratio of the number of 1’s in a single sample divided by the total number of features, given that the number of features is large.

i1 .

82CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION d

(b) We define T = (1/d)

 E

xi to be the proportion of 1’s in a single observation x.

i=1

As the number of dimensions d approaches infinity, we have 1 d

T =

d

|

× p + 0 × (1 − p)] = p.

(xi ω1 ) = [1

i=1

Likewise, the variance of T , given that we’re considering just one class, ω 1 , is 1 d

|

Var(T ω1 ) =

d

 

1 d2

=

d

[12

i=1

p(1

=

|

Var(xi ω1 )

i=1

2

× p + 0 × (1 − p) − p × p]

− p) ,

d

→∞

which vanishes as d . Clearly, for minimum error, we choos e ω1 if T > T ∗ = 1/2 for p > 1/2. Since the variance vanishes for large d, the probability of error is zero for a single sample having a sufficiently large d. (c) See figure . P(T|ωi)

d = 111

ω2

ω1

d = 11

0.2

T

T* 0.6

0.4

0.8

1

6. The d-dimensional multivariate normal density is given by

|

p(x µ, Σ) =

1

We choose x 1 , x2 ,..., xn is

|

−

p(x1 , x2 ,..., xn µ, Σ) =

1/2

exp

=



| |

− n2 ln (2π) − n2 ln |Σ| − 12 − n2 ln (2π) − n2 ln |Σ| − 12

1

t

1 n (x k 2 k=1

1 exp (2π)n/2 Σ n/2

The log-likelihood function of µ and Σ is l(µ, Σ) =

1 (x 2

− µ) Σ− (x − µ) . |Σ| independent observations from p(x|µ, Σ). The joint density

(2π)d/2

−  −  −  −

µ)t Σ−1 (xk

− µ)



.

n

(xk

µ)t Σ−1 (xk

k=1 n

xtk xk

k=1

− µ)



2µt Σ−1 xk + nµt Σ−1 µ .

PROBLEM SOLUTIONS

83

We set the derivative of the log-likelihood to zero, that is, ∂l(µ, Σ) ∂µ

=

and find that

  − − 2Σ−1

xk + n2Σ−1 µ

k=1

n

ˆ −1 Σ



n

1 2

ˆ xk = n Σ

−1

= 0,

ˆ. µ

k=1



This gives the maximum-likelihood solution, µ ˆ=

n



1 xk , n k=1

ˆ we temporarily substitute as expect ed. In order to simplify our calculation of Σ, A = Σ −1 , and thus have l(µ, Σ) =

n



(xk

(xk

− µ)(x − µ)

− n2 ln (2π) + n2 ln |A| − 12

k=1

t

− µ) A(x − µ). k

We use the above results and seek the solution to ∂l(µ, Σ) ∂A

=

n −1 A 2

− 12

We now replace A by Σ −1 and find that nΣ ˆ = 1 n (xk 2 2 k=1

 

n



k=1

k

t

= 0.

t

− µˆ )(x − µˆ ) , k

and then multiply both sides by 2 /n and find our solution: n

ˆ= 1 Σ (xk n k=1

t

− µˆ )(x − µˆ ) . k

As we would expect, the maximum-likelihood estimate of the covariance matrix is merely the covariance of the samples actually found. 7. The figure shows the model for ω 1 , the incorrect model for ω 2 , that is, p(x ω2 ) N (0, 1) and the true model for ω 2 (dashed, and scaled by 30000 for visibility).

|



(a) According to Eq. 18 in the text, the maxim um-likelihood estimat e of the mean (for the incorrect model) is merely the sample mean of the data from the true model. In this case, then, we have ˆµ = 1.0. (b) Given two equal-variance Gaussians and equal priors, the decision boundary point is midway between the two means, here, x ∗ = 0.5.

R

(c) For the incorrect case, according to part (b), we have 1 is the line segment x < 0.5 and for 2 the line segment x > 0.5. For the corre ct case, we must solve numerically

R

1 1 exp[ x2 /2] = exp[ 1(x 1)2 /(2 106 )], 2π 1 2π 106 which gives the values x = 3.717. The Bayesian decisi on boundaries and regions are shown along the bottom of the figure above.







±

− −

·

84CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION p(x|ωi) 0.4 p(x|ω1)

p(x|ω2) as incorrect model

0.3 0.2 true p(x|ω2) model (shown 30000x)

0.1

-6

-4

-3.717

-2

0 x*

2

4 +3.717

R1

R2

6 x R2

(d) For the poor model, which has equal varia nces for the two Gaussians , the only decision boundary possible is a single point, as was illustrated in part (a). The best we can do with the incorrect model is to adjust the mean µ so as to match the rightmost of the decision boundaries given in part (c), that is, x∗ = 3.717. This decision point is midway between the means of the two Gaussians — that is, at 0 and at µ. As such, we need to choo se µ such that (0 + µ)/2 = 3.717. This leads directly to the solution, µ = 7.43.

|



(e) The maximum-likelihood solution in the incorrect model — here p(x ω2 ) N (µ, 1) — does not yield the minimum class ification error. A different value of the parameter (here, µ = 7.43) approximates the Bayes decision boundary

better andmodel. gives aAslower than the maximum-likelihood solution in which the incorrect such,error the faulty prior information about the model, did not permit the true model to be expressed, could not be overcome with even an infinite amount of training data. 8. Consider a case in which the maximum-likelihood solution gives the worst possible classifier.

↔−

|



|

(a) In this case, the symmetry operation x x takes p(x ω1 ) p(x ω2 ), and thus we are assured that the estimated distributions have this same symmetry property. For that reason, we are guaranteed that these distribution have the same value at x = 0, and thus x∗ = 0 is a decision boundary. Since the Gaussian estimates must have the same variance, we are assured that there will be only a single intersection, and hence a single decision point, at x ∗ = 0. (b) See Figure . (c) We have the estima te of the mean as µ ˆ1 =



|

p(x ω1 )dx = (1

− k)1 + k(−X ) = 1 − k(1 + X ).

We are asked to “switch” the mean, that is, have ˆ µ1 < 0 and ˆµ2 > 0. This can be assured if X > (1 k)/k. (We get the symmetric answer for ω 2 .)



(d) Since the decision boundary is at x ∗ = 0, the error is 1 k, that is, the value of the distribution spikes on the “wrong” side of the decision boundary.



PROBLEM SOLUTIONS

85 p(x|ωi) 1-k=.8

1-k=.8

ω2

ω1

k=.2

k=.2

ω1

ω1

ω2

ω2

x -6

-4

-2

0

2

4

6

X

x*



(e) Thus for the error to approach 1, we must have k 0, and this, in turn, requires X to become arbitrarily large, that is, X (1 k)/k . In this way, a tiny “spike” in density sufficiently far away from the srcin is sufficient to “switch” the estimated mean and give error equal 100%.

→ −

→∞

(f) There is no substantive difference in the above discus sion if we constrain the variances to have a fixed value, since they will always be equal to each other, that is, ˆσ12 = σ ˆ22 . The equality of variances preserves the property that the decision point remains at x ∗ = 0; it consequently also ensures that for sufficiently small k and large X the error will be 100%. (g) All the maximum-likelihood methods demand that the optimal solution exists in the model space. If the optim al solut ion does not lie in the solution space, then the proo fs do not hold. This is actually a very strong restriction. Note that obtaining the error equal 100% solution is not dependent upon limited data, or getting caught in a local minimum — it arises because of the error in assuming that the optimal solution lies in the solution set. This is model error, as described on page 101 of the text. 9. The standard maximum-likelihood solution is θˆ = arg max p(x θ). θ

Now consider a mapping x p(τ θ)dτ = p(x θ)dx, and

|

|



τ (x) where

|

| τ (·) is continuous.

|

p(τ θ) = p(x θ) . dτ/dx

|

Then we find the value of θ maximizing p(τ (x) θ) as

|

θ

|

p(x θ) dτ/dx = arg max p(x θ)

arg max p(τ (x) θ) = argm ax θ θ

=

ˆ θ,

|

Then we can write

86CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION ˆ In short, then, the maximum-likelihood where we have assumed dτ/dx = 0 at θ = θ. ˆ In practice, however, we must check whether the value of value of τ (θ) is indeed θ. θˆ derived this way gives a maximum or a minimum (or possibly inflection point) for p(τ θ). 10. Consider the novel meth od of estimating the mean of a set of points as taking its first value, which we denote M = x 1 .



|

(a) Clearly, this unusual estimator of the mean is unbiased, that is, the expected value of this statistic is equal the set truewe value. selection of the first point of atodata haveIn other words, if we repeat the

E

bias = [M]

− µ=

1 K →∞ K lim

K



M(k)

k=1

− µ = 0,

where M(k) is the first point in data set k drawn from the given distribution. (b) While the unusua l method for estimating the mean may indeed be unbiased, it will generally have large variance, and this is an undesirable property. Note that [(xi µ)2 ] = σ 2 , and the RMS error, σ, is independent of n. This undesirable behavior is quite different from that of the measurement of

E



x ¯=

1 n

n



xi ,

i=1

where we see 2 2

E [(¯x − µ) ]

E

=

1 n2

=



1 n

n

  −    E − 

=

xi

µ

i=1

n

[(xi

µ)2 ]

i=1

σ2 . n



Thus the RMS error, σ/ n, approches 0 as 1 / n. Note that ther e are many superior methods for estimat ing the mean, for instance the sample mean . (In Chapter 9 we shall see other techniques — ones based on resampling — such as the so-called “bootstrap” and “jackknife” methods.)



|



11. We assume p2 (x) p(x ω2 ) N (µ, Σ) but that p1 (x) The Kullback-Leibler divergence from p 1 (x) to p 2 (x) is

≡ p(x|ω ) is arbitrary. 1

1

p1 (x) dln(2π) + ln Σ + (x 2 where we used the fact that p 2 is a Gaussian, that is, DKL (p1 , p2 ) =

p1 (x)lnp1 (x)dx +



p2 (x) =

1 (2π)d/2

|Σ|

1/2

  − −

exp

(x

| |

µ)t Σ−1 (x

2

− µ)



µ)t Σ−1 (x



µ) dx,



.

We now seek µ and Σ to minimize this “distanc e.” We set the derivative to zero and find ∂ DKL (p1 , p2 ) = ∂µ





Σ−1 (x

− µ)p (x)dx = 0, 1



PROBLEM SOLUTIONS

87

and this implies



Σ −1

p1 (x)(x

− µ)dx = 0.

We assume Σ is non-singular, and hence this equation implies p1 (x)(x

− µ)dx = E [x − µ] = 0, 1



E

or simply, 1 [x] = µ. In short, the mean of the second distribution should be the same as that of the Gaussian. Now we turn to the covariance of the secon d distribution. Here for notat ional convenience we denote A = Σ. Again, we take a derivative of the Kullback-Leibler divergence and find: ∂ DKL (p1 , p2 ) = 0 = ∂A and thus

E

1

or

E

 − p1 (x)

− Σ

(x

1

(x

A−1 + (x

t

− µ)(x − µ)

− µ)(x − µ)

t



− µ)(x − µ)

t



dx,

,

= Σ.

In short, the covariance of the second distribution should indeed match that of the Gaussian. Note that above, in taking the derivative above,





| | | |

∂A = A A−1 ∂A we relied on the fact that A = Σ −1 is symmetric since Σ is a covariance matrix. More generally, for an arbitrary non-singular matrix we would use

| | | |

∂M = M (M−1 )t . ∂M Section 3.3 12. In the text we saw the following results: 1. The posterior density can be computed as p(x) =



p(x, θ

|D) d θ.

|D) = p(x|θ, D)p(θ|D)). 3. p(x|θ, D ) = p(x|θ ), that is, the distribution of 2. p(x, θ

x is known completely once we know the value of the parameter vector, regardless of the data .

|D) =

4. p(x



|

p(x θ )p(θ

|D) d θ.

D

88CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION These are justified as follows: 1. This statement reflects the conceptual difference between the maximum-likelihood estimator and Bayesian estimator. The Bayesian learning method considers the parameter vector θ to be a random variable rather than a fixed value, as in maximum-likelihood estimation. The posterior densit y p(x ) also depends upon the probability density p(θ) distributed over the entire θ space instead of a single value. Therefore, the p(x ) is the integration of p(x, θ ) over the

|D

|D

|D

entire parameter space. The maximum-likelihood be regarded as a special case of Bayesian estimator, where p(θ ) is estimator uniformly can distributed so that its effect disappears after the integration.

|D

2. The p(x, θ ) implies two steps in computation. One is the computation of the probability density θ given the data set , that is, p(θ ). The other is the computation of the probability density of x given θ , that is, p(x θ, ). The above two steps are independent of each other, and thus p(x, θ ) is the product of the results of the two steps.

D

|D | D |D

3. As mentioned in the text, the selection of x and that of the training samples is done independently, that is, the selection of x does not depend upont . Therefore we have p(x θ , ) = p(x θ ).

D

| D

D

|

4. We substitute the above relations int o Eq. 24 in the text and get Eq. 25. Section 3.4 13. We seek a novel approach for finding the maximum-likelihood estimate for Σ. (a) We first inspect the forms of a general vector a and matrix A:

a=

 

Consider the scalar

a1 a2 .. . an

 

 

and A =

n

at Aa =

A11 .. . A 1n .. .. .. . . . An1 .. . A nn

 

.

n



aj Aij aj .

i=1 j=1 n

The ( i, i)th element of this scalar is

Aij aj ai , and the trace of Aaat is the

j=1

sum of these diagonal elements, that is,

  n

tr (Aaat ) =

n

Aij aj ai = a t Aa.

i=1 j=1

(b) We seek to show that the likel ihood function can be written as

|

p(x1 ,..., xn Σ) =

1 (2π)nd/2 Σ−1

|

−   n

|

exp n/2

1 tr Σ−1 (xk 2 k=1

t

− µ)(x − µ) k



.

PROBLEM SOLUTIONS

89

| ∼

We note that p(x Σ) N (µ, Σ) where µ is known and x 1 ,..., xn are independent observations from p(x Σ). Therefore the likelihood is

|

−

n

|

p(x1 ,..., xn Σ) = =



1 (2π)d/2 Σ−1 k=1

exp 1/2

|Σ|−n/2

1 2

|

exp (2π)nd/2

−|

|Σ|−

|

(2π)

n/2

|Σ−1|−n/2

=

k=1

− µ) Σ− (x − µ) 1

t

k



.

1

n

tr Σ−1 (xk n

Ak

Σ −1

(xk

  − t

− µ)

µ)(xk

k=1

n

tr (Ak ) = tr

k=1

k

− µ and |A| = |Σ− |, we have

1 tr 2

exp

(2π)nd/2

n



where we used

1 2

exp nd/2

1

t

n

(xk



− µ) Σ− (x − µ)

 −   − −   −   | | | |

From the results in part (a), with a = x k p(x1 ,..., xn Σ) =

1 (xk 2

µ)(xk

k=1

µ)t

,

and Σ−1 = Σ −1 .

k=1

(c) Recall our definition of the sample covariance matrix: n

ˆ = 1 Σ n

(xk

t

− µ)(x − µ) . k

k =1



ˆ which easily leads to the following equalities Here we let A = Σ −1 Σ, Σ −1

|Σ− | 1

= = = =

−1

ˆ , AΣ ˆ −1 = A Σˆ−1 AΣ ˆ −1 = A A Σ ˆ Σ

| | | || | | || | || || λ λ ···λ |Σˆ | , 1 2

d

where λ1 ,...,λ n are the eigenvalues of A. We substitute these into our result in part (b) to get

|Σ− |

1 n/2

|

p(x1 ,..., xn Σ) = =

(2π)nd/2

− 12 tr

exp

(xk

|

(λ1 ...λ n )n/2 exp ˆ n/2 (2π)nd/2 Σ

| |

k

k=1

1 ˆ tr nΣ−1 Σ 2

ˆ = n[tr ( A)] = n(λ1 + Note, however, that tr[ nΣ−1 Σ] have p(x1 ,..., xn Σ) =

− µ)(x − µ)

 −  

(λ1 ...λ n )n/2 exp ˆ n/2 (2π)nd/2 Σ

| |

n

Σ −1

−

n (λ1 + 2

t



.

· · · + λ ), and thus we d



··· + λ ) d

.

90CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

··· |

(d) The expression for p(x1 , , xn Σ) in part (c) depends on Σ only through ˆ We can write our likelihood, then, as λ1 , , λd , the eigenvalues of A = Σ −1 Σ.

···

|

p(x1 ,..., xn Σ) =

  d

1

n/2

λi e−λi

ˆ n/2 (2π)nd/2 Σ

| |

.

i=1

|

Maximizing p(x1 ,..., xn Σ) with respect to Σ is equivalent to maximizing λi e−

λ

i

with respect to λ i . We do this by setting the derivative to zero, that is, ∂ [λi e−λi ] = e−λi + λi ( e−λi ) = 0, ∂λ i



|

which has solution λ i = 1. In short, p(x1 ,..., xn Σ) is maximized by choosing ˆ or Σ ˆ = Σ, as expected. λ1 = λ 2 = = λ n = 1. This means that A = Σ −1 Σ,

·· ·

|

∼ N (µ , Σ). We have also l

14. First we note that p(x µi , Σ, ωi ) nature for x k was ω i .

i

k

= i if the state of

(a) From Bayes’ Rule we can write p(x1 ,..., xn , l1 ,...,l

n

|µ ,...,

|

µc , Σ) = p(x1 ,..., xn µ1 ,..., µc , l1 ,...,l

1

Because the distribution of l1 ,...,l can write

n

n , Σ)p(l1 ,...,l

does not depend on µ1 ,..., µc or Σ, we

||  | | −

p(x1 ,..., xn µ1 ,..., µc , Σ, l1 ,...,l

n)

n

=

p(xk µ1 ,..., µc , Σ, lk )

k=1 n

=

k=1

1

(2π)d/2 Σ

exp 1/2

1 (xk 2

−µ

lk )

t

Σ−1 (xk

The l i are independent, and thus the probability density of the n

p(l1 ,...,l

n)

=



−µ

lk )



ls is a product,

n

p(lk ) =

k=1



p(ωlk ).

k=1

We combine the above equations and get p(x1n,..., xn , l1 ,...,l P (ωlk ) k=1 = exp nd/2 (2π) Σ n/2



| |

−|  n

µ1 ,..., µc , Σ) n

1 (xk 2 k=1

−µ

lk

) Σ−1 (xk − µ t

lk )



(b) We sum the result of part (a) over n samples to find n



l

(xk

k=1

−µ

lk )

t

Σ−1 (xk

−µ

lk ) =



(xk

i=1 lk =1

− µ ) Σ− (x − µ ). i

t

1

.

k

i

.

n ).

PROBLEM SOLUTIONS

91

The likelihood function for µ i is n

    − − | | − −  −  P (ωi )

l(µi ) =



k=1 (2π)nd/2

exp

Σ n/2

1 2l

c

1 2

exp

(xk

j=1 k:lk =j

µi )t Σ−1 (xk

(xk

µ ) Σ−1 (xk − µ ) i

t

i



µi ) .

=i

k

Thus, the maximum-likelihood estimate of µi is a function of xk restricted to lk = i. But these xk ’s are from an independent identically distributed (i.i.d.) sample from N (µi , Σ). Thus, by the result for samp les drawn from a single normal population, it follows that 1 ni

µ ˆi =

{

}

where n i = xk : lk = i =





xk ,

lk =i

1. Therefore, our solution is

lk =i

 

xk

lk =i

ˆi = µ

1

.

lk =i

We also know that if we have a maximum-likelihood estimate for maximum-likelihood estimate for Σ is

µ, then the

n



ˆ = 1 Σ (xk n k=1

t

− µˆ )(x − µˆ ) . k

ˆ corresponds to the distribution of xk But the maximum-likelihood estimate µ ˆ lk . Thus, we have the estimate is µ n



ˆ = 1 Σ (xk n k=1

− µˆ

lk )(xk

t lk ) .

− µˆ

15. Consider the problem of learning the mean of a univariate normal distribution. (a) From Eqs. 34 and 35 in the text, we have µn =

nσo2

σ2

mn +

nσo2 + σ2

nσo2 + σ 2

and σn2

=

σo2 σ2 , nσo2 + σ2

where the sample mean is n

mn =



1 xk . n k=1

µo ,

92CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Here µ o is formed by averaging n o fictitious samples x k for k = 2,..., 0. Thus we can write 1 nok=

µo =

−n

o



+ 1, no +

0



xk ,

−n

o +1

and n

µn

 

k=1

=

n+

xk

σ 2 /σo2

+

σ 2 /σo2 1 + n no

σ 2 /σo2

 0

n

xk

k=1

=

n + no

+

no 1 n + no no

xk



k= no +1

0



xk .



k=1 no

We can use the fact that n o = σ 2 /σo2 to write µn =

1 n + nok=

n



xk .

−n

o +1

Likewise, we have σn2

=

σ 2 σo2 nσo2 + σ2

=

σ2 = σ2 . n + σ 2 /σo2 n + no

(b) The result of part (a) can be interpreted as follows: For a suitable choice of the prior density p(µ) N (µo , σo2 ), maximum-likelihood inference on the “full” sample on n + no observations coincides with Bayesian inference on the “second sample” of n observations. Thus, by suitable choice of prior, Bayesian learning can be interpreted as maximum-likelihood learning and here the suitable choice of prior in Bayesian learning is



µo σo2

= =

1 nok= σ2 . no

0



xk ,

−n

o +1

Here µo is the sample mean of the first no observations and σo2 is the variance based on those n o observations. 16. We assume that A and B are non-singular matrices of the same order. (a) Consider Eq. 44 in the text. We write A(A + B)−1 B

= A[(A + B)−1 (B−1 )−1 ] = A[B−1 (A + B)]−1 = A[B−1 A + I]−1 = (A−1 )−1 (B−1 A + I)−1 = [(B−1 A + I)A−1 ]−1 = (B−1 + A−1 )−1 .

PROBLEM SOLUTIONS

93

We interchange the roles of A and B in this equation to get our desired answer: B(A + B)−1 A = (A−1 + B−1 )−1 . (b) Recall Eqs. 41 and 42 in the text: 1 Σ− n

=

1 nΣ−1 + Σ− o

=

nΣ− µn + Σo− µo .

1

1

Σn− µn

1

We have solutions µn = Σ o



Σo +





1 1 1 Σ µn + Σ Σo + Σ n n n



−1

µo ,

and



Σn = Σ o Σo +

1 Σ n



−1

1 Σ. n

Taking the inverse on both sides of Eq. 41 in the text gives



1 Σn = nΣ−1 + Σ− o

We use the result from part (a), letting A =

 

1 Σ n



−1

1 nΣ

=

Σo

= Σo Σo +

1 Σ n

and B = Σ o to get

 

1 Σ + Σo n

Σn

.

−1

−1

Σ,

which proves Eqs. 41 and 42 in the text. We also compute the mean as µn

1 = Σn (nΣ−1 mn + Σ− o µo ) − 1 1 = Σn nΣ mn + Σn Σ− o µo

 

= Σo Σo +

1 Σ n

= Σo Σo +

1 Σ n

 

−1 −1

   

1 1 1 ΣnΣ−1 mn + Σ Σo + Σ n n n mn +

1 1 Σ Σo + Σ n n

−1

−1

1 Σo Σ− o µo

µo .

Section 3.5 17. The Bernoulli distribution is written d

|

p(x θ ) = Let

D be a set of



i=1

θixi (1

−θ ) − i

1 xi

.

|

n samples x 1 ,..., xn independently drawn according to p(x θ ).

94CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION (a) We denote s = (s1 , (xk1 ,

···,x

kd )

t

t

···,s ) d

as the sum of the n samples. If we denot e xk = n



for k = 1,...,n , then si =

xki , i = 1,...,d , and the likelihood

k=1

is

n

D|θ)

 |   −     −   − |

P(

= P (x1 ,..., xn θ ) =

P (xk θ )

k=1

are indep.

x

n

k

d

θixki (1

=

θi )1−xki

k=1 i=1 d

=

n

θi

k=1

x ki

n

(1

θi )

k=1



(1 xki )

i=1 d

θisi (1

=

θi )n−si .

i=1

(b) We assume an (unnormalized) uniform prior for θ , that is, p(θ) = 1 for 0 θi 1 for i = 1, , d, and have by Bayes’ Theorem



···

p(θ

|D) = p(D|p(θD)p() θ) .

From part (a), we know that p(

D|θ) =

probability density of obtaining data set

 D|  ···   −

D

p( ) =

p(

θisi (1

θ)n−si , and therefore the

is

d

θisi (1

θi )n−si dθ

i=1

1 d

θisi (1

0 i=1

0

−θ ) − i

n si

dθ1 dθ2

· · · dθ

d

1

d

θisi (1

=

d

 − D  − i=1

θ )p(θ )dθ =

1

=



θi )n−si dθi .

i=1 0 n

Now s i =

{0, 1,...,n } for i = 1,...,d

xki takes values in the set k=1

use the identity

 

1

θ m (1

0

− θ)

n

dθ =

m!n! , (m + n + 1)!

and substitute into the above equation, we get

 d

D

p( ) =

i=1 0

1

d

θisi (1

−θ ) − i

n si

dθi =



i=1



si !(n si )! . (n + 1)!

, and if we

PROBLEM SOLUTIONS

95

We consolidate these partial results and find p(θ

|D)

=

D|θ)p(θ) p(D )

p(

d



i=1

=

d

θisi (1

si !(n

i=1 d

=



i=1

−θ ) − i

n si

si )!/(n + 1)!

− (n + 1)! θ s !(n − s )! i

i

si i (1

−θ ) − i

n si

.

(c) We have d = 1, n = 1, and thus p(θ1

|D) = s !(n2!− s )! θ 1

1

s1 1 (1

−θ ) − 1

n s1

=

2 s1 !(1

s1 1 (1

− s )! θ 1

−θ ) − 1

1 s1

.

Note that s1 takes the discrete values 0 and 1. Thus the densities are of the form s1 = 0 : p(θ1 s1 = 1 : p(θ1 for 0

|D) = 2(1 − θ ) |D) = 2θ , 1

1

≤ θ ≤ 1, as shown in the figure. 1

p(θ1|D) 2

s 1=

1

1.5

1

s

0.5

1

= 0 θ1

0.2

0.4

0.6

0.8

1

18. Consider how knowledge of an invariance can guide our choice of priors. (a) We are given that s is actually the number of times that x = 1 in the first n tests. Consider the ( n + 1)st test. If again x = 1, then there are n+1 s+1 permutations of 0s and 1s in the ( n + 1) tests, in which the number of 1s is ( s + 1). Given the assumption of invariance of exchangeablility (that is, all permutations have the same chance to appear), the probability of each permutation is

 

Pinstance =

1

  n+1 s+1

.

Therefore, the probability of x = 1 after n tests is the product of two probabilities: one is the probability of having ( s + 1) number of 1s, and the other is the probability for a particular instance with ( s + 1) number of 1s, that is, Pr[xn+1 = 1

|D

n

] = Pr[ x1 +

··· + x

n

·

= s + 1] Pinstance =

p(s + 1)

  n+1 s+1

.

96CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION The same analysis yields P (xn+1 = 0

|D

n

)=

Therefore, the ratio can be written as

|D = 0|D

Pr[xn+1 = 1 Pr[xn+1

     

(b) Given p(s)

]

n

=

n+1 s+1 n+1 s

n+1 s n+1 s+1

= =

n+1 s

.

n+1 s+1 n+1 s

p(s + 1)/

]

p(s)/

p(s + 1) for large s, we have

p(s + 1)/ p(s)/

n

p(s)

 

.

 

− − 1)! − (s + 1)!(n + 1 − s − 1)! s!(s + 1)(n + 1 − s − 1)! = s!(n + 1 − s)! s!(n + 1 − s − 1)!(n + 1 − s) s+1 . n+1−s (n + 1)! (s + 1)!(n + 1 s s!(n + 1 s)! (n + 1)!

=

We can see that for some n, the ratio will be small if s is small, and that the ratio will be large if s is large. This implies that with n increasing, if x is unlikely to be 1 in the first n tests (i.e., small s), it would be unlikely to be 1 in the (n + 1)st test, and thus s remains small. On the contrary, if there are more 1s than 0s in the first n tests (i.e., large s), it is very likely that in the ( n + 1)st test the x remains 1, and thus retains s large. (c) If p(θ) U (0, 1) for some n, then p(θ) is a constant, which we call c. Thus



1

p(s) =

 −    − θs (1

θ)n−s p(θ)dθ

0

= c

1

n s

θs (1

0

n = c s

− θ) − dθ n s

θs+1 (1 θ)n−s+1 s+1 n s+1

− −

= 0,

  

1

0

which of course does not depend upon s. 19. Consider MAP estimators, that is, ones that maximize l(θ )p(θ). (a) In this problem, the parameter needed to be estimated is µ . Given the training data, we have l(µ)p(µ) = ln[ p(

D|µ)p(µ)]

where for the Gaussian

 |  n

D|µ)]

ln[p(

= ln

p(xk µ)

k=1

PROBLEM SOLUTIONS

97 n



=

|

ln[p(xk µ)]

k=1

− n2 ln

=



n

(2π)d Σ



| |−

1 (xk 2 k=1

− µ) Σ− (x − µ) 1

t

k

and 1 p(µ) = (2π)d/2 Σ0

1/2 exp

| |

−

1 2 (µ

1

t

− m ) Σ− (µ − m ) o

0

0

The MAP estimator for the mean is then

− 

n ln (2π)d Σ 2

ˆ = arg max µ µ

 ×

1

(2π)d/2 Σ0

| |

1/2

exp

.

n



| |−

−



1 (µ 2

1 (xk 2 k=1

− µ) Σ− (x − µ) 1

t

k

− m0) Σ−1(µ − m0) t

o

 



.

(b) After the linear transfo rm governed by the matrix A, we have µ =

E[x ] = E [Ax] = AE [x] = Aµ,

and Σ

= = = = =

E [(x − µ)(x − µ ) ]  − Aµ ) ] E [(Ax − Aµ )(Ax E [A(x − µ )(x − µ ) A ] AE [(x − µ )(x − µ ) ]A t

t

t t

t

t

t

AΣA .

Thus we have the log-likelihood

   −  −  −  −  n

D |µ )]

ln[p(

= ln

k=1 n

= ln

p(xk µ )

|



|

p(Axk Aµ)

k=1



n

=

= = = =

|

ln[p(Axk Aµ)]

k=1

n d t 2 ln (2π) AΣA

|

n ln (2π)d AΣAt 2

|

n ln (2π)d AΣAt 2

|

n ln (2π)d AΣAt 2

|

n

|−

  | −  | −  | −

k=1 n k=1 n k=1 n

1 k 2 (Ax

− Aµ) (AΣA )− (Ax − Aµ) t

t

1

k

1 ((x 2

− µ) A )((A− ) Σ− A− )(A(x − µ))

1 (xk 2

− µ) (A (A− ) )Σ− (A− A)(x − µ)

1 (xk 2 k=1

t

t

t

1 t

t

1 t

− µ) Σ− (x − µ). t

1

k

1

1

1

k

1

k

98CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Likewise we have that the density of µ  is a Gaussian of the form 1

p(µ ) =

(2π)d/2

|Σ |1/2 exp 0

1

=

exp 1/2

(2π)d/2 Σ0 1 (2π)d/2 Σ 0 1 d/2 (2π) Σ0

| |

=

1/2

| | | |

=

1/2

exp exp

Thus the new MAP estimator is ˆ µ

= arg max µ

n

 −

k=1

− 

− − −−











|

− µ) Σ−1(xk − µ) t









 

n ln (2π)d AΣAt 2

1 (xk 2



1  1  m0 ) (µ m0 )t Σ− 0 (µ 2 1 (Aµ Am0 )t (AΣ0 At )−1 (Aµ Am0 ) 2 1 1 −1 A(µ m0 ) (µ m0 )t At (A−1 )t Σ− 0 A 2 1 1 m0 ) . (µ m0 )t Σ− 0 (µ 2





|

1

(2π)d/2 Σ0

| |

1/2

exp

−

1 (µ 2

 

− m0 ) Σ −1 ( µ − m 0 ) t

0

ˆ and see that the two equations are the same, up to a constant. We compare µ Therefore the estimator gives the appropriate estimate for the transformed mean µ ˆ . 20. Consider the problem of noninformative priors. (a) Assu me that ˜σ has a uniform distribution, that is, ˜p(˜σ ) = c, where c is a constant. We make the correspond ence ˜σ = lnσ and thus σ = exp[˜σ] = f (˜σ), for some f ( ). Since ˜p(˜ σ ) = c, we have

·

df −1 (σ) dσ dln(σ) p(˜ ˜ σ) dσ c . σ p(f ˜ −1 (σ))

p(σ) = = =

For the case c = 1, we have p(σ) = 1/σ. (b) The noninformative priors here are very simple, p(θ0 ) = p(σθ ) = 21. Note that 0 that



≤ p(θ|D) ≤ 1. In order to converge while n → ∞, it must be true

p(x |θ)p(θ |D − ) − |D) = lim →∞ p(x |θ)p(θ|D − )dθ = lim →∞ p(θ|D ). lim p(θ |D ) → p(θ ). We assume lim p(θ|D ) → p(θ)  = 0 and →∞ →∞

lim p(θ

→∞

n

1 1 = 2π 0 2π 1/σθ .

n



n 1

k

n 1

k

n Note that n ∗ x . Then the above equation implies

p(x∗ θ ) =

|

n

|

lim p(xn θ )

→∞

n

n 1

n

lim xn

n

→∞



.

PROBLEM SOLUTIONS

99 =

→∞

 

= =



| lim p(x |θ)p(θ )dθ →∞ p(x∗ |θ)p(θ )dθ

lim

n

p(xn θ)p(θ )dθ n

n

p(x∗ ).

= In summary, we have the conditions: lim p(θ n ) p(θ ) = 0

• →∞ |D →  ∗ • lim →∞ x → x • p(x∗|θ) = p(x∗), that is, p(x∗|θ) is independent of θ . n

n

n

22. Consider the Gibbs algorithm.

|



| − |

(a) Note that p(x ω2 , µ) = 0 for x µ < 1 and this implies x p(µ) = 0 for 0 µ 2. We have the following cases:



≤ ≤

− 1 < µ < x + 1 and

• x − 1 > 2 and thus x > 3 or x + 1 < 0 and thus x < −1. • 0 ≤ x − 1 ≤ 2 and thus 1 ≤ x ≤ 3. In that case, 2

|

p(x ω2 ) =



2

|

p(x ω2 , µ)p(µ)dµ =



x 1







11 3 x dµ = . 22 4

x 1

• 0 ≤ x + 1 ≤ 2 and thus −1 ≤ x ≤ 1. In that case, x+1

|

p(x ω2 ) =



x+1

|

p(x ω2 , µ)p(µ)dµ =

0



11 x+1 dµ = . 22 4

0

Therefore, the class-conditional density is:

|

p(x ω2 ) =

(b) The decision point is at

 

− − ≤ ≤ ≤ ≤

0 x< 1 (x + 1)/4 1 x 1 (3 x)/4 1 x 3 0 x > 3.



x∗ = arg [p(x ω1 )p(ω1 ) = p(x ω2 )p(ω2 )]. x

(c) We let P

|

|

≡ P (ω ) and note the normalization condition 1

P (ω1 ) + P (ω2 ) = 1.

There are two cases: (1 + x)P (x + 1)/4 (1 P ) and thus 1 /5 P 1, as shown in the figure. In this case, (1 x ∗ )P = (x∗ + 1)/4 (1 P ) and this implies x∗ = (5P 1)/(3P + 1), and the error is







P (error) =

· −

≤ ≤



· −

x∗



−1



1

|

p(x ω2 )P (ω2 )dx +

x∗

|

p(x ω1 )P (ω1 )dx =

24P 3 + 8P 2 . (3P + 1)2

100CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION p(x|ωi)P(ωi) 0.5 0.4

ω1 0.3

ω2 0.2 0.1

-1

error 0

12

x

3

• 0 ≤ P ≤ 1/5, as shown in the figure. The error is

1

P (error) =



|

p(x ω1 )P (ω1 )dx = P.

−1

|

|

(d) Again, we are considering p(x ω1 )P (ω1 ) and p(x ω2 )P (ω2 ), but this time, we first use a single value of µ as the true one, that is, p(µ) = 1. There are tw o cases for different relationships between P and µ.

• p(x|ω )P (ω ) ≤ p(x|ω , µ)P (ω ) which implies P ≤ 1/2 · (1 − P ) or 0 ≤ P ≤ 1/3. This case can be further divided into two sub-cases: subcase 1: 0 ≤ µ ≤ 1 1

1

2

2

The error is



1

P (error) =

1 (1 2

0

subcase 2: 1 < µ The error is

− (−1)) ·

−  µ2 2

P

dµ = P

− P/6 = 5P/6.

≤2



2

P2 (error) =

(2

− µ) 2

2

P dµ = P /6

1

and the total error for this case is 5P P + = P. 6 6 1. This case can be further subdivided into three

PGibbs(error) = P 1 (error) + P2 (error) =

• Other case: 1 /3 ≤ P ≤ subcases. subcase 1: 0

µ

(1

P )/(2P ) as shown in the figure.

≤ ≤− −

The error is

 − − −   − −  − −

(1 P )/(2P )

P1 (error) =

1

P

4

9 2



1 2P

0

1 2

=

(1

2

1

P

2

− P ) (31P − 7) . 48P 2



1

3P 2P

−µ+1

 



PROBLEM SOLUTIONS

101 p(x|ωi)P(ωi) 0.5 0.4

ω2 0.3 0.2

ω1

0.1

error -1

0



12

subcase 2: (1 P )/(2P ) The error is

3

≤ µ ≤ (5P − 1)/(2P ), as shown in the figure. (5P

P2 (error) =



−1)/(2P )



1

−P 4

(1 P )/(2P )

(5P

= subcase 3: (5P The error is

x

− 9 2



− 2P1





− 1)(1 − P )(3P − 1) . 8P 2

− 1)/(2P ) ≤ µ ≤ 2, as shown in the figure. 2

P (error) = 3 (5P

P 6

=

P

−1)/(2P ) 1

P 2P

µ)2 dµ

(2



2

 − 

3

.

Thus the probability of error under Gibbs sampling is PGibbs(error) = =

P1 (error) + P2 (error) + P3 (error) 5P 2 6P + 1 4P





(e) It can be easily confir med that P Gibbs(error) < 2PBayes (error). Section 3.6

| D

|

| 

23. Let s be a sufficient statistic for which p(θ s, ) = p(θ s); we assume p(θ s) = 0. In that case, we can write Bayes’ law p( as p(

D|s, θ) = p(θ|s,p(Dθ)|s)p(D|s)

D|s, θ)

= =

D

p( , s, θ ) p(s, θ) p(θ s, )P ( , s) p(θ s)p(s)

| D D |

102CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION p(x|ωi)P(ωi) 0.5 0.4

ω2 0.3 0.2

ω1 0.1

error -1

0

12

3

x

| D D| | | D D| |

p(θ s, )p( s)p(s) p(θ s)p(s) p(θ s, )p( s) . p(θ s)

= =

Note that the probability density of the parameter θ is fully specified by the sufficient statistic; the data gives no further information, and this implies

| D

|

p(θ s, ) = p(θ s).

| 

Since p(θ s) = 0, we can write p(

s, θ) =

D|

= =

| D D|s) p(θ|s) p(θ|s)p(D|s) p(θ |s) p(D|s), p(θ s, )p(

D|

which does not involve θ . Thus, p( s, θ ) is indeed independent of θ . 24. To obtain the maximum-likelihood estimate, we must maximize the likelihood function p( θ) = p(x1 ,..., xn θ) with respect to θ . However, by the Factorization Theorem (Theorem 3.1) in the text, we have

D|

|

p(

D|θ) = g(s, θ)h(D),

where s is a sufficient statistic for θ. Thus, if we max imize g(s, θ ) or equivalently [g(s, θ)]1/n , we will have the maximum-likelihoood solution we seek. For the Rayleigh distribu tion, we have from Table 3.1 in the text, [g(s, θ)]1/n = θe −θs for θ > 0, where n

s=



1 x2 . n k=1 k

Then, we take the derivative with respect to θ and find 1/n

∇ [g(s, θ)] θ

= e −θs

− sθe−

θs

.

PROBLEM SOLUTIONS

103 p(x|ωi)P(ωi) 0.5

ω2

0.4 0.3

ω1 0.2 0.1

x

error -1

0

12

3

We set this to 0 and solve to get ˆ

e−θs

ˆ

ˆ −θs , sθe

=

which gives the maximum-likelihood solution, 1 θˆ = = s

 n

1 x2 n k=1 k

−1 .

We next evaluate the second derivative at this value of θˆ to see if the solution represents a maximum, a minimum, or possibly an inflection point: 2 θ

1/n

∇ [g(s, θ)]

θ=θˆ



−se− − se− + s θe− e− (s θˆ − 2s) = −se− < 0. θs

=

θs

θs

2

θ=θˆ

θˆs

=

2

1



Thus θˆ indeed gives a maximum (and not a minimum or an inflection point). 25. The maximum-likelihood solution is obtained by maximizing [ g(s, θ)]1/n . From Table 3.1 in the text, we have for a Maxwell distribution [g(s, θ)]1/n where s =

1 n

n



k=1

=

θ3/2 e−θs

x2k . The derivative is 1/n

∇ [g(s, θ)] θ

=

3 1/2 −θs θ e 2

− sθ

3/2

e−θs .

We set this to zero to obtain 3 1/2 −θs θ e = sθ 3/2 e−θs , 2 and thus the maximum-likelihood solution is 3/2 3 θˆ = = s 2

 n

1 x2 n k=1 k

−1 .

We next evaluate the second derivative at this value of θˆ to see if the solution represents a maximum, a minimum, or possibly an inflection point: 2 θ

1/n

∇ [g(s, θ)]



θ=θˆ

=

3 1 1/2 −θs θ e 22

− 32 θ

1/2

se−θs

− 32 θ

1/2

se−θs + s2 θ3/2 e−θs



θ =θˆ

104CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION p(x|ωi)P(ωi) 0.5 0.4

ω1

0.3

ω2 0.2 0.1

-1

= =

error 0

12

x

3

3 ˆ−1/2 −θs ˆ ˆ ˆ θ e 3sθˆ1/2 e−θs + s2 θˆ3/2 e−θs 4 3 3 9 ˆ−1/2 3 ˆ−1/2 −3/2 e−3/2 3 + θ = θ e < 0. 4 2 4 2

 −− 



Thus θˆ indeed gives a maximum (and not a minimum or an inflection point). 26. We fine the maximum-likelihood solution by maximizing [g(s, θ)]1/n . In this case, we have d

[g(s, θ)]1/n =



θisi

i=1

and t d)

s = (s1 ,...,s

=

1 n

n



xk .

k=1

Our goal is to maximize [ g(s, θ)]1/n with respect to θ over the set 0 < θi < 1, i = d

1,...,d

subject to the constraint



θi = 1. We set up the objective function

i=1

 −   −  d

l(θ, λ s) = ln [ g(s, θ)]1/n + λ

|

θi

1

i=1

d



=

d

si ln θ i + λ

i=1

θi

i=1

The derivative is thus (

∇ l) θ

i

δl si = δθ i = θi + λ

for i = 1,...,d . We set this derivative to zero and get si = θˆi

−λ,

which yields our intermediate solution: θˆi =

− sλ . i

1 .

PROBLEM SOLUTIONS

105 p(x|ωi)P(ωi) 0.5 0.4

ω1

0.3

ω2 0.2 0.1

-1

error 0

12

d

We impose the normalization condition,

 

x

3

θˆi = 1, to find λ, which leads to our final

i=1

solution θˆi =

sj

d

si

i=1

for j = 1,...,d . 27. Consider the notion that sufficiency is an integral concept. (a) Using Eq. 51 in the text, we ha ve n

p(

D|

θ)

=

p(x θ )

 √ | √  √  √  √  k=1 n

=

k=1

= = =

1 exp 2πσ

1 2πσ

n

1 2πσ

n

1 2πσ

n

− −    −   −  −    −  −  − − 

exp

1 (xk 2σ 2

1 2σ 2

n

x2k

2µxk + µ2

n

n

x2k

k=1



n

µ2

xk +

k=1

1 ns2 2σ 2

exp

2

k=1

1 2σ 2

exp

µ)2

2ns1 + nµ2

k=1

.

Now let g(s, θ) be a Gaussian of the form g(s, θ ) =

1 2πσ

n

1 (ns2 2σ2

exp

2µns1 + nµ2 )

D

and h( ) = 1. Then we can write p(

D|θ) = g(s, θ)h(D).

According to Theorem 3.1 in the text, the statistic s is indeed sufficient for θ . (b) We apply the result from part (a) to the case of g (s1 ,µ,σ 2 ), that is, g(s1 ,µ,σ 2 ) =

√   1 2πσ

n

exp

1 ( 2µns1 + nµ2 ) 2σ2





106CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION p(x|ωi)P(ωi) 0.5 0.4

ω1

0.3

ω2 0.2 0.1

error -1

0

12

3

x

and h( , σ 2 ) = exp

D

 

1 ns2 . 2σ2

In the general case, the g(s1 ,µ,σ 2 ) and h( , σ2 ) are dependent on σ 2 , that is, there is o way to havep( µ, σ2 ) = g(s1 , µ)h( ). Thus according to Theorem 3.1 in the text, s 1 is not sufficient for µ. However, if σ 2 is known, then g(s1 ,µ,σ 2 ) = g(s1 , µ) and h( , σ 2 ) = h( ). Then we indeed have p( µ) = g(s1 , µ)h( ), and thus s 1 is sufficient for µ.

D D

D|

D

D

D|

D

(c) As in part (b), we let g(s2 ,µ,σ 2 ) =

1

n

exp

√   2πσ

1

(ns2

2σ 2



2µns1 + nµ2 )



and h( ) = 1. In the gen eral case, the g(s2 ,µ,σ 2 ) is dependent upon µ, that is, there is no way to have p( µ, σ 2 ) = g(s2 , σ2 )h( ). Thus according to Theorem 3.1 in the text, s 2 is not sufficient for σ 2 . However, if µ is known, then g(s2 ,µ,σ 2 ) = g(s2 , σ2 ). Then we have p( σ 2 ) = g(s2 , σ 2 )h( ), and thus s2 is sufficient for σ 2 .

D

D|

D

D|

D

| D

|

28. We are to suppose that s is a statistic for which p(θ x, ) = p(θ s), that is, that s does not depend upon the data set .

D

| D

|

| 

(a) Let s be a sufficient statistic for which p(θ s, ) = p(θ s); we assume p(θ s) = 0. In that case, we can write Bayes’ law p(

D|s, θ) = p(θ|s,p(Dθ)|s)p(D|s)

as p(

D|s, θ)

= = = =

D

p( , s, θ ) p(s, θ) p(θ s, )P ( , s) p(θ s)p(s) p(θ s, )p( s)p(s) p(θ s)p(s) p(θ s, )p( s) . p(θ s)

| D D | | D D| | | D D| |

PROBLEM SOLUTIONS

107

Note that the probability density of the parameter θ is fully specified by the sufficient statistic; the data gives no further information, and this implies

| D

|

p(θ s, ) = p(θ s).

| 

Since p(θ s) = 0, we can write p(

| D D|s) p(θ |s) p(θ |s)p(D|s) p(θ|s) p(D|s), p(θ s, )p(

s, θ ) =

D|

= =

D|s, θ) is indeed independent of θ . (b) Assume the variable x comes from a uniform distribution, p(x) ∼ U (µ, 1). Let s= x be a statistic of a sample data set D of x, that is, s is the sample mean of D . Now assume that we estimate µ = µ = 0, but from a particular sample data set D we have s = 5. Note that we esti mate p(x) ∼ U (0, 1), we have p(µ |s , D ) = p(µ |s ) = 0, that is, it is impossible that given the data set D whose mean is s = 5 that the distribution is p(x) ∼ U (0, 1). Therefore, if given µ = µ = 0 and s = 5, there does not exist a data set D whose mean is s = 5 but whose elements are distributed according to p(x) ∼ U (0, 1). Said another way, we have p( s , µ ) = 0 for these particular values of s and µ . However, p(D |s ) is notDnecessarily zero. It is easy to set up a data set whose |  p(D |s ). Therefore, p(θ|s) = 0, as required mean is s , giving p(D |s , µ ) = which does not involve θ . Thus, p(

1 n

n



k

k=1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

for the proof.

29. Recall the Cauchy distribution, p(x) =

1 1 πb 1 + x−a b

·

(a) We integrate p(x) over all x, and have





p(x)dx =

−∞ We substitute y = (x 1 πb



  

−∞

1

1+

dx x a 2 b



1 πb

 



2.

  

−∞

1

1+

dx. x a 2 b



− a)/b and dx = b dy and find =

1 π





−∞

1 1 dy = tan−1 [y] 1 + y2 π



∞ −∞ =

We find, then, that indeed p(x) is normalized. (b) According to the definition of the mean, we have



µ=



−∞

xp(x)dx =

1 x πb 1 + x−a

  b

2

dx.

1 π

 − −  π 2

π 2

= 1.

108CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

− a)/b and dx = b dy. Then the mean is

As in part (a), we let y = (x 1 πb

µ =

by + a b dy 1 + y2

−∞



1

=



      

π



y

b

1 + y2

−∞

1

dy +a

−∞

b dy

1 + y2

0

y 2)

 

.

Note that f (y) = y/(1 + is an odd function, and thus the first integral vanishes. We substitute our result from part (a) and find µ=

a π





−∞

1 b dy = a. 1 + y2

This conclusion is satisfying, since when x = a, the distribution has its maximum; moreover, the distribution dies symmetrically around this peak. According to the definition of the standard deviatio n, σ2

∞ =

x2 p(x)dx

−∞ ∞ = We substitute y = (x σ2

=

= = =

   ∞

1 πb 1 π 1 π



−∞ ∞

1 πb

x2

−∞

1

1+

dx x a 2 b

  −

2

−µ .

− a)/b and dx = bdy, and have

b2 y 2 + 2aby + a2 b dy 1 + y2



2

b dy + 2ab

−∞

∞.



2

−µ



−∞

+ 2ab 0 + (a2

·

2

−µ

y dy + (a2 1 + y2 2



2



−b )



−∞

2

− b )π − µ

1 dy 1 + y2

 −

µ2

This result is again as we would expect, since according to the above calculation, 2

E [(x − µ) ] = ∞.

(c) Since we know θ = (µ )t = (a )t , for any defined s = (s1 ,...,s n ), si < , it is impossible to have p( s, (a )t ) = p( s). Thus the Cauchy distribution has no sufficient statistics for the mean and standard deviation.



D|

∞ ∞



D|

30. We consider the univariate case. Let µ and σ 2 denote the mean and variance of the Gaussian, Cauchy and binomial distributions and µ ˆ = 1/n ni=1 xi be the sample mean of the data points x i , for i = 1,...,n .



PROBLEM SOLUTIONS

109

(a) According to Eq. 21 in the text, we hav e σn2

= = = =

Note that

E [(x − µ)(ˆµ i

  −  E  − − −  − −  − E   − − − −  E − − E − − E −  −     − E − −   −   E − −    E − E − − −      n

E n −1 1 1

n n

1

1

µ ˆ )2

((xi

µ)

µ))2

(ˆ µ

i=1 n

µ)2

(xi

1

2(xi

µ)(ˆ µ

µ)2

µ) + (ˆµ

i=1

n

1

n

(xi

i=1 n

1

µ)2 ]

[(xi

2 [(xi

µ)(ˆ µ

µ)2 ] .

µ)] + [(ˆ µ

i=1

µ)] =

(xi

=

(xi

xj

µ

j=1

xi

µ)

1 (xi n

=

n

1 n

µ)

µ

n

n

+

1 n k=1; k

1 (xi n

µ)2 +

xk

µ

=i

n

µ)

xk



k=1; k =i

=

1 2 σ + 0 = σ 2 /n. n

Similarly, we have the variance, 2

2

E [(ˆµ − µ) ] = σ /n, and furthermore σn2

  − n

1

=

n n n

=

1

σ2

i=1

− 1σ −1

2

− n2 σ

2

+

1 2 σ n



= σ 2.

Thus indeed the estimator in Eq. 21 in the text is unbiased. (b) The result in part (a) applies to a Cauchy distribution. (c) The result in part (a) applies to a Binomial distrib ution. (d) From Eq. 20 in the text, we have lim

→∞

n

that is, asymptotically unbiased.

n

− 1σ n

2

= σ 2,

(n

1)µ

110CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Section 3.7 31. We assume that a and b are positive constants and n a variable. Recall the definition on page 633 in the text that O(g(x)) =

{f (x) : There exist positive constants c and x 0 ≤ f (x) ≤ cg(x) for all x ≥ x }.

0

such that

0

n

n

·

n

(a) Is a +1 = O(a )? Yes. If let c = a and x 0 = 1, then c a = a x 1.



n +1

≥a

n +1

for

(b) Is a bn = O(an )? No. No matter what constant c we choose, we can always find a value n for which a bn > c an .

·

n+b

n

(c) Is a = O(a )? Yes. If choose c = a b and x 0 = 1, then c an = a n+b for x > 1.

·

≤a

n+b

(d) Clearly f (n) = O(f (n)). To prove this, we let c = 1 and x 0 = 0. Then of course f (x) f (x) for x 0. (In fact, we have f (n) = f (n).)





32. We seek to evaluate f (x) = are given.





n 1 i=0

ai xi at a point x where the n coefficients ai

(a) A straightforward Θ(n2 ) algorithm is: Algorithm 0 (Basic polynomial evaluation) 1 2 3 4 5 6 7 8 9 10 11 12

begin initialize x, ai for i = 1,...,n f a0 i 0 for i i+1 x0 1 j 0 for j j+1 xj xj − 1 x until j = i f f + a i xi until i = n 1 end

← ←

← ← ←

−1

← ←





(b) A Θ(n) algorithm that takes advantage of Horner’s rule is: Algorithm 0 (Horner evaluation) 1 2 3 4 5 6 7

begin initialize x, ai for i = 1,...,n f an−1 i n for i i 1 f f x + a i −1 until i = 1 end

← ←

← − ←

33. The complexities are as follows:

−1

PROBLEM SOLUTIONS

111

(a) O(N ) (b) O(N ) (c) O(J (I + K )) 34. We assume that the uniprocessor can perform one operation per nanosecond (10−9 second). There are 3600 seconds in an hour, 24 3600 = 86 , 400 seconds in a day, and 31 , 557, 600 seconds in a year. Given one operation per nanosec ond, the

×

×

total number of basic operations in the periods are: 10 9 in one second; 3 .6 1012 in one hour; 8 .64 1013 in one day; 3 .156 1016 in one year. Thus, for the table below, we find n such that f (n) is equal to these total number of operation s. For instance, for the “1 day” entry in the nlog2 n row, we solve for n such that nlog2 n = 8.64 1013 .

×

×

×

f (n) operations:

1sec 10 9 9

log2 n

√n

n nlog2 n * n2 n3 2n en n! **

1hour 3.6

12



× ×



12

3.6×10 1.9×10 √3.6 ×10 1.53×10 log (3.6×10 ) 41.71 ln(3.6×10 ) 16.1 28.91 6

3

3

9

13

3.6 10 10 9.86 √ 10

√ × × √10 = 10

log2 109 = 29.90 ln109 = 10.72 13 .1

12

2

1year 13

× 10

3.156

4 12

12

8.64 2.11 √

23.156×10 16 1010 (3.156 1016 )2 9.96 1032 16



× ×



× 10 × 10

13

8.64×10 9.3×10 √8.64 ×10 4.42×10 log (8.64×10 ) 46.30 ln(8.64×10 ) 17.8 32.09

× ×

× ×

16

3.156×10 1.78×10 √3.156 ×10 3.16×10 log (3.156×10 ) 54.81 ln(3.156×10 ) 19.3 38.0

6

8

3

13

4 13

2



3.156 10 6.41 1014 √

12

3

16

× 10 16

28.64×10 13 1010 (8.64 1013 )2 7.46 1027 13



×  ×

(10 ) = 10 109 3.9 107 109 3.16 104 3

1day 8.64

12

23.6×10 12 1010 (3.6 1012 )2 1.3 1025 12

210 4.6 10301029995 9 2 18

×

× 10

16

5

2

13

16

16

* For entries in this row, we solve numerically for n such that, for instance, nlog2 n = 3.9 107 , and so on. ** For entries in this row, we use Stirling’s approximation — n! (n/e)n for large n — and solve numerically. 35. Recall first the sample mean and sample covariance based on n samples:

×



mn Cn

= =

1 n

n



n

 − 1

n

xk

k=1

(xk 1 k=1

t

− m )(x − m ) . n

k

n

(a) The computational complexity for m n is O(dn), and for C n is O(dn2 ). (b) We can express the sample mean for n + 1 observations as mn+1 =

1 n+1

n+1



xk

=

k=1

= = =

1 n+1

 n

xk + xn+1

k=1



1 [nmn + xn+1 ] n+1 1 1 mn mn + xn+1 n+1 n+1 1 mn + (xn+1 mn ). n+1





112CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION For the sample covariance, we have Cn+1

= =

=

n+1

 − −  − −  − −   − −  − −  −  −  − − −  − −  − − − −  − −   − −    − − − 1 n 1 n 1 n

(xk

mn+1 )t + (xn+1

mn+1 )(xk

mn+1 )(xn+1

k=1 n

mn ) t

mn )(xk

1 1) (xn+1 (n +

k=1

n

(xk

mn ) (xn+1

mn )

1 (n + 1)2

mn )t +

k=1

(xn+1

1

n n+1

mn )(xn+1

k=1

1 (xn+1 n+1

2

(xn+1

mn ) t

mn )(xn+1

1 n + (xn+1 mn )(xn+1 bfmn )t (n + 1)2 (n + 1)2 n 1 1 Cn + = (xn+1 mn )(xn+1 mn )t . n n+1 (c) To compute mn from scratch, requires O(dn) operations. To update mn according to =

n

1 n

mn ) t

(xk

k=1 n

1 1 + (xn+1 mn ) (xn+1 mn ) (xn+1 mn ) n n+1 1 n (n 1)Cn + (xn+1 mn )(xn+1 mn )t n (n + 1)2 +

mn+1 )t

n

(xk

1 n+1

=

mn+1 )t

mn+1 )(xk

(xk

k=1 n

n

Cn +





mn+1 = m n +



1 (xn+1 n+1

−m

n)

requires O(d) operations, one for each component. To update C n according to Cn+1 =

n

− 1C n

n

+

1 (xn+1 (n + 1)

− m )(x − m ) n

n+1

n

t



given Cn , mn and x n requires O(d) operations for computing xn+1 mn ; and O(d2 ) operations for computing ( xn+1 m n )(xn+1 m n )t . Thus, Cn+1 is computed in O(d2 ) operations, given Cn , mn and xn . If we must compute Cn+1 from scratch, then we need O(dn2 ) operations.





(d) The recursive method is on-line, and the classifier can be used as the data is coming in. One possib le advantage of the non- recursive method is that it might avoid compounding roundoff errors in the successive additions of new information from samples, and hence be more accurate. 1 36. We seek a recursive method for computing C − n+1 .

(a) We first seek to show (A + xxt )−1 = A −1

A−1 xxt A−1 . 1 + xt A−1 x

−m

t n)

−m ) n



t



PROBLEM SOLUTIONS

113

We prove this most simply by verifying the inverse as:



(A + xxt ) A−1

−1

−1

t

− A1 + xxxAA− x t

1



AA −1 xxt A−1 xx t A−1 xxt A−1 = AA−1 + xxt A−1 1 + xt A−1 x 1 + xt A−1 x xxt A−1 xxt A−1 t −1 t −1 =I xA x t t 1 + xx A 1





1 + x A− x − 1 + x A− x − 1 x A− x − = I + x xA− 1 − − 1 + x A x 1 + x A− x − = I + x xA · 0

   a scalar

t

1

t

1

t

1

1

1

t

   a scalar

t

= I.

Note the left-hand-side and the right-hand-side of this equation and see (A + xt x)−1

A −1

=

−1

−1

t

− A1 + xxxAA− x . t

1

(b) We apply the result in part (a) to the result discusse d in Problem 35 and see that Cn+1

− 1 C + 1 (x − m )(x − m ) n n+1 n−1 n C + − m )(x − m ) (x n n −1 n−1 (A + xx ). n

=

n

= =

n

We set A = C n and x = 1 C− n+1

= =

=

=

− n

1

   −

n 1 C− n n+1 n

1

n

1 C− n



n

n+1

n+1

2

n

n

n+1

t

n

t

n n2 1 (xn+1



−1

t



− m ) and substitute to get n



n (A + xxt )−1 n 1 A −1 xxt A−1 1 + xt A−1 x

(A + xxt )

n n A−1 n+1

n

 

n+1

1 C− n

=









− m ) − (x − m ) C− − − 1+ − (x − m ) C − (x − m ) − C−− +(x(x −−mm)(x) C− −(xm ) −Cm− ) ,



n1 n2 1 n

n n2 1 (xn+1

n

n n2 1

n



n+1

n+1

n+1

n

n

t

n+1 1 t n

where we used our result from part (a).

n n2 1

n

1

n+1

n n2 1

n t

n

n+1

t

n

n

1



  n1

n+1

n

(c) We determine the computat ional complexity as follows. Consider the formula 1 for C − n+1 : 1 C− n+1 =

n n

−1



1 C− n



n2 1 n



uut t + u (xn+1

−m ) n



,

114CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION 1 1, x where u = C − mn ) is of O (d2 ) complexity, given that C − and n (xn+1 −1 n n+1 1 2 mn are known. Hence, clearly C − n can be computed from C n−1 in O(d ) operations, as uu t , ut (xn+1 mn ) is computed in O(d2 ) operations. The complexity 1 2 associated with determining C − n is O(nd ).

− −

37. We assume the symmetric non-negative covariance matrix is of otherwise general form:

Σ=



σ11 σ21 .. .

σ12 σ22 .. .

·. · ·

σ1n σ2n .. .

σn1

σn2

···

σnn

..



.

To employ shrinkage of an assumed common covariance toward the identity matrix, then Eq. 77 requires

− β )Σ + βI = I,

Σ(β ) = (1 and this implies (1

− β )σ

ii

·

+ β 1 = 1, and thus σii =

1 1

−β =1 −β

for all 0 < β < 1. Therefore, we must first normalize the data to have unit variance. Section 3.8 38. Note that in this problem our densities need not be normal. (a) Here we have the criterio n function J1 (w) =

(µ1 µ2 )2 . σ12 + σ22



We make use of the following facts for i = 1, 2: y µi

= wt x 1 = ni y

   −  1 ni

y=

∈Y

σi2

=

(y

y

Σi

i

wt x = w t µi

∈D

x

2

µi ) = w

i

t

∈Y

∈D

x

i

=

(x x

(x

i

i

− µ )(x − µ ) i

i

t

− µ )(x − µ ) . i

i

∈D



We define the within- and between-scatter matrices to be SW SB

= Σ1 + Σ2 = (µ1 µ2 )(µ1



2

Then we can write σ12 + σ22 (µ1 µ2 )2



t

−µ ) .

= w t SW w = w t SB w.

t



w

PROBLEM SOLUTIONS

115

The criterion function can be written as w t SB w . w t SW w

J1 (w) =

For the same reaso n Eq. 103 in the text is maximized, we have that J1 (w) 1 µ 2 ). In sum, that J1 (w) is maximized at w = is maximized at wS− W ( µ1 (Σ1 + Σ2 )−1 (µ1 µ2 ).





(b) Consider the criterion function (µ1 µ2 )2 . P (ω1 )σ12 + P (ω2 )σ22



J2 (w) =

Except for letting SW = P (ω1 )Σ1 + P (ω2 )Σ2 , we retain all the notations in part (a). Then we write the criterion function as a Rayleigh quotient w t SB w . w t SW w

J2 (w) =

For the same reason Eq. 103 is maximized, we have that J 2 (w) is maximized at w = (P (ω1 )Σ1 + P (ω2 )Σ2 )−1 (µ1

− µ ). 2

(c) Equation 96 of the text is more closely related to the criterion function in part (a) above. In Eq. 96 in th e text, we let ˜mi = µi , and ˜s2i = σi2 and the statistical meanings are unchanged. Then we see the exact correspondence between J (w) and J 1 (w). 39. The expression for the criterion function 1 n1 n2

J1 =



yi

∈Y

1

yj

(yi

∈Y

2

−y ) j

2

clearly measures the total within-group scatter. (a) We can rewrite J 1 by expanding J1

= =

1 n1 n2 1 n1 n2

   ∈Y

yi

∈Y

1 yj

1

+2(yi =

1 n1 n2

=

1

∈Y

2

(yi

∈Y

2

−m ) 1

j

1

1 n1 n2

1

yj

1 n1 n2

+ (yj

−m )

1

2

∈Y

1

i

2

1

2

2

+ (m1

2

2(yi

yi

∈Y

1

yj

yj

−m ) 2

∈Y

2

1

m2 )(m1

2

2

−m ) , 2

1

2

 ∈Y

1

yj

j

j

∈Y

2

−m ) 2

j

2

2

2

2

1

1

2

2

2

2

− m )(y − m ) + n 1n

∈Y

1 2 1 2 s + s + (m1 n1 1 n2 2

1

1 2 y i

∈Y

2(yj

yi

2

2

i

∈Y

2

j

− m )(y − m ) + 2(y − m )(m − m ) + 2(y − m )(m − m ) (y − m ) + 1 (y − m ) + (m − m ) n n

   −

yi

+ +

yj

2

− m ) − (y − m ) + (m − m )]

[(yi

yi



1 2 y i

∈Y

1

yj

2(yi

∈Y

2



− m )(m − m ) 1

1

2

116CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION where m1 m2

 

=

∈Y

yj

∈Y

=

s21

=

s22

=

yi

yi

1

yj 2

yi

∈Y

yj

∈Y

2

(yi

−m )

(yj

−m ) .

1

1



2

2

2

(b) The prior probability of class one is P (ω1 ) = 1/n1 , and likewise the prior probability of class two is P (ω2 ) = 1/n2 . Thus the total within-class scatter is 1 2 1 2 s + s . n1 1 n2 2

J2 = P (ω1 )s21 + P (ω2 )s22 = (c) We write the criterion funct ion as J (w) =

(m1

2 + 2 1 2 n1 s1 +

−m )

1 2 1 2 n1 s1 + n2 s2 1 2 n2 s2

(m1 m2 )2 1 2 1 2 + 1, n1 s1 + n2 s2



=

then J (w) = J1 /J2 . Thus optimizing J1 subject to the constraint J2 = 1 is equivalent to optimizing J (w), except that the optimal solution of the latter is not unique. We let y = w t x, and this implies (m1

−m ) 2

= w t (m1 m2 )(m1 m2 )t w 1 2 1 2 1 1 = s + s = wt S1 + S2 w n1 1 n2 2 n1 n2

2



J2







where the mean and scatter matrices are mi Si

  −

1 ni

= =

x

∈D

x

i

(x

i

∈D

x

t

−m ) ,

mi )(x

i

respectively. Thus the criterion function is J (w) =

wt (m1 wt

t

− m )(m − m ) w + 1. 2

1 n 1 S1

1

+



2

1 n2 S2

w



For the same reason Eq. 103 in the text is optimized, we have that maximized at w=λ



1 1 S1 + S2 n1 n2



−1

(m1

− m ).

We need to guarantee J 2 = 1, and this requires wt





1 1 S1 + S2 w = 1, n1 n2

2

J (w) is

PROBLEM SOLUTIONS

117

which in turn requires λ2 (m1

t

−m ) 2



1 1 S1 + S2 n1 n2



−1

(m1

− m ) = 1. 2

This equation implies that the parameter λ has the value



1

m2 ) t

λ = (m1





S1 +

n1

1

−1

S2

n2

1/2

m2 )

(m1





.



40. In the below, W is a d-by-n matrix whose columns correspond to n distinct eigenvectors.

{ }

(a) Suppose that the set ei are normalized eigenvectors, then we have eti SB ej eti SW ej

= λi δij = λi δij ,

where δij is the Kronecker symbol. We denote the matrix consisting of eigenvectors as W = [e1 e2 . . . en ]. Then we can write the with in scatter matrix in the new representation as

˜W S

= W t SW W =

=

 

  · · ·    ··· et1 et2 .. .

SW (e1 e2

···e

n)

etn

et1 SW e1 .. .

..

etn SW e1

.

et1 SW en .. .

= I.

etn SW en

Likewise we can write the between scatter matrix in the new representation as

˜B S

= W t SB W =

=

 

et1 SB e1 .. .

 

et1 et2 .. . etn

··· ..

.

etn SB e1

 

SB (e1 e2

et1 SB en .. . etn SB en

···

 

=

···e

 

n)

λ1 0 .. .

0 λ2 .. .

0

0

··· ··· ..

.

0 0 .. . λ

···

n

 

.

˜ W is an n-by-n matrix and S ˜ B is a diagonal matrix whose elements are Thus S the corresponding eigenvalues. (b) Here we have the determinant of the transformed between-scatter matrix is just the product of the eigenvalues, that is,

|S˜ | = λ λ · · · λ B

1 2

n,

˜ W = 1. Then the criterion function is simply J = λ 1 λ2 and S

| |

···λ

n.

118CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION (c) We make the following definitions: ˜t W

  

˜W S

=

QDWt

=

˜ t SW W ˜ = QDWt SW WDQt . W

˜ W = D 2 and Then we have S

| |

˜B = W ˜ t SB W ˜ = QDWt SB WDQt = QD S ˜ B DQt , S

   | |

˜ B = D 2 λ1 λ2 then S

···λ

n.

This implies that the criterion function obeys J=

    ˜B S

˜W S

,

and thus J is invariant to this transformation.

| ∼ D ˜ ) = √ 1 exp[−(y − µ p(y|θ ˜ ) /(2˜s )], 2π˜ s

41. Our two Gaussian distributions are p(x ωi ) N (µi , Σ) for i = 1, 2. We denote the samples after projection as ˜ i and the distributions 2

i

˜i = and θ

r



=

µ ˜i s˜

for i = 1, 2. The log-likelihood ratio is n

ln ˜˜ 1 lnp( ˜ θ ˜ 2 )) = lnp( θ ln

D| D|

ln

k=1 n

ln

k=1

=

=

c1 +

(yk µ ˜ 1 )2 2˜ s2

1 √2π˜ exp s

(yk µ ˜ 2 )2 s2 2˜

∈D

yk

∈D

1 2

+

c1 +

1 2

c1 +

1 2

(yk µ ˜ 1 )2 2˜ s2 (yk µ ˜ 2 )2 s2 2˜



ln

k=1

∈D

yk

∈D

2

yk

∈D

2

1 s2 2˜

+

1 2˜ s2

=



c1 +

1 2

(yk

2

1 2

∈D

∈D˜

yk

∈D

1 2

2

(yk µ ˜ 2 )2 s2 2˜



µ ˜1 =

c + n2 J (w) . c + n1 J (w)

+ (˜ µ1

Thus we can write the criterion function as rc c J (w) = . n2 rn1

− −



µ2 µ (yk µ ˜ 2 )+(˜ ˜ 1 ))2 s2 2˜





2

µ ˜ 1 ) + 2(yk

)2

2

− µ˜ ) − µ˜ )



1

µ ˜2

1

1 µ2 2˜ s2 n2 (˜ 1 n (˜ µ1 2 1 2˜ s

k=1

(yk µ ˜ 1 )2 2˜ s2

2

+

µ ˜ 2 ) + (˜ µ2

((yk

+



yk

2

∈D˜

k=1 n

µ2 µ (yk µ ˜ 2 )+(˜ ˜ 1 ))2 s2 2˜

+

2

yk

n

+



c1 +

(yk µ ˜ 2 )2 s2 2˜

 

(yk µ ˜ 2 )2 s2 2˜

+

2

+

1 √2π˜ s



∈D

(yk µ ˜ 1 )2 s2 2˜

+

1 √2π˜ s

(yk µ ˜ 1 )2 2˜ s2

yk



yk

k=1 n

=

+

1

yk

=

ln



1

c1 + 1 + c1 + 1 +

n





yk

1 2

˜2) p(yk θ

1 √2π˜ exp s

c1 + c1 +

k =1 n k=1

c1 + =

˜1) p(yk θ

  ||                        −− −− n

=

2

)2

+ 2(yk

− µ˜ )(˜µ − µ˜ ) − µ˜ )(˜µ − µ˜ )) 2

2

1

1

1

2



PROBLEM SOLUTIONS

119

This implies that the Fisher linear discriminant can be derived from the negative of the log-likelihood ratio. 42. Consider the criterion function J (w) required for the Fisher linear discrimi nant. (a) We are given Eqs. 96, 97, and 98 in the text:

|m˜ − m˜ | 1

J1 (w) =

2

2

(96)

s˜21 + s˜22

t

Si

=

SW

∈D(x − mi )(x − mi )



x

= S1 + S2

where y = w t x, m ˜ i = 1/ni

   

(98)

y = w t mi . From these we can writ e Eq. 99 in

y

the text, that is, s˜2i

(97)

=

∈Y

i

y

=

(y

∈Y

i

∈D

− m˜ ) i

−w m )

wt (x

− m )(x − m ) w

∈D

x

=

t

2

(wt x

x

=

2

i

i

i

t

wt Si w.

Therefore, the sum of the scatter matrixes can be written as s˜21 + s˜22 = w t SW w (m ˜1 m ˜ 2 )2 = (wt m1 wt m2 )2 = w t (m1 m2 )(m1 m2 )t w = w t SB w,







(100) (101)



where SB = (m1 m2 )(m1 m2 )t , as given by Eq. 102 in the text. Putting these together we get Eq. 103 in the text,





J (w) =

w t SB w . w t SW w

(103)

(b) Part (a) ga ve us Eq. 103. It is easy to see tha t the w that optimizes Eq. 103 is not uniqu e. Here we optim ize J1 (w) = wt SB w subject to the constraint that J2 (w) = wt SW w = 1. We use the meth od of Lagrange undetermined multipliers and form the functional g(w, λ) = J 1 (w)

− λ(J (w) − 1). 2

We set its derivative to zero, that is, ∂g(w, λ) ∂w i

=



uti SB w + wt SB ui

= 2uti (SB w

···

···

− λS

− 

W w) =

t

λ uti SW w + wt Sw ui

0,



where u i = (0 0 1 0 0) is the n-dimensional unit vector in the ith direction. This equation implies SB w = λSW w.

120CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION (c) We assume that J (w) reaches the extremum at w and that ∆ w is a small deviation. Then we can expand the criterion function around w as J (w + ∆w) =

(w + ∆w)t SB (w + ∆w) . (w + ∆w)t SW (w + ∆w)

From J (w) = J (w + ∆w) = λ, to first order we have w t SB w w t SB w + 2∆wt SB w λ = wt SW w = wt SW w + 2∆wt SW w , that is, λ=

∆wt SB w . ∆wt SW w

Therefore the following equation holds: ∆wt (SB w

− λS

W w) =

0.

Because ∆w is arbitrary, we conclude that S B w = λSW w. 43. Here we have the between-scatter matrix is c

SB =



ni (mi

i=1

t

− m)(m − m) , i

where the group and full means are mi m

= =

1 ni



x

∈D

x

i

c

  x=

∈D

x

1 n

ni mi ,

i=1

and for this case of equal covariances, c

SW =



∈D

i=1 x

c

(x

t

− m )(x − m ) i

i

=



Σ = cΣ.

i=1

In order to maximize J (W) =

t

BW

t

WW

|W S |W S

|, |

the largest columnseigenvalues of an optimal the in W must be generalized eigenvectors that correspond to SB wi = λ i SW wi = cλ i Σwi , with W = [w1 w2 Section 3.9

···

wc−1 ] in terms of Σ and d-dimensional mean vectors.

44. Consider the convergence of the expectation-maximization algorithm, where as usual b and g are the “bad”and the “good” data, as described in the text.

D

D

PROBLEM SOLUTIONS

121

(a) The expected value of θ as determined in the primed coordinates is

 

D )p(D |D ; θ)dD = (lnp(D , D ; θ) − lnp(D |D ; θ)) p(D |D ; θ  )dD = lnp(D , D ; θ )p(D |D ; θ  ) dD − lnp(D |D ; θ )p(D |D ; θ  )dD  [lnp(D : D ; θ)] = ED [lnp(D , D ; θ )|D ; θ ] − ED   = Q(θ ; θ ) − ED [lnp(D : D ; θ)]. (b) If we define φ(D ) = p(D |D ; θ)/p(D |D ; θ  ), then E [φ(D )] − 1 = p(p(DD )|D|D ;; θθ) p(D )|D ; θ)dD − 1 = p(D |D ; θ )dD − 1 = 1 − 1 = 0. According to Jensen’s inequality for this case — E  [ln(·)] ≤ E  [·] − 1. Thus we have in our particular case E  [lnφ(D )] ≤ E  [φ(D )] − 1 = 0. E  (θ ; D ) g

=

l(θ ;

g

b

g

g

g

b



g

b

b

b

g

b

b

b

b

g

b

b

g

b

b

g

g

b

b

b

  b

b

g

b

 

g

g

b

g

b

g

g

b

g

b

b

g

b

g

b

g

b

b

b

(c) The expected value of the log-like lihood at step t + 1 and at step t are t+1

t+1

E [l(θ ; D )] E [l(θ ; D )]

t

t+1

t

− ED [lnp(D |D ; θ − ED [lnp(D |D ; θ )],

g

t

= Q(θ ; θ ) = Q(θt ; θ t )

g

b

b

t

b

b

g

)]

t

g

respectively. We take the difference and find

E [l(θ

t+1

;

t

D )] − E [l(θ ; D )] g

From part (b), we have t

and thus that

E

t b

−ED

[l(θ t+1 ;

b



[lnφ(

ln

b )]

t+1

D |D ; θ D |D ; θ

p( b p(

g

b

t

t

  

− Q(θ ; θ ) D [lnp(D |D ; θ )] − ED [lnp(D |D ; θ )] p(D |D ; θ ) Q(θ ; θ ) − Q(θ ; θ ) − ED ln p(D |D ; θ )

=

ED

 −E  

Q(θ t+1 ; θt )

=

g

g

t

)

t

b

b

t+1

t+1

g

t

t

b

t

t

t

b

=

t

b

b

b

ED [lnφ(D )] < 0, b

b

> 0 . Given the fact that Q(θ t+1 ; θ t ) > Q(θ t ; θt ) and t

g

t+1

t

g

g

and thus l(θt+1 ;

t

D ) > l(θ ; D ) g

g

for increasing t. In short, the log-likelihood increases through the application of the expectation-maximization algorithm.



t+1

g

D )] D− E [l(θ ; D )] > 0, we conclude E [l(θ ; D )] > E [l(θ ; D )], g

t

g

g

t

.

122CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION 45. We consider an iterative algorithm in which the maximum-likelihood value of missing val ues is calculated, then assumed to be correct for the purposes of reestimating θ and the process iterated. (a) This is a generalized expectation-maximization algorithm, but not necessarily a full expectation-maximization algorithm because there is no guarantee that the optimal likelihood value is achieved on each step. (b) We have Q(θ; θ i ) =

b

[lnp(

b;

θ)

θi )].

g;

ED D |D 46. Recall first a special case Jensen’s inequality, that is E [lnx] < ln[E [x]]. Our data set is D = { , , , ∗ , } sampled from a two-dimensional uniform distribution bounded by x ≤ x ≤ x and x ≤ x ≤ x .

 2 3

3 1 l1

5 4 1

4

8 6 u1

2

l2

l2

(a) According to the definitio n of Q(θ , θ 0 ) of Eq. 129 in the text, we have Q(θ ; θ0 ) = =

E

∞ ∞

−∞ −∞

3

k=1

θ θ0 ;

| D )] lnp(x |θ) + ln p(x |θ ) + lnp(x |θ)

x42 x51 [lnp(xg , xb ;

  

k

0

×p(x |θ ; x 42



3

=



lnp(xk θ ) +

41

g

4

= 4)p(x51 θ0 ; x52 = 6)dx42 dx51

|

lnp(x4 θ )p(x42 θ0 ; x41 = 4)dx42

 |  | |  | |     | |                           k=1

5

−∞



+



≡K

p(x5 θ )p(x51 θ 0 ; x52 = 6)dx51 .

−∞

≡L

We now calculate the second term, K :



K

=

lnp(x4 θ)p(x42 θ 0 ; x41 = 4)dx42

−∞ ∞

=

lnp

−∞



=

4 x42

ln

−∞

p

θ



10

−∞

lnp

p

−∞ 4

p

θ

10 0



=

4 x42

4 x42



θ

x42

4 x42

4 x42

θ0

θ0 dx42

dx42

θ0

dx42

1 10 10 dx42

×

p



4 x42

θ0

dx42 .

There are four cases we must consider when calculating K :

PROBLEM SOLUTIONS 1. 0

≤x

l2

< xu2

123

≤ 10 and x ≤ 4 ≤ x l1

xu2

K

=

10

xl2

= 2. xl2

(xu2

xl2 )ln

−∞

u2

l1

= 10

< 10

≤x

u2

l1

u2

l2

θ

1

× 10 dx

10

42

1

| − x | · |x − x | .

≤ ≤

and x l1 = 10

4

lnp

1 (10 10

l1

u2

l2

xu1 :

xu1

=

1

u1

4 x42

lnp

10

K

42

|x − x | · |x − x | .

1 xu2 ln 10 xu1

= l2

× 10 dx

u1 :

0

≤x

10

    xu2

3. 0

1

θ



≤ 0 < x ≤ 10 and x ≤ 4 ≤ x K

which gives:

4 x42

lnp

1 10

u1 ,

     −

−x

4 x42

u1 )ln

 

1

θ

10

× 10 dx

42

1

|x − x | · |x − x | . u1

l1

u2

l2

4. Otherwise, K = 0. Similarly, there are four cases we must consider when calculating L:



≤ ≤ ≤

1. xl1 < xu1 0 < 10 or 0 < 10 xl1 < xu1 or 6 2. 0 xl1 < xu1 10 and x l2 6 xu2 , then





    xu1

L

= 10

x41 6

lnp

xl1

1 (xu1 10

= 3. xl1

≤ 0 < x ≤ 10 and x u1

L =

l2

θ

≤x

l2

or x u2

≤ 6: then L = 0.

1

10

× 10 dx

41

− x )ln |x − x | 1· |x − x | . ≤ 6 ≤ x , then l1

u1

l1

u2

l2

u2

    × | − | · | − ≤ ≤     × xu1

10

x41 6

lnp

1

θ

10

10

dx41

0

= 4. 0

≤x

l1

< 10

≤x

u1

1 u1 10 x ln xu1

and x l2

6

10

L

= 10

lnp

xl1

=

1 (10 10

−x

1

xl1

xl2 .

|

xu2

xu2 :

x41 6

l1 )ln

θ

1

10

10

dx41

1

|x − x | · |x − x | . u1

l1

u2

l2

124CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

 3

Therefore Q(θ; θ 0 ) =

|

lnp(xk θ ) +K +L has different forms depending upon

k=1

the different values of θ , as shown above. (b) Here we have θ = (2 1 5 6) t . (c) See figure. x2 10

2 = 1 x

8

5 = 1 x

x2 = 6

6

4

2

x2 = 1

x1

0 0

2

4

6

8

10

(d) Here we have θ = (2 1 5 6) t .

      ∗  | D

47. Our data set is = 11 , 33 , ∗2 , where x2 component of the third data point.

D

(a) For the E step we have: 0 Q(θ ; θ0 ) = x32 lnp(xg , xb ; θ ) θ ,

E

=





=

g

(lnp(x1 θ) + ln p(x2 θ ) + lnp(x3 θ )) p(x32 θ 0 , x31 = 2)dx32

|

−∞ =

represents an unknown value for the

|

|



|

|

|

|

lnp(x1 θ ) + ln p(x2 θ) +

 |· |     ·             ·    lnp(x3 θ) p(x32 θ 0 , x31 = 2)dx32

−∞ ∞

lnp(x1 θ ) + ln p(x2 θ) +

|

p

−∞

2 x32

p

θ



p

−∞

2 x32

θ0

2 x32

dx32

θ0

d32

θ0

dx32

1/(2e)



= =

lnp | | −∞ lnp(x |θ ) + ln p(x |θ) + K. lnp(x1 θ ) + ln p(x2 θ) + 2e 1

2

2 x32

θ

There are three cases for K : 1. 3

≤ θ ≤ 4: 2

K

=

1 4

  θ2

ln

0

1 −2θ1 1 e θ1 θ2



dx32

p

2 x32

PROBLEM SOLUTIONS

2. θ2

125 =

1 θ2 ln 4

=

1 4



1 −2θ1 1 e θ1 θ2



.

≥ 4: K

       4

ln

0

1 −2θ1 1 e θ1 θ2

dx32

1 4ln 1 e−2θ1 1 4 θ1 θ2 1 −2θ1 1 ln e . θ1 θ2

= = 3. Otherwise K = 0. Thus we have

Q(θ; θ 0 ) = ln p(x1 θ) + ln p(x2 θ ) + K 1 −θ1 1 1 −3θ1 1 = ln e + ln e +K θ1 θ2 θ1 θ2 = θ1 ln(θ1 θ2 ) 3θ1 ln(θ1 θ2 ) + K = 4θ1 2ln(θ1 θ2 ) + K,

 |  |

− − − −







∞ according to the different cases of 1, or





θ2 .

Note the normalization condition −∞ p(x1 )dx1 =



1 −θ1 x1 e dx1 = 1. θ1

−∞

This equation yields the solution θ 1 = 1. (b) There are two cases: 1. 3

≤ θ ≤ 4: 2

Q(θ ; θ 0 ) =

−4 −





1 2lnθ2 + θ2 (2 + lnθ2 ) . 4

Note that this is a monotonic function and that arg max θ2 Q(θ; θ0 ) = 3, 0

− −−

which leads to max Q(θ ; θ ) 8.5212. 2. θ2 4: In this case Q(θ; θ0 ) = 6 3lnθ2 . Note that this is a monotomic function and that arg max θ2 Q(θ; θ 0 ) = 4, which leads to max Q(θ; θ 0 ) 10.1589.

≥ −

In short, then, θ = (1 3) t . (c) See figure. 48. Our data set is

D=

      1 1

.

3 3

,



2

.



126CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION 0

θ

θ

x2

p(x1 , x2) 0.3 0.2 0.1 0 0

t

= (2 4)

0.3 0.2 0.1 0 0

5

3 2

2 3

x2

p(x1 , x2) 4

1

t

= (1 3)

4 1

1

2

2

3

4

1 4

x1

5

5

3

x1

5

(a) The E step is: Q(θ ; θ0 ) = =

E



lnp(xg , xb ; θ ) θ0 ,



1

g

2

3

−∞ =



| D ∞ (lnp(x |θ) + ln p(x |θ ) + lnp(x |θ )) p(x |θ , x x31

|

0

32

= 2)dx31

    ·             ·    ∞

|

lnp(x1 θ ) + ln p(x2 θ) +

31

−∞

x31 2

θ0

x31 2

θ0

p

x31 2

lnp

θ



p

−∞

dx31

1/16



= =

x31 2

lnp | | −∞ lnp(x |θ ) + ln p(x |θ) + K. lnp(x1 θ ) + ln p(x2 θ) + 16 1

2

θ

p

x31 2

There are two cases for K : 1. θ2

≥ 2:



K

=

 

2e−2x31 ln

0



=



2e−2x31 ( θ1 x31



0



= = =

1 −θ1 x31 1 e θ1 θ2

−ln(θ θ ) 1 2





dx31

− ln(θ θ ))dx 1 2

2e−2x31 dx31

0



− 2θ

1



1 2θ1 4

−ln(θ θ ) − 1 θ − ln(θ θ ). 2 1 2

1

1 2

2. Otherwise K = 0. Therefore we have Q(θ; θ 0 ) = ln p(x1 θ ) + lnp(x2 θ ) + K

|

|

31

0

x31 e−2x31 dx31

θ0

dx31

dx31

PROBLEM SOLUTIONS

127



 



1 −θ1 1 1 −3θ1 1 e + ln e +K θ1 θ2 θ1 θ2 θ1 ln(θ1 θ2 ) 3θ1 ln(θ1 θ2 ) + K 4θ1 2ln(θ1 θ2 ) + K,

= ln

− − − −

= =









according to the cases for K , directly above. Note that

p(x1 )dx1 = 1, and

−∞

thus





−∞

1 −θ1 x e dx1 = 1, θ1

and thus θ 1 = 1. The above results can be simplified. (b) When θ 2

≥ 2, we have

−4 − 2lnθ + 12 − lnθ − 72 − 3lnθ .

Q(θ ; θ0 ) =

2

=

2

2

Note that this is a monotonic function, and thus arg max θ2 Q(θ; θ 0 ) = 2. Thus the parameter vector is θ = (1 2) t . (c) See figure.

t

0

θ

= (2 4) x2

p(x1 , x2) 0.4 0.3 0.2 0.1 0 0

1

θ

2

3

1

4

2

3

5

0.4 0.3 0.2 0.1 0 0

x1

5

1

= (1 2) x2

p(x1, x2) 4

t

2

3

4

1 5

2

x1

Section 3.10 49. A single revision of ˆaij ’s and ˆbij ’s involves computing (via Eqs. ?? – ??) T

γij (t) a ˆij

=

t=1 T

  

t=1 k

γik (t)

??

γij

ˆbij

=

??

T

γij (t)

t=1

γij (t) =

αi (t

− 1)a

ij bij βi (t) . P (V T M )

|

3

4

5

128CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION αi (t)’s and P (V T M ) are computed by the Forward Algorithm , which requires O(c2 T ) operations. The β i (t)’s can be computed recursively as follows: For t=T to 1 (by -1) For i=1 to c βi (t) = j a ij bjk v(t + 1)βj (t + 1)

|

End



This requires O(c2 T ) operations. Similarly, γij ’s can be computed by O(c2 T ) operations given αi (t)’s, aij ’s, bij ’s, βi (t)’s and P (V T M ). So, γ ij (t)’s are computed by

|

O(c2 T ) + O(c2 T ) + O(c2 T ) = O(c2 T )operations.

         αi (t)’s

βi (t)’s

γij (t)’s

Then, given ˆγij (t)’s, the ˆaij ’s can be computed by O(c2 T ) operations and ˆbij ’s by O(c2 T ) operations. Therefore, a single revision requires O(c2 T ) operations. 50. The standard method for calculating the probabil ity of a sequence in a given HMM is to use the forward probabilities α i (t). (a) In the forward algorithm, for t = 0, 1,...,T , we have

    



0 1

αj (t) =

t = 0 and j = initial status t = 0 and j = initial status

c

i=1

αi (t

− 1)a

ij bjk v(t)

otherwise.

In the backward algorithm, we use for t = T , T

βj (t) =

0 1

− 1,...,

0,



t = T and j = final status t = T and j = final status

c

βi (t + 1)aij bjk v(t + 1) otherwise.

i=1

Thus in the forward algorithm, if we first reverse the observed sequence VT (that is, set bjk v(t) = bjk (T + 1 t) and then set βj (t) = αj (T t), we can obtain the backward algorithm.





(b) Consider splitting the sequence VT into two parts — V1 and V2 — before, during, and after each time step T  where T  < T . We know that αi (T  ) represents the probability that the HMM is in hidden state ω i at step T  , having generated the firt T  elements of VT , that is V1 . Likewise, βi (T  ) represents the probability that the HMM given that it is in ωi at step T  generates the remaining elements of VT , that is, V2 . Hence, for the complete sequence we have

c

P (VT ) = P (V1 , V2 ) =



P (V1 , V2 , hidden state ω i at step T  )

i=1

c

=

 

P (V1 , hidden state ω i at step T  )P (V2 hidden state ω i at step T  )

|

i=1 c

=

i=1

αi (T  )βi (T  ).

PROBLEM SOLUTIONS

129

(c) At T  = 0, the above reduces to

P (VT ) =

c



αi (0)βi (0) = βj (0), where j is

i=1

the known initial state. This is the same as line 5 in Algorithm 3. Likewise, at T  = T , the above reduces to P (VT ) =

c



αi (T )βi (T ) = α j (T ), where j is the

i=1

known final state. This is the same as line 5 in Algorithm 2. 51. From the learning algorithm in the text, we have for a giveen HMM with model parameters θ : γij (t) =

αi (t

− 1)a

ij bjk v(t)βj (t) P (VT θ )

T

 



( )

|

γij (t)

a ˆij

=

t=1 T c

t=1 k=1

. γik (t)

∗∗

( )



For a new HMM with ai j  = 0, from ( ) we have γi j  = 0 for all t. Substituting γi j  (t) into ( ), we have ˆai j  = 0. Therefore, keeping this substitution throughout the iterations in the learning algorithm, we see that ˆ ai j  = 0 remains unchanged. 52. Consider the decoding algorithm (Algorithm 4).

∗∗

(a) the algorithm is: Algorithm 0 (Modified decoding) 1 2 3 4 5

← ←

1 i c

ij

jk v(t)]

until j = c j arg min[δj (t)]

6



7

j

Append ω j  to Path until t = T return Path

8 9 10 11

← {} ← ← ← ← ≤ ≤ − 1) − ln(a )] − ln[b

begin initialize Path ,t 0 for t t+1 j 0; δ0 0 for j j+1 δj (t) min [δi (t

end

(b) Taking the logari thm is an O(c2 ) computation since we only need to calculate lnaij for all i, j = 1, 2,...,c , and ln[ bjk v(t)] for j = 1, 2,...,c . Then, the whole complexity of this algorithm is O(c2 T ).

130CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

Computer Exercises Section 3.2 1. Computer exercise not yet solved Section 3.3 2. Computer exercise not yet solved Section 3.4 3. Computer exercise not yet solved Section 3.5 4. Computer exercise not yet solved Section 3.6 5. Computer exercise not yet solved Section 3.7 6. Computer exercise not yet solved 7. Computer exercise not yet solved 8. Computer exercise not yet solved Section 3.8 9. Computer exercise not yet solved 10. Computer exercise not yet solved Section 3.9 11. Computer exercise not yet solved 12. Computer exercise not yet solved Section 3.10 13. Computer exercise not yet solved

Chapter 4

Nonparametric techniques

Problem Solutions Section 4.3 1. Our goal is to show that Eqs. 19–22 are sufficient to assure convergence of Eqs. 17 and 18. We first ne ed to show that nlim →∞ p¯n (x) = p(x), and this requires us to show that 1 ϕ n→∞ Vn lim

− x

V hn

= δ n (x

− V) → δ(x − V).

Without loss of generality, we set V = 0 and write 1 ϕ Vn

 x hn

=

   d

1

d

xi

xi ϕ h i=1 n

x hn

= δ n (x),

i=1

where we used the fact that Vn = hdn . We also se e that δn (x) normalization

→ 0 if x = 0, with

δn (x)dx = 1.



→∞



Thus we have in the limit n , δ n (x) δ (x) as required. We substitute our definition of ¯pn (x) and get

E

lim p¯n (x) = [¯ p(x)] =

n

→∞



δ (x

− V)p(V)dV = p(x).

Furthermore, the variance is σn2 (x) =

1 nVn

 − 1 2 ϕ Vn

x

V hn

131

p(V)dv

− n1 [¯p (x)] n

2

CHAPTER 4. NONPARAMETRIC TECHNIQUES

132

 −  −

1 1 2 x V ϕ p(V)dV nVn Vn hn Sup(ϕ) 1 x V ϕ p(V)dV nVn Vn hn Sup(ϕ)¯ pn (x) . nVn

≤ ≤ ≤

We note that Sup( ϕ) < and that in the limit n we have ¯pn (x) nVn . We put these results together to conclude that

→∞



→∞

p(x) and



lim σn2 (x) = 0,

→∞

n

for all x. 2. Our normal distribution is p(x) N (0, 1), or more explicitly,

2

∼ N (µ, σ ) and our Parzen window is √12π e−

ϕ(x) =

x2 /2

ϕ(x)



,

and our estimate is

−  n

1 nhn

pn (x) =

x

ϕ

i=1

xi hn

.

(a) The expected value of the probabil ity at x, based on a window width parameter hn , is p¯n (x) =

=

=

=

=

 E   −   −  √ −  −   √ −  −   −    −   − 

E [p (x)] = nh1 n

1 hn 1 hn



−∞

1

v

hn

1 exp 2π

1 exp 2πh n σ exp

2πh n σ where we have defined

x

ϕ

n i=1

x

ϕ

−∞ ∞

n

p(v)dv 1 2

x

v

x2 µ 2 + h2n σ 2

1

x2

− 

hn

2

θ2 =

2

1 exp 2πσ

hn

1 2

2

xi hn

+

µ2 2

σ



1 2 v 2

exp

−∞

+

1 α2

∞ exp

2



−∞

and



v

µ

2

dv

σ

1 1 + h2n σ2 1

v

−α

2v

x µ + h2n σ 2



.

2

θ

x µ + 2 h2n σ

2

   −   

1 h2 σ 2 = 2n 2 1/h2n + 1/σ2 hn + σ

α = θ2

1 2

dv,



dv

PROBLEM SOLUTIONS

133

We perform the integration and find

√2πθ

p¯n (x) =

2πhn σ 1 2πh n σ



=

−     −  −    − 1 x2 µ 2 1 α2 + 2 + 2 2 hn σ 2 θ2 hn σ 1 x2 µ 2 exp + 2 2 2 2 h2n σ hn + σ

exp

α2 θ2

.

The argument of the exponentiation can be expressed as follows x2 µ 2 + h2n σ 2

− αθ

2

2

x2 µ 2 θ 4 x µ + + h2n σ 2 θ 2 h2n σ 2 x2 h2n µ2 σ2 + (h2n + σ 2 )h2n (h2n + σ 2 )σ 2 (x µ)2 . h2n + σ2

=

2

=

− h 2xµ +σ 2 n

2



= We substitute this back to find

√2π

p¯n (x) =

1 exp h2n + σ 2



−



1 (x µ)2 , 2 h2n + σ 2



which is the form of a Gaussian, denoted N (µ, h2n + σ2 ).

p¯n (x)



(b) We calculate the variance as follows:

   −     −    −  E   −  − E   −  

=

n

1 nhn

Var[pn (x)] = Var

x

ϕ

n

1

n2 h2n i=1

xi hn

i=1

x

Var ϕ

=

1 Var ϕ nh2n

=

1 nh2n

ϕ2

x

xi hn

v

hn

x

v

x

ϕ

hn

2

v

hn

,

where in the first step we used the fact that x 1 ,..., xn are independent samples drawn according to p(x). We thus now need to calculate the expecte d value of the square of the kernel function

E

  −    −   −  −   −  −   √  √ −  −√   −  −   √ √ ϕ2

x

v

hn

=

ϕ2



=

−∞

=

x

v

hn

1 exp 2π

1 2π



−∞

p(v)dv x

v

hn

1 exp 2π

1 2

2

exp

x v hn / 2

1 2

v

µ

2

σ

2

exp

1 2

1 dv 2πσ

v

µ

σ

2

1 dv. 2πσ

CHAPTER 4. NONPARAMETRIC TECHNIQUES

134



From part (a) by a similar argument with h n / 2 replacing h n , we have

         − − √ √ √ √ − − − −  √ E  −   √ − −    −  √ √  − −  E     −   √ √ −  −   E

1 hn / 2



1 exp 2π

−∞

1 2

2

x v hn / 2 =

1 2

exp

v

µ

We make the substitution and find x v hn / 2 ϕ2 = exp hn 2π h2n /2 + σ2

1 dv 2πσ

σ

1 (x µ)2 . 2 h2n /2 + σ2

1 exp h2n /2 + σ 2



2

1 (x µ)2 , 2 h2n /2 + σ 2

and thus conclude 1 nh2n

ϕ2

x

v

1 1 nhn 2 π

h2n /2 + σ 2

For small hn , mated as 1 nh2n

=

hn

x

ϕ2

1 (x µ)2 . 2 h2n /2 + σ2

1 exp h2n /2 + σ 2



σ, and thus the above equation can be approxi-

v

1 1 exp nhn 2 π 2πσ 1 p(x). 2nhn π

hn

1 2

x

2

µ

σ



=



( )

Similarly, we have 1 nh2n

E   −   x

ϕ

v

hn

2

= =



−

1 2 1 1 (x µ)2 h exp nh2n n 2π h2n + σ 2 2 h2n + σ 2 hn 1 1 (x µ)2 exp nhn 2π h2n + σ 2 2 h2n + σ 2





hn nhn



  −  −  − −  ∗∗ 1 (x µ)2 2 σ2

1 √2πσ exp



∗∗) we have, (still for small p(x) √ . Var[P (x)]  2nh π

valid for small h n . From ( ) and ( n





0, ( )

hn)

n

(c) We write the bias as p(x)

− p¯ (x) n

1

= =

1 (x

µ)2

√2πσ exp − 2 −σ − √2π 1 1 (x − µ) √2πσ exp − 1− 2 h +σ

  −  −

2

2

2 n

2

1 (x µ)2 2 h2n + σ 2

1 h2n + σ 2 exp

   − −   − − − −  −    − 

1 (x µ)2 1 (x µ)2 + 2 h2n + σ2 2 σ2

σ exp h2n + σ2 µ)2

= p(x) 1

1 exp 1 + (hn /σ)2

(x

= p(x) 1

1 1 h2n (x µ)2 exp 2 2 h2n + σ 2 1 + (hn /σ)

2

1 h2n + σ 2 .

1 σ2





PROBLEM SOLUTIONS

135

For small h n we expand to second order:



and

1

 1 − 12 (h /σ) n

1 + ( hσn )2

h2n (x µ)2 1 + 2σ 2 h2n + σ2 .

h 2n (x µ)2 exp 2σ2 h2n + σ 2





2

 −  −  −    − − −   − −     − −  

We ignore terms of order higher than h 2n and find p(x)

− p¯ (x) 

p(x) 1

1

1 2



p(x) 1

1+

1 h2n 2 σ2



1 2



1 2

n

hn σ

2

hn σ

2

hn σ

2

1+

h2n (x µ)2 2σ 2 h2n + σ2





h2n (x µ)2 2σ2 h2n + σ 2

(x µ)2 p(x) h2n + σ 2

1

x

1

µ

2

p(x).

σ

3. Our (normalized) distribution is p(x) =



≤ ≤

1/a 0 x a 0 otherwise ,

and our Parzen window is



ϕ(x) =

e−x 0



x 0 x < 0.

(a) The expected value of the Parzen window estimate is p¯n (x) = =

E   −    −  1 nhn

1 hn

n

x

ϕ

xi hn

i=1

e−



(x v) hn

=

1 hn

ϕ

p(v)dv



x v

=

− 

exp[ x/hn ] hn

     

1 exp[v/hn ]dv a



x v, 00 for k> 1

(c) We have in this case Pn (e) =

1 2n

  −

(k 1)/2



j=0

= Pr Y1 +

···

n j

= Pr [B(n, 1/2)

· · · + Y ≤ k −2 1 n



≤ (k − 1)/2]

,

··

where Y 1Y, = ,0]Yn=are independent, B( , to ) isincrease a binomial and Pr[ Yiby = 1] = Pr[ 1 /2. If k is allowed withdistribution n, but is restricted i k < a/ n, then we have



Pn (e)



√ · · · + Y ≤ a/ 2n − 1 + · · · + Y ≤ 0) for n sufficiently large



Pr Y1 +

= =

P r(Y1 0,

and this guarantees P n (e)

n

n

→ 0 as n → ∞.



PROBLEM SOLUTIONS

141

Section 4.5 7. Our goal is to show that Voronoi cells induced by the nearest-neighbor algorithm are convex, that is, given any two points in the cell, the line connecting them also lies in the cell. We let x∗ be the labeled sample point in the Voronoi cell, and y be

x(0)

y x*

x(λ)

x(1)

any other labeled sample point. A unique hyperplane separates the space into those that are closer to x∗ than to y, as shown in the figure . Consider any tw o points x(0) and x(1) inside the Voronoi cell of x ∗ ; these points are surely on the side of the hyperplane nearest x . Now consider the line connecting those points, parameterized ∗ + λx(1) for 0 λ 1. Because the half -space defined by as x(λ) = (1 λ)x(0) the hyperplane is convex, all the points x(λ) lie nearer to x ∗ than to y. This is true for every pair of points x(0) and x(1) in the Voronoi cell. Furthermore, the result holds for every other sample point y i . Thus x(λ) remains closer to x ∗ than any other labeled point. By our definition of convexity, we have, then, that the Voronoi cell is convex. 8. It is indeed possible to have the nearest-neighbor error rate P equal to the Bayes error rate P ∗ for non-trivial distributions.



≤ ≤

(a) Consider uniform priors over c categories, that is, P (ωi ) = 1/c, and onedimensional distributions

|

p(x ωi ) =

The evidence is c

p(x) =

 i=1

|

 ≤ ≤ ≤ ≤ −    ≤≤ ≤≤ 0 x ccr −1 i x i+1 elsewhere .

1 1 0

p(x ωi )P (ωi ) =

1 1/c 0

≤ c cr− 1 ≤ 1.



0 x ccr −1 i x (i + 1) elsewhere .

Note that this automatically imposes the restriction 0

cr c 1



cr c 1



CHAPTER 4. NONPARAMETRIC TECHNIQUES

142

| ∝ p(x|ω ) and thus 0≤x≤ −

Because the P (ωi )’s are constant, we have P (ωi x)

|

P (ωi x) =

  

P (ωi ) p(x)

1/c p(x)

=

i

cr c 1





0 if i = j 1 if i = j

j

0

cr c 1

≤x ≤j +1 −



otherwise ,

as shown in the figure for c = 3 and r = 0.1. P(ωi|x)

1/c = 1/3 3

0.3

ω1

ω

,

ω2

ω3

2

ω

,

0.25

1

ω

0.2 0.15 0.1 0.05

x

.1 0 = r = x

1234

The conditional Bayesian probabili ty of error at a point x is P ∗ (e x) = 1

|

=

− −

|

P (ωmax x)

1 1



1/c p(x)

if 1 = 0 if

0 i

cr c 1

≤x≤ − ≤x ≤i+1−

cr c 1,



and to calculate the full Bayes probability of error, we integrate as P∗

=



P ∗ (e x)p(x)dx

|



cr/(c 1)

 − 

=

1

1/c p(x)dx p(x)

0

=

1

1

cr

= r.

−  −   − |   −   −    c

c

1

(b) The nearest-neighbor error rate is c

P

=

P 2 (ωi x) p(x)dx

1

i=1



cr/(c 1)

=

c

1

0



j+1

c( 1c )2 p(x)dx + p2 (x) j=1

cr c 1



[1

j

0

1] p(x)dx

PROBLEM SOLUTIONS

143



cr/(c 1)

 −   −  − 

=

1

1/c p2 (x)

1

1 c

p(x)dx

0



cr/(c 1)

=

dx =

1

1 c

cr c

0

− 1 = r.

Thus we have demonstrated that P = P = r in this nontrivial case.



9. Our data are given in the table. ω1 (10,0) (0,-10) (5,-2)

ω2 (5,10) (0,5) (5,5)

ω3 (2,8) (-5,2) (10,-4)

Throughout the figures below, we let dark gray represent category ω 1 , white represent category ω2 and light gray represent category ω 3 . The data points are labeled by the ir category numbers, the means are shown as small black dots, and the dashed straight lines separate the means. a)

b)

15

15

10

2

10

2

5

3 2

5

3 1

0

0

1

1

1

-5

-5

-10

3

-10

1

-15

1

-15 -10

-5

0

5

10

15

-10

-5

c)

0

5

10

15

d)

15

15

10

2

10

2

5

2

3 5

2

3 2

3

2

3

0

0

1

1 3

-5

3

-5

-10

-10

-15

1

-15 -10

-5

0

5

10

15

-10

-5

0

5

10

15

10. The Voronoi diagram of n points in d-dimensional space is the same as the convex hull of those same points projected orthogonally to a hyperboloid in ( d + 1)dimensional space. So the editi ng algorithm can be solved either with a Voronoi diagram algorithm in d-space or a convex hull algorithm in (d + 1)-dimensional space. Now there are scores of algorithms available for both problems all with different complexities. A theorem in the book by Preparata and Shamos refers to the complexity of the Voronoi diagram itself, which is of course a lower bound on the complexity of computing it. This complexity was solved by Victo r Klee, “On the complexity of d-dimensional Voronoi diagrams,” Archiv. de Mathematik. , vol. 34, 1980, pp. 7580. The complexity formula given in this problem is the complexity of the conve x hull algorithm of Raimund Seidel, “Constructing higher dimensional convex hulls at

144

CHAPTER 4. NONPARAMETRIC TECHNIQUES

logarithmic cost per face,” Proc. 18th ACM Conf. on the Theory of Computing , 1986, pp. 404-413. So here d is one bigge r than for Voronoi diagrams. If we substitute d in the formula in this problem with ( d 1) we get the complexity of Seidel’s algorithm for Voronoi diagrams, as discussed in A. Okabe, B. Boots and K. Sugihara, Spatial Tessellations: Concepts and Applications of Voronoi Diagrams , John Wiley, 1992. 11. Consider the “curse of dimensionality” and its relation to the separation of points



randomly selected in a high-dimensional space. (a) The sampling density goes as n1/d , and thus if we need n1 samples in d = 1 dimensions, an “equivalent” number samples in d dimensions is n d1 . Thus if we needed 100 points in a line (i.e., n1 = 100), then for d = 20, we would need n20 = (100)20 = 10 40 points — roughly the number of atoms in the universe. (b) We assume roughly uniform distribution, and thus the typical inter- point Euclidean (i.e., L 2 ) distance δ goes as δ d volume, or δ (volume)1/d .





≤ ≤

(c) Consider points uniformly distributed in the unit interval 0 x 1. The length containing fraction p of all the points is of course p. In d dimensions, the width of a hypercube containing fraction p of points is l d (p) = p1/d . Thus we have l5 (0.01) = (0 .01)1/5 l5 (0.1) = (0 .1)1/5 l20 (0.01) = (0 .01)1/20 l20 (0.1) = (0 .1)1/20

= 0.3910 = 0.7248 = 0.8609 = 0.8609.

(d) The L ∞ distance between two points in d-dimensional space is given by Eq. 57 in the text, with k :

→∞

 | d

L∞ (x, y) = = =

1/k

k

−y | max[ |x − y |, |x − y |,..., |x − y |] max |x − y |.

lim k

→∞

1

i

xi

i

i=1

i

1

2

2

d

d

i

In other words, consider each axis separately, i = 1,...,d . There is a separation between two points x and y along each individual direction i, that is, xi yi . One of these distanc es is the greatest. The L ∞ distance between two points is merely this maximum distance.

| − |

Informally we can see that for two points randomly selected in the unit ddimensional hypercube [0, 1]d , this L distance approaches 1.0 as we can nearly always find an axis i for which the ∞ separation is large. In contrast, the L∞ distance to the closest of the faces of the hypercube approaches 0.0, because we can nearly always find an axis for which the distance to a face is small. Thus, nearly every point is closer to a face than to another randomly selected point. In short, nearly eve ry point is on the “outside” (tha t is, on the “convex hull”) of the set of points in a high-dimensional space — nearly every point is an “outlier.” We now demonstrate this result formally. Of course, x is always closer to a wall than 0.5 — even for d = 1 — and thus we consider distances l∗ in the

PROBLEM SOLUTIONS

145

range 0 l∗ < 0.5. The figure at the lef t shows the xi and yi coordinates plotted in a unit square. There are d points, one corresponding to each axis in the hypercube. The probability that a particular single coordina te i will have xi y i l∗ is equal to 1.0 minus the area in the small triangles, that is,



| − |≤

| − y | ≥ l∗ for a particular i] = (1 − l∗) . The probability that al l of the d points will have |x − y | ≥ l ∗ is then Pr[|x − y | ≥ l∗ for all i] = (1 − (1 − l∗ ) ) . Pr[ xi

2

i

i

i

2 d

i

i

The probability that at least one coordinate has separation less than 1.0 minus the above, that is, Pr[at least one coordinate is closer than l ∗ ] = 1

l ∗ is just

− (1 − (1 − l∗) ) . 2 d

Now consider the distance of a point x to the faces of the hypercube . The figure L-infinity distance between x and y |i -y

yi

=

0

d points

|x i

1

L-infinity distance between x and any hypercube face

.4

|i -y

=

0.

4

|x i

0.5 l*

l* 0 0

xi

l* 0.5

1

1-l* xi

0

0.5

1

at the right shows d points corresponding to the coordinates of x. The probability that a single particular coordinate value xi is closer to a corresponding face (0 or 1) than a distance l ∗ is clearly 2 l∗ since the position of x i is uniformly distributed in the unit interval. The probability that any of the coordinates is closer to a face than l ∗ is Pr[at least one coordinate is closer to a face than l ∗ ] = 1

− (1 − 2l∗) . d

We put these two results together to find that for a given l∗ and d, the probability that x is closer to a wall than l ∗ and that L∞ (x, y) ≤ l ∗ is the product of the two independent probabilities, that is, d

− −

− −

2d

[1 (1 2l∗ ) ][1 (1 l∗ ) ], which approaches 1.0 quickly as a function of d, as shown in the graph (for the case l∗ = 0.4). As shown in Fig. 4.19 in the text, the unit “hypersphere” in the L∞ distance always encloses the unit hypersphere in the L2 or Euclidean metric. Therefore, our discussion above is “pessimistic,” that is, if x is closer to a face than to point y according to the L∞ metric, than it is “even closer” to the wall than to a face in the L2 metric. Thus our conclusion above holds for the Euclidean metric too. Consequently in high dimensions, we generally must extrapolate from sample points, rather than interpolate between them.

CHAPTER 4. NONPARAMETRIC TECHNIQUES

146

probability x is closer to a face than l* = 0.4 and x is farther from y (in L-infinity distant) than l* 1 0.8 0.6

l* = 0.4 0.4 0.2

d

23456

12. The “curse of dimensionality” can be “overcome” if we know the form of the target function. (a) Our target function is linear , d



f (x) =

aj xj = a t x,

j=1

and y = f (x) + N (0, σ2 ), that is, we have Gaussian noise. The approximation is fˆ(x) =

d



a ˆj xj , where

j=1

  −   n

a ˆj = arg min aj

d

yi

i=1

2

aj xij

,

j=1

where xij is the jth component of point i, for j = 1,...,d and i = 1,...,n . In short, these are the best fit coefficients in a sum-squared-error sense. The expected value (over the data set) of the difference between the target and fit functions (squared) is

      − E −  d

E [f (x) − fˆ(x)]

2

=

d

n

a j xj

j=1

arg min aj

j=1

yi

i=1

j=1

    2

d

aj xij

xj

2

.

For some set of sample points x i , we let y i be the corresponding sample values yi = f (xi ) + N (0, σ2 ). Consider the n-by-d matrix X = [xij ] where xij is the jth component of x i , that is

X=

 

x1 x2 .. . xn

  d

where x i is a d-dimensional vector. As stated above, we have f (x) =



j=1

a j xj .

PROBLEM SOLUTIONS

147

We define the weight vector

a=

then we can write

Xa =

      

a1 a2 .. . an

 

f (x1 ) f (x2 ) .. .

f (xn )

and hence y = Xa + N (0, σ 2 )n where

y=

We approximate f (x) with fˆ(x) =

y1 y2 .. . yn

d

 

 

.

a ˆj xj where the a ˆj are chosen to minimize

j=1 n

2

d

yi

a ˆj xij

= y

  −   i=1

j =1

Xˆ a

2

 − 

ˆ is self-evident. Thus we choose a ˆ such that Xˆ where the definition of a a is as close to y as possible. However, Xˆ a is in CX , the column space of X, that is, the d-dimensional subspace of R n spanned by the columns of X. (Of course, we assume n > d.) It is clear that we wish to choose ˆa such that Xˆ a is the projection of y onto C X , which we denote Proj C y. Now y is distributed according to an n-dimensional Gaussian distribution with mean Xn. Projecting an n-dimensional Gaussian onto a d-dimensional subspace yields a d-dimensional Gaussian, that is, X

Xˆ a = Proj C y = Proj C [Xn + N (0, σ 2 )n ] = Xa + N (0, σ2 )d , X

X

where N (0, σ 2 )d is rotated to lie in C X . Thus we have



E Xˆa − Xa But Xˆ a

2

Xa

=

n

(fˆ(xi )

2

 

f (xi ))2 . Since the terms ( fˆ(xi )

 −  have− E Xˆa − Xa = nE 2

(fˆ(x)

2

f (x))

f (xi ))2 are



=1we eachii,

    −   −  E

independent for

and thus



= Var N (0, σ2 )d = dσ 2 .

2

= dσ ,

dσ 2 fˆ(x))2 = . n In short, the squared error increases linearly with the dimension nentially as we might expect from the general curse of dimension. (f (x)

d, not expo-

148

CHAPTER 4. NONPARAMETRIC TECHNIQUES d

(b) We follow the logic of part (a). Now our target function is f (x) =



ai Bi (x)

i=1

where each member in the basis set of M basis functions B i (x) is a function of the d-component vector x. The approximation function is M



fˆ(x) =

a ˆm Bm (x),

m=1

and, as before, the coefficients are least-squares estimates

 − 

ai

yi

am Bm (x)

m=1

i=1



2

q

n

a ˆi = arg min )+ N (0, σ 2 ).

and yi = f (xi Now y will be approximated by Bˆ a, the pro jection of y onto the column space of B j , that is, the subspace spanned by the M vectors

 

Bi (x1 ) Bi (x2 ) .. . Bi (xn )

 

.

As in part (a), we have

E [(f (x) − fˆ(x)) ] = Mnσ 2

2

,

which is independent of d, the dimensionality of the srcinal space. 13. We assume P (ω1 ) = P (ω2 ) = 0.5 and the distributions are as given in the figure. p(x|ωi)

2 ω2

ω1

1

x2

0

x1

x

0.5x*

0

1

(a) Clearly, by the symmetry of the problem, the Bayes decision boundary is x ∗ = 0.5. The error is then the area of the dark shading in the figure, divided by the total possible area, that is



1

P∗

=

0

|

|

min[P (ω1 )p(x ω1 ), P (ω2 )p(x ω2 )] dx

PROBLEM SOLUTIONS

149



0.5

=

P (ω1 )



1

2x dx + P (ω2 )

0

(2

0.5

− 2x) dx

1 1 0.5 + 0.5 = 0.25. 4 4

=

|

(b) Suppose a point is randoml y selected from ω 1 according to p(x ω1 ) and another point from ω 2 according to p(x ω2 ). We have that the error is





1

|

1

|

dx1 p(x1 ω1 )

0



|

dx2 p(x2 ω2 )

0



(x1 x2 )/2

1



|

dx p(x ω2 )



|

dx p(x ω1 ).

0

(x1 x2 )/2

(c) From part (d), below, we have for the special case n = 2, 1 1 1 51 + + = = 0.425. 3 (2 + 1)(2 + 3) 2(2 + 2)(2 + 3) 120

P2 (e) =

(d) By symmetry, we may assume that the test point b elongs to category ω 2 . Then the chance of error is the chance that the nearest sample point ot the test point is in ω 1 . Thus the probability of error is 1

|

Pn (e) =

P (x ω2 )Pr[nearest y i to x is in ω 1 ]dx

 0

1

=

n

|

P (x ω2 )

0



Pr[yi

i=1

∈ω

1

∀ 

and y i is closer to x than y j , j = i]dx.

By symmetry the summands are the same for all i, and thus we can write

  

1

Pn (e) =

|

P (x ω2 )nPr[y1

0

1

=

1

| − x| > |y − x|, ∀i > 1]dx

and y1

i

1

|

P (x ω2 )n

0

 

|

| − x| > |y − x|, ∀i > 1]dy

|

| − x| > |y − x|] − dy dx,

P (ω1 y1 )Pr[ yi

0

1

=

∈ω

1

1

dx

1

|

P (x ω2 )n

0

P (ω1 y1 )Pr[ y2

0

1

n 1

where the last step again relies on the symmetry of the problem.

| − x| > |y − x|], we divide the integral into six cases, as shown

To evalute Pr[ y2 in the figure.

1

We substitute these values into the above integral, and break the integral into the six cases as 1/2

Pn (e) =

 0



x

|

P (x ω2 )n

0

|

P (ω1 y1 )(1 + 2y1

− 2x) − dy n 1

1

CHAPTER 4. NONPARAMETRIC TECHNIQUES

150

denotes possible locations of y2 with |y2-x|>|y1-x|

x [0,1/2]

Case 1.1:

y1 x 0 Pr[

1

x [0,1/2] x
|y1-x|] = 1 + 2x - 2y1

Case 1.3:

x

0

y < 2x

y1

y1 < 2x - 1

0
|y1-x|] = 1 + 2 y1 - 2x

Case 1.2:

Pr[

y1 < x

0
|y1-x|] = 1 + 2x - 2 y1

2x

+



− 2y ) − dy

|

P (ω1 y1 )(1 + 2x

x

n 1

1

1

1

+



+

   

2x 2x 1

1

|



− y ) − dy n 1

1

0

dx

1



P (ω1 y1 )y1n 1 dy1

|

P (x ω2 )n

1/2

|

P (ω1 y1 )(1



x

|

+

P (ω1 y1 )(1 + 2y1



2x 1

− 2x) − dy n 1

1

+

|

− 2y ) − dy

|



P (ω1 y1 )(1 + 2x

x

1

n 1

1

1



dx.

|

Our density and posterior are given as p(x ω2 ) = 2(1 x) and P (ω1 y) = y for x [0, 1] and y [0, 1]. We substitute these forms into the above large integral and find





1/2

Pn (e) =

 0

2n(1

x

− x)

  

− 2x) − dy n 1

y1 (1 + 2y1

0

1

2x

y1 (1 + 2x

x

− 2y ) − dy 1

1

2x

y1 (1

−y ) 1



n 1

n 1



dy1 dx

1

PROBLEM SOLUTIONS

151



 

1

+



2x 1

2n(1

− x)

1/2

y1n dy1

0

x

y1 (1 + 2y1



2x 1

− 2x) − dy n 1

1

1



y1 (1 + 2x

x

− 2y ) − dy 1 n 1

1

dx.



There are two integrals we must do twice with different bounds. The first is: b



y1 (1 + 2y1

a

− 2x) − dy . n 1

We define the function u(y1 ) = 1 + 2 y1 dy1 = du/2. Then the integral is

− 2x, and thus y

1

= (u + 2x

− 1)/2 and

u(b)

b



1

y(1 + 2y1

a

− 2x) − dy n 1

1

=

1 4



(u + 2x

− 1)u − du n 1

u(a)

=

1 2x 1 n 1 n+1 4 n u + n + 1u





The second general integral is:

u(b)



u(a)

.

b



− 2y ) − dy .

y1 (1 + 2x

1

a

We define the function u(y1 ) = 1 + 2 x dy1 = du/2. Then the integral is



1

− 2y , and thus y 1

1

= (1 + 2 x

− u)/2 and

u(b)

b



n 1

y1 (1 + 2x

a

− 2y ) − dy 1

n 1

=

1

− 14



(1 + 2x + u)un−1 du

u(a)

1 2x + 1

=

−4



u(b)

1

un

un+1

− n+1

n

.



u(a)

We use these general forms to evaluate three of the six components of our full integral for P n (e): x

 0

y1 (1 + 2y1

− 2x) − dy n 1

1

= =

 



1 2x 1 n u 4 n 1 4

− n +1 1 u

2x + 1 1 + n n+1



n+1





1 2x=u(0) 1=u(x)

1 + (1 4

n+1

− 2x)

−  1 n

1 n+1

CHAPTER 4. NONPARAMETRIC TECHNIQUES

152



2x

y1 (1 + 2x

x

− 2y ) − dy n 1

1

1

=

− 14

=

1 4





2x + 1 n u n

2x + 1 n



1



− n +1 1 u

1 n+1

1 2x

y1 (1

− y ) − dy n 1

1

1



=

(1

=

−

2x

n 1

2x)n

−

− n +1 1 (1



1 (1 2n

1 n u n

− u)u − du =

0

1 (1 n

−

n+1



1 2x=u(2x) 1=u(x) n

− 2x)

1 + (1 4

− n +1 1 u

− 2x)

n+1



1 1 + n n+1





1 2x n+1

0



2x)n+1 .

We add these three integrals together and find, after a straightforward calculation that the sum is x 1 + (1 n 2n

n

− 2x)

+



1 2n

− n +1 1



(1

− 2x)

n+1



.

( )

We next turn to the three remaining three of the six components of our full integral for P n (e):





2x 1

y1n dy1

1 (2x n+1

=

0

x

 

+

2y1



2x 1

− 2x)



dy1

=

1

x

− 2y ) − dy 1

n 1

1



− 2x − 1 + n

1 n+1

1 2x + 1 n u 4 n

=

1 4

=

1=u(x)

−  −   −  −  − − − −   − − − 1 4

= y1 (1 + 2x

n+1

1 2x 1 n 1 n+1 4 n u + n + 1u

n 1

y1 (1

− 1)

2x + 1 n

1 (2x 4

1 un+1 n+1

1 n+1





2x 1=u(2x 1)

1)n+1

1 1 + n n+1





2x 1=u(1) 1=u(x)

1 (2x 2n

1)n

1 (2x 4

n+1

− 1)

The sum of these latter three terms is x n

− 2n1 (2x − 1)

n

1 2n

1 n+1

(2x

Thus the sum of the three Case 1 terms is 1/2

 0

2n(1

   x

− x)

y1 (1 + 2y1

0

− 2x) − dy n 1

1

2x

+

y1 (1 + 2x

x

− 2y ) − dy 1

n 1

1

+

2x

y1 (1

− y ) − dy 1

n 1

1



dx

1

1)n+1 .

∗∗

( )

− 1 n

1 n+1

PROBLEM SOLUTIONS

153

1/2

=

 −    − 2n(1

0 1

=

x 1 + (1 n 2n

x)

1+u 2

n

1

u

2n

+

n

+ (1

− 2x)

1 n u + 2n



− n +1 1

0

where we used the substitution u = 1 dx = du/2 and 1 x = (1 + u)/2.





 −    1 2n

n+1

− 2x)

1 2n

  



1

dx

un+1 du,

2x, and thus x = (1



u)/2, and



Likewise, we have for the three Case 2 terms



1 n+1

2x 1

2n(1

− x)

1/2

y1n dy1

0

x

+

y1 (1 + 2y1



2x 1

− 2x) − dy n 1

1

+

y1 (1 + 2x

x

1

=

2n(1

=

1

1

n

1

n

n 1



dx

− x) nx − 2n1 (2x − 1) − (2x − 1)

  − 

1/2 1

− 2y ) − dy

1

u

2

1+u 2n

0



1 n u 2n

1 2n

n+1

 −

1 2n



1 n+1

− n +1 1

dx

   un+1 du,



where we used the substitution u = 2x 1, and thus x = (1+ u)/2 and dx = du/2 and 1 x = (1 u)/2.





Now we are ready to put all these results together:

   −   −   −  −   1

Pn (e) =

1+u 2

n

1

2n

u

+

0

1

+

n

1

u

2

0 1

=

1 (1 2n

n

1+u 2n

− 2n1 u

u2 ) + 1 un+1 + 2n

0

= = =

     − −  −    −    − 

1 n u + 2n



1 2n

− n +1 1

1 2n

n

1 2n

1 n+1

1 n+1

1 u3 1 1 1 u un+2 + 2n 3 2n(n + 2 n + 3 2n 1 2 1 1 1 n + + n 2n 3 2n(n + 2) n + 3 2n(n + 1) 1 (n + 1)(n + 3) (n 1)(n + 2) + 3 2(n + 1)(n + 2)(n + 3) n

− −

un+1 du

un+1 du

un+2 du 1 n+1

1

un+3

0

CHAPTER 4. NONPARAMETRIC TECHNIQUES

154 = =

1 3n + 5 + 3 2(n + 1)(n + 2)(n + 3) 1 1 1 + + , 3 (n + 1)(n + 3) 2(n + 2)(n + 3)

which decays as 1 /n2 , as shown in the figure. Pn(e) 0.5 0.4 1/3 0.2 2

4

6

8

10

n

We may check this result for the case n = 1 where there is only one sample. Of course, the error in that case is P1 (e) = 1/2, since the true label on the test point may either match or mismatch that of the single sample point with equal probability. The above formula above confirms this P1 (e) = =

1 1 1 + + 3 (1 + 1)(1 + 3) 2(1 + 2)(1 + 3) 1 1 1 1 + + = . 3 8 24 2

(e) The limit for infi nite data is simpl y lim Pn (e) =

→∞

n

1 , 3

which is larger than the Bayes error, as indeed it must be. In fact, this solution also illustrates the bounds of Eq. 52 in the text: P∗ 1 4

≤ ≤

P 1 3

≤ P ∗(2 − 2P ∗) ≤ 38 .

14. We assume P (ω1 ) = P (ω2 ) = 0.5 and the distributions are as given in the figure. p(x|ωi) ω1

ω2

1

0 0

x1

x*

0.5

x2

x 1

(a) This is a somewhat unusual problem in that the Bayes decision can be any point 1/3 x ∗ 2/3. For simplicity, we can take x ∗ = 1/3. Then the Bayes error is

≤ ≤

PROBLEM SOLUTIONS

155

then simply

 

1

P∗

=

|

|

min[P (ω1 )p(x ω1 ), P (ω2 )p(x ω2 )]dx

0

2/3

=

|

P (ω1 )p(x ω1 )dx

1/3

=

0.5(1/3)(3/2) = 0 .25.

(b) The shaded area in the figure shows the possible (and equally likely ) values of a point x 1 chosen from p(x ω1 ) and a point x 2 chosen from p(x ω2 ). x2

|

|

1

1

2/3

2 1/3 0

x1

0

1/3

2/3

1

There are two functionally separate cases, as numbered, corresponding to the position of the decision boundary x∗ = (x1 + x 2 )/2. (Note that we will als o have to consider which is larger, x 1 or x 2 . We now turn to the decision rule and probability of error in the single nearest-neighbor classifier in these two cases: case 1 : x2 x1 and 1 /3 (x1 + x 2 )/2 2/3: Here the decision point x∗ is between 1 /3 and 2 /3, with 2 at large values of x. This is ju st the Bayes case described in part (a) and the error rate is thus 0.25, as we saw. The relative probability of case 2 occuring is the relative area of the gray region, that is, 7 /8. case 2 : x1 x2 and 1 /3 (x1 + x2 )/2 2/3: Here the decision boundary is between 1 /3 and 2 /3 (in the Bayes region) but note especially that 1 is for large values of x, that is, the decision is the opposite of the Bayes decision. Thus the error is 1.0 minus the Bayes error, or 0 .75. The relative probability of case 2 occuring is the relative area of the gray region, that









R





R

is, 1 /8. We calculate the average error rate in the case of one point from each category by merely adding the probability of occurrence of each of the three cases (proportional to the area in the figure), times the expected error given that case, that is, P1 =

7 1 5 0.25 + 0.75 = = 0.3125, 8 8 16

which is of course greater than the Bayes error.

CHAPTER 4. NONPARAMETRIC TECHNIQUES

156

(c) Problem not yet solved (d) Problem not yet solved

→∞

≤ ≤ ≤ ≤ ≤ ≤

(e) In the limit n , every test point x in the range 0 x 1/3 will be properly classified as ω1 and every point in the range 2 /3 x 1 will be properly classified as ω2 . Test points in the rang e 1 /3 x 2/3 will be misclassified half of the time, of course. Thus the expected error in the n case is P∞

=

→∞ P (ω )Pr[0 ≤ x ≤ 1/3|ω ] · 0 + P (ω )Pr[1/3 ≤ x ≤ 2/3|ω ] · 0.5 +P (ω )Pr[1/3 ≤ x ≤ 2/3|ω ] · 0.5 + P (ω )Pr[2/3 ≤ x ≤ 1 |ω ] · 0 1

1

2

1

2

1

2

2

= 0.5 0.5 0.5 + 0.5 0.5 0.5 = 0.25.

Note that this is the same as the Bayes rate. This problem is closely related to the “zero information” case, where the posterior probabilities of the two categories are equal over a range of x. If the problem specified that the distributions were equal throughout the full range of x, then the Bayes error and the P∞ errors would equal 0 .5. 15. An faster version of Algorithm 3 in the text deletes prototypes as follows: Algorithm 0 (Faster nearest-neighbor) 1 2 3 4 5 6 7 8 9

← D

begin initialize j 0, , n = number of prototypes Construct the full Voronoi diagram of do j j + 1 (for each prototype x j )

D



if xj is not marked then find the Voronoi neighbors of xj if any neighbor is not from the xj class then mark x j and its neighbors in other classes until j = n Discard all unmarked prototypes returnVoronoi diagram of the remaining (marked) prototypes end

If we have k Voronoi neighbors on average of any point x j , then the probability that i out of these k neighbors are not from the same class as x j is given by the binomial law: P (i) =



k (1 i

− 1/c) (1/c) − , k

k 1

where we have assumed that all the classes have the same prior probability. Then the expected number of neighbors of any point x j belonging to different class is E (i) = k(1

1/c).

− Since each time we find a prototype to delete we will remove k(1−1/c) more prototypes on average, we will be able to speed up the search by a factor k(1 − 1/c). 16. Consider Algorithm 3 in the text.

(a) In the figure, the training points (blac k for ω1 , white for ω2 ) are constrained to the intersections of a two-dimensional grid. Note that protot ype f does not contribute to the class boundary due to the existence of points e and d. Hence f should be removed from the set of prototypes by the editing algorithm (Algorithm 3 in the text). However, this algorithm detects that f has a prototype

PROBLEM SOLUTIONS

157

from another class (prototype c) as a neighbor, and thus f is retained according to step 5 in the algorithm. Consequently, the editing algorithm does not only select the points a, b, c, d and e which are are the minimum set of points, but it also retains the “useless” prototype f .

R1

R2

a

c

b

d

e

f l

(b) A sequential editing algorithm in which each point is considered in turn and retained or rejected before the next point is considered is as follows: Algorithm 0 (Sequential editing) 1 2 3



D

if until xj is not J =well n classified by return

4 5 6 7

← D

begin initialize j 0, , n = number of prototypes do j j+1 Remove x j from

D

end

D, then restore x  to D j

This sequential editing algorithm picks up one training sample and checks whether it can be correctly classified with the remaining prototypes. If an error is detected, the prototype under consideration must be kept, since it may contribute to the class boundarie s. Otherwise it may be removed. This procedure is repeated for all the training patterns. Given a set of training points, the solution computed by the sequential editing algorithm is not unique, since it clearly depends upon the order in which the data are presented. This can be seen in the following simp le example in the figure, where black points are in ω1 , and white points in ω2 . If d is the first point b

a

d

f

c

e

presented to the editing algorithm, then it will be removed, since its nearestneighbor is f so d can be correctly classified. Then, the other points are kept , except e, which can be removed since f is also its nearest neighbor. Suppose now that the first point to be considered is f . Then, f will be removed since d is its nearest neighbor. However, point e will not be deleted once f has been

CHAPTER 4. NONPARAMETRIC TECHNIQUES

158

removed, due to c, which will be its nearest neighbor. According to the above considerations, the algorithm will return points a, b, c, f in the first ordering, and will return points a, b, c, d, e for the second ordering.

|

17. We have that P (ωi ) = 1/c and p(x ωi ) = p(x) for i = 1,...,c . Thus the probability density of finding x is c

c

p(x) =

|

p(x ωi )P (ωi ) = i=1

and accordingly

|



|

p(x ω) i=1



|

|

1 = p(x ω), c

|

p(x ωi )P (ωi ) p(x ω)1/c 1 = = . p(x) p(x ω) c

P (ωi x) =



( )

|



We use ( ) in Eq. 45 in the text to find that the asymptotic nearest-neighbor error rate is P = lim n

→∞ Pn (e)

  − |    −   −  − c

=

P 2 (ωi x) p(x)dx

1

i=1 c

=

1

i=1

=

1 c

1

1 p(x)dx c2

p(x)dx = 1

1 c.

On the other hand, the Bayes error is P∗

=

 

P ∗ (error x)p(x)dx

|

c

=

i=1

R

[1

− P (ω |x)]p(x)dx,

∗∗

( )

i

i

R



where i denotes the Bayes decision regions for class ω i . We now substitute ( ) into ( ) and find

∗∗

  −  c

P∗ =

1

i=1

1 c

p(x)dx =

R

i

 −  1

1 c

p(x)dx = 1

− 1c . c





Thus have P = P ∗ , which in thiswe “no information” case. agrees with the upper bound P ∗ (2 c−1 P ∗ ) = 1 1/c 18. The probability of error of a classifier, and the k-nearest-neighbor classifier in particular, is P (e) =



|

p(e x)dx.

|

|

In a two-class case, where P (ω1 x) + P (ω2 x) = 1, we have

|

|

|

|

|

P (e x) = P (e x, ω1 )P (ω1 x) + P (e x, ω2 )P (ω2 x).

PROBLEM SOLUTIONS

159

The probability of error given a pattern x which belongs to class ω 1 can be computed as the probability that the number of nearest neighbors which belong to ω1 (which we denote by a) is less than k/2 for k odd. Thus we have

|

P (e x, ω1 ) = P r[a

≤ (k − 1)/2]. − 1)/2,

The above probability is the sum of the probabilities of finding a = 0 to a = (k that is,



(k 1)/2

Pr[a



≤ (k − 1)/2] =

Pr[a = i].

i=0

The probability of finding i nearest prototopyes which belong to ωi among k is k P (ω1 x)i P (ω2 x)k−i multiplied by the number of possible combinations, i ; that is,

|



|

  −

(k 1)/2

|

P (e x, ω1 ) =

k P (ω1 x)i P (ω2 x)k−i . i

i=0

By a simple interchange ω 1

|

|

↔ ω , we find 2



(k 1)/2

|

P (e x, ω2 ) =

i=0





k P (ω2 x)i P (ω1 x)k−i . i

|

|

We put these results together and find

   −

(k 1)/2

|

P (e x) =

i=0

k i

P (ω1 x)i+1 [1

|

− P (ω |x)] −

k i

1

+ P (ω1 x)k−i [1

|

i+1

− P (ω |x)] 1



.

Recall that the conditional Bayes error rate is P ∗ (e x) = min[ P (ω1 x), P (ω2 x)].

|

|

|

|

Hence we can write P (e x) as

   −

(k 1)/2

|

P (e x) =

i=0

k i

P ∗ (e x)i+1 [1

|

− P ∗(e|x))] −

k i

+ P ∗ (e x)k−i [1

|

− P ∗(e|x)]

i+1

|

= fk [P (e x)],



where we have defined f k to show the dependency upon k. This is Eq. 54 in the text. Now we denote Ck [P ∗ (e x)] as the smallest concave function of P ∗ (e x) greater than f k [P ∗ (e x)], that is,

|

|

|

fk [P ∗ (e x)]

| ≤ C [P ∗(e|x)]. k

Since this is true at any x, we can integrate and find P (e) =



fk [P ∗ (e x)]p(x)dx

|





Ck [P ∗ (e x)]p(x)dx.

|

CHAPTER 4. NONPARAMETRIC TECHNIQUES

160

Jensen’s inequality here states that under very broad conditions, We apply Jensen’s inequalit y to the expression for P (e) and find P (e)





Ck [P ∗ (e x)]p(x)dx

|

≤C



k

E [u(x)] ≤ u[E [x]]. x



P ∗ (e x)p(x)dx = C k [P ∗ ],

|

where again P ∗ is the Bayes error. In short, we see P (e) fk [P ∗ ] C k [P ∗ ].



≤ C [P ∗], and from above k

Section 4.6 19. We must show that the use of the distance measure

 d

D(a, b) =

αk (ak

k=1

2

−b ) k

for α k > 0 generates a metric space. This happens if and only if D(a, b) is a metric. Thus we first check whether D obeys the prope rties of a metric. First, D must be non-negative, that is, D(a, b) 0, which indeed holds because the sum of squares is non-negative. Next we consider reflexivity, that is, D(a, b) = 0 if and only if a = b. Clearly, D = 0 if and only if each term in the sum vanishes, and thus indeed a = b. Conversely, if a = b, then D = 0. Next we conside r symmetry, that is, D(a, b) = D(b, a) for all a and b. We can alter the orde r of the elements in the squared sum without affecting its value, that is, ( ak bk )2 = (bk ak )2 . Thus







symmetry holds. The last property is the triangle inequal ity, that is, D(a, b) + D(b, c) D(a, c)



for all a, b, and c. Note that D can be written



D(a, b) = A(a where A = diag[ α1 ,...,α d ]. We let x = a inequality test can be written

− b)

2

− c and y = c − b, and then the triangle

Ax + Ay ≥ A(x + y) . 2

2

2

We denote x  = Ax and y  = Ay, as the feature vectors computed using the matrix A from the input space. Then, the above inequality gives

x + y  ≥ x + y  . 2

2

2

We can square both sides of this equation and see ( x

  + y   ) 2

2

2

d

= x

  + y   2 2

2 2

+ 2 x

  y  ≥ x + y  2

2

which is equivalent to the simple test d

x   y   ≥ 2

2

 i=1

xi yi .

2 2

2 2

+2

 i=1

xi yi = x + y 2 ,





PROBLEM SOLUTIONS

161

Observe that the above inequality if fulfilled, since the Cauchy-Schwarz inequality states that

  

 

d

x y ≥ 2

2

xi yi .

i=1

If we work with metric spaces in the nearest-neighbor method, we can ensure the existence of best approximations in this space. In other words, there will always be a stored prototype p of minimum distance from an input pattern x ∗ . Let δ = min D(p, x), x

S

S

where x belongs to a set of the metric space which is compact. (A subset is said to be compact if every sequence of points in has a sequence that converges to a point in .) Suppse we define a sequence of points x1 , x2 ,... xn with the property

S

S

{ } D(p, x ) → δ as n → ∞, → x∗ using the compactness of S. By the triangle inequality, we have that D(p, x ) + D(x , x∗ ) ≥ D(p, x∗ ). n

where x n

n

n

The right-hand side of the inequality does not depend on n, and the left side approches δ as n . Thus we have δ D(p, x∗ ). Nevertheless, we also have D(p, x∗ ) δ

→∞



because x ∗ belongs to



S

D(p, x∗ ) = δ .

. Hencefor wenearest-neighbor have that The import of this property classifiers is that given enough prototypes, we can find a prototype p very close to an input pattern x ∗ . Then, a good approximation of the posterior class probabilit y of the input pattern can be achieved with that of the stored prototype, that is, p x ∗ implies

 P (ω |p)  P (ω |x∗ ). i

i

20. We must prove for

 | − |  d

Lk (a, b) =

ai

bi

1/k

k

 − b

= a

i=1

k

the four properties a metric are obeyed.

• The first states that L must be positive definite. Since |a − b | ≥ 0 for all i, L (a, b) ≥ 0. • The second states that L (a, b) = 0 if and only if a = b. If a = b, then each term of the sum |a − b | is 0, and L (a, b) = 0. Conversely, if L (a, b) = 0, then each term |a − b | must be 0, and thus a = b for all i, that is, a = b. • The third condition is symmetry, that is L (a, b) = L (b, a). This follows k

i

i

k

k

i

i

i

k

k

i

i

i

k

k

directly from the fact

|a − b | = | − (b − a )| = | − 1| |b − a | = |b − a | i

i

i

i

i

i

i

i

for i = 1,...,d.

CHAPTER 4. NONPARAMETRIC TECHNIQUES

162

• The last property is the triangle inequality, that is L (a, b) + L (b, c) ≥ L (a, c) k

k

k

or

a − b + b − c ≥ a − c k

k



( )

k

− b = x and b − c = y, then ( ∗) can be x + y ≥ x + y . (∗∗)

for arbitrary a, b and c. We define a written k

k

k

We exponentiate both sides and find an equivalent condition k

k

  + y  ) ≥ ( x + y  ) for k > 1. We expand the left-hand side of ( ∗ ∗ ∗) to find ( x

k

k

k

 | |  | |     d

  + y )

( x

k

k

k

=

(

+

k

yi

i=1



k 1

d

k

xi

+

i=1

j=1

k j

x



k j k

j k

· y  ,

and the right-hand side to find d k k



d

k 1

k

k

x + y  = | | + | | In short, then ( ∗ ∗ ∗) can be written as

k j

d k j

j

   | | | |  ·  ≥ | | | | − xi

i=1

bi

i=1

+

j =1



xi

i=1

bi .

d

x −

k j k

y

j k



k j

xi

bi

j

j = 1,...,k

1.

i=1

Note that d

x − ≥ k j k

| | xi

d



k j

and

i=1

j k

y ≥

| | yi

j

i=1

because q

a

aq



for a > 0.

     | |    ≥ | |    ≥ 

Then we have for the above special case d

x



k j k

y

j k

d

xi



k j

i=1

Since

ai

i

bi

i

yi

j

.

i=1

ai bi

i

for a i , bi > 0,

∗ ∗ ∗)

PROBLEM SOLUTIONS

163

we have

 | |  | |  ≥  | | d

x − y ≥ k j k

We replace x = a

j k

d

xi



k j

d

yi

i=1

j

xi

i=1



k j

i=1

j

|b | . i

− b and y = b − c and conclude Lk (a, b) + Lk (b, c)

Lk (a, c),

≥ and hence the triangle inequality is obeyed. 21. Problem not yet solved 22. Problem not yet solved 23. Problem not yet solved 24. Problem not yet solved 25. Consider the computational complexity of calculating distances under different metrics. (a) The Euclidean distanc e between two prototypes is given by d

x − x 

2

=



(xi

i=1

− x ) , i

2

where x denotes the transformed image, x is a stored prototype, and d is the number of pixels in the handwritte n digit. As we can see, d subtractions, d multiplications and d additions are needed. Consequently, this is an O(d) process. (b) Given a text sample x and a prototype x  , we must compute r non-linear transforms, which depend on a set of parameters subject to optimization. If an iterative algorithm is used for optimizing these parameters, we will have to compute the Euclidean distance of the transformed image and the stored prototype x for each step. Since the computation of the transformed image invlves r non-linear transformations (with a i k 2 operations per transform), the number of operations required is ai k 2 r + 3k 2 = (ai r + 3)k2 . (c) For each prototype, we will have to perform A searches, and this implies (ai r + 3)k2 A operations. Accordingly, the total number of operations is 2

(ai r + 3)k An. (d) If n = 106 , r = 6, a i 10, A 5, and basic operations on our computer require 10−9 second, then the classification of a single point is performed in





(10 6 + 3)k2 5 106 operations

·

· ·

−9

seconds · 10operation = 0.315k seconds. 2

Suppose that k = 64. Then a test sample will be classified in 21 minutes and 30 seconds.

CHAPTER 4. NONPARAMETRIC TECHNIQUES

164

26. Explore the effect of r on the accuracy of nearest-neighbor search based on partial distance. (a) We make some simplifications, specifically that points are uniformly distributed in a d-dimensional unit hypercube. The closest point to a given test point in, on average, enclosed in a small hypercube whose volume is 1/n times the volume of the full space. The length of a side of such a hypercube is thus 1 /n1/d . Consider a single dimension . The probability that any point falls within the pro jection 1/d



of the hypercube is thus 1 /n Now in r d dimensions, the probability of falling inside the projection of the. small hypercube onto r-dimensional subspace r/d is 1/n , i.e., the product of r indpendendent events, each having probability 1/n1/d . So, for each poin t the probability it will not fall inside the volume in the r-dimensional space is 1 1/nr/d . Since we have n indpendent such points, the probability that the partial distance in r dimensions will give us the true nearest neighbor is





P r[nearest neighbor in d dim. is found in r < d dim.] =

−  1

1

nr/d

n

,

≤ d = 10 and n = 5, 10 and 100.

as shown in the figure for 0 < r P 1

100

n=5

0.8 0.6 0.4

n=10

0.2 2

4

6

8

r

10

(b) The Bayes decision boundary in one dimenion is clearly x∗1 = 0.5. In two dimensions it is the line x ∗2 = 1 x1 . In three dimension the boundardy occurs when



x1 x2 x3 = (1

− x )(1 − x )(1 − x ) 1

2

3

which implies

x∗3 =

(x − 1) (x − 1) − x − x + 2x x 1

1

2

1

2

,

1 2

as shown in the figure. (c) The Bayes error in one dimensio n is clearly



1

EB

=

0

|

|

Min[P (ω1 )p(x ω1 ), P (ω2 )p(x ω2 )]p(x) dx

PROBLEM SOLUTIONS d=1

165

2=d

3=d x3

x2

0

0.5

0.8

1

R2

0.8

R2 R2

0.6

0 1

1

R1

0.4

0.2

x2

0.6

x1 R1

1

R1

0.4

0

0.2

x1

0 1

0 0.2

0.4 0.6 0.8

x1

1

x∗ =0.5



·

= 2 0.5

(1

0

− x) dx

= 0.1875. The Bayes error in two dimensions is x∗ 2 =1 x1

  1

EB

·

=

2 0.5

=

dx1

x1 =0



dx2 (1

0

− x )(1 − x ) 1

2

0.104167.

The Bayes error in three dimensions is 1

EB

=

1

dx1 dx2

0

=

x∗ 3 (x1 ,x2 )

   0

− x )(1 − x )(1 − x )

dx3 (1

1

0

2

3

0.0551969,

where x ∗3 (x1 , x2 ) is the position of the decision boundary, found by solving x1 x2 x3 = (1

− x )(1 − x )(1 − x ) 1

2

3

for x 3 , that is, x∗3 (x1 , x2 ) =

(x − 1) (x − 1) − x − x + 2x x 1

1

2

1

2

.

1 2

The general decision boundary in d dimensions comes from solving d

d

  xi =

i=1

(1

i=1

−x ) i

or



d 1

xd



i=1



d 1

xi = (1

−x ) d



i=1

(1

−x ) i

CHAPTER 4. NONPARAMETRIC TECHNIQUES

166

for x d , which has solution



d 1

x∗d (x1 , x2 ,...,x

d 1) = d 1







i

−x )+



1



dx1

0

dx2

dxd−1

0

d

dxd

0

xk

k=1

x∗ d (x1 ,...,xd−1 )

1

.

d 1

j

  ···      1

EB =

−x )

(1

i=1

(1

j=1

The Bayes error is



(1

j=1

0

− x ). i

d integrals

We first compute the integral(x[1] - 1)) over x d , since it has the other variables implicit. This integral I is x∗ d (x1 ,...,xd−1 )



I=

(1

0

− x ) dx d

d



= xd

 ∗ − 12

= x∗d

− 12 (x∗) .

xd 0

d

x2d

x∗ d 0

2

Substituting x ∗ from above we have d

  −   −  −  −   −   −  −  −   −

d 1

I

=



d 1

(1

xi )

i=1



d 1

(1

xj ) +

j=1



d 1

1/2 =

(1

xi ) 2 +

i=1





xj ) +

j=1



d 1

(1

xi )

i=1

2

xk

k=1



d 1

xi (1

xi )

i=1

d 1



2

d 1

(1

1/2

d 1

(1



d 1

xk

k=1

xj ) +

j=1

.

xk

k=1

(d) Problem not yet solved (e) Problem not yet solved 27. Consider the Tanimoto metric described by Eq. 58 in the text. (a) The Tanimoto metric measures the distance between two sets ing to

AB

DT animoto ( , ) =

na + nb na + nb

A and B, accord-

− 2n −n

ab

ab

where n a and n b are the sizes of the two sets, and n ab is the number of elements common to both sets.

PROBLEM SOLUTIONS

167

We must check whether the four properties of a metric are always fulfilled. The first property is non-negativity, that is,

AB

DT animoto ( , ) = Since the sum of the elements in elements, we can write

na + nb na + nb

− 2n ≥ 0. −n ab

ab

A and B is always greater than their common −

na + nb nab > 0. Furthermore, the term n a + nb 2nab gives account of the number of different elements in sets and , and thus

A



B

na + nb

− 2n ≥ 0. ab

A B ≥ 0.

Consequently, D Tanimoto ( , )

The second property, reflexivity, states that

AB

DT animoto ( , ) = 0

if and only if

A = B. AB

From the definition of the Tanimoto measure above, we see that DTanimoto ( , ) will be 0 if and only if na + nb 2nab = 0. This numerator is the number of different elements in and , so it will yield 0 only when and have no different elements, that is, when = . Conversely, if = , then na + nb 2nab = 0, and hence D T animoto ( , ) = 0.

− B A B AB

A

A A B

B



The third property is symmetry, or

AB

BA

DTanimoto ( , ) = D T animoto ( , )

A

B

for all and . Since the terms na and nb appear in the nume rator and denominator in the same way, only the term n ab can affect the fulfilment of this property. However, n ab and n ba give the same measure: the number of elements common to both and . Thus the Tanimoto metric is indeed symmetric.

A

B

The final property is the triangle inequality:

AB

B C ≥D

DT animoto ( , ) + DTanimoto ( , )



T animoto (

A, C ).



( )

We substitute ( ) into the definition of the Tanimoto metric and find na + nb na + nb

− 2n −n

ab

ab

+

n b + nc nb + nc

− 2n ≥ n + n − 2n −n n +n −n bc

bc

a

a

c

c

ac

.

ac

After some simple manipulation, the inequality can be expressed as 3BDE where

2ADE



2BC E + ACE



2BDF + ADF + BC F

− A B C D E F

= = = = = =

na + nb nab nb + nc nbc na + nc nac .

0,



CHAPTER 4. NONPARAMETRIC TECHNIQUES

168

We factor to find

− 2B) + D(3B − 2A)] + F [D(A − 2B) + BC ] ≥ 0. Since A − 2B is the number of different elements in A and B and hence nonE [C (A

negative, the triangle inequality is fulfilled if some of the below conditions are met:

− 2A = 3n − 2(n + n ) ≥ 0 D=n = 0 C (A − 2B) + D(3B − 2A) ≥ 0. ab

3B

a

b

(

bc

(∗∗) ∗ ∗ ∗)

This last condition can be rewritten after some arrangment as DT

≥n

b

nbc . + nc nbc



On the other hand, if we take other common factors in the above equation, we have

− 2C ) + F (C − 2D)] + A[E(C − 2D) + DF ] ≥ 0. The term C − 2D is the number of different elements in C and D, so it is greater B[E (3D

than zero. Accordingly, the triangle inequality can be also fulfilled if some of the below additional condition s are met: 3D

− 2C = 3n − 2(n



0 b + nc ) E = n b + na = 0 2C ) + F (C 2D) 0. bc

E (3D







After some rearrangements, this last inequality can be written as

B C ≥ n −n (n+ n+ n ) .

DT animoto ( , )

b

bc

c

b

c

Since the denominator of the right-hand side is always negative, the inequality is met. Thus we can conclude that the Tanimoto metric also obeys the triangle inequality. (b) Since the Tanimoto metric is symmetric, we consider here only 15 out of the 30 possible pairings.

A

B

na

nb

pattern pattern pattern pattern pattern pat pat pat pat pots pots pots stop stop

pat 7 3 3 pots 7 4 2 stop 7 4 2 taxonomy 7 8 3 elementary 7 10 5 pots 3 4 2 stop 3 4 2 taxonomy 3 8 2 elementary 3 10 2 stop 44 4 0 taxonomy 4 8 2 elementary 4 10 1 taxonomy 4 8 2 elementary 4 10 1

taxonomy

elementary 8 10

5

nab DT animoto (a, b) 0.57 2 0.77 7 0.77 7 0.75 6 0.58 3 0.6 4 0.6 4 0.77 7 0.81 9 1 0.8 8 0.92 9 0.8 8 0.92 9 0.61

5

rank

PROBLEM SOLUTIONS

169

(c) Yes, the Tanimoto metric fulfills the triangle inequality for all possible sets given above. Section 4.7 28. As shown in Fig. 23 in the text, there is a big difference between “categories” of a feature and the (true) categories or classes for the patterns. For this problem there is a fuzzy feature that we can call “temperature,” and a class named “hot.” Given a pattern, e.g., a cup, we can observe a degree of membership of this object onto the fuzzy feature “temperature,” for instance, “temperature = warm.” However “temperature = warm” does not necessarily imply that the membership to the “hot” class is less than 1.0 due to we could argue that the fuzzy feature “temperature” is not very hot but warm. We could reasonably suppose that if the “hot ” class has something to do with temperature then it would have into account the fuzzy feature “temperature” for its membership function. Accordingly, values of “temperature” such as “warm” or “hot” might be active in some degree for the class “hot” though the exact dependence between them is unknown for us without more information aboute the problem. 29. Consider “fuzzy” classification. (a) We first fit the designer’s subjective feelings into fuzzy features, as suggested by the below table. Feature lightness

designer’s feelings medium-light

length

medium-dark dark short long

fuzzy feature light dark dark short large

If we use Min[ a, b] as the conjunction rule between fuzzy sets a and b, then the discriminant functions can be written as: d1 = Min[ˆc(length, 13, 5), cˆ(lightness, 70, 30)] d2 = Min[ˆc (length, 5, 5), cˆ (lightness, 30, 30)] d1 = Min[ˆc(length, 13, 5), cˆ (lightness, 30, 30)] where cˆ(x,µ,δ ) =

cˆ (x,µ,δ ) =

  

1 + (x

1 + (µ

1 x >µ µ δ

− µ)/δ −

0 0 x)/δ 1

− ≤x ≤µ

otherwise x >µ +δ µ x µ+δ otherwise.

≤ ≤

(b) If every “category mem bership functio n” were rescaled by a constant α, the discriminant functions would be d1 = Min[ αˆ c(length, 13, 5), αˆ c(lightness, 70, 30)]

CHAPTER 4. NONPARAMETRIC TECHNIQUES

170

d2 = Min[ αˆ c(length, 5, 5), αˆ c (lightness, 30, 30)] d3 = Min[ αˆ c(length, 13, 5), αˆ c (lightness, 30, 30)]

Since Min[ αf1 , αf2 ] = αMin[f1 , f2 ], the above functions can be written d1 d2 d3

= αMin[ˆc(length, 13, 5), cˆ(lightness, 70, 30)] = αMin[ˆc (length, 13, 5), cˆ (lightness, 70, 30)] = αMin[ˆc(length, 13, 5), cˆ (lightness, 70, 30)].

Hence, the classification borders would be unaffected, since all the discriminant functions would be rescaled by the same constant α, and we would have arg max(αd1 , αd2 , αd3 ) = α arg max(d1 , d2 , d3 ) = arg max (d1 , d2 , d3 ). di

di

di

(c) Given the pattern x = (7.5, 60)t and the above equat ions, the value of the discriminant functions can be computed as follows: d1 = Min[ˆc(7.5, 13, 5), cˆ(60, 70, 30)] = Min[0 , 0.66] = 0 d2 = Min[ˆc (7.5, 5, 5), cˆ (60, 30, 30)] = Min[0 .5, 0] = 0 d3 = Min[ˆc(7.5, 13, 5), cˆ (60, 30, 30)] = Min[0 , 0] = 0 . Since all the discriminant functions are equal to zero, the classifier cannot assign the input pattern to any of the existing classes. (d) In this problem we deal with a handcrafted classifier. The designer has selected lightness and length as features that characterize the problem and has devined several membership functions for them without any theoretical or empirical justification. Moreover, the conjunction rule that fuses membership functions of the features to define the true discriminant functions is also imposed without any principle. Consequently, we cannot know where the sources of error come from. 30. Problem not yet solved Section 4.8 31. If all the radii have been reduced to a value less than λ m , Algorithm 4 in the text can be rewritten as: Algorithm 0 (Modified RCE) 1 2 3 4 5 6 7 8 9

← ← ← ← ← ∈ λ ← D(ˆ x, x ) λ ← D(ˆ x, x ) −  if x ∈ ω , then a ← 1 for k = 1,...c

begin initialize j 0, n number of patterns,  do j j+1 wji xji for i = 1,...,d ˆ x arg min D(x, xj ) x ωi

j

j

j

j

j

k

until j = n end

jk

← small parameter

PROBLEM SOLUTIONS

171

According to Algorithm 5 in the text, the decision boundaries of the RCE classifier depend on λj ; an input pattern x is assigned to a class if and only if all its nearest prototypes belong to the same class, where these prototypes fulfill the condition ˆ is the nearest training sample to w j bex, wj ), where x D(x, wj ), λj . Since λ j = D(ˆ longing to another class, we will have different class boundaries when we use different training points. Section 4.9 32. We seek a Taylor series expansion for pn (x) =

1 nhn

−  n

x

ϕ

xi hn

i=1

(a) If the Parzen window is of the form ϕ(x)

∼ N (0, 1), or

√12π e−

p(x) =

.

x2 /2

,

we can then write the estimate as pn (x) = = =

1 nhn 1

−  n

1

x1 hn

exp

1

x

−x

2

1

 √ −     √ − − −       √ − nhn i=1 2π n 1 1 exp 2πhn n i=1 1 x2 2 h2n

exp

=

x

ϕ

i=1 n

hn x2 x2 + 2i 2 hn hn

2xxi h2n

exp

xxi exp h2n

1 x2i . 2 h2n

n

1 n

2πh n

2 1 2

i=1

We express this result in terms of the normalized variables xi /hn as 2

pn (x) = We expand e

xxi /h2n

−u

/2

√e 2πh

1 n n

n



u = x/h n and u i =

2

e−ui /2 euui .

i=1

= e uui about the srcin, j j

euui =



u ui , j!

    j=0

and therefore our density can be written as pn (x) = =

−u

2

/2

e−u

2

/2

√e 2πh √2πh

1 nn

n



i=1 j=0



n j=0

1 n

uj uji −u2i /2 e j!

n

i=1

2

uji e−ui /2 j!



uj

CHAPTER 4. NONPARAMETRIC TECHNIQUES

172

2

−u

=





/2

√e 2πh

=

b j uj

n j=0

lim pnm (x),

→∞

m

where we have defined 1

n

n



bj =

1

i=1

2

uji e−ui /2 .

j!

Thus we can express p nm (x) as the m-term approximation as

−u

2



/2 m 1

√e 2πh

pnm (x) =



bj u j .

n j=0

(b) If the n samples are extremely tightly clustered about u = u o (i.e., u i i = 1,...,n , then the two-term approximation is

−u

2

o

/2

√e 2πh

pn2 (x) =

 u ) for

(bo + b1 u), n

where bo = and b1 =

1 n

1

n

n

i=1

2

2

e−ui /2

e −uo /2



  n

2

ui e−ui /2

i=1

 u e− o

The peak of p n2 (x) is given by the solution of 2

−ue− √

u2o /2

∇p

.

u n2

= 0, that is 2

(bo + b1 u) b 1 e−u /2 + = 0. 2πh n 2πh n

u /2

∇p

u n2 (x)

=



This equation then gives 0 = =

−u(b

o

+ b1 u) + b1

2

2

u(e−uo /2 + uo e−uo /2 u) 2

= =

u2o /2

− u e− o

−−

u2 uo +u u uo = 0 u + 1. uo

Thus the peaks of p n2 (x) are the solutions of u 2 + u/uo (c) We have from part (b) that u is the solution to u 2 solution is u=

−1/u ± o

 2

1/u2o + 4

.

− 1 = 0. + u/u − 1 = 0, and thus the o

PROBLEM SOLUTIONS In the case u o and thus

173

 1, the square root can be approximated as u

−1 ±

=

 −1 ±





1/u20 + 4

 1/u , 0

1 + 4u2o 2uo (1 + 12 4u2o ) , 2uo

and this implies that one of the peaks of the probability is at 1 + (1 + 2u2o ) u = uo . 2uo

−

In the case u o

 1, we have

 

−1/u ± 1/u + 4 2 − 1/u ± 2 1/(4u ) + 1 = 2 − 1/u ± 2(1 + 1/(4u ))  . 2 o

u =

o

o

2 o

2 o

1 2

2 o

Thus, one of the peaks occurs at

−1/u

u

o

+ 2(1 + 1/(8u2o ))



= 1.

2

2

(d) In these conditions, we find

−u

2

/2

√e 2πh

pn2 (u) =

2

(bo + b1 u) n

where bo

 e−

u2o /2

,

and b1

 u e− o

u2o /2

,

as is shown in the figure. (The curves have been rescale d so that their struc ture is apparent.) Indeed, as our calc ulation above showed, for smal l uo , pn2 (u) peaks near u = 0, and for large u o , it peaks near u = 1. ~pn2(u) u o

0.4

= 1 0

0.3

0.2

u o u o

= 1

= .0 1

0.1

u 1234

CHAPTER 4. NONPARAMETRIC TECHNIQUES

174

Computer Exercises Section 4.2 1. Computer exercise not yet solved Section 4.3 2. Computer exercise not yet solved Section 4.4 3. Computer exercise not yet solved Section 4.5 4. xxx 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

MATLAB program x = [0.58 0.27 0.0055 0.53 0.47 0.69 0.55 0.61 0.49 0.054]’; mu = mean(x); delta = std(x); mu arr = linspace(min(x), max(x), 100); delta arr = linspace(2*(min(x)-mu), 8*(max(x)-mu), 100); [M, D] = meshgrid(mu arr, delta arr); p D theta = zeros(size(M)); for i=1:size(M,2) for j=1:size(M,2) x minus mu = abs(x - M(i,j)); if max(x minus mu) > D(i,j) p D theta(i,j) = 0; else; a = (D(i,j) - x minus mu)/D(i,j)^2; p D theta(i,j) = prod(a); end end end p D theta = p D theta./(sum(p D theta(:))/... ((mu arr(2) - mu arr(1))*(delta arr(2)-delta arr(1)))); p theta D = p D the ta; ind = find(p D thet a>0); x vect = linspace(min(x)-10*delta, max(x)+10*delta, 100); post = zeros(size(x vect)); for i=1:length(x vect) for k=1:length(ind) if abs(x vect(i) - M(ind(k))) < D(ind(k)), p x theta = (D(ind(k)) - abs(x vect(i) - M(ind(k))))./D(ind(k)).^2; else p x theta = 0; end post(i) = post(i) + p x theta*p theta D(ind(k));

COMPUTER EXERCISES 44 45 46 47 48 49 50 51 52 53

end end post = post./sum(post); figure; plot(x vect, post, ’-b’); hold on; plot([mu mu], [min(post) max(post)], ’-r’); plot(x, zeros(size(x)), ’*m’); grid; hold off;

[[FIGURE TO BE INSERTED]] 5. Computer exercise not yet solved Section 4.6 6. Computer exercise not yet solved 7. Computer exercise not yet solved 8. Computer exercise not yet solved Section 4.7 9. Computer exercise not yet solved Section 4.8 10. Computer exercise not yet solved Section 4.9 11. Computer exercise not yet solved

175

176

CHAPTER 4. NONPARAMETRIC TECHNIQUES

Chapter 5

Linear discriminant functions

Problem Solutions Section 5.2 1. Consider the problem of linear discriminants with unimodal and multimodal distributions. (a) The figure shows two bimodal distributions that obey reflective symmetry about a vertical axis. Thus their densities have the same val ue along this vertical line, and as a result this is the Bayesian decision boundary, giving minimum classification error. R1

R2

ω1

ω2

(b) Of course, if two unimodal distributions overlap significantly, the Bayes error will be high. As shown Figs 2.14(rather and 2.5than in Chapter the (unimodal) Gaussian case generally has inquadratic linear)2, Bayes decision boundaries. Moreover, the two unimodal distributions shown in the figure below would be better classified using a curved boundary than the straight one shown. (c) Suppose we have Gaussian s of different variances, σ12 = σ 22 and for definiteness (and without loss of generality) σ2 > σ1 . Then at large distances from th e means, the density P (ω2 x) > P (ω1 x). Moreover, the position of the mean of ω1 , that is, µ1 , will always be categorized as ω1 . No (single) straight line can separate the position of µ1 from all distant points; thus, the optimal decision



|

|

177

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

178

R1

R2

ω1 ω2

boundary cannot be a straight line. In fact, for the case shown, the optimal boundary is a circle. R2

R1

µ1

ω2

ω1 µ2

R2

2. Consider a linear machine for c categories with discriminant functions gi (x) = wit x + ωi0 for i = 1,...,c . Our goal is to show that the decision region s are convex, that is, if x(1) and x(0) are in ωi , then λx(1) + (1 λ)x(0) is also in ωi , for all 0 λ 1. The fact that x(1) and x(0) are in ω i implies



≤ ≤

max gj (x(1)) = g i (x(1)) = w it x(1) + wi0 j

and max gj (x(0)) = g i (x(0)) = w it x(0) + wi0 . j

For any j and for 0 wjt [λx(1) + (1

≤ λ ≤ 1, the above imply

− λ)x(0)] + w

j0

=

λ[wjt x(1) + wj0 ] + (1



λ[wit x(1) +

wi0

We next determine the category of a point between maximum discriminant function:



max wjt [λx(1) + (1 j

− λ)x(0)] + w

j0



= = =

t j t i

− λ)[w x(0) + w ] ] + (1 − λ)[w x(0) + w ]. j0

i0

x(0) and x(1) by finding the

λ max[wj t x(1) + wj0 ] + (1 j

t

t

− λ)max [wj x(1) + w j

t

λ[w x(1) + w ] + (1 λ)[w x(1) + w ] i i0 i i0 wit [λx(1) + (1 λ)x(0)] + wi0 .







This shows that the point λx(1)+(1 λ)x(0) is in ω i . Thus any point between x(0) and x(1) is also in category ωi ; and since this is true for any two points in ωi , the region is convex. Furthermore, it is true for any category ω j — all that is required in the proof is that x(1) and x(0) be in the category region of interest, ω j . Thus every decision region is convex. 3. A counterexample will serve as proof that the decision regions need not be convex. In the example below, there are c = 5 classes and 2c = 10 hyperplanes separating



j0 ]

PROBLEM SOLUTIONS

179

the pairs of classes; here H ij means the hyperplane separating class ω i from class ω j . To show the non-convexity of the solution retion for class ω 1 , we take points A and B, as shown in the figure. H15

H14 H24

ω4

ω3

H13

B

ω1 H12 H35

C

ω2

ω5 A

H25

H H23 34

H45

The table b elow summarizes the voting result s. Each row lists a hyperplane H ij , used to disginguish category ω i from ω j . Each entry under a point states the category decision according to that hyperplane. For instance, hyperplane H12 would place A in ω2 , point B in ω 1 , and point C in ω 2 . The underlining indicates the winning votes in the full classifier system. Thus point A gets three ω 1 votes, only one ω 2 vote, two ω 3 votes, and so forth. The bottom row shows the final voting result of the full classifier system. Thus points A and B are classified as category ω1 , while point C (on the line between A and B) is category ω 2 . Thus the solution region for ω 1 is not convex. hyperplane H12 H13 H14 H15 H23 H24 H25 H34 H35 H45 Vote result:

A ω2 ω1 ω1 ω1 ω3 ω2 ω5 ω3 ω5 ω4 ω1

B ω1 ω1 ω1 ω1 ω2 ω2 ω2 ω4 ω3 ω4 ω1

C ω2 ω3 ω1 ω1 ω2 ω2 ω2 ω3 ω5 ω4 ω2

4. Consider the hyperplane used for discriminant functions. 2

 −x 

(a) In order to minim ize x

a

subject to the constraint g(x) = 0, we form the

objective function

 −x 

f (x, λ) = x

a

2

+ 2λ[g(x)],

where λ is a Lagrange undetermined multiplier; the point varies on the hyperplane. We expand f ( ) and find

xa is fixed while x

·

f (x, λ) = x xa 2 + 2λ[wt x + w0 ] = (x xa )t (x xa ) + 2λ(wt x + w0 ) = xt x 2xt xa + xta xa + 2λ(xt w + w0 ).

 −  − − −

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

180

We set the derivatives of f (x, λ) to zero and find ∂f (x, λ) ∂x ∂f (x, λ) ∂λ

= x

−x

a

+ λw = 0

= w t x + w0 = 0,

and these equations imply x = xa and wt x + w0

− λw,

= wt (xa λw) + w0 = wt xa + w0 λwt w = 0.





These can be solved for λ as λwt w = w t xa + w0



or (so long as w = 0) λ=

w t x a + w0 . wt w

The value of x is then

 − −

x = xa λw xa = xa

wt xa +w0 wt w





w

if w = 0 if w = 0.

The minimum value of the distance is obtained by substituting the value of into the distance:

x − x 

=

a

= =

 −   −       | |  | | w t x a + w0 w wt w

xa

w t x a + w0 wt w

w

xa

x

 

g(xa ) w g(xa ) = . w 2 w

 

 

(b) From part (a) we have that the minimum dista nce from the hyperplane to the point x is attained uniquely at the point x = x a λw on the hyperplane. Thus the projection of x a onto the hyperplane is



xo

=

xa

=

xa

− λw ) − g(x w, w a 2

where λ=

w t x a + w0 g(xa ) = . wt w w 2

 

PROBLEM SOLUTIONS

181

5. Consider the three category linear machine with discriminant functions wit x wio , for i = 1, 2, 3.

gi (x) =



(a) The decision boundari es are obtained when wit x = w jt x or equivalently (wi wj )t x = 0. Thus, the relevant decision boundaries are the lines thro ugh the srcin that are orthogonal to the vectors w1 w2 , w1 w3 and w 2 w3 , and illustrated by the lines marked ω i ωj in the figure.







x2

ω

ω

2

1

3



ω

4

ω

ω



2







w1

1

w2

ω 3

2 w3

-4

-2

2

x1

4

-2

-4

(b) If a constant vector c is added to the weight vectors, w 1 , w2 and w 3 , then the decision boundaries are given by [( wi + c) (wj + c)]t x = (wi wj )t x = 0, just as in part (a); thus the decision boundaries do not change. Of course, the triangle formed by joining the heads of w1 , w2 , and w3 changes, as shown in





the figure. x2

ω 2

1

4

ω

ω

3



ω

2





ω

w1

1

ω

+

3

w2

c

w1

2 w3

w2 + c

-4

-2

2

w3 + c

4

x1

-2 c

-4

6. We first show that totally linearly separable samples must be linearly separable. For samples that are totally linearly separable, for each i there exists a hyperplane gi (x) = 0 that separates ω i samples from the rest, that is, gi (x) 0 for x ωi gi (x) < 0 for x ωi .



∈ ∈

The set of discriminant functions g i (x) form a linear machine that classifies correctly; the samples are linearly separable and each category can be separated from each of the others. We shall show that by an example that, conversely, linearly separable samples need not be totally linearly separable . Consider three categorie s in two dimension s

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

182 specified by

ω1 ω2 ω3

{x : x < −1} {x : −1 ≤ x < 1} {x : x > 1}.

= = =

1

1

1

Clearly, these categories are linearly separable by vertical lines, but not totally linearly separable since there can be no linear function g 2 (x) that separates the category ω2 samples from the rest. 7. Consider the four categories in two dimensions corresponding to the quadrants of the Euclidean plane: ω1 ω2 ω3 ω4

{x : x {x : x {x : x {x : x

= = = =

} } } }

> 0, x2 > 0 > 0, x2 < 0 1 < 0, x2 < 0 1 < 0, x2 > 0 . 1 1

Clearly, these are pairwise linearly separable but not totally linearly separable. 8. We employ proof by contra diction. Suppose there were two distinct minimum length solution vectors a1 and a2 with at1 y > b and at2 y > b. Then necessarily we would have a1 = a2 (otherwise the longer of the two vectors would not be a minimum length solution). Next consider the average vector ao = 12 (a1 + a 2 ). We note that

   

1 1 t 1 t t t ao yi = 2 (a1 + a2 ) yi = 2 a1 yi + 2 a2 yi and thus a o is indeed a solution vector. Its length is

a  = 1/2(a o

1





+ a2 ) = 1/2 a1 + a2

≥ b,

 ≤ 1/2(a  + a ) = a  = a , 1

2

1

2

where we used the triangle inequality for the Euclidean metric . Thus a o is a solution a1 = a2 . But by our hypothesis, a 1 and a 2 are minimum vector such that ao length solution vectors. Thus we must have ao = a1 = a2 , and thus

 ≤   

1 a1 + a2 2





=

      a  = a . 1

2

We square both sides of this equation and find 1 a1 + a2 4



2

 = a  1

2

or 1( a 1 4

2

  + a  2

2

+ 2at1 a2 ) = a1 2 .

 

We regroup and find 0 = =

2

2

t 1 2

a  + a  − 2a a a − a  , 1 1

2

2

2

and thus a1 = a2 , contradicting our hypothe sis. Therefore, the minimum -length solution vector is unique.

PROBLEM SOLUTIONS

183

S

{

}

S

{

}

9. Let the two sets of vectors be 1 = x1 ,..., xn and 2 = y1 ,..., ym . We assume S1 and S2 are linearly seperable, that is, there exists a linear discriminant function g(x) = w t x + w0 such that

∈S ∈S .

g(x) > 0 implies x g(x) < 0 implies x

and

1 2

S , or

Consider a point x in the convex hull of

1

n

x=



αi x i ,

i=1

where the αi ’s are non-negative and sum to 1. The discriminant function evaluated at x is g(x) =

w t x + w0

   n

=

w

t

αi xi

+ w0

i=1

n

=

αi (wt xi + w0 )

i=1

> 0,

where we used the fact that w t xi + w0 > 0 for 1

≤ i ≤ n and

n

αi = 1. i=1

Now let us assume that our point x is also in the convex hull of S 2 , or



m

x=



βj yj ,

j=1

where the β j ’s are non-negative and sum to 1. We follow the approach immediately above and find g(x) =

w t x + w0

      m

=

w

t

βj yj

+ w0

j=1

m

=

βj (w t yj + w0 )

j=1

g(yj ) 0 and g(x) < 0, and hence clearly the inter section is empty. In short, either two sets of vectors are either linearly separable or their convex hulls intersects. 10. Consider a piecewise linear machine.

S

(a) The discriminiant functions have the form gi (x) = max

j=1,...,ni

gij (x),

184

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS where the components are linear t gij (x) = w ij x + wij0 ,

and the decision rule is to classify x in category ω i , if max gk (x) = g i (x), k

for j = 1,...,c . We can expand this discriminant function as max gk (x) = max k

k

max gkj (x),

j=1,...,nk

t x+wkjo , are linear functions. Thus our decision rule implies where gkj (x) = w kj

max max gkj (x) = max gij (x) = g ij(i) (x). k

≤≤

j=1,...,nk

1 j ni

Therefore, the piecewise linear machine can be viewed as a linear machine for classifying subclasses of patterns, as follows: Classify x in category ω i if max max gkj (x) = g ij (x). j=1,...,nk

k

(b) Consider the following two categor ies in one dimension: ω1 ω2

= =

{x : |x| > 1}, { x : | x| < 1 }

which clearly are not linearly separable. However, a classifier that is a piecewise linear machine with the following discriminant functions



g11 (x) = 1 x g12 (x) = 1+ x g1 (x) = max g1j (x) j=1,2

g2 (x) = 2 can indeed classify the patterns, as shown in the figure. g(x) 3

g11(x)

g12(x) 2

g2(x)

1

-2

-1

1

2

x

-1

11. We denote the number of non-zero components in a vector by components of x be either 0 or 1 and the categories given by ω1 ω2

= =

{x : b is odd } {x : b is even }.

b. We let th e d

PROBLEM SOLUTIONS

185

(a) Our goal is to show that ω 1 and ω 2 are not linearly separable if d > 1. We prove this by contradiction. Suppose that there exists a linear discrim inant function g(x) = w t x + w0 such that



g(x) 0 for x g(x) < 0 for x

∈ ω and ω . ∈ 1 2

Consider the unique point having b = 0, i.e., x = 0 = (0, 0,..., 0)t . This pattern is in ω2 and hence clearly g (0) = w t 0 + w0 = w 0 < 0. Next consider patterns in ω 1 having b = 1. Such patterns are of the general form x = (0,..., 0, 1, 0,..., 0), and thus g(x) = w i + w0 > 0 for any i = 1,...,d . Next consider patterns with b = 2, for instance x = (0,..., 0, 1, 0,..., 0, 1, 0,..., 0). Because such patterns are in ω 2 we have g(x) = w i + wj + w0 < 0



for i = j. We summarize our results up to here as three equations: w0 wi + w0 wi + wj + w0

< 0 > 0 < 0,



for i = j. The first two of these imply

      − 

(wi + w0 ) + (wj + w0 ) + ( w0 ) > 0, >0

>0

>0

or

wi + wj + w0

> 0,

which contradicts the third equation. Thus our premise is false, and we have proven that ω 1 , ω2 are not linearly separable if d > 1. (b) Consider the discriminant functions g1 (x) = max g1j (x) and j

g2 (x) = max g2j (x), j

where gij (x) = αij (1,..., 1)t x + wijo = αij b + wijo ,

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

186

and, as in part (a), we let We can set

b denote the number of non-zero components of α1j α2j wijo w2jo

for j = 0, 1,...,d

= = = =

x.

j + 1/2 j j2 j j 2,

− − − 1/4 −

+ 1 and hence write our discriminant function as g1 (x) = (j + 1/2)b g2 (x) = jb j 2 .



2

− j − j − 1/4

We now verify that these discriminant functions indeed solve the problem. Suppose x is in ω 1 ; then the number of non-zero components of x is b = 2m + 1, 0 m (d 1)/2 for m an integer. The discriminant function is



≤ −

g1j (x) = = = =

(j + 1/2)(2 m + 1) j 2 j 1/4 j(2m + 1) j 2 j + 1/2(2m + 1) j(2m) j 2 + 1/2(2m + 1) 1/4 j(2m j) + 1/2(2m + 1) 1/4.

− −

− − −

− −

− −

− 1/4

For patterns in ω 2 , we have b = 2m, and thus g2j (x) = =

− −

j(2m + 1) j 2 j(2m + 1 j).



It is a simple matter to verify that j (2m j) is maximized at j = m and that j(2m + 1 j) is maximized at j = m + 1/2. But note that j is an integer and the maximum occurs at j = m and j = m + 1. Thus we have



g1 (x) = max g1j (x) j

− j)] + 1/2(2m + 1) − 1/4 m(2m − m) + m + 1/2 − 1/4

= max [j(2m j

= = m2 + m + 1/4 = m(m + 1) + 1/4, g2 (x) = max g2j (x)

and

j

= max j(2m + 1 j

=

m(m + 1).

j)





Thus if x ω1 and the number of non-zero components of x is b = 2m + 1 (i.e., odd), we have g1 (x) = m(m + 1) + 1/4 > m(m + 1) = g 2 (x), that is, g1 (x) > g2 (x), as desired. Conversely, if x is in ω2 , then the number of non-zero components of x is b = 2m, 0 m d/2 where m is an integer.

≤ ≤

PROBLEM SOLUTIONS

187

Following the above logic, we have g1j (x) = (j + 1/2)(2m) j 2 j 1/4 = j(2m 1) j 2 + m 1/4 = j(2m j 1) + m 1/4

− − − −

− − − − −

which is maximized at j = m. Likewise, we have 2

g2j (x) = =

−−

j(2m) j j(2m j)

which is maximized at j = m. Thus our discriminant functions are g1 (x) = max g1j (x) j

− 1 − j) + m − 1/4 m(m − 1) + m − 1/4 = m − 1/4, max g (x) = max j(2m − j) = m .

= max j(2m j

= g2 (x) =

2

2

2j

j

j

Indeed, if x is in ω2 and the number of non-zero components of x is b = 2m (i.e., even), we have m2

g1 (x) =

2

− 1/4 < g (x) = m , 2

or g 1 (x) < g2 (x), and hence our piecewise linear machine solves the problem. Section 5.3 12. Consider the quadratic discriminant function given by Eq. 4 in the text: d

g1 (x)

− g (x) = g(x) = w 2

0

+

d

d

  wi xi +

i=1

wij xi xj ,

i=1 i=1

which for convenience we write in vector and matrix form as g(x) = w 0 + wt x + xt Wx. In this two-category case, the decision boundary is the set of x for which g(x) = 0. If we did not have the wt x term, we could easily describe the boundary in the form xW0 x = k where k is some constant. We can, in fact, eliminate the w t x terms with a proper translat ion of axes. We seek a translation, described by a vector m, that eliminates the linear term. Thus we have g(x

− m)

= =

w0 + wt (x m) + (x m)t W(x m) (w0 w t m + mt Wm) + (wt mt W t











(∗) − mW)x + x Wx, t

where we used the fact that W is symmetric (i.e., Wij = Wji ) and that taking the transpose of a scalar does not change its value. We can solve for m by setting the coefficients of x to zero, that is, 0 = wt

t

t

t

− m W − mW = w − 2m W.

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

188

This implies m t W = w t /2 or m = W −1 wt /2. Now we substitute this translation vector into the right-hand side of ( g(x

− m)

∗) and find

= w 0 − w (W −1 w/2) + wt W −1 /2 · W · W −1 w/2 + xt Wx = w 0 − wt W−1 w/4 + xt Wx. t

The decision boundary is given by g(x

− m) = 0, and then ( ∗∗) implies

W

t

x w t W −1 w and thus the matrix

∗∗

( )

t

− 4w

0x

= x Wx = 1/4,

W w t W −1 w

W=

− 4w

0

shows the basic character of the decision boundary. (a) If W = I, the identity matrix, then x t Ix = k/4 is the equation of the decision d

boundary, where k is some constant. Then we have

√ a hypersphere of radius k/2.



i=1

x2i = k/4, which defines

(b) Since W is non-singular and symmetric, we can write W in terms of ΦΛΦt , where Φ is the matrix formed by the (linearly independent) eigenvectors of W and Λ is the diagonal matrix formed by the eigenvalues of W. In this way, xt Wx = 1/4 can be written as x t ΦΛΦt x = 1/4. We let y t = x t Φ; then we get yt Λy = 1/4, or λi y 2 = 1/4 where the eigenvalues λ i > 0 when W is positive i i for non-pathological data distribu tions. Because the definite, which it will be eigenvalues are positive and not necessarily all equal to each other, in this case the decision boundary is a hyperellipsoid.



(c) With an argument following that of part (b), we can represent the characteristics d

of the decision boundary with



i=1

λi yi2 = 1/4. Since, in this cas e we have at

least one λ i < 0, the geometric shape defined is a hyperhyperboloid. (d) Our matrix and vector in this case are W=



=

−6 −2 1 −−62 31 −11 · 4 ;

Thus we have

1 2 2 5 0 1

 − 0 1 3

;

w=

  − 5 2 3

.

16

1

W− and



W W= = 82.75

{





0.0121 0.0242 0.0242 0.0604 0 0.0121

} {

1

t

w W− w = 82.75

0 0 .0121 0.0363







}

The eigenvalues are λ1 , λ2 , λ3 = 0.0026, 0.0716, 0.0379 . Because there is a negative eigenvalue, the decision boundary is a hyperhyperboloid.

PROBLEM SOLUTIONS

189

(e) Our matrix and ve ctor in this case are W= Thus we have W −1 = and



0.3077 0.4231 0.1538

 −

1 2 2 0 3 4

3 4 5

0 .4231

 −

;

w=

0 .1538 0 .0385 0.0769

−0.2692





0 .0385

 − − −

  −

;

2 1 3

.

wt W −1 w =

−2.2692



−0.8814 −1.3220 0 −1.7627 W= − −1.7627 2 .2034 The eigenvalues are {λ , λ , λ } = {0.6091, −2.1869, 3.3405}. Because of a negW = 2.2692 1

2

0.4407 0.8814 1.3220

3

tative eigenvalue, the decision boundary is a hyperhyperboloid. Section 5.4

13. We use a second-order Taylor series expansion of the criterion function at the point a(k): J (a)

 J (a(k)) + ∇J (a(k))(a − a(k)) + 12 (a − a(k)) H(a(k))(a − a(k)), t

t

where ∇J (a(k)) is the derivative and H(a(k)) the Hessian matrix evaluated at the current point. The update rule is a(k + 1) = a(k)

− η(k)∇J (a(k)).

We use the expansion to write the criterion function as J (a(k + 1)

 J (a(k)) + ∇J (a(k)[η(k)∇J (a(k)] + 12 [η(k)∇J (a(k))]H(a(k))[η(k)∇J (a(k))]. t

t

We minimize this with respect to η (k) and find 0=

t

t

−∇J (a(k))∇J (a(k)) + η(k)∇J (a(k))H(a(k))∇J (a(k))

which has solution η(k) =

J (a(k))

2

∇J t (a(k))H(a(k))∇J (a(k))

.

We must, however, verify that this is indeed a minimum, and that calculate the second derivative ∂2J = ( ∇J )t H(∇J ). ∂η 2

η(k) > 0. We

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

190

If a is in a neighborhood of a local minimum of J (a), the H will be positive definite and hence ( ∇J )t H(∇J ) > 0. Thus J indeed attains minima at the optimal η(k) and η(k) > 0 because ( ∇J )t H∇J > 0. 14. We are given samples from ω 1 :

    −  −   −    1 , 5

and from ω 2 :

2

−3

2 , 9

5 3

1

,

0 . 2

,

−4

Augmenting the samples with an extra dimension, and inverting the sign of the samples from ω 2 define

Y=

  −  −−

1 1 1 1 1 1

1 2 5 2 1 0

− −

     −       − 5 9 3 3 4 2

b b b b b b

a=

a1 a2 a3

= (Ya

− b)2(Ya − b) .

b=

.

Then the sum-of-squares error criterion is n

Js (a) = 12

t

 −   at y i

b

2

i=1

(a) We differentiate the criterion function and find ∇Js (a) =

and

t

Y (Ya



6a1 a1 6a1

− b) = −

H = Y tY =

+ +

 −

a2 + 6a3 35a2 + 36a3 36a2 + 144 a3



6 1 6 1 35 36 6 36 144



+ +



0 3b 16b



.

Notice that for this criterion, the Hessian is not a function of a quadratic form throughout.

a since we assume

(b) The optimal step size varies with a, that is, η=

[∇Js (a)]t ∇Js (a) . [∇Js (a)]t H∇Js (a)

The range of optimal step sizes, however, will vary from the inverse of the largest eigenvalue of H to the inverse of the smallest eigenvalue of H. By solving the characteristic equation or using singular value decomposition, the eigenvalues of H are found to be 5.417, 24.57, and 155.0, so 0 .006452 η 0.1846.

≤ ≤

PROBLEM SOLUTIONS

191

Section 5.5 15. We consider the Perceptron procedure (Algorithm 5.3 and Theorem 5.1 in the text). (a) Equation 20 in the text gives the update rule a(1) = arbitrary a(k + 1) = a(k) + yk ,

k

1.

≥ It has been shown in the convergence proof in the text that a(k + 1) − αˆa ≤ a(k) − αˆa − 2αν + β , 2

2

2

ˆ is a solution vector, α is a scale factor and β 2 where a(k) is the kth iterate, a and ν are specified by Eqs. 21 and 22 in the text: β2

2

 

= max yi i

ˆ t yi > 0. ν = min a i

Thus we have 2

2

2

a(k + 1) − αˆa ≤ a(k) − αˆa − (2αν − β ). Because α > β 2 /ν , we have the following inequality: 2αν



− 

Thus a(k) αˆ a and this implies

2

2

2

− β 2β − β

2

= β 2 > 0.



decreases (strictly) by an amount (2αν β 2 ) at each iteration, 2

2

2

a(k + 1) − αˆa ≤ a(1) − αˆa − k(2αν − β) . We denote the maximum number of correction convergence by k o , we have 0

2

2

2

≤ a(k) − αˆa ≤ a(1) − αˆa − k(2αν − β ),

and thus 2

k

− αˆa ≤ a(1) 2αν − β

,

2

and thus the maximum number of corrections is ko =

a(1) − αˆa 2αν − β

2

.

2

(b) If we start out with a zero weigh t vector, that is, a(1) = 0, we have 2

ko

=

a(1) − αˆa 2αν − β αˆa 2αν − β ˆ α a 2αν − β 2

2

=

2

2

=

2

2

2

=

α 2αν β 2



CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

192

In order to find the maximum of k 0 with respect to α, we set



β 2 )2α

2

α2

(2αν

to 0,

2 2ν

  − aˆ −   −   − 2αβ aˆ

ˆ a (2αν β 2 )2 ˆ 2 2ν a ˆ 2) α2 (4ν a 0,

∂ ko (α) = ∂α = =

∂ ∂α ko (α)

2

2

and this implies



2

2

2

  − 2β aˆ

ˆ α α2ν a

2



= 0.

We note that α > β /ν > 0, which minimizes k o if a(1) = 0. We assume b and therefore

≥ 0,

ν = min [ˆ at y i ] > b

≥ 0, and ν > 0. We next use the fact that 0 < η ≤ η ≤ η < ∞, and k ≥ 1 to find a(k + 1) − αˆa ≤ a(k) − αˆa + η β + 2η b − 2η αν. i

a

2

k

2

b

2 2 b

b

a

We choose the scale factor α to be α=

ηb2 β 2 + 2ηb b ηa ν

and find

a(k + 1) − αˆa ≤ a(k) − αˆa − (η β 2

η b2 β 2

2

b2 2

+ 2ηb b),

where + 2ηb b > 0. This means the difference between the weight vector at a obeys: iteration k + 1 and αˆ 2

2

2 2 b

a(k + 1) − αˆa ≤ a(1) − αˆa − k(η β

+ 2ηb b).

Since a squared distance cannot become negative, it follows that the procedure must converge after almost k o iterations where ko =

a(1) − αˆa ηb2 β 2

2

+ 2ηb b

.

In the case b < 0 there is no guarantee that ν > 0 and thus there is no guarantee of convergence of the algorithm. 16. Our algorithm is given by Eq. 27 in the text: a(ka(1) + 1) == arbitrary a(k) + ηk yk where in this case at (k)yk b, for all k. The ηk ’s are bounded by 0 < ηa ηk ηb < , for k 1. We shall modify the proof in the text to prove convergence given the above conditions. ˆ be a solution vector. Then a ˆ t yi > b, for all i. For a scale factor α we have Let a







a(k + 1)

≤ ≤

− αˆa

= =

a(k) + ηk yk αˆ a (a(k) αˆ a) + ηk yk ,





PROBLEM SOLUTIONS

193

and thus

a(k + 1) − αˆa

2

k

2

t

As y was misclassified, a (k)y

k

k 2

k

2

2 k 2 k

2

k

k 2

(a(k) − αˆa) + η y  a(k) − αˆa + η y  a(k) − αˆa + η y  ≤ a(k) − αˆa + η y  = = =

+ 2ηk (a(k) αˆ a)t yk t k + 2ηk ak y 2ηk αˆ at y k t k 2 ay . + 2ηk b 2ηk αˆ

− −

k 2 k



b, for all k. Now we let

β≤= 2

max yi 2 , i

 

ν = min [ˆ at yi ] > b, i

ˆ is a solution vector. Thus we substitute into the above since a 2

a(k + 1) − αˆa ≤ a(k) − αˆa

2

+ ηk2 β 2 + 2ηk b

− 2η αν, k

and convergence is assured. 17. If y 1 , y2 ,..., yn are linearly separable then by definition there exists a separating hyperplane g(x) = w t x + w0 such that

∈ω ∈ω .

g(yi ) > 0 if y i g(yi ) < 0 if y i

1 2

We augment every y i by 1, that is, ˜i = y



1 , yi

i = 1,...,n

and augment the weight vector by appending a bias weight w 0 , written as ˜= w



w0 . w

Then our linear descriminant function can be written as ˜ ty ˜i w w ˜ ty ˜i

> 0 < 0

if yi if yi

∈ω ∈ω . 1 2

˜ i by 1 if y i ω2 , and thus there exists a vector w ˜ such that w ˜ ty ˜i > 0 We multiply y for all i. ˜ which requires We use the logic of the Perceptron convergence to find such an w at most k o operations for convergence, where





2

ko =

a(1) − αw˜  β2

,

˜ is the first iterate and where w β2 ν = max yi

α = β2

i

 

2

˜t y ˜ i > 0. ν = min a i

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

194

The remainder follows from the Perceptron convergence Theorem (Theorem 5.1) in the text. 18. Our criterion function is Jq (a) = y



(at y

∈Y (a)

2

− b) ,

t

Y (a) is the set of samples for which a y ≤ b. Suppose y is the only sample in Y (a(k)). Then we have J (a(k)) = ( a (k)y − b) = (a (k)y ) + b − 2ba (k)y

where

1

t

q

2

1

2

1

2

d

=

t

2

   −   −

1

d

+ b2

akj y1j

t

2b

j=1

akj y1j .

j=1

The derivative of the criterion function with respect to the components of ∂J q (a(k)) ∂a ki

a is

d

=

2

akj y1j

y1i

2by1i

j=1

=

2(at y1

− b)y

1i ,

for i = 1,...,d . Thus we have ∇Jq (a(k))

Dii

= = =

∂J q (a(k))

= 2(at y1

∂ ∂a ki

b)y1



∂ a(k) q (a(k)) ∂a ki ∂a ki

∂2J

   d

2

akj y1j

y1i

j=1

− 2by

1i

= 2y1i y1i



for i, i = 1,...,d . This implies D = 2y1 yit . From Eq. 34 in the text, we have that the basic gradient descent algorithm is given by a(1) = arbitrary at (k)y

b a(k + 1) =

a(k) + ηk

y

∈Y (a(k))

and thus

Y (a(k))



=

−y

y1 ,

2

{y }, 1

which implies a(k + 1) =

a(k) + ηk

b

t

− a (k)y y . y 2

1

PROBLEM SOLUTIONS

195

We also have at (k + 1)y1

t

− b = (1 − η )(a (k)y − b). k

1

If we choose η k = 1, then a(k +1) is moved exactly to the hyperplane a t y1 = b. Thus, if we optimally choose η k = 1, we have Eq. 35 in the text: t

− a (k)y y . y   19. We begin by following the central equation and proof of Perceptron Convergence a(k + 1) = a(k) +

b

1

2

given in the text: 2

a(k + 1) − αˆa = a(k) − αˆa

2

+ 2(a(k)

t

− αˆa) ηy

k

+ η 2 yk 2 ,

 

where yk is the kth training presen tation. (We suppress the k dependence of η for clarity.) An update occurs when y k is misclassified, that is, when a t (k)yk b. We define β 2 = max yi 2 and γ = min [ˆ at yi ] > b > 0. Then we can write i



 

i

2

2

a(k + 1) − αˆa ≤ a(k) − αˆa

+ 2(b

2 2

− αγ )η + η β ,

where α is a positive free parameter. We can define 1 α(k) = γ

 −



m 1

η(k) + b

k=1

at the mth update. Then we have



m 1 2

2

a(k + 1) − αˆa ≤ a(k) − αˆa − 2η(k)



η(l) + η 2 β 2 .

l=1

Summing all such equations for the mth update gives m

2

2

a(k + 1) − α(k)ˆa ≤ a(0) − α(0)ˆa − 2

k

m

   η(k)

k=1

η(l) +

l=1

η 2 (k)β 2 ,

k=1

or equivalently m

2

2

a(k + 1) − α(k)ˆa ≤ a(0) − α(0)ˆa − 2



m

η(k)η(l) + β 2



k,l=k

On the other hand, we are given that m

       lim

→∞

m

η2

k=1

2

m

=0

η(k)

k=1

and we note

2

m

η(k)

k=1

m

m

η 2 (k) + 2

=

k=1

(kl)

η(k)η(l).



k=1

η 2 (k).



( )

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

196 We also have lim

→∞

m

   m

1

η (k) + 2

2

m

k=1

η(k)

  −

m

2



m

η(k)η(l)

2

(kl)

k=1

η(k)η(l) = 0

(kl)

which implies

m

 −           2

lim

→∞

m

η(k)η(l)

(kl)

1

=0

2

m

η(k)

k=1

or

m

2

lim

η(k)η(l)

(kl)

→∞

m

∗∗

= 1. ( )

2

m

η(k)

k=1



Now we can reconsider ( ):

2

a(k + 1) − α(k)ˆa m

2 2

≤ a(0) − α(0)ˆa −

m

n

η 2 (k)

η(k)η(l)

2

      −    (kl)

η(k)

2

m

k=1

β2

2

n

η(k)

.

η(k)

k=1

∗∗

k=1

k=1

But we know from ( ) that the first term in the brackets goes to 1 and the coefficient of β 2 goes to 0, and all the η(k) terms (and their sum) are positiv e. Moreover, the M

corrections will never cease, since



η(k)

k=1

→ ∞, as long as

there are incorrectly

classified samples. But the distance term on the left cannot be negative, and thus we conclude that the correc tions must cease. This implies that all the samples will be classified correctly. Section 5.6 20. As shown in the figure, in this case the initial weight vector is marked 0, and after the successive updates 1, 2,... 12, where the updat es cease. The sequence of presentations is y 1 , y 2 , y 3 , y 1 , y 2 , y 3 , y 1 , y 3 , y 1 , y 3 , y 2 , and y 1 . Section 5.7 21. From Eq. 54 in the text we have the weight vector 1 w = αnS− w (m1

− m ), 2

where w satisfies



1 n 1 n2 Sw + (m1 n n2

− m )(m − m ) 2

1

2

t



w = m1

−m . 2

PROBLEM SOLUTIONS

197 a2 ||

2

|y /| b

y1

n tio lu n so gio e r

|y b/|

||

1

y3

12 910

9

11

a1

8 7 4

6

y2

5 1

3 2

b

0

/| |y 3

||

We substitute w into the above and find



1 n 1 n2 Sw + 2 (m1 n n

− m )(m − m ) 2

1

2

t



1 αnS− w (m1

−m )= m −m , 2

1

2

which in turn implies



α m1

−m

2

2 + n 1n (m1 n2

− m )(m − m ) S− (m − m ) 2

1

2

t

1

w

1



2

Thus we have αθ(m1

−m )= m −m 2

1

= m1

−m . 2

2

where θ= 1+

n 1 n2 (m1 n2

− m ) S− (m − m ). t

2

w

1

1

2

This equation is valid for all vectors m 1 and m 2 , and therefore we have αθ = 1, or



α= 1+

n 1 n2 (m1 n2



− m ) S− (m − m ) − 2

t

w

1

1

22. We define the discriminant function



| −

2

1

.



|

(at y + (λ12

−λ

g0 (x) = (λ21 λ11 )P (ω1 x) (λ12 λ22 )P (ω2 x). Our goal is to show that the vector a that minimizes Js (a) =



y

∈Y

1

(at y

− (λ − λ 21

2 11 ))

+



y

∈Y

2

is asymptotically equivalent to the vector a that minimizes ε2 =



[at y

2

− g (x)] p(x)dx, o

2 22 ))

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

198

as given in Eq. 57 in the text. We proceed by calculating the criterion function

 − −      − − 

Js (a) =

at y

y

λ11 )

2

+

∈Y

y

1

n1 1 n n n

=

(λ21

at y

y

(λ21

at y

∈Y

2

λ11 )

2

∈Y

1

− (λ − λ 12

n2 1 + n n



22 )

2



at y + (λ12

y

∈Y

2

−λ

22 )

  2

.

→∞

By the law of large numbers, as n we have with probability 1 1 lim Js (a) = J  (a) n→∞ n = P (ω1 ) 1 (at y (λ21 λ11 ))2 + P (ω2 ) 2 (at y + (λ12



E







We expand the terms involving expectations as

E E

1

2

 

(at y

     − −  −    − −    −     − −  −   21

2 11 ))

=

21

2 11 ))

=

− (λ − λ (a y + (λ − λ t

E

at y

− (λ − λ (a y + (λ − λ

11 )

21

t

 

2

2 22 ))



.

|

p(x ω1 )dx,

2 22 )) p(x

12

−λ

|ω )dx. 2

We substitute these into the above equations and find that our criterion function is J  (a) =

at y

at y

at y

2

+(λ21

=

(λ21

2

λ11 )

2

λ22 )

|

p(x ω1 )dx 2

|

p(x ω2 )dx

p(x, ω1 )dx 2

p(x, ω2 )dx

[p(x, ω1 ) + p(x, ω2 )]dx

at y [(λ21 λ11 )2

λ11 )p(x, ω1 ) + (λ12

p(x, ω1 )dx + (λ12

(at y)2 p(x)dx + 2

+(λ21

2

λ22 )

at y + (λ12

+

=

λ11 )

at y + (λ12

+

=

(λ21

−λ

22 )p(x, ω2 )] dx

2 22 )

−λ



p(x, ω2 )dx

at yg0 (x)p(x)dx

2 11 ) P (ω1 ) + (λ12

−λ

−λ

22 )

2

P (ω2 ),

where we have used the fact that p(x) = p(x, ω1 ) + p(x, ω2 ) and (λ21

−λ

11 )p(x, ω1 ) + (λ12

−λ

22 )p(x, ω2 )



|

= (λ21 λ11 )p(x ω1 ) + (λ12 = g0 (x)p(x).

−λ

Furthermore, the probability of finding a particular category ω i is simply P (ωi ) =



p(x, ωi )dx,

|ω )

22 )p(x

2

PROBLEM SOLUTIONS

199

and thus our criterior function can be written J  (a) =

  

at y

+



2

− g (x) p(x)dx [(λ − λ ) P (ω ) + (λ − λ o

21

11

2

1

12

22 )

2

2 o

− g (x)]p(x)dx .

P (ω2 )





independent of a

Since the second term in the above expression is independent of with respect to a is equivalent to minimizing ε2 =



[at y

a, minimizing J (a)



2

− g (x)] p(x)dx. 0

In summary, the vector a that minimizes Js (a) provides asymptotically a minimum mean-square error approximation to the Bayes discriminant function g 0 (x). 23. Given Eqs. 66 and 67 in the text, a(k + 1) = a(k) + η(k)[θ(k) ˆ = a [yyt ]−1 [θy]

E

E

t

− a (k)y ]y k

k

we need to show

→∞ E [a(k) − aˆ

lim

2

n

] =0

given the two conditions η(k) =



η 2 (k)
M for some finite M , then we will have proved our needed result. This is because that the update rule 2

E

whereas



2

− ˆa ] for k > R for some finite R since η (k) is finite ∞ ˆ  ] ≥ 0, and thus η(k) is infinite. We know that E [a(k + 1) − a ˆ ] lim E [a(k + 1) − a →∞

must reduce [ a(k + 1)



k=1

2

k=1

2

k

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

200

must vanish. Now we seek to show that indeed  3 < 0. We find 3

t

t

E [(a(k) − aˆ) (θ(k) − a (k)y)y ] E [a (k)θ(k)y ] − E [(a (k)y ) ] − E [θ(k)ˆa y ] + E [ˆa y a y ] E [θ(k)a (k)y ] + E [ˆa y a y ] − E [(a (k)y ) ] − E [(a y ) ] −E [(a y ) + (ˆa y ) − 2a (k)y aˆ y ] −E [(a (k)y − aˆ y ) ] −E [(a (k)y − aˆ y ) ] ≤ 0.

= = = =

t

k

t

t t t

= =

t

k

k

2

k

t

t

k

t

k

2

t

2

k

t

k

t

t

k

k

t

k 2

t

k

t

k

k

2

k

t k 2 t 2 k

k k

In the above, we used the fact that y1 , y2 ,..., yk is determined, so we can consider at (k) as a non-random variable. This ena bles us to write [θ(k)at (k)yk ] as ˆ ]. [at (k)yk ykt a Thus we have proved our claim that  3 < 0, and hence the full proof is complete. 24. Our criterion function here is

E

E

Jm (a) = [(at y

E

2

− z) ].

(a) We expand this criterion function as t

2

E [(a y − z) ] E [{(a y − g (x)) − (z − g (x))} ]

Jm (a) = = = =

t

o

E

[(a y [(at y



2

o

2

t

go (x)) 2(a y go (x))(z go (x)) + (z go (x))2 ] 2 go (x)) ] 2 [(at y go (x))(z go (x))] + [(z go (x))2 ]. t

−− E − −

−−

−E −

E [z|x] = g (x) to note that E [(a y − g (x))(z − g (x))] = E [(a y(x) − g (x))(z − g (x))].

(b) We use the fact that t

o

o

t

o

o

o

This leads to t

t

E {E [(a y − g (x))(z − g (x))|x]} = E{(a y − g (x))E [(z − g (x))|x]}, since conditioned on x, a y(x) − g (x) is a constant. Thus we have E {(a y − g (x))[E (z|x) − g (x)]} = 0 as E [z |x] = g (x). We put these results together to get J (a) = E [(a y − g (x)) ] + E [(z − g (x)) ], o

o

t

o

o

o

t

o

o

o

m

t

2

o

o

2

   independent of a

where the second term in the expression does not depend on a that minimizes J m also minimizes [(at y go (x))2 ].

E

25. We are given that

−1 = η −1 + y 2 . ηk+1 k k



a. Thus the vector

PROBLEM SOLUTIONS

201

(a) We solve for η k−1 as ηk−1

ηk−−11 + yk2−1

=

ηk−−12 + yk2−2 + yk2−1

=

η1−1 + y12 + y22 + . . . + yk2−1

=



k 1

1 + η1

i=1

=

yi2 ,

η1



and thus ηk

=

η1



k 1



1 + η1

i=1

for η 1 > 0 and 0 < a

2 i

≤ y ≤ b.

yi2

(b) With these range s we have



k 1

a(k

− 1) ≤

which implies ηk



yi2

i=1

≤ b(k − 1), η1

=



k 1

1 + η1

yi2

i=1 η1

−

≤ ≥

1 + a(k η1 1 + b(k

1)

− 1) .

This is true for each k, so we can sum to find η1 ηk 1 + b(k 1)

 ≥  ∞ k

as

− →∞

k



k=1

1 = k

and b > 0.

Moreover, we have η12

ηk2 k

as

L1
0.

Consequently, the learning rates obey ηk

k

and

k

ηk2

→ L < ∞.



CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

202

26. We rewrite the LMS rule using the following notation:

Y=

 

 

y1t y2t .. . ynt

 

b=

and

Then the LMS rule before Eq. 61 in the text is a(1)

 

b1 b2 .. . bn

.

arbitrary n

a(k + 1) =

a(k) + η(k)



t k i

− a y ).

yi (bi

i=1

ˆa now reads

Note that the condition we are looking for the limiting vector n



ˆ yi (yit a

i=1

− b ) = 0, i

or equivalently n



n

yi (ˆ at y i ) =

i=1



yi bi .

i=1

Now we consider how the distance between a(k) and ˆa changes during an update:

a(k + 1) − aˆ

2

a(k) − aˆ

=

2

+

  −    −  

at (k)yi )

yi (bi

i=1

Ck

2η(1) + (a(k) k



t

at (k)yi )

yi (bi

i=1

Dk

a(k) − aˆ

=

n

− aˆ)

  2

n

η 2 (1) k2

2

η 2 (1) 2η(1) + Ck + Dk , k2 k

where for convenience we have defined C k and D k . Clearly C k non-negative values. Consider now D k , which we expand as

≥ 0 as it is the sum of

n t

t

t

t

t

t

−a (k)y a (k)y + a (k)y b + ˆa y a (k)y − aˆ y b

Dk =

i

i

i i

i

i

i i

i=1

   −  − n

We can substitute

n

n

Dk

=

(at (k)yi )2

i=1

n

=

(at (k)yi )2

i=1



ˆt

yi a yi , from our definition of ˆa. Then D k can be

yi bi with

i=1

written

  − i=1 n

n

at (k)yi )2 + at (k) (ˆ

i=1

t

 i=1

2

− (ˆa y ) i

n

y i bi + a ˆt

  i=1

ˆ t yi + a ˆ t yi at (k)yi + at (k)yi a

yi at (k)yi

PROBLEM SOLUTIONS

203

and thus n

Dk =





(at (k)yi

t

2

− aˆ y ) ≤ 0.

i=1

i

Adding the m update equations for the LMS rule we find m

a(m + 1) − aˆ

2



= a(1)

− aˆ

2

+ η 2 (1) k =1

→ ∞ for both sides: ˆ = a(1) − a ˆ lim a(m + 1) − a →∞ 2

a(1) − aˆ

=











Ck Dk + 2η(1) k2 k k=1 k=1

2

+ η 2 (1)

2

+ η 2 (1)CL + 2η(1)

m

Dk . k

k =1



Now we take the limit m

m

Ck + 2η(1) k2





k=1

Dk , k

where if C k is bounded we have CL =





Ck < k2 k=1

∞.

We also know that if D k < 0



Dk k

→ −∞

k=1



for all k. But D k < 0 cannot be true for all k (except for finite occassions), otherwise the right-hand side of the equation will be negative while the left-hand side will be non-negative. Thus Dk must be arbitrarily close to zero for some choice N , for all k > N . This tells us that lim at (k)yik = ˆat yik

k

→∞

for i = 1, 2,...,N . To see this, we start with arbitrarily smally positive for all k > N for some integer N . This implies

| |

λ. Then, we know Dk < λ

n

|D | = k

in particular at (k)yi

|



(at (k)yi

i=1

t

2

− aˆ y ) ≤ λ, i

− aˆ y | ≤ √λ will also be true. Now we let t

i

n

v(k) =

 −  − ∓  − ∓ ∓  yi (bi

yit a(k))

yi (bi

[ˆ at y i

yi (bi

a ˆ t yi )

i=1 n

=

λi ])

i=1 n

=

i=1 n

=

yi

i=1

λi

yi

λi

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

204



ˆ t yi . where λi = at (k)yi a Since λ was arbitrarily close to zero and

|



|

|λ | ≤ |λ|, we have i

lim v(k) = 0,

k

→∞

proving that a(k) satisfies the limit condition. Section 5.9 27. The six points are shown in the figure, labeled by their category membership. 6

ω2

4

ω1

2

ω2

0

ω1

-2

ω1

-4

ω2

-6 -4

-2

0

2

4

6

(a) By inspec ting the plot of the poin ts, we see that ω1 and ω2 are not linearly separable, that is, there is no line that can separate all the the ω 2 points.

ω1 points from all

(b) Equation 85 in the text shows how much the e 2 value is reduced from step k to step k + 1. For faster convergence we want the maximum possible reduction. We can find an expression for the learning rate η by solving the maxima problem for the right-hand side of Eq. 85,



1 4



e(k)

2

2

 − e(k + 1)



= η(1

− η)e

+



(k)

2

+ η 2 e+t (k)YY † e+ (k).



( )

We take the derivative with respect to η of the right-hand side, set it to zero and solve:

e

+

2

 − 2ηe

(k)

+



(k)

2

+ 2ηe+ (k)YYt e+ (k) = 0,

which gives the optimal learning rate ηopt (k) =

e  − e (k)YY e 2

2 [ e+ (k)



2

+

t + (k)]

.

The maximum eigenvalue of Y t Y is λ max = 66.4787. Since all the η values between 0 and 2/λmax ensure convergence to take the largest step in each iteration we should choose η as large as possi ble. That is, since 0 < η < 2/66.4787, we can choose η opt = 0.0301 , where  is an arbitrarily small positive constant.





However, ( ) is not suitable for fixed learning rate Ho-Kashyap algorithms. To get a fixed optimal η, be proceed as discussed in pages 254– 255 in the text: If

PROBLEM SOLUTIONS

205

we use the algorithm expressed in Eq. 90, b(1) > 0 but otherwise arbitrary a(1) arbitrary b(k + 1) = b(k) + (e(k) + e(k) ) a(k + 1) = a(k) + ηRY t e(k) ,

|

|

|

|

where R is arbitrary posit ive definite. As discussed on page 255 of the text , if R = I, then A = 2ηI η 2 Y t Y will be positive definite, thereby ensuring convergence if 0 < η < 2/λmax , where λmax is the largest eigenvalue of Y t Y. For our case, we have



Y=

  −  −−

1 1 1 1 1 1

− − −

 − −  −  

1 2 3 2 1 5

2 4 1 4 5 0

t

and Y Y =

 −



6 6 4 6 44 10 4 10 62



.

The eigenvalues of Y t Y are 4.5331, 40.9882, 66.4787 .

{

}

Section 5.10 28. The linear programming problem on pages 256–257 in the text involved finding t

{ ≥ 0, a y + t > b

min t : t

i

i

}

for all i .

Our goal here is to show that the resulting weight vector function t

a minimizes the criterion

− a y ).

Jt (a) = max (bi t



i:a yi bi

i

There are two cases to consider: linearly separable and non-linearly separable. Case 1 : Suppose the samples are linearly separable. Then there exists a vector, say ao , such that ato yi = b i . Then clearly a to yi + t > bi , for any t > 0 and for all i. Thus we have for all i 0

≤ ≤

t

{ ≥ 0, a y + t > b } min{t : t ≥ 0, a y + t > b } = 0. min t : t

i

i

t i

i

Therefore, we have

{

min t : t

t

≥ 0, a y + t > b } = 0, i

i

and the resulting weight vector is a o . The fact that J t (a) implies arg min Jt (a) = a o . a

≥ 0 for all a and J (a ) = 0 t

o

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

206

This proves that the a minimizes J t (a) is the same as the one for solving the modified variable problem. Case 2: If the samples are not linearly separable, then there is no vector a o such that ato yi = b i . This means that for all i 0, at yi + t > bi

min t : t t,a

{



= min t : t

}

at y i

0, t > b i

{ ≥ − } min {t : t ≥ 0, t > max(b − a y )} min {t : t > max (b − a y )} ≤ min { max (b − a y )} ≤ t,a

= = =

t,a t,a a

i

i:at yi bi

i:at yi bi

i

t

t

i

i

t

i

i

i

= min Jt (a). a

Section 5.11 29. Given n patterns x k , in d-dimensional space, we associate z k where



If xk ω1 , then z k = 1 If xk ω2 , then z k = 1.



We also define a mapping φ : R d



Rd+1 as



φ(x) = (xk , zk )

where

 − x).

xk = arg min( xk xk

In other words, φ returns the nearest-neighbor prototype xk . Then clearly φ(x) = (x, 0) represents a separating hyperplane with normal a = (0, 0,..., 0, 1)

   d zeros

We verify this by considering any pattern in ω 1 . For such a pattern at φ(xk ) = (0 , 0,..., 0, 1)t (xk , zk ) = z k = +1. Conversely, for any pattern in ω 2 , we have at φ(xk ) = (0 , 0,..., 0, 1)t (xk , zk ) = z k =

−1.

Intuitively speaking, this construction tells us that if we can label the samples unambiguously, then we can always find a mapping to transform class ω 1 and class ω2 to points into the half spaces. 30. The points are mapped to ω1 ω2

√ √ √ −√2,√−√√2, √2,√1, 1) √ √ √ (1, 2, − 2, − 2, 1, 1) , (1, − 2, 2, − 2, 1, 1)

:

(1, 2, 2, 2, 1, 1)t , (1,

:

t

t

t

PROBLEM SOLUTIONS

207 z2 = x12

z2 =

√ 2 x2

8

g=-1 g=1

3

6

g=-1 2

4

√2 1

2 z1 =

-3

-2 - √ 2 -1

1

√2

-1

2

√ 2 x1

3

z1 = -3

- √ 2-1

-2

1√ 2

0

2

√ 2 x1

3

g=-1

- √2 -2

-3 z2 = x22 g=-1 g=1

3

z2 = x12 3

g=-1 g=1 g=-1

2

g=1

2

√2 1

1 z1 =

-3

1 √2

-2 - √ 2 -1

2

3

√ 2 x1

- √2 -3

-2

√2 -1

1

2

z1 = √ 2 x1x2 3

-1 -2 -3

as shown in the figure. The margins are not the same, simply because the real margin is the distance of the support vectors to the optimal hyperplane in R 6 space, and their projection to lower dimensional subspaces does not necessarily preserve the margin. 31. The Support Vector Machine algorithm can be written: Algorithm 0 (SVM)

←∞

←∞ ←∞

begin initialize a ; worst1 ; worst2 ; b i 0 do i i+1 if z i = 1 and a t yi zi < worst 1, then worst1 at yi zi ; kworst1 k if z i = 1 and a t yi zi < worst 2, then worst2 at yi zi ; kworst2 k until i = n a a + ykworst2 ykworst1 a0 at (ykworst2 + ykworst1 )/2 oldb b; b at ykworst1 / a until b oldb <  return a 0 , a end

1



2 3 4 5 6





← ←

← − ← ← ← | − |

7 8 9 10 11 12

← ←

 

Note that the algorithm picks the worst classified patterns from each class and adjusts a such that the hyperplane moves toward the center of the worst patterns and rotates so that the angle between the hyperplane and the vector connecting the worst points increases. Once the hyper plane separates the classes, all the update s will involve support vectors, since the if statements can only pick the vectors with the smallest at y i . 32. Consider Support Vector Machines for classification.

|

|

(a) We are given the following six points in two categories: ω1

: x1 =

   1 , x2 = 1

2 , x3 = 2

2 0

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

208

ω2 with z 1 = z 2 = z 3 = y

1

+ y

2

: x4 =

−1 and z

4

   0 , x5 = 0

1 , x6 = 0

0 1

= z 5 = z 6 = +1.

y2 = 3/ 2

2

x2

x1

1

x6

x3

x4

x51

0

2

y1

√ 2 /4

The optimal hyperplane is y1 + y2 = 3/2, or (3 /2 1 1)t (1 y1 y2 ) = 0. To ensure zk at y 1, we have to scale (3 /2 1 1)t by 2, and thus the weight vector is (3 2 2)t . The optimal margin is the shortest distance from the patterns to the optimal hyperplane, which is 2/4, as can be seen in the figure.

≥ − −

− − √

− −

(b) Support vectors are the samples on the margin, that is, the ones with the shortest distance to the separ ating hyperplane. In this case, the suppor t vectors are x1 , x3 , x5 , x6 = (1, 1)t , (2, 0)t , (1, 0)t , (0, 1)t .

{

} {

}

(c) We seek to maximize the criterion giv en in Eq. 109 in the text, n

L(α) =

n

 − αk

k=1

1 2

αk αj zk zj yjt yk

k,j

subject to the constraints n



zk αk = 0

k=1



− −

for α k 0. Using the constraint, we can substitute α 6 = α 1 + α2 + α3 α4 α5 in the expression for L(α). Then we can get a system of linea r equations by setting the partial derivatives, ∂L/∂α i to zero. This yields:



−1 −2 −2 0 1 2 5 2 1 1 −2 −2 −5 −1 1 0 −1 1 −1 −1 1 1 3 −1 −2

α1 α2 α3 α4 α5

=

−2 2 −2

    

.

0 0

Unfortunately, this is an inconsistent set of equations. Therefore, the maxima must be achieved on the boundary (where some α i vanish). We try each α i = 0 and solve ∂L/∂α i = 0: ∂L(0, α2 , α3 , α4 , α5 ) =0 ∂α i

PROBLEM SOLUTIONS

209

implies α = 1/5(0, 2 28 8 4)t , which violates the constraint αi Next, both of the following vanishing derivatives,

− −

− −

≥ 0.

∂L(α1 , 0, α3 , α4 , α5 ) ∂L(α1 , α2 , 0, α4 , α5 ) = =0 ∂α i ∂α i lead to inconsistent equations. Then the derivative ∂L(α1 , α2 , α3 , 0, α5 ) =0 ∂α i implies α = 1/5(16 0 4 0 14 6) t , which does not violate the constraint αi In this case the criterion function is L(α) = 4. Finally, we have

≥ 0.

∂L(α1 , α2 , α3 , α4 , 0) =0 ∂α i which implies α = 1/5(2 2 2 0 0 6) t , and the constraint αi this case the criterion function is L(α) = 1.2.

≥ 0 is obeyed.

In

Thus α = 1/5(16 0 4 0 14 6) t is where the criterion function L reaches its maximum within the constraints. Now we seek the weight vector a. We seek to minimize L(a, α) of Eq. 108 in the text, 1 a 2

n 2

−

L(a, α) =

αk [zk at yk k=1

− 1],

  −

with respect to a. We take the derivative of the criterion function, ∂L =a ∂a

n

αk zk yk = 0,

k=1

which for the α k found above has solution a

= =

−(16/5)y − 0y − 4/5y 0 −2 . −2

 

1

2

3

+ 0y4 + 14/5y5 + 6/5y6

Note that ∂L/∂ a = 0 here is not sufficient to allow us to find the bias a 0 directly since the a 2 term does not include the augmented vector a and k αk zk = 0. We determine a 0 , then, by using one of the support vectors, for instance y 1 =



(1 1 1) t . Since y 1 is a support vector, a t y1 z1 = 1 holds, and thus

  − − − 0 2 2

(1 1 1) =

−a

0

+4=1.

This, then, means a 0 = 3, and the full weight vector is a = (3

 t

− 2 − 2) .

33. Consider the Kuhn-Tucker theorem and the conversion of a constrained optimization problem for support vector machines to an unconstrained one.

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

210

(a) Here the relevant functional is given by Eq. 108 in the text, that is, L(a, α) =

n

1 a 2



2

−

αk [zk at yk

k=1

− 1].

We seek to maximize L with respect to α to guarantee that all the patterns are correctly classified, that is z k at yk 1, and we want to minimize L with respect to the (un-augmented) a. This will give us the optimal hyperplane. This solution corresponds to a saddle point in α –a space, as shown in the figure.



saddle point

L

α

a

(b) We write the augmented vector a as (a0 ar )t , where a 0 is the augmented bias. Then we have L(ar , α, a0 ) =

1 ar 2

2

 −

n



αk [zk atr yk + zk a0

k=1

− 1].

At the saddle point, ∂L/∂a 0 = 0 and ∂L/∂ ar = 0. The first of these derivative vanishing implies n



αk∗ zk = 0.

k=1

(c) The second derivative vanishing implies ∂L = ar ∂ ar

n





α∗k zk yk

k=1

and thus n

ar =



α∗k zk yk .

k=1 n

Since



αk zk = 0, we can thus write the solution in augmented form as

k=1 n

a=



k=1

α∗k zk yk .

PROBLEM SOLUTIONS

211

(d) If α ∗k (zk a∗t yk 1) = 0 and if z k a∗t yk = 0, then α ∗k must be zero. Respectively, call the predicates as h, N OT p and q, then the above states h AND NOT p q, which is equivalent to h AND NOT q p. In other words, given the expression



 → α∗ (z a∗ y − 1) = 0 , t

k

k



then α ∗ is non-zero if and only if z k a∗t yk = 1. (e) Here we have ¯ = L

=

1 2 1 2

   n

αk zk yk

k=1

 −     −     −   n

2

n

αk zk

αl zl yl

k=1

t

n

n

1

n

αk zk yk

k=1

yk

l=1

n

αk αl zk zl ykt yl +

αk zk yk

k=1

kl

αk .

k=1

Thus we have n

¯= L

 − 1 2

αk

k=1

αk αl zk zl ykt yl .

kl

(f) See part (e). 34. We repeat Example 2 in the text but with the following four points:

√2 5√2 5√2 1 25) , √ √ √ = (1 2 3 2 6 2 4 9) ,

y = (1 1

y3

t

y = (1

t

2 y4

= (1

√ 4√2 8√2 4 16) −− 2√2 5−√2 − 5√2 1 25) 2 2

t

,z = z =

t

, z3 = z 4 = +1

1

2

1



We seek the optimal hyperplane, and thus want to maximize the functional given by Eq. 109 in the text: L(α) = α 1 + α2 + α3 + α4

− 12

4



αl αk zk zl ykt yl ,

kl





with constraints α1 + α2 = α 3 + α4 and α1 0. We substitute α4 = α 1 + α2 α3 into L(α) and take the partial derivatives with respect to α 1 , α 2 and α 3 and set the derivatives to zero: ∂L ∂α 1 ∂L ∂α 2 ∂L ∂α 3

= 2

− 208α − 256α 1

+ 232α3 = 0

− 256α − 592α + 496α = 0 232 α + 496α − 533α = 0.

= 2 =

2

1

1

2

2

3

3

The solution to these equations — α 1 = 0.0154, α 2 = 0.0067, α 3 = 0.0126 — indeed satisfy the constraint α i 0, as required. Now we compute a using Eq. 108 in the text:



∂L =a ∂a

4





k=1

αk zk yk = 0,

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

212

which has solution



a



= 0.0154( y1 ) + 0.0067( y2 ) + 0.01261y3 + 0.095y4 = (0 0 .0194 0.0496 0.145 0.0177 0.1413)t .





Note that this process cannot determine the bias term, a0 directly; we can use a support vector for this in the following way: We note that a t yk zk = 1 must hold for each support vector. We pick y 1 and then

−(a

0

0.0194 0.0496

− 0.145

− 0.1413) · y

0.0177

1

= 1,

which givesa0 = 3.1614, and thus the full weight vector is a = (3.1614 0 .0194 0 .0496 0.145 0.0177 0.1413)t . Now we plot the discriminant function in x 1 –x2 space:



g(x1 , x2 ) = =





√ −

at (1 2x1 2x2 2x1 x2 x21 x22 ) 0.0272x1 + 0.0699x2 0.2054x1 x2 + 0.1776x21

− 0.1415x

2 2



+ 3.17.

The figure shows the hyperbola corresponding to g(x1 , x2 ) = 0 as well as the margins, g(x1 , x2 ) = 1, along with the three support vectors y 2 , y 3 and y 4 .

±

x2 10

g(x1,x2)=-1

7.5

y4

y1 5

g(x1,x2)=1 y3

2.5

g(x1,x2)=0 -10

-7.5

-5

-2.5

2.5

5

7.5

10

x1

-2.5

y2

-5

g(x1,x2)=-1

g(x1,x2)=1 -7.5

-10

Section 5.12 35. Consider the LMS algorithm and Eq. 61 in the text.

 − 

(a) The LMS algorithm minimizes J s (a) = Ya b Js (a) =

 

1n 1n

Y1 Y1

a0 ar

2

and we are given the problem b1 b2

  −  

2

We assume we have 2n data points. Moreover, we let bi = (b1i b2i ...b ni) t for i = 1, 2 be arbitrary margin vectors of size n each, and Y 1 is an n-by-2 matrix containing the data points as the rows, and a = (a0 , a1 , a2 )t = (a0 , ar )t . Then we have Js (a) =

 −

− −

a0 1 n + Y 1 ar b 1 a 0 1n + Y 1 a r b 1



2

PROBLEM SOLUTIONS

213

n

=

 

n

(a0 + yit ar

i=1 n

=

n

(a0

i=1 n

=

i 2 1

−b )

(a0 i=1

i 2 1

−b )

+

+

−

( a0 + yit ar

i=1

−

( a0

i=1 n

i 2 1

−b )

i=1



n

i 2 2

−b )

(a0 + bi2 )2

+

i 2 2

−b )

+2



yit ar (a0

i=1 n

−2



i 1

i 2

−b −a −b ) 0

yit ar (bi1 + bi2 ). i=1



Thus ∂ Js /∂a 0 = 0 implies a 0 = 0 and the minimum of J must be at (0 , ar )t for some a r . Hence we showed that a 0 = 0 which tells us that the separating plane must go through the srcin. (b) We know from part (a) that the LMS hyperplane must pass throu gh the srcin. Thus ω 1 samples must lie on the same side of a separating hyperplane in order to ensure ω2 samples lie in the other half space. This is guaranteed in the shaded (union) region. We are asked to define the region where there is no shading. We can stat e this as: if y (x1 , x2 )t 2x1 < x2 , then the LMS solution to separate ω 1 and ω 2 will not give a separating hyperplane, where

∈{

|

ω1

:

ω2

:

| |}

  −  −−  −

1 2 , ,y 2 4 1 2 , , y 2 4



as shown in the figure. (c) Part (b) can be generalized by noting that the feasible region for y (so that LMS will give a separating hyperplane) is the union of the half spaces i determined by y i as:

H

H H

1 2

= =

{half space induced by the separating hyperplane {half space induced by the separating hyperplane

} }

y 1 and containing y 2 y 2 and containing y 1 .

∈ H ∪ H . Thus, to {−y , −y , −y },

The LMS will give separating hyperplanes if and only if y 3 ensure that the LMS solution not to separate y1 , y2 , y3 from ¯1 ¯2. we must have y 3

{

∈H ∩H

}

1

2

1

2

36. The algorithm is as follows: Algorithm 0 (Fixed increment multicategory Perceptron) 1 2 3 4 5 6 7 8 9 10

begin initialize a i k 0, n number of samples , c do k (k + 1)modn i class of yk do j j + 1; ( j = i) until a ti yk < a tj yk j >c if j c then a i a i + yk aj aj y k until no more misclassifications return a 1 , a2 ,..., ac end

←←

← ≤



←  ∨ ← ← −

number of classes



3

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

214

x2 y1 x1 2 = x2

x1

x 2

= -2

x 1

y2

Some of the advantages are that the algorithm is simple and admits a simple parallel implementation. Some of the disadvantages are that there may be numerical instability and that the search is not particularly efficient. 37. Equation 122 in the text gives the Bayes discriminant function as c

g0i =





|

λij P (ωi x).

j=1

The definition of λ ij from Eq. 121 in the text ensures that for a given i only one λ ij will be non-zero. Thus we have g 0i = P (ωi x). We apply Bayes rule to find

|

|



p(x)g0i (x) = p(x ωi )P (ωi ).

( )

On the other hand, analogous to Eq. 58 in the text, the criterion function written as

 −    − (ati y

Js1 i (a) =

y

1)2 +

∈Y

n1 1 n n n1

=

(ati y)2

∈Y

y/

i

y

i

n2 1 1) + n n2

(ati y

∈Y

i

J s1 i can be

2

 

(ati y)2 .

∈Y

y/

i

By the law of large numbers, as n , we have that 1 /nJs1 i (ai ) approaches J¯i (ai ) with probability 1, where J¯i (ai ) is given by

→∞

J¯i (ai ) = P (ωi ) i [(ati y

E

2

where t i

2

=

2

=

E [(a y − 1) ] E [(a y) ] 1

2

Thus we have J¯i (ai ) =



(ati y

t i

2

t i

2

− 1) ] + P (NOT ω )E [(a y) ],

 

i

(ati y

− 1) p(x|ω )dx (a y) p(x|NOT ω )dx.

− 1) p(x, ω )dx + i

2

2

t i



2

i

i

(ati y)2 p(x,NOT ω i )dx.

PROBLEM SOLUTIONS

215

We expand the square to find: J¯(ai ) =



(ati y)2 p(x, ωi )dx

−2



(ati y)p(x, ωi )dx +



p(x, ωi )dx +



(ati y)2 p(x,NOT ω i )dx.

We collect the terms containing (ati y)2 and find J¯(ai ) =

(ati y)2 p(x)dx



(ati y)p(x, ωi )dx +

−2



p(x, ωi )dx.

 



We use the fact that p(x)g0i (x) = p(x, ωi ) from ( ), above, we find J¯(ai ) =



(ati y)2 p(x)dx

−2



(at y)p(x)g0i (x)dx +

p(x, ωi )dx,

which can be written as J¯(ai ) =

 

[ati y

  

2 0i (x)] p(xdx)

−g





2 [g0i (x)p(x)

0i (x)p(x)]dx .



2i ,the mean squared approx. error



−g

independent of ai



The second term in the sum is independent of weight vector ai . Hence the ai that minimizes Js1 i also minimizes 2i . Thus the MSE solution A = Y t B (where A = ai a2 ac and Y = [Y1 Y2 Yc ]t , and B is defined by Eq. 119 in the text) for the multiclass case yields discriminant functions a ti y that provide a minimum-meansquare error approximation to the Bayes discriminant functions g 0i . 38. We are given the multi-category classification problem with sets c 1 , 2 ,..., to be classified as ω1 , ω2 ,...,ω c , respectively. The aim is to find a1 , a2 ,..., ac such t a tj y for all j = i. We transform this problem int o a binary that if y i , then a i y classification problem using Kesler’s construc tion. We define G = G1 G2 Gc where

···

Y Y

∈Y





Y

∪ ∪···

{ | ∈ Y , j = i}

Gi = ηij y

i

and η ij is as given in Eq. 115 in the text. Moreover we define α = (at1 a2

···

atc )t .

Perceptron case We wish to show what the Fixed-increment single sample Perceptron algorithm given in Eq. 20 in the text does to our transformed problem. We rewrite Eq. 20 as α(1)

= arbitrary

α(k + 1) =

α(k) + gk

where g k is misclassified. The condition g k being misclassified at step k implies αt (k)gk 0. Since the Gi s are disjoint, gk must belong to one and only one of the G i . We shall use a subscript to denote whic h G i that g k belongs to; for instance, gik is in Gi . Given this notation, the inequalit y αt (k)gik 0 implies ati y atj y 0 for some j = i. Thus there is an equivalence between αt (k)gik 0 and a ti (k)y a tj (k)y.





≤ ≤







Consider the update rule α (k + 1) = α (k) + gik . At once we see the equivalence:

216

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS Multi-category α(k + 1) = α (k) +

gik



Two-category ai (k + 1) = a i (k) + yik aj (k + 1) = a j (k) yjk ,



where, as defined above, α is the concatenation of as. Relaxation rule case The single sample relaxation rule becomes α(1)

= arbitrary

α(k + 1) =

α(k) + η

b

t k

−α g g  k 2

gk ,

where b is the margin. An update takes place when the sample g k is incorrectly classified, that is, when α t (k)gk < b. We use the same definition of g as in the Perceptron case above, and can then write αt (k)η ij < b

t i

t j

⇔ (a (k)y − a (k)y) < b.

In the update rule, α can be decomposed into its sub-component as, yielding the following equivalence:

Multi-category α(k + 1) = α (k) + η

Two-category

−α (k)g

b

t

gk

2

 

k

b (ati (k) atj (k))yk k y 2 yk 2 b (ati (k) atj (k))yk k 2 yk 2 a j (k) + y

ai (k + 1) = a i (k) +

gk



aj (k + 1) =



−   − −

.

217

5.0. COMPUTER EXERCISES

Computer Exercises Section 5.4 1. Computer exercise not yet solved Section 5.5 2. 3. 4. 5. 6.

Computer Computer Computer Computer Computer

exercise exercise exercise exercise exercise

not not not not not

yet yet yet yet yet

solved solved solved solved solved

Section 5.6 7. Computer exercise not yet solved Section 5.8 8. Computer exercise not yet solved Section 5.9 9. Computer exercise not yet solved Section 5.10 10. Computer exercise not yet solved Section 5.11 11. Computer exercise not yet solved Section 5.12 12. Computer exercise not yet solved

218

CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

Chapter 6

Multilayer neural networks

Problem Solutions Section 6.2 1. Consider a three-layer network with linear units throughout, having input vector x, vector at the hidden units y, and output vector z. For such a linear system we have y = W 1 x and z = W 2 y for two matrices W 1 and W 2 . Thus we can write the output as z = =

W2 y = W 2 W1 x W3 x

for some matrix W 3 = W 2 W1 . But this equation is the same as that of a two-layer network having connection matrix W 3 . Thus a three-layer network with linear units throughout can be implemented by a two-layer network with appropriately chosen connections. Clearly, a non-linearly separable problem cannot be solved by a three-layer neural network with linear hidden units. To see this, suppose a non-linearly separable problem can be solved by a three-layer neural network with hidden units. Then, equivalently, it can be solved by a two-layer neural network. Then clearly the problem is linearly separable. But, by assumption the problem is only non-linearly separable. Hence there is a contradiction and the above conclusion holds true. 2. Fourier’s shows that a three-layer neural anetwork with sigmoidal units can acttheorem as a universal approximator. Consider two-dimensional input hidden and a single output z (x1 , x2 ) = z(x). Fourier’s theorem states z(x)



 f1

Af1 f2 cos(f1 x1 )cos(f2 x2 ).

f2

(a) Fourier’s theorem, as stated above, can be rewritten with the trigonometric identity: 1 1 cos(α)cos(β ) = cos(α + β ) + cos(α β ), 2 2



219

CHAPTER 6. MULTILAYER NEURAL NETWORKS

220 to give

z(x1 , x2 )



 f1

f2

Af1 f2 [cos(f1 x1 + f2 x2 ) + cos(f1 x1 z2

− f x )] . 2 2

(b) We want to show that cos( x), or indeed any continuous function, can be approximated by the following linear combination n

f (x)

 f (x ) + 0

 i=0

[f (xi+1

− f (x )] i



Sgn[x xi ] . 2





The Fourier series of a function f (x) at a point x0 converges to f (x0 ) if f (x) is of bounded variation in some interval ( x0 h, x0 + h) centered on x0 . A function of bounded variation is a follows, given a partition on the interval a = x0 < x1 < x2 < .. . < x n−1 < xn = b form the sum



n

|

f (xk )

k=1

− f (x − )|. k 1

The least upper bound of these sums is called the total varia tion. For a point f (x) in the neighborhood of f (x0 ), we can rewrite the variation as n



[f (xi+1

i=1

− f (x )] i







Sgn[x xi ] , 2

which sets the interval to be ( x, x + 2h). Note the func tion has to be either continuous at f (x0 ) or have a discontinuity of the first kind. f(x)

approximation function

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14

x



(c) As the effect ive width of the sigmo id vanishes, i.e., as σ 0, the sigmoids become step functions. Then the functions cos(f1 x1 +f2 x2 ) and cos(f1 x1 f2 x2 )



can be approximated as cos(f1 x1 + f2 x2 ) cos(f1 x10 + f2 x20 )



  × n

+

[cos(x1i+1 f1 + x2i+1 f2 )

i=0

and similarly for cos[ f1 x1

Sgn[x1

− f x ]. 2 2

−x

1i ]Sgn[x2

2

− cos(x

−x

2i ]

1i f1



,

+ x2i f2 )]

PROBLEM SOLUTIONS

221

(d) The construction does not necessarily guarantee that the derivative is approximated, since there might be discontinuities of the first order, that is, noncontinuous first derivative. Nevertheless, the one-sided limits of f (x0 + 0) and f (x0 0) will exist.



Section 6.3 3. Consider a d

nH

c network trained with n patterns for m e epochs.

− space − complexity of this problem. The total numbe r of adjustable (a) Consider the weights is dn H + nH c. The amount of storage for the n patterns is nd.

(b) In stochastic mode we choose patte rn randomly and compute w(t + 1) = w(t) + ∆w(t). until stopping criterion is met. Each iteration involves computing ∆ w(t) and then adding it to w(t). From Eq. 17 in the text, for hidden-to-output unit weights we have ∆wjk = η(tk where netk =

nH



− z )f (net )y , k

k

j

wjk yj is computed by nH multiplications and nH additions.

j=1

So, ∆ wjk is computed with c(2nH + 1) operations. From Eq. 17 in the text, for input-to-hidden unit weights we have c

∆wji = ηx i f  (netk )



wkj δk

k=1

where, netj is computed in 2 d operations. Moreover,



wjk δk is computed in

k

2c operations. Thus we have w ij ’s are computed in [2 d + 2c]nH time. Thus, the time for one iteration is the time to compute ∆ add w to ∆w, that is, T

= =

w plus the time to

c(2nH + 10 + 2(d + c + 1)nH + (dnH + nH c) 3dnH + 5nH c + c + 2nH .

In summary, the time complexity is (3dnH + 5nH c + c + 2nH )me . (c) Here, the number of iterations = nm e and thus the time complexity is (3dnH + 5nH c + c + 2nH )n me . 4. Equation 20 in the text gives the sensitivity at a hidden unit c

δj = f  (netj )



wkj δk .

k=1

For a four-layer or higher-layer neural network, the sensitivity of a hidden unit is likewise given by δj

∂E ≡ − ∂net

= j







∂E ∂O j . ∂o j ∂net j

CHAPTER 6. MULTILAYER NEURAL NETWORKS

222

By the chain rule we have δj =

−f (net ) j



∂E ∂net k . ∂net k ∂o j

k

Now, for a unit k in the next higher layer netk =

wj  k oj  = j

where

wj  k oj  + wjk oj , j







δnetk = w jk . δo j

Thus we have

 −     c

δj = f  (netj )

∂E = f  (netj ) ∂net k

wjk

k=1

δk

c



wjk δk .

k=1

5. From Eq. 21 in the text, the backpropagation rule for training input-to-hidden weights is given by ∆wji = ηx i f  (netj )

wkj δk = ηx i δj . k=1



For a fixed hidden unit j , ∆wji ’s are determined by x i . The larger the magnitude of xi , the larger the weight change. If xi = 0, then there is no input from the ith unit and hence no change in w ji . On the othe r hand, for a fixed input unit i, ∆wji δ j , the sensitivity of unit j . Change in weight w ji determined how the overall error changes with the activation of unit j. For a large ma gnitude of δj (i.e., large sensitivity), ∆wji is large. Also note that the sensitivities of the subsequent layers propagate to the layer j through weighted sum of sensitivities. 6. There is no reason to expect that the backpropagation rules should be inversely related to f  (net). Note, the backpropagation rules are defined to provide an algorithm that minimizes 1 2

J=



c



(tk

k=1

2

−z ) , k

where zk = f (netk ). Now, the weights are chosen to minim ize J . Clearly, a large change in weight small shouldchanges occur only to significantly decrease J to ensure convergence in the algorithm, in weight should correspond to, small changes in J. But J can be significantly decreased only by a large change in the output ok . So, large changes in weight should occur where the output varies the most and least where the output varies the least. Thus a change in weight should not be inversely related to f  (net). 7. In a three-layer neural network with bias, the output can be written zk = f

   wjk f

j

wji xi

i

PROBLEM SOLUTIONS

223

can be equivalently expressed as zk = f

   wkj f

wji xi

j

i

by increasing the number of input units by 1 and the number of hidden inputs by 1, xo = 1, ωoj = α j , j = o, ωio = 0,



and f h is adjusted such that o bias = f h (o) = 1 and w ok = α k . 8. We denote the number of input units as d, the number of hidden units as n H and the number of category (output) units as c. There is a single bias unit. (a) The number of weights is dn H + (nH + 1)c, where the first term is the number of input-to-hidden weights and the second term the number of hidden-to-output weights, including the bias unit. (b) The output at output node k is given by Eq. 7 in the text:

   nH

zk =

d

f

wkj f

j=1

wji xi + wj0

i=1

  + wk0

.

·

We assume that the activation function f ( ) is odd or antisymmetric, giving d

d

 −  ↔ −         − f

wji xi

j =1

f

j =1

wji xi

,

but since the weighting on this summation also flips, we have d



( wkj )

f

d

wji xi

= (wkj ) f

j=1

wji xi

,

j=1

and the srcinal output is unchanged.

(c) The hidden units can be exchanged along with corresponding weights and this can leave the network unaffected. The number of subsets that can be constructed over the set nH is of course 2 nH . Now, because the corre sponding weights for each subset nH different weight orderings can be constructed, so the total hidden unit symmetry is nH !2nH . For the case nH = 10, this is a factor of 3,715,891,200. 9. The on-line version of the backpropagation algorithm is as follows: Algorithm 0 (On-line backpropagation) 1 2 3 4 5 6 7

begin initialize n H , w, η do x next input pattern wji wji + ηδj xi ; wkj wkj + ηδk yi until no more patterns available return w end

← ←



CHAPTER 6. MULTILAYER NEURAL NETWORKS

224

10. We express the derivative of a sigmoid in terms of the sigmoid itself for positive constants a and b for the following cases. (a) For f (net) = 1/(1 + ea net ), the derivative is df (net) d net





2

−a 1 + e1 e −af (net)(1 − f (net)).

= =

a net

a net

(b) For f (net) = atanh(b net), the derivative is df (net) d net

=

2

2

−2b atanh(b net)(1 − tanh (b net )).

11. We use the following notation: The activations at the first (input), second, third, and fourth (output) layers are xi , yj , vl , and zk , respectively, and the indexes are clear from usage. The number of units in the first hidden layer is nH1 and the number in the second hidden layer is n H2 .

output

z

v

v1

y

x

x1

z1

z2

v2

vl

y1

y2

x2

xi

zq

zc

vlmax

yj

yjmax

xd

input Algorithm 0 (Four-layer backpropagation) 1 2 3

begin initialize xxx xxx xxx

4 5

end

return xxx

Section 6.4 12. Suppose the input to hidden weights are set equal to the same value, say then w ij = w o . Then we have d

netj = f (netj ) =

 i=1

wji xi = w o

 i

xi = w o x.

wo ,

PROBLEM SOLUTIONS

225

This means that oj = f (netj ) is constant, say y o . Clearly, whatever the topology of the srcinal network, setting the wji to be a constant is equivalent to changing the topology so that there is only a single input unit, whose input to the next layer is xo . As, a result of this loss of one-layer and number of input units in the next layer, the network will not train well. 13. If the labels on the two hidden units are exchanged, the shape of the error surface is unaffected. Consider a d nH c three-layer network. From Problem 8 part (c), we know that there are n H !2nH equivalent relabelings of the hidden units that do not





affect the network output. One can also flip weights for each of these configurations, Thus there should be nH !2nH +1 possible relabelings and weight changes that leave the error surface unaffected. 14. Consider a simple 2 1 network with bias. Suppose the training data come from two Gaussians, p(x ω1 ) N ( 0.5, 1) and p(x ω2 ) N (0.5, 1). The teaching values are 1.

±

|

− ∼ −

|



·

(a) The error is a sum over the n patterns as a function of the transfer function f ( ) as n

J (w) =



(tk

k=1

t

− f (w x

k

+ w0 ))2 ,

where t k is the teaching signal and w 0 the bias weight. (b) We differentiate twice with respect to the weights to compute the Hessian matrix, which has components Hij =

∂ 2 J (w) . ∂w i ∂w j

We use the outer product approximation of the Hessian and find n

Hij =



f  (wt xk + w0 )xki f  (wt xk + w0 )xkj

k=1

where x kj is the j th component of sample x k .

|

(c) Consider two data sets drawn from p(x ω1 ) then has components

∼ N (u , I). i

The Hessian matrix

Hij = n2 (µ1i + µ2i )(µ1j + µ2j ), where µ1i is the ith element of the mean fector of class 1, and analogously for class 2. (d) Problem not yet solved (e) Problem not yet solved (f) Problem not yet solved 15. We assume that the error function can be well described by a Hessian matrix H having d eigenvalues and where λmax and λmin are the maximum and minimum of these eigenvalues.

CHAPTER 6. MULTILAYER NEURAL NETWORKS

226

(a) The optimimum rate is η < 2/λmax .

| − ηλ | < 1 for all eigenvalues i = 1,...,d

(b) The convergence criterion is 1

i

.

(c) The time for the system to meet the convergence criterion θ is θ = (1

− ηλ ) i

T

where T is the total number of steps. This factor is dominated by the smallest eigenvalue, so we seek the value T such that θ = (1 ηλ min )T .



This, in turn, implies that lnθ . ln(1 ηλ)

T =



16. Assume the criterion function J (w) is well described to second order by a Hessian matrix H. (a) Stepping along the gradien t gives at step T αTi = (1 To get a local minimum, we need (b) Consider (1

0 i

T

− ηλ ) α . |1 − η | < 1, and this implies i

i

η < 2/λmax .

(2/λmax )λi ). More steps will be required along the direction cor-



responding to the smallest eigenvalue λ i , and thus the learning rate is governed by 1

− 2λλ

min

.

max

Standardization helps reduce the learning time by making the ratio λmax /λmin = 1. (c) Standardization is, in essence, a whitening transform. Section 6.6 17. From Eq. 25 in the text, we have

  

nk 1 J (w) = n n nk

→∞

[gk (x, w)



x ωk

2

− 1]

+

n

−n

1

k

n

n

−n

k

 

gk (x, w)2 .

∈

x ωk



As n , the proportion of all samples that are in ωk approaches P (ωk ). By the law of large numbers, then, 1 nk



[gk (x, w)



x ωk

2

− 1]

approaches 2

E ([g (x, w) − 1] |x ∈ ω ) = k

k



[gk (x, w)

2

− 1] p(x|ω )dx. k

PROBLEM SOLUTIONS

227

Likewise, we have 1 n

−n

k



[Fk (x, ω)]2 ,

∈

x ωk

which approaches ([gk (x, w)]2 x

E



Thus we see, in the limit of infinite data J (w) = = = = =

P (ωk )

   



[gk (x, w)

[gk (x, w)]2 p(x ωi=k )dx.

ωi=k ) =

| ∈

2

|



− 1] p(x|ω )dx + P (ω 

2

i =k )

k





 [gk (x, w)]2 p(x ωi =k )dx

2

|

− 1] p(x, ω )dx + [g (x, w)] p(x|ω  )dx [g (x, w) + 1 − 2g (x, w)]p(x, w)]p(x, ω )dx + [g (x, w)] p(x|ω  [g (x, w) − P (ω |x)] p(x)dx + P (ω |x)[1 − P (ω |x)]p(x)dx [g (x, w) − P (ω |x)] p(x)dx + P (ω |x)P (ω  |x)p(x)dx. [gk (x, w)

k

2 k

k

k

k

k

k

k

i =k



k

 

2

2

k

2

k

i=k )dx

k

i =k

k

18. Consider how one of the solutions to the minimum-squared-error condition indeed yields outputs that are estimates of posterior probabilities. (a) From Eq. 25 in the text, we have J (w) = ∂J (w) ∂w We set

=

∂J (w) ∂w



 2

gk2 (x, w)p(x)dx



gk (x, w)

−2



gk (x, w)p(x, ωk )dx +

∂g k (x, w) p(x)dx ∂w

 − 2



p(x, ωk )

∂g k (x, w) p(x, ωk )dx. ∂w

= 0 and find

Fk (x, ω)

∂F k (x, ω ) p(x)dx = ∂ω



∂F k (x, ω ) p(x, ωk )dx ∂ω

Clearly, w∗ w∗ (p(x)), the solution of the above equation, depends on the choice of p(x). But, for all p(x), we have



gk (x, w∗ (p(x))



Thus we have

∂g k (x, w) ∂w

∂g k (x, w)

p(x)dx = w=w∗ (p(x))



∂w



gk (x, w∗ (p(x)))p(x) = p(x, ωk )

with probability 1 on the set of x : p(x, ωk ) > 0. This implies gk (x, w∗ ) =

p(x, ωk ) = P (ωk x). p(x)

|

p(x, ωk )dx.



w=w∗ (p(x))

CHAPTER 6. MULTILAYER NEURAL NETWORKS

228

(b) Already shown above. 19. The assumption that the network can represent the underlying true distribution is not used before Eq. 28 in the text. For Eq. 29, however, we invoke gk (x; w) p(ωk x), which is used for





|

c

[gk (x; w)

k=1

− P (ω |x)] p(x)dx = 0. k

This is true only when the above assu mption is met. If the assumption is not met, the gradient descent procedure yields the closest projection to the posterior probability in the class spanned by the network. 20. Recall the equation t

˜ k )+B(y,φ)+w ˜ y p(y ωk ) = e A(w .

|

|

(a) Given p(y ωk ), we use Bayes’ Theorem to write the posterior as

|

p(ωk y) =

|

p(y ωk )P (ωk ) . p(y)

·

˜ k and φ as follows: (b) We interpret A( ), w P (ωk ) = p(ωk y) =

|

˜ k) e−A(w ˜ k ) + B(y, φ) + w ˜ kt y] P (ωk ) exp[ A(w c

 

m=1

=

t y] P (ωm ) exp[ A(w ˜ m ) + B(y, φ) + w ˜m

netk

c

,

enetm

m=1

˜ t y. Thus, B(y, φ) is the bias, w ˜ is the weight vector where netk = b(y, φ) + w ˜ k) describing the separating plane, and e −A(w is P (ωk ). 21. Backpropagation with softmax is done in the usual manner; all that must be evaluated differently are the sensitivities at the output and hidden layer units, that is, zh =

eneth eneth



∂z h = z h (1 ∂net h

and

h

(a) We are given the following terms: d

netj

=

 

wji xi

i=1 nH

netk

=

wkj yj

j=1

yj

=

zk

=

f (netj ) enetk c



m=1

enetm

,

− z ). h

PROBLEM SOLUTIONS

229

and the error function c



1 (tk 2 k=1

J=

2

−z ) . k

To derive the learning rule we have to compute ∂J/∂w kj and ∂J/∂w ji . We start with the former: ∂J ∂J ∂net k ∂w kj = netk ∂w kj . We next compute c

∂J ∂net k

=

 s=1

∂J ∂z s . ∂z s ∂net k

We also have

         −   

enets ( 1)enetk



c

∂z s = ∂net k

=

−z z



s k

if s = k

2 k

if s = k

m=1

enets

+ ( 1)

c

enets enets c

enetm

2

= zk

enetm

m=1

and finally

2

enetm

−z

m=1

∂J = ( 1)(ts ∂z s



− z ). s

Putting these together we get c

−

∂J = ∂net k s

( 1)(ts

=k

2 k

− z )(−z z ) + (−1)(t − z )(z − z ). s

s k

k

k

k

We use ∂net k /∂w kj = yj and obtain c



∂J = yj ∂w kj s

(ts

=k

s

s k

j

k

k

k

Now we have to find input-to-hidden weight contribution to rule we have ∂J

∂J ∂y j ∂net j

∂w ji = ∂y j ∂net j ∂w ji . We can at once find out the last two partial derivatives. ∂y j ∂net j = f  (netj ) and = xi. ∂net j ∂w ji Now we also have ∂J ∂y j

c

=

 s=1

2 k

− z )(z z ) − y (t − z )(z − z ).

∂J ∂z s ∂z s ∂y j

J . By the chain

230

CHAPTER 6. MULTILAYER NEURAL NETWORKS c

= =

  −



(ts

s=1 c

∂z − z ) ∂y s

−z )

(ts

zs )

s

c



  −



s=1

(ts

s=1

∂z s ∂net r ∂net r ∂y j r=1 ∂z s ∂net s

c

zs zs2





r =s

wsj

c

∂net r ∂y j

−z

s zr

c

zs2 )wsj +

− z )(z s

∂z s ∂net r

+

∂net s ∂y j

c

=

j

                  − − c

(ts

s=1

=

s

s

wrj

zs zr wrj (ts

zs ).

s=1 r =s



We put all this together and find ∂J ∂w ji

c

= xi f  (netj )

c

−  − (ts

zs )

s=1

c

−x f  (net ) i

wrj zs zr



r =s

(ts

j

zs )wsj (zs

s=1

2 s

− z ).

Of course, the learning rule is then ∆wji

=

∆wkj

=

η

∂J

− ∂w∂J −η ∂w

ji

,

kj

where the derivatives are as given above. (b) We are given the cross-entropy criterion function c

JCE =



tk ln

k=1

tk . zk

The learning rule derivation is the same as in part (a) except that we need to replace ∂J/∂z with ∂ Jce /∂z. Note that for the cross-entropy criterion we have ∂J ce /∂z k = tk /zk . Then, following the steps in part (a), we have



∂J ce ∂w kj

c

= yj

  

s=k

= yj

c

tk zs zk zk

∂J ce ∂w ji

xi f  (netj )

2 k

−z )

k

c

   ts z s=1 s c

−x f (net ) i

tk (zk zk

j k

c

=

j

− y t (1 − z ).

tk zk



s=k

Similarly, we have

−y

j

s=1

wrj zs zr



r =x

ts wsj (zs zs

2 s

− z ).

PROBLEM SOLUTIONS

231

Thus we find c

∂J ce ∂w ji

xi f  (netj )

=

c

  ts

s=1

c

−x f  (net ) i

wrj zr



r =s

ts wsj (1

j

s=1

− z ). s

Of course, the learning rule is then ∆wji ∆wkj

∂J −η ∂w

ce

=



=

ji

∂Jce η , ∂w kj

where the derivatives are as given above.



|

− 

|

22. In the two-category case, if g1 P (ω1 x), then 1 g1 P (ω2 x), since we can assume the categ ories are mutually exclusive and exhaustive. But 1 g 1 can be computed by a network with input-to-hidden weights identical to those used to compute g 1 . From Eq. 27 in the text, we know



[gk1 (x, w)

k1

2

− P (ω |x)] k1

dx +



[gk2 (x, w)

k2



− P (ω |x)] dx k2

is a minimum. this implies that every term in the above equation is minimized. Section 6.7 23. Consider the weight update rules given by Eqs. 12 and 23 in the text. (a) The weight updates are have the factor ηf  (net). If we take the sigmoid f b (h) = tanh(bh), we have fb (h) = 2b

e−bh (1 + e−bh )2

− 1 = 2b e

1 + e−bh + 2

   − bh

1 = 2bD = 1.

D

Clearly, 0 < D < 0.25 for all b and h. If we assume D is constant, then clearly  (h) will be equal to ηf  (h), which tells us the increment the product η/γfγb b in the weight values will be the same, preser ving the convergence time. The assumption will be approximately true as long as bh is very small or kept constant in spite of the change in b.

| |

(b) If the inpu t data is scale d by 1 /α, then the increment of weights at each step will be exactly the same. That is, η  f (h/γ ) = ηf  (h). γ γb Therefore the convergence time will be kept the same. (However, if the network is a multi-layer one, the input scaling should be applied to the hidden units’ outputs as well.)

CHAPTER 6. MULTILAYER NEURAL NETWORKS

232

24. A general additive model is given by Eq. 32 in the text:

 d

zk = f

fjk (xj ) +

wok

j=1



.

The functions fk act on single components, namely the xi . This actually restricts the additive models. To have the full power of three-layer neural networks we must assume that fi is multivariate func tion of the inputs. In this case the model will become:

 d

zk = f

fjk (x1 , x2 ,...,x

d) +

w0

j=1



.

With this modification it is trivial to implement any three-layer neural network. We let

 d

fjk

= w kj g

wji xi + wj0

j=1



and w0k = w k0 , where g , wkj , wji , w j0 , wk0 are defined as in Eq. 32 in the text . We substitute fjk into the above equation and at once arrive at the three-layer neural network function shown in Eq. 7 in the text. 25. Let p x (x) and p y (y) be probability density functions of x and y, respectively. (a) From the definition of entropy given by Eq. 37 in Chapter 1 of the text (or Eq. 118 in Section A.7.1 in the Appendix), and the one-dimensional Gaussian,

  2

px (x) =

√12π exp − 2σx

2

.

we have the entropy



H (px (x)) =





px (x)logpx (x)dx =

−∞



1 + log 2πσ 2

 1.447782 + logσ bits.

(b) From Sect. 6.8.2 in the text, if a = 2/(3σ), then we have f  (x) x < σ.

 1 for −σ
0. At

net =

a

f( ) = = =

 ∞, we have

2a a = 2a a = a 1 + e−b net b 2 b 2 [a (f (net))2 ] = (a a2 ) = 0 2a 2a b f (net)f  (net) = 0. a











At net = 0, we have f (0) =

2a 1 + e−b net b 2

f  (0) = 2a (a f  (0) = 0 . At net =

−∞, we have f (−∞) = f  (−∞) = f  (−∞) =

− a = 2aa − a = 0 b

2

ab

2

− (f (net)) ) = 2a a

= 2

2a a=0 a= a 1 + e−b net b 2 b 2 (a (f (net))2 ) = (a a2 ) = 0 2a 2a 0.







− −

CHAPTER 6. MULTILAYER NEURAL NETWORKS

234

27. We consider the computational burden for standardizing data, as described in the text. (a) The data should be shifted and scaled to ensure zero mean and unit variance (standardization). To compute the mean, the variance we need 2 nd steps, and thus the complexity is thus O(nd). (b) A full forward pass of the network activations requires nH d + n H c steps. A backpropagation of the error nH d is + for nH dc steps. The first te rm isThe for the output-to-hidden and the requires second term hidden-to-input weights. extra c factor comes from the backpropagation formula given in Eq. 21 in the text. Thus nd epochs of training require (nd)n[nh d + nH c + nH c + nH dc] steps. Assuming a single output netowrk we can set c = 1 and use the approximation given in Sec. 6.8.7 to get n/10 = n H d + nH . We use this result in the above and find n2 d[n/10 + n/10] =

n3 d , 5

and thus is O(n3 d) complexity. (c) We use the results of parts (a) and (b) above and see that the ratio of steps can be approximated as nd 5 = 2. n3 d/5 n This tells us that the burden of standardizing data gets negligible as larger. 28. The Minkowski error per pattern is defined by c

J=

|

c

zk

k=1

−t | k

R

=

|

tk

k=1

R

−z | k

.

We proceed as in page 290 of the text for the usual sum-square-error function: ∂J ∂J ∂net k = ∂w kj ∂net k ∂w kj where z k = f

nH

 

wkj yj . We also have

j=1

∂J = ∂net k

k

k

− f  (netk )Sgn(tk − zk ),

R 1

−R|t − z |

||

where the signum function can be written Sgn( x) = x/ x . We also have ∂net k = yj . ∂w kj

n geets

PROBLEM SOLUTIONS

235

Thus we have ∂J = ∂w kj

− f  (netk )yj Sgn(tk − zk ),

R 1

−R|t − z | k

k

and so the update rule for w kj is k

− f  (netk )yj Sgn(tk − zk ).

R 1

| −z |

∆wkj = η tk

Now we compute ∂J/∂w ji by the chain rule: ∂J ∂J ∂y j ∂net j = . ∂w ji ∂y j ∂net j ∂w ji The first term on the right-hand side is c

∂J ∂y j

 − |

=

k=1 c

=

∂J ∂z k ∂z k ∂y j R tk

k=1

− f  (netk )wkj Sgn(tk − zk ).

R 1

−z | k

The second term is simply ∂yj /∂net j = f  (netj ). The third term is ∂net j /∂w ji = xi . We put all this together and find c

∂J = ∂w ji



 | 

R tk

k=1

k



R 1 Sgn(tk

−z |

− z )f (net )w k

k

and thus the weight update rule is c

∆wji = η

wkj f  (netk )Sgn(tk

k=1

k

k

k



R 1

− z )R|t − z |

 

kj

f  (netj )xi .

f  (netj )xi .

For the Euclidean distance measure, we let R = 2 and obtain Eqs. 17 and 21 in the text, if we note the identity Sgn(tk

− z ) · |t − z | = t − z . k

k

k

k

k

− −

29. Consider a d nH c three-layer neural network whose input units are liner and output units are sigmoidal but each hidden unit implements a particular polynomial function, trained on a sum-square error criterion. Here the output of hidden unit j is given by oj = w ji xi + wjm xm + qj xi xm



for two prespecified inputs, i and m = i. (a) We write the network function as

 nH

zk = f

j=1

wkj oj (x)



236

CHAPTER 6. MULTILAYER NEURAL NETWORKS where o j (x) is oj (x) = w ji j xij + wj mj xmj + qj xij wmj . Since each hidden unit has prespecifie d inputs i and m, we use a subscript on i and m to denote their relation to the hidden unit. The criterion function is c

J=

zk )2 .

(tk





k=1

Now we calculate the derivatives:

∂J ∂J ∂net k = = ∂w kj ∂net k ∂w kj

−(t − z )f (net )o (x). k

k

k

j

Likewise we have ∂J ∂w ji

=

      −  −   c

∂J ∂o j = ∂o j ∂w ji

∂J ∂z k ∂z k ∂o j k=1

∂o j ∂w ji

∂J/∂o j

c

=

zk )f  (netk )wkj

(tk

k=1

∂o j . ∂w ji

Now we turn to the second term: ∂o j = ∂w ji

xi + qj xmj if i = i j xi + qj xij if i = m j 0 otherwise.

Now we can write ∂J/∂w ji for the weights connecting inputs to the hidden layer. Remember that each hidden unit j has three weights or parameters for the two specified input units i j and m j , namely w ji j , w jm j , and q j . Thus we have

  − c

∂J ∂w ji j

=

∂J ∂w jmj

=



(tk

k=1 c

(tk

k=1

− z )f  (net )w k

k

kj

− z )f  (net )w k

k

kj

 

(xij + qj xmj ) (xim + qj xim )

∂J = 0 for r = i j and r = m j . ∂wjr





We must also compute ∂J ∂J ∂o j = , ∂q j ∂o j ∂q j where ∂J/∂q j = x ij xmj and ∂J = ∂o j

c





(tk

k=1

− z )f  (net )w k

k

kj .

PROBLEM SOLUTIONS

237

Thus we have ∂J = ∂q j

−  c

(tk − zk )f  (netk )wkj

k=1



xij xmj .

Thus the gradient descent rule for the input-to-hidden weights is c

∆wji j

= η

∆wjmj

= η

  

(tk

− z )f  (net )w

(tk

− z )f  (net )w

kj

− z )f  (net )w

kj

k=1 c k=1 c

∆qj

= η

(tk

k=1

k

k

k

k

k

k

kj

 

(xij + qj xmj ) (xim + qj xij ) xij xmj .

(b) We observe that y i = o i , from part (a) we see that the hidden-to-output weight update rule is the same as the standard backpropagation rule. (c) The most obvious weakness is that the “rece ptive fields” of the network are a mere two units. Another weakness is that the hidden lay er outputs are not bounded anymore and hence create problems in convergence and numerical stability. A key fundamental weakness is that the network is no more a universal approximator/classifier, the mials function space two F that unitsa span is merely a subset ofbecause all polyno of degree and the lesshidden . Consider classification task. In essence the network does a linear classification followed by a sigmoidal non-linearity at the output. In order for the network to be able to perform the task, the hidden unit outputs must be linearly separable in the F space. However, this cannot be guaranteed; in general we need a much larger function space to ensure the linear separability. The primary advantage of this schem e is due to the task domain. If the task domain is known to be solvable with such a network, the convergence will be faster and the computational load will be much less than a standard backpropagation learning network, since the number of input-to-hidden unit weights is 3nH compared to dn H of standard backpropagation. 30. The solution to Problem 28 gives the back propagation learning rule for the Minkowski error. In this problem we will derive the learning rule for the Manhattan metric directly. The error per pattern is given by c

J=

|

k=1

tk

− z |. k

We proceed as in page 290 of the text for the sum-squared error criterion function: ∂J ∂w kj ∂J ∂net k

=

∂J ∂net k ∂net k ∂w kj

=

−f  (net )Sgn(t − z ), k

k

k

CHAPTER 6. MULTILAYER NEURAL NETWORKS

238

||

where Sgn( x) = x/ x . Finally, we have ∂net k /∂w kj = y j . Thus we have ∂J = ∂w kj

−f  (net )y Sgn(t − z ), k

j

k

k

and so the update rule for w kj is: ∆wkj = ηf  (netk )yj Sgn(tk

− z ). k

Now we turn to ∂J/∂w ji . By the chain rule we have ∂J ∂J ∂y j ∂net j = . ∂w ji ∂y j ∂net j ∂w ji We compute each of the three factors on the right-hand-side of the equation: c

∂J ∂y j

=

 − k=1 c

=

∂J ∂z k ∂z k ∂y j f  (netk )wkj Sgn(tk

k=1

∂y j ∂net j ∂net j ∂w ji

=

f  (netj )

=

xi .

−z ) k

Thus we put this together and find ∂J = ∂w ji





f  (netj )xi .



f  (netj )xi .

c



Sgn(tk

k=1

− z )f  (net )w k

k

kj

Thus the weight update rule for w ji is:

 c

∆wji = η

k=1

wkj f  (netk )Sgn(tk − zk )

The above learning rules define the backpropagation learning rule with the Manhattan metric. Note that the Manhattan metric uses the direction (signum) of t k zk instead of the actual t k zk as in the sum-squared-error criterion.





Section 6.9 31. We are given the criterion function n

J=



1 (tm 2 m=1

2 m) .

−z

The Hessian matrix is then ∂2J 1 = ∂w ji ∂w lk 2

   n

n

 −  

∂z m ∂z m + (zm ∂w ji ∂w lk m=1 m=1

tm )

0

  

∂ 2 zm ∂w ji ∂w lk

.

PROBLEM SOLUTIONS

239

We drop the last term in the outer-product approximation, as given in Eq. 45 in the text. Let Wj be any hidden-to-output weight in a single output network. Then we have ∂z/∂W j = f  (net)yj . Now we let w ji be any input-to-hidden weight. Then since z=f

    Wj f

j

wji xi

,

i

the derivative is ∂z = f  (net)Wj f  (netj )xi . ∂w ji Thus the derivatives with respect to the hidden-to-output weights are: Xtv = (f  (net)y1 ,...,f

 (net)yn

H

),

as in Eq. 47 in the text. Further, the derivatives with respect to the intput-to-hidden weights are Xtu

= (f  (net)f  (netj )W1 x1 ,...,f  (net)f  (netnH )WnH x1 , f  (net)f  (net1 )W1 xd ,...,f  (net)f  (netnH )WnH xd ). c

32. We are given that JCE =

tk ln(tk /zk ). For the calcu lation of the Hessian k=1

matrix, we need to find the expressions for the second deriva tives: ∂ 2 JCE ∂w ji ∂w lk

   −   −  ∂J CE ∂w ji



= =



=

∂w lk tk ∂ zk ∂z k zk2 ∂w ji ∂w lk

tk ∂z zk ∂w ji

∂w lk zk

∂ 2 zk . ∂w ji ∂w lk

We arrive at the outer product approximation by dropping the second-order terms. We compare the above formula to Eq. 44 in the text and see that the Hessian matrix, HCE differs from the Hessian matrix in the traditional sum-squared error case in Eq. 45 by a scale factor of t k /zk2 . Thus, for a d nH 1 network, the Hessian matrix can be approximated by

− −

n

HCE =



1 t1 [m]t [m] X X , n m=1 z12

where X [m] is defined by Eqs. 46, 47 and 48 in the text, as well as in Problem 31. 33. We assume that the current weight value is w(n 1) and we are doing a line search.



(a) The next w according to the line search will be found via w(n) = w(n

− 1) + λ∇J (w(n − 1))

by minimizing J (w) using only λ. Since we are given that the Hessian matrix is proportional to the identity, H I, the Taylor expansion of J (w) up to second



CHAPTER 6. MULTILAYER NEURAL NETWORKS

240

degree will give an exact representation of J . We expand around w(n find J (w) = J (w(n 1)) + (w w(n 1))t J (w(n +1/2(w w(n 1))t H(w w(n 1)).

− −





− ∇ − −

− 1) and

− 1)) −



We know that H = kI for some scalar k. Further, we can write w w(n 1) = λ J (w(n 1)), which can be plugged into the expression for J (w) to give



−J (w)

=

=

J (w(n 1)) + λ J t (w(n 1)) J (w(n 1)) +(k/2)λ2 J t (w(n 1)) J (w(n 1)) k J (w(n 1)) + λ + λ2 ) J (w(n 1)) 2 . 2











− ∇ − − ∇ − ∇ − 



Now we see the result of the line search in is, ∂J = ∂λ

J by solving ∂J/∂λ = 0 for λ, that

2

∇J (w(n − 1)) (1 + kλ) = 0, which has solution λ = −1/k. Thus the next w after the line search will be w(n) = w(n − 1) − 1/k∇J (w(n − 1)). We can also rearrange terms and find

− w(n − 1) = ∆ w(n − 1) = − k1 ∇J (w(n − 1)). Using the Taylor expansion for ∇J (w(n) around w(n − 1) we can write ∇J (w(n)) = ∇J (w(n − 1)) + k(w(n) − w(n − 1)), then substitute w(n) − w(n − 1) from above into the right-hand-side and find ∇J (w(n)) = ∇J (w(n − 1)) + k(−1/k∇J (w(n − 1))) = 0. Indeed, Eqs. 56 and 57 in the text give β = 0, given tha t ∇ J (w ) = 0, as shown. Equation 56 in the text is trivially satisfi ed since the numer ator is ∇J (w(n)) , and thus the Fletcher-Reeves equation gives β = 0. Equation 57 also vanishes because the numerator is the inner product of two vectors, one of which is ∇J (w(n)) = 0, and thus the Polak-Ribiere equation gives β = 0. (b) The above proves that applic ation of a line search reult with w(n − 1) takes us w(n)

n

2

n

n

n

to the minimum of J , and hence no further weight update is necessary.

34. Problem not yet solved 35. The Quickprop algorithm assume s the weights are independent and the error surface is quadratic. Whith these assump tions, ∂J/∂ w is linear. The graph shows ∂J/∂ w versus w and a possible error minimiza tion step. For finding a minimum in the criterion function, we search for ∂J/∂ w = 0. That is the value w where the line crosses the w axis.

PROBLEM SOLUTIONS

241

∂J/∂w

er

ro

r

m

in

im

a iz

o ti

n

w

0

wn wn-1

w*

We can easilty compute the intercept in this one-dimensional illustration. The equation of the line passing through (x0 , y0 and x1 , y1 ) is (x x1 )/(x1 x0 ) = (y y1 )/(y1 y0 ). The x axis crossing is found by setting y = 0 and solving for x, that is



x

−x

1

= (x1

− x ) y −−y y 0

1

1

= (x1

0



− x ) y y− y 0

1

0





. 1

| −

Now we can plug in our point shown on the graph, with x0 = w n−1 , y 0 = ∂J/∂ w n−1 , x1 = w n , y1 = ∂J/∂ w n , and by noting ∆w(m) = (x1 x0 ) and ∆w(n+1) = ( x x1 ). The final result is ∂J/∂ w n ∆w(n + 1) = ∆w(n). ∂J/∂ w n−1 ∂J/∂ w n

|



|

| −

|

Section 6.10 36. Consider the matched filter described by Eq. 66 in the text. (a) The trial weight funct ion w(t) = w∗ (t) + h(t) constrained of Eq. 63 must be obeyed:







w2 (t)dt =

−∞



w ∗2 (t)dt,

−∞

and we expand to find





w∗2 dt +

−∞





h2 (t)dt + 2

−∞





w ∗ (t)h(t)dt =

−∞





w ∗2 (t)dt

−∞

and this implies

−2





h(t)w ∗ (t)dt =

−∞





h2 (t)dt

−∞

≥ 0.

(b) We have



∞ ∗ x(t)w(t)dt = x(t)w (t)dt + x(t)h(t)dt. −∞ −∞ −∞





     output z ∗

CHAPTER 6. MULTILAYER NEURAL NETWORKS

242

From Eq. 66 in the text we know that w∗ (t) = λ/2x(t), and thus x(t) = 2w ∗ (t)/( λ). We plug thi s form of x(t) into the right-hand side of the above equation and find









−∞



x(t)w(t)dt =



x(t)[w ∗ (t) + h(t)]dt = z ∗ +

−∞

From part (a) we can substitute





2

w∗ (t)h(t)dt =

−∞







2 λ







w ∗ (t)h(t)dt.

−∞

h2 (t)dt,

−∞

and thus the output z of the matched filter is given by:



z=



x(t)w(t)dt = z ∗

−∞

− −1λ





h2 (t)dt.

−∞

(c) Plugging the above into Eq. 66 in the text,





z∗ =

w∗ (t)x(t)dt,

−∞ we get



−

z∗ =

−∞ Now we check on the value of λz ∗ : λz ∗ =



2

− λ2

λ x(t)x(t)dt. 2



x2 (t)dt < 0.

−∞

Thus z ∗ and λ have opposite signs. (d) Part (b) ensures z = z ∗ if and only if



1 λ

h2 (t)dt = 0,

−∞



which occurs if and only if h(t) = 0 for all t. That is, w ∗ (t) ensures the maximum output. Assume w 1 (t) also assures maximum output z ∗ as well, that is



 

x(t)w ∗ (t)dt

= z∗

x(t)w1 (t)dt

= z∗ .

−∞ ∞ −∞

PROBLEM SOLUTIONS

243

We subtract both sides and find





x(t)[w1 (t)

−∞

− w∗(t)]dt = 0.

Now we can let h(t) = w1 (t) w ∗ (t). We have just seen that giv en the trial weight function w 1 (t) = w ∗ (t) + h(t), that h(t) = 0 for all t. This tells us that w1 (t) w ∗ (t) = 0 for all t, or equivalently w 1 (t) = w ∗ (t) for all t.





37. Consider OBS and OBD algorithms. (a) We have from Eq. 49 in the text δJ

=

 ·    ∂J ∂w

t

1 ∂2J δ w + δ wt δ w + O( δ w3 ) 2 ∂ w2

·

·

   0

0

=

1 t δ w Hδ w. 2

So, it is required to minimize δ wt Hδ w, subject to the constraint of deleting one weight, say weight q . Let u q be the unit vector along the q th direction in weight space. Then, pre- and post-operat ing “picks out” the qq entry of a matrix, in particular, utq H−1 uq = [H−1 ]qq . Now we turn to the problem of minimizing 12 δ wt Hδ w subject to δ wt uq = wq . By the method of Lagrangian multipliers, this is equivalent to minimizing



f (δ w) =

1 t δ w Hδ w 2

t

− λ (δw u

q

+ wq ) .

   0

This in turn implies ∂f = Hδ w ∂δ w

− λ(u ) = 0. q

Now since the derivative of matrix products obey ∂ t x Ax = 2Ax ∂x and ∂ xt b = b, ∂x we have δ w = +λH−1 uq . To solve for λ we compute δf = 0, δλ

CHAPTER 6. MULTILAYER NEURAL NETWORKS

244

and this implies δ wt uq + wq = 0. This implies λutq H−1 uq =

−ω , q

which in turn gives the value of the Lagrange undetermined multiplier, λ=





wq wq = . utq H−1 uq [H−1 ]qq

So, we have δ w = +λH−1 uq =

− [Hw− ] q 1

qq

H −1 u q .

Moreover, we have the saliency of weight q is Lq

= = =

1 wq wq [ H−1 uq ]t H[ H −1 u q ] 2 [H−1 ]qq [H−1 ]qq wq2 1 ut H−1 HH−1 uq − 2 ([H 1 ]qq )2 q 1 wq2 1 wqq ut H−1 uq = [H−1 ]qq 2 (H−1 )2qq q 2 ([H−1 ]qq )2





2

=

12 [Hw −q1 ]qq .

(b) If H is diagonal, then [H−1 ]qq =

1 . Hqq

So, from part (a), the saliency of weight q for OBD is Lq =

1 wq2 1 = wq2 Hqq . 2 [H−1 ]qq 2

38. The three-layer radial basis function network is characterized by Eq. 61 in the text:

  

 

nH

zk = f

wkj ϕj (x)

j =0

,

where we can let y j = ϕ(x). The quadratic error function we use is J=

1 2

c

(tk

k=1

2

−z ) . k

We begin by finding the contribution of the hidden-to-output weights to the error: ∂J ∂J ∂net k = , ∂w kj ∂net k ∂w kj

PROBLEM SOLUTIONS

245

where ∂J ∂J ∂z k = = ∂net k ∂z k ∂net k

−(t − z )f (net ). k

k

k

We put these together and find ∂J = ∂w kj

−(t − z )f (net )ϕ (x). k

k

k

j

Now we turn to the update rule for µ j and b j . Here we take ϕj (x) = e −bj x−µj  . 2

Note that µj is a vector which can be written ∂J/∂µ ij :

µ1j , µ2j ,...,µ

dj .

We now compute

∂J ∂J ∂y j = . ∂µ ij ∂y j ∂µ ij We look at the first term and find ∂J ∂y j

c

=

−

(tk

∂z − z ) ∂net



(tk

− z )f  (net )w

k

k

k=1

k

∂net k ∂y j

c

=



The second term is ∂y j ∂µ ji

k

k=1

k

= e−bx−µj  ( 1)( 1)2(xi 2

− −

kj .

−µ

ij )

= 2ϕ(x). We put all this together to find ∂J = ∂µ ji

c

−2ϕ(x)(x − µ i

ij )



(tk

k=1

− z )f  (net )w k

k

kj .

Finally, we find ∂J/∂b j such that the size of the spherical Gaussians be adaptive: ∂J ∂J ∂y j = . ∂b j ∂y j ∂b j While the first term is given above, the second can be calculated directly: ∂y j ∂b j

= =

2 ∂ϕ(x) = e −bx−µj  x ∂b j

2

 − µ  (−1) j

2

−ϕ(x)x − µ  . j

Thus the final derivative is ∂J = ∂b j

c

−ϕ(x)x − µ  j

2



(tk

k=1

− z )f  (net )w k

k

kj .

CHAPTER 6. MULTILAYER NEURAL NETWORKS

246

Now it is a simple matter to write the learning rules: ∆wkj

=

−η ∂w∂J

∆µij

=

−η ∂µ∂J

kj

ij

∆bj



=

∂J η , ∂b j

with the derivatives as given above. Section 6.11 39. Consider a general d-by-d matrix K and variable d-dimensional vector x. (a) We write in component form: d

d

  

xt Kx =

xi Kij xj

j=1 i=1 d

d

x2i Kii +

=

xi xj Kij .



i=1

i=j

The derivatives are, then,

  

d t x Kx = dxr

d dxr

=

d dxr

d

i=1

xi xj Kij



i =j

x2i Kii + x2r Krr



i =r

d

+

  d

x2i Kii +

i

d

xi xj Kij + xr



xj Krj + xr



j

d



xj Krj +



j =r

d

=



d

xj Krj +

j=1

=



i =r

d



xi Kir



i =r

xi Kir

i=1

(Kx)rth element + (Kt x)tth element

This implies d t x Kx = Kx + Kt x = (K + Kt )x. dx (b) So, when K is symmetric, we have d t x Kx = (K + Kt )x = 2Kx. dx

xi Kir



j =r

= 0 + 2 xr Krr + 0 +

 

PROBLEM SOLUTIONS

247

40. Consider weight decay and regularization. (a) We note that the error can be written



E = Epat + γ

2 wij .

ij

Then, gradient descent in the error function is new

wij

δE

old

− η δw

= w ij

ij

where η is the learning rate. We also have ∂E ∂w ij



∂ ∂w ij

= =

Epat + γ

  2 wij

ij

γ 2wij .

This implies new wij

=

old wij

=

old wij (1

new old (b) In this case we have wij = wij (1 Thus

old ij

− 2γηw − 2γη).

− ξ) where ξ is a weight decay constant.

new old wij = w ij (1

2γη)



from part (a). This implies 2 γη = ξ or ξ . 2η

γ= (c) Here we have the error function E

= Epat +

2 γw ij 2 1 + wij

−  −  1 2 1 + wij

= Epat + γ 1 ∂E ∂w ij

= =

∂E pat ∂ +γ 1 ∂w ij ∂w ij 2γw ij 2 2 wij )

1 2 1 + wij

=0 +

γ 2wij 2

2 ) (1 + wij

.

(1 + The gradient descent procedure gives new wij

old = wij

∂E − η ∂w

ij

−

2γη

old = wij 1

=

old wij (1

old = w ij

2) (1 + wij

−ξ

ij )

2

old ij

− η (12γw +w



2 2 ij )

CHAPTER 6. MULTILAYER NEURAL NETWORKS

248 where

ξij =

2γη

2.

2 ) (1 + wij

This implies 2

γ=

2 ) ξij (1 + wij

.

2η (d) Now consider a network with a wide range of magnitudes for weights. Then, the new old constant weight decay method wij = w ij (1 ξ ) is equivalent to regularization 2 via a penalty on ωij . Thus, large magnitude weights are penalized and will



 ij

be pruned to a greater degree than wei ghts of small magnitude. On the other new old hand, the variable weight decay method wij = wij (1 ξ ij ) penalized on



 −  1

ij

1 2 1+wij

, which is less susceptible to large weights. Therefore the pruning

will not be as severe on large magnitude weights as in the first case. 41. Suppose that the error E pat is the negative log-likelihood up to a constant, that is, Epat =

−lnp(x|w) + constant.

Then we have

| ∝ e−

p(x w)

Epat

.

Consider the following prior distribution over weights p(ω)

∝e

−λ



2 wij

i,j

where λ > 0 is some parameter. Clearly this prior favors small weights, that is, 2 is large if wij is small. The joint density of x, w is

 ij

|

p(x, w) = p(x w)p(w)

∝ exp

The p osterior density is

|

p(w x) =

p(x, w) p(x)

 −



Epat

  −

∝ exp −E − λ pat

2 wij .

λ

ij

 

2 wij .

ij

So, maximizing the posterior density is equivalnet to minimizing E = E pat + λ

 ij

which is the E of Eq. 41 in the text.

2 wij ,

p(w)

PROBLEM SOLUTIONS

249

42. Given the definition of the criterion function  t w w, 2η

Jef = J (w) +

we can compute the derivative ∂ Jef /∂w i , and thus determine the learning rule. This derivative is ∂J ef ∂J  ∂w i = ∂w i + η wi for any w i . Thus the learning rule is wi (k + 1) =

 −    ∂J η ∂w i

wi (k)

∂J −η ∂w

= This can be written in two steps as

+ (1

i

wi (k)

winew (k) = wiold (k)(1 wiold (k + 1) =



∂J η ∂w i

Continuing with the next update we have winew (k + 1) = wiold (k + 2) =

− 

i

− )w (k). i

) + winew (k).

wiold (k)

wiold (k + 1)(1 ∂J −η ∂w

− w (k) i

wi (k)

 

− ) + winew (k + 1).

wiold (k+1)

The above corresponds to the description of weight decay, that is, the weights are updated by the standard learning rule and then shrunk by a factor, as according to Eq. 38 in the text. 43. Equation 70 in the text computes the inverse of ( H + αI) when H is initialized to αI. When the value of α is chosen small, this serves as a good approximation [H + αI]−1

 H− . 1

We the Taylorwill expansion w∗ up to second order. Thewrite approximation be exactofifthe theerror error function function around is quadratic: J (w)

 J (w∗) +



∂J ∂w

 

t

(w

w∗

− w∗) + (w − w∗) H(w − w∗). t

We plug in H + αI for H and get J  (w)

 J (w∗) +





∂J ∂w

  −  t

(w

w∗ ) + (w

w∗

J (w )

− w∗) H(w − w∗) + · · · + (w − w∗) αI(w − w∗). t

t



250

CHAPTER 6. MULTILAYER NEURAL NETWORKS

Thus we have J  (w) = J (w) + αwt w

− αw∗ w∗ . t

   constant

Thus our calculation of H modifies the error function to be J  (w) = J (w) + α w

2

  − constant.

This constant is irrelevant for minimization, and thus our error function is equivalent to J  (w) + α w 2 ,

 

which is equivalent to a criterion leading to weight decay. 44. The predicted functional increase in the error for a weight change δ w is given by Eq. 68 in the text: δJ

 12 δw Hδw. t

We want to find which component of w to set to zero that will lead to the smallest increase in the training error. If we choose the q th component, then δ w must satisfy utq δ w + wq = 0, where u q is the unit vector parallel to the q th axis. Since we want to minimize δ J with the constraint above, we write the Lagrangian L(w, λ) = 1 δ wt Hδ w + λ (utq δ w + wq ), 2

   0

and then solve the system of equations ∂L = 0, δλ

∂L = 0. ∂w

These become ∂L ∂λ ∂L ∂w j ∂L ∂w j

= utq δ w + wq = 0 =



Hkj δ wk

=



for j = i

k

Hij δ wk + λ

for j = i.

k

−

These latter equations imply Hδ w = λuq . Thus δ w = H −1 ( λ)uq . We substitute this result into the above and solve for the undetermined multiplier λ: utq H−1 uq + wq = 0 which yields λ=



wq . [H−1 ]qq



PROBLEM SOLUTIONS

251

Putting these results together gives us Eq. 69 in the text: δw =

− [Hw− ] q H− u . q 1

1

q

q

For the second part of Eq. 69, we just need to substitute the expression for just found for the expression for δ J given in Eq. 68 in the text: L

1

= q

2

t

wq

− 

H−1 u

[H−1 ]qq

=

1 2

=

1 wq2 . 2 [H−1 ]qq

wq

H

 −    q

δ w we

H −1 u

[H−1 ]qq

wq wq ut HH−1 −1 H−1 uq [H−1 ]qq q [H ]qq I

q

 

w = (w1 , w2 )t be the weights

45. Consider a simple 2-1 network with bias. First let and x = (x1 , x2 )t be the inputs, and w 0 the bias. (a) Here the error is n

E=

n

  Ep =

p=1

2

− z (x)]

[tp (x)

p

p=1

where p indexes the patterns, and tp is the teaching or desired output and zp the actual output. We can rewrite this error as E= [f (w1 x1 + w2 x2 + w0 ) 1]2 + [f (w1 x1 + w2 x2 + w0 ) + 1]2 .

 ∈D

x

1

(b) To compute the Hessian matrix H = ∂E ∂w i





=

∈D

x 2

∂ E ∂ w2 ,

   {  { 

2

we proceed as follows:

2 f (w1 x1 + w2 x2 + w0 )



x ω1

− 1]f (w x

1 1

+ w2 x2 + w0 )xi

2[f (w1 x1 + w2 x2 + w0 ) + 1]f  (w1 x1 + w2 x2 + w0 )xi , i = 1, 2

+



x ω2

∂E ∂w j ∂w i

f  (w1 x1 + w2 x2 + w0 ) 2 xj xi

= 2

}



x ω1

+

[f (w1 x1 + w2 x2 + w0 )



x ω1

− 1]f (w x

1 1

+ w2 x2 + w0 )xj xi

}

f  (w1 x1 + w2 x2 + w0 ) 2 xj xi

+



x ω2

[f (w1 x1 + w2 x2 + w0 ) + 1]f  (w1 x1 + w2 x2 + w0 )xj xi

+



x ω2

We let the net activation at the output unit be denoted net = w 1 x1 + w2 x2 + w0 = (w0 w1 w2 )

  1 x1 x2

˜ tx ˜ = w t x + w0 . =w



.

CHAPTER 6. MULTILAYER NEURAL NETWORKS

252

Then we have ∂2E ∂w j ∂w i

  

=

[(f  (net))2 + (f (net)1)f  (net)]xj xi

2



x ω1

[(f  (net))2 + (f (net) + 1)f  (net)]xj xi

+2



x ω2

=

u(x)xi xj

bfx

where u(x) =



(f  (net))2 + (f (net) 1)f  (net) if (f  (net))2 + (f (net) + 1)f  (net) if



x x

∈ω ∈ω

1 2

So, we have H=

  x



u(x)x21

u(x)x1 x2

x

u(x)x1 x2

x

x

u(x)x22



.

| ∼ N (µ , 1), i = 1, 2. Then, H in terms of µ , µ

(c) Consider two data sets N (x ωi ) is H = ((Hij ))2×2 where,

i

1

2

Hij = 2[( f  (net1 )2 ) + (f (net1 ) 1)f  (net1 )]µ1i µ1j +2[(f  (net2 )2 ) + (f (net2 ) + 1)f  (net2 )]µ2i µ2j



for i, j = 1, 2, where net1 = w t µ1 + w 0 and net2 = wt µ2 + w 0 . Clearly, the Hessian matrix is symmetric, that is, H 12 = H21 . (d) We have the Hessian matrix H=



H11 H12



H12 H22

.

| − λI| = 0, that is,

Then, the two eigenvalues of H are the roots of H



Now we have



h11 λ h12 h12 h22 λ





= 0.

(H11

− λ)(H − λ) − H

− λ(H

+ H22 ) + H11 H22

22

2 12

=0

and λ2

11

−H

2 12

= 0,

and this implies the eigenvalue is λ = =

H11 + H22 H11 + H22

± ±

 



(H11 + H22 )2 4(H11 H22 2 2 (H11 H22 )2 + 4H12 . 2



−H

2 12 )

PROBLEM SOLUTIONS

253

Thus the minimum and maximum eigenvalues are λmin

=

λmax

=

H11 + H22



H11 + H22 +

 

(H11 2 (H11 2

−H −H

22 )

2

2 + 4H12

22 )

2

2 + 4H12

.

(e) Suppose µ 1 = (1, 0) and µ 2 = (0, 1). Then from part (c) we have H12 = 0 H11 = 2[( f  (w1 + w0 ))2 + (f (w1 + w0 ) 1)f  (w1 + w0 )] H22 = 2[( f  (w2 + w0 ))2 + (f (w2 + w0 ) + 1)f  (w2 + w0 )].



From part (d) we have λmin

=

λmax

=

H11 + H22 H11 + H22

− |H − H | 2 + |H − H | 11

22

11

22

2

and therefore in this case the minimum and maximum eigenvalues are λmin = min( H11 , H22 ) λmax = max( H11 , H22 ), and thus their ratio is λmax λmin

= =

max(H11 , H22 ) min(H11 , H22 ) H11 H22 or . H22 H11

254

CHAPTER 6. MULTILAYER NEURAL NETWORKS

Computer Exercises Section 6.2 1. Computer exercise not yet solved Section 6.3 2. 3. 4. 5. 6. 7.

Computer Computer Computer Computer Computer Computer

exercise exercise exercise exercise exercise exercise

not not not not not not

yet yet yet yet yet yet

solved solved solved solved solved solved

Section 6.4 8. Computer exercise not yet solved Section 6.5 9. Computer exercise not yet solved Section 6.6 10. Computer exercise not yet solved Section 6.7 11. Computer exercise not yet solved Section 6.8 12. Computer exercise not yet solved

Chapter 7

Stochastic methods

Problem Solutions Section 7.1 1. First, the total number of typed charact ers in the play is approximately m = 50 pages

× 80 lines × 40 characters = 160000 page line

characters.

We assume that each character has an equal chance of being typed. Thus, the probability of typing a specific charactor is r = 1/30. We assume that the typing of each character is an independent event, and thus the probability of typing any particular string of length m is r m . Therefore, a rough estimate of the length of the total string that must be typed before one copy of Hamlet appears is r1m = 30 160000 2.52 10236339 . One year is



× hours × 60 minutes × 60 seconds 365.25 days × 24 = 31557600 seconds. day hour minute

We are given that the monkey types two characters per second. Hence the expected time needed to type Hamlet under these circumstances is 1 second

10236339 characters

2.52

×

×

1

year

×

2 character 31557600 seconds which is much larger than 10 10 years, the age of the universe.

4

10236331 years,

 ×

Section 7.2 2. Assume we have an optimization problem with a non-symmetric connection matrix, W, where w ij = w ji .



E

=

1 1 (2E ) = 2 2

  − N

1 wij si sj 2 i,j=1 255

  N



1 wji si sj 2 i,j=1

CHAPTER 7. STOCHASTIC METHODS

256

=

− 12

N



1 (wij + wji )si sj 2 i,j=1

We define a new weight matrix w ˆij =

wij + wji , 2

which has the property w ˆ ij = w ˆji . So the srcinal optimization problem is equivalent to an optimization problem with a symmetric connection matrix, W. ˆ 3. Consider Fig. 7.2 in the text.

±

(a) In the discrete optimization problem, the variables only take the values 1, yielding isolated point s in the space. Therefore, the energy is only defined on these points, and there is no continuous landscape as shown in Fig. 7.2 in the text.

±

(b) In the discrete space, all variables take 1 values. In other words, all of the feasible solutions are at the corners of a hypercube, and are of equal distance to the “middle” (the center) of the cube. Their distance from the center is N.



(c) Along any “cut” in the feature space paralle l to one of the axes the energy will be monotonic; in fact, it will be linear. This is because all other features are held constant, and the energy depends monotonically on a single feature. However, for cuts not parallel to a single axis (that is, when two or more features are varied), the energy need not be monotonic, as shown in Fig. 7.6.

±

4. First, we recall each var iable can take two valu es, 1. The total number of configurations is therefore 2 N . An exhaustive search must calcul ate ener gy for all configurations. For N = 100, the time for exhaustive search is 2100 configurations

−8

10 second × configuration  1.27 × 10

22

seconds.

We can express this time as 1.27

× 10

22

seconds

× 601 minute × 1 hour × 1 day × 1 year  4.0 × 10 seconds 60 minutes 24 hours 365 days

14

which is ten thousand times larger than 10 10 years, the age of the uni verse. For N = 1000, a simliar argument gives the time as 3 .4 10285 years. 5. We are to suppos e that it takes 10 −10 seconds to perform a single multiplyaccumulate.

×

(a) Suppose the net work is fully connected. Since the connec tion matrix is sym-



metric and wii = 0, we can compute E as i>j wij si sj and thus the number of multiply-accumulate operations needed for calculating E for a single configuration is N (N 1)/2. We also know that the total number of configurations is 2N for a network of N binary units. Therefore, the total time t(N ) for an exhaustive search on a network of size N (in seconds) can be expressed as



− × operations × 2 configuration

second N (N 1) operation 2 = 10−10 N (N 1)2N −1 seconds.

t(N ) = 10

−10

×





N

configurations

years,

PROBLEM SOLUTIONS

257

log[t] 700 600 500 400 300 200 100

log[N]

1234567

(b) See figure. (c) We need to find the largest possible N such that the total time is less than or equal to the specified time. Unfortunately, we cannot solve the invers e of t(N ) in a closed form. However, since t(N ) is monotonically increasing with respect to N , a simple search can be used to solve the problem; we merely increase N until the corresponding time exceeds the specified time. A straightforward calculation for a reasonable N would overflow on most computers if no special software is used. So instead we use the logarithm of the expression, specifically t(N ) = 10 log10

−10 N (N −1)2(N −1)

= 10(N −1)log2+logN +log(N −1)−10 .

A day is 86400 seconds, within which an exhaustive search can solve a network of size 20 (because 5 .69 104 t(20) < 86400 < t(21) 6.22 105 ). A year is 7 31557600 3.16 10 seconds, within which an exhaustive search can solve a network of size 22 (because 7.00 106 t(22) < 3.16 107 < t(23) 8.10 107 ). A century is 3155760000 3.16 109 seconds, within which an exhaustive search can solve a network of size 24 (because 9.65 108 t(24) < 3.16 109 < t(25) 1.18 1010 ).



×

×

 ×   ×

×



×

×

× 



×

×



6. Notice that for any configuration γ , E γ does not change with T . It is sufficient to show that for any pair of configurations γ and γ  , the ratio of the probabilities P (γ ) and P (γ  ) goes to 1 as T goes to infinity: P (γ ) e−Eγ /T = lim −E  /T = lim e(Eγ  −Eγ )/T  γ T →∞ P (γ ) T →∞ e T →∞ lim

Since the energy of a configuration does not change with T , E γ and Eγ  are constants, so we have: lim

T

→∞

(Eγ  (Eγ 

lim e

T

→∞

−E )

=

0

−E

=

1,

γ

T

γ )/T

which means the probabilities of being in different configurations are the same if T is sufficiently high. 7. Let kuN denote the number pointing up in the subsystem of N magnets and kdN denote the number down.

CHAPTER 7. STOCHASTIC METHODS

258

(a) The number of configurations is given by the number of ways of selecting kuN magnets out of N and letting them point up, that is, K (N, EN ) =

 N kuN

We immediately have the following set of equations for k uN and k dN : kuN + kdN kuN kdN

= =



N EN

The solution is given by k uN = 12 (N + EN ) and k dN = 12 (N have K (N, EN ) =

 N kuN

=

1 (N 2

−E

N ).

Therefore we



N . + EN )

(b) We follow the approac h in part (a), and find K (N, EM ) =

 M kuM

=

1 2 (M



M . + EM )

(c) Problem not yet solved (d) Problem not yet solved (e) Problem not yet solved 8. Consider a single magnet in an applied field. (a) The two states γ  = 0 and γ  = 1 correspond to the two states of the binary magnet, s = +1 and s = 1, respectively. For γ  = 0 or s = +1, E γ  = E 0 ; for γ  = 1 or s = 1, Eγ  = E0 . Therefore, according to Eq. 3 in the text, the partition function is



− −

Z (T ) = e −E0 /T + e−(−E0 )/T = e−E0 /T + eE0 /T . (b) According to Eq. 2 in the text, the probabili ty of the magnet pointing up and pointing down are E0 /T

E0 /T

P (s = +1) = e− , Z (T )

P (s =

−1) = eZ (T )

respectively. We use the partition function obtained in part (a) to calculate the expected value of the state of the magnet, that is,

−E

E [s] = P (s = +1) − P (s = −1) = ee−

0 /T

E0 /T

E0 /T

−e

+ eE0 /T

which is indeed in the form of Eq. 5 in the text.



= tanh( E0 /T ),

PROBLEM SOLUTIONS

259

E

(c) We want to calculate the expected value of the state, [s]. We assume the other N 1 magnets produce an average magnetic field. In another word, we consider the magnet in question, i, as if it is in an external magnetic field of the same strength as the average field. Therefore, we can calculate the probabilities in a similar way to that in part (b). First, we need to calculate the energy associated with the states of i. Consider the states of the other N 1 magnets are fixed. Let γi+ and γ i− denote the configuration in which s i = +1 and s i = 1, respectively, while the other N 1 magnets take the fixed state values. Therefore, the energy









for configuration γ i+ is Eγ + = i



1 2

N



wjk sj sk =

j,k=1

  − 1 2

N

wkj sj sk +



k,j =i



  N

wki sk +

wij sj

j=1

k=1

since si = +1. Recall that the weig ht matrix is symmetric, that is, wki = w ik , and thus we can write N

N



wki sk =

k=1



N

wik sk =



wij sj ,

j=1

k=1

where the last equality is by renaming dummy summation indices. Therefore the energy is N



Eγ + = i

wij sj

− 12

 

wjk sj sk =

−l + C, i



j =1

j,k=i

 

where, as given by Eq. 5 in the text, l i =



wij sj , and C =

j

1 2



a constant independent of i. Similarly, we can obtain N

Eγ − = i

wij sj

j=1

− 12

wjk sj sk is



j,k=i

wjk sj sk = li + C



j,k=i

Next, we can apply the results from par ts (a) and (b). The partition function is thus Z (T ) = e

−E

+ /T γ i

−E

+e

γ

− /T

i

= e (li −C )/T + e(−li −C )/T = (eli /T + e−li /T )/E C/T .

Therefore, according to Eq. 2 in the text, the expected value of the state of magnet i in the average field is Eγ + /T

E [s ] i

= P r(si = +1)(+1) + P r(si = =

li /T e(li −C )/T e(−li −C )/T eli /T = l /T (eli /T + e−li /T )/E C/T e i + e−li /T



Eγ − /T

−1)(−1) = e− Z−(Te)− − e− = tanh(l /T ). i

i

i

9. The two input nodes are indexed as 1 and 2, as shown in the figure. Since the input units are supposed to have fixed state values, we can ignore the connection between the two input nodes. For this assignment, we use a version of Boltzmann networks with biases, in another word, there is a node with constant value +1 in the network.

CHAPTER 7. STOCHASTIC METHODS

260

Suppose the constant node is numbered as 0. The representation power of Boltzmann networks with biases is the same as those without biases. This can be shown as follows. First, each network without the constant unit can be converted to a network with the constant unit by including the constant node 0 and assigning the connection weights w0j a zero value. Second, each network with the constant unit can also be converted to a network without the constant unit by replacing the constant unit with a pair of units, numbered 1 and 2. Node 1 assumes the same connection weights to other units as the constant unit 0, while node 2 is only connected to node 1. The









− −



connection weights between nodes 1 and 2 are a vary large positive number M . In this way, the nodes 1 and 2 are forced to have the same value in the configurations with the lowest energy, one of which requires the value be +1.





3 3

3

4

1

2

4

5 1

2

1

2

Next, we are going to determine the set of connection weights between pairs of units. In the follo ws, we use a vector of state values of all units to indicate the corresponding configuration. (a) For the exclusive-OR problem, the input-output relationship is as follows: Input 1 +1 +1 1 1

− −

Input 2 +1 1 +1 1

Output (3) 1 +1 +1 1



− −



Since the Boltzmann network always converges to a minimum energy configuration, it solves the exclusive-OR problem if and only if the following inequalit ies hold: E{+1,+1,−1} E{+1,−1,+1} E{−1,+1,+1} E{−1,−1,−1}

< < < ln 2. Thus if x21 > ln 2, the maximum-likelihood estimate of P (ω1 ) is Pˆ (ω1 ) = 1. (c) If there is a single sample, we are forced to infer that the maximu m-likelihood estimate for the prior of one of the categories is 1.0, and the other is 0.0, depending upon which density is greater at the sampled point. 13. We assume that ther real line R 1 can be divided into c non-overlapping intervals Ω1 ,..., Ωc such that cj=1 Ωj = R 1 (the real line) and Ω j Ωj  = for j = j  . We are given that at any point x, only one of the Gaussian functions differs significantly from zero, and thus





|

p(x µ1 ,...,µ

c)



) 1  P√(ω exp − (x − µ ) 2σ 2πσ j

j

2



2





.

We thus have for some particular j that depends upon x n

1 ln p(x1 ,...,x n

n

|µ ,...,µ 1

 |  |    √  −   − c)

=

1 ln p(xk µ1 ,...,µ n k=1

=

1 n

c

ln p(xk µ1 ,...,µ

c)



j=1 k:xk Ωj c

1 n

c)

P (ωj ) e−(xk −µj )2 /(2σ2 ) . 2πσ

ln



j=1 k:xk Ωj



We take logarithms on the right-hand side, rearrange, and find 1 ln p(x1 ,...,x n

n

|µ ,...,µ 1

c)

1 n

=

1 n

c

ln P (ωj )



j=1 k:xk Ωj c

ln P (ωj )

j=1

1



k:xk Ωj

1 ln (2πσ 2 ) 2

− 2σ1

−µ )

(xk 2

j

c



11 ln (2πσ 2 ) n2 j=1 k:x

k

∈Ω

1 j

2



PROBLEM SOLUTIONS

317 c



− n1 1 n

=

j=1 k:xk Ωj



c



P (ωj )nj

j=1

1 (xk 2σ 2

−µ ) j

2

− 12 ln (2πσ ) − n1 2

c



(xk



j=1 k:xk Ωj



k:xk Ωj



max

µ1 ,...,µc c

1 n



1 ln p(x1 ,...,x n



nj ln P (ωj )

j=1

n

|µ ,...,µ

c)

1

− 12 ln (2πσ ) + n1 2

c

 − max µj

j=1

[ (xk



k:xk Ωj

2

− µ ) ]. j

However, we note the fact that max µj

−

2

−µ ) ]

[ (xk

j



k:xk Ωj

occurs at µ ˆj



xk



k:xk Ωj

=

1



k:xk Ωj

 ∈

k:xk Ωj

=

xk

nj

= x¯j , for some interval, j say, and thus we have max

µ1 ,...,µc



1 n

=

1 n

1 p(x1 ,...,x n

n

n

 

nj ln P (ωj )

j=1 c

nj ln P (ωj )

j=1

|µ ,...,µ

c)

1

− 12 ln (2 πσ ) − 2σ1 2

− 12 ln (2 πσ ) − 2σ1

2

2

2

1 n 1 n

c

 −   (xk

x ¯ j )2



j=1 k:xk Ωj c

nj

j=1

1 (xk nj k :xk ∈Ωj

2

− x¯ ) . j

→∞

Thus if n (i.e., the number of independently drawn samples is very large), we have nj /n = the proportion of total samples which fall in Ω j , and this implies (by the law of large numbers) that we obtain P (ωj ). 14. We let the mean value be denoted ¯= x

1 n

n



xk .

k=1

Then we have n



1 (xk n k=1

j

1 is the number of points in the interval Ωj . The result above implies

where nj =

n

− x) Σ− (x − x) t

1

k

=



1 (xk n k=1

− x¯ + x¯ − x) Σ− (x − x¯ + x¯ − x) t

1

k

2

−µ ) ,

318CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING

 n

1 n

=

− x¯) Σ− (x − x¯ ) k

n

− x) Σ− t

+2(¯ x

1

t

(xk

k=1

1

n



(xk

k=1

− x¯) + n(¯x − x) Σ− )(¯x − x) t

=

1 ¯ )t Σ−1 (xk x ¯ ) + (¯ x (xk x n k=1



1 n

n



(xk

k=1

1

− x) Σ− (¯x − x) 1

t

− x¯ ) Σ− (x − x¯), 1

t

k

where we used n



n

(xk

k=1

− x¯) =



xk

k=1

− n¯x = n¯x − n¯x = 0.

Since Σ is positive definite, we have

− x) Σ− (¯x − x) ≥ 0,  ¯x. Thus with strict inequality holding if and only if x = 1

t

x (¯

n

1 n

(xk

− x) Σ− (x − x) 1

t

k

k=1



is minimized at x = ¯x, that is, at x = ¯x =

1 n

n



xk .

k=1

15. Problem not yet solved 16. The basic operation of the algorithm is the computation of the distance between a sample and the center of a cluster which takes O(d) time since each dimension needs to be compared seperately. During each iteration of the algorithm, we have to classify each sample with respect to each cluster center, which amounts to a total number of O(nc) distance computations for a total complexity O(ncd). Each cluster center than needs to be updated, which takes O(cd) time for each cluster, therefore the update step takes O(cd) time. Since we have T iterations of the classification and update step, the total time complexity of the algorithm is O(Tncd ). 17. We derive the equations as follows. (a) From Eq. 14 in the text, we have

|

ln p(xk ωi , θ i ) = ln

|Σ− | − 1 (x − µ ) Σ− (x − µ ). 2 (2π) i

1 1/2 d/2

k

i

t

i

1

k

i

It was shown in Problem 11 that

|

∂ ln p(xk ωi , θi ) = ∂σ pq (i)

−  1

δ pq 2

[σpq (i)

− (x (k) − µ (i))(x (k) − µ (i))]. p

p

q

q



PROBLEM SOLUTIONS

319

We write the squared Mahalanobis distance as d

(xk

− µ ) Σ− (x − µ ) i

t

1

i

k

 − 

=

i

(xp (k)

 

µp (i))2 σ p p (i)

p=1

d

d

+

(xp (k)

p =1 q =1 p= q

− µ  (i))σ   (i)(x  (k) − µ  (i)).

The derivative of the likelihood with respect to the mean is

|

∂ ln p(xk ωi , θi ) ∂µ p (i)

−   − ∂ ∂µ p (i)

=

1 2 p

d

1 2 p

d

(xp (k)

=1

d

(xp (k)

 

=1

q =1 p=q

p

pq

p

(i)

pq

− µ  (i))σ

1 (xp (k) 2 p =1

− µ  (i))σ

q

pp

q



(i)( 1)

d

p

−1)

pq (i)(



p =p

d

=

(xq (k)

q=1

− µ (i))σ q

pq

(i).

ˆ i must satisfy From Eq. xxx in the text, we know that Pˆ (ωi ) and θ n



1 Pˆ (ωi ) = Pˆ (ωi xk , θˆi ) n k=1

|

and n

ˆi) = 0 Pˆ (ωi xk , θˆi ) θi ln p(xk ωi , θ

|

k=1

or

 



n

|



|

∂ ln p(xk ωi , θ i ) Pˆ (ωi xk , θˆi ) ˆ = 0. ∂µ p (i) θ i =θ i k=1

|

We evaluate the derivative, as given above, and find n



k=1

d

Pˆ (ωi xk , θˆi )

|

 q=1

(xq (k)



− µ  (i))σ   (i)(x  − µ  (i))

(xq (k)

 

2 pp

p

p

q =1 q =p

q

− µ  (i)) σ   (i)

d

 −   − 12

q

pth coordinate of the ith

− 12 2(x (k) − µ (i))(−1)σ

=

pq

p

− µˆ (i))ˆσ q

pq

(i) = 0,

q

320CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING which implies n



ˆ i )Σ−1 (xk Pˆ (ωi xk , θ i

|

k=1

− µˆ ) = 0. i

ˆ i and find We solve for µ n k=1

µ ˆi =



Furthermore, we have n

 | − 

ˆ i )xk Pˆ (ωi xk , θ

|

n k =1 P ˆ (ωi

.

|x , θˆ ) k

i



|

ˆ i ) ∂ ln p(xk ωi , θ i ) Pˆ (ωi xk , θ ˆ = 0, ∂σ pq θ i =θ i k=1 and this gives n



δ pq 2

ˆi) 1 Pˆ (ωi xk , θ

|

k=1

[ˆσpq (i)

− (x (k) − µˆ (i))(x (k) − µˆ (i))] = 0. p

p

q

q

We solve for ˆσpq (i) and find σ ˆpq (i) =



n k=1

and thus ˆi = Σ

ˆ i )[(xp (k) µ Pˆ (ωi xk , θ ˆ p (i))(xq (k) n ˆi) ˆ (ωi xk , θ P k=1

|

− |

 |



n k=1

ˆ i )(xk µ ˆ i )(xk Pˆ (ωi xk , θ n ˆi) Pˆ (ωi xk , θ

|

k=1

− µˆ (i))] q

t

− µˆ ) . i

(b) For the conditions of this part, we have σ i = σ i2 I and thus

|

ln p(xk ωi , θi ) = ln

1 d

(2πσi2 ) 2

− 2σ1

2 (xk i

t

− µ ) (x − µ ), i

k

and thus it is easy to verify that

∇µ ln p(x |ω , θ ) = σ1 (x − µ ). k

i

i

i

2 i

k

i

and hence n

ˆ i ) 1 (xk Pˆ (ωi xk , θ 2 k =1 σi Thus the mean

|



µ ˆi =

does not change. This implies

|



µ ˆ i ) = 0.



n ˆ ˆ k=1 P (ωi xk , θ i )xk n ˆ ˆ k=1 P (ωi xk , θ i )

∂ ln p(xk ωi , θi ) = ∂σ i2

|

− 2σd

2 i

|

+

1 xk 2σi4

2

|| − µ || . i

i

PROBLEM SOLUTIONS

321

We have, moreover,



n



 

|

ˆ i ) ∂ ln p(xk ωi , θi ) Pˆ (ωi xk , θ ˆ ∂σ i2 θ i =θ i k=1

|

and thus n

d

ˆi) Pˆ (ωi xk , θ

|



k =1

and this gives

σ ˆi2

1

+

− 

1 d

2σˆi4 xk



i 2

n ˆ ˆ k=1 P (ωi xk , θ i ) xk n ˆ ˆ k=1 P (ωi xk , θ i )

|

|

0

= 0

|| − µˆ ||

2σˆi2

=

=

|| − µˆ || i

2

.

·· ·

(c) In this case, the covariances are equal: Σ1 = Σ 2 = = Σ c = Σ. Clearly the maximum-likelihood estimate of µ i , which does not depend on the choice of Σ i , will remain the same. Therefore, we have ˆi = µ The log-likelihood function is



n ˆ ˆ k=1 P (ωi xk , θ i )xk . n ˆ ˆ k=1 P (ωi xk , θ i )

|

|

n

l

=

   n

=



c

ln

|

p(xk ωj , θ j )P (ωj )

j=1

k=1 n

=

| 

ln p(xk θ)

k=1

c

ln

k=1

| Σ− | P (ω )

1 1/2

j

j=1

(2π)d/2

exp

−

1 (xk 2

−µ

j

) Σ−1 (xk − µ t

j)

and hence the derivative is ∂l ∂σ pq

n

=

 |  |  |  | −  k=1 n

=

k=1

= =

∂ ln p(xk θ ) ∂σ pq 1 ∂ p(xk θ) p(xk θ ) ∂σ pq

n

c

n

j=1 c

1 p(xk θ ) k=1

k=1

1 p(xk θ )

|

P (ωj ) ∂pq p(xk ωj , θj ) ∂σ

|

|

P (ωj )p(xk ωj , θj )

j=1

∂ ln p(xk ωj , θ j ) ∂σ pq

|

However, we have from Problem 17,

|

∂ ln p(xk ωj , θ j ) = ∂σ pq

1

δ pq 2

[σpq

− (x (k) − µ (j))(x (k) − µ (j))] . p

p

q

q



322CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING We take derivatives and set



∂l =0 ∂σ pq θ =θˆ and consequently n

ˆ k=1 p(xk θ )

|



− 

c

1

δ pq 2

ˆ) 1 P (ωj )p(xk ωj , θ

|

j=1



[σˆpq

− (x (k) − µˆ (j))(x (k) − µˆ (j))] = 0 . p

p

q

q

This gives the equation n



k=1

n

=



k=1

c

1 ˆ) p(xk θ

|

 j=1

j

We have, therefore, σ ˆ pq = c



ˆ) = P (ωj )p(xk ωj , θ

ˆ) p(xk θ

k



j

p

p

n k=1 1,

j

k

j

j

j=1

q

q

since

ˆ) p(xk , θ

|

j=1

c



1

ˆ ) P (ω )p(x |ω , θ | ˆ ) [(x (k) − µˆ (i))(x (k) − µˆ (i))] . P (ω )p(x |ω , θ σ ˆpq

n

c



=

ˆ j )[(xp (k) Pˆ (ωj xk , θ

|

k=1 j=1

− µˆ (i))(x (k) − µˆ (i))] q

q

q

and this implies that the estimated covariance matrix is n

c



ˆ j )(xk ˆ = 1 Σ Pˆ (ωj xk , θ n k=1 j=1

|

t

− µˆ )(x − µˆ ) . j

k

j

Section 10.5 18. We shall make use of the following figure, where the largest circle repres ents all possible assignments of n points to c clusters, and the smaller circle labeled C1 represents assignments of the n points that do not have points in cluster C1 , and likewise for C2 up through C c (the case c = 4 is shown). The gray region repre sents those assigments that are valid, that is, assignments in which none of the Ci are empty. Our goal is then to compute the number of such valid assignments, that is the cardinality of the gray region. (a) First the card inality of invalid assigments, that is the number of assignmentswe in find the white region: c

c

c

 | | −  | ∩ |  | ∩ ∩ |−···±  | ∩ ∩ · · · |   −     −···±     − Ci

Ci



i=1

=

Cj +

c N1 1

c

=

i=1

Ci

Cj

Ck



i=j

c c N2 + N3 2 3

c ( 1)c−i+1 Ni , i

Ci



i =j =k

i=j =k...

c Nc c

Cj

Ck

c terms

Cq

PROBLEM SOLUTIONS

323 S

assignments that leave no points in C 1 C1

C2

C3

C4

valid assignments: ones that have points in every Ci

where we define Ni to be the number of distinct assignments of n points that leave exactly i (labeled) clusters empty. Since each of the n points can be assigned to one of i 1 clusters, we have Ni = (i 1)n . We put these intermediate results together and find that the number of invalid assignments (where the clusters are labeled) is





  − c

i=1

c ( 1)c−i+1 (i i

n

− 1) .

But the above assumed that the clusters were labeled, that is, that we could distinguish between C1 and C2 , for instance. In clustering, no such labeling is given. Thus the above overcounts by the number of ways we can assign c labels to c clusters, that is, by c!. Therefore, the cardinality of the white area in the figure (the number of invalid assignments) is the above result divided by c!, that is 1 c!

  − c

i=1

c ( 1)c−i+1 (i i

n

− 1) .

The number of valid assignments (the gray region in the figure) is thus the total number of assignments (indicated by S in the figure) minus the number of invalid assignments (the white region). This total number is cc in . To find the total number of valid assignments, we subtract:



  −   − c n i c

We perform a substitution i

c

c ( 1)c−i+1 (i i

i=1

n

− 1) .

← i − 1 and regroup to obtain our final answer: 1 c!

c

−

( 1)c−i in .

i=1

(b) For the case c = 5 and n = 100 the number of clusterings is 1 5!

  − · 5

i=1

5 ( 1)5−i i100 i



1 5 1100 + 10 2100 + 10 3100 + 5 4100 + 1 5100 5! = 65738408701461898606895733752711432902699495364788241645840659777500 6.57 1067 . =



×

·

·

·

·

324CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING (c) As given in part (a), the number of distinct clust erings of 1000 points into 10 clusters is 1 10!

  − 10

i=1

10 ( 1)10−i i1000 . i

Clearly, the term that dominates this is the case i = 10, since the 10 1000 is larger than any other in the sum. We use Stirling’s aproximation, x! x x e−x 2πx, which is valid for large x. Thus we approximate





1 1000 10 10!

1010 990 10 10! 10 e 10990 2π10 2778 10990 2.778 10993 .

=

 √  ×  ×

Incidentally, this result approximates quite well the exact number of clusterings, 275573192239858906525573192239858906525573191758192446064084102276 241562179050768703410910649475589239352323968422047450227142883275 301297357171703073190953229159974914411337998286379384317700201827 699064518600319141893664588014724019101677001433584368575734051087 113663733326712187988280471413114695165527579301182484298326671957 439022946882788124670680940407335115534788347130729348996760498628 758235529529402633330738877542418150010768061269819155260960497017 554445992771571146217248438955456445157253665923558069493011112420 464236085383593253820076193214110122361976285829108747907328896777 952841254396283166368113062488965007972641703840376938569647263827 074806164568508170940144906053094712298457248498509382914917294439 593494891086897486875200023401927187173021322921939534128573207161 833932493327072510378264000407671043730179941023044697448254863059 191921920705755612206303581672239643950713829187681748502794419033 667715959241433673878199302318614319865690114473540589262344989397 5880 2.756 10993 .



×

Section 10.6 19. We show that the rankin g of distances betw een samples is invariant to any monotonic transformation of the dissimilarity values.

D D D

D D

(a) We define the value vk at level k to be min δ ( i , j ), where δ ( i , j ) is the dissimilarity between pairs of clusters i and j . Recall too that by definition in hierarchical clustering if two clusters are joined at some level (say k), they

D

PROBLEM SOLUTIONS

325

remain joined for all subsequent levels. (Below, we assume that only two clusters are merged at any level.) Suppose at level k that two clusters, Then we have vk+1 =

min [δ ( D D

i, j distinct at k

= min

= min

 

D and D are merged into D∗ at level k + 1.

D , D )] i

j

D , D )],

min [δ (

D D  D∗

i, j = distinct at k

i

min

D D  D D

i, j = , are distinct at k

j

[δ ( −1

min [δ (

D  D∗

i= is cluster at k

D , D )], i

j

D , D )] i

min

D  D D

i= , is cluster at k

j





[δ (Di , D ∪ D )]

−1

If δ = δ min , then we have min

D  D D

i= , is cluster at k

[δ ( −1

D , D ∪ D )] i

=

min

D  D D

i= , is cluster at k

=

min δ (x, x )

x∈Di −1 x ∈D∪D 

min

D  D D

i= , is cluster at k

=

min −1

min[δ ( −1

min

, i j are distance at



D D



[δ ( k

min δ (x, x ), ∈Di  ∈D

x x

δ (x, x )

min ∈D

x x

 ∈Di

D , D), δ(D , D )]

min

D  D D

i= , is cluster at k

.

−1

i,

i

i

j] .

D D

Therefore we can conclude that vk+1 = min



min

D D  D D

i, j = , are distinc t at k

−1

≥ v ≥ v for all k . A similar argument goes through for and thus v ≥ v for all k, and thus (b) From part (a) we have v 0 = v ≤ v ≤ v ≤ ··· ≤ v , k

k+1

k

k+1



[δ (Di , Dj ), δ (Di , D), δ (Di , D  )]

δ = δ max .

k

1

2

3

n

so the similarity values retain their ordering. Section 10.7 20. It is sufficient to derive the equation for one single cluster, which allows us to drop the cluster index. We have

 − x

∈D

x

m



2

=

  −      −  x

∈D

x

=

∈D

x

1 x n  x ∈D

1 x n x ∈D

x

2

2



326CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING =

1 n2

=

1 n2

=

   −    − −    − −   − − − − −    −  − −    −     − − −        −  x

∈D

x

2

x

x

∈D

x )t (x

(x

x )

∈D x ∈D x ∈D

x

1 1 n2 2

x )t (x

(x

x )

∈D x ∈D x ∈D

x

t

+

=

1 1 n2 2

x )t (x

(x

1 1 n2 2 +

x ) + (x

x )]

x )

∈D x ∈D x ∈D

x

x

x

+

=

x ) [(x

(x x∈D x ∈D x ∈D

2

+ (x

x )t (x

x )

x∈D x ∈D x ∈D

x

x

2

x∈D x ∈D x ∈D

1 1 n2 2

x )t (x

(x

∈D x ∈D x ∈D

x )

∈D x ∈D x ∈D

x

x

(x

− x) (x − x) t

=0

= =

1 1 n 2 n2

x

x

2

∈D x ∈D

x

1 n¯ s. 2

21. We employ proof by contradiction. From Problem 14 with Σ = I, we know that for a non-empty set of samples i that

D

 −    − x

αi

2

∈D

x

i

D . Now our criterion function is m .

is minimized at x = m i , the mean of the points in x

Je =

D =φ x∈D i

i

2

i

i



Suppose that the partition minimizing J e has an empty subset. Then n c and there must exist at least one subset in the partition containing two or more samples; we call that subset j . We write this as

D

D where

D

j1

and

D

j2

D   −  − j

=

D

j1

are disjoint and non-empty. Then we conclude that Je =

D =φ x∈D i

can be written as

j2 ,

Je = A +

x

∈D

j

x

mi

x

mj

i



2

2

,

 

PROBLEM SOLUTIONS

327

where A is some constant. However, we also have

 − x

∈D

x

mj

j



2

 −  − x

=

mj

∈D

x

j1

x

>



mj1

∈D

x

Thus, replacing

j

by

D

j1

and

j2

D

j1

2

 −  − x

+

mj

∈D

x

2



+

j2

x



mj2 2 .



∈D

x

2

j2

and removing the empty subset yields a partition

D

into c disjoint subsets that reduces J e . But by assumptio n the partition chosen minimized J e , and hence we have a contradiction. Thus there can be no empty subsets in a partition that minimizes J e (if n c). In short, we should part ition into as many subsets are allowed, since fewer subsets means larger dispersion within subsets. 22. The figure shows our data and terminology.



D11

Partition 1

D12

k

k 1 -2

Partition 2

-1

0

1

D21

a

2

x

D22

(a) We have n = 2k + 1 points to be placed in c = 2 clusters and of course the clusters should be non-empty. Thus the only two cases to consider are those shown in the figure above. In Partition 1 , we have

−2k + 0 · k = −1,

m11 m12

= = a,

2k

and thus the value of our cluster criterion function is Je1

 −  − 

=

(x

m11 )2 +

∈D

x

11

k

k

2

=

i=1

(x

12

−m

i=1

k + k + 0 = 2 k.

In Partition 2 , we have m21

=

m22

=

−2, k·0+a = k+1

a , k+1

and thus the value of our cluster criterion function is Je2

=



∈D

x

21

(x

2 21 )

−m

+

 ∈D

x

12 )

(0 + 1)2 + (a

( 2 + 1) +

=

∈D

x

22

(x

−m

22 )

2

2

2

− a)

328CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING ( 2 + 2)2 +

∈D

x

 −   −  k

−

=

a k+1

0

i=1

21

2

a k+1

+ a

2

a2 a2 k 2 0+ k+ (k + 1) 2 (k + 1) 2 a2 k(k + 1) a2 k = . 2 (k + 1) k+1

= =

Thus if J e2 < Je1 , that is, if a 2 /(k + 1) < 2k or equivalently a 2 < 2(k + 1), then the partition that minimizes Je is Partition 2, which groups the k samples at x = 0 with the one sample at x = a. (b) If Je1 < Je2 , i.e., 2 k < a 2 /(k+1) or equivalently 2(k+1) > a2 , then the partition that minimzes J e is Partition 1, which groups the k-samples at x = 2 with the k samples at x = 0.



23. Our sum-of-square (scatter) criterion is tr[ SW ]. We thus need to calculat e SW , that is, c

SW

=

 − −  − −   −  −  − (x

m i )t

mi )(x

∈D

i=1 x c

=

i

[xxt

m i xt

∈D

i=1 x c

=

c

c

xxt

∈D

i=1 x 4

=

xmti + mi mt ]

i

m1 ni mti

i=1

i

c

ni mi mti +

i=1

ni mi mti

i=1

c

xk xtk



ni mi mti ,

i=1

k=1

D

where n i is the number of samples in

i

mi =

1 ni

and

 ∈D

x

x.

i

For the data given in the problem we have the following: 4



xk xtk

=

k=1

= =









4 (4 5) 1 (1 4) 0 (0 1) 5 (5 0) + + + 5 4 1 0 16 20

+

20 25 42 24 . 24 42

1

4

4

16

+

0

0

0

1

+

25

0

0

0

                

Partition 1: Our means are m1

=

m2

=

1 2 1 2

4 1 + 5 4 0 5 + 1 0

= =

5/2 , 9/2 5/2 , 1/2

PROBLEM SOLUTIONS

329

and thus the matrix products are m1 mt1 =





25/4 45/4

45/4 and m 2 mt2 = 81/4





25/4 5/4 . 5/4 1/4

Our scatter matrix is therefore SW =

42 24 24 42

25/4 45/4

−2

45/4 81/4

25/4 5/4 5/4 1/4

−2

17 1

1



17 1

−1 1

,

 − 

      −

and thus our criterion values are the trace tr[SW ] = tr

=

= 17 + 1 = 18 ,

1

and the determinant

|S | = 17 · 1 − (−1) · (−1) = 16. W

Partition 2: Our means are m1

=

m2

=

           

1 2 1 2

4 5 + 5 0 1 0 + 4 1

= =

9/2 , 5/2 1/2 , 5/2

and thus our matrix products are m1 mt1 =



81/4 45/4



45/4 and m 2 mt2 = 25/4





1/4 5/4 . 5/4 25/4

Our scatter matrix is therefore SW =

 −  42 24 24 42

2

81/4 45/4

−   − 45/4 25/4

2

1/4 5/4 5/4 25/4

  − =

1

−1

and thus our criterion values are the trace tr[SW ] = tr

17

−1

1 1

= 1 + 17 = 18 ,

and the determinant

|S | = 1 · 17 − (−1) · (−1) = 16. W

Partition 3: Our means are 1 m1 = 3 m2

            

=

4 1 0 + + 5 4 1

=

5/3 , 3

5 . 0

and thus our matrix products are m1 mt1 =

25/9 5 5 9

and m 2 mt2 =

25 0 . 0 0

1 , 17

330CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING Our scatter matrix is therefore SW =

 −  −    42 24 24 42

25/9 5 5 9

3

25 0 0 0

1

=



26/3 22/3

22/3 , 26/3

and thus our criterion value s are tr S W = tr

−1

17 1

= 26/3 + 26/3 = 17 .33,

1

− 

and

|S | = 26/3 · 26/3 − 22/3 · 22/3 = 21 .33. W

We summarize our results as

| |

Partition 1 2 3

tr[SW ] SW 18 16 18 16 17.33 2 1.33

| |

Thus for the tr[ SW ] criterion Partition 3 is favored; for the SW criterion Partitions 1 and 2 are equal, and are to be favored over Partition 3. 24. Problem not yet solved 25. Consider a non-singular transformation of the feature space: y = Ax where A is a d-by-d non-singular matrix. (a) If we let ˜ i = Ax : x i denote the data set transformed to the new space, then the scatter matrix in the transformed domain can be written as

D {

∈D}

c

SyW

=

 −  −  −

=

myi )(y

(y

i=1 y c

˜i

∈D

(Ax

i=1 y c

y t i

−m )

Ami )(Ax

˜i

∈D

= A

(x

i=1 y

∈D˜

mi )(x

− Am ) i

t

t

−m ) i

i

where A t = AS W At . We also have the between-scatter matrix SyB

=

c

  

ni (myi

i=1 c

=

ni (Ami

i=1

y

y t

− m )(m − m ) i

t

− Am)(Am − Am) i

c

=

A

ni (mi

i=1

=

ASB At .

− m)(m − m) i

t



At

PROBLEM SOLUTIONS

331

The product of the inverse matrices is [SyW ]−1 [SyB ]−1

= = =

(ASW At )−1 (ASB At ) 1 −1 (At )−1 S− ASB At WA t −1 −1 (A ) SW S B At .

1 denote the eigenvalues of S − W S B . There exist vectors

We let λ i for i = 1,...,d zi ,..., zd such that

1 S− W S B zi = λ i zi ,

for i = 1,...,d , and this in turn implies 1 t t −1 (At )−1 S− zi W SB A A )

= λi (At )−1 zi ,

or SyW−1 SyB ui

=

λi ui ,

where ui = (At )−1 zi . This im plies that λ1 ,...,λ d are the eigenvalues of SyW−1 SyB , and finally that λ1 ,...,λ d are invariant to non-singular linear transformation of the data. (b) Our total scatter matrix is S T = S B + SW , and thus 1 S− T SW

= (SB + SW )−1 SW 1 −1 = [S− W (SB + SW )] − 1 − 1 = [I + SW S B ] .

1 If λ1 ,...,λ d are the eigenvalues of S− W S B and the u1 ,..., ud are the corre1 sponding eigenvectors, then S − and hence W S B ui = λ i ui for i = 1,...,d 1 ui + S− W S B u i = u i + λi u i .

This equation implies 1 [I + S− W S B ]ui = (1 + λi )ui . 1 −1 and find We multiply both sides of the equation by (1 + λi )−1 [I + S− W SB ] 1 −1 u i (1 + λi )−1 ui = [I + S− W SB ] 1 and this implies ν i = 1/(1 + λi ) are eigenvalues of I + S− W SB .

(c) We use our result from part (a) and find Jd =

|S | = |S− S | = |S | W T

1

T

W

d

d

  νi =

i=1

1 , 1 + λi i=1

which is invariant to non-singular linear transformations described in part (a).

332CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 26. Consider a non-singular transformation of the feature space: y = Ax, where A x is a d-by-d non-singular matrix. We let i = Ax : x . We have c

SyW



=

i=1 y c

∈D

D { ∈D } (y − m )(y − m ) y

i

=

(Ax i=1 x

i

y t

x i

x t i

− Am )(Ax − Am )

= ASxW∈D At .



In a similar way, we have

SyB Syt y −1 y (St ) SW (SyW )−1 SyB

= = = = = =

ASxB At ASxt At (At )−1 (Sxt )−1 A−1 ASxW At (At )−1 (Sxt )−1 SxW At (At )−1 (SxW )−1 A−1 ASxB At (At )−1 (SxW )−1 SxB At .

(a) From problem 25 (b), we know that d

tr[(Sxt )−1 SxW ] =

 

νi

i=1 d

=

i=1

1 1 + λi

B as well as tr[ B−1 SB], because they have the same eigenvalues so long as is non-singular. This is becaus e if Sx = νi x, then SBB−1 x = νi x, then also B−1 SBB−1 x = B −1 νi x = ν i B−1 x. Thus we have B−1 SB(B−1 x) = ν i (B−1 x). We put this together and find tr[(Syt )−1 SyW ] = tr[( At )−1 (Sxt )−1 SxW At ] = tr[( Sxt )−1 SxW ] d



=

i=1

1 . 1 + λi

(b) See Solution to Problem 25 part (c). (c) Here we have the determ inant

|(S

y 1 y SB W)



|

= = =

|(A )− (S t

  

i=1



x 1 x SB A t W)

|

eigenvalues of [(At )−1 (SxW )−1 SxB At ] eigenvalues of [(SxW )−1 SxB ]

d

=

1

λi .

PROBLEM SOLUTIONS

333

(d) The typical value of the criterion is zero or close to zero. This is because S B is often singular, even if samples are not from a subspace . Even when SB is not singular, some λi is likely to be very small, and this makes the product small. Hence the criterion is not always useful.

| |

27. Equation 68 in the text defines the criterion J d = SW = t

Si =

∈D (x − mi )(x − mi )



x

i

  

 

c

Si , where

i=1

is the scatter matrix for category ωi , defined in Eq. 61 in the text. We let T be a non-singular matrix and consider the change of variables x  = Tx. (a) From the conditions stated, we have



1 x ni  x ∈Di

mi =

where n i is the number of points in category ω i . Thus we have the mean of the transformed data is



1 ni

mi =

Tx = Tm i .

∈D

x

i

Furthermore, we have the transformed scatter matrix is Si

=

 −  −  − (x

mi )(x

(Tx

Tmi )(Tx

x ∈D

− m ) i

t

i

=

∈D

x

=

i

T

(x

x

mi )(x

∈D

i

− Tm ) i

t

−m ) i



t

Tt = TS i Tt .

(b) From the conditions state d by the problem, the criterion functi on of the transformed data must obey

  c

J  = |S d

W

|

=

= Therefore, J 

d

(c) Since J 

c

i

i=1

=

   |  

S

i=1

c

T Tt

||T||| J . 2

=

d

Si

t

 

   c

TSi T = T

i=1

Si



T

t

 

i=1

differs from J d only by an overall non-negative scale factor T 2 .

| |

| |

2

d differs from J d only by a scale factor of T (which does not depend on the partitioning into clusters) J d and J d will rank partitions in the same order. Hence the optimal clustering based on J d is always the optimal clustering based on J d . Optimal clustering is invariant to non-singular linear transformations of the data .

334CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 28. Consider a non-singular transformation of the feature space y = Ax, where A is x a d-by-d non-singular matrix. We let i = Ax : x be the transformed data i set. We then have the within scatter matrix as

D {

∈D }

c

SyW

=

 i=1 y c

∈D

y i

i

=

∈D





i

= ASxW At . In a similar way, we have

Amxi )(y

(Ax i=1 y

SyB Syt y −1 y (SW ) SB

y t i

− m )(y − m )

(y

Amxi )t



ASxB At ASxt At (At )−1 (SxW )−1 A−1 ASxB At (At )−1 (SxW )−1 SxB At .

= = = =

If λ is an eigenvalue of (SxW )−1 SxB with corresponding eigenvector x, that is, ( SxW )−1 SxB x = λx, then we have (SxW )−1 SxB (At (At )−1 ) x = λx,

   I

which is equivalent to

(SxW )−1 SxB At ((At )−1 x) = λx. We multiply both sides on the left by ( At )−1 and find (At )−1 (SxW )−1 SxB At ((At )−1 x) = (At )−1 λx, which yields (SyW )−1 SyB ((At )−1 x) = λ((At )−1 x). Thus we see that λ is an eigenvalue of ( SyW )−1 SyB with corresponding eigenvector (At )−1 x. 29. We consider the problems that might arise when using the determinant criterion for clustering. (a) Here the within-cl uster matrix for category i is



SiW =

(x

∈D

x

i

t

− m )(x − m ) , i

i

and is the sum of ni matrices, each of rank at most 1. Thus rank[ SiW ] These matrices are not independent; they satisfy the constraint



(x

∈D

x

i

− m ) = 0. i

This implies that c

rank[SW ]



 i=1

(ni

− 1) = n − c.

≤n. i

PROBLEM SOLUTIONS

335

(b) Here we have the betwee n scatter matrix c

SB =



ni (mi

i=1

− m)(m − m) i

t

is the sum of c matrices of rank at most 1, and thus rank[SB ] are constrained by c



− m)(m − m)

ni (mi

i

i=1

t

≤ c. The matrices

= 0,

≤ −

and thus rank[ SB ] c 1. The between-cluster scatter matrix, SB , is always singular if c d, the dimension of the space. The determinant criterion for clustering is not useful under such conditions because at least one of the eigenvalues is 0, since as a result the determinant is also 0.



Section 10.8 30. Our generalization of the basic minimum-squared-error criterion function of Eq. 54 in the text is: c

JT =

(x i=1 x

i

− m ) S− (x − m ). i

t

1

i

T

∈D



(a) We consider a non-singular transformation of the feature space of the form y = Ax, where A is a d-by-d non-singular matrix. We let ˜ i = Ax : x i denote the transformed data set. Then, we have the criterion in the transformed space is

D {

∈D }

c

JTy

=

  i=1 y c

=

∈D˜

∈D

i=1 x

(y

− m ) S − (y − m ) y t y 1 i T

y i

i

(Ax

i

− Am ) S − (Ax − Am ). i

t y 1 T

i

As mentioned in the solution to Problem 25, the within- and between-scatter matrices transform according to: SyW SyB

= =

ASW At and ASB At ,

respectively. Thus the scatter matrix in the transformed coordinates is SyT

= SyW + SyB = A(SW + SB )At = AS T At ,

and this implies [SyT ]−1

1 −1 = (AST At )−1 = (At )−1 S− . T A

336CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING Therefore, we have c

JTy

=

 − −

(A(x

∈D

i=1 x c

=

i

− m )) (A )− S− A− (A(x − m )) t

i

t

1

(x

1 mi )t (At )(At )−1 S− T (x

(x

mi )S− T 1 (x

i=1

−m )= J i

1

i

−m ) i

i=1 c

=

1

T

T.

In short, then, J T is invariant to non-singular linear transformation of the data.

D

D

ˆ being transferred from i to j . Recall that the tota l (b) We consider sample x scatter matrix S T = x (x m)(x m)t , given by Eq. 64 in the text, does not change as a result of changing the partition. Therefore the criterion is







c

 − − D  D −{ } D  D { }   −  −

JT∗ =

(x

1 m∗k )t S− T (x

m∗k ),

∈D∗

k=1 x

where

k

if if if

k

∗=

ˆ x ˆ j+ x i

k

k = i, j k=i k = j.

We note the following values of the means after transfer of the point: m∗k m∗i

= =

mk if k = i,j, x ˆ x x x∈Di∗ x∈Di = 1 ni 1

∈D∗

x

= = m∗j

− − − −−

=

− 1)m − (ˆx − m ) n −1 i

i

i





x Hi

=

=

i

ˆ ni mi x (ni = ni 1 ˆ mi x mi , ni 1 x+x ˆ ni + 1 ˆ nj mj + x

ˆ (nj + 1)mj + ( x

nj + 1 = ˆ mj x mj + . nj + 1

nj + 1



mj )



Thus our criterion functio n is JT∗

c

=





k=1,k =i,j

(x



− m ) S− (x − m ) + ∗ (x − m∗) S− (x − m∗) ∈D + (x − m∗ ) S− (x − m ). (∗) ∈D∗ k

t

x

1

k

T



x

j

j

t

1

T

i

i

j

t

1

T

i

PROBLEM SOLUTIONS

337

We expand the sum:

 −  −  − − − − −  −

1 m∗i )t S− T (x

(x

x∈D∗

i

i

=

1 x t S− T x



n∗i m∗i t +

x∈D∗

=

∈D∗

x

=

∈D

1

i

j

j

t

1

T

j

t

mi

i

1

T

1

(ˆ x

(x



j

T

x ˆ

1

j

T

i

j

j

j

mi i

t

1

T

1

t

i

i T

T

1

 j

j

j

1 t i T

i

−m ) ˆ S− x ˆ − m S− m m ) S− (x − m ) + x

1 t i T

t

j

T

1

j

T

1

T

j

1 1 ˆ + 2mtj S− + 2mtj S− T x T mj

− n 1+ 1 (ˆx − m ) S− (ˆx − m ) n (x − m ) S− (x − m ) − (ˆ x − m ) S− (ˆ x−m ) n +1

 ∈D

x

i

t

1

t

j

j

j

T

1

i

i

T

i

i

i

+ nj (ˆ x nj + 1

t

1

i

T

− m ) S− (ˆx − m ). t

j

T

1

j



We substitute this result in ( ) and find c

JT∗

=

   ∈D

k=1 x

k

(x

− m ) S− (x − m ) k

t

1

k

T



nj ni 1 1 (x mj )t S− mj ) (ˆ x m i )t S− x mi ) T (x T (ˆ nj + 1 ni 1 nj ni 1 1 = JT + (ˆ x mj )t S− x mj ) (ˆ x mi )t S− x mi ) . T (ˆ T (ˆ nj + 1 ni 1 +

(c) If we let







D denote the data set and

− − − − − − −





n the number of points, the algorithm is:

Algorithm 0 (Minimize JT ) 1 2 3 4 5 6 7 8 9

D

begin initialize ,c Compute c means m 1 ,..., mc Compute J T ˆ do Randomly select a sample; call it x ˆ ; call it m j Determine closest mean to x if n i = 1 then go to line 10 nj if j = i then ρ j x mj )t ST−1 (ˆ x mj ) nj +1 (ˆ ni if j = 1 then ρ j x mi )t ST−1 (ˆ x mi ) ni −1 (ˆ ˆ to k if ρ k ρj for all j then transfer x







i

i

1

t

i

 

− m ) − xˆ S− xˆ + m S− m + 2m S− xˆ − 2m S− m

1 mi )t S− x T (ˆ j

∈D

x

− n∗m∗ S− m∗ x ˆ

x ˆ ST− x ˆ

1 mi ) t S − T (x

1

+

=

1 x t S− T x

1

i

ni

j

T

j

j

(x

x

1

t

j

− (n − 1) m − n−− 1 S− m − n−− 1 x ˆ−m x ˆ−m ˆ S− x ˆ − (n + 1) m + x S− x + x S− m + n +1 n +1 t

+

− m∗) S− (x − m∗)

j

1

−T x x∈D∗ x S i

(x

x∈D∗

x∈D∗

i

t



− m∗ ) +









− D



338CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING Update J T , mi , and m k until J T has not changed in n tries

10 11

end

12

n



31. The total scatter matrix, S T =

(xi

i=1

t

− m)(x − m) , given by Eq. 64 in the text, i

does not change. Thus our criterion function is





Je = tr[SW ] = tr[ST SB ] = tr[ST ] tr[SB ]. We let Je∗ be the criterion function which results from transferring a sample i to j . Thus we have

D

x ˆ from

D

Je∗ = tr[ ST ] = tr[ ST

− tr[S∗ ] ]− n∗ m∗ − m

  −  −



( )

B

k

k

2

k

= tr[ ST ]

n∗k m∗k



k =i,j

= tr[ ST ]

2

 − m −



n∗k m∗k

 − m

k=i,j

2

 − m − n∗m∗ − m − n∗m∗ − m . 2

nk mk



k =i,j

i

2

i

j

2

j

Therefore we have n∗i m∗i

2

 − m

+ n∗j m∗j

 − m

2

= (ni

− 1)

− xˆn− m1 − m ˆ−− m − m m +x



+(nj + 1)

2

i

mi

i



j

j

nj + 1

as shown in Problem 30. We thus find that the means change by m∗i

= mi

m∗j

= mj

2

,

− xˆn−−m1 , ˆ−m x + . i

i

j

nj + 1



We substitute these into ( ) above and find through a straightforward but tedious calculation: Je∗



− 1) m − m i

j

=

i

i

i

j

i



=

i

2

2

j

2

i

i

2

j

j

j

j

i

2

i

2

t

t

i

j

j

i

j

2

j

i

i

t

2

j

2

j

2

2

i

i

i

i

2

i

j

2

i

+ 2

j

j

=

2

1



xˆ − m  − n 2− 1 (ˆx − m ) (m − m) (n − 1) 1 +(n − 1) m − m + xˆ − m  − n 2+ 1 (ˆx − m ) (m − m) (n + 1) ˆ−m  n m − m − m − m − 2(ˆ x − m ) (m − m) + 1 x n −1 1 x − m ) (m − m) + x ˆ−m  +n m − m − m − m + 2(ˆ n +1 ˆ − m  − x ˆ−m  n  m − m  + n  m − m +  x 1 1 + ||xˆ − m || + n + 1 ||xˆ − m || n −1 n n  m − m  + n  m − m + xˆ − m  − n n+ 1 xˆ − m  . n −1

= (ni

t

j

i

i

j

j

j

2

j

2

2

2

i

j

2

j

j

2

2



PROBLEM SOLUTIONS

339

Therefore our criterion function is Je∗ = tr[ ST ] =

− tr[S∗ ] n xˆ − m  tr[ S ] − n  m − m − n −1 n J + xˆ − m  − n n− 1 xˆ − m  . n +1



T

B

k

i

2

k

k

=

e

j

j

j

i

i

i

2

+

nj ˆ x nj + 1

 −m  j

2

2

i

i

2

Section 10.9 32. Our similarity measure is given by Eq. 50 in the text: s(x, x ) =

xt x

xx .

(a) We have that x and x  are d-dimensional vectors with x i = 1 if x possesses the ith feature and x i = 1 otherwise. The Euclidean length of the vectors obeys



 d

x = x  = and thus we can write

 d

x2i =

i=1

1=



d,

i=1

d

s(x, x ) = = = =

√xd√xd = d1 t



xi xi

i=1

1 [number of common features number of features not common] d 1 [number of common features (d number of common features) ] d 2 (number of common features) 1. d

− − − −

(b) The length of the differe nce vector is

x − x 

2

= = = = =

where, from part (a), we used

(x x )t (x xt x + xt x x 2 + x

− x ) − 2x x     − 2s(x, x )xx  √ √ d + d 2s(x, x ) d d 2d[1 −−s(x, x )], x = x  = √d. −

t

2

33. Consider the following candidates for metrics or pseudometrics. (a) Squared Euclidean distance : d

s(x, x ) = x

 − x 

2

=

 i=1

(xi

− x ) . i

2

340CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING Clearly we have s(x, x ) 0 (non-negativity) s(x, x ) = 0 x = x  (uniqueness) s(x, x ) = s(x , x) (symmetry) .





Now consider the triangle inequality for the particular case 0, x = 1/2 and x  = 1.

d = 1, and x =

s(x, x ) = (0 1)2 = 1  s(x, x ) = (0 1/2)2 = 1/4   s(x , x ) = (1 /2 1)2 = 1/4.

− −



Thus we have s(x, x ) + s(x , x ) = 1/2 < s(x, x ), and thus s(x, x ) is not less than or equal to s(x, x ) + s(x , x ). In short, the squared Euclidean distance does not obey the triangle inequality and is not a metric. Hence it cannot be an ultrametric either. (b) Euclidean distance: d

s(x, x ) = x

x =

 − 

xi )2 .

(xi



i=1



Clearly symmetry, non-negativity and uniqueness hold, as in part (a); now we turn to the triangle inequali ty. First consider two vectors x and x . We will x + x ; we do this as follows: need to show that x + x

 ≤    x + x  = (x + x ) (x + x ) 2

t

x x + xt x + 2xt x x2 + x 2 + 2xtx x2 + x 2 + 2xx . t

= =



We use the Cauchy-Schwarz Inequality (a special case of the H¨older inequality, see part (c) below), i.e., xt x x x , and thus find

| | ≤    x + x ≤ (x + x ) 2

2

and thus

x + x ≤ x + x . We put this together to find s(x, x ) = x

 − x

(x − x) + (x − x) ≤ x − x  + x − x  = =

s(x, x ) + s(x , x ),

PROBLEM SOLUTIONS

341

and thus the Eucl idean metric is indeed a metric. We use the same samp le points as in the example of part (a) to test whether the Euclidean metric is an ultrametric: s(x, x )  s(x, x ) = s(x, x )

= 1 = 1/4

so max[s(x, x ), s(x , x )] = 1 /4 < 1 = s(x, x ). Thus the Euclidean metric is not an ultrametric. (c) Minkowski metric:

 | d

s(x, x ) =

xi

i=1

− x |

q

i



1/q

.

It is a simple matter to show that the properties of non-negativity, uniqueness and symmetry hold for the Minkowski metric. In order to prove that the triangle inequality also holds, we will first need to prove H¨ older’s inequality, that is, for p and q positive numbers such that 1/p + 1/q = 1 1/q d

d

1/p d

 | | ≤  | |  ·  | |        | |    | |  xi xi

i=1

xi

q

i=1

xi

p

,

i=1

with equality holding if and only if

1/q

xj

=

1/p

d

1/p

xj

d

xi

i=1

i=1

1/q

xi



for all j. The limiting case of p = 1 and q = can be easily verified directly. We thus turn to the case 1 < p , q,< . Consider two real, non-negative numbers a and b, and 0 λ 1; we have



≤ ≤

aλ b(1−λ)

≤ λa + (1 − λ)b,

with equality if and only if a = b. To show this, we consider the function f (t) = t λ

− λt + λ − 1 for t ≥ 0. Thus we have f  (t) = λ(t − − 1) ≥ 0. Furthermore, f (t) ≤ f (1) = 0 (λ 1)

with equality only for t = 1. Thus we have tλ

≤ λt + 1 − λ.

We let t = a/b, substitute above and our intermediate inequality is thus proven.

342CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING We apply this inequality to the particular case

       || ||   −      | | |  | | |  ≤    || ||      || ||  a=

    || || 

p

q

xj

d

xi

,

1/p

xj

b=

i=1

with λ = 1/p and 1

1/q

d

p

i=1

xi

q

λ = 1/q . Thus for each component j we have p



d

xi

xj xj 1/p d

p

i=1

1 p

1/q

i=1

xi

q

xj

+

1/p

d

p

xi

xj

1 q

d

i=1

i=1

We sum this inequality over all j = 1,...,d d

| |  | |   | |  d

xi

d

p

i=1

xi

i=1

1/q

q

q

.

and find

xj xj

j=1 1/p

xi

 

1/q

q

≤ 1p + 1q = 1,

and the H¨older inequality is thereby proven. We now use the H¨older inequality to prove that the triangle inequality holds for the Minkowski metric, a result known as Minkowski’s inequality. We follow the logic above and have d

d

|

xi + xi

i=1

p

| ≤

d

|

xi + xi

i=1

| − |x | + p 1

i

|

xi + xi

i=1

| − |x |. p 1

i

We apply H¨older’s inequality to each summation on the right hand side and find d

|

p

|

xi + xi

i=1

1/p

d



(p 1)q

+

1/p

d

xi + xi

i=1

p

xi

p

1/p

d

xi + xi

+ xi

i=1

, recall that 1

1/p

d

p

xi

i=1

.

p

1/p

d

xi

+

1/q = 1/p, and

p

.

i=1

Thus, using the notation of the Minkowski metric above, we have s(x + x , 0)

p

i=1

    −   | |  ≤  | |   | |  i=1

1/p

xi

+

1/q

p

xi

p

i=1

i=1

d

We divide both side by thereby obtain

xi

i=1

d

1/p

d

p

xi

1/q

d

=

      | | | | | |  | |   | |   1/q

d

xi + xi

j=1

 | ≤  |

≤ s(x, 0) + s(x, 0),

PROBLEM SOLUTIONS

343

or with simple substitution and rearrangement s(x, x )

≤ s(x, x ) + s(x , x ),

for arbitrary x, x and x . Thus our triangle inequa lity is thereby proven and the Minkowski measure is a true metric. Ultrametric [[more hereProblem not yet solved]] (d) Cosine: xt x

s(x, x ) =

xx  .

We investigate the condition of uniqueness in the particular case x = 11 , x = −11 . For these points we have

  

xt x = 1

d = 2 and

− 1 = 0,

but note that x = x  here. Therefore s(x, x ) does not possess the uniqueness property, and thus is not a metric and cannot be an ultrametric either.



(e) Dot product: s(x, x ) = x t x . We use the same counterexample as in part (c) to see that the dot product is neither a metric nor an ultrametric. (f) One-sided tangent distance: 2



− 

α x + αT(x) x . s(x, x ) = min There is no reason why the symmetry property will be obeyed, in general, that is, s(x, x ) = s(x , x), or





min x + α1 T1 (x) α1

− x  = min x + α T (x ) − x , α 2

2

2

2

2

and thus the one-sided tangent distance is not not a metric and not an ultrametric.

D

34. Consider merging two clusters parameters in the function

i

and

D

j

and whether various values of the

| −d |

dhk = αd hi + αj dhj + βd ij + γ dhi

hj

can be used for a range of distance or similarity measures. (a) We consider d min defined by dmin (

D ,D ) i

j

= min ∈D x

 ∈Dij

x

dhk = min

x − x 



 − x , min x − x   ∈D

x min ∈D x

 ∈Dij

x

x

j

 ∈Dh

x

= min ( dhi , dhj ) 1 1 1 = dhi + dhj dhi dhj 2 2 2 = αi dhi + αj dhj + βd ij + γ dhi for α i = α j = 1/2, β = 0 and γ =

−1/2.



− | − | | −d | hj

344CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING (b) We consider d max defined by dmax (

D ,D ) i

j

 − x 

= m ∈D ax x i x

 ∈Dj

x

 − x  =

x dhk = max ∈D x

 ∈Dkh

x

= max



x

x



∈D x − x   − x, max  ∈D

∈D x max  i x

x

∈Dj

x

−

max

∈Di Dj x ∈Dh

x

x



j

h

= max ( dhi , dhj ) 1 1 1 = dhi + dhj + dhi dhj 2 2 2 = αi dhi + αj dhj + βd ij + γ dhi

| − | | −d | hj

for α i = dj = 1/2, β = 0 and γ = 1/2. (c) We consider d ave defined by dave (

D ,D )

=

dhk

=

i

j

 −  −   −        −   − 

1 ni nj 1 nh nk

=

x

x

∈Di x ∈Dj x

x

x

∈Dk x h ∈D x

1 x nh (ni + nj ) ∈D ∩D i j

x

1 nh (ni + nj )

x +

x

 ∈Dh

x

=

x

∈Di x ∈Dh

x

∈Dj x∈Dh

x

x

1 [nh ni dhi + nh nj dhj ] nh (ni + nj ) ni nj = dhi + dhj ni + nj ni + nj = αi dhi + αj dhj + βd ij + γ dhi dhj =

| − |

for α i = ni /(ni + nj ), αj = nj /(ni + nj ) and β = γ = 0. (d) We consider d 2mean defined by d2mean (

D ,D ) i

dhk

= =

mk

=

j

2

m − m  m − m  i

j

h

k

x

∈D

x

k

1

∈D

x

=

2

    

x

∈D ∩D

x

i

j

ni + nj

k

∈D

x

=

x+

i

x

∈D

x

ni + nj

j

=

ni mi + nj mj , ni + nj

x

PROBLEM SOLUTIONS dhk

  

345



− n n+ n m − n n+ n m n n n n m − m + m − m n +n n +n n +n n +n n n (m − m ) + (m − m ) n +n n +n n n +2 (m − m ) (m − m ) n +n n +n

=

i

mh

= =

i

i

i

i

i

i

i

i

i

h

j

t

2

j

i

  i

j

j

j 2

h

j

i

i

h

j

j

i

j

i

j

i

h

j

i

h

j

j

i

j

h

j

j

j



j

2

j



n2i + ni nj n2j + ni nj mh mi 2 + mh m j 2 (ni + nj )2 (ni + nj )2 ni nj + [(mh mi )t (mi mj ) (mh mj )t (mi mj )] (ni + nj )2 ni nj ni nj mh m i 2 + mh m j 2 mh ni + nj ni + nj (ni + nj )2 αi dhi + αj dhj + βd ij + γ dhi dhj ,

 −  −  − 

=

= =

 −  − − −  − − | − |



where αi

=

αj

=

β

=

ni ni + nj nj ni + nj ni nj = (ni + nj )2



−α α

i j

γ = 0. 35. The sum-of-squared-error criterion is given by Eq. 72 in the text: c

Je

 −     −   −

=

x

i =1 x

2

mi 

∈D  i

c

xt x

=

i =1

∈D 

x

i

and

D

j

D

into

k

c

ni mti mi .

i =1

x

D

ni mti mi

i

xt x

= We merge

and find an increase in the criterion function J e of c





Je∗

−J

e

xt x

= x



ni mti mi i =1 i=k

t k

−n m m k

k



    − − c

xt x

x

ni mti mi

 

i =1 i =i µ1 < µ1 .

348CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING Thus we have

E



 E    E d

n1 n2 m1 n1 + n2

 −m  2

2

=

i=1

n1 n2 (m1i n1 + n2

n1 n2 (m11 n1 + n2

−m

−m

2i )



2



2 , 21 )

as x comes from a symmetric normal distribution of the form N (µ, σ2 I). Thus we have m1i m2i 0 for i = 1 independent of m11 , m21 by the Strong Law of Large Numbers. Now we have in the limit of large n







→ E (x |x → E (x |x

m11 m21

1 1

> µ1 ) 1 < µ1 ). 1

Without loss of generality we let µ 1 = 0. Then we have p(x1 )

E (x |x

1

> 0)

=

E (x |x

1

< 0)

=

1

1

Therefore, we have m 11

−m

21

 → 2

2 πσ

ndσ2

(Je (2))

E



 −

2 σ π 2 σ. π

for n large. We also have (n/2)(n/2)

− (n/2) + (n/2) − n4 4 × π2 σ 2 d− σ . π

=

ndσ2

=

n

2

∼ N (0, σ ) and thus

2

 

2

2

2

σ

π

 

2

Thus J e (2) is approximately independent of Je (1)

− J (2) e

=

n1 n2 m1 n1 + n2

2

 −m  . 2

We write Je (1) = J e (2) + [Je (1)

− J (2)] e

and take the variance of both sides and find Var[Je (1)] = Var[ Je (2)] + Var[∆], where ∆=

n1 n2 m1 n1 + n2

2

 −m  . 2

Problem not yet solved

Section 10.11 42. Consider a simple greedy algorithm for creating a spanning tree based on the Euclidean distance between points.

PROBLEM SOLUTIONS

349

(a) The following is known as Kruskal’s minimal spanning tree algorithm . Algorithm 0 (Kruskal’s minimal spanning tree) 1 2 3 4 5 6 7 8 9 10 11 12 13



begin initialize i 0 T do i i+1 ci xi until i = n eij xi xj ; sort in non decreasing order E ordered set of eij do for each e ij in E in non decreasing order if x i and x j belong to disjoint clusters then Append e ij to T ; Append c j to c i until all e ij considered return T end

← {} ← ← ← −  ←

− −



(b) The space complexity is n(n 1) if we pre-compute all point-to-point distances. If we compute all distances on demand, we only need to store the spanning treee T and the clusters, which are both at most of length n. (c) The time complexity is dominated by the initial sorting step, which takes O(n2 logn) time. We have to examine each edge in the for loop and that means we have O(n2 ) iterations where each iteration can be done in O(logn) time assuming we store the clusters as disjoint-set-forests where searches and unions can be done in logarithmic time. Thus the total time compl exity of the algorithm is O(n2 logn). Section 10.12 43. Consider the XOR prob lem and whether it can be implemented by an ART network as illustrated in the text. (a) Learning the XOR problem means that the ART network is supposed to learn the follo wing input-output relat ion. We augment all the input vectors and normalize them to unit weight wo that they lie on the unit sphere in R 3 . Then we have input 0 0 1 1/ 2 0 0 1/ 2 1/ 3 1/ 3

√ √

√ √

output 1 ( 1/ 2 1/ 2 1/ 3

√ √ √

∗ ∗∗

) ( )

0 0 1

( (

)

∗∗∗∗ ∗∗∗)

Associate the weight vector w0 with cluster 0, and w1 with cluster 1, and the subscripts with the different dimension. Then we get the following conditions: w10 + x03 w20 + w30

> >

w11 w21

+ w31 or (w10 + w31 or (w20 w10 + w20 + w30

− −

x03 < w31 ( ) > (w31 w30 ) ( ) > (w31 w30 ) ( ) 1 < w1 + w21 + w31 . ( ) w11 ) w21 )

− −

∗ ∗∗ ∗∗∗ ∗∗∗∗

350CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING

∗∗∗∗ ) we see

From (

(w10

1 1

0 2

1 2

1 3

0 3

− w ) + (w − w ) < (w − w ). We combine this with (∗), (∗∗) and ( ∗ ∗ ∗) and find 0 < 2(w − w ) < (w − w ) + (w − w ) < (w − w ). 1 3

0 3

0 1

1 1

0 2

1 2

1 3

0 3

This is a contradition. Thus there is no set of weights that the ART network could converge to that would implement that XOR solution. (b) Given a weight vector w for cluster C1 , the vigilance parameter ρ and two inputs x1 and x2 such that wt x1 > ρ and wt x2 < ρ, but also ( w+µx)t x2 > ρ w+µx1 . Since the top-down feedback tries to push the input vector to be closer to w, we can use the angle between w and x1 in this case to make our point, since we cannot forecast the value of y. If we present x1 before x2 , then x1 will be classified as belonging to cluster C1 . The weight vector w is now slightly adjusted and now also x2 is close enough to w to be classified as belonging to cluster C 1 , thus we will not create another cluster. In constrast, if we present x 2 before x 1 , the angle between w and x 2 is smaller than ρ, thus we will introduce a new cluster C 2 containing x 2 . Due to the fixed value of ρ but weights that are changing in dependence on the input, the number of clusters depends on the order the samples are presented.





(c) In a stationary environment the ART network is able to classify inputs robustly because the feedback connections will drive the network into a stable state even when the input is corrupted by noise. In a non-stationary environment the feedback mechanism will delay the adaptation to the changing input vectors because the feedback connections will interpret the changing inputs as noisy versions of the srcinal input and will try to force the input to stay the same. Without the feedback the adaptation of the weights would faster account for changing sample configurations. Section 10.13 44. Let e be a vector of unit length and x the input vector. (a) We can write the variance as σ2

= = = = =

2

E [a ] E [(x e) (x e)] x

t

x

t

t

t

t

x [e xx e] et x [xxt ]e et Σe,

EE

since e is fixed and thus can be taken out of the expectation operator. (b) We use the Taylor expansion: σ2 (e + δ e) = =

(e + δ e)t Σ(e + δ e) et Σe + 2δ et Σe + O( δ e 3 ).

 

PROBLEM SOLUTIONS

351

If we disregard the cubic term, we see that (e + δ e)t Σ(e + δ e) = e t Σe implies ( δ e)t Σe = 0. (c) Since δ e is perpendicular to e, we have ( δ e)t e = 0. We can now combine this with the condition from part (b) and get the following equation: ( δ e)t Σe λ(δ e)t e = 0. This can be rewritten ( δ e)t (Σe λe) = 0. For this equa tion to





hold sufficient that for some λ. Ittoise,also necessary since (δ e)tit Σeis must vanish forΣe all = perpendicular thusa Σe is itselfcondition perpendicular δ eλe to δ e or in other words parallel to e. In other words, we have Σe = λe. (d) We denote the d eigenvalues of the covariance matrix Σ by λi and the corresponding eigenvectors by ei . Since Σ is symmetric and positive semi-definite, we know that all eigen values are real and greater than or equal to zero. Since the eigenvectors ei can be chosen orthonormal, they span a basis of the ddimensional space and we can write each vector x as d



(eti x)ei .

i=1

We define the error E k (x) as the sum-squared error between the vector its projection onto a k-dimensional subspace, that is t

k

Ek (x) =

x



(eti x)e i=1

k

x

(eti x)e



i=1

       t

d

=

x and

d

(eti x)ei

(eti x)ei

i=k+1

i=k+1

d

=

(eti x)2 ,

k=k+1

because et e = δ ij , where δij is the Kronecker delta which has value 1 if i = j and 0 otherwise. If we now take the expected va lue of the error over all x, we get using the definitions in part (a): d

E [E (x)] x

k

=

E

t t t x [(x ei ) (x ei )]

i=k+1 d

eti Σi ei

= i=k+1



since all e i are eigenvectors of the covariance matrix Σ, we find d



i=k+1

λi eti ei =

d



λi .

k=k+1

This then give s us the expec ted erro r. Thus to minim ize the squared-error criterion, the k-dimensional subspace should be spanned by the k largest eigenvectors, wince the error is the sum of the remaining d k eigenvalues of the covariance matrix.



352CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 45. 46. 47. 48. 49.

Problem Problem Problem Problem Problem

not not not not not

yet yet yet yet yet

solved solved solved solved solved

Section 10.14 1

50. Consider the three points x1 = distances

|| − x || = ||x − x || = ||x − x ||

δ12 = x1 δ13 δ23

=

3

2

0

=

2

1

0

, x2 =

=

3

0

, and x3 =

     −− − 0

(1

0)2 + (0

2

(1

0)2

2

(0

0)2

 1

. Then we have the

− 0) = 1√ + (0 − 1) = 2 + (0 − 1) = 1. 2

We assume for definiteness, and without loss of generality, that the transformed points obey 0 = y 1 < y2 < y3 . We define ∆21

y3 =

y2 = ∆21 > 0 > ∆ 21 = y 2 ,

+ ∆ 22

and therefore we have d12

=

13 d d23

−y −− yy

y2

= =

yy3 3

= ∆21

1

2 1 = =∆ ∆21

1 2

2 1

−0= ∆ + +∆ ∆ − − 0∆==∆∆+. ∆ 2 2 2 2

2 1

2

2

1 2 2

2

(a) From the definition in Eq. 107 in the text we have

Jee



=

−

(dij

i x∗ x ω1 ) = 0.8, P (x > x∗ x ω2 ) = 0.3 or Case B: P (x > x∗ x ω1 ), P (x > x ∗ x ω2 ) = 0.7.”

| ∈

| ∈

| ∈

| ∈

page 76 Problem 41, first line after the equation: Change “( µ2 µ1 )/δ ”

− µ )/δ

i

1

” to “( µ2



page 76 Problem 41, part (b): Change “ dT = 1.0.” to “ dT = 1.0 and 2 .0.”

| ∈ ω ) = .2.” to “ P (x > x∗|x ∈

page 76 Problem 41, part (c): Change “ P (x > x∗ x ω1 ) = .7.”

1

page 76 Problem 76, part (e): Change “measure P (x > x∗ x ω 2 ) = .9 and ( x > x∗ x ω 1 ) = .3.” to “measure P (x > x∗ x ω 2 ) = .3 and P (x > x∗ x ω 1 ) = .9.”

| ∈

| ∈

| ∈

| ∈

page 81 Computer exercise 6, part (b), line +1 : Change “Consider” to “Consider the normal distributions”

|

page 81 Computer exercise 6, part (b), equation: Change “and p(x ω2 )” to “and p(x ω2 )” (i.e., add space)

|

page 81 Computer exercise 6, part (b), equation: Move “with P (ω1 ) = P (ω2 ) = 1/2.” out of the centered equation, and into the following line of text. page 83 first column, entry for [21], lines +3 – 4 : Change “Silverman edition, 1963.” to “Silverman, 1963.”

Chapter 3 page 88 Equation 9: Change “ ∇θµ ” to “ ∇µ ” page 91 Ninth line after Eq. 22: Change “shal l consider), the samp les” to “shall consider) the samples” page 99 Caption to first figure, change “starts our as a flat” to “starts out as a flat” page 100 line -5 : Change “are equivalent to” to “are more similar to” page 100 line -5 – -4 : Change “If there are much data” to “If there is much data” page 102 line -5 : Change “(Computer exercise 22)” to “(Problem 22)” page 103 line -2 : Change “choice of an prior” to “choice of a prior” page 104 line +6 : Change “if” to “only if” page 104 line +16 : Change “only if” to “if”



page 104 Equation 62: Make the usage and style of the summation sign ( ) uniform in this equation. Specifically, in two places put the argum ents beneath the summation sign, that is, change “ ” D∈D¯ ” to “





D∈D¯

page 105 first line after Eq 63: Change “to this kind of scaling.” to “to such scaling.”

Chapter 3

421

|

page 111 lines +9 – 10 : Change “constants c0 and x0 such that f (x) for all” to “constants c and x 0 such that f (x) c h(x) for all”

|

|≤ |

|

| ≤ c |h(x)| 0

page 111 line +14 : Change “proper choice of c 0 and x 0 .” to “proper choice of c and x0 .” page 116 Equation 86: Change “ λet e” to “ λ(et e

− 1)”

page 125 Algorithm 1 , line 1 : Change “ i = 0” to “ i

0”

← : Change page 126 first line of the equation at the middle of the page Q(θ ; θ ) = E [lnp(x , x ; θ|θ ; D )] 0

x41

g

0

b

g

to Q(θ ; θ0 ) =

E

x41 [lnp(xg , xb ;

θ) θ 0 ;

| D] g

page 128 line +6, (second line after the Example ): Change “the EM algorithm, and they” to “the EM algorithm as they” page 129 second line aboove Section 3.10.3: Change “while the ω i are unobservable” to “while the ω j are unobservable” page 132 line +3 : Change “ bkj , and thus” to “ bjk , and thus” page 132 Equation 136: Replace by: αj (t) =

   0 1

i

αi (t

− 1)a

ij



t = 0 and j = initial state t = 0 and j = initial state bjk v(t) otherwise.



page 132 third line after Eq. 136: Change “Consequently, αi (t) represents” to “Consequently, α j (t) represents” page 132 fourth line after Eq. 136: Change “hidden state ω i ” to “hidden state ω j ” page 132 Algorithm 2 , line 1 : Delete “ ω(1),” page 132 Algorithm 2 , line 1 : Change “ t = 0” to “ t

← 0”

page 132 Algorithm 2 , line 1 : Change “ α(0) = 1” to “ αj (0)” page 132 Algorithm 2 , line 3 : Replace entire line by “ αj (t) 1)aij ”

←b

jk v(t)



c i=1 αi (t



page 132 Algorithm 3 : Somehow the line numbering became incorrect. Change the line number to be sequential, 1 , 2,... 6. page 132 Algorithm 3 , line 1 : Change “ ω(t), t = T ” to “ βj (T ), t

← T”

page 132 Algorithm 3 , old line 4 , renumbered to be line 3 : Replace entire line by c “βi (t) j=1 βj (t + 1)aij bjk v(t + 1)”





page 132 Algorithm 3, old line 7, renumbered line 5: Change “ P (V T )” to “P (VT )” (i.e., make the “ V ” bold)

ERRATA IN THE TEXT

422

page 133 Figure 3.10, change the label on the horizontal arrow from “ a12 ” to “ a22 ”. page 133 Figure 3.10, caption, line +5 : Change “was in state ω j (t = 2)” to “was in state ω i (t = 2)” page 133 Figure 3.10, caption, line +6 : Change “is αj (2) for j = 1, 2” to “is αi (t) for i = 1, 2” page 133 Figure 3.10, caption, line -1 : Change “ α2 (3) = b2k “α2 (3) = b 2k



c i=1 αi (2)ai2 ”

c j=1

αj (2)aj2 ” to

 {

page 133 line -6 : Change “ V5 = v3 , v1 , v3 , v2 , v0 ” to “ V4 = v1 , v3 , v2 , v0 ”

{

}

}

page 133 line -44 : Change “is shown above,” to “is shown at the top of the figure” page 134 Figure in Example 3, caption, line +4 : Change “ αi (t) — the probability” to “αj (t) — the probability”



page 134 Figure in Example 3, caption, line +6 : Change “and α i (0) = 0 for i = 1.” to “and α j (0) = 0 for j = 1.”



page 134 Figure in Example 3, caption, line +6 : Change “calculation of α i (1).” to “calculation of α j (1).” page 134 Figure in Example 3, caption, line -6 : Change “calculation of αi (1)” to “calculation of α j (1)” page 134 igure in Example 3, caption, line -4 : Change “contribution to α (1).”to i “contribution to α j (1).” page 135 Algorithm 4: somehow the line numbering became incorrect. Change the line numbering to be sequential, 1, 2, 3, . . ., 11. In the old li ne 4 (no w renumbered 3): Change “ k = 0, α0 = 0” to “ j 1”

←−

page 135 Algorithm 4 old line 5 (now renumbered 4): Change “ k “j j + 1”



← k + 1” to

page 135 Algorithm 4 old line 7 (now renumbered 5): Change “ αk (t)” to “ αj (t)” page 135 Algorithm 4 , old line 8 (now renumbered 6): Change “k = c” to “ j = c” page 135 Algorithm 4, old line 11 (now renumbered 8): Change “AppendTo Path ωj  ” to “Append ωj  to Path” page 135 line -5 : Change “The red line” to “The black line” page 135 line -4 : Change “value of α i at each step” to “value of α j at each step” page 137 Equation 138: Replace equation by βi (t) =

   0 1

page 137 seventh line after Eq. 138: Change “ βi (T “βi (T 1) = j aij bjk v(T )βj (T ).”





ωi (t) = ω 0 and t = T ωi (t) = ω 0 and t = T j βj (t + 1)aij bjk v(t + 1) otherwise.

− 1) =



j

aij bij v(T )βi (T ).” to

Chapter 3

423

page 137 fourth line before Eq. 139: Change “probabilities a ij and b ij ” to “probabilities a ij and b jk ” page 137 Equation 139: Change “ bij ” to “ bjk ” page 138 line +3 : Change “whereas at step t it is” to “whereas the total expected number of any transitions from ω i is” page 138 first line after Eq. 140: Change “ ˆbij ” to “ ˆbjk ” page 138 Equation 141: Replace equation by: T

ˆbjk =

 

t=1 v(t)=vk

γjl (t)

l

T

t=1 l

γjl (t)

page 138 Algorithm 5 , line 1 : Change “criterion θ”to “criterion θ, z

← 0”

page 143 Problem 11, second and third lines after first equation: Change “ p2 (x) by a normal p 1 (x) N (µ, Σ)” to “ p1 (x) by a normal p 2 (x) N (µ, Σ)”





page 143 Problem 11: Second equations: Change “

E ” to “ E ” in two places 2

1

page 143 Problem 11, last line: Change “over the density p2 (x)” to “over the density p1 (x)” page 147 Problem 22, line between the two equation s: Change “has a uniform” to “has a uniform distribution” page 148 Problem 27, part (a), line after the equat ion: Change “as given in Table 3.1.” to “as in Table 3.1.1.” page 149 Problem 31, line +1: Change “suppose a and b are constants” to “suppose a and b are positive constants” page 149 Problem 32, line +1 : Change “where the n coefficients” to “at a point x, where the n coefficients” page 150 Problem 34, line +4 : Change “the number of operations n” to “the maximum size n” page 151 Problem 38, line +1 : Change “ px (x ωi )” to “ px (x ωi )”

|

|

page 151 Problem 38 (b) bottom equation on page: Change “( µ1 µ 2 )2 ”

2

− µ ) ” to “( µ − 2

1

page 152 line +1 : Change “and” to “is maximized by” page 153 Problem 43, line +4 : Change “and the d mean vectors.” to “and the c mean vectors.” page 154 Problem 46, (top equation): Change “0

otherwise .” to “

otherwise.”

ERRATA IN THE TEXT

424

page 154 Problem 46, line +1 (after the equation): Change “missing feature values.” to “missing feature values and  is a very small positive constant that can be neglected when normalizing the density within the above bounds.” page 154 Problem 46, part (b): Add “You may make some simplifying assumptions.” page 154 Problem 47, first equation: Change “ e−θ1 x1 ” to “ e−x1 /θ1 ” page 156 Computer exercise 2, after the equation: Change “calculate a density” to “calculate the density” page 156 Computer exercise 2. After “x2 feature of category ω2 .” add “Assume your priors on the parameters are uniform throughout the range of the data.” page 157 Computer exercise 4, line -3: Change “apply it to the x1 – x2 components” to “apply it to the x 1 –x2 components” (i.e., eliminate spaces and note that the dash is not subtraction sign, but an n-dash) page 157 Computer exercise 6, part (a), line +2 : Change “in the Table above.” to “in the table above.” page 159 Computer exercise 13 table, sample 4 under ω 1 : Change “ AD” to “ ADB”

Chapter 4

page 172 Figure 4.9, caption, line -1 : Change “where I is the d to “where I is the d-by-d identity matrix.” page 173 Algorithm 1 , line 1 : Change “ j = 0” to “ j

× d idenity matrix.”

← 0”

page 173 Algorithm 2 , line 1 : Change “test pattern,” to “test pattern” page 178 line +3 : Change “Becuase” to “Because” page 179 Equation 41: Change “ the integral sign) page 184 lines +2 – 3 : Keep “ k break in this equation)



x



∈S ” to “ x ∈S

− i > i” on the same line (i.e., do not put a line

page 186 Algorithm 3 , line 1 : Change “ j = 0, “j 0, data set, n # prototypes”

← D←



” (i.e., place the limits underneath

= data set , n = # prototypes” to

D

page 187 Figure 4.18, caption, line -4 : Change “by a facto r 1/3” to “by a factor α = 1/3” page 188 First margin note: Change “ minkowski metric” to “ Minkowski metric” (i.e., capitalize the M in Minkowski) page 188 Second margin note: Change “ manhattan distance” to “ Manhattan distance” (i.e., capitalize the M in Manhattan)

Chapter 4

425

page 188 Third margin note: Change “ tanimoto metric” to “ Tanimoto metric” (i.e., capitalize the T in Tanimoto) page 189 line +5 : Change “the relative shift is a mere” to “the relative shift s is a mere” page 192 Figure 4.23, caption, lines -4– -2 : Eliminate the second-to-last sentence. page 193 Equation 61: Replace the full equation with “ µx (x) µy (y)”

·

page 194 line +19 : Change “beief about memberships” to “belief about mem berships” page 195 line +9 : Change in the title “ 4.8 REDUCED COULOMB ” to “ *4.8 REDUCED COULOMB” page 196 Algorithm 4, line 1: Change “ j = 0, n = # patterns ,  = small parameter, λm = max radius” to “j 0, n # patterns,  small parameter, λm max radius”









page 197 Algorithm 5 , line 1 : Change “ j = 0, k = 0, x = test pattern , to “j 0, k 0, x test pattern, t ”







D ← {}

D

t

=

{}”

page 199 line -7 : Change “prior knowledge.” to “prior beliefs.” page 202 Problem 6c, first line: Change “increases” to “increase” page 203 Problem 11, part (d), line +1 : Change “close to an edge” to “clos e to a face” page 203 Problem 11, part (d) line +4 : Change “closer to an edge” to “closer to a face” page 203 Problem 11, part (d), line +5: Change “even though it is easier to calculate here” to “and happens to be easier to calculate” page 204 Problem 13 line +1 : Change “fro m the distr ibutions” to “with priors P (ω1 ) = P (ω2 ) = 0.5 and the distributions” page 205 Problem 21, part (a), line +1 : Change “As giv en in the text, take” to “Follow the treatment in the text and take” page 206 Problem 23 part (d) line +2 : Change “space, then the b” to “space, then in the b” page 206 Problem 26, line +4 : Change “and find its nearest” to “and seek its nearest” page 207 Problem 26 part (d), first line: change “Calculate the proba bility” to “Estimate through simulation the probability” page 207 Problem 27, part (a) equation: Replace current equation with “DTanimoto ( n1 +n2 −2n12 n1 +n2 −n12 ,” page 207 Problem 29, first equation: Change “ δi ” to “ δ1 ” in all three places page 207 Problem 29, second equation: Change “ Cˆ (x, µi ” to “ Cˆ (x; µi ”

S ,S )= 1

2

ERRATA IN THE TEXT

426

page 207 Problem 29 second equation: Change “ δi ” to “ δ2 ” in all three places page 207 Problem 29 line -5 : Change “we have for the length δ i = 5” to “we have for the length δ 1 = 5” page 207 Problem 29 line -4 : Change “and for lightn ess δj = 30” to “and for the lightness δ 2 = 30”

Chapter 5 page 218 line +6 : Change “the problem to c page 218 Equation 2: Change “

wt x

i”

to “

− 1” to “the problem to

c”

wit x”

page 219 second line after the second (unnumbered) equation: Change “is given by (gi gj )/ wi wj ” to “is given by ( gi (x) gj (x))/ wi wj ”



 − 



 − 

page 220 sentence before Eq. 5, Change “this in turn suggest” to “this in turn suggests” pages 221–222 (across the page break): Change “mul-tilayer” to “multi-layer” (i.e., hyphenate as “multi-layer”)

← 0” page 228 Algorithm 3 , line 1 : Change “ k = 0” to “ k ← 0” page 225 Algorithm 1 , line 1 : Change “ k = 0” to “ k

page examination” 229 line +5 : Change “We shall begin our exam ination” to “We begin our page 229 line -3 : Change “Thus we shall denote” to “Thus we denote” page 230 Algorithm 4 , line 1 : Change “ k = 0” to “ k

← 0”

page 230 fourth line before Theorem 5.1 : Change “correction is clearly moving” to “correction is hence moving” page 230 line -2 : Change “From Eq. 20,” to “From Eq. 20 we have” page 232 first line after Eq. 24: Change “Because the squared distance” to “Because this squared distance”

← 0” page 233 Algorithm 6 , line 1 : Change “ k = 0” to “ k ← 0” page 233 Algorithm 5 , line 1 : Change “ k = 0” to “ k

page 234 line +22 (counting from the end of the Algorithm): Change “that it will have little effect at all.” to “that it will have little if any effect.” page 235 lines +3 –4 (counting from the end of the Algorithm): Change “and this means the “gap,” determined by these two vectors, can never increase in size for separable data.” to “and this means that for separable data the “gap,” determined by these two vectors, can never increase in size.” page 235 second and third lines after the secti on title 5.6 RELAXATION PROCEDURES: Change “in so-called “relaxation procedures” to include” to “in so-called “relaxation procedures,” to include”

Chapter 5

427

page 235 first and second lines after Eq. 32: Change “misclassified by a, as do Jp , Jq focus attention” to “misclassified by a. Both J p and J q focus attention” page 235 second line after Eq. 32: Change “Its chief difference is that its gradient” to “The chief difference is that the gradient of J q is” page 236 Algorithm 8 , line 1 : Change “ k = 0” to “ k

← 0”

page 236 Algorithm 8 , line 4 : Change “ j = 0” to “ j

0”

page 236 Algorithm 9 , line 1 : Change “ k = 0” to “ k ← ← 0” page 238 line -5 : Change “ procedures, because” to “ procedures because” page 242 line -2 : Change “We begin by writing Eq. 47” to “We begin by writing Eq. 45” page 246 lines +1 – 2 : Don’t split the equation by the line break



page 246 first (unnumbered) equation, second line: Change “(Yak b).” to (Ya(k) b).”



page 246 margin note: Change “lms rule” to “LMS rule” (i.e., capitalize “ LMS”) t k

t

k

− a(k) y )” to “( b(k) − a (k)y )” page 246 Algorithm 10 , line 1 : Change “ k = 0” to “ k ← 0” page 246 Equation 61, second line: Change “( bk

page the 248MSE-optimal” second line after Eq. 66: Change “obtain the MSE optimal” to “obtain page 250 Equation 79: Co-align vertically the “>” in the top equation with the “=” in the bottom equation page 250 Equation 79: Change “ a(k)” to “ b(k)” page 251 Algorithm 11, line 5 : Change “ a” to “ b” page 252 top equation: Change “= 0 =” to “= 0 =” (i.e., de-bold the “ 0”) page 253 line -7 : Change “requires that e + (k) = 0 for” to “requires that e + (k) = 0 for” page 254 line after Eq. 90: Change “constant, positive-definite” to “constant, symmetric, positive-definite” page 255 First (unnumbered) equation on page: Change the first term in parentheses 2

t

2

t

t

on the right-hand side from “ η YRY YRY” to “ η YRY YRY ” page 255 Eq. 92: Last term on the right-hand-side, change “η 2 RY t R” to “η 2 RY t YR” page 257 Figure 5.18, caption, line +2 : Change “form Au β ” to “form Au = β ” page 262 fourth line after Eq. 105: Change “with the largest margin” to “with the largest margin” (i.e., italicize “largest”) page 263 fourth line after Eq. 107: Change “equation in Chapter 9,” to “topic in Chapter 9,”

ERRATA IN THE TEXT

428 page 266 Equation 114: Change “ ati (k)yk

t k

t i

k

t j

k

≤ a (k) y .” to “ a (k)y ≤ a (k)y .” j

page 270 twelfth line under BIBLIOGRAPHICAL AND HISTORICAL REMARKS: Change “error-free case [7] and [11] and” to “error-free case [7,11] and” page 270 line -13: Change “support vector machines” to “Support Vector Machines” page 270 line -8 : Change “support vector machines” to “Support Vector Machines”

≤ λ ≤ 1.” to “for 0 ≤ λ ≤ 1.” Problem 8, line +2 : Change “if a y ≥ 0” to “if a y ≥ b”

page 271 Problem 2, line +3 : Change “if 0 page 272

t

i

t

i

page 274 Problem 22, second term on the right-hand side change “( at y λ22 ))2 ” to “( at y + (λ12 λ22 ))2 ”



− (λ − 12

page 275 Problem 27, last line: Change “by Eq. 85.” to “by Eq. 95.” page 277 lines +2 – 3 : Change “satisfies z k at yk = 0” to “satisfies z k at yk = 1” page 277 Problem 38, line -2 : Change “procedures Per ceptron” to “procedures. Generalize the Perceptron” page 278 Problem 1, part (a): change “data in in order” to “data in order” page 278 Computer exercise 2, part (a): Change “Starting with a = 0,” to “Starting with a = 0,” (i.e., make bold the “0”) page 278 First heading after the table: Change “ Section 5.4 ” to “ Section 5.5 ” page 278 Computer Exercise 1: Change “(Algorithm 1) and Newton’s algorit hm (Algorithm 2) applied” to “(Algorithm 1) and the Perceptron criterion (Eq. 16)” page 278 Second heading after the table: Delete “ Section 5.5 ” page 279 line +1 : Change “length is greater than the pocket” to “length is greater than with the pocket” page 279 Computer exercise 4, part (a), line +2 : Change “and µ1 = 0,” to “and µ1 = 0,” (i.e., make bold the “0”)

Chapter 6 page 286 Equation 5: Change “ zk = f (netk ).” to “ zk = f (netk ),” page 287 line +2 : Change “all identical.” to “all the same.” page 288 line -3 : Change “depend on the” to “depends on the” page 291 Two lines before Eq. 3: Change “hidden-to-output weights, wij ” to “hiddento-output weights, w kj ” page 292 Equation 19, first line (inside brackets): Change “1/2” to “ 12 ” (i.e., typeset as a full fraction) page 292 After Eq. 19, line +3 : Change “activiation” to “activation”

Chapter 6

429

page 293 Figure 6.5: Change “ wij ” to “ wji ” page 294 Algorithm 2, line 3 : Change “∆ wkj

←” to “∆ w ← 0” kj

page 295 line -5 : Change “In addi tion to the use of the training set, here are” to “In addition to the use of the training set, there are” page 299 line -1 : Change “and can be linearly separable” to “and are linearly separable” page 305 line 10: Change “ratio of such priors.” to “ratio of such priors, though this need not ensure minimal error.” page 306 sixth and seventh line after Eq. 33: Change “in a sum squared error sense” to “in a sum-squared-error sense” page 307 lines +2 – 3 : Change “been found useful” to “found to be useful” page 307 line +6 : Change “as continuity of f and its derivative” to “as continuity of f ( ) and its derivative”

·

page 308 line +10 : Change “that is,” to “or is an “odd” function, that is,” page 308 line +19 : Change “values that ensure f  (0) f  (0) 0.5”



 1” to “values that ensure

page 315 first line after Eq. 38: Change “reducing the criter ion” to “reducing the error” page 316 Figure 6.19 caption, line -1 : Change “trained network.” to “trained network (red).” page 318 line +8 : Change “to compute is nonnegative” to “to compute, is nonnegative” page 318 margin note: Change “ minkowski error” to “ Minkowski error” (i.e., captialize “M” in “ Minkowski”) page 320 line between Eqs. 50 and 51: Change “The optimimum change” to “Therefore, the optimum change” page 319 Equation 48: Change last entry from “f  (net)ynH xd ” to “f  (net)f  (netnH xd ” page 322 Equation 56: Replace the current equation by t

βm =

∇J (w(m))∇J (w(m)) ∇J (w(m − 1))∇J (w(m − 1)) t

(56)

page 322 Equation 57: Replace the current equation by t

βm =

∇J (w(m))[∇J (w(m)) − ∇J (w(m − 1))] ∇J (w(m − 1))∇J (w(m − 1)) t

(57)

page 323 Fourth (unnumbered) equation on the page: Replace the left portion by t

β1 =

∇J (w(1))∇J (w(1)) = ∇J (w(0))∇J (w(0)) t

430

ERRATA IN THE TEXT

page 323 Caption to bottom figure, line +2 : Change “shown in the conto ur plot,” to “shown in the density plot,” page 325 last line of the section Special Bases : Add “This is very close ly related to model-dependent maximum-likelihood techniques we saw in Chapter 3.” page 326 first line after Eq. 63: Change “of the filter in analogy” to “of the filter, in analogy” page 326 line -5 : Add a red margin note “ time delay neural network” page 330 three lines above Eq. 67: Change “write the error as the sum” to “write the new error as the sum” page 332 Figure 6.28, caption, line +1: Change “a functio n of weights, J (w)” to “a function of weights, J (w),” page 337 Problem 8, pagt (b) line +2 : Change “if the sign if flipped” to “if the sign is flipped” page 337 Problem 14, part (c), line +2 : Change “the 2 identity”

× 2 identity” to “the 2-by-2

page 339 Problem 22, line +4: Add “Are the discriminant functions independent?” page 341 Problem 31, lines +1 – 2 : Change “for a sum squared error criterion” to “for a sum-square-error criterion” page 344 Computer exercise 2, line +2 : Change “backpropagation to (Algorithm 1)” to “backpropagation (Algorithm 1)” page 345 Computer exercise 7, line +2 : Change “on a random problem.” to “on a two-dimensional two-category problem with 2k patterns chosen randomly from the unit square. Estimate k such that the expected error is 25% . Discuss your results. page 346 Computer exercise 10, part (c), line +1 : Change “Use your network” to “Use your trained network” page 347 Column 2, entry for [14], line +5: Change “volume 3. Morgan Kaufmann” to “volume 3, pages 853–859. Morgan Kaufmann” page 348 column 2, entry for [43], line +4 : Change “2000.” to “2001.”

Chapter 7 page 351 fourteenth line after Eq. 1: Change “of the magnets with the most stable configuration” to “of the magnets that is the most stable” page 351 footnote, line -1 : Change “in a range of proble m domains.” to “in many problem domains.” page 352 Figure 7.1, caption, line +7 : Change “While our convention” to “While for neural nets our convention”

Chapter 7

431

page 352 Figure 7.1, caption, line +8 : Change “Boltzmann networks is” to “Boltzmann networks here is” page 352 Caption to Figure 7.1, line -1 : Change “0

≤ α ≤2

10 ”

to “0

≤α