# Mathematics For Machine Learning (MML) Official Solutions (Instructor's Solution Manual) 9781108455145, 9781108470049, 9781108569323, 1108470041, 1108569323, 110845514X

the official solution manual for https://mml-book.com from https://www.cambridge.org/us/academic/subjects/computer-scien

27,301 2,057 838KB

English Pages 74 Year 2020

##### Citation preview

2 Linear Algebra

Exercises 2.1

We consider (R\{−1}, ?), where a ? b := ab + a + b,

a, b ∈ R\{−1}

(2.1)

a. Show that (R\{−1}, ?) is an Abelian group. b. Solve 3 ? x ? x = 15

in the Abelian group (R\{−1}, ?), where ? is defined in (2.1). a. First, we show that R\{−1} is closed under ?: For all a, b ∈ R\{−1}: a ? b = ab + a + b + 1 − 1 = (a + 1) (b + 1) −1 6= −1

| {z } | {z } 6=0

6=0

⇒ a ? b ∈ R\{−1}

Next, we show the group axioms Associativity: For all a, b, c ∈ R\{−1}: (a ? b) ? c = (ab + a + b) ? c = (ab + a + b)c + (ab + a + b) + c = abc + ac + bc + ab + a + b + c = a(bc + b + c) + a + (bc + b + c) = a ? (bc + b + c) = a ? (b ? c)

Commutativity: ∀a, b ∈ R\{−1} : a ? b = ab + a + b = ba + b + a = b ? a

Neutral Element: n = 0 is the neutral element since ∀a ∈ R\{−1} : a ? 0 = a = 0 ? a

Inverse Element: We need to find a ¯, such that a ? a ¯=0=a ¯ ? a. a ¯ ? a = 0 ⇐⇒ a ¯a + a + a ¯=0 ⇐⇒ a ¯(a + 1) = −a a 1 a6=−1 ⇐⇒ a ¯=− = −1 + 6= −1 ∈ R\{−1} a+1 a+1

468 This material will be published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is free to view and download for personal use only. Not for re-distribution, re-sale or use in derivac tive works. by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

469

Exercises b. 3 ? x ? x = 15 ⇐⇒ 3 ? (x2 + x + x) = 15 ⇐⇒ 3x2 + 6x + 3 + x2 + 2x = 15 ⇐⇒ 4x2 + 8x − 12 = 0 ⇐⇒ (x − 1)(x + 3) = 0 ⇐⇒ x ∈ {−3, 1}

2.2

Let n be in N\{0}. Let k, x be in Z. We define the congruence class k¯ of the integer k as the set k = {x ∈ Z | x − k = 0 (modn)} = {x ∈ Z | ∃a ∈ Z : (x − k = n · a)} .

We now define Z/nZ (sometimes written Zn ) as the set of all congruence classes modulo n. Euclidean division implies that this set is a finite set containing n elements: Zn = {0, 1, . . . , n − 1} For all a, b ∈ Zn , we define a ⊕ b := a + b

a. Show that (Zn , ⊕) is a group. Is it Abelian? b. We now define another operation ⊗ for all a and b in Zn as a ⊗ b = a × b,

(2.2)

where a × b represents the usual multiplication in Z. Let n = 5. Draw the times table of the elements of Z5 \{0} under ⊗, i.e., calculate the products a ⊗ b for all a and b in Z5 \{0}. Hence, show that Z5 \{0} is closed under ⊗ and possesses a neutral element for ⊗. Display the inverse of all elements in Z5 \{0} under ⊗. Conclude that (Z5 \{0}, ⊗) is an Abelian group. c. Show that (Z8 \{0}, ⊗) is not a group. d. We recall that the B´ezout theorem states that two integers a and b are relatively prime (i.e., gcd(a, b) = 1) if and only if there exist two integers u and v such that au + bv = 1. Show that (Zn \{0}, ⊗) is a group if and only if n ∈ N\{0} is prime. a. We show that the group axioms are satisfied: Closure: Let a, b be in Zn . We have: a⊕b=a+b = (a + b)

mod n

by definition of the congruence class, and since [(a + b) mod n] ∈ {0, . . . , n − 1}, it follows that a ⊕ b ∈ Zn . Thus, Zn is closed under ⊕. c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

470

Linear Algebra Associativity: Let c be in Zn . We have: (a ⊕ b) ⊕ c = (a + b) ⊕ c = (a + b) + c = a + (b + c) = a ⊕ (b + c) = a ⊕ (b ⊕ c)

so that ⊕ is associative. Neutral element: We have a+0=a+0=a=0+a

so 0 is the neutral element for ⊕. Inverse element: We have a + (−a) = a − a = 0 = (−a) + a

and we know that (−a) is equal to (−a) mod n which belongs to Zn and is thus the inverse of a. Commutativity: Finally, the commutativity of (Zn , ⊕) follows from that of (Z, +) since we have a ⊕ b = a + b = b + a = b ⊕ a,

which shows that (Zn , ⊕) is an Abelian group. b. Let us calculate the times table of Z5 \{0} under ⊗: ⊗

1

2

3

4

1 2 3 4

1 2 3 4

2 4 1 3

3 1 4 2

4 3 2 1

We can notice that all the products are in Z5 \{0}, and that in particular, none of them is equal to 0. Thus, Z5 \{0} is closed under ⊗. The neutral element is 1 and we have (1)−1 = 1, (2)−1 = 3, (3)−1 = 2, and (4)−1 = 4. Associativity and commutativity are straightforward and (Z5 \{0}, ⊗) is an Abelian group. c. The elements 2 and 4 belong to Z8 \{0}, but their product 2 ⊗ 4 = 8 = 0 does not. Thus, this set is not closed under ⊗ and is not a group. d. Let us assume that n is not prime and can thus be written as a product n = a × b of two integers a and b in {2, . . . , n − 1}. Both elements a and b belong to Zn \{0} but their product a ⊗ b = n = 0 does not. Thus, this set is not closed under ⊗ and (Zn \{0}, ⊗) is not a group. Let n be a prime number. Let a and b be in Zn \{0} with a and b in {1, . . . , n − 1}. As n is prime, we know that a is relatively prime to n, and so is b. Let us then take four integers u, v, u0 and v 0 , such that au + nv = 1 bu0 + nv 0 = 1 .

We thus have that (au + nv)(bu0 + nv 0 ) = 1, which we can rewrite as ab(uu0 ) + n(auv 0 + vbu0 + nvv 0 ) = 1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

471

Exercises

By virtue of the B´ezout theorem, this implies that ab and n are relatively prime, which ensures that the product a ⊗ b is not equal to 0 and belongs to Zn \{0}, which is thus closed under ⊗. The associativity and commutativity of ⊗ are straightforward, but we need to show that every element has an inverse. First, the neutral element is 1. Let us again consider an element a in Zn \{0} with a in {1, . . . , n − 1}. As a and n are coprime, the B´ ezout theorem enables us to define two integers u and v such that (2.3)

au + nv = 1 ,

which implies that au = 1 − nv and thus au = 1

(2.4)

mod n ,

which means that a ⊗ u = au = 1, or that u is the inverse of a. Overall, (Zn \{0}, ⊗) is an Abelian group. Note that the B´ezout theorem ensures the existence of an inverse without yielding its explicit value, which is the purpose of the extended Euclidean algorithm. 2.3

Consider the set G of 3 × 3 matrices defined as follows:      1 x z G = 0 1 y  ∈ R3×3 x, y, z ∈ R   0 0 1 We define · as the standard matrix multiplication. Is (G, ·) a group? If yes, is it Abelian? Justify your answer. Closure: Let a, b, c, x, y and z be in R and let us define A and B in G as     1 A = 0 0

x 1 0

z y , 1



a 1 0

1 B = 0 0

a 1 0

c b . 1

Then, 

1 A · B = 0 0

x 1 0

z 1 y  0 1 0

c 1 b  = 0 1 0

a+x 1 0

c + xb + z b+y  . 1

Since a + x, b + y and c + xb + z are in R we have A · B ∈ G . Thus, G is closed under matrix multiplication. Associativity: Let α, β and γ be in R and let C in G be defined as   1 C = 0 0

α 1 0

γ β . 1

It holds that 

a+x 1 0

α+a+x 1 0

1 (A · B) · C = 0 0 1 = 0 0



c + xb + z 1 b + y  0 1 0

α 1 0

γ β 1

γ + αβ + xβ + c + xb + z . β+b+y 1

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

472

Linear Algebra Similarly, 

x 1 0

α+a+x 1 0

1 A · (B · C) = 0 0 1 = 0 0

 

z 1 y  · 0 1 0

α+a 1 0

γ + αβ + c β+b  1

γ + αβ + xβ + c + xb + z  = (A · B) · C . β+b+y 1

Therefore, · is associative. Neutral element: For all A in G , we have: I 3 · A = A = A · I 3 and thus I 3 is the neutral element. Non-commutativity: We show that · is not commutative. Consider the matrices X, Y ∈ G , where     1 X = 0 0

1 1 0

0 0 , 1

1 Y = 0 0

0 1 0

0 1 . 1

(2.5)

Then, 

1 1 0

1 1 , 1

1 1 0

0 1 6= X · Y . 1

1 X · Y = 0 0 1 Y · X = 0 0

Therefore, · is not commutative. Inverse element: Let us look for a right inverse A−1 r of A. Such a matrix should satisfy AA−1 = I . We thus solve the linear system [A|I 3 ] that 3 r we transform into [I 3 |A−1 ] : r       

1 0 0

x 1 0

z y 1

1 0 0

0 1 0

0 −zR3 0  −yR3 1

1 0 0

x 1 0

0 0 1

1 0 0

0 1 0

−z −xR2 −y  1

1 0 0

0 1 0

0 0 1

1 0 0

−x 1 0

Therefore, we obtain the right inverse  A−1 r

1 = 0 0

−x 1 0

xy − z −y  . 1

xy − z −y  ∈ G 1

Because of the uniqueness of the inverse element, if a left inverse A−1 l exists, then it is equal to the right inverse. But as · is not commutative, Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

473

Exercises

we need to check manually that we also have AA−1 r = I 3 , which we do next:     −x 1 0

1  A−1 A = 0 r 0

−z + xy 1 −y  · 0 1 0

z y 1

x−x 1 0

1 = 0 0

x 1 0

z − xy − z + xy  = I3 . y+y 1

Thus, every element of G has an inverse. Overall, (G, ·) is a non-Abelian group. 2.4

Compute the following matrix products, if possible: a. 

1 4 7

2 1 5 0 8 1

1 1 0

0 1 1

This matrix product is not defined. Highlight that the neighboring dimensions have to fit (i.e., m × n matrices need to be multiplied by n × p (from the right) or k × m matrices (from the left).) b. 

3 9 15

5 11 17

7 13 10

9 15 12

1 4 7

2 5 8

1 3 6 0 1 9



1 1 0

4 0 1 = 10 16 1

1 1 0

1 0 1 4 7 1



2 5 8

5 3 6 = 11 8 9

c. 1 0 1

d. 

1 4

2 1

  0 2  1 −4 2

1 −1

3  −1  = 14 1 −21 2

5



6 2

e. 

0 1  2 5

2.5

3  −1  1 1 4 2

2 1

1 −1

12 −3 2 = 6 −4 13



−3 2 1 3

3 1 5 12

−12 6   0  2

Find the set S of all solutions in x of the following inhomogeneous linear systems Ax = b, where A and b are defined as follows: a. 

1 2 A= 2 5

1 5 −1 2

−1 −7 1 −4

−1 −5 , 3 2

1 −2  b= 4 6

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

474

Linear Algebra We apply Gaussian elimination to the augmented matrix  

5

1 −1 −1 1 5 −7 −5 −2   −2R1 4  −2R1 −1 1 3 −5R1 2 −4 2 6

1  0   0 0

1 −1 −1 1 − 13 R2 1 3 −5 −3 −4   |· 3 2  +R2 −3 3 5 −3 1 7 1 +R2

1

 2   2 

1  0   0 0

0 1 0 0

2 3

7 0 3 −1 − 43   −2 2 −2  −4 4 −3 −2R3

1  0   0 0

− 35

0 1 0 0

7 0 3 −1 − 43   −2 2 −2  0 0 1 2 3

− 53

The last row of the final linear system shows that the equation system has no solution and thus S = ∅. b. 

1 1 A= 2 −1

−1 1 −1 2

0 0 0 0

0 −3 1 −2

1 0 , −1 −1

3 6  b= 5 −1

We start again by writing down the augmented matrix and apply Gaussian elimination to obtain the reduced row echelon form:   −1 1 −1 −1 2

0 0 0 0

3 0 1 −3 0 6   −R1 1 −1 5  −2R1 +R1 −2 −1 −1

−1 2 1 1

0 0 0 0

0 1 3 −3 −1 3   1 −3 −1  −2 0 2

0

0 1 0 0

0 0 0 0

1  0   0 0

0 1 0 0

0 0 0 0

1 1 1 0

−2 2 −R3 −3 −1   −R3 −1 −1  0 0

0 1 0

0 0 0

0 0 1

−1 3 −2 0  −1 −1

1

 1   2 

1  0   0 0

1

 0   0

1  0 0

+R3 −2R3

swap with R2 −R3

1 −2 2 1 −3 −1   −5 5 5  ·(− 51 ) 3 + 53 R3 −3 3

From the reduced row echelon form we apply the “Minus-1 Trick” in order to get the following system: Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

475

Exercises

     

1 0 0 0 0

0 1 0 0 0

0 0 −1 0 0

−1 3 −2 0   0 0   −1 −1  −1 0

0 0 0 1 0

The right-hand side of the system yields us a particular solution while the columns corresponding to the −1 pivots are the directions of the solution space. We then obtain the set of all possible solutions as         3 0 −1        0 0 −2          6  + λ1 −1 + λ2  0  λ1 , λ2 ∈ R . S := x ∈ R : x =  0           −1 0 −1       0

2.6

−1

0

Using Gaussian elimination, find all solutions of the inhomogeneous equation system Ax = b with     0 A = 0 0

1 0 1

0 0 0

0 1 0

1 1 0

0 0 , 1

2 b = −1 . 1

We start by determining the reduced row echelon form of the augmented matrix [A|b].   0

1 0 1

0 0 0

0 1 0

1 1 0

0 0 1

2 −1  1 −R1

0  0 0

1 0 0

0 0 0

0 1 0

1 1 −1

0 0 1

2 +R3 −1  +R3 ·(−1) −1

1 0 0

0 0 0

0 1 0

0 0 1

0

 0

0  0 0

1 1 1 −2  −1 1

This augmented matrix is now in reduced row echelon form. Applying the “Minus-1 trick” gives us the following augmented matrix: −1  0   0   0   0 0

0 1 0 0 0 0

0 0 −1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 0 1 1   0 0   1 −2   −1 1  −1 0

The right-hand side of this augmented matrix gives us a particular solution while the columns corresponding to the −1 pivots are the span the solution c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

476

Linear Algebra space S . Therefore, we obtain the general solution                    0 −1 0 0                 1 0 0 1               0 −1  0  0         S =   + λ1   + λ2   + λ3   λ1 , λ2 , λ3 ∈ R −2     0 0  1      1 0 0 −1           0 0 0 −1       {z } | solves Ax=0

Now, we find a particular solution that solves the inhomogeneous equation system. From the RREF, we see that x2 = 1, x4 = −2, x5 = 1 + x6 and x1 , x3 , x6 ∈ R are free variables (they correspond to variables that belong to the non-pivot columns of the augmented matrix). Therefore, a particular solution is 

0 1   0 6  xp =  −2 ∈ R .   1 0

The general solution adds solutions from the homogeneous equation system Ax = 0. We can use the RREF of the augmented system to read out these solutions by using the Minus-1 Trick. Padding the RREF with rows containing -1 on the diagonal gives us −1 0  0  0  0 0

0 1 0 0 0 0

0 0 −1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 1  0 . 1  −1 −1

This yields the general solution −1 0 0 0 0 1       0 −1 0      xp + λ 1   + λ 2   + λ 3   1 , 0 0       0 0 −1 0 0 −1

|

{z

solves Ax=0

2.7

λ1 , λ2 , λ3 ∈ R .

}

x1 Find all solutions in x = x2  ∈ R3 of the equation system Ax = 12x, x3

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

477

Exercises where 

6 A = 6 0

4 0 8

3 9 0

P and 3i=1 xi = 1. We start by rephrasing the problem into solving a homogeneous system of linear equations. Let x be in R3 . We notice that Ax = 12x is equivalent to (A − 12I)x = 0, which can be rewritten as the homogeneous system ˜ = 0, where we define Ax   −6 ˜ = 6 A 0

4 −12 8

3 9 . −12

P The constraint 3i=1 xi = 1 can be transcribed as a fourth equation, which leads us to consider the following linear system, which we bring to reduced row echelon form:   −6

            

4

3

0

 1 0   ·3  1 0   ·4

6

−12

9

0

8

−12

1

1

1

1

0

−8

12

0

2

−4

3

0

2

−3

+R2

+4R3

 0   +2R3  0  

1

1

1

1

0

0

0

0

     

2

0

−3

0

2

−3

 

1 0   ·2

1 0   ·2

1

1

1

1

1

0

− 23

0

  

0

1

− 23

0  

1

1

1

  −R1 − R2

1

1

0

− 32

0

+( 83 )R3

1

0

0

  

0

1

− 23

3 0   +( 8 )R3

  

0

1

0

0

0

4

0

0

1

1

· 14

3 8 3 8 1 4

   

Therefore, we obtain the unique solution 3   8

x =  38  = 1 4

3 1  3 . 8 2

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

478 2.8

Linear Algebra Determine the inverses of the following matrices if possible: a. 

2 A = 3 4

3 4 5

4 5 6

To determine the inverse of a matrix, we start with the augmented matrix [A | I] and transform it into [I | B], where B turns out to be A−1 : 

2

3

4

1

0

0

  

3

4

5

0

1

3 0   − 2 R1

4

5

6

0

0

1

       

−2R1

2

3

4

1

0

0

0

− 21

−1

− 23

1

1 0   − 2 R3

0

−1

−2

−2

0

1

2

3

4

1

0

0

1

− 12

0

−1

0

0

0

− 21

0

1

2

2

·(−1)

  . 

Here, we see that this system of linear equations is not solvable. Therefore, the inverse does not exist. b. 

1 0 A= 1 1

0 1 1 1

1 1 0 1

0 0  1 0

1

0

1

0

1

0

0

0

     

0

1

1

0

0

1

0

0  

1

1

0

1

0

0

1

0   −R1

1

1

1

0

0

0

0

1

 

−R1

1

0

1

0

1

0

0

0

     

0

1

1

0

0

1

0

0   −R4

0

1

−1

1

−1

0

1

0   −R4

0

1

0

0

−1

0

0

1

 

swap with R2

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

479

Exercises       

1

0

1

1

0

0

0

0

1

0

0

−1

0

0

0

0

−1

1

0

0

1

0

−R4

 1    −1   +R4 swap with R3 −1 

0

0

1

0

1

1

0

1

0

0

0

0

−1

0

     

0

1

0

0

−1

0

0

1  

0

0

1

0

1

1

0

−1  

0

0

0

1

1

1

1

−2

1

 

Therefore, 

A−1

2.9

0 −1 = 1 1

−1 0 1 1

0 0 0 1

1 1   −1 −2

Which of the following sets are subspaces of R3 ? a. A = {(λ, λ + µ3 , λ − µ3 ) | λ, µ ∈ R} b. B = {(λ2 , −λ2 , 0) | λ ∈ R} c. Let γ be in R. C = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ1 − 2ξ2 + 3ξ3 = γ} d. D = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ2 ∈ Z} As a reminder: Let V be a vector space. U ⊆ V is a subspace if 1. U 6= ∅. In particular, 0 ∈ U . 2. ∀a, b ∈ U : a + b ∈ U Closure with respect to the inner operation 3. ∀a ∈ U, λ ∈ R : λa ∈ U Closure with respect to the outer operation The standard vector space properties (Abelian group, distributivity, associativity and neutral element) do not have to be shown because they are inherited from the vector space (R3 , +, ·). Let us now have a look at the sets A, B, C, D. a. 1. We have that (0, 0, 0) ∈ A for λ = 0 = µ. 2. Let a = (λ1 , λ1 + µ31 , λ1 − µ31 ) and b = (λ2 , λ2 + µ32 , λ2 − µ32 ) be two elements of A, where λ1 , µ1 , λ2 , µ2 ∈ R. Then, a + b = (λ1 , λ1 + µ31 , λ1 − µ31 ) + (λ2 , λ2 + µ32 , λ2 − µ32 ) = (λ1 + λ2 , λ1 + µ31 + λ2 + µ32 , λ1 − µ31 + λ2 − µ32 ) = (λ1 + λ2 , (λ1 + λ2 ) + (µ31 + µ32 ), (λ1 + λ2 ) − (µ31 + µ32 )) ,

which belongs to A. 3. Let α be in R. Then, α(λ, λ + µ3 , λ − µ3 ) = (αλ, αλ + αµ3 , αλ − αµ3 ) ∈ A .

Therefore, A is a subspace of R3 . c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

480

Linear Algebra b. The vector (1, −1, 0) belongs to B , but (−1) · (1, −1, 0) = (−1, 1, 0) does not. Thus, B is not closed under scalar multiplication and is not a subspace of R3 . c. Let A ∈ R1×3 be defined as A = [1, −2, 3]. The set C can be written as: C = {x ∈ R3 | Ax = γ} .

We can first notice that 0 belongs to B only if γ = 0 since A=0. Let thus consider γ = 0 and ask whether C is a subspace of R3 . Let x and y be in C . We know that Ax = 0 and Ay = 0, so that A(x + y) = Ax + Ay = 0 + 0 = 0 .

Therefore, x + y belongs to C . Let λ be in R. Similarly, A(λx) = λ(Ax) = λ0 = 0

Therefore, C is closed under scalar multiplication, and thus is a subspace of R3 if (and only if) γ = 0. d. The vector (0, 1, 0) belongs to D but π(0, 1, 0) does not and thus D is not a subspace of R3 . 2.10 Are the following sets of vectors linearly independent? a. 

2 x1 = −1 , 3

1 x2 =  1  , −2

3 x3 = −3 8

To determine whether these vectors are linearly independent, we check if the 0-vector can be non-trivially represented as a linear combination of x1 , . . . , x3 . Therefore, we try to solve the homogeneous linear equation P system 3i=1 λi xi = 0 for λi ∈ R. We use Gaussian elimination to solve Ax = 0 with   2 A = −1 3

1 1 −2

3 −3 , 8

which leads to the reduced row echelon form   1

0 0

0 1 0

2 −1 . 0

This means that A is rank deficient/singular and, therefore, the three vectors are linearly dependent. For example, with λ1 = 2, λ2 = −1, λ3 = P3 −1 we have a non-trivial linear combination i=1 λi xi = 0. b.  

1 2    x1 =  1 , 0 0

 

1 1    x2 =  0 , 1 1

  1

0     x3 =  0  1  1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

481

Exercises

Here, we are looking at the distribution of 0s in the vectors. x1 is the only vector whose third component is non-zero. Therefore, λ1 must be 0. Similarly, λ2 must be 0 because of the second component (already conditioning on λ1 = 0). And finally, λ3 = 0 as well. Therefore, the three vectors are linearly independent. An alternative solution, using Gaussian elimination, is possible and would lead to the same conclusion. 2.11 Write 

1 y = −2 5

as linear combination of  

 

1 x1 = 1 , 1

1 x2 = 2 , 3

2 x3 = −1 1

P We are looking for λ1 , . . . , λ3 ∈ R, such that 3i=1 λi xi = y . Therefore, we need to solve the inhomogeneous linear equation system   1

 1 1

1 2 3

1 2 −1 −2  1 5

Using Gaussian elimination, we obtain λ1 = −6, λ2 = 3, λ3 = 2. 2.12 Consider two subspaces of R4 :            1 1  U1 = span[ −3 1

2 −1  , 0 −1

−1 1  , −1] , 1

−1 −2  U2 = span[ 2 1

2 −2  , 0 0

−3 6  , −2] . −1

Determine a basis of U1 ∩ U2 . We start by checking whether there the vectors in the generating sets of U1 (and U2 ) are linearly dependent. Thereby, we can determine bases of U1 and U2 , which will make the following computations simpler. We start with U1 . To see whether the three vectors are linearly dependent, we need to find a linear combination of these vectors that allows a nontrivial representation of 0, i.e., λ1 , λ2 , λ3 ∈ R, such that         1

−1

2

0

1 −1  1  0         λ1  −3 + λ2  0  + λ3 −1 = 0 . −1

1

1

0

We see that necessarily: λ3 = −3λ1 (otherwise, the third component can never be 0). With this, we get       1+3

2

0

 1−3  −1 0      λ1  −3 + 3 + λ2  0  = 0 1−3

−1

0

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

482

Linear Algebra 

4

2

  0

−1 0 −2      ⇐⇒ λ1   0  + λ2  0  = 0 −1

−2

0

and, therefore, λ2 = −2λ1 . This means that there exists a non-trivial linear combination of 0 using spanning vectors of U1 , for example: λ1 = 1, λ2 = −2 and λ3 = −3. Therefore, not all vectors in the generating set of U1 are necessary, such that U1 can be more compactly represented as 

 

1 2  1  −1    U1 = span[ −3 ,  0 ] . 1 −1

Now, we see whether the generating set of U2 is also a basis. We try again whether we can find a non-trivial linear combination of 0 using the spanning vectors of U2 , i.e., a triple (α1 , α2 , α3 ) ∈ R3 such that 

 

0 −3 2 −1  6  0 −2 −2        α1   2  + α2  0  + α3 −2 = 0 . 0 −1 0 1

Here, we see that necessarily α1 = α3 . Then, α2 = 2α1 gives a non-trivial representation of 0, and the three vectors are linearly dependent. However, any two of them are linearly independent, and we choose the first two vectors of the generating set as a basis of U2 , such that 

 

−1 2 −2 −2    U2 = span[  2  ,  0 ] . 1 0

Now, we determine U1 ∩ U2 . Let x be in R4 . Then, x ∈ U1 ∩ U2 ⇐⇒ x ∈ U1 ∧ x ∈ U2





−1 2  −2 −2     ⇐⇒ ∃λ1 , λ2 , α1 , α2 ∈ R :  x = α1  2  + α2  0  1 0



1 2  1 −1     ∧ x = λ1 −3 + λ2  0  1 −1

−1 2  −2 −2     ⇐⇒ ∃λ1 , λ2 , α1 , α2 ∈ R :  x = α1  2  + α2  0  1 0 Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

483

Exercises 

1

2

−1

2



−1 −2 −2  1         ∧ λ1 −3 + λ2  0  = α1  2  + α2  0  −1

1

1

0

A general approach is to use Gaussian elimination to solve for either λ1 , λ2 or α1 , α2 . In this particular case, we can find the solution by careful inspection: From the third component, we see that we need −3λ1 = 2α1 and thus α1 = − 32 λ1 . Then: 

x ∈ U1 ∩ U2



−1 2 −2 −2  3      ⇐⇒ ∃λ1 , λ2 , α2 ∈ R : x = − 2 λ1   + α2   2 0  1 0



−1 2 2 1 −1 −2   1  3 −2         ∧ λ1 −3 + 2 λ1  2  + λ2  0  = α2  0  1 −1 0 1



−1 2      −2 −2 3     ⇐⇒ ∃λ1 , λ2 , α2 ∈ R :  x = − 2 λ1  2  + α2  0  1 0 2 2 − 21 −1 −2   −2        ∧ λ1  0  + λ2  0  = α2  0  5 −1 0 2



The last component requires that λ2 = 52 λ1 . Therefore, 

x ∈ U1 ∩ U2





−1 2      −2 −2 3     ⇐⇒ ∃λ1 , α2 ∈ R :  x = − 2 λ1  2  + α2  0  1 0

9  2 − 9   2



2  −2   ∧ λ1  0  = α2  0  0 0

−1 2      −2 −2 3 9     ⇐⇒ ∃λ1 , α2 ∈ R :  x = − 2 λ1  2  + α2  0  ∧ (α2 = 4 λ1 ) 1 0



−1 2      −2 −2 3   9   ⇐⇒ ∃λ1 ∈ R :  x = − 2 λ1  2  + 4 λ1  0  1 0 c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

484

Linear Algebra 

−1

2



−2 −2      ⇐⇒ ∃λ1 ∈ R :  x = −6λ1  2  + 9λ1  0  0

1

24

(mutiplied by 4)



 −6     ⇐⇒ ∃λ1 ∈ R :  x = λ1 −12 −6



4 −1    ⇐⇒ ∃λ1 ∈ R :  x = λ1 −2 −1

Thus, we have    

4 4    −1 −1    U1 ∩ U2 = λ1  −2 λ1 ∈ R = span[−2 , ]      −1 −1

i.e., we obtain vector space spanned by [4, −1, −2, −1]> . 2.13 Consider two subspaces U1 and U2 , where U1 is the solution space of the homogeneous equation system A1 x = 0 and U2 is the solution space of the homogeneous equation system A2 x = 0 with     1

1 A1 =  2 1

0 −2 1 0

1 −1 , 3 1

3

1 A2 =  7 3

−3 2 −5 −1

0 3 . 2 2

a. Determine the dimension of U1 , U2 . We determine U1 by computing the reduced row echelon form of A1 as   1

0  0 0

0 1 0 0

1 1 , 0 0

which gives us 

1 U1 = span[ 1 ] . −1

Therefore, dim(U1 ) = 1. Similarly, we determine U2 by computing the reduced row echelon form of A2 as   1

0  0 0

0 1 0 0

1 1 , 0 0

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

485

Exercises which gives us 

1 U2 = span[ 1 ] . −1

Therefore, dim(U2 ) = 1. b. Determine bases of U1 and U2 . The basis vector that spans both U1 and U2 is   1

1. −1

c. Determine a basis of U1 ∩ U2 . Since both U1 and U2 are spanned by the same basis vector, it must be that U1 = U2 , and the desired basis is   1 U1 ∩ U2 = U1 = U2 = span[ 1 ] . −1

2.14 Consider two subspaces U1 and U2 , where U1 is spanned by the columns of A1 and U2 is spanned by the columns of A2 with     1

1 A1 =  2 1

0 −2 1 0

1 −1 , 3 1

3

1 A2 =  7 3

−3 2 −5 −1

0 3 . 2 2

a. Determine the dimension of U1 , U2 We start by noting that U1 , U2 ⊆ R4 since we are interested in the space spanned by the columns of the corresponding matrices. Looking at A1 , we see that −d1 + d3 = d2 , where di are the columns of A1 . This means that the second column can be expressed as a linear combination of d1 and d3 . d1 and d3 are linearly independent, i.e., dim(U1 ) = 2. Similarly, for A2 , we see that the third column is the sum of the first two columns, and again we arrive at dim(U2 ) = 2. Alternatively, we can use Gaussian elimination to determine a set of linearly independent columns in both matrices. b. Determine bases of U1 and U2 A basis B of U1 is given by the first two columns of A1 (any pair of columns would be fine), which are independent. A basis C of U2 is given by the second and third columns of A2 (again, any pair of columns would be a basis), such that           1 0  −3 0                  1 −2 2  ,   , C =   , 3 B=          2 1  −5 2            1

0

−1

2

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

486

Linear Algebra c. Determine a basis of U1 ∩ U2 Let us call b1 , b2 , c1 and c2 the vectors of the bases B and C such that B = {b1 , b2 } and C = {c1 , c2 }. Let x be in R4 . Then, x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 ) ∧ (x = λ3 c1 + λ4 c2 ) ⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 ) ∧ (λ1 b1 + λ2 b2 = λ3 c1 + λ4 c2 ) ⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 ) ∧ (λ1 b1 + λ2 b2 − λ3 c1 − λ4 c2 = 0)

Let λ := [λ1 , λ2 , λ3 , λ4 ]> . The last equation of the system can be written as the linear system Aλ = 0, where we define the matrix A as the concatenation of the column vectors b1 , b2 , −c1 and −c2 .   1

1 A= 2 1

0 −2 1 0

3 −2 5 1

0 −3 . −2 −2

We solve this homogeneous linear system using Gaussian elimination.   1

 1   2 1

1  0   0 0

1  0   0 0

0 3 0 −2 −2 −3   −R1 1 5 −2  −2R1 −R1 0 1 −2

0 3 0 −2 −5 −3   +2R3 1 −1 −2  swap with R2 ·(− 12 ) 0 −2 −2 0 1 0 0

−3R4 3 0 −1 −2   +R4 −7 −7  +7R4 1 1 swap with R3

1  0   0 0

0 1 0 0

0 0 1 0

−3 −1   1  0

From the reduced row echelon form we find that the set   −3

−1  S := span[  1 ] −1

describes the solution space of the system of equations in λ. We can now resume our equivalence derivation and replace the homogeneous system with its solution space. It holds x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , λ2 , λ3 , λ4 , α ∈ R : (x = λ1 b1 + λ2 b2 ) ∧ ([λ1 , λ2 , λ3 , λ4 ]> = α[−3, −1, 1, −1]> ) ⇐⇒ ∃α ∈ R : x = −3αb1 − αb2 ⇐⇒ ∃α ∈ R : x = α[−3, −1, −7, −3]> Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

487

Exercises Finally, 

−3 −1  U1 ∩ U2 = span[ −7] . −3

Alternatively, we could have expressed the solutions of x in terms of b1 and c2 with the condition on λ being ∃α ∈ R : (λ3 = α) ∧ (λ4 = −α) to obtain [3, 1, 7, 3]> . 2.15 Let F = {(x, y, z) ∈ R3 | x+y−z = 0} and G = {(a−b, a+b, a−3b) | a, b ∈ R}. a. Show that F and G are subspaces of R3 . We have (0, 0, 0) ∈ F since 0 + 0 − 0 = 0. Let a = (x, y, z) ∈ R3 and b = (x0 , y 0 , z 0 ) ∈ R3 be two elements of F . We have: (x + y − z) + (x0 + y 0 − z 0 ) = 0 + 0 = 0 so a + b ∈ F . Let λ ∈ R. We also have: λx + λy − λz = λ0 = 0 so λa ∈ F and thus F is a subspace of R3 . Similarly, we have (0, 0, 0) ∈ G by setting a and b to 0. Let a, b, a0 and b0 be in R and let x = (a − b, a + b, a − 3b) and y = (a0 − b0 , a0 + b0 , a0 − 3b0 ) be two elements of G. We have x + y = ((a + a0 ) − (b + b0 ), (a + a0 ) + (b + b0 ), (a + a0 ) − 3(b + b0 )) and (a + a0 , b + b0 ) ∈ R2 so x + y ∈ G. Let λ be in R. We have (λa, λb) ∈ R2 so λx ∈ G and thus G is a subspace of R3 . b. Calculate F ∩ G without resorting to any basis vector. Combining both constraints, we have: F ∩ G = {(a − b, a + b, a − 3b) | (a, b ∈ R) ∧ [(a − b) + (a + b) − (a − 3b) = 0]} = {(a − b, a + b, a − 3b) | (a, b ∈ R) ∧ (a = −3b)} = {(−4b, −2b, −6b) | b ∈ R} = {(2b, b, 3b) | b ∈ R}

 

2 = span[1] 3

c. Find one basis for F and one for G, calculate F ∩G using the basis vectors previously found and check your result with the previous question. We can see that F is a subset of R3 with one linear constraint. It thus has dimension 2, and it suffices to find two independent vectors in F to construct a basis. By setting (x, y) = (1, 0) and (x, y) = (0, 1) successively, we obtain the following basis for F :      0   1 0 , 1   1

1

Let us consider the set G. We introduce u, v ∈ R and perform the following variable substitutions: u := a + b and v := a − b. Note that then c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

488

Linear Algebra a = (u + v)/2 and b = (u − v)/2 and thus a − 3b = 2v − u, so that G can

be written as G = {(v, u, 2v − u) | u, v ∈ R} .

The dimension of G is clearly 2 and a basis can be found by choosing two independent vectors of G, e.g.,     0   1 0 ,  1    2

−1

Let us now find F ∩ G. Let x ∈ R3 . It holds that   

 

1 0 x ∈ F ∩ G ⇐⇒ ∃λ1 , λ2 , µ1 , µ2 ∈ R : x = λ1 0 + λ2 1 1 1

 



1 0 ∧ x = µ1 0 + µ2  1  2 −1

 

 

1 0 ⇐⇒ ∃λ1 , λ2 , µ1 , µ2 ∈ R : x = λ1 0 + λ2 1 1 1

 

 

 

0 1 0 1 ∧ λ1 0 + λ2 1 + µ1 0 + µ2  1  = 0 −1 2 1 1

Note that for simplicity purposes, we have not reversed the sign of the coefficients for µ1 and µ2 , which we can do since we could replace µ1 by −µ1 . The latter equation is a linear system in [λ1 , λ2 , µ1 , µ2 ]> that we solve next. 

1  0 1

0 1 1

1 0 2

0 1 (· · · ) −1

1  0 0

0 1 0

0 0 1

2 1  −2

The solution space for (λ1 , λ2 , µ1 , µ2 ) is therefore   2

1  span[ −2] , −1

and we can resume our equivalence  

 

1 0 x ∈ F ∩ G ⇐⇒ ∃α ∈ R : x = 2α 0 + α 1 1 1

 

2 ⇐⇒ ∃α ∈ R : x = α 1 , 3 Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

489

Exercises which yields the same result as the previous question, i.e.,   2 F ∩ G = span[1] . 3

2.16 Are the following mappings linear? Recall: To show that Φ is a linear mapping from E to F , we need to show that for all x and y in E and all λ in R: Φ(x + y) = Φ(x) + Φ(y) Φ(λx) = λΦ(x)

a. Let a, b ∈ R. Φ : L1 ([a, b]) → R

Z

b

f 7→ Φ(f ) =

f (x)dx , a

where L1 ([a, b]) denotes the set of integrable functions on [a, b]. Let f, g ∈ L1 ([a, b]). It holds that Z b Z b Φ(f ) + Φ(g) =

f (x)dx + a

Z

b

g(x)dx = a

f (x) + g(x)dx = Φ(f + g) . a

For λ ∈ R we have Z

b

Φ(λf ) =

Z

b

λf (x)dx = λ a

f (x)dx = λΦ(f ) . a

Therefore, Φ is a linear mapping. (In more advanced courses/books, you may learn that Φ is a linear functional, i.e., it takes functions as arguments. But for our purposes here this is not relevant.) b. Φ : C1 → C0 f 7→ Φ(f ) = f 0 ,

where for k > 1, C k denotes the set of k times continuously differentiable functions, and C 0 denotes the set of continuous functions. For f, g ∈ C 1 we have Φ(f + g) = (f + g)0 = f 0 + g 0 = Φ(f ) + Φ(g)

For λ ∈ R we have Φ(λf ) = (λf )0 = λf 0 = λΦ(f )

Therefore, Φ is linear. (Again, Φ is a linear functional.) From the first two exercises, we have seen that both integration and differentiation are linear operations. c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

490

Linear Algebra c. Φ:R→R x 7→ Φ(x) = cos(x)

We have cos(π) = −1 and cos(2π) = 1m which is different from 2 cos(π). Therefore, Φ is not linear. d. Φ : R 3 → R2

 x 7→

1 1

2 4



3 x 3

We define the matrix as A. Let x and y be in R3 . Let λ be in R. Then: Φ(x + y) = A(x + y) = Ax + Ay = Φ(x) + Φ(y) .

Similarly, Φ(λx) = A(λx) = λAx = λΦ(x) .

Therefore, this mapping is linear. e. Let θ be in [0, 2π[ and Φ : R2 → R2

 x 7→

cos(θ) − sin(θ)



sin(θ) x cos(θ)

We define the (rotation) matrix as A. Then the reasoning is identical to the previous question. Therefore, this mapping is linear. The mapping Φ represents a rotation of x by an angle θ. Rotations are also linear mappings. 2.17 Consider the linear mapping Φ : R3 → R4

 3x1 + 2x2 + x3 x1  x1 + x2 + x3   Φ x2  =   x1 − 3x2  x3 2x1 + 3x2 + x3 

Find the transformation matrix AΦ . Determine rk(AΦ ). Compute the kernel and image of Φ. What are dim(ker(Φ)) and dim(Im(Φ))? The transformation matrix is 

3 1 AΦ =  1 2

2 1 −3 3

−1 1  0 1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

491

Exercises

The rank of AΦ is the number linearly independent rows/columns. We use Gaussian elimination on AΦ to determine the reduced row echelon form (Not necessary to identify the number of linearly independent rows/columns, but useful for the next questions.):   1

0  0 0

0 1 0 0

0 0 . 1 0

From here, we see that rk(AΦ ) = 3. ker(Φ) = 0 and dim(ker(Φ)) = 0. From the reduced row echelon form, we see that all three columns of AΦ are linearly independent. Therefore, they form a basis of Im(Φ), and dim(Im(Φ)) = 3. 2.18 Let E be a vector space. Let f and g be two automorphisms on E such that f ◦ g = idE (i.e., f ◦ g is the identity mapping idE ). Show that ker(f ) = ker(g ◦ f ), Im(g) = Im(g ◦ f ) and that ker(f ) ∩ Im(g) = {0E }. Let x ∈ ker(f ). We have g(f (x)) = g(0) = 0 since g is linear. Therefore, ker(f ) ⊆ ker(g ◦ f ) (this always holds). Let x ∈ ker(g ◦ f ). We have g(f (x)) = 0 and as f is linear, f (g(f (x))) = f (0) = 0. This implies that (f ◦ g)(f (x)) = 0 so that f (x) = 0 since f ◦ g = idE . So ker(g ◦ f ) ⊆ ker(f ) and thus ker(g ◦ f ) = ker(f ). Let y ∈ Im(g ◦ f ) and let x ∈ E so that y = (g ◦ f )(x). Then y = g(f (x)), which shows that Im(g ◦ f ) ⊆ Im(g) (which is always true). Let y ∈ Im(g). Let then x ∈ E such that y = g(x). We have y = g((f ◦ g)(x)) and thus y = (g ◦ f )(g(x)), which means that y ∈ Im(g ◦ f ). Therefore, Im(g) ⊆ Im(g ◦ f ). Overall, Im(g) = Im(g ◦ f ). Let y ∈ ker(f ) ∩ Im(g). Let x ∈ E such that y = g(x). Applying f gives us f (y) = (f ◦ g)(x) and as y ∈ ker(f ), we have 0 = x. This means that ker(f ) ∩ Im(g) ⊆ {0} but the intersection of two subspaces is a subspace and thus always contains 0, so ker(f ) ∩ Im(g) = {0}. 2.19 Consider an endomorphism Φ : R3 → R3 whose transformation matrix (with respect to the standard basis in R3 ) is   1 AΦ = 1 1

1 −1 1

0 0 . 1

a. Determine ker(Φ) and Im(Φ). The image Im(Φ) is spanned by the columns of A. One way to determine a basis, we need to determine the smallest generating set of the columns of AΦ . This can be done by Gaussian elimination. However, in this case, it is quite obvious that AΦ has full rank, i.e., the set of columns is already minimal, such that       1 1 0 Im(Φ) = [1 , −1 , 0] = R3 1 1 1

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

492

Linear Algebra We know that dim(Im(Φ)) = 3. Using the rank-nullity theorem, we get that dim(ker(Φ)) = 3 − dim(Im(Φ)) = 0, and ker(Φ) = {0} consists of the 0-vector alone. ˜ Φ with respect to the basis b. Determine the transformation matrix A       1 1 1 B = (1 , 2 , 0) , 1 1 0

i.e., perform a basis change toward the new basis B . Let B the matrix built out of the basis vectors of B (order is important):   1 B = 1 1

1 2 1

1 0 . 0

˜ Φ = B −1 AΦ B . The inverse is given by Then, A   −1 1 0

0 B −1 = 0 1

2 −1 , −1

and the desired transformation matrix of Φ with respect to the new basis B of R3 is

0 0 1

−1 1 0



2 1 −1 1 −1 1

1 −1 1



0 1 0  1 1 1

1 2 1

1 1 0  = 0 0 0

6 = −3 −1

3 −2 0



2 1 −1 1 −1 1

9 −5 −1

1 2 1

1 0 0

1 0 . 0

2.20 Let us consider b1 , b2 , b01 , b02 , 4 vectors of R2 expressed in the standard basis of R2 as         b1 =

2 , 1

b2 =

−1 , −1

b01 =

2 , −2

b02 =

1 1

and let us define two ordered bases B = (b1 , b2 ) and B 0 = (b01 , b02 ) of R2 . a. Show that B and B 0 are two bases of R2 and draw those basis vectors. The vectors b1 and b2 are clearly linearly independent and so are b01 and b02 . b. Compute the matrix P 1 that performs a basis change from B 0 to B . We need to express the vector b01 (and b02 ) in terms of the vectors b1 and b2 . In other words, we want to find the real coefficients λ1 and λ2 such that b01 = λ1 b1 + λ2 b2 . In order to do that, we will solve the linear equation system   0 b1

b2

b1

−1 −1

2 −2

i.e., 

2 1



Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

493

Exercises and which results in the reduced row echelon form   1 0

4 6

0 1

.

This gives us b01 = 4b1 + 6b2 . Similarly for b02 , Gaussian elimination gives us b02 = −1b2 . Thus, the matrix that performs a basis change from B 0 to B is given as   P1 =

4 6

0 . −1

c. We consider c1 , c2 , c3 , three vectors of R3 defined in the standard basis of R as       1 c1 =  2  , −1

0 c2 = −1 , 2

1 c3 =  0  −1

and we define C = (c1 , c2 , c3 ). (i) Show that C is a basis of R3 , e.g., by Section 4.1). We have: 1 0 det(c1 , c2 , c3 ) = 2 −1 −1 2

using determinants (see

1 0 = 4 6= 0 −1

Therefore, C is regular, and the columns of C are linearly independent, i.e., they form a basis of R3 . (ii) Let us call C 0 = (c01 , c02 , c03 ) the standard basis of R3 . Determine the matrix P 2 that performs the basis change from C to C 0 . In order to write the matrix that performs a basis change from C to C 0 , we need to express the vectors of C in terms of those of C 0 . But as C 0 is the standard basis, it is straightforward that c1 = 1c01 + 2c02 − 1c03 for example. Therefore,   1 P 2 :=  2 −1

0 −1 2

1 0. −1

simply contains the column vectors of C (this would not be the case if C 0 was not the standard basis). d. We consider a homomorphism Φ : R2 −→ R3 , such that Φ(b1 + b2 ) Φ(b1 − b2 )

= =

c2 + c3 2c1 − c2 + 3c3

where B = (b1 , b2 ) and C = (c1 , c2 , c3 ) are ordered bases of R2 and R3 , respectively. Determine the transformation matrix AΦ of Φ with respect to the ordered bases B and C . c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

494

Linear Algebra Adding and subtracting both equations gives us  Φ(b1 + b2 ) + Φ(b1 − b2 ) Φ(b1 + b2 ) − Φ(b1 − b2 )

= =

2c1 + 4c3 −2c1 + 2c2 − 2c3

As Φ is linear, we obtain 

Φ(2b1 ) Φ(2b2 )

= 2c1 + 4c3 = −2c1 + 2c2 − 2c3

And by linearity of Φ again, the system of equations gives us 

Φ(b1 ) Φ(b2 )

= =

c1 + 2c3 −c1 + c2 − c3

.

Therefore, the transformation matrix of AΦ with respect to the bases B and C is   −1 1. −1

1 AΦ = 0 2

e. Determine A0 , the transformation matrix of Φ with respect to the bases B 0 and C 0 . We have:       0 2 1 0 1 1 −1  4 0 0 = −10 A = P 2 AP 1 =  2 3. −1 0  0 1 −1

−1

2

2

6

−1

−1

12

−4

f. Let us consider the vector x ∈ R2 whose coordinates in B 0 are [2, 3]> . In other words, x = 2b01 + 3b02 . (i) Calculate the coordinates of x in B . By definition of P 1 , x can be written in B as        P1

2 4 = 3 6

0 −1

2 8 = 3 9

(ii) Based on that, compute the coordinates of Φ(x) expressed in C . Using the transformation matrix A of Φ with respect to the bases B and C , we get the coordinates of Φ(x) in C with       1 −1   −1 8 8 A = 0 = 9  1 9

2

−1

9

7

(iii) Then, write Φ(x) in terms of c01 , c02 , c03 . Going back to the basis C 0 thanks to the matrix P 2 gives us the expression of Φ(x) in C 0        −1 1 P2  9  =  2 7 −1

0 −1 2

1 −1 6 0   9  = −11 −1 7 12

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

495

Exercises

In other words, Φ(x) = 6c01 − 11c02 + 12c03 . (iv) Use the representation of x in B 0 and the matrix A0 to find this result directly. We can calculate Φ(x) in C directly:       0 2   6 2 2 A0 = −10 = −11 3 3

12

−4

3

12

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

3 Analytic Geometry

Exercises 3.1

Show that h·, ·i defined for all x = [x1 , x2 ]> ∈ R2 and y = [y1 , y2 ]> ∈ R2 by hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 )

is an inner product. We need to show that hx, yi is a symmetric, positive definite bilinear form. Let x := [x1 , x2 ]> , y = [y1 , y2 ]> ∈ R2 . Then, hx, yi = x1 y1 − (x1 y2 + x2 y1 ) + 2x2 y2 = y1 x1 − (y1 x2 + y2 x1 ) + 2y2 x2 = hy, xi ,

where we exploited the commutativity of addition and multiplication in R. Therefore, h·, ·i is symmetric. It holds that hx, xi = x21 − (2x1 x2 ) + 2x22 = (x1 − x2 )2 + x22 .

This is a sum of positive terms for x 6= 0. Moreover, this expression shows that if hx, xi = 0 then x2 = 0 and then x1 = 0, i.e., x = 0. Hence, h·, ·i is positive definite. In order to show that h·, ·i is bilinear (linear in both arguments), we will simply show that h·, ·i is linear in its first argument. Symmetry will ensure that h·, ·i is bilinear. Do not duplicate the proof of linearity in both arguments. Let z = [z1 , z2 ]> ∈ R2 and λ ∈ R. Then, 

hx + y, zi = (x1 + y1 )z1 − (x1 + y1 )z2 + (x2 + y2 )z2 + 2 x2 + y2 )z2



= x1 z1 − (x1 z2 + x2 z1 ) + 2(x2 z2 ) + y1 z1 − (y1 z2 + y2 z1 ) + 2(y2 z2 ) = hx, zi + hy, zi hλx, yi = λx1 y1 − (λx1 y2 + λx2 y1 ) + 2(λx2 y2 )



= λ x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 ) = λ hx, yi

Thus, h·, ·i is linear in its first variable. By symmetry, it is bilinear. Overall, h·, ·i is an inner product.

496 This material will be published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is free to view and download for personal use only. Not for re-distribution, re-sale or use in derivac tive works. by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

497

Exercises 3.2

Consider R2 with h·, ·i defined for all x and y in R2 as   hx, yi := x>

2 1

0 y. 2

| {z } =:A

Is h·, ·i an inner product? Let us define x and y as  

 

1 x= , 0

3.3

y=

0 . 1

We have hx, yi = 0 but hy, xi = 1, i.e., h·, ·i is not symmetric. Therefore, it is not an inner product. Compute the distance between     −1 y = −1 0

1 x = 2 , 3

using a. hx, yi := x> y 

b. hx, yi := x> Ay ,

2 A := 1 0

1 3 −1

0 −1 2

The difference vector is  

2 z = x − y = 3  . 4 √

a. kzk = √z > z = 29 √ b. kzk = z > Az = 55 3.4

Compute the angle between  



−1 y= −1

1 x= , 2



using a. hx, yi := x> y b. hx, yi := x> By ,

 B :=

2 1

1 3



It holds that cos ω =

hx, yi , kxk kyk

where ω is the angle between x and y . a. −3 3 cos ω = √ √ = − √ ≈ 2.82 rad = 161.5◦ 5 2 10

b. cos ω = p

x> By x> By

p

y > By

−11 11 = √ √ =− ≈ 1.66 rad = 95◦ 126 18 7

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

498 3.5

Analytic Geometry Consider the Euclidean vector space R5 with the dot product. A subspace U ⊆ R5 and x ∈ R5 are given by           0

−1    U = span[  2 , 0 2

−3

1

−3    1 ,   −1

−1

−1

 4  −3      1  ,  5 ] ,     2 0

2

1

−9    x= −1 4

7

1

a. Determine the orthogonal projection πU (x) of x onto U First, we determine a basis of U . Writing the spanning vectors as the columns of a matrix A, we use Gaussian elimination to bring A into (reduced) row echelon form:   1

0 1 0 0 0

0  0  0 0

0 0 1 0 0

1 2  1  0 0

From here, we see that the first three columns are pivot columns, i.e., the first three vectors in the generating set of U form a basis of U :       −3

1

0

−1    U = span[  2 , 0

−3    1 ,   −1

4    1 ] .   2 1

2

2

Now, we define 

0 −1  B= 2 0 2

1 −3 1 −1 2

−3 4  1 , 2 1

where we define three basis vectors bi of U as the columns of B for 1 6 i 6 3. We know that the projection of x on U exists and we define p := πU (x). Moreover, we know that p ∈ U . We define λ := [λ1 , λ2 , λ3 ]> ∈ R3 , such P that p can be written p = 3i=1 λi bi = Bλ. As p is the orthogonal projection of x onto U , then x − p is orthogonal to all the basis vectors of U , so that B > (x − Bλ) = 0 .

Therefore, B > Bλ = B > x .

Solving in λ the inhomogeneous system B > Bλ = B > x gives us a single Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

499

Exercises solution 

−3 λ= 4  1

and, therefore, the desired projection 

1 −5    p = Bλ =  −1 ∈ U . −2 3

b. Determine the distance d(x, U ) The distance is simply the length of x − p:

 

2

 4 

  √   kx − pk =

 0  = 60

−6

2 3.6

Consider R3 with the inner product 

2 hx, yi := x> 1 0

1 2 −1

0 −1 y . 2

Furthermore, we define e1 , e2 , e3 as the standard/canonical basis in R3 . a. Determine the orthogonal projection πU (e2 ) of e2 onto U = span[e1 , e3 ] .

Hint: Orthogonality is defined through the inner product. Let p = πU (e2 ). As p ∈ U , we can define Λ = (λ1 , λ3 ) ∈ R2 such that p can be written p = U Λ. In fact, p becomes p = λ1 e1 +λ3 e3 = [λ1 , 0, λ3 ]> expressed in the canonical basis. Now, we know by orthogonal projection that p = πU (e2 ) =⇒ (p − e2 ) ⊥ U

 =⇒



 =⇒

 

hp − e2 , e1 i 0 = hp − e2 , e3 i 0



 

hp, e1 i − he2 , e1 i 0 = hp, e3 i − he2 , e3 i 0

We compute the individual components as    2 1 hp, e1 i = λ1 0 λ3 1 2 0

−1

 

0 1 −1 0 = 2λ1 2 0

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

500

Analytic Geometry 



hp, e3 i = λ1

0

 

0 2 he2 , e1 i = 1 1 0 0



he2 , e3 i = 0

1

 

 2 λ3 1 0

1 2 −1

0 0 −1 1 = 2λ3 2 0

1 2 −1

 

0 1 −1 0 = 1 2 0

  2 0 1 0

 

1 2 −1

0 0 −1 0 = −1 2 1

This now leads to the inhomogeneous linear equation system 2λ1 = 1 2λ3 = −1

This immediately gives the coordinates of the projection as  1  2

πU (e2 ) =  0  − 12

b. Compute the distance d(e2 , U ). The distance of d(e2 , U ) is the distance between e2 and its orthogonal projection p = πU (e2 ) onto U . Therefore, q d(e2 , U ) =

hp − e2 , p − e2 i2 .

However, hp − e2 , p − e2 i =

1

2

−1

 2  − 1 1 2

0

1 2 −1

1 0 2   −1  = 1 , −1 − 12 2



p

which yields d(e2 , U ) = hp − e2 , p − e2 i = 1 c. Draw the scenario: standard basis vectors and πU (e2 ) See Figure 3.1. 3.7

Let V be a vector space and π an endomorphism of V . a. Prove that π is a projection if and only if idV − π is a projection, where idV is the identity endomorphism on V . b. Assume now that π is a projection. Calculate Im(idV −π) and ker(idV −π) as a function of Im(π) and ker(π). a. It holds that (idV − π)2 = idV − 2π + π 2 . Therefore, (idV − π)2 = idV − π ⇐⇒ π 2 = π ,

which is exactly what we want. Note that we reasoned directly at the endomorphism level, but one can also take any x ∈ V and prove the same results. Also note that π 2 means π ◦ π as in “π composed with π ”. Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

501

Exercises

e3

1 U 1 e2 1 πU (e2 ) e1

ure 3.1 ojection πU (e2 ).

b. We have π ◦ (idV − π) = π − π 2 = 0V , where 0V represents the null endomorphism. Then Im(idV − π) ⊆ ker(π). Conversely, let x ∈ ker(π). Then (idV − π)(x) = x − π(x) = x ,

which means that x is the image of itself by idV −π . Hence, x ∈ Im(idV − π). In other words, ker(π) ⊆ Im(idV − π) and thus ker(π) = Im(idV − π). Similarly, we have (idV − π) ◦ π = π − π 2 = π − π = 0V

so Im(π) ⊆ ker(idV − π). Conversely, let x ∈ ker(idV − π). We have (idV − π)(x) = 0 and thus x − π(x) = 0 or x = π(x). This means that x is its own image by π , and therefore ker(idV − π) ⊆ Im(π). Overall, ker(idV − π) = Im(π) .

3.8

Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a twodimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where     1 b1 := 1 , 1

−1 b2 :=  2  . 0

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

502

Analytic Geometry We start by normalizing b1  

1 b 1 c1 := 1 = √ 1 . kb1 k 3 1

(3.1)

To get c2 , we project b2 onto the subspace spanned by c1 . This gives us (since kc1 = 1k)   c> 1 b2 c1 =

1 1  1 ∈U. 3 1

By subtracting this projection (a multiple of c1 ) from b2 , we get a vector that is orthogonal to c1 :       −1

−4

1

 2  − 1 1 = 1  5  = 1 (−b1 + 3b2 ) ∈ U . 3

0

1

3

3

−1

Normalizing c˜2 yields 

−4 ˜ c 3 c2 = 2 = √  5  . k˜ c2 k 42 −1

3.9

We see that c1 ⊥ c2 and that kc1 k = 1 = kc2 k. Moreover, c1 , c2 ∈ U it follows that (c1 , c2 ) are a basis of U . Let n ∈ N and let x1 , . . . , xn > 0 be n positive real numbers so that x1 + · · · + xn = 1. Use the Cauchy-Schwarz inequality and show that Pn 2 1 a. i=1 xi > n Pn 2 1 b. i=1 xi > n Hint: Think about the dot product on Rn . Then, choose specific vectors x, y ∈ Rn and apply the Cauchy-Schwarz inequality. Recall Cauchy-Schwarz inequality expressed with the dot product in Rn . Let x = [x1 , . . . , xn ]> and y = [y1 , . . . , yn ]> be two vectors of Rn . CauchySchwarz tells us that hx, yi2 6 hx, xi · hy, yi ,

which, applied with the dot product in Rn , can be rephrased as !2 ! ! n n n X X X 2 2 xi yi 6 xi · yi . i=1

i=1

i=1

a. Consider x = [x1 , . . . , xn ]> as defined in the question. Let us choose y = [1, . . . , 1]> . Then, the Cauchy-Schwarz inequality becomes !2 ! ! n n n X X X 2 2 xi · 1 6 xi · 1 i=1

i=1

i=1

and thus 16

n X

! x2i

· n,

i=1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

503

Exercises

which yields the expected result. b. Let us now choose both vectors differently to obtain the expected result. √ √ Let x = [ √1x1 , . . . , √1xn ]> and y = [ x1 , · · · , n]> . Note that our choice is legal since all xi and yi are strictly positive. The Cauchy-Schwarz inequality now becomes ! !2 ! n n n  2 X X X √ √ 2 1 1 √ √ · · xi 6 xi xi xi i=1

i=1

i=1

so that 2

n 6

n X

! 1 xi

·

i=1

This yields n2 6

Pn

1 i=1 xi

n X

! xi

.

i=1

· 1, which gives the expected result.

3.10 Rotate the vectors   x1 :=

2 , 3

 x2 :=

0 −1



by 30◦ . Since 30◦ = π/6 rad we obtain the rotation matrix   π π A=

cos( 6 ) sin( π6 )

− sin( 6 ) cos( π6 )

and the rotated vectors are 

2 cos( π6 ) − 3 sin( π6 ) 0.23 Ax1 = ≈ 3.60 2 sin( π6 ) + 3 cos( π6 )







sin( π6 ) 0.5 Ax2 = ≈ . 0.87 − cos( π6 )









c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

4 Matrix Decompositions

Exercises 4.1

Compute the determinant using the Laplace expansion (using the first row) and the Sarrus rule for 

1 A= 2 0

3 4 2

5 6 . 4

The determinant is 1 3 5 |A| = det(A) = 2 4 6 0 2 4 4 6 − 3 2 6 + 5 2 = 1 0 0 4 2 4

4 2

Laplace expansion

=1·4−3·8−5·4=0

Sarrus’ rule

= 16 + 20 + 0 − 0 − 12 − 24 = 0

4.2

Compute the following determinant efficiently: 

2 2  0  −2 2

0 −1 1 0 0

1 0 2 2 0

2 1 1 −1 1

0 1  2 . 2 1

This strategy shows the power of the methods we learned in this and the previous chapter. We can first apply Gaussian elimination to transform A into a triangular form, and then use the fact that the determinant of a triangular matrix equals the product of its diagonal elements. 2 2 0 −2 2

0 −1 1 0 0

1 0 2 2 0

2 1 1 −1 1

0 2 1 0 2 = 0 2 0 1 0

0 −1 1 0 0

1 −1 2 3 −1

2 −1 1 1 −1

0 2 1 0 2 = 0 2 0 1 0

0 −1 0 0 0

1 −1 1 3 −1

2 −1 0 1 −1

0 1 3 2 1

504 This material will be published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is free to view and download for personal use only. Not for re-distribution, re-sale or use in derivac tive works. by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

505 0 1 2 0 −1 −1 −1 1 0 1 0 3 = 6 . 0 0 1 −7 0 0 0 −3

Exercises 2 0 = 0 0 0 Alternatively, solution: 2 2 0 −2 2

0 −1 0 0 0

1 −1 1 0 0

0 2 1 0 3 = 0 −7 0 4 0

2 −1 0 1 −1

we can apply the Laplace expansion and arrive at the same 0 −1 1 0 0

1 0 2 2 0

2 1 1 −1 1

0 2 1 0 2 = 0 2 0 1 0

0 −1 1 0 0

1 −1 2 3 −1

2 −1 1 1 −1

−1 1 1st col. 1+1 = (−1) 2 · 0 0

0 1 2 2 1 −1 2 3 −1

−1 1 1 −1

1 2 . 2 1

If we now subtract the fourth row from the first row and multiply (−2) times the third column to the fourth column we obtain −1 0 0 0 2 1 0 1st row 1 2 1 0 3rd=col. (−2) · 3(−1)3+3 · 2 1 = 6 . = −2 3 2 1 0 3 1 3 1 0 −1 −1 3 0 0 −1 −1 3 4.3

Compute the eigenspaces of a. 

1 1

A :=

0 1



b. 



−2 B := 2

2 1

a. For  A=

1 1

0 1



1 − λ (i) Characteristic polynomial: p(λ) = |A − λI 2 | = 1

0 = 1 − λ

(1 − λ)2 . Therefore λ = 1 is the only root of p and, therefore, the only eigenvalue of A (ii) To compute the eigenspace for the eigenvalue λ = 1, we need to compute the null space of A − I :

 (A − 1 · I)x = 0 ⇐⇒

0 1



0 x=0 0

  ⇒ E1 = [

0 ] 1

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

506

Matrix Decompositions E1 is the only eigenspace of A.

b. For B , the corresponding eigenspaces (for λi ∈ C) are     Ei = [

4.4

1 ], i

E−i = [

1 ]. −i

Compute all eigenspaces of 

0 −1 A= 2 1

−1 1 −1 −1

1 −2 0 1

1 3 . 0 0

(i) Characteristic polynomial: −λ 1 1 −1 1 1 −λ −1 −1 1 − λ −2 −λ −1 3 − λ 3 0 p(λ) = = 0 1 −2 − λ 2λ 2 −1 −λ 0 1 −1 1 −λ −1 1 −λ 1 −λ −1 − λ 0 1 0 −λ −1 − λ 3 − λ = 0 1 −1 −λ 2λ 1 0 0 −λ −λ −1 − λ 3 − λ −1 − λ 0 1 = (−λ) 1 −1 − λ 3 − λ −1 − λ 2λ − −λ 0 −1 − λ 2λ 0 −λ 1 0 1 −λ −1 − λ −1 − λ − −λ = (−λ)2 −1 − λ 3 − λ 1 −1 − λ 1 −1 − λ 2λ = (1 + λ)2 (λ2 − 3λ + 2) = (1 + λ)2 (1 − λ)(2 − λ)

Therefore, the eigenvalues of A are λ1 = −1, λ2 = 1, λ3 = 2. (ii) The corresponding eigenspaces are the solutions of (A − λi I)x = 0, i = 1, 2, 3, and given by       0

E−1

1

1   = span[ 1],

1   E1 = span[ 1],

0

4.5

1

0   E2 = span[ 1] .

1

1

Diagonalizability of a matrix is unrelated to its invertibility. Determine for the following four matrices whether they are diagonalizable and/or invertible         1 0

0 , 1

1 0

0 , 0

1 0

1 , 1

0 0

1 . 0

In the four matrices above, the first one is diagonalizable and invertible, the second one is diagonalizable but is not invertible, the third one is invertible but is not diagonalizable, and, finally, the fourth ione s neither invertible nor is it diagonalizable. Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

507

Exercises 4.6

Compute the eigenspaces of the following transformation matrices. Are they diagonalizable? a. For 

2 A = 1 0

3 4 0

0 3 1

(i) Compute the eigenvalue as the roots of the characteristic polynomial 2 − λ 3 0 2 − λ 3 p(λ) = det(A − λI) = 1 4−λ 3 = (1 − λ) 1 4 − λ 0 0 1 − λ  2 = (1 − λ) (2 − λ)(4 − λ) − 3 = (1 − λ)(8 − 2λ − 4λ + λ − 3)

= (1 − λ)(λ2 − 6λ + 5) = (1 − λ)(λ − 1)(λ − 5) = −(1 − λ)2 (λ − 5) .

Therefore, we obtain the eigenvalues 1 and 5. (ii) To compute the eigenspaces, we need to solve (A − λi I)x = 0, where λ1 = 1, λ2 = 5:     1 E1 : 1 0

3 3 0

0 3 0

1

0 0

3 0 0

0 1 , 0

where we subtracted the first row from the second and, subsequently, divided the second row by 3 to obtain the reduced row echelon form. From here, we see that   3 E1 = span[−1] 0

Now, we compute E5 by solving (A − 5I)x = 0:   1 | · (− 3 ) −3 3 0  1 −1 3  + 1 R1 + 3 R3 3 4 0 0 −4 | · (− 14 ), swap with R2

1  0 0

−1 0 0

0 1  0

Then,  

1 E5 = span[1] . 0

(iii) This endomorphism cannot be diagonalized because dim(E1 ) + dim(E5 ) 6= 3. Alternative arguments: dim(E1 ) does not correspond to the algebraic multiplicity of the eigenvalue λ = 1 in the characteristic polynomial rk(A − I) 6= 3 − 2.

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

508

Matrix Decompositions b. For 

1 0 A= 0 0

1 0 0 0

0 0 0 0

0 0  0 0

(i) Compute the eigenvalues as the roots of the characteristic polynomial p(λ) = det(A − λI) = (1 − λ)λ3

1 − λ 0 = 0 0

1 −λ 0 0

0 0 −λ 0

0 0 = −(1 − λ)λ3 . 0 −λ

It follows that the eigenvalues are 0 and 1 with algebraic multiplicities 3 and 1, respectively. (ii) We compute the eigenspaces E0 and E1 , which requires us to determine the null spaces A − λi I , where λi ∈ {0, 1}. (iii) We compute the eigenspaces E0 and E1 , which requires us to determine the null spaces A − λi I , where λi ∈ {0, 1}. For E0 , we compute the null space of A directly and obtain       1

0

−1  E0 = span[  0 ,

0  , 1

0

0

0

0   ] . 0  1

To determine E1 , we need to solve (A − I)x = 0:    0  0   0 0

1 0 0 −1 0 0   +R1 | move to R4 0 −1 0  ·(−1) 0 0 −1 ·(−1)

0  0   0 0

1 0 0 0

0 1 0 0

From here, we see that   1

0  E1 = span[ 0] . 0

(iv) Since dim(E0 )+dim(E1 ) = 4 = dim(R4 ), it follows that a diagonal form exists. 4.7

Are the following matrices diagonalizable? If yes, determine their diagonal form and a basis with respect to which the transformation matrices are diagonal. If no, give reasons why they are not diagonalizable. a.  A=

0 −8

0 0   1  0



1 4

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

509

Exercises We determine the characteristic polynomial as p(λ) = det(A − λI) = −λ(4 − λ) + 8 = λ2 − 4λ + 8 .

The characteristic polynomial does not decompose into linear factors over R because the roots of p(λ) are complex and given by λ1,2 = √ 2 ± −4. Since the characteristic polynomial does not decompose into linear factors, A cannot be diagonalized (over R). b. 

1 A = 1 1

1 1 1

1 1 1

(i) The characteristic polynomial is p(λ) = det(A − λI). 1 − λ 1 − λ 1 1 subtr. R1 from R2 ,R3 λ = p(λ) = 1 1−λ 1 λ 1 1 1 − λ 1 − λ 1 develop last row 1 1 − λ = λ λ −λ −λ 0

1 −λ 0

1 0 −λ

= λ2 + λ(λ(1 − λ) + λ) = λ(−λ2 + 3λ) = λ2 (λ − 3) .

Therefore, the roots of p(λ) are 0 and 3 with algebraic multiplicities 2 and 1, respectively. (ii) To determine whether A is diagonalizable, we need to show that the dimension of E0 is 2 (because the dimension of E3 is necessarily 1: an eigenspace has at least dimension 1 by definition, and its dimension cannot exceed the algebraic multiplicity of its associated eigenvalue). Let us study E0 = ker(A − 0I) :     1 A − 0I = 1 1

1 1 1

1 1 1

1

0 0

1 0 0

1 0 . 0

Here, dimE0 = 2, which is identical to the algebraic multiplicity of the eigenvalue 0 in the characteristic polynomial. Thus A is diagonalizable. Moreover, we can read from the reduced row echelon form that     1 1 E0 = span[−1 ,  0 ] . 0 −1

(iii) For E3 , we obtain 

−2 A − 3I =  1 1

1 −2 1

1 1 −2

1 0 0

0 1 0

−1 −1 , 0

which has rank 2, and, therefore (using the rank-nullity theorem), c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

510

Matrix Decompositions E3 has dimension 1 (it could not be anything else anyway, as justified above) and   −1 E1 = span[−1] . −1

(iv) Therefore, we can find a new basis P as the concatenation of the spanning vectors of the eigenspaces. If we identify the matrix   −1 P = −1 −1

1 −1 0

1 0, −1

whose columns are composed of the basis vectors of basis P , then our endomorphism will have the diagonal form: D = diag[3, 0, 0] with respect to this new basis. As a reminder, diag[3, 0, 0] refers to the 3 × 3 diagonal matrix with 3, 0, 0 as values on the diagonal. Note that the diagonal form is not unique and depends on the order of the eigenvectors in the new basis. For example, we can define another matrix Q composed of the same vectors as P but in a different order:   −1 −1 −1

1 Q = −1 0

1 0. −1

If we use this matrix, our endomorphism would have another diagonal form: D 0 = diag[0, 3, 0]. Sceptical students can check that Q−1 AQ = D 0 and P −1 AP = D . c. 

5  0 A=  −1 1

4 1 −1 1

2 −1 3 −1

1 −1  0 2

p(λ) = (λ − 1)(λ − 2)(λ − 4)2

−1 1  E1 = span[  0 ] , 0

1 −1  E2 = span[  0 ] , 1

1 0  E4 = span[ −1] . 1

Here, we see that dim(E4 ) = 1 6= 2 (which is the algebraic multiplicity of the eigenvalue 4). Therefore, A cannot be diagonalized. d. 

5 A = −1 3

−6 4 −6

−6 2 −4

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

511

Exercises

(i) We compute the characteristic polynomial p(λ) = det(A − λI) as 5 − λ −6 −6 p(λ) = −1 4−λ 2 3 −6 −4 − λ = (5 − λ)(4 − λ)(−4 − λ) − 36 − 36 + 18(4 − λ) + 12(5 − λ) − 6(−4 − λ) = −λ3 + 5λ2 − 8λ + 4 = (1 − λ)(2 − λ)2 ,

where we used Sarrus rule. The characteristic polynomial decomposes into linear factors, and the eigenvalues are λ1 = 1, λ2 = 2 with (algebraic) multiplicity 1 and 2, respectively. (ii) If the dimension of the eigenspaces are identical to multiplicity of the corresponding eigenvalues, the matrix is diagonalizable. The eigenspace dimension is the dimension of ker(A − λi I), where λi are the eigenvalues (here: 1,2). For a simple check whether the matrices are diagonalizable, it is sufficient to compute the rank ri of A − λi I since the eigenspace dimension is n − ri (rank-nullity theorem). Let us study E2 and apply Gaussian elimination on A − 2I :   3

−1 3

−6 2 −6

−6 2. −6

(4.1)

(iii) We can immediately see that the rank of this matrix is 1 since the first and third row are three times the second. Therefore, the eigenspace dimension is dim(E2 ) = 3 − 1 = 2, which corresponds to the algebraic multiplicity of the eigenvalue λ = 2 in p(λ). Moreover, we know that the dimension of E1 is 1 since it cannot exceed its algebraic multiplicity, and the dimension of an eigenspace is at least 1. Hence, A is diagonalizable. (iv) The diagonal matrix is easy to determine since it just contains the eigenvalues (with corresponding multiplicities) on its diagonal:   1 D = 0 0

0 2 0

0 0 . 2

(v) We need to determine a basis with respect to which the transformation matrix is diagonal. We know that the basis that consists of the eigenvectors has exactly this property. Therefore, we need to determine the eigenvectors for all eigenvalues. Remember that x is an eigenvector for an eigenvalue λ if they satisfy Ax = λx ⇐⇒ (A − λI)x = 0. Therefore, we need to find the basis vectors of the eigenspaces E1 , E2 . For E1 = ker(A − I) we obtain:     1 4

−1 3

−6 −6 +4R2 3 2  −6 −5 +3R2

0

−1 0

6 3 3

2 2  1

·( 6 ) ·(−1)|swap with R1 − 12 R1

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

512 

1

 0 0

−3 −2 +3R2 1  1 3 0 0

Matrix Decompositions 

1

0 1 0

 0 0

−1 1 3

0

The rank of this matrix is 2. Since 3 − 2 = 1 it follows that dim(E1 ) = 1, which corresponds to the algebraic multiplicity of the eigenvalue λ = 1 in the characteristic polynomial. (vi) From the reduced row echelon form we see that   3 E1 = span[−1] , 3

and our first eigenvector is [3, −1, 3]> . (vii) We proceed with determining a basis of E2 , which will give us the other two basis vectors that we need (remember that dim(E2 ) = 2). From (4.1), we (almost) immediately obtain the reduced row echelon form   −2 0 0

1

0 0

−2 0, 0

and the corresponding eigenspace    

2 2 E2 = span[1 , 0] . 0 1

(viii) Overall, an ordered basis with respect to which A has diagonal form D consists of all eigenvectors is       3 2 2 B = (−1 , 1 , 0) . 3 0 1

4.8

Find the SVD of the matrix  A=

" A=

|

4.9

√1 2 √1 2

− √1

2 √1 2

{z

=U

#

3 2



2 3

2 . −2

 5 0

}|

0 3

{z

0 0

  

√1 2 √1 2

0

} |

1 √ 3 2 1 √ 3 √ 2 −232

{z

=V

− 23 2 3 1 3

   }

Find the singular value decomposition of   A=

2 −1

2 . 1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

513

Exercises

(i) Compute the symmetrized matrix A> A. We first compute the symmetrized matrix      2 −1 2 2 5 3 A> A = = . (4.2) 2

−1

1

1

3

5

(ii) Find the right-singular vectors and singular values from A> A. The characteristic polynomial of A> A is (5 − λ)2 − 9 = λ2 − 10λ + 16 = 0 .

(4.3)

This yields eigenvalues, sorted from largest absolute first, λ1 = 8 and λ2 = 2. The associated normalized eigenvectors are respectively " v1 =

− √1

#

2

" and v 2 =

√1 2

√1 2 √1 2

# (4.4)

.

We have thus obtained the right-singular orthogonal matrixx " 1 # 1 V = [v 1 , v 2 ] =

√ 2 − √1 2

2 √1 2

(4.5)

.

(iii) Determine the singular values. We obtain the two singular values from the square root of the eigen√ √ √ √ √ values σ1 = λ1 = 8 = 2 2 and σ2 = λ2 = 2. We construct the singular value diagonal matrix as    √  σ 0 2 2 0 √ . Σ= 1 = (4.6) 0

σ2

0

2

(iv) Find the left-singular eigenvectors. We have to map the two eigenvectors v 1 , v 2 using A. This yields two self-consistent equations that enable us to find orthogonal eigenvectors u1 , u2 : > Av 1 = (σ1 u1 v > 1 )v 1 = σ1 u1 (v 1 v 1 ) = σ1 u1 =

 √  2 2 0



0 > Av 2 = (σ2 u2 v > 2 )v 2 = σ2 u2 (v 2 v 2 ) = σ2 u2 = √ 2



We normalize the left-singular vectors by  diving them  by their respective singular values and obtain u1 =

1 0 and u2 = , which yields 0 1



1 U = [u1 , u2 ] = 0



0 . 1

(4.7)

(v) Assemble the left-/right-singular vectors and singular values. The SVD of A is #   √  " √1 √1 A = U ΣV > =

1 0

0 1

8 0

0 √ 2

2

− √1

2

2 √1 2

.

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

514

Matrix Decompositions

4.10 Find the rank-1 approximation of 

3 2

A=

2 3

2 −2



To find the rank-1 approximation we apply the SVD to A (as in Exercise 4.7) to obtain # " 1 1 −√

U=

V =

2 √1 2  1 √ 2  √1  2

2 √1 2 1 − √ 3 2 1 √ 3 √ 2 2 2 − 3

0

− 32 2 3 1 3

  .

We apply the construction rule for rank-1 matrices Ai = ui v > i .

We use the largest singular value (σ1 = 5, i.e., i = 1 and the first column vectors of the U and V matrices, respectively: " 1 # h i 1 1 1 0 √ > 2 √1 √1 A1 = u1 v 1 =

√1 2

2

2

0 =

2 1

1

0

.

To find the rank-1 approximation, we apply the SVD to A to obtain " 1 # 1 √

U=

V =

2 √1 2  1 √ 2  √1  2

0

−√

2 √1 2 1 − √ 3 2 1 √ 3 √ 2 2 2 − 3

− 32 2 3 1 3

  .

We apply the construction rule for rank-1 matrices Ai = ui v > i .

We use the largest singular value (σ1 = 5, i.e., i = 1) and therefore, the first column vectors of U and V , respectively, which then yields " 1 #   h i √ A1 = u1 v > 1 =

2 √1 2

√1 2

√1 2

0 =

1 1 2 1

1 1

0 . 0

4.11 Show that for any A ∈ Rm×n the matrices A> A and AA> possess the same nonzero eigenvalues. Let us assume that λ is a nonzero eigenvalue of AA> and x is an eigenvector belonging to λ. Thus, the eigenvalue equation (AA> )x = λx

can be manipulated by left multiplying by A> and pulling on the right-hand side the scalar factor λ forward. This yields A> (AA> )x = A> (λx) = λ(A> x) Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

515

Exercises

and we can use matrix multiplication associativity to reorder the left-hand side factors (A> A)(A> x) = λ(A> x) .

This is the eigenvalue equation for A> A. Therefore, λ is the same eigenvalue for AA> and A> A. 4.12 Show that for x 6= 0 Theorem 4.24 holds, i.e., show that max x

kAxk2 = σ1 , kxk2

where σ1 is the largest singular value of A ∈ Rm×n . (i) We compute the eigendecomposition of the symmetric matrix A> A = P DP >

for diagonal D and orthogonal P . Since the columns of P are an ONB of Rn , we can write every y = P x as a linear combination of the eigenvectors pi so that y = Px =

n X

xi pi ,

(4.8)

x∈ R .

i=1

Moreover, since the orthogonal matrix P preserves lengths (see Section 3.4), we obtain kyk22 = kP xk22 = kxk22 =

n X

x2i .

(4.9)

i=1

(ii) Then, kAxk22

>

>

>

= x (P DP )x = y Dy =

* n Xp

λi xi pi ,

i=1

n p X

+ λi xi pi

,

i=1

where we used h·, ·i to denote the dot product. (iii) The bilinearity of the dot product gives us kAxk22 =

n X

λi hxi pi , xi pi i =

i=1

n X

λi x2i

i=1

where we exploited that the pi are an ONB and p> i pi = 1. (iv) With (4.8) we obtain kAxk22 6 ( max λj ) 16j6n

n X

(4.9)

x2i =

i=1

max λj kxk22

16j6n

so that kAxk22 kxk22

6 max λj , 16j6n

where λj are the eigenvalues of A> A. c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

516

Matrix Decompositions (v) Assuming the eigenvalues of A> A are sorted in descending order, we get p kAxk2 6 λ1 = σ1 , kxk2

where σ1 is the maximum singular value of A.

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

5 Vector Calculus

Exercises 5.1

0

Compute the derivative f (x) for f (x) = log(x4 ) sin(x3 ) .

f 0 (x) =

5.2

5.3

4 sin(x3 ) + 12x2 log(x) cos(x3 ) x

Compute the derivative f 0 (x) of the logistic sigmoid f (x) =

1 . 1 + exp(−x)

f 0 (x) =

exp(x) (1 + exp(x))2

Compute the derivative f 0 (x) of the function f (x) = exp(− 2σ1 2 (x − µ)2 ) ,

where µ, σ ∈ R are constants. f 0 (x) = − σ12 f (x)(x − µ)

5.4

Compute the Taylor polynomials Tn , n = 0, . . . , 5 of f (x) = sin(x) + cos(x) at x0 = 0. T0 (x) = 1 T1 (x) = T0 (x) + x x2 2 x3 T3 (x) = T2 (x) − 6 x4 T4 (x) = T3 (x) + 24 x5 T5 (x) = T4 (x) + 120

T2 (x) = T1 (x) −

517 This material will be published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is free to view and download for personal use only. Not for re-distribution, re-sale or use in derivac tive works. by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

518 5.5

Vector Calculus Consider the following functions: x ∈ R2

f1 (x) = sin(x1 ) cos(x2 ) , f2 (x, y) = x> y ,

x, y ∈ Rn

f3 (x) = xx> ,

x ∈ Rn

a. What are the dimensions of b. Compute the Jacobians.

∂fi ∂x

?

f1 ∂f1 = cos(x1 ) cos(x2 ) ∂x1 ∂f1 = − sin(x1 ) sin(x2 ) ∂x2 =⇒ J =

h

∂f1 ∂x2

∂f1 ∂x1

i

− sin(x1 ) sin(x2 ) ∈ R1×2





= cos(x1 ) cos(x2 )

f2 x> y =

X

xi yi

i

∂f2 = ∂x

h

∂f2 ∂x1

···

∂f2 ∂xn

i

∂f2 = ∂y

h

∂f2 ∂y1

···

∂f2 ∂yn

i

=⇒ J =

h

∂f2 ∂x

∂f2 ∂y

i

= y1



···

yn = y > ∈ Rn



···

x n = x> ∈ R n

= x1

= y>







1×2n x> ∈ R



f3 : Rn → Rn×n x1 x>  x2 x>   

 xx> = 

  = xx1 

.. .

···

xx2

xxn ∈ Rn×n



x n x> x>  0>   n

 =⇒

 ∂f3 =  . + x ∂x1  ..  | 0> n

···

0n

{z

∈Rn×n

0n ∈ Rn×n



}

| {z }

∈Rn×n

 ∂f3 =⇒ = ∂xi

|

0(i−1)×n   + 0n×(i−1) x> 0(n−1+1)×n

{z

∈Rn×n

|

x

{z

0n×(n−i+1) ∈ Rn×n

∈Rn×n



}

}

To get the Jacobian, we need to concatenate all partial derivatives and obtain h i ∂f3 ∂f3 J = ∂x ∈ R(n×n)×n · · · ∂x 1 n

∂f3 ∂xi

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

519

Exercises 5.6

Differentiate f with respect to t and g with respect to X , where f (t) = sin(log(t> t)) ,

t ∈ RD A ∈ RD×E , X ∈ RE×F , B ∈ RF ×D ,

g(X) = tr(AXB) ,

where tr(·) denotes the trace. ∂f 1 = cos(log(t> t)) · > · 2t> ∂t t t

The trace for T ∈ RD×D is defined as tr(T ) =

D X

Tii

i=1

A matrix product ST can be written as X (ST )pq =

Spi Tiq

i

The product AXB contains the elements (AXB)pq =

E X F X

Api Xij Bjq

i=1 j=1

When we compute the trace, we sum up the diagonal elements of the matrix. Therefore we obtain,   D D E X F X X X  tr(AXB) = (AXB)kk = Aki Xij Bjk  k=1

k=1

i=1 j=1

X ∂ tr(AXB) = Aki Bjk = (BA)ji ∂Xij k

We know that the size of the gradient needs to be of the same size as X (i.e., E × F ). Therefore, we have to transpose the result above, such that we finally obtain ∂ tr(AXB) = |{z} A> |{z} B> ∂X E×D D×F

5.7

Compute the derivatives df /dx of the following functions by using the chain rule. Provide the dimensions of every single partial derivative. Describe your steps in detail. a. f (z) = log(1 + z) ,

z = x> x ,

x ∈ RD

b. f (z) = sin(z) ,

z = Ax + b ,

A ∈ RE×D , x ∈ RD , b ∈ RE

where sin(·) is applied to every element of z .

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

520

Vector Calculus a. ∂f ∂z df ∈ R1×D = dx ∂z ∂x |{z} |{z} ∈R ∈R1×D

∂f 1 1 = = ∂z 1+z 1 + x> x ∂z = 2x> ∂x df 2x> =⇒ = dx 1 + x> x

b. ∂z ∈ RE×D ∂x |{z}

df ∂f = dx ∂z |{z}

∈RE×E ∈RE×D

sin z1

.. .

 

sin(z) = 

sin zE

0

 .   ..     0    ∂ sin z   = cos(zi ) ∈ RE ∂zi    0     ..   .  0 ∂f =⇒ = diag(cos(z)) ∈ RE×E ∂z ∂z = A ∈ RE×D : ∂x ci =

D X j=1

Aij xj =⇒

∂ci = Aij , ∂xj

i = 1, . . . , E, j = 1, . . . , D

Here, we defined ci to be the ith component of Ax. The offset b is constant and vanishes when taking the gradient with respect to x. Overall, we obtain df = diag(cos(Ax + b))A dx

5.8

Compute the derivatives df /dx of the following functions. Describe your steps in detail. a. Use the chain rule. Provide the dimensions of every single partial derivative. f (z) = exp(− 12 z) z = g(y) = y > S −1 y y = h(x) = x − µ

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

521

Exercises where x, µ ∈ RD , S ∈ RD×D . The desired derivative can be computed using the chain rule: ∂f ∂g ∂h df = ∈ R1×D dx ∂z ∂y ∂x |{z} |{z} |{z} 1×1 1×D D×D

Here ∂f = − 12 exp(− 21 z) ∂z ∂g = 2y > S −1 ∂y ∂h = ID ∂x

so that  df = − exp − 21 (x − µ)> S −1 (x − µ) (x − µ)> S −1 dx b. f (x) = tr(xx> + σ 2 I) ,

x ∈ RD

Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii . Hint: Explicitly write out the outer product. Let us have a look at the outer product. We define X = xx> with Xij = xi xj

The trace sums up all the diagonal elements, such that D

X ∂Xii + σ 2 ∂ tr(X + σ 2 I) = = 2xj ∂xj ∂xj i=1

for j = 1, . . . , D. Overall, we get ∂ tr(xx> + σ 2 I) = 2x> ∈ R1×D ∂x

c. Use the chain rule. Provide the dimensions of every single partial derivative. You do not need to compute the product of the partial derivatives explicitly. f = tanh(z) ∈ RM z = Ax + b,

x ∈ RN , A ∈ RM ×N , b ∈ RM .

Here, tanh is applied to every component of z . ∂f = diag(1 − tanh2 (z)) ∈ RM ×M ∂z ∂z ∂Ax = = A ∈ RM ×N ∂x ∂x c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

522

Vector Calculus We get the latter result by defining y = Ax, such that X ∂yi ∂yi yi =

Aij xj =⇒

j

=⇒

∂xk

= Aik =⇒

∂x

= [Ai1 , ..., AiN ] ∈ R1×N

∂y =A ∂x

The overall derivative is an M × N matrix. 5.9

We define g(z, ν) := log p(x, z) − log q(z, ν) z := t(, ν)

for differentiable functions p, q, t. By using the chain rule, compute the gradient d g(z, ν) . dν d d d d g(z, ν) = (log p(x, z) − log q(z, ν)) = log p(x, z) − log q(z, ν) dν dν dν dν ∂t(, ν) ∂t(, ν) ∂ ∂ ∂ = log p(x, z) − log q(z, ν) − log q(z, ν) ∂z ∂ν ∂z ∂ν ∂ν   ∂t(, ν) ∂ ∂ ∂ log p(x, z) − log q(z, ν) − log q(z, ν) = ∂z ∂z ∂ν ∂ν

 =

1 ∂ 1 ∂ p(x, z) − q(z, ν) p(x, z) ∂z q(z, ν) ∂z



∂t(, ν) ∂ − log q(z, ν) ∂ν ∂ν

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

6 Probability and Distributions

Exercises 6.1

Consider the following bivariate distribution p(x, y) of two discrete random variables X and Y .

y1 0.01 0.02 0.03 0.1

0.1

y2 0.05 0.1 0.05 0.07 0.2

Y

y3 0.1 0.05 0.03 0.05 0.04 x1

x2

x3

x4

x5

X Compute: a. The marginal distributions p(x) and p(y). b. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ). The marginal and conditional distributions are given by p(x) = [0.16, 0.17, 0.11, 0.22, 0.34]> p(y) = [0.26, 0.47, 0.27]> p(x|Y = y1 ) = [0.01, 0.02, 0.03, 0.1, 0.1]> p(y|X = x3 ) = [0.03, 0.05, 0.03]> .

6.2

Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4),         0.4 N

10 1 , 2 0

0 1

+ 0.6 N

0 8.4 , 0 2.0

2.0 1.7

.

a. Compute the marginal distributions for each dimension. b. Compute the mean, mode and median for each marginal distribution. c. Compute the mean and mode for the two-dimensional distribution. Consider the mixture of two Gaussians,       p

x1 x2

= 0.4 N

10 1 , 2 0

0 1

   + 0.6 N

0 8.4 , 0 2.0

2.0 1.7

 .

523 This material will be published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is free to view and download for personal use only. Not for re-distribution, re-sale or use in derivac tive works. by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

524

Probability and Distributions a. Compute the marginal distribution for each dimension. 

Z p(x1 ) =

  

10 1 , 2 0

0.4 N

  

Z = 0.4

N

10 1 , 2 0

0 1 0 1



  

0 8.4 , 0 2.0

+ 0.6N



  

Z N

dx2 + 0.6

2.0 1.7



0 8.4 , 0 2.0

dx2

(6.1) 

2.0 1.7

dx2

(6.2) 



= 0.4N 10, 1 + 0.6N 0, 8.4 .

From (6.1) to (6.2), we used the result that the integral of a sum is the sum of the integrals. Similarly, we can get   p(x2 ) = 0.4N 2, 1 + 0.6N 0, 1.7 .

b. Compute the mean, mode and median for each marginal distribution. Mean: Z E(x1 ) = x1 p(x1 )dx1 Z   = x1 0.4N x1 | 10, 1 + 0.6N x1 | 0, 8.4 dx1 (6.3) Z Z   = 0.4 x1 N x1 | 10, 1 dx1 + 0.6 x1 N x1 | 0, 8.4 dx1 (6.4) = 0.4 · 10 + 0.6 · 0 = 4 .

 From step (6.3) to step (6.4), we use the fact that for Y ∼ N µ, σ , where E[Y ] = µ. Similarly, E(x2 ) = 0.4 · 2 + 0.6 · 0 = 0.8 . Mode: In principle, we would need to solve for dp(x1 ) =0 dx1

and

dp(x2 ) = 0. dx2

However, we can observe that the modes of each individual distribution are the peaks of the Gaussians, that is the Gaussian means for each dimension. Median: Z a p(x1 )dx1 = −∞

1 . 2

c. Compute the mean and mode for the 2 dimensional distribution. Mean: From (6.30), we know that       x E[x1 ] 4 E[ 1 ] = = . x2 E[x2 ] 0.8 Mode: The two dimensional distribution has two peaks, and hence there are Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

525

Exercises

two modes. In general, we would need to solve an optimization problem to find the maxima. However, in this particular case we can observe that the two modes correspond to the individual Gaussian means, that is     10 0 and . 2

6.3

0

You have written a computer program that sometimes compiles and sometimes not (code does not change). You decide to model the apparent stochasticity (success vs. no success) x of the compiler using a Bernoulli distribution with parameter µ: p(x | µ) = µx (1 − µ)1−x ,

x ∈ {0, 1} .

Choose a conjugate prior for the Bernoulli likelihood and compute the posterior distribution p(µ | x1 , . . . , xN ). The posterior is proportional to the product of the likelihood (Bernoulli) and the prior (Beta). Given the choice of the prior, we already know that the posterior will be a Beta distribution. With a conjugate Beta prior p(µ) = Beta(a, b) obtain N

p(µ | X) =

Y x p(µ)p(X | µ) ∝ µa−1 (1 − µ)b−1 µ i (1 − µ)1−xi p(X) i=1

∝µ

P a−1+ i xi

(1 − µ)

b−1+N −

P

i xi

∝ Beta(a +

X

xi , b + N −

i

X

xi ) .

i

Alternative approach: We look at the log-posterior. Ignoring all constants that are independent of µ and xi , we obtain log p(µ | X) = (a − 1) log µ + (b − 1) log(1 − µ) + log µ

X

xi + log(1 − µ)

i

= log µ(a − 1 +

X

(1 − xi )

i

X

xi ) + log(1 − µ)(b − 1 + N −

i

6.4

X

xi ) .

i

P P Therefore, the posterior is again Beta(a + i xi , b + N − i xi ). There are two bags. The first bag contains four mangos and two apples; the second bag contains four mangos and four apples. We also have a biased coin, which shows “heads” with probability 0.6 and “tails” with probability 0.4. If the coin shows “heads”. we pick a fruit at random from bag 1; otherwise we pick a fruit at random from bag 2. Your friend flips the coin (you cannot see the result), picks a fruit at random from the corresponding bag, and presents you a mango. What is the probability that the mango was picked from bag 2? Hint: Use Bayes’ theorem. We apply Bayes’ theorem and compute the posterior p(b2 | m) of picking a mango from bag 2. p(b2 | m) =

p(m | b2 )p(b2 ) p(m)

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

526

Probability and Distributions where p(m) = p(b1 )p(m | b1 ) + p(b2 )p(m | b2 ) = p(b2 ) = p(m | b2 ) =

2 5 1 2

21 2 1 3 32 + = + = 53 52 5 5 5

Evidence

Prior Likelihood

Therefore, p(b2 | m) =

6.5

p(m | b2 )p(b2 ) = p(m)

21 52 3 5

1 . 3

=

Consider the time-series model xt+1 = Axt + w , y t = Cxt + v ,

w ∼ N 0, Q





v ∼ N 0, R ,

where w, v are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) = N µ0 , Σ0 . a. What is the form of p(x0 , x1 , . . . , xT )? Justify your answer (you do not have to explicitly compute the joint distribution).  b. Assume that p(xt | y 1 , . . . , y t ) = N µt , Σt . 1. Compute p(xt+1 | y 1 , . . . , y t ). 2. Compute p(xt+1 , y t+1 | y 1 , . . . , y t ). ˆ . Compute the conditional 3. At time t+1, we observe the value y t+1 = y distribution p(xt+1 | y 1 , . . . , y t+1 ). 1. The joint distribution is Gaussian: p(x0 ) is Gaussian, and xt+1 is a linear/affine transformations of xt . Since affine transformations leave the Gaussianity of the random variable invariant, the joint distribution must be Gaussian. 2. 1. We use the results from linear transformation of Gaussian random variables (Section 6.5.3). We immediately obtain  p(xt+1 | y 1:t ) = N xt+1 | µt+1 | t , Σt+1 | t

µt+1 | t := Aµt Σt+1 | t := AΣt A> + Q .

2. The joint distribution p(xt+1 , y t+1 | y 1:t ) is Gaussian (linear transformation of random variables). We compute every component of the Gaussian separately: E[y t+1 | y 1:t ] = Cµt+1 | t =: µyt+1 | t V[y t+1 | y 1:t ] = CΣt+1 | t C > + R =: Σyt+1 | t Cov[xt+1 , y t+1 | y 1:t ] = Cov[xt+1 , Cxt+1 | y 1:t ] = Σt+1 | t C > =: Σxy

Therefore, " p(xt+1 , y t+1 | y 1:t ) = N

# "

µt+1 | t Σt+1 | t , Σyx µyt+1 | t

Σxy Σyt+1 | t

# ! .

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

527

Exercises

3. We obtain the desired distribution (again: Gaussian) by applying the rules for Gaussian conditioning of the joint distribution in B):  p(xt+1 | y 1:t+1 ) = N µt+1 | t+1 , Σt+1 | t+1

µt+1 | t+1 = µt+1 | t + Σxy (Σyt+1 | t )−1 (ˆ y − µyt+1 | t ) Σt+1 | t+1 = Σt+1 | t − Σxy (Σyt+1 | t )−1 Σyx

6.6

Prove the relationship in (6.44), which relates the standard definition of the variance to the raw-score expression for the variance. The formula in (6.43) can be converted to the so called raw-score formula for variance N N   1 X 2 1 X (xi − µ)2 = xi − 2xi µ + µ2 N N i=1

i=1

N 1 X

=

N 1 N

=

N

x2i −

i=1 N X

2 X µ xi + µ2 N i=1

x2i − µ2

i=1

N 1 X 2 = xi − N i=1

6.7

N 1 X xi N

!2 .

i=1

Prove the relationship in (6.45), which relates the pairwise difference between examples in a dataset with the raw-score expression for the variance. Consider the pairwise squared deviation of a random variable: N N 1 X 1 X 2 2 (x − x ) = xi + x2j − 2xi xj i j N2 N2 i,j=1

(6.5)

i,j=1

=

N N N 1 X 2 1 X 2 1 X xi xj xi + xj − 2 2 N N N i=1

N N N 2 X 2 1 X 1 X  = xi − 2 xi xj N N N i=1

N 1 X 2 = 2 xi − N i=1

(6.6)

i,j=1

j=1

i=1

(6.7)

j=1

N 1 X xi N

!2  .

(6.8)

i=1

Observe that the last equation above is twice of (6.44). Going from (6.6) to (6.7), we need to make two observations. The first two terms in (6.6) are sums over all examples, and the index is not important, therefore they can be combined into twice the sum. The last term is obtained by “pushing in” the sum over j . This can be seen by a small example by considering N = 3: x1 x1 + x1 x2 + x1 x3 + x2 x1 + x2 x2 + x2 x3 + x3 x1 + x3 x2 + x3 x3 =x1 (x1 + x2 + x3 ) + x2 (x1 + x2 + x3 ) + x3 (x1 + x2 + x3 ) . c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

528

Probability and Distributions Going from (6.7) to (6.8), we consider the sum over j as a common factor, as seen in the following small example: x1 (x1 + x2 + x3 ) + x2 (x1 + x2 + x3 ) + x3 (x1 + x2 + x3 ) =(x1 + x2 + x3 )(x1 + x2 + x3 ) .

6.8

Express the Bernoulli distribution in the natural parameter form of the exponential family, see (6.107). p(x | µ) = µx (1 − µ)1−x

h



= exp log µx (1 − µ)1−x

i

= exp [x log µ + (1 − x) log(1 − µ)]

 = exp x log

µ + log(1 − µ) 1−µ



= exp [xθ − log(1 + exp(θ))] ,

6.9

µ where the last line is obtained by substituting θ = log 1−µ . Note that this is in exponential family form, as θ is the natural parameter, the sufficient statistic φ(x) = x and the log partition function is A(θ) = log(1 + exp(θ)). Express the Binomial distribution as an exponential family distribution. Also express the Beta distribution is an exponential family distribution. Show that the product of the Beta and the Binomial distribution is also a member of the exponential family. Express Binomial distribution as an exponential family distribution:

Recall the Binomial distribution from Example 6.11 ! p(x|N, µ) =

N µx (1 − µ)(N −x) x

x = 0, 1, ..., N .

This can be written in exponential family form ! p(x|N, µ) =

N x

!

=

N x

!

=

N x

exp [log(µx (1 − µ)(N −x) )]

exp [x log µ + (N − x) log(1 − µ)]

exp [x log

µ + N log(1 − µ)] . 1−µ

The last line can be identified as being in exponential family form by observing that (  N h(x) =

x

0

x = 0, 1, ..., N.

otherwise

µ θ = log 1−µ Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

529

Exercises φ(x) = x A(θ) = −N log(1 − µ) = N log(1 + eθ ) .

Express Beta distribution as an exponential family distribution: Recall the Beta distribution from Example 6.11 Γ(α + β) α−1 µ (1 − µ)(β−1) Γ(α)Γ(β) 1 = µα−1 (1 − µ)(β−1) . B(α, β)

p(µ|α, β) =

This can be written in exponential family form p(µ|α, β) = exp [(α − 1) log µ + (β − 1) log(1 − µ) − log B(α, β)] .

This can be identified as being in exponential family form by observing that h(µ) = 1 θ|(α, β) = [α − 1 φ(µ) = [log µ

β − 1]> log(1 − µ)]

A(θ) = log B(α + β) .

Show the product of the Beta and Binomial distribution is also a member of exponential family: We treat µ as random variable here, x as an known integer between 0 and N . For x = 0, . . . , N , we get !

1 N p(x|N, µ)p(µ|α, β) = µx (1 − µ)(N −x) µα−1 (1 − µ)(β−1) x B(α, β) N = x

! exp [x log

µ + N log(1 − µ)] 1−µ

· exp [(α − 1) log µ + (β − 1) log(1 − µ) − log B(α, β)]

h

= exp (x + α − 1) log µ + (N − x + β − 1) log(1 − µ) + log

N x

!

i

− log B(α, β) .

The last line can be identified as being in exponential family form by observing that h(µ) = 1 θ = [x + α − 1 φ(µ) = [log µ A(θ) = log

N x

N − x + β − 1]>

log(1 − µ)]

! + log B(α + β) .

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

530

Probability and Distributions

6.10 Derive the relationship in Section 6.5.2 in two ways: a. By completing the square b. By expressing the Gaussian in its exponential family form   The product of two Gaussians N x | a, A N x | b, B is an unnormalized Gaussian distribution c N x | c, C with C = (A−1 + B −1 )−1 c = C(A−1 a + B −1 b) D

1

c = (2π)− 2 | A + B | − 2 exp − 12 (a − b)> (A + B)−1 (a − b) .



Note that the normalizing constant c itself can be considered a (normalized) Gaussian distribution either in a or in b with an “inflated” covariance matrix  A + B , i.e., c = N a | b, A + B = N b | a, A + B . (i) By completing the square, we obtain  D 1 1 N x | a, A ) =(2π)− 2 |A|− 2 exp [− (x − a)> A−1 (x − a))], 2  1 −D − 12 2 N x | b, B ) =(2π) |B| exp [− (x − b)> B −1 (x − b))] . 2 For convenience, we consider the log of the product:    log N x | a, A N x | b, B





= log N x | a, A + log N x | b, B 1 = − [(x − a)> A−1 (x − a) + (x − b)> B −1 (x − b)] + const 2 1 = − [x> A−1 x − 2x> A−1 a + a> A−1 a + x> B −1 x 2 − 2x> B −1 b + b> B −1 b] + const 1 = − [x> (A−1 + B −1 x − 2x(A−1 a + B −1 b) + a> A−1 a + b> B −1 b] 2 + const h 1 =− (x − (A−1 + B −1 )−1 (A−1 a + B −1 b))> 2

i

· (A−1 + B −1 )(x − (A−1 + B −1 )−1 (A−1 a + B −1 b) + const .

 Thus, the corresponding product is N x | c, C with C = (A−1 + B −1 )−1 −1

c = (A

+B

−1 −1

)

(6.9) −1

(A

a+B

−1

−1

b) = C(A

a+B

−1

b) . (6.10)

(ii) By expressing Gaussian in its exponential family form: Recall from the Example 6.13 we get the exponential form of univariate Gaussian distribution, similarly we can get the exponential form of multivariate Gaussian distribution as  1 1 p(x|a, A) = exp

2

x> A−1 x + x> A−1 a −

2

a> A−1 a

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

531

Exercises −

1 1 log(2π) − log |A| . 2 2



This can be identified as being in exponential family form by observing that h(µ) = 1 − 12 vec A−1 A−1 a

 θ|(a, A) =

" φ(x) =



vec xx>



# .

x

Then the product of Gaussian distribution can be expressed as p(x|a, A)p(x|b, B) = exp(hθ|(a, A), φ(x)i + hθ|(b, B), φ(x)i) + const = exp(hθ|(a, A, b, B), φ(x)i) + const,

>  where θ | (a, A, b, B) = − 21 (A−1 + B −1 ) A−1 a + B −1 b . Then we end up with the same answer as shown in (6.9)–(6.10). 6.11 Iterated Expectations. Consider two random variables x, y with joint distribution p(x, y). Show that   EX [x] = EY EX [x | y] . Here, EX [x | y] denotes the expected value of x under the conditional distribution p(x | y).   EX [x] = EY EX [x | y] Z ⇐⇒

Z ⇐⇒

xp(x | y)p(y)dxdy

ZZ xp(x)dx =

Z ⇐⇒

ZZ

xp(x)dx =

xp(x, y)dxdy

Z xp(x)dx =

Z

Z

x

p(x, y)dydx =

xp(x)dx ,

which proves the claim. 6.12 Manipulation of Gaussian Random Variables.  Consider a Gaussian random variable x ∼ N x | µx , Σx , where x ∈ RD . Furthermore, we have y = Ax + b + w ,

 where y ∈ RE , A ∈ RE×D , b ∈ RE , and w ∼ N w | 0, Q is independent Gaussian noise. “Independent” implies that x and w are independent random variables and that Q is diagonal. a. Write down the likelihood p(y | x). 

p(y | x) = N y | Ax + b, Q

= Z exp − 21 (y − Ax − b)> Q−1 (y − Ax − b)



where Z = (2π)−E/2 | Q | −1/2 c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

532

Probability and Distributions R b. The distribution p(y) = p(y | x)p(x)dx is Gaussian. Compute the mean µy and the covariance Σy . Derive your result in detail. p(y) is Gaussian distributed (affine transformation of the Gaussian random variable x) with mean µy and covariance matrix Σy . We obtain µy = EY [y] = EX,W [Ax + b + w] = AEX [x] + b + EW [w] = Aµx + b,

and

>

Σy = EY [yy ] − EY [y]EY [y]> = EX,W (Ax + b + w)(Ax + b + w)> − µy µ> y





= EX,W Axx> A> + Axb> + Axw> + bx> A> + bb> + bw>



+ w(Ax + b)> + ww> ) − µy µ> y .



We use the linearity of the expected value, move all constants out of the expected value, and exploit the independence of w and x: Σy = AEX [xx> ]A> + AEX [x]b> + bEX [x> ]A> + bb> + EW [ww> ] − (Aµx + b)(Aµx + b)> ,

where we used our previous result for µy . Note that EW [w] = 0. We continue as follows: > > Σy = AEX [xx> ]A> + Aµx b> + bµ> x A + bb + Q > > > > > − Aµx µ> x A − Aµx b − bµx A − bb > = A EX [xx> ] − µx µ> x A +Q



|

{z

=Σx

}

= AΣx A> + Q .

Alternatively, we could have exploited i.i.d.

Vy [y] = Vx,w [Ax + b + w] = Vx [Ax + b] + Vw [w] = AVx A> + Q = AΣx A> + Q .

c. The random variable y is being transformed according to the measurement mapping z = Cy + v ,

 where z ∈ RF , C ∈ RF ×E , and v ∼ N v | 0, R is independent Gaussian (measurement) noise. Write down p(z | y). p(z | y) = N z | Cy, R



= (2π)−F/2 | R | −1/2 exp − 21 (z − Cy)> R−1 (z − Cy) .



Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

533

Exercises

Compute p(z), i.e., the mean µz and the covariance Σz . Derive your result in detail. Remark: Since y is Gaussian and z is a linear transformation of y , p(z) is Gaussian, too. Let’s compute its moments (similar to the “time update”): µz = EZ [z] = EY,V [Cy + v] = C EY [y] + EV [v]

| {z } =0

= Cµy .

For the covariance matrix, we compute Σz = EZ [zz > ] − EZ [z]EZ [z]> = EY,V (Cy + v)(Cy + v)> − µz µ> z





= EY,V Cyy > C > + Cyv > + v(Cy)> + vv > ) − µz µ> z .





We use the linearity of the expected value, move all constants out of the expected value, and exploit the independence of v and y : > Σz = C EY [yy > ]C > + EV [vv > ] − Cµy µ> yC ,

where we used our previous result for µz . Note that EV [v] = 0. We continue as follows:  > Σy = C EY [yy > ] − µy µ> y C +R | {z } =Σx

= CΣy C

>

+ R.

ˆ is measured. Compute the posterior distribution p(x | y ˆ ). d. Now, a value y Hint for solution: This posterior is also Gaussian, i.e., we need to determine only its mean and covariance matrix. Start by explicitly computing the joint Gaussian p(x, y). This also requires us to compute the cross-covariances Covx,y [x, y] and Covy,x [y, x]. Then apply the rules for Gaussian conditioning. We derive the posterior distribution following the second hint since we do not have to worry about normalization constants: Assume, we know the joint Gaussian distribution       x µx Σx Σxy p(x, y) = N , , (6.11) y

µy

Σyx

Σy

where we defined Σxy := Covx,y [x, y]. Now, we apply the rules for Gaussian conditioning to obtain  ˆ ) = N x | µx | y , Σx | y p(x | y

µx | y = µx + Σxy Σ−1 y − µy ) y (ˆ Σx | y = Σx − Σxy Σ−1 y Σyx . c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

534

Probability and Distributions Looking at (6.11), it remains to compute the cross-covariance term Σxy (the marginal distributions p(x) and p(y) are known and Σyx = Σ> xy ): Σxy = Cov[x, y] = EX,Y [xy > ] − EX [x]EY [y]> = EX,Y [xy > ] − µx µ> y ,

where µx and µy are known. Hence, it remains to compute EX,Y [xy > ] = EX [x(Ax + b + w)> ] = EX [xx> ]A> + EX [x]b> = EX [xx> ]A> + µx b> .

With > > > µx µ> y = µx µx A + µx b

we obtain the desired cross-covariance > > Σxy = EX [xx> ]A> + µx b> − µx µ> x A − µx b > > = EX [xx> ] − µx µ> x A = Σx A .



And finally µx | y = µx + Σx A> (AΣx A> + Q)−1 (ˆ y − Aµx − b) , Σx | y = Σx − Σx A> (AΣx A> + Q)−1 AΣx .

6.13 Probability Integral Transformation Given a continuous random variable x, with cdf Fx (x), show that the random variable y = Fx (x) is uniformly distributed. Proof We need to show that the cumulative distribution function of Y defines a distribution of a uniform random variable. Recall that by the axioms of probability (Section 6.1) probabilities must be non-negative and sum/ integrate to one. Therefore, the range of possible values of Y = FX (x) is −1 the interval [0, 1]. For any FX (x), the inverse FX (x) exists because we assumed that FX (x) is strictly monotonically increasing, which we will use in the following. Given any continuous random variable X , the definition of a cdf gives FY (y) = P (Y 6 y) = P (FX (x) 6 y) = =

−1 P (X 6 FX (y)) −1 FX (FX (y))

transformation of interest inverse exists definition of cdf

= y,

where the last line is due to the fact that FX (x) composed with its inverse results in an identity transformation. The statement FY (y) = y along with the fact that y lies in the interval [0, 1] means that FY (x) is the cdf of the uniform random variable on the unit interval.

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

7 Continuous Optimization

Exercises 7.1

Consider the univariate function f (x) = x3 + 6x2 − 3x − 5.

Find its stationary points and indicate whether they are maximum, minimum, or saddle points. Given the function f (x), we obtain the following gradient and Hessian, df = 3x2 + 12x − 3 dx d2 f = 6x + 12 . dx2

We find stationary points by setting the gradient to zero, and solving for x. One option is to use the formula for quadratic functions, but below we show how to solve it using completing the square. Observe that (x + 2)2 = x2 + 4x + 4

and therefore (after dividing all terms by 3), (x2 + 4x) − 1 = ((x + 2)2 − 4) − 1 .

By setting this to zero, we obtain that (x + 2)2 = 5,

and hence x = −2 ±

Substituting the solutions of

df dx

5.

= 0 into the Hessian gives

√ d2 f (−2 − 5) ≈ −13.4 dx2

7.2

and

√ d2 f (−2 + 5) ≈ 13.4 . dx2

This means that the left stationary point is a maximum and the right one is a minimum. See Figure 7.1 for an illustration. Consider the update equation for stochastic gradient descent (Equation (7.15)). Write down the update when we use a mini-batch size of one. θ i+1 = θ i − γi (∇L(θ i ))> = θ i − γi ∇Lk (θ i )>

7.3

where k is the index of the example that is randomly chosen. Consider whether the following statements are true or false:

535 This material will be published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is free to view and download for personal use only. Not for re-distribution, re-sale or use in derivac tive works. by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

536

Continuous Optimization

Figure 7.1 A plot of the function f (x) along with its gradient and Hessian.

100

f (x) df dx

80

d2 f dx2

60 40 20 0 −20 −40

−6

−4

−2

0

2

a. The intersection of any two convex sets is convex. b. The union of any two convex sets is convex. c. The difference of a convex set A from another convex set B is convex. a. true b. false c. false 7.4

7.5

Consider whether the following statements are true or false: a. b. c. d.

The sum of any two convex functions is convex. The difference of any two convex functions is convex. The product of any two convex functions is convex. The maximum of any two convex functions is convex.

a. b. c. d.

true false false true

Express the following optimization problem as a standard linear program in matrix notation max

x∈R2 , ξ∈R

p> x + ξ

subject to the constraints that ξ > 0, x0 6 0 and x1 6 3. The optimization target and constraints must all be linear. To make the optimization target linear, we combine x and ξ into one matrix equation:  >   p0

max

x∈R2 ,ξ∈R

x0

p1  x1  1

ξ

We can then combine all of the inequalities into one matrix inequality:      1

subject to 0 0

7.6

0 1 0

0 x0 0 0  x1  − 3 6 0 −1 ξ 0

Consider the linear program illustrated in Figure 7.9,  >   min −

x∈R2

5 3

x1 x2

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

537

Exercises 

2

2  subject to  −2 0 0

2 33   8 −4    x1   1  x2 6  5  −1  −1 1 8

Derive the dual linear program using Lagrange duality. Write down the Lagrangian:    2  2  >  5  L(x, λ) = − x + λ>  −2 3  0 0

Rearrange and factorize x: 



2 33  8  −4      1  x −  5   −1 −1 1 8



33 2 8  −4    >   1  x − λ  5  −1 −1 1 8

2 2   >   5 > L(x, λ) =  − 3 + λ −2 0  0

Differentiate with respect to x and set to zero:  2 2  >  5 + λ>  ∇x L = − −2 3 0 0

2 −4  1 =0 −1 1

Then substitute back into the Lagrangian to obtain the dual Lagrangian:   33

8   > D(λ) = −λ  5   −1 8

The dual optimization problem is therefore   33

8    maxm − λ>  5 λ∈R −1 8

2 2  >  5 subject to − + λ>  −2 3 0 0

2 −4  1 =0 −1 1

λ > 0.

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

538 7.7

Continuous Optimization Consider the quadratic program illustrated in Figure 7.4,  >      >   min

2 1

1 x1 x2

x∈R2 2

1 −1  subject to  0 0

1 4

5 x1 + x2 3

x1 x2

 

0 1     0  x1 6 1 1 1  x2 −1 1

Derive the dual quadratic program duality.  using Lagrange    

2 Let Q = 1

1    −1 1 5 ,c = ,A =  0 4 3 0

0 1 1  0  , and b =  . Then by (7.45) 1  1  −1 1

and (7.52) the dual optimization problem is  >  >  1    −1 1 5  + max −  2 3 0 λ∈R4 0

0   0 2  λ  1   1 −1

1 −1   −1 1  5  +  4 0  3 0

> 

subject to λ > 0. which expands to 

>

4 33  −4 −35 1  >   max − 88 +   1  λ + λ −1 14  λ∈R4 1 −3

−4 4 1 −1

 

−1 1 2 −2

1  −1  λ  −2  2

 

subject to λ > 0.

Alternative derivation Write down the Lagrangian: 



L(x, λ) =

 >



1 > 2 x 2 1

1 5 x+ 4 3

1   −1  x + λ>   0 0

Differentiate with respect to x and set to zero:  1    >  2 1 5 −1 ∇ x L = x> + + λ>   1

4

3

0 0

0 1 1 0  x −   1 1  −1 1

0 0 =0 1  −1

Solve for x: 

 >

 5

x> = −  

3

1  > −1 +λ  0 0



0   0  2 1  1 −1

1 4

 

0 1    0 > 1  λ −λ   1  1  −1 1

−1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

539

Exercises Then substitute back to get the dual Lagrangian:  1    >  5 1 > 2 1 > −1 D(λ) = x x+ x + λ  2 1 4 3 0 0

7.8

 

0 1 1 0  x −   1 1  −1 1

Consider the following convex optimization problem 1 > w w 2

min

w∈RD

w> x > 1 .

subject to

Derive the Lagrangian dual by introducing the Lagrange multiplier λ. First we express the convex optimization problem in standard form, 1

min

w∈RD 2

w> w

subject to 1 − w> x 6 0 .

By introducing a Lagrange multiplier λ > 0, we obtain the following Lagrangian L(w) =

1 > w w + λ(1 − w> x) 2

Taking the gradient of the Lagrangian with respect to w gives dL(w) = w> − λx> . dw

Setting the gradient to zero and solving for w gives w = λx .

Substituting back into L(w) gives the dual Lagrangian λ2 > x x + λ − λ2 x> x 2 λ2 = − x> x + λ . 2

D(λ) =

Therefore the dual optimization problem is given by max − λ∈R

λ2 > x x+λ 2

subject to λ > 0 .

7.9

Consider the negative entropy of x ∈ RD , f (x) =

D X

xd log xd .

d=1

Derive the convex conjugate function f ∗ (s), by assuming the standard dot c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

540

Continuous Optimization product. Hint: Take the gradient of an appropriate function and set the gradient to zero. From the definition of the Legendre Fenchel conjugate D X

f ∗ (s) = sup

x∈RD d=1

sd xd − xd log xd .

Define a function (for notational convenience) g(x) =

D X

sd xd − xd log xd .

d=1

The gradient of g(x) with respect to xd is dg(x) x = sd − d − log xd dxd xd = sd − 1 − log xd .

Setting the gradient to zero gives xd = exp(sd − 1) .

Substituting the optimum value of xd back into f ∗ (s) gives f ∗ (s) =

D X

= sd exp(sd − 1) − (sd − 1) exp(sd − 1)

d=1

= exp(sd − 1) .

7.10 Consider the function f (x) =

1 > x Ax + b> x + c , 2

where A is strictly positive definite, which means that it is invertible. Derive the convex conjugate of f (x). Hint: Take the gradient of an appropriate function and set the gradient to zero. From the definition of the Legendre Fenchel transform, f ∗ (s) = sup s> x − x∈RD

1 > x Ax − b> x − c . 2

Define a function (for notational convenience) g(x) = s> x −

1 > x Ax − b> x − c . 2

The gradient of g(x) with respect to x is dg(x) = s > − x> A − b > . dx

Setting the gradient to zero gives (note that A> = A) sA> x − b = 0 A> x = s − b x = A−1 (s − b) .

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

541

Exercises Substituting the optimum value of x back into f ∗ (s) gives f ∗ (s) = s> A−1 (s − b) − =

1 (s − b)A−1 (s − b) − b> A−1 (s − b) − c 2

1 (s − b)> A−1 (s − b) − c . 2

7.11 The hinge loss (which is the loss used by the support vector machine) is given by L(α) = max{0, 1 − α} ,

If we are interested in applying gradient methods such as L-BFGS, and do not want to resort to subgradient methods, we need to smooth the kink in the hinge loss. Compute the convex conjugate of the hinge loss L∗ (β) where β is the dual variable. Add a `2 proximal term, and compute the conjugate of the resulting function L∗ (β) +

γ 2 β , 2

where γ is a given hyperparameter. Recall that the hinge loss is given by L(α) = max{0, 1 − α}

The convex conjugate of L(α) is L∗ (β) = sup {αβ − max{0, 1 − α}} α∈R

( =

− 1 6 β 6 0,

β

if

otherwise

The smoothed conjugate is L∗γ (β) = L∗ (β) +

γ 2 β . 2

The corresponding primal smooth hinge loss is given by o n γ Lγ (α) =

sup

αβ − β −

−16β60

=

  1 − α 2−

γ 2

(α−1)  2γ

0

2

β2

if

α < 1 − γ,

if

1 − γ 6 α 6 1,

if

α > 1.

Lγ (α) is convex and differentiable with the derivative L0γ (α) =

  −1

α−1  γ

0

if

α < 1 − γ,

if

1 − γ 6 α 6 1,

if

α > 1.

c

2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.