Probabilistic Graphical Models Lecture Notes (CMU 10.708)

Citation preview

10-708: Probabilistic Graphical Models 10-708, Spring 2014

1 : Introduction Lecturer: Eric P. Xing

1

Scribes: Yuxing Zhang, Tianshu Ren

Logistics

There is no exam in this course, and the grade breakdown is as follows: • Homework assignments (40%) – There will be 5 assignments in total • Scribe duties (10%) – Needs to be submitted within one week after the lecture, people should sign up on the spreadsheet first • Short reading summary (10%) – Should be handed to TA in paper at the beginning of each lecture • Final project (40%) – Could be theoretical or algorithmic research, could also be applications of PGM algorithms to real problems – Group project with team size 3 – Final presentation will be in class, oral presentation, around 30 minutes

2

Abstract Definition of Graphical Models

Graphical model is characterized by the following three components: • Graph: G is a graphical representation that can be used to describe the relationship between variables. In the context of graphical models, it characterizes the dependencies/correlations between random variables. • Model: MG provides a mathematical description of the graph. It is the basis for quantitative analysis of a graphical model. (i)

(i)

(i)

• Data: D ≡ {X1 , X2 , ..., Xm }N i=1 are our observations of the variables in the graph. Specifically, (k) Xj corresponds to the kth observation of random variable Xj . Graphical model: MG , where G is a graph, connects data from hypothesis, instantiations from random variables, which easily encodes the dependencies among the random variables, especially in high dimensional problems with many random variables. 1

2

1 : Introduction

3

Fundamental Questions

3.1

Representation

• How to describe the problem in a proper way? • How to encode domain knowledge in to a graph? (good/bad hypothesis)

3.2

Inference

• How to answer queries (especially about some hidden variables) using the model and the data? • The queries are not deterministically, but probabilistic inference like asking probability P (X|D).

3.3

Learning

• How to obtain the “best” model from the data? Normally, we will assume a tree structure of the graph, then somehow maximize a score function to get the parameters of the model such as M = arg min F (D; M) M∈M

4

A Heuristic Example of Graphical Model

Consider a scenario where we want to specify the probability distribution of 8 random variables. Without any prior knowledge, we can specify the joint pdf by a table with 28 row entries, and that is all we can do. This is not only expensive, but also almost impossible because we may not have past observations on all the 28 configurations. However, sometimes we may have some prior knowledge about the relationship between the random variables. For instance, the 8 random variables may be states of parts of a biological system and we have prior knowledge about their relationships. e.g. X1 and X7 may be biologically separated that they cannot directly influence each other. These dependency relationships can be characterized by a directed graph where random variable X depends on Y if Y is its parent. Figure. 1 shows one such scenario.

Figure 1: Dependencies among variables

1 : Introduction

3

Note that with the known dependencies and the graph, we can easily use the factorization rule to simplify our description of joint distribution. Specifically in this case, P (X1 , X2 , ..., X8 ) = P (X1 )P (X2 )P (X3 |X1 )P (X4 |X2 )P (X5 |X2 )P (X6 |X3 , X4 )P (X7 |X6 )P (X8 |X5 , X6 ). For this joint distribution, we only need to specify probabilities for 18 cases, i.e. P (X1 = 0), P (X2 = 0), P (X3 = 0|X1 = 0), P (X3 = 0|X1 = 1), . . . , etc. This is number is significantly smaller than the 28 number of cases in a naive specification of joint distribution.

5

Why Having a Graph

From the example above, there are some clear advantages of using graphical models over a table.

5.1

Simpler joint distribution

As we have discussed above, even for only 8 binary random variables, the table for the joint distribution is huge (exponential in the number of random variables), and it would be way too expensive to do that for hundreds of random variables. Besides, using a table to represent the joint distribution does not provide any insight for the problem, nor does it use any domain knowledge in the representation. This is one of the reasons why people use graphical models to represent joint distributions.

Figure 2: Graphical model leads to simpler joint distribution

5.2

Easier learning and inference

For a table representation of a joint distribution, it is often impossible to learn the probabilities of every outcome from the data, since many instantiations of the random variables don’t even show up with limited amount of data, especially for some combinations with extremely low probability. Then there is also a problem with doing inference, because if we have a giant table and we want to ask about some marginal distributions, we need to sum up a subset of all the rows in the table to compute the probability, which is expensive when the number of random variables is large.

5.3

Enables integration of domain knowledge

In some cases, we assume no prior knowledge of data. But incorporating domain knowledge makes us make more reasonable assumption and simplifies learning/inference. Graphical model provides a platform for inter-disciplinary communication and a simple solution to integrate domain knowledge.

4

5.4

1 : Introduction

Enables data integration

Sometimes data come from different sources or they are of different types. It is generally hard to combine these data in some reasonable way, but graphical makes it possible and easy by specifying dependencies and join the distributions together by taking product of their own distributions. For instance, we can combine text, image and network data into holistic social media data using graphical models.

5.5

Enables rational statistical inference

With graphical models, we can do Bayesian learning and inference by adding a prior to the model parameters, which will capture the uncertainty about the model itself. Computing the posterior distribution of the parameters and take the posterior mean will give us an Bayes estimator. So informally, probabilistic graphical model is a smart way to write, specify, compose and design exponentiallylarge probability distributions without paying an exponential cost, and at the same time endow the distributions with structured semantics. Formally speaking, it refers to a family of distributions on a set of random variables that are compatible with all the probabilistic independence propositions encoded by a graph that connects these variables.

6 6.1

Types of Graphical Model Bayesian Network (Directed Probabilistic Graphical Models)

Figure 3: An example of Bayesian Network The structure of a Bayesian Network is that of directed acyclic graph (DAG). Such representation describes causality relationships between variables and facilitates a generative process of the random variables in the

1 : Introduction

5

graph. Figure 3 illustrates one such graph. One important observation is that X is conditionally independent of the orange random variables given the green random variables. The green random variables are called the Markov blanket of X. Note that X and YP 1 , Y2 should form a well defined conditional probability distribution such that P (x|y1 , y2 ) ≥ 0, ∀x, y1 , y2 and x P (x|y1 , y2 ) = 1. To write out the joint distribution of random variables represented by a Bayesian network, we usually use factorization rule, e.g. for Figure 1 P (X1 , X2 , ..., X8 ) = P (X1 )P (X2 )P (X3 |X1 )P (X4 |X2 )P (X5 |X2 )P (X6 |X3 , X4 )P (X7 |X6 )P (X8 |X5 , X6 )

6.2

Markov Random Fields (Undirected Probabilistic Graphical Models)

Figure 4: An example of Markov Random Field The structure of a Markov Random Fields is that of undirected graph. Such representation describes correlations between variables, but not an explicit way to generate samples. Figure 4 illustrates one such graph. Here, a node (X)is conditionally independent of every other node (orange) given its directed neighbors (green). The joint distribution is completely determined by local contingency functions (potentials) and cliques. In contrast, the joint distribution of Bayesian network is determined by local conditional distributions and DAG. The joint distribution is thus factored as, e.g. for Figure 1 (replace directed arrow with undirected lines) P (X1 , X2 , ..., X8 ) =

6.3

1 exp{E(X1 ) + E(X2 ) + E(X3 , X1 ) + E(X4 , X2 ) + E(X5 , X2 ) Z + E(X6 , X3 , X4 ) + E(X7 , X6 ) + E(X8 , X5 , X6 )}

Equivalence Theorem

For a graph G, Let D1 denote the family of all distributions that satisfy I(G), let D2 denote the family of all distributions that factor according to G, then D1 ≡ D2 .

6

1 : Introduction

6.4

Relations to ML tasks

• Density estimation (parametric and nonparametric methods)

Figure 5: A graphical model representation of density estimation problem X is the observed variable generated from an unknown distribution with parameters m and s. The goal is to estimate m and s from X. • Regression (linear, conditional mixture, nonparametric)

Figure 6: A graphical model representation of regression problem X and Y are two random variables. The goal is to estimate Y from X. • Classification (Generative and discriminative approach)

Figure 7: A graphical model representation of classification problem The left of Figure 7 represents classification problem where the model is generative. Q is the class variable that determines the generation of X. The goal is to estimate Q from which X is generated. The right of Figure 7 represents classification problem where the model is discriminative. The goal is to assign each observation of X a class Q.

1 : Introduction

7

• Clustering Similar to classification, but we do not observe Q in training set.

7

Summary

Probabilistic graphical model is not a model but a language for communication, computation and development. It provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms. Many of the classical multivariate probabilistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism. So the graphical model framework provides a way to view all of these systems as instances of a common underlying formalism.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing

1

Scribes: Lidan Mu, Lanxiao Xu

Notation

The notations used in this course are defined as follows: Variable, value and index: Variables are names representing some numbers, denoted by upper-case letters, such as V, S and the value of a variable is a realization of the variable, usually denoted by lower-case letters, such as v, s. Index corresponds to conditions used to distinguish between different scenarios, which can be subscript or superscript, such as Vi and vi . Random variable: Random variables are usually represented by upper-case letters such as X, Y, Z. Random vector and matrix: Random vectors are multivariate distributions represented by vectors,   X1   usually denoted by bold upper-case letters such as X =  ... . Similarly andom matrices are denoted by Xn 

X11  X21  bold upper-case letters such as X =  .  ..

··· ··· .. .

 X1n X2n   .. . . 

Xn1

···

Xnn

Parameters: Parameters are in general represented by Greek letters such as α, β.

2

Example: The Dishonest Casino

Suppose that you are in a casino. The casino has two dice, where one is a fair die P (1) = P (2) = P (3) = P (4) = P (5) = P (6) =

1 , 6

and also a loaded die 1 1 , P (6) = . 10 2 The casino player switches back ans forth between the fair and loaded die once every 20 turns. P (1) = P (2) = P (3) = P (4) = P (5) =

You first bet 1 dollar and roll with the fair die. The casino player also rolls a die, but it may be the fair die or the loaded die. The one with highest number wins 2 dollars from the game. Given a sequence X of rolls by the casino player, there are a few questions that we are interested in: Evaluation problem: How likely is this sequence, given our model of how the casino works? This should be given by P (X|θ) where θ represents the parameters of our model. 1

2

2 : Directed GMs: Bayesian Networks

Decoding problem: What portion of the sequence was generated with the fair die, and what portion with the loaded die? Learning problem: How ”loaded” is the loaded die? Hoe ”fair” is the fair die? How often does the casino player change from fair to loaded, and back? The model we learn should give us this information. We can actually convert this casino game scenario into a graphical model problem using knowledge engineering. First we need to pick the variables which can be observed or hidden, discrete or continuous. Denote the result of the ith rolling as Xi and the indicator of which die is used as Yi . Here notice that Xi ’s are observed during the game while the values of Yi ’s are hidden. Then we pick the structure to represent the relationship between random variables we just choose. Since the probability of choosing a loaded die depends on the previous one, we can use a so called hidden Markov model. Finally we need to pick the probability distributions in the model. As we can easily see, the probability of seeing a particular result depends on the die we pick, and thus the model is shown in Figure 1.

Figure 1: A hidden Markov model representation of the casino game. Given a sequence X = X1 · · · XT and a parse Y = Y1 · · · YT , the graphical model helps us to answer questions we previously mentioned. If we want to find out how likely the parse given our hidden Markov model and the observed sequence, we can compute the joint probability (also called the complete probability) as p(X, Y ) = p(X1 , · · · , XT , Y1 , · · · , YT ) = p(Y1 )p(X1 |Y1 )p(Y2 |Y1 )p(X2 |Y2 ) · · · p(YT |YT −1 )p(XT |YT ) = p(Y1 )p(Y2 |Y1 ) · · · p(YT |YT −1 ) × p(X1 |Y1 ) · · · p(XT |YT ) = p(Y1 , · · · , YT )p(X1 · · · XT |Y1 , · · · , YT ). And we can also get the marginal probability p(X) =

X

p(X, Y ) =

Y

XX Y1

Y2

···

X YT

πY 1

T Y t=2

aYt−1 ,YT

T Y

p(Xt |Yt ),

t=1

but it takes exponential time to do this summation. We will be talking about how to do this efficiently in future chapters.

3

Bayesian Network

A Bayesian network (BN) is a directed graph whose nodes represent the random variables and whose edges represent direct influence of one variable on another. When a node is connected to the other node with a directed arrow such as X → Y , it means that Y is caused by X.

2 : Directed GMs: Bayesian Networks

3

It is a data structure that provides the skeleton for representing a joint distribution compactly in a factorized way and also offers a compact representation for a set of conditional independence assumptions about a distribution. We can view the graph as encoding a generative sampling process executed by nature where the value for each variable is selected by nature using a distribution that depends only on its parents. In other words, each variable is a stochastic function of its parents.

3.1

Factorization Theorem

Theorem: Given a DAG, the most general form of the probability distribution that is consistent with the graph factors according to ”node given its parents” P (X) =

Y

P (Xi |Xπi ),

i=1···d

where Xπi is the set of parents of Xi and d is the number of nodes (random variables) in the graph.

Figure 2: An example of factorization theorem. For example the following joint probability is derived from Figure 2: P (X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 ) = P (X1 )P (X2 )P (X3 |X1 )P (X4 |X2 )P (X + 5|X2 )P (X6 |X3 , X4 )P (X7 |X6 )P (X8 |X5 , X6 )

3.2

Local Structures and Independencies

There are three types of local structures in graphical models (Fgure 3).

3.2.1

Common parent

In this structure, two nodes A and C share the same parents. Fixing B decouples A and C, that is, A and C are independent given B.

4

2 : Directed GMs: Bayesian Networks

This can be justified by P (A, B, C) P (B) P (B)P (A|B)P (C|B) = P (B) = P (A|B)P (C|B)

P (A, C|B) =

3.2.2

Cascade

In this structure, node A has an edge to node B, which has an edge to node C. Fixing B again decouples A and C, that is, A and C are independent given B. We can justify it by P (A, B, C) P (B) P (A)P (B|A)P (C|B) = P (B) P (A, B)P (C|B) = P (B) = P (A|B)P (C|B)

P (A, C|B) =

3.2.3

V-structure

In this structure, node C has two parents A and B. Knowing C would couple A and B, meaning that A and B are originally independent if we don’t know C. This can be justified by thinking of a real world example. Let A denote the fact that the clock in the classroom is 5 minutes fast, B represent the statement that there is a traffic jam on Highland Park Bridge, and C denote the observation that Eric is late for class. Apparently having a traffic jam is independent of any problem with the clock. However, if we know that Eric comes to class late, knowing that there is no traffic jam means a higher probability of the clock being fast. Therefore, the two events now become dependent.

Figure 3: Three types of local structures.

3.3

I-maps

Definition: Let P be a distribution over X. We define I(P ) to be the set of independence assertions of the form X ⊥ Y |Z that hold in P (however we set the parameter values). Definition: Let K be any graph object associated with a set of independencies I(K). We say that K is an I-map for a set of independencies I, if I(K) ⊆ I. We now say that G is an I-map for P if G is an I-map for I(P ), where we use I(G) as the set of independencies associated.

2 : Directed GMs: Bayesian Networks

3.3.1

5

Facts about I-map

For G to be an I-map of P , any independence that G asserts must also hold in P . Conversely, P may have additional indepndencies that are not reflected in G.

3.3.2

local Markov assumptions

Given a Bayesian network structure G whose nodes represent random variables X1 , · · · , Xn . Definition: Let P aXi denote the parents of Xi in G, and N onDescendatasXi denote the variables in the graph that are not descendants of Xi . Then G encodes the following set of local conditional independence assumptions Il (G) Il (G) : {Xi ⊥ N onDescendatsXi |P aXi : ∀i}. In other words, each node Xi is independent of its nondescendants given its parents.

3.4

Global Markov assumptions

The global Markov assumptions are related to the concept of D-separation (D stands for Directed edges). Let X, Y, Z be three sets of nodes in G. That X and Y are d-separated given Z, if they are conditionally independent given Z. There are two ways to define d-separation:

(1) D-separation by Moralized Ancestral Graph Definition: Variables x and y are D-separated (conditionally independent) given z if they are separated in the moralized ancestral graph. There are two steps to generate a moralized ancestral graph from the original BN (illustrated in Figure 4). In the first step we construct an ancestral graph from the original graph by deleting the descendants of the query nodes (in the case of Figure 4 x, y and z are the query nodes). In the second step we moralized the ancestral graph by creating undirected edges between nodes that are not connected yet and have at least one common child.

Figure 4: An Illustration of Constructing Moralized Ancestral Graph

6

2 : Directed GMs: Bayesian Networks

(2) D-separation by Bayes Ball Algorithm Definition: X is D-separated from Y given Z if we can’t send a ball from any node in X to any node in Y using the “Bayes ball ” algorithm, which means that there are no active trail between any node x ∈ X and y ∈ Y given Z.

Figure 5: An Illustration of Bayes Ball Algorithm Figure 5 shows an illustration of the Bayes ball algorithm. An undirected trail is active if a Bayes ball travelling along the graph and never encounters the stop symbol. D-separation can be used as an approach to reveal the conditional independencies characterized by a graph. Let I(G) denote all independence properties that correspond to d-separation. In the example shown in Figure 6, the elements of I(G) are: x1 ⊥ x2

x1 ⊥ x2 | x4

x2 ⊥ x4

x2 ⊥ x4 | x1

x3 ⊥ x4 | x1

From the example above, we reach a conclusion that separation properties in the graph imply independence properties about the associated variables.

3.5 Quantitative Specification of Probability Distributions

Figure 6

The Equivalence Theorem: For a graph G, Let D1 denote the family of all distributions that satisfy I(G), Let D2 denote the family of all distributions that factor according to G, Y P (X) = P (Xi |Xπi ) i=1:d

Then D1 ≡ D2 . For the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents. According to the equivalence theorem, for one distribution, we only need to test whether it can be factored according to G instead of testing every independence conditions in I(G) against it. For a Bayesian network (G, P ), where P factorizes over G, we only need to specify P as a set of conditional probability tables (CPTs) for discrete random variables or a set of conditional probability density functions (CPDs) for continuous random variables.

2 : Directed GMs: Bayesian Networks

3.6

7

Soundness and Completeness

D-separation is sound and “complete” w.r.t BN factorization law. The soundness property states that: If a distribution P factorizes according to G, then I(G) ⊆ I(P ). That is, the graph G is the I-map for I(P ) and can not generate more independencies than those implied by the distribution. The completeness property states that: For any distribution P that factorizes over G, if (X ⊥ Y | Z) ∈ I(P ), then d-sepG (X; Y |Z) And naturally, we need to ask the validity of the contrapositive of the completeness statement: If X and Y are not d-separated given Z in G, then are X and Y dependent in all distributions P that factorize over G? The answer is no. Even if a distribution factorizes over G, it can still contain additional indepencencies that are not reflected in the structure. For example, there is a graph with two nodes A and B, where there is a directed edge from A to B. The graph implies that A and B are dependent. However, there exist P (A, B) in Table 1 which satisfies P (A, B) = P (A)P (B). So the distribution implies that A and B are not dependent. A a0 a1

b0 0.08 0.12

b1 0.32 0.48

Table 1 Theorem: Let G be a BN graph. If X and Y are not d-separated given Z in G, then X and Y are dependent in some distribution P that factorizes over G. Theorem: For almost all distributions P that factorize over G, i.e., for all distributions except for a set of “measure zero” in the space of CPD parametrization, we have that I(P ) = I(G).

3.7

I-equivalence

It should be noted that very different BN graphs can actually be equivalent, in that they encode precisely the same set of conditional independence assertions. As shown in Figure 7, all of the BNs encode the same conditional independence: X ⊥ Y | Z.

Figure 7 Definition: Two BN graphs G1 and G2 over X are I-equivalent if I(G1) = I(G2).

8

2 : Directed GMs: Bayesian Networks

Recalling the previous discussions about I-map, we can see that a complete graph is a (trivial) I-map for any distribution, yet it does not reveal any of the independence structure in the distribution. Naturally, we want to find an I-map of I(P ) such that it reveals as many independence relationships as possible, yet still ⊆ I(P ). Definition: A graph object G is a minimal I-map for a set of independencies I if it is an I-map for I, and if the removal of even a single edge from G renders it not an I-map. Note: Minimum I-map is not unique, there can exist multiple minimum I-maps for a single set of independencies I.

4

Summary • Definition: A Bayesian network is a pair (G, P ) where P factorizes over G, and where P is specified as set of local conditional probability distributions (CPDs). CPDs are associated with G’s nodes. • A BN captures “causality”, “generative schemes”, “asymmetric influences”, etc., between entities. • Local and global independence properties are identifiable via d-separation criteria (Bayes ball). • Computing joint likelihood amounts multiplying CPDs, but computing marginal and conducting inference can be hard. • True: Graphical Models require a localist semantics for the nodes. • False: Graphical Models require a causal semantics for the edges. • False: Graphical Models are necessarily Bayesian.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

3 : Representation of Undirected GM Lecturer: Eric P. Xing

1

Scribes: Longqi Cai, Man-Chia Chang

MRF vs BN

There are two types of graphical models: one is Bayesian Network, which uses directed edges to model causality relationships with a Directed Acyclic Graph (DAG); the other is Markov Random Field (MRF), which uses undirected edges to model correlations between random variables. There are two difference between these two models. Factorization rule: Bayesian Network (BN) uses chain rule, with each local conditional independence represented as factor:

P (X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 ) = P (X1 )P (X2 )P (X3 |X1 )P (X4 |X2 )P (X5 |X2 )P (X6 |X3 , X4 )P (X7 |X6 )P (X8 |X5 , X6 ) Markov Random Field (MRF) uses exponential of sums of energy functions, rather than simply production of factors.

P (X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 ) = 1/Z exp{E(X1 ) + E(X2 ) + E(X3 , X1 ) + E(X4 , X2 )+ E(X5 , X2 ) + E(X6 , X3 , X4 ) + E(X7 , X6 ) + E(X8 , X5 , X6 )} Partition Function: As each term of MRF does not have direct probabilistic meaning as BN, the sum of the exponential terms are not guaranteed to be one, so a partition function Z,which serves as a normalization factor, is needed to make this function a valid probability.

2 2.1

Independence in Graphical Model I-map

Independence map (I-map) is defined as the independence properties encoded in the graph. Let I(G) be the set of local independence properties encoded by a DAG G, then a DAG G is an I-map for distribution P if I(G) ⊆ I(P ). Thus, a fully connected DAG G is an I-map for any distribution since I(G) is the empty set and is a subset of all possible I(P ). A DAG G is a minimal I-map for distribution P if it is an I-map for P , and if a removal of even a single edge from G would make it not an I-map for P . 1

2

3 : Representation of Undirected GM

Figure 1: Bayes net 1.

2.2

Figure 2: Bayes net 2.

P-map

A DAG G is a perfect map (P-map) for a distribution P if I(G) = I(P ). Not every distribution has a P-map as DAG. This can be proved by a counterexample. Example: Suppose we have four random variable A, B, C, and D, and a set of conditional independence properties I = {A ⊥ C|{B, D} and B ⊥ D|{A, C}}. These conditional independence properties can not be represented by any DAG. As shown in Figure 1, the Bayes net can represent A ⊥ C|{B, D} and B ⊥ D|A but can not represent B ⊥ D|C. As for Figure 2, the Bayes net can imply A ⊥ C|{B, D}, but can not imply B ⊥ D|{A, C}. In the example, we can see none of the DAG can represent the conditional independencies, which proves that not every distribution has a perfect map as a DAG. Also, the fact that a graph G is a minimal I-map for a distribution P does not guarantee that G captures all the independence properties of P . Nevertheless, we can find a MRF graph, which is shown in Figure 3, to represent all the conditional independencies in the example.

3

Applications of Undirected Graphical Model

Unlike a DAG, an undirected graphical model, also known as MRF, illustrates pairwise relationships rather than parental relationships. We can write down the model of undirected graphical model and score specific configuration of the graph, but there is no explicit way to generate sample from undirected graphical model. Undirected graphical model is widely used in information retrieval and data mining realms. The grid model shown in Figure 4 is a canonical example of undirected graph. This model can be applied in image processing or lattice physics. Each node in the grid model could represent a single pixel or an atom. Due to continuity, adjacent or nearby nodes may have some relationship. For example, in Figure 5,it is really hard to say the small block, which at the up-right side of the image, is air or water. But if we look at the whole image, we can realize that the block is water. This is because we use the information from the nearby blocks.

3 : Representation of Undirected GM

3

Figure 3: The MRF.

Figure 4: The grid model.

Figure 5: The Scene.

Besides, undirected graphical models can also be applied in social networks or protein interactive networks.

4

Markov Random Field (MRF) • Gibbs distribution: P (x1 , ..., xn ) =

1 Y ψc (xc ) Z c∈C

• Potential function: ψc (xc ), where c corresponds to index of a clique. • Partition function: Z =

P

x1 ,...,xn

Q

c∈C

ψc (xc )

4

3 : Representation of Undirected GM

Figure 6: Clique demo

4.1

Figure 7: Counter example showing that potentials can be neither be viewed as marginals, nor as conditionals.

Clique

• Clique: for G = {V, E}, clique is a complete subgraph G0 = {V 0 ⊆ V, E 0 ⊆ E}, such that nodes in V 0 are fully connected. For example in Figure 6, {A, B, D} is a valid clique because (A, B), (B, D), (D, A) connect all the nodes. However {A, B, C, D} is invalid, because (A, C) is not connected. • Maximal clique: a complete subgraph such that any superset V 00 ⊇ V 0 is not complete. For example {A, B, D} is a maximal clique, because adding C in will cause the graph incomplete. {A, B} is not maximal, because, adding D in, the graph will still be complete • Sub-clique: Any subset of maximal clique will form a sub-clique. The minimal sub-cliques can be edges and singletons.

4.2

Clique potentials

• Potentials can be neither be viewed as marginals, nor as conditionals. Figure 7 is a counter example.

P (x, y, z) = P (y)P (x|y)P (z|y) = P (x, y)P (z|y) = P (x|y)P (z, y) Given the conditional independences revealed by Figure 7, we can factorize the joint distribution as above. However, neither way can we factorize it into a form, such that each term corresponds to a potential function consistently. In this case, either one is joint, the other is conditional, or vice versa. • Potentials should be thought as measure of ”compatibility”, ”goodness”, or ”happiness” over the assignment of a clique of variable. For example if ψ(1, 1, 1) = 100, ψ(0, 0, 0) = 1, it means the system is more compatible if (X, Y, Z) = (1, 1, 1).

3 : Representation of Undirected GM

5

Figure 8: Max clique representation of Figure 6.

4.3

Figure 9: Sub clique representation of Figure 6.

Max clique vs sub-clique vs canonical representation

Note by: here x1 , x2 , x3 , x4 are aliases for xA , xB , xC , xD . • Canonical representation of Figure 6: factorize every possible clique.

P (x1 , x2 , x3 , x4 ) =

1 ψ124 (x124 )ψ234 (x234 ) Z ψ12 (x12 ) ψ14 (x14 ) ψ23 (x23 ) ψ24 (x24 ) ψ34 (x34 ) ψ1 (x1 ) ψ2 (x2 ) ψ3 (x3 ) ψ4 (x4 )

• Max clique representation (Figure 8): factorize over max cliques.

P 0 (x1 , x2 , x3 , x4 ) =

1 ψ124 (x124 ) ψ234 (x234 ) Z

• Sub clique representation (Figure 9): factorize over sub cliques.

P 00 (x1 , x2 , x3 , x4 ) =

1 ψ12 (x12 ) ψ14 (x14 ) ψ23 (x23 ) ψ24 (x24 ) ψ34 (x34 ) Z

• I-map: they represent the same graphical model, so the I-map should be the same.

I(P ) = I(P 0 ) = I(P 00 ) • Distribution family: canonical form has the finest granularity, then comes sub-clique form, and then max-clique form, as can be seen from the number of parameters. Another interpretation is that subclique form can be marginalized into the max-clique form, but cannot be factorized back, so sub-clique encodes richer information than max clique. Similar analysis also applies to canonical form.

D(P ) ⊇ D(P 00 ) ⊇ D(P 0 )

6

3 : Representation of Undirected GM

5

Independence in MRF

The independence properties of MRF can be implied from Markov property. There are two kinds of Markov property in MRF, global Markov property and local Markov property.

5.1

Global Markov Independence

In Figure 10, it can be seen that the set XB separate all active paths from XA to XC , which can be denoted as sepH (A; C|B). This means every path from a node in set XA to a node in XC must pass through a node in XB . A probability distribution satisfy global Markov independence if for any disjoint A, B, C, such that B separate A and C, then A is independent of C given B, which can be written as I(H) = {A ⊥ C|B : sepH (A; C|B)} There are two theorems about global Markov independence. Let a MRF be H and a distribution P . Soundness: If P is a Gibbs distribution over H,then H is an I-map for P . Completeness: If ¬sepH (X; Z|Y ), there are some P that factorizes over H such that X 6⊥P Z|Y .

5.2

Local Markov Independence

There is unique Markov blanket of each node in a MRF. Take Figure 11 for example. The Markov blanket of Xi ∈ V ,denoted M BXi ,is the set of direct neighbors of Xi which share edge with Xi . The local Markov independence of the graph in Figure 11 is defined as: Il (H) = {Xi ⊥ V − {Xi } − M BXi |M BXi : ∀i} That is, Xi is independent of all other nodes in the graph given its direct neighbors.

5.3

Relation Between Global and Local Markov Independence

For MRF, we can also define local pair-wise Markov independencies associated with H as follow: Ip (H) = {X ⊥ Y |V \ {X, Y } : {X, Y } ∈ / E} For example, a pairwise Markov independence in Figure 12 can be described as X1 ⊥ X5 |{X2 , X3 , X4 }. There are several relationships between global and local Markov properties. • If P |= Il (H), then P |= Ip (H). • If P = I(H), then P |= Il (H). • If P > 0 and P |= Ip (H), then P |= Il (H). • The following statements are equivalent for a positive distribution P :

3 : Representation of Undirected GM

Figure 10: The global Markov properties.

7

Figure 11: The local Markov properties.

Figure 12: The pairwise Markov properties. P |= Il (H) P |= Ip (H) P |= I(H)

5.4

Hammersley-Clifford Theorem

We have so far represented undirected graphical model by Gibbs distribution, which is a product factorization of potential functions. The conditional independence properties of undirected graphical model can be described in Markov properties. The Hammersley-Clifford theorem states that a positive probability distribution can be factorized over the cliques of the undirected graph. That is, all joint probability distribution that satisfies the Markov independence can be written as potentials over maximal cliques, and given a Gibbs distribution, all of its joint distributions satisfy the Markov independence. The formal theorem is illustrated following: Theorem: Let P be a positive distribution over V , and H a Markov network graph over V , If H is an I-map for P , then P is a Gibbs distribution over H.

5.5

Perfect Map

A Markov network H is a perfect map for a distribution P if for any X, Y, Z, we have that sepH (X; Z|Y ) ⇔ P |= (X ⊥ Z|Y ). Not every distribution has a perfect map as undirected graphical model, just the same as

8

3 : Representation of Undirected GM

Figure 13: The distribution coverage of Bayesian network and MRF.

what we described earlier in Section 2. A counterexample can be given. The independencies encoded in a v-structure X → Z ← Y can not be captured by any undirected graphical model. There are some distributions that can be captured by DAG while some can be captured by MRF. Figure 13 is a venn diagram which illustrates the distribution that DAG and MRF can capture, provided with example graphs.

5.6

Exponential Form

Constraining clique potentials to be positive could be inconvenient,for example the interactions between atoms can be attractive or repulsive. To address this problem, we can represent a clique potential ψc (xc ) in an unconstrained form using a real-value energy function φc (xc ) with exponential forms

ψc (xc ) = exp{−φc (xc )}

The exponential forms provides a nice additive property in that we can write the distribution as:

p(x) =

n X o n o 1 1 exp − φc (xc ) = exp − H(x) Z Z c∈C

where the sum in the exponent is called the ”free energy”: H(x) =

X

φc (xc )

c∈C

The form of distribution p(x) has different names in different realms. In physics, this is called the ”Boltzmann distribution”, while in statistics, this is called a log-linear model.

3 : Representation of Undirected GM

9

Figure 14: Bolzman Machine example

6 6.1

Figure 15: Ising model

Examples Boltzman Machine

Definition: fully connected graph with pairwise potentials on binary-valued nodes (xi ∈ {−1, +1} or {0, 1})

nX o 1 exp φij (xi , xj ) Z ij nX o X 1 = exp θij xi xj + α i xi + C Z ij i

P (x1 , x2 , x3 , x4 ) =

Energy function in matrix form:

H(x) =

X

(xi − µ)Θij (xj − µ)

ij

= (x − µ)T Θ(x − µ)

6.2

Ising model

Description: Nodes are arranged in grid and connected only to geometric neighbors (Figure 15).

P (X) =

n X o X 1 θij Xi Xj + θi0 Xi exp Z i i,j∈Ni

Potts model: when each node has multiple states.

10

3 : Representation of Undirected GM

Figure 16: Restricted Bolzman Machine

6.3

Restricted Bolzman Machine (RBM)

Restricted in the way that it’s a bipartite graph from hidden units to visible units.

p(x, h | θ) = exp

nX

θi φi (xi ) +

i

6.3.1

X

θj φj (hj ) +

j

X

θi,j φi,j (xi , hj ) − A(θ)

o

i,j

Properties of RBM

• Factors are marginally dependent. • Factors are conditionally independent given visible nodes.

P (h|x) =

Y

P (hi | x)

i

• Property of conditional independence makes it possible for iterative Gibbs sampling.

6.4

Conditional Random Field (CRF)

Figure 17 shows a transition from Hidden Markov Model (HMM) to CRF. First change the directed edges into undirected ones. Second, as we don’t assume indepedences among features, we may merge all the features together, and labeling feature Xi , we would take all the features into account. Unlike HMM, CRF is a discriminative model, as it models posterior directly.

Pθ (y|x) =

nX o 1 exp θc fc (x, yc ) Z(θ, x) c

3 : Representation of Undirected GM

11

Figure 17: Conditional Random Field

Figure 18: Structure learning

7

Structure Learning

The problem of structure learning is that, given a set of independent samples, find the best graphical model topology. The goal is to optimize the following likelihood function.

`(θG , G; D) = log pˆ(D | θG , G) X X ˆ i , xπ (G) ) − M ˆ i) =M I(x H(x i i

2

i

The problem is difficult because there are O(2n ) graphs and O(n!) trees over n nodes. However, we are able to find the exact solution of an optimal tree under MLE. It’s solved by Chow-Liu algorithm. The trick is based on the fact that each node has only one parent in a tree.

10-708: Probabilistic Graphical Models 10-708, Spring 2015

4: Parameter Estimation in Fully Observed BNs Lecturer: Eric P. Xing

1

Scribes: Purvasha Chakravarti, Natalie Klein, Dipan Pal

Learning Graphical Models

Learning a graphical model involves both learning the structure of the model and estimating the parameters involved in the structure of the graph. This lecture mostly focuses on estimating the parameters of a completely observed graphical model from the available data and performing inference on them. We also discuss a few structural learning algorithms for completely observed graphical models.

Figure 1: Given a set of variables (B, E, A, C and R) and independent samples on the left we want to learn a structure and the parameters involved (given on right) from the data. The goal of learning a graphical model is find the best (or most likely) Bayesian Network (both DAG and CPDs) given a set of independent samples (or assignments of random variables). To succeed in finding the best model first, we either learn the structure of the model or assume a structure of the model given by an expert. Structural leaning has multiple limitations and hence there is not too much literature on it. The second step involves learning or estimating the parameters involved in the model. While learning a model we come across different scenarios. We could have a directed or undirected completely observed graphical model or we could have a directed or undirected partially or unobserved graphical model. Learning an undirected partially observed graphical model is an open research topic. We could also use different estimation principles in order to find the best Bayesian network that fits the data. We could use, • Maximum likelihood estimation • Bayesian estimation • Maximum conditional likelihood (a lot of recent work has been done on this as it is more flexible) • Maximal ”Margin” graphical model • Maximum entropy graphical model 1

2

4: Parameter Estimation in Fully Observed BNs

We generally use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

2

ML Structural Learning for completely observed GMs

In this section we discuss two ”optimal” algorithms (vehicles to get to optimum) for learning the structure of completely observed GMs that guarantee to return a structure that maximizes the objective function (example, log likelihood function). That is, we want, G0 ≡ arg max L(D, G), G

where D is the data and L is the objective function to be maximized. Many heuristics used to be popular , but most of them do not guarantee attainment of optimality, interpret-ability or even an explicit objective function. For example, structured Expectation-Maximization (EM), Module network, greedy structural search, deep learning via auto-encoders, gradient ascent method, samping method to MLE, etc do not provide these guarantees and hence are not ”optimal”. We learn two classes of algorithms for guaranteed structural learning. These are likely to be the only known methods enjoying such guarantee, but they only apply to certain families of graphs. The two classes are, • The Chow-Liu algorithm which holds only for trees, that is, for graphs where every node has only one parent. We discuss this algorithm in this lecture. • Covariance selection, neighborhood selection which holds for continuous or discrete pairwise Markov Random Fields (MRFs). Learning the structure of a graphical model is a very difficult task which is the reason there is not too much literature on it. To search for the best structure, let us first count how many structures are possible given 2 n nodes. The number of graphs possible over n nodes is of the order O 2n . To see this we have to first realize that the best way to represent any graph over n nodes is to consider the corresponding adjacency 2 matrix of order n × n. Now each entry in the matrix could either be a 0 or a 1. Hence there are 2n many options. Since it is computationally difficult to consider so many graphs, its easier to focus only on trees. Now there are O(n!) many trees possible over n nodes as one of the possibility is that every node has exactly one parent and one child. This can be done in n! ways. So the first generation gets n options, the second generation gets n − 1 options and so on. Hence there are O(n!) many trees possible. It turns out that we can find an exact solution of an optimal tree (under MLE)! The first trick is to decompose the MLE score to edge-related elements. This is possible due to the product form of the likelihood function of a graphical model. As a result of this trick, now every change in an edge changes the likelihood function very systematically. The second trick is to realise that every node has only one parent and so we need to search for only one parent which increases the likelihood function the most. This is computationally easier than finding multiple parents. Applying these tricks we finally use the Chow-liu algorithm.

2.1

Information Theoretic Interpretation of ML

Considering the log likelihood function to be the objective function, for any graph G, parameters θG and data D, the objective functions is, l(θG ) = log p(D|θG , G)

4: Parameter Estimation in Fully Observed BNs

3

Let there be M i.i.d. samples (assignments of random variables) and K random variables or nodes in the data D, such that D = {x1 , ..., xM } where xn is a vector of K values, one per node, (xn1 , ..., xnK ) for every n = 1, 2, ..., M . We denote πi (G) to be the parent node of the ith node in graph G. Hence (xi , xπi (G) ) denotes the value at node i and its parents. Then the objective function becomes,

l(θG ) = log p(D|θG , G) = log

=

M Y

K Y

n=1

i=1

K M X X i=1

=M

! p(xn,i |xn,πi (G) , θi|πi (G) )

(by factorization rule) !

log p(xn,i |xn,πi (G) , θi|πi (G) )

(taking log inside and interchanging sums)

n=1

K X

 X  xi ,xπi (G)

i=1

 count(xi , xπi (G) ) log p(xi |xπi (G) , θi|πi (G) ) M

(counting configurations)

(count(xi , xπi (G) ) denotes no. of times (xi , xπi (G) ) appears in M samples)   K X X  =M pˆ(xi , xπi (G) ) log p(xi |xπi (G) , θi|πi (G) ) (ˆ p is the empirical probability from D) xi ,xπi (G)

i=1

=M

K X



 X

 i=1

pˆ(xi , xπi (G) ) log p(xi |xπi (G) , θi|πi (G) )

(ˆ p is the empirical probability from D)

xi ,xπi (G)

Since we do not know the true p, we replace it by pˆ the empirical probability. Therefore the objective function we now consider is, l(θG ) = log pˆ(D|θG , G)   K X X  =M pˆ(xi , xπi (G) ) log pˆ(xi |xπi (G) , θi|πi (G) ) xi ,xπi (G)

i=1

 pˆ(xi , xπi (G) , θi|πi (G) ) pˆ(xi )   (conditional probability definition) =M pˆ(xi , xπi (G) ) log pˆ(xπi (G) ) pˆ(xi ) xi ,xπi (G) i=1     K K X X X X pˆ(xi , xπi (G) , θi|πi (G) )  −M  =M pˆ(xi , xπi (G) ) log pˆ(xi , xπi (G) ) log pˆ(xi ) pˆ(xπi (G) )ˆ p(xi ) xi ,xπi (G) xi ,xπi (G) i=1 i=1   ! K K X X X X pˆ(xi , xπi (G) , θi|πi (G) )  −M =M pˆ(xi , xπi (G) ) log pˆ(xi ) log pˆ(xi ) pˆ(xπi (G) )ˆ p(xi ) x ,x x i=1 i=1 K X



X

i

=M

K X i=1

i

πi (G)

ˆ i , xπ (G) ) − M I(x i

K X

ˆ i ), H(x

i=1

p(x ˆ i ,xπi (G) ,θi|πi (G) ) ˆ i , xπ (G) ) = P where I(x ˆ(xi , xπi (G) ) log p(x is the decomposable score of mutual ini xi ,xπi (G) p ˆ πi (G) )p(x ˆ i) P ˆ i) = formation of the node i and its parents and H(x pˆ(xi ) log pˆ(xi ) is the entropy of node i. xi

4

4: Parameter Estimation in Fully Observed BNs

2.2

Chow-Liu tree learning algorithm

As shown in the previous section, the objective function of structure learning for a graph can be written as, l(θG ) = log pˆ(D|θG , G) =M

K X

ˆ i , xπ (G) ) − M I(x i

i=1

K X

ˆ i ). H(x

i=1

Since the second term does not depend on the graph and only the first term depends on the graph structure, the objective function to find the best graph structure becomes, C(G) = M

K X

ˆ i , xπ (G) ). I(x i

i=1

The Chow-Liu’s algorithm can now be given in three steps, 1. For each pair of variable xi and xj , count(x ,x )

i j , • Compute the empirical distribution: pˆ(xi , xj ) = M where count(xi , xj ) is the number of times (xi , xj ) occurs together in the M samples. p(x ˆ i ,xj ) ˆ i , xj ) = P • Compute mutual information: I(x ˆ(xi , xj ) log p(x xi ,xj p ˆ i )p(x ˆ j) . This is because in a tree every node i just has one parent. So, πi (G) = {j} for some j. Hence the ˆ i , xj ). information can be broken into edge scores, I(x

2. Define a graph with node x1 , ..., xK . ˆ i , xj ). • Edge (i,j) gets weight I(x 3. Find the optimum Bayesian net given the edge scores. • For undirected graphical model, compute maximum weight spanning tree. • For directed graphical model, after finding the maximum weight spanning tree, pick any node as root, then do a breadth-first-search to define directions. That is, for the root node any other node that shares an edge with it in the maximum weight spanning tree becomes its child and so on. We notice that depending on the root node we choose at the last step, we could have some equivalent trees, ie, we notice I-equivalence between trees. An example of this can be seen in Figure 2

Figure 2: Given a set of variables (A, B, C, D and E) and the value of objective function C(G) the following graphs are I-equivalent.

4: Parameter Estimation in Fully Observed BNs

5

Therefore the structure of trees can be found using the Chow-Liu algorithm but unfortunately if there are more parents the problem of structure learning becomes very hard. The following theorem formalizes it further.

Theorem: The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d ≥ 2. Most structure learning approaches use heuristics that exploit the score decomposition. Two heuristics that exploit the decomposition in different ways are greedy search through space of node -orders and local search of graph structures.

3

ML Parameter Estimation for completely observed GMs of given structure

We assume the structure of G is known and fixed in order to do parameter estimation. The structure can be known from either an expert’s design or an intermediate outcome of iterative structure learning. The goal of parameter estimation is to estimate parameters from a data set of M independent, identically distributed (iid) training cases D = {x1 , ..., xM }. In general, each training case xn = (xn,1 , ..., xn,K ) is a vector of K values, one per node for each n = 1, ..., M. The model can be completely observed, i.e., every element in xn is known (no missing values, no hidden variables). Or the model can be partially observed, i.e., ∃ i, s.t. Xn, i is not observed.

Figure 3: The graphical model representation of density estimation.

In this lecture we only consider estimating parameters for a BN given a structure and a completely observable model. In particular, we notice that density estimation can be viewed as a single node graphical model as seen in Figure 3. Also density estimation forms the building block of general GM. The next two sections deal with estimating the parameters of different distributions given the data. We discuss some instances of exponential family distributions. We also consider the MLE and Bayesian estimates of the parameters in order to estimate the density.

6

4: Parameter Estimation in Fully Observed BNs

4

Discrete Distributions

We first assume a parametric distribution of the data in order to perform parameter estimation. Here we review common discrete distributions and examine both MLEs and Bayesian estimators for their parameters.

4.1

Definition of discrete distributions

• Bernoulli distribution: Ber(p) The random variable X takes values in {0, 1}, where P r(x = 1) = p and P r(x = 0) = 1 − p. The probability mass function can be written: p(x) = px (1 − p)1−x • Multinomial distribution (over indicators): Mult(1, θ) Also called the categorical distribution, this is the generalization of the Bernoulli distribution to more than two outcomes. Let there be K total outcomes. Suppose each P outcome k = 1, ..., K has probability θk of occurring in a single trial, where k θk = 1. A convenient way to represent the P outcome of a single trial is as a vector X = [X1 , ..., XK ] where each Xk ∈ {0, 1} and k Xk = 1. In other words, element k of the vector (Xk ) is 1 (with probability θk ) and the rest of the elements are 0. The probability mass function can be written: Y x θk k = θx p(x) = k

A simple example is a single roll of a six-sided die; let k = 1, ..., 6 index the die face, where each face/outcome has underlying probability θk . Then the observation X = [1, 0, 0, 0, 0, 0] corresponds to face 1 occurring after a single roll. If the die is fair, θk = 1/6 for all k = 1, ..., 6. • Multinomial distribution (over counts): Mult(N, θ) The previous distribution could be thought of as the outcome of a single trial; we now generalize to N trials. Now we think of a vector of outcomes n = [n1 , ..., nK ] where P each nk is the number of occurrences of outcome k, so k nk = N is the total number of trials. The probability mass function can be written: p(x) =

Y n N! N! θk k = θn n1 !n2 ! · · · nK ! k n1 !n2 ! · · · nK !

4: Parameter Estimation in Fully Observed BNs

4.2

7

Parameter estimation in a multinomial model using MLE

Suppose the data is comprised of N iid draws from Mult(1, θ), so a single observation n is P the vector xn = [xn,1 , ..., xn,K ] where xn,k ∈ {0, 1} and K k=1 xn,k = 1. Then the likelihood of one observation is: K Y x L(θ|xn ) = p(xn |θ) = θk n,k k=1

Using the fact that each draw is iid, the likelihood of the entire dataset D = {x1 , ..., xN } is L(θ|x1 , ..., xN ) = p(x1 , ..., xN |θ) =

N Y n=1

p(xn |θ) =

N Y K Y

x θk n,k

n=1 k=1

=

K Y k=1

P

θk

n

xn,k

=

K Y

θknk

k=1

where nk counts the number of occurrences of state k across all trials. The MLE is the estimate of θ which maximizes the likelihood (or log likelihood): X nk log θk `(θ|D) = log L(θ|x1 , ..., xN ) = k

The multinomial model is an exponential familyP distribution, but it is not full rank because the parameters must satisfy the linear constraint k θk = 1. If we view `(θ|D) as an objective function, we can naturally incorporate the linear constraint using a Lagrange multiplier: ! X X ˜ `(θ|D) = nk log θk + λ 1 − θk k

k

Differentiating with respect to θk and setting equal to zero gives the following system of equations: nk ∂ `˜ = −λ=0 ∂θk θk P P Then since nk = λθk , we see N = k nk = λ k θk = λ, so the MLE is θˆk,M LE = nNk = P 1 n xn,k . N This corresponds to the notion that the counts nk are sufficient statistics for the data D; in other words, the distribution depends on the data only through the counts, and therefore we can estimate the MLE directly from the counts without need for the original data. 4.3

Parameter estimation in a multinomial model using Bayesian estimation

While MLE treats the parameter as a fixed, unknown constant that is estimated from the data, Bayesian estimation treats the parameter as random with a given prior distribution. The prior distribution along with the distribution of the data allows calculation of a posterior distribution, and the mean of the posterior distribution can be used as a point estimator

8

4: Parameter Estimation in Fully Observed BNs

Figure 4: Depiction of the model and the simplex where θ is defined.

for the parameter. In practice, this estimator will generally be similar to the MLE for large sample sizes, but for smaller samples it will blend the influence of the prior with the influence of the data. For this reason, Bayesian estimators can be helpful with small sample sizes, particularly when certain outcomes are not actually observed and it is not desirable to set the probability of those outcomes to zero. In general, if p(θ) is the prior distribution of the parameter, Bayes’ rule states p(θ|x1 , ..., xN ) =

p(x1 , ..., xN |θ)p(θ) p(x1 , ..., xN )

where p(θ|x1 , ..., xN ) is the posterior distribution we wish to calculate. Arbitrary choices of prior will not be tractable mathematically, so it is convenient to use special conjugate priors, so that the prior and posterior distribution are easy to work with mathematically and take similar forms. For example, if the data P is multinomial (as in the last example), the parameter vector θ exists in a simplex since k θk = 1 and θk ≥ 0 for all k. Instead of a point estimate we can create a distribution of θ over the simplex to inform our estimation; see examples in Figure 4. 4.3.1

The Dirichlet prior

A convenient prior is the Dirichlet distribution, Dir(α = (α1 , ..., αK )): P Y α −1 Γ( k αk ) Y αk −1 p(θ) = Q θk = C(α) θk k Γ(α ) k k k k where is the gamma function; for integers, Γ(n + 1) = n!, but in general Γ(α) = R ∞ α−1Γ(·) −t t e dt. Note the gamma function also has the convenient property that Γ(t + 1) = 0 tΓ(t).

4: Parameter Estimation in Fully Observed BNs

9

Now that the normalization constant C(α) is given by: Q Z Z 1 αK −1 α1 −1 k Γ(αk ) · · · θK dθ1 · · · dθK = P = · · · θ1 C(α) Γ( k αk ) where the integration can be done using integration by parts. With the Dirichlet prior, the posterior distribution becomes (up to some proportionality constants): Y n Y α −1 Y α +n −1 p(θ|x1 , ..., xN ) ∝ θk k θk k = θk k k k

k

k

So Dirichlet is a conjugate prior for multinomial because the posterior distribution takes a form similar to the prior. This form also explains why the Dirichlet parameters are referred to as psuedo-counts, because they behave similarly to how the counts derived from the data behave in the likelihood function. To find the exact closed-form posterior distribution, first note that the marginal likelihood can be written: Z C(α) p(x1 , ..., xN |α) = p(n|α) = p(n|θ)p(θ|α)dθ = C(α + n) so the posterior is p(θ|x1 , ..., xn , α) =

Y α +n −1 p(n|θ)p(θ|α) = C(n + α) θk k k p(n|α) k

Now we recognize this is Dir(n + α), again underscoring the conjugacy of the prior with the posterior. Therefore if we observe N 0 samples with sufficient statistics/counts vector n0 , the posterior distribution is p(θ|α, n0 ) = Dir(α+n0 ). If we then observe another N 00 samples with statistics n00 , the posterior becomes p(θ|α, n0 , n00 ) = Dir(α+n0 +n00 ), allowing for sequential Bayesian updating as we receive new information, which is similar to online learning. To get a point estimate for θ we can take the mean of the posterior distribution: Z Z Y α +n −1 nk + α k ˆ θk = θk p(θ|D)dθ = C θk θk k k dθ = N + |α| k Notice that this estimate is essentially a weighted combination of the actual observed frequencies and the psuedo-counts from the prior distribution. As the number of observations N grows large, this estimate will be essentially the same as the MLE. Another quantity of interest is the posterior predictive rate, which in essence predicts the outcome of the XN +1 event given the first N observations: Z Y α +n −1 α +n C(n + α) ni + α i p(xN +1 = i|x1 , ..., xN , α) = C(n + α) θk k k θi k k dθ = = C(xN + α + n) |n| + |α| k

10

4.3.2

4: Parameter Estimation in Fully Observed BNs

Hierarchical Bayesian Models

When we specified a Dirichlet prior in the previous section, we needed to specify the parameters α = (α1 , ..., αk ), which are also called the psuedo-counts. Hierarchical Bayesian models put another prior on these parameters rather than specifying exact values. So while we have parameters θ for the likelihood p(x|θ) and parameters α for the prior p(θ|α), we could continue to add more ‘layers’ by putting a prior distribution on the α values with its own parameters known as hyperparamters. Adding layers could be done indefinitely, but typically adding more layers does not have too much influence on the final results, particularly if there is enough data. While our choice of Dirichlet prior for θ was motivated by conjugacy with the multinomial distribution, how do we choose a prior for α? One approach is to make an intelligent guess, or use a uniform or other noninformative prior, so that the prior should not have too much influence but still allows us to avoid making arbitrary parameter choices. Another approach is called empirical Bayes (or Type-II maximum likelihood). The idea rests on the following equation, which integrates θ out of the model to directly get a distribution of n given α: Z p(n|α) = p(n|θ)p(θ|α)dθ In the case of the Dirichlet prior, this function will be a gamma function. Then we can select the following estimator for α: α ˆ M LE = arg max p(n|α) α

Typically this estimation is done using some data, then applied to further data; ‘treat yesterday as a prior for today’. (Eric mentioned during class that he may post further notes for a more full treatment of this approach.)

4.3.3

The Logistic Normal Prior

While the Dirichlet prior is convenient, it has some drawbacks. In particular, it can only give rise to certain kinds of distributions of θ over the simplex, which are symmetric or concentrated in one corner, as shown in Figure 5. An alternative prior is the logistic Normal distribution (or logit Normal distribution), which is more difficult to use because it is no longer conjugate (though we will talk more later in the class about how to get around this issue). We say θ is logistic Normal if θ ∼ LNk (µ, Σ). To define thedistribution, first let γ ∼ P γi NK−1 (µ, Σ) with γK = 0. Then θi = exp γi log 1 + K−1 . One difficulty with this i=1 e distribution is the log partition function/normalization constant which takes the form C(γ) =  PK−1 γi  log 1 + i=1 e .

4: Parameter Estimation in Fully Observed BNs

Figure 5: Examples of distributions of θ over the simplex with the Dirichlet prior.

Figure 6: Plate diagram of logistic Normal prior.

11

12

4: Parameter Estimation in Fully Observed BNs

Figure 7: Examples of distributions of θ over the simplex with the logistic Normal prior.

One benefit of the logistic Normal prior is it involves a covariance structure that we can exploit. It also gives rise to different kinds of distributions on the simplex, as shown in Figure 7.

5

Continuous Distribution

Parametric distributions can also be of the continuous form. We review some continuous distributions below. 5.1

Some continuous distributions

1. Uniform Distribution: Uniform distributions are basically ”flat” distributions. Parametrized as 1 p(x) = ∀a ≤ x ≤ b (b − a) and 0 elsewhere 2. Normal Density Function: The Normal distribution is parametrized as 1 p(x) = √ exp −(x − µ)2 /2σ 2 2πσ The distribution is symmetric, and has two moments characterized by the mean µ and the variance σ.

4: Parameter Estimation in Fully Observed BNs

13

3. Multivariate Gaussian: Multivariate Gaussian distribution is simply a high dimensional Gaussian distribution. It is parametrized as p(x) =

5.2

1

1 exp {− (X − µ)T Σ−1 (X − µ)} 2 2π |Σ| n 2

1 2

Bayesian estimation of parameters for the Gaussian

We look at the following cases: 1. Known µ and unknown λ = σ12 : The conjugate prior for λ with shape a and rate (inverse scale) b. p(λ | a, b) =

1 a a−1 b λ exp −bλ Γ(a)

The conjugate prior for inverse σ is the Gamma-inverse IG(σ 2 | a, b) =

−b 1 a 2 −(a+1) b (σ ) exp 2 Γ(a) σ

2. Unknown µ and unknown σ: The conjugate prior is P (µ, σ 2 ) = P (µ|σ 2 )P (σ 2 ) = N (µ | m, σ 2 , V )IG(σ 2 | a, b) 3. Multivariate case : The conjugate prior is P (µ, Σ) = P (µ | Σ)P (Σ) = N (µ | µ0 ,

5.3

1 Σ)IW (Σ | Σ, Λ−1 0 , v0 ) κ0

Estimation of conditional densities

1. Estimation of individual conditional densities can be viewed as two-node graphical models. The parameters of the child are then estimated given a configuration of the parent. 2. The two-node models are the fundamental building blocks of general larger graphical models. However, for parameter estimation, they can be considered separately when all nodes are observed. 3. Parameter estimation can be carried out through Maximum Likelihood estimates or Bayesian estimation. Given enough data, MLE estimates are usually preferred. However, Bayesian estimation is useful in the case when few samples are available.

14

4: Parameter Estimation in Fully Observed BNs

Decomposability of the log-likelihood Under the global independence assumption, if all nodes are observed, then the log-likelihood of the network decomposes into a sum of local likelihoods. ! ! Y Y X X l(θ, D) = log p(D | θ) = log p(xn,i | xn,πi , θi ) = log p(xn,i | xn,πi , θi ) n

i

i

n

This makes the parameter estimation much easier, tractable and parallelizable. The problem basically becomes, to individually estimate the CPDs for each node separately and then combine the resulting parameters together into one model. We now illustrate this phenomenon. Consider the general form of the distribution represented by a directed acyclic graphical model. p(x |θ) = p(x1 | θ1 )p(x2 | x1 , θ2 )p(x3 | x2 , x3 , θ3 )p(x4 | x1 , x2 , x3 , θ4 ) As we saw in the equation above this, equivalently we have four independent graphical models each with just a single child and fully observed parents. 5.4

MLE for Bayesian Networks with tabular CPDs

In the case of tabular CPDs, i.e. the CPDs are in the form of a table (possibly multinomial), the MLE estimate is straight forward. The parameter we need to estimate is θijk = P (Xi = j | Xπi = k) In the case of a single parent, the table is a two dimensional one. Higher dimensional tables result in the case of multiple parents. The difficulty in estimation increases as the number of parents goes up, since we need to be able to observe enough samples for each configuration of the parents in order to have a good estimate. The counts of family (joint configurations of parents and child) serve as sufficient statistics of the distribution. nijk =

X

xjn,i xkn,πi

n

Thus the log likelihood becomes l(θ, D) = log

Y i,j,k

Maximizing under the condition

P

j

n

θijkijk =

X i,j,k

θijk = 1, we have nijk M LE θijk =P j 0 nij 0 k

nijk log θijk

4: Parameter Estimation in Fully Observed BNs

15

Thus, the MLE is simply the fraction of counts of a particular value of a child upon the total number of values the child took for a particular configuration of values of its parents. 5.5

Defining a parameter prior

Recall that the joint density can be factorized as P (X = x) =

M Y

p(xi | xπi )

i=1

Each of the terms p(xi | xπi ) is in fact a local distribution p(xki | xjπi ) = θxk i

| xjπi

Geiger and Heckerman state a set of assumptions under which the parameter priors can be defined for a large class of directed acyclic graphs. These are: 1. Complete Model Equivalence: Given a data distribution or data X, any two complete DAG model which describe X, describe the same set of joint probability distributions. 2. Global Parameter Independence: For every DAG model, we have p(θm | G) =

M Y

p(θi | G)

i=1

The equation basically shows that the priors of every node are independent. 3. Local Parameter Independence: For every DAG node, we have p(θi | G) =

qi Y

p(θxk i

| xjπi

| G)

j=1

The equation basically shows that, for every configuration of the parents of a child, the priors on the child’s resultant distribution are independent. 4. Likelihood and Prior Modularity: For every two DAG models of X with the same structure (same parents for a given node in both models), then the local distributions of every node and the prior distributions for every parameter are the same. 5.6

Parameter Sharing

To illustrate parameter sharing, we consider a stationary (time-invariant) first-order Markov model defined by two parameters. First, the initial state probability πk , and second, the

16

4: Parameter Estimation in Fully Observed BNs

i state transition probabilities parameterized by Aij = p(Xtj = 1 | Xt−1 = 1). The parameter A is shared by all future states. Q The joint distribution becomes p(X | θ) = p(x1 | π) Tt=2 P (Xt | Xt−1 ) P P P Whereas the log-likelihood is l(θ, D) = n log p(xn,1 | π) + n Tt=2 log p(xn,t | xn,t−1 )

With optimize each parameter separately due to local independence. π is estimated easily using techniques described before, since it is simply a multinomial frequency vector. We now P discuss the estimation of A. We have the constraint that j Aij = 1, each row of A is a multinomial distribution. Thus, the MLE of A becomes

LE AM ij

#(i → j) = = #(i → ·)

P PT

j i t=2 xn,t−1 xn,j P PT i t=2 xn,t−1 n n

10-708: Probabilistic Graphical Models 10-708, Spring 2016

5 : Exponential Family and Generalized Linear Models Lecturer: Matthew Gormley

1

Scribes: Yuan Li, Yichong Xu, Silun Wang

Exponential Family

Probability density functions that are in exponential family can be expressed in the following form. p(x|η) = h(x)exp{η T T (x) − A(η)} Z A(η) = log

h(x)exp{η T T (x)}dx

One example of exponential family is multinomial distribution. For given data x = (x1 , x2 , ..., xk ) , xi ∼ Multi(1, πi ) and Σ πi = 1, we can write the probability of the data as follows.

p(x|π) = π1x1 π2x2 ...πkxk x1

= elog(π1

x

x

π2 2 ...πk k )

k

= eΣi=1 xi logπi Thus, in the corresponding exponential form, η = [log π1 , logπ2 , ..., logπk ], x = [x1 , x2 , ..., xk ], T(x) = x, A(η) = 0 and h(x) = 1. Another example is Dirichlet distribution. Let α1 , α2 , ... , αk > 0. The probability function of such distribution can be represented as an exponential function.

p(π|α) =

1 ΠK π αi −1 B(α) i=1 i K

= eΣi=1 (αi −1)logπi −logB(α) where A(η) = logB(α), η = [α1 , α2 , ..., αk ], T (x) = [logπ1 , logπ2 , ..., πk ] and h((x)) = 1.

2

Cumulant Generating Property

Notice that one appealing feature of the exponential family is that we can easily compute moments of the distribution by taking derivatives of the log normalizer A(η). 1

2

2.1

5 : Exponential Family and Generalized Linear Models

First cumulant a.k.a Mean

The first derivative of A(η) is equal to the mean of sufficient statistics T (X). dA dη

= = = = =

d logZ(η) dη 1 d Z(η) Z(η) dη Z  1 d T h(x)exp{η T (x)}dx Z(η) dη Z h(x)exp{η T T (x)} T (x) dx Z(η) Z T (x)p(x|η)dx

= E[T (X)]

2.2

Second cumulant a.k.a Variance

The second derivative of A(η) is equal to the variance or first central moment of sufficient statistics T (X). d2 A dη 2

Z

=

dA T (x)exp{η T T (x) − A(η)}(T (x) − )h(x)dx dη Z Z dA h(x)exp{η T T (x)} h(x)exp{η T T (x)} dx − T (x) dx T 2 (x) Z(η) dη Z(η) Z Z dA T 2 (x)p(x|η)dx − T (x)p(x|η)dx dη E[T 2 (X)] − (E[T (X)])2

=

V ar[T (X)]

= = =

2.3

Moment estimation

Accordingly, the q th derivative gives the q th centered moment. When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

2.4

Moment vs canonical parameters

Since the moment parameter µ can be derived from the natural (canonical) parameter η by: dA(η) def = E[T (x)] = µ η Also notice that A(η) is convex since: d2 A(η) = V ar[T (x)] > 0 dη 2

5 : Exponential Family and Generalized Linear Models

3

Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1) by: ∆ η = ψ(µ) which means a distribution in the exponential family can be parameterized not only by η (the canonical parameterization), but also by µ (the moment parameterization).

3

Sufficiency

For p(x|θ), T(x) is sufficient for θ if there is no information in X regarding θ beyond that in T(x). θ⊥ ⊥ X|T (X) However, it is defined in different ways in the Bayesian and frequentist frameworks. Bayesian view θ as a random variable. To estimate θ, T (X) contains all the essential information in X. p(θ|T (x), x) = p(θ|T (x))

Frequentist view θ as a label rather than a random variable. T (X) is sufficient for θ if the conditional distribution of X given T (X) is not a function of θ. p(x|T (x), θ) = p(x|T (x))

For undirected models, we have p(x, T (x), θ) = ψ1 (T (x), θ)ψ2 (x, T (x))

Since T (x) is function of x, we can drop T (x) on the left side, and then divide it by p(θ). p(x|θ) = g(T (x), θ)h(x, T (x)) Another important feature of the exponential family is that one can obtain the sufficient statistics T (X) simply by inspection. Once the distribution function is expressed in the standard form, p(x|η) = h(x)exp{η T T (x) − A(η)} we can directly see T (X) is sufficient for η.

4

5 : Exponential Family and Generalized Linear Models

4

MLE for Exponential Family

The reduction obtained by using a sufficient statistic T (X) is particularly notable in the case of IID sampling. Suppose the dataset D is composed of N independent random variables, characterized by the same exponential family density. For these i.i.d data, the log-likelihood is

l(η; D)

N Y

= log

h(xn )exp{η T T (xn ) − A(η)}

n=1

=

N X

log(h(xn )) + (η T

n=1

N X

T (xn )) − N A(η)

n=1

Take derivative and set it to zero, we can get N X

∂l ∂η

=

∂A(η) ∂η

=

N 1 X T (xn ) N n=1

µ ˆM LE

=

N 1 X T (xn ) N n=1

ηˆM LE

= ψ(ˆ µM LE )

T (xn ) − N

n=1

∂A(η) =0 ∂η

PN Our formula involves the data only via the sufficient statistic n=1 T (Xn ). This means that to estimate MLE of η, we only need to maintain fixed dimensions of data. For Bernouilli, Poisson and multinomial distributions, it suffices to maintain a single value, the sum of the observations. Individual data PNpoints can be thrown away. While for the univariate Gaussian distribution, we need to maintain the sum n=1 xn and PN the sum of squares n=1 x2n .

4.1

Examples

1. Gaussian distribution: We have η

=

T (x)

=

A(η)

=

h(x)

=

  1 −1 −1 Σ µ; − vec(Σ ) 2   x; vec(xxT ) 1 T −1 1 µ Σ µ + log |Σ| 2 2 −k/2 (2π) .

So µM LE =

N 1 X 1 X T (xn ) = xn . N n N n=1

5 : Exponential Family and Generalized Linear Models

5

2. Multinomial distribution: We have   πk ln ;0 πK = [x]

η

=

T (x)

= − ln 1 −

A(η)

K−1 X

! πk

k=1

h(x)

=

1.

So µM LE =

N 1 X xn . N n=1

3. Poisson distribution: We have η T (x)

=

log λ

= x

= λ = eη 1 . h(x) = x!

A(η)

So µM LE =

5 5.1

N 1 X xn . N n=1

Bayesian Estimation Conjugacy

Prior p(η|φ) ∝ exp{φT T (η) − A(η)} Likelihood p(x|η) ∝ exp{η T T (x) − A(η)} suppose η = T (η), posterior p(η|x, φ) ∝ p(x|η)p(η|φ) ∝ exp{η T T (x) + φT T (η) − A(η) − A(φ)} ∝ exp{

T (η)T | {z }

(

T (x) + φ | {z }

) − (A(η) + A(φ))} | {z }

sufficient func natural parameter

A(η,φ)

6

6

5 : Exponential Family and Generalized Linear Models

Generalized Linear Model

GLIM is a generalized form of traditional linear regression. As in linear regression, the observed input x is assumed to enter the model via a linear combination of its elements ξ = θT x. The output of the model, on the other hand, is assumed to have an exponential family distribution with conditional mean µ = f (ξ), where f is known as the response function. Note that for linear regression f is simply the identity function. Figure 1 is a graphical representation of GLIM. And Table 1 lists some correspondence between usual regression types and choice of f and Y .

Figure 1: Graphical model of GLIM. . Regression Type Linear Regression Logistic Regression Probit regression Multivariate Regression

f identity logistic cumulative Gaussian logistic

distribution of Y N (µ, σ 2 ) Bernoulli Bernoulli Multivariate

Table 1: Examples of regression types and choice of f and Y .

6.1

Why GLIMs?

As a generalization of linear regression, logistic regression, probit regression, etc., GLIM provides a framework for creating new conditional distributions that comes with some convenient properties. Also GLIMs with the canonical response functions are easy to train with MLE. However, Bayesian estimation of GLIMs doesn’t have a closed form of posterior, so we have to turn to approximation techniques.

6.2

Properties of GLIM

Formally, we assume the output of GLIM has the following form:   1 T p(y|η, φ) = h(y, φ) exp (η (x)y − A(η)) . φ

5 : Exponential Family and Generalized Linear Models

7

This is slightly different from the traditional definition of EF, where we include a new scale parameter φ; most distributions are naturally expressed in this form. Note that η = ψ(µ) and µ = f (ξ) = f (θT x), so we have η = ψ(f (θT x)). So the conditional distribution of y given x, θ and φ is   1 T p(y|x, θφ) = h(y, φ) exp (y ψ(f (θT x)) − A(ψ(f (θT x)))) . φ There’re mostly 2 design choices of GLIM: the choice of exponential family and the choice of f . The choice of the exponential family is largely constrained by the nature of the data y. E.g., for continuous y we use multivariate Gaussian, where for discrete class labels we use Bernoulli or multinomial. Response function is usually chosen with some mild constraints, e.g., between [0, 1] and being positive. There’s a so-called canonical response function where we use f = ψ −1 ; in this case the conditional probability is simplified to  p(y|x, θφ) = h(y, φ) exp

 1 T (θ x · y − A(θT x)) . φ

Figure 2 lists canonical response function for several distributions. Figure 3 and table 2 lists the relationship

Figure 2: Canonical response function for several distributions. . between variables and canonical functions.

Figure 3: Relationship between variables and functions. .

Regression Type Linear Regression Logistic Regression Probit regression

Canonical Response Y Y N

µ = f (ξ) µ=ξ 1 µ = 1+exp(−ξ) µ = φ(ξ)

η = f −1 (µ) η=µ µ η = log 1−µ η = φ−1 (µ)

distribution of Y N (µ, σ 2 ) Bernoulli(µ) Bernoulli(µ)

Table 2: Some regression types and their response/link functions.

8

6.3

5 : Exponential Family and Generalized Linear Models

MLE estimation for GLIMs with canonical response

Now we can compute the MLE estimation for canonical response functions: the log likelihood function is X X l= log h(yn ) + (θT xn yn − A(ηn )). n

n

Take derivative with respect to θ(note that θ is the only parameter for GLIMs with canonical response):  X X dl dA(ηn ) dηn = xn yn − = (yn − µn )xn = X T (y − µ). dθ dηn dθ n n So we can do stochastic gradient ascent with update rule θ(t+1) = θ(t) + ρ(yn − (θ(t) )T xn )xn where ρ is the step size. Another method is to use Newton-Raphson methods to obtain a batch-learning algorithm: The update rule is θ(t+1) = θ(t) − H −1 ∇θ J where J is the cost function and H is the Hessian matrix (second derivative). We have ∇θ J = X T (y − µ), and H

X ∂µn ∂ X ∂2l = (y − µ )x = xn T n n n ∂θ∂θT ∂θT n ∂θ n X ∂µn ∂ηn = − xn ∂ηn ∂θT n X ∂µn = − xn xTn ∂η n n =

= −X T W X,   dµ1 dµ2 dµN where W = diag , ,··· , . So the update rule is dη1 dη2 dηN θ(t+1) = θ(t) − H −1 ∇θ J = (X T W (t) X)−1 X T W (t) z (t) where the adjusted response is z (t) = Xθ(t) + (W (t) )−1 (y − µ(t) ).

10-708: Probabilistic Graphical Models 10-708, Spring 2014

Learning fully observed graphical models Lecturer: Matthew Gormley

1

Scribes: Akash Bharadwaj, Devendra Chaplot, Sumeet Kumar

Parameter estimation for fully observed directed graphical models

In the case of fully observed directed graphs, the product form of the joint distribution can be used to decompose the log-likelihood function into a sum of local terms, one per node: l(θ; D) = log p(D|θ)  YY = log p(xn,i |xπi , θi ) n

i

=

XX

=

XX

n

 log p(xn,i |xπi , θi )

i

 log p(xn,i |xπi , θi )

n

i

Since the joint probability decomposes into sum of local terms,the maximum likelihood problem decomposes into separate terms such that parameters θi appear in different terms, and thus each parameter can be estimated independently:

θ∗ = arg max log p(D|θ) θ  YY = arg max log p(xn,i |xπi , θi ) θ

= arg max θ

θi∗

= arg max θi

n

i

XX i

X

 log p(xn,i |xπi , θi )

n

log p(xn,i |xπi , θi )

n

This is exactly like learning parameters of several separate small BNs, each of which consists of a node and its parents.

1.1

Example

Consider a bayesian network with 4 nodes as shown in Figure 1(a). It has the following joint probability distribution: 1

2

Learning fully observed graphical models

Figure 1: Maximum Likelihood parameter estimation in Bayesian Networks. (a) A Bayesian Network with 4 nodes. (b) Maximum likelihood problem in (a) can be broken down into separate maximum likelihood problem for each node conditioned on its parents

p(x|θ) = p(x1 |θ1 )p(x2 |x1 , θ1 )p(x3 |x1 , θ1 )p(x4 |x2 , x3 θ1 ) Maximum Likelihood estimate of the parameters are calculated as follow: θ∗ = arg max log p(x1 , x2 , x3 , x4 ) θ

= arg max log p(x1 |θ1 ) + log p(x2 |x1 , θ2 ) + log p(x3 |x1 , θ3 ) + log p(x4 |x2 , x3 θ4 ) θ

Once it is expressed as a sum, it is possible to estimate one parameter at a time.

θ1∗ = arg max log p(x1 |θ1 ) θ1

θ2∗

= arg max log p(x2 |x1 , θ2 ) θ2

θ3∗ = arg max log p(x3 |x1 , θ3 ) θ3

θ4∗

= arg max log p(x4 |x2 , x3 , θ4 ) θ4

This decomposition is equivalent to splitting the bayesian network into four small bayesian network corresponding to a node and its parent as shown in Figure 1(b).

Learning fully observed graphical models

3

Note that marginal distributions such as p(x1 |θ1 ) are often represented by exponential family distributions, while conditional distributions such as p(x2 |x1 , θ2 ), p(x3 |x1 , θ3 ) and p(x4 |x2 , x3 , θ4 ) are conveniently represented using Generalized Linear Models.

2

Parameter estimation for fully observed Undirected Graphical Models

The previous section described how MLE estimates for parameters can be obtained for fully observed directed graphical models (bayes nets). In that case, we see that the log likelihood breaks down into separate terms for each set of local parameters (one per node) i.e. there is no parameter sharing between different terms in the log likelihood formulation. This however is not the case for even fully observed Undirected Graphical Models (UGMs). The source of our trouble is the partition function as usual, because of which we no longer get disparate terms in the log likelihood. This section presents two approaches to estimate parameters for undirected graphical models; one for a special sub-class of UGMs called decomposable UGMs and another approach called Iterative Proportional Fitting for arbitrary UGMs. We restrict this discussion to UGMs involving discrete random variables for simplicity. While these techniques can be adapted for continuous random variables, readers are recommended to refer to [3] for further details.

2.1

Notation

First we clarify some notation to be used in the rest of this section. XV indicates the random vector corresponding to the entire graph G associated with the UGM. Let C be the set of cliques in this graph. Then XC for some C ∈ C refers to the subset of random variables associated with the nodes in the clique C. xC refers to a specific instantiaion (value assignment) of the random variables in XC . The UGM is parametrized by potential functions ψC (xC ) associated with each clique C in the UGM. The joint probability of all the random variables in the graph then defined as: p(xV |θ) =

Y 1 ∗ (ψC (xC )) Z

(1)

C ∈ C

where θ = {ψC (xC ) ∀ C ∈ C}. Assuming each data sample is i.i.d, the nth data sample is associated with its own replica of the UGM Gn with random variables XV,n . Parameters for each node are shared across replicas. Since we are dealing with UGMs involving discrete random variables, we define the following marginal counts:

m(xV ) =

X

δ(xV , xV,n )

(number of times xV occurs in data set)

(2)

(marginal count for a value assignment xC to clique C)

(3)

(total number of samples in the data set)

(4)

n

m(xC ) =

X

m(xV )

xV\C

N=

X xV

m(xV )

4

Learning fully observed graphical models

Having defined the notation we shall use, we proceed to formulate the log likelihood: Y p(xV,n |θ) = p(xV |θ)δ(xV ,xV,n )

(5)

xV

p(D|θ) =

Y

p(xV,n ) =

n

Y Y n

l(D|θ) = log p(D|θ) =

p(xV |θ)δ(xV ,xV,n )

X X n

(6)

xV

(δ(xV , xV,n ) ∗ log (p(xV |θ)))

(7)

xV

By rearranging the order of the summation, applying the summation over n and using eqn 1 and 2 and, we get: X X X l(D|θ) = m(xV ) ∗ log(ψC (xC )) − m(xV ) ∗ log(Z) (8) xV

Observe that

P

xV

m(xV ) = N =

P

C∈C

xV

m(xC ). By using this, we get: XX l(D|θ) = m(xC ) ∗ log(ψC (xC )) − N ∗ log(Z) xC

(9)

C∈C xC

2.2

MLE for UGMs

Using the formulation of log likelihood in eqn (9), we use the standard technique of finding the derivative of the log likelihood and setting it to 0 to find the MLE estimates. As we shall see shortly, this doesn’t give us a closed form solution for the MLE parameters as we would have hoped, but rather gives us a condition involving the parameters, that must hold for them to be MLE estimates. We now proceed to obtain derivates with respect to each of our parameters in θ. Remember that θ = {ψC (xC ) ∀ C ∈ C}.

m(xC ) ∂(m(xC ) ∗ log(ψC (xC ))) = ∂ψC (xC ) ψC (xC ) P Q xD )) 1 ∂ ( x˜ D ψD (˜ ∂ log Z = ∗ ∂ψC (xC ) Z ∂ψC (xC )

(10) (using definition of Z)

By applying the differentiation, all terms where x ˜C 6= xC are eliminated. Consequently: ! Y ∂ log Z 1 X ∂ = ∗ δ(˜ xC , xC ) ∗ ψD (˜ xD ) ∂ψC (xC ) Z ∂ψC (xC ) x ˜ D∈C   Y ∂ log Z 1 X = ∗ δ(˜ xC , xC ) ∗  ψD (˜ xD ) ∂ψC (xC ) Z x ˜ D6=C ! X Y ∂ log Z 1 1 = δ(˜ xC , xC ) ∗ ∗ ∗ ψD (˜ xD ) ∂ψC (xC ) ψC (˜ xC ) Z x ˜ D ! X Y ∂ log Z 1 1 = ∗ δ(˜ xC , xC ) ∗ ∗ ψD (˜ xD ) ∂ψC (xC ) ψC (xC ) Z x ˜

Note that the model’s marginal distribution of x ˜ is defined as p(˜ x) =

(11)

(12)

(13)

(14)

(15)

D

1 Z

∗(

Q

D

ψD (˜ xD )). Using this definition:

Learning fully observed graphical models

X ∂ log Z 1 δ(˜ xC , xC ) ∗ p(˜ x) = ∗ ∂ψC (xC ) ψC (xC )

5

(equivalent to marginalizing out all Xi∈C / )

(16)

x ˜

∂ log Z 1 = ∗ p(xC ) ∂ψC (xC ) ψC (xC ) ∂l m(xC ) p(xC ) ⇒ = −N ∗ ∂ψC (xC ) ψC (xC ) ψC (xC )

(17) (18)

Note that without loss of generality, we can assume the potential functions are positive valued, since negative score functions can always be exponentiated to ensure the potential value has strictly non-negative range (0 being an extremal case). Consequently, when the gradient is 0, likelihood is maximized as follows: m(xC ) p(xC ) ∂l = −N ∗ =0 ∂ψC (xC ) ψC (xC ) ψC (xC ) m(xC ) ⇒ pM LE (xC ) = N

(19) (20)

By defining m(x) as the empirical marginal distribution p˜(x), we see that we have obtained a condition N constraining the MLE model marginal distribution to be equal to the empirical distribution. However, as mentioned before, we have not obtained a closed form solution for each of the parameters themselves since each such constraint involves all the parameters in it. This impasse leads use to two approaches to obtaining MLE estimates for the parameters.

2.3

Decomposable Models

As seen in the previous section, the appearance of the partition function means that equating the derivative of the log likelihood to 0 doesn’t give us MLE estimates for the parameters. This is mainly because the log likelihood doesn’t decompose into disparate terms as was the case in Bayes Nets. However, for a special subset of UGMs, the likelihood does indeed factor out conveniently enough to enable us to obtain the MLE estimates by inspection. This special subset of UGMs is the set of decomposable UGMs. They are defined as follows:

2.3.1

Definition of Decomposable models

An undirected graphical model is said to be decomposable if it can be recursively sub-divided into three subsets of nodes A, S and B such that A,S,B are disjoint, S separates nodes in A from nodes in B and S is complete. There are several alternate definitions of decomposable models as well [1]. They can be defined as: 1. Markov random fields whose underlying graph is chordal i.e. all cycles with 4 or more vertices have at least one edge (a chord) that is not a part of the cycle that connects two vertices on the cycle. 2. Bayes nets with no V-structures (common child). 3. Bayes nets with a Markov field perfect map. 4. Graphical models whose underlying (hyper)graph is a junction tree.

6

Learning fully observed graphical models

Figure 2: Decomposable graph with A = {X1 }, B = {X4 } and S = {X2 , X3 } Indeed, decomposable models are the intersection between directed graphical models and undirected graphical models. It is not surprising then that a simple technique exists, that can be used to obtain MLE estimates for parameters of such decomposable UGMs simply by inspection. 2.3.2

MLE for Decomposable models

MLE estimates can be easily obtained by parametrizing decomposable UGMs using potential functions associated with maximal cliques only. Let C be the set of maximal cliques in the UGM. Given this constraint, use the folloqing procedure to obtain MLE estimates for the potential functions: 1. For each clique C ∈ C, set the clique parameter θC (xC ) (= ψC (xC )) to be the empirical clique C) marginal p˜(xC ) = m(x N . 2. For each non-empty intersection between cliques, let the empirical marginal associated with that intersection be ψS (xS ). Divide this potential into the parameter of one of the intersecting cliques involved C) (say θC (xC )) and set the parameter of that clique to the quotient i.e. θC (xC ) = θψCS(x (xS ) . As an example, consider the decomposable graph in figure 3. Its decomposition has been provided in the image caption. Applying the above procedure for MLE via inspection, we ge the following estimates for it: pM LE (x1 , x2 , x3 ) =˜ p(x1 , x2 , x3 ) p˜(x2 , x3 , x4 ) p˜(x2 , x3 ) p˜(x2 , x3 , x4 ) ∗ p˜(x1 , x2 , x3 ) ⇒ pM LE (x1 , x2 , x3 , x4 ) = p˜(x2 , x3 ) pM LE (x2 , x3 , x4 ) =

2.4

(21) (22) (23)

Iterative Proportional Fitting for arbitrary UGMs

The simple procedure described in the previous sections for MLE by inspection works only for decomposable fully observed UGMs. To deal with arbitrary fully observed UGMs, we use eqn (19) along with a technique called fixed point iteration to develop an algorithm called Iterative Proportional Fitting which can be applied to arbitray UGMs to obtain MLE parameters. For decomposable UGMs, IPF converges in a single iteration (through all parameters) and ends up performing the same operations as the MLE by inspection method.

Learning fully observed graphical models

7

The procedure is as follows. We use eqn 19 and the definitions of empirical and model marginals, we get: p˜(xC ) p(xC ) = ψC (xC ) ψC (xC )

(24) (t)

Fixed point iteration suggests that we hold the parameter ψC (xC ) constant on the RHS (say ψC (xC )) and (t+1) solve for the free parameter (say ψC (xC )) on the LHS. Thus we get:

(t+1)

ψC

(t)

(xC ) =ψC (xC ) ∗

p˜(xC ) p(t) (xC )

(25)

(t)

Note that ψC (xC ) appears in p(t) (xC ) internally. IPF performs this operation by iterating through parameters associated with all maximal cliques C ∈ C cyclically. 2.4.1

IPF as coordinate ascent

In general, fixed point iteration is not guaranteed to converge and is not guaranteed to be well behaved (monotonic). However, IPF both converges and is well behaved in the sense that log likelihood is guaranteed not to decrease at any step. This can be justified by showing that IPF is actually a form of coordinate ascent, where the coordinates are the potential functions.This shown by using eqn 13 and plugging it into the derivative of the log likelihood. This gives:

Y ∂l m(xC ) NX = − δ(˜ xC , xC ) ψD (˜ xD ) ∂ψC (xC ) ψC (xC ) Z x ˜

(26)

D6=C

This can be viewed as a maximization of the parameter ψC (xC ) while holding the rest of the parameters ψD6=C (xD ) constant. We annotate these constant parameters with a timestamp (t). This gives us: Y (t) ∂l m(xC ) NX = − δ(˜ xC , xC ) ψD (˜ xD ) ∂ψC (xC ) ψC (xC ) Z x ˜

(27)

D6=C

Now we make use of an insight provided in [2], were it is shown that by updating the free parameter ψC (xC ) as per the IPF update equation (eqn 25), the value of the partition function doesn’t change. Thus, (t) (t+1) (t) p(x ˜ C) . We make use of this property, update ψC (xC ) and Z (t+1) = Z (t) when ψC (xC ) = ψC (xC ) ∗ p(t) (xC ) (t)

(t+1)

both multiply and divide eqn (27) by ψC (xC ). We evaluate the new derivative at this new value ψC ∂l (t+1) ∂ψC (xC )

∂l (t+1) ∂ψC (xC )

= =

m(xC ) (t+1) ψC (xC )



m(xC ) (t+1) ψC (xC )



(t+1)

N X (t) ψC

N (t) ψC

δ(˜ xC , xC ) ∗

x ˜



p˜(x)C p(t) (xC )

Y (t) 1 ∗ ψD (˜ xD ) (t) Z D

(xC ):

(28) (29)

By substituting the actual value of ψC as per the IPF update rule in eqn (25), we see that the updated value forces the new gradient to be 0. In this sense, IPF is a coordinate ascent algorithm where the coordinates are the parameters associated with the maximal cliques in the graph.

8

Learning fully observed graphical models

Figure 3: Visualization of IPF. Each step of IPF is a projection onto a manifold such that one of the cliques has the correct marginal. The point of convergence is where all such manifolds intersect.

3

Generalized Iterative Scaling (GIS)

GIS is one of the ways to estimate parameter of an Undirected Graphical Model (UGM) and is particularly useful for non decomposable models. GIS like IPF is an iterative model, but it can be broadly applied to exponential family potentials. As we saw in the previous section, IPF maximizes log likelihood by maximizing clique potential function by differentiation. Instead of optimizing the log liklihood directly, GIS uses the lower bound of log-likelihood to find the optima. In a general case in which the clique potentials are parameterized by arbitrary collection of features, we could use a general exponential family model.

p(x|θ) =

X 1 exp θi fi (x) Z(θ) i

(30)

The scaled likelihood function could be written as: X ˜l(θ|D) = p˜(x)logp(x), where ˜l(θ|D) = l(θ|D)/N and p˜(x)is the empirical distribution.

(31)

x

˜l(θ|D) =

X x

p˜(x)log

X

θi fi (x) − logZ(θ) =

i

X i

θi

X

p˜(x)fi (x) − logZ(θ)

(32)

x

If the following two constraints are satisfied, GIS could be used to find the maximum likelihood parameter estimate. X fi (x) ≥ 0 and fi (x) = 1 (33) i

We use the convexity property to design another function which is a lower bound to log likelihood. Then we increase the lower bound to increase the log likelihood.

Learning fully observed graphical models

9

Using convexity property: logz(θ) ≤ µZ(θ) − log(µ) − 1, where µ = Z −1 (θ(t) ) X X z(θ) => ˜l(θ|D) ≥ p˜(x) θi fi (x) − − logZ(θ(t) ) + 1 (t) ) Z(θ x i (t)

(t)

Lets define: ∆i θi = θi − θi

˜l(θ|D) ≥

X x

p˜(x)

X

θi fi (x) −

i

(35)

X X X  X 1 (t) (t) exp θ f (x) exp ∆θ f (x) − logZ(θ(t) ) + 1 (36) i i i i Z(θ(t) ) x x i i

Using Convexity and Jensen’s inequality, we get:  X  X X πi xi ≤ πi exp(xi ) for πi = 1 exp i

i

x

(37)

i

In the above equation, f ’s are positive and sum to one, so it can play the role of πi , that gives   X X X X (t) ˜l(θ|D) ≥ θi p˜(x)fi (x) − p(x|θ(t) fi (x)exp ∆θi − logZ(θ(t) ) + 1 = Λ(θ) i

(34)

x

(38)

i

Note the above equation is a lower bound and parameters are decoupled. Taking the derivative of the above equation, wrt theta i and setting it to zero, gives:

  P p˜(x)f (x) i (t) exp∆θi = P x (t) (Z(θ(t) ) x p (x)fi (x)

(39)

We have a relationship between the update function of theta and total distribution:

(t+1)

(t)

θi = θi + ∆θi (t) Y p(t+1) (x) = p(t) exp(∆θi (t)fi (x))

(40)

i

Using the above equations, the final update equation could be written as: P fi (x) ˜(x)fi (x) p(t) x Y  xp P p(t+1) (x) = (Z(θ(t) ) (t) (t) Z(θ x p (x)fi (x) i P  Y p˜(x)fi (x) fi (x) P x (t = p(t) x x p )(x)fi (x) i (t+1) θi

=

(t) θi

 P p˜(x)f (x)  i + log P x (t) x p (x)fi (x)

(41) (42)

(43)

10

Learning fully observed graphical models

As seen in the derivation above, the key idea is to first define a function that lower bonds the log-liklihood. Since the bound is tight means we can increase lower-bound by fixed-point iteration in order to increase log-likelihood. Please check [3] chapter 20 for details. Comparison Between GIS and IPF: GIS is a fully parallel algorithm, where as IPF is parallel at the level of a single clique. GIS is an iterative algorithm (like IPF), but more broadly applies to exponential family potentials GIS have been largely surpassed by the gradient based methods which is discussed in the next section.

4

Gradient Based Methods (GBMs)

If potential function could be described as: ψc (xc ) = θcT fc (xc ) Its log likelihood function could be written as: XX l(θ) = θk fk (xn ) − N logZ(θ) n

The derivative is:

(44)

(45)

k

X ∂l ∂ = logZ fi (xn ) − N ∂θj ∂θj n

(46)

X ∂l = fi (xn ) − N E[fj (X)] ∂θj n

(47)

Which could further be resolved as:

Any gradient-based optimization algorithm could be used for finding the global MLE by passing the above derivative. Steps involved in GBMs are: 1. Design the objective function 2. Compute partial derivatives of the objective function 3. Feed the objective function and derivatives to an optimization algorithm. A number of optimization algorithms like Newton’s Method, Quasi-Newton’s methods or Stochastic gradient methods could be used. 4. Get back the optimized parameters from the optimization algorithm.

5

Summary • Maximum Likelihood Parameter estimation in completely observed Bayesian Networks is easy thanks to decomposability.

Learning fully observed graphical models

11

• MLE estimation for fully observed UGMs is easy for decomposable UGMs and can be achieved by inspection. • MLE estimation for arbitrary fully observed UGMs is possible using the IPF algorithm, which is a form of coordinate ascent that is guaranteed to converge and to be well behaved. • GIS uses fixed point iteration over the derivative of a lower-bound of the likelihood objective to estimate maximum likelihood parameter. • Gradient Based Methods uses simple algorithms like SGD, have usually a faster convergence than GIS and applies to arbitrary potentials. Note: A lot of the materials in these scribe notes have been adapted from the citations below. More in-depth reading of these materials is highly recommended.

References [1] Decomposable graphical models, triangulation and the Junction Tree, Marina Meila (Available at: http: //www.stat.washington.edu/courses/stat535/fall11/Handouts/l5-decomposable.pdf) [2] Chapter 9, Probabilistic Graphical Models, Michael I. Jordan, pg 17 [3] Chapter 19,20, Probabilistic Graphical Models, Michael I. Jordan

10-708: Probabilistic Graphical Models 10-708, Spring 2014

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference Lecturer: Matthew Gormley

1

Scribes: Keith Maki, Anbang Hu, Jining Qin

Learning Fully Observed UGMs (Cont’d)

1.1

Fixed point iteration for optimization

Fixed point iteration is a general tool for solving systems of equations. It is also applicable to optimization. In order to maximize a given objective function J(θ), we can do the following: 1. Compute derivative derivative.

∂J(θ) ∂θi

and set it to zero:

∂J(θ) ∂θi

= 0 = f (θ), where we introduce f to denote the

2. Rearrange the equation such that one of the parameters appear on the LHS: θi = g(θ). (t+1)

3. Rewrite the equation with timestamp: θi

= g(θ (t) )

4. Initialize the parameters θ. 5. Update parameters and increment t. 6. Repeat step 5 until convergence.

1.2

Iterative proportional fitting (IPF)

In iterative proportional fitting (IPF), our goal is to maximize log-likelihood `(θ; D) =

N X

log p(x(n) |θ)

(1)

n=1

Q 1 where p(x|θ) = Z(θ) C ψC (xC ). Following the suit of fixed point iteration algorithm, after letting the derivative of Eq.1 equal zero, we get m(xC ) p(xC ) ∂`(θ; D) = −N =0 ∂ψC (xC ) ψC (xC ) ψC (xC )

(2)

After rearranging one of the parameters to the LHS in Eq.2 and define p˜(xC ) , m(xC )/N , we obtain ψC (xC ) = ψC (xC ) 1

p˜(xC ) p(xC )

(3)

2

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

So we can set up the update rule: (t+1)

ψC

(t)

(xC ) = ψC (xC )

p˜(xC ) p(t) (xC )

(4)

Loop through each clique C in C and update each potential function while incrementing t until the potential functions converge to the maximum likelihood estimates. IPF requires the potentials to be fully parameterized: ψC (xC ) = θC,xC . IPF iterates a set of fixed-point equations Eq.4. IPF is also a coordinate ascent algorithm (coordinates are parameters of clique potentials). IPF guarantees that the log-likelihood Eq.1 increases at each step and converges to a global maximum.

1.3

Generalized iterative scaling (GIS)

General potentials for large cliques are exponentially costly for inference and have exponential numbers of parameters that we must learn from limited data. One solution is to change the graphical model to make cliques smaller. But this changes the dependencies, and may force us to make more independence assumptions than we would like. Another solution is to keep the same graphical model, but use a less general parameterization of the clique potentials. This motivates the feature based potential ( ) X 1 exp θi fi (x) p(x|θ) = Z(θ) i

(5)

which is exponential family potential. For MRF with feature based potential, we apply generalized iterative scaling (GIS), because it allows more general parameterization and can be seen as IPF extended to other settings. In generalized iterative scaling (GIS), instead of optimizing the log-likelihood function directly, we repeatedly increase a tight lower bound. Given the average log-likelihood function: N X ˜ D) = 1 `(θ; log p(x(n) |θ) N n=1

Denoting p˜(x) =

m(x) N

(6)

as before, and substituting Eq.5 into Eq.6 we have

X ˜ D) = 1 `(θ; m(x) log p(x(n) |θ) N x X m(x) = log p(x(n) |θ) N x X = p˜(x) log p(x(n) |θ) x

=

X x

p˜(x)

X

θi fi (x) − log Z(θ)

i

We would like to attack the lower bound of the average log-likelihood. Because log Z(θ) ≤ µZ(θ) − log µ − 1, ∀µ. We choose µ = Z −1 (θ (t) ), then

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

˜ D) ≥ `(θ;

X

p˜(x)

x

(t)

X

θi fi (x) −

i

3

Z(θ) − log Z(θ (t) ) + 1 Z(θ (t) )

(t)

Let ∆θi = θi − θi , we can get

( ) X X 1 p˜(x) θi fi (x) − exp θi fi (x) − log Z(θ (t) ) + 1 (t) ) Z(θ x x i i ( ) ( ) X X X X X (t) 1 (t) exp = p˜(x) θi fi (x) − θi fi (x) exp ∆θi fi (x) − log Z(θ (t) ) + 1 (t) ) Z(θ x x i i i ( ) X X X X (t) = p˜(x) θi fi (x) − p(x|θ (t) ) exp ∆θi fi (x) − log Z(θ (t) ) + 1

˜ D) ≥ `(θ;

X

X

x

i

x

i

Due to the convexity of exponential function, we can apply Jensen’s inequality (exp n o P (t) ) to get i fi (x) exp ∆θi ˜ D) ≥ `(θ;

X

p˜(x)

X

x

θi fi (x) −

X

p(x|θ (t) )

x

i

X

nP

(t)

i fi (x)∆θi

n o (t) fi (x) exp ∆θi − log Z(θ (t) ) + 1

o



(7)

i

We denote the RHS of Eq.7 by Λ(θ). Λ(θ) is the lower bound of which we should take derivative: n oX ∂Λ(θ) X (t) = p˜(x)fi (x) − exp ∆θi p(x|θ (t) )fi (x) ∂θi x x

(8)

Setting Eq.8 to zero, we obtain the fixed point iteration update rule:

(t) ∆θi

 = log

P   P  p˜(x)fi (x) p˜(x)fi (x) (t) x x P = log P (t) Z(θ ) (t) x p(x|θ )fi (x) x p (x)fi (x)

(9)

 P  p˜(x)fi (x) (t) x + log P (t) Z(θ ) x p (x)fi (x)

(10)

or

(t+1) θi

=

(t) θi

where p(t) (x) is the unnormalized version of p(x|θ (t) ). This update rule is applied for each dimension of θ as t is incremented until convergence to the maximum likelihood estimates for MRF.

4

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

Let’s take another look at Eq.6: N

X ˜ D) = 1 `(θ; log p(x(n) |θ) N n=1 X = p˜(x) log p(x(n) |θ) x

=

XX C

p˜(xC ) log ψC (xC ) − log Z(θ)

xC

Differentiating it, we get

∂`(θ, D) = ∂θk =

! XX C

E

where f·,k = distribution.

P

C



p˜(xC )fC,k (xC )

xC

C

[f·,k (x)] −

x∼p(·|D) ˜

! XX

E

p(xC )fC,k (xC )

xC

[f·,k (x)]

x∼p(·|D)

fC,k (xC ). Here p˜(·|D) is the empirical distribution of data points x and p(·|D) is the model

Note that neither IPF nor GIS works well in practical settings. The reason they are used in learning of conditional random field is mostly historical. People used these models for log-linear models and exponential family models and they naturally tried to adapt them for conditional random field. Later people realized we have a lot of optimization methods available. Using the log-likelihood function as objective, together with its gradient and sometimes Hessian, we can get the maximum likelihood estimate for the model by optimizing the log-likelihood directly. And that corresponds to the gradient-based learning methods.

1.4

Gradient-based learning method

In a gradient-based learning algorithm. We follow the steps below: 1. Write down the objective function. 2. Compute the partial derivatives of the objective (i.e. gradient, and sometimes also Hessian). 3. Feed objective function and derivatives into black box optimization algorithms. 4. Retrieve optimal parameters from black box. The black box optimization algorithms include: 1. Newton’s method. It requires the gradient and Hessian of the objective function. 2. Hessian-free/Quasi-Newton methods. When the gradient or Hessian is hard to obtain either symbolically or numerically, there are ways to get around it by calculating an approximate partial derivative. They include conjugate gradient method and L-BFGS method, etc. 3. Stochastic gradient methods. They work better when the objective function is expressed as the sum of a lot of differentiable functions. They include stochastic gradient descent, stochastic meta-descent, AdaGrad, etc.

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

1.5

5

Regularization

When we are learning a model using training data set, we are really trying to find a model that maximizes the probability density of the training data set given a set of parameters. Sometimes that leads to fitting the noise but not signal present in the data set. This phenomenon is called overfitting. Overfitting can greatly undermine the generalizability of a model. Overfitting can be avoided by reasonable regularization. In order to avoid overfitting in learning with gradient-based methods, we add a penalty term (a.k.a regularization term) to the objective log-likelihood function: J(θ) = `(θ) + r(θ)

(11)

For L2 regularization, K

r(θ) =

λX 2 λ (kθk2 )2 = θk 2 2

(12)

k=1

It can be shown L2 regularization is equivalent to MAP estimation with prior θ ∼ N (0, λ1 I). For L1 regularization, r(θ) = λ kθk1 = λ

K X

|θk |

(13)

k=1

It can be shown L1 regularization is equivalent to MAP estimation with prior θ ∼ Laplace(λ). The L1 regularization is more likely to produce sparse parameter estimates so it is often used for model selection.

2 2.1

Exact Inference Factor graphs

Although the techniques discussed so far have focused on either directed or undirected graphical models, these two representations may be unified by a third representation known as a factor graph. Factor graphs unify the representation of Markov Random Fields and Bayes Nets by explicitly encoding relations between the random variables in a GM with arbitrary functions called factors. This representation naturally supports inference techniques which apply to both directed and undirected graphical models, such as the class of belief propagation algorithms introduced in section 2.4 and 2.5 below. Visually, factor graphs are represented using a bipartite graph, with two types of vertices, representing the random variables and the factors, respectively. Random variables which share a factor are connected to that factor by undirected edges. In a factor graph, the factors serve as potential tables, assigning weights locally to each possible configuration of the connected variables. Indeed, for undirected graphical models, the factors have a natural one-to-one mapping onto the clique potentials, as shown in Figure 1. For directed graphical models, factors represent the marginal and conditional distributions in the factored joint distribution as shown in Figure 2. It is important to note also that in addition to unifying the representation between directed and undirected graphical models, the structure of the underlying factor graph may be simplified from that of the original model. In particular, we emphasize the fact that a non-tree graphical model may have a factor graph which is a tree, referred to as a factor tree. For example, notice that maximal cliques in the undirected case (e.g. Figure 1) and polytrees in the directed case (e.g. Figure 3) correspond to factor trees.

6

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

X1

fb

X2

X1

X2

fa

X3

X4

X3

(a) P (x) ∝ ψ134 (x1 , x3 , x4 )ψ12 (x1 , x2 )

X4

(b) P (x) ∝ fa (x1 , x3 , x4 )fb (x1 , x2 )

Figure 1: Markov Random Field with equivalent factor graph representation

X1

fa

X5

X1

X3

fc

X2

fb

X4

X2

(a) P (x) = P (x1 )P (x2 )P (x3 |x1 , x2 ) P (x5 |x1 , x3 )P (x4 |x2 , x3 )

X5

fd

X3

fe

X4

(b) P (x) = fa (x1 )fb (x2 )fc (x3 , x1 , x2 ) fd (x5 , x1 , x3 )fe (x4 , x2 , x3 )

Figure 2: Bayes net with equivalent factor graph representation

X1

fa

X2

X3

X4

X5 (a) P (x) = P (x1 )P (x3 |x1 , x2 ) P (x5 |x3 , x4 )P (x2 )P (x4 )

fd

X1

fb

X2

X3

fc

fe

X4

X5 (b) P (x) = fa (x1 )fb (x3 , x1 , x2 ) fc (x5 , x3 , x4 )fd (x2 )fe (x4 )

Figure 3: Bayes net polytree with corresponding factor tree representation

However, note that for an MRF, there may be more than one factor graph which represent the same graphical model. This is a result of the ambiguity inherent in the clique structure of undirected graphical models. The structure of the factor graph implicitly identifies the set of cliques used to factor the joint distribution of the MRF. As shown in Figures 1 and 4, this choice of factorization may determine whether the factor graph is a factor tree, which can have implications for inference.

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

X1

7

fd

X2

X1 fa

X3

X2 fc

fb

X4

X3

(a) P (x) ∝ ψ13 (x1 , x3 )ψ34 (x3 , x4 ) ψ14 (x1 , x4 )ψ12 (x1 , x2 )

X4

(b) P (x) ∝ fa (x1 , x3 )fb (x3 , x4 ) fc (x1 , x4 )fd (x1 , x2 )

Figure 4: Markov Random Field with equivalent factor graph representation

2.2

Inference problems

In inference, there are three task, all of which are NP-hard in general case. The three tasks are listed below: 1. Marginal Inference. Compute marginals of variables and cliques: X p(xi ) = p(x0 |θ) x0 :x0i =xi

p(xC ) =

X

p(x0 |θ)

x0 :x0C =xC

2. Partition Function. Compute the normalization constant: XY Z(θ) = ψC (xC ) x C∈C

3. MAP Inference. Compute the variable assignment with highest probability: ˆ = arg max p(x|θ) x x

Computing marginals by sampling on factor graph is NP-hard in general. In practice, we use MCMC to draw an approximate sample fast. In addition, sampling finds the high-probability value xi efficiently, but it takes too many samples to see a low-probability ones.

2.3

Variable elimination

Variable elimination is a simple and general exact inference algorithm for graphical models. During the following discussion, we omit the parameters θ for simplicity. Let X = {X1 , . . . , Xn } denote the set of variable nodes, each of which can range from 1 to k, F = {ψα1 , . . . , ψαs } denote the set of factor nodes, where α1 , . . . , αs ⊆ {1, . . . , n}. To compute marginal p(x1 ), naively, we do the following: p(x1 ) =

1 Z

s X Y x2 ,...,xn i=1

ψαi (xαi )

8

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

This involves O(k n−1 ) additions. Instead, we can choose an ordering I (in which x1 appears at the end) and capitalize on the factorization of p(x). But first, we need to know several concepts. A potential can be a factor ψαi , or an “intermediate” tablemi (xSi ), where mi (xSi ) results from elimination of variable xi . An active list stores all these potentials. The variable elimination algorithm runs as follows: 1. Initialize active list with all potentials in the factor graph and choose an ordering I in which x1 appears last. 2. For each i ∈ I, do the following • find all potentials from the active list that reference xi and remove them from the active list • let φi (xTi ) denote the product of these potentials P • let mi (xSi ) = xi φi (xTi ) • place mi (xSi ) on the active list 3. eventually, we get p(x1 ) =

1 mj (x1 ) Z

(14)

where xj is the second last in I. The above algorithm produces one dimensional table that summarize the k different values of p(x1 ). This process easily generalize to computation of other marginals. Note that for directed graphs, Z = 1. For undirected graphs, if we compute each (unnormalized) value on the LHS of Eq.14, we can sum them to get the value of partition function Z. The complexity of variable elimination in the above case of computing marginals is O((n − 1)k r ), where n is the number of variables, k is the maximum value a variable can take and r is the number of variables participating in largest “intermediate” table. In other words, the overall complexity of variable elimination is determined by the number of the largest elimination clique. However, this number is relevant to elimination ordering I. Tree-width t is one less than the smallest achievable value of the cardinality of the largest elimination clique, ranging over all possible elimination orderings. “good” elimination orderings lead to small cliques and hence reduce the complexity. That finding the best elimination ordering of a graph is NPhard, which implies that inference is also NP-hard. But there often exist “obvious” optimal or near-optimal elimination ordering. Before closing this section, it is reasonable to illustrate variable elimination algorithm with a small example and see how it relates to node elimination in the graph: For all i, suppose that the range of Xi is {0, 1, 2}. The factor graph is shown in Fig.5 ψ12

ψ24

X1

X2

X4

ψ234 ψ13

ψ45 ψ5

X3

X5 Figure 5: Example factor graph

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

9

Let’s compute p(x1 ) using variable elimination (corresponding node elimination is illustrated in following figures).

p(x1 ) = = = = = = = =

1 Z 1 Z 1 Z

X

ψ12 (x1 , x2 )ψ13 (x1 , x3 )ψ24 (x2 , x4 )ψ234 (x2 , x3 , x4 )ψ45 (x4 , x5 )ψ5 (x5 )

x2 ,x3 ,x4 ,x5

X

X ψ12 (x1 , x2 )ψ13 (x1 , x3 )ψ24 (x2 , x4 )ψ234 (x2 , x3 , x4 ) ψ45 (x4 , x5 )ψ5 (x5 )

x2 ,x3 ,x4

X

Fig.6a

x5

ψ12 (x1 , x2 )ψ13 (x1 , x3 )ψ24 (x2 , x4 )ψ234 (x2 , x3 , x4 )m5 (x4 )

Fig.6b

x2 ,x3 ,x4

X 1 X ψ12 (x1 , x2 )ψ13 (x1 , x3 ) ψ24 (x2 , x4 )ψ234 (x2 , x3 , x4 )m5 (x4 ) Z x ,x x4 2 3 1 X ψ12 (x1 , x2 )ψ13 (x1 , x3 )m4 (x2 , x3 ) Z x ,x 2 3 X 1 X ψ12 (x1 , x2 ) ψ13 (x1 , x3 )m4 (x2 , x3 ) Z x x3 2 1 X ψ12 (x1 , x2 )m3 (x1 , x2 ) Z x 2 1X ψ12 (x1 , x2 )m3 (x1 , x2 ) Z x

Fig.7a Fig.7b Fig.8a Fig.8b Fig.9a

2

1 = m2 (x1 ) Z

Fig.9b

ψ12

X1

ψ24

X2

ψ12

X4

ψ13

X1

ψ45

ψ24

X2

X4

ψ13

ψ234

ψ234 ψ5

X3

X5 (a) Node elimination step 1.1

X3

m5

(b) Node elimination step 1.2

Figure 6: Elimination of node X5

10

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

ψ12

ψ24

X1

ψ12

X4

X2

X1

ψ13

m4

X2

ψ13 ψ234

X3

X3

m5

(a) Node elimination step 2.1

(b) Node elimination step 2.2

Figure 7: Elimination of node X4 ψ12

X1

m4

ψ12

X2

X1

X2

ψ13

X3

m3

(b) Node elimination step 3.2

(a) Node elimination step 3.1

Figure 8: Elimination of node X3

2.4

Sum-product belief propagation

Message passing is a great idea in machine learning. For a given node in an acyclic graph, based on counting information passed from all directions, the node is able to compute the global count. Message passing is fundamental in belief propagation. In a factor graph, neighbors of any variable nodes are factor nodes and neighbors of any factor nodes are variable nodes. If a variable node Xi wants to send a message µi→α to one of its neighbors ψα , it has to wait until it receives messages µα0 →i , ∀α0 ∈ N (i)\α. Then it computes and sends the message: Y

µi→α (xi ) =

µα0 →i (xi )

(15)

α0 ∈N (i)\α

Similarly, if a factor node ψα wants to send a message µα→i to one of its neighbors Xi , it has to wait until it receives messages µj→α , ∀j ∈ N (α)\i. Then it computes and sends the message: µα→i (xi ) =

X

ψα (xα )

xα :xα [i]=xi

Y

µj→α (xα [i])

(16)

j∈N (α)\i

where xα [i] is the component corresponding to variable node Xi . Belief at variable node Xi is bi (xi ) =

Y α∈N (i)

µα→i (xi )

(17)

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

11

ψ12

X1

X2 m2

X1 (b) Node elimination step 4.2 m3

(a) Node elimination step 4.1

Figure 9: Elimination of node X2

Belief at factor node ψα is bα (xα ) = ψα (xα )

Y

µi→α (xα [i])

(18)

i∈N (α)

Fig. 10 illustrates beliefs of and messages sent by variable nodes and factor nodes. With the above information, we can introduce the sum-product belief propagation algorithm as follows: Input: An acyclic factor graph Output: Exact marginals for each variable and factor Algorithm: 1. Initialize the messages to the uniform distribution µi→α (xi ) = 1,

µα→i (xi ) = 1

2. Choose a root node 3. Send messages from the leaves to root using Eq.15,16 Send messages from the root to leaves using Eq.15,16 4. Compute the beliefs (unnormalized marginals) using Eq.17,18 5. Normalize beliefs and return the exact marginals: pi (xi ) ∝ bi (xi ), pα (xα ) ∝ bα (xα ) Note that a node computes an outgoing message along an edge only after it has received incoming messages along all its other edges. (Acyclic) Belief propagation can be viewed as dynamic programming in the following way: If you want the marginal pi (xi ) where Xi is a variable node of degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to larger. So you can compute all the marginals.

2.5

Max-product belief propagation

Sum-product belief propagation can be used to compute the marginals: pi (xi ). Max-product belief propagation is a variant of Sum-product belief propagation that can be used to compute the most likely assignment: ˆ = arg maxx p(x). If we replace the summation in Eq.15 with max function and hold everything else unx changed in sum-product belief propagation, we can get max-product belief propagation with a new message

12

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

Figure 10: Beliefs and messages of variables and factors.

passed from factor nodes to variable nodes: µα→i (xi ) =

max xα :xα [i]=xi

ψα (xα )

Y

µj→α (xα [i])

(19)

j∈N (α)\i

The max-marginal bi (xi ) from max-product belief propagation is the (unnormalized) probability of the MAP assignment under the constraint Xi = xi . Assuming no ties, the MAP assignment for an acyclic graph is given by x ˆi = arg max bi (xi ) (20) xi

In order for smooth transition from sum-product to max-product, we introduce deterministic annealing: 1. Incorporate inverse temperature parameter into each factor, from which we get the annealed joint distribution: 1 1 Y p(x) = ψα (xα ) T (21) Z α 2. Send messages as usual for sum-product belief propagation 3. Anneal T from 1 to 0: T =1 T →0

Sum-product Max-product

Lecture 7: Learning Fully Observed UGMs (Cont’d) & Exact Inference

13

4. Take resulting beliefs to power T The reason that the algorithm works when “sum” is replaced by “max” is that both the “sum-product” pair and the “max-product” pair are examples of an algebraic structure known as a commutative semiring. A commutative semiring is a set endowed with two operations - generically referred to as “addition” and “multiplication” - that obey certain laws. In particular, addition and multiplication are both required to be associative and commutative. Moreover, multiplication is distributive over addition: a · b + a · c = a · (b + c). This distributive law plays a key role in sum-product belief propagation, in which “sum” operator repeatedly migrates across the “product” operator. The ability to group and reorder intermediate factors is also required. It can be shown that associativity, commutativity and distributivity are all that are needed for sum-product algorithm. We can also check that “product” distributes over “max”: max(a · b, a · c) = a · max(b, c). A practical problem is that multiplying many small numbers together can underflow. Thus, instead of using sum-product, we use log-add-sum; and instead of using max-product, we use max-sum.

10-708: Probabilistic Graphical Models 10-708, Spring 2015

8 : Learning Partially Observed GM: EM Algorithm Lecturer: Eric P. Xing

1

Scribes: Ankit Laddha, Anirudh Vemula, Cuong Nguyen

Inference as Subroutine for Learning

Many a times inference is used as subroutine for learning or parameter estimation in graphical models. In the following we describe two such scenarios.

1.1

Partially Observed Directed GM

In case of partially observed directed graphical models the log-likelihood l(θ; D) involves a marginalization over the unobserved variables. We calculate l(θ; D) as: X l(θ; D) = log p(x, z|θ) (1) z

= log

X

p(z|θz )p(x|z, θx )

(2)

z

Therefore, to calculate l we need to do inference to find the conditional probability of x given z.

1.2

Fully Observed Undirected GM

In fully observed undirected graphical models the log-likelihood (l)is calculates as XX l= m(xc ) log ψc (xc ) − N log Z x

(3)

xc

Any method for parameter estimation requires the gradient of l which in tern requires the the gradient of the term log Z, which is calculated as ∂ log Z p(xc ) = ∂ψc (xc ) ψc (xc )

(4)

Therefore, to calculate p(xc ) we need to find the p(xc ) for every clique in the model.

2 2.1

Examples: Partially Observed Models Speech Recognition

As shown in the Figure 1, we can model the speech recognition problem as a HMM where the observed variables Xi are the sounds and the unobserved variables Yi are the phonetics or words spoken. 1

2

8 : Learning Partially Observed GM: EM Algorithm

Figure 1: A Graphical Model for Speech Recognition

2.2

Biological Evolution

We could also model the biological evolution as a directed GM as shown in Figure 2. In this, the leaf nodes representing the various organisms are observed. The hidden nodes represent a common ancestor from which the organisms derive a particular trait.

Figure 2: A Graphical Model for Biological Evolution

3 3.1

Unobserved Variables Why we need them?

Latent variables are extensively used in graphical models as they provide a simplified and abstract view of the data generation process. They can be used to model real-world objects or phenomena which are difficult/impossible to measure or can be only measured with noise (e.g. through faulty sensors) In case of data where we have clusters (or some kind of grouping), discrete latent variables can be used to model the membership of the data. Continuous latent variables are used in dimensionality reduction

8 : Learning Partially Observed GM: EM Algorithm

3

techniques like factor analysis.

3.2

Why is learning hard?

As we have seen before, the log-likelihood for fully-observable directed models decomposes ”neatly” into a sum of local terms. In case of undirected models, we observe the same property of log-likelihood in the case of tree-like models and decomposable models. lc (θ; D) = log p(x, z|θ) = log p(z|θz ) + log p(x|z, θx )

(5)

But in the presence of latent variables, the situation becomes ”tricky”. When some variables are not observed the likelihood is not a joint probability but a marginal probability obtained by summing out all the latent variables. This summation leads to all the parameters getting coupled and the log-likelihood doesn’t decompose like before. lc (θ; D) = log

X

p(x, z|θ) = log

X

z

p(z|θz )p(x|z, θx )

(6)

z

This coupling of parameters makes the learning task harder in the presence of latent variables. The usual gradient-based approaches to get maximum-likelihood estimates cannot be used efficiently in this case and we must resort to EM-like approaches.

4 4.1

Mixture Models GMMs

Mixture models arise naturally from data clustering task, where data may have multi-modal density distribution. A mixture model comprises of a number of components which are uni-modal density functions. In such a setup, each uni-modal density function corresponds to a sub-population of the data. Gaussian mixture model (GMM) is one of the most mature mixture model. A GMM defines the overall probability density of data as a weighted sum of Gaussian distributions. Figure 3 illustrates the idea of using a GMM for data clustering. More formally, we can define a GMM of k Gaussian components as follows: p(x|µ, Σ) =

X

πk N (x|µk , Σk )

(7)

k

Where πk are the model weights to specify mixture proportion, each N (x|µk , Σk ) is a Gaussian component. Another way to think of GMM is to have a latent discrete variable Z to indicate which Gaussian component is being selected: p(zn ) = multi(zn : π) =

Y

k

(πk )zn

(8)

k

Given Z, X is a conditional Gaussian variable with specific uni-modal parameters (mean and variance): p(xn |znk = 1, µ, Σ) =

1 1/2 (2π)m/2 |Σk |

1 exp(− (xn − µk )T Σ− k 1(xn − µk )) 2

(9)

4

8 : Learning Partially Observed GM: EM Algorithm

Figure 3: GMM for clustering

The likelihood of a single sample xn is computed as the product of the probability distribution of Z and the conditional distribution of X given Z: p(xn |µ, Σ) =

X

p(z k = 1|π)p(x|zk = 1, µ, Σ)

(10)

XY k k ((πk )zn N (xn : µk , Σk )zn )

(11)

k

=

zn

=

k

X

πk N (x|µk , Σk )

(12)

k

In a scenario where Z is revealed, i.e. for completely observed data, the MLE solution can be found analytically. The data log-likelihood is given by: l(θ, D) = log

Y

p(zn , xn ) = log

n

Y

p(zn |π)p(xn |zn , µ, σ)

(13)

n

=

X

log

n

=

zk

πkn +

X

Y

log

n

k

XX n

Y

znk log πk −

(14)

k

XX n

k

k

N (xn ; µk , σ)zn

k

znk

1 (xn − µk )2 + C 2σ 2

(15)

The MLE solution for πk , µk , and σk is given by: πk,M LE = arg max l(θ, D)

(16)

µk,M LE = arg max l(θ, D)

(17)

σk,M LE = arg max l(θ, D)

(18)

π

µ

σ

The closed-form solution requires the true value of zn , for example: P µk,M LE =

k n zn xn znk

(19)

8 : Learning Partially Observed GM: EM Algorithm

4.2

5

EM algorithm for GMMs

In cases where Z is unobserved, notice that the expected complete log-likelihood comprises the expected value of zn : X X hlc (θ; x, z)i = hlog p(zn |π)ip(z|x) + hlog p(xn |zn , µ, Σ)ip(z|x) (20) n

n

XX 1 XX k = hznk i log πk − hzn i((xn − µk )T Σ−1 k (xn − µk ) + log |Σk | + C) 2 n n k

(21)

k

Therefore, if hznk i can be estimated, the expected log-likelihood can be computed. This task is completed in the E-step of EM algorithm. The overall strategy of EM algorithm is to run E-steps followed by M-steps in an iterative manner to maximize the expected complete log-likelihood hlc (θ)i. In the E-step (expectation step), the expected value of the sufficient statistics of the latent variable is computed. Essentially, we are doing an inference for the sufficient statistics. In this case, the sufficient statistics is Z: (t)

π N (xn |µ(t) , Σ(t) ) τnk(t) = hznk iq(t) = p(znk = 1|xn , µ(t) , Σ(t) ) = P k (t) (t) (t) i πi N (xn |µ , Σ )

(22)

In the M-step (maximization step), we maximize the expected log-likelihood by computing the MLE for the parameters πk , µk , Σk , using the plug-in value of hznk i obtained from E-step: πk∗ = arg maxhlc (θ)i

(23)

µ∗k = arg maxhlc (θ)i

(24)

Σ∗k = arg maxhlc (θ)i

(25)

π

µ

Σ

4.3

Compare K-means algorithm and EM

Recall that K-means is a clustering algorithm which iteratively runs 2 steps: 1. Assignment step: assign each data sample to its nearest cluster centroid. 2. Update step: recompute the new centroid for each new cluster. Figure 4 shows a case where K-means converges after 8 iterations, and we are able to identify 2 clusters. In K-means, the assignment step is doing hard assignment, in which a data sample either belong to a cluster (with probability equal to 1) or not (0 probability): −1(t)

zn(t) = arg max(xn − µk )T Σk

(xn − µk )

(26)

In EM, the E-step does soft assignment: for each data sample xn , it computes the probabilities of xn belonging to each of the k clusters, as seen in equation (25). In K-means, the update step recomputes the mean as the weighted sum of the data, where all weights are either 1 or 0: P (t) (t+1) n δ(zn , k)xn µk = P (27) (t) n δ(zn , k) In EM, the M-step also recomputes the mean, but with soft weights: P k(t) (t+1) n τn xn µk = P k(t) n τn Therefore, K-means can be considered as a hard-assignment version of EM.

(28)

6

8 : Learning Partially Observed GM: EM Algorithm

Figure 4: K-means algorithm for clustering

5 5.1

EM Complete and Incomplete Log-likelihood

Lets denote X as observable variable(s) and Z as hidden variable(s). If we could observe Z, then the complete log-likelihood is defined as: lc (θ; x, z) = log p(x, z|θ)

(29)

Usually optimizing lc () given both x and z is straightforward because it decomposes into a sum of local factors. The parameters for each factor can be estimated separately. However, given that Z is not observed, lc () is a random quantity which cannot be maximized directly. With unobserved Z, our objective becomes the log of marginal probability: X X lc (θ; x) = log p(x, z|θ) = log p(z|θz )p(x|z, θx ) z

(30)

z

This is called incomplete log-likelihood. Now, the objective won’t decouple which makes the parameter estimation problem very hard.

5.2

Expected Complete Log-Likelihood

To make the parameter estimation tractable in presence of unobserved variables we define a surrogate function called expected complete log-likelihood. We will also show that it is a lower bound on the incomplete loglikelihood and thus we hope that maximizing this yield a maximizer for the likelihood. For any distribution q(z|x, θz ), we define the expected complete log-likelihood hlc (θ; x, z)i as: X hlc (θ; x, z)i = q(z|x, θz )p(x, z|θx ) z

(31)

8 : Learning Partially Observed GM: EM Algorithm

7

It is a deterministic function of θ because we are taking an expectation over the unobserved random z. It is also linear in lc () which implies that it inherits its factorizability. We could also show that it is a lower bound on the original incomplete log-likelihood using the following arguments: l(θ; x) = log p(x|θ) X = log p(x, z|θ)

(32) (33)

z

= log

X

q(z|x)

z

p(x, z|θ) q(z|x)

(34) (35)

Note that log is a concave function. Thus, by using Jensen’s inequality we get

l(θ; x) ≥

X

q(z|x) log

z

=

X

p(x, z|θ) q(z|x)

q(z|x) log p(x, z|θ) + Hq

(36) (37)

z

5.3

= hlc (θ; x, z)i + Hq

(38)

≥ hlc (θ; x, z)i

(39)

EM as Coordinate-Ascent on Free Energy

For a fixed data x, we could define a function called the free energy as: F (q, θ) =

X

q(z|x) log

z

p(x, z|θ) ≤ l(θ; x) q(z|x)

(40)

Now, the EM algorithm can be seen as coordinate ascent on F , where we alternatively minimize q and θ.

5.4

E-step q t+1 = argmax F (q, θt ) q

(41)

The solution to the E-step is the posterior distribution over the latent variable given the data and the parameters. q t+1 = p(z|x, θt )

(42)

We could prove this easily by substituting this into F (q, θ) and showing that it attains the bound l(θ; x) ≥

8

8 : Learning Partially Observed GM: EM Algorithm

F (q, θ). X

F (p(z|x, θt ), θt ) =

p(z|x, θt ) log

z

X

=

p(x, z|θt ) p(z|x, θt )

p(z|x) log p(x, |θt )

(43) (44)

z

= log p(x, |θt )

(45)

t

= l(θ ; x)

(46)

Before looking at M-step lets define the form of p(x, z|θ). Without loss of generality we can assume that p(x, z|θ) is a generalized family distribution: ! X 1 h(x, z) exp θi fi (x, z) (47) p(x, z|θ) = Z(θ) i Now we can write the hlc (θ; x, z)i under q t+1 as X hlc (θt ; x, z)iqt+1 = q(z|x, θt ) log p(x, z|θt ) − A(θ)

(48)

z

=

X

θit hfi (x, z)iq(z|x,θt ) − A(θ)

(49)

i

Under the special case of that P (x|z) are GLIMs, then fi (x, z) = η T (z)ξi (x)

(50)

Therefore, hlc (θt ; x, z)iqt+1 =

X

θit hηiT iq(z|x,θt ) ξi (x) − A(θ)

(51)

i

5.5

M-step θt+1 = argmax F (q t+1 , θ) θ

(52)

Note that F (q, θ) breaks into two terms F (q, θ) =

X

q(z|x) log

z

=

X

p(x, z|θ) q(z|x)

q(z|x) log p(x, z|θ) −

z

(53) X

q(z|x) log q(z|x)

(54)

z

= hlc (θ; x, z)iq − Hq

(55)

The first term is the expected complete log likelihood and the second term is entropy which does not depend on θ. Thus in M-step we only need to consider the first term. So, X θt+1 = argmax q(z|x) log p(x, z|θ) (56) θ

z

When q is optimal, this is the same as MLE of the fully observed p(x, z|θ), but instead of the sufficient statistics for z, their expectations with respect to p(z|x, θ) are used.

8 : Learning Partially Observed GM: EM Algorithm

9

Figure 5: A hidden markov model

6 6.1

More examples HMMs

We can look at learning Hidden-Markov models in an EM-like framework. Learning in HMMs can be done in both ways: Supervised and unsupervised. In the supervised learning setting, we are given annotated data with the correct labels for the sequences. In such a case, it is a fully-observed model and we can directly apply MLE techniques to learn the parameters of the model. Contrastingly, in the unsupervised learning setting we have unannotated data. We can only observe some of the variables and the remaining variables are unobserved (latent variables). In such a case, it is a partially-observed model and requires us to use a EM-like approach to learn the parameters. 6.1.1

Baum-Welch algorithm

Baum-Welch algorithm is an EM-framework approach to learning the parameters of a HMM. The complete log-likelihood of a HMM can be written as: lc (θ; x, Y ) = log p(x, Y ) = log

T T Y Y Y (p(Yn,1 ) p(Yn,t |Yn,t−1 ) p(xn,t |Yn,t )) n

t=2

(57)

t=1

Similarly, the expected complete log-likelihood can be written as: hlc (θ; x, Y )i =

X

i (hYn,1 ip(Yn,1 |xn ) log πi ) +

n

+

n

T XX n

T XX

j i (hYn,t−1 Yn,t ip(Yn,t−1 ,Yn,t |xn ) log ai,j )

(58)

t=2

i (xkn,t hYn,t ip(Yn,t |xn ) log bi,k )

(59)

t=1

where A = ai,j is the transition matrix which gives p(Yt = j|Yt−1 = i) and B = bi,k is the emission matrix which gives p(xt = k|Yt = i). The E-step of the EM algorithm is then given by: i i i γn,t = hYn,t i = p(Yn,t = 1|xn ) i,j ξn,t

=

j i hYn,t−1 Yn,t i

=

i p(Yn,t−1

=

j 1, Yn,t

= 1|xn )

(60) (61)

10

8 : Learning Partially Observed GM: EM Algorithm

The M-step is given by: i γn,1 N i,j n t=2 ξn,t L aM PT −1 i i,j = sumn t=1 γn,t P PT i k n t=1 γn,t xn,t L bM = P PT −1 i i,k t=1 γn,t n

P

πiM L

= P PT

n

(62) (63)

(64)

In the unsupervised setting, Baum-welch algorithm proceeds as follows 1. Start with an initial best guess of parameters θ for the model 2. Estimate ai,j and bi,k from the training data ai,j =

X

j i hYn,t−1 Yn,t i

(65)

n,t

bi,k =

X

i hYn,t ixkn,t

(66)

n,t

3. Update θ according to estimated ai,j and bi,k . This is the plain old MLE estimation problem, which can be done by any previously explored technique. 4. Repeat steps 2 and 3 until convergence It can be proven that we get a more likely set of parameters θ at the end of each iteration compared to the previous one.

6.2

EM for BNs

For general bayesian networks, the EM algorithm can be given as: 1. For each node i, reset the expected sufficient statistics ESSi = 0 2. For each data sample n, do inference with Xn,H and for each node i, update the expected sufficient statistics correspondingly ESSi + = hSSi (Xn,i , Xn,πi )ip(Xn,H ,Xn,−H ) 3. For each node i, obtain the MLE parameters θi = M LE(ESSi ) 4. Go to step 1 until convergence

6.3

Conditional mixture model

To model p(y|x), we can use a set of different experts each responsible for different regions of the input space. We can use a latent variable Z to choose the expert by a softmax gating function

p(z k = 1|x) = sof tmax(ξ T x)

(67)

8 : Learning Partially Observed GM: EM Algorithm

11

Figure 6: Conditional Mixture Model The model for a conditional mixture model looks like: X P (y|x) = p(z k = 1|x, ξ)p(y|z k = 1, x, θi , σ)

(68)

k

The loss function in this case has the following form X X hlc (θ; x, y, z)i = hlog p(zn |xn , ξ)ip(z|x,y) + hlog p(yn |xn , zn , θ, σ)ip(z|x,y) n

(69)

n

The E-step in the EM algorithm is p(z k = 1|xn )pk (yn |xn , θk , σk2 ) τnk(t) = p(z k = 1|xn , yn , θ) = P n k 2 j p(zn = 1|xn )pj (yn |xn , θj , σj )

(70)

The M-step in the EM algorithm uses the standard normal equation for linear regression θ = (X T X)−1 X T Y but with the data reweighted by τ or using the weighted IRLS algorithm to update ξk , θk , σk based on data k(t) points (xn , yn ) with weights τn .

6.4

Other variants of EM

There are several variants of the EM algorithm, few of which are: • Sparse EM: This algorithm does not recompute the posterior probability on each data point under all data models, because it is almost zero (hence, the sparsity). Instead, it keeps an ”active” list which it updates every once in a while • Generalized (incomplete) EM: In some cases, it is intractable to get the maximum likelihood estimates of the parameters in the M-step, even with complete data. This algorithm still makes progress by doing an M-step that improves the likelihood a bit in a way similar to the IRLS step in the conditional mixture model parameter estimation

12

7

8 : Learning Partially Observed GM: EM Algorithm

Summary

In summary, EM is a family of algorithms that help in maximizing the likelihood of latent variable models. It computes the MLE of the parameters in two steps: 1. Estimating ”unobserved” or ”missing” data from observed data and current parameters. Can also be seen as filling-in the values of the latent variables based on the current best guess 2. Using this ”complete” data to get the current MLE parameters. Can also be seen as updating the parameters based on the guesses (or filling-in) that we did in the previous step Essentially, EM maximizes likelihood by optimizing a lower bound on the log-likelihood in the M-step and closing the gap between the bound and the log-likelihood in the E-step. EM is currently the most popular method for parameter estimation in partially-observed models. It does not require any learning-rate parameter (unlike many gradient based approaches) and is very fast for lowdimensions. It also ensures convergence as the likelihood increases after each iteration. The disadvantages of EM are mostly that it can lead to a local optima instead of a global optima. EM can also be slower than conjugate gradient especially near convergence. We can also observe that it needs an expensive inference step (in the E-step) that might be intractable in some situations.

10-708: Probabilistic Graphical Models 10-708, Spring 2014

Discrete sequential models and CRFs Lecturer: Eric P. Xing

1

Scribes: Pankesh Bamotra, Xuanchong Li

Case Study: Supervised Part-of-Speech Tagging

The supervised part-of-speech tagging is a supervised task that tags each part of speech sequence with predefined labels, such as noun, verb, and so on. As shown in Figure 1, given the data D = {x(n) , y (n) }, x represent the speech word and y represents the tag.

Figure 1: Supervised Part-of-Speech Tagging

This problem can be approached with many different methods. Here we discuss three of them: Markov random field, Bayes network, and conditional random field. • Markov Random Field (Figure 2): it models the joint distribution over the tags Yi and words Xi . The individual factors are not probabilities. Thus a normalization Z is needed. • Bayes Network (Figure 3): it also models the joint distribution over the tags Yi and words Xi . Note that here the individual factors are probabilities. So Z = 1. • Conditional Random Field (Figure 4): it models conditional distribution over tags Yi given words Xi . The factors and Z are specific to sentence X. 1

2

Discrete sequential models and CRFs

Figure 2: Supervised Part-of-Speech Tagging with Markov Random Field

Figure 3: Supervised Part-of-Speech Tagging with Bayes Network

Figure 4: Supervised Part-of-Speech Tagging with Conditional Random Field

Discrete sequential models and CRFs

2

3

Review of Inference Algorithm

The forward-backward algorithm (marginal inference) and viterbi algorithm (MAP inference) are two major inference algorithms we have seen so far. It turns out they are all belief propagation algorithms.

Figure 5: Belief Propagation in Forward-backward Algorithm For example, in the forward-backward algorithm, as shown in Figure 5, α is the belief from forward pass, the β is the belief from the backward pass, and the ψ is the belief from the xi . Then the belief of of Y2 = n is the product of the belief from the three directions: α(n)β(n)ψ(n).

3

3.1

Hidden Markov Model (HMM) and Maximal Entropy Markov Model (MEMM) HMM

Figure 6: Hidden Markov Model

4

Discrete sequential models and CRFs

HMM (Figure 6) is a graphical model with latent variable Y and observed variable X. It is a simple model for sequential data such as language, speech, and so force. But, there are two issues with HMM. • Locality of feature: HMM models only capture dependencies between each state and its corresponding observation. But in real world problem like NLP, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc. • Mismatch of objective function: HMM learns a joint distribution of states and observations P (Y, X), but in a prediction task, we need the conditional probability P (Y |X)

3.2

MEMM

Figure 7: Maximal Entropy Markov Model MEMM (Figure 7) is for solving the problems of HMM. It gives full observation of sequence to every state, which make the model more experssive than HMM. It is also a discriminative model, since it completely ignores modeling P (X) and the learning objective function consistent with predictive function: P (Y |X). But MEMM has the label bias problem, which means a preference for states with lower number of transitions over others. To avoid this problem, one solution is not normalizing probabilities locally. This improvement gives us the conditional random field (CRF).

4

Comparison between Markov Random Field and Conditional Random Field

4.1

Data

• CRF: D = {X (n) , Y (n) }N n=1 • MRF: D = {Y (n) }N n=1

4.2

Model

• CRF: P (Y |X, θ) = • MRF: P (Y |θ) =

4.3

1 Z(X,θ)

1 Z(θ)

Q

Q

c∈C

c∈C

ψc (Yc , X), where ψc (Yc , X) = exp(θfc (Yc , X))

ψc (Yc ), where ψc (Yc ) = exp(θfc (Yc ))

Log likelihood

• CRF: l(θ, D) =

1 N

PN

log(y n |x(n) , θ)

• MRF: l(θ, D) =

1 N

PN

log(y n |θ)

n=1 n=1

Discrete sequential models and CRFs

4.4

Derivatives (n)

=

1 N

PN

P

c

fc,k (yc , x(n) ) −

∂l = • MRF: ∂θ k

1 N

PN

P

c

fc,k (yc , x(n) ) −

• CRF:

5

5

∂l ∂θk

n=1

n=1

(n)

(n)

(n)

1 N

PN

P P

yc

p(yc |x(n) )fc,k (yc , x(n) )

1 N

PN

P P

yc

p(yc )fc,k (yc )

n=1

n=1

c

c

(n)

(n)

Generative Vs. Discriminative Models - Recap

In simple terms, the difference between generative and discriminative models is the generative models are based on joint distribution p(y, x), while discriminative models are based on the conditional distribution p(y|x). Typical example of generative-discriminative pair [1] is the Naive Bayes classifier and Logistic regression. The principle advantage of using discriminative models is that they can incorporate rich features which can have long range dependencies. For example, in POS tagging we can incorporate features like capitalization of the word, syntactic properties of the neighbour words, and others like location of the word in the sentence. Having such features in generative models generally leads to poor performance of the models.

Figure 8: Relationship between some generative-discriminative pairs

6

CRF formulation

Conditional random fields are formulated as below: P (y1:n |x1:n ) =

n n Y Y 1 1 φ(yi , yi−1 , x1:n ) = exp(wT f (yi , yi−1 , x1:n )) Z(x1:n ) i=1 Z(x1:n , w) i=1

An important difference to note here is that CRF formulation looks more or less like the MEMM formulation. However, here the partition function is global and lies outside of the exponential product term.

7

Properties of CRFs • CRFs are partially directed models.

6

Discrete sequential models and CRFs

• CRFs are discriminative models like MEMMs. • CRF formulation has global normalizer Z that helps in overcoming the label bias problem of the MEMMs. • Being a discriminative model, CRFs can model rich set of features over the entire observation sequence.

8

Linear Chain CRFs

Figure 9: A Linear chain CRF In a linear chain conditional random field, we model the potentials between adjacent nodes as a Markov Random Field conditioned on input x. Thus, we can formulate linear chain CRF as: P (y|x) =

n X X X 1 exp( ( λk fk (yi , yi−1 , x) + µl gl (yi , x))) Z(x, λ, µ) i=1 k

where Z(x, λ, µ) =

X y

exp(

l

n X

X

i=1

k

(

λk fk (yi , yi−1 , x) +

X

µl gl (yi , x)))

l

fk and gl in the above equations are referred to as feature functions that can encode rich set of features over the entire observation sequence.

9 9.1

CRFs Vs. MRFs Model CRF :

P (y|x, θ) =

M RF :

9.2

Y 1 exp(θ, f (yc , x)) Z(x, θ) c

P (y|x, θ) =

1 Y exp(θ, f (yc )) Z(θ) c

Average Likelihood CRF :

M RF :

N X ˜l(θ; D) = 1 log p(y(n) |x(n) , θ) N n=1 N X ˜l(θ; D) = 1 log p(y(n) |θ) N n=1

Discrete sequential models and CRFs

9.3

7

Derivatives CRF :

N N d˜l(θ; D) 1 XX 1 XXX (n) = fc,k (yc(n) , x(n) ) − p(yc |x(n) c c )fc,k (yc , xc ) dθk N n=1 c N n=1 c y c

M RF :

N N 1 XX 1 XXX d˜l(θ; D) = fc,k (yc(n) ) − p(yc )fc,k (yc ) dθk N n=1 c N n=1 c y c

10

Parameter estimation

As we saw in the previous sections CRFs have parameters λk and µk which we estimate from the training N data D = (x(n) , y (n )i=1 having empirical distribution p˜(x, y). We can use iterative scaling algorithms and gradient descent to maximizes the log-likelihood function.

11

Performance

Figure 10: CRFs vs. other sequential models for POS tagging on Penn treebank

12

Minimum Bayes Risk Decoding

Decoding in CRFs refers to choosing a particular y from p(y|x) such that some form of loss function is minimized. A minimum Bayes risk decoder returns an assignment that minimizes expected loss under model distribution. This can be represented as: X hθ (x) = argminy pθ (y|x)L(ˆ y , y) y

Here L(ˆ y , y) represents a loss function like 0-1 loss or Hamming loss.

13

Applications - Computer Vision

A few applications of CRFs in computer vision are: • Image segmentation

8

Discrete sequential models and CRFs

• Handwriting recognition • Pose estimation • Object recognition

13.1

Image segmentation

Image segmentation can be modelled by conditional random fields. We exploit the fact that foreground and background portions of an image have pixel-wise and local characteristics that can be incorporated as CRF parameters. Thus, image segmentation can be formulated as:   X XX Y ∗ = argmaxy∈(0,1)n  Vi (yi , X) + Vi,j (yi , yj ) i∈S

i∈S j∈Ni

Here, Y refers to image label as foreground or background, Xs are data features, S are pixels, and Ni refers to the neighbours of pixel i.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

10 : Gaussian graphical models and Ising models: modeling networks Lecturer: Eric P. Xing

1 1.1

Scribes: Xiongtao Ruan, Kirthevasan Kandasamy

Introduction Networks in real world

This lecture introduced methods for modeling networks. Gaussian graphical models and Ising models were introduced in the lecture. These methods are popular in learning the structure of networks. In real word, networks are important and interested to researchers. Networks come from lots of areas, such as the Jesus network which represents relationships of characters in Jeses, social networks that represent assiciation among users; Internet networks that represents the connection between different nodes; gene regulatory networks and so on. Network can also evolves through time. Figure 1 shows a network of genes in the development of an embryo. In different time, the gene regulatory networks are different in corresponding to the activation and regulation of different developmental genes.

Figure 1: Evolving of Gene regulatory networks in embryo development.

In sum, in real world there are lots of networks and some may evolves through time. It is necessary to develop some statistical methods to study them. 1

2

10 : Gaussian graphical models and Ising models: modeling networks

1.2

Two optimal approaches for structure learning

An important question in modeling networks is to decide the structures of networks. The general idea is to sort to ’optimal’ approaches, that is, utilize algorithms that guarantee to return a structure that maximizes the objectives (usually likelihood). For structure learning, two approaches that is suitable for different kind of networks have been learned: • The Chow-Liu algorithm: this algorithm is appropriate for tree-structured graph; • Pairwise Markov random fields: it is used for undirected graphs

2

Pairwise Markov Random Fields

The basic idea for pairwise Markov random fields is to model the edges as parameters, where if there is connection between two nodes, then there is non-zero parameters for the pair. In other words, we use a matrix of parameters to encode the graph structure. As is shown in Figure 2, θij represents the parameter for edge (i, j). The structure of the network can be inferred through the distribution below the network.

Figure 2: Relationship network and the method for structure learning

The states of nodes can be either discrete, which is called Ising /Potts model, or continuous, which is called Gaussian graphical model, or even heterogeneous.

2.1

Ising Models

Assuming an exponential family distribution, the joint probability of a pairwise MRF, parametrised by Θ is given by,   X X > p(x|Θ) = exp  θii xi + θij xi xj − A(Θ) . i∈V

(i,j)∈E

10 : Gaussian graphical models and Ising models: modeling networks

3

If the variables are binary, it is straightforward to argue that the conditional probability of xk given x−k is given by a logistic regression model. Concretely, pΘ (xk |x−k ) = logistic (2xk hΘ−k , x−k i) . This suggests that we could use an L1 regularised logistic regression approach to learn the neighbors of xk . The approach also extends to vector valued nodes provided we use an appropriate group lasso penalty on elements of Θ that correspond to one variable.

2.2

Gaussian Graphical Model

As is mentioned above, gaussian graphical models (GGMs) are continuous form of pairwise MRFs. The basic assumption for GGMs is that the variables in the network follows multivariate Gaussian distribution. The distribution for GGMs is p(x|µ, Σ) =

1 1 exp{− (x − µ)T Σ−1 (x − µ)} 2 (2π)n/2 |Σ|1/2

where µ is the mean, Σ is the covariance matrix, n is the dimension of data (number of variables). Here we can show how to convert a GGM to the form of pairwise MRF. WLOG, let µ = 0, precision matrix Q = Σ−1 , then the above distribution can be written as: p(x1 , ..., xn |µ = 0, Q) =

X |Q|1/2 1X exp{− qii (xi )2 − qij xi xj } n/2 2 i (2π) i 0} There are three assumptions for LASSO: • Dependency Condition: Relevant Covariates are not overly dependent • Incoherence Condition: Large number of irrelevant covariates cannot be too correlated with relevant covariates • Strong concentration bounds: Sample quantities converge to expected values quickly If the above assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant. In papers Meinshausen and Buhlmann q 2006, and Wainwright 2009, the conclusion above were ˆ → S(β ∗ ) proved. The mathematical form is: if λs > C log p , then with high probability, S(β) S

10 : Gaussian graphical models and Ising models: modeling networks

5

Neighborhood selection with LASSO is intuitively simple and has theoretical guarantee. However, it it is only suitable for iid-sampled networks. For non-iid sampled networks or time evolving networks, other kinds of algorithms are needed.

3 3.1

Gaussian Graphical models Variables Dependencies in GGMs

The conditional dependencies in GGMs are basis for the algorithms for network structure inference. Here we will show the joint Gaussian distribution and related conditional distributions. If x1 and x2 follows p(

       x1 x µ Σ |µ, Σ) = N ( 1 | 1 , 11 x2 x2 µ2 Σ21

 Σ12 ) Σ22

Obviously the marginal distributions for x1 and x2 are also Gaussian distributions, and the expressions are as follows respectively.

m p(x1 ) = N (x1 |mm 1 , V1 )

m p(x2 ) = N (x2 |mm 2 , V2 )

mm 1 = µ1

mm 2 = µ2

V1m = Σ11

V2m = Σ22

The conditional distribution of x1 given x2 or x2 given x1 are also Gaussian distributions. The expressions of the distributions are as follows:

p(x1 |x2 ) = N (x1 |m1|2 , V1|2 ) m1|2 = V1|2 = 3.1.1

µ1 + Σ12 Σ−1 22 (x2 − Σ11 − Σ12 Σ−1 22 Σ21

p(x2 |x1 ) = N (x2 |m2|1 , V2|1 ) m2|1 = µ2 + Σ21 Σ−1 11 (x1 − µ1 )

µ2 )

V2|1 = Σ22 − Σ21 Σ−1 11 Σ12

The matrix inverse lemma

Here we present the trick for matrix inverse, which is useful for the prove of conditional Gaussian distribution. For a block-partitioned matrix:   E F M= G H First, we diagonalize M:  I 0

−F H −1 I

 E G

F H



I −H −1 G

   0 E − F H −1 G 0 = I 0 H

To simplify the expression, a value called Schur complement is defined as: M/H = E − F H −1 G Then, we inverse the matrix, using this formula: XY Z = W



Y −1 = ZW −1 X

6

10 : Gaussian graphical models and Ising models: modeling networks

M −1 = =

 E G 

F H

−1

I −H −1 G

0 I

 (M/H)−1 0

(M/H)−1 −1 −H G(M/H)−1

 =

=

0 H −1

 I 0

−F H −1 I



−(M/H)−1 F H −1 −1 H + H −1 G(M/H)−1 F H −1

 −1 E + E −1 F (M/E)−1 GE −1 −(M/E)−1 GE −1

−E −1 F (M/E)−1 (M/E)−1





By above derivation, we have the matrix inverse lemma: (E − F H −1 G)−1 = E −1 + E −1 F (H − GE −1 F )−1 GE −1 3.1.2

Conversion between covariance and the precision matrices

For covariance matrix, we do the partition for the first variable with all other variables like the following expression.  σ Σ = 11 ~σ1

 ~σ1T = Q−1 Σ−1

So by matrix inverse procedure above, we can get the partitioned precision matrix as:     q11 −q11~σ1T Σ−1−1 q11 ~q1T Q= = ~q1 Q−1 −q11 Σ−1−1~σ1 Σ−1−1 (I + q11~σ1~σ1T Σ−1−1 ) where q11 = σ11−1 . 3.1.3

Single-node Conditional

Based on above section, we can write the conditional distribution of a single node i given the rest of node as: −1 p(Xi |X−i ) = N (µi + ΣXi X−i Σ−1 X−i X−i (X−i − µX−i ), ΣXi Xi − ΣXi X−i ΣX−i X−i ΣX−i Xi ) Without loss of generality, let µ = 0, we have: −1 p(Xi |X−i ) = N (ΣXi X−i Σ−1 X−i X−i X−i , ΣXi Xi − ΣXi X−i ΣX−i X−i ΣX−i Xi )

= N (~σi Σ−1 −i X−i , qi|−i ) = N(

~qiT X−i , qi|−i ) −qii

From this equation, for each node we can write the following conditional auto-regression function: Xi =

~qiT X−i + , −qii

 ∼ N (0, qi|−i )

10 : Gaussian graphical models and Ising models: modeling networks

7

This could be interpreted as a node can be expressed as the linear combination of all other nodes with a Gaussian noise. And we can estimate the neighborhood for each node based on the auto-regression coefficient. The neighborhood of node i is defined as: Si = {j :, j 6= i, θij 6= 0} Si defines the Markov blanket of node i, we have p(Xi |Xs ) = p(Xi |X−i )

3.2

Some Recent Trends in Gaussian Graphical Models

One of the most classical methods for estimating structure in a GGM is due to Arther P. Dempster who sequentially prunes the smallest elements in a precision matrix. Drton & Perlman 2007 refines this approach by providing improved statistical tests for pruning. However, this approach has serious limitations in practice, particularly when the covariance matrix is not invertible. Recently, there has been a flurry of activity in L1 regularised methods for structure learning in GGMs. Meinshausen & B¨ uhlmann 2006 estimates the neighborhood of a variable in a Gaussian model via Lasso regression. Friedman et al. 2008 adopts a maximum likelihood approach subject to an L1 penalty on the coefficients of the precision matrix. Banerjee et al. 2008 uses a block sub-gradient algorithm for finding the precision matrix. Below, we review them in more detail. 3.2.1

Graphical Lasso

Pn Let the sample covariance constructed using the data be S = n1 i=1 (Xi − (X))(Xi − (X))> . Here X = P n 1 i=1 Xi is the sample mean. Then, the Gaussian log likelihood of the inverse covariance Q can be shown to n be equal to log det Q−tr(SQ). In the Graphical Lasso, we maximise this likelihood subject to an element-wise L1 norm penalty on Q. Precisely, we solve ˆ = arg min log det Q − tr(SQ) − ρkQk1,1 . Q Q

ˆ The estimated neighborhood is then the non-zero elements of Q. 3.2.2

Coupled Lasso

Here, we focus on just one row and column at each step.  Q Q= > `

First we write  ` . λ

At each step we solve for `. The difference with the MB algorithm is essentially that the resulting Lasso problems are coupled since we solve them iteratively. This coupling is essential for stability under noise.

4

Time Varying Networks

Consider the time varying network below. At each time step we have a process which is characterised by a network whose structure changes with time.

8

10 : Gaussian graphical models and Ising models: modeling networks

Our goal is to infer the structure at each time step using just one datum (or a limited amount of data) for each time step. However, the change in structure from one time step to another is assumed to be smooth so we wish to use data from other time steps too when estimating structure at one step. We review a series of methods in time varying networks.

4.1

KELLER

KELLER, for Kernel Weighted L1 -regularised Logistic Regression, solves the following optimisation problem over θit ∈ Rp−1 to estimate the neighborhood of i at time step t, θˆit = arg min `w (θit ) + λkθit k1 . t θi

PT s s Here `w (·) is the time weighted conditional log likelihood, `w (θ) = s=1 w(xt , xs ) log p(xi |x−i ; θ). The conditional likelihood is given by a logistic regression model. The weighting w is chosen so that neighboring time steps to t have higher weight than time steps far away from t. For instance, if each time step corresponds to an actual time instant t0 , the authors recommend using the following weighting scheme. Kh (t0 − s0 ) wt (s) = PT T . 0 0 t=1 KhT (t − s ) Here, t0 , s0 are the times corresponding to the time steps t, s and Kh is a smoothing kernel with bandwidth h. The dependence of h on T is made explicit since we might want to adjust the bandwidth depending on how regularly we observe the data. The authors also show that under certain regularity conditions on the problem, we also have a consistent estimator for the structure at time t. Precisely,      CnhT 0 t b , P G(λ, hT , t) 6= G ∈ O exp − 3 + C log p sT where C, C 0 are constants. The right hand side goes to zero as we have data from more time steps, i.e. as T increases.

4.2

TESLA

TESLA, or Temporally Smoothed L1 -regularised Logistic Regression, solves the following optimisation problem for estimating time-varying networks. It optimises over the parameters of one node over all times steps jointly. Precisely, it solves the following optimisation problem. θˆi1 , . . . , θˆiT = argmin

T X

θˆi1 ,...,θˆiT t=1

`avg (θit ) + λ1

T X i=1

kθit k1 + λ2

T X i=1

kθit − θit−1 kqq .

10 : Gaussian graphical models and Ising models: modeling networks

9

PN t

PT log p(xtd,i |xtd,−i , θit ) is the conditional log likelihood. The i=1 kθit k1 penalty term PT encourages sparse solutions while i=1 kθit − θit−1 kqq encourages smoothness across different time steps. It is easy to see that the above problem can be cast as a constrained convex optimisation problem. However, the authors also recommend the following alternative estimation scheme. (i) First estimate the block partition on which the coefficient functions are constant via the following,

Here, `avg (θit ) =

1 Nt

d=1

min β

T X

(Yi − Xi β(ti ))2 + 2λ2

i=1

p X

kβk kT V .

k=1

(ii) Then, estimate the coefficient functions on each block of the partition, X minp (Yi − Xi γ)2 + 2λ1 kγk1 . γ∈R

i∈nbd(j)

The advantage of the two step procedure is that choosing the parameters λ1 , λ2 is now easier and the optimisation problem is also faster. The authors show that structure estimation in TESLA using the alternative procedure is consistent under certain regularity conditions. In contrast to KELLER, it does not need any smoothness assumptions and can accommodate abrupt changes.

4.3

Other Graph estimation Scenarios

Recently, there has been a lot of interest in structure estimation in more complex scenarios. Some examples include time varying Bayesian networks (Song et al. 2009), estimation with missing data (Kolar & Xing 2012)and multi-attribute data (Kolar et al. 2013)

10-708: Probabilistic Graphical Models 10-708, Spring 2016

11 : Factor Analysis and State Space Models Lecturer: Eric P. Xing

1

Scribes: Rahul Nallamothu, Syed Zahir Bokhari, Yu Zhang

Background Review

In this section, we will review some of mathematical concepts that will be used later in the material.

1.1

Multivariate Gaussian

The pdf of joint Gaussian distribution of x1 , x2 can be written in block form as    ! x1 µ p | =N x2 Σ

     x1 µ Σ11 ; 1 , x2 µ2 Σ21

Σ12 Σ22

!

The joint probability can also be written as: p(x1 , x2 ) =

  Ω11 T T exp [(x − µ ) , (x − µ ) ] 1 1 2 2 Ω21 (2π)N/2 |E|1/2 1

  Ω12 [(x1 − µ1 ), (x2 − µ2 )]T Ω22

where Ω = Σ−1 =

 Ω11 Ω21

Ω12 Ω22



−1 −1 T −1 T Ω11 = Σ−1 Σ12 Σ−1 11 + Σ11 Σ12 (Σ22 − Σ12 Σ11 Σ12 ) 11 −1 T −1 T −1 Ω22 = Σ−1 Σ12 Σ−1 22 + Σ22 Σ12 (Σ11 − Σ12 Σ22 Σ12 ) 22 −1 T −1 Ω12 = ΩT21 = −Σ−1 11 Σ12 (Σ22 − Σ12 Σ11 Σ12 )

We can show that p(x1 , x2 ) can also be written as: p(x1 , x2 ) = N (x1 ; µ1 , Σ11 )N (x2 ; m2|1 , V2|1 ) = N (x2 ; µ2 , Σ22 )N (x1 ; m1|2 , V1|2 )

1

2

11 : Factor Analysis and State Space Models

where, m1|2 = µ1 + Σ12 Σ−1 22 (x2 − µ2 ), V1|2 = Σ11 − Σ12 Σ−1 22 Σ21 m2|1 = µ2 + ΣT12 Σ−1 11 (x1 − µ1 ), V2|1 = Σ22 − ΣT12 Σ−1 11 Σ12

Now, using the above form we can write the marginal and conditional probabilities as the following: M p(x1 ) = N (x1 |mM 1 , v1 )

mM 1 = µ1 v1M = Σ11 M p(x2 ) = N (x2 |mM 2 , v2 )

mM 2 = µ2 v2M = Σ22 p(x1 |x2 ) = N (x1 |m1|2 , V1|2 ) m1|2 = µ1 + Σ12 Σ−1 22 (x2 − µ2 ) V1|2 = Σ11 − Σ12 Σ−1 22 Σ21

1.2

Matrix Inversion

It is also useful to remember the inversion of matrices written in block form. Consider such a matrix M to be:   E F M= G H Then we can write the inverse of Matrix M as

 −1 E + E −1 F (M/E)−1 GE −1 M= −(M/E)−1 GE −1

−E −1 F (M/E)−1 (M/E)−1



The following matrix inversion lemma will also be used further in this material:

(E − F H −1 G)−1 = E −1 + E −1 F (H − GE −1 F )−1 GE −1

1.3

Matrix Algebra

In this section we will look at some formulae involving, traces, determinants and derivatives

11 : Factor Analysis and State Space Models

3

tr[A] =

X

aii

i

The cyclical property of trace: tr[ABC] = tr[CAB] = tr[BCA] Derivatives involving trace: ∂tr[BA] = BT ∂A ∂tr[xxT A] ∂tr[xT Ax] = = xxT ∂A ∂A

Derivatives of determinants: ∂log|A| = A−1 ∂A

2

Factor Analysis

Factor analysis is a latent variable model where the latent variable is a continuous random vector. So, the model essentially is X → Y where X is continuous, hidden and Y is continuous, observed. Geometrically, it can be interpreted as sampling X from a Gaussian in low-dimensional subspace and then generating Y by sampling a normal distribution conditioned on X. The following figure from slides illustrates that.

Figure 1: Illustration of Factor Analysis Model The problem of estimating X based on Y is a dimensionality reduction problem.

4

2.1

11 : Factor Analysis and State Space Models

Inference

X is a p-dimensional variable, Y is a q-dimensional variable where p < q and we begin with X and Y |X.

X ∼ N (0, I) Y |X ∼ N (µ + ΛX, Ψ) Since, the distributions of X and Y |X are both gaussian, the marginals, conditional and joint probabilities are all gaussian. So, these distributions are characterized by their mean and variance. To calculate the marginals:

Y = µ + ΛX + W E[Y ] = E[µ + ΛX + W ] = E[µ] + ΛE[X] + E[W ] =µ+0+0 =µ V ar[Y ] = E[(Y − µ)(Y − µ)T ] = E[(µ + ΛX + W − µ)(µ + ΛX + W − µ)T ] = E[(ΛX + W )(ΛX + W )T ] = E[ΛXX T ΛT + W W T ] = ΛE[XX T ]ΛT + E[W W T ] = ΛΛT + Ψ

To write the joint distribution we also need the covariance of X and Y. Cov[X, Y ] = E[(X − 0)(Y − µ)T ] = E[(X)(µ + ΛX + W − µ)T ] = E[XX T ΛT + XW T ] = ΛT Simillarly, Cov[Y,X] = Λ. Therfore the joint probability can be written as:  ! X p =N Y

     X 0 I ; , Y µ Λ

ΛT ΛΛT + Ψ

!

Applying Gaussian Conditioning formulae shown in section 1, we get the following result for the posterior of latent variable X, given Y.

p(X|Y ) = N (X|m1|2 , V1|2 )

11 : Factor Analysis and State Space Models

5

where, m1|2 = µ1 + Σ12 Σ−1 22 (Y − µ2 ) = ΛT (ΛΛT + Ψ)−1 (Y − µ) V1|2 = Σ11 − Σ12 Σ−1 22 Σ21 = I − ΛT (ΛΛT + Ψ)−1 Λ

Inverting (ΛΛT + Ψ) involves inverting a |y| × |y| matrix, So we use matrix lemma to replace it by (I + ΛT Ψ−1 Λ)−1 . So, the final equations for m1|2 and V1|2 are: V1|2 = (I + ΛT Ψ−1 Λ)−1 m1|2 = V1|2 ΛT Ψ−1 (Y − µ)

2.2

Learning

In this section, learning strategy of Factor Analysis will be discussed. In previous sections we have known that there are three parameters to be learned: • Loading matrix Λ • Manifold center µ • Variance Ψ So we are able to formalize the problem as a log likelihood function: [Λ∗ , µ∗ , Ψ∗ ] = argmax(Loglikelihood(Y ))

The incomplete log likelihood If we consider the incomplete data log likelihood function, which in factor analysis is the marginal density of y, we have:

1 X N log|ΛΛT + Ψ| − { (yn − µ)T (ΛΛT + Ψ)−1 (yn − µ)} 2 2 n X N 1 = − log|ΛΛT + Ψ| − tr[(ΛΛT + Ψ)−1 S], where S = (yn − µ)T (yn − µ) 2 2 n

l(θ|D) = −

Obviously, estimating µ is trivial, but parameters Λ and Ψ are still coupled non-linearly in the expression. To decouple the parameters and obtain a simple algorithm for MLE, we consider EM algorithm in the following part.

6

11 : Factor Analysis and State Space Models

EM algorithm As we have learned a few classes earlier, complete log likelihood would be the objective function that we consider after taking the expectation. Suppose here that we have ”complete data”, which means X and Y are both observed, it’s clear that the estimation of the distribution of X would reduce to a Gaussian density estimation problem. So, in E step, we will try to fill in X by calculating the expected complete log likelihood and identify the expected sufficient statistics. Then, in M step, we will reduce to just estimating Λ and Ψ using linear regression. E step

The complete likelihood is simply a product of Gaussian distributions. 1X T 1X N log|Ψ| − xn xn − (yn −n )T Ψ−1 (yn − Λn ) 2 2 n 2 n N 1X 1X = − log|Ψ| − tr[xn xTn ] − tr[(yn − Λxn )(yn − Λxn )T Ψ−1 ] 2 2 n 2 n

lc (θ|Dc ) = −

=−

N N log|Ψ| − tr(SΨ−1 ) 2 2

where we have: S=

1 X (yn − Λxn )(yn − Λxn )T N n

Take the expectation

Q(θ|θ(t) ) = −

N N log|Ψ| − tr(hSiΨ−1 ) 2 2

Here the conditional expectation hsi is

1 X hyn ynT − yn XnT ΛT − ΛXn ynT + ΛXn XnT ΛT i N n 1 X = hyn ynT − yn hXnT iΛT − ΛhXn iynT + ΛhXn XnT iΛT ) N n

hsi =

To draw a conclusion, the expected sufficient statistics that we need are actually conditional expectations hXn i and hXn xTn i. We’ve already have these expectation derived in the previous sections. Thus hXn i = E(Xn |Yn ) hXn xTn i = V ar(Xn |yn ) + E(Xn |yn )E(Xn |yn )T M step Since we have ”filled in” X in the E step by calculating sufficient statistics, we are able to compute parameters by means of taking the derivative of expected complete log likelihood Q with respect to parameters.

11 : Factor Analysis and State Space Models

7

∂ N ∂ 1X 1X T hl i = (− log|Ψ| − tr[x x ] − tr[(yn − Λxn )(yn − Λxn )T Ψ−1 ]) c n n ∂Ψ−1 ∂Ψ−1 2 2 n 2 n =

N N Ψ − hSi 2 2

Here we have Ψt+1 = hsi

∂ N 1X 1X ∂ hlc i = (− log|Ψ| − tr[xn xTn ] − tr[(yn − Λxn )(yn − Λxn )T Ψ−1 ]) ∂Λ ∂Λ 2 2 n 2 n N −1 ∂ Ψ hSi 2 ∂Λ N ∂ 1 X ( = − Ψ−1 hyn ynT − yn hXnT iΛT − ΛhXn iynT + ΛhXn XnT iΛT )) 2 ∂Λ N n X X = Ψ−1 yn hXnT i − Ψ−1 Λ hXn XnT i =−

n

n

P P Here we have Λt+1 = ( n yn hXnT i)( n hXn XnT i)−1

Model Invariance and Identifiability Since Λ only appear as outer product ΛΛT , the model is invariant to rotation and axis flips of the latent space: (ΛQ)(ΛQ)T = Λ(QQT )ΛT = ΛΛT This means there is no optimal solution of parameter estimations. Such models are called un-identifiable since multiple sets of parameters would be obtained when filling in same set of parameters.

3 3.1

State Space Models Introduction

We have just learned Factor Analysis, whose latent and observed variables are both continuous Gaussians. If we connect multiple factor analysis models as what we do to mixture model, we can get a HMM-like graphical model, which is called State Space Model as shown in Fig. 2

Figure 2: Graphical model for SSM

8

11 : Factor Analysis and State Space Models

Here we have: xt = Axt−1 + Gwt yt = Cxt−1 + vt wt ∼ N (0; Q), vt N (0; R) x0 ∼ N (0; Σ0 )

3.2

Inference

There are two interesting inference problems worth mentioning in SMM model: filtering and smoothing.

Filtering is a way to perform exact inference in an Linear Dynamic System, to infer the current latent variable based on current as well as previous observed variables. The problem of filtering is formalized as computing P (xt |y1:t ): p(Xt = i|y1:t ) = αti ∝ p(yt |Xt = i)

X

j p(Xt = i|Xt−1 = j)αt−1

j

Fig. 3 is an example graph for this problem.

Figure 3: Graphical model for SSM

Smoothing is another inference problem in which we compute the current latent variable given observables at all time steps, which can be formalized as computing P (xt |y1:T ): p(xt |y1:T ) = γti ∝

X

j j |Xij )γt+1 αti P (Xt+1

j

Fig. 4 is an example graph for this problem.

Figure 4: Graphical model for SSM

11 : Factor Analysis and State Space Models

4 4.1

9

Kalman Filtering Overview of derivation

Kalman filtering is an online filtering algorithm for use on state space models. It is widely used and particularly efficient. Since the state space models we are dealing with all have conditional probability distributions that are linear Gaussian, the system defines a large multivariate Gaussian. This means that all the marginals are Gaussian, and we can represent the belief state p(X|y1:t ) as a Gaussian with mean E[Xt |y1:t ] = µt|t and covariance E[(Xt − µt|t )(Xt − µt|t )T ] = Pt|t It is common to work with the inverse of the covariance matrix, called the precision matrix. This is known as the information form. Kalman filter is a recursive procedure to update the belief state. it has two main phases: the predict step, and the update step. Essentially, instead of trying to solve for p(Xt+1 |y1:t+1 ) directly, we break it into two parts. In the predict step, we seek to compute p(Xt+1 |y1:t ) from the prior belief p(Xt |y1:t ) and the dynamics model p(Xt+1 |Xt ). This is called the time update. In the update step, we compute our goal p(Xt+1 |y1:t+1 ) from the prediction p(Xt+1 |y1:t ), the observation yt+1 , and the observation model p(yt+1 |Xt+1 ). This is called the measurement update. The advantage of this process is that, since the variables are Gaussian, then everything else ends up being Gaussian. Recall from the beginning of the notes that if we have      z1 µ1 Σ ∼N , 11 z2 µ2 Σ21

Σ12 Σ22



This means that, given that the marginal z1 is Gaussian, then the joint z1 z2 is Gaussian, the marginal z2 is Gaussian, and finally the conditional z2 |z1 must also be Gaussian. The Kalman filter essentially follows this process. Given p(Xt |y1:t ) is Gaussian, we can get p(Xt+1 |y1:t ), with xt+1 = f (xt ) = Axt + w. Then from this result and the observation model yt+1 = Cxt+1 + v, we can finally get p(Xt+1 , yt+1 |y1:t ).

4.2

Predict Step

For the dynamical model, we have that xt+1 = Axt + Gwt where wt ∼ N (0; Q). We wish to find the parameters of the distribution of p(Xt+1 |y1:t ). Since everything here is a Gaussian, this means we want the mean and covariance of this distribution. For one step ahead

10

11 : Factor Analysis and State Space Models

prediction of state: E[Xt+1 |y1:t ] =E[Axt + Gwt ] =Aµt|t + 0 =ˆ xt+1|t x ˆt+1|t =Aµt|t E[(Xt+1 − x ˆt+1|t )(Xt+1 − x ˆt+1|t )T |y1:t ] =E[(AXt + Gwt − Aµt|t )(AXt + Gwt − Aµt|t )T |y1:t ] =E[(AXt − Aµt|t )(AXt − Aµt|t )T |y1:t ] + E[(AXt − Aµt|t )wtT GT |y1:t ] E[Gwt (AXt − Aµt|t )|y1:t ] + E[Gwt wtT GT |y1:t ] =E[(AXt − Aµt|t )(AXt − Aµt|t )T |y1:t ] + 0 + 0 + E[Gwt wtT GT |y1:t ] =AE[(Xt − µt|t )(Xt − µt|t )T |y1:t ]AT + GE[wt wtT |y1:t ]GT Pt+1|t = APt|t AT + GQGT

And so the prediction for the dynamical model is the mean x ˆt+1|t = Aµt|t and covariance is Pt+1|t = APt|t AT + GQGT . For the observation model, have have that yt = Cxt + vt where vt ∼ (0; R). We now wish to find the parameters of the model of the observation. Once again, it is Gaussian, since the prior parts are all Gaussian. For one step ahead prediction of observation:

E[Yt+1 |y1:t ] =E[Cxt + wt ] =C x ˆt+1|t + 0 yˆt+1|t =C x ˆt+1|t T

E[(Yt+1 − yˆt+1|t )(Yt+1 − yˆt+1|t ) |y1:t ] =E[(CXt+1 + vt − C x ˆt+1|t )(CXt+1 + vt − C x ˆt+1|t )T |y1:t ] =CE[(Xt+1 − x ˆt+1|t )(Xt+1 − x ˆt+1|t )T ]C T + E[vt vtT ] =CPt+1|t C T + R E[(Yt+1 − yˆt+1|t )(Xt+1 − x ˆt+1|t )T |y1:t ] =E[(CXt+1 + vt − C x ˆt+1|t )(Xt+1 + x ˆt+1|t )T |y1:t ] =CE[(Xt+1 − x ˆt+1|t )(Xt+1 − x ˆt+1|t )T ] + E[vt (Xt+1 − x ˆt+1|t )] =CPt+1|t

And so finally, for the observation variable we have mean C x ˆt+1|t and variance CPt+1|t C T + R, as well as covariance with state variable CPt+1|t .

4.3

Update Step

Will be continued in the next lecture

10-708: Probabilistic Graphical Models 10-708, Spring 2016

12 : Variational Inference I Lecturer: Eric P. Xing

1

Scribes: Jing Chen, Yulan Huang,Yu-Fang Chang

Kalman Filtering

From last lecture, we have introduced Kalman Filtering as a recursive procedure to update the belief state. In each iteration, there are two steps: Predict Step and Update Step. In Predict Step, we compute latent state distribution P (Xt+1 |y1:t ) from prior belief P (Xt |y1:t ) and dynamic model p(Xt+1 |Xt ). This step is also called time update. In Update Step, we compute new belief of the latent state distribution p(Xt+1 |y1:t+1 ) from prediction p(Xt+1 |y1:t ) and observation yt+1 by using the observation model p(yt+1 |Xt+1 ). The step is also called measurement update since its using the measured information yt+1 . The reason for doing so is that under a joint multivariant gaussian distribution, we can compute the conditional influences, marginal influences easily. Since all distributions are gaussian, their linear combinations are also gaussian. Hence we just need the mean and covariance, which can be computed easily in this case, to describe the influence.

1.1

Derivation

Our goal is to compute p(xt+1 |y1:t+1 ) We first utilize the dynamic model for finding the parameters of the distribution of p(Xt+1 |y1:t ). With the dynamic model, we define xt+1 : xt+1 = Axt + Gwt where W is the noise model with zero mean and covariance matrix Q. We can then predict the expectation x ˆt+1|t and covariance Pt+1|t of the distribution p(Xt+1 |y1:t ) as following. x ˆt+1|t = E[xt+1 |y1:t ] = E[Axt + Gwt+1 |y1:t ] = Aˆ xt|t + 0 = Aˆ xt|t

Pt+1|t = E[(xt+1

x ˆt+1|t )(xt+1

x ˆt+1|t )T |y1:t ]

= E[(Axt + Gwt

x ˆt+1|t )(Axt + Gwt

= E[(Axt + Gwt

Aˆ xt|t )(Axt + Gwt

= E[(Axt

Aˆ xt|t )(Axt

x ˆt+1|t )T |y1:t ]

Aˆ xt|t )T |y1:t ]

Aˆ xt|t )T + (Gwt (Axt

= APt|t AT + 0 + 0 + GQGT = APt|t AT + GQGT For observation model, we have yt = Cxt + vt vt ⇠ N (0; R) 1

Aˆ xt|t )T + Axt

Aˆ xt|t )GwtT + Gwt GwtT ]

2

12 : Variational Inference I

We can then derive the mean and variance of observation and state variables. E[Yt+1 |y1:t ] = E[Cxt + wt ] = C x ˆt+1|t

V ar(yt+1 |y1:t ) = E[(Yt+1

yˆt+1|t )(Yt+1

= E[(CXt+1 + vt

yˆt+1|t )T |y1:t ]

Cx ˆt+1|t )T |y1:t ]

Cx ˆt+1|t )(CXt+1 + vt

= CPt+1|t C T + R Cov(yt+1 , xt+1 |y1:t ) = E[(Yt+1

yˆt+1|t )(Xt+1

x ˆt+1|t )T |y1:t ]

Cx ˆt+1|t )(Xt+1 + x ˆt+1|t )T |y1:t ]

= E[(CXt+1 + vt = CPt+1|T

Cov(xt+1 , yt+1 |y1:t ) = Cov(yt+1 , xt+1 |y1:t )T = Pt+1|t C T Next, we can combine the above result and get the joint distribution p(Xt+1 , Yt+1 |y1:t ) ⇠ N (mt+1 , Vt+1 ) with   x ˆt+1|t Pt+1|t Pt+1|t C T mt+1 = , Vt+1 = Cx ˆt+1|t CPt+1|t CPt+1|t C T + R We can see that V ar(yt+1 |y1:t ) is similar to Pt+1|t , so we introduce Kalman gain matrix K to replace some repeated step as following: K = Cov(xt+1 , yt+1 |y1:t )V ar(yt+1 |y1:t ) = Pt+1|t C T (CPt+1|t C T + R)

1

1

Since K doesn’t require a new observation, in other words, independent of the data, it can be precomputed. For the measurements update, by the formula for conditional Faussian distribution, we have x ˆt+1|t+1 = x ˆt+1|t + Cov(xt+1 , yt+1 |y1:t )V ar(yt+1 |y1:t ) =x ˆt+1|t + K(yt+1

Pt+1|t+1 = Pt+1|t

1

(yt+1

yˆt+1|t )

Cx ˆt+1|t )

KCPt+1|t

and we finally done with the derivation.

1.2

Example

Now we look at a simple example of the Kaulman Filter. We consider noisy observations of a 1D particle moving randomly. xt|t 1 = xt 1 + w, w ⇠ N (0, x ) v ⇠ N (0,

zt = xt + v,

z)

and then we can plugged into the Kalman Filter equation that gives us the new mean and variance. Pt+1|t = APt|t A + GQGT =

t

+

x

x ˆt+1|t = Aˆ xt|t = x ˆt|t K = Pt+1|t C T (CPt+1|t C T + R) x ˆt+1|t+1 = x ˆt+1|t + Kt+1 (zt+1 Pt+1|t+1 = Pt+1|t

KCPt+1|t

1

=(

Cx ˆt+1|t ) = (

+ x) = t+ x+ t

t

(

+ t

x )( t

+

x )zt

t+ z z

+

+ + x

x

+

ˆt|t zx z

z)

12 : Variational Inference I

3

We can see that, with the initial setting, the noise is not going to generate a non-zero shift on the mean since the noise is centered. But the variance is changing. Let’s begin with the P (x0 ) and base on the transition model, we can now predict the distribution of the next timestamp. It turns out a distribution with wider gaussian variance. This is because that the point move due to the noise, which increase the uncertainty of the point. Once we have a new point zt+1 , the mean is shifted as the previous one add the transformation one. The noise is reduced due to our choice which reduce the uncertainty. And now it induce to a new belief of the distribution of the point. Then we come to the intuition that once we have an observation, we can update the mean as following: x ˆt+1|t+1 = x ˆt+1|t + Kt+1 (zt+1 which the term (zt+1

2

Cx ˆt+1|t ) =

(

t

+

x )zt

t+

+ + x

ˆt|t zx z

Cx ˆt+1|t ) is called the innovation.

Background Review

2.1

Inference Problem

In this lecture, we look deeply into variational inference in graphical models. Given a graphical model G and corresponding distribution P defined over a set of variables (nodes) V , the general inference problem involves the computation of: • the likelihood of observed data • the marginal distribution p(xA ) over a particular subset of nodes A ⇢ V • the conditional distribution p(xA |xB ) for disjoint subsets A and B • a mode of the density x ˆ = arg maxx2

m

p(x)

Previous lectures have covered several exact inference algorithms like brute force enumeration, variable elimination, sum-product and junction tree algorithms. More specifically, while brute force enumeration sum over all variable excluding the query nodes, variable elimination fully utilize the graphical structure by taking ordered summation over variables. Both algorithm treat individual computations independently, which is to some degree wasteful. On the contrary, message passing approaches such as sum-product and belief propagation are considered more efficient as a result of sharing intermediate terms. However, the above approaches typically apply to trees and can hardly converge on non-tree structures, in other words, loopy graphs. One exception is the Junction tree algorithm, which provides a way to convert any arbitrary loopy graph to a clique trees and then perform exact inference on the clique trees by passing messages. Nevertheless few people actually use this approach because though junction is locally and globally consistent, it is to expensive with computational cost exponential to the maximum number of nodes in the each clique. In order to solve the inference problem on loopy graphs, we introduce approximate inference approaches, and focus on loopy belief propagation in this lecture.

2.2

Review of Belief Propagation

We first give a brief review for the Belief Propagation (BP), one classical message passing algorithm. Figure 1a demonstrates the procedure of massage update rules in BP, where node i passes a message to each of its

4

12 : Variational Inference I

neighboring nodes j once receiving messages from its neighbors excluding j. The message update rule can be formulated as below: mi!j (xj ) / where

ij (xi , xj )

X

Y

ij (xi , xj ) i (xi )

xi

mj!i (xi )

k2N (i)\j

is called Compatibilities (interactions) and

i (xi )

is called external evidence.

Further, the marginal probability of each node in graph can then be computed in a procedure as Figure ?? shows, which can be formulated as: bi (xi ) /

i (xi )

Y

mk (xk )

k2N (i)

Particularly, it is worth noticing that BP on trees always converges to exact marginals as Junction Tree algorithm reveals.

(a) BP message passing procedure

(b) BP node marginal

Figure 1: BP Message-update Rules Moreover, we can generalize BP model on factor graph, where the square nodes denote factors and circle nodes denote variables as Figure 2b. Similarly, we can compute the marginal probabilities for variable i and factor a as Figure ?? shows: bi (xi ) / fi (xi )

Y

ma!i (xi )

a2N (i)

ba (Xa ) / fa (xa )

Y

mi!a (xi )

i2N (a)

where we call bi (xi ) “beliefs” and ma!i “messages”. Therefore, messages are passed in two ways: (1) from variable i to factor a; (2) from factor a to variable i, which is written as: mi!a (xi ) =

Y

mc!i (xi )

c2N (i)\a

ma!i (xi ) =

X

Xa \xi

fa (Xa )

Y

j2N (a)\i

mj!a (xj )

12 : Variational Inference I

5

(b) (a)

Figure 2: BP on Factor Graph

3

Loopy Belief Propagation

Now we consider the inference on an arbitrary graph. As is mentioned, most previously discussed algorithms apply on tree-structured graphs and even some can converge on arbitrary graphs, say, loopy graphs, they su↵er from the problem of expensive computational cost. Also, we can hardly prove the correctness of these algorithms, e.g. Junction Tree algorithm. Therefore, the Loopy Belief Propagation (LBP) algorithm is proposed. LBP can be viewed as a fixed-point iterative procedure that tries to minimize Fbethe . More specifically, LBP starts with random initialization of messages and belief, and iterate the following steps until convergence. bi (xi ) /

Y

ma!i (xi )

a2N (i)

ba (Xa ) / fa (Xa ) mnew i!a (xi ) =

Y

mi!a (xi )

i2N (a)

Y

mc!i (xi )

a2N (i)\a

mnew a!i (xi ) =

X

Xa \xi

fa (Xa )

Y

mj!a (xj )

j2N (a)\i

However, it is not clear whether such procedure will converge to a correct solution. In late 90’s, there was much research devotion trying to investigate the theory lying behind this algorithm. Murphy et. al (1999) has revealed empirically that a good approximation is still achievable if: • stop after fixed number of iterations • stop when no significant change in beliefs • of solution is not oscillatory but converges, it usually is a good approximation However, whether the good performance of LBP is a dirty hack ? In the following sections, we try to understand the theoretical characteristics of this algorithm and show that LBP can lead to an almost optimal approximation to the actual distribution.

6

3.1

12 : Variational Inference I

Approximating a Probability Distribution

Let us denote the actual distribution of an arbitrary graph G as P : P (X) =

1 Y fa (Xa ) Z fa 2F

Then we wish to find a distribution Q such that Q is a “good” approximation to P . To achieve this, we first recall the definition of KL-divergence: KL(Q1 ||Q2 ) =

X

Q1 (X) log(

X

Q1 (X) ) Q2 (X)

satisfying: KL(Q1 ||Q2 )

0

KL(Q1 ||Q2 ) = 0 () Q1 = Q2 KL(Q1 ||Q2 ) 6= KL(Q2 ||Q1 ) Therefore, our goal of finding an optimal approximation can be converted to minimizing the KL-divergence between P and Q. However, the computation of KL(P ||Q) requires inference steps on P while KL(Q||P ) not. Thus we adopt KL(Q||P ): KL(Q||P ) =

X

Q(X) log(

X

=

X

Q(X) ) P (X)

Q(X) log Q(X)

X

= =

X

Q(X) log P (X)

X

HQ (X)

EQ log(

HQ (X)

X

1 Y fa (Xa )) Z fa 2F

EQ log fa (Xa ) + log Z

fa 2F

Therefore we can formulate the optimization function as: KL(Q||P ) =

HQ (X)

X

EQ log fa (Xa ) + log Z

fa 2F

where HQ denotes the entropy of distribution Q, and F (P, Q) is called “free energy”: F (P, Q) =

HQ (X)

X

EQ log fa (Xa )

fa 2F

Particularly, we have F (P, P ) = log Z and F (P, Q) F (P, P ). More specifically, P P we see that while E log f (X ) can be computed based on marginals over each f , H = Q a a a Q fa 2F X Q(X) log Q(X) is hard since it requires summation over all possible values of X. This makes the computation of F hard. One possible solution is to approximate F (P, Q) with some Fˆ (P, Q) that is easy to compute.

12 : Variational Inference I

7

Figure 3: A tree graph

3.2

Deriving the Bethe Approximation

Q Q As shown in Figure 3, for a tree-structured graph, the joint probability can be written as b(x) = a ba (xa ) i bi (xi )1 where a enumerates all edges in the graph, i enumerates all nodes in the graph and di are the degree of the vertex i. We can then calculate their entropy as well as the free energy: XX X X Htree = ba (xa )lnba (xa ) + (di 1) bi (xi )lnbi (xi ) a

FT ree =

xa

XX a

xi

i

ba (xa )ln

xa

ba (xa ) X + (1 fa (xa ) i

di )

X

bi (xi )lnbi (xi )

xi

Since the entropy and free energy only involves summation over edges and vertices, it is easy to compute.

Figure 4: An arbitrary graph However, for an arbitrary graph as Figure 4, the entropy and free energy is hard to write down. Therefore, we use the Bethe approximation which has the exact same formula as free energy for a tree-structured graph: XX X X HBethe = ba (xa )lnba (xa ) + (di 1) bi (xi )lnbi (xi ) a

FBethe =

xa

XX a

xa

ba (xa )ln

xi

i

ba (xa ) X + (1 fa (xa ) i

di )

X

bi (xi )lnbi (xi )

xi

As we can see, the advantage of Bethe approximation is that it is easy to compute, since the entropy term only involves sum over pairwise and single variables. However, the approximation Fˆ (P, Q) may or may not

di

,

8

12 : Variational Inference I

be well connect to the true F (P, Q). There is no guarantee that it will be greater, equal or less than the true F (P, Q). To find the belief b(xa ) and b(xi ) that minimize the KL divergence, while still satisfying the local consistency. For the discrete case, the local consistency is: X 8i, xi bi (xi ) = 1 xi

8a, i 2 N (a), xi

X

ba (xa ) = bi (xi )

xa |xa =xi ,xj

where N(a) is all neighbors of xi . Thus, using the Lagrangian multiplier, the objective function becomes: X X X X X X L = FBethe + bi (xi )] + ba (Xa )] i [1 ai (xi )[bi (xi ) xi

i

a i2N (a) xi

Xa \xi

We can find the stationary points by setting the derivative to zero: @L =0 @bi (xi ) ) bi (xi ) / exp(

1 di

1

X

ai (xi ))

a2N (i)

@L =0 @ba (xa )

) ba (xa ) / exp( logfa (Xa ) +

3.3

X

ai (xi ))

i2N (a)

Bethe Energy Minimization using Belief Propagation

By setting the derivative of the objective function, in the previous section we obtain the update formula for bi (xQ i ) and ba (Xa ) in a factor graph. For the variables in the factor graph, with ai (xi ) = log(mi!a (xi )) = log b2N (i)6=a mb!i (xi ), we have: Y bi (xi ) / fi (xi ) ma!i (xi ) a2N (i)

For the factors, we have: ba (Xa ) / fa (Xa ) Using ba!i (xi ) =

P

Xa \xi ba (Xa ),

Y

Y

mc!i (xi )

i2N (a) c2N (i)\a

we get

ma!i (xi ) =

X

Xa \xi

fa (Xa )

Y

Y

mb!j (xj )

j2N (a)\i b2N (j)\a

As we can see, the Belief Propagation-update is in a sum product form and is easy to compute.

4

Theory Behind Loopy Belief Propagation

For a distribution p(X|✓) associated with a complex graph, computing the marginal (or conditional) probability of arbitrary random variable(s) is intractable. Thus, instead the variational methods optimize over an

12 : Variational Inference I

9

easier (tractable) distribution q. It want to find q ⇤ = argminq2S {FBetha (p, q)} Now, we can just optimize Hq . However, optimizing Hq is still difficult, so instead we do not optimize the q(X) explicitly, but relax the optimization problem to the approximate objective, the Bethe free energy of the beliefs F (b), and a loose feasible set with local constrains. The Loopy belief propagation can be viewed as a fixed point iteration procedure that want to optimize F(b). Loopy belief propagation often not converge to the correct solution, although empirically it often performs well.

5

Generalized Belief Propagation

Instead of considering single node in normal loopy belief propagation, the Generalized Belief Propagation considers regions of graph. This enables us to use more accurate Hq for approximation and achieve better results. In the generalized belief propagation, instead of using Bethe free energy, it uses Gibbs free energy which is more generalized and achieves better results in the cost of an increased computational complexity. More precisely, the Generalized Belief Propagation defines the belief in a region as the product of the local information (factors in region), messages from parent regions and messages into descendant regions from parents who are not descendants. Moreover, the message-update roles are obtained by enforcing marginalization constrains.

Figure 5: Example: Generalized Belief Propagation There is an example for this region integration: As shown in Figure 5, the graph contains four regions, each of which contains four nodes. When we compute beliefs, we integrate them hierarchically, which increases the computational cost. As shown in Figure 6 and Figure ??, when we want to compute belief for a particular region, we consider nodes that share regions and calculate their energies from their parent regions. As we can see, more regions can make the approximation more accurate, but also increases the computational cost.

10

12 : Variational Inference I

Figure 6: Example: Hierarchically compute belief in Generalized Belief Propagation

10-708: Probabilistic Graphical Models 10-708, Spring 2016

13 : Mean Field Approximation and Topic Models Lecturer: Eric P. Xing

1 1.1

Scribes: Shichao Yang, Mengtian Li, Haoqi Fan

Mean field Recall Loopy Belief Propagation

For a distribution p(X|θ) associated with a complex graph especially with loops, it is intractable to compute the approximation distribution q(X) (from KL-divergence view) directly and marginal (or conditional) probability of arbitrary random variables. So instead, the variational methods optimize over approximation objective: q ∗ = arg minq∈M FBetha (p, q) However, optimization FBetha (p, q) with respect to q(X) is still difficult. So we don’t explicit optimize q(X), but optimize b in the following form: b = {bi,j = τ (xi , xj ), bi = τ (xi )} namely a set of beliefs of singleton and edges. The constraints on b is the local consistence: X X Mo = {τ ≥ 0| τ (xi ) = 1, τ (xi , xj ) = τ (xj )} xi

xi

Mo is an over-approximation of original feasible set M , namely Mo ⊇ M . The LBP can be viewed as a fixed point iteration procedure that want to optimize F (b). LBP often not converge to the correct solution, although empirically it often performs well.

1.2

Mean Field Introduction

The core idea behind MF is to optimize the posterior distribution q(xH ) in the space of tractable families to make it easier to compute. For example, we can approximate q(xH ) as: q(z1 , ..., zm ) =

m Y

q(zj )

j==1

This assumes the complete factorization of the distribution over individual latent variables, which is often referred as ”naive mean field”. From graphical view, we remove all the edges between variables. For a more general settings, we can also assume the variational distribution factorizes into R group: zG1 , ..., zGR which is referred to as ”generalized mean field”: [picture] q(z1 , ..., zm ) =

R Y r==1

1

q(zGr )

2

13 : Mean Field Approximation and Topic Models

So there are different ways of subgraph representation that could be used to approximate the true distribution. The problem thus has changed to: q ∗ = arg minq∈T (< E >q −Hq ) We are optimizing the exact objective Hq but on a tightened feasible set T ⊆ Q. In the next lecture, we will introduce a unified point of view based on the variational principle, here we briefly conclude: Mean field method is non-convex inner bound with exact form of entropy. Loopy belief propagation is polyhedral outer bound with non-convex Bethe approximation.

1.3

Naive Mean Field

Q We approximate p(x) by fully factorized q(X) = i qi (Xi ). For the Boltzmann distribution p(X) = exp{Eiqj +Ai } qi (Xi ) = {θio Xi + j∈Ni

where < Xj >qj resembles a ”message” sent from node j to i. < Xj >qj : j ∈ Ni forms the ”mean field” applied to Xi from its neighborhood shown in Figure 1(a).

(a)

(b)

Figure 1: (a) Naive mean field update. (b) Generalized mean field update.

1.4

Generalized Mean Field

We can also apply more general forms of the mean field approximations, i.e. clusters of disjoint latent variables are independent, while the dependencies of latent variables in each clusters are preserved shown in Figure 1(b). Figure 2 is the naive mean field for Ising model while Figure 3 shows the original Ising model and, generalized mean field with 2×2 clusters and generalized mean field with 4×4 clusters. The generalized mean filed theorem shows that the optimum GMF approximation to the cluster marginal is isomorphic to the cluster posterior of the original distribution given internal evidence and its generalized mean fields: q ∗ (XH,Ci ) = p(XH,Ci |XE,Ci , < XH,M Bi >qj6=i ) where < XH,M Bi >qj6=i is the neighbour cluster. The convergence theorem shows that GMF is guaranteed to converge to a local optimum and provides a lower bound for the likelihood of evidence. The inference accuracy and computation comparison between NMF and GMF, BP is shown in Figure 4. We can see that mean filed especially larger GMF has the lowest error while also at much high computational cost. GMF with grid 2×2 is better than BP in both accuracy and speed.

13 : Mean Field Approximation and Topic Models

3

Figure 2: Naive mean field for Ising models.

Figure 3: General mean field for Ising models

1.5

Factorial HMMs

Lets take a look at another example: factorial HMM. We can make naive mean field approximations such that all variables are assumed to be independent in posterior. We can also make generalized mean field approximations such that each disjoint cluster contains one hidden Markov chain, two hidden Markov chains or three hidden Markov chains. The original factorial HMM and a generalized mean field approximation based on clusters with two hidden Markov chains are shown in Figure 5. Figure 5 shows the singleton marginal error and CPU time for naive mean field, generalized mean field with clusters of different number of Markov chains and exact inference BP algorithm. From the Singleton marginal error histogram, we can see that as expected, naive mean field drops all edges in the posterior, thus has the highest error, while generalized mean field with clusters of more chains generally behave better in terms of singleton marginal error. From the CPU time histogram, we can see that the exact inference method takes the most time, while mean field approximations generally take less time. From the above example, we can draw the conclusion that mean field approximation makes inference tractable or cheap compared with exact inference methods, at the cost of additional independency assumptions, thus leads to a bias in the inference result.

4

13 : Mean Field Approximation and Topic Models

Figure 4: Inference comparison on Ising models.

Figure 5: Mean Field Approximation for HMM.

2

Topic Model

2.1

General framework

The general framework for solving problems with graphical models can be decomposed into these major steps: • Task: embedding, classification, clustering, topic extraction. • Data representation: input and output, data types (continuous, binary, counts) • Model: belief networks, Markov random fields, regression, support vector machine • Inference: exact inference, MCMC, variational • Learning: MLE, MCLE, max-margin • Evaluation: visualization, human interpretability, perplexity, predictive accuracy We should better consider one step at a time.

2.2

Task and Data Representation

The motivation underlying the probabilistic topic models is the incapability of human processing a huge number of text documents (e.g., search, browse, or measure similarity). We need new computational tools

13 : Mean Field Approximation and Topic Models

5

Figure 6: Document Embedding.

to help organize, search and understand these vast amounts of information. To this end, the probabilistic topic models to help to organize the documents according to the topics in order for us to perform a variety of tasks with large number of documents. This task can be done by finding a mapping from each document to a vector space, i.e. document embedding (Fig 6). Formally, it reads D → Rd , where D is the spaces of documents and Rd is an Euclidean space. Document embedding enable us to compare the similarity of two documents, classify contents, group documents into clusters, distill semantics and perspectives, etc. One common representation is bag of words (Fig 7). Bag of words is an orderless high-dimensional sparse representation, where each document is represented by the frequency of words over a fixed vocabulary. There are a few drawbacks of the bag of words representation. It is not very informative. It is not very efficient for text processing tasks, e.g., search, document classification, or similarity measure. It is also not effective for browsing.

2.3

The Big Picture of Topic Models

The big picture is presented in (Fig 8). The blue crosses denote topics while the red crosses denotes documents. Each topic is modeled by a distribution of words, thus it can be viewed as a point in a word simplex. Similarly, each document is modeled by a distribution of words, thus it can be viewed as a point in the topic simplex.

2.4

Latent Dirichlet Allocation

Below is the scheme to generate a document from a topic model. If the prior is a Dirichlet distribution, the model is called Latent Dirichlet Allocation (LDA).

6

13 : Mean Field Approximation and Topic Models

Figure 7: The Bag of Words Representation of a Document

Figure 8: The Big Picture of Probabilistic Topic Models

13 : Mean Field Approximation and Topic Models

2.5

7

Inference for LDA

The posterior inference is the fundamental problem to reveal the mysteries behind the LDA. The joint distribution of the network could be easily computed by chain rule. According to the Fig. 9

Figure 9: The Latent Dirichlet allocation (LDA) Model.

P (β, θ, z, w) =

QK

k=1

P (βk |ζ)

QD

d=1

P (θd ; α)

QN

n=1

P (zd,n |θd )P (wd,n |zd,n , β)

In order to utilize the LDA model, there some conditional possibility we may be need in order to calculate the entire posterior possibility. For example, in order to know p(thetan |D) as well as p(zn,m |D). Also, in n ,D) order to calculate p(θn |D) = p(θ p(D) , we might need to integrate over θ, β:

Where Then, we could get an acceptable result by approximate inference. There are several inference

algorithm for the posterior possibility. • Mean field approximation • Expectation propagation • Variational 2nd-order Taylor approximation • Markov Chain Monte Carlo (Gibbs sampling) predictive accuracy

8

13 : Mean Field Approximation and Topic Models

Since in the lecture, it is actually running our of time when introducing inference approach, so barely detail of inference is revealed. We are going to cover some important ones to in order to reflect the main points of LDA.

2.6

Mean-Field Approximation

When we doing the mean field approximation, the variational distribution over the latent variables factorizes could be asserted as: q(β, θ, z) =

Q

k

q(βk )

Q

d

q(θd )

Q

n

q(zd,n )

By the formulation, it is obvious to see that the variational approximation q over β, θ, d are independent. Using the mean field, we could optimize the lower bound of the real posterior as the following form, given q(β, θ, z) =

Q

k

q(βk )

Q

d

q(θd )

Q

n

q(zd,n )

L(q(β, θ, z) = Eq log p(w, β, θ, z) + H(q(β, θ, z)) Then it is easy to derive the coordinate ascent algorithm by L(q(βi , θi , zi ) =

R

q(βi , θi , zi )Eq−i log p(w, β, θ, z)dβi θi zi + H(q(β, θ, z))

In order to avoid the confusion we should clearify that Eq−i is the expectation over all other latent variables except for the j then variable. Then we could directly get the update rule as:

We could have a better review of the entire algorithm by going through the following algorithm. initialize varientional topics q(βk ); while Lower bound L(q) not converge do for for each document d ∈ {1, 2, 3..., D} do Initialize varientional topic assignment q(zdn ); while Change of q(θ) is not small enough do Update varientional topic proportions q(θd ); Update varientional topic assignments q(zdn ); end Update varientional topics q(βk ); end end Algorithm 1: Coordinate ascent algorithm for LDA By revealing the algorithm, we could see there is 3 loops here. If we have millions of documents here, it is going to be very slow.

13 : Mean Field Approximation and Topic Models

2.7

9

Variational Inference for LDA

The general idea of variational inference is minimizing the KL divergence between the variational distribution q(θ, z|γ, φ) and the true posterior distribution is q(θ, z|w, α, β), and the γ and φ are variational parameters for q. Worth to notice that q(θ|γ) follows a Dirichlet distribution. And the distribution is parameterized by γ. And this is a clique that the q(zn |φn ) follows the multinomial distribution which is parameterizing by φn . So we could have the KL divergence as We omit some derivation and draw the conclusion here as

L(γ, φ; α, β) could be optimized by Variational EM algorithm which maximizes the lower bound with respect to the variational parameters γ and φ in E step. Then maximazes the lower bound with respect to the model parameters for fixed values of the variational parameters in M step.

2.8

Gibbs Sampling for LDA

We actually going to cover the Gibbs sampling in Lecture 16, but in order to make the note self-contain, we are going to have a brief introduce about the Gibbs sampling based approximate inference on LDA as a special case Markov-chain Monte Carlo. The latent variables in the graphical model are sampled iteratively given the rest based on the conditional distribution. Using z denote the concatenation of z for all words and z−n is the topics for all words except wn . Given all variables, the conditional probability of wn being assigned to topic k is

k Where nw k denotes the counting of appearence of word w in topic k. nd denotes the counting number of words assigning to topic k in document d.

2.9

Conclusion

In this section, we reviewed the essential background of Latent Dirichlet Allocation. Then we introduced the general frameworks. After that we reviewed three standard inference method for LDA. Worth to mention that the topic model are one of the most activate models in the cutting edge research, especially in multi-media and unsupervised data mining areas. This could be proved easily by reviewing Fig. 10.

10

13 : Mean Field Approximation and Topic Models

Figure 10: Topic Model zoo.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing

1

Scribes: Qi Guo, Chieh Lo, Wei-Chiu Ma

Introduction

So far we have learned two families of approximate inference algorithms 1. Loopy belief propagation (sum-product) 2. Mean-field approximation We will re-exam them together ,and form a unfied point of view based on the variational principle: 1. Loopy belief propagation: outer approximation 2. Mean-field approximation: inner approximation

2

Exponential Families

Before digging deeper into variational inference, we first slightly review the exponential families and their parameterization. Recall that exponential families refer to arbitrary set of probability distributions that can be represented in the following form: p(x1 , . . . , xm ; θ) = exp{θT φ(x) − A(θ)},

(1)

where A(θ) is the log partition function, θ is the canonical parameters, and φ(x) is the sufficient statistics. This form is well-known as the canonical parameterization. Note that θ and φ(x) can be either constants or vectors and the log partition function A(θ) is a always a convex function. In addition, the joint probability distribution over Markov Random Fields (MRFs) can be expressed as the normalized product of clique potentials, i.e. p(x; θ) =

1 Y ψ(xC ; θC ). Z(θ)

(2)

C

By re-formulating Equation 2 into the canonical representation defined in Equation 1, we will get: p(x; θ) = exp{

X

log ψ(xC ; θC ) − log Z(θ)}

C

1

(3)

2

14 : Theory of Variational Inference: Inner and Outer Approximation

Figure 1: Conjugate Dual From Equation 3, we can observe that computing the expectation of the sufficient statistics yields the following: µs;j = Ep [Ij (Xs )] = P [Xs = j], ∀j ∈ Xs µst;jk = Ep [Ist;jk (Xs , Xj )] = P [Xs = j, Xt = k], ∀(j, k) ∈ (Xs , Xt ).

(4) (5)

We can think of the first expectation here as the marginal mean value for a node, with the second expectation representing the marginal mean value for a pair of nodes.

3

Conjugate Dual

Given any function f(), its conjugate dual function is defined as: f ∗ (µ) = sup{hθ, µi − f (θ)}

(6)

A convenient property of the dual function is that it is always convex. Additionally, when the original function f is both convex and lower semi-continuous, the dual of the dual is f . We now step through a simple example of computing the mean parameters for a Bernoulli distribution.

3.1

Example: Bernoulli

The Bernoulli distribution takes the following form: p(x; θ) = exp θx − A(θ),

(7)

where A(θ) = log[1 + exp{θ}]. The conjugate dual function is defined as: A∗ (µ) = sup{µθ − log[1 + exp(θ)]} θ∈R

(8)

14 : Theory of Variational Inference: Inner and Outer Approximation

3

Taking the partial with respect to θ, we obtain the following stationary point: 0=µ− µ=

eθ 1 + eθ

(9)

eθ 1 + eθ

(10)

From Equation 9, we can then solve θ: eθ 1 + eθ 1 = 1 + eθ 1 e−θ = − 1 µ µ ] θ = log[ 1−µ µ=

(11) (12) (13) (14)

From above, we can observe that if µ < 0, θ = +∞, then A∗ (µ) → +∞, and vice versa. We can thus obtain the following formulation: A∗ (µ) = µ log µ + (1 − µ) log(1 − µ), = +∞,

if µ ∈ [0, 1]

(15)

otherwise

(16)

A(θ) can thus be defined as: A(θ) = max {µT θ − A∗ (µ)}

(17)

µ∈[0,1]

To maximize Equation 17, we take the partial derivative with respect to µ and set it to 0. The derivation is as follows: θ − log µ − 1 + log(1 − µ) + 1 = 0 log µ − log(1 − µ) = θ 1 − 1 = e−θ µ eθ µ= 1 + eθ

(18) (19) (20) (21)

From above we can observe that this is the mean. In general, this will be true - the value of µ that maximizes the expression in our formulation of A(θ) will be the mean parameter. Additionally, just as our mean parameter was restricted to the range [0,1] above, our mean parameter in general will be restricted to some range of values. Note also that the dual function A∗ (θ) is equal to the negative entropy of a Bernoulli distribution; the fact that the dual is equal to the negative entropy holds true in general and will be useful in the future. We’ve shown that the mean computation of a Bernoulli distribution can be cast as an optimized problem on a restricted set of values. Does this methodology work in general? Unfortunately, computing the conjugate dual function over arbitrary graphs is intractable and the constraint set of possible mean values can be hard to determine. Thus, we turn to approximation methods.

4

3.2

14 : Theory of Variational Inference: Inner and Outer Approximation

Conjugate Dual for Exponential Family

Given d X p(x1 , . . . , xm ; θ) = exp{ θi φi (x) − A(θ)}. i=1

By definition, the dual function of A(θ) is A∗ (µ) = sup hµ, θi − A(θ). θ∈Ω

Stationary condition (one of the KKT conditions) is µ − ∇A(θ) = 0. The derivatives of A yields mean parameters ∂A (θ) = Eθ [φi (X)] = ∂θi

Z φi (x)p(x; θ)dx.

So the stationary condition becomes µ = Eθ [φ(X)]. If we have the dual solutions µ, can we get the primal solution θ(µ) and how? Lets assume there is a solution θ(µ) such that µ = Eθ(µ) [φ(X)], then the dual of exponential family is A∗ (µ) = hθ(µ), µi − A(θ(µ))

(22)

= Eθ(µ) [hθ(µ), φ(X)i − A(θ(µ))]

(23)

= Eθ(µ) [log p(X; θ(µ))]

(24)

The entropy is defined as Z H(p(x)) = −

p(x) log p(x)dx

, so we have that the dual A∗ (µ) = −H(p(x; θ(µ))) . It has this nice form not out of a coincident, but because of the property of exponential family. The domain of A∗ (µ) is a marginal polytope. We will explain it now. First define a vector of mean parameters, which has been mentioned before. Given any distribution p(x) and a set of sufficient statistics φ(x), mean paramters are Z µi = Ep [φi (X)] = φi (x)p(x)dx , where p(x) is not necessarily an exponential family. For an exponential family, the set of all realizable mean parameters of M := {µ ∈ Rd | ∃p s.t. Ep [φ(X)] = µ} The above is a convex set. And for discrete exponential families, this is called marginal polytope. M = conv{φ(x), x ∈ X m }

(25)

14 : Theory of Variational Inference: Inner and Outer Approximation

5

Figure 2: Marginal Polytope (Half-plane Representation) According to Minkowski-Weyl Theorem, any non-empty convex polytope can be characterized by a finite collection of linear inequality constraints. So we have a half-plane representation: M = {µ ∈ Rd | aTj µ ≥ bj , ∀j ∈ J } , where |J | is finite (Figure 2).

4

Variational Method

The exact variational formulation of the log partition function is A(θ) = sup {θT µ − A∗ (µ)} µ∈M

, where M := {µ ∈ Rd | ∃p s.t. Ep [φ(X)] = µ} A∗ (µ) = −H(pθ (µ)) if µ ∈ Mo ( else + ∞) We have taken a long way to get here, combining all former efforts involving convex optimization and exponential family. This is THE optimization problem we aim to solve. Two difficulties to optimize it are • M: the marginal polytope, difficult to characterize • A∗ : the negative entropy function, no explicit form, involving an integral

5

Mean Field Method

Mean field tackles the hard optimization problem by non-convex inner bound and exact form of entropy. For a general graph G, the marginal polytope M(G; φ) (Equation 25) is hard to characterize. We find a subgraph F that is tractable to approximate it. For example, a tree or or a graph with no edge at all. This is the essence of mean-field approximation.

6

14 : Theory of Variational Inference: Inner and Outer Approximation

Figure 3: Mean field optimization is always non-convex for any exponential family in which the state space X is finite. MF (G) contains all the extreme points. For a tractable subgraph F , M(F ; φ)o := {τ ∈ Rd | ∃θ ∈ Ω(F ) s.t. Eθ [φ(X)] = τ }

(26)

This is an inner approximation M(F ; θ)o ⊆ M(G; θ)o (Figure 3). And mean field solve the relaxed problem max {hτ, θi − A∗F (τ )}

τ ∈MF (G)

. Where A∗F is the exact dual function w.r.t. MF (G).

6 6.1

Bethe Approximation and Sum-Product Recap: Sum-Product/Belief Propagation Algorithm

The update for each node in sum-product algorithm is given by: Mts (xs ) → k

X

0

0

{φst (xs , xt )φt (xt )

0

Y

0

Mut (xt )}.

u∈N (t)\s

xt

where φ is the potential function. And the marginal for node s is given by: Y ∗ µs (xs ) = kφs (xs ) Mts (xs ). t∈N (s)

6.2

Variational Inference for Sum-Product/Belief Propagation Algorithm

The sum-product algorithm can do exact inference on trees, but can only approximate for loopy graphs. Hence, by using variation inference methods, we can formulate an optimization problem to estimate mean parameters in trees. We explicitly describe the algorithm in the following paragraphs.

14 : Theory of Variational Inference: Inner and Outer Approximation

6.3

7

Exact Variational Principle for Trees

Let’s begin by computing the mean parameters for tree graphical models. For a discrete tree with variables Xs ∈ {0, 1, . . . , ms − 1}, the sufficeint statistics of a tree is given by: Ij (xs ) and Ijk (xs , xt ) for s = 1, . . . n and (s, t) ∈ E The mean parameters are µs (xs ) = P (Xs = xs ) and µst (xs , xt ) = P (Xs = xs , Xt = xt ). We can construct the marginal polytope for the tree and by the junction tree theorem X X M(T ) = {µ ≥ | µs (xs ) = 1, µst (xs , xt ) = µs (xs )} xs

xt

If µ ∈ M(T ), then Y

pµ (x) :=

Y

µs (xs )

s∈V

(s,t)∈E

µst (xs , Xt ) . µs (xs )µt (xt )

For trees, the entropy can be decomposes as: H(pµ (x)) =

X

X

Hs (µs ) −

s∈V

Ist µst .

(s,t)∈E

Hence, the variational function is given by: A(θ) = max {< θ, µ > + µ∈M(T )

X

Hs (µs ) −

s∈V

X

Ist (µst )}.

(s,t)∈E

Next, by utilizing the Lagrangian method, we can get the general variational principal for belief propagation algorithm for tree: Y µs (xs ) ∝ exp θs (xs ) exp λts (xs ). t∈N (s)

Y

µs (xs , xt ) ∝ exp (θs (xs ) + θt (xt ) + θst (xs , xt ))

exp (λus (xs ))

u∈N (s)\t

where λss and λts (xs ) are the Lagrange multiplier. This yields: X Mts (xs ) → exp θt (xt ) + θst (xs , xt ) xt

6.4

Y

exp (λvt (xt ))

v∈N (t)\s

Y

Mut (xt ).

u∈N (t)\s

Belief Propagation on Arbitrary Graphs

There are two main difficulties for belief propagation on arbitrary graphs: (1) determination of the marginal polytype M and (2) computation of the exact entropy. Suppose that, for an arbitrary connected graph G, we use the tree-based approximation for M. Since G contains additional constraints (via additional edges) compared to any tree formed with its edges, this tree approximation is an outer bound. We formulate as follows: The outer bound: X X L(G) = {τ | τs (xs ) = 1, τst (xs , xt ) = τs (xs )} xs

xt

8

14 : Theory of Variational Inference: Inner and Outer Approximation

where τ is the pseudo-marginals. Hence, we can approximate the exact entropy as: X X −A∗ (τ ) ≈ HBethe (τ ) := Hs (τs ) − Ist (τst ). s∈V

(s,t)∈E

The Bethe Variational Problem is given by: max {< θ, τ > +

τ ∈L(G)

X s∈V

Hs (τs ) −

X

Ist (τst )}.

(s,t)∈E

This problem is differentiable, with each constraint set being a simple convex polytype with message passing on trees as an analytical solution. However, there is no guarantees on the convergence of the algorithm on loopy graphs. Also, the Bethe variational problem is usually non-convex. Therefore, there are no guarantees on the global optimum.

7

Summary

In this scribes, we discussed the following things: (1) recap of the exponential family (2) mean-field approximation (3) belief propagation approximation. Variational methods can be thought of as turning inference problem into an optimization problem by using exponential families and convex duality. In the mean field approximation, we use an inner approximation of the realizable space. This allows us to use the exact formulation for entropy, however may cause the true maximum fall out of our approximated space. In the Bethe approximation for belief propagation, we use an outer approximation to the marginal polytype M. By solving the Lagrangian in the Bethe Variational Problem, we are equivalent solve the message-passing/Sum-Product algorithm.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

15 : Approximate Inference: Monte Carlo Methods Lecturer: Eric P. Xing

1

Scribes: Binxuan Huang, Yotam Hechtlinger, Fuchen Liu

Introduction to Sampling Methods

1.1

General Overview

We have so far studied Variational methods to address the problem of inference. Variational methods turn the problem of inference to an optimization problem in which everything is deterministic. The drawback of Variational methods is in the fact that they only offer an approximation and not the correct answer to the problem. In this class we study Monte Carlo sampling methods, which offers two advantages over Variational methods. The first is that the solution they provide is consistent, in the sense that it is guranteed to converge to the right solution given sufficient amount of data. The second is that sampling methods are usually easier to derive compared to Variational methods. There are two classes of Monte Carlo methods - Stochastic sampling methods, which we have discussed in this lecture, and Markov Chain Monte Carlo (MCMC), which is a special class of Monte Carlo enabling more flexable sampling, and it will be discussed in future lectures.

1.2

Monte Carlo Sampling Methods

Suppose x ∼ p is high dimensional random vector from a distribution p. Often during the process of inference there is a need to compute the quantity Z Ep (f (x)) = f (x) p (x) dx, for some function f (If f is the identity this correspond to the mean of the distribution). The expected value might be hard to calculate directly, either because it is high dimensional or there is no closed form solution. Sampling methods approximate the expected value by drawing a random sample x(1) , . . . , x(n) from the distribution p and use the asymptotic guaranties provided by the the Law of Large Numbers to estimate: Ep (f (x)) ∼ =

N 1 X  (n)  f x . N n=1

The challenges with sampling methods are: • Sampling from the distribution p might not be trivial. • How to make better use of the samples? Not all samples are equally useful. • How many samples are required for the asymptotic to be sufficiently close? 1

2

16 Monte : Monte Carlo Methods 15 : Approximate Inference: Carlo Methods

2

of Montenetwork. Carlo method: ExampleFigure of a 1: 5 Example variables Bayesian When Naive naivelySampling sampled, P (J = 1 | B = 1) = P (J = 1, B = 1) /P (B = 1), for example, can not be defined with the current sample, and would require significant amount of samples to be accurately estimated.

Figure 1:

paradigm) probabilities of some marginal and conditional distribution. This implies we need more samples to accurately estimate a probability. However, in many situations the number of samples required is exponential 1.3 Example: Naive Sampling in numbers. Furthermore, the naive sampling method would be too costly for high dimensional cases, and therefore we introduce other alternatives from rejection sampling to weighted resampling.

3

It is sometimes possible to naively sample from the graphical model by drawing Bernoulli draws according to the graph distribution. Although tempting, it will not be useful under many scenarios. To demonstrate the complications that might be encountered when directly sampling the distribution, suppose we sample Rejection from the Bayesian Sampling Graph presented at Figure 1 in a naive way according to the values given in the figure. In many cases it will be interesting to calculate the conditional distribution of rare events, which is estimated by the sample counts. Accurate estimation of rare events requires large number of samples even in simple network such as the one in the figure. As the networks become more complicated, naive sampling methods becomes less and less efficient.

2 2.1

Rejection Sampling The Rejection Sampling Process

Rejection sampling is usefulFigure in a situation where weofwant sample from a distribution Π (X) = Π0 (X) /Z, 2: Illustration the to Rejecton Sampling and the normalizing constant Z is unknown, thus it is hard to sample directly from Π, but it is easy to evaluate Π0 . 1 0

Suppose we want to sample from a distribution Π(x) = Z Π (x), where Z is an unknown normaliser. In order to apply rejection sampling, we use abut proposal (x), whichRejection we can easily sample directly manyInsituations, Π(x) is difficult to sample Π0 (x)distribution is easy to Qevaluate. sampling provides a from, and alsotoassume there a constantand K such thatfrom for all in the support: simple procedure make use ofexist this property sample thex original distribution by utilising a simpler distribution, Q(x). The procedure is as the following: KQ (x) ≥ Π0 (x) . The rejection sampling algorithm will then be:

Draw

Accept with proportion to

x0 ∼ Q(x) Π(x0 ) kQ(x0 )

15 : Approximate Inference: Monte Carlo Methods

3

Figure 1: Example of Monte Carlo method: Naive Sampling paradigm) probabilities of some marginal and conditional distribution. This implies we need more samples to accurately estimate a probability. However, in many situations the number of samples required is exponential 0 costly for high dimensional cases, and in numbers. Furthermore, the naive sampling method would be x too Sample ∼ Q (x) therefore we introduce other alternatives from rejection sampling to 0weighted resampling. 0

Accept with proportion to

3

Π (x ) . KQ (x0 )

Rejection Sampling

Figure 2 present an intuitive explanation to the process.

Figure 2: Illustration of the Rejecton Sampling

Figure 2: The Rejection sampling algorithm can be thought of a uniform sample from the area under the distribution graph. The we firstwant stepto is sample to use Q to draw x0 , and inΠ(x) the second step where the observation is accepted with proportion Suppose from a distribution = Z1 Π0 (x), Z is an unknown normaliser. In Π 0 (x 0 ) 0 many situations, to to sample Π (x) easy to evaluate. Rejection a the equivalent to KQ(x0 ) .Π(x) This isis difficult equivalent draw but a point u0 is uniformaly from the interval [0, sampling KQ (x0 )] provides and accept 0 make use of this 0property and sample from the original distribution by utilising a simpler simple procedure observation if u0 ≤ Πto (x0 ), that is in Π graph. distribution, Q(x). The procedure is as the following:

The correctness of the procedure can be shown using Bayesian analysis: 0

Q(x) [Π0 (x)Draw /KQ (x)]x · ∼ Q (x) 0 Π(x ) dx 0 [Π (x) /KQ Accept with proportion to (x)] · Q (x) 0) kQ(x Π0 (x) = R 0 Π (x) dx 0 Π (x) = = Π (x) . Z

p (x)

2.2

=

R

High Dimensional Drawback

A crucial step in the process is the selection of Q, and K. The number of samples accepted is equal to the ratio between the areas of the distributions. It is therefore important to control the rejection area to be as small as possible. In high dimensions this becomes a major drawback due to the curse of dimension, effectively limiting the method to low dimensions. Figure 3 further explain the problem using an example.

3 3.1

Important sampling Unnormalized important sampling

The finite sum approximation to the expectation depends on being able to draw samples from the distribution P (x). However, it could be impractical to sample directly from P (x). Suppose we have a proposal distribution

ling

4

15 : Approximate Inference: Monte Carlo Methods

le P=N(,p2/d) dimensional=1000,

e k=(q/p)d1/20,000

pling



Figure 3: As a thought experiment, suppose we sample from P = N µ, σp2/d

define QQ = N



2/d µ, σq



with the proposal distribution



, where σq exceed σp by 1%. The figure demonstrate the densities when d = 1. This example is just for instructional purposes - since we known how to sample one Guassian, we could sample the other also. When the  d σ dimension d = 1000, the optimal acceptance rate is K = σpq ≈ 1/20, 000. It follows that only 1 sample out of 20, 000 will be accepted.

Q(x) which can be simpler sampled from and Q dominate P (i.e. Q(x) > 0 whenever P (x) > 0), then we P (x) can sample from Q and reweight each sample by importance w(x) = Q(x) . The procedure of unnormalized important sampling is as follows: 1 Sample xm from Q for m = 1,2, . . . , M m ˆ ) = 1 PM f (xm ) P (xm ) 2 Compute E(f m=1 M Q(x ) It is because

© Eric Xing @ CMU, 2005-2016

9

Z EP (f ) =

f (x)P (x)dx Z

=

f (x)

P (x) Q(x)dx Q(x)

M 1 X P (xm ) ' f (xm ) M m=1 Q(xm )

=

M 1 X f (xm )w(xm ) M m=1

One advantage of unnormalized important sampling beyond rejection sampling is that it uses all samples and avoids the waste. However, we need to be able to compute the exact value of P(x) (i.e. P need to be close form) in unnormalized important sampling.

3.2

Normalized important sampling

But sometimes, we can only evaluate P 0 (x) = αP (x) (e.g. for an MRF) with an unknown scaling factor 0 (x) α > 0. In this case, we can get around the nasty normalization constant α as follows: let ratio r(x) = PQ(x) ,

15 : Approximate Inference: Monte Carlo Methods

5

then

Z EQ [r(x)] =

P 0 (x) Q(x)dx = Q(x)

Z

P 0 (x)dx = α

Now Z EP [f (x)] =

f (x)P (x)dx Z P 0 (x) 1 f (x) Q(x)dx = α Q(x) R f (x)r(x)Q(x) = R r(x)Q(x) PM f (xm )r(xm ) ' m=1 xm ∼ Q(X) PM m m=1 r(x ) =

M X

f (xm )wm

m=1

r(xm ) w m = PM l l=1 r(x )

Then the procedure of normalized importance sampling is: 1 Sample xm ∼ Q(x) for m = 1, 2,. . . , M PM 1 m 2 Compute scaling factor α ˆ=M m=1 r(x ) ˆP (f ) = 3 Compute E

PM f (xm )r(xm ) m=1 PM m m=1 r(x )

Normalized importance sampling allows us to use a scaled approximate of P(x) but it is biased. Notice that for unnormalized importance sampling: Z Z P (x) Q(x)dx EQ [f (X)w(X)] = f (x)w(x)Q(x)dx = f (x) Q(x) Z = f (x)P (x)dx = Ep (f ) So unnormalized importance sampling is unbiased. But for normalized importance sampling, e.g. M=1:

Figure 4: Examples of weight functions in unnormalized and normalized importance sampling

6

15 : Approximate Inference: Monte Carlo Methods

EQ [

f (x1 )r(x1 ) ]= r(x1 )

Z

f (x)Q(x)dx 6= EP (f )in general

However, in practice, the variance of the estimator in the normalized case is usually lower than in the unnormalized case. Also, it is common that we can evaluate P 0 (X) but not P (x). For example in Bayes nets, it is more reasonable to assume that P 0 (x|e) = P (x|e)P (e) is computable, where P (e) is the scaling 0 factor. And In MRF, P (x) = P Z(x) and Z is generally hard to compute.

3.3

Normalized sampling method to BN

We now apply normalized importance sampling to a Bayes net. The objective is to estimate the conditional probability of a variable given some evidence :P (Xi = xi |e). We rewrite the probability P (Xi = xi |e) as the expectation EP (XI |e) [f (Xi )] where f (X) := δ(Xi = xi ). Then we get the proposal distribution from the multilated BN where we clamp evidence nodes and cut off the incoming arcs. Figure 2 gives an illustration of this procedure. Define Q = PM , P 0 (x) = P (x, e), then we get

Pˆ (Xi = xi |e) =

PM

w(xm )δ(xm = xi ) PM m m=1 w(x ) P 0 (xm , e) wherew(xm ) = PM (xm ) m=1

Figure 5: Illustration of how the proposal density is constructed in likelihood weighting. The evidence consists of e = (G = g2, I = i1)

Likelihood weighting is a special case of normalized importance sampling used to sample from a BN. This part is skipped by Eric. The pseudo code and efficiency of likelihood weighting method could be found in lecture slides.

3.4

Weighted resampling

The performance of Importance sampling depends on how well Q matches P . Like figure 3 shows, if P (x)f (x) is strongly varying and has a significant proportion of its mass concentrated in a small region, r(x) will be dominated by a few samples. And if the high-prob mass region of Q falls into the low-prob mass region of P, the there will be a lot of samples have less weight, like the star points showed in figure 3. And in the

15 : Approximate Inference: Monte Carlo Methods

7

high-prob region of P, there may be few or no samples. The problem is that there is no way to diagnose it in a importance sampling procedure. We need to draw more samples to see whether it changes the estimator, but P (xm ) the variance of rm = Q(x m ) can be small even if the samples come from low-prob region of P and potentially erroneous.

Figure 6: An examples of the problem of importance sampling: the high-prob mass region of Q falls into the low-prob mass region of P

There are 2 possible solutions for this problem. Firstly we can use a heavy tail Q in order to make sure there are enough samples in all of the region. The second solution is to apply a weighted resampling method. Sampling important resampling (SIR) if one of the resampling method based on weight of the samples: 1 Draw N samples from Q: X1 , · · · , XN 2 Constructing weights: w(x1 ), · · · , w(xN ), where w(xm ) =

P (xm )/Q(xm ) PM l l l=1 P (x )/Q(x )

=

r(xm ) PM l l=1 r(x )

3 Sub-sampling x from X1 , · · · , XN w.p w(x1 ), · · · , w(xN ) Another way to do it particular filtering, which will be showed in the next section.

4

Particle Filter

Particle Filter is a sampling method used to estimate the posterior distribution P (Xt |Y1:t ) in a state space model (SSM) with known transition probability distribution P (Xt+1 |Xt ) and emission probability P (Yt |Xt ). In the previous lectures, we have studied some algorithms like Kalman Filtering to solve SSM. However, Kalman Filtering assumes that the transition probabilities are Gaussian distributions, which is a big constraint. That’s why we need Particle Filter. Particle Filter can be viewed as an online algorithm. At time t + 1, a new observation Yt+1 is recieved as input, and the algorithm output is P (Xt+1 |Y1:t+1 ) based on previous estimation P (Xt |Y1:t ). Notice we assume have already have P (Xt |Y1:t ) which can be represented by ( ) P (Yt |Xtm ) m m Xt ∼ P (Xt |Y1:t−1 ), wt = PM , m m m=1 P (Yt |Xt ) where {Xtm } are M samples we drew from the prediction at time t − 1, P (Xt |Y1:t−1 ), and {wtm } are the

8

15 : Approximate Inference: Monte Carlo Methods

Figure 7: Schematic illustration of the operation of the particle filter. At time step t the posterior p(xt |ytm ) is

represented as a mixture distribution, shown schematically as circles whose sizes are proportional to the weights wt, . m A set of M samples is then drawn from this distribution, and the new weights wt+1 evaluated using p(yt+1 |xm t+1 ).

corresponding weights from samples. This representation suffice because: P (Xt |Y1:t−1 )P (Yt |Xt ) P (Xt |Y1:t−1 )P (Yt |Xt )dXt P (Yt |Xt ) = P (Xt |Y1:t−1 ) R , P (Xt |Y1:t−1 )P (Yt |Xt )dXt

P (Xt |Y1:t ) = P (Xt |Yt , Y1:t−1 ) = R

where the right part of right equation above is just the weight approximated by M samples. Next, at next time step t + 1, we will calculate P (Xt+1 |Yt+1 ) using two updates: Time Update and Measurement Update. m In Time Update, we will draw M new samples {Xt+1 } from P (Xt+1 |Y1:t ), which is given by

P (Xt+1 |Y1:t ) =

Z

P (Xt+1 |Xt )P (Xt |Y1:t )dXt =

M X m=1

wtm P (Xt+1 |Y1:t ).

Here we can see that P (Xt+1 |Y1:t ) is a mixture model with M weights and M known component models given by the transition probability P (Xt+1 |X1:t ). m At Measurement Update step, we will update the weight {wt+1 } again by m ) P (Yt+1 |Xt+1 m . wt+1 = PM m m=1 P (Yt+1 |Xt+1 )

The desired posterior probability at time t + 1 , P (Xt+1 |Y1:t+1 ), follows from the two step updates because it can be represented in the same manner by: ( ) m P (Yt+1 |Xt+1 ) m m Xt+1 ∼ P (Xt+1 |Y1:t ), wt+1 = PM . m m m=1 P (Yt+1 |Xt+1 )

15 : Approximate Inference: Monte Carlo Methods

5

9

Rao-Blackwellised sampling

Sampling in a high dimensional probability spaces can sometimes result with high variance in the estimate. In the class, the lecturer gave an example of multivariate Gaussian distribution. In that case, with high dimensition n, making small changes to the standard deviation σ in every dimension will cause the estimation to change a lot. To avoid this drawback, we can utilize the property of total variance: var[τ (xp , xd )] = var[E[τ (xp , xd )|xp ]] + E[var[τ (xp , xd )|xp ]]. From the equation above, we can see var[E[τ (xp , xd )|xp ]] ≤ var[τ (xp , xd )]. There is a simple proof at: https://en.wikipedia.org/wiki/Law_of_total_variance Hence when computing Ep(X|e) [f (Xp , Xd )], instead of sampling xp , xd directly from probability p(xp , xd |e) P 1 m m just like Ep(X|e [f (Xp , Xd )] = M m f (xp , xd ), we can first sample variables Xp and then compute the expected value of Xd conditioned on Xp . Z Ep(X|e) [f (Xp , Xd )] = p(xp , xd |e)f (xp , xd )dxp dxd Z Z = p(xp |e)[ p(xd |xp , e)f (xp , xd )dxd ]dxp xp

Z = xp

=

xd

p(xp |e)Ep(Xd |xp ,e) [f (xp , Xd )]dxp

1 X Ep(Xd |xm [f (xm p , Xd )], p ,e) M m

xm p ∼ p(xp |e).

Basically, this sampling process transforms sampling in spaces with high dimension p + d into spaces with low dimension p.

10-708: Probabilistic Graphical Models 10-708, Spring 2014

16 : Markov Chain Monte Carlo (MCMC) Lecturer: Matthew Gormley

1

Scribes: Yining Wang, Renato Negrinho

Sampling from low-dimensional distributions

1.1

Direct method

Consider the one-dimensional case x ∈ R and that we would like to sample x from a complex distribution P . Let h : R → [0, 1] be the cumulative density function (CDF) of P and suppose the inverse function h−1 is known. We can then sample x from P via the following procedure: 1. Sample u from U [0, 1], the uniform distribution over [0, 1]. 2. Output x = h−1 (u). It can be easily proved that x ∼ P because for any t ∈ R we have that Pr[x ≤ t] = Pr[h−1 (u) ≤ t] = Pr[u ≤ h(t)] = h(t). This method is exact and is highly efficient when h−1 can be easily computed. However, when h is very complicated its inverse might not admit an easily computable form. In addition, the method is difficult to generalize to high dimensional cases when x has more than one covariate.

1.2

Rejection sampling

Suppose we want to sample x from P . Let Q be a distribution that is easy to sample from (e.g., uniform or Gaussian distribution) and let k be a constant such that kQ(x) ≥ P (x) for all x. The following procedure then produces a sample x that is exactly sampled from P : 1. Sample y from Q. 2. Sample u from U [0, kQ(y)], where U [0, kQ(y)] is the uniform distribution over interval [0, kQ(y)]. 3. If u > P (y), discard y and repeat from the first step; otherwise, return x = y as the sample. The rejection sampling justified by the envelope principle: (y, u) are jointly sampled from the uniform distribution over the subgraph of kQ(x). Thus accepting pairs with u < P (y)/kQ(y) produces samples uniformly sampled from the subgraph of P (x). As a result, the marginal distribution of y in the accepted pairs is exactly the same of P . 1

2

16 : Markov Chain Monte Carlo (MCMC)

Rejection sampling is also exact and does not need to invert the CDF of P , which might be too difficult to evaluate. However, the rejection sampling procedure might reject a large number of samples before finally producing one, when the envelope distribution Q does not align well with the target distribution P . This disadvantage is even more serious in high-dimensional settings, thanks to the notorious curse of dimensionality.

1.3

Importance sampling

Unlike rejection sampling that aims at sampling from a particular distribution P , the importance sampling method concerns evaluating statistics under P , e.g., EP [f (x)]. Suppose Q is another distribution that is easy to sample with. The importance sampling procedure is as follows: 1. Draw S samples i.i.d. from distribution Q; denoted as x(1) , · · · , x(S) . 2. Produce

1 S

PS

s=1

(s)

P (x ) f (x(s) ) Q(x (s) ) as an estimate of EP [f (x)].

It is not difficult to prove that the estimate is unbiased; more specifically; # "   S (s) ) 1X P (x) (s) P (x f (x ) E = E f (x) q S s=1 Q(x) Q(x(s) ) Z P (x) = Q(x) · f (x) dx Q(x) X Z = f (x)P (x)dx X

= EP [f (x)]. Furthermore, the variance of the estimate decreases as we increase the number of samples S, rendering the resulting estimate more accurate. The estimation accuracy also depends on the alignment of the proposal distribution Q compared with the true distribution P to be evaluated.

1.4

Curse of dimensionality

Rejection/importance sampling usually behaves very poorly with increased dimensionality of X . Consider, that X = RD and the true and the proposal (envelope) distributions defined as P = ND (0, I),

Q = ND (0, σ 2 I).

The rejection sampling must require σ ≥ 1 to work. In this case, the probability of rejecting a proposed sample is σ −D , which scales exponentially with the dimension D. √ The importance sampling procedure requires σ ≤ 1/ 2 to have a well-defined finite variance of the resulting estimate. The variance (of S = 1) of the resulting estimator is 

σ2 2 − 1/σ 2

D/2 − 1,

which again scales exponentially with the dimensionality D.

16 : Markov Chain Monte Carlo (MCMC)

2

3

Markov Chain Monte Carlo

In this section we study MCMC methods to obtain a sequence of samples {x(t) }Tt=1 from an underlying distribution P . Suppose p is the pdf associated with P ; we assume that p=

p˜ , Z

where Z is a deterministic normalization constant that is difficult to compute and p˜ is a function of X that is easy to compute.

2.1

Metropolis Algorithm

In the Metropolis algorithm, we choose a simple proposal distribution q(·|x0 ) that is easy to sample from and an initial sample x(1) . Here q must be symmetric, meaning that q(x|x0 ) = q(x0 |x),

∀x, x0 .

We then perform the following steps for T times, each time obtaining a new sample x(t+1) : 1. Propose a new sample x ∼ q(x|x(t) ). 2. Compute the acceptance probability 

p˜(x) a = min 1, p˜(x(t) )

 .

Accept the new sample x with probability a. 3. If x is accepted, set x(t+1) = x and continue to the next iteration; otherwise repeat from the first step until x gets accepted.

2.2

Metropolis-Hastings Algorithm

In the Metropolis-Hastings algorithm, the proposal distribution q is no longer required to be symmetric. The sampling procedure is essentially the same as in the Metropolis algorithm, except the acceptance probability a0 is computed differently as   p˜(x)q(x(t) |x) a0 = min 1, . p˜(x(t) )q(x|x(t) ) Needless to say, Metropolis algorithm is a special case of Metropolis-Hastings with symmetric proposal distributions q.

2.3

Gibbs sampling

Suppose x can be decomposed as x = (x1 , · · · , xN ). The Gibbs sampling procedure iteratively samples x(t+1) based on x(t) by performing the following steps: 1. Set y (t+1) = x(t) .

4

16 : Markov Chain Monte Carlo (MCMC)

(t+1)

2. for all i in {1, · · · , n}, sample yi

(t+1)

from its conditional distribution p(yi

(t+1)

|y−i

).

3. Produce x(t+1) = y (t+1) . It can be shown that Gibbs sampling is a special case of Metropolis-Hastings algorithm with proposal distribution q(xi |x−i ) = p(xi |x−i ). In particular, it can be shown that the acceptance probability a is always 1, as in the following derivation: (t)

p˜(x)p(xi |x−i ) p˜(x)q(x(t) |x) = (t) p˜(x(t) )q(x|x(t) ) p˜(x(t) )p(xi |x−i ) (t)

=

p(x)p(xi |x−i ) (t)

p(x(t) )p(xi |x−i ) (t)

=

p(xi |x−i )p(x−i ) · p(xi |x−i ) (t)

(t)

(t)

(t)

p(xi |x−i )p(x−i ) · p(xi |x−i )

= 1.

2.4

Markov Chain Monte Carlo in General

All three examples above are instantiations of MCMC algorithms. A MCMC algorithm involves the specification of a stochastic process defined through a Markov chain. If some conditions are satisfied, it can be shown that in the limit, after we run the Markov chain for a long time, we get an independent sample from the desired probability distribution p. A Markov chain over some set of states S, is specified by an initial probability distribution p(0) : S → R≥0 over states and a transition probability operator T : S ×S → R≥0 . The operator T determines the probability of transitioning from state x to state x0 . It is assumed that the transition operator is the same across all time steps. The Markov chain is said to be homogeneous. Having some probability distribution over states p(t) at time-step t, the chain induces a probability distribution p(t+1) at time-step t + 1 by using the transition operator T as Z (t+1) 0 p (x ) = T (x0 |x)p(t) (x)dx. To guarantee that the Markov chain does in fact define a process that samples from the desired probability distribution p in the limit, the following properties must be hold: 1. The desired distribution p must be invariant under the transitioning process of the Markov chain. It means that p is the stationary distribution of the chain. Formally, Z 0 p(x ) = T (x0 |x)p(x)dx. 2. The Markov chain must be ergodic. This means that irrespective of the choice of the initial probability distribution p(0) , the chain converges to p: p(t) (x) → p(x) as t → ∞, for any p(0) . Some of reasons under which the chain may fail to be ergodic are the existence of periodic (i.e., cyclic) behaviour, or the existence of two or more subsets in the state space that cannot be reached from each other. The convergence of the of distribution of states to the stationary distribution is usually referred to as mixing.

16 : Markov Chain Monte Carlo (MCMC)

5

Detailed balance is a sufficient condition for the transition operator T under which the desired distribution p is an invariant of the chain. The condition is T (x0 |x)p(x) = T (x|x0 )p(x0 ). Detailed balance means that, for each pair of states x and x0 , to arrive at x and then transition to x0 is equiprobable to arrive at x0 and then transition to x. It is easy to verify that the condition implies the desired invariance. By integrating both sides, we get Z

0

0

Z

T (x |x)p(x)dx = We then notice that

R

T (x|x0 )p(x0 )dx0 .

T (x0 |x)dx0 = 1, therefore Z p(x) =

T (x|x0 )p(x0 )dx0 ,

which is the desired invariance condition. Furthermore, if p is invariant under the chain and v=

T (x0 |x) > 0, x,x ;p(x )>0 p(x) min0 0

then p is the stationary distribution of the chain. Going back to Metropolis-Hastings, it is easy to show that detailed balance is satisfied by its transition operator. The transition operator for Metropolis-Hastings is T (x0 |x) = q(x0 |x)a(x0 ; x)   p(x0 )q(x|x0 ) 0 = q(x |x) min 1, , p(x)q(x0 |x) where q is the proposal distribution and a(x0 , x) is the probability of accepting state x0 , given that we are now in state x. We first propose a new state x0 according to the proposal distribution q, and accept it with probability a(x0 , x). We can now verify that detailed balance holds:   p(x0 )q(x|x0 ) T (x0 |x)p(x) = p(x)q(x0 |x) min 1, p(x)q(x0 |x) = min (p(x)q(x0 |x), p(x0 )q(x|x0 ))   p(x)q(x0 |x) = p(x0 )q(x|x0 ) min , 1 p(x0 )q(x|x0 ) = p(x0 )T (x|x0 ). Another way of constructing a transition operator T that satisfies the invariance properties for p is by mixing or concatenating other transition operators T1 , . . . , Tn that also satisfy the invariance property. For mixture case, T (x0 |x) =

n X i=1

αi Ti (x0 |x),

6

with

16 : Markov Chain Monte Carlo (MCMC)

Pn

i=1

α = 1 and αi ≥ 0 for all i ∈ [n], the invariance is easily shown: ! Z Z X n 0 0 T (x |x)p(x)dx = αi Ti (x |x) p(x)dx i=1

=

n X

Z αi

Ti (x0 |x)p(x)dx



i=1

=

n X

αi p(x)

i=1

= p(x), where we used the invariance of the individual T1 , . . . , Tn and

2.5

Pn

i=1

αi = 1.

Practical considerations of Markov Chain Monte Carlo

Markov chain Monte Carlo is widely applicable, but it is not without its problems. While in theory, it is guaranteed to converge in the limit to the desired probability distribution p, in practice getting sufficiently close to convergence (i.e., mixing) may take a very large number of steps. Even worse is the fact that it is notoriously difficult to assess if the chain has converged or not. This means that answers computed using samples from a chain that has not mixed properly may be completely wrong in ways that are hard to diagnose. MCMC methods usually have some number of hyperparameters, and the behaviour of the methods depends critically on the their value. One example is the scale parameter of transition operator T : if chosen too small, the process will move with very small steps through the state space; if chosen too large, most of the proposed steps will be rejected and the process will stay a long time in the same place; in both of these cases the chain will take a very long to converge. Setting the parameters incorrectly will lead to large mixing times, making the methods impractical. Even with appropriate settings for the parameters, mixing may still be very slow due to the random walk type of exploration of the space. There is also a significant number of design decision and hyperparameters that are hard to set. These include what transition operator to use; what scale parameter for the transition operator; how many steps for the initial burn-in period; how many samples to collect; how to assess convergence; choosing to run one long chain versus multiple smaller chains; using all the samples of the chain for inference versus just a few independent ones, and if just a few, how many before we consider we have independent samples. Some of these issues are addressed heuristically to some degree. For example, a common heuristic for the scale parameter of the transition operator T is setting it such that we accept half of the proposed steps. For assessing the convergence of the chain, it is common to run several chains in parallel with different initial conditions and gather statistics of the process. If the different chains look similar after some number of steps, we can be reasonably confident that the process has converged. Nonetheless, these heuristics do not address the fundamental slowness of exploring the state space by means of a random walk. A rule of thumb is that if we need in the order of 2  σmax , σmin to get independent samples from the chain, where σmax and σmin are the maximum and minimum length scales in the desired distribution p. Roughly speaking, the more correlated the random variables in the p are, the larger the mixing time of the chain.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

17 : MCMC (cont’d) and Intro to Topic Modeling Lecturer: Matthew Gormley

1

Scribes: Chun-Liang Li, Yanyu Liang, Mengxin Li

MCMC with Auxiliary Variables

In Gibbs sampling, we try to sample “less” variables in each time. Therefore, we only sample one variable by conditioning on the remaining ones. However, we could do a reverser way by introducing more variables to be sampled. R For any distribution p(x), we know that p(x) = u p(x, u). If we want to sample from certain distribution p(x), which is hard to sample. We then introduce the “auxiliary” variable u, which is a random variable that do not exist in the model but are introduced into the model to facilitate sampling. We then hope the joint distribution p(x, u) is easy to navigate, and the conditional distributions p(x|u) and p(u|x) are easy to sample. Then sampling from p(x, u) is easier than sampling from p(x), and we can marginalize out u to get p(x) back. Next, we discuss two approaches, including slice sampling and Hamiltonian Monte Carlo.

1.1

Slice Sampling

Figure 1: Slice Sampling. Assume we want to sample from P (x), and P˜ (x) ∝ P (x), where we can evaluate P˜ (x). The we define u as the auxiliary variable, and p(u|x) is the uniform sampling between 0 and P˜ (x). Also, we define p(x|u) as the uniform sampling from {x0 |P˜ (x0 ) ≥ u}. The sample (x, u) are uniformly distributed under the area of P˜ (x). We can then obtain p(x) by marginalizing out u. 1

2

17 : MCMC (cont’d) and Intro to Topic Modeling

(a)

(b)

Figure 2: Example of the sampling procedure of slice sampling. The algorithm is shown as follows and Figure 1.1. • Initialize x0 . • Sample u ∼ Unif (0, p˜(x0 )). • Sample x uniformly from {z|P˜ (z) ≥ u}. 1.1.1

Computational Concern

Finding the set z|P˜ (z) > u can be computationally expensive or unfeasible. Then bracket slice sampling can be applied. That is we use a horizon bracket to contain x0 , then extend or shrink the bracket to search the proper size of bracket that is z|P˜ (z) > u. 1.1.2

Discussion

The advantages of slice sampling includes • Without tuning parameters • No rejection. However, this method is still suffered from random walk as Gibbs sampling.

1.2

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo is another auxiliary variable method, which makes use of not only the probability distribution but the gradient of probability for sampling. Consider a Boltzmann distribution of x ∈ RN , it can be wiritten as p(x) = Zx−1 exp{−E(x)}. By introducing an independent auxiliary variable q ∈ RN , q ∼ N (0, I), the joint distribution is p(x, q) = Zx−1 Zq−1 exp{−E(x) − q T q/2}. Then the sampling is divided into two steps. The first step is similar to block Gibbs sampling, namely given x(t) sample q (t+1) . And the second step is to update x. Since x and q are independent, p(q|x) is Gaussian and to sample from it is simple. The update of x follows Metropolis algorithm, and more specifically, the transition is proposed by Hamiltonian Dynamics. The following will first introduce Hamiltonian Dynamics and its properties, and then talks about Hamiltonian Monte Carlo in details.

17 : MCMC (cont’d) and Intro to Topic Modeling

3

Figure 3: Euler’s method and Leapfrog algorithm for approximating Hamiltonian Dynamics

1.2.1

Hamiltonian Dynamics

Given two physical quantities x, q ∈ RN and let H(x, q) be the Hamiltonian of the system, Hamiltonian Dynamics describes the evolution of them over time, which follows the following relationships [1]: dxi ∂H = dt ∂qi ∂H dqi = dt ∂xi

(1) (2)

A simple example is to consider a simple harmonic motion in one dimension, where x is the position and q is the momentum and let the Hamiltonian be −kx + q 2 /2m. From Eqs. (1) to (2), we get: dq = −k dt

dx q = dt m

From Newton’s Law, the mechanical energy (namely the Hamiltonian) of the above motion is unchanged. And it turns out that Hamiltonian Dynamic always gaurantees that the Hamiltonian is invariant over time. Additionally, the evolution over time is reversible in Hamiltonian Dynamics. Namely, if from x, q, the evolution goes ∆t time and achieves x0 , q 0 , it will take exactly the same time for the system to evolve from x0 , −q 0 to x, −q. By discretizing the time, Hamiltonian Dynamics can be simulated by computer. Euler’s method and Leapfrog alogrithm are two ways to do so, and their update rules are slightly different which give different performances. 3 [1] shows the simulation of system with x2 /2 + q 2 /2 as Hamiltonian. Sometimes Euler’s method does not converge and the accuracy is low. By contrast, Leapfrog is more stable and still satisfies reversibility. The update rule of Leapfrog method is showed as follow:

4

17 : MCMC (cont’d) and Intro to Topic Modeling

∂E qi (t + /2) = qi (t) − (/2) ∂xi t qi (t + /2) xi (t + ) = xi (t) +  mi ∂E qi (t) = qi (t + /2) − (/2) ∂xi t+ , where  is step size. And in practice, once we fix step size and number of steps, from a starting state x, q, Leapfrog method proposes a new state x0 , q 0 where H(x, q) is approximately equal to H(x0 , q 0 ), and such transition is reversible. 1.2.2

Algorithm and Properties of HMC

In our case, let q T q/2 = K(q) and set the Hamiltonian of the system as H(x, q) = E(x) + K(q). The algorithm of Hamiltonian Monte Carlo (HMC) is as follow: Step 1 Draw new sample q 0 ∼ p(q|x) Step 2 Run Leapfrog on H(x, q) with step size  for L steps And obtain x0 , q 00 , where H(x, q 0 ) ' H(x0 , q 00 Step 3 Use x0 , q 00 as proposed sample, and accept with probability: min{1, exp{H(x, q 0 ) − H(x0 , q 00 )}} Step 1 is block Gibbs sampling step, where the proposal distribution is a conditional distribution. By constraction, p(q|x) = p(q) ∼ N (0, I), so it satisifies detailed balance. Step 2 and 3 are Metropolis algorithm. The transition from x, q 0 to x0 , q 00 is defined by Hamiltonian Dynamics. Since Leapfrog method is reversible, R(x0 , q 00 ← x, q 0 ) = R(x, −q 0 ← x0 , −q 00 ) = 1. If we artificially reverse the direction of q 00 to −q 00 after running Leapfrog, we have R(x0 , −q 00 ← x, q 0 ) = R(x, q 0 ← x0 , −q 00 ) = 1, which stastifies Metropolis algorithm’s condition. Using the fact that K(q) = K(−q), the acceptance rule can be simplified as:

∵ H(x0 , −q 00 ) = H(x0 , q 00 ) H(x, −q 0 ) = H(x, q 0 ) ∴ min{1,

p(x0 , −q 00 ) } = min{1, exp{H(x, −q 0 ) − H(x0 , −q 00 )}} p(x, −q 0 ) = min{1, exp{H(x, q 0 ) − H(x0 , q 00 )}}

, which is exactly what checked in Step 3. Metropolis algorithm satisfies detailed balance, so HMC satisfies detailed balance. Ideally, Hamiltonian Dynamics has invariant H, which leads to 100% acceptance. But, in reality, Leapfrog method may introduce some errors, so Step 3 is necessary, but the acceptance rate is very high. Fig. 4 [1]

17 : MCMC (cont’d) and Intro to Topic Modeling

5

Figure 4: The trajectory for a two-dimensional Gaussian distribution sampled by HMC

Figure 5: HMC vs. M-H

shows an example of sampling a 2-D Gaunssian using HMC. Unlike random walk, the trajectory gose from lower left-hand side to upper right-hand side because of the Leapfrog step. Fig. 5 [1] is the comparison between HMC and M-H algorithm’s results. Thanks to Hamiltonian Dynamic, sample is more likely to make big jump across the space but still holds a high probability to be accepted.

2

Introduction to Topic Modeling

Topic Modeling is a method (ususally unsupervised) for discovery of latent or hidden structure in a corpus. Suppose you’re given a massive corpora and asked to carry out the following tasks: • Organize the documents into thematic categories. • Describe the evolution of those categories over time. • Enable a domain expert to analyze and understand the content. • Find relationships between the categories. • Understand how authorship influences the content. Topic modeling provides a modeling toolbox for these tasks. Although it is applied primarily to text corpora, the techniques can be generalized to solve problems in other fields including computer vision and bioinformatics.

6

17 : MCMC (cont’d) and Intro to Topic Modeling

2.1

Beta Bernoulli Model

Beta Bernoulli Model is a simple Bayesian model that can be used to model corpus in which the words are binary random variables whose prior is modeled by the Beta distribution. Beta distribution is a conjugate distribution that can be written as: 1 f (φ|α, β) = xα−1 (1 − x)β−1 B(α, β) The Beta distribution is illustrated in Figure 2.1.

Figure 6: Beta distribution The generative process for Beta Bernoulli model can be described as: • draw φ ∼ Beta(α, β) • For each word n ∈ {1, ..., N }: draw xn ∼ Bernoulli(φ)

2.2

Dirichlet-Multinomial Model

Similar to Beta Bernoulli Model, the Dirichlet-Multinomial Model can also model corpus but in which the words are multinomial random variables whose prior is modeled by Dirichlet distribution. Dirichlet distribution is a conjugate distribution that can be written as: ~ p(φ|α) =

K 1 Y αk −1 φk B(α) k=1

Where

QK

B(α) =

k=1 Γ(αk ) PK Γ( k=1 αk − 1)

The Dirichlet distribution is illustrated in Figure 2.2. The generative process for Dirichlet-multinomial model can be described as: • draw φ ∼ Dir(β) • For each word n ∈ {1, ..., N }: Draw xn ∼ M ult(1, φ)

17 : MCMC (cont’d) and Intro to Topic Modeling

7

Figure 7: Dirichlet distribution

2.3

Dirichlet-Multinomial Mixture Model

When we take one step further from the Dirichlet-multinomial model, instead of just generating words, we also want to generate independent documents so that each document has a particular topic. The generative process for Dirichlet-multinomial mixture model can be described as below: • For each topic k ∈ {1, ..., K}: draw φk ∼ Dir(β) • Draw θ ∼ Dir(α) • For each document m ∈ {1, ..., M }: Draw zm ∼ M ult(1, θ) For each word n ∈ {1, ..., Nm }, draw xn ∼ M ult(1, φzm )

2.4

Latent Dirichlet Allocation

The problem of Dirichlet-multinomail model is that it cannot model documents which have more than one topic. The Latent Dirichlet Allocation (LDA) is proposed to tackle this drawback of Dirichlet-multinomial model. In LDA, for each document, there is a probability distribution over the topics. The generative process for LDA can be described as below: • For each topic k ∈ {1, ..., K}: draw φk ∼ Dir(β) • For each document m ∈ {1, ..., M }: Draw θm ∼ Dir(α) For each word n ∈ {1, ..., Nm }: Draw zmn ∼ M ult(1, θm ) Draw xmn ∼ M ult(1, φzmi )

8

17 : MCMC (cont’d) and Intro to Topic Modeling

References [1] R. M. Neal et al., “Mcmc using hamiltonian dynamics,” Handbook of Markov Chain Monte Carlo, vol. 2, pp. 113–162, 2011.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

18 : Dirichlet Process and Dirichlet Process Mixtures Lecturer: Matt Gormley

1 1.1

Scribes: Chiqun Zhang, Hsu-Chieh Hu

Wrap up topic modeling Latent Dirichlet Allocation

Latent  Dirichlet  Allocation  

Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. The plate diagram of LDA model054 is given in Figure 1.

•  Plate  Diagram   055 056 Dirichlet   057 058 Document-­‐ specific   topic  distribution   059 060 061 Topic  assignment   062 063 Observed  word   064 065 066 067

⇤m

Topic  

Dirichlet  

⌅k



zmn xmn Nm

K M

Figure 1: Plate diagram for LDAmodel model for the SCTM. Figure 1: The graphical 068 069 The generative story of LDA begins with only a Dirichlet prior over topics. Each topic is defined as a 070 SCTM Multinomial distribution over the 2 vocabulary, parameterized by φk . Since LDA is an unsupervised learning, 071 as its high probability words and a pedagogical label is used to identify the a topic in LDA is visualized topic. The LDA can be decomposed into two part, one is the distributions over words and the other one 072

is the distributions over topics. Before we step into inference,(PoE) it is natural to ask: “Is this a believable A Product of the Experts [1] model p(x|⌅ 1 , . . . , ⌅C ) story for the generation073 of a corpus of documents?” or “Why might it work well anyway?”

074 075 076 077 078

= PV v components, and the summation in the denominator is over all poss

The answer for this question is that LDA is a trading off two goals:

079

1 Latent Dirichlet allocation generative process For each topic k ⇥ {1, . . . , K}: Dir( ) [draw distribution over words] k For each document m ⇥ {1, . . . , M } ✓m Dir(↵) [draw distribution over topics]

The Finit

For each componen c

Beta( C , 1

2

18 : Dirichlet Process and Dirichlet Process Mixtures

• For each document, allocate its words to as few topics as possible. • For each topic, assign high probability to as few terms as possible. Because putting a document in a single topic will require all of its words have probability under that topic, it will make the second goal hard. On the other hand, putting very few words in each topic will assign many topics to it to cover a document’s words, which will make the first goal hard. LDA actually trades off there goals to find groups of tightly co-occurring words.

1.2

LDA Inference

The standard EM cannot be applied to LDA because there are α and β in LDA which are also latent. In addition, it is intractable to do the exact inference for all z, θ and φ. Because exact MAP inference in LDA is NP-hard for a large number of topics. For example, to compute the posterior in LDA, we can apply Junction tree algorithm. The Junction tree algorithm can be generalized into three steps: • ‘Moralization” coverts the directed graph to undirected graph. • “Triangulation” breaks the 4-cycles by adding edges. • Cliques are arranged into a junction tree For this algorithm, the time complexity is exponential in size of cliques, which will be the topics in LDA. Therefore, since LDA cliques will be large, at least O(topics), the complexity is O(2topics ). Also, since the parameters are highly coupled, the sample method for all z, θ and φ is intractable either.

To handle this problem, we apply CollapsedLDA   GibbsInference   sampler method, whose general idea is given in Figure 2. In this method, the θ and φ are integrated out and the partially marginal distribution from α and β to z •  Collapsed   Gibbs  Sampler   becomes Dirichlet Multinomial. 054 055 056 Dirichlet   057 058 Document-­‐ specific   topic  distribution   059 060 061 Topic  assignment   062 063 Observed  word   064 065 066 067 068 069 070 071 072

⇤m

Topic  

Dirichlet  

zmn xmn

⌅k Nm

⇥ K

M

Figure 1: The graphical model for the SCTM.

Figure 2: General illustration of Collapsed Gibbs sampler method

073 074 075 076 077 078 079 080 081 082 083 084 085 086

2

SCTM QC



QC cx A Product of Experts (PoE) [1] model p(x|⌅1 , . . . , ⌅C ) = PV c=1 , where there are v=1 c=1 ⇥cv components, and the summation in the denominator is over all possible feature types.

Latent Dirichlet allocation generative process For each topic k ⇥ {1, . . . , K}: Dir( ) [draw distribution over words] k For each document m ⇥ {1, . . . , M } ✓m Dir(↵) [draw distribution over topics] For each word n ⇥ {1, . . . , Nm } zmn Mult(1, ✓ m ) [draw topic] xmn [draw word] zmi

2.1

PoE

The Finite IBP model generative process For each component c ⇥ {1, . . . , C}:

[columns]

Beta( C , 1) [draw probability of component c] c For each topic k ⇥ {1, . . . , K}: [rows] bkc Bernoulli( c ) [draw whether topic includes cth component in its PoE]

3

18 : Dirichlet Process and Dirichlet Process Mixtures

1.3

Gibbs sampling for LDA

First, we need to derive the full conditionals: p(X, Z|α, β) p(zi = k|Z −i , X, α, β) = ∝ p(X, Z|α, β) p(X, Z −i |α, β) Z Z = p(X|Z, β)p(Z|α) = p(X|Z, Φ)p(Φ|β)dΦ p(Z|Θ)p(Θ|α)dΘ Φ Θ ! ! M K Y Y B(nm + α) B(nk + β) = B(β) B(α) m=1 k=1

n−i + βt n−i k + αk = PT kt −i · PKm −i v=1 nkv + βv j=1 nmj + αj

where t, m are given by i. nkt is the number of times topic k appears with type t and nmk is the number of 20 times topic k appears in document m. LdaGibbs({⇧ w}, ⇤, ⌅, K) A property for Gibbs sampling for LDA is that the Algorithm Dirichlet is conjugate to the Multinomial. In LDA, we draw Input: word vectors {⇧ w}, hyperparameters ⇤, ⌅, topic number K Global data: countThen statistics {n }, {n } and their sumsevery {n }, {n }, memory for fullfrom conditional array |·) distribution over words from the Dirichlet distribution. we draw word thep(z Multinomial Output: topic associations {⇧z}, multinomial parameters ⇥ and , hyperparameter estimates ⇤, ⌅ P (X|phi)p(φ) 20 // initialisation , which turns out to be a distribution. Then the posterior of φ can be written as p(φ|X) = p(X) all count variables, n , n , n , n Gibbs   ampling   for  count LDA  zero for all documents ⇤ [1, M] do Dirichlet distribution Dir(βS+ n). Here the vector nm denoted the number of times every word appears. (k) m

(k) m

m

(t) k

(t) k

m

k

i

k

Gibbs  Sampling  for  LDA  

Algorithm LdaGibbs({⇧ w}, ⇤, ⌅, K)

for all words n ⇤ [1, Nm ] in document m do sample topic index zm,n =k ⇥ Mult(1/K)

Input: word vectors {⇧ }, hyperparameters ⇤, ⌅, topic number K count: n the += 1 initialization and the sampling. The GibbsAlgorithm   sampling forwLDA algorithm can be dividedincrement intodocument–topic two part, (t) Global data: count statistics {n(k) for fulldocument–topic conditional array sum: n p(z +=i1|·) m }, {nk } and their sums {nm }, {nk }, memoryincrement Algorithm   incrementestimates topic–term count: Output: topic associations {⇧z}, multinomial parameters ⇥ and , hyperparameter ⇤, ⌅ n += 1 (k) m

(t) k

m

increment topic–term sum: nk += 1

// initialisation (t) zero all count variables, n(k) m , nm , nk , nk for all documents m ⇤ [1, M] do for all words n ⇤ [1, Nm ] in document m do sample topic index zm,n =k ⇥ Mult(1/K) increment document–topic count: n(k) m += 1 increment document–topic sum: nm += 1 increment topic–term count: n(t) k += 1 increment topic–term sum: nk += 1

// Gibbs sampling over burn-in period and sampling period while not finished do for all documents m ⇤ [1, M] do for all words n ⇤ [1, Nm ] in document m do // for the current assignment of k to a term t for word wm,n : decrement counts and sums: n(k) = 1; nm = 1; n(t) = 1; nk = 1 m k // multinomial sampling acc. to Eq. 78 (decrements from previous step): ⇧) sample topic index k˜ ⇥ p(zi |⇧z¬i , w // for the new assignment of zm,n to the term t for word wm,n : ˜

(k) increment counts and sums: nm += 1; nm += 1; n(t) += 1; nk˜ += 1 k˜

// check // Gibbs sampling over burn-in period and sampling period

convergence and read out parameters

if converged L sampling iterations read out then (b) The and sampling partsinceoflastGibbs sampling algorithm.

(a) The part of Gibbs samwhile notinitialization finished do pling for algorithm. all documents m ⇤ [1, M] do

// the different parameters read outs are averaged. read out parameter set ⇥ according to Eq. 81

read out parameter set according to Eq. 82 for all words n ⇤ [1, Nm ] in document m do // for the current assignment of k to a term t for word wm,n : Figure 3: The Gibbs sampling for LDA algorithm. decrement counts and sums: n(k) = 1; nm = 1; n(t) = 1; nk = 1 m k Fig. 9. Gibbs sampling algorithm for latent Dirichlet allocation // multinomial sampling acc. to Eq. 78 (decrements from previous step): ⇧) sample topic index k˜ ⇥ p(zi |⇧z¬i , w // for theGibbs new assignment of zm,nistoathe term t for wordof wm,n : recall that sampling special case Metropolis-Hastings method with

Also, we should a special ˜ counts and sums: the n(mk) +=hasting 1; nm += 1;observed n(t) += 1;w n ˜m,n +=and 1 the corresponding zm,n , the state variables of the Markov chain. The proposal distribution,increment which ensures k˜ratio kis always 1.0. strategy of integrating out some of the parameters for model inference is often referred or Rao-Blackwellised [CaRo96] approach, which is often // check convergence and read out parametersto as “collapsed” [Neal00] used in Gibbs sampling.18 if converged and L sampling iterations since last read out then The target of inference is the distribution p(⇧z|⇧ w), which is directly proportional to // the different parameters read outs are averaged. the joint distribution read out parameter set ⇥ according to Eq. 81 ⇥W read out parameter set according to Eq. 82 ⇧) p(⇧z, w i=1 p(zi , wi ) p(⇧z|⇧ w) =

p(⇧ w)

= ⇥W

i=1

K k=1

p(zi =k, wi )

(62)

where the hyperparameters are omitted. This distribution covers a large space of dis-

Fig. 9. Gibbs sampling algorithm for latent Dirichlet crete random variables,allocation and the difficult part for evaluation is its denominator, which

represents a summation over K W terms. At this point, the Gibbs sampling procedure comes into play. In our setting, the desired Gibbs sampler runs a Markov chain that 18

Cf. the non-collapsed strategy pursued in the similar admixture model of [PSD00].

observed wm,n and the corresponding zm,n , the state variables of the Markov chain. The strategy of integrating out some of the parameters for model inference is often referred to as “collapsed” [Neal00] or Rao-Blackwellised [CaRo96] approach, which is often used in Gibbs sampling.18 The target of inference is the distribution p(⇧z|⇧ w), which is directly proportional to the joint distribution p(⇧z|⇧ w) =

⇧) p(⇧z, w

=⇥

⇥W

i=1

p(zi , wi )

(62)

4

18 : Dirichlet Process and Dirichlet Process Mixtures

1.4

1.4.1

Extensions of LDA

Correlated topic models

The Dirichlet is a distribution on the simplex, positive vectors that sum to 1. And it assumes that the components are nearly independent. However, in real data, an article about fossil fuels is more likely to also be about geology than about genetics. In correlated topic models, we apply the logistic normal distribution, which can model dependence between components (Aitchsion, 1980). In this distribution, the log of the parameters of the multinomial are drawn from a multivariate Gaussian distribution. X ∼ Nk (µ, Σ) θi ∝ exp{xi }

Figure 4 shows a plate diagram for the correlated topic model. The basic properties are: 1. Draw topic proportions from a logistic normal. This allows topic occurrences to exhibit correlation. 2. Provides a “map” of topics and how they are related. 3. Provides a better fit to (Blei   text&data, but computation is more  Lafferty,   2004)   Correlated topic models complex. And we should notice that the correlated topic model doesn’t use Dirichlet distribution, so it is not conjugate any more.

Correlated  Topic  Models  

Logistic normal prior

µ,

d

Zd,n

Wd,n

N

D

k

K

• Draw topic proportions from a logistic normal

Figure 4: Plate diagram for correlated topic models.

• This allows topic occurrences to exhibit correlation.

1.4.2

• Provides a “map” of topics and how they are related

Dynamic topic models • Provides a better fit to text data, but computation is more complex

In a dynamic topic models, we also consider the time evolution effect on the learning model. For example, in a document classification learning problem, in dynamic topic model, the documents are divided up by year. For each year, we start with a separate topic model and then add a dependence of each year on the previous one. Figure 5 shows the plate diagramSlide   of dynamic topic model. Recall that LDA assumes 65  that the order of from  David   Blei,  M LSS  2012   the documents does not matter, but this assumption is not appropriate for sequential corpora. In addition, we may also want to track how language changes over time. In dynamic topic model, the topics are allowed to drift in a sequence.

(Blei  &  Lafferty,  2006)  

Dynamic  Topic  Models   High-­‐level  idea:   •  Divide  the   documents   up  by  year   •  Start  with  a   separate   topic  model   for  each   year   •  Then  add  a   dependence   of  each  year   on  the   previous  one  

…  

d

d

d

Zd,n

Zd,n

Zd,n

Wd,n

Wd,n

Wd,n

N

N

N

D

D

D

... K

βk,2

βk,1

βk,T

Topics drift through time

Figure 5: Plate diagram models. (Mimno  et  al.,  2009)   1991   for dynamic 1990   …   topic 2016  

Polylingual  Topic  Models   67  

1.4.3

Polylingual topic model

•  Data  Setting:  Comparable  versions  of  each   exist  Dirichlet in  multiple   languages     modeling polylingual docPolylingual topic model is andocument   extension of latent allocation (LDA) for the   Wikipedia   rticle   for  equivalent “Barak  Otobama”   in   but written in ument tuples. Each tuple is (e.g.   a set of documents that aare loosely each other, twelve   anguages)   different languages. In this model, the ldata is comparable versions of each document exist in multiple languages, for example, the Wikipedia article forsimilar   “Barak tObama” twelve tlanguages. •  Model:   Very   o  LDA,  in except   hat  the  tThe opic  polyingual topic model is very similar to LDA,assignments,   except that the ztopic assignments, z, and words, w, are sampled separately ,  and  words,  w,  are  sampled  separately   for each language. Figure 6 shows the plate diagram of the Polylingual topic model. for  each  language.  

llections in unfamiliar languages s in topic prevalence.

rk

5

18 : Dirichlet Process and Dirichlet Process Mixtures

z !

"

...

w

N1

#1

$1

Arabic  

... L

L

z w Turkish   # $ models for parallel texts with NL D T gnments have been studied pre72   e HM-bitam model (Zhao and Figure 6: Plate diagram for Polylingual topic models. am, Lane and Schultz (Tam et Figure 1: Graphical model for PLTM. how improvements machine LDA 1.4.4in Supervised bilingual topic models. Both over words—one for each language l = 1, . . . , L. is an unsupervised model, but many data are paired with response variables. For example, user reviews ion-focused topicLDA models infer In other words, rather using a asingle of can be paired with a number of stars; web pages canthan be paired with numberset of “likes”; documents can be gnments as part ofpaired their with inference links to other documents; images can be paired with a category. The supervised LDA are topic topics = or{⇥ , . . . , ⇥ }, as in LDA, there are L 1 T h would becomemodels exponentially of documents and responses. It can fit to find topics predictive of the response. In supervised LDA, 1 L sets of language-specific topics, , . . . , , each we add to added. LDA a response variable associated with each document. Then we jointly model the documents additional languages were of which drawn from a language-specific sym- variables for future and the responses, in order to findislatent topics that will best predict the response er approach thatunlabeled is moredocuments. suit- Figure 7 shows the plate diagram for this model. metric Dirichlet with concentration parameter ⇥ l . similar document tuples (where ot direct translations of one an3.1 Generative Process an two languages. A recent exA new document tuple w = (w1 , . . . , wL ) is gendeveloped concurrently by Ni et erated by first drawing a tuple-specific topic dis9), discusses a multilingual topic tribution from an asymmetric Dirichlet prior with the one presented here. Howconcentration parameter and base measure m: te their model on only two lan-

6

18 : Dirichlet Process and Dirichlet Process Mixtures

Supervised LDA

Supervised  LDA   Zd,n

d

Wd,n

k

N

K

Document response

Regression parameters

Yd

1 2

D

,

7: Plate Draw topicFigure proportions ✓ | ↵ diagram ⇠ Dir(↵)for . Supervised LDA. For each word

Draw topic assignment zn | ✓ ⇠ Mult(✓ ). Dirichlet Process and Dirichlet Process Mixtures • Draw word w | z , ⇠ Mult( ). •

2

n

3

2.1

n

1: K

Ä ⇠ N ⌘> ¯ z, PN ¯z = (1/N ) n=1 zn .

Draw response variable y | z1:N , ⌘,

Introduction

zn

2

2

ä

, where

Slide  from  David  Blei,  MLSS  2012  

78  

In parametric modeling, it is assumed that data can be represented by models using a fixed, finite number of parameters. Examples of parametric models include clusters of K Gaussians and polynomial regression models. In many problems, determining the number of parameters a priori is difficult; for example, selecting the number of clusters in a cluster model, the number of segments in an image segmentation problem, the number of chains in a hidden Markov model, or the number of topics in a topic modelling problem before the data is seen can be problematic. In nonparametric modeling, the number of parameters is not fixed, and often grows with the sample size. Kernel density estimation is an example of a nonparametric model. In Bayesian nonparametrics, the number of parameters is itself considered to be a random variable. One example is to do clustering with k-means (or mixture of Gassuians) while the number of clusters k is unknown. Bayesian inference addresses this problem by treating k itself as a random variable. A prior is defined over an infinite dimensional model space, and inference is done to select the number of parameters. Such models have infinite capacity, in that they include an infinite number of parameters a priori; however, given finite data, only a finite set of these parameters will be used. Unused parameters will be integrated out.

2.2

Parametric vs. Nonparametric

• Parametric models: – Finite and fixed number of parameters – Number of parameters is independent of the dataset • Nonparametric models: – Have parameters (”infinite dimensional” would be a better name) – Can be understood as having an infinite number of parameters – Can be understood as having a random number of parameters

18 : Dirichlet Process and Dirichlet Process Mixtures

7

– Number of parameters can grow with the dataset • Semiparametric models: – Have a parametric component and a nonparametric component

Figure 8: Frequentist and Bayesian methods for Parametric and Nonparametric

Figure 9: Different applications for Parametric and Nonparametric Definition: a model of a collection of distributions {pθ : θ ∈ Θ}

(1)

parametric model: the parameter vector is finite dimensional Θ ⊂ Rk

(2)

8

18 : Dirichlet Process and Dirichlet Process Mixtures

nonparametric model: the parameters are from a possibly infinite dimensional space Θ⊂F

2.3

(3)

Motivations

Model selection is an operation that is fraught with difficulties, whether we use cross validation or marginal probabilities as the basis for selection. The Bayesian nonparametric approach is an alternative to parametric modeling and selection. There are two motivations for Bayesian nonparametric models: In clustering, the actual number of clusters used to model data is not fixed, and can be automatically inferred from data using the usual Bayesian posterior inference framework. The equivalent operation for finite mixture models would be model averaging or model selection for the appropriate number of components, an approach which is fraught with difficulties. Thus infinite mixture models as exemplified by DP mixture models provide a compelling alternative to the traditional finite mixture model paradigm. In density estimation, we are interested in modeling the density from which a given set of data is drawn. To avoid limiting ourselves to any parametric class, we may again use a nonparametric prior over an infinite set of distributions.

2.4

Exchangability and de Finetti’s Theorem

Definition1: a joint probability distribution is exchangeable if it is invariant to permutation. Definition2: The possibly infinite sequence of random variables (X1 , X2 , X3 , . . . ) is exchangeable if for any finite permutations of the indices (1, 2, . . . n): p(X1 , X2 , . . . , Xn ) = p(Xs(1) , Xs(2) , . . . , Xs(n) )

(4)

The meaning of exchangability is different from independent and identical distributed (i.i.d.). Exchangability means that it does not matter if data is reordered. The de Finetti’s theorem states that if (X1 , X2 , X3 , . . . ) is infinitely exchangeable. The joint distribution has representation as mixture: Z Y n p(x1 , x2 , . . . , xn ) = p(xi |θ)dP (θ) (5) i=1

2.5

Chinese Restaurant Process (CRP)

The distribution over partitions can be described in terms of the following restaurant metaphor of Figure 10. We assume that a Chinese restaurant has infinite tables, each of which can seat infinite customers. In addition, there is only one dish on each table.

18 : Dirichlet Process and Dirichlet Process Mixtures

9

The first customer enters the restaurant and sits at the first table. The second customer enters and decides either to sit with the first customer, or choose a new table. In the general case, the n + 1st customer either choose to join an already occupied table k with the probability propotional to the number of customers nk already sitting there, or sit at a new table with the probability propotional to α. In this metaphor, customers are identified with the intergers 1,2,3,... and tables as clusters. When all the n customers have sat down the tables, they are partitioned into clusters, which exhibits the clustering property of the Dirichlet process above.

Figure 10: Chinese Restaurant Process

10-708: Probabilistic Graphical Models 10-708, Spring 2016

Lecture 19: Indian Buffet Process Lecturer: Matthew Gormley

1

Scribes: Kai-Wen Liang, Han Lu

Dirichlet Process Review

1.1

Chinese Restaurant Process

In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at infinite number of tables in a Chinese restaurant. Assume that each customer enters and sits down at a table. The way they sit at the tables follows the process below: • The first customer sits at the first unoccupied table • Each subsequent customer chooses a table according to the following probability distribution: p(kth occupied table) ∝ nk p(next unoccupied table) ∝ α In the end, we have the number of people sitting at each table. This corresponds to a distribution over clusterings, where custermer = index, and table = cluster. Although CRP gives potentially infinite number of clusters, the expected number of clusters given n customers is O(α log(n)). The number of clusters also indicates the rich-get-richer effect on clusters. Also as α goes to 0, the number of clusters goes to 1, while as α goes to +∞, the number of clusters goes to n.

1.2

CRP Mixture Model

Here we denotes z1 , z2 , . . . zn as a sequence of indices drawn from a Chinese Restaurant Process, where n is the number of customers. For each table/cluster we also draw a distribution θk∗ from a base distribution H. Despite there are infinite number of tables/clusters we can have in CRP, the maximum number of clusters/tables is the number of the custumers (i.e. n). Finally, for each customer zi (cluster indice), draw 1

2

Lecture 19: Indian Buffet Process

a observation xi from p(xi |θz∗i ). Here, in chinese restaurant story, we can view zi as the table assignment of ith customer, θk∗ is the table specific distribution over dishes, and finally xi is the dishes that ith customer ordered follow the table specific dishes distribution. The next thing we want to know is inference problem (i.e. computing the distribution of z and θ given observation x). Because the exchangeability of CRP, the Gibbs sampler is easy to do inference because for each observation, we can remove the customer/dish from the restaurant and resample as if the were the last to enter. Here we discribe three Gibbs Samplers for CRP Mixture Model. • Algorithm. 1 (uncollapsed) – Markov chain state: per-customer parameters θ1 , θ2 , . . . , θn – For i = 1, . . . , n: draw θi ∝ p(θi |θ −i , x) • Algorithm. 2 (uncollapsed) – Markov chain state: per-customer cluster indices z1 , . . . , zn and per-cluster parameters θ1∗ , . . . , θk∗ – For i = 1, . . . , n: draw zi ∝ p(zi |z −i , x, θ ∗ ) – Set K =numer of clusters in z – For k = 1, . . . , K:draw θk∗ ∝ p(θk∗ |xi : zi = k) • Algorithm. 3 (collapsed) – Markov chain state: per-customer cluster indices z1 , . . . , zn – For i = 1, . . . , n:draw zi ∝ p(zi |z −i , x) For algorithm 1, if θi = θj , then i, j ∈ samecluster. For algorithm 2, since it is uncollapsed, it is hard to draw a new zi under the conditional distribution.

1.3

Dirichlet Process

• Parameters of a DP: – Base distribution, H, is a probability distribution over Θ – Strength parameter, α ∈ R • We say G ∝ DP (α, H), if for any partition A1 ∪ A2 ∪ · · · ∪ AK = Θ we have: (G(A1 ), . . . , G(AK )) ∝ Dirichlet(αH(A1 ), . . . , αH(AK )) The above definition is to say that the DP is a distribution over probability measures such taht marginlas on finite partitions are Dirichlet distributed. Given Dirichlet Process definition above, we have properties as follows, • Base distribution is the mean of the DP: E[G(A)] = H(A) for any Ai ⊂ Θ • Strength parameter is like inverse variance: V [G(A)] =

H(A)(1−H(A)) α+1

• Samples from a DP are discrete distributions (stick-breaking construction of G ∝ DP (α, H) makes this clear) • Posterior distribution of PG ∝ DP (α, H) given samples θ1 , . . . , θn from G is a DP, G|θ1 , . . . , θn ∝ n α n i=1 δθi DP (α + n, α+n H + α+n ) n

Lecture 19: Indian Buffet Process

1.4

3

Stick Breaking Construction

Stick breaking construction provides a constructive definition of the Dirichlet process as follows, • Start with a stick of length 1, and break it at β1 . Then the length of the broken part of the stick is π1 • Recursively break the rest of the stick to obtain β2 , β3 . . . and π2 , . . . , π3 where P∞ βk ∗∝ Beta(1, α), πk ∝ βk k=1 πk θk ∝ DP (α, H)

2 2.1

Qk−1 l=1

1 − βl . Also we draw θk∗ from a base distribution H. Then G =

Indian Buffet Process Motivation

There are some latent feature models that are familiar to us. For example, they are factor analysis, probabilistic PCA, cooperative vector quantization, and sparse PCA. The applications are various, one of the application is as follows: we have images, and there are some set of objects in it, we want to get a vector, in which a one corresponds to the existence of the object in the image and zero if not. What latent feature models do is to help us assign our data instances to multiple classes, while a mixture model only assign one data instance to one class. Another example is Netflix challenge, where we have a sparse data of the preference of users, and we want to find movie to recommend to users. They also allows infinite features so that we do not need to specify beforehand. The formal description of latent feature models is as follows: let xi be the ith data instance, and f i be its T ] be the list of features. Define X = [xT1 , xT2 , . . . , xTN ] be the list of data instances and F = [f1T , f2T , . . . , fN features. The model is then specified by the joint distribution of p(X, F ), and by specifying some priors over the features, we further factorize it as p(X, F ) = P (X|F )p(F ). We can further decompose the feature matrix F into a sparse binary matrix Z and a value matrix V . That is, for a real matrix F , we have F = Z ⊗ V,

4

Lecture 19: Indian Buffet Process

where ⊗ is elementwise product and zij ∈ {0, 1} and vij ∈ R. One example is shown as follows:

The reason that this is a powerful idea is that as the number of feature K, which is the number of column here, goes to infinity, we do not need to represent the entire matrix V (even if it might be dense), as long as the matrix Z is appropriately sparse. Therefore, the model becomes p(X, F ) = p(X|F )p(Z)p(V ). The main topic of this lecture, Indian Buffet Process, is going to provide a way to specify p(Z) under the condition of infinite number of features k. Before going to the infinite latent feature models, we first review the basics of finite feature models.

2.2

Finite Feature Model

The first example is Beta-Bernoulli Model. We have encountered this model before when we were talking about LDA. Here we restate the coin-flipping story: From a hyperparameter α, we sample a weighted coin for each column k, and for each row n we sample a head or tail. Here we make things more formal. For each column we sample a feature πk and for each row we sample an ON/OFF value based on πk . That is, • for each feature k ∈ {1, . . . , K}: α , 1) where α > 0 – πk ∼ Beta( K

– for each object i ∈ {1, dots, N }: ∗ zik ∼ Bernoulli(πk ) The graphical representation can be drawn as the plate diagram as follows. This gives us the probability of zik given πk and α.

Because Beta distribution is the conjugate prior of Bernoulli distribution (this is the special case for Dirichlet distribution being the conjugate prior of Multinomial distribution, where the dimension decrease to 2), we

Lecture 19: Indian Buffet Process

5

can analytically marginalize out the feature parameters πk . The probability of just the matrix Z can be written as the product of the marginal probability of each column, that is,

P (Z) =

=

N Y

k=1

i=1

K Y

α K Γ(mk

PN

i=1 zik

P (zik |πk ) p(πk )dπk α +K )Γ(N − mk + 1) , α Γ(N + 1 + K )

k=1

where mk =

!

K Z Y

is the number of features ON in column k, and Γ is the Gamma function.

The question that we are interested in is the expected number of non-zero elements in the matrix Z. To answer this question, we first recall that r r+s if Y ∼ Bernoulli(p), then E[Y ] = p

if X ∼ Beta(r, s), then E[X] =

α Since zik ∼ Bernoulli(πk ) and πk ∼ Beta( K , 1), we have

E[zik ] =

α K

1+

α K

and we can calculate the expected number of ON element as

E[1T Z1] = E

"N K XX

# zik =

i=1 k=1

Nα α 1+ K

This value is upper-bounded by N α. If we take K → ∞, the value simply goes to N α, which means that this particular model guarantees sparsity even if we have infinite set of features. However, a bad thing is α that when K → ∞, p(Z) will also go to 0 because the first term K goes to 0. This is not a property we favor since we do not want to see the entire matrix Z become 0. To tackle this problem, we first have to recognize the fact that the features are not identifiable, which means that the order of features does not matter to the model. To understand the concept of ”not identifiable”, recall that in topic model, we usually use MAP inference after a few run for the k th topic from topic model since the order of which topic corresponds to which k does not matter. In a latent feature model, it is obvious that there is no difference between feature k = 13 and k = 27. Having this in mind, we can further convert the matrix to Left-Ordered Form (lof). Define the history of feature k to be the magnitude of the binary value given by the column

hk =

N X

2(N −i) zik .

i=1

The figure below help us understand the concept of history:

6

Lecture 19: Indian Buffet Process

With history at hand, we further define the lof (Z) to be Z sorted left-to-right by the history of each feature. The figure below depicts the concept.

We define equivalence class [Z] = {Z 0 : lof (Z 0 ) = lof (Z)}, which is the collection of all the matrices Z’s that have the same lof . By doing some counting, we can find out the cardinality of [Z] to be Q2NK! , −1 h=0

Kh !

which is the number of matrices that have the same lof . Now, instead of calculating the probability of a particular matrix p(Z), we calculate the probability of the collection of matrices p([Z]). That is K! lim p([Z]) = lim Q2N −1 K→∞

K→∞

h=0

p(Z) Kh !

K+ Y (N − mk )!(mk − 1)! α K+ · exp{−αHN } = Q2N −1 , N! k=1 h=1 Kh !

PN where K+ is the number of features with non-zero history, and HN = j=1 1j is the N th harmonic number. By doing the algebra, we can see that the probability no longer goes to infinity. Now we have the enough background to go to the Indian Buffet Process.

2.3

The Indian Buffet Process

Imagine that there is an Indian restaurant with wonderful buffet containing an infinite number of dishes. Each customer walks in, takes as many dishes as possible and then sit down after they have enough food in their plate. The rule for them to select the dish is as follows:

Lecture 19: Indian Buffet Process

7

• 1st customer: Starts at the left and selects a P oisson(α) number of dishes. • ith customer: – Samples previously sampled dishes according to their popularity: (i.e. with probability mk is the number of previous customers who tried dish k)

mk i

where

– Selects a P oisson( αi ) number of new dishes The example of this process is shown below: The problem is that the process is not exchangeable, which means that dishes sampled as ”new” depend on the customer order. The way to fix that is to modify the way the ith customer selects dishes: • Makes a single decision for dishes with same history, h: (i.e. if there are Kh dishes with history h sampled by mh customers, then samples a Binomial(mh /i) number starting at the left) • Selects a P oisson( αi ) number of new dishes This is equivalent to seeing them as a lof matrix. Therefore, we fix the problem and can thus calculate the probability p([Z]). Next is to construct a Gibbs sampler for Indian Buffet Process. Specifically, we consider a ”prior only” sampler of p(Z|α). For finite number K, we have Z 1 P (zik = 1|z−i,k ) = P (zik |πk )p(πk |z−i,k )dπk 0 α m−i,k + K = α N+K

where z−i,k is the k th column except row i, and m−i,k is the number of rows with feature k except i. For infinite K, since Indian Buffet Process is exchangeable, we can do the sampling just like CRP, in which we choose an order such that the ith customer was the last to enter. For any k such that m−i,k > 0, resample P (zik = 1|z−i,k ) =

m−i,k , N

then draw a P oisson( αi ) number of new dishes. There are some properties of Indian Buffet Process that should be noted: • It is infinitely exchangeable. • The number of ones in each row is P oisson(α). • The expected total number of ones is αN . • The number of nonzero columns grows as O(α log N ). • It has a stick-breaking representation. • It can be interpreted using Beta-Bernoulli process. Finally, the posterior inference can be done using several different methods such as Gibbs sampling, Conjugate sampler, etc. And previous literatures have reported using this model for graph structures, protein complexes, etc.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

20: Gaussian Processes Lecturer: Andrew Gordon Wilson

1

Scribes: Sai Ganesh Bandiatmakuri

Discussion about ML

Here we discuss an introduction to Machine learning and practicality of a few historical techniques. In one example, we observe that sampling using a Hamiltonian Monte Carlo algorithm, although can be effective in practice, it is still associated with some hard problems, like handtuning some of the hyper-parameters (e.g the step size) and many recent efforts in Machine learning attempt to automate such manual steps. (For example, the step sampling algorithm). Machine learning is also primarily adaptive function learning learning to adapt based on data, rather than have a fixed set of rules and in order to successfully do this, a model should be able to automatically discover patterns and extrapolate to new situations. In this context, we present an example of a trend graph of the number of airline passengers by year and qualitatively show the effectiveness of a spectral mixture/kernel model. Figure 1: Extrapolating airline passenger count

In the above picture, the red curve is generated by an RBF kernel, which makes strong smoothness approximations. Te black curve is generated by a different kind of kernel based on spectral mixtures. It is very close to the ground truth in green.

1.1

Support and Inductive biases

Whenever we want to discuss the effectiveness of a model, we need to consider its support and its inductive Bias. The support is what solutions we think are a-priori possible. It is the space of solutions that can possibly be represented by the model. For example, if we restrict the model to a polynomial of degree 3, the support involves only polynomials of degree ≤ 3 and it cannot model cases where the true distribution 1

2

20: Gaussian Processes

is more complex. The inductive bias of a model is what solutions (represented in the support) are a-priori likely. It is a measure of how we distribute the support. Another example discussed was the inductive bias of conv-nets. These layers are less flexible than fully connected layers, but their inductive bias is often better suited to computer vision tasks. In principle, we want a model to be able to represent all likely functions, although how it distributes the support may be biased. Today’s deep learning approaches and the Gaussian process are considered universal approximators. We can get a better idea of complexity of models by plotting the likelihoods of certain datasets given a model. Figure 2: Simple models can only generate certain datasets

Figure 3: More flexible models can generate more datasets

20: Gaussian Processes

3

Figure 4: Flexible model with inductive bias

1.2

Basic regression problem

The basic regression problem is to predict a function value at a data point x∗ which may be far away from given training data points. The typical approach is to assume some parametric form of the function, by parameters w, formulate an error function E(w) and minimize wrt w. There are different choices for the function type f (x, w) - wT x (linear in w and x), wT φ(x) (linear basis function model), etc. An example of E(w) is the L2 error. E(w) =

N X

|f (xi , w) − y(xi )|2

i=1

1.2.1

A probabilistic approach by modeling noise

In the above expression, although we may arrive at a set of parameters w, it may not provide an intuitive understanding of the error measure. A probabilistic approach is to also model the noise.

y(x) = f (x, w) + (x) If we model the noise as i.i.d additive gaussian, with mean 0 and variance σ 2 , p(y(x)|x, w, σ 2 ) = N (y(x); f (x, w), σ 2 ) p(Y |X, w, σ 2 ) =

N Y

N (y(xi ); f (xi , w), σ 2 )

i=1

logp(Y |X, w, σ 2 ) ∝ −

N 1 X |f (xi , w) − y(xi )|2 2σ 2 i=1

The above approach makes the same predictions as an L2 error function, but it can now be optimized for both w and σ 2 and thus provides a more intuitive understanding of the error function and the noise. For example, if there are too many outliers, we can change the representation of (x) to a laplacian distribution

4

20: Gaussian Processes

and the squared error to sum of absolute errors. This method thus provides an intuitive framework to represent uncertainty and model development. However, both the above approaches are still prone to overfitting (when using highly complex models). One way to address this is to introduce regularization. This process has its own issues - it is a difficult design decision to pick the regularization penalty term and the weight and performing cross validation.

1.3

Bayesian model averaging

An alternate way to avoid overfitting is by Bayesian model averaging. Here, we’re more interested in the distribution of functions than the parameters w themselves. We average over infinitely many models p(y|x∗ , w) by the posterior probabilities of those models. This gives us automatically calibrated complexity and does not have the problem of overfitting.

Z p(y|x∗ , y, w) =

1.4

p(y|x∗ , w)p(w|y, X)dw

Examples of Occam’s razor

One example discussed in the class was the example of a tree blocking a block as shown below. The simplest explanation is there exists a regular sized block with no deformities behind the tree. There can be more elaborate and far fetched hypotheses for the occluded portion of the box (e.g it can be covered, distorted, etc), but Occam’s razor favors the first explanation.

Figure 5: Tree covering a box

Another example discussed was predicting the next two numbers from a given sequence of integers {-1, 3, 7, 11, ? ?}. The first hypothesis H1 , the simplest is assuming its an arithmetic progression and predicting 15 2 3 23 and 19. The second more complicated hypothesis tries to fit a polynomial H2 : y(x) = x11 + 9x 11 + 11 . We can quantitatively see that Occam’s razor favors the first hypothesis over the other. We first write the p(D|H1 ) p(H1 ) 1 |D) posterior in terms of the likelihood and the priors using Bayes rule. p(H p(H2 |D) = p(D|H2 ) p(H2 ) For x [−50, 50], we can arrive at

p(H1 |D) p(H2 |D)



10−4 2.3∗10−12

(i.e ∼ 40M higher even if we assume P (H1 ) = P (H2 )).

20: Gaussian Processes

1.5

5

Occam’s asymptote

Since we have techniques like Bayesian model averaging, regularization, etc, it then seems intuitive to pick a ”sufficiently complex” model to represent the data and not worry about overfitting. However, we’ll see that choice of prior is important in making such a decision. If we try to fit a function using different models, where each model k is a polynomial of order k, we note that for the naive case, the graph of the marginal likelihood forms an occam’s hill - i.e too simple models don’t explain the data neither do too complex models. Figure 6: Occam’s hill

The above paradox arises when we use an isotropic prior (i.e equal weights on all parameters). If the coefficients are a, we have p(a) = N (0, σ 2 I). However, most of the information is usually captured in the lower order polynomials with some noise terms in the higher order coefficients, so using an anisotropic prior, reducing the weights for higher order terms seems more intuitive. p(ai ) = N (0, γ −i ) where γ is learned with the data. In this case, we arrive at the Occam’s asymptote.

6

20: Gaussian Processes

Figure 7: Occam’s asymptote

2 2.1

Gaussian process Linear model as a gaussian process

Consider a simple linear model f (x) = a0 + a1 x, where a0 , a1 N (0, 1). We have E[f (x)] = 0 and covariance between two points f (xb ) and f (xc ) given by E[f (xb )f (xc )] = E[a20 + a0 a1 (xb + xc ) + a21 (xb xc )] E[f (xb )f (xc )] = E[a20 ] + 0 + E[a21 ]xb xc E[f (xb )f (xc )] = 1 + 0 + xb xc Therefore any collection of values has a joint Gaussian distribution [f (x1 ), f (x2 ), ...f (xN )] ∼ N (0, K), where Kij = 1 + xi xj . By definition, f(x) is a gaussian process.

2.2

Definition

A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. We write f (x) ∼ GP (m, k) to mean [f (x1 ), ..., f (xN )] ∼ N (µ, K), where µi = m(xi ) and Kij = k(xi , xj ) for any collection of input values x1 , ..., xN . In other words, f is a GP with mean function m(x) and covariance kernel k(xi , xj ).

2.3

Linear basis function models

By modeling the regression function as a linear regression in the basis function space, we can show that it is still a Gaussian process.

20: Gaussian Processes

7

f (x, w) = wT φ(x) p(w) = N (0, Σw ) E[f (x, w)] = m(x) = E[w]T φ(x) = 0 cov(f (xi ), f (xj )) = E[f (xi )f (xj )] − E[f (xi )]E[f (xj )] cov(f (xi ), f (xj )) = φ(xi )T E[wwT ]φ(xi ) − 0 = φ(xi )T Σw φ(xj )

Thus f (x) ∼ GP (m, K), where m(x) = 0 and k(xi , xj ) = φ(xi )T Σw φ(xj ). Thus this entire model is encapsulated as a distribution over functions with kernel k(x, x0 )

2.3.1

Inference for new test samples

The results from this section are derived in detail in Ch2 from [2]. For inference in this model, we have new test samples X∗ and need to predict their outputs f∗ . We assume the training data is given where y is the output of f with additive Gaussian noise , i.e y = f + , where  ∼ N (0, σ 2 I). As a result, y ∼ N (0, K(X, X) + σ 2 I). We write the joint distribution of y and f∗ and by the definition of gaussian process, this is a Gaussian distribution:

   y K(X, X) + σ 2 I ∼ N (0, f∗ K(X∗ , X)

 K(X, X∗ ) ) K(X∗ , X∗ )

therefore the conditional distribution p(f∗ |y) can be written as (f∗ |X, X∗ , y) ∼ N (m, V ), where m = K(X∗ , X)(K(X, X) + σ 2 )−1 f , and V = K(X∗ , X∗ ) − K(X∗ , X)(K(X, X) + σ 2 )−1 K(X, X∗ )

2.4

The Gaussian process graphical model

Figure 8: The gaussian process graphical model, Squares are observed variables, circles are latent. and the think horizontal bar is a set of fully connected nodes. Note that each yi is conditionally independent given fi . Because of the marginalization property of GP, addition of further inputs x and unobserved targets y does not change the distribution of any other variables. Originally from Rasmussen and Williams (2006) [2]

8

2.5

20: Gaussian Processes

Example: RBF kernel

This is one of the most popular kernels, with also nice theoretical properties. 2

i) It can be viewed as an infinite basis model, where each feature can be written as φi (x) = σ 2 exp( −(x−c ). 2l2 If we use J features, we can write the kernel entry as:

k(xp , xq ) =

J σ2 X φi (xp )φi (xq ) J i=1 ||x −x ||2

By taking limits J → ∞, we can write the kernel function as KRBF (xi , xj ) = a2 exp( i 2l2 j hyperparameters a and l control the amplitude and wiggliness of the function respectively.

), here the

Here, the intuition in the RBF kernel is nearby samples are more correlated than samples that are farther away (controlled by l). Figure 9: Visualization of the covariance matrix

l controls how the correlations are decay wrt L2 distance. A larger value of l indicates slow decay and thus farther points will have non zero covariance or correlations. This can be represented in the figure below, the x axis is the distance between samples τ .

For larger l, we can see long range correlations. If l is very large, we see high correlations between samples. If l is very small, we see a lot of wiggliness where the distribution tries to fit individual data points. If l is

20: Gaussian Processes

9

0, this is equivalent to white noise.

2.6

Learning and model selection

We can integrate away the entire gaussian process f (x) to obtain the marginal likelihood as a function of the model hyperparameters alone.

Z p(Y |θ, X) =

p(y|f, θX)p(f |θ, X)df

1 1 N log(P (Y |θ, X)) = − y T (K + σ 2 I)−1 y − log|K + σ 2 I| − log(2π) 2 2 2 The first term above evaluates the model fit and the second term penalizes the model complexity.

2.7

Learning by full Bayesian treatment

The following derivation is from Section 5.2 in [2]. For each function f , we may have some low level parameters w (which may be weights in a neural network for example) and we have hyper parameters θ on top of them (e.g parameters like weight decay for neural networks, a and l for Gaussian processes, etc). On top of them, we may have a discrete set of possible model structures Hi . At the bottom level, the posterior over the parameters is given using the Bayes rule p(w, |y, X, θ, Hi ) =

p(y|X, w, Hi )p(w|θ, Hi ) p(y|X, θ, Hi )

Where p(y|X, w, Hi ) is considered the likelihood and p(w|θ, Hi ) is considered the prior. The normalizing constant in the denominator is obtained by marginalizing out w and called the marginal likelihood.

Z p(y|X, θ, Hi ) =

p(y|X, w, Hi )p(w|θ, Hi )dw

At the next level, we express the posterior distribution of the hyperparameters, where the marginal likelihood defined above plays the role of the likelihood. p(θ|y, X, Hi ) =

p(y|X, θ, Hi )p(θ|Hi ) p(y|X, Hi )

Here, p(θ|Hi ) is the hyperprior (prior over the hyper-parameters) and the normalizing constant is Z p(y|X, Hi ) = p(y|X, θ, Hi )p(θ|Hi )dθ At the top level, we compute the posterior of the model. p(y|X, Hi )p(Hi ) p(y|X) X p(y|X) = p(y|X, Hi )p(Hi )

p(Hi |y, X) =

i

(1)

10

20: Gaussian Processes

However computing the integral in Equation (1) may be intractable and one may have to refer to analytic approximations (e.g variational inference) or MCMC methods.

2.8

Example: Combining kernels to predict CO2 trends

The data consists of monthly average atmospheric CO2 concentrations derived from air samples collected at the Mauna Loa Observatory, Hawaii, between 1958 and 2003. The data is shown below. Our goal is to model the CO2 concentration as a function of time x.

The following were kernels or covariance functions handcrafted by inspecting the training data. • Long rising trends: k1 (xp , xq ) = θ12 exp(−

(xp −xq )2 ) 2θ22

• Quasi periodic seasonal changes: k2 (xp , xq ) = krbf (xp , xq )kper (xp , xq ) = θ32 exp(− • Multi-scale medium term irregularities: k3 (xp , xq ) = θ62 (1 + • Correlated and i.i.d noise: k4 (xp , xq ) = θ92 exp(−

(xp −xq )2 ) 2 2θ10

(xp −xq ) sin2 (π(xp −xq )) −2 ) 2θ22 θ52

(xp −xq )2 −θ8 ) 2θ8 θ72

2 + θ11 δpq

• ktotal (xp , xq ) = k1 (xp , xq ) + k2 (xp , xq ) + k3 (xp , xq ) + k4 (xp , xq ) Results from the predictions are shown below. We display the given training data, together with 95% predictive confidence region for a Gaussian process regression model, 20 years into the future. Rising trend and seasonal variations are clearly visible. Note also that the confidence interval gets wider the further the predictions are extrapolated. Original figure from [2].

20: Gaussian Processes

2.9

11

Non gaussian likelihoods

If our model is non-gaussian, we can no longer integrate away the gaussian process to infer f∗ . However, we can use a simple Monte Carlo sum: Z p(f∗ |y, X, X∗ ) = p(f∗ |f, x∗ )p(f |y)df ≈

J 1X p(f∗ |f (j) , x∗ ), f (j) ∼ p(f |y) J j=1

We can sample from p(f |y) using elliptical slice sampling [1]. For the hyperparameters, one possible approach is to perform Gibbs sampling: p(f|y, θ) ∝ p(y|f)p(f|θ) p(θ|f, y) ∝ p(f|θ)p(θ) Typically however since f and θ are highly correlated, the sampler will not mix very well and the approach usually does not work. According to the lecturer, a better alternative for hyperparameter inference is to compute an approximation of the distribution (using variational methods, for example) and use that instead as the source distribution to obtain samples.

3

Further extensions

Two major research areas devoted to extending Gaussian processes are related to automatic kernel selection and improving the scalability. In the predictions related to CO2 data explored above, there were a lot of manual decisions done by inspecting the data with handcrafted kernels. One possible direction is to automate this entire process and learn the kernel combinations from a data driven approach. It is also an active area in high dimensional problems. The RBF kernel uses the L2 distance between data points, which need not be the best representative distance metric in higher dimensions.

12

20: Gaussian Processes

Another focus area is scalability. When N (the number of variables) is approximately 1000 or small, Gaussian processes remain the gold standard for regression. Many results involve solving large systems of linear equations and performing Cholesky decompositions. The complexity for this is cubic in the size of the variables and thus not scalable when there are more than a few thousand variables. However, many of these matrices have predictable structure which can be exploited to make decompositions and inference more efficient.

References [1] Iain Murray, Ryan Prescott Adams, and David JC MacKay. Elliptical slice sampling. arXiv preprint arXiv:1001.0175, 2009. [2] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing

1

Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu

Motivation

In modern machine learning, latent variables are often introduced into the models to endow them with learnable and interpretable structures. Examples of such models include various state space models of sequential data (such as hidden Markov models), mixed membership models (such as topic models), and stochastic grammars (such as probabilistic context free grammars) used to model grammatical structure of sentence or the structure of RNA sequences. Despite the flexibility of such models, the latent structure often complicates inference and learning (exact inference becomes statistically and computationally intractable) and forces one to employ approximate methods such as Expectation Maximization (EM) type of algorithms. The main drawbacks of EM are slow convergence and tendency to get stuck at local minima. Spectral methods for learning latent variable models attempt to mitigate these issues by representing the latent structures implicitly through the spectral properties of different statistics of the observable variables. While some applications may require explicit representation of the latent variables (when the goal is to interpret the data), many applications are interested in prediction, and hence use latent structures merely as reasonable (often, domain-specific) constraints. Examples of such applications include forward prediction in various sequence models (Figure 1). In such case, spectral methods offer a powerful toolbox of techniques based on linear algebraic properties of the statistics of models with latent structures. These techniques not only have appealing theoretical properties, such as global consistency and provable guarantees on convergence to the global optimum, but often work orders of magnitude faster than EM-type algorithms in practice. In this set of notes, we introduce and discuss the main ideas behind spectral learning techniques for discrete latent variable models. Our working examples are the mixture model and the hidden Markov model (HMM).

Figure 1: Prediction of the future observations given the past of a dynamical system using a state space model (SSM) or a hidden Markov model (HMM). 1

2

Lecture 21: Spectral Learning for Graphical Models

Figure 2: Marginalization of latent variables leads to fully connected graphs.

2

Latent Structure and Low Rank Factorization

Having motivated the main ideas behind spectral methods, we start by considering the following question. As mentioned, we do not care about the latent structure in prediction tasks explicitly. However, consider an HMM; if we marginalize out the latent variables of the model, we certainly arrive at a complete graph (Figure 2). Is there any difference between initially starting with a complete graphical model (i.e., encoding no structural assumptions) and the model we get via marginalization of an HMM? Even though all the observable variables become correlated after marginalization, the structure of the correlations between the variables is, in fact, controlled by the latent factors of the original model. Hence, there is a difference, and, as we will see, it manifests in the low rank structure of the joint probability distribution over the observables.

2.1

Mixture model example

To gain more intuition, consider an example of a mixture model with three discrete observable variables X1 , X2 , X3 , where Xi ∈ [m] := {1, . . . , m} and a discrete latent variable H ∈ [k]. Consider the case when k = 1, i.e., the model has a single latent state. The joint probability distribution has the following form: P (X1 , X2 , X3 ) = P (H = 1)

3 Y j=1

P (Xj | H = 1) =

3 Y

P (Xj | H = 1)

(1)

j=1

where, since there is a single latent state, P (H = 1) = 1, and the distribution factorizes over the observable variables, which means they are all independent (Figure 3, left).

Figure 3: Depending on the number of latent states, a mixture model can be equivalent to a model with independent random variables (left) or to a fully connected graph (right). Now, consider k = m3 , i.e., the number of latent states coincides with the number of all possible configurations of the values taken by X1 , X2 , X3 . In such case, the number of parameters in the mixture model is sufficient to encode any discrete distribution over X1 , X2 , X3 , and hence the model is equivalent to a complete connected graph (i.e., the structural assumptions are effectively void). What happens when the number of latent states is at neither of these two extremes?

Lecture 21: Spectral Learning for Graphical Models

2.2

3

Independence and Rank

To answer the stated question, first, consider the sum rule and the chain rule from purely algebraic perspective. Let A ∈ [m], B ∈ [n] are discrete random variables. The sum rule can be represented as simply matrix to vector multiplication: X P (A) = P (A | B = b)P (B = b) = P (A | B) · P (B), (2) b

where P (A | B) is an m × n matrix of conditional probabilities and P (B) is a vector of size n. The chain rule can be represented in a similar form: P (A, B) = P (A | B)P (B) = P (A | B) · P ( B),

(3)

where P ( B) := diag[P (B)] denotes a diagonal matrix with P (B) entries on the diagonal. Diagonal is used to keep B from being marginalized out.

Figure 4: Graphical models for two random variables: arbitrarily dependent (left), completely independent (center), dependent through a latent variable (right). Now, we are ready to consider the joint distribution of simple two variable graphical model, with two discrete variables, A and B. In general case, A and B are arbitrarily dependent (Figure 4, left) and P (A, B) is an arbitrary table with the only condition that its entries sum up to 1. If A and B are independent (Figure 4, center), then P (A = a, B = b) = P (A = a)P (B = b), and hence the joint probability table can be represented as an outer product of P (A) and P (B) vectors: P (A, B) = P (A) ⊗ P (B),

(4)

which means that P (A, B) is of rank 1. If we introduce a latent variable, X, with k ≤ min{m, n} between A and B (Figure 4, right), we can write the joint probability distribution in the following form: P (A, B) = P (A | X)P ( X)P (B | X)> ,

(5)

where we used the introduced algebraic representation of the chain and sum rules. Figure 5 depicts the joint distribution in the form of matrix multiplications. Note that under the condition k ≤ min{m, n}, P (A, B) is of rank k, i.e., neither full-column, nor full-row rank. Hence, assumptions about certain latent structures result into low rank dependencies between the random variables that can be further exploited by using standard tools from linear algebra: ranks, eigen spaces, singular values decomposition, etc.

Figure 5: Low rank decomposition of the joint distribution over the observable variables of a model with a latent variable with k states.

4

3

Lecture 21: Spectral Learning for Graphical Models

An Alternate Factorization

In the previous sections, we discover that the joint probability of two random variables could be written into low rank factorization such as M = LR,

(6)

assuming M has rank k. However, it’s well known that factorization is not unique. By multiplying a rotation matrix and its inverse, we have M = LSS −1 R,

(7)

. An interesting question is that could we have an factorization that only depends on observed variables ? To see this, let us continue with the HMM example. We want to factorize a matrix of 4 variables P [X1,2 , X3,4 ]

(8)

such that the factorization matrices only contain at most three observed variables. First we factorize the following two matrix based on the formula (5), P [X1,2 , X3 ] = P [X1,2 |H2 ]P [ H2 ]P [X3 |H2 ]T

(9)

T

(10)

P [X2 , X3,4 ] = P [X2 |H2 ]P [ H2 ]P [X3,4 |H2 ] By multiplying equation (9) and (10), we obtain

P [X1,2 , X3 ]P [X2 , X3,4 ] = P [X1,2 , X3,4 ]P [X2 , X3 ], and finally, P [X1,2 , X3,4 ] = P [X1,2 , X3 ]P [X2 , X3 ]−1 P [X2 , X3,4 ]

(11)

Thus we have an alternate factorization that depends only on the observed variables (No Hi ), which we hereby referred as observable factorization. This means that these factors can be directly computed from the observed data without the EM algorithm. An example of this procedure is illustrated in Figure 6. As mentioned above, there’s no unique factorization, so in practice one could combine both factorization to obtain a better empirically stable counting estimator of the joint probability.

Figure 6: Two ways of doing observable factorization for the same joint distribution. At this point, it may not be very encouraging since we have only reduced the joint distribution of four random variables to factor of three random variables multiplications. Nonetheless, the amazing part of the observable factorization is that every latent tree of V variables could be recursively applying this factorization technique to an extend that all factors are of size 3 and that all factors are only functions of observed variables.

Lecture 21: Spectral Learning for Graphical Models

4 4.1

5

Training, Testing, and Consistency Training and Testing

In training, we replace each probability matrix with its MLE and get PM LE [X1,2 , X3 ], PM LE [X2 , X3 ]−1 , PM LE [X2 , X3,4 ].

(12)

For the discrete case, the MLE matrices correspond to frequency counts.In test time we replace variables with certain values and inference is just the look up table.

4.2

Consistency

It is well known that the maximum likelihood estimator (MLE) is consistent for the true joint probability. However, simply estimate the big probability table from the data is not very statistically efficient. An alternative is to first factorize the joint probability into smaller pieces according to the graphical model latent structure, and estimate those small tables based on the EM algorithms. Nevertheless, running EM algorithm may suffer from getting stuck in local optima and thus is not guaranteed to obtain the MLE of the factorized model. In spectral learning, we could estimate the joint probability by the observable factorization, which is PM LE [X1,2 , X3 PM LE [X2 , X3 ]−1 PM LE [X2 , X3,4 ] −→ P [X1 , X2 ; X3 , X4 ]. In this way, it enjoy the consistency property and is computationally tractable. The only issue now turns to finding the inverse of the probability matrix.

5

The Existence of Inverse

We now look at the conditions for the inverse P [X2 , X3 ]−1 to be well defined. P [X2 , X3 ] = P [X2 |H2 ]P [ H2 ]P [X3 |H2 ]T

(13)

All the matrices on the right hand side must be full rank. We will discuss the following cases where k 6= m. Where k can be regarded as the number of latent states and m as the number of observe states.

5.1

m>k

The inverse does not exist for this case. However, this can be solved easily by projecting the matrix to a lower dimensional space. P [X2 , X3 ]−1 = V (U T P [X2 , X3 ]V )−1 U T

(14)

where U, V are the top left/right k singular vectors of P [X2 , X3 ]

5.2

k> m. The case can be interpreted as the number of observed states

6

Lecture 21: Spectral Learning for Graphical Models

are not powerful enough to express the relationship. For example, frequency counting does not capture the relationship and some information is missing from just counting the frequency. Intuitively, large k, small m means long range dependencies. We try to solve this with long range features discussed in the next section.

6

Empirical Results with Spectral Learning for Latent Probabilistic Context Free Grammars

The F1 measure in [Cohen et al. 2013] did not show great improvement. However, the run time of the 1 . algorithm reduces to 20

7

Spectral Learning With Features

By using more complex feature, such as E[φL ⊗φR ] to represent the original variables, P [X2 , X3 ] = E[δ2 ⊗δ3 ]. We are able to solve the case where k 0

∀f.

This is a generalization of the positive definite matrix. The most common kernel that we will use is the Gaussian RBF Kernel,   kx − yk22 K(x, y) = exp . σ2 Consider holding one element of the kernel fixed. The result is a function of one variable which we call a feature function. The collection of feature functions is called the feature map, φx := K(x, ·).

22 : Introduction to Hilbert Space Embeddings and Kernel GM

3

For example, using the Gaussian Kernel, the feature functions are unnormalized Gaussians. Here is an examples: k1 − yk22 ). φ1 (x) = exp( σ2 The inner product of feature functions in an RKHS is defined as hφx , φy i = hK(x, ·), K(y, ·)i := K(x, y). Intuitively, this quantity is the dot product between two feature vectors. By the symmetric property of kernels, φx (y) = φy (x) = K(x, y). Having defined feature functions, consider the space composed of all functions that are a linear combination of these feature functions. That is,   k   X F0 := f (z) : αj φxj (z), ∀k ∈ N+ , xj ∈ X .   j=1

Then, define a Reproducing Kernel Hilbert Space F to be the completion of the set F0 defined above. The feature functions thus form a spanning set basis (albeit over-complete) for this space F. Indeed, any object in the RKHS can be obtained as a linear combination of these feature functions, by definition. With this definition in place, the space F exhibits the nice Reproducing property, from which the RKHS derives its name. Mathematically this is denoted by: hf, φx i = f (x), where f is some function. What this means is that to evaluate a function at some point in infinite dimension, one does not explicitly have to operate in infinite dimensions but can instead simply take the inner product of that function with the feature function mapping of the point. The proof of this property is as follows, X hf, φx i = h αj φxj , φx i j

=

X

αj hφxj , φx i

Linearity of inner product

αj K(xj , x)

Definition of kernel

j

=

X j

= f (x). Recall how this property is used to great advantage in SVMs, where data points are symbolically mapped to RKHS feature functions. However, operationally, they are only evaluated with inner products, so that this symbolic mapping never has to be explicit.

4

Embedding Distributions in Reproducing Kernel Hilbert Spaces

We now turn to the problem of embedding entire distributions in RKHS.

4.1

The Mean Map - Embedding Distributions of One Variable

We first show how to embed univariate in RKHS. Consider the mean map defined as: Z µX (·) = EX∼D [φx ] = pD (x)φX (·)dx.

4

22 : Introduction to Hilbert Space Embeddings and Kernel GM

This is effectively the statistic computed over the feature function mappings of the distribution into the RKHS. It corresponds, intuitively to the “Empirical Estimate of the data. In the finite case this is simply the first moment, N 1 X µ ˆX = φx N n=1 n It can be shown that when the kernels are universal, the mapping from distributions to embeddings is one-to-one. The Gaussian RBF Kernel and the Laplacian Kernel are examples of universal kernels. As an illustrative example consider the finite dimensional case of an RKHS embedding for a distribution that takes on discrete values from 1 to 4. In its explicit form, the moments of this distribution can be computed directly from the data, but leads to loss of information. Now consider an RKHS mapping of the data into R4 . Let the feature functions in this RKHS be         0 0 0 1 0 0 1 0        φ1 =  0 , φ2 = 0 , φ3 = 1 , φ4 = 0 . 1 0 0 0 Given this mapping, the mean map is µX = EX [φX ] = P(X = 1)φ1 + P(X = 2)φ2 + P(X = 3)φ3 + P(X = 4)φ4 . This is the marginal probability vector in the discrete case. It is evident that the mean map (a 4-dimensional vector in this case) is a more expressive statistic than the empirical mean calculated in the simplistic case. The mean map can be conveniently evaluated by using an inner product as well. That is, EX∼D [f (X)] = hf, µX i. The proof is as follows, hf, µX i = hf, EX∼D [φX ]i = EX∼D [f (X)].

4.2

Def of Mean Map Reproducing property

Cross-Covariance - Embedding Joint Distributions of Two Variables

Now consider the problem of embedding joint distributions of two variables in RKHS. Begin by implicitly defining the cross-covariance operator CY X such that CY X = EY X [f (X)g(Y )]. We will show that this definition leads to the following property, hg, CY X f i = EY X [f (X)g(Y )]∀f ∈ F, g ∈ G. Note that the two Hilbert spaces F and G no longer need to be analogous, and indeed are likely to be different. CY X will be the joint embedding of the distribution over X and Y . Now we show how CY X is constructed. Suppose we have φX ∈ F and φY ∈ G, which are the feature functions of the two RKHS. For two random variables, the covariance of two centered variables is Cov(X, Y ) = EY X [XY ]. In the infinite dimensional case, this translates to CY X = EY X [φY ⊗ φX ], where ⊗ is the tensor product operator. This operator effectively creates a new space by taking the cross-product of the feature functions in the spaces F and G. This leads to a formal characterization of the tensor product of the two Hilbert spaces, H = {h : ∃f ∈ F, ∃g ∈ G s.t. h = f ⊗ g}. The expectation of this new space is then the cross-covariance operator. The proof of correctness of the cross

22 : Introduction to Hilbert Space Embeddings and Kernel GM

5

covariance operator property is now given below, hg, CY X f i = hg, EY X [φY ⊗ φX ]f i = EY X [hg, [φY ⊗ φX ]f i] = EY X [hg, hφX , f iφY i]

Def of outer product

= EY X [hg, φY ihφX , f i] = EY X [g(Y )f (X)]

Reproducing property.

Taking the covariance operator with itself leads to the auto-covariance operator.

4.3

Product of Cross-Covariances Embedding Conditional Distributions of 2 Variables

Given what we know about cross-covariance and auto-covariance, we can now proceed to explicit a form for the embedding of conditional distributions of two variables. In simple probabilistic terms we have P(X, Y ) = P(Y |X)P(X). In linear algebraic operations, the conditional distribution then emerges as P(Y |X) = P(Y, X) · Diag(P(X))−1 . But we already know that the embedding of a joint distribution P(X, Y ) is cross covariance operator CY X and the embedding of a distribution P(X) in a diagonalized matrix form is the auto covariance operator CXX . It follows that the embedding of a conditional distribution is then also an operator. Specifically we have −1 CY |X = CY X CXX . It can be shown that this operator has the following property, EY |X (φY |X) = CY |X φX .

5

Kernel Graphic Models

As we can embed marginal distribution, joint distribution and conditional distribution in RKHS space, we can use these embeddings to replace the conditional probability tables we used before in graphic model to build the kernel graphic model. We can also perform inference on kernel graphic model with the sum rule and chain rule in RKHS. Specifically, the sum rule for densities and its corresponding rule in RKHS is Z Z P[X] = P[X, Y ] = P[X|Y ]Y ⇐⇒ µX = CX|Y µY . Y

Y

Likewise for the chain rule, P[X, Y ] = P[X|Y ]P[Y ] = P[Y |X]Y ⇐⇒ CY X = CY |X CXX = CX|Y CY Y . Consider a simple graphic model shown in Figure 1. If the variables in this model are discrete, we can parameterize the model using probability vector P[A] and conditional probability matrix P[B|A], P[C|B] and P[D|C]. Then, we can do inference based on these matrix. For example, let’s see how we would compute P[A = a, D = d]. To do this, we need to compute the joint distribution matrix P[A, D]. However, if we directly use matrix multiplication, P[A] would be integrated out so we can only get P[D]. To solve this problem, we convert P[A] to a diagonal matrix P[ØA]. Then, P[A, D] could be calculated using P[A, D] = P[ØA]P[B|A]T P[C|B]T P[D|C]T .

6

22 : Introduction to Hilbert Space Embeddings and Kernel GM

Figure 1: Example Graphical Model

To compute the probability P[A = a, D = d], we introduce the evidence vectors δa and δd where (for example) δa is the all-zero vector except for the element corresponding to element a. If the variables were continuous, then we can use the cross-covariance operators we described earlier. For example, T T T CAD = CAA CB|A CC|B CD|C P[A = a, D = d] ∝ φTa CAD φd . These examples show that inference on kernel graphic model is similar to inference on regular graphic model. Therefore, we can apply the inference algorithm on regular graphic model, such like message passing algorithm to kernel graphic model by replacing the sum-product operations with tensor operations.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

25 : Graphical induced structured input/output models Lecturer: Eric P. Xing

Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu

Disclaimer: A large part of the content are taken from the references listed. For this section, we will extend the concept of traditional graphical models from modeling dependencies of a distribution or minimizing the loss function on graphs to modeling constraints.

1

Genetic Basis of Diseases

A single nucleotide polymorphism, often abbreviated to SNP, is a variation in a single nucleotide that occurs at a specific position in the genome1 . Genetic Association Hypothesis testing aims at finding which SNP’s are causal (or associated) vis-a-vis a hereditary disease. In other words, we hope to find out the mapping between genotype and phenotype. Until now, most popular approaches for genetic and molecular analysis of diseases are mainly based on classical statistical techniques, such as the linkage analysis of selected markers; quantitative trait locus (QTL) mapping conducted over one phenotype and one marker genotype at a time, which are then corrected for multiple hypothesis testing. Primitive data mining methods include the clustering of gene expressions and the high-level descriptive analysis of molecular networks. Such approaches yield crude, usually qualitative characterizations of the study subjects. However, many complex disease syndromes, such as asthma, consist of a large number of highly related, rather than independent, clinical or molecular phenotypes. This raises a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. In this lecture, we will see several methods to analyze the multi-correspondence mapping between multiple SNP’s and (multiple) symptoms phenotypes.

2

Basics

2.1

Sparse Learning

Assume we now have a linear model, y = Xβ + 

(1)

where X ∈ RN ×J is the input matrix, y ∈ RN ×1 is the output matrix,  ∼ N (0, σ 2 IN ×N ) is an error term of length N with zero mean and a constant variance. Then the lasso can be formulated as: arg min f (β) = 1 https://en.wikipedia.org/wiki/Single-nucleotide

1 ky − Xβk22 + λkβk1 2

polymorphism

1

(2)

2

25 : Graphical induced structured input/output models

where kβk1 =

PJ

j=1

|βj | is defined to be the sum of all the absolute values of elements in β.

Figure 1 provides a geometric view of lasso.

Figure 1 Geometric view of lasso. left: `1 regularization (lasso); right: `2 regularization (ridge regression). Image taken from [Hastie et al., 2009].

2.2

Multi-Task Learning

The basic linear model for multi-task regression can be written as: yk = Xβk + k , ∀k = 1, 2, ..., K

(3)

where X = [x1 , ..., xJ ] ∈ RN ×J denotes the input matrix, Y = [y1 , ..., yK ] ∈ RN ×K denotes the output matrix, βk = [β1k , ..., βJk ] ∈ RJ is a regression parameter term of length J for the k-th output, and k is an error term of length N with zero mean and a constant variance. Denote B as a combination of all βk , which is 

β11  β21  B = (β1 , ..., βK ) =  .  ..

β12 β22 .. .

... ... .. .

 β1K β2K   ..  . 

βJ1

βJ2

...

βJK

(4)

We can use lasso to solve the equation 3, which is to solve the following optimization problem: ˆ lasso = arg min B

X XX (yk − Xβk )T (yk − Xβk ) + λ |βjk | k

j

(5)

k

where λ is a tuning parameter that controls the level of sparsity. A larger λ would lead to a sparser solution. In multi-task learning, the goal is to select input variables that are relevant to at least one task. Thus, an `1 /`2 penalty has been proposed. Here `2 penalty comes from taking the `2 norm of the regression coefficients β j for all outputs for each input j, and `1 penalty comes from taking sum of the above J `2 norms, which could encourage sparsity across input variables. The `1 /`2 penalized multi-task regression is

25 : Graphical induced structured input/output models

3

defined as follows:

ˆ `1 /`2 = arg min B

X k

(yk − Xβk )T (yk − Xβk ) + λ

X

kβ j k2

(6)

j

Here `1 part enforces sparsity, and `2 part combines information across tasks. As all of the elements of β j ˆ `1 /`2 is sparse only across inputs would take non-zero values if the j-th input is selected, the estimation B but not across outputs.

2.3

Structure Association

Specific to the genetic basis of diseases example, we can formulate it as a regression problem. That is, given multivariate input X (the SNPs) and multivariate output Y (the phenotypes), we hope to identify the association B between X and Y. This matrix B encodes the structure and strength of the association, e.g. the parameter βjk represents the association strength between SNP j and trait k. Here the output covariates Y can be a graph connecting the phenotypes, a tree structure connecting genes etc. We will mainly consider three types of structure association.

• Association to a graph-structured phenome: Graph-guided fused lasso [Kim and Xing, 2009] • Association to a tree-structured phenome: Tree-guided group lasso [Kim and Xing, 2010] • Association between a subnetwork of genome and a subnetwork of phenome: Two-graph guided multitask lasso [Chen et al., 2012]

3 3.1

Structure Association I: Graph-guided Fused lasso Motivation

To capture correlated genome associations to a Quantitative Trait Network (QTN), we employ a multivariate linear regression model as the basic model for trait responses given inputs of genome variations such as SNPs, with the addition of a sparsity-biasing regularizer to encourage selection of truly relevant SNPs in the presence of many irrelevant ones. In order to estimate the association strengths jointly for multiple correlated traits while maintaining sparsity, we introduce another penalty term called graph-guided fusion penalty into the lasso framework. This novel penalty makes use of the complex correlation pattern among the traits represented as a QTN, and encourages the traits which appear highly correlated in the QTN to be influenced by a common set of genetic markers. Thus, the GFlasso estimate of the regression coefficients reveals joint associations of each SNP with the correlated traits in the entire subnetwork as well as associations with each individual trait. Figure 2 provides a visualization about two different choices of the fusion scheme, which leads to two variants of GFlasso: Graph-constrained Fused lasso (Gc Flasso) and Graph-weighted Fused lasso (Gw Flasso) .

4

25 : Graphical induced structured input/output models

Figure 2 Illustrations for association analysis with multiple quantitative traits using various regression methods. left: original lasso; middle: Graph-constrained Fused lasso (Gc Flasso); right: Graph-weighted Fused lasso (Gw Flasso). Image taken from [Kim and Xing, 2009].

3.2

Model I: Graph-constrained Fused lasso

As shown in the middle subfigure of Figure 2, Gc Flasso model considers the graph structure without edgeweights. Formally, Gc Flasso can be formulated as:

ˆ GC = arg min B

X

(yk − Xβk )T (yk − Xβk ) + λ

XX

k

k

|

X X

|βjk | +γ

j

(m,l)∈E

{z

}

lasso penalty

|

|βjm − sign(rml )βjl |

(7)

j

{z

Graph-constrained fusion penalty

}

where E is the set of edges. Here the last term (which we refer to as a fusion penalty or a total variation cost) encourages (but does not strictly enforce) βjm and sign(rml )βjl to take the same value by shrinking the difference between them toward zero. γ is a tuning parameter and a larger value for γ leads to a greater fusion effect, or in other words, a sparser result.

3.3

Model II: Graph-weighted Fused lasso

As shown in the right subfigure of Figure 2, Gw Flasso model not only considers the graph structure, but also considers the edge weights. Formally, Gw Flasso can be formulated as:

ˆ GW = arg min B

X XX X X (yk − Xβk )T (yk − Xβk ) + λ |βjk | +γ f (rml ) |βjm − sign(rml )βjl | (8) k

k

|

j

{z

(m,l)∈E

}

lasso penalty

|

j

{z

Graph-weighted fusion penalty

}

The Gw Flasso method weights each term in the fusion penalty in equation 8 by the amount of correlation between the two traits being fused, so that the amount of correlation controls the amount of fusion for each edge. More generally, Gw Flasso weights each term in the fusion penalty with a monotonically increasing function of the absolute values of correlations, and finds an estimate of the regression coefficients.

25 : Graphical induced structured input/output models

3.4

5

Optimization Problem

The optimization problem in equation 7 and equation 8 are convex and thus can be formulated as a quadratic programming problem. There are several existing tools for solving the quadratic programming problem. But there are some issues: • These approaches do not scale in terms of computation time to a large problem involving hundreds or thousands of traits, as is the case in a typical multiple-trait association study; • Difficulty arises in directly optimizing equation 7 and equation 8, as they are non-smooth function of the `1 norm. Here, take Gw Flasso as an example, we may reformulate it into an equivalent form that only involves smooth functions, which is: 2 X X (βjm − sign(rml )βjl )2 X X βjk +γ f (rml )2 (yk − Xβk )T (yk − Xβk ) + λ βk ,djk ,djml djk djml j k jk (m,l)∈E X subject to djk = 1

min

j,k

X X (m,l)∈E

(9)

djml = 1

j

djk ≥ 0 djml ≥ 0

∀j, k ∀j, (m, l) ∈ E

We can solve the above problem by coordinate-descent algorithm, that is we iteratively update βk , djk and djml , until there is little improvement in the value of the objective function. By taking derivatives with respect to a specific variable and set it to be zero while keeping other variables fixed, we could get the following update rules:    P P P f (rkl )2 sign(rkl )βjl f (rmk )2 sign(rmk )βjm 0 βj 0 k x y − + x + γ 0 ij jk ij i j 6=j (m,k)∈E (k,l)∈E djkl djmk (10) P 2 P P f (rkl )2 f (rmk )2 λ i xij + djk + γ (k,l)∈E djkl + γ (m,k)∈E djmk

P βjk =

|βjk | 0 j 0 ,a |βj a |

(11)

f (rml )|βjm − sign(rml βjl )| P 0 0 (a,b)∈E j 0 f (rab )|βj a − sign(rab )βj b |

(12)

djk = P

djml = P

4 4.1

Structure Association II: Tree-guided Group lasso Motivation

In a univariate-output regression setting, sparse regression methods that extend lasso have been proposed to allow the recovered relevant inputs to reflect the underlying structural information among the inputs.

6

25 : Graphical induced structured input/output models

Group lasso achieved this by applying an `1 norm of the lasso penalty over groups of inputs, while using an `2 norm for the input variables within each group. This `1 /`2 norm for group lasso has been extended to a more general setting to encode prior knowledge on various sparsity patterns, where the key idea is to allow the groups to have an overlap. However, the overlapping groups in their regularization methods can cause an imbalance among different outputs, because the regression coefficients for an output that appears in a large number of groups are more heavily penalized than for other outputs with memberships to fewer groups. Thus, a tree-guided lasso for multi-task regression with structured sparsity has been proposed. Considering tree-guided lasso has several advantages, such as: • A tree structure would naturally represent a hierarchical structure; • Compared to a graph with O(|V |2 ) edges, a tree has only O(|V |) edges, which makes it scalable to a very large number of phenotypes.

4.2

Examples of Constructing Penalties with Tree Structure

The `1 -penalized regression assumes that all outputs in the problem share the common set of relevant input variables. But that is not always the case in practice. Here we consider a simple case of two genes. As shown in Figure 3, low height from nodes to their parents leads to a tight correlation and we need to select them jointly, while high height leads to a weak correlation and we need to select them separately.

Figure 3

Two genes example for tree-structured penalty construction.

Based on the above intuition, we can revise the original `1 penalty to fit this tree structure. One possible way to formulate penalty is as follows:   q X j j  j j penalty = λ h (|β1 | + |β2 |) +(1 − h) (β1 )2 + (β2 )2  | {z } | {z } j `1 regularization

(13)

`2 regularization

Here, `1 penalty means selecting βjk s for two nodes separately while `2 penalty means selecting them jointly. h is the tuning parameter that can control the level of balance.

25 : Graphical induced structured input/output models

Figure 4

7

General tree for tree-structured penalty construction.

For a general tree as shown in Figure 4, the penalty can be formulated as:

ˆ Tree = arg min B

X k

  q X j j j  j (yk − Xβk )T (yk − Xβk ) + λ h2 (|C1 | + |β2 |) +(1 − h2 ) (β1 )2 + (β2 )2 + (β3 )2  {z } | | {z } j separate selection

joint selection

(14) where C1 can be recursively extended as: q C1 = h1 (|β1j | + |β2j |) + (1 − h1 ) (β1j )2 + (β2j )2

4.3

(15)

Definition

Now we could formally formulate tree-guided group lasso. Given the tree T over the outputs, we generalize the `1 /`2 regularization to a tree regularization as follows. We expand the `2 part of the `1 /`2 penalty into a group-lasso penalty, where the group is defined based on tree T . In this tree T , each node v ∈ V is associated with group Gv , whose members consist of all of the output variables (or leaf nodes) in the subtree rooted at node v. Given these groups of outputs that arise from tree T , tree-guided group lasso can be written as follows: ˆ Tree = arg min B

X

(yk − Xβk )T (yk − Xβk ) + λ

XX j

k

j wv kβG k v 2

(16)

v∈V

j j where βG is a vector of regression coefficients. Each group of regression coefficients βG is weighted with v v wv that reflects the strength of correlation within the group.

In order to define the weights of wv , we first associate each internal node v of the tree T with two quantities sv and gv that satisfy the condition sv + gv = 1. Here the sv represents the weight for selecting the output variables associated with each of the children of node v separately, and the gv represents the weight for selecting them jointly. Section 4.2 has given out an example of constructing tree-structured penalty. Here we give out a formal formulation. Given an arbitrary tree T , by recursively applying the similar operation starting from the root node towards the leaf nodes, we could get as follows: XX j

v∈V

j k =λ wv kβG v 2

X j

Wj (vroot )

(17)

8

25 : Graphical induced structured input/output models

where ( P j sv · c∈Children(v) |Wj (c)| + gv · kβG k v 2 Wj (v) = P j |β | m m∈Gv

4.4

if v is an internal node, if v is a leaf node.

(18)

Parameter Estimation

We use an alternative formulation in order to estimate the regression coefficients in tree-guided group lasso.

ˆ Tree B

 2 X XX j = arg min (yk − Xβk )T (yk − Xβk ) + λ  wv kβG k  v 2 j

k

(19)

v∈V

|

{z

}

relaxation needed on this term

As `1 /`2 norm is a non-smooth function, we need to make some relaxation using the fact that the variational formulation of a mixed-norm regularization is equivalent to a weighted `2 regularization, which is:  2 j XX X X wv2 kβG k2 j v 2   ≤ wv kβG k 2 v dj,v j j v∈V

where

P P j

v

(20)

v∈V

dj,v = 1, dj,v ≥ 0, ∀j, v, and the equality holds for wv kβj,v k2 dj,v = P P j v∈V wv kβj,v k2

(21)

Thus, the optimization problem listed in equation 19 can be rewritten as:

min

j X X X wv2 kβG k2 v 2 (yk − Xβk )T (yk − Xβk ) + λ dj,v j v∈V k XX subject to dj,v = 1, j

(22)

v

dj,v ≥ 0, ∀j, v Here, additional variables dj,v are introduced for smoothing. We solve the problem in equation 22 by optimizing β and dj,v alternatively over iterations until convergence. For each iteration, we first fix the values for βk , and update dj,v , where the update equations for dj,v are given in equation 21. Then, we treat dj,v as constant, and optimize for βk . It would lear to a closed-form solution, which is: βk = (XT X + λD)−1 XT yk where D is a J × J diagonal matrix with

P

v∈V

wv2 /dj,v in the j-th element along the diagonal.

(23)

25 : Graphical induced structured input/output models

5 5.1

9

Structure Association III: Two-graph Guided Multi-task lasso Motivation

Two-graph guided multi-task lasso tries to answer the question that how multiple genetic variants in a biological process or pathway, by forming a subnetwork, jointly affect a subnetwork of multiple correlated traits. It is motivated by graph structures in both genome and phenome and tries to take advantage of the two side information simultaneously. Figure 5 gives out a illustration of two-graph guided multi-task lasso. Here we can see that two traits connected in trait network are coupled though paths between the two nodes; and two SNPs connected in genome network are coupled through paths between the two nodes.

Figure 5

5.2

Illustration of two-graph guided multi-task lasso.

Parameter Estimation

The two-graph guided multi-task lasso is defined as: ˆ TCML = arg min B

X k

(yk − Xβk )T (yk − Xβk ) + λkBk1 + γ1 pen1 (E1 , B) +γ2 pen2 (E2 , B) | {z } {z } | Trait network

(24)

Genome network

where pen1 and pen2 are two penalty functions measuring the discrepancy between the prior label and feature graphs and the association pattern. Specifically, they can be defined as:

pen1 (E1 , B) =

X

w(em,l )

em,l ∈E1

pen2 (E2 , B) =

X ef,g ∈E2

J X

|βjm − sign(rm,l )βjl |

j=1

w(ef,g )

K X

(25)

|βf k − sign(rf,g )βgk |

k=1

where w(em,l ) and w(ef,g ) are the weights assigned to the edge em,l in graph E1 and E2 , respectively. rm,l and rf,g are the correlations between ym and yl , yf and yg , respectively. It is easy to see that the objective function in equation 24 is non-differentiable. Thus, its optimization is achieved by transforming it to a series of smooth functions that can be efficiently minimized by the coordinate-descent algorithm. Detailed update rules can be found in [Chen et al., 2012].

10

25 : Graphical induced structured input/output models

References [Chen et al., 2012] Chen, X., Shi, X., Xu, X., Wang, Z., Mills, R., Lee, C., and Xu, J. (2012). A two-graph guided multi-task lasso approach for eqtl mapping. In Lawrence, N. D. and Girolami, M. A., editors, AISTATS, volume 22 of JMLR Proceedings, pages 208–217. JMLR.org. [Hastie et al., 2009] Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2009). The elements of statistical learning : data mining, inference, and prediction. Springer series in statistics. Springer, New York. Autres impressions : 2011 (corr.), 2013 (7e corr.). [Kim and Xing, 2009] Kim, S. and Xing, E. P. (2009). Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet, 5(8):1–18. [Kim and Xing, 2010] Kim, S. and Xing, E. P. (2010). Tree-guided group lasso for multi-task regression with structured sparsity. In Frnkranz, J. and Joachims, T., editors, ICML, pages 543–550. Omnipress.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

24: Max-Margin Learning of GMs Lecturer: Matthew Gormley

1

Scribes: Achal Dave, Po-Wei Wang, Eric Wong

Introduction

We’ve often discussed how there are 5 key ingredients to graphical models: the data, the model, the objective, learning, and inference. This note focuses on the last of these: inference, the task of computing quantities such as likelihoods and marginal probabilities.

1.1

Inference

There are three general tasks for inference. 1. Marginal Inference: Compute marginals of variables and cliques: X

p(xC ) =

p(x0 |θ)

x0 :x0C =xC

2. Partition Function: Compute the normalization constant: Z(θ) =

XY

ψC (xC )

x C∈C

3. MAP Inference: Compute variable assignment with highest probability. ˆ = arg max p(x|θ) x x

Unfortunately, all three of these are NP-hard in the general case. Let’s focus on the third task, MAP inference. In the past, we’ve talked about one way to perform MAP inference using Max-Product Belief Propagation, which gives us max-marginals to directly read off the max. In these notes, we’ll show that MAP inference can be performed efficiently by solving a linear program.

2

MAP Inference as Mathematical Programming

MAP inference in Chain Markov Net (CRF) Let’s assume we are working with acyclic models (a chain model, or a CRF). Recall that a CRF consists of hidden variables y, and observed variables x. The graphical model for a CRF is represented in Figure 1. 1

2

24: Max-Margin Learning of GMs

yt−1

yt

yt+1

xt−1

xt

xt+1

Figure 1: Graphical model for a CRF. Concretely, we are interested in computing ˆ = arg max log pw (y|x) y

(1)

y

(You may notice that we are using w to represent the model parameters, as opposed to θ, which we used in earlier lectures. This is for consistency with the literature in this area.) To do MAP inference, we can drop the normalization constant in the distribution.

ˆ = arg max y y

X j

wT fnode (xj , yj ) +

X

wT fedge (xjk , yj , yk )

(2)

j,k

We will show that we can reformulate this as an integer linear program (ILP). We can apply the branch-andbound method to approximately solve the integer linear program, in which simplex algorithm is applied to solve the linear part. However, if we could relax the integer constraints, we could simply run simplex once and get the exact optimum value for the problem. Fortunately, the following lemma allows us to do exactly that: Lemma 1 ([Wainwright et al., 2002]). If there is a unique MAP assignment, the linear program relaxation of the integer linear program on a tree structured model is guaranteed to have an integer solution, which is exactly the MAP solution. With this lemma in hand, we can now set up the integer program for MAP inference, knowing that we’ll be able to solve it efficiently. We create auxiliary variables zi corresponding to one hot encodings for each yi , and auxiliary variables zij corresponding to one hot encoding matrices for each edge in the matrix. Concretely, suppose that yi ∈ {A, B}. Then, the sequence y = ‘AB’ generates the following auxiliary variables:  T  z1 = 1 0 ; z2 = 0   0 1 z12 = 0 0

T 1

Now, we can convert the maximization in eq. (2) into a maximization over the encodings z.

24: Max-Margin Learning of GMs

max

X

subject to

X

z

3

zj (m)[wT fnode (xj , m)] +

j,m

X

zjk (m, n)[wT fedge (xjk , m, n)]

j,k,m,n

zj (m) = 1

(normalization)

zjk (m, n) = zj (m)

(agreement)

m

X n

zj (m) ∈ Z, zjk (m) ∈ Z

(integer)

It turns out that we can rewrite this optimization more compactly as max (F T w)T z

z:Az=b

But when our model is acyclic, Lemma 1 shows that we can remove the integer constraints, solve the resulting linear program, and still get an integer(!) solution. Not covered: dual decomposition. substructures that correspond to linear chains, and other substructures that define a tree, and we have dynamic programming algorithms for each one: dual decomposition allows us to put these together.

3 3.1

Max-Margin Markov Networks (M3 Ns) Motivation 1: Comparison of SVMs and GMs

Handwriting recognition Suppose we have an SVM and we want to distinguish the ‘r’s from the rest of the characters. The challenge is we have characters that look sort of the same: it is difficult to distinguish between an r and an c, for example. SVMs allow us to use kernels for high dimensional learning and gives generalization bounds. We would like to use these nice properties of SVMs for learning from sequences. But the number of classes is now exponential in length. With graphical models, the model is instead linear in length. If it were 2002, graphical models don’t let us use kernels or generalization bounds, but let us use label correlations. Now, with M3 N we can have all three advantages. Classical Predictive Models Our next motivation is a comparison of loss functions. We have some predictive function, and for learning we want to minimize the loss function with a regularizer. For logistic regression, we have logistic loss and an l2 regularizer. For SVM we have hinge loss with an l2 regularizer. Each of these losses has unique advantages and disadvantages, that are discussed in detail in slide 27 of the lecture. We would like to be able to use these losses in conjunction with graphical models, as opposed to simply performing MAP inference. The Structured Prediction Problem Classical models described above generally do not incorporate structure amongst labels. Consider, for example, OCR, where we know that the sequence ‘brace’ is more likely than the sequence ‘acrbe’. Classifying each letter individually ignores this structure. Similarly, in dependency parsing of sentences, we have variables between pairs of words and a graphical model that scores edges or triplets.

4

24: Max-Margin Learning of GMs

Structured prediction graphical models Intuitively, M 3 N optimizes over the hinge loss for all the sequence, the CRF optimizes over the soft-max loss, and the SVM optimizes over hinge loss but only on binary labels. For M 3 N we capitalize on the factorizing structure to solve the problem. Challenges: 1. We typically want an interpretable model. If we have millions of variables and only 10 are non-zero we might actually learn a lot about what these variables are telling us about the data. 2. Prior information of output structures (to be discussed on Wednesday). 3. Latent structures, time series, scalability are other challenges The main takeaway from M 3 N s is that we’ll have strong generalization bounds that will give us better empirical performance.

Parameter Estimation for M3 Ns Max conditional likelihood: what we want is to return the answer that gives the maximum possible conditional likelihood, which can be rewritten as P (y|x) = wT f (x, y). So the dot product of the correct answer should be greater than the dot product of any wrong answer, but there is an exponential number of these. In other words, the goal is to find w such that wT (f (x, t(x))−f (x, y)) > 0, so wT ∇fx (y) ≥ γ∆tx (y) and maximize the margin γ. Estimation: we want to maximize the margin subject to these constraints. Still the problem is an exponential number of constraints on y. Can’t just drop into a QP solver, it would be too big to write down.

LP Duality Recap Variables turn into constraints, constraints turn into variables, and the optimal values are the same when both feasible regions are bounded. Specifically, for the following LP max cT z z

s.t. Az ≤ b; z ≥ 0

we have that the dual is min bT λ λ

s.t. AT λ ≥ c; λ ≥ 0

So in the setting for Markov nets, we change the Markov constraints into a dual variable α, and so we have an exponential number of variables. The constraints is that they sum to 1 and are greater than or equal to 0. These constraints on alpha represent a probability distribution! We can use the insight from graphical models to factorize the summation.

Variable elimination To compute partition functions, slide sums over and use factorization to compute the marginals more efficiently using dynamic programming. We can do this with α as well. Each α is giving a probability over each possible y. Introduce factored dual variables that is linear in the size of the network: On an acyclic graph, we can factorize αs over nodes and edges with variables µi and µij . Rewrite the dual using the µ’s, then we can formulate the problem as a quadratic program of µ with linear constraints. The first term of the objective is just an expectation of ∆t(y). The second is just the product of two expectations on ∆f (y). We can now decompose these expectations over the factorization of α, and rewrite it over the µ.

24: Max-Margin Learning of GMs

5

Factored constraints (when network is a tree; otherwise add clique tree constraint): P P  i µ(yi ) = 1, i,j µ(yi , yj ) = 1 Normalization  (P   µ(yi ) ≥ 0, µ(yi , yj ) ≥ 0 Non-negativity y α(y) = 1 ⇒ P  µ(y ) = µ(y , y ) Agreement α(y) ≥ 0, ∀y i i j  j   µ ∈ CliqueTreePolytope Triangulation Factored dual is now quadratic in network size, and constraint is exponential in tree width. If you wanted to do a tri-gram HMM you could take cliques of larger size than just the edge, still exponential in the tree width. These constraints are not enough to require a one to one correspondence to a real probability distribution. We have the same set of constraints as in loopy belief propagation, and when these factored dual has not acyclic then the solution will be reasonable but might not correspond to the true α. Has to do with the marginal polytope.

3.2

Min-max formulation

Instead of considering exponentially many constraint in primal min w

1 kwk, 2

s.t. wT f (x, y ∗ ) ≥ f (x, y) + loss(y, y ∗ ), ∀y,

there is really just one y that has the highest weight, and we just need to consider the most violated constraint instead of every constraint min w

1 kwk, 2

s.t. wT f (x, y ∗ ) ≥ max∗ {f (x, y) + loss(y, y ∗ )} . y6=y

Find the ’best’ y, and add that one constraint to the problem. Only need a polynomial number of constraints, linear in the training examples. This assumes we have a fixed w, but if we do this during learning, we want to jointly find w as well. You can enforce the constraints one at a time, but here’s a different way. Let’s rewrite the constraint as a max within the constraint, and turn the discrete problem into an continuous problem. Rewriting it with indicating one hot encoding variables for the linear chain structure, the problem decomposes into the variables and the edges between variables. This gives us a transformed problem in terms of the z variables. We can use strong Lagrangian duality to get a third optimization problem with has a nice objective with linear constraints. Turned exponential constraints to min max problem, replaced with z, switched to dual and jointly optimizing over both sets of variables gives a compact quadratic program. Experimental Results Handwriting recognition: just trying to predict the right characters given a sequence of characters. Insert results here from the slides. In both cases need to define feature functions over the input images. Over the same feature functions, the M 3 net has lower test error. Use kernel trick to predict over pairs and triples of features. Hypertext Classification: This graphical model has cycles., we didn’t talk about the SMO algorithm, but probably overtaken by exponentiated gradient. See same improvements from the M 3 N . Named entity recognition: Input is a sentence segmented into words, and for each word predict if it’s an organization, person, location, or misc. Same improvements

6

24: Max-Margin Learning of GMs

Additional: Proximal / stochastic gradient methods, e.g., MIRA [Crammer et al., 2005], to get around using a quadratic programming solver, and instead use gradient based methods from CRFs.

4

MaxEnDNet

Idea: Using model average and the idea of entropy to learn graphical models. For next time.

References [Crammer et al., 2005] Crammer, K., McDonald, R., and Pereira, F. (2005). Scalable large-margin online learning for structured classification. In NIPS Workshop on Learning With Structured Outputs. [Wainwright et al., 2002] Wainwright, M., Jaakkola, T., and Willsky, A. (2002). Map estimation via agreement on (hyper) trees: Message-passing and linear programming approaches. In Proceedings of the Annual Allerton Conference on Communication Control and Computing, volume 40, pages 1565–1575. The University; 1998.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

25 : Posterior Regularization: an integrative paradigm for learning GMs Lecturer: Matt Gormley

1

Scribes: Tzu-Ming Kuo

Continued: Maximum Entropy Discrimination Markov Network

1.1

Motivation

Likelihood-based estimation and max-margin learning method are two widely used learning approaches in machine learning community. Likelihood-based estimation is usually applied to train probabilistic model. It basically tries to find a model that maximizes the likelihood on some given training samples. On the other hand, max-margin methods tries to find a model that generates correct output while keeping the decision value of the correct output away from that of incorrect output. This two different approaches has different advantages: • Likelihood Estimation – Easy to perform Bayesian learning, – Incorporate prior knowledge, latent structures, missing data – Bayesian or direct regularization • Max-Margin Learning – Dual sparsity – Sound theoretical guarantee – Easy to incorporate kernel trick Since these approaches enjoy different advantages, it is natural to come up a model that incorporates these two approaches.

1.2

Maximum Entropy Discrimination

This work is a attempt to integrate likelihood-based method an max-margin learning. It takes the idea of model averaging from Bayesian learning and max-margin for max-margin learning. In model averaging, instead of using only one model to generate output, every models are used to generate outputs and the final outputs is the weighted average of outputs from all models. During training procedure, the goal is to find the weight that has the best performance on training samples. Take binary classification as an example, the model output yˆ of an instance x is Z  yˆ = sign p(w)F (x; w)dw .

1

2

25 : Posterior Regularization: an integrative paradigm for learning GMs

Substitute the model output into max-margin constraints yields the expected margin constraint Z p(w) [yF (x; w) − ξ] dw ≥ 0. Sometimes we may have prior knowledge about the distribution p(w). For example, if we think w should be close to the origin, we may want to use a zero-mean Gaussian distribution as the prior for w. Given a prior distribution p0 (w) for w, we wants p(w) to be close p0 (w) as much as possible. An approach is to minimize the KL-divergence between p(w) and p0 (w). Putting objectives above together yields the optimization problem of Maximum Entropy Discrimination for binary classification. min KL (p(w)||p0 (w)) Z subjects to p(w) [yi F (xi ; w) − ξi ] dw ≥ 0, ∀i p(w)

ξi ≥ 0.

1.3

Maximum Entropy Discrimination Markov Networks

Maximum Entropy Discrimination Markov Networks extends the idea of Maximum Entropy Discrimination on Markov Networks and consider structured prediction problems. The optimization problem is min KL (p(w)||p0 (w)) + U (ξ)

p(w),ξ

subjects to p(w) ∈ F1 , ξi ≥ 0, where U is the loss penalizing the violation of expected margin constraints, F1 is the expected margin constraints Z F1 = {p(w) : p(w)[∆Fi (y; w) − ∆li (y)]dw − ξi ≥ 0, ∀y 6= yi }, ∆Fi (y; w) ≡ F (xi , yi ; w) − F (xi , y; w), and ∆i (y) is the margin required between predicting yi rather than y. Given p(w) the prediction rule is Z h1 (x, p(w)) = arg max

p(w)F (x, y, w)dw.

y∈Y(x)

1.3.1

Optimization Problem

It can be shown that the optimal posterior distribution p∗ (w) can be expressed as   X 1 p0 (w) exp  αi (y)[∆Fi (y; w) − ∆li (y)] , Z(α) i,y where α are the solution of the dual problem max − log Z(α) − U ∗ (α) α

subjects to αi (y) ≥ 0, ∀y, i, ∗

and U is conjugate of U .

25 : Posterior Regularization: an integrative paradigm for learning GMs

1.3.2

3

Gaussian Maximum Entropy Discrimination Markov Networks

An interesting observation is that if we specify model F , margin penalty U , and prior p0 (w) as following F (x, y, w) ≡ wT f (x, y), X U (ξ) ≡ ξi , i

p0 (w) ≡ N (0, I), the dual problem is max α

X i,y

subjects to

X

1 X αi (y)∆i (y) − k αi (y)∆fi (y)k2 2 i,y αi (y) = C, ∀i

y

αi (y) ≥ 0, ∀i, y. The corresponding primal optimal is p∗ (w) = N (µw , I), where µw =

X

αi (y)∆fi (y).

i,y

The prediction rule is Z yˆ = arg max y∈Y(x)

p(w)F (x, y; w)dw = µTw f (x, y).

The dual problem and the prediction rule are identical to that of M3 N . Therefore, in essence, Gaussian Maximum Entropy Discrimination Markov Networks is probabilistic version of M3 N . It uses the posterior mean of p∗ (w) instead of w∗ . This means that Maximum Entropy Discrimination Markov Networks is a more general framework, subsumes M3 N and offers some advantages for being probabilistic. The advantages including PAC-Bayesian prediction error guarantee, introducing useful biases via entropy regularization, and integrating Generative and Discriminative principles. 1.3.3

Laplace Maximum Entropy Discrimination Markov Networks

The Gaussian prior can be replaced with a Laplace prior po (w) =

K Y k=1



λ −√λ|wl | = e 2

√ !K √ λ e− λkwk . 2

This is called Laplace Maximum Entropy Discrimination Markov Network. The optimization problem becomes ! r p N K X √ X λµ2k + 1 + 1 1 1 2 min λ µk + − √ log +C ξi µ,ξ λ 2 λ i=1 k=1 subjects to µT ∆fi (y) − ∆li (y) + ξi ≥ 0, ξi ≥ 0, ∀i, ∀y 6= yi .

4

25 : Posterior Regularization: an integrative paradigm for learning GMs

The nature of Laplace prior has a l1 regularization effect on the components of w, push the weights towards zero. The hyper-parameter λ controls the regularization effect. As λ increases, the model becomes more regularized. This can be seen by the fact that hwk ip =

2ηk , ∀k, λ − ηk2

where η is is a linear combination of the support vectors: η=

X

αi (y)∆fi (y).

i,y

The shrinkage effect makes the primal solution of Laplace Maximum Entropy Discrimination Markov Network to be sparse. This means that a large number of components of w will be 0 and informative components will be more likely be non-zero. In addition, since we only have non-zero value on the dual variables associated with active constraints in primal problem and solution of the dual problem will be sparse. Therefore, Laplace Maximum Entropy Discrimination Markov Network have not only primal sparsity but also dual sparsity. Define KL-norm as

kµkKL ≡

K X k=1

r µ2k

1 1 + − √ log λ λ

p

λµ2k + 1 + 1 2

! ,

and compare with l1 and l2 norm by plotting a level curve.

Figure 1: The KL-norm can be considered an interpolation between l1 and l2 norm controlled by parameter λ.

1.3.4

Variational Learning of Laplace Maximum Entropy Discrimination Markov Networks

Though Laplace Maximum Entropy Discrimination Markov Network has nice property, the both primal problem and the dual problem is infeasible to be exactly solved. A alternative way to solve this problem is to utilize the hierarchical representation of Laplace prior. Using this property we get an upper bound of

25 : Posterior Regularization: an integrative paradigm for learning GMs

5

KL-norm Z KL(pkp0 ) = −H(p) − hlog

p(w|τ )p(τ |λ)dτ ip Z p(w|τ )p(τ |λ) ≤ −H(p) − hlog q(τ ) dτ ip q(τ ) ≡ L(p(w), q(τ )).

And we minimize the following problem, which is an upper bound of the primal problem min

L(p(w), q(τ )) + U (ξ).

p(w)∈F1 ,q(τ ),ξ

Though this problem is still difficult to solve if we optimize all variables at the same time. However, a nice property of this problem is that when q(τ ) is fixed, the problem is reduced to a M3 N problem. And if p(w) and ξi is fixed instead, q(τ ) has a closed-form solution and its expectation is s 1 λ . h iq = τk hwk i p This suggests that we can solve this problem by alternatively fixing q(τ ) and p(w), ξ and solves for the optimal for other variables.

2

2.1

Posterior Regularization: an integrative paradigm for learning GMs Introduction

This lecture is not just about regularization, it is about to integrate what we learnt in this class so far. In the first part of class, parameter estimation for graphical models are based on maximum likelihood principle because that is the most common objective function we use on graphs. In the latter part of class, we have seen some different types of loss functions. For example, we introduce a prior distribution for parameters and we end up optimizing the posterior probability of the parameters of the model given samples. In the last lecture, we also learnt to integrate max-margin learning principle with Markov Network and results in a probability model with dual/primal sparsity and generalization guarantee. All this method enjoys different advantages between each-other. It is natural to ask if we can find a way to integrate those principles.

2.2

Bayesian Inference

Bayesian inference relies on a prior distribution over models and a likelihood function of data given a model. Bayesian inference allow us not only to utilize prior knowledge by designing the prior distribution over models but also have trade-off between the evidence and the prior knowledge. Bayesian inference is a coherent framework of dealing with uncertainties: p(M|x) = R

p(x|M)π(M) , p(x|M)π(M)dM

where p(x|M) is a likelihood function of data x given model M, and π(M) is a prior distribution of model M.Bayes rule offers a mathematically rigorous computational mechanism for combining prior knowledge with evidence.

6

25 : Posterior Regularization: an integrative paradigm for learning GMs

2.2.1

Parametric Bayesian Inference

In parametric Bayesian inference, a model M can represented by a finite set of parameters θ. The posterior distribution is p(θ|x) = R

p(x|θ)π(θ) ∝ p(x|θ)π(θ). p(x|θ)π(θ)dθ

There are some examples of conjugacy in picking the appropriate prior: • Gaussian distribution prior + 2D Gaussian likelihood → Gaussian posterior distribution • Dirichlet distribution prior + 2D Multinomial likelihood → Dirichlet posterior distribution • Sparsity-inducing priors + some likelihood models → Sparse Bayesian inference 2.2.2

Nonparametric Bayesian Inference

If the parameter can grow as the number of data increase (i.e., model cannot be represented by finite number of parameters). In this case the posterior distribution is p(M|x) = R

p(x|M)π(M) ∝ p(x|M)π(M). p(x|M)π(M)dM

Note that the formula is s only symbolically true. We cannot write a closed-form for an infinite term. we need process definitions such as Dirichlet process, Indian buffet process, and Gaussian process, which allow us to construct conditional distributions of one instance given all the other instances.

10-708: Probabilistic Graphical Models 10-708, Spring 2014

26 : Deep Neural Networks and GMs Lecturer: Matthew Gormley

1

Scribes: Hayden Luse

Introduction

The goal of deep learning is to solve the problem of coming up with good features. This is valuable as one of the best ways to get state of the art results is good feature engineering. Not only can deep networks themselves produce state of the art results but graphical models can also use the features discovered by these networks. There is currently a huge amount of interest in these techniques and large amounts of corporate money are being funnelled into groups using them including a four hundred million dollar investment by Google in Deep Mind. This interest is based on the success these techniques have had and in many areas the current state of the art was achieved with deep networks that did not require hand tuned features. The basic ideas have been around since the 1960’s however there have been marked improvements in recent years. Stochastic Gradient Descent for example was a huge improvement. Beyond that, using more hidden units, new nonlinear functions (ReLu’s), better online optimization and simply having more powerful CPUs and GPUs have advanced the field significantly. This lecture focuses on the decision function and loss functions used in deep learning.

2

Background and Notation

2.1

basic structure

1. Given training data {xi , yi }N i=1 2. Choose a decision function yˆ = fθ (xi ) 3. Choose a loss function `(y, yˆ) ∈ R) 4. Define a goal θ∗ = min θ

PN

i=1

`(fθ (xi ), yi )

5. Train with SGD θ(t+1) = θt − ηt ∇`(fθ (xi ), yi ) In order to calculate the gradients in SGD, backpropogation can be used which is a special case of a more general algorithm Reverse Mode Automatic Differentiation.

3

Deep Neural Networks (DNNs)

First we consider choosing fθ (xi ) = h(θ · xi ) where h(a) = a for the following network architecture. 1

2

26 : Deep Neural Networks and GMs

1 In this case, the network is equivalent to linear regression. If instead we choose h(a) = 1+exp(a) this network is equivalent to logistic regression. The next step in progressing towards deep networks is to begin adding hidden layers as in the following network.

This network also includes multiple outputs. A single layer network like this is already a universal function approximator and works well. However a network with further hidden layers can have fewer computational units for the same power and be representationally efficient. Deeper networks can generalize non-locally and may allow for hierarchy. Overall deeper networks have been shown to work better. The intuition for why deeper networks work better is that they allow for different levels of abstraction of features. As we do not know going in the ”right” level of abstraction we can let the model figure it out.

3.1 3.1.1

Training No Pre-Training

The first method we can use to train a deep network is to simply use the same strategy as for a shallow net and simply train the network with Backpropogation. Backpropogation is just repeated use of the chain rule. For a network with one hidden layer and a sigmoid activation function for the hidden units backpropogation is: Forward Pass:

26 : Deep Neural Networks and GMs

3

1. Loss function: J = y ∗ log q + (1 − y ∗ ) log(1 − q) 1 2. Output (sigmoid) y = 1+exp(−b) PD 3. Output (linear) b = j=0 βj zj

4. Hidden (sigmoid) zj = 5. Hidden (linear) aj =

1 1+exp(−aj )

PM

i=0

aji xi

Backwards Pass: (1−y ∗ ) q−1

1.

dJ dq

=

y∗ q

2.

dJ db

=

dJ dy dy dq db , db

3.

dJ dβj

=

dJ db db db dβj , dβj

= zj

4.

dJ dzj

=

dJ db db db dzj , dzj

= βj

5.

dJ daj

=

dJ dzj dzj dzj daj , daj

6.

dJ dαji

7.

dJ dxi

= =

+

=

exp(b) (exp(b)+1)2

=

daj dJ daj daj dαji , dαji dJ dαji dαji dαji dxi , dxi

exp(aj ) (exp(aj )+1)2

= xi =

PD

j=0

αji

However this algorithm runs into two major problems. First, this is a convex optimization algorithm being used to optimize a non-convex function and as such gets stuck in local optima. Second, as this method has more and more multiplications as more hidden layers are added the gradients being propogated backwards through the network become vanishingly small. 3.1.2

Supervised Pre-Training

The idea of supervised pre-training for the network is to use labeled data and then greedily train the network layer by layer from the bottom up, fixing each layer’s parameters after it has been trained. After this step, fine tune the network by training as usual with backpropogation. This approach performs better than with no pre-training but still fails to outperform shallow networks. 3.1.3

Unsupervised Pre-training

This works identically to supervised pre-training except for the output that the network is being trained to predict in each step. For unsupervised pretraining instead of predicting a label, the network predicts the input which is of greater dimension than the internal hidden layers. This setup is called an autoencoder and the goal is to minimize the reconstruction error. The intuition is that if the network can reduce the dimensionality of the input and reconstruct it accurately then it has learned useful features of the data. The pre-training is done with the same approach of greedy layerwise optimization and then fine tuning with backpropogation. This approach performs best and outperforms shallow networks.

In practice, using ReLu units instead of sigmoid helps and regularization helps significantly. Convolutional Neural Nets do not commonly use pretraining.

4

4 4.1

26 : Deep Neural Networks and GMs

Deep Belief Networks Background

The connection between graphical models and deep networks goes back to the very first deep learning papers in 2006 which were about innovations in training a particular flavor of Belief Network that also happened to be a neural net. Deep belief networks are generative models with the goal of learning unsupervised representations of probability distributions (such as for images) that can be sampled from.

4.2

Sigmoid Belief Networks

A sigmoid belief network is a directed graphical model (the following diagram is a graphical model not a neural net) of binary variables in fully connected layers where only the bottom layer is observed. The specific parameterization of the conditional probabilities is p(xi |parents(xi )) = 1+exp(−1P wi xj ) . The model has the j following form.

4.3

Contrastive Divergence Training

Contrastive Divergence is a general tool for learning a generative distribution, where the derivative of the log partition function is intractable to compute. log L = log P (D) =

X v∈D

So

(log(P ∗ (v)/Z) ∝

1 X log P ∗ (v) − log Z N v∈D

26 : Deep Neural Networks and GMs

5

X ∂ 1 XX ∂ ∗ ∂ log L ∝ P (h|v) (x) − P (v, h) log P ∗ (x) ∂w N ∂w ∂w v∈D h

v,h

Contrastive divergence estimates the second term with a Monte Carlo estimate from a 1-step Gibbs sampler. The first term is called the clamped/wake phase and the second is called the unclamped/sleep/free phase. For a belief net the joint is automatically normalized: Z is a constant 1. The second term is zero. For the L weight wij from j into i the gradient log ∂wij = (xi − pi )xj so stochastic gradient ascent is ∆wij ∝ (xi − pi )xj which is called the delta rule. This is a stochastic version of the EM algorithm. In the E step we get samples from the posterior and in the M step we apply a learning rule that makes them more likely.

In practice applying Contrastive Divergence to a Deep Sigmoid Belief Net fails. Sampling from the posterior of many hidden layers does not approach equilibrium quickly enough.

4.4

Boltzman Machines

A Boltzman Machine is an undirected graphical model of binary variables with pairwise potentials. The parameterization of the potentials is ψij (xi , xj ) = exp(xi Wij xj ) so a higher value of Wij leads to higher correlation between xi and xj .

4.5

Restricted Boltzman Machines

A restricted boltzman machine assumes all visible units are one layer and hidden units are another. Units within a layer are unconnected.

Since units within a layer are not connected we can do Gibbs sampling updating a whole layer at a time. So to learn an RBM, the program is to start with a training vector on visible units and then alternate between updating all hidden units in parallel and updating all visible units in parallel. ∆wij = η[< vi hj >0 − < vi hj >∞ ] By restricting connectivity we do not have to wait for equilibrium in the clamped phase. We can also curtail the Markov chain during learning. We can use

6

26 : Deep Neural Networks and GMs

∆wij = η[< vi hj >0 − < vi hj >1 ] which is not the correct gradient but works well in practice.

Another way to look at RBM’s is that they are equivalent to infinitely deep belief networks. Sampling alternately from the visible units and the hidden units infinitely is the same as sampling from an infinitely deep belief network with the weights between each layer being tied. So training an RBM is actually training an infinitely deep sigmoid belief net.

Our goal then is to untie the weights of the layers of this net. We do this by first freezing the first RBM and then training another RBM on top of it. This unties the weights of layers 2+ in the net. We can then freeze the weights of the second layer to untie layers 3+ and so on. After untying the third layer, the structure is as follows:

26 : Deep Neural Networks and GMs

7

Where the third RBM is as before equivalent to an infinitely deep belief network with tied weights. To train this network we can use the same wake sleep structure as before. The wake phase is doing a bottom up pass, starting with a pattern in the training set. Use the delta rule to make this more likely under the generative model. The sleep phase is to do a top-down pass starting from an equilibrium sample from the top RBM. Use the delta rule to make this more likely under the recognition model.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley

1

Scribes: Jakob Bauer, Otilia Stretcu, Rohan Varma

Motivation

We first look at a high-level comparison between deep learning and standard machine learning techniques (like graphical models). The empirical goal in deep learning is usually that of classification or feature learning, whereas in graphical models we are often interested in transfer learning and latent variable inference. The main learning algorithm in deep learning is back-propagation whereas in machine learning, this is a major focus of open research with many inference algorithms, some of which we have studied in this course. Deep learning is also often tested and evaluated upon massive datasets. When evaluating performance in the many different techniques, we want to able to conclusively determine the factor that led to the improvement in performance. That is, whether it came from using a better model architecture, or a better algorithm that is more accurate and/or faster, or is it merely because of better training data. In current deep learning research, since we evaluate using a black-box performance score, we often lose the ability to make these distinctions. Additionally, the concept of training error is perfectly valid when we have a classifier with no hidden states and hence inference and inference accuracy is not an issue. However, deep neural networks are not just classifiers and possess many hidden states, so inference quality is something that should be looked at more carefully. This is often not the case in current deep learning research. Hence, in graphical models, lots of efforts are directed towards improving inference accuracy and convergence, whereas in deep learning, most effort is directed towards comparing different architectures and activation functions. A convergence between these two fields is something that we might see in the near future given a lot of the similarities between how they both started.

1.1

Hybrid NN + HMM

We have seen how graphical models lets us encode domain knowledge in an easy way. Also neural networks are good at fitting data discriminatively to make good prediction. We now examine whether we can combine the best of these two paradigms, that is, can we define a neural network that incorporates domain knowledge. In this hybrid model, we want to use a neural network to learn features for a graphical model, then train the entire model using back-propagation.

1

2

27: Hybrid Graphical Models and Neural Networks

(a)

Figure 1: Hybrid Models The general recipe for classification is as follows. Given training data {xi , yi }N i=1 , we choose a decision ˆ = fθ (xi ) and a loss function `(ˆ function y y, yi ) ∈ R. Now we can define our objective and can train (minimize the objective) using gradient descent as below. ∗

θ = arg min

N X

θ

`(fθ (xi ), yi )

i=1

θ(t+1) = θ(t) − ηt O`(fθ (xi ), yi ) ˆ = fθ (xi ). We know In our hybrid model, we incorporate our graphical model into the decision function y how to compute the marginal probabilities, but we now want to make a prediction y. We could use a minimum Bayes risk (MBR) decoded h(x) returns the variable assignment with minimum expected loss under the model’s distribution. hθ (x) = Ey∼pθ (·|x) [`(ˆ y, y)] X = arg min pθ (y|x)`(ˆ y, y) ˆ y

y

We can use the 0 − 1 loss function which returns 1 only if the two assignments are identical and 0 otherwise, `(ˆ y, y) = 1 − I(ˆ y, y) in which case the MBR decoder is exactly the MAP inference problem. hθ (x) = arg min ˆ y

X

pθ (y|x)(1 − I(ˆ y, y))

y

= arg max pθ (ˆ y|x) ˆ y

27: Hybrid Graphical Models and Neural Networks

3

We could also use the Hamming loss function which returns the number of incorrect variable assignments `(ˆ y, y) =

V X

(1 − I(yˆi , yi ))

i=1

In this case the MBR decoder is yˆi = hθ (x)i = arg max pθ (yˆi |x) yˆi

and decomposes across variables and requires the variable marginals. Now going back to our recipe, if were to use the MBR decoder as the decision function, we note that we minimized the objective and updated using the gradient descent where we need to compute O`(fθ (xi ), yi ). However the MBR decoder is not differentiable because of the arg max term and we cannot compute the updates.

2

Hybrid NN + HMM

2.1

Model: neural net for emissions

Our model is a hybrid HMM + NN where the HMM has Gaussian emission. Note that this is sometimes called a GMM-HMM model. The hidden states, S1 , . . . , SN , are phonemes, i.e., St ∈ /p/, /t/, /k/, . . . , /g/ , and the emissions, Y1 , . . . , YT , are vectors in k-dimensional space, i.e., Yt ∈ Rk . The emission probabilities are given by a mixture of Gaussians:   X 1 Zk T −1 p exp − (Yt − µk ) Σk (Yt − µk ) p(Yt |St = i) = . (1) n 2 (2π) |Σk | k We can combine the HMM with a NN by attaching identical copies of the same NN to each of the emission nodes (see Figure 2a). The NN takes as input a fixed number of frames of the speech signal, x1 , . . . , xM , and generates learned features of the frames as its ouput. Note that the input frames of neighboring networks are overlapping.

(a)

(b)

Figure 2: Hybrid HMM + NN model There are several issues with this model: • The visual representation of the HMM and NN are similar although they are completely unrelated. • The emissions are simultaneously generated in a top-down fashion by the HMM and in a bottom-up fashion by the NN. • The emissions are not actually observed but only appear as a function of the observed speech signal.

4

27: Hybrid Graphical Models and Neural Networks

The picture improves somewhat if we use factor graph notation instead (see Figure 2b). The graph still represents an HMM but we have gotten rid of the arrows. At test time, we first feed the speech signal as input into the NN and then use the resulting features for each timestep as observations in the HMM. In a third step, we run the forward-backward algorithm on the HMM to compute the alpha-beta probabilities and the gamma (marginal) probabilities according to the following formulae: X  αi,t , P Y1t , St = i model = bi,t aj,i αj,t−1 , (2) j

βi,t , P





T Yt+1 St

 X = i, model = ai,j bj,t+1 βj,t+1 ,

(3)

j

 γi,t , P St = 1 Y1t , model = αi,t βi,t ,

(4)

where the transition and emission probabilities are defined as ai,j , P (St = 1|St−1 = j) ,

(5)

bi,t , P (Yt |St = i) .

(6)

Finally, we compute a loss in the form of the log-likelihood: log p(S, Y ) = αend,T .

(7)

Note that to get a loss of this form, all we do have to do is to use use the alpha-beta probabilities as the outputs of the decision functions.

2.2

Learning: backprop for end-to-end training

Given the decision function and the the objective function, the only remaining question is how to compute the gradient. Fortunately, all the feed-forward computations are now differentiable since there is no longer an argmax operation but only summations and products to compute the α’s and β’s. Recall that in backpropagation, for a given y = g(u) and u = h(x) the gradient can be computed by applying the chain rule: J

X dyi duj dyi = , dxk duj dxk j=1

∀i, k.

(8)

In our case, g(·) is the graphical model and h(·) the neural network. Hence, all we need to be able to compute the gradient are partial derivatives for the graphical model and partial derivatives for the neural network. du dyi In particular, du is the partial derivative of the log-likelihood with respect to the emissions and dxkj is the j partial derivative of the emission with respect to the NN parameters. Thus, the forward pass is given by the following equations: cj =

X

λij xi ,

(9)

i

 z j = σ cj , X d= ρj zj ,

(10) (11)

j

Yt,k = σ(d) .

(12)

27: Hybrid Graphical Models and Neural Networks

αi,t = bi,t

5

X

aj,i αj,t−1 ,

(13)

j

βi,t =

X

ai,j bj,t+1 βj,t+1 ,

(14)

j

γi,t = αi,t βi,t ,

(15)

J = log p(S, Y ) = αend,t .

(16)

The first box corresponds to the NN part and the second box to the HMM part. The third box shows how to compute the loss. For the backward pass, we have: dJ γi,t = , dbi,t bi,t X dJ dbi,t dJ = , dYt,k dbi,t dYi,k

(17) (18)

bi,t

    X X 1 Zk dbi,t T −1   p = wkl (µkl − Ylt ) exp − (Yt − µk ) Σk (Yt − µk ) n dYjt 2 (2π) |Σk | k

dJ dd dJ dρj dJ dzj dJ dcj dJ dλij

3 3.1

(19)

l

= = = = =

dJ dy , dy dd dJ dd , dd dρj dJ dd , dd dzj dJ dzj , dzj dcj dJ dcj , dcj dλij

 dy = σ(d) 1 − σ(d) , dd dd = zj , dρj dd = ρj , dzj   dzj = σ cj 1 − σ cj , dcj dcj = xi . dλij

(20) (21) (22) (23) (24)

Recurrent Neural Networks (RNNs) RNN definition

Deep Neural Networks, such DNN, DBN, DBM, have several limitations. For instance, they only accept inputs of fixed size, and produce outputs of fixed size. This makes it difficult to work with sequences of data, such as time series. In addition, these models have a fixed number of computation steps, given by the number of layers in the model. Recurrent Neural Networks (RNNs) are able to deal with these problems and model sequences of data, such as sentences, speech, stock market, and signal data. RNNs are neural networks with loops, which allows information to persist. Figure 3 (a) shows a typical RNN with one hidden

6

27: Hybrid Graphical Models and Neural Networks

layer. The loop allows information to be passed from one step of the network to the next. We can define a RNN as follows: inputs : x = (x1 , x2 , ..., xT ), xi ∈ RI hidden units : h = (h1 , h2 , ..., hT ), hi ∈ RJ outputs : y = (y1 , y2 , ..., yT ), yi ∈ R nonlinearity : H

(a) RNN

K

where

ht = H(Wxh xt + Whh ht−1 + bh ) yt = Why ht + by

(b) Unrolled RNN

Figure 3: Recurrent neural network (RNN) We can unroll the RNN through time as shown in Figure 3 (b). Notice that when T = 1, we have a standard feed-forward neural network with one hidden layer. When T > 1, we can share parameters and allow input/output pairs of arbitrary length. To train the RNN, we unroll it through time and apply backpropagation, as in a typical feed-forward neural net.

3.2

Bidirectional RNNs

The loop in the RNN allows it to incorporate past information into the current prediction. However, in certain applications, such as speech processing, we want to model both the left (backward in time) and right (forward in time) context in order to make a prediction. This can be modeled using Bidirectional RNNs. Bidirectional RNNs contain two types of hidden layers, one which has forward loop, and the other has a backward loop (see Figure 4).

(a) Bidirectional RNN

(b) Bidirectional RNN unrolled

Figure 4: Bidirectional recurrent neural network We can define a Bidirectional RNN as follows:

27: Hybrid Graphical Models and Neural Networks

inputs : x = (x1 , x2 , ..., xT ), xi ∈ RI → − ← − hidden units : h and h outputs : y = (y1 , y2 , ..., yT ), yi ∈ RK nonlinearity : H

3.3

7

where

→ − → − −) −→ − h t−1 + b→ − xt + W→ h t = H(Wx→ h h h h ← − ← − ← ← − ← − ← − h t = H(Wx h xt + W h h h t−1 + b h− ) → − ← − − h t + by − h t + W→ yt = W← hy hy

Deep RNNs

In the examples above, we defined, for simplicity, a RNN with only one hidden layer, and a bidirectional RNN with only one hidden layer in each direction. However, we can also define RNNs and bidirectional RNNs with several hidden layers. For example, Figure 5 shows a deep bidirectional RNN. The upper level hidden units, as well as the output layer, contain inputs from the previous two layers (i.e. wider input). This model allows us to capture information from both left and right context, as well as learn interesting features in the higher levels of the RNN.

3.4

LSTM

RNNs are very useful due to their ability to incorporate past information to the present task. For instance, in sentence processing, it can use the previous words Figure 5: Deep bidirecin order to make predictions for the next word. In theory, the structure of the tional RNN RNN should allow it to handle long-term dependencies, meaning it should be able to incorporate information from inputs as far back in time as necessary. In practice, this is most often not possible, due to what is known as the vanishing gradient problem. This refers to the fact that, in standard RNNs, the signal from much earlier inputs is lost through time, because we multiply many very low valued gradients. Long Short-Term Memorm (LSTM) neural networks are a type RNN that deals with this problem, and are able to learn long-term dependencies. In LSTMs, the hidden units contain special gates, which determine how the information is propagated through time. These gates allow the network to choose to remember or forget information. Figure 6 (a) shows an unrolled LSTM, and Figure 6 (b) shows a LSTM hidden unit, which controls the propagation of information.

(a) LSTM

(b) LSTM unit

Figure 6: Long - short term memory network (LSTM) The LSTM hidden unit in Figure 6 (b) consists of: (i) an input gate, it , which masks out the standard RNN inputs, (ii) a forget gate, ft , which masks out the previous cell, (iii) a cell, ct , which combines the input with the forget mask in order to learn keep a current state of the LSTM, (iv) an output gate, ot ,

8

27: Hybrid Graphical Models and Neural Networks

which masks out the values of the next hidden input. Note that there are several LSTM gate architectures that have been proposed in the literature. Here we show a simple version. However, [Jozefowicz et al., 2015] have explored 10,000 different LSTM architectures, and found several variants that worked similarly well on several tasks. We can also have Deep Bidirectional LSTMs (DBLSTM), which has the same general topology as a Deep Bidirectional RNN, but the hidden units consist of LSTM units, instead of single neurons. Such networks present no additional representation power over Deep Bidirectional RNNs. However, they are easier to learn, and therefore are preferred in practice.

4

Hybrid RNN + HMM

Hybrid RNN and HMM are similar in principle with the hybrid NN+HMM described in section 2. The difference is that the we replace the NN with a RNN. We can also make the RNN bidirectional, and replace the hidden units of the RNN with LSTM units. We show this architecture in Figure 7. Inference and training in this model is similar to that of the hybrid NN+HMM. The objective is to maximize the log-likelihood, and inference can be done with the forward-backward algorithm. To learn the model, we can use Stochastic Gradient Descent (SGD) by backpropagation. [Graves et al., 2013] applied this model on the task of phoneme recognition. Their results show that the hybrid NN+HMM models perform better than neural net only models, and among the hybrid architectures, the HMM + Deep Bidirectional LSTM perform best.

5

Figure 7: Hybrid RNN + HMM

Hybrid CNN + CRF

We can also create hybrid graphical models and neural networks using Conditional Random Fields (CRFs) and Convolutional Neural Networks (CNNs). Such a model is shown in Figure 8. This model allows us to compute the values of the factors using a CNN. We allow each factor to have its own parameters (i.e. its own CNN). This hybrid model is particularly suited for Natural Language Processing applications such as part-of-speech tagging, noun-phrase and verb-phrase chunking, namedentity recognition, and semantic role labeling. In such NLP tasks, the CNN is 1-dimensional. The output of the CNN is fed into the input of the CRF. Experiments on these tasks show that the CNN+HMM model gives results close to the state of the art on benchmark systems. In addition, the advantage of this model is that it does not require hand-engineered features.

Figure 8: Hybrid CNN + CRF

27: Hybrid Graphical Models and Neural Networks

9

References [Graves et al., 2013] Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE. [Jozefowicz et al., 2015] Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015). An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2342–2350.

10-708: Probabilistic Graphical Models 10-708, Spring 2016

28 : Distributed Algorithms for ML Lecturer: Eric P. Xing

1

Scribes: Joe Runde, Michael Muehl

Introduction

Big data has been all the rage for the last few years- there are even markets where it is sold or used as a form of currency. It has been around for many years, for example biologists sequenced the human genome more than a decade ago. The hope is that big data can generate a lot of value, function, or capabilities for us. But the question is, how do we take advantage of this big data? We need systems that can search, digest, index, and understand contents of big data, but there are a few key challenges with in doing so. This lecture will outline those challenges, and describe big picture solutions with distributed algorithms.

2

Challenges with understanding Big Data

Looking into the data to find the information we need is not an easy task as data grows larger. Fundamental issues arise with figuring out how to make the computations, storage, transmission, and imputation of the data happen in a rapid way. Then there are technical challenges beyond that, of ensuring the models continue to work better as the size of the system increases. Often naiive implementations will end up with very poor performance, such as the red line in Figure 1. Below we outline 3 key challenges of scaling up ML methods.

2.1

Challenge 1: Massive Data Scale Figure 1: Common scalability issue with ML methods

The first problem is very obvious: the data is very big. For example, an Internet of Things dataset may contain data from over 50B devices, or in the realm of social data, Facebook now has over 1.6 billion active users. Even neglecting most of the information from these devices or people, we need to deal with a graph with billions of nodes, and edges among them. Already we can see there will be a big problem just storing and transmitting this data, as it cannot fit into the memory of a single machine. 1

2

2.2

28 : Distributed Algorithms for ML

Challenge 2: Massive Model Scale

Once the problem of transmitting and storing all of the data is dealt with, there is the even bigger challenge of doing computations with it. Sophisticated models like very large deep CNNs, or models with parameters that grow with the amount of data may have billions of parameters. How can we ensure that computations of these parameters are done correctly and efficiently across many machines, and then attempt to do inference?

2.3

Challenge 3: Supporting New Methods

Finally, older methods like KNN, K means, or Naive Bayes are often insufficient for picking out the concepts that we want to learn from big data. In fact, even for smaller data like image sets that could be stored on a single computer, larger models such as deep neural nets with billions of nodes are used. While some newer software packages like Torch, YahooLDA, or Vowpal Wabbit provide newer models, many packages still only implement the classic methods that have been in place for decades. Implementing newer models to gain insights from big data is often a challenge left up to the system engineers.

2.4

Solution

How can we start to solve these challenges? They are still open questions- it is unclear even whether the best approach is to try to make a serial algorithm faster, or try to use a simpler approach in parallel and try to guarantee correctness across many machines. We will need to present a formal view of machine learning methods, and explain how to design scalable systems based on a few mathematical and functional properties of the method.

3

ML Methods

We’ll start by defining some broad properties of ML programs. An ML program consists of a mathematical model from some family, which is solved by an ML algorithm from one of a few types. Further, we can view each ML program as either a probabilistic program, or an optimization program. A visualization is provided in figure 3 We can then break down an ML program into its constituent parts. Usually we have:

28 : Distributed Algorithms for ML

3

Figure 2: Classification of ML programs • An Objective/Loss function: L(θ, D) Some examples are data log likelihood, or MSE between data and predictions. • Regularization / Prior or Structural Knowledge: r(θ) Such as: L2 regularization to prevent overfitting, L1 regularization for sparseness, or Dirichlet priors for smoothing • An algorithm to solve for the model given the data Usually, ML models are solved with an iterative-convergent ML algorithm. These algorithms repeatedly update θ until it is stationary. For example, MCMC methods or Variation Inference are used to solve probabilistic problems, where Gradient Descent or Proximal methods are used to solve optimization problems. Here are a couple examples of optimization programs and probabilistic programs:

3.1

Example 1: Lasso Regression

Lasso Regression is an example of an optimization problem. Here the data D is a feature matrix X with response vector y, and the model θ is a parameter vector. The objective, L(θ, D), used is the least squares difference between y and xθ. The regularization, r(θ), is the L1 norm of θ, times a tunable scalar, λ. The possible algorithms used to solve the problem are Cooredinate Descent or Stochastic Proximal Gradient Descent.

3.2

Example 2: Topic Models

Topic models are an example of a probabilistic problem. Here the objective, L(θ, D) is the log likelihood of documents given topic indicators, document-topic distributions, and topic-word distributions. The prior, r(θ), is a Dirichlet over the document-topic and topic-word distributions, with tunable hyperparameters controlling their strength. The algorithm used to learn the model is usually Collapsed Gibbs sampling.

4

28 : Distributed Algorithms for ML

Figure 3: Error tolerance of ML programs

4

ML vs Classical Computing

In classical computing, programs are deterministic and operation-centric: a strict set of instructions with predetermined length is followed until completion. A problem with this approach is that strict operational correctness is required; one error in the middle of a sorting algorithm and the end result is useless. Atmoic correctness is required, and there is an entire field of research on fault-tolerance to address this issue. On the other hand, ML algorithms are probabilistic, optimization-centric, and iterative-convergent. Update rules which may or may bot be deterministic are followed for an indeterminate amount of time until some solution is reached. In this case, a few errors here and there don’t really matter, as the program will continue seeking an optimum regardless of a few missteps. ML Programs do have pitfalls of their own as well though. First, there can be structural dependency which isn’t known a-prior, and can change as the program runs. As correlations between parameters change, the parallel structure of the program must change as well to stay efficient. Convergence of the parameters themselves is also non uniform: Often most of the parameters converge very quickly, while the rest take thousands more iterations to fit. Structuring a system around these properties can be challenging. Having this formal classification and breakdown of ML programs can be immensely helpful to build scalable systems. We can implement a smaller number of ”workhorse” algorithms which solve the basic challenges outlined above, and then apply these algorithms to newer, more sophisticated models as they are created.

5

Distributed ML Programs

Generally, there are two types of programs we may wish to parallelize: optimization programs such as Stochastic Gradient Descent or LASSO or probabilistic programs implementing various Markov Chain Monte Carlo methods such as Gibbs Sampling. Additionally, we can take two approaches to the parallelization procedure: we can parallelize the model or parallelize the data. For this lecture, we are going to focus on optimization programs, specifically Stochastic Gradient Descent.

28 : Distributed Algorithms for ML

5.1

5

Parallel Methods for Stochastic Gradient Descent

Stochastic Gradient Descent is an algorithm that at first glance seems poorly suited to parallelization. Recall the basic Stochastic Gradient Descent Template:

x ← x − ηt λΩ0 [x] − ηt ∂xft (x)

where f (x) is the loss function and Ω[x] is the regularizer.

5.2

Standard Parallel Methods

The most immediately obvious problem is that Stochastic Gradient Descent is inherently serial: the parameters θ must be updated after seeing every example. The next example needs to wait until the update is done before it can calculate the gradient, so any sort of parallelization scheme needs to address this issue. However, the method provided by Zinkevich et al. provides theoretical guarantees of convergence in a multicore setting. Here we execute the above loop in each core independently while working with a common shared weight vector x. This means that if we update x in a round-robin fashion we will have a delay of k − 1 when using k cores. The delay is due to the time it takes between seeing an instance of f and when we can actually apply the update to x. The intuition behind this is that as we converge optimization becomes more and more an averaging procedure and in this case a small delay does not hurt at all. The Achilles heel of the above algorithm is that it requires extremely low latency between the individual processing nodes. While this is OK on a computer or on a GPU, it will fail miserably on a network of computers since we may have many miliseconds latency between the nodes. This suggests an even simpler algorithm for further parallelization:

1. Overpartition the data into k blocks for k clusters (i.e. for each cluster simply randomly draw a fraction c = O(m/k) of the entire dataset). 2. Now perform stochastic gradient descent on each machine separately with constant learning rate. 3. Average the solutions between different machines. This can also be shown to converge.

6

5.3

28 : Distributed Algorithms for ML

Hogwild!

Interestingly, in many machine learning problems the cost can be represented as just a function of x f (x), and additionally this function is sparse. Consider Sparse SVM: X

min x

max(1 − yα xT zα , 0) + λkxk22

α∈E

where zα is sparse, or Matrix Completion: X

min

W,H

(Auv − Wu HvT )2 + λ1 kW k2F + λ2 kHk2F

(u,v)∈E

where the input matrix A is sparse, or Graph Cuts: min

X

x

wuv kxu − xv k1 subject to xv ∈ SD , v = 1, . . . , n

(u,v)∈E

where W is a sparse similarity matrix, encoding a graph. In the Hogwild! algorithm, we take advantage of this sparsity as follows, in parallel for each core: 1. Sample e uniformly at random from E 2. Read current parameter xe , evaluate gradient of function fe 3. Sample uniformly from e a coordinate v 4. Perform SGD on only coordinate v with a small constant step size. Since we are only updating a single coordinate at a time, there is no need for memory locking of any kind. This gives a near-linear speedup if done on a single machine. However, there are problems if we try to use this algorithm in a distributed environment. In particular, since the variance of the parameter estimate is: Vart+1 = Vart − 2ηt cov(xt , E∆t [gt ]) + O(ηt ξt ) + O(ηt2 ρ2t ) + O∗t where O∗t represents 5th and higher order terms, as a function of the delay among machines t . So the higher the delay between machines, the higher the variance in the parameter estimate, and hence the more instability in convergence. Since distributed systems generally have much higher delay than single machines, this algorithm is better suited for use on a single machine.

10-708: Probabilistic Graphical Models 10-708, Spring 2014

29: Distributed Systems for ML Lecturer: Eric P. Xing

1

Scribes: Petar Stojanov, Christoph Dann

Case Studies for Distributed Machine Learning Algorithms

In the first part of the lecture, several examples for how to distribute machine common machine learning algorithms have been discussed.

1.1

Coordinate Descent

Coordinate descent is a general strategy for convex optimization problems. The basic idea is iteratively solve the problem by optimizing the objective only with respect to one optimization variable at a time while keeping all other dimensions fixed. While the order in which the dimensions are optimized can be chosen arbitrarily, it is crucial for convergence guarantees that updates occur sequentially. Coordinate descent can for example be used to efficiently solve the LASSO objective X 1 min ||y − Xβ||22 + λ |βj |, β 2 j

(1)

where we assume Xi> Xi = 1. Using the KKT conditions, one can derive that  1 arg min ||y − Xβ||22 + λkβk1 = Sλ Xi> (y − X−i β−i βj 2

(2)

with Sλ (x) = sgn(x) max{0, |x| − λ} being the soft-threshold operator. Similarly, in block coordinate descent, a batch of variables are updated at a time while all others are kept fixed. One algorithm for parallel coordinate descent is Shotgun [Bradley et al., 2011]. In each iteration, this algorithm chooses P variables to update uniform at random and then updates these variables in parallel. Shotgun scales linearly when the features are almost uncorrelated. For perfectly correlated features, Shotgun is not faster than sequential updates. The analysis of Scherrer et al. [2012] shows that one can mitigate this issue by instead of choosing the variables to update in parallel at random, it is beneficial to partition the dimensions into P blocks by optimizing the block spectral radius. Since this is itself a challenging combinatorial optimization problem, the authors propose a heuristic: greedily cluster dimensions with largest absolute inner product in each block. This heuristic yields good empirical results. The parameter scheduler STRADS by Lee et al. [2014] for model-parallel machine learning is applicable to coordinate descent and can further improve upon these results. STRADS does so by leveraging two ideas for scheduling parameter updates: First, parameters are chosen randomly with higher probability assigned to parameters that changed much recently (prioritization). Second, STRADS checks the dependency by avoiding updating parameters which are too linearly dependent in parallel. See Figure 1 for example results.

1

2

29: Distributed Systems for ML

Figure 1: Performance comparison of STRADS and Shotgun

1.2

Proximal Gradient Descent

From what we learned in the vast majority of this course, we saw objective functions that are smooth and for which gradient updates are derived in a straight forward fashion. However, criteria to be optimized sometimes consist of regularization terms which are not smooth. One example of such a criterion is the LASSO, given by: min ||y − Xβ||22 + λ β

X

|βj |

(3)

j

Let’s treat the two components as a smooth component f and a non-smooth component g, yielding the criterion: minβ f (β) + g(β)

(4)

Then in order to construct a gradient update, one would have to solve what is called a proximal map, an auxiliary problem given by: proxt (β) = arg min z

1 ||β − z||2 + g(z) 2t

(5)

Then instead of our regular gradient update, the update for the next iteration’s β becomes: β k = proxt (β k−1 − t∇f (β k−1 ))

(6)

To provide a little bit of intuition, this update is equivalent to a quadratic approximation of the smooth part f . It can be interpreted as staying close to the gradient update while also making g small. For some non-smooth regularizers, the proximal map has a closed form. There are certain properties of the proximal mapping problem that enable it to be easily computed for more complex non-smooth regularizers.

29: Distributed Systems for ML

3

In particular, if the non-smooth regularizer is a sum of two non-smooth terms (g = g1 + g2 ), then we have the following property: Pgt1 +g2 = Pgt1 (Pgt2 )

(7)

Given this proximal mapping framework, there is a way to implement it in a distributed environment. Namely, given servers and workers, we perform the gradient update on the workers: δ k−1 = β k−1 − t∇f (β k−1 )

(8)

where t is the step size. Then we send this information to the servers and perform the proximal map to get the next iteration update for β: β k = Pgt (δ k−1 )

(9)

The new update is then returned to the workers and this process is iterated.

1.3

ADMM

In the previous section we saw how to deal with and parallelize criteria that decompose in a smooth and a non-smooth part. However there is a wide variety of problems whos objective functions do not decompose into neat sums of terms. Here is a general representation of the problem: min f (w) + g(z)

(10)

s.t. Aw + Bz = c

(11)

w,z

In this general objective function, we see that the variables w and z are uncoupled, but the constraint couples them and makes it difficult to derive proper updates. To make this problem manageable, we introduce what is called the ”augmented Lagrangian”, which is given by: Lµ (w, z; λ) = f (w) + g(z) + λT (Aw + Bz − c) +

µ ||Aw + Bz − C||22 2

(12)

From the theory of duality, it can be shown that maximizing this augmented Lagrangian represents the initial objective in equation 8, so now we have to solve the following dual problem: min max Lµ (w, z; λ) = f (w) + g(z) + λT (Aw + Bz − c) + w,z

λ

µ ||Aw + Bz − C||22 2

(13)

Solving this problem with fixed dual variable λ yields the following updates for the (k + 1)th step : wk+1 ← arg min Lµ (w, z k , λk ) = f (w) + µ/2||Aw + Bz k − c + λk /µ||2 ;

(14)

z k+1 ← arg min Lµ (wk+1 , z; λk ) = g(z) + µ/2||Awk+1 + Bz − c + λt /µ||2

(15)

w z

Now we fix the primal variables w, z, and perform the gradient update of the dual λ: λk+1 ← λk + t(Awk+1 + Bz k+1 − c)

(16)

An obvious way to parallelize this algorithm is to perform all the primal variable updates on worker machines. The problem is that waiting for all of them to complete before updating the dual λ creates a bottleneck. This can be avoided by doing the dual update asynchronously (after seeing s out of n primal updates).

4

29: Distributed Systems for ML

1.4

MCMC for Dirichlet Processes

Parallelizing MCMC inference often relies on the idea of auxiliary variables. By introducing additional variables into the model to introduce (conditional) independences. One example for such a strategy can be used to parallelize inference in Dirichlet processes. Here, the idea is to leverage the fact that the following process Dj ∼ DP(α/P, H) j = 1, . . . , p

(17)

φ ∼ Dirichlet(α/P )

(18)

πi ∼ Categorical(φ)

(19)

θi ∼ Dπi

(20)

has the same marginal distribution as D ∼ DP(α, H)

(21)

πi ∼ Categorical(φ)

(22)

θi ∼ Dπi .

(23)

Hence, inference can be conducted in parallel by randomly assigning the data to local Dirichlet process models. Then, until convergence iteratively perform Gibbs sampling in parallel for each local model and then swap clusters across models using Metropolis-Hastings.

1.5

Embarrassingly Parallel MCMC

One way to speed up MCMC algorithms is to run MCMC in parallel on subsets of the data, without communication between any of the machines. The objective is to recover the full posterior distribution after the machines have completed, given by: p(θ|xN ) ∝ p(θ)p(X N |θ) = p(θ)

N Y

p(xi |θ)

(24)

i=1

The approach consists of partitioning the data into M subsets {xn1 , . . . , xnM . Then each machine has a sub-posterior defined by: 1

p(θ|xN ) ∝ pm (θ) M p(X nM |θ)

(25)

The approach works by: • by sampling from the respective sub-posteriors pm • combining the sub-posterior samples via a non-parametric kernel density estimator. Given T samples {θtmm }Ttm =1 from each subposterior pm : T T X ||θ − θtmm 1 1 X ˆ = 1 p(θ) K( )= Nd (θ|θtmm , h2 Id ) T t =1 hd h T t =1 m

(26)

m

• then the sub-posterior kernel density estimators can be combined: p1 , . .ˆ. , pM (θ) = pˆ1 , . . . , pˆM (θ) =

M T T T X X 1 Y X m 2 N (θ|θ , h I ) ∝ , . . . , wt Nd (θ|θ¯t,. , h2 /M Id ) d d tm T M m=1 t =1 t =1 t =1 m

1

M

(27) where θt,. is the sample mean of the sub-posterior estimators, and wt =

QM

m=1

Nd (θ|θ¯t,. , h2 Id )

29: Distributed Systems for ML

5

Figure 2: Diagram of the LightLDA algorithm

1.6

Gibbs Sampling for Topic Models

The most famous way of parallelizing MCMC is collapsed Gibbs sampling. It aims to solve a simple equation: p(zij = k|xij , δi , B) ∝ (δik + αk )

βxij + Bk,xij PV V β + v=1 Bk,v

(28)

where δ and B are the document and word-specific (word count matrix) model parameters. The documents can be partitioned into pieces with replicated model parameters, and distributed into workers, where parallel Gibbs sampling is performed. This approach has several problems: • Convergence is not guaranteed, because Markov chain ergodicity is broken. • CPU cycles are wasted while synchronizing the model The parallelism incurs error in B, and the parameters become stale during the sampling process. One way to deal with this is to have asynchronicity and more frequent communication between the machines. Another way is to partition the word count matrix into non-overlapping matrices. This method is called LightLDA and is shown in Figure 2. Briefly, the algorithm works by keeping the data on disk, and send model portions to machines as needed. A short outline: • Preprocess the data, and determine which set of words goes to which block and organize it on disk.

6

29: Distributed Systems for ML

• Pull a set of words from a Key-Value store. • Perform sampling and write the result to disk sequentially, then send changes back to the key-value store. This way, the errors due to partitioning the data are eliminated because its integrity is kept intact on disk

2

Systems and Architectures for Distributed Machine Learning

While the first part of the lecture was denoted to specific examples of distributed machine learning algorithms, higher-level architectures and general challenges for distributed ML have been discussed in the last part. Many machine learning methods can be cast as iterative-convergent methods – for examples see previous section. There are two basic approaches for parallelization: • Data-Parallelism: The data is partitioned and distributed onto the different workers. Each worker typically updates all parameters based on its share of the data (see Section 1.4 for an example). • Model-Parallelism: Each worker has access to the entire dataset but only updates a subset of the parameters at a time (see Section 1.1 for an example) Of course, both principle approaches can also be combined as data- and model-parallel approaches. There are two major challenges in data parallelism: 1. Need for partial synchronicity: Synchronization (and therefore communication costs) should only happen when necessary. Workers waiting for synchronization should be avoided when possible. 2. Need for straggler tolerance: Slow worker need to be allowed to catch up. One attempt at overcoming these challenges is the Stale Synchronous Parallel (SSP) model where each thread runs at its own pace without synchronization but for theoretical guarantees the threads are not allowed to drift more than some number of steps apart. To reduce communication overhead, each worker has its own cached version of parameters (which need not be up-to date). This approach has been further improved in the Eager SSP protocol (ESSP).

References Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1-regularized loss minimization. arXiv preprint arXiv:1105.5379, 2011. Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson, and Eric P Xing. On model parallelization and scheduling strategies for distributed machine learning. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2834–2842. Curran Associates, Inc., 2014. Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, and David Haglin. Feature clustering for accelerating parallel coordinate descent. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 28–36. Curran Associates, Inc., 2012.