Here is the complete list of oral exam questions along with their answers:
The Empirical Risk Minimization (ERM) involves finding the best hypothesis from a given hypothesis class in order to minimize the empirical risk (or training error), which is the average loss over a training set.
Formally, for a hypothesis class $ \mathcal{H}={h: X\rightarrow Y}$ and a loss function
$$\begin{equation*}
R{emp}(h) = \frac{1}{N} \sum_{i=1}^N \ell(x_i, y_i,h)
\end{equation*}$$
ERM aims to find $ h^* $ such that: $ h^* = \arg\min_{h \in \mathcal{H}} R_{emp}(h). $ Sometimes this minimum can be combinatorially hard to achieve, so sometimes we relax this framework and we only find a good solution.
Probably Approximately Correct (PAC) learning is a framework in computational learning theory that provides a formal definition of what it means for a learning algorithm to generalize well.
An hypothesis set $ \mathcal{H} $ is agnostic PAC-learnable iff there exists a learning algorithm
$$\begin{equation*}
p_D\left(R(h_D^A)\le min_{h\in\mathcal{H}}R(h)+\epsilon \right)\ge 1-\delta
\end{equation*}$$
where
Given a class of hypothesis functions
The VC dimension of
Probabilistic inference involves computing the posterior distribution of unknown variables given observed data. This process uses the principles of probability theory to make predictions or decisions under uncertainty. The basic ideas include, given two set of random variates
-
Marginalization: Summing over or integrating out unwanted variables to obtain a marginal distribution, like
$p(x)=\int p(x,z)dz$ (or$\sum_zp(x,z)$ if$z$ is discrete) -
Conditioning: Leveraging independence properties to simplify computations like
$p(z|x=\bar{x})=\frac{p(\bar x,z)}{p(\bar x)}$
Combined together we have that: $$\begin{equation*} p(z_j|x_i=\bar x_i)=\frac{p(\bar x_i,z_j)}{p(\bar x)}=\frac{\int p(x,z)dx_{-i}dz_{-j}}{\int p(x,z)dx_{-i}dz} \end{equation*}$$
A Bayesian Network (BN) is a kind of Probabilistic Graphical Model (PGM) that represents the joint probability distribution of a set of random variables using a Directed Acyclic Graph (DAG).
Each node represents a random variable, and edges represent conditional dependencies.
The joint distribution factorizes according to the graph structure:
$$\begin{equation*} P(x_1, \ldots, x_n) = \prod_{i=1}^n P(x_i | \text{Pa}(x_i)) \end{equation*}$$
where $ \text{Pa}(x_i) $ denotes the parents of $ x_i $ in the graph.
Given Random Variables (RV)
There are three particular scenarios to consider in Bayesian Networks: (where the bar means that the RV it's observed)
-
Tail to tail, so
$a \leftarrow \bar c \rightarrow b$ that means$p(a,b,c)=p(a|c)p(b|c)p(c)$ , then we have $$\begin{align*} p(a,b)&=\sum_cp(a|c)p(b|c)p(c)\ne p(a)p(b) \ p(a,b \ |c)&=\frac{p(a,b,c)}{p(c)}=\frac{p(a|c)p(b|c)p(c)}{p(c)}=p(a|c)p(b|c)\ &\text{so it means that } a \not\perp!!!\perp b \text{ and } (a\perp!!!\perp b)|c
\end{align*}$$ -
Head to tail, so
$a\rightarrow \bar c\rightarrow b$ , that means$p(a,b,c)=p(b|c)p(c|a)p(a)$ , then we have $$\begin{align*} p(a,b)&=\sum_cp(b|c)p(c|a)p(a)=p(b|a)p(a) \ p(a,b \ |c)&=\frac{p(a,b,c)}{p(c)}=\frac{p(b|c)p(c|a)p(a)}{p(c)}=p(b|c)p(a|c)\ &\text{so it means that } a \not\perp!!!\perp b \text{ and } (a\perp!!!\perp b)|c \end{align*}$$ -
Head to Head, so
$a\rightarrow \bar c \leftarrow b$ that means$p(a,b,c)=p(a)p(b)p(c| \ a,b)$ , then we have $$\begin{align*} p(a,b)&=\sum_cp(a)p(b)p(c| \ a,b)= p(a)p(b) \ p(a,b \ |c)&=\frac{p(a,b,c)}{p(c)}=\frac{p(a)p(b)p(c| \ a,b)}{p(c)}\ne p(a|c)p(b|c)\ &\text{so it means that } a \perp!!!\perp b \text{ and } (a\not\perp!!!\perp b)|c
\end{align*}$$ This stuff it can all be extended to subset$A,B,C$ and in general a path from$a$ to$b$ is blocked by$c$ if: -
$c$ is observed and the path is head-to-tail or tail-to-tail in c; -
$c$ is not observed or any of its descendand and the path is head-to-head in$c$ .
A Markov Random Field (MRF) is an undirected graphical model that represents the joint distribution of a set of random variables using an undirected graph. Each node represents a variable and edges represent direct probabilistic interactions. The joint distribution factorizes over the cliques (fully connected subgraphs) of the graph:
$$
\begin{equation*} P(x_1, \ldots, x_n) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C)\end{equation*}
$$
where $ \mathcal{C} $ denotes the set of cliques, $ \psi_C $ are potential functions (it must be always
8. Define Factor Graphs and how to convert Bayesian Networks and Markov Random Fields to Factor Graphs
Factor graphs are bipartite graphs representing the factorization of a function. They consist of variable nodes
It's possible to convert to Factor Graphs the following models:
-
Bayesian Networks: Each conditional probability is represented as a factor node connected to the variable and its parents, formally
$\forall p(x_j|Pa(x_j))\rightarrow f_i(x_j \cup Pa(x_j))$ -
Markov Random Fields(MRF): Each clique potential is represented as a factor node connected to the variables in the clique, formally
$\forall c\in\mathcal{C} \rightarrow f_c(x_c)\sim\psi_c(x_c)$
We do this because now we can perform message passing on it, without loss of generality.
The Sum Product (or Belief Propagation) algorithm is an efficient message-passing algorithm used in factor graphs, which are trees (so, no loop) to compute marginal and conditional distributions of variables.
Given a set of neighboring nodes of variable
The Sum-product algorithm works in two steps:
- forward pass: messages are sent from the leaves to the root;
- backward pass: propagation from the the root back to the leaves.
After this, it's possible to obtain marginals and conditionals!
- Computation of the marginal of a variable node
$x$ :
$p(x)=\prod_{f\in\mathcal{N}(x)}\mu_{f\rightarrow x}(x)$ - Computation of the marginal of a factor node
$f(\bar x)$ (where$\bar x$ is the set of neigbhoring variable nodes, $\bar x=\mathcal{N}(f)$):
$p(\bar x)=f(\bar x)\prod_{f\in\mathcal{N}(x)}\mu_{x\rightarrow f}(x)$ - Computation of conditionals
$p(x|y=\hat y)$ on$y\sube{x_1,\ldots,x_n}$ :-
Clamping
$y$ to$\hat y$ , so we fix$y=\hat y$ in the joint$p(x,x_1,\ldots,x_k,y=\hat y)$ -
Running the sum-product algorithm with
$y$ fixed to$\hat y$ , so$p(x,y=\hat y)=\sum_{x_1,\ldots,x_k}p(x,x_1,\ldots x_k,y=\hat y)$ -
Computing the conditional as:
$p(x|y=\hat y)=\frac{p(x,y=\hat y)}{p(\hat y)}$ , where$p(\hat y)=\sum_x p(x,y=\hat y)$
-
Clamping
In case of unnormalized joint distribution, we have to normalize them.
The Max-Plus (or Max-Product) algorithm is an exact inference algorithm used on tree-structured factor graphs to find the configuration of variables that maximizes the joint probability, known as the Maximum a Posteriori (MAP) estimate:
The algorithm operates by passing messages from nodes to factors and vice versa. It is analogous to the Sum-Product algorithm but replaces summation with maximization and products with sums (often worked in log-space for numerical stability).
The message-passing rules are defined as follows:
-
Message from variable
$x$ to factor$f$ : $\begin{equation} \mu_{x \to f}(x) = \sum_{f_i \in \mathcal{N}(x) \backslash f} \mu_{f_i \to x}(x) \end{equation}$ -
Message from factor
$f$ to variable$x$ : $\begin{equation} \mu_{f \to x}(x) = \max_{x_1, \ldots, x_k} \left[ \log f(x, x_1, \ldots, x_k) + \sum_{x_i \in \mathcal{N}(f) \backslash x} \mu_{x_i \to f}(x_i) \right] \end{equation}$ where$\mathcal{N}(\cdot)$ denotes the set of neighbouring nodes.
The base cases are:
- For a leaf variable node:
$\mu_{x \to f}(x) = 0$ - For a leaf factor node:
$\mu_{f \to x}(x) = \log f(x)$
To also find the exact MAP assignment (not just its probability), a backtracking step is required. During the forward pass, each factor-to-variable message also records the maximizing values of the other variables: $\begin{equation} \Phi_{f \to x}(x) = \arg\max_{x_1, \ldots, x_k} \left[ \log f(x, x_1, \ldots, x_k) + \sum_{x_i \in \mathcal{N}(f) \backslash x} \mu_{x_i \to f}(x_i) \right] \end{equation}$
After a full forward pass to the root, the root's value is fixed to its MAP state. The algorithm then performs a backward pass, using the stored
Chapter 5: Hidden Markov Models (HMM)
A Hidden Markov Model (HMM) is a specific state-space model for sequential data where the latent variables
The model is characterized by three parameters:
-
Transition Probability:
$A_{ij} = p(z_n = j | z_{n-1} = i)$ , governing the evolution of the discrete hidden states. -
Initial Distribution:
$\boldsymbol{\pi}_i = p(z_1 = i)$ . -
Emission Probability:
$\boldsymbol{\psi}=p(x_n | z_n)$ , the distribution of observations given the hidden state (e.g., a Gaussian if$x_n$ is continuous).
The joint distribution of the hidden states and observations factorizes as:
$$\begin{equation*}
p(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = p(z_1 | \boldsymbol{\pi}) \prod_{n=2}^{N} p(z_n | z_{n-1}, \mathbf{A}) \prod_{n=1}^{N} p(x_n | z_n, \boldsymbol{\psi})
\end{equation*}$$
where
The primary inference problems in HMMs are:
-
Filtering: The task of estimating the current hidden state given all observations up to the current time, i.e., computing the belief state
$p(z_n | \bar x)$ . -
Smoothing: The task of obtaining a better estimate of a past hidden state by incorporating "future" evidence, i.e., computing
$p(z_k | \bar x)$ for$1 \leq k < N$ . -
Most Likely Explanation (Viterbi Decoding): The task of finding the most probable sequence of hidden states that could have generated the observed sequence, i.e., finding
$z^* = \arg\max_{\bar z} p(\bar z | \bar x)$ . This is solved efficiently using the Viterbi algorithm, a special case of the Max-Plus algorithm on the HMM's trellis graph.
Given a
The distribution has several important closure properties:
-
Exponential of quadratic form distribution: every distribution which is an exponential of a quadratic form is a Gaussian distribution.
So, given $\begin{equation}p(x)=c\cdot e^{-\frac{1}{2}x^TAx-b^Tx} \end{equation}$ it can be proved, given$x$ and$b$ $d$ -dimensional vector and$A\in\R^{d\times d}$ a symmetric and invertible matrix, that the quantity in the exponential is equal to: $\begin{equation} -\frac{1}{2}x^TAx-b^Tx=\frac{1}{2}(x-A^{-1}b)^TA(x-A^{-1}b)-\frac{1}{2}b^TA^{-1}b \end{equation}$ If we use the properties of the exponential, we get $\begin{equation} ce^{-\frac{1}{2}x^{T}Ax-b^{T}x} = N(x|A^{-1}b, A^{-1})\sqrt{(2\pi)^{d}det(A^{-1})}e^{-\frac{1}{2}b^{T}A^{-1}b} \end{equation}$ so I obtain: $\begin{equation}p(x)=\mathcal{N}(x|A^{-1}b,A^{-1})\end{equation}$ -
Linear Transformation: If $ x \sim \mathcal{N}(\mu_x, \sigma_x) $ with
$b\sim\mathcal{N}(\mu,\sigma)$ , with x and$\eta$ conditionally indipendent, then $ y = Ax + b $ is also Gaussian with$y\sim\mathcal{N}(A\mu_x+\mu,A\sigma_xA^T+\Sigma)$ . -
Product of Gaussian: The product of two Gaussian densities is still a Gaussian (whereas, in general, the product of two densities is not always a density)
$$\begin{align*} \mathcal{N(x|\mu_1,\Sigma_1)}\mathcal{N(x|\mu_2,\Sigma_2)}=\mathcal{N(x|\mu,\Sigma)}\cdot K\ \end{align*}$$ -
Marginal Distribution: Any subset of a multivariate Gaussian distribution is also Gaussian, so assuming that
$z=(x,y)$ and$z\sim\mathcal{N}(\mu,\Sigma)$ , where$\mu=(\mu_x,\mu_y)$ and $\Sigma=\bigg(\begin{matrix}\Sigma_{xx} & \Sigma_{xy}\ \Sigma_{yx} & \Sigma_{yy} \end{matrix}\bigg)$ then we have$x\sim \mathcal{N}(\mu_x,\Sigma_{xx})$ -
Conditional Distribution: The conditional distribution of a subset of Gaussian variables given others is also Gaussian like
$ p(x|y)= \mathcal{N}(x|\mu_x+\Sigma_{xy}\Sigma_{yy}^{-1}(y-\mu_y),\Sigma_{xx}-\Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx})$
13. How to transform a generic Gaussian Distribution to a Standard Gaussian via Principal Components
To transform a generic Gaussian distribution $ X\sim N(\mu, \Sigma) $ to a standard Gaussian $ N(0, I) $:
- Centering: Subtract the mean: $ X' = X - \mu $.
-
Whitening: Apply the whitening transformation using the eigenvalue decomposition of $ \Sigma $: $ X'' = \Lambda^{-1/2} E^T X' $, where $ \Sigma = E \Lambda E^T $ and E is an othogonal matrix whose rows are the eigenvectors of
$\Sigma$ .
The resulting $ X'' $ will follow $ N(0, I).$
Bayesian estimation updates the probability of a hypothesis as more evidence becomes available, combining prior knowledge and observed data. The fundamental equation is Bayes' theorem: $ p(\theta | \bar x) = \frac{p(\bar x | \theta) p(\theta)}{p(\bar x)}, $ where:
- $ \theta $ is the parameter or hypothesis.
- $\bar x $ is the observed data.
- $ p(\theta | \bar x) $ is the posterior distribution (updated belief after seeing the data).
- $ p(\bar x | \theta) $ is the likelihood (probability of the data given the parameter).
- $ p(\theta) $ is the prior distribution (initial belief about the parameter).
- $ p(\bar x) $ is the marginal likelihood or evidence, given by $ p(\bar x) = \int p(\bar x | \theta) p(\theta) , d\theta.$
The problem is that it's difficult to compute the marginal likelihood, having an integral hard to approximate in an high dimensional setting. So, it's convenient to compute the predictive distribution of
Bayesian Linear Regression is a probabilistic framework that reformulates standard linear regression by treating the model parameters not as fixed unknown values but as random variables with probability distributions. This shift from point estimates to distributions allows for a principled quantification of uncertainty. The entire framework is built upon three foundational components:
- Prior Distribution: This represents our initial beliefs about the parameters before observing any data. It is a mechanism to encode domain knowledge or desired constraints (like preventing overfitting). A typical choice is a zero-mean Gaussian prior,
$p(w | \alpha) = \mathcal{N}(w | 0, \alpha^{-1}I)$ , which expresses a preference for smaller weights, functionally equivalent to L2 regularization but within a probabilistic framework. - Posterior Distribution: After observing data
$(\bar{x}, \bar{y})$ , we update our beliefs using Bayes' Theorem. The posterior distribution,$p(w | \bar{x}, \bar{y})$ , combines the prior with the information from the data (the likelihood) to give a complete distribution over plausible parameter values. For the conjugate Gaussian prior and likelihood, this posterior is also Gaussian. - Predictive Distribution: The ultimate goal of the Bayesian approach. Instead of making a prediction with a single "best" model, we average the predictions of all possible models (all possible
$w$ ), weighted by their posterior probability. This results in a prediction that inherently includes uncertainty estimates, distinguishing between noise in the data (aleatoric uncertainty) and uncertainty in the model itself (epistemic uncertainty).
In essence, Bayesian Linear Regression provides a complete statistical description of the linear modeling process, from prior assumptions to final predictions with full uncertainty quantification.
The regularization term is a bias introduced in our data, so we treat it as prior distribution that goes like a Gaussian, like
Fixed
In general, given a general prior gaussian of the form
The predictive distribution for a new input $ x^* $ is obtained by integrating this over all the possible models:
$$\begin{equation*}
p(y|x^,\bar x,\bar y,\alpha, \beta)=\int \underbrace{p(y|x,w,\alpha,\beta)}{\propto \mathcal N(y|w^T\phi(x),\beta^{-1})}\cdot \underbrace{p(w|\bar x,\bar y,\alpha,\beta)}{gaussian \ posterior}dw
\end{equation}$$
Since in the bayesian linear regression setting we have that both these distributions are Gaussians, we know that the product of Gaussians densities is a Gaussian, and also the marginal of a Gaussian is a Gaussian. Therefore it can be shown that the probability above is
$$
p(y|x^*,\bar x,\bar y,\alpha, \beta)=\mathcal N(y|m_N^T\phi(x),\sigma^2_N(x))\
\text{where } \sigma^2_N(x)=\frac{1}{\beta}+\phi^T(x)S_N\phi(x)
$$
The prediction is a Gaussian centered on an average prediction and having a variance which has two terms: one takes into account the noise of observations, while the second one takes into account the epistemic uncertainty of our model. As we increase our knowledge, the epistemic uncertainty goes to zero and we are left with just the aleatoric uncertainty (as more data is observed (
The model evidence, or marginal likelihood,
$\begin{equation}
p(\bar y|\bar x, \alpha, \beta) = \int p(\bar y|\bar x, w, \beta)p(w|\alpha)dw
\end{equation}$
is the primary tool for optimizing the hyperparameters
The strategy is to find the hyperparameters that maximize the model evidence:
$$\begin{equation*}
\alpha^∗, \beta^∗ = \arg\max_{\alpha,\beta} p(y | x, \alpha, \beta)
\end{equation*}$$
$$\begin{equation*}
\alpha^, \beta^ = \arg\max_{\alpha,\beta} ; p(y \mid X, \alpha, \beta)
\end{equation*}$$
This approach can be justified by considering a Bayesian treatment of the hyperparameters:
- We assume an uninformative hyper-prior
$p(\alpha, \beta)$ (e.g., a uniform distribution). - We approximate the true posterior
$p(\alpha, \beta|y, x)$ as a delta function centered at its mode, i.e., $p(\alpha, \beta|y, x) \approx \delta(\alpha - \alpha^)\delta(\beta - \beta^)$. This is equivalent to the MAP estimate under the uninformative prior.
In practice, the (log) marginal likelihood can be computed analytically for this conjugate model, and its maximization leads to fixed-point equations for
Taking the eigenvalue
Small value of
Bayesian model comparison relies on the marginal likelihood
- $ P(D | \mathcal{M}_i) $ is the marginal likelihood of the model
$\mathcal M_i$ . - $ P(\mathcal{M}_i) $ is the prior probability of the model.
There are two methods to perform model selection between 2 models:
-
Model averaging : instead of choosing one model we consider both of them, weighted according to the posterior distribution. The predictive distribution then is
$$ p(y|x,D)=\sum_j p(y|x,D,\mathcal M_j)\cdot p(\mathcal M_j| D) $$
-
Choose the best model by computing the Bayes Factor (of
$\mathcal M_1$ versus$\mathcal M_2$ )$$ BF(\mathcal M_1,\mathcal M_2)=\frac{p(D|\mathcal M_1)}{p(D|\mathcal M_2)} $$
The one to choose is the one with the largest Bayes Factor, intuitable from the definition of KL divergence
$\int p(D|\mathcal M_1)\log BF(\mathcal M_1,\mathcal M_2)dD>0$
The Laplace approximation provides a method for approximating complex probability distributions by using a Gaussian distribution centered at the mode of the target distribution. Specifically, for a distribution expressed as
First, we find the mode
Next, we approximate
This leads to a Gaussian approximation of the original distribution: $$ q(z) = \mathcal{N}(z \mid z_0, A^{-1}) $$ with the corresponding approximation of the normalization constant: $$ Z \approx \frac{(2\pi)^{k/2}}{|A|^{1/2}} f(z_0) $$
In the context of Bayesian model comparison, where we have a parametric model
This approximation provides a computationally tractable way to estimate Bayesian model evidence, balancing model fit against model complexity through the determinant of the Hessian matrix.
The Bayesian Information Criterion (BIC) is an asymptotic approximation to the log model evidence:
-
$BIC$ corresponds to marginal likelihood, the$\log p(D)$ - $ L $ is the likelihood of the model, that in our framework is
$p(D|)$ . - $ M $ is the number of parameters.
- $ N $ is the number of observations.
A lower BIC indicates a better model.
Given a dataset
The posterior, then, is
$$
p(w|\bar x,\bar y)=\frac{p(\bar y|w,\bar x)p(w)}{p(\bar y|\bar x)}\propto p(\bar y|w,\bar x)p(w)
$$
In the 2-class problem with a Bernoulli model of noise, we can write the likelihood as $p(\bar y|w,\bar x)=\prod{i=1}^Ns_i^{y_i}(1-s_i)^{1-y_i}$ where
- find
$w_{MaP}=\argmax_w \log p(w|\bar y)=\argmax_w \log p(w)+\log p(\bar y|w)$ by numerical optimization - Find the Hessian at
$w_{MaP}$ and invert it, obtaining the precision matrix$S_N$ of the Gaussian.
In this way, the Laplace Approximation of the posterior
Important:
Why we do this? In the context of Bayesian Networks we are only able to perform exact inference if our network satisfies certain structural properties; in particular it has to be a tree or a poli-tree.
Rejection sampling generates samples from a target distribution $ p(x) $ by sampling from a proposal distribution $ q(x)
The steps are:
- Sample $ x $ from $ q(x) $.
- Generate $ u \sim \text{Uniform}(0, 1) $.
- Accept $ x $ if $ u \leq \frac{p(x)}{Mq(x)} $, where $ M $ is a constant such that $ p(x) \leq Mq(x) $ for all $ x $, if reject , then repeat the algorithm until acceptance.
We can actually compute the expected number of samples from
A major drawback is efficiency; the expected number of trials per accepted sample is
Importance sampling estimates properties of a target distribution $ p(x) $ using samples from a proposal distribution $ q(x)
Given a proposal distribution
If
A Markov chain is a stochastic process denotated as ${x_t}{t\ge 0}$ where the future state depends only on the current state, through a transitional kernel $p(x_n|x{n-1})$, and not on past states. Also known as "Markov Property", it's encoded as:
$$
p(x_n|x_{n-1},\ldots,x_0)=p(x_n|x_{n-1})
$$
Another required property is time homogeneity, where
The Detailed Balance condition ensures that a Markov Chain is reversible, it means that
It is not true that the existence of a stationary distribution implies that our Markov Chain is reversible.
Markov Chain Monte Carlo (MCMC) methods helps us to sample from
The Metropolis-Hastings algorithm is a popular MCMC method. It generates a candidate state from a proposal distribution and accepts or rejects it based on the following criterion:
$$ \alpha(y|x) = \min\left( 1, \frac{\tilde p(y) q(x|y)}{\tilde p(x) q(y|x)}\right) $$
For symmetric transition kernels
The Metropolis-Hastings acceptance criterion satisfies the detailed balance condition.
Vanilla MCMC can have several issues:
- High Mixing time: The chain may take a long time to explore the entire state space before it reach the stationary distribution.
- Right proposal kernel choosing (it's difficult): the goal is to have a good balance between exploration and exploitation, which is especially challenging for multimodal distributions due to high dimensionality. Right choose of the proposal kernel: our aim is to have a good balance in between exploration and exploitation, specially in multimodal distribution because of the high dimensionality.
- Convergence Diagnostics: It can be challenging to determine if the chain has converged to the stationary distribution.
Gibbs sampling is a special case of MCMC where each variable is sampled from the conditional distribution
The algorithm involves iteratively sampling each variable from its conditional distribution:
- Pick a variable
$k\in{1,\dots,n}$ - Set
$x_j^{(t+1)}=x_j^{(t)}$ for$j\ne k$ - Sample
$x_k^{(t+1)}$ from$\tilde p (x_k|x^{(t)}_{-k})$ - Repeat until convergence.
The transition kernel of Gibbs Sampling is exactly the same as Metropolis-Hastings algorithm.
Gibbs Sampling algorithm can be generalized, for example, by sampling blocks instead of a single variable, i.e. we can sample
There are some issue with this technique:
- We may not satisfy ergodicity in Gibbs sampling,since it may not always be possible to find a path between two states which is only made of "single component" steps.
- If variables are strongly correlated, then our mixing time can be very high.
Convergence diagnostics assess whether an MCMC chain has converged to the stationary distribution.
Given a function over a state of MCMC trajectory
Practically:
- We sample
$m/2\ge1$ trajectories from overdispersed initial points. We try to start from different states that are far away in our state space. - Sample for
$4n$ steps. - we throw away the first half of every trajectory so that we have only
$2n$ points left. This phase is known as the burn-in or warm-up phase. We do this because it takes time to reach the steady state. - Then we split in
$2$ parts the remain trajectories, so that we are left with$m$ different trajectories each of length$n$ .
Now, for each trajectory
We are also interested in
We can define this quantities:
$$
s_j^2=\frac{1}{n-1}\sum_{i=1}^n (\Psi_{ij}-\bar \Psi_j)^2\
W=\frac{1}{m}\sum_{j=1}^m s_j^2\
B=\frac{n}{m-1}\sum_{j=1}^m(\bar \Psi_j-\bar \Psi)^2
$$
where
By definition
$$
W\le \text{VAR}[\Psi]
$$
But also it can be shown that
$$
\text{VAR}[\Psi]\le\frac{n-1}{n}W+\frac{1}{n}B=\text{VAR}^+[\Psi]
$$
So, we have an upper and lower bound, both converging to the true variance. Therefore we can compute:
$$
\hat R=\sqrt{\frac{\text{VAR}^+[\Psi]}{W}}=\sqrt{\frac{n-1}{n}+\frac{1}{n}\frac{B}{W}}
$$
which can be monitor while running the MCMC simulations. Notice that
The effective sample size (ESS) measures the quality of samples generated by an MCMC algorithm, accounting for autocorrelation for near points in the trajectory.
The ESS is given by:
$ n_{eff} = \frac{nm}{1 + 2 \sum_{k=1}^\infty \rho_k}<nm, $
where $ N $ is the number of samples and $ \rho_k $ is the autocorrelation at lag $ k $, so (
Hamiltonian Monte Carlo (HMC) is an MCMC method that uses Hamiltonian dynamics to propose new states. HMC improves the efficiency of the sampler by reducing random walk behavior. It introduces auxiliary momentum variables and simulates the Hamiltonian dynamics in order to move along the surface of the energy corresponding to our probability distribution.
We always start from the same purpose: sampling
We will sample from joint distribution:
$$
p(x,y)=\frac{1}{Z}e^{H(x,y)}
$$
where
In Hamiltonian Monte Carlo, we sample by following trajectories defined by the Hamiltonian, ensuring energy remains constant. This means we move through the probability distribution space according to the equations of motion, preserving energy to efficiently explore high-probability regions.
The algorithm is the following:
- Pick
$x_i$ - We sampe
$y\sim p(y)$ , randomizing the momentum - We choose a random direction in time, sampling
$t \sim \text{Uniform}(-1, 1)$ (implying reversibility of time and providing ergodicity, it's known that hamiltonian law are time-indipendent in this case) - We move according to Hamiltonian dynamics from
$(x_i,y)$ to a candidate$(x',y')$ doing$L$ steps - We introduce a rejection criterion:
- we accept it if
$H(x',y')>H(x,y)$ - otherwise, we accept it with probability
$e^{H(x',y')-H(x,y)}$
- we accept it if
In the meanwhile in each step from
The Expectation-Maximization (EM) algorithm is designed to solve the problem of finding Maximum Likelihood Estimates (MLE) for the parameters
The problem is formally stated as follows:
- We have observed data
$x = (x_1, \ldots, x_N)$ . - We assume a generative model
$p(x, z | \theta)$ that also depends on latent variables$z = (z_1, \ldots, z_M)$ . - Our goal is to find the parameters that maximize the marginal likelihood of the observed data: $$ \theta_{ML} = \underset{\theta}{\operatorname{argmax}} , p(x | \theta) = \underset{\theta}{\operatorname{argmax}} \sum_{z} p(x, z | \theta) $$
But this optimization problem is intractable, and that's because of two main reasons:
-
The Sum-over-Latent-States: The marginal likelihood
$p(x | \theta)$ requires a summation (or integration in the continuous case) over all possible configurations of the latent variables$z$ . If each$z_i$ can take$K$ values, the number of terms in the sum grows as$K^M$ , which is computationally infeasible for anything but trivially small$M$ (the "curse of dimensionality"). -
The Log-Sum Problem: To perform optimization, we typically work with the log-likelihood. However, the structure of the problem becomes prohibitively complex: $$ \log p(x | \theta) = \log \left( \sum_{z} p(x, z | \theta) \right) $$ The logarithm is placed outside the sum. This log-sum structure makes it impossible to obtain a closed-form solution or to compute the gradient of the log-likelihood efficiently, as the sum inside the log couples the parameters
$\theta$ for all possible values of$z$ .
The problem is further compounded when we have multiple i.i.d. observations. The complete log-likelihood
In summary, the problem solved by EM is maximum likelihood parameter estimation in models with latent variables, and its primary purpose is to circumvent the direct, intractable optimization of the marginal log-likelihood caused by the log-of-a-sum problem.
The Evidence Lower Bound (ELBO) is a concept used in variational inference, a method in Bayesian statistics used to approximate complex integrals. The ELBO is used to derive a lower bound on the log marginal likelihood, which is often intractable to compute directly.
Let's denote the observed data as
The log marginal likelihood can be written as:
We can't compute this directly because it depends on the log marginal likelihood. However, we can rearrange terms to get an expression that we can compute, adding and subtracting the term
The second term on the right-hand side is the ELBO. It's a lower bound on the log marginal likelihood because the KL divergence is always non-negative. The goal of variational inference is to choose
35. Discuss the Expectation Maximization Algorithm (both E-step and M-step) (formalism different, to analyze better)
The Expectation-Maximization (EM) algorithm is a popular iterative method used in statistical computations to find
It alternates between the E-step, which computes the expected value of the log-likelihood given the current parameter estimates, and the M-step, which maximizes this expected log-likelihood to update the parameter estimates.
-
Expectation Step (E-step): Given the current parameters
$\theta^{(t)}$ , we first calculate the expected value of the log-likelihood function. This is done by taking the expectation over the conditional distribution of the latent variable$z$ given the observed data$\bar x$ under the current parameter estimates. Mathematically, this can be written as:$$Q(\theta, \theta^{(t)}) = E_{z|\bar x,\theta^{(t)}}[\log p(z,\bar x|\theta)]$$ -
Maximization Step (M-step): We find the parameters that maximize the expected log-likelihood found on the E-step:
$$\theta^{(t+1)} = argmax_{\theta} Q(\theta, \theta^{(t)})$$
The algorithm alternates between these two steps until the parameters converge, i.e., the estimates of the parameters do not change significantly between successive iterations.
The EM algorithm guarantees that the likelihood will not decrease with each iteration. However, it does not guarantee to find the global maximum likelihood, and the result can be sensitive to the choice of initial parameters.
The algorithm works by repeating the E-step and M-step until convergence (i.e. whenever
After setting the Kullback-Leibler divergence to zero, we close the gap vertically and we make the log-likelihood equal to the lower bound.
In the M-step, since we are maximizing the lower bound, we are pushing it up, but then also the log-likelihood will be pushed up, possibly of a larger quantity than the lower bound, creating a new gap.
After both the E and M steps are completed, the log likelihood is always increasing, until it gets really close to a local maximum and eventually reaches it. So we can consider as the termination condition of the EM algorithm the following:
In general the EM algorithm converges to a local optimum of
The EM algorithm is the standard method for learning the parameters of a Gaussian Mixture Model (GMM). A GMM assumes data is generated from a mixture of k Gaussian components. The model has parameters
-
$\pi$ : A vector of mixing coefficients (priors for each component), where$\pi_j>= 0$ and$\Sigma pi_j = 1$ -
$\mu_j$ : The mean of the j-th Gaussian component. -
$\Sigma:$ The covariance matrix of the j-th Gaussian component.
The latent variable z is a one-hot encoded vector indicating which component generated a data point x.
EM for GMM:
-
Initialization: Initialize the parameters
$\theta^{0} = (\pi, \mu, \Sigma)$ (e.g., using K-means). -
E-step (Expectation): Compute the "responsibility"
$\gamma(z_{nj})$ of component$j$ for generating data point$x_n$ . This is the posterior probability$p(z_j=1 | x_n)$ . $$ \gamma(z_{nj}) = \frac{\pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}{\sum_{i=1}^K \pi_i \mathcal{N}(x_n | \mu_i, \Sigma_i)} $$ -
M-step (Maximization): Update the parameters using the current responsibilities. These updates are weighted by
$\gamma(z_{nj})$ .- Update mixing coefficients: $$ \pi_j^{\text{new}} = \frac{N_j}{N}, \quad \text{where } N_j = \sum_{n=1}^N \gamma(z_{nj}) $$
- Update means: $$ \mu_j^{\text{new}} = \frac{1}{N_j} \sum_{n=1}^N \gamma(z_{nj}) x_n $$
- Update covariances: $$ \Sigma_j^{\text{new}} = \frac{1}{N_j} \sum_{n=1}^N \gamma(z_{nj}) (x_n - \mu_j^{\text{new}})(x_n - \mu_j^{\text{new}})^T $$
-
Repeat: Iterate steps 2 (E-step) and 3 (M-step) until the log-likelihood converges.
Chapter 10: Variational Inference (from here I gave up and all those answers are made with CHAT-GPT 4o the night before the exam)
Variational Inference (VI) aims to approximate complex posterior distributions $ p(z|x) $ with a simpler, parameterized distribution $ q(z|\lambda) $ by minimizing the Kullback-Leibler (KL) divergence between $ q(z|\lambda) $ and $ p(z|x) $:
$\begin{equation} \text{KL}(q(z|\lambda) | p(z|x)) = \mathbb{E}_{q(z|\lambda)} \left[ \log q(z|\lambda) - \log p(z|x) \right] \end{equation}$
This can be reformulated as maximizing the Evidence Lower Bound (ELBO):
$ \mathcal{L}(\lambda) = \mathbb{E}_{q(z|\lambda)} \left[ \log p(x,z) - \log q(z|\lambda) \right] $
Mean Field Variational Inference (MFVI) is a type of VI where the variational distribution $ q(z|\lambda) $ is assumed to factorize across the latent variables:
$ q(z|\lambda) = \prod_{i=1}^{n} q_i(z_i|\lambda_i) $
This factorization simplifies the optimization problem and makes the computations more tractable.
For a joint Gaussian distribution $ p(z) = \mathcal{N}(z|\mu, \Lambda^{-1}) $, with $ \mu = [\mu_1, \mu_2]^T $ and precision matrix $ \Lambda $, the MFVI approximates the joint distribution as a product of univariate Gaussians:
$ q(z) = q(z_1) q(z_2) $
The update equations for the variational parameters are derived from the ELBO optimization:
$ q_1(z_1) = \mathcal{N}(z_1|\nu_1, \Lambda_{11}^{-1}) $ $ \nu_1 = \mu_1 - \Lambda_{11}^{-1} \Lambda_{12} (\mathbb{E}_{q_2}[z_2] - \mu_2) $
Similarly,
$ q_2(z_2) = \mathcal{N}(z_2|\nu_2, \Lambda_{22}^{-1}) $ $ \nu_2 = \mu_2 - \Lambda_{22}^{-1} \Lambda_{21} (\mathbb{E}_{q_1}[z_1] - \mu_1) $
The behavior of the variational approximation depends on which form of the Kullback-Leibler (KL) divergence is minimized:
-
Direct KL (Minimizing $\text{KL}(q|p)$): This is the standard approach in Variational Inference. It results in a mode-seeking behavior (also called zero-forcing). The variational distribution
$q$ will concentrate on the main modes of the true posterior$p$ and will be forced to be zero where$p$ is zero. This can cause$q$ to underestimate the spread of$p$ and potentially miss smaller, secondary modes. -
Inverse KL (Minimizing $\text{KL}(p|q)$): This approach results in a mode-covering behavior (also called zero-avoiding). The variational distribution
$q$ will be encouraged to spread out to cover all regions where$p$ has mass, ensuring that$q$ is never zero if$p$ could be non-zero. However, this objective is typically intractable for VI as it requires computing expectations with respect to the true posterior$p(z|x)$ .
In Variational Linear Regression, the goal is to approximate the posterior distribution of weights and noise precision in a Bayesian linear regression model. Given the model:
$ y = Xw + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I) $ $ p(w|\alpha) = \mathcal{N}(w|0, \alpha^{-1} I) $ $ p(\alpha) = \text{Gamma}(\alpha|a_0, b_0) $
The variational distribution is factorized as:
$ q(w, \alpha) = q(w) q(\alpha) $
The updates for $ q(w) $ and $ q(\alpha) $ involve the expectations of the sufficient statistics, iteratively optimized until convergence.
Black Box Variational Inference (BBVI) is a flexible VI approach that uses stochastic gradient descent to optimize the ELBO. It allows for the use of any model and any variational distribution without the need for model-specific derivations. BBVI leverages Monte Carlo estimates of gradients, making it applicable to a wide range of probabilistic models.
44. Discuss How to Compute the Gradient of the ELBO for a Non-reparameterizable Variational Distribution
For non-reparameterizable variational distributions, the gradient of the ELBO can be computed using the score function estimator (also known as the REINFORCE estimator). This approach estimates the gradient as follows:
$ \nabla_\lambda \mathcal{L}(\lambda) = \mathbb{E}{q(z|\lambda)} \left[ (\nabla\lambda \log q(z|\lambda)) (\log p(x,z) - \log q(z|\lambda)) \right] $
While this estimator has high variance, techniques such as Rao-Blackwellization and control variates can reduce it.
In the context of Black Box Variational Inference, Rao-Blackwellization is a variance reduction technique that leverages the structure of the model to compute parts of an expectation analytically, even when the entire expectation is intractable.
The key insight is to identify a subset of variables—specifically, those in the Markov blanket of a target variable
The standard, high-variance gradient estimator is: $\begin{equation} \nabla_{\lambda_i} \mathcal{L}(\lambda) \approx \frac{1}{S} \sum_{s=1}^{S} \nabla_{\lambda_i} \log q(z_i^{(s)}|\lambda_i) \left[ \log p(x, z^{(s)}) - \log q(z^{(s)}|\lambda) \right] \end{equation}$
The Rao-Blackwellized version restricts the inner term to only depend on the Markov blanket
-
$q_{(i)}(z|\lambda)$ is the marginal distribution over the Markov blanket of$z_i$ , -
$p_i(x, z_{(i)})$ is the part of the joint distribution$p(x, z)$ that depends on$z_{(i)}$ .
By integrating over the Markov blanket variables (either analytically or with lower-variance sampling), this estimator focuses on the relevant local information for
Control variates are another technique for reducing variance in Monte Carlo estimates. The idea is to use an auxiliary variable with known expected value to construct an estimator with lower variance. For a function $ g(z) $ with known expectation $ \mathbb{E}[g(z)] = 0 $, the control variate estimator is:
$ \hat{f}_{CV} = \hat{f} - c(\hat{g} - \mathbb{E}[g(z)]) $
where $ c $ is a coefficient that minimizes the variance of the adjusted estimator.
Bayesian Neural Networks (BNNs) extend neural networks by placing a prior distribution over the network's weights. This allows for the incorporation of uncertainty in predictions and model parameters. Inference in BNNs is typically performed using VI, where the posterior distribution of the weights is approximated by a simpler variational distribution. This approach allows for robust predictions and quantification of uncertainty.
Generative modelling involves learning the joint distribution $ p(x, z) $ of observed data $ x $ and latent variables $ z $ to generate new data samples. The goal is to model the underlying data distribution and generate realistic samples. Common generative models include Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and normalizing flows.
Autoencoding Variational Bayes (AEVB) is a framework for training Variational Autoencoders (VAEs). A VAE consists of an encoder that maps input data to a latent space and a decoder that reconstructs the input from the latent space. The ELBO for AEVB is:
$ \mathcal{L}(\phi, \theta) = \mathbb{E}{q\phi(z|x)} \left[ \log p_\theta(x|z) \right] - \text{KL}(q_\phi(z|x) | p(z)) $
where $ \phi $ and $ \theta $ are the parameters of the encoder and decoder, respectively.
Denoising Diffusion Models are a class of generative models where data is corrupted with noise through a forward diffusion process and a neural network is trained to reverse this process. The forward process adds noise step by step:
$ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) $
The reverse process is learned to generate data by denoising from the prior distribution to the data distribution. The neural network learns to approximate the reverse conditional probabilities, allowing for efficient sampling and generation of new data.