Oral Exam of PML

Here is the complete list of oral exam questions along with their answers:

Chapter 2: Empirical Risk Minimization and PAC Learning

1. Introduce the Empirical Risk Minimization (ERM) principle

The Empirical Risk Minimization (ERM) involves finding the best hypothesis from a given hypothesis class in order to minimize the empirical risk (or training error), which is the average loss over a training set.
Formally, for a hypothesis class $ \mathcal{H}={h: X\rightarrow Y}$ and a loss function $\ell(x,y,h) $, the empirical risk for a hypothesis $ h $ on a training set ${(x_i, y_i)\}{i=1}^n $ is given by:
$$\begin{equation*} R{emp}(h) = \frac{1}{N} \sum_{i=1}^N \ell(x_i, y_i,h) \end{equation*}$$

ERM aims to find $ h^* $ such that: $ h^* = \arg\min_{h \in \mathcal{H}} R_{emp}(h). $ Sometimes this minimum can be combinatorially hard to achieve, so sometimes we relax this framework and we only find a good solution.

2. Define PAC Learning

Probably Approximately Correct (PAC) learning is a framework in computational learning theory that provides a formal definition of what it means for a learning algorithm to generalize well.
An hypothesis set $ \mathcal{H} $ is agnostic PAC-learnable iff there exists a learning algorithm $A$ such that $ \forall\epsilon > 0 $ and $ \forall\delta > 0 $, for any data generating distributions $ p(x,y)$, there is a number of data-point $m^(\epsilon,\delta)\in\N$ such that for every possible dataset $\forall D\sim p^m$ with $|D|=m\ge m^(\epsilon,\delta)$ the probability that the algorithm produces an hypothesis $ h $, with error at most $ \epsilon $,is at least $ 1 - \delta $.
$$\begin{equation*} p_D\left(R(h_D^A)\le min_{h\in\mathcal{H}}R(h)+\epsilon \right)\ge 1-\delta \end{equation*}$$ where $m^*(\epsilon,\delta)$ depends polinomially $\frac{1}{\epsilon}$,$\frac{1}{\delta}$ and $A$ should run in polynomial time.

3. Define VC Dimension (Vapnik-Chervonenkis)

Given a class of hypothesis functions $\mathcal{H} = \left{ h : X \rightarrow \left{ 0,1 \right} \right}$ and a subset $C={c_1,\dots,c_m}\sube X$ of input points, we define $\mathcal{H}_C={(h(c_1),\dots,h(c_m))|h\in\mathcal{H}}$ and we can say that $\mathcal{H}$ shatters $C$ iff $|\mathcal{H}_C|=2^m$.
The VC dimension of $\mathcal{H}$ is defined as: $$\begin{equation*} VCdim(\mathcal{H})=\max{m|\exist C\sube X, |C|=m \ s.t.\ \mathcal{H} \text{ shatters } C} \end{equation*}$$ So if $\mathcal{H}$ shatters $C$ and $|C|\ge2m$ then we can't learn $\mathcal{H}$ with $m$ samples, but $\mathcal{H}$ is (agnostic) PAC-learnable iff $VCdim(\mathcal{H})<\infty$.

Chapter 3: Probabilistic Graphical Models (PGM)

4. Introduce the basic ideas of Probabilistic Inference

Probabilistic inference involves computing the posterior distribution of unknown variables given observed data. This process uses the principles of probability theory to make predictions or decisions under uncertainty. The basic ideas include, given two set of random variates $x=(x_1,\dots,x_n)$ and $z=(z_1,\dots, z_m)$:

Marginalization: Summing over or integrating out unwanted variables to obtain a marginal distribution, like $p(x)=\int p(x,z)dz$ (or $\sum_zp(x,z)$ if $z$ is discrete)
Conditioning: Leveraging independence properties to simplify computations like $p(z|x=\bar{x})=\frac{p(\bar x,z)}{p(\bar x)}$

Combined together we have that: $$\begin{equation*} p(z_j|x_i=\bar x_i)=\frac{p(\bar x_i,z_j)}{p(\bar x)}=\frac{\int p(x,z)dx_{-i}dz_{-j}}{\int p(x,z)dx_{-i}dz} \end{equation*}$$

5. Introduce Bayesian Networks and their factorization

A Bayesian Network (BN) is a kind of Probabilistic Graphical Model (PGM) that represents the joint probability distribution of a set of random variables using a Directed Acyclic Graph (DAG).
Each node represents a random variable, and edges represent conditional dependencies.
The joint distribution factorizes according to the graph structure: $$\begin{equation*} P(x_1, \ldots, x_n) = \prod_{i=1}^n P(x_i | \text{Pa}(x_i)) \end{equation*}$$ where $ \text{Pa}(x_i) $ denotes the parents of $ x_i $ in the graph.

6. Conditional Independence in Bayesian Networks

Given Random Variables (RV) $a,b,c$ we say that $a$ is conditional indipendent of $b$ given c (formally $(a\perp!!!\perp b)|c$ ) if and only if $p(a| \ b,c)= p(a|c)$ or, equivanlently $p(a,b \ |c)=p(a|c)p(b|c)$.

There are three particular scenarios to consider in Bayesian Networks: (where the bar means that the RV it's observed)

Tail to tail, so $a \leftarrow \bar c \rightarrow b$ that means $p(a,b,c)=p(a|c)p(b|c)p(c)$, then we have $$\begin{align*} p(a,b)&=\sum_cp(a|c)p(b|c)p(c)\ne p(a)p(b) \ p(a,b \ |c)&=\frac{p(a,b,c)}{p(c)}=\frac{p(a|c)p(b|c)p(c)}{p(c)}=p(a|c)p(b|c)\ &\text{so it means that } a \not\perp!!!\perp b \text{ and } (a\perp!!!\perp b)|c
\end{align*}$$
Head to tail, so $a\rightarrow \bar c\rightarrow b$, that means $p(a,b,c)=p(b|c)p(c|a)p(a)$, then we have $$\begin{align*} p(a,b)&=\sum_cp(b|c)p(c|a)p(a)=p(b|a)p(a) \ p(a,b \ |c)&=\frac{p(a,b,c)}{p(c)}=\frac{p(b|c)p(c|a)p(a)}{p(c)}=p(b|c)p(a|c)\ &\text{so it means that } a \not\perp!!!\perp b \text{ and } (a\perp!!!\perp b)|c \end{align*}$$
Head to Head, so $a\rightarrow \bar c \leftarrow b$ that means $p(a,b,c)=p(a)p(b)p(c| \ a,b)$, then we have $$\begin{align*} p(a,b)&=\sum_cp(a)p(b)p(c| \ a,b)= p(a)p(b) \ p(a,b \ |c)&=\frac{p(a,b,c)}{p(c)}=\frac{p(a)p(b)p(c| \ a,b)}{p(c)}\ne p(a|c)p(b|c)\ &\text{so it means that } a \perp!!!\perp b \text{ and } (a\not\perp!!!\perp b)|c
\end{align*}$$ This stuff it can all be extended to subset $A,B,C$ and in general a path from $a$ to $b$ is blocked by $c$ if:
$c$ is observed and the path is head-to-tail or tail-to-tail in c;
$c$ is not observed or any of its descendand and the path is head-to-head in $c$.

7. Introduce Markov Random Fields

A Markov Random Field (MRF) is an undirected graphical model that represents the joint distribution of a set of random variables using an undirected graph. Each node represents a variable and edges represent direct probabilistic interactions. The joint distribution factorizes over the cliques (fully connected subgraphs) of the graph: $$ \begin{equation*} P(x_1, \ldots, x_n) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C)\end{equation*} $$
where $ \mathcal{C} $ denotes the set of cliques, $ \psi_C $ are potential functions (it must be always $\psi_C\ge0$ and its integral must be finite) and $ Z=\int \prod_{C \in \mathcal{C}} \psi_C(x_C) dx_c<\infty$ is the partition function. An example of choose of potential function is the Boltzmann distribution $\psi_C(x_C)=e^{E(x_C)}$ where $E(x_C)$ is the energy (low energy $\rightarrow$ high potential)

Chapter 4: Exact Inference in PGM

8. Define Factor Graphs and how to convert Bayesian Networks and Markov Random Fields to Factor Graphs

Factor graphs are bipartite graphs representing the factorization of a function. They consist of variable nodes $\boxed{x}$ and factor nodes $\boxed{f}$ such that he distribution can be described as $p(x)\propto \prod_{c\in \mathcal{C}}f(x_C)$.
It's possible to convert to Factor Graphs the following models:

Bayesian Networks: Each conditional probability is represented as a factor node connected to the variable and its parents, formally $\forall p(x_j|Pa(x_j))\rightarrow f_i(x_j \cup Pa(x_j))$
Markov Random Fields(MRF): Each clique potential is represented as a factor node connected to the variables in the clique, formally $\forall c\in\mathcal{C} \rightarrow f_c(x_c)\sim\psi_c(x_c)$

We do this because now we can perform message passing on it, without loss of generality.

9. Describe the Sum Product algorithm

The Sum Product (or Belief Propagation) algorithm is an efficient message-passing algorithm used in factor graphs, which are trees (so, no loop) to compute marginal and conditional distributions of variables.
Given a set of neighboring nodes of variable $x$ defined as $\mathcal{N}(x)={f_1,\ldots, f_k|f_i \text{ is connected to }x}$ and a set of neighboring nodes of factor $f$ where $\mathcal{N}(f)={x_1,\ldots, x_k|x_i \text{ is connected to }f}$ in general we can define the messages as: $$\begin{align*} \mu_{f\rightarrow x}(x)&=\sum_{x_i\in\mathcal{N}(f)/x}f(x,x_1,\ldots,x_k)\prod_{x_i\in\mathcal{N}(f)/x}\mu_{x_i\rightarrow f}(x_i)\ \mu_{x\rightarrow f}(x)&=\prod_{f_i\in\mathcal{N}(x)/f} \mu_{f_i\rightarrow x}(x) \end{align*}$$ where in the base case for the $x$ leaf $\mu_{x\rightarrow f}(x)=1$ and for the $f$ leaf $\mu_{f\rightarrow x}=f(x)$.
The Sum-product algorithm works in two steps:

forward pass: messages are sent from the leaves to the root;
backward pass: propagation from the the root back to the leaves.

After this, it's possible to obtain marginals and conditionals!

Computation of the marginal of a variable node $x$:
$p(x)=\prod_{f\in\mathcal{N}(x)}\mu_{f\rightarrow x}(x)$
Computation of the marginal of a factor node $f(\bar x)$ (where $\bar x$ is the set of neigbhoring variable nodes, $\bar x=\mathcal{N}(f)$):
$p(\bar x)=f(\bar x)\prod_{f\in\mathcal{N}(x)}\mu_{x\rightarrow f}(x)$
Computation of conditionals $p(x|y=\hat y)$ on $y\sube{x_1,\ldots,x_n}$:
- Clamping $y$ to $\hat y$, so we fix $y=\hat y$ in the joint $p(x,x_1,\ldots,x_k,y=\hat y)$
- Running the sum-product algorithm with $y$ fixed to $\hat y$, so $p(x,y=\hat y)=\sum_{x_1,\ldots,x_k}p(x,x_1,\ldots x_k,y=\hat y)$
- Computing the conditional as: $p(x|y=\hat y)=\frac{p(x,y=\hat y)}{p(\hat y)}$, where $p(\hat y)=\sum_x p(x,y=\hat y)$

In case of unnormalized joint distribution, we have to normalize them.

10. Describe the Max Plus algorithm

The Max-Plus (or Max-Product) algorithm is an exact inference algorithm used on tree-structured factor graphs to find the configuration of variables that maximizes the joint probability, known as the Maximum a Posteriori (MAP) estimate: $x^{MAP} = \arg\max_{\mathbf{x}} p({x})$.
The algorithm operates by passing messages from nodes to factors and vice versa. It is analogous to the Sum-Product algorithm but replaces summation with maximization and products with sums (often worked in log-space for numerical stability).

The message-passing rules are defined as follows:

Message from variable $x$ to factor $f$: $\begin{equation} \mu_{x \to f}(x) = \sum_{f_i \in \mathcal{N}(x) \backslash f} \mu_{f_i \to x}(x) \end{equation}$
Message from factor $f$ to variable $x$: $\begin{equation} \mu_{f \to x}(x) = \max_{x_1, \ldots, x_k} \left[ \log f(x, x_1, \ldots, x_k) + \sum_{x_i \in \mathcal{N}(f) \backslash x} \mu_{x_i \to f}(x_i) \right] \end{equation}$ where $\mathcal{N}(\cdot)$ denotes the set of neighbouring nodes.

The base cases are:

For a leaf variable node: $\mu_{x \to f}(x) = 0$
For a leaf factor node: $\mu_{f \to x}(x) = \log f(x)$

To also find the exact MAP assignment (not just its probability), a backtracking step is required. During the forward pass, each factor-to-variable message also records the maximizing values of the other variables: $\begin{equation} \Phi_{f \to x}(x) = \arg\max_{x_1, \ldots, x_k} \left[ \log f(x, x_1, \ldots, x_k) + \sum_{x_i \in \mathcal{N}(f) \backslash x} \mu_{x_i \to f}(x_i) \right] \end{equation}$

After a full forward pass to the root, the root's value is fixed to its MAP state. The algorithm then performs a backward pass, using the stored $\Phi$ functions to retrieve the optimal state of every other variable in the graph.

Chapter 5: Hidden Markov Models (HMM)

11. Define HMM and different Inference problems in HMM

A Hidden Markov Model (HMM) is a specific state-space model for sequential data where the latent variables $z = (z_1, \ldots, z_N)$ form a Markov chain, and each observation $x_n$ is dependent only on the state of its corresponding latent variable $z_n$.

The model is characterized by three parameters:

Transition Probability: $A_{ij} = p(z_n = j | z_{n-1} = i)$, governing the evolution of the discrete hidden states.
Initial Distribution: $\boldsymbol{\pi}_i = p(z_1 = i)$.
Emission Probability: $\boldsymbol{\psi}=p(x_n | z_n)$, the distribution of observations given the hidden state (e.g., a Gaussian if $x_n$ is continuous).

The joint distribution of the hidden states and observations factorizes as: $$\begin{equation*} p(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = p(z_1 | \boldsymbol{\pi}) \prod_{n=2}^{N} p(z_n | z_{n-1}, \mathbf{A}) \prod_{n=1}^{N} p(x_n | z_n, \boldsymbol{\psi}) \end{equation*}$$ where $\boldsymbol{\theta} = {\boldsymbol{\pi}, \mathbf{A}, \boldsymbol{\psi}}$.

The primary inference problems in HMMs are:

Filtering: The task of estimating the current hidden state given all observations up to the current time, i.e., computing the belief state $p(z_n | \bar x)$.
Smoothing: The task of obtaining a better estimate of a past hidden state by incorporating "future" evidence, i.e., computing $p(z_k | \bar x)$ for $1 \leq k < N$.
Most Likely Explanation (Viterbi Decoding): The task of finding the most probable sequence of hidden states that could have generated the observed sequence, i.e., finding $z^* = \arg\max_{\bar z} p(\bar z | \bar x)$. This is solved efficiently using the Viterbi algorithm, a special case of the Max-Plus algorithm on the HMM's trellis graph.

Chapter 6: Bayesian Linear Regression

12. Discuss closure properties of the Gaussian Distribution

Given a $d$-dimensional Multivariate Gaussian, denoted by $\mathcal{N}(x|\mu,\Sigma)$, where $\mu$ is the vector of the means and $\Sigma$ is the covariance matrix, written as: $$\begin{equation*} \mathcal{N}(x|\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^d\det(\Sigma)}}e^{-\frac{1}{2}\left[(x-\mu)^T\Sigma^{-1}(x-\mu)\right]} \end{equation*}$$ Where $\Sigma=A^{-1}$ is the Precision matrix and, in the definition of the Gaussion, it's used instead of the covariance matrix.
The distribution has several important closure properties:

Exponential of quadratic form distribution: every distribution which is an exponential of a quadratic form is a Gaussian distribution.
So, given $\begin{equation}p(x)=c\cdot e^{-\frac{1}{2}x^TAx-b^Tx} \end{equation}$ it can be proved, given $x$ and $b$ $d$-dimensional vector and $A\in\R^{d\times d}$ a symmetric and invertible matrix, that the quantity in the exponential is equal to: $\begin{equation} -\frac{1}{2}x^TAx-b^Tx=\frac{1}{2}(x-A^{-1}b)^TA(x-A^{-1}b)-\frac{1}{2}b^TA^{-1}b \end{equation}$ If we use the properties of the exponential, we get $\begin{equation} ce^{-\frac{1}{2}x^{T}Ax-b^{T}x} = N(x|A^{-1}b, A^{-1})\sqrt{(2\pi)^{d}det(A^{-1})}e^{-\frac{1}{2}b^{T}A^{-1}b} \end{equation}$ so I obtain: $\begin{equation}p(x)=\mathcal{N}(x|A^{-1}b,A^{-1})\end{equation}$
Linear Transformation: If $ x \sim \mathcal{N}(\mu_x, \sigma_x) $ with $b\sim\mathcal{N}(\mu,\sigma)$, with x and $\eta$ conditionally indipendent, then $ y = Ax + b $ is also Gaussian with $y\sim\mathcal{N}(A\mu_x+\mu,A\sigma_xA^T+\Sigma)$.
Product of Gaussian: The product of two Gaussian densities is still a Gaussian (whereas, in general, the product of two densities is not always a density)
$$\begin{align*} \mathcal{N(x|\mu_1,\Sigma_1)}\mathcal{N(x|\mu_2,\Sigma_2)}=\mathcal{N(x|\mu,\Sigma)}\cdot K\ \end{align*}$$
Marginal Distribution: Any subset of a multivariate Gaussian distribution is also Gaussian, so assuming that $z=(x,y)$ and $z\sim\mathcal{N}(\mu,\Sigma)$, where $\mu=(\mu_x,\mu_y)$ and $\Sigma=\bigg(\begin{matrix}\Sigma_{xx} & \Sigma_{xy}\ \Sigma_{yx} & \Sigma_{yy} \end{matrix}\bigg)$ then we have $x\sim \mathcal{N}(\mu_x,\Sigma_{xx})$
Conditional Distribution: The conditional distribution of a subset of Gaussian variables given others is also Gaussian like
$ p(x|y)= \mathcal{N}(x|\mu_x+\Sigma_{xy}\Sigma_{yy}^{-1}(y-\mu_y),\Sigma_{xx}-\Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx})$

13. How to transform a generic Gaussian Distribution to a Standard Gaussian via Principal Components

To transform a generic Gaussian distribution $ X\sim N(\mu, \Sigma) $ to a standard Gaussian $ N(0, I) $:

Centering: Subtract the mean: $ X' = X - \mu $.
Whitening: Apply the whitening transformation using the eigenvalue decomposition of $ \Sigma $: $ X'' = \Lambda^{-1/2} E^T X' $, where $ \Sigma = E \Lambda E^T $ and E is an othogonal matrix whose rows are the eigenvectors of $\Sigma$.
The resulting $ X'' $ will follow $ N(0, I).$

14. Introduce the Bayesian Estimation Principles

Bayesian estimation updates the probability of a hypothesis as more evidence becomes available, combining prior knowledge and observed data. The fundamental equation is Bayes' theorem: $ p(\theta | \bar x) = \frac{p(\bar x | \theta) p(\theta)}{p(\bar x)}, $ where:

$ \theta $ is the parameter or hypothesis.
$\bar x $ is the observed data.
$ p(\theta | \bar x) $ is the posterior distribution (updated belief after seeing the data).
$ p(\bar x | \theta) $ is the likelihood (probability of the data given the parameter).
$ p(\theta) $ is the prior distribution (initial belief about the parameter).
$ p(\bar x) $ is the marginal likelihood or evidence, given by $ p(\bar x) = \int p(\bar x | \theta) p(\theta) , d\theta.$

The problem is that it's difficult to compute the marginal likelihood, having an integral hard to approximate in an high dimensional setting. So, it's convenient to compute the predictive distribution of $x$ given the observed data $\bar x$ $$\begin{align*} p(x|\bar x)&=\int p(x|\theta)p(\theta|\bar x)d\theta=\ &=\int p(x,\theta|\bar x)d\theta=\ &=\int p(x|\theta,\bar x)p(\theta|\bar x)d\theta\ &=\int p(x|\theta)p(\theta|\bar x)d\theta \end{align*}$$ In general, prior distribution and likelihood distribution are conjugate priors, like Gaussian and Gaussian or Bernuoulli and Beta.

15. Introduce Bayesian Linear Regression

Bayesian Linear Regression is a probabilistic framework that reformulates standard linear regression by treating the model parameters not as fixed unknown values but as random variables with probability distributions. This shift from point estimates to distributions allows for a principled quantification of uncertainty. The entire framework is built upon three foundational components:

Prior Distribution: This represents our initial beliefs about the parameters before observing any data. It is a mechanism to encode domain knowledge or desired constraints (like preventing overfitting). A typical choice is a zero-mean Gaussian prior, $p(w | \alpha) = \mathcal{N}(w | 0, \alpha^{-1}I)$, which expresses a preference for smaller weights, functionally equivalent to L2 regularization but within a probabilistic framework.
Posterior Distribution: After observing data $(\bar{x}, \bar{y})$, we update our beliefs using Bayes' Theorem. The posterior distribution, $p(w | \bar{x}, \bar{y})$, combines the prior with the information from the data (the likelihood) to give a complete distribution over plausible parameter values. For the conjugate Gaussian prior and likelihood, this posterior is also Gaussian.
Predictive Distribution: The ultimate goal of the Bayesian approach. Instead of making a prediction with a single "best" model, we average the predictions of all possible models (all possible $w$), weighted by their posterior probability. This results in a prediction that inherently includes uncertainty estimates, distinguishing between noise in the data (aleatoric uncertainty) and uncertainty in the model itself (epistemic uncertainty).

In essence, Bayesian Linear Regression provides a complete statistical description of the linear modeling process, from prior assumptions to final predictions with full uncertainty quantification.

16. Prior and Posterior over Parameters in Bayesian Linear Regression

The regularization term is a bias introduced in our data, so we treat it as prior distribution that goes like a Gaussian, like $p(w|\alpha)=\mathcal{N}(w|0,\alpha^{-1}I)$ where $\alpha$ is an hyperparameter.
Fixed $\alpha$ and introducing a gaussian noise in the observations with precision $\beta$, applying Bayes Theorem the posterior is: $$ p(w|\bar x,\bar y,\alpha,\beta)=\frac{p(\bar y|\bar x,w,\alpha,\beta)p(w|\alpha)}{p(\bar y|\bar x,\alpha,\beta)} $$ So given a gaussian prior and a gaussian likelihood, looking at the $\log$ of the marginal likelihood we notice we have a quadratic function of $w$ , so if the $\log$ of a distribution is a quadratic function, then is a gaussian: summarizing the posterior is also a gaussian.
In general, given a general prior gaussian of the form $p(w|m_0,S_0)=\mathcal{N}(w|m_0,S_0)$, then our posterior becomes $$ p(w|\bar x,\bar y,\alpha,\beta)=\mathcal{N}(w|m_N,S_N)\ m_N=S_N[S_0^{-1}m_0+\beta\Phi^Ty]\ S_N^{-1}=S_0^{-1}+\beta\Phi^T\Phi $$

17. Predictive Distribution in Bayesian Linear Regression

The predictive distribution for a new input $ x^* $ is obtained by integrating this over all the possible models: $$\begin{equation*} p(y|x^,\bar x,\bar y,\alpha, \beta)=\int \underbrace{p(y|x,w,\alpha,\beta)}{\propto \mathcal N(y|w^T\phi(x),\beta^{-1})}\cdot \underbrace{p(w|\bar x,\bar y,\alpha,\beta)}{gaussian \ posterior}dw \end{equation}$$ Since in the bayesian linear regression setting we have that both these distributions are Gaussians, we know that the product of Gaussians densities is a Gaussian, and also the marginal of a Gaussian is a Gaussian. Therefore it can be shown that the probability above is $$ p(y|x^*,\bar x,\bar y,\alpha, \beta)=\mathcal N(y|m_N^T\phi(x),\sigma^2_N(x))\ \text{where } \sigma^2_N(x)=\frac{1}{\beta}+\phi^T(x)S_N\phi(x) $$ The prediction is a Gaussian centered on an average prediction and having a variance which has two terms: one takes into account the noise of observations, while the second one takes into account the epistemic uncertainty of our model. As we increase our knowledge, the epistemic uncertainty goes to zero and we are left with just the aleatoric uncertainty (as more data is observed ($N \to \infty$), the epistemic uncertainty diminishes ($S_N \to 0$)).

18. Discuss Model Evidence and Hyperparameter Optimization

The model evidence, or marginal likelihood, $\begin{equation} p(\bar y|\bar x, \alpha, \beta) = \int p(\bar y|\bar x, w, \beta)p(w|\alpha)dw \end{equation}$ is the primary tool for optimizing the hyperparameters $\alpha$ and $\beta$. This process is known as Type-II Maximum Likelihood or Empirical Bayes.
The strategy is to find the hyperparameters that maximize the model evidence: $$\begin{equation*} \alpha^∗, \beta^∗ = \arg\max_{\alpha,\beta} p(y | x, \alpha, \beta) \end{equation*}$$ $$\begin{equation*} \alpha^, \beta^ = \arg\max_{\alpha,\beta} ; p(y \mid X, \alpha, \beta) \end{equation*}$$

This approach can be justified by considering a Bayesian treatment of the hyperparameters:

We assume an uninformative hyper-prior $p(\alpha, \beta)$ (e.g., a uniform distribution).
We approximate the true posterior $p(\alpha, \beta|y, x)$ as a delta function centered at its mode, i.e., $p(\alpha, \beta|y, x) \approx \delta(\alpha - \alpha^)\delta(\beta - \beta^)$. This is equivalent to the MAP estimate under the uninformative prior.

In practice, the (log) marginal likelihood can be computed analytically for this conjugate model, and its maximization leads to fixed-point equations for $\alpha$ and $\beta$ (e.g., $\alpha = \frac{\gamma}{m_N^\top m_N}$).

19. Introduce the Idea of Effective Number of Parameters

Taking the eigenvalue $\lambda_i$ of $\beta\phi^T\phi$ we obtain info about the curvature of the Maximum Likelihood solutione for $w$.
Small value of $\lambda_i$ means a large curvature, so an higher uncertainty on $w_i$, while large $\lambda_i$ means small curvature and, hence more precise estimation. The numbers of parameter could be calculated as $$ \gamma=\sum_{i=1}^{M} \frac{\lambda_i}{\alpha+\lambda_i} $$ and if we consider $\lambda_i<<\alpha\rightarrow\frac{\lambda_i}{\alpha+\lambda_i}\approx 0$, while for $\lambda_i>>\alpha\rightarrow\frac{\lambda_i}{\alpha+\lambda_i}\approx 1$ obtaining the effective number of parameters, which counts how many parameters $w_i$ are well-determined by the data (where the ML estimate is close to the MAP estimate). This results in the effective number of parameters $\gamma$, which counts how many weights $w_i$ are "well-determined" by the data. A parameter is only counted ($\gamma_i \approx 1$) if there is sufficient evidence in the data to overcome the prior's regularization. This makes the model complexity adaptive and inherently guards against overfitting.

20. Discuss Bayesian Model Comparison

Bayesian model comparison relies on the marginal likelihood $p(D|\mathcal{M}_i)$. This quantity naturally penalizes model complexity, implementing Occam's razor, Bayesian model comparison involves computing the posterior probability of each model given the data, placing a prior $p(\mathcal{M}_i)$ and fixing the hyper-parameters: $ P(\mathcal{M}_i | D) = \frac{P(D | \mathcal{M}_i) P(\mathcal{M}_i)}{\sum_jP(D|\mathcal{M}_j)p(\mathcal M_j)}, $ where

$ P(D | \mathcal{M}_i) $ is the marginal likelihood of the model $\mathcal M_i$.
$ P(\mathcal{M}_i) $ is the prior probability of the model.

There are two methods to perform model selection between 2 models:

Model averaging : instead of choosing one model we consider both of them, weighted according to the posterior distribution. The predictive distribution then is

$$ p(y|x,D)=\sum_j p(y|x,D,\mathcal M_j)\cdot p(\mathcal M_j| D) $$
Choose the best model by computing the Bayes Factor (of $\mathcal M_1$ versus $\mathcal M_2$)

$$ BF(\mathcal M_1,\mathcal M_2)=\frac{p(D|\mathcal M_1)}{p(D|\mathcal M_2)} $$

The one to choose is the one with the largest Bayes Factor, intuitable from the definition of KL divergence $\int p(D|\mathcal M_1)\log BF(\mathcal M_1,\mathcal M_2)dD>0$

Chapter 7: Bayesian Linear Classification

21. Describe Laplace Approximation

The Laplace approximation provides a method for approximating complex probability distributions by using a Gaussian distribution centered at the mode of the target distribution. Specifically, for a distribution expressed as $p(z) = \frac{1}{Z}f(z)$, where $Z$ is the normalization constant, the method proceeds as follows:

First, we find the mode $z_0$ of the distribution by solving: $$ \nabla \log f(z_0) = 0 $$ This point $z_0$ corresponds to the maximum a posteriori (MAP) estimate in Bayesian contexts.

Next, we approximate $\log f(z)$ using a second-order Taylor expansion around $z_0$: $$ \log f(z) \approx \log f(z_0) - \frac{1}{2}(z - z_0)^\top A (z - z_0) $$ where $A = -\nabla^2 \log f(z_0)$ is the Hessian matrix of the negative log-probability evaluated at $z_0$.

This leads to a Gaussian approximation of the original distribution: $$ q(z) = \mathcal{N}(z \mid z_0, A^{-1}) $$ with the corresponding approximation of the normalization constant: $$ Z \approx \frac{(2\pi)^{k/2}}{|A|^{1/2}} f(z_0) $$

In the context of Bayesian model comparison, where we have a parametric model $\mathcal{M}(\theta)$ with prior $p(\theta)$ and want to compute the marginal likelihood $p(D) = \int p(D|\theta)p(\theta)d\theta$, the Laplace approximation can be applied by setting $f(\theta) = p(D|\theta)p(\theta)$. Using the MAP estimate $\theta_{\text{MAP}}$, the marginal likelihood is approximated as: $$ p(D) \approx \frac{(2\pi)^{M/2}}{|A|^{1/2}} p(D|\theta_{\text{MAP}})p(\theta_{\text{MAP}}) $$ where $A = -\nabla^2 \log[p(D|\theta_{\text{MAP}})p(\theta_{\text{MAP}})]$ is the Hessian of the negative log-posterior evaluated at the mode.

This approximation provides a computationally tractable way to estimate Bayesian model evidence, balancing model fit against model complexity through the determinant of the Hessian matrix.

22. Discuss Bayesian Information Criterion for Model Comparison

The Bayesian Information Criterion (BIC) is an asymptotic approximation to the log model evidence: $\log p(D|\mathcal{M}) ≈ \log p(D|\theta_{ML}) - (\mathcal{M}/2) \log N$. In practice, one computes $BIC \approx = -2 \log p(D|\theta_{ML}) + M \log N$ and prefers models with lower BIC values We have:

$BIC$ corresponds to marginal likelihood, the $\log p(D)$
$ L $ is the likelihood of the model, that in our framework is $p(D|)$.
$ M $ is the number of parameters.
$ N $ is the number of observations.

A lower BIC indicates a better model.

23. Introduce Bayesian Logistic Regression

Given a dataset $D$ of $(\bar x_i,\bar y_i){i=1}^N$, a set of basis functions $\phi(x)$, and the logit activation function s.t. $f(x,w)=\sigma(w^T\phi(x))$ to implement a Bayesian Logistic Regression we plac e a Gaussian prior over $w$, so $p(w)=\mathcal N(w|m_0,s_0)$ where the parameters are computed via marginal likelihood optimization.
The posterior, then, is $$ p(w|\bar x,\bar y)=\frac{p(\bar y|w,\bar x)p(w)}{p(\bar y|\bar x)}\propto p(\bar y|w,\bar x)p(w) $$ In the 2-class problem with a Bernoulli model of noise, we can write the likelihood as $p(\bar y|w,\bar x)=\prod{i=1}^Ns_i^{y_i}(1-s_i)^{1-y_i}$ where $s_i=f(x_i,w)$, so the log posterior is a quadratic function and we can apply Laplace Approximation in the following way:

find $w_{MaP}=\argmax_w \log p(w|\bar y)=\argmax_w \log p(w)+\log p(\bar y|w)$ by numerical optimization
Find the Hessian at $w_{MaP}$ and invert it, obtaining the precision matrix $S_N$ of the Gaussian.

In this way, the Laplace Approximation of the posterior $q(w)\approx p(w|\bar y)$ with $$ q(w)=\mathcal N(w|w_{MaP},S_N) $$ The predictive distribution is, then: $$ p(C_1|x^,\bar y,\bar x)=\int p(C_1|x^,w,\bar y,\bar x)p(w|\bar y,\bar x)dw\approx\int f(x^,w)q(w)dw $$ The predictive distribution integral $\int \sigma(w^T\phi(x^))q(w)dw$ is intractable. In practice, it is often approximated, for example, by using the probit function to approximate the logistic sigmoid.

Chapter 8: Sampling-based Inference

Important:
Why we do this? In the context of Bayesian Networks we are only able to perform exact inference if our network satisfies certain structural properties; in particular it has to be a tree or a poli-tree.

24. Rejection Sampling

Rejection sampling generates samples from a target distribution $ p(x) $ by sampling from a proposal distribution $ q(x) $, such that $\forall M>0: Mq(x)\ge p(x) \ \forall x$, and accepting/rejecting samples based on a criterion.
The steps are:

Sample $ x $ from $ q(x) $.
Generate $ u \sim \text{Uniform}(0, 1) $.
Accept $ x $ if $ u \leq \frac{p(x)}{Mq(x)} $, where $ M $ is a constant such that $ p(x) \leq Mq(x) $ for all $ x $, if reject , then repeat the algorithm until acceptance.

We can actually compute the expected number of samples from $g$ to accept a single sample on $p$, and this expectation is in fact $M$. Therefore, if $M$ is large, this method becomes inefficient. Rejection sampling is especially inefficient in high dimension (you need a large $M$ to "cover" $p$).
A major drawback is efficiency; the expected number of trials per accepted sample is $M$, which often becomes impractically large in high-dimensional spaces.

25. Importance Sampling

Importance sampling estimates properties of a target distribution $ p(x) $ using samples from a proposal distribution $ q(x) $. The key idea is to weight the samples by the ratio of the target and proposal distributions: $$ \mathbb{E}_p[f(x)] = \int f(x)\frac{1}{Z}\tilde p(x)dx=\frac{\int f(x)\tilde p(x)dx}{\int \tilde p(x)dx} $$ where $\tilde p(x)$ is the unnormalized distribution where the normalization is $Z=\int \tilde p(x)dx$.
Given a proposal distribution $q(x)$ we can turn this evaluation of $\mathbb{E}_p [f(x)]$ in: $$ \frac{\int f(x)\tilde p(x)dx}{\int \tilde p(x)dx}=\frac{\int f(x)\tilde p(x)\frac{q(x)}{q(x)}dx}{\int \tilde p(x)\frac{q(x)}{q(x)}dx}=\frac{\mathbb{E}q\left[f(x) \frac{p(x)}{q(x)}\right]}{\mathbb{E}q\left[ \frac{p(x)}{q(x)}\right]} $$ Then, we sample $x_1,\ldots,x_n$ from $q(x)$ and replace the expectation with its discrete definition: $$ \frac{\mathbb{E}q\left[f(x) \frac{p(x)}{q(x)}\right]}{\mathbb{E}q\left[ \frac{p(x)}{q(x)}\right]}=\frac{\frac{1}{N}\sum{i=1}^N f(x_i)w(x_i)}{\frac{1}{N}\sum{i=1}^N w(x_i)}=\frac{\sum{i=1}^N f(x_i)w(x_i)}{\sum{i=1}^N w(x_i)} $$ where we define the importance weights as $w(x_i)=\frac{\tilde p(x_i)}{q(x_i)}$

If $\frac{fp}{q}$ is approximately constant, then estimates can be very good. If weights vary a lot instead, we have a large variance and consequently a poor estimate. So, the quality depends from the choice of $q(x)$. This makes it highly useful for estimating the probability of rare events; one can choose a proposal distribution q(x) that oversamples the region of interest (making the event less rare) and then corrects for this bias using the importance weights

26. Introduce Markov Chain and the Detailed Balance Condition

A Markov chain is a stochastic process denotated as ${x_t}{t\ge 0}$ where the future state depends only on the current state, through a transitional kernel $p(x_n|x{n-1})$, and not on past states. Also known as "Markov Property", it's encoded as:
$$ p(x_n|x_{n-1},\ldots,x_0)=p(x_n|x_{n-1}) $$
Another required property is time homogeneity, where $p(x_n|x_{n-1})=p(x_1|x_o) \ \forall n$, which means that the probability to jump from a certain state to another stays the same for every time step.
The Detailed Balance condition ensures that a Markov Chain is reversible, it means that $\exist \pi(y)$ stationary distribution, defined as $\pi(y)=\int p(y|x)\pi(x)dx$, such that $ p(x|y) \pi(y) = p(y|x)\pi(x) $.
It is not true that the existence of a stationary distribution implies that our Markov Chain is reversible.

27. Introduce Markov Chain Monte Carlo and the Metropolis-Hastings Criterion

Markov Chain Monte Carlo (MCMC) methods helps us to sample from $p(x)=\frac{1}{Z}\hat p(x)$ by constructing an ergodic (the entire state space is reachable, no matter where we start exploring) Markov Chain fixing the proposal kernel $q(x|y)$.
The Metropolis-Hastings algorithm is a popular MCMC method. It generates a candidate state from a proposal distribution and accepts or rejects it based on the following criterion:
$$ \alpha(y|x) = \min\left( 1, \frac{\tilde p(y) q(x|y)}{\tilde p(x) q(y|x)}\right) $$ For symmetric transition kernels $q(x|y)=q(y|x)$ and this simplifies the Metropolis criterion, which can now be written as $$ \alpha(y|x) = \min\left( 1, \frac{\tilde p(y)}{\tilde p(x)}\right) $$ where $\frac{\tilde p(y)}{\tilde p(x)}=\frac{ p(y)}{ p(x)}$ and so the regions with higher probability $p(y)>p(x)$ are more likely to be visited.
The Metropolis-Hastings acceptance criterion satisfies the detailed balance condition.

28. Discuss Issues of Vanilla MCMC

Vanilla MCMC can have several issues:

High Mixing time: The chain may take a long time to explore the entire state space before it reach the stationary distribution.
Right proposal kernel choosing (it's difficult): the goal is to have a good balance between exploration and exploitation, which is especially challenging for multimodal distributions due to high dimensionality. Right choose of the proposal kernel: our aim is to have a good balance in between exploration and exploitation, specially in multimodal distribution because of the high dimensionality.
Convergence Diagnostics: It can be challenging to determine if the chain has converged to the stationary distribution.

29. Gibbs Sampling

Gibbs sampling is a special case of MCMC where each variable is sampled from the conditional distribution $p(x_i|x_{-i})$ on the current values of the other variables.
The algorithm involves iteratively sampling each variable from its conditional distribution:

Pick a variable $k\in{1,\dots,n}$
Set $x_j^{(t+1)}=x_j^{(t)}$ for $j\ne k$
Sample $x_k^{(t+1)}$ from $\tilde p (x_k|x^{(t)}_{-k})$
Repeat until convergence.

The transition kernel of Gibbs Sampling is exactly the same as Metropolis-Hastings algorithm.
Gibbs Sampling algorithm can be generalized, for example, by sampling blocks instead of a single variable, i.e. we can sample $x_j, \dots,x_k\subset x$.
There are some issue with this technique:

We may not satisfy ergodicity in Gibbs sampling,since it may not always be possible to find a path between two states which is only made of "single component" steps.
If variables are strongly correlated, then our mixing time can be very high.

30. Discuss Convergence Diagnostics and the $\hat R$ index

Convergence diagnostics assess whether an MCMC chain has converged to the stationary distribution. Given a function over a state of MCMC trajectory ${x_t}_{t\ge 0}$ as $\Psi:\mathcal X\rightarrow \mathbb R$ we want to take more than one chain and compare the distribution of the samples that we obtain and see if they look more or less the same.
Practically:

We sample $m/2\ge1$ trajectories from overdispersed initial points. We try to start from different states that are far away in our state space.
Sample for $4n$ steps.
we throw away the first half of every trajectory so that we have only $2n$ points left. This phase is known as the burn-in or warm-up phase. We do this because it takes time to reach the steady state.
Then we split in $2$ parts the remain trajectories, so that we are left with $m$ different trajectories each of length $n$.

Now, for each trajectory $x_{ij}$ found, where $i\in[1,n]$ and $j\in[1,m]$ we can define: $$\begin{cases} \bar \Psi_j=\frac{1}{n}\sum_{i=1}^n\Psi_{ij}\ \bar \Psi=\frac{1}{m}\sum_{j=1}^m\Psi_{j} \end{cases}$$

We are also interested in $\text{VAR}[\bar \Psi]$ but, since our samples are not independent, we don’t have that $\text{VAR}[\bar \Psi]=\frac{1}{n}\text{VAR}[\Psi]$, because two consecuitive points in the chian could be more correlated then two points distantly in time from one another.
We can define this quantities: $$ s_j^2=\frac{1}{n-1}\sum_{i=1}^n (\Psi_{ij}-\bar \Psi_j)^2\ W=\frac{1}{m}\sum_{j=1}^m s_j^2\ B=\frac{n}{m-1}\sum_{j=1}^m(\bar \Psi_j-\bar \Psi)^2 $$ where $W$ is called within variance while $B$ is the between variance and both are indicators of dispersion.
By definition $$ W\le \text{VAR}[\Psi] $$ But also it can be shown that $$ \text{VAR}[\Psi]\le\frac{n-1}{n}W+\frac{1}{n}B=\text{VAR}^+[\Psi] $$ So, we have an upper and lower bound, both converging to the true variance. Therefore we can compute: $$ \hat R=\sqrt{\frac{\text{VAR}^+[\Psi]}{W}}=\sqrt{\frac{n-1}{n}+\frac{1}{n}\frac{B}{W}} $$ which can be monitor while running the MCMC simulations. Notice that $\hat R>1$ and that goes to $1$ for $n\rightarrow\infty$ (realistically we can say that when $\hat R\sim1.1$ we have converged to our stationary distribution)

31. Introduce the Ideas of Effective Sample Size

The effective sample size (ESS) measures the quality of samples generated by an MCMC algorithm, accounting for autocorrelation for near points in the trajectory.
The ESS is given by: $ n_{eff} = \frac{nm}{1 + 2 \sum_{k=1}^\infty \rho_k}<nm, $ where $ N $ is the number of samples and $ \rho_k $ is the autocorrelation at lag $ k $, so ($p_k=Corr[\Psi(x_i),\Psi(x_{i+k})]$). In this scenario we can write $$ VAR[\bar \Psi]=\frac{VAR[\psi]}{n_{eff}} $$ We can compute the autocorrelation knowing that the identity $\mathbb E[(\Psi_i-\Psi_{i-k})^2]=2(1-\rho_k)VAR[\Psi]$ holds and that this expectation is named Variogram at lag $k$ and is estimated as: $$ V_k=\frac{1}{m(n-k)}\sum_{j=1}^m\sum_{i=k+1}^n (\Psi_{i,j}-\Psi_{i-k,j})^2 $$ So that $$ \rho_k=1-\frac{V_k}{2VAR[\Psi]} $$

32. Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is an MCMC method that uses Hamiltonian dynamics to propose new states. HMC improves the efficiency of the sampler by reducing random walk behavior. It introduces auxiliary momentum variables and simulates the Hamiltonian dynamics in order to move along the surface of the energy corresponding to our probability distribution. We always start from the same purpose: sampling $p(x)=\frac{1}{Z}\tilde p(x)$, but this time we express it using the Boltzmann trick: $$ p(x)=\frac{1}{Z}\tilde p(x)=\frac{1}{Z_x}e^{H_x(x)} $$ We introduce also momentum variables $y$ an we assign probability distribution of $p(y)=\frac{1}{Z_y}e^{H_y(y)}$ with the typical assumption to consider $p(y)$ a standard Gaussian, so $H_y(y)=\frac{1}{2}y^Ty$.
We will sample from joint distribution: $$ p(x,y)=\frac{1}{Z}e^{H(x,y)} $$ where $H(x,y)=H_x(x)+H_y(y)$ and $Z=Z_xZ_y$.
In Hamiltonian Monte Carlo, we sample by following trajectories defined by the Hamiltonian, ensuring energy remains constant. This means we move through the probability distribution space according to the equations of motion, preserving energy to efficiently explore high-probability regions. The algorithm is the following:

Pick $x_i$
We sampe $y\sim p(y)$, randomizing the momentum
We choose a random direction in time, sampling $t \sim \text{Uniform}(-1, 1)$ (implying reversibility of time and providing ergodicity, it's known that hamiltonian law are time-indipendent in this case)
We move according to Hamiltonian dynamics from $(x_i,y)$ to a candidate $(x',y')$ doing $L$ steps
We introduce a rejection criterion:
- we accept it if $H(x',y')>H(x,y)$
- otherwise, we accept it with probability $e^{H(x',y')-H(x,y)}$

In the meanwhile in each step from $(x,y)$ to $(x',y')$ we move following the Hamiltonian Dynamics, keeping $H(x',y')=H(x,y)$, because at each step we set $(x',y')=(x+\Delta x,y+\Delta y)$ and with the expansion we obtain: $$ H(x+\Delta x,y+\Delta y)\approx H(x,y)+\nabla_xH_x(x)^T\Delta x+\nabla_yH_y(y)^T\Delta y $$ Then, enforcing the condition $H(x',y')=H(x,y)$ we can derive the motion's equations: $$ \begin{cases} \Delta x=\varepsilon\nabla_y H_y(y)\ \Delta y=\varepsilon\nabla_x H_x(x) \end{cases} $$ Our acceptance criterion is a Metropolis acceptance criterion $$ \alpha=\min\left{ 1,\frac{p(x',y')}{p(x,y)} \right}=\min\left{ 1,e^{H(x',y')-H(x,y)} \right} $$

Chapter 9: Expectation Maximization

33. Introduce the Problem that can be Solved by EM

The Expectation-Maximization (EM) algorithm is designed to solve the problem of finding Maximum Likelihood Estimates (MLE) for the parameters $\theta$ of a probabilistic model in the presence of latent (unobserved) variables.

The problem is formally stated as follows:

We have observed data $x = (x_1, \ldots, x_N)$.
We assume a generative model $p(x, z | \theta)$ that also depends on latent variables $z = (z_1, \ldots, z_M)$.
Our goal is to find the parameters that maximize the marginal likelihood of the observed data: $$ \theta_{ML} = \underset{\theta}{\operatorname{argmax}} , p(x | \theta) = \underset{\theta}{\operatorname{argmax}} \sum_{z} p(x, z | \theta) $$

But this optimization problem is intractable, and that's because of two main reasons:

The Sum-over-Latent-States: The marginal likelihood $p(x | \theta)$ requires a summation (or integration in the continuous case) over all possible configurations of the latent variables $z$. If each $z_i$ can take $K$ values, the number of terms in the sum grows as $K^M$, which is computationally infeasible for anything but trivially small $M$ (the "curse of dimensionality").
The Log-Sum Problem: To perform optimization, we typically work with the log-likelihood. However, the structure of the problem becomes prohibitively complex: $$ \log p(x | \theta) = \log \left( \sum_{z} p(x, z | \theta) \right) $$ The logarithm is placed outside the sum. This log-sum structure makes it impossible to obtain a closed-form solution or to compute the gradient of the log-likelihood efficiently, as the sum inside the log couples the parameters $\theta$ for all possible values of $z$.

The problem is further compounded when we have multiple i.i.d. observations. The complete log-likelihood $p(x, z | \theta)$ factorizes nicely, but the marginal log-likelihood does not: $$ \log p(x | \theta) = \log \left( \sum_{z} \prod_{n=1}^N p(x_n, z_n | \theta) \right) \neq \sum_{n=1}^N \log p(x_n | \theta) $$

In summary, the problem solved by EM is maximum likelihood parameter estimation in models with latent variables, and its primary purpose is to circumvent the direct, intractable optimization of the marginal log-likelihood caused by the log-of-a-sum problem.

34. Introduce and Derive the Evidence Lower Bound

The Evidence Lower Bound (ELBO) is a concept used in variational inference, a method in Bayesian statistics used to approximate complex integrals. The ELBO is used to derive a lower bound on the log marginal likelihood, which is often intractable to compute directly.

Let's denote the observed data as $x$, the latent variables as $z$, and the parameters of the model as $\theta$. The goal of variational inference is to approximate the posterior distribution $p(z | x, \theta)$.
The log marginal likelihood can be written as: $$\log p(x | \theta) = \log \int p(x, z | \theta) dz$$ This is often difficult to compute directly. Instead, we introduce a variational distribution $q(z)$ that approximates the true posterior, and we try to minimize the divergence between $q(z)$ and $p(z | x, \theta)$. This divergence can be measured using the Kullback-Leibler (KL) divergence:

$$KL(q(z) || p(z | x, \theta)) = E_{q(z)}[\log q(z) - \log p(z | x, \theta)]$$

We can't compute this directly because it depends on the log marginal likelihood. However, we can rearrange terms to get an expression that we can compute, adding and subtracting the term $\log p(x|\theta)$ in the logarithm:

$$\log p(x | \theta) = KL(q(z) || p(z | x, \theta)) + E_{q(z)}[\log p(x, z | \theta) - \log q(z)]$$

The second term on the right-hand side is the ELBO. It's a lower bound on the log marginal likelihood because the KL divergence is always non-negative. The goal of variational inference is to choose $q(z)$ to maximize the ELBO, which also minimizes the KL divergence and makes $q(z)$ a better approximation to the true posterior.

35. Discuss the Expectation Maximization Algorithm (both E-step and M-step) (formalism different, to analyze better)

The Expectation-Maximization (EM) algorithm is a popular iterative method used in statistical computations to find $\theta_{ML}=\argmax_\theta \log p(\bar x|\theta)$ maximum likelihood estimates of parameters. We know that: $$ELBO(q,\theta)= E_{q(z)}[\log p(x, z | \theta) - \log q(z)]$$ where the first term is the energy term, while the second term is actually the entropy of the $q(z)$ distribution.
It alternates between the E-step, which computes the expected value of the log-likelihood given the current parameter estimates, and the M-step, which maximizes this expected log-likelihood to update the parameter estimates.

Expectation Step (E-step): Given the current parameters $\theta^{(t)}$, we first calculate the expected value of the log-likelihood function. This is done by taking the expectation over the conditional distribution of the latent variable $z$ given the observed data $\bar x$ under the current parameter estimates. Mathematically, this can be written as:

$$Q(\theta, \theta^{(t)}) = E_{z|\bar x,\theta^{(t)}}[\log p(z,\bar x|\theta)]$$
Maximization Step (M-step): We find the parameters that maximize the expected log-likelihood found on the E-step:

$$\theta^{(t+1)} = argmax_{\theta} Q(\theta, \theta^{(t)})$$

The algorithm alternates between these two steps until the parameters converge, i.e., the estimates of the parameters do not change significantly between successive iterations.

The EM algorithm guarantees that the likelihood will not decrease with each iteration. However, it does not guarantee to find the global maximum likelihood, and the result can be sensitive to the choice of initial parameters.

36. Discuss convergence of the EM algorithm

The algorithm works by repeating the E-step and M-step until convergence (i.e. whenever $||\theta^{(t+1)}- \theta^{(t)}|| < \varepsilon$).
After setting the Kullback-Leibler divergence to zero, we close the gap vertically and we make the log-likelihood equal to the lower bound. In the M-step, since we are maximizing the lower bound, we are pushing it up, but then also the log-likelihood will be pushed up, possibly of a larger quantity than the lower bound, creating a new gap.
After both the E and M steps are completed, the log likelihood is always increasing, until it gets really close to a local maximum and eventually reaches it. So we can consider as the termination condition of the EM algorithm the following: $ELBO(q^{(t+1)},\theta^{(t+1)})-ELBO(q^{(t)},\theta^{(t)})<\delta$.
In general the EM algorithm converges to a local optimum of $\log p(\bar x|\theta)$.

37. Discuss EM algorithm for Gaussian Mixtures

The EM algorithm is the standard method for learning the parameters of a Gaussian Mixture Model (GMM). A GMM assumes data is generated from a mixture of k Gaussian components. The model has parameters $\theta = (\pi, \mu, \Sigma)$:

$\pi$: A vector of mixing coefficients (priors for each component), where $\pi_j>= 0$ and $\Sigma pi_j = 1$
$\mu_j$: The mean of the j-th Gaussian component.
$\Sigma:$ The covariance matrix of the j-th Gaussian component.

The latent variable z is a one-hot encoded vector indicating which component generated a data point x.

EM for GMM:

Initialization: Initialize the parameters $\theta^{0} = (\pi, \mu, \Sigma)$ (e.g., using K-means).
E-step (Expectation): Compute the "responsibility" $\gamma(z_{nj})$ of component $j$ for generating data point $x_n$. This is the posterior probability $p(z_j=1 | x_n)$. $$ \gamma(z_{nj}) = \frac{\pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}{\sum_{i=1}^K \pi_i \mathcal{N}(x_n | \mu_i, \Sigma_i)} $$
M-step (Maximization): Update the parameters using the current responsibilities. These updates are weighted by $\gamma(z_{nj})$.
- Update mixing coefficients: $$ \pi_j^{\text{new}} = \frac{N_j}{N}, \quad \text{where } N_j = \sum_{n=1}^N \gamma(z_{nj}) $$
- Update means: $$ \mu_j^{\text{new}} = \frac{1}{N_j} \sum_{n=1}^N \gamma(z_{nj}) x_n $$
- Update covariances: $$ \Sigma_j^{\text{new}} = \frac{1}{N_j} \sum_{n=1}^N \gamma(z_{nj}) (x_n - \mu_j^{\text{new}})(x_n - \mu_j^{\text{new}})^T $$
Repeat: Iterate steps 2 (E-step) and 3 (M-step) until the log-likelihood converges.

Chapter 10: Variational Inference (from here I gave up and all those answers are made with CHAT-GPT 4o the night before the exam)

38. Introduce the Problem Formulation for Variational Inference

Variational Inference (VI) aims to approximate complex posterior distributions $ p(z|x) $ with a simpler, parameterized distribution $ q(z|\lambda) $ by minimizing the Kullback-Leibler (KL) divergence between $ q(z|\lambda) $ and $ p(z|x) $:

$\begin{equation} \text{KL}(q(z|\lambda) | p(z|x)) = \mathbb{E}_{q(z|\lambda)} \left[ \log q(z|\lambda) - \log p(z|x) \right] \end{equation}$

This can be reformulated as maximizing the Evidence Lower Bound (ELBO):

$ \mathcal{L}(\lambda) = \mathbb{E}_{q(z|\lambda)} \left[ \log p(x,z) - \log q(z|\lambda) \right] $

39. Introduce Mean Field Variational Inference

Mean Field Variational Inference (MFVI) is a type of VI where the variational distribution $ q(z|\lambda) $ is assumed to factorize across the latent variables:

$ q(z|\lambda) = \prod_{i=1}^{n} q_i(z_i|\lambda_i) $

This factorization simplifies the optimization problem and makes the computations more tractable.

40. Example of Mean Field Variational Inference on Gaussian Distribution

For a joint Gaussian distribution $ p(z) = \mathcal{N}(z|\mu, \Lambda^{-1}) $, with $ \mu = [\mu_1, \mu_2]^T $ and precision matrix $ \Lambda $, the MFVI approximates the joint distribution as a product of univariate Gaussians:

$ q(z) = q(z_1) q(z_2) $

The update equations for the variational parameters are derived from the ELBO optimization:

$ q_1(z_1) = \mathcal{N}(z_1|\nu_1, \Lambda_{11}^{-1}) $ $ \nu_1 = \mu_1 - \Lambda_{11}^{-1} \Lambda_{12} (\mathbb{E}_{q_2}[z_2] - \mu_2) $

Similarly,

$ q_2(z_2) = \mathcal{N}(z_2|\nu_2, \Lambda_{22}^{-1}) $ $ \nu_2 = \mu_2 - \Lambda_{22}^{-1} \Lambda_{21} (\mathbb{E}_{q_1}[z_1] - \mu_1) $

41. Variational Inference with Direct and Inverse KL

The behavior of the variational approximation depends on which form of the Kullback-Leibler (KL) divergence is minimized:

Direct KL (Minimizing $\text{KL}(q|p)$): This is the standard approach in Variational Inference. It results in a mode-seeking behavior (also called zero-forcing). The variational distribution $q$ will concentrate on the main modes of the true posterior $p$ and will be forced to be zero where $p$ is zero. This can cause $q$ to underestimate the spread of $p$ and potentially miss smaller, secondary modes.
Inverse KL (Minimizing $\text{KL}(p|q)$): This approach results in a mode-covering behavior (also called zero-avoiding). The variational distribution $q$ will be encouraged to spread out to cover all regions where $p$ has mass, ensuring that $q$ is never zero if $p$ could be non-zero. However, this objective is typically intractable for VI as it requires computing expectations with respect to the true posterior $p(z|x)$.

42. Variational Linear Regression

In Variational Linear Regression, the goal is to approximate the posterior distribution of weights and noise precision in a Bayesian linear regression model. Given the model:

$ y = Xw + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I) $ $ p(w|\alpha) = \mathcal{N}(w|0, \alpha^{-1} I) $ $ p(\alpha) = \text{Gamma}(\alpha|a_0, b_0) $

The variational distribution is factorized as:

$ q(w, \alpha) = q(w) q(\alpha) $

The updates for $ q(w) $ and $ q(\alpha) $ involve the expectations of the sufficient statistics, iteratively optimized until convergence.

43. Introduce Black Box Variational Inference

Black Box Variational Inference (BBVI) is a flexible VI approach that uses stochastic gradient descent to optimize the ELBO. It allows for the use of any model and any variational distribution without the need for model-specific derivations. BBVI leverages Monte Carlo estimates of gradients, making it applicable to a wide range of probabilistic models.

44. Discuss How to Compute the Gradient of the ELBO for a Non-reparameterizable Variational Distribution

For non-reparameterizable variational distributions, the gradient of the ELBO can be computed using the score function estimator (also known as the REINFORCE estimator). This approach estimates the gradient as follows:

$ \nabla_\lambda \mathcal{L}(\lambda) = \mathbb{E}{q(z|\lambda)} \left[ (\nabla\lambda \log q(z|\lambda)) (\log p(x,z) - \log q(z|\lambda)) \right] $

While this estimator has high variance, techniques such as Rao-Blackwellization and control variates can reduce it.

45. Discuss Rao-Blackwellization for Variance Reduction

In the context of Black Box Variational Inference, Rao-Blackwellization is a variance reduction technique that leverages the structure of the model to compute parts of an expectation analytically, even when the entire expectation is intractable.

The key insight is to identify a subset of variables—specifically, those in the Markov blanket of a target variable $z_i$ and integrate over them explicitly, rather than sampling. This conditions the estimator on more information, reducing its variance.

The standard, high-variance gradient estimator is: $\begin{equation} \nabla_{\lambda_i} \mathcal{L}(\lambda) \approx \frac{1}{S} \sum_{s=1}^{S} \nabla_{\lambda_i} \log q(z_i^{(s)}|\lambda_i) \left[ \log p(x, z^{(s)}) - \log q(z^{(s)}|\lambda) \right] \end{equation}$

The Rao-Blackwellized version restricts the inner term to only depend on the Markov blanket $z_{(i)} of z_i$: $\begin{equation} \hat{\nabla}{\lambda_i}[\mathcal{L}] = \mathbb{E}{q_{(i)}} \left[ \nabla_{\lambda_i} \log q(z_i|\lambda_i) \left( \log p_i(x, z_{(i)}) - \log q(z_i|\lambda_i) \right) \right] \end{equation}$ where:

$q_{(i)}(z|\lambda)$ is the marginal distribution over the Markov blanket of $z_i$,
$p_i(x, z_{(i)})$ is the part of the joint distribution $p(x, z)$ that depends on $z_{(i)}$.

By integrating over the Markov blanket variables (either analytically or with lower-variance sampling), this estimator focuses on the relevant local information for $z_i$, significantly reducing the variance compared to the naive estimator.

46. Discuss Control Variates for Variance Reduction

Control variates are another technique for reducing variance in Monte Carlo estimates. The idea is to use an auxiliary variable with known expected value to construct an estimator with lower variance. For a function $ g(z) $ with known expectation $ \mathbb{E}[g(z)] = 0 $, the control variate estimator is:

$ \hat{f}_{CV} = \hat{f} - c(\hat{g} - \mathbb{E}[g(z)]) $

where $ c $ is a coefficient that minimizes the variance of the adjusted estimator.

47. Discuss Bayesian Neural Networks

Bayesian Neural Networks (BNNs) extend neural networks by placing a prior distribution over the network's weights. This allows for the incorporation of uncertainty in predictions and model parameters. Inference in BNNs is typically performed using VI, where the posterior distribution of the weights is approximated by a simpler variational distribution. This approach allows for robust predictions and quantification of uncertainty.

Chapter 11: Generative Modelling

48. Introduce the Generative Modelling Problem

Generative modelling involves learning the joint distribution $ p(x, z) $ of observed data $ x $ and latent variables $ z $ to generate new data samples. The goal is to model the underlying data distribution and generate realistic samples. Common generative models include Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and normalizing flows.

49. Introduce Autoencoding Variational Bayes

Autoencoding Variational Bayes (AEVB) is a framework for training Variational Autoencoders (VAEs). A VAE consists of an encoder that maps input data to a latent space and a decoder that reconstructs the input from the latent space. The ELBO for AEVB is:

$ \mathcal{L}(\phi, \theta) = \mathbb{E}{q\phi(z|x)} \left[ \log p_\theta(x|z) \right] - \text{KL}(q_\phi(z|x) | p(z)) $

where $ \phi $ and $ \theta $ are the parameters of the encoder and decoder, respectively.

50. Introduce Denoising Diffusion Models

Denoising Diffusion Models are a class of generative models where data is corrupted with noise through a forward diffusion process and a neural network is trained to reverse this process. The forward process adds noise step by step:

$ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) $

The reverse process is learned to generate data by denoising from the prior distribution to the data distribution. The neural network learns to approximate the reverse conditional probabilities, allowing for efficient sampling and generation of new data.

dododevs/gist:e2aeb9972fbb8dca37397841c3b1d348