Skip to the content.

< Go back

Variational approaches in machine learning

Variational approaches in machine learning (ML) shift the perspective away from explicitly dealing with data samples to dealing with probability distributions that describe those data samples. This can provide a powerful data modeling approach, recently standing behind the many successes of generative models.

Since the core concepts of variational approaches are rooted in probability distributions, we need to first talk about one of the simplest ones…

The Gaussian normal distribution

For a random variable \(z\)—for example, star dates from Star Trek Original Series—the Gaussian probability density function (PDF), also known as the Gaussian normal distribution, is denoted \(\mathcal{N}(z \mid \mu, \sigma)\), where \(\mu\) is the mean and \(\sigma\) is the standard deviation of the distribution (note that \(\sigma^2\) is a quantity known as the variance of the distribution). The symbol “\(\mid\)” denotes a conditional probability. The Gaussian PDF can be conveniently expressed in analytic form as:

\(\mathcal{N}(z \mid \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left( - \frac{1}{2} \frac{(z - \mu)^2}{\sigma^2} \right)\)

since such a function always has the total area under the curve equal to unity.

Python implementation of the above equation is quite straightforward:

import numpy as np

def gaussian_PDF(x, 
                 μ=0.0, 
                 σ=1.0):

    return 1.0 / (σ * np.sqrt(2.0 * np.pi)) * np.exp(- 1.0 / 2.0 * (x - μ)**2 / (σ**2))

The figure below visualizes a few Gaussian PDFs for a couple of choices for \(\mu\) and \(\sigma\).

The cool part about this distribution is that knowing the values of \(\mu\) and \(\sigma\) you know everything about it. What’s even cooler, you can now sample a value of the random variable \(z\) from a distribution given a specific value for \(\mu\) and \(\sigma\)! The way to do that is to first draw a sample, \(\varepsilon\), from \(\mathcal{N}(z \mid \mu = 0.0, \sigma = 1.0)\) and re-scale that sample to the desired \(\mu\) and \(\sigma\) values, like so:

\(z = \varepsilon \cdot \sigma + \mu\)

Take a look at how I have now drawn 1000 samples from each distribution and plotted histograms for those samples. They indeed fit the given PDF!

Working with log-distributions

Working with probabilities or PDFs can be tricky to ML algorithms, because those can get arbitrarily close to zero. Take a look at the Gaussian PDFs from the figure before, for the most part (stretching to plus/minus infinity) the Gaussian PDF is very near zero. How near? Well, the endpoints of the red curve correspond to the density being \(\mathcal{O}(10^{-23})\). We may easily break numerical stability of a computational algorithm if we have to run operations (such as addition or subtraction) on such small numbers. In addition, predictive ML approaches may not work that well when they have to discern between \(10^{-6}\) and \(10^{-7}\) and, at the same time, also work well in the regime between \(10^{0}\) and \(10^{-1}\). This is why you will often encounter logarithmic transformations of probabilities, or of PDFs, in ML. In the figure below, I’ve transformed the earlier Gaussian PDFs with a natural logarithm. There is now much more variability in values that we would have previously considered as essentially zero.

Let’s compute the natural logarithm of the Gaussian normal distribution—this will come in handy later:

\(\ln \left( \mathcal{N}(z \mid \mu, \sigma) \right) = \ln \left( \frac{1}{\sigma} \right) + \ln \left( \frac{1}{\sqrt{2\pi}} \right) + -\frac{1}{2} \frac{(z - \mu)^2}{\sigma^2} = -\frac{1}{2} \left( \ln \sigma^2 + \ln(2\pi) + \frac{(z - \mu)^2}{\sigma^2} \right)\)

Here’s the Python implementation of the above equation:

def ln_gaussian_PDF(x, 
                    μ=0.0, 
                    σ=1.0):

    return -0.5 * (np.log(σ**2) + np.log(2 * np.pi) + (x - μ)**2 / (σ**2))

The concept of maximizing log-likelihoods