# Evidence approximation in linear regression: A method that produces “automatically regularized” solutions.

## Summary

In this post, I look at a Bayesian treatment of the linear regression problem. Making use of basis functions allows you to model non-linear patterns in data, however taking this route usually requires that you regularize your solution. To find the best regularization parameter often requires cross validation, but by looking at a framework known as Evidence Approximation, this allows us to reformulate regression optimization into a form where we can analytically find the most likely value for the regularization parameter. What results is a lightweight algorithm that requires no parameter-tuning.

(Note: This post is based on my collection of reading notes from Section 3.5 of Chris Bishop’s book “Pattern Recognition and Machine Learning”. I highly recommend this text, and encourage you to look at this source for derivations and deeper insights.)

## Some theory…

$y(\mathbf{x}, \mathbf{w}) = \mathbf{w}^T \boldsymbol\phi(\mathbf{x})$

where $\boldsymbol\phi(\mathbf{x})$ is a vector of basis function values. Using linear basis functions is powerful, as it lets you model non-linear relationships in your data. Unfortunately, selecting the best basis functions is an art and often times you need to take a “kitchen sink” approach and throw in as many basis functions as possible.

As soon as you do this though you run into overfitting problems as well as numerical problems as your design matrix can become ill-conditioned. The solution to this, of course, is to add a regularization penalty on $\mathbf{w}$, and the common practice is to use cross validation to find the right value of $\alpha$.

It turns out that by taking a Bayesian look at the linear regression problem, we can derive a solution for the value of the regularization parameter that does not require cross validation, rather it’s easily estimated from your training data. Let’s dig into this:

We start with a generative model that measures the likelihood of your observations $\{\mathbf{t},\mathbf{x}\}$,

$p(\mathbf{t}|\mathbf{w},\alpha,\beta)=p(\mathbf{t}|\boldsymbol\Phi^T\mathbf{w},\beta) p(\mathbf{w}|\mathbf{0},\alpha\mathbf{I})$

Where the probability density functions on the right are both Gaussian, $\boldsymbol\Phi$ is the design matrix of basis functions evaluated at your input points $\mathbf{x}$, and $\beta$ and $\alpha$ are precision parameters for their respective distributions. Note that the last distribution on the r.h.s. is a prior on the distribution of weights $\mathbf{w}$ with zero mean and precision $\alpha \mathbf{I}$.

Our goal is to find the most likely value for $\alpha$ and $\beta$ given the training data, this requires that we integrate out $\mathbf{w}$. Hence, we have,

$p(\mathbf{t}|\alpha,\beta ) =\int p(\mathbf{t} | \boldsymbol\Phi^T\mathbf{w}, \beta) p(\mathbf{w}|\alpha \mathbf{I})\delta\mathbf{w}$. (1)

As we’ll see shortly, this solution leads to an optimally regularized linear regression solution.

We can re-write Equation 1 as,

$(\frac{\beta}{2\pi})^{N/2} (\frac{\alpha}{2\pi})^{M/2}\int-E(\mathbf{w})\delta\mathbf{w}$

where $E(\mathbf{w}) = \frac{\beta}{2}||\mathbf{t} - \boldsymbol\Phi^T\mathbf{w}||^2 + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w}$, N is the number of training points and M is the number of basis functions.

Here is where the connection to regularized linear regression comes out. Notice that $E( \mathbf{w} )$ is the same the regularized least squares problem where the regularization parameter $\lambda = \alpha / \beta$. Also note that optimization of the log-likelihood of (1) w.r.t $\mathbf{w}$ reduces to the optimization of $E( \mathbf{w} )$.

Our goal at this point is to find the most likely $\alpha$ and $\beta$ within $p(\mathbf{t}|\alpha,\beta )$. This gives us an implicit value for the regularization parameter, at which point we are free to estimate the most likely value for $\mathbf{w}$ using the least squares solution. There are a few methods for doing this, for example, EM or trying to find a direct solution. In this post we look at the direct solution. You can follow the derivation in Section 3.5 of Bishop’s text. The solutions for $\alpha$ and $\beta$ are,

$\alpha=\frac{\gamma}{\mathbf{m}^T\mathbf{m}}$,

$\beta=\frac{1}{N-\gamma}||\mathbf{t}-\boldsymbol\Phi \mathbf{m}||^2$

where $\gamma=\sum{\frac{\lambda_i}{\alpha+\lambda_i}}$ where the  $\lambda_i$‘s are the eigenvalues of the matrix $\beta\boldsymbol\Phi^T\boldsymbol\Phi$, $\mathbf{m}=\beta \mathbf{A}^{-1} \boldsymbol\Phi^T\mathbf{t}$, and $\mathbf{A}=\alpha\mathbf{I}+\beta \boldsymbol\Phi^T\boldsymbol\Phi$.

These solutions are implicit, but we can estimate $\alpha$ and $\beta$ by plugging in initial estimates and iterating over the recurrence relations until convergence.

## Testing it out…

I wrote a Matlab script implementing the solution (You can download the code below). Again, the nice thing about this solution is that all you need to provide is the training data and your design matrix and an optimally “regularized” solution just pops out.

I tested this method out on the classic “Series G”  data set of Box and Jenkins  which represents airline passenger totals from 1949 to 1960. The dataset is displayed as the solid blue curve in the image above. Noticing the obviously periodic trends I created a design matrix that contained sinusoids with various periods and offsets. Specifically I added sinusoids $\phi_i(x) = sin(ax + b)$ for$a$ between .25 and 10 in steps of .25 and $b$ between 0 and $\pi$ in steps of $\pi/4$. Also, noticing the general uptrend in the data I added a single linear function to the design matrix.

The results are satisfying. The method is able to find the correct weights for $\alpha$  and $\beta$, such that it picks up the periodic uptrend, all without needing to resort to cross validation to tune parameters.