I think I have a cleaner way to explain Diagnosis Tip #2 in my post on diagnosing problems with your machine learning algorithm. Many machine learning solutions boil down to defining a model that is specified by some parameter $latex \theta$ and then creating an optimization function or error function $latex J( \theta )$ that you use to find the best possible $latex \theta_{OPT}$. Diagnosis Tip #2 is about figuring out when there is a problem with your formulation of $latex J( \theta)$. It helps you answer the question, “Is there a problem with my optimization function formulation? Or is my formulation correct, but rather, there is a problem in my implementation (e.g., non-convergence, software bug, etc.)?”

As an example consider the problem of fitting a linear model to a regression training set. $latex \theta$ is the set of weights that define your linear model, and $latex J(\theta)$ is an error minimization function which is a sum of squared errors. Say you run your model training and your test results are awful, what now? Is there a problem with your error function? If so you need to change your error function, perhaps by adding a regularization parameter or using a LASSO. But what if the sum of squared errors is correct? Then maybe you have a bug in your software keeping it from converging to a solution. How do figure out what’s wrong? Here’s a way that will help get to the bottom of things.

Come up with an independent and alternate method of generating a solution, call it $latex \theta_{ALT}$. And create $latex \theta_{ALT}$ such that it beats the test error of $latex \theta_{OPT}$, the solution of your optimization problem. The best way of doing this is to extrapolate $latex \theta_{ALT}$ from a human crafted solution (or perhaps from a heavier duty algorithm).

Now plug in $latex \theta_{ALT}$ into $latex J(\theta)$. What should we expect? Because we have “cheated” and “hand crafted” $latex \theta_{ALT}$ such that it has good performance we should find that $latex J( \theta_{ALT} ) > J ( \theta_{OPT} )$ (Assume that we’re maximizing $latex J(\theta)$). If this is indeed the case then our formulation of $latex J(\theta)$ is correct and we have problems in the implementation of our optimization algorithm.

On the other hand if we find that $latex J( \theta_{ALT} ) < J ( \theta_{OPT} )$,then $theta J(\theta)$ is incorrect, the inequality has to go the other way around. You need to go back and reformulate the function.