“Is your optimization function correct?” Revisited

I think I have a cleaner way to explain Diagnosis Tip #2 in my post on diagnosing problems with your machine learning algorithm. Many machine learning solutions boil down to defining a model that is specified by some parameter \theta and then creating an optimization function or error function J( \theta ) that you use to find the best possible \theta_{OPT}. Diagnosis Tip #2 is about figuring out when there is a problem with your formulation of J( \theta). It helps you answer the question, “Is there a problem with my optimization function formulation? Or is my formulation correct, but rather, there is a problem in my implementation (e.g., non-convergence, software bug, etc.)?”

As an example consider the problem of fitting a linear model to a regression training set. \theta is the set of weights that define your linear model, and J(\theta) is an error minimization function which is a sum of squared errors. Say you run your model training and your test results are awful, what now? Is there a problem with your error function? If so you need to change your error function, perhaps by adding a regularization parameter or using a LASSO. But what if the sum of squared errors is correct? Then maybe you have a bug in your software keeping it from converging to a solution. How do figure out what’s wrong? Here’s a way that will help get to the bottom of things.

Come up with an independent and alternate method of generating a solution, call it \theta_{ALT}. And create \theta_{ALT} such that it beats the test error of \theta_{OPT}, the solution of your optimization problem. The best way of doing this is to extrapolate \theta_{ALT} from a human crafted solution (or perhaps from a heavier duty algorithm).

Now plug in \theta_{ALT} into J(\theta).  What should we expect? Because  we have “cheated” and  “hand crafted” \theta_{ALT} such that it has good performance we should find that J( \theta_{ALT} ) > J ( \theta_{OPT} ) (Assume that we’re maximizing J(\theta)). If this is indeed the case then our formulation of J(\theta) is correct and we have problems in the implementation of our optimization algorithm.

On the other hand if we find that J( \theta_{ALT} ) < J ( \theta_{OPT} ),then $theta J(\theta)$ is incorrect, the inequality has to go the other way around. You need to go back and reformulate the function.

 

Leave a Reply

Your email address will not be published. Required fields are marked *