We spent a lot of time in our statistics class on regressions. A regression is simply the measure of the relation between the mean value of the output or ‘y’ variable and the value of input variables. It is the ability to predict a value for ‘y’ given the values of certain ‘x’.
For any regression, we go through the following steps –
1. Think of a model of y. Let’s say we are attempting to predict the health of a person. We need to first think of all the factors that result in good health.
So, f(good health) = f(good diet) + f(exercise) + f(sleep) + f(collection of other minor variables or noise)
It is important to get the important factors in this model so as to guard against omitted variable bias or OVB.
2. Sort out causality. We then think about whether the relationship is cause-and-effect. This deserves a blog post in itself as this “identification” error is rampant across many studies. Here, we could argue that, largely, good health is caused by behavior. That said, we are probably missing genetic predisposition to good health. That is an example of a variable that should go into the equation because it is a sure shot source of OVB
3. Clean your data. Now it is time to examine the data we have. Look at trends and outliers.
4. Match the data to the theory to see if you have the right variables. This is when we check if we have data for diet, exercise, sleep and genetics as we’re trying to predict the average health of a patient. We also need to check if we have a good proxy of average health – e.g. number of doctor visits per year.
5. Check for “action” in the key variables. Next, we need to make sure that variables like diet, exercise, etc., have enough variation. Without variation, it is impossible to link movement in the variables with movement in the output.
6. Briefly examine a simple model. Run a first regression with a simple model and check for relationships and significance (or level of confidence in our relationships)
7. Settle on an empirical method. Are we assuming a linear relationship? Is there any chance an improvement in health or exercise or sleep could result in a percentage change in health?
8. Check for multicollinearity and heteroskedasticity. These are big words. Multicollinearity is simply checking for whether x variables in the model are correlated. For example, are good diet and exercise correlated? If everyone who has a good diet also exercises a certain amount, perhaps we could just include one of the two. The key here is to understand multicollinearity doesn’t change the outcome. It just messes with the standard errors of prediction. So, definitely important to check but no need to panic.
Heteroskedasticity, on the other hand, is a check for whether we have the right empirical model. I won’t go into depth here but this is a quick check to see if there is a pattern in the residuals or unexplained factors in the regression.
9. Check for robustness. The robustness check is an interesting check – it involves removing variables or groups of variables to see if they make a difference to the output. For example, one check would be to remove sleep from the regression and see if it makes a difference. If it doesn’t, we can safely take it out. The idea here is to use as few variables as necessary to check for whether there is a relationship between the variables and the output.
10. Run the final regression and interpret your results.
(Note: We have intentionally not discussed the R-squared of the regression. The R-squared isn’t a good predictor of regression accuracy especially in models where future trends aren’t likely to mirror historical trends. Low R-squared models in very hard-to-predict situations can still tell us a lot.)
Sorry for the long drawn statistics lesson. Now for what I find most interesting – step 10 or running that final regression is a 10 second task followed by a few minutes interpreting the result. So, 90% or more of the time we’ve spent on the analysis has been preparing the data for regression.
I find that analogous to all of life’s great processes – 90% of most great processes is indeed prep time.