Averages and statistics

A friend who is in the market for a home recently shared how averages and statistics have little predictive value. While they might give you a sense of the likely range of outcomes, your unique preferences may land you well above or below these numbers. You are, after all, a data point.

I’ve observed that the same is true for the other challenging activity that many of us go through – job searching. When we’re out there looking for the next gig, it is tempting to get caught in the “average and statistics” zone by asking “do I stand a chance based on people like me who’ve been through the process before?” 

While it is important to know these numbers and use them to not spend all our time on unlikely events, the flip side holds as well. We are just one data point. If we’re focused and thoughtful about how we approach our search, averages and statistics can matter lesser than we think.

Understand them, then learn to ignore them.

90% of all analysis is prep time – MBA Learnings

We spent a lot of time in our statistics class on regressions. A regression is simply the measure of the relation between the mean value of the output or ‘y’ variable and the value of input variables. It is the ability to predict a value for ‘y’ given the values of certain ‘x’.

For any regression, we go through the following steps –

1. Think of a model of y. Let’s say we are attempting to predict the health of a person. We need to first think of all the factors that result in good health.

So, f(good health) = f(good diet) + f(exercise) + f(sleep) + f(collection of other minor variables or noise)

It is important to get the important factors in this model so as to guard against omitted variable bias or OVB.

2. Sort out causality. We then think about whether the relationship is cause-and-effect. This deserves a blog post in itself as this “identification” error is rampant across many studies. Here, we could argue that, largely, good health is caused by behavior. That said, we are probably missing genetic predisposition to good health. That is an example of a variable that should go into the equation because it is a sure shot source of OVB

3. Clean your data. Now it is time to examine the data we have. Look at trends and outliers.

4. Match the data to the theory to see if you have the right variables. This is when we check if we have data for diet, exercise, sleep and genetics as we’re trying to predict the average health of a patient. We also need to check if we have a good proxy of average health – e.g. number of doctor visits per year.

5. Check for “action” in the key variables. Next, we need to make sure that variables like diet, exercise, etc., have enough variation. Without variation, it is impossible to link movement in the variables with movement in the output.

6. Briefly examine a simple model. Run a first regression with a simple model and check for relationships and significance (or level of confidence in our relationships)

7. Settle on an empirical method. Are we assuming a linear relationship? Is there any chance an improvement in health or exercise or sleep could result in a percentage change in health?

8. Check for multicollinearity and heteroskedasticity. These are big words. Multicollinearity is simply checking for whether x variables in the model are correlated. For example, are good diet and exercise correlated? If everyone who has a good diet also exercises a certain amount, perhaps we could just include one of the two. The key here is to understand multicollinearity doesn’t change the outcome. It just messes with the standard errors of prediction. So, definitely important to check but no need to panic.

Heteroskedasticity, on the other hand, is a check for whether we have the right empirical model. I won’t go into depth here but this is a quick check to see if there is a pattern in the residuals or unexplained factors in the regression.

9. Check for robustness. The robustness check is an interesting check – it involves removing variables or groups of variables to see if they make a difference to the output. For example, one check would be to remove sleep from the regression and see if it makes a difference. If it doesn’t, we can safely take it out. The idea here is to use as few variables as necessary to check for whether there is a relationship between the variables and the output.

10. Run the final regression and interpret your results.

(Note: We have intentionally not discussed the R-squared of the regression. The R-squared isn’t a good predictor of regression accuracy especially in models where future trends aren’t likely to mirror historical trends. Low R-squared models in very hard-to-predict situations can still tell us a lot.)

Sorry for the long drawn statistics lesson. Now for what I find most interesting – step 10 or running that final regression is a 10 second task followed by a few minutes interpreting the result. So, 90% or more of the time we’ve spent on the analysis has been preparing the data for regression.

I find that analogous to all of life’s great processes – 90% of most great processes is indeed prep time.