Spurious Correlations – What they are and Why they Matter

In an earlier post I mentioned that one of the big benefits of geospatial technology is its ability to show connections between complex and often disparate data sets. As you work with Big Data you tend to see the value of these multi-layered and often multi-dimensional perspectives of a trend or event. While that can lead to incredible results, it can also lead to spurious correlations of data.

First, let me state that I am not a Data Scientist or Statistician, and there are definitely people far more expert on this topic than myself.  But, if you are like the majority of companies out there experimenting with geospatial and big data it is likely that your company doesn’t have these experts on-staff. So, a little awareness, understanding, and caution can go a long way in this type of scenario.

Before we dig into that more, let’s think about what your goal is:

  • Do you want to be able to identify and understand a particular trend – reinforcing actions and/or behavior? –OR–
  • Do you want to understand what triggers a specific event – initiating a specific behavior?

Both are important, but they are both different. My personal focus has been on identification of trends so that you can leverage or exploit them for commercial gain. While that may sound a big ominous, it is really what business is all about.

There is a popular saying that goes, “Correlation does not imply causation.”  A common example is that for a large fire you may see a large number of fire trucks.  There is a correlation, but it does not imply that fire trucks cause fires. Now, extending this analogy, let’s assume that in a major city the probability of multi-tenant buildings starting on fire is relatively high. Since they are a big city, it is likely that most of those apartments or condos have WiFi hotspots. A spurious correlation would be to imply that WiFi hotspots cause fires.

As you can see, there is definitely potential to misunderstand the results of correlated data. More logical analysis would lead you to see the relationships between the type of building (multi-tenant residential housing) and technology (WiFi) or income (middle-class or higher). Taking the next step to understand the findings, rather than accepting them at face value, is very important.

Once you have what looks to be an interesting correlation there are many fun and interesting things you can do to validate, refine, or refute your hypothesis. It is likely that even without high-caliber data experts and specialists you will be able to identify correlations and trends that can provide you and your company with a competitive advantage.  Don’t let the potential complexity become an excuse for not getting started, because as you can see above it is possible to gain insight and create value with a little effort and simple analysis.