In an earlier post I mentioned that one of the big benefits of geospatial technology is its ability to show connections between complex and often disparate data sets. As you work with Big Data you tend to see the value of these multi-layered and often multi-dimensional perspectives of a trend or event. While that can lead to incredible results, it can also lead to spurious correlations of data.
First, let me state that I am not a Data Scientist or Statistician, and there are definitely people far more expert on this topic than myself. But, if you are like the majority of companies out there experimenting with geospatial and big data it is likely that your company doesn’t have these experts on-staff. So, a little awareness, understanding, and caution can go a long way in this type of scenario.
Before we dig into that more, let’s think about what your goal is:
- Do you want to be able to identify and understand a particular trend – reinforcing actions and/or behavior? –OR–
- Do you want to understand what triggers a specific event – initiating a specific behavior?
Both are important, but they are both different. My personal focus has been on identification of trends so that you can leverage or exploit them for commercial gain. While that may sound a big ominous, it is really what business is all about.
There is a popular saying that goes, “Correlation does not imply causation.” A common example is that for a large fire you may see a large number of fire trucks. There is a correlation, but it does not imply that fire trucks cause fires. Now, extending this analogy, let’s assume that in a major city the probability of multi-tenant buildings starting on fire is relatively high. Since they are a big city, it is likely that most of those apartments or condos have WiFi hotspots. A spurious correlation would be to imply that WiFi hotspots cause fires.
As you can see, there is definitely potential to misunderstand the results of correlated data. More logical analysis would lead you to see the relationships between the type of building (multi-tenant residential housing) and technology (WiFi) or income (middle-class or higher). Taking the next step to understand the findings, rather than accepting them at face value, is very important.
Once you have what looks to be an interesting correlation there are many fun and interesting things you can do to validate, refine, or refute your hypothesis. It is likely that even without high-caliber data experts and specialists you will be able to identify correlations and trends that can provide you and your company with a competitive advantage. Don’t let the potential complexity become an excuse for not getting started, because as you can see above it is possible to gain insight and create value with a little effort and simple analysis.
Two years ago I was assigned some of the product management and product marketing work for a new version of a database product we were releasing. To me this was the trifecta of bad fortune. I didn’t mind product marketing but knew it took a lot of work to do it well. I didn’t feel that product management was a real challenge (I was so wrong here), and I really didn’t want to have anything to do with maps.
I was so wrong in so many ways. I didn’t realize that real product management was just as much work as product marketing. And, I learned that geospatial was far more than just maps. It was quite an eye-opening experience for me – one that turned out to be very valuable as well.
First, let me start by saying that I now have a huge appreciation for Cartography. I never realized how complex mapmaking really is, and how there just as much art as there is science (a lot like programming). Maps can be so much more than just simple drawings.
I had a great teacher when it came to geospatial – Tyler Mitchell (@spatialguru). He showed me the power of overlaying tabular business data with common spatial data (addresses, zip / postal codes, coordinates) and presenting the “conglomeration of data” in layers that made things easier to understand. People buy easy, so that is good in my book.
The more I thought about this technology – simple points, lines, and area combined with powerful functions, the more I began to think about other uses. I realized that you could use it to correlate very different data sets and graphically show relationships that would otherwise extremely difficult to make.
Think about having access to population data, demographic data, business and housing data, crime data, health / disease data, etc. Now, think about a simple and easy to use graphical dashboard that lets you overlay as many of those data sets as you wanted. Within seconds you see very specific clusters of data that is correlated geographically.
Some data may only be granular to a zip code or city, but other data will allow you to identify patterns down to specific streets and neighborhoods. Just think of how something so simple can help you make decisions that are so much better. The interesting thing is how few businesses are really taking advantage of this cost-effective technology.
If that wasn’t enough, just think about location aware applications, the proliferation of smart devices that completely lend themselves to so many helpful and lucrative mobile applications. Even more than that, they make those devices more helpful and user friendly. Just think about how easy it is to find the nearest Indian restaurant when the thought of curry for lunch hits you. And these things are just the tip of the iceberg.
What a lucky day it was for me when I was assigned this work that I did not want. Little did I know that it would change the way that I think about so many things. That’s just the way things work out sometimes.
Ever since I worked on redesigning a risk management system at an insurance company (1994-1995) I was impressed at how better decisions could be made with more data – assuming it was the right data. The concept of, “What is the right data?” has intrigued me for years, as what may seem common sense today could have been unknown 5-10 years ago and could be completely passé 5-10 years from now. Context becomes very important because of the variability and relevance of data over time.
This is what makes Big Data interesting. There really is no right or wrong answer or definition. Having a framework to define, categorize, and use that data is important. And at some point being able to refer to the data in-context will be very important as well. Just think about how challenging it could be to compare scenarios or events from 5 years ago with those of today. It’s likely not an apples-to-apples comparison but could certainly be done. The concept of maximizing the value of data is pretty cool stuff.
The way I think of Big Data is similar to a water tributary system. Water enters the system many ways – rain from the clouds, sprinkles from private and public supplies, runoff, overflow, etc. It also has many interesting dimensions, such as quality/purity (not necessarily the same due to different aspects of need), velocity, depth, capacity, and so forth. Not all water gets into the tributary system (e.g., some is absorbed into the groundwater tables, and some evaporate) – just as some data loss should be anticipated.
If you think in terms of streams, ponds, rivers, lakes, reservoirs, deltas, etc. there are many relevant analogies that can be made. And just like the course of a river may change over time, data in our “big data” water tributary system could also change over time.
Another part of my thinking is based on an experience I had about a decade ago (2002 – 2003 timeframe) working on a project for a Nanotech company. In their labs, they were testing various things. There were particles that changed reflectivity based on the temperature that was embedded in shingles and paint. There were very small batteries that could be recharged tens of thousands of times, were light and had more capacity than a common 12-volt car battery.
And, there was a section where they were doing “biometric testing” for the military. I have since read articles about things like smart fabrics that could monitor the health of a soldier and do things like apply basic first aid and notify others once a problem was detected. This company felt that by 2020 advanced nanotechnology would be widely used by the military, and by 2025 it would be in wide commercial use. Is that still a possibility? Who knows…
Much of what you read today is about the exponential growth of data. I agree with that, but as stated earlier, and this is important, I believe that the nature of and sources of that data will change significantly. For example, nano-particles in engine oil will provide information about temperature, engine speed and load, and even things like rapid changes in movement (fast take-off or stops, quick turns). The nanoparticles in the paint will provide weather conditions. The nanoparticles on the seat upholstery will provide information about occupants (number, size, weight). Sort of like the “sensor web,” from the original Kevin Delin perspective. A lot of “Information of Things” data will be generated, but then what?
I believe that time will become an important aspect of every piece of data, and that location (X, Y, and Z coordinates) will be just as important. But, not every sensor will collect location (spatial data). I do believe there will be multiple data aggregators in common use at common points (your car, your house, your watch). Those aggregators will package the available data in something akin to an XML object, which allows flexibility. From my perspective, this is where things become very interesting relative to commercial use and data privacy.
Currently, companies like Google make a lot of money from aggregating data from multiple sources, correlating it to a variety of attributes, and then selling knowledge derived from that plethora of data. I believe that there will be opportunities for individuals to use “data exchanges” to manage, sell, and directly benefit from their own data. The more interesting their data, the more value it has and the more benefit it provides to the person selling it. This could have a huge economic impact, and that would foster both the use and expansion of various commercial ecosystems required to manage the commercial and privacy aspects of this technology.
The next logical step in this vision is “smart everything.” For example, you could buy a shirt that is just a shirt. But, for an extra cost, you could turn-on medical monitoring or refractive heating/cooling. And, if you felt there was a market for extra dimensions of data that could benefit you financially, then you could enable those sensors as well. Just think of the potential impact that technology would make to commerce in this scenario.
This is what I personally believe will happen within the next decade or so. This won’t be the only type of or use of big data. Rather, there will be many valid types and uses of data – some complementary and some completely discrete. It has the potential to become a confusing mess. But, people will find ways to ingest, categorize, and correlate data to create value with it – today or in the future.
Utilizing data will become an increasingly competitive advantage for people and companies knowing how to do something interesting and useful with it. Who knows what will be viewed as valuable data 5-10 years from now, but it will likely be different than what we view as valuable data today.
So, what are your thoughts? Can we predict the future based on the past? Or, is it simply enough to create platforms that are powerful enough, flexible enough, and extensible enough to change our understanding as our perspective of what is important changes? Either way it will be fun!