Big Data

The Unsung Hero of Big Data

Posted on Updated on

Earlier this week, I read a blog post regarding the recent Gartner Hype Cycle for Advanced Analytics and Data Science, 2015. The Gartner chart reminded me of the epigram, “Plus ça change, plus c’est la même chose” (asserting that history repeats itself by stating the more things change, the more they stay the same.)

To some extent, that is true, as you could consider today’s Big Data as a derivative of yesterday’s VLDBs (very large databases) and Data Warehouses. One of the biggest changes, IMO is the shift away from Star Schemas and practices implemented for performance reasons, such as aggregation of data sets, using derived and encoded values, using surrogate and foreign keys to establish linkage, etc. Going forward, it may not be possible to have that much rigidity and be as responsive as needed from a competitive perspective.

There are many dimensions to big data: A huge sample of data (volume), which becomes your universal set and supports deep analysis as well as temporal and spatial analysis; A variety of data (structured and unstructured) that often does not lend itself to SQL based analytics; and often data streaming in (velocity) from multiple sources – an area that will become even more important in the era of the Internet of Things. These are the “Three V’s” people have talked about for the past five years.

Like many people, my interest in Object Database technology initially waned in the late 1990s. That is, until about four years ago when a project at work led me back in this direction. As I dug into the various products, I learned they were alive and doing well in several niche areas. That finding led to a better understanding of the real value of object databases.

Some products try to be “All Vs to all people,” but generally, what works best is a complementary, integrated set of tools working together as services within a single platform. It makes a lot of sense. So, back to object databases.

One of the things I like most about my job is the business development aspect. One of the product families I’m responsible for is Versant. With the Versant Object Database (VOD – high performance, high throughput, high concurrency) and Fast Objects (great for embedded applications). I’ve met and worked with brilliant people who have created amazing products based on this technology. Creative people like these are fun to work with, and helping them grow their business is mutually beneficial. Everyone wins.

An area where VOD excels is with the near real-time processing of streaming data. The reason it is so adept at this task is the way that objects are mapped out in the database. They do so in a way that essentially mirrors reality. So, optionality is not a problem – no disjoint queries or missed data, no complex query gyrations to get the correct data set, etc. Things like sparse indexing are no problem with VOD. This means that pattern matching is quick and easy, as well as more traditional rule and look-up validation. Polymorphism allows objects, functions, and even data to have multiple forms.

Image of globe with network of connected dots in the space above it.

VOD does more by allowing data to be more, which is ideal for environments where change is the norm. Cyber Security, Fraud Detection, Threat Detection, Logistics, and Heuristic Load Optimization. In each case, performance, accuracy, and adaptability are the key to success.  

The ubiquity of devices generating data today, combined with the desire for people and companies to leverage that data for commercial and non-commercial benefit, is very different than what we saw 10+ years ago. Products like VOD are working their way up that Slope of Enlightenment because there is a need to connect the dots better and faster – especially as the volume and variety of those dots increases. It is not a “one size fits all” solution, but it is often the perfect tool for this type of work.

These are indeed exciting times!

Big Data – The Genie is out of the Bottle!

Posted on Updated on

Back in early 2011, me and other members of the Executive team at Ingres were taking a bet on the future of our company. We knew we needed to do something big and bold, so we decided to build what we thought the standard data platform would be in 5-7 years. A small minority of the team members did not believe this was possible and left, while the rest focused on making that happen. There were three strategic acquisitions to fill in the gaps on our Big Data platform. Today (as Actian), we have nearly achieved our goal. It was a leap of faith back then, but our vision turned out to be spot-on, and our gamble is paying off today.

My mailbox is filled daily with stories, seminars, white papers, etc., about Big Data. While it feels like this is becoming more mainstream, reading and hearing the various comments on the subject is interesting. They range from “It’s not real” and “It’s irrelevant” to “It can be transformational for your business” to “Without big data, there would be no <insert company name here>.”

Illustration of smoke coming out of a brass lantern

What I continue to find amazing is hearing comments about big data being optional. It’s not – that genie has already been let out of the bottle. There are incredible opportunities for those companies that understand and embrace the potential. I like to tell people that big data can be their unfair advantage in business. Is that really the case? Let’s explore that assertion and find out.

We live in the age of the “Internet of Things.” Data about nearly everything is everywhere, and the tools to correlate that data to gain an understanding of so many things (activities, relationships, likes and dislikes, etc.)  With smart devices that enable mobile computing, we have the extra dimension of location. And, with new technologies such as Graph Databases (based on SPARQL), graphic interfaces to analyze that data (such as Sigma), and identification technology such as Stylometry, it is getting easier to identify and correlate that data. Someday, this will feed into artificial intelligence, becoming a superpower for those who know how to leverage it effectively.

We are generating increasingly larger and larger volumes of data about everything we do and everything going on around us, and tools are evolving to make sense of that data better and faster than ever. Those organizations that perform the best analysis get the answers fastest and act on that insight quickly are more likely to win than organizations that look at a smaller slice of the world or adopt a “wait and see” posture. So, that seems like a significant advantage in my book. But is it an unfair advantage?

First, let’s remember that big data is just another tool. Like most tools, it has the potential for misuse and abuse. Whether a particular application is viewed as “good” or “bad” is dependent on the goals and perspective of the entity using the tool (which may be the polar opposite view of the groups of people targeted by those people or organizations).  So, I will not attempt to judge the various use cases but rather present a few use cases and let you decide.

Scenario 1 – Sales Organization: What if you could understand what you were being told a prospect company needs and had a way to validate and refine that understanding? That’s half the battle in sales (budget, integration, and support / politics are other key hurdles). Data that helped you understand not only the actions of that organization (customers and industries, sales and purchases, gains and losses, etc.) but also the stakeholders’ and decision-makers’ goals, interests, and biases. This could provide a holistic view of the environment and allow you to provide a highly targeted offering, with messaging tailored to each individual. That is possible, and I’ll explain soon.

Scenario 2 – Hiring Organization: Many questions cannot be asked by a hiring manager. While I’m not an attorney, I would bet that State and Federal laws have not kept pace with technology. And while those laws vary state by state, there are likely loopholes allowing public records to be used. Moreover, implied data that is not officially considered could color the judgment of a hiring manager or organization. For instance, if you wanted to “get a feeling” that a candidate might fit in with the team or the culture of the organization or have interests and views that are aligned with or contrary to your own, you could look for personal internet activity that would provide a more accurate picture of that person’s interests.

Scenario 3 – Teacher / Professor: There are already sites in use to search for plagiarism in written documents, but what if you had a way to make an accurate determination about whether an original work was created by your student? There are people who, for a price, will do the work and write a paper for a student. So, what if you could not only determine that the paper was not written by your student but also determine who the likely author was?

Do some of these things seem impossible or at least implausible? Personally, I don’t believe so. Let’s start with the typical data that our credit card companies, banks, search engines, and social network sites already have related to us. Add to that the identified information available for purchase from marketing companies and various government agencies. That alone can provide a pretty comprehensive view of us. But there is so much more that’s available.

Consider the potential of gathering information from intelligent devices accessible through the Internet, your alarm and video monitoring system, etc. These are intended to be private data sources, but one thing history has taught us is that anything accessible is subject to unauthorized access and use (just think about the numerous recent credit card hacking incidents).

Even de-identified data (medical / health / prescription / insurance claim data is one major example), which receives much less protection and can often be purchased, could be correlated with a reasonably high degree of confidence to gain an understanding of other “private” aspects of your life. The key is to look for connections (websites, IP addresses, locations, businesses, people), things that are logically related (such as illnesses / treatments / prescriptions), and then accurately identify (stylometry looks at things like sentence complexity, function words, co-location of words, misspellings and misuse of words, etc. and will likely someday take into consideration things like idea density). It is nearly impossible to remain anonymous in the Age of Big Data.

There has been a paradigm shift regarding the practical application of data analysis, and the companies that understand this and embrace it will likely perform better than those that don’t. There are new ethical considerations that arise from this technology, and likely new laws and regulations as well. But for now, the race is on!

Spurious Correlations – What they are and Why they Matter

Posted on Updated on

Image containing the word "Technology"

In an earlier post, I mentioned that one of the big benefits of geospatial technology is its ability to show connections between complex and often disparate data sets. As you work with Big Data, you tend to see the value of these multi-layered and often multi-dimensional perspectives of a trend or event. While that can lead to incredible results, it can also lead to spurious data correlations.

First, let me state that I am not a Data Scientist or Statistician, and there are definitely people far more expert on this topic than myself.  But, if you are like the majority of companies out there experimenting with geospatial and big data, it is likely that your company doesn’t have these experts on staff. So, a little awareness, understanding, and caution can go a long way in this scenario.

Before we dig into that more, let’s think about what your goal is:

  • Do you want to be able to identify and understand a particular trend – reinforcing actions and/or behavior? –OR–
  • Do you want to understand what triggers a specific event – initiating a specific behavior?

Both are important, but they are both different. My focus has been identifying trends so that you can leverage or exploit them for commercial gain. While that may sound a bit ominous, it is really what business is all about.

A popular saying goes, “Correlation does not imply causation.”  A common example is that you may see many fire trucks for a large fire.  There is a correlation, but it does not imply that fire trucks cause fires. Now, extending this analogy, let’s assume that the probability of a fire starting in a multi-tenant building in a major city is relatively high. Since it is a big city, it is likely that most of those apartments or condos have WiFi hotspots. A spurious correlation would be to imply that WiFi hotspots cause fires.

As you can see, there is definitely the potential to misunderstand the results of correlated data. A more logical analysis would lead you to see the relationships between the type of building (multi-tenant residential housing) and technology (WiFi) or income (middle-class or higher). Taking the next step to understand the findings, rather than accepting them at face value, is very important.

Once you have what looks to be an interesting correlation, there are many fun and interesting things you can do to validate, refine, or refute your hypothesis. It is likely that even without high-caliber data experts and specialists, you will be able to identify correlations and trends that can provide you and your company with a competitive advantage.  Don’t let the potential complexity become an excuse for not getting started. As you can see, gaining insight and creating value with a little effort and simple analysis is possible.

There’s a story in there – I just know it…

Posted on Updated on

I was reading an article from Nancy Duarte about Strengthening Culture with Storytelling, and it made me think about how important a skill storytelling can be in business and how it can be far more effective than just presenting facts and data. These are just a few examples. You probably have many of your own.Storytelling

One of the best salespeople I’ve ever known wasn’t a salesperson at all. It is Jon Vice, former CEO of the Children’s Hospital of Wisconsin. Jon is very personable and has the ability to make each person feel like they are the most important person in the room (quite a skill in itself). Jon would talk to a room of people and tell a story. Mid-story, you were hooked. You completely bought what he was selling, often without knowing what the “ask” was. It was an amazing thing to experience.

Years ago, when my company was funding medical research projects, my oldest daughter (then only four years old) and I watched a presentation on the mid-term findings of one of the projects. The MD/Ph.D. giving the presentation was impressive, but what he showed was slide after slide of data. After 10-15 minutes, my daughter held her Curious George stuffed animal up in front of her (where the shadow would be seen on the screen) and proclaimed, “Boring!”

Six months later, that same person gave his wrap-up presentation. It was short and told an interesting story that explained why these findings were important, laying the groundwork for a follow-on project. A few years later he commented that his initial presentation became a valuable lesson. That was when he realized the story the data told was far more compelling than just the data itself.

A few years ago, the company I work for introduced a high-performance analytics database. We touted that our product was 100 times faster than other products, which happened to be a similar message used by a handful of competitors. In my region, we created a “Why Fast Matters” webinar series and told the stories of our early Proof of Value efforts. This helped my team make the first few sales of this new product and change the approach the rest of the company used to position this product. People understood our value proposition because these success stories made the facts tangible.

I tell my teams to weave the thread of our value proposition into the fabric of a prospect’s story. This makes us part of the story and makes this new story their own (as opposed to our story). This simple approach has been very effective.

What if you not selling anything? Your data tells a story – even more so with big data. Whether you are analyzing data from a single source (such as audit or log data) or correlating data from multiple sources, the data has a story to tell. Whether patterns, trends, or correlated events – the story is there. And once you find it, there is so much you can do to build it out.

Whether you are selling, managing, teaching, coaching, analyzing, or just hanging out with friends or colleagues, being able to entertain with a story is a valuable skill. It is also a great way to make many things more interesting and memorable in business. So, give it a try.

Getting Started with Big Data

Posted on Updated on

Image

Being in Sales, I have the opportunity to speak to many customers and prospects about many things. Most are interested in Cloud Computing and Big Data, but often they don’t fully understand how they will leverage the technology to maximize the benefits.

Here is a simple three-step process that I use:

1. For Big Data, I explain that there is no single correct definition. Because of this, I recommend that companies focus on what they need rather than what to call it. Results are more important than definitions for these purposes.

2. Relate the technology to something people are likely already familiar with (extending those concepts). For example: Cloud computing is similar to virtualization and has many of the same benefits; Big Data is similar to data warehousing. This helps make new concepts more tangible in any context.

3. Provide a high-level explanation of how “new and old” are different and why new is better using specific examples that they should relate to. For example: Cloud computing often occurs in an external data center – possibly one where you may not even know where it is- so security can be even more complex than in-house systems and applications. It is possible to have both Public and Private Clouds, and a public cloud from a major vendor may be more secure and easier to implement than a similar system using your own hardware;

Big Data is a little bit like my first house. I was newly married, anticipated having children and also anticipated moving into a larger house in the future. My wife and I started buying things that fit into our vision of the future and storing them in our basement. We were planning for a future that was not 100% known.

But, our vision changed over time and we did not know exactly what we needed until the end. After 7 years, our basement was very full, and finding things difficult.  When we moved to a bigger house, we did have a lot of what we needed. But we also had many things that we no longer wanted or needed. And, there were a few things we wished that we had purchased earlier. We did our best, and most of what we did was beneficial, but those purchases were speculative, and in the end, there was some waste.

How many of you would have thought Social Media Sentiment Analysis would be important 5 years ago? How many would have thought that hashtag usage would have become so pervasive in all forms of media? How many understood the importance of location information (and even the time stamp for that location)? I guess it would be less than 50% of all companies.

This ambiguity is both a good and bad thing about big data. In the old data warehouse days, you knew what was important because this was your data about your business, systems, and customers.  While IT may have seemed tough in the past, it can be much more challenging now. But the payoff can also be much larger, so it is worth the effort. You often don’t know what you don’t know – and you just need to accept that.

Now we care about unstructured data (website information, blog posts, press releases, tweets, etc.), streaming data (stock ticker data is a common example), sensor data (temperature, altitude, humidity, location, lateral and horizontal forces), temporal data, etc. Data arrives from multiple sources and likely will have multiple time frame references (e.g., constant streaming versus updates with varying granularity), often in unknown or inconsistent formats. Someday soon, data from all sources will be automatically analyzed to identify patterns and correlations and gain other relevant insights.

Robust and flexible data integration, data protection, and data privacy will all become far more important in the near future! This is just the beginning for Big Data.