root cause analysis
Too often people, including Consultants, spend time trying to solve the wrong problem due to having incomplete or incorrect information. Once I was investigating a series of performance problems and unplanned outages that were assumed to be two separate problems. As I gathered information several people provided anecdotal stories of anomalous behaviors in a variety of systems, speculation about the “real problem,” and discussions about “chasing ghosts” during previous attempts to resolve the problem.
I remember stating that I was there to solve a real problem having a serious negative impact on production and that it was not my intent to chase ghosts or do anything else that would unnecessarily waste time. Next, I outlined the approach I would use to make a Root Cause determination, and that we would reconvene to discuss the real problem and potential solutions. A few people scoffed and felt that this was a waste of time and money.
The process followed was simple, structured, and logical. It took everything that was known to be true and mapped it out. I looked for patterns, commonalities, and intersections of systems and events. Within two days my team and I had identified a complex root cause involving multiple components, which we demonstrated would reliably reproduce the symptoms that our client was experiencing. From there we worked with their teams to make minor network changes, system configuration changes, and several small application changes.
By the end of the second week, they were no longer experiencing major slowdowns or unplanned outages. Each outage cost this company tens of thousands of dollars in lost sales due to the time-sensitive nature of their product. Within one week they had recovered the cost of hiring me and my team. What stuck with us was how many really smart people “believed in ghosts” and failed to focus on the information that they already had.
A few years later we decided to create a white paper to potentially help others in need of a simple structured approach. Below is a link to that white paper, which was written by one of the top people on my team. We received very positive feedback at the time so it seemed that this could potentially still be useful today. Please take a look and let me know what you think.