Want to predict and prevent major incidents? Stop looking for them!
Wouldn’t it be great if we all got warnings before a major incident occurs so that we can prevent them from happening? Sounds like science fiction perhaps, but it is not. We just need to start looking differently and use different tools. Below is the story of how we discovered what to do differently.
The signs are there
Some years ago I was working in a large IT organization where the number of Major Incidents was running out of hand. Obviously this had a direct negative effect on customer satisfaction so the VP of operations gave a few chosen ones (moi included) carte-blanche to do whatever was needed to turn this effect around.
First order of business was to perform high quality root cause analysis on the Major Incidents that had occurred in the past 2 years. We did a major discovery: nearly 90% of those showed early warnings. Some of those showed up hours or days before the actual service disruption. We found the early warnings in a variety of different sources: system and device logging, performance and capacity data across different services and components. This meant that we had a chance of predicting Major Incidents if we just figured out how and where to look.
Big data, big dilemma
Looking back at data after a Major Incident is not too difficult. You know the starting point or at least the general time frame and technical area where to look. So how can we turn this around so that we can start looking forward? Potentially the warning signs can be found in combinations of many data sources that are at least remotely relevant. A few examples: number of log messages, number of login’s, holiday planning, network capacity, number of user tickets, #hitsonsocialmedia, outside temperature, highway traffic patterns, number of days until Christmas, free storage percentage, the Dusseldorf Airport skytrain time schedule, number of days since last reboot, MTBF, employee satisfaction, etc. All of these sources have at one time actually been related to a major incident that I was involved in. Some via correlation and some via causality.
This raises a few challenges. Firstly, it is hard to predict which sources are at least remotely relevant. So let’s not filter and use ALL available data? Yes, good plan, but not so easy to implement. Not everything is readily available, and some is even protected by law. Also, who would consider the schedule of the tour the France even remotely relevant to predicting a major incident in IT? (yes, it was ..)
The second challenge is that this concerns a huge amount of data and different sources. In most professional IT environments you could run into well over thousands of servers, routers, switches, firewalls, load balancers, databases, applications, etc. Collecting all that data over your network potentially causes a major incident itself. And if you do manage to collect it you’ll have a huge bucket of bytes that no human can ever make sense of. It is simply too much to eye-ball or pivot through.
Lucky for us some people already penned the concept of Big Data and written software to take care of this. The root cause analysis we had done enabled us discover which data sources statistically showed the most and best early warning signs. So we added those sources to our big data environment first. At the start, over 4 million new lines of log data were added to the central big data repository every day from thousands of devices in near real time.
But then our next challenge came along: what do we ‘ask’ the big data system? “Please predict our next Major Incident days in advance?”. Answer; “SYNTAX ERROR”
Learning what to look for
Traditionally system and network monitoring works as follows: a threshold is reached for a monitored device (component) and this causes an event or incident ticket to be generated (yes, a bit too simplistic, I know). Nothing wrong with this, however it is limited in time, scope and limited to detecting stuff that you can think of beforehand. Traditional monitoring searches for something that is expected to be found. It was clear that the traditional way of monitoring was not going to increase our predictive capabilities.
Each time we found warning signs looking back at a major incident we said; “sigh, why didn’t I think of that before?”. This made us realize we were continually searching for something we had no clue what it looked like. We needed a new monitoring approach that helps us find something we don’t know we’re looking for.
We looked again at the early warning signs and discovered 3 ways they showed themselves: correlation, trends and patterns. This should not be limited per component, but has to be looked at across time, scope, service and any other dimension.
Now we started to understand what we should ask our Big Data system. We started building dashboards that showed the trend and patterns in messages on different devices. Stacking multiple graphs of multiple sources to see if we could see correlation. I started to built a small team that specialized in using these tools to find stuff they weren’t looking for. Instead of monitoring for a red message or an alert, I asked them to find anything that stands out, that deviates from a trend, anything that forms a pattern that grabs attention. And on a day to day basis we started learning how to look. Each day we understood the behavior of the data better, and we tweaked the graphs, we prioritized on new data sources to be added. The tools enabled us to compare trends and patterns of today with historic data. To be clear: we were looking at the amount of messages much more that the content.
The biggest challenge remained: how to query the big data system effectively. This was still something we as humans needed to figure out. Most of the Big Data tools that we came across helped cope with huge amounts of data, enabled effective and fast querying of that data and visualization of the outcome. The predictive and intelligent analytical capacities were still mostly up to the humans (this was 1,5 years ago). So far I’ve been impressed with many of those Big Data tools, but only one surprised me in artificial intelligence. The difference is that most tools helped so that we could find clues we didn't know we were looking for instead of tools that found the clues for us automatically.
So to me the real key to Big Data is knowing what question to ask, how to visualize the data and how to position the tool towards what you are trying to achieve.
Did it help?
Yes! Even in the first week of our new way of monitoring with Big Data tools we started to predict incidents and act upon it. Some of them may have escalated to major incidents but we didn’t let that happen.
Because we were looking at our entire IT landscape we also found a lot of garbage. The landscape needed hygiene. Superfluous messages, connection retries to systems that were long gone, small system failures that were insignificant to the service. At least 35% of the data we were looking at was overhead. Either logging that was irrelevant, or device behavior that was was unwanted. This not only used a lot system and network resources, but also made it harder for us to spot patterns and trends. So in the next months 80% of our time was spend on major cleanup of the landscape.
The third way in which it helped was in context of problem solving. Unfortunately we could not predict and prevent all prio 1 incidents and major incidents. But we could use our Big Data tooling to look back at huge amounts of relevant data when an incident occurred. This enabled us to speed up the incident resolution considerably.
It is hard to say how many major incidents were prevented this way. But the cost of non-quality that we prevented far outweighed the cost of our efforts. Not only by saving on down time and repair time, but most importantly in customer satisfaction and trust.
These efforts were part of an improvement program that reduced the total number of incidents by 50% in at least 2 consecutive years. It also reduced the number of major incidents by more than 50% and increased the number of days without p1 from 0 per year to over 20 per year. The incident prediction and analysis efforts I described were one of the major contributors to these successes, next to better problem management and risk analysis for changes.
This article is mostly about the trigger or detection part of incident prevention. It is useless without effective problem solving and analysis methods and capabilities. If you think; “We already have ITIL, 5 Why and Ishikawa, so we’re set”, then you’re in even more trouble that you may realize 😊 At the time of this story I introduced CoThink RATIO approach in our organization to help us get better at analysis and solving (of which I became such a fan that I decided to go work for them). Also the operational management and priorities in the organization need drastic change in most cases in order to enable this new(er) approach to monitoring and prevention of incidents. Contact me if you want to discuss further on this.
Key take aways
- Most major incidents can be predicted!
- Use Big Data that is enabled to find something you are not searching for, embrace serendipity!
- Train people to ask the right questions, to find trends, correlation and patterns.
- Learn from history and from each step you take on this journey: analyse! Each p1 or major incident that does occur should be analyzed to inspire new ways questioning your Big Data.
- Use effective problem solving and analysis methods! (here's a tip)
I can’t wait to see the next developments in machine learning and how that can help predict and prevent incidents even further. In the mean time, I hope to have inspired you to look at the topic from a different viewpoint. Thanks for reading!
ps Special thanks to the few cowboys that were with me on this fast journey forward to the future. I didn't do all of this alone, which is why I wrote it in 'we' form.