Using Big Data to solve Big Problems!
De combinatie van krachtige methodieken met een nieuwe manier van zoeken en monitoring is nodig om complexe problemen nu en in de toekomst de kop te kunnen bieden.
The future of monitoring is also the future of problem prediction and prevention. A good intelligent big data analytics toolset and an effective problem solving & troubleshooting approach are needed to be ready for the complex problems of the future. However, we also need to start looking at monitoring in a very different way.
In the beginning ..
In the beginning we had console messages. After that came icons with different colors for status. And after that we had event windows, with event descriptions with different colors. Nowadays we have a mixture of all of that. What they have in common is that a smart engineer at one time had to define thresholds and event criteria that defined when the system will alert the operators. This means that this type of monitoring is limited by what can be preconceived as an issue or anomaly. In other words, the expertise in a technology silo is responsible for its monitoring quality. Therefore this often also means the monitoring itself is in silos. Meaning: not a lot of cross-service or cross-technology correlation is possible.
Enter the future ..
With increasing complexity of IT there are a few new factors to consider. IT service and infrastructure stacks are increasingly more fragmented. The fragments need to work together obviously. What I see a lot in complex problems is that while the individual components show no issues or anomalies, there still is a problem somewhere in the combination or collaboration between the components. Traditional monitoring won't find this as it is mostly restricted to silos and to what can be preconceived and defined. So what to do about it?
90% of P1's show warning signs: patterns, trend and correlation.
In order to find issues in IT operations, in fact even see issues coming before they disturb an IT service, there are 3 things that you need to look for: patterns, trend and correlation. Where? you might ask. Well, in as many relevant operational data as possible. Why? you might ask. Well, after looking at an enormous amount of P1's and major incidents I can conclude that in over 90% of P1's there are warning signs visible in the data. Usually clear enough and in time to allow operators to prevent a service disruption. But the warning signs are not as obvious as they were: they are sometimes symptomatic, sometimes technical, sometimes service, but mostly only indicative of cause or symptoms. Trends, patterns and correlation.
Ok, how do I do this?
I am talking about BIG data and relevant analytics tools.When I first implemented such monitoring, we started with the most accessible and most valuable data. In that case it was logging from loadbalancers. Even though those loadbalancers only rarely contained the root cause of a service disruption, their behavior was indicative of where we might find the actual root cause. It got to be really interesting when weadded user loggins, end user incident tickets, batch schedule results, network performance, outside temperature, network loggin etc. You can imagine that combining that data can help to find some interesting trends and patterns. A few years ago, when I was first involved with big data and analytics the tools were able to handle millions of lines of logging but we had to do the trend, pattern and correlation searching by hand. Nowadays even that part is impressively automated IBM Predictive Insights and Seasonality impressed me a lot lately).
Everytime a problem did occur we'd analyse it, find the root cause using CoThink RATIO methods and used the output to correct or add other data to the big data environment or fine tune the analytics.
Below is an overview of how the structured RATIO approach and intelligent tooling work together:
This new way of monitoring is not limited by a technology silo or by what can be preconceived. Well, not quite, cause someone still needs to have a clever idea of what data to add to the tool. And with that, it helps to think outside the box. let me give you 3 examples of data correlation that helped me solve crises in the past:
Data out of the box into the box:
The schedule of the Dusseldorf monorail and the schedule of mainframe batch processing the data center next to the airport helped me find the cause of the reoccurring crashing quarter-end processing.
The schedule of the Tour De France and the LAN performance data helped me resolve an intermittent LAN performance
The lunch schedules, building attendance data and the number of user complaints one time helped me figure out that the root cause of an intermittent mobile device e-mail problem.
The message here is: trends, patterns and correlation are not limited to what you can conceive, and are not limited by any technology silo. They are the new navigation to finding a root cause; the most effective input to a good RCA process (Problem Analysis in RATIO).
Old monitoring: limited by what can be conceived and by technology silos
New monitoring: using big data tooling to find trends, patterns and correlation that you weren't even looking for but effectively aim towards the root cause or preventive measures.
So, stop searching what you were looking for (what was preconceived), and start finding what you weren't searching for (trends, patterns, correlation)
Hope this inspires you to look at monitoring from another angle!