Better to be safe than sorry. This applies to many things in life, including disruptions. But what if feeling sorry is faster, cheaper and more sustainable than taking preventative measures? Then proactive action turns into reactive action and, let's face it, we are generally quite good at that.
Yet, somehow, the word reactive has a nasty aftertaste. Reactive is associated with laziness. Everything should be proactive, and should radiate energy. In other words, problem management instead of incident management. Why, in fact?
Corkscrews and reactive action
As I write this, I think back to a conversation I once had about corkscrews. A wine wholesaler accidentally delivered a batch of incorrectly produced corkscrews. It was not the cork that came loose, but the handle. The product manager, however, considered a recall unnecessary: "That costs far too much time and money, and taking a hit is also an art".
Only a few customers returned their broken corkscrew. These customers received a voucher in addition to a properly functioning corkscrew. Considerate, isn't it? Now, I don't want to compare a product like a corkscrew with a business-critical IT component, but it is certainly an interesting angle.
Problem management vs incident management
According to ITIL, a problem is an underlying cause of one or more incidents. Whereas incident management focuses purely on rapid recovery, problem management focuses on understanding the cause and thus on preventive action. Preventing incidents, limiting the impact of an incident and preventing its recurrence are central here.
The problem management process is usually not initiated for this purpose. This may be because the cause is not relevant or because the costs are simply too high. On the other hand, good problem management also looks at the economic aspects and the damage to the image. Nevertheless, problem management remains an unattractive task for many companies.
The reactive variant is therefore strongly preferred: a fire is spotted, the fire brigade is called, they wait, the fire is extinguished, and the fire brigade reports that the fire has been extinguished and leaves again. Entire departments are designed for this. And, to be honest, it works quite well, doesn't it?
The other day, someone on a specialised ITIL forum was suggesting that problem management should be deleted from the ITIL literature altogether. The reason for this was: "Nobody uses it and there are all kinds of tools nowadays that detect faults and thus indirectly engage in problem management", according to this, until then, respected forum user. Tools do not replace processes, but generally support them, Mr Forumer.
Nevertheless, he had aroused my curiosity. The person in question spoke of a tool that mapped out 'normal behaviour' on the basis of certain activities and patterns. Depending on the size and complexity of the environment, this could take hours to weeks. After this learning period, the system was able to detect deviations and thus alert specialists in time or even intervene automatically, whatever that might be. A conditioned system that wakes up as soon as the alarm goes off.
The root cause analysis (RCA) that is delivered afterwards looks decent. Although it is limited to a number of parameters, it is certainly not a real RCA that you just send to a customer. No, it requires human action in the form of knowledge, insight and perhaps a little wisdom, about everything that most modern tools still cannot provide. For now, these tools are more suitable for more advanced incident management, first detecting and then solving. Although I wonder what they can do in very complex multi-vendor infrastructures with thousands of variables. A big advantage remains that intervention can take place before the user hangs up the phone. In this way, we are still being proactive.
Combination of technical and soft skills
As long as there are no tools that think for us, problem management will largely be a human activity, where the combination of technical skills and soft skills is greater than ever. It is the people who do the work. We will have to analyse, ask questions, consult and negotiate in order to make decisions. Such as: when does an incident become a problem?
A gigantic incident in which the lights go out in an entire city does not have to be a problem according to ITIL. Whereas a small incident that is repeated again and again is ripe for the problem management process. Analysis and a critical look at the environmental factors. Returning to the city in the dark, is the cause obvious? What is the likelihood of repetition? Is it a logical consequence of certain actions? In short, questions that the funnel of problem management asks. If we can answer these questions in the affirmative, why should we apply problem management?
Last but not least
Of course, when it comes to problem management, we must not forget the known error database, which answers questions such as: Can we link the problems, failures and incidents to known errors? Are there workarounds available? And can we apply these workarounds without letting an entire infrastructure collapse? It takes time to build a known error database, let alone maintain it. Yet it is worth the effort, if only for the learning organisation. If we stop learning, we might as well stop working.
Would you like to know how we at Nomios deal with incident and problem management for our customers? Then contact us and we will be happy to discuss it with you.