The Incident Management Process

by: Mat Strange

Following on from my previous post on what constitutes an Incident within IT Service Management (ITSM), I will now dive into the processes we can use to manage and resolve incidents in our systems.

Incident Management is a solid framework within ITSM that guides how incidents are handled from identification to resolution. The process typically includes the following stages:

Incident Identification

The first step is to identify, verify and acknowledge the incident. This can be done through monitoring tools, user reports, or automated alerts. It is always recommended to verify the incident through a secondary tool or process to eliminate the risk of false positives from the monitoring system. Once the incident is found, it is important to quickly assess its impact on the business and users.

Incident Logging

Once the incident is identified, it is important to log it in a central system, for example Jira or Service Now. This log should include all relevant information about the incident, such as its description, impact, and any initial actions taken.

Incident Categorisation and Prioritisation

The next step is to categorise and prioritise the incident. This will help to ensure that the most critical incidents are addressed first. Incidents can be categorised based on their nature, urgency, and impact, effective communications sent, and the correct resource allocation made. Categorisation is also important if adherence to SLA’s form part of the support strategy. SLAs ensure that incidents are handled within agreed-upon timeframes, emphasising the commitment to customer satisfaction and operational continuity. Proper incident prioritisation, in accordance with SLAs, ensures that critical incidents receive immediate attention while less critical ones are managed efficiently.

Incident Investigation and Diagnosis

Once the incident is categorised and prioritised, it is important to investigate the root cause of the incident. This is essential for preventing similar incidents from happening in the future. The investigation may involve collecting data, analysing logs, and interviewing users. Where possible, read-only accounts should be used to access log files to reduce the chance of data loss and ensure the chain of custody – especially important for cyber security incidents. At this stage it might be possible to identify a workaround to restore service, even if a full fix will require a significant amount of time and effort.

Incident Handling

This phase, often managed by a designated Incident Manager, and running in parallel to several of the previous steps of the incident process, involves effective communication with stakeholders, including users and relevant IT teams, and, for a major incident often includes stakeholders gathered in an ‘incident room’ conference call to facilitate rapid discussions. It also includes escalation when necessary to ensure swift resolution and minimise impact. Effective communication ensures that all parties are informed of the incident’s status and that the right resources are deployed for resolution. When an incident escalates beyond the initial response, it moves up the organisational hierarchy for more advanced expertise and intervention, ensuring that critical incidents receive the required attention and approval for any workarounds (e.g. if the cost or risk is high), and to invoke crisis management activities if required.

Incident Resolution

Once the root cause of the incident is found, it is time to resolve the incident. This may involve implementing corrective actions, such as fixing a bug, replacing faulty hardware, or restoring data from a backup. If time allows, the corrective action should be first testing in a preproduction or staging environment, to try and reduce the likelihood of any unintended regressions, before the fix is deployed to production.

Incident Closure

Once the incident is resolved, it is important to close it out formally. This involves documenting the resolution details. Smoke and unit testing on the production systems should be performed to make sure that the incident has been fully resolved.

Incident Reporting and Review

Finally, it is important to generate incident reports and analyse incident data. This information can be used to find trends, improve the incident management process, and prevent similar incidents from happening in the future.

On top of these usual steps, there are some further enhancements that can dramatically improve handling and resolution times:

Use automation to streamline the process: There are several tools and technologies that can be used to automate tasks such as incident logging, categorisation, and prioritisation. This can free up IT staff to focus on more complex tasks.
Implement a knowledge base and asset database: A well-implemented knowledge base and asset database can be invaluable tools for incident management. By storing information about previous incidents, their resolutions, and the assets that may be affected, these databases can help IT teams to quickly identify and resolve incidents, prevent similar incidents from happening in the future, assess related dependencies and improve their overall incident management process.
Establish a culture of continuous improvement: Regularly review the incident management process and identify opportunities for improvement. This will help to ensure that the process is as efficient and effective as possible.

It is, however, important to note that this is a framework, and an Incident Management process that works perfectly for one organisation may not work for another. The process must adapt to suit the risk appetite of the business, the domain in which they operate and the customer profile to name but a few examples.

In future posts, I will be digging into many of these areas in greater depth and bringing them to life with some practical examples.

Conclusion

In summary, the Incident Process is far more than just a technical process. It’s a strategic approach to maintaining operational excellence, preserving customer satisfaction and ensuring business continuity. By handling incidents with precision and efficiency, organisations not only navigate the complexities of the digital age but thrive within it.

Stay connected with Lean Tree as we continue to provide you with practical guidance, industry knowledge, and expertise to make the most of your ITSM endeavours. If you have specific themes or topics you’d like to explore further in subsequent blog posts or would like to discuss how we can support your technology transformation, please feel free to get in touch!