{"id":1490,"date":"2023-12-14T09:00:32","date_gmt":"2023-12-14T09:00:32","guid":{"rendered":"https:\/\/leantree.co.uk\/?p=1490"},"modified":"2024-02-16T09:03:28","modified_gmt":"2024-02-16T09:03:28","slug":"the-incident-management-process","status":"publish","type":"post","link":"https:\/\/leantree.co.uk\/the-incident-management-process\/","title":{"rendered":"The Incident Management Process"},"content":{"rendered":"
\n\n

Following on from my previous post<\/a> on what constitutes an Incident within IT Service Management (ITSM), I will now dive into the processes we can use to manage and resolve incidents in our systems.<\/p>\n\n\n\n

Incident Management is a solid framework within ITSM that guides how incidents are handled from identification to resolution. The process typically includes the following stages:<\/p>\n\n\n\n

Incident Identification<\/h2>\n\n\n\n

The first step is to identify, verify and acknowledge the incident. This can be done through monitoring tools, user reports, or automated alerts. It is always recommended to verify the incident through a secondary tool or process to eliminate the risk of false positives from the monitoring system. Once the incident is found, it is important to quickly assess its impact on the business and users.<\/p>\n\n\n\n

Incident Logging<\/h2>\n\n\n\n

Once the incident is identified, it is important to log it in a central system, for example Jira or Service Now. This log should include all relevant information about the incident, such as its description, impact, and any initial actions taken.<\/p>\n\n\n\n

Incident Categorisation and Prioritisation<\/h2>\n\n\n\n

The next step is to categorise and prioritise the incident. This will help to ensure that the most critical incidents are addressed first. Incidents can be categorised based on their nature, urgency, and impact, effective communications sent, and the correct resource allocation made. Categorisation is also important if adherence to SLA\u2019s form part of the support strategy. SLAs ensure that incidents are handled within agreed-upon timeframes, emphasising the commitment to customer satisfaction and operational continuity. Proper incident prioritisation, in accordance with SLAs, ensures that critical incidents receive immediate attention while less critical ones are managed efficiently.<\/p>\n\n\n\n

Incident Investigation and Diagnosis<\/h2>\n\n\n\n

Once the incident is categorised and prioritised, it is important to investigate the root cause of the incident. This is essential for preventing similar incidents from happening in the future. The investigation may involve collecting data, analysing logs, and interviewing users. Where possible, read-only accounts should be used to access log files to reduce the chance of data loss and ensure the chain of custody \u2013 especially important for cyber security incidents. At this stage it might be possible to identify a workaround to restore service, even if a full fix will require a significant amount of time and effort.<\/p>\n\n\n\n

Incident Handling<\/h2>\n\n\n\n

This phase, often managed by a designated Incident Manager, and running in parallel to several of the previous steps of the incident process, involves effective communication with stakeholders, including users and relevant IT teams, and, for a major incident often includes stakeholders gathered in an \u2018incident room\u2019 conference call to facilitate rapid discussions. It also includes escalation when necessary to ensure swift resolution and minimise impact. Effective communication ensures that all parties are informed of the incident’s status and that the right resources are deployed for resolution. When an incident escalates beyond the initial response, it moves up the organisational hierarchy for more advanced expertise and intervention, ensuring that critical incidents receive the required attention and approval for any workarounds (e.g. if the cost or risk is high), and to invoke crisis management activities if required.<\/p>\n\n\n

\n\n
\"medium<\/figure>\n\n<\/div>\n\n\n

Incident Resolution<\/h2>\n\n\n\n

Once the root cause of the incident is found, it is time to resolve the incident. This may involve implementing corrective actions, such as fixing a bug, replacing faulty hardware, or restoring data from a backup. If time allows, the corrective action should be first testing in a preproduction or staging environment, to try and reduce the likelihood of any unintended regressions, before the fix is deployed to production.<\/p>\n\n\n\n

Incident Closure<\/h2>\n\n\n\n

Once the incident is resolved, it is important to close it out formally. This involves documenting the resolution details. Smoke and unit testing on the production systems should be performed to make sure that the incident has been fully resolved.<\/p>\n\n\n\n

Incident Reporting and Review<\/h2>\n\n\n\n

Finally, it is important to generate incident reports and analyse incident data. This information can be used to find trends, improve the incident management process, and prevent similar incidents from happening in the future.<\/p>\n\n\n\n

On top of these usual steps, there are some further enhancements that can dramatically improve handling and resolution times:<\/p>\n\n\n\n