The game is changing for the IT ops community, which means the rules of the past make less and less sense. Organizations need accurate, understandable, and actionable metrics in the right context to measure operations performance and drive critical business transformation.
The more customers use modern tools and the more variation in the types of incidents they manage, the less sense it makes to smash all those different incidents into one bucket to compute an average resolution time that will represent ops performance, which is what IT has been doing for a long time.
History and metrics
History shows that context is key when analyzing signals to prevent errors and misunderstandings. For example, during the 1980s, Sweden set up a system to analyze hydrophone signals to alert them to Russian submarines in local Sweden waters. The Swedes used an acoustic signature they thought represented a class of Russian submarines—but was actually gas bubbles released by herring when confronted by a potential predator. This misinterpretation of a metric increased tensions between the countries and almost resulted in a war.
Mean time to resolve (MTTR) is the main ops performance metric operations managers use to gain insight towards achieving their goals. It is an age-old measure based on systems reliability engineering. MTTR has been adopted across many industries, including manufacturing, facility maintenance, and, more recently, IT ops, where it represents the average time it takes to resolve incidents from the time they were created across a given period of time.
MTTR is calculated by dividing the time it takes to resolve all incidents (from the time of incident creation to time of resolution) by the total number of incidents.
MTTR is exactly what it says: It’s the average across all incidents. MTTR smears together both high- and low-urgency incidents. It also repetitively counts each separate, ungrouped incident and results in a biased resolve time. It includes manually resolved and auto-resolved incidents in the same context. It mashes together incidents that are tabled for days (or months) after creation or are even completely ignored. Finally, MTTR includes every little transient burst (incidents that are auto-closed in under 120 seconds), which are either noisy non-issues or quickly resolved by a machine.
MTTR takes all incidents, regardless of type, throws them into a single bucket, mashes them all together, and calculates an “average” resolution time across the entire set. This overly simplistic method results in a noisy, erroneous, and misleading indication of how operations is performing.
A new way of measuring performance
Critical incident response time (CIRT) is a new, significantly more accurate method to evaluate operations performance. PagerDuty developed the concept of CIRT, but the methodology is freely available for anyone to use.
CIRT focuses on the incidents that are most likely to impact business by culling noise from incoming signals using the following techniques:
- Real business-impacting (or potentially impacting) incidents are very rarely low urgency, so rule out all low-urgency incidents.
- Real business-impacting incidents are very rarely (if ever) auto-resolved by monitoring tools without the need for human intervention, so rule out incidents that were not resolved by a human.
- Short, bursting, and transient incidents that are resolved within 120 seconds are highly unlikely to be real business-impacting incidents, so rule them out.
- Incidents that go unnoticed, tabled, or ignored (not acknowledged, not resolved) for a very long time are rarely business-impacting; rule them out. Note: This threshold can be a statistically derived number that is customer-specific (e.g., two standard deviations above the mean) to avoid using an arbitrary number.
- Individual, ungrouped incidents generated by separate alerts are not representative of the larger business-impacting incident. Therefore, simulate incident groupings with a very conservative threshold, e.g., two minutes, to calculate response time.
What effect does applying these assumptions have on response times? In a nutshell, a very, very large effect!
By focusing on ops performance during critical, business-impacting incidents, the resolve-time distribution narrows and shifts greatly to the left, because now it is dealing with similar types of incidents rather than all events.
Because MTTR calculates a much longer, artificially skewed response time, it is a poor indicator of operations performance. CIRT, on the other hand, is an intentional measure focused on the incidents that matter most to business.
An additional critical measure that is wise to use alongside CIRT is the percentage of responders who are acknowledging and resolving incidents. This is important, as it validates whether the CIRT (or MTTA/MTTR for that matter) is worth utilizing. For example, if an MTTR result is low, say 10 minutes, it sounds great, but if only 42% of your responders are resolving their incidents, then that number is suspect.
In summary, CIRT and the percentage of responders who are acknowledging and resolving incidents form a valuable set of metrics that give you a much better idea of how operations is performing. Gauging performance is the first step to improving performance, so these new measures are key to achieving continuous cycles of measurable improvement for your organization.
Source link: MTTR is dead, long live CIRT