The True Cost of Downtime
“I&O leaders are often on a misguided mission to find mythical cost-of-downtime numbers, which leads to a lack of credibility and, ultimately, a denial of necessary funding. Focus instead on impacts that matter to the business.”
According to a Gartner report titled “Why Business Leaders Don’t Care About the Cost of Downtime”, published in April 2019, “through 2021, 65% of I&O leaders will underinvest in their availability and recovery needs because they use estimated cost-of-downtime metrics.”
What’s the Cloudsoft take?
In Gartner’s report – available here – the writers state that “every time, such broad-brush numbers are wrong and their usage completely invalidates the objective of justifying increased investment in availability and recovery. Generic “cost of downtime” metrics are often built on five myths”.
We believe that this means these leaders aren’t right-sizing their availability because they don’t know what investment will deliver the right availability and recovery for the parts of their IT that support critical business applications.
Gartner suggest that “quantify any actual increases in the cost of production, decreases in revenue or profitability that accrue from an outage by looking at the factors of production and how they are affected by an outage.”
Failure to get investment right can be measured in the “downside risk” of this well-intentioned yet wayward investment. The incorrect metrics for the availability and recovery needs of the application might show improved availability even though the application has an outage at a critical time leading to loss of transactions.
Adjusting the approach to availability
In Cloudsoft’s experience the key to getting the investment right and improving application availability and recovery lies in adjusting the approach:
|Be application-centric||Be application-centric in metrics and control for application resilience and recovery, not sub-component or technology-layer. For example an application may be a CRM system.|
|Use autonomic controls||Use self-managing autonomic systems to codify your resilience and recovery policies at the application level, limit or remove the human “bump in the road”.|
|Define business tolerances||Understand, for each application, where the outage impact inflection point is and build the recovery and resilience to within Maximum Tolerable Period of Disruption (MTPOD) thresholds.|
Metrics and controls must be aligned to the software application, not its individual components or technology layers.
For example, investment to make the web-layer highly scalable to handle 10x traffic without failure may be deemed a success by the web team, but failure to update other components in the transaction flow – such as the database – will just move the constraint to other components and still cause an outage. The web team is happy but the database team is on fire and transactions are being lost.
It is typical for enterprises to divide responsibility for one application amongst different teams aligned to technology layers or sub-components of an application or system.
Each team has different metrics, only applicable to their domain, using different tools, with different ideas of an “outage” and different recovery plans.
By unifying application resilience and recovery at the application level, it’s possible to “look down” at all the dependent, interconnected systems that make up the application and understand how the metrics and behaviour of each component affects the whole system, and even implement controls at the application layer to make lower-level systems more resilient.
For example, a Cloudsoft AMP blueprint defines the DNA of an application and its components, wiring them together and deploying them in a resilient and recoverable manner, and also providing autonomic (no-human interaction) resilience and recovery through sensors and effectors.
Use autonomic controls
The focus for improved availability has shifted in recent years from improving probabilistic uptime to accepting that failure is inevitable and investing instead in resilience and recovery to those inevitable events.
Cloudsoft AMP uses a combination of policies, sensors and effectors with its deep understanding of an application DNA through the blueprint, to sense and remediate events with effectors per predefined management policies.
For example, an increase in web HTTP transactions and a rise in a web server CPU load could breach a threshold where Cloudsoft AMP will deploy an additional web server and wire it into the load balancer, attach cache mechanisms, create database connections and access security systems – then remove it when traffic reduces below a different threshold – all without human interaction. This provides resilience to expected future changes in traffic.
Cloudsoft AMP can autonomically modify any running application it controls in limitless ways because it can control anything with an API or that can be scripted. It is often used to direct lower-level systems such as infrastructure provisioning tools to provide infrastructure, and plays very well in complex IT ecosystems.
Being able to add Cloudsoft AMP to an existing, complex IT infrastructure and integrate it northbound (other systems query or control AMP) and southbound (AMP queries and controls other systems) is essential.
Define business tolerances
It’s a myth that all outages have the same cost, but it’s the most common and incorrect approach with over two-thirds of leaders using an “average cost of downtime” for all outages. This number is then used to make strategic IT decisions where availability probabilities are involved such as datacenter uptimes. For example, if a facility has an uptime of three nines, then it can be said that there will probably be nearly nine hours of outage. Multiplying nine by the number of services and the average cost of downtime is then used to justify a decision.
In their report “Why Business Leaders Don’t Care about the cost of downtime”, Gartner observed that there is a period of outage for an application (not a component) after which there is an inflection point where the impact dramatically increases.
This is why resilience and recovery are the bedrock of improved availability and avoiding this inflection point, as measured by the Maximum Tolerable Period of Disruption (MTPOD).
Using an application-centric deployment, measurement and control system like Cloudsoft AMP can have this MTPOD as a dashboard metric and demonstrate how its effectors autonomically handle sensed events to recover the system before MTPOD is breached.
By shifting to an application-centric approach and swapping misguided metrics for ones that are more representative of actual business performance can transform business performance by right-sizing the investment in availability, resilience and recovery which in turn will deliver better services.
Cloudsoft AMP helps leaders to improve the availability, resilience and recoverability of their applications in unique ways:
- Unlike most management tools, Cloudsoft AMP is a top-down, application-centric full-lifecycle modelling, deployment and in-life management tool. Most management tools are bottom-up and infrastructure focused with no in-life management capabilities.
- Cloudsoft AMP can glue together anything that has an API or can be scripted, for example securing on-premises Oracle databases with cloud security systems. This makes it applicable to a wide range of complex IT landscapes.
- By combining Cloudsoft AMP policies, sensors and effectors an organisation can wrap autonomics around their applications, freeing up precious and finite staff time by removing or reducing the need for human support for applications.
Gartner, Why Business Leaders Don’t Care About the Cost of Downtime, 9 April 2019, David Gregory, Mark Jaggers