Application availability is at the heart of Digital Operational Resilience
Downtime is expensive. The average cost of unplanned application downtime at Tier One financial institutions exceeds $2.5 billion every year, according to the IDC.
The Operational Resilience of your business depends now more than ever on the availability of your IT systems. Critical systems, and the applications which support them, need to be available at all times, especially in a crisis.
The Digital Operational Resilience Act presents an opportunity to review, and implement changes to, your systems and processes. You can both reduce your financial exposure and demonstrate to regulators the robustness of your resilience processes.
In this blog, we discuss how you can quantify the financial impact of downtime and how an application-centric approach to availability can solve the resilience problem.
Quantifying the cost of downtime
According to a Gartner report titled “Why Business Leaders Don’t Care About the Cost of Downtime”, published in April 2019, “through 2021, 65% of I&O leaders will underinvest in their availability and recovery needs because they use estimated cost-of-downtime metrics.”
We believe that this means these leaders aren’t right-sizing their availability because they don’t know what investment will deliver the right availability and recovery for their critical business applications.
Focusing on the wrong metrics for availability and recovery might lead to improved overall availability, despite the application experiencing an outage at a critical time; leading to a loss of transactions and therefore revenue. High ’average’ availability is not enough to guarantee availability at financially critical moments.
Adjusting the approach to availability
In Cloudsoft’s experience the key to getting the investment right and improving application availability and recovery lies in adjusting the approach:
Metrics and controls must be aligned to the software application, not its individual components, technology layers and the often disparate teams aligned to them. This might be directed by the Shared Site Reliability Engineering team, the importance of which is discussed in our eBook.
By unifying application resilience and recovery at the application level, it’s possible to “look down” at all the dependent, interconnected systems that make up the application and understand how the metrics and behaviour of each component affects the whole system, and even implement controls at the application layer to make lower-level systems more resilient.
Understand and define business tolerances for disruption
The focus for improved availability has shifted in recent years from improving probabilistic uptime to accepting that failure is inevitable and investing instead in resilience and recovery to those inevitable events.
In their report “Why Business Leaders Don’t Care About the Cost of Downtime”, Gartner observed that there is a period of outage for an application after which there is an inflection point where the impact dramatically increases. This is measured by the Maximum Tolerable Period of Disruption (MTPOD).
Understand, for each application, where the outage impact inflection point is and build the recovery and resilience to within Maximum Tolerable Period of Disruption (MTPOD) thresholds.
- Automation, automation, automation
The pinnacle of application resilience involves intelligent automation tools that transform the runbook from a Word document to a model in source control, extending techniques such as infrastructure-as-code. These tools, such as Cloudsoft AMP, enable you to codify your resilience and recovery policies at the application level and limit or remove the human “bump in the road”.
The best performing tools seek to model logical application components so that recovery processes are easily and consistently visualised, while integrating the breadth of technologies an organisation uses. This means availability and recovery are optimised and strategies are not only reused, they are improved, automatically tested and rolled out. As a resulting effect, all stakeholders have a clear view of what is running and where, not only in peacetime but also in the (much-less-likely) event of an incident.
Applying this approach for Digital Operational Resilience
Ensure digital operational resilience by guaranteeing rock solid availability for all your critical digital infrastructure and processes.
Visit our Continuous Resilience Resource Centre for guides and practical tips on implementing an application-first approach to your digital operational resilience.
Gartner, Why Business Leaders Don’t Care About the Cost of Downtime, 9 April 2019, David Gregory, Mark Jaggers