Cloudsoft AMP demo: auto-remediation, toil reduction & chaos engineering
Cloudsoft AMP is our flexible automation engine with strong capabilities around Site Reliability Engineering (SRE).
Our latest demo showcases these capabilities, diving into how AMP:
- monitors and senses the health of your environments
- captures and codifies your systems, runbooks and best practices in blueprints and policies
- automates remediation of issues, with no human intervention.
These capabilities enable AMP to help SREs to:
- Reduce toil
- Reduce mean time to recover
- Reduce mean time between failures
And at the end of the demo, we nod towards how AMP can help to test and stress environments with a bit of chaos engineering.
Watch the demo here, and read the transcript below.
Want your own, personalised demo?
Let us show you a personalised demo that answers your questions and ticks your boxes.
Book a 30 minute discovery session to get started and we'll arrange a demo that shows you how AMP can work for your specific needs.
AMP is our flexible automation engine with strong capabilities around SRE.
Over the next few minutes, I’ll walk you through how AMP supports the main tenets of SRE, namely:
- Providing contextualised workload monitoring around service levels, key indicators and objectives; bridge the gap between architecture and tooling
- Auto-remediating whole categories of problems to improve MTTR and uptime
- Automating repetitive tasks to reduce toil
- AMP can coordinate whatever tooling is already in place (the myriad names along the bottom of the slide)
- Everything is captured in blueprints which codify knowledge and make it shareable across the whole community: SRE, application and ops teams
We’ll begin in AMP’s Dashboard view, which summarises the state of all running workloads to high-level stakeholders like product owners.
This is where we can surface relevant information about workload health and compliance, including key SLIs and status against SLOs.
All of which is central to SRE, because we can’t manage or improve what we can’t see.
And right now, we know at a glance that our estate is healthy and all services are in compliance with their defined objectives.
In the case of [this] workload, we’re highlighting web latency and error rate as our key SLIs, with attached SLOs that are currently met, as you can see.
What’s surfaced on the dashboard is just part of a richer set of information held within AMP’s runtime models of the running services. Reliability engineers and app developers may wish to dig deeper into the structure and state of the workload, which they can do via the inspector.
Here we can see raw data about each component of the system:
AMP can acquire metrics “directly”, by interacting with the component itself; or “indirectly” via whatever additional monitoring solutions might be in place
AMP contextualises all that information through knowledge of the workload’s structure
We’ll come back to how that knowledge gets expressed later.
Now this particular app contains a bug that causes one of our key indicators (SLI) to deteriorate, threatening the corresponding objective (SLO).
I can artificially trigger that here.
[Injects trigger condition off screen]
AMP inspector and dashboard update in near real-time
Our summary and detail views remain up-to-date with the system itself
Runbook (not shown) suggests manual remediation: e.g. remote login to restart the process. SRE engineer recognises that doing so manually is simply toil, and instead automates this with AMP’s workflows:
We can run ad-hoc workflows directly in the console
Or we can further reduce toil and attach workflows to entities as effectors
Creates key ops actions that are contextualised and easy to find
Can go beyond this and use AMP policies to create truly autonomous systems.
Attach a simple threshold policy that invokes our workflows when SLI exceeds our goal.
Re-introduce problem, see auto-remediation
Closed-loop management; autonomics
Toil further reduced
Gets more interesting when workflows are coordinating activities across multiple components: think canary deployments, or blue-green updates.
Let’s return to how AMP expresses all this knowledge. It all starts with a blueprint, which we can interactively build in the composer.
Blueprints capture structure, dependencies, locations and behaviour (policy)
Incorporate other tech + artifacts: Terrraform, Ansible, scripts, etc
Goal is to capture everything-as-code
Policies are part of the same coherent model
They can support event-triggered actions, which we’ve seen for auto-remediation
We can also run scheduled actions, which can alleviate toil associated with routine maintenance tasks
And there are specific policies that offer integration with other relevant enterprise systems:
- updating the CMDB with infrastructure changes
- raising incident tickets in the ITSM system when problems are detected, and resolving those tickets automatically
AMP records composable knowledge at the level of individual components and architectural patterns
Can really start to apply broader autonomics patterns around resilience
Self-healing at the level of individual components or tiers
Escalating all the way up to automating DR processes across data centres, cloud regions or even between different cloud providers
We can increase our MTBF, as small failures don’t cascade to become system outages
And our MTTR obviously improves massively through automation
Both of which contribute to healthy service uptime
I want to offer a brief nod towards chaos engineering, where AMP can serve two roles:
- As a remediation agent, which we’ve already seen
- Extend our models with additional policies to describe relevant auto-remediation steps for categories of failure injected by your simian army
Alternatively, AMP can be a chaos agent: coordinating various tools to inject failures, and validating the system’s responses.
I’ll wrap up by noting:
We’ve barely scratched the surface of AMP, but we’ve seen how it supports the essential pillars of SRE through:
- Contextualised observability of health status and service levels
- User-defined automation workflows in response to toil, also contextualised
- Auto-remediation workflows for reduced MTTR
- All represented as composable building-blocks that can be curated and shared among SRE stakeholders
And AMP does so in a way that integrates with culture and tooling, and adapts well to chaos engineering