Cloudsoft AMP demo: auto-remediation, toil reduction & chaos engineering

May, 11 2023

Charlotte Binstead

Cloudsoft AMP is our flexible automation engine with strong capabilities around Site Reliability Engineering (SRE).

Our latest demo showcases these capabilities, diving into how AMP:

monitors and senses the health of your environments
captures and codifies your systems, runbooks and best practices in blueprints and policies
automates remediation of issues, with no human intervention.

These capabilities enable AMP to help SREs to:

Reduce toil
Reduce mean time to recover
Reduce mean time between failures
Scale

And at the end of the demo, we nod towards how AMP can help to test and stress environments with a bit of chaos engineering.

Watch the demo here, and read the transcript below.

Want your own, personalised demo?

Let us show you a personalised demo that answers your questions and ticks your boxes.

Book a 30 minute discovery session to get started and we'll arrange a demo that shows you how AMP can work for your specific needs.

Demo Transcript:

AMP is our flexible automation engine with strong capabilities around SRE.

Over the next few minutes, I’ll walk you through how AMP supports the main tenets of SRE, namely:

Providing contextualised workload monitoring around service levels, key indicators and objectives; bridge the gap between architecture and tooling
Auto-remediating whole categories of problems to improve MTTR and uptime
Automating repetitive tasks to reduce toil
AMP can coordinate whatever tooling is already in place (the myriad names along the bottom of the slide)
Everything is captured in blueprints which codify knowledge and make it shareable across the whole community: SRE, application and ops teams

We’ll begin in AMP’s Dashboard view, which summarises the state of all running workloads to high-level stakeholders like product owners.

This is where we can surface relevant information about workload health and compliance, including key SLIs and status against SLOs.

All of which is central to SRE, because we can’t manage or improve what we can’t see.

And right now, we know at a glance that our estate is healthy and all services are in compliance with their defined objectives.

In the case of [this] workload, we’re highlighting web latency and error rate as our key SLIs, with attached SLOs that are currently met, as you can see.

What’s surfaced on the dashboard is just part of a richer set of information held within AMP’s runtime models of the running services. Reliability engineers and app developers may wish to dig deeper into the structure and state of the workload, which they can do via the inspector.

Here we can see raw data about each component of the system:

AMP can acquire metrics “directly”, by interacting with the component itself; or “indirectly” via whatever additional monitoring solutions might be in place

AMP contextualises all that information through knowledge of the workload’s structure

We’ll come back to how that knowledge gets expressed later.

Now this particular app contains a bug that causes one of our key indicators (SLI) to deteriorate, threatening the corresponding objective (SLO).

I can artificially trigger that here.

[Injects trigger condition off screen]

AMP inspector and dashboard update in near real-time

Our summary and detail views remain up-to-date with the system itself

Runbook (not shown) suggests manual remediation: e.g. remote login to restart the process. SRE engineer recognises that doing so manually is simply toil, and instead automates this with AMP’s workflows:

We can run ad-hoc workflows directly in the console

Or we can further reduce toil and attach workflows to entities as effectors

Creates key ops actions that are contextualised and easy to find

Can go beyond this and use AMP policies to create truly autonomous systems.

Attach a simple threshold policy that invokes our workflows when SLI exceeds our goal.

Re-introduce problem, see auto-remediation

Closed-loop management; autonomics

Toil further reduced

Gets more interesting when workflows are coordinating activities across multiple components: think canary deployments, or blue-green updates.

Let’s return to how AMP expresses all this knowledge. It all starts with a blueprint, which we can interactively build in the composer.

[composer]

Blueprints capture structure, dependencies, locations and behaviour (policy)

Incorporate other tech + artifacts: Terrraform, Ansible, scripts, etc

Goal is to capture everything-as-code

Policies are part of the same coherent model

They can support event-triggered actions, which we’ve seen for auto-remediation

We can also run scheduled actions, which can alleviate toil associated with routine maintenance tasks

And there are specific policies that offer integration with other relevant enterprise systems:

updating the CMDB with infrastructure changes
raising incident tickets in the ITSM system when problems are detected, and resolving those tickets automatically

AMP records composable knowledge at the level of individual components and architectural patterns

Can really start to apply broader autonomics patterns around resilience

Self-healing at the level of individual components or tiers

Escalating all the way up to automating DR processes across data centres, cloud regions or even between different cloud providers

We can increase our MTBF, as small failures don’t cascade to become system outages

And our MTTR obviously improves massively through automation

Both of which contribute to healthy service uptime

I want to offer a brief nod towards chaos engineering, where AMP can serve two roles:

As a remediation agent, which we’ve already seen
Extend our models with additional policies to describe relevant auto-remediation steps for categories of failure injected by your simian army

Alternatively, AMP can be a chaos agent: coordinating various tools to inject failures, and validating the system’s responses.

I’ll wrap up by noting:

We’ve barely scratched the surface of AMP, but we’ve seen how it supports the essential pillars of SRE through:

Contextualised observability of health status and service levels
User-defined automation workflows in response to toil, also contextualised
Auto-remediation workflows for reduced MTTR
All represented as composable building-blocks that can be curated and shared among SRE stakeholders

And AMP does so in a way that integrates with culture and tooling, and adapts well to chaos engineering

Cloudsoft AMP demo: auto-remediation, toil reduction & chaos engineering

Our latest demo showcases these capabilities, diving into how AMP:

These capabilities enable AMP to help SREs to:

Watch the demo here, and read the transcript below.

Want your own, personalised demo?

Demo Transcript:

Subscribe

Related Posts

Revolutionise Your Operations: Keeping the Lights on without the On-Call Burden

Must-Know Site Reliability Engineering (SRE) Terminology

How To Slash Your Mean Time To Recover with Auto-Remediation

Cloudsoft AMP demo: auto-remediation, toil reduction & chaos engineering

Our latest demo showcases these capabilities, diving into how AMP:

These capabilities enable AMP to help SREs to:

Watch the demo here, and read the transcript below.

Want your own, personalised demo?

Demo Transcript:

Subscribe

Share

Related Posts

Revolutionise Your Operations: Keeping the Lights on without the On-Call Burden

Must-Know Site Reliability Engineering (SRE) Terminology

How To Slash Your Mean Time To Recover with Auto-Remediation