Consequences of bad health checks in AWS Application Load Balancer

Ed: I’m getting a bit worried about our VP of Engineering, Aled: he is at his happiest doing cloud war games and pre-/post-mortems! This is the first of many upcoming posts from him about solving customer application challenges on AWS. Notice how, in this post, it’s not just a “cloud plumbing” IaaS answer: it’s a combination of application (thread pool) AND cloud (AWS ALB) know-how that was the answer. Over to Aled!

I was recently discussing an application outage that an AWS customer experienced several times during spiky, heavy load. This situation can be improved by a minor addition to the web-app and reconfiguration of the load balancer’s health check. I think it’s an interesting case-study for how best to configure AWS Application Load Balancers and auto-scaling groups.

The Architecture

Below is a (simplified) architecture diagram:

Customer Architecture for ALB Enhancements

The load balancer and auto-scaling group’s health check used a `/health` HTTP endpoint, which included a check that the database was reachable.

The Problem

Under extremely heavy load, the database was overloaded and became very slow. The load balancer’s health-checks to the app-server timed out, which caused the app-server to be removed from the load balancer’s target group. The auto-scaler would also replace these app-servers due to the failing health-check.

This made the app completely unavailable for a few minutes. The database load would drop to zero, and then this cycle would repeat when the health-checks were passing again.

Unfortunately the alerting was set up to only page people if the site was down for 15 minutes. The cycle repeated in less than 15 minutes, so the alerts were not triggered.

Fix part one: the health-check

The first problem was that the health-check required a response from the database. If the database was responding too slowly, the app-server health check would fail. It blamed the app-server and removed it from the target group.

An improvement would be to use a different endpoint (e.g. `/alive`) that did not check the database connection. This improves things, but not enough.

The app-server handles the HTTP requests in a thread pool pool with a max size. This thread pool was maxed out with customer HTTP requests blocked on database access, with a queue of requests also being managed by the app-server. The `/alive` health-check calls were in this queue so were still slow to respond.

To improve this further, a different thread pool was needed for the health-checks. This is one reason why the application load balancer’s configuration supports a different port for health checks versus customer traffic. A small change to the application code would be to add a listener on a different port, served by a different thread pool with no authentication, so the `/alive` calls respond in a timely manner.

Fix part two: the alerting

The alerts were only raised if the application was down for 15 minutes. This time was presumably chosen because of too many false-negatives in the past or because of an overconfidence in the auto-scaling group’s ability to fix the problem.

I’d recommend that this time be greatly reduced. Instead, rely on a highly available architecture to avoid major outages – if one happens, then respond quickly.

It’s also a good idea to monitor the four golden signals that Google SRE recommend: latency, traffic, errors and saturation. Many outages affect only a subset of requests, meaning your service level objectives (SLOs) are not met but where a high-level health check might still pass. How to do this will be the topic of another blog post.

Fix part three: the database

The RDS MySQL database was the bottleneck in all of this. The long-term solution must address that. A number of options exist for them:

  • Monitor RDS through CloudWatch to better understand its performance – for example, have the credits for burst capacity run out, causing IOPS to drop to baseline? Is it CPU or I/O bound?
  • Temporarily increase the instance type before the next spike in traffic (only works because they have predictable spikes, and can have short maintenance windows to do the resize).
  • Switch to AWS Aurora, to get much better performance at the same price.
  • Increase the instance type permanently to better handle the load.
  • Create read-replicas, to offload some of the requests.

To improve reliability, they could also check that RDS is configured to run multi-AZ, that the RDS maintenance window is configured sensibly for their use-case, and that CloudWatch alerts are configured to proactively detect performance problems.

Fix part four: postmortems

I was surprised that this problem had affected them multiple times. A healthy way to avoid that is to do a postmortem after each incident. This should be blame-free, it should identify the underlying causes, and should include concrete improvements to prevent this from happening again. It should be shared with the affected customers so they understand you are sorry and will avoid it in the future.

Without such postmortems, history will repeat itself.


What everyone needs to know about migrating Applications to the Cloud

Why bother migrating to the cloud? Seriously, who is really doing it? Why are those people doing it? For what outcomes? Hang on a minute, isn’t cloud bad for security? Doesn’t cloud cost “a lot”? Isn’t cloud “just a trend”? “We’ll never put production in the cloud” (didn’t we say that about virtualisation over a decade ago?).

The top 1000 companies in the world have about 2000 applications each. We know that about 15% of them are in the public cloud today, and 75% will be there by 2020. That means 1.2million workloads are going to move to the cloud. That’s just the top 1000. What about the millions of businesses in the UK?

What about you?

Woah. Wait a minute, Steve. It sounds too good to be true.

What are the benefits? Risks? What is it? How do I do it?

That’s why we at Cloudsoft have started a series on exposing everything about the cloud.

Starting with this white paper to share our experience and help people.

Get the paper – What everyone needs to know about migrating applications to AWS

We, at Cloudsoft, live and breathe applications in AWS. We are obsessed and do war games and all kinds of things that a modern MSP on AWS should do. AWS cloud plumbing, for us, is table stakes. Anyone can do it, but it is often done surprisingly badly.

We are application, automation and AWS experts. In person and in code. We are a friction-free, happy-friendly co-collaborator.

At Cloudsoft we supercharge our customers with best practice cloud finance, operations and security as basic necessities. Then we add our secret sauce: we add application-focused goodness. This is unique.

We believe in show-and-tell so we are writing a series of papers and sharing everything we know. Here is our first example, but we will add more every week.

Enjoy. Please (PLEASE) feedback. We seriously want to hear from you 🙂


What everyone needs to know about migrating applications to AWS
What everyone needs to know about migrating applications to AWS


#CEE18 Cloudsoft activities at CloudExpo Europe 2018


I’m writing this while hurtling down the east coast of the UK in a train at 125mph en route to #CEE18 CloudExpo Europe 2018. Very excited, and here’s why!

I’m representing Cloudsoft and talking about our Rx3 solution that will migrate, run and evolve business applications in the cloud.

We are working hard to democratize our cloud capabilities to a wider UK market and a key part of that is working with IDE Group as a leading UK partner.

Hey YOU! Are you at #CEE18? Get in touch with the form at the bottom of this post so we can meet up and talk applications, cloud, security, costs, operations 🙂

We are doing two key pieces at CloudExpo:

  • The Interview: Jay Bradley of IDE Group interviews Steve Chambers of Cloudsoft on cloud experiences.
  • The Show: Steve Chambers will be on Compare the Cloud’s DisruptLIVE show. This will be a fun 15 minutes of putting the cloud world to rights!

The Interview: Migrating to the public cloud: how to achieve a positive outcome

Come and join us! 21-Mar-2018   Agile Networks (SDN & NFV) Theatre
Businesses migrating to the public cloud need to avoid a range of pitfalls, wrong turns and dead-ends. If successfully navigated, their cloud experience will be less expensive, availability will be improved and they will be able to take all the opportunities presented by hyperscale public cloud pioneers like AWS and Azure.

Session host, Jay Bradley of IDE Group, will interview Steve Chambers, COO of Cloudsoft, who has a wealth of experience “doing public cloud” for companies large and small.

With a focus on aligning technology to business outcomes, Jay will question Steve about what to aim for and what to watch out for when migrating to the cloud, running applications in the cloud, and the important, but often missed, the evolution of applications to exploit cloud innovation.

The Show: Jez Black interviews Steve Chambers on putting the cloud world to rights

Let’s meet at #CEE18!

Get in touch and let’s talk applications, costs, security and operations in the cloud.


Cloudsoft’s Andrew Kennedy announced as a Hyperledger Technical Ambassador

Hyperledger Sawtooth

Andrew Kennedy Hyperledger Technical AmbassadorAndrew Kennedy is a Distributed Systems Hacker on Cloudsoft’s engineering team. His latest work has resulted in him not only becoming a maintainer for the open source Hyperledger Sawtooth project but additionally being invited to join the Hyperledger Technical Ambassador program. Andrew specialises in building Cloudsoft blockchain projects on top of our managed AWS offering. Currently, he’s the lead engineer for our ongoing work with Blockchain Technology Partners.

While congratulating Andrew on his stellar progress and catching up his contributions to open source projects during his work with Blockchain Technology Partners, I managed to get some more information out of him about how this all came about.

“Having originally focused on Hyperledger Fabric, we started working with Hyperledger Sawtooth late last year. It’s an interesting blockchain framework because the architecture is extremely flexible, with pluggable consensus models and transaction processors, including a module that handles Ethereum EVM code for smart contracts.”

Cloudsoft has a long history as a significant contributor to open source through Apache Brooklyn and Apache jclouds, and now you’re contributing to projects from the Hyperledger community: what is it you like about open source?

“One of the things I love about open source is that if there is a problem you can go right ahead and fix it, and if your fix works, everyone else using the project gets to take advantage. For example, in the process of building a proof of concept deployment for a customer, I ended up doing a lot of work on the Seth (Sawtooth-Ethereum) transaction processor.

I refactored Seth to be compatible with the latest API changes in the core project and updated features to bring it into line with the Ethereum specifications. This involved interacting with the other Sawtooth developers, from Intel and bitwise.io, who were very helpful in getting me up to speed with the project.

After several of my pull requests had been approved, I was asked to become a maintainer for the Sawtooth project, which allows me to work more closely with the rest of the team on adding new features and improving the software.”

So what’s next as a Technical Ambassador for the Hyperledger Community?

“The Hyperledger community includes not just large entities like IBM and Intel, but also small, agile, specialist companies like Cloudsoft and Blockchain Technology Partners.

The technical ambassador program is a way for people like me who are involved in the development of the projects to share in-depth knowledge about the software with other engineers. from how it works and how its built to ways of using it in an existing software stack.”

I can’t wait to see the outcome of the next stage of Andrew’s work both in terms of customer success and open source contributions: talk about a win-win! We’ll get Andrew to blog more on his progress in a few weeks time. In the meantime, we’ll leave the last word to our customer:

“Andrew is a tremendous software engineer and an outstanding contributor to open source projects having cut his teeth on Apache Qpid while at JP Morgan. We are delighted to see his BTP sponsored contributions to Hyperledger Sawtooth recognized by the Hyperledger community with his appointment as a Technical Ambassador as well as as a Sawtooth committer and Seth maintainer,” Duncan Johnston-Watt, CEO of Blockchain Technology Partners.


SDI18 A Practical Guide to Cloud for Grown Ups

SDI18 A Practical Guide to Cloud for Grown Ups

A cheeky title, I know. Pejorative even. But there’s a serious message behind the laughter.

Cloud finance, security and operations are the top challenges and initiatives in the cloud, especially for advanced cloud consumers. But many only learn this the hard way, as shown by the recent 2018 RightScale State of the Cloud Report (n=997).

For example, the Top 5 Challenges according to survey respondents, grouped by cloud maturity:

Top 5 Cloud Challenges

I think this is partly to do with legacy IT mindsets, what the IT Service Management practitioners call “IT’s Bad Parenting”. Basically, when you invest millions of pounds in your on-premises datacenter, the feeling is that it’s secure (because it’s in the basement) and it’s cheap (because you don’t see the bills). Move that thinking to the cloud, problems happen.

This isn’t a new phenomena: over two years ago I was on a webinar where a VP of Technology was complaining that “…nobody is watching the finances. And multi-cloud amplifies the problem because all the resources and billing is different….”

At Cloudsoft we deal with this every day on behalf of clients:

  1. We migrate applications into the cloud,
  2. We run those applications in the cloud,
  3. We evolve those apps to exploit the cloud.

We take care of a lot of security, finance and operational aspects so customers don’t have to.

The Rise of the Cloud Service Delivery Manager

If you don’t have a Cloudsoft to help you, you should get a Cloud Service Delivery Manager (CSDM) in place as soon as possible. This is the person who looks at IT and cloud from the perspective of the business, the customers, the services. The goal of the CSDM is to stop cloud finance, security and operations falling down the cracks (because it’s “Somebody Else’s Problem”).

I thought it would be fun to talk to the ITSM Professional community at the recent Service Desk Institute 2018 conference in Birmingham. I was disappointed that many in my audience couldn’t tell me what the bridge was in the title slide, but they made up for that by being passionate and professional about customers and business. Great crowd!

Here are my slides. I hope they make sense, and please do contact me (details on slide) with any feedback – I’d love to hear from you!

I’d love to hear what you think about this topic, please get in touch!