AWS Load Balancers & WAF: Availability vs Security with 'fail open'.
The AWS Application Load Balancer (ALB) and Web Application Firewall (WAF) are two popular services that play extremely well together. However, as we’ll see, this integration is not quite seamless: they are two separate services, connected via the AWS internal network. So what happens in that rare situation when your ALB can’t get a timely response from its associated WAF to validate an HTTP request? With the recent addition of “WAF fail open”, you now have a choice: whether to remain secure, or to remain available.
“Fail closed” vs “Fail open”
By default, the ALB takes a security-first approach: if the WAF cannot check the request, it is treated as malicious. The request is blocked, nothing is forwarded to the back-end service, and a 500 “internal server error” is returned to the client error.
AWS calls this “fail closed”: think “closed” like a door (not a switch)—nothing at all gets through. Although secure, this has one obvious drawback: your service becomes unavailable to regular, non-malicious users if the WAF cannot respond. The secure option leads to downtime which may affect your SLAs!
You can enable WAF fail open to improve your application’s availability, but this is a tradeoff against security. If the WAF is unable to check the request, the request will still be forwarded to your back-end service to be processed.
The Reliability Risk
We first encountered this problem during a partial availability zone outage affecting the London region in August 2020. Despite having a highly-available architecture across multiple availability zones, we noticed approximately one third of user requests had failed with 500 responses over a half-hour period. The failures were requests hitting the application load balancer (ALB) in the affected availability zone.
Digging deeper, we found many errors like that below in the ALB access logs:
xxxxxxxxxxxx_elasticloadbalancing_eu-west-2_app.xxxxxxxx.xxxxxxxxxxxxxxxx_20200825T0930Z_xxx.xxx.xxx._xxxxxxxx.log.gz:https 2020-08-25T09:25:37.301101Z app/xxxxxxxx/xxxxxxxxxxxxxxxx xxx.xxx.xxx.xxx:48362 - -1 -1 -1 500 - 1605 318 "POST https://xxxxxx.com:443/xxxxxx HTTP/1.1" "Amazon CloudFront" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:xxxxxxxxxxxx:targetgroup/xxxxxxxx/xxxxxxxxxxxxxxxx "xxxxxxxx" "xxxxxxxx.com" "session-reused" -1 2020-08-25T09:25:33.259000Z "waf-failed" "-" "WAFConnectionTimeout" "-" "-" "-" "-"
According to AWS support, this is a known issue: “your clients can experience intermittent waf-failed requests, which result in 500 HTTP response code. These responses are generated by the load balancer when it encounters errors evaluating requests with your configured WAF rules.” There is a “fail-closed mechanism” that the load balancer uses to enforce a request is either properly evaluated and passed the configured WAF rules, or the request fails.
The AWS WAF SLA is 99.95% uptime. AWS offers 10% service credits if there is more than 21 minutes outage in a given month, or 25% if more than 7 hours outage in a given month. That payout is tiny compared to the impact of failed requests for some applications. But on the plus side, AWS are extremely good at meeting their SLAs (better than competing cloud providers, in our experience).
To build reliable systems, we want to avoid any outages in sub-components causing the system to fail. If you rely on all sub-components being healthy, then the overall availability of your system is the product of each component’s availability - this quickly gives a much lower level of availability.
So how could we improve the availability?
AWS Load Balancer “WAF fail open”
Back in August 2020, we discussed this with AWS support. Instead of the fail-closed mechanism when the load balancer fails to get a response from the WAF, this could be set to “fail open”.
This was a hidden feature: we could ask AWS support to enable this configuration option for our account/region so we could configure the load balancer to instead “fail open”. If the load balancer cannot get a response from the AWS WAF, it would then still forward the request to the servers in our target group.
A few months later, this feature became generally available (with no fanfare or AWS announcements that we noticed, just an addition to the AWS documentation):
Note: this is very different from the load balancer fail-open behaviour: when all members of the target group are unhealthy, requests are still routed to all targets.
The Security Risk
The “fail-closed mechanism” is clearly the most secure: a request will only reach your servers if it has properly evaluated and passed the configured WAF rules. Changing this to “fail-open” is a trade-off between reliability and security.
This is something for your security team and business to decide: what are the security risks if bypassing the WAF temporarily, what availability do you really need for your application, and what are the alternatives?
Worst case, consider a sophisticated attacker: if they suspect some target applications are using fail-open then they might launch an attack when there is an AWS availability zone outage. Some of the malicious requests could bypass the WAF.
This could be handled with defence-in-depth. For example, don’t just rely on the WAF to defend against SQL Injection attacks, but also implement defences in your code such as careful sanitising of inputs. Of course there is another trade-off here for the increased implementation effort as there can be many WAF rules in use.
Detecting WAF Failures
If configuring “fail open”, the Infosec teams will likely ask: how can we detect when and which requests have not been checked by the WAF?
Unfortunately there are no CloudWatch metrics for this.
Fortunately, you can detect this in the ELB Access Logs (however knowing what the message will be and testing your monitoring configuration is problematic, given the rarity of the situation!). Checking with AWS Support, we’d expect the actions_executed field to contain “waf-failed”. It is also worth checking (but not solely relying upon) the error_reason field for any of:
- WAFConnectionError: The load balancer cannot connect to AWS WAF.
- WAFConnectionTimeout: The connection to AWS WAF timed out.
- WAFResponseReadTimeout: A request to AWS WAF timed out.
- WAFServiceError: AWS WAF returned a 5XX error.
- WAFUnhandledException: The load balancer encountered an unhandled exception.
Alternatives to WAF “fail open”?
What are good alternatives for meeting reliability goals, without resorting to WAF fail open? There is a lot of great material on this topic, such as general advice and best practices in the AWS well-architected reliability pillar.
The most important design pattern is to implement retries (ideally with timeouts, exponential backoff and jitter) in the client. This is often possible where you are in control of the client code, or where you can influence the users of your API.
For an availability zone outage like the one we experienced, it would be possible to reconfigure the application load balancer on-the-fly to temporarily remove the faulty availability zone. However, we didn’t go down that road: AWS support discouraged it when we raised the idea; it would be difficult to automate (tricky to confidently detect that the failure scenario justified this extreme response); difficult to realistically simulate the scenario for testing; and the alternatives felt better for our use-cases.
The load balancer “WAF fail open” attribute will be a useful configuration option for some applications. However, the security implications require careful consideration. Alternative design patterns could deliver the same reliability without the risks.
When thinking about these trade-offs, it’s also important to be realistic about what availability your application really needs - for example, many applications do not need anywhere close to 99.99% availability. Error budgets can be a great way to think about this.
If you want to discuss application reliability and security with a cloud expert, you can speak to Cloudsoft to find out how we can help.