Handling AWS EC2 VM Failures
What happens if an EC2 instance fails to start? This post talks about how to troubleshoot, and more importantly how to avoid it being a serious issue in production.
The Problem
I hit a problem resizing a VM in AWS EC2. To reduce the costs of one of our dev VMs, I changed the instance type. However, it failed to come up properly. The CloudWatch instance status check showed:
<span style="font-weight: 400;">Instance reachability check failed at March 1, 2018 at 2:47:00 PM UTC (8 minutes ago)</span>
Diagnosing
Have you tried turning it off and on again? Did that, but the same thing happened (including after reverting it to the previous instance type).
Next up was to follow the excellent steps in the AWS documentation for Troubleshooting Instances with Failed Status Checks.
The instance's system log gave very useful information. This is accessible through the EC2 console's "Instance Settings -> Get Systems Log". For my VM, it showed:
[ 1.653949] List of all partitions:
[ 1.655877] No filesystem could mount root, tried:
[ 1.657906] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[ 1.658780] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.10.0-693.2.2.el7.centos.plus.x86_64 #1
[ 1.658780] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[ 1.658780] ffffffffffffff00 0000000087cd85cd ffff880086607d68 ffffffff816b4d84
[ 1.658780] ffff880086607de8 ffffffff816aec47 ffffffff00000010 ffff880086607df8
[ 1.658780] ffff880086607d98 0000000087cd85cd 0000000087cd85cd ffff880086607e00
[ 1.658780] Call Trace:
[ 1.658780] [<ffffffff816b4d84>] dump_stack+0x19/0x1b
[ 1.658780] [<ffffffff816aec47>] panic+0xe8/0x20d
[ 1.658780] [<ffffffff81b5d5f5>] mount_block_root+0x291/0x2a0
[ 1.658780] [<ffffffff81b5d657>] mount_root+0x53/0x56
[ 1.658780] [<ffffffff81b5d796>] prepare_namespace+0x13c/0x174
[ 1.658780] [<ffffffff81b5d273>] kernel_init_freeable+0x1f2/0x219
[ 1.658780] [<ffffffff81b5c9d4>] ? initcall_blacklist+0xb0/0xb0
[ 1.658780] [<ffffffff816a3c80>] ? rest_init+0x80/0x80
[ 1.658780] [<ffffffff816a3c8e>] kernel_init+0xe/0xf0
[ 1.658780] [<ffffffff816c5f18>] ret_from_fork+0x58/0x90
[ 1.658780] [<ffffffff816a3c80>] ? rest_init+0x80/0x80
This lead to the filesystem kernel troubleshooting.
I detached the volume, and then attached and mounted the volume on another VM to inspect the file system - it all looked fine. The kernel version also looked fine.
Best Practices
Confession: much of the investigation above was just for my general interest. To get the service healthy again quickly, we instead just provisioned a replacement VM.
A few of the techniques we use to avoid ever needing to do this in production include:
-
Don't store business data on the root EBS volume. Instead attach and mount additional EBS volumes when required.
-
Use configuration-as-code so can easily spin up a replacement VM.
-
Use CloudWatch alerts, along with on-box monitoring, to quickly detect such problems.
-
Automate the replacement of failed servers (e.g. using AWS auto-scaling groups, or policies in Apache Brooklyn).
Get a 7-day free trial for our easy-to-use Visual Composer, made to create Cloudformation templates easily and graphically.