Automating ECS cluster upgrades with CloudFormation and Lambda

Published in

Xero Developer

8 min readMay 11, 2018

Over a million small businesses, and their advisors are looking for the best cloud apps that integrate with Xero. Partner with us, and we’ll make sure they find yours.

fir0002 | flagstaffotos.com.au [GFDL 1.2 ], from Wikimedia Commons ”Cumulus clouds panorama”

At Xero, we use Amazon ECS-optimised AMI as the base image for our ECS clusters. AWS frequently releases updates to the AMI, which require us to first bake our custom image on top of the latest Amazon AMI, and then roll it out to all of our ECS clusters.

We can automate the first part with a simple lambda, triggered by new messages in Amazon ECS-Optimised AMI Update Notifications SNS topic. It then kicks off the SSM automation that performs the actual baking and sharing of the AMI.

The second part is a bit more involved, and that’s what I will cover in this blog.

Set up rolling updates on ECS autoscaling group(s)

The first step is to make sure that replacing instances in the Auto Scaling group(s) forming the ECS cluster is an easy and foolproof process. When you update Launch Configuration with the new AMI, Auto Scaling does… absolutely nothing. All new instances will have the new configuration (with latest AMI and other changes), and all existing servers will stay the way they are until termination. It’s possible to force instance cycling by either shutting them down one-by-one or doubling the Auto Scaling group size and then scaling it back. But there is a better way.

We can use UpdatePolicy attribute of CloudFormation to control the process of cycling instances. UpdatePolicy is only exposed in CloudFormation, which can be a bit of a problem if we use another tool (like Terraform) to codify the rest of our infrastructure. The good news is that we can use CloudFormation stack resource in our Terraform template, even if this does look a little bit weird.

On the bright side, CloudFormation allows us to use cfn-signal helper script, which helps to validate that userdata has run successfully. In the most basic form, the userdata script can wait for ECS agent to start and then report the exit code to CloudFormation. If it’s 0, all is good and CloudFormation proceeds with creating another instance and terminating an old one.

If ECS agent failed to start for whatever reason, CloudFormation would attempt to roll the Auto Scaling group back to the old Launch Configuration with AMI and userdata, proven to be working previously.

If userdata gets more complex over time, we can extend this process by running an arbitrary suite of tests on the instance to verify that it’s in the desired state before reporting the success code to CloudFormation.

The reference architecture for Amazon ECS provides a good example of using ‘AutoScalingRollingUpdate’ property, but we should tweak this template a little bit:

set ‘MinInstancesInService’ to the minimum size of the Auto Scaling group instead of 1 to ensure that we never go below what’s reasonable for our cluster
set ‘MaxBatchSize’ to whatever makes sense for a given ECS cluster — if the cluster is rather large, bumping this value up a bit will speed up cluster upgrades
add ‘MinSuccessfulInstancesPercent’ parameter and set it to something lower than 100 — it doesn’t normally make sense to roll back the whole deployment because of some random transient error, like a network hiccup
add ‘IgnoreUnmodifiedGroupSizeProperties’ scheduled action and set it to ‘true’ — this means CloudFormation will not try to reset the size of the group to the desired size if it was changed due to autoscaling actions
add ‘DeletionPolicy’ and set it to ‘Retain’ to protect against accidental stack deletion (this is complemented nicely by setting a stack policy to prevent such mishaps in the first place)

The resulting snippet can look like this:

Drain ECS tasks from the instance before terminating it

During a cluster update, the Auto Scaling group terminates all old instances. If any ECS tasks are running on them, they are all abruptly stopped, which essentially causes a micro-outage. In a cloud environment, all clients need to be able to deal with such failures, and we do use Chaos lambda to deliberately inject faults into many parts of our stack. This, however, needs to be done on a separate schedule and not be part of regular deployments.

We can remove a container instance from the cluster by setting its state to DRAINING and giving it some time. ECS scheduler will take care of shifting service tasks to other container instances in the cluster. It’s important to note here that tasks which don’t belong to a service (e.g. daemon tasks that must run on each node similar to Kubernetes DaemonSet) are not affected by this, and we have to manage them separately.

We can automate this process by leveraging Auto Scaling lifecycle hooks. This blog post from AWS explains this process well, but at the high level the sequence of actions looks like this:

When the Auto Scaling group initiates the termination of one of the instances, the lifecycle hook gets triggered and puts the server into a ‘Terminating:Wait’ state. It will remain in this state until the timeout period ends or some external agent completes the lifecycle action, which continues the termination process.
The lifecycle hook also sends a message to an SNS topic.
This message invokes a lambda, which finds the container instance id of the EC2 instance about to be terminated, puts it into ‘DRAINING’ state (if it isn’t in it already), and checks if there are any tasks still running on this server.
If there are any, the lambda sends another message to the same SNS topic to trigger itself again.
If the instance has no running tasks, the lambda completes the lifecycle action, and the Auto Scaling group terminates the server.

Using LifeCycle hook to gracefully drain tasks from an instance before terminating it. AWS Compute Blog

The sample lambda mostly works as expected. There were, however, some issues when I tried to run it in one of our clusters:

Lambda only sets the instance to ‘DRAINING’ state and then waits passively for all tasks to be shifted away by the scheduler. This approach does not work for tasks running outside of any services— for example, the daemon tasks such as cAdvisor which get started from userdata at container instance launch time.
Lambda essentially runs in a very tight infinite loop and only breaks out of it if either there are no more tasks running on the instance, or it reaches the heartbeat timeout on the lifecycle hook. Because ECS scheduler cannot migrate the daemon tasks, the lambda sends an excessive number of SNS messages and keeps invoking itself until the heartbeat timeout runs out.
Lambda makes API calls to retrieve any parameters it needs, such as container instance id or SNS topic ARN. Combined with #2, this results in getting rate-limited by AWS. The lambda can’t retrieve the parameters it needs any more, and even web console becomes inoperable, which is a deal breaker during cluster upgrades.

To resolve these issues, we rewrote and open-sourced the drain-lambda, and it’s now available on GitHub. Here’s a summary of the main changes:

Lambda forcefully terminates all daemon tasks. ECS scheduler will then be able to migrate the service tasks normally.
We added a brief stagger between invocations, so the lambda doesn’t spam SNS as much.
The container instance id is only retrieved once. This turned out to be a surprisingly non-trivial operation, as there’s no direct relationship between instance id, which we get from the Auto Scaling group as a part of the lifecycle event, and container instance id, that can only be determined by listing all ECS clusters. Once found, the container instance id is included in the SNS message, so subsequent lambda invocations no longer need to make an API call to get it.
We also added a counter which serves as a circuit breaker, so lambda never exceeds a predefined number of invocations. The problem with relying on timeouts to do this is that if lambda is invoked outside of Auto Scaling lifecycle (for example, manually during testing), it will keep sending SNS messages and invoking itself forever.

Tag old instances to prevent ECS from shifting tasks to them

Now we have an automated process both for rolling out new AMIs and for moving ECS tasks from container instances at the end of their lifecycle. The only missing piece is controlling where those tasks are shifted to. ECS scheduler is not aware of Auto Scaling events, so in many cases, this can be a server which will be terminated next. The result may be that running tasks are shifted from one old instance to another until they finally bunch up on a few of the new ones. This gif, captured from the excellent presentation by Matt Callanan of Expedia, can illustrate the whole process.

Matt Callanan. Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS

Note that since we’re already using drain-lambda, the tasks will be shifted before the termination, not after, but the problem of where they get moved to remains.

There is an elegant solution to this problem, first suggested in this issue in the ECS agent repository. ECS Task Placement Constraints allow us to specify rules considered by the scheduler during the task placement. For example, we can constrain the task to only run on given instance types. We can also specify other parameters, such as tags, and only allow task placement on instances which don’t have a certain tag. A simple lambda can automate this tagging. The complete process looks like this:

All ECS tasks that we care about are constrained to instances which either don’t have the tag ‘drain’, or don’t have it set to ‘true’. Here’s a Terraform snippet which achieves this:

2. Before triggering the update of the ECS CloudFormation stack, the lambda marks all currently running instances with ‘drain = true’ tag, so the scheduler cannot place any new tasks on them. Tagging the ECS nodes should be a separate CI/CD step.

3. New instances created during the rollout don’t have this tag and can receive new tasks. To prevent tasks from congregating on the few instances launched first the ‘MaxBatchSize’ parameter should be reasonably large, perhaps close to ‘DesiredCapacity’ parameter.

4. If the stack update fails (or there are no changes to be applied), we should reset the tag so ECS tasks can be placed on the old instances again. A post-build step in the Jenkinsfile, or something similar for other CI systems, can achieve this:

Tag-lambda was open-sourced as part of the ecs-cluster-update-lambda project.

Summary

The final sequence of steps to perform an automated ECS cluster upgrade looks like this:

Taint old instances before starting CloudFormation stack update.
Trigger stack update. The drain-lambda puts all instances in the cluster to ‘DRAINING’ state before allowing Auto Scaling to terminate them, and ECS scheduler gracefully migrates all service tasks to new instances. The outage will only occur for tasks which run in singleton configuration (i.e. have ‘maximumPercent’ set to 100 and ‘minimumHealthyPercent’ set to 0 in the ECS service definition), and then it will only happen once, as the tasks are placed on new instances.
If the update fails, all old instances are untainted and can be used again. Otherwise, the ECS cluster will be fully upgraded to the latest AMI.

Main references

ecs-cluster-update-lambda — a set of lambdas used to facilitate zero-downtime ECS cluster updates at Xero

CON314 Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS — slides from a talk by Matt Callanan at AWS re:Invent 2017

How to Automate Container Instance Draining in Amazon ECS — blog post from AWS Compute blog

ECS Container draining sample repository from AWS