Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We have lived in Berlin since 2009, have two kids, and have far too little time to really justify having a blog.
The background to this story is that I spent the bulk of one week recently working on getting a prototype service deployed with AWS CloudFormation, and the experience was still reasonably painful. My team has other services deployed with CloudFormation, which is working perfectly fine (now) but I had hoped there would be some improvements available since the last time we went through the process.
Firstly I'll describe in a sentence what CloudFormation does, for those who aren't familiar with it, and then list the components which need to go into making something deployable by CloudFormation. CloudFormation itself allows you to describe an entire collection of resources required to make a service runnable and accessible - for example web servers, database servers, a loadbalancer and perhaps a Memcached cluster.
You might consider the descriptive language it requires somewhat akin to the current generation of configuration management languages, but far more restrictive and minimal. It also tends to occupy itself with what happens on initial provisioning rather than keeping the whole "stack" consistent over time - tasks which are more or less delegated to other services such as CloudWatch + AutoScaling etc. The fundamental input is the CloudFormation template which requires you to provide the following:
The input is encoded in JSON - not a bad encoding in that it is fairly easy to read, but tends to be fairly easy to break by human editing with missing quotes or commas. Depending on your JSON parser of choice, it can be very hard to find the breaking change when that happens.
Static configuration data is mixed with dynamic configuration data (parameters that can come from external sources, or mappings that depend on other inputs) which complicates the understanding of the template, but also vastly enriches its functionality. Here, we already have two possible origins of data coming into the system from sources other than the template itself - parameters passed on the command line to a tool, or read from a configuration file (local, stored in revision control, stored in S3, etc) to be merged in with other parameters.
Then there is the user data, which is really my main pain point.
We've described the stack, set up a template ranging in complexity from very simple to a very rich description of different environments and software version requirements - but even with a completely running stack we don't have any instances actually running our software. You might have generated your own AMIs with your software built into them, configured to start on boot - but in this case you have simply moved your complexity from the provisioning stage to the build stage (and building an AMI as complete as this can take time both in preparation and on every build).
There are several methods available to you when delivering bootstrapping instructions to your newly started instances (most of which are described in this document from AWS).
None of these are the perfect solution, and a choice for one or another often is simply selecting a point on a complexity scale. I won't go into detail of how each works but will try to express what pained me about each the most.
I'll now briefly describe another part of the problem which also has an impact on the choice of bootstrapping methodology, before getting to a more complete conclusion.
If you have been paying attention, you would have read at least something recently about the wars raging over the decision to replace Init in Debian. I won't provide any links - Google has more than enough fodder on the subject. It's a topic I've rarely concerned myself with (if you've been using RHEL or CentOS for any reasonable amount of time, any system other than the standard SysV-style Init has barely been a blip on the radar), but nevertheless I'm aware that the current choice is SystemD, for better or worse.
Be that as it may, a bunch of my services are running on Ubuntu and hence use Upstart (which I've rarely had problems with, despite my reading on the topic now showing that this seems to be rare). Attempting to integrate Upstart-based control of a service with the CloudFormation user-data or other bootstrapping mechanisms comes with its own challenges:
Despite these points, Upstart still allows you to provide a very minimal startup script for your service (mine was about three lines - compare that to your typical init script), and some conveniences such as automatic logging. It's still not perfectly smooth deploying a service unless you want to package everything up in a deb/rpm earlier in your build pipeline (and then still deal with the issue of full configuration of your service dependent on the environment and versions at hand) but to be fair, that's not the fault or purpose of Upstart.
An initial implementation of ours had the vast majority of the work being done using Ruby scripts, some shell, the Amazon Ruby SDK and ERB templating. Specifically, you would call a shell script with some parameters (this could be done either manually or from our build system), that would call a Ruby script with those parameters and perhaps some other data divined from the filesystem, and the Ruby script would render the ERB template and make calls using the AWS Ruby SDK to S3 (build artifacts and more CloudFormation parameters stored in files) and CloudFormation to create or update the stack.
There were a lot of moving parts, data coming from a bunch of different places and not a consistent language or distribution of work between components. Adding to the distribution of configuration data amongst several different sources, passing this data between components at runtime as command line parameters, environment variables or embedded in JSON templates is also awkward at best.
So, I'm pretty good at complaining about things, but what does my solution look like? One distinct advantage I have is that I tend to use Golang to write software, and hence end up with a single binary artifact from which the service is run. Most of the time there are no or few external supporting files so there is very little to ship and configure. But nevertheless...
When I next need to make some changes to this deployment system, I think I'll be attempting to remove as much of the Ruby code as possible and relying solely on the Amazon Python CLI. This will allow us to remove the Ruby SDK requirement (which, unless you are a developer, has very little utility in the command-line realm) in favour of the Python CLI which can be used for many other things in addition to its integration in our deployment script. It's also a lot lighter-weight than the previous JVM-based CLI utilities. I've already reduced the data sources down a lot, but these could be potentially reduced further, or if not, work to ensure that there are suitable mechanisms for tracing data transformations through the system so we know where data is coming from, and when it has changed.
These are just a few realisations after the latest round of bootstrapping refactoring I had to perform recently. What have your own experiences been like in bootstrapping nodes and keeping your sanity?