adrift on a cosmic ocean

Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We have lived in Berlin since 2009, have two kids, and have far too little time to really justify having a blog.

It's 2015, and I'm still writing config management commands wrapped in for loops.

Posted by Oliver on the 10th of April, 2015 in category Tech
Tagged with: chefelasticsearchpain

Warning: this is a bit of a rant. Today my team had to migrate our ElasticSearch cluster from one set of instances in our EC2 VPC, to a smaller set of physical machines in our regular datacenter (yes, it's actually cheaper). Both sets of machines/instances are Chef-controlled, which generally I don't have to worry about, but in this case it was important due to the ordering of steps in the migration.

The general process was something like this:

  • Remove the Logstash role from the nodes we were collecting data from to avoid immediate reconfiguration when we started moving ElasticSearch nodes around.
  • Remove the ElasticSearch role from the old ES cluster, and alter the disk and memory layout settings to match the new hardware.
  • Add the ElasticSearch role to the new ES machines and let them bring up the new cluster.
  • Add the Logstash role back to the data collection nodes, which would then point to the new cluster.

For simplicity's sake, I'll just say we weren't too interested in the data as we rotate it fairly regularly anyway, and didn't bother migrating anything.

Last week I was already bitten by a classic Chefism when we realised that how I'd set up the Logstash recipe was incorrect. To avoid weird and wonderful automatic reconfigurations of the ElasticSearch nodes that Logstash points to, I use a hard-coded array of node addresses in the role attributes, but in the recipe leave the default setting (in case you forgot to set up your array of ElasticSearch addresses) as a single array - [""]. Chef professionals will already know what is coming.

Of course, Chef has an attribute precedence ruleset that still fails to sink into my brain even now, and what compounds the problem is that mergeable structures (arrays, hashes) are merged rather than replaced. So I ended up with an array containing all my ElasticSearch node addresses, and So for some indeterminate time we've been only receiving about 80% of the data we had been expecting, as the Logstashes configured with have been simply dropping the messages. Good thing there wasn't ElasticSearch running on those machines as well!

On attempting to fix this, I removed the default value that had been merged in, but decided it would be prudent to check that an array had been given for the list of ElasticSearch servers to avoid Logstash starting up with no server to send messages to, and potentially start crash-looping. The first attempt was something like this:

if ! elasticsearch_servers.class == Array
  fail "You need to set node["logstash"]["elasticsearch_servers"] with an array of hostnames."

Then I found that this fail block was being triggered on every Logstash server. Prior experiences with Hashes and Mashes in Chef land made me suspect that some bizarro internal Chef type was being used instead of a plain Array, and a quick dip into Chef shell confirmed that - indeed I was dealing with an instance of Chef::Node::ImmutableArray. That was surprised number two.

The third thing that got us today was Chef's eventual consistency model with respect to node data. If you are running Chef rather regularly (e.g. every 15 minutes) or have simply been unlucky, you may have attempted to add or remove a role from a machine while it was in the middle of a Chef run, only for it to write its node attributes back to the Chef server at the end of the run and wipe out your change. I'm sure there's a better way of doing this (and I'm surprised there isn't locking involved) but I managed to run into this not only at the start of our migration but also at the end. So we started out life for our new ElasticSearch nodes by accidentally joining them to the old cluster for a few minutes (thankfully not killing ourselves with the shard relocation traffic) since the old roles had not been completely removed. Then we managed to continue sending Logstash messages to the old cluster for a period of time at the end of the migration, until we figured out the Logstash nodes still thought the old cluster was in use.

The for loop referenced in the title was of course me, repeatedly attempting to add or remove roles from machines with the knife command, then restarting Logstash until it was clear the only connections to our old cluster were from the health checkers. (Of course I could have used PDSH, but I needed to space out the Logstash restarts, due to the usual JVM startup cycle.)

All of this makes me really glad that I don't deal with configuration management on a daily (or even weekly, perhaps even monthly) basis. Writing and running 12 Factor apps and things easily deployable via CloudFormation (yes, even with its mountain of JSON config) are actually getting work done, and this just feels like pushing a boulder up a hill. Sorry for the Friday evening downer - happy weekend everyone!

© 2010-2018 Oliver Hookins and Angela Collins