It’s 2015, and I’m still writing config management commands wrapped in for loops.

by Oliver on Friday, April 10th, 2015.

Warning: this is a bit of a rant. Today my team had to migrate our ElasticSearch cluster from one set of instances in our EC2 VPC, to a smaller set of physical machines in our regular datacenter (yes, it’s actually cheaper). Both sets of machines/instances are Chef-controlled, which generally I don’t have to worry about, but in this case it was important due to the ordering of steps in the migration.

The general process was something like this:

  • Remove the Logstash role from the nodes we were collecting data from to avoid immediate reconfiguration when we started moving ElasticSearch nodes around.
  • Remove the ElasticSearch role from the old ES cluster, and alter the disk and memory layout settings to match the new hardware.
  • Add the ElasticSearch role to the new ES machines and let them bring up the new cluster.
  • Add the Logstash role back to the data collection nodes, which would then point to the new cluster.

For simplicity’s sake, I’ll just say we weren’t too interested in the data as we rotate it fairly regularly anyway, and didn’t bother migrating anything.

Last week I was already bitten by a classic Chefism when we realised that how I’d set up the Logstash recipe was incorrect. To avoid weird and wonderful automatic reconfigurations of the ElasticSearch nodes that Logstash points to, I use a hard-coded array of node addresses in the role attributes, but in the recipe leave the default setting (in case you forgot to set up your array of ElasticSearch addresses) as a single array – [“”]. Chef professionals will already know what is coming.

Of course, Chef has an attribute precedence ruleset that still fails to sink into my brain even now, and what compounds the problem is that mergeable structures (arrays, hashes) are merged rather than replaced. So I ended up with an array containing all my ElasticSearch node addresses, and So for some indeterminate time we’ve been only receiving about 80% of the data we had been expecting, as the Logstashes configured with have been simply dropping the messages. Good thing there wasn’t ElasticSearch running on those machines as well!

On attempting to fix this, I removed the default value that had been merged in, but decided it would be prudent to check that an array had been given for the list of ElasticSearch servers to avoid Logstash starting up with no server to send messages to, and potentially start crash-looping. The first attempt was something like this:

if ! elasticsearch_servers.class == Array
  fail "You need to set node["logstash"]["elasticsearch_servers"] with an array of hostnames."

Then I found that this fail block was being triggered on every Logstash server. Prior experiences with Hashes and Mashes in Chef land made me suspect that some bizarro internal Chef type was being used instead of a plain Array, and a quick dip into Chef shell confirmed that – indeed I was dealing with an instance of Chef::Node::ImmutableArray. That was surprised number two.

The third thing that got us today was Chef’s eventual consistency model with respect to node data. If you are running Chef rather regularly (e.g. every 15 minutes) or have simply been unlucky, you may have attempted to add or remove a role from a machine while it was in the middle of a Chef run, only for it to write its node attributes back to the Chef server at the end of the run and wipe out your change. I’m sure there’s a better way of doing this (and I’m surprised there isn’t locking involved) but I managed to run into this not only at the start of our migration but also at the end. So we started out life for our new ElasticSearch nodes by accidentally joining them to the old cluster for a few minutes (thankfully not killing ourselves with the shard relocation traffic) since the old roles had not been completely removed. Then we managed to continue sending Logstash messages to the old cluster for a period of time at the end of the migration, until we figured out the Logstash nodes still thought the old cluster was in use.

The for loop referenced in the title was of course me, repeatedly attempting to add or remove roles from machines with the knife command, then restarting Logstash until it was clear the only connections to our old cluster were from the health checkers. (Of course I could have used PDSH, but I needed to space out the Logstash restarts, due to the usual JVM startup cycle.)

All of this makes me really glad that I don’t deal with configuration management on a daily (or even weekly, perhaps even monthly) basis. Writing and running 12 Factor apps and things easily deployable via CloudFormation (yes, even with its mountain of JSON config) are actually getting work done, and this just feels like pushing a boulder up a hill. Sorry for the Friday evening downer – happy weekend everyone!

Tags: , ,

Friday, April 10th, 2015 Tech No Comments

Chef’s attribute system is broken

by Oliver on Thursday, December 19th, 2013.

Disclaimer: I’m not what I would call a “hard-core” Chef user, but I’ll explain in a moment why that is irrelevant.

I’ve been working on and off with Chef for almost a year and a half now; certainly not something I use every day or even every week but perhaps a handful of times per month something will come up that requires me to jump into Chef land and get some real work done. There are definitely people in the organisation with a lot more Chef knowledge than I, and I utilise their knowledge to make up for my own knowledge gaps.

If you don’t know me, I have a fairly good background already in configuration management. I would safely say I fit into the “power-user” category of Puppet users at my previous employer, having worked quite a lot on custom Ruby bits and pieces for our installation, and even giving a main-stage talk at Puppet Camp in Amsterdam in 2011. So I know plenty about the general landscape.

One of the main features of the system I helped build around that time was a very flexible data store and layered configuration engine – an External Node Classifier – which, mostly due to user demand from those using our system, went from a handful of layers at which you could define data to over 12. These allowed you to express variables tied to a given app, role within that app’s deployment, datacenter, region, environment, and machine itself. Originally we had just a few of these, and there were just a few combinations in which you could use them. Demand grew until we had covered a reasonable amount of combinations, at which point the system became a little unwieldy.

We tried documenting the order of overrides permissible, which turned into a fairly awful document, and it confused the hell out of new users who had much simpler requirements. But despite supporting so many combinations already, there would always be someone else coming up and asking for some other new combination we hadn’t yet supported. When I left, we were just starting to gear up for a rewrite of the engine that would support a completely different mechanism for configuration that wasn’t limited to just facts about the app or its deployment environment, but I’m not sure how well that went.

My point is, if you haven’t already guessed, is that Chef replicates all of that flexibility and all of the pitfalls that go along with it. Initially I was comforted by its familiarity, but now I find it to just painfully shackled to the same inescapable problems I was trying to forget, and even introduces problems of its own. Take a look at the documentation for deep merging attributes.

A feature in this area (“knockout” prefix for deep merging) was even removed in the last major release. Deep merging makes so little sense in Chef, it pains me. Chef has all the flexibility of raw Ruby at its fingertips, and yet deep merging was decided to be necessary anyway – even though you can more or less accomplish any data transformations you want with native code. And not all of it is even correct.

I’ve seen several examples in our own codebase of deep merging being used as intended – with hashes – but also several examples where arrays have been used “to work around deep merging”. If you read the documentation, you’ll see that both hashes and arrays are candidates for deep merging and yet apparently even on the latest version (11.8) this is not really the case. Another point that pains me is that in order to avoid the deep merging behaviour encourages all of the bad habits that go along with Ruby’s duck-typing system – if you have an array already in place in a cookbook default attribute, and define a hash (or any other type) in a role default attribute it will be overwritten wholesale. This potentially puts a large onus on cookbook authors to excessively type check and stamps a large “WAT” all over this.

Finally, the complexity of the attribute definition layers is just excessive. I’ve previously done a lot of stuff like this in Puppet and it still makes very little sense to me. I was able in one instance to override a hash defined in the cookbook default attributes with another hash defined in the role override attributes despite them apparently being candidates for merging, and yet when the role override attributes were moved down to role default attribute level they WERE merged. The system is just far too complex and complicated, even for relatively savvy users. If a configuration management system requires you to do excessive juggling of config data just to get it working properly, it’s actually costing rather than saving you time.

Fortunately there are now other (what I will still call “current generation”, since they haven’t made any huge noticeable improvements over current offerings) configuration management systems in the marketplace such as Ansible, Salt and Juju. In my brief examinations of them, none of them appear to do much of a better job but at least it is good to have other alternatives than just Chef and Puppet (and I suppose CFEngine3, if you’re a real masochist). Will there be another generation? Based on my experiences with containerised application hosting in the style of Docker, Heroku and others, perhaps today’s configuration management systems are trying to do too much already and we just don’t need more – we need less.

Tags: , ,

Thursday, December 19th, 2013 Tech No Comments

Another transition

by Oliver on Friday, August 10th, 2012.

A couple of things of note have happened recently. Firstly, I managed to conquer the #10-level Currywurst at Curry & Chili, rated at a (literally) eye-watering 7.7 million Scovilles – not a performance I care to repeat. Secondly, I recently changed employer from Nokia to SoundCloud. This is a big change for me, and in some ways hard since it was working with Nokia that allowed me to anchor here in Berlin and has been a huge area of stability in our lives amongst a lot of change. Most significantly, it means making the final transition from a somewhat “DevOps” role which combined operational elements and development to a full-time developer role.

Although I’ve seen people go the other way (from 100% development to operations/systems-engineering), I feel like that is “Salmoning” to some extent. Especially with the fervour surrounding DevOps culture and the Cloud, Web 2.0/3.0 etc, taking on additional development experience and tasks seems a natural flow in this industry. Whatever everyone else is doing, this has been high on my personal agenda for some years now and I’m very glad to have actually made it work out. Whatever I end up working on, I’m going to try to continue writing up posts on the interesting technical aspects (and I know my new employer encourages such things).

In just over a week I’ve already had to get to grips with a bunch of new technologies (mostly just new to me) and systems and it is proving not only a worthy challenge but a lot of fun. Two technologies in particular I wanted to write a couple of paragraphs about.

Puppet vs Chef

If you’ve read previous posts, worked with me, met up with me at various Meetups or conferences or even seen me talk before you’d know I’ve been pretty involved with Puppet for quite a few years now. I never thought I’d be “the Puppet guy” at my last two places of work and after a while wished I wasn’t – not because it wasn’t enjoyable, but because I wanted some new challenges. I took a look at Chef a long time ago but didn’t think it fit my/our requirements very well and more or less dismissed it after that, and didn’t re-evaluate again.

If that sounds at all familiar to you, I urge you to take another look! The last couple of days working with it have been very eye-opening for me, especially given the amount of custom systems we built around Puppet at Nokia.

  • The data bag system is a big win. I like that you get light-weight JSON storage built into Chef, and that it provides a hierarchical storage and lookup system out of the box. OK, Puppet has this now too with Hiera built-in, but I found the data bags system immediately understandable and usable. The only thing that bothers me is using the custom Mash type which pollutes any output into YAML (which now requires the original object definitions in order to de-serialise).
  • I assume data bags are stored in CouchDB, which adds cool factor. It also means the Chef Server is completely stateful, which I’m on the fence about but so far while I’ve been using it it has been a good experience.
  • The knife tool plays further into this concept of a stateful Chef Server and is a really useful tool for getting content into, and information out of the server. I can really say I missed having a tool like this for many years already with Puppet.
  • Having a management interface built-in is a nice touch. Yes, there is Puppet Dashboard/Console but it has always been an optional extra component. Viewing all cookbooks, recipes, data bags, roles, nodes and everything else about the system immediately is a big convenience rather than getting selective access to some of those parts of the system.
  • Being able to write recipes in native Ruby once you are very familiar with Ruby is extremely liberating. The gloves can come off, and you can really achieve whatever you want to achieve by using a tool you are already familiar with.

Of course, any one of these could be just as applicably a criticism against Chef. The Ruby DSL and access directly to native Ruby code would allow many novice users to shoot themselves immediately in the foot. The relatively closed nature of the Puppet DSL means that testing and introspection of the configuration code is much simpler as opposed to native Ruby which can do anything in any number of ways. A lot of the other parts of Chef would be equally hard to wrap your head around if you have not ever tried your hand at coding.

That being said, at this point I find it a logical and usable system; one that could have solved a lot of problems had we given it a chance. I think Puppet was the right tool at the time, but Chef has a lot of possibilities too – take a look at it if you haven’t already.

NodeJS and the cool kids of event-based programming.

NodeJS seems to attract a lot of love and hatred – it is a polarising technology. Unfairly perhaps, I put it into my hatred bucket – mostly because a lot I had read about it seemed to consider it the silver bullet to web scaling problems, and I hadn’t honestly tried it or learned about it.

I’m not sure if any of the systems I’ll encounter over the next few months are actually running Node, but I decided to give myself a coding challenge and attempt to write a small application with it anyway. From what I’ve seen and read about its design so far, it seems like a good solution to a very specific set of event-driven, small payload, tightly-couple client/server problems. It is certainly not a silver bullet, and it comes with its own set of restrictions but again like Chef I have to recommend that you at least get yourself familiar with it and have a think about what problems you could solve with Node.

So that has been my experience the last few days, and I hope to get much more familiar with these two tools before very long. With any luck, you’ll see some more posts on both of them!

Tags: , , , ,

Friday, August 10th, 2012 Tech No Comments