Another transition

by Oliver on Friday, August 10th, 2012.

A couple of things of note have happened recently. Firstly, I managed to conquer the #10-level Currywurst at Curry & Chili, rated at a (literally) eye-watering 7.7 million Scovilles – not a performance I care to repeat. Secondly, I recently changed employer from Nokia to SoundCloud. This is a big change for me, and in some ways hard since it was working with Nokia that allowed me to anchor here in Berlin and has been a huge area of stability in our lives amongst a lot of change. Most significantly, it means making the final transition from a somewhat “DevOps” role which combined operational elements and development to a full-time developer role.

Although I’ve seen people go the other way (from 100% development to operations/systems-engineering), I feel like that is “Salmoning” to some extent. Especially with the fervour surrounding DevOps culture and the Cloud, Web 2.0/3.0 etc, taking on additional development experience and tasks seems a natural flow in this industry. Whatever everyone else is doing, this has been high on my personal agenda for some years now and I’m very glad to have actually made it work out. Whatever I end up working on, I’m going to try to continue writing up posts on the interesting technical aspects (and I know my new employer encourages such things).

In just over a week I’ve already had to get to grips with a bunch of new technologies (mostly just new to me) and systems and it is proving not only a worthy challenge but a lot of fun. Two technologies in particular I wanted to write a couple of paragraphs about.

Puppet vs Chef

If you’ve read previous posts, worked with me, met up with me at various Meetups or conferences or even seen me talk before you’d know I’ve been pretty involved with Puppet for quite a few years now. I never thought I’d be “the Puppet guy” at my last two places of work and after a while wished I wasn’t – not because it wasn’t enjoyable, but because I wanted some new challenges. I took a look at Chef a long time ago but didn’t think it fit my/our requirements very well and more or less dismissed it after that, and didn’t re-evaluate again.

If that sounds at all familiar to you, I urge you to take another look! The last couple of days working with it have been very eye-opening for me, especially given the amount of custom systems we built around Puppet at Nokia.

  • The data bag system is a big win. I like that you get light-weight JSON storage built into Chef, and that it provides a hierarchical storage and lookup system out of the box. OK, Puppet has this now too with Hiera built-in, but I found the data bags system immediately understandable and usable. The only thing that bothers me is using the custom Mash type which pollutes any output into YAML (which now requires the original object definitions in order to de-serialise).
  • I assume data bags are stored in CouchDB, which adds cool factor. It also means the Chef Server is completely stateful, which I’m on the fence about but so far while I’ve been using it it has been a good experience.
  • The knife tool plays further into this concept of a stateful Chef Server and is a really useful tool for getting content into, and information out of the server. I can really say I missed having a tool like this for many years already with Puppet.
  • Having a management interface built-in is a nice touch. Yes, there is Puppet Dashboard/Console but it has always been an optional extra component. Viewing all cookbooks, recipes, data bags, roles, nodes and everything else about the system immediately is a big convenience rather than getting selective access to some of those parts of the system.
  • Being able to write recipes in native Ruby once you are very familiar with Ruby is extremely liberating. The gloves can come off, and you can really achieve whatever you want to achieve by using a tool you are already familiar with.

Of course, any one of these could be just as applicably a criticism against Chef. The Ruby DSL and access directly to native Ruby code would allow many novice users to shoot themselves immediately in the foot. The relatively closed nature of the Puppet DSL means that testing and introspection of the configuration code is much simpler as opposed to native Ruby which can do anything in any number of ways. A lot of the other parts of Chef would be equally hard to wrap your head around if you have not ever tried your hand at coding.

That being said, at this point I find it a logical and usable system; one that could have solved a lot of problems had we given it a chance. I think Puppet was the right tool at the time, but Chef has a lot of possibilities too – take a look at it if you haven’t already.

NodeJS and the cool kids of event-based programming.

NodeJS seems to attract a lot of love and hatred – it is a polarising technology. Unfairly perhaps, I put it into my hatred bucket – mostly because a lot I had read about it seemed to consider it the silver bullet to web scaling problems, and I hadn’t honestly tried it or learned about it.

I’m not sure if any of the systems I’ll encounter over the next few months are actually running Node, but I decided to give myself a coding challenge and attempt to write a small application with it anyway. From what I’ve seen and read about its design so far, it seems like a good solution to a very specific set of event-driven, small payload, tightly-couple client/server problems. It is certainly not a silver bullet, and it comes with its own set of restrictions but again like Chef I have to recommend that you at least get yourself familiar with it and have a think about what problems you could solve with Node.

So that has been my experience the last few days, and I hope to get much more familiar with these two tools before very long. With any luck, you’ll see some more posts on both of them!

Tags: , , , ,

Friday, August 10th, 2012 Tech No Comments

More on Puppet module unit-testing

by Oliver on Tuesday, June 19th, 2012.

I’ve previously made presentations and blog posts about Puppet and module testing – my position is that you should treat Puppet code as just that: code. Just like mainstream programming languages, it is possible (and good practice) to test your Puppet manifests so that you have higher confidence in them working when it comes time to actually run them.

There are some other factors which play a part:

  • Make your modules generic. Any environment or host specifics you have baked into classes or definitions makes them that much harder to test in a dissimilar (read: clean) environment like your CI pipeline.
  • As a corollary to the first point, classes should be parameterised so that they can be used in a variety of different ways – in your production environment, in staging, in QA, development (etc etc) and of course your test pipeline.
  • Loosely couple your modules. Tight dependencies enforced with strict ordering constraints means that you can’t test each class by itself without pulling in all the dependencies as well. Speaking directly from experience, this can mean errors are that much harder to track down when you have to look in a bunch of places for one failing resource.

It is issues like this last point that seem to cause the most grief when testing Puppet modules in our environment. We have a collection of common modules called dist, which provide both re-usable functionality when required by application modules (e.g. the ability to easily set up a MySQL server, a standard way to provision Apache/Nginx etc) and configuration we expect to be standardized across all machines – in other words, the platform. In fact the wrapper class that pulls in the standardized configuration is called just that – “platform”.

Platform pulls in a lot of helper functionality which application modules can take for granted. An example is the yum class. Here we set up a standard /etc/yum.conf with some tunable values, the /etc/yum.repos.d fragments directory and a bunch of standard repository fragments such as OS, Updates and so on. The module also includes a defined type, yum::repo which acts much like the built-in version but works with the yum class to have a fully managed fragments directory – if a yum fragment is on disk but not managed by Puppet, it is removed.

Naturally, in developing an application module you will want to set up a repository fragment to point to wherever you have your app packages stored, so all application modules utilise yum::repo at least once. Now, in testing your application class foo as a unit test, you might have the following code:

class foo {
  yum::repo { 'myapp':
    descr => 'Repository for My App',

To test it, you’d have the following in the tests directory:

class { 'foo': }

This is, of course, a trivial example with no parameters. Here, we already have a problem when attempting to unit test the class – it will immediately fail due to the yum class not having been instantiated in the catalog, thus not satisfying the dependency the yum::repo defined type has on it. Typically we have worked around this by just adding it to the test:

class { 'yum': }
class { 'foo': }

This is fine if you know which class you need to pull in, but if there are dependencies between resources in different classes it may not be so clear. The dependent resource will know what it needs, but not where it should retrieve it from. This pattern actually breaks encapsulation, so while it is acceptable in Puppet standard practice it is not very good practice from a developer standpoint.

Another idea we toyed with was automatically including a “stubbed out” version of the platform in every test, thus satisfying all dependencies that application classes may have without needing the user to specify them. I don’t like this idea for a couple of main reasons:

  1. It will blow out compile (and thus test) time a lot. We can usually get away with each test running for a few seconds, and a complete app module (for the entire job) in maybe 30-60 seconds. Pull in all of platform for every test and one application module will take a few minutes. Multiply that by hundreds of application modules and you are looking at a big increase in test time.
  2. This is no longer really unit testing. We’re doing full-blown integration testing at this point. Don’t get me wrong – this is also valuable, but there is a time and place for it, and I don’t want to destroy the unit testing that we already have, with the implicit limited scope (and thus easier fault-finding) that it provides.

In traditional unit-testing with external dependencies that we don’t want to test, we would mock those dependencies. Good mocking libraries will also allow us to be explicit about call ordering, inputs and outputs expected in order to verify behaviour of our own code as well as the relationship it establishes with the external dependencies. Is this possible with Puppet? What would it look like?

define yum::repo (
) { }
class { 'foo': }

Now we have a somewhat mocked version of yum::repo that our class can use without having to worry about other chained dependencies outside of its view of the world. This starts introducing some other problems though:

  • It’s quite clear that Puppet language just doesn’t have the capabilities for advanced mocking (which is no surprise – that’s not its primary goal). It would be interesting if a third-party library provided Puppet mocks though…
  • We now have inconsistency in our testing methods. The only time you need to mock out a class/define is when it has unresolved dependencies you don’t want to have to worry about. In all other cases, we can still use the real version (which will be on the modulepath already since we install the dist modules into the testing VM with the app module). Now there are two slightly confusing mechanisms for testing.
  • Will the mocked version be found before the real version that is elsewhere on the modulepath? I haven’t looked into the code to know whether it will be found immediately by virtue of it having just been parsed, but it’s not an unknown factor that I like.
  • One of the tenets of reusable, encapsulated functionality that we provide in our dist modules is that you don’t need to know the details. In fact, thanks to parameterised classes, defined types, custom types and providers it is often not possible to tell what is built-in to Puppet and what is one of the previously mentioned ways of extending it – and this is just how it should be. Would you really want to mock out any one of these resource types when it provides so much more than just a container with parameters? Input validation, consistency between input parameters, platform checking, built-in dependency handling between resources of the same type are all valid reasons to stick with what these types give you for free. It feels wrong to remove them (which effectively is the reason you do the compile testing in the first place).

I haven’t spent us much time on this as on other Puppet problems previously, because at the moment (fortunately) it is mostly no more than an annoyance. We can add in the missing test dependencies by hand, and most of our users are becoming savvy enough to do it themselves. I’m interesting in what the community thinks about this topic though, and if you have solved this problem yourself? Please leave comments; I would love to know what you think!

Tags: , ,

Tuesday, June 19th, 2012 Tech No Comments

Puppetmaster performance tuning with mod_mem_cache

by Oliver on Saturday, April 7th, 2012.

It has been quite a while since I’ve had anything sufficiently detailed or interesting to post on a technical theme so it was quite a welcome divergence from the usual sprint work recently when we started having some Puppet performance issues. First up, like most people working at big corporations I can’t give you any figures that might give away numbers of servers or anything else sensitive but obviously we run enough machines to cause performance issues on a reasonably set up Puppet environment. I’ll share some statistics with you as well which give context and show the kind of before and after effects of the work.

If you are used to working with Puppet you might be accustomed to runs taking perhaps a minute, or a few minutes for complicated catalogs. When your run time starts to exceed ten minutes it starts getting uncomfortable, and when it gets much more than that people start complaining. We had noticed some of our core systems were taking an extremely long time to complete runs (which were meant to be occurring every 15 minutes) and had also had some reports of extremely long runs for some of our customers.

First off, we ruled out this bug although we suffered from it briefly. If you are still running 2.6 and use run stages, please upgrade to 2.7 or you will soon (or already) be suffering from fairly bad performance during client graph processing. We were seeing several orders of magnitude worse performance as you can see from the bug report.

How can you start to logically proceed towards a performance fix?

  • Puppet reports from the client do contain timestamps for the time each resource was processed, but no duration of client or server processing time (actually, wall time of an individual resource is what we are interested in).
  • Without digging deeper into the exact processing time of each event it is not possible to see whether performance problems are solely client-related, server-related or both. Puppet is a fairly chatty client/server application so a run is tightly bound between the two (at least, in a standard configuration).
  • Generally clients will outnumber servers by at least an order of magnitude, and possibly up to four (five?) if your configuration is lightweight or runs don’t occur frequently. (This is obviously an approximation, and could differ significantly with aggressive load-balancing and performance tuning. You may even have only clients and no servers.)
  • We need to break down exactly which operations are costing us computing effort, and work out what would give us the maximum gain with minimum time spent.
  • Therefore, we need information to gain awareness of these operations.

Better Logging

We were already running our Puppetmasters through Apache with Passenger (a.k.a. mod_rails/mod_rack) so we had logs in combined log format already in existence. Unfortunately, while you can get an idea of a lot of statistics already from this format of logs, computational expense of each request and response is not clear. I decided to add the %D field to the log which gives the response time in microseconds for each request, which would allow me to get a better idea of what was actually taking a long time. I had my suspicions that we had recently hit a request ceiling beyond which performance would degrade rapidly (not linearly) but I needed actual evidence of this.

For the record, here’s the actual log format specification:

LogFormat "%h %l %u %t \"%r\" %>s %b %D \"%{Referer}i\" \"%{User-Agent}i\"" puppet
CustomLog /var/log/httpd/puppet.log puppet

And the log output lines start looking like this: - - [18/Mar/2012:04:02:25 +0000] "GET /environment1/file_metadata/modules/logrotate/etc/logrotate.conf HTTP/1.1" 200 333 42741 "-" "-"

In this case, we have a response time of 42741 microseconds. Don’t freak out just yet.

Great, so now I have a few days worth of logs in the new format to work with and figure out what is going on. Initially I thought it would be simple but after a few minutes of scanning the logs I realised I’d want to do some analysis which would point me in the exact direction of the problem. There are some basic elements at play here which I wanted to do the analysis over:

  • Environments: We allow separation of distinct apps, and releases of those apps by using the Puppet environment. So by looking at the environment we can get an idea if one app is causing excessive load, or one specific release of that app.
  • Requestor: Is one host, or a group of hosts hitting the Puppetmaster more than it should?
  • Indirector: Are catalog compiles taking a long time and causing everything else to be slow, is report processing taking the time or are we serving excessive numbers of file resources? I erroneously named this the “resource type” in my analysis script because I was in a RESTful state of mind but technically these are Indirectors in Puppet Land.
  • Resources: Within a given Indirector, is a particular resource taking a long time to generate (e.g. a particularly large or complex catalog, or an extremely large file).

These were what I thought I’d start with. Clearly a generic http log statistics package like AWStats is not going to cater to most of my requirements, but at the other end of the scale something like Puppet Dashboard/Console is a little too specific to events around Puppet runs – I wanted something in between. I’d be surprised if no one else had written something but a few things made me decide to write my own; the challenge, the naive belief it would be quick and easy, fooling myself into believing it was too specific a problem to reuse someone else’s code, and for the sheer interest factor.

Log Analysis and Problem Identification

So I wrote puppet_log_analyse.py. I’m trying to get this open sourced (which is an overall goal for my team) but it has to go through the due process of review first. As soon as this is complete I’ll post back with a link to where it is externally available should it be useful for anyone.

It’s my first bit of Python in quite a while so it’s probably not brilliant code. It underwent several iterations as I wanted more information from the logs, and also as I experimented with different ways of storing and querying the data. I initially was using heapq as I was looking for the worst or best results from many, and heapq maintains relatively inexpensive insert cost while providing a data structure that will be sorted already when you query it. Unfortunately some of the more interesting statistics I wanted to generate proved harder to get out of heapq than I expected so despite taking a performance hit (and probably costing a lot more memory at runtime) I switched to SQLite where I could express myself in familiar SQL, while still storing all the data in memory. Hey, I’ve got 8GB in my laptop so it’s no problem!

I ended up generating a reasonable set of different metrics but the ones that I ended up using the most for the decision on what to optimise came from the following:

Top resource types:
  7129326 GET file_metadata
  285765 GET file_metadatas
  85287 GET catalog
  84108 PUT report
  1770 GET file_content
  15 GET certificate
  5 GET certificate_request
  5 PUT certificate_request
  5 GET certificate_revocation_list

Statistics by resource type:
                                   AVG      STD.DEV    MIN   MAX
 GET catalog                       6951.7ms 5430.7ms 8.175ms 2.9996e+05ms
 GET certificate                   10.722ms 3.6854ms 5.128ms 17.609ms
 GET certificate_request           22.437ms 29.293ms 5.567ms 80.777ms
 GET certificate_revocation_list   19.492ms 29.293ms 3.877ms 78.062ms
 GET file_content                  10.106ms 56.091ms 2.917ms 1457ms
 GET file_metadata                 13.58ms 639.82ms 2.243ms 2.9982e+05ms
 GET file_metadatas                123.61ms 2817.5ms 3.587ms 2.9197e+05ms
 PUT certificate_request           55.047ms 3.0518ms 50.834ms 58.675ms
 PUT report                        2005.5ms 2411.3ms 7.438ms 2.99e+05ms

Yikes, that’s one loaded server. You can see that response times are not great across the board but the maximums are woeful. A lot of time is being spent generating catalogs (which is no surprise) but I was not expecting the sheer number of file_metadata and file_metadatas requests which are a couple of orders of magnitude higher in count than even catalogs (which is exactly the number of runs that have taken place). Taking the average response time of file_metadata requests and their count over these logs we can see that the server spent about 26.88 hours out of what was about 60 hours of logs generating MD5 sums and returning them. Probably for the same files over and over again, which are not changing.

Turning on Caching in Apache

Catalogs can be precompiled and there are various other tweaks that can be made but the first idea that sprang to mind was simply caching the file_metadata responses. Even if Puppet were caching the responses (and it’s not clear to me that it ever did), we are still having our requests go through a whole lot of slow Ruby to get the same result back. Caching closer to the requestor makes sense from a performance point of view, and I was already aware of mod_mem_cache. Rather than spend more time adding a caching layer with a separate product like Varnish why not just add more configuration to Apache which was already in place?

I came up with a reasonable looking configuration and did some basic testing.

        <IfModule mod_mem_cache.c>
                CacheEnable mem /
                CacheDefaultExpire 300
                MCacheSize 1024000
                MCacheMaxObjectCount 10000
                MCacheMinObjectSize 1
                MCacheMaxObjectSize 2048000
                CacheIgnoreNoLastMod On

I’ll explain some of these settings later on. What kind of a performance gain can you get by caching a file metadata (which is just an MD5 sum right now, due to another bug)? Obviously this is dependent on file size and CPU hashing speed but let’s take a reasonably example of a 256KB file. - - [01/Apr/2012:01:30:55 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 7853 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"

We can see that this request took 7853 microseconds, or just under 8ms to process and return on an otherwise idle machine. That seems pretty fast already, but what happens when we turn on caching? - - [01/Apr/2012:01:34:26 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 6960 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5" - - [01/Apr/2012:01:34:26 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 226 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"

The first request after the restart is predictably about the same speed as before, but the subsequent response is about 30 times faster. That’s a big improvement for our most frequent query.

So we’re done now, right? Well, you could just go ahead and implement this as is given what we’ve seen already but I wanted to be a bit more certain I’d actually see a reasonable improvement. There are also some caveats to the configuration which I’ll list at the end of the post. To gain some more concrete figures of performance improvement I decided to run some performance tests before and after the change using Apache JMeter.

Performance Testing the Solution

I won’t go into depth with JMeter as there is plenty of information about it out there and honestly, it’s a snap to set up and use even if you’ve never used it before. The main problems I hit immediately while testing were as follows:

  • Generating load and serving the application under test on the same machine is a recipe for disaster (which should be obvious). So you will want to have a dedicated load generator and a separate machine for the application (the Puppetmaster in this case).
  • If you are attempting to do distributed load testing (i.e. with dedicated JMeter servers and a separate frontend) you will hit the usual RMI connection problems if there are firewalls/NAT in between your servers and the frontend.
  • Testing the Puppetmaster necessitates using SSL and a client certificate for authentication/authorisation. Currently it seems as though the JMeter SSL Manager will not send client certificate details through to the JMeter servers when doing distributed load testing. In order to work around this I used stunnel directly between the JMeter server and the Puppetmaster but I soon hit performance bottlenecks due to stunnel – it appears to be quite inefficient.
  • In order to use your existing Puppet-generated certificates and private keys you’ll need to package them in a PKCS#12-format file. For the sake of readers the command to do so is openssl pkcs12 -export -out jmeter.p12 -in cert.pem -inkey priv.pem. You’ll need to put a passphrase on the exported file (OpenSSL asks you for it) but the SSL Manager in JMeter handles this when the file is loaded.

So, I installed a barebones X11 environment on my remote load generator machine and ran it all from there. The test scenario was fairly simple:

  • HTTP Request was being made to https://puppetmaster:8140/production/file_metadata/modules/testmodule/testfile with GET.
  • The HTTP Header Manager was used to set 'Accept: yaml'.
  • The Graph Results listener was used to generate a simple graph of the output.
  • I used 100 concurrent threads each making 1000 requests.

Admittedly by only testing a single file in a single module I was not making this an extremely good test but I wanted to verify that caching was going to generate some reasonable savings across many clients, so I assumed responses in the production system would be cached quite quickly and reused for enough clients to make a difference. In this case, the file enters the filesystem cache after the first request so it is being served from memory with or without response caching. As I previously mentioned, the Puppetmaster doesn’t seem to be doing any other form of caching for the file so the diffence in performance should only be between running through the Puppetmaster code and generating an MD5 sum versus returning the cached response.

Here’s what the output from JMeter tended to look like:

JMeter graph results

JMeter graph results

After speaking with some long-time JMeter users, it appears I’m not the only one who thinks that this double-rainbow isn’t all that useful. It gives you a reasonable overview of what is going on early in the run but beyond that if you want some usable graphs it is better to export the results and make your own. In any case, the run summary was my best indicator of results:

Before caching:

  • Throughput: ~20859 requests/minute
  • Average: 276ms
  • Median: 100ms
  • Deviation: 712ms
  • Client load average: ~3.5
  • Server load average: ~55

After enabling caching:

  • Throughput: ~50000 requests/minute
  • Average: 84ms
  • Median: 41ms
  • Deviation: 735ms
  • Client load average: ~11
  • Server load average: ~1.2

It’s not the 30 times faster I’d hoped for but still, double the throughput is a significant win. The vastly decreased response time and obvious improvements in server load also seem to point to a potential big win when this is actually deployed.

Now armed with results I decided to work the configuration changes into our Puppetmaster build (which itself is automated) and do some final smoke checks as QA before rolling out.

Making It Live… Almost

The general plan was to make a new release of our Puppetmaster, do some similar tests and then release to production. I spun up a new Puppetmaster using our automation recipe for it and attempted the same basic requests using curl, but saw fairly inconsistent results. I could often make ten or more requests before seeing a response time in the logs corresponding to a cached result being returned. Even more surprising, with Apache’s debugging turned on, I could verify that it was caching the result but not returning it all the time.

Unfortunately I’d missed this note in the mod_mem_cache documentation:

... In particular, this module's cache is per-process, ...

In my case, the handling of requests had gone to somewhat random child processes and I’d only received a cached response after hitting the same child process again (assuming the cached response is still fresh). Easy to see when testing by hand, not so easy when sending 1000 requests from 100 test clients in JMeter. This does have some ramifications though:

  • Aside from memory requirements of the Puppetmaster app running through Passenger, you effectively have to account for a completely separate cache running in each child process of Apache. If you want to ensure a good hit rate, you may need a reasonable amount of RAM.
  • If you have a very small amount of clients and requests, a lot of Apache child processes and a lot of files, it is probable that you won’t hit the cache much of the time if the cached responses are evenly distributed. There’s not a big downside to this in general as you’ll get a performance increase even with a few hits but you do stand a greater chance of suffering from stale responses which may not make the change worth it.

Making It Live For Real

Here are some statistics from the logs for the same Puppetmaster after the change:

Statistics by resource type:
                                   AVG      STD.DEV    MIN   MAX
 GET catalog                       7053.9ms 3062.9ms 4.559ms 55651ms
 GET certificate                   15.386ms 36.771ms 4.381ms 263.69ms
 GET certificate_request           7.351ms 2.5096ms 5.001ms 15.452ms
 GET certificate_revocation_list   4.4384ms 1.0202ms 3.252ms 7.733ms
 GET file_content                  15.078ms 150.57ms 2.466ms 5207.8ms
 GET file_metadata                 8.4615ms 109.2ms 0.049ms 20517ms
 GET file_metadatas                65.971ms 345.93ms 3.546ms 21223ms
 PUT certificate_request           53.68ms 2.56ms 49.41ms 59.803ms
 PUT report                        434.32ms 1281.1ms 4.569ms 2.3324e+05ms

Across the board, operations haven’t changed a great deal but they are generally more consistent. File_metadata operations have improved by about 60% which is a pretty big change considering the number of these operations the Puppetmaster is processing. In terms of what actual effect this had on load, we went from seeing about 1 runnable process per core constantly to about 0.5.

And finally…

This was an interesting exercise, especially since I’d had caching and Puppet on the radar for quite some time now (just look at this page by Masterzen) but not actually had a chance to need it or investigate it. The work paid off and will let our system scale up a whole lot more. I haven’t been doing a great deal of Puppet work recently (hey, it’s nice to branch out and do other things) but I like digging into the internals when I can to find a good solution to an irritating problem.

Fortunately or unfortunately there’s always a plethora of work to be done on infrastructure issues and improvements so I’ll put in a small plug for my company. If this post or anything I’ve written in the past piques your interest, I need more people on my team! http://devops.nokia.com/ is the place to go to get on board 🙂


  • I initially started testing in our internal, private cloud but when I hit performance issues with JMeter testing I moved my testing to a couple of instances in HP’s cloud which I have beta access to. Performance was much better there but I realised the critical bottleneck was stunnel as I’ve already mentioned above. Once I had confirmed numbers, I moved the testing back to our internal cloud and saw equivalent numbers except for OpenVZ containers where it was almost double the throughput in some cases. This corresponds to previous examples we’ve seen compared with KVM (which we also support internally, and is what HP’s cloud runs on). If you run your Puppetmasters virtualised, consider OpenVZ or Virtuozzo.
  • In our mod_mem_cache configuration I set a fairly low timeout (300 seconds) and I’ve chosen to tell Apache to ignore the lack of ETags or last-modified headers (CacheIgnoreNoLastMod On) in responses that would usually assist it in stale response expiry. Unfortunately Puppet doesn’t yet send back any kind of metadata like this for responses so we can’t nicely check the freshness of responses, however for file_metadata responses like this arguably it doesn’t make a lot of sense anyway.
  • Unlike some other configuration management systems that rely on proprietary protocols between client and server (CFengine comes to mind, although I haven’t looked at it in a while), Puppet relies on REST between client and server which means a lot of the standard methodologies for improving web performance are applicable. This is really nice to fall back on, as you can see from what I’ve written.

Tags: , , , ,

Saturday, April 7th, 2012 Tech 11 Comments

How we use Vagrant as a throwaway testing environment

by Oliver on Tuesday, May 31st, 2011.

As I mentioned in my presentation at PuppetCamp, one of the goals of our Puppet code testing was to run Puppet Faces-style compile tests on all of our modules. Since this requires a somewhat realistic deployment environment, a clean system, and our modules to be installed as packages (since that’s how we are distributing our own modules) it makes sense to run this stage in a throwaway virtual machine.

This is where Vagrant comes in. I won’t dive into what Vagrant is all about since the website describes it well, but instead try to focus on my testing methodology. Using Vagrant VMs is a natural fit for this use case, but there are a few aspects which also make it a bit difficult to do this kind of testing (probably irrespective of the tool you use):

  • VMs are typically set up on a private NATed network on the hypervisor machine. This has the added benefit of not requiring cooperation from the regular network, but means that you cannot straightforwardly connect to each VM as needed by your testing system.
  • Related to the previous point, your test jobs will be running on a Jenkins build master or slave, which may then need to SSH/RPC to the Vagrant/Virtualbox host, which will then need to SSH to the Vagrant VMs. Not very nice.
  • Vagrant has the same VM spin-up lag problems that the cloud has – you want a new instance to become available rapidly so testing completes fast. Alternatively you can pre-build your VMs but this introduces complexities in the testing system.
  • Related to the previous point, bugs/limitations in Vagrant/Virtualbox mean that you can’t spin up several VMs simultaneously due to some race conditions.
  • Your testing system exists outside the VMs – you want to get your code to test inside the VM.

These were probably the hardest problems I had to solve (aside from just writing the Rake tasks to wrap around Puppet Faces). I’ll break these down into sections and describe the solutions. I won’t go into the usual nightmare of how to get an appropriately recent version of Ruby/Rubygems etc onto a CentOS 5.5 machine so that you can actually run Vagrant – that’s surely been solved many times already.


This was a relatively easy one. Vagrant VMs run on a host-only NATed network, but we need to somehow get the Jenkins jobs running something initially on the Vagrant host. Basically, I decided the easiest course of action would be to make the physical machine a Jenkins slave right from the start. So I have the master connecting to this machine as the jenkins user, which is also the user I am using to run all of the Vagrant stuff. Jenkins owns the VMs, so to speak, which makes connectivity and permissions on them much easier.

Once a job has started it is possible, by using the vagrant command, to connect to any of the VMs and run commands on them, destroy them, bring them back up again etc.

SSH and jobs within jobs

As I mentioned in a previous post, working with Net::SSH can be fiddly, but it is by far preferable to command line SSH invocation and all of the usual string-manipulation, input/output and error-handling nightmares that entails. The basic principle for running these test jobs is:

  1. Run the job from Jenkins, which runs an initial rake task on the Vagrant hypervisor machine (the test wrapper)
  2. The test wrapper calls the Vagrant libraries and sets up configuration for the VM in question.
  3. The build tools (previously checked out from version control to the Vagrant hypervisor machine) are copied to the Vagrant VM’s shared directory on the hypervisor. They are now available to the VM as well.
  4. A Net::SSH call is made to the VM, which calls the “real” test task from the build tools in the shared folder.
  5. Testing runs in the VM.
  6. Net::SSH call to the VM finishes, and the VM is destroyed.

Now we only really have a single RPC via Net::SSH which is relatively easy to manage programmatically. Of course, we lose granularity in the output since we are running through an SSH session. We cannot raise exceptions or pass structured data – but have to rely on exit codes and STDOUT/STDERR. So far this doesn’t seem like a big limitation in this use case.

VM Lifetime

My first mental approach was to have a number of VMs constantly running on the machine (roughly equal to the number of physical CPU cores), and a helper script or daemon that handled allocation of VMs to incoming test jobs. After each job completes, the helper would mark the VM as unavailable, destroy and recreate it from the base box image and when ready again return it to the available pool of VMs. This seemed like a nice idea at first but I realised the coding would take some time.

As soon as I had a basic idea for what the testing tasks themselves would look like, I decided to implement them for a single job first as a proof of concept. It was hard-coded to a single VM which made things much simpler, and it just worked. Pretty soon I moved on to creating the rest of my test jobs and had to find a way to handle the parallel builds. The simplest possible solution (when using Jenkins, at least) is to prepare a fixed number of VMs (say 8, to match the physical number of cores) and tie each one to a specific build executor of the Jenkins slave (you obviously need to also configure Jenkins to use no more than this number of executors on the slave).

Thus, as long as your create/destroy sequences work (hint: put them in a begin/rescue block in the ensure section) then you will have no problem running jobs continually – each new job run gets a fresh instance of the VM, tied to that executor. The problem that I ran into here was the bug/limitation of Vagrant and/or Virtualbox that prevents multiple simultaneous VM creations. Inevitably if your number of jobs exceeds your number of build executors and the jobs are doing similar things, they will all finish at similar times and attempt to cycle the VMs at the same time, which inevitably fails.

It seemed like I was back at square one, looking at having a smart helper script which automated VM recycling. I actually implemented locking around what I judged to be the critical code section where the VMs were created but I continued to experience problems. In the end, Patrick Debois of DevOpsDays fame made a suggestion to keep all the VMs running and use his sahara add-on to Vagrant in order to do lightweight snapshots/rollbacks on the VMs rather than a full-blown create/destroy between tests. Now the general run procedure looks like:

  1. Start wrapper task on Vagrant hypervisor machine
  2. Call real test task through Net::SSH to VM
  3. After Net::SSH call completes, run vagrant sandbox rollback on VM

The rollback takes a matter of seconds (the snapshot is generated just once, when I first create each VM on hypervisor boot), and now each compile job takes generally about 45 seconds in total. This is far better than the < 10 minutes time goal I had originally set, and the system is working much better than I had originally imagined, despite it being effectively a “workaround” for my mental design.

Sharing code with the VM

A while back I forcibly separated out my build toolchain from my Puppet code, so that it would be possible to store our Puppet modules in any repository so that Dev teams could maintain their own modules but build them using our common toolchain. This has generally been a success, and as a bonus it also makes them pretty easy to share and copy around. In the case of Vagrant, Jenkins automatically checks out the build toolchain at the start of the job and calls the test wrapper task.

The test wrapper task then copies the checked-out build toolchain directory into the Vagrant VM’s shared directory (in my case, /vagrant/puppetvm#{ENV[‘BUILD_EXECUTOR’]} which is the directory that the VM’s Vagrantfile lives in). Inside the VM, the files are also visible at the same path. The basic mechanism for the test is now as follows:

  1. Copy the build tools to the VM shared directory
  2. Call Net::SSH to the VM and install the Rake gem
  3. Call Net::SSH to the VM and run rake on the “real” test task

Actually running the tests

So what does my “real” test task look like? This will look different for everyone, so I’ll just list the general steps:

  1. Set up Yum repository fragments so we have access to the Puppet packages, and the repository that contains our syntax-checked Puppet modules.
  2. Yum install puppet and supporting packages (puppet-interfaces and interface-utils which were pre-release parts of Puppet Faces we’ve been using a while) and the package of the Puppet module we wish to test.
  3. Run Puppet Faces compile tests on the module
  4. Exit with a reasonable exit code for the wrapper task to catch and use as the result of the job.

The hardest part of this all is making changes to the testing toolchain, since when you want to look at how something failed your VM has already been rolled back. This is just a sad fact of having to tie VMs to build executors, but we also can’t just leave failed builds lying around with their VMs (especially in the case of real test failures). If anything, it has strengthened the need to pull core logic out of the Rake tasks and into a separate set of build libraries that can be tested separately with Test::Unit or RSpec, but given the reliance of so much of the testing on shell interaction, it is difficult to test adequately (especially when you resort to mocking).

Tags: , , ,

Tuesday, May 31st, 2011 Tech 7 Comments

Video of talk from Puppet Camp 2011 Amsterdam

by Oliver on Friday, May 20th, 2011.


Tags: , , ,

Friday, May 20th, 2011 Tech 1 Comment

How we use cucumber-puppet

by Oliver on Monday, May 16th, 2011.

I got a question recently by email in followup to my presentation at Puppet Camp in Amsterdam about how we use cucumber-puppet. I touched on the subject only briefly in my talk but what I did say is that it revolutionised my approach to Puppet in general. Don’t get too high an opinion of the tool from that statement! Behaviour-driven development in general was a new thing to me and did change my ways of thinking, but my opinions of it in conjunction with Puppet have changed over the months slightly.

Before I go into too much depth, let’s take a look at the tool and how it is used. To be fair, cucumber-puppet is a good tool (as is cucumber itself and cucumber-puppet’s cousin, cucumber-nagios). Typically you’ll start off by running cucumber-puppet-gen world in your Puppet code repository and let it generate the infrastructure necessary to start writing your own tests. Basically they are grouped into three main categories:

  • modules – where you actually write your high-level language tests
  • steps – the Ruby-language breakdowns to help cucumber-puppet turn natural-language requests into things to test in the Puppet catalog
  • support – cucumber-puppet settings and globals

As you might have noticed, you are actually directing cucumber-puppet with natural-language, which gets translated into native Ruby and applied as tests for various content in the Puppet catalog. It’s actually not that much magic. Let’s look at an example feature:

Feature: Base repositories
  In order to have a system that can install packages
  As a sysadmin
  I want all of the CentOS repositories to be available

  Scenario: CentOS yum repositories
    Given a node of class "yum::base"
   When I compile the catalog
    Then there should be a yum repository "Base"
    And there should be a yum repository "Updates"
    And there should be a yum repository "Extras"

Pretty cool huh? The most important aspect of this is that it is readable by humans. As you go on though, you’ll realise it is somewhat verbose and you are prone to much repetition. Anyway, let’s take a look at some of the steps that make this work. You’ll notice we said a node of class "yum::base". It’s not exactly a real node, we are just directing cucumber-puppet to compile a catalog that has just a single class in it – yum::base and treat it as if that is the entire node.

Given /^a node of class "([^\"]*)"$/ do |klass|
  @klass = klass

Then /^there should be a yum repository "([^\"]*)"$/ do |name|
  steps %Q{
    Then there should be a resource "Yum::Repo[#{name}]"
    And it should be "enabled"

Then /^it should be "(enabled|disabled)"$/ do |bool|
  if bool == "enabled"
    fail unless @resource["enabled"] == "1"
    fail unless @resource["enabled"] == "0"

Then /^there should be a resource "([^\"]*)"$/ do |res|
  @resource = resource(res)
  fail "Resource #{res} was not defined" unless @resource

Those are all the steps necessary to make the previous feature work. They should be fairly clear even if you have no idea about Ruby or cucumber-puppet. Some important items to note:

  • Steps can call other steps, so that you have a useful abstraction mechanism between many different things that test resources in similar ways – e.g. for presence in the catalog.
  • Yes, not all the feature text is actually parsed. A lot of it is human-understandable, but computationally-useless fluff.
  • We have replaced the built-in Yum provider with our own (that is just a managed directory and template files in a defined type) – as the Yum provider famously doesn’t support purging.
  • …@resource[“enabled”] == “1” looks wrong, but that’s how Yum repositories represent settings so we mirror that here, even if it is not strictly boolean.

You do get a lot of steps for free if you run cucumber-puppet-gen world, like the first and last I’ve quoted, so you don’t have to come up with it all by yourself. This is the general style of testing with cucumber-puppet at least up to version 0.0.6 – very verbose, can duplicate your Puppet code in an almost 1:1 ratio (or even beyond it) but still a very useful tool for refactors etc. Starting with 0.1.0 it was possible to create a catalog policy – think of it as a bit closer to reality as instead of testing arbitrary aspects of your Puppet code, or fake machine definitions (or even real ones, with faked facts) you can now test some real facts from a real machine and be sure that the machine’s catalog compiles.

We’re not doing this just yet (I’ve been busy working on other aspects of our Puppet systems, but in theory it is not much work to get going. You do, however need some mechanism for transporting your real machine YAML files from your Puppetmasters to your testing environment for runs through cucumber-puppet. While this is definitely a step forward, it also gets into the territory that perhaps more people are considering – pre-compiling catalogs on your real Puppetmasters and checking for success/failure there. It also gives you the ability to check for differences between catalogs on the same machine when inputs change (i.e. your configuration data changes) – Puppet Faces will give you the ability to do this quite easily.

Another couple of cool things about cucumber-puppet before I sign off. Due to it being based on cucumber, it has the capacity for generating output in many different useful formats. For example, you can output a JUnit report (in XML format). Jenkins supports JUnit reports natively, so you can run your cucumber-puppet tests in a Jenkins job and have the JUnit test results integrated into the build result and history. Very cool.

Finally, since you are testing catalogs with cucumber-puppet you can make tests for just about anything in those catalogs. For example, if you are generating some application configuration using an ERB template and want to check that certain values have been correctly substituted, you can just test what has been generated as the file content:

Scenario: Proxy host and port have sensible defaults
  Given a node of class "mymodule::myapp"
  And we have loaded "test" settings
  And we have unset the fact "proxy_host"
  And we have unset the fact "proxy_port"
  When I compile the catalog
  Then there should be a file "/etc/myapp/config.properties"
  And the file should contain "proxy.port=-1"
  And the file should contain /proxy\.host=$/


Then /^the file should contain "(.*)"$/ do |text|
  fail "File parameter 'content' was not specified" if @resource["content"].nil?
  fail "Text content [#{text}] was not found" unless @resource["content"].include?(text)

Then /^the file should contain \/([^\"].*)\/$/ do |regex|
  fail "File parameter 'content' was not specified" if @resource["content"].nil?
  fail "Text regex [/#{regex}/] did not match" unless @resource["content"] =~ /#{regex}/

The complexity and coverage of your tests are only limited by your inquisitivity with respect to the Puppet catalog, and your Ruby sklils (both of which are easily developed).

Tags: , , ,

Monday, May 16th, 2011 Tech 2 Comments

reblog: Modeling Class Composition With Parameterized Classes

by Oliver on Wednesday, May 11th, 2011.


I’d like to point out a fine blog post by Dan Bode on parametrised class theory and application. No surprises, I’ve done a lot of work with Dan and our new Puppet implementations reflect this, but I find it to be a very logical way of working.


Wednesday, May 11th, 2011 Tech No Comments

CI and Vagrant SSH sessions

by Oliver on Sunday, May 8th, 2011.

If you saw my talk in Amsterdam or read my slides (or even work with me, <shudder/>) you would know that one of my stages of Puppet module testing is compile tests using Puppet Faces (at least, a pre-release version of it since we started using it a while ago). I’m just putting the finishing touches on the system so that it uses dynamically provisioned VMs using Vagrant and one of the slightly fiddly bits is actually running stuff inside the VMs – you’d think this would be a pretty large use-case in terms of what Vagrant is for, but I’m not sure how much success is being had with it in reality.

Vagrant wraps around the Ruby Net::SSH library to provide SSH access into its VMs (which just use VirtualBox’s port forwarding, since all of the VMs are run on a NATed private network on the host). You can easily issue vagrant ssh from within a VM root and it will just work. You can even issue the secret vagrant ssh -e 'some command' line to run a command within the shell, but it will wrap around Net::SSH::Connection::Session.exec! which is completely synchronous (probably what you want) and saves up all the output until it has the exit code when the process has terminated (probably not what you want). Especially when the process takes a few minutes, and you are watching Jenkins’ job console output, you want to see what is going on.

Unfortunately Vagrant’s libraries don’t expose the full richness of Net::SSH, but even if they did you wouldn’t be much better off. Net::SSH gives you connection sessions in which you can issue multiple commands synchronously (as Vagrant typically does), or multiple commands asynchronously – and basically this equates to “assume they will run in parallel, in no particular order”. There is also no direct handling of output and return codes – you need to set up callbacks for these. What this all amounts to is a bit of hackery just to get line-by-line output for our Jenkins job, and capture the return codes of each command properly.

#!/usr/bin/env ruby                                                                     
require 'rubygems'                                                                      
require 'net/ssh'                                                                       

def ssh_command(session, cmd)
  session.open_channel do |channel|
    channel.exec(cmd) do |ch, success|                                                  
      ch.on_data do |ch2, data|
        $stdout.puts "STDOUT: #{data}"                                                  
      ch.on_extended_data do |ch2, type, data|                                          
        $stderr.puts "STDERR: #{data}"                                                  
      ch.on_request "exit-status" do |ch2, data|                                        
        return data.read_long                                                           

Net::SSH.start 'localhost', 'vagrant' do |session|
  ['echo foo >&2','sleep 5','echo bar','/bin/false','echo baz'].each do |command|       
    if (rc = ssh_command(session,command)) != 0                                         
      puts "ERROR: #{command}"                                                          
      puts "RC: #{rc}"                                                                  

What this should give you is:

$ ./ssh_test.rb 
ERROR: /bin/false
RC: 1
$ echo $?

Vagrant 0.7.3 sets up a read-only accessor to the Net::SSH::Connection::Session object contained within the Vagrant::SSH object so it’s easy to just hook in to that and set up the code above to get slightly more flexible SSH access to the VM for our CI tasks:

commands = ['some','list','of','commands']

env = Vagrant::Environment.new(:cwd => VMDIR)
env.primary_vm.ssh.execute do |ssh|
  commands.each do |c|
    if (rc = ssh_command(ssh.session,c)) != 0
      puts "ERROR: #{c}"
      puts "RC: #{rc}"

Tags: , , ,

Sunday, May 8th, 2011 Tech No Comments

PuppetCamp 2011 Amsterdam Talk

by Oliver on Friday, April 29th, 2011.

If you made it to my talk, I hope you enjoyed it. If not, then maybe next time! Either way, I’ve uploaded my slides which are available here.

If I forgot your name, your face, or just totally forgot who you are I apologise, and similarly if I didn’t introduce myself when I was standing right next to you – I’m a terribly shy person at heart so I probably didn’t have the guts to interrupt your conversation to shake your hand. See you next time or on the mailing lists / IRC 🙂

Tags: , , ,

Friday, April 29th, 2011 Tech 1 Comment

Puppet Dev Training / Camp @ Amsterdam

by Oliver on Monday, April 25th, 2011.

I’m currently at the Puppet Developer course at Beurs van Berlage in Amsterdam. We drove up on Friday from Berlin (which should have taken 6 hours, but due to traffic jams and sanity stops it took almost 11 hours). We’ve now spent the weekend exploring Amsterdam, which is an amazing looking city. The canals are breathtaking, and despite the frenzied bicycles and scooters constantly whizzing past it seems to have a relaxed atmosphere. It is like the vibe on Berlin’s Spree river, but several magnitudes of relaxation higher due to there being so many more waterways.

Three days of training and two days of Puppet Camp… and hopefully some more time for exploring and fun. If you are coming to Puppet Camp, see you there!

Tags: , , ,

Monday, April 25th, 2011 Tech No Comments