Archive for April, 2012

Puppetmaster performance tuning with mod_mem_cache

by Oliver on Saturday, April 7th, 2012.

It has been quite a while since I’ve had anything sufficiently detailed or interesting to post on a technical theme so it was quite a welcome divergence from the usual sprint work recently when we started having some Puppet performance issues. First up, like most people working at big corporations I can’t give you any figures that might give away numbers of servers or anything else sensitive but obviously we run enough machines to cause performance issues on a reasonably set up Puppet environment. I’ll share some statistics with you as well which give context and show the kind of before and after effects of the work.

If you are used to working with Puppet you might be accustomed to runs taking perhaps a minute, or a few minutes for complicated catalogs. When your run time starts to exceed ten minutes it starts getting uncomfortable, and when it gets much more than that people start complaining. We had noticed some of our core systems were taking an extremely long time to complete runs (which were meant to be occurring every 15 minutes) and had also had some reports of extremely long runs for some of our customers.

First off, we ruled out this bug although we suffered from it briefly. If you are still running 2.6 and use run stages, please upgrade to 2.7 or you will soon (or already) be suffering from fairly bad performance during client graph processing. We were seeing several orders of magnitude worse performance as you can see from the bug report.

How can you start to logically proceed towards a performance fix?

  • Puppet reports from the client do contain timestamps for the time each resource was processed, but no duration of client or server processing time (actually, wall time of an individual resource is what we are interested in).
  • Without digging deeper into the exact processing time of each event it is not possible to see whether performance problems are solely client-related, server-related or both. Puppet is a fairly chatty client/server application so a run is tightly bound between the two (at least, in a standard configuration).
  • Generally clients will outnumber servers by at least an order of magnitude, and possibly up to four (five?) if your configuration is lightweight or runs don’t occur frequently. (This is obviously an approximation, and could differ significantly with aggressive load-balancing and performance tuning. You may even have only clients and no servers.)
  • We need to break down exactly which operations are costing us computing effort, and work out what would give us the maximum gain with minimum time spent.
  • Therefore, we need information to gain awareness of these operations.

Better Logging

We were already running our Puppetmasters through Apache with Passenger (a.k.a. mod_rails/mod_rack) so we had logs in combined log format already in existence. Unfortunately, while you can get an idea of a lot of statistics already from this format of logs, computational expense of each request and response is not clear. I decided to add the %D field to the log which gives the response time in microseconds for each request, which would allow me to get a better idea of what was actually taking a long time. I had my suspicions that we had recently hit a request ceiling beyond which performance would degrade rapidly (not linearly) but I needed actual evidence of this.

For the record, here’s the actual log format specification:

LogFormat "%h %l %u %t \"%r\" %>s %b %D \"%{Referer}i\" \"%{User-Agent}i\"" puppet
CustomLog /var/log/httpd/puppet.log puppet

And the log output lines start looking like this: - - [18/Mar/2012:04:02:25 +0000] "GET /environment1/file_metadata/modules/logrotate/etc/logrotate.conf HTTP/1.1" 200 333 42741 "-" "-"

In this case, we have a response time of 42741 microseconds. Don’t freak out just yet.

Great, so now I have a few days worth of logs in the new format to work with and figure out what is going on. Initially I thought it would be simple but after a few minutes of scanning the logs I realised I’d want to do some analysis which would point me in the exact direction of the problem. There are some basic elements at play here which I wanted to do the analysis over:

  • Environments: We allow separation of distinct apps, and releases of those apps by using the Puppet environment. So by looking at the environment we can get an idea if one app is causing excessive load, or one specific release of that app.
  • Requestor: Is one host, or a group of hosts hitting the Puppetmaster more than it should?
  • Indirector: Are catalog compiles taking a long time and causing everything else to be slow, is report processing taking the time or are we serving excessive numbers of file resources? I erroneously named this the “resource type” in my analysis script because I was in a RESTful state of mind but technically these are Indirectors in Puppet Land.
  • Resources: Within a given Indirector, is a particular resource taking a long time to generate (e.g. a particularly large or complex catalog, or an extremely large file).

These were what I thought I’d start with. Clearly a generic http log statistics package like AWStats is not going to cater to most of my requirements, but at the other end of the scale something like Puppet Dashboard/Console is a little too specific to events around Puppet runs – I wanted something in between. I’d be surprised if no one else had written something but a few things made me decide to write my own; the challenge, the naive belief it would be quick and easy, fooling myself into believing it was too specific a problem to reuse someone else’s code, and for the sheer interest factor.

Log Analysis and Problem Identification

So I wrote I’m trying to get this open sourced (which is an overall goal for my team) but it has to go through the due process of review first. As soon as this is complete I’ll post back with a link to where it is externally available should it be useful for anyone.

It’s my first bit of Python in quite a while so it’s probably not brilliant code. It underwent several iterations as I wanted more information from the logs, and also as I experimented with different ways of storing and querying the data. I initially was using heapq as I was looking for the worst or best results from many, and heapq maintains relatively inexpensive insert cost while providing a data structure that will be sorted already when you query it. Unfortunately some of the more interesting statistics I wanted to generate proved harder to get out of heapq than I expected so despite taking a performance hit (and probably costing a lot more memory at runtime) I switched to SQLite where I could express myself in familiar SQL, while still storing all the data in memory. Hey, I’ve got 8GB in my laptop so it’s no problem!

I ended up generating a reasonable set of different metrics but the ones that I ended up using the most for the decision on what to optimise came from the following:

Top resource types:
  7129326 GET file_metadata
  285765 GET file_metadatas
  85287 GET catalog
  84108 PUT report
  1770 GET file_content
  15 GET certificate
  5 GET certificate_request
  5 PUT certificate_request
  5 GET certificate_revocation_list

Statistics by resource type:
                                   AVG      STD.DEV    MIN   MAX
 GET catalog                       6951.7ms 5430.7ms 8.175ms 2.9996e+05ms
 GET certificate                   10.722ms 3.6854ms 5.128ms 17.609ms
 GET certificate_request           22.437ms 29.293ms 5.567ms 80.777ms
 GET certificate_revocation_list   19.492ms 29.293ms 3.877ms 78.062ms
 GET file_content                  10.106ms 56.091ms 2.917ms 1457ms
 GET file_metadata                 13.58ms 639.82ms 2.243ms 2.9982e+05ms
 GET file_metadatas                123.61ms 2817.5ms 3.587ms 2.9197e+05ms
 PUT certificate_request           55.047ms 3.0518ms 50.834ms 58.675ms
 PUT report                        2005.5ms 2411.3ms 7.438ms 2.99e+05ms

Yikes, that’s one loaded server. You can see that response times are not great across the board but the maximums are woeful. A lot of time is being spent generating catalogs (which is no surprise) but I was not expecting the sheer number of file_metadata and file_metadatas requests which are a couple of orders of magnitude higher in count than even catalogs (which is exactly the number of runs that have taken place). Taking the average response time of file_metadata requests and their count over these logs we can see that the server spent about 26.88 hours out of what was about 60 hours of logs generating MD5 sums and returning them. Probably for the same files over and over again, which are not changing.

Turning on Caching in Apache

Catalogs can be precompiled and there are various other tweaks that can be made but the first idea that sprang to mind was simply caching the file_metadata responses. Even if Puppet were caching the responses (and it’s not clear to me that it ever did), we are still having our requests go through a whole lot of slow Ruby to get the same result back. Caching closer to the requestor makes sense from a performance point of view, and I was already aware of mod_mem_cache. Rather than spend more time adding a caching layer with a separate product like Varnish why not just add more configuration to Apache which was already in place?

I came up with a reasonable looking configuration and did some basic testing.

        <IfModule mod_mem_cache.c>
                CacheEnable mem /
                CacheDefaultExpire 300
                MCacheSize 1024000
                MCacheMaxObjectCount 10000
                MCacheMinObjectSize 1
                MCacheMaxObjectSize 2048000
                CacheIgnoreNoLastMod On

I’ll explain some of these settings later on. What kind of a performance gain can you get by caching a file metadata (which is just an MD5 sum right now, due to another bug)? Obviously this is dependent on file size and CPU hashing speed but let’s take a reasonably example of a 256KB file. - - [01/Apr/2012:01:30:55 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 7853 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"

We can see that this request took 7853 microseconds, or just under 8ms to process and return on an otherwise idle machine. That seems pretty fast already, but what happens when we turn on caching? - - [01/Apr/2012:01:34:26 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 6960 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5" - - [01/Apr/2012:01:34:26 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 226 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"

The first request after the restart is predictably about the same speed as before, but the subsequent response is about 30 times faster. That’s a big improvement for our most frequent query.

So we’re done now, right? Well, you could just go ahead and implement this as is given what we’ve seen already but I wanted to be a bit more certain I’d actually see a reasonable improvement. There are also some caveats to the configuration which I’ll list at the end of the post. To gain some more concrete figures of performance improvement I decided to run some performance tests before and after the change using Apache JMeter.

Performance Testing the Solution

I won’t go into depth with JMeter as there is plenty of information about it out there and honestly, it’s a snap to set up and use even if you’ve never used it before. The main problems I hit immediately while testing were as follows:

  • Generating load and serving the application under test on the same machine is a recipe for disaster (which should be obvious). So you will want to have a dedicated load generator and a separate machine for the application (the Puppetmaster in this case).
  • If you are attempting to do distributed load testing (i.e. with dedicated JMeter servers and a separate frontend) you will hit the usual RMI connection problems if there are firewalls/NAT in between your servers and the frontend.
  • Testing the Puppetmaster necessitates using SSL and a client certificate for authentication/authorisation. Currently it seems as though the JMeter SSL Manager will not send client certificate details through to the JMeter servers when doing distributed load testing. In order to work around this I used stunnel directly between the JMeter server and the Puppetmaster but I soon hit performance bottlenecks due to stunnel – it appears to be quite inefficient.
  • In order to use your existing Puppet-generated certificates and private keys you’ll need to package them in a PKCS#12-format file. For the sake of readers the command to do so is openssl pkcs12 -export -out jmeter.p12 -in cert.pem -inkey priv.pem. You’ll need to put a passphrase on the exported file (OpenSSL asks you for it) but the SSL Manager in JMeter handles this when the file is loaded.

So, I installed a barebones X11 environment on my remote load generator machine and ran it all from there. The test scenario was fairly simple:

  • HTTP Request was being made to https://puppetmaster:8140/production/file_metadata/modules/testmodule/testfile with GET.
  • The HTTP Header Manager was used to set 'Accept: yaml'.
  • The Graph Results listener was used to generate a simple graph of the output.
  • I used 100 concurrent threads each making 1000 requests.

Admittedly by only testing a single file in a single module I was not making this an extremely good test but I wanted to verify that caching was going to generate some reasonable savings across many clients, so I assumed responses in the production system would be cached quite quickly and reused for enough clients to make a difference. In this case, the file enters the filesystem cache after the first request so it is being served from memory with or without response caching. As I previously mentioned, the Puppetmaster doesn’t seem to be doing any other form of caching for the file so the diffence in performance should only be between running through the Puppetmaster code and generating an MD5 sum versus returning the cached response.

Here’s what the output from JMeter tended to look like:

JMeter graph results

JMeter graph results

After speaking with some long-time JMeter users, it appears I’m not the only one who thinks that this double-rainbow isn’t all that useful. It gives you a reasonable overview of what is going on early in the run but beyond that if you want some usable graphs it is better to export the results and make your own. In any case, the run summary was my best indicator of results:

Before caching:

  • Throughput: ~20859 requests/minute
  • Average: 276ms
  • Median: 100ms
  • Deviation: 712ms
  • Client load average: ~3.5
  • Server load average: ~55

After enabling caching:

  • Throughput: ~50000 requests/minute
  • Average: 84ms
  • Median: 41ms
  • Deviation: 735ms
  • Client load average: ~11
  • Server load average: ~1.2

It’s not the 30 times faster I’d hoped for but still, double the throughput is a significant win. The vastly decreased response time and obvious improvements in server load also seem to point to a potential big win when this is actually deployed.

Now armed with results I decided to work the configuration changes into our Puppetmaster build (which itself is automated) and do some final smoke checks as QA before rolling out.

Making It Live… Almost

The general plan was to make a new release of our Puppetmaster, do some similar tests and then release to production. I spun up a new Puppetmaster using our automation recipe for it and attempted the same basic requests using curl, but saw fairly inconsistent results. I could often make ten or more requests before seeing a response time in the logs corresponding to a cached result being returned. Even more surprising, with Apache’s debugging turned on, I could verify that it was caching the result but not returning it all the time.

Unfortunately I’d missed this note in the mod_mem_cache documentation:

... In particular, this module's cache is per-process, ...

In my case, the handling of requests had gone to somewhat random child processes and I’d only received a cached response after hitting the same child process again (assuming the cached response is still fresh). Easy to see when testing by hand, not so easy when sending 1000 requests from 100 test clients in JMeter. This does have some ramifications though:

  • Aside from memory requirements of the Puppetmaster app running through Passenger, you effectively have to account for a completely separate cache running in each child process of Apache. If you want to ensure a good hit rate, you may need a reasonable amount of RAM.
  • If you have a very small amount of clients and requests, a lot of Apache child processes and a lot of files, it is probable that you won’t hit the cache much of the time if the cached responses are evenly distributed. There’s not a big downside to this in general as you’ll get a performance increase even with a few hits but you do stand a greater chance of suffering from stale responses which may not make the change worth it.

Making It Live For Real

Here are some statistics from the logs for the same Puppetmaster after the change:

Statistics by resource type:
                                   AVG      STD.DEV    MIN   MAX
 GET catalog                       7053.9ms 3062.9ms 4.559ms 55651ms
 GET certificate                   15.386ms 36.771ms 4.381ms 263.69ms
 GET certificate_request           7.351ms 2.5096ms 5.001ms 15.452ms
 GET certificate_revocation_list   4.4384ms 1.0202ms 3.252ms 7.733ms
 GET file_content                  15.078ms 150.57ms 2.466ms 5207.8ms
 GET file_metadata                 8.4615ms 109.2ms 0.049ms 20517ms
 GET file_metadatas                65.971ms 345.93ms 3.546ms 21223ms
 PUT certificate_request           53.68ms 2.56ms 49.41ms 59.803ms
 PUT report                        434.32ms 1281.1ms 4.569ms 2.3324e+05ms

Across the board, operations haven’t changed a great deal but they are generally more consistent. File_metadata operations have improved by about 60% which is a pretty big change considering the number of these operations the Puppetmaster is processing. In terms of what actual effect this had on load, we went from seeing about 1 runnable process per core constantly to about 0.5.

And finally…

This was an interesting exercise, especially since I’d had caching and Puppet on the radar for quite some time now (just look at this page by Masterzen) but not actually had a chance to need it or investigate it. The work paid off and will let our system scale up a whole lot more. I haven’t been doing a great deal of Puppet work recently (hey, it’s nice to branch out and do other things) but I like digging into the internals when I can to find a good solution to an irritating problem.

Fortunately or unfortunately there’s always a plethora of work to be done on infrastructure issues and improvements so I’ll put in a small plug for my company. If this post or anything I’ve written in the past piques your interest, I need more people on my team! is the place to go to get on board 🙂


  • I initially started testing in our internal, private cloud but when I hit performance issues with JMeter testing I moved my testing to a couple of instances in HP’s cloud which I have beta access to. Performance was much better there but I realised the critical bottleneck was stunnel as I’ve already mentioned above. Once I had confirmed numbers, I moved the testing back to our internal cloud and saw equivalent numbers except for OpenVZ containers where it was almost double the throughput in some cases. This corresponds to previous examples we’ve seen compared with KVM (which we also support internally, and is what HP’s cloud runs on). If you run your Puppetmasters virtualised, consider OpenVZ or Virtuozzo.
  • In our mod_mem_cache configuration I set a fairly low timeout (300 seconds) and I’ve chosen to tell Apache to ignore the lack of ETags or last-modified headers (CacheIgnoreNoLastMod On) in responses that would usually assist it in stale response expiry. Unfortunately Puppet doesn’t yet send back any kind of metadata like this for responses so we can’t nicely check the freshness of responses, however for file_metadata responses like this arguably it doesn’t make a lot of sense anyway.
  • Unlike some other configuration management systems that rely on proprietary protocols between client and server (CFengine comes to mind, although I haven’t looked at it in a while), Puppet relies on REST between client and server which means a lot of the standard methodologies for improving web performance are applicable. This is really nice to fall back on, as you can see from what I’ve written.

Tags: , , , ,

Saturday, April 7th, 2012 Tech 11 Comments