Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We currently reside in Sydney after almost a decade in Berlin, have three kids, and have far too little time to really justify having a blog.
It has been quite a while since I've had anything sufficiently detailed or interesting to post on a technical theme so it was quite a welcome divergence from the usual sprint work recently when we started having some Puppet performance issues. First up, like most people working at big corporations I can't give you any figures that might give away numbers of servers or anything else sensitive but obviously we run enough machines to cause performance issues on a reasonably set up Puppet environment. I'll share some statistics with you as well which give context and show the kind of before and after effects of the work.
If you are used to working with Puppet you might be accustomed to runs taking perhaps a minute, or a few minutes for complicated catalogs. When your run time starts to exceed ten minutes it starts getting uncomfortable, and when it gets much more than that people start complaining. We had noticed some of our core systems were taking an extremely long time to complete runs (which were meant to be occurring every 15 minutes) and had also had some reports of extremely long runs for some of our customers.
First off, we ruled out this bug although we suffered from it briefly. If you are still running 2.6 and use run stages, please upgrade to 2.7 or you will soon (or already) be suffering from fairly bad performance during client graph processing. We were seeing several orders of magnitude worse performance as you can see from the bug report.
How can you start to logically proceed towards a performance fix?
We were already running our Puppetmasters through Apache with
Passenger (a.k.a.
mod_rails/mod_rack) so we had logs in combined log format already in
existence. Unfortunately, while you can get an idea of a lot of
statistics already from this format of logs, computational expense of
each request and response is not clear. I decided to add the %D
field
to the log which gives the response time in microseconds for each
request, which would allow me to get a better idea of what was actually
taking a long time. I had my suspicions that we had recently hit a
request ceiling beyond which performance would degrade rapidly (not
linearly) but I needed actual evidence of this.
For the record, here's the actual log format specification: ``
LogFormat "%h %l %u %t "%r" %>s %b %D "%{Referer}i" "%{User-Agent}i"" puppet
CustomLog /var/log/httpd/puppet.log puppet
And the log output lines start looking like this: ``
192.0.2.1 - - [18/Mar/2012:04:02:25 +0000] "GET /environment1/file_metadata/modules/logrotate/etc/logrotate.conf HTTP/1.1" 200 333 42741 "-" "-"
In this case, we have a response time of 42741 microseconds. Don't freak out just yet.
Great, so now I have a few days worth of logs in the new format to work with and figure out what is going on. Initially I thought it would be simple but after a few minutes of scanning the logs I realised I'd want to do some analysis which would point me in the exact direction of the problem. There are some basic elements at play here which I wanted to do the analysis over:
These were what I thought I'd start with. Clearly a generic http log statistics package like AWStats is not going to cater to most of my requirements, but at the other end of the scale something like Puppet Dashboard/Console is a little too specific to events around Puppet runs - I wanted something in between. I'd be surprised if no one else had written something but a few things made me decide to write my own; the challenge, the naive belief it would be quick and easy, fooling myself into believing it was too specific a problem to reuse someone else's code, and for the sheer interest factor.
So I wrote puppet_log_analyse.py
. I'm trying to get this open sourced
(which is an overall goal for my team) but it has to go through the due
process of review first. As soon as this is complete I'll post back
with a link to where it is externally available should it be useful for
anyone.
It's my first bit of Python in quite a while so it's probably not brilliant code. It underwent several iterations as I wanted more information from the logs, and also as I experimented with different ways of storing and querying the data. I initially was using heapq as I was looking for the worst or best results from many, and heapq maintains relatively inexpensive insert cost while providing a data structure that will be sorted already when you query it. Unfortunately some of the more interesting statistics I wanted to generate proved harder to get out of heapq than I expected so despite taking a performance hit (and probably costing a lot more memory at runtime) I switched to SQLite where I could express myself in familiar SQL, while still storing all the data in memory. Hey, I've got 8GB in my laptop so it's no problem!
I ended up generating a reasonable set of different metrics but the ones that I ended up using the most for the decision on what to optimise came from the following: ``
Top resource types:
7129326 GET file_metadata
285765 GET file_metadatas
85287 GET catalog
84108 PUT report
1770 GET file_content
15 GET certificate
5 GET certificate_request
5 PUT certificate_request
5 GET certificate_revocation_list
Statistics by resource type:
AVG STD.DEV MIN MAX
GET catalog 6951.7ms 5430.7ms 8.175ms 2.9996e+05ms
GET certificate 10.722ms 3.6854ms 5.128ms 17.609ms
GET certificate_request 22.437ms 29.293ms 5.567ms 80.777ms
GET certificate_revocation_list 19.492ms 29.293ms 3.877ms 78.062ms
GET file_content 10.106ms 56.091ms 2.917ms 1457ms
GET file_metadata 13.58ms 639.82ms 2.243ms 2.9982e+05ms
GET file_metadatas 123.61ms 2817.5ms 3.587ms 2.9197e+05ms
PUT certificate_request 55.047ms 3.0518ms 50.834ms 58.675ms
PUT report 2005.5ms 2411.3ms 7.438ms 2.99e+05ms
Yikes, that's one loaded server. You can see that response times are not great across the board but the maximums are woeful. A lot of time is being spent generating catalogs (which is no surprise) but I was not expecting the sheer number of file_metadata and file_metadatas requests which are a couple of orders of magnitude higher in count than even catalogs (which is exactly the number of runs that have taken place). Taking the average response time of file_metadata requests and their count over these logs we can see that the server spent about 26.88 hours out of what was about 60 hours of logs generating MD5 sums and returning them. Probably for the same files over and over again, which are not changing.
Catalogs can be precompiled and there are various other tweaks that can be made but the first idea that sprang to mind was simply caching the file_metadata responses. Even if Puppet were caching the responses (and it's not clear to me that it ever did), we are still having our requests go through a whole lot of slow Ruby to get the same result back. Caching closer to the requestor makes sense from a performance point of view, and I was already aware of mod_mem_cache. Rather than spend more time adding a caching layer with a separate product like Varnish why not just add more configuration to Apache which was already in place?
I came up with a reasonable looking configuration and did some basic testing. ``
<IfModule mod_mem_cache.c>
CacheEnable mem /
CacheDefaultExpire 300
MCacheSize 1024000
MCacheMaxObjectCount 10000
MCacheMinObjectSize 1
MCacheMaxObjectSize 2048000
CacheIgnoreNoLastMod On
</IfModule>
I'll explain some of these settings later on. What kind of a performance gain can you get by caching a file metadata (which is just an MD5 sum right now, due to another bug)? Obviously this is dependent on file size and CPU hashing speed but let's take a reasonably example of a 256KB file. ``
127.0.0.1 - - [01/Apr/2012:01:30:55 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 7853 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"
We can see that this request took 7853 microseconds, or just under 8ms to process and return on an otherwise idle machine. That seems pretty fast already, but what happens when we turn on caching? ``
127.0.0.1 - - [01/Apr/2012:01:34:26 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 6960 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"
127.0.0.1 - - [01/Apr/2012:01:34:26 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 226 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"
The first request after the restart is predictably about the same speed as before, but the subsequent response is about 30 times faster. That's a big improvement for our most frequent query.
So we're done now, right? Well, you could just go ahead and implement this as is given what we've seen already but I wanted to be a bit more certain I'd actually see a reasonable improvement. There are also some caveats to the configuration which I'll list at the end of the post. To gain some more concrete figures of performance improvement I decided to run some performance tests before and after the change using Apache JMeter.
I won't go into depth with JMeter as there is plenty of information about it out there and honestly, it's a snap to set up and use even if you've never used it before. The main problems I hit immediately while testing were as follows:
openssl pkcs12 -export -out jmeter.p12 -in cert.pem -inkey priv.pem
.
You'll need to put a passphrase on the exported file (OpenSSL asks
you for it) but the SSL Manager in JMeter handles this when the file
is loaded.So, I installed a barebones X11 environment on my remote load generator machine and ran it all from there. The test scenario was fairly simple:
https://puppetmaster:8140/production/file_metadata/modules/testmodule/testfile
with GET
.'Accept: yaml'
.Admittedly by only testing a single file in a single module I was not making this an extremely good test but I wanted to verify that caching was going to generate some reasonable savings across many clients, so I assumed responses in the production system would be cached quite quickly and reused for enough clients to make a difference. In this case, the file enters the filesystem cache after the first request so it is being served from memory with or without response caching. As I previously mentioned, the Puppetmaster doesn't seem to be doing any other form of caching for the file so the diffence in performance should only be between running through the Puppetmaster code and generating an MD5 sum versus returning the cached response.
Here's what the output from JMeter tended to look like:
After speaking with some long-time JMeter users, it appears I'm not the only one who thinks that this double-rainbow isn't all that useful. It gives you a reasonable overview of what is going on early in the run but beyond that if you want some usable graphs it is better to export the results and make your own. In any case, the run summary was my best indicator of results:
Before caching:
After enabling caching:
It's not the 30 times faster I'd hoped for but still, double the throughput is a significant win. The vastly decreased response time and obvious improvements in server load also seem to point to a potential big win when this is actually deployed.
Now armed with results I decided to work the configuration changes into our Puppetmaster build (which itself is automated) and do some final smoke checks as QA before rolling out.
The general plan was to make a new release of our Puppetmaster, do some similar tests and then release to production. I spun up a new Puppetmaster using our automation recipe for it and attempted the same basic requests using curl, but saw fairly inconsistent results. I could often make ten or more requests before seeing a response time in the logs corresponding to a cached result being returned. Even more surprising, with Apache's debugging turned on, I could verify that it was caching the result but not returning it all the time.
Unfortunately I'd missed this note in the mod_mem_cache documentation:
... In particular, this module's cache is per-process, ...
In my case, the handling of requests had gone to somewhat random child processes and I'd only received a cached response after hitting the same child process again (assuming the cached response is still fresh). Easy to see when testing by hand, not so easy when sending 1000 requests from 100 test clients in JMeter. This does have some ramifications though:
Here are some statistics from the logs for the same Puppetmaster after the change: ``
Statistics by resource type:
AVG STD.DEV MIN MAX
GET catalog 7053.9ms 3062.9ms 4.559ms 55651ms
GET certificate 15.386ms 36.771ms 4.381ms 263.69ms
GET certificate_request 7.351ms 2.5096ms 5.001ms 15.452ms
GET certificate_revocation_list 4.4384ms 1.0202ms 3.252ms 7.733ms
GET file_content 15.078ms 150.57ms 2.466ms 5207.8ms
GET file_metadata 8.4615ms 109.2ms 0.049ms 20517ms
GET file_metadatas 65.971ms 345.93ms 3.546ms 21223ms
PUT certificate_request 53.68ms 2.56ms 49.41ms 59.803ms
PUT report 434.32ms 1281.1ms 4.569ms 2.3324e+05ms
Across the board, operations haven't changed a great deal but they are generally more consistent. File_metadata operations have improved by about 60% which is a pretty big change considering the number of these operations the Puppetmaster is processing. In terms of what actual effect this had on load, we went from seeing about 1 runnable process per core constantly to about 0.5.
This was an interesting exercise, especially since I'd had caching and Puppet on the radar for quite some time now (just look at this page by Masterzen) but not actually had a chance to need it or investigate it. The work paid off and will let our system scale up a whole lot more. I haven't been doing a great deal of Puppet work recently (hey, it's nice to branch out and do other things) but I like digging into the internals when I can to find a good solution to an irritating problem.
Fortunately or unfortunately there's always a plethora of work to be done on infrastructure issues and improvements so I'll put in a small plug for my company. If this post or anything I've written in the past piques your interest, I need more people on my team! http://devops.nokia.com/ is the place to go to get on board :)