Mothers Day 2012

by Angela on Saturday, May 12th, 2012.

So this is what the world looks like when I have had sleep. Its good. its surreal. I have motivation to write a blog post for the first time in months.

Since returning to Berlin early 2012 –  Kaichen has been in Kita. This has allowed me to have time  to sleep and also be “alone” for a few hours each day  for the first time since he was born ( over two years)-  and  this “freedom” for want of a better word, its a foreign concept to me.

I am still adjusting, even now I still look around for him, and I still have to tell myself that I can actually have a shower or sit down to enjoy a coffee without looking up to see what he is doing.

I don’t really know how to put into words the way the world looks now through non tired eyes –  shinier perhaps?

After talking to Kais Kita teacher, (aka Anne Hathaway of Berlin) –  she enlightened me to the idea that perhaps i had been functioning at such a low energy/on no sleep level for so long – it became the norm, and suddenly, my mind, now fuelled on sleep, (plus the mandatory caffeine) is buzzing on a new level!

suddenly ideas and thoughts come flooding to me – things seem possible in  new way, on a practical and theoretical level. Even my daily yoga is met with a new breath, I can feel myself going deeper into the so called “dead zones”…

I have known for a while now that I have become totally incapable of relaxing. I cant sit still. I have borrowed the term, “the blessed unrest” taken from the idea that it is the blessed unrest that keeps you marching. Partly i fear if i slow, i wont get going again, ( think  - the rolling stone, no moss analogy). Oli also has a similar blessed unrest, and we both can see it in Kai. His Kita teacher, is amazed at how Kai “never stops moving”, he is like the energizer bunny. When she first confessed this, I was so relieved. For so long I was under the assumption that I was a bad mother for being tired running after him, thinking that Kai needing to escape, run for the road without fear, go down slides head first on his own, climb onto and into everything, and catapult himself off things, was normal boy behaviour, and to an extent it is – but Kai takes it to the next level of intensity. I also love him for it. He has an unwavering sense of adventure. If there is a new thing to explore, he leads the way. on the flip side, he also gets bored and restless and has a short fuse. His stubbornness is not a huge shock, ( both oliver and I are very stubborn by nature) but a lot of these arguments do fall under the nature/nurture debate.

A reflection fitting in line with Mothers Day tomorrow, finds me at the point where I am still riddled with guilt – feeling like every day  I commit myself to being ” a better mother”/ more patient, more attentive to meeting Kais needs, and yet, I still fall down. It also never ceases to amaze me that many mothers I speak too from many varying backgrounds,  share this sentiment!

Somehow I can give advice I cant take myself. Hypocritically of me, I can reassure someone else that they are a brilliant mother, and that they are human, and thus allowed to make mistakes  - yet I want more for myself, and for Kaichen.

My mothers Day present to myself? A Butoh Workshop entitled “Frontier is the Body” with the famous Yuko Kaseki – held here in Berlin, at Eden Studios An eight day workshop, that will allow me to explore my sense of self (whatever that is now).

A chance to use my body in conjunction with my mind. To explore movement and my kinesthetic awareness to re connect with myself. I was lucky to secure a place, and I am just a little excited!

Saturday, May 12th, 2012 Thoughts No Comments

The Duck Always Bites Twice

by Oliver on Tuesday, May 8th, 2012.

These days I’m noticing myself saying more and more frequently that Duck Typing is great, except when it’s not.

An amusing issue that briefly cropped up this afternoon was when we failed to correctly negotiate a data structure inside of a Rake task. Consider the following basic task:

desc "a test task"
task :test, :glob do |t,args|
  if args[:glob].nil?
    args[:glob] = 'some default value'
  end
  puts args[:glob]
end


What kind of output would you expect would happen if you ran rake test right now? If you said nil you’d be right! That’s odd, I wonder what is going on here?

...
puts args
...


Some debugging code later… what is the output? That’s right, it’s an empty hash – {}.

You could forgive us for thinking it might behave as one. Anyway, needless to say we then tried args.class and it turns out to be a Rake::TaskArguments, which evidently decides to make the arguments immutable but in such a way that you never know about it.

What usually happens?

$ irb
irb(main):001:0> class Foo
irb(main):002:1> attr_reader :bar
irb(main):003:1> def initialize(value)
irb(main):004:2> @bar = value
irb(main):005:2> end
irb(main):006:1> end
=> nil
irb(main):007:0> f = Foo.new(5)
=> #
irb(main):008:0> f.bar
=> 5
irb(main):009:0> f.bar = 6
NoMethodError: undefined method `bar=' for #
	from (irb):9

If you’ve seen the WAT video then you know what’s coming next:

    def method_missing(sym, *args, &block)
      lookup(sym.to_sym)
    end

...

    protected

    def lookup(name)
      if @hash.has_key?(name)
        @hash[name]
      elsif ENV.has_key?(name.to_s)
        ENV[name.to_s]
      elsif ENV.has_key?(name.to_s.upcase)
        ENV[name.to_s.upcase]
      elsif @parent
        @parent.lookup(name)
      end
    end


To be fair, this is actually kinda cool. Not only can you do something like args.glob you can also do args[:pwd] or args.term or args.USERNAME.

Unfortunately it lets you do completely unexpected things as in the above example, which is handily translated into the symbol :[]= (which I like to call the Cookie Monster symbol), which doesn’t exist, returns nothing and throws away the value you attempted to assign to it. Because it is handled by method_missing, the additional value we supplied was accepted but not used, unlike any typical situation where it will cause a compile error.

Tags: , ,

Tuesday, May 8th, 2012 Tech No Comments

Puppetmaster performance tuning with mod_mem_cache

by Oliver on Saturday, April 7th, 2012.

It has been quite a while since I’ve had anything sufficiently detailed or interesting to post on a technical theme so it was quite a welcome divergence from the usual sprint work recently when we started having some Puppet performance issues. First up, like most people working at big corporations I can’t give you any figures that might give away numbers of servers or anything else sensitive but obviously we run enough machines to cause performance issues on a reasonably set up Puppet environment. I’ll share some statistics with you as well which give context and show the kind of before and after effects of the work.

If you are used to working with Puppet you might be accustomed to runs taking perhaps a minute, or a few minutes for complicated catalogs. When your run time starts to exceed ten minutes it starts getting uncomfortable, and when it gets much more than that people start complaining. We had noticed some of our core systems were taking an extremely long time to complete runs (which were meant to be occurring every 15 minutes) and had also had some reports of extremely long runs for some of our customers.

First off, we ruled out this bug although we suffered from it briefly. If you are still running 2.6 and use run stages, please upgrade to 2.7 or you will soon (or already) be suffering from fairly bad performance during client graph processing. We were seeing several orders of magnitude worse performance as you can see from the bug report.

How can you start to logically proceed towards a performance fix?

  • Puppet reports from the client do contain timestamps for the time each resource was processed, but no duration of client or server processing time (actually, wall time of an individual resource is what we are interested in).
  • Without digging deeper into the exact processing time of each event it is not possible to see whether performance problems are solely client-related, server-related or both. Puppet is a fairly chatty client/server application so a run is tightly bound between the two (at least, in a standard configuration).
  • Generally clients will outnumber servers by at least an order of magnitude, and possibly up to four (five?) if your configuration is lightweight or runs don’t occur frequently. (This is obviously an approximation, and could differ significantly with aggressive load-balancing and performance tuning. You may even have only clients and no servers.)
  • We need to break down exactly which operations are costing us computing effort, and work out what would give us the maximum gain with minimum time spent.
  • Therefore, we need information to gain awareness of these operations.

Better Logging

We were already running our Puppetmasters through Apache with Passenger (a.k.a. mod_rails/mod_rack) so we had logs in combined log format already in existence. Unfortunately, while you can get an idea of a lot of statistics already from this format of logs, computational expense of each request and response is not clear. I decided to add the %D field to the log which gives the response time in microseconds for each request, which would allow me to get a better idea of what was actually taking a long time. I had my suspicions that we had recently hit a request ceiling beyond which performance would degrade rapidly (not linearly) but I needed actual evidence of this.

For the record, here’s the actual log format specification:

LogFormat "%h %l %u %t \"%r\" %>s %b %D \"%{Referer}i\" \"%{User-Agent}i\"" puppet
CustomLog /var/log/httpd/puppet.log puppet

And the log output lines start looking like this:

192.0.2.1 - - [18/Mar/2012:04:02:25 +0000] "GET /environment1/file_metadata/modules/logrotate/etc/logrotate.conf HTTP/1.1" 200 333 42741 "-" "-"


In this case, we have a response time of 42741 microseconds. Don’t freak out just yet.

Great, so now I have a few days worth of logs in the new format to work with and figure out what is going on. Initially I thought it would be simple but after a few minutes of scanning the logs I realised I’d want to do some analysis which would point me in the exact direction of the problem. There are some basic elements at play here which I wanted to do the analysis over:

  • Environments: We allow separation of distinct apps, and releases of those apps by using the Puppet environment. So by looking at the environment we can get an idea if one app is causing excessive load, or one specific release of that app.
  • Requestor: Is one host, or a group of hosts hitting the Puppetmaster more than it should?
  • Indirector: Are catalog compiles taking a long time and causing everything else to be slow, is report processing taking the time or are we serving excessive numbers of file resources? I erroneously named this the “resource type” in my analysis script because I was in a RESTful state of mind but technically these are Indirectors in Puppet Land.
  • Resources: Within a given Indirector, is a particular resource taking a long time to generate (e.g. a particularly large or complex catalog, or an extremely large file).

These were what I thought I’d start with. Clearly a generic http log statistics package like AWStats is not going to cater to most of my requirements, but at the other end of the scale something like Puppet Dashboard/Console is a little too specific to events around Puppet runs – I wanted something in between. I’d be surprised if no one else had written something but a few things made me decide to write my own; the challenge, the naive belief it would be quick and easy, fooling myself into believing it was too specific a problem to reuse someone else’s code, and for the sheer interest factor.

Log Analysis and Problem Identification

So I wrote puppet_log_analyse.py. I’m trying to get this open sourced (which is an overall goal for my team) but it has to go through the due process of review first. As soon as this is complete I’ll post back with a link to where it is externally available should it be useful for anyone.

It’s my first bit of Python in quite a while so it’s probably not brilliant code. It underwent several iterations as I wanted more information from the logs, and also as I experimented with different ways of storing and querying the data. I initially was using heapq as I was looking for the worst or best results from many, and heapq maintains relatively inexpensive insert cost while providing a data structure that will be sorted already when you query it. Unfortunately some of the more interesting statistics I wanted to generate proved harder to get out of heapq than I expected so despite taking a performance hit (and probably costing a lot more memory at runtime) I switched to SQLite where I could express myself in familiar SQL, while still storing all the data in memory. Hey, I’ve got 8GB in my laptop so it’s no problem!

I ended up generating a reasonable set of different metrics but the ones that I ended up using the most for the decision on what to optimise came from the following:

Top resource types:
  7129326 GET file_metadata
  285765 GET file_metadatas
  85287 GET catalog
  84108 PUT report
  1770 GET file_content
  15 GET certificate
  5 GET certificate_request
  5 PUT certificate_request
  5 GET certificate_revocation_list

Statistics by resource type:
                                   AVG      STD.DEV    MIN   MAX
 GET catalog                       6951.7ms 5430.7ms 8.175ms 2.9996e+05ms
 GET certificate                   10.722ms 3.6854ms 5.128ms 17.609ms
 GET certificate_request           22.437ms 29.293ms 5.567ms 80.777ms
 GET certificate_revocation_list   19.492ms 29.293ms 3.877ms 78.062ms
 GET file_content                  10.106ms 56.091ms 2.917ms 1457ms
 GET file_metadata                 13.58ms 639.82ms 2.243ms 2.9982e+05ms
 GET file_metadatas                123.61ms 2817.5ms 3.587ms 2.9197e+05ms
 PUT certificate_request           55.047ms 3.0518ms 50.834ms 58.675ms
 PUT report                        2005.5ms 2411.3ms 7.438ms 2.99e+05ms


Yikes, that’s one loaded server. You can see that response times are not great across the board but the maximums are woeful. A lot of time is being spent generating catalogs (which is no surprise) but I was not expecting the sheer number of file_metadata and file_metadatas requests which are a couple of orders of magnitude higher in count than even catalogs (which is exactly the number of runs that have taken place). Taking the average response time of file_metadata requests and their count over these logs we can see that the server spent about 26.88 hours out of what was about 60 hours of logs generating MD5 sums and returning them. Probably for the same files over and over again, which are not changing.

Turning on Caching in Apache

Catalogs can be precompiled and there are various other tweaks that can be made but the first idea that sprang to mind was simply caching the file_metadata responses. Even if Puppet were caching the responses (and it’s not clear to me that it ever did), we are still having our requests go through a whole lot of slow Ruby to get the same result back. Caching closer to the requestor makes sense from a performance point of view, and I was already aware of mod_mem_cache. Rather than spend more time adding a caching layer with a separate product like Varnish why not just add more configuration to Apache which was already in place?

I came up with a reasonable looking configuration and did some basic testing.

        <IfModule mod_mem_cache.c>
                CacheEnable mem /
                CacheDefaultExpire 300
                MCacheSize 1024000
                MCacheMaxObjectCount 10000
                MCacheMinObjectSize 1
                MCacheMaxObjectSize 2048000
                CacheIgnoreNoLastMod On
        </IfModule>


I’ll explain some of these settings later on. What kind of a performance gain can you get by caching a file metadata (which is just an MD5 sum right now, due to another bug)? Obviously this is dependent on file size and CPU hashing speed but let’s take a reasonably example of a 256KB file.

127.0.0.1 - - [01/Apr/2012:01:30:55 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 7853 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"


We can see that this request took 7853 microseconds, or just under 8ms to process and return on an otherwise idle machine. That seems pretty fast already, but what happens when we turn on caching?

127.0.0.1 - - [01/Apr/2012:01:34:26 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 6960 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"
127.0.0.1 - - [01/Apr/2012:01:34:26 -0700] "GET /production/file_metadata/modules/testmodule/testfile HTTP/1.1" 200 342 226 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"


The first request after the restart is predictably about the same speed as before, but the subsequent response is about 30 times faster. That’s a big improvement for our most frequent query.

So we’re done now, right? Well, you could just go ahead and implement this as is given what we’ve seen already but I wanted to be a bit more certain I’d actually see a reasonable improvement. There are also some caveats to the configuration which I’ll list at the end of the post. To gain some more concrete figures of performance improvement I decided to run some performance tests before and after the change using Apache JMeter.

Performance Testing the Solution

I won’t go into depth with JMeter as there is plenty of information about it out there and honestly, it’s a snap to set up and use even if you’ve never used it before. The main problems I hit immediately while testing were as follows:

  • Generating load and serving the application under test on the same machine is a recipe for disaster (which should be obvious). So you will want to have a dedicated load generator and a separate machine for the application (the Puppetmaster in this case).
  • If you are attempting to do distributed load testing (i.e. with dedicated JMeter servers and a separate frontend) you will hit the usual RMI connection problems if there are firewalls/NAT in between your servers and the frontend.
  • Testing the Puppetmaster necessitates using SSL and a client certificate for authentication/authorisation. Currently it seems as though the JMeter SSL Manager will not send client certificate details through to the JMeter servers when doing distributed load testing. In order to work around this I used stunnel directly between the JMeter server and the Puppetmaster but I soon hit performance bottlenecks due to stunnel – it appears to be quite inefficient.
  • In order to use your existing Puppet-generated certificates and private keys you’ll need to package them in a PKCS#12-format file. For the sake of readers the command to do so is openssl pkcs12 -export -out jmeter.p12 -in cert.pem -inkey priv.pem. You’ll need to put a passphrase on the exported file (OpenSSL asks you for it) but the SSL Manager in JMeter handles this when the file is loaded.

So, I installed a barebones X11 environment on my remote load generator machine and ran it all from there. The test scenario was fairly simple:

  • HTTP Request was being made to https://puppetmaster:8140/production/file_metadata/modules/testmodule/testfile with GET.
  • The HTTP Header Manager was used to set 'Accept: yaml'.
  • The Graph Results listener was used to generate a simple graph of the output.
  • I used 100 concurrent threads each making 1000 requests.

Admittedly by only testing a single file in a single module I was not making this an extremely good test but I wanted to verify that caching was going to generate some reasonable savings across many clients, so I assumed responses in the production system would be cached quite quickly and reused for enough clients to make a difference. In this case, the file enters the filesystem cache after the first request so it is being served from memory with or without response caching. As I previously mentioned, the Puppetmaster doesn’t seem to be doing any other form of caching for the file so the diffence in performance should only be between running through the Puppetmaster code and generating an MD5 sum versus returning the cached response.

Here’s what the output from JMeter tended to look like:

JMeter graph results

JMeter graph results

After speaking with some long-time JMeter users, it appears I’m not the only one who thinks that this double-rainbow isn’t all that useful. It gives you a reasonable overview of what is going on early in the run but beyond that if you want some usable graphs it is better to export the results and make your own. In any case, the run summary was my best indicator of results:

Before caching:

  • Throughput: ~20859 requests/minute
  • Average: 276ms
  • Median: 100ms
  • Deviation: 712ms
  • Client load average: ~3.5
  • Server load average: ~55

After enabling caching:

  • Throughput: ~50000 requests/minute
  • Average: 84ms
  • Median: 41ms
  • Deviation: 735ms
  • Client load average: ~11
  • Server load average: ~1.2

It’s not the 30 times faster I’d hoped for but still, double the throughput is a significant win. The vastly decreased response time and obvious improvements in server load also seem to point to a potential big win when this is actually deployed.

Now armed with results I decided to work the configuration changes into our Puppetmaster build (which itself is automated) and do some final smoke checks as QA before rolling out.

Making It Live… Almost

The general plan was to make a new release of our Puppetmaster, do some similar tests and then release to production. I spun up a new Puppetmaster using our automation recipe for it and attempted the same basic requests using curl, but saw fairly inconsistent results. I could often make ten or more requests before seeing a response time in the logs corresponding to a cached result being returned. Even more surprising, with Apache’s debugging turned on, I could verify that it was caching the result but not returning it all the time.

Unfortunately I’d missed this note in the mod_mem_cache documentation:

... In particular, this module's cache is per-process, ...

In my case, the handling of requests had gone to somewhat random child processes and I’d only received a cached response after hitting the same child process again (assuming the cached response is still fresh). Easy to see when testing by hand, not so easy when sending 1000 requests from 100 test clients in JMeter. This does have some ramifications though:

  • Aside from memory requirements of the Puppetmaster app running through Passenger, you effectively have to account for a completely separate cache running in each child process of Apache. If you want to ensure a good hit rate, you may need a reasonable amount of RAM.
  • If you have a very small amount of clients and requests, a lot of Apache child processes and a lot of files, it is probable that you won’t hit the cache much of the time if the cached responses are evenly distributed. There’s not a big downside to this in general as you’ll get a performance increase even with a few hits but you do stand a greater chance of suffering from stale responses which may not make the change worth it.

Making It Live For Real

Here are some statistics from the logs for the same Puppetmaster after the change:

Statistics by resource type:
                                   AVG      STD.DEV    MIN   MAX
 GET catalog                       7053.9ms 3062.9ms 4.559ms 55651ms
 GET certificate                   15.386ms 36.771ms 4.381ms 263.69ms
 GET certificate_request           7.351ms 2.5096ms 5.001ms 15.452ms
 GET certificate_revocation_list   4.4384ms 1.0202ms 3.252ms 7.733ms
 GET file_content                  15.078ms 150.57ms 2.466ms 5207.8ms
 GET file_metadata                 8.4615ms 109.2ms 0.049ms 20517ms
 GET file_metadatas                65.971ms 345.93ms 3.546ms 21223ms
 PUT certificate_request           53.68ms 2.56ms 49.41ms 59.803ms
 PUT report                        434.32ms 1281.1ms 4.569ms 2.3324e+05ms


Across the board, operations haven’t changed a great deal but they are generally more consistent. File_metadata operations have improved by about 60% which is a pretty big change considering the number of these operations the Puppetmaster is processing. In terms of what actual effect this had on load, we went from seeing about 1 runnable process per core constantly to about 0.5.

And finally…

This was an interesting exercise, especially since I’d had caching and Puppet on the radar for quite some time now (just look at this page by Masterzen) but not actually had a chance to need it or investigate it. The work paid off and will let our system scale up a whole lot more. I haven’t been doing a great deal of Puppet work recently (hey, it’s nice to branch out and do other things) but I like digging into the internals when I can to find a good solution to an irritating problem.

Fortunately or unfortunately there’s always a plethora of work to be done on infrastructure issues and improvements so I’ll put in a small plug for my company. If this post or anything I’ve written in the past piques your interest, I need more people on my team! http://devops.nokia.com/ is the place to go to get on board :)

Notes

  • I initially started testing in our internal, private cloud but when I hit performance issues with JMeter testing I moved my testing to a couple of instances in HP’s cloud which I have beta access to. Performance was much better there but I realised the critical bottleneck was stunnel as I’ve already mentioned above. Once I had confirmed numbers, I moved the testing back to our internal cloud and saw equivalent numbers except for OpenVZ containers where it was almost double the throughput in some cases. This corresponds to previous examples we’ve seen compared with KVM (which we also support internally, and is what HP’s cloud runs on). If you run your Puppetmasters virtualised, consider OpenVZ or Virtuozzo.
  • In our mod_mem_cache configuration I set a fairly low timeout (300 seconds) and I’ve chosen to tell Apache to ignore the lack of ETags or last-modified headers (CacheIgnoreNoLastMod On) in responses that would usually assist it in stale response expiry. Unfortunately Puppet doesn’t yet send back any kind of metadata like this for responses so we can’t nicely check the freshness of responses, however for file_metadata responses like this arguably it doesn’t make a lot of sense anyway.
  • Unlike some other configuration management systems that rely on proprietary protocols between client and server (CFengine comes to mind, although I haven’t looked at it in a while), Puppet relies on REST between client and server which means a lot of the standard methodologies for improving web performance are applicable. This is really nice to fall back on, as you can see from what I’ve written.

Tags: , , , ,

Saturday, April 7th, 2012 Tech 3 Comments

On small revelations

by Oliver on Sunday, March 25th, 2012.

I had a very brief moment of inspiration this weekend when a realisation occurred to me. It’s nothing amazing, and it has probably already been mentioned many (dozens of, thousands of?) times before but I’ll say it anyway. My fundamental approach to work has changed due to the “cloud”, or rather, easily provisionable and destroyable virtual machines. This may be pretty obvious but it’s worth stating anyway.

A bunch of years ago now I had to bring up some new systems for doing border routing at my previous employer. I had had no experience with Quagga at the time and I spent a decent week setting up a small test environment with something like five or six cheap (but physical) machines to represent our coming redundant border routers, a couple of direct upstreams and whatever remained as route injectors representing the Internet. Once it was running I was extremely careful not to pollute the environment so as to ensure that the deployment was correct. I’m not even sure I wiped everything and started from scratch so as to QA the configurations.

This week I’ve been working on some Puppet tuning work involving mod_mem_cache in Apache. It has been all too easy to bring up a couple of reasonably representative CentOS 5.6 machines in either our own private cloud at work, or in one of a couple of public clouds I have access to. If things go pear-shaped I can just blow it away and start again, which is definitely not the kind of behaviour I have when making modifications on my home router/fileserver/webserver or even my US VPS system. One tends to be far more careful and perhaps even willingly create a lot more technical debt working with those machines.

In my current workflow, I’m all too happy to settle on a configuration that looks somewhat reasonable, record it in my configuration management system, reprovision everything again and test it. I actually discovered some fairly serious caveats in what I was working on and even now may have stumbled on an Apache bug (or undocumented feature) due to willingly blowing it all away and testing again. I could have done the same locally using Vagrant but I needed the performance of two larger instances in this case.

The behaviour is valuable. You know that what you are doing is of limited lifetime, so you work to make sure you can reliably arrive at the same configuration as many times as you want. This also has a side-effect of ensuring you only create a minimal configuration so as to save time and effort. The most powerful thing is that I have a level of confidence in the results that I can’t say I’ve ever had in the past.

Tags: ,

Sunday, March 25th, 2012 Tech No Comments

Separation and reunification

by Oliver on Wednesday, March 14th, 2012.

If you know me from my professional life you may think of me as a techy, obsessive-compulsive, slightly (or aggressively) arrogant person driven to making some cool large systems work. But I’m also a family man, and for the last two months I’ve been separated from two-thirds of that family. During the last vacation back to Australia, I came back at the start of January while my wife and son stayed on for a duration we didn’t have concrete plans on at the time.

It ended up being two months almost to the day, and while I kept myself busy with work in general as well as trips to the US for work and Brussels for FOSDEM, I was completely unprepared for the emotional upheaval that hit me when they returned. Needless to say, biological mothers are hit with a barrage of various chemical rebalancing acts which, combined with the addition of a newborn, add up to be a formidable force in the first few weeks and months of the baby’s life. As a man, I don’t get the same effects (although sympathetic effects can’t be ruled out) but despite society’s expectations on males and my own generally robotic appearance day to day, I found I was hit with a wall of emotion in those first few days as well.

It was an undefinable experience. I couldn’t analyse it or understand it, but I knew it had me in its grasp. The responsibility of taking care of a fragile life form, being the only force in the universe that can keep this thing alive; but that again puts it into too tangible forms. As much as you can have these things going through your head and know that you are completely capable of it, the emotion still hits you.

And now that they are back again I found myself in the same place. Not sure I could do it, especially with a child that had aged and matured in just the two months he’d been away. New behaviours, slightly more grown-up features. Just back from the airport, I was going to duck out briefly to grab some take-away food since we had almost nothing in the house and the little guy burst into tears just as I was walking out the door. That’s putting it mildly – he just lost it completely. I’d be lying if I said it didn’t break my heart. I think both of us missed each other more than we realised.

Now he has some kind of flu which has spread to the rest of us. I’ve been off work since last week and taking care of a snot-dribbling, but still amazingly energetic two-year-old is draining to say the least. Me and the little guy have been sleeping on the sofa-bed to give Mama a chance to rest after the draining flights from Australia, which means not much sleep for either of the boys unfortunately. It’s not a great start to the year (now that we’re all together, this is our year beginning effectively) but we still have hopes it will be better than the last one. Kai will be going to Kita, Ange will have some freedom and do whatever she wants after two years of full-time caregiving and I’ll be my usual effervescent self with any luck.

Tags: , ,

Wednesday, March 14th, 2012 Thoughts No Comments

WAT

by Oliver on Thursday, February 2nd, 2012.

I’ve been spreading this excellent talk by Gary Bernhardt around my co-workers and friends who universally love it. I’m also proud to say the “WAT” meme has (hopefully permanently) entered my team’s culture as we find it adequately sums up our feelings towards various bits of software that we either have to work with or maintain.

Half my team is away on business trips or sick leave at the moment so it was a relatively quiet day, which the rest of us spent squashing bugs. Unfortunately for us, most of our codebase is in Ruby and we have started cultivating a reasonable collection of Javascript in some of our web-oriented interfaces so you can imagine that Gary’s talk was particularly poignant.

A co-worker had submitted a review request this afternoon (yes, code reviews are awesome) and I was talking a brief look through it as I was not that familiar with this piece of code. He’d admirably refactored some quite nasty stuff but I was a bit perplexed as to some of the logic. An example might be similar to this:

def check_vm_state(vm, vmstate)
  if vmstate[vm][:host]
    return vmstate[vm][:host]
  end
  raise InvalidHypervisorError
end

OK, it wasn’t nearly as bad as this but you get the point. We’re meant to pass in some kind of hash which contains status information, pull out something relevant to the VM in question and return it; if not, throw an error. What threw me was the test for this which was clearly just some fudged parameters but I couldn’t figure out what was going on:

assert_equals some_valid_value check_vm_state(1,2)

Again, for your sanity’s sake it was a little more than this. I wondered how this could possibly work, and my co-worker did the same. “But clearly the tests are passing!” someone exclaimed. Let’s take a look:

1.8.7 :001 > 2[1]
 => 1
1.8.7 :002 > 2[1][:host]
 => 0

WAT.

Apparently the [] operator on a Fixnum will retrieve the binary digit value at that index. OK, not too unreasonable. But what about the index off the :host symbol?!? Well, easy – it is cast to its unique identifier which is an integer value – in this case 16473 but this changes every time you run Ruby. This bit index clearly can’t exist for the number 2 so the return value is 0. WAT.

Here’s another cool one. We had some awful code that was a bunch of if statements testing for equality to various things – which in most instances you would just replace with a case/switch statement and be done with it. So we tried, and failed (initially):

> foo = 'hello'
 => "hello"
> case foo
>   when 'hello'
>   puts 'hello'
>   when 'goodbye'
>   puts 'goodbye'
>   else
>     puts 'something else'
>   end
hello
 => nil

Absolutely no surprises there. Except in this case we’re more interested in the type of the object we are dealing with since our particular piece of code is making use of Ruby’s duck-typing to do some smart manipulation of various objects in similar ways. So now:

> foo.class
 => String
> case foo.class
>   when String
>   puts 'string'
>   when Fixnum
>   puts 'fixnum'
>   else
>     puts 'something else'
>   end
something else
 => nil

Er… WAT? Obviously we failed to take into account that the case statement calls the === method on the operand in each when statement and it behaves completely differently depending on whether it is used with an Object or a Module (which Class inherits from as do String, Fixnum etc). For Module it will only return true if the thing being compared to is an instance or descendent of the thing being operated on, whereas Objects just compare equality (and not identity).

I’m sure this behaviour is also overridden in other types to “make sense” under the circumstances, but unfortunately just serves to confuse by its inconsistency. I realise this represents something like a raised middle finger to dyed-in-the-wool Rubyists but it really isn’t following any principle of least surprise that I know about.

Anyway, after all of us said WAT about nine-thousand times I came to the conclusion that duck-typing is only cool if not every single basic type in the language responds to most method calls, usually in completely different and unexpected ways. If it swims like a duck, looks like a duck and quacks like a duck, it could be a duck. Or some kind of shapeshifting, organism impersonating cyborg warrior from the future intent on the destruction of our minds and all we hold dear to us.

On a slightly less inflammatory note, Ruby is actually still quite nice to use once you know all it’s little quirks. I went to my first Clojure meetup in Berlin tonight and was introduced to some pretty awesome concepts. I’m not sure I’m completely sold but it may actually be time to broaden my horizons somewhat but I guess I’m stuck with Ruby for a while yet ;)

Tags: ,

Thursday, February 2nd, 2012 Tech No Comments

More talks

by Oliver on Thursday, December 8th, 2011.

I gave yet another permutation of my talk about DevOps in Big Enterprise tonight at the Berlin DevOps meetup. I last gave it as a 45 minute presentation at the VelocityConf BoF and prior to that as a 5 minute Ignite talk at DevOpsDays Göteborg. This time I’m actually uploading the slides (although they are fairly minimalistic).

Enjoy!

Tags: ,

Thursday, December 8th, 2011 Tech No Comments

Exceptional circumstances

by Oliver on Thursday, November 24th, 2011.

I’m still building up to my article on how to properly mock calls to Rake’s sh to facilitate testing of your Rakefile tasks but I haven’t quite worked up the strength to trace the calls through the libraries and into Eigenclass wonderland. For the moment, I’ve got just enough fodder to write a bit about determinism in testing.

Since Friday, both a coworker and I have been hit by the same testing failure. Not the exact same one mind you as we are working on two different pieces of software but the same failure in that the order of our tests matter. Anyone experienced in TDD will tell you that you need to make sure your tests don’t depend on the order in which they run so that your units of testing don’t build upon previous assumptions and don’t fully test each unit in isolation. Easier said than done unfortunately.

It would be nice if we could test the cartesian product of all of our units in all possible orders to rule out any unplanned dependency but it is simply not feasible for anything but the smallest programs. In any case, our small programs managed to cause a small bit of hell just fine, thankyou! Enough babbling and onto the code:


desc 'Run tests'
RSpec::Core::RakeTask.new('test') do |t|
  t.pattern = 'spec/*_spec.rb'
end

This is a fairly standard task that sets up RSpec tests using RSpec’s Rake helpers. Not much to see here – except for the fact that my two test files run in opposite orders on different machines. I managed to rule out different Ruby versions (exactly the same, installed in both places by RVM), Gem versions (same of Rake, RSpec and all relevant dependencies), even the locale on both machines was identical to rule out sorting order differences in the filenames.

RSpec uses RSpec::Core::RakeTask#pattern to assemble a FileList with the pattern you have set. FileList (defined by Rake) basically uses Dir.glob to get its dirty work done:


# Add matching glob patterns.
    def add_matching(pattern)
      Dir[pattern].each do |fn|
        self << fn unless exclude?(fn)
      end
    end

Dir[] just aliases Dir.glob, and within the source of that you can find the following:


dp = readdir(dirp->dir);

readdir(3) just returns the directory entries in the order they are linked together, which is also not related to inode numbering but as best as I can tell is from outer leaf inwards (since the most recently created file is listed first).

Now I have some idea of why the testing order can be different, but I'm no closer to the cause of the problem - my tests succeed when run in one order and not in the opposite order. The errors look something like this:


  1) Nagios#new raises an exception when the command file is missing
     Failure/Error: expect { Nagios.new(@cmd_file, @status_file) }.to raise_error(NagiosFileError, /not found/)
       expected NagiosFileError with message matching /not found/, got #
     # ./spec/nagios_spec.rb:26

Hmm, a bare Exception object with no information about where it came from. I had a few suspicions:

  • I have no idea how to code Ruby
  • I changed the code in some subtle way and broke the tests legitimately
  • Some Mock object's lifetime is longer than expected and sticking around

This last idea seemed most plausible since I was able to put some debugging code into the constructor of the object and it was not being called at all. Mocha usefully has an unstub method which allows you to remove stubs on an object/class you had previously set up and return it to its previous state, but this seemed to be a no-go:


     Failure/Error: Nagios.unstub(:new)
       The method `new` was not stubbed or was already unstubbed

I installed the very useful ruby-debug and invoked that just before the failing tests started and did some poking around but wasn't able to find much (although that's probably more to do with lack of skill than the debugger lacking functionality). How to proceed? Unfortunately the easiest way forward seemed to be the brute force method - comment out tests in the preceding file until it starts working then narrow the field. Fortunately the problem uncovered itself immediately:


  # Change the existing mock to be something we can pick up in output
  before(:each) do
    Nagios.stubs(:new).returns('surprise !')
  end

Then in the test output:

 NoMethodError:
       undefined method `cmd_file' for "surprise !":String
     # ./spec/nagios_spec.rb:55

Hmm, that's no good. The stubbing is surviving between test files. TL;DR - It's good to read the documentation! RSpec supports several mocking frameworks - FlexMock, Mocha, RR, bring your own framework and of course, RSpec itself. I already knew of this before but since I've been working mostly with Test::Unit up until recently it didn't register in my brain whatsoever and I just reached for Mocha out of habit.

It turns out that if you use a different mocking framework but don't tell RSpec about it, weird stuff happens - that is to say, the kind of behaviour you see above. You add a small block to your test code something like this:


RSpec.configure do |config|
  config.mock_framework = :mocha
end

Once I had confirmed this was the actual problem and my tests were passing, I decided to rewrite all my mocking code to use RSpec mocking anyway - less gems, easier to maintain.

So the main takeaways from this experience are:

  • Read the documentation!
  • Ruby is powerful and frequently easy to use, but can still not be easy to interrogate when there are problems happening that don't seem to make sense.
  • Having source code on hand to see what your software is really doing is an awesome thing. Admittedly, the metaprogramming behind testing and mocking frameworks can be some mind-bending stuff, but the code is there when you want to dive in.
  • Testing can be frustrating, but awesome.
  • Ordering is not always the problem. Try to get your tests order-independent, but just like code coverage it can be a large time sink and the law of diminishing returns applies.

Tags: , , , , , , ,

Thursday, November 24th, 2011 Tech 1 Comment

Making a ‘complete’ service

by Oliver on Friday, November 18th, 2011.

One of the things that is pushed here a lot is the necessity to make all new applications really production-ready before being deployed as a service. There is of course the adage “If it’s not monitored, it doesn’t exist” but beyond that there are a lot of other little fiddly details just to make something run. I’m talking about things like init scripts, logging etc.

It’s very easy to get wrapped up writing code for some new idea you have or perhaps extending some existing codebase, and with most frameworks it is relatively easy to run your app from a terminal on your local workstation. What changes when you want to take it to a real server is generally (and I’m limiting this post to these topics):

  • packaging
  • service lifetime management
  • monitoring
  • logging

This is not an exhaustive list. When making these additions to my Sinagios app I felt more pain than expected, and suddenly it made sense why we tend to see a lot of applications that fall down so much in these areas – it’s really boring and stupid work. Seriously. It’s not fun, and so much of it could be automated away but sadly we don’t live in that kind of a world.

Packaging is largely a non-issue thanks to such tools as FPM (although it has some minor issues still), and it is all too simple to add some basic monitoring to a Sinatra app:


# Health check for monitoring systems
get '/v1/health/?' do
  # Just try to verify the command and status files look ok, rescue the
  # exception and allow the message to propagate to the output with a
  # reasonable error code.
  begin
    nagios = Nagios.new
  rescue Exception => detail
    body detail.message
    status 500
  else
    body 'OK'
    status 200
  end
end

The next pain point was generating the init script. I’d really like to see a somewhat more specific framework than the one currently in place – shell script is ubiquitous but sometimes it is just annoying. Sadly most apps still don’t fit into the generic functions available in /etc/init.d/functions and I found it took an hour or so to craft the file just right to get my app launching nicely with rackup. There should be some convention over placement and naming of entities like PID files, process names, what to do with standard in/out/error, run users, log/run/work directories etc etc. It could be so much easier, but it’s a problem for smarter people than I.

Finally I hit logging and this was a really irritating one, not least because the Rubyforge website was down all day and I couldn’t get to the documentation. In daemon mode, rackup will still output errors to STDERR so you have the option of either redirecting that at invocation time or in the config.ru file (and possibly later as well, I didn’t bother to try). Standard output (which in interactive mode will accept the access logging) ceases to offer anything useful as Rack wraps up logging in its CommonLogger class.

There are a multitude of different ways to skin this cat but the simplest seems to be telling Rack (through Sinatra) which logging “middleware” to use, in our case just a simple instance of Logger (effectively just appending to a file, but it can also add timestamps, priorities, and take care of rotation). Sadly the basic example fails still:


logger = Logger.new('/var/log/sinagios/access.log')
use Rack::CommonLogger, logger

This leads to:

!! Unexpected error while processing request: undefined method `write' for #<Logger:0x7f6f28d033d0>

But only if you are running non-daemonized!

That’s right – if you add this code to your app it will just break it and not leave a trace in any logs anywhere. You will have to run it interactively again to figure out the problem, although fortunately it is staring us right in the face. Rack::CommonLogger expects the logger interface to support the write method, which Logger does not at the moment. Let’s just alias it to the append operator:


Logger.class_eval { alias :write :'<<' }

This will have Logger just append the log entries without any additional formatting or timestamps as Rack::CommonLogger already adds these for us. I also added logrotate fragments to ensure the logs don’t consume the entire disk (although connecting to Scribe would have been better, but is a project for another day). This work has been somewhat enlightening but not exactly thrilling – I can see why it is such a chore for developers and frequently left out altogether. Now it is done though, I can at least reuse it but it would be nice to see some better system frameworks for these common and boring tasks for getting the real stuff running in production.

Tags: , , , , , ,

Friday, November 18th, 2011 Tech No Comments

Sinagios – a very simple RESTful API to Nagios downtime

by Oliver on Monday, November 14th, 2011.

Last week I spent a bit of time scratching an itch I’ve had with doing maintenance. We use Nagios for our monitoring (without any addons, plugins or frontends) and scheduling downtime is a pain. Not only that, but deleting downtime is impossible short of screenscraping or (shock) actually clicking in the user interface. When I’m done doing maintenance I want to delete all the downtime (regardless of if the window will naturally come to an end soon or not) because I want notifications to start sending immediately if there is a real problem.

Naturally I did some diligence beforehand as this is hardly a topic that hasn’t come up before, and sure enough some guy has implemented something similar but in Python. Nevertheless I wanted to do some coding so I present to you Sinagios.

It’s a pretty simple Sinatra app (and thus obviously Ruby), in fact it only does three things:

  • Lists currently scheduled downtime
  • Schedules new downtime
  • Deletes downtime

I also got to dive into a bit of RSpec and Rack-test since I’ve mostly gotten away with Test::Unit before and non-web-facing systems so it was an interesting change. Fairly soon I’ll be uploading a spec file for making RPMs and adding some Puppet recipes to deploy it so that it’s a more complete set of stuff.

Hopefully someone will find it useful!

Tags: , , ,

Monday, November 14th, 2011 Tech 1 Comment