Archive for January, 2015

Golang, testing and HTTP router package internals

by Oliver on Sunday, January 18th, 2015.

We have an internal service that takes requests of the form something like /foo/{ID}/bar which basically is expected to generate data of the form “bar” about the ID entity within the collection “foo”. Real clear! Because this service has a bunch of similar routes that have a resource identifier as part of the request path, we use the Gorilla Mux package. There are a lot of different approaches to HTTP request routing, many different packages (and some approaches recommending you use no external packages) – this is out of scope of this article!

This week we realised that for a given endpoint, one other service that calls ours was URL-encoding the resource ID. There are two parts to this that need to be stated:

  • IDs are meant to be opaque identifiers like id:12345 – so they have at least one colon in them.
  • A colleague later pointed out that encoding of path parts is actually up to the application to define (although I haven’t read the part of the spec that says this, recently). URL-encoding thus is not strictly necessary except when it comes to query parameters, but in this case we have something being encoded and need to deal with it.

The fix is relatively simple. We already had some code pulling out the ID:

id := mux.Vars(request)["id"]


However, this was just passing the ID to another part of the code that validated that the ID is properly formed, with a regular expression. It was expecting, again, something like id:12345 and not id%3A12345. So we introduce a very simple change:

    encodedID := mux.Vars(request)["id"]
    id, err := url.QueryUnescape(encodedID)
    if err != nil {
        return err
    }


OK, this will work, but we should test this. We’ve introduced two new “happy paths” (receive a request with the ID not encoded and the decoded version is exactly the same, receive a request with the ID encoded, and the decoded version is what we expect for later validation) and one new “unhappy path” where the ID is malformed and doesn’t decode properly. The problem here is that we need to test this function and pass in a request that has an appropriate ID for the path we are trying to test.

Already, this sets us down the road to the yak shaving shop. We can assemble a request ourselves, and attempt to set up all of the right internal variables necessary to be able to pull them out again inside our own function in the way it expects. A slightly harder way would be to set up a test server with httptest, set up the routing with mux and make a real request through them, which will ensure the request that makes it to our function under test is as real as can be. However we are now also effectively testing the request handling and routing as well – not the minimum code surface area.

As it turns out, neither of these options are particularly good. I’ll start with the latter. You start up a test server and assemble a request like so (some parts have been deliberately simplified):

testID := "id%XX12345" // deliberately malform the encoded part
ts, _ := httptest.NewServer(...)
url := fmt.Sprintf("%s/foo/%s/bar", ts.URL, testID)
resp, err := http.Get(url)


Uh-oh, the malformed encoded characters will be caught by http.Get() before the request is even sent by the client. The same goes for http.NewRequest() We can’t test like this, but what if we assemble the request ourselves?

serverAddr := strings.Split(ts.URL, "http://")[1]
req := &http.Request{
	Method: "GET",
	URL: &url.URL{
		Host:   serverAddr,
		Scheme: "http",
		Opaque: "/foo/id%XX12345/bar",
	},
}
resp, err := http.DefaultClient.Do(req)


We can send this request and it will make it to the server, but now the request handling on the server side will parse the path and catch the error there – it still won’t make it to the function under test. We could write our own test server that has its own request handling wrapped around a TCP connection, but that’s far more work. We also have to determine whether the request has succeeded or failed via the test server’s response codes (and possibly response body text) which is really not ideal.

So, onto testing with a “fake” request. Looking back at our code, we notice that we are pulling the ID out of the request via mux.Vars(request)["id"]. When you don’t use any request routing package like this, basically all request path and query parameter variables are accessible directly on the request object, but my suspicion was that mux.Vars didn’t just simply wrap around data already in the request object but that it stored it elsewhere in a different way. Looking at the mux code, it actually uses a data structure defined in the context package, a very gnarly nesting of maps. The outer level keys off the request pointer, and each unique request pointer will have a map containing different “context keys” – either varsKey or routeKey depending on where the parameters are coming from (but I’ll not dive into that in this article).

The part of the request we are interested in is grouped under varsKey, which is the first iota constant, so it is 0. We can use the context.Set() to set the appropriate data we want to fake out in our request, with a relatively bizarre invocation:

type contextKey int
const (
  varsKey contextKey iota
  routeKey
)
context.Set(request, varsKey, map[string]string{"id": "id%XX12345"})


This appeared to be enough to work, but inevitably the test would fail due to the result of mux.Vars(request)["id"] being an empty string. I added some debugging Printf’s to the mux and context packages to print out what was being set and what was being accessed, and universally it looked like what was created should have been correct:

map[0x10428000:map[0:map[id:id%XX12345]]]


The request pointer keying into the top-level map was the same in both cases, but the map of parameter names to values was only there when setting it in the test – what mux.Vars() was accessing simply didn’t have them.

The problem is of course simple. The mux package is keying into the second-level map with a variable of value 0 but type mux.contextKey. I was attempting to fool it by keying into the map with a main.contextKey of the same value. The only reason this worked at all was due to the inner map of data being map[interface{}]interface{} – effectively untyped – the two zero-valued variables of different types (and even then, only different by virtue of being in different packages) did not collide and hence there was no way to get out the value I had previously set.

Since mux.contextKey is not exported, there is actually no way to fake out that request data (well, I’m sure it can be done with some reflect package magic, but that’s definitely indicative of code smell). The end result was that this small code change is untestable in the unhappy path. I’m still relatively sure nothing unexpected will happen at runtime since the request handling above the function will catch malformed encodings, and some alternatives do exist, such as doing this kind of decoding in its own handler wrapping around what we already have set up, or not using the mux package in the first place and simplifying our request routes.

It is yet again, a great example of why sometimes the simplest changes can take the most time, discussion, agonising over testing methodologies and greatest personal sacrifice of software development values. The only reason I didn’t spend any longer on it (and I definitely could have) was because it was blocking other teams from progressing on their own work (outside of the obvious wastage in my own productive time).

Tags: , , , ,

Sunday, January 18th, 2015 Tech 2 Comments

AWS AutoScaling group size metrics (or lack thereof)

by Oliver on Saturday, January 17th, 2015.

One of the notably lacking metrics from CloudWatch has been the current and previous AutoScaling group sizes – in other words, how many nodes are in the cluster. I’ve worked around this by using the regular EC2 APIs, querying the current cluster size and the desired size and logging this to Graphite. However, it only gives you the current values – not anything in the past, which regular CloudWatch metrics do (up to 2 weeks in the past).

My colleague Sean came up with a nice workaround – using the SampleCount statistic of the CPUUtilization metric within a given AutoScaler group namespace. Here’s an example, using the AWS Python CLI:


$ aws cloudwatch get-metric-statistics --dimensions Name=AutoScalingGroupName,Value=XXXXXXXXProdCluster1-XXXXXXXX --metric CPUUtilization --namespace AWS/EC2 --period 60 --statistics SampleCount --start-time 2015-01-17T00:00:00 --end-time 2015-01-17T00:05:00
{
    "Datapoints": [
        {
            "SampleCount": 69.0,
            "Timestamp": "2015-01-17T00:00:00Z",
            "Unit": "Percent"
        },
        {
            "SampleCount": 69.0,
            "Timestamp": "2015-01-17T00:01:00Z",
            "Unit": "Percent"
        },
        {
            "SampleCount": 69.0,
            "Timestamp": "2015-01-17T00:03:00Z",
            "Unit": "Percent"
        },
        {
            "SampleCount": 69.0,
            "Timestamp": "2015-01-17T00:02:00Z",
            "Unit": "Percent"
        },
        {
            "SampleCount": 67.0,
            "Timestamp": "2015-01-17T00:04:00Z",
            "Unit": "Percent"
        }
    ],
    "Label": "CPUUtilization"
}

Some things to note:

  • Ignore the units – it’s not a percentage!
  • You will need to adjust your –period parameter to match that of your metric sampling period on the EC2 instances in the AutoScale group – if you have regular monitoring enabled this will be one sample per 5 minutes (300 seconds), if you have detailed monitoring enabled it will be one sample per 1 minute (60 seconds).
  • The last point also means that if you want to gather less frequent data points for historical data, you’ll need to do some division – e.g. using –period 3600 will require you to divide the resulting sample count by 12 (regular monitoring) or 60 (detailed monitoring) before you store it.
  • Going via CloudWatch in this way means you can see your cluster size history for the last two weeks, just like any other CloudWatch metric!
  • Unfortunately you will lose your desired cluster size metric, which is not captured. In practice I haven’t really required both desired and actual cluster size metrics.

We’ll start using this almost immediately, as we can remove one crufty metric collection script in the process. Hope it also helps some of you out there in AWS land!

Tags: , , ,

Saturday, January 17th, 2015 Tech No Comments