Pre-warming Memcache for fun and profit

by Oliver on Wednesday, August 12th, 2015.

One of the services my team runs in AWS makes good use of Memcached (via the ElastiCache product). I say “good” use as we manage to achieve a hit rate of something like 98% most of the time, although now I realise that it comes at a significant cost – when this cache is removed, it takes a significant toll on the application. Unlike other applications that traditionally cache the results of MySQL queries, this particular application stores GOB-encoded binary metadata, but what the application does is outside the scope of this post. When the cached entries aren’t there, the application has to do a reasonable amount of work to regenerate it and store it back.

Recently I observed that when one of our ElastiCache nodes are restarted (which can happen for maintenance, or due to system failure) we already saw a less desirable hit to the application. We could minimise this impact by having more instances in the cluster with less capacity each – for the same overall cluster capacity. Thus, going from say 3 nodes where we lose 33% of our cache capacity to 8 nodes where we would lose 12.5% of our cache capacity is a far better situation. I also realised we could upgrade to the latest generation of cache nodes, which sweetens the deal.

The problem that arises is: how can I cycle out the ElastiCache cluster with minimal impact to the application and user experience? To save a long story here, I’ll tell you that there’s no way to change individual nodes in a cluster to a different type, and if you maintain your configuration in CloudFormation and change the instance type there, you’ll destroy the entire cluster and recreate it again – losing your cache in the process (in fact you’ll be without any cache for a short period of time). I decided to create a new CloudFormation stack altogether, pre-warm the cache and bring it into operation gently.

How can you pre-warm the cache? Ideally, you could dump the entire contents and simply insert it into the new cluster (much like MySQL dumps or backups), but with Memcached this is impossible. There is the stats cachedump command to Memcached, which is capable of dumping out the first 2MB of keys of a given slab. If you’re not aware of how Memcached stores its data, it breaks the memory allocation into various “slabs” of increasing sizes and stores values in the closest-sized slab that will fit it (although always rounding up). Thus, internally the data is segmented. You can list stats for all of the current slabs with stats slabs, then perform a dump of the keys with stats cachedump {slab} {limit}.

There are a couple of problems with this. One is the aforementioned 2MB limit on the returned data, which in my case did in fact limit how useful this approach was. Some slabs had several hundred thousand objects and I was not able to retrieve nearly the whole keyspace. Secondly, the developer community around Memcached is opposed to the continued lifetime of this command, and it may be removed in future (perhaps it already is, I’m not sure, but at least it still exists in 1.4.14 which I’m using) – I’m sure they have good reasons for it. I was also concerned that using the command would lock internal data structures and cause operational issues for the application accessing the server.

You can see the not-so-reassuring function comment here describing the locking characteristics of this operation. Sure enough, the critical section is properly locked with pthread_mutex_lock on the LRU lock for the slab, which I assumed meant that only cache evictions would be affected by taking this lock. Based on some tests (and common sense) I suspect that it is an LRU lock in name only, and more generally locks the data structure in the case of writes (although it does record cache access stats somewhere as well, perhaps in another structure). In any case as mentioned before, I was able to retrieve only a small amount of the total keyspace from my cluster, so as well as being a dangerous exercise, using the stats cachedump command was not useful for my original purpose.

Later in the day I decided to instead retrieve the Elastic LoadBalancer logs from the last few days, run awk over them to extract the request path (for some requests that would trigger a cache fill) and simply make the same requests to the new cluster. This is more effort up-front since the ELB logs can be quite large, and unfortunately are not compressed, but fortunately awk is very fast. The second part to this approach (or any for that matter) is using Vegeta to “attack” your new cluster of machines, replaying the previous requests that you’ve pulled from the ELB logs.

A more adventurous approach might be to use Elastic MapReduce to parse the logs, pull out the request paths and using the streaming API to call an external script that will make the HTTP request to the ELB. That way you could quite nicely farm out work of making a large number of parallel requests from a much larger time period in order to more thoroughly pre-warm that cache with historical requests. Or poll your log store frequently and replay ELB requests to the new cluster with just a short delay after they happen on your primary cluster. If you attempt either of these and enjoy some success, let me know!

Tags: , , , ,

Wednesday, August 12th, 2015 Tech No Comments

AWS AutoScaling group size metrics (or lack thereof)

by Oliver on Saturday, January 17th, 2015.

One of the notably lacking metrics from CloudWatch has been the current and previous AutoScaling group sizes – in other words, how many nodes are in the cluster. I’ve worked around this by using the regular EC2 APIs, querying the current cluster size and the desired size and logging this to Graphite. However, it only gives you the current values – not anything in the past, which regular CloudWatch metrics do (up to 2 weeks in the past).

My colleague Sean came up with a nice workaround – using the SampleCount statistic of the CPUUtilization metric within a given AutoScaler group namespace. Here’s an example, using the AWS Python CLI:

$ aws cloudwatch get-metric-statistics --dimensions Name=AutoScalingGroupName,Value=XXXXXXXXProdCluster1-XXXXXXXX --metric CPUUtilization --namespace AWS/EC2 --period 60 --statistics SampleCount --start-time 2015-01-17T00:00:00 --end-time 2015-01-17T00:05:00
    "Datapoints": [
            "SampleCount": 69.0,
            "Timestamp": "2015-01-17T00:00:00Z",
            "Unit": "Percent"
            "SampleCount": 69.0,
            "Timestamp": "2015-01-17T00:01:00Z",
            "Unit": "Percent"
            "SampleCount": 69.0,
            "Timestamp": "2015-01-17T00:03:00Z",
            "Unit": "Percent"
            "SampleCount": 69.0,
            "Timestamp": "2015-01-17T00:02:00Z",
            "Unit": "Percent"
            "SampleCount": 67.0,
            "Timestamp": "2015-01-17T00:04:00Z",
            "Unit": "Percent"
    "Label": "CPUUtilization"

Some things to note:

  • Ignore the units – it’s not a percentage!
  • You will need to adjust your –period parameter to match that of your metric sampling period on the EC2 instances in the AutoScale group – if you have regular monitoring enabled this will be one sample per 5 minutes (300 seconds), if you have detailed monitoring enabled it will be one sample per 1 minute (60 seconds).
  • The last point also means that if you want to gather less frequent data points for historical data, you’ll need to do some division – e.g. using –period 3600 will require you to divide the resulting sample count by 12 (regular monitoring) or 60 (detailed monitoring) before you store it.
  • Going via CloudWatch in this way means you can see your cluster size history for the last two weeks, just like any other CloudWatch metric!
  • Unfortunately you will lose your desired cluster size metric, which is not captured. In practice I haven’t really required both desired and actual cluster size metrics.

We’ll start using this almost immediately, as we can remove one crufty metric collection script in the process. Hope it also helps some of you out there in AWS land!

Tags: , , ,

Saturday, January 17th, 2015 Tech No Comments

Elastic MapReduce and data in S3

by Oliver on Friday, February 28th, 2014.

I don’t have to do much data analysis fortunately, but when I do there are two options: either the data is local to our own datacenter and I can use our own Hadoop cluster, or it is external and I can use Elastic MapReduce. Generally you don’t run an Elastic MapReduce cluster all the time, so when you create your cluster you still need to get that data into the system somehow. Usually the easiest way is to use one of your existing running instances outside of the MapReduce system to transfer it from wherever it may be to S3. If you are lucky, the data is already in S3.

Even better, Elastic MapReduce has the ability to run jobs against datasets located in S3 (rather than on HDFS as is usually the case). I believe this used to be a customisation AWS has applied to Hadoop, but has been in mainline for some time now. It is really quite simple – instead of supplying an absolute or relative path to your hdfs datastore, you can provide an S3-style URI to the data such as: s3://my-bucket-name/mydata/

The “magic” here is not that it now runs the job against S3 directly, but it will create a job before your main workflow to copy the data over from S3 to HDFS. Unfortunately, it’s a bit slow. Previously it has also had showstopper bugs which prevented it working for me at all, but in a lot of cases I just didn’t care enough and used it anyway. Today’s job had significantly more data, and so I decided to copy the data over by hand. I knew it was faster, but not as much of a difference as this:

Screen Shot 2014-02-28 at 5.56.15 PM

The first part of the graph is the built-in copy operation as part of the job I had started, and where it steepens significantly is where I stopped the original job and started the S3DistCp command. Its usage is relatively simple:

hadoop fs -mkdir hdfs:///data/
hadoop jar lib/emr-s3distcp-1.0.jar --src s3n://my-bucket-name/path/to/logs/ --dest hdfs:///data/

The s3distcp jar file is already loaded on the master node when it is bootstrapped, so you can do this interactively or as part of a step on a cluster you have running automatically. I thoroughly recommend using it, as it will cut down the total time of your job significantly!

Tags: , , , ,

Friday, February 28th, 2014 Tech 2 Comments

Can’t create new network sockets? Maybe it isn’t user limits…

by Oliver on Thursday, February 28th, 2013.

I’ve been doing a lot more programming in Go recently, mostly because it has awesome concurrency primitives but also because it is generally a pretty amazing language. Unlike other languages which have threads, fibres or event-driven frameworks to achieve good concurrency, Go manages to avoid all of these but still remain readable. You can also reason about its behaviour very effectively due to how easily understandable and straightforward concepts like channels and goroutines are.

But enough about Go (for the moment). Recently I found the need to quickly duplicate the contents of one Amazon S3 bucket to another. This would not be a problem, were it not for the fact that the bucket contained several million objects in it. Fortunately, there are two factors which makes this not so daunting:

  1. S3 scales better than your application ever can, so you can throw as many requests at it as you like.
  2. You can copy objects between buckets very easily with a PUT request combined with a special header indicating the object you want copied (you don’t need to physically GET then PUT the data).

A perfect job for a Go program! The keys of the objects are in a consistent format, so we can split up the keyspace by prefixes and split the work-load amongst several goroutines. For example, if your objects are named 00000000 through to 99999999 using only numerical characters, you could quite easily split this into 10 segments of 10 million keys. Using the bucket GET method you can retrieve up to 1000 keys in a batch using prefixes. Even if you split into 10 million key segments and there aren’t that many actual objects, the only things that matter are that you start and finish in the right places (the beginning and end of the segment) and continue making batch requests until you have all of the keys in that part of the keyspace.

So now we have a mechanism for rapidly retrieving all of the keys. For millions of objects this will still take some time, but you have divided the work amongst several goroutines so it will be that much faster. For comparison, the Amazon Ruby SDK uses the same REST requests under the hood when using the bucket iterator bucket.each { |obj| … } but only serially – there is no division of work.

Now to copy all of our objects we just need to take each key return by the bucket GET batches, and send off one PUT request for each one. This introduces a much slower process – one GET request results in up to 1000 keys, but then we need to perform 1000 PUTs to copy them. The PUTs also take quite a long time each, as the S3 backend has to physically copy the data between buckets – for large objects this can still take some time.

Let’s use some more concurrency, and have a pool of 100 goroutines waiting to process the batch of 1000 keys just fetched. A recent discussion on the golang-nuts group resulted in some good suggestions from others in the Go community and resulted in this code:

It’s not a lot of code, which makes me think it is reasonably idiomatic and correct Go. Best yet, it has the possibility to scale out to truly tremendous numbers of workers. You may notice that each of the workers also uses the same http.Client and this is intentional – internally the http.Client makes some optimisations around connection reuse so that you aren’t susceptible to the performance penalty of socket creation and TCP handshakes for every request. Generally this works pretty well.

Let’s think about system limits now. Say we want to make our PUT copy operations really fast, and use 100 goroutines for these operations. With just 10 fetcher goroutines that means we now have 1000 goroutines vying for attention from the http.Client connection handling. Even if the fetchers are idle, if we have all of the copier workers running at the same time, we might require 1000 concurrent TCP connections. With a default user limit of 1024 open file handles (e.g. on Ubuntu 12.04) this means we are dangerously close to exceeding that limit.

Head lookup no such host

When you see an error like the above pop up in your program’s output, it almost seems a certainty that you have exceeded these limits… and you’d be right! For now… Initially these were the errors I was getting, and while it was somewhat mysterious that I would see so many of them (literally one for each failed request), apparently some additional sockets are required for name lookups (even if locally cached). I’m still looking for a reference for this, so if you know of it please let me know in the comments.

This resulted in a second snippet of Go code to check my user limits:

Using syscall.Getrusage in conjunction with syscall.Getrlimit would allow you to fairly dynamically scale your program to use just as much of the system resources as it has access to, but not overstep these boundaries. But remember what I said about using http.Client before? The net/http package documentation says Clients should be reused instead of created as needed and Clients are safe for concurrent use by multiple goroutines and both of these are indeed accurate. The unexpected side-effect of this is that, unfortunately, the usage of TCP connections is now fairly opaque to us. Thus our understanding of current system resource usage is fundamentally detached from how we use http.Client. This will become important in just a moment.

So, having raised my ulimits far beyond what I expected I actually needed (this was to be the only program running on my test EC2 instance anyway), I re-ran the program and faced another error:

Error: dial tcp cannot assign requested address

What the… I thought I had dealt with user limits? I didn’t initially find the direct cause of this, thinking I hadn’t properly dealt with the user limits issue. I found a few group discussion threads dealing with http.Client connection reuse, socket lifetimes and related topics, and I first tried a few different versions of Go, suspecting it was a bug fixed in the source tip (more or less analogous to HEAD on origin/master in Git, if you mainly use that VCVS). Unfortunately this yielded no fix and no additional insights.

I had been monitoring open file handles of the process during runtime and noticed it had never gone over about 150 concurrent connections. Using netstat on the other hand, showed that there were a significant number of connections in the TIME_WAITstate. This socket state is used by the kernel to leave a trace of the connection around in case there are duplicate packets on the network waiting to arrive (among other things). In this state the socket is actually detached from the process that created it, but waiting for kernel cleanup – therefore it actually doesn’t count as an open file handle anymore, but that doesn’t mean it can’t cause problems!

In this case I was connecting to Amazon S3 from a single IP address – the only one configured on the EC2 instance. S3 itself has a number of IP addresses on both East and West coasts, rotated automatically through DNS-based load-balancing mechanisms. However, at any given moment you will resolve a single IP address and probably use that for a small period of time before querying DNS again and perhaps getting another IP. So we can basically say we have one IP contacting another IP – and this is where the problem lies.

When an IPv4 network socket is created, there are five basic elements the kernel uses to make it unique among all others on the system:

protocol; local IPv4 address : local IPv4 port <-> remote IPv4 address : remote IPv4 port

Given roughly 2^27 possibilities for local IP (class A,B,C), the same for remote IP and 2^16 for each of the local and remote ports (assuming we can use any privileged ports < 1024 if we use the root account), that gives us about 2^86 different combinations of numbers and thus number of theoretical IPv4 TCP sockets a single system could keep track of. That’s a whole lot! Now consider that we have a single local IP on the instance, we have (for some small amount of time) a single remote IP for Amazon S3, and we are reaching it only over port 80 – now three of our variables are reduced to a single possibility and we only have the local port range to make use of.

Worse still, the default setting (for my machine at least) of the local port range available to non-root users was only 32768-61000, which reduced my available local ports to less than half of the total range. After watching the output of netstat and grepping for TIME_WAIT sockets, it was evident that I was using up this odd 30000 local ports within a matter of seconds. When there are no remaining local port numbers to be used, the kernel simply fails to create a network socket for the program and returns an error as in the above message – cannot assign requested address.

Armed with this knowledge, there are a couple of kernel tunings you can make. Tcp_tw_reuse and tcp_tw_recycle both are related to tunings to the kernel which affect when it will reclaim sockets in the TIME_WAIT state, but practically this didn’t seem to have much effect. Another setting, tcp_max_tw_buckets sets a limit on the total number of TIME_WAIT sockets and actively kills them off rapidly after the count exceeds this limit. All three of these parameters look and sound slightly dangerous, and despite them having had not much effect I was loath to use them and call the problem solved. After all, if the program was killing the connections and leaving them for the kernel to clean up, it didn’t sound like http.Client was doing a very good job of reusing connections automatically.

Incidentally, Go does support automatic reuse of connections in TIME_WAIT with the SO_REUSEADDR socket option, but this only applies to listening sockets (i.e. servers).

Unfortunately that brought me about to the end of my inspiration, but a co-worker pointed me in the direction of the http.Transport’s MaxIdleConnsPerHost parameter, which I was only vaguely aware of due to having skimmed the source of that package in the last couple of days, desperately searching for clues. The default value used here is two (2) which seems reasonable for most applications, but evidently is terrible when your application has large bursts of requests rather than a constant flow. I believe that internally, the transport creates as many connections as required, the requests are processed and closed and then all of those connections (but two) are terminated again, left in TIME_WAIT state for the kernel to deal with. Just a few cycles of this need to repeat before you have built up tens of thousands of sockets in this state.

Altering the value of MaxIdleConnsPerHost to around 250 immediately removed the problem, and I didn’t see any sockets in TIME_WAIT state while I was monitoring the program. Shortly thereafter the program stopped functioning, I believe because my instance was blacklisted by AWS for sending too many requests to S3 in a short period of time – scalability achieved!

If there are any lessons in this, I guess it is that you still often need to be aware of what is happening at the lowest levels of the system even if your programming language or application has abstracted enough of the details away for you not to have to worry about them. Even knowing that there was an idle connection limit of two would not have given away the whole picture of the forces at play here. Go is still my favourite language at the moment and I was glad that the fix was relatively simple, and I still have a very understandable codebase with excellent performance characteristics. However, whenever the network and remote services with variable performance characteristics are involved, any problem can take on large complexity.

Tags: , , , , , , , ,

Thursday, February 28th, 2013 Tech 8 Comments

Service SDKs and Language Support Part 2

by Oliver on Sunday, January 20th, 2013.

As I wrote previously, I found that the mismatch between the goals of large cloud services like Amazon Web Services and the languages they support slightly conflict with the notion of making highly concurrent and parallelised workflows.

Of course the obvious followup to that post (even embarrassingly obvious since I’ve been copiously mentioning Go so much recently) is to point out that Google’s App Engine is doing this right by supporting Go as a first-class language, even getting an SDK provided for several platforms.

I haven’t had a chance to use App Engine so far, but I’d like to in future. Unfortunately, Google’s suite of services is not nearly as rich as that provided in AWS right now but I’m sure they are working hard on achieving feature parity in order to pull more customers over from AWS.

Tags: , , ,

Sunday, January 20th, 2013 Tech No Comments

Personal off-site backups

by Oliver on Saturday, December 29th, 2012.

Unlike many, I’m actually a good boy and do backups of my personal data (for which I can mostly thank my obsessive-compulsive side). However, up until now I’ve been remiss in my duties to also take these backups off-site in case of fire, theft, acts of god or gods etc. Without a tape system or rotation of hard drives (not to mention an actual “off-site” site to store them), this ends up being a little tricky to pull off.

Some of my coworkers and colleagues make use of various online backup services, a lot of which are full-service offerings with a custom client or fixed workflow for performing the backups. At least one person I know backs up (or used to) to Amazon S3 directly; but even in the cheapest of their regions, the cost is significant for what could remain an effectively cold backup. It may be somewhat easier to swallow now that they have recently reduced their pricing across the board.

Glacier is a really interesting offering from Amazon that I’ve been playing with a bit recently, and while its price point is squarely aimed at businesses who want to back up really large amounts of data, it also makes a lot of sense for personal backups. Initially the interface was somewhat similar to what you would expect from a tape system – collect your files together as a vaguely linear archive and upload it with some checksum information. I was considering writing a small backup tool that would make backing up to Glacier reasonably simple but didn’t quite get around to it in time.

Fortunately for me, waiting paid off as they recently added support for transitioning S3 objects to Glacier automatically. This means you get to use the regular S3 interface for uploading and downloading individual objects/files, but allow the automatic archival mechanism to move them into Glacier for long-term storage. This actually makes the task of performing cost-effective remote backups ridiculously trivial but I still wrote a small tool to automate it a little bit.

Hence, glacier_backup. It just uses a bit of Ruby, the Amazon Ruby SDK (which is a very nice library, incidentally), ActiveRecord and progressbar. Basically, it just traverses directories you configure it with and uploads any readable file there to S3, after setting up a bucket of your choosing and setting a policy to transition all objects to Glacier immediately. Some metadata is stored locally using ActiveRecord, not because it is necessary (you can store a wealth of metadata on S3 objects themselves), but each S3 request costs something, so it’s helpful to avoid making requests if it is not necessary.

It’s not an amazing bit of code but it gets the job done, and it is somewhat satisfying to see the progress bar flying past as it archives my personal files up to the cloud. Give it a try, if you have a need for remote backups. Pull requests or features/issues are of course welcome, and I hope you find it useful!

Tags: , , , , ,

Saturday, December 29th, 2012 Tech No Comments

On Service SDKs and Language Support

by Oliver on Wednesday, November 21st, 2012.

As I’ve previously mentioned, I’ve been doing a lot of work recently with various aspects of AWS on a daily basis (or close to it). My primary language these days is still Ruby, but I’ve been labouring through the excellent Seven Languages in Seven Weeks book in the hope I can broaden my horizons somewhat. I’m fairly comfortable with Python, somewhat familiar with Javascript now after playing with NodeJS and I have a cursory ability still in C/C++ and Java but it has been over 10 years since I’ve done anything significant in any of those languages.

Suffice to say, I’m far from being a polyglot, but I know my current limitations. Go has been increasingly noticeable on my radar and I am starting to familiarise myself with it, but this has led me to a small realisation. When service providers (like Amazon in this case) are providing SDK support they typically will be catering to their largest consumer base. Internally they largely use Java and that shows by their 1st class support for that language and toolchain.

Using the example of Elastic Beanstalk and the language support it provides, you can quite easily determine their current (or recent) priorities. Java came first, with .NET and PHP following. Python came about half-way through this year and Ruby was only recently added. Their general-purpose SDKs are somewhat more limiting, only supporting Java, .NET, PHP and Ruby (outside of mobile platform support). These are reasonable, if middle-of-the-road options.

Today I was attempting to run some code against the Ruby SDK, using JRuby. The amount of work it has to do is significant, parallisable and doesn’t exactly fit Ruby’s poor native support (at least in MRI) for true concurrency. I’m not going to gain anything by rewriting in PHP, cannot consider .NET and Java is just not going to be a good use of my time. I feel like there is an impedance mismatch between this set of languages and the scale of what AWS supports.

You are supposed to be scaling up to large amounts of computing and storage to best take advantage of what AWS offers. Similarly, you best make use of the platform by highly parallelising your workload. The only vaguely relevant language from this point of view is Java, but it’s just not a desirable general-purpose language for many of us, especially if we want to enjoy low-friction development as so many newer languages provide.

To be more specific – languages like Go, Erlang (or perhaps more relevant, Elixir), Scala etc offer fantastic concurrency and more attractive development experiences but these are not going to be supported by the official SDKs. It makes perfect sense from the point of view of the size of the developer base, but from the point of view of picking the right tool for the job it doesn’t. Perhaps in a few years this paradigm of highly parallel computing will have gained momentum enough that these languages move to the mainstream (ok, Heroku supports Scala already) and we start to see more standard SDK support for them.

Tags: , , , ,

Wednesday, November 21st, 2012 Tech 1 Comment

Amazon S3 object deletions and Multi-Factor Authentication

by Oliver on Sunday, October 7th, 2012.

I’ve been using S3 a lot in the last couple of months, and with the Amazon SDK for Ruby it really is dead simple to work with (as well as all of the other AWS services the SDK supports currently). So simple in fact, that you could quite easily delete all of your objects with very little work indeed. I did some benchmarks and found that (with batch operations) it took around 3 minutes to delete ~75000 files in about a terabyte. Single threaded.

Parallelize that workload and you could drop everything in your S3 buckets within a matter of minutes for just about any number of objects. Needless to say, if a hacker gets your credentials an extraordinary amount of damage can be done very easily and in a very short amount of time. Given there is often a several hour lag in accesses being logged, you’ll probably not find out about such accesses until long after the fact. Another potential cause of deletions is of course human error (and this is generally way more probable). In both cases there is something you can do about it.

S3 buckets have supported versioning for well over two years now, and if you use SVN, Git, or some other version control system then you’ll already understand how it works. The access methods of plain objects and their versions do differ slightly but the principle ideas are the same (object access methods generally operate on only the latest, non-deleted version). With versioning you can already protect yourself against accidental deletion, since you can revert to the last non-deleted version at any time.

However there is nothing preventing you from deleting all versions of a file, and with it all traces that that file ever existed. This is an explicit departure from the analogy with source versioning systems, as any object with versions still present will continue to cost you real money (even if the latest version is a delete marker). So, you can add Multi-Factor Authentication to your API access to S3 and secure these version deletion operations.

This has existed in the web API for some time but I recently had a commit merged into the official SDK that allows you to enable MFA Delete on a bucket, and there is another one in flight which will allow you to actually use the multi-factor tokens in individual delete requests. The usage is slightly interesting so I thought I’d demonstrate how it is done in Ruby, and some thoughts on its potential use cases. If you want to use it now, you’ll have to pull down my branch (until the pull request is merged).

Enabling MFA

I won’t go into details about acquiring the actual MFA device as it is covered in sufficient detail in the official documentation but suffice it to say that you can buy an actual hardware TOTP token, or use Amazon’s or Google’s “virtual” MFA applications for iPhone or Android. Setting them up and associating them with an account is also fairly straightforward (as long as you are using the AWS console; the command line IAM tools are another matter altogether).

Setting up MFA Delete on your bucket is actually quite trivial:

require 'rubygems'
require 'aws-sdk'
s3 = => 'XXXX', :secret_access_key => 'XXXX')
bucket = s3.buckets['my-test-bucket']
bucket.enable_versioning(:mfa_delete => 'Enable', :mfa => 'arn:aws:iam::123456789012:mfa/root-account-mfa-device 123456')

Behind the scenes, this doesn’t do much different to enabling versioning without MFA. It adds a new element to the XML request which requests that MFA Delete be enabled, and adds a header containing the MFA device serial number and current token number. Importantly (and this may trip you up if you have started using IAM access controls), only the owner of a bucket can enable/disable MFA Delete. In the case of a “standard” account and delegated IAM accounts under it, this will be the “standard” account (even if one of the sub-accounts was used to create the bucket).

Version Deletion with MFA

Now, it is still possible to delete objects but not versions. Version deletion looks much the same but requires the serial/token passed in if MFA Delete is enabled:

require 'rubygems'
require 'aws-sdk'
s3 = => 'XXXX', :secret_access_key => 'XXXX')
bucket = s3.buckets['my-test-bucket']
bucket.versions['itHPX6m8na_sog0cAtkgP3QITEE8v5ij'].delete(:mfa => 'arn:aws:iam::123456789012:mfa/root-account-mfa-device 123456')

As mentioned above there are some limitations to this (as you’ve probably guessed):

  • Being a TOTP system, tokens can be used only once. That means you can delete a single version with a single token, no more. Given that on Google Authenticator and Gemalto physical TOTP devices a token is generated once every 30 seconds it may take up to a minute to completely eradicate all traces of an object that was deleted previously (original version + delete marker).
  • Following on from this, it is almost impossible to consider doing large numbers of deletions. There is a batch object deletion method inside of AWS::S3::ObjectCollection but this is not integrated with any of the MFA Delete mechanisms. Even then, you can only perform batches of 1000 deletions at a time.

As it stands, I’m not sure how practical it is. MFA involves an inherently human-oriented process as it is involves something you have rather than something you are or something you know (both of which are reasonably easily transcribed once into a computer). Given the access medium is an API designed for rapid, lightweight use there seems to be an impedance mismatch. Still, with some implementation to get the batch deletions working it would probably serve a lot of use cases still.

Are you using MFA Delete (through any of the native APIs or other language SDKs, or even 3rd-party apps)? I would love to hear about other peoples’ experiences with it – leave your comments below.

Tags: , , , , ,

Sunday, October 7th, 2012 Tech 6 Comments