“otherness” in others that shines.
Today I actually have the energy for a blog post. Probably because I went for a bike ride when it was -1 outside, and am WIDE awake as a result!
That aside- I realised that the last time I went bike riding- Kai was barely walking – meaning it has been a long time since I have ridden. However I distinctively recall having significantly lower energy then, than this time around.
As I was riding along, aware of how good I felt, I began to realise that this is a small landmark event for me. For those of you who know me well or are close to me you will recall that the past two years have seen me not at my optimum health.
My internal fire was extinguished. 2012 in particular was a dark time, endless days feeling lost and alone. I will not mention the specifics, however I will be the first to admit that due to external stress levels going on in my life that were beyond my control- I stopped functioning. My eating became disordered. I had no appetite for my life and the knot in the pit of my stomach prevented me from thinking clearly, making decisions that were rational. It prevented me from sleeping, dreaming, from seizing opportunities that were presented to me. The Isolation was crippling. My mind was fractured.
Some people whispered on corners, some turned away, and some screamed at the top of their voices.
Since returning to a healthier state of being I have thought about sharing a part or the whole of this journey. I thought initially that I did not want to write a blog post about that time. However, I think there is merit in sharing. I usually have no time for blogging, I am not an avid blogger who likes to spread my opinion via the web however- if I can reach out to others, then it seems to make sense to write about something i know will effect others.
I know for a fact that Mental illness is a huge part of human society, especially in the westernised world.If there was less fear and judgement then people would be open to sharing more. This will not be an attempt to rid the world of stigmatisation, however I can share a part of a story that may.
While it was in large a personal journey, a re-building of a self – I did not do this on my own, I did this with support from certain people- and those people know who they are, and they should be acknowleged. For those who lost faith in me during this time; I can understand it would have been painful to be a powerless witness. While I have a firm idea, I may not fully know the extent of their pain or what was going through their minds as they struggled with watching me struggle.
Call it ironic but in actual fact- having people give up on me, only strengthened my resolve to push forward and prove to myself they were wrong, so to them I say an additional thank you.
I wonder, about a hypothetical alternative. If I had been in a traumatic road accident which resulted in a broken leg which left me just as incapacitated, just as damaged, whether the treatment of me would have been more understanding. I think it would. People could “see the cast” on my leg and understand what was “wrong with me”- when it comes to matters of mental health – a lot of people are just not educated about the facts, and naturally we shy away from what we do not understand, “it is dangerous” – however therin lies the problem with social determinates of health. Stigma.
Essentially what propelled me to get through this dark time was my son Kai. I could not leave my child motherless. I had to get up and face myself. I have realised I can not seperate myself from being a mother now. This death of my previous identity was something I had to grieve. Something I had to challenge, to deny, to surrender to, to play with, to taste and re taste, to explore, realise and accept.
While having my first baby, at age 23 – away from family in a foreign country was totally my choice, I realise now that was a choice marked as one of the hardest things I have ever done.
- But when you answer to someone who is 100 % dependant on you for their survival- the pressure is over-whelming. Your thoughts about yourself must change.
What kind of example did I want to set for my son?
That personal fears can rule your daily existence?
Or rather that your opinions of yourself are flexible, can mutate and evolve, and that we have a right to demand our own physical, mental and psychological health is a priority?
That being afraid of life is a balanced approach? or rather; that it is healthy to have fears, but not to let them dominate you?
That sometimes you can try your hardest at something, but it will not always come to fruition, and that is acceptable too.
That we need to love ourselves enough to forgive ourselves for the things beyond our control. Nothing is so big that we are not deserving to share a part of our-self with the world, furthermore – that self betterment is always desired. Accepting a second rate shadow of ourselves, is not acceptable.
Each second of every day we make choices with consequences-and each day we can choose to alter what is causing us lose respect for ourselves, or what is blocking inner contentment.
I wanted Kai to learn hat happiness should not be a the ultimate goal, as it is fleeting emotion – which will pass, and it is exhausting to chase an unrealistic ideal; however to reach for a place of self contentment no matter what you are doing or where you are.
I wanted Kai to see mistakes are a natural progression in evolution, that adults and children alike make mistakes and can adapt. That adults do not always have all the answers, all the time, we are human after all.
That there are myriads of ways to deal with the situations presented to us, and we can choose.
The final lesson I hope I can impart is how art can heal.
An art project I began whilst in an hour of darkness helped me to process my fears during this time, to channel it into something that can now potentially serve other people who are struggling and need something to cling to at their hour of darkness. A tiny seed of hope was implanted in me the day I decided to do this art project, a little flickering spark on the horizon that makes you think- what is up ahead?
When I told a therapist I was doing this- she was taken aback, and then disapproving. Her judgement was intriguing as she scoffed at me and rolled her eyes- I just smiled at her in response. I CAN not wait to prove her wrong.
Apple can’t obey its own specifications
I’ve been playing around with a basic implementation of HLS the last few days, and despite it being a proprietary streaming transport there is a public specification available for it which seems reasonably complete. That being said, it is still proprietary, hasn’t made it through the standardization process and the public spec is suspected of being at least partially incomplete. THAT being said, I still managed to get both iTunes and an iPad working with it just based on my reading of the specification, so it has enough to get by.
However, as I worked through the small number of bugs I had in my implementation, firstly only iTunes would play back content but not the iPad. Later when I had fixed the bugs, the situation was reversed and only the iPad could play back but not iTunes. One would think that a desktop app would be more lenient with faulty implementations on the server side than a portable device which may not always be kept up to date, but apparently not!
It turns out that iTunes (at least 11.0.2 build 26) doesn’t actually implement Apple’s own specification properly, in that it can’t handle media segment URIs relative to the URI of the m3u8 playlist. This has been in the specification since at least draft 3 of the public specification, released April 2, 2010. Here’s what the playlist should (or could) look like:
#EXTM3U
#EXT-X-VERSION:4
#EXT-X-TARGETDURATION:2
#EXT-X-MEDIA-SEQUENCE:1
#EXTINF:0.992653,
#EXT-X-BYTERANGE:15883@0
test.mp3
#EXTINF:0.992653,
#EXT-X-BYTERANGE:15883@15882
test.mp3
#EXTINF:0.992653,
#EXT-X-BYTERANGE:15884@31764
test.mp3
With no further specification, the HLS client should request the media file at the same path level (and from the same server) as the playlist file. Unfortunately iTunes seems to require the full URI to the media segments:
#EXTINF:0.992653,
#EXT-X-BYTERANGE:15884@31764
http://hls.paperairoplane.net/test.mp3
It’s not the end of the world, but it is unnecessary, and a requirement that died almost three years ago when the second draft specification expired. The other side-effect of this is that your playlists will now contain more data. If you are attempting to stream very long tracks and with reasonably short media segments, this means you have a much longer startup time due to the size of the playlist being bigger and thus being a larger download for the player before it can start requesting media segments.
The main mechanism by which we could offset this is also seemingly not present in iTunes – gzip content encoding. ITunes makes no attempt to negotiate any compression in the response, which was part of the specification since draft 4, released June 5, 2010. It mystifies me that Apple can stay so far behind in its own streaming transport with its flagship music playback software and yet clearly keep its devices implementing the same streaming mechanism up to date (at least enough to be useful).
Can’t create new network sockets? Maybe it isn’t user limits…
I’ve been doing a lot more programming in Go recently, mostly because it has awesome concurrency primitives but also because it is generally a pretty amazing language. Unlike other languages which have threads, fibres or event-driven frameworks to achieve good concurrency, Go manages to avoid all of these but still remain readable. You can also reason about its behaviour very effectively due to how easily understandable and straightforward concepts like channels and goroutines are.
But enough about Go (for the moment). Recently I found the need to quickly duplicate the contents of one Amazon S3 bucket to another. This would not be a problem, were it not for the fact that the bucket contained several million objects in it. Fortunately, there are two factors which makes this not so daunting:
- S3 scales better than your application ever can, so you can throw as many requests at it as you like.
- You can copy objects between buckets very easily with a PUT request combined with a special header indicating the object you want copied (you don’t need to physically GET then PUT the data).
A perfect job for a Go program! The keys of the objects are in a consistent format, so we can split up the keyspace by prefixes and split the work-load amongst several goroutines. For example, if your objects are named 00000000 through to 99999999 using only numerical characters, you could quite easily split this into 10 segments of 10 million keys. Using the bucket GET method you can retrieve up to 1000 keys in a batch using prefixes. Even if you split into 10 million key segments and there aren’t that many actual objects, the only things that matter are that you start and finish in the right places (the beginning and end of the segment) and continue making batch requests until you have all of the keys in that part of the keyspace.
So now we have a mechanism for rapidly retrieving all of the keys. For millions of objects this will still take some time, but you have divided the work amongst several goroutines so it will be that much faster. For comparison, the Amazon Ruby SDK uses the same REST requests under the hood when using the bucket iterator bucket.each { |obj| … } but only serially – there is no division of work.
Now to copy all of our objects we just need to take each key return by the bucket GET batches, and send off one PUT request for each one. This introduces a much slower process – one GET request results in up to 1000 keys, but then we need to perform 1000 PUTs to copy them. The PUTs also take quite a long time each, as the S3 backend has to physically copy the data between buckets – for large objects this can still take some time.
Let’s use some more concurrency, and have a pool of 100 goroutines waiting to process the batch of 1000 keys just fetched. A recent discussion on the golang-nuts group resulted in some good suggestions from others in the Go community and resulted in this code:
It’s not a lot of code, which makes me think it is reasonably idiomatic and correct Go. Best yet, it has the possibility to scale out to truly tremendous numbers of workers. You may notice that each of the workers also uses the same http.Client and this is intentional – internally the http.Client makes some optimisations around connection reuse so that you aren’t susceptible to the performance penalty of socket creation and TCP handshakes for every request. Generally this works pretty well.
Let’s think about system limits now. Say we want to make our PUT copy operations really fast, and use 100 goroutines for these operations. With just 10 fetcher goroutines that means we now have 1000 goroutines vying for attention from the http.Client connection handling. Even if the fetchers are idle, if we have all of the copier workers running at the same time, we might require 1000 concurrent TCP connections. With a default user limit of 1024 open file handles (e.g. on Ubuntu 12.04) this means we are dangerously close to exceeding that limit.
Head http://mybucket.s3.amazonaws.com:80/: lookup mybucket.s3.amazonaws.com: no such host
When you see an error like the above pop up in your program’s output, it almost seems a certainty that you have exceeded these limits… and you’d be right! For now… Initially these were the errors I was getting, and while it was somewhat mysterious that I would see so many of them (literally one for each failed request), apparently some additional sockets are required for name lookups (even if locally cached). I’m still looking for a reference for this, so if you know of it please let me know in the comments.
This resulted in a second snippet of Go code to check my user limits:
Using syscall.Getrusage in conjunction with syscall.Getrlimit would allow you to fairly dynamically scale your program to use just as much of the system resources as it has access to, but not overstep these boundaries. But remember what I said about using http.Client before? The net/http package documentation says Clients should be reused instead of created as needed and Clients are safe for concurrent use by multiple goroutines and both of these are indeed accurate. The unexpected side-effect of this is that, unfortunately, the usage of TCP connections is now fairly opaque to us. Thus our understanding of current system resource usage is fundamentally detached from how we use http.Client. This will become important in just a moment.
So, having raised my ulimits far beyond what I expected I actually needed (this was to be the only program running on my test EC2 instance anyway), I re-ran the program and faced another error:
Error: dial tcp 207.171.163.142:80: cannot assign requested address
What the… I thought I had dealt with user limits? I didn’t initially find the direct cause of this, thinking I hadn’t properly dealt with the user limits issue. I found a few group discussion threads dealing with http.Client connection reuse, socket lifetimes and related topics, and I first tried a few different versions of Go, suspecting it was a bug fixed in the source tip (more or less analogous to HEAD on origin/master in Git, if you mainly use that VCVS). Unfortunately this yielded no fix and no additional insights.
I had been monitoring open file handles of the process during runtime and noticed it had never gone over about 150 concurrent connections. Using netstat on the other hand, showed that there were a significant number of connections in the TIME_WAITstate. This socket state is used by the kernel to leave a trace of the connection around in case there are duplicate packets on the network waiting to arrive (among other things). In this state the socket is actually detached from the process that created it, but waiting for kernel cleanup – therefore it actually doesn’t count as an open file handle anymore, but that doesn’t mean it can’t cause problems!
In this case I was connecting to Amazon S3 from a single IP address – the only one configured on the EC2 instance. S3 itself has a number of IP addresses on both East and West coasts, rotated automatically through DNS-based load-balancing mechanisms. However, at any given moment you will resolve a single IP address and probably use that for a small period of time before querying DNS again and perhaps getting another IP. So we can basically say we have one IP contacting another IP – and this is where the problem lies.
When an IPv4 network socket is created, there are five basic elements the kernel uses to make it unique among all others on the system:
protocol; local IPv4 address : local IPv4 port <-> remote IPv4 address : remote IPv4 port
Given roughly 2^27 possibilities for local IP (class A,B,C), the same for remote IP and 2^16 for each of the local and remote ports (assuming we can use any privileged ports < 1024 if we use the root account), that gives us about 2^86 different combinations of numbers and thus number of theoretical IPv4 TCP sockets a single system could keep track of. That’s a whole lot! Now consider that we have a single local IP on the instance, we have (for some small amount of time) a single remote IP for Amazon S3, and we are reaching it only over port 80 – now three of our variables are reduced to a single possibility and we only have the local port range to make use of.
Worse still, the default setting (for my machine at least) of the local port range available to non-root users was only 32768-61000, which reduced my available local ports to less than half of the total range. After watching the output of netstat and grepping for TIME_WAIT sockets, it was evident that I was using up this odd 30000 local ports within a matter of seconds. When there are no remaining local port numbers to be used, the kernel simply fails to create a network socket for the program and returns an error as in the above message – cannot assign requested address.
Armed with this knowledge, there are a couple of kernel tunings you can make. Tcp_tw_reuse and tcp_tw_recycle both are related to tunings to the kernel which affect when it will reclaim sockets in the TIME_WAIT state, but practically this didn’t seem to have much effect. Another setting, tcp_max_tw_buckets sets a limit on the total number of TIME_WAIT sockets and actively kills them off rapidly after the count exceeds this limit. All three of these parameters look and sound slightly dangerous, and despite them having had not much effect I was loath to use them and call the problem solved. After all, if the program was killing the connections and leaving them for the kernel to clean up, it didn’t sound like http.Client was doing a very good job of reusing connections automatically.
Incidentally, Go does support automatic reuse of connections in TIME_WAIT with the SO_REUSEADDR socket option, but this only applies to listening sockets (i.e. servers).
Unfortunately that brought me about to the end of my inspiration, but a co-worker pointed me in the direction of the http.Transport’s MaxIdleConnsPerHost parameter, which I was only vaguely aware of due to having skimmed the source of that package in the last couple of days, desperately searching for clues. The default value used here is two (2) which seems reasonable for most applications, but evidently is terrible when your application has large bursts of requests rather than a constant flow. I believe that internally, the transport creates as many connections as required, the requests are processed and closed and then all of those connections (but two) are terminated again, left in TIME_WAIT state for the kernel to deal with. Just a few cycles of this need to repeat before you have built up tens of thousands of sockets in this state.
Altering the value of MaxIdleConnsPerHost to around 250 immediately removed the problem, and I didn’t see any sockets in TIME_WAIT state while I was monitoring the program. Shortly thereafter the program stopped functioning, I believe because my instance was blacklisted by AWS for sending too many requests to S3 in a short period of time – scalability achieved!
If there are any lessons in this, I guess it is that you still often need to be aware of what is happening at the lowest levels of the system even if your programming language or application has abstracted enough of the details away for you not to have to worry about them. Even knowing that there was an idle connection limit of two would not have given away the whole picture of the forces at play here. Go is still my favourite language at the moment and I was glad that the fix was relatively simple, and I still have a very understandable codebase with excellent performance characteristics. However, whenever the network and remote services with variable performance characteristics are involved, any problem can take on large complexity.
Asynchronous MySQL queries with non-blocking readiness checks
Well, despite my best intentions, here I am again writing Ruby. I decided to automate a small part of some data analysis I’ve had to do a few times, starting with the database queries themselves. Unfortunately the data is spread over several hosts and databases and the first implementation simply queried them serially. The next iteration used the mysql2 gem‘s asynchronous query functionality but still naively blocked on the results retrieval rather than polling the IOs to see when they could be read from.
It doesn’t actually add anything to my script to do this, but it seemed like a small learning opportunity and somewhat interesting so here is the guts of that code:
The code is pretty simple and the comments should reveal the intent of any confusing lines. The only part that was slightly irritating was receiving file descriptor numbers from Mysql2::Client#socket rather than the IO itself, hence having to re-open the same file descriptor.
In this case I haven’t done anything fancy after checking when the results are ready, but you can see how this could be trivially turned into a system for querying multiple backends for the same data and returning the fastest result which is a quite popular pattern at the moment.
Service SDKs and Language Support Part 2
As I wrote previously, I found that the mismatch between the goals of large cloud services like Amazon Web Services and the languages they support slightly conflict with the notion of making highly concurrent and parallelised workflows.
Of course the obvious followup to that post (even embarrassingly obvious since I’ve been copiously mentioning Go so much recently) is to point out that Google’s App Engine is doing this right by supporting Go as a first-class language, even getting an SDK provided for several platforms.
I haven’t had a chance to use App Engine so far, but I’d like to in future. Unfortunately, Google’s suite of services is not nearly as rich as that provided in AWS right now but I’m sure they are working hard on achieving feature parity in order to pull more customers over from AWS.
Seven Languages – Io
I’ve mentioned a couple of times that I started reading through Seven Languages in Seven Weeks, and even though I’ve recently been heavily sidetracked by Learning Go I just finished chapter two which dealt with Io.
The book gushes over the language, and I’ve read a lot of other people’s blogs where they seem quite excited about it. In the end I couldn’t finish the last exercise, even having a working example to go off. It was just a bit too painful trying to find the right object context, navigate around the strange method syntax and other oddities in the language. Of course, it would be wrong to blame the language, so I’m just going to leave it saying that Io didn’t resonate with me.
Interestingly, even though I haven’t used a great deal of Javascript I wasn’t too bothered by the prototyping paradigm of the language. The main confusion (aside from general syntax) was the control you had over in whose context the method arguments would be evaluated – the sender’s or the target’s. It took a bit of playing around to figure out which was the correct alternative in all instances.
For now, I’m conquered. Maybe I’ll come back to that exercise and solve it, but probably not. On to Prolog, which I did get into briefly around 2001 (with fairly awful results). Hopefully the experience will be better this time.
Another Personal Evolution – From Ruby to Go
Almost two years ago now, I wrote a post about how I was fed up with resorting to shell scripting as my knee-jerk reaction to computer problems. At the time, I had been attacking any problem that required more than a couple of commands at the prompt by writing a shell (usually BASH) script and hit major limitations that I really should have been solving with a legitimate programming language. I resolved to only resort to Ruby or Python and in that goal I’ve actually been very successful (although I’ve ended up using Ruby around 90% of the time and Python only 10% of the time, which I wish was a little more evenly distributed).
Now I feel as if there is another evolution happening which I need to apply myself to. As a side-effect of the kind of work I’ve been doing, Ruby is just not cutting it. I love the flexibility of it (even despite the numerous ways you can shoot yourself in the foot), and there are some really great libraries like the AWS Ruby SDK which I’ve been using a lot lately. However, when you start wanting to do highly parallelised or concurrent tasks (and this is an excellent talk on the subject), it all starts getting a bit painful. I dabbled in event-based programming last year with NodeJS but found the spaghetti callbacks a bit mind-bending. Similarly with Ruby and EventMachine the code can be less than perfectly understandable. Goliath makes the task somewhat easier (if you are writing a web-service), and em-synchrony follows a similar pattern with Ruby Fibers but they all fall down if you need to use any libraries which don’t make use of asynchronous IO. I briefly looked at Python’s Twisted framework but didn’t find it much better (although that may be an unfair statement, as I didn’t spend much time on it).
I tried a different approach recently and attempted to use the quite awesome JRuby and solve the problem with native threads and the power of the JVM, but hit similar problems with libraries just not working in JRuby. This seems to be a common problem still, unfortunately. The overall result is having no clear option from a Ruby point of view when attempting to make a high-performance application that is also readable and understandable. It’s a bit of a blanket statement, granted, and if I had more constraints on my development environment I might have persisted with one of the options above (there are certainly workarounds to most of the problems I’ve experienced).
Fortunately for me, I have a flexible working environment, buy-in with alternative languages is pretty good and I’m willing to learn something new. Go is a relatively new language, having only been around (publicly) for just over three years, but quite nicely fits my current needs. I won’t go into it technically, as it is all over the interwebs, but I find it relatively easy to read (even for a newbie), and similarly easy to write.
However, I find myself in the same situation I was almost two years ago: it will take some effort to stop the now familiar knee-jerk reaction – this time towards Ruby – and establish the new habit in using Go wherever possible. I’ve just finished up a recent small spare-time project which utilised Ruby so I have free rein to indulge in Go at every possible opportunity. It is scary, but also very exciting – just as it was declaring my intention to use only Ruby almost two years ago.
That’s not to say I’m going to use Go exclusively – I still have to finish up reading (and working) through Seven Languages in Seven Weeks. My intention is not to become a polyglot (I think that’s a bit beyond my capabilities), but I’d at least like to be reasonably proficient in at least one language that solves a given set of problems well. I found that niche with Ruby, and now I am hoping to find that niche with Go. If you haven’t tried it, I thoroughly recommend it.
Personal off-site backups
Unlike many, I’m actually a good boy and do backups of my personal data (for which I can mostly thank my obsessive-compulsive side). However, up until now I’ve been remiss in my duties to also take these backups off-site in case of fire, theft, acts of god or gods etc. Without a tape system or rotation of hard drives (not to mention an actual “off-site” site to store them), this ends up being a little tricky to pull off.
Some of my coworkers and colleagues make use of various online backup services, a lot of which are full-service offerings with a custom client or fixed workflow for performing the backups. At least one person I know backs up (or used to) to Amazon S3 directly; but even in the cheapest of their regions, the cost is significant for what could remain an effectively cold backup. It may be somewhat easier to swallow now that they have recently reduced their pricing across the board.
Glacier is a really interesting offering from Amazon that I’ve been playing with a bit recently, and while its price point is squarely aimed at businesses who want to back up really large amounts of data, it also makes a lot of sense for personal backups. Initially the interface was somewhat similar to what you would expect from a tape system – collect your files together as a vaguely linear archive and upload it with some checksum information. I was considering writing a small backup tool that would make backing up to Glacier reasonably simple but didn’t quite get around to it in time.
Fortunately for me, waiting paid off as they recently added support for transitioning S3 objects to Glacier automatically. This means you get to use the regular S3 interface for uploading and downloading individual objects/files, but allow the automatic archival mechanism to move them into Glacier for long-term storage. This actually makes the task of performing cost-effective remote backups ridiculously trivial but I still wrote a small tool to automate it a little bit.
Hence, glacier_backup. It just uses a bit of Ruby, the Amazon Ruby SDK (which is a very nice library, incidentally), ActiveRecord and progressbar. Basically, it just traverses directories you configure it with and uploads any readable file there to S3, after setting up a bucket of your choosing and setting a policy to transition all objects to Glacier immediately. Some metadata is stored locally using ActiveRecord, not because it is necessary (you can store a wealth of metadata on S3 objects themselves), but each S3 request costs something, so it’s helpful to avoid making requests if it is not necessary.
It’s not an amazing bit of code but it gets the job done, and it is somewhat satisfying to see the progress bar flying past as it archives my personal files up to the cloud. Give it a try, if you have a need for remote backups. Pull requests or features/issues are of course welcome, and I hope you find it useful!
On Service SDKs and Language Support
As I’ve previously mentioned, I’ve been doing a lot of work recently with various aspects of AWS on a daily basis (or close to it). My primary language these days is still Ruby, but I’ve been labouring through the excellent Seven Languages in Seven Weeks book in the hope I can broaden my horizons somewhat. I’m fairly comfortable with Python, somewhat familiar with Javascript now after playing with NodeJS and I have a cursory ability still in C/C++ and Java but it has been over 10 years since I’ve done anything significant in any of those languages.
Suffice to say, I’m far from being a polyglot, but I know my current limitations. Go has been increasingly noticeable on my radar and I am starting to familiarise myself with it, but this has led me to a small realisation. When service providers (like Amazon in this case) are providing SDK support they typically will be catering to their largest consumer base. Internally they largely use Java and that shows by their 1st class support for that language and toolchain.
Using the example of Elastic Beanstalk and the language support it provides, you can quite easily determine their current (or recent) priorities. Java came first, with .NET and PHP following. Python came about half-way through this year and Ruby was only recently added. Their general-purpose SDKs are somewhat more limiting, only supporting Java, .NET, PHP and Ruby (outside of mobile platform support). These are reasonable, if middle-of-the-road options.
Today I was attempting to run some code against the Ruby SDK, using JRuby. The amount of work it has to do is significant, parallisable and doesn’t exactly fit Ruby’s poor native support (at least in MRI) for true concurrency. I’m not going to gain anything by rewriting in PHP, cannot consider .NET and Java is just not going to be a good use of my time. I feel like there is an impedance mismatch between this set of languages and the scale of what AWS supports.
You are supposed to be scaling up to large amounts of computing and storage to best take advantage of what AWS offers. Similarly, you best make use of the platform by highly parallelising your workload. The only vaguely relevant language from this point of view is Java, but it’s just not a desirable general-purpose language for many of us, especially if we want to enjoy low-friction development as so many newer languages provide.
To be more specific – languages like Go, Erlang (or perhaps more relevant, Elixir), Scala etc offer fantastic concurrency and more attractive development experiences but these are not going to be supported by the official SDKs. It makes perfect sense from the point of view of the size of the developer base, but from the point of view of picking the right tool for the job it doesn’t. Perhaps in a few years this paradigm of highly parallel computing will have gained momentum enough that these languages move to the mainstream (ok, Heroku supports Scala already) and we start to see more standard SDK support for them.
Amazon S3 object deletions and Multi-Factor Authentication
I’ve been using S3 a lot in the last couple of months, and with the Amazon SDK for Ruby it really is dead simple to work with (as well as all of the other AWS services the SDK supports currently). So simple in fact, that you could quite easily delete all of your objects with very little work indeed. I did some benchmarks and found that (with batch operations) it took around 3 minutes to delete ~75000 files in about a terabyte. Single threaded.
Parallelize that workload and you could drop everything in your S3 buckets within a matter of minutes for just about any number of objects. Needless to say, if a hacker gets your credentials an extraordinary amount of damage can be done very easily and in a very short amount of time. Given there is often a several hour lag in accesses being logged, you’ll probably not find out about such accesses until long after the fact. Another potential cause of deletions is of course human error (and this is generally way more probable). In both cases there is something you can do about it.
S3 buckets have supported versioning for well over two years now, and if you use SVN, Git, or some other version control system then you’ll already understand how it works. The access methods of plain objects and their versions do differ slightly but the principle ideas are the same (object access methods generally operate on only the latest, non-deleted version). With versioning you can already protect yourself against accidental deletion, since you can revert to the last non-deleted version at any time.
However there is nothing preventing you from deleting all versions of a file, and with it all traces that that file ever existed. This is an explicit departure from the analogy with source versioning systems, as any object with versions still present will continue to cost you real money (even if the latest version is a delete marker). So, you can add Multi-Factor Authentication to your API access to S3 and secure these version deletion operations.
This has existed in the web API for some time but I recently had a commit merged into the official SDK that allows you to enable MFA Delete on a bucket, and there is another one in flight which will allow you to actually use the multi-factor tokens in individual delete requests. The usage is slightly interesting so I thought I’d demonstrate how it is done in Ruby, and some thoughts on its potential use cases. If you want to use it now, you’ll have to pull down my branch (until the pull request is merged).
Enabling MFA
I won’t go into details about acquiring the actual MFA device as it is covered in sufficient detail in the official documentation but suffice it to say that you can buy an actual hardware TOTP token, or use Amazon’s or Google’s “virtual” MFA applications for iPhone or Android. Setting them up and associating them with an account is also fairly straightforward (as long as you are using the AWS console; the command line IAM tools are another matter altogether).
Setting up MFA Delete on your bucket is actually quite trivial:
require 'rubygems'
require 'aws-sdk'
s3 = AWS::S3.new(:access_key_id => 'XXXX', :secret_access_key => 'XXXX')
bucket = s3.buckets['my-test-bucket']
bucket.enable_versioning(:mfa_delete => 'Enable', :mfa => 'arn:aws:iam::123456789012:mfa/root-account-mfa-device 123456')
Behind the scenes, this doesn’t do much different to enabling versioning without MFA. It adds a new element to the XML request which requests that MFA Delete be enabled, and adds a header containing the MFA device serial number and current token number. Importantly (and this may trip you up if you have started using IAM access controls), only the owner of a bucket can enable/disable MFA Delete. In the case of a “standard” account and delegated IAM accounts under it, this will be the “standard” account (even if one of the sub-accounts was used to create the bucket).
Version Deletion with MFA
Now, it is still possible to delete objects but not versions. Version deletion looks much the same but requires the serial/token passed in if MFA Delete is enabled:
require 'rubygems'
require 'aws-sdk'
s3 = AWS::S3.new(:access_key_id => 'XXXX', :secret_access_key => 'XXXX')
bucket = s3.buckets['my-test-bucket']
bucket.versions['itHPX6m8na_sog0cAtkgP3QITEE8v5ij'].delete(:mfa => 'arn:aws:iam::123456789012:mfa/root-account-mfa-device 123456')
As mentioned above there are some limitations to this (as you’ve probably guessed):
- Being a TOTP system, tokens can be used only once. That means you can delete a single version with a single token, no more. Given that on Google Authenticator and Gemalto physical TOTP devices a token is generated once every 30 seconds it may take up to a minute to completely eradicate all traces of an object that was deleted previously (original version + delete marker).
- Following on from this, it is almost impossible to consider doing large numbers of deletions. There is a batch object deletion method inside of AWS::S3::ObjectCollection but this is not integrated with any of the MFA Delete mechanisms. Even then, you can only perform batches of 1000 deletions at a time.
As it stands, I’m not sure how practical it is. MFA involves an inherently human-oriented process as it is involves something you have rather than something you are or something you know (both of which are reasonably easily transcribed once into a computer). Given the access medium is an API designed for rapid, lightweight use there seems to be an impedance mismatch. Still, with some implementation to get the batch deletions working it would probably serve a lot of use cases still.
Are you using MFA Delete (through any of the native APIs or other language SDKs, or even 3rd-party apps)? I would love to hear about other peoples’ experiences with it – leave your comments below.
Pages
Archives
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- November 2010
- October 2010
- September 2010