Archive for February, 2014

Elastic MapReduce and data in S3

by Oliver on Friday, February 28th, 2014.

I don’t have to do much data analysis fortunately, but when I do there are two options: either the data is local to our own datacenter and I can use our own Hadoop cluster, or it is external and I can use Elastic MapReduce. Generally you don’t run an Elastic MapReduce cluster all the time, so when you create your cluster you still need to get that data into the system somehow. Usually the easiest way is to use one of your existing running instances outside of the MapReduce system to transfer it from wherever it may be to S3. If you are lucky, the data is already in S3.

Even better, Elastic MapReduce has the ability to run jobs against datasets located in S3 (rather than on HDFS as is usually the case). I believe this used to be a customisation AWS has applied to Hadoop, but has been in mainline for some time now. It is really quite simple – instead of supplying an absolute or relative path to your hdfs datastore, you can provide an S3-style URI to the data such as: s3://my-bucket-name/mydata/

The “magic” here is not that it now runs the job against S3 directly, but it will create a job before your main workflow to copy the data over from S3 to HDFS. Unfortunately, it’s a bit slow. Previously it has also had showstopper bugs which prevented it working for me at all, but in a lot of cases I just didn’t care enough and used it anyway. Today’s job had significantly more data, and so I decided to copy the data over by hand. I knew it was faster, but not as much of a difference as this:

Screen Shot 2014-02-28 at 5.56.15 PM

The first part of the graph is the built-in copy operation as part of the job I had started, and where it steepens significantly is where I stopped the original job and started the S3DistCp command. Its usage is relatively simple:

hadoop fs -mkdir hdfs:///data/
hadoop jar lib/emr-s3distcp-1.0.jar --src s3n://my-bucket-name/path/to/logs/ --dest hdfs:///data/

The s3distcp jar file is already loaded on the master node when it is bootstrapped, so you can do this interactively or as part of a step on a cluster you have running automatically. I thoroughly recommend using it, as it will cut down the total time of your job significantly!

Tags: , , , ,

Friday, February 28th, 2014 Tech 2 Comments

Setting goals for learning for 2014

by Oliver on Tuesday, February 18th, 2014.

Perhaps a little late in the year to be conducting a personal retrospective on the years past, but I feel at this point I’m starting to wonder about the challenges ahead. The last two to three years I’ve distinctly changed my career direction from systems engineering, to “DevOps” (whatever that means anymore), to developer. Sure, I’m technically a Computer Scientist by tertiary education standards but I’ve been outside of that field for a large part of my professional career. I’ve now almost completed the book Seven Languages in Seven Weeks – not just reading through it, but dedicating enough time to absorbing the material and implementing all of the problems given. In actual day-to-day programming terms I keep myself busy largely with Golang, occasionally Ruby, and occasionally ActionScript. Perhaps with exception of ActionScript I find myself solidly in the realm of “backend languages”.

That’s a pretty fair assessment, as I am employed as a backend engineer. From time to time I do need to delve into Javascript and front-end tasks, and I feel like everything goes to pieces. It conjures up the same feelings I had when working as a systems administrator, seeing an error and diving into the (usually C) codebase, only to stare at the screen utterly confused and not knowing what to do. The spirit was willing but the flesh was weak: and I feel the same way when getting into Javascript and/or front-end development territory.

Not influencing my thoughts around this, but also not entirely unrelated is this blog post by Ian Bicking (of SQLObject, Paste, virtualenv and pip fame (as well as many other excellent pieces of software)). Ian expresses some interesting points including “the browser seems like the most interesting platform” which does resonate with me – the HTML5 media realm is where a lot of my time is spent but without really understanding what is going on. For that reason I’m dedicating (at least a significant amount of) my mental space in 2014 to Javascript and front-end development learning.

If you’ve been writing a personal web page (or perhaps on behalf of a friend) and been stuck using tables or frames, or resorted to using Twitter Bootstrap out of frustration and even then not really knowing what you are doing, you’ll understand the desire to know more of how all that web magic works. I’m totally happy writing an API in some of the previously mentioned languages, but when it comes to actually making something that works in the browser that doesn’t look like it’s been transported from 1995, well – there’s something more to be learned.

Tags: , , , , ,

Tuesday, February 18th, 2014 Tech No Comments

The current state of feature flags

by Oliver on Tuesday, February 11th, 2014.

Feature flags, flippers, toggles, selective rollouts and various other terms have been used to describe systems that allow you to deploy code without necessarily forcing it on all of your users. I won’t describe it here as it has already been covered in depth by a variety of companies and people, notably (and not necessarily in date order) Flickr, Github, Forrst, Martin Fowler and many others. The concept seems to have sprung into the general consciousness around 2009-2010 and many derivative articles, blog posts and screencasts appeared in 2011-2012. You can easily find many of the original and forked gems and other libraries between Google and Github.

The basic feature flag concept is simple – it is either on or off for all of your users. Groups of users for which to enable a feature is a small extension, but does raise architectural concerns when the groups become quite large – you don’t want to be searching linearly through an array of user identifiers, and similarly, arranging the identifiers into a data structure capable of being searched in less than O(N) time complicates what was previously a fairly simple system.

What is most interesting to me is the percentage selection mechanism for flags. Almost exclusively, the mechanism by which rollout libraries perform this is something like

user_id % 100 < desired_percentage

or some simple variation on this theme. This flipper gem actually takes a CRC of the user identifier in order to not use the raw identifier in the comparison (or perhaps if the identifier is non-numeric):

Zlib.crc32(key) % 100 < percentage

which is an interesting twist on the original idea. That and other libraries (such as etsy/feature occasionally add the ability to enable a feature not just for a percentage of users browsing the site, but by postings, post creators, etc or basically any item or entity that can be consumed or act on your site that has some kind of unique identifier that can be modulo’d and turned into a percentage.

However all these mechanisms suffer from the same limitation – whatever identifier you select to do this comparison will have the simultaneous advantage and disadvantage of behaving in the same way every time you do the comparison. The first 1% of users will always be the same, every time they visit the site (ignoring the additional users from growth through new signups). Sure, you could use one of the mechanisms that selects a random 1%, but for reasons of response and object representation cachability, having responses change every time something is retrieved can have a performance impact. Even more importantly, if you are trying to perform A/B testing and want to know the impact or user perception of your new feature by collecting metrics around its use, the selection of users for that feature cannot change.

Oddly, I’ve only seen one comment anywhere that asks about this problem – the first 1% of users in that percentage will always be the same. Thus every feature flag you roll out to a percentage of users will hit that 1% first, every time. If you have a lot of features controlled in this way, and your features tend to be very experimental in nature you could end up with a lot of users in this percentage who experience a very odd, buggy, inconsistent view of your website. Alternatively, they may get a completely awesome experience with the most up-to-date and useful features!

Clearly what is lacking here is the ability to segregate experiments and features by different sections of your userbase. A basic less-than operator on the modulo of some number is clearly insufficient. My own thoughts on this haven’t produced anything significantly different from the status quo. At a minimum, we cannot use a less-than comparison but must segregate individual experimental features into non-overlapping ranges of users. This is starting to sound like a system of multiple A/B tests rather than feature flags, and in fact this principle of non-colliding userbases involved in different tests is present in some A/B testing frameworks.

That being said, Google searches on the topic usually bring up content relating to multivariate testing and Multi-armed bandit testing – both extremely useful concepts in their own right but not quite my desired mechanism. What few resources I can find on the subject are fairly lacking in technical implementation details. Perhaps this is just not a requirement many people have. If you know better, please leave a comment below!

Tags: , , , , ,

Tuesday, February 11th, 2014 Tech 3 Comments

Seven Languages – Clojure

by Oliver on Monday, February 10th, 2014.

I notice my pace has yet again slowed between the last chapter of the book – Erlang – and this one. Another five months has passed since I finished the chapter on Erlang! In actual fact, I haven’t been slaving away on the next language that whole time – decompression of sorts has to follow each chapter, and dealing with a manic three-year-old, finding some time for a bit of exercise and trying to learn a spoken language (German) all take a decent amount away from my free time.

The sixth chapter of Seven Languages in Seven Weeks is Clojure – a challenging language, but after getting through the previous five chapters this one only took me about three weeks of real world time (spent on-and-off) to conquer the last exercise of the chapter.

Since I tend to ramble on about the experiences I had while learning the new language, I’m going to break it down into a series of (hopefully) short points – what I liked about it and what I disliked. Do bear in mind that I’m no expert in Clojure, with only a brief learning period dedicated to it.

What I liked:

  • It seems to have everything. The transactional memory support, power/libraries/community of the JVM, and many programming paradigms baked into the one language. This felt a lot like my experience with Scala, and I’m not sure if it is due to the JVM powering the runtime or the intents of the creators of the language.
  • What I learned previously about how to best utilise recursion from Prolog and Erlang was also quite applicable here (albeit in slightly different form using loop/recur).
  • The Leiningen tool and its REPL make getting into Clojure relatively easy, without having to initially bother with much of the JVM-required compilation/classpathery stuff (which frankly, I still don’t understand).
  • After just a small amount of time, the initial perception of it all being a mountain of parentheses dissipates reasonably quickly (but not entirely). Prefix notation is actually not that bad.

What I disliked:

  • Despite my last point in the section above, parentheses and punctuation remain a big problem to newcomers to the language. If you are not used to Lisp-based languages, there is a big learning curve here. Similar to Scala, I found the large amount of other punctuation (which is used extensively in the language core as well) to be quite hard to understand. Some areas that provide interoperability with Java also have their own unique operators which makes it even harder to wrap your head around.
  • There are often several ways to do things which are not obvious to a newcomer (e.g. creating classes with deftype vs defrecord vs a regular map, or when to use atoms vs refs vs other blocking data structures from the Java library). Some are still listed as experimental alpha features. Fortunately there are plenty of resources out either via Google or Stackoverflow.
  • The language is powerful and sophistication, but I think this requires a corresponding amount of sophistication on the part of the programmer to use it without constructing a monstrosity. Macros take a while to wrap your head around (and I still couldn’t tell you with certainty exactly when things need to be quoted and when not).
  • Without being very familiar with Java (and its libraries) or the JVM, I felt at a disadvantage. I think a lot of parts of Clojure and Scala are framed in terms of how they wrap around the JVM, or solve a Java problem in a better or more understandable way than simply standing on their own. If you want to use the extensive Java interoperability then you have no choice but to learn how that works and its requirements (and with such extensive facilities on the Java side, it frequently makes sense to use the Java interop).
  • To me it just doesn’t feel like a great general-purpose language, but that is probably just because it seems quite academic. I can’t imagine doing very rapid iteration web-app development in it, for example (although I know some people at my work that are doing just that). I guess what it comes down to, is that you would need a lot more experience in this language than you would if you were to pick up Ruby and start developing with Rails for example.

If this all seems like I’m not in favour of the language, that’s not the case at all. Despite its challenges, I see Clojure as a very tempting and powerful language. If I were suddenly in a position where I had to do 100% of my coding in this language, I would see it as a good thing. For the moment though, there are simpler languages that accomplish everything that I need, and I don’t feel the desire to become an expert in every language I have managed to familiarise myself with.

Sidebar: Spoken vs Programming Languages

After doing this much study on a variety of programming languages I don’t use on a day-to-day basis, and having been learning German for a few years now (with varying levels of dedication) I’ve naturally been comparing how learning and knowledge of the two different types of language differs. I’ll preface everything I say below with the fact that I’m not a linguist and haven’t researched this topic academically whatsoever.

Firstly, there exists a certain type of programmer, computer nerd, systems engineer, etc. that will list (somewhat facetiously) their known languages (e.g. on Facebook, LinkedIn etc.) like this – English, German (or some other spoken language), Pig Latin, C, Python etc. etc. Maybe even Klingon. Their argument is that all languages are equivalent and that they know C just as well as they do English. The intent of listing languages in these data fields is usually just for natural spoken languages, but they have mixed the two “types” of language together.

To the majority of us, this argument is plainly false. I recall briefly reading some discussion on this from actual linguists, and at a purely biological level, using spoken languages and computer languages exercise completely different parts of the brain. There are different amounts of reasoning, analysis and plain communication going on depending on whether you are speaking to another human being or expressing an algorithm to a computer.

The grammar of spoken languages is complex, has many exceptions, idioms, and is constantly evolving, whereas in computer languages it is extremely well defined, seldom changes and must be understood by the computer and programmer in 100% of cases. Spoken languages have tens or hundreds of thousands of words, whereas computer languages often have just dozens or hundreds of identifiers at their core. Fluency is defined in a spoken language as basically needing no assistance to communicate with anyone in that language, whether it be spoken or written; even warping the language outside of its usual boundaries while remaining understood by other fluent speakers. Fluency in a computer language, it could be argued, might still permit a user of the language to consult references from time to time. Computer languages are also almost exclusively written, permitting more leisurely consideration of the correct grammar, syntax and vocabulary with which to express one’s self.

This seems like a fairly compelling argument for the two types of language to be vastly different, but recently I’ve been thinking more and more about another level of similarities beyond those points I’ve raised above. I would argue that true fluency in a computer language would in fact allow you to converse (perhaps not exclusively) with another fluent “speaker” of that language in actual spoken words, without aid of references. Anyone who has taken an interview at Google would know the requirement for whiteboarding a solution to a given problem in the language of your choice. You have no option but to be able to express yourself instantaneously, without references, and without making any mistakes – much like natural spoken languages.

Once you take into account all of the standard libraries, commonly used libraries outside of that, frameworks, extensions, plugins etc of a given computer language, the vocabulary is extended dramatically past the dozens or hundreds of words barrier. You can even draw a parallel between learning a given framework in a computer language, and becoming a specialist in a given occupational field – medicine for example introduces a new range of specialist language, just as the next new web-app framework might in your computer language of choice.

When speaking a computer language, the barrier for understandability is actually in some ways higher than than for natural spoken languages with a human partner. A human has the benefit of context, shared common knowledge and culture, observable body language, and can grant understandability concessions when the grammar, vocabulary or syntax is not entirely correct but can be inferred. A computer knows none of these and will not accept anything less than 100% accuracy.

Computers are hard, cold, reasoning machines and computer languages are expressly designed to convey meaning as efficiently as possible and with little room for interpretive error. Spoken languages are the result of centuries or millennia of evolution and culture, not to mention the development and psychology of the human brain itself. In some ways it is amazing that they are able to be compared at all, given their origins are so vastly different.

After dedicating my little free time over the last three weeks to Clojure it is now back to German until I have finished the current teaching book I’m working through. The unifying factor for me personally is that I find learning both spoken and computer languages challenging, mind-bending but exciting. I have no intention of becoming “fluent” in more than a very small amount of programming languages (a passing familiarity is probably sufficient) but I would be significantly upset if I never become fluent in German.

On a related note, if you haven’t yet checked out
Hello World Quiz, it is frustrating but simultaneously a lot of fun 🙂

Tags: , , , ,

Monday, February 10th, 2014 Tech No Comments

Mobile shell and high latency connections

by Oliver on Monday, February 3rd, 2014.

I seem to recall first being introduced to MOSH after I saw a link to it in my RSS feed, and almost simultaneously in my work’s developer mailing list (where people tend to post about new and interesting technologies). People were literally losing their shit over it (well, ok, not literally) and declaring they would never use anything but MOSH ever again. I may be dramatising somewhat.

It’s certainly an interesting piece of software and I could see it might be of utility if you really do work “mobile” a lot of the time or have really shoddy connectivity, but I’m not under any illusion that most of the people I know actually have it that rough. 1st world problems and all of that. In any case, I now actually have a chance to use it since I’m working remotely and with much greater latency to the servers than I’m used to. I realise that up until now I’ve had it pretty good – most of our servers are easily within 100ms of us (continental Europe) and those that aren’t so close are usually on the east coast of the U.S. I have an account on a server back in Sydney, which is almost intolerable to use, but I barely ever use it.

So I installed MOSH and gave it a run. It is undoubtedly good for typing actual letters and deleting them – I will give it that much. There is nothing quite as horrible as making several typos and thundering on with the rest of the line only to realise that you must awkwardly backspace (usually just a few characters at a time to ensure you don’t go too far by accident) or move the cursor back to the scene of the crime. Once you get there, it’s another laborious process of fixing the error and moving back to where you left off. I won’t go into more detail – you understand the problems MOSH is trying to solve.

The underliney thing it does is great, and I truly enjoyed the performance improvement as already stated. Yes, you do have to wait at the end for a moment so that the remote end can confirm it has received your keystrokes, but you generally already know it is fine at that point. The few times my connection or VPN has dropped, MOSH has kept the session around and even printed a little message at the top of the screen to tell me about it. That was pretty awesome.

But the terminal experience could be even tighter. Moving the cursor by itself is for some reason unbuffered (or at least feels that way) – it is just as slow as if I were using normal SSH. I don’t see why this has to be the case, as it is not actually changing anything, and could presumably perform the same optimisations on the cursor positioning. Delete (rather than backspace) also (at least as far as I could tell) would drop back to regular unbuffered speeds. Given that I often can’t remember more shell shortcuts than Meta-Del, Ctrl-e, Ctrl-a etc I’m often using the cursor keys for navigation so this frequently kills me.

I was also not entirely sure how it would go combined with a screen session, but there didn’t seem to be any problems there. Overall I would definitely recommend it, if you have an unreliable or high-latency connection – it is far better than “raw” SSH. But I really hope the developers have further ideas to tweak and tune the experience – it goes perhaps half of the way to being truly awesome but is not quite there yet.

Tags: , , ,

Monday, February 3rd, 2014 Tech 1 Comment