LZ* compression algorithms

by Oliver on Sunday, May 25th, 2014.

While I was implementing a small, naive log aggregation tool I had a moment to consider the type of compression I wanted to use on the log files at rest. The main implication this has outside of the efficiency of compression and how much space the files will take up is how they can be used once stored. If you are using tools like gzip/gunzip with awk or other simple command-line tools, or even within scripts written in common languages like Python or Ruby, gzip compression poses very little problem – since in most cases you are processing one file at a time.

If you want to use a distributed computation system like Hadoop on the other hand, this can be a problem. Gzip files can’t be split, so you tend to suffer by only having a single mapper being able to work on a file at a time. If your files are small, this may not be a problem but if your files are large it can be. Other tools such as Impala will outright refuse to work on gzipped data, so this may further limit your options. I started looking into alternative compression algorithms that these tools do support and one name that kept coming up was LZO. If you look it up on Wikipedia it doesn’t offer much insight into what it actually is. Since I was implementing my aggregation tool in Golang, I checked the standard library and only found LZW compression.

Are LZO and LZW the same? Are they related? Is one better than the other? I also found very little help in the Google results (but perhaps that is just me). In the end I implemented gzip compression in my program, but little did I know that even gzip is basically in the same family as these aforementioned LZ-prefixed algorithms (via deflate).

I just started watching a new series of videos on the Google Developers YouTube channel called Compressor Headepisode 2 of which covers the LZ Compression Family, how these algorithms work in easy-to-understand terms and which programs we know of today that inherit from the fundamental algorithms LZ77 and LZ78. I highly recommend watching them!

Tags: , , , , ,

Sunday, May 25th, 2014 Tech 4 Comments

Efficient retry/backoff mechanisms

by Oliver on Wednesday, October 16th, 2013.

In day to day life as a systems engineer there aren’t too many opportunities to build solutions based purely on theoretical studies or research papers – generally solutions are built for you and you “just” need to make them work, tie up the spaghetti, shave the yak and so on. Entering the world of programming and software development though, a lot of new problems come up that you can find your own solution to. On one hand this can be extremely liberating – speaking from experience, having for the first time in my career being employed as a software engineer for just over a year now. You can let your creativity go wild, dream up whatever crazy solution and then implement it and see if it works.

On the other hand, there is a vast body of existing work out there to build on, almost exclusively by people smarter than you are and who have put a lot of effort and thought into solving a specific problem. Some good examples in recent times are Paxos and similarly Raft consensus algorithms, F1 Distributed Database, BigTable, Map/Reduce and many many more. A great blog that I try to keep on top of is Werner Vogels’ All Things Distributed, where he maintains a mix of AWS-related announcement-style posts and musings on the various research papers he is reading while on flights between engagements. Needless to say that the Google Research Papers are another valuable source of information, as well as a lot of things coming out of the various SIG* organisations (e.g. SIGOPS, SIGGRAPH).

But back to my point. Having recently coded up a few systems which hurl copious amounts of requests at various AWS services, I very quickly found out that you WILL get errors and need to handle them in a smart way. Not retrying is not an option, as is retrying indefinitely with no limit. Linearly-spaced retries with a limit tend to also suffer from abusing the target system beyond what it can tolerate, before giving up entirely and perhaps prematurely. Once you start to implement a system with exponential backoff similar to how most multi-access medium algorithms work (think Ethernet with CSMA/CD), you are finally starting to get somewhere.

Someone at $EMPLOYER recently linked to this paper which details backoff mechanisms utilising a Fibonacci sequence. It is very interesting work, and taking the time to read and understand such research allows you to build stronger, more resilient software with characteristics far exceeding what you could have come up with on your own. This in itself I find fascinating and addictive. I strongly recommend giving it a read and utilising this approach to building software.

Tags: , , ,

Wednesday, October 16th, 2013 Tech No Comments

On Pair Programming

by Oliver on Thursday, June 13th, 2013.

It has been a while since my last blog post; certainly not because I wanted to write less but I’ve just been trying to learn so damn much recently it hasn’t left much time for anything else. Whether or not I’ve actually learned anything is yet to be seen. This leads me to a subject which I am still struggling with – Pair Programming.

Not that I don’t “get it”, “understand it”, or “enjoy it” – I hope that I have attained all three of these – but due to my situation with software development it can be a bit of a struggle. What is this situation? Well, let me briefly explain.

My background, ultimately, is in Computer Science. I have a Bachelor’s Degree in it, and did study several years of programming in various languages and contexts. However my career up until the last couple of years has been decidedly in the Systems Administration realm. I enjoyed this immensely, and while my programming skills grew rusty I eventually grew to envy some of my colleagues whose abilities far exceeded my own in tailoring their own solutions to problems while I had to reuse whatever OSS fit the bill.

While at Nokia I was fortunate to get a lot of time to myself to implement some next generation configuration management solutions, and found myself hacking away at Ruby for quite some time. Enough, I guess, at least to make it possible to start my current job at SoundCloud as a Software Developer again. The environment is extremely challenging in that we have a lot of very interesting problems to solve, and many very talented developers working around you on these problems.

My team in particular seems to have tasks which span a number of systems, and we are aiming very much to approach our work end-to-end rather than just focussing on the back-end or front-end systems. The current situation is that I am most comfortable in Ruby (but don’t particularly like it anymore), working up my experience in Go, learning Javascript and ActionScript, and reacquainting myself with C and Python (ok, these last two aren’t used a lot). The codebases we work on are frequently not “owned” by us (which fortunately doesn’t really matter), and we often have to switch between several different systems and languages to solve the one end-to-end problem.

That was a rather long introduction. What’s the point? I find when I attack a problem, aside from the initial paralysis of indecision due to not knowing how to get started, I also can spend a good hour or two just acquainting myself with the codebase and finding my way around it. Where the hell do I make the change I want to make? How do I do it? How can I most quickly learn the conventions of this codebase? All of these questions have answers, but if I were to pair on the task from the beginning, it would also involve someone looking over my shoulder (or I over theirs) for a couple of hours while my brain just starts to fire. It doesn’t feel very productive for one, let alone two people.

The end result unfortunately is that after I’ve acquainted myself with the codebase, I probably won’t end up pairing with someone else to complete the work. I’ve figured out the problem, prototyped it until it worked, and then tidied up enough to call the problem “solved”. Maybe I’ve even added tests! Where in this process was I supposed to have started pairing? I genuinely do enjoy it, but when I’m faced with the prospect of wasting another co-worker’s time and exposing my ineptitude it doesn’t feel as enticing. I’m anything but a coding Ninja or Rockstar, and often it feels like that’s all I’m surrounded by.

If you are in a similar position, how do you deal with this? If you have any suggestions or comments on the subject I would love to read it and discuss further.

Tags: , ,

Thursday, June 13th, 2013 Tech 5 Comments