Archive for October, 2013

Efficient retry/backoff mechanisms

by Oliver on Wednesday, October 16th, 2013.

In day to day life as a systems engineer there aren’t too many opportunities to build solutions based purely on theoretical studies or research papers – generally solutions are built for you and you “just” need to make them work, tie up the spaghetti, shave the yak and so on. Entering the world of programming and software development though, a lot of new problems come up that you can find your own solution to. On one hand this can be extremely liberating – speaking from experience, having for the first time in my career being employed as a software engineer for just over a year now. You can let your creativity go wild, dream up whatever crazy solution and then implement it and see if it works.

On the other hand, there is a vast body of existing work out there to build on, almost exclusively by people smarter than you are and who have put a lot of effort and thought into solving a specific problem. Some good examples in recent times are Paxos and similarly Raft consensus algorithms, F1 Distributed Database, BigTable, Map/Reduce and many many more. A great blog that I try to keep on top of is Werner Vogels’ All Things Distributed, where he maintains a mix of AWS-related announcement-style posts and musings on the various research papers he is reading while on flights between engagements. Needless to say that the Google Research Papers are another valuable source of information, as well as a lot of things coming out of the various SIG* organisations (e.g. SIGOPS, SIGGRAPH).

But back to my point. Having recently coded up a few systems which hurl copious amounts of requests at various AWS services, I very quickly found out that you WILL get errors and need to handle them in a smart way. Not retrying is not an option, as is retrying indefinitely with no limit. Linearly-spaced retries with a limit tend to also suffer from abusing the target system beyond what it can tolerate, before giving up entirely and perhaps prematurely. Once you start to implement a system with exponential backoff similar to how most multi-access medium algorithms work (think Ethernet with CSMA/CD), you are finally starting to get somewhere.

Someone at $EMPLOYER recently linked to this paper which details backoff mechanisms utilising a Fibonacci sequence. It is very interesting work, and taking the time to read and understand such research allows you to build stronger, more resilient software with characteristics far exceeding what you could have come up with on your own. This in itself I find fascinating and addictive. I strongly recommend giving it a read and utilising this approach to building software.

Tags: , , ,

Wednesday, October 16th, 2013 Tech No Comments

Fun and games with awk on Mac

by Oliver on Monday, October 7th, 2013.

I’m a long-time fan of simple UNIX tools that do one thing well and nothing else. At previous employers I’ve even taken great delight in asking interview questions in which the best answer would be a simple awk one-liner rather than a complicated multi-line catastrophe involving cat, grep, sed and perhaps even other tools. Poor interview techniques aside, awk is really useful, and if you have occasion to solve problems using an actual awk script (as in a multi-line separate file) then all power to you.

Today I had a small problem I decided would be best solved by an awk one-liner – splitting a file into several components based on input characteristics. In this case, I have a list of essentially random alpha-numeric filenames and want to split them into separate files based on the first character of the name – essentially bucketing. Let’s pretend the list looks like this:


They’re not actually sorted, completely random names (with the same length), but you see the basic structure of the input.

I was going to perform the splitting with this:

awk '{ s = substr($0,1,1); filename = "output/"s"_files.txt"; print $1 >>filename }' input.txt

The various awk manuals say something along the lines of identical string values in different statements denote the same open file which I read as reusing the same file handle, but the reality seems to not reflect this:

awk: output/7_files.txt makes too many open files
 input record number 432, file input.txt
 source line number 1

This error only seems to occur with the version of awk included with MacOSX I’m using. Gawk suffered no similar problem. Nevertheless, you can ask it to close the filehandle after usage, to be safe (although probably at the cost of many more system calls, unless it optimises them away):

awk '{ s = substr($0,1,1); filename = "output/"s"_files.txt"; print $1 >>filename; close(filename) }' input.txt

However, I would end up with all of my input being split into output files in an odd way:

$ ls -alh output/
total 288
drwxr-xr-x  38 oliver  staff   1.3K Oct  7 21:47 .
drwxr-xr-x  14 oliver  staff   476B Oct  7 21:11 ..
-rw-r--r--   1 oliver  staff   988B Oct  7 21:47 0_files.txt
-rw-r--r--   1 oliver  staff   1.3K Oct  7 21:47 1_files.txt
-rw-r--r--   1 oliver  staff   1.6K Oct  7 21:47 2_files.txt
-rw-r--r--   1 oliver  staff   1.4K Oct  7 21:47 3_files.txt
-rw-r--r--   1 oliver  staff   1.0K Oct  7 21:47 4_files.txt
-rw-r--r--   1 oliver  staff   1.2K Oct  7 21:47 5_files.txt
-rw-r--r--   1 oliver  staff   1.2K Oct  7 21:47 6_files.txt
-rw-r--r--   1 oliver  staff   1.0K Oct  7 21:47 7_files.txt
-rw-r--r--   1 oliver  staff   1.4K Oct  7 21:47 8_files.txt
-rw-r--r--   1 oliver  staff   1.0K Oct  7 21:47 9_files.txt
-rw-r--r--   1 oliver  staff   2.1K Oct  7 21:47 A_files.txt
-rw-r--r--   1 oliver  staff   2.3K Oct  7 21:47 B_files.txt
-rw-r--r--   1 oliver  staff   2.7K Oct  7 21:47 C_files.txt
-rw-r--r--   1 oliver  staff   2.5K Oct  7 21:47 D_files.txt
-rw-r--r--   1 oliver  staff   2.3K Oct  7 21:47 E_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 F_files.txt
-rw-r--r--   1 oliver  staff   3.0K Oct  7 21:47 G_files.txt
-rw-r--r--   1 oliver  staff   2.2K Oct  7 21:47 H_files.txt
-rw-r--r--   1 oliver  staff   2.6K Oct  7 21:47 I_files.txt
-rw-r--r--   1 oliver  staff   1.9K Oct  7 21:47 J_files.txt
-rw-r--r--   1 oliver  staff   2.0K Oct  7 21:47 K_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 M_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 N_files.txt
-rw-r--r--   1 oliver  staff   1.7K Oct  7 21:47 P_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 Q_files.txt
-rw-r--r--   1 oliver  staff   2.1K Oct  7 21:47 l_files.txt
-rw-r--r--   1 oliver  staff   2.0K Oct  7 21:47 o_files.txt
-rw-r--r--   1 oliver  staff   2.5K Oct  7 21:47 r_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 s_files.txt
-rw-r--r--   1 oliver  staff   2.8K Oct  7 21:47 t_files.txt
-rw-r--r--   1 oliver  staff   3.2K Oct  7 21:47 u_files.txt
-rw-r--r--   1 oliver  staff   3.4K Oct  7 21:47 v_files.txt
-rw-r--r--   1 oliver  staff   2.7K Oct  7 21:47 w_files.txt
-rw-r--r--   1 oliver  staff   3.5K Oct  7 21:47 x_files.txt
-rw-r--r--   1 oliver  staff   3.5K Oct  7 21:47 y_files.txt
-rw-r--r--   1 oliver  staff   2.9K Oct  7 21:47 z_files.txt

There only seems to be one of each of the files for each letter, whether upper- or lower-case. Inspection of one of the files shows that it contains both upper- and lower-case filenames – so clearly awk is able to distinguish between the letters.

I even went so far as to run awk in debug mode and traced each command:

gawk> c
Breakpoint 1, main() at `foo.awk':4
4           print $1 >>filename
gawk> p $1
gawk> p filename
filename = "output/Q_files.txt"
gawk> c
Breakpoint 1, main() at `foo.awk':4
4           print $1 >>filename
gawk> p $1
$1 = "qqqqqqqq"
gawk> p filename
filename = "output/q_files.txt"

And then cross-checked with lsof:

lsof -n | grep awk
gawk      34874 oliver  cwd      DIR                1,4        476 55018025 /Users/oliver/
gawk      34874 oliver  txt      REG                1,4     513488 55052828 /usr/local/Cellar/gawk/4.1.0/bin/gawk
gawk      34874 oliver  txt      REG                1,4     600576 34360163 /usr/lib/dyld
gawk      34874 oliver  txt      REG                1,4  299741184 54931101 /private/var/db/dyld/dyld_shared_cache_x86_64
gawk      34874 oliver    0u     CHR               16,1 0t13185493     1177 /dev/ttys001
gawk      34874 oliver    1u     CHR               16,1 0t13185493     1177 /dev/ttys001
gawk      34874 oliver    2u     CHR               16,1 0t13185493     1177 /dev/ttys001
gawk      34874 oliver    3r     REG                1,4     128822 55057090 /Users/oliver/input.txt
gawk      34874 oliver    4r     REG                1,4        111 55057040 /Users/oliver/foo.awk
gawk      34874 oliver    5w     REG                1,4        171 55057144 /Users/oliver/output/Q_files.txt

Aha! The smoking gun! Awk thinks it has q_files.txt open, but the OS thinks it is Q_files.txt. Could this have anything to do with the filesystem being case-insensitive? Of course it does, and now that I know that I can’t believe it was staring me in the face the whole time – but that’s the nature of these problems.

The thing that disturbs me the most about this is not that there are platform differences (the eventual intention was to run this code fragment on a Linux system with a case-sensitive filesystem anyway) but that such a difference is invisible to the program running, and its cause is relatively hard to determine. A quick Google on the topic suggests that running a case-sensitive filesystem on MacOSX can cause problems with several well-used commercial programs, but I wonder if there is any underlying reason why this was even the case in the first place? The justification of it being simpler for the user seems a bit weak…

Tags: ,

Monday, October 7th, 2013 Tech No Comments