adrift on a cosmic ocean

Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We have lived in Berlin since 2009, have two kids, and have far too little time to really justify having a blog.

Fun and games with awk on Mac

Posted by Oliver on the 7th of October, 2013 in category Tech
Tagged with: awkmacosx

I'm a long-time fan of simple UNIX tools that do one thing well and nothing else. At previous employers I've even taken great delight in asking interview questions in which the best answer would be a simple awk one-liner rather than a complicated multi-line catastrophe involving cat, grep, sed and perhaps even other tools. Poor interview techniques aside, awk is really useful, and if you have occasion to solve problems using an actual awk script (as in a multi-line separate file) then all power to you.

Today I had a small problem I decided would be best solved by an awk one-liner - splitting a file into several components based on input characteristics. In this case, I have a list of essentially random alpha-numeric filenames and want to split them into separate files based on the first character of the name - essentially bucketing. Let's pretend the list looks like this:

aaaaaaaa
bbbbbbbb
cccccccc
dddddddd
...
zzzzzzzz
AAAAAAAA
BBBBBBBB
CCCCCCCC
DDDDDDDD
...
ZZZZZZZZ
00000000
11111111
...
99999999

They're not actually sorted, completely random names (with the same length), but you see the basic structure of the input.

I was going to perform the splitting with this:

awk '{ s = substr($0,1,1); filename = "output/"s"_files.txt"; print $1 >>filename }' input.txt

The various awk manuals say something along the lines of identical string values in different statements denote the same open file which I read as reusing the same file handle, but the reality seems to not reflect this:

awk: output/7_files.txt makes too many open files
 input record number 432, file input.txt
 source line number 1

This error only seems to occur with the version of awk included with MacOSX I'm using. Gawk suffered no similar problem. Nevertheless, you can ask it to close the filehandle after usage, to be safe (although probably at the cost of many more system calls, unless it optimises them away):

awk '{ s = substr($0,1,1); filename = "output/"s"_files.txt"; print $1 >>filename; close(filename) }' input.txt

However, I would end up with all of my input being split into output files in an odd way:

$ ls -alh output/
total 288
drwxr-xr-x  38 oliver  staff   1.3K Oct  7 21:47 .
drwxr-xr-x  14 oliver  staff   476B Oct  7 21:11 ..
-rw-r--r--   1 oliver  staff   988B Oct  7 21:47 0_files.txt
-rw-r--r--   1 oliver  staff   1.3K Oct  7 21:47 1_files.txt
-rw-r--r--   1 oliver  staff   1.6K Oct  7 21:47 2_files.txt
-rw-r--r--   1 oliver  staff   1.4K Oct  7 21:47 3_files.txt
-rw-r--r--   1 oliver  staff   1.0K Oct  7 21:47 4_files.txt
-rw-r--r--   1 oliver  staff   1.2K Oct  7 21:47 5_files.txt
-rw-r--r--   1 oliver  staff   1.2K Oct  7 21:47 6_files.txt
-rw-r--r--   1 oliver  staff   1.0K Oct  7 21:47 7_files.txt
-rw-r--r--   1 oliver  staff   1.4K Oct  7 21:47 8_files.txt
-rw-r--r--   1 oliver  staff   1.0K Oct  7 21:47 9_files.txt
-rw-r--r--   1 oliver  staff   2.1K Oct  7 21:47 A_files.txt
-rw-r--r--   1 oliver  staff   2.3K Oct  7 21:47 B_files.txt
-rw-r--r--   1 oliver  staff   2.7K Oct  7 21:47 C_files.txt
-rw-r--r--   1 oliver  staff   2.5K Oct  7 21:47 D_files.txt
-rw-r--r--   1 oliver  staff   2.3K Oct  7 21:47 E_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 F_files.txt
-rw-r--r--   1 oliver  staff   3.0K Oct  7 21:47 G_files.txt
-rw-r--r--   1 oliver  staff   2.2K Oct  7 21:47 H_files.txt
-rw-r--r--   1 oliver  staff   2.6K Oct  7 21:47 I_files.txt
-rw-r--r--   1 oliver  staff   1.9K Oct  7 21:47 J_files.txt
-rw-r--r--   1 oliver  staff   2.0K Oct  7 21:47 K_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 M_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 N_files.txt
-rw-r--r--   1 oliver  staff   1.7K Oct  7 21:47 P_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 Q_files.txt
-rw-r--r--   1 oliver  staff   2.1K Oct  7 21:47 l_files.txt
-rw-r--r--   1 oliver  staff   2.0K Oct  7 21:47 o_files.txt
-rw-r--r--   1 oliver  staff   2.5K Oct  7 21:47 r_files.txt
-rw-r--r--   1 oliver  staff   2.4K Oct  7 21:47 s_files.txt
-rw-r--r--   1 oliver  staff   2.8K Oct  7 21:47 t_files.txt
-rw-r--r--   1 oliver  staff   3.2K Oct  7 21:47 u_files.txt
-rw-r--r--   1 oliver  staff   3.4K Oct  7 21:47 v_files.txt
-rw-r--r--   1 oliver  staff   2.7K Oct  7 21:47 w_files.txt
-rw-r--r--   1 oliver  staff   3.5K Oct  7 21:47 x_files.txt
-rw-r--r--   1 oliver  staff   3.5K Oct  7 21:47 y_files.txt
-rw-r--r--   1 oliver  staff   2.9K Oct  7 21:47 z_files.txt

There only seems to be one of each of the files for each letter, whether upper- or lower-case. Inspection of one of the files shows that it contains both upper- and lower-case filenames - so clearly awk is able to distinguish between the letters.

I even went so far as to run awk in debug mode and traced each command:

gawk> c
Breakpoint 1, main() at `foo.awk':4
4           print $1 >>filename
gawk> p $1
$1 = "QQQQQQQQ"
gawk> p filename
filename = "output/Q_files.txt"
gawk> c
Breakpoint 1, main() at `foo.awk':4
4           print $1 >>filename
gawk> p $1
$1 = "qqqqqqqq"
gawk> p filename
filename = "output/q_files.txt"

And then cross-checked with lsof:

lsof -n | grep awk
gawk      34874 oliver  cwd      DIR                1,4        476 55018025 /Users/oliver/
gawk      34874 oliver  txt      REG                1,4     513488 55052828 /usr/local/Cellar/gawk/4.1.0/bin/gawk
gawk      34874 oliver  txt      REG                1,4     600576 34360163 /usr/lib/dyld
gawk      34874 oliver  txt      REG                1,4  299741184 54931101 /private/var/db/dyld/dyld_shared_cache_x86_64
gawk      34874 oliver    0u     CHR               16,1 0t13185493     1177 /dev/ttys001
gawk      34874 oliver    1u     CHR               16,1 0t13185493     1177 /dev/ttys001
gawk      34874 oliver    2u     CHR               16,1 0t13185493     1177 /dev/ttys001
gawk      34874 oliver    3r     REG                1,4     128822 55057090 /Users/oliver/input.txt
gawk      34874 oliver    4r     REG                1,4        111 55057040 /Users/oliver/foo.awk
gawk      34874 oliver    5w     REG                1,4        171 55057144 /Users/oliver/output/Q_files.txt

Aha! The smoking gun! Awk thinks it has qfiles.txt open, but the OS thinks it is Qfiles.txt. Could this have anything to do with the filesystem being case-insensitive? Of course it does, and now that I know that I can't believe it was staring me in the face the whole time - but that's the nature of these problems.

The thing that disturbs me the most about this is not that there are platform differences (the eventual intention was to run this code fragment on a Linux system with a case-sensitive filesystem anyway) but that such a difference is invisible to the program running, and its cause is relatively hard to determine. A quick Google on the topic suggests that running a case-sensitive filesystem on MacOSX can cause problems with several well-used commercial programs, but I wonder if there is any underlying reason why this was even the case in the first place? The justification of it being simpler for the user seems a bit weak...

© 2010-2018 Oliver Hookins and Angela Collins