linux

Can’t create new network sockets? Maybe it isn’t user limits…

by Oliver on Thursday, February 28th, 2013.

I’ve been doing a lot more programming in Go recently, mostly because it has awesome concurrency primitives but also because it is generally a pretty amazing language. Unlike other languages which have threads, fibres or event-driven frameworks to achieve good concurrency, Go manages to avoid all of these but still remain readable. You can also reason about its behaviour very effectively due to how easily understandable and straightforward concepts like channels and goroutines are.

But enough about Go (for the moment). Recently I found the need to quickly duplicate the contents of one Amazon S3 bucket to another. This would not be a problem, were it not for the fact that the bucket contained several million objects in it. Fortunately, there are two factors which makes this not so daunting:

  1. S3 scales better than your application ever can, so you can throw as many requests at it as you like.
  2. You can copy objects between buckets very easily with a PUT request combined with a special header indicating the object you want copied (you don’t need to physically GET then PUT the data).

A perfect job for a Go program! The keys of the objects are in a consistent format, so we can split up the keyspace by prefixes and split the work-load amongst several goroutines. For example, if your objects are named 00000000 through to 99999999 using only numerical characters, you could quite easily split this into 10 segments of 10 million keys. Using the bucket GET method you can retrieve up to 1000 keys in a batch using prefixes. Even if you split into 10 million key segments and there aren’t that many actual objects, the only things that matter are that you start and finish in the right places (the beginning and end of the segment) and continue making batch requests until you have all of the keys in that part of the keyspace.

So now we have a mechanism for rapidly retrieving all of the keys. For millions of objects this will still take some time, but you have divided the work amongst several goroutines so it will be that much faster. For comparison, the Amazon Ruby SDK uses the same REST requests under the hood when using the bucket iterator bucket.each { |obj| … } but only serially – there is no division of work.

Now to copy all of our objects we just need to take each key return by the bucket GET batches, and send off one PUT request for each one. This introduces a much slower process – one GET request results in up to 1000 keys, but then we need to perform 1000 PUTs to copy them. The PUTs also take quite a long time each, as the S3 backend has to physically copy the data between buckets – for large objects this can still take some time.

Let’s use some more concurrency, and have a pool of 100 goroutines waiting to process the batch of 1000 keys just fetched. A recent discussion on the golang-nuts group resulted in some good suggestions from others in the Go community and resulted in this code:


It’s not a lot of code, which makes me think it is reasonably idiomatic and correct Go. Best yet, it has the possibility to scale out to truly tremendous numbers of workers. You may notice that each of the workers also uses the same http.Client and this is intentional – internally the http.Client makes some optimisations around connection reuse so that you aren’t susceptible to the performance penalty of socket creation and TCP handshakes for every request. Generally this works pretty well.

Let’s think about system limits now. Say we want to make our PUT copy operations really fast, and use 100 goroutines for these operations. With just 10 fetcher goroutines that means we now have 1000 goroutines vying for attention from the http.Client connection handling. Even if the fetchers are idle, if we have all of the copier workers running at the same time, we might require 1000 concurrent TCP connections. With a default user limit of 1024 open file handles (e.g. on Ubuntu 12.04) this means we are dangerously close to exceeding that limit.

Head http://mybucket.s3.amazonaws.com:80/: lookup mybucket.s3.amazonaws.com: no such host

When you see an error like the above pop up in your program’s output, it almost seems a certainty that you have exceeded these limits… and you’d be right! For now… Initially these were the errors I was getting, and while it was somewhat mysterious that I would see so many of them (literally one for each failed request), apparently some additional sockets are required for name lookups (even if locally cached). I’m still looking for a reference for this, so if you know of it please let me know in the comments.

This resulted in a second snippet of Go code to check my user limits:

Using syscall.Getrusage in conjunction with syscall.Getrlimit would allow you to fairly dynamically scale your program to use just as much of the system resources as it has access to, but not overstep these boundaries. But remember what I said about using http.Client before? The net/http package documentation says Clients should be reused instead of created as needed and Clients are safe for concurrent use by multiple goroutines and both of these are indeed accurate. The unexpected side-effect of this is that, unfortunately, the usage of TCP connections is now fairly opaque to us. Thus our understanding of current system resource usage is fundamentally detached from how we use http.Client. This will become important in just a moment.

So, having raised my ulimits far beyond what I expected I actually needed (this was to be the only program running on my test EC2 instance anyway), I re-ran the program and faced another error:

Error: dial tcp 207.171.163.142:80: cannot assign requested address

What the… I thought I had dealt with user limits? I didn’t initially find the direct cause of this, thinking I hadn’t properly dealt with the user limits issue. I found a few group discussion threads dealing with http.Client connection reuse, socket lifetimes and related topics, and I first tried a few different versions of Go, suspecting it was a bug fixed in the source tip (more or less analogous to HEAD on origin/master in Git, if you mainly use that VCVS). Unfortunately this yielded no fix and no additional insights.

I had been monitoring open file handles of the process during runtime and noticed it had never gone over about 150 concurrent connections. Using netstat on the other hand, showed that there were a significant number of connections in the TIME_WAITstate. This socket state is used by the kernel to leave a trace of the connection around in case there are duplicate packets on the network waiting to arrive (among other things). In this state the socket is actually detached from the process that created it, but waiting for kernel cleanup – therefore it actually doesn’t count as an open file handle anymore, but that doesn’t mean it can’t cause problems!

In this case I was connecting to Amazon S3 from a single IP address – the only one configured on the EC2 instance. S3 itself has a number of IP addresses on both East and West coasts, rotated automatically through DNS-based load-balancing mechanisms. However, at any given moment you will resolve a single IP address and probably use that for a small period of time before querying DNS again and perhaps getting another IP. So we can basically say we have one IP contacting another IP – and this is where the problem lies.

When an IPv4 network socket is created, there are five basic elements the kernel uses to make it unique among all others on the system:

protocol; local IPv4 address : local IPv4 port <-> remote IPv4 address : remote IPv4 port

Given roughly 2^27 possibilities for local IP (class A,B,C), the same for remote IP and 2^16 for each of the local and remote ports (assuming we can use any privileged ports < 1024 if we use the root account), that gives us about 2^86 different combinations of numbers and thus number of theoretical IPv4 TCP sockets a single system could keep track of. That’s a whole lot! Now consider that we have a single local IP on the instance, we have (for some small amount of time) a single remote IP for Amazon S3, and we are reaching it only over port 80 – now three of our variables are reduced to a single possibility and we only have the local port range to make use of.

Worse still, the default setting (for my machine at least) of the local port range available to non-root users was only 32768-61000, which reduced my available local ports to less than half of the total range. After watching the output of netstat and grepping for TIME_WAIT sockets, it was evident that I was using up this odd 30000 local ports within a matter of seconds. When there are no remaining local port numbers to be used, the kernel simply fails to create a network socket for the program and returns an error as in the above message – cannot assign requested address.

Armed with this knowledge, there are a couple of kernel tunings you can make. Tcp_tw_reuse and tcp_tw_recycle both are related to tunings to the kernel which affect when it will reclaim sockets in the TIME_WAIT state, but practically this didn’t seem to have much effect. Another setting, tcp_max_tw_buckets sets a limit on the total number of TIME_WAIT sockets and actively kills them off rapidly after the count exceeds this limit. All three of these parameters look and sound slightly dangerous, and despite them having had not much effect I was loath to use them and call the problem solved. After all, if the program was killing the connections and leaving them for the kernel to clean up, it didn’t sound like http.Client was doing a very good job of reusing connections automatically.

Incidentally, Go does support automatic reuse of connections in TIME_WAIT with the SO_REUSEADDR socket option, but this only applies to listening sockets (i.e. servers).

Unfortunately that brought me about to the end of my inspiration, but a co-worker pointed me in the direction of the http.Transport’s MaxIdleConnsPerHost parameter, which I was only vaguely aware of due to having skimmed the source of that package in the last couple of days, desperately searching for clues. The default value used here is two (2) which seems reasonable for most applications, but evidently is terrible when your application has large bursts of requests rather than a constant flow. I believe that internally, the transport creates as many connections as required, the requests are processed and closed and then all of those connections (but two) are terminated again, left in TIME_WAIT state for the kernel to deal with. Just a few cycles of this need to repeat before you have built up tens of thousands of sockets in this state.

Altering the value of MaxIdleConnsPerHost to around 250 immediately removed the problem, and I didn’t see any sockets in TIME_WAIT state while I was monitoring the program. Shortly thereafter the program stopped functioning, I believe because my instance was blacklisted by AWS for sending too many requests to S3 in a short period of time – scalability achieved!

If there are any lessons in this, I guess it is that you still often need to be aware of what is happening at the lowest levels of the system even if your programming language or application has abstracted enough of the details away for you not to have to worry about them. Even knowing that there was an idle connection limit of two would not have given away the whole picture of the forces at play here. Go is still my favourite language at the moment and I was glad that the fix was relatively simple, and I still have a very understandable codebase with excellent performance characteristics. However, whenever the network and remote services with variable performance characteristics are involved, any problem can take on large complexity.

Tags: , , , , , , , ,

Thursday, February 28th, 2013 Tech 8 Comments

Reliably finding processes with ps by name

by Oliver on Sunday, July 17th, 2011.

I imagine that there are two groups of people who might read this post:

  1. When you need to find a process by name, you run ps -ef or similar and pipe into grep processname.
  2. You gladly suffer the presence of your own grep process being shown in the output, or maybe even grep -v it out (these are the “heathens”).

  3. When you need to find a process by name, you run ps -C processname
  4. (these are the “enlightened ones”).

If you fall into the first category, you fail my interview tests. Perhaps you smugly fall into the second category, but surely you have seen this occur:


ohookins 4410 0.0 0.2 212804 9384 ? S 20:10 0:00 /usr/lib/bamf/bamfdaemon
$ ps u -C /usr/lib/bamf/bamfdaemon
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ ps u -C bamfdaemon
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ohookins 4410 0.0 0.2 212804 9384 ? S 20:10 0:00 /usr/lib/bamf/bamfdaemon

I don’t particularly care about bamfdaemon, but given that the process listing shows the full path to the binary, why can’t we search for it by this process name? Why does the unqualified filename work? OK, perhaps it is just basing the match on the unqualified filename…


ohookins 3710 0.0 0.9 468988 38216 ? Sl 19:49 0:05 /usr/bin/python /usr/bin/terminator
$ ps u -C terminator
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ ps u -C python
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ ps u -C /usr/bin/python
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ ps u -C /usr/bin/terminator
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ohookins 3710 0.0 0.9 468988 38216 ? Sl 19:49 0:05 /usr/bin/python /usr/bin/terminator

OK, what the heck is going on here exactly? I’m not terribly familiar with the POSIX specification, so let’s take a look at the source code of procps:


109 break; case SEL_COMM: i=sn->n; while(i--)
110 if(!strncmp( buf->cmd, (*(sn->u+i)).cmd, 15 )) return 1;

In select.c of ps, we see these two lines in the case statement which selects between different process identification mechanisms. -C actually allows you to select multiple processes by different name since it iterates through the list of selectors (which I didn’t know before looking at the code – very cool).

A limited string comparison is done between the argument given to -C and the process being examined. You can see that this limit is 15 characters, and in the union inside the selection node only 16 characters are stored anyway. Let’s have a look at what this proc_t buf struct looks like so we can figure out what the comparison is being done on. This sits in proc/readproc.h:


char cmd[16]; // stat,status basename of executable file in call to exec(2)

Now we are getting somewhere. We can easily verify that the limit in the character comparison is being done:


$ ps u -C upstart-socket-
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1123 0.0 0.0 15004 400 ? S 18:46 0:00 upstart-socket-bridge --daemon

So if you are lazy, you only have to type 15 characters at most of your process name. Let’s look at the more complicated case of when processes are just really hard to find by any name we can see in the process listing – my candidate case for this is Jenkins, which is notoriously hard to track down especially if you are running several Java-based services on the one machine (for example Jenkins itself, Nexus and perhaps Sonar which all logically fit together as part of a typical Java build server):


$ ps uwww -U jenkins
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
jenkins 19517 9.3 30.8 1823496 1245768 ? Ssl Jul08 1255:29 /usr/bin/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -Xmx1024m -Xms768m -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20

Nothing amazing here, let’s find this process by the command name:


$ ps uwww -C java
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nexus 3853 0.0 3.5 1442536 145520 ? Sl Jun06 3:17 java -Dsun.net.inetaddr.ttl=3600 -Dbasedir=. -Djava.io.tmpdir=./runtime/tmp -Djava.library.path=bin/jsw/linux-x86-64/lib -classpath bin/jsw/lib/wrapper-3.2.3.jar:./runtime/apps/nexus/lib/plexus-classworlds-1.4.jar:./conf/ -Dwrapper.key=ybUhRQr9hU88aJwC -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.pid=3837 -Dwrapper.version=3.2.3 -Dwrapper.native_library=wrapper -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 org.codehaus.plexus.classworlds.launcher.Launcher

Wait, where is Jenkins? Didn’t we confirm that the process running was in fact /usr/bin/java, and we know only java is used as the executable basename inside of ps? How is it possible that ps is now not showing us the Jenkins process? Let’s have a slightly different look at it:


$ ps -U jenkins
PID TTY TIME CMD
19517 ? 20:55:47 exe
$ ps uwww -C exe
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
jenkins 19517 9.3 30.7 1823496 1245124 ? Ssl Jul08 1255:50 /usr/bin/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -Xmx1024m -Xms768m -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20

Aha! We have found the errant Jenkins process. But why is the basename exe? As it turns out, it is a peculiarity of Jenkins itself, documented in https://issues.jenkins-ci.org/browse/JENKINS-9206 which also causes problems with the init script (when it tries to find the process with the incorrect method, as we found above, due to certain assumptions).

In any case, now we’ve seen how ps operates and even how to find a process using the correct method, even when that process is playing hard to get.

Tags: , ,

Sunday, July 17th, 2011 Tech 7 Comments

Dependencies stored in grey matter

by Oliver on Wednesday, February 2nd, 2011.

I have a Zotac ZBox which I use an my HTPC, and generally it works pretty well. One thing that is slightly troublesome is the HDMI audio, which seems to rely on having the ALSA backports modules package installed in order for it to work. Remembering this is key, though, since when the package is installed it does not automatically get updated when you upgrade the kernel package.

Most packaging systems rely on the fact that you only have the one instance of a package around at a time, and as time passes you upgrade these packages between versions (a notable exception is the Gem packaging system). Kernel packages are the exception to the rule and not only can you have several present on your system at a time, but this is usually desirable so that you can test stabliity, rollback etc. For this reason the version number of the kernel creeps into the package name, making it effectively treated as a unique package (and since the file paths differ, it is unique on disk as well). The DEB package format handles upgrades by way of a meta-package which pulls in the latest version of the kernel package. RPM uses some other magic I can’t recall right now.

In the case of the linux-backports-modules-alsa package, the same idea applies. However where the kernel meta-package pulls in the newer kernel package when there is an update available, it can’t do the same for this package since not everybody wants it installed automatically. Since I do want it installed automatically but am not in a position to change the packages, this puts me in a slightly irritating position. Ideally there would be some hook that I could use to pull in the latest version of this package whenever a new kernel package is installed (and in fact there is, in /etc/kernel/postinst.d/) but anything triggered synchronously with the kernel upgrade will fail since the dpkg transaction is not yet complete and starting a new one will be blocked.

The trigger could in fact schedule an at job to install the newer alsa package a few minutes later, but I don’t like the asynchronous nature of this method and the likelihood of failure (what if I reboot immediately after to activate the new kernel?) although I can’t see an obvious alternative. Does anybody have any suggestions?

The work around for this to prevent having to remember to install the latest version is to make use of the kernel package maintainer hook directory: /etc/kernel/postinst.d. Scripts in this directory are called after installation of a new kernel with the first parameter being the new kernel version in uname -r style format.

Tags: , , , , , , ,

Wednesday, February 2nd, 2011 Tech No Comments

I/O redirection “optimizations”

by Oliver on Sunday, September 12th, 2010.

Quite a while back, I had to migrate a few terabytes of data from one machine to another. Not that special a task, and certainly a few terabytes is not that much but at the time it was a reasonable amount and even over 1Gbps network it can take some time. Fortunately it was not time critical and I could take the server in question down for a while to facilitate the migration. The data in question was a number of discrete filesystems on a bunch of LVM logical volumes, thus I was able to basically just recreate the LVs on the destination and do a straight bit copy.

That all being said, I still wanted it to complete quickly! After eradicating the usual readahead settings being set too low for sequential reads from the source, the copy occurred more or less as expected, and I kept a watchful eye on iostat. This is where things got a bit strange, as I noticed identical read and write values coming back from the destination LV. The basic formula of the copy was as follows:

# source
for i in /dev/VolumeGroup/*; do
    LE=`lvdisplay $i | grep "Current LE" | awk '{print $NF}'`
    NAME=`basename $i`
    echo "${NAME}:${LE}" | nc newmachine 30001
    sleep 10
    dd if=$i bs=4M | nc newmachine 30000
    sleep 10
done
echo "DONE:0" | nc newmachine 30001

# destination
while true
do
    INFO=`nc -l 30001`
    NAME=`echo $INFO | cut -f1 -d:`
    LE=`echo $INFO | cut -f2 -d:`
    if [ $NAME == "DONE" ]
    then
        break
    fi
    lvcreate -l $LE -n $NAME /dev/VolumeGroup
    nc -l 30000 > /dev/VolumeGroup/$NAME
done

Unfortunately I don’t have the actual code around, so the above is only an off-the-top-of-my-head approximation, but you should get the idea:

  • Loop over our logical volumes we want to migrate over to the new machine, determining name and number of logical extents (yes, we have to do some extra work if logical extent size differs between source and destination).
  • Pipe the number of LEs and the name of the LV to the destination over a “control” channel so that the new LV can be created, and wait a few seconds for this to take place.
  • Read out the source LV with a reasonable block size, and pipe it over to the destination where it is piped directly into the new LV. I may have added an intermediate stage of dd to ensure an output block size of 4MB as well, but my memory fails me.

So, as I mentioned, at this point I noticed that not only was data being written to the destination LV (as you would expect) but a corresponding amount was being simultaneously read from it. I was not able to resolve this discrepancy at the time, although I suspected perhaps some intelligence in part of the redirection on the output side was trying to determine which blocks actually needed overwriting.

A couple of months ago I spotted this post in Chris Siebenmann’s blog which may explain it. He has certainly run into a similar confounding case of system “intelligence”.

Tags: , , , , ,

Sunday, September 12th, 2010 Tech No Comments

Exact determination of Linux process memory usage

by Oliver on Sunday, September 12th, 2010.

´╗┐While I was still working for Anchor Systems, we had a client who was launching a fairly large website and as part of the gradual ramp-up to delivery we needed to perform some capacity tuning of the web/application servers. The application stack was basically Perl via mod_perl on Apache (not threaded) so we had to determine the memory footprint of the application and make a determination of how many client processes we could support on each server (divide your available physical RAM by the size of the process).

Unfortunately for the system administrators in question, this is a little more difficult than expected due to Linux’s memory sharing smarts. There used to be no easy way to determine the split between shared and private RSS (Resident Set Size) of a process, making it virtually impossible to say how much of the memory allocation for a process was really completely unique and therefore important to be included in calculations. A similar issue existed for determining the number of mapped pages. At the time, we chose the safest option – consider the entire allocation to be private – thus using slightly more hardware resources but guaranteeing never to cause performance degradation due to overzealous memory allocation.

Kernel versions >= 2.6.25 provide the /proc/$PID/pagemap interface which allows you to examine the page tables for processes. The format of the data is documented in the Linux Cross Reference, which, if you don’t already have bookmarked, do it now! There is also a writeup of the interface and how it can be used in LWN.net which is another bookmark-worthy resource with many very technical articles.

It appears someone has also written a userspace tool to pull information out of this interface, at http://selenic.com/repo/pagemap/

It is also possible to view directly human-readable information from /proc/$PID/smaps which divides memory allocation up by loaded libraries and the stack. Quite verbose and certainly useful in some situations.

Tags: ,

Sunday, September 12th, 2010 Tech 1 Comment

IPv6 Privacy

by Oliver on Sunday, September 12th, 2010.

I’m a big proponent of getting IPv6 out there, but not everyone shares this opinion. A lot of people are happy to stick with IPv4 and all of the horrid NATing nightmares this introduces despite there being such big wins when using IPv6. Some issues it does introduce though need to be dealt with:

  • The big compatibility issue, which encompasses OS, network stack and application support as well as support by all of the solid-state devices out there already.
  • Direct-connectivity to all endpoints is now possible, so NAT cannot be relied upon to provide security.
  • MAC addresses are directly converted into EUI-64 addresses which, when used with IPv6 autoconfiguration, are directly exposed in the IPv6 address.

These last two seem to cause a bit of argument. NAT does provide an implicit form of security (albeit one that can be bypassed with advanced techniques). Adequate firewalling mitigates the security problem, and doesn’t involve breaking the Internet. The alternative is working with an Internet where NAT is omnipresent and every P2P service requires proxies or connection brokering services. This will be a problem.

The last point of MAC address privacy is easily dealt with. If you (like me) don’t particularly find the idea of exposing your MAC address to the world when you use IPv6, you can configure your system to randomize the address. Setting:

net.ipv6.conf.all.use_tempaddr = 2

will cause your network stack to use the advertised prefix on your segment but generate a random suffix, and use it as the preferred address (any previously configured address from the MAC address will be deprecated and eventually removed).

Tags: , ,

Sunday, September 12th, 2010 Tech No Comments