jenkins

Reliably finding processes with ps by name

by Oliver on Sunday, July 17th, 2011.

I imagine that there are two groups of people who might read this post:

  1. When you need to find a process by name, you run ps -ef or similar and pipe into grep processname.
  2. You gladly suffer the presence of your own grep process being shown in the output, or maybe even grep -v it out (these are the “heathens”).

  3. When you need to find a process by name, you run ps -C processname
  4. (these are the “enlightened ones”).

If you fall into the first category, you fail my interview tests. Perhaps you smugly fall into the second category, but surely you have seen this occur:


ohookins 4410 0.0 0.2 212804 9384 ? S 20:10 0:00 /usr/lib/bamf/bamfdaemon
$ ps u -C /usr/lib/bamf/bamfdaemon
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ ps u -C bamfdaemon
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ohookins 4410 0.0 0.2 212804 9384 ? S 20:10 0:00 /usr/lib/bamf/bamfdaemon

I don’t particularly care about bamfdaemon, but given that the process listing shows the full path to the binary, why can’t we search for it by this process name? Why does the unqualified filename work? OK, perhaps it is just basing the match on the unqualified filename…


ohookins 3710 0.0 0.9 468988 38216 ? Sl 19:49 0:05 /usr/bin/python /usr/bin/terminator
$ ps u -C terminator
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ ps u -C python
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ ps u -C /usr/bin/python
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ ps u -C /usr/bin/terminator
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ohookins 3710 0.0 0.9 468988 38216 ? Sl 19:49 0:05 /usr/bin/python /usr/bin/terminator

OK, what the heck is going on here exactly? I’m not terribly familiar with the POSIX specification, so let’s take a look at the source code of procps:


109 break; case SEL_COMM: i=sn->n; while(i--)
110 if(!strncmp( buf->cmd, (*(sn->u+i)).cmd, 15 )) return 1;

In select.c of ps, we see these two lines in the case statement which selects between different process identification mechanisms. -C actually allows you to select multiple processes by different name since it iterates through the list of selectors (which I didn’t know before looking at the code – very cool).

A limited string comparison is done between the argument given to -C and the process being examined. You can see that this limit is 15 characters, and in the union inside the selection node only 16 characters are stored anyway. Let’s have a look at what this proc_t buf struct looks like so we can figure out what the comparison is being done on. This sits in proc/readproc.h:


char cmd[16]; // stat,status basename of executable file in call to exec(2)

Now we are getting somewhere. We can easily verify that the limit in the character comparison is being done:


$ ps u -C upstart-socket-
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1123 0.0 0.0 15004 400 ? S 18:46 0:00 upstart-socket-bridge --daemon

So if you are lazy, you only have to type 15 characters at most of your process name. Let’s look at the more complicated case of when processes are just really hard to find by any name we can see in the process listing – my candidate case for this is Jenkins, which is notoriously hard to track down especially if you are running several Java-based services on the one machine (for example Jenkins itself, Nexus and perhaps Sonar which all logically fit together as part of a typical Java build server):


$ ps uwww -U jenkins
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
jenkins 19517 9.3 30.8 1823496 1245768 ? Ssl Jul08 1255:29 /usr/bin/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -Xmx1024m -Xms768m -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20

Nothing amazing here, let’s find this process by the command name:


$ ps uwww -C java
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nexus 3853 0.0 3.5 1442536 145520 ? Sl Jun06 3:17 java -Dsun.net.inetaddr.ttl=3600 -Dbasedir=. -Djava.io.tmpdir=./runtime/tmp -Djava.library.path=bin/jsw/linux-x86-64/lib -classpath bin/jsw/lib/wrapper-3.2.3.jar:./runtime/apps/nexus/lib/plexus-classworlds-1.4.jar:./conf/ -Dwrapper.key=ybUhRQr9hU88aJwC -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.pid=3837 -Dwrapper.version=3.2.3 -Dwrapper.native_library=wrapper -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 org.codehaus.plexus.classworlds.launcher.Launcher

Wait, where is Jenkins? Didn’t we confirm that the process running was in fact /usr/bin/java, and we know only java is used as the executable basename inside of ps? How is it possible that ps is now not showing us the Jenkins process? Let’s have a slightly different look at it:


$ ps -U jenkins
PID TTY TIME CMD
19517 ? 20:55:47 exe
$ ps uwww -C exe
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
jenkins 19517 9.3 30.7 1823496 1245124 ? Ssl Jul08 1255:50 /usr/bin/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -Xmx1024m -Xms768m -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20

Aha! We have found the errant Jenkins process. But why is the basename exe? As it turns out, it is a peculiarity of Jenkins itself, documented in https://issues.jenkins-ci.org/browse/JENKINS-9206 which also causes problems with the init script (when it tries to find the process with the incorrect method, as we found above, due to certain assumptions).

In any case, now we’ve seen how ps operates and even how to find a process using the correct method, even when that process is playing hard to get.

Tags: , ,

Sunday, July 17th, 2011 Tech 7 Comments

An odd obsession with hardware utilisation

by Oliver on Saturday, February 19th, 2011.

I’m sure I’m not alone in my personification of hardware, but in extension to that I like to know that my hardware is doing something. The thought of it sitting there idle just bugs me. So when I install a fresh new Jenkins server and it is sitting there waiting for jobs to be fired off, it saddens me just a little that it isn’t utilised more.

The flip-side situation is when the machine is overutilised, or even just adequately utilised. Just as it is frustrating in one way to have a job that completes before you even have time to make a coffee, it is frustrating to have to wait for several jobs to complete that take an hour. In reality, the sweet-spot in the middle where you have the best of both world (latency and throughput) is so hard to achieve you end up having to choose one or the other.

I guess this is what cloud computing is for, which is yet another area I feel like I’m two years behind in. My understanding is that there are no real guarantees of throughput or latency. You have the promise of “infinite” scalability, but no real idea of how things will perform on any given VM. This makes knowing exactly how much performance you will get at any given point in time completely non-deterministic (obviously I am speaking about the public cloud here). What does this matter to an OCD sysadmin who likes his hardware to be well utilised? Probably nothing.

Going to the cloud does make sense, but who in this industry would never miss the endless rows of blinkenlights in the datacenter on the occasional visit? I doubt there would be a single person among us (and if you answered “yes”, what is wrong with you? ;))

Tags: , , ,

Saturday, February 19th, 2011 Tech, Thoughts No Comments