[mpich-discuss] mpich2-1.2.1, only starts 5 mpd's and cpi won't run, compiler flags issue?

David Mathog mathog at caltech.edu
Tue Feb 9 11:58:55 CST 2010


> Unfortunately, this error message isn't particularly helpful in  
> figuring out what the problem is.  The most suspicious thing is  
> whichever process is dying with "Communication Error" during its  
> participation in MPI_Bcast.  Adding the "-l" option to mpiexec will  
> make it easier to figure out who that is.  Then try dropping that host  
> from your next run in case there is a problem with that particular  
> machine.  You might also try running only on the cluster nodes, that  
> is, without the head node because it has a more complicated networking  
> setup.
> 
> Otherwise you can try running something like "mpiexec -n 8 strace -ff - 
> o strace.log /path/to/cpi".  That might shed some light on things, but  
> I can't guarantee anything.

This is really not working very well.  Sometimes mpdboot works,
sometimes not, failing at one machine or another out of the 21.  It
seems to work more reliably when Athcool is turned off.  Athcool is a
power saving mode on Athlon's that does make them work slower, and
definitely slows down the network.  However, I have never previously
seen it interact with software in any manner other than to slow it down
slightly.  Perhaps there is an overly optimistic timeout value somewhere
in Mpich2?  (Note there is a script that turns athcool off once it
detects a "significant load" on a compute node, but the few packets
moving around in mpdboot are not enough to trigger it.)

With athcool off, and mpdboot having run correctly, this works reliably:

  mpiexec -n 3 strace -ff -o /tmp/strace.log /bin/hostname

and this never works

  mpiexec -n 16 strace -ff -o /tmp/strace.log /bin/hostname

The thing is in the latter case it just locks up and the strace.log file
is never written on any of the compute nodes.  Working up slowly -n 8
works (usually), -n 9 doesn't (so far, ever):

mpiexec -l -n 8 strace -ff -o /tmp/strace.log /bin/hostname
0: safserver.bio.caltech.edu
2: monkey12.cluster
3: monkey11.cluster
4: monkey10.cluster
5: monkey09.cluster
7: monkey15.cluster
6: monkey04.cluster
1: monkey02.cluster
[safrun at safserver ~]$ mpiexec -l -n 9 strace -ff -o /tmp/strace.log
/bin/hostname
(LOCKS, hit ^C and it emits)
mpiexec_safserver.bio.caltech.edu (mpiexec 440): mpiexec: failed to
obtain sock from manager

This leaves an extra mpd running on every node - and those are only
found on the nodes listed for -n 8.  That is, it didn't start an mpd on
some other node and fail after that - it looks like the failure was an
inability to start the mpd on the 9th node.  Also this:

  mpiexec -l -n 8 /bin/hostname  

also fails, sometimes.  Other times, it works. -n 9 never works.

There seems to be a major fault in mpd/mpiexec, or something it doesn't
like about the way our systems are configured, since /bin/hostname has
no MPI stuff linked into it.

I checked that all nodes are running, no kernel errors, and none of them
have a problem running /bin/hostname when started by other means:

% /usr/common/bin/rsh -f \
  /usr/common/etc/machines.LINUX_INTEL '/bin/hostname'
monkey01.cluster
monkey02.cluster
(snip)
monkey19.cluster
monkey20.cluster

Any idea what might be triggering the "failed to obtain sock from manager"?

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the mpich-discuss mailing list