[mpich-discuss] mpich2-1.2.1, only starts 5 mpd's and cpi won't run, compiler flags issue?

Dave Goodell goodell at mcs.anl.gov
Mon Feb 8 13:51:17 CST 2010


On Feb 8, 2010, at 1:05 PM, David Mathog wrote:

> Built and installed mpich2-1.2.1 on our master node, which placed it  
> in
> /opt/mpich2_121.  Copied that directory to all compute nodes.  For an
> account set up .mpd.conf with a secretword, put /opt/mpich2_121/bin in
> that accounts PATH via .bashrc, and did:
>
> mpdboot -f /usr/common/etc/machines.LINUX_INTEL_Safserver \
>  -r rsh --ifhn=192.168.1.220 &
>
> There are 21 machines listed in that file, but it only started mpd on
> the first 5.  No warnings or anything, it just did 5 and stopped,
> /var/log/messages on the missing node don't show any attempt to rsh  
> in.

I think you need to pass "-n 21" (or equivalently "--totalnum=21") to  
mpdboot.  Also, mpdboot is not usually run in the background.  If you  
are running it in the background because it hangs, then you are likely  
hitting the hang bug described here: https://trac.mcs.anl.gov/projects/mpich2/ticket/974

You should also be aware that there is no way to prevent the head node  
from being part of the mpd ring if you run mpdboot on the head node.   
The hydra process manager does not suffer from this deficiency: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager

> The rump cluster works to the extent that
>
>  mpiexec -n 5 mpdtrace -l
>  mpiexec -n 5 /bin/uname -n
>
> give the expected results.  However this doesn't work:
>
>  mpiexec -n -5 /opt/mpich2_121/examples/cpi

(I'm guessing that "-5" was really just a typo for "5" here.)

> and this is why, I think:
>
>  rsh monkey01 '/opt/mpich2_121/examples/cpi'
>  Illegal instruction
>
> The main machine is a dual opteron, and the slaves are all Athlon MPs,
> so it is possible to generate code that would run on the former and  
> not
> the latter.  However they are both running the same 32 bit linux, and
> have the same versions of all packages installed.   Does MPICH2  
> somehow
> pick compiler flags that might only run locally by default (none were
> specified in ./configure)?   For instance, to rebuild cpi make does  
> only
> this:

It's not so much an issue of compiler flags but rather the results of  
some configure tests on the head node are not appropriate for the rest  
of the cluster.  In particular, you are very likely encountering a  
variant of this bug: http://trac.mcs.anl.gov/projects/mpich2/ticket/694

You should be able to fix the "Illegal instruction" issue by either of  
the following approaches:

1) Compiling MPICH2 on one of the cluster nodes.  This will help to  
ensure that any configure tests accurately reflect the environment on  
the main cluster.

2) Configuring with "--with-atomic- 
primitives=opa_gcc_intel_32_64_p3.h".  This tells MPICH2 to use  
atomics suitable for older x86 processors that don't support mfence.

-Dave



More information about the mpich-discuss mailing list