[mpich-discuss] mpich2-1.2.1, only starts 5 mpd's and cpi won't run, compiler flags issue?
Dave Goodell
goodell at mcs.anl.gov
Mon Feb 8 13:51:17 CST 2010
On Feb 8, 2010, at 1:05 PM, David Mathog wrote:
> Built and installed mpich2-1.2.1 on our master node, which placed it
> in
> /opt/mpich2_121. Copied that directory to all compute nodes. For an
> account set up .mpd.conf with a secretword, put /opt/mpich2_121/bin in
> that accounts PATH via .bashrc, and did:
>
> mpdboot -f /usr/common/etc/machines.LINUX_INTEL_Safserver \
> -r rsh --ifhn=192.168.1.220 &
>
> There are 21 machines listed in that file, but it only started mpd on
> the first 5. No warnings or anything, it just did 5 and stopped,
> /var/log/messages on the missing node don't show any attempt to rsh
> in.
I think you need to pass "-n 21" (or equivalently "--totalnum=21") to
mpdboot. Also, mpdboot is not usually run in the background. If you
are running it in the background because it hangs, then you are likely
hitting the hang bug described here: https://trac.mcs.anl.gov/projects/mpich2/ticket/974
You should also be aware that there is no way to prevent the head node
from being part of the mpd ring if you run mpdboot on the head node.
The hydra process manager does not suffer from this deficiency: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
> The rump cluster works to the extent that
>
> mpiexec -n 5 mpdtrace -l
> mpiexec -n 5 /bin/uname -n
>
> give the expected results. However this doesn't work:
>
> mpiexec -n -5 /opt/mpich2_121/examples/cpi
(I'm guessing that "-5" was really just a typo for "5" here.)
> and this is why, I think:
>
> rsh monkey01 '/opt/mpich2_121/examples/cpi'
> Illegal instruction
>
> The main machine is a dual opteron, and the slaves are all Athlon MPs,
> so it is possible to generate code that would run on the former and
> not
> the latter. However they are both running the same 32 bit linux, and
> have the same versions of all packages installed. Does MPICH2
> somehow
> pick compiler flags that might only run locally by default (none were
> specified in ./configure)? For instance, to rebuild cpi make does
> only
> this:
It's not so much an issue of compiler flags but rather the results of
some configure tests on the head node are not appropriate for the rest
of the cluster. In particular, you are very likely encountering a
variant of this bug: http://trac.mcs.anl.gov/projects/mpich2/ticket/694
You should be able to fix the "Illegal instruction" issue by either of
the following approaches:
1) Compiling MPICH2 on one of the cluster nodes. This will help to
ensure that any configure tests accurately reflect the environment on
the main cluster.
2) Configuring with "--with-atomic-
primitives=opa_gcc_intel_32_64_p3.h". This tells MPICH2 to use
atomics suitable for older x86 processors that don't support mfence.
-Dave
More information about the mpich-discuss
mailing list