[mpich-discuss] mpich2-1.2.1, only starts 5 mpd's and cpi won't run, compiler flags issue?
Dave Goodell
goodell at mcs.anl.gov
Mon Feb 8 17:29:54 CST 2010
On Feb 8, 2010, at 4:55 PM, David Mathog wrote:
> Using --with-atomic-primitives=opa_gcc_intel_32_64-p3.h did not
> resolve
> this issue. Again it blew up following:
That's not really surprising. If cpi will run on all machines
individually, then you are no longer hitting the "Illegal instruction"
problem.
> mpdboot -f /usr/common/etc/machines.LINUX_INTEL_Safserver -n 21 -r rsh
> --ifhn=192.168.1.220 -v
> mpiexec -n 8 /opt/mpich2_121/examples/cpi
>
> mpiexec -n 8 /opt/mpich2_121/examples/cpi
> Process 0 of 8 is on safserver.bio.caltech.edu
[snip]
> MPID_nem_tcp_recv_handler(1446)...: socket closed
> rank 6 in job 2 safserver.bio.caltech.edu_49241 caused collective
> abort of all ranks
> exit status of rank 6: killed by signal 9
> rank 4 in job 2 safserver.bio.caltech.edu_49241 caused collective
> abort of all ranks
> exit status of rank 4: killed by signal 9
Unfortunately, this error message isn't particularly helpful in
figuring out what the problem is. The most suspicious thing is
whichever process is dying with "Communication Error" during its
participation in MPI_Bcast. Adding the "-l" option to mpiexec will
make it easier to figure out who that is. Then try dropping that host
from your next run in case there is a problem with that particular
machine. You might also try running only on the cluster nodes, that
is, without the head node because it has a more complicated networking
setup.
Otherwise you can try running something like "mpiexec -n 8 strace -ff -
o strace.log /path/to/cpi". That might shed some light on things, but
I can't guarantee anything.
-Dave
More information about the mpich-discuss
mailing list