[mpich-discuss] mpich2-1.2.1, only starts 5 mpd's and cpi won't run, compiler flags issue?

Dave Goodell goodell at mcs.anl.gov
Mon Feb 8 17:29:54 CST 2010


On Feb 8, 2010, at 4:55 PM, David Mathog wrote:

> Using --with-atomic-primitives=opa_gcc_intel_32_64-p3.h did not  
> resolve
> this issue.  Again it blew up following:

That's not really surprising.  If cpi will run on all machines  
individually, then you are no longer hitting the "Illegal instruction"  
problem.

> mpdboot -f /usr/common/etc/machines.LINUX_INTEL_Safserver -n 21 -r rsh
> --ifhn=192.168.1.220 -v
> mpiexec -n 8 /opt/mpich2_121/examples/cpi
>
> mpiexec -n 8 /opt/mpich2_121/examples/cpi
> Process 0 of 8 is on safserver.bio.caltech.edu
[snip]
> MPID_nem_tcp_recv_handler(1446)...: socket closed
> rank 6 in job 2  safserver.bio.caltech.edu_49241   caused collective
> abort of all ranks
>  exit status of rank 6: killed by signal 9
> rank 4 in job 2  safserver.bio.caltech.edu_49241   caused collective
> abort of all ranks
>  exit status of rank 4: killed by signal 9

Unfortunately, this error message isn't particularly helpful in  
figuring out what the problem is.  The most suspicious thing is  
whichever process is dying with "Communication Error" during its  
participation in MPI_Bcast.  Adding the "-l" option to mpiexec will  
make it easier to figure out who that is.  Then try dropping that host  
from your next run in case there is a problem with that particular  
machine.  You might also try running only on the cluster nodes, that  
is, without the head node because it has a more complicated networking  
setup.

Otherwise you can try running something like "mpiexec -n 8 strace -ff - 
o strace.log /path/to/cpi".  That might shed some light on things, but  
I can't guarantee anything.

-Dave



More information about the mpich-discuss mailing list