[mpich-discuss] mpi job crashing: rank 2 in job 7 caused collective abort of all ranks

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Fri Oct 31 10:53:14 CDT 2008


Hi,
Trying to run mpi jobs on my cluster and get: rank 2 in job 7 caused 
collective abort of all ranks   mpi job fails.
48 (dual-dual core cpu) node Linux cluster, running Torque with Maui.  
Have MPICH2-1.07 installed.  I have mpd's running on all nodes started 
from the head node as root. 

What I want to do is submit an mpi job that runs one process/per node 
requests all 4 cores on the node and I want to submit this one process 
to 4 nodes.

If I request in my pbs script 1 node with 4 processors, then it works 
fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
output says everything ran perfect.  If I request in my pbs script 4 
nodes with 4 processors,  #PBS -l nodes=4:ppn=4, then it fails,  my 
torque epilogue/proloque output file say the job ran on 4 nodes and 
requests 16 processors.
But the mpi output file has the following error:

-snippet-
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
rank 2 in job 7  node1043_55948   caused collective abort of all ranks
  exit status of rank 2: killed by signal 9
rank 1 in job 7  node1043_55948   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
-snippet -

Anyone know why my mpi job is crashing?

-- 
Thanks
Mary Ellen




More information about the mpich-discuss mailing list