[mpich-discuss] mpi job crashing: rank 2 in job 7 caused collective abort of all ranks
Mary Ellen Fitzpatrick
mfitzpat at bu.edu
Fri Oct 31 10:53:14 CDT 2008
Hi,
Trying to run mpi jobs on my cluster and get: rank 2 in job 7 caused
collective abort of all ranks mpi job fails.
48 (dual-dual core cpu) node Linux cluster, running Torque with Maui.
Have MPICH2-1.07 installed. I have mpd's running on all nodes started
from the head node as root.
What I want to do is submit an mpi job that runs one process/per node
requests all 4 cores on the node and I want to submit this one process
to 4 nodes.
If I request in my pbs script 1 node with 4 processors, then it works
fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
output says everything ran perfect. If I request in my pbs script 4
nodes with 4 processors, #PBS -l nodes=4:ppn=4, then it fails, my
torque epilogue/proloque output file say the job ran on 4 nodes and
requests 16 processors.
But the mpi output file has the following error:
-snippet-
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
rank 2 in job 7 node1043_55948 caused collective abort of all ranks
exit status of rank 2: killed by signal 9
rank 1 in job 7 node1043_55948 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
-snippet -
Anyone know why my mpi job is crashing?
--
Thanks
Mary Ellen
More information about the mpich-discuss
mailing list