[mpich-discuss] mpi job crashing: rank 2 in job 7 caused collective abort of all ranks
Rajeev Thakur
thakur at mcs.anl.gov
Fri Oct 31 13:21:19 CDT 2008
Does the code run elsewhere with 16 processes if you run it simply as
"mpiexec -n 16 a.out"?
Rajeev
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary
> Ellen Fitzpatrick
> Sent: Friday, October 31, 2008 10:53 AM
> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7
> caused collective abort of all ranks
>
> Hi,
> Trying to run mpi jobs on my cluster and get: rank 2 in job 7 caused
> collective abort of all ranks mpi job fails.
> 48 (dual-dual core cpu) node Linux cluster, running Torque
> with Maui.
> Have MPICH2-1.07 installed. I have mpd's running on all
> nodes started
> from the head node as root.
>
> What I want to do is submit an mpi job that runs one process/per node
> requests all 4 cores on the node and I want to submit this
> one process
> to 4 nodes.
>
> If I request in my pbs script 1 node with 4 processors, then it works
> fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
> output says everything ran perfect. If I request in my pbs script 4
> nodes with 4 processors, #PBS -l nodes=4:ppn=4, then it fails, my
> torque epilogue/proloque output file say the job ran on 4 nodes and
> requests 16 processors.
> But the mpi output file has the following error:
>
> -snippet-
> Initializing MPI Routines...
> Initializing MPI Routines...
> Initializing MPI Routines...
> rank 2 in job 7 node1043_55948 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
> rank 1 in job 7 node1043_55948 caused collective abort of all ranks
> exit status of rank 1: killed by signal 9
> -snippet -
>
> Anyone know why my mpi job is crashing?
>
> --
> Thanks
> Mary Ellen
>
>
More information about the mpich-discuss
mailing list