[mpich-discuss] mpi job crashing: rank 2 in job 7 causedcollective abort of all ranks
Rajeev Thakur
thakur at mcs.anl.gov
Fri Oct 31 15:10:40 CDT 2008
Could you try logging in to node1047 and running all 16 processes manually
there?
Rajeev
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary
> Ellen Fitzpatrick
> Sent: Friday, October 31, 2008 2:06 PM
> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
> Subject: Re: [mpich-discuss] mpi job crashing: rank 2 in job
> 7 causedcollective abort of all ranks
>
> Hi,
> If I loging into one of the compute nodes and from the command line
> run: mpiexec -n 16 dock6.mpi -i dock.in -o dock.out &> dock.log
> It appears to run all 16 processes on that node and no rank errors.
>
> If I add the above line to my pbs script and submit
> (requesting 4 nodes
> with 4 CPUs) then I still get the following errors. One
> thing of note
> in the output file is that all of the "rank abort" errors are
> listed for
> the same node: node1047, more errors/ranks than processors (4
> cores on
> the node).
>
> Another thing I changed was in /etc/mpd.hosts file, I added
> the number
> of cpus for each node: ie node1048:4
> Prior to this I had the syntax wrong for listing the number of cpus.
> Now the mpd --ncpus=4 shows on the nodes
> node1048: /usr/bin/python /usr/local/mpich2/bin/mpd.py -h
> node1010 -p
> 48118 --ncpus=4
>
> Initializing MPI Routines...
> Initializing MPI Routines...
> rank 15 in job 2 node1047_33232 caused collective abort of
> all ranks
> exit status of rank 15: killed by signal 9
> rank 14 in job 2 node1047_33232 caused collective abort of
> all ranks
> exit status of rank 14: return code 0
> rank 13 in job 2 node1047_33232 caused collective abort of
> all ranks
> exit status of rank 13: return code 0
> rank 11 in job 2 node1047_33232 caused collective abort of
> all ranks
> exit status of rank 11: return code 0
> rank 10 in job 2 node1047_33232 caused collective abort of
> all ranks
> exit status of rank 10: return code 0
> -snippet-
>
> Rajeev Thakur wrote:
> > Does the code run elsewhere with 16 processes if you run it
> simply as
> > "mpiexec -n 16 a.out"?
> >
> > Rajeev
> >
> >
> >> -----Original Message-----
> >> From: mpich-discuss-bounces at mcs.anl.gov
> >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary
> >> Ellen Fitzpatrick
> >> Sent: Friday, October 31, 2008 10:53 AM
> >> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
> >> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7
> >> caused collective abort of all ranks
> >>
> >> Hi,
> >> Trying to run mpi jobs on my cluster and get: rank 2 in
> job 7 caused
> >> collective abort of all ranks mpi job fails.
> >> 48 (dual-dual core cpu) node Linux cluster, running Torque
> >> with Maui.
> >> Have MPICH2-1.07 installed. I have mpd's running on all
> >> nodes started
> >> from the head node as root.
> >>
> >> What I want to do is submit an mpi job that runs one
> process/per node
> >> requests all 4 cores on the node and I want to submit this
> >> one process
> >> to 4 nodes.
> >>
> >> If I request in my pbs script 1 node with 4 processors,
> then it works
> >> fine: #PBS -l nodes=1:ppn=4, everything runs on one node
> 4 cpus, mpi
> >> output says everything ran perfect. If I request in my
> pbs script 4
> >> nodes with 4 processors, #PBS -l nodes=4:ppn=4, then it
> fails, my
> >> torque epilogue/proloque output file say the job ran on 4
> nodes and
> >> requests 16 processors.
> >> But the mpi output file has the following error:
> >>
> >> -snippet-
> >> Initializing MPI Routines...
> >> Initializing MPI Routines...
> >> Initializing MPI Routines...
> >> rank 2 in job 7 node1043_55948 caused collective abort
> of all ranks
> >> exit status of rank 2: killed by signal 9
> >> rank 1 in job 7 node1043_55948 caused collective abort
> of all ranks
> >> exit status of rank 1: killed by signal 9
> >> -snippet -
> >>
> >> Anyone know why my mpi job is crashing?
> >>
> >> --
> >> Thanks
> >> Mary Ellen
> >>
> >>
> >>
> >
> >
> >
>
> --
> Thanks
> Mary Ellen
>
>
More information about the mpich-discuss
mailing list