[mpich-discuss] mpi job crashing: rank 2 in job 7 causedcollective abort of all ranks

Rajeev Thakur thakur at mcs.anl.gov
Fri Oct 31 15:10:40 CDT 2008


Could you try logging in to node1047 and running all 16 processes manually
there?

Rajeev 

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
> Ellen Fitzpatrick
> Sent: Friday, October 31, 2008 2:06 PM
> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
> Subject: Re: [mpich-discuss] mpi job crashing: rank 2 in job 
> 7 causedcollective abort of all ranks
> 
> Hi,
> If I loging into one of the compute nodes and from the command line 
> run:  mpiexec -n 16 dock6.mpi -i dock.in -o dock.out &> dock.log  
> It appears to run all 16 processes on that node and no rank errors.
> 
> If I add the above line to my pbs script and submit 
> (requesting 4 nodes 
> with 4 CPUs) then I still get the following errors.  One 
> thing of note 
> in the output file is that all of the "rank abort" errors are 
> listed for 
> the same node: node1047, more errors/ranks than processors (4 
> cores on 
> the node).
> 
> Another thing I changed was in /etc/mpd.hosts file, I added 
> the number 
> of cpus for each node: ie node1048:4
> Prior to this I had the syntax wrong for listing the number of cpus.  
> Now the mpd --ncpus=4 shows on the nodes
> node1048:  /usr/bin/python /usr/local/mpich2/bin/mpd.py -h 
> node1010 -p 
> 48118 --ncpus=4
> 
> Initializing MPI Routines...
> Initializing MPI Routines...
> rank 15 in job 2  node1047_33232   caused collective abort of 
> all ranks
>   exit status of rank 15: killed by signal 9
> rank 14 in job 2  node1047_33232   caused collective abort of 
> all ranks
>   exit status of rank 14: return code 0
> rank 13 in job 2  node1047_33232   caused collective abort of 
> all ranks
>   exit status of rank 13: return code 0
> rank 11 in job 2  node1047_33232   caused collective abort of 
> all ranks
>   exit status of rank 11: return code 0
> rank 10 in job 2  node1047_33232   caused collective abort of 
> all ranks
>   exit status of rank 10: return code 0
> -snippet-
> 
> Rajeev Thakur wrote:
> > Does the code run elsewhere with 16 processes if you run it 
> simply as
> > "mpiexec -n 16 a.out"?
> >
> > Rajeev 
> >
> >   
> >> -----Original Message-----
> >> From: mpich-discuss-bounces at mcs.anl.gov 
> >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
> >> Ellen Fitzpatrick
> >> Sent: Friday, October 31, 2008 10:53 AM
> >> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
> >> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7 
> >> caused collective abort of all ranks
> >>
> >> Hi,
> >> Trying to run mpi jobs on my cluster and get: rank 2 in 
> job 7 caused 
> >> collective abort of all ranks   mpi job fails.
> >> 48 (dual-dual core cpu) node Linux cluster, running Torque 
> >> with Maui.  
> >> Have MPICH2-1.07 installed.  I have mpd's running on all 
> >> nodes started 
> >> from the head node as root. 
> >>
> >> What I want to do is submit an mpi job that runs one 
> process/per node 
> >> requests all 4 cores on the node and I want to submit this 
> >> one process 
> >> to 4 nodes.
> >>
> >> If I request in my pbs script 1 node with 4 processors, 
> then it works 
> >> fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 
> 4 cpus, mpi
> >> output says everything ran perfect.  If I request in my 
> pbs script 4 
> >> nodes with 4 processors,  #PBS -l nodes=4:ppn=4, then it 
> fails,  my 
> >> torque epilogue/proloque output file say the job ran on 4 
> nodes and 
> >> requests 16 processors.
> >> But the mpi output file has the following error:
> >>
> >> -snippet-
> >> Initializing MPI Routines...
> >> Initializing MPI Routines...
> >> Initializing MPI Routines...
> >> rank 2 in job 7  node1043_55948   caused collective abort 
> of all ranks
> >>   exit status of rank 2: killed by signal 9
> >> rank 1 in job 7  node1043_55948   caused collective abort 
> of all ranks
> >>   exit status of rank 1: killed by signal 9
> >> -snippet -
> >>
> >> Anyone know why my mpi job is crashing?
> >>
> >> -- 
> >> Thanks
> >> Mary Ellen
> >>
> >>
> >>     
> >
> >
> >   
> 
> -- 
> Thanks
> Mary Ellen
> 
> 




More information about the mpich-discuss mailing list