[mpich-discuss] mpi job crashing: rank 2 in job 7causedcollective abort of all ranks: RESOLVED

Mon Nov 3 13:51:54 CST 2008

I was able to resolve my issue to adding the global mpiexec variable:   
-machinefile $PBS_NODEFILE to my command.

 From within my pbs script I was running and it would give the rank 
abort error:
mpiexec -n $NP dock6.mpi -i dock.in -o dock.out &> dock.log

I added the global mpiexec variable "-machinefile $PBS_NODEFILE  right 
after the call for mpiexec and it worked.
mpiexec -machinefile $PBS_NODEFILE -n $NP dock6.mpi -i dock.in -o 
dock.out &> dock.log

My error (well one of them anyway :) ) was that because I had the 
/etc/mpd.hosts file on each node with the node list:ppn info, I though 
it was being read by torque/maui.   But apparantly not.   The pbs script 
prefers the info from the $PBS_NODEFILE  instead.

Thanks to all who responded and I hope this info is helpful to others.
Mary Ellen

Fitzpatrick, Mary Ellen wrote:
> Logging into node1047, I enter the following command, 16 times manually. 
>  mpiexec dock6.mpi -i dock.in -o dock.out &> dock.log &
>
> I view top and all of the processes that I requested are running.
> I have started the compute node mpds from my head node via the command:
> mpdboot -n 47 -f /etc/mpd.hosts
>
> Should I be starting the mpds on the compute nodes as part of my pbs script?
> Thanks
>
>
> Rajeev Thakur wrote:
>   
>> Could you try logging in to node1047 and running all 16 processes manually
>> there?
>>
>> Rajeev 
>>
>>   
>>     
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
>>> Ellen Fitzpatrick
>>> Sent: Friday, October 31, 2008 2:06 PM
>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>> Subject: Re: [mpich-discuss] mpi job crashing: rank 2 in job 
>>> 7 causedcollective abort of all ranks
>>>
>>> Hi,
>>> If I loging into one of the compute nodes and from the command line 
>>> run:  mpiexec -n 16 dock6.mpi -i dock.in -o dock.out &> dock.log  
>>> It appears to run all 16 processes on that node and no rank errors.
>>>
>>> If I add the above line to my pbs script and submit 
>>> (requesting 4 nodes 
>>> with 4 CPUs) then I still get the following errors.  One 
>>> thing of note 
>>> in the output file is that all of the "rank abort" errors are 
>>> listed for 
>>> the same node: node1047, more errors/ranks than processors (4 
>>> cores on 
>>> the node).
>>>
>>> Another thing I changed was in /etc/mpd.hosts file, I added 
>>> the number 
>>> of cpus for each node: ie node1048:4
>>> Prior to this I had the syntax wrong for listing the number of cpus.  
>>> Now the mpd --ncpus=4 shows on the nodes
>>> node1048:  /usr/bin/python /usr/local/mpich2/bin/mpd.py -h 
>>> node1010 -p 
>>> 48118 --ncpus=4
>>>
>>> Initializing MPI Routines...
>>> Initializing MPI Routines...
>>> rank 15 in job 2  node1047_33232   caused collective abort of 
>>> all ranks
>>>   exit status of rank 15: killed by signal 9
>>> rank 14 in job 2  node1047_33232   caused collective abort of 
>>> all ranks
>>>   exit status of rank 14: return code 0
>>> rank 13 in job 2  node1047_33232   caused collective abort of 
>>> all ranks
>>>   exit status of rank 13: return code 0
>>> rank 11 in job 2  node1047_33232   caused collective abort of 
>>> all ranks
>>>   exit status of rank 11: return code 0
>>> rank 10 in job 2  node1047_33232   caused collective abort of 
>>> all ranks
>>>   exit status of rank 10: return code 0
>>> -snippet-
>>>
>>> Rajeev Thakur wrote:
>>>     
>>>       
>>>> Does the code run elsewhere with 16 processes if you run it 
>>>>       
>>>>         
>>> simply as
>>>     
>>>       
>>>> "mpiexec -n 16 a.out"?
>>>>
>>>> Rajeev 
>>>>
>>>>   
>>>>       
>>>>         
>>>>> -----Original Message-----
>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
>>>>> Ellen Fitzpatrick
>>>>> Sent: Friday, October 31, 2008 10:53 AM
>>>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>>>> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7 
>>>>> caused collective abort of all ranks
>>>>>
>>>>> Hi,
>>>>> Trying to run mpi jobs on my cluster and get: rank 2 in 
>>>>>         
>>>>>           
>>> job 7 caused 
>>>     
>>>       
>>>>> collective abort of all ranks   mpi job fails.
>>>>> 48 (dual-dual core cpu) node Linux cluster, running Torque 
>>>>> with Maui.  
>>>>> Have MPICH2-1.07 installed.  I have mpd's running on all 
>>>>> nodes started 
>>>>> from the head node as root. 
>>>>>
>>>>> What I want to do is submit an mpi job that runs one 
>>>>>         
>>>>>           
>>> process/per node 
>>>     
>>>       
>>>>> requests all 4 cores on the node and I want to submit this 
>>>>> one process 
>>>>> to 4 nodes.
>>>>>
>>>>> If I request in my pbs script 1 node with 4 processors, 
>>>>>         
>>>>>           
>>> then it works 
>>>     
>>>       
>>>>> fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 
>>>>>         
>>>>>           
>>> 4 cpus, mpi
>>>     
>>>       
>>>>> output says everything ran perfect.  If I request in my 
>>>>>         
>>>>>           
>>> pbs script 4 
>>>     
>>>       
>>>>> nodes with 4 processors,  #PBS -l nodes=4:ppn=4, then it 
>>>>>         
>>>>>           
>>> fails,  my 
>>>     
>>>       
>>>>> torque epilogue/proloque output file say the job ran on 4 
>>>>>         
>>>>>           
>>> nodes and 
>>>     
>>>       
>>>>> requests 16 processors.
>>>>> But the mpi output file has the following error:
>>>>>
>>>>> -snippet-
>>>>> Initializing MPI Routines...
>>>>> Initializing MPI Routines...
>>>>> Initializing MPI Routines...
>>>>> rank 2 in job 7  node1043_55948   caused collective abort 
>>>>>         
>>>>>           
>>> of all ranks
>>>     
>>>       
>>>>>   exit status of rank 2: killed by signal 9
>>>>> rank 1 in job 7  node1043_55948   caused collective abort 
>>>>>         
>>>>>           
>>> of all ranks
>>>     
>>>       
>>>>>   exit status of rank 1: killed by signal 9
>>>>> -snippet -
>>>>>
>>>>> Anyone know why my mpi job is crashing?
>>>>>
>>>>> -- 
>>>>> Thanks
>>>>> Mary Ellen
>>>>>
>>>>>
>>>>>     
>>>>>         
>>>>>           
>>>>   
>>>>       
>>>>         
>>> -- 
>>> Thanks
>>> Mary Ellen
>>>
>>>
>>>     
>>>       
>>   
>>     
>
>   

-- 
Thanks
Mary Ellen