[mpich-discuss] mpi job crashing: rank 2 in job 7causedcollective abort of all ranks

Mon Nov 3 08:49:24 CST 2008

Logging into node1047, I enter the following command, 16 times manually. 
 mpiexec dock6.mpi -i dock.in -o dock.out &> dock.log &

I view top and all of the processes that I requested are running.
I have started the compute node mpds from my head node via the command:
mpdboot -n 47 -f /etc/mpd.hosts

Should I be starting the mpds on the compute nodes as part of my pbs script?
Thanks

Rajeev Thakur wrote:
> Could you try logging in to node1047 and running all 16 processes manually
> there?
>
> Rajeev 
>
>   
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
>> Ellen Fitzpatrick
>> Sent: Friday, October 31, 2008 2:06 PM
>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>> Subject: Re: [mpich-discuss] mpi job crashing: rank 2 in job 
>> 7 causedcollective abort of all ranks
>>
>> Hi,
>> If I loging into one of the compute nodes and from the command line 
>> run:  mpiexec -n 16 dock6.mpi -i dock.in -o dock.out &> dock.log  
>> It appears to run all 16 processes on that node and no rank errors.
>>
>> If I add the above line to my pbs script and submit 
>> (requesting 4 nodes 
>> with 4 CPUs) then I still get the following errors.  One 
>> thing of note 
>> in the output file is that all of the "rank abort" errors are 
>> listed for 
>> the same node: node1047, more errors/ranks than processors (4 
>> cores on 
>> the node).
>>
>> Another thing I changed was in /etc/mpd.hosts file, I added 
>> the number 
>> of cpus for each node: ie node1048:4
>> Prior to this I had the syntax wrong for listing the number of cpus.  
>> Now the mpd --ncpus=4 shows on the nodes
>> node1048:  /usr/bin/python /usr/local/mpich2/bin/mpd.py -h 
>> node1010 -p 
>> 48118 --ncpus=4
>>
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> rank 15 in job 2  node1047_33232   caused collective abort of 
>> all ranks
>>   exit status of rank 15: killed by signal 9
>> rank 14 in job 2  node1047_33232   caused collective abort of 
>> all ranks
>>   exit status of rank 14: return code 0
>> rank 13 in job 2  node1047_33232   caused collective abort of 
>> all ranks
>>   exit status of rank 13: return code 0
>> rank 11 in job 2  node1047_33232   caused collective abort of 
>> all ranks
>>   exit status of rank 11: return code 0
>> rank 10 in job 2  node1047_33232   caused collective abort of 
>> all ranks
>>   exit status of rank 10: return code 0
>> -snippet-
>>
>> Rajeev Thakur wrote:
>>     
>>> Does the code run elsewhere with 16 processes if you run it 
>>>       
>> simply as
>>     
>>> "mpiexec -n 16 a.out"?
>>>
>>> Rajeev 
>>>
>>>   
>>>       
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
>>>> Ellen Fitzpatrick
>>>> Sent: Friday, October 31, 2008 10:53 AM
>>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>>> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7 
>>>> caused collective abort of all ranks
>>>>
>>>> Hi,
>>>> Trying to run mpi jobs on my cluster and get: rank 2 in 
>>>>         
>> job 7 caused 
>>     
>>>> collective abort of all ranks   mpi job fails.
>>>> 48 (dual-dual core cpu) node Linux cluster, running Torque 
>>>> with Maui.  
>>>> Have MPICH2-1.07 installed.  I have mpd's running on all 
>>>> nodes started 
>>>> from the head node as root. 
>>>>
>>>> What I want to do is submit an mpi job that runs one 
>>>>         
>> process/per node 
>>     
>>>> requests all 4 cores on the node and I want to submit this 
>>>> one process 
>>>> to 4 nodes.
>>>>
>>>> If I request in my pbs script 1 node with 4 processors, 
>>>>         
>> then it works 
>>     
>>>> fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 
>>>>         
>> 4 cpus, mpi
>>     
>>>> output says everything ran perfect.  If I request in my 
>>>>         
>> pbs script 4 
>>     
>>>> nodes with 4 processors,  #PBS -l nodes=4:ppn=4, then it 
>>>>         
>> fails,  my 
>>     
>>>> torque epilogue/proloque output file say the job ran on 4 
>>>>         
>> nodes and 
>>     
>>>> requests 16 processors.
>>>> But the mpi output file has the following error:
>>>>
>>>> -snippet-
>>>> Initializing MPI Routines...
>>>> Initializing MPI Routines...
>>>> Initializing MPI Routines...
>>>> rank 2 in job 7  node1043_55948   caused collective abort 
>>>>         
>> of all ranks
>>     
>>>>   exit status of rank 2: killed by signal 9
>>>> rank 1 in job 7  node1043_55948   caused collective abort 
>>>>         
>> of all ranks
>>     
>>>>   exit status of rank 1: killed by signal 9
>>>> -snippet -
>>>>
>>>> Anyone know why my mpi job is crashing?
>>>>
>>>> -- 
>>>> Thanks
>>>> Mary Ellen
>>>>
>>>>
>>>>     
>>>>         
>>>   
>>>       
>> -- 
>> Thanks
>> Mary Ellen
>>
>>
>>     
>
>
>   

-- 
Thanks
Mary Ellen