[mpich-discuss] mpi job crashing: rank 2 in job 7 causedcollective abort of all ranks

Fri Oct 31 14:06:28 CDT 2008

Hi,
If I loging into one of the compute nodes and from the command line 
run:  mpiexec -n 16 dock6.mpi -i dock.in -o dock.out &> dock.log  
It appears to run all 16 processes on that node and no rank errors.

If I add the above line to my pbs script and submit (requesting 4 nodes 
with 4 CPUs) then I still get the following errors.  One thing of note 
in the output file is that all of the "rank abort" errors are listed for 
the same node: node1047, more errors/ranks than processors (4 cores on 
the node).

Another thing I changed was in /etc/mpd.hosts file, I added the number 
of cpus for each node: ie node1048:4
Prior to this I had the syntax wrong for listing the number of cpus.  
Now the mpd --ncpus=4 shows on the nodes
node1048:  /usr/bin/python /usr/local/mpich2/bin/mpd.py -h node1010 -p 
48118 --ncpus=4

Initializing MPI Routines...
Initializing MPI Routines...
rank 15 in job 2  node1047_33232   caused collective abort of all ranks
  exit status of rank 15: killed by signal 9
rank 14 in job 2  node1047_33232   caused collective abort of all ranks
  exit status of rank 14: return code 0
rank 13 in job 2  node1047_33232   caused collective abort of all ranks
  exit status of rank 13: return code 0
rank 11 in job 2  node1047_33232   caused collective abort of all ranks
  exit status of rank 11: return code 0
rank 10 in job 2  node1047_33232   caused collective abort of all ranks
  exit status of rank 10: return code 0
-snippet-

Rajeev Thakur wrote:
> Does the code run elsewhere with 16 processes if you run it simply as
> "mpiexec -n 16 a.out"?
>
> Rajeev 
>
>   
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
>> Ellen Fitzpatrick
>> Sent: Friday, October 31, 2008 10:53 AM
>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7 
>> caused collective abort of all ranks
>>
>> Hi,
>> Trying to run mpi jobs on my cluster and get: rank 2 in job 7 caused 
>> collective abort of all ranks   mpi job fails.
>> 48 (dual-dual core cpu) node Linux cluster, running Torque 
>> with Maui.  
>> Have MPICH2-1.07 installed.  I have mpd's running on all 
>> nodes started 
>> from the head node as root. 
>>
>> What I want to do is submit an mpi job that runs one process/per node 
>> requests all 4 cores on the node and I want to submit this 
>> one process 
>> to 4 nodes.
>>
>> If I request in my pbs script 1 node with 4 processors, then it works 
>> fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
>> output says everything ran perfect.  If I request in my pbs script 4 
>> nodes with 4 processors,  #PBS -l nodes=4:ppn=4, then it fails,  my 
>> torque epilogue/proloque output file say the job ran on 4 nodes and 
>> requests 16 processors.
>> But the mpi output file has the following error:
>>
>> -snippet-
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> rank 2 in job 7  node1043_55948   caused collective abort of all ranks
>>   exit status of rank 2: killed by signal 9
>> rank 1 in job 7  node1043_55948   caused collective abort of all ranks
>>   exit status of rank 1: killed by signal 9
>> -snippet -
>>
>> Anyone know why my mpi job is crashing?
>>
>> -- 
>> Thanks
>> Mary Ellen
>>
>>
>>     
>
>
>   

-- 
Thanks
Mary Ellen