[mpich-discuss] mpi job crashing: rank 2 in job 7 causedcollective abort of all ranks
Mary Ellen Fitzpatrick
mfitzpat at bu.edu
Fri Oct 31 14:06:28 CDT 2008
Hi,
If I loging into one of the compute nodes and from the command line
run: mpiexec -n 16 dock6.mpi -i dock.in -o dock.out &> dock.log
It appears to run all 16 processes on that node and no rank errors.
If I add the above line to my pbs script and submit (requesting 4 nodes
with 4 CPUs) then I still get the following errors. One thing of note
in the output file is that all of the "rank abort" errors are listed for
the same node: node1047, more errors/ranks than processors (4 cores on
the node).
Another thing I changed was in /etc/mpd.hosts file, I added the number
of cpus for each node: ie node1048:4
Prior to this I had the syntax wrong for listing the number of cpus.
Now the mpd --ncpus=4 shows on the nodes
node1048: /usr/bin/python /usr/local/mpich2/bin/mpd.py -h node1010 -p
48118 --ncpus=4
Initializing MPI Routines...
Initializing MPI Routines...
rank 15 in job 2 node1047_33232 caused collective abort of all ranks
exit status of rank 15: killed by signal 9
rank 14 in job 2 node1047_33232 caused collective abort of all ranks
exit status of rank 14: return code 0
rank 13 in job 2 node1047_33232 caused collective abort of all ranks
exit status of rank 13: return code 0
rank 11 in job 2 node1047_33232 caused collective abort of all ranks
exit status of rank 11: return code 0
rank 10 in job 2 node1047_33232 caused collective abort of all ranks
exit status of rank 10: return code 0
-snippet-
Rajeev Thakur wrote:
> Does the code run elsewhere with 16 processes if you run it simply as
> "mpiexec -n 16 a.out"?
>
> Rajeev
>
>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary
>> Ellen Fitzpatrick
>> Sent: Friday, October 31, 2008 10:53 AM
>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7
>> caused collective abort of all ranks
>>
>> Hi,
>> Trying to run mpi jobs on my cluster and get: rank 2 in job 7 caused
>> collective abort of all ranks mpi job fails.
>> 48 (dual-dual core cpu) node Linux cluster, running Torque
>> with Maui.
>> Have MPICH2-1.07 installed. I have mpd's running on all
>> nodes started
>> from the head node as root.
>>
>> What I want to do is submit an mpi job that runs one process/per node
>> requests all 4 cores on the node and I want to submit this
>> one process
>> to 4 nodes.
>>
>> If I request in my pbs script 1 node with 4 processors, then it works
>> fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
>> output says everything ran perfect. If I request in my pbs script 4
>> nodes with 4 processors, #PBS -l nodes=4:ppn=4, then it fails, my
>> torque epilogue/proloque output file say the job ran on 4 nodes and
>> requests 16 processors.
>> But the mpi output file has the following error:
>>
>> -snippet-
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> rank 2 in job 7 node1043_55948 caused collective abort of all ranks
>> exit status of rank 2: killed by signal 9
>> rank 1 in job 7 node1043_55948 caused collective abort of all ranks
>> exit status of rank 1: killed by signal 9
>> -snippet -
>>
>> Anyone know why my mpi job is crashing?
>>
>> --
>> Thanks
>> Mary Ellen
>>
>>
>>
>
>
>
--
Thanks
Mary Ellen
More information about the mpich-discuss
mailing list