[mpich-discuss] mpi job crashing: rank 2 in job 7causedcollective abort of all ranks
Mary Ellen Fitzpatrick
mfitzpat at bu.edu
Mon Nov 3 08:49:24 CST 2008
Logging into node1047, I enter the following command, 16 times manually.
mpiexec dock6.mpi -i dock.in -o dock.out &> dock.log &
I view top and all of the processes that I requested are running.
I have started the compute node mpds from my head node via the command:
mpdboot -n 47 -f /etc/mpd.hosts
Should I be starting the mpds on the compute nodes as part of my pbs script?
Thanks
Rajeev Thakur wrote:
> Could you try logging in to node1047 and running all 16 processes manually
> there?
>
> Rajeev
>
>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary
>> Ellen Fitzpatrick
>> Sent: Friday, October 31, 2008 2:06 PM
>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>> Subject: Re: [mpich-discuss] mpi job crashing: rank 2 in job
>> 7 causedcollective abort of all ranks
>>
>> Hi,
>> If I loging into one of the compute nodes and from the command line
>> run: mpiexec -n 16 dock6.mpi -i dock.in -o dock.out &> dock.log
>> It appears to run all 16 processes on that node and no rank errors.
>>
>> If I add the above line to my pbs script and submit
>> (requesting 4 nodes
>> with 4 CPUs) then I still get the following errors. One
>> thing of note
>> in the output file is that all of the "rank abort" errors are
>> listed for
>> the same node: node1047, more errors/ranks than processors (4
>> cores on
>> the node).
>>
>> Another thing I changed was in /etc/mpd.hosts file, I added
>> the number
>> of cpus for each node: ie node1048:4
>> Prior to this I had the syntax wrong for listing the number of cpus.
>> Now the mpd --ncpus=4 shows on the nodes
>> node1048: /usr/bin/python /usr/local/mpich2/bin/mpd.py -h
>> node1010 -p
>> 48118 --ncpus=4
>>
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> rank 15 in job 2 node1047_33232 caused collective abort of
>> all ranks
>> exit status of rank 15: killed by signal 9
>> rank 14 in job 2 node1047_33232 caused collective abort of
>> all ranks
>> exit status of rank 14: return code 0
>> rank 13 in job 2 node1047_33232 caused collective abort of
>> all ranks
>> exit status of rank 13: return code 0
>> rank 11 in job 2 node1047_33232 caused collective abort of
>> all ranks
>> exit status of rank 11: return code 0
>> rank 10 in job 2 node1047_33232 caused collective abort of
>> all ranks
>> exit status of rank 10: return code 0
>> -snippet-
>>
>> Rajeev Thakur wrote:
>>
>>> Does the code run elsewhere with 16 processes if you run it
>>>
>> simply as
>>
>>> "mpiexec -n 16 a.out"?
>>>
>>> Rajeev
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary
>>>> Ellen Fitzpatrick
>>>> Sent: Friday, October 31, 2008 10:53 AM
>>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>>> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7
>>>> caused collective abort of all ranks
>>>>
>>>> Hi,
>>>> Trying to run mpi jobs on my cluster and get: rank 2 in
>>>>
>> job 7 caused
>>
>>>> collective abort of all ranks mpi job fails.
>>>> 48 (dual-dual core cpu) node Linux cluster, running Torque
>>>> with Maui.
>>>> Have MPICH2-1.07 installed. I have mpd's running on all
>>>> nodes started
>>>> from the head node as root.
>>>>
>>>> What I want to do is submit an mpi job that runs one
>>>>
>> process/per node
>>
>>>> requests all 4 cores on the node and I want to submit this
>>>> one process
>>>> to 4 nodes.
>>>>
>>>> If I request in my pbs script 1 node with 4 processors,
>>>>
>> then it works
>>
>>>> fine: #PBS -l nodes=1:ppn=4, everything runs on one node
>>>>
>> 4 cpus, mpi
>>
>>>> output says everything ran perfect. If I request in my
>>>>
>> pbs script 4
>>
>>>> nodes with 4 processors, #PBS -l nodes=4:ppn=4, then it
>>>>
>> fails, my
>>
>>>> torque epilogue/proloque output file say the job ran on 4
>>>>
>> nodes and
>>
>>>> requests 16 processors.
>>>> But the mpi output file has the following error:
>>>>
>>>> -snippet-
>>>> Initializing MPI Routines...
>>>> Initializing MPI Routines...
>>>> Initializing MPI Routines...
>>>> rank 2 in job 7 node1043_55948 caused collective abort
>>>>
>> of all ranks
>>
>>>> exit status of rank 2: killed by signal 9
>>>> rank 1 in job 7 node1043_55948 caused collective abort
>>>>
>> of all ranks
>>
>>>> exit status of rank 1: killed by signal 9
>>>> -snippet -
>>>>
>>>> Anyone know why my mpi job is crashing?
>>>>
>>>> --
>>>> Thanks
>>>> Mary Ellen
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>> --
>> Thanks
>> Mary Ellen
>>
>>
>>
>
>
>
--
Thanks
Mary Ellen
More information about the mpich-discuss
mailing list