[mpich-discuss] mpi job crashing: rank 2 in job 7causedcollective abort of all ranks: RESOLVED
Mary Ellen Fitzpatrick
mfitzpat at bu.edu
Mon Nov 3 13:51:54 CST 2008
I was able to resolve my issue to adding the global mpiexec variable:
-machinefile $PBS_NODEFILE to my command.
From within my pbs script I was running and it would give the rank
abort error:
mpiexec -n $NP dock6.mpi -i dock.in -o dock.out &> dock.log
I added the global mpiexec variable "-machinefile $PBS_NODEFILE right
after the call for mpiexec and it worked.
mpiexec -machinefile $PBS_NODEFILE -n $NP dock6.mpi -i dock.in -o
dock.out &> dock.log
My error (well one of them anyway :) ) was that because I had the
/etc/mpd.hosts file on each node with the node list:ppn info, I though
it was being read by torque/maui. But apparantly not. The pbs script
prefers the info from the $PBS_NODEFILE instead.
Thanks to all who responded and I hope this info is helpful to others.
Mary Ellen
Fitzpatrick, Mary Ellen wrote:
> Logging into node1047, I enter the following command, 16 times manually.
> mpiexec dock6.mpi -i dock.in -o dock.out &> dock.log &
>
> I view top and all of the processes that I requested are running.
> I have started the compute node mpds from my head node via the command:
> mpdboot -n 47 -f /etc/mpd.hosts
>
> Should I be starting the mpds on the compute nodes as part of my pbs script?
> Thanks
>
>
> Rajeev Thakur wrote:
>
>> Could you try logging in to node1047 and running all 16 processes manually
>> there?
>>
>> Rajeev
>>
>>
>>
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary
>>> Ellen Fitzpatrick
>>> Sent: Friday, October 31, 2008 2:06 PM
>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>> Subject: Re: [mpich-discuss] mpi job crashing: rank 2 in job
>>> 7 causedcollective abort of all ranks
>>>
>>> Hi,
>>> If I loging into one of the compute nodes and from the command line
>>> run: mpiexec -n 16 dock6.mpi -i dock.in -o dock.out &> dock.log
>>> It appears to run all 16 processes on that node and no rank errors.
>>>
>>> If I add the above line to my pbs script and submit
>>> (requesting 4 nodes
>>> with 4 CPUs) then I still get the following errors. One
>>> thing of note
>>> in the output file is that all of the "rank abort" errors are
>>> listed for
>>> the same node: node1047, more errors/ranks than processors (4
>>> cores on
>>> the node).
>>>
>>> Another thing I changed was in /etc/mpd.hosts file, I added
>>> the number
>>> of cpus for each node: ie node1048:4
>>> Prior to this I had the syntax wrong for listing the number of cpus.
>>> Now the mpd --ncpus=4 shows on the nodes
>>> node1048: /usr/bin/python /usr/local/mpich2/bin/mpd.py -h
>>> node1010 -p
>>> 48118 --ncpus=4
>>>
>>> Initializing MPI Routines...
>>> Initializing MPI Routines...
>>> rank 15 in job 2 node1047_33232 caused collective abort of
>>> all ranks
>>> exit status of rank 15: killed by signal 9
>>> rank 14 in job 2 node1047_33232 caused collective abort of
>>> all ranks
>>> exit status of rank 14: return code 0
>>> rank 13 in job 2 node1047_33232 caused collective abort of
>>> all ranks
>>> exit status of rank 13: return code 0
>>> rank 11 in job 2 node1047_33232 caused collective abort of
>>> all ranks
>>> exit status of rank 11: return code 0
>>> rank 10 in job 2 node1047_33232 caused collective abort of
>>> all ranks
>>> exit status of rank 10: return code 0
>>> -snippet-
>>>
>>> Rajeev Thakur wrote:
>>>
>>>
>>>> Does the code run elsewhere with 16 processes if you run it
>>>>
>>>>
>>> simply as
>>>
>>>
>>>> "mpiexec -n 16 a.out"?
>>>>
>>>> Rajeev
>>>>
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary
>>>>> Ellen Fitzpatrick
>>>>> Sent: Friday, October 31, 2008 10:53 AM
>>>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>>>> Subject: [mpich-discuss] mpi job crashing: rank 2 in job 7
>>>>> caused collective abort of all ranks
>>>>>
>>>>> Hi,
>>>>> Trying to run mpi jobs on my cluster and get: rank 2 in
>>>>>
>>>>>
>>> job 7 caused
>>>
>>>
>>>>> collective abort of all ranks mpi job fails.
>>>>> 48 (dual-dual core cpu) node Linux cluster, running Torque
>>>>> with Maui.
>>>>> Have MPICH2-1.07 installed. I have mpd's running on all
>>>>> nodes started
>>>>> from the head node as root.
>>>>>
>>>>> What I want to do is submit an mpi job that runs one
>>>>>
>>>>>
>>> process/per node
>>>
>>>
>>>>> requests all 4 cores on the node and I want to submit this
>>>>> one process
>>>>> to 4 nodes.
>>>>>
>>>>> If I request in my pbs script 1 node with 4 processors,
>>>>>
>>>>>
>>> then it works
>>>
>>>
>>>>> fine: #PBS -l nodes=1:ppn=4, everything runs on one node
>>>>>
>>>>>
>>> 4 cpus, mpi
>>>
>>>
>>>>> output says everything ran perfect. If I request in my
>>>>>
>>>>>
>>> pbs script 4
>>>
>>>
>>>>> nodes with 4 processors, #PBS -l nodes=4:ppn=4, then it
>>>>>
>>>>>
>>> fails, my
>>>
>>>
>>>>> torque epilogue/proloque output file say the job ran on 4
>>>>>
>>>>>
>>> nodes and
>>>
>>>
>>>>> requests 16 processors.
>>>>> But the mpi output file has the following error:
>>>>>
>>>>> -snippet-
>>>>> Initializing MPI Routines...
>>>>> Initializing MPI Routines...
>>>>> Initializing MPI Routines...
>>>>> rank 2 in job 7 node1043_55948 caused collective abort
>>>>>
>>>>>
>>> of all ranks
>>>
>>>
>>>>> exit status of rank 2: killed by signal 9
>>>>> rank 1 in job 7 node1043_55948 caused collective abort
>>>>>
>>>>>
>>> of all ranks
>>>
>>>
>>>>> exit status of rank 1: killed by signal 9
>>>>> -snippet -
>>>>>
>>>>> Anyone know why my mpi job is crashing?
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Mary Ellen
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>> --
>>> Thanks
>>> Mary Ellen
>>>
>>>
>>>
>>>
>>
>>
>
>
--
Thanks
Mary Ellen
More information about the mpich-discuss
mailing list