[mpich-discuss] MPD in the PBS environment

Rajeev Thakur thakur at mcs.anl.gov
Thu Feb 12 22:45:15 CST 2009


Anne,
     If you are using MPICH2 with PBS, you may want to consider using the
mpiexec for PBS developed by Pete Wyckoff:
http://www.osc.edu/~pw/mpiexec/index.php . You don't need to use MPD at all
if you use that.

Rajeev 

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Anne M.
> Hammond
> Sent: Thursday, February 12, 2009 1:42 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] MPD in the PBS environment
>
> Thanks Anthony.  Although NNODES was defined, it was incorrect number 
> of mpd's to start.  This has been fixed.
>
> The mpds are now launching on the nodes that PBS allocates, but the 
> mpiexec process is still trying to connect to a root mpd:
>
> [hammond at boron ecrp12NoRing]$ more ecrp12NoRing.log
> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>     probable cause:  no mpd daemon on this machine
>     possible cause:  unix socket /tmp/mpd2.console_root has been 
> removed mpiexec_node12.cl.corp.com (__init__ 1190): forked process 
> failed;
> status=255
>
> The file /tmp/mpd2.console_hammond exists.  Shouldn't mpiexec be 
> trying to connect to that socket??
>
>
> On Thu, 12 Feb 2009, Anne M. Hammond wrote:
>
>> Yes.  NNODES is set:
>>
>> setenv NNODES `wc $PBS_NODEFILE|awk '{print $1}'`
>>
>>
>> On Thu, 12 Feb 2009, Anthony Chan wrote:
>>
>>>
>>> Did you set NNODES in your PBS script ?
>>>
>>> ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>>>
>>>> These are the relevant lines from the qsub file:
>>>>
>>>> sort -u $PBS_NODEFILE > mpd.hosts
>>>> mpdboot -f mpd.hosts -n $NNODES --rsh=/usr/bin/rsh mpiexec 
>>>> -machinefile $PBS_NODEFILE -np $NNODES $RUNJOB -i 
>>>> $WORK_AREA/$PREFILE/$PREFILE.in -dim 2 -n 100000 -d 10000 > 
>>>> $PREFILE.log mpdallexit
>>>>
>>>> mpd.hosts:
>>>> node12
>>>> node13
>>>>
>>>> When the ring is not running, this is the error message from the 
>>>> PBS job:
>>>>
>>>> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>>>>      probable cause:  no mpd daemon on this machine
>>>>      possible cause:  unix socket /tmp/mpd2.console_root has been 
>>>> removed mpiexec_node12.cl.corp.com (__init__ 1190): forked process 
>>>> failed;
>>>> status=255
>>>>
>>>> Do you have to have a persistent ring booted in order to use mpd 
>>>> from PBS?  Or is my qsub script incorrect?
>>>>
>>>> Thanks in advance,
>>>> Anne
>>>
>>>
>>
>>
>
> --
>
> Anne M. Hammond - Systems / Network Administration - Tech-X Corp
>                   hammond_at_txcorp.com 720-974-1840
>




More information about the mpich-discuss mailing list