[mpich-discuss] MPD in the PBS environment

Anne M. Hammond hammond at txcorp.com
Thu Feb 12 14:22:10 CST 2009


This must be the problem (in /etc/bashrc):

export MPD_USE_ROOT_MPD=1

On Thu, 12 Feb 2009, Anne M. Hammond wrote:

> Thanks Anthony.  Although NNODES was defined, it was incorrect
> number of mpd's to start.  This has been fixed.
>
> The mpds are now launching on the nodes that PBS allocates, but the
> mpiexec process is still trying to connect to a root mpd:
>
> [hammond at boron ecrp12NoRing]$ more ecrp12NoRing.log
> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>     probable cause:  no mpd daemon on this machine
>     possible cause:  unix socket /tmp/mpd2.console_root has been removed
> mpiexec_node12.cl.corp.com (__init__ 1190): forked process failed; status=255
>
> The file /tmp/mpd2.console_hammond exists.  Shouldn't mpiexec
> be trying to connect to that socket??
>
>
> On Thu, 12 Feb 2009, Anne M. Hammond wrote:
>
>>  Yes.  NNODES is set:
>>
>>  setenv NNODES `wc $PBS_NODEFILE|awk '{print $1}'`
>> 
>>
>>  On Thu, 12 Feb 2009, Anthony Chan wrote:
>> 
>> > 
>> >   Did you set NNODES in your PBS script ?
>> > 
>> >   ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>> > 
>> > >   These are the relevant lines from the qsub file:
>> > > 
>> > >   sort -u $PBS_NODEFILE > mpd.hosts
>> > >   mpdboot -f mpd.hosts -n $NNODES --rsh=/usr/bin/rsh
>> > >   mpiexec -machinefile $PBS_NODEFILE -np $NNODES $RUNJOB -i
>> > >   $WORK_AREA/$PREFILE/$PREFILE.in -dim 2 -n 100000 -d 10000 >
>> > >   $PREFILE.log
>> > >   mpdallexit
>> > > 
>> > >   mpd.hosts:
>> > >   node12
>> > >   node13
>> > > 
>> > >   When the ring is not running, this is the error message from the
>> > >   PBS job:
>> > > 
>> > >   mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>> > >        probable cause:  no mpd daemon on this machine
>> > >        possible cause:  unix socket /tmp/mpd2.console_root has been
>> > >   removed
>> > >   mpiexec_node12.cl.corp.com (__init__ 1190): forked process failed;
>> > >   status=255
>> > > 
>> > >   Do you have to have a persistent ring booted in order to use mpd
>> > >   from PBS?  Or is my qsub script incorrect?
>> > > 
>> > >   Thanks in advance,
>> > >   Anne
>> > 
>> > 
>> 
>> 
>
>

-- 

Anne M. Hammond - Systems / Network Administration - Tech-X Corp
                   hammond_at_txcorp.com 720-974-1840


More information about the mpich-discuss mailing list