[mpich-discuss] MPD in the PBS environment

Anne M. Hammond hammond at txcorp.com
Thu Feb 12 13:41:53 CST 2009


Thanks Anthony.  Although NNODES was defined, it was incorrect
number of mpd's to start.  This has been fixed.

The mpds are now launching on the nodes that PBS allocates, but the
mpiexec process is still trying to connect to a root mpd:

[hammond at boron ecrp12NoRing]$ more ecrp12NoRing.log
mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
     probable cause:  no mpd daemon on this machine
     possible cause:  unix socket /tmp/mpd2.console_root has been removed
mpiexec_node12.cl.corp.com (__init__ 1190): forked process failed; 
status=255

The file /tmp/mpd2.console_hammond exists.  Shouldn't mpiexec
be trying to connect to that socket??


On Thu, 12 Feb 2009, Anne M. Hammond wrote:

> Yes.  NNODES is set:
>
> setenv NNODES `wc $PBS_NODEFILE|awk '{print $1}'`
>
>
> On Thu, 12 Feb 2009, Anthony Chan wrote:
>
>>
>>  Did you set NNODES in your PBS script ?
>>
>>  ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>> 
>> >  These are the relevant lines from the qsub file:
>> > 
>> >  sort -u $PBS_NODEFILE > mpd.hosts
>> >  mpdboot -f mpd.hosts -n $NNODES --rsh=/usr/bin/rsh
>> >  mpiexec -machinefile $PBS_NODEFILE -np $NNODES $RUNJOB -i
>> >  $WORK_AREA/$PREFILE/$PREFILE.in -dim 2 -n 100000 -d 10000 >
>> >  $PREFILE.log
>> >  mpdallexit
>> > 
>> >  mpd.hosts:
>> >  node12
>> >  node13
>> > 
>> >  When the ring is not running, this is the error message from the
>> >  PBS job:
>> > 
>> >  mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>> >       probable cause:  no mpd daemon on this machine
>> >       possible cause:  unix socket /tmp/mpd2.console_root has been
>> >  removed
>> >  mpiexec_node12.cl.corp.com (__init__ 1190): forked process failed;
>> >  status=255
>> > 
>> >  Do you have to have a persistent ring booted in order to use mpd
>> >  from PBS?  Or is my qsub script incorrect?
>> > 
>> >  Thanks in advance,
>> >  Anne
>> 
>> 
>
>

-- 

Anne M. Hammond - Systems / Network Administration - Tech-X Corp
                   hammond_at_txcorp.com 720-974-1840


More information about the mpich-discuss mailing list