[mpich-discuss] MPD in the PBS environment
Rajeev Thakur
thakur at mcs.anl.gov
Thu Feb 12 22:45:15 CST 2009
Anne,
If you are using MPICH2 with PBS, you may want to consider using the
mpiexec for PBS developed by Pete Wyckoff:
http://www.osc.edu/~pw/mpiexec/index.php . You don't need to use MPD at all
if you use that.
Rajeev
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Anne M.
> Hammond
> Sent: Thursday, February 12, 2009 1:42 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] MPD in the PBS environment
>
> Thanks Anthony. Although NNODES was defined, it was incorrect number
> of mpd's to start. This has been fixed.
>
> The mpds are now launching on the nodes that PBS allocates, but the
> mpiexec process is still trying to connect to a root mpd:
>
> [hammond at boron ecrp12NoRing]$ more ecrp12NoRing.log
> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
> probable cause: no mpd daemon on this machine
> possible cause: unix socket /tmp/mpd2.console_root has been
> removed mpiexec_node12.cl.corp.com (__init__ 1190): forked process
> failed;
> status=255
>
> The file /tmp/mpd2.console_hammond exists. Shouldn't mpiexec be
> trying to connect to that socket??
>
>
> On Thu, 12 Feb 2009, Anne M. Hammond wrote:
>
>> Yes. NNODES is set:
>>
>> setenv NNODES `wc $PBS_NODEFILE|awk '{print $1}'`
>>
>>
>> On Thu, 12 Feb 2009, Anthony Chan wrote:
>>
>>>
>>> Did you set NNODES in your PBS script ?
>>>
>>> ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>>>
>>>> These are the relevant lines from the qsub file:
>>>>
>>>> sort -u $PBS_NODEFILE > mpd.hosts
>>>> mpdboot -f mpd.hosts -n $NNODES --rsh=/usr/bin/rsh mpiexec
>>>> -machinefile $PBS_NODEFILE -np $NNODES $RUNJOB -i
>>>> $WORK_AREA/$PREFILE/$PREFILE.in -dim 2 -n 100000 -d 10000 >
>>>> $PREFILE.log mpdallexit
>>>>
>>>> mpd.hosts:
>>>> node12
>>>> node13
>>>>
>>>> When the ring is not running, this is the error message from the
>>>> PBS job:
>>>>
>>>> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>>>> probable cause: no mpd daemon on this machine
>>>> possible cause: unix socket /tmp/mpd2.console_root has been
>>>> removed mpiexec_node12.cl.corp.com (__init__ 1190): forked process
>>>> failed;
>>>> status=255
>>>>
>>>> Do you have to have a persistent ring booted in order to use mpd
>>>> from PBS? Or is my qsub script incorrect?
>>>>
>>>> Thanks in advance,
>>>> Anne
>>>
>>>
>>
>>
>
> --
>
> Anne M. Hammond - Systems / Network Administration - Tech-X Corp
> hammond_at_txcorp.com 720-974-1840
>
More information about the mpich-discuss
mailing list