[mpich-discuss] mpdboot error

Jacob Harvey j.harv8 at gmail.com
Fri Apr 23 13:42:55 CDT 2010


Dave,

See that is the weird thing about this is that it seems as though the
mpd ring is indeed forming. Its more simply that my script hangs on
the line of mpdboot. But I can ssh to the nodes specified for the job
and see the mpd process running on that node. So the mpd ring forms
but why doesn't mpdboot give control back to the script?

I'm definitely up for more suggestions if anyone has got any, but in
the mean time I'll give hydra a shot. Thanks.

Jacob

On Fri, Apr 23, 2010 at 2:21 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> It's good to hear that it was just a bad mpdboot line.
>
> As for your further troubles with mpd, please see this page:
> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_My_MPD_ring_won.27t_start.2C_what.27s_wrong.3F
>
> Particularly, I would recommend using hydra if you don't specifically need
> MPD.
>
> -Dave
>
> On Apr 23, 2010, at 1:15 PM, Jacob Harvey wrote:
>
>> Dave and MPICH2 users,
>>
>> Thanks for your timely response. I did actually have an error in my
>> invocation of mpdboot. I was missing a space on my mpdboot line (doh).
>> ie. I had:
>>
>> mpdboot -n 2-f mpd.hosts
>>
>> But this has actually caused me to run into some other weird problems.
>> I want to run on 2 nodes with 2 procs each (4 tot). But my PBS script
>> seems to hang on booting the mpd ring. So for instance the main part
>> of my PBS script looks like this ($NODES = 2):
>>
>> echo 'Building the MPD ring'
>> $MPI_HOME/bin/mpdboot -n $NODES -f $PBS_NODEFILE -r ssh
>> echo ' '
>>
>> echo 'Inspecting if all MPI nodes have been activated'
>> $MPI_HOME/bin/mpdtrace -l
>> echo ' '
>>
>> echo 'Checking the connectivity'
>> $MPI_HOME/bin/mpdringtest 100
>> echo ' '
>>
>> echo 'Running my code in parallel'
>> echo ' '
>>
>> if [[ `uname -i` == "x86_64" ]]; then
>>    mpirun -np 4 ~/dlpoly/dl_poly_2.20_i86/execute/DLPOLY.X
>> else
>>    mpirun -np 4 ~/dlpoly/dl_poly_2.20_i386/execute/DLPOLY.X
>> fi
>>
>> Yet all I get is the 'Building the mpd ring' line with no further
>> output (from either the PBS script or the program I am running) in my
>> output file. But while the job is "running" I can ssh to the nodes
>> involved in the mpd ring and can see the mpd ring running (either
>> seeing the process in top, or mpdtrace -l). What seems to force
>> mpdboot to move on is specifying more hosts than are available (ie.
>> "mpdboot -n 4 -f mpd.hosts" when only 2 nodes are used). In that case
>> I get the typical "Too many hosts" error but the mpd ring is formed
>> and the job runs as expected. Like I said I've used this same script
>> on another one of our clusters and it works just fine and I've gone
>> through the troubleshooting section in the manual but did not find an
>> error.
>>
>> Any suggestions on why the mpd ring is hanging?
>>
>> Jacob
>>
>> On Fri, Apr 23, 2010 at 1:16 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>>>
>>> Hi Jacob,
>>>
>>> Can you post your mpdboot invocation?  There is probably either a mistake
>>> in
>>> that line or a bug in mpdboot.
>>>
>>> -Dave
>>>
>>> On Apr 23, 2010, at 11:41 AM, Jacob Harvey wrote:
>>>
>>>> MPICH2 users,
>>>>
>>>> Has anyone seen this error from mpdboot before?
>>>>
>>>> Traceback (most recent call last):
>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 476, in ?
>>>>  mpdboot()
>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 158, in mpdboot
>>>>  totalnumToStart = int(argv[argidx+1])
>>>> ValueError: invalid literal for int(): 2-f
>>>>
>>>> Its odd because I only get this error when I try to set up an mpd ring
>>>> for calculations in a PBS script. But if I try to set up an mpd ring
>>>> from the command line then the ring forms just fine. Plus I've gone
>>>> through the debugging MPD rings in the manual and did not turn up any
>>>> errors in doing so. Another puzzling piece of information is that I
>>>> use the same exact PBS script on a different cluster and the mpd ring
>>>> forms just fine. So I'm confused as to what is the problem here. Any
>>>> thoughts? Thanks in advance!
>>>>
>>>> Jacob
>>>>
>>>> --
>>>> --
>>>> Jacob Harvey
>>>>
>>>> Graduate Student
>>>>
>>>> University of Massachusetts Amherst
>>>>
>>>> j.harv8 at gmail.com
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>>
>>
>> --
>> --
>> Jacob Harvey
>>
>> Graduate Student
>>
>> University of Massachusetts Amherst
>>
>> j.harv8 at gmail.com
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
--
Jacob Harvey

Graduate Student

University of Massachusetts Amherst

j.harv8 at gmail.com


More information about the mpich-discuss mailing list