[mpich-discuss] mpdboot error

Jacob Harvey j.harv8 at gmail.com
Fri Apr 23 14:03:41 CDT 2010


We are running MPICH2-1.2.1 and CentOS 4.8 on the head node with
CentOS 4.4 on all other nodes. Is the difference in operating systems
a problem? I'm guessing no because we don't use the head node for
calculations, but I figure I would ask. Otherwise it looks very
similar to that bug. Similar to your findings I cannot reproduce this
behavior on CentOS 5.4 which is running on on our other cluster that
works just fine. Should I give the quick fix you list in the bug
report a shot or was this fixed in mpich2-1.2.1p1?

Jacob

On Fri, Apr 23, 2010 at 2:47 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> What version of MPICH2 are you using?  What operating system are you using
> (CentOS/RHEL?)?  You may be hitting the bug linked from that FAQ entry:
> https://trac.mcs.anl.gov/projects/mpich2/ticket/974
>
> -Dave
>
> On Apr 23, 2010, at 1:42 PM, Jacob Harvey wrote:
>
>> Dave,
>>
>> See that is the weird thing about this is that it seems as though the
>> mpd ring is indeed forming. Its more simply that my script hangs on
>> the line of mpdboot. But I can ssh to the nodes specified for the job
>> and see the mpd process running on that node. So the mpd ring forms
>> but why doesn't mpdboot give control back to the script?
>>
>> I'm definitely up for more suggestions if anyone has got any, but in
>> the mean time I'll give hydra a shot. Thanks.
>>
>> Jacob
>>
>> On Fri, Apr 23, 2010 at 2:21 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>>>
>>> It's good to hear that it was just a bad mpdboot line.
>>>
>>> As for your further troubles with mpd, please see this page:
>>>
>>> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_My_MPD_ring_won.27t_start.2C_what.27s_wrong.3F
>>>
>>> Particularly, I would recommend using hydra if you don't specifically
>>> need
>>> MPD.
>>>
>>> -Dave
>>>
>>> On Apr 23, 2010, at 1:15 PM, Jacob Harvey wrote:
>>>
>>>> Dave and MPICH2 users,
>>>>
>>>> Thanks for your timely response. I did actually have an error in my
>>>> invocation of mpdboot. I was missing a space on my mpdboot line (doh).
>>>> ie. I had:
>>>>
>>>> mpdboot -n 2-f mpd.hosts
>>>>
>>>> But this has actually caused me to run into some other weird problems.
>>>> I want to run on 2 nodes with 2 procs each (4 tot). But my PBS script
>>>> seems to hang on booting the mpd ring. So for instance the main part
>>>> of my PBS script looks like this ($NODES = 2):
>>>>
>>>> echo 'Building the MPD ring'
>>>> $MPI_HOME/bin/mpdboot -n $NODES -f $PBS_NODEFILE -r ssh
>>>> echo ' '
>>>>
>>>> echo 'Inspecting if all MPI nodes have been activated'
>>>> $MPI_HOME/bin/mpdtrace -l
>>>> echo ' '
>>>>
>>>> echo 'Checking the connectivity'
>>>> $MPI_HOME/bin/mpdringtest 100
>>>> echo ' '
>>>>
>>>> echo 'Running my code in parallel'
>>>> echo ' '
>>>>
>>>> if [[ `uname -i` == "x86_64" ]]; then
>>>>   mpirun -np 4 ~/dlpoly/dl_poly_2.20_i86/execute/DLPOLY.X
>>>> else
>>>>   mpirun -np 4 ~/dlpoly/dl_poly_2.20_i386/execute/DLPOLY.X
>>>> fi
>>>>
>>>> Yet all I get is the 'Building the mpd ring' line with no further
>>>> output (from either the PBS script or the program I am running) in my
>>>> output file. But while the job is "running" I can ssh to the nodes
>>>> involved in the mpd ring and can see the mpd ring running (either
>>>> seeing the process in top, or mpdtrace -l). What seems to force
>>>> mpdboot to move on is specifying more hosts than are available (ie.
>>>> "mpdboot -n 4 -f mpd.hosts" when only 2 nodes are used). In that case
>>>> I get the typical "Too many hosts" error but the mpd ring is formed
>>>> and the job runs as expected. Like I said I've used this same script
>>>> on another one of our clusters and it works just fine and I've gone
>>>> through the troubleshooting section in the manual but did not find an
>>>> error.
>>>>
>>>> Any suggestions on why the mpd ring is hanging?
>>>>
>>>> Jacob
>>>>
>>>> On Fri, Apr 23, 2010 at 1:16 PM, Dave Goodell <goodell at mcs.anl.gov>
>>>> wrote:
>>>>>
>>>>> Hi Jacob,
>>>>>
>>>>> Can you post your mpdboot invocation?  There is probably either a
>>>>> mistake
>>>>> in
>>>>> that line or a bug in mpdboot.
>>>>>
>>>>> -Dave
>>>>>
>>>>> On Apr 23, 2010, at 11:41 AM, Jacob Harvey wrote:
>>>>>
>>>>>> MPICH2 users,
>>>>>>
>>>>>> Has anyone seen this error from mpdboot before?
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 476, in ?
>>>>>>  mpdboot()
>>>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 158, in mpdboot
>>>>>>  totalnumToStart = int(argv[argidx+1])
>>>>>> ValueError: invalid literal for int(): 2-f
>>>>>>
>>>>>> Its odd because I only get this error when I try to set up an mpd ring
>>>>>> for calculations in a PBS script. But if I try to set up an mpd ring
>>>>>> from the command line then the ring forms just fine. Plus I've gone
>>>>>> through the debugging MPD rings in the manual and did not turn up any
>>>>>> errors in doing so. Another puzzling piece of information is that I
>>>>>> use the same exact PBS script on a different cluster and the mpd ring
>>>>>> forms just fine. So I'm confused as to what is the problem here. Any
>>>>>> thoughts? Thanks in advance!
>>>>>>
>>>>>> Jacob
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Jacob Harvey
>>>>>>
>>>>>> Graduate Student
>>>>>>
>>>>>> University of Massachusetts Amherst
>>>>>>
>>>>>> j.harv8 at gmail.com
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Jacob Harvey
>>>>
>>>> Graduate Student
>>>>
>>>> University of Massachusetts Amherst
>>>>
>>>> j.harv8 at gmail.com
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>>
>>
>> --
>> --
>> Jacob Harvey
>>
>> Graduate Student
>>
>> University of Massachusetts Amherst
>>
>> j.harv8 at gmail.com
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
--
Jacob Harvey

Graduate Student

University of Massachusetts Amherst

j.harv8 at gmail.com


More information about the mpich-discuss mailing list