[mpich-discuss] mpdboot error

Jacob Harvey j.harv8 at gmail.com
Fri Apr 23 14:20:21 CDT 2010


Dave,

Thanks a ton for the help. I've been struggling with this for some
time now. The update python script you list in #974 did the trick.
Either way I'll definitely be looking into hydra. Thanks again!

Jacob

On Fri, Apr 23, 2010 at 3:17 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> I doubt that the OS difference is causing this problem, although there is
> always a small chance that it will cause _some_ problem.
>
> You can try the fix listed, but you may still have a problem (the fix for
> #963 is available in 1.2.1p1, but there is no fix currently for #974).
>
> In all cases, I still recommend hydra instead of mpd.
>
> -Dave
>
> On Apr 23, 2010, at 2:03 PM, Jacob Harvey wrote:
>
>> We are running MPICH2-1.2.1 and CentOS 4.8 on the head node with
>> CentOS 4.4 on all other nodes. Is the difference in operating systems
>> a problem? I'm guessing no because we don't use the head node for
>> calculations, but I figure I would ask. Otherwise it looks very
>> similar to that bug. Similar to your findings I cannot reproduce this
>> behavior on CentOS 5.4 which is running on on our other cluster that
>> works just fine. Should I give the quick fix you list in the bug
>> report a shot or was this fixed in mpich2-1.2.1p1?
>>
>> Jacob
>>
>> On Fri, Apr 23, 2010 at 2:47 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>>>
>>> What version of MPICH2 are you using?  What operating system are you
>>> using
>>> (CentOS/RHEL?)?  You may be hitting the bug linked from that FAQ entry:
>>> https://trac.mcs.anl.gov/projects/mpich2/ticket/974
>>>
>>> -Dave
>>>
>>> On Apr 23, 2010, at 1:42 PM, Jacob Harvey wrote:
>>>
>>>> Dave,
>>>>
>>>> See that is the weird thing about this is that it seems as though the
>>>> mpd ring is indeed forming. Its more simply that my script hangs on
>>>> the line of mpdboot. But I can ssh to the nodes specified for the job
>>>> and see the mpd process running on that node. So the mpd ring forms
>>>> but why doesn't mpdboot give control back to the script?
>>>>
>>>> I'm definitely up for more suggestions if anyone has got any, but in
>>>> the mean time I'll give hydra a shot. Thanks.
>>>>
>>>> Jacob
>>>>
>>>> On Fri, Apr 23, 2010 at 2:21 PM, Dave Goodell <goodell at mcs.anl.gov>
>>>> wrote:
>>>>>
>>>>> It's good to hear that it was just a bad mpdboot line.
>>>>>
>>>>> As for your further troubles with mpd, please see this page:
>>>>>
>>>>>
>>>>> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_My_MPD_ring_won.27t_start.2C_what.27s_wrong.3F
>>>>>
>>>>> Particularly, I would recommend using hydra if you don't specifically
>>>>> need
>>>>> MPD.
>>>>>
>>>>> -Dave
>>>>>
>>>>> On Apr 23, 2010, at 1:15 PM, Jacob Harvey wrote:
>>>>>
>>>>>> Dave and MPICH2 users,
>>>>>>
>>>>>> Thanks for your timely response. I did actually have an error in my
>>>>>> invocation of mpdboot. I was missing a space on my mpdboot line (doh).
>>>>>> ie. I had:
>>>>>>
>>>>>> mpdboot -n 2-f mpd.hosts
>>>>>>
>>>>>> But this has actually caused me to run into some other weird problems.
>>>>>> I want to run on 2 nodes with 2 procs each (4 tot). But my PBS script
>>>>>> seems to hang on booting the mpd ring. So for instance the main part
>>>>>> of my PBS script looks like this ($NODES = 2):
>>>>>>
>>>>>> echo 'Building the MPD ring'
>>>>>> $MPI_HOME/bin/mpdboot -n $NODES -f $PBS_NODEFILE -r ssh
>>>>>> echo ' '
>>>>>>
>>>>>> echo 'Inspecting if all MPI nodes have been activated'
>>>>>> $MPI_HOME/bin/mpdtrace -l
>>>>>> echo ' '
>>>>>>
>>>>>> echo 'Checking the connectivity'
>>>>>> $MPI_HOME/bin/mpdringtest 100
>>>>>> echo ' '
>>>>>>
>>>>>> echo 'Running my code in parallel'
>>>>>> echo ' '
>>>>>>
>>>>>> if [[ `uname -i` == "x86_64" ]]; then
>>>>>>  mpirun -np 4 ~/dlpoly/dl_poly_2.20_i86/execute/DLPOLY.X
>>>>>> else
>>>>>>  mpirun -np 4 ~/dlpoly/dl_poly_2.20_i386/execute/DLPOLY.X
>>>>>> fi
>>>>>>
>>>>>> Yet all I get is the 'Building the mpd ring' line with no further
>>>>>> output (from either the PBS script or the program I am running) in my
>>>>>> output file. But while the job is "running" I can ssh to the nodes
>>>>>> involved in the mpd ring and can see the mpd ring running (either
>>>>>> seeing the process in top, or mpdtrace -l). What seems to force
>>>>>> mpdboot to move on is specifying more hosts than are available (ie.
>>>>>> "mpdboot -n 4 -f mpd.hosts" when only 2 nodes are used). In that case
>>>>>> I get the typical "Too many hosts" error but the mpd ring is formed
>>>>>> and the job runs as expected. Like I said I've used this same script
>>>>>> on another one of our clusters and it works just fine and I've gone
>>>>>> through the troubleshooting section in the manual but did not find an
>>>>>> error.
>>>>>>
>>>>>> Any suggestions on why the mpd ring is hanging?
>>>>>>
>>>>>> Jacob
>>>>>>
>>>>>> On Fri, Apr 23, 2010 at 1:16 PM, Dave Goodell <goodell at mcs.anl.gov>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Jacob,
>>>>>>>
>>>>>>> Can you post your mpdboot invocation?  There is probably either a
>>>>>>> mistake
>>>>>>> in
>>>>>>> that line or a bug in mpdboot.
>>>>>>>
>>>>>>> -Dave
>>>>>>>
>>>>>>> On Apr 23, 2010, at 11:41 AM, Jacob Harvey wrote:
>>>>>>>
>>>>>>>> MPICH2 users,
>>>>>>>>
>>>>>>>> Has anyone seen this error from mpdboot before?
>>>>>>>>
>>>>>>>> Traceback (most recent call last):
>>>>>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 476, in ?
>>>>>>>>  mpdboot()
>>>>>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 158, in mpdboot
>>>>>>>>  totalnumToStart = int(argv[argidx+1])
>>>>>>>> ValueError: invalid literal for int(): 2-f
>>>>>>>>
>>>>>>>> Its odd because I only get this error when I try to set up an mpd
>>>>>>>> ring
>>>>>>>> for calculations in a PBS script. But if I try to set up an mpd ring
>>>>>>>> from the command line then the ring forms just fine. Plus I've gone
>>>>>>>> through the debugging MPD rings in the manual and did not turn up
>>>>>>>> any
>>>>>>>> errors in doing so. Another puzzling piece of information is that I
>>>>>>>> use the same exact PBS script on a different cluster and the mpd
>>>>>>>> ring
>>>>>>>> forms just fine. So I'm confused as to what is the problem here. Any
>>>>>>>> thoughts? Thanks in advance!
>>>>>>>>
>>>>>>>> Jacob
>>>>>>>>
>>>>>>>> --
>>>>>>>> --
>>>>>>>> Jacob Harvey
>>>>>>>>
>>>>>>>> Graduate Student
>>>>>>>>
>>>>>>>> University of Massachusetts Amherst
>>>>>>>>
>>>>>>>> j.harv8 at gmail.com
>>>>>>>> _______________________________________________
>>>>>>>> mpich-discuss mailing list
>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Jacob Harvey
>>>>>>
>>>>>> Graduate Student
>>>>>>
>>>>>> University of Massachusetts Amherst
>>>>>>
>>>>>> j.harv8 at gmail.com
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Jacob Harvey
>>>>
>>>> Graduate Student
>>>>
>>>> University of Massachusetts Amherst
>>>>
>>>> j.harv8 at gmail.com
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>>
>>
>> --
>> --
>> Jacob Harvey
>>
>> Graduate Student
>>
>> University of Massachusetts Amherst
>>
>> j.harv8 at gmail.com
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
--
Jacob Harvey

Graduate Student

University of Massachusetts Amherst

j.harv8 at gmail.com


More information about the mpich-discuss mailing list