[mpich-discuss] mpdboot error

Dave Goodell goodell at mcs.anl.gov
Fri Apr 23 14:17:43 CDT 2010


I doubt that the OS difference is causing this problem, although there  
is always a small chance that it will cause _some_ problem.

You can try the fix listed, but you may still have a problem (the fix  
for #963 is available in 1.2.1p1, but there is no fix currently for  
#974).

In all cases, I still recommend hydra instead of mpd.

-Dave

On Apr 23, 2010, at 2:03 PM, Jacob Harvey wrote:

> We are running MPICH2-1.2.1 and CentOS 4.8 on the head node with
> CentOS 4.4 on all other nodes. Is the difference in operating systems
> a problem? I'm guessing no because we don't use the head node for
> calculations, but I figure I would ask. Otherwise it looks very
> similar to that bug. Similar to your findings I cannot reproduce this
> behavior on CentOS 5.4 which is running on on our other cluster that
> works just fine. Should I give the quick fix you list in the bug
> report a shot or was this fixed in mpich2-1.2.1p1?
>
> Jacob
>
> On Fri, Apr 23, 2010 at 2:47 PM, Dave Goodell <goodell at mcs.anl.gov>  
> wrote:
>> What version of MPICH2 are you using?  What operating system are  
>> you using
>> (CentOS/RHEL?)?  You may be hitting the bug linked from that FAQ  
>> entry:
>> https://trac.mcs.anl.gov/projects/mpich2/ticket/974
>>
>> -Dave
>>
>> On Apr 23, 2010, at 1:42 PM, Jacob Harvey wrote:
>>
>>> Dave,
>>>
>>> See that is the weird thing about this is that it seems as though  
>>> the
>>> mpd ring is indeed forming. Its more simply that my script hangs on
>>> the line of mpdboot. But I can ssh to the nodes specified for the  
>>> job
>>> and see the mpd process running on that node. So the mpd ring forms
>>> but why doesn't mpdboot give control back to the script?
>>>
>>> I'm definitely up for more suggestions if anyone has got any, but in
>>> the mean time I'll give hydra a shot. Thanks.
>>>
>>> Jacob
>>>
>>> On Fri, Apr 23, 2010 at 2:21 PM, Dave Goodell  
>>> <goodell at mcs.anl.gov> wrote:
>>>>
>>>> It's good to hear that it was just a bad mpdboot line.
>>>>
>>>> As for your further troubles with mpd, please see this page:
>>>>
>>>> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q 
>>>> :_My_MPD_ring_won.27t_start.2C_what.27s_wrong.3F
>>>>
>>>> Particularly, I would recommend using hydra if you don't  
>>>> specifically
>>>> need
>>>> MPD.
>>>>
>>>> -Dave
>>>>
>>>> On Apr 23, 2010, at 1:15 PM, Jacob Harvey wrote:
>>>>
>>>>> Dave and MPICH2 users,
>>>>>
>>>>> Thanks for your timely response. I did actually have an error in  
>>>>> my
>>>>> invocation of mpdboot. I was missing a space on my mpdboot line  
>>>>> (doh).
>>>>> ie. I had:
>>>>>
>>>>> mpdboot -n 2-f mpd.hosts
>>>>>
>>>>> But this has actually caused me to run into some other weird  
>>>>> problems.
>>>>> I want to run on 2 nodes with 2 procs each (4 tot). But my PBS  
>>>>> script
>>>>> seems to hang on booting the mpd ring. So for instance the main  
>>>>> part
>>>>> of my PBS script looks like this ($NODES = 2):
>>>>>
>>>>> echo 'Building the MPD ring'
>>>>> $MPI_HOME/bin/mpdboot -n $NODES -f $PBS_NODEFILE -r ssh
>>>>> echo ' '
>>>>>
>>>>> echo 'Inspecting if all MPI nodes have been activated'
>>>>> $MPI_HOME/bin/mpdtrace -l
>>>>> echo ' '
>>>>>
>>>>> echo 'Checking the connectivity'
>>>>> $MPI_HOME/bin/mpdringtest 100
>>>>> echo ' '
>>>>>
>>>>> echo 'Running my code in parallel'
>>>>> echo ' '
>>>>>
>>>>> if [[ `uname -i` == "x86_64" ]]; then
>>>>>   mpirun -np 4 ~/dlpoly/dl_poly_2.20_i86/execute/DLPOLY.X
>>>>> else
>>>>>   mpirun -np 4 ~/dlpoly/dl_poly_2.20_i386/execute/DLPOLY.X
>>>>> fi
>>>>>
>>>>> Yet all I get is the 'Building the mpd ring' line with no further
>>>>> output (from either the PBS script or the program I am running)  
>>>>> in my
>>>>> output file. But while the job is "running" I can ssh to the nodes
>>>>> involved in the mpd ring and can see the mpd ring running (either
>>>>> seeing the process in top, or mpdtrace -l). What seems to force
>>>>> mpdboot to move on is specifying more hosts than are available  
>>>>> (ie.
>>>>> "mpdboot -n 4 -f mpd.hosts" when only 2 nodes are used). In that  
>>>>> case
>>>>> I get the typical "Too many hosts" error but the mpd ring is  
>>>>> formed
>>>>> and the job runs as expected. Like I said I've used this same  
>>>>> script
>>>>> on another one of our clusters and it works just fine and I've  
>>>>> gone
>>>>> through the troubleshooting section in the manual but did not  
>>>>> find an
>>>>> error.
>>>>>
>>>>> Any suggestions on why the mpd ring is hanging?
>>>>>
>>>>> Jacob
>>>>>
>>>>> On Fri, Apr 23, 2010 at 1:16 PM, Dave Goodell  
>>>>> <goodell at mcs.anl.gov>
>>>>> wrote:
>>>>>>
>>>>>> Hi Jacob,
>>>>>>
>>>>>> Can you post your mpdboot invocation?  There is probably either a
>>>>>> mistake
>>>>>> in
>>>>>> that line or a bug in mpdboot.
>>>>>>
>>>>>> -Dave
>>>>>>
>>>>>> On Apr 23, 2010, at 11:41 AM, Jacob Harvey wrote:
>>>>>>
>>>>>>> MPICH2 users,
>>>>>>>
>>>>>>> Has anyone seen this error from mpdboot before?
>>>>>>>
>>>>>>> Traceback (most recent call last):
>>>>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 476, in ?
>>>>>>>  mpdboot()
>>>>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 158, in mpdboot
>>>>>>>  totalnumToStart = int(argv[argidx+1])
>>>>>>> ValueError: invalid literal for int(): 2-f
>>>>>>>
>>>>>>> Its odd because I only get this error when I try to set up an  
>>>>>>> mpd ring
>>>>>>> for calculations in a PBS script. But if I try to set up an  
>>>>>>> mpd ring
>>>>>>> from the command line then the ring forms just fine. Plus I've  
>>>>>>> gone
>>>>>>> through the debugging MPD rings in the manual and did not turn  
>>>>>>> up any
>>>>>>> errors in doing so. Another puzzling piece of information is  
>>>>>>> that I
>>>>>>> use the same exact PBS script on a different cluster and the  
>>>>>>> mpd ring
>>>>>>> forms just fine. So I'm confused as to what is the problem  
>>>>>>> here. Any
>>>>>>> thoughts? Thanks in advance!
>>>>>>>
>>>>>>> Jacob
>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> Jacob Harvey
>>>>>>>
>>>>>>> Graduate Student
>>>>>>>
>>>>>>> University of Massachusetts Amherst
>>>>>>>
>>>>>>> j.harv8 at gmail.com
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Jacob Harvey
>>>>>
>>>>> Graduate Student
>>>>>
>>>>> University of Massachusetts Amherst
>>>>>
>>>>> j.harv8 at gmail.com
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Jacob Harvey
>>>
>>> Graduate Student
>>>
>>> University of Massachusetts Amherst
>>>
>>> j.harv8 at gmail.com
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
>
>
> -- 
> --
> Jacob Harvey
>
> Graduate Student
>
> University of Massachusetts Amherst
>
> j.harv8 at gmail.com
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list