[mpich-discuss] mpdboot hangs, but ...

Scott Atchley atchley at myri.com
Mon Dec 14 09:17:57 CST 2009


Dave,

It encountered this problem on Friday as well. I hit ^C and the MPDs  
were running. I should have attached and gotten a stacktrace.

If I encounter it again this week, I will try to get more info.

Scott

On Dec 14, 2009, at 10:10 AM, Dave Goodell wrote:

> Unfortunately, I haven't been able to reproduce this hang here when  
> using the updated version of mpd.py that I posted earlier, which  
> makes it difficult to debug and fix your problem.
>
> I would recommend trying the hydra process manager instead.  It will  
> be the default process manager in the next release and is where we  
> are putting most of our PM development effort.
>
> http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
>
> If you really need MPD specifically, I might be able to figure out  
> what's going on from some "lsof" output on your system together with  
> a tweaked mpd.py/mpdboot.py.  But we should probably open a ticket  
> to track this at that point.
>
> -Dave
>
> On Dec 10, 2009, at 9:33 AM, Benjamin Svetitsky wrote:
>
>> Quite right.  I had a line in my ~/.mpd.conf that said
>> MPD_USE_ROOT_MPD=1
>>
>> When I changed the 1 to 0 I was able to run mpdboot as a regular  
>> user. The result was exactly the same as when I ran mpdboot as  
>> root:  It started the daemons and then hung until I hit ^C,  
>> whereupon the traceback was the same as noted below.  After the ^C  
>> I was able to run MPI jobs.
>>
>> 			Ben
>>
>>
>> Dave Goodell wrote:
>>> The regular user hang might be because you don't have a proper  
>>> ~/.mpd.conf when running as the regular user.  Can you try just  
>>> running "mpd &" as the regular user and see if you get any error  
>>> messages out?
>>> Appendix A of this document also can be helpful for mpd problems: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.2.1-installguide.pdf 
>>>  If you CTRL-C the hung mpdboot process when running as a regular  
>>> user, what traceback to you get?
>>> -Dave
>>> On Dec 6, 2009, at 9:19 AM, Benjamin Svetitsky wrote:
>>>> If I run mpdboot as a regular user then it hangs WITHOUT starting  
>>>> up the daemons.    -Ben
>>>>
>>>> Rajeev Thakur wrote:
>>>>> Are you running as root? Can you try running it as regular user  
>>>>> first?
>>>>> Rajeev
>>>>>> -----Original Message-----
>>>>>> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov 
>>>>>> ] On Behalf Of Benjamin Svetitsky
>>>>>> Sent: Sunday, December 06, 2009 8:02 AM
>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>> Subject: Re: [mpich-discuss] mpdboot hangs, but ...
>>>>>>
>>>>>> Hi Dave,
>>>>>>
>>>>>> Thanks for the quick work.  But the problem is still there.  I  
>>>>>> downloaded the file and put it where you said; then I did:
>>>>>> configure
>>>>>> make
>>>>>> make install
>>>>>> and verified that the new copy of mpd.py is in /usr/local/bin.   
>>>>>> But
>>>>>> mpdboot -n 4 -f /root/mpd.hosts
>>>>>> still doesn't exit after starting up the daemons.
>>>>>>
>>>>>>       -Ben
>>>>>>
>>>>>> Dave Goodell wrote:
>>>>>>> This has been fixed in the trunk.  Anyone who needs a fix
>>>>>> in the short
>>>>>>> term should be able to download the following copy of
>>>>>> mpd.py and drop
>>>>>>> it into src/pm/mpd/ in their MPICH2 source tree (and then
>>>>>> re-install MPICH2):
>>>>>>>
>>>>>> https://trac.mcs.anl.gov/projects/mpich2/export/5923/mpich2/trunk/src/
>>>>>>> pm/mpd/mpd.py
>>>>>>>
>>>>>>>
>>>>>>> -Dave
>>>>>>>
>>>>>>> On Dec 4, 2009, at 9:28 AM, Dave Goodell wrote:
>>>>>>>
>>>>>>>> Hi Ben,
>>>>>>>>
>>>>>>>> This looks very similar to ticket #963: https://trac.mcs.anl.gov/projects/mpich2/ticket/963
>>>>>>>>
>>>>>>>> Please feel free to add yourself to the CC list if you
>>>>>> would like to
>>>>>>>> receive progress updates.  Thanks for letting us know that  
>>>>>>>> you are having trouble.
>>>>>>>>
>>>>>>>> Thinking about it this morning, I just had an idea of what
>>>>>> might be
>>>>>>>> going on. I'll spend some time on it today and see if I
>>>>>> can reproduce
>>>>>>>> it and work up a fix.
>>>>>>>>
>>>>>>>> In the mean time, you can either use the hydra process
>>>>>> manager (built
>>>>>>>> by default as "mpiexec.hydra") or copy the mpd.py script
>>>>>> from 1.1.1p1
>>>>>>>> as a workaround.
>>>>>>>>
>>>>>>>> -Dave
>>>>>>>>
>>>>>>>> On Dec 4, 2009, at 5:38 AM, Benjamin Svetitsky wrote:
>>>>>>>>
>>>>>>>>> Hello everybody,
>>>>>>>>>
>>>>>>>>> I just upgraded from mpich2-1.0.8 to mpich2-1.2.1.  When I run
>>>>>>>>>
>>>>>>>>>  mpdboot -n 4 -f /root/mpd.hosts
>>>>>>>>>
>>>>>>>>> as root, the command just hangs until I hit ^C.   
>>>>>>>>> Nonetheless, it starts the daemons successfully and I can  
>>>>>>>>> run MPI jobs as
>>>>>> usual (so
>>>>>>>>> far).  A subsequent mpdallexit kills the saemons without  
>>>>>>>>> complaint
>>>>>>>>>
>>>>>>>>> Details:
>>>>>>>>>
>>>>>>>>> I am running four Intel quad-cores under CentOS:
>>>>>>>>> Linux version 2.6.18-164.6.1.el5.centos.plus The file /root/ 
>>>>>>>>> mpd.hosts contains:
>>>>>>>>> -- 
>>>>>>>>> nodeA
>>>>>>>>> nodeB
>>>>>>>>> nodeC
>>>>>>>>> nodeD
>>>>>>>>> -- 
>>>>>>>>> and I executed mpdboot on nodeC.
>>>>>>>>> I compiled the MPICH2 source without any config options.
>>>>>>>>> After mpdboot hangs for several minutes and I hit ^C, it  
>>>>>>>>> responds:
>>>>>>>>> -- 
>>>>>>>>> Traceback (most recent call last):
>>>>>>>>> File "/usr/local/bin/mpdboot", line 476, in ?
>>>>>>>>> mpdboot()
>>>>>>>>> File "/usr/local/bin/mpdboot", line 347, in mpdboot
>>>>>>>>> handle_mpd_output(fd,fd2idx,hostsAndInfo)
>>>>>>>>> File "/usr/local/bin/mpdboot", line 385, in handle_mpd_output
>>>>>>>>> for line in fd.readlines():    # handle output from shells  
>>>>>>>>> that echo stuff
>>>>>>>>> KeyboardInterrupt
>>>>>>>>> -- 
>>>>>>>>> which may be irrelevant.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>  Ben
>>>>>>>>> -- 
>>>>>>>>> Prof. Benjamin Svetitsky         Phone:             
>>>>>>>>> +972-3-640 8870
>>>>>>>>> School of Physics and Astronomy  Fax:               
>>>>>>>>> +972-3-640 7932
>>>>>>>>> Tel Aviv University              E-mail:      bqs at julian.tau.ac.il
>>>>>>>>> 69978 Tel Aviv, Israel           WWW: http://julian.tau.ac.il/~bqs
>>>>>>>>> _______________________________________________
>>>>>>>>> mpich-discuss mailing list
>>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>> _______________________________________________
>>>>>>>> mpich-discuss mailing list
>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>> -- 
>>>>>> Prof. Benjamin Svetitsky         Phone:            +972-3-640  
>>>>>> 8870
>>>>>> School of Physics and Astronomy  Fax:              +972-3-640  
>>>>>> 7932
>>>>>> Tel Aviv University              E-mail:       
>>>>>> bqs at julian.tau.ac.il
>>>>>> 69978 Tel Aviv, Israel           WWW: http://julian.tau.ac.il/ 
>>>>>> ~bqs
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>> -- 
>>>> Prof. Benjamin Svetitsky         Phone:            +972-3-640 8870
>>>> School of Physics and Astronomy  Fax:              +972-3-640 7932
>>>> Tel Aviv University              E-mail:      bqs at julian.tau.ac.il
>>>> 69978 Tel Aviv, Israel           WWW: http://julian.tau.ac.il/~bqs
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> -- 
>> Prof. Benjamin Svetitsky         Phone:            +972-3-640 8870
>> School of Physics and Astronomy  Fax:              +972-3-640 7932
>> Tel Aviv University              E-mail:      bqs at julian.tau.ac.il
>> 69978 Tel Aviv, Israel           WWW: http://julian.tau.ac.il/~bqs
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list