[mpich-discuss] mpdboot error

Dave Goodell goodell at mcs.anl.gov
Fri Apr 23 13:47:17 CDT 2010


What version of MPICH2 are you using?  What operating system are you  
using (CentOS/RHEL?)?  You may be hitting the bug linked from that FAQ  
entry: https://trac.mcs.anl.gov/projects/mpich2/ticket/974

-Dave

On Apr 23, 2010, at 1:42 PM, Jacob Harvey wrote:

> Dave,
>
> See that is the weird thing about this is that it seems as though the
> mpd ring is indeed forming. Its more simply that my script hangs on
> the line of mpdboot. But I can ssh to the nodes specified for the job
> and see the mpd process running on that node. So the mpd ring forms
> but why doesn't mpdboot give control back to the script?
>
> I'm definitely up for more suggestions if anyone has got any, but in
> the mean time I'll give hydra a shot. Thanks.
>
> Jacob
>
> On Fri, Apr 23, 2010 at 2:21 PM, Dave Goodell <goodell at mcs.anl.gov>  
> wrote:
>> It's good to hear that it was just a bad mpdboot line.
>>
>> As for your further troubles with mpd, please see this page:
>> http://wiki.mcs.anl.gov/mpich2/index.php/ 
>> Frequently_Asked_Questions#Q:_My_MPD_ring_won.27t_start.2C_what. 
>> 27s_wrong.3F
>>
>> Particularly, I would recommend using hydra if you don't  
>> specifically need
>> MPD.
>>
>> -Dave
>>
>> On Apr 23, 2010, at 1:15 PM, Jacob Harvey wrote:
>>
>>> Dave and MPICH2 users,
>>>
>>> Thanks for your timely response. I did actually have an error in my
>>> invocation of mpdboot. I was missing a space on my mpdboot line  
>>> (doh).
>>> ie. I had:
>>>
>>> mpdboot -n 2-f mpd.hosts
>>>
>>> But this has actually caused me to run into some other weird  
>>> problems.
>>> I want to run on 2 nodes with 2 procs each (4 tot). But my PBS  
>>> script
>>> seems to hang on booting the mpd ring. So for instance the main part
>>> of my PBS script looks like this ($NODES = 2):
>>>
>>> echo 'Building the MPD ring'
>>> $MPI_HOME/bin/mpdboot -n $NODES -f $PBS_NODEFILE -r ssh
>>> echo ' '
>>>
>>> echo 'Inspecting if all MPI nodes have been activated'
>>> $MPI_HOME/bin/mpdtrace -l
>>> echo ' '
>>>
>>> echo 'Checking the connectivity'
>>> $MPI_HOME/bin/mpdringtest 100
>>> echo ' '
>>>
>>> echo 'Running my code in parallel'
>>> echo ' '
>>>
>>> if [[ `uname -i` == "x86_64" ]]; then
>>>    mpirun -np 4 ~/dlpoly/dl_poly_2.20_i86/execute/DLPOLY.X
>>> else
>>>    mpirun -np 4 ~/dlpoly/dl_poly_2.20_i386/execute/DLPOLY.X
>>> fi
>>>
>>> Yet all I get is the 'Building the mpd ring' line with no further
>>> output (from either the PBS script or the program I am running) in  
>>> my
>>> output file. But while the job is "running" I can ssh to the nodes
>>> involved in the mpd ring and can see the mpd ring running (either
>>> seeing the process in top, or mpdtrace -l). What seems to force
>>> mpdboot to move on is specifying more hosts than are available (ie.
>>> "mpdboot -n 4 -f mpd.hosts" when only 2 nodes are used). In that  
>>> case
>>> I get the typical "Too many hosts" error but the mpd ring is formed
>>> and the job runs as expected. Like I said I've used this same script
>>> on another one of our clusters and it works just fine and I've gone
>>> through the troubleshooting section in the manual but did not find  
>>> an
>>> error.
>>>
>>> Any suggestions on why the mpd ring is hanging?
>>>
>>> Jacob
>>>
>>> On Fri, Apr 23, 2010 at 1:16 PM, Dave Goodell  
>>> <goodell at mcs.anl.gov> wrote:
>>>>
>>>> Hi Jacob,
>>>>
>>>> Can you post your mpdboot invocation?  There is probably either a  
>>>> mistake
>>>> in
>>>> that line or a bug in mpdboot.
>>>>
>>>> -Dave
>>>>
>>>> On Apr 23, 2010, at 11:41 AM, Jacob Harvey wrote:
>>>>
>>>>> MPICH2 users,
>>>>>
>>>>> Has anyone seen this error from mpdboot before?
>>>>>
>>>>> Traceback (most recent call last):
>>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 476, in ?
>>>>>  mpdboot()
>>>>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 158, in mpdboot
>>>>>  totalnumToStart = int(argv[argidx+1])
>>>>> ValueError: invalid literal for int(): 2-f
>>>>>
>>>>> Its odd because I only get this error when I try to set up an  
>>>>> mpd ring
>>>>> for calculations in a PBS script. But if I try to set up an mpd  
>>>>> ring
>>>>> from the command line then the ring forms just fine. Plus I've  
>>>>> gone
>>>>> through the debugging MPD rings in the manual and did not turn  
>>>>> up any
>>>>> errors in doing so. Another puzzling piece of information is  
>>>>> that I
>>>>> use the same exact PBS script on a different cluster and the mpd  
>>>>> ring
>>>>> forms just fine. So I'm confused as to what is the problem here.  
>>>>> Any
>>>>> thoughts? Thanks in advance!
>>>>>
>>>>> Jacob
>>>>>
>>>>> --
>>>>> --
>>>>> Jacob Harvey
>>>>>
>>>>> Graduate Student
>>>>>
>>>>> University of Massachusetts Amherst
>>>>>
>>>>> j.harv8 at gmail.com
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Jacob Harvey
>>>
>>> Graduate Student
>>>
>>> University of Massachusetts Amherst
>>>
>>> j.harv8 at gmail.com
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
>
>
> -- 
> --
> Jacob Harvey
>
> Graduate Student
>
> University of Massachusetts Amherst
>
> j.harv8 at gmail.com
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list