[mpich-discuss] mpdboot error

Jacob Harvey j.harv8 at gmail.com
Fri Apr 23 13:15:24 CDT 2010


Dave and MPICH2 users,

Thanks for your timely response. I did actually have an error in my
invocation of mpdboot. I was missing a space on my mpdboot line (doh).
ie. I had:

mpdboot -n 2-f mpd.hosts

But this has actually caused me to run into some other weird problems.
I want to run on 2 nodes with 2 procs each (4 tot). But my PBS script
seems to hang on booting the mpd ring. So for instance the main part
of my PBS script looks like this ($NODES = 2):

echo 'Building the MPD ring'
$MPI_HOME/bin/mpdboot -n $NODES -f $PBS_NODEFILE -r ssh
echo ' '

echo 'Inspecting if all MPI nodes have been activated'
$MPI_HOME/bin/mpdtrace -l
echo ' '

echo 'Checking the connectivity'
$MPI_HOME/bin/mpdringtest 100
echo ' '

echo 'Running my code in parallel'
echo ' '

if [[ `uname -i` == "x86_64" ]]; then
     mpirun -np 4 ~/dlpoly/dl_poly_2.20_i86/execute/DLPOLY.X
else
     mpirun -np 4 ~/dlpoly/dl_poly_2.20_i386/execute/DLPOLY.X
fi

Yet all I get is the 'Building the mpd ring' line with no further
output (from either the PBS script or the program I am running) in my
output file. But while the job is "running" I can ssh to the nodes
involved in the mpd ring and can see the mpd ring running (either
seeing the process in top, or mpdtrace -l). What seems to force
mpdboot to move on is specifying more hosts than are available (ie.
"mpdboot -n 4 -f mpd.hosts" when only 2 nodes are used). In that case
I get the typical "Too many hosts" error but the mpd ring is formed
and the job runs as expected. Like I said I've used this same script
on another one of our clusters and it works just fine and I've gone
through the troubleshooting section in the manual but did not find an
error.

Any suggestions on why the mpd ring is hanging?

Jacob

On Fri, Apr 23, 2010 at 1:16 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> Hi Jacob,
>
> Can you post your mpdboot invocation?  There is probably either a mistake in
> that line or a bug in mpdboot.
>
> -Dave
>
> On Apr 23, 2010, at 11:41 AM, Jacob Harvey wrote:
>
>> MPICH2 users,
>>
>> Has anyone seen this error from mpdboot before?
>>
>> Traceback (most recent call last):
>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 476, in ?
>>   mpdboot()
>>  File "/opt/mpich2-1.2.1-i86/bin/mpdboot", line 158, in mpdboot
>>   totalnumToStart = int(argv[argidx+1])
>> ValueError: invalid literal for int(): 2-f
>>
>> Its odd because I only get this error when I try to set up an mpd ring
>> for calculations in a PBS script. But if I try to set up an mpd ring
>> from the command line then the ring forms just fine. Plus I've gone
>> through the debugging MPD rings in the manual and did not turn up any
>> errors in doing so. Another puzzling piece of information is that I
>> use the same exact PBS script on a different cluster and the mpd ring
>> forms just fine. So I'm confused as to what is the problem here. Any
>> thoughts? Thanks in advance!
>>
>> Jacob
>>
>> --
>> --
>> Jacob Harvey
>>
>> Graduate Student
>>
>> University of Massachusetts Amherst
>>
>> j.harv8 at gmail.com
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
--
Jacob Harvey

Graduate Student

University of Massachusetts Amherst

j.harv8 at gmail.com


More information about the mpich-discuss mailing list