[MPICH] MPI_Comm_spawn, -usize and -machinefile
Martin Siegert
siegert at sfu.ca
Fri Jan 6 16:28:46 CST 2006
Sorry for replying to my own email, but ...
On Thu, Jan 05, 2006 at 06:39:34PM -0800, Martin Siegert wrote:
> Hi,
>
> I am trying to figure out how to use MPI_Comm_spawn. In particular,
> I want the slave processes spawned on nodes specified in the
> -machinefile argument to mpiexec, e.g.,
>
> mpiexec -machinefile mpihosts -usize 4 -n 1 ./master_prog ./slave_prog
>
> master_prog then calls
>
> MPI_Comm_spawn(argv[1], slave_argv, universe_size-1,
> MPI_INFO_NULL, 0, MPI_COMM_SELF, &everyone,
> MPI_ERRCODES_IGNORE);
>
> and I expected that those slave processes would run on the remaining
> hosts specified in the "mpihosts" file (there are 4 hosts in that file).
> That's not what is happening, instead the slaves are spawned on the
> first 3 hosts listed by mpdtrace. Is there anyway to have those slaves
> started on the nodes specified in the mpihosts file?
>
> Or is the only way to achieve this by doing
>
> export MPD_USE_ROOT_MPD=0
> mpdboot -n 4 -f mpihosts
> mpiexec -usize 4 -n 1 ./master_prog ./slave_prog
> mpdallexit
>
> (this is with mpich2-1.0.3 and I usually use the mpd's started by root
> at boot time on each node, i.e., every user by default has the
> environment variable MPD_USE_ROOT_MPD set to 1).
even this last method does not work:
assume I a "mpihosts" file
r1
r2
r2
r3
r4
r4
- usually this would be the $PBS_NODFILE generated by the batch scheduler.
I can get the no. of mpd to boot through
nmpd=`cat mpihosts | sort -u | wc -l`
and the no. of processes through
ncpus=`cat mpihosts | wc -l`
and then would do
unset MPD_USE_ROOT_MPD
mpdboot -n $nmpd -f mpihosts -r rsh
mpiexec -usize $ncpus -n 1 ./master_prog ./slave_prog
But this starts the slaves on the wrong hosts as well, e.g., assuming that
mpdtrace shows
r1
r3
r2
r4
I would have a master on r1 and slaves on r1, r3, r3, r2, and r4.
I then tried
mpdboot -n 6 -f mpihosts -r rsh -1
mpdtrace
r1
r2
r1
r4
r2
r3
which again shows the wrong list of hosts: 2 mpds on r1 and r2 instead of
two mpds on r2 and r4. Isn't "mpdboot -1 -f mpihosts ..." supposed to
start one mpd for each line in the mpihosts file?
[also: mpdboot -1 appears to be quite unreliable: about half the time
when I try this I get an error
mpdboot_r1 (handle_mpd_output 368): failed to connect to mpd on r2]
The only way I got this to work was:
mpd &
port=`mpdtrace -l | sed -e 's/.*_//' -e 's/[^0-9].*//'`
rsh -n r2 'unset MPD_USE_ROOT_MPD;mpd -p $port' &
rsh -n r2 'unset MPD_USE_ROOT_MPD;mpd -p $port --noconsole' &
rsh -n r3 'unset MPD_USE_ROOT_MPD;mpd -p $port' &
rsh -n r4 'unset MPD_USE_ROOT_MPD;mpd -p $port' &
rsh -n r4 'unset MPD_USE_ROOT_MPD;mpd -p $port --noconsole' &
mpiexec -usize 6 -n 1 ./master_prog ./slave_prog
which is really too ugly and complicated for general use.
I guess I could write a script that does the parsing of the PBS_NODEFILE
and starts the mpd, but isn't there an easier way?
Cheers,
Martin
More information about the mpich-discuss
mailing list