[MPICH] Problem probing message from dynamically spawned process
Pieter Thysebaert
pieter.thysebaert at intec.ugent.be
Thu May 24 03:24:42 CDT 2007
Hello,
I've attached the three programs test-master test-worker and test-slave
(P1 P2 and P3).
Test-master spawns test-worker and then starts an infinite MPI_Iprobe
loop listening for test-worker messages. One message is sent from
test-worker to test-master.
Similarly, test-worker spawns test-slave and starts an MPI_Iprobe loop.
One message is sent from test-slave to test-worker. It can be received
in test-worker by calling the appropriate MPI_Recv directly, but as it
is, this message is never detected by MPI_Iprobe in test-worker (thus,
test-worker stalls in an infinite loop and the system never returns).
I compile these programs with mpicxx, which on my Debian Etch AMD64
system is versioned as
mpicxx for 1.0.5
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v
--enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu
--enable-libstdcxx-debug --enable-mpfr --enable-checking=release
x86_64-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
The master and worker contain hardcoded paths ("/home/pthyseba/...") to
the worker and slave executables, so they will need to be changed when
running on another system.
Anyway, mpdtrace -l on my machine returns:
fleming_39205 (<my_ip>)
"mpiexec -n 1 test-master" outputs things like:
Master: message from worker detected after 85632 probes
Slave started, contacting worker
and then the program hangs.
If I remove the infinite loop in test-worker and replace it by a single
MPI_Recv operation to capture the test message sent by test-slave, it
simply works:
Master: message from worker detected after 2302235 probes
Slave started, contacting worker
Worker: received data from slave: 1
and the program exits cleanly.
This has been puzzling me for over a week, and I am at a complete loss
as to what I am doing wrong, is wrong with my system or whatnot....
My MPICH2 version is 1.0.5, and I compiled it from source packages at
http://www-unix.mcs.anl.gov/mpi/mpich/.
I'm currently downloading 1.0.5p4 but am unsure as to that upgrade will
do anything to my problem.
Any and all suggestions are more than welcome!
Pieter
Rajeev Thakur wrote:
> It should work. Can you send us a test program that we can use to reproduce
> the problem?
>
> Rajeev
>
>
>> -----Original Message-----
>> From: owner-mpich-discuss at mcs.anl.gov
>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Pieter
>> Thysebaert
>> Sent: Wednesday, May 23, 2007 8:02 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [MPICH] Problem probing message from dynamically
>> spawned process
>>
>> Following up to my original post:
>>
>> The difficulties I'm seeing occur when the master process itself is
>> MPI_Comm_spawn-ed by another process.
>> So I seem to be having problems with nested MPI_Comm_spawn calls....
>> Is nesting MPI_comm_spawn supposed to be a functional and supported
>> operation?
>>
>>
>> So to summarize:
>>
>> when I only have 2 processes, P1 and P2, and P1 MPI_Comm_spawn-s P2, I
>> can implement a message loop in P1 (using MPI_Iprobe).
>>
>> When I have 3 processes with P1 spawning P2 and P2 spawning P3, I can
>> implement a message loop in P1 listening for messages from P2, I can
>> also send data from P3 to P2 BUT MPI_Iprobe() in P2, testing for P3
>> messages always returns false, prohibiting me from implementing a
>> similar message loop in P2 (listening for P3 messages).
>>
>>
>> Is there some race condition or unsupported feature (or
>> blatant misuse
>> of the MPI API) I'm unaware of?
>>
>> Thanks,
>> Pieter
>>
>>
>>
>> Pieter Thysebaert wrote:
>>
>>> Hello,
>>>
>>> I'm using MPICH2 1.0.5 on Debian Etch AMD64 (mpd daemon).
>>>
>> I'm trying to
>>
>>> implement a Master / Worker architecture, where the master
>>>
>> can dynamically
>>
>>> spawn additional workers (using MPI_Comm_spawn).
>>>
>>> Ultimately, I want the master to listen to its workers
>>>
>> using a loop with
>>
>>> MPI_Iprobe statements to process incoming messages.
>>>
>> However, when testing
>>
>>> my initial efforts, I have stumbled over a peculiar situation which
>>> (seemingly) allows the Master to receive a worker's (test)
>>>
>> message, but
>>
>>> cannot Iprobe for it.
>>>
>>> In my testing, the spawned Workers run on the same machine
>>>
>> as the Master.
>>
>>> Assume the Worker (residing in an executable called
>>>
>> "Worker") looks like
>>
>>> this:
>>>
>>> int main(int argc, char** argv) {
>>> MPI_Comm Master;
>>>
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_get_parent(&Master);
>>> if (MDPBlackBoard == MPI_COMM_NULL) {
>>> cerr << "No parent Master!" << endl;
>>> return 1;
>>> }
>>>
>>> int size;
>>> MPI_Comm_remote_size(Master, &size);
>>> if (size != 1) {
>>> cerr << "Parent Master doesn't have size 1" << endl;
>>> return 1;
>>> }
>>> // Test: send test message to Master
>>> int test = 37;
>>> MPI_Status s;
>>> MPI_Send(&test, 1, MPI_INT, 0, TAG_TEST, Master);
>>> // Rest of code
>>> }
>>>
>>>
>>> And the Master begins as
>>>
>>> int main(int argc, char** argv) {
>>> MPI_Init(&argc, &argv);
>>>
>>> MPI_Comm workerComm;
>>> MPI_Info ourInfo;
>>> MPI_Info_create(&ourInfo);
>>>
>>> // Spawn Worker
>>> MPI_Comm_spawn("Worker", MPI_ARGV_NULL, 1, ourInfo, 0,
>>> MPI_COMM_SELF, &workerComm, MPI_ERRCODES_IGNORE);
>>>
>>> // Test: check test message from worker
>>> for (;;) {
>>> int flag = 0;
>>> int result = MPI_Iprobe(0, TAG_TEST, workerComm, &flag, &s);
>>> cout << "MPI_Iprobe: result is " << result << ",
>>>
>> flag is " <<
>>
>>> flag << endl;
>>> if (flag > 0)
>>> break;
>>> }
>>>
>>> int test;
>>> MPI_Recv(&test, 1, MPI_INT, 0, TAG_TEST, workerComm, &s);
>>> cout << "BlackBoard: Have received test, data is "
>>>
>> << test << endl;
>>
>>> }
>>>
>>>
>>> What happens when running this architecture (mpiexec -n 1
>>>
>> Master) is that
>>
>>> the Master never leaves its for loop (probing for messages from the
>>> Worker, flag and result equal 0 forever; according to my docs, flag
>>> should become 1 when a message is available), even if I let
>>>
>> it run for a
>>
>>> long time.
>>>
>>> However, when I remove the for loop in the Master and
>>>
>> immediately proceed
>>
>>> to MPI_Recv() of the TAG_TEST message, all goes well (i.e.
>>>
>> the message is
>>
>>> received by the master and both master and worker continue).
>>>
>>> What am I doing wrong or not understanding correctly?
>>>
>>> The message send/receive and probing works fine on this same machine
>>> when two processes are started with mpiexec -n 2 (and thus
>>>
>> have ranks 0
>>
>>> and 1 in the same MPI_COMM_WORLD) and MPI_COMM_WORLD is
>>>
>> used everywhere.
>>
>>> Pieter
>>>
>>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-master.cc
Type: text/x-c++src
Size: 1013 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070524/86cbd86a/attachment.cc>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-slave.cc
Type: text/x-c++src
Size: 776 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070524/86cbd86a/attachment-0001.cc>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-worker.cc
Type: text/x-c++src
Size: 1326 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070524/86cbd86a/attachment-0002.cc>
More information about the mpich-discuss
mailing list