[MPICH] Problem probing message from dynamically spawned process
Pieter Thysebaert
pieter.thysebaert at intec.ugent.be
Thu May 24 04:26:23 CDT 2007
Hello,
again following up to my own post:
I stand corrected, as an upgrade to the latest mpich2 version (1.0.5p4)
seems to have fixed my problem as far as I can tell!
Thx,
Pieter
Pieter Thysebaert wrote:
> Hello,
>
> I've attached the three programs test-master test-worker and test-slave
> (P1 P2 and P3).
> Test-master spawns test-worker and then starts an infinite MPI_Iprobe
> loop listening for test-worker messages. One message is sent from
> test-worker to test-master.
> Similarly, test-worker spawns test-slave and starts an MPI_Iprobe loop.
> One message is sent from test-slave to test-worker. It can be received
> in test-worker by calling the appropriate MPI_Recv directly, but as it
> is, this message is never detected by MPI_Iprobe in test-worker (thus,
> test-worker stalls in an infinite loop and the system never returns).
>
> I compile these programs with mpicxx, which on my Debian Etch AMD64
> system is versioned as
> mpicxx for 1.0.5
> Using built-in specs.
> Target: x86_64-linux-gnu
> Configured with: ../src/configure -v
> --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr
> --enable-shared --with-system-zlib --libexecdir=/usr/lib
> --without-included-gettext --enable-threads=posix --enable-nls
> --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu
> --enable-libstdcxx-debug --enable-mpfr --enable-checking=release
> x86_64-linux-gnu
> Thread model: posix
> gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
>
> The master and worker contain hardcoded paths ("/home/pthyseba/...") to
> the worker and slave executables, so they will need to be changed when
> running on another system.
>
> Anyway, mpdtrace -l on my machine returns:
> fleming_39205 (<my_ip>)
>
> "mpiexec -n 1 test-master" outputs things like:
>
> Master: message from worker detected after 85632 probes
> Slave started, contacting worker
>
> and then the program hangs.
>
> If I remove the infinite loop in test-worker and replace it by a single
> MPI_Recv operation to capture the test message sent by test-slave, it
> simply works:
>
> Master: message from worker detected after 2302235 probes
> Slave started, contacting worker
> Worker: received data from slave: 1
>
> and the program exits cleanly.
>
> This has been puzzling me for over a week, and I am at a complete loss
> as to what I am doing wrong, is wrong with my system or whatnot....
>
> My MPICH2 version is 1.0.5, and I compiled it from source packages at
> http://www-unix.mcs.anl.gov/mpi/mpich/.
> I'm currently downloading 1.0.5p4 but am unsure as to that upgrade will
> do anything to my problem.
>
>
> Any and all suggestions are more than welcome!
>
> Pieter
>
>
>
>
>
> Rajeev Thakur wrote:
>
>> It should work. Can you send us a test program that we can use to reproduce
>> the problem?
>>
>> Rajeev
>>
>>
>>
>>> -----Original Message-----
>>> From: owner-mpich-discuss at mcs.anl.gov
>>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Pieter
>>> Thysebaert
>>> Sent: Wednesday, May 23, 2007 8:02 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [MPICH] Problem probing message from dynamically
>>> spawned process
>>>
>>> Following up to my original post:
>>>
>>> The difficulties I'm seeing occur when the master process itself is
>>> MPI_Comm_spawn-ed by another process.
>>> So I seem to be having problems with nested MPI_Comm_spawn calls....
>>> Is nesting MPI_comm_spawn supposed to be a functional and supported
>>> operation?
>>>
>>>
>>> So to summarize:
>>>
>>> when I only have 2 processes, P1 and P2, and P1 MPI_Comm_spawn-s P2, I
>>> can implement a message loop in P1 (using MPI_Iprobe).
>>>
>>> When I have 3 processes with P1 spawning P2 and P2 spawning P3, I can
>>> implement a message loop in P1 listening for messages from P2, I can
>>> also send data from P3 to P2 BUT MPI_Iprobe() in P2, testing for P3
>>> messages always returns false, prohibiting me from implementing a
>>> similar message loop in P2 (listening for P3 messages).
>>>
>>>
>>> Is there some race condition or unsupported feature (or
>>> blatant misuse
>>> of the MPI API) I'm unaware of?
>>>
>>> Thanks,
>>> Pieter
>>>
>>>
>>>
>>> Pieter Thysebaert wrote:
>>>
>>>
>>>> Hello,
>>>>
>>>> I'm using MPICH2 1.0.5 on Debian Etch AMD64 (mpd daemon).
>>>>
>>>>
>>> I'm trying to
>>>
>>>
>>>> implement a Master / Worker architecture, where the master
>>>>
>>>>
>>> can dynamically
>>>
>>>
>>>> spawn additional workers (using MPI_Comm_spawn).
>>>>
>>>> Ultimately, I want the master to listen to its workers
>>>>
>>>>
>>> using a loop with
>>>
>>>
>>>> MPI_Iprobe statements to process incoming messages.
>>>>
>>>>
>>> However, when testing
>>>
>>>
>>>> my initial efforts, I have stumbled over a peculiar situation which
>>>> (seemingly) allows the Master to receive a worker's (test)
>>>>
>>>>
>>> message, but
>>>
>>>
>>>> cannot Iprobe for it.
>>>>
>>>> In my testing, the spawned Workers run on the same machine
>>>>
>>>>
>>> as the Master.
>>>
>>>
>>>> Assume the Worker (residing in an executable called
>>>>
>>>>
>>> "Worker") looks like
>>>
>>>
>>>> this:
>>>>
>>>> int main(int argc, char** argv) {
>>>> MPI_Comm Master;
>>>>
>>>> MPI_Init(&argc, &argv);
>>>> MPI_Comm_get_parent(&Master);
>>>> if (MDPBlackBoard == MPI_COMM_NULL) {
>>>> cerr << "No parent Master!" << endl;
>>>> return 1;
>>>> }
>>>>
>>>> int size;
>>>> MPI_Comm_remote_size(Master, &size);
>>>> if (size != 1) {
>>>> cerr << "Parent Master doesn't have size 1" << endl;
>>>> return 1;
>>>> }
>>>> // Test: send test message to Master
>>>> int test = 37;
>>>> MPI_Status s;
>>>> MPI_Send(&test, 1, MPI_INT, 0, TAG_TEST, Master);
>>>> // Rest of code
>>>> }
>>>>
>>>>
>>>> And the Master begins as
>>>>
>>>> int main(int argc, char** argv) {
>>>> MPI_Init(&argc, &argv);
>>>>
>>>> MPI_Comm workerComm;
>>>> MPI_Info ourInfo;
>>>> MPI_Info_create(&ourInfo);
>>>>
>>>> // Spawn Worker
>>>> MPI_Comm_spawn("Worker", MPI_ARGV_NULL, 1, ourInfo, 0,
>>>> MPI_COMM_SELF, &workerComm, MPI_ERRCODES_IGNORE);
>>>>
>>>> // Test: check test message from worker
>>>> for (;;) {
>>>> int flag = 0;
>>>> int result = MPI_Iprobe(0, TAG_TEST, workerComm, &flag, &s);
>>>> cout << "MPI_Iprobe: result is " << result << ",
>>>>
>>>>
>>> flag is " <<
>>>
>>>
>>>> flag << endl;
>>>> if (flag > 0)
>>>> break;
>>>> }
>>>>
>>>> int test;
>>>> MPI_Recv(&test, 1, MPI_INT, 0, TAG_TEST, workerComm, &s);
>>>> cout << "BlackBoard: Have received test, data is "
>>>>
>>>>
>>> << test << endl;
>>>
>>>
>>>> }
>>>>
>>>>
>>>> What happens when running this architecture (mpiexec -n 1
>>>>
>>>>
>>> Master) is that
>>>
>>>
>>>> the Master never leaves its for loop (probing for messages from the
>>>> Worker, flag and result equal 0 forever; according to my docs, flag
>>>> should become 1 when a message is available), even if I let
>>>>
>>>>
>>> it run for a
>>>
>>>
>>>> long time.
>>>>
>>>> However, when I remove the for loop in the Master and
>>>>
>>>>
>>> immediately proceed
>>>
>>>
>>>> to MPI_Recv() of the TAG_TEST message, all goes well (i.e.
>>>>
>>>>
>>> the message is
>>>
>>>
>>>> received by the master and both master and worker continue).
>>>>
>>>> What am I doing wrong or not understanding correctly?
>>>>
>>>> The message send/receive and probing works fine on this same machine
>>>> when two processes are started with mpiexec -n 2 (and thus
>>>>
>>>>
>>> have ranks 0
>>>
>>>
>>>> and 1 in the same MPI_COMM_WORLD) and MPI_COMM_WORLD is
>>>>
>>>>
>>> used everywhere.
>>>
>>>
>>>> Pieter
>>>>
>>>>
>>>>
>>>
>>>
>
>
> ------------------------------------------------------------------------
>
> #include "mpi.h"
> #include <iostream>
>
> using namespace std;
>
> #define TAG_TEST 500
>
> int main(int argc, char** argv) {
> int commSize, commRank, result;
>
> result = MPI_Init(&argc, &argv);
> if (result != MPI_SUCCESS) {
> cerr << "Error initializing MPI application!" << endl;
> MPI_Abort(MPI_COMM_WORLD, -1);
> }
>
> MPI_Comm_size(MPI_COMM_WORLD, &commSize);
> MPI_Comm_rank(MPI_COMM_WORLD, &commRank);
>
> int flag;
> MPI_Info ourInfo;
> MPI_Info_create(&ourInfo);
>
> MPI_Comm workerComm;
> MPI_Comm_spawn("/home/pthyseba/workspace/OCTOPUS/test-worker", MPI_ARGV_NULL, 1, ourInfo, 0, MPI_COMM_SELF, &workerComm, MPI_ERRCODES_IGNORE);
> for (int i = 0; ; i++) {
> MPI_Status s;
> MPI_Iprobe(0, TAG_TEST, workerComm, &flag, &s);
> if (flag > 0) {
> cout << "Master: message from worker detected after " << i << " probes" << endl;
> int data;
> MPI_Recv(&data, 1, MPI_INT, 0, TAG_TEST, workerComm, &s);
> break;
> }
> }
>
> MPI_Info_free(&ourInfo);
> MPI_Finalize();
> return 0;
> }
>
> ------------------------------------------------------------------------
>
> #include "mpi.h"
> #include <iostream>
>
> using namespace std;
>
> #define TAG_TEST 500
>
> int main(int argc, char** argv) {
> int commSize, commRank, result;
>
> result = MPI_Init(&argc, &argv);
> if (result != MPI_SUCCESS) {
> cerr << "Error initializing MPI application!" << endl;
> MPI_Abort(MPI_COMM_WORLD, -1);
> }
>
> MPI_Comm_size(MPI_COMM_WORLD, &commSize);
> MPI_Comm_rank(MPI_COMM_WORLD, &commRank);
>
> int flag;
>
> MPI_Comm workerComm;
> MPI_Comm_get_parent(&workerComm);
>
> int size;
> MPI_Comm_remote_size(workerComm, &size);
> if (size != 1) {
> cerr << "Parent worker doesn't have size 1" << endl;
> return 1;
> }
>
> cout << "Slave started, contacting worker" << endl;
>
> MPI_Send(&size, 1, MPI_INT, 0, TAG_TEST, workerComm);
> MPI_Finalize();
> return 0;
> }
>
> ------------------------------------------------------------------------
>
> #include "mpi.h"
> #include <iostream>
>
> using namespace std;
>
> #define TAG_TEST 500
>
> int main(int argc, char** argv) {
> int commSize, commRank, result;
>
> result = MPI_Init(&argc, &argv);
> if (result != MPI_SUCCESS) {
> cerr << "Error initializing MPI application!" << endl;
> MPI_Abort(MPI_COMM_WORLD, -1);
> }
>
> MPI_Comm_size(MPI_COMM_WORLD, &commSize);
> MPI_Comm_rank(MPI_COMM_WORLD, &commRank);
>
> int flag;
>
> MPI_Comm masterComm;
> MPI_Comm_get_parent(&masterComm);
>
> int size;
> MPI_Comm_remote_size(masterComm, &size);
> if (size != 1) {
> cerr << "Parent master doesn't have size 1" << endl;
> return 1;
> }
>
> MPI_Send(&size, 1, MPI_INT, 0, TAG_TEST, masterComm);
>
> // spawn slave
> MPI_Info ourInfo;
> MPI_Info_create(&ourInfo);
> MPI_Comm slaveComm;
> MPI_Comm_spawn("/home/pthyseba/workspace/OCTOPUS/test-slave", MPI_ARGV_NULL, 1, ourInfo, 0, MPI_COMM_SELF, &slaveComm, MPI_ERRCODES_IGNORE);
>
> MPI_Status s;
> int data;
> for (int i = 0;; i++) {
> MPI_Iprobe(0, TAG_TEST, slaveComm, &flag, &s);
> if (flag > 0) {
> cout << "Worker: message from slave detected after " << i << " probes" << endl;
> MPI_Recv(&data, 1, MPI_INT, 0, TAG_TEST, slaveComm, &s);
> break;
> }
> }
>
> cout << "Worker: received data from slave: " << data << endl;
>
> MPI_Finalize();
> return 0;
> }
>
More information about the mpich-discuss
mailing list