[mpich-discuss] spawned processes do not shut down
Jonathan Bishop
jbishop.rwc at gmail.com
Tue Nov 1 19:00:13 CDT 2011
Hi,
I have just discovered that the use of disconnect causes a crash if I
spawn more than one worker. For the crash to occur this must be done
across the network, not on one machine. Here is the error message...
Assertion failed in file mpid_nem_init.c at line 575: our_pg_rank < pg->size
internal ABORT - process 0
[proxy:2:1 at o14sa01] parse_exec_params (./pm/pmiserv/pmip_cb.c:843): no
executable given and doesn't look like a restart either
[proxy:2:1 at o14sa01] procinfo (./pm/pmiserv/pmip_cb.c:898): unable to
parse argument list
[proxy:2:1 at o14sa01] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:932): error parsing process info
[proxy:2:1 at o14sa01] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:2:1 at o14sa01] main (./pm/pmiserv/pmip.c:226): demux engine error
waiting for event
[mpiexec at o14sa16] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert
(!closed) failed
[mpiexec at o14sa16] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at o14sa16] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec at o14sa16] main (./ui/mpich/mpiexec.c:405): process manager
error waiting for completion
Again, all is well if I only spawn 1 worker, or I do not pass a
machine file to mpiexec (which makes all processes run on the same
host as mpiexec).
The program below is similar to the previous one in this thread, but I
have added a constant to control the number of workers spawned, and
also added mpi_send and mpi_recv functions to simplify the code.
Simply run the program and do
start // spawns NWORKER processes
<enter various strings which should be echoed from the workers>
stop // all workers ended
start // will crash (at least on my network)
Thanks in advance for any help you can give.
Jon
=========================START OF PROGRAM=============================
#include <sys/types.h>
#include <unistd.h>
#include <iostream>
#include "mpi.h"
using namespace std;
const int BUFSIZE = 1000;const int NWORKER = 2; // No crash if set to 1
// -------------------------------------------------------------------------------------------------
static void mpi_send(int dst, const string& s, MPI_Comm comm){ char
buf[BUFSIZE]; strcpy(buf, s.c_str()); MPI_Send(buf, s.size(),
MPI_CHAR, dst, 0, comm);}
// -------------------------------------------------------------------------------------------------
static string mpi_recv(int src, MPI_Comm comm){ char buf[BUFSIZE];
MPI_Status status; MPI_Recv(buf, BUFSIZE, MPI_CHAR, src, MPI_ANY_TAG,
comm, &status); int count; MPI_Get_count(&status, MPI_CHAR, &count);
buf[count] = 0; return buf;}
// -------------------------------------------------------------------------------------------------
main(int argc, char **argv){
MPI_Init(&argc, &argv); MPI_Comm parent; MPI_Comm_get_parent(&parent);
// Master if (parent == MPI_COMM_NULL) { MPI_Comm intercom =
MPI_COMM_NULL; while (1) { cout << "Enter: "; string s;
cin >> s; if (s == "start") { if (intercom != MPI_COMM_NULL)
{ cout << "already started" << endl;
continue; } MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, NWORKER,
MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercom,
MPI_ERRCODES_IGNORE); continue; } if (s == "stop") { if
(intercom == MPI_COMM_NULL) { cout << "worker not running" << endl;
continue; } for (int w = 0; w < NWORKER; w++) { mpi_send(w, s,
intercom); } MPI_Comm_disconnect(&intercom); intercom =
MPI_COMM_NULL; continue; } if (s == "exit") { if (intercom
!= MPI_COMM_NULL) { cout << "need to stop before exit" << endl;
continue; } break; } if (intercom == MPI_COMM_NULL) { cout
<< "need to start" << endl; continue; } for (int w = 0; w <
NWORKER; w++) { mpi_send(w, s, intercom); string t = mpi_recv(w,
intercom); cout << "worker " << w << " returned " << t << endl; }
} }
// Worker if (parent != MPI_COMM_NULL) { while (1) { string
s = mpi_recv(0, parent); if (s == "stop")
{ MPI_Comm_disconnect(&parent); break; } mpi_send(0, s,
parent); } }
MPI_Finalize();}
More information about the mpich-discuss
mailing list