[mpich-discuss] failure on respawn
Dave Goodell
goodell at mcs.anl.gov
Fri Nov 4 11:09:54 CDT 2011
I don't really have any suggestions for this particular error. It looks like a bug in MPICH2, either in the process manager or (more likely) in the nemesis initialization process. You're welcome to file a bug report in trac and attach the test program, although we're unlikely to be able to take a look at it right away.
However, in general I recommend staying away from MPI dynamic processes. It's little used in general and therefore poorly tested in all MPI implementations. Furthermore, there are a few known issues in MPICH2's implementation of dynamic processes. Finally there's no good way to use codes like this in a batch scheduled environment, so if you ever want to scale up to a large machine then you'll likely have to rewrite your code entirely.
-Dave
On Nov 4, 2011, at 10:55 AM CDT, Jonathan Bishop wrote:
> Any takers for this one?
>
> On Wed, Nov 2, 2011 at 9:50 AM, Jonathan Bishop <jbishop.rwc at gmail.com> wrote:
> Hi,
>
> Here is a short program which shows an MPI crash when multiple MPI_Comm_spawn calls are made. Previously, it was found that it is necessary to call MPI_Comm_disconnect from both the worker and master processes to make sure that the spawned processes actually die. Unfortunately, this second issue may be related to that fix --- if I remove the disconnects the crash disappears.
>
> Here is the crash message...
>
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(392).................:
> MPID_Init(139)........................: channel initialization failed
> MPIDI_CH3_Init(38)....................:
> MPID_nem_init(196)....................:
> MPIDI_CH3I_Seg_commit(366)............:
> MPIU_SHMW_Hnd_deserialize(324)........:
> MPIU_SHMW_Seg_open(863)...............:
> MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or directory
>
> <repeated a number of times>
>
> Thanks,
>
> Jon
>
>
> #include <iostream>
> #include "mpi.h"
>
> using namespace std;
>
> const int BUFSIZE = 1000;
> const int NWORKER = 10;
> const int NPASS = 10;
>
> int main(int argc, char **argv)
> {
> MPI_Init(&argc, &argv);
> MPI_Comm parent;
> MPI_Comm_get_parent(&parent);
>
> // Master
> if (parent == MPI_COMM_NULL) {
> for (int i = 0; i < NPASS; i++) {
> cout << "pass " << i << " =============" << endl;
> MPI_Comm intercom = MPI_COMM_NULL;
> cout << "spawn " << NWORKER << endl;
> MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, NWORKER, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercom, MPI_ERRCODES_IGNORE);
> for (int worker = 0; worker < NWORKER; worker++) {
> cout << "stop " << worker << endl;
> char buf[BUFSIZE];
> MPI_Send(buf, 0, MPI_CHAR, worker, 0, intercom);
> }
> cout << "disconnnect" << endl;
> MPI_Comm_disconnect(&intercom);
> intercom = MPI_COMM_NULL;
> }
> }
>
> // Worker
> if (parent != MPI_COMM_NULL) {
> char buf[BUFSIZE];
> MPI_Status status;
> MPI_Recv(buf, BUFSIZE, MPI_CHAR, 0, MPI_ANY_TAG, parent, &status);
> MPI_Comm_disconnect(&parent);
> }
>
> MPI_Finalize();
>
> return 0;
> }
>
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list