[mpich-discuss] failure on respawn

Dave Goodell goodell at mcs.anl.gov
Fri Nov 4 11:09:54 CDT 2011


I don't really have any suggestions for this particular error.  It looks like a bug in MPICH2, either in the process manager or (more likely) in the nemesis initialization process.  You're welcome to file a bug report in trac and attach the test program, although we're unlikely to be able to take a look at it right away.

However, in general I recommend staying away from MPI dynamic processes.  It's little used in general and therefore poorly tested in all MPI implementations.  Furthermore, there are a few known issues in MPICH2's implementation of dynamic processes.  Finally there's no good way to use codes like this in a batch scheduled environment, so if you ever want to scale up to a large machine then you'll likely have to rewrite your code entirely.

-Dave

On Nov 4, 2011, at 10:55 AM CDT, Jonathan Bishop wrote:

> Any takers for this one?
> 
> On Wed, Nov 2, 2011 at 9:50 AM, Jonathan Bishop <jbishop.rwc at gmail.com> wrote:
> Hi,
> 
> Here is a short program which shows an MPI crash when multiple MPI_Comm_spawn calls are made. Previously, it was found that it is necessary to call MPI_Comm_disconnect from both the worker and master processes to make sure that the spawned processes actually die. Unfortunately, this second issue may be related to that fix --- if I remove the disconnects the crash disappears.
> 
> Here is the crash message...
> 
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(392).................: 
> MPID_Init(139)........................: channel initialization failed
> MPIDI_CH3_Init(38)....................: 
> MPID_nem_init(196)....................: 
> MPIDI_CH3I_Seg_commit(366)............: 
> MPIU_SHMW_Hnd_deserialize(324)........: 
> MPIU_SHMW_Seg_open(863)...............: 
> MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or directory
> 
> <repeated a number of times>
> 
> Thanks,
> 
> Jon
> 
> 
> #include <iostream>
> #include "mpi.h"
> 
> using namespace std;
> 
> const int BUFSIZE = 1000;
> const int NWORKER = 10;
> const int NPASS = 10;
> 
> int main(int argc, char **argv)
> {
>   MPI_Init(&argc, &argv);
>   MPI_Comm parent;
>   MPI_Comm_get_parent(&parent);
> 
>   // Master
>   if (parent == MPI_COMM_NULL) {
>     for (int i = 0; i < NPASS; i++) {
>       cout << "pass " << i << " =============" << endl;
>       MPI_Comm intercom = MPI_COMM_NULL;
>       cout << "spawn " << NWORKER << endl;
>       MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, NWORKER, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercom, MPI_ERRCODES_IGNORE);
>       for (int worker = 0; worker < NWORKER; worker++) {
> 	cout << "stop " << worker << endl;
> 	char buf[BUFSIZE];
> 	MPI_Send(buf, 0, MPI_CHAR, worker, 0, intercom);
>       }
>       cout << "disconnnect" << endl;
>       MPI_Comm_disconnect(&intercom);
>       intercom = MPI_COMM_NULL;
>     }
>   }
> 
>   // Worker
>   if (parent != MPI_COMM_NULL) {
>     char buf[BUFSIZE];
>     MPI_Status status;
>     MPI_Recv(buf, BUFSIZE, MPI_CHAR, 0, MPI_ANY_TAG, parent, &status);
>     MPI_Comm_disconnect(&parent);
>   }
> 
>   MPI_Finalize();
> 
>   return 0;
> }
> 
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list