[Mochi-devel] [EXTERNAL] [SSG] pmix initialization failure

Latham, Robert J. robl at mcs.anl.gov
Wed Sep 16 09:15:18 CDT 2020


On Wed, 2020-09-16 at 14:09 +0000, Sim, Hyogi wrote:
> Hi Shane,
> 
> Your are correct that ssg_group_create_pmix() is failing. I heard
> from other coworker that he avoided using pmix because he frequently
> observed unreliable behavior from pmix. I am not sure which exactly
> causes a problem for now.
> 
> As for the application, it is like a service daemon and spawns
> exactly one process per node. I can probably with test with other
> group creation function (possible with MPI), then see if it still
> fails or not.


We've been investigating SSG behavior at larger scale.  I added a
pointer to your message to this SSG issue:

https://xgitlab.cels.anl.gov/sds/ssg/-/issues/21

==rob

> 
> > On Sep 15, 2020, at 6:27 PM, Snyder, Shane <ssnyder at mcs.anl.gov>
> > wrote:
> > 
> > Hi Hyogi,
> > 
> > Thanks for the heads up!
> > 
> > What exactly fails when you run the code snippet you shared? Is it
> > the ssg_group_create_pmix() call? That would make the most sense
> > but just making sure it's not something else like the Margo or SSG
> > init calls. I don't see anything obviously wrong in your code.
> > 
> > I'm currently investigating some other errors that result in
> > sporadic hangs or crashes on Summit when using SSG to launch many
> > processes on a node (e.g,. starting 64 SSG processes on one Summit
> > node) -- maybe this is just another variation of that particular
> > bug (though I'm using verbs and you're using tcp)? It might be
> > helpful if you can confirm exactly how many processes and nodes it
> > takes to trigger the problem so I can try to reproduce. Your
> > original mail mentions 512-nodes, but not clear whether you're
> > starting just a single process on each node or more than that, etc.
> > It would be nice if we could simplify the reproducer so we don't
> > need so many nodes to debug, if possible.
> > 
> > --Shane
> > From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf
> > of Sim, Hyogi <simh at ornl.gov>
> > Sent: Tuesday, September 15, 2020 10:59 AM
> > To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
> > Subject: [Mochi-devel] [SSG] pmix initialization failure
> >  
> > Hi,
> > 
> > I am initializing a SSG group using pmix on Summit at . The
> > initialization works as expected, but only up to a certain number
> > of compute nodes (~ 256 nodes). The group initialization seems
> > always unsuccessful with 512+ nodes. Assuming that ssg itself has
> > been tested in a larger scale, I am wondering if you see any
> > obvious problems in my code below. 
> > 
> > For ssg_config and ssg_group_update_cb, I just copied directly from
> > the mochi documentation (
> > https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I
> > am using v0.4.1.
> > 
> > ===
> > 
> > static int comm_init(void)
> > {
> >     int ret = 0;
> >     int i = 0;
> >     int rank = 0;
> >     int nranks = 0;
> >     pmix_proc_t proc;
> >     margo_instance_id mid;
> >     ssg_group_id_t gid;
> > 
> >     __debug("initializa the communication");
> > 
> >     ret = PMIx_Init(&proc, NULL, 0);
> >     if (ret != PMIX_SUCCESS) {
> >         __error("PMIx_Init failed: %s", PMIx_Error_string(ret));
> >         return ret;
> >     }
> > 
> >     __debug("pmix initialized");
> > 
> >     mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
> >     if (mid == MARGO_INSTANCE_NULL) {
> >         __error("failed to initialize margo");
> >         return EIO;
> >     }
> > 
> >     __debug("margo initialized");
> > 
> >     ret = ssg_init();
> >     if (ret != SSG_SUCCESS) {
> >         __error("ssg_init() failed");
> >         return ret;
> >     }
> > 
> >     gid = ssg_group_create_pmix(mid, "servergroup", proc,
> >                                 &ssg_config, ssg_group_update_cb,
> > NULL);
> >     if (gid == SSG_GROUP_ID_INVALID) {
> >         __error("ssg_group_create_pmix() failed");
> >         return ret;
> >     }
> > 
> >     rank = ssg_get_group_self_rank(gid);
> >     nranks = ssg_get_group_size(gid);
> > 
> >     __debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
> >             (unsigned long long) gid, (int) rank, nranks);
> > 
> >     ssg_group_dump(gid);
> > 
> >     return 0;
> > }
> > 
> > ===
> > 
> > Thanks,
> > Hyogi
> > _______________________________________________
> > mochi-devel mailing list
> > mochi-devel at lists.mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
> > https://www.mcs.anl.gov/research/projects/mochi
> 
> _______________________________________________
> mochi-devel mailing list
> mochi-devel at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
> https://www.mcs.anl.gov/research/projects/mochi


More information about the mochi-devel mailing list