[Mochi-devel] [EXTERNAL] [SSG] pmix initialization failure

Sim, Hyogi simh at ornl.gov
Wed Sep 16 09:09:08 CDT 2020


Hi Shane,

Your are correct that ssg_group_create_pmix() is failing. I heard from other coworker that he avoided using pmix because he frequently observed unreliable behavior from pmix. I am not sure which exactly causes a problem for now.

As for the application, it is like a service daemon and spawns exactly one process per node. I can probably with test with other group creation function (possible with MPI), then see if it still fails or not.

Thanks,
Hyogi



> On Sep 15, 2020, at 6:27 PM, Snyder, Shane <ssnyder at mcs.anl.gov> wrote:
> 
> Hi Hyogi,
> 
> Thanks for the heads up!
> 
> What exactly fails when you run the code snippet you shared? Is it the ssg_group_create_pmix() call? That would make the most sense but just making sure it's not something else like the Margo or SSG init calls. I don't see anything obviously wrong in your code.
> 
> I'm currently investigating some other errors that result in sporadic hangs or crashes on Summit when using SSG to launch many processes on a node (e.g,. starting 64 SSG processes on one Summit node) -- maybe this is just another variation of that particular bug (though I'm using verbs and you're using tcp)? It might be helpful if you can confirm exactly how many processes and nodes it takes to trigger the problem so I can try to reproduce. Your original mail mentions 512-nodes, but not clear whether you're starting just a single process on each node or more than that, etc. It would be nice if we could simplify the reproducer so we don't need so many nodes to debug, if possible.
> 
> --Shane
> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf of Sim, Hyogi <simh at ornl.gov>
> Sent: Tuesday, September 15, 2020 10:59 AM
> To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
> Subject: [Mochi-devel] [SSG] pmix initialization failure
>  
> Hi,
> 
> I am initializing a SSG group using pmix on Summit at . The initialization works as expected, but only up to a certain number of compute nodes (~ 256 nodes). The group initialization seems always unsuccessful with 512+ nodes. Assuming that ssg itself has been tested in a larger scale, I am wondering if you see any obvious problems in my code below. 
> 
> For ssg_config and ssg_group_update_cb, I just copied directly from the mochi documentation (https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I am using v0.4.1.
> 
> ===
> 
> static int comm_init(void)
> {
>     int ret = 0;
>     int i = 0;
>     int rank = 0;
>     int nranks = 0;
>     pmix_proc_t proc;
>     margo_instance_id mid;
>     ssg_group_id_t gid;
> 
>     __debug("initializa the communication");
> 
>     ret = PMIx_Init(&proc, NULL, 0);
>     if (ret != PMIX_SUCCESS) {
>         __error("PMIx_Init failed: %s", PMIx_Error_string(ret));
>         return ret;
>     }
> 
>     __debug("pmix initialized");
> 
>     mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
>     if (mid == MARGO_INSTANCE_NULL) {
>         __error("failed to initialize margo");
>         return EIO;
>     }
> 
>     __debug("margo initialized");
> 
>     ret = ssg_init();
>     if (ret != SSG_SUCCESS) {
>         __error("ssg_init() failed");
>         return ret;
>     }
> 
>     gid = ssg_group_create_pmix(mid, "servergroup", proc,
>                                 &ssg_config, ssg_group_update_cb, NULL);
>     if (gid == SSG_GROUP_ID_INVALID) {
>         __error("ssg_group_create_pmix() failed");
>         return ret;
>     }
> 
>     rank = ssg_get_group_self_rank(gid);
>     nranks = ssg_get_group_size(gid);
> 
>     __debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
>             (unsigned long long) gid, (int) rank, nranks);
> 
>     ssg_group_dump(gid);
> 
>     return 0;
> }
> 
> ===
> 
> Thanks,
> Hyogi
> _______________________________________________
> mochi-devel mailing list
> mochi-devel at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
> https://www.mcs.anl.gov/research/projects/mochi



More information about the mochi-devel mailing list