[Mochi-devel] [EXTERNAL] [SSG] pmix initialization failure

Sim, Hyogi simh at ornl.gov
Wed Sep 16 10:25:39 CDT 2020


Thanks, Rob.

I have just tested with ssg_group_create_mpi(), and it works fine (tested up to 1024 nodes, ppn=1). It seems like the problem is with pmix.

Best,
Hyogi



> On Sep 16, 2020, at 10:15 AM, Latham, Robert J. <robl at mcs.anl.gov> wrote:
> 
> On Wed, 2020-09-16 at 14:09 +0000, Sim, Hyogi wrote:
>> Hi Shane,
>> 
>> Your are correct that ssg_group_create_pmix() is failing. I heard
>> from other coworker that he avoided using pmix because he frequently
>> observed unreliable behavior from pmix. I am not sure which exactly
>> causes a problem for now.
>> 
>> As for the application, it is like a service daemon and spawns
>> exactly one process per node. I can probably with test with other
>> group creation function (possible with MPI), then see if it still
>> fails or not.
> 
> 
> We've been investigating SSG behavior at larger scale.  I added a
> pointer to your message to this SSG issue:
> 
> https://xgitlab.cels.anl.gov/sds/ssg/-/issues/21
> 
> ==rob
> 
>> 
>>> On Sep 15, 2020, at 6:27 PM, Snyder, Shane <ssnyder at mcs.anl.gov>
>>> wrote:
>>> 
>>> Hi Hyogi,
>>> 
>>> Thanks for the heads up!
>>> 
>>> What exactly fails when you run the code snippet you shared? Is it
>>> the ssg_group_create_pmix() call? That would make the most sense
>>> but just making sure it's not something else like the Margo or SSG
>>> init calls. I don't see anything obviously wrong in your code.
>>> 
>>> I'm currently investigating some other errors that result in
>>> sporadic hangs or crashes on Summit when using SSG to launch many
>>> processes on a node (e.g,. starting 64 SSG processes on one Summit
>>> node) -- maybe this is just another variation of that particular
>>> bug (though I'm using verbs and you're using tcp)? It might be
>>> helpful if you can confirm exactly how many processes and nodes it
>>> takes to trigger the problem so I can try to reproduce. Your
>>> original mail mentions 512-nodes, but not clear whether you're
>>> starting just a single process on each node or more than that, etc.
>>> It would be nice if we could simplify the reproducer so we don't
>>> need so many nodes to debug, if possible.
>>> 
>>> --Shane
>>> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf
>>> of Sim, Hyogi <simh at ornl.gov>
>>> Sent: Tuesday, September 15, 2020 10:59 AM
>>> To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
>>> Subject: [Mochi-devel] [SSG] pmix initialization failure
>>> 
>>> Hi,
>>> 
>>> I am initializing a SSG group using pmix on Summit at . The
>>> initialization works as expected, but only up to a certain number
>>> of compute nodes (~ 256 nodes). The group initialization seems
>>> always unsuccessful with 512+ nodes. Assuming that ssg itself has
>>> been tested in a larger scale, I am wondering if you see any
>>> obvious problems in my code below. 
>>> 
>>> For ssg_config and ssg_group_update_cb, I just copied directly from
>>> the mochi documentation (
>>> https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I
>>> am using v0.4.1.
>>> 
>>> ===
>>> 
>>> static int comm_init(void)
>>> {
>>>    int ret = 0;
>>>    int i = 0;
>>>    int rank = 0;
>>>    int nranks = 0;
>>>    pmix_proc_t proc;
>>>    margo_instance_id mid;
>>>    ssg_group_id_t gid;
>>> 
>>>    __debug("initializa the communication");
>>> 
>>>    ret = PMIx_Init(&proc, NULL, 0);
>>>    if (ret != PMIX_SUCCESS) {
>>>        __error("PMIx_Init failed: %s", PMIx_Error_string(ret));
>>>        return ret;
>>>    }
>>> 
>>>    __debug("pmix initialized");
>>> 
>>>    mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
>>>    if (mid == MARGO_INSTANCE_NULL) {
>>>        __error("failed to initialize margo");
>>>        return EIO;
>>>    }
>>> 
>>>    __debug("margo initialized");
>>> 
>>>    ret = ssg_init();
>>>    if (ret != SSG_SUCCESS) {
>>>        __error("ssg_init() failed");
>>>        return ret;
>>>    }
>>> 
>>>    gid = ssg_group_create_pmix(mid, "servergroup", proc,
>>>                                &ssg_config, ssg_group_update_cb,
>>> NULL);
>>>    if (gid == SSG_GROUP_ID_INVALID) {
>>>        __error("ssg_group_create_pmix() failed");
>>>        return ret;
>>>    }
>>> 
>>>    rank = ssg_get_group_self_rank(gid);
>>>    nranks = ssg_get_group_size(gid);
>>> 
>>>    __debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
>>>            (unsigned long long) gid, (int) rank, nranks);
>>> 
>>>    ssg_group_dump(gid);
>>> 
>>>    return 0;
>>> }
>>> 
>>> ===
>>> 
>>> Thanks,
>>> Hyogi
>>> _______________________________________________
>>> mochi-devel mailing list
>>> mochi-devel at lists.mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>> https://www.mcs.anl.gov/research/projects/mochi
>> 
>> _______________________________________________
>> mochi-devel mailing list
>> mochi-devel at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>> https://www.mcs.anl.gov/research/projects/mochi



More information about the mochi-devel mailing list