[Mochi-devel] [SSG] pmix initialization failure

Snyder, Shane ssnyder at mcs.anl.gov
Tue Sep 15 17:27:27 CDT 2020


Hi Hyogi,

Thanks for the heads up!

What exactly fails when you run the code snippet you shared? Is it the ssg_group_create_pmix() call? That would make the most sense but just making sure it's not something else like the Margo or SSG init calls. I don't see anything obviously wrong in your code.

I'm currently investigating some other errors that result in sporadic hangs or crashes on Summit when using SSG to launch many processes on a node (e.g,. starting 64 SSG processes on one Summit node) -- maybe this is just another variation of that particular bug (though I'm using verbs and you're using tcp)? It might be helpful if you can confirm exactly how many processes and nodes it takes to trigger the problem so I can try to reproduce. Your original mail mentions 512-nodes, but not clear whether you're starting just a single process on each node or more than that, etc. It would be nice if we could simplify the reproducer so we don't need so many nodes to debug, if possible.

--Shane
________________________________
From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf of Sim, Hyogi <simh at ornl.gov>
Sent: Tuesday, September 15, 2020 10:59 AM
To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
Subject: [Mochi-devel] [SSG] pmix initialization failure

Hi,

I am initializing a SSG group using pmix on Summit at . The initialization works as expected, but only up to a certain number of compute nodes (~ 256 nodes). The group initialization seems always unsuccessful with 512+ nodes. Assuming that ssg itself has been tested in a larger scale, I am wondering if you see any obvious problems in my code below.

For ssg_config and ssg_group_update_cb, I just copied directly from the mochi documentation (https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I am using v0.4.1.

===

static int comm_init(void)
{
    int ret = 0;
    int i = 0;
    int rank = 0;
    int nranks = 0;
    pmix_proc_t proc;
    margo_instance_id mid;
    ssg_group_id_t gid;

    __debug("initializa the communication");

    ret = PMIx_Init(&proc, NULL, 0);
    if (ret != PMIX_SUCCESS) {
        __error("PMIx_Init failed: %s", PMIx_Error_string(ret));
        return ret;
    }

    __debug("pmix initialized");

    mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
    if (mid == MARGO_INSTANCE_NULL) {
        __error("failed to initialize margo");
        return EIO;
    }

    __debug("margo initialized");

    ret = ssg_init();
    if (ret != SSG_SUCCESS) {
        __error("ssg_init() failed");
        return ret;
    }

    gid = ssg_group_create_pmix(mid, "servergroup", proc,
                                &ssg_config, ssg_group_update_cb, NULL);
    if (gid == SSG_GROUP_ID_INVALID) {
        __error("ssg_group_create_pmix() failed");
        return ret;
    }

    rank = ssg_get_group_self_rank(gid);
    nranks = ssg_get_group_size(gid);

    __debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
            (unsigned long long) gid, (int) rank, nranks);

    ssg_group_dump(gid);

    return 0;
}

===

Thanks,
Hyogi
_______________________________________________
mochi-devel mailing list
mochi-devel at lists.mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
https://www.mcs.anl.gov/research/projects/mochi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20200915/4e563325/attachment.html>


More information about the mochi-devel mailing list