[Mochi-devel] [SSG] pmix initialization failure

Sim, Hyogi simh at ornl.gov
Tue Sep 15 10:59:23 CDT 2020


Hi,

I am initializing a SSG group using pmix on Summit at . The initialization works as expected, but only up to a certain number of compute nodes (~ 256 nodes). The group initialization seems always unsuccessful with 512+ nodes. Assuming that ssg itself has been tested in a larger scale, I am wondering if you see any obvious problems in my code below. 

For ssg_config and ssg_group_update_cb, I just copied directly from the mochi documentation (https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I am using v0.4.1.

===

static int comm_init(void)
{
    int ret = 0;
    int i = 0;
    int rank = 0;
    int nranks = 0;
    pmix_proc_t proc;
    margo_instance_id mid;
    ssg_group_id_t gid;

    __debug("initializa the communication");

    ret = PMIx_Init(&proc, NULL, 0);
    if (ret != PMIX_SUCCESS) {
        __error("PMIx_Init failed: %s", PMIx_Error_string(ret));
        return ret;
    }

    __debug("pmix initialized");

    mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
    if (mid == MARGO_INSTANCE_NULL) {
        __error("failed to initialize margo");
        return EIO;
    }

    __debug("margo initialized");

    ret = ssg_init();
    if (ret != SSG_SUCCESS) {
        __error("ssg_init() failed");
        return ret;
    }

    gid = ssg_group_create_pmix(mid, "servergroup", proc,
                                &ssg_config, ssg_group_update_cb, NULL);
    if (gid == SSG_GROUP_ID_INVALID) {
        __error("ssg_group_create_pmix() failed");
        return ret;
    }

    rank = ssg_get_group_self_rank(gid);
    nranks = ssg_get_group_size(gid);

    __debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
            (unsigned long long) gid, (int) rank, nranks);

    ssg_group_dump(gid);

    return 0;
}

===

Thanks,
Hyogi


More information about the mochi-devel mailing list