[Mochi-devel] [SSG] pmix initialization failure
Sim, Hyogi
simh at ornl.gov
Tue Sep 15 10:59:23 CDT 2020
Hi,
I am initializing a SSG group using pmix on Summit at . The initialization works as expected, but only up to a certain number of compute nodes (~ 256 nodes). The group initialization seems always unsuccessful with 512+ nodes. Assuming that ssg itself has been tested in a larger scale, I am wondering if you see any obvious problems in my code below.
For ssg_config and ssg_group_update_cb, I just copied directly from the mochi documentation (https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I am using v0.4.1.
===
static int comm_init(void)
{
int ret = 0;
int i = 0;
int rank = 0;
int nranks = 0;
pmix_proc_t proc;
margo_instance_id mid;
ssg_group_id_t gid;
__debug("initializa the communication");
ret = PMIx_Init(&proc, NULL, 0);
if (ret != PMIX_SUCCESS) {
__error("PMIx_Init failed: %s", PMIx_Error_string(ret));
return ret;
}
__debug("pmix initialized");
mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
if (mid == MARGO_INSTANCE_NULL) {
__error("failed to initialize margo");
return EIO;
}
__debug("margo initialized");
ret = ssg_init();
if (ret != SSG_SUCCESS) {
__error("ssg_init() failed");
return ret;
}
gid = ssg_group_create_pmix(mid, "servergroup", proc,
&ssg_config, ssg_group_update_cb, NULL);
if (gid == SSG_GROUP_ID_INVALID) {
__error("ssg_group_create_pmix() failed");
return ret;
}
rank = ssg_get_group_self_rank(gid);
nranks = ssg_get_group_size(gid);
__debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
(unsigned long long) gid, (int) rank, nranks);
ssg_group_dump(gid);
return 0;
}
===
Thanks,
Hyogi
More information about the mochi-devel
mailing list