[Mochi-devel] [EXTERNAL] [SSG] pmix initialization failure
Sim, Hyogi
simh at ornl.gov
Wed Sep 16 10:25:39 CDT 2020
Thanks, Rob.
I have just tested with ssg_group_create_mpi(), and it works fine (tested up to 1024 nodes, ppn=1). It seems like the problem is with pmix.
Best,
Hyogi
> On Sep 16, 2020, at 10:15 AM, Latham, Robert J. <robl at mcs.anl.gov> wrote:
>
> On Wed, 2020-09-16 at 14:09 +0000, Sim, Hyogi wrote:
>> Hi Shane,
>>
>> Your are correct that ssg_group_create_pmix() is failing. I heard
>> from other coworker that he avoided using pmix because he frequently
>> observed unreliable behavior from pmix. I am not sure which exactly
>> causes a problem for now.
>>
>> As for the application, it is like a service daemon and spawns
>> exactly one process per node. I can probably with test with other
>> group creation function (possible with MPI), then see if it still
>> fails or not.
>
>
> We've been investigating SSG behavior at larger scale. I added a
> pointer to your message to this SSG issue:
>
> https://xgitlab.cels.anl.gov/sds/ssg/-/issues/21
>
> ==rob
>
>>
>>> On Sep 15, 2020, at 6:27 PM, Snyder, Shane <ssnyder at mcs.anl.gov>
>>> wrote:
>>>
>>> Hi Hyogi,
>>>
>>> Thanks for the heads up!
>>>
>>> What exactly fails when you run the code snippet you shared? Is it
>>> the ssg_group_create_pmix() call? That would make the most sense
>>> but just making sure it's not something else like the Margo or SSG
>>> init calls. I don't see anything obviously wrong in your code.
>>>
>>> I'm currently investigating some other errors that result in
>>> sporadic hangs or crashes on Summit when using SSG to launch many
>>> processes on a node (e.g,. starting 64 SSG processes on one Summit
>>> node) -- maybe this is just another variation of that particular
>>> bug (though I'm using verbs and you're using tcp)? It might be
>>> helpful if you can confirm exactly how many processes and nodes it
>>> takes to trigger the problem so I can try to reproduce. Your
>>> original mail mentions 512-nodes, but not clear whether you're
>>> starting just a single process on each node or more than that, etc.
>>> It would be nice if we could simplify the reproducer so we don't
>>> need so many nodes to debug, if possible.
>>>
>>> --Shane
>>> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf
>>> of Sim, Hyogi <simh at ornl.gov>
>>> Sent: Tuesday, September 15, 2020 10:59 AM
>>> To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
>>> Subject: [Mochi-devel] [SSG] pmix initialization failure
>>>
>>> Hi,
>>>
>>> I am initializing a SSG group using pmix on Summit at . The
>>> initialization works as expected, but only up to a certain number
>>> of compute nodes (~ 256 nodes). The group initialization seems
>>> always unsuccessful with 512+ nodes. Assuming that ssg itself has
>>> been tested in a larger scale, I am wondering if you see any
>>> obvious problems in my code below.
>>>
>>> For ssg_config and ssg_group_update_cb, I just copied directly from
>>> the mochi documentation (
>>> https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I
>>> am using v0.4.1.
>>>
>>> ===
>>>
>>> static int comm_init(void)
>>> {
>>> int ret = 0;
>>> int i = 0;
>>> int rank = 0;
>>> int nranks = 0;
>>> pmix_proc_t proc;
>>> margo_instance_id mid;
>>> ssg_group_id_t gid;
>>>
>>> __debug("initializa the communication");
>>>
>>> ret = PMIx_Init(&proc, NULL, 0);
>>> if (ret != PMIX_SUCCESS) {
>>> __error("PMIx_Init failed: %s", PMIx_Error_string(ret));
>>> return ret;
>>> }
>>>
>>> __debug("pmix initialized");
>>>
>>> mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
>>> if (mid == MARGO_INSTANCE_NULL) {
>>> __error("failed to initialize margo");
>>> return EIO;
>>> }
>>>
>>> __debug("margo initialized");
>>>
>>> ret = ssg_init();
>>> if (ret != SSG_SUCCESS) {
>>> __error("ssg_init() failed");
>>> return ret;
>>> }
>>>
>>> gid = ssg_group_create_pmix(mid, "servergroup", proc,
>>> &ssg_config, ssg_group_update_cb,
>>> NULL);
>>> if (gid == SSG_GROUP_ID_INVALID) {
>>> __error("ssg_group_create_pmix() failed");
>>> return ret;
>>> }
>>>
>>> rank = ssg_get_group_self_rank(gid);
>>> nranks = ssg_get_group_size(gid);
>>>
>>> __debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
>>> (unsigned long long) gid, (int) rank, nranks);
>>>
>>> ssg_group_dump(gid);
>>>
>>> return 0;
>>> }
>>>
>>> ===
>>>
>>> Thanks,
>>> Hyogi
>>> _______________________________________________
>>> mochi-devel mailing list
>>> mochi-devel at lists.mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>> https://www.mcs.anl.gov/research/projects/mochi
>>
>> _______________________________________________
>> mochi-devel mailing list
>> mochi-devel at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>> https://www.mcs.anl.gov/research/projects/mochi
More information about the mochi-devel
mailing list