[Mochi-devel] [EXTERNAL] [SSG] pmix initialization failure

Snyder, Shane ssnyder at mcs.anl.gov
Mon Sep 21 09:19:11 CDT 2020


Hi Hyogi,

Is this bug resulting from the same code snippet as you shared in your previous email, or is it more complicated? And you're only using 1 process-per-node, so 64 total processes, right?

I have some time early this week to get on Summit and try to reproduce this as well as the other error you mentioned related to PMIx. This one does appear to be some weird corruption of SSG state that should be investigated.

Thanks,
--Shane
________________________________
From: Sim, Hyogi <simh at ornl.gov>
Sent: Friday, September 18, 2020 6:20 PM
To: Sim, Hyogi <simh at ornl.gov>
Cc: Latham, Robert J. <robl at mcs.anl.gov>; Snyder, Shane <ssnyder at mcs.anl.gov>; mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
Subject: Re: [Mochi-devel] [EXTERNAL] [SSG] pmix initialization failure

BTW, today I encountered a segfault on summit, while running with 64 nodes. I do not think this is a deterministic behavior:

===

Core was generated by `/gpfs/alpine/proj-shared/stf008/hs2/metasim/summit/sum/progress.debug/sum-paral'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002000000c1004 in swim_dping_ack_recv_ult (handle=0x231e5010)
    at ../src/swim-fd/swim-fd-ping.c:312
312             ABT_rwlock_wrlock(group->swim_ctx->swim_lock);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.ppc64le libatomic-4.8.5-37.el7_6.ppc64le libgcc-4.8.5-37.el7_6.ppc64le libibverbs-41mlnx1-OFED.4.7.0.0.2.47329.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.47329.ppc64le libmlx5-41mlnx1-OFED.4.7.0.3.3.47329.ppc64le libnl3-3.2.28-4.el7.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.47329.ppc64le libstdc++-4.8.5-37.el7_6.ppc64le numactl-libs-2.0.9-7.el7.ppc64le
(gdb) bt
#0  0x00002000000c1004 in swim_dping_ack_recv_ult (handle=0x231e5010)
    at ../src/swim-fd/swim-fd-ping.c:312
#1  0x00002000000c1460 in swim_dping_ack_recv_ult_wrapper (handle=0x231e5010)
    at ../src/swim-fd/swim-fd-ping.c:378
#2  0x000020000065ac44 in ABTD_thread_func_wrapper_thread ()
   from /autofs/nccs-svm1_proj/csc300/mjbrim/spack.mjb/opt/spack/linux-rhel7-power8le/gcc-4.8.5/argobots-1.0rc2-2kq6htkm2ura7u3qj6wpr5xhqyns5ikp/lib/libabt.so.0
#3  0x000020000065b7b8 in make_fcontext ()
   from /autofs/nccs-svm1_proj/csc300/mjbrim/spack.mjb/opt/spack/linux-rhel7-power8le/gcc-4.8.5/argobots-1.0rc2-2kq6htkm2ura7u3qj6wpr5xhqyns5ikp/lib/libabt.so.0
(gdb) f 0
#0  0x00002000000c1004 in swim_dping_ack_recv_ult (handle=0x231e5010)
    at ../src/swim-fd/swim-fd-ping.c:312
312             ABT_rwlock_wrlock(group->swim_ctx->swim_lock);
(gdb) p group
$1 = (ssg_group_t *) 0x24978ca0
(gdb) p group->swim_ctx
$2 = (swim_context_t *) 0x0
(gdb) set print pretty on
(gdb) p group[0]
$3 = {
  mid_state = 0x231eff40,
  name = 0x249792e0 "\360{\227$",
  view = {
    size = 33,
    member_map = 0x0,
    rank_array = 0x24979510
  },
  config = {
    swim_period_length_ms = 3000,
    swim_suspect_timeout_periods = 5,
    swim_subgroup_member_count = -1,
    ssg_credential = -1
  },
  dead_members = 0x0,
  swim_ctx = 0x0,
  update_cb = 0x10002034 <ssg_group_update_cb>,
  update_cb_dat = 0x0,
  lock = 0xf
}



> On Sep 16, 2020, at 11:25 AM, Sim, Hyogi <simh at ornl.gov> wrote:
>
> Thanks, Rob.
>
> I have just tested with ssg_group_create_mpi(), and it works fine (tested up to 1024 nodes, ppn=1). It seems like the problem is with pmix.
>
> Best,
> Hyogi
>
>
>
>> On Sep 16, 2020, at 10:15 AM, Latham, Robert J. <robl at mcs.anl.gov> wrote:
>>
>> On Wed, 2020-09-16 at 14:09 +0000, Sim, Hyogi wrote:
>>> Hi Shane,
>>>
>>> Your are correct that ssg_group_create_pmix() is failing. I heard
>>> from other coworker that he avoided using pmix because he frequently
>>> observed unreliable behavior from pmix. I am not sure which exactly
>>> causes a problem for now.
>>>
>>> As for the application, it is like a service daemon and spawns
>>> exactly one process per node. I can probably with test with other
>>> group creation function (possible with MPI), then see if it still
>>> fails or not.
>>
>>
>> We've been investigating SSG behavior at larger scale.  I added a
>> pointer to your message to this SSG issue:
>>
>> https://xgitlab.cels.anl.gov/sds/ssg/-/issues/21
>>
>> ==rob
>>
>>>
>>>> On Sep 15, 2020, at 6:27 PM, Snyder, Shane <ssnyder at mcs.anl.gov>
>>>> wrote:
>>>>
>>>> Hi Hyogi,
>>>>
>>>> Thanks for the heads up!
>>>>
>>>> What exactly fails when you run the code snippet you shared? Is it
>>>> the ssg_group_create_pmix() call? That would make the most sense
>>>> but just making sure it's not something else like the Margo or SSG
>>>> init calls. I don't see anything obviously wrong in your code.
>>>>
>>>> I'm currently investigating some other errors that result in
>>>> sporadic hangs or crashes on Summit when using SSG to launch many
>>>> processes on a node (e.g,. starting 64 SSG processes on one Summit
>>>> node) -- maybe this is just another variation of that particular
>>>> bug (though I'm using verbs and you're using tcp)? It might be
>>>> helpful if you can confirm exactly how many processes and nodes it
>>>> takes to trigger the problem so I can try to reproduce. Your
>>>> original mail mentions 512-nodes, but not clear whether you're
>>>> starting just a single process on each node or more than that, etc.
>>>> It would be nice if we could simplify the reproducer so we don't
>>>> need so many nodes to debug, if possible.
>>>>
>>>> --Shane
>>>> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf
>>>> of Sim, Hyogi <simh at ornl.gov>
>>>> Sent: Tuesday, September 15, 2020 10:59 AM
>>>> To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
>>>> Subject: [Mochi-devel] [SSG] pmix initialization failure
>>>>
>>>> Hi,
>>>>
>>>> I am initializing a SSG group using pmix on Summit at . The
>>>> initialization works as expected, but only up to a certain number
>>>> of compute nodes (~ 256 nodes). The group initialization seems
>>>> always unsuccessful with 512+ nodes. Assuming that ssg itself has
>>>> been tested in a larger scale, I am wondering if you see any
>>>> obvious problems in my code below.
>>>>
>>>> For ssg_config and ssg_group_update_cb, I just copied directly from
>>>> the mochi documentation (
>>>> https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I
>>>> am using v0.4.1.
>>>>
>>>> ===
>>>>
>>>> static int comm_init(void)
>>>> {
>>>>   int ret = 0;
>>>>   int i = 0;
>>>>   int rank = 0;
>>>>   int nranks = 0;
>>>>   pmix_proc_t proc;
>>>>   margo_instance_id mid;
>>>>   ssg_group_id_t gid;
>>>>
>>>>   __debug("initializa the communication");
>>>>
>>>>   ret = PMIx_Init(&proc, NULL, 0);
>>>>   if (ret != PMIX_SUCCESS) {
>>>>       __error("PMIx_Init failed: %s", PMIx_Error_string(ret));
>>>>       return ret;
>>>>   }
>>>>
>>>>   __debug("pmix initialized");
>>>>
>>>>   mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
>>>>   if (mid == MARGO_INSTANCE_NULL) {
>>>>       __error("failed to initialize margo");
>>>>       return EIO;
>>>>   }
>>>>
>>>>   __debug("margo initialized");
>>>>
>>>>   ret = ssg_init();
>>>>   if (ret != SSG_SUCCESS) {
>>>>       __error("ssg_init() failed");
>>>>       return ret;
>>>>   }
>>>>
>>>>   gid = ssg_group_create_pmix(mid, "servergroup", proc,
>>>>                               &ssg_config, ssg_group_update_cb,
>>>> NULL);
>>>>   if (gid == SSG_GROUP_ID_INVALID) {
>>>>       __error("ssg_group_create_pmix() failed");
>>>>       return ret;
>>>>   }
>>>>
>>>>   rank = ssg_get_group_self_rank(gid);
>>>>   nranks = ssg_get_group_size(gid);
>>>>
>>>>   __debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
>>>>           (unsigned long long) gid, (int) rank, nranks);
>>>>
>>>>   ssg_group_dump(gid);
>>>>
>>>>   return 0;
>>>> }
>>>>
>>>> ===
>>>>
>>>> Thanks,
>>>> Hyogi
>>>> _______________________________________________
>>>> mochi-devel mailing list
>>>> mochi-devel at lists.mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>
>>> _______________________________________________
>>> mochi-devel mailing list
>>> mochi-devel at lists.mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>> https://www.mcs.anl.gov/research/projects/mochi
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20200921/da0769eb/attachment-0001.html>


More information about the mochi-devel mailing list