[Mochi-devel] [EXTERNAL] [SSG] pmix initialization failure

Sim, Hyogi simh at ornl.gov
Fri Sep 18 18:20:48 CDT 2020


BTW, today I encountered a segfault on summit, while running with 64 nodes. I do not think this is a deterministic behavior:

===

Core was generated by `/gpfs/alpine/proj-shared/stf008/hs2/metasim/summit/sum/progress.debug/sum-paral'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002000000c1004 in swim_dping_ack_recv_ult (handle=0x231e5010)
    at ../src/swim-fd/swim-fd-ping.c:312
312             ABT_rwlock_wrlock(group->swim_ctx->swim_lock);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.ppc64le libatomic-4.8.5-37.el7_6.ppc64le libgcc-4.8.5-37.el7_6.ppc64le libibverbs-41mlnx1-OFED.4.7.0.0.2.47329.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.47329.ppc64le libmlx5-41mlnx1-OFED.4.7.0.3.3.47329.ppc64le libnl3-3.2.28-4.el7.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.47329.ppc64le libstdc++-4.8.5-37.el7_6.ppc64le numactl-libs-2.0.9-7.el7.ppc64le
(gdb) bt
#0  0x00002000000c1004 in swim_dping_ack_recv_ult (handle=0x231e5010)
    at ../src/swim-fd/swim-fd-ping.c:312
#1  0x00002000000c1460 in swim_dping_ack_recv_ult_wrapper (handle=0x231e5010)
    at ../src/swim-fd/swim-fd-ping.c:378
#2  0x000020000065ac44 in ABTD_thread_func_wrapper_thread ()
   from /autofs/nccs-svm1_proj/csc300/mjbrim/spack.mjb/opt/spack/linux-rhel7-power8le/gcc-4.8.5/argobots-1.0rc2-2kq6htkm2ura7u3qj6wpr5xhqyns5ikp/lib/libabt.so.0
#3  0x000020000065b7b8 in make_fcontext ()
   from /autofs/nccs-svm1_proj/csc300/mjbrim/spack.mjb/opt/spack/linux-rhel7-power8le/gcc-4.8.5/argobots-1.0rc2-2kq6htkm2ura7u3qj6wpr5xhqyns5ikp/lib/libabt.so.0
(gdb) f 0
#0  0x00002000000c1004 in swim_dping_ack_recv_ult (handle=0x231e5010)
    at ../src/swim-fd/swim-fd-ping.c:312
312             ABT_rwlock_wrlock(group->swim_ctx->swim_lock);
(gdb) p group
$1 = (ssg_group_t *) 0x24978ca0
(gdb) p group->swim_ctx
$2 = (swim_context_t *) 0x0
(gdb) set print pretty on
(gdb) p group[0]
$3 = {
  mid_state = 0x231eff40,
  name = 0x249792e0 "\360{\227$",
  view = {
    size = 33,
    member_map = 0x0,
    rank_array = 0x24979510
  },
  config = {
    swim_period_length_ms = 3000,
    swim_suspect_timeout_periods = 5,
    swim_subgroup_member_count = -1,
    ssg_credential = -1
  },
  dead_members = 0x0,
  swim_ctx = 0x0,
  update_cb = 0x10002034 <ssg_group_update_cb>,
  update_cb_dat = 0x0,
  lock = 0xf
}



> On Sep 16, 2020, at 11:25 AM, Sim, Hyogi <simh at ornl.gov> wrote:
> 
> Thanks, Rob.
> 
> I have just tested with ssg_group_create_mpi(), and it works fine (tested up to 1024 nodes, ppn=1). It seems like the problem is with pmix.
> 
> Best,
> Hyogi
> 
> 
> 
>> On Sep 16, 2020, at 10:15 AM, Latham, Robert J. <robl at mcs.anl.gov> wrote:
>> 
>> On Wed, 2020-09-16 at 14:09 +0000, Sim, Hyogi wrote:
>>> Hi Shane,
>>> 
>>> Your are correct that ssg_group_create_pmix() is failing. I heard
>>> from other coworker that he avoided using pmix because he frequently
>>> observed unreliable behavior from pmix. I am not sure which exactly
>>> causes a problem for now.
>>> 
>>> As for the application, it is like a service daemon and spawns
>>> exactly one process per node. I can probably with test with other
>>> group creation function (possible with MPI), then see if it still
>>> fails or not.
>> 
>> 
>> We've been investigating SSG behavior at larger scale.  I added a
>> pointer to your message to this SSG issue:
>> 
>> https://xgitlab.cels.anl.gov/sds/ssg/-/issues/21
>> 
>> ==rob
>> 
>>> 
>>>> On Sep 15, 2020, at 6:27 PM, Snyder, Shane <ssnyder at mcs.anl.gov>
>>>> wrote:
>>>> 
>>>> Hi Hyogi,
>>>> 
>>>> Thanks for the heads up!
>>>> 
>>>> What exactly fails when you run the code snippet you shared? Is it
>>>> the ssg_group_create_pmix() call? That would make the most sense
>>>> but just making sure it's not something else like the Margo or SSG
>>>> init calls. I don't see anything obviously wrong in your code.
>>>> 
>>>> I'm currently investigating some other errors that result in
>>>> sporadic hangs or crashes on Summit when using SSG to launch many
>>>> processes on a node (e.g,. starting 64 SSG processes on one Summit
>>>> node) -- maybe this is just another variation of that particular
>>>> bug (though I'm using verbs and you're using tcp)? It might be
>>>> helpful if you can confirm exactly how many processes and nodes it
>>>> takes to trigger the problem so I can try to reproduce. Your
>>>> original mail mentions 512-nodes, but not clear whether you're
>>>> starting just a single process on each node or more than that, etc.
>>>> It would be nice if we could simplify the reproducer so we don't
>>>> need so many nodes to debug, if possible.
>>>> 
>>>> --Shane
>>>> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf
>>>> of Sim, Hyogi <simh at ornl.gov>
>>>> Sent: Tuesday, September 15, 2020 10:59 AM
>>>> To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
>>>> Subject: [Mochi-devel] [SSG] pmix initialization failure
>>>> 
>>>> Hi,
>>>> 
>>>> I am initializing a SSG group using pmix on Summit at . The
>>>> initialization works as expected, but only up to a certain number
>>>> of compute nodes (~ 256 nodes). The group initialization seems
>>>> always unsuccessful with 512+ nodes. Assuming that ssg itself has
>>>> been tested in a larger scale, I am wondering if you see any
>>>> obvious problems in my code below. 
>>>> 
>>>> For ssg_config and ssg_group_update_cb, I just copied directly from
>>>> the mochi documentation (
>>>> https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html). I
>>>> am using v0.4.1.
>>>> 
>>>> ===
>>>> 
>>>> static int comm_init(void)
>>>> {
>>>>   int ret = 0;
>>>>   int i = 0;
>>>>   int rank = 0;
>>>>   int nranks = 0;
>>>>   pmix_proc_t proc;
>>>>   margo_instance_id mid;
>>>>   ssg_group_id_t gid;
>>>> 
>>>>   __debug("initializa the communication");
>>>> 
>>>>   ret = PMIx_Init(&proc, NULL, 0);
>>>>   if (ret != PMIX_SUCCESS) {
>>>>       __error("PMIx_Init failed: %s", PMIx_Error_string(ret));
>>>>       return ret;
>>>>   }
>>>> 
>>>>   __debug("pmix initialized");
>>>> 
>>>>   mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);
>>>>   if (mid == MARGO_INSTANCE_NULL) {
>>>>       __error("failed to initialize margo");
>>>>       return EIO;
>>>>   }
>>>> 
>>>>   __debug("margo initialized");
>>>> 
>>>>   ret = ssg_init();
>>>>   if (ret != SSG_SUCCESS) {
>>>>       __error("ssg_init() failed");
>>>>       return ret;
>>>>   }
>>>> 
>>>>   gid = ssg_group_create_pmix(mid, "servergroup", proc,
>>>>                               &ssg_config, ssg_group_update_cb,
>>>> NULL);
>>>>   if (gid == SSG_GROUP_ID_INVALID) {
>>>>       __error("ssg_group_create_pmix() failed");
>>>>       return ret;
>>>>   }
>>>> 
>>>>   rank = ssg_get_group_self_rank(gid);
>>>>   nranks = ssg_get_group_size(gid);
>>>> 
>>>>   __debug("ssg group (gid=%llu, rank=%d, nranks=%d)",
>>>>           (unsigned long long) gid, (int) rank, nranks);
>>>> 
>>>>   ssg_group_dump(gid);
>>>> 
>>>>   return 0;
>>>> }
>>>> 
>>>> ===
>>>> 
>>>> Thanks,
>>>> Hyogi
>>>> _______________________________________________
>>>> mochi-devel mailing list
>>>> mochi-devel at lists.mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>> https://www.mcs.anl.gov/research/projects/mochi
>>> 
>>> _______________________________________________
>>> mochi-devel mailing list
>>> mochi-devel at lists.mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>> https://www.mcs.anl.gov/research/projects/mochi
> 



More information about the mochi-devel mailing list