[Mochi-devel] [EXTERNAL] Re: Margo handler hangs on flooded requests

Fri Sep 25 14:30:44 CDT 2020

Fantastic- thank you for confirming so quickly!

I'll defer to Jerome on how the best way to handle the official spack repo, but I would imagine the easiest short term thing is to add a variant to control it.

thanks,
-Phil
________________________________
From: Sim, Hyogi <simh at ornl.gov>
Sent: Friday, September 25, 2020 2:54 PM
To: Sim, Hyogi <simh at ornl.gov>
Cc: Jerome Soumagne <jsoumagne at hdfgroup.org>; Carns, Philip H. <carns at mcs.anl.gov>; mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>; Wang, Feiyi <fwang2 at ornl.gov>; Brim, Michael J. <brimmj at ornl.gov>
Subject: Re: [Mochi-devel] [EXTERNAL] Re: Margo handler hangs on flooded requests

Jerome and Phil,

It seems like your suggestion fixes the problem. Thank you!

I am still waiting results of larger runs on Summit, but I was able to run the reproducer successfully on summitdev with 54 nodes/20 ppn (1080 bcast rpcs). And the number of blocked ults doesn't get stuck at 255 but increases further (e.g., 900+). I will report further once I get results from summit.

BTW, is there any way to set the MERCURY_POST_LIMIT=off when using the official spack repo (instead of manually patching)?

Regards,
Hyogi

> On Sep 25, 2020, at 12:34 PM, Sim, Hyogi <simh at ornl.gov> wrote:
>
> Thank you, Jerome. I will try your suggestion of setting MERCURY_POST_LIMIT. We were suspecting some constant close to 255, because the number of blocked ult became 255 when the server got stuck.
>
> Thanks,
> Hyogi
>
>
>> On Sep 25, 2020, at 12:27 PM, Jerome Soumagne <jsoumagne at hdfgroup.org> wrote:
>>
>> Hi Hyogi and Phil
>>
>> Yes that's very likely that could be the issue. Indeed handles can only be re-used once the user releases them after a call to HG_Destroy() in the RPC handler callback, meaning that callback also has to be triggered.
>>
>> The limit by default is 256 so I'm not sure if that really matches the 512 from your report. At any rate there are two cmake build variables that can affect that behavior:
>> MERCURY_ENABLE_POST_LIMIT
>> and
>> MERCURY_POST_LIMIT (set to 256 by default)
>>
>> You can try to simply turn off that limit by setting MERCURY_ENABLE_POST_LIMIT to OFF (if you are using ccmake, press 't' to toggle advanded mode).
>>
>> Here is also a quick patch if you are building through spack
>>
>> /packages/mercury/package.py
>> index 31ee318c9..086ab6266 100644
>> --- a/var/spack/repos/builtin/packages/mercury/package.py
>> +++ b/var/spack/repos/builtin/packages/mercury/package.py
>> @@ -69,6 +69,7 @@ def cmake_args(self):
>>             '-DBUILD_SHARED_LIBS:BOOL=%s' % variant_bool('+shared'),
>>             '-DBUILD_TESTING:BOOL=%s' % str(self.run_tests),
>>             '-DMERCURY_ENABLE_PARALLEL_TESTING:BOOL=%s' % str(parallel_tests),
>> +            '-DMERCURY_ENABLE_POST_LIMIT:BOOL=OFF',
>>             '-DMERCURY_USE_BOOST_PP:BOOL=ON',
>>             '-DMERCURY_USE_CHECKSUMS:BOOL=ON',
>>             '-DMERCURY_USE_EAGER_BULK:BOOL=ON',
>>
>> Jerome
>>
>>
>> On Fri, 2020-09-25 at 13:21 +0000, Carns, Philip H. wrote:
>>> Thanks for the clarifications Hyogi (in this and the previous email), that helps quite a bit.
>>>
>>> If it  happens with both ofi+tcp and bmi then that (probably) rules out transport problem.  Neither requires memory registration either, which is another possible transport-level resource constraint.
>>>
>>> Jerome, do you know of anything that could be an issue in Mercury 1.0.1 w.r.t. resource consumption with a pattern like this, where rpc handler handles are held open for an extended period waiting on a chain of dependent RPCs to complete?
>>>
>>> I was wondering about the HG_POST_LIMIT.  Do incoming RPC handles count against this limit until the handler is completed at trigger time, or until the handle is destroyed?
>>>
>>> If it is the former, margo should recycle them very quickly.  The trigger callbacks only execute long enough to spawn detached Argobots ULTS before returning (and those ULTS are using deferred execution, so there isn't much work there).  If it is the latter, that could be an issue for this case, though, because the handles at the root of the trees won't be destroyed until after all of the dependent RPCs are complete, and it might eventually not leave enough buffers free to make progress on the  chain.
>>>
>>> thanks,
>>> -Phil
>>>
>>> From: Sim, Hyogi <simh at ornl.gov>
>>> Sent: Thursday, September 24, 2020 5:33 PM
>>> To: Carns, Philip H. <carns at mcs.anl.gov>
>>> Cc: Sim, Hyogi <simh at ornl.gov>; mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>; Brim, Michael J. <brimmj at ornl.gov>; Wang, Feiyi <fwang2 at ornl.gov>
>>> Subject: Re: [EXTERNAL] Re: Margo handler hangs on flooded requests
>>>
>>>
>>>
>>>> On Sep 24, 2020, at 5:06 PM, Carns, Philip H. <carns at mcs.anl.gov> wrote:
>>>>
>>>> Oh and one other simple high level question: in the reproducer, every server is initiating a broadcast with itself as the root at the same time?  Otherwise there would only be one handler in operation at a time per server, and presumably it doesn't hang in that case?
>>>
>>> For "512 nodes/ppn1", every server is initiating a broadcast at the same time, so 512 broadcasting operations are going at the same time. In this case, each server will handle 511 rpc requests from other peers. And, this runs successful.
>>>
>>> For "512 nodes/ppn2", every server is initiating two broadcasts at the same time, so 1024 broadcasting at the same time. Each server is expected to handle 511*2 rpc requests. This hangs in the middle, like after handling around 510 requests.
>>>
>>> In the reproducer, the server operation is triggered by local client applications. The server launches a separate rpc channel (margo instance with na+sm://) for local clients. This part of the server is called 'listener' in the code:
>>>
>>> https://code.ornl.gov/hyogi/metasim/-/blob/master/server/src/metasim-listener.c
>>>
>>> The client applications of the sum (which asks the local server to initiate the broadcasting sum) is:
>>>
>>> https://code.ornl.gov/hyogi/metasim/-/blob/master/examples/src/sum.c
>>>
>>>
>>> (I've added some collaborators in this email).
>>>
>>> Thanks,
>>> Hyogi
>>>
>>>
>>>>
>>>> thanks,
>>>> -Phil
>>>> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf of Carns, Philip H. <carns at mcs.anl.gov>
>>>> Sent: Thursday, September 24, 2020 5:00 PM
>>>> To: Sim, Hyogi <simh at ornl.gov>; mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
>>>> Subject: Re: [Mochi-devel] Margo handler hangs on flooded requests
>>>>
>>>> Hi Hyogi,
>>>>
>>>> I'm not going to lie, that's a daunting bug report 🙂  Thank you for all of the detailed information and reproducer!
>>>>
>>>> Your theory of resource exhaustion triggering a deadlock seems likely.  Off the top of my head I'm not sure what the problematic resource would be, though.  Some random ideas would be that a limit on the number of preposted buffers (for incoming RPC's) has been exhausted, or that the handlers are inadvertently blocking on something that is not Argobots aware and clogging up the RPC pools.
>>>>
>>>> I have some preliminary questions about the test environment (I apologize if this is covered in the repo; I have not dug into it yet):
>>>>       • Exactly what metric are you showing in the graphs (as in, what Argobots function are you using to retrieve the information)?  Just making sure there is no ambiguity in the interpretation.  It looks like the graphs are showing suspended threads and threads that are eligible for execution.
>>>>       • What versions of Mercury, Argobots, and Libfabric (if present) are you using?
>>>>       • Which transport are you using in Mercury?  Are you modifying it's behavior with any environment variables?
>>>>       • How big is the RPC handler pool, and is there a dedicated progress thread for Margo?
>>>> thanks,
>>>> -Phil
>>>> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf of Sim, Hyogi <simh at ornl.gov>
>>>> Sent: Thursday, September 24, 2020 1:31 PM
>>>> To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
>>>> Subject: [Mochi-devel] Margo handler hangs on flooded requests
>>>>
>>>> Hello,
>>>>
>>>> I am a part of the UnifyFS development team (https://github.com/LLNL/UnifyFS). We use the margo framework for our rpc communications.
>>>>
>>>> We are currently redesigning our metadata handling, and the new design includes broadcasting operations across server daemons. UnifyFS spawns one sever daemon per compute node. For broadcasting, we first build a binary tree of server ranks, rooted by the rank who initiates the broadcasting, and then recursively forward the request to child nodes in the tree. When a server doesn't have a child (i.e., a leaf node), it will directly respond to its parent.
>>>>
>>>> While testing our implementation, we've found that the servers hang while handling rpc requests, especially when many broadcasting operations are triggered simultaneously. Although we cannot exactly identify the number of broadcasting operations/servers from which it starts to fail, we can always reproduce with a sufficiently large scale, for instance, 48 nodes with ppn=16 (48 servers and 16 broadcasts per server). Our tests were performed on Summit and Summitdev.
>>>>
>>>> To isolate and debug the problem, we wrote a separate test program without the unifyfs codepath. Specifically, each server triggers a sum operation that adds up all ranks of servers using the broadcast operation. The following is the snippet of the sum operation (Full code is at https://code.ornl.gov/hyogi/metasim/-/blob/master/server/src/metasim-rpc.c):
>>>>
>>>> ---
>>>>
>>>> /*
>>>> * sum rpc (broadcasting)
>>>> */
>>>>
>>>> static int sum_forward(metasim_rpc_tree_t *tree,
>>>>                       metasim_sum_in_t *in, metasim_sum_out_t *out)
>>>> {
>>>>    int ret = 0;
>>>>    int i;
>>>>    int32_t seed = 0;
>>>>    int32_t partial_sum = 0;
>>>>    int32_t sum = 0;
>>>>    int child_count = tree->child_count;
>>>>    int *child_ranks = tree->child_ranks;
>>>>    corpc_req_t *req = NULL;
>>>>
>>>>    seed = in->seed;
>>>>
>>>>    if (child_count == 0) {
>>>>        __debug("i have no child (sum=%d)", sum);
>>>>        goto out;
>>>>    }
>>>>
>>>>    __debug("bcasting sum to %d children:", child_count);
>>>>
>>>>    for (i = 0; i < child_count; i++)
>>>>        __debug("child[%d] = rank %d", i, child_ranks[i]);
>>>>
>>>>    /* forward requests to children in the rpc tree */
>>>>    req = calloc(child_count, sizeof(*req));
>>>>    if (!req) {
>>>>        __error("failed to allocate memory for corpc");
>>>>        return ENOMEM;
>>>>    }
>>>>
>>>>    for (i = 0; i < child_count; i++) {
>>>>        corpc_req_t *r = &req[i];
>>>>        int child = child_ranks[i];
>>>>
>>>>        ret = corpc_get_handle(rpcset.sum, child, r);
>>>>        if (ret) {
>>>>            __error("corpc_get_handle failed, abort rpc");
>>>>            goto out;
>>>>        }
>>>>
>>>>        ret = corpc_forward_request((void *) in, r);
>>>>        if (ret) {
>>>>            __error("corpc_forward_request failed, abort rpc");
>>>>            goto out;
>>>>        }
>>>>    }
>>>>
>>>>    /* collect results */
>>>>    for (i = 0; i < child_count; i++) {
>>>>        metasim_sum_out_t _out;
>>>>        corpc_req_t *r = &req[i];
>>>>
>>>>        ret = corpc_wait_request(r);
>>>>        if (ret) {
>>>>            __error("corpc_wait_request failed, abort rpc");
>>>>            goto out;
>>>>        }
>>>>
>>>>        margo_get_output(r->handle, &_out);
>>>>        partial_sum = _out.sum;
>>>>        sum += partial_sum;
>>>>
>>>>        __debug("sum from child[%d] (rank=%d): %d (sum=%d)",
>>>>                i, child_ranks[i], partial_sum, sum);
>>>>
>>>>        margo_free_output(r->handle, &_out);
>>>>        margo_destroy(r->handle);
>>>>    }
>>>>
>>>> out:
>>>>    sum += metasim->rank + seed;
>>>>    out->sum = sum;
>>>>
>>>>    return ret;
>>>> }
>>>>
>>>> static void metasim_rpc_handle_sum(hg_handle_t handle)
>>>> {
>>>>    int ret = 0;
>>>>    hg_return_t hret;
>>>>    metasim_rpc_tree_t tree;
>>>>    metasim_sum_in_t in;
>>>>    metasim_sum_out_t out;
>>>>
>>>>    __debug("sum rpc handler");
>>>>    print_margo_handler_pool_info(metasim->mid);
>>>>
>>>>    hret = margo_get_input(handle, &in);
>>>>    if (hret != HG_SUCCESS) {
>>>>        __error("margo_get_input failed");
>>>>        return;
>>>>    }
>>>>
>>>>    metasim_rpc_tree_init(metasim->rank, metasim->nranks, in.root, 2, &tree);
>>>>
>>>>    ret = sum_forward(&tree, &in, &out);
>>>>    if (ret)
>>>>        __error("sum_forward failed");
>>>>
>>>>    metasim_rpc_tree_free(&tree);
>>>>    margo_free_input(handle, &in);
>>>>
>>>>    margo_respond(handle, &out);
>>>>
>>>>    margo_destroy(handle);
>>>> }
>>>> DEFINE_MARGO_RPC_HANDLER(metasim_rpc_handle_sum)
>>>>
>>>> int metasim_rpc_invoke_sum(int32_t seed, int32_t *sum)
>>>> {
>>>>    int ret = 0;
>>>>    int32_t _sum = 0;
>>>>    metasim_rpc_tree_t tree;
>>>>    metasim_sum_in_t in;
>>>>    metasim_sum_out_t out;
>>>>
>>>>    ret = metasim_rpc_tree_init(metasim->rank, metasim->nranks, metasim->rank,
>>>>                                2, &tree);
>>>>    if (ret) {
>>>>        __error("failed to initialize the rpc tree (ret=%d)", ret);
>>>>        return ret;
>>>>    }
>>>>
>>>>    in.root = metasim->rank;
>>>>    in.seed = seed;
>>>>
>>>>    ret = sum_forward(&tree, &in, &out);
>>>>    if (ret) {
>>>>        __error("sum_forward failed (ret=%d)", ret);
>>>>    } else {
>>>>       _sum = out.sum;
>>>>       __debug("rpc sum final result = %d", _sum);
>>>>
>>>>       *sum = _sum;
>>>>    }
>>>>
>>>>    metasim_rpc_tree_free(&tree);
>>>>
>>>>    return ret;
>>>> }
>>>>
>>>> ---
>>>>
>>>> This sum operation runs successfully with 512 nodes/ppn=1 (512 broadcasting), but fails/hangs with 512 nodes/ppn=2 (1024 broadcasting), on Summit. From the log messages, we see a different pattern in the argobot rpc handler pool. In the following plots, the blue line shows the size of the handler pool, and the red line is the number of blocked elements, each time when the sum handler is invoked. This result is from one server (out of 512), but other servers show a similar pattern. In the failing case (2nd plot), we see that all ults become blocked. When the server hangs, the blocked count was 255.
>>>>
>>>>
>>>> <logs_success (size and blocked).png>
>>>>
>>>>
>>>> <logs_fail (size and blocked).png>
>>>>
>>>>
>>>> I am wondering if our way of implementing the broadcasting operation is problematic. Or, is there any parameter that we need to tune for handling a large number of requests? The margo versions that we have test with are v0.4.3 and v0.5.1.
>>>>
>>>> We suspect that somehow a leaf node cannot respond to its parent due to some resource exhaustion, which results in a deadlock.
>>>>
>>>> The full test code is at https://code.ornl.gov/hyogi/metasim, and the full log is athttps://code.ornl.gov/hyogi/metasim/-/tree/master/logs.
>>>>
>>>> Any advice will be helpful. Thank you!
>>>>
>>>> Best,
>>>> Hyogi
>>>
>>> _______________________________________________
>>> mochi-devel mailing list
>>> mochi-devel at lists.mcs.anl.gov
>>>
>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>
>>> https://www.mcs.anl.gov/research/projects/mochi
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20200925/2f1cdc6d/attachment-0001.html>