[Mochi-devel] [EXTERNAL] Re: Margo handler hangs on flooded requests

Thu Sep 24 16:33:13 CDT 2020

> On Sep 24, 2020, at 5:06 PM, Carns, Philip H. <carns at mcs.anl.gov> wrote:
> 
> Oh and one other simple high level question: in the reproducer, every server is initiating a broadcast with itself as the root at the same time?  Otherwise there would only be one handler in operation at a time per server, and presumably it doesn't hang in that case?

For "512 nodes/ppn1", every server is initiating a broadcast at the same time, so 512 broadcasting operations are going at the same time. In this case, each server will handle 511 rpc requests from other peers. And, this runs successful. 

For "512 nodes/ppn2", every server is initiating two broadcasts at the same time, so 1024 broadcasting at the same time. Each server is expected to handle 511*2 rpc requests. This hangs in the middle, like after handling around 510 requests. 

In the reproducer, the server operation is triggered by local client applications. The server launches a separate rpc channel (margo instance with na+sm://) for local clients. This part of the server is called 'listener' in the code:

https://code.ornl.gov/hyogi/metasim/-/blob/master/server/src/metasim-listener.c

The client applications of the sum (which asks the local server to initiate the broadcasting sum) is:

https://code.ornl.gov/hyogi/metasim/-/blob/master/examples/src/sum.c

(I've added some collaborators in this email).

Thanks,
Hyogi

> 
> thanks,
> -Phil
> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf of Carns, Philip H. <carns at mcs.anl.gov>
> Sent: Thursday, September 24, 2020 5:00 PM
> To: Sim, Hyogi <simh at ornl.gov>; mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
> Subject: Re: [Mochi-devel] Margo handler hangs on flooded requests
>  
> Hi Hyogi,
> 
> I'm not going to lie, that's a daunting bug report 🙂  Thank you for all of the detailed information and reproducer!
> 
> Your theory of resource exhaustion triggering a deadlock seems likely.  Off the top of my head I'm not sure what the problematic resource would be, though.  Some random ideas would be that a limit on the number of preposted buffers (for incoming RPC's) has been exhausted, or that the handlers are inadvertently blocking on something that is not Argobots aware and clogging up the RPC pools.
> 
> I have some preliminary questions about the test environment (I apologize if this is covered in the repo; I have not dug into it yet):
> 	• Exactly what metric are you showing in the graphs (as in, what Argobots function are you using to retrieve the information)?  Just making sure there is no ambiguity in the interpretation.  It looks like the graphs are showing suspended threads and threads that are eligible for execution.
> 	• What versions of Mercury, Argobots, and Libfabric (if present) are you using?
> 	• Which transport are you using in Mercury?  Are you modifying it's behavior with any environment variables?
> 	• How big is the RPC handler pool, and is there a dedicated progress thread for Margo?
> thanks,
> -Phil
> From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf of Sim, Hyogi <simh at ornl.gov>
> Sent: Thursday, September 24, 2020 1:31 PM
> To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
> Subject: [Mochi-devel] Margo handler hangs on flooded requests
>  
> Hello,
> 
> I am a part of the UnifyFS development team (https://github.com/LLNL/UnifyFS). We use the margo framework for our rpc communications.
> 
> We are currently redesigning our metadata handling, and the new design includes broadcasting operations across server daemons. UnifyFS spawns one sever daemon per compute node. For broadcasting, we first build a binary tree of server ranks, rooted by the rank who initiates the broadcasting, and then recursively forward the request to child nodes in the tree. When a server doesn't have a child (i.e., a leaf node), it will directly respond to its parent.
> 
> While testing our implementation, we've found that the servers hang while handling rpc requests, especially when many broadcasting operations are triggered simultaneously. Although we cannot exactly identify the number of broadcasting operations/servers from which it starts to fail, we can always reproduce with a sufficiently large scale, for instance, 48 nodes with ppn=16 (48 servers and 16 broadcasts per server). Our tests were performed on Summit and Summitdev.
> 
> To isolate and debug the problem, we wrote a separate test program without the unifyfs codepath. Specifically, each server triggers a sum operation that adds up all ranks of servers using the broadcast operation. The following is the snippet of the sum operation (Full code is at https://code.ornl.gov/hyogi/metasim/-/blob/master/server/src/metasim-rpc.c):
> 
> ---
> 
> /*
>  * sum rpc (broadcasting)
>  */
> 
> static int sum_forward(metasim_rpc_tree_t *tree,
>                        metasim_sum_in_t *in, metasim_sum_out_t *out)
> {
>     int ret = 0;
>     int i;
>     int32_t seed = 0;
>     int32_t partial_sum = 0;
>     int32_t sum = 0;
>     int child_count = tree->child_count;
>     int *child_ranks = tree->child_ranks;
>     corpc_req_t *req = NULL;
> 
>     seed = in->seed;
> 
>     if (child_count == 0) {
>         __debug("i have no child (sum=%d)", sum);
>         goto out;
>     }
> 
>     __debug("bcasting sum to %d children:", child_count);
> 
>     for (i = 0; i < child_count; i++)
>         __debug("child[%d] = rank %d", i, child_ranks[i]);
> 
>     /* forward requests to children in the rpc tree */
>     req = calloc(child_count, sizeof(*req));
>     if (!req) {
>         __error("failed to allocate memory for corpc");
>         return ENOMEM;
>     }
> 
>     for (i = 0; i < child_count; i++) {
>         corpc_req_t *r = &req[i];
>         int child = child_ranks[i];
> 
>         ret = corpc_get_handle(rpcset.sum, child, r);
>         if (ret) {
>             __error("corpc_get_handle failed, abort rpc");
>             goto out;
>         }
> 
>         ret = corpc_forward_request((void *) in, r);
>         if (ret) {
>             __error("corpc_forward_request failed, abort rpc");
>             goto out;
>         }
>     }
> 
>     /* collect results */
>     for (i = 0; i < child_count; i++) {
>         metasim_sum_out_t _out;
>         corpc_req_t *r = &req[i];
> 
>         ret = corpc_wait_request(r);
>         if (ret) {
>             __error("corpc_wait_request failed, abort rpc");
>             goto out;
>         }
> 
>         margo_get_output(r->handle, &_out);
>         partial_sum = _out.sum;
>         sum += partial_sum;
> 
>         __debug("sum from child[%d] (rank=%d): %d (sum=%d)",
>                 i, child_ranks[i], partial_sum, sum);
> 
>         margo_free_output(r->handle, &_out);
>         margo_destroy(r->handle);
>     }
> 
> out:
>     sum += metasim->rank + seed;
>     out->sum = sum;
> 
>     return ret;
> }
> 
> static void metasim_rpc_handle_sum(hg_handle_t handle)
> {
>     int ret = 0;
>     hg_return_t hret;
>     metasim_rpc_tree_t tree;
>     metasim_sum_in_t in;
>     metasim_sum_out_t out;
> 
>     __debug("sum rpc handler");
>     print_margo_handler_pool_info(metasim->mid);
> 
>     hret = margo_get_input(handle, &in);
>     if (hret != HG_SUCCESS) {
>         __error("margo_get_input failed");
>         return;
>     }
> 
>     metasim_rpc_tree_init(metasim->rank, metasim->nranks, in.root, 2, &tree);
> 
>     ret = sum_forward(&tree, &in, &out);
>     if (ret)
>         __error("sum_forward failed");
> 
>     metasim_rpc_tree_free(&tree);
>     margo_free_input(handle, &in);
> 
>     margo_respond(handle, &out);
> 
>     margo_destroy(handle);
> }
> DEFINE_MARGO_RPC_HANDLER(metasim_rpc_handle_sum)
> 
> int metasim_rpc_invoke_sum(int32_t seed, int32_t *sum)
> {
>     int ret = 0;
>     int32_t _sum = 0;
>     metasim_rpc_tree_t tree;
>     metasim_sum_in_t in;
>     metasim_sum_out_t out;
> 
>     ret = metasim_rpc_tree_init(metasim->rank, metasim->nranks, metasim->rank,
>                                 2, &tree);
>     if (ret) {
>         __error("failed to initialize the rpc tree (ret=%d)", ret);
>         return ret;
>     }
> 
>     in.root = metasim->rank;
>     in.seed = seed;
> 
>     ret = sum_forward(&tree, &in, &out);
>     if (ret) {
>         __error("sum_forward failed (ret=%d)", ret);
>     } else {
>        _sum = out.sum;
>        __debug("rpc sum final result = %d", _sum);
> 
>        *sum = _sum;
>     }
> 
>     metasim_rpc_tree_free(&tree);
> 
>     return ret;
> }
> 
> ---
> 
> This sum operation runs successfully with 512 nodes/ppn=1 (512 broadcasting), but fails/hangs with 512 nodes/ppn=2 (1024 broadcasting), on Summit. From the log messages, we see a different pattern in the argobot rpc handler pool. In the following plots, the blue line shows the size of the handler pool, and the red line is the number of blocked elements, each time when the sum handler is invoked. This result is from one server (out of 512), but other servers show a similar pattern. In the failing case (2nd plot), we see that all ults become blocked. When the server hangs, the blocked count was 255.
> 
> 
> <logs_success (size and blocked).png>
> 
> 
> <logs_fail (size and blocked).png>
> 
> 
> I am wondering if our way of implementing the broadcasting operation is problematic. Or, is there any parameter that we need to tune for handling a large number of requests? The margo versions that we have test with are v0.4.3 and v0.5.1.
> 
> We suspect that somehow a leaf node cannot respond to its parent due to some resource exhaustion, which results in a deadlock.
> 
> The full test code is at https://code.ornl.gov/hyogi/metasim, and the full log is at https://code.ornl.gov/hyogi/metasim/-/tree/master/logs.
> 
> Any advice will be helpful. Thank you!
> 
> Best,
> Hyogi