<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">
Oh and one other simple high level question: in the reproducer, every server is initiating a broadcast with itself as the root at the same time? Otherwise there would only be one handler in operation at a time per server, and presumably it doesn't hang in
that case?</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">
thanks,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">
-Phil<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> mochi-devel <mochi-devel-bounces@lists.mcs.anl.gov> on behalf of Carns, Philip H. <carns@mcs.anl.gov><br>
<b>Sent:</b> Thursday, September 24, 2020 5:00 PM<br>
<b>To:</b> Sim, Hyogi <simh@ornl.gov>; mochi-devel@lists.mcs.anl.gov <mochi-devel@lists.mcs.anl.gov><br>
<b>Subject:</b> Re: [Mochi-devel] Margo handler hangs on flooded requests</font>
<div> </div>
</div>
<style type="text/css" style="display:none">
<!--
p
{margin-top:0;
margin-bottom:0}
-->
</style>
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
Hi Hyogi,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
I'm not going to lie, that's a daunting bug report <span id="x_🙂">🙂 Thank you for all of the detailed information and reproducer!</span><br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
Your theory of resource exhaustion triggering a deadlock seems likely. Off the top of my head I'm not sure what the problematic resource would be, though. Some random ideas would be that a limit on the number of preposted buffers (for incoming RPC's) has
been exhausted, or that the handlers are inadvertently blocking on something that is not Argobots aware and clogging up the RPC pools.<br>
</div>
<div><br>
</div>
<div>I have some preliminary questions about the test environment (I apologize if this is covered in the repo; I have not dug into it yet):</div>
<div>
<ul>
<li>Exactly what metric are you showing in the graphs (as in, what Argobots function are you using to retrieve the information)? Just making sure there is no ambiguity in the interpretation. It looks like the graphs are showing suspended threads and threads
that are eligible for execution.<br>
</li><li>What versions of Mercury, Argobots, and Libfabric (if present) are you using?</li><li>Which transport are you using in Mercury? Are you modifying it's behavior with any environment variables?<br>
</li><li>How big is the RPC handler pool, and is there a dedicated progress thread for Margo?</li></ul>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
thanks,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
-Phil<br>
</div>
<div id="x_appendonsend"></div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> mochi-devel <mochi-devel-bounces@lists.mcs.anl.gov> on behalf of Sim, Hyogi <simh@ornl.gov><br>
<b>Sent:</b> Thursday, September 24, 2020 1:31 PM<br>
<b>To:</b> mochi-devel@lists.mcs.anl.gov <mochi-devel@lists.mcs.anl.gov><br>
<b>Subject:</b> [Mochi-devel] Margo handler hangs on flooded requests</font>
<div> </div>
</div>
<div style="word-wrap:break-word; line-break:after-white-space">Hello,
<div class=""><br class="">
</div>
<div class="">I am a part of the UnifyFS development team (<a href="https://github.com/LLNL/UnifyFS" class="">https://github.com/LLNL/UnifyFS</a>). We use the margo framework for our rpc communications.</div>
<div class=""><br class="">
</div>
<div class="">We are currently redesigning our metadata handling, and the new design includes broadcasting operations across server daemons. UnifyFS spawns one sever daemon per compute node. For broadcasting, we first build a binary tree of server ranks, rooted
by the rank who initiates the broadcasting, and then recursively forward the request to child nodes in the tree. When a server doesn't have a child (i.e., a leaf node), it will directly respond to its parent.</div>
<div class=""><br class="">
</div>
<div class="">While testing our implementation, we've found that the servers hang while handling rpc requests, especially when many broadcasting operations are triggered simultaneously. Although we cannot exactly identify the number of broadcasting operations/servers
from which it starts to fail, we can always reproduce with a sufficiently large scale, for instance, 48 nodes with ppn=16 (48 servers and 16 broadcasts per server). Our tests were performed on Summit and Summitdev.</div>
<div class=""><br class="">
</div>
<div class="">To isolate and debug the problem, we wrote a separate test program without the unifyfs codepath. Specifically, each server triggers a sum operation that adds up all ranks of servers using the broadcast operation. The following is the snippet of
the sum operation (Full code is at <a href="https://code.ornl.gov/hyogi/metasim/-/blob/master/server/src/metasim-rpc.c" class="">
https://code.ornl.gov/hyogi/metasim/-/blob/master/server/src/metasim-rpc.c</a>):</div>
<div class=""><br class="">
</div>
<div class="">---</div>
<div class=""><br class="">
</div>
<div class="">/*<br class="">
* sum rpc (broadcasting)<br class="">
*/<br class="">
<br class="">
static int sum_forward(metasim_rpc_tree_t *tree,<br class="">
metasim_sum_in_t *in, metasim_sum_out_t *out)<br class="">
{<br class="">
int ret = 0;<br class="">
int i;<br class="">
int32_t seed = 0;<br class="">
int32_t partial_sum = 0;<br class="">
int32_t sum = 0;<br class="">
int child_count = tree->child_count;<br class="">
int *child_ranks = tree->child_ranks;<br class="">
corpc_req_t *req = NULL;<br class="">
<br class="">
seed = in->seed;<br class="">
<br class="">
if (child_count == 0) {<br class="">
__debug("i have no child (sum=%d)", sum);<br class="">
goto out;<br class="">
}<br class="">
<br class="">
__debug("bcasting sum to %d children:", child_count);<br class="">
<br class="">
for (i = 0; i < child_count; i++)<br class="">
__debug("child[%d] = rank %d", i, child_ranks[i]);<br class="">
<br class="">
/* forward requests to children in the rpc tree */<br class="">
req = calloc(child_count, sizeof(*req));<br class="">
if (!req) {<br class="">
__error("failed to allocate memory for corpc");<br class="">
return ENOMEM;<br class="">
}<br class="">
<br class="">
for (i = 0; i < child_count; i++) {<br class="">
corpc_req_t *r = &req[i];<br class="">
int child = child_ranks[i];<br class="">
<br class="">
ret = corpc_get_handle(rpcset.sum, child, r);<br class="">
if (ret) {<br class="">
__error("corpc_get_handle failed, abort rpc");<br class="">
goto out;<br class="">
}<br class="">
<br class="">
ret = corpc_forward_request((void *) in, r);<br class="">
if (ret) {<br class="">
__error("corpc_forward_request failed, abort rpc");<br class="">
goto out;<br class="">
}<br class="">
}<br class="">
<br class="">
/* collect results */<br class="">
for (i = 0; i < child_count; i++) {<br class="">
metasim_sum_out_t _out;<br class="">
corpc_req_t *r = &req[i];<br class="">
<br class="">
ret = corpc_wait_request(r);<br class="">
if (ret) {<br class="">
__error("corpc_wait_request failed, abort rpc");<br class="">
goto out;<br class="">
}<br class="">
<br class="">
margo_get_output(r->handle, &_out);<br class="">
partial_sum = _out.sum;<br class="">
sum += partial_sum;<br class="">
<br class="">
__debug("sum from child[%d] (rank=%d): %d (sum=%d)",<br class="">
i, child_ranks[i], partial_sum, sum);<br class="">
<br class="">
margo_free_output(r->handle, &_out);<br class="">
margo_destroy(r->handle);<br class="">
}<br class="">
<br class="">
out:<br class="">
sum += metasim->rank + seed;<br class="">
out->sum = sum;<br class="">
<br class="">
return ret;<br class="">
}<br class="">
<br class="">
static void metasim_rpc_handle_sum(hg_handle_t handle)<br class="">
{<br class="">
int ret = 0;<br class="">
hg_return_t hret;<br class="">
metasim_rpc_tree_t tree;<br class="">
metasim_sum_in_t in;<br class="">
metasim_sum_out_t out;<br class="">
<br class="">
__debug("sum rpc handler");<br class="">
print_margo_handler_pool_info(metasim->mid);<br class="">
<br class="">
hret = margo_get_input(handle, &in);<br class="">
if (hret != HG_SUCCESS) {<br class="">
__error("margo_get_input failed");<br class="">
return;<br class="">
}<br class="">
<br class="">
</div>
<div class=""> metasim_rpc_tree_init(metasim->rank, metasim->nranks, in.root, 2, &tree);<br class="">
<br class="">
ret = sum_forward(&tree, &in, &out);<br class="">
if (ret)<br class="">
__error("sum_forward failed");<br class="">
<br class="">
metasim_rpc_tree_free(&tree);<br class="">
margo_free_input(handle, &in);<br class="">
<br class="">
margo_respond(handle, &out);<br class="">
<br class="">
margo_destroy(handle);<br class="">
}<br class="">
DEFINE_MARGO_RPC_HANDLER(metasim_rpc_handle_sum)<br class="">
<br class="">
int metasim_rpc_invoke_sum(int32_t seed, int32_t *sum)<br class="">
{<br class="">
int ret = 0;<br class="">
int32_t _sum = 0;<br class="">
metasim_rpc_tree_t tree;<br class="">
metasim_sum_in_t in;<br class="">
metasim_sum_out_t out;<br class="">
<br class="">
ret = metasim_rpc_tree_init(metasim->rank, metasim->nranks, metasim->rank,<br class="">
2, &tree);<br class="">
if (ret) {<br class="">
__error("failed to initialize the rpc tree (ret=%d)", ret);<br class="">
return ret;<br class="">
}<br class="">
<br class="">
in.root = metasim->rank;<br class="">
in.seed = seed;<br class="">
<br class="">
ret = sum_forward(&tree, &in, &out);<br class="">
if (ret) {<br class="">
__error("sum_forward failed (ret=%d)", ret);<br class="">
} else {<br class="">
_sum = out.sum;<br class="">
__debug("rpc sum final result = %d", _sum);<br class="">
<br class="">
*sum = _sum;<br class="">
}<br class="">
<br class="">
metasim_rpc_tree_free(&tree);<br class="">
<br class="">
return ret;<br class="">
}<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">---</div>
<div class=""><br class="">
</div>
<div class="">This sum operation runs successfully with 512 nodes/ppn=1 (512 broadcasting), but fails/hangs with 512 nodes/ppn=2 (1024 broadcasting), on Summit. From the log messages, we see a different pattern in the argobot rpc handler pool. In the following
plots, the blue line shows the size of the handler pool, and the red line is the number of blocked elements, each time when the sum handler is invoked. This result is from one server (out of 512), but other servers show a similar pattern. In the failing case
(2nd plot), we see that all ults become blocked. When the server hangs, the blocked count was 255.</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><img id="x_x_7292AE3F-B69F-4E67-BBA2-B3836EACEE6B" class="" data-outlook-trace="F:2|T:2" src="cid:15561D05-0BE5-48A0-9A81-BC44B6384FFB"></div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><img id="x_x_948F40CA-5415-4E05-8941-7A2A89D4A2B3" class="" data-outlook-trace="F:2|T:2" src="cid:F5278F00-DD8D-4D6B-9804-BAF91AFEEDBD"></div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">I am wondering if our way of implementing the broadcasting operation is problematic. Or, is there any parameter that we need to tune for handling a large number of requests? The margo versions that we have test with are v0.4.3 and v0.5.1.</div>
<div class=""><br class="">
</div>
<div class="">We suspect that somehow a leaf node cannot respond to its parent due to some resource exhaustion, which results in a deadlock.</div>
<div class=""><br class="">
</div>
<div class="">The full test code is at <a href="https://code.ornl.gov/hyogi/metasim" class="">https://code.ornl.gov/hyogi/metasim</a>, and the full log is at <a href="https://code.ornl.gov/hyogi/metasim/-/tree/master/logs" class="">https://code.ornl.gov/hyogi/metasim/-/tree/master/logs</a>.</div>
<div class=""><br class="">
</div>
<div class="">Any advice will be helpful. Thank you!</div>
<div class=""><br class="">
<div class="">Best,</div>
<div class="">Hyogi</div>
</div>
<div class=""><br class="">
</div>
</div>
</div>
</body>
</html>