[Mochi-devel] Margo handler hangs on flooded requests

Thu Sep 24 16:00:13 CDT 2020

Hi Hyogi,

I'm not going to lie, that's a daunting bug report 🙂  Thank you for all of the detailed information and reproducer!

Your theory of resource exhaustion triggering a deadlock seems likely.  Off the top of my head I'm not sure what the problematic resource would be, though.  Some random ideas would be that a limit on the number of preposted buffers (for incoming RPC's) has been exhausted, or that the handlers are inadvertently blocking on something that is not Argobots aware and clogging up the RPC pools.

I have some preliminary questions about the test environment (I apologize if this is covered in the repo; I have not dug into it yet):

  *   Exactly what metric are you showing in the graphs (as in, what Argobots function are you using to retrieve the information)?  Just making sure there is no ambiguity in the interpretation.  It looks like the graphs are showing suspended threads and threads that are eligible for execution.
  *   What versions of Mercury, Argobots, and Libfabric (if present) are you using?
  *   Which transport are you using in Mercury?  Are you modifying it's behavior with any environment variables?
  *   How big is the RPC handler pool, and is there a dedicated progress thread for Margo?

thanks,
-Phil
________________________________
From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf of Sim, Hyogi <simh at ornl.gov>
Sent: Thursday, September 24, 2020 1:31 PM
To: mochi-devel at lists.mcs.anl.gov <mochi-devel at lists.mcs.anl.gov>
Subject: [Mochi-devel] Margo handler hangs on flooded requests

Hello,

I am a part of the UnifyFS development team (https://github.com/LLNL/UnifyFS). We use the margo framework for our rpc communications.

We are currently redesigning our metadata handling, and the new design includes broadcasting operations across server daemons. UnifyFS spawns one sever daemon per compute node. For broadcasting, we first build a binary tree of server ranks, rooted by the rank who initiates the broadcasting, and then recursively forward the request to child nodes in the tree. When a server doesn't have a child (i.e., a leaf node), it will directly respond to its parent.

While testing our implementation, we've found that the servers hang while handling rpc requests, especially when many broadcasting operations are triggered simultaneously. Although we cannot exactly identify the number of broadcasting operations/servers from which it starts to fail, we can always reproduce with a sufficiently large scale, for instance, 48 nodes with ppn=16 (48 servers and 16 broadcasts per server). Our tests were performed on Summit and Summitdev.

To isolate and debug the problem, we wrote a separate test program without the unifyfs codepath. Specifically, each server triggers a sum operation that adds up all ranks of servers using the broadcast operation. The following is the snippet of the sum operation (Full code is at https://code.ornl.gov/hyogi/metasim/-/blob/master/server/src/metasim-rpc.c):

---

/*
 * sum rpc (broadcasting)
 */

static int sum_forward(metasim_rpc_tree_t *tree,
                       metasim_sum_in_t *in, metasim_sum_out_t *out)
{
    int ret = 0;
    int i;
    int32_t seed = 0;
    int32_t partial_sum = 0;
    int32_t sum = 0;
    int child_count = tree->child_count;
    int *child_ranks = tree->child_ranks;
    corpc_req_t *req = NULL;

    seed = in->seed;

    if (child_count == 0) {
        __debug("i have no child (sum=%d)", sum);
        goto out;
    }

    __debug("bcasting sum to %d children:", child_count);

    for (i = 0; i < child_count; i++)
        __debug("child[%d] = rank %d", i, child_ranks[i]);

    /* forward requests to children in the rpc tree */
    req = calloc(child_count, sizeof(*req));
    if (!req) {
        __error("failed to allocate memory for corpc");
        return ENOMEM;
    }

    for (i = 0; i < child_count; i++) {
        corpc_req_t *r = &req[i];
        int child = child_ranks[i];

        ret = corpc_get_handle(rpcset.sum, child, r);
        if (ret) {
            __error("corpc_get_handle failed, abort rpc");
            goto out;
        }

        ret = corpc_forward_request((void *) in, r);
        if (ret) {
            __error("corpc_forward_request failed, abort rpc");
            goto out;
        }
    }

    /* collect results */
    for (i = 0; i < child_count; i++) {
        metasim_sum_out_t _out;
        corpc_req_t *r = &req[i];

        ret = corpc_wait_request(r);
        if (ret) {
            __error("corpc_wait_request failed, abort rpc");
            goto out;
        }

        margo_get_output(r->handle, &_out);
        partial_sum = _out.sum;
        sum += partial_sum;

        __debug("sum from child[%d] (rank=%d): %d (sum=%d)",
                i, child_ranks[i], partial_sum, sum);

        margo_free_output(r->handle, &_out);
        margo_destroy(r->handle);
    }

out:
    sum += metasim->rank + seed;
    out->sum = sum;

    return ret;
}

static void metasim_rpc_handle_sum(hg_handle_t handle)
{
    int ret = 0;
    hg_return_t hret;
    metasim_rpc_tree_t tree;
    metasim_sum_in_t in;
    metasim_sum_out_t out;

    __debug("sum rpc handler");
    print_margo_handler_pool_info(metasim->mid);

    hret = margo_get_input(handle, &in);
    if (hret != HG_SUCCESS) {
        __error("margo_get_input failed");
        return;
    }

    metasim_rpc_tree_init(metasim->rank, metasim->nranks, in.root, 2, &tree);

    ret = sum_forward(&tree, &in, &out);
    if (ret)
        __error("sum_forward failed");

    metasim_rpc_tree_free(&tree);
    margo_free_input(handle, &in);

    margo_respond(handle, &out);

    margo_destroy(handle);
}
DEFINE_MARGO_RPC_HANDLER(metasim_rpc_handle_sum)

int metasim_rpc_invoke_sum(int32_t seed, int32_t *sum)
{
    int ret = 0;
    int32_t _sum = 0;
    metasim_rpc_tree_t tree;
    metasim_sum_in_t in;
    metasim_sum_out_t out;

    ret = metasim_rpc_tree_init(metasim->rank, metasim->nranks, metasim->rank,
                                2, &tree);
    if (ret) {
        __error("failed to initialize the rpc tree (ret=%d)", ret);
        return ret;
    }

    in.root = metasim->rank;
    in.seed = seed;

    ret = sum_forward(&tree, &in, &out);
    if (ret) {
        __error("sum_forward failed (ret=%d)", ret);
    } else {
       _sum = out.sum;
       __debug("rpc sum final result = %d", _sum);

       *sum = _sum;
    }

    metasim_rpc_tree_free(&tree);

    return ret;
}

---

This sum operation runs successfully with 512 nodes/ppn=1 (512 broadcasting), but fails/hangs with 512 nodes/ppn=2 (1024 broadcasting), on Summit. From the log messages, we see a different pattern in the argobot rpc handler pool. In the following plots, the blue line shows the size of the handler pool, and the red line is the number of blocked elements, each time when the sum handler is invoked. This result is from one server (out of 512), but other servers show a similar pattern. In the failing case (2nd plot), we see that all ults become blocked. When the server hangs, the blocked count was 255.

[cid:15561D05-0BE5-48A0-9A81-BC44B6384FFB]

[cid:F5278F00-DD8D-4D6B-9804-BAF91AFEEDBD]

I am wondering if our way of implementing the broadcasting operation is problematic. Or, is there any parameter that we need to tune for handling a large number of requests? The margo versions that we have test with are v0.4.3 and v0.5.1.

We suspect that somehow a leaf node cannot respond to its parent due to some resource exhaustion, which results in a deadlock.

The full test code is at https://code.ornl.gov/hyogi/metasim, and the full log is at https://code.ornl.gov/hyogi/metasim/-/tree/master/logs.

Any advice will be helpful. Thank you!

Best,
Hyogi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20200924/312ded5d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs_success (size and blocked).png
Type: image/png
Size: 22250 bytes
Desc: logs_success (size and blocked).png
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20200924/312ded5d/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs_fail (size and blocked).png
Type: image/png
Size: 14654 bytes
Desc: logs_fail (size and blocked).png
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20200924/312ded5d/attachment-0003.png>