[Mochi-devel] An overview of application usage scenarios

Fri Jun 28 08:37:17 CDT 2019

Hi Srinivasan,

Re: 3., there is a PR here that Mathieu and I have been iterating on:

     https://xgitlab.cels.anl.gov/sds/margo/merge_requests/16<https://xgitlab.cels.anl.gov/sds/margo/merge_requests/16>

... and you can also install it as part of a spack install by using the @develop-rpc-breadcrumb version of margo.

     spack install mobject ^margo at develop-rpc-breadcrumbiobject ^margo at develop-rpc-breadcrumb

I don't have documentation yet (this week got away from me), but the short version is that you can activate it by calling margo_diag_start() and then display simple cumulative statistics about all RPCs issued by that margo instance by running margo_diag_dump().  The programs in the examples/ subdirectory do this so you could see a quick example.  Each RPC type gets its own min/max/avg/count counters, and RPC identifiers are recorded as a stack so that you get separate counters for RPC x() if it was triggered by RPC bar() or RPC foo().  This is important for our composition because there maybe be multiple ways to trigger an RPC on an internal component.

The output is not user friendly; this PR is just experimenting with an instrumentation mechanism rather than any storage or analysis of that instrumentation.  The internal margo_breadcrumb_measure() function is where the cumulative counters are updated right now.

The thing that's dapper-like about it is that it is timing RPCs and chains of RPCs.  The thing that's un-dapper like about it is that it is accumulating aggregate statistics rather than storing individual measurements, and there is no sampling mechanism.

thanks,
-Phil

________________________________
From: mochi-devel <mochi-devel-bounces at lists.mcs.anl.gov> on behalf of Dorier, Matthieu via mochi-devel <mochi-devel at lists.mcs.anl.gov>
Sent: Friday, June 28, 2019 7:44 AM
To: Srinivasan Ramesh; mochi-devel at lists.mcs.anl.gov
Subject: Re: [Mochi-devel] An overview of application usage scenarios

Hi Srinivasan,

Glad that the tutorials were useful! Regarding your questions:

1.a: HEPnOS is intended to be deployed separately from the application. The physicists we are working with deploy it in a set of containers. We envision leaving it running for the duration of an experimental campaign (which may be weeks to months). FlameStore is supposed to act like a cache for a single workflow. It lives for the duration of the workflow. As for SDSDKV, I'm not too familiar with that one but I think it's supposed to be deployed for the duration of an application as well.

1.b: HEPnOS is not part of the application (though it could be deployed as part of it). FlameStore is part of the application (same MPI_COMM_WORLD). For SDSDKV I don't know.

1.c: Yes, HEPnOS is intended to be long-running.

2: I wish we had such a code; our physicists colleagues are working on it right now.

Thanks,

Matthieu

On 27/06/2019, 17:31, "mochi-devel on behalf of Srinivasan Ramesh via mochi-devel" <mochi-devel-bounces at lists.mcs.anl.gov on behalf of mochi-devel at lists.mcs.anl.gov> wrote:

    Hi team,

    @Mattheiu: Thanks for the wiki tutorial for Mochi. I found it extremely
    useful for my understanding and tried out the hands-on tutorials.

    I re-read the PDSW "Methodology for rapid development..." paper and
    installed HePNOS locally on my laptop. A few questions come to mind:

    1. For each of the popular data-services mentioned in the paper
    (Framestore, HePNOS, SDSKV), what is the model of usage/topology?
    Specifically:
        a. Are these services part of a workflow? Meaning, a node allocation
    is managed, and the services are long-running for the duration of the
    workflow. Jobs within the workflow come and go, and use the service
    during their execution.
        b. Are these services part of the application itself? Meaning a
    "regular" MPI job where the service is built into each MPI process and
    loaded as a library local to the process.
        c. Is it possible that certain services are long-running on the
    system "forever" (reduces to (a) I guess?)

    The methodology paper hints at the topology but doesn't really provide a
    concrete description. With regard to performance measurement,
    I am fully aware that data-services can span the entire range of
    possibilities. However, I think it may not be a bad idea to start with
    specific scenarios in mind and then go from there onto more general
    cases once we have a grasp on the problem.

    2. Can I get access to a high-energy physics code that actually uses the
    HepNOS service? Can I run this setup on my laptop?

    3. @Phil: I remember you mentioning that you had a branch where you had
    developed a dapper-like request tracing infrastructure? Could you kindly
    point me to this?

    Regards,
    --
    Srinivasan Ramesh
    _______________________________________________
    mochi-devel mailing list
    mochi-devel at lists.mcs.anl.gov
    https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
    https://www.mcs.anl.gov/research/projects/mochi

_______________________________________________
mochi-devel mailing list
mochi-devel at lists.mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
https://www.mcs.anl.gov/research/projects/mochi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20190628/5b12b18b/attachment-0001.html>