[Darshan-users] Full time deployment
Phil Carns
carns at mcs.anl.gov
Tue Jul 9 14:51:21 CDT 2013
Hi Richard. Some comments inline below:
On 07/09/2013 01:04 PM, Hedges, Richard M. wrote:
> Anyone out there deploying Darshan to profile all MPI jobs on a big
> system?
You bet :) The current list:
ANL IBM BG/Q systems: Mira, Cetus, and Vesta
ANL IBM BG/P systems: Intrepid, Challenger, and Surveyor
NERSC Cray XE6 system: Hopper
You can find out a little more about the setup on those systems at these
pages:
https://www.alcf.anl.gov/user-guides/darshan
https://www.alcf.anl.gov/user-guides/bgp-darshan
http://www.nersc.gov/users/software/debugging-and-profiling/darshan/
> What are your experiences?
Those deployments are pretty smooth now, although of course I'm pretty
biased :-) Most of the effort is just in the initial setup and testing
to make sure that there isn't any unexpected overhead, but at this point
we've done quite a bit to make sure that's not a problem. I don't know
exactly what the coverage rate is on Hopper right now, but in general I
wouldn't expect more than 50% coverage in the long run for various
reasons (apps not using MPI, apps are precompiled without Darshan, or
apps simply don't run to MPI_Finalize()).
> How much effort is required to keep reviewing the incoming data?
This is an open question. We've done some work in identifying
interesting jobs based on various metrics at NERSC (see CUG'13 paper)
and done some work on storing results in a database for ad-hoc queries
at ALCF, but neither of these has quite evolved into a full-time
system. NERSC does have a mechanism for generating some automatic
summaries for end users in their web portal, but you still have to know
to go look at it.
At ANL the way it works in practice is that we go to the Darshan logs
when someone reports an I/O problem or if we suspect an I/O problem in a
specific application. We don't have any automated mechanism for
reporting statistics from data.
I would like for there to be a nice, general purpose analysis tool for
scanning darshan logs and reporting information from the data (even if
it just made summaries you could email out of a cron job daily or
weekly), but unfortunately that doesn't exist yet.
> Have you solved or identified any problems with Darshan? Problems w
> apps? Problems w the system?
Quite a few at this point, actually. In our case it has mostly been app
problems. There are some examples in the Darshan publications page, and
I can think of a few more that I could tell you about off-list if you
are interested. I am not aware of any system problems that have been
discovered with Darshan yet.
-Phil
>
> Thanks,
> - Richard
>
> ====================================================
>
> Richard Hedges
> Customer Support and Test - File Systems Project
> Development Environment Group - Livermore Computing
> Lawrence Livermore National Laboratory
> 7000 East Avenue, MS L-557
> Livermore, CA 94551
>
> v: (925) 423-2699
> f: (925) 423-6961
> E: richard-hedges at llnl.gov
>
>
>
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20130709/8747bd70/attachment.html>
More information about the Darshan-users
mailing list