<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hi Richard. Some comments inline
below:<br>
<br>
On 07/09/2013 01:04 PM, Hedges, Richard M. wrote:<br>
</div>
<blockquote
cite="mid:05CADB00DB0C5441BDA26AA36D16DDC9299260A4@PRDEXMBX-05.the-lab.llnl.gov"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<div>
<div>
<div>Anyone out there deploying Darshan to profile all MPI
jobs on a big system? <br>
</div>
</div>
</div>
</blockquote>
<br>
You bet :) The current list:<br>
<br>
ANL IBM BG/Q systems: Mira, Cetus, and Vesta<br>
ANL IBM BG/P systems: Intrepid, Challenger, and Surveyor<br>
NERSC Cray XE6 system: Hopper<br>
<br>
You can find out a little more about the setup on those systems at
these pages:<br>
<br>
<a class="moz-txt-link-freetext" href="https://www.alcf.anl.gov/user-guides/darshan">https://www.alcf.anl.gov/user-guides/darshan</a><br>
<a class="moz-txt-link-freetext" href="https://www.alcf.anl.gov/user-guides/bgp-darshan">https://www.alcf.anl.gov/user-guides/bgp-darshan</a><br>
<a class="moz-txt-link-freetext" href="http://www.nersc.gov/users/software/debugging-and-profiling/darshan/">http://www.nersc.gov/users/software/debugging-and-profiling/darshan/</a><br>
<br>
<br>
<blockquote
cite="mid:05CADB00DB0C5441BDA26AA36D16DDC9299260A4@PRDEXMBX-05.the-lab.llnl.gov"
type="cite">
<div>
<div>
<div>What are your experiences? </div>
</div>
</div>
</blockquote>
<br>
Those deployments are pretty smooth now, although of course I'm
pretty biased :-) Most of the effort is just in the initial setup
and testing to make sure that there isn't any unexpected overhead,
but at this point we've done quite a bit to make sure that's not a
problem. I don't know exactly what the coverage rate is on Hopper
right now, but in general I wouldn't expect more than 50% coverage
in the long run for various reasons (apps not using MPI, apps are
precompiled without Darshan, or apps simply don't run to
MPI_Finalize()).<br>
<br>
<blockquote
cite="mid:05CADB00DB0C5441BDA26AA36D16DDC9299260A4@PRDEXMBX-05.the-lab.llnl.gov"
type="cite">
<div>
<div>
<div> How much effort is required to keep reviewing the
incoming data? <br>
</div>
</div>
</div>
</blockquote>
<br>
This is an open question. We've done some work in identifying
interesting jobs based on various metrics at NERSC (see CUG'13
paper) and done some work on storing results in a database for
ad-hoc queries at ALCF, but neither of these has quite evolved into
a full-time system. NERSC does have a mechanism for generating some
automatic summaries for end users in their web portal, but you still
have to know to go look at it. <br>
<br>
At ANL the way it works in practice is that we go to the Darshan
logs when someone reports an I/O problem or if we suspect an I/O
problem in a specific application. We don't have any automated
mechanism for reporting statistics from data.<br>
<br>
I would like for there to be a nice, general purpose analysis tool
for scanning darshan logs and reporting information from the data
(even if it just made summaries you could email out of a cron job
daily or weekly), but unfortunately that doesn't exist yet.<br>
<br>
<blockquote
cite="mid:05CADB00DB0C5441BDA26AA36D16DDC9299260A4@PRDEXMBX-05.the-lab.llnl.gov"
type="cite">
<div>
<div>
<div>Have you solved or identified any problems with Darshan?
Problems w apps? Problems w the system?</div>
</div>
</div>
</blockquote>
<br>
Quite a few at this point, actually. In our case it has mostly been
app problems. There are some examples in the Darshan publications
page, and I can think of a few more that I could tell you about
off-list if you are interested. I am not aware of any system
problems that have been discovered with Darshan yet. <br>
<br>
-Phil<br>
<br>
<blockquote
cite="mid:05CADB00DB0C5441BDA26AA36D16DDC9299260A4@PRDEXMBX-05.the-lab.llnl.gov"
type="cite">
<div>
<div>
<div><br>
</div>
<div>Thanks,</div>
<div>- Richard</div>
<div>
<div><br>
</div>
<div>====================================================</div>
<div><br>
</div>
<div>Richard Hedges</div>
<div>Customer Support and Test - File Systems Project</div>
<div>Development Environment Group - Livermore Computing</div>
<div>Lawrence Livermore National Laboratory</div>
<div>7000 East Avenue, MS L-557</div>
<div>Livermore, CA 94551</div>
<div><br>
</div>
<div>v: (925) 423-2699</div>
<div>f: (925) 423-6961</div>
<div>E: <a class="moz-txt-link-abbreviated" href="mailto:richard-hedges@llnl.gov">richard-hedges@llnl.gov</a></div>
</div>
</div>
</div>
<div><br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Darshan-users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Darshan-users@lists.mcs.anl.gov">Darshan-users@lists.mcs.anl.gov</a>
<a class="moz-txt-link-freetext" href="https://lists.mcs.anl.gov/mailman/listinfo/darshan-users">https://lists.mcs.anl.gov/mailman/listinfo/darshan-users</a>
</pre>
</blockquote>
<br>
</body>
</html>