[Darshan-users] Full time deployment

Phil Carns carns at mcs.anl.gov
Tue Jul 9 14:51:21 CDT 2013


Hi Richard.  Some comments inline below:

On 07/09/2013 01:04 PM, Hedges, Richard M. wrote:
> Anyone out there deploying Darshan to profile all MPI jobs on a big 
> system?

You bet :)  The current list:

     ANL IBM BG/Q systems: Mira, Cetus, and Vesta
     ANL IBM BG/P systems: Intrepid, Challenger, and Surveyor
     NERSC Cray XE6 system: Hopper

You can find out a little more about the setup on those systems at these 
pages:

     https://www.alcf.anl.gov/user-guides/darshan
     https://www.alcf.anl.gov/user-guides/bgp-darshan
http://www.nersc.gov/users/software/debugging-and-profiling/darshan/


> What are your experiences?

Those deployments are pretty smooth now, although of course I'm pretty 
biased :-)  Most of the effort is just in the initial setup and testing 
to make sure that there isn't any unexpected overhead, but at this point 
we've done quite a bit to make sure that's not a problem.  I don't know 
exactly what the coverage rate is on Hopper right now, but in general I 
wouldn't expect more than 50% coverage in the long run for various 
reasons (apps not using MPI, apps are precompiled without Darshan, or 
apps simply don't run to MPI_Finalize()).

>  How much effort is required to keep reviewing the incoming data?

This is an open question.  We've done some work in identifying 
interesting jobs based on various metrics at NERSC (see CUG'13 paper) 
and done some work on storing results in a database for ad-hoc queries 
at ALCF, but neither of these has quite evolved into a full-time 
system.  NERSC does have a mechanism for generating some automatic 
summaries for end users in their web portal, but you still have to know 
to go look at it.

At ANL the way it works in practice is that we go to the Darshan logs 
when someone reports an I/O problem or if we suspect an I/O problem in a 
specific application.  We don't have any automated mechanism for 
reporting statistics from data.

I would like for there to be a nice, general purpose analysis tool for 
scanning darshan logs and reporting information from the data (even if 
it just made summaries you could email out of a cron job daily or 
weekly), but unfortunately that doesn't exist yet.

> Have you solved or identified any problems with Darshan?  Problems w 
> apps?  Problems w the system?

Quite a few at this point, actually.  In our case it has mostly been app 
problems.  There are some examples in the Darshan publications page, and 
I can think of a few more that I could tell you about off-list if you 
are interested.  I am not aware of any system problems that have been 
discovered with Darshan yet.

-Phil

>
> Thanks,
> - Richard
>
> ====================================================
>
> Richard Hedges
> Customer Support and Test - File Systems Project
> Development Environment Group - Livermore Computing
> Lawrence Livermore National Laboratory
> 7000 East Avenue, MS L-557
> Livermore, CA    94551
>
> v:    (925) 423-2699
> f:    (925) 423-6961
> E:    richard-hedges at llnl.gov
>
>
>
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20130709/8747bd70/attachment.html>


More information about the Darshan-users mailing list