[Darshan-users] getting plots
Phil Carns
carns at mcs.anl.gov
Wed Mar 16 12:58:45 CDT 2016
On 03/15/2016 03:22 PM, Burlen Loring wrote:
> Shane, Phil, these commands are all working fine. Thanks
>
> One thing I'd like to do is separate the open from the write. the
> file-list-detailed option may give me what I need.
>
> are the values in the columns time stamps? so I can estimate write time
> by "<end_close> - <start_write>",
Hi Burlen,
You have at least 3 options, just to list them out for completion and
maybe give you some more values to sanity check against:
<slowest>: this is a cumulative counter of how much time the slowest
rank (there is just one per file in your case) spent doing any kind of
I/O (read or write) or metadata (open or close) operation
<end_write>-<start_write>: elapsed time between the time the first write
was issued and the time the last write completed. Similar to the above,
but it will (for better or worse) include any time that the process
might have spent doing other things in between successive write() calls,
but it does not include open or close time
<end_close>-<start_write>: similar to above except that it will include
close time. Sometimes that's arguably valid to look at as well if the
file system elected to wait until close time to flush data.
> and open time by "<start_write> -
> <start_open>", columns?
Yes, that is a reasonable guess, especially if you have access to the
source code and can double check that the app isn't doing anything else
between the open and the write.
>
> I didn't see documentation for those columns and want to make sure I
> have it right before embarking on a wild goose chase.
>
> <start_open> <start_read> <start_write> <end_read>
> <end_write> <end_close>
> 0.770823 0.000000 1.602510 0.000000 1.607319 1.608949
Sorry about the lack of documentation; you are in top secret command
line option territory :)
thanks,
-Phil
> Burlen
>
> On 03/15/2016 10:48 AM, Phil Carns wrote:
>> One thing that you can do (not sure if would be helpful in this case)
>> is filter a Darshan log file down so that it only includes
>> instrumentation data for a single file, and then run
>> darshan-job-summary.pl on just that one file view. If you wanted to
>> try that, you can do the following:
>>
>> $ darshan-parser --file-list
>> loring_oscillator_id1336628_3-14-37278-8825292184016672560_1.darshan
>> |head -n 75
>>
>> # I'm just picking a file at random from the output, but this example
>> is for
>> # /global/cscratch1/sd/loring/sensei/fpp/10k/PosthocIO_5.vtmb
>> # so I'm using it's corresponding hash value. The following command
>> will write a new darshan log file that strips
>> # away everything except for that one file:
>>
>> $ darshan-convert --file 13503923528039498363
>> loring_oscillator_id1336628_3-14-37278-8825292184016672560_1.darshan
>> onefile.darshan
>>
>> $ darshan-job-summary.pl onefile.darshan
>>
>> The resulting pdf is generated instantaneously and is easy to open,
>> but it doesn't tell you anything about the I/O at all except what
>> happened to that one file. It might be helpful for some cases, though.
>>
>> You can also use the following to get a text dump of the cumulative
>> statistics across all files (which also runs pretty quickly):
>>
>> darshan-parser --total
>> loring_oscillator_id1336628_3-14-37278-8825292184016672560_1.darshan
>>
>> Unfortunately that output is presented in text format instead of
>> producing another darshan log that could then be visualized with
>> darshan-job-summary.pl, but maybe that is something we could consider
>> in a future version.
>>
>> thanks,
>> -Phil
>>
>> On 03/15/2016 11:18 AM, Burlen Loring wrote:
>>> I let darshan job summary run all night, still going but no indication
>>> of progress.
>>>
>>> This is my first experience with darshan, let me ask a naive question:
>>> is it possible to extract time series for a single process? write
>>> bandwidth over time for instance? and time for file open (or close)
>>> vs time?
>>>
>>> Thanks for all your help
>>> Burlen
>>>
>>> On 03/14/2016 09:53 PM, Burlen Loring wrote:
>>>> Yes, you are correct, it's file per process on 6496 processes, and the
>>>> simulation runs for 100 time steps, plus there are some header files
>>>> and directories created (I think by rank 0). It doesn't seem like too
>>>> extreme of a case to me. We will also run 50k cores for 100 time
>>>> steps. It sounds like darshan can't analyze this type of i/o, but
>>>> please let me know if you have any ideas!
>>>>
>>>> On the size discrepancy. My fault. Darshan had the size correct. I was
>>>> looking at the wrong output file, 200G is the size of the smaller run
>>>> (812 procs). I apologize that I didn't notice that sooner!
>>>>
>>>> On 03/14/2016 08:55 PM, Shane Snyder wrote:
>>>>> Maybe the reason the job summary graphs are hanging might be due to
>>>>> the number of files the application is opening? It looks like there
>>>>> are over 500,000 files (100 each for 6,496 processes). I haven't
>>>>> tried generating graphs for any logs that large myself, but that
>>>>> might be beyond what the graphing utilities can realistically handle.
>>>>> It takes forever for me to even parse the logs in text form.
>>>>>
>>>>> As for the discrepancy in size, that may just be due to what the 'du'
>>>>> utility is actually reporting. 'du' measures the size of a given file
>>>>> based on the underlying file system block size. If the file is 1
>>>>> byte, and the block size is 1 MiB, the file is reported as 1 MiB.
>>>>> Additionally, if you run 'du' on a directory containing numerous
>>>>> subdirectories (as you have, 100 subdirectories), it counts the sizes
>>>>> of the directories as well. Darshan will only report the I/O observed
>>>>> at the application level, so it does not account for file system
>>>>> blocks or directories. You can use 'du -b' to show the "actual"
>>>>> (i.e., not rounded up to block sizes) of individual files, though it
>>>>> still counts subdirectory sizes when determining the size of a given
>>>>> directory. If you do that, is it closer to what Darshan reports?
>>>>>
>>>>> --Shane
>>>>>
>>>>> On 03/14/2016 06:44 PM, Burlen Loring wrote:
>>>>>> sure, here is the link
>>>>>> https://drive.google.com/open?id=0B3y5yyus32lveHljWkExal9TVmM
>>>>>>
>>>>>> On 03/14/2016 03:56 PM, Shane Snyder wrote:
>>>>>>> Hi Burlen,
>>>>>>>
>>>>>>> Would you mind sharing your Darshan log with us? If you prefer, you
>>>>>>> can send it to me off-list, or if it contains sensitive information
>>>>>>> we can give you details on how to anonymize parts of it (e.g., file
>>>>>>> names, etc.).
>>>>>>>
>>>>>>> I don't know for sure what the historical reason the "(may be
>>>>>>> incorrect)" caveat is given with the total bytes read and written.
>>>>>>> Someone correct me if I'm wrong, but I suspect that is to warn
>>>>>>> against the possibility that the code actually wrote/read more data
>>>>>>> than expected from the application's point of view? For instance,
>>>>>>> an I/O optimization called data sieving is possible at the MPI-IO
>>>>>>> layer which results in more data being read than expected from the
>>>>>>> application's point of view to improve performance. That shouldn't
>>>>>>> account for the drastic discrepancy you are seeing, though, so
>>>>>>> perhaps something else is up.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> --Shane
>>>>>>>
>>>>>>> On 03/14/2016 05:29 PM, Burlen Loring wrote:
>>>>>>>> Hi, I'd like to analyze our runs with darshan. I'm able to get the
>>>>>>>> log files, but so far no luck plotting them.
>>>>>>>>
>>>>>>>> In the terminal after a while I see the following output, but then
>>>>>>>> the program appears to hang. After ~20 min of no output and no
>>>>>>>> evidence of it running in top, I killed it, and I didn't see any
>>>>>>>> newly created files.
>>>>>>>>
>>>>>>>> I'm also wondering about the total bytes report and warning that
>>>>>>>> it may be wrong. it does indeed seem way off, du reports 1.6T, but
>>>>>>>> darshan only reports ~200G.
>>>>>>>>
>>>>>>>> Please, let me know what I did wrong! and if I should I be
>>>>>>>> concerned about the numbers being so far off.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Burlen
>>>>>>>>
>>>>>>>> $/work/apps/darshan/3.0.0-pre/bin/darshan-job-summary.pl
>>>>>>>> loring_oscillator_id1336621_3-14-37256-5315836542621785504_1.darshan
>>>>>>>>
>>>>>>>> Slowest unique file time: 25.579892
>>>>>>>> Slowest shared file time: 0
>>>>>>>> Total bytes read and written by app (may be incorrect):
>>>>>>>> 214218545937
>>>>>>>> Total absolute I/O time: 25.579892
>>>>>>>> **NOTE: above shared and unique file times calculated using MPI-IO
>>>>>>>> timers if MPI-IO interface used on a given file, POSIX timers
>>>>>>>> otherwise.
>>>>>>>> _______________________________________________
>>>>>>>> Darshan-users mailing list
>>>>>>>> Darshan-users at lists.mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>>>>>>> _______________________________________________
>>>>>>> Darshan-users mailing list
>>>>>>> Darshan-users at lists.mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>>> _______________________________________________
>>> Darshan-users mailing list
>>> Darshan-users at lists.mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>> _______________________________________________
>> Darshan-users mailing list
>> Darshan-users at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
More information about the Darshan-users
mailing list