<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, May 4, 2015 at 7:07 AM, Justin Chang <span dir="ltr"><<a href="mailto:jychang48@gmail.com" target="_blank">jychang48@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Jed,<div><br></div><div>Thanks for the reply. Not too long ago one of you guys (Matt I think) had mentioned the Roofline model and I was hoping to emulate something like it for my application. If I understand the presentation slides (and the paper implementing it) correctly, the upper bound FLOPS/s is calculated by multiplying the stream BW by the ratio of DRAM flop to byte (aka arithmetic intensity). The workload (i.e., flops) can be counted via PetscLogFlops() and in the paper, the sparse matvec total bytes transferred for fmadd was manually counted. Since my program involves more than just matvec I am curious if there's a way to obtain the bytes for all operations and functions invoked. </div><div><br></div><div>Or if I really should go with what you had suggested, could you elaborate a little more on it, or point me to some papers/links/slides that talk about it? </div></blockquote><div><br></div><div>The best we can do is estimates here (because of all the caveats that Jed points out). I suggest just counting</div><div>how many bytes come down manually, just as we do for flops.</div><div><br></div><div>  Thanks,</div><div><br></div><div>    Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>Thanks,</div><div>Justin<div><div class="h5"><br><br>On Monday, May 4, 2015, Jed Brown <<a>jed@jedbrown.org</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Justin Chang <<a>jychang48@gmail.com</a>> writes:<br>

<br>

> Hello everyone,<br>

><br>

> If I wanted to obtain the bytes/second for my PETSc program, is there a<br>

> generic way of doing this? My initial thought would be to first run the<br>

> program with valgrind to obtain the total memory usage, and then run it<br>

> without valgrind to get the wall clock time. These two metrics then give<br>

> you the bytes/second.<br>

<br>

Not really, because usually we're interested in useful bandwidth<br>

sustained from some level of cache.  You can use hardware performance<br>

counters to measure the number of cache lines transferred, but this is<br>

usually an overestimate of the amount of useful data.  You really need a<br>

performance model for your application and a cache model for the machine<br>

to say what bandwidth is useful.<br>

<br>

> Or can PETSc manually count the load/stores the way it's done for<br>

> flops?<br>

<br>

No, this information is not available in source code and would be nearly<br>

meaningless even if it was.<br>

<br>

> I was looking at the PetscMemXXX() functions but wasn't sure if this<br>

> is what I was looking for.<br>

><br>

> Thanks,<br>

> Justin<br>

</blockquote></div></div></div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>