<div dir="ltr"><div><div><div>Thank you guys for your responses. If I want to estimate the number of bytes that come down, would -memory_info give me that information?<br><br></div>And with this information plus the total number of logged flops, i can get the ratio of flop to bytes and hence the (crude estimation of) upper bound FLOPS/s based on the reported stream BW?<br><br></div>Thanks,<br></div>Justin<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, May 4, 2015 at 11:07 AM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> writes:<br>

<br>

> Hi Jed,<br>

><br>

> Thanks for the reply. Not too long ago one of you guys (Matt I think) had<br>

> mentioned the Roofline model and I was hoping to emulate something like it<br>

> for my application. If I understand the presentation slides (and the paper<br>

> implementing it) correctly, the upper bound FLOPS/s is calculated by<br>

> multiplying the stream BW by the ratio of DRAM flop to byte (aka arithmetic<br>

> intensity). The workload (i.e., flops) can be counted via PetscLogFlops()<br>

> and in the paper, the sparse matvec total bytes transferred for fmadd was<br>

> manually counted. Since my program involves more than just matvec I am<br>

> curious if there's a way to obtain the bytes for all operations and<br>

> functions invoked.<br>

<br>

</span>Counting "useful" data motion subject to some cache granularity is not<br>

automatic.  You can look at performance analysis of stencil operations<br>

for an example of what this can look like.  I go through examples in my<br>

class, but I do it interactively with experiments rather than off of<br>

slides.<br>

<div class="HOEnZb"><div class="h5"><br>

> Or if I really should go with what you had suggested, could you elaborate a<br>

> little more on it, or point me to some papers/links/slides that talk about<br>

> it?<br>

><br>

> Thanks,<br>

> Justin<br>

><br>

> On Monday, May 4, 2015, Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br>

><br>

>> Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> writes:<br>

>><br>

>> > Hello everyone,<br>

>> ><br>

>> > If I wanted to obtain the bytes/second for my PETSc program, is there a<br>

>> > generic way of doing this? My initial thought would be to first run the<br>

>> > program with valgrind to obtain the total memory usage, and then run it<br>

>> > without valgrind to get the wall clock time. These two metrics then give<br>

>> > you the bytes/second.<br>

>><br>

>> Not really, because usually we're interested in useful bandwidth<br>

>> sustained from some level of cache.  You can use hardware performance<br>

>> counters to measure the number of cache lines transferred, but this is<br>

>> usually an overestimate of the amount of useful data.  You really need a<br>

>> performance model for your application and a cache model for the machine<br>

>> to say what bandwidth is useful.<br>

>><br>

>> > Or can PETSc manually count the load/stores the way it's done for<br>

>> > flops?<br>

>><br>

>> No, this information is not available in source code and would be nearly<br>

>> meaningless even if it was.<br>

>><br>

>> > I was looking at the PetscMemXXX() functions but wasn't sure if this<br>

>> > is what I was looking for.<br>

>> ><br>

>> > Thanks,<br>

>> > Justin<br>

>><br>

</div></div></blockquote></div><br></div>