<div dir="ltr">Thank you very much Junchao!<div><br></div><div>Most of these tools are developed for Linux, and at this time I am mainly interested in code for Windows.</div><div>I found this thread very informative;</div><div>
<p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";color:rgb(220,161,13)"><a href="https://stackoverflow.com/questions/34641644/is-there-a-windows-equivalent-of-the-linux-command-perf-stat">https://stackoverflow.com/questions/34641644/is-there-a-windows-equivalent-of-the-linux-command-perf-stat</a></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";color:rgb(220,161,13)"><br></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";color:rgb(220,161,13)">Thanks,</p></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 3, 2020 at 8:58 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">You can try HPCTookit (<a href="http://hpctoolkit.org/" target="_blank">http://hpctoolkit.org/</a>), Tau (<a href="https://www.cs.uoregon.edu/research/tau/home.php" target="_blank">https://www.cs.uoregon.edu/research/tau/home.php</a>), or Intel VTune. But for each, you need to read its manual to learn it.</div><div dir="ltr"><br clear="all"><div><div dir="ltr"><div dir="ltr">--Junchao Zhang</div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 3, 2020 at 5:29 PM C B <<a href="mailto:cebau.mail@gmail.com" target="_blank">cebau.mail@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Barry,<div><br><div>Thank you so much for your quick reply and insight.</div><div><br></div><div>Are there any tools/simple ways to determine how much time is lost in cache misses / etc, please direct me to any resources to learn about this.</div><div><br></div><div>Thanks again!</div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 3, 2020 at 4:09 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br><div><br><blockquote type="cite"><div>On Dec 3, 2020, at 2:25 PM, C B <<a href="mailto:cebau.mail@gmail.com" target="_blank">cebau.mail@gmail.com</a>> wrote:</div><br><div><div dir="ltr"><p class="MsoNormal" style="margin:0in 0in 8pt;line-height:107%;font-size:11pt;font-family:Calibri,sans-serif">Resorting to your expertise in software performance:</p><p class="MsoNormal" style="margin:0in 0in 8pt;line-height:107%;font-size:11pt;font-family:Calibri,sans-serif">Subject: Looking for a crude assessment of CPU speed or DRAM
speed bottlenecks in shared memory multi-core PCs</p><p class="MsoNormal" style="margin:0in 0in 8pt;line-height:107%;font-size:11pt;font-family:Calibri,sans-serif">On a typical PC with one Xeon CPU (8 cores), a serial code runs a case in say 10 hours of
Wall time, and on the same computer 4 instances of the same code running simultaneously
(the same case) take essentially the same Wall time, 10 hrs or a marginal
increase such as 10hrs 30 mins. There is
no I/O, lots of free physical RAM, each core running an instance shows ~ 100%
utilization.</p><p class="MsoNormal" style="margin:0in 0in 8pt;line-height:107%;font-size:11pt;font-family:Calibri,sans-serif">Q1: What could we conclude about this hardware-software-case
combination in terms of being CPU bound, memory bandwidth bound, etc ?</p><div><br></div></div></div></blockquote> It does not appear to be memory bandwidth bound. Presumably the 4 cases will each be utilizing the same memory bandwidth as one case so I think one can conclude that the 1 case is using at most 25 percent of the memory bandwidth.</div><div><br></div><div><br><blockquote type="cite"><div><div dir="ltr"><p class="MsoNormal" style="margin:0in 0in 8pt;line-height:107%;font-size:11pt;font-family:Calibri,sans-serif">Q2: Can we say that this hardware-software-case combination is
not DRAM bound, and that it “may be amenable” to a good speedup running
multiple threads in the same shared memory environment ?</p><div><br></div></div></div></blockquote> I think this is good a way to say it, "since it is not DRAM bound it may be amendable to good speedup running multiple threads", it may also be amendable to MPI parallelism. There are other factors that affect parallel performance besides memory bandwidth without more information these are unknown".</div><div><br></div><div> Barry</div><div><br></div><div><br></div><div><br><blockquote type="cite"><div><div dir="ltr"><p class="MsoNormal" style="margin:0in 0in 8pt;line-height:107%;font-size:11pt;font-family:Calibri,sans-serif">I did look into the shared memory benchmark <a href="http://www.cs.virginia.edu/stream" style="color:rgb(5,99,193)" target="_blank">http://www.cs.virginia.edu/stream</a> but I could not draw any conclusions.</p><p class="MsoNormal" style="margin:0in 0in 8pt;line-height:107%;font-size:11pt;font-family:Calibri,sans-serif">If this is a trivial question, please point me to a good resource
to learn.</p><p class="MsoNormal" style="margin:0in 0in 8pt;line-height:107%;font-size:11pt;font-family:Calibri,sans-serif">Thanks!</p></div>
</div></blockquote></div><br></div></blockquote></div>
</blockquote></div></div>
</blockquote></div>