[petsc-users] CPU speed or DRAM speed bottlenecks ?

C B cebau.mail at gmail.com
Fri Dec 4 00:29:20 CST 2020


Thank you very much Junchao!

Most of these tools are developed for Linux, and at this time I am mainly
interested in code for Windows.
I found this thread very informative;

https://stackoverflow.com/questions/34641644/is-there-a-windows-equivalent-of-the-linux-command-perf-stat


Thanks,

On Thu, Dec 3, 2020 at 8:58 PM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> You can try HPCTookit (http://hpctoolkit.org/), Tau (
> https://www.cs.uoregon.edu/research/tau/home.php), or Intel VTune. But
> for each, you need to read its manual to learn it.
>
> --Junchao Zhang
>
>
> On Thu, Dec 3, 2020 at 5:29 PM C B <cebau.mail at gmail.com> wrote:
>
>> Barry,
>>
>> Thank you so much for your quick reply and insight.
>>
>> Are there any tools/simple ways to determine how much time is lost in
>> cache misses / etc, please direct me to any resources to learn about this.
>>
>> Thanks again!
>>
>>
>> On Thu, Dec 3, 2020 at 4:09 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>
>>> On Dec 3, 2020, at 2:25 PM, C B <cebau.mail at gmail.com> wrote:
>>>
>>> Resorting to your expertise in software performance:
>>>
>>> Subject: Looking for a crude assessment of CPU speed or DRAM speed
>>> bottlenecks in shared memory multi-core PCs
>>>
>>> On a typical PC with one Xeon CPU (8 cores),  a serial code runs a case
>>> in say 10 hours of Wall time, and on the same computer 4 instances of the
>>> same code running simultaneously (the same case) take essentially the same
>>> Wall time, 10 hrs or a marginal increase such as 10hrs 30 mins.   There is
>>> no I/O, lots of free physical RAM, each core running an instance shows ~
>>> 100% utilization.
>>>
>>> Q1: What could we conclude about this hardware-software-case combination
>>> in terms of being CPU bound, memory bandwidth bound, etc ?
>>>
>>>    It does not appear to be memory bandwidth bound.  Presumably the 4
>>> cases will each be utilizing the same memory bandwidth as one case so I
>>> think one can conclude that the 1 case is using at most 25 percent of the
>>> memory bandwidth.
>>>
>>>
>>> Q2: Can we say that this hardware-software-case combination is not DRAM
>>> bound, and that it “may be amenable” to a good speedup running multiple
>>> threads in the same shared memory environment ?
>>>
>>>    I think this is good a way to say it, "since it is not DRAM bound it
>>> may be amendable to good speedup running multiple threads", it may also be
>>> amendable to MPI parallelism. There are other factors that affect parallel
>>> performance besides memory bandwidth without more information these are
>>> unknown".
>>>
>>>   Barry
>>>
>>>
>>>
>>> I did look into the shared memory benchmark
>>> http://www.cs.virginia.edu/stream  but I could not draw any conclusions.
>>>
>>> If this is a trivial question, please point me to a good resource to
>>> learn.
>>>
>>> Thanks!
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201204/ff28ff3b/attachment-0001.html>


More information about the petsc-users mailing list