[petsc-dev] Kokkos/Crusher perforance

Sat Jan 22 20:48:58 CST 2022

  I am not arguing for a rickety set of scripts, I am arguing that doing more is not so easy and it is only worth doing if the underlying benchmark is worth the effort. 

> On Jan 22, 2022, at 8:08 PM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Yeah, I'm referring to the operational aspect of data management, not benchmark design (which is hard and even Sam had years working with Mark and me on HPGMG to refine that).
> 
> If you run libCEED BPs (which use PETSc), you can run one command
> 
> srun -N.... ./bps -ceed /cpu/self/xsmm/blocked,/gpu/cuda/gen -degree 2,3,4,5 -local_nodes 1000,5000000 -problem bp1,bp2,bp3,bp4
> 
> and it'll loop (in C code) over all the combinations (reusing some non-benchmarked things like the DMPlex) across the whole range of sizes, problems, devices. It makes one output file and you feed that to a Python script to read it as a Pandas DataFrame and plot (or read and interact in a notebook). You can have a basket of files from different machines and slice those plots without code changes.
> 
> We should do similar for a suite of PETSc benchmarks, even just basic Vec and Mat operations like in the reports. It isn't more work than a rickety bundle of scripts, and it's a lot less error-prone.
> 
> Barry Smith <bsmith at petsc.dev> writes:
> 
>>  I submit it is actually a good amount of additional work and requires real creativity and very good judgment; it is not a good intro or undergrad project; especially for someone without a huge amount of hands-on experience already. Look who had to do the new SpecHPC multigrid benchmark. The last time I checked Sam was not an undergrad. Senior Scientist, Lawrence Berkeley National Laboratory - ‪‪Cited by 11194‬‬ I definitely do not plan to involve myself in any brand new serious benchmarking studies in my current lifetime, doing one correctly is a massive undertaking IMHO.
>> 
>>> On Jan 22, 2022, at 6:43 PM, Jed Brown <jed at jedbrown.org> wrote:
>>> 
>>> This isn't so much more or less work, but work in more useful places. Maybe this is a good undergrad or intro project to make a clean workflow for these experiments.
>>> 
>>> Barry Smith <bsmith at petsc.dev> writes:
>>> 
>>>> Performance studies are enormously difficult to do well; which is why there are so few good ones out there. And unless you fall into the LINPACK benchmark or hit upon Streams the rewards of doing an excellent job are pretty thin. Even Streams was not properly maintained for many years, you could not just get it and use it out of the box for a variety of purposes (which is why PETSc has its hacked-up ones). I submit a properly performance study is a full-time job and everyone always has those.
>>>> 
>>>>> On Jan 22, 2022, at 2:11 PM, Jed Brown <jed at jedbrown.org> wrote:
>>>>> 
>>>>> Barry Smith <bsmith at petsc.dev> writes:
>>>>> 
>>>>>>> On Jan 22, 2022, at 12:15 PM, Jed Brown <jed at jedbrown.org> wrote:
>>>>>>> Barry, when you did the tech reports, did you make an example to reproduce on other architectures? Like, run this one example (it'll run all the benchmarks across different sizes) and then run this script on the output to make all the figures?
>>>>>> 
>>>>>> It is documented in https://www.overleaf.com/project/5ff8f7aca589b2f7eb81c579    You may need to dig through the submit scripts etc to find out exactly.
>>>>> 
>>>>> This runs a ton of small jobs and each job doesn't really preload, but instead of loops in job submission scripts, the loops could be inside the C code and it could directly output tabular data. This would run faster and be easier to submit and analyze.
>>>>> 
>>>>> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/summit-submissions/submit_gpu1.lsf
>>>>> 
>>>>> It would hopefully also avoid writing the size range manually over here in the analysis script where it has to match exactly the job submission.
>>>>> 
>>>>> https://gitlab.com/hannah_mairs/summit-performance/-/blob/master/python/graphs.py#L8-9
>>>>> 
>>>>> 
>>>>> We'd make our lives a lot easier understanding new machines if we put into the design of performance studies just a fraction of the kind of thought we put into public library interfaces.