[petsc-dev] Kokkos/Crusher perforance

Sat Jan 22 13:34:44 CST 2022

> I suggested years ago that -log_view automatically print useful information about the GPU setup (when GPUs are used) but everyone seemed comfortable with the lack of information so no one improved it.

FWIW, PetscDeviceView() does a bit of what you want (it just dumps all of cuda/hipDeviceProp)

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Jan 22, 2022, at 12:55, Barry Smith <bsmith at petsc.dev> wrote:
> 
> 
>  I suggested years ago that -log_view automatically print useful information about the GPU setup (when GPUs are used) but everyone seemed comfortable with the lack of information so no one improved it. I think for a small number of GPUs -log_view should just print details and for a larger number print some statistics (how many physical ones etc). Currently, it does not even print how many are used. I think requiring another option to get this basic information is a mistake, we already print a ton of background with -log_view it is just sad no background on the GPU usage.
> 
> 
> 
> 
> 
>> On Jan 22, 2022, at 1:06 PM, Jed Brown <jed at jedbrown.org> wrote:
>> 
>> Mark Adams <mfadams at lbl.gov> writes:
>> 
>>> On Sat, Jan 22, 2022 at 12:29 PM Jed Brown <jed at jedbrown.org> wrote:
>>> 
>>>> Mark Adams <mfadams at lbl.gov> writes:
>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> VecPointwiseMult     402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00
>>>> 0.0e+00
>>>>>> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22515   70608      0 0.00e+00
>>>> 0
>>>>>> 0.00e+00 100
>>>>>>> VecScatterBegin      400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05
>>>> 1.6e+04
>>>>>> 0.0e+00  0  0 62 54  0   2  0100100  0     0       0      0 0.00e+00
>>>> 0
>>>>>> 0.00e+00  0
>>>>>>> VecScatterEnd        400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00
>>>>>> 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0      0 0.00e+00
>>>> 0
>>>>>> 0.00e+00  0
>>>>>>> PCApply              402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00
>>>> 0.0e+00
>>>>>> 0.0e+00  0  0  0  0  0   5  1  0  0  0 22490   70608      0 0.00e+00
>>>> 0
>>>>>> 0.00e+00 100
>>>>>> 
>>>>>> Most of the MatMult time is attributed to VecScatterEnd here. Can you
>>>>>> share a run of the same total problem size on 8 ranks (one rank per
>>>> GPU)?
>>>>>> 
>>>>>> 
>>>>> attached. I ran out of memory with the same size problem so this is the
>>>>> 262K / GPU version.
>>>> 
>>>> How was this launched? Is it possible all 8 ranks were using the same GPU?
>>>> (Perf is that bad.)
>>>> 
>>> 
>>> srun -n8 -N1 *--ntasks-per-gpu=1* --gpu-bind=closest ../ex13
>>> -dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2
>>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
>>> -dm_refine 6 -dm_view -dm_mat_type aijkokkos -dm_vec_type kokkos -pc_type
>>> jacobi -log_view -ksp_view -use_gpu_aware_mpi true
>> 
>> I'm still worried because the results are so unreasonable. We should add an option like -view_gpu_busid that prints this information per rank.
>> 
>> https://code.ornl.gov/olcf/hello_jobstep/-/blob/master/hello_jobstep.cpp
>> 
>> A single-process/single-GPU comparison would also be a useful point of comparison.
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220122/e30d6889/attachment.html>