[petsc-dev] Kokkos/Crusher perforance
Jed Brown
jed at jedbrown.org
Sat Jan 22 12:06:09 CST 2022
Mark Adams <mfadams at lbl.gov> writes:
> On Sat, Jan 22, 2022 at 12:29 PM Jed Brown <jed at jedbrown.org> wrote:
>
>> Mark Adams <mfadams at lbl.gov> writes:
>>
>> >>
>> >>
>> >>
>> >> > VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00
>> 0.0e+00
>> >> 0.0e+00 0 0 0 0 0 5 1 0 0 0 22515 70608 0 0.00e+00
>> 0
>> >> 0.00e+00 100
>> >> > VecScatterBegin 400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05
>> 1.6e+04
>> >> 0.0e+00 0 0 62 54 0 2 0100100 0 0 0 0 0.00e+00
>> 0
>> >> 0.00e+00 0
>> >> > VecScatterEnd 400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00
>> >> 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00
>> 0
>> >> 0.00e+00 0
>> >> > PCApply 402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00
>> 0.0e+00
>> >> 0.0e+00 0 0 0 0 0 5 1 0 0 0 22490 70608 0 0.00e+00
>> 0
>> >> 0.00e+00 100
>> >>
>> >> Most of the MatMult time is attributed to VecScatterEnd here. Can you
>> >> share a run of the same total problem size on 8 ranks (one rank per
>> GPU)?
>> >>
>> >>
>> > attached. I ran out of memory with the same size problem so this is the
>> > 262K / GPU version.
>>
>> How was this launched? Is it possible all 8 ranks were using the same GPU?
>> (Perf is that bad.)
>>
>
> srun -n8 -N1 *--ntasks-per-gpu=1* --gpu-bind=closest ../ex13
> -dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2
> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
> -dm_refine 6 -dm_view -dm_mat_type aijkokkos -dm_vec_type kokkos -pc_type
> jacobi -log_view -ksp_view -use_gpu_aware_mpi true
I'm still worried because the results are so unreasonable. We should add an option like -view_gpu_busid that prints this information per rank.
https://code.ornl.gov/olcf/hello_jobstep/-/blob/master/hello_jobstep.cpp
A single-process/single-GPU comparison would also be a useful point of comparison.
More information about the petsc-dev
mailing list