[petsc-dev] Parallel calculation on GPU
Karl Rupp
rupp at iue.tuwien.ac.at
Wed Aug 20 09:03:43 CDT 2014
>> What you could do with 4N procs for PETSc is to define your own matrix
>> layout, where only one out of four processes actually owns part of the
>> matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full data gets
>> correctly transferred to N procs, with the other 3*N procs being
>> 'empty'. You should then be able to run the solver with all 4*N
>> processors, but only N of them actually do the work on the GPUs.
> OK, I understand your solution, as I was already thinking about that,
> thanks to confirm it. But, my fear was that the performance was not
> improved. Indeed, I still don't understand (even after
> analyzing -log_summary profiles and searching in the petsc-dev archives)
> what is slowing down with several MPI tasks sharing one GPU, compared to
> one MPI task working with one GPU...
> In the proposed solution, 4*N processes will still exchange MPI messages
> during a KSP iteration, and the amount of data copy will be the same
> between GPU and CPU(s), so if you could enlighten
> me, I will be glad.
One of the causes of the performance penalty you observe is the higher
PCI-Express communication: If four ranks share a single GPU, then each
matrix-vector product requires at least 8 vector transfers between host
and device, rather than just 2 with a single MPI rank. Similarly, you
have four times the number of kernel launches. It may well be that these
overheads just eat up all the performance gains you could otherwise
obtain. I don't know your profiling data, so I can't be more specific at
this point.
Best regards,
Karli
More information about the petsc-dev
mailing list