[petsc-dev] Parallel calculation on GPU

Karl Rupp rupp at iue.tuwien.ac.at
Wed Aug 20 06:14:59 CDT 2014


 >>> Is there a way to run a calculation with 4*N MPI tasks where
>>> my matrix is first built outside PETSc, then to solve the
>>> linear system using PETSc Mat, Vec, KSP on only N MPI
>>> tasks to adress efficiently the N GPUs ?
>> as far as I can tell, this should be possible with a suitable
>> subcommunicator. The tricky piece, however, is to select the right MPI
>> ranks for this. Note that you generally have no guarantee on how the
>> MPI ranks are distributed across the nodes, so be prepared for
>> something fairly specific to your MPI installation.
> Yes, I am ready to face this point too.

Okay, good to know that you are aware of this.

> I also started the work with a purely CPU-based solve only to test, but
> without success. When
> I read this:
> "If you wish PETSc code to run ONLY on a subcommunicator of
> MPI_COMM_WORLD, create that communicator first and assign it to
"If you wish PETSc code to run ONLY on a subcommunicator of
MPI_COMM_WORLD, create that communicator first and assign it to
BEFORE calling PetscInitialize
> BEFORE calling PetscInitialize
Thus if you are running a four process job and two processes will run
PETSc and have PetscInitialize
> Thus if you are running a four process job and two processes will run
> PETSc and have PetscInitialize
and PetscFinalize
> and PetscFinalize
and two process will not, then do this. If ALL processes in
the job are using PetscInitialize
> and two process will not, then do this. If ALL processes in
> the job are using PetscInitialize
and PetscFinalize
> and PetscFinalize
then you don't need to do this, even if different subcommunicators of
the job are doing different things with PETSc."
> then you don't need to do this, even if different subcommunicators of
> the job are doing different things with PETSc."
> I think I am not in this special scenario, because as my matrix is
> initially partitionned on 4
> processes, I need to call PetscInitialize() on each 4 processes in order
> to build the PETSc matrix
> with MatSetValues. And my goal is after to solve the linear system on
> only 2 processes... So
> building a sub-communicator will really do the trick ? Or i miss something ?

oh, then I misunderstood your question. I thought that you want to run 
*your* code on 4N procs and let PETSc never see more than N procs when 
feeding the matrix.

What you could do with 4N procs for PETSc is to define your own matrix 
layout, where only one out of four processes actually owns part of the 
matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full data gets 
correctly transferred to N procs, with the other 3*N procs being 
'empty'. You should then be able to run the solver with all 4*N 
processors, but only N of them actually do the work on the GPUs.

A different question is whether you actually need all 4*N MPI ranks for 
the system assembly. You can make your life a lot easier if you only run 
with N MPI ranks upfront, particularly if the performance gains from 
N->4N procs in the assembly stage is small relative to the time spent in 
the solver. This may well be the case for memory bandwidth limited 
applications, where one process can utilize most of the available 
bandwidth. Either way, a test run with N procs will give you a good 
profiling baseline on whether you can expect any performance gains from 
GPUs in the solver stage overall. It may well be that you can get faster 
solver times with some fancy multigrid preconditioning techniques on a 
purely CPU-based implementation, which is unavailable on GPUs. Also, 
your system size needs to be sufficiently large (100k unknowns per GPU 
as a rule of thumb) to hide PCI-Express latencies.

Best regards,

