[petsc-dev] Parallel calculation on GPU

Projet_TRIOU triou at cea.fr
Wed Aug 20 06:34:54 CDT 2014

On 08/20/14 13:14, Karl Rupp wrote:
> Hey,
> >>> Is there a way to run a calculation with 4*N MPI tasks where
>>>> my matrix is first built outside PETSc, then to solve the
>>>> linear system using PETSc Mat, Vec, KSP on only N MPI
>>>> tasks to adress efficiently the N GPUs ?
>>> as far as I can tell, this should be possible with a suitable
>>> subcommunicator. The tricky piece, however, is to select the right MPI
>>> ranks for this. Note that you generally have no guarantee on how the
>>> MPI ranks are distributed across the nodes, so be prepared for
>>> something fairly specific to your MPI installation.
>> Yes, I am ready to face this point too.
> Okay, good to know that you are aware of this.
>> I also started the work with a purely CPU-based solve only to test, but
>> without success. When
>> I read this:
>> "If you wish PETSc code to run ONLY on a subcommunicator of
>> MPI_COMM_WORLD, create that communicator first and assign it to
>> <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD> 
>> BEFORE calling PetscInitialize
>> <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>(). 
>> Thus if you are running a four process job and two processes will run
>> PETSc and have PetscInitialize
>> <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>() 
>> and PetscFinalize
>> <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>() 
>> and two process will not, then do this. If ALL processes in
>> the job are using PetscInitialize
>> <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>() 
>> and PetscFinalize
>> <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>() 
>> then you don't need to do this, even if different subcommunicators of
>> the job are doing different things with PETSc."
>> I think I am not in this special scenario, because as my matrix is
>> initially partitionned on 4
>> processes, I need to call PetscInitialize() on each 4 processes in order
>> to build the PETSc matrix
>> with MatSetValues. And my goal is after to solve the linear system on
>> only 2 processes... So
>> building a sub-communicator will really do the trick ? Or i miss 
>> something ?
> oh, then I misunderstood your question. I thought that you want to run 
> *your* code on 4N procs and let PETSc never see more than N procs when 
> feeding the matrix.
Sorry, I was not very clear :-)
> What you could do with 4N procs for PETSc is to define your own matrix 
> layout, where only one out of four processes actually owns part of the 
> matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full data gets 
> correctly transferred to N procs, with the other 3*N procs being 
> 'empty'. You should then be able to run the solver with all 4*N 
> processors, but only N of them actually do the work on the GPUs.
OK, I understand your solution, as I was already thinking about that, 
thanks to confirm it. But, my fear was that the performance was not 
improved. Indeed, I still don't understand (even after
analyzing -log_summary profiles and searching in the petsc-dev archives) 
what is slowing down with several MPI tasks sharing one GPU, compared to 
one MPI task working with one GPU...
In the proposed solution, 4*N processes will still exchange MPI messages 
during a KSP iteration, and the amount of data copy will be the same 
between GPU and CPU(s), so if you could enlighten
me, I will be glad.
> A different question is whether you actually need all 4*N MPI ranks 
> for the system assembly. You can make your life a lot easier if you 
> only run with N MPI ranks upfront, particularly if the performance 
> gains from N->4N procs in the assembly stage is small relative to the 
> time spent in the solver. 
Indeed, but it is not always small in our cases...
> This may well be the case for memory bandwidth limited applications, 
> where one process can utilize most of the available bandwidth. Either 
> way, a test run with N procs will give you a good profiling baseline 
> on whether you can expect any performance gains from GPUs in the 
> solver stage overall. It may well be that you can get faster solver 
> times with some fancy multigrid preconditioning techniques on a purely 
> CPU-based implementation, which is unavailable on GPUs. Also, your 
> system size needs to be sufficiently large (100k unknowns per GPU as a 
> rule of thumb) to hide PCI-Express latencies.
Indeed, the rule of thumb seems 100-150k unknowns per GPU for my app.

Thanks Karli, I really appreciate your advices,

> Best regards,
> Karli

*Trio_U support team*
Marthe ROUX (01 69 08 00 02) Saclay
Pierre LEDAC (04 38 78 91 49) Grenoble
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20140820/9b2ebf48/attachment.html>

More information about the petsc-dev mailing list