[petsc-users] Offloading linear solves in time stepper to GPU

Sun May 31 11:58:40 CDT 2015

> On May 30, 2015, at 10:14 PM, Harshad Sahasrabudhe <hsahasra at purdue.edu> wrote:
> 
> Is your intent to solve a problem that matters in a way that makes sense for a scientist or engineer
> 
> I want to see if we can speed up the time stepper for a large system using GPUs. For large systems with sparse matrix of size 420,000^2, each time step takes 341 sec on a single process and 180 seconds on 16 processes.

   Rather than going off on a wild goose chase it would be good to understand WHY 1) the time on one process is so poor and 2) why the speedup to 16 processes is so low. These means gathering information and then analyzing that information. 

So first you need to measure the memory bandwidth of your system for 1 to 16 processes. This is explained at http://www.mcs.anl.gov/petsc/documentation/faq.html#computers by running the "make streams NPMAX=16" then use the MPI "binding" options to see if that improves the streams numbers.   What do you get for these numbers?

Next you need to run your PETSc application with -log_summary to see how much time is in the linear solve and how what it is doing in the linear solve and how much time is spent in each part of the linear solve and how many iterations it is taking. To start run with -log_summary and 1 MPI process, 2 MPI processes, 4 MPI processes, 8 MPI processes and 16 MPI processes. What do you get for these numbers?

   In addition to 1) and 2) you need to determine what is a good preconditioner for YOUR problem. Linear iterative solvers are not black box solvers, using an inappropriate preconditioner can have many orders of magnitude difference on solution time (more than changing the hardware). If your problem is a nice elliptic operator than like -pc_type gamg might work well (or you can try the external packages -pc_type hypre or -pc_type ml ; requires installing those optional external packages; see http://www.mcs.anl.gov/petsc/documentation/linearsolvertable.html.  If your problem is a saddle-point (eg. Stokes) problem then you likely need to use the PCFIELDSPLIT preconditioner to "pull out" the saddle-point part. For more complicated simulations you will need nesting of several preconditioners.

  Barry

> So the scaling isn't that good. We also run out of memory with more number of processes. 
> 
> On Sat, May 30, 2015 at 11:01 PM, Jed Brown <jed at jedbrown.org> wrote:
> Harshad Sahasrabudhe <hsahasra at purdue.edu> writes:
> > For now, I want to serialize the matrices and vectors and offload them to 1
> > GPU from the root process. Then distribute the result later.
> 
> Unless you have experience with these solvers and the overheads
> involved, I think you should expect this to be much slower than simply
> doing the solves using a reasonable method in the CPU.  Is your intent
> to solve a problem that matters in a way that makes sense for a
> scientist or engineer, or is it to demonstrate that a particular
> combination of packages/methods/hardware can be used?
>