[petsc-dev] GPU preconditioners

Sat Jan 18 03:26:52 CST 2014

Hi Andrea,

the fix is now merged to master:
https://bitbucket.org/petsc/petsc/commits/087a195f1d07b315894e9d8ab1801a0ce993221c

Best regards,
Karli

On 01/17/2014 10:13 PM, Andrea Lani wrote:
> Well, I have 9 equations, so 9x9 I guess...
>
> I hope the one you are mentioning was a major bug, because what I get is
> seriously wrong: while on single GPU (KSPGMRES+PCASM) I get a residual
> of +0.72, on 8-cores/GPU I get -1.00 at the first time step, just to
> make an example. Can this be due to the bug you are saying or you can
> suspect something more?
>
> What should I do then? wait for the valgrind fix which is underway and
> then update? Can you please notify me when this is fixed? I'm writing a
> final report for a project and I would like to include this feature
> fully fixed if possible.
>
> Another question, what do you exactly mean by "order the unknowns
> properly" in this case?
> Thanks a lot!
>
> Andrea
>
>
> On Fri, Jan 17, 2014 at 10:02 PM, Karl Rupp <rupp at mcs.anl.gov
> <mailto:rupp at mcs.anl.gov>> wrote:
>
>     Hi Andrea,
>
>
>         In fact, I have another major problem: when running on multi-GPU
>         with
>         PETSc my results are totally inconsistent compared to a single
>         GPU  .
>
>
>     This was a bug which was fixed a couple of days ago. It is in branch
>     'next', but not yet merged to master since it has another valgrind
>     issue I haven't nailed down yet.
>
>
>
>         In my code, for now, I'm assuming a 1-1 correspondence between
>         CPU and
>         GPU: I run on 8 cores and 8 GPUs (4 K10).  How can I enforce
>         this in the
>         PETSc solver? Is it automatically done or do I have to specify
>         some options?
>
>
>     One MPI rank maps to one logical GPU. In your case, please run with
>     8 MPI ranks and distribute them equally over the nodes equipped with
>     the GPUs.
>
>     As for the preconditioners: We haven't added any new preconditioners
>     recently. Preconditioning on GPUs is a very problem-specific thing
>     due to the burden of PCI-Express latency. Massively parallel
>     approaches such as Sparse Approximate Inverses perform well in terms
>     of theoretical FLOP counts, but are poor in terms of convergence and
>     pretty expensive in terms of memory when running many simultaneous
>     factorizations. ILU on the GPU can be fast if you order the unknowns
>     properly and have only few nonzeros per row, but it is not great in
>     terms of convergence rate either. PCI-Express bandwidth and latency
>     is really a problem here...
>
>     How large are your blocks when using a block-Jacobi preconditioner
>     for your problem? In the order of 3x3 or (much) larger?
>
>     Best regards,
>     Karli
>
>
>
>
> --
> Dr. Andrea Lani
> Senior Research Engineer, PhD
> Aeronautics & Aerospace dept., CFD group
> Von Karman Institute for Fluid Dynamics
> Chausse de Waterloo 72,
> B-1640, Rhode-Saint-Genese,  Belgium
> fax  : +32-2-3599600
> work : +32-2-3599769 _
> lani at vki.ac.be <mailto:lani at vki.ac.be>_