[petsc-dev] GPU preconditioners
Karl Rupp
rupp at mcs.anl.gov
Sat Jan 18 03:26:52 CST 2014
Hi Andrea,
the fix is now merged to master:
https://bitbucket.org/petsc/petsc/commits/087a195f1d07b315894e9d8ab1801a0ce993221c
Best regards,
Karli
On 01/17/2014 10:13 PM, Andrea Lani wrote:
> Well, I have 9 equations, so 9x9 I guess...
>
> I hope the one you are mentioning was a major bug, because what I get is
> seriously wrong: while on single GPU (KSPGMRES+PCASM) I get a residual
> of +0.72, on 8-cores/GPU I get -1.00 at the first time step, just to
> make an example. Can this be due to the bug you are saying or you can
> suspect something more?
>
> What should I do then? wait for the valgrind fix which is underway and
> then update? Can you please notify me when this is fixed? I'm writing a
> final report for a project and I would like to include this feature
> fully fixed if possible.
>
> Another question, what do you exactly mean by "order the unknowns
> properly" in this case?
> Thanks a lot!
>
> Andrea
>
>
> On Fri, Jan 17, 2014 at 10:02 PM, Karl Rupp <rupp at mcs.anl.gov
> <mailto:rupp at mcs.anl.gov>> wrote:
>
> Hi Andrea,
>
>
> In fact, I have another major problem: when running on multi-GPU
> with
> PETSc my results are totally inconsistent compared to a single
> GPU .
>
>
> This was a bug which was fixed a couple of days ago. It is in branch
> 'next', but not yet merged to master since it has another valgrind
> issue I haven't nailed down yet.
>
>
>
> In my code, for now, I'm assuming a 1-1 correspondence between
> CPU and
> GPU: I run on 8 cores and 8 GPUs (4 K10). How can I enforce
> this in the
> PETSc solver? Is it automatically done or do I have to specify
> some options?
>
>
> One MPI rank maps to one logical GPU. In your case, please run with
> 8 MPI ranks and distribute them equally over the nodes equipped with
> the GPUs.
>
> As for the preconditioners: We haven't added any new preconditioners
> recently. Preconditioning on GPUs is a very problem-specific thing
> due to the burden of PCI-Express latency. Massively parallel
> approaches such as Sparse Approximate Inverses perform well in terms
> of theoretical FLOP counts, but are poor in terms of convergence and
> pretty expensive in terms of memory when running many simultaneous
> factorizations. ILU on the GPU can be fast if you order the unknowns
> properly and have only few nonzeros per row, but it is not great in
> terms of convergence rate either. PCI-Express bandwidth and latency
> is really a problem here...
>
> How large are your blocks when using a block-Jacobi preconditioner
> for your problem? In the order of 3x3 or (much) larger?
>
> Best regards,
> Karli
>
>
>
>
> --
> Dr. Andrea Lani
> Senior Research Engineer, PhD
> Aeronautics & Aerospace dept., CFD group
> Von Karman Institute for Fluid Dynamics
> Chausse de Waterloo 72,
> B-1640, Rhode-Saint-Genese, Belgium
> fax : +32-2-3599600
> work : +32-2-3599769 _
> lani at vki.ac.be <mailto:lani at vki.ac.be>_
More information about the petsc-dev
mailing list