[petsc-dev] GPU preconditioners

Andrea Lani andrea.lani at gmail.com
Fri Jan 17 15:13:54 CST 2014


Well, I have 9 equations, so 9x9 I guess...

I hope the one you are mentioning was a major bug, because what I get is
seriously wrong: while on single GPU (KSPGMRES+PCASM) I get a residual of
+0.72, on 8-cores/GPU I get -1.00 at the first time step, just to make an
example. Can this be due to the bug you are saying or you can suspect
something more?

What should I do then? wait for the valgrind fix which is underway and then
update? Can you please notify me when this is fixed? I'm writing a final
report for a project and I would like to include this feature fully fixed
if possible.

Another question, what do you exactly mean by "order the unknowns properly"
in this case?
Thanks a lot!

Andrea


On Fri, Jan 17, 2014 at 10:02 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi Andrea,
>
>
>  In fact, I have another major problem: when running on multi-GPU with
>> PETSc my results are totally inconsistent compared to a single GPU  .
>>
>
> This was a bug which was fixed a couple of days ago. It is in branch
> 'next', but not yet merged to master since it has another valgrind issue I
> haven't nailed down yet.
>
>
>
>  In my code, for now, I'm assuming a 1-1 correspondence between CPU and
>> GPU: I run on 8 cores and 8 GPUs (4 K10).  How can I enforce this in the
>> PETSc solver? Is it automatically done or do I have to specify some
>> options?
>>
>
> One MPI rank maps to one logical GPU. In your case, please run with 8 MPI
> ranks and distribute them equally over the nodes equipped with the GPUs.
>
> As for the preconditioners: We haven't added any new preconditioners
> recently. Preconditioning on GPUs is a very problem-specific thing due to
> the burden of PCI-Express latency. Massively parallel approaches such as
> Sparse Approximate Inverses perform well in terms of theoretical FLOP
> counts, but are poor in terms of convergence and pretty expensive in terms
> of memory when running many simultaneous factorizations. ILU on the GPU can
> be fast if you order the unknowns properly and have only few nonzeros per
> row, but it is not great in terms of convergence rate either. PCI-Express
> bandwidth and latency is really a problem here...
>
> How large are your blocks when using a block-Jacobi preconditioner for
> your problem? In the order of 3x3 or (much) larger?
>
> Best regards,
> Karli
>
>


-- 
Dr. Andrea
Lani
Senior Research Engineer, PhD
Aeronautics & Aerospace dept., CFD group
Von Karman Institute for Fluid Dynamics
Chausse de Waterloo 72,
B-1640, Rhode-Saint-Genese,  Belgium
fax  : +32-2-3599600
work : +32-2-3599769
*lani at vki.ac.be <lani at vki.ac.be>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20140117/ee9bb9c3/attachment.html>


More information about the petsc-dev mailing list