[petsc-dev] GPU preconditioners

Sat Jan 18 03:53:02 CST 2014

Thanks a lot, Karli! I will update, make a few tests and let you know if my problem is fixed!
Best regards

Andrea

On Jan 18, 2014, at 10:26 AM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi Andrea,
> 
> the fix is now merged to master:
> https://bitbucket.org/petsc/petsc/commits/087a195f1d07b315894e9d8ab1801a0ce993221c
> 
> Best regards,
> Karli
> 
> 
> 
> On 01/17/2014 10:13 PM, Andrea Lani wrote:
>> Well, I have 9 equations, so 9x9 I guess...
>> 
>> I hope the one you are mentioning was a major bug, because what I get is
>> seriously wrong: while on single GPU (KSPGMRES+PCASM) I get a residual
>> of +0.72, on 8-cores/GPU I get -1.00 at the first time step, just to
>> make an example. Can this be due to the bug you are saying or you can
>> suspect something more?
>> 
>> What should I do then? wait for the valgrind fix which is underway and
>> then update? Can you please notify me when this is fixed? I'm writing a
>> final report for a project and I would like to include this feature
>> fully fixed if possible.
>> 
>> Another question, what do you exactly mean by "order the unknowns
>> properly" in this case?
>> Thanks a lot!
>> 
>> Andrea
>> 
>> 
>> On Fri, Jan 17, 2014 at 10:02 PM, Karl Rupp <rupp at mcs.anl.gov
>> <mailto:rupp at mcs.anl.gov>> wrote:
>> 
>>    Hi Andrea,
>> 
>> 
>>        In fact, I have another major problem: when running on multi-GPU
>>        with
>>        PETSc my results are totally inconsistent compared to a single
>>        GPU  .
>> 
>> 
>>    This was a bug which was fixed a couple of days ago. It is in branch
>>    'next', but not yet merged to master since it has another valgrind
>>    issue I haven't nailed down yet.
>> 
>> 
>> 
>>        In my code, for now, I'm assuming a 1-1 correspondence between
>>        CPU and
>>        GPU: I run on 8 cores and 8 GPUs (4 K10).  How can I enforce
>>        this in the
>>        PETSc solver? Is it automatically done or do I have to specify
>>        some options?
>> 
>> 
>>    One MPI rank maps to one logical GPU. In your case, please run with
>>    8 MPI ranks and distribute them equally over the nodes equipped with
>>    the GPUs.
>> 
>>    As for the preconditioners: We haven't added any new preconditioners
>>    recently. Preconditioning on GPUs is a very problem-specific thing
>>    due to the burden of PCI-Express latency. Massively parallel
>>    approaches such as Sparse Approximate Inverses perform well in terms
>>    of theoretical FLOP counts, but are poor in terms of convergence and
>>    pretty expensive in terms of memory when running many simultaneous
>>    factorizations. ILU on the GPU can be fast if you order the unknowns
>>    properly and have only few nonzeros per row, but it is not great in
>>    terms of convergence rate either. PCI-Express bandwidth and latency
>>    is really a problem here...
>> 
>>    How large are your blocks when using a block-Jacobi preconditioner
>>    for your problem? In the order of 3x3 or (much) larger?
>> 
>>    Best regards,
>>    Karli
>> 
>> 
>> 
>> 
>> --
>> Dr. Andrea Lani
>> Senior Research Engineer, PhD
>> Aeronautics & Aerospace dept., CFD group
>> Von Karman Institute for Fluid Dynamics
>> Chausse de Waterloo 72,
>> B-1640, Rhode-Saint-Genese,  Belgium
>> fax  : +32-2-3599600
>> work : +32-2-3599769 _
>> lani at vki.ac.be <mailto:lani at vki.ac.be>_
>