[petsc-users] Questions about setting values for GPU based matrices

Tue Nov 29 02:38:44 CST 2011

2011/10/28 Matthew Knepley <knepley at gmail.com>

> On Fri, Oct 28, 2011 at 10:24 AM, Fredrik Heffer Valdmanis <
> fredva at ifi.uio.no> wrote:
>
>> Hi,
>>
>> I am working on integrating the new GPU based vectors and matrices into
>> FEniCS. Now, I'm looking at the possibility for getting some speedup during
>> finite element assembly, specifically when inserting the local element
>> matrix into the global element matrix. In that regard, I have a few
>> questions I hope you can help me out with:
>>
>> - When calling MatSetValues with a MATSEQAIJCUSP matrix as parameter,
>> what exactly is it that happens? As far as I can see, MatSetValues is not
>> implemented for GPU based matrices, neither is the mat->ops->setvalues set
>> to point at any function for this Mat type.
>>
>
> Yes, MatSetValues always operates on the CPU side. It would not make sense
> to do individual operations on the GPU.
>
> I have written batched of assembly for element matrices that are all the
> same size:
>
>
> http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatSetValuesBatch.html
>
>
>> - Is it such that matrices are assembled in their entirety on the CPU,
>> and then copied over to the GPU (after calling MatAssemblyBegin)? Or are
>> values copied over to the GPU each time you call MatSetValues?
>>
>
> That function assembles the matrix on the GPU and then copies to the CPU.
> The only time you do not want this copy is when
> you are running in serial and never touch the matrix afterwards, so I left
> it in.
>
>
>> - Can we expect to see any speedup from using MatSetValuesBatch over
>> MatSetValues, or is the batch version simply a utility function? This
>> question goes for both CPU- and GPU-based matrices.
>>
>
> CPU: no
>
> GPU: yes, I see about the memory bandwidth ratio
>
>
> Hi,

I have now integrated MatSetValuesBatch in our existing PETSc wrapper
layer. I have tested matrix assembly with Poisson's equation on different
meshes with elements of varying order. I have timed the single call to
MatSetValuesBatch and compared that to the total time consumed by the
repeated calls to MatSetValues in the old implementation. I have the
following results:

Poisson on 1000x1000 unit square, 1st order Lagrange elements:
MatSetValuesBatch: 0.88576 s
repeated calls to MatSetValues: 0.76654 s

Poisson on 500x500 unit square, 2nd order Lagrange elements:
MatSetValuesBatch: 0.9324 s
repeated calls to MatSetValues: 0.81644 s

Poisson on 300x300 unit square, 3rd order Lagrange elements:
MatSetValuesBatch: 0.93988 s
repeated calls to MatSetValues: 1.03884 s

As you can see, the two methods take almost the same amount of time.
What behavior and performance should we expect? Is there any way to
optimize the performance of batched assembly?

I also have a problem with Thrust throwing std::bad_alloc on some calls to
MatSetValuesBatch. The exception originates in thrust::device_ptr<void>
thrust::detail::device::cuda::malloc<0u>(unsigned long). It seems to be
thrown when the number of double values I send to MatSetValuesBatch
approaches 30 million. I am testing this on a laptop with 4 GB RAM and a
GeForce 540 M (1 GB memory), so the 30 million doubles are far away from
exhausting my memory, both on the host and device side. Any clues on what
causes this problem and how to avoid it?

Thanks,

Fredrik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111129/4e377e73/attachment-0001.htm>