[petsc-users] Questions about setting values for GPU based matrices

Matthew Knepley knepley at gmail.com
Tue Nov 29 10:57:05 CST 2011


On Tue, Nov 29, 2011 at 10:37 AM, Fredrik Heffer Valdmanis <
fredva at ifi.uio.no> wrote:

> 2011/11/29 Matthew Knepley <knepley at gmail.com>
>
>> On Tue, Nov 29, 2011 at 2:38 AM, Fredrik Heffer Valdmanis <
>> fredva at ifi.uio.no> wrote:
>>
>>> 2011/10/28 Matthew Knepley <knepley at gmail.com>
>>>
>>>> On Fri, Oct 28, 2011 at 10:24 AM, Fredrik Heffer Valdmanis <
>>>> fredva at ifi.uio.no> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am working on integrating the new GPU based vectors and matrices
>>>>> into FEniCS. Now, I'm looking at the possibility for getting some speedup
>>>>> during finite element assembly, specifically when inserting the local
>>>>> element matrix into the global element matrix. In that regard, I have a few
>>>>> questions I hope you can help me out with:
>>>>>
>>>>> - When calling MatSetValues with a MATSEQAIJCUSP matrix as parameter,
>>>>> what exactly is it that happens? As far as I can see, MatSetValues is not
>>>>> implemented for GPU based matrices, neither is the mat->ops->setvalues set
>>>>> to point at any function for this Mat type.
>>>>>
>>>>
>>>> Yes, MatSetValues always operates on the CPU side. It would not make
>>>> sense to do individual operations on the GPU.
>>>>
>>>> I have written batched of assembly for element matrices that are all
>>>> the same size:
>>>>
>>>>
>>>> http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatSetValuesBatch.html
>>>>
>>>>
>>>>> - Is it such that matrices are assembled in their entirety on the CPU,
>>>>> and then copied over to the GPU (after calling MatAssemblyBegin)? Or are
>>>>> values copied over to the GPU each time you call MatSetValues?
>>>>>
>>>>
>>>> That function assembles the matrix on the GPU and then copies to the
>>>> CPU. The only time you do not want this copy is when
>>>> you are running in serial and never touch the matrix afterwards, so I
>>>> left it in.
>>>>
>>>>
>>>>> - Can we expect to see any speedup from using MatSetValuesBatch over
>>>>> MatSetValues, or is the batch version simply a utility function? This
>>>>> question goes for both CPU- and GPU-based matrices.
>>>>>
>>>>
>>>> CPU: no
>>>>
>>>> GPU: yes, I see about the memory bandwidth ratio
>>>>
>>>>
>>>> Hi,
>>>
>>> I have now integrated MatSetValuesBatch in our existing PETSc wrapper
>>> layer. I have tested matrix assembly with Poisson's equation on different
>>> meshes with elements of varying order. I have timed the single call to
>>> MatSetValuesBatch and compared that to the total time consumed by the
>>> repeated calls to MatSetValues in the old implementation. I have the
>>> following results:
>>>
>>> Poisson on 1000x1000 unit square, 1st order Lagrange elements:
>>> MatSetValuesBatch: 0.88576 s
>>> repeated calls to MatSetValues: 0.76654 s
>>>
>>> Poisson on 500x500 unit square, 2nd order Lagrange elements:
>>> MatSetValuesBatch: 0.9324 s
>>> repeated calls to MatSetValues: 0.81644 s
>>>
>>> Poisson on 300x300 unit square, 3rd order Lagrange elements:
>>> MatSetValuesBatch: 0.93988 s
>>> repeated calls to MatSetValues: 1.03884 s
>>>
>>> As you can see, the two methods take almost the same amount of time.
>>> What behavior and performance should we expect? Is there any way to
>>> optimize the performance of batched assembly?
>>>
>>
>> Almost certainly it is not dispatching to the CUDA version. The regular
>> version just calls MatSetValues() in a loop. Are you
>> using a SEQAIJCUSP matrix?
>>
> Yes. The same matrices yields a speedup of 4-6x when solving the system on
> the GPU.
>

Please confirm that the correct routine by running wth -info and sending
the output.

Please send the output of -log_summary so I can confirm the results.

You can run KSP ex4 and reproduce my results where I see a 5.5x speedup on
the GTX285

   Matt


>
>>
>>>  I also have a problem with Thrust throwing std::bad_alloc on some
>>> calls to MatSetValuesBatch. The exception originates in
>>> thrust::device_ptr<void> thrust::detail::device::cuda::malloc<0u>(unsigned
>>> long). It seems to be thrown when the number of double values I send to
>>> MatSetValuesBatch approaches 30 million. I am testing this on a laptop with
>>> 4 GB RAM and a GeForce 540 M (1 GB memory), so the 30 million doubles are
>>> far away from exhausting my memory, both on the host and device side. Any
>>> clues on what causes this problem and how to avoid it?
>>>
>>
>> It uses more memory that just the values. I would have to look at the
>> specific case, but
>> I assume that the memory is exhausted.
>>
> OK, I can look further into it myself as well. Thanks,
>
> Fredrik
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111129/09bdec62/attachment-0001.htm>


More information about the petsc-users mailing list