[petsc-users] Questions about setting values for GPU based matrices

Thu Dec 1 07:00:16 CST 2011

On Thu, Dec 1, 2011 at 5:39 AM, Fredrik Heffer Valdmanis
<fredva at ifi.uio.no>wrote:

>
> 2011/11/29 Matthew Knepley <knepley at gmail.com>
>
>> On Tue, Nov 29, 2011 at 10:37 AM, Fredrik Heffer Valdmanis <
>> fredva at ifi.uio.no> wrote:
>>
>>> 2011/11/29 Matthew Knepley <knepley at gmail.com>
>>>
>>>> On Tue, Nov 29, 2011 at 2:38 AM, Fredrik Heffer Valdmanis <
>>>> fredva at ifi.uio.no> wrote:
>>>>
>>>>> 2011/10/28 Matthew Knepley <knepley at gmail.com>
>>>>>
>>>>>> On Fri, Oct 28, 2011 at 10:24 AM, Fredrik Heffer Valdmanis <
>>>>>> fredva at ifi.uio.no> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am working on integrating the new GPU based vectors and matrices
>>>>>>> into FEniCS. Now, I'm looking at the possibility for getting some speedup
>>>>>>> during finite element assembly, specifically when inserting the local
>>>>>>> element matrix into the global element matrix. In that regard, I have a few
>>>>>>> questions I hope you can help me out with:
>>>>>>>
>>>>>>> - When calling MatSetValues with a MATSEQAIJCUSP matrix as
>>>>>>> parameter, what exactly is it that happens? As far as I can see,
>>>>>>> MatSetValues is not implemented for GPU based matrices, neither is
>>>>>>> the mat->ops->setvalues set to point at any function for this Mat type.
>>>>>>>
>>>>>>
>>>>>> Yes, MatSetValues always operates on the CPU side. It would not make
>>>>>> sense to do individual operations on the GPU.
>>>>>>
>>>>>> I have written batched of assembly for element matrices that are all
>>>>>> the same size:
>>>>>>
>>>>>>
>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatSetValuesBatch.html
>>>>>>
>>>>>>
>>>>>>> - Is it such that matrices are assembled in their entirety on the
>>>>>>> CPU, and then copied over to the GPU (after calling MatAssemblyBegin)? Or
>>>>>>> are values copied over to the GPU each time you call MatSetValues?
>>>>>>>
>>>>>>
>>>>>> That function assembles the matrix on the GPU and then copies to the
>>>>>> CPU. The only time you do not want this copy is when
>>>>>> you are running in serial and never touch the matrix afterwards, so I
>>>>>> left it in.
>>>>>>
>>>>>>
>>>>>>> - Can we expect to see any speedup from using MatSetValuesBatch over
>>>>>>> MatSetValues, or is the batch version simply a utility function? This
>>>>>>> question goes for both CPU- and GPU-based matrices.
>>>>>>>
>>>>>>
>>>>>> CPU: no
>>>>>>
>>>>>> GPU: yes, I see about the memory bandwidth ratio
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>
>>>>> I have now integrated MatSetValuesBatch in our existing PETSc wrapper
>>>>> layer. I have tested matrix assembly with Poisson's equation on different
>>>>> meshes with elements of varying order. I have timed the single call to
>>>>> MatSetValuesBatch and compared that to the total time consumed by the
>>>>> repeated calls to MatSetValues in the old implementation. I have the
>>>>> following results:
>>>>>
>>>>> Poisson on 1000x1000 unit square, 1st order Lagrange elements:
>>>>> MatSetValuesBatch: 0.88576 s
>>>>> repeated calls to MatSetValues: 0.76654 s
>>>>>
>>>>> Poisson on 500x500 unit square, 2nd order Lagrange elements:
>>>>> MatSetValuesBatch: 0.9324 s
>>>>> repeated calls to MatSetValues: 0.81644 s
>>>>>
>>>>> Poisson on 300x300 unit square, 3rd order Lagrange elements:
>>>>> MatSetValuesBatch: 0.93988 s
>>>>> repeated calls to MatSetValues: 1.03884 s
>>>>>
>>>>> As you can see, the two methods take almost the same amount of time.
>>>>> What behavior and performance should we expect? Is there any way to
>>>>> optimize the performance of batched assembly?
>>>>>
>>>>
>>>> Almost certainly it is not dispatching to the CUDA version. The regular
>>>> version just calls MatSetValues() in a loop. Are you
>>>> using a SEQAIJCUSP matrix?
>>>>
>>>  Yes. The same matrices yields a speedup of 4-6x when solving the system
>>> on the GPU.
>>>
>>
>> Please confirm that the correct routine by running wth -info and sending
>> the output.
>>
>> Please send the output of -log_summary so I can confirm the results.
>>
>> You can run KSP ex4 and reproduce my results where I see a 5.5x speedup
>> on the GTX285
>>
>> I am not sure what to look for in those outputs. I have uploaded the
> output of running my assembly program with -info and -log_summary, and the
> output of running ex4 with -log_summary. See
>
> http://folk.uio.no/fredva/assembly_info.txt
> http://folk.uio.no/fredva/assembly_log_summary.txt
> http://folk.uio.no/fredva/ex4_log_summary.txt
>
> Trying this on a different machine now, I actually see some speedup. 3rd
> order Poisson on 300x300 assembles in 0.211 sec on GPU and 0.4232 sec on
> CPU. For 1st order and 1000x1000 mesh, I go from 0.31 sec to 0.205 sec.
> I have tried to increase the mesh size to see if the speedup increases,
> but I hit the bad_alloc error pretty quick.
>
> For a problem of that size, should I expect even more speedup? Please let
> me know if you need any more output from test runs on my machine.
>

Here are my results for nxn grids where n = range(150, 1350, 100). This is
using a GTX 285. What card are you using?

   Matt

> --
> Fredrik
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111201/047edf0c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AssemblyResults.pdf
Type: application/pdf
Size: 63138 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111201/047edf0c/attachment-0001.pdf>