[petsc-users] Questions about setting values for GPU based matrices

Fredrik Heffer Valdmanis fredva at ifi.uio.no
Fri Dec 2 04:54:20 CST 2011


2011/12/2 Fredrik Heffer Valdmanis <fredva at ifi.uio.no>

> 2011/12/1 Matthew Knepley <knepley at gmail.com>
>
>> On Thu, Dec 1, 2011 at 5:39 AM, Fredrik Heffer Valdmanis <
>> fredva at ifi.uio.no> wrote:
>>
>>>
>>> 2011/11/29 Matthew Knepley <knepley at gmail.com>
>>>
>>>> On Tue, Nov 29, 2011 at 10:37 AM, Fredrik Heffer Valdmanis <
>>>> fredva at ifi.uio.no> wrote:
>>>>
>>>>> 2011/11/29 Matthew Knepley <knepley at gmail.com>
>>>>>
>>>>>> On Tue, Nov 29, 2011 at 2:38 AM, Fredrik Heffer Valdmanis <
>>>>>> fredva at ifi.uio.no> wrote:
>>>>>>
>>>>>>> 2011/10/28 Matthew Knepley <knepley at gmail.com>
>>>>>>>
>>>>>>>> On Fri, Oct 28, 2011 at 10:24 AM, Fredrik Heffer Valdmanis <
>>>>>>>> fredva at ifi.uio.no> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I am working on integrating the new GPU based vectors and matrices
>>>>>>>>> into FEniCS. Now, I'm looking at the possibility for getting some speedup
>>>>>>>>> during finite element assembly, specifically when inserting the local
>>>>>>>>> element matrix into the global element matrix. In that regard, I have a few
>>>>>>>>> questions I hope you can help me out with:
>>>>>>>>>
>>>>>>>>> - When calling MatSetValues with a MATSEQAIJCUSP matrix as
>>>>>>>>> parameter, what exactly is it that happens? As far as I can see,
>>>>>>>>> MatSetValues is not implemented for GPU based matrices, neither is
>>>>>>>>> the mat->ops->setvalues set to point at any function for this Mat type.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, MatSetValues always operates on the CPU side. It would not
>>>>>>>> make sense to do individual operations on the GPU.
>>>>>>>>
>>>>>>>> I have written batched of assembly for element matrices that are
>>>>>>>> all the same size:
>>>>>>>>
>>>>>>>>
>>>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatSetValuesBatch.html
>>>>>>>>
>>>>>>>>
>>>>>>>>> - Is it such that matrices are assembled in their entirety on the
>>>>>>>>> CPU, and then copied over to the GPU (after calling MatAssemblyBegin)? Or
>>>>>>>>> are values copied over to the GPU each time you call MatSetValues?
>>>>>>>>>
>>>>>>>>
>>>>>>>> That function assembles the matrix on the GPU and then copies to
>>>>>>>> the CPU. The only time you do not want this copy is when
>>>>>>>> you are running in serial and never touch the matrix afterwards, so
>>>>>>>> I left it in.
>>>>>>>>
>>>>>>>>
>>>>>>>>> - Can we expect to see any speedup from using MatSetValuesBatch
>>>>>>>>> over MatSetValues, or is the batch version simply a utility function? This
>>>>>>>>> question goes for both CPU- and GPU-based matrices.
>>>>>>>>>
>>>>>>>>
>>>>>>>> CPU: no
>>>>>>>>
>>>>>>>> GPU: yes, I see about the memory bandwidth ratio
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>
>>>>>>> I have now integrated MatSetValuesBatch in our existing PETSc
>>>>>>> wrapper layer. I have tested matrix assembly with Poisson's equation on
>>>>>>> different meshes with elements of varying order. I have timed the single
>>>>>>> call to MatSetValuesBatch and compared that to the total time consumed by
>>>>>>> the repeated calls to MatSetValues in the old implementation. I have the
>>>>>>> following results:
>>>>>>>
>>>>>>> Poisson on 1000x1000 unit square, 1st order Lagrange elements:
>>>>>>> MatSetValuesBatch: 0.88576 s
>>>>>>> repeated calls to MatSetValues: 0.76654 s
>>>>>>>
>>>>>>> Poisson on 500x500 unit square, 2nd order Lagrange elements:
>>>>>>> MatSetValuesBatch: 0.9324 s
>>>>>>> repeated calls to MatSetValues: 0.81644 s
>>>>>>>
>>>>>>> Poisson on 300x300 unit square, 3rd order Lagrange elements:
>>>>>>> MatSetValuesBatch: 0.93988 s
>>>>>>> repeated calls to MatSetValues: 1.03884 s
>>>>>>>
>>>>>>> As you can see, the two methods take almost the same amount of time.
>>>>>>> What behavior and performance should we expect? Is there any way to
>>>>>>> optimize the performance of batched assembly?
>>>>>>>
>>>>>>
>>>>>> Almost certainly it is not dispatching to the CUDA version. The
>>>>>> regular version just calls MatSetValues() in a loop. Are you
>>>>>> using a SEQAIJCUSP matrix?
>>>>>>
>>>>>  Yes. The same matrices yields a speedup of 4-6x when solving the
>>>>> system on the GPU.
>>>>>
>>>>
>>>> Please confirm that the correct routine by running wth -info and
>>>> sending the output.
>>>>
>>>> Please send the output of -log_summary so I can confirm the results.
>>>>
>>>> You can run KSP ex4 and reproduce my results where I see a 5.5x speedup
>>>> on the GTX285
>>>>
>>>> I am not sure what to look for in those outputs. I have uploaded the
>>> output of running my assembly program with -info and -log_summary, and the
>>> output of running ex4 with -log_summary. See
>>>
>>> http://folk.uio.no/fredva/assembly_info.txt
>>> http://folk.uio.no/fredva/assembly_log_summary.txt
>>> http://folk.uio.no/fredva/ex4_log_summary.txt
>>>
>>> Trying this on a different machine now, I actually see some speedup. 3rd
>>> order Poisson on 300x300 assembles in 0.211 sec on GPU and 0.4232 sec on
>>> CPU. For 1st order and 1000x1000 mesh, I go from 0.31 sec to 0.205 sec.
>>> I have tried to increase the mesh size to see if the speedup increases,
>>> but I hit the bad_alloc error pretty quick.
>>>
>>> For a problem of that size, should I expect even more speedup? Please
>>> let me know if you need any more output from test runs on my machine.
>>>
>>
>> Here are my results for nxn grids where n = range(150, 1350, 100). This
>> is using a GTX 285. What card are you using?
>>
>> I realize now that I was including the time it takes to construct the
> large flattended array of values that is sent to MatSetValuesBatch. I
> assume of course that you only time MatSetValues/MatSetValuesBatch
> completely isolated. If I do this, I get significant speedup as well. Sorry
> for the confusion here.
>
> Still, this construction has to be done somehow in order to have
> meaningful data to pass to MatSetValuesBatch. The way I do this is
> apparently almost as costly as calling MatSetValues for each local matrix.
>
> Have you got any ideas on how to speed up the construction of the values
> array? This has to be done very efficiently in order for batch assembly to
> yield any speedup overall.
>
> Arg, disregard last transmission! I was confusing myself with timings from
several runs, and the "significant speedup" I referred to was seen when I
timed things very badly. The numbers from yesterdays mail are correct,
those were obtained using a GTX 280. That is, 30%-50% speedup on Poisson 2D
on different meshes.

The question from my previous email remains though, we need to speed up the
construction of the values array to get good speedup overall.

Sorry for the spamming,

Fredrik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111202/74e8a83e/attachment-0001.htm>


More information about the petsc-users mailing list