SV: Slow MatSetValues

Lars Rindorf Lars.Rindorf at teknologisk.dk
Fri May 30 06:44:12 CDT 2008


Hi everybody

Thanks for all the suggestions and help. The problem is of a bit different nature. I use only direct solvers, so I give the options "-ksp_type preonly -pc_type lu" to make a standard LU factorization. This works fine without any problems. If I additionally set "-mat_type umfpack" to use umfpack then MatSetValues is very, very slow (about 50 times slower). If, as a test, I call MatAssemblyBegin and MatAssemblyEnd before MatSetValues, and only use the lu (no umfpack) then the performance is very similarly slow.

My code is otherwise identical in PETsc setup to that at http://www-unix.mcs.anl.gov/petsc/petsc-2/snapshots/petsc-current/src/ksp/ksp/examples/tutorials/ex8.c.html

There is no need to invoke MatAssemblyBegin() with the argument MAT_FLUSH_ASSEMBLY since MatSetValues is only given the ADD_VALUES argument. So it is not that.

Is there some conflict with the matrix format used by umfpack and something else?

KR, Lars





-----Oprindelig meddelelse-----
Fra: owner-petsc-users at mcs.anl.gov [mailto:owner-petsc-users at mcs.anl.gov] På vegne af Barry Smith
Sendt: 30. maj 2008 03:21
Til: petsc-users at mcs.anl.gov
Emne: Re: Slow MatSetValues


   I realize I made a mistake for three dimensions below; when nodes share an edge in 3d they will over counted. The fix is to have another array with one entry per edge that gives the number of elements that contain that edge. Then use
                    if node1 and node2 share an edge then t = 1/ elementsperedge[edge that connects node1 and node2]
>                 else if node1 and node2 share an face in 3d and that 
> face in 3d is not a boundary face  set t = .5 (this prevents double 
> counting of these couplings)
>                 else set t = 1.0
    This increases the complexity of the code a bit but is still very rapid.

    Barry


On May 29, 2008, at 4:49 PM, Barry Smith wrote:

>
>  Partition the elements across the processes,
>
>   then partition the nodes across processes (try to make sure that 
> each node is on the same process of at least one of its elements),
>
>   create
>      1) three parallel vectors with the number of local owned nodes on 
> each process
>          call these vectors off and on and owner; fill the on vector 
> with a 1 in each location, fill the vector owner with rank for each 
> element
>      2) three sequential vectors on each process with the total number 
> of nodes of all the elements of that process (this is the locally 
> owned plus ghosted nodes)
>           call these vectors ghostedoff and ghostedon and ghostedowner
>      3) a VecScatter from the "locally owned plus ghosted nodes" to 
> the "local owned nodes"
>      [you need these anyways for the numerical part of the code when 
> you evaluate your nonlinear functions (or right hand side for linear
> problems)
>
>  scatter the owner vector to the ghostedowner vector  now on each 
> process loop over the locally owned ELEMENTS
>      for each node1 in that element
>           for each node2 in that element (excluding the node1 in the 
> outer loop)
>                 if node1 and node2 share an edge (face in 3d) and that 
> edge (face in 3d) is not a boundary edge (face in 3d)  set t = .
> 5 (this prevents double counting of these couplings)
>                 else set t = 1.0
>                 if node1 and node2 are both owned by the same
> process** addt t into ghostedon at both the node1 location and the
> node2 location
>                 if node1 and node2 are owned by different processes 
> add t into ghostedoff at both the node1 and node2 location
>
>   Do a VecScatter add from the ghostedoff and ghostedon into the off 
> and on.
>
>   The off and on now contain exactly the preallocation need for each 
> processes preallocation.
>
>   The amount of work required is proportional to the number of 
> elements times the (number of nodes on an element)^2, the amount of 
> memory
>   needed is roughly three global vectors and three local vectors.  
> This is much less work and memory then needed in the numerical part of 
> the
>   code hence is very efficient. In fact it is likely much cheaper than 
> a single nonlinear function evaluation.
>
>    Barry
>
> ** two nodes are owned by the same process if ghostedowner of node1 
> matches ghostedowner of node2
>
> On May 29, 2008, at 3:50 PM, Billy Araújo wrote:
>
>>
>> Hi,
>>
>> I just want to share my experience with FE assembly.
>> I think the problem of preallocation in finite element matrices is 
>> that you don't know how many elements are connected to a given node, 
>> there can be 5, 20 elements or more. You can build a structure with 
>> the number of nodes connected to a node and then preallocate the 
>> matrix but this is not very efficient.
>>
>> I know UMFPACK has a method of forming triplets with the matrix 
>> information and then it has routines to add duplicate entries and 
>> compress the data in a compressed matrix format. Although I have 
>> never used UMFPACK with PETSC. I also don't know if there are 
>> similiar functions in PETSC optimized for FE matrix assembly.
>>
>> Regards,
>>
>> Billy.
>>
>>
>>
>> -----Mensagem original-----
>> De: owner-petsc-users at mcs.anl.gov em nome de Barry Smith
>> Enviada: qua 28-05-2008 16:03
>> Para: petsc-users at mcs.anl.gov
>> Assunto: Re: Slow MatSetValues
>>
>>
>> http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/do
>> cs/manual.pdf#sec_matsparse 
>> http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/do
>> cs/manualpages/Mat/MatCreateMPIAIJ.html
>>
>> Also, slightl less important,  collapse the 4 MatSetValues() below 
>> into a single call that does the little two by two block
>>
>>    Barry
>>
>> On May 28, 2008, at 9:07 AM, Lars Rindorf wrote:
>>
>> > Hi everybody
>> >
>> > I have a problem with MatSetValues, since the building of my matrix 
>> > takes much longer (35 s) than its solution (0.2 s). When the number 
>> > of degrees of freedom is increased, then the problem worsens. The 
>> > rate of which the elements of the (sparse) matrix is set also seems 
>> > to decrease with the number of elements already set. That is, it 
>> > becomes slower near the end.
>> >
>> > The structure of my program is something like:
>> >
>> > for element in finite elements
>> >     for dof in element
>> >         for equations in FEM formulation
>> >             ierr = MatSetValues(M->M,1,&i,1,&j,&tmp,ADD_Values);
>> >             ierr = MatSetValues(M->M,1,&k,1,&l,&tmp,ADD_Values);
>> >             ierr = MatSetValues(M->M,1,&i,1,&l,&tmp,ADD_Values);
>> >             ierr = MatSetValues(M->M,1,&k,1,&j,&tmp,ADD_Values);
>> >
>> >
>> > where i,j,k,l are appropriate integers and tmp is a double value to 
>> > be added.
>> >
>> > The code has fine worked with previous version of petsc (not 
>> > compiled by me). The version of petsc that I use is slightly newer 
>> > (I think), 2.3.3 vs ~2.3.
>> >
>> > Is it something of an dynamic allocation problem? I have tried
>> using
>> > MatSetValuesBlock, but this is only slightly faster. If I monitor 
>> > the program's CPU and memory consumption then the CPU is 100 % used 
>> > and the memory consumption is only 20-30 mb.
>> >
>> > My computer is a red hat linux with a xeon quad core processor. I 
>> > use Intel's MKL blas and lapack.
>> >
>> > What should I do to speed up the petsc?
>> >
>> > Kind regards
>> > Lars
>> > _____________________________
>> >
>> >
>> > Lars Rindorf
>> > M.Sc., Ph.D.
>> >
>> > http://www.dti.dk
>> >
>> > Danish Technological Institute
>> > Gregersensvej
>> >
>> > 2630 Taastrup
>> >
>> > Denmark
>> > Phone +45 72 20 20 00
>> >
>> >
>>
>>
>




More information about the petsc-users mailing list