[petsc-users] Best practices for solving Dense Linear systems

Sat Aug 8 16:55:46 CDT 2020

On 8/7/20 12:55 PM, Barry Smith wrote:
>
>
>> On Aug 7, 2020, at 12:26 PM, Nidish <nb25 at rice.edu 
>> <mailto:nb25 at rice.edu>> wrote:
>>
>>
>> On 8/7/20 8:52 AM, Barry Smith wrote:
>>>
>>>
>>>> On Aug 7, 2020, at 1:25 AM, Nidish <nb25 at rice.edu 
>>>> <mailto:nb25 at rice.edu>> wrote:
>>>>
>>>> Indeed - I was just using the default solver (GMRES with ILU).
>>>>
>>>> Using just standard LU (direct solve with "-pc_type lu -ksp_type 
>>>> preonly"), I find elemental to be extremely slow even for a 
>>>> 1000x1000 matrix.
>>>>
>>>
>>> What about on one process?
>> On just one process the performance is comparable.
>>>
>>> Elemental generally won't be competitive for such tiny matrices.
>>>>
>>>> For MPIaij it's throwing me an error if I tried "-pc_type lu".
>>>>
>>>
>>>    Yes, there is no PETSc code for sparse parallel direct solver, 
>>> this is expected.
>>>
>>>    What about ?
>>>
>>>>     mpirun -n 1 ./ksps -N 1000 -mat_type mpidense -pc_type jacobi
>>>>
>>>>     mpirun -n 4 ./ksps -N 1000 -mat_type mpidense -pc_type jacobi
>>>>
>> Same results - the elemental version is MUCH slower (for 1000x1000).
>>> Where will your dense matrices be coming from and how big will they 
>>> be in practice? This will help determine if an iterative solver is 
>>> appropriate. If they will be 100,000 for example then testing with 
>>> 1000 will tell you nothing useful, you need to test with the problem 
>>> size you care about.
>>
>> The matrices in my application arise from substructuring/Component 
>> Mode Synthesis conducted on a system that is linear "almost 
>> everywhere", for example jointed systems. The procedure we follow is: 
>> build a mesh & identify the nodes corresponding to the interfaces, 
>> reduce the model using component mode synthesis to obtain a 
>> representation of the system using just the interface 
>> degrees-of-freedom along with some (~10s) generalized "modal 
>> coordinates". We conduct the non-linear analyses (transient, steady 
>> state harmonic, etc.) using this matrices.
>>
>> I am interested in conducting non-linear mesh convergence for a 
>> particular system of interest wherein the interface DoFs are, approx, 
>> 4000, 8000, 12000, 16000. I'm fairly certain the dense matrices will 
>> not be larger. The
>>
>
>    Ok, so it is not clear how well conditioned these dense matrices 
> will be.
>
>     There are three questions that need to be answered.
>
> 1) for your problem can iterative methods be used and will they 
> require less work than direct solvers.
>
>        For direct LU the work is order N^3 to do the factorization 
> with a relatively small constant. Because of smart organization inside 
> dense LU the flops can be done very efficiently.
>
>        For GMRES with Jacobi preconditioning the work is order N^2 
> (the time for a dense matrix-vector product) for each iteration. If 
> the number of iterations small than the total work is much less than a 
> direct solver. In the worst case the number of iterations is order N 
> so the total work is order N^3, the same order as a direct method. 
>  But the efficiency of a dense matrix-vector product is much lower 
> than the efficiency of a LU factorization so even if the work is the 
> same order it can take longer.  One should use mpidense as the matrix 
> format for iterative.
>
>        With iterative methods YOU get to decide how accurate you need 
> your solution, you do this by setting how small you want the residual 
> to be (since you can't directly control the error). By default PETSc 
> uses a relative decrease in the residual of 1e-5.
>
> 2) for your size problems can parallelism help?
>
>     I think it should but elemental since it requires a different data 
> layout has additional overhead cost to get the data into the optimal 
> format for parallelism.
>
> 3) can parallelism help on YOUR machine. Just because a machine has 
> multiple cores it may not be able to utilize them efficiently for 
> solvers if the total machine memory bandwidth is limited.
>
>    So the first thing to do is on the machine you plan to use for your 
> computations run the streams benchmark discussed in 
> https://www.mcs.anl.gov/petsc/documentation/faq.html#computers this 
> will give us some general idea of how much parallelism you can take 
> advantage of. Is the machine a parallel cluster or just a single node?
>
>    After this I'll give you a few specific cases to run to get a 
> feeling for what approach would be best for your problems,
>
>    Barry
>
Thank you for the responses. Here's a pointwise response to your queries:

1) I am presently working with random matrices (with a large constant 
value in the diagonals to ensure diagonal dominance) before I start 
working with the matrices from my system. At the end of the day the 
matrices I expect to be using can be thought of to be Schur complements 
of a Laplacian operator.

2) Since my application is joint dynamics, I have a non-linear function 
that has to be evaluated at quadrature locations on a 2D mesh and 
integrated to form the residue vector as well as the Jacobian matrices. 
There is thus potential speedup I expect for the function evaluations 
definitely.

Since the matrices I will end up with will be dense (at least for static 
simulations), I wanted directions to find the best solver options for my 
problem.

3) I am presently on an octa-core (4 physical cores) machine with 16 
Gigs of RAM. I plan to conduct code development and benchmarking on this 
machine before I start running larger models on a cluster I have access to.

I was unable to run the streams benchmark on the cluster (PETSc 3.11.1 
is installed there, and the benchmarks in the git directory was giving 
issues), but I was able to do this in my local machine - here's the output:

    scaling.log
    1  13697.5004   Rate (MB/s)
    2  13021.9505   Rate (MB/s) 0.950681
    3  12402.6925   Rate (MB/s) 0.905471
    4  12309.1712   Rate (MB/s) 0.898644

Could you point me to the part in the documentation that speaks about 
the different options available for dealing with dense matrices? I just 
realized that bindings for MUMPS are available in PETSc.

Thank you very much,
Nidish

>
>> However for frequency domain simulations, we use matrices that are 
>> about 10 times the size of the original matrices (whose meshes have 
>> been shown to be convergent in static test cases).
>>
>> Thank you,
>> Nidish
>>
>>>
>>> Barry
>>>
>>>> I'm attaching the code here, in case you'd like to have a look at 
>>>> what I've been trying to do.
>>>>
>>>> The two configurations of interest are,
>>>>
>>>>     $> mpirun -n 4 ./ksps -N 1000 -mat_type mpiaij
>>>>     $> mpirun -n 4 ./ksps -N 1000 -mat_type elemental
>>>>
>>>> (for the GMRES with ILU) and,
>>>>
>>>>     $> mpirun -n 4 ./ksps -N 1000 -mat_type mpiaij -pc_type lu
>>>>     -ksp_type preonly
>>>>     $> mpirun -n 4 ./ksps -N 1000 -mat_type elemental -pc_type lu
>>>>     -ksp_type preonly
>>>>
>>>> elemental seems to perform poorly in both cases.
>>>>
>>>> Nidish
>>>>
>>>> On 8/7/20 12:50 AM, Barry Smith wrote:
>>>>>
>>>>>   What is the output of -ksp_view  for the two case?
>>>>>
>>>>>   It is not only the matrix format but also the matrix solver that 
>>>>> matters. For example if you are using an iterative solver the 
>>>>> elemental format won't be faster, you should use the PETSc 
>>>>> MPIDENSE format. The elemental format is really intended when you 
>>>>> use a direct LU solver for the matrix. For tiny matrices like this 
>>>>> an iterative solver could easily be faster than the direct solver, 
>>>>> it depends on the conditioning (eigenstructure) of the dense 
>>>>> matrix. Also the default PETSc solver uses block Jacobi with ILU 
>>>>> on each process if using a sparse format, ILU applied to a dense 
>>>>> matrix is actually LU so your solver is probably different also 
>>>>> between the MPIAIJ and the elemental.
>>>>>
>>>>>   Barry
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Aug 7, 2020, at 12:30 AM, Nidish <nb25 at rice.edu 
>>>>>> <mailto:nb25 at rice.edu>> wrote:
>>>>>>
>>>>>> Thank you for the response.
>>>>>>
>>>>>> I've just been running some tests with matrices up to 2e4 
>>>>>> dimensions (dense). When I compared the solution times for 
>>>>>> "-mat_type elemental" and "-mat_type mpiaij" running with 4 
>>>>>> cores, I found the mpidense versions running way faster than 
>>>>>> elemental. I have not been able to make the elemental version 
>>>>>> finish up for 2e4 so far (my patience runs out faster).
>>>>>>
>>>>>> What's going on here? I thought elemental was supposed to be 
>>>>>> superior for dense matrices.
>>>>>>
>>>>>> I can share the code if that's appropriate for this forum (sorry, 
>>>>>> I'm new here).
>>>>>>
>>>>>> Nidish
>>>>>> On Aug 6, 2020, at 23:01, Barry Smith <bsmith at petsc.dev 
>>>>>> <mailto:bsmith at petsc.dev>> wrote:
>>>>>>
>>>>>>         On Aug 6, 2020, at 7:32 PM, Nidish <nb25 at rice.edu
>>>>>>         <mailto:nb25 at rice.edu>> wrote: I'm relatively new to
>>>>>>         PETSc, and my applications involve (for the most part)
>>>>>>         dense matrix solves. I read in the documentation that
>>>>>>         this is an area PETSc does not specialize in but instead
>>>>>>         recommends external libraries such as Elemental. I'm
>>>>>>         wondering if there are any "best" practices in this
>>>>>>         regard. Some questions I'd like answered are: 1. Can I
>>>>>>         just declare my dense matrix as a sparse one and fill the
>>>>>>         whole matrix up? Do any of the others go this route?
>>>>>>         What're possible pitfalls/unfavorable outcomes for this?
>>>>>>         I understand the memory overhead probably shoots up. 
>>>>>>
>>>>>>
>>>>>>        No, this isn't practical, the performance will be terrible.
>>>>>>
>>>>>>         2. Are there any specific guidelines on when I can expect
>>>>>>         elemental to perform better in parallel than in serial? 
>>>>>>
>>>>>>
>>>>>>        Because the computation to communication ratio for dense matrices is higher than for sparse you will see better parallel performance for dense problems of a given size than sparse problems of a similar size. In other words parallelism can help for dense matrices for relatively small problems, of course the specifics of your machine hardware and software also play a role.
>>>>>>
>>>>>>         Barry
>>>>>>
>>>>>>         Of course, I'm interesting in any other details that may
>>>>>>         be important in this regard. Thank you, Nidish 
>>>>>>
>>>>>>
>>>>>
>>>> -- 
>>>> Nidish
>>>> <ksps.cpp>
>>>
>> -- 
>> Nidish
>
-- 
Nidish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200808/1e74f724/attachment-0001.html>