[petsc-users] Petsc mesh scalability issue with iterative solver and direct solver

Jinlei Shen jshen25 at jhu.edu
Tue Aug 9 23:17:10 CDT 2016


Thanks for all the replies.

Really appreciate.

Bests,
Jinlei

On Tue, Aug 9, 2016 at 11:48 AM, Matthew Knepley <knepley at gmail.com> wrote:

> On Tue, Aug 9, 2016 at 9:52 AM, Jinlei Shen <jshen25 at jhu.edu> wrote:
>
>> Hi Barry,
>>
>> Thanks for your answer.
>>
>> But logically for large problem, we are always expecting to see
>> paralleled program perform better with regard to both speed and memory
>> since each of the multi-processes independently deal with its own
>> submatrix, especially for iterative solver, which is revealed by CG+BJ.
>> I just don't understand, in the computing with CG+ASM and SUPER_LU, why
>> the two-process is most inefficient among these cases. If this is due to
>> the communication cost compared with uni-process, why the speed goes down
>> for triple and more processes. I'm new to parallelism, could you speculate
>> any possible reason for such situation?
>>
>
> As Barry noted, there are two reasons that you get slowdown:
>
>   1) Worse performance of existing algorithm
>
> This comes from things like communication. This is usually a small
> contributor to slowdown.
>
>   2) Different parallel algorithm
>
> This is what is causing your slowdown most likely.
>
>    Matt
>
>
>> Great thanks
>>
>>
>>
>> On Fri, Aug 5, 2016 at 10:09 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>>>
>>> > On Aug 5, 2016, at 5:58 PM, Jinlei Shen <jshen25 at jhu.edu> wrote:
>>> >
>>> > ​Hi,
>>> >
>>> > Thanks for your answers.
>>> >
>>> > I just figured out the issues which are mainly due to the
>>> ill-conditioning of my matrix. I found the conditional number blows up when
>>> the beam is discretized into large number of elements.
>>> >
>>> > Now, I am using the 1D bar model to solve the same problem. The good
>>> news is the solution is always accurate and stable even I discretized into
>>> 10 million elements.
>>> >
>>> > When I run the model with both iterative solver(CG+BJACOBI/ASM) and
>>> direct solver(SUPER_LU) in parallelization, I got the following results:
>>> >
>>> > Mesh size: 1 million unknowns
>>> > Processes     1       2       4       6       8       10      12
>>> 16      20
>>> > CG+BJ 0.36    0.22    0.15    0.12    0.11    0.1     0.096   0.097
>>>  0.099
>>> > CG+ASM        0.47    0.46    0.267   0.2     0.17    0.15    0.145
>>>  0.16    0.15
>>> > SUPER_LU_DIST 4.73    5.4     4.69    4.58    4.38    4.2     4.27
>>> 4.28    4.38
>>> >
>>> > It seems the CG+BJ works correctly, i.e. time decreases fast with a
>>> few more processes and reach stable with many more cores.
>>> >
>>> > However, I have some concerns about CG+ASM and SUPER_LU_DIST. The time
>>> of both two methods goes up when I use two processes compared with
>>> uniprocess.
>>>
>>>    This is actually not surprising at all but since the mantra is
>>> "parallelism will always make things faster" it can confuse people. When
>>> run with one process the ASM and SuperLU_DIST utilize essentially
>>> sequential algorithms, when run with two processes they "switch" to
>>> parallel algorithms which simply are not as good as the essentially
>>> sequential algorithm that is obtained with one process hence they run
>>> slower. This is just life, there really isn't something one can do about it
>>> except to perhaps use a poorer quality algorithm on one process so that two
>>> processes look better but the goal of PETSc is not to make parallelism to
>>> look good but to provide efficient solvers (as best we can) for one and
>>> multiple processes.
>>>
>>>    Barry
>>>
>>>
>>>
>>>
>>> > The tendency is more obvious when I use larger mesh size.
>>> > I especially doubt the results of SUPER_LU_DIST in parallelism since
>>> the overall expedition is very small which is not expected.
>>> > The runtime option I use for ASM pc and SUPER_LU_DIST solver is shown
>>> as below:
>>> > ASM preconditioner:  -pc_type asm -pc_asm_type basic
>>> > SUPER_LU_DIST solver:   -ksp_type preonly -pc_type lu
>>> -pc_factor_mat_solver_package superlu_dist
>>> >
>>> > I use same mpiexec -n np ./xxxx for all solvers.
>>> >
>>> > Am I using them correctly? If so, is there anyway to speed up the
>>> computation further, especially for SUPER_LU_DIST?
>>> >
>>> > Thank you very much!
>>> >
>>> > Bests,
>>> > Jinlei
>>> >
>>> > On Mon, Aug 1, 2016 at 2:10 PM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>> > On Mon, Aug 1, 2016 at 12:52 PM, Jinlei Shen <jshen25 at jhu.edu> wrote:
>>> > Hi Barry,
>>> >
>>> > Thanks for your reply.
>>> >
>>> > Firstly, as you suggested, I checked my program under valgrind. The
>>> results for both sequential and parallel cases showed there are no memory
>>> errors detected.
>>> >
>>> > Second, I coded a sequential program without using PETSC to generate
>>> the global matrix of small mesh for the same problem. I then checked the
>>> matrix both from petsc(sequential and parallel) and serial code, and they
>>> are same.
>>> > The way I assembled the global matrix in parallel is first
>>> distributing the nodes and elements into processes, then I loop with
>>> elements on the calling process to put the element stiffness into the
>>> global. Since the nodes and elements in cantilever beam are numbered
>>> successively, the connectivity is simple. I didn't use any partition tools
>>> to optimize mesh. It's also easy to determine the preallocation d_nnz and
>>> o_nnz since each node only connects the left and right nodes except for
>>> beginning and end, the maximum nonzeros in each row is 6. The MatSetValue
>>> process is shown as follows:
>>> >     do iEL = idElStart, idElEnd
>>> >         g_EL = (/2*iEL-1-1,2*iEL-1,2*iEL+1-1,2*iEL+2-1/)
>>> >         call MatSetValues(SG,4,g_EL,4,g_El,SE,ADD_VALUES,ierr)
>>> >     end do
>>> > where idElStart and idElEnd are the global number of first element and
>>> end element that the process owns, g_EL is the global index for DOF in
>>> element iEL, SE is the element stiffness which is same for all elements.
>>> > From above assembling, most of the elements are assembled within own
>>> process while there are few elements crossing two processes.
>>> >
>>> > The BC for my problem(cantilever under end point load) is to fix the
>>> first two DOF, so I called the MatZeroRowsColumns to set the first two rows
>>> and columns into zero with diagonal equal to one, without changing the RHS.
>>> >
>>> > Now some new issues show up :
>>> >
>>> > I run with -ksp_monitor_true_residual and -ksp_converged_reason, the
>>> monitor showed two different residues, one is the residue I can
>>> set(preconditioned, unpreconditioned, natural), the other is called true
>>> residue.
>>> > ​​
>>> > I initially thought the true residue is same as unpreconditioned based
>>> on definition. But it seems not true.  Is it the norm of the residue (b-Ax)
>>> between computed RHS and true RHS?    But, how to understand unprecondition
>>> residue since its definition is b-Ax as well?
>>> >
>>> > It is the unpreconditioned residual. You must be misinterpreting. And
>>> we could determine exactly if you sent the output with the suggested
>>> options.
>>> >
>>> > Can I set the true residue as my converging criteria?
>>> >
>>> > Use right preconditioning.
>>> >
>>> > I found the accuracy of large mesh in my problem didn't necessary
>>> depend on the tolerance I set, either preconditioned or unpreconditioned,
>>> sometimes, it showed converged while the solution is not correct. But the
>>> true residue looks reflecting the true convergence very well, if the true
>>> residue is diverging, no matter what the first residue says, the results
>>> are bad!
>>> >
>>> > Yes, your preconditioner looks singular. Note that BJACOBI has an
>>> inner solver, and by default the is GMRES/ILU(0). I think
>>> > ILU(0) is really ill-conditioned for your problem.
>>> >
>>> > For the preconditioner concerns, actually, I used BJACOBI before I
>>> sent the first email, since the JACOBI or PBJACOBI didn't even converge
>>> when the size was large.
>>> > But BJACOBI also didn't perform well in the paralleliztion for large
>>> mesh as posed in my last email, while it's fine for small size (below 10k
>>> elements)
>>> >
>>> > Yesterday, I tried the ASM  with CG using the runtime option: -pc_type
>>> asm -pc_asm_type basic -sub_pc_type lu (default is ilu).
>>> > For 15k elements mesh, I am now able to get the correct answer with
>>> 1-3, 6 and more processes, using either -sub_pc_type lu or ilu.
>>> >
>>> > Yes, LU works for your subdomain solver.
>>> >
>>> > Based on all the results I have got, it shows the results varies a lot
>>> with different PC and seems ASM is better for large problem.
>>> >
>>> > Its not ASM so much as an LU subsolver that is better.
>>> >
>>> > But what is the major factor to produce such difference between
>>> different PCs, since it's not just the issue of computational efficiency,
>>> but also the accuracy.
>>> > Also, I noticed for large mesh, the solution is unstable with small
>>> number of processes, for the 15k case, the solution is not correct with 4
>>> and 5 processes, however, the solution becomes always correct with more
>>> than 6 processes. For the 50k mesh case, more processes are required to
>>> show the stability.
>>> >
>>> > Yes, partitioning is very important here. Since you do not have a good
>>> partition, you can get these wild variations.
>>> >
>>> >   Thanks,
>>> >
>>> >      Matt
>>> >
>>> > What do you think about this? Anything wrong?
>>> > Since the iterative solver in parallel is first computed locally(if
>>> this is correct), can it be possible that there are 'good' and 'bad' locals
>>> when dividing the global matrix, and the result from 'bad' local will
>>> contaminate the global results. But with more processes, such risk is
>>> reduced.
>>> >
>>> > It is highly appreciated if you could give me some instruction for
>>> above questions.
>>> >
>>> > Thank you very much.
>>> >
>>> > Bests,
>>> > Jinlei
>>> >
>>> >
>>> > On Fri, Jul 29, 2016 at 2:09 PM, Barry Smith <bsmith at mcs.anl.gov>
>>> wrote:
>>> >
>>> >   First run  under valgrind all the cases to make sure there is not
>>> some use of uninitialized data or overwriting of data. Go to
>>> http://www.mcs.anl.gov/petsc follow the link to FAQ and search for
>>> valgrind (the web server seems to be broken at the moment).
>>> >
>>> >   Second it is possible that your code the assembles the matrices and
>>> vectors is not correctly assembling it for either the sequential or
>>> parallel case. Hence a different number of processes could be generating a
>>> different linear system hence inconsistent results. How are you handling
>>> the parallelism? How do you know the matrix generated in parallel is
>>> identically to that sequentially?
>>> >
>>> > Simple preconditioners such as pbjacobi will converge slower and
>>> slower with more elements.
>>> >
>>> > Note that you should run with -ksp_monitor_true_residual and
>>> -ksp_converged_reason to make sure that the iterative solver is even
>>> converging. By default PETSc KSP solvers do not stop with a big error
>>> message if they do not converge so you need make sure they are always
>>> converging.
>>> >
>>> >    Barry
>>> >
>>> >
>>> >
>>> > > On Jul 29, 2016, at 11:46 AM, Jinlei Shen <jshen25 at jhu.edu> wrote:
>>> > >
>>> > > Dear PETSC developers,
>>> > >
>>> > > Thank you for developing such a powerful tool for scientific
>>> computations.
>>> > >
>>> > > I'm currently trying to run a simple cantilever beam FEM to test the
>>> scalability of PETSC on multi-processors. I also want to verify whether
>>> iterative solver or direct solver is more efficient for parallel large FEM
>>> problem.
>>> > >
>>> > > Problem description, An Euler elementary cantilever beam with point
>>> load at the end along -y direction. Each node has 2 DOF (deflection and
>>> rotation)). MPIBAIJ is used with bs = 2, dnnz and onnz are determined based
>>> on the connectivity. Loop with elements in each processor to assemble the
>>> global matrix with same element stiffness matrix. The boundary condition is
>>> set using call MatZeroRowsColumns(SG,2,g_BC,o
>>> ne,PETSC_NULL_OBJECT,PETSC_NULL_OBJECT,ierr);
>>> > >
>>> > > Based on what I have done, I find the computations work well, i.e
>>> the results are correct compared with theoretical solution, for small mesh
>>> size (small than 5000 elements) using both solvers with different numbers
>>> of processes.
>>> > >
>>> > > However, there are several confusing issues when I increase the mesh
>>> size to 10000 and more elements with iterative solve(CG + PCBJACOBI)
>>> > >
>>> > > 1. For 10k elements, I can get accurate solution using iterative
>>> solver with uni-processor(i.e. only one process). However, when I use 2-8
>>> processes, it tells the linear solver converged with different iterations,
>>> but, the results are all different for different processes and erroneous.
>>> The wired thing is when I use >9 processes, the results are correct again.
>>> I am really confused by this. Could you explain me why?  If my
>>> parallelization is not correct, why it works for small cases? And I check
>>> the global matrix and RHS vector and didn't see any mallocs during the
>>> process.
>>> > >
>>> > > 2. For 30k elements, if I use one process, it says: Linear solve did
>>> not converge due to DIVERGED_INDEFINITE_PC. Does this commonly happen for
>>> large sparse matrix? If so, is there any stable solver or pc for large
>>> problem?
>>> > >
>>> > >
>>> > > For parallel computing using direct solver(SUPERLU_DIST + PCLU), I
>>> can only get accuracy when the number of elements are below 5000. There
>>> must be something wrong. The way I use the superlu_dist solver is first
>>> convert MatType to AIJ, then call PCFactorSetMatSolverPackage, and change
>>> the PC to PCLU. Do I miss anything else to run SUPER_LU correctly?
>>> > >
>>> > >
>>> > > I also use SUPER_LU and iterative solver(CG+PCBJACOBI) to solve the
>>> sequential version of the same problem. The results shows that iterative
>>> solver works well for <50k elements, while SUPER_LU only gets right
>>> solution below 5k elements. Can I say iterative solver is better than
>>> SUPER_LU for large problem? How can I improve the solver to copy with very
>>> large problem, such as million by million? Another thing is it's still
>>> doubtable of performance of SUPER_LU.
>>> > >
>>> > > For the inaccuracy issue, do you think it may be due to the memory?
>>> However, there is no memory error showing during the execution.
>>> > >
>>> > > I really appreciate someone could resolve those puzzles above for
>>> me. My goal is to replace the current SUPER_LU  solver in my parallel CPFEM
>>> main program with the iterative solver using PETSC.
>>> > >
>>> > >
>>> > > Please let me if you would like to see my code in detail.
>>> > >
>>> > > Thank you very much.
>>> > >
>>> > > Bests,
>>> > > Jinlei
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> > -- Norbert Wiener
>>> >
>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160810/8556e98a/attachment-0001.html>


More information about the petsc-users mailing list