[petsc-users] Petsc mesh scalability issue with iterative solver and direct solver

Mon Aug 1 13:10:34 CDT 2016

On Mon, Aug 1, 2016 at 12:52 PM, Jinlei Shen <jshen25 at jhu.edu> wrote:

> Hi Barry,
>
> Thanks for your reply.
>
> Firstly, as you suggested, I checked my program under valgrind. The
> results for both sequential and parallel cases showed there are no memory
> errors detected.
>
> Second, I coded a sequential program without using PETSC to generate the
> global matrix of small mesh for the same problem. I then checked the matrix
> both from petsc(sequential and parallel) and serial code, and they are same.
> The way I assembled the global matrix in parallel is first distributing
> the nodes and elements into processes, then I loop with  elements on the
> calling process to put the element stiffness into the global. Since the
> nodes and elements in cantilever beam are numbered successively, the
> connectivity is simple. I didn't use any partition tools to optimize mesh.
> It's also easy to determine the preallocation d_nnz and o_nnz since each
> node only connects the left and right nodes except for beginning and end,
> the maximum nonzeros in each row is 6. The MatSetValue process is shown as
> follows:
>     do iEL = idElStart, idElEnd
>         g_EL = (/2*iEL-1-1,2*iEL-1,2*iEL+1-1,2*iEL+2-1/)
>         call MatSetValues(SG,4,g_EL,4,g_El,SE,ADD_VALUES,ierr)
>     end do
> where idElStart and idElEnd are the global number of first element and end
> element that the process owns, g_EL is the global index for DOF in element
> iEL, SE is the element stiffness which is same for all elements.
> From above assembling, most of the elements are assembled within own
> process while there are few elements crossing two processes.
>
> The BC for my problem(cantilever under end point load) is to fix the first
> two DOF, so I called the MatZeroRowsColumns to set the first two rows and
> columns into zero with diagonal equal to one, without changing the RHS.
>
> Now some new issues show up :
>
> I run with -ksp_monitor_true_residual and -ksp_converged_reason, the
> monitor showed two different residues, one is the residue I can
> set(preconditioned, unpreconditioned, natural), the other is called true
> residue.
> 
> I initially thought the true residue is same as unpreconditioned based on
> definition. But it seems not true.  Is it the norm of the residue (b-Ax)
> between computed RHS and true RHS?    But, how to understand
> unprecondition residue since its definition is b-Ax as well?
>

It is the unpreconditioned residual. You must be misinterpreting. And we
could determine exactly if you sent the output with the suggested options.

> Can I set the true residue as my converging criteria?
>

Use right preconditioning.

> I found the accuracy of large mesh in my problem didn't necessary depend
> on the tolerance I set, either preconditioned or unpreconditioned,
> sometimes, it showed converged while the solution is not correct. But the
> true residue looks reflecting the true convergence very well, if the true
> residue is diverging, no matter what the first residue says, the results
> are bad!
>

Yes, your preconditioner looks singular. Note that BJACOBI has an inner
solver, and by default the is GMRES/ILU(0). I think
ILU(0) is really ill-conditioned for your problem.

> For the preconditioner concerns, actually, I used BJACOBI before I sent
> the first email, since the JACOBI or PBJACOBI didn't even converge when the
> size was large.
> But BJACOBI also didn't perform well in the paralleliztion for large mesh
> as posed in my last email, while it's fine for small size (below 10k
> elements)
>
> Yesterday, I tried the ASM  with CG using the runtime option: -pc_type
> asm -pc_asm_type basic -sub_pc_type lu (default is ilu).
> For 15k elements mesh, I am now able to get the correct answer with 1-3, 6
> and more processes, using either -sub_pc_type lu or ilu.
>

Yes, LU works for your subdomain solver.

> Based on all the results I have got, it shows the results varies a lot
> with different PC and seems ASM is better for large problem.
>

Its not ASM so much as an LU subsolver that is better.

> But what is the major factor to produce such difference between different
> PCs, since it's not just the issue of computational efficiency, but also
> the accuracy.
> Also, I noticed for large mesh, the solution is unstable with small number
> of processes, for the 15k case, the solution is not correct with 4 and 5
> processes, however, the solution becomes always correct with more than 6
> processes. For the 50k mesh case, more processes are required to show the
> stability.
>

Yes, partitioning is very important here. Since you do not have a good
partition, you can get these wild variations.

  Thanks,

     Matt

> What do you think about this? Anything wrong?
> Since the iterative solver in parallel is first computed locally(if this
> is correct), can it be possible that there are 'good' and 'bad' locals when
> dividing the global matrix, and the result from 'bad' local will
> contaminate the global results. But with more processes, such risk is
> reduced.
>
> It is highly appreciated if you could give me some instruction for above
> questions.
>
> Thank you very much.
>
> Bests,
> Jinlei
>
>
> On Fri, Jul 29, 2016 at 2:09 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>>   First run  under valgrind all the cases to make sure there is not some
>> use of uninitialized data or overwriting of data. Go to
>> http://www.mcs.anl.gov/petsc follow the link to FAQ and search for
>> valgrind (the web server seems to be broken at the moment).
>>
>>   Second it is possible that your code the assembles the matrices and
>> vectors is not correctly assembling it for either the sequential or
>> parallel case. Hence a different number of processes could be generating a
>> different linear system hence inconsistent results. How are you handling
>> the parallelism? How do you know the matrix generated in parallel is
>> identically to that sequentially?
>>
>> Simple preconditioners such as pbjacobi will converge slower and slower
>> with more elements.
>>
>> Note that you should run with -ksp_monitor_true_residual and
>> -ksp_converged_reason to make sure that the iterative solver is even
>> converging. By default PETSc KSP solvers do not stop with a big error
>> message if they do not converge so you need make sure they are always
>> converging.
>>
>>    Barry
>>
>>
>>
>> > On Jul 29, 2016, at 11:46 AM, Jinlei Shen <jshen25 at jhu.edu> wrote:
>> >
>> > Dear PETSC developers,
>> >
>> > Thank you for developing such a powerful tool for scientific
>> computations.
>> >
>> > I'm currently trying to run a simple cantilever beam FEM to test the
>> scalability of PETSC on multi-processors. I also want to verify whether
>> iterative solver or direct solver is more efficient for parallel large FEM
>> problem.
>> >
>> > Problem description, An Euler elementary cantilever beam with point
>> load at the end along -y direction. Each node has 2 DOF (deflection and
>> rotation)). MPIBAIJ is used with bs = 2, dnnz and onnz are determined based
>> on the connectivity. Loop with elements in each processor to assemble the
>> global matrix with same element stiffness matrix. The boundary condition is
>> set using call
>> MatZeroRowsColumns(SG,2,g_BC,one,PETSC_NULL_OBJECT,PETSC_NULL_OBJECT,ierr);
>> >
>> > Based on what I have done, I find the computations work well, i.e the
>> results are correct compared with theoretical solution, for small mesh size
>> (small than 5000 elements) using both solvers with different numbers of
>> processes.
>> >
>> > However, there are several confusing issues when I increase the mesh
>> size to 10000 and more elements with iterative solve(CG + PCBJACOBI)
>> >
>> > 1. For 10k elements, I can get accurate solution using iterative solver
>> with uni-processor(i.e. only one process). However, when I use 2-8
>> processes, it tells the linear solver converged with different iterations,
>> but, the results are all different for different processes and erroneous.
>> The wired thing is when I use >9 processes, the results are correct again.
>> I am really confused by this. Could you explain me why?  If my
>> parallelization is not correct, why it works for small cases? And I check
>> the global matrix and RHS vector and didn't see any mallocs during the
>> process.
>> >
>> > 2. For 30k elements, if I use one process, it says: Linear solve did
>> not converge due to DIVERGED_INDEFINITE_PC. Does this commonly happen for
>> large sparse matrix? If so, is there any stable solver or pc for large
>> problem?
>> >
>> >
>> > For parallel computing using direct solver(SUPERLU_DIST + PCLU), I can
>> only get accuracy when the number of elements are below 5000. There must be
>> something wrong. The way I use the superlu_dist solver is first convert
>> MatType to AIJ, then call PCFactorSetMatSolverPackage, and change the PC to
>> PCLU. Do I miss anything else to run SUPER_LU correctly?
>> >
>> >
>> > I also use SUPER_LU and iterative solver(CG+PCBJACOBI) to solve the
>> sequential version of the same problem. The results shows that iterative
>> solver works well for <50k elements, while SUPER_LU only gets right
>> solution below 5k elements. Can I say iterative solver is better than
>> SUPER_LU for large problem? How can I improve the solver to copy with very
>> large problem, such as million by million? Another thing is it's still
>> doubtable of performance of SUPER_LU.
>> >
>> > For the inaccuracy issue, do you think it may be due to the memory?
>> However, there is no memory error showing during the execution.
>> >
>> > I really appreciate someone could resolve those puzzles above for me.
>> My goal is to replace the current SUPER_LU  solver in my parallel CPFEM
>> main program with the iterative solver using PETSC.
>> >
>> >
>> > Please let me if you would like to see my code in detail.
>> >
>> > Thank you very much.
>> >
>> > Bests,
>> > Jinlei
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160801/049c4f47/attachment-0001.html>