[petsc-users] SuperLU convergence problem (More test)

Danyang Su dsu at eos.ubc.ca
Tue Dec 8 22:27:13 CST 2015


Hi Hong,

Thanks for checking this. A mechanical model was added at the time when 
the solver failed, causing some problem. We need to improve this part in 
the code.

Thanks again and best wishes,

Danyang

On 15-12-08 08:10 PM, Hong wrote:
> Danyang :
> Your matrices are ill-conditioned, numerically singular with
> Recip. condition number = 6.000846e-16
> Recip. condition number = 2.256434e-27
> Recip. condition number = 1.256452e-18
> i.e., condition numbers = O(1.e16 - 1.e27), there is no accuracy in 
> computed solution.
>
> I checked your matrix  168 - 172, got Recip. condition number 
> = 1.548816e-12.
>
> You need check your model to understand why the matrices are so 
> ill-conditioned.
>
> Hong
>
>     Hi Hong,
>
>     Sorry to bother you again. The modified code works much better
>     than before using both superlu or mumps. However, it still
>     encounters failure. The case is similar with the previous one,
>     ill-conditioned matrices.
>
>     The code crashed after a long time simulation if I use
>     superlu_dist, but will not fail if use superlu. I restart the
>     simulation before the time it crashes and can reproduce the
>     following error
>
>      timestep:      22 time: 1.750E+04 years   delt: 2.500E+00 years  
>     iter:  1 max.sia: 5.053E-03 tol.sia: 1.000E-01
>      Newton Iteration Convergence Summary:
>      Newton       maximum      maximum     solver
>      iteration    updatePa  updateTemp    residual iterations 
>     maxvolpa   maxvoltemp   nexvolpa nexvoltemp
>          1       0.1531E+08 0.1755E+04   0.6920E-05 1    
>     5585          4402        5814        5814
>
>     *** Error in `../program_test': malloc(): memory corruption:
>     0x0000000003a70d50 ***
>     Program received signal SIGABRT: Process abort signal.
>     Backtrace for this error:
>
>     The solver failed at timestep 22, Newton iteration 2. I exported
>     the matrices at timestep 1 (matrix 1) and timestep 22 (matrix 140
>     and 141). Matrix 141 is where it failed.  The three matrices here
>     are not ill-conditioned form the estimated value.
>
>     I did the same using the new modified ex52f code and found pretty
>     different results for matrix 141. The norm by superlu is much
>     acceptable than superlu_dist. In this test, memory corruption was
>     not detected. The codes and example data can be download from the
>     link below.
>
>     https://www.dropbox.com/s/i1ls0bg0vt7gu0v/petsc-superlu-test2.tar.gz?dl=0
>
>
>     ****************More test on matrix_and_rhs_bin2*******************
>     mpiexec.hydra -n 1 ./ex52f -f0
>     ./matrix_and_rhs_bin2/a_flow_check_1.bin -rhs
>     ./matrix_and_rhs_bin2/b_flow_check_1.bin -loop_matrices flow_check
>     -loop_folder ./matrix_and_rhs_bin2 -matrix_index_start 140
>     -matrix_index_end 141  -pc_type lu -pc_factor_mat_solver_package
>     superlu -ksp_monitor_true_residual -mat_superlu_conditionnumber
>      -->loac matrix a
>      -->load rhs b
>      size l,m,n,mm       90000       90000 90000       90000
>       Recip. condition number = 6.000846e-16
>       0 KSP preconditioned resid norm 1.146871454377e+08 true resid
>     norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
>       1 KSP preconditioned resid norm 2.071118508260e-06 true resid
>     norm 3.363767171515e-08 ||r(i)||/||b|| 7.140102249181e-12
>     Norm of error  3.3638E-08 iterations     1
>      -->Test for matrix          140
>       Recip. condition number = 2.256434e-27
>       0 KSP preconditioned resid norm 2.084372893355e+14 true resid
>     norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
>       1 KSP preconditioned resid norm 4.689629276419e+00 true resid
>     norm 1.037236635337e-01 ||r(i)||/||b|| 2.201690918330e-05
>     Norm of error  1.0372E-01 iterations     1
>      -->Test for matrix          141
>       Recip. condition number = 1.256452e-18
>       0 KSP preconditioned resid norm 1.055488964519e+08 true resid
>     norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
>       1 KSP preconditioned resid norm 2.998827511681e-04 true resid
>     norm 4.805214542776e-04 ||r(i)||/||b|| 1.019979130994e-07
>     Norm of error  4.8052E-04 iterations     1
>      --> End of test, bye
>
>
>     mpiexec.hydra -n 1 ./ex52f -f0
>     ./matrix_and_rhs_bin2/a_flow_check_1.bin -rhs
>     ./matrix_and_rhs_bin2/b_flow_check_1.bin -loop_matrices flow_check
>     -loop_folder ./matrix_and_rhs_bin2 -matrix_index_start 140
>     -matrix_index_end 141  -pc_type lu -pc_factor_mat_solver_package
>     superlu_dist
>      -->loac matrix a
>      -->load rhs b
>      size l,m,n,mm       90000       90000 90000       90000
>     Norm of error  3.6752E-08 iterations     1
>      -->Test for matrix          140
>     Norm of error  1.6335E-01 iterations     1
>      -->Test for matrix          141
>     Norm of error  3.4345E+01 iterations     1
>      --> End of test, bye
>
>     Thanks,
>
>     Danyang
>
>     On 15-12-07 12:01 PM, Hong wrote:
>>     Danyang:
>>     Add 'call MatSetFromOptions(A,ierr)' to your code.
>>     Attached below is ex52f.F modified from your ex52f.F to be
>>     compatible with petsc-dev.
>>
>>     Hong
>>
>>         Hello Hong,
>>
>>         Thanks for the quick reply and the option
>>         "-mat_superlu_dist_fact SamePattern" works like a charm, if I
>>         use this option from the command line.
>>
>>         How can I add this option as the default. I tried using
>>         PetscOptionsInsertString("-mat_superlu_dist_fact
>>         SamePattern",ierr) in my code but this does not work.
>>
>>         Thanks,
>>
>>         Danyang
>>
>>
>>         On 15-12-07 10:42 AM, Hong wrote:
>>>         Danyang :
>>>
>>>         Adding '-mat_superlu_dist_fact SamePattern' fixed the
>>>         problem. Below is how I figured it out.
>>>
>>>         1. Reading ex52f.F, I see '-superlu_default' =
>>>         '-pc_factor_mat_solver_package superlu_dist', the later
>>>         enables runtime options for other packages. I use
>>>         superlu_dist-4.2 and superlu-4.1 for the tests below.
>>>
>>>         2. Use the Matrix 168 to setup KSP solver and factorization,
>>>         all packages, petsc, superlu_dist and mumps give same
>>>         correct results:
>>>
>>>         ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
>>>         matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
>>>         flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
>>>         -pc_factor_mat_solver_package petsc
>>>          -->loac matrix a
>>>          -->load rhs b
>>>          size l,m,n,mm       90000 90000       90000 90000
>>>         Norm of error  7.7308E-11 iterations     1
>>>          -->Test for matrix          168
>>>         ..
>>>          -->Test for matrix          172
>>>         Norm of error  3.8461E-11 iterations     1
>>>
>>>         ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
>>>         matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
>>>         flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
>>>         -pc_factor_mat_solver_package superlu_dist
>>>         Norm of error  9.4073E-11 iterations     1
>>>          -->Test for matrix          168
>>>         ...
>>>          -->Test for matrix          172
>>>         Norm of error  3.8187E-11 iterations     1
>>>
>>>         3. Use superlu, I get
>>>         ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
>>>         matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
>>>         flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
>>>         -pc_factor_mat_solver_package superlu
>>>         Norm of error  1.0191E-06 iterations     1
>>>          -->Test for matrix          168
>>>         ...
>>>          -->Test for matrix          172
>>>         Norm of error  9.7858E-07 iterations     1
>>>
>>>         Replacing default DiagPivotThresh: 1. to 0.0, I get same
>>>         solutions as other packages:
>>>
>>>         ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
>>>         matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
>>>         flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
>>>         -pc_factor_mat_solver_package superlu
>>>         -mat_superlu_diagpivotthresh 0.0
>>>
>>>         Norm of error  8.3614E-11 iterations     1
>>>          -->Test for matrix          168
>>>         ...
>>>          -->Test for matrix          172
>>>         Norm of error  3.7098E-11 iterations     1
>>>
>>>         4.
>>>         using '-mat_view ascii::ascii_info', I found that
>>>         a_flow_check_1.bin and a_flow_check_168.bin seem have same
>>>         structure:
>>>
>>>          -->loac matrix a
>>>         Mat Object: 1 MPI processes
>>>         type: seqaij
>>>         rows=90000, cols=90000
>>>         total: nonzeros=895600, allocated nonzeros=895600
>>>         total number of mallocs used during MatSetValues calls =0
>>>           using I-node routines: found 45000 nodes, limit used is 5
>>>
>>>         5.
>>>         Using a_flow_check_1.bin, I am able to reproduce the error
>>>         you reported: all packages give correct results except
>>>         superlu_dist:
>>>         ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>>>         matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
>>>         flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
>>>         -pc_factor_mat_solver_package superlu_dist
>>>         Norm of error  2.5970E-12 iterations     1
>>>          -->Test for matrix  168
>>>         Norm of error  1.3936E-01 iterations    34
>>>          -->Test for matrix  169
>>>
>>>         I guess the error might come from reuse of matrix factor.
>>>         Replacing default
>>>         -mat_superlu_dist_fact <SamePattern_SameRowPerm> with
>>>         -mat_superlu_dist_fact SamePattern, I get
>>>
>>>         ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>>>         matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
>>>         flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
>>>         -pc_factor_mat_solver_package superlu_dist
>>>         -mat_superlu_dist_fact SamePattern
>>>
>>>         Norm of error  2.5970E-12 iterations     1
>>>          -->Test for matrix  168
>>>         Norm of error  9.4073E-11 iterations     1
>>>          -->Test for matrix  169
>>>         Norm of error  6.4303E-11 iterations     1
>>>          -->Test for matrix  170
>>>         Norm of error  7.4327E-11 iterations     1
>>>          -->Test for matrix  171
>>>         Norm of error  5.4162E-11 iterations     1
>>>          -->Test for matrix  172
>>>         Norm of error  3.4440E-11 iterations     1
>>>          --> End of test, bye
>>>
>>>         Sherry may tell you why SamePattern_SameRowPerm cause the
>>>         difference here.
>>>         Best on the above experiments, I would set following as default
>>>         '-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
>>>         '-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist
>>>         interface.
>>>
>>>         Hong
>>>
>>>             Hi Hong,
>>>
>>>             I did more test today and finally found that the
>>>             solution accuracy depends on the initial (first) matrix
>>>             quality. I modified the ex52f.F to do the test. There
>>>             are 6 matrices and right-hand-side vectors. All these
>>>             matrices and rhs are from my reactive transport
>>>             simulation. Results will be quite different depending on
>>>             which one you use to do factorization. Results will also
>>>             be different if you run with different options. My code
>>>             is similar to the First or the Second test below. When
>>>             the matrix is well conditioned, it works fine. But if
>>>             the initial matrix is well conditioned, it likely to
>>>             crash when the matrix become ill-conditioned. Since most
>>>             of my case are well conditioned so I didn't detect the
>>>             problem before. This case is a special one.
>>>
>>>
>>>             How can I avoid this problem? Shall I redo
>>>             factorization? Can PETSc automatically detect this
>>>             prolbem or is there any option available to do this?
>>>
>>>             All the data and test code (modified ex52f) can be found
>>>             via the dropbox link below.
>>>             _
>>>             __https://www.dropbox.com/s/4al1a60creogd8m/petsc-superlu-test.tar.gz?dl=0_
>>>
>>>
>>>             Summary of my test is shown below.
>>>
>>>             First, use the Matrix 1 to setup KSP solver and
>>>             factorization, then solve 168 to 172
>>>
>>>             mpiexec.hydra -n 1 ./ex52f -f0
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
>>>             -rhs
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
>>>             -loop_matrices flow_check -loop_folder
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
>>>             -pc_type lu -pc_factor_mat_solver_package superlu_dist
>>>
>>>             Norm of error 3.8815E-11 iterations     1
>>>              -->Test for matrix          168
>>>             Norm of error 4.2307E-01 iterations 32
>>>              -->Test for matrix          169
>>>             Norm of error 3.0528E-01 iterations 32
>>>              -->Test for matrix          170
>>>             Norm of error 3.1177E-01 iterations 32
>>>              -->Test for matrix          171
>>>             Norm of error 3.2793E-01 iterations 32
>>>              -->Test for matrix          172
>>>             Norm of error 3.1251E-01 iterations 31
>>>
>>>             Second, use the Matrix 1 to setup KSP solver and
>>>             factorization using the implemented SuperLU relative
>>>             codes. I thought this will generate the same results as
>>>             the First test, but it actually not.
>>>
>>>             mpiexec.hydra -n 1 ./ex52f -f0
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
>>>             -rhs
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
>>>             -loop_matrices flow_check -loop_folder
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
>>>             -superlu_default
>>>
>>>             Norm of error 2.2632E-12 iterations     1
>>>              -->Test for matrix          168
>>>             Norm of error 1.0817E+04 iterations     1
>>>              -->Test for matrix          169
>>>             Norm of error 1.0786E+04 iterations     1
>>>              -->Test for matrix          170
>>>             Norm of error 1.0792E+04 iterations     1
>>>              -->Test for matrix          171
>>>             Norm of error 1.0792E+04 iterations     1
>>>              -->Test for matrix          172
>>>             Norm of error 1.0792E+04 iterations     1
>>>
>>>
>>>             Third, use the Matrix 168 to setup KSP solver and
>>>             factorization, then solve 168 to 172
>>>
>>>             mpiexec.hydra -n 1 ./ex52f -f0
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
>>>             -rhs
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
>>>             -loop_matrices flow_check -loop_folder
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
>>>             -pc_type lu -pc_factor_mat_solver_package superlu_dist
>>>
>>>             Norm of error 9.5528E-10 iterations     1
>>>              -->Test for matrix          168
>>>             Norm of error 9.4945E-10 iterations     1
>>>              -->Test for matrix          169
>>>             Norm of error 6.4279E-10 iterations     1
>>>              -->Test for matrix          170
>>>             Norm of error 7.4633E-10 iterations     1
>>>              -->Test for matrix          171
>>>             Norm of error 7.4863E-10 iterations     1
>>>              -->Test for matrix          172
>>>             Norm of error 8.9701E-10 iterations     1
>>>
>>>             Fourth, use the Matrix 168 to setup KSP solver and
>>>             factorization using the implemented SuperLU relative
>>>             codes. I thought this will generate the same results as
>>>             the Third test, but it actually not.
>>>
>>>             mpiexec.hydra -n 1 ./ex52f -f0
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
>>>             -rhs
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
>>>             -loop_matrices flow_check -loop_folder
>>>             /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
>>>             -superlu_default
>>>
>>>             Norm of error 3.7017E-11 iterations     1
>>>              -->Test for matrix          168
>>>             Norm of error 3.6420E-11 iterations     1
>>>              -->Test for matrix          169
>>>             Norm of error 3.7184E-11 iterations     1
>>>              -->Test for matrix          170
>>>             Norm of error 3.6847E-11 iterations     1
>>>              -->Test for matrix          171
>>>             Norm of error 3.7883E-11 iterations     1
>>>              -->Test for matrix          172
>>>             Norm of error 3.8805E-11 iterations     1
>>>
>>>             Thanks very much,
>>>
>>>             Danyang
>>>
>>>             On 15-12-03 01:59 PM, Hong wrote:
>>>>             Danyang :
>>>>             Further testing a_flow_check_168.bin,
>>>>             ./ex10 -f0
>>>>             /Users/Hong/Downloads/matrix_and_rhs_bin/a_flow_check_168.bin
>>>>             -rhs
>>>>             /Users/Hong/Downloads/matrix_and_rhs_bin/x_flow_check_168.bin
>>>>             -pc_type lu -pc_factor_mat_solver_package superlu
>>>>             -ksp_monitor_true_residual -mat_superlu_conditionnumber
>>>>             Recip. condition number = 1.610480e-12
>>>>             0 KSP preconditioned resid norm 6.873340313547e+09 true
>>>>             resid norm 7.295020990196e+03 ||r(i)||/||b||
>>>>             1.000000000000e+00
>>>>             1 KSP preconditioned resid norm 2.051833296449e-02 true
>>>>             resid norm 2.976859070118e-02 ||r(i)||/||b||
>>>>             4.080672384793e-06
>>>>             Number of iterations =   1
>>>>             Residual norm 0.0297686
>>>>
>>>>             condition number of this matrix = 1/1.610480e-12 = 1.e+12,
>>>>             i.e., this matrix is ill-conditioned.
>>>>
>>>>             Hong
>>>>
>>>>
>>>>                 Hi Hong,
>>>>
>>>>                 The binary format of matrix, rhs and solution can
>>>>                 be downloaded via the link below.
>>>>
>>>>                 https://www.dropbox.com/s/cl3gfi0s0kjlktf/matrix_and_rhs_bin.tar.gz?dl=0
>>>>
>>>>                 Thanks,
>>>>
>>>>                 Danyang
>>>>
>>>>
>>>>                 On 15-12-03 10:50 AM, Hong wrote:
>>>>>                 Danyang:
>>>>>
>>>>>
>>>>>
>>>>>                     To my surprising, solutions from SuperLU at
>>>>>                     timestep 29 seems not correct for the first 4
>>>>>                     Newton iterations, but the solutions from
>>>>>                     iteration solver and MUMPS are correct.
>>>>>
>>>>>                     Please find all the matrices, rhs and
>>>>>                     solutions at timestep 29 via the link below.
>>>>>                     The data is a bit large so that I just share
>>>>>                     it through Dropbox. A piece of matlab code to
>>>>>                     read these data and then computer the norm has
>>>>>                     also been attached.
>>>>>                     _https://www.dropbox.com/s/rr8ueysgflmxs7h/results-check.tar.gz?dl=0_
>>>>>
>>>>>
>>>>>                 Can you send us matrix in petsc binary format?
>>>>>
>>>>>                 e.g., call MatView(M,
>>>>>                 PETSC_VIEWER_BINARY_(PETSC_COMM_WORLD))
>>>>>                 or '-ksp_view_mat binary'
>>>>>
>>>>>                 Hong
>>>>>
>>>>>
>>>>>
>>>>>                     Below is a summary of the norm from the three
>>>>>                     solvers at timestep 29, newton iteration 1 to 5.
>>>>>
>>>>>                     Timestep 29
>>>>>                     Norm of residual seq 1.661321e-09, superlu
>>>>>                     1.657103e+04, mumps 3.731225e-11
>>>>>                     Norm of residual seq 1.753079e-09, superlu
>>>>>                     6.675467e+02, mumps 1.509919e-13
>>>>>                     Norm of residual seq 4.914971e-10, superlu
>>>>>                     1.236362e-01, mumps 2.139303e-17
>>>>>                     Norm of residual seq 3.532769e-10, superlu
>>>>>                     1.304670e-04, mumps 5.387000e-20
>>>>>                     Norm of residual seq 3.885629e-10, superlu
>>>>>                     2.754876e-07, mumps 4.108675e-21
>>>>>
>>>>>                     Would anybody please check if SuperLU can
>>>>>                     solve these matrices? Another possibility is
>>>>>                     that something is wrong in my own code. But so
>>>>>                     far, I cannot find any problem in my code
>>>>>                     since the same code works fine if I using
>>>>>                     iterative solver or direct solver MUMPS. But
>>>>>                     for other cases I have tested, all these
>>>>>                     solvers work fine.
>>>>>
>>>>>                     Please let me know if I did not write down the
>>>>>                     problem clearly.
>>>>>
>>>>>                     Thanks,
>>>>>
>>>>>                     Danyang
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151208/ae0db2a0/attachment-0001.html>


More information about the petsc-users mailing list