[petsc-users] SuperLU convergence problem (More test)
Hong
hzhang at mcs.anl.gov
Tue Dec 8 22:10:19 CST 2015
Danyang :
Your matrices are ill-conditioned, numerically singular with
Recip. condition number = 6.000846e-16
Recip. condition number = 2.256434e-27
Recip. condition number = 1.256452e-18
i.e., condition numbers = O(1.e16 - 1.e27), there is no accuracy in
computed solution.
I checked your matrix 168 - 172, got Recip. condition number
= 1.548816e-12.
You need check your model to understand why the matrices are so
ill-conditioned.
Hong
Hi Hong,
>
> Sorry to bother you again. The modified code works much better than before
> using both superlu or mumps. However, it still encounters failure. The case
> is similar with the previous one, ill-conditioned matrices.
>
> The code crashed after a long time simulation if I use superlu_dist, but
> will not fail if use superlu. I restart the simulation before the time it
> crashes and can reproduce the following error
>
> timestep: 22 time: 1.750E+04 years delt: 2.500E+00 years iter:
> 1 max.sia: 5.053E-03 tol.sia: 1.000E-01
> Newton Iteration Convergence Summary:
> Newton maximum maximum solver
> iteration updatePa updateTemp residual iterations maxvolpa
> maxvoltemp nexvolpa nexvoltemp
> 1 0.1531E+08 0.1755E+04 0.6920E-05 1 5585
> 4402 5814 5814
>
> *** Error in `../program_test': malloc(): memory corruption:
> 0x0000000003a70d50 ***
> Program received signal SIGABRT: Process abort signal.
> Backtrace for this error:
>
> The solver failed at timestep 22, Newton iteration 2. I exported the
> matrices at timestep 1 (matrix 1) and timestep 22 (matrix 140 and 141).
> Matrix 141 is where it failed. The three matrices here are not
> ill-conditioned form the estimated value.
>
> I did the same using the new modified ex52f code and found pretty
> different results for matrix 141. The norm by superlu is much acceptable
> than superlu_dist. In this test, memory corruption was not detected. The
> codes and example data can be download from the link below.
>
> https://www.dropbox.com/s/i1ls0bg0vt7gu0v/petsc-superlu-test2.tar.gz?dl=0
>
>
> ****************More test on matrix_and_rhs_bin2*******************
> mpiexec.hydra -n 1 ./ex52f -f0 ./matrix_and_rhs_bin2/a_flow_check_1.bin
> -rhs ./matrix_and_rhs_bin2/b_flow_check_1.bin -loop_matrices flow_check
> -loop_folder ./matrix_and_rhs_bin2 -matrix_index_start 140
> -matrix_index_end 141 -pc_type lu -pc_factor_mat_solver_package superlu
> -ksp_monitor_true_residual -mat_superlu_conditionnumber
> -->loac matrix a
> -->load rhs b
> size l,m,n,mm 90000 90000 90000 90000
> Recip. condition number = 6.000846e-16
> 0 KSP preconditioned resid norm 1.146871454377e+08 true resid norm
> 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
> 1 KSP preconditioned resid norm 2.071118508260e-06 true resid norm
> 3.363767171515e-08 ||r(i)||/||b|| 7.140102249181e-12
> Norm of error 3.3638E-08 iterations 1
> -->Test for matrix 140
> Recip. condition number = 2.256434e-27
> 0 KSP preconditioned resid norm 2.084372893355e+14 true resid norm
> 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
> 1 KSP preconditioned resid norm 4.689629276419e+00 true resid norm
> 1.037236635337e-01 ||r(i)||/||b|| 2.201690918330e-05
> Norm of error 1.0372E-01 iterations 1
> -->Test for matrix 141
> Recip. condition number = 1.256452e-18
> 0 KSP preconditioned resid norm 1.055488964519e+08 true resid norm
> 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
> 1 KSP preconditioned resid norm 2.998827511681e-04 true resid norm
> 4.805214542776e-04 ||r(i)||/||b|| 1.019979130994e-07
> Norm of error 4.8052E-04 iterations 1
> --> End of test, bye
>
>
> mpiexec.hydra -n 1 ./ex52f -f0 ./matrix_and_rhs_bin2/a_flow_check_1.bin
> -rhs ./matrix_and_rhs_bin2/b_flow_check_1.bin -loop_matrices flow_check
> -loop_folder ./matrix_and_rhs_bin2 -matrix_index_start 140
> -matrix_index_end 141 -pc_type lu -pc_factor_mat_solver_package
> superlu_dist
> -->loac matrix a
> -->load rhs b
> size l,m,n,mm 90000 90000 90000 90000
> Norm of error 3.6752E-08 iterations 1
> -->Test for matrix 140
> Norm of error 1.6335E-01 iterations 1
> -->Test for matrix 141
> Norm of error 3.4345E+01 iterations 1
> --> End of test, bye
>
> Thanks,
>
> Danyang
>
> On 15-12-07 12:01 PM, Hong wrote:
>
> Danyang:
> Add 'call MatSetFromOptions(A,ierr)' to your code.
> Attached below is ex52f.F modified from your ex52f.F to be compatible with
> petsc-dev.
>
> Hong
>
> Hello Hong,
>>
>> Thanks for the quick reply and the option "-mat_superlu_dist_fact
>> SamePattern" works like a charm, if I use this option from the command
>> line.
>>
>> How can I add this option as the default. I tried using
>> PetscOptionsInsertString("-mat_superlu_dist_fact SamePattern",ierr) in my
>> code but this does not work.
>>
>> Thanks,
>>
>> Danyang
>>
>>
>> On 15-12-07 10:42 AM, Hong wrote:
>>
>> Danyang :
>>
>> Adding '-mat_superlu_dist_fact SamePattern' fixed the problem. Below is
>> how I figured it out.
>>
>> 1. Reading ex52f.F, I see '-superlu_default' =
>> '-pc_factor_mat_solver_package superlu_dist', the later enables runtime
>> options for other packages. I use superlu_dist-4.2 and superlu-4.1 for the
>> tests below.
>>
>> 2. Use the Matrix 168 to setup KSP solver and factorization, all
>> packages, petsc, superlu_dist and mumps give same correct results:
>>
>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>> petsc
>> -->loac matrix a
>> -->load rhs b
>> size l,m,n,mm 90000 90000 90000 90000
>> Norm of error 7.7308E-11 iterations 1
>> -->Test for matrix 168
>> ..
>> -->Test for matrix 172
>> Norm of error 3.8461E-11 iterations 1
>>
>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>> superlu_dist
>> Norm of error 9.4073E-11 iterations 1
>> -->Test for matrix 168
>> ...
>> -->Test for matrix 172
>> Norm of error 3.8187E-11 iterations 1
>>
>> 3. Use superlu, I get
>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>> superlu
>> Norm of error 1.0191E-06 iterations 1
>> -->Test for matrix 168
>> ...
>> -->Test for matrix 172
>> Norm of error 9.7858E-07 iterations 1
>>
>> Replacing default DiagPivotThresh: 1. to 0.0, I get same solutions as
>> other packages:
>>
>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>> superlu -mat_superlu_diagpivotthresh 0.0
>>
>> Norm of error 8.3614E-11 iterations 1
>> -->Test for matrix 168
>> ...
>> -->Test for matrix 172
>> Norm of error 3.7098E-11 iterations 1
>>
>> 4.
>> using '-mat_view ascii::ascii_info', I found that a_flow_check_1.bin and
>> a_flow_check_168.bin seem have same structure:
>>
>> -->loac matrix a
>> Mat Object: 1 MPI processes
>> type: seqaij
>> rows=90000, cols=90000
>> total: nonzeros=895600, allocated nonzeros=895600
>> total number of mallocs used during MatSetValues calls =0
>> using I-node routines: found 45000 nodes, limit used is 5
>>
>> 5.
>> Using a_flow_check_1.bin, I am able to reproduce the error you reported:
>> all packages give correct results except superlu_dist:
>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>> superlu_dist
>> Norm of error 2.5970E-12 iterations 1
>> -->Test for matrix 168
>> Norm of error 1.3936E-01 iterations 34
>> -->Test for matrix 169
>>
>> I guess the error might come from reuse of matrix factor. Replacing
>> default
>> -mat_superlu_dist_fact <SamePattern_SameRowPerm> with
>> -mat_superlu_dist_fact SamePattern, I get
>>
>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>> superlu_dist -mat_superlu_dist_fact SamePattern
>>
>> Norm of error 2.5970E-12 iterations 1
>> -->Test for matrix 168
>> Norm of error 9.4073E-11 iterations 1
>> -->Test for matrix 169
>> Norm of error 6.4303E-11 iterations 1
>> -->Test for matrix 170
>> Norm of error 7.4327E-11 iterations 1
>> -->Test for matrix 171
>> Norm of error 5.4162E-11 iterations 1
>> -->Test for matrix 172
>> Norm of error 3.4440E-11 iterations 1
>> --> End of test, bye
>>
>> Sherry may tell you why SamePattern_SameRowPerm cause the difference here.
>> Best on the above experiments, I would set following as default
>> '-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
>> '-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist interface.
>>
>> Hong
>>
>> Hi Hong,
>>>
>>> I did more test today and finally found that the solution accuracy
>>> depends on the initial (first) matrix quality. I modified the ex52f.F to do
>>> the test. There are 6 matrices and right-hand-side vectors. All these
>>> matrices and rhs are from my reactive transport simulation. Results will be
>>> quite different depending on which one you use to do factorization. Results
>>> will also be different if you run with different options. My code is
>>> similar to the First or the Second test below. When the matrix is well
>>> conditioned, it works fine. But if the initial matrix is well conditioned,
>>> it likely to crash when the matrix become ill-conditioned. Since most of my
>>> case are well conditioned so I didn't detect the problem before. This case
>>> is a special one.
>>>
>>>
>>> How can I avoid this problem? Shall I redo factorization? Can PETSc
>>> automatically detect this prolbem or is there any option available to do
>>> this?
>>>
>>> All the data and test code (modified ex52f) can be found via the dropbox
>>> link below.
>>>
>>> *https://www.dropbox.com/s/4al1a60creogd8m/petsc-superlu-test.tar.gz?dl=0
>>> <https://www.dropbox.com/s/4al1a60creogd8m/petsc-superlu-test.tar.gz?dl=0>*
>>>
>>>
>>> Summary of my test is shown below.
>>>
>>> First, use the Matrix 1 to setup KSP solver and factorization, then
>>> solve 168 to 172
>>>
>>> mpiexec.hydra -n 1 ./ex52f -f0
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
>>> -rhs
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
>>> -loop_matrices flow_check -loop_folder
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -pc_type lu
>>> -pc_factor_mat_solver_package superlu_dist
>>>
>>> Norm of error 3.8815E-11 iterations 1
>>> -->Test for matrix 168
>>> Norm of error 4.2307E-01 iterations 32
>>> -->Test for matrix 169
>>> Norm of error 3.0528E-01 iterations 32
>>> -->Test for matrix 170
>>> Norm of error 3.1177E-01 iterations 32
>>> -->Test for matrix 171
>>> Norm of error 3.2793E-01 iterations 32
>>> -->Test for matrix 172
>>> Norm of error 3.1251E-01 iterations 31
>>>
>>> Second, use the Matrix 1 to setup KSP solver and factorization using the
>>> implemented SuperLU relative codes. I thought this will generate the same
>>> results as the First test, but it actually not.
>>>
>>> mpiexec.hydra -n 1 ./ex52f -f0
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
>>> -rhs
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
>>> -loop_matrices flow_check -loop_folder
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -superlu_default
>>>
>>> Norm of error 2.2632E-12 iterations 1
>>> -->Test for matrix 168
>>> Norm of error 1.0817E+04 iterations 1
>>> -->Test for matrix 169
>>> Norm of error 1.0786E+04 iterations 1
>>> -->Test for matrix 170
>>> Norm of error 1.0792E+04 iterations 1
>>> -->Test for matrix 171
>>> Norm of error 1.0792E+04 iterations 1
>>> -->Test for matrix 172
>>> Norm of error 1.0792E+04 iterations 1
>>>
>>>
>>> Third, use the Matrix 168 to setup KSP solver and factorization, then
>>> solve 168 to 172
>>>
>>> mpiexec.hydra -n 1 ./ex52f -f0
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
>>> -rhs
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
>>> -loop_matrices flow_check -loop_folder
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -pc_type lu
>>> -pc_factor_mat_solver_package superlu_dist
>>>
>>> Norm of error 9.5528E-10 iterations 1
>>> -->Test for matrix 168
>>> Norm of error 9.4945E-10 iterations 1
>>> -->Test for matrix 169
>>> Norm of error 6.4279E-10 iterations 1
>>> -->Test for matrix 170
>>> Norm of error 7.4633E-10 iterations 1
>>> -->Test for matrix 171
>>> Norm of error 7.4863E-10 iterations 1
>>> -->Test for matrix 172
>>> Norm of error 8.9701E-10 iterations 1
>>>
>>> Fourth, use the Matrix 168 to setup KSP solver and factorization using
>>> the implemented SuperLU relative codes. I thought this will generate the
>>> same results as the Third test, but it actually not.
>>>
>>> mpiexec.hydra -n 1 ./ex52f -f0
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
>>> -rhs
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
>>> -loop_matrices flow_check -loop_folder
>>> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -superlu_default
>>>
>>> Norm of error 3.7017E-11 iterations 1
>>> -->Test for matrix 168
>>> Norm of error 3.6420E-11 iterations 1
>>> -->Test for matrix 169
>>> Norm of error 3.7184E-11 iterations 1
>>> -->Test for matrix 170
>>> Norm of error 3.6847E-11 iterations 1
>>> -->Test for matrix 171
>>> Norm of error 3.7883E-11 iterations 1
>>> -->Test for matrix 172
>>> Norm of error 3.8805E-11 iterations 1
>>>
>>> Thanks very much,
>>>
>>> Danyang
>>>
>>> On 15-12-03 01:59 PM, Hong wrote:
>>>
>>> Danyang :
>>> Further testing a_flow_check_168.bin,
>>> ./ex10 -f0 /Users/Hong/Downloads/matrix_and_rhs_bin/a_flow_check_168.bin
>>> -rhs /Users/Hong/Downloads/matrix_and_rhs_bin/x_flow_check_168.bin -pc_type
>>> lu -pc_factor_mat_solver_package superlu -ksp_monitor_true_residual
>>> -mat_superlu_conditionnumber
>>> Recip. condition number = 1.610480e-12
>>> 0 KSP preconditioned resid norm 6.873340313547e+09 true resid norm
>>> 7.295020990196e+03 ||r(i)||/||b|| 1.000000000000e+00
>>> 1 KSP preconditioned resid norm 2.051833296449e-02 true resid norm
>>> 2.976859070118e-02 ||r(i)||/||b|| 4.080672384793e-06
>>> Number of iterations = 1
>>> Residual norm 0.0297686
>>>
>>> condition number of this matrix = 1/1.610480e-12 = 1.e+12,
>>> i.e., this matrix is ill-conditioned.
>>>
>>> Hong
>>>
>>>
>>> Hi Hong,
>>>>
>>>> The binary format of matrix, rhs and solution can be downloaded via the
>>>> link below.
>>>>
>>>> https://www.dropbox.com/s/cl3gfi0s0kjlktf/matrix_and_rhs_bin.tar.gz?dl=0
>>>>
>>>> Thanks,
>>>>
>>>> Danyang
>>>>
>>>>
>>>> On 15-12-03 10:50 AM, Hong wrote:
>>>>
>>>> Danyang:
>>>>
>>>>>
>>>>>
>>>>> To my surprising, solutions from SuperLU at timestep 29 seems not
>>>>> correct for the first 4 Newton iterations, but the solutions from iteration
>>>>> solver and MUMPS are correct.
>>>>>
>>>>> Please find all the matrices, rhs and solutions at timestep 29 via the
>>>>> link below. The data is a bit large so that I just share it through
>>>>> Dropbox. A piece of matlab code to read these data and then computer the
>>>>> norm has also been attached.
>>>>> *
>>>>> <https://www.dropbox.com/s/rr8ueysgflmxs7h/results-check.tar.gz?dl=0>https://www.dropbox.com/s/rr8ueysgflmxs7h/results-check.tar.gz?dl=0
>>>>> <https://www.dropbox.com/s/rr8ueysgflmxs7h/results-check.tar.gz?dl=0>*
>>>>>
>>>>
>>>> Can you send us matrix in petsc binary format?
>>>>
>>>> e.g., call MatView(M, PETSC_VIEWER_BINARY_(PETSC_COMM_WORLD))
>>>> or '-ksp_view_mat binary'
>>>>
>>>> Hong
>>>>
>>>>>
>>>>>
>>>>> Below is a summary of the norm from the three solvers at timestep 29,
>>>>> newton iteration 1 to 5.
>>>>>
>>>>> Timestep 29
>>>>> Norm of residual seq 1.661321e-09, superlu 1.657103e+04, mumps
>>>>> 3.731225e-11
>>>>> Norm of residual seq 1.753079e-09, superlu 6.675467e+02, mumps
>>>>> 1.509919e-13
>>>>> Norm of residual seq 4.914971e-10, superlu 1.236362e-01, mumps
>>>>> 2.139303e-17
>>>>> Norm of residual seq 3.532769e-10, superlu 1.304670e-04, mumps
>>>>> 5.387000e-20
>>>>> Norm of residual seq 3.885629e-10, superlu 2.754876e-07, mumps
>>>>> 4.108675e-21
>>>>>
>>>>> Would anybody please check if SuperLU can solve these matrices?
>>>>> Another possibility is that something is wrong in my own code. But so far,
>>>>> I cannot find any problem in my code since the same code works fine if I
>>>>> using iterative solver or direct solver MUMPS. But for other cases I have
>>>>> tested, all these solvers work fine.
>>>>>
>>>>> Please let me know if I did not write down the problem clearly.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Danyang
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
