[petsc-users] SuperLU convergence problem (More test)

Mon Dec 7 12:42:07 CST 2015

Danyang :

Adding '-mat_superlu_dist_fact SamePattern' fixed the problem. Below is how
I figured it out.

1. Reading ex52f.F, I see '-superlu_default' =
'-pc_factor_mat_solver_package superlu_dist', the later enables runtime
options for other packages. I use superlu_dist-4.2 and superlu-4.1 for the
tests below.

2. Use the Matrix 168 to setup KSP solver and factorization, all packages,
petsc, superlu_dist and mumps give same correct results:

./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
-loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
petsc
 -->loac matrix a
 -->load rhs b
 size l,m,n,mm       90000       90000       90000       90000
Norm of error  7.7308E-11 iterations     1
 -->Test for matrix          168
..
 -->Test for matrix          172
Norm of error  3.8461E-11 iterations     1

./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
-loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
superlu_dist
Norm of error  9.4073E-11 iterations     1
 -->Test for matrix          168
...
 -->Test for matrix          172
Norm of error  3.8187E-11 iterations     1

3. Use superlu, I get
./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
-loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
superlu
Norm of error  1.0191E-06 iterations     1
 -->Test for matrix          168
...
 -->Test for matrix          172
Norm of error  9.7858E-07 iterations     1

Replacing default DiagPivotThresh: 1. to 0.0, I get same solutions as other
packages:

./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
-loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
superlu -mat_superlu_diagpivotthresh 0.0

Norm of error  8.3614E-11 iterations     1
 -->Test for matrix          168
...
 -->Test for matrix          172
Norm of error  3.7098E-11 iterations     1

4.
using '-mat_view ascii::ascii_info', I found that a_flow_check_1.bin and
a_flow_check_168.bin seem have same structure:

 -->loac matrix a
Mat Object: 1 MPI processes
  type: seqaij
  rows=90000, cols=90000
  total: nonzeros=895600, allocated nonzeros=895600
  total number of mallocs used during MatSetValues calls =0
    using I-node routines: found 45000 nodes, limit used is 5

5.
Using a_flow_check_1.bin, I am able to reproduce the error you reported:
all packages give correct results except superlu_dist:
./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
-loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
superlu_dist
Norm of error  2.5970E-12 iterations     1
 -->Test for matrix          168
Norm of error  1.3936E-01 iterations    34
 -->Test for matrix          169

I guess the error might come from reuse of matrix factor. Replacing default
-mat_superlu_dist_fact <SamePattern_SameRowPerm> with
-mat_superlu_dist_fact SamePattern, I get

./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
-loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
superlu_dist -mat_superlu_dist_fact SamePattern

Norm of error  2.5970E-12 iterations     1
 -->Test for matrix          168
Norm of error  9.4073E-11 iterations     1
 -->Test for matrix          169
Norm of error  6.4303E-11 iterations     1
 -->Test for matrix          170
Norm of error  7.4327E-11 iterations     1
 -->Test for matrix          171
Norm of error  5.4162E-11 iterations     1
 -->Test for matrix          172
Norm of error  3.4440E-11 iterations     1
 --> End of test, bye

Sherry may tell you why SamePattern_SameRowPerm cause the difference here.
Best on the above experiments, I would set following as default
'-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
'-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist interface.

Hong

Hi Hong,
>
> I did more test today and finally found that the solution accuracy depends
> on the initial (first) matrix quality. I modified the ex52f.F to do the
> test. There are 6 matrices and right-hand-side vectors. All these matrices
> and rhs are from my reactive transport simulation. Results will be quite
> different depending on which one you use to do factorization. Results will
> also be different if you run with different options. My code is similar to
> the First or the Second test below. When the matrix is well conditioned, it
> works fine. But if the initial matrix is well conditioned, it likely to
> crash when the matrix become ill-conditioned. Since most of my case are
> well conditioned so I didn't detect the problem before. This case is a
> special one.
>
>
> How can I avoid this problem? Shall I redo factorization? Can PETSc
> automatically detect this prolbem or is there any option available to do
> this?
>
> All the data and test code (modified ex52f) can be found via the dropbox
> link below.
>
> *https://www.dropbox.com/s/4al1a60creogd8m/petsc-superlu-test.tar.gz?dl=0
> <https://www.dropbox.com/s/4al1a60creogd8m/petsc-superlu-test.tar.gz?dl=0>*
>
>
> Summary of my test is shown below.
>
> First, use the Matrix 1 to setup KSP solver and factorization, then solve
> 168 to 172
>
> mpiexec.hydra -n 1 ./ex52f -f0
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
> -rhs
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
> -loop_matrices flow_check -loop_folder
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -pc_type lu
> -pc_factor_mat_solver_package superlu_dist
>
> Norm of error  3.8815E-11 iterations     1
>  -->Test for matrix          168
> Norm of error  4.2307E-01 iterations    32
>  -->Test for matrix          169
> Norm of error  3.0528E-01 iterations    32
>  -->Test for matrix          170
> Norm of error  3.1177E-01 iterations    32
>  -->Test for matrix          171
> Norm of error  3.2793E-01 iterations    32
>  -->Test for matrix          172
> Norm of error  3.1251E-01 iterations    31
>
> Second, use the Matrix 1 to setup KSP solver and factorization using the
> implemented SuperLU relative codes. I thought this will generate the same
> results as the First test, but it actually not.
>
> mpiexec.hydra -n 1 ./ex52f -f0
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
> -rhs
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
> -loop_matrices flow_check -loop_folder
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -superlu_default
>
> Norm of error  2.2632E-12 iterations     1
>  -->Test for matrix          168
> Norm of error  1.0817E+04 iterations     1
>  -->Test for matrix          169
> Norm of error  1.0786E+04 iterations     1
>  -->Test for matrix          170
> Norm of error  1.0792E+04 iterations     1
>  -->Test for matrix          171
> Norm of error  1.0792E+04 iterations     1
>  -->Test for matrix          172
> Norm of error  1.0792E+04 iterations     1
>
>
> Third, use the Matrix 168 to setup KSP solver and factorization, then
> solve 168 to 172
>
> mpiexec.hydra -n 1 ./ex52f -f0
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
> -rhs
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
> -loop_matrices flow_check -loop_folder
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -pc_type lu
> -pc_factor_mat_solver_package superlu_dist
>
> Norm of error  9.5528E-10 iterations     1
>  -->Test for matrix          168
> Norm of error  9.4945E-10 iterations     1
>  -->Test for matrix          169
> Norm of error  6.4279E-10 iterations     1
>  -->Test for matrix          170
> Norm of error  7.4633E-10 iterations     1
>  -->Test for matrix          171
> Norm of error  7.4863E-10 iterations     1
>  -->Test for matrix          172
> Norm of error  8.9701E-10 iterations     1
>
> Fourth, use the Matrix 168 to setup KSP solver and factorization using the
> implemented SuperLU relative codes. I thought this will generate the same
> results as the Third test, but it actually not.
>
> mpiexec.hydra -n 1 ./ex52f -f0
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
> -rhs
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
> -loop_matrices flow_check -loop_folder
> /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -superlu_default
>
> Norm of error  3.7017E-11 iterations     1
>  -->Test for matrix          168
> Norm of error  3.6420E-11 iterations     1
>  -->Test for matrix          169
> Norm of error  3.7184E-11 iterations     1
>  -->Test for matrix          170
> Norm of error  3.6847E-11 iterations     1
>  -->Test for matrix          171
> Norm of error  3.7883E-11 iterations     1
>  -->Test for matrix          172
> Norm of error  3.8805E-11 iterations     1
>
> Thanks very much,
>
> Danyang
>
> On 15-12-03 01:59 PM, Hong wrote:
>
> Danyang :
> Further testing a_flow_check_168.bin,
> ./ex10 -f0 /Users/Hong/Downloads/matrix_and_rhs_bin/a_flow_check_168.bin
> -rhs /Users/Hong/Downloads/matrix_and_rhs_bin/x_flow_check_168.bin -pc_type
> lu -pc_factor_mat_solver_package superlu -ksp_monitor_true_residual
> -mat_superlu_conditionnumber
>   Recip. condition number = 1.610480e-12
>   0 KSP preconditioned resid norm 6.873340313547e+09 true resid norm
> 7.295020990196e+03 ||r(i)||/||b|| 1.000000000000e+00
>   1 KSP preconditioned resid norm 2.051833296449e-02 true resid norm
> 2.976859070118e-02 ||r(i)||/||b|| 4.080672384793e-06
> Number of iterations =   1
> Residual norm 0.0297686
>
> condition number of this matrix = 1/1.610480e-12 = 1.e+12,
> i.e., this matrix is ill-conditioned.
>
> Hong
>
>
> Hi Hong,
>>
>> The binary format of matrix, rhs and solution can be downloaded via the
>> link below.
>>
>> https://www.dropbox.com/s/cl3gfi0s0kjlktf/matrix_and_rhs_bin.tar.gz?dl=0
>>
>> Thanks,
>>
>> Danyang
>>
>>
>> On 15-12-03 10:50 AM, Hong wrote:
>>
>> Danyang:
>>
>>>
>>>
>>> To my surprising, solutions from SuperLU at timestep 29 seems not
>>> correct for the first 4 Newton iterations, but the solutions from iteration
>>> solver and MUMPS are correct.
>>>
>>> Please find all the matrices, rhs and solutions at timestep 29 via the
>>> link below. The data is a bit large so that I just share it through
>>> Dropbox. A piece of matlab code to read these data and then computer the
>>> norm has also been attached.
>>> *https://www.dropbox.com/s/rr8ueysgflmxs7h/results-check.tar.gz?dl=0
>>> <https://www.dropbox.com/s/rr8ueysgflmxs7h/results-check.tar.gz?dl=0>*
>>>
>>
>> Can you send us matrix in petsc binary format?
>>
>> e.g., call MatView(M, PETSC_VIEWER_BINARY_(PETSC_COMM_WORLD))
>> or '-ksp_view_mat binary'
>>
>> Hong
>>
>>>
>>>
>>> Below is a summary of the norm from the three solvers at timestep 29,
>>> newton iteration 1 to 5.
>>>
>>> Timestep 29
>>> Norm of residual seq 1.661321e-09, superlu 1.657103e+04, mumps
>>> 3.731225e-11
>>> Norm of residual seq 1.753079e-09, superlu 6.675467e+02, mumps
>>> 1.509919e-13
>>> Norm of residual seq 4.914971e-10, superlu 1.236362e-01, mumps
>>> 2.139303e-17
>>> Norm of residual seq 3.532769e-10, superlu 1.304670e-04, mumps
>>> 5.387000e-20
>>> Norm of residual seq 3.885629e-10, superlu 2.754876e-07, mumps
>>> 4.108675e-21
>>>
>>> Would anybody please check if SuperLU can solve these matrices? Another
>>> possibility is that something is wrong in my own code. But so far, I cannot
>>> find any problem in my code since the same code works fine if I using
>>> iterative solver or direct solver MUMPS. But for other cases I have
>>> tested,  all these solvers work fine.
>>>
>>> Please let me know if I did not write down the problem clearly.
>>>
>>> Thanks,
>>>
>>> Danyang
>>>
>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151207/14f9f79e/attachment.html>