[petsc-users] SuperLU_dist issue in 3.7.4

Hong hzhang at mcs.anl.gov
Mon Nov 7 11:18:52 CST 2016


Anton:
I am planning to work on this as soon as I get time. I assume that your
code is working with the option '-mat_superlu_dist_fact
SamePattern_SameRowPerm'. If not, let me know.

What I'm planing to do is to detect the existence of Pc and Pr in petsc
interface, then set reuse option, so users will not be bothered by it.

>
> Setting Options.Fact = DOFACT for all factorizations is currently
> impossible via PETSc interface.
>
This might be a bug in our side. I'll check it.


> The user is expected to choose some kind of reuse model.
> If you could add it, I (and other users probably too) would really
> appreciate that.
>

I'll try to get it done soon, will let you know. Thanks for your patience.

Hong

>
>
>
> I'll check our interface to see if we can add flag-checking for Pr and Pc,
> then set default accordingly.
>
> Hong
>
> On Wed, Oct 26, 2016 at 3:23 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>
>> Some graph preprocessing steps can be skipped ONLY IF a previous
>> factorization was done, and the information can be reused (AS INPUT) to the
>> new factorization.
>>
>> In general, the driver routine SRC/pdgssvx.c() performs the LU
>> factorization of the following (preprocessed) matrix:
>>  Pc*Pr*diag(R)*A*diag(C)*Pc^T = L*U
>>
>> The default is to do LU from scratch, including all the steps to compute
>> equilibration (R, C), pivot ordering (Pr), and sparsity ordering (Pc).
>>
>> -- The default should be set as options.Fact = DOFACT.
>>
>> -- When you set options.Fact = SamePattern, the sparsity ordering step is
>> skipped, but you need to input Pc which was obtained from a previous
>> factorization.
>>
>> -- When you set options.Fact = SamePattern_SameRowPerm, both sparsity
>> reordering and pivoting ordering steps are skipped, but you need to input
>> both Pr and Pc.
>>
>> Please see Lines 258 - 307 comments in SRC/pdgssvx.c for details,
>> regarding which data structures should be inputs and which are outputs.
>> The Users Guide also explains this.
>>
>> In EXAMPLE/ directory, I have various examples of these usage situations,
>> see EXAMPLE/README.
>>
>> I am a little puzzled why in PETSc, the default is set to SamePattern ??
>>
>> Sherry
>>
>>
>> On Tue, Oct 25, 2016 at 9:18 AM, Hong <hzhang at mcs.anl.gov> wrote:
>>
>>> Sherry,
>>>
>>> We set '-mat_superlu_dist_fact SamePattern'  as default in
>>> petsc/superlu_dist on 12/6/15 (see attached email below).
>>>
>>> However, Anton must set 'SamePattern_SameRowPerm' to avoid crash in his
>>> code. Checking
>>> http://crd-legacy.lbl.gov/~xiaoye/SuperLU/superlu_dist_code_
>>> html/pzgssvx___a_bglobal_8c.html
>>> I see detailed description on using SamePattern_SameRowPerm, which
>>> requires more from user than SamePattern. I guess these flags are used
>>> for efficiency. The library sets a default, then have users to switch for
>>> their own applications. The default setting should not cause crash. If
>>> crash occurs, give a meaningful error message would be help.
>>>
>>> Do you have suggestion how should we set default in petsc for this flag?
>>>
>>> Hong
>>>
>>> -------------------
>>> Hong <hzhang at mcs.anl.gov>
>>> 12/7/15
>>>
>>> to Danyang, petsc-maint, PETSc, Xiaoye
>>> Danyang :
>>>
>>> Adding '-mat_superlu_dist_fact SamePattern' fixed the problem. Below is
>>> how I figured it out.
>>>
>>> 1. Reading ex52f.F, I see '-superlu_default' =
>>> '-pc_factor_mat_solver_package superlu_dist', the later enables runtime
>>> options for other packages. I use superlu_dist-4.2 and superlu-4.1 for the
>>> tests below.
>>> ...
>>> 5.
>>> Using a_flow_check_1.bin, I am able to reproduce the error you reported:
>>> all packages give correct results except superlu_dist:
>>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>>> superlu_dist
>>> Norm of error  2.5970E-12 iterations     1
>>>  -->Test for matrix          168
>>> Norm of error  1.3936E-01 iterations    34
>>>  -->Test for matrix          169
>>>
>>> I guess the error might come from reuse of matrix factor. Replacing
>>> default
>>> -mat_superlu_dist_fact <SamePattern_SameRowPerm> with
>>> -mat_superlu_dist_fact SamePattern, I get
>>>
>>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>>> superlu_dist -mat_superlu_dist_fact SamePattern
>>>
>>> Norm of error  2.5970E-12 iterations     1
>>>  -->Test for matrix          168
>>> ...
>>> Sherry may tell you why SamePattern_SameRowPerm cause the difference
>>> here.
>>> Best on the above experiments, I would set following as default
>>> '-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
>>> '-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist interface.
>>>
>>> Hong
>>>
>>> On Tue, Oct 25, 2016 at 10:38 AM, Hong <hzhang at mcs.anl.gov> wrote:
>>>
>>>> Anton,
>>>> I guess, when you reuse matrix and its symbolic factor with updated
>>>> numerical values, superlu_dist requires this option. I'm cc'ing Sherry to
>>>> confirm it.
>>>>
>>>> I'll check petsc/superlu-dist interface to set this flag for this case.
>>>>
>>>> Hong
>>>>
>>>>
>>>> On Tue, Oct 25, 2016 at 8:20 AM, Anton Popov <popov at uni-mainz.de>
>>>> wrote:
>>>>
>>>>> Hong,
>>>>>
>>>>> I get all the problems gone and valgrind-clean output if I specify
>>>>> this:
>>>>>
>>>>> -mat_superlu_dist_fact SamePattern_SameRowPerm
>>>>> What does SamePattern_SameRowPerm actually mean?
>>>>> Row permutations are for large diagonal, column permutations are for
>>>>> sparsity, right?
>>>>> Will it skip subsequent matrix permutations for large diagonal even if
>>>>> matrix values change significantly?
>>>>>
>>>>> Surprisingly everything works even with:
>>>>>
>>>>> -mat_superlu_dist_colperm PARMETIS
>>>>> -mat_superlu_dist_parsymbfact TRUE
>>>>>
>>>>> Thanks,
>>>>> Anton
>>>>>
>>>>> On 10/24/2016 09:06 PM, Hong wrote:
>>>>>
>>>>> Anton:
>>>>>>
>>>>>> If replacing superlu_dist with mumps, does your code work?
>>>>>>
>>>>>> yes
>>>>>>
>>>>>
>>>>> You may use mumps in your code, or tests different options for
>>>>> superlu_dist:
>>>>>
>>>>>   -mat_superlu_dist_equil: <TRUE> Equilibrate matrix (None)
>>>>>   -mat_superlu_dist_rowperm <LargeDiag> Row permutation (choose one
>>>>> of) LargeDiag NATURAL (None)
>>>>>   -mat_superlu_dist_colperm <METIS_AT_PLUS_A> Column permutation
>>>>> (choose one of) NATURAL MMD_AT_PLUS_A MMD_ATA METIS_AT_PLUS_A PARMETIS
>>>>> (None)
>>>>>   -mat_superlu_dist_replacetinypivot: <FALSE> Replace tiny pivots
>>>>> (None)
>>>>>   -mat_superlu_dist_parsymbfact: <FALSE> Parallel symbolic
>>>>> factorization (None)
>>>>>   -mat_superlu_dist_fact <SamePattern> Sparsity pattern for repeated
>>>>> matrix factorization (choose one of) SamePattern SamePattern_SameRowPerm
>>>>> (None)
>>>>>
>>>>> The options inside <> are defaults. You may try others. This might
>>>>> help narrow down the bug.
>>>>>
>>>>> Hong
>>>>>
>>>>>>
>>>>>> Hong
>>>>>>>
>>>>>>> On 10/24/2016 05:47 PM, Hong wrote:
>>>>>>>
>>>>>>> Barry,
>>>>>>> Your change indeed fixed the error of his testing code.
>>>>>>> As Satish tested, on your branch, ex16 runs smooth.
>>>>>>>
>>>>>>> I do not understand why on maint or master branch, ex16 creases
>>>>>>> inside superlu_dist, but not with mumps.
>>>>>>>
>>>>>>>
>>>>>>> I also confirm that ex16 runs fine with latest fix, but
>>>>>>> unfortunately not my code.
>>>>>>>
>>>>>>> This is something to be expected, since my code preallocates once in
>>>>>>> the beginning. So there is no way it can be affected by multiple
>>>>>>> preallocations. Subsequently I only do matrix assembly, that makes sure
>>>>>>> structure doesn't change (set to get error otherwise).
>>>>>>>
>>>>>>> Summary: we don't have a simple test code to debug superlu issue
>>>>>>> anymore.
>>>>>>>
>>>>>>> Anton
>>>>>>>
>>>>>>> Hong
>>>>>>>
>>>>>>> On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay <balay at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Mon, 24 Oct 2016, Barry Smith wrote:
>>>>>>>>
>>>>>>>> >
>>>>>>>> > > [Or perhaps Hong is using a different test code and is
>>>>>>>> observing bugs
>>>>>>>> > > with superlu_dist interface..]
>>>>>>>> >
>>>>>>>> >    She states that her test does a NEW MatCreate() for each
>>>>>>>> matrix load (I cut and pasted it in the email I just sent). The bug I fixed
>>>>>>>> was only related to using the SAME matrix from one MatLoad() in another
>>>>>>>> MatLoad().
>>>>>>>>
>>>>>>>> Ah - ok.. Sorry - wasn't thinking clearly :(
>>>>>>>>
>>>>>>>> Satish
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161107/25612157/attachment-0001.html>


More information about the petsc-users mailing list