[petsc-users] Error using MUMPS to solve large linear system

Fri Mar 7 15:00:33 CST 2014

For superlu_dist, you can try:

options.ReplaceTinyPivot  = NO;   (I think default is YES)

and/or

options.IterRefine = YES;

Sherry Li

On Sun, Mar 2, 2014 at 2:23 PM, Matt Landreman <matt.landreman at gmail.com>wrote:

> Hi,
>
> I'm having some problems with my PETSc application similar to the ones
> discussed in this thread, so perhaps one of you can help. In my application
> I factorize a preconditioner matrix with mumps or superlu_dist, using this
> factorized preconditioner to accelerate gmres on a matrix that is denser
> than the preconditioner.  I've been running on edison at nersc.  My program
> works reliably for problem sizes below about 1 million x 1 million, but
> above this size, the factorization step fails in one of many possible ways,
> depending on the compiler, # of nodes, # of procs/node, etc:
>
> When I use superlu_dist, I get 1 of 2 failure modes:
> (1) the first step of KSP returns "0 KSP residual norm -nan" and ksp then
> returns KSPConvergedReason = -9, or
> (2) the factorization completes, but GMRES then converges excruciatingly
> slowly or not at all, even if I choose the "real" matrix to be identical to
> the preconditioner matrix so KSP ought to converge in 1 step (which it does
> for smaller matrices).
>
> For mumps, the factorization can fail in many different ways:
> (3) With the intel compiler I usually get "Caught signal number 11 SEGV:
> Segmentation Violation"
> (4) Sometimes with the intel compiler I get "Caught signal number 7 BUS:
> Bus Error"
> (5) With the gnu compiler I often get a bunch of lines like "problem with
> NIV2_FLOPS message  -5.9604644775390625E-008           0
>  -227464733.99999997"
> (6) Other times with gnu I get a mumps error with INFO(1)=-9 or
> INFO(1)=-17. The mumps documentation suggests I should increase icntl(14),
> but what is an appropriate value? 50? 10000?
> (7) With the Cray compiler I consistently get this cryptic error:
> Fatal error in PMPI_Test: Invalid MPI_Request, error stack:
> PMPI_Test(166): MPI_Test(request=0xb228dbf3c, flag=0x7ffffffe097c,
> status=0x7ffffffe0a00) failed
> PMPI_Test(121): Invalid MPI_Request
> _pmiu_daemon(SIGCHLD): [NID 02784] [c6-1c1s8n0] [Sun Mar  2 10:35:20 2014]
> PE RANK 0 exit signal Aborted
> [NID 02784] 2014-03-02 10:35:20 Apid 3374579: initiated application
> termination
> Application 3374579 exit codes: 134
>
> For linear systems smaller than around 1 million^2, my application is very
> robust, working consistently with both mumps & superlu_dist, working for a
> wide range of # of nodes and # of procs/node, and working with all 3
> available compilers on edison (intel, gnu, cray).
>
> By the way, mumps failed for much smaller problems until I tried
> -mat_mumps_icntl_7 2 (inspired by your conversation last week). I tried all
> the other options for icntl(7), icntl(28), and icntl(29), finding
> icntl(7)=2 works best by far.  I tried the flags that worked for Samar
> (-mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1) with
> superlu_dist, but they did not appear to change anything in my case.
>
> Can you recommend any other parameters of petsc, superlu_dist, or mumps
> that I should try changing?  I don't care in the end whether I use
> superlu_dist or mumps.
>
> Thanks!
>
> Matt Landreman
>
>
> On Tue, Feb 25, 2014 at 3:50 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>
>> Very good!  Thanks for the update.
>> I guess you are using all 16 cores per node?  Since superlu_dist
>> currently is MPI-only, if you generate 16 MPI tasks, serial symbolic
>> factorization only has less than 2 GB memory to work with.
>>
>> Sherry
>>
>>
>> On Tue, Feb 25, 2014 at 12:22 PM, Samar Khatiwala <spk at ldeo.columbia.edu>wrote:
>>
>>> Hi Sherry,
>>>
>>> Thanks! I tried your suggestions and it worked!
>>>
>>> For the record I added these flags: -mat_superlu_dist_colperm PARMETIS
>>> -mat_superlu_dist_parsymbfact 1
>>>
>>> Also, for completeness and since you asked:
>>>
>>> size: 2346346 x 2346346
>>> nnz:  60856894
>>> unsymmetric
>>>
>>> The hardware (http://www2.cisl.ucar.edu/resources/yellowstone/hardware)
>>> specs are: 2 GB/core, 32 GB/node (27 GB usable), (16 cores per node)
>>> I've been running on 8 nodes (so 8 x 27 ~ 216 GB).
>>>
>>> Thanks again for your help!
>>>
>>> Samar
>>>
>>> On Feb 25, 2014, at 1:00 PM, "Xiaoye S. Li" <xsli at lbl.gov> wrote:
>>>
>>> I didn't follow the discussion thread closely ... How large is your
>>> matrix dimension, and number of nonzeros?
>>> How large is the memory per core (or per node)?
>>>
>>> The default setting in superlu_dist is to use serial symbolic
>>> factorization. You can turn on parallel symbolic factorization by:
>>>
>>> options.ParSymbFact = YES;
>>> options.ColPerm = PARMETIS;
>>>
>>> Is your matrix symmetric?  if so, you need to give both upper and lower
>>> half of matrix A to superlu, which doesn't exploit symmetry.
>>>
>>> Do you know whether you need numerical pivoting?  If not, you can turn
>>> off pivoting by:
>>>
>>> options.RowPerm = NATURAL;
>>>
>>> This avoids some other serial bottleneck.
>>>
>>> All these options can be turned on in the petsc interface. Please check
>>> out the syntax there.
>>>
>>>
>>> Sherry
>>>
>>>
>>>
>>> On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <spk at ldeo.columbia.edu>wrote:
>>>
>>>> Hi Barry,
>>>>
>>>> You're probably right. I note that the error occurs almost instantly
>>>> and I've tried increasing the number of CPUs
>>>> (as many as ~1000 on Yellowstone) to no avail. I know this is a big
>>>> problem but I didn't think it was that big!
>>>>
>>>> Sherry: Is there any way to write out more diagnostic info? E.g.,how
>>>> much memory superlu thinks it needs/is attempting
>>>> to allocate.
>>>>
>>>> Thanks,
>>>>
>>>> Samar
>>>>
>>>> On Feb 25, 2014, at 10:57 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>> >
>>>> >>
>>>> >> I tried superlu_dist again and it crashes even more quickly than
>>>> MUMPS with just the following error:
>>>> >>
>>>> >> ERROR: 0031-250  task 128: Killed
>>>> >
>>>> >   This is usually a symptom of running out of memory.
>>>> >
>>>> >>
>>>> >> Absolutely nothing else is written out to either stderr or stdout.
>>>> This is with -mat_superlu_dist_statprint.
>>>> >> The program works fine on a smaller matrix.
>>>> >>
>>>> >> This is the sequence of calls:
>>>> >>
>>>> >> KSPSetType(ksp,KSPPREONLY);
>>>> >> PCSetType(pc,PCLU);
>>>> >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST);
>>>> >> KSPSetFromOptions(ksp);
>>>> >> PCSetFromOptions(pc);
>>>> >> KSPSolve(ksp,b,x);
>>>> >>
>>>> >> All of these successfully return *except* the very last one to
>>>> KSPSolve.
>>>> >>
>>>> >> Any help would be appreciated. Thanks!
>>>> >>
>>>> >> Samar
>>>> >>
>>>> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>>>> >>
>>>> >>> Samar:
>>>> >>> If you include the error message while crashing using superlu_dist,
>>>> I probably know the reason.  (better yet, include the printout before the
>>>> crash. )
>>>> >>>
>>>> >>> Sherry
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <hzhang at mcs.anl.gov>
>>>> wrote:
>>>> >>> Samar :
>>>> >>> There are limitations for direct solvers.
>>>> >>> Do not expect any solver can be used on arbitrarily large problems.
>>>> >>> Since superlu_dist also crashes, direct solvers may not be able to
>>>> work on your application.
>>>> >>> This is why I suggest to increase size incrementally.
>>>> >>> You may have to experiment other type of solvers.
>>>> >>>
>>>> >>> Hong
>>>> >>>
>>>> >>> Hi Hong and Jed,
>>>> >>>
>>>> >>> Many thanks for replying. It would indeed be nice if the error
>>>> messages from MUMPS were less cryptic!
>>>> >>>
>>>> >>> 1) I have tried smaller matrices although given how my problem is
>>>> set up a jump is difficult to avoid. But a good idea
>>>> >>> that I will try.
>>>> >>>
>>>> >>> 2) I did try various ordering but not the one you suggested.
>>>> >>>
>>>> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt
>>>> termination of the program (there should be more
>>>> >>> error messages if, for example, memory was a problem). I therefore
>>>> thought it might be an interface problem rather than
>>>> >>> one with mumps and turned to the petsc-users group first.
>>>> >>>
>>>> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to
>>>> why) at which point I decided to try mumps. The fact that both
>>>> >>> crash would again indicate a common (memory?) problem.
>>>> >>>
>>>> >>> I'll try a few more things before asking the MUMPS developers.
>>>> >>>
>>>> >>> Thanks again for your help!
>>>> >>>
>>>> >>> Samar
>>>> >>>
>>>> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <hzhang at mcs.anl.gov>
>>>> wrote:
>>>> >>>
>>>> >>>> Samar:
>>>> >>>> The crash occurs in
>>>> >>>> ...
>>>> >>>> [161]PETSC ERROR: Error in external library!
>>>> >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical
>>>> factorization phase: INFO(1)=-1, INFO(2)=48
>>>> >>>>
>>>> >>>> for very large matrix, likely memory problem as you suspected.
>>>> >>>> I would suggest
>>>> >>>> 1. run problems with increased sizes (not jump from a small one to
>>>> a very large one) and observe memory usage using
>>>> >>>> '-ksp_view'.
>>>> >>>>   I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of
>>>> estimated workspace increase. Is it too large?
>>>> >>>>   Anyway, this input should not cause the crash, I guess.
>>>> >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7
>>>> <> (I usually use sequential ordering 2)
>>>> >>>>    I see you use parallel ordering -mat_mumps_icntl_29 2.
>>>> >>>> 3. send bug report to mumps developers for their suggestion.
>>>> >>>>
>>>> >>>> 4. try other direct solvers, e.g., superlu_dist.
>>>> >>>>
>>>> >>>> ...
>>>> >>>>
>>>> >>>> etc etc. The above error I can tell has something to do with
>>>> processor 48 (INFO(2)) and so forth but not the previous one.
>>>> >>>>
>>>> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the
>>>> attached file. Any hints as to what could be giving this
>>>> >>>> error would be very much appreciated.
>>>> >>>>
>>>> >>>> I do not know how to interpret this  output file. mumps developer
>>>> would give you better suggestion on it.
>>>> >>>> I would appreciate to learn as well :-)
>>>> >>>>
>>>> >>>> Hong
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140307/16f8f2e5/attachment-0001.html>