[petsc-users] Error using MUMPS to solve large linear system

Xiaoye S. Li xsli at lbl.gov
Tue Feb 25 12:00:55 CST 2014


I didn't follow the discussion thread closely ... How large is your matrix
dimension, and number of nonzeros?
How large is the memory per core (or per node)?

The default setting in superlu_dist is to use serial symbolic
factorization. You can turn on parallel symbolic factorization by:

options.ParSymbFact = YES;
options.ColPerm = PARMETIS;

Is your matrix symmetric?  if so, you need to give both upper and lower
half of matrix A to superlu, which doesn't exploit symmetry.

Do you know whether you need numerical pivoting?  If not, you can turn off
pivoting by:

options.RowPerm = NATURAL;

This avoids some other serial bottleneck.

All these options can be turned on in the petsc interface. Please check out
the syntax there.


Sherry



On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <spk at ldeo.columbia.edu>wrote:

> Hi Barry,
>
> You're probably right. I note that the error occurs almost instantly and
> I've tried increasing the number of CPUs
> (as many as ~1000 on Yellowstone) to no avail. I know this is a big
> problem but I didn't think it was that big!
>
> Sherry: Is there any way to write out more diagnostic info? E.g.,how much
> memory superlu thinks it needs/is attempting
> to allocate.
>
> Thanks,
>
> Samar
>
> On Feb 25, 2014, at 10:57 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >>
> >> I tried superlu_dist again and it crashes even more quickly than MUMPS
> with just the following error:
> >>
> >> ERROR: 0031-250  task 128: Killed
> >
> >   This is usually a symptom of running out of memory.
> >
> >>
> >> Absolutely nothing else is written out to either stderr or stdout. This
> is with -mat_superlu_dist_statprint.
> >> The program works fine on a smaller matrix.
> >>
> >> This is the sequence of calls:
> >>
> >> KSPSetType(ksp,KSPPREONLY);
> >> PCSetType(pc,PCLU);
> >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST);
> >> KSPSetFromOptions(ksp);
> >> PCSetFromOptions(pc);
> >> KSPSolve(ksp,b,x);
> >>
> >> All of these successfully return *except* the very last one to KSPSolve.
> >>
> >> Any help would be appreciated. Thanks!
> >>
> >> Samar
> >>
> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
> >>
> >>> Samar:
> >>> If you include the error message while crashing using superlu_dist, I
> probably know the reason.  (better yet, include the printout before the
> crash. )
> >>>
> >>> Sherry
> >>>
> >>>
> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <hzhang at mcs.anl.gov>
> wrote:
> >>> Samar :
> >>> There are limitations for direct solvers.
> >>> Do not expect any solver can be used on arbitrarily large problems.
> >>> Since superlu_dist also crashes, direct solvers may not be able to
> work on your application.
> >>> This is why I suggest to increase size incrementally.
> >>> You may have to experiment other type of solvers.
> >>>
> >>> Hong
> >>>
> >>> Hi Hong and Jed,
> >>>
> >>> Many thanks for replying. It would indeed be nice if the error
> messages from MUMPS were less cryptic!
> >>>
> >>> 1) I have tried smaller matrices although given how my problem is set
> up a jump is difficult to avoid. But a good idea
> >>> that I will try.
> >>>
> >>> 2) I did try various ordering but not the one you suggested.
> >>>
> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt
> termination of the program (there should be more
> >>> error messages if, for example, memory was a problem). I therefore
> thought it might be an interface problem rather than
> >>> one with mumps and turned to the petsc-users group first.
> >>>
> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to
> why) at which point I decided to try mumps. The fact that both
> >>> crash would again indicate a common (memory?) problem.
> >>>
> >>> I'll try a few more things before asking the MUMPS developers.
> >>>
> >>> Thanks again for your help!
> >>>
> >>> Samar
> >>>
> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <hzhang at mcs.anl.gov> wrote:
> >>>
> >>>> Samar:
> >>>> The crash occurs in
> >>>> ...
> >>>> [161]PETSC ERROR: Error in external library!
> >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization
> phase: INFO(1)=-1, INFO(2)=48
> >>>>
> >>>> for very large matrix, likely memory problem as you suspected.
> >>>> I would suggest
> >>>> 1. run problems with increased sizes (not jump from a small one to a
> very large one) and observe memory usage using
> >>>> '-ksp_view'.
> >>>>   I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of
> estimated workspace increase. Is it too large?
> >>>>   Anyway, this input should not cause the crash, I guess.
> >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 <>
> (I usually use sequential ordering 2)
> >>>>    I see you use parallel ordering -mat_mumps_icntl_29 2.
> >>>> 3. send bug report to mumps developers for their suggestion.
> >>>>
> >>>> 4. try other direct solvers, e.g., superlu_dist.
> >>>>
> >>>> ...
> >>>>
> >>>> etc etc. The above error I can tell has something to do with
> processor 48 (INFO(2)) and so forth but not the previous one.
> >>>>
> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the
> attached file. Any hints as to what could be giving this
> >>>> error would be very much appreciated.
> >>>>
> >>>> I do not know how to interpret this  output file. mumps developer
> would give you better suggestion on it.
> >>>> I would appreciate to learn as well :-)
> >>>>
> >>>> Hong
> >>>
> >>>
> >>>
> >>
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140225/9c29447e/attachment.html>


More information about the petsc-users mailing list