[petsc-users] Error using MUMPS to solve large linear system

Sun Mar 2 16:23:46 CST 2014

Hi,

I'm having some problems with my PETSc application similar to the ones
discussed in this thread, so perhaps one of you can help. In my application
I factorize a preconditioner matrix with mumps or superlu_dist, using this
factorized preconditioner to accelerate gmres on a matrix that is denser
than the preconditioner.  I've been running on edison at nersc.  My program
works reliably for problem sizes below about 1 million x 1 million, but
above this size, the factorization step fails in one of many possible ways,
depending on the compiler, # of nodes, # of procs/node, etc:

When I use superlu_dist, I get 1 of 2 failure modes:
(1) the first step of KSP returns "0 KSP residual norm -nan" and ksp then
returns KSPConvergedReason = -9, or
(2) the factorization completes, but GMRES then converges excruciatingly
slowly or not at all, even if I choose the "real" matrix to be identical to
the preconditioner matrix so KSP ought to converge in 1 step (which it does
for smaller matrices).

For mumps, the factorization can fail in many different ways:
(3) With the intel compiler I usually get "Caught signal number 11 SEGV:
Segmentation Violation"
(4) Sometimes with the intel compiler I get "Caught signal number 7 BUS:
Bus Error"
(5) With the gnu compiler I often get a bunch of lines like "problem with
NIV2_FLOPS message  -5.9604644775390625E-008           0
 -227464733.99999997"
(6) Other times with gnu I get a mumps error with INFO(1)=-9 or
INFO(1)=-17. The mumps documentation suggests I should increase icntl(14),
but what is an appropriate value? 50? 10000?
(7) With the Cray compiler I consistently get this cryptic error:
Fatal error in PMPI_Test: Invalid MPI_Request, error stack:
PMPI_Test(166): MPI_Test(request=0xb228dbf3c, flag=0x7ffffffe097c,
status=0x7ffffffe0a00) failed
PMPI_Test(121): Invalid MPI_Request
_pmiu_daemon(SIGCHLD): [NID 02784] [c6-1c1s8n0] [Sun Mar  2 10:35:20 2014]
PE RANK 0 exit signal Aborted
[NID 02784] 2014-03-02 10:35:20 Apid 3374579: initiated application
termination
Application 3374579 exit codes: 134

For linear systems smaller than around 1 million^2, my application is very
robust, working consistently with both mumps & superlu_dist, working for a
wide range of # of nodes and # of procs/node, and working with all 3
available compilers on edison (intel, gnu, cray).

By the way, mumps failed for much smaller problems until I tried
-mat_mumps_icntl_7 2 (inspired by your conversation last week). I tried all
the other options for icntl(7), icntl(28), and icntl(29), finding
icntl(7)=2 works best by far.  I tried the flags that worked for Samar
(-mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1) with
superlu_dist, but they did not appear to change anything in my case.

Can you recommend any other parameters of petsc, superlu_dist, or mumps
that I should try changing?  I don't care in the end whether I use
superlu_dist or mumps.

Thanks!

Matt Landreman

On Tue, Feb 25, 2014 at 3:50 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:

> Very good!  Thanks for the update.
> I guess you are using all 16 cores per node?  Since superlu_dist currently
> is MPI-only, if you generate 16 MPI tasks, serial symbolic factorization
> only has less than 2 GB memory to work with.
>
> Sherry
>
>
> On Tue, Feb 25, 2014 at 12:22 PM, Samar Khatiwala <spk at ldeo.columbia.edu>wrote:
>
>> Hi Sherry,
>>
>> Thanks! I tried your suggestions and it worked!
>>
>> For the record I added these flags: -mat_superlu_dist_colperm PARMETIS
>> -mat_superlu_dist_parsymbfact 1
>>
>> Also, for completeness and since you asked:
>>
>> size: 2346346 x 2346346
>> nnz:  60856894
>> unsymmetric
>>
>> The hardware (http://www2.cisl.ucar.edu/resources/yellowstone/hardware)
>> specs are: 2 GB/core, 32 GB/node (27 GB usable), (16 cores per node)
>> I've been running on 8 nodes (so 8 x 27 ~ 216 GB).
>>
>> Thanks again for your help!
>>
>> Samar
>>
>> On Feb 25, 2014, at 1:00 PM, "Xiaoye S. Li" <xsli at lbl.gov> wrote:
>>
>> I didn't follow the discussion thread closely ... How large is your
>> matrix dimension, and number of nonzeros?
>> How large is the memory per core (or per node)?
>>
>> The default setting in superlu_dist is to use serial symbolic
>> factorization. You can turn on parallel symbolic factorization by:
>>
>> options.ParSymbFact = YES;
>> options.ColPerm = PARMETIS;
>>
>> Is your matrix symmetric?  if so, you need to give both upper and lower
>> half of matrix A to superlu, which doesn't exploit symmetry.
>>
>> Do you know whether you need numerical pivoting?  If not, you can turn
>> off pivoting by:
>>
>> options.RowPerm = NATURAL;
>>
>> This avoids some other serial bottleneck.
>>
>> All these options can be turned on in the petsc interface. Please check
>> out the syntax there.
>>
>>
>> Sherry
>>
>>
>>
>> On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <spk at ldeo.columbia.edu>wrote:
>>
>>> Hi Barry,
>>>
>>> You're probably right. I note that the error occurs almost instantly and
>>> I've tried increasing the number of CPUs
>>> (as many as ~1000 on Yellowstone) to no avail. I know this is a big
>>> problem but I didn't think it was that big!
>>>
>>> Sherry: Is there any way to write out more diagnostic info? E.g.,how
>>> much memory superlu thinks it needs/is attempting
>>> to allocate.
>>>
>>> Thanks,
>>>
>>> Samar
>>>
>>> On Feb 25, 2014, at 10:57 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>> >
>>> >>
>>> >> I tried superlu_dist again and it crashes even more quickly than
>>> MUMPS with just the following error:
>>> >>
>>> >> ERROR: 0031-250  task 128: Killed
>>> >
>>> >   This is usually a symptom of running out of memory.
>>> >
>>> >>
>>> >> Absolutely nothing else is written out to either stderr or stdout.
>>> This is with -mat_superlu_dist_statprint.
>>> >> The program works fine on a smaller matrix.
>>> >>
>>> >> This is the sequence of calls:
>>> >>
>>> >> KSPSetType(ksp,KSPPREONLY);
>>> >> PCSetType(pc,PCLU);
>>> >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST);
>>> >> KSPSetFromOptions(ksp);
>>> >> PCSetFromOptions(pc);
>>> >> KSPSolve(ksp,b,x);
>>> >>
>>> >> All of these successfully return *except* the very last one to
>>> KSPSolve.
>>> >>
>>> >> Any help would be appreciated. Thanks!
>>> >>
>>> >> Samar
>>> >>
>>> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>>> >>
>>> >>> Samar:
>>> >>> If you include the error message while crashing using superlu_dist,
>>> I probably know the reason.  (better yet, include the printout before the
>>> crash. )
>>> >>>
>>> >>> Sherry
>>> >>>
>>> >>>
>>> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <hzhang at mcs.anl.gov>
>>> wrote:
>>> >>> Samar :
>>> >>> There are limitations for direct solvers.
>>> >>> Do not expect any solver can be used on arbitrarily large problems.
>>> >>> Since superlu_dist also crashes, direct solvers may not be able to
>>> work on your application.
>>> >>> This is why I suggest to increase size incrementally.
>>> >>> You may have to experiment other type of solvers.
>>> >>>
>>> >>> Hong
>>> >>>
>>> >>> Hi Hong and Jed,
>>> >>>
>>> >>> Many thanks for replying. It would indeed be nice if the error
>>> messages from MUMPS were less cryptic!
>>> >>>
>>> >>> 1) I have tried smaller matrices although given how my problem is
>>> set up a jump is difficult to avoid. But a good idea
>>> >>> that I will try.
>>> >>>
>>> >>> 2) I did try various ordering but not the one you suggested.
>>> >>>
>>> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt
>>> termination of the program (there should be more
>>> >>> error messages if, for example, memory was a problem). I therefore
>>> thought it might be an interface problem rather than
>>> >>> one with mumps and turned to the petsc-users group first.
>>> >>>
>>> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to
>>> why) at which point I decided to try mumps. The fact that both
>>> >>> crash would again indicate a common (memory?) problem.
>>> >>>
>>> >>> I'll try a few more things before asking the MUMPS developers.
>>> >>>
>>> >>> Thanks again for your help!
>>> >>>
>>> >>> Samar
>>> >>>
>>> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <hzhang at mcs.anl.gov> wrote:
>>> >>>
>>> >>>> Samar:
>>> >>>> The crash occurs in
>>> >>>> ...
>>> >>>> [161]PETSC ERROR: Error in external library!
>>> >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical
>>> factorization phase: INFO(1)=-1, INFO(2)=48
>>> >>>>
>>> >>>> for very large matrix, likely memory problem as you suspected.
>>> >>>> I would suggest
>>> >>>> 1. run problems with increased sizes (not jump from a small one to
>>> a very large one) and observe memory usage using
>>> >>>> '-ksp_view'.
>>> >>>>   I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of
>>> estimated workspace increase. Is it too large?
>>> >>>>   Anyway, this input should not cause the crash, I guess.
>>> >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7
>>> <> (I usually use sequential ordering 2)
>>> >>>>    I see you use parallel ordering -mat_mumps_icntl_29 2.
>>> >>>> 3. send bug report to mumps developers for their suggestion.
>>> >>>>
>>> >>>> 4. try other direct solvers, e.g., superlu_dist.
>>> >>>>
>>> >>>> ...
>>> >>>>
>>> >>>> etc etc. The above error I can tell has something to do with
>>> processor 48 (INFO(2)) and so forth but not the previous one.
>>> >>>>
>>> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the
>>> attached file. Any hints as to what could be giving this
>>> >>>> error would be very much appreciated.
>>> >>>>
>>> >>>> I do not know how to interpret this  output file. mumps developer
>>> would give you better suggestion on it.
>>> >>>> I would appreciate to learn as well :-)
>>> >>>>
>>> >>>> Hong
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140302/8e8460c6/attachment-0001.html>