[petsc-users] Error using MUMPS to solve large linear system

Samar Khatiwala spk at ldeo.columbia.edu
Tue Feb 25 14:22:05 CST 2014


Hi Sherry,

Thanks! I tried your suggestions and it worked!

For the record I added these flags: -mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1 

Also, for completeness and since you asked:

size: 2346346 x 2346346
nnz:  60856894
unsymmetric

The hardware (http://www2.cisl.ucar.edu/resources/yellowstone/hardware) specs are: 2 GB/core, 32 GB/node (27 GB usable), (16 cores per node)
I've been running on 8 nodes (so 8 x 27 ~ 216 GB).

Thanks again for your help!

Samar

On Feb 25, 2014, at 1:00 PM, "Xiaoye S. Li" <xsli at lbl.gov> wrote:

> I didn't follow the discussion thread closely ... How large is your matrix dimension, and number of nonzeros?
> How large is the memory per core (or per node)?  
> 
> The default setting in superlu_dist is to use serial symbolic factorization. You can turn on parallel symbolic factorization by:
> 
> options.ParSymbFact = YES;
> options.ColPerm = PARMETIS;
> 
> Is your matrix symmetric?  if so, you need to give both upper and lower half of matrix A to superlu, which doesn't exploit symmetry.
> 
> Do you know whether you need numerical pivoting?  If not, you can turn off pivoting by:
> 
> options.RowPerm = NATURAL;
> 
> This avoids some other serial bottleneck.
> 
> All these options can be turned on in the petsc interface. Please check out the syntax there.
> 
> 
> Sherry
> 
> 
> 
> On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <spk at ldeo.columbia.edu> wrote:
> Hi Barry,
> 
> You're probably right. I note that the error occurs almost instantly and I've tried increasing the number of CPUs
> (as many as ~1000 on Yellowstone) to no avail. I know this is a big problem but I didn't think it was that big!
> 
> Sherry: Is there any way to write out more diagnostic info? E.g.,how much memory superlu thinks it needs/is attempting
> to allocate.
> 
> Thanks,
> 
> Samar
> 
> On Feb 25, 2014, at 10:57 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >>
> >> I tried superlu_dist again and it crashes even more quickly than MUMPS with just the following error:
> >>
> >> ERROR: 0031-250  task 128: Killed
> >
> >   This is usually a symptom of running out of memory.
> >
> >>
> >> Absolutely nothing else is written out to either stderr or stdout. This is with -mat_superlu_dist_statprint.
> >> The program works fine on a smaller matrix.
> >>
> >> This is the sequence of calls:
> >>
> >> KSPSetType(ksp,KSPPREONLY);
> >> PCSetType(pc,PCLU);
> >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST);
> >> KSPSetFromOptions(ksp);
> >> PCSetFromOptions(pc);
> >> KSPSolve(ksp,b,x);
> >>
> >> All of these successfully return *except* the very last one to KSPSolve.
> >>
> >> Any help would be appreciated. Thanks!
> >>
> >> Samar
> >>
> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
> >>
> >>> Samar:
> >>> If you include the error message while crashing using superlu_dist, I probably know the reason.  (better yet, include the printout before the crash. )
> >>>
> >>> Sherry
> >>>
> >>>
> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <hzhang at mcs.anl.gov> wrote:
> >>> Samar :
> >>> There are limitations for direct solvers.
> >>> Do not expect any solver can be used on arbitrarily large problems.
> >>> Since superlu_dist also crashes, direct solvers may not be able to work on your application.
> >>> This is why I suggest to increase size incrementally.
> >>> You may have to experiment other type of solvers.
> >>>
> >>> Hong
> >>>
> >>> Hi Hong and Jed,
> >>>
> >>> Many thanks for replying. It would indeed be nice if the error messages from MUMPS were less cryptic!
> >>>
> >>> 1) I have tried smaller matrices although given how my problem is set up a jump is difficult to avoid. But a good idea
> >>> that I will try.
> >>>
> >>> 2) I did try various ordering but not the one you suggested.
> >>>
> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt termination of the program (there should be more
> >>> error messages if, for example, memory was a problem). I therefore thought it might be an interface problem rather than
> >>> one with mumps and turned to the petsc-users group first.
> >>>
> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to why) at which point I decided to try mumps. The fact that both
> >>> crash would again indicate a common (memory?) problem.
> >>>
> >>> I'll try a few more things before asking the MUMPS developers.
> >>>
> >>> Thanks again for your help!
> >>>
> >>> Samar
> >>>
> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <hzhang at mcs.anl.gov> wrote:
> >>>
> >>>> Samar:
> >>>> The crash occurs in
> >>>> ...
> >>>> [161]PETSC ERROR: Error in external library!
> >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization phase: INFO(1)=-1, INFO(2)=48
> >>>>
> >>>> for very large matrix, likely memory problem as you suspected.
> >>>> I would suggest
> >>>> 1. run problems with increased sizes (not jump from a small one to a very large one) and observe memory usage using
> >>>> '-ksp_view'.
> >>>>   I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of estimated workspace increase. Is it too large?
> >>>>   Anyway, this input should not cause the crash, I guess.
> >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 <> (I usually use sequential ordering 2)
> >>>>    I see you use parallel ordering -mat_mumps_icntl_29 2.
> >>>> 3. send bug report to mumps developers for their suggestion.
> >>>>
> >>>> 4. try other direct solvers, e.g., superlu_dist.
> >>>>
> >>>> …
> >>>>
> >>>> etc etc. The above error I can tell has something to do with processor 48 (INFO(2)) and so forth but not the previous one.
> >>>>
> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the attached file. Any hints as to what could be giving this
> >>>> error would be very much appreciated.
> >>>>
> >>>> I do not know how to interpret this  output file. mumps developer would give you better suggestion on it.
> >>>> I would appreciate to learn as well :-)
> >>>>
> >>>> Hong
> >>>
> >>>
> >>>
> >>
> >
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140225/75db9441/attachment.html>


More information about the petsc-users mailing list