[petsc-users] Error using MUMPS to solve large linear system

Xiaoye S. Li xsli at lbl.gov
Tue Feb 25 14:50:26 CST 2014


Very good!  Thanks for the update.
I guess you are using all 16 cores per node?  Since superlu_dist currently
is MPI-only, if you generate 16 MPI tasks, serial symbolic factorization
only has less than 2 GB memory to work with.

Sherry


On Tue, Feb 25, 2014 at 12:22 PM, Samar Khatiwala <spk at ldeo.columbia.edu>wrote:

> Hi Sherry,
>
> Thanks! I tried your suggestions and it worked!
>
> For the record I added these flags: -mat_superlu_dist_colperm PARMETIS
> -mat_superlu_dist_parsymbfact 1
>
> Also, for completeness and since you asked:
>
> size: 2346346 x 2346346
> nnz:  60856894
> unsymmetric
>
> The hardware (http://www2.cisl.ucar.edu/resources/yellowstone/hardware)
> specs are: 2 GB/core, 32 GB/node (27 GB usable), (16 cores per node)
> I've been running on 8 nodes (so 8 x 27 ~ 216 GB).
>
> Thanks again for your help!
>
> Samar
>
> On Feb 25, 2014, at 1:00 PM, "Xiaoye S. Li" <xsli at lbl.gov> wrote:
>
> I didn't follow the discussion thread closely ... How large is your matrix
> dimension, and number of nonzeros?
> How large is the memory per core (or per node)?
>
> The default setting in superlu_dist is to use serial symbolic
> factorization. You can turn on parallel symbolic factorization by:
>
> options.ParSymbFact = YES;
> options.ColPerm = PARMETIS;
>
> Is your matrix symmetric?  if so, you need to give both upper and lower
> half of matrix A to superlu, which doesn't exploit symmetry.
>
> Do you know whether you need numerical pivoting?  If not, you can turn off
> pivoting by:
>
> options.RowPerm = NATURAL;
>
> This avoids some other serial bottleneck.
>
> All these options can be turned on in the petsc interface. Please check
> out the syntax there.
>
>
> Sherry
>
>
>
> On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <spk at ldeo.columbia.edu>wrote:
>
>> Hi Barry,
>>
>> You're probably right. I note that the error occurs almost instantly and
>> I've tried increasing the number of CPUs
>> (as many as ~1000 on Yellowstone) to no avail. I know this is a big
>> problem but I didn't think it was that big!
>>
>> Sherry: Is there any way to write out more diagnostic info? E.g.,how much
>> memory superlu thinks it needs/is attempting
>> to allocate.
>>
>> Thanks,
>>
>> Samar
>>
>> On Feb 25, 2014, at 10:57 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>> >
>> >>
>> >> I tried superlu_dist again and it crashes even more quickly than MUMPS
>> with just the following error:
>> >>
>> >> ERROR: 0031-250  task 128: Killed
>> >
>> >   This is usually a symptom of running out of memory.
>> >
>> >>
>> >> Absolutely nothing else is written out to either stderr or stdout.
>> This is with -mat_superlu_dist_statprint.
>> >> The program works fine on a smaller matrix.
>> >>
>> >> This is the sequence of calls:
>> >>
>> >> KSPSetType(ksp,KSPPREONLY);
>> >> PCSetType(pc,PCLU);
>> >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST);
>> >> KSPSetFromOptions(ksp);
>> >> PCSetFromOptions(pc);
>> >> KSPSolve(ksp,b,x);
>> >>
>> >> All of these successfully return *except* the very last one to
>> KSPSolve.
>> >>
>> >> Any help would be appreciated. Thanks!
>> >>
>> >> Samar
>> >>
>> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>> >>
>> >>> Samar:
>> >>> If you include the error message while crashing using superlu_dist, I
>> probably know the reason.  (better yet, include the printout before the
>> crash. )
>> >>>
>> >>> Sherry
>> >>>
>> >>>
>> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <hzhang at mcs.anl.gov>
>> wrote:
>> >>> Samar :
>> >>> There are limitations for direct solvers.
>> >>> Do not expect any solver can be used on arbitrarily large problems.
>> >>> Since superlu_dist also crashes, direct solvers may not be able to
>> work on your application.
>> >>> This is why I suggest to increase size incrementally.
>> >>> You may have to experiment other type of solvers.
>> >>>
>> >>> Hong
>> >>>
>> >>> Hi Hong and Jed,
>> >>>
>> >>> Many thanks for replying. It would indeed be nice if the error
>> messages from MUMPS were less cryptic!
>> >>>
>> >>> 1) I have tried smaller matrices although given how my problem is set
>> up a jump is difficult to avoid. But a good idea
>> >>> that I will try.
>> >>>
>> >>> 2) I did try various ordering but not the one you suggested.
>> >>>
>> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt
>> termination of the program (there should be more
>> >>> error messages if, for example, memory was a problem). I therefore
>> thought it might be an interface problem rather than
>> >>> one with mumps and turned to the petsc-users group first.
>> >>>
>> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to
>> why) at which point I decided to try mumps. The fact that both
>> >>> crash would again indicate a common (memory?) problem.
>> >>>
>> >>> I'll try a few more things before asking the MUMPS developers.
>> >>>
>> >>> Thanks again for your help!
>> >>>
>> >>> Samar
>> >>>
>> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <hzhang at mcs.anl.gov> wrote:
>> >>>
>> >>>> Samar:
>> >>>> The crash occurs in
>> >>>> ...
>> >>>> [161]PETSC ERROR: Error in external library!
>> >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization
>> phase: INFO(1)=-1, INFO(2)=48
>> >>>>
>> >>>> for very large matrix, likely memory problem as you suspected.
>> >>>> I would suggest
>> >>>> 1. run problems with increased sizes (not jump from a small one to a
>> very large one) and observe memory usage using
>> >>>> '-ksp_view'.
>> >>>>   I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of
>> estimated workspace increase. Is it too large?
>> >>>>   Anyway, this input should not cause the crash, I guess.
>> >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7
>> <> (I usually use sequential ordering 2)
>> >>>>    I see you use parallel ordering -mat_mumps_icntl_29 2.
>> >>>> 3. send bug report to mumps developers for their suggestion.
>> >>>>
>> >>>> 4. try other direct solvers, e.g., superlu_dist.
>> >>>>
>> >>>> ...
>> >>>>
>> >>>> etc etc. The above error I can tell has something to do with
>> processor 48 (INFO(2)) and so forth but not the previous one.
>> >>>>
>> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the
>> attached file. Any hints as to what could be giving this
>> >>>> error would be very much appreciated.
>> >>>>
>> >>>> I do not know how to interpret this  output file. mumps developer
>> would give you better suggestion on it.
>> >>>> I would appreciate to learn as well :-)
>> >>>>
>> >>>> Hong
>> >>>
>> >>>
>> >>>
>> >>
>> >
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140225/9fa402f0/attachment-0001.html>


More information about the petsc-users mailing list