[petsc-users] Error using MUMPS to solve large linear system

Samar Khatiwala spk at ldeo.columbia.edu
Tue Feb 25 10:07:52 CST 2014


Hi Barry,

You're probably right. I note that the error occurs almost instantly and I've tried increasing the number of CPUs 
(as many as ~1000 on Yellowstone) to no avail. I know this is a big problem but I didn't think it was that big!

Sherry: Is there any way to write out more diagnostic info? E.g.,how much memory superlu thinks it needs/is attempting 
to allocate.

Thanks,

Samar

On Feb 25, 2014, at 10:57 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>> 
>> I tried superlu_dist again and it crashes even more quickly than MUMPS with just the following error:
>> 
>> ERROR: 0031-250  task 128: Killed
> 
>   This is usually a symptom of running out of memory.
> 
>> 
>> Absolutely nothing else is written out to either stderr or stdout. This is with -mat_superlu_dist_statprint. 
>> The program works fine on a smaller matrix.
>> 
>> This is the sequence of calls:
>> 
>> KSPSetType(ksp,KSPPREONLY);
>> PCSetType(pc,PCLU);
>> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST);
>> KSPSetFromOptions(ksp);
>> PCSetFromOptions(pc);
>> KSPSolve(ksp,b,x);
>> 
>> All of these successfully return *except* the very last one to KSPSolve.
>> 
>> Any help would be appreciated. Thanks!
>> 
>> Samar
>> 
>> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>> 
>>> Samar:
>>> If you include the error message while crashing using superlu_dist, I probably know the reason.  (better yet, include the printout before the crash. )
>>> 
>>> Sherry
>>> 
>>> 
>>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <hzhang at mcs.anl.gov> wrote:
>>> Samar :
>>> There are limitations for direct solvers. 
>>> Do not expect any solver can be used on arbitrarily large problems.
>>> Since superlu_dist also crashes, direct solvers may not be able to work on your application. 
>>> This is why I suggest to increase size incrementally. 
>>> You may have to experiment other type of solvers.
>>> 
>>> Hong
>>> 
>>> Hi Hong and Jed,
>>> 
>>> Many thanks for replying. It would indeed be nice if the error messages from MUMPS were less cryptic!
>>> 
>>> 1) I have tried smaller matrices although given how my problem is set up a jump is difficult to avoid. But a good idea 
>>> that I will try.
>>> 
>>> 2) I did try various ordering but not the one you suggested.
>>> 
>>> 3) Tracing the error through the MUMPS code suggest a rather abrupt termination of the program (there should be more 
>>> error messages if, for example, memory was a problem). I therefore thought it might be an interface problem rather than 
>>> one with mumps and turned to the petsc-users group first. 
>>> 
>>> 4) I've tried superlu_dist but it also crashes (also unclear as to why) at which point I decided to try mumps. The fact that both 
>>> crash would again indicate a common (memory?) problem.
>>> 
>>> I'll try a few more things before asking the MUMPS developers.
>>> 
>>> Thanks again for your help!
>>> 
>>> Samar
>>> 
>>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <hzhang at mcs.anl.gov> wrote:
>>> 
>>>> Samar:
>>>> The crash occurs in 
>>>> ...
>>>> [161]PETSC ERROR: Error in external library!
>>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization phase: INFO(1)=-1, INFO(2)=48
>>>> 
>>>> for very large matrix, likely memory problem as you suspected. 
>>>> I would suggest 
>>>> 1. run problems with increased sizes (not jump from a small one to a very large one) and observe memory usage using
>>>> '-ksp_view'.
>>>>   I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of estimated workspace increase. Is it too large?
>>>>   Anyway, this input should not cause the crash, I guess.
>>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 <> (I usually use sequential ordering 2) 
>>>>    I see you use parallel ordering -mat_mumps_icntl_29 2.
>>>> 3. send bug report to mumps developers for their suggestion.
>>>> 
>>>> 4. try other direct solvers, e.g., superlu_dist.
>>>> 
>>>>>>>> 
>>>> etc etc. The above error I can tell has something to do with processor 48 (INFO(2)) and so forth but not the previous one.
>>>> 
>>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the attached file. Any hints as to what could be giving this
>>>> error would be very much appreciated.
>>>> 
>>>> I do not know how to interpret this  output file. mumps developer would give you better suggestion on it.
>>>> I would appreciate to learn as well :-)
>>>> 
>>>> Hong
>>> 
>>> 
>>> 
>> 
> 



More information about the petsc-users mailing list