<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">I didn't follow the discussion thread closely ... How large is your matrix dimension, and number of nonzeros?<br>How large is the memory per core (or per node)? <br>
<br>The default setting in superlu_dist is to use serial symbolic factorization. You can turn on parallel symbolic factorization by:<br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">options.ParSymbFact = YES;<br>
options.ColPerm = PARMETIS;<br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Is your matrix symmetric? if so, you need to give both upper and lower half of matrix A to superlu, which doesn't exploit symmetry.<br>
<br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Do you know whether you need numerical pivoting? If not, you can turn off pivoting by:<br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">
options.RowPerm = NATURAL;<br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">This avoids some other serial bottleneck.<br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">
All these options can be turned on in the petsc interface. Please check out the syntax there.<br><br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Sherry<br><br></div></div><div class="gmail_extra">
<br><br><div class="gmail_quote">On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <span dir="ltr"><<a href="mailto:spk@ldeo.columbia.edu" target="_blank">spk@ldeo.columbia.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Barry,<br>
<br>
You're probably right. I note that the error occurs almost instantly and I've tried increasing the number of CPUs<br>
(as many as ~1000 on Yellowstone) to no avail. I know this is a big problem but I didn't think it was that big!<br>
<br>
Sherry: Is there any way to write out more diagnostic info? E.g.,how much memory superlu thinks it needs/is attempting<br>
to allocate.<br>
<br>
Thanks,<br>
<br>
Samar<br>
<div class="HOEnZb"><div class="h5"><br>
On Feb 25, 2014, at 10:57 AM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
><br>
>><br>
>> I tried superlu_dist again and it crashes even more quickly than MUMPS with just the following error:<br>
>><br>
>> ERROR: 0031-250 task 128: Killed<br>
><br>
> This is usually a symptom of running out of memory.<br>
><br>
>><br>
>> Absolutely nothing else is written out to either stderr or stdout. This is with -mat_superlu_dist_statprint.<br>
>> The program works fine on a smaller matrix.<br>
>><br>
>> This is the sequence of calls:<br>
>><br>
>> KSPSetType(ksp,KSPPREONLY);<br>
>> PCSetType(pc,PCLU);<br>
>> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST);<br>
>> KSPSetFromOptions(ksp);<br>
>> PCSetFromOptions(pc);<br>
>> KSPSolve(ksp,b,x);<br>
>><br>
>> All of these successfully return *except* the very last one to KSPSolve.<br>
>><br>
>> Any help would be appreciated. Thanks!<br>
>><br>
>> Samar<br>
>><br>
>> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <<a href="mailto:xsli@lbl.gov">xsli@lbl.gov</a>> wrote:<br>
>><br>
>>> Samar:<br>
>>> If you include the error message while crashing using superlu_dist, I probably know the reason. (better yet, include the printout before the crash. )<br>
>>><br>
>>> Sherry<br>
>>><br>
>>><br>
>>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <<a href="mailto:hzhang@mcs.anl.gov">hzhang@mcs.anl.gov</a>> wrote:<br>
>>> Samar :<br>
>>> There are limitations for direct solvers.<br>
>>> Do not expect any solver can be used on arbitrarily large problems.<br>
>>> Since superlu_dist also crashes, direct solvers may not be able to work on your application.<br>
>>> This is why I suggest to increase size incrementally.<br>
>>> You may have to experiment other type of solvers.<br>
>>><br>
>>> Hong<br>
>>><br>
>>> Hi Hong and Jed,<br>
>>><br>
>>> Many thanks for replying. It would indeed be nice if the error messages from MUMPS were less cryptic!<br>
>>><br>
>>> 1) I have tried smaller matrices although given how my problem is set up a jump is difficult to avoid. But a good idea<br>
>>> that I will try.<br>
>>><br>
>>> 2) I did try various ordering but not the one you suggested.<br>
>>><br>
>>> 3) Tracing the error through the MUMPS code suggest a rather abrupt termination of the program (there should be more<br>
>>> error messages if, for example, memory was a problem). I therefore thought it might be an interface problem rather than<br>
>>> one with mumps and turned to the petsc-users group first.<br>
>>><br>
>>> 4) I've tried superlu_dist but it also crashes (also unclear as to why) at which point I decided to try mumps. The fact that both<br>
>>> crash would again indicate a common (memory?) problem.<br>
>>><br>
>>> I'll try a few more things before asking the MUMPS developers.<br>
>>><br>
>>> Thanks again for your help!<br>
>>><br>
>>> Samar<br>
>>><br>
>>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <<a href="mailto:hzhang@mcs.anl.gov">hzhang@mcs.anl.gov</a>> wrote:<br>
>>><br>
>>>> Samar:<br>
>>>> The crash occurs in<br>
>>>> ...<br>
>>>> [161]PETSC ERROR: Error in external library!<br>
>>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization phase: INFO(1)=-1, INFO(2)=48<br>
>>>><br>
>>>> for very large matrix, likely memory problem as you suspected.<br>
>>>> I would suggest<br>
>>>> 1. run problems with increased sizes (not jump from a small one to a very large one) and observe memory usage using<br>
>>>> '-ksp_view'.<br>
>>>> I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of estimated workspace increase. Is it too large?<br>
>>>> Anyway, this input should not cause the crash, I guess.<br>
>>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 <> (I usually use sequential ordering 2)<br>
>>>> I see you use parallel ordering -mat_mumps_icntl_29 2.<br>
>>>> 3. send bug report to mumps developers for their suggestion.<br>
>>>><br>
>>>> 4. try other direct solvers, e.g., superlu_dist.<br>
>>>><br>
>>>> …<br>
>>>><br>
>>>> etc etc. The above error I can tell has something to do with processor 48 (INFO(2)) and so forth but not the previous one.<br>
>>>><br>
>>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the attached file. Any hints as to what could be giving this<br>
>>>> error would be very much appreciated.<br>
>>>><br>
>>>> I do not know how to interpret this output file. mumps developer would give you better suggestion on it.<br>
>>>> I would appreciate to learn as well :-)<br>
>>>><br>
>>>> Hong<br>
>>><br>
>>><br>
>>><br>
>><br>
><br>
<br>
</div></div></blockquote></div><br></div>