[petsc-users] Can't expand MemType 1: jcol 16104

Wed Jul 8 12:34:10 CDT 2015

Hi,

I have used the switch -mat_superlu_dist_parsymbfact in my pbs script.
However, although my program worked fine with sequential symbolic
factorization, I get one of the following 2 behaviors when I run with
parallel symbolic factorization (depending on the number of processors that
I use):

1) the program just hangs (it seems stuck in some subroutine ==> see
test.out-hangs)
2) I get a floating point exception ==> see
test.out-floating-point-exception

Note that as suggested in the Superlu manual, I use a power of 2 number of
procs. Are there any tunable parameters for the parallel symbolic
factorization? Note that when I build my sparse matrix, most elements I add
are nonzero of course but to simplify the programming, I also add a few
zero elements in the sparse matrix. I was thinking that maybe if the
parallel symbolic factorization proceed by block, there could be some
blocks where the pivot would be zero, hence creating the FPE??

Thanks,

Anthony

On Wed, Jul 8, 2015 at 6:46 AM, Xiaoye S. Li <xsli at lbl.gov> wrote:

> Did you find out how to change option to use parallel symbolic
> factorization?  Perhaps PETSc team can help.
>
> Sherry
>
>
> On Tue, Jul 7, 2015 at 3:58 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>
>> Is there an inquiry function that tells you all the available options?
>>
>> Sherry
>>
>> On Tue, Jul 7, 2015 at 3:25 PM, Anthony Paul Haas <aph at email.arizona.edu>
>> wrote:
>>
>>> Hi Sherry,
>>>
>>> Thanks for your message. I have used superlu_dist default options. I did
>>> not realize that I was doing serial symbolic factorization. That is
>>> probably the cause of my problem.
>>> Each node on Garnet has 60GB usable memory and I can run with 1,2,4,8,16
>>> or 32 core per node.
>>>
>>> So I should use:
>>>
>>> -mat_superlu_dist_r 20
>>> -mat_superlu_dist_c 32
>>>
>>> How do you specify the parallel symbolic factorization option? is it
>>> -mat_superlu_dist_matinput 1
>>>
>>> Thanks,
>>>
>>> Anthony
>>>
>>>
>>> On Tue, Jul 7, 2015 at 3:08 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>>>
>>>> For superlu_dist failure, this occurs during symbolic factorization.
>>>> Since you are using serial symbolic factorization, it requires the entire
>>>> graph of A to be available in the memory of one MPI task. How much memory
>>>> do you have for each MPI task?
>>>>
>>>> It won't help even if you use more processes.  You should try to use
>>>> parallel symbolic factorization option.
>>>>
>>>> Another point.  You set up process grid as:
>>>>        Process grid nprow 32 x npcol 20
>>>> For better performance, you show swap the grid dimension. That is, it's
>>>> better to use 20 x 32, never gives nprow larger than npcol.
>>>>
>>>>
>>>> Sherry
>>>>
>>>>
>>>> On Tue, Jul 7, 2015 at 1:27 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>
>>>>>
>>>>>    I would suggest running a sequence of problems, 101 by 101 111 by
>>>>> 111 etc and get the memory usage in each case (when you run out of memory
>>>>> you can get NO useful information out about memory needs). You can then
>>>>> plot memory usage as a function of problem size to get a handle on how much
>>>>> memory it is using.  You can also run on more and more processes (which
>>>>> have a total of more memory) to see how large a problem you may be able to
>>>>> reach.
>>>>>
>>>>>    MUMPS also has an "out of core" version (which we have never used)
>>>>> that could in theory anyways let you get to large problems if you have lots
>>>>> of disk space, but you are on your own figuring out how to use it.
>>>>>
>>>>>   Barry
>>>>>
>>>>> > On Jul 7, 2015, at 2:37 PM, Anthony Paul Haas <aph at email.arizona.edu>
>>>>> wrote:
>>>>> >
>>>>> > Hi Jose,
>>>>> >
>>>>> > In my code, I use once PETSc to solve a linear system to get the
>>>>> baseflow (without using SLEPc) and then I use SLEPc to do the stability
>>>>> analysis of that baseflow. This is why, there are some SLEPc options that
>>>>> are not used in test.out-superlu_dist-151x151 (when I am solving for the
>>>>> baseflow with PETSc only). I have attached a 101x101 case for which I get
>>>>> the eigenvalues. That case works fine. However If i increase to 151x151, I
>>>>> get the error that you can see in test.out-superlu_dist-151x151 (similar
>>>>> error with mumps: see test.out-mumps-151x151 line 2918 ). If you look a the
>>>>> very end of the files test.out-superlu_dist-151x151 and
>>>>> test.out-mumps-151x151, you will see that the last info message printed is:
>>>>> >
>>>>> > On Processor (after EPSSetFromOptions)  0    memory:
>>>>> 0.65073152000E+08          =====>  (see line 807 of module_petsc.F90)
>>>>> >
>>>>> > This means that the memory error probably occurs in the call to
>>>>> EPSSolve (see module_petsc.F90 line 810). I would like to evaluate how much
>>>>> memory is required by the most memory intensive operation within EPSSolve.
>>>>> Since I am solving a generalized EVP, I would imagine that it would be the
>>>>> LU decomposition. But is there an accurate way of doing it?
>>>>> >
>>>>> > Before starting with iterative solvers, I would like to exploit as
>>>>> much as I can direct solvers. I tried GMRES with default preconditioner at
>>>>> some point but I had convergence problem. What solver/preconditioner would
>>>>> you recommend for a generalized non-Hermitian (EPS_GNHEP) EVP?
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> > Anthony
>>>>> >
>>>>> > On Tue, Jul 7, 2015 at 12:17 AM, Jose E. Roman <jroman at dsic.upv.es>
>>>>> wrote:
>>>>> >
>>>>> > El 07/07/2015, a las 02:33, Anthony Haas escribió:
>>>>> >
>>>>> > > Hi,
>>>>> > >
>>>>> > > I am computing eigenvalues using PETSc/SLEPc and superlu_dist for
>>>>> the LU decomposition (my problem is a generalized eigenvalue problem). The
>>>>> code runs fine for a grid with 101x101 but when I increase to 151x151, I
>>>>> get the following error:
>>>>> > >
>>>>> > > Can't expand MemType 1: jcol 16104   (and then [NID 00037]
>>>>> 2015-07-06 19:19:17 Apid 31025976: OOM killer terminated this process.)
>>>>> > >
>>>>> > > It seems to be a memory problem. I monitor the memory usage as far
>>>>> as I can and it seems that memory usage is pretty low. The most memory
>>>>> intensive part of the program is probably the LU decomposition in the
>>>>> context of the generalized EVP. Is there a way to evaluate how much memory
>>>>> will be required for that step? I am currently running the debug version of
>>>>> the code which I would assume would use more memory?
>>>>> > >
>>>>> > > I have attached the output of the job. Note that the program uses
>>>>> twice PETSc: 1) to solve a linear system for which no problem occurs, and,
>>>>> 2) to solve the Generalized EVP with SLEPc, where I get the error.
>>>>> > >
>>>>> > > Thanks
>>>>> > >
>>>>> > > Anthony
>>>>> > > <test.out-superlu_dist-151x151>
>>>>> >
>>>>> > In the output you are attaching there are no SLEPc objects in the
>>>>> report and SLEPc options are not used. It seems that SLEPc calls are
>>>>> skipped?
>>>>> >
>>>>> > Do you get the same error with MUMPS? Have you tried to solve linear
>>>>> systems with a preconditioned iterative solver?
>>>>> >
>>>>> > Jose
>>>>> >
>>>>> >
>>>>> >
>>>>> <module_petsc.F90><test.out-mumps-151x151><test.out_superlu_dist-101x101><test.out-superlu_dist-151x151>
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150708/2709f809/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.out-hangs
Type: application/octet-stream
Size: 660 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150708/2709f809/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.out-floating-point-exception
Type: application/octet-stream
Size: 9713 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150708/2709f809/attachment-0001.obj>