FW: [PETSC #18391] PETSc crash with memory allocation in ILU preconditioning
On Fri, Oct 10, 2008 at 6:29 PM, Deng, Ying wrote:
> Hi,
> I am seeing problems when trying to build petsc-dev code. My configure
> line is below, same as I successfully did for 2.3.2-p10. I tried with
> mkl 9 and mkl 10. Same errors. There are references to undefined
> symbols. Please share with me if you have any experience with the issue
> or suggestions to resolve it.
1) Please always send configure.log. The screen output does not tell us
enough to debug problems.
2) Specifying libraries directly is not usually a good idea since some packages,
like MKL, tend to depend on other libraries (like libguide,
libpthread). I would
use --with-blas-lapack-dir=$MKLDIR
3) Mail about install problems should go to petsc-maint at mcs.anl.gov. petsc-dev
is for discussion of development.
> Thanks,
> Ying
> ./config/configure.py --with-batch=1 --with-clanguage=C++
> --with-vendor-compilers=intel '--CXXFLAGS=-g
> -gcc-name=/usr/intel/pkgs/gcc/4.2.2/bin/g++ -gcc-version=420 '
> '--LDFLAGS=-L/usr/lib64 -L/usr/intel/pkgs/gcc/4.2.2/lib -ldl -lpthread
> -Qlocation,ld,/usr/intel/pkgs/gcc/4.2.2/x86_64-suse-linux/bin
> -L/usr/intel/pkgs/icc/10.1.008e/lib -lirc' --with-cxx=$ICCDIR/bin/icpc
> --with-fc=$IFCDIR/bin/ifort --with-mpi-compilers=0 --with-mpi-shared=0
> --with-debugging=yes --with-mpi=yes --with-mpi-include=$MPIDIR/include
> --with-mpi-lib=\[$MPIDIR/lib64/libmpi.a,$MPIDIR/lib64/libmpiif.a,$MPIDIR
> /lib64/libmpigi.a\]
> --with-blas-lapack-lib=\[$MKLLIBDIR/libguide.so,$MKLLIBDIR/libmkl_lapack
> .so,$MKLLIBDIR/libmkl_solver.a,$MKLLIBDIR/libmkl.so\]
> --with-scalapack=yes --with-scalapack-include=$MKLDIR/include
> --with-scalapack-lib=$MKLLIBDIR/libmkl_scalapack.a --with-blacs=yes
> --with-blacs-include=$MKLDIR/include
> --with-blacs-lib=$MKLLIBDIR/libmkl_blacs_intelmpi_lp64.a
> --with-umfpack=1
> --with-umfpack-lib=\[$UMFPACKDIR/UMFPACK/Lib/libumfpack.a,$UMFPACKDIR/AM
> D/Lib/libamd.a\] --with-umfpack-include=$UMFPACKDIR/UMFPACK/Include
> --with-parmetis=1 --with-parmetis-dir=$PARMETISDIR --with-mumps=1
> --download-mumps=$PETSC_DIR/externalpackages/MUMPS_4.6.3.tar.gz
> --with-superlu_dist=1
> --download-superlu_dist=$PETSC_DIR/externalpackages/superlu_dist_2.0.tar
> .gz
> ....
> /nfs/pdx/proj/dt/pdx_sde02/x86-64_linux26/petsc/petsc-dev/conftest.c:7:
> undefined reference to `f2cblaslapack311_id_'
> /p/dt/sde/tools/x86-64_linux26/mkl/
> undefined reference to `pthread_atfork'
> ....
> ------------------------------------------------------------------------
> --------------
> You set a value for --with-blas-lapack-lib=<lib>, but
> ['/p/dt/sde/tools/x86-64_linux26/mkl/',
> '/p/dt/sde/tools/x86-64_linux26/mkl/
> o',
> '/p/dt/sde/tools/x86-64_linux26/mkl/
> ', '/p/dt/sde/tools/x86-64_linux26/mkl/']
> cannot be used
> ************************************************************************
> *********
> We don't have all the code just right to use those packages with
> 64 bit integers. I will try to get them all
> working by Monday and will let you know my progress. To use them you
> will need to be using
> petsc-dev
> http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html
> so you can switch to
> that now if you are not yet using it in preparation for my updates.
> Barry
On Oct 9, 2008, at 12:52 PM, Rhew, Jung-hoon wrote:
>> Hi,
>> I found that the root cause of malloc error was that our PETSc
>> library had been compiled without 64 bit flag on. Thus, PetscInt
>> was defined as "int" instead of "long long" and for large problems,
>> the memory allocation requires memory beyond the maximum of int and
>> causes integer overflow.
>> But when I tried to build using 64 bit flag (--with-64-bit-
>> indices=1), all files associated with the external libraries (such
>> as UMFPACK, and MUMPS) built with PETSc started failing in
>> compilation mainly due to the incompatibility between "int" in those
>> libraries and "long long" in PETSc.
>> I wonder if you can let us know how to resolve this conflict when
>> builing PETSc with 64 bit. The brute force way is to change the
>> source codes of those libraries where the conflicts occur but I
>> wonder if there is a neater way of doing this.
>> Thanks.
>> jr
>> Example:
>> libfast in: /nfs/ltdn/disks/td_disk49/usr.cdmg/jrhew/work/mds_work/
>> PETSC/mypetsc-2.3.2-p10/src/mat/impls/aij/seq/umfpack
>> umfpack.c(154): error: a value of type "PetscInt={long long} *"
>> cannot be used to initialize an entity of type "int *"
>> int m=A->rmap.n,n=A->cmap.n,*ai=mat->i,*aj=mat-
>> >j,status,*ra,idx;
>> During the symbolic phase of ILU(N) there is no way in advance to
>> know how many new nonzeros are needed
>> in the factored version over the original matrix (this is tree for LU
>> too). We handle this by starting with a certain
>> amount of memory and then if that is not enough for for the symbolic
>> factor we double the memory allocated
>> and copy the values over from the old copy of the symbolic factor
>> (what has been computed so far) and then
>> free the old copy.
>> To avoid this "memory doubling" (which is not super memory
>> efficient) you can use the option
>> -mat_factor_fill or PCFactorSetFill() to set slightly more than the
>> "correct" value then only a single malloc
>> is needed and you can do larger problems.
>> Of course, the question is "what value should I use for fill"?
>> There is no formula, if there was we would
>> use it automatically. So the only way I know is to run smaller
>> problems and get a feel for what the ratio
>> should be for your larger problem. Run with -info | grep
>> pc_factor_fill and it will tell you what "you should
>> have used"
>> Hope this helps,
>> Barry
On Oct 7, 2008, at 5:46 PM, Rhew, Jung-hoon wrote:
>>> Hi,
>>> 1. I ran it with 64-bit machine with 32GB physical memory but it
>>> still crashed. At the crash, the peak memory was 17GB so there were
>>> plenty of memory left. This is why I don't think the simulation
>>> needed full 32GB + swap space more than 64GB.
>>> 2. The problem size is too big for direct solver as it can easily go
>>> beyond 32GB. Actually, we use MUMPS for smaller problems.
>>> 3. ILUN is the most robust preconditioner we found for our
>>> production simulation so we want to stick to it.
>>> I think I'll send a test case that reproduces the problem.
>>> Its not hard for ILU(k) to run out of the 32-bit limit for large
>>> matrices. I would recommend
>>> 1) Using a 64-bit machine with more memory
>>> 2) Trying a sparse direct solver like MUMPS
>>> 3) Trying another preconditioner, which is of course problem
>>> dependent
>>> Thanks,
>>> Matt
On Tue, Oct 7, 2008 at 4:03 PM, Rhew, Jung-hoon wrote:
>>> <jung-hoon.rhew at intel.com> wrote:
>>>> Dear PETSc team,
>>>> We use PETSc as a linear solver library in our tool and in some
>>>> test cases
>>>> using ILU(N) preconditioner, we have problems with memory. I'm not
>>>> sending
>>>> our matrix at this time since it is huge but if you think it is
>>>> needed, I'll
>>>> send it to you.
>>>> Thanks for your help in advance.
>>>> Log file is attached.
>>>> OS: suse 64bit sles9
>>>> 2.6.5-7.276.PTF.196309.1-smp #1 SMP Mon Jul 24 10:45:31 UTC 2006
>>>> x86_64
>>>> x86_64 x86_64 GNU/Linux
>>>> PETSc ver: petsc-2.3.2-p10
>>>> MPI implementation: Intel MPI based on MPICH2 and MVAPICH2
>>>> Compiler: GCC 4.2.2
>>>> Probable PETSc component: n/a
>>>> Problem Description
>>>> Solver setting: BCGSL (L=2) and ILU(N=2)
>>>> -ksp_rtol=1e-14
>>>> -ksp_type=bcgsl
>>>> -ksp_bcgsl_ell=2
>>>> -pc_factor_levels=2
>>>> -pc_factor_reuseordering
>>>> -pc_factor_zeropivot=0.0
>>>> -pc_type=ilu
>>>> -pc_factor_fill=2
>>>> -pc_factor_mat_ordering_type=rcm
>>>> malloc crash: sparse matrix size ~ 500K by 500K with NNZ ~ 0.002%
>>>> (full
>>>> error message is attached.)
>>>> In debugger, symbolic ILU requires memory beyond the max int. At
>>>> line 1089
>>>> In aijfact.c, len becomes -2147483648 as
>>>> (bi[n])*sizeof(PetscScalar) > max
>>>> int.
>>>> len = (bi[n])*sizeof(PetscScalar);
>>>> Then, it causes the following malloc error in subsequent function
>>>> calls (the
>>>> call stack is also in the attached error message).
>>>> [0]PETSC ERROR: --------------------- Error Message
>>>> ------------------------------------
>>>> [0]PETSC ERROR: Out of memory. This could be due to allocating
>>>> [0]PETSC ERROR: too large an object or bleeding by not properly
>>>> [0]PETSC ERROR: destroying unneeded objects.
>>>> [0]PETSC ERROR: Memory allocated -2147483648 Memory used by process
>>>> -2147483648
>>>> [0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for
>>>> info.
>>>> [0]PETSC ERROR: Memory requested 18446744071912865792!
>>>> [0]PETSC ERROR:
> ------------------------------------------------------------------------
>>>> [0]PETSC ERROR: Petsc Release Version 2.3.2, Patch 10, Wed Mar 28
>>>> 19:13:22
>>>> CDT 2007 HG revision: d7298c71db7f5e767f359ae35d33cab3bed44428
>>>> Possibly relevant symptom: iterative solver with ILU(N) consumes
>>>> more memory
>>>> than direct solver as N gets larger (>5) although the matrix is not
>>>> big
>>>> enough to cause malloc crash like the above.
