[petsc-users] Questions on 1 billion unknowns and 64-bit-indices

Matthew Knepley knepley at gmail.com
Thu Apr 30 13:29:13 CDT 2015


On Fri, May 1, 2015 at 4:21 AM, Danyang Su <danyang.su at gmail.com> wrote:

>  Dear All,
>
> I have run my codes successfully with up to 100 million total unknowns
> using 1000 processors on WestGrid Jasper Cluster, Canada. But when I scale
> the unknows up to 1 billion, the codes crashes with the following error.
> It's out of memory.
>

If you are running out of memory, you need to use more processors.

  Thanks,

     Matt


> Error message from valgrind output
>
> ==9344== Invalid read of size 16
> ==9344==    at 0xADB2906: __intel_sse2_strdup (in
> /lustre/jasper/software/intel/l_ics_2013.0.028/composer\
> _xe_2013.1.117/compiler/lib/intel64/libintlc.so.5)
> ==9344==    by 0xE6: ???
> ==9344==    by 0xE7: ???
> ==9344==    by 0x5: ???
> ==9344==  Address 0xb364410 is 16 bytes inside a block of size 17 alloc'd
> ==9344==    at 0x4A0638D: malloc (vg_replace_malloc.c:291)
> ==9344==    by 0x3DE7C6807C: vasprintf (in /lib64/libc-2.5.so)
> ==9344==    by 0x3DE7C4CBE7: asprintf (in /lib64/libc-2.5.so)
> ==9344==    by 0x9DC511E: opal_output_init (output.c:144)
> ==9344==    by 0x9DC042D: opal_init_util (opal_init.c:207)
> ==9344==    by 0x9CF4EBB: ompi_mpi_init (ompi_mpi_init.c:309)
> ==9344==    by 0x9D0D802: PMPI_Init (pinit.c:84)
> ==9344==    by 0x905E976: PMPI_INIT (pinit_f.c:75)
> ==9344==    by 0x4D5280F: petscinitialize_ (in
> /lustre/jasper/software/petsc/petsc-3.5.1/lib/libpetsc.so.\
> 3.5.1)
> ==9344==    by 0x439D05: petsc_mpi_common_mp_petsc_mpi_initialize_ (in
> /lustre/home/danyangs/benchmark/ba\
> sin/min3p_thcm)
> ==9344==    by 0x5FDBB9: MAIN__ (in
> /lustre/home/danyangs/benchmark/basin/min3p_thcm)
> ==9344==    by 0x4321FB: main (in
> /lustre/home/danyangs/benchmark/basin/min3p_thcm)
> ==9344==
>
> Error message from Jasper Cluster output
> --32725:0:aspacem  <<< SHOW_SEGMENTS: out_of_memory (407 segments, 96
> segnames)
> --32725:0:aspacem  ( 0)
> /lustre/jasper/software/valgrind/valgrind-3.9.0/lib/valgrind/memcheck-amd64-linux
> --32725:0:aspacem  ( 1) /lustre/home/danyangs/benchmark/basin/min3p_thcm
> --32725:0:aspacem  ( 2) /lib64/ld-2.5.so
> --32725:0:aspacem  ( 3) /data2/PBStmp/
> 6456165.jasper-usradm.westgrid.ca/vgdb-pipe-shared-mem-vgdb-32725-b\
> <http://6456165.jasper-usradm.westgrid.ca/vgdb-pipe-shared-mem-vgdb-32725-b%5C>
> y-danyangs-on-cl2n050
> --32725:0:aspacem  ( 4)
> /lustre/jasper/software/valgrind/valgrind-3.9.0/lib/valgrind/vgpreload_core-amd64\
> -linux.so
> --32725:0:aspacem  ( 5)
> /lustre/jasper/software/valgrind/valgrind-3.9.0/lib/valgrind/vgpreload_memcheck-a\
> md64-linux.so
> --32725:0:aspacem  ( 6)
> /lustre/jasper/software/petsc/petsc-3.5.1/lib/libpetsc.so.3.5.1
> --32725:0:aspacem  ( 7)
> /lustre/jasper/software/openmpi/openmpi-1.6.5-intel/lib/libmpi_cxx.so.1.0.2
> --32725:0:aspacem  ( 8)
> /lustre/jasper/software/intel/l_ics_2013.0.028/composer_xe_2013.1.117/mkl/lib/int\
> el64/libmkl_scalapack_lp64.so
> --32725:0:aspacem  ( 9)
> /lustre/jasper/software/intel/l_ics_2013.0.028/composer_xe_2013.1.117/mkl/lib/int\
> el64/libmkl_intel_lp64.so
> --32725:0:aspacem  (10)
> /lustre/jasper/software/intel/l_ics_2013.0.028/composer_xe_2013.1.117/mkl/lib/int\
> el64/libmkl_sequential.so
> --32725:0:aspacem  (11)
> /lustre/jasper/software/intel/l_ics_2013.0.028/composer_xe_2013.1.117/mkl/lib/int\
> el64/libmkl_core.so
> --32725:0:aspacem  (12)
> /lustre/jasper/software/petsc/petsc-3.5.1/lib/libparmetis.so
> --32725:0:aspacem  (13)
> /lustre/jasper/software/petsc/petsc-3.5.1/lib/libmetis.so
> --32725:0:aspacem  (14)
> /lustre/jasper/software/openmpi/openmpi-1.6.5-intel/lib/openmpi/mca_paffinity_hwl\
> oc.so
> --32725:0:aspacem  (15) /usr/lib64/libX11.so.6.2.0
> --32725:0:aspacem  (16) /lib64/libpthread-2.5.so
> --32725:0:aspacem  (17) /lib64/libssl.so.0.9.8e
> --32725:0:aspacem  (18) /lib64/libcrypto.so.0.9.8e
> --32725:0:aspacem  (19)
> /lustre/jasper/software/openmpi/openmpi-1.6.5-intel/lib/libmpi_f90.so.1.3.0
> --32725:0:aspacem  (20)
> /lustre/jasper/software/openmpi/openmpi-1.6.5-intel/lib/libmpi_f77.so.1.0.7
> --32725:0:aspacem  (21)
> /lustre/jasper/software/intel/l_ics_2013.0.028/composer_xe_2013.1.117/compiler/li\
> b/intel64/libimf.so
>
> The PETSc configuration is as below
>
>
> ================================================================================
> Starting Configure Run at Tue Dec 16 10:42:20 2014
> Configure Options: --configModules=PETSc.Configure
> --optionsModule=PETSc.compilerOptions --prefix=/global\
> /software/petsc/petsc-3.5.1 --with-shared-libraries --with-mpirun=mpiexec
> --with-vendor-compiler=intel --\
> with-blas-lapack-lib=-mkl=sequential --with-cc=mpicc --with-cxx=mpiCC
> --with-fc=mpif90 --COPTFLAGS=-O2 --\
> CXXOPTFLAGS=-O2 --FOPTFLAGS=-O2 --with-debugging=no --with-blacs=yes
> --with-blacs-include=/lustre/jasper/\
> software/intel/l_ics_2013.0.028/composer_xe_2013.1.117/mkl/include
> --with-blacs-lib=/lustre/jasper/softwa\
> re/intel/l_ics_2013.0.028/composer_xe_2013.1.117/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.a
> --with-scala\
> pack=yes
> --with-scalapack-include=/lustre/jasper/software/intel/l_ics_2013.0.028/composer_xe_2013.1.117/m\
> kl/include
> --with-scalapack-lib="-L/lustre/jasper/software/intel/l_ics_2013.0.028/composer_xe_2013.1.117/\
> mkl/lib/intel64/ -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64"
> --download-metis=metis-5.0.2-p3.tar.gz --\
> download-parmetis=yes --download-superlu_dist=yes --download-hypre=yes
> Working directory: /lustre/jasper/software-build/petsc/petsc-3.5.1
> Machine platform:
> ('Linux', 'jasper.westgrid.ca', '2.6.18-274.el5', '#1 SMP Fri Jul 22
> 04:43:29 EDT 2011', 'x86_64', 'x86_6\
> 4')
> Python version:
> 2.4.3 (#1, Sep 21 2011, 19:55:41)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)]
>
> ================================================================================
>
> My case does not hold the conditions of 64-bit-indices as below.
> By default the type that PETSc uses to index into arrays and keep sizes of
> arrays is a PetscInt defined to be a 32 bit int. If your problem
>
>    - involves more than 2^31 - 1 unknowns (around 2 billion) OR
>    - your matrix might contain more than 2^31 - 1 nonzeros on a single
>    process
>
> then you need to use this option. Otherwise you will get strange crashes.
>
> Do you guys have suggestions on this?
>
> Thanks and regards,
>
> Danyang
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150501/0be3f70e/attachment.html>


More information about the petsc-users mailing list