[petsc-users] MatMatMult causes crash

Matthew Knepley knepley at gmail.com
Thu Jan 19 10:59:34 CST 2017


On Thu, Jan 19, 2017 at 10:54 AM, Cyrill Vonplanta <cyrill.von.planta at usi.ch
> wrote:

> Thanks for the answer. I believe that bit shortage is not the problem as
> the problem size is still very small (I printed out the matrix sizes of the
> operands in MatMatMult on the cray and my machine below).
>

Then the next step is to run under valgrind, or give us something that
reproduces this error. It does not occur in our tests yet.

  Thanks,

     Matt


> In addition by commenting in and out of code I found that the matrix _O
> (codes an orthogonal 3D transformation and contains only 3x3 blocks on the
> diagonal) causes this. This seems strange to me as the matrix is set up and
> well behaved. When I write it out to matlab, _O is of full rank and the
> eigenvalues are nice. Is there a way to further diagnose this matrix in
> PETSc or maybe do I have to allocate something else than “PETSC_DEFAULT” in
> MatMatMult(...)?
>
> Cyrill
> --
> On the cray machine
>
> _O:
> (Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size:
> 1107, local row size: 666, local column size: 666, blocksize: 1
> (Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size:
> 1107, local row size: 441, local column size: 441, blocksize: 1
>
> _interpolations[0]:
> (Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size:
> 195, local row size: 666, local column size: 132, blocksize: 1
> (Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size:
> 195, local row size: 441, local column size: 63, blocksize: 1
>
>
>
> On my Desktop:
>
> _O:
> (Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size:
> 1107, local row size: 645, local column size: 645, blocksize: 1
> (Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size:
> 1107, local row size: 462, local column size: 462, blocksize: 1
>
> _interpolations[0]:
> (Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size:
> 195, local row size: 645, local column size: 126, blocksize: 1
> (Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size:
> 195, local row size: 462, local column size: 69, blocksize: 1
>
>
>
>
>
>
>  *******
> Cyrill von Planta
>
> Institute of Computational Science
> University of Lugano          **   Switzerland
> Via Giuseppe Buffi 13        **   6900 Lugano
> Tel.: +41 (0)58 666 49 73   **   Fax.: +41 (0)58 666 45 36
> http://ics.usi.ch/                  **   cyrill.von.planta at usi.ch<mailto:
> cyrill.von.planta at usi.ch>
>
> On 19 Jan 2017, at 17:03, Barry Smith <bsmith at mcs.anl.gov<mailto:bsm
> ith at mcs.anl.gov>> wrote:
>
>
>   Absurd memory requests "Memory requested 18446744068029169664" usually
> means that 32 bit integers are not large enough for the problem. Try
> configuring on the cray with --with-64-bit-indices
>
>   Barry
>
>
> On Jan 19, 2017, at 7:14 AM, Cyrill Vonplanta <cyrill.von.planta at usi.ch<
> mailto:cyrill.von.planta at usi.ch>> wrote:
>
> Dear PETSc Users,
>
>
> I have a problem with a solver running on a cray machine that crashes at
> the command “MatMatMult” (see error message below). When i run the same
> solver on my machine in serial or parallel it runs through, also when I
> look at it with -malloc_debug there doesn’t seem to be any issues.
>
> Does someone have a clue what the cause of this failure could be?
>
> Best Cyrill
> --
>
> The line that causes the crash is this:
>
> ierr = MatMatMult(_O, _interpolations[0], MAT_INITIAL_MATRIX,
> PETSC_DEFAULT, &mmg->interpolations[mg_levels-2]); CHKERRQ(ierr);
>
> The error message:
>
>
> [0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [0]PETSC ERROR: Out of memory. This could be due to allocating
> [0]PETSC ERROR: too large an object or bleeding by not properly
> [0]PETSC ERROR: destroying unneeded objects.
> [0]PETSC ERROR: Memory allocated 0 Memory used by process 61852
> [0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info.
> [0]PETSC ERROR: Memory requested 18446744068029169664
> [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.7.2, Jun, 05, 2016
> [0]PETSC ERROR: /scratch/snx3000/studi/./moose-passo-opt on a haswell
> named nid01137 by studi Thu Jan 19 14:03:27 2017
> [0]PETSC ERROR: Configure options --known-has-attribute-aligned=1
> --known-mpi-int64_t=0 --known-bits-per-byte=8 --known-sdot-returns-double=0
> --known-snrm2-returns-double=0 --known-level1-dcache-assoc=0
> --known-level1-dcache-linesize=32 --known-level1-dcache-size=32768
> --known-memcmp-ok=1 --known-mpi-c-double-complex=1
> --known-mpi-long-double=1 --known-mpi-shared-libraries=0
> --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1
> --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4
> --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2
> --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-ar=ar --with-batch=1
> --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC
> --with-cxxlib-autodetect=0 --with-debugging=0 --with-dependencies=0
> --with-fc=ftn --with-fortran-datatypes=0 --with-fortran-interfaces=0
> --with-fortranlib-autodetect=0 --with-ranlib=ranlib --with-scalar-type=real
> --with-shared-ld=ar --with-etags=0 --with-dependencies=0 --with-x=0
> --with-ssl=0 --with-shared-libraries=0 --with-dependencies=0
> --with-mpi-lib="[]" --with-mpi-include="[]" --with-blas-lapack-lib="-L/
> opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mp" --with-superlu=1
> --with-superlu-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-superlu-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib
> -lsuperlu" --with-superlu_dist=1 --with-superlu_dist-include=/
> opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu_dist-lib="-L/
> opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu_dist"
> --with-parmetis=1 --with-parmetis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-parmetis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib
> -lparmetis" --with-metis=1 --with-metis-include=/opt/
> cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-metis-lib="-L/opt/cray/
> tpsl/16.07.1/GNU/5.1/haswell/lib -lmetis" --with-ptscotch=1
> --with-ptscotch-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-ptscotch-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib
> -lptscotch -lscotch -lptscotcherr -lscotcherr" --with-scalapack=1
> --with-scalapack-include=/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/include
> --with-scalapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib
> -lsci_gnu_mpi_mp -lsci_gnu_mp" --with-mumps=1 --with-mumps-include=/opt/
> cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-mumps-lib="-L/opt/cray/
> tpsl/16.07.1/GNU/5.1/haswell/lib -lcmumps -ldmumps -lesmumps -lsmumps
> -lzmumps -lmumps_common -lptesmumps -lpord" --with-hdf5=1
> --with-hdf5-include=/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/include
> --with-hdf5-lib="-L/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/lib
> -lhdf5_parallel -lz -ldl" --CFLAGS="-march=haswell -fopenmp -O3
> -ffast-math  -fPIC" --CPPFLAGS= --CXXFLAGS="-march=haswell -fopenmp -O3
> -ffast-math   -fPIC" --FFLAGS="-march=haswell -fopenmp -O3 -ffast-math
>  -fPIC" --LIBS= --CXX_LINKER_FLAGS= --PETSC_ARCH=haswell
> --prefix=/opt/cray/pe/petsc/3.7.2.1/real/GNU/5.1/haswell --with-hypre=1
> --with-hypre-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-hypre-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lHYPRE"
> --with-sundials=1 --with-sundials-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-sundials-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib
> -lsundials_cvode -lsundials_cvodes -lsundials_ida -lsundials_idas
> -lsundials_kinsol -lsundials_nvecparallel -lsundials_nvecserial"
> [0]PETSC ERROR: #1 MatGetBrowsOfAoCols_MPIAIJ() line 4815 in
> src/mat/impls/aij/mpi/mpiaij.c
> [0]PETSC ERROR: #2 MatGetBrowsOfAoCols_MPIAIJ() line 4815 in
> src/mat/impls/aij/mpi/mpiaij.c
> [0]PETSC ERROR: #3 MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable() line
> 198 in src/mat/impls/aij/mpi/mpimatmatmult.c
> [0]PETSC ERROR: #4 MatMatMult_MPIAIJ_MPIAIJ() line 34 in
> src/mat/impls/aij/mpi/mpimatmatmult.c
> [0]PETSC ERROR:   MMG Setup 30.868420 ms.
> #5 MatMatMult() line 9517 in src/mat/interface/matrix.c
> [0]PETSC ERROR: #6 MMGSetup() line 85 in /users/studi/src/moose-passo/
> src/passo/monotone_mg.C
> [0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [0]PETSC ERROR: Arguments are incompatible
> [0]PETSC ERROR: Incompatible vector local lengths 666 != 10922
> [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.7.2, Jun, 05, 2016
> [0]PETSC ERROR: /scratch/snx3000/studi/./moose-passo-opt on a haswell
> named nid01137 by studi Thu Jan 19 14:03:27 2017
> [0]PETSC ERROR: Configure options --known-has-attribute-aligned=1
> --known-mpi-int64_t=0 --known-bits-per-byte=8 --known-sdot-returns-double=0
> --known-snrm2-returns-double=0 --known-level1-dcache-assoc=0
> --known-level1-dcache-linesize=32 --known-level1-dcache-size=32768
> --known-memcmp-ok=1 --known-mpi-c-double-complex=1
> --known-mpi-long-double=1 --known-mpi-shared-libraries=0
> --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1
> --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4
> --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2
> --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-ar=ar --with-batch=1
> --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC
> --with-cxxlib-autodetect=0 --with-debugging=0 --with-dependencies=0
> --with-fc=ftn --with-fortran-datatypes=0 --with-fortran-interfaces=0
> --with-fortranlib-autodetect=0 --with-ranlib=ranlib --with-scalar-type=real
> --with-shared-ld=ar --with-etags=0 --with-dependencies=0 --with-x=0
> --with-ssl=0 --with-shared-libraries=0 --with-dependencies=0
> --with-mpi-lib="[]" --with-mpi-include="[]" --with-blas-lapack-lib="-L/
> opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mp" --with-superlu=1
> --with-superlu-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-superlu-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib
> -lsuperlu" --with-superlu_dist=1 --with-superlu_dist-include=/
> opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu_dist-lib="-L/
> opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu_dist"
> --with-parmetis=1 --with-parmetis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-parmetis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib
> -lparmetis" --with-metis=1 --with-metis-include=/opt/
> cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-metis-lib="-L/opt/cray/
> tpsl/16.07.1/GNU/5.1/haswell/lib -lmetis" --with-ptscotch=1
> --with-ptscotch-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-ptscotch-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib
> -lptscotch -lscotch -lptscotcherr -lscotcherr" --with-scalapack=1
> --with-scalapack-include=/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/include
> --with-scalapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib
> -lsci_gnu_mpi_mp -lsci_gnu_mp" --with-mumps=1 --with-mumps-include=/opt/
> cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-mumps-lib="-L/opt/cray/
> tpsl/16.07.1/GNU/5.1/haswell/lib -lcmumps -ldmumps -lesmumps -lsmumps
> -lzmumps -lmumps_common -lptesmumps -lpord" --with-hdf5=1
> --with-hdf5-include=/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/include
> --with-hdf5-lib="-L/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/lib
> -lhdf5_parallel -lz -ldl" --CFLAGS="-march=haswell -fopenmp -O3
> -ffast-math  -fPIC" --CPPFLAGS= --CXXFLAGS="-march=haswell -fopenmp -O3
> -ffast-math   -fPIC" --FFLAGS="-march=haswell -fopenmp -O3 -ffast-math
>  -fPIC" --LIBS= --CXX_LINKER_FLAGS= --PETSC_ARCH=haswell
> --prefix=/opt/cray/pe/petsc/3.7.2.1/real/GNU/5.1/haswell --with-hypre=1
> --with-hypre-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-hypre-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lHYPRE"
> --with-sundials=1 --with-sundials-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include
> --with-sundials-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib
> -lsundials_cvode -lsundials_cvodes -lsundials_ida -lsundials_idas
> -lsundials_kinsol -lsundials_nvecparallel -lsundials_nvecserial"
> [0]PETSC ERROR: #7 VecCopy() line 1639 in src/vec/vec/interface/vector.c
> Level 1, Presmoothing step 0 ... srun: error: nid01137: task 0:
> Trace/breakpoint trap
> srun: Terminating job step 349949.1
> slurmstepd: error: *** STEP 349949.1 ON nid01137 CANCELLED AT
> 2017-01-19T14:03:32 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: nid01137: task 1: Killed
>
>
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170119/589f59bc/attachment.html>


More information about the petsc-users mailing list