[petsc-users] MatMatMult causes crash

Cyrill Vonplanta cyrill.von.planta at usi.ch
Thu Jan 19 10:54:34 CST 2017


Thanks for the answer. I believe that bit shortage is not the problem as the problem size is still very small (I printed out the matrix sizes of the operands in MatMatMult on the cray and my machine below).

In addition by commenting in and out of code I found that the matrix _O (codes an orthogonal 3D transformation and contains only 3x3 blocks on the diagonal) causes this. This seems strange to me as the matrix is set up and well behaved. When I write it out to matlab, _O is of full rank and the eigenvalues are nice. Is there a way to further diagnose this matrix in PETSc or maybe do I have to allocate something else than “PETSC_DEFAULT” in MatMatMult(...)?

Cyrill
--
On the cray machine

_O:
(Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size: 1107, local row size: 666, local column size: 666, blocksize: 1
(Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size: 1107, local row size: 441, local column size: 441, blocksize: 1

_interpolations[0]:
(Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size: 195, local row size: 666, local column size: 132, blocksize: 1
(Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size: 195, local row size: 441, local column size: 63, blocksize: 1



On my Desktop:

_O:
(Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size: 1107, local row size: 645, local column size: 645, blocksize: 1
(Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size: 1107, local row size: 462, local column size: 462, blocksize: 1

_interpolations[0]:
(Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size: 195, local row size: 645, local column size: 126, blocksize: 1
(Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size: 195, local row size: 462, local column size: 69, blocksize: 1






 *******
Cyrill von Planta

Institute of Computational Science
University of Lugano          **   Switzerland
Via Giuseppe Buffi 13        **   6900 Lugano
Tel.: +41 (0)58 666 49 73   **   Fax.: +41 (0)58 666 45 36
http://ics.usi.ch/                  **   cyrill.von.planta at usi.ch<mailto:cyrill.von.planta at usi.ch>

On 19 Jan 2017, at 17:03, Barry Smith <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:


  Absurd memory requests "Memory requested 18446744068029169664" usually means that 32 bit integers are not large enough for the problem. Try configuring on the cray with --with-64-bit-indices

  Barry


On Jan 19, 2017, at 7:14 AM, Cyrill Vonplanta <cyrill.von.planta at usi.ch<mailto:cyrill.von.planta at usi.ch>> wrote:

Dear PETSc Users,


I have a problem with a solver running on a cray machine that crashes at the command “MatMatMult” (see error message below). When i run the same solver on my machine in serial or parallel it runs through, also when I look at it with -malloc_debug there doesn’t seem to be any issues.

Does someone have a clue what the cause of this failure could be?

Best Cyrill
--

The line that causes the crash is this:

ierr = MatMatMult(_O, _interpolations[0], MAT_INITIAL_MATRIX, PETSC_DEFAULT, &mmg->interpolations[mg_levels-2]); CHKERRQ(ierr);

The error message:


[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Out of memory. This could be due to allocating
[0]PETSC ERROR: too large an object or bleeding by not properly
[0]PETSC ERROR: destroying unneeded objects.
[0]PETSC ERROR: Memory allocated 0 Memory used by process 61852
[0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info.
[0]PETSC ERROR: Memory requested 18446744068029169664
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.7.2, Jun, 05, 2016
[0]PETSC ERROR: /scratch/snx3000/studi/./moose-passo-opt on a haswell named nid01137 by studi Thu Jan 19 14:03:27 2017
[0]PETSC ERROR: Configure options --known-has-attribute-aligned=1 --known-mpi-int64_t=0 --known-bits-per-byte=8 --known-sdot-returns-double=0 --known-snrm2-returns-double=0 --known-level1-dcache-assoc=0 --known-level1-dcache-linesize=32 --known-level1-dcache-size=32768 --known-memcmp-ok=1 --known-mpi-c-double-complex=1 --known-mpi-long-double=1 --known-mpi-shared-libraries=0 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1 --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4 --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2 --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-ar=ar --with-batch=1 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-dependencies=0 --with-fc=ftn --with-fortran-datatypes=0 --with-fortran-interfaces=0 --with-fortranlib-autodetect=0 --with-ranlib=ranlib --with-scalar-type=real --with-shared-ld=ar --with-etags=0 --with-dependencies=0 --with-x=0 --with-ssl=0 --with-shared-libraries=0 --with-dependencies=0 --with-mpi-lib="[]" --with-mpi-include="[]" --with-blas-lapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mp" --with-superlu=1 --with-superlu-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu" --with-superlu_dist=1 --with-superlu_dist-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu_dist-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu_dist" --with-parmetis=1 --with-parmetis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-parmetis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lparmetis" --with-metis=1 --with-metis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-metis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lmetis" --with-ptscotch=1 --with-ptscotch-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-ptscotch-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lptscotch -lscotch -lptscotcherr -lscotcherr" --with-scalapack=1 --with-scalapack-include=/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/include --with-scalapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mpi_mp -lsci_gnu_mp" --with-mumps=1 --with-mumps-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-mumps-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lcmumps -ldmumps -lesmumps -lsmumps -lzmumps -lmumps_common -lptesmumps -lpord" --with-hdf5=1 --with-hdf5-include=/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/include --with-hdf5-lib="-L/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/lib -lhdf5_parallel -lz -ldl" --CFLAGS="-march=haswell -fopenmp -O3 -ffast-math  -fPIC" --CPPFLAGS= --CXXFLAGS="-march=haswell -fopenmp -O3 -ffast-math   -fPIC" --FFLAGS="-march=haswell -fopenmp -O3 -ffast-math   -fPIC" --LIBS= --CXX_LINKER_FLAGS= --PETSC_ARCH=haswell --prefix=/opt/cray/pe/petsc/3.7.2.1/real/GNU/5.1/haswell --with-hypre=1 --with-hypre-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-hypre-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lHYPRE" --with-sundials=1 --with-sundials-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-sundials-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsundials_cvode -lsundials_cvodes -lsundials_ida -lsundials_idas -lsundials_kinsol -lsundials_nvecparallel -lsundials_nvecserial"
[0]PETSC ERROR: #1 MatGetBrowsOfAoCols_MPIAIJ() line 4815 in src/mat/impls/aij/mpi/mpiaij.c
[0]PETSC ERROR: #2 MatGetBrowsOfAoCols_MPIAIJ() line 4815 in src/mat/impls/aij/mpi/mpiaij.c
[0]PETSC ERROR: #3 MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable() line 198 in src/mat/impls/aij/mpi/mpimatmatmult.c
[0]PETSC ERROR: #4 MatMatMult_MPIAIJ_MPIAIJ() line 34 in src/mat/impls/aij/mpi/mpimatmatmult.c
[0]PETSC ERROR:   MMG Setup 30.868420 ms.
#5 MatMatMult() line 9517 in src/mat/interface/matrix.c
[0]PETSC ERROR: #6 MMGSetup() line 85 in /users/studi/src/moose-passo/src/passo/monotone_mg.C
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Arguments are incompatible
[0]PETSC ERROR: Incompatible vector local lengths 666 != 10922
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.7.2, Jun, 05, 2016
[0]PETSC ERROR: /scratch/snx3000/studi/./moose-passo-opt on a haswell named nid01137 by studi Thu Jan 19 14:03:27 2017
[0]PETSC ERROR: Configure options --known-has-attribute-aligned=1 --known-mpi-int64_t=0 --known-bits-per-byte=8 --known-sdot-returns-double=0 --known-snrm2-returns-double=0 --known-level1-dcache-assoc=0 --known-level1-dcache-linesize=32 --known-level1-dcache-size=32768 --known-memcmp-ok=1 --known-mpi-c-double-complex=1 --known-mpi-long-double=1 --known-mpi-shared-libraries=0 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1 --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4 --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2 --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-ar=ar --with-batch=1 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-dependencies=0 --with-fc=ftn --with-fortran-datatypes=0 --with-fortran-interfaces=0 --with-fortranlib-autodetect=0 --with-ranlib=ranlib --with-scalar-type=real --with-shared-ld=ar --with-etags=0 --with-dependencies=0 --with-x=0 --with-ssl=0 --with-shared-libraries=0 --with-dependencies=0 --with-mpi-lib="[]" --with-mpi-include="[]" --with-blas-lapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mp" --with-superlu=1 --with-superlu-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu" --with-superlu_dist=1 --with-superlu_dist-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu_dist-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu_dist" --with-parmetis=1 --with-parmetis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-parmetis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lparmetis" --with-metis=1 --with-metis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-metis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lmetis" --with-ptscotch=1 --with-ptscotch-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-ptscotch-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lptscotch -lscotch -lptscotcherr -lscotcherr" --with-scalapack=1 --with-scalapack-include=/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/include --with-scalapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mpi_mp -lsci_gnu_mp" --with-mumps=1 --with-mumps-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-mumps-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lcmumps -ldmumps -lesmumps -lsmumps -lzmumps -lmumps_common -lptesmumps -lpord" --with-hdf5=1 --with-hdf5-include=/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/include --with-hdf5-lib="-L/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/lib -lhdf5_parallel -lz -ldl" --CFLAGS="-march=haswell -fopenmp -O3 -ffast-math  -fPIC" --CPPFLAGS= --CXXFLAGS="-march=haswell -fopenmp -O3 -ffast-math   -fPIC" --FFLAGS="-march=haswell -fopenmp -O3 -ffast-math   -fPIC" --LIBS= --CXX_LINKER_FLAGS= --PETSC_ARCH=haswell --prefix=/opt/cray/pe/petsc/3.7.2.1/real/GNU/5.1/haswell --with-hypre=1 --with-hypre-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-hypre-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lHYPRE" --with-sundials=1 --with-sundials-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-sundials-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsundials_cvode -lsundials_cvodes -lsundials_ida -lsundials_idas -lsundials_kinsol -lsundials_nvecparallel -lsundials_nvecserial"
[0]PETSC ERROR: #7 VecCopy() line 1639 in src/vec/vec/interface/vector.c
Level 1, Presmoothing step 0 ... srun: error: nid01137: task 0: Trace/breakpoint trap
srun: Terminating job step 349949.1
slurmstepd: error: *** STEP 349949.1 ON nid01137 CANCELLED AT 2017-01-19T14:03:32 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: nid01137: task 1: Killed






More information about the petsc-users mailing list