[petsc-users] MatMatMult causes crash
Cyrill Vonplanta
cyrill.von.planta at usi.ch
Thu Jan 19 07:14:55 CST 2017
Dear PETSc Users,
I have a problem with a solver running on a cray machine that crashes at the command “MatMatMult” (see error message below). When i run the same solver on my machine in serial or parallel it runs through, also when I look at it with -malloc_debug there doesn’t seem to be any issues.
Does someone have a clue what the cause of this failure could be?
Best Cyrill
--
The line that causes the crash is this:
ierr = MatMatMult(_O, _interpolations[0], MAT_INITIAL_MATRIX, PETSC_DEFAULT, &mmg->interpolations[mg_levels-2]); CHKERRQ(ierr);
The error message:
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Out of memory. This could be due to allocating
[0]PETSC ERROR: too large an object or bleeding by not properly
[0]PETSC ERROR: destroying unneeded objects.
[0]PETSC ERROR: Memory allocated 0 Memory used by process 61852
[0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info.
[0]PETSC ERROR: Memory requested 18446744068029169664
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.7.2, Jun, 05, 2016
[0]PETSC ERROR: /scratch/snx3000/studi/./moose-passo-opt on a haswell named nid01137 by studi Thu Jan 19 14:03:27 2017
[0]PETSC ERROR: Configure options --known-has-attribute-aligned=1 --known-mpi-int64_t=0 --known-bits-per-byte=8 --known-sdot-returns-double=0 --known-snrm2-returns-double=0 --known-level1-dcache-assoc=0 --known-level1-dcache-linesize=32 --known-level1-dcache-size=32768 --known-memcmp-ok=1 --known-mpi-c-double-complex=1 --known-mpi-long-double=1 --known-mpi-shared-libraries=0 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1 --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4 --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2 --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-ar=ar --with-batch=1 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-dependencies=0 --with-fc=ftn --with-fortran-datatypes=0 --with-fortran-interfaces=0 --with-fortranlib-autodetect=0 --with-ranlib=ranlib --with-scalar-type=real --with-shared-ld=ar --with-etags=0 --with-dependencies=0 --with-x=0 --with-ssl=0 --with-shared-libraries=0 --with-dependencies=0 --with-mpi-lib="[]" --with-mpi-include="[]" --with-blas-lapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mp" --with-superlu=1 --with-superlu-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu" --with-superlu_dist=1 --with-superlu_dist-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu_dist-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu_dist" --with-parmetis=1 --with-parmetis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-parmetis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lparmetis" --with-metis=1 --with-metis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-metis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lmetis" --with-ptscotch=1 --with-ptscotch-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-ptscotch-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lptscotch -lscotch -lptscotcherr -lscotcherr" --with-scalapack=1 --with-scalapack-include=/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/include --with-scalapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mpi_mp -lsci_gnu_mp" --with-mumps=1 --with-mumps-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-mumps-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lcmumps -ldmumps -lesmumps -lsmumps -lzmumps -lmumps_common -lptesmumps -lpord" --with-hdf5=1 --with-hdf5-include=/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/include --with-hdf5-lib="-L/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/lib -lhdf5_parallel -lz -ldl" --CFLAGS="-march=haswell -fopenmp -O3 -ffast-math -fPIC" --CPPFLAGS= --CXXFLAGS="-march=haswell -fopenmp -O3 -ffast-math -fPIC" --FFLAGS="-march=haswell -fopenmp -O3 -ffast-math -fPIC" --LIBS= --CXX_LINKER_FLAGS= --PETSC_ARCH=haswell --prefix=/opt/cray/pe/petsc/3.7.2.1/real/GNU/5.1/haswell --with-hypre=1 --with-hypre-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-hypre-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lHYPRE" --with-sundials=1 --with-sundials-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-sundials-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsundials_cvode -lsundials_cvodes -lsundials_ida -lsundials_idas -lsundials_kinsol -lsundials_nvecparallel -lsundials_nvecserial"
[0]PETSC ERROR: #1 MatGetBrowsOfAoCols_MPIAIJ() line 4815 in src/mat/impls/aij/mpi/mpiaij.c
[0]PETSC ERROR: #2 MatGetBrowsOfAoCols_MPIAIJ() line 4815 in src/mat/impls/aij/mpi/mpiaij.c
[0]PETSC ERROR: #3 MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable() line 198 in src/mat/impls/aij/mpi/mpimatmatmult.c
[0]PETSC ERROR: #4 MatMatMult_MPIAIJ_MPIAIJ() line 34 in src/mat/impls/aij/mpi/mpimatmatmult.c
[0]PETSC ERROR: MMG Setup 30.868420 ms.
#5 MatMatMult() line 9517 in src/mat/interface/matrix.c
[0]PETSC ERROR: #6 MMGSetup() line 85 in /users/studi/src/moose-passo/src/passo/monotone_mg.C
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Arguments are incompatible
[0]PETSC ERROR: Incompatible vector local lengths 666 != 10922
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.7.2, Jun, 05, 2016
[0]PETSC ERROR: /scratch/snx3000/studi/./moose-passo-opt on a haswell named nid01137 by studi Thu Jan 19 14:03:27 2017
[0]PETSC ERROR: Configure options --known-has-attribute-aligned=1 --known-mpi-int64_t=0 --known-bits-per-byte=8 --known-sdot-returns-double=0 --known-snrm2-returns-double=0 --known-level1-dcache-assoc=0 --known-level1-dcache-linesize=32 --known-level1-dcache-size=32768 --known-memcmp-ok=1 --known-mpi-c-double-complex=1 --known-mpi-long-double=1 --known-mpi-shared-libraries=0 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1 --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4 --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2 --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-ar=ar --with-batch=1 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-dependencies=0 --with-fc=ftn --with-fortran-datatypes=0 --with-fortran-interfaces=0 --with-fortranlib-autodetect=0 --with-ranlib=ranlib --with-scalar-type=real --with-shared-ld=ar --with-etags=0 --with-dependencies=0 --with-x=0 --with-ssl=0 --with-shared-libraries=0 --with-dependencies=0 --with-mpi-lib="[]" --with-mpi-include="[]" --with-blas-lapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mp" --with-superlu=1 --with-superlu-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu" --with-superlu_dist=1 --with-superlu_dist-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-superlu_dist-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsuperlu_dist" --with-parmetis=1 --with-parmetis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-parmetis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lparmetis" --with-metis=1 --with-metis-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-metis-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lmetis" --with-ptscotch=1 --with-ptscotch-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-ptscotch-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lptscotch -lscotch -lptscotcherr -lscotcherr" --with-scalapack=1 --with-scalapack-include=/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/include --with-scalapack-lib="-L/opt/cray/libsci/13.2.0/GNU/5.1/x86_64/lib -lsci_gnu_mpi_mp -lsci_gnu_mp" --with-mumps=1 --with-mumps-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-mumps-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lcmumps -ldmumps -lesmumps -lsmumps -lzmumps -lmumps_common -lptesmumps -lpord" --with-hdf5=1 --with-hdf5-include=/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/include --with-hdf5-lib="-L/opt/cray/hdf5-parallel/1.8.16/GNU/5.1/lib -lhdf5_parallel -lz -ldl" --CFLAGS="-march=haswell -fopenmp -O3 -ffast-math -fPIC" --CPPFLAGS= --CXXFLAGS="-march=haswell -fopenmp -O3 -ffast-math -fPIC" --FFLAGS="-march=haswell -fopenmp -O3 -ffast-math -fPIC" --LIBS= --CXX_LINKER_FLAGS= --PETSC_ARCH=haswell --prefix=/opt/cray/pe/petsc/3.7.2.1/real/GNU/5.1/haswell --with-hypre=1 --with-hypre-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-hypre-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lHYPRE" --with-sundials=1 --with-sundials-include=/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/include --with-sundials-lib="-L/opt/cray/tpsl/16.07.1/GNU/5.1/haswell/lib -lsundials_cvode -lsundials_cvodes -lsundials_ida -lsundials_idas -lsundials_kinsol -lsundials_nvecparallel -lsundials_nvecserial"
[0]PETSC ERROR: #7 VecCopy() line 1639 in src/vec/vec/interface/vector.c
Level 1, Presmoothing step 0 ... srun: error: nid01137: task 0: Trace/breakpoint trap
srun: Terminating job step 349949.1
slurmstepd: error: *** STEP 349949.1 ON nid01137 CANCELLED AT 2017-01-19T14:03:32 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: nid01137: task 1: Killed
More information about the petsc-users
mailing list