<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Jan 19, 2017 at 10:54 AM, Cyrill Vonplanta <span dir="ltr"><<a href="mailto:cyrill.von.planta@usi.ch" target="_blank">cyrill.von.planta@usi.ch</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks for the answer. I believe that bit shortage is not the problem as the problem size is still very small (I printed out the matrix sizes of the operands in MatMatMult on the cray and my machine below).<br></blockquote><div><br></div><div>Then the next step is to run under valgrind, or give us something that reproduces this error. It does not occur in our tests yet.</div><div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
In addition by commenting in and out of code I found that the matrix _O (codes an orthogonal 3D transformation and contains only 3x3 blocks on the diagonal) causes this. This seems strange to me as the matrix is set up and well behaved. When I write it out to matlab, _O is of full rank and the eigenvalues are nice. Is there a way to further diagnose this matrix in PETSc or maybe do I have to allocate something else than “PETSC_DEFAULT” in MatMatMult(...)?<br>
<br>
Cyrill<br>
--<br>
On the cray machine<br>
<br>
_O:<br>
(Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size: 1107, local row size: 666, local column size: 666, blocksize: 1<br>
(Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size: 1107, local row size: 441, local column size: 441, blocksize: 1<br>
<br>
_interpolations[0]:<br>
(Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size: 195, local row size: 666, local column size: 132, blocksize: 1<br>
(Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size: 195, local row size: 441, local column size: 63, blocksize: 1<br>
<br>
<br>
<br>
On my Desktop:<br>
<br>
_O:<br>
(Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size: 1107, local row size: 645, local column size: 645, blocksize: 1<br>
(Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size: 1107, local row size: 462, local column size: 462, blocksize: 1<br>
<br>
_interpolations[0]:<br>
(Matrix) Type: mpiaij, rank 0| Global row size: 1107, global column size: 195, local row size: 645, local column size: 126, blocksize: 1<br>
(Matrix) Type: mpiaij, rank 1| Global row size: 1107, global column size: 195, local row size: 462, local column size: 69, blocksize: 1<br>
<br>
<br>
<br>
<br>
<br>
<br>
 *******<br>
Cyrill von Planta<br>
<br>
Institute of Computational Science<br>
University of Lugano          **   Switzerland<br>
Via Giuseppe Buffi 13        **   6900 Lugano<br>
Tel.: +41 (0)58 666 49 73   **   Fax.: +41 (0)58 666 45 36<br>
<a href="http://ics.usi.ch/" rel="noreferrer" target="_blank">http://ics.usi.ch/</a>                  **   <a href="mailto:cyrill.von.planta@usi.ch">cyrill.von.planta@usi.ch</a><<wbr>mailto:<a href="mailto:cyrill.von.planta@usi.ch">cyrill.von.planta@usi.<wbr>ch</a>><br>
<br>
On 19 Jan 2017, at 17:03, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a><mailto:<a href="mailto:bsmith@mcs.anl.gov">bsm<wbr>ith@mcs.anl.gov</a>>> wrote:<br>
<br>
<br>
  Absurd memory requests "Memory requested 18446744068029169664" usually means that 32 bit integers are not large enough for the problem. Try configuring on the cray with --with-64-bit-indices<br>
<br>
  Barry<br>
<br>
<br>
On Jan 19, 2017, at 7:14 AM, Cyrill Vonplanta <<a href="mailto:cyrill.von.planta@usi.ch">cyrill.von.planta@usi.ch</a><<wbr>mailto:<a href="mailto:cyrill.von.planta@usi.ch">cyrill.von.planta@usi.<wbr>ch</a>>> wrote:<br>
<br>
Dear PETSc Users,<br>
<br>
<br>
I have a problem with a solver running on a cray machine that crashes at the command “MatMatMult” (see error message below). When i run the same solver on my machine in serial or parallel it runs through, also when I look at it with -malloc_debug there doesn’t seem to be any issues.<br>
<br>
Does someone have a clue what the cause of this failure could be?<br>
<br>
Best Cyrill<br>
--<br>
<br>
The line that causes the crash is this:<br>
<br>
ierr = MatMatMult(_O, _interpolations[0], MAT_INITIAL_MATRIX, PETSC_DEFAULT, &mmg->interpolations[mg_<wbr>levels-2]); CHKERRQ(ierr);<br>
<br>
The error message:<br>
<br>
<br>
[0]PETSC ERROR: --------------------- Error Message ------------------------------<wbr>------------------------------<wbr>--<br>
[0]PETSC ERROR: Out of memory. This could be due to allocating<br>
[0]PETSC ERROR: too large an object or bleeding by not properly<br>
[0]PETSC ERROR: destroying unneeded objects.<br>
[0]PETSC ERROR: Memory allocated 0 Memory used by process 61852<br>
[0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info.<br>
[0]PETSC ERROR: Memory requested 18446744068029169664<br>
[0]PETSC ERROR: See <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/<wbr>documentation/faq.html</a> for trouble shooting.<br>
[0]PETSC ERROR: Petsc Release Version 3.7.2, Jun, 05, 2016<br>
[0]PETSC ERROR: /scratch/snx3000/studi/./<wbr>moose-passo-opt on a haswell named nid01137 by studi Thu Jan 19 14:03:27 2017<br>
[0]PETSC ERROR: Configure options --known-has-attribute-aligned=<wbr>1 --known-mpi-int64_t=0 --known-bits-per-byte=8 --known-sdot-returns-double=0 --known-snrm2-returns-double=0 --known-level1-dcache-assoc=0 --known-level1-dcache-<wbr>linesize=32 --known-level1-dcache-size=<wbr>32768 --known-memcmp-ok=1 --known-mpi-c-double-complex=1 --known-mpi-long-double=1 --known-mpi-shared-libraries=0 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1 --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4 --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2 --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-ar=ar --with-batch=1 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-dependencies=0 --with-fc=ftn --with-fortran-datatypes=0 --with-fortran-interfaces=0 --with-fortranlib-autodetect=0 --with-ranlib=ranlib --with-scalar-type=real --with-shared-ld=ar --with-etags=0 --with-dependencies=0 --with-x=0 --with-ssl=0 --with-shared-libraries=0 --with-dependencies=0 --with-mpi-lib="[]" --with-mpi-include="[]" --with-blas-lapack-lib="-L/<wbr>opt/cray/libsci/13.2.0/GNU/5.<wbr>1/x86_64/lib -lsci_gnu_mp" --with-superlu=1 --with-superlu-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-superlu-lib="-L/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lsuperlu" --with-superlu_dist=1 --with-superlu_dist-include=/<wbr>opt/cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-superlu_dist-lib="-L/<wbr>opt/cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lsuperlu_dist" --with-parmetis=1 --with-parmetis-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-parmetis-lib="-L/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lparmetis" --with-metis=1 --with-metis-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-metis-lib="-L/opt/cray/<wbr>tpsl/16.07.1/GNU/5.1/haswell/<wbr>lib -lmetis" --with-ptscotch=1 --with-ptscotch-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-ptscotch-lib="-L/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lptscotch -lscotch -lptscotcherr -lscotcherr" --with-scalapack=1 --with-scalapack-include=/opt/<wbr>cray/libsci/13.2.0/GNU/5.1/<wbr>x86_64/include --with-scalapack-lib="-L/opt/<wbr>cray/libsci/13.2.0/GNU/5.1/<wbr>x86_64/lib -lsci_gnu_mpi_mp -lsci_gnu_mp" --with-mumps=1 --with-mumps-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-mumps-lib="-L/opt/cray/<wbr>tpsl/16.07.1/GNU/5.1/haswell/<wbr>lib -lcmumps -ldmumps -lesmumps -lsmumps -lzmumps -lmumps_common -lptesmumps -lpord" --with-hdf5=1 --with-hdf5-include=/opt/cray/<wbr>hdf5-parallel/1.8.16/GNU/5.1/<wbr>include --with-hdf5-lib="-L/opt/cray/<wbr>hdf5-parallel/1.8.16/GNU/5.1/<wbr>lib -lhdf5_parallel -lz -ldl" --CFLAGS="-march=haswell -fopenmp -O3 -ffast-math  -fPIC" --CPPFLAGS= --CXXFLAGS="-march=haswell -fopenmp -O3 -ffast-math   -fPIC" --FFLAGS="-march=haswell -fopenmp -O3 -ffast-math   -fPIC" --LIBS= --CXX_LINKER_FLAGS= --PETSC_ARCH=haswell --prefix=/opt/cray/pe/petsc/<a href="http://3.7.2.1/real/GNU/5.1/haswell" rel="noreferrer" target="_blank">3.<wbr>7.2.1/real/GNU/5.1/haswell</a> --with-hypre=1 --with-hypre-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-hypre-lib="-L/opt/cray/<wbr>tpsl/16.07.1/GNU/5.1/haswell/<wbr>lib -lHYPRE" --with-sundials=1 --with-sundials-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-sundials-lib="-L/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lsundials_cvode -lsundials_cvodes -lsundials_ida -lsundials_idas -lsundials_kinsol -lsundials_nvecparallel -lsundials_nvecserial"<br>
[0]PETSC ERROR: #1 MatGetBrowsOfAoCols_MPIAIJ() line 4815 in src/mat/impls/aij/mpi/mpiaij.c<br>
[0]PETSC ERROR: #2 MatGetBrowsOfAoCols_MPIAIJ() line 4815 in src/mat/impls/aij/mpi/mpiaij.c<br>
[0]PETSC ERROR: #3 MatMatMultSymbolic_MPIAIJ_<wbr>MPIAIJ_nonscalable() line 198 in src/mat/impls/aij/mpi/<wbr>mpimatmatmult.c<br>
[0]PETSC ERROR: #4 MatMatMult_MPIAIJ_MPIAIJ() line 34 in src/mat/impls/aij/mpi/<wbr>mpimatmatmult.c<br>
[0]PETSC ERROR:   MMG Setup 30.868420 ms.<br>
#5 MatMatMult() line 9517 in src/mat/interface/matrix.c<br>
[0]PETSC ERROR: #6 MMGSetup() line 85 in /users/studi/src/moose-passo/<wbr>src/passo/monotone_mg.C<br>
[0]PETSC ERROR: --------------------- Error Message ------------------------------<wbr>------------------------------<wbr>--<br>
[0]PETSC ERROR: Arguments are incompatible<br>
[0]PETSC ERROR: Incompatible vector local lengths 666 != 10922<br>
[0]PETSC ERROR: See <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/<wbr>documentation/faq.html</a> for trouble shooting.<br>
[0]PETSC ERROR: Petsc Release Version 3.7.2, Jun, 05, 2016<br>
[0]PETSC ERROR: /scratch/snx3000/studi/./<wbr>moose-passo-opt on a haswell named nid01137 by studi Thu Jan 19 14:03:27 2017<br>
[0]PETSC ERROR: Configure options --known-has-attribute-aligned=<wbr>1 --known-mpi-int64_t=0 --known-bits-per-byte=8 --known-sdot-returns-double=0 --known-snrm2-returns-double=0 --known-level1-dcache-assoc=0 --known-level1-dcache-<wbr>linesize=32 --known-level1-dcache-size=<wbr>32768 --known-memcmp-ok=1 --known-mpi-c-double-complex=1 --known-mpi-long-double=1 --known-mpi-shared-libraries=0 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1 --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4 --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2 --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-ar=ar --with-batch=1 --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-dependencies=0 --with-fc=ftn --with-fortran-datatypes=0 --with-fortran-interfaces=0 --with-fortranlib-autodetect=0 --with-ranlib=ranlib --with-scalar-type=real --with-shared-ld=ar --with-etags=0 --with-dependencies=0 --with-x=0 --with-ssl=0 --with-shared-libraries=0 --with-dependencies=0 --with-mpi-lib="[]" --with-mpi-include="[]" --with-blas-lapack-lib="-L/<wbr>opt/cray/libsci/13.2.0/GNU/5.<wbr>1/x86_64/lib -lsci_gnu_mp" --with-superlu=1 --with-superlu-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-superlu-lib="-L/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lsuperlu" --with-superlu_dist=1 --with-superlu_dist-include=/<wbr>opt/cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-superlu_dist-lib="-L/<wbr>opt/cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lsuperlu_dist" --with-parmetis=1 --with-parmetis-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-parmetis-lib="-L/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lparmetis" --with-metis=1 --with-metis-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-metis-lib="-L/opt/cray/<wbr>tpsl/16.07.1/GNU/5.1/haswell/<wbr>lib -lmetis" --with-ptscotch=1 --with-ptscotch-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-ptscotch-lib="-L/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lptscotch -lscotch -lptscotcherr -lscotcherr" --with-scalapack=1 --with-scalapack-include=/opt/<wbr>cray/libsci/13.2.0/GNU/5.1/<wbr>x86_64/include --with-scalapack-lib="-L/opt/<wbr>cray/libsci/13.2.0/GNU/5.1/<wbr>x86_64/lib -lsci_gnu_mpi_mp -lsci_gnu_mp" --with-mumps=1 --with-mumps-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-mumps-lib="-L/opt/cray/<wbr>tpsl/16.07.1/GNU/5.1/haswell/<wbr>lib -lcmumps -ldmumps -lesmumps -lsmumps -lzmumps -lmumps_common -lptesmumps -lpord" --with-hdf5=1 --with-hdf5-include=/opt/cray/<wbr>hdf5-parallel/1.8.16/GNU/5.1/<wbr>include --with-hdf5-lib="-L/opt/cray/<wbr>hdf5-parallel/1.8.16/GNU/5.1/<wbr>lib -lhdf5_parallel -lz -ldl" --CFLAGS="-march=haswell -fopenmp -O3 -ffast-math  -fPIC" --CPPFLAGS= --CXXFLAGS="-march=haswell -fopenmp -O3 -ffast-math   -fPIC" --FFLAGS="-march=haswell -fopenmp -O3 -ffast-math   -fPIC" --LIBS= --CXX_LINKER_FLAGS= --PETSC_ARCH=haswell --prefix=/opt/cray/pe/petsc/<a href="http://3.7.2.1/real/GNU/5.1/haswell" rel="noreferrer" target="_blank">3.<wbr>7.2.1/real/GNU/5.1/haswell</a> --with-hypre=1 --with-hypre-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-hypre-lib="-L/opt/cray/<wbr>tpsl/16.07.1/GNU/5.1/haswell/<wbr>lib -lHYPRE" --with-sundials=1 --with-sundials-include=/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/include --with-sundials-lib="-L/opt/<wbr>cray/tpsl/16.07.1/GNU/5.1/<wbr>haswell/lib -lsundials_cvode -lsundials_cvodes -lsundials_ida -lsundials_idas -lsundials_kinsol -lsundials_nvecparallel -lsundials_nvecserial"<br>
[0]PETSC ERROR: #7 VecCopy() line 1639 in src/vec/vec/interface/vector.c<br>
Level 1, Presmoothing step 0 ... srun: error: nid01137: task 0: Trace/breakpoint trap<br>
srun: Terminating job step 349949.1<br>
slurmstepd: error: *** STEP 349949.1 ON nid01137 CANCELLED AT 2017-01-19T14:03:32 ***<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.<br>
srun: error: nid01137: task 1: Killed<br>
<br>
<br>
<br>
<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>
</div></div>