[petsc-dev] Error on large problems.

Mark Adams mfadams at lbl.gov
Sat Mar 7 13:21:29 CST 2015


I seem to be getting this error on Edison with 128K cores and ~4 Billion
equations.  I've seen this error several time.  I've attached a recent
output from this.  I wonder if it is an integer overflow.  This built with
64 bit integers, but I notice that GAMG prints out N and I see N=0 for the
finest level.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150307/48cf85b2/attachment.html>
-------------- next part --------------
	[0]PCSetFromOptions_GAMG threshold set -5.000000e-03
[0] FAS solver |r|=2.36953, |r|/|b|=0.0148153
	[0]PCSetUp_GAMG level 0 N=0, n data rows=1, n data cols=1, nnz/row (ave)=26, np=131072
	[0]PCGAMGFilterGraph 100% nnz after filtering, with threshold -0.005, 26.9649 nnz ave. (N=0)
[0]PCGAMGCoarsen_AGG square graph
[0]PCGAMGCoarsen_AGG coarsen graph
	[0]maxIndSetAgg removed 0 of 0 vertices. (0 local)  88465566 selected.
		[0]PCGAMGProlongator_AGG New grid 88465566 nodes
			PCGAMGOptprol_AGG smooth P0: max eigen=1.408912e+00 min=1.563660e-03 PC=jacobi
		[0]PCSetUp_GAMG 1) N=88465566, n data cols=1, nnz/row (ave)=46, 131072 active pes
	[0]PCGAMGFilterGraph 100% nnz after filtering, with threshold -0.005, 46.2266 nnz ave. (N=88465566)
[0]PCGAMGCoarsen_AGG square graph
MPICH2 ERROR [Rank 66213] [job id 10600361] [Fri Mar  6 04:01:56 2015] [c6-1c2s12n2] [nid02866] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_POST had error (SOURCE_SSID:AT_MDD_INV:CPLTN_SRSP)
[66213]PETSC ERROR: #1 MatTransposeMatMultNumeric_MPIAIJ_MPIAIJ() line 1535 in /global/u2/m/madams/petsc_maint/src/mat/impls/aij/mpi/mpimatmatmult.c
[66213]PETSC ERROR: #2 MatTransposeMatMult_MPIAIJ_MPIAIJ() line 887 in /global/u2/m/madams/petsc_maint/src/mat/impls/aij/mpi/mpimatmatmult.c
[66213]PETSC ERROR: #3 MatTransposeMatMult() line 8977 in /global/u2/m/madams/petsc_maint/src/mat/interface/matrix.c
[66213]PETSC ERROR: #4 PCGAMGCoarsen_AGG() line 991 in /global/u2/m/madams/petsc_maint/src/ksp/pc/impls/gamg/agg.c
[66213]PETSC ERROR: #5 PCSetUp_GAMG() line 596 in /global/u2/m/madams/petsc_maint/src/ksp/pc/impls/gamg/gamg.c
[66213]PETSC ERROR: #6 PCSetUp() line 902 in /global/u2/m/madams/petsc_maint/src/ksp/pc/interface/precon.c
[66213]PETSC ERROR: #7 KSPSetUp() line 306 in /global/u2/m/madams/petsc_maint/src/ksp/ksp/interface/itfunc.c
[66213]PETSC ERROR: #8 KSPSolve() line 418 in /global/u2/m/madams/petsc_maint/src/ksp/ksp/interface/itfunc.c
MPICH2 ERROR [Rank 65378] [job id 10600361] [Fri Mar  6 04:01:56 2015] [c6-1c2s3n2] [nid02830] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_POST had error (SOURCE_SSID:AT_MDD_INV:CPLTN_SRSP)
[65378]PETSC ERROR: #1 MatTransposeMatMultNumeric_MPIAIJ_MPIAIJ() line 1535 in /global/u2/m/madams/petsc_maint/src/mat/impls/aij/mpi/mpimatmatmult.c
[65378]PETSC ERROR: #2 MatTransposeMatMult_MPIAIJ_MPIAIJ() line 887 in /global/u2/m/madams/petsc_maint/src/mat/impls/aij/mpi/mpimatmatmult.c
[65378]PETSC ERROR: #3 MatTransposeMatMult() line 8977 in /global/u2/m/madams/petsc_maint/src/mat/interface/matrix.c
[65378]PETSC ERROR: #4 PCGAMGCoarsen_AGG() line 991 in /global/u2/m/madams/petsc_maint/src/ksp/pc/impls/gamg/agg.c
[65378]PETSC ERROR: #5 PCSetUp_GAMG() line 596 in /global/u2/m/madams/petsc_maint/src/ksp/pc/impls/gamg/gamg.c
[65378]PETSC ERROR: #6 PCSetUp() line 902 in /global/u2/m/madams/petsc_maint/src/ksp/pc/interface/precon.c
[65378]PETSC ERROR: #7 KSPSetUp() line 306 in /global/u2/m/madams/petsc_maint/src/ksp/ksp/interface/itfunc.c
[65378]PETSC ERROR: #8 KSPSolve() line 418 in /global/u2/m/madams/petsc_maint/src/ksp/ksp/interface/itfunc.c
aprun: Apid 10600361: Caught signal Terminated, sending to application
aprun: Apid 10600361: Caught signal Terminated, sending to application
[98520]PETSC ERROR: [98521]PETSC ERROR: ------------------------------------------------------------------------
------------------------------------------------------------------------
[98520]PETSC ERROR: [98521]PETSC ERROR: [98522]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[98523]PETSC ERROR: [98521]PETSC ERROR: ------------------------------------------------------------------------
[98520]PETSC ERROR: ------------------------------------------------------------------------
[98522]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
Try option -start_in_debugger or -on_error_attach_debugger
[98520]PETSC ERROR: [98523]PETSC ERROR: [98521]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[98524]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[98522]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
Try option -start_in_debugger or -on_error_attach_debugger
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[98521]PETSC ERROR: [98520]PETSC ERROR: [98522]PETSC ERROR: [98523]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
Try option -start_in_debugger or -on_error_attach_debugger
[98522]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[98523]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[98520]PETSC ERROR: ------------------------------------------------------------------------
configure using --with-debugging=yes, recompile, link, and run 
[98522]PETSC ERROR: [98520]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[98526]PETSC ERROR: [98522]PETSC ERROR: [98525]PETSC ERROR: to get more inf[78000]PETSC ERROR: ------------------------------------------------------------------------
[78001]PETSC ERROR: [78000]PETSC ERROR: ------------------------------------------------------------------------
[78002]PETSC ERROR: [78001]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
------------------------------------------------------------------------
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[78002]PETSC ERROR: [78001]PETSC ERROR: [78003]PETSC ERROR: [78000]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
Try option -start_in_debugger or -on_error_attach_debugger
[78002]PETSC ERROR: ------------------------------------------------------------------------
[78004]PETSC ERROR: [78003]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
------------------------------------------------------------------------
[78002]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[78001]PETSC ERROR: [78000]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[78002]PETSC ERROR: [78000]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[78001]PETSC ERROR: [78004]PETSC ERROR: [78005]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[78002]PETSC ERROR: [78000]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[78001]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[78002][65688]PETSC ERROR: [65689]PETSC ERROR: ------------------------------------------------------------------------
------------------------------------------------------------------------
[65690]PETSC ERROR: [65689]PETSC ERROR: [65688]PETSC ERROR: ------------------------------------------------------------------------
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[65690]PETSC ERROR: [65691]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[65690]PETSC ERROR: ------------------------------------------------------------------------
[65689]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
Try option -start_in_debugger or -on_error_attach_debugger
[65692]PETSC ERROR: [65691]PETSC ERROR: [65690]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
------------------------------------------------------------------------
[65691]PETSC ERROR: [65690]PETSC ERROR: [65692]PETSC ERROR: [65688]PETSC ERROR: [65693]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[65692]PETSC ERROR: [65690]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
configure using --with-debugging=yes, recompile, link, and run 
Try option -start_in_debugger or -on_error_attach_debugger
[65688]PETSC ERROR: [65690]PETSC ERROR: [65694]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[65691]PETSC ERROR: [65688]PETSC ERROR: ------------------------------------------------------------------------
or see http://www.mcs.anl.gov/petsc/documentatio[61584]PETSC ERROR: ------------------------------------------------------------------------
[61585]PETSC ERROR: [61584]PETSC ERROR: ------------------------------------------------------------------------
[61585]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[61586]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[61584]PETSC ERROR: ------------------------------------------------------------------------
Try option -start_in_debugger or -on_error_attach_debugger
[61585]PETSC ERROR: [61586]PETSC ERROR: [61587]PETSC ERROR: [61584]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[61586]PETSC ERROR: ------------------------------------------------------------------------
Try option -start_in_debugger or -on_error_attach_debugger
[61587]PETSC ERROR: [61584]PETSC ERROR: [61585]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
Try option -start_in_debugger or -on_error_attach_debugger
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[61586]PETSC ERROR: [61587]PETSC ERROR: [61584]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[61585]PETSC ERROR: [61586]PETSC ERROR: [61588]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
configure using --with-debugging=yes, recompile, link, and run 
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[61587]PETSC ERROR: [61584]PETSC ERROR: [61585]PETSC ERROR: [61586]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
configure using --with-debugging[53376]PETSC ERROR: ------------------------------------------------------------------------
[53377]PETSC ERROR: [53376]PETSC ERROR: ------------------------------------------------------------------------
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[53377]PETSC ERROR: [53378]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[53376]PETSC ERROR: [53377]PETSC ERROR: ------------------------------------------------------------------------
[53379]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[53378]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[53376]PETSC ERROR: ------------------------------------------------------------------------
[53377]PETSC ERROR: [53379]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[53380]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[53378]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[53376]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[53379]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[53377]PETSC ERROR: [53378]PETSC ERROR: [53381]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[53379]PETSC ERROR: [53378]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[53376]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
------------------------------------------------------------------------
configure using --with-debugging=yes, recompile, link, and run [45168]PETSC ERROR: ------------------------------------------------------------------------
[45169]PETSC ERROR: [45168]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
------------------------------------------------------------------------
[45168]PETSC ERROR: [45170]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[45169]PETSC ERROR: [45168]PETSC ERROR: ------------------------------------------------------------------------
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[45170]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[45168]PETSC ERROR: [45170]PETSC ERROR: [45171]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
Try option -start_in_debugger or -on_error_attach_debugger
[45168]PETSC ERROR: ------------------------------------------------------------------------
[45170]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[45171]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[45170]PETSC ERROR: [45172]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[45171]PETSC ERROR: [45168]PETSC ERROR: ------------------------------------------------------------------------
to get more information on the crash.
[45170]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[45172]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[45171]PETSC ERROR: [45170]PETSC ERROR: [45173]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
to get more information on the crash.
[45171]PETSC ERROR: --------------------------------------[41064]PETSC ERROR: ------------------------------------------------------------------------
[41065]PETSC ERROR: [41064]PETSC ERROR: ------------------------------------------------------------------------


More information about the petsc-dev mailing list