[petsc-users] OOM error while using TSSUNDIALS in PETSc

Satish Balay balay at mcs.anl.gov
Wed Mar 17 09:42:01 CDT 2021


On Wed, 17 Mar 2021, Patrick Sanan wrote:

> 
> 
> 
> 
> > Am 17.03.2021 um 15:15 schrieb Sanjoy Kumar Mazumder <mazumder at purdue.edu>:
> > 
> > Hi all,
> > 
> > I am trying to solve a set of coupled stiff ODEs in parallel using TSSUNDIALS with SUNDIALS_BDF as 'TSSundialsSetType' in PETSc. I am using a sparse Jacobian matrix of type MATMPIAIJ with no preconditioner. It runs for a long time with a very small timestep (~10^-8 - 10^-10) and then terminates abruptly with the following error:
> > 
> > 'slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.'
> > 
> This general class of problem can arise if there is a (small) memory leak occuring at every time step, so that is the first thing to rule out. 
> 
> > After going through some of the common suggestions in the mailing list before, 
> > 
> > 1) I tried increasing the memory alloted per cpu (--mem-per-cpu) in my batch script but the problem still remains. 
> When you tried increasing the memory allocated per CPU, did the solver take more timesteps before the OOM error?
> 
> > 2) I have also checked for proper deallocation of the arrays in my function and jacobian sub-routines before every TS iteration.
> Did you confirm this with a tool like valgrind? If not, Is it possible for you to run a few time steps of your code on a local machine with valgrind?

If PetscMalloc is used [or petsc objects not destroyed] - you can check with -malloc_dump

Satish

> 
> > 3) The time allotted for my job in the assigned nodes (wall-time) far exceed the time for which the job is actually running.
> > 
> > Is there anything I am missing out or not doing properly? Given below is the complete error that is showing up after the termination.
> > 
> > Thanks
> > 
> > With regards,
> > Sanjoy
> > 
> > --------------------------------------------------------------------------
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code. Per user-direction, the job has been aborted.
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
> > with errorcode 50176059.
> > 
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > You may or may not see output from other processes, depending on
> > exactly when Open MPI kills them.
> > --------------------------------------------------------------------------
> > [1]PETSC ERROR: ------------------------------------------------------------------------
> > [1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
> > [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> > [1]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> > [1]PETSC ERROR: likely location of problem given in stack below
> > [1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> > [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [1]PETSC ERROR:       INSTEAD the line number of the start of the function
> > [1]PETSC ERROR:       is given.
> > [1]PETSC ERROR: [1] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> > [1]PETSC ERROR: [1] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> > [1]PETSC ERROR: [1] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> > [1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> > [1]PETSC ERROR: Signal received
> > [1]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
> > [1]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> > [1]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu <http://bell-a017.rcac.purdue.edu/> by mazumder Mon Mar 15 13:26:36 2021
> > [1]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
> > [1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> > [2]PETSC ERROR: ------------------------------------------------------------------------
> > [2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
> > [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [2]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> > [2]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> > [2]PETSC ERROR: likely location of problem given in stack below
> > [2]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> > [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [2]PETSC ERROR:       INSTEAD the line number of the start of the function
> > [2]PETSC ERROR:       is given.
> > [2]PETSC ERROR: [2] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> > [2]PETSC ERROR: [2] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> > [2]PETSC ERROR: [2] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> > [2]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> > [2]PETSC ERROR: Signal received
> > [2]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
> > [2]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> > [2]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu <http://bell-a017.rcac.purdue.edu/> by mazumder Mon Mar 15 13:26:36 2021
> > [2]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
> > [2]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> > [3]PETSC ERROR: ------------------------------------------------------------------------
> > [3]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
> > [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [3]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> > [3]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> > [3]PETSC ERROR: likely location of problem given in stack below
> > [3]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> > [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [3]PETSC ERROR:       INSTEAD the line number of the start of the function
> > [3]PETSC ERROR:       is given.
> > [3]PETSC ERROR: [3] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> > [3]PETSC ERROR: [3] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> > [3]PETSC ERROR: [3] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> > [3]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> > [3]PETSC ERROR: Signal received
> > [3]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
> > [3]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> > [3]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu <http://bell-a017.rcac.purdue.edu/> by mazumder Mon Mar 15 13:26:36 2021
> > [3]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
> > [3]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> > [4]PETSC ERROR: ------------------------------------------------------------------------
> > [4]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
> > [4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [4]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> > [4]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> > [4]PETSC ERROR: likely location of problem given in stack below
> > [4]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> > [4]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [4]PETSC ERROR:       INSTEAD the line number of the start of the function
> > [4]PETSC ERROR:       is given.
> > [4]PETSC ERROR: [4] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> > [4]PETSC ERROR: [4] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> > [4]PETSC ERROR: [4] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> > [4]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> > [4]PETSC ERROR: Signal received
> > [4]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
> > [4]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> > [4]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu <http://bell-a017.rcac.purdue.edu/> by mazumder Mon Mar 15 13:26:36 2021
> > [4]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
> > [4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> > --------------------------------------------------------------------------
> > mpirun noticed that process rank 0 with PID 0 on node bell-a017 exited on signal 9 (Killed).
> > --------------------------------------------------------------------------
> > [bell-a017.rcac.purdue.edu:62310 <http://bell-a017.rcac.purdue.edu:62310/>] 62 more processes have sent help message help-mpi-api.txt / mpi-abort
> > [bell-a017.rcac.purdue.edu:62310 <http://bell-a017.rcac.purdue.edu:62310/>] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> > slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
> 
> 



More information about the petsc-users mailing list