[petsc-users] OOM error while using TSSUNDIALS in PETSc

Sanjoy Kumar Mazumder mazumder at purdue.edu
Wed Mar 17 09:15:27 CDT 2021


Hi all,

I am trying to solve a set of coupled stiff ODEs in parallel using TSSUNDIALS with SUNDIALS_BDF as 'TSSundialsSetType' in PETSc. I am using a sparse Jacobian matrix of type MATMPIAIJ with no preconditioner. It runs for a long time with a very small timestep (~10^-8 - 10^-10) and then terminates abruptly with the following error:

'slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.'

After going through some of the common suggestions in the mailing list before,

1) I tried increasing the memory alloted per cpu (--mem-per-cpu) in my batch script but the problem still remains.
2) I have also checked for proper deallocation of the arrays in my function and jacobian sub-routines before every TS iteration.
3) The time allotted for my job in the assigned nodes (wall-time) far exceed the time for which the job is actually running.

Is there anything I am missing out or not doing properly? Given below is the complete error that is showing up after the termination.

Thanks

With regards,
Sanjoy

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
with errorcode 50176059.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: likely location of problem given in stack below
[1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[1]PETSC ERROR:       INSTEAD the line number of the start of the function
[1]PETSC ERROR:       is given.
[1]PETSC ERROR: [1] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
[1]PETSC ERROR: [1] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
[1]PETSC ERROR: [1] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: Signal received
[1]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[1]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
[1]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu by mazumder Mon Mar 15 13:26:36 2021
[1]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
[1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
[2]PETSC ERROR: ------------------------------------------------------------------------
[2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[2]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[2]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[2]PETSC ERROR: likely location of problem given in stack below
[2]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[2]PETSC ERROR:       INSTEAD the line number of the start of the function
[2]PETSC ERROR:       is given.
[2]PETSC ERROR: [2] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
[2]PETSC ERROR: [2] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
[2]PETSC ERROR: [2] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
[2]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[2]PETSC ERROR: Signal received
[2]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[2]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
[2]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu by mazumder Mon Mar 15 13:26:36 2021
[2]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
[2]PETSC ERROR: #1 User provided function() line 0 in  unknown file
[3]PETSC ERROR: ------------------------------------------------------------------------
[3]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[3]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[3]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[3]PETSC ERROR: likely location of problem given in stack below
[3]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[3]PETSC ERROR:       INSTEAD the line number of the start of the function
[3]PETSC ERROR:       is given.
[3]PETSC ERROR: [3] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
[3]PETSC ERROR: [3] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
[3]PETSC ERROR: [3] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
[3]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[3]PETSC ERROR: Signal received
[3]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[3]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
[3]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu by mazumder Mon Mar 15 13:26:36 2021
[3]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
[3]PETSC ERROR: #1 User provided function() line 0 in  unknown file
[4]PETSC ERROR: ------------------------------------------------------------------------
[4]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[4]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[4]PETSC ERROR: likely location of problem given in stack below
[4]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[4]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[4]PETSC ERROR:       INSTEAD the line number of the start of the function
[4]PETSC ERROR:       is given.
[4]PETSC ERROR: [4] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
[4]PETSC ERROR: [4] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
[4]PETSC ERROR: [4] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
[4]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[4]PETSC ERROR: Signal received
[4]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[4]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
[4]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu by mazumder Mon Mar 15 13:26:36 2021
[4]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
[4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node bell-a017 exited on signal 9 (Killed).
--------------------------------------------------------------------------
[bell-a017.rcac.purdue.edu:62310] 62 more processes have sent help message help-mpi-api.txt / mpi-abort
[bell-a017.rcac.purdue.edu:62310] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210317/ad3d2bc1/attachment-0001.html>


More information about the petsc-users mailing list