[petsc-users] OOM error while using TSSUNDIALS in PETSc

Patrick Sanan patrick.sanan at gmail.com
Wed Mar 17 09:27:14 CDT 2021





> Am 17.03.2021 um 15:15 schrieb Sanjoy Kumar Mazumder <mazumder at purdue.edu>:
> 
> Hi all,
> 
> I am trying to solve a set of coupled stiff ODEs in parallel using TSSUNDIALS with SUNDIALS_BDF as 'TSSundialsSetType' in PETSc. I am using a sparse Jacobian matrix of type MATMPIAIJ with no preconditioner. It runs for a long time with a very small timestep (~10^-8 - 10^-10) and then terminates abruptly with the following error:
> 
> 'slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.'
> 
This general class of problem can arise if there is a (small) memory leak occuring at every time step, so that is the first thing to rule out. 

> After going through some of the common suggestions in the mailing list before, 
> 
> 1) I tried increasing the memory alloted per cpu (--mem-per-cpu) in my batch script but the problem still remains. 
When you tried increasing the memory allocated per CPU, did the solver take more timesteps before the OOM error?

> 2) I have also checked for proper deallocation of the arrays in my function and jacobian sub-routines before every TS iteration.
Did you confirm this with a tool like valgrind? If not, Is it possible for you to run a few time steps of your code on a local machine with valgrind?

> 3) The time allotted for my job in the assigned nodes (wall-time) far exceed the time for which the job is actually running.
> 
> Is there anything I am missing out or not doing properly? Given below is the complete error that is showing up after the termination.
> 
> Thanks
> 
> With regards,
> Sanjoy
> 
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
> with errorcode 50176059.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> [1]PETSC ERROR: ------------------------------------------------------------------------
> [1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [1]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> [1]PETSC ERROR: likely location of problem given in stack below
> [1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
> [1]PETSC ERROR:       is given.
> [1]PETSC ERROR: [1] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> [1]PETSC ERROR: [1] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [1]PETSC ERROR: [1] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [1]PETSC ERROR: Signal received
> [1]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
> [1]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> [1]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu <http://bell-a017.rcac.purdue.edu/> by mazumder Mon Mar 15 13:26:36 2021
> [1]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
> [1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> [2]PETSC ERROR: ------------------------------------------------------------------------
> [2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
> [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [2]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [2]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> [2]PETSC ERROR: likely location of problem given in stack below
> [2]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [2]PETSC ERROR:       INSTEAD the line number of the start of the function
> [2]PETSC ERROR:       is given.
> [2]PETSC ERROR: [2] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> [2]PETSC ERROR: [2] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [2]PETSC ERROR: [2] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [2]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [2]PETSC ERROR: Signal received
> [2]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
> [2]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> [2]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu <http://bell-a017.rcac.purdue.edu/> by mazumder Mon Mar 15 13:26:36 2021
> [2]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
> [2]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> [3]PETSC ERROR: ------------------------------------------------------------------------
> [3]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
> [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [3]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [3]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> [3]PETSC ERROR: likely location of problem given in stack below
> [3]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [3]PETSC ERROR:       INSTEAD the line number of the start of the function
> [3]PETSC ERROR:       is given.
> [3]PETSC ERROR: [3] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> [3]PETSC ERROR: [3] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [3]PETSC ERROR: [3] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [3]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [3]PETSC ERROR: Signal received
> [3]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
> [3]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> [3]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu <http://bell-a017.rcac.purdue.edu/> by mazumder Mon Mar 15 13:26:36 2021
> [3]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
> [3]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> [4]PETSC ERROR: ------------------------------------------------------------------------
> [4]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
> [4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [4]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [4]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> [4]PETSC ERROR: likely location of problem given in stack below
> [4]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> [4]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [4]PETSC ERROR:       INSTEAD the line number of the start of the function
> [4]PETSC ERROR:       is given.
> [4]PETSC ERROR: [4] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> [4]PETSC ERROR: [4] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [4]PETSC ERROR: [4] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [4]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [4]PETSC ERROR: Signal received
> [4]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
> [4]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> [4]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named bell-a017.rcac.purdue.edu <http://bell-a017.rcac.purdue.edu/> by mazumder Mon Mar 15 13:26:36 2021
> [4]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging
> [4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node bell-a017 exited on signal 9 (Killed).
> --------------------------------------------------------------------------
> [bell-a017.rcac.purdue.edu:62310 <http://bell-a017.rcac.purdue.edu:62310/>] 62 more processes have sent help message help-mpi-api.txt / mpi-abort
> [bell-a017.rcac.purdue.edu:62310 <http://bell-a017.rcac.purdue.edu:62310/>] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210317/32a5fdcb/attachment.html>


More information about the petsc-users mailing list