[petsc-users] OOM error while using TSSUNDIALS in PETSc

Matthew Knepley knepley at gmail.com
Wed Mar 17 09:51:30 CDT 2021


On Wed, Mar 17, 2021 at 10:15 AM Sanjoy Kumar Mazumder <mazumder at purdue.edu>
wrote:

> Hi all,
>
> I am trying to solve a set of coupled stiff ODEs in parallel using
> TSSUNDIALS with SUNDIALS_BDF as 'TSSundialsSetType' in PETSc. I am using a
> sparse Jacobian matrix of type MATMPIAIJ with no preconditioner.
>

What is the convergence of your linear solver like? You can see this using

  -ksp_converged_reason

  Thanks,

     Matt


> It runs for a long time with a very small timestep (~10^-8 - 10^-10) and
> then terminates abruptly with the following error:
>
> 'slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch
> cgroup. Some of your processes may have been killed by the cgroup
> out-of-memory handler.'
>
> After going through some of the common suggestions in the mailing list
> before,
>
> 1) I tried increasing the memory alloted per cpu (--mem-per-cpu) in my
> batch script but the problem still remains.
> 2) I have also checked for proper deallocation of the arrays in my
> function and jacobian sub-routines before every TS iteration.
> 3) The time allotted for my job in the assigned nodes (wall-time) far
> exceed the time for which the job is actually running.
>
> Is there anything I am missing out or not doing properly? Given below is
> the complete error that is showing up after the termination.
>
> Thanks
>
> With regards,
> Sanjoy
>
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
> with errorcode 50176059.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> [1]PETSC ERROR:
> ------------------------------------------------------------------------
> [1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [1]PETSC ERROR: or see
> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
> X to find memory corruption errors
> [1]PETSC ERROR: likely location of problem given in stack below
> [1]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
> [1]PETSC ERROR:       is given.
> [1]PETSC ERROR: [1] TSStep_Sundials line 121
> /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> [1]PETSC ERROR: [1] TSStep line 3736
> /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [1]PETSC ERROR: [1] TSSolve line 4046
> /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [1]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [1]PETSC ERROR: Signal received
> [1]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [1]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> [1]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named
> bell-a017.rcac.purdue.edu by mazumder Mon Mar 15 13:26:36 2021
> [1]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx
> --with-fc=mpif90 --download-fblaslapack --download-sundials=yes
> --with-debugging
> [1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> [2]PETSC ERROR:
> ------------------------------------------------------------------------
> [2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [2]PETSC ERROR: or see
> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [2]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
> X to find memory corruption errors
> [2]PETSC ERROR: likely location of problem given in stack below
> [2]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [2]PETSC ERROR:       INSTEAD the line number of the start of the function
> [2]PETSC ERROR:       is given.
> [2]PETSC ERROR: [2] TSStep_Sundials line 121
> /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> [2]PETSC ERROR: [2] TSStep line 3736
> /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [2]PETSC ERROR: [2] TSSolve line 4046
> /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [2]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [2]PETSC ERROR: Signal received
> [2]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [2]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> [2]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named
> bell-a017.rcac.purdue.edu by mazumder Mon Mar 15 13:26:36 2021
> [2]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx
> --with-fc=mpif90 --download-fblaslapack --download-sundials=yes
> --with-debugging
> [2]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [3]PETSC ERROR: or see
> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [3]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
> X to find memory corruption errors
> [3]PETSC ERROR: likely location of problem given in stack below
> [3]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [3]PETSC ERROR:       INSTEAD the line number of the start of the function
> [3]PETSC ERROR:       is given.
> [3]PETSC ERROR: [3] TSStep_Sundials line 121
> /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> [3]PETSC ERROR: [3] TSStep line 3736
> /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [3]PETSC ERROR: [3] TSSolve line 4046
> /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [3]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [3]PETSC ERROR: Signal received
> [3]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [3]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> [3]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named
> bell-a017.rcac.purdue.edu by mazumder Mon Mar 15 13:26:36 2021
> [3]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx
> --with-fc=mpif90 --download-fblaslapack --download-sundials=yes
> --with-debugging
> [3]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> [4]PETSC ERROR:
> ------------------------------------------------------------------------
> [4]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [4]PETSC ERROR: or see
> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
> X to find memory corruption errors
> [4]PETSC ERROR: likely location of problem given in stack below
> [4]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [4]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [4]PETSC ERROR:       INSTEAD the line number of the start of the function
> [4]PETSC ERROR:       is given.
> [4]PETSC ERROR: [4] TSStep_Sundials line 121
> /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c
> [4]PETSC ERROR: [4] TSStep line 3736
> /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [4]PETSC ERROR: [4] TSSolve line 4046
> /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c
> [4]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [4]PETSC ERROR: Signal received
> [4]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [4]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021
> [4]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named
> bell-a017.rcac.purdue.edu by mazumder Mon Mar 15 13:26:36 2021
> [4]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx
> --with-fc=mpif90 --download-fblaslapack --download-sundials=yes
> --with-debugging
> [4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node bell-a017 exited on
> signal 9 (Killed).
> --------------------------------------------------------------------------
> [bell-a017.rcac.purdue.edu:62310] 62 more processes have sent help
> message help-mpi-api.txt / mpi-abort
> [bell-a017.rcac.purdue.edu:62310] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
> slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch
> cgroup. Some of your processes may have been killed by the cgroup
> out-of-memory handler.
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210317/f850ae85/attachment.html>


More information about the petsc-users mailing list