<div dir="ltr"><div dir="ltr">On Wed, Mar 17, 2021 at 10:15 AM Sanjoy Kumar Mazumder <<a href="mailto:mazumder@purdue.edu">mazumder@purdue.edu</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Hi all,</span>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I am trying to solve a set of coupled stiff ODEs in parallel using TSSUNDIALS with SUNDIALS_BDF as 'TSSundialsSetType' in PETSc. I am using a sparse Jacobian matrix of type MATMPIAIJ with no preconditioner.</div></div></div></blockquote><div><br></div><div>What is the convergence of your linear solver like? You can see this using</div><div><br></div><div> -ksp_converged_reason</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"> It runs for a long time with a very small timestep
(~10^-8 - 10^-10) and then terminates abruptly with the following error:</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
'slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.'</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
After going through some of the common suggestions in the mailing list before, <br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
1) I tried increasing the memory alloted per cpu (--mem-per-cpu) in my batch script but the problem still remains.
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
2) I have also checked for proper deallocation of the arrays in my function and jacobian sub-routines before every TS iteration.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
3) The time allotted for my job in the assigned nodes (wall-time) far exceed the time for which the job is actually running.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Is there anything I am missing out or not doing properly? Given below is the complete error that is showing up after the termination.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thanks</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
With regards,<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Sanjoy<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
--------------------------------------------------------------------------
<div>Primary job terminated normally, but 1 process returned</div>
<div>a non-zero exit code. Per user-direction, the job has been aborted.</div>
<div>--------------------------------------------------------------------------</div>
<div>--------------------------------------------------------------------------</div>
<div>MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD</div>
<div>with errorcode 50176059.</div>
<div><br>
</div>
<div>NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.</div>
<div>You may or may not see output from other processes, depending on</div>
<div>exactly when Open MPI kills them.</div>
<div>--------------------------------------------------------------------------</div>
<div>[1]PETSC ERROR: ------------------------------------------------------------------------</div>
<div>[1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end</div>
<div>[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger</div>
<div>[1]PETSC ERROR: or see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a></div>
<div>[1]PETSC ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors</div>
<div>[1]PETSC ERROR: likely location of problem given in stack below</div>
<div>[1]PETSC ERROR: --------------------- Stack Frames ------------------------------------</div>
<div>[1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,</div>
<div>[1]PETSC ERROR: INSTEAD the line number of the start of the function</div>
<div>[1]PETSC ERROR: is given.</div>
<div>[1]PETSC ERROR: [1] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c</div>
<div>[1]PETSC ERROR: [1] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c</div>
<div>[1]PETSC ERROR: [1] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c</div>
<div>[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------</div>
<div>[1]PETSC ERROR: Signal received</div>
<div>[1]PETSC ERROR: See <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html</a> for trouble shooting.</div>
<div>[1]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021 </div>
<div>[1]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named <a href="http://bell-a017.rcac.purdue.edu" target="_blank">bell-a017.rcac.purdue.edu</a> by mazumder Mon Mar 15 13:26:36 2021</div>
<div>[1]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging</div>
<div>[1]PETSC ERROR: #1 User provided function() line 0 in unknown file</div>
<div>[2]PETSC ERROR: ------------------------------------------------------------------------</div>
<div>[2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end</div>
<div>[2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger</div>
<div>[2]PETSC ERROR: or see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a></div>
<div>[2]PETSC ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors</div>
<div>[2]PETSC ERROR: likely location of problem given in stack below</div>
<div>[2]PETSC ERROR: --------------------- Stack Frames ------------------------------------</div>
<div>[2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,</div>
<div>[2]PETSC ERROR: INSTEAD the line number of the start of the function</div>
<div>[2]PETSC ERROR: is given.</div>
<div>[2]PETSC ERROR: [2] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c</div>
<div>[2]PETSC ERROR: [2] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c</div>
<div>[2]PETSC ERROR: [2] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c</div>
<div>[2]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------</div>
<div>[2]PETSC ERROR: Signal received</div>
<div>[2]PETSC ERROR: See <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html</a> for trouble shooting.</div>
<div>[2]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021 </div>
<div>[2]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named <a href="http://bell-a017.rcac.purdue.edu" target="_blank">bell-a017.rcac.purdue.edu</a> by mazumder Mon Mar 15 13:26:36 2021</div>
<div>[2]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging</div>
<div>[2]PETSC ERROR: #1 User provided function() line 0 in unknown file</div>
<div>[3]PETSC ERROR: ------------------------------------------------------------------------</div>
<div>[3]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end</div>
<div>[3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger</div>
<div>[3]PETSC ERROR: or see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a></div>
<div>[3]PETSC ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors</div>
<div>[3]PETSC ERROR: likely location of problem given in stack below</div>
<div>[3]PETSC ERROR: --------------------- Stack Frames ------------------------------------</div>
<div>[3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,</div>
<div>[3]PETSC ERROR: INSTEAD the line number of the start of the function</div>
<div>[3]PETSC ERROR: is given.</div>
<div>[3]PETSC ERROR: [3] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c</div>
<div>[3]PETSC ERROR: [3] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c</div>
<div>[3]PETSC ERROR: [3] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c</div>
<div>[3]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------</div>
<div>[3]PETSC ERROR: Signal received</div>
<div>[3]PETSC ERROR: See <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html</a> for trouble shooting.</div>
<div>[3]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021 </div>
<div>[3]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named <a href="http://bell-a017.rcac.purdue.edu" target="_blank">bell-a017.rcac.purdue.edu</a> by mazumder Mon Mar 15 13:26:36 2021</div>
<div>[3]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging</div>
<div>[3]PETSC ERROR: #1 User provided function() line 0 in unknown file</div>
<div>[4]PETSC ERROR: ------------------------------------------------------------------------</div>
<div>[4]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end</div>
<div>[4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger</div>
<div>[4]PETSC ERROR: or see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a></div>
<div>[4]PETSC ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors</div>
<div>[4]PETSC ERROR: likely location of problem given in stack below</div>
<div>[4]PETSC ERROR: --------------------- Stack Frames ------------------------------------</div>
<div>[4]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,</div>
<div>[4]PETSC ERROR: INSTEAD the line number of the start of the function</div>
<div>[4]PETSC ERROR: is given.</div>
<div>[4]PETSC ERROR: [4] TSStep_Sundials line 121 /home/mazumder/petsc-3.14.5/src/ts/impls/implicit/sundials/sundials.c</div>
<div>[4]PETSC ERROR: [4] TSStep line 3736 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c</div>
<div>[4]PETSC ERROR: [4] TSSolve line 4046 /home/mazumder/petsc-3.14.5/src/ts/interface/ts.c</div>
<div>[4]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------</div>
<div>[4]PETSC ERROR: Signal received</div>
<div>[4]PETSC ERROR: See <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html</a> for trouble shooting.</div>
<div>[4]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021 </div>
<div>[4]PETSC ERROR: ./ThO2 on a arch-linux-c-debug named <a href="http://bell-a017.rcac.purdue.edu" target="_blank">bell-a017.rcac.purdue.edu</a> by mazumder Mon Mar 15 13:26:36 2021</div>
<div>[4]PETSC ERROR: Configure options --with-cc-mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-fblaslapack --download-sundials=yes --with-debugging</div>
[4]PETSC ERROR: #1 User provided function() line 0 in unknown file</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
--------------------------------------------------------------------------
<div>mpirun noticed that process rank 0 with PID 0 on node bell-a017 exited on signal 9 (Killed).</div>
<div>--------------------------------------------------------------------------</div>
<div>[<a href="http://bell-a017.rcac.purdue.edu:62310" target="_blank">bell-a017.rcac.purdue.edu:62310</a>] 62 more processes have sent help message help-mpi-api.txt / mpi-abort</div>
<div>[<a href="http://bell-a017.rcac.purdue.edu:62310" target="_blank">bell-a017.rcac.purdue.edu:62310</a>] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages</div>
slurmstepd: error: Detected 4 oom-kill event(s) in step 1701844.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.</div>
<br>
</div>
</div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>