<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""></div><div> Everyone,</div><div><br class=""></div><div> Previously we checked the bounds range for the debug version of the code but not the optimized version. Based on Mark's experience I felt that the tiny hit on performance on checking was worth it all the time and our intention is now to always check these bounds.</div><div><br class=""></div><div> Barry</div><div><br class=""><blockquote type="cite" class=""><div class="">On Nov 3, 2020, at 7:30 AM, Matthew Knepley <<a href="mailto:knepley@gmail.com" class="">knepley@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div dir="ltr" class="">On Tue, Nov 3, 2020 at 8:23 AM Mark McClure <<a href="mailto:mark@resfrac.com" class="">mark@resfrac.com</a>> wrote:<br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class="">Hi, all.<div class=""><br class=""></div><div class="">I am emailing to close the loop on this. </div><div class=""><br class=""></div><div class="">There were two things combining to cause our issue.</div><div class=""><br class=""></div><div class="">1. At some point, several years ago, I had set PetscPushErrorHandler(PetscAbortErrorHandler, NULL), and then forgotten about it. This caused the program to terminate when an error was encountered.</div><div class=""><br class=""></div><div class="">2. I believe our code had a very rare, nonreproducible bug (occurring once every 1000s of hours of runtime) where it passed column and/or row values to Petsc that were greater than the size of the matrix.</div><div class=""><br class=""></div><div class="">Having changed the error handler, and also put in a special error check to check the column/row assignments and abort (and then rerun with smaller dt) a timestep if they are out of range. Having done that, we've run for over a month and have not seen the problem reproduce.</div><div class=""><br class=""></div><div class="">Two ideas that might be helpful to avoid this problem in the future:</div><div class=""><br class=""></div><div class="">1. When Petsc aborts with error because PetscPushErrorHandler(PetscAbortErrorHandler, NULL) is set, it might be helpful to add to the error message that is printed to cout. "Petsc is aborting the program because the Petsc Error Handler has been set to abort on errors. This can be changed by modifying the option passed into the error handler." Otherwise, it might be unclear to a user why Petsc is aborting (if abort on error is set, but they don't realize). We had written error checks to catch errors passed out of Petsc and handle gracefully. But had overlooked the error handler setting, and so were confused why this error was causing the entire program to terminate. A bit more explanation in the abort message could help avoid that kind of user confusion. I had thought it was a hard crash out of Petsc, since I was confused why the function wasn't returning.</div></div></blockquote><div class=""><br class=""></div><div class="">Hi Mark,</div><div class=""><br class=""></div><div class="">That is a good suggestion. I am doing it.</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="">2. When the "greater than largest key allowed" error is encountered, it might be helpful to add to the warning message printed to users to additionally say: "This error may occur if the column and row values passed into Petsc are within the range from 0 to size of matrix." Not positive, but I am about 90% sure that is why we were hitting the error.</div></div></blockquote><div class=""><br class=""></div><div class="">Can you help me understand this? It should be impossible to pass row or column indices that are greater than the matrix size through MatSetValues:</div><div class=""><br class=""></div><div class=""> <a href="https://gitlab.com/petsc/petsc/-/blob/master/src/mat/impls/aij/mpi/mpiaij.c#L582" class="">https://gitlab.com/petsc/petsc/-/blob/master/src/mat/impls/aij/mpi/mpiaij.c#L582</a></div><div class=""><br class=""></div><div class="">What are you calling to get them in?</div><div class=""><br class=""></div><div class=""> Thanks,</div><div class=""><br class=""></div><div class=""> Matt</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="">Thank you again for your help. I really appreciate your rapid responses and help!</div><div class=""><br class=""></div><div class="">Mark</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Sep 26, 2020 at 10:53 PM Mark McClure <<a href="mailto:mark@resfrac.com" target="_blank" class="">mark@resfrac.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class="">Ok, I think we've made some progress.<div class=""><br class=""></div><div class="">We were already calling the function like this: ierr = PetscCall(); if (ierr != 0) {do something to handle error}. We actually are doing that on every single call made to Petsc, just to be careful. This is what was confusing to me. Why was the program terminating from within Petsc and not returning out with an error? We'd written our code so that if Petsc did return with an error, we'd discard the full timestep, destroy all the Petsc data structures and redo everything with a smaller dt. So if Petsc did hit this error very rarely, we might be able to recovery gracefully.</div><div class=""><br class=""></div><div class="">It does not appear to be seg faulting. So it seemed that the program was being terminated intentionally from within Petsc, which was puzzling, and why I was asking about that in my previous email.</div><div class=""><br class=""></div><div class="">So - Chris made a great find. Turns out that right after PetscInitialize in our main.cpp, we had the line:</div><div class=""><br class=""></div><div class="">PetscPushErrorHandler(PetscAbortErrorHandler, NULL);<br class=""></div><div class=""><br class=""></div><div class="">Which was telling Petsc to call MPI_Abort if there was an error. I probably put that line into the code years ago and forgot it was there. So, as Barry said, if we change the PetscErrorHandler option to ignore, then at least we can avoid the program aborting on the error, and hopefully be able to recover with our existing code logic. </div><div class=""><br class=""></div><div class="">Also, may have found a clue on the root cause of the error. I had thought that were were checking all of our inputs to Petsc for issues such as out of range index values. But I went back and see that due to a versioning mistake, there is one particular error check on our inputs that was being removed from production builds by a preprocessor definition. Which means that it wouldn't be caught in our production builds, which means that it is possible that bad inputs could have been passed into Petsc. I don't know for sure - but is plausible. The missing error check was doing the following: checking to see if the fourth entry to MatSetValues ("n", for the number of nonzero values in the row) (<a href="https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html" target="_blank" class="">https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html</a>) was equal to the sum of the number of diagonal and off diagonal values that we had specified in our previous call to MatMPIAIJSetPreallocation. </div><div class=""><br class=""></div><div class="">So that is at least a theory for what was happening. The theory would be: very rarely, due to a bug in our code, we were running MatSetValues with "n" set to a value not equal to the number of nonzero values promised in the call to MatMPIAIJSetPreallocation. Maybe this led to the Petsc "key 7556 is greater than largest key allowed 5693" error message, and then our setting of 'PetscAbortErrorHandler' was causing the program to abort.<br class=""></div><div class=""><br class=""></div><div class="">Mark</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""> </div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Sep 26, 2020 at 9:51 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Sep 26, 2020, at 5:58 PM, Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank" class="">junchao.zhang@gmail.com</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class=""><div dir="ltr" class=""><br class=""></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Sep 26, 2020 at 5:44 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank" class="">mfadams@lbl.gov</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div dir="ltr" class=""><br class=""></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Sep 26, 2020 at 1:07 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank" class="">knepley@gmail.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div dir="ltr" class="">On Sat, Sep 26, 2020 at 11:17 AM Mark McClure <<a href="mailto:mark@resfrac.com" target="_blank" class="">mark@resfrac.com</a>> wrote:<br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class="">Thank you, all for the explanations. <div class=""><br class=""></div><div class="">Following Matt's suggestion, we'll use -g (and not use -with-debugging=0) all future compiles to all users, so in future, we can provide better information.</div><div class=""><br class=""></div><div class="">Second, Chris is going to boil our function down to minimum stub and share in case there is some subtle issue with the way the functions are being called. </div><div class=""><br class=""></div><div class="">Third, I have question/request - Petsc is, in fact, detecting an error. As far as I can tell, this is not an uncontrolled 'seg fault'. It seems to me that maybe Petsc could choose to return out from the function when it detects this error, returning an error code, rather than dumping the core and terminating the program. If Petsc simply returned out with an error message, this would resolve the problem for us. After the Petsc call, we check for Petsc error messages. If Petsc returns an error - that's fine - we use a direct solver as a backup, and the simulation continues. So - I am not sure whether this is feasible - but if Petsc could return out with an error message - rather than dumping the core and terminating the program - then that would effectively resolve the issue for us. Would this change be possible?</div></div></blockquote><div class=""><br class=""></div><div class="">At some level, I think it is currently doing what you want. CHKERRQ() simply returns an error code from that function call, printing an error message. Suppressing the message is harder I think,</div></div></div></blockquote><div class=""><br class=""></div><div class="">He does not need this.</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><div class="">but for now, if you know what function call is causing the error, you can just catch the (ierr != 0) yourself instead of using CHKERRQ. </div></div></div></blockquote><div class=""><br class=""></div><div class="">This is what I suggested earlier but maybe I was not clear enough.</div><div class=""><br class=""></div><div class="">Your code calls something like </div><div class=""><br class=""></div><div class="">ierr = SNESSolve(....); CHKERRQ(ierr);</div><div class=""><br class=""></div><div class="">You can replace this with:</div><div class=""><br class=""></div><div class=""> ierr = SNESSolve(....); <br class=""></div><div class=""> if (ierr) {</div></div></div></blockquote><div class="">How to deal with CHKERRQ(ierr); inside SNESSolve()?</div></div></div></div></blockquote><div class=""><br class=""></div><div class=""><br class=""></div> PetscPushErrorHandler(PetscIgnoreErrorHandler,NULL);</div><div class=""><br class=""></div><div class=""> But the problem in this code's runs appear to be due to corrupt data, why and how it gets corrupted is not known. Continuing with an alternative solver because a solver failed for numerical or algorithmic reasons is generally fine but continuing when there is corrupted data is always iffy because one doesn't know how far the corruption has spread. SNESDestroy(&snes); SNESCreate(&snes); may likely clean out any potentially corrupted data but if the corruption got into the mesh data structures it will still be there.</div><div class=""><br class=""></div><div class=""> A very sophisticated library code would, when it detects this type of corruption, sweep through all the data structures looking for any indications of corruption, to help determine the cause and/or even fix the problem. We don't have this code in place, though we could add some, because generally we relay on valgrind or -malloc_debug to detect such corruption, the problem is valgrind and -malloc_debug don't fit well in a production environment. Handling corruption that comes up in production but not testing is difficult.</div><div class=""><br class=""></div><div class=""> Barry</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><div class=""> .... </div><div class=""> }</div><div class=""><br class=""></div><div class="">I suggested something earlier to do here. Maybe call KSPView. You could even destroy the solver and start the solver from scratch and see if that works.</div><div class=""><br class=""></div><div class="">Mark</div><div class=""> <br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><div class="">The drawback here is that we might not have cleaned up</div><div class="">all the state so that restarting makes sense. It should be possible to just kill the solve, reset the solver, and retry, although it is not clear to me at first glance if MPI will be in an okay state.</div><div class=""><br class=""></div></div></div>
</blockquote></div></div>
</blockquote></div></div>
</div></blockquote></div><br class=""></div></blockquote></div>
</blockquote></div>
</blockquote></div><br clear="all" class=""><div class=""><br class=""></div>-- <br class=""><div dir="ltr" class="gmail_signature"><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class="">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br class="">-- Norbert Wiener</div><div class=""><br class=""></div><div class=""><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank" class="">https://www.cse.buffalo.edu/~knepley/</a><br class=""></div></div></div></div></div></div></div></div>
</div></blockquote></div><br class=""></body></html>