<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div> Sounds like it might be a compiler problem generating bad code. <div class=""><br class=""></div><div class=""> On the machine where it fails you can run with -fp_trap to have it error out as soon as a Nan or Inf appears. If you can use the debugger on that machine you can tell the debugger to catch floating point exceptions and see the exact line an values of variables where a Nan or Inf appear.</div><div class=""><br class=""></div><div class=""> As Matt conjectured it is likely there is a divide by zero before PETSc detects and it may be helpful to find out exactly where that happens.</div><div class=""><br class=""></div><div class=""> Barry</div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 25, 2020, at 8:03 AM, Alfredo Jaramillo <<a href="mailto:ajaramillopalma@gmail.com" class="">ajaramillopalma@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Yes, Barry, that is correct.<br class=""><div class=""><div dir="ltr" data-smartmail="gmail_signature" class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div dir="ltr" class=""><div dir="ltr" class=""><div dir="ltr" class=""><br class=""></div></div></div></div></div></div></div></div></div></div></div></div><br class=""></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 25, 2020 at 1:02 AM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><div class=""><br class=""></div> On one system you get this error, on another system with the identical code and test case you do not get the error?<div class=""><br class=""></div><div class=""> You get it with three iterative methods but not with MUMPS?<br class=""><div class=""><br class=""></div><div class="">Barry</div><div class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Aug 24, 2020, at 8:35 PM, Alfredo Jaramillo <<a href="mailto:ajaramillopalma@gmail.com" target="_blank" class="">ajaramillopalma@gmail.com</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class=""><div class="">Hello Barry, Matthew, thanks for the replies !</div><div class=""><br class=""></div><div class="">Yes, it is our custom code, and it also happens when setting -pc_type bjacobi. Before testing an iterative solver, we were using MUMPS (-ksp_type preonly -ksp_pc_type lu -pc_factor_mat_solver_type mumps) without issues.</div><div class=""><br class=""></div><div class="">Running the ex19 (as "mpirun -n 4 ex19 -da_refine 5") did not produce any problem.<br class=""><br class=""></div><div class="">To reproduce the situation on my computer, I was able to reproduce the error for a small case and -pc_type bjacobi. For that particular case, when running in the cluster the error appears at the very last iteration:</div><div class=""><br class=""></div><div class="">=====<br class=""></div><div class=""> 27 KSP Residual norm 8.230378644666e-06 <br class="">[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<br class="">[0]PETSC ERROR: Invalid argument<br class="">[0]PETSC ERROR: Scalar value must be same on all processes, argument # 3</div><div class="">====</div><div class=""><br class=""></div><div class="">whereas running on my computer the error is not launched and convergence is reached instead:</div><div class=""><br class=""></div><div class="">====<br class="">Linear interp_ solve converged due to CONVERGED_RTOL iterations 27</div><div class="">====</div><div class=""><br class=""></div><div class="">I will run valgrind to seek for possible memory corruptions.</div><div class=""><br class=""></div><div class="">thank you<br class=""></div><div class="">Alfredo<br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 24, 2020 at 9:00 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><div dir="auto" class=""><div class=""><br class=""></div> Oh yes, it could happen with Nan. <div class=""><br class=""></div><div class=""> KSPGMRESClassicalGramSchmidtOrthogonalization() calls KSPCheckDot(ksp,lhh[j]); so should detect any NAN that appear and set ksp->convergedreason but the call to MAXPY() is still made before returning and hence producing the error message.</div><div class=""><br class=""></div><div class=""> We should circuit the orthogonalization as soon as it sees a Nan/Inf and return immediately for GMRES to cleanup and produce a very useful error message. </div><div class=""><br class=""></div><div class=""> Alfredo,</div><div class=""><br class=""></div><div class=""> It is also possible that the hypre preconditioners are producing a Nan because your matrix is too difficult for them to handle, but it would be odd to happen after many iterations.</div><div class=""><br class=""></div><div class=""> As I suggested before run with -pc_type bjacobi to see if you get the same problem.</div><div class=""><br class=""></div><div class=""> Barry</div><div class=""><br class=""></div><div class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Aug 24, 2020, at 6:38 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank" class="">knepley@gmail.com</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class=""><div dir="ltr" class="">On Mon, Aug 24, 2020 at 6:27 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>> wrote:<br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br class="">
Alfredo,<br class="">
<br class="">
This should never happen. The input to the VecMAXPY in gmres is computed via VMDot which produces the same result on all processes.<br class="">
<br class="">
If you run with -pc_type bjacobi does it also happen?<br class="">
<br class="">
Is this your custom code or does it happen in PETSc examples also? Like src/snes/tutorials/ex19 -da_refine 5 <br class="">
<br class="">
Could be memory corruption, can you run under valgrind?<br class=""></blockquote><div class=""><br class=""></div><div class="">Couldn't it happen if something generates a NaN? That also should not happen, but I was allowing that pilut might do it.</div><div class=""><br class=""></div><div class=""> Thanks,</div><div class=""><br class=""></div><div class=""> Matt</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Barry<br class="">
<br class="">
<br class="">
> On Aug 24, 2020, at 4:05 PM, Alfredo Jaramillo <<a href="mailto:ajaramillopalma@gmail.com" target="_blank" class="">ajaramillopalma@gmail.com</a>> wrote:<br class="">
> <br class="">
> Dear PETSc developers,<br class="">
> <br class="">
> I'm trying to solve a linear problem with GMRES preconditioned with pilut from HYPRE. For this I'm using the options:<br class="">
> <br class="">
> -ksp_type gmres -pc_type hypre -pc_hypre_type pilut -ksp_monitor<br class="">
> <br class="">
> If I use a single core, GMRES (+ pilut or euclid) converges. However, when using multiple cores the next error appears after some number of iterations:<br class="">
> <br class="">
> [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3<br class="">
> <br class="">
> relative to the function VecMAXPY. I attached a screenshot with more detailed output. The same happens when using euclid. Can you please give me some insight on this?<br class="">
> <br class="">
> best regards<br class="">
> Alfredo<br class="">
> <Screenshot from 2020-08-24 17-57-52.png><br class="">
<br class="">
</blockquote></div><br clear="all" class=""><div class=""><br class=""></div>-- <br class=""><div dir="ltr" class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class=""><div dir="ltr" class=""><div class="">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br class="">-- Norbert Wiener</div><div class=""><br class=""></div><div class=""><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank" class="">https://www.cse.buffalo.edu/~knepley/</a><br class=""></div></div></div></div></div></div></div></div>
</div></blockquote></div><br class=""></div></div></div></blockquote></div>
</div></blockquote></div><br class=""></div></div></div></blockquote></div>
</div></blockquote></div><br class=""></div></body></html>