[petsc-dev] errors galore related to Barry's change to PetscError

Jed Brown jed at 59A2.org
Sun May 9 08:47:41 CDT 2010


On Sat, 8 May 2010 17:42:33 -0500, Barry Smith <bsmith at mcs.anl.gov> wrote:
>     Now that a comm is passed into the error handler, how we use it is
>     still preliminary and work and progress. Likely it will evolve as
>     we figure out what do to.
> 
> The reason I don't have non-root call MPI_Abort() (even after waiting)
> is that MPI_Abort() will trigger all the other processes to abort? If
> the root is "late" getting to the error then it will receive an abort
> from the non-root MPI_Abort() and never execute the traceback hence no
> error message; bad news. At least I think this might happen.

Alternatively this happens:

  mpirun has exited due to process rank 1 with PID 4388 on
  node kunyang exiting without calling "finalize". This may
  have caused other processes in the application to be
  terminated by signals sent by mpirun (as reported here).

But there is a crucial behavioral change.  The user used to be able to
catch the error at any point in the chain and decide not to make it
fatal.  This is no longer possible with the traceback error handler
(which admittedly isn't the best handler for this handling mechanism).
I realize that MPI (and thus PETSc) make no guarantees about the state
after an error occurs, but they might be trying to write some checkpoint
or release some resources, in which case abort() from the other ranks is
not desirable.

> An alternative to what I have done is to have non-root wait a while
> and then return with the usual traceback. Thus under normal
> circumstances it will receive the abort() from root before printing
> the traceback so we will get one nice traceback from root. (will it?)
> Under strange circumstances where root for some reason doesn't get to
> the error we will get the current behavior where everyone else prints
> the traceback and so we do get a useful error message (not perfect
> cause there are several error messages but much better than no
> messages.

I think this would be better.

Jed



More information about the petsc-dev mailing list