[petsc-users] question about partitioning usage

Wed Jan 25 14:41:47 CST 2012

On Wed, Jan 25, 2012 at 13:20, Dominik Szczerba <dominik at itis.ethz.ch>wrote:

> I create it to partition my input unstructured mesh, then I destroy
> it. It is so separated from the rest of the code, that I could even
> delegate it to a separate program.
> But then, as shown in the other thread, I somehow get an assertion
> fail from parmetis much later in the code, when creating solver
> contexts. Why is this happening if the MatPartitioning object was
> destroyed?
>

Dominik, you've been at this game for a while now, so you should be
familiar with your tools. In any software project, when you encounter an
error, you should ask the question "how did I get here?" With languages
that have a call stack, the first step to answering that question is to
look at the stack trace. There is a well-known tool for producing stack
traces and related tasks, it's called a debugger.

Now PETSc noticed early on that teaching every user to use a debugger (and
dealing with the issues on clusters) is too much work, consequently, PETSc
manages its own call stack so that it can produce stack traces
automatically. Third-party packages usually don't do this, so they usually
just return error codes. Asserts are useful as a developer because it's
very little typing and the debugger catches them, but they are horrible in
production because the program just exits without telling you why there was
an error. If the programmer was nice, they would have placed a comment at
that line of code explaining what might cause the assertion to fail, but
you still have to dig up the source code and go to that line, and many
programmers aren't that careful.

Of course failing asserts don't usually forcefully exit, they call abort()
which raises SIGABRT and, as it turns out, you can catch SIGABRT (though
you cannot block it). However, SIGABRT is the same signal that is usually
used to instruct a program to dump core, so in order to not clutter the
ability to get core, PETSc does not catch SIGABRT. (At least I think this
is the rationale, perhaps it should be reconsidered.) Consequently, PETSc
doesn't automatically give stack traces when third-party libraries call
abort().

Note that the only thing worse than calling assert()/abort() from a library
is to call exit() in an error condition, and unfortunately, this is more
common than you might thing. Again, PETSc could register a callback with
atexit(), but this would interfere with users' ability to exit
intentionally (before calling PetscFinalize()) without seeing a PETSc error
message. In any case, the current decision was not to use atexit() either.

What this all adds up to for you is that you should use a debugger to get a
stack trace if you want to find out how you reached a failed assert() in
some third-party package. You know that PETSc doesn't require that package,
so it's not using it by default for anything. Presumably, you also know
that "somehow get an assertion fail from parmetis much later in the code,
when creating solver contexts" is pretty vague. How did you determine that
it was when creating solver contexts? Even if that claim is correct, are we
expected to guess what kind of solvers you have and how they are
configured, such that a partitioner might be called? You have to help us
help you.

So don't be afraid to use the debugger. Read enough of the documentation
that you can fluently work with conditional breakpoints and with
watchpoints (both values and locations). Other times, use valgrind with
--db-attach=yes to get the debugger to the correct place. I can understand
that it looks overwhelming if you are starting out, but if you have made it
through your first year of serious development and aren't familiar with
these tools yet, you have already lost more time than it takes to learn the
tools.

When developing new code, have a hard rule that there always be a run-time
parameter to change the problem size, and always make the smallest problem
size something that runs in less than 10 seconds without optimization. I
have occasionally ended up in circumstances where these rules were not
followed and I have regretted it every time. Similarly, always develop code
in a friendly development environment. To me, that means that debuggers and
valgrind must work, disk latency must be low enough that I don't notice it
in Emacs, and the source code has to be indexed so that I can move around
quickly. It also means that the build system has to be fast, waiting more
than a few seconds for compilation when you make a simple change to a C
file is unacceptable.

If you follow these guidelines, I think you will end up answering your own
questions in less time than it takes to ask them, and when find that you
need to ask, you will have plenty of relevant information that hopefully we
won't need several rounds of email ping-pong to get oriented.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120125/e70d2a99/attachment.htm>