Petsc on Blue Gene

Brian Biskeborn bbiskebo at us.ibm.com
Wed Jul 11 13:26:36 CDT 2007


> > > Can you send a log of these messages? Is this on BGL or BGP? Does the
> > > program abort? [on encountering these messages]
> >
> > The program does not abort on exceptions - the only evidence of the
problem
> > is messages in the event log reading "Kernel detected X floating point
> > alignment exceptions" (where X is a number usually on the order of
10^5)
> > followed by what looks like a series of register values. I'm running on
> > BGL.

> Is this event log in some system logs that users have no access to?
> Where is this logfile? [I'm guessing its neither JOBID.output nor
> JOBID.error]

The exceptions are displayed in the so-called RAS event log. Users here
have direct access to this log, but I'm not sure the same facility is
necessarily available at every Blue Gene installation.

> >
> > > With the minimal runs I've done on BGL - I don't remember seing any
> > > such messages.
> >
> > > [Barry can confirm this] the code in mal.c attempts to make sure the
> > > memory allocated by PETSc is aligned properly. [8 byte boundary for
> > > doubles]
> >
> > > One possibility is that the data passed in to MatAssemblyBegin() is
> > > not aligned?
> >
> > This says to me that the unaligned data is probably being generated
outside
> > of Petsc. Thanks for the info, I now have a much better idea about
where to
> > look for the problem!

> If the problem exists in PETSc, it should be reporduceable with a
> PETSc example [perhaps mat/examples/tests/ex2.c - which does
> MatSetValues()]

> cqsub -n 2 -t 2 ex9

Well, what do you know? I'm getting floating point exceptions with
mat/examples/tests/ex2. There are two relevant lines in the RAS event log.
They are:
Kernel detected 6 floating point alignment exceptions (1) iar 0x0026f4a8,
dear 0x0069c22c (2) iar 0x0026f4a8, dear 0x0069c23c (3) iar 0x0026f4a8,
dear 0x0069c24c (4) iar 0x0026f4a8, dear 0x0069c25c (5) iar 0x0026f4a8,
dear 0x0069c26c (6) iar 0x0026f4a8, dear 0x0069c27c
Kernel detected 36 floating point alignment exceptions (29) iar 0x0026f4a8,
dear 0x006b48ac (30) iar 0x0026f4a8, dear 0x006b48bc (31) iar 0x0026f4a8,
dear 0x006b48cc (32) iar 0x0026f4a8, dear 0x006b48dc (33) iar 0x0026f4a8,
dear 0x006b48ec (34) iar 0x0026f4a8, dear 0x006b48fc (35) iar 0x0026f4a8,
dear 0x006b490c (36) iar 0x0026f4a8, dear 0x006b491c

The 42 exceptions above break down as follows: 10 in the test of MatNorm,
12 in MatTranspose, 10 in the 2nd MatNorm, and 10 during the test of
MatAXPY.

I compiled the Petsc I used above with a modified version of mal.c that
#defines PETSC_MEMALIGN to 32 (to be on the safe side) and uses the
posix_memalign call to allocate aligned memory. As far as I know, the only
way to produce alignment exceptions would be to manually produce a
misalignment somewhere (for example, storing an int immediately followed by
a double at the beginning of an aligned memory block).

Sorry, I totally forgot to mention this earlier: the code I'm working with
requires Petsc 2.3.0, so I'm not using the latest version.

Any suggestions on where this data misalignment might be occurring?

Thanks,
Brian




More information about the petsc-users mailing list