[petsc-dev] Strange failure in PetscSF when using Open MPI 1.10.2 in OS X Travis-CI

Lawrence Mitchell lawrence.mitchell at imperial.ac.uk
Thu Jun 16 05:16:39 CDT 2016



On 15/06/16 19:59, Lisandro Dalcin wrote:
> This is the failing build:
> https://travis-ci.org/petsc/petsc/jobs/137818148
> 
> A similar build with MPICH does not generate this error:
> https://travis-ci.org/petsc/petsc/jobs/137818145
> 
> Maybe Open MPI bug?

I don't think so.  I can reproduce this on ubuntu/16.04 with openmpi
1.10.2.  The problem is, I think, as follows:

When sfbasic sets up a pack in PetscSFBasicPackTypeSetup it does a
bunch of comparisons for the datatype.  In this case, the unit
datatype is an MPI_Type_contiguous(4, MPIU_COMPLEX).

So the check

  ierr =
MPIPetsc_Type_compare_contig(unit,MPIU_COMPLEX,&nPetscComplexContig);CHKERRQ(ierr);

should return true in nPetscComplexContig.

But it doesn't.  Why?

MPIPetsc_Type_compare_contig unwraps the passed in types.  Neither was
dupped so we pull apart unit:

MPI_Type_get_envelope(unit, ...)

The combiner is contiguous, great.  So now we do:

MPI_Type_get_contents(unit, ...)

This returns one datatype that "is equivalent to the datatype used
when creating unit".  It is only *equal* to the datatype used if
MPIU_COMPLEX is a predefined data type.  But if PETSC_CLANGUAGE_CXX is
defined, then MPIU_COMPLEX is *not* a predefined datatype, afaict.

So now the check:

    if (atypes[0] == btype) *n = aints[0];

fails, and we don't determine that the type is contiguous, and so we
fall through to the "generic" code around line 640 in sfbasic.c.  The
sizeof(int) is 4 and the number of bytes in the type is 16*4, so this
is not handled.  Hence the error.

I think the correct fix for this is to use MPIPetsc_Type_compare to
compare atypes[0] and btype, rather than expected object identity.

We then run into a further error in PetscSFBasicGetPackInUse, because
we call:

MPIPetsc_Type_compare(unit, link->unit)

where link->unit is created from MPI_Type_dup(unit, &link->unit)

So we'll unwrap link->unit and return the type that
MPI_Type_get_contents returns.  But now we run into the same problem
that this just "looks the same" as unit, and isn't the same.

There's a comment in MPIPetsc_Type_compare that the internal
comparison should be recursive.  With that addition as well, the ex3
tests pass again.

Tentative patch to fix this problem attached.

Cheers,

Lawrence
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-SF-handle-case-where-base-datatype-was-not-predefine.patch
Type: text/x-patch
Size: 2526 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20160616/794da0f1/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20160616/794da0f1/attachment.sig>


More information about the petsc-dev mailing list