[petsc-dev] Strange failure in PetscSF when using Open MPI 1.10.2 in OS X Travis-CI
Lawrence Mitchell
lawrence.mitchell at imperial.ac.uk
Thu Jun 16 05:16:39 CDT 2016
On 15/06/16 19:59, Lisandro Dalcin wrote:
> This is the failing build:
> https://travis-ci.org/petsc/petsc/jobs/137818148
>
> A similar build with MPICH does not generate this error:
> https://travis-ci.org/petsc/petsc/jobs/137818145
>
> Maybe Open MPI bug?
I don't think so. I can reproduce this on ubuntu/16.04 with openmpi
1.10.2. The problem is, I think, as follows:
When sfbasic sets up a pack in PetscSFBasicPackTypeSetup it does a
bunch of comparisons for the datatype. In this case, the unit
datatype is an MPI_Type_contiguous(4, MPIU_COMPLEX).
So the check
ierr =
MPIPetsc_Type_compare_contig(unit,MPIU_COMPLEX,&nPetscComplexContig);CHKERRQ(ierr);
should return true in nPetscComplexContig.
But it doesn't. Why?
MPIPetsc_Type_compare_contig unwraps the passed in types. Neither was
dupped so we pull apart unit:
MPI_Type_get_envelope(unit, ...)
The combiner is contiguous, great. So now we do:
MPI_Type_get_contents(unit, ...)
This returns one datatype that "is equivalent to the datatype used
when creating unit". It is only *equal* to the datatype used if
MPIU_COMPLEX is a predefined data type. But if PETSC_CLANGUAGE_CXX is
defined, then MPIU_COMPLEX is *not* a predefined datatype, afaict.
So now the check:
if (atypes[0] == btype) *n = aints[0];
fails, and we don't determine that the type is contiguous, and so we
fall through to the "generic" code around line 640 in sfbasic.c. The
sizeof(int) is 4 and the number of bytes in the type is 16*4, so this
is not handled. Hence the error.
I think the correct fix for this is to use MPIPetsc_Type_compare to
compare atypes[0] and btype, rather than expected object identity.
We then run into a further error in PetscSFBasicGetPackInUse, because
we call:
MPIPetsc_Type_compare(unit, link->unit)
where link->unit is created from MPI_Type_dup(unit, &link->unit)
So we'll unwrap link->unit and return the type that
MPI_Type_get_contents returns. But now we run into the same problem
that this just "looks the same" as unit, and isn't the same.
There's a comment in MPIPetsc_Type_compare that the internal
comparison should be recursive. With that addition as well, the ex3
tests pass again.
Tentative patch to fix this problem attached.
Cheers,
Lawrence
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-SF-handle-case-where-base-datatype-was-not-predefine.patch
Type: text/x-patch
Size: 2526 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20160616/794da0f1/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20160616/794da0f1/attachment.sig>
More information about the petsc-dev
mailing list