[mpich2-dev] 0 byte derived types

Jeff Hammond jhammond at alcf.anl.gov
Mon Apr 23 09:23:46 CDT 2012


A bug appeared when BGP came online and has reappared on BGQ.  It
relates to MPI_Bcast of a non-zero number of 0-byte derived datatypes.
 ScaLAPACK is one source of this patter.  They have a workaround, but
it seems to be that either ScaLAPACK is using MPI in a non-compliant
way or there is a bug in MPICH2 that has persisted across many major
version releases.

Are you guys aware of this?  Has it been fixed in 1.5?  Is there a
test to make sure there is no regression in the future?  The ScaLAPACK
code and comment noting the problem with MPICH is below, as is a long
email lthread with Brian Smith and Lee on this topic.

Thanks,

Jeff



#include "Bdef.h"
MPI_Datatype BI_GetMpiGeType(BLACSCONTEXT *ctxt, int m, int n, int lda,
                               MPI_Datatype Dtype, int *N)
{
  int info;
  MPI_Datatype GeType;

/*
 * Some versions of mpich and its derivitives cannot handle 0 byte typedefs,
 * so we set type MPI_BYTE as a flag for a 0 byte message
 */
#ifdef ZeroByteTypeBug
  if ( (m < 1) || (n < 1) )
  {
     *N = 0;
     return (MPI_BYTE);
  }
#endif
  *N = 1;
  info=MPI_Type_vector(n, m, lda, Dtype, &GeType);
  info=MPI_Type_commit(&GeType);

  return(GeType);
}


---------- Forwarded message ----------
From: Brian Smith <smithbr at us.ibm.com>
Date: Mon, Apr 23, 2012 at 8:32 AM
Subject: Re: [td-support #113586] PAMI assertions
To: Jeff Hammond <jhammond at alcf.anl.gov>
Cc: jeff.science at gmail.com, Lee Killough <killough at alcf.anl.gov>,
"td-support at alcf.anl.gov" <td-support at alcf.anl.gov>


Well, there are 2 different bugs here.

(from memory) 1) We found places were SCALAPACK made assumptions about
uninitialized variables that caused significant badness in a number of
apps. I believe someone reported this to the SCALAPACK maintainers
many years ago. In fact, they ran something like valgrind and provided
a patch for *all* of the usage of uninitialized variables. The
SCALAPACK people did not integrate the changes at the time. Perhaps
this new release will have some of them added so we don't have to deal
with that again. See
http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=588&p=1911&hilit=trsm#p1911
and
http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=13&t=2625

(more recently, based on different glue punt-to-MPICH-logic) 2)
SCALAPACK creates zero-length datatypes and then calls bcast with
nonzero counts. The glue didn't test for this condition.

A test for #2 in the MPICH2 test bucket wouldn't be a bad thing, but
#1 is of course beyond the scope of MPICH2. It is possible there is a
test for #2 already but because of other circumstances (node count for
example) we might not have seen a failure. I'm sure Dave G. or someone
could comment on that.



Brian Smith (smithbr at us.ibm.com)
BlueGene MPI Development/
Communications Team Lead
IBM Master Inventor
IBM Rochester
Phone: 507 253 4717




From:        Jeff Hammond <jhammond at alcf.anl.gov>
To:        Brian Smith/Rochester/IBM at IBMUS
Cc:        Lee Killough <killough at alcf.anl.gov>,
"td-support at alcf.anl.gov" <td-support at alcf.anl.gov>
Date:        04/21/2012 08:51 PM
Subject:        Re: [td-support #113586] PAMI assertions
Sent by:        jeff.science at gmail.com
________________________________



If this showed up on BGP and now on BGQ, why was it not added to the
MPICH2 test suite 3+ years ago?  This is a bug in MPICH2 according to
the comments in ScaLAPACK and the fact that both BGP and BGQ suffered
it despite forking vastly different code bases, right?

I'm trying to write a standalone test for this, btw, but haven't been
successful yet.

Jeff

On Sat, Apr 21, 2012 at 8:46 PM, Brian Smith <smithbr at us.ibm.com> wrote:
> Hi Lee,
>
> It's not actually a user error, what SCALAPACK is doing is (probably, I
> didn't look at it too much) valid MPI code. However, it is appears to be a
> weird fringe case that none of the test cases that come with MPICH, nor the
> gigantic Intel/ANL testbucket found.
>
> Basically, we were missing an if() check in the collectives glue to check
> for nonzero counts of zero length datatypes. The optimized protocols don't
> deal with things like that which is why there was an assert().
>
>
>
> Brian Smith (smithbr at us.ibm.com)
> BlueGene MPI Development/
> Communications Team Lead
> IBM Master Inventor
> IBM Rochester
> Phone: 507 253 4717
>
> "the scientific community is very A-Buzz with positive reviews of Blue Gene
> ..." - Charles Archer - un-sung hero of technology
>
>
>
>
> From:        Lee Killough <killough at alcf.anl.gov>
> To:        Jeff Hammond <jhammond at alcf.anl.gov>
> Cc:        "td-support at alcf.anl.gov" <td-support at alcf.anl.gov>, Brian
> Smith/Rochester/IBM at IBMUS
> Date:        04/20/2012 11:08 PM
> Subject:        Re: [td-support #113586] PAMI assertions
> ________________________________
>
>
>
> Sorry, a busy evening after 6 pm, fast forwarding to this email, and have
> not read previous.
>
> First, if it's a user error, it should never be diagnosed in an assert().
> assert() is only intended for catching internal errors, and should be turned
> off in production code. It being an assert() immediately threw me off and
> made me think it was a configuration issue or mismatched libraries, etc.
>
> A new version of ScaLAPACK is about to be released, maybe even as I send
> this email. I have been working closely with the developers for the past
> month on several bugs, some of which are only seen on BG, such as illegal
> Fortran calls with overlapping arguments.
>
> If we can identify the cause and work on a fix for this bcast problem, I may
> be able to get it in before the next release, or maybe not. If you have a
> BLACS code fix suggestion, please send it and I'll try to get the fix in the
> next version of ScaLAPACK.
>
> Thanks,
> Lee
>
> On Apr 20, 2012, at 22:44, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>
>> IBM says it's a ScaLAPACK problem but that the latest MPI/PAMI has a
>> fix anyways.
>>
>> See if this makes sense from the BLACS code.  We can look through MPI
>> standard together next week to see if BLACS violates it.  It would be
>> the first time Clint Whaley completely screwed up using MPI.
>>
>> Jeff
>>
>> ---------- Forwarded message ----------
>> From: Brian Smith <smithbr at us.ibm.com>
>> Date: Fri, Apr 20, 2012 at 8:47 PM
>> Subject: Re: Fwd: [td-support #113586] PAMI assertions
>> To: Jeff Hammond <jhammond at alcf.anl.gov>
>> Cc: jeff.science at gmail.com, Michael Blocksome <blocksom at us.ibm.com>
>>
>> It's goofy datatype stuff in SCALAPACK. There's a fix in head... I
>> didn't/don't feel it was worth efixing.
>>
>> I forget if the problem was a nonzero count with a zero-byte
>> constructed datatype or a zero count with a non-zero byte constructed
>> datatype, something stupid like that, so it's unlikely a real
>> application is going to hit it.
>>
>> Brian Smith (smithbr at us.ibm.com)
>> BlueGene MPI Development/
>> Communications Team Lead
>> IBM Master Inventor
>> IBM Rochester
>> Phone: 507 253 4717
>>
>>
>>
>> On Fri, Apr 20, 2012 at 7:34 PM, Jeff Hammond <jhammond at alcf.anl.gov>
>> wrote:
>>> 0-byte bcast is fine with the MPI I always use.
>>>
>>> Can you print the args at
>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/BLACS/SRC/dgebr2d_.c:127
>>> and see what I need to test?
>>>
>>> Thanks,
>>>
>>> Jeff
>>>
>>> On Fri, Apr 20, 2012 at 7:18 PM, Jeff Hammond <jhammond at alcf.anl.gov>
>>> wrote:
>>>> Assuming I am looking at the same code (PAMI Git head is V1R1M1
>>>> already and I'm too Git-impaired to toggle for V1R1M0, nor do I want
>>>> to do this in any case), the assertion that fails indicates a 0-byte
>>>> message is being attempted.
>>>>
>>>> I'll write a test of 0-byte MPI_Bcast right now.  Which MPI library
>>>> are you linking against?
>>>>
>>>> Jeff
>>>>
>>>> =================================================
>>>>          pami_result_t  postShortCollective (uint32_t        opcode,
>>>>                                              uint32_t        sizeoftype,
>>>>                                              uint32_t        bytes,
>>>>                                              char          * src,
>>>>                                              PipeWorkQueue * dpwq,
>>>>                                              pami_event_function
>>>> cb_done,
>>>>                                              void          * cookie,
>>>>                                              unsigned        classroute)
>>>>          {
>>>>            TRACE_FN_ENTER();
>>>>            TRACE_FORMAT("opcode %u, sizeoftype %u, bytes %u, src %p,
>>>> dpwq %p, classroute %u", opcode, sizeoftype, bytes, src, dpwq,
>>>> classroute);
>>>>            PAMI_assert (bytes <= _collstate._tempSize);
>>>>            PAMI_assert(bytes);  /*
>>>> <------------------------------------------------------------------
>>>> JEFF: This is line 284 */
>>>>            _int64Cpy(_collstate._tempBuf, src, bytes);
>>>>            //memcpy(_collstate._tempBuf, src, bytes);
>>>> ...
>>>> =================================================
>>>>
>>>> On Fri, Apr 20, 2012 at 6:06 PM, Lee Killough <killough at alcf.anl.gov>
>>>> wrote:
>>>>> With the new GA driver, I'm getting a lot of PAMI assertions when
>>>>> running
>>>>> ScaLAPACK programs. The traceback:
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> Program   : ./xzsep
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
>>>>>
>>>>> 00000000016a3638
>>>>> abort
>>>>>
>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/stdlib/abort.c:77
>>>>>
>>>>> 000000000169c668
>>>>> __assert_fail
>>>>>
>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/assert/assert.c:81
>>>>>
>>>>> 000000000149774c
>>>>> PAMI::Device::MU::CollectiveDmaModelBase::postShortCollective(unsigned
>>>>> int, unsigned int, unsigned int, char*, PAMI::PipeWorkQueue*, void
>>>>> (*)(void*, void*, pami_result_t), void*, unsigned int)
>>>>>
>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveDmaModelBase.h:284
>>>>>
>>>>> 0000000001497954
>>>>>
>>>>> PAMI::Device::MU::CollectiveMulticastDmaModel::postMulticastImmediate_impl(unsigned
>>>>> long, unsigned long, pami_multicast_t*, void*)
>>>>>
>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveMulticastDmaModel.h:107
>>>>>
>>>>> 0000000001394304
>>>>>
>>>>> PAMI::Geometry::Algorithm<PAMI::Geometry::Common>::generate(pami_xfer_t*)
>>>>>
>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/algorithms/geometry/Algorithm.h:45
>>>>>
>>>>> 0000000001359dec
>>>>> MPIDO_Bcast
>>>>>
>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpid/pamid/src/coll/bcast/mpido_bcast.c:146
>>>>>
>>>>> 00000000012f95cc
>>>>> MPIR_Bcast_impl
>>>>>
>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpi/coll/bcast.c:1310
>>>>>
>>>>> 00000000012f997c
>>>>> PMPI_Bcast
>>>>>
>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpi/coll/bcast.c:1464
>>>>>
>>>>> 000000000101b980
>>>>> dgebr2d
>>>>>
>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/BLACS/SRC/dgebr2d_.c:127
>>>>>
>>>>> 00000000010670f4
>>>>> pdlared1d
>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/SRC/pdlared1d.f:156
>>>>>
>>>>> 0000000001045350
>>>>> pzheevx
>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/SRC/pzheevx.f:839
>>>>>
>>>>> 00000000010084e0
>>>>> pzsepsubtst
>>>>>
>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepsubtst.f:396
>>>>>
>>>>> 0000000001002874
>>>>> pzseptst
>>>>>
>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzseptst.f:565
>>>>>
>>>>> 00000000010123ec
>>>>> pzsepreq
>>>>>
>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepreq.f:205
>>>>>
>>>>> 00000000010112e8
>>>>> pzsepdriver
>>>>>
>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepdriver.f:229
>>>>>
>>>>> 0000000001699b08
>>>>> generic_start_main
>>>>>
>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>>>>>
>>>>> 0000000001699e04
>>>>> __libc_start_main
>>>>>
>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>>>>>
>>>>> 0000000000000000
>>>>> ??
>>>>> ??:0
>>>>>
>>>>> It looks like an assertion is failing at:
>>>>>
>>>>> PAMI::Device::MU::CollectiveDmaModelBase::postShortCollective(unsigned
>>>>> int, unsigned int, unsigned int, char*, PAMI::PipeWorkQueue*, void
>>>>> (*)(void*, void*, pami_result_t), void*, unsigned int)
>>>>>
>>>>>
>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveDmaModelBase.h:284
>>>>>
>>>>> during a broadcast.
>>>>>
>>>>> I don't recall these errors in the previous driver.
>>>>>
>>>>> Lee
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>
>



--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)




-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)


More information about the mpich2-dev mailing list