[mpich2-dev] 0 byte derived types

Dave Goodell goodell at mcs.anl.gov
Mon May 7 14:23:02 CDT 2012


FYI, I've added a test for this in r9835.  Stock MPICH2 passed it, although I made a few fixes in r9836 to avoid doing some pointless communication.  It looks like there was a potential bug in the heterogeneous support case (now fixed), but all of that support has been untested for years.

-Dave

On Apr 23, 2012, at 10:22 AM CDT, Jeff Hammond wrote:

> If you read Brian's email, he indicates that there is a problem with
> PAMI, but this is also a problem at the MPI level, if one is to
> believe the comment in BLACS.  Lee noted that he had to workaround
> this problem with NEC-MPI as well, so it appears to be a corner case
> that is overlooked in multiple implementations (for good reason; this
> use case is ridiculous).
> 
> Thanks,
> 
> Jeff
> 
> On Mon, Apr 23, 2012 at 10:19 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>> The assert you quote in the oldest email in the thread is in PAMI code, so I don't think this is directly an MPICH2 bug.  I do agree, however, that zero-size, non-zero-count bcast should be tested in our test suite and that we have a coverage gap here.
>> 
>> I'll add creating a test for this to my TODO list.
>> 
>> -Dave
>> 
>> On Apr 23, 2012, at 9:23 AM CDT, Jeff Hammond wrote:
>> 
>>> A bug appeared when BGP came online and has reappared on BGQ.  It
>>> relates to MPI_Bcast of a non-zero number of 0-byte derived datatypes.
>>> ScaLAPACK is one source of this patter.  They have a workaround, but
>>> it seems to be that either ScaLAPACK is using MPI in a non-compliant
>>> way or there is a bug in MPICH2 that has persisted across many major
>>> version releases.
>>> 
>>> Are you guys aware of this?  Has it been fixed in 1.5?  Is there a
>>> test to make sure there is no regression in the future?  The ScaLAPACK
>>> code and comment noting the problem with MPICH is below, as is a long
>>> email lthread with Brian Smith and Lee on this topic.
>>> 
>>> Thanks,
>>> 
>>> Jeff
>>> 
>>> 
>>> 
>>> #include "Bdef.h"
>>> MPI_Datatype BI_GetMpiGeType(BLACSCONTEXT *ctxt, int m, int n, int lda,
>>>                               MPI_Datatype Dtype, int *N)
>>> {
>>>  int info;
>>>  MPI_Datatype GeType;
>>> 
>>> /*
>>> * Some versions of mpich and its derivitives cannot handle 0 byte typedefs,
>>> * so we set type MPI_BYTE as a flag for a 0 byte message
>>> */
>>> #ifdef ZeroByteTypeBug
>>>  if ( (m < 1) || (n < 1) )
>>>  {
>>>     *N = 0;
>>>     return (MPI_BYTE);
>>>  }
>>> #endif
>>>  *N = 1;
>>>  info=MPI_Type_vector(n, m, lda, Dtype, &GeType);
>>>  info=MPI_Type_commit(&GeType);
>>> 
>>>  return(GeType);
>>> }
>>> 
>>> 
>>> ---------- Forwarded message ----------
>>> From: Brian Smith <smithbr at us.ibm.com>
>>> Date: Mon, Apr 23, 2012 at 8:32 AM
>>> Subject: Re: [td-support #113586] PAMI assertions
>>> To: Jeff Hammond <jhammond at alcf.anl.gov>
>>> Cc: jeff.science at gmail.com, Lee Killough <killough at alcf.anl.gov>,
>>> "td-support at alcf.anl.gov" <td-support at alcf.anl.gov>
>>> 
>>> 
>>> Well, there are 2 different bugs here.
>>> 
>>> (from memory) 1) We found places were SCALAPACK made assumptions about
>>> uninitialized variables that caused significant badness in a number of
>>> apps. I believe someone reported this to the SCALAPACK maintainers
>>> many years ago. In fact, they ran something like valgrind and provided
>>> a patch for *all* of the usage of uninitialized variables. The
>>> SCALAPACK people did not integrate the changes at the time. Perhaps
>>> this new release will have some of them added so we don't have to deal
>>> with that again. See
>>> http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=588&p=1911&hilit=trsm#p1911
>>> and
>>> http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=13&t=2625
>>> 
>>> (more recently, based on different glue punt-to-MPICH-logic) 2)
>>> SCALAPACK creates zero-length datatypes and then calls bcast with
>>> nonzero counts. The glue didn't test for this condition.
>>> 
>>> A test for #2 in the MPICH2 test bucket wouldn't be a bad thing, but
>>> #1 is of course beyond the scope of MPICH2. It is possible there is a
>>> test for #2 already but because of other circumstances (node count for
>>> example) we might not have seen a failure. I'm sure Dave G. or someone
>>> could comment on that.
>>> 
>>> 
>>> 
>>> Brian Smith (smithbr at us.ibm.com)
>>> BlueGene MPI Development/
>>> Communications Team Lead
>>> IBM Master Inventor
>>> IBM Rochester
>>> Phone: 507 253 4717
>>> 
>>> 
>>> 
>>> 
>>> From:        Jeff Hammond <jhammond at alcf.anl.gov>
>>> To:        Brian Smith/Rochester/IBM at IBMUS
>>> Cc:        Lee Killough <killough at alcf.anl.gov>,
>>> "td-support at alcf.anl.gov" <td-support at alcf.anl.gov>
>>> Date:        04/21/2012 08:51 PM
>>> Subject:        Re: [td-support #113586] PAMI assertions
>>> Sent by:        jeff.science at gmail.com
>>> ________________________________
>>> 
>>> 
>>> 
>>> If this showed up on BGP and now on BGQ, why was it not added to the
>>> MPICH2 test suite 3+ years ago?  This is a bug in MPICH2 according to
>>> the comments in ScaLAPACK and the fact that both BGP and BGQ suffered
>>> it despite forking vastly different code bases, right?
>>> 
>>> I'm trying to write a standalone test for this, btw, but haven't been
>>> successful yet.
>>> 
>>> Jeff
>>> 
>>> On Sat, Apr 21, 2012 at 8:46 PM, Brian Smith <smithbr at us.ibm.com> wrote:
>>>> Hi Lee,
>>>> 
>>>> It's not actually a user error, what SCALAPACK is doing is (probably, I
>>>> didn't look at it too much) valid MPI code. However, it is appears to be a
>>>> weird fringe case that none of the test cases that come with MPICH, nor the
>>>> gigantic Intel/ANL testbucket found.
>>>> 
>>>> Basically, we were missing an if() check in the collectives glue to check
>>>> for nonzero counts of zero length datatypes. The optimized protocols don't
>>>> deal with things like that which is why there was an assert().
>>>> 
>>>> 
>>>> 
>>>> Brian Smith (smithbr at us.ibm.com)
>>>> BlueGene MPI Development/
>>>> Communications Team Lead
>>>> IBM Master Inventor
>>>> IBM Rochester
>>>> Phone: 507 253 4717
>>>> 
>>>> "the scientific community is very A-Buzz with positive reviews of Blue Gene
>>>> ..." - Charles Archer - un-sung hero of technology
>>>> 
>>>> 
>>>> 
>>>> 
>>>> From:        Lee Killough <killough at alcf.anl.gov>
>>>> To:        Jeff Hammond <jhammond at alcf.anl.gov>
>>>> Cc:        "td-support at alcf.anl.gov" <td-support at alcf.anl.gov>, Brian
>>>> Smith/Rochester/IBM at IBMUS
>>>> Date:        04/20/2012 11:08 PM
>>>> Subject:        Re: [td-support #113586] PAMI assertions
>>>> ________________________________
>>>> 
>>>> 
>>>> 
>>>> Sorry, a busy evening after 6 pm, fast forwarding to this email, and have
>>>> not read previous.
>>>> 
>>>> First, if it's a user error, it should never be diagnosed in an assert().
>>>> assert() is only intended for catching internal errors, and should be turned
>>>> off in production code. It being an assert() immediately threw me off and
>>>> made me think it was a configuration issue or mismatched libraries, etc.
>>>> 
>>>> A new version of ScaLAPACK is about to be released, maybe even as I send
>>>> this email. I have been working closely with the developers for the past
>>>> month on several bugs, some of which are only seen on BG, such as illegal
>>>> Fortran calls with overlapping arguments.
>>>> 
>>>> If we can identify the cause and work on a fix for this bcast problem, I may
>>>> be able to get it in before the next release, or maybe not. If you have a
>>>> BLACS code fix suggestion, please send it and I'll try to get the fix in the
>>>> next version of ScaLAPACK.
>>>> 
>>>> Thanks,
>>>> Lee
>>>> 
>>>> On Apr 20, 2012, at 22:44, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>>>> 
>>>>> IBM says it's a ScaLAPACK problem but that the latest MPI/PAMI has a
>>>>> fix anyways.
>>>>> 
>>>>> See if this makes sense from the BLACS code.  We can look through MPI
>>>>> standard together next week to see if BLACS violates it.  It would be
>>>>> the first time Clint Whaley completely screwed up using MPI.
>>>>> 
>>>>> Jeff
>>>>> 
>>>>> ---------- Forwarded message ----------
>>>>> From: Brian Smith <smithbr at us.ibm.com>
>>>>> Date: Fri, Apr 20, 2012 at 8:47 PM
>>>>> Subject: Re: Fwd: [td-support #113586] PAMI assertions
>>>>> To: Jeff Hammond <jhammond at alcf.anl.gov>
>>>>> Cc: jeff.science at gmail.com, Michael Blocksome <blocksom at us.ibm.com>
>>>>> 
>>>>> It's goofy datatype stuff in SCALAPACK. There's a fix in head... I
>>>>> didn't/don't feel it was worth efixing.
>>>>> 
>>>>> I forget if the problem was a nonzero count with a zero-byte
>>>>> constructed datatype or a zero count with a non-zero byte constructed
>>>>> datatype, something stupid like that, so it's unlikely a real
>>>>> application is going to hit it.
>>>>> 
>>>>> Brian Smith (smithbr at us.ibm.com)
>>>>> BlueGene MPI Development/
>>>>> Communications Team Lead
>>>>> IBM Master Inventor
>>>>> IBM Rochester
>>>>> Phone: 507 253 4717
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Apr 20, 2012 at 7:34 PM, Jeff Hammond <jhammond at alcf.anl.gov>
>>>>> wrote:
>>>>>> 0-byte bcast is fine with the MPI I always use.
>>>>>> 
>>>>>> Can you print the args at
>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/BLACS/SRC/dgebr2d_.c:127
>>>>>> and see what I need to test?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jeff
>>>>>> 
>>>>>> On Fri, Apr 20, 2012 at 7:18 PM, Jeff Hammond <jhammond at alcf.anl.gov>
>>>>>> wrote:
>>>>>>> Assuming I am looking at the same code (PAMI Git head is V1R1M1
>>>>>>> already and I'm too Git-impaired to toggle for V1R1M0, nor do I want
>>>>>>> to do this in any case), the assertion that fails indicates a 0-byte
>>>>>>> message is being attempted.
>>>>>>> 
>>>>>>> I'll write a test of 0-byte MPI_Bcast right now.  Which MPI library
>>>>>>> are you linking against?
>>>>>>> 
>>>>>>> Jeff
>>>>>>> 
>>>>>>> =================================================
>>>>>>>          pami_result_t  postShortCollective (uint32_t        opcode,
>>>>>>>                                              uint32_t        sizeoftype,
>>>>>>>                                              uint32_t        bytes,
>>>>>>>                                              char          * src,
>>>>>>>                                              PipeWorkQueue * dpwq,
>>>>>>>                                              pami_event_function
>>>>>>> cb_done,
>>>>>>>                                              void          * cookie,
>>>>>>>                                              unsigned        classroute)
>>>>>>>          {
>>>>>>>            TRACE_FN_ENTER();
>>>>>>>            TRACE_FORMAT("opcode %u, sizeoftype %u, bytes %u, src %p,
>>>>>>> dpwq %p, classroute %u", opcode, sizeoftype, bytes, src, dpwq,
>>>>>>> classroute);
>>>>>>>            PAMI_assert (bytes <= _collstate._tempSize);
>>>>>>>            PAMI_assert(bytes);  /*
>>>>>>> <------------------------------------------------------------------
>>>>>>> JEFF: This is line 284 */
>>>>>>>            _int64Cpy(_collstate._tempBuf, src, bytes);
>>>>>>>            //memcpy(_collstate._tempBuf, src, bytes);
>>>>>>> ...
>>>>>>> =================================================
>>>>>>> 
>>>>>>> On Fri, Apr 20, 2012 at 6:06 PM, Lee Killough <killough at alcf.anl.gov>
>>>>>>> wrote:
>>>>>>>> With the new GA driver, I'm getting a lot of PAMI assertions when
>>>>>>>> running
>>>>>>>> ScaLAPACK programs. The traceback:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> Program   : ./xzsep
>>>>>>>> 
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
>>>>>>>> 
>>>>>>>> 00000000016a3638
>>>>>>>> abort
>>>>>>>> 
>>>>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/stdlib/abort.c:77
>>>>>>>> 
>>>>>>>> 000000000169c668
>>>>>>>> __assert_fail
>>>>>>>> 
>>>>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/assert/assert.c:81
>>>>>>>> 
>>>>>>>> 000000000149774c
>>>>>>>> PAMI::Device::MU::CollectiveDmaModelBase::postShortCollective(unsigned
>>>>>>>> int, unsigned int, unsigned int, char*, PAMI::PipeWorkQueue*, void
>>>>>>>> (*)(void*, void*, pami_result_t), void*, unsigned int)
>>>>>>>> 
>>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveDmaModelBase.h:284
>>>>>>>> 
>>>>>>>> 0000000001497954
>>>>>>>> 
>>>>>>>> PAMI::Device::MU::CollectiveMulticastDmaModel::postMulticastImmediate_impl(unsigned
>>>>>>>> long, unsigned long, pami_multicast_t*, void*)
>>>>>>>> 
>>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveMulticastDmaModel.h:107
>>>>>>>> 
>>>>>>>> 0000000001394304
>>>>>>>> 
>>>>>>>> PAMI::Geometry::Algorithm<PAMI::Geometry::Common>::generate(pami_xfer_t*)
>>>>>>>> 
>>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/algorithms/geometry/Algorithm.h:45
>>>>>>>> 
>>>>>>>> 0000000001359dec
>>>>>>>> MPIDO_Bcast
>>>>>>>> 
>>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpid/pamid/src/coll/bcast/mpido_bcast.c:146
>>>>>>>> 
>>>>>>>> 00000000012f95cc
>>>>>>>> MPIR_Bcast_impl
>>>>>>>> 
>>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpi/coll/bcast.c:1310
>>>>>>>> 
>>>>>>>> 00000000012f997c
>>>>>>>> PMPI_Bcast
>>>>>>>> 
>>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpi/coll/bcast.c:1464
>>>>>>>> 
>>>>>>>> 000000000101b980
>>>>>>>> dgebr2d
>>>>>>>> 
>>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/BLACS/SRC/dgebr2d_.c:127
>>>>>>>> 
>>>>>>>> 00000000010670f4
>>>>>>>> pdlared1d
>>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/SRC/pdlared1d.f:156
>>>>>>>> 
>>>>>>>> 0000000001045350
>>>>>>>> pzheevx
>>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/SRC/pzheevx.f:839
>>>>>>>> 
>>>>>>>> 00000000010084e0
>>>>>>>> pzsepsubtst
>>>>>>>> 
>>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepsubtst.f:396
>>>>>>>> 
>>>>>>>> 0000000001002874
>>>>>>>> pzseptst
>>>>>>>> 
>>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzseptst.f:565
>>>>>>>> 
>>>>>>>> 00000000010123ec
>>>>>>>> pzsepreq
>>>>>>>> 
>>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepreq.f:205
>>>>>>>> 
>>>>>>>> 00000000010112e8
>>>>>>>> pzsepdriver
>>>>>>>> 
>>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepdriver.f:229
>>>>>>>> 
>>>>>>>> 0000000001699b08
>>>>>>>> generic_start_main
>>>>>>>> 
>>>>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>>>>>>>> 
>>>>>>>> 0000000001699e04
>>>>>>>> __libc_start_main
>>>>>>>> 
>>>>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>>>>>>>> 
>>>>>>>> 0000000000000000
>>>>>>>> ??
>>>>>>>> ??:0
>>>>>>>> 
>>>>>>>> It looks like an assertion is failing at:
>>>>>>>> 
>>>>>>>> PAMI::Device::MU::CollectiveDmaModelBase::postShortCollective(unsigned
>>>>>>>> int, unsigned int, unsigned int, char*, PAMI::PipeWorkQueue*, void
>>>>>>>> (*)(void*, void*, pami_result_t), void*, unsigned int)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveDmaModelBase.h:284
>>>>>>>> 
>>>>>>>> during a broadcast.
>>>>>>>> 
>>>>>>>> I don't recall these errors in the previous driver.
>>>>>>>> 
>>>>>>>> Lee
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Jeff Hammond
>>>>>>> Argonne Leadership Computing Facility
>>>>>>> University of Chicago Computation Institute
>>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>>>>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>>>>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>>>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>>>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>> 
> 
> 
> 
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)



More information about the mpich2-dev mailing list