[mpich2-dev] 0 byte derived types

Jeff Hammond jhammond at alcf.anl.gov
Mon Apr 23 10:22:14 CDT 2012


If you read Brian's email, he indicates that there is a problem with
PAMI, but this is also a problem at the MPI level, if one is to
believe the comment in BLACS.  Lee noted that he had to workaround
this problem with NEC-MPI as well, so it appears to be a corner case
that is overlooked in multiple implementations (for good reason; this
use case is ridiculous).

Thanks,

Jeff

On Mon, Apr 23, 2012 at 10:19 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> The assert you quote in the oldest email in the thread is in PAMI code, so I don't think this is directly an MPICH2 bug.  I do agree, however, that zero-size, non-zero-count bcast should be tested in our test suite and that we have a coverage gap here.
>
> I'll add creating a test for this to my TODO list.
>
> -Dave
>
> On Apr 23, 2012, at 9:23 AM CDT, Jeff Hammond wrote:
>
>> A bug appeared when BGP came online and has reappared on BGQ.  It
>> relates to MPI_Bcast of a non-zero number of 0-byte derived datatypes.
>> ScaLAPACK is one source of this patter.  They have a workaround, but
>> it seems to be that either ScaLAPACK is using MPI in a non-compliant
>> way or there is a bug in MPICH2 that has persisted across many major
>> version releases.
>>
>> Are you guys aware of this?  Has it been fixed in 1.5?  Is there a
>> test to make sure there is no regression in the future?  The ScaLAPACK
>> code and comment noting the problem with MPICH is below, as is a long
>> email lthread with Brian Smith and Lee on this topic.
>>
>> Thanks,
>>
>> Jeff
>>
>>
>>
>> #include "Bdef.h"
>> MPI_Datatype BI_GetMpiGeType(BLACSCONTEXT *ctxt, int m, int n, int lda,
>>                               MPI_Datatype Dtype, int *N)
>> {
>>  int info;
>>  MPI_Datatype GeType;
>>
>> /*
>> * Some versions of mpich and its derivitives cannot handle 0 byte typedefs,
>> * so we set type MPI_BYTE as a flag for a 0 byte message
>> */
>> #ifdef ZeroByteTypeBug
>>  if ( (m < 1) || (n < 1) )
>>  {
>>     *N = 0;
>>     return (MPI_BYTE);
>>  }
>> #endif
>>  *N = 1;
>>  info=MPI_Type_vector(n, m, lda, Dtype, &GeType);
>>  info=MPI_Type_commit(&GeType);
>>
>>  return(GeType);
>> }
>>
>>
>> ---------- Forwarded message ----------
>> From: Brian Smith <smithbr at us.ibm.com>
>> Date: Mon, Apr 23, 2012 at 8:32 AM
>> Subject: Re: [td-support #113586] PAMI assertions
>> To: Jeff Hammond <jhammond at alcf.anl.gov>
>> Cc: jeff.science at gmail.com, Lee Killough <killough at alcf.anl.gov>,
>> "td-support at alcf.anl.gov" <td-support at alcf.anl.gov>
>>
>>
>> Well, there are 2 different bugs here.
>>
>> (from memory) 1) We found places were SCALAPACK made assumptions about
>> uninitialized variables that caused significant badness in a number of
>> apps. I believe someone reported this to the SCALAPACK maintainers
>> many years ago. In fact, they ran something like valgrind and provided
>> a patch for *all* of the usage of uninitialized variables. The
>> SCALAPACK people did not integrate the changes at the time. Perhaps
>> this new release will have some of them added so we don't have to deal
>> with that again. See
>> http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=588&p=1911&hilit=trsm#p1911
>> and
>> http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=13&t=2625
>>
>> (more recently, based on different glue punt-to-MPICH-logic) 2)
>> SCALAPACK creates zero-length datatypes and then calls bcast with
>> nonzero counts. The glue didn't test for this condition.
>>
>> A test for #2 in the MPICH2 test bucket wouldn't be a bad thing, but
>> #1 is of course beyond the scope of MPICH2. It is possible there is a
>> test for #2 already but because of other circumstances (node count for
>> example) we might not have seen a failure. I'm sure Dave G. or someone
>> could comment on that.
>>
>>
>>
>> Brian Smith (smithbr at us.ibm.com)
>> BlueGene MPI Development/
>> Communications Team Lead
>> IBM Master Inventor
>> IBM Rochester
>> Phone: 507 253 4717
>>
>>
>>
>>
>> From:        Jeff Hammond <jhammond at alcf.anl.gov>
>> To:        Brian Smith/Rochester/IBM at IBMUS
>> Cc:        Lee Killough <killough at alcf.anl.gov>,
>> "td-support at alcf.anl.gov" <td-support at alcf.anl.gov>
>> Date:        04/21/2012 08:51 PM
>> Subject:        Re: [td-support #113586] PAMI assertions
>> Sent by:        jeff.science at gmail.com
>> ________________________________
>>
>>
>>
>> If this showed up on BGP and now on BGQ, why was it not added to the
>> MPICH2 test suite 3+ years ago?  This is a bug in MPICH2 according to
>> the comments in ScaLAPACK and the fact that both BGP and BGQ suffered
>> it despite forking vastly different code bases, right?
>>
>> I'm trying to write a standalone test for this, btw, but haven't been
>> successful yet.
>>
>> Jeff
>>
>> On Sat, Apr 21, 2012 at 8:46 PM, Brian Smith <smithbr at us.ibm.com> wrote:
>>> Hi Lee,
>>>
>>> It's not actually a user error, what SCALAPACK is doing is (probably, I
>>> didn't look at it too much) valid MPI code. However, it is appears to be a
>>> weird fringe case that none of the test cases that come with MPICH, nor the
>>> gigantic Intel/ANL testbucket found.
>>>
>>> Basically, we were missing an if() check in the collectives glue to check
>>> for nonzero counts of zero length datatypes. The optimized protocols don't
>>> deal with things like that which is why there was an assert().
>>>
>>>
>>>
>>> Brian Smith (smithbr at us.ibm.com)
>>> BlueGene MPI Development/
>>> Communications Team Lead
>>> IBM Master Inventor
>>> IBM Rochester
>>> Phone: 507 253 4717
>>>
>>> "the scientific community is very A-Buzz with positive reviews of Blue Gene
>>> ..." - Charles Archer - un-sung hero of technology
>>>
>>>
>>>
>>>
>>> From:        Lee Killough <killough at alcf.anl.gov>
>>> To:        Jeff Hammond <jhammond at alcf.anl.gov>
>>> Cc:        "td-support at alcf.anl.gov" <td-support at alcf.anl.gov>, Brian
>>> Smith/Rochester/IBM at IBMUS
>>> Date:        04/20/2012 11:08 PM
>>> Subject:        Re: [td-support #113586] PAMI assertions
>>> ________________________________
>>>
>>>
>>>
>>> Sorry, a busy evening after 6 pm, fast forwarding to this email, and have
>>> not read previous.
>>>
>>> First, if it's a user error, it should never be diagnosed in an assert().
>>> assert() is only intended for catching internal errors, and should be turned
>>> off in production code. It being an assert() immediately threw me off and
>>> made me think it was a configuration issue or mismatched libraries, etc.
>>>
>>> A new version of ScaLAPACK is about to be released, maybe even as I send
>>> this email. I have been working closely with the developers for the past
>>> month on several bugs, some of which are only seen on BG, such as illegal
>>> Fortran calls with overlapping arguments.
>>>
>>> If we can identify the cause and work on a fix for this bcast problem, I may
>>> be able to get it in before the next release, or maybe not. If you have a
>>> BLACS code fix suggestion, please send it and I'll try to get the fix in the
>>> next version of ScaLAPACK.
>>>
>>> Thanks,
>>> Lee
>>>
>>> On Apr 20, 2012, at 22:44, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>>>
>>>> IBM says it's a ScaLAPACK problem but that the latest MPI/PAMI has a
>>>> fix anyways.
>>>>
>>>> See if this makes sense from the BLACS code.  We can look through MPI
>>>> standard together next week to see if BLACS violates it.  It would be
>>>> the first time Clint Whaley completely screwed up using MPI.
>>>>
>>>> Jeff
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Brian Smith <smithbr at us.ibm.com>
>>>> Date: Fri, Apr 20, 2012 at 8:47 PM
>>>> Subject: Re: Fwd: [td-support #113586] PAMI assertions
>>>> To: Jeff Hammond <jhammond at alcf.anl.gov>
>>>> Cc: jeff.science at gmail.com, Michael Blocksome <blocksom at us.ibm.com>
>>>>
>>>> It's goofy datatype stuff in SCALAPACK. There's a fix in head... I
>>>> didn't/don't feel it was worth efixing.
>>>>
>>>> I forget if the problem was a nonzero count with a zero-byte
>>>> constructed datatype or a zero count with a non-zero byte constructed
>>>> datatype, something stupid like that, so it's unlikely a real
>>>> application is going to hit it.
>>>>
>>>> Brian Smith (smithbr at us.ibm.com)
>>>> BlueGene MPI Development/
>>>> Communications Team Lead
>>>> IBM Master Inventor
>>>> IBM Rochester
>>>> Phone: 507 253 4717
>>>>
>>>>
>>>>
>>>> On Fri, Apr 20, 2012 at 7:34 PM, Jeff Hammond <jhammond at alcf.anl.gov>
>>>> wrote:
>>>>> 0-byte bcast is fine with the MPI I always use.
>>>>>
>>>>> Can you print the args at
>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/BLACS/SRC/dgebr2d_.c:127
>>>>> and see what I need to test?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jeff
>>>>>
>>>>> On Fri, Apr 20, 2012 at 7:18 PM, Jeff Hammond <jhammond at alcf.anl.gov>
>>>>> wrote:
>>>>>> Assuming I am looking at the same code (PAMI Git head is V1R1M1
>>>>>> already and I'm too Git-impaired to toggle for V1R1M0, nor do I want
>>>>>> to do this in any case), the assertion that fails indicates a 0-byte
>>>>>> message is being attempted.
>>>>>>
>>>>>> I'll write a test of 0-byte MPI_Bcast right now.  Which MPI library
>>>>>> are you linking against?
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> =================================================
>>>>>>          pami_result_t  postShortCollective (uint32_t        opcode,
>>>>>>                                              uint32_t        sizeoftype,
>>>>>>                                              uint32_t        bytes,
>>>>>>                                              char          * src,
>>>>>>                                              PipeWorkQueue * dpwq,
>>>>>>                                              pami_event_function
>>>>>> cb_done,
>>>>>>                                              void          * cookie,
>>>>>>                                              unsigned        classroute)
>>>>>>          {
>>>>>>            TRACE_FN_ENTER();
>>>>>>            TRACE_FORMAT("opcode %u, sizeoftype %u, bytes %u, src %p,
>>>>>> dpwq %p, classroute %u", opcode, sizeoftype, bytes, src, dpwq,
>>>>>> classroute);
>>>>>>            PAMI_assert (bytes <= _collstate._tempSize);
>>>>>>            PAMI_assert(bytes);  /*
>>>>>> <------------------------------------------------------------------
>>>>>> JEFF: This is line 284 */
>>>>>>            _int64Cpy(_collstate._tempBuf, src, bytes);
>>>>>>            //memcpy(_collstate._tempBuf, src, bytes);
>>>>>> ...
>>>>>> =================================================
>>>>>>
>>>>>> On Fri, Apr 20, 2012 at 6:06 PM, Lee Killough <killough at alcf.anl.gov>
>>>>>> wrote:
>>>>>>> With the new GA driver, I'm getting a lot of PAMI assertions when
>>>>>>> running
>>>>>>> ScaLAPACK programs. The traceback:
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> Program   : ./xzsep
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> +++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
>>>>>>>
>>>>>>> 00000000016a3638
>>>>>>> abort
>>>>>>>
>>>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/stdlib/abort.c:77
>>>>>>>
>>>>>>> 000000000169c668
>>>>>>> __assert_fail
>>>>>>>
>>>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/assert/assert.c:81
>>>>>>>
>>>>>>> 000000000149774c
>>>>>>> PAMI::Device::MU::CollectiveDmaModelBase::postShortCollective(unsigned
>>>>>>> int, unsigned int, unsigned int, char*, PAMI::PipeWorkQueue*, void
>>>>>>> (*)(void*, void*, pami_result_t), void*, unsigned int)
>>>>>>>
>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveDmaModelBase.h:284
>>>>>>>
>>>>>>> 0000000001497954
>>>>>>>
>>>>>>> PAMI::Device::MU::CollectiveMulticastDmaModel::postMulticastImmediate_impl(unsigned
>>>>>>> long, unsigned long, pami_multicast_t*, void*)
>>>>>>>
>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveMulticastDmaModel.h:107
>>>>>>>
>>>>>>> 0000000001394304
>>>>>>>
>>>>>>> PAMI::Geometry::Algorithm<PAMI::Geometry::Common>::generate(pami_xfer_t*)
>>>>>>>
>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/algorithms/geometry/Algorithm.h:45
>>>>>>>
>>>>>>> 0000000001359dec
>>>>>>> MPIDO_Bcast
>>>>>>>
>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpid/pamid/src/coll/bcast/mpido_bcast.c:146
>>>>>>>
>>>>>>> 00000000012f95cc
>>>>>>> MPIR_Bcast_impl
>>>>>>>
>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpi/coll/bcast.c:1310
>>>>>>>
>>>>>>> 00000000012f997c
>>>>>>> PMPI_Bcast
>>>>>>>
>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/lib/dev/mpich2/src/mpi/coll/bcast.c:1464
>>>>>>>
>>>>>>> 000000000101b980
>>>>>>> dgebr2d
>>>>>>>
>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/BLACS/SRC/dgebr2d_.c:127
>>>>>>>
>>>>>>> 00000000010670f4
>>>>>>> pdlared1d
>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/SRC/pdlared1d.f:156
>>>>>>>
>>>>>>> 0000000001045350
>>>>>>> pzheevx
>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/SRC/pzheevx.f:839
>>>>>>>
>>>>>>> 00000000010084e0
>>>>>>> pzsepsubtst
>>>>>>>
>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepsubtst.f:396
>>>>>>>
>>>>>>> 0000000001002874
>>>>>>> pzseptst
>>>>>>>
>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzseptst.f:565
>>>>>>>
>>>>>>> 00000000010123ec
>>>>>>> pzsepreq
>>>>>>>
>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepreq.f:205
>>>>>>>
>>>>>>> 00000000010112e8
>>>>>>> pzsepdriver
>>>>>>>
>>>>>>> /gpfs/veas-fs0/killough/libs/build/SCALAPACK-xl/TESTING/EIG/pzsepdriver.f:229
>>>>>>>
>>>>>>> 0000000001699b08
>>>>>>> generic_start_main
>>>>>>>
>>>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
>>>>>>>
>>>>>>> 0000000001699e04
>>>>>>> __libc_start_main
>>>>>>>
>>>>>>> /bgsys/drivers/V1R1M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
>>>>>>>
>>>>>>> 0000000000000000
>>>>>>> ??
>>>>>>> ??:0
>>>>>>>
>>>>>>> It looks like an assertion is failing at:
>>>>>>>
>>>>>>> PAMI::Device::MU::CollectiveDmaModelBase::postShortCollective(unsigned
>>>>>>> int, unsigned int, unsigned int, char*, PAMI::PipeWorkQueue*, void
>>>>>>> (*)(void*, void*, pami_result_t), void*, unsigned int)
>>>>>>>
>>>>>>>
>>>>>>> /bgsys/source/srcV1R1M0.5670/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/CollectiveDmaModelBase.h:284
>>>>>>>
>>>>>>> during a broadcast.
>>>>>>>
>>>>>>> I don't recall these errors in the previous driver.
>>>>>>>
>>>>>>> Lee
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>>>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>>>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>>>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>>>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>>
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
>> https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
>> https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)
>



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond (in-progress)
https://wiki.alcf.anl.gov/old/index.php/User:Jhammond (deprecated)
https://wiki-old.alcf.anl.gov/index.php/User:Jhammond(deprecated)


More information about the mpich2-dev mailing list