[petsc-dev] Is mpich/master:a8a2b30fd21 tested with Petsc?

Jed Brown jed at jedbrown.org
Thu Nov 8 13:04:12 CST 2018


This bug is present in Ubuntu 18.10, which distributes unpatched
mpich-3.3b2.  I just submitted a bug report:

https://bugs.launchpad.net/ubuntu/+source/mpich/+bug/1802372

Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> writes:

> Hi,
>
> mainly for PETSc users: please do no waste your time using MPI released 
> with Intel Parallel Studio 2019 since it is the buggy MPICH 3.3b2 for 
> which this initial thread has been created...
>
> I just wrote a remind about this also on Intel forum:
>
> https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/797761
>
> Eric
>
> On 19/04/18 09:01 AM, Eric Chamberland wrote:
>> Hi,
>> 
>> this morning, mpich/master with PETSc is 100% working again for us.
>> 
>> Thanks to both commits:
>> 
>> https://github.com/pmodels/mpich/commit/c597c8d79deea220a42751fda0f01ce70764c260 
>> 
>> 
>> https://github.com/pmodels/mpich/commit/8edabc7373b82dd660019e53d246131765819294 
>> 
>> 
>> and thanks to everybody who helped:
>> 
>> Satish
>> Min
>> Wesley
>> Ken
>> Rob
>> Jed
>> 
>> :)
>> 
>> Eric
>> 
>> On 17/04/18 04:58 PM, Min Si wrote:
>>> Hi all,
>>>
>>> Thanks for narrowing down the problem. I checked the MPICH code and 
>>> believe this is a bug in MPICH. I just created a PR to fix it:
>>> https://github.com/pmodels/mpich/pull/3097
>>>
>>> It should be merged into MPICH master branch soon.
>>>
>>> Thanks,
>>> Min
>>>
>>> On 2018/04/17 14:10, Eric Chamberland wrote:
>>>> Hi,
>>>>
>>>> are we talking about the "tag" passed to MPI_Isend for example?
>>>>
>>>> but does that mean there is something to change for any MPI call 
>>>> which involves tags usage or is it only a PETSc "bad" tag usage?
>>>>
>>>> thanks Satish for your finding!
>>>>
>>>> Eric
>>>>
>>>> On 16/04/18 11:31 PM, Satish Balay wrote:
>>>>> On Tue, 13 Mar 2018, Eric Chamberland wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> each night we are testing mpich/master with our petsc-based code.  
>>>>>> I don't
>>>>>> know if PETSc team is doing the same thing with mpich/master? 
>>>>>> (Maybe it is a
>>>>>> good idea?)
>>>>>>
>>>>>> Everything was fine (except the issue
>>>>>> https://github.com/pmodels/mpich/issues/2892) up to commit 
>>>>>> 7b8d64debd, but
>>>>>> since commit mpich:a8a2b30fd21), I have a segfault on a any 
>>>>>> parallel nightly
>>>>>> test.
>>>>>
>>>>> I attempted a bisect of the above range of commits - and narrowed 
>>>>> down to:
>>>>>
>>>>>>>>>>>>
>>>>> db11d4c4a70e39a28be88ed32f00542301699e08 is the first bad commit
>>>>> <<<<<<<
>>>>>>>>>>>>>
>>>>> balay at asterix /home/balay/soft/build/mpich ((db11d4c4a...)|BISECTING)
>>>>> $ git show db11d4c4a70e39a28be88ed32f00542301699e08
>>>>> commit db11d4c4a70e39a28be88ed32f00542301699e08 (HEAD, refs/bisect/bad)
>>>>> Author: Ken Raffenetti <raffenet at mcs.anl.gov>
>>>>> Date:   Thu Feb 15 11:37:59 2018 -0600
>>>>>
>>>>>      init: Fix tag upper limit initialization
>>>>>           The starting point for this value is equivalent to the 
>>>>> usable tag bits
>>>>>      macro. This value should be set before device initialization,
>>>>>      otherwise devices will assume they have more bits than are 
>>>>> actually
>>>>>      available.
>>>>>           Signed-off-by: Wesley Bland <wesley.bland at intel.com>
>>>>>
>>>>> diff --git a/src/mpi/init/initthread.c b/src/mpi/init/initthread.c
>>>>> index cbc41f4d5..b31ae2f07 100644
>>>>> --- a/src/mpi/init/initthread.c
>>>>> +++ b/src/mpi/init/initthread.c
>>>>> @@ -403,7 +403,7 @@ int MPIR_Init_thread(int *argc, char ***argv, 
>>>>> int required, int *provided)
>>>>>       MPIR_Process.attrs.host = MPI_PROC_NULL;
>>>>>       MPIR_Process.attrs.io = MPI_PROC_NULL;
>>>>>       MPIR_Process.attrs.lastusedcode = MPI_ERR_LASTCODE;
>>>>> -    MPIR_Process.attrs.tag_ub = 0;
>>>>> +    MPIR_Process.attrs.tag_ub = MPIR_TAG_USABLE_BITS;
>>>>>       MPIR_Process.attrs.universe = MPIR_UNIVERSE_SIZE_NOT_SET;
>>>>>       MPIR_Process.attrs.wtime_is_global = 0;
>>>>>   @@ -531,13 +531,6 @@ int MPIR_Init_thread(int *argc, char ***argv, 
>>>>> int required, int *provided)
>>>>>       MPIR_Assert(((unsigned) MPIR_Process.
>>>>>                    attrs.tag_ub & ((unsigned) 
>>>>> MPIR_Process.attrs.tag_ub + 1)) == 0);
>>>>>   -    /* Set aside tag space for tagged collectives and failure 
>>>>> notification */
>>>>> -#ifdef HAVE_TAG_ERROR_BITS
>>>>> -    MPIR_Process.attrs.tag_ub >>= 3;
>>>>> -#else
>>>>> -    MPIR_Process.attrs.tag_ub >>= 1;
>>>>> -#endif
>>>>> -
>>>>>       /* Assert: tag_ub is at least the minimum asked for in the MPI 
>>>>> spec */
>>>>>       MPIR_Assert(MPIR_Process.attrs.tag_ub >= 32767);
>>>>> <<<<<<<<<<<<<<<<<
>>>>>
>>>>> Reverthing this patch gets mpich-3.3b2 working with petsc
>>>>>
>>>>> Satish
>>>>>


More information about the petsc-dev mailing list