[petsc-users] MPI error for large number of processes and subcomms

Tue Apr 14 14:23:27 CDT 2020

There is an MPI_Allreduce in PetscGatherNumberOfMessages, that is why I
doubted it was the problem. Even if users configure petsc with 64-bit
indices, we use PetscMPIInt in MPI calls. So it is not a problem.
Try -vecscatter_type mpi1 to restore to the original VecScatter
implementation. If the problem still remains, could you provide a test
example for me to debug?

--Junchao Zhang

On Tue, Apr 14, 2020 at 12:13 PM Randall Mackie <rlmackie862 at gmail.com>
wrote:

> Hi Junchao,
>
> We have tried your two suggestions but the problem remains.
> And the problem seems to be on the MPI_Isend line 117 in
> PetscGatherMessageLengths and not MPI_AllReduce.
>
> We have now tried Intel MPI, Mpich, and OpenMPI, and so are thinking the
> problem must be elsewhere and not MPI.
>
> Give that this is a 64 bit indices build of PETSc, is there some possible
> incompatibility between PETSc and MPI calls?
>
> We are open to any other possible suggestions to try as other than
> valgrind on thousands of processes we seem to have run out of ideas.
>
> Thanks, Randy M.
>
> On Apr 13, 2020, at 8:54 AM, Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>
> --Junchao Zhang
>
>
> On Mon, Apr 13, 2020 at 10:53 AM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> Randy,
>>    Someone reported similar problem before. It turned out an Intel MPI
>> MPI_Allreduce bug.  A workaround is setting the environment variable
>> I_MPI_ADJUST_ALLREDUCE=1.arr
>>
>  Correct:  I_MPI_ADJUST_ALLREDUCE=1
>
>>    But you mentioned mpich also had the error. So maybe the problem is
>> not the same. So let's try the workaround first. If it doesn't work, add
>> another petsc option -build_twosided allreduce, which is a workaround for
>> Intel MPI_Ibarrier bugs we met.
>>    Thanks.
>> --Junchao Zhang
>>
>>
>> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie862 at gmail.com>
>> wrote:
>>
>>> Dear PETSc users,
>>>
>>> We are trying to understand an issue that has come up in running our
>>> code on a large cloud cluster with a large number of processes and subcomms.
>>> This is code that we use daily on multiple clusters without problems,
>>> and that runs valgrind clean for small test problems.
>>>
>>> The run generates the following messages, but doesn’t crash, just seems
>>> to hang with all processes continuing to show activity:
>>>
>>> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in
>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
>>> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in
>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
>>> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in
>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
>>> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in
>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
>>>
>>>
>>> Looking at line 117 in PetscGatherMessageLengths we find the offending
>>> statement is the MPI_Isend:
>>>
>>>
>>>   /* Post the Isends with the message length-info */
>>>   for (i=0,j=0; i<size; ++i) {
>>>     if (ilengths[i]) {
>>>       ierr =
>>> MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
>>>       j++;
>>>     }
>>>   }
>>>
>>> We have tried this with Intel MPI 2018, 2019, and mpich, all giving the
>>> same problem.
>>>
>>> We suspect there is some limit being set on this cloud cluster on the
>>> number of file connections or something, but we don’t know.
>>>
>>> Anyone have any ideas? We are sort of grasping for straws at this point.
>>>
>>> Thanks, Randy M.
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200414/0448e926/attachment.html>