[petsc-users] MPI error for large number of processes and subcomms

Mon Apr 20 16:45:21 CDT 2020

Hello, Randy,
  I further looked at the problem and believe it was due to overwhelming
traffic. The code sometimes fails at MPI_Waitall. I printed out MPI error
strings of bad MPI Statuses. One of them is like
"MPID_nem_tcp_connpoll(1845): Communication error with rank 25: Connection
reset by peer", which is a tcp error and has nothing to do with petsc.
  Further investigation shows in the case of 5120 ranks with 320 sub
communicators, during VecScatterSetUp, each rank has around 640
isends/irecvs neighbors, and quite a few ranks has 1280 isends neighbors. I
guess these overwhelming isends occasionally crashed the connection.
  The piece of code in VecScatterSetUp is to calculate the communication
pattern. With index sets "having good locality", the calculate itself
incurs less traffic. Here good locality means indices in an index set
mostly point to local entries. However, the AOApplicationToPetsc() call in
your code unnecessarily ruined the good petsc ordering. If we remove
AOApplicationToPetsc() (the vecscatter result is still correct) , then each
rank uniformly has around 320 isends/irecvs.
  So, test with this modification and see if it really works in your
environment. If not applicable, we can provide options in petsc to carry
out the communication in phases to avoid flooding the network (though it is
better done by MPI).

 Thanks.
--Junchao Zhang

On Fri, Apr 17, 2020 at 10:47 AM Randall Mackie <rlmackie862 at gmail.com>
wrote:

> Hi Junchao,
>
> Thank you for your efforts.
> We tried petsc-3.13.0 but it made no difference.
> We think now the issue are with sysctl parameters, and increasing those
> seemed to have cleared up the problem.
> This also most likely explains how different clusters had different
> behaviors with our test code.
>
> We are now running our code and will report back once we are sure that
> there are no further issues.
>
> Thanks again for your help.
>
> Randy M.
>
> On Apr 17, 2020, at 8:09 AM, Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>
>
>
> On Thu, Apr 16, 2020 at 11:13 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> Randy,
>>   I reproduced your error with petsc-3.12.4 and 5120 mpi ranks. I also
>> found the error went away with petsc-3.13.  However, I have not figured out
>> what is the bug and which commit fixed it :).
>>   So at your side, it is better to use the latest petsc.
>>
> I want to add that even with petsc-3.12.4 the error is random. I was
> only able to reproduce the error once, so I can not claim petsc-3.13
> actually fixed it (or, the bug is really in petsc).
>
>
>> --Junchao Zhang
>>
>>
>> On Thu, Apr 16, 2020 at 9:06 PM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>> Randy,
>>>   Up to now I could not reproduce your error, even with the biggest
>>> mpirun -n 5120 ./test -nsubs 320 -nx 100 -ny 100 -nz 100
>>>   While I continue doing test, you can try other options. It looks you
>>> want to duplicate a vector to subcomms. I don't think you need the two
>>> lines:
>>>
>>> call AOApplicationToPetsc(aoParent,nis,ind1,ierr)
>>> call AOApplicationToPetsc(aoSub,nis,ind2,ierr)
>>>
>>>  In addition, you can use simpler and more memory-efficient index sets.
>>> There is a petsc example for this task, see case 3 in
>>> https://gitlab.com/petsc/petsc/-/blob/master/src/vec/vscat/tests/ex9.c
>>>  BTW, it is good to use petsc master so we are on the same page.
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Apr 15, 2020 at 10:28 AM Randall Mackie <rlmackie862 at gmail.com>
>>> wrote:
>>>
>>>> Hi Junchao,
>>>>
>>>> So I was able to create a small test code that duplicates the issue we
>>>> have been having, and it is attached to this email in a zip file.
>>>> Included is the test.F90 code, the commands to duplicate crash and to
>>>> duplicate a successful run, output errors, and our petsc configuration.
>>>>
>>>> Our findings to date include:
>>>>
>>>> The error is reproducible in a very short time with this script
>>>> It is related to nproc*nsubs and (although to a less extent) to DM grid
>>>> size
>>>> It happens regardless of MPI implementation (mpich, intel mpi 2018,
>>>> 2019, openmpi) or compiler (gfortran/gcc , intel 2018)
>>>> No effect changing vecscatter_type to mpi1 or mpi3. Mpi1 seems to
>>>> slightly increase the limit, but still fails on the full machine set.
>>>> Nothing looks interesting on valgrind
>>>>
>>>> Our initial tests were carried out on an Azure cluster, but we also
>>>> tested on our smaller cluster, and we found the following:
>>>>
>>>> Works:
>>>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 1280 -hostfile hostfile ./test
>>>> -nsubs 80 -nx 100 -ny 100 -nz 100
>>>>
>>>> Crashes (this works on Azure)
>>>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 2560 -hostfile hostfile ./test
>>>> -nsubs 80 -nx 100 -ny 100 -nz 100
>>>>
>>>> So it looks like it may also be related to the physical number of nodes
>>>> as well.
>>>>
>>>> In any case, even with 2560 processes on 192 cores the memory does not
>>>> go above 3.5 Gbyes so you don’t need a huge cluster to test.
>>>>
>>>> Thanks,
>>>>
>>>> Randy M.
>>>>
>>>>
>>>>
>>>> On Apr 14, 2020, at 12:23 PM, Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>> There is an MPI_Allreduce in PetscGatherNumberOfMessages, that is why I
>>>> doubted it was the problem. Even if users configure petsc with 64-bit
>>>> indices, we use PetscMPIInt in MPI calls. So it is not a problem.
>>>> Try -vecscatter_type mpi1 to restore to the original VecScatter
>>>> implementation. If the problem still remains, could you provide a test
>>>> example for me to debug?
>>>>
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Tue, Apr 14, 2020 at 12:13 PM Randall Mackie <rlmackie862 at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Junchao,
>>>>>
>>>>> We have tried your two suggestions but the problem remains.
>>>>> And the problem seems to be on the MPI_Isend line 117 in
>>>>> PetscGatherMessageLengths and not MPI_AllReduce.
>>>>>
>>>>> We have now tried Intel MPI, Mpich, and OpenMPI, and so are thinking
>>>>> the problem must be elsewhere and not MPI.
>>>>>
>>>>> Give that this is a 64 bit indices build of PETSc, is there some
>>>>> possible incompatibility between PETSc and MPI calls?
>>>>>
>>>>> We are open to any other possible suggestions to try as other than
>>>>> valgrind on thousands of processes we seem to have run out of ideas.
>>>>>
>>>>> Thanks, Randy M.
>>>>>
>>>>> On Apr 13, 2020, at 8:54 AM, Junchao Zhang <junchao.zhang at gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Mon, Apr 13, 2020 at 10:53 AM Junchao Zhang <
>>>>> junchao.zhang at gmail.com> wrote:
>>>>>
>>>>>> Randy,
>>>>>>    Someone reported similar problem before. It turned out an Intel
>>>>>> MPI MPI_Allreduce bug.  A workaround is setting the environment variable
>>>>>> I_MPI_ADJUST_ALLREDUCE=1.arr
>>>>>>
>>>>>  Correct:  I_MPI_ADJUST_ALLREDUCE=1
>>>>>
>>>>>>    But you mentioned mpich also had the error. So maybe the problem
>>>>>> is not the same. So let's try the workaround first. If it doesn't work, add
>>>>>> another petsc option -build_twosided allreduce, which is a workaround for
>>>>>> Intel MPI_Ibarrier bugs we met.
>>>>>>    Thanks.
>>>>>> --Junchao Zhang
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <
>>>>>> rlmackie862 at gmail.com> wrote:
>>>>>>
>>>>>>> Dear PETSc users,
>>>>>>>
>>>>>>> We are trying to understand an issue that has come up in running our
>>>>>>> code on a large cloud cluster with a large number of processes and subcomms.
>>>>>>> This is code that we use daily on multiple clusters without
>>>>>>> problems, and that runs valgrind clean for small test problems.
>>>>>>>
>>>>>>> The run generates the following messages, but doesn’t crash, just
>>>>>>> seems to hang with all processes continuing to show activity:
>>>>>>>
>>>>>>> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in
>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
>>>>>>> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in
>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
>>>>>>> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in
>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
>>>>>>> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in
>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
>>>>>>>
>>>>>>>
>>>>>>> Looking at line 117 in PetscGatherMessageLengths we find the
>>>>>>> offending statement is the MPI_Isend:
>>>>>>>
>>>>>>>
>>>>>>>   /* Post the Isends with the message length-info */
>>>>>>>   for (i=0,j=0; i<size; ++i) {
>>>>>>>     if (ilengths[i]) {
>>>>>>>       ierr =
>>>>>>> MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
>>>>>>>       j++;
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>> We have tried this with Intel MPI 2018, 2019, and mpich, all giving
>>>>>>> the same problem.
>>>>>>>
>>>>>>> We suspect there is some limit being set on this cloud cluster on
>>>>>>> the number of file connections or something, but we don’t know.
>>>>>>>
>>>>>>> Anyone have any ideas? We are sort of grasping for straws at this
>>>>>>> point.
>>>>>>>
>>>>>>> Thanks, Randy M.
>>>>>>>
>>>>>>
>>>>>
>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200420/1f0fad1c/attachment.html>