[petsc-users] MPI error for large number of processes and subcomms

Thu Apr 30 20:29:58 CDT 2020

I guess you can fix that with an additional option, -build_twosided
allreduce

We have two algorithms for PetscCommBuildTwoSided: ibarrier (when # of
ranks > 1024) and allreduce (otherwise). The flow control with ibarrier is
much weaker than that in allreduce. Though in my tests, they both worked.

Thanks.
--Junchao Zhang

On Thu, Apr 30, 2020 at 7:46 PM Randall Mackie <rlmackie862 at gmail.com>
wrote:

> Hi Junchao,
>
> Unfortunately these modifications did not work on our cluster (see output
> below).
> However, I am not asking you to spend anymore time on this, as we are able
> to avoid the problem by setting appropriate sysctl parameters into
> /etc/sysctl.conf.
>
> Thank you again for all your help on this.
>
> Randy
>
>
> Output of test program:
>
>  mpiexec -np 1280 -hostfile machines ./test -nsubs 160 -nx 100 -ny 100 -nz
> 10 -max_pending_isends 64
> Started
>
>  ind2 max              31999999
>  nis                  33600
> begin VecScatter create
> [1175]PETSC ERROR: #1 PetscCommBuildTwoSided_Ibarrier() line 102 in
> /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/sys/utils/mpits.c
> [1175]PETSC ERROR: #2 PetscCommBuildTwoSided() line 313 in
> /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/sys/utils/mpits.c
> [1175]PETSC ERROR: #3 PetscSFSetUp_Basic() line 33 in
> /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/is/sf/impls/basic/sfbasic.c
> [1175]PETSC ERROR: #4 PetscSFSetUp() line 253 in
> /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/is/sf/interface/sf.c
> [1175]PETSC ERROR: #5 VecScatterSetUp_SF() line 747 in
> /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/impls/sf/vscatsf.c
> [1175]PETSC ERROR: #6 VecScatterSetUp() line 208 in
> /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/interface/vscatfce.c
> [1175]PETSC ERROR: #7 VecScatterCreate() line 287 in
> /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/interface/vscreate.c
>
>
>
> On Apr 27, 2020, at 9:59 AM, Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
> Randy,
>    You are absolutely right. The AOApplicationToPetsc could not be
> removed.  Since the excessive communication is inevitable, I made two
> changes in petsc to ease that. One is I skewed the communication to let
> each rank send to ranks greater than itself first. The other is an option,
> -max_pending_isend, to control number of pending isends. Current default is
> 512.
>    I have an MR at https://gitlab.com/petsc/petsc/-/merge_requests/2757.
> I tested it dozens of times with your example at 5120 ranks. It worked fine.
>    Please try it in your environment and let me know the result. Since the
> failure is random, you may need to run multiple times.
>
>   BTW, if no objection, I'd like to add your excellent example to petsc
> repo.
>
>    Thanks
> --Junchao Zhang
>
>
> On Fri, Apr 24, 2020 at 5:32 PM Randall Mackie <rlmackie862 at gmail.com>
> wrote:
>
>> Hi Junchao,
>>
>> I tested by commenting out the AOApplicationToPetsc calls as you suggest,
>> but it doesn’t work because it doesn’t maintain the proper order of the
>> elements in the scattered vectors.
>>
>> I attach a modified version of the test code where I put elements into
>> the global vector, then carry out the scatter, and check on the subcomms
>> that they are correct.
>>
>> You can see everything is fine with the AOApplicationToPetsc calls, but
>> the comparison fails when those are commented out.
>>
>> If there is some way I can achieve the right VecScatters without those
>> calls, I would be happy to know how to do that.
>>
>> Thank you again for your help.
>>
>> Randy
>>
>> ps. I suggest you run this test with nx=ny=nz=10 and only a couple
>> subcomms and maybe 4 processes to demonstrate the behavior
>>
>>
>> On Apr 20, 2020, at 2:45 PM, Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>> Hello, Randy,
>>   I further looked at the problem and believe it was due to overwhelming
>> traffic. The code sometimes fails at MPI_Waitall. I printed out MPI error
>> strings of bad MPI Statuses. One of them is like
>> "MPID_nem_tcp_connpoll(1845): Communication error with rank 25: Connection
>> reset by peer", which is a tcp error and has nothing to do with petsc.
>>   Further investigation shows in the case of 5120 ranks with 320 sub
>> communicators, during VecScatterSetUp, each rank has around 640
>> isends/irecvs neighbors, and quite a few ranks has 1280 isends neighbors. I
>> guess these overwhelming isends occasionally crashed the connection.
>>   The piece of code in VecScatterSetUp is to calculate the communication
>> pattern. With index sets "having good locality", the calculate itself
>> incurs less traffic. Here good locality means indices in an index set
>> mostly point to local entries. However, the AOApplicationToPetsc() call in
>> your code unnecessarily ruined the good petsc ordering. If we remove
>> AOApplicationToPetsc() (the vecscatter result is still correct) , then each
>> rank uniformly has around 320 isends/irecvs.
>>   So, test with this modification and see if it really works in your
>> environment. If not applicable, we can provide options in petsc to carry
>> out the communication in phases to avoid flooding the network (though it is
>> better done by MPI).
>>
>>  Thanks.
>> --Junchao Zhang
>>
>>
>> On Fri, Apr 17, 2020 at 10:47 AM Randall Mackie <rlmackie862 at gmail.com>
>> wrote:
>>
>>> Hi Junchao,
>>>
>>> Thank you for your efforts.
>>> We tried petsc-3.13.0 but it made no difference.
>>> We think now the issue are with sysctl parameters, and increasing those
>>> seemed to have cleared up the problem.
>>> This also most likely explains how different clusters had different
>>> behaviors with our test code.
>>>
>>> We are now running our code and will report back once we are sure that
>>> there are no further issues.
>>>
>>> Thanks again for your help.
>>>
>>> Randy M.
>>>
>>> On Apr 17, 2020, at 8:09 AM, Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>
>>> On Thu, Apr 16, 2020 at 11:13 PM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>> Randy,
>>>>   I reproduced your error with petsc-3.12.4 and 5120 mpi ranks. I also
>>>> found the error went away with petsc-3.13.  However, I have not figured out
>>>> what is the bug and which commit fixed it :).
>>>>   So at your side, it is better to use the latest petsc.
>>>>
>>> I want to add that even with petsc-3.12.4 the error is random. I was
>>> only able to reproduce the error once, so I can not claim petsc-3.13
>>> actually fixed it (or, the bug is really in petsc).
>>>
>>>
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Thu, Apr 16, 2020 at 9:06 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>>> Randy,
>>>>>   Up to now I could not reproduce your error, even with the biggest
>>>>> mpirun -n 5120 ./test -nsubs 320 -nx 100 -ny 100 -nz 100
>>>>>   While I continue doing test, you can try other options. It looks you
>>>>> want to duplicate a vector to subcomms. I don't think you need the two
>>>>> lines:
>>>>>
>>>>> call AOApplicationToPetsc(aoParent,nis,ind1,ierr)
>>>>> call AOApplicationToPetsc(aoSub,nis,ind2,ierr)
>>>>>
>>>>>  In addition, you can use simpler and more memory-efficient index
>>>>> sets. There is a petsc example for this task, see case 3 in
>>>>> https://gitlab.com/petsc/petsc/-/blob/master/src/vec/vscat/tests/ex9.c
>>>>>  BTW, it is good to use petsc master so we are on the same page.
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Wed, Apr 15, 2020 at 10:28 AM Randall Mackie <rlmackie862 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Junchao,
>>>>>>
>>>>>> So I was able to create a small test code that duplicates the issue
>>>>>> we have been having, and it is attached to this email in a zip file.
>>>>>> Included is the test.F90 code, the commands to duplicate crash and to
>>>>>> duplicate a successful run, output errors, and our petsc configuration.
>>>>>>
>>>>>> Our findings to date include:
>>>>>>
>>>>>> The error is reproducible in a very short time with this script
>>>>>> It is related to nproc*nsubs and (although to a less extent) to DM
>>>>>> grid size
>>>>>> It happens regardless of MPI implementation (mpich, intel mpi 2018,
>>>>>> 2019, openmpi) or compiler (gfortran/gcc , intel 2018)
>>>>>> No effect changing vecscatter_type to mpi1 or mpi3. Mpi1 seems to
>>>>>> slightly increase the limit, but still fails on the full machine set.
>>>>>> Nothing looks interesting on valgrind
>>>>>>
>>>>>> Our initial tests were carried out on an Azure cluster, but we also
>>>>>> tested on our smaller cluster, and we found the following:
>>>>>>
>>>>>> Works:
>>>>>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 1280 -hostfile hostfile
>>>>>> ./test -nsubs 80 -nx 100 -ny 100 -nz 100
>>>>>>
>>>>>> Crashes (this works on Azure)
>>>>>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 2560 -hostfile hostfile
>>>>>> ./test -nsubs 80 -nx 100 -ny 100 -nz 100
>>>>>>
>>>>>> So it looks like it may also be related to the physical number of
>>>>>> nodes as well.
>>>>>>
>>>>>> In any case, even with 2560 processes on 192 cores the memory does
>>>>>> not go above 3.5 Gbyes so you don’t need a huge cluster to test.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Randy M.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Apr 14, 2020, at 12:23 PM, Junchao Zhang <junchao.zhang at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> There is an MPI_Allreduce in PetscGatherNumberOfMessages, that is why
>>>>>> I doubted it was the problem. Even if users configure petsc with 64-bit
>>>>>> indices, we use PetscMPIInt in MPI calls. So it is not a problem.
>>>>>> Try -vecscatter_type mpi1 to restore to the original VecScatter
>>>>>> implementation. If the problem still remains, could you provide a test
>>>>>> example for me to debug?
>>>>>>
>>>>>> --Junchao Zhang
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 14, 2020 at 12:13 PM Randall Mackie <
>>>>>> rlmackie862 at gmail.com> wrote:
>>>>>>
>>>>>>> Hi Junchao,
>>>>>>>
>>>>>>> We have tried your two suggestions but the problem remains.
>>>>>>> And the problem seems to be on the MPI_Isend line 117 in
>>>>>>> PetscGatherMessageLengths and not MPI_AllReduce.
>>>>>>>
>>>>>>> We have now tried Intel MPI, Mpich, and OpenMPI, and so are thinking
>>>>>>> the problem must be elsewhere and not MPI.
>>>>>>>
>>>>>>> Give that this is a 64 bit indices build of PETSc, is there some
>>>>>>> possible incompatibility between PETSc and MPI calls?
>>>>>>>
>>>>>>> We are open to any other possible suggestions to try as other than
>>>>>>> valgrind on thousands of processes we seem to have run out of ideas.
>>>>>>>
>>>>>>> Thanks, Randy M.
>>>>>>>
>>>>>>> On Apr 13, 2020, at 8:54 AM, Junchao Zhang <junchao.zhang at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> --Junchao Zhang
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 13, 2020 at 10:53 AM Junchao Zhang <
>>>>>>> junchao.zhang at gmail.com> wrote:
>>>>>>>
>>>>>>>> Randy,
>>>>>>>>    Someone reported similar problem before. It turned out an Intel
>>>>>>>> MPI MPI_Allreduce bug.  A workaround is setting the environment variable
>>>>>>>> I_MPI_ADJUST_ALLREDUCE=1.arr
>>>>>>>>
>>>>>>>  Correct:  I_MPI_ADJUST_ALLREDUCE=1
>>>>>>>
>>>>>>>>    But you mentioned mpich also had the error. So maybe the problem
>>>>>>>> is not the same. So let's try the workaround first. If it doesn't work, add
>>>>>>>> another petsc option -build_twosided allreduce, which is a workaround for
>>>>>>>> Intel MPI_Ibarrier bugs we met.
>>>>>>>>    Thanks.
>>>>>>>> --Junchao Zhang
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <
>>>>>>>> rlmackie862 at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Dear PETSc users,
>>>>>>>>>
>>>>>>>>> We are trying to understand an issue that has come up in running
>>>>>>>>> our code on a large cloud cluster with a large number of processes and
>>>>>>>>> subcomms.
>>>>>>>>> This is code that we use daily on multiple clusters without
>>>>>>>>> problems, and that runs valgrind clean for small test problems.
>>>>>>>>>
>>>>>>>>> The run generates the following messages, but doesn’t crash, just
>>>>>>>>> seems to hang with all processes continuing to show activity:
>>>>>>>>>
>>>>>>>>> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in
>>>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
>>>>>>>>> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in
>>>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
>>>>>>>>> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in
>>>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
>>>>>>>>> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in
>>>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Looking at line 117 in PetscGatherMessageLengths we find the
>>>>>>>>> offending statement is the MPI_Isend:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   /* Post the Isends with the message length-info */
>>>>>>>>>   for (i=0,j=0; i<size; ++i) {
>>>>>>>>>     if (ilengths[i]) {
>>>>>>>>>       ierr =
>>>>>>>>> MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
>>>>>>>>>       j++;
>>>>>>>>>     }
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>> We have tried this with Intel MPI 2018, 2019, and mpich, all
>>>>>>>>> giving the same problem.
>>>>>>>>>
>>>>>>>>> We suspect there is some limit being set on this cloud cluster on
>>>>>>>>> the number of file connections or something, but we don’t know.
>>>>>>>>>
>>>>>>>>> Anyone have any ideas? We are sort of grasping for straws at this
>>>>>>>>> point.
>>>>>>>>>
>>>>>>>>> Thanks, Randy M.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200430/1ffbda45/attachment-0001.html>