[petsc-users] MPI error for large number of processes and subcomms

Mon Apr 13 10:54:31 CDT 2020

--Junchao Zhang

On Mon, Apr 13, 2020 at 10:53 AM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> Randy,
>    Someone reported similar problem before. It turned out an Intel MPI
> MPI_Allreduce bug.  A workaround is setting the environment variable
> I_MPI_ADJUST_ALLREDUCE=1.arr
>
 Correct:  I_MPI_ADJUST_ALLREDUCE=1

>    But you mentioned mpich also had the error. So maybe the problem is not
> the same. So let's try the workaround first. If it doesn't work, add
> another petsc option -build_twosided allreduce, which is a workaround for
> Intel MPI_Ibarrier bugs we met.
>    Thanks.
> --Junchao Zhang
>
>
> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie862 at gmail.com>
> wrote:
>
>> Dear PETSc users,
>>
>> We are trying to understand an issue that has come up in running our code
>> on a large cloud cluster with a large number of processes and subcomms.
>> This is code that we use daily on multiple clusters without problems, and
>> that runs valgrind clean for small test problems.
>>
>> The run generates the following messages, but doesn’t crash, just seems
>> to hang with all processes continuing to show activity:
>>
>> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in
>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
>> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in
>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
>> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in
>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
>> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in
>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
>>
>>
>> Looking at line 117 in PetscGatherMessageLengths we find the offending
>> statement is the MPI_Isend:
>>
>>
>>   /* Post the Isends with the message length-info */
>>   for (i=0,j=0; i<size; ++i) {
>>     if (ilengths[i]) {
>>       ierr =
>> MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
>>       j++;
>>     }
>>   }
>>
>> We have tried this with Intel MPI 2018, 2019, and mpich, all giving the
>> same problem.
>>
>> We suspect there is some limit being set on this cloud cluster on the
>> number of file connections or something, but we don’t know.
>>
>> Anyone have any ideas? We are sort of grasping for straws at this point.
>>
>> Thanks, Randy M.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200413/0f44b830/attachment.html>