[petsc-users] MPI error for large number of processes and subcomms

Mon Apr 13 10:54:19 CDT 2020

Thanks we’ll try and report back.

Randy M.

> On Apr 13, 2020, at 8:53 AM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
> 
> Randy,
>    Someone reported similar problem before. It turned out an Intel MPI MPI_Allreduce bug.  A workaround is setting the environment variable I_MPI_ADJUST_ALLREDUCE=1.arr
>    But you mentioned mpich also had the error. So maybe the problem is not the same. So let's try the workaround first. If it doesn't work, add another petsc option -build_twosided allreduce, which is a workaround for Intel MPI_Ibarrier bugs we met.
>    Thanks.
> --Junchao Zhang
> 
> 
> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie862 at gmail.com <mailto:rlmackie862 at gmail.com>> wrote:
> Dear PETSc users,
> 
> We are trying to understand an issue that has come up in running our code on a large cloud cluster with a large number of processes and subcomms.
> This is code that we use daily on multiple clusters without problems, and that runs valgrind clean for small test problems.
> 
> The run generates the following messages, but doesn’t crash, just seems to hang with all processes continuing to show activity:
> 
> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
> 
> 
> Looking at line 117 in PetscGatherMessageLengths we find the offending statement is the MPI_Isend:
> 
>  
>   /* Post the Isends with the message length-info */
>   for (i=0,j=0; i<size; ++i) {
>     if (ilengths[i]) {
>       ierr = MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
>       j++;
>     }
>   } 
> 
> We have tried this with Intel MPI 2018, 2019, and mpich, all giving the same problem.
> 
> We suspect there is some limit being set on this cloud cluster on the number of file connections or something, but we don’t know.
> 
> Anyone have any ideas? We are sort of grasping for straws at this point.
> 
> Thanks, Randy M.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200413/e37cbf90/attachment.html>