[petsc-users] MPI error for large number of processes and subcomms

Tue Apr 14 12:13:45 CDT 2020

Hi Junchao,

We have tried your two suggestions but the problem remains.
And the problem seems to be on the MPI_Isend line 117 in PetscGatherMessageLengths and not MPI_AllReduce.

We have now tried Intel MPI, Mpich, and OpenMPI, and so are thinking the problem must be elsewhere and not MPI.

Give that this is a 64 bit indices build of PETSc, is there some possible incompatibility between PETSc and MPI calls?

We are open to any other possible suggestions to try as other than valgrind on thousands of processes we seem to have run out of ideas.

Thanks, Randy M.

> On Apr 13, 2020, at 8:54 AM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
> 
> 
> --Junchao Zhang
> 
> 
> On Mon, Apr 13, 2020 at 10:53 AM Junchao Zhang <junchao.zhang at gmail.com <mailto:junchao.zhang at gmail.com>> wrote:
> Randy,
>    Someone reported similar problem before. It turned out an Intel MPI MPI_Allreduce bug.  A workaround is setting the environment variable I_MPI_ADJUST_ALLREDUCE=1.arr
>  Correct:  I_MPI_ADJUST_ALLREDUCE=1
>    But you mentioned mpich also had the error. So maybe the problem is not the same. So let's try the workaround first. If it doesn't work, add another petsc option -build_twosided allreduce, which is a workaround for Intel MPI_Ibarrier bugs we met.
>    Thanks.
> --Junchao Zhang
> 
> 
> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie862 at gmail.com <mailto:rlmackie862 at gmail.com>> wrote:
> Dear PETSc users,
> 
> We are trying to understand an issue that has come up in running our code on a large cloud cluster with a large number of processes and subcomms.
> This is code that we use daily on multiple clusters without problems, and that runs valgrind clean for small test problems.
> 
> The run generates the following messages, but doesn’t crash, just seems to hang with all processes continuing to show activity:
> 
> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
> 
> 
> Looking at line 117 in PetscGatherMessageLengths we find the offending statement is the MPI_Isend:
> 
>  
>   /* Post the Isends with the message length-info */
>   for (i=0,j=0; i<size; ++i) {
>     if (ilengths[i]) {
>       ierr = MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
>       j++;
>     }
>   } 
> 
> We have tried this with Intel MPI 2018, 2019, and mpich, all giving the same problem.
> 
> We suspect there is some limit being set on this cloud cluster on the number of file connections or something, but we don’t know.
> 
> Anyone have any ideas? We are sort of grasping for straws at this point.
> 
> Thanks, Randy M.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200414/206ab14e/attachment.html>