[petsc-users] MPI error for large number of processes and subcomms
Randall Mackie
rlmackie862 at gmail.com
Mon Apr 13 10:54:19 CDT 2020
Thanks we’ll try and report back.
Randy M.
> On Apr 13, 2020, at 8:53 AM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
>
> Randy,
> Someone reported similar problem before. It turned out an Intel MPI MPI_Allreduce bug. A workaround is setting the environment variable I_MPI_ADJUST_ALLREDUCE=1.arr
> But you mentioned mpich also had the error. So maybe the problem is not the same. So let's try the workaround first. If it doesn't work, add another petsc option -build_twosided allreduce, which is a workaround for Intel MPI_Ibarrier bugs we met.
> Thanks.
> --Junchao Zhang
>
>
> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie862 at gmail.com <mailto:rlmackie862 at gmail.com>> wrote:
> Dear PETSc users,
>
> We are trying to understand an issue that has come up in running our code on a large cloud cluster with a large number of processes and subcomms.
> This is code that we use daily on multiple clusters without problems, and that runs valgrind clean for small test problems.
>
> The run generates the following messages, but doesn’t crash, just seems to hang with all processes continuing to show activity:
>
> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
>
>
> Looking at line 117 in PetscGatherMessageLengths we find the offending statement is the MPI_Isend:
>
>
> /* Post the Isends with the message length-info */
> for (i=0,j=0; i<size; ++i) {
> if (ilengths[i]) {
> ierr = MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
> j++;
> }
> }
>
> We have tried this with Intel MPI 2018, 2019, and mpich, all giving the same problem.
>
> We suspect there is some limit being set on this cloud cluster on the number of file connections or something, but we don’t know.
>
> Anyone have any ideas? We are sort of grasping for straws at this point.
>
> Thanks, Randy M.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200413/e37cbf90/attachment.html>
More information about the petsc-users
mailing list