[petsc-users] Do the guards against calling MPI_Comm_dup() in PetscCommDuplicate() apply with Fortran?

Smith, Barry F. bsmith at mcs.anl.gov
Fri Nov 1 10:24:22 CDT 2019


  Certain OpenMPI versions have bugs where even when you properly duplicate and then free  communicators it eventually "runs out of communicators". This is a definitely a bug and was fixed in later OpenMPI versions.  We wasted a lot of time tracking down this bug in the past. By now it is an old version of OpenMPI; the OpenMPI site https://www.open-mpi.org/software/ompi/v4.0/ lists the buggy versions as retired. 

   So the question is should PETSc attempt to change its behavior or add functionality or hacks to work around this bug?

   My answer is NO. This is a "NEW" cluster! A "NEW" cluster is not running OpenMPI 2.1 by definition of new.  The cluster manager needs to remove the buggy version of OpenMPI from their system. If the cluster manager is incapable of doing the most elementary part of the their job (removing buggy code) then the application person is stuck having to put hacks into their code to work around the bugs on their cluster; it cannot be PETSc's responsibility to distorted itself due to ancient bugs in other software.

  Barry

Note that this OpenMPI bug does not affect very many MPI or PETSc codes. It only affects those codes that completely correctly call duplicate and free many times. This is why PETSc configure doesn't blacklist the OpenMPI version (though perhaps it should).



> On Nov 1, 2019, at 5:41 AM, Patrick Sanan via petsc-users <petsc-users at mcs.anl.gov> wrote:
> 
> Context: I'm trying to track down an error that (only) arises when running a Fortran 90 code, using PETSc, on a new cluster. The code creates and destroys a linear system (Mat,Vec, and KSP) at each of (many) timesteps. The error message from a user looks like this, which leads me to suspect that MPI_Comm_dup() is being called many times and this is eventually a problem for this particular MPI implementation (Open MPI 2.1.0):
> 
> [lo-a2-058:21425] *** An error occurred in MPI_Comm_dup
> [lo-a2-058:21425] *** reported by process [4222287873,2]
> [lo-a2-058:21425] *** on communicator MPI COMMUNICATOR 65534 DUP FROM 65533
> [lo-a2-058:21425] *** MPI_ERR_INTERN: internal error
> [lo-a2-058:21425] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [lo-a2-058:21425] ***    and potentially your MPI job)
> 
> Question: I remember some discussion recently (but can't find the thread) about not calling MPI_Comm_dup() too many times from PetscCommDuplicate(), which would allow one to safely use the (admittedly not optimal) approach used in this application code. Is that a correct understanding and would the fixes made in that context also apply to Fortran? I don't fully understand the details of the MPI techniques used, so thought I'd ask here. 
> 
> If I hack a simple build-solve-destroy example to run several loops, I see a notable difference between C and Fortran examples. With the attached ex223.c and ex221f.F90, which just add outer loops (5 iterations) to KSP tutorials examples ex23.c and ex21f.F90, respectively, I see the following. Note that in the Fortran case, it appears that communicators are actually duplicated in each loop, but in the C case, this only happens in the first loop:
> 
> [(arch-maint-extra-opt) tutorials (maint *$%=)]$ ./ex223 -info | grep PetscCommDuplicate
> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> 
> [(arch-maint-extra-opt) tutorials (maint *$%=)]$ ./ex221f -info | grep PetscCommDuplicate
> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
> 
> 
> 
> <ex221f.F90><ex223.c>



More information about the petsc-users mailing list