[petsc-users] Do the guards against calling MPI_Comm_dup() in PetscCommDuplicate() apply with Fortran?

Patrick Sanan patrick.sanan at gmail.com
Fri Nov 1 06:45:04 CDT 2019


Ah, really interesting! In the attached ex321f.F90, I create a dummy KSP before the loop, and indeed the behavior is as you say - no duplications

[(arch-maint-extra-opt) tutorials (maint *$%=)]$ ./ex321f -info | grep PetscCommDuplicate
[0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784

I've asked the user to re-run with -info, so then I'll hopefully be able to see whether the duplication is happening as I expect (in which case your insight might provide at least a workaround), and to see if it's choosing a new communicator number each time, somehow. 

> Am 01.11.2019 um 12:36 schrieb Stefano Zampini <stefano.zampini at gmail.com>:
> 
> I know why your C code does not duplicate the comm at each step. This is because it uses PETSC_VIEWER_STDOUT_WORLD, which basically inserts the duplicated comm into PETSC_COMM_WORLD as attribute. Try removing the KSPView call and you will see the C code behaves as the Fortran one.
> 
> 
>> On Nov 1, 2019, at 2:16 PM, Stefano Zampini <stefano.zampini at gmail.com <mailto:stefano.zampini at gmail.com>> wrote:
>> 
>> From src/sys/objects/ftn-custom/zstart.c petscinitialize_internal
>> 
>> PETSC_COMM_WORLD = MPI_COMM_WORLD
>> 
>> Which means that PETSC_COMM_WORLD is not a PETSc communicator.
>> 
>> The first matrix creation duplicates the PETSC_COMM_WORLD and thus can be reused for the other objects
>> When you finally destroy the matrix inside the loop, the ref count of this duplicated comm goes to zero and it is free
>> This is why you duplicate at each step
>> 
>> However, the C version of PetscInitialize does the same, so I’m not sure why this happens with Fortran and not with C. (Do you leak objects in the C code?)
>> 
>> 
>>> On Nov 1, 2019, at 1:41 PM, Patrick Sanan via petsc-users <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>> 
>>> Context: I'm trying to track down an error that (only) arises when running a Fortran 90 code, using PETSc, on a new cluster. The code creates and destroys a linear system (Mat,Vec, and KSP) at each of (many) timesteps. The error message from a user looks like this, which leads me to suspect that MPI_Comm_dup() is being called many times and this is eventually a problem for this particular MPI implementation (Open MPI 2.1.0):
>>> 
>>> [lo-a2-058:21425] *** An error occurred in MPI_Comm_dup
>>> [lo-a2-058:21425] *** reported by process [4222287873,2]
>>> [lo-a2-058:21425] *** on communicator MPI COMMUNICATOR 65534 DUP FROM 65533
>>> [lo-a2-058:21425] *** MPI_ERR_INTERN: internal error
>>> [lo-a2-058:21425] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> [lo-a2-058:21425] ***    and potentially your MPI job)
>>> 
>>> Question: I remember some discussion recently (but can't find the thread) about not calling MPI_Comm_dup() too many times from PetscCommDuplicate(), which would allow one to safely use the (admittedly not optimal) approach used in this application code. Is that a correct understanding and would the fixes made in that context also apply to Fortran? I don't fully understand the details of the MPI techniques used, so thought I'd ask here. 
>>> 
>>> If I hack a simple build-solve-destroy example to run several loops, I see a notable difference between C and Fortran examples. With the attached ex223.c and ex221f.F90, which just add outer loops (5 iterations) to KSP tutorials examples ex23.c and ex21f.F90, respectively, I see the following. Note that in the Fortran case, it appears that communicators are actually duplicated in each loop, but in the C case, this only happens in the first loop:
>>> 
>>> [(arch-maint-extra-opt) tutorials (maint *$%=)]$ ./ex223 -info | grep PetscCommDuplicate
>>> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> 
>>> [(arch-maint-extra-opt) tutorials (maint *$%=)]$ ./ex221f -info | grep PetscCommDuplicate
>>> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Duplicating a communicator 1140850688 -2080374784 max tags = 268435455
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374784
>>> 
>>> 
>>> 
>>> <ex221f.F90><ex223.c>
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191101/637e0339/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex321f.F90
Type: application/octet-stream
Size: 10854 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191101/637e0339/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191101/637e0339/attachment-0003.html>


More information about the petsc-users mailing list