[petsc-dev] Fwd: [petsc-maint] [OMPI users] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5

Barry Smith bsmith at mcs.anl.gov
Fri Dec 19 11:00:46 CST 2014



> Begin forwarded message:
> 
> Date: December 19, 2014 at 9:44:31 AM CST
> From: George Bosilca <bosilca at icl.utk.edu>
> To: Open MPI Users <users at open-mpi.org>
> Cc: "petsc-maint at mcs.anl.gov" <petsc-maint at mcs.anl.gov>
> Subject: Re: [petsc-maint] [OMPI users] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5
> 
> On Fri, Dec 19, 2014 at 8:58 AM, Jeff Squyres (jsquyres) <jsquyres at cisco.com <mailto:jsquyres at cisco.com>> wrote:
> George:
> 
> (I'm not a member of petsc-maint; I have no idea whether my mail will actually go through to that list)
> 
> TL;DR: I do not think that George's change was correct. PETSC is relying on undefined behavior in the MPI standard and should probably update to use a different scheme.
> 
> I will detail below why the change is correct or at least not worst than what we have today.
> 
> Regarding your second point, while I do tend to agree that such issue is better addressed in the MPI Forum, the last attempt to fix this was certainly not a resounding success.
> 
> 
> More detail:
> 
> More specifically, George's change can lead to inconsistency/incorrectness in the presence of multiple threads simultaneously executing attribute actions on a single entity.
> 
> Indeed, there is a slight window of opportunity for inconsistencies in the recursive behavior. But the inconsistencies were already in the code, especially in the single threaded case. As we never received any complaints related to this topic I did not deemed interesting to address them with my last commit. Moreover, the specific behavior needed by PETSc is available in Open MPI when compiled without thread support, as the only thing that "protects" the attributes is that global mutex.
> 
> For example, in ompi_attr_delete_all(), it gets the count of all attributes and then loops <count> times to delete each attribute.  But each attribute callback can now insert or delete attributes on that entity.  This can mean that the loop can either fail to delete an attribute (because some attribute callback already deleted it) or fail to delete *all* attributes (because some attribute callback added more).
> 
> To be extremely precise the deletion part is always correct as it copies the values to be deleted into a temporary array before calling any callbacks (and before releasing the mutex), so we only remove what was in the object attribute hash when the function was called. Don't misunderstand we have an extremely good reason to do it this way, we need to call the callbacks in the order in which they were created (mandated by the MPI standard).
>  
> ompi_attr_copy_all() has similar problems -- in general, the hash that it is looping over can change underneath it.
> 
> For the copy it is a little bit more tricky, as the calling order is not imposed. Our peculiar implementation of the hash table (with array) makes the code work, with a single (possible minor) exception when the hash table itself is grown between 2 calls. However, as stated before this issue was already present in the code in single threaded cases for years. Addressing it is another 2 line patch, but I leave this exercise to an interested reader.
> 
>    George.
> 
> 
> This is, unfortunately, an undefined area of the MPI specification.  I do believe that our previous behavior was *correct* -- it just deadlocks with PETSC because PETSC is relying on undefined behavior.
> 
> For those who care, Microsoft/Cisco proposed a new attribute system to the Forum a while ago that removes all these kinds of ambiguities (see http://meetings.mpi-forum.org/secretary/2013/09/slides/jsquyres-attributes-revamp.pdf <http://meetings.mpi-forum.org/secretary/2013/09/slides/jsquyres-attributes-revamp.pdf>).  However, we didn't get a huge amount of interest, and therefore lost our window of availability opportunity to be able to advance the proposal.  I'd be more than happy to talk anyone through the proposal if they have interest/cycles in taking it over and advancing it with the Forum.
> 
> Two additional points from the PDF listed above:
> 
> - on slide 21, it was decided to no allow the recursive behavior (i.e., you can ignore the "This is under debate" bullet.
> - the "destroy" callback was not judged to be useful; you can ignore slides 22 and 23.
> 
> 
> On Dec 17, 2014, at 11:16 PM, George Bosilca <bosilca at icl.utk.edu <mailto:bosilca at icl.utk.edu>> wrote:
> 
> > Ben,
> >
> > I can't find anything in the MPI standard suggesting that a recursive behavior of the attribute deletion is enforced/supported by the MPI standard. Thus, the current behavior of Open MPI (a single lock for all attributes), while maybe a little strict, is standard compliant (and thus correct). I guess this is one of these peculiar differences between different MPI libraries that makes other libraries (taking advantage of particular implementation of MPI functions) less portable.
> >
> > That being said, changing Open MPI to a less restrictive behavior is a 4 lines patch [1]. Unfortunately, this will not make your life easier, you will still have to cope with all the existing versions of Open MPI that do not support this recursive behavior.
> >
> > I leave the answer to the question of including this patch into the next release to our release managers.
> >
> >   George.
> >
> > [1] https://github.com/open-mpi/ompi/commit/4d55ae838d5eff2818707da7ba60ffd640144360 <https://github.com/open-mpi/ompi/commit/4d55ae838d5eff2818707da7ba60ffd640144360>
> >
> >
> > On Wed, Dec 17, 2014 at 10:11 PM, Howard Pritchard <hppritcha at gmail.com <mailto:hppritcha at gmail.com>> wrote:
> > Hi Ben,
> >
> > Would you mind checking if you still observe this deadlock condition if you use
> > the 1.8.4 rc4 candidate?
> >
> > openmpi-1.8.4rc4.tar.gz
> >
> > I realize the behavior will likely be the same, but this is just to double check.
> >
> > The Open MPI man page for MPI_Attr_get (hmm.. no MPI_Comm_attr_get man page,
> > needs to be fixed) says nothing about issues with recursion with respect to invoking
> > this function within an attribute delete callback, so I would treat this as a bug.
> >
> > Thanks for your patience,
> >
> > Howard
> >
> >
> > 2014-12-17 17:07 GMT-07:00 Ben Menadue <ben.menadue at nci.org.au <mailto:ben.menadue at nci.org.au>>:
> > Hi PETSc and OpenMPI teams,
> >
> > I'm running into a deadlock in PETSc 3.4.5 with OpenMPI 1.8.3:
> >
> >  1. PetscCommDestroy calls MPI_Attr_delete
> >  2. MPI_Attr_delete acquires a lock
> >  3. MPI_Attr_delete calls Petsc_DelComm_Outer (through a callback)
> >  4. Petsc_DelComm_Outer calls MPI_Attr_get
> >  5. MPI_Attr_get wants to also acquire the lock from step 2.
> >
> > Looking at the OpenMPI source code, it looks like you can't call an
> > MPI_Attr_* function from inside the registered deletion callback. The
> > OpenMPI source code notes that all of the functions acquire a global lock,
> > which is where the problem is coming from - here are the comments and the
> > lock definition, in ompi/attribute/attribute.c of OpenMPI 1.8.3:
> >
> >     404 /*
> >     405  * We used to have multiple locks for semi-fine-grained locking.
> > But
> >     406  * the code got complex, and we had to spend time looking for subtle
> >     407  * bugs.  Craziness -- MPI attributes are *not* high performance, so
> >     408  * just use a One Big Lock approach: there is *no* concurrent
> > access.
> >     409  * If you have the lock, you can do whatever you want and no data
> > will
> >     410  * change/disapear from underneath you.
> >     411  */
> >     412 static opal_mutex_t attribute_lock;
> >
> > To get it to work, I had to modify the definition of this lock to use a
> > recursive mutex:
> >
> >     412 static opal_mutex_t attribute_lock = { .m_lock_pthread =
> > PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP };
> >
> > but this is non-portable.
> >
> > Is the behaviour expected from new versions OpenMPI? In which case a new
> > approach might be needed in PETSc. Otherwise, maybe a per-attribute lock is
> > needed in OpenMPI - but I'm not sure whether the get in the callback is on
> > the same attribute as is being deleted.
> >
> > Thanks,
> > Ben
> >
> > #0  0x00007fd7d5de4264 in __lll_lock_wait () from /lib64/libpthread.so.0
> > #1  0x00007fd7d5ddf508 in _L_lock_854 () from /lib64/libpthread.so.0
> > #2  0x00007fd7d5ddf3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> > #3  0x00007fd7d27d91bc in ompi_attr_get_c () from
> > /apps/openmpi/1.8.3/lib/libmpi.so.1
> > #4  0x00007fd7d2803f03 in PMPI_Attr_get () from
> > /apps/openmpi/1.8.3/lib/libmpi.so.1
> > #5  0x00007fd7d7716006 in Petsc_DelComm_Outer (comm=0x7fd7d2a83b30,
> > keyval=128, attr_val=0x7fff00a20f00, extra_state=0xffffffffffffffff) at
> > pinit.c:406
> > #6  0x00007fd7d27d8cad in ompi_attr_delete_impl () from
> > /apps/openmpi/1.8.3/lib/libmpi.so.1
> > #7  0x00007fd7d27d8f2f in ompi_attr_delete () from
> > /apps/openmpi/1.8.3/lib/libmpi.so.1
> > #8  0x00007fd7d2803dfc in PMPI_Attr_delete () from
> > /apps/openmpi/1.8.3/lib/libmpi.so.1
> > #9  0x00007fd7d78bf5c5 in PetscCommDestroy (comm=0x7fd7d2a83b30) at
> > tagm.c:256
> > #10 0x00007fd7d7506f58 in PetscHeaderDestroy_Private (h=0x7fd7d2a83b30) at
> > inherit.c:114
> > #11 0x00007fd7d75038a0 in ISDestroy (is=0x7fd7d2a83b30) at index.c:225
> > #12 0x00007fd7d75029b7 in PCReset_ILU (pc=0x7fd7d2a83b30) at ilu.c:42
> > #13 0x00007fd7d77a9baa in PCReset (pc=0x7fd7d2a83b30) at precon.c:81
> > #14 0x00007fd7d77a99ae in PCDestroy (pc=0x7fd7d2a83b30) at precon.c:117
> > #15 0x00007fd7d7557c1a in KSPDestroy (ksp=0x7fd7d2a83b30) at itfunc.c:788
> > #16 0x00007fd7d91cdcca in linearSystemPETSc<double>::~linearSystemPETSc
> > (this=0x7fd7d2a83b30) at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Solver/li
> > nearSystemPETSc.hpp:73
> > #17 0x00007fd7d8ddb63b in GFaceCompound::parametrize (this=0x7fd7d2a83b30,
> > step=128, tom=10620672) at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GFace
> > Compound.cpp:1672
> > #18 0x00007fd7d8dda0fe in GFaceCompound::parametrize (this=0x7fd7d2a83b30)
> > at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GFace
> > Compound.cpp:916
> > #19 0x00007fd7d8f98b0e in checkMeshCompound (gf=0x7fd7d2a83b30, edges=...)
> > at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh
> > GFace.cpp:2588
> > #20 0x00007fd7d8f95c7e in meshGenerator (gf=0xd13020, RECUR_ITER=0,
> > repairSelfIntersecting1dMesh=true, onlyInitialMesh=false, debug=false,
> > replacement_edges=0x0)
> >     at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh
> > GFace.cpp:1075
> > #21 0x00007fd7d8f9a41e in meshGFace::operator() (this=0x7fd7d2a83b30,
> > gf=0x80, print=false) at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh
> > GFace.cpp:2562
> > #22 0x00007fd7d8f8c327 in Mesh2D (m=0x7fd7d2a83b30) at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/Gene
> > rator.cpp:407
> > #23 0x00007fd7d8f8ad0b in GenerateMesh (m=0x7fd7d2a83b30, ask=128) at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/Gene
> > rator.cpp:641
> > #24 0x00007fd7d8e43126 in GModel::mesh (this=0x7fd7d2a83b30, dimension=128)
> > at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GMode
> > l.cpp:535
> > #25 0x00007fd7d8c1acd2 in GmshBatch () at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Common/Gm
> > sh.cpp:240
> > #26 0x000000000040187a in main (argc=-760726736, argv=0x80) at
> > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Common/Ma
> > in.cpp:27
> >
> >
> > _______________________________________________
> > users mailing list
> > users at open-mpi.org <mailto:users at open-mpi.org>
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> > Link to this post: http://www.open-mpi.org/community/lists/users/2014/12/26018.php <http://www.open-mpi.org/community/lists/users/2014/12/26018.php>
> >
> > _______________________________________________
> > users mailing list
> > users at open-mpi.org <mailto:users at open-mpi.org>
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> > Link to this post: http://www.open-mpi.org/community/lists/users/2014/12/26025.php <http://www.open-mpi.org/community/lists/users/2014/12/26025.php>
> >
> > _______________________________________________
> > users mailing list
> > users at open-mpi.org <mailto:users at open-mpi.org>
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> > Link to this post: http://www.open-mpi.org/community/lists/users/2014/12/26028.php <http://www.open-mpi.org/community/lists/users/2014/12/26028.php>
> 
> 
> --
> Jeff Squyres
> jsquyres at cisco.com <mailto:jsquyres at cisco.com>
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ <http://www.cisco.com/web/about/doing_business/legal/cri/>
> 
> _______________________________________________
> users mailing list
> users at open-mpi.org <mailto:users at open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: http://www.open-mpi.org/community/lists/users/2014/12/26041.php <http://www.open-mpi.org/community/lists/users/2014/12/26041.php>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20141219/6bb58343/attachment.html>


More information about the petsc-dev mailing list