[petsc-dev] Registration implicitly collective on COMM_WORLD

Jed Brown jedbrown at mcs.anl.gov
Mon Feb 4 22:40:40 CST 2013


On Mon, Feb 4, 2013 at 10:05 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>    This is currently a mess.
>
>    Say one process calls PetscFunctionListAdd() with a function pointer,
> but another calls it with the string name of the function. Now both
> processes call PetscFunctionListFind() with a common comm. The process with
> the function pointer will return immediately with the answer. The one
> without the function pointer will start mucking around with dynamic
> libraries which "sometimes" could be collective on the comm so it would
> block?
>
>   These sets of routines evolved organically overtime. We need to refactor
> the whole hierarchy of these routines and figure out what collectivity is
> needed.  There are too many potential comms since they were kind of shoved
> in over time.
>
>   It may be simplest if we treat accessing the dynamic libraries as
> completely non-collective, this means removing things like
> PetscDLLibraryRetrieve() which, while a way cool concept has never proven
> to be practical during its 15 years of existence.
>
>    So are we able to treat accessing dynamic libraries as completely
> non-collective? Will this lose a valuable feature?
>

Sort of. The problem is that independent access to the file system is
already so slow on current hardware that shared libraries bring those
expensive machines to their knees. When we worked on this problem for
Python (which is _heavily_ dependent on dynamical loading), we patched
glibc-rtld so we could get hooks into a library I called "collfs"
(collective file system) that would do a collective open implemented using
MPI_Bcast. Most of this circus could go away if libc provided "dlopenfd()",
in which case we could use shm_open() and avoid touching the file system at
all.

>From the glibc implementation, I don't think anyone was trying to make
adding dlopenfd() easy to implement, so we probably have to deal with
paths. Still, if we have a working shm_open and a communicator, we can
avoid the libc-rtld hocus pocus with a fast collective load implemented as:

  rank 0 mmaps the file
  everyone else does shm_open and mmap
  MPI_Bcast
  dlopen("/dev/shm/thelib.so",)

In summary, I think collective loads are useful even without the
"retrieval" stuff.

Now it's cleaner for modularity to load the entire plugin library up-front
and let PetscDLLibraryRegister_thelib call MatRegister for everything that
it provides. It's easy to manage collectivity this way, but unfortunately,
it eats up startup time and memory. (PETSc's current dynamic registration
is like Emacs "autoload".)

At a cost of at least one reduction per library per communicator, we could
keep track of the scope on which each library has been loaded so that all
loads are safe. Of course performance would go way down if many callers
brought in the library on a small object, but that may be unavoidable.


>    Barry
>
>
> On Feb 4, 2013, at 9:22 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>
> > On Sat, Feb 2, 2013 at 3:30 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >    Yeah I noticed this problem but didn't want to deal with it when I
> changed the code.
> >
> > So if we believe the documentation of PetscFunctionListAdd,
> XXInitializePackage() is effectively collective on COMM_WORLD (though not
> documented as such). This means that if
> !defined(PETSC_USE_DYNAMIC_LIBRARIES), the following could deadlock:
> >
> > if (!rank) {
> >   VecCreate(PETSC_COMM_SELF,....);
> > }
> >
> > which would be awfully bad behavior. In reality, PetscFunctionListAdd()
> does not reference comm at all. Why did you add the comm argument?
> "Consistency"?
> >
> > Whatever the "next" documentation system is, it should be taught to
> trace the "collective" attribute and complain if a "Not Collective"
> function calls a Collective function with an argument other than COMM_SELF.
> >
> >
> >     Yes we should remove the "Formally Collective", I was drinking that
> week :-)
> >
> >    Barry
> >
> > On Feb 2, 2013, at 2:54 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> >
> > > In [1], PetscFunctionListAdd became implicitly collective on
> COMM_WORLD, but the all the XXRegisterDynamic() say "Not collective". These
> all have to be updated if this is the case, but I'm not sure it's even a
> good thing. What if we have a big multi-domain simulation in which we
> initialize each of the components on their own subcomm. Those
> sub-components would not be allowed to register methods (or load plugins)
> that they might use because registration was implicitly more global.
> > >
> > > The comm is used by PetscLs and others. This is important because file
> systems are terrible at independent access. (Same for loading shared
> libraries; it's potentially much easier to do it by broadcasting the
> library, though portability is tricky.)
> > >
> > > Anyway, it would be really bad to PetscDLLibraryAppend() on a subcomm
> and have the registration function in the shared lib call
> PCRegisterDynamic() that promotes itself to COMM_WORLD.
> > >
> > > Maybe we need to pass an explicit comm to all the registration
> functions.
> > >
> > > [1]
> https://bitbucket.org/petsc/petsc-dev/commits/07f9e01e040feeb4162253a60ca63556436f4135
> > >
> > > What does "Formally collective" mean anyway? Either it's always safe
> to call independently, it's "Logically collective" so that there is no
> performance impact, but it still needs to be collective to have consistent
> state, or it's Not Collective. This falls under Not Collective because it
> can deadlock if you call it independently.
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130204/fe8a8a17/attachment.html>


More information about the petsc-dev mailing list