[mpich-discuss] Problems with Pcontrol and MPE2 -- fixed, please accept this patch

Anthony Chan chan at mcs.anl.gov
Mon Apr 26 17:45:16 CDT 2010


Hi Brian,

Thanks for looking into the bug.  Your patch isn't ideal because the bug
affects not only comm_create/comm_dup/comm_free but also intercomm_create.
Instead of ignoring the MPI_Pcontrol for the communicator functions as shown
in your patch, my patch turns off as much MPE logging as possible for all the
affected communicator functions following the intent of MPI_Pcontrol(0).  
I've committed my fix to svn, could you try it again to see if the latest
log_mpi_core.c works on your code ?
  
https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpe2/src/wrappers/src/log_mpi_core.c

Thanks,
A.Chan

----- "Brian Wainscott" <brian at lstc.com> wrote:

> Anthony,
> 
> I'm pretty sure logging was off when MPI_Comm_create was called, as
> well as
> MPI_Comm_dup and MPI_Comm_free.  In any case, I found something that
> works.
> Please let me know what you think of it, and adapt it to the svn code
> if possible:
> 
> After some playing, I finally resorted to this, which is a bit brute
> force, and
> maybe not what you'd want to do, but it works for me:  For each of the
> three
> routines MPI_Comm_create, MPI_Comm_dup, MPI_Comm_free, I added this
> line to the
> top of the routine in log_mpi_core.c:
> 
>   int savelog = is_mpilog_on; is_mpilog_on = 1;
> 
> and this line at the bottom
> 
>   is_mpilog_on = savelog;
> 
> I did that to all three routines, which I figured would fool each
> routine into
> acting like MPI_Pcontrol(1) was in effect.  And everything works the
> way I wanted
> it to.
> 
> So, please accept this (or something like it) as a patch to MPE2.
> 
> For reference, here is a full "diff -u" between the 2.1.1 version of
> log_mpi_core.c and my working version:
> 
> 
> --- log_mpi_core.c.old  2010-04-01 09:34:10.000000000 -0700
> +++ log_mpi_core.c      2010-04-26 10:25:09.000000000 -0700
> @@ -2360,6 +2360,7 @@
>    MPE_LOG_STATE_DECL
>    MPE_LOG_COMM_DECL
>    MPE_LOG_THREADSTM_DECL
> +  int savelog = is_mpilog_on; is_mpilog_on = 1;
> 
>  /*
>      MPI_Comm_create - prototyping replacement for MPI_Comm_create
> @@ -2387,6 +2388,7 @@
>    MPE_LOG_STATE_END(comm,NULL)
>    MPE_LOG_THREAD_UNLOCK
> 
> +  is_mpilog_on = savelog;
>    return returnVal;
>  }
> 
> @@ -2398,6 +2400,7 @@
>    MPE_LOG_STATE_DECL
>    MPE_LOG_COMM_DECL
>    MPE_LOG_THREADSTM_DECL
> +  int savelog = is_mpilog_on; is_mpilog_on = 1;
> 
>  /*
>      MPI_Comm_dup - prototyping replacement for MPI_Comm_dup
> @@ -2425,6 +2428,7 @@
>    MPE_LOG_STATE_END(comm,NULL)
>    MPE_LOG_THREAD_UNLOCK
>                                                                       
>  1,1
>      Top
> +  is_mpilog_on = savelog;
>    return returnVal;
>  }
> 
> @@ -2435,6 +2439,7 @@
>    MPE_LOG_STATE_DECL
>    MPE_LOG_COMM_DECL
>    MPE_LOG_THREADSTM_DECL
> +  int savelog = is_mpilog_on; is_mpilog_on = 1;
> 
>  /*
>      MPI_Comm_free - prototyping replacement for MPI_Comm_free
> @@ -2464,6 +2469,7 @@
>    MPE_LOG_STATE_END(*comm,NULL)
>    MPE_LOG_THREAD_UNLOCK
> 
> +  is_mpilog_on = savelog;
>    return returnVal;
>  }
> 
> 
> Brian
> 
> ------ "Anthony Chan" <chan at mcs.anl.gov> wrote:
> > I assume you are getting segfault when MPI_Comm_dup wasn't logged, 
> > was MPI_Comm_free() of the dup'ed communicator not being logged as
> well ?
> > 
> > ----- "Brian Wainscott" <brian at lstc.com> wrote:
> > 
> >> > Hi Chan,
> >> > 
> >> > I got your changes to log_mpi_core.c, and things are
> better....but I
> >> > think not
> >> > quite right.  Now the code is blowing up when I call
> MPI_COMM_FREE and
> >> > logging is
> >> > disabled.  In this case, the communicator being freed was created
> via
> >> > COMM_DUP,
> >> > in case that makes any difference.  I looked through
> log_mpi_core, and
> >> > COMM_DUP
> >> > seems to be treated like COMM_CREATE as far as I can see.  On
> the
> >> > other hand, it
> >> > is likely this is just the first communicator I'm freeing so how
> it
> >> > was created
> >> > may not matter.
> >> > 
> >> > I rebuilt MPE2 with debugging enabled, and got this for my
> traceback:
> >> > 
> >> > 
> >> > #0  0x0000000004052373 in CLOG_Buffer_save_header
> (buffer=0xcb81de0,
> >> >     commIDs=0xe9000898, thd=0, rectype=9) at clog_buffer.c:630
> >> > #1  0x0000000004052b90 in CLOG_Buffer_save_commevt
> (buffer=0xcb81de0,
> >> >     commIDs=0xe9000898, thd=0, etype=10, guid=0x44a28a0 "",
> >> > icomm=-999999999,
> >> >     comm_rank=-1, world_rank=-1) at clog_buffer.c:900
> >> > #2  0x000000000404c070 in MPE_Log_commIDs_nullcomm
> >> > (commIDs=0xe9000898,
> >> > local_thread=0,
> >> >     comm_etype=10) at mpe_log.c:224
> >> > #3  0x00000000040140a2 in MPI_Comm_free (comm=0x7fffe9000848) at
> >> > log_mpi_core.c:2477
> >> > 
> >> > 
> >> > The problem seems to be that CLOG_Buffer_save_header has these
> lines:
> >> > 
> >> >     hdr->icomm       = commIDs->local_ID;
> >> >     hdr->rank        = commIDs->comm_rank;
> >> > 
> >> > but commIDs is not a valid memory address.  It is never properly
> set
> >> > in the macro
> >> > MPE_LOG_INTRACOMM -- in fact, it looks as though it is known to
> be
> >> > logging an
> >> > action for MPI_COMM_NULL, (based on the name of the function
> used,
> >> > MPE_Log_commIDs_nullcomm), but it is still trying to dereference
> this
> >> > thing.
> >> > 
> >> > I hope that makes sense to you...
> >> > 
> >> > BTW -- I'm running with 2.1.1, plus your version of
> log_mpi_core.c. 
> >> > Should I try
> >> > something newer?
> >> > 
> >> > Brian
> >> > 
> >> > 
> >>> > > Hi Brian,
> >>> > > 
> >>> > > I've modified log_mpi_core.c to address this MPI_Pcontrol of
> MPI
> >>> > > communicator function within MPE.  Could you recompile MPE by
> >>> > > updating your log_mpi_core.c with
> >>> > > 
> >>> > >
> >> >
> https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpe2/src/wrappers/src/log_mpi_core.c
> >>> > > 
> >>> > > and see if this solves your problem.
> >>> > > 
> >>> > > A.Chan
> >>> > > 
> >>> > > ----- chan at mcs.anl.gov wrote:
> >>> > > 
> >>>>> > >> > Hi Brian,
> >>>>> > >> > 
> >>>>> > >> > MPE logging needs to know that the user program makes
> >> > communicator
> >>>>> > >> > creation calls, e.g.
> >> > MPI_Comm_create/MPI_Comm_split/MPI_Comm_dup,....
> >>>>> > >> > otherwise any subsequent MPI calls that uses these
> communicators
> >>>>> > >> > can't be logged by MPE.  There is a mechanism in MPE
> that
> >> > bypasses
> >>>>> > >> > the actual logging but still keeps track of communicator
> >>>>> > >> > creation/destruction.  It is likely the mechanism has
> bug.
> >>>>> > >> > Do you have a small program that shows your use of
> communicators
> >>>>> > >> > so I can make sure whatever fixes that I applied will
> solve your
> >>>>> > >> > problem ?
> >>>>> > >> > 
> >>>>> > >> > PS. Thanks for spending time to track down the problem.
> >>>>> > >> > 
> >>>>> > >> > A.Chan
> >>>>> > >> > ----- "Brian Wainscott" <brian at lstc.com> wrote:
> >>>>> > >> > 
> >>>>>>> > >>> > > I posted previously with the subject "MPE logging
> with
> >> > OpenMPI"
> >>>>>>> > >>> > > describing some
> >>>>>>> > >>> > > issues I was having getting MPI_Pcontrol to work. 
> Anthony
> >> > Chan
> >>>>>>> > >>> > > suggested I try
> >>>>>>> > >>> > > MPICH instead of OpenMPI, which I've finally had
> time to do. 
> >> > It
> >>>>> > >> > also
> >>>>>>> > >>> > > doesn't work.
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > I looked through the source code for mpe2, and
> suspect I know
> >> > the
> >>>>>>> > >>> > > issue, and am
> >>>>>>> > >>> > > looking for help/confirmation/hopefully a fix or
> workaround:
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > According to these comments in log_mpi_core.c
> >>>>>>> > >>> > > (src/mpe2/src/wrappers/src):
> >>>>>>> > >>> > >
> >>>>>>> > >>> > >  * MPI_Init checks for logging control options and
> >> > environment
> >>>>>>> > >>> > > variables,
> >>>>>>> > >>> > >  * and MPI_Pcontrol allows control over logging
> (allowing the
> >> > user
> >>>>> > >> > to
> >>>>>>> > >>> > >  * turn logging on and off).  Note that some
> routines are
> >> > ALWAYS
> >>>>>>> > >>> > > logged;
> >>>>>>> > >>> > >  * principly, these are the communicator constuction
> routines
> >>>>> > >> > (needed
> >>>>>>> > >>> > > to
> >>>>>>> > >>> > >  * avoid using the "context_id" which may not exist
> in some
> >> > MPI
> >>>>>>> > >>> > >  * implementations).
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > and this comment:
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > /*
> >>>>>>> > >>> > >   level = 1 turns on tracing,
> >>>>>>> > >>> > >   level = 0 turns it off.
> >>>>>>> > >>> > >
> >>>>>>> > >>> > >   Still to do: in some cases, must log communicator
> operations
> >> > even
> >>>>>>> > >>> > > if
> >>>>>>> > >>> > >   logging is off.
> >>>>>>> > >>> > >  */
> >>>>>>> > >>> > > int MPI_Pcontrol( const int level, ... )
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > I suspect the problem is related to a conflict with
> >> > MPI_Pcontrol and
> >>>>>>> > >>> > > certain
> >>>>>>> > >>> > > communicator construction operations?
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > If tried modifying the problem I am running, in such
> a way
> >> > that it
> >>>>>>> > >>> > > should not
> >>>>>>> > >>> > > create many (any?) communicators after
> initialization, and
> >> > then
> >>>>>>> > >>> > > everything
> >>>>>>> > >>> > > behaves as I'd like: I can call MPI_Pcontrol(0)
> early on, and
> >> > later
> >>>>>>> > >>> > > call
> >>>>>>> > >>> > > MPI_Pcontrol(1) then MPI_Pcontrol(0), and get one
> nice window
> >> > into
> >>>>> > >> > the
> >>>>>>> > >>> > > execution,
> >>>>>>> > >>> > > without a LOT of stuff I'm not interested in.
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > With my original problem, which does create
> communicators, I
> >> > call
> >>>>>>> > >>> > > MPI_Pcontrol(0)
> >>>>>>> > >>> > > right after initialization, then MPI_Pcontrol(1)
> later, then
> >>>>>>> > >>> > > immediately get this
> >>>>>>> > >>> > > error:
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > clog_commset.c:CLOG_CommSet_get_IDs() -
> >>>>>>> > >>> > >         PMPI_Comm_get_attr() fails!
> >>>>>>> > >>> > >
> >>>>>>> > >>> > >
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > I tried putting calls to MPI_Pcontrol(1) just before
> (and
> >>>>>>> > >>> > > MPI_Pcontrol(0) just
> >>>>>>> > >>> > > after) every call to
> >> > MPI_COMM_CREATE/MPI_COMM_DUP/MPI_COMM_FREE, but
> >>>>>>> > >>> > > that didn't
> >>>>>>> > >>> > > work (or maybe I missed one....)  Or maybe this is a
> red
> >> > herring,
> >>>>> > >> > and
> >>>>>>> > >>> > > the smaller
> >>>>>>> > >>> > > problem ran for some other unrelated reason.
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > Suggestions of anything else to try?
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > Does anyone know exactly WHICH calls must always be
> made?  It
> >> > should
> >>>>>>> > >>> > > be a simple
> >>>>>>> > >>> > > matter to ignore the "is_mpilog_on" flag for just a
> few calls,
> >> > if
> >>>>> > >> > that
> >>>>>>> > >>> > > is all
> >>>>>>> > >>> > > that is needed....I just need to know WHICH ones.
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > Thanks!
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > Brian
> >>>>>>> > >>> > >
> >>>>>>> > >>> > > _______________________________________________
> >>>>>>> > >>> > > mpich-discuss mailing list
> >>>>>>> > >>> > > mpich-discuss at mcs.anl.gov
> >>>>>>> > >>> > >
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >> > 
> >> > _______________________________________________
> >> > mpich-discuss mailing list
> >> > mpich-discuss at mcs.anl.gov
> >> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >


More information about the mpich-discuss mailing list