[mpich-discuss] Problems with Pcontrol and MPE2 -- fixed, please accept this patch

Brian Wainscott brian at lstc.com
Mon Apr 26 12:56:09 CDT 2010


Anthony,

I'm pretty sure logging was off when MPI_Comm_create was called, as well as
MPI_Comm_dup and MPI_Comm_free.  In any case, I found something that works.
Please let me know what you think of it, and adapt it to the svn code if possible:

After some playing, I finally resorted to this, which is a bit brute force, and
maybe not what you'd want to do, but it works for me:  For each of the three
routines MPI_Comm_create, MPI_Comm_dup, MPI_Comm_free, I added this line to the
top of the routine in log_mpi_core.c:

  int savelog = is_mpilog_on; is_mpilog_on = 1;

and this line at the bottom

  is_mpilog_on = savelog;

I did that to all three routines, which I figured would fool each routine into
acting like MPI_Pcontrol(1) was in effect.  And everything works the way I wanted
it to.

So, please accept this (or something like it) as a patch to MPE2.

For reference, here is a full "diff -u" between the 2.1.1 version of
log_mpi_core.c and my working version:


--- log_mpi_core.c.old  2010-04-01 09:34:10.000000000 -0700
+++ log_mpi_core.c      2010-04-26 10:25:09.000000000 -0700
@@ -2360,6 +2360,7 @@
   MPE_LOG_STATE_DECL
   MPE_LOG_COMM_DECL
   MPE_LOG_THREADSTM_DECL
+  int savelog = is_mpilog_on; is_mpilog_on = 1;

 /*
     MPI_Comm_create - prototyping replacement for MPI_Comm_create
@@ -2387,6 +2388,7 @@
   MPE_LOG_STATE_END(comm,NULL)
   MPE_LOG_THREAD_UNLOCK

+  is_mpilog_on = savelog;
   return returnVal;
 }

@@ -2398,6 +2400,7 @@
   MPE_LOG_STATE_DECL
   MPE_LOG_COMM_DECL
   MPE_LOG_THREADSTM_DECL
+  int savelog = is_mpilog_on; is_mpilog_on = 1;

 /*
     MPI_Comm_dup - prototyping replacement for MPI_Comm_dup
@@ -2425,6 +2428,7 @@
   MPE_LOG_STATE_END(comm,NULL)
   MPE_LOG_THREAD_UNLOCK
                                                                        1,1
     Top
+  is_mpilog_on = savelog;
   return returnVal;
 }

@@ -2435,6 +2439,7 @@
   MPE_LOG_STATE_DECL
   MPE_LOG_COMM_DECL
   MPE_LOG_THREADSTM_DECL
+  int savelog = is_mpilog_on; is_mpilog_on = 1;

 /*
     MPI_Comm_free - prototyping replacement for MPI_Comm_free
@@ -2464,6 +2469,7 @@
   MPE_LOG_STATE_END(*comm,NULL)
   MPE_LOG_THREAD_UNLOCK

+  is_mpilog_on = savelog;
   return returnVal;
 }


Brian

------ "Anthony Chan" <chan at mcs.anl.gov> wrote:
> I assume you are getting segfault when MPI_Comm_dup wasn't logged, 
> was MPI_Comm_free() of the dup'ed communicator not being logged as well ?
> 
> ----- "Brian Wainscott" <brian at lstc.com> wrote:
> 
>> > Hi Chan,
>> > 
>> > I got your changes to log_mpi_core.c, and things are better....but I
>> > think not
>> > quite right.  Now the code is blowing up when I call MPI_COMM_FREE and
>> > logging is
>> > disabled.  In this case, the communicator being freed was created via
>> > COMM_DUP,
>> > in case that makes any difference.  I looked through log_mpi_core, and
>> > COMM_DUP
>> > seems to be treated like COMM_CREATE as far as I can see.  On the
>> > other hand, it
>> > is likely this is just the first communicator I'm freeing so how it
>> > was created
>> > may not matter.
>> > 
>> > I rebuilt MPE2 with debugging enabled, and got this for my traceback:
>> > 
>> > 
>> > #0  0x0000000004052373 in CLOG_Buffer_save_header (buffer=0xcb81de0,
>> >     commIDs=0xe9000898, thd=0, rectype=9) at clog_buffer.c:630
>> > #1  0x0000000004052b90 in CLOG_Buffer_save_commevt (buffer=0xcb81de0,
>> >     commIDs=0xe9000898, thd=0, etype=10, guid=0x44a28a0 "",
>> > icomm=-999999999,
>> >     comm_rank=-1, world_rank=-1) at clog_buffer.c:900
>> > #2  0x000000000404c070 in MPE_Log_commIDs_nullcomm
>> > (commIDs=0xe9000898,
>> > local_thread=0,
>> >     comm_etype=10) at mpe_log.c:224
>> > #3  0x00000000040140a2 in MPI_Comm_free (comm=0x7fffe9000848) at
>> > log_mpi_core.c:2477
>> > 
>> > 
>> > The problem seems to be that CLOG_Buffer_save_header has these lines:
>> > 
>> >     hdr->icomm       = commIDs->local_ID;
>> >     hdr->rank        = commIDs->comm_rank;
>> > 
>> > but commIDs is not a valid memory address.  It is never properly set
>> > in the macro
>> > MPE_LOG_INTRACOMM -- in fact, it looks as though it is known to be
>> > logging an
>> > action for MPI_COMM_NULL, (based on the name of the function used,
>> > MPE_Log_commIDs_nullcomm), but it is still trying to dereference this
>> > thing.
>> > 
>> > I hope that makes sense to you...
>> > 
>> > BTW -- I'm running with 2.1.1, plus your version of log_mpi_core.c. 
>> > Should I try
>> > something newer?
>> > 
>> > Brian
>> > 
>> > 
>>> > > Hi Brian,
>>> > > 
>>> > > I've modified log_mpi_core.c to address this MPI_Pcontrol of MPI
>>> > > communicator function within MPE.  Could you recompile MPE by
>>> > > updating your log_mpi_core.c with
>>> > > 
>>> > >
>> > https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpe2/src/wrappers/src/log_mpi_core.c
>>> > > 
>>> > > and see if this solves your problem.
>>> > > 
>>> > > A.Chan
>>> > > 
>>> > > ----- chan at mcs.anl.gov wrote:
>>> > > 
>>>>> > >> > Hi Brian,
>>>>> > >> > 
>>>>> > >> > MPE logging needs to know that the user program makes
>> > communicator
>>>>> > >> > creation calls, e.g.
>> > MPI_Comm_create/MPI_Comm_split/MPI_Comm_dup,....
>>>>> > >> > otherwise any subsequent MPI calls that uses these communicators
>>>>> > >> > can't be logged by MPE.  There is a mechanism in MPE that
>> > bypasses
>>>>> > >> > the actual logging but still keeps track of communicator
>>>>> > >> > creation/destruction.  It is likely the mechanism has bug.
>>>>> > >> > Do you have a small program that shows your use of communicators
>>>>> > >> > so I can make sure whatever fixes that I applied will solve your
>>>>> > >> > problem ?
>>>>> > >> > 
>>>>> > >> > PS. Thanks for spending time to track down the problem.
>>>>> > >> > 
>>>>> > >> > A.Chan
>>>>> > >> > ----- "Brian Wainscott" <brian at lstc.com> wrote:
>>>>> > >> > 
>>>>>>> > >>> > > I posted previously with the subject "MPE logging with
>> > OpenMPI"
>>>>>>> > >>> > > describing some
>>>>>>> > >>> > > issues I was having getting MPI_Pcontrol to work.  Anthony
>> > Chan
>>>>>>> > >>> > > suggested I try
>>>>>>> > >>> > > MPICH instead of OpenMPI, which I've finally had time to do. 
>> > It
>>>>> > >> > also
>>>>>>> > >>> > > doesn't work.
>>>>>>> > >>> > >
>>>>>>> > >>> > > I looked through the source code for mpe2, and suspect I know
>> > the
>>>>>>> > >>> > > issue, and am
>>>>>>> > >>> > > looking for help/confirmation/hopefully a fix or workaround:
>>>>>>> > >>> > >
>>>>>>> > >>> > > According to these comments in log_mpi_core.c
>>>>>>> > >>> > > (src/mpe2/src/wrappers/src):
>>>>>>> > >>> > >
>>>>>>> > >>> > >  * MPI_Init checks for logging control options and
>> > environment
>>>>>>> > >>> > > variables,
>>>>>>> > >>> > >  * and MPI_Pcontrol allows control over logging (allowing the
>> > user
>>>>> > >> > to
>>>>>>> > >>> > >  * turn logging on and off).  Note that some routines are
>> > ALWAYS
>>>>>>> > >>> > > logged;
>>>>>>> > >>> > >  * principly, these are the communicator constuction routines
>>>>> > >> > (needed
>>>>>>> > >>> > > to
>>>>>>> > >>> > >  * avoid using the "context_id" which may not exist in some
>> > MPI
>>>>>>> > >>> > >  * implementations).
>>>>>>> > >>> > >
>>>>>>> > >>> > > and this comment:
>>>>>>> > >>> > >
>>>>>>> > >>> > > /*
>>>>>>> > >>> > >   level = 1 turns on tracing,
>>>>>>> > >>> > >   level = 0 turns it off.
>>>>>>> > >>> > >
>>>>>>> > >>> > >   Still to do: in some cases, must log communicator operations
>> > even
>>>>>>> > >>> > > if
>>>>>>> > >>> > >   logging is off.
>>>>>>> > >>> > >  */
>>>>>>> > >>> > > int MPI_Pcontrol( const int level, ... )
>>>>>>> > >>> > >
>>>>>>> > >>> > > I suspect the problem is related to a conflict with
>> > MPI_Pcontrol and
>>>>>>> > >>> > > certain
>>>>>>> > >>> > > communicator construction operations?
>>>>>>> > >>> > >
>>>>>>> > >>> > > If tried modifying the problem I am running, in such a way
>> > that it
>>>>>>> > >>> > > should not
>>>>>>> > >>> > > create many (any?) communicators after initialization, and
>> > then
>>>>>>> > >>> > > everything
>>>>>>> > >>> > > behaves as I'd like: I can call MPI_Pcontrol(0) early on, and
>> > later
>>>>>>> > >>> > > call
>>>>>>> > >>> > > MPI_Pcontrol(1) then MPI_Pcontrol(0), and get one nice window
>> > into
>>>>> > >> > the
>>>>>>> > >>> > > execution,
>>>>>>> > >>> > > without a LOT of stuff I'm not interested in.
>>>>>>> > >>> > >
>>>>>>> > >>> > > With my original problem, which does create communicators, I
>> > call
>>>>>>> > >>> > > MPI_Pcontrol(0)
>>>>>>> > >>> > > right after initialization, then MPI_Pcontrol(1) later, then
>>>>>>> > >>> > > immediately get this
>>>>>>> > >>> > > error:
>>>>>>> > >>> > >
>>>>>>> > >>> > > clog_commset.c:CLOG_CommSet_get_IDs() -
>>>>>>> > >>> > >         PMPI_Comm_get_attr() fails!
>>>>>>> > >>> > >
>>>>>>> > >>> > >
>>>>>>> > >>> > >
>>>>>>> > >>> > > I tried putting calls to MPI_Pcontrol(1) just before (and
>>>>>>> > >>> > > MPI_Pcontrol(0) just
>>>>>>> > >>> > > after) every call to
>> > MPI_COMM_CREATE/MPI_COMM_DUP/MPI_COMM_FREE, but
>>>>>>> > >>> > > that didn't
>>>>>>> > >>> > > work (or maybe I missed one....)  Or maybe this is a red
>> > herring,
>>>>> > >> > and
>>>>>>> > >>> > > the smaller
>>>>>>> > >>> > > problem ran for some other unrelated reason.
>>>>>>> > >>> > >
>>>>>>> > >>> > > Suggestions of anything else to try?
>>>>>>> > >>> > >
>>>>>>> > >>> > > Does anyone know exactly WHICH calls must always be made?  It
>> > should
>>>>>>> > >>> > > be a simple
>>>>>>> > >>> > > matter to ignore the "is_mpilog_on" flag for just a few calls,
>> > if
>>>>> > >> > that
>>>>>>> > >>> > > is all
>>>>>>> > >>> > > that is needed....I just need to know WHICH ones.
>>>>>>> > >>> > >
>>>>>>> > >>> > > Thanks!
>>>>>>> > >>> > >
>>>>>>> > >>> > > Brian
>>>>>>> > >>> > >
>>>>>>> > >>> > > _______________________________________________
>>>>>>> > >>> > > mpich-discuss mailing list
>>>>>>> > >>> > > mpich-discuss at mcs.anl.gov
>>>>>>> > >>> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> > 
>> > _______________________________________________
>> > mpich-discuss mailing list
>> > mpich-discuss at mcs.anl.gov
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 



More information about the mpich-discuss mailing list