[mpich-discuss] Problems with Pcontrol and MPE2 -- fixed, please accept this patch
Brian Wainscott
brian at lstc.com
Mon Apr 26 12:56:09 CDT 2010
Anthony,
I'm pretty sure logging was off when MPI_Comm_create was called, as well as
MPI_Comm_dup and MPI_Comm_free. In any case, I found something that works.
Please let me know what you think of it, and adapt it to the svn code if possible:
After some playing, I finally resorted to this, which is a bit brute force, and
maybe not what you'd want to do, but it works for me: For each of the three
routines MPI_Comm_create, MPI_Comm_dup, MPI_Comm_free, I added this line to the
top of the routine in log_mpi_core.c:
int savelog = is_mpilog_on; is_mpilog_on = 1;
and this line at the bottom
is_mpilog_on = savelog;
I did that to all three routines, which I figured would fool each routine into
acting like MPI_Pcontrol(1) was in effect. And everything works the way I wanted
it to.
So, please accept this (or something like it) as a patch to MPE2.
For reference, here is a full "diff -u" between the 2.1.1 version of
log_mpi_core.c and my working version:
--- log_mpi_core.c.old 2010-04-01 09:34:10.000000000 -0700
+++ log_mpi_core.c 2010-04-26 10:25:09.000000000 -0700
@@ -2360,6 +2360,7 @@
MPE_LOG_STATE_DECL
MPE_LOG_COMM_DECL
MPE_LOG_THREADSTM_DECL
+ int savelog = is_mpilog_on; is_mpilog_on = 1;
/*
MPI_Comm_create - prototyping replacement for MPI_Comm_create
@@ -2387,6 +2388,7 @@
MPE_LOG_STATE_END(comm,NULL)
MPE_LOG_THREAD_UNLOCK
+ is_mpilog_on = savelog;
return returnVal;
}
@@ -2398,6 +2400,7 @@
MPE_LOG_STATE_DECL
MPE_LOG_COMM_DECL
MPE_LOG_THREADSTM_DECL
+ int savelog = is_mpilog_on; is_mpilog_on = 1;
/*
MPI_Comm_dup - prototyping replacement for MPI_Comm_dup
@@ -2425,6 +2428,7 @@
MPE_LOG_STATE_END(comm,NULL)
MPE_LOG_THREAD_UNLOCK
1,1
Top
+ is_mpilog_on = savelog;
return returnVal;
}
@@ -2435,6 +2439,7 @@
MPE_LOG_STATE_DECL
MPE_LOG_COMM_DECL
MPE_LOG_THREADSTM_DECL
+ int savelog = is_mpilog_on; is_mpilog_on = 1;
/*
MPI_Comm_free - prototyping replacement for MPI_Comm_free
@@ -2464,6 +2469,7 @@
MPE_LOG_STATE_END(*comm,NULL)
MPE_LOG_THREAD_UNLOCK
+ is_mpilog_on = savelog;
return returnVal;
}
Brian
------ "Anthony Chan" <chan at mcs.anl.gov> wrote:
> I assume you are getting segfault when MPI_Comm_dup wasn't logged,
> was MPI_Comm_free() of the dup'ed communicator not being logged as well ?
>
> ----- "Brian Wainscott" <brian at lstc.com> wrote:
>
>> > Hi Chan,
>> >
>> > I got your changes to log_mpi_core.c, and things are better....but I
>> > think not
>> > quite right. Now the code is blowing up when I call MPI_COMM_FREE and
>> > logging is
>> > disabled. In this case, the communicator being freed was created via
>> > COMM_DUP,
>> > in case that makes any difference. I looked through log_mpi_core, and
>> > COMM_DUP
>> > seems to be treated like COMM_CREATE as far as I can see. On the
>> > other hand, it
>> > is likely this is just the first communicator I'm freeing so how it
>> > was created
>> > may not matter.
>> >
>> > I rebuilt MPE2 with debugging enabled, and got this for my traceback:
>> >
>> >
>> > #0 0x0000000004052373 in CLOG_Buffer_save_header (buffer=0xcb81de0,
>> > commIDs=0xe9000898, thd=0, rectype=9) at clog_buffer.c:630
>> > #1 0x0000000004052b90 in CLOG_Buffer_save_commevt (buffer=0xcb81de0,
>> > commIDs=0xe9000898, thd=0, etype=10, guid=0x44a28a0 "",
>> > icomm=-999999999,
>> > comm_rank=-1, world_rank=-1) at clog_buffer.c:900
>> > #2 0x000000000404c070 in MPE_Log_commIDs_nullcomm
>> > (commIDs=0xe9000898,
>> > local_thread=0,
>> > comm_etype=10) at mpe_log.c:224
>> > #3 0x00000000040140a2 in MPI_Comm_free (comm=0x7fffe9000848) at
>> > log_mpi_core.c:2477
>> >
>> >
>> > The problem seems to be that CLOG_Buffer_save_header has these lines:
>> >
>> > hdr->icomm = commIDs->local_ID;
>> > hdr->rank = commIDs->comm_rank;
>> >
>> > but commIDs is not a valid memory address. It is never properly set
>> > in the macro
>> > MPE_LOG_INTRACOMM -- in fact, it looks as though it is known to be
>> > logging an
>> > action for MPI_COMM_NULL, (based on the name of the function used,
>> > MPE_Log_commIDs_nullcomm), but it is still trying to dereference this
>> > thing.
>> >
>> > I hope that makes sense to you...
>> >
>> > BTW -- I'm running with 2.1.1, plus your version of log_mpi_core.c.
>> > Should I try
>> > something newer?
>> >
>> > Brian
>> >
>> >
>>> > > Hi Brian,
>>> > >
>>> > > I've modified log_mpi_core.c to address this MPI_Pcontrol of MPI
>>> > > communicator function within MPE. Could you recompile MPE by
>>> > > updating your log_mpi_core.c with
>>> > >
>>> > >
>> > https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpe2/src/wrappers/src/log_mpi_core.c
>>> > >
>>> > > and see if this solves your problem.
>>> > >
>>> > > A.Chan
>>> > >
>>> > > ----- chan at mcs.anl.gov wrote:
>>> > >
>>>>> > >> > Hi Brian,
>>>>> > >> >
>>>>> > >> > MPE logging needs to know that the user program makes
>> > communicator
>>>>> > >> > creation calls, e.g.
>> > MPI_Comm_create/MPI_Comm_split/MPI_Comm_dup,....
>>>>> > >> > otherwise any subsequent MPI calls that uses these communicators
>>>>> > >> > can't be logged by MPE. There is a mechanism in MPE that
>> > bypasses
>>>>> > >> > the actual logging but still keeps track of communicator
>>>>> > >> > creation/destruction. It is likely the mechanism has bug.
>>>>> > >> > Do you have a small program that shows your use of communicators
>>>>> > >> > so I can make sure whatever fixes that I applied will solve your
>>>>> > >> > problem ?
>>>>> > >> >
>>>>> > >> > PS. Thanks for spending time to track down the problem.
>>>>> > >> >
>>>>> > >> > A.Chan
>>>>> > >> > ----- "Brian Wainscott" <brian at lstc.com> wrote:
>>>>> > >> >
>>>>>>> > >>> > > I posted previously with the subject "MPE logging with
>> > OpenMPI"
>>>>>>> > >>> > > describing some
>>>>>>> > >>> > > issues I was having getting MPI_Pcontrol to work. Anthony
>> > Chan
>>>>>>> > >>> > > suggested I try
>>>>>>> > >>> > > MPICH instead of OpenMPI, which I've finally had time to do.
>> > It
>>>>> > >> > also
>>>>>>> > >>> > > doesn't work.
>>>>>>> > >>> > >
>>>>>>> > >>> > > I looked through the source code for mpe2, and suspect I know
>> > the
>>>>>>> > >>> > > issue, and am
>>>>>>> > >>> > > looking for help/confirmation/hopefully a fix or workaround:
>>>>>>> > >>> > >
>>>>>>> > >>> > > According to these comments in log_mpi_core.c
>>>>>>> > >>> > > (src/mpe2/src/wrappers/src):
>>>>>>> > >>> > >
>>>>>>> > >>> > > * MPI_Init checks for logging control options and
>> > environment
>>>>>>> > >>> > > variables,
>>>>>>> > >>> > > * and MPI_Pcontrol allows control over logging (allowing the
>> > user
>>>>> > >> > to
>>>>>>> > >>> > > * turn logging on and off). Note that some routines are
>> > ALWAYS
>>>>>>> > >>> > > logged;
>>>>>>> > >>> > > * principly, these are the communicator constuction routines
>>>>> > >> > (needed
>>>>>>> > >>> > > to
>>>>>>> > >>> > > * avoid using the "context_id" which may not exist in some
>> > MPI
>>>>>>> > >>> > > * implementations).
>>>>>>> > >>> > >
>>>>>>> > >>> > > and this comment:
>>>>>>> > >>> > >
>>>>>>> > >>> > > /*
>>>>>>> > >>> > > level = 1 turns on tracing,
>>>>>>> > >>> > > level = 0 turns it off.
>>>>>>> > >>> > >
>>>>>>> > >>> > > Still to do: in some cases, must log communicator operations
>> > even
>>>>>>> > >>> > > if
>>>>>>> > >>> > > logging is off.
>>>>>>> > >>> > > */
>>>>>>> > >>> > > int MPI_Pcontrol( const int level, ... )
>>>>>>> > >>> > >
>>>>>>> > >>> > > I suspect the problem is related to a conflict with
>> > MPI_Pcontrol and
>>>>>>> > >>> > > certain
>>>>>>> > >>> > > communicator construction operations?
>>>>>>> > >>> > >
>>>>>>> > >>> > > If tried modifying the problem I am running, in such a way
>> > that it
>>>>>>> > >>> > > should not
>>>>>>> > >>> > > create many (any?) communicators after initialization, and
>> > then
>>>>>>> > >>> > > everything
>>>>>>> > >>> > > behaves as I'd like: I can call MPI_Pcontrol(0) early on, and
>> > later
>>>>>>> > >>> > > call
>>>>>>> > >>> > > MPI_Pcontrol(1) then MPI_Pcontrol(0), and get one nice window
>> > into
>>>>> > >> > the
>>>>>>> > >>> > > execution,
>>>>>>> > >>> > > without a LOT of stuff I'm not interested in.
>>>>>>> > >>> > >
>>>>>>> > >>> > > With my original problem, which does create communicators, I
>> > call
>>>>>>> > >>> > > MPI_Pcontrol(0)
>>>>>>> > >>> > > right after initialization, then MPI_Pcontrol(1) later, then
>>>>>>> > >>> > > immediately get this
>>>>>>> > >>> > > error:
>>>>>>> > >>> > >
>>>>>>> > >>> > > clog_commset.c:CLOG_CommSet_get_IDs() -
>>>>>>> > >>> > > PMPI_Comm_get_attr() fails!
>>>>>>> > >>> > >
>>>>>>> > >>> > >
>>>>>>> > >>> > >
>>>>>>> > >>> > > I tried putting calls to MPI_Pcontrol(1) just before (and
>>>>>>> > >>> > > MPI_Pcontrol(0) just
>>>>>>> > >>> > > after) every call to
>> > MPI_COMM_CREATE/MPI_COMM_DUP/MPI_COMM_FREE, but
>>>>>>> > >>> > > that didn't
>>>>>>> > >>> > > work (or maybe I missed one....) Or maybe this is a red
>> > herring,
>>>>> > >> > and
>>>>>>> > >>> > > the smaller
>>>>>>> > >>> > > problem ran for some other unrelated reason.
>>>>>>> > >>> > >
>>>>>>> > >>> > > Suggestions of anything else to try?
>>>>>>> > >>> > >
>>>>>>> > >>> > > Does anyone know exactly WHICH calls must always be made? It
>> > should
>>>>>>> > >>> > > be a simple
>>>>>>> > >>> > > matter to ignore the "is_mpilog_on" flag for just a few calls,
>> > if
>>>>> > >> > that
>>>>>>> > >>> > > is all
>>>>>>> > >>> > > that is needed....I just need to know WHICH ones.
>>>>>>> > >>> > >
>>>>>>> > >>> > > Thanks!
>>>>>>> > >>> > >
>>>>>>> > >>> > > Brian
>>>>>>> > >>> > >
>>>>>>> > >>> > > _______________________________________________
>>>>>>> > >>> > > mpich-discuss mailing list
>>>>>>> > >>> > > mpich-discuss at mcs.anl.gov
>>>>>>> > >>> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >
>> > _______________________________________________
>> > mpich-discuss mailing list
>> > mpich-discuss at mcs.anl.gov
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
More information about the mpich-discuss
mailing list