[mpich-discuss] Problems with MPI_Pcontrol and MPE2
Brian Wainscott
brian at lstc.com
Fri Apr 23 14:51:31 CDT 2010
Hi Chan,
I got your changes to log_mpi_core.c, and things are better....but I think not
quite right. Now the code is blowing up when I call MPI_COMM_FREE and logging is
disabled. In this case, the communicator being freed was created via COMM_DUP,
in case that makes any difference. I looked through log_mpi_core, and COMM_DUP
seems to be treated like COMM_CREATE as far as I can see. On the other hand, it
is likely this is just the first communicator I'm freeing so how it was created
may not matter.
I rebuilt MPE2 with debugging enabled, and got this for my traceback:
#0 0x0000000004052373 in CLOG_Buffer_save_header (buffer=0xcb81de0,
commIDs=0xe9000898, thd=0, rectype=9) at clog_buffer.c:630
#1 0x0000000004052b90 in CLOG_Buffer_save_commevt (buffer=0xcb81de0,
commIDs=0xe9000898, thd=0, etype=10, guid=0x44a28a0 "", icomm=-999999999,
comm_rank=-1, world_rank=-1) at clog_buffer.c:900
#2 0x000000000404c070 in MPE_Log_commIDs_nullcomm (commIDs=0xe9000898,
local_thread=0,
comm_etype=10) at mpe_log.c:224
#3 0x00000000040140a2 in MPI_Comm_free (comm=0x7fffe9000848) at log_mpi_core.c:2477
The problem seems to be that CLOG_Buffer_save_header has these lines:
hdr->icomm = commIDs->local_ID;
hdr->rank = commIDs->comm_rank;
but commIDs is not a valid memory address. It is never properly set in the macro
MPE_LOG_INTRACOMM -- in fact, it looks as though it is known to be logging an
action for MPI_COMM_NULL, (based on the name of the function used,
MPE_Log_commIDs_nullcomm), but it is still trying to dereference this thing.
I hope that makes sense to you...
BTW -- I'm running with 2.1.1, plus your version of log_mpi_core.c. Should I try
something newer?
Brian
> Hi Brian,
>
> I've modified log_mpi_core.c to address this MPI_Pcontrol of MPI
> communicator function within MPE. Could you recompile MPE by
> updating your log_mpi_core.c with
>
> https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpe2/src/wrappers/src/log_mpi_core.c
>
> and see if this solves your problem.
>
> A.Chan
>
> ----- chan at mcs.anl.gov wrote:
>
>> > Hi Brian,
>> >
>> > MPE logging needs to know that the user program makes communicator
>> > creation calls, e.g. MPI_Comm_create/MPI_Comm_split/MPI_Comm_dup,....
>> > otherwise any subsequent MPI calls that uses these communicators
>> > can't be logged by MPE. There is a mechanism in MPE that bypasses
>> > the actual logging but still keeps track of communicator
>> > creation/destruction. It is likely the mechanism has bug.
>> > Do you have a small program that shows your use of communicators
>> > so I can make sure whatever fixes that I applied will solve your
>> > problem ?
>> >
>> > PS. Thanks for spending time to track down the problem.
>> >
>> > A.Chan
>> > ----- "Brian Wainscott" <brian at lstc.com> wrote:
>> >
>>> > > I posted previously with the subject "MPE logging with OpenMPI"
>>> > > describing some
>>> > > issues I was having getting MPI_Pcontrol to work. Anthony Chan
>>> > > suggested I try
>>> > > MPICH instead of OpenMPI, which I've finally had time to do. It
>> > also
>>> > > doesn't work.
>>> > >
>>> > > I looked through the source code for mpe2, and suspect I know the
>>> > > issue, and am
>>> > > looking for help/confirmation/hopefully a fix or workaround:
>>> > >
>>> > > According to these comments in log_mpi_core.c
>>> > > (src/mpe2/src/wrappers/src):
>>> > >
>>> > > * MPI_Init checks for logging control options and environment
>>> > > variables,
>>> > > * and MPI_Pcontrol allows control over logging (allowing the user
>> > to
>>> > > * turn logging on and off). Note that some routines are ALWAYS
>>> > > logged;
>>> > > * principly, these are the communicator constuction routines
>> > (needed
>>> > > to
>>> > > * avoid using the "context_id" which may not exist in some MPI
>>> > > * implementations).
>>> > >
>>> > > and this comment:
>>> > >
>>> > > /*
>>> > > level = 1 turns on tracing,
>>> > > level = 0 turns it off.
>>> > >
>>> > > Still to do: in some cases, must log communicator operations even
>>> > > if
>>> > > logging is off.
>>> > > */
>>> > > int MPI_Pcontrol( const int level, ... )
>>> > >
>>> > > I suspect the problem is related to a conflict with MPI_Pcontrol and
>>> > > certain
>>> > > communicator construction operations?
>>> > >
>>> > > If tried modifying the problem I am running, in such a way that it
>>> > > should not
>>> > > create many (any?) communicators after initialization, and then
>>> > > everything
>>> > > behaves as I'd like: I can call MPI_Pcontrol(0) early on, and later
>>> > > call
>>> > > MPI_Pcontrol(1) then MPI_Pcontrol(0), and get one nice window into
>> > the
>>> > > execution,
>>> > > without a LOT of stuff I'm not interested in.
>>> > >
>>> > > With my original problem, which does create communicators, I call
>>> > > MPI_Pcontrol(0)
>>> > > right after initialization, then MPI_Pcontrol(1) later, then
>>> > > immediately get this
>>> > > error:
>>> > >
>>> > > clog_commset.c:CLOG_CommSet_get_IDs() -
>>> > > PMPI_Comm_get_attr() fails!
>>> > >
>>> > >
>>> > >
>>> > > I tried putting calls to MPI_Pcontrol(1) just before (and
>>> > > MPI_Pcontrol(0) just
>>> > > after) every call to MPI_COMM_CREATE/MPI_COMM_DUP/MPI_COMM_FREE, but
>>> > > that didn't
>>> > > work (or maybe I missed one....) Or maybe this is a red herring,
>> > and
>>> > > the smaller
>>> > > problem ran for some other unrelated reason.
>>> > >
>>> > > Suggestions of anything else to try?
>>> > >
>>> > > Does anyone know exactly WHICH calls must always be made? It should
>>> > > be a simple
>>> > > matter to ignore the "is_mpilog_on" flag for just a few calls, if
>> > that
>>> > > is all
>>> > > that is needed....I just need to know WHICH ones.
>>> > >
>>> > > Thanks!
>>> > >
>>> > > Brian
>>> > >
>>> > > _______________________________________________
>>> > > mpich-discuss mailing list
>>> > > mpich-discuss at mcs.anl.gov
>>> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list