[mpich-discuss] Problems with Pcontrol and MPE2 -- fixed, please accept this patch

Brian Wainscott brian at lstc.com
Mon Apr 26 19:27:58 CDT 2010


Hi Anthony,

On 04/26/2010 03:45 PM, Anthony Chan wrote:
> 
> Hi Brian,
> 
> Thanks for looking into the bug.  Your patch isn't ideal because the bug
> affects not only comm_create/comm_dup/comm_free but also intercomm_create.

Not surprising -- I didn't think it was really what SHOULD be done, because I
don't really know all the details of how MPE works....

> Instead of ignoring the MPI_Pcontrol for the communicator functions as shown
> in your patch, my patch turns off as much MPE logging as possible for all the
> affected communicator functions following the intent of MPI_Pcontrol(0).  
> I've committed my fix to svn, could you try it again to see if the latest
> log_mpi_core.c works on your code ?
>   
> https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpe2/src/wrappers/src/log_mpi_core.c

Thanks....but no, it still blows up:

#0  0x00000000040183db in CLOG_Buffer_save_header (buffer=0xd6fee30, commIDs=0x0,
    thd=0, rectype=9) at clog_buffer.c:630
#1  0x0000000004018bf8 in CLOG_Buffer_save_commevt (buffer=0xd6fee30, commIDs=0x0,
    thd=0, etype=100, guid=0x4468060 "", icomm=-999999999, comm_rank=-1,
    world_rank=-1) at clog_buffer.c:900
#2  0x00000000040120d8 in MPE_Log_commIDs_nullcomm (commIDs=0x0, local_thread=0,
    comm_etype=100) at mpe_log.c:224
#3  0x0000000003fd8685 in MPI_Comm_create (comm=0xd5faec0, group=0xdf11130,
    comm_out=0x7fff1c627ab0) at log_mpi_core.c:2390
#4  0x0000000003fcc382 in mpi_comm_create_ (comm=0x8b9ad1c, group=0x7e75368,
    comm_out=0x8b6ba2c, __ierr=0x7e75360) at mpe_proff.c:644

The indicated line reads:

         hdr->icomm       = commIDs->local_ID;

and commIDs == 0x0.

If I have time tomorrow, I'll try to throw together a small program that fails,
which I can then send you.  In the mean time, I'll keep my version of the program
just in case.

BTW -- I tried adding this line:

  MPE_LOG_COMM_CHECK(*comm)

in MPI_Comm_create, just like you had added in MPI_Comm_free, but it didn't help.....

> 
> Thanks,
> A.Chan
> 
> ----- "Brian Wainscott" <brian at lstc.com> wrote:
> 
>> Anthony,
>>
>> I'm pretty sure logging was off when MPI_Comm_create was called, as
>> well as
>> MPI_Comm_dup and MPI_Comm_free.  In any case, I found something that
>> works.
>> Please let me know what you think of it, and adapt it to the svn code
>> if possible:
>>
>> After some playing, I finally resorted to this, which is a bit brute
>> force, and
>> maybe not what you'd want to do, but it works for me:  For each of the
>> three
>> routines MPI_Comm_create, MPI_Comm_dup, MPI_Comm_free, I added this
>> line to the
>> top of the routine in log_mpi_core.c:
>>
>>   int savelog = is_mpilog_on; is_mpilog_on = 1;
>>
>> and this line at the bottom
>>
>>   is_mpilog_on = savelog;
>>
>> I did that to all three routines, which I figured would fool each
>> routine into
>> acting like MPI_Pcontrol(1) was in effect.  And everything works the
>> way I wanted
>> it to.
>>
>> So, please accept this (or something like it) as a patch to MPE2.
>>
>> For reference, here is a full "diff -u" between the 2.1.1 version of
>> log_mpi_core.c and my working version:
>>
>>
>> --- log_mpi_core.c.old  2010-04-01 09:34:10.000000000 -0700
>> +++ log_mpi_core.c      2010-04-26 10:25:09.000000000 -0700
>> @@ -2360,6 +2360,7 @@
>>    MPE_LOG_STATE_DECL
>>    MPE_LOG_COMM_DECL
>>    MPE_LOG_THREADSTM_DECL
>> +  int savelog = is_mpilog_on; is_mpilog_on = 1;
>>
>>  /*
>>      MPI_Comm_create - prototyping replacement for MPI_Comm_create
>> @@ -2387,6 +2388,7 @@
>>    MPE_LOG_STATE_END(comm,NULL)
>>    MPE_LOG_THREAD_UNLOCK
>>
>> +  is_mpilog_on = savelog;
>>    return returnVal;
>>  }
>>
>> @@ -2398,6 +2400,7 @@
>>    MPE_LOG_STATE_DECL
>>    MPE_LOG_COMM_DECL
>>    MPE_LOG_THREADSTM_DECL
>> +  int savelog = is_mpilog_on; is_mpilog_on = 1;
>>
>>  /*
>>      MPI_Comm_dup - prototyping replacement for MPI_Comm_dup
>> @@ -2425,6 +2428,7 @@
>>    MPE_LOG_STATE_END(comm,NULL)
>>    MPE_LOG_THREAD_UNLOCK
>>                                                                       
>>  1,1
>>      Top
>> +  is_mpilog_on = savelog;
>>    return returnVal;
>>  }
>>
>> @@ -2435,6 +2439,7 @@
>>    MPE_LOG_STATE_DECL
>>    MPE_LOG_COMM_DECL
>>    MPE_LOG_THREADSTM_DECL
>> +  int savelog = is_mpilog_on; is_mpilog_on = 1;
>>
>>  /*
>>      MPI_Comm_free - prototyping replacement for MPI_Comm_free
>> @@ -2464,6 +2469,7 @@
>>    MPE_LOG_STATE_END(*comm,NULL)
>>    MPE_LOG_THREAD_UNLOCK
>>
>> +  is_mpilog_on = savelog;
>>    return returnVal;
>>  }
>>
>>
>> Brian
>>
>> ------ "Anthony Chan" <chan at mcs.anl.gov> wrote:
>>> I assume you are getting segfault when MPI_Comm_dup wasn't logged, 
>>> was MPI_Comm_free() of the dup'ed communicator not being logged as
>> well ?
>>>
>>> ----- "Brian Wainscott" <brian at lstc.com> wrote:
>>>
>>>>> Hi Chan,
>>>>>
>>>>> I got your changes to log_mpi_core.c, and things are
>> better....but I
>>>>> think not
>>>>> quite right.  Now the code is blowing up when I call
>> MPI_COMM_FREE and
>>>>> logging is
>>>>> disabled.  In this case, the communicator being freed was created
>> via
>>>>> COMM_DUP,
>>>>> in case that makes any difference.  I looked through
>> log_mpi_core, and
>>>>> COMM_DUP
>>>>> seems to be treated like COMM_CREATE as far as I can see.  On
>> the
>>>>> other hand, it
>>>>> is likely this is just the first communicator I'm freeing so how
>> it
>>>>> was created
>>>>> may not matter.
>>>>>
>>>>> I rebuilt MPE2 with debugging enabled, and got this for my
>> traceback:
>>>>>
>>>>>
>>>>> #0  0x0000000004052373 in CLOG_Buffer_save_header
>> (buffer=0xcb81de0,
>>>>>     commIDs=0xe9000898, thd=0, rectype=9) at clog_buffer.c:630
>>>>> #1  0x0000000004052b90 in CLOG_Buffer_save_commevt
>> (buffer=0xcb81de0,
>>>>>     commIDs=0xe9000898, thd=0, etype=10, guid=0x44a28a0 "",
>>>>> icomm=-999999999,
>>>>>     comm_rank=-1, world_rank=-1) at clog_buffer.c:900
>>>>> #2  0x000000000404c070 in MPE_Log_commIDs_nullcomm
>>>>> (commIDs=0xe9000898,
>>>>> local_thread=0,
>>>>>     comm_etype=10) at mpe_log.c:224
>>>>> #3  0x00000000040140a2 in MPI_Comm_free (comm=0x7fffe9000848) at
>>>>> log_mpi_core.c:2477
>>>>>
>>>>>
>>>>> The problem seems to be that CLOG_Buffer_save_header has these
>> lines:
>>>>>
>>>>>     hdr->icomm       = commIDs->local_ID;
>>>>>     hdr->rank        = commIDs->comm_rank;
>>>>>
>>>>> but commIDs is not a valid memory address.  It is never properly
>> set
>>>>> in the macro
>>>>> MPE_LOG_INTRACOMM -- in fact, it looks as though it is known to
>> be
>>>>> logging an
>>>>> action for MPI_COMM_NULL, (based on the name of the function
>> used,
>>>>> MPE_Log_commIDs_nullcomm), but it is still trying to dereference
>> this
>>>>> thing.
>>>>>
>>>>> I hope that makes sense to you...
>>>>>
>>>>> BTW -- I'm running with 2.1.1, plus your version of
>> log_mpi_core.c. 
>>>>> Should I try
>>>>> something newer?
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>>> Hi Brian,
>>>>>>>
>>>>>>> I've modified log_mpi_core.c to address this MPI_Pcontrol of
>> MPI
>>>>>>> communicator function within MPE.  Could you recompile MPE by
>>>>>>> updating your log_mpi_core.c with
>>>>>>>
>>>>>>>
>>>>>
>> https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpe2/src/wrappers/src/log_mpi_core.c
>>>>>>>
>>>>>>> and see if this solves your problem.
>>>>>>>
>>>>>>> A.Chan
>>>>>>>
>>>>>>> ----- chan at mcs.anl.gov wrote:
>>>>>>>
>>>>>>>>>>> Hi Brian,
>>>>>>>>>>>
>>>>>>>>>>> MPE logging needs to know that the user program makes
>>>>> communicator
>>>>>>>>>>> creation calls, e.g.
>>>>> MPI_Comm_create/MPI_Comm_split/MPI_Comm_dup,....
>>>>>>>>>>> otherwise any subsequent MPI calls that uses these
>> communicators
>>>>>>>>>>> can't be logged by MPE.  There is a mechanism in MPE
>> that
>>>>> bypasses
>>>>>>>>>>> the actual logging but still keeps track of communicator
>>>>>>>>>>> creation/destruction.  It is likely the mechanism has
>> bug.
>>>>>>>>>>> Do you have a small program that shows your use of
>> communicators
>>>>>>>>>>> so I can make sure whatever fixes that I applied will
>> solve your
>>>>>>>>>>> problem ?
>>>>>>>>>>>
>>>>>>>>>>> PS. Thanks for spending time to track down the problem.
>>>>>>>>>>>
>>>>>>>>>>> A.Chan
>>>>>>>>>>> ----- "Brian Wainscott" <brian at lstc.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>>>> I posted previously with the subject "MPE logging
>> with
>>>>> OpenMPI"
>>>>>>>>>>>>>>> describing some
>>>>>>>>>>>>>>> issues I was having getting MPI_Pcontrol to work. 
>> Anthony
>>>>> Chan
>>>>>>>>>>>>>>> suggested I try
>>>>>>>>>>>>>>> MPICH instead of OpenMPI, which I've finally had
>> time to do. 
>>>>> It
>>>>>>>>>>> also
>>>>>>>>>>>>>>> doesn't work.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I looked through the source code for mpe2, and
>> suspect I know
>>>>> the
>>>>>>>>>>>>>>> issue, and am
>>>>>>>>>>>>>>> looking for help/confirmation/hopefully a fix or
>> workaround:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> According to these comments in log_mpi_core.c
>>>>>>>>>>>>>>> (src/mpe2/src/wrappers/src):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  * MPI_Init checks for logging control options and
>>>>> environment
>>>>>>>>>>>>>>> variables,
>>>>>>>>>>>>>>>  * and MPI_Pcontrol allows control over logging
>> (allowing the
>>>>> user
>>>>>>>>>>> to
>>>>>>>>>>>>>>>  * turn logging on and off).  Note that some
>> routines are
>>>>> ALWAYS
>>>>>>>>>>>>>>> logged;
>>>>>>>>>>>>>>>  * principly, these are the communicator constuction
>> routines
>>>>>>>>>>> (needed
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>  * avoid using the "context_id" which may not exist
>> in some
>>>>> MPI
>>>>>>>>>>>>>>>  * implementations).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and this comment:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>   level = 1 turns on tracing,
>>>>>>>>>>>>>>>   level = 0 turns it off.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   Still to do: in some cases, must log communicator
>> operations
>>>>> even
>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>   logging is off.
>>>>>>>>>>>>>>>  */
>>>>>>>>>>>>>>> int MPI_Pcontrol( const int level, ... )
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suspect the problem is related to a conflict with
>>>>> MPI_Pcontrol and
>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>> communicator construction operations?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If tried modifying the problem I am running, in such
>> a way
>>>>> that it
>>>>>>>>>>>>>>> should not
>>>>>>>>>>>>>>> create many (any?) communicators after
>> initialization, and
>>>>> then
>>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>>> behaves as I'd like: I can call MPI_Pcontrol(0)
>> early on, and
>>>>> later
>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>> MPI_Pcontrol(1) then MPI_Pcontrol(0), and get one
>> nice window
>>>>> into
>>>>>>>>>>> the
>>>>>>>>>>>>>>> execution,
>>>>>>>>>>>>>>> without a LOT of stuff I'm not interested in.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With my original problem, which does create
>> communicators, I
>>>>> call
>>>>>>>>>>>>>>> MPI_Pcontrol(0)
>>>>>>>>>>>>>>> right after initialization, then MPI_Pcontrol(1)
>> later, then
>>>>>>>>>>>>>>> immediately get this
>>>>>>>>>>>>>>> error:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> clog_commset.c:CLOG_CommSet_get_IDs() -
>>>>>>>>>>>>>>>         PMPI_Comm_get_attr() fails!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried putting calls to MPI_Pcontrol(1) just before
>> (and
>>>>>>>>>>>>>>> MPI_Pcontrol(0) just
>>>>>>>>>>>>>>> after) every call to
>>>>> MPI_COMM_CREATE/MPI_COMM_DUP/MPI_COMM_FREE, but
>>>>>>>>>>>>>>> that didn't
>>>>>>>>>>>>>>> work (or maybe I missed one....)  Or maybe this is a
>> red
>>>>> herring,
>>>>>>>>>>> and
>>>>>>>>>>>>>>> the smaller
>>>>>>>>>>>>>>> problem ran for some other unrelated reason.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Suggestions of anything else to try?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does anyone know exactly WHICH calls must always be
>> made?  It
>>>>> should
>>>>>>>>>>>>>>> be a simple
>>>>>>>>>>>>>>> matter to ignore the "is_mpilog_on" flag for just a
>> few calls,
>>>>> if
>>>>>>>>>>> that
>>>>>>>>>>>>>>> is all
>>>>>>>>>>>>>>> that is needed....I just need to know WHICH ones.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> mpich-discuss mailing list
>>>>>>>>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>>>>>>>>>
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>


More information about the mpich-discuss mailing list