[mpich-discuss] Assertion failure in ch3_progress

Dave Goodell goodell at mcs.anl.gov
Fri Jan 30 10:57:17 CST 2009


At the bottom of [1] you will find instructions for building mpich2  
with valgrind support.  Basically do what Darius said, but you should  
probably also add the CPPFLAGS argument as shown on the wiki.  This  
should help catch additional errors.

-Dave

[1] http://wiki.mcs.anl.gov/mpich2/index.php/Support_for_Debugging_Memory_Allocation

On Jan 30, 2009, at 9:57 AM, Darius Buntinas wrote:

>
> Thanks Dorian.  I would try what Dave suggested.  We do run tests here
> to check for things like this, and haven't seen such errors, so it's
> possible that we don't have a good test for this.  If you could send  
> us
> a specific example program (that's short) that shows this problem that
> would help us track this down.
>
> Such a problem may also be caused by the application corrupting  
> MPICH2's
> address space.  One way to check this would be to run your app using
> valgrind and look for illegal memory references.  To do this you  
> should
> first configure MPICH2 with --enable-g=dbg,meminit (and make clean;
> make; make install), recompile your app, then run your app through
> valgrind like this (all on one line):
>
>  mpiexec -n 10 valgrind --log-file="vglog-%p{PMI_RANK}" my_app  
> my_app_args
>
> Look at the log files (vglog-0, vglog-1, ...) and see if valgrind  
> found
> any errors.
>
> -d
>
> On 01/30/2009 03:45 AM, Dorian Krause wrote:
>> Hi Darius,
>>
>> thanks for your answer.
>>
>> I will try to break it down to a simple example. What I can
>> say right now is that
>>
>> a) The problem depends on the communication value (the larger
>> the communication volume the earlier (timesteps) the problem occurs).
>>
>> b) It only occurs when the procs are on different machines.
>>
>>
>> It would be helpful if there is a way to make sure that MPICH2
>> behaves in the same way on shared memory and distributed memory
>> machines (e.g. doesn't use IPC). Is there such a way (I suspect
>> that there is a different behaviour because of point b))?
>>
>> Thanks.
>>
>> Dorian
>>
>>
>> Darius Buntinas wrote:
>>> This means that the header of the received packet has been  
>>> corrupted.
>>> It looks like it might be an internal bug.  Can you send us a short
>>> program that demonstrates this?
>>>
>>> Thanks,
>>> -d
>>>
>>> On 01/27/2009 07:25 AM, Dorian Krause wrote:
>>>
>>>> Hi List,
>>>>
>>>> I'm running an application with mpich2-1.1a2 (intel compiler)  
>>>> which uses
>>>> onesided communication to put data from a contiguous buffer on the
>>>> origin
>>>> side into a strided (derived datatype) buffer on the target side.  
>>>> The
>>>> program runs fine with (let's say) 4 procs on a single machine  
>>>> but fails
>>>> with
>>>>
>>>> Assertion failed in file ch3_progress.c at line 473: pkt->type >=  
>>>> 0 &&
>>>> pkt->type < MPIDI_NEM_PKT_END
>>>> internal ABORT - process 3
>>>>
>>>>
>>>> if submitted to the cluster (I suppose it does not use nemesis in  
>>>> the
>>>> first case ?). In ch3_progress.c I can read that "invalid pkt  
>>>> data will
>>>> result in unpredictable behavior".
>>>>
>>>> Can you tell me what that means? What is "pkt data" and to which  
>>>> input
>>>> from the application does the pkt instance corresponds?
>>>>
>>>>
>>>> Thanks,
>>>> Dorian
>>>>
>>



More information about the mpich-discuss mailing list