[mpich-discuss] Assertion failure in ch3_progress

Darius Buntinas buntinas at mcs.anl.gov
Fri Jan 30 09:57:29 CST 2009


Thanks Dorian.  I would try what Dave suggested.  We do run tests here
to check for things like this, and haven't seen such errors, so it's
possible that we don't have a good test for this.  If you could send us
a specific example program (that's short) that shows this problem that
would help us track this down.

Such a problem may also be caused by the application corrupting MPICH2's
address space.  One way to check this would be to run your app using
valgrind and look for illegal memory references.  To do this you should
first configure MPICH2 with --enable-g=dbg,meminit (and make clean;
make; make install), recompile your app, then run your app through
valgrind like this (all on one line):

  mpiexec -n 10 valgrind --log-file="vglog-%p{PMI_RANK}" my_app my_app_args

Look at the log files (vglog-0, vglog-1, ...) and see if valgrind found
any errors.

-d

On 01/30/2009 03:45 AM, Dorian Krause wrote:
> Hi Darius,
> 
> thanks for your answer.
> 
> I will try to break it down to a simple example. What I can
> say right now is that
> 
> a) The problem depends on the communication value (the larger
> the communication volume the earlier (timesteps) the problem occurs).
> 
> b) It only occurs when the procs are on different machines.
> 
> 
> It would be helpful if there is a way to make sure that MPICH2
> behaves in the same way on shared memory and distributed memory
> machines (e.g. doesn't use IPC). Is there such a way (I suspect
> that there is a different behaviour because of point b))?
> 
> Thanks.
> 
> Dorian
> 
> 
> Darius Buntinas wrote:
>> This means that the header of the received packet has been corrupted.
>> It looks like it might be an internal bug.  Can you send us a short
>> program that demonstrates this?
>>
>> Thanks,
>> -d
>>
>> On 01/27/2009 07:25 AM, Dorian Krause wrote:
>>  
>>> Hi List,
>>>
>>> I'm running an application with mpich2-1.1a2 (intel compiler) which uses
>>> onesided communication to put data from a contiguous buffer on the
>>> origin
>>> side into a strided (derived datatype) buffer on the target side. The
>>> program runs fine with (let's say) 4 procs on a single machine but fails
>>> with
>>>
>>> Assertion failed in file ch3_progress.c at line 473: pkt->type >= 0 &&
>>> pkt->type < MPIDI_NEM_PKT_END
>>> internal ABORT - process 3
>>>
>>>
>>> if submitted to the cluster (I suppose it does not use nemesis in the
>>> first case ?). In ch3_progress.c I can read that "invalid pkt data will
>>> result in unpredictable behavior".
>>>
>>> Can you tell me what that means? What is "pkt data" and to which input
>>> from the application does the pkt instance corresponds?
>>>
>>>
>>> Thanks,
>>> Dorian
>>>     
> 


More information about the mpich-discuss mailing list