[mpich-discuss] Assertion failure in ch3_progress

Sun Feb 1 16:16:31 CST 2009

Hi,

following your suggestions, I installed a debug mpich2 version and ran
the problem with MPICH_NOLOCAL=1 under valgrind. The Error output after
timestep 999 is:

Fatal error in MPI_Win_fence: Other MPI error, error stack:
MPI_Win_fence(123).......................: MPI_Win_fence(assert=0, 
win=0xa0000000) failed
MPIDI_Win_fence(313).....................:  Detected an error while in 
progress wait for RMA messages
MPIDI_CH3I_Progress(136).................:
MPID_nem_mpich2_blocking_recv(1101)......:
MPID_nem_newtcp_module_poll(49)..........:
MPID_nem_newtcp_module_connpoll(1619)....:
state_commrdy_handler(1492)..............:
MPID_nem_newtcp_module_recv_handler(1420): read from socket failed - 
Invalid argument
rank 1 in job 12  m02_60428   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

I attached the valgrind output. Beside for some conditional jump issues and
one invalid read (which I suspect to be a false positive after 
investigation) valgrind
found some syscall issues which I'm unable to interpret, maybe you can 
see more...

Recently I ran the program on a SMP machine with a larger input set 
(larger communication volume).
As a result I got a deadlock in the sequence

MPI_Barrier
MPI_Win_fence
MPI_Barrier

because 3/4 procs waited in the second barrier while 1 proc got stuck in 
the fence.

Additionally, it might be of interest for you that I recently posted a 
test example to the
mvapich2 mailing list showing a deadlock in MPI_Win_fence (the example 
was extracted from the
same application). Since however the error is completely different I 
suspect that it is a different
problem with the rdma channel in mvapich ...

I will now start to construct a (hopefully small) test example.

Thanks + Regards,
Dorian

Dave Goodell wrote:
> At the bottom of [1] you will find instructions for building mpich2 
> with valgrind support.  Basically do what Darius said, but you should 
> probably also add the CPPFLAGS argument as shown on the wiki.  This 
> should help catch additional errors.
>
> -Dave
>
> [1] 
> http://wiki.mcs.anl.gov/mpich2/index.php/Support_for_Debugging_Memory_Allocation 
>
>
> On Jan 30, 2009, at 9:57 AM, Darius Buntinas wrote:
>
>>
>> Thanks Dorian.  I would try what Dave suggested.  We do run tests here
>> to check for things like this, and haven't seen such errors, so it's
>> possible that we don't have a good test for this.  If you could send us
>> a specific example program (that's short) that shows this problem that
>> would help us track this down.
>>
>> Such a problem may also be caused by the application corrupting MPICH2's
>> address space.  One way to check this would be to run your app using
>> valgrind and look for illegal memory references.  To do this you should
>> first configure MPICH2 with --enable-g=dbg,meminit (and make clean;
>> make; make install), recompile your app, then run your app through
>> valgrind like this (all on one line):
>>
>>  mpiexec -n 10 valgrind --log-file="vglog-%p{PMI_RANK}" my_app 
>> my_app_args
>>
>> Look at the log files (vglog-0, vglog-1, ...) and see if valgrind found
>> any errors.
>>
>> -d
>>
>> On 01/30/2009 03:45 AM, Dorian Krause wrote:
>>> Hi Darius,
>>>
>>> thanks for your answer.
>>>
>>> I will try to break it down to a simple example. What I can
>>> say right now is that
>>>
>>> a) The problem depends on the communication value (the larger
>>> the communication volume the earlier (timesteps) the problem occurs).
>>>
>>> b) It only occurs when the procs are on different machines.
>>>
>>>
>>> It would be helpful if there is a way to make sure that MPICH2
>>> behaves in the same way on shared memory and distributed memory
>>> machines (e.g. doesn't use IPC). Is there such a way (I suspect
>>> that there is a different behaviour because of point b))?
>>>
>>> Thanks.
>>>
>>> Dorian
>>>
>>>
>>> Darius Buntinas wrote:
>>>> This means that the header of the received packet has been corrupted.
>>>> It looks like it might be an internal bug.  Can you send us a short
>>>> program that demonstrates this?
>>>>
>>>> Thanks,
>>>> -d
>>>>
>>>> On 01/27/2009 07:25 AM, Dorian Krause wrote:
>>>>
>>>>> Hi List,
>>>>>
>>>>> I'm running an application with mpich2-1.1a2 (intel compiler) 
>>>>> which uses
>>>>> onesided communication to put data from a contiguous buffer on the
>>>>> origin
>>>>> side into a strided (derived datatype) buffer on the target side. The
>>>>> program runs fine with (let's say) 4 procs on a single machine but 
>>>>> fails
>>>>> with
>>>>>
>>>>> Assertion failed in file ch3_progress.c at line 473: pkt->type >= 
>>>>> 0 &&
>>>>> pkt->type < MPIDI_NEM_PKT_END
>>>>> internal ABORT - process 3
>>>>>
>>>>>
>>>>> if submitted to the cluster (I suppose it does not use nemesis in the
>>>>> first case ?). In ch3_progress.c I can read that "invalid pkt data 
>>>>> will
>>>>> result in unpredictable behavior".
>>>>>
>>>>> Can you tell me what that means? What is "pkt data" and to which 
>>>>> input
>>>>> from the application does the pkt instance corresponds?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Dorian
>>>>>
>>>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: vglog-0.20891
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090201/08571551/attachment-0002.diff>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: vglog-1.20890
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090201/08571551/attachment-0003.diff>