[mpich-discuss] Assertion failure in ch3_progress

Mon Feb 2 10:20:26 CST 2009

Looking at the valgrind output and your stdout there does seem to be  
some sort of problem with MPI_Win_fence call, although it's tough to  
say exactly what is causing it without a test application that we can  
really dig into.  Hopefully you will be able to isolate the problem  
into an example.

-Dave

On Feb 1, 2009, at 4:16 PM, Dorian Krause wrote:

> following your suggestions, I installed a debug mpich2 version and ran
> the problem with MPICH_NOLOCAL=1 under valgrind. The Error output  
> after
> timestep 999 is:
>
>
> Fatal error in MPI_Win_fence: Other MPI error, error stack:
> MPI_Win_fence(123).......................: MPI_Win_fence(assert=0,  
> win=0xa0000000) failed
> MPIDI_Win_fence(313).....................:  Detected an error while  
> in progress wait for RMA messages
> MPIDI_CH3I_Progress(136).................:
> MPID_nem_mpich2_blocking_recv(1101)......:
> MPID_nem_newtcp_module_poll(49)..........:
> MPID_nem_newtcp_module_connpoll(1619)....:
> state_commrdy_handler(1492)..............:
> MPID_nem_newtcp_module_recv_handler(1420): read from socket failed -  
> Invalid argument
> rank 1 in job 12  m02_60428   caused collective abort of all ranks
> exit status of rank 1: killed by signal 9
>
> I attached the valgrind output. Beside for some conditional jump  
> issues and
> one invalid read (which I suspect to be a false positive after  
> investigation) valgrind
> found some syscall issues which I'm unable to interpret, maybe you  
> can see more...
>
> Recently I ran the program on a SMP machine with a larger input set  
> (larger communication volume).
> As a result I got a deadlock in the sequence
>
> MPI_Barrier
> MPI_Win_fence
> MPI_Barrier
>
> because 3/4 procs waited in the second barrier while 1 proc got  
> stuck in the fence.
>
>
> Additionally, it might be of interest for you that I recently posted  
> a test example to the
> mvapich2 mailing list showing a deadlock in MPI_Win_fence (the  
> example was extracted from the
> same application). Since however the error is completely different I  
> suspect that it is a different
> problem with the rdma channel in mvapich ...
>
>
> I will now start to construct a (hopefully small) test example.
>
> Thanks + Regards,
> Dorian
>
>
>
>
>
> Dave Goodell wrote:
>> At the bottom of [1] you will find instructions for building mpich2  
>> with valgrind support.  Basically do what Darius said, but you  
>> should probably also add the CPPFLAGS argument as shown on the  
>> wiki.  This should help catch additional errors.
>>
>> -Dave
>>
>> [1] http://wiki.mcs.anl.gov/mpich2/index.php/Support_for_Debugging_Memory_Allocation
>>
>> On Jan 30, 2009, at 9:57 AM, Darius Buntinas wrote:
>>
>>>
>>> Thanks Dorian.  I would try what Dave suggested.  We do run tests  
>>> here
>>> to check for things like this, and haven't seen such errors, so it's
>>> possible that we don't have a good test for this.  If you could  
>>> send us
>>> a specific example program (that's short) that shows this problem  
>>> that
>>> would help us track this down.
>>>
>>> Such a problem may also be caused by the application corrupting  
>>> MPICH2's
>>> address space.  One way to check this would be to run your app using
>>> valgrind and look for illegal memory references.  To do this you  
>>> should
>>> first configure MPICH2 with --enable-g=dbg,meminit (and make clean;
>>> make; make install), recompile your app, then run your app through
>>> valgrind like this (all on one line):
>>>
>>> mpiexec -n 10 valgrind --log-file="vglog-%p{PMI_RANK}" my_app  
>>> my_app_args
>>>
>>> Look at the log files (vglog-0, vglog-1, ...) and see if valgrind  
>>> found
>>> any errors.
>>>
>>> -d
>>>
>>> On 01/30/2009 03:45 AM, Dorian Krause wrote:
>>>> Hi Darius,
>>>>
>>>> thanks for your answer.
>>>>
>>>> I will try to break it down to a simple example. What I can
>>>> say right now is that
>>>>
>>>> a) The problem depends on the communication value (the larger
>>>> the communication volume the earlier (timesteps) the problem  
>>>> occurs).
>>>>
>>>> b) It only occurs when the procs are on different machines.
>>>>
>>>>
>>>> It would be helpful if there is a way to make sure that MPICH2
>>>> behaves in the same way on shared memory and distributed memory
>>>> machines (e.g. doesn't use IPC). Is there such a way (I suspect
>>>> that there is a different behaviour because of point b))?
>>>>
>>>> Thanks.
>>>>
>>>> Dorian
>>>>
>>>>
>>>> Darius Buntinas wrote:
>>>>> This means that the header of the received packet has been  
>>>>> corrupted.
>>>>> It looks like it might be an internal bug.  Can you send us a  
>>>>> short
>>>>> program that demonstrates this?
>>>>>
>>>>> Thanks,
>>>>> -d
>>>>>
>>>>> On 01/27/2009 07:25 AM, Dorian Krause wrote:
>>>>>
>>>>>> Hi List,
>>>>>>
>>>>>> I'm running an application with mpich2-1.1a2 (intel compiler)  
>>>>>> which uses
>>>>>> onesided communication to put data from a contiguous buffer on  
>>>>>> the
>>>>>> origin
>>>>>> side into a strided (derived datatype) buffer on the target  
>>>>>> side. The
>>>>>> program runs fine with (let's say) 4 procs on a single machine  
>>>>>> but fails
>>>>>> with
>>>>>>
>>>>>> Assertion failed in file ch3_progress.c at line 473: pkt->type  
>>>>>> >= 0 &&
>>>>>> pkt->type < MPIDI_NEM_PKT_END
>>>>>> internal ABORT - process 3
>>>>>>
>>>>>>
>>>>>> if submitted to the cluster (I suppose it does not use nemesis  
>>>>>> in the
>>>>>> first case ?). In ch3_progress.c I can read that "invalid pkt  
>>>>>> data will
>>>>>> result in unpredictable behavior".
>>>>>>
>>>>>> Can you tell me what that means? What is "pkt data" and to  
>>>>>> which input
>>>>>> from the application does the pkt instance corresponds?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Dorian
>>>>>>
>>>>
>
> <vglog-0.20891><vglog-1.20890>