[mpich-discuss] Assertion failure in ch3_progress
Dave Goodell
goodell at mcs.anl.gov
Mon Feb 2 10:20:26 CST 2009
Looking at the valgrind output and your stdout there does seem to be
some sort of problem with MPI_Win_fence call, although it's tough to
say exactly what is causing it without a test application that we can
really dig into. Hopefully you will be able to isolate the problem
into an example.
-Dave
On Feb 1, 2009, at 4:16 PM, Dorian Krause wrote:
> following your suggestions, I installed a debug mpich2 version and ran
> the problem with MPICH_NOLOCAL=1 under valgrind. The Error output
> after
> timestep 999 is:
>
>
> Fatal error in MPI_Win_fence: Other MPI error, error stack:
> MPI_Win_fence(123).......................: MPI_Win_fence(assert=0,
> win=0xa0000000) failed
> MPIDI_Win_fence(313).....................: Detected an error while
> in progress wait for RMA messages
> MPIDI_CH3I_Progress(136).................:
> MPID_nem_mpich2_blocking_recv(1101)......:
> MPID_nem_newtcp_module_poll(49)..........:
> MPID_nem_newtcp_module_connpoll(1619)....:
> state_commrdy_handler(1492)..............:
> MPID_nem_newtcp_module_recv_handler(1420): read from socket failed -
> Invalid argument
> rank 1 in job 12 m02_60428 caused collective abort of all ranks
> exit status of rank 1: killed by signal 9
>
> I attached the valgrind output. Beside for some conditional jump
> issues and
> one invalid read (which I suspect to be a false positive after
> investigation) valgrind
> found some syscall issues which I'm unable to interpret, maybe you
> can see more...
>
> Recently I ran the program on a SMP machine with a larger input set
> (larger communication volume).
> As a result I got a deadlock in the sequence
>
> MPI_Barrier
> MPI_Win_fence
> MPI_Barrier
>
> because 3/4 procs waited in the second barrier while 1 proc got
> stuck in the fence.
>
>
> Additionally, it might be of interest for you that I recently posted
> a test example to the
> mvapich2 mailing list showing a deadlock in MPI_Win_fence (the
> example was extracted from the
> same application). Since however the error is completely different I
> suspect that it is a different
> problem with the rdma channel in mvapich ...
>
>
> I will now start to construct a (hopefully small) test example.
>
> Thanks + Regards,
> Dorian
>
>
>
>
>
> Dave Goodell wrote:
>> At the bottom of [1] you will find instructions for building mpich2
>> with valgrind support. Basically do what Darius said, but you
>> should probably also add the CPPFLAGS argument as shown on the
>> wiki. This should help catch additional errors.
>>
>> -Dave
>>
>> [1] http://wiki.mcs.anl.gov/mpich2/index.php/Support_for_Debugging_Memory_Allocation
>>
>> On Jan 30, 2009, at 9:57 AM, Darius Buntinas wrote:
>>
>>>
>>> Thanks Dorian. I would try what Dave suggested. We do run tests
>>> here
>>> to check for things like this, and haven't seen such errors, so it's
>>> possible that we don't have a good test for this. If you could
>>> send us
>>> a specific example program (that's short) that shows this problem
>>> that
>>> would help us track this down.
>>>
>>> Such a problem may also be caused by the application corrupting
>>> MPICH2's
>>> address space. One way to check this would be to run your app using
>>> valgrind and look for illegal memory references. To do this you
>>> should
>>> first configure MPICH2 with --enable-g=dbg,meminit (and make clean;
>>> make; make install), recompile your app, then run your app through
>>> valgrind like this (all on one line):
>>>
>>> mpiexec -n 10 valgrind --log-file="vglog-%p{PMI_RANK}" my_app
>>> my_app_args
>>>
>>> Look at the log files (vglog-0, vglog-1, ...) and see if valgrind
>>> found
>>> any errors.
>>>
>>> -d
>>>
>>> On 01/30/2009 03:45 AM, Dorian Krause wrote:
>>>> Hi Darius,
>>>>
>>>> thanks for your answer.
>>>>
>>>> I will try to break it down to a simple example. What I can
>>>> say right now is that
>>>>
>>>> a) The problem depends on the communication value (the larger
>>>> the communication volume the earlier (timesteps) the problem
>>>> occurs).
>>>>
>>>> b) It only occurs when the procs are on different machines.
>>>>
>>>>
>>>> It would be helpful if there is a way to make sure that MPICH2
>>>> behaves in the same way on shared memory and distributed memory
>>>> machines (e.g. doesn't use IPC). Is there such a way (I suspect
>>>> that there is a different behaviour because of point b))?
>>>>
>>>> Thanks.
>>>>
>>>> Dorian
>>>>
>>>>
>>>> Darius Buntinas wrote:
>>>>> This means that the header of the received packet has been
>>>>> corrupted.
>>>>> It looks like it might be an internal bug. Can you send us a
>>>>> short
>>>>> program that demonstrates this?
>>>>>
>>>>> Thanks,
>>>>> -d
>>>>>
>>>>> On 01/27/2009 07:25 AM, Dorian Krause wrote:
>>>>>
>>>>>> Hi List,
>>>>>>
>>>>>> I'm running an application with mpich2-1.1a2 (intel compiler)
>>>>>> which uses
>>>>>> onesided communication to put data from a contiguous buffer on
>>>>>> the
>>>>>> origin
>>>>>> side into a strided (derived datatype) buffer on the target
>>>>>> side. The
>>>>>> program runs fine with (let's say) 4 procs on a single machine
>>>>>> but fails
>>>>>> with
>>>>>>
>>>>>> Assertion failed in file ch3_progress.c at line 473: pkt->type
>>>>>> >= 0 &&
>>>>>> pkt->type < MPIDI_NEM_PKT_END
>>>>>> internal ABORT - process 3
>>>>>>
>>>>>>
>>>>>> if submitted to the cluster (I suppose it does not use nemesis
>>>>>> in the
>>>>>> first case ?). In ch3_progress.c I can read that "invalid pkt
>>>>>> data will
>>>>>> result in unpredictable behavior".
>>>>>>
>>>>>> Can you tell me what that means? What is "pkt data" and to
>>>>>> which input
>>>>>> from the application does the pkt instance corresponds?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Dorian
>>>>>>
>>>>
>
> <vglog-0.20891><vglog-1.20890>
More information about the mpich-discuss
mailing list