[mpich-discuss] Using MPI_Put/Get correctly?
Rajeev Thakur
thakur at mcs.anl.gov
Tue Dec 21 08:50:56 CST 2010
You could also try configuring MPICH2 with --with-device=ch3:sock. That will use a different communication method and will help narrow down the problem.
Rajeev
On Dec 17, 2010, at 2:59 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT wrote:
> I rebuilt MPICH2 with debugging support, so I get some more detail on
> the error point from gdb:
>
> Program received signal EXC_BAD_ACCESS, Could not access memory.
> Reason: 13 at address: 0x0000000000000000
> 0x000000010040a645 in MPID_Segment_blkidx_m2m (blocks_p=0x7fff5fbfd328,
> count=1606407072, blocklen=3138, offsetarray=0x6, el_type=8847888,
> rel_off=4310413312, bufp=0x6000218000000000, v_paramp=0x7fff5fbfd3a0) at
> segment_packunpack.c:313
> 313 MPIDI_COPY_FROM_VEC(src, dest, 0, int64_t, blocklen,
> 1);
> (gdb) list
> 308
> 309 /* note: macro modifies dest buffer ptr, so we must
> reset */
> 310 if (el_size == 8
> 311 MPIR_ALIGN8_TEST(src, dest))
> 312 {
> 313 MPIDI_COPY_FROM_VEC(src, dest, 0, int64_t, blocklen,
> 1);
> 314 }
> 315 else if (el_size == 4
> 316 MPIR_ALIGN4_TEST(src,dest))
> 317 {
>
> Also, I'm trying to come up with a small sample that demonstrates the
> issue.
>
> Matt
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
> Sent: Thursday, December 16, 2010 4:45 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
>
> That must be a bug in MPICH2. The name of the routine is helpful, but
> your MPICH2 isn't built with debug information, so it's a bit harder to
> tell what part of that function is causing the trouble. Also, a stack
> trace with line numbers would be helpful.
>
>
> As Rajeev mentioned before, a small test program would really help us
> troubleshoot this. It can be very difficult to find/fix this sort of
> thing over email.
>
> -Dave
>
> On Dec 16, 2010, at 3:35 PM CST, Grismer, Matthew J Civ USAF AFMC
> AFRL/RBAT wrote:
>
>> I attached to the running processes with gdb, and get the following
>> error when the code dies:
>>
>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>> Reason: 13 at address 0x00000000000
>> 0x000000010040a5e5 in MPID_Segment_blkidx_m2m ()
>>
>> if that is any help at all...
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of James Dinan
>> Sent: Thursday, December 16, 2010 3:28 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
>>
>> Hi Matt,
>>
>> If my understanding is correct, the only time you are allowed to
> perform
>>
>> direct load/store accesses on local data that is exposed in a window
> is
>> when the window is closed under active target or when you are in an
>> exclusive access epoch under passive mode target. So I think what you
>
>> are doing may be invalid even though you are able to guarantee that
>> accesses do not overlap. The source for your put will need to be a
>> private buffer, you may be able to accomplish this easily in your code
>
>> or you might have to copy data into a private buffer (before you post
>> the window) before you can put().
>>
>> Even though this is outside of the standard, some (many?) MPI
>> implementations may actually allow this on cache-coherent systems (I
>> think MPICH2 on shared memory will allow it).
>>
>> I would be surprised if this error is causing your seg fault (more
>> likely it should just result in corrupted data within the bounds of
> your
>>
>> buffer). I would tend to suspect that something is off in your
>> datatype, possibly the target datatype since the segfault occurs in
>> wait() which is when data might be getting unpacked at the target.
> Can
>> you run your code through a debugger or valgrind to give us more
>> information on how/when the seg faul occurs?
>>
>> Cheers,
>> ~Jim.
>>
>> On 12/16/2010 12:33 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
>> wrote:
>>> I am trying to modify the communication routines in our code to use
>>> MPI_Put's instead of sends and receives. This worked fine for
> several
>>> variable Put's, but now I have one that is causing seg faults.
> Reading
>>> through the MPI documentation it is not clear to me if what I am
> doing
>>> is permissible or not. Basically, the question is this - if I have
>>> defined all of an array as a window on each processor, can I PUT data
>>> from that array to remote processes at the same time as the remote
>>> processes are PUTing into the local copy, assuming no overlaps of any
>> of
>>> the PUTs?
>>>
>>> Here are the details if that doesn't make sense. I have a (Fortran)
>>> array QF(6,2,N) on each processor, where N could be a very large
>> number
>>> (100,000). I create a window QFWIN on the entire array on all the
>>> processors. I define MPI_Type_indexed "sending" datatypes (QFSND)
>> with
>>> block lengths of 6 that send from QF(1,1,*), and MPI_Type_indexed
>>> "receiving" datatypes (QFREC) with block lengths of 6 the receive
> into
>>> QF(1,2,*). Here * is non-repeating set of integers up to N. I
> create
>>> groups of processors that communicate, where these groups will all
>>> exchange QF data, PUTing local QF(1,1,*) to remote QF(1,2,*). So,
>>> processor 1 is PUTing QF data to processors 2,3,4 at the same time
>> 2,3,4
>>> are putting their QF data to 1, and so on. Processors 2,3,4 are
>> PUTing
>>> into non-overlapping regions of QF(1,2,*) on 1, and 1 is PUTing from
>>> QF(1,1,*) to 2,3,4, and so on. So, my calls look like this on each
>>> processor:
>>>
>>> assertion = 0
>>> call MPI_Win_post(group, assertion, QFWIN, ierr)
>>> call MPI_Win_start(group, assertion, QFWIN, ierr)
>>>
>>> do I=1,neighbors
>>> call MPI_Put(QF, 1, QFSND(I), NEIGHBOR(I), 0, 1, QFREC(I), QFWIN,
>>> ierr)
>>> end do
>>>
>>> call MPI_Win_complete(QFWIN,ierr)
>>> call MPI_Win_wait(QFWIN,ierr)
>>>
>>> Note I did define QFREC locally on each processor to properly
>> represent
>>> where the data was going on the remote processors. The error value
>>> ierr=0 after MPI_Win_post, MPI_Win_start, MPI_Put, and
>> MPI_Win_complete,
>>> and the code seg faults in MPI_Win_wait.
>>>
>>> I'm using MPICH2 1.3.1 on Mac OS X 10.6.5, built with Intel XE (12.0)
>>> compilers, and running on just 2 (internal) processors of my Mac Pro.
>>> The code ran normally with this configuration up until the point I
> put
>>> the above in. Several other communications with MPI_Put similar to
>> the
>>> above work fine, though the windows are only on a subset of the
>>> communicated array, and the origin data is being PUT from part of the
>>> array that is not within the window.
>>>
>>> _____________________________________________________
>>> Matt
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list