[mpich-discuss] Using MPI_Put/Get correctly?

Grismer, Matthew J Civ USAF AFMC AFRL/RBAT Matthew.Grismer at wpafb.af.mil
Fri Dec 17 14:59:14 CST 2010


I rebuilt MPICH2 with debugging support, so I get some more detail on
the error point from gdb:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
0x000000010040a645 in MPID_Segment_blkidx_m2m (blocks_p=0x7fff5fbfd328,
count=1606407072, blocklen=3138, offsetarray=0x6, el_type=8847888,
rel_off=4310413312, bufp=0x6000218000000000, v_paramp=0x7fff5fbfd3a0) at
segment_packunpack.c:313
313		    MPIDI_COPY_FROM_VEC(src, dest, 0, int64_t, blocklen,
1);
(gdb) list
308	
309		/* note: macro modifies dest buffer ptr, so we must
reset */
310		if (el_size == 8
311		    MPIR_ALIGN8_TEST(src, dest))
312		{
313		    MPIDI_COPY_FROM_VEC(src, dest, 0, int64_t, blocklen,
1);
314		}
315		else if (el_size == 4
316			 MPIR_ALIGN4_TEST(src,dest))
317		{

Also, I'm trying to come up with a small sample that demonstrates the
issue.

Matt

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
Sent: Thursday, December 16, 2010 4:45 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?

That must be a bug in MPICH2.  The name of the routine is helpful, but
your MPICH2 isn't built with debug information, so it's a bit harder to
tell what part of that function is causing the trouble.  Also, a stack
trace with line numbers would be helpful.


As Rajeev mentioned before, a small test program would really help us
troubleshoot this.  It can be very difficult to find/fix this sort of
thing over email.

-Dave

On Dec 16, 2010, at 3:35 PM CST, Grismer, Matthew J Civ USAF AFMC
AFRL/RBAT wrote:

> I attached to the running processes with gdb, and get the following
> error when the code dies:
> 
> Program received signal EXC_BAD_ACCESS, Could not access memory.
> Reason:  13 at address 0x00000000000
> 0x000000010040a5e5 in MPID_Segment_blkidx_m2m ()
> 
> if that is any help at all...
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of James Dinan
> Sent: Thursday, December 16, 2010 3:28 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
> 
> Hi Matt,
> 
> If my understanding is correct, the only time you are allowed to
perform
> 
> direct load/store accesses on local data that is exposed in a window
is 
> when the window is closed under active target or when you are in an 
> exclusive access epoch under passive mode target.  So I think what you

> are doing may be invalid even though you are able to guarantee that 
> accesses do not overlap.  The source for your put will need to be a 
> private buffer, you may be able to accomplish this easily in your code

> or you might have to copy data into a private buffer (before you post 
> the window) before you can put().
> 
> Even though this is outside of the standard, some (many?) MPI 
> implementations may actually allow this on cache-coherent systems (I 
> think MPICH2 on shared memory will allow it).
> 
> I would be surprised if this error is causing your seg fault (more 
> likely it should just result in corrupted data within the bounds of
your
> 
> buffer).  I would tend to suspect that something is off in your 
> datatype, possibly the target datatype since the segfault occurs in 
> wait() which is when data might be getting unpacked at the target.
Can 
> you run your code through a debugger or valgrind to give us more 
> information on how/when the seg faul occurs?
> 
> Cheers,
>  ~Jim.
> 
> On 12/16/2010 12:33 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
> wrote:
>> I am trying to modify the communication routines in our code to use
>> MPI_Put's instead of sends and receives.  This worked fine for
several
>> variable Put's, but now I have one that is causing seg faults.
Reading
>> through the MPI documentation it is not clear to me if what I am
doing
>> is permissible or not.  Basically, the question is this - if I have
>> defined all of an array as a window on each processor, can I PUT data
>> from that array to remote processes at the same time as the remote
>> processes are PUTing into the local copy, assuming no overlaps of any
> of
>> the PUTs?
>> 
>> Here are the details if that doesn't make sense.  I have a (Fortran)
>> array QF(6,2,N) on each processor, where N could be a very large
> number
>> (100,000). I create a window QFWIN on the entire array on all the
>> processors.  I define MPI_Type_indexed "sending" datatypes (QFSND)
> with
>> block lengths of 6 that send from QF(1,1,*), and MPI_Type_indexed
>> "receiving" datatypes (QFREC) with block lengths of 6 the receive
into
>> QF(1,2,*).  Here * is non-repeating set of integers up to N.  I
create
>> groups of processors that communicate, where these groups will all
>> exchange QF data, PUTing local QF(1,1,*) to remote QF(1,2,*).  So,
>> processor 1 is PUTing QF data to processors 2,3,4 at the same time
> 2,3,4
>> are putting their QF data to 1, and so on.  Processors 2,3,4 are
> PUTing
>> into non-overlapping regions of QF(1,2,*) on 1, and 1 is PUTing from
>> QF(1,1,*) to 2,3,4, and so on.  So, my calls look like this on each
>> processor:
>> 
>> assertion = 0
>> call MPI_Win_post(group, assertion, QFWIN, ierr)
>> call MPI_Win_start(group, assertion, QFWIN, ierr)
>> 
>> do I=1,neighbors
>>   call MPI_Put(QF, 1, QFSND(I), NEIGHBOR(I), 0, 1, QFREC(I), QFWIN,
>> ierr)
>> end do
>> 
>> call MPI_Win_complete(QFWIN,ierr)
>> call MPI_Win_wait(QFWIN,ierr)
>> 
>> Note I did define QFREC locally on each processor to properly
> represent
>> where the data was going on the remote processors.  The error value
>> ierr=0 after MPI_Win_post, MPI_Win_start, MPI_Put, and
> MPI_Win_complete,
>> and the code seg faults in MPI_Win_wait.
>> 
>> I'm using MPICH2 1.3.1 on Mac OS X 10.6.5, built with Intel XE (12.0)
>> compilers, and running on just 2 (internal) processors of my Mac Pro.
>> The code ran normally with this configuration up until the point I
put
>> the above in.  Several other communications with MPI_Put similar to
> the
>> above work fine, though the windows are only on a subset of the
>> communicated array, and the origin data is being PUT from part of the
>> array that is not within the window.
>> 
>> _____________________________________________________
>> Matt
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list