[mpich-discuss] Using MPI_Put/Get correctly?

Grismer, Matthew J Civ USAF AFMC AFRL/RBAT Matthew.Grismer at wpafb.af.mil
Mon Dec 27 16:29:39 CST 2010


I've created two example test programs that appear to highlight the issue
with MPICH2; both die when I run them on 2 processors.  I am pretty certain
the first (putoneway.f90) should work, as I am only doing a single put from
one processor to a second processor; the target processor is doing nothing
with the window'ed array that is receiving the data. My guess is the problem
lies in the indexed datatypes that I am using for both the origin and
target.

The second case (putbothways.f90) closely mirrors what I am actually trying
to do in my code, that is have each processor put into the other processors
window'ed array at the same time.  So, each process is sending from and
receiving into the same array at the same time, with no overlap in the sent
and received data.  Once again I'm using indexed data types for both the
origin and target.

To build:  mpif90 putoneway.f90
To run:  mpiexec -np 2 a.out

Matt

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
Sent: Thursday, December 16, 2010 4:45 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?

That must be a bug in MPICH2.  The name of the routine is helpful, but your
MPICH2 isn't built with debug information, so it's a bit harder to tell what
part of that function is causing the trouble.  Also, a stack trace with line
numbers would be helpful.


As Rajeev mentioned before, a small test program would really help us
troubleshoot this.  It can be very difficult to find/fix this sort of thing
over email.

-Dave

On Dec 16, 2010, at 3:35 PM CST, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
wrote:

> I attached to the running processes with gdb, and get the following
> error when the code dies:
> 
> Program received signal EXC_BAD_ACCESS, Could not access memory.
> Reason:  13 at address 0x00000000000
> 0x000000010040a5e5 in MPID_Segment_blkidx_m2m ()
> 
> if that is any help at all...
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of James Dinan
> Sent: Thursday, December 16, 2010 3:28 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
> 
> Hi Matt,
> 
> If my understanding is correct, the only time you are allowed to perform
> 
> direct load/store accesses on local data that is exposed in a window is 
> when the window is closed under active target or when you are in an 
> exclusive access epoch under passive mode target.  So I think what you 
> are doing may be invalid even though you are able to guarantee that 
> accesses do not overlap.  The source for your put will need to be a 
> private buffer, you may be able to accomplish this easily in your code 
> or you might have to copy data into a private buffer (before you post 
> the window) before you can put().
> 
> Even though this is outside of the standard, some (many?) MPI 
> implementations may actually allow this on cache-coherent systems (I 
> think MPICH2 on shared memory will allow it).
> 
> I would be surprised if this error is causing your seg fault (more 
> likely it should just result in corrupted data within the bounds of your
> 
> buffer).  I would tend to suspect that something is off in your 
> datatype, possibly the target datatype since the segfault occurs in 
> wait() which is when data might be getting unpacked at the target.  Can 
> you run your code through a debugger or valgrind to give us more 
> information on how/when the seg faul occurs?
> 
> Cheers,
>  ~Jim.
> 
> On 12/16/2010 12:33 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
> wrote:
>> I am trying to modify the communication routines in our code to use
>> MPI_Put's instead of sends and receives.  This worked fine for several
>> variable Put's, but now I have one that is causing seg faults. Reading
>> through the MPI documentation it is not clear to me if what I am doing
>> is permissible or not.  Basically, the question is this - if I have
>> defined all of an array as a window on each processor, can I PUT data
>> from that array to remote processes at the same time as the remote
>> processes are PUTing into the local copy, assuming no overlaps of any
> of
>> the PUTs?
>> 
>> Here are the details if that doesn't make sense.  I have a (Fortran)
>> array QF(6,2,N) on each processor, where N could be a very large
> number
>> (100,000). I create a window QFWIN on the entire array on all the
>> processors.  I define MPI_Type_indexed "sending" datatypes (QFSND)
> with
>> block lengths of 6 that send from QF(1,1,*), and MPI_Type_indexed
>> "receiving" datatypes (QFREC) with block lengths of 6 the receive into
>> QF(1,2,*).  Here * is non-repeating set of integers up to N.  I create
>> groups of processors that communicate, where these groups will all
>> exchange QF data, PUTing local QF(1,1,*) to remote QF(1,2,*).  So,
>> processor 1 is PUTing QF data to processors 2,3,4 at the same time
> 2,3,4
>> are putting their QF data to 1, and so on.  Processors 2,3,4 are
> PUTing
>> into non-overlapping regions of QF(1,2,*) on 1, and 1 is PUTing from
>> QF(1,1,*) to 2,3,4, and so on.  So, my calls look like this on each
>> processor:
>> 
>> assertion = 0
>> call MPI_Win_post(group, assertion, QFWIN, ierr)
>> call MPI_Win_start(group, assertion, QFWIN, ierr)
>> 
>> do I=1,neighbors
>>   call MPI_Put(QF, 1, QFSND(I), NEIGHBOR(I), 0, 1, QFREC(I), QFWIN,
>> ierr)
>> end do
>> 
>> call MPI_Win_complete(QFWIN,ierr)
>> call MPI_Win_wait(QFWIN,ierr)
>> 
>> Note I did define QFREC locally on each processor to properly
> represent
>> where the data was going on the remote processors.  The error value
>> ierr=0 after MPI_Win_post, MPI_Win_start, MPI_Put, and
> MPI_Win_complete,
>> and the code seg faults in MPI_Win_wait.
>> 
>> I'm using MPICH2 1.3.1 on Mac OS X 10.6.5, built with Intel XE (12.0)
>> compilers, and running on just 2 (internal) processors of my Mac Pro.
>> The code ran normally with this configuration up until the point I put
>> the above in.  Several other communications with MPI_Put similar to
> the
>> above work fine, though the windows are only on a subset of the
>> communicated array, and the origin data is being PUT from part of the
>> array that is not within the window.
>> 
>> _____________________________________________________
>> Matt
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: putoneway.f90
Type: application/octet-stream
Size: 1782 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101227/4e659916/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: putbothways.f90
Type: application/octet-stream
Size: 1878 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101227/4e659916/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4880 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101227/4e659916/attachment.bin>


More information about the mpich-discuss mailing list