[mpich-discuss] Using MPI_Put/Get correctly?
Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
Matthew.Grismer at wpafb.af.mil
Mon Jan 3 10:16:37 CST 2011
Unfortunately correcting the integer type for the displacement does not fix
the problem in my code, argh! So, thinking this might have something to do
with the large arrays and amount of data being passed in the actual code, I
modified my example (attached putbothways2.f90) so that the array sizes and
amount of data swapped are nearly identical to the code giving me the issue.
I also filled the array that is shared with random data, instead of 0's and
1's, to ensure nothing special was happening due to the simple, uniform
data. Unfortunately, the example works great, but my actual code still seg
faults at the same location. The only difference is now the reason given
for the failure at line 313 of MPID_Segment_blkidx_m2m is
KERN_INVALID_ADDRESS instead of "13".
So, the summary is the example code that uses MPI_Put calls with indexed
datatypes to swap data between 2 processors works without issue, while the
actual code that communicates in the same manner fails. The only difference
is the actual code allocates many other arrays, which are communicated in
various ways (sends, puts, broadcasts, etc). I checked and re-checked all
the argument lists associated with the indexed data, window, and puts;
everything looks correct. Any thoughts or suggestions on how to proceed?
Matt
-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Grismer,Matthew J
Civ USAF AFMC AFRL/RBAT
Sent: Wednesday, December 29, 2010 12:05 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
Ahah, yes, I did miss that in the examples, thank you! And I see I have
the same issue for that particular Put in my actual code, I will see if
that fixes things...
Matt
-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
Sent: Wednesday, December 29, 2010 1:59 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
Matt,
The target_disp parameter to MPI_Put is of type integer
(kind=MPI_ADDRESS_KIND). If I define a variable disp of that type, set
it to 0, and pass it to MPI_Put (instead of directly passing 0), both
examples work.
Rajeev
On Dec 27, 2010, at 4:29 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
wrote:
> I've created two example test programs that appear to highlight the
issue
> with MPICH2; both die when I run them on 2 processors. I am pretty
certain
> the first (putoneway.f90) should work, as I am only doing a single put
from
> one processor to a second processor; the target processor is doing
nothing
> with the window'ed array that is receiving the data. My guess is the
problem
> lies in the indexed datatypes that I am using for both the origin and
> target.
>
> The second case (putbothways.f90) closely mirrors what I am actually
trying
> to do in my code, that is have each processor put into the other
processors
> window'ed array at the same time. So, each process is sending from
and
> receiving into the same array at the same time, with no overlap in the
sent
> and received data. Once again I'm using indexed data types for both
the
> origin and target.
>
> To build: mpif90 putoneway.f90
> To run: mpiexec -np 2 a.out
>
> Matt
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
> Sent: Thursday, December 16, 2010 4:45 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
>
> That must be a bug in MPICH2. The name of the routine is helpful, but
your
> MPICH2 isn't built with debug information, so it's a bit harder to
tell what
> part of that function is causing the trouble. Also, a stack trace
with line
> numbers would be helpful.
>
>
> As Rajeev mentioned before, a small test program would really help us
> troubleshoot this. It can be very difficult to find/fix this sort of
thing
> over email.
>
> -Dave
>
> On Dec 16, 2010, at 3:35 PM CST, Grismer, Matthew J Civ USAF AFMC
AFRL/RBAT
> wrote:
>
>> I attached to the running processes with gdb, and get the following
>> error when the code dies:
>>
>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>> Reason: 13 at address 0x00000000000
>> 0x000000010040a5e5 in MPID_Segment_blkidx_m2m ()
>>
>> if that is any help at all...
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of James Dinan
>> Sent: Thursday, December 16, 2010 3:28 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
>>
>> Hi Matt,
>>
>> If my understanding is correct, the only time you are allowed to
perform
>>
>> direct load/store accesses on local data that is exposed in a window
is
>> when the window is closed under active target or when you are in an
>> exclusive access epoch under passive mode target. So I think what
you
>> are doing may be invalid even though you are able to guarantee that
>> accesses do not overlap. The source for your put will need to be a
>> private buffer, you may be able to accomplish this easily in your
code
>> or you might have to copy data into a private buffer (before you post
>> the window) before you can put().
>>
>> Even though this is outside of the standard, some (many?) MPI
>> implementations may actually allow this on cache-coherent systems (I
>> think MPICH2 on shared memory will allow it).
>>
>> I would be surprised if this error is causing your seg fault (more
>> likely it should just result in corrupted data within the bounds of
your
>>
>> buffer). I would tend to suspect that something is off in your
>> datatype, possibly the target datatype since the segfault occurs in
>> wait() which is when data might be getting unpacked at the target.
Can
>> you run your code through a debugger or valgrind to give us more
>> information on how/when the seg faul occurs?
>>
>> Cheers,
>> ~Jim.
>>
>> On 12/16/2010 12:33 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
>> wrote:
>>> I am trying to modify the communication routines in our code to use
>>> MPI_Put's instead of sends and receives. This worked fine for
several
>>> variable Put's, but now I have one that is causing seg faults.
Reading
>>> through the MPI documentation it is not clear to me if what I am
doing
>>> is permissible or not. Basically, the question is this - if I have
>>> defined all of an array as a window on each processor, can I PUT
data
>>> from that array to remote processes at the same time as the remote
>>> processes are PUTing into the local copy, assuming no overlaps of
any
>> of
>>> the PUTs?
>>>
>>> Here are the details if that doesn't make sense. I have a (Fortran)
>>> array QF(6,2,N) on each processor, where N could be a very large
>> number
>>> (100,000). I create a window QFWIN on the entire array on all the
>>> processors. I define MPI_Type_indexed "sending" datatypes (QFSND)
>> with
>>> block lengths of 6 that send from QF(1,1,*), and MPI_Type_indexed
>>> "receiving" datatypes (QFREC) with block lengths of 6 the receive
into
>>> QF(1,2,*). Here * is non-repeating set of integers up to N. I
create
>>> groups of processors that communicate, where these groups will all
>>> exchange QF data, PUTing local QF(1,1,*) to remote QF(1,2,*). So,
>>> processor 1 is PUTing QF data to processors 2,3,4 at the same time
>> 2,3,4
>>> are putting their QF data to 1, and so on. Processors 2,3,4 are
>> PUTing
>>> into non-overlapping regions of QF(1,2,*) on 1, and 1 is PUTing from
>>> QF(1,1,*) to 2,3,4, and so on. So, my calls look like this on each
>>> processor:
>>>
>>> assertion = 0
>>> call MPI_Win_post(group, assertion, QFWIN, ierr)
>>> call MPI_Win_start(group, assertion, QFWIN, ierr)
>>>
>>> do I=1,neighbors
>>> call MPI_Put(QF, 1, QFSND(I), NEIGHBOR(I), 0, 1, QFREC(I), QFWIN,
>>> ierr)
>>> end do
>>>
>>> call MPI_Win_complete(QFWIN,ierr)
>>> call MPI_Win_wait(QFWIN,ierr)
>>>
>>> Note I did define QFREC locally on each processor to properly
>> represent
>>> where the data was going on the remote processors. The error value
>>> ierr=0 after MPI_Win_post, MPI_Win_start, MPI_Put, and
>> MPI_Win_complete,
>>> and the code seg faults in MPI_Win_wait.
>>>
>>> I'm using MPICH2 1.3.1 on Mac OS X 10.6.5, built with Intel XE
(12.0)
>>> compilers, and running on just 2 (internal) processors of my Mac
Pro.
>>> The code ran normally with this configuration up until the point I
put
>>> the above in. Several other communications with MPI_Put similar to
>> the
>>> above work fine, though the windows are only on a subset of the
>>> communicated array, and the origin data is being PUT from part of
the
>>> array that is not within the window.
>>>
>>> _____________________________________________________
>>> Matt
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
<putoneway.f90><putbothways.f90>________________________________________
_______
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: putbothways2.f90
Type: application/octet-stream
Size: 2179 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110103/b4cddb06/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4880 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110103/b4cddb06/attachment.bin>
More information about the mpich-discuss
mailing list