[mpich-discuss] Using MPI_Put/Get correctly?

Pavan Balaji balaji at mcs.anl.gov
Mon Jan 3 20:51:24 CST 2011


Can you try to run your code through valgrind?

  -- Pavan

On 01/03/2011 10:16 AM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT wrote:
> Unfortunately correcting the integer type for the displacement does not fix
> the problem in my code, argh! So, thinking this might have something to do
> with the large arrays and amount of data being passed in the actual code, I
> modified my example (attached putbothways2.f90) so that the array sizes and
> amount of data swapped are nearly identical to the code giving me the issue.
> I also filled the array that is shared with random data, instead of 0's and
> 1's, to ensure nothing special was happening due to the simple, uniform
> data. Unfortunately, the example works great, but my actual code still seg
> faults at the same location.  The only difference is now the reason given
> for the failure at line 313 of MPID_Segment_blkidx_m2m is
> KERN_INVALID_ADDRESS instead of "13".
>
> So, the summary is the example code that uses MPI_Put calls with indexed
> datatypes to swap data between 2 processors works without issue, while the
> actual code that communicates in the same manner fails.  The only difference
> is the actual code allocates many other arrays, which are communicated in
> various ways (sends, puts, broadcasts, etc).  I checked and re-checked all
> the argument lists associated with the indexed data, window, and puts;
> everything looks correct.  Any thoughts or suggestions on how to proceed?
>
> Matt
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Grismer,Matthew J
> Civ USAF AFMC AFRL/RBAT
> Sent: Wednesday, December 29, 2010 12:05 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
>
> Ahah, yes, I did miss that in the examples, thank you!  And I see I have
> the same issue for that particular Put in my actual code, I will see if
> that fixes things...
>
> Matt
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
> Sent: Wednesday, December 29, 2010 1:59 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
>
> Matt,
>           The target_disp parameter to MPI_Put is of type integer
> (kind=MPI_ADDRESS_KIND). If I define a variable disp of that type, set
> it to 0, and pass it to MPI_Put (instead of directly passing 0), both
> examples work.
>
> Rajeev
>
>
> On Dec 27, 2010, at 4:29 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
> wrote:
>
>> I've created two example test programs that appear to highlight the
> issue
>> with MPICH2; both die when I run them on 2 processors.  I am pretty
> certain
>> the first (putoneway.f90) should work, as I am only doing a single put
> from
>> one processor to a second processor; the target processor is doing
> nothing
>> with the window'ed array that is receiving the data. My guess is the
> problem
>> lies in the indexed datatypes that I am using for both the origin and
>> target.
>>
>> The second case (putbothways.f90) closely mirrors what I am actually
> trying
>> to do in my code, that is have each processor put into the other
> processors
>> window'ed array at the same time.  So, each process is sending from
> and
>> receiving into the same array at the same time, with no overlap in the
> sent
>> and received data.  Once again I'm using indexed data types for both
> the
>> origin and target.
>>
>> To build:  mpif90 putoneway.f90
>> To run:  mpiexec -np 2 a.out
>>
>> Matt
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
>> Sent: Thursday, December 16, 2010 4:45 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
>>
>> That must be a bug in MPICH2.  The name of the routine is helpful, but
> your
>> MPICH2 isn't built with debug information, so it's a bit harder to
> tell what
>> part of that function is causing the trouble.  Also, a stack trace
> with line
>> numbers would be helpful.
>>
>>
>> As Rajeev mentioned before, a small test program would really help us
>> troubleshoot this.  It can be very difficult to find/fix this sort of
> thing
>> over email.
>>
>> -Dave
>>
>> On Dec 16, 2010, at 3:35 PM CST, Grismer, Matthew J Civ USAF AFMC
> AFRL/RBAT
>> wrote:
>>
>>> I attached to the running processes with gdb, and get the following
>>> error when the code dies:
>>>
>>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>>> Reason:  13 at address 0x00000000000
>>> 0x000000010040a5e5 in MPID_Segment_blkidx_m2m ()
>>>
>>> if that is any help at all...
>>>
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of James Dinan
>>> Sent: Thursday, December 16, 2010 3:28 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] Using MPI_Put/Get correctly?
>>>
>>> Hi Matt,
>>>
>>> If my understanding is correct, the only time you are allowed to
> perform
>>>
>>> direct load/store accesses on local data that is exposed in a window
> is
>>> when the window is closed under active target or when you are in an
>>> exclusive access epoch under passive mode target.  So I think what
> you
>>> are doing may be invalid even though you are able to guarantee that
>>> accesses do not overlap.  The source for your put will need to be a
>>> private buffer, you may be able to accomplish this easily in your
> code
>>> or you might have to copy data into a private buffer (before you post
>
>>> the window) before you can put().
>>>
>>> Even though this is outside of the standard, some (many?) MPI
>>> implementations may actually allow this on cache-coherent systems (I
>>> think MPICH2 on shared memory will allow it).
>>>
>>> I would be surprised if this error is causing your seg fault (more
>>> likely it should just result in corrupted data within the bounds of
> your
>>>
>>> buffer).  I would tend to suspect that something is off in your
>>> datatype, possibly the target datatype since the segfault occurs in
>>> wait() which is when data might be getting unpacked at the target.
> Can
>>> you run your code through a debugger or valgrind to give us more
>>> information on how/when the seg faul occurs?
>>>
>>> Cheers,
>>> ~Jim.
>>>
>>> On 12/16/2010 12:33 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAT
>>> wrote:
>>>> I am trying to modify the communication routines in our code to use
>>>> MPI_Put's instead of sends and receives.  This worked fine for
> several
>>>> variable Put's, but now I have one that is causing seg faults.
> Reading
>>>> through the MPI documentation it is not clear to me if what I am
> doing
>>>> is permissible or not.  Basically, the question is this - if I have
>>>> defined all of an array as a window on each processor, can I PUT
> data
>>>> from that array to remote processes at the same time as the remote
>>>> processes are PUTing into the local copy, assuming no overlaps of
> any
>>> of
>>>> the PUTs?
>>>>
>>>> Here are the details if that doesn't make sense.  I have a (Fortran)
>>>> array QF(6,2,N) on each processor, where N could be a very large
>>> number
>>>> (100,000). I create a window QFWIN on the entire array on all the
>>>> processors.  I define MPI_Type_indexed "sending" datatypes (QFSND)
>>> with
>>>> block lengths of 6 that send from QF(1,1,*), and MPI_Type_indexed
>>>> "receiving" datatypes (QFREC) with block lengths of 6 the receive
> into
>>>> QF(1,2,*).  Here * is non-repeating set of integers up to N.  I
> create
>>>> groups of processors that communicate, where these groups will all
>>>> exchange QF data, PUTing local QF(1,1,*) to remote QF(1,2,*).  So,
>>>> processor 1 is PUTing QF data to processors 2,3,4 at the same time
>>> 2,3,4
>>>> are putting their QF data to 1, and so on.  Processors 2,3,4 are
>>> PUTing
>>>> into non-overlapping regions of QF(1,2,*) on 1, and 1 is PUTing from
>>>> QF(1,1,*) to 2,3,4, and so on.  So, my calls look like this on each
>>>> processor:
>>>>
>>>> assertion = 0
>>>> call MPI_Win_post(group, assertion, QFWIN, ierr)
>>>> call MPI_Win_start(group, assertion, QFWIN, ierr)
>>>>
>>>> do I=1,neighbors
>>>>   call MPI_Put(QF, 1, QFSND(I), NEIGHBOR(I), 0, 1, QFREC(I), QFWIN,
>>>> ierr)
>>>> end do
>>>>
>>>> call MPI_Win_complete(QFWIN,ierr)
>>>> call MPI_Win_wait(QFWIN,ierr)
>>>>
>>>> Note I did define QFREC locally on each processor to properly
>>> represent
>>>> where the data was going on the remote processors.  The error value
>>>> ierr=0 after MPI_Win_post, MPI_Win_start, MPI_Put, and
>>> MPI_Win_complete,
>>>> and the code seg faults in MPI_Win_wait.
>>>>
>>>> I'm using MPICH2 1.3.1 on Mac OS X 10.6.5, built with Intel XE
> (12.0)
>>>> compilers, and running on just 2 (internal) processors of my Mac
> Pro.
>>>> The code ran normally with this configuration up until the point I
> put
>>>> the above in.  Several other communications with MPI_Put similar to
>>> the
>>>> above work fine, though the windows are only on a subset of the
>>>> communicated array, and the origin data is being PUT from part of
> the
>>>> array that is not within the window.
>>>>
>>>> _____________________________________________________
>>>> Matt
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
> <putoneway.f90><putbothways.f90>________________________________________
> _______
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list