[mpich-discuss] fence vs. lock-unlock

Mon Jul 9 08:17:12 CDT 2012

If your communication pattern is known in advance and this regular,
why are you using one-sided?  It seems you could use send-recv here
quite easily.

Background progress is usually available on supercomputers so if
that's your long-term target, then MPI_Get is a reasonable design
choice.

Jeff

On Mon, Jul 9, 2012 at 1:20 AM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> That's because by default progress on passive target RMA may need an MPI function to be called on the target. You can set the environment variable MPICH_ASYNC_PROGRESS to enable asynchronous progress.
>
> Rajeev
>
> On Jul 9, 2012, at 1:10 AM, Jie Chen wrote:
>
>> I am using one sided communications lock-unlock and see something that I do not understand.
>>
>> In the normal case (using fence), my code block looks something like this (rank is my process id, nproc is the total number of processes):
>>
>> for i = 0 to nproc-1
>> -- fence
>> -- MPI_Get something from process (rank+i)%nproc
>> -- fence
>> -- computation
>> end
>>
>> The time line for the above operations looks like the following, which is completely normal (- means computation, * means communication including fence and MPI_Get):
>>
>> proc 0: *--------*--------*--------*--------
>> proc 1: *--------*--------*--------*--------
>> proc 2: *--------*--------*--------*--------
>> proc 3: *--------*--------*--------*--------
>>
>> The above illustration shows perfect computation load balance, for simplicity.
>>
>> However, when I change the two fences by lock and unlock, the time line looks like the following
>>
>> proc 0: *--------**********--------*--------**********--------
>> proc 1: *--------*--------**********--------*--------
>> proc 2: *--------*******************--------*--------*--------
>> proc 3: *--------*--------**********--------**********--------
>>
>> The problem here is that sometimes the communication takes a very long time to finish. In particular, this is attributed to the MPI_Win_unlock call that will not return until the target process has finished one round of computation.
>>
>> I do not understand why the unlock (or perhaps the actual data transfer) is so time consuming. The figures here show balanced computational work load. When the work load is not balanced, I thought the lock/unlock mechanism was better than using fence because it avoids barriers. But according to experiments, it appears that having barriers is better than none. Is this caused by the implementation of MPI_Win_unlock or the hardware?
>>
>>
>>
>> --
>> Jie Chen
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>> Address: 9700 S Cass Ave, Bldg 240, Lemont, IL 60439
>> Phone: (630) 252-3313
>> Email: jiechen at mcs.anl.gov
>> Homepage: http://www.mcs.anl.gov/~jiechen
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond