[mpich-discuss] fence vs. lock-unlock

Tue Jul 10 08:06:35 CDT 2012

What I did was to add "-env MPICH_ASYNC_PROGRESS 1" when calling mpiexec. But I don't see any improvements. Did I miss anything?

Jie

------------------------------

Message: 3
Date: Mon, 9 Jul 2012 01:20:39 -0500
From: Rajeev Thakur <thakur at mcs.anl.gov>
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] fence vs. lock-unlock
Message-ID: <68DD6AB6-ECD4-4FEE-ADE8-68F08FDEBFCD at mcs.anl.gov>
Content-Type: text/plain; charset=us-ascii

That's because by default progress on passive target RMA may need an MPI function to be called on the target. You can set the environment variable MPICH_ASYNC_PROGRESS to enable asynchronous progress.

Rajeev

On Jul 9, 2012, at 1:10 AM, Jie Chen wrote:

> I am using one sided communications lock-unlock and see something that I do not understand.
>
> In the normal case (using fence), my code block looks something like this (rank is my process id, nproc is the total number of processes):
>
> for i = 0 to nproc-1
> -- fence
> -- MPI_Get something from process (rank+i)%nproc
> -- fence
> -- computation
> end
>
> The time line for the above operations looks like the following, which is completely normal (- means computation, * means communication including fence and MPI_Get):
>
> proc 0: *--------*--------*--------*--------
> proc 1: *--------*--------*--------*--------
> proc 2: *--------*--------*--------*--------
> proc 3: *--------*--------*--------*--------
>
> The above illustration shows perfect computation load balance, for simplicity.
>
> However, when I change the two fences by lock and unlock, the time line looks like the following
>
> proc 0: *--------**********--------*--------**********--------
> proc 1: *--------*--------**********--------*--------
> proc 2: *--------*******************--------*--------*--------
> proc 3: *--------*--------**********--------**********--------
>
> The problem here is that sometimes the communication takes a very long time to finish. In particular, this is attributed to the MPI_Win_unlock call that will not return until the target process has finished one round of computation.
>
> I do not understand why the unlock (or perhaps the actual data transfer) is so time consuming. The figures here show balanced computational work load. When the work load is not balanced, I thought the lock/unlock mechanism was better than using fence because it avoids barriers. But according to experiments, it appears that having barriers is better than none. Is this caused by the implementation of MPI_Win_unlock or the hardware?
>
>
>
> --
> Jie Chen
> Mathematics and Computer Science Division
> Argonne National Laboratory
> Address: 9700 S Cass Ave, Bldg 240, Lemont, IL 60439
> Phone: (630) 252-3313
> Email: jiechen at mcs.anl.gov
> Homepage: http://www.mcs.anl.gov/~jiechen
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss