[petsc-dev] PetscSFFetchAndOpBegin_Window

Jeff Hammond jeff.science at gmail.com
Wed Mar 11 13:54:15 CDT 2015


Permission to ask potentially dumb questions?

In this code (src/vec/is/sf/impls/window/sfwindow.c in git master as
of 7fb8eca36577947f0d4ded035522334534bc7e38):

static PetscErrorCode PetscSFFetchAndOpBegin_Window(PetscSF
sf,MPI_Datatype unit,void *rootdata,const void *leafdata,void
*leafupdate,MPI_Op op)
{
  PetscErrorCode     ierr;
  PetscInt           i,nranks;
  const PetscMPIInt  *ranks;
  const MPI_Datatype *mine,*remote;
  MPI_Win            win;

  PetscFunctionBegin;
  ierr = PetscSFGetRanks(sf,&nranks,&ranks,NULL,NULL,NULL);CHKERRQ(ierr);
  ierr = PetscSFWindowGetDataTypes(sf,unit,&mine,&remote);CHKERRQ(ierr);
  ierr = PetscSFWindowOpTranslate(&op);CHKERRQ(ierr);
  ierr = PetscSFGetWindow(sf,unit,rootdata,PETSC_FALSE,0,0,0,&win);CHKERRQ(ierr);
  for (i=0; i<sf->nranks; i++) {
    ierr = MPI_Win_lock(MPI_LOCK_EXCLUSIVE,sf->ranks[i],0,win);CHKERRQ(ierr);
    ierr = MPI_Get(leafupdate,1,mine[i],ranks[i],0,1,remote[i],win);CHKERRQ(ierr);
    ierr = MPI_Accumulate((void*)leafdata,1,mine[i],ranks[i],0,1,remote[i],op,win);CHKERRQ(ierr);
    ierr = MPI_Win_unlock(ranks[i],win);CHKERRQ(ierr);
  }
  PetscFunctionReturn(0);
}

I see the following issues:

1) Implementing atomic read-modify-write using
Lock->Get->Accumulate->Unlock is W.R.O.N.G.  Get and Accumulate are
NOT guaranteed to be ordered within any epoch.  Ever.  You appear to
be leveraged the ordered channel semantics of MPICH's Ch3.  You need
to either use MPI-3 RMA's MPI_Get_accumulate (or MPI_Fetch_and_op) or
use a mutex ala MPICH's ROMIO and ARMCI-MPI's MPI-2 port.  See
http://dx.doi.org/10.1109/CCGRID.2005.1558687 for algorithm details.

To be explicit, your code is equivalent to:

  for (i=0; i<sf->nranks; i++) {
    ierr = MPI_Win_lock(MPI_LOCK_EXCLUSIVE,sf->ranks[i],0,win);CHKERRQ(ierr);
    int order = rand()%2;
    if (order) {
      ierr = MPI_Get(leafupdate,1,mine[i],ranks[i],0,1,remote[i],win);CHKERRQ(ierr);
      ierr = MPI_Accumulate((void*)leafdata,1,mine[i],ranks[i],0,1,remote[i],op,win);CHKERRQ(ierr);
    } else {
      ierr = MPI_Accumulate((void*)leafdata,1,mine[i],ranks[i],0,1,remote[i],op,win);CHKERRQ(ierr);
      ierr = MPI_Get(leafupdate,1,mine[i],ranks[i],0,1,remote[i],win);CHKERRQ(ierr);
    }
    ierr = MPI_Win_unlock(ranks[i],win);CHKERRQ(ierr);
  }

2) 'sf->ranks[i]' is used in MPI_Win_lock, whereas the rest of the RMA
calls use ranks[i].  I assume this is innocuous, but unless you have a
mutex on 'sf', it is theoretically possible that sf->ranks[i] could be
changed by another thread in such a way as to lead to an undefined or
incorrect program.  If you prohibit this behavior as part of the
internal design contract, then it should be noted.

Best,

Jeff

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/



More information about the petsc-dev mailing list