[petsc-dev] MatPinToCPU

Mon Jul 29 16:31:06 CDT 2019

  I don't understand the notation in the legend on the second page

12,288 cpus and no GPUs ?

24 GPUs?  or 6 GPUs

192 GPUs?

1536 GPUs?

12,288 GPUs?  or 12288/4 = 3072  GPUs?

So on the largest run using GPUs or not takes pretty much exactly the same 
amount of  time?

What about 6 GPUs vs 24 CPUs ? Same equal amount of time. 

Can you send some log summaries

> On Jul 29, 2019, at 4:01 PM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> FYI, CUDA is running and here is some preliminary data on up to 1/8 of SUMMIT. This run with 4 cores/processes per GPU, so the GPU is virtualized into 4 GPUs.
> 
> On Sun, Jul 28, 2019 at 2:34 PM Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
> Hi Mark,
> 
> feel free to submit a fresh pull request now. I looked at your latest 
> commit in the repository in order to cherry-pick it, but it looked like 
> it had a few other bits in it as well.
> 
> Best regards,
> Karli
> 
> 
> On 7/28/19 6:27 PM, Mark Adams via petsc-dev wrote:
> > This is looking good. I'm not seeing the numerical problems, but I've 
> > just hid them by avoiding the GPU on coarse grids.
> > 
> > Should I submit a pull request now or test more or wait for Karl?
> > 
> > On Sat, Jul 27, 2019 at 7:37 PM Mark Adams <mfadams at lbl.gov 
> > <mailto:mfadams at lbl.gov>> wrote:
> > 
> >     Barry, I fixed CUDA to pin to CPUs correctly for GAMG at least.
> >     There are some hacks here that we can work on.
> > 
> >     I will start testing it tomorrow, but I am pretty sure that I have
> >     not regressed. I am hoping that this will fix the numerical
> >     problems, which seem to be associated with empty processors.
> > 
> >     I did need to touch code outside of GAMG and CUDA. It might be nice
> >     to test this in a next.
> > 
> >     GAMG now puts all reduced processorg grids on the CPU. This could be
> >     looked at in the future.
> > 
> > 
> >     On Sat, Jul 27, 2019 at 1:00 PM Smith, Barry F. <bsmith at mcs.anl.gov
> >     <mailto:bsmith at mcs.anl.gov>> wrote:
> > 
> > 
> > 
> >          > On Jul 27, 2019, at 11:53 AM, Mark Adams <mfadams at lbl.gov
> >         <mailto:mfadams at lbl.gov>> wrote:
> >          >
> >          >
> >          > On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F.
> >         <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
> >          >
> >          >   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
> >          >
> >          > THis is done  (I may have done it).
> >          >
> >          > Now it seems to me that when you call VecPinToCPU you are
> >         setting up and don't have data, so this copy does not seem
> >         necessary. Maybe remove the copy here:
> >          >
> >          > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> >          > {
> >          >   PetscErrorCode ierr;
> >          >
> >          >   PetscFunctionBegin;
> >          >   V->pinnedtocpu = pin;
> >          >   if (pin) {
> >          >     ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); ????
> > 
> >             The copy from GPU should actually only do anything if the
> >         GPU already has data and PETSC_OFFLOAD_GPU. If the GPU does not
> >         have data
> >         the copy doesn't do anything. When one calls VecPinToCPU() one
> >         doesn't know where the data is so the call must be made, but it
> >         may do nothing
> > 
> >            Note that VecCUDACopyFromGPU() calls
> >         VecCUDAAllocateCheckHost() not VecCUDAAllocateCheck() so the GPU
> >         will not allocate space,
> >         VecCUDAAllocateCheck() is called from VecCUDACopyToGPU().
> > 
> >             Yes, perhaps the naming could be more consistent:
> > 
> >         1) in one place it is Host in an other place it is nothing
> >         2) some places it is Host, Device, some places GPU,CPU
> > 
> >             Perhaps Karl can make these all consistent and simpler in
> >         his refactorization
> > 
> > 
> >            Barry
> > 
> > 
> >          >
> >          > or
> >          >
> >          > Not allocate the GPU if it is pinned by added in a check here:
> >          >
> >          > PetscErrorCode VecCUDAAllocateCheck(Vec v)
> >          > {
> >          >   PetscErrorCode ierr;
> >          >   cudaError_t    err;
> >          >   cudaStream_t   stream;
> >          >   Vec_CUDA       *veccuda;
> >          >
> >          >   PetscFunctionBegin;
> >          >   if (!v->spptr) {
> >          >     ierr = PetscMalloc(sizeof(Vec_CUDA),&v->spptr);CHKERRQ(ierr);
> >          >     veccuda = (Vec_CUDA*)v->spptr;
> >          > if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {
> >          >     err =
> >         cudaMalloc((void**)&veccuda->GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
> >          >     veccuda->GPUarray = veccuda->GPUarray_allocated;
> >          >     err = cudaStreamCreate(&stream);CHKERRCUDA(err);
> >          >     veccuda->stream = stream;
> >          >     veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
> >          >     if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> >          >       if (v->data && ((Vec_Seq*)v->data)->array) {
> >          >         v->valid_GPU_array = PETSC_OFFLOAD_CPU;
> >          >       } else {
> >          >         v->valid_GPU_array = PETSC_OFFLOAD_GPU;
> >          >       }
> >          >     }
> >          > }
> >          >   }
> >          >   PetscFunctionReturn(0);
> >          > }
> >          >
> >          >
> >          >
> >          >
> >          >
> >          > > On Jul 27, 2019, at 10:40 AM, Mark Adams <mfadams at lbl.gov
> >         <mailto:mfadams at lbl.gov>> wrote:
> >          > >
> >          > > Yea, I just figured out the problem. VecDuplicate_MPICUDA
> >         did not call PinToCPU or even copy pinnedtocpu. It just copied
> >         ops, so I added and am testing:
> >          > >
> >          > >   ierr =
> >         VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
> >          > >   vw   = (Vec_MPI*)(*v)->data;
> >          > >   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
> >         _VecOps));CHKERRQ(ierr);
> >          > >   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
> >          > >
> >          > > Thanks,
> >          > >
> >          > > On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F.
> >         <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
> >          > >
> >          > >   I don't understand the context. Once a vector is pinned
> >         to the CPU the flag should be PETSC_OFFLOAD_CPU permanently
> >         until the pin to cpu is turned off.  Do you have a pinned vector
> >         that has the value PETSC_OFFLOAD_GPU?  For example here it is
> >         set to PETSC_OFFLOAD_CPU
> >          > >
> >          > > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> >          > > {
> >          > > ....
> >          > >   if (pin) {
> >          > >     ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
> >          > >     V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the
> >         CPU code will likely change values in the vector */
> >          > >
> >          > >
> >          > >   Is there any way to reproduce the problem?
> >          > >
> >          > >   Barry
> >          > >
> >          > >
> >          > >
> >          > >
> >          > > > On Jul 27, 2019, at 10:28 AM, Mark Adams <mfadams at lbl.gov
> >         <mailto:mfadams at lbl.gov>> wrote:
> >          > > >
> >          > > > I'm not sure what to do here. The problem is that
> >         pinned-to-cpu vectors are calling VecCUDACopyFromGPU here.
> >          > > >
> >          > > > Should I set x->valid_GPU_array to something else, like
> >         PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not
> >         executed?
> >          > > >
> >          > > > PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> >          > > > {
> >          > > >   PetscErrorCode ierr;
> >          > > > #if defined(PETSC_HAVE_VIENNACL)
> >          > > >   PetscBool      is_viennacltype = PETSC_FALSE;
> >          > > > #endif
> >          > > >
> >          > > >   PetscFunctionBegin;
> >          > > >   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
> >          > > >   ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
> >          > > >   if (x->petscnative) {
> >          > > > #if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
> >          > > >     if (x->valid_GPU_array == PETSC_OFFLOAD_GPU) {
> >          > > > #if defined(PETSC_HAVE_VIENNACL)
> >          > > >       ierr =
> >         PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> >          > > >       if (is_viennacltype) {
> >          > > >         ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
> >          > > >       } else
> >          > > > #endif
> >          > > >       {
> >          > > > #if defined(PETSC_HAVE_CUDA)
> >          > > >         ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);
> >          > > > #endif
> >          > > >      }
> >          > > >     } else if (x->valid_GPU_array ==
> >         PETSC_OFFLOAD_UNALLOCATED) {
> >          > > > #if defined(PETSC_HAVE_VIENNACL)
> >          > > >       ierr =
> >         PetscObjectTypeCompareAny((PetscObject)x,&is_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> >          > > >       if (is_viennacltype) {
> >          > > >         ierr = VecViennaCLAllocateCheckHost(x);CHKERRQ(ierr);
> >          > > >       } else
> >          > > > #endif
> >          > > >       {
> >          > > > #if defined(PETSC_HAVE_CUDA)
> >          > > >         ierr = VecCUDAAllocateCheckHost(x);CHKERRQ(ierr);
> >          > > > #endif
> >          > > >       }
> >          > > >     }
> >          > > > #endif
> >          > > >     *a = *((PetscScalar**)x->data);
> >          > > >   } else {
> >          > > >
> >          > > >
> >          > > > On Tue, Jul 23, 2019 at 9:18 PM Smith, Barry F.
> >         <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
> >          > > >
> >          > > >  Yes, it needs to be able to switch back and forth
> >         between the CPU and GPU methods so you need to move into it the
> >         setting of the methods that is currently directly in the create
> >         method. See how  MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr =
> >         MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set
> >         the methods for the GPU initially.
> >          > > >
> >          > > >   Barry
> >          > > >
> >          > > >
> >          > > > > On Jul 23, 2019, at 7:32 PM, Mark Adams
> >         <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
> >          > > > >
> >          > > > >
> >          > > > >   What are the symptoms of it not working? Does it
> >         appear to be still copying the matrices to the GPU? then running
> >         the functions on the GPU?
> >          > > > >
> >          > > > >
> >          > > > > The object is dispatching the CUDA mat-vec etc.
> >          > > > >
> >          > > > >   I suspect the pinning is incompletely done for CUDA
> >         (and MPIOpenCL) matrices.
> >          > > > >
> >          > > > >
> >          > > > > Yes, git grep MatPinToCPU shows stuff for ViennaCL but
> >         not CUDA.
> >          > > > >
> >          > > > > I guess I can add something like this below. Do we need
> >         to set the device methods? They are already set when this method
> >         is set, right?
> >          > > > >
> >          > > > > We need the equivalent of
> >          > > > >
> >          > > > > static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat
> >         A,PetscBool flg)
> >          > > > > {
> >          > > > >   PetscFunctionBegin;
> >          > > > >   A->pinnedtocpu = flg;
> >          > > > >   if (flg) {
> >          > > > >     A->ops->mult           = MatMult_SeqAIJ;
> >          > > > >     A->ops->multadd        = MatMultAdd_SeqAIJ;
> >          > > > >     A->ops->assemblyend    = MatAssemblyEnd_SeqAIJ;
> >          > > > >     A->ops->duplicate      = MatDuplicate_SeqAIJ;
> >          > > > >   } else {
> >          > > > >     A->ops->mult           = MatMult_SeqAIJViennaCL;
> >          > > > >     A->ops->multadd        = MatMultAdd_SeqAIJViennaCL;
> >          > > > >     A->ops->assemblyend    = MatAssemblyEnd_SeqAIJViennaCL;
> >          > > > >     A->ops->destroy        = MatDestroy_SeqAIJViennaCL;
> >          > > > >     A->ops->duplicate      = MatDuplicate_SeqAIJViennaCL;
> >          > > > >   }
> >          > > > >   PetscFunctionReturn(0);
> >          > > > > }
> >          > > > >
> >          > > > > for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't
> >         look like it has been written yet.
> >          > > > >
> >          > > > >
> >          > > > > >
> >          > > > > > It does not seem to work. It does not look like CUDA
> >         has an MatCreateVecs. Should I add one and copy this flag over?
> >          > > > >
> >          > > > >    We do need this function. But I don't see how it
> >         relates to pinning. When the matrix is pinned to the CPU we want
> >         it to create CPU vectors which I assume it does.
> >          > > > >
> >          > > > >
> >          > > > > >
> >          > > > > > Mark
> >          > > > >
> >          > > >
> >          > >
> >          >
> > 
> <strong_su.png><dynamic_range_log.png>