On Sat, Oct 6, 2012 at 9:15 PM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi again,<div class="im"><br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

    Hmm, I thought that spptr is some 'special pointer' as commented in<br>

    Mat, but not supposed to be a generic pointer to a derived class'<br>

    datastructure (spptr is only injected with #define PETSC_HAVE_CUSP).<br>

<br>

<br>

Look at cuspvecimpl.h, for example.<br>

<br>

#undef __FUNCT__<br>

#define __FUNCT__ "VecCUSPGetArrayReadWrite"<br>

PETSC_STATIC_INLINE PetscErrorCode VecCUSPGetArrayReadWrite(Vec v,<br>

CUSPARRAY** a)<br>

{<br>

   PetscErrorCode ierr;<br>

<br>

   PetscFunctionBegin;<br>

   *a   = 0;<br>

   ierr = VecCUSPCopyToGPU(v);CHKERRQ(<u></u>ierr);<br>

   *a   = ((Vec_CUSP *)v->spptr)->GPUarray;<br>

   PetscFunctionReturn(0);<br>

}<br>

<br>

Vec is following the convention from Mat where spptr points to the<br>

Mat_UMFPACK, Mat_SuperLU, Mat_SeqBSTRM, etc, which hold the extra<br>

information needed for that derived class (note that "derivation" for<br>

Mat is typically done at run time, after the object has been created,<br>

due to a call to MatGetFactor).<br>

</blockquote>

<br></div>

Yes, I've looked at this code prior to my opening email. I'm doubt that it is a good idea to *enforce* that handles are hidden somewhere behind spptr, because the memory is then bound to a particular library. Compare this with the CPU memory allocated by PETSc in *data: One is in principle free to pass the raw memory pointer to external libraries and let them operate on it. There can be library-specific data hidden behind *spptr if required, but I'd prefer to have the raw memory for the representation of the vector itself available at *data.<div class="im">

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

        With different devices, we could simply have a valid flag for each<br>

        device. When someone does VecDevice1GetArrayWrite(), the flag<br>

        for all<br>

        other devices is marked invalid. When VecDevice2GetArrayRead() is<br>

        called, the implementation copies from any valid device to device2.<br>

        Packing all those flags as bits in a single int is perhaps<br>

        convenient,<br>

        but not necessary.<br>

<br>

<br>

    I think that the most common way of handling GPUs will be an<br>

    overlapping decomposition of the host array, similar to how a vector<br>

    is distributed via MPI (locally owned, writeable, vs ghost values<br>

    with read-only). Assigning the full vector exclusively to just one<br>

    device is more a single-GPU scenario rather than a multi-GPU use case.<br>

<br>

<br>

Okay, the matrix will have to partition itself. What is the advantage of<br>

having a single CPU process addressing multiple GPUs? Why not use<br>

different MPI processes? (We can have the MPI processes sharing a node<br>

create a subcomm so they can decide which process is driving which device.)<br>

</blockquote>

<br></div>

Making MPI a prerequisite for multi-GPU usage would be a unnecessary restriction, wouldn't it?</blockquote><div><br></div><div>Small point: I don't believe this, in fact the opposite. There are many equivalent ways of doing these</div>

<div>things, and we should use the simplest and most structured that can accomplish our goal. We have</div><div>already bought into MPI and should never fall into the trap of trying to support another paradigm at</div><div>

the expense of the simplicity.</div><div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

How would the user decide which device they wanted computation to run<br>

on? (Also, Is OpenCL really the right name in an environment where there<br>

may be multiple devices using OpenCL?)<br>

</blockquote>

<br></div>

That probably leaves the 'memory' scope - I'd like to avoid hard-wiring accelerator buffers to devices a-priori. OpenCL, for example, defines memory buffers per context, not per device as it is the case in CUDA. So, Vec_OpenCL is just referring to the runtime rather than particular devices, so the name should be okay...<div class="im">

<br>

<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Currently, the type indicates<br>

where native operations should "prefer" to compute, copying data there<br>

when necessary. The Vec operations have different implementations for<br>

CUDA and OpenCL so I don't see the problem with making them different<br>

derived classes. If we wanted a hybrid CUDA/OpenCL class, it would<br>

contain the logic for deciding where to do things followed by dispatch<br>

into the device-specific implementation, thus it doesn't seem like<br>

duplication to me.<br>

</blockquote>

<br></div>

Vec_CUDA vs. Vec_OpenCL is probably indeed okay. What I'd like to avoid is to end up with Vec_CUSP, Vec_CUDALib2, Vec_CUDALib3, etc., which are all operating on CUDA-handles, yet one cannot start kernels for e.g. the addition of a Vec_CUSP and a Vec_CUDALib2 without creating a temporary.<br>


<br>

Best regards,<br>

Karli<br>

<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>