[petsc-users] Questions about setting values for GPU based matrices
Anders Logg
logg at simula.no
Fri Dec 2 05:19:24 CST 2011
On Fri, Dec 02, 2011 at 11:54:20AM +0100, Fredrik Heffer Valdmanis wrote:
> 2011/12/2 Fredrik Heffer Valdmanis <fredva at ifi.uio.no>
> 2011/12/1 Matthew Knepley <knepley at gmail.com>
> On Thu, Dec 1, 2011 at 5:39 AM, Fredrik Heffer Valdmanis
> <fredva at ifi.uio.no> wrote:
>
> 2011/11/29 Matthew Knepley <knepley at gmail.com>
> On Tue, Nov 29, 2011 at 10:37 AM,
> Fredrik Heffer Valdmanis
> <fredva at ifi.uio.no> wrote:
> 2011/11/29 Matthew Knepley
> <knepley at gmail.com>
> On Tue, Nov 29,
> 2011 at 2:38 AM,
> Fredrik Heffer
> Valdmanis
> <fredva at ifi.uio.no>
> wrote:
> 2011/10/28
> Matthew
> Knepley
> <knepley at gmail.com>
> On
> Fri,
> Oct
> 28,
> 2011
> at
> 10:24
> AM,
> Fredrik
> Heffer
> Valdmanis
> <fredva at ifi.uio.no>
> wrote:
> Hi,
>
> I
> am
> working
> on
> integrating
> the
> new
> GPU
> based
> vectors
> and
> matrices
> into
> FEniCS.
> Now,
> I'm
> looking
> at
> the
> possibility
> for
> getting
> some
> speedup
> during
> finite
> element
> assembly,
> specifically
> when
> inserting
> the
> local
> element
> matrix
> into
> the
> global
> element
> matrix.
> In
> that
> regard,
> I
> have
> a
> few
> questions
> I
> hope
> you
> can
> help
> me
> out
> with:
>
> -
> When
> calling
> MatSetValues
> with
> a
> MATSEQAIJCUSP
> matrix
> as
> parameter,
> what
> exactly
> is
> it
> that
> happens?
> As
> far
> as
> I
> can
> see,
> MatSetValues
> is
> not
> implemented
> for
> GPU
> based
> matrices,
> neither
> is
> the mat-
> >ops-
> >setvalues
> set
> to
> point
> at
> any
> function
> for
> this
> Mat
> type.
>
> Yes,
> MatSetValues
> always
> operates
> on
> the
> CPU
> side.
> It
> would
> not
> make
> sense
> to do
> individual
> operations
> on
> the
> GPU.
>
> I
> have
> written
> batched
> of
> assembly
> for
> element
> matrices
> that
> are
> all
> the
> same
> size:
>
> http:
> //
> www.mcs.anl.gov/
> petsc/
> petsc-
> as/
> snapshots/
> petsc-
> current/
> docs/
> manualpages/
> Mat/
> MatSetValuesBatch.html
>
> -
> Is
> it
> such
> that
> matrices
> are
> assembled
> in
> their
> entirety
> on
> the
> CPU,
> and
> then
> copied
> over
> to
> the
> GPU
> (after
> calling
> MatAssemblyBegin)?
> Or
> are
> values
> copied
> over
> to
> the
> GPU
> each
> time
> you
> call
> MatSetValues?
>
> That
> function
> assembles
> the
> matrix
> on
> the
> GPU
> and
> then
> copies
> to
> the
> CPU.
> The
> only
> time
> you
> do
> not
> want
> this
> copy
> is
> when
> you
> are
> running
> in
> serial
> and
> never
> touch
> the
> matrix
> afterwards,
> so I
> left
> it
> in.
>
> -
> Can
> we
> expect
> to
> see
> any
> speedup
> from
> using
> MatSetValuesBatch
> over
> MatSetValues,
> or
> is
> the
> batch
> version
> simply
> a
> utility
> function?
> This
> question
> goes
> for
> both
> CPU-
> and
> GPU-
> based
> matrices.
>
> CPU:
> no
>
> GPU:
> yes,
> I see
> about
> the
> memory
> bandwidth
> ratio
>
>
> Hi,
>
> I have now
> integrated
> MatSetValuesBatch
> in our
> existing
> PETSc
> wrapper
> layer. I
> have
> tested
> matrix
> assembly
> with
> Poisson's
> equation
> on
> different
> meshes
> with
> elements
> of varying
> order. I
> have timed
> the single
> call to
> MatSetValuesBatch
> and
> compared
> that to
> the total
> time
> consumed
> by the
> repeated
> calls to
> MatSetValues
> in the old
> implementation.
> I have the
> following
> results:
>
> Poisson on
> 1000x1000
> unit
> square,
> 1st order
> Lagrange
> elements:
> MatSetValuesBatch:
> 0.88576 s
> repeated
> calls to
> MatSetValues:
> 0.76654 s
>
> Poisson on
> 500x500
> unit
> square,
> 2nd order
> Lagrange
> elements:
> MatSetValuesBatch:
> 0.9324 s
> repeated
> calls to
> MatSetValues:
> 0.81644 s
>
> Poisson on
> 300x300
> unit
> square,
> 3rd order
> Lagrange
> elements:
> MatSetValuesBatch:
> 0.93988 s
> repeated
> calls to
> MatSetValues:
> 1.03884 s
>
> As you can
> see, the
> two
> methods
> take
> almost the
> same
> amount of
> time.
> What behavior and
> performance
> should we
> expect? Is
> there any
> way to
> optimize
> the
> performance
> of batched
> assembly?
>
> Almost certainly it
> is not dispatching
> to the CUDA
> version. The
> regular version
> just calls
> MatSetValues() in a
> loop. Are you
> using a SEQAIJCUSP
> matrix?
> Yes. The same matrices yields
> a speedup of 4-6x when
> solving the system on the
> GPU.
>
> Please confirm that the correct routine
> by running wth -info and sending the
> output.
>
> Please send the output of -log_summary
> so I can confirm the results.
>
> You can run KSP ex4 and reproduce my
> results where I see a 5.5x speedup on
> the GTX285
>
> I am not sure what to look for in those outputs.
> I have uploaded the output of running my assembly
> program with -info and -log_summary, and the
> output of running ex4 with -log_summary. See
>
> http://folk.uio.no/fredva/assembly_info.txt
> http://folk.uio.no/fredva/
> assembly_log_summary.txt
> http://folk.uio.no/fredva/ex4_log_summary.txt
>
> Trying this on a different machine now, I
> actually see some speedup. 3rd order Poisson on
> 300x300 assembles in 0.211 sec on GPU and 0.4232
> sec on CPU. For 1st order and 1000x1000 mesh, I
> go from 0.31 sec to 0.205 sec.
> I have tried to increase the mesh size to see if
> the speedup increases, but I hit the bad_alloc
> error pretty quick.
>
> For a problem of that size, should I expect even
> more speedup? Please let me know if you need any
> more output from test runs on my machine.
>
> Here are my results for nxn grids where n = range(150,
> 1350, 100). This is using a GTX 285. What card are you
> using?
>
> I realize now that I was including the time it takes to construct the
> large flattended array of values that is sent to MatSetValuesBatch. I
> assume of course that you only time MatSetValues/MatSetValuesBatch
> completely isolated. If I do this, I get significant speedup as well.
> Sorry for the confusion here.
>
> Still, this construction has to be done somehow in order to have
> meaningful data to pass to MatSetValuesBatch. The way I do this is
> apparently almost as costly as calling MatSetValues for each local
> matrix.
>
> Have you got any ideas on how to speed up the construction of the
> values array? This has to be done very efficiently in order for batch
> assembly to yield any speedup overall.
>
> Arg, disregard last transmission! I was confusing myself with timings from
> several runs, and the "significant speedup" I referred to was seen
> when I timed things very badly. The numbers from yesterdays mail are correct,
> those were obtained using a GTX 280. That is, 30%-50% speedup on Poisson 2D on
> different meshes.
>
> The question from my previous email remains though, we need to speed up the
> construction of the values array to get good speedup overall.
>
> Sorry for the spamming,
Off-topic: I find this thread extremely hard to follow. Is Gmail
required to read this list? The html-formatting with indentation (and
no ">") makes it really hard to read in my email-client (mutt).
--
Anders
More information about the petsc-users
mailing list