[petsc-users] Questions about setting values for GPU based matrices

Fri Dec 2 05:19:24 CST 2011

On Fri, Dec 02, 2011 at 11:54:20AM +0100, Fredrik Heffer Valdmanis wrote:
> 2011/12/2 Fredrik Heffer Valdmanis <fredva at ifi.uio.no>
>      2011/12/1 Matthew Knepley <knepley at gmail.com>
>           On Thu, Dec 1, 2011 at 5:39 AM, Fredrik Heffer Valdmanis
>           <fredva at ifi.uio.no> wrote:
>
>                2011/11/29 Matthew Knepley <knepley at gmail.com>
>                     On Tue, Nov 29, 2011 at 10:37 AM,
>                     Fredrik Heffer Valdmanis
>                     <fredva at ifi.uio.no> wrote:
>                          2011/11/29 Matthew Knepley
>                          <knepley at gmail.com>
>                               On Tue, Nov 29,
>                               2011 at 2:38 AM,
>                               Fredrik Heffer
>                               Valdmanis
>                               <fredva at ifi.uio.no>
>                               wrote:
>                                    2011/10/28
>                                    Matthew
>                                    Knepley
>                                    <knepley at gmail.com>
>                                         On
>                                         Fri,
>                                         Oct
>                                         28,
>                                         2011
>                                         at
>                                         10:24
>                                         AM,
>                                         Fredrik
>                                         Heffer
>                                         Valdmanis
>                                         <fredva at ifi.uio.no>
>                                         wrote:
>                                              Hi,
>
>                                              I
>                                              am
>                                              working
>                                              on
>                                              integrating
>                                              the
>                                              new
>                                              GPU
>                                              based
>                                              vectors
>                                              and
>                                              matrices
>                                              into
>                                              FEniCS.
>                                              Now,
>                                              I'm
>                                              looking
>                                              at
>                                              the
>                                              possibility
>                                              for
>                                              getting
>                                              some
>                                              speedup
>                                              during
>                                              finite
>                                              element
>                                              assembly,
>                                              specifically
>                                              when
>                                              inserting
>                                              the
>                                              local
>                                              element
>                                              matrix
>                                              into
>                                              the
>                                              global
>                                              element
>                                              matrix.
>                                              In
>                                              that
>                                              regard,
>                                              I
>                                              have
>                                              a
>                                              few
>                                              questions
>                                              I
>                                              hope
>                                              you
>                                              can
>                                              help
>                                              me
>                                              out
>                                              with:
>
>                                              -
>                                              When
>                                              calling
>                                              MatSetValues
>                                              with
>                                              a
>                                              MATSEQAIJCUSP
>                                              matrix
>                                              as
>                                              parameter,
>                                              what
>                                              exactly
>                                              is
>                                              it
>                                              that
>                                              happens?
>                                              As
>                                              far
>                                              as
>                                              I
>                                              can
>                                              see,
>                                              MatSetValues
>                                              is
>                                              not
>                                              implemented
>                                              for
>                                              GPU
>                                              based
>                                              matrices,
>                                              neither
>                                              is
>                                              the mat-
>                                              >ops-
>                                              >setvalues
>                                              set
>                                              to
>                                              point
>                                              at
>                                              any
>                                              function
>                                              for
>                                              this
>                                              Mat
>                                              type. 
>
>                                         Yes,
>                                         MatSetValues
>                                         always
>                                         operates
>                                         on
>                                         the
>                                         CPU
>                                         side.
>                                         It
>                                         would
>                                         not
>                                         make
>                                         sense
>                                         to do
>                                         individual
>                                         operations
>                                         on
>                                         the
>                                         GPU.
>
>                                         I
>                                         have
>                                         written
>                                         batched
>                                         of
>                                         assembly
>                                         for
>                                         element
>                                         matrices
>                                         that
>                                         are
>                                         all
>                                         the
>                                         same
>                                         size:
>
>                                           http:
>                                         //
>                                         www.mcs.anl.gov/
>                                         petsc/
>                                         petsc-
>                                         as/
>                                         snapshots/
>                                         petsc-
>                                         current/
>                                         docs/
>                                         manualpages/
>                                         Mat/
>                                         MatSetValuesBatch.html
>                                          
>                                              -
>                                              Is
>                                              it
>                                              such
>                                              that
>                                              matrices
>                                              are
>                                              assembled
>                                              in
>                                              their
>                                              entirety
>                                              on
>                                              the
>                                              CPU,
>                                              and
>                                              then
>                                              copied
>                                              over
>                                              to
>                                              the
>                                              GPU
>                                              (after
>                                              calling
>                                              MatAssemblyBegin)?
>                                              Or
>                                              are
>                                              values
>                                              copied
>                                              over
>                                              to
>                                              the
>                                              GPU
>                                              each
>                                              time
>                                              you
>                                              call
>                                              MatSetValues?
>
>                                         That
>                                         function
>                                         assembles
>                                         the
>                                         matrix
>                                         on
>                                         the
>                                         GPU
>                                         and
>                                         then
>                                         copies
>                                         to
>                                         the
>                                         CPU.
>                                         The
>                                         only
>                                         time
>                                         you
>                                         do
>                                         not
>                                         want
>                                         this
>                                         copy
>                                         is
>                                         when
>                                         you
>                                         are
>                                         running
>                                         in
>                                         serial
>                                         and
>                                         never
>                                         touch
>                                         the
>                                         matrix
>                                         afterwards,
>                                         so I
>                                         left
>                                         it
>                                         in.
>                                          
>                                              -
>                                              Can
>                                              we
>                                              expect
>                                              to
>                                              see
>                                              any
>                                              speedup
>                                              from
>                                              using
>                                              MatSetValuesBatch
>                                              over
>                                              MatSetValues,
>                                              or
>                                              is
>                                              the
>                                              batch
>                                              version
>                                              simply
>                                              a
>                                              utility
>                                              function?
>                                              This
>                                              question
>                                              goes
>                                              for
>                                              both
>                                              CPU-
>                                              and
>                                              GPU-
>                                              based
>                                              matrices.
>
>                                         CPU:
>                                         no
>
>                                         GPU:
>                                         yes,
>                                         I see
>                                         about
>                                         the
>                                         memory
>                                         bandwidth
>                                         ratio
>
>
>                                    Hi,
>
>                                    I have now
>                                    integrated
>                                    MatSetValuesBatch
>                                    in our
>                                    existing
>                                    PETSc
>                                    wrapper
>                                    layer. I
>                                    have
>                                    tested
>                                    matrix
>                                    assembly
>                                    with
>                                    Poisson's
>                                    equation
>                                    on
>                                    different
>                                    meshes
>                                    with
>                                    elements
>                                    of varying
>                                    order. I
>                                    have timed
>                                    the single
>                                    call to
>                                    MatSetValuesBatch
>                                    and
>                                    compared
>                                    that to
>                                    the total
>                                    time
>                                    consumed
>                                    by the
>                                    repeated
>                                    calls to
>                                    MatSetValues
>                                    in the old
>                                    implementation.
>                                    I have the
>                                    following
>                                    results:
>
>                                    Poisson on
>                                    1000x1000
>                                    unit
>                                    square,
>                                    1st order
>                                    Lagrange
>                                    elements:
>                                    MatSetValuesBatch:
>                                    0.88576 s
>                                    repeated
>                                    calls to
>                                    MatSetValues:
>                                    0.76654 s
>
>                                    Poisson on
>                                    500x500
>                                    unit
>                                    square,
>                                    2nd order
>                                    Lagrange
>                                    elements:
>                                    MatSetValuesBatch:
>                                    0.9324 s
>                                    repeated
>                                    calls to
>                                    MatSetValues:
>                                    0.81644 s
>
>                                    Poisson on
>                                    300x300
>                                    unit
>                                    square,
>                                    3rd order
>                                    Lagrange
>                                    elements:
>                                    MatSetValuesBatch:
>                                    0.93988 s
>                                    repeated
>                                    calls to
>                                    MatSetValues:
>                                    1.03884 s
>
>                                    As you can
>                                    see, the
>                                    two
>                                    methods
>                                    take
>                                    almost the
>                                    same
>                                    amount of
>                                    time.
>                                    What behavior and
>                                    performance
>                                    should we
>                                    expect? Is
>                                    there any
>                                    way to
>                                    optimize
>                                    the
>                                    performance
>                                    of batched
>                                    assembly?
>
>                               Almost certainly it
>                               is not dispatching
>                               to the CUDA
>                               version. The
>                               regular version
>                               just calls
>                               MatSetValues() in a
>                               loop. Are you
>                               using a SEQAIJCUSP
>                               matrix?
>                          Yes. The same matrices yields
>                          a speedup of 4-6x when
>                          solving the system on the
>                          GPU. 
>
>                     Please confirm that the correct routine
>                     by running wth -info and sending the
>                     output.
>
>                     Please send the output of -log_summary
>                     so I can confirm the results.
>
>                     You can run KSP ex4 and reproduce my
>                     results where I see a 5.5x speedup on
>                     the GTX285
>
>                I am not sure what to look for in those outputs.
>                I have uploaded the output of running my assembly
>                program with -info and -log_summary, and the
>                output of running ex4 with -log_summary. See
>
>                http://folk.uio.no/fredva/assembly_info.txt
>                http://folk.uio.no/fredva/
>                assembly_log_summary.txt
>                http://folk.uio.no/fredva/ex4_log_summary.txt
>
>                Trying this on a different machine now, I
>                actually see some speedup. 3rd order Poisson on
>                300x300 assembles in 0.211 sec on GPU and 0.4232
>                sec on CPU. For 1st order and 1000x1000 mesh, I
>                go from 0.31 sec to 0.205 sec. 
>                I have tried to increase the mesh size to see if
>                the speedup increases, but I hit the bad_alloc
>                error pretty quick. 
>
>                For a problem of that size, should I expect even
>                more speedup? Please let me know if you need any
>                more output from test runs on my machine. 
>
>           Here are my results for nxn grids where n = range(150,
>           1350, 100). This is using a GTX 285. What card are you
>           using?
>
>      I realize now that I was including the time it takes to construct the
>      large flattended array of values that is sent to MatSetValuesBatch. I
>      assume of course that you only time MatSetValues/MatSetValuesBatch
>      completely isolated. If I do this, I get significant speedup as well.
>      Sorry for the confusion here. 
>
>      Still, this construction has to be done somehow in order to have
>      meaningful data to pass to MatSetValuesBatch. The way I do this is
>      apparently almost as costly as calling MatSetValues for each local
>      matrix.
>
>      Have you got any ideas on how to speed up the construction of the
>      values array? This has to be done very efficiently in order for batch
>      assembly to yield any speedup overall. 
>
> Arg, disregard last transmission! I was confusing myself with timings from
> several runs, and the &quot;significant speedup&quot; I referred to was seen
> when I timed things very badly. The numbers from yesterdays mail are correct,
> those were obtained using a GTX 280. That is, 30%-50% speedup on Poisson 2D on
> different meshes. 
>
> The question from my previous email remains though, we need to speed up the
> construction of the values array to get good speedup overall. 
>
> Sorry for the spamming,

Off-topic: I find this thread extremely hard to follow. Is Gmail
required to read this list? The html-formatting with indentation (and
no ">") makes it really hard to read in my email-client (mutt).

--
Anders