[petsc-dev] Proper matrix size to choose when evaluating MatMult?

Sun Feb 23 23:34:26 CST 2020

On Sat, Feb 22, 2020 at 11:05 PM Karl Rupp <rupp at iue.tuwien.ac.at> wrote:

> Hi Junchao,
>
> > I want to evaluate MatMult on GPU.  I took a 2M x 2M matrix and ran with
> > 6 mpi ranks and 6 GPUs.  It took about 0.9 seconds.
>
> How many nonzeros per row? With 0.9 seconds you should either have many
> runs of MatMult, or a fairly dense matrix; or a really slow MatMult
> kernel ;-)
>
I had a typo.  It should be 0.9e-3 seconds. I ran with 6 GPUs and 6 MPI
ranks. The matrix has about 100 nonzeros per row.  2M x 2M is the whole
matrix size.  Thanks for the explanation.

> A 2M-by-2M matrix for a 5-point stencil is probably still on the small
> side (I'm assuming that you run 2M-by-2M for *each* GPU), but should
> suffice. Expect that communication cost are significant (i.e. the
> bookkeeping and data exchange between GPUs is on the order of the costs
> for running the MatMult kernel for the respective diagonal block).
>
>
> > A kernel launch or
> > a stream synchronization took about 10us.  Compared with MatMult, they
> > are tiny. Does it mean we can ignore them?  What is a proper size to
> > evaluate MatMult?  I heard it is a few thousand rows per MPI rank.  Why?
>
> That would be a typical strong scaling limit for a CPU-based run a
> well-tuned BlueGene-type system. With GPUs you will probably need at
> least 100k unknowns (or ~1M nonzeros) per rank in the strong scaling
> limit. Add a factor of ~10 to make latency costs small in comparison.
>
> Best regards,
> Karli
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20200223/9a91754c/attachment.html>