[petsc-dev] Proper matrix size to choose when evaluating MatMult?
rupp at iue.tuwien.ac.at
Sat Feb 22 23:05:44 CST 2020
> I want to evaluate MatMult on GPU. I took a 2M x 2M matrix and ran with
> 6 mpi ranks and 6 GPUs. It took about 0.9 seconds.
How many nonzeros per row? With 0.9 seconds you should either have many
runs of MatMult, or a fairly dense matrix; or a really slow MatMult
A 2M-by-2M matrix for a 5-point stencil is probably still on the small
side (I'm assuming that you run 2M-by-2M for *each* GPU), but should
suffice. Expect that communication cost are significant (i.e. the
bookkeeping and data exchange between GPUs is on the order of the costs
for running the MatMult kernel for the respective diagonal block).
> A kernel launch or
> a stream synchronization took about 10us. Compared with MatMult, they
> are tiny. Does it mean we can ignore them? What is a proper size to
> evaluate MatMult? I heard it is a few thousand rows per MPI rank. Why?
That would be a typical strong scaling limit for a CPU-based run a
well-tuned BlueGene-type system. With GPUs you will probably need at
least 100k unknowns (or ~1M nonzeros) per rank in the strong scaling
limit. Add a factor of ~10 to make latency costs small in comparison.
More information about the petsc-dev