[petsc-dev] Proper matrix size to choose when evaluating MatMult?

Sat Feb 22 23:05:44 CST 2020

Hi Junchao,

> I want to evaluate MatMult on GPU.  I took a 2M x 2M matrix and ran with 
> 6 mpi ranks and 6 GPUs.  It took about 0.9 seconds.  

How many nonzeros per row? With 0.9 seconds you should either have many 
runs of MatMult, or a fairly dense matrix; or a really slow MatMult 
kernel ;-)

A 2M-by-2M matrix for a 5-point stencil is probably still on the small 
side (I'm assuming that you run 2M-by-2M for *each* GPU), but should 
suffice. Expect that communication cost are significant (i.e. the 
bookkeeping and data exchange between GPUs is on the order of the costs 
for running the MatMult kernel for the respective diagonal block).

> A kernel launch or 
> a stream synchronization took about 10us.  Compared with MatMult, they 
> are tiny. Does it mean we can ignore them?  What is a proper size to 
> evaluate MatMult?  I heard it is a few thousand rows per MPI rank.  Why?

That would be a typical strong scaling limit for a CPU-based run a 
well-tuned BlueGene-type system. With GPUs you will probably need at 
least 100k unknowns (or ~1M nonzeros) per rank in the strong scaling 
limit. Add a factor of ~10 to make latency costs small in comparison.

Best regards,
Karli