[PETSC #16886] Storage of the matrix

Sun Oct 14 21:14:13 CDT 2007

On Wed, 10 Oct 2007, Jaime Suarez wrote:

> Also, I would like to perform a matrix-vector multiplication in
> parallel, but this time the matrix would be symmetric, so that the
> libraries for sparce matrix do not seem efficient for equally
> distributing the jobs in several processors. Can you give me some advice
> about what to do?
> 
> Thanking you in advance,
> 
> Jaime Suarez
> 
> 

  This is a good point. Though the symmetric compressed sparse row (CSR) matrix
format is good for reducing memory usage, it makes load balancing of __floating point operations__
and __communication__ difficult. If roughly the same number of rows are assigned to 
each process the last process has less nonzero's (and hence less floating point to do),
in addition the neighbor to neighbor communication is not symmetric. The first process needs
to receive from several other processes, but not send ANY messages, the last process receives
no messages but needs to send to several. The other issue is when the messages are posted.
There need to be TWO "waves" of communication, each process needs to receive the remote vector
entries to do the "upper trianglar" part of the multiply. They also need to send the partial 
results they computed for the "lower triangular" part ot the multiply to the process that owns
it. We found that actually doing this as two seperate VecScatters made the whole multiply
a good bit slower thant the MPIAIJ format; hence we merged them so there is only one VecScatter
and hence one "wave" of communication. BUT this introduces its own problem, the first process cannot start
its scatter until it has done ALL the lower part of the multiply so later processes will be 
waiting a bit longer.

  Ways to improve the situation
1) give the later processes more rows to balance the number of nonzeros (and hence floating point operations)
per process, the drawback to this is 
  a) calculating the rows that each process should get (which depends on the nonzero pattern)
  b) the synchronization of when sends and receives are sent by different processes is worse
  c) the rest of the code may be now unbalanced since latter processes have many more entries in the 
     vector
2) just use the MPIAIJ format. Give up memory savings.
3) store the "block diagonal" portion of the matrix (that portion associated with the local rows AND columns
   of the matrix) in SBAIJ format but redundantly store the off-diagonal entries on the appropriate processes.
   This has nice load balancing. If we assume the vast majority of the matrix entries are in the 
   "block diagonal" portion then this will still save most of the memory that the MPISBAIJ format saves.
4) ???

  If there is a lot of demand for an efficient parallel symmetric format it may make sense for us
to implement 3.

  Barry