I have a code that keeps on using the same matrix L
and its transpose in all time updates.
I can improve the performance of the code by replacing
the MatMultTranspose() with MatMult() and computing
the transposed matrix at the beginning of the code for
only once. The cost is of course extra storage of the
transposed matrix.

However, I have a question regarding the efficiency of
transposing the matrix. I created the Matrix L with
MPIAIJ and preallocated the proper memory for it.
Then I call MatTranspose(L,&LT) to compute LT which is
the transposed L. But I noticed that this process is
extremely slow, 6 times slower than the creation of
Matrix L itself.

The first question is do I need to preallocate the
memory  for LT also? I didn't do it since I suppose
PETSc is smart enough to figure out the necessary

Secondly, I am not sure why MatTranspose is so slow. I
understand in order to transpose a Matrix, one may
need to call MPI_Alltoall which is extremely
expensive. But it seems trivial that I can go through
a similar process of creating the Matrix L and be much
faster. I am not sure how MatTraspose() is implemented
and whether I should actually compose LT instead of
transpose L.

