[petsc-users] Solving A*X = B where A and B are matrices

Mon Feb 4 14:23:08 CST 2013

Hello,
I apologize for asking the very similar question again. I have a problem in
which the greatest bottleneck is the Trace(A\B) function. Thus I am trying
to run this function in parallel. If I understand correctly, MatMatSolve is
a sequential operation, am I right?
I have tried the following approach - I divide the matrix B by columns, so
basically I solve a bunch of linear systems of type A\b where A is a matrix
and b is a vector (column of matrix B). Since these A\b operations are
independent, I would like to run them all in parallel.
In my experiment I have:

1) loaded the whole matrix A on each process
2) loaded matrix B as shared on all processes
3) transposed matrix B
On each separate process in parallel:
     - run Cholesky factorization (only for the first A\b solve)
     - Solve linear systems A\b_i for each row i of matrix B that is stored
on that process
     - Send the partial trace result to the main process
Please find attached the complete code.

My question is about scalability of this solution. I see the improvement
when running the code on 20 nodes as opposed to 10 nodes, but if I run the
code on 40 nodes, my calculation time is even worse then calculation time
on 20 nodes. Please find attached my log summary for 10 nodes, 20 nodes and
40 nodes. I should maybe point out that I am only interested in speeding up
the Trace(A\B) calculation time (noted with "Trace calculations took: " in
my output).
Could you please help me figure out what is causing my performance to
worsen when using more nodes? I am sorry if this is a very basic question,
I am fairly new to PETSc.
Grateful in advance,
Jelena

On Mon, Dec 3, 2012 at 2:21 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
>
> On Dec 3, 2012, at 1:08 PM, Jelena Slivka <slivkaje at gmail.com> wrote:
>
> > Thank you very much!
> > However, I have another question. I have a cluster of 4 nodes and each
> node has 6 cores. If I run my code using 6 cores on one node (using the
> command "mpiexec -n 6") it is much faster than running it on just one
> process (which is expected). However, if I try running the code on multiple
> nodes (using "mpiexec -f machinefile -ppn 4", where machinefile is the file
> which contains the node names), it runs much slower than on just one
> process. This also happens with tutorial examples. I have checked the
> number of iteration for KSP solver when spread on multiple processors and
> it doesn't seem to be the problem. Do you have any suggestions on what am I
> doing wrong? Are the commands I am using wrong?
> >
> >
> > On Sat, Dec 1, 2012 at 6:03 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >     We recommend following the directions
> http://www.mcs.anl.gov/petsc/documentation/faq.html#schurcomplement  for
> computing a Schur complement; just skip the unneeded step. MUMPS supports a
> parallel Cholesky but you can also use a parallel LU with MUMPS, PaSTIX or
> SuperLU_Dist and those will work fine also. With current software Cholesky
> in parallel is not tons better than LU so generally not worth monkeying
> with.
> >
> >    Barry
> >
> >
> > On Dec 1, 2012, at 12:05 PM, Jelena Slivka <slivkaje at gmail.com> wrote:
> >
> > > Hello!
> > > I am trying to solve A*X = B where A and B are matrices, and then find
> trace of the resulting matrix X. My approach has been to partition matrix B
> in column vectors bi and then solve each system A*xi = bi. Then, for all
> vectors xi I would extract i-th element xi(i) and sum those elements in
> order to get Trace(X).
> > > Pseudo-code:
> > > 1) load matrices A and B
> > > 2) transpose matrix B (so that each right-hand side bi is in the row,
> as operation MatGetColumnVector is slow)
> > > 3) set up KSPSolve
> > > 4) create vector diagonal (in which xi(i) elements will be stored)
> > > 5) for each row i of matrix B owned by current process:
> > >           - create vector bi by extracting row i from matrix B
> > >           - apply KSPsolve to get xi
> > >           - insert value xi(i) in diagonal vector (only the process
> which
> > >             holds the ith value of vector x(i) should do so)
> > > 6) sum vector diagonal to get the trace.
> > > However, my code (attached, along with the test case) runs fine on one
> process, but hangs if started on multiple processes. Could you please help
> me figure out what am I doing wrong?
> > > Also, could you please tell me is it possible to use Cholesky
> factorization when running on multiple processes (I see that I cannot use
> it when I set the format of matrix A to MPIAIJ)?
> > >
> > > <Experiment.c><Abin><Bbin>
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130204/48ee578b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log_10nodes.log
Type: application/octet-stream
Size: 11461 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130204/48ee578b/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log_20nodes.log
Type: application/octet-stream
Size: 11463 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130204/48ee578b/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log_40nodes.log
Type: application/octet-stream
Size: 11462 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130204/48ee578b/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ComputeTraceParallelKSP.c
Type: text/x-csrc
Size: 3720 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130204/48ee578b/attachment-0001.c>