[petsc-users] Solving A*X = B where A and B are matrices
Karl Rupp
rupp at mcs.anl.gov
Mon Feb 4 14:32:18 CST 2013
Hi Jelena,
there are two things to note:
a) From the logs:
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
b) It seems like you are transposing the full matrix on every process. I
doubt that this is what you really want, as it eats up about 45 percent
of the total time in the case of 40 processes. You should instead load
the entries into the rhs-vector just when you need it.
Best regards,
Karli
On 02/04/2013 02:23 PM, Jelena Slivka wrote:
> Hello,
> I apologize for asking the very similar question again. I have a problem
> in which the greatest bottleneck is the Trace(A\B) function. Thus I am
> trying to run this function in parallel. If I understand correctly,
> MatMatSolve is a sequential operation, am I right?
> I have tried the following approach - I divide the matrix B by columns,
> so basically I solve a bunch of linear systems of type A\b where A is a
> matrix and b is a vector (column of matrix B). Since these A\b
> operations are independent, I would like to run them all in parallel.
> In my experiment I have:
>
> 1) loaded the whole matrix A on each process
> 2) loaded matrix B as shared on all processes
> 3) transposed matrix B
> On each separate process in parallel:
> - run Cholesky factorization (only for the first A\b solve)
> - Solve linear systems A\b_i for each row i of matrix B that is
> stored on that process
> - Send the partial trace result to the main process
> Please find attached the complete code.
>
> My question is about scalability of this solution. I see the improvement
> when running the code on 20 nodes as opposed to 10 nodes, but if I run
> the code on 40 nodes, my calculation time is even worse then calculation
> time on 20 nodes. Please find attached my log summary for 10 nodes, 20
> nodes and 40 nodes. I should maybe point out that I am only interested
> in speeding up the Trace(A\B) calculation time (noted with "Trace
> calculations took: " in my output).
> Could you please help me figure out what is causing my performance to
> worsen when using more nodes? I am sorry if this is a very basic
> question, I am fairly new to PETSc.
> Grateful in advance,
> Jelena
>
>
>
>
>
>
>
>
> On Mon, Dec 3, 2012 at 2:21 PM, Barry Smith <bsmith at mcs.anl.gov
> <mailto:bsmith at mcs.anl.gov>> wrote:
>
>
> http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
>
> On Dec 3, 2012, at 1:08 PM, Jelena Slivka <slivkaje at gmail.com
> <mailto:slivkaje at gmail.com>> wrote:
>
> > Thank you very much!
> > However, I have another question. I have a cluster of 4 nodes and
> each node has 6 cores. If I run my code using 6 cores on one node
> (using the command "mpiexec -n 6") it is much faster than running it
> on just one process (which is expected). However, if I try running
> the code on multiple nodes (using "mpiexec -f machinefile -ppn 4",
> where machinefile is the file which contains the node names), it
> runs much slower than on just one process. This also happens with
> tutorial examples. I have checked the number of iteration for KSP
> solver when spread on multiple processors and it doesn't seem to be
> the problem. Do you have any suggestions on what am I doing wrong?
> Are the commands I am using wrong?
> >
> >
> > On Sat, Dec 1, 2012 at 6:03 PM, Barry Smith <bsmith at mcs.anl.gov
> <mailto:bsmith at mcs.anl.gov>> wrote:
> >
> > We recommend following the directions
> http://www.mcs.anl.gov/petsc/documentation/faq.html#schurcomplement
> for computing a Schur complement; just skip the unneeded step.
> MUMPS supports a parallel Cholesky but you can also use a parallel
> LU with MUMPS, PaSTIX or SuperLU_Dist and those will work fine also.
> With current software Cholesky in parallel is not tons better than
> LU so generally not worth monkeying with.
> >
> > Barry
> >
> >
> > On Dec 1, 2012, at 12:05 PM, Jelena Slivka <slivkaje at gmail.com
> <mailto:slivkaje at gmail.com>> wrote:
> >
> > > Hello!
> > > I am trying to solve A*X = B where A and B are matrices, and
> then find trace of the resulting matrix X. My approach has been to
> partition matrix B in column vectors bi and then solve each system
> A*xi = bi. Then, for all vectors xi I would extract i-th element
> xi(i) and sum those elements in order to get Trace(X).
> > > Pseudo-code:
> > > 1) load matrices A and B
> > > 2) transpose matrix B (so that each right-hand side bi is in
> the row, as operation MatGetColumnVector is slow)
> > > 3) set up KSPSolve
> > > 4) create vector diagonal (in which xi(i) elements will be stored)
> > > 5) for each row i of matrix B owned by current process:
> > > - create vector bi by extracting row i from matrix B
> > > - apply KSPsolve to get xi
> > > - insert value xi(i) in diagonal vector (only the
> process which
> > > holds the ith value of vector x(i) should do so)
> > > 6) sum vector diagonal to get the trace.
> > > However, my code (attached, along with the test case) runs fine
> on one process, but hangs if started on multiple processes. Could
> you please help me figure out what am I doing wrong?
> > > Also, could you please tell me is it possible to use Cholesky
> factorization when running on multiple processes (I see that I
> cannot use it when I set the format of matrix A to MPIAIJ)?
> > >
> > > <Experiment.c><Abin><Bbin>
> >
> >
>
>
More information about the petsc-users
mailing list