PETSc runs slower on a shared memory machine than on a cluster
Shi Jin
jinzishuai at yahoo.com
Sat Feb 3 15:46:29 CST 2007
Thank you.
I did the same runs again with -log_summary. Here is
the part that I think is most important.
On cluster:
--- Event Stage 5: Projection
Event Count Time (sec)
Flops/sec --- Global --- ---
Stage --- Total
Max Ratio Max Ratio Max
Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M
%L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
[x]rhsLu 99 1.0 2.3875e+02 1.0 0.00e+00
0.0 0.0e+00 0.0e+00 9.9e+01 7 0 0 0 0 14 0 0
0 0 0
VecMDot 133334 1.0 4.1386e+02 1.6 3.43e+08
1.6 0.0e+00 0.0e+00 1.3e+05 10 18 0 0 45 21 27 0
0 49 883
VecNorm 137829 1.0 6.9839e+01 1.5 1.27e+08
1.5 0.0e+00 0.0e+00 1.4e+05 2 1 0 0 46 4 2 0
0 51 350
VecScale 137928 1.0 5.5639e+00 1.1 5.79e+08
1.1 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0
0 0 2197
VecCopy 4495 1.0 8.4510e-01 1.1 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 0
VecSet 142522 1.0 1.7712e+01 1.5 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0
0 0 0
VecAXPY 8990 1.0 9.9013e-01 1.1 4.34e+08
1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 1610
VecMAXPY 137829 1.0 2.1687e+02 1.1 4.92e+08
1.1 0.0e+00 0.0e+00 0.0e+00 6 20 0 0 0 12 29 0
0 0 1793
VecScatterBegin 137829 1.0 2.1816e+01 1.9 0.00e+00
0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 1
0100100 0 0
VecScatterEnd 137730 1.0 3.0302e+01 1.6 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0
0 0 0
VecNormalize 137829 1.0 7.6565e+01 1.4 1.68e+08
1.4 0.0e+00 0.0e+00 1.4e+05 2 2 0 0 46 4 3 0
0 51 479
MatMult 137730 1.0 3.5652e+02 1.3 2.58e+08
1.2 8.3e+05 3.4e+04 0.0e+00 9 15 91 74 0 19
21100100 0 815
MatSolve 137829 1.0 5.0916e+02 1.2 1.56e+08
1.2 0.0e+00 0.0e+00 0.0e+00 13 14 0 0 0 28 20 0
0 0 531
MatGetRow 44110737 1.0 1.1846e+02 1.0 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 7 0 0
0 0 0
KSPGMRESOrthog 133334 1.0 6.0430e+02 1.3 3.87e+08
1.3 0.0e+00 0.0e+00 1.3e+05 15 37 0 0 45 32 54 0
0 49 1209
KSPSolve 99 1.0 1.4336e+03 1.0 2.37e+08
1.0 8.3e+05 3.4e+04 2.7e+05 40 68 91 74 91
86100100100100 944
PCSetUpOnBlocks 99 1.0 3.2687e-04 1.2 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 0
PCApply 137829 1.0 5.3316e+02 1.2 1.50e+08
1.2 0.0e+00 0.0e+00 0.0e+00 14 14 0 0 0 30 20 0
0 0 507
---------------------------------------------------
On the shared memory machine:
--- Event Stage 5: Projection
Event Count Time (sec)
Flops/sec --- Global --- ---
Stage --- Total
Max Ratio Max Ratio Max
Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M
%L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
[x]rhsLu 99 1.0 2.0673e+02 1.0 0.00e+00
0.0 0.0e+00 0.0e+00 9.9e+01 5 0 0 0 0 9 0 0
0 0 0
VecMDot 133334 1.0 7.0932e+02 2.1 2.70e+08
2.1 0.0e+00 0.0e+00 1.3e+05 11 18 0 0 45 22 27 0
0 49 515
VecNorm 137829 1.0 1.2860e+02 7.0 3.32e+08
7.0 0.0e+00 0.0e+00 1.4e+05 2 1 0 0 46 3 2 0
0 51 190
VecScale 137928 1.0 5.0018e+00 1.0 6.36e+08
1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0
0 0 2444
VecCopy 4495 1.0 1.4161e+00 1.8 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 0
VecSet 142522 1.0 1.9602e+01 2.1 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0
0 0 0
VecAXPY 8990 1.0 1.5128e+00 1.4 3.67e+08
1.4 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 1054
VecMAXPY 137829 1.0 3.5204e+02 1.4 3.82e+08
1.4 0.0e+00 0.0e+00 0.0e+00 7 20 0 0 0 13 29 0
0 0 1105
VecScatterBegin 137829 1.0 1.4310e+01 2.2 0.00e+00
0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 0
0100100 0 0
VecScatterEnd 137730 1.0 1.5035e+02 6.5 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 3 0 0
0 0 0
VecNormalize 137829 1.0 1.3453e+02 5.6 3.80e+08
5.6 0.0e+00 0.0e+00 1.4e+05 2 2 0 0 46 3 3 0
0 51 272
MatMult 137730 1.0 5.4179e+02 1.5 1.99e+08
1.4 8.3e+05 3.4e+04 0.0e+00 11 15 91 74 0 21
21100100 0 536
MatSolve 137829 1.0 7.9682e+02 1.4 1.18e+08
1.4 0.0e+00 0.0e+00 0.0e+00 16 14 0 0 0 30 20 0
0 0 339
MatGetRow 44110737 1.0 1.0296e+02 1.0 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 5 0 0
0 0 0
KSPGMRESOrthog 133334 1.0 9.4927e+02 1.4 2.75e+08
1.4 0.0e+00 0.0e+00 1.3e+05 18 37 0 0 45 34 54 0
0 49 770
KSPSolve 99 1.0 2.0562e+03 1.0 1.65e+08
1.0 8.3e+05 3.4e+04 2.7e+05 47 68 91 74 91
91100100100100 658
PCSetUpOnBlocks 99 1.0 3.3998e-04 1.5 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 0
PCApply 137829 1.0 8.2326e+02 1.4 1.14e+08
1.4 0.0e+00 0.0e+00 0.0e+00 16 14 0 0 0 31 20 0
0 0 328
I do see that the cluster run is faster than the
shared-memory case. However, I am not sure how I can
tell the reason for this behavior is due to the memory
subsystem. I don't know what evidence in the log to
look for.
Thanks again.
Shi
--- Satish Balay <balay at mcs.anl.gov> wrote:
> There are 2 aspects to performance.
>
> - MPI performance [while message passing]
> - sequential performance for the numerical stuff.
>
> So it could be that the SMP box has better MPI
> performance. This can
> be verified with -log_summary from both the runs
> [and looking at
> VecScatter times]
>
> However with the sequential numerical codes - it
> primarily depends
> upon the bandwidth between the CPU and the memory.
> On the SMP box -
> depending upon how the memory subsystem is designed
> - the effective
> memory bandwidth per cpu could be a small fraction
> of the peak memory
> bandwidth [when all cpus are used]
>
> So you'll have to look at the memory subsystem
> design of each of these
> machines and compare the 'memory bandwidth per cpu].
> The performance
> from log_summary - for ex: in MatMult will reflect
> this. [ including
> the above communication overhead]
>
> Satish
>
> On Fri, 2 Feb 2007, Shi Jin wrote:
>
> > Hi there,
> >
> > I am fairly new to PETSc but have 5 years of MPI
> > programming already. I recently took on a project
> of
> > analyzing a finite element code written in C with
> > PETSc.
> > I found out that on a shared-memory machine (60GB
> RAM,
> > 16 CPUS), the code runs around 4 times slower
> than
> > on a distributed memory cluster (4GB Ram,
> 4CPU/node),
> > although they yield identical results.
> > There are 1.6Million finite elements in the
> problem so
> > it is a fairly large calculation. The total memory
> > used is 3GBx16=48GB.
> >
> > Both the two systems run Linux as OS and the same
> code
> > is compiled against the same version of MPICH-2
> and
> > PETSc.
> >
> > The shared-memory machine is actually a little
> faster
> > than the cluster machines in terms of single
> process
> > runs.
> >
> > I am surprised at this result since we usually
> tend to
> > think that shared-memory would be much faster
> since
> > the in-memory operation is much faster that the
> > network communication.
> >
> > However, I read the PETSc FAQ and found that "the
> > speed of sparse matrix computations is almost
> totally
> > determined by the speed of the memory, not the
> speed
> > of the CPU".
> > This makes me wonder whether the poor performance
> of
> > my code on a shared-memory machine is due to the
> > competition of different process on the same
> memory
> > bus. Since the code is still MPI based, a lot of
> data
> > are moving around inside the memory. Is this a
> > reasonable explanation of what I observed?
> >
> > Thank you very much.
> >
> > Shi
> >
> >
> >
> >
>
____________________________________________________________________________________
> > Do you Yahoo!?
> > Everyone is raving about the all-new Yahoo! Mail
> beta.
> > http://new.mail.yahoo.com
> >
> >
>
>
____________________________________________________________________________________
Need a quick answer? Get one in minutes from people who know.
Ask your question on www.Answers.yahoo.com
More information about the petsc-users
mailing list