PETSc runs slower on a shared memory machine than on a cluster

Mon Feb 5 19:05:46 CST 2007

One more comment in regards to single core vs dual core opteron:

There are two ways to evaluate the performance. Performance per core -
or performance for the price [of the machine].

Ideally we'd like the performance per core be scalable [for publishing
pretty graphs]. However the dual core machine does not cost twice the
cost of single core machine. [Its probably costs 10-30% more]. So
realistically - if one can get the same factor of improvement in
performance with 16nodes vs 8nodes, one can consider the dual core
machine as providing reasonable performance.

Satish

On Mon, 5 Feb 2007, Satish Balay wrote:

> A couple of comments:
> 
> - with the dual core opteron - the memorybandwith per core is now
> reduced by half - so the performance suffers.  However memory
> bandwidth across CPUs is scalable. [6.4 Gb/s per each node or 3.2Gb/s
> per core]
> 
> - Current generation Intel Core 2 duo appears to claim having
> sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s per core?] so from
> this bandwidth number - this chip might do better than the AMD
> chip. However I'm not sure if there is a SMP with this chip - which
> has scalable memory system [across say 8 nodes - as you currently
> have..]
> 
> - Older intel SMP boxes has a single memory bank shared across all the
> CPUs [so effective bandwidth per CPU was pretty small. Optrons'
> scalable architecture looked much better than the older intel SMPs]
> 
> - From previous log_summary - part of the inefficiency of the SMP box
> [when compared to the cluster] was in the MPI performance. Do you
> still see this effect in the '-np 8' runs? If so this could be the
> [part of the] reason for this 30% reduction in performance.
> 
> Satish
> 
> On Mon, 5 Feb 2007, Shi Jin wrote:
> 
> > Hi there,
> > 
> > I have made some new progress on the issue of SMP
> > performance. Since my shared memory machine is a 8
> > dual-core Opteron machine. I think the two cores on a
> > single CPU chip shares the memory bandwidth.
> > Therefore, if I can avoid using the same core on the
> > chip, I can get some performance improvement. Indeed,
> > I am able to do this by the linux command taskset. 
> > Here is what I did:
> > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ../spAF
> > This way, I specifically ask the processes to be run
> > on the first core on the CPUs. 
> > By doing this, my performance is doubled compared with
> > the simple petscmpirun -n 8 ../spAF
> > 
> > So this test shows that we do suffer from the
> > competition of resources of multiple processes,
> > especially when we use 16 processes.
> > 
> > However, I should point out that even with the help
> > taskset, the shared-memory performance is still 30%
> > less than  that on the cluster.
> > 
> > I am not sure whether this problem exists specifically
> > for the AMD machines or it applys to any shared-memory
> > architecture.
> > 
> > Thanks.
> > Shi
> > 
> > --- Shi Jin <jinzishuai at yahoo.com> wrote:
> > 
> > > Hi there,
> > > 
> > > I am fairly new to PETSc but have 5 years of MPI
> > > programming already. I recently took on a project of
> > > analyzing a finite element code written in C with
> > > PETSc.
> > > I found out that on a shared-memory machine (60GB
> > > RAM,
> > > 16    CPUS), the code runs around 4 times slower
> > > than
> > > on a distributed memory cluster (4GB Ram,
> > > 4CPU/node),
> > > although they yield identical results.
> > > There are 1.6Million finite elements in the problem
> > > so
> > > it is a fairly large calculation. The total memory
> > > used is 3GBx16=48GB.
> > > 
> > > Both the two systems run Linux as OS and the same
> > > code
> > > is compiled against the same version of MPICH-2 and
> > > PETSc.
> > >  
> > > The shared-memory machine is actually a little
> > > faster
> > > than the cluster machines in terms of single process
> > > runs.
> > > 
> > > I am surprised at this result since we usually tend
> > > to
> > > think that shared-memory would be much faster since
> > > the in-memory operation is much faster that the
> > > network communication.
> > > 
> > > However, I read the PETSc FAQ and found that "the
> > > speed of sparse matrix computations is almost
> > > totally
> > > determined by the speed of the memory, not the speed
> > > of the CPU". 
> > > This makes me wonder whether the poor performance of
> > > my code on a shared-memory machine is due to the
> > > competition of different process on the same memory
> > > bus. Since the code is still MPI based, a lot of
> > > data
> > > are moving around inside the memory. Is this a
> > > reasonable explanation of what I observed?
> > > 
> > > Thank you very much.
> > > 
> > > Shi
> > > 
> > > 
> > >  
> > >
> > ____________________________________________________________________________________
> > > Do you Yahoo!?
> > > Everyone is raving about the all-new Yahoo! Mail
> > > beta.
> > > http://new.mail.yahoo.com
> > > 
> > > 
> > 
> > 
> > 
> >  
> > ____________________________________________________________________________________
> > Expecting? Get great news right away with email Auto-Check. 
> > Try the Yahoo! Mail Beta.
> > http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html 
> > 
> > 
> 
>