PETSc runs slower on a shared memory machine than on a cluster

Wed Feb 7 10:27:48 CST 2007

Thank you very much, Satish.
You are right. From the log_summary, the communication
takes slightly more time on the shared memory than the
cluster even after using the taskset.
This is still hard to understand since I think
in-memory operations have to been orders of magnitude
faster than network opertations(gigabit ethernet).

By the way, I took a look my the specs of my
shared-memory machine( Sun Fire Server 4600).
It seems that each CPU socket has its own DIMMS of
RAM.
I wonder if there is a speed issue if one has to copy
data from the RAM of one CPU to another.

Thanks.

Shi
--- Satish Balay <balay at mcs.anl.gov> wrote:

> A couple of comments:
> 
> - with the dual core opteron - the memorybandwith
> per core is now
> reduced by half - so the performance suffers. 
> However memory
> bandwidth across CPUs is scalable. [6.4 Gb/s per
> each node or 3.2Gb/s
> per core]
> 
> - Current generation Intel Core 2 duo appears to
> claim having
> sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s
> per core?] so from
> this bandwidth number - this chip might do better
> than the AMD
> chip. However I'm not sure if there is a SMP with
> this chip - which
> has scalable memory system [across say 8 nodes - as
> you currently
> have..]
> 
> - Older intel SMP boxes has a single memory bank
> shared across all the
> CPUs [so effective bandwidth per CPU was pretty
> small. Optrons'
> scalable architecture looked much better than the
> older intel SMPs]
> 
> - From previous log_summary - part of the
> inefficiency of the SMP box
> [when compared to the cluster] was in the MPI
> performance. Do you
> still see this effect in the '-np 8' runs? If so
> this could be the
> [part of the] reason for this 30% reduction in
> performance.
> 
> Satish
> 
> On Mon, 5 Feb 2007, Shi Jin wrote:
> 
> > Hi there,
> > 
> > I have made some new progress on the issue of SMP
> > performance. Since my shared memory machine is a 8
> > dual-core Opteron machine. I think the two cores
> on a
> > single CPU chip shares the memory bandwidth.
> > Therefore, if I can avoid using the same core on
> the
> > chip, I can get some performance improvement.
> Indeed,
> > I am able to do this by the linux command taskset.
> 
> > Here is what I did:
> > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14
> ../spAF
> > This way, I specifically ask the processes to be
> run
> > on the first core on the CPUs. 
> > By doing this, my performance is doubled compared
> with
> > the simple petscmpirun -n 8 ../spAF
> > 
> > So this test shows that we do suffer from the
> > competition of resources of multiple processes,
> > especially when we use 16 processes.
> > 
> > However, I should point out that even with the
> help
> > taskset, the shared-memory performance is still
> 30%
> > less than  that on the cluster.
> > 
> > I am not sure whether this problem exists
> specifically
> > for the AMD machines or it applys to any
> shared-memory
> > architecture.
> > 
> > Thanks.
> > Shi
> > 
> > --- Shi Jin <jinzishuai at yahoo.com> wrote:
> > 
> > > Hi there,
> > > 
> > > I am fairly new to PETSc but have 5 years of MPI
> > > programming already. I recently took on a
> project of
> > > analyzing a finite element code written in C
> with
> > > PETSc.
> > > I found out that on a shared-memory machine
> (60GB
> > > RAM,
> > > 16    CPUS), the code runs around 4 times slower
> > > than
> > > on a distributed memory cluster (4GB Ram,
> > > 4CPU/node),
> > > although they yield identical results.
> > > There are 1.6Million finite elements in the
> problem
> > > so
> > > it is a fairly large calculation. The total
> memory
> > > used is 3GBx16=48GB.
> > > 
> > > Both the two systems run Linux as OS and the
> same
> > > code
> > > is compiled against the same version of MPICH-2
> and
> > > PETSc.
> > >  
> > > The shared-memory machine is actually a little
> > > faster
> > > than the cluster machines in terms of single
> process
> > > runs.
> > > 
> > > I am surprised at this result since we usually
> tend
> > > to
> > > think that shared-memory would be much faster
> since
> > > the in-memory operation is much faster that the
> > > network communication.
> > > 
> > > However, I read the PETSc FAQ and found that
> "the
> > > speed of sparse matrix computations is almost
> > > totally
> > > determined by the speed of the memory, not the
> speed
> > > of the CPU". 
> > > This makes me wonder whether the poor
> performance of
> > > my code on a shared-memory machine is due to the
> > > competition of different process on the same
> memory
> > > bus. Since the code is still MPI based, a lot of
> > > data
> > > are moving around inside the memory. Is this a
> > > reasonable explanation of what I observed?
> > > 
> > > Thank you very much.
> > > 
> > > Shi
> > > 
> > > 
> > >  
> > >
> >
>
____________________________________________________________________________________
> > > Do you Yahoo!?
> > > Everyone is raving about the all-new Yahoo! Mail
> > > beta.
> > > http://new.mail.yahoo.com
> > > 
> > > 
> > 
> > 
> > 
> >  
> >
>
____________________________________________________________________________________
> > Expecting? Get great news right away with email
> Auto-Check. 
> > Try the Yahoo! Mail Beta.
> >
>
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html
> 
> > 
> > 
> 
> 

____________________________________________________________________________________
Don't get soaked.  Take a quick peak at the forecast
with the Yahoo! Search weather shortcut.
http://tools.search.yahoo.com/shortcuts/#loc_weather