[mpich-discuss] Why is my quad core slower than cluster

Gus Correa gus at ldeo.columbia.edu
Wed Jul 9 16:29:25 CDT 2008


Hi Zach and list

zach wrote:

>Hi,
>
>Thanks for the info.
>Something is really wrong because comparing 1 to 4 core utilization on the code
>on my home pc, the speed increase is *tiny*.
>  
>
Well, I haven't tried Intel Core 2 quad-core (I think this is what you 
have, right?).
The dual-core Xeon and Opteron I tried do speed up, but sub-linearly, and
Opteron is somewhat better, as I mentioned to you.

I would try running the program with 1 core, then 2 cores, and so on,
and monitor the performance with "top". There may be a turning point,
or in the worst case it is a downhill trip.
Worth checking anyway.

>I am pretty sure my home pc recognizes all the ram and when 
>
If it recognizes the whole 8GB, then top should show this (in kBytes) on 
its header lines.
How much does it show?

Top reads this info from /proc/meminfo, which you can also check directly.

Likewise, check the number of actual cpus in /proc/cpuinfo.

Did you check these two things?
Don't take them for granted!
We had both problems here, in different computers, different occasions,
and for different reasons.

>I run top
>my sim is the only one really taking up the majority of the resources.
>  
>
I would try to run this at runlevel 3.
Change the /etc/inittab line

id:5:initdefault:

to

id:3:initdefault:

(Change back later, to restore runlevel 5.)

>I see four processes at 98% CPU (or more) for 4 processes. 
>
Do you see them running simultaneously on top, or alternating in-and-out 
there?
Are some processes runnable and others sleeping perhaps?

That top shows four processes does not guarantee that there are four 
active cores.
You can easily oversubscribe physical processors or cores by launching 
mpirun with more
processes than processors/cores.
E.g., "mpirun -n 10 hello_world" is a pretty common test in
machines with less than 10 cpus/cores. And it works!
Linux is a multi-task OS, and each of mpi instance of your program is a 
task.
To check for four active cores you need to look at /proc/cpuinfo.
(Note, that oversubscribing processors leads to bad performance,
and for large programs crash.)

>Memory at
>like 11% or so...
>  
>

What memory percent does "top" show on the cluster?
Are the two values compatible?

>One thing I noticed is then I type
>"which mpirun"
>"which mpd"
>--The ones in /usr/bin come up.
>I worked around this by using absolute path to the mpich install dir
>versions when running mpd & before a sim and starting a sim with
>mpirun.
>  
>
That is a main and very common source of confusion.
You can find many cases reported (and fixed) on this list.
Linux distributions and compilers come with their own versions of MPI,
which tend to be on your search path ahead of the MPICH2 that you install.

Make sure you configured MPICH2 on your home PC and on the cluster with 
the same compilers and options.
Make sure you compiled in both computers with the right MPICH2 mpicc (or 
mpif77 etc),
that you launched the right MPICH2 mpd, and that you use the right
MPICH2 mpirun/mpiexec on the two computers.
All these matter.
If you have the slightest doubt that something may have been mixed,
I suggest that you start over from scratch, rather than trying to find 
out the needle on the haystack.
I had the same problem too, misled by the large population of mpi 
flavors available on Linux boxes.

>Is it possible that the path setup is still causing an issue in the
>communication?
>
>  
>
Yes, for instance, if you are using the wrong mpirun or the wrong mpd.
I wouldn't put all bets on this, though.
It should be carefully checked, but it is not the last ditch.
It may be something else.

>Also, I am not sure how mpd and ssh are related, but do I need to
>configure ssh settings in the mpich install in any way for a quad core
>box?
>
>  
>
I am not an expert, but I would guess mpich and mpd will largely bypass 
ssh on a standalone PC like yours.
The MPICH2 developers will be able to clarify this point better.

A final issue that just came to mind.
How big is the actual computation?
How long is the walltime?
If it the problem is too small, if the waltime is very small, the whole 
computation may be
dominated by non-parallel tasks (initialization, etc), and won't scale 
well with
the number of processors (Amdahl's law).
In many cases you can easily increase the problem/computation size to 
solve the problem.

Hang in there, Zach.
Take a breath, freshen up your mind, play some ping-pong in the gym,
then go back to the problem,  but don't give up.
The reason for the performance drop may not be fixable, but it should be 
identifiable.

Gus

>Zach
>
>On 7/9/08, Gus Correa <gus at ldeo.columbia.edu> wrote:
>  
>
>>Hi Zach and list
>>
>>Well, re-reading your message with more attention I saw your
>>estimate or 1/3 speed on your home computer.
>>That is indeed on the low side, although I would expect a multi-core machine
>>to run slower than a cluster of single-cores, specially if the programs are
>>memory-intensive.
>>A lower performance factor on the ballpark of 0.6-0.8 is what I've seen so
>>far on
>>memory-intensive programs (climate models), but not your low value of
>>1/3=0.33.
>>
>>In any case, you are sure all four cores are working on your home PC,
>>and the SMP kernel is running.
>>
>>Besides, you seem to have installed the MPICH2 nemesis device in your home
>>PC,
>>and hopefully on the cluster too, as recommended by Rajeev
>>
>>In addition, you don't seem to have a problem with memory size (which would
>>trigger
>>paging), because you said the total memory in your PC and in the cluster is
>>the same.
>>I would check the amount of memory in /proc/meminfo anyway (look for
>>MemTotal). Some memory modules are tricky,
>>and have to be sat on matched slots, in order to be recognized.
>>We had this problem here with a Dell computer.
>>The computer manual diagrams were misleading,
>>and it took a few attempts to get the memory slots right, when we upgraded
>>it.
>>Before that, a system with 8GB of memory installed would recognize only 2GB.
>>
>>So, if all the considerations above are correct, a number of possibilities
>>are removed.
>>Let's try other possibilities.
>>
>>What is the other (concurrent) activity in your home PC that runs along
>>with your mpich program?
>>I have experienced significant performance degradation of MPI programs
>>running in standalone PCs when other users login and start their programs.
>>The memory-greedy Matlab is the first killer, but fancy desktops, intense
>>web browsing,
>>streaming video and music, etc, can compete with the mpich program for
>>memory and CPU cycles,
>>to the point that the mpich program can't really work,
>>and spends most of the time switching context in/out.
>>HPC and interactive workstation use don't really mix well.
>>You can monitor this activity with the "top" command on your PC,
>>and compare it with what you get from "top" on your cluster nodes.
>>
>>I would also suggest starting the system at runlevel 3 (no X-windows),
>>and running the mpich program alone, if you want to make a fair performance
>>comparison between your home PC and your office cluster (whose nodes are
>>likely to be at runlevel 3 and be dedicated to run the mpich program only).
>>
>>Also, a fair comparison should take into account the cpu speeds of each
>>computer.
>>A 3.6GHz processor works faster than a 2.8GHz of similar architecture.
>>Since both computers you use have Intel processors (comparing Intel with AMD
>>seems to be more complicated),
>>maybe you can just look at the raw processor speeds
>>in /proc/cpuinfo (look for cpu MHz), and factor in the ratio of these values
>>on both computers,
>>when you compare their performance.
>>
>>I hope this helps.
>>
>>Gus Correa
>>
>>--
>>---------------------------------------------------------------------
>>Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>Lamont-Doherty Earth Observatory - Columbia University
>>P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>Oceanography Bldg., Rm. 103-D, ph. (845) 365-8911, fax (845) 365-8736
>>---------------------------------------------------------------------
>>
>>
>>
>>zach wrote:
>>
>>
>>    
>>
>>>Thanks for the info.
>>>I tried all of these things but it does not look like it gave any
>>>      
>>>
>>improvement.
>>    
>>
>>>Zach
>>>
>>>On Tue, Jul 8, 2008 at 2:52 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>>>
>>>
>>>      
>>>
>>>>PS:
>>>>
>>>>Zach:  A couple of obvious checks, besides Rajeev's important
>>>>        
>>>>
>>suggestion:
>>    
>>
>>>>1) Make sure the SMP kernel is running on your home PC:
>>>>"uname -a"
>>>>(Should show "smp" as part of the string.)
>>>>
>>>>2) Check if Ubuntu triggers all four cores:
>>>>"cat /proc/cpuinfo". (Should show four "virtual" CPUs.)
>>>>
>>>>Gus Correa
>>>>
>>>>##########
>>>>
>>>>Rajeev Thakur wrote:
>>>>
>>>>Try using the Nemesis device in MPICH2 if you aren't already. Configure
>>>>        
>>>>
>>with
>>    
>>
>>>>--with-device=ch3:nemesis.
>>>>
>>>>Rajeev
>>>>
>>>>
>>>>Gus Correa wrote:
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>>>Hello  Zach and list
>>>>>
>>>>>From all that I've observed on dual-processor dual-core PCs,
>>>>>and from all that I've read on the web about dual-processor quad-core
>>>>>machines,
>>>>>your results are not alarming, but typical.
>>>>>I was as disappointed as you are, when I saw my speedup results.
>>>>>A lot of people out there had the same frustration too.
>>>>>
>>>>>My benchmarks using a standard climate atmospheric model (NCAR CAM3)
>>>>>          
>>>>>
>>on
>>    
>>
>>>>>a dual-processor dual-core Xeon workstation showed a speedup factor of
>>>>>          
>>>>>
>>3
>>    
>>
>>>>>(not 4),
>>>>>when I moved from one core to four cores.
>>>>>Likewise for a dual-processor dual-core Opteron workstation,
>>>>>I've got a speedup factor slightly below 3.5. (Better than Xeon, but
>>>>>          
>>>>>
>>still
>>    
>>
>>>>>not 4).
>>>>>
>>>>>The problem seems to get worse with quad-cores, again with the
>>>>>          
>>>>>
>>Opterons
>>    
>>
>>>>>slightly ahead of the game.
>>>>>Memory/bus contention has been mentioned as the culprit by a lot of
>>>>>people.
>>>>>One core in a multicore doesn't scale as one (single-core) CPU.
>>>>>
>>>>>You will find plenty of references to this problem on the web and on
>>>>>          
>>>>>
>>many
>>    
>>
>>>>>mailing lists:
>>>>>here in the MPICH list, on the Rocks Cluster list, on the MITgcm list,
>>>>>etc, etc.
>>>>>
>>>>>I hope it heals (as helping it cannot)
>>>>>Gus Correa
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>zach wrote:
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>>>I am using a cluster.
>>>>>Each pc has two cpus and they are Xeon. Each cpu has 4GB, i think red
>>>>>hat is running.
>>>>>
>>>>>I also use a pc at home- quad core intel chip, 8gb ram, ubuntu.
>>>>>
>>>>>Both are using mpich.
>>>>>
>>>>>I have found that my home pc is only running about 1/3 the speed of
>>>>>the cluster, and the number of processes (4) and code is the same.
>>>>>
>>>>>Can anyone tell me if this is typical, and why, or am I not optimizing
>>>>>something properly?
>>>>>
>>>>>Thanks
>>>>>Zach
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>        
>>>>
>>    
>>




More information about the mpich-discuss mailing list