[mpich-discuss] A general question to the community on data sharing for multi core nodes

Tue Aug 24 11:01:02 CDT 2010

What kinds of approaches are others using to maximize performance on multi-core, multi-processor systems.  For performance reasons I'd really like to have only one MPI process running on each node and then communicate the data between the processes on the nodes in the most efficient way.

I've got a bunch of twin i7 nodes.  For a while I was able to get by just using the round robin approach and assigning enough processes so that each core was busy.  The advantage was that it was easy and let me be lazy.  But as the development has matured the volumes of data have begun to badly clog the bandwidth of the cluster.  That's because I have 8 cores per node and so the same data is going to each process (i.e. core) and hence it's replicated 8 times.  Further it's then being replicated 8 times in terms of memory use, which is expensive too.  This is because I have a particular component of data that is common to every MPI process, so it's wasteful to replicate it.  We knew this would be an issue eventually, and eventually has become now.

Right now, to start with I've separated out the MPI into a single thread in in the process, so the data volume is very reasonable and network performance is fine.   And I have only one copy of the large amount of shared data.  And in my current design I am then forking off threads to fill up the cores.  This works, but there is a moderate amount of contention even after using a number of performance analysis tools to find hot spots.  And the threading has in my opinion introduced logic "clutter" which is distracting.  That said if threading is the best solution then the threading stays.  But the performance issue seems to have roots in OpenMP (the threading model I'm using - I am not using critical sections, they were too expensive in terms of contention, I went to individual locks).    I'm using OpenMP because this system runs in both Linux and Windows and I want a common denominator for how threading is implemented.  So I'm looking to see if there is a better solution that this.  Hence the following questions.

Would creating one independent communicator for each node, and communicating within that node using a local communicator be better than using say pipes or trying to referee using the OpenMP locks?  I would have to make an "inter-communicator" then on the real COMM_WORLD communicator to pass data only to the local node 0's.  And they in turn would use intra communicators to communicate locally.  Or would perhaps using "named pipes", which I understand in Windows, but for my Linux components I'm not as well versed be better and leave the MPI to moving the freight between nodes?  If so, does anyone know if there is there a Boost or Gnu package out there that creates named pipes for Linux?  I looked and either I'm just malfunctioning mentally, or there isn't one.  Can anyone help there.  It looks to me that in the Linux space I need to create a socket level class to emulate "named pipes" that exist in Windows.  Or am I missing something?  This is all in C++ by the way.

Or are others attacking this problem with better ideas than these.  If so could you share them?

"People get held back by the voice inside em" - K'naan - In the Beginning

Dave M. Hiatt
Director, Risk Analytics
CitiMortgage
1000 Technology Drive
O'Fallon, MO 63368-2240

Telephone:  636-261-1408

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100824/6eff6377/attachment-0001.htm>