[mpich2-dev] More ROMIO performance questions

Tue Sep 15 13:28:18 CDT 2009

Yes, I meant cb_buffer_size.

I believe it was tested lockless, but I'll try to verify that.

I'm not completely assuming a high performance file system (HPFS).  What 
about NFS?  Or even with a HPFS, BG has "i/o poor" racks with 512 cores 
(writers) to 1 i/o node (HPFS client).   Would an equivalent non-BG 
example be a server (HPFS client) with 512 processes (writers) have 
problems? 

Do we just accept less performance in these scenarios? Is it all up to the 
(single) HPFS client to figure it out?  Is it all HPFS config and they got 
it wrong?  They only saw the problem with, for example, all 512 cores 
writing to the same file.  All 512 cores writing to different files worked 
well... so I guess it could be client config/limited resources.

Or does an option to throttle back individual writers going through a 
single point/client/server/? make any sense at all?    Collective i/o 
through N aggregators per i/o node works pretty well if you get the hints 
right, but with more overhead than their customized flow control of N 
writers per i/o node.

I thought about shared files.  If they opened N shared files per pset they 
would basically be getting N concurrent writers/readers.  This would just 
be another way to coordinate like their customized testcase.

Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.

mpich2-dev-bounces at mcs.anl.gov wrote on 09/14/2009 05:48:00 PM:

> Rob Latham <robl at mcs.anl.gov> 
> Sent by: mpich2-dev-bounces at mcs.anl.gov
> 
> 09/14/2009 05:48 PM
> 
> Please respond to
> mpich2-dev at mcs.anl.gov
> 
> To
> 
> mpich2-dev at mcs.anl.gov
> 
> cc
> 
> Subject
> 
> Re: [mpich2-dev] More ROMIO performance questions
> 
> On Mon, Sep 14, 2009 at 04:57:47PM -0500, Bob Cernohous wrote:
> > 
> > We tried to tune it with hints for cb_block_size and get ok 
performance 
> > when we can avoid read/write data sieving.
> 
> I'm sure you must have meant "cb_buffer_size" ? 
> 
> > They customized the testcase to coordinate/flow-control the 
non-collective 
> > i/o and they get great performance.   They only have N simultaneous 
> > writers/readers active.  They pass a token around and take turns. It's 

> > almost like having N aggregators but without the collective i/o 
overhead 
> > to pass the data around.  Instead they pass a small token and take 
turns 
> > writing the large, non-interleaved contiguous data blocks.
> > 
> > I'm not aware of anything in MPIIO or ROMIO that would do tihs?   Has 
this 
> > been explored by the experts (meaning you guys)? 
> 
> In the ordered mode routines, we pass a token around to ensure that
> process write/read in rank-order.  (this is actually a pretty naive
> way to implement ordered mode, but until very recently nobody seemed
> too concerned about shared file pointer performance).
> 
> We don't do anything like this in ROMIO because frankly if a high
> performance file system can't handle simultaneous non-interleaved
> contiguous data blocks (what we would consider the best case scenario
> performance-wise), then a lot of ROMIO assumptions about how to
> achieve peak performance kind of go out the window.
> 
> However, Kevin's suggestion that this is instead due to lock
> contention makes a lot of sense and I'm curious to hear what if any
> impact that has on your customer's performance.
> 
> ==rob
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090915/15435fae/attachment.htm>