[mpich2-dev] More ROMIO performance questions

Tue Sep 15 14:39:30 CDT 2009

On Tue, Sep 15, 2009 at 01:28:18PM -0500, Bob Cernohous wrote:
> Yes, I meant cb_buffer_size.

ok, good

> I believe it was tested lockless, but I'll try to verify that.
> 
> I'm not completely assuming a high performance file system (HPFS).  What 
> about NFS?  

I know you are stuck supporting NFS because you promised your
customers you would, but NFS has never ceased being a headache for us.

> Or even with a HPFS, BG has "i/o poor" racks with 512 cores 
> (writers) to 1 i/o node (HPFS client).   Would an equivalent non-BG 
> example be a server (HPFS client) with 512 processes (writers) have 
> problems? 

Isn't that what 'bgl_nodes_pset' is supposed to address?  If you are
i/o rich or i/o poor, aggregate down to 'bgl_nodes_pset' aggregators
per io node.  There are tons of things ROMIO can do in collective I/O.
There is very little ROMIO can do to improve independent I/O (not
nothing, though. see below). 

> Do we just accept less performance in these scenarios? Is it all up
> to the (single) HPFS client to figure it out?  Is it all HPFS config
> and they got it wrong?  They only saw the problem with, for example,
> all 512 cores writing to the same file.  All 512 cores writing to
> different files worked well... so I guess it could be client
> config/limited resources.

If all 512 cores can write to different files and see acceptable
performance, then we have adequate throughput between compute nodes
and i/o nodes and between i/o nodes and the HPFS.  So yeah, it sounds
like they've got a problem at the HPFS layer.  I'm going to guess they
are running Lustre.

> Or does an option to throttle back individual writers going through a 
> single point/client/server/? make any sense at all?    Collective i/o 
> through N aggregators per i/o node works pretty well if you get the hints 
> right, but with more overhead than their customized flow control of N 
> writers per i/o node.
> 
> I thought about shared files.  If they opened N shared files per pset they 
> would basically be getting N concurrent writers/readers.  This would just 
> be another way to coordinate like their customized testcase.

If you pass around a token in the MPI layer, you can easily starve
processes.  Independent I/O means there's no guarantee when any
process will be in that call.  so, do you have rank 1 give up the
token after exiting MPI_FILE_WRITE?  Who does he pass to?  Will they
be in an MPI call and able to make progress on the receive? Do you
have rank 4 take the token from someone when he's ready to do
I/O?

The only thing I can think that might help in this situation is a
collaborative write-back cache with a handful of writer processes.
Our friends at Northwestern University have done a lot of work in this
area.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA