[mpich2-dev] More ROMIO performance questions
Rob Latham
robl at mcs.anl.gov
Tue Sep 15 14:39:30 CDT 2009
On Tue, Sep 15, 2009 at 01:28:18PM -0500, Bob Cernohous wrote:
> Yes, I meant cb_buffer_size.
ok, good
> I believe it was tested lockless, but I'll try to verify that.
>
> I'm not completely assuming a high performance file system (HPFS). What
> about NFS?
I know you are stuck supporting NFS because you promised your
customers you would, but NFS has never ceased being a headache for us.
> Or even with a HPFS, BG has "i/o poor" racks with 512 cores
> (writers) to 1 i/o node (HPFS client). Would an equivalent non-BG
> example be a server (HPFS client) with 512 processes (writers) have
> problems?
Isn't that what 'bgl_nodes_pset' is supposed to address? If you are
i/o rich or i/o poor, aggregate down to 'bgl_nodes_pset' aggregators
per io node. There are tons of things ROMIO can do in collective I/O.
There is very little ROMIO can do to improve independent I/O (not
nothing, though. see below).
> Do we just accept less performance in these scenarios? Is it all up
> to the (single) HPFS client to figure it out? Is it all HPFS config
> and they got it wrong? They only saw the problem with, for example,
> all 512 cores writing to the same file. All 512 cores writing to
> different files worked well... so I guess it could be client
> config/limited resources.
If all 512 cores can write to different files and see acceptable
performance, then we have adequate throughput between compute nodes
and i/o nodes and between i/o nodes and the HPFS. So yeah, it sounds
like they've got a problem at the HPFS layer. I'm going to guess they
are running Lustre.
> Or does an option to throttle back individual writers going through a
> single point/client/server/? make any sense at all? Collective i/o
> through N aggregators per i/o node works pretty well if you get the hints
> right, but with more overhead than their customized flow control of N
> writers per i/o node.
>
> I thought about shared files. If they opened N shared files per pset they
> would basically be getting N concurrent writers/readers. This would just
> be another way to coordinate like their customized testcase.
If you pass around a token in the MPI layer, you can easily starve
processes. Independent I/O means there's no guarantee when any
process will be in that call. so, do you have rank 1 give up the
token after exiting MPI_FILE_WRITE? Who does he pass to? Will they
be in an MPI call and able to make progress on the receive? Do you
have rank 4 take the token from someone when he's ready to do
I/O?
The only thing I can think that might help in this situation is a
collaborative write-back cache with a handful of writer processes.
Our friends at Northwestern University have done a lot of work in this
area.
==rob
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the mpich2-dev
mailing list