[mpich2-dev] Another romio performance question

Fri Aug 21 11:38:17 CDT 2009

On Fri, Aug 21, 2009 at 10:35:38AM -0500, Bob Cernohous wrote:
> I have a customer complaint that a particular testcase performs poorly 
> with MPI_File_write_at_all.  For the most part, it's just a testcase that 
> shouldn't be using collective i/o.  Each rank writes large contiguous 
> blocks to it's own range of the file.  So aggregation is just a waste of 
> time.  Each write the aggregator does is a large contiguous write for a 
> single rank.  So there's really no true aggregation.
> 
> What caught my eye was, for example,  using a 16MB cb_buffer_size and 
> writing a contiguous 1M block causes read-modify-write of the whole 16M 
> because of the single large (15M) trailing (or leading) hole.   It just 
> seems like we should do better, but is it worth doing anything for 
> something that probably isn't a true collective i/o pattern?
> 
> I can fix the testcase performance by hinting cb_buffer_size down to 1MB 
> and then there's no hole.  This is a fine user circumvention, but I'm 
> trying to decide if we should do more.

OK, I see what's going on here.

When the workload is as you describe, ROMIO normally looks at the
accesses and if there is any overlap, it decides it would be better
served with independent access (in ad_write_coll.c there's a check for
interleaved accesses).  Contiguous data like your customer's falls
under the non-interleaved category.

However, on BlueGene, the romio_cb_read and romio_cb_write hints are
set to 'enable' instead of 'automatic'.  This is usually the right
thing, since aggregation works great on bluegene for workloads that
are non-overlapping, but also non-contiguous.

I guess that's why we might not have paid a whole lot of attention to
holes preceding or following the data of interest. 

A couple years back we found a bug in ROMIO because it was ignoring
leading or trailing holes.  Sounds like we fixed that case (it was
causing a segfault) but then made other cases perform worse.

Maybe ROMIO needs to have more clever 'automatic' logic?  Or actually
on bluegene, less clever: contiguous data, especially contiguous data
of 1MB or more should just be handled independently.   We'd still want
to use collective buffering for noncontiguous data.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA