[mpich2-dev] MPI_File_get_position_shared scaling

Rob Latham robl at mcs.anl.gov
Thu Aug 20 15:53:42 CDT 2009


On Thu, Aug 20, 2009 at 03:11:02PM -0500, Bob Cernohous wrote:
> We have a customer complaint that:
> 
> "The MPI subroutine MPI_File_get_position_shared is too slow and has a bad 
>  impact on I/O performance. "
> 
> >From what I can see, each rank exclusively locks and operates on the 
> shared_fp.  So it doesn't scale well, but I don't see anything BlueGene 
> specific.  Am I missing an ad_xxx that does something better?  Or is this 
> just "as expected".  Any comments from the ROMIO experts?

Hi Bob.  This is definitely "as expected".  ROMIO implements shared
file pointers for correctness, not for performance.

We've done research into using MPI-2 one-sided RMA operations to
support shared file pointers without using file locks.  If your
application is well behaved, you get great performance.  

If the rank holding the RMA window goes off into a cpu loop for a
couple hours, that would not be "well behaved" :>

Here's the paper:
http://www.mcs.anl.gov/~robl/papers/latham_rmaops.pdf

We've seen somewhat more interest in shared file pointers over the
last couple of years, but I think your customer can get by without
them with just a little bit of work, and get vastly better performance
to boot.  

Every process knows how much data it will write or read. It just
doesn't know where to begin.  Easy to solve: get the implicit file
pointer with MPI_FILE_GET_POSISTION and just use MPI_SCAN or
MPI_EXSCAN to have each process efficiently contribute its part of the
I/O pattern.  Then, everyone can do an MPI_FILE_WRITE_AT_ALL or
MPI_FILE_READ_AT_ALL.   (this is the crux of the 'ordered mode'
algorithm in the linked paper.)

Maybe more information than you wanted!

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the mpich2-dev mailing list