[mpich-discuss] patch for ad_lustre_wrcoll.c

Martin Pokorny mpokorny at nrao.edu
Wed Dec 7 11:38:05 CST 2011


Rob Latham wrote:
> On Wed, Aug 10, 2011 at 01:56:33PM -0600, Martin Pokorny wrote:
>> These changes improve performance by reducing the number of system
>>  'write' calls in the ADIO Lustre collective write code, and
>> perhaps also by keeping the writes ordered. This is especially
>> effective in my application, in which the data are highly
>> interleaved among the processes in the group calling the MPI-IO
>> collective write functions.
> 
> Hi Martin.  Sorry it has taken me so long to respond to your 
> contribution.  I think I understand what you're doing and why, and 
> I'm going to commit it.
> 
> Let me make sure I really do understand by explaining back what you 
> are doing:
> 
> - in collective I/O, if there are any gaps in the file domain, 
> collective i/o does a read-modify-write.  This works great on most 
> file systems but on Lustre the (implicit, here) locking needed for 
> this is extremely costly.  So, the Lustre driver has a hint to turn 
> off data sieving in collective cases and service each request piece 
> by piece.

I'm not convinced that the locking is necessarily extremely costly in 
the current implementation. The reason that my application doesn't use 
data sieving is more related to the fact that the files being written 
have lots of holes, and using data sieving leaves random values in the 
files where the holes are. (I realize that this usage isn't ideal, but 
the holes will eventually go away, and it's not worth the effort to do 
something better at the moment.)

> - because the requests are being serviced piecewise, certain 
> workloads could result in out-of-order blocks that once placed back 
> in order end up being adjacent to each other and can be merged.

Yes, that's the effect the patch is intended to have. This reduces the 
number of system calls to the Lustre client code.

> I would very much like to have a tiny test case that shows this out 
> of order workload but for now I am just committing the patch.  SVN 
> revision 9240 has the fix.

I would like one, too! Unfortunately, my application is a part of a 
real-time, event-driven system, and we don't have the ability to 
simulate the application inputs. I'll see if I can generate a test case 
for you outside of my application, but it may be a while before I can 
get to that.

(Sorry about the duplicate message, Rob; I neglected to cc mpich-discuss 
in my first message.)

-- 
Martin Pokorny
Software Engineer - Expanded Very Large Array
National Radio Astronomy Observatory
Socorro, NM USA


More information about the mpich-discuss mailing list