[mpich-discuss] testing new ADIO Lustre code

Wed May 23 14:29:22 CDT 2012

Rob,

I've been in email contact with Wei-keng Liao about the changes that 
I've made. We have mainly discussed my implementation of the new 
non-blocking code; it turns out that he is currently working on a very 
similar set of modifications. The file domain partitioning algorithm 
comes from Wei-keng, and I expect that the results he published should 
apply to my implementation, at least approximately. He has encouraged me 
to do some large-scale testing using XSEDE, and I have plans to run some 
benchmarks in that setting soon.

A few more comments, below.

Rob Latham wrote:
> On Thu, May 17, 2012 at 03:02:48PM -0600, Martin Pokorny wrote:
>> Hi, everyone;
>>
>> I've been using MPI-IO on a Lustre file system to good effect for a
>> while now in an application that has up to 32 processes writing to a
>> shared file. However, seeking to understand the performance of our
>> system, and improve on it, I've recently made some changes to the
>> ADIO Lustre code, which show some promise, but need more testing.
>> Eventually, I'd like to submit the code changes back to the mpich2
>> project, but that is certainly contingent upon the results of
>> testing (and various code compliance issues for mpich2/romio/adio
>> that I will likely need to sort out.) This message is my request for
>> volunteers to help test my code, in particular for output file
>> correctness and shared-file write performance. If you're interested
>> in doing shared file I/O using MPI-IO on Lustre, please continue
>> reading this message.
> 
> Gosh, Martin, I really thought you'd get more attention with this
> post.  

Me too.

> I'd like to see these patches: I can't aggressively test them on a
> lustre system but I'd be happy to provide another set of
> ROMIO-eyeballs.  

I can send them to you now, if you want. However, I'm not finished 
testing and can't rule out further changes. I will go ahead and put them 
somewhere publicly accessible, and let everyone know when it's done.

>> In broad terms, the changes I made are on two fronts: changing the
>> file domain partitioning algorithm, and introducing non-blocking
>> operations at several points. 
> 
> Non-blocking communication or i/o ?
> 
> One concern with non-blocking I/O in this path is that often the
> communication and I/O networks are the same thing (e.g. infiniband, or
> the BlueGene tree network in some situations).  

Non-blocking in both communication and I/O. I was preparing a question 
to the Lustre discussion list about non-blocking I/O using the POSIX aio 
API. I'll just ask right here, then. Is POSIX aio on a Lustre file 
system truly asynchronous? I expect that perhaps the implementation of 
aio in glibc may be asynchronous w.r.t. the calling thread, but I also 
wonder whether system calls to the Lustre client are asynchronous or 
not. Can anyone help me understand?

I have a little data suggesting that the aio calls do improve 
performance a bit, but this is a tentative conclusion.

>> The file domain partitioning algorithm
>> that I implemented is from the paper "Dynamically Adapting File
>> Domain Partitioning Methods for Collective I/O Based on Underlying
>> Parallel File System Locking Protocols" by Wei-keng Liao and Alok
>> Choudhary. The non-blocking operations that I added allow the ADIO
>> Lustre driver better to parallelize the data exchange and writing
>> procedures over multiple stripes within each process writing to one
>> Lustre OST,
> 
> I was hoping Wei-keng would chime in on this.  I'll be sure to draw
> your patches to his attention.

I've already done that.

>> My testing so far has been limited to four nodes, up to sixteen
>> processes, writing to shared files on a Lustre file system with up
>> to eight OSTs. 
> 
> Right now the only concern I have is that you may (and without looking
> at the code I have no way of knowing) traded better small-scale
> performance for worse large-scale performance.    

Right. As I mentioned above I will soon be testing my code in a 
large-scale setting.

>> These tests were conducted to simulate the production
>> application for which I'm responsible, but on a different cluster,
>> focused only on the file output. In these rather limited tests, I've
>> seen write performance gains of up to a factor of two or three. The
>> new file domain partitioning algorithm is most effective when the
>> number of processes exceeds the number of Lustre OSTs, but there are
>> smaller gains in other cases, and I have not seen instance in which
>> the performance has decreased. As an example, in one case using
>> sixteen processes, MPI over Infiniband, and a file striping factor
>> of four, the new code achieves over 800 MB/s, whereas the standard
>> code achieves 300 MB/s. I have hints that the relative performance
>> gains when using a 1Gb Ethernet rather than Infiniband for MPI
>> message passing are greater, but I have not completed my testing in
>> that environment.
>>
>> If you're willing to try out this code in a test environment please
>> let me know. I have not yet put the code into a publicly accessible
>> repository, but will do so if there is interest out there.
> 
> ==rob
> 

-- 
Martin