[mpich-discuss] testing new ADIO Lustre code

Thu May 17 16:02:48 CDT 2012

Hi, everyone;

I've been using MPI-IO on a Lustre file system to good effect for a 
while now in an application that has up to 32 processes writing to a 
shared file. However, seeking to understand the performance of our 
system, and improve on it, I've recently made some changes to the ADIO 
Lustre code, which show some promise, but need more testing. Eventually, 
I'd like to submit the code changes back to the mpich2 project, but that 
is certainly contingent upon the results of testing (and various code 
compliance issues for mpich2/romio/adio that I will likely need to sort 
out.) This message is my request for volunteers to help test my code, in 
particular for output file correctness and shared-file write 
performance. If you're interested in doing shared file I/O using MPI-IO 
on Lustre, please continue reading this message.

In broad terms, the changes I made are on two fronts: changing the file 
domain partitioning algorithm, and introducing non-blocking operations 
at several points. The file domain partitioning algorithm that I 
implemented is from the paper "Dynamically Adapting File Domain 
Partitioning Methods for Collective I/O Based on Underlying Parallel 
File System Locking Protocols" by Wei-keng Liao and Alok Choudhary. The 
non-blocking operations that I added allow the ADIO Lustre driver better 
to parallelize the data exchange and writing procedures over multiple 
stripes within each process writing to one Lustre OST,

My testing so far has been limited to four nodes, up to sixteen 
processes, writing to shared files on a Lustre file system with up to 
eight OSTs. These tests were conducted to simulate the production 
application for which I'm responsible, but on a different cluster, 
focused only on the file output. In these rather limited tests, I've 
seen write performance gains of up to a factor of two or three. The new 
file domain partitioning algorithm is most effective when the number of 
processes exceeds the number of Lustre OSTs, but there are smaller gains 
in other cases, and I have not seen instance in which the performance 
has decreased. As an example, in one case using sixteen processes, MPI 
over Infiniband, and a file striping factor of four, the new code 
achieves over 800 MB/s, whereas the standard code achieves 300 MB/s. I 
have hints that the relative performance gains when using a 1Gb Ethernet 
rather than Infiniband for MPI message passing are greater, but I have 
not completed my testing in that environment.

If you're willing to try out this code in a test environment please let 
me know. I have not yet put the code into a publicly accessible 
repository, but will do so if there is interest out there.

-- 
Martin Pokorny
Software Engineer - Jansky Very Large Array
National Radio Astronomy Observatory - New Mexico Operations