[mpich-discuss] testing new ADIO Lustre code
Martin Pokorny
mpokorny at nrao.edu
Thu May 17 16:02:48 CDT 2012
Hi, everyone;
I've been using MPI-IO on a Lustre file system to good effect for a
while now in an application that has up to 32 processes writing to a
shared file. However, seeking to understand the performance of our
system, and improve on it, I've recently made some changes to the ADIO
Lustre code, which show some promise, but need more testing. Eventually,
I'd like to submit the code changes back to the mpich2 project, but that
is certainly contingent upon the results of testing (and various code
compliance issues for mpich2/romio/adio that I will likely need to sort
out.) This message is my request for volunteers to help test my code, in
particular for output file correctness and shared-file write
performance. If you're interested in doing shared file I/O using MPI-IO
on Lustre, please continue reading this message.
In broad terms, the changes I made are on two fronts: changing the file
domain partitioning algorithm, and introducing non-blocking operations
at several points. The file domain partitioning algorithm that I
implemented is from the paper "Dynamically Adapting File Domain
Partitioning Methods for Collective I/O Based on Underlying Parallel
File System Locking Protocols" by Wei-keng Liao and Alok Choudhary. The
non-blocking operations that I added allow the ADIO Lustre driver better
to parallelize the data exchange and writing procedures over multiple
stripes within each process writing to one Lustre OST,
My testing so far has been limited to four nodes, up to sixteen
processes, writing to shared files on a Lustre file system with up to
eight OSTs. These tests were conducted to simulate the production
application for which I'm responsible, but on a different cluster,
focused only on the file output. In these rather limited tests, I've
seen write performance gains of up to a factor of two or three. The new
file domain partitioning algorithm is most effective when the number of
processes exceeds the number of Lustre OSTs, but there are smaller gains
in other cases, and I have not seen instance in which the performance
has decreased. As an example, in one case using sixteen processes, MPI
over Infiniband, and a file striping factor of four, the new code
achieves over 800 MB/s, whereas the standard code achieves 300 MB/s. I
have hints that the relative performance gains when using a 1Gb Ethernet
rather than Infiniband for MPI message passing are greater, but I have
not completed my testing in that environment.
If you're willing to try out this code in a test environment please let
me know. I have not yet put the code into a publicly accessible
repository, but will do so if there is interest out there.
--
Martin Pokorny
Software Engineer - Jansky Very Large Array
National Radio Astronomy Observatory - New Mexico Operations
More information about the mpich-discuss
mailing list