[mpich-discuss] Parallel I/O on Lustre: MPI Vs. POSIX

Rob Latham robl at mcs.anl.gov
Thu Jun 23 10:45:09 CDT 2011


I'm a little late to this discussion because I was on vacation.  

Most of the important points have been made already, but I would add
a few comments:

- The just-released MPICH2-1.4 has the latest updates for lustre.
  It's a bit tricky to get good MPI-IO performance out of Lustre, but
  MPICH2-1.4 encapsulates all the tricks the community knows about.  

  Other MPI-IO implementations sync up with MPICH2's MPI-IO
  implementation, but it takes time to do so.  MPICH2 will always have
  the latest fixes and improvements.  For example, we were
  synchronizing in the close case unnecessarily.  1.4 does not do
  that.

==rob


On Mon, Jun 20, 2011 at 04:52:12PM -0400, George Zagaris wrote:
> Dear All:
> 
> I am currently investigating what would be the best I/O strategy for large-scale
> data targeting in particular the Lustre architecture.
> 
> Towards this end, I developed a small benchmark (also attached) where
> each process
> writes and reads 4,194,304 doubles (32MB per process) with MPI I/O on
> a single shared file
> and POSIX I/O on a separate files -- one file per process.
> 
> I run this code with 32 processes under a directory which has:
> (a) stripe size equal to 32MB, i.e., data is stripe aligned, and
> (b) stripe count (number of OSTs) set to 32
> 
> I would expect that given the above configuration there will be no
> file-system contention
> since the data is stripe aligned and the number of OSTs is equal to
> the number of processes
> that are performing the I/O. Hence, I would expect that the
> performance of the MPI I/O would
> be close to the POSIX performance. The raw performance numbers that I
> obtained do not
> corroborate this theory however:
> 
> MPI-WRITE-OPEN:       0.0422981
> MPI-WRITE-CLOSE:     0.000592947
> MPI-WRITE:                 0.0437472
> MPI-READ-OPEN:        0.00699806
> MPI-READ-CLOSE:      1.30613
> MPI-READ:                  1.30675
> POSIX-WRITE-OPEN:   0.017261
> POSIX-WRITE-CLOSE: 0.00202298
> POSIX-WRITE:             0.00158501
> POSIX-READ-OPEN:    0.00238109
> POSIX-READ-CLOSE:  0.000462055
> POSIX-READ:              0.00268793
> 
> I was wondering if anyone has experience with using MPI I/O on lustre
> and whether
> using hints can improve the I/O performance. Any additional, thoughts, comments
> or suggestions on this would also be very welcome.
> 
> I sincerely thank you for all your time & help.
> 
> Best Regards,
> George

> /*=========================================================================
> 
>   Program:   Visualization Toolkit
>   Module:    ConflictFreeStipeAligned.cxx
> 
>   Copyright (c) Ken Martin, Will Schroeder, Bill Lorensen
>   All rights reserved.
>   See Copyright.txt or http://www.kitware.com/Copyright.htm for details.
> 
>      This software is distributed WITHOUT ANY WARRANTY; without even
>      the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
>      PURPOSE.  See the above copyright notice for more information.
> 
> =========================================================================*/
> // .NAME ConflictFreeStipeAligned.cxx -- MPI I/O example with no conflicts
> //
> // .SECTION Description
> //  A simple benchmark wherein each process writes the same number of bytes
> //  as the stripe size.
> 
> #include <iostream>
> #include <sstream>
> #include <fstream>
> #include <cassert>
> #include <mpi.h>
> #include <cmath>
> #include <limits>
> 
> // Statistics
> #define MPI_FILE_OPEN_WRITE   0
> #define MPI_FILE_CLOSE_WRITE  1
> #define MPI_FILE_WRITE        2
> #define MPI_FILE_OPEN_READ    3
> #define MPI_FILE_CLOSE_READ   4
> #define MPI_FILE_READ         5
> #define POSIX_OPEN_WRITE      6
> #define POSIX_CLOSE_WRITE     7
> #define POSIX_WRITE           8
> #define POSIX_OPEN_READ       9
> #define POSIX_CLOSE_READ     10
> #define POSIX_READ           11
> 
> // Number of doubles each process will write which is equal to 1048576
> // that is 1MB and corresponds to the stripe size set in the directory
> // where the data will be stored.
> //#define NUMDOUBLES 131072
> 
> // Stripe size: 33554432
> #define NUMDOUBLES 4194304
> 
> // Description:
> // Logs a message -- only process 0 writes the message
> void Log( const char *msg );
> 
> // Description:
> // Initialize Data
> void InitializeData( double *data, const int N, const int rank );
> 
> // Description:
> // Checks the data
> void CheckData( double *a, double *b, const int N );
> 
> // Description:
> // Writes data using collective MPI I/O
> void WriteData(double *data, const int N, const int rank );
> void POSIXWrite( double *data, const int N, const int rank );
> 
> // Description:
> // Reads data using collective MPI I/O
> void ReadData(double *data, const int N, const int rank );
> void POSIXRead( double *data, const int N, const int rank );
> 
> // Description:
> // Gathers the statistics from all the processes
> void GatherStatistics( );
> 
> // Description:
> //
> std::string GetStringFromIdx( const int idx );
> 
> 
> // Description:
> // Program data-structure
> struct {
>     int Rank;
>     int NumProcessors;
>     int N;
>     double *dataToWrite;
>     double *dataRead;
>     double *statistics;
> } Program;
> 
> // Description:
> // Program main
> int main( int argc, char **argv )
> {
>   MPI_Init( &argc, &argv );
>   MPI_Comm_rank( MPI_COMM_WORLD, &Program.Rank );
>   MPI_Comm_size( MPI_COMM_WORLD, &Program.NumProcessors );
> 
>   // STEP 0: Initialize data
>   Program.N           = NUMDOUBLES;
>   Program.dataRead    = new double[ Program.N ];
>   Program.dataToWrite = new double[ Program.N ];
>   Program.statistics  = new double[ 12 ];
>   for( int i=0; i < 12; Program.statistics[i++]=0.0 );
>   InitializeData(Program.dataToWrite, Program.N, Program.Rank );
>   MPI_Barrier( MPI_COMM_WORLD );
> 
>   // STEP 1: Write Data with MPI I/O
>   Log( "Write data with MPI I/O" );
>   WriteData( Program.dataToWrite, Program.N, Program.Rank );
> 
>   // STEP 2: Read Data with MPI I/O
>   Log( "Read data with MPI I/O" );
>   ReadData( Program.dataRead, Program.N, Program.Rank );
> 
>   // STEP 3: Check data from MPI I/O
>   Log( "Checking data from MPI I/O" );
>   CheckData( Program.dataToWrite, Program.dataRead, Program.N );
>   MPI_Barrier( MPI_COMM_WORLD );
> 
>   // STEP 4: Write data with POSIX I/O
>   Log( "Writing data with POSIX I/O" );
>   POSIXWrite( Program.dataToWrite, Program.N, Program.Rank );
> 
>   // STEP 5: Read data with POSIX I/O
>   Log( "Reading data with POSIX I/O" );
>   POSIXRead( Program.dataRead, Program.N, Program.Rank );
> 
>   // STEP 6: Check data from POSIX I/O
>   Log( "Check data from POSIX I/O" );
>   CheckData( Program.dataToWrite, Program.dataRead, Program.N );
>   MPI_Barrier( MPI_COMM_WORLD );
> 
>   // STEP 7: Gather statistics
>   Log( "Gathering statistics" );
>   GatherStatistics();
> 
>   // STEP 8: Clean up
>   delete [] Program.dataRead;
>   delete [] Program.dataToWrite;
>   delete [] Program.statistics;
> 
>   MPI_Finalize();
>   return 0;
> }
> 
> //==============================================================================
> //      F U N C T I O N   P R O T O T Y P E   I M P L E M E N T A T I O N
> //==============================================================================
> 
> // Description:
> // Returns a string representation from the index
> std::string GetStringFromIdx( const int idx )
> {
>   //#define MPI_FILE_OPEN_WRITE   0
>   //#define MPI_FILE_CLOSE_WRITE  1
>   //#define MPI_FILE_WRITE        2
>   //#define MPI_FILE_OPEN_READ    3
>   //#define MPI_FILE_CLOSE_READ   4
>   //#define MPI_FILE_READ         5
>   //#define POSIX_OPEN_WRITE      6
>   //#define POSIX_CLOSE_WRITE     7
>   //#define POSIX_WRITE           8
>   //#define POSIX_OPEN_READ       9
>   //#define POSIX_CLOSE_READ     10
>   //#define POSIX_READ           11
>   switch( idx )
>     {
>       case 0:
>         return( std::string( "MPI-WRITE-OPEN" )  );
>         break;
>       case 1:
>         return( std::string( "MPI-WRITE-CLOSE" ) );
>         break;
>       case 2:
>         return( std::string( "MPI-WRITE" ) );
>         break;
>       case 3:
>         return( std::string( "MPI-READ-OPEN" ) );
>         break;
>       case 4:
>         return( std::string( "MPI-READ-CLOSE" ) );
>         break;
>       case 5:
>         return( std::string( "MPI-READ" ) );
>         break;
>       case 6:
>         return( std::string( "POSIX-WRITE-OPEN" ) );
>         break;
>       case 7:
>         return( std::string( "POSIX-WRITE-CLOSE" ) );
>         break;
>       case 8:
>         return( std::string( "POSIX-WRITE" ) );
>         break;
>       case 9:
>         return( std::string( "POSIX-READ-OPEN" ) );
>         break;
>       case 10:
>         return( std::string( "POSIX-READ-CLOSE" ) );
>         break;
>       case 11:
>         return( std::string( "POSIX-READ" ) );
>         break;
>       default:
>         assert( "Code should not reach here!" && false );
>         return( std::string( "" ) );
>     }
> }
> 
> //------------------------------------------------------------------------------
> // Description:
> // Gatherirng statistics.
> void GatherStatistics()
> {
>   double *StatisticsPerRank = NULL;
>   if( Program.Rank == 0 )
>     StatisticsPerRank = new double[ 12*Program.NumProcessors ];
> 
>   MPI_Gather(
>     Program.statistics, 12, MPI_DOUBLE,
>     StatisticsPerRank, 12, MPI_DOUBLE,
>     0, MPI_COMM_WORLD );
> 
> 
>   if( Program.Rank == 0 )
>     {
>       // Find max
>       double maxStats[ 12 ];
>       for( int i=0; i < 12; ++i )
>         maxStats[ i ] = std::numeric_limits<double>::min();
> 
>       for( int i=0; i < Program.NumProcessors; ++i )
>         {
>           for( int j=0; j < 12; ++j )
>           {
>             if( maxStats[ j ] < StatisticsPerRank[i*12+j] )
>               maxStats[ j ] = StatisticsPerRank[ i*12+j ];
>           }
>         } // END for all processors
> 
>       // Write the max out
>       std::ofstream ofs;
>       ofs.open( "MaxStatistics.txt" );
>       assert( ofs.is_open() );
> 
>       for( int i=0; i < 12; ++i )
>         ofs << GetStringFromIdx( i ) << ": " << maxStats[ i ] << std::endl;
> 
>       ofs.close();
>       delete [] StatisticsPerRank;
>     }
> 
>   MPI_Barrier( MPI_COMM_WORLD );
> }
> 
> //------------------------------------------------------------------------------
> // Description:
> // Reads data using POSIX I/O
> void POSIXRead( double *data, const int N, const int rank )
> {
>   assert( "pre: data array is NULL" && (data != NULL)  );
> 
>   double start, end;
>   std::ostringstream oss;
>   oss << "posixdata_" << rank << ".dat";
>   FILE *fhandle;
>   start   = MPI_Wtime();
>   fhandle = fopen( oss.str().c_str(), "rb" );
>   end     = MPI_Wtime();
>   assert( fhandle != NULL );
>   Program.statistics[ POSIX_OPEN_READ ] = end-start;
> 
>   start = MPI_Wtime();
>   fread( data, sizeof(double), N, fhandle );
>   end   = MPI_Wtime();
>   Program.statistics[ POSIX_READ ] = end-start;
> 
>   start = MPI_Wtime();
>   fclose(fhandle);
>   end   = MPI_Wtime();
>   Program.statistics[ POSIX_CLOSE_READ ] = end-start;
> }
> 
> //------------------------------------------------------------------------------
> // Description:
> // Writes data using POSIX I/O
> void POSIXWrite( double *data, const int N, const int rank )
> {
>   assert( "pre: data array is NULL" && (data != NULL) );
> 
>   double start, end;
>   std::ostringstream oss;
>   oss << "posixdata_" << rank << ".dat";
>   FILE *fhandle;
>   start   = MPI_Wtime();
>   fhandle = fopen( oss.str().c_str(), "wb" );
>   end     = MPI_Wtime();
>   assert( fhandle != NULL );
>   Program.statistics[ POSIX_OPEN_WRITE ] = end-start;
> 
>   start = MPI_Wtime();
>   fwrite( data, sizeof(double), N, fhandle );
>   end   = MPI_Wtime();
>   Program.statistics[ POSIX_WRITE ] = end-start;
> 
>   start = MPI_Wtime();
>   fclose( fhandle );
>   end   = MPI_Wtime();
>   Program.statistics[ POSIX_CLOSE_WRITE ] = end-start;
> }
> 
> //------------------------------------------------------------------------------
> // Description:
> // Writes data using MPI I/O
> void WriteData( double *data, const int N, const int rank )
> {
>   double start, end;
>   MPI_File fhandle;
> 
>   // STEP 0: Open file
>   start = MPI_Wtime();
>   MPI_File_open(
>       MPI_COMM_WORLD, "mpitestdata.dat",
>       MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL,
>       &fhandle );
>   end = MPI_Wtime();
>   Program.statistics[ MPI_FILE_OPEN_WRITE ] = end-start;
> 
>   // STEP 1: compute offset
>   MPI_Offset offset = rank*N*sizeof(double);
> 
>   // STEP 2: write data
>   start = MPI_Wtime();
>   MPI_File_write_at_all(
>    fhandle, offset, data, N, MPI_DOUBLE, MPI_STATUS_IGNORE );
>   end   = MPI_Wtime();
>   Program.statistics[ MPI_FILE_WRITE ] = end-start;
> 
>   // STEP 3: close file
>   start = MPI_Wtime();
>   MPI_File_close( &fhandle );
>   end   = MPI_Wtime();
>   Program.statistics[ MPI_FILE_CLOSE_WRITE ] = end-start;
> }
> 
> //------------------------------------------------------------------------------
> // Description:
> // Initializes the data to be written
> void ReadData( double *data, const int N, const int rank )
> {
>   double start, end;
>   MPI_File fhandle;
> 
>   // STEP 0: Open file
>   start = MPI_Wtime();
>   MPI_File_open(
>       MPI_COMM_WORLD, "mpitestdata.dat",
>       MPI_MODE_RDONLY, MPI_INFO_NULL,
>       &fhandle );
>   end   = MPI_Wtime();
>   Program.statistics[ MPI_FILE_OPEN_READ ] = end-start;
> 
>   // STEP 1: compute offset
>   MPI_Offset offset = rank*N*sizeof(double);
> 
>   // STEP 2: read data
>   start = MPI_Wtime();
>   MPI_File_read_at_all(
>     fhandle, offset, data, N, MPI_DOUBLE, MPI_STATUS_IGNORE );
>   end   = MPI_Wtime();
>   Program.statistics[ MPI_FILE_READ ] = end-start;
> 
>   // STEP 3: close file
>   start = MPI_Wtime();
>   MPI_File_close( &fhandle );
>   end   = MPI_Wtime();
>   Program.statistics[ MPI_FILE_CLOSE_READ ] = end -start;
> }
> 
> //------------------------------------------------------------------------------
> // Description:
> // Initializes the data to be written
> void CheckData( double *a, double *b, const int N )
> {
>   assert( "pre: array is NULL" && (a != NULL) );
>   assert( "pre: array is NULL" && (b != NULL) );
> 
>   for( int i=0; i < N; ++i )
>     {
>       assert( a[ i ] == b[ i ] );
>     } // END for all data
> 
> }
> 
> //------------------------------------------------------------------------------
> // Description:
> // Initializes the data to be written
> void InitializeData( double *data, const int N, const int rank )
> {
>   assert( "pre: data != NULL" && (data != NULL) );
> 
>   for( int i=0; i < N; ++i )
>    data[ i ] = rank*10+i;
> }
> 
> //------------------------------------------------------------------------------
> // Description:
> // Logs message to STDOUT
> void Log( const char *msg )
> {
>   if( Program.Rank == 0 )
>     {
>       std::cout << msg << std::endl;
>       std::cout.flush();
>     }
>   MPI_Barrier( MPI_COMM_WORLD );
> }

> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the mpich-discuss mailing list