benchmark program question

Wed Feb 27 11:26:10 CST 2008

Marty,

Please see below.

On Tue, 26 Feb 2008, Marty Barnaby wrote:
> For my 26 GB/s rate, I don't know many of the specifics the FS itself, except
> that it what is being called /scratch_grande on our big XT3, Redstorm.
> 
> For reasons, I didn't choose IOR, but, instead, ran a simple, MPI-executable
> based on something which had been written by Rob Matzke, for our now defunct
> SAF IO library project.
> 
> He basically had a user input parameter for the blocksize, and, after the
> MPI_File_open, went around on a loop nrecs many times, redumping the same
> buffer with MPI_File_write_at_all, after recomputing for each processor,
> respectively, the offset based on the buffer size on the total processor
> count.
> 
> We were working on having a real user get his enormous restart in less time,
> when I decided to find the combination that got me the absolute fastest. I
> don't have an entire historgram of parameters, but I found that a job where
> the processor count was the same as the maximum LFS stripe-count (160), and my
> buffer size was 20 MB per processor, I could consistently get to 26 GB/s for
> files over 100 GB in total, final size.

This is interesting. It indicates that setting the number of I/O 
aggregators (cb_nodes in ROMIO hint) the same as the Lustre stripe count 
can provide better I/O performance. Is your 20 MB buffer size the ROMIO 
collective buffer size (hint cb_buffer_size)? It is known that if 
cb_buffer_size is small, there may be several stages of two-phase I/Os 
performed in a single collective I/O call. Providing a large value of 
cb_buffer_size (default is 16 MB now in mpich2-1.0.6) can reduce the 
number of two-phase I/O stages.

> I thought this had no practical value at the time, since it means having 
> 3.3 GB across the communicator for one, collective, field or variable 
> store operation. However, now I'm seeing fields this large, and think 
> that this may approach representing the second part of a properly 
> applied 2-phase aggregation. Besides that already available in ROMIO, I 
> am working with the CAM code from UCAR, where they have written their 
> own IO package that applies the primary phase in a specific manner, 
> usually involving 'rearrangement'; turning a large number of small 
> buffers into a small number of larger ones, on a subset of the global 
> communicators processors, then performs the writes via PNetCDF.

I have some experience on I/O rearrangement. I found that if the ROMIO 
file domain is partitioned aligned with the stripe boundaries, the I/O 
performance can be enhanced significantly. Experiment results can be found 
in my IPDPS07 paper.
 "An Implementation and Evaluation of Client-Side File Caching for MPI-IO"

The alignment is tested in ROMIO and enabled by a hint, but this feature 
has not been incorported in the ROMIO release yet.

Wei-keng

> 
> Marty
> 
> 
> 
> Wei-keng Liao wrote:
> > Hi,
> >
> > IOR benchmark includes pnetcdf. http://sourceforge.net/projects/ior-sio
> > Its pnetcdf was recently modified to use record variables. That should
> > take care of the file appending I/O mode.
> >
> > I am interested in your results of 26 GB/s on Lustre using MPI-IO on
> > shared files. My experience was about 15 GB/s top on Jaguar @ORNL using an
> > application I/O kernel whose data partitioning patterns are 3D
> > block-block-block. But the results were obtained when Jaguar was running
> > catamount. Jaguar is currently under upgrade to new hardware and software.
> > I would appreciate if you can share your results.
> >
> > Wei-keng
> >
> >
> > On Tue, 26 Feb 2008, Marty Barnaby wrote:
> >
> >   
> > > I'm new to the parallel NetCDF interface, and I don't have much
> > > experience with NetCDF either. Because of new interest on our part, we
> > > would like to have a straightforward, benchmark program to get byte-rate
> > > metrics for writing to a Posix FS (chiefly, some large Lustre
> > > deployments). I've had some reasonable experiences in this at the MPI-IO
> > > level, achieving a sustained, average rate of 26 GB/s; this writing to a
> > > single, shared file with an LFS stripe-count of 160. If anyone is
> > > interested, I could provide them with more specifics.
> > >
> > > I can't find the benchmark-type code that I really need, though I've
> > > been looking at the material under /test like /test_double/test_write.c
> > > This I've compiled and executed successfully at the appropriate -np 4
> > > level.
> > >
> > > There are three dynamics I would like to have that I can't see how to
> > > get.
> > >
> > >   1. Run on any number of processors. I'm sure this is simple, but I want
> > >   to
> > >       know where the failure is when I attempt it.
> > >
> > >   2. Set the number of bytes appended to an open file in a single, atomic,
> > >       collective write operation. In my MPI-IO benchmark program I merely
> > >       got this number by having a buffer size on each processor, and
> > > the total
> > >       was the product of this times the number of processors. At the
> > >       higher
> > >       level of PNetCDF I'm not sure which value I'm getting in the def_var
> > >       and put_var.
> > >
> > >   3. Be able to perform any number of equivalent, collective write
> > >   operations,
> > >       appending to the same, open file. Simply a:
> > >
> > >       for ( i = 0; i < NUMRECS; i++ )
> > >
> > >       concept. This is basically what our scientific, simulation
> > > applications
> > > do in their 'dump' mode.
> > >
> > >
> > > Thanks,
> > > Marty Barnaby
> > >
> > >     
> >
> >
> >   
> 
>