<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

For my 26 GB/s rate, I don't know many of the specifics the FS itself,

except that it what is being called /scratch_grande on our big XT3,

Redstorm.<br>

<br>

For reasons, I didn't choose IOR, but, instead, ran a simple,

MPI-executable based on something which had been written by Rob Matzke,

for our now defunct SAF IO library project.<br>

<br>

He basically had a user input parameter for the blocksize, and, after

the MPI_File_open, went around on a loop nrecs many times, redumping

the same buffer with MPI_File_write_at_all, after recomputing for each

processor, respectively, the offset based on the buffer size on the

total processor count.<br>

<br>

We were working on having a real user get his enormous restart in less

time, when I decided to find the combination that got me the absolute

fastest. I don't have an entire historgram of parameters, but I found

that a job where the processor count was the same as the maximum LFS

stripe-count (160), and my buffer size was 20 MB per processor, I could

consistently get to 26 GB/s for files over 100 GB in total, final size.<br>

<br>

I thought this had no practical value at the time, since it means

having 3.3 GB across the communicator for one, collective, field or

variable store operation. However, now I'm seeing fields this large,

and think that this may approach representing the second part of a

properly applied 2-phase aggregation. Besides that already available in

ROMIO, I am working with the CAM code from UCAR, where they have

written their own IO package that applies the primary phase in a

specific manner, usually involving 'rearrangement'; turning a large

number of small buffers into a small number of larger ones, on a subset

of the global communicators processors, then performs the writes via

PNetCDF.<br>

<br>

Marty<br>

<br>

<br>

<br>

Wei-keng Liao wrote:

<blockquote

 cite="mid:Pine.LNX.4.64.0802261238090.15708@delta.ece.northwestern.edu"

 type="cite">

  <pre wrap="">Hi,

IOR benchmark includes pnetcdf. <a class="moz-txt-link-freetext" href="http://sourceforge.net/projects/ior-sio">http://sourceforge.net/projects/ior-sio</a>

Its pnetcdf was recently modified to use record variables. That should

take care of the file appending I/O mode.

I am interested in your results of 26 GB/s on Lustre using MPI-IO on

shared files. My experience was about 15 GB/s top on Jaguar @ORNL using an

application I/O kernel whose data partitioning patterns are 3D

block-block-block. But the results were obtained when Jaguar was running

catamount. Jaguar is currently under upgrade to new hardware and software.

I would appreciate if you can share your results.

Wei-keng

On Tue, 26 Feb 2008, Marty Barnaby wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">I'm new to the parallel NetCDF interface, and I don't have much

experience with NetCDF either. Because of new interest on our part, we

would like to have a straightforward, benchmark program to get byte-rate

metrics for writing to a Posix FS (chiefly, some large Lustre

deployments). I've had some reasonable experiences in this at the MPI-IO

level, achieving a sustained, average rate of 26 GB/s; this writing to a

single, shared file with an LFS stripe-count of 160. If anyone is

interested, I could provide them with more specifics.

I can't find the benchmark-type code that I really need, though I've

been looking at the material under /test like /test_double/test_write.c

This I've compiled and executed successfully at the appropriate -np 4

level.

There are three dynamics I would like to have that I can't see how to

get.

  1. Run on any number of processors. I'm sure this is simple, but I want to

      know where the failure is when I attempt it.

  2. Set the number of bytes appended to an open file in a single, atomic,

      collective write operation. In my MPI-IO benchmark program I merely

      got this number by having a buffer size on each processor, and

the total

      was the product of this times the number of processors. At the higher

      level of PNetCDF I'm not sure which value I'm getting in the def_var

      and put_var.

  3. Be able to perform any number of equivalent, collective write operations,

      appending to the same, open file. Simply a:

      for ( i = 0; i &lt; NUMRECS; i++ )

      concept. This is basically what our scientific, simulation applications

do in their 'dump' mode.

Thanks,

Marty Barnaby

    </pre>

  </blockquote>

  <pre wrap=""><!---->

  </pre>

</blockquote>

<br>

</body>

</html>