Performance tuning problem with iput_vara_double/wait_all

Phil Miller mille121 at illinois.edu
Wed Dec 26 23:25:29 CST 2012


I'm trying to write out a NetCDF file from an MPI program written in
Fortran 90, and seeing painfully bad performance - 6 MB/sec total,
across the processes in my job.

The file will contain a 4D array of doubles, with the dimensions being
called (latitude, longitude, pft, week), of sizes (720, 360, 24, 52),
giving a data volume of ~2.4 GiB. In the program, this data is
distributed among MPI ranks, each of which holds some arbitrary subset
of the points (distributed so as to be approximately load balanced,
since some points are heavier to compute than others). Their arrays
are 3D (point, pft, week), with a translation from point to lat/lon
through an indirection array 'mask'.

What I'm trying to do with this is use a number of nonblocking array
put calls of size (1, 1, 24, 52) from each rank equal to the number of
points it own, and then wait_all to complete the operation.

I'm running this on the Hopper Cray XE6 system at NERSC with a Lustre
filesystem. They have pnetcdf version 1.3.1 installed, and I'm
compiling my code with Intel's compiler, version 12. The target
directory is set to 1MB stripes across 48 OSTs, the recommended
configuration from NERSC's site for files in the 1-10GB range.
Following that advice, I've also tried setting the environment
variable
MPICH_MPIIO_HINTS="*:romio_cb_write=enable:romio_ds_write=disable"

The entire write takes about 6 minutes. If I switch to independent
data mode, it gets much slower - more than 20 minutes to write the
file.

The heart of the source code in question is as follows:

     nvals(1) = 1
     nvals(2) = 1
     nvals(3) = numpft
     nvals(4) = 52

     start(3) = 1
     start(4) = 1

     do i = myigp_begin, myigp_end
        start(1) = mask(i, 2)
        start(2) = mask(i, 3)
        call nc_check(nfmpi_iput_vara_double(nfid, varidv, start,
nvals, values(i,:,:), reqs(i)))

        if (myid == 0) print *, 'Put data', i, start(1), start(2), myigp_num
     end do

     if (myid == 0) print *, 'Put all data', myigp_num, size(reqs),
(myigp_end-myigp_begin+1)

     call nc_check(nfmpi_wait_all(nfid, myigp_num, reqs, stats))

The more complete source code as I have it so far, with the array
declarations, allocations, and surrounding routines, can be seen here:
http://pastebin.com/BSwT8n2s

Are there other hints that I should be giving the library? Do I need
to redistribute the data to some 'nice' order myself before making the
put calls, or pass the data in in a different arrangement? It seems
like the standard 2-phase algorithm should work fine here, but I'm not
seeing that happen.

Thank you for your attention.

Phil


More information about the parallel-netcdf mailing list