Performance tuning problem with iput_vara_double/wait_all

Wei-keng Liao wkliao at ece.northwestern.edu
Thu Dec 27 14:47:31 CST 2012


Hi, Phil,

It looks like you are writing one array element of 8 bytes at a time.
Performance will definitely be poor for this I/O strategy.

Please check if the indices translated by mask are actually continuous.
If so, you can replace the loop with one write call.

Wei-keng

On Dec 26, 2012, at 11:25 PM, Phil Miller wrote:

> I'm trying to write out a NetCDF file from an MPI program written in
> Fortran 90, and seeing painfully bad performance - 6 MB/sec total,
> across the processes in my job.
> 
> The file will contain a 4D array of doubles, with the dimensions being
> called (latitude, longitude, pft, week), of sizes (720, 360, 24, 52),
> giving a data volume of ~2.4 GiB. In the program, this data is
> distributed among MPI ranks, each of which holds some arbitrary subset
> of the points (distributed so as to be approximately load balanced,
> since some points are heavier to compute than others). Their arrays
> are 3D (point, pft, week), with a translation from point to lat/lon
> through an indirection array 'mask'.
> 
> What I'm trying to do with this is use a number of nonblocking array
> put calls of size (1, 1, 24, 52) from each rank equal to the number of
> points it own, and then wait_all to complete the operation.
> 
> I'm running this on the Hopper Cray XE6 system at NERSC with a Lustre
> filesystem. They have pnetcdf version 1.3.1 installed, and I'm
> compiling my code with Intel's compiler, version 12. The target
> directory is set to 1MB stripes across 48 OSTs, the recommended
> configuration from NERSC's site for files in the 1-10GB range.
> Following that advice, I've also tried setting the environment
> variable
> MPICH_MPIIO_HINTS="*:romio_cb_write=enable:romio_ds_write=disable"
> 
> The entire write takes about 6 minutes. If I switch to independent
> data mode, it gets much slower - more than 20 minutes to write the
> file.
> 
> The heart of the source code in question is as follows:
> 
>     nvals(1) = 1
>     nvals(2) = 1
>     nvals(3) = numpft
>     nvals(4) = 52
> 
>     start(3) = 1
>     start(4) = 1
> 
>     do i = myigp_begin, myigp_end
>        start(1) = mask(i, 2)
>        start(2) = mask(i, 3)
>        call nc_check(nfmpi_iput_vara_double(nfid, varidv, start,
> nvals, values(i,:,:), reqs(i)))
> 
>        if (myid == 0) print *, 'Put data', i, start(1), start(2), myigp_num
>     end do
> 
>     if (myid == 0) print *, 'Put all data', myigp_num, size(reqs),
> (myigp_end-myigp_begin+1)
> 
>     call nc_check(nfmpi_wait_all(nfid, myigp_num, reqs, stats))
> 
> The more complete source code as I have it so far, with the array
> declarations, allocations, and surrounding routines, can be seen here:
> http://pastebin.com/BSwT8n2s
> 
> Are there other hints that I should be giving the library? Do I need
> to redistribute the data to some 'nice' order myself before making the
> put calls, or pass the data in in a different arrangement? It seems
> like the standard 2-phase algorithm should work fine here, but I'm not
> seeing that happen.
> 
> Thank you for your attention.
> 
> Phil



More information about the parallel-netcdf mailing list