Performance when reading many small variables

Mon Dec 7 23:36:47 CST 2015

Hi, Michael

Another way is to have one process read all the data from file
and broadcast to all. In your example, O(2000) single-entry variables
take a space of only 16KB. Broadcasting 16KB should not take long
on today's parallel computers.

To further improve the performance, you can apply the nonblocking
API approach on the root process using MPI_COMM_WELF, so those 2000
single-entry "get" requests can be aggregated into one MPI file read.

Wei-keng

On Dec 7, 2015, at 11:18 PM, Schlottke-Lakemper, Michael wrote:

> Hi Wei-keng,
> 
> Thanks a lot for your elaborate answer. It might take us a while to implement your suggestions, but it gives us a good idea where to start.
> 
> Michael
> 
>> On 01 Dec 2015, at 08:06 , Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
>> 
>> Hi, Michael
>> 
>> You can use PnetCDF nonblocking APIs to read. The code fragment that uses
>> nonblocking reads is shown the followings.
>> 
>>   int reqs[2000], statuses[2000];
>> 
>>   err = ncmpi_open(MPI_COMM_WORLD, filename, omode, MPI_INFO_NULL, &ncid);
>>   for (i=0; i<2000; i++)
>>       err = ncmpi_iget_vara_int(ncid, varid[i], start, count, &buf[i], &reqs[i]);
>> 
>>   err = ncmpi_wait_all(ncid, 2000, reqs, statuses);
>> 
>> 
>> If there is only one entry per variable, then you can use var APIs and skip
>> the arguments start and count. Such as
>> 
>>   for (i=0; i<2000; i++)
>>       err = ncmpi_iget_var_int(ncid, varid[i], &buf[i], &reqs[i]);
>> 
>> 
>> 
>> PnetCDF nonblocking APIs defer the requests till ncmpi_wait_all where all
>> requests are aggregated into one big, single MPI I/O call. There are many example
>> programs (in C and Fortran) available in all PnetCDF releases, under examples
>> directory. http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/examples
>> 
>> In addition, I suggest to open the input file using MPI_COMM_WORLD, so the
>> program can take advantage of MPI collective I/O for better performance, even if
>> all processes read the same data.
>> 
>> If your input file is generated from a PnetCDF program, then I suggest to disable
>> file offset alignment for the fixed-size (non-record) variables, given there is
>> only one entry per variable. To disable alignment, you can use an MPI info object
>> and set nc_var_align_size to 1 and pass the info object to ncmpi_create call.
>> Or you can set the same hint at run time. Please see
>> https://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/HintsForPnetcdf
>> and
>> http://cucis.ece.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html
>> 
>> For further information, please check Q&A in
>> http://cucis.ece.northwestern.edu/projects/PnetCDF/faq.html
>> and
>> http://cucis.ece.northwestern.edu/projects/PnetCDF
>> 
>> Wei-keng
>> 
>> On Nov 30, 2015, at 11:53 PM, Schlottke-Lakemper, Michael wrote:
>> 
>>> Dear all,
>>> 
>>> We recently converted all of our code to use the Parallel netCDF library instead of the NetCDF library (before we had a mix), also using Pnetcdf for non-parallel file access. We did not have any issues whatsoever, until one user notified us of a performance regression in a particular case.
>>> 
>>> He is trying to read many (O(2000)) variables from a single file in a loop, each variable with just one entry. Since this is very old code and usually only few variables are concerned, each process reads the same data individually. Before, the NetCDF library was used for this task, and during refactoring it was replaced by Pnetcdf with MPI_COMM_SELF. When using the code on a moderate number of MPI ranks (~500), the user noticed a severe performance degradation since switching to Pnetcdf:
>>> 
>>> Before, the read process of the 2000 Variables cumulatively amounted to ~0.6s. After switching to Pnetcdf (using ncmpi_get_vara_int_all), this number increased to ~300s. Going from MPI_COMM_SELF to MPI_COMM_WORLD reduced this number to ~30s, which is still high in comparison.
>>> 
>>> What, if anything, can we do to get similar performance when using Pnetcdf in this particular case? I know this is a rather degenerate case and that one possible fix would be to change the layout to 1 Variable with 2000 entries, but I was hoping that someone here has a suggestion what we could try anyways.
>>> 
>>> Thanks a lot in advance
>>> 
>>> Michael
>>> 
>>> 
>>> --
>>> Michael Schlottke-Lakemper
>>> 
>>> SimLab Highly Scalable Fluids & Solids Engineering
>>> Jülich Aachen Research Alliance (JARA-HPC)
>>> RWTH Aachen University
>>> Wüllnerstraße 5a
>>> 52062 Aachen
>>> Germany
>>> 
>>> Phone: +49 (241) 80 95188
>>> Fax: +49 (241) 80 92257
>>> Mail: m.schlottke-lakemper at aia.rwth-aachen.de
>>> Web: http://www.jara.org/jara-hpc
>>> 
>> 
>