how to do nonblocking collective i/o

Mon Jan 28 22:31:04 CST 2013

Here is the detailed explanations.

Prior to r1121, the implementation of nonblocking APIs did not break each
request into a list of offset-length pairs. It simply used the arguments
start[] and count[] to define a filetype through a call to
MPI_Type_create_subarray(). Then the filetypes of all nonblocking requests
are concatenated to a single one, if their file offsets are monotonically
increasing. If not, the requests are divided into groups, each group
fulfilling the MPI fileview requirement, and then each group calls an
MPI collective read/write.

The reason of the above design is due to memory concern. Flattening each
request into a list of offset-length pairs can take up memory space that
might not be negligible. Let's take the examples/column_wise.c as an example
in which each process writes a few non-contiguous columns into a global 2D
array using nonblocking APIs. The C struct for each flattened offset-length
pair takes 24 bytes, bigger than the data (say 8 bytes if the variable is in
double type). This is why I did not use this approach at the first place.

The fix in r1121 eventually uses this approach. So, be warned!

If the memory space is of no concern, this new fix can significantly improve
the performance. I evaluated column_wise.c with a slightly larger array size
and saw a decent improvement. You can give it a try.

Maybe in the future, we will come up with a smart approach to dynamically
decide when to fall back to the previous approach, say when the additional
memory space required is beyond a threshold.

Wei-keng

On Jan 28, 2013, at 9:45 PM, Liu, Jaln wrote:

> Hi Dr. Liao,
> 
> So glad to know that it has been solved. 
> 
> I was confused why a offset-length pairs' sort could not generate a fileview that abide the requirements, I didn't know that the nonblocking requests were divided into groups. Can you please tell me why the nonblocking requests were initially designed to be divided into groups? for scalability or any other reason? If for scalability, how about now?
> 
>> Please give it a try and let me know if you see a problem.
> 
> Sure, I'm testing it.
> 
> Jialin
> ________________________________________
> Best Regards,
> Jialin Liu, Ph.D student.
> Computer Science Department
> Texas Tech University
> Phone: 806.742.3513(x241)
> Office:Engineer Center 304
> http://myweb.ttu.edu/jialliu/
> ________________________________________
> From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao [wkliao at ece.northwestern.edu]
> Sent: Monday, January 28, 2013 6:13 PM
> To: parallel-netcdf at lists.mcs.anl.gov
> Subject: Re: how to do nonblocking collective i/o
> 
> Hi, Jialin, please see my in-line response below.
> 
> On Jan 28, 2013, at 4:05 PM, Liu, Jaln wrote:
> 
>> Hi Rob,
>> 
>> Thanks for your answer.
>> 
>>> You're close. I  bet by the time I finish writing this email Wei-keng
>>> will already respond.
>> 
>> You reminds me of a previous thread in pnetcdf maillist:
>> 'Performance tuning problem with iput_vara_double/wait_all',
>> 
>> Dr. Liao once mentioned "But concatenating two filetypes will end up with a filetype violating the requirement of monotonic non-decreasing file offsets",
>> 
>> So I guess, even my code is correctly trying to do non-blocking collective I/O, but it will still result in individually collective I/O, right?
>> Is there any way that we can know this before performance test?
> 
> This problem has been resolved in SVN r1121 committed on Saturday.
> Please give it a try and let me know if you see a problem.
> 
> 
>> I have another related question,
>> According to the paper "combining I/O operations for multiple array variables in parallel netCDF", the non-blocking collective i/o is designed for multiple variables access. But I assume it is also useful to optimize multiple subsets access for one variable? just like what I'm trying to do in the code. right?
> 
> PnetCDF nonblocking APIs can be used to aggregate requests within a variable and
> across variables (also, mixed record and non-record variables). There is an
> example program newly added in trunk/examples/column_wise.c that calls multiple
> nonblocking writes to a single 2D variable, each request writes a column of the 2D array.
> 
> 
> Wei-keng
> 
> 
> 
>> Jialin
>> 
>> 
>> 
>>> Here are the codes I wrote:
>>> 
>>>       float ** nb_temp_in=malloc(numcalls*sizeof(float *));
>>>       int * request=calloc(numcalls, sizeof(int));
>>>       int * status=calloc(numcalls,sizeof(int));
>>>       int varasize;
>>>       for(j=0;j<numcalls;j++)
>>>       {
>>>         mpi_count[1]=(j>NLVL)?NLVL:j+1;
>>>         varasize=mpi_count[0]*mpi_count[1]*NLAT*NLON;
>>>         nb_temp_in[j]=calloc(varasize,sizeof(float));
>>>         if (ret = ncmpi_iget_vara(ncid, temp_varid,
>>>              mpi_start,mpi_count,nb_temp_in[j],
>>>              varasize,MPI_FLOAT,&(request[j])));
>>>         if (ret != NC_NOERR) handle_error(ret);
>>>       }
>>> 
>>>       ret = ncmpi_wait_all(ncid, numcalls, request, status);
>>>       for (j=0; j<numcalls; j++)
>>>        if (status[j] != NC_NOERR) handle_error(status[j]);
>>>     }
>>> 
>>> I have two questions,
>>> 1, in the above code, what is right way to parallelize the program?
>>> by decomposing the for loop " for(j=0;j<numcalls;j++)"?
>> 
>> No "right" way, really. Depends on what the reader needs.  Decomposing
>> over numcalls is definitely one way.  Or you can decompose over
>> 'mpi_start' and 'mpi_count' -- though I personally have to wrestle
>> with block decomposition for a while before it's correct.
>> 
>>> 2, how to do non-blocking collective I/O? is there a function like
>>> 'ncmpi_iget_vara_all'?
>> 
>> you already did it.
>> 
>> We've iterated over a few nonblocking-pnetcdf approaches over the
>> years, but settled on this way:
>> - operations are posted independently.
>> - One can collectively wait for completion with "ncmpi_wait_all", as
>> you did.
>> - If one needs to wait for completion locally due to the nature of the
>> application, one might not get the best performance, but
>> "ncmpi_wait" is still there if the app needs independent I/O
>> completion.
>> 
>> ==rob
>> 
>> --
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA