how to do nonblocking collective i/o

Tue Jan 29 12:54:18 CST 2013

Hi, Phil,

Thanks. ISAM can be a good use case. Could you provide more info about it?

I made a few changes in the file /trunk/src/lib/nonblocking.c
in the latest SVN. Please give it a try and let me know if you
encounter a problem. I am very interested in how it performs
for your workload.

Wei-keng

On Jan 29, 2013, at 11:41 AM, Phil Miller wrote:

> On Tue, Jan 29, 2013 at 11:38 AM, Wei-keng Liao
> <wkliao at ece.northwestern.edu> wrote:
>> Maybe, if I can find a good use case exhibiting the access patterns
>> as discussed in this thread. Any suggestions?
> 
> I'll be happy to work with you on extracting the pattern exhibited by
> ISAM for this.
> 
>> 
>> Wei-keng
>> 
>> On Jan 29, 2013, at 9:50 AM, Rob Latham wrote:
>> 
>>> Wei-keng: you've done a lot of work on the non-blocking interface
>>> since the 2009 paper.  I wonder if there's a publication you can get
>>> out of those modifications.
>>> 
>>> ==rob
>>> 
>>> On Mon, Jan 28, 2013 at 10:31:04PM -0600, Wei-keng Liao wrote:
>>>> Here is the detailed explanations.
>>>> 
>>>> Prior to r1121, the implementation of nonblocking APIs did not break
>>>> each
>>>> request into a list of offset-length pairs. It simply used the
>>>> arguments
>>>> start[] and count[] to define a filetype through a call to
>>>> MPI_Type_create_subarray(). Then the filetypes of all nonblocking
>>>> requests
>>>> are concatenated to a single one, if their file offsets are
>>>> monotonically
>>>> increasing. If not, the requests are divided into groups, each group
>>>> fulfilling the MPI fileview requirement, and then each group calls an
>>>> MPI collective read/write.
>>>> 
>>>> The reason of the above design is due to memory concern. Flattening
>>>> each
>>>> request into a list of offset-length pairs can take up memory space
>>>> that
>>>> might not be negligible. Let's take the examples/column_wise.c as an
>>>> example
>>>> in which each process writes a few non-contiguous columns into a global
>>>> 2D
>>>> array using nonblocking APIs. The C struct for each flattened
>>>> offset-length
>>>> pair takes 24 bytes, bigger than the data (say 8 bytes if the variable
>>>> is in
>>>> double type). This is why I did not use this approach at the first
>>>> place.
>>>> 
>>>> The fix in r1121 eventually uses this approach. So, be warned!
>>>> 
>>>> If the memory space is of no concern, this new fix can significantly
>>>> improve
>>>> the performance. I evaluated column_wise.c with a slightly larger array
>>>> size
>>>> and saw a decent improvement. You can give it a try.
>>>> 
>>>> Maybe in the future, we will come up with a smart approach to
>>>> dynamically
>>>> decide when to fall back to the previous approach, say when the
>>>> additional
>>>> memory space required is beyond a threshold.
>>>> 
>>>> Wei-keng
>>>> 
>>>> On Jan 28, 2013, at 9:45 PM, Liu, Jaln wrote:
>>>> 
>>>>> Hi Dr. Liao,
>>>>> 
>>>>> So glad to know that it has been solved.
>>>>> 
>>>>> I was confused why a offset-length pairs' sort could not generate a
>>>>> fileview that abide the requirements, I didn't know that the nonblocking
>>>>> requests were divided into groups. Can you please tell me why the
>>>>> nonblocking requests were initially designed to be divided into groups? for
>>>>> scalability or any other reason? If for scalability, how about now?
>>>>> 
>>>>>> Please give it a try and let me know if you see a problem.
>>>>> 
>>>>> Sure, I'm testing it.
>>>>> 
>>>>> Jialin
>>>>> ________________________________________
>>>>> Best Regards,
>>>>> Jialin Liu, Ph.D student.
>>>>> Computer Science Department
>>>>> Texas Tech University
>>>>> Phone: 806.742.3513(x241)
>>>>> Office:Engineer Center 304
>>>>> http://myweb.ttu.edu/jialliu/
>>>>> ________________________________________
>>>>> From: parallel-netcdf-bounces at lists.mcs.anl.gov
>>>>> [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao
>>>>> [wkliao at ece.northwestern.edu]
>>>>> Sent: Monday, January 28, 2013 6:13 PM
>>>>> To: parallel-netcdf at lists.mcs.anl.gov
>>>>> Subject: Re: how to do nonblocking collective i/o
>>>>> 
>>>>> Hi, Jialin, please see my in-line response below.
>>>>> 
>>>>> On Jan 28, 2013, at 4:05 PM, Liu, Jaln wrote:
>>>>> 
>>>>>> Hi Rob,
>>>>>> 
>>>>>> Thanks for your answer.
>>>>>> 
>>>>>>> You're close. I  bet by the time I finish writing this email
>>>>>>> Wei-keng
>>>>>>> will already respond.
>>>>>> 
>>>>>> You reminds me of a previous thread in pnetcdf maillist:
>>>>>> 'Performance tuning problem with iput_vara_double/wait_all',
>>>>>> 
>>>>>> Dr. Liao once mentioned "But concatenating two filetypes will end up
>>>>>> with a filetype violating the requirement of monotonic non-decreasing file
>>>>>> offsets",
>>>>>> 
>>>>>> So I guess, even my code is correctly trying to do non-blocking
>>>>>> collective I/O, but it will still result in individually collective I/O,
>>>>>> right?
>>>>>> Is there any way that we can know this before performance test?
>>>>> 
>>>>> This problem has been resolved in SVN r1121 committed on Saturday.
>>>>> Please give it a try and let me know if you see a problem.
>>>>> 
>>>>> 
>>>>>> I have another related question,
>>>>>> According to the paper "combining I/O operations for multiple array
>>>>>> variables in parallel netCDF", the non-blocking collective i/o is designed
>>>>>> for multiple variables access. But I assume it is also useful to optimize
>>>>>> multiple subsets access for one variable? just like what I'm trying to do in
>>>>>> the code. right?
>>>>> 
>>>>> PnetCDF nonblocking APIs can be used to aggregate requests within a
>>>>> variable and
>>>>> across variables (also, mixed record and non-record variables). There
>>>>> is an
>>>>> example program newly added in trunk/examples/column_wise.c that calls
>>>>> multiple
>>>>> nonblocking writes to a single 2D variable, each request writes a
>>>>> column of the 2D array.
>>>>> 
>>>>> 
>>>>> Wei-keng
>>>>> 
>>>>> 
>>>>> 
>>>>>> Jialin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Here are the codes I wrote:
>>>>>>> 
>>>>>>>     float ** nb_temp_in=malloc(numcalls*sizeof(float *));
>>>>>>>     int * request=calloc(numcalls, sizeof(int));
>>>>>>>     int * status=calloc(numcalls,sizeof(int));
>>>>>>>     int varasize;
>>>>>>>     for(j=0;j<numcalls;j++)
>>>>>>>     {
>>>>>>>       mpi_count[1]=(j>NLVL)?NLVL:j+1;
>>>>>>>       varasize=mpi_count[0]*mpi_count[1]*NLAT*NLON;
>>>>>>>       nb_temp_in[j]=calloc(varasize,sizeof(float));
>>>>>>>       if (ret = ncmpi_iget_vara(ncid, temp_varid,
>>>>>>>            mpi_start,mpi_count,nb_temp_in[j],
>>>>>>>            varasize,MPI_FLOAT,&(request[j])));
>>>>>>>       if (ret != NC_NOERR) handle_error(ret);
>>>>>>>     }
>>>>>>> 
>>>>>>>     ret = ncmpi_wait_all(ncid, numcalls, request, status);
>>>>>>>     for (j=0; j<numcalls; j++)
>>>>>>>      if (status[j] != NC_NOERR) handle_error(status[j]);
>>>>>>>   }
>>>>>>> 
>>>>>>> I have two questions,
>>>>>>> 1, in the above code, what is right way to parallelize the program?
>>>>>>> by decomposing the for loop " for(j=0;j<numcalls;j++)"?
>>>>>> 
>>>>>> No "right" way, really. Depends on what the reader needs.
>>>>>> Decomposing
>>>>>> over numcalls is definitely one way.  Or you can decompose over
>>>>>> 'mpi_start' and 'mpi_count' -- though I personally have to wrestle
>>>>>> with block decomposition for a while before it's correct.
>>>>>> 
>>>>>>> 2, how to do non-blocking collective I/O? is there a function like
>>>>>>> 'ncmpi_iget_vara_all'?
>>>>>> 
>>>>>> you already did it.
>>>>>> 
>>>>>> We've iterated over a few nonblocking-pnetcdf approaches over the
>>>>>> years, but settled on this way:
>>>>>> - operations are posted independently.
>>>>>> - One can collectively wait for completion with "ncmpi_wait_all", as
>>>>>> you did.
>>>>>> - If one needs to wait for completion locally due to the nature of
>>>>>> the
>>>>>> application, one might not get the best performance, but
>>>>>> "ncmpi_wait" is still there if the app needs independent I/O
>>>>>> completion.
>>>>>> 
>>>>>> ==rob
>>>>>> 
>>>>>> --
>>>>>> Rob Latham
>>>>>> Mathematics and Computer Science Division
>>>>>> Argonne National Lab, IL USA
>>>> 
>>> 
>>> --
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>>