how to do nonblocking collective i/o
Wei-keng Liao
wkliao at ece.northwestern.edu
Tue Jan 29 12:54:18 CST 2013
Hi, Phil,
Thanks. ISAM can be a good use case. Could you provide more info about it?
I made a few changes in the file /trunk/src/lib/nonblocking.c
in the latest SVN. Please give it a try and let me know if you
encounter a problem. I am very interested in how it performs
for your workload.
Wei-keng
On Jan 29, 2013, at 11:41 AM, Phil Miller wrote:
> On Tue, Jan 29, 2013 at 11:38 AM, Wei-keng Liao
> <wkliao at ece.northwestern.edu> wrote:
>> Maybe, if I can find a good use case exhibiting the access patterns
>> as discussed in this thread. Any suggestions?
>
> I'll be happy to work with you on extracting the pattern exhibited by
> ISAM for this.
>
>>
>> Wei-keng
>>
>> On Jan 29, 2013, at 9:50 AM, Rob Latham wrote:
>>
>>> Wei-keng: you've done a lot of work on the non-blocking interface
>>> since the 2009 paper. I wonder if there's a publication you can get
>>> out of those modifications.
>>>
>>> ==rob
>>>
>>> On Mon, Jan 28, 2013 at 10:31:04PM -0600, Wei-keng Liao wrote:
>>>> Here is the detailed explanations.
>>>>
>>>> Prior to r1121, the implementation of nonblocking APIs did not break
>>>> each
>>>> request into a list of offset-length pairs. It simply used the
>>>> arguments
>>>> start[] and count[] to define a filetype through a call to
>>>> MPI_Type_create_subarray(). Then the filetypes of all nonblocking
>>>> requests
>>>> are concatenated to a single one, if their file offsets are
>>>> monotonically
>>>> increasing. If not, the requests are divided into groups, each group
>>>> fulfilling the MPI fileview requirement, and then each group calls an
>>>> MPI collective read/write.
>>>>
>>>> The reason of the above design is due to memory concern. Flattening
>>>> each
>>>> request into a list of offset-length pairs can take up memory space
>>>> that
>>>> might not be negligible. Let's take the examples/column_wise.c as an
>>>> example
>>>> in which each process writes a few non-contiguous columns into a global
>>>> 2D
>>>> array using nonblocking APIs. The C struct for each flattened
>>>> offset-length
>>>> pair takes 24 bytes, bigger than the data (say 8 bytes if the variable
>>>> is in
>>>> double type). This is why I did not use this approach at the first
>>>> place.
>>>>
>>>> The fix in r1121 eventually uses this approach. So, be warned!
>>>>
>>>> If the memory space is of no concern, this new fix can significantly
>>>> improve
>>>> the performance. I evaluated column_wise.c with a slightly larger array
>>>> size
>>>> and saw a decent improvement. You can give it a try.
>>>>
>>>> Maybe in the future, we will come up with a smart approach to
>>>> dynamically
>>>> decide when to fall back to the previous approach, say when the
>>>> additional
>>>> memory space required is beyond a threshold.
>>>>
>>>> Wei-keng
>>>>
>>>> On Jan 28, 2013, at 9:45 PM, Liu, Jaln wrote:
>>>>
>>>>> Hi Dr. Liao,
>>>>>
>>>>> So glad to know that it has been solved.
>>>>>
>>>>> I was confused why a offset-length pairs' sort could not generate a
>>>>> fileview that abide the requirements, I didn't know that the nonblocking
>>>>> requests were divided into groups. Can you please tell me why the
>>>>> nonblocking requests were initially designed to be divided into groups? for
>>>>> scalability or any other reason? If for scalability, how about now?
>>>>>
>>>>>> Please give it a try and let me know if you see a problem.
>>>>>
>>>>> Sure, I'm testing it.
>>>>>
>>>>> Jialin
>>>>> ________________________________________
>>>>> Best Regards,
>>>>> Jialin Liu, Ph.D student.
>>>>> Computer Science Department
>>>>> Texas Tech University
>>>>> Phone: 806.742.3513(x241)
>>>>> Office:Engineer Center 304
>>>>> http://myweb.ttu.edu/jialliu/
>>>>> ________________________________________
>>>>> From: parallel-netcdf-bounces at lists.mcs.anl.gov
>>>>> [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao
>>>>> [wkliao at ece.northwestern.edu]
>>>>> Sent: Monday, January 28, 2013 6:13 PM
>>>>> To: parallel-netcdf at lists.mcs.anl.gov
>>>>> Subject: Re: how to do nonblocking collective i/o
>>>>>
>>>>> Hi, Jialin, please see my in-line response below.
>>>>>
>>>>> On Jan 28, 2013, at 4:05 PM, Liu, Jaln wrote:
>>>>>
>>>>>> Hi Rob,
>>>>>>
>>>>>> Thanks for your answer.
>>>>>>
>>>>>>> You're close. I bet by the time I finish writing this email
>>>>>>> Wei-keng
>>>>>>> will already respond.
>>>>>>
>>>>>> You reminds me of a previous thread in pnetcdf maillist:
>>>>>> 'Performance tuning problem with iput_vara_double/wait_all',
>>>>>>
>>>>>> Dr. Liao once mentioned "But concatenating two filetypes will end up
>>>>>> with a filetype violating the requirement of monotonic non-decreasing file
>>>>>> offsets",
>>>>>>
>>>>>> So I guess, even my code is correctly trying to do non-blocking
>>>>>> collective I/O, but it will still result in individually collective I/O,
>>>>>> right?
>>>>>> Is there any way that we can know this before performance test?
>>>>>
>>>>> This problem has been resolved in SVN r1121 committed on Saturday.
>>>>> Please give it a try and let me know if you see a problem.
>>>>>
>>>>>
>>>>>> I have another related question,
>>>>>> According to the paper "combining I/O operations for multiple array
>>>>>> variables in parallel netCDF", the non-blocking collective i/o is designed
>>>>>> for multiple variables access. But I assume it is also useful to optimize
>>>>>> multiple subsets access for one variable? just like what I'm trying to do in
>>>>>> the code. right?
>>>>>
>>>>> PnetCDF nonblocking APIs can be used to aggregate requests within a
>>>>> variable and
>>>>> across variables (also, mixed record and non-record variables). There
>>>>> is an
>>>>> example program newly added in trunk/examples/column_wise.c that calls
>>>>> multiple
>>>>> nonblocking writes to a single 2D variable, each request writes a
>>>>> column of the 2D array.
>>>>>
>>>>>
>>>>> Wei-keng
>>>>>
>>>>>
>>>>>
>>>>>> Jialin
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Here are the codes I wrote:
>>>>>>>
>>>>>>> float ** nb_temp_in=malloc(numcalls*sizeof(float *));
>>>>>>> int * request=calloc(numcalls, sizeof(int));
>>>>>>> int * status=calloc(numcalls,sizeof(int));
>>>>>>> int varasize;
>>>>>>> for(j=0;j<numcalls;j++)
>>>>>>> {
>>>>>>> mpi_count[1]=(j>NLVL)?NLVL:j+1;
>>>>>>> varasize=mpi_count[0]*mpi_count[1]*NLAT*NLON;
>>>>>>> nb_temp_in[j]=calloc(varasize,sizeof(float));
>>>>>>> if (ret = ncmpi_iget_vara(ncid, temp_varid,
>>>>>>> mpi_start,mpi_count,nb_temp_in[j],
>>>>>>> varasize,MPI_FLOAT,&(request[j])));
>>>>>>> if (ret != NC_NOERR) handle_error(ret);
>>>>>>> }
>>>>>>>
>>>>>>> ret = ncmpi_wait_all(ncid, numcalls, request, status);
>>>>>>> for (j=0; j<numcalls; j++)
>>>>>>> if (status[j] != NC_NOERR) handle_error(status[j]);
>>>>>>> }
>>>>>>>
>>>>>>> I have two questions,
>>>>>>> 1, in the above code, what is right way to parallelize the program?
>>>>>>> by decomposing the for loop " for(j=0;j<numcalls;j++)"?
>>>>>>
>>>>>> No "right" way, really. Depends on what the reader needs.
>>>>>> Decomposing
>>>>>> over numcalls is definitely one way. Or you can decompose over
>>>>>> 'mpi_start' and 'mpi_count' -- though I personally have to wrestle
>>>>>> with block decomposition for a while before it's correct.
>>>>>>
>>>>>>> 2, how to do non-blocking collective I/O? is there a function like
>>>>>>> 'ncmpi_iget_vara_all'?
>>>>>>
>>>>>> you already did it.
>>>>>>
>>>>>> We've iterated over a few nonblocking-pnetcdf approaches over the
>>>>>> years, but settled on this way:
>>>>>> - operations are posted independently.
>>>>>> - One can collectively wait for completion with "ncmpi_wait_all", as
>>>>>> you did.
>>>>>> - If one needs to wait for completion locally due to the nature of
>>>>>> the
>>>>>> application, one might not get the best performance, but
>>>>>> "ncmpi_wait" is still there if the app needs independent I/O
>>>>>> completion.
>>>>>>
>>>>>> ==rob
>>>>>>
>>>>>> --
>>>>>> Rob Latham
>>>>>> Mathematics and Computer Science Division
>>>>>> Argonne National Lab, IL USA
>>>>
>>>
>>> --
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>>
More information about the parallel-netcdf
mailing list