how to do nonblocking collective i/o

Thu Jan 31 10:59:11 CST 2013

Hi, Jialin,

The I/O aggregator is the term used in ROMIO, not PnetCDF.
The number of aggregators usually is the number of compute nodes on most file systems.
On Lustre it depends on the file striping count (or number of OSTs).

What is the definition of your temp_varid? Is it a 4D array of type float?
Please tell us its dimensionality.

What performance variance do you observed?
The I/O performance has a lot to do with your environment.
What is the file striping configuration, if you are using a parallel
file system. This and the variable's definition may tell a lot about the performance.

Wei-keng

On Jan 31, 2013, at 3:32 AM, Liu, Jaln wrote:

> Hi all,
> 
> If I use ncmpi_get_vara_float_all in the code, what is the default number of the aggregator that will do the I/O?
> 
>       mpi_start[1]=rank*5;//please notice here, I specify the start position with rank.
>       mpi_count[1]=10;
>       temp_in=(float *)malloc(mpi_count[0]*mpi_count[1]*NLAT*NLON*sizeof(float));
>       if ((ret = ncmpi_get_vara_float_all(ncid, temp_varid, mpi_start,
>                                    mpi_count, temp_in)))
>       if (ret != NC_NOERR) handle_error(ret);
> 
> I run the code like mpirun -n 50
> 
> Is the number of aggregator equal to the number of nodes or 50 here?
> 
> I'm confusing about the performance variance when I change the number of processes, I wonder if I misunderstood this type of I/O.
> 
> Best Regards,
> Jialin Liu, Ph.D student.
> Computer Science Department
> Texas Tech University
> Phone: 806.252.2832
> Office:Engineer Center 304
> http://myweb.ttu.edu/jialliu/
> ________________________________________
> From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Rob Latham [robl at mcs.anl.gov]
> Sent: Tuesday, January 29, 2013 1:01 PM
> To: Wei-keng Liao
> Cc: parallel-netcdf at lists.mcs.anl.gov
> Subject: Re: how to do nonblocking collective i/o
> 
> On Tue, Jan 29, 2013 at 11:38:58AM -0600, Wei-keng Liao wrote:
>> Hi, Rob
>> 
>> Maybe, if I can find a good use case exhibiting the access patterns
>> as discussed in this thread. Any suggestions?
> 
> The other thing such a publication should cover is the change in the
> API since the IASDS workshop paper.
> 
> We (mostly you) have redesigned the non-blocking interface to better
> accommodate chombo-like workloads.  The earliest non-blocking versions
> posted operations collectively, but that restriction was relaxed and
> now the "wait" step has the option to be collective or not.
> 
> ==rob
> 
>> Wei-keng
>> 
>> On Jan 29, 2013, at 9:50 AM, Rob Latham wrote:
>> 
>>> Wei-keng: you've done a lot of work on the non-blocking interface
>>> since the 2009 paper.  I wonder if there's a publication you can get
>>> out of those modifications.
>>> 
>>> ==rob
>>> 
>>> On Mon, Jan 28, 2013 at 10:31:04PM -0600, Wei-keng Liao wrote:
>>>> Here is the detailed explanations.
>>>> 
>>>> Prior to r1121, the implementation of nonblocking APIs did not break each
>>>> request into a list of offset-length pairs. It simply used the arguments
>>>> start[] and count[] to define a filetype through a call to
>>>> MPI_Type_create_subarray(). Then the filetypes of all nonblocking requests
>>>> are concatenated to a single one, if their file offsets are monotonically
>>>> increasing. If not, the requests are divided into groups, each group
>>>> fulfilling the MPI fileview requirement, and then each group calls an
>>>> MPI collective read/write.
>>>> 
>>>> The reason of the above design is due to memory concern. Flattening each
>>>> request into a list of offset-length pairs can take up memory space that
>>>> might not be negligible. Let's take the examples/column_wise.c as an example
>>>> in which each process writes a few non-contiguous columns into a global 2D
>>>> array using nonblocking APIs. The C struct for each flattened offset-length
>>>> pair takes 24 bytes, bigger than the data (say 8 bytes if the variable is in
>>>> double type). This is why I did not use this approach at the first place.
>>>> 
>>>> The fix in r1121 eventually uses this approach. So, be warned!
>>>> 
>>>> If the memory space is of no concern, this new fix can significantly improve
>>>> the performance. I evaluated column_wise.c with a slightly larger array size
>>>> and saw a decent improvement. You can give it a try.
>>>> 
>>>> Maybe in the future, we will come up with a smart approach to dynamically
>>>> decide when to fall back to the previous approach, say when the additional
>>>> memory space required is beyond a threshold.
>>>> 
>>>> Wei-keng
>>>> 
>>>> On Jan 28, 2013, at 9:45 PM, Liu, Jaln wrote:
>>>> 
>>>>> Hi Dr. Liao,
>>>>> 
>>>>> So glad to know that it has been solved.
>>>>> 
>>>>> I was confused why a offset-length pairs' sort could not generate a fileview that abide the requirements, I didn't know that the nonblocking requests were divided into groups. Can you please tell me why the nonblocking requests were initially designed to be divided into groups? for scalability or any other reason? If for scalability, how about now?
>>>>> 
>>>>>> Please give it a try and let me know if you see a problem.
>>>>> 
>>>>> Sure, I'm testing it.
>>>>> 
>>>>> Jialin
>>>>> ________________________________________
>>>>> Best Regards,
>>>>> Jialin Liu, Ph.D student.
>>>>> Computer Science Department
>>>>> Texas Tech University
>>>>> Phone: 806.742.3513(x241)
>>>>> Office:Engineer Center 304
>>>>> http://myweb.ttu.edu/jialliu/
>>>>> ________________________________________
>>>>> From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao [wkliao at ece.northwestern.edu]
>>>>> Sent: Monday, January 28, 2013 6:13 PM
>>>>> To: parallel-netcdf at lists.mcs.anl.gov
>>>>> Subject: Re: how to do nonblocking collective i/o
>>>>> 
>>>>> Hi, Jialin, please see my in-line response below.
>>>>> 
>>>>> On Jan 28, 2013, at 4:05 PM, Liu, Jaln wrote:
>>>>> 
>>>>>> Hi Rob,
>>>>>> 
>>>>>> Thanks for your answer.
>>>>>> 
>>>>>>> You're close. I  bet by the time I finish writing this email Wei-keng
>>>>>>> will already respond.
>>>>>> 
>>>>>> You reminds me of a previous thread in pnetcdf maillist:
>>>>>> 'Performance tuning problem with iput_vara_double/wait_all',
>>>>>> 
>>>>>> Dr. Liao once mentioned "But concatenating two filetypes will end up with a filetype violating the requirement of monotonic non-decreasing file offsets",
>>>>>> 
>>>>>> So I guess, even my code is correctly trying to do non-blocking collective I/O, but it will still result in individually collective I/O, right?
>>>>>> Is there any way that we can know this before performance test?
>>>>> 
>>>>> This problem has been resolved in SVN r1121 committed on Saturday.
>>>>> Please give it a try and let me know if you see a problem.
>>>>> 
>>>>> 
>>>>>> I have another related question,
>>>>>> According to the paper "combining I/O operations for multiple array variables in parallel netCDF", the non-blocking collective i/o is designed for multiple variables access. But I assume it is also useful to optimize multiple subsets access for one variable? just like what I'm trying to do in the code. right?
>>>>> 
>>>>> PnetCDF nonblocking APIs can be used to aggregate requests within a variable and
>>>>> across variables (also, mixed record and non-record variables). There is an
>>>>> example program newly added in trunk/examples/column_wise.c that calls multiple
>>>>> nonblocking writes to a single 2D variable, each request writes a column of the 2D array.
>>>>> 
>>>>> 
>>>>> Wei-keng
>>>>> 
>>>>> 
>>>>> 
>>>>>> Jialin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Here are the codes I wrote:
>>>>>>> 
>>>>>>>     float ** nb_temp_in=malloc(numcalls*sizeof(float *));
>>>>>>>     int * request=calloc(numcalls, sizeof(int));
>>>>>>>     int * status=calloc(numcalls,sizeof(int));
>>>>>>>     int varasize;
>>>>>>>     for(j=0;j<numcalls;j++)
>>>>>>>     {
>>>>>>>       mpi_count[1]=(j>NLVL)?NLVL:j+1;
>>>>>>>       varasize=mpi_count[0]*mpi_count[1]*NLAT*NLON;
>>>>>>>       nb_temp_in[j]=calloc(varasize,sizeof(float));
>>>>>>>       if (ret = ncmpi_iget_vara(ncid, temp_varid,
>>>>>>>            mpi_start,mpi_count,nb_temp_in[j],
>>>>>>>            varasize,MPI_FLOAT,&(request[j])));
>>>>>>>       if (ret != NC_NOERR) handle_error(ret);
>>>>>>>     }
>>>>>>> 
>>>>>>>     ret = ncmpi_wait_all(ncid, numcalls, request, status);
>>>>>>>     for (j=0; j<numcalls; j++)
>>>>>>>      if (status[j] != NC_NOERR) handle_error(status[j]);
>>>>>>>   }
>>>>>>> 
>>>>>>> I have two questions,
>>>>>>> 1, in the above code, what is right way to parallelize the program?
>>>>>>> by decomposing the for loop " for(j=0;j<numcalls;j++)"?
>>>>>> 
>>>>>> No "right" way, really. Depends on what the reader needs.  Decomposing
>>>>>> over numcalls is definitely one way.  Or you can decompose over
>>>>>> 'mpi_start' and 'mpi_count' -- though I personally have to wrestle
>>>>>> with block decomposition for a while before it's correct.
>>>>>> 
>>>>>>> 2, how to do non-blocking collective I/O? is there a function like
>>>>>>> 'ncmpi_iget_vara_all'?
>>>>>> 
>>>>>> you already did it.
>>>>>> 
>>>>>> We've iterated over a few nonblocking-pnetcdf approaches over the
>>>>>> years, but settled on this way:
>>>>>> - operations are posted independently.
>>>>>> - One can collectively wait for completion with "ncmpi_wait_all", as
>>>>>> you did.
>>>>>> - If one needs to wait for completion locally due to the nature of the
>>>>>> application, one might not get the best performance, but
>>>>>> "ncmpi_wait" is still there if the app needs independent I/O
>>>>>> completion.
>>>>>> 
>>>>>> ==rob
>>>>>> 
>>>>>> --
>>>>>> Rob Latham
>>>>>> Mathematics and Computer Science Division
>>>>>> Argonne National Lab, IL USA
>>>> 
>>> 
>>> --
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>> 
> 
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA