how to do nonblocking collective i/o
Liu, Jaln
jaln.liu at ttu.edu
Thu Jan 31 15:14:15 CST 2013
Hi Dr. Liao,
>Small jobs tend to give
>inconsistent results.
LIKE~
I need to vary the size of requests
Jialin
________________________________________
From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao [wkliao at ece.northwestern.edu]
Sent: Thursday, January 31, 2013 12:51 PM
To: parallel-netcdf at lists.mcs.anl.gov
Subject: Re: how to do nonblocking collective i/o
Jialin,
In PnetCDF, usually you don't have to set the number of I/O aggregators,
unless you want to fine-tune the MPI-IO performance. PnetCDF will just faithfully
pass the hint to MPI-IO.
What is the dimension order of the 4D array? and length of each dimension?
Your striping size 100MB is too big. I suggest you change it to 1MB.
You can check what striping configuration is used by your job, by
printing the following MPI-IO info:
striping_factor
striping_unit
cb_nodes (the number of I/O aggregators)
ROMIO Lustre driver will set the number of I/O aggregators (cb_nodes)
to the minimum of striping_factor (the number of OSTs) and the MPI
processes of your job. In your case it is 10, 20, and 30.
Since 10 and 20 are factors of 40, the mapping between the processes
to the OSTs is a perfect match. In these cases, one OST only takes requests
from the same four/two processes, but no others. So, the 10 and 20-process jobs
should perform well, much better than the 30. In addition, the 20-process case
should be better than 10-process one, as it has less lock contention.
In any case, people are interested in I/O performance for larger jobs,
given 40 OSTs are available on your machine. Small jobs tend to give
inconsistent results.
Wei-keng
On Jan 31, 2013, at 11:54 AM, Liu, Jaln wrote:
> Thanks, Dr. Liao,
>
>> The I/O aggregator is the term used in ROMIO, not PnetCDF.
>> The number of aggregators usually is the number of compute nodes on most file systems.
>> On Lustre it depends on the file striping count (or number of OSTs).
>
> But in PnetCDF, do I need to consider the number of aggregators?
> I used Lustre, and striped the data across 40 OSTs, stripe size is 104857600.
>
> I didn't get your meaning, in this case, I run in 5 compute nodes, each has 12 core,
> is the number of I/O processes (aggregators) 5?
>
>> What is the definition of your temp_varid? Is it a 4D array of type float?
>> Please tell us its dimensionality.
>
> yes, it is 4-D float, generated by myself.
>
>> What performance variance do you observed?
>
> for example, running with 20 processes is better than with 10 and 30
>
> in other words, the test with certain number of processes does not fit in the trend of all the other goups of tests. each test was conducted with multiple times.
>
>> The I/O performance has a lot to do with your environment.
>> What is the file striping configuration, if you are using a parallel
>> file system. This and the variable's definition may tell a lot about the performance.
>
> Wei-keng
>
> On Jan 31, 2013, at 3:32 AM, Liu, Jaln wrote:
>
>> Hi all,
>>
>> If I use ncmpi_get_vara_float_all in the code, what is the default number of the aggregator that will do the I/O?
>>
>> mpi_start[1]=rank*5;//please notice here, I specify the start position with rank.
>> mpi_count[1]=10;
>> temp_in=(float *)malloc(mpi_count[0]*mpi_count[1]*NLAT*NLON*sizeof(float));
>> if ((ret = ncmpi_get_vara_float_all(ncid, temp_varid, mpi_start,
>> mpi_count, temp_in)))
>> if (ret != NC_NOERR) handle_error(ret);
>>
>> I run the code like mpirun -n 50
>>
>> Is the number of aggregator equal to the number of nodes or 50 here?
>>
>> I'm confusing about the performance variance when I change the number of processes, I wonder if I misunderstood this type of I/O.
>>
>> Best Regards,
>> Jialin Liu, Ph.D student.
>> Computer Science Department
>> Texas Tech University
>> Phone: 806.252.2832
>> Office:Engineer Center 304
>> http://myweb.ttu.edu/jialliu/
>> ________________________________________
>> From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Rob Latham [robl at mcs.anl.gov]
>> Sent: Tuesday, January 29, 2013 1:01 PM
>> To: Wei-keng Liao
>> Cc: parallel-netcdf at lists.mcs.anl.gov
>> Subject: Re: how to do nonblocking collective i/o
>>
>> On Tue, Jan 29, 2013 at 11:38:58AM -0600, Wei-keng Liao wrote:
>>> Hi, Rob
>>>
>>> Maybe, if I can find a good use case exhibiting the access patterns
>>> as discussed in this thread. Any suggestions?
>>
>> The other thing such a publication should cover is the change in the
>> API since the IASDS workshop paper.
>>
>> We (mostly you) have redesigned the non-blocking interface to better
>> accommodate chombo-like workloads. The earliest non-blocking versions
>> posted operations collectively, but that restriction was relaxed and
>> now the "wait" step has the option to be collective or not.
>>
>> ==rob
>>
>>> Wei-keng
>>>
>>> On Jan 29, 2013, at 9:50 AM, Rob Latham wrote:
>>>
>>>> Wei-keng: you've done a lot of work on the non-blocking interface
>>>> since the 2009 paper. I wonder if there's a publication you can get
>>>> out of those modifications.
>>>>
>>>> ==rob
>>>>
>>>> On Mon, Jan 28, 2013 at 10:31:04PM -0600, Wei-keng Liao wrote:
>>>>> Here is the detailed explanations.
>>>>>
>>>>> Prior to r1121, the implementation of nonblocking APIs did not break each
>>>>> request into a list of offset-length pairs. It simply used the arguments
>>>>> start[] and count[] to define a filetype through a call to
>>>>> MPI_Type_create_subarray(). Then the filetypes of all nonblocking requests
>>>>> are concatenated to a single one, if their file offsets are monotonically
>>>>> increasing. If not, the requests are divided into groups, each group
>>>>> fulfilling the MPI fileview requirement, and then each group calls an
>>>>> MPI collective read/write.
>>>>>
>>>>> The reason of the above design is due to memory concern. Flattening each
>>>>> request into a list of offset-length pairs can take up memory space that
>>>>> might not be negligible. Let's take the examples/column_wise.c as an example
>>>>> in which each process writes a few non-contiguous columns into a global 2D
>>>>> array using nonblocking APIs. The C struct for each flattened offset-length
>>>>> pair takes 24 bytes, bigger than the data (say 8 bytes if the variable is in
>>>>> double type). This is why I did not use this approach at the first place.
>>>>>
>>>>> The fix in r1121 eventually uses this approach. So, be warned!
>>>>>
>>>>> If the memory space is of no concern, this new fix can significantly improve
>>>>> the performance. I evaluated column_wise.c with a slightly larger array size
>>>>> and saw a decent improvement. You can give it a try.
>>>>>
>>>>> Maybe in the future, we will come up with a smart approach to dynamically
>>>>> decide when to fall back to the previous approach, say when the additional
>>>>> memory space required is beyond a threshold.
>>>>>
>>>>> Wei-keng
>>>>>
>>>>> On Jan 28, 2013, at 9:45 PM, Liu, Jaln wrote:
>>>>>
>>>>>> Hi Dr. Liao,
>>>>>>
>>>>>> So glad to know that it has been solved.
>>>>>>
>>>>>> I was confused why a offset-length pairs' sort could not generate a fileview that abide the requirements, I didn't know that the nonblocking requests were divided into groups. Can you please tell me why the nonblocking requests were initially designed to be divided into groups? for scalability or any other reason? If for scalability, how about now?
>>>>>>
>>>>>>> Please give it a try and let me know if you see a problem.
>>>>>>
>>>>>> Sure, I'm testing it.
>>>>>>
>>>>>> Jialin
>>>>>> ________________________________________
>>>>>> Best Regards,
>>>>>> Jialin Liu, Ph.D student.
>>>>>> Computer Science Department
>>>>>> Texas Tech University
>>>>>> Phone: 806.742.3513(x241)
>>>>>> Office:Engineer Center 304
>>>>>> http://myweb.ttu.edu/jialliu/
>>>>>> ________________________________________
>>>>>> From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao [wkliao at ece.northwestern.edu]
>>>>>> Sent: Monday, January 28, 2013 6:13 PM
>>>>>> To: parallel-netcdf at lists.mcs.anl.gov
>>>>>> Subject: Re: how to do nonblocking collective i/o
>>>>>>
>>>>>> Hi, Jialin, please see my in-line response below.
>>>>>>
>>>>>> On Jan 28, 2013, at 4:05 PM, Liu, Jaln wrote:
>>>>>>
>>>>>>> Hi Rob,
>>>>>>>
>>>>>>> Thanks for your answer.
>>>>>>>
>>>>>>>> You're close. I bet by the time I finish writing this email Wei-keng
>>>>>>>> will already respond.
>>>>>>>
>>>>>>> You reminds me of a previous thread in pnetcdf maillist:
>>>>>>> 'Performance tuning problem with iput_vara_double/wait_all',
>>>>>>>
>>>>>>> Dr. Liao once mentioned "But concatenating two filetypes will end up with a filetype violating the requirement of monotonic non-decreasing file offsets",
>>>>>>>
>>>>>>> So I guess, even my code is correctly trying to do non-blocking collective I/O, but it will still result in individually collective I/O, right?
>>>>>>> Is there any way that we can know this before performance test?
>>>>>>
>>>>>> This problem has been resolved in SVN r1121 committed on Saturday.
>>>>>> Please give it a try and let me know if you see a problem.
>>>>>>
>>>>>>
>>>>>>> I have another related question,
>>>>>>> According to the paper "combining I/O operations for multiple array variables in parallel netCDF", the non-blocking collective i/o is designed for multiple variables access. But I assume it is also useful to optimize multiple subsets access for one variable? just like what I'm trying to do in the code. right?
>>>>>>
>>>>>> PnetCDF nonblocking APIs can be used to aggregate requests within a variable and
>>>>>> across variables (also, mixed record and non-record variables). There is an
>>>>>> example program newly added in trunk/examples/column_wise.c that calls multiple
>>>>>> nonblocking writes to a single 2D variable, each request writes a column of the 2D array.
>>>>>>
>>>>>>
>>>>>> Wei-keng
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Jialin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Here are the codes I wrote:
>>>>>>>>
>>>>>>>> float ** nb_temp_in=malloc(numcalls*sizeof(float *));
>>>>>>>> int * request=calloc(numcalls, sizeof(int));
>>>>>>>> int * status=calloc(numcalls,sizeof(int));
>>>>>>>> int varasize;
>>>>>>>> for(j=0;j<numcalls;j++)
>>>>>>>> {
>>>>>>>> mpi_count[1]=(j>NLVL)?NLVL:j+1;
>>>>>>>> varasize=mpi_count[0]*mpi_count[1]*NLAT*NLON;
>>>>>>>> nb_temp_in[j]=calloc(varasize,sizeof(float));
>>>>>>>> if (ret = ncmpi_iget_vara(ncid, temp_varid,
>>>>>>>> mpi_start,mpi_count,nb_temp_in[j],
>>>>>>>> varasize,MPI_FLOAT,&(request[j])));
>>>>>>>> if (ret != NC_NOERR) handle_error(ret);
>>>>>>>> }
>>>>>>>>
>>>>>>>> ret = ncmpi_wait_all(ncid, numcalls, request, status);
>>>>>>>> for (j=0; j<numcalls; j++)
>>>>>>>> if (status[j] != NC_NOERR) handle_error(status[j]);
>>>>>>>> }
>>>>>>>>
>>>>>>>> I have two questions,
>>>>>>>> 1, in the above code, what is right way to parallelize the program?
>>>>>>>> by decomposing the for loop " for(j=0;j<numcalls;j++)"?
>>>>>>>
>>>>>>> No "right" way, really. Depends on what the reader needs. Decomposing
>>>>>>> over numcalls is definitely one way. Or you can decompose over
>>>>>>> 'mpi_start' and 'mpi_count' -- though I personally have to wrestle
>>>>>>> with block decomposition for a while before it's correct.
>>>>>>>
>>>>>>>> 2, how to do non-blocking collective I/O? is there a function like
>>>>>>>> 'ncmpi_iget_vara_all'?
>>>>>>>
>>>>>>> you already did it.
>>>>>>>
>>>>>>> We've iterated over a few nonblocking-pnetcdf approaches over the
>>>>>>> years, but settled on this way:
>>>>>>> - operations are posted independently.
>>>>>>> - One can collectively wait for completion with "ncmpi_wait_all", as
>>>>>>> you did.
>>>>>>> - If one needs to wait for completion locally due to the nature of the
>>>>>>> application, one might not get the best performance, but
>>>>>>> "ncmpi_wait" is still there if the app needs independent I/O
>>>>>>> completion.
>>>>>>>
>>>>>>> ==rob
>>>>>>>
>>>>>>> --
>>>>>>> Rob Latham
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Lab, IL USA
>>>>>
>>>>
>>>> --
>>>> Rob Latham
>>>> Mathematics and Computer Science Division
>>>> Argonne National Lab, IL USA
>>>
>>
>> --
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>
More information about the parallel-netcdf
mailing list