how to do nonblocking collective i/o

Phil Miller mille121 at illinois.edu
Tue Jan 29 11:41:31 CST 2013


On Tue, Jan 29, 2013 at 11:38 AM, Wei-keng Liao
<wkliao at ece.northwestern.edu> wrote:
> Maybe, if I can find a good use case exhibiting the access patterns
> as discussed in this thread. Any suggestions?

I'll be happy to work with you on extracting the pattern exhibited by
ISAM for this.

>
> Wei-keng
>
> On Jan 29, 2013, at 9:50 AM, Rob Latham wrote:
>
> > Wei-keng: you've done a lot of work on the non-blocking interface
> > since the 2009 paper.  I wonder if there's a publication you can get
> > out of those modifications.
> >
> > ==rob
> >
> > On Mon, Jan 28, 2013 at 10:31:04PM -0600, Wei-keng Liao wrote:
> >> Here is the detailed explanations.
> >>
> >> Prior to r1121, the implementation of nonblocking APIs did not break
> >> each
> >> request into a list of offset-length pairs. It simply used the
> >> arguments
> >> start[] and count[] to define a filetype through a call to
> >> MPI_Type_create_subarray(). Then the filetypes of all nonblocking
> >> requests
> >> are concatenated to a single one, if their file offsets are
> >> monotonically
> >> increasing. If not, the requests are divided into groups, each group
> >> fulfilling the MPI fileview requirement, and then each group calls an
> >> MPI collective read/write.
> >>
> >> The reason of the above design is due to memory concern. Flattening
> >> each
> >> request into a list of offset-length pairs can take up memory space
> >> that
> >> might not be negligible. Let's take the examples/column_wise.c as an
> >> example
> >> in which each process writes a few non-contiguous columns into a global
> >> 2D
> >> array using nonblocking APIs. The C struct for each flattened
> >> offset-length
> >> pair takes 24 bytes, bigger than the data (say 8 bytes if the variable
> >> is in
> >> double type). This is why I did not use this approach at the first
> >> place.
> >>
> >> The fix in r1121 eventually uses this approach. So, be warned!
> >>
> >> If the memory space is of no concern, this new fix can significantly
> >> improve
> >> the performance. I evaluated column_wise.c with a slightly larger array
> >> size
> >> and saw a decent improvement. You can give it a try.
> >>
> >> Maybe in the future, we will come up with a smart approach to
> >> dynamically
> >> decide when to fall back to the previous approach, say when the
> >> additional
> >> memory space required is beyond a threshold.
> >>
> >> Wei-keng
> >>
> >> On Jan 28, 2013, at 9:45 PM, Liu, Jaln wrote:
> >>
> >>> Hi Dr. Liao,
> >>>
> >>> So glad to know that it has been solved.
> >>>
> >>> I was confused why a offset-length pairs' sort could not generate a
> >>> fileview that abide the requirements, I didn't know that the nonblocking
> >>> requests were divided into groups. Can you please tell me why the
> >>> nonblocking requests were initially designed to be divided into groups? for
> >>> scalability or any other reason? If for scalability, how about now?
> >>>
> >>>> Please give it a try and let me know if you see a problem.
> >>>
> >>> Sure, I'm testing it.
> >>>
> >>> Jialin
> >>> ________________________________________
> >>> Best Regards,
> >>> Jialin Liu, Ph.D student.
> >>> Computer Science Department
> >>> Texas Tech University
> >>> Phone: 806.742.3513(x241)
> >>> Office:Engineer Center 304
> >>> http://myweb.ttu.edu/jialliu/
> >>> ________________________________________
> >>> From: parallel-netcdf-bounces at lists.mcs.anl.gov
> >>> [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao
> >>> [wkliao at ece.northwestern.edu]
> >>> Sent: Monday, January 28, 2013 6:13 PM
> >>> To: parallel-netcdf at lists.mcs.anl.gov
> >>> Subject: Re: how to do nonblocking collective i/o
> >>>
> >>> Hi, Jialin, please see my in-line response below.
> >>>
> >>> On Jan 28, 2013, at 4:05 PM, Liu, Jaln wrote:
> >>>
> >>>> Hi Rob,
> >>>>
> >>>> Thanks for your answer.
> >>>>
> >>>>> You're close. I  bet by the time I finish writing this email
> >>>>> Wei-keng
> >>>>> will already respond.
> >>>>
> >>>> You reminds me of a previous thread in pnetcdf maillist:
> >>>> 'Performance tuning problem with iput_vara_double/wait_all',
> >>>>
> >>>> Dr. Liao once mentioned "But concatenating two filetypes will end up
> >>>> with a filetype violating the requirement of monotonic non-decreasing file
> >>>> offsets",
> >>>>
> >>>> So I guess, even my code is correctly trying to do non-blocking
> >>>> collective I/O, but it will still result in individually collective I/O,
> >>>> right?
> >>>> Is there any way that we can know this before performance test?
> >>>
> >>> This problem has been resolved in SVN r1121 committed on Saturday.
> >>> Please give it a try and let me know if you see a problem.
> >>>
> >>>
> >>>> I have another related question,
> >>>> According to the paper "combining I/O operations for multiple array
> >>>> variables in parallel netCDF", the non-blocking collective i/o is designed
> >>>> for multiple variables access. But I assume it is also useful to optimize
> >>>> multiple subsets access for one variable? just like what I'm trying to do in
> >>>> the code. right?
> >>>
> >>> PnetCDF nonblocking APIs can be used to aggregate requests within a
> >>> variable and
> >>> across variables (also, mixed record and non-record variables). There
> >>> is an
> >>> example program newly added in trunk/examples/column_wise.c that calls
> >>> multiple
> >>> nonblocking writes to a single 2D variable, each request writes a
> >>> column of the 2D array.
> >>>
> >>>
> >>> Wei-keng
> >>>
> >>>
> >>>
> >>>> Jialin
> >>>>
> >>>>
> >>>>
> >>>>> Here are the codes I wrote:
> >>>>>
> >>>>>      float ** nb_temp_in=malloc(numcalls*sizeof(float *));
> >>>>>      int * request=calloc(numcalls, sizeof(int));
> >>>>>      int * status=calloc(numcalls,sizeof(int));
> >>>>>      int varasize;
> >>>>>      for(j=0;j<numcalls;j++)
> >>>>>      {
> >>>>>        mpi_count[1]=(j>NLVL)?NLVL:j+1;
> >>>>>        varasize=mpi_count[0]*mpi_count[1]*NLAT*NLON;
> >>>>>        nb_temp_in[j]=calloc(varasize,sizeof(float));
> >>>>>        if (ret = ncmpi_iget_vara(ncid, temp_varid,
> >>>>>             mpi_start,mpi_count,nb_temp_in[j],
> >>>>>             varasize,MPI_FLOAT,&(request[j])));
> >>>>>        if (ret != NC_NOERR) handle_error(ret);
> >>>>>      }
> >>>>>
> >>>>>      ret = ncmpi_wait_all(ncid, numcalls, request, status);
> >>>>>      for (j=0; j<numcalls; j++)
> >>>>>       if (status[j] != NC_NOERR) handle_error(status[j]);
> >>>>>    }
> >>>>>
> >>>>> I have two questions,
> >>>>> 1, in the above code, what is right way to parallelize the program?
> >>>>> by decomposing the for loop " for(j=0;j<numcalls;j++)"?
> >>>>
> >>>> No "right" way, really. Depends on what the reader needs.
> >>>> Decomposing
> >>>> over numcalls is definitely one way.  Or you can decompose over
> >>>> 'mpi_start' and 'mpi_count' -- though I personally have to wrestle
> >>>> with block decomposition for a while before it's correct.
> >>>>
> >>>>> 2, how to do non-blocking collective I/O? is there a function like
> >>>>> 'ncmpi_iget_vara_all'?
> >>>>
> >>>> you already did it.
> >>>>
> >>>> We've iterated over a few nonblocking-pnetcdf approaches over the
> >>>> years, but settled on this way:
> >>>> - operations are posted independently.
> >>>> - One can collectively wait for completion with "ncmpi_wait_all", as
> >>>> you did.
> >>>> - If one needs to wait for completion locally due to the nature of
> >>>> the
> >>>> application, one might not get the best performance, but
> >>>> "ncmpi_wait" is still there if the app needs independent I/O
> >>>> completion.
> >>>>
> >>>> ==rob
> >>>>
> >>>> --
> >>>> Rob Latham
> >>>> Mathematics and Computer Science Division
> >>>> Argonne National Lab, IL USA
> >>
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
>


More information about the parallel-netcdf mailing list