how to do nonblocking collective i/o

Mon Jan 28 21:45:19 CST 2013

Hi Dr. Liao,

So glad to know that it has been solved. 

I was confused why a offset-length pairs' sort could not generate a fileview that abide the requirements, I didn't know that the nonblocking requests were divided into groups. Can you please tell me why the nonblocking requests were initially designed to be divided into groups? for scalability or any other reason? If for scalability, how about now?

>Please give it a try and let me know if you see a problem.

Sure, I'm testing it.

Jialin
________________________________________
Best Regards,
Jialin Liu, Ph.D student.
Computer Science Department
Texas Tech University
Phone: 806.742.3513(x241)
Office:Engineer Center 304
http://myweb.ttu.edu/jialliu/
________________________________________
From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao [wkliao at ece.northwestern.edu]
Sent: Monday, January 28, 2013 6:13 PM
To: parallel-netcdf at lists.mcs.anl.gov
Subject: Re: how to do nonblocking collective i/o

Hi, Jialin, please see my in-line response below.

On Jan 28, 2013, at 4:05 PM, Liu, Jaln wrote:

> Hi Rob,
>
> Thanks for your answer.
>
>> You're close. I  bet by the time I finish writing this email Wei-keng
>> will already respond.
>
> You reminds me of a previous thread in pnetcdf maillist:
> 'Performance tuning problem with iput_vara_double/wait_all',
>
> Dr. Liao once mentioned "But concatenating two filetypes will end up with a filetype violating the requirement of monotonic non-decreasing file offsets",
>
> So I guess, even my code is correctly trying to do non-blocking collective I/O, but it will still result in individually collective I/O, right?
> Is there any way that we can know this before performance test?

This problem has been resolved in SVN r1121 committed on Saturday.
Please give it a try and let me know if you see a problem.

> I have another related question,
> According to the paper "combining I/O operations for multiple array variables in parallel netCDF", the non-blocking collective i/o is designed for multiple variables access. But I assume it is also useful to optimize multiple subsets access for one variable? just like what I'm trying to do in the code. right?

PnetCDF nonblocking APIs can be used to aggregate requests within a variable and
across variables (also, mixed record and non-record variables). There is an
example program newly added in trunk/examples/column_wise.c that calls multiple
nonblocking writes to a single 2D variable, each request writes a column of the 2D array.

Wei-keng

> Jialin
>
>
>
>> Here are the codes I wrote:
>>
>>        float ** nb_temp_in=malloc(numcalls*sizeof(float *));
>>        int * request=calloc(numcalls, sizeof(int));
>>        int * status=calloc(numcalls,sizeof(int));
>>        int varasize;
>>        for(j=0;j<numcalls;j++)
>>        {
>>          mpi_count[1]=(j>NLVL)?NLVL:j+1;
>>          varasize=mpi_count[0]*mpi_count[1]*NLAT*NLON;
>>          nb_temp_in[j]=calloc(varasize,sizeof(float));
>>          if (ret = ncmpi_iget_vara(ncid, temp_varid,
>>               mpi_start,mpi_count,nb_temp_in[j],
>>               varasize,MPI_FLOAT,&(request[j])));
>>          if (ret != NC_NOERR) handle_error(ret);
>>        }
>>
>>        ret = ncmpi_wait_all(ncid, numcalls, request, status);
>>        for (j=0; j<numcalls; j++)
>>         if (status[j] != NC_NOERR) handle_error(status[j]);
>>      }
>>
>> I have two questions,
>> 1, in the above code, what is right way to parallelize the program?
>> by decomposing the for loop " for(j=0;j<numcalls;j++)"?
>
> No "right" way, really. Depends on what the reader needs.  Decomposing
> over numcalls is definitely one way.  Or you can decompose over
> 'mpi_start' and 'mpi_count' -- though I personally have to wrestle
> with block decomposition for a while before it's correct.
>
>> 2, how to do non-blocking collective I/O? is there a function like
>> 'ncmpi_iget_vara_all'?
>
> you already did it.
>
> We've iterated over a few nonblocking-pnetcdf approaches over the
> years, but settled on this way:
> - operations are posted independently.
> - One can collectively wait for completion with "ncmpi_wait_all", as
>  you did.
> - If one needs to wait for completion locally due to the nature of the
>  application, one might not get the best performance, but
>  "ncmpi_wait" is still there if the app needs independent I/O
>  completion.
>
> ==rob
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA