how to do nonblocking collective i/o

Mon Jan 28 16:05:19 CST 2013

Hi Rob, 

Thanks for your answer.

>You're close. I  bet by the time I finish writing this email Wei-keng
>will already respond.

You reminds me of a previous thread in pnetcdf maillist: 
'Performance tuning problem with iput_vara_double/wait_all',

Dr. Liao once mentioned "But concatenating two filetypes will end up with a filetype violating the requirement of monotonic non-decreasing file offsets",

So I guess, even my code is correctly trying to do non-blocking collective I/O, but it will still result in individually collective I/O, right?
Is there any way that we can know this before performance test?

I have another related question, 
According to the paper "combining I/O operations for multiple array variables in parallel netCDF", the non-blocking collective i/o is designed for multiple variables access. But I assume it is also useful to optimize multiple subsets access for one variable? just like what I'm trying to do in the code. right?

Jialin

> Here are the codes I wrote:
>
>         float ** nb_temp_in=malloc(numcalls*sizeof(float *));
>         int * request=calloc(numcalls, sizeof(int));
>         int * status=calloc(numcalls,sizeof(int));
>         int varasize;
>         for(j=0;j<numcalls;j++)
>         {
>           mpi_count[1]=(j>NLVL)?NLVL:j+1;
>           varasize=mpi_count[0]*mpi_count[1]*NLAT*NLON;
>           nb_temp_in[j]=calloc(varasize,sizeof(float));
>           if (ret = ncmpi_iget_vara(ncid, temp_varid,
>                mpi_start,mpi_count,nb_temp_in[j],
>                varasize,MPI_FLOAT,&(request[j])));
>           if (ret != NC_NOERR) handle_error(ret);
>         }
>
>         ret = ncmpi_wait_all(ncid, numcalls, request, status);
>         for (j=0; j<numcalls; j++)
>          if (status[j] != NC_NOERR) handle_error(status[j]);
>       }
>
> I have two questions,
> 1, in the above code, what is right way to parallelize the program?
> by decomposing the for loop " for(j=0;j<numcalls;j++)"?

No "right" way, really. Depends on what the reader needs.  Decomposing
over numcalls is definitely one way.  Or you can decompose over
'mpi_start' and 'mpi_count' -- though I personally have to wrestle
with block decomposition for a while before it's correct.

> 2, how to do non-blocking collective I/O? is there a function like
> 'ncmpi_iget_vara_all'?

you already did it.

We've iterated over a few nonblocking-pnetcdf approaches over the
years, but settled on this way:
- operations are posted independently.
- One can collectively wait for completion with "ncmpi_wait_all", as
  you did.
- If one needs to wait for completion locally due to the nature of the
  application, one might not get the best performance, but
  "ncmpi_wait" is still there if the app needs independent I/O
  completion.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA