Is this the proper way to do parallel reads? and if so, is there a bug?

Rob Latham robl at mcs.anl.gov
Thu Jan 10 09:43:33 CST 2013


On Thu, Jan 10, 2013 at 06:53:45AM -0500, Smith, Brian E. wrote:
> Hi all,
> 
> I am fairly new to (p)netCDF, but have plenty of MPI experience. I am working on improving IO performance of a fairly simple code I have.
> 
> The basic premise is:
> 1) create communicators where the members of the communicator are on the same "node".  
> 2) each "node" opens their subset of the total number of files (say 50 files out of 200 if we have 4 "nodes")
> 3) each "core" on a given node reads their subset of variables from their node's file. (say 25 out of 100 variables if we have 4 "cores" per "node")
> 4) each "core" does some local work (right now, just a simple sum)
> 5) each "node" combines the local work (basically combine local sums for a given var on multiple nodes to a global sum to calculate an average)
> 6) write results.

Ah! I think I see a problem in step 3. If there are 100 variables, and
you want to read those variables collectively, you're going to have to
get everyone participating, even if the 'count' array is zero for the
do-nothing processors. 

Maybe you are already doing that...

If you use the pnetcdf non-blocking interface, you have a bit more
flexibility: the collective requirement is deferred until the
ncmpi_wait_all() call  (because in our implementation, all I/O is
deferred until the ncmpi_wait_all() call). 

You can put yourself into independent mode (must call
ncmpi_begin_indep_data to flush record variable caches and reset file
views), but performance is pretty poor. we don't spend any time
optimizing independent I/O.

> Depending on how I tweak ncmpi_get_var_*_all vs ncmpi_get_var and indep mode vs collective mode, I get different errors (in all 4 cases though). I think this snippet is the "right" way, so it is what I'd like to focus on.
> 
> Should this work? I am using pnetcdf 1.2. I will probably try upgrading today to see if that changes the behavior, but I don't see much in the change logs that would suggest that is necessary. This is also linked against MPICH2 1.4.1p1.
> 
> 
> here's the rough snippet:
>      
>      fcount = 0;
>      /* each node handles files, so we need rank in IOcomm */
>      for(i = IORank * files_pernode; i < (IORank + 1 ) * files_pernode; i++)
>      {
>         fdata[fcount] = (float **)Malloc(sizeof(float *) * vars_percore);
>         SAFE(ncmpi_open(iocomm, fname[i], NC_NOWRITE, MPI_INFO_NULL, &infh));
> 
>         vcount = 0;
>         for(j = coreRank * vars_percore; j < (coreRank + 1) * vars_percore; j++)
>         {
>                  fdata[fcount][vcount] = (float *)Malloc(sizeof(float) * lat * lon);
>                  SAFE(ncmpi_get_var_float_all(infh, var1[j].varid, fdata[fcount][vcount]));
> 		  finish_doing_some_local_work();
> 		 vcount++;
>          }
> 	   do_some_collective_work();
>            fcount++;
> 	}
> 
> 
> I get errors deep inside MPI doing this:
> #0  0x00007fff8c83bd57 in memmove$VARIANT$sse42 ()
> #1  0x0000000100052389 in MPIUI_Memcpy ()
> #2  0x0000000100051ffb in MPIDI_CH3U_Buffer_copy ()
> #3  0x000000010006792f in MPIDI_Isend_self ()
> #4  0x0000000100063e72 in MPID_Isend ()
> #5  0x00000001000a34b5 in PMPI_Isend ()
> #6  0x00000001000ac8c5 in ADIOI_R_Exchange_data ()
> #7  0x00000001000aba71 in ADIOI_GEN_ReadStridedColl ()
> #8  0x00000001000c6545 in MPIOI_File_read_all ()
> #9  0x00000001000a779f in MPI_File_read_all ()
> #10 0x000000010001d0c9 in ncmpii_getput_vars (ncp=0x7fff5fbff600, varp=0x100282890, start=0x7fff5fbff600, count=0x7fff5fbff600, stride=0x7fff5fbff600, buf=0x7fff5fbff600, bufcount=3888000, datatype=1275069450, rw_flag=1, io_method=1) at getput_vars.c:741
> #11 0x0000000100018fc9 in ncmpi_get_var_float_all (ncid=2440416, varid=238391296, ip=0x7fff5fbff670) at getput_var.c:500
> #12 0x0000000100006e50 in do_fn (fn_params=0x7fff5fbff7b0) at average.c:103
> #13 0x0000000100000c00 in main (argc=12, argv=0x7fff5fbff9b8) at main.c:47
> 
> with this failing assert:
> Assertion failed in file ch3u_buffer.c at line 77: FALSE
> memcpy argument memory ranges overlap, dst_=0x1048bf000 src_=0x1049bd000 len_=15552000
> 
> I am testing with 4 processes on my laptop (mpirun -n 4 ./a.out --various-parameters). The communicators are setup as I expect, at least as far as being properly disjoint:
> I am world rank 0. IO Rank 0. core rank 0. I will be processing file 0151-01.nc (fh 1) and var VAR1.fcount 0 vcount 0, nc_type 5
> I am world rank 2. IO Rank 1. core rank 0. I will be processing file 0158-07.nc (fh 1) and var VAR1. fcount 0 vcount 0, nc_type 5
> I am world rank 1. IO Rank 0. core rank 1. I will be processing file 0151-01.nc (fh 1) and var VAR2. fcount 0 vcount 0, nc_type 5
> I am world rank 3. IO Rank 1. core rank 1. I will be processing file 0158-07.nc (fh 1) and var VAR2. fcount 0 vcount 0, nc_type 5
> 
> What am I doing wrong?
> 
> Thanks.
> 

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the parallel-netcdf mailing list