Is this the proper way to do parallel reads? and if so, is there a bug?

Thu Jan 10 11:49:52 CST 2013

Hi, Brian,

The communicators you used look fine to me. What Rob explained the collective
read in your step 3 indeed points out the problem of the error.

Since the call you make is to read the entire variables, you will have to
read the whole variables on both processes, no matter if the process needs
the variable data or not.

Say the 2 processes of world rank 0 and 1, opening file 0151-01.nc, are
reading variables VAR1 and VAR2, respectively. The collective I/O requires
both processes participate the calls to ncmpi_get_var_float_all().

	ncmpi_get_var_float_all(infh, VAR1.varid, buf1))
	ncmpi_get_var_float_all(infh, VAR2.varid, buf2))

An alternative is to call ncmpi_get_vara_float_all(), the one with arguments
of start[] and count[]. Set start[] to all zeros and count[] to full variable
dimension sizes on the process that need to read the variable data, and
set count[] to all zeros on the process that does not need the variable.
For example, for the 2 processes in the same node,

	start[0] = start[1] = 0;  /* say the variable is 2D */
	count[0] = count[1] = 0;
	if (rank == 0) {
		count[0] = full length of VAR1 in dimension Y 
		count[1] = full length of VAR1 in dimension X 
        }
	else if (rank == 1) {
		count[0] = full length of VAR2 in dimension Y 
		count[1] = full length of VAR2 in dimension X 
        }
	ncmpi_get_vara_float_all(infh, VAR1.varid, start, count, buf);
	ncmpi_get_vara_float_all(infh, VAR2.varid, start, count, buf);

In this case, rank0 reads the entire VAR1 and none for VAR2. Similarly,
rank1 read the entire VAR2 and none for VAR1.

As Rob suggested, using nonblocking APIs will avoid the above and give you
better performance

Wei-keng

On Jan 10, 2013, at 10:18 AM, Smith, Brian E. wrote:

> Is reading collectively across the communicator that opened the file enough? 
> 
> I have 4 MPI tasks. I create 2 "node" communicators via MPI_Comm_split, then 2 "core" communicators via MPI_Comm_split:
> 
> I am world rank 0. IO Rank 0. core rank 0. I will be processing file 0151-01.nc (fh 1) and var VAR1.fcount 0 vcount 0, nc_type 5
> I am world rank 2. IO Rank 1. core rank 0. I will be processing file 0158-07.nc (fh 1) and var VAR1. fcount 0 vcount 0, nc_type 5
> I am world rank 1. IO Rank 0. core rank 1. I will be processing file 0151-01.nc (fh 1) and var VAR2. fcount 0 vcount 0, nc_type 5
> I am world rank 3. IO Rank 1. core rank 1. I will be processing file 0158-07.nc (fh 1) and var VAR2. fcount 0 vcount 0, nc_type 5
> 
> So in this case, world rank 0 and world rank 1 are reading different variables out of the same file. They are on the same "node" communicator, so the file is ncmpi_open()ed with that communicator. 
> 
> Here are the comm splits too. Rankspernode is passed in in this case. I can figure it out on Cray and BlueGene systems, but there isn't a portable way to do that (in MPI2.x) , and I'm just testing on my laptop anyway.
> 
> 
>   numnodes = world_size / rankspernode;
> 
>   mynode = rank % numnodes; /* generic for laptop testing */
> 
>   MPI_Comm_split(MPI_COMM_WORLD, mynode, rank, &iocomm);
> 
>   MPI_Comm_rank(iocomm, &iorank);
>   MPI_Comm_split(MPI_COMM_WORLD, iorank, rank, &corecomm);
>   MPI_Comm_size(corecomm, &numcores);
>   MPI_Comm_rank(corecomm, &corerank);
> 
> then:
> 	files_pernode = numfiles / numnodes;
> 	vars_percore = numvars / numcores;
> 
> 	for(i = mynode * files_pernode; i < (mynode + 1 ) * files_pernode; i++)
> 	 {
> 		ncmpi_open(iocomm, fname[i], NC_NOWRITE, MPI_INFO_NULL, &infh);
> 		for(j = corerank * vars_percore; j < (corerank+1)*vars_percore; j++)
> 		{
> 		ncmpi_get_var_float_all(info, vars[j], data)
> 		}
> 	}
> 
> I'm fairly sure from my printfs (see above) that the communicators are split appropriately for the IO I want to do. However, I'll double check by printing out the communicator pointers too.
> 
> Does that make more sense? 
> 
> 
> Thanks.
> 
> 
> 
> On Jan 10, 2013, at 10:43 AM, Rob Latham wrote:
> 
>> On Thu, Jan 10, 2013 at 06:53:45AM -0500, Smith, Brian E. wrote:
>>> Hi all,
>>> 
>>> I am fairly new to (p)netCDF, but have plenty of MPI experience. I am working on improving IO performance of a fairly simple code I have.
>>> 
>>> The basic premise is:
>>> 1) create communicators where the members of the communicator are on the same "node".  
>>> 2) each "node" opens their subset of the total number of files (say 50 files out of 200 if we have 4 "nodes")
>>> 3) each "core" on a given node reads their subset of variables from their node's file. (say 25 out of 100 variables if we have 4 "cores" per "node")
>>> 4) each "core" does some local work (right now, just a simple sum)
>>> 5) each "node" combines the local work (basically combine local sums for a given var on multiple nodes to a global sum to calculate an average)
>>> 6) write results.
>> 
>> Ah! I think I see a problem in step 3. If there are 100 variables, and
>> you want to read those variables collectively, you're going to have to
>> get everyone participating, even if the 'count' array is zero for the
>> do-nothing processors. 
>> 
>> Maybe you are already doing that...
>> 
>> If you use the pnetcdf non-blocking interface, you have a bit more
>> flexibility: the collective requirement is deferred until the
>> ncmpi_wait_all() call  (because in our implementation, all I/O is
>> deferred until the ncmpi_wait_all() call). 
>> 
>> You can put yourself into independent mode (must call
>> ncmpi_begin_indep_data to flush record variable caches and reset file
>> views), but performance is pretty poor. we don't spend any time
>> optimizing independent I/O.
>> 
>>> Depending on how I tweak ncmpi_get_var_*_all vs ncmpi_get_var and indep mode vs collective mode, I get different errors (in all 4 cases though). I think this snippet is the "right" way, so it is what I'd like to focus on.
>>> 
>>> Should this work? I am using pnetcdf 1.2. I will probably try upgrading today to see if that changes the behavior, but I don't see much in the change logs that would suggest that is necessary. This is also linked against MPICH2 1.4.1p1.
>>> 
>>> 
>>> here's the rough snippet:
>>> 
>>>    fcount = 0;
>>>    /* each node handles files, so we need rank in IOcomm */
>>>    for(i = IORank * files_pernode; i < (IORank + 1 ) * files_pernode; i++)
>>>    {
>>>       fdata[fcount] = (float **)Malloc(sizeof(float *) * vars_percore);
>>>       SAFE(ncmpi_open(iocomm, fname[i], NC_NOWRITE, MPI_INFO_NULL, &infh));
>>> 
>>>       vcount = 0;
>>>       for(j = coreRank * vars_percore; j < (coreRank + 1) * vars_percore; j++)
>>>       {
>>>                fdata[fcount][vcount] = (float *)Malloc(sizeof(float) * lat * lon);
>>>                SAFE(ncmpi_get_var_float_all(infh, var1[j].varid, fdata[fcount][vcount]));
>>> 		  finish_doing_some_local_work();
>>> 		 vcount++;
>>>        }
>>> 	   do_some_collective_work();
>>>          fcount++;
>>> 	}
>>> 
>>> 
>>> I get errors deep inside MPI doing this:
>>> #0  0x00007fff8c83bd57 in memmove$VARIANT$sse42 ()
>>> #1  0x0000000100052389 in MPIUI_Memcpy ()
>>> #2  0x0000000100051ffb in MPIDI_CH3U_Buffer_copy ()
>>> #3  0x000000010006792f in MPIDI_Isend_self ()
>>> #4  0x0000000100063e72 in MPID_Isend ()
>>> #5  0x00000001000a34b5 in PMPI_Isend ()
>>> #6  0x00000001000ac8c5 in ADIOI_R_Exchange_data ()
>>> #7  0x00000001000aba71 in ADIOI_GEN_ReadStridedColl ()
>>> #8  0x00000001000c6545 in MPIOI_File_read_all ()
>>> #9  0x00000001000a779f in MPI_File_read_all ()
>>> #10 0x000000010001d0c9 in ncmpii_getput_vars (ncp=0x7fff5fbff600, varp=0x100282890, start=0x7fff5fbff600, count=0x7fff5fbff600, stride=0x7fff5fbff600, buf=0x7fff5fbff600, bufcount=3888000, datatype=1275069450, rw_flag=1, io_method=1) at getput_vars.c:741
>>> #11 0x0000000100018fc9 in ncmpi_get_var_float_all (ncid=2440416, varid=238391296, ip=0x7fff5fbff670) at getput_var.c:500
>>> #12 0x0000000100006e50 in do_fn (fn_params=0x7fff5fbff7b0) at average.c:103
>>> #13 0x0000000100000c00 in main (argc=12, argv=0x7fff5fbff9b8) at main.c:47
>>> 
>>> with this failing assert:
>>> Assertion failed in file ch3u_buffer.c at line 77: FALSE
>>> memcpy argument memory ranges overlap, dst_=0x1048bf000 src_=0x1049bd000 len_=15552000
>>> 
>>> I am testing with 4 processes on my laptop (mpirun -n 4 ./a.out --various-parameters). The communicators are setup as I expect, at least as far as being properly disjoint:
>>> I am world rank 0. IO Rank 0. core rank 0. I will be processing file 0151-01.nc (fh 1) and var VAR1.fcount 0 vcount 0, nc_type 5
>>> I am world rank 2. IO Rank 1. core rank 0. I will be processing file 0158-07.nc (fh 1) and var VAR1. fcount 0 vcount 0, nc_type 5
>>> I am world rank 1. IO Rank 0. core rank 1. I will be processing file 0151-01.nc (fh 1) and var VAR2. fcount 0 vcount 0, nc_type 5
>>> I am world rank 3. IO Rank 1. core rank 1. I will be processing file 0158-07.nc (fh 1) and var VAR2. fcount 0 vcount 0, nc_type 5
>>> 
>>> What am I doing wrong?
>>> 
>>> Thanks.
>>> 
>> 
>> -- 
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>