problems writing vars with pnetcdf

Jianwei Li jianwei at ece.northwestern.edu
Fri Dec 3 17:04:25 CST 2004


Hello, Katie,

Thank you for pointing this out. 
I think you found a hidden bug in our PnetCDF implementation in dealing with
zero size I/O.

For sub-array access, although underlying MPI/MPI-IO can handle "size=0" 
gracefully (so can intermediate malloc), the PnetCDF code would check the 
(start, edge, dimsize), and it thought that [start+edge > dimsize] was not 
valid even if [edge==0] and returned error like:
	"Index exceeds dimension bound".

Actually, this is also a "bug" in Unidata netCDF-3.5.0, and it returns the same
error message:
	"Index exceeds dimension bound"

Luckily, nobody in serial netcdf world has interest trying to read/write zero
bytes. (though we should point this out to Unidata netcdf developers, or  
probably they are watching this message.)

I agree that this case is inevitable in parallel I/O environment and I will 
fix this bug in the next release, but for now I have following quick fix for
whoever met this problem:

	1. go into the pnetcdf src code: parallel-netcdf/src/lib/mpinetcdf.c
	2. identify all ncmpi_{get/put}_vara[_all], ncmpi_{get/put}_vars[_all]
	   subroutines. (well, if you only need "vars", you can ignore the 
	   "vara" part for now)
	3. in each of the subroutines, locate code section between (excluding)
	   set_var{a/s}_fileview and MPI_File_write[_all] function calls:
	   
	   	set_var{a/s}_fileview
	   	
	   	section{	   		
	   		4 lines of code calculating nelems/nbytes
	   		other code
	   	}
	   	
	   	MPI_File_write[_all]
	   
	4. move the 4 lines of nelems/nbytes calculation code out from after 
	   the set_var{a/s}_fileview function call to before it, and move
	   set_var{a/s}_fileview function call into that section.
	5. After nbytes is calculated, bypass the above section if nbyte==0
	   using the following sudo-code:
	   
	   	calculating nelems/nbytes
	   	
	   	if (nbytes != 0) {
	   		set_var{a/s}_fileview
	   		section [without calculating nelems/nbytes]
	   	}
	   	
	   	MPI_File_write[_all]
	   	
	6. Rebuild the pnetCDF library.

Note: it will only solve this problem and may make "nc_test" in our test
suite miss some originally-expected errors (hence report failures), because
(start, edge=0, dimsize) was invalid if [start>dimsize] but now it is always 
valid as we'll bypass the boundary check. Actually it's hard to tell if it's
valid or not after all, but it is at least safe to treat it just as VALID.

Hope it will work for you and everybody.

Thanks again for the valuable feedbacks and welcome for further comments!

--
Jianwei

=========================================
 Jianwei Li				~
					~
 Northwestern University		~
 2145 Sheridan Rd, ECE Dept.		~
 Evanston, IL 60208			~
					~
 (847)467-2299				~
=========================================

  
	   

>Hi All,

>
>I'm not sure if this list gets much traffic but here goes.  I'm having a 
>problem writing out data in parallel for a particular case when there are 
>zero elements to write on a given processor.
>
>Let me explain a little better.  For a very simple case, a 1 dimensional 
>array that we want to write in parallel - we define a dimension say, 
>'dim_num_particles' and define a variable, say 'particles' with a unique 
>id.
>
>Each processor then writes out its portion of the particles into the 
>particles variable with the correct 
>starting position and count.  As long as each processor has at least one 
>particle to write we have absolutely no problems, but quite often in our 
>code there are 
>processors that have zero particles for a given checkpoint file and thus 
>have nothing to write to 
>file.  This is where we hang.
>
>
>I've tried a couple different hacks to get around this --
>
>* First was to try to write a zero-length array, with the count= zero 
>  and the offset or starting point = 'dim_num_particles'  but that 
>  returned an error message from the put_vars calls.  
>  All other offsets I choose returned errors as well, which is 
>  understandable.
>
>* The second thing I tried was to not write the data at all if there 
>  were zero particles on a proc.  But that hung.  After talking to some 
>  people here they though this also made sense because all procs now would 
>  not be doing the same task, a problem we've also seen hang hdf5.
>
>-- I can do a really ugly hack by increasing the dim_num_particles to have 
>extra room.  That way if a proc had zero particles it could write out a 
>dummy value.  The problem is that messes up our offsets when we need 
>to read in the checkpoint file.
>
>
>Has anyone else seen this problem or know a fix to it?
>
>Thanks,
>
>Katie
>
>
>____________________________
>Katie Antypas
>ASC Flash Center
>University of Chicago
>kantypas at flash.uchicago.edu
>




More information about the parallel-netcdf mailing list