problems writing vars with pnetcdf

Jianwei Li jianwei at ece.northwestern.edu
Fri Dec 3 17:19:25 CST 2004


	Sorry for some minor corrections as below:


>Hello, Katie,
>
>Thank you for pointing this out. 
>I think you found a hidden bug in our PnetCDF implementation in dealing with
>zero size I/O.
>
>For sub-array access, although underlying MPI/MPI-IO can handle "size=0" 
     ^^^^^^^^^
	It's also the same case as stride subarray access.

>gracefully (so can intermediate malloc), the PnetCDF code would check the 
>(start, edge, dimsize), and it thought that [start+edge > dimsize] was not
					      ^^^^^^^^^^^^^^^^^^^^
					      	This should be always invalid,
					      	but [start >= dimsize] was    		
					      	handled inappropriately in
					      	the coordinate check for
					      	[edge==0]
					
 
>valid even if [edge==0] and returned error like:
>	"Index exceeds dimension bound".
>
>Actually, this is also a "bug" in Unidata netCDF-3.5.0, and it returns the same
>error message:
>	"Index exceeds dimension bound"
>
>Luckily, nobody in serial netcdf world has interest trying to read/write zero
>bytes. (though we should point this out to Unidata netcdf developers, or  
>probably they are watching this message.)
>
>I agree that this case is inevitable in parallel I/O environment and I will 
>fix this bug in the next release, but for now I have following quick fix for
>whoever met this problem:
>
>	1. go into the pnetcdf src code: parallel-netcdf/src/lib/mpinetcdf.c
>	2. identify all ncmpi_{get/put}_vara[_all], ncmpi_{get/put}_vars[_all]
>	   subroutines. (well, if you only need "vars", you can ignore the 
>	   "vara" part for now)
>	3. in each of the subroutines, locate code section between (excluding)
>	   set_var{a/s}_fileview and MPI_File_write[_all] function calls:
>	   
>	   	set_var{a/s}_fileview
>	   	
>	   	section{	   		
>	   		4 lines of code calculating nelems/nbytes
>	   		other code
>	   	}
>	   	
>	   	MPI_File_write[_all]
>	   
>	4. move the 4 lines of nelems/nbytes calculation code out from after 
>	   the set_var{a/s}_fileview function call to before it, and move
>	   set_var{a/s}_fileview function call into that section.
>	5. After nbytes is calculated, bypass the above section if nbyte==0
>	   using the following sudo-code:
>	   
>	   	calculating nelems/nbytes
>	   	
>	   	if (nbytes != 0) {
>	   		set_var{a/s}_fileview
>	   		section [without calculating nelems/nbytes]
>	   	}
>	   	
>	   	MPI_File_write[_all]
>	   	
>	6. Rebuild the pnetCDF library.
>
>Note: it will only solve this problem and may make "nc_test" in our test
>suite miss some originally-expected errors (hence report failures), because
>(start, edge=0, dimsize) was invalid if [start>dimsize] but now it is always 
					  ^^^^^^^^^^^^^
					  I meant [start>=dimsize]
					  

>valid as we'll bypass the boundary check. Actually it's hard to tell if it's
>valid or not after all, but it is at least safe to treat it just as VALID.
>
>Hope it will work for you and everybody.
>
>Thanks again for the valuable feedbacks and welcome for further comments!
>
>
> Jianwei
>  
>	   
>
>>Hi All,
>
>>
>>I'm not sure if this list gets much traffic but here goes.  I'm having a 
>>problem writing out data in parallel for a particular case when there are 
>>zero elements to write on a given processor.
>>
>>Let me explain a little better.  For a very simple case, a 1 dimensional 
>>array that we want to write in parallel - we define a dimension say, 
>>'dim_num_particles' and define a variable, say 'particles' with a unique 
>>id.
>>
>>Each processor then writes out its portion of the particles into the 
>>particles variable with the correct 
>>starting position and count.  As long as each processor has at least one 
>>particle to write we have absolutely no problems, but quite often in our 
>>code there are 
>>processors that have zero particles for a given checkpoint file and thus 
>>have nothing to write to 
>>file.  This is where we hang.
>>
>>
>>I've tried a couple different hacks to get around this --
>>
>>* First was to try to write a zero-length array, with the count= zero 
>>  and the offset or starting point = 'dim_num_particles'  but that 
>>  returned an error message from the put_vars calls.  
>>  All other offsets I choose returned errors as well, which is 
>>  understandable.
>>
>>* The second thing I tried was to not write the data at all if there 
>>  were zero particles on a proc.  But that hung.  After talking to some 
>>  people here they though this also made sense because all procs now would 
>>  not be doing the same task, a problem we've also seen hang hdf5.
>>
>>-- I can do a really ugly hack by increasing the dim_num_particles to have 
>>extra room.  That way if a proc had zero particles it could write out a 
>>dummy value.  The problem is that messes up our offsets when we need 
>>to read in the checkpoint file.
>>
>>
>>Has anyone else seen this problem or know a fix to it?
>>
>>Thanks,
>>
>>Katie
>>
>>
>>____________________________
>>Katie Antypas
>>ASC Flash Center
>>University of Chicago
>>kantypas at flash.uchicago.edu
>>


Jianwei

=========================================
 Jianwei Li				~
					~
 Northwestern University		~
 2145 Sheridan Rd, ECE Dept.		~
 Evanston, IL 60208			~
					~
 (847)467-2299				~
=========================================




More information about the parallel-netcdf mailing list