problems writing vars with pnetcdf
Jianwei Li
jianwei at ece.northwestern.edu
Fri Dec 3 17:19:25 CST 2004
Sorry for some minor corrections as below:
>Hello, Katie,
>
>Thank you for pointing this out.
>I think you found a hidden bug in our PnetCDF implementation in dealing with
>zero size I/O.
>
>For sub-array access, although underlying MPI/MPI-IO can handle "size=0"
^^^^^^^^^
It's also the same case as stride subarray access.
>gracefully (so can intermediate malloc), the PnetCDF code would check the
>(start, edge, dimsize), and it thought that [start+edge > dimsize] was not
^^^^^^^^^^^^^^^^^^^^
This should be always invalid,
but [start >= dimsize] was
handled inappropriately in
the coordinate check for
[edge==0]
>valid even if [edge==0] and returned error like:
> "Index exceeds dimension bound".
>
>Actually, this is also a "bug" in Unidata netCDF-3.5.0, and it returns the same
>error message:
> "Index exceeds dimension bound"
>
>Luckily, nobody in serial netcdf world has interest trying to read/write zero
>bytes. (though we should point this out to Unidata netcdf developers, or
>probably they are watching this message.)
>
>I agree that this case is inevitable in parallel I/O environment and I will
>fix this bug in the next release, but for now I have following quick fix for
>whoever met this problem:
>
> 1. go into the pnetcdf src code: parallel-netcdf/src/lib/mpinetcdf.c
> 2. identify all ncmpi_{get/put}_vara[_all], ncmpi_{get/put}_vars[_all]
> subroutines. (well, if you only need "vars", you can ignore the
> "vara" part for now)
> 3. in each of the subroutines, locate code section between (excluding)
> set_var{a/s}_fileview and MPI_File_write[_all] function calls:
>
> set_var{a/s}_fileview
>
> section{
> 4 lines of code calculating nelems/nbytes
> other code
> }
>
> MPI_File_write[_all]
>
> 4. move the 4 lines of nelems/nbytes calculation code out from after
> the set_var{a/s}_fileview function call to before it, and move
> set_var{a/s}_fileview function call into that section.
> 5. After nbytes is calculated, bypass the above section if nbyte==0
> using the following sudo-code:
>
> calculating nelems/nbytes
>
> if (nbytes != 0) {
> set_var{a/s}_fileview
> section [without calculating nelems/nbytes]
> }
>
> MPI_File_write[_all]
>
> 6. Rebuild the pnetCDF library.
>
>Note: it will only solve this problem and may make "nc_test" in our test
>suite miss some originally-expected errors (hence report failures), because
>(start, edge=0, dimsize) was invalid if [start>dimsize] but now it is always
^^^^^^^^^^^^^
I meant [start>=dimsize]
>valid as we'll bypass the boundary check. Actually it's hard to tell if it's
>valid or not after all, but it is at least safe to treat it just as VALID.
>
>Hope it will work for you and everybody.
>
>Thanks again for the valuable feedbacks and welcome for further comments!
>
>
> Jianwei
>
>
>
>>Hi All,
>
>>
>>I'm not sure if this list gets much traffic but here goes. I'm having a
>>problem writing out data in parallel for a particular case when there are
>>zero elements to write on a given processor.
>>
>>Let me explain a little better. For a very simple case, a 1 dimensional
>>array that we want to write in parallel - we define a dimension say,
>>'dim_num_particles' and define a variable, say 'particles' with a unique
>>id.
>>
>>Each processor then writes out its portion of the particles into the
>>particles variable with the correct
>>starting position and count. As long as each processor has at least one
>>particle to write we have absolutely no problems, but quite often in our
>>code there are
>>processors that have zero particles for a given checkpoint file and thus
>>have nothing to write to
>>file. This is where we hang.
>>
>>
>>I've tried a couple different hacks to get around this --
>>
>>* First was to try to write a zero-length array, with the count= zero
>> and the offset or starting point = 'dim_num_particles' but that
>> returned an error message from the put_vars calls.
>> All other offsets I choose returned errors as well, which is
>> understandable.
>>
>>* The second thing I tried was to not write the data at all if there
>> were zero particles on a proc. But that hung. After talking to some
>> people here they though this also made sense because all procs now would
>> not be doing the same task, a problem we've also seen hang hdf5.
>>
>>-- I can do a really ugly hack by increasing the dim_num_particles to have
>>extra room. That way if a proc had zero particles it could write out a
>>dummy value. The problem is that messes up our offsets when we need
>>to read in the checkpoint file.
>>
>>
>>Has anyone else seen this problem or know a fix to it?
>>
>>Thanks,
>>
>>Katie
>>
>>
>>____________________________
>>Katie Antypas
>>ASC Flash Center
>>University of Chicago
>>kantypas at flash.uchicago.edu
>>
Jianwei
=========================================
Jianwei Li ~
~
Northwestern University ~
2145 Sheridan Rd, ECE Dept. ~
Evanston, IL 60208 ~
~
(847)467-2299 ~
=========================================
More information about the parallel-netcdf
mailing list