[gdsjaar at sandia.gov: [netcdfgroup] strlen calls in NC_finddim and NC_findvar]

Fri Dec 4 09:33:53 CST 2009

I modified my fix somewhat from what is described below.  The NC_string 
'nchars' field is not what is needed since it is modified for alignment 
issues and can be incorrect after a rename operation.  Instead, I added 
a 'lenstr' field to both NC_dim and NC_var which maintains the length of 
the name.  This reduced the number of strlen calls in one case from 
476,952,472 to 389,810  (43.6% of execution time down to 0.25%).  There 
are still several calls to strncmp. 

I think that perhaps a better fix than caching the name string length 
may be to compute a hash of the name and store that instead.  The 
finddim and findvar functions can then hash the name they are searching 
for.  The inner loop could then just compare the hash values and if they 
match, do the further strncmp check to catch hash collisions.

Rob Latham wrote:
> Greg S. found something noteworthy on the serial netcdf list.  We do
> something similar (not surprising: i'm sure our NC_finddim and
> NC_findvar functions are 99% unchanged from serial netcdf)
>
> In NC_finddim we have a call to strlen as part of the condition of a
> for loop.  If there are a lot of dimensions as in Greg's case, then
> yeah, we too would call strlen a lot.
>
> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/src/lib/dim.c#L135
>
> our ncmpii_NC_findvar calls strlen inside a loop for each variable in
> a dataset.
>
> http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/src/lib/var.c#L317
>
> How common are datasets with thousands of dimensions and thousands of
> variables?
>
> In a followup message, Greg found at least one case where "size" was
> not the same as strlen(name) for one of these NC_dim types, so it
> looks like the easy optimization won't work out after all.
>
> The status quo isn't awful if you've got a small number of dimensions
> and variables: if anybody else has a dataset like Greg's, though,
> reply to this email and we'll put optimzing this workload on the todo
> list.
>
> thanks
> ==rob
>
> ----- Forwarded message from Greg Sjaardema <gdsjaar at sandia.gov> -----
>
> Sender: netcdfgroup-bounces at unidata.ucar.edu
> From: Greg Sjaardema <gdsjaar at sandia.gov>
> Subject: [netcdfgroup] strlen calls in NC_finddim and NC_findvar
> Date: Thu, 3 Dec 2009 15:41:49 -0700
> Message-ID: <4B183EAD.20808 at sandia.gov>
> User-Agent: Thunderbird 2.0.0.23 (X11/20090812)
> To: "netcdfgroup at unidata.ucar.edu" <netcdfgroup at unidata.ucar.edu>
> X-Spam-Status: No, score=-2.599 tagged_above=-10 required=6.6
> 	tests=[BAYES_00=-2.599]
> Delivered-To: netcdfgroup at conanmail.unidata.ucar.edu
> Delivered-To: netcdfgroup at unidata.ucar.edu
>
> I have a monstrous file with several thousand dimensions and variables
> which is running slower than it should.  I investigated the runtime
> and found that strlen was the major time user in the NC_finddim and
> NC_findvar calls.  The obvious optimization was to cache the length of
> the name instead of calling strlen each time.  However, when I went to
> do this, I discovered that the length is already cached as the nchars
> field in the NC_string struct.
>
> I did some checks in the code and also added some assertions to the
> code and verified that, as far as I can tell, nchars is the correct
> length of the string.  Is there a reason that it isn't used and
> strlen() is called instead?  Switching the code to use nchars dropped
> my execution time from 20 units to 6 units.  I would like to make the
> switch, but wondered if there was some strange corner case where the
> nchars value is incorrect and will cause problems.
>
> Thanks,
> --Greg
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup at unidata.ucar.edu
> For list information or to unsubscribe,  visit:
> http://www.unidata.ucar.edu/mailing_lists/
>
> ----- End forwarded message -----
>
>