[gdsjaar at sandia.gov: [netcdfgroup] strlen calls in NC_finddim and NC_findvar]

Fri Dec 4 09:10:35 CST 2009

Greg S. found something noteworthy on the serial netcdf list.  We do
something similar (not surprising: i'm sure our NC_finddim and
NC_findvar functions are 99% unchanged from serial netcdf)

In NC_finddim we have a call to strlen as part of the condition of a
for loop.  If there are a lot of dimensions as in Greg's case, then
yeah, we too would call strlen a lot.

http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/src/lib/dim.c#L135

our ncmpii_NC_findvar calls strlen inside a loop for each variable in
a dataset.

http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/src/lib/var.c#L317

How common are datasets with thousands of dimensions and thousands of
variables?

In a followup message, Greg found at least one case where "size" was
not the same as strlen(name) for one of these NC_dim types, so it
looks like the easy optimization won't work out after all.

The status quo isn't awful if you've got a small number of dimensions
and variables: if anybody else has a dataset like Greg's, though,
reply to this email and we'll put optimzing this workload on the todo
list.

thanks
==rob

----- Forwarded message from Greg Sjaardema <gdsjaar at sandia.gov> -----

Sender: netcdfgroup-bounces at unidata.ucar.edu
From: Greg Sjaardema <gdsjaar at sandia.gov>
Subject: [netcdfgroup] strlen calls in NC_finddim and NC_findvar
Date: Thu, 3 Dec 2009 15:41:49 -0700
Message-ID: <4B183EAD.20808 at sandia.gov>
User-Agent: Thunderbird 2.0.0.23 (X11/20090812)
To: "netcdfgroup at unidata.ucar.edu" <netcdfgroup at unidata.ucar.edu>
X-Spam-Status: No, score=-2.599 tagged_above=-10 required=6.6
	tests=[BAYES_00=-2.599]
Delivered-To: netcdfgroup at conanmail.unidata.ucar.edu
Delivered-To: netcdfgroup at unidata.ucar.edu

I have a monstrous file with several thousand dimensions and variables
which is running slower than it should.  I investigated the runtime
and found that strlen was the major time user in the NC_finddim and
NC_findvar calls.  The obvious optimization was to cache the length of
the name instead of calling strlen each time.  However, when I went to
do this, I discovered that the length is already cached as the nchars
field in the NC_string struct.

I did some checks in the code and also added some assertions to the
code and verified that, as far as I can tell, nchars is the correct
length of the string.  Is there a reason that it isn't used and
strlen() is called instead?  Switching the code to use nchars dropped
my execution time from 20 units to 6 units.  I would like to make the
switch, but wondered if there was some strange corner case where the
nchars value is incorrect and will cause problems.

Thanks,
--Greg

_______________________________________________
netcdfgroup mailing list
netcdfgroup at unidata.ucar.edu
For list information or to unsubscribe,  visit:
http://www.unidata.ucar.edu/mailing_lists/

----- End forwarded message -----

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA