Houston, we have a problem

Jianwei Li jianwei at cheetah.cpdc.ece.nwu.edu
Mon Aug 4 13:07:21 CDT 2003



	John,

	Thanks for actively testing this new release!

> Jianwei,
>
> I finally loaded up 0.8.9 and ran with it.  My run of the Fortran test
> code appears to put out exactly the same results that you got, except
> that my read/write times are all zero?  This also leads to a couple of

	You mean both header I/O (define) time and data I/O time are zero?
	Is that possible while your data is actually written out?
	//I'll try another run later to see what's going on...

> INF values being output (/0).  My netCDF output appears to be the same
> as yours, but it does not match the C test code output?  The C output
> starts 0.001, 0.002, 0.003, ...  The Fortran output starts 65.794,
> 65.795, 65.796, ...  Could be a problem with my test code?  I was kind

	This is definitely due to difference of your C test and F test:)
	I checked it, in the Get_Field tri-loop, your F index start
	from 1, while your C index start from 0, being multipiled by 256
 	makes them so different:)

> of hoping that somehow the Fortran results were just reversed, but this
> does not appear to be the case.
>

	It is not reversed because you have taken care of the dimension
	ordering problem explictly in your F code (in Get_Field subrout).

	Don't worry, I'll send out another email talking about this.
	I think it's not a big problem.

	Jianwei

> Regards,
> John
>
> Jianwei Li wrote:
> > OK. I got the full output from my test after I made those modifications
> > to the fortran binding code.
> > The result seems good?
> >
> > ###############################################################
> > standard output:
> > mype  pe_coords    totsiz_3d         locsiz_3d       kstart,jstart,istart
> >   0    0  0  0   256  256  256      16  256  256        0      0      0
> >   1    1  0  0   256  256  256      16  256  256       16      0      0
> >   3    3  0  0   256  256  256      16  256  256       48      0      0
> >   4    4  0  0   256  256  256      16  256  256       64      0      0
> >   7    7  0  0   256  256  256      16  256  256      112      0      0
> >   2    2  0  0   256  256  256      16  256  256       32      0      0
> >   5    5  0  0   256  256  256      16  256  256       80      0      0
> >   6    6  0  0   256  256  256      16  256  256       96      0      0
> >   8    8  0  0   256  256  256      16  256  256      128      0      0
> >   9    9  0  0   256  256  256      16  256  256      144      0      0
> >  11   11  0  0   256  256  256      16  256  256      176      0      0
> >  14   14  0  0   256  256  256      16  256  256      224      0      0
> >  10   10  0  0   256  256  256      16  256  256      160      0      0
> >  12   12  0  0   256  256  256      16  256  256      192      0      0
> >  13   13  0  0   256  256  256      16  256  256      208      0      0
> >  15   15  0  0   256  256  256      16  256  256      240      0      0
> > write 1: 1.312E+00 1.562E+00
> > write 2: 1.000E+00 1.188E+00
> > write 3: 8.750E-01 1.438E+00
> > write 4: 8.125E-01 1.062E+00
> > write 5: 8.125E-01 1.000E+00
> >  read 1: 1.250E-01 3.750E-01
> > diff, delmax, delmin = 0.000E+00 0.000E+00 0.000E+00
> >  read 2: 1.250E-01 5.000E-01
> >  read 3: 1.250E-01 3.750E-01
> >  read 4: 2.500E-01 3.750E-01
> >  read 5: 1.250E-01 5.000E-01
> > File size:  1.342E+02 MB
> >     Write:   134.218 MB/s  (eff.,    74.051 MB/s)
> >     Read :   357.914 MB/s  (eff.,   268.435 MB/s)
> > Total number PEs:   16
> >   8.125E-01  1.000E+00   74.051  1.250E-01  3.750E-01  268.435
> >
> > ####################################################################
> > ncdump pnf_test.nc | more
> > netcdf pnf_test {
> > dimensions:
> >         level = 256 ;
> >         latitude = 256 ;
> >         longitude = 256 ;
> > variables:
> >         float tt(level, latitude, longitude) ;
> > data:
> >
> >  tt =
> >   65.794, 65.795, 65.796, 65.797, 65.798, 65.799, 65.8, 65.801, 65.802,
> >     65.803, 65.804, 65.805, 65.806, 65.807, 65.808, 65.809, 65.81, 65.811,
> >     65.812, 65.813, 65.814, 65.815, 65.816, 65.817, 65.818, 65.819, 65.82,
> >     65.821, 65.822, 65.823, 65.824, 65.825, 65.826, 65.827, 65.828, 65.829,
> >     65.83, 65.831, 65.832, 65.833, 65.834, 65.835, 65.836, 65.837, 65.838,
> >     65.839, 65.84, 65.841, 65.842, 65.843, 65.844, 65.845, 65.846, 65.847,
> >     65.848, 65.849, 65.85, 65.851, 65.852, 65.853, 65.854, 65.855, 65.856,
> >     65.857, 65.858, 65.859, 65.86, 65.861, 65.862, 65.863, 65.864, 65.865,
> >     65.866, 65.867, 65.868, 65.869, 65.87, 65.871, 65.872, 65.873, 65.874,
> >     65.875, 65.876, 65.877, 65.878, 65.879, 65.88, 65.881, 65.882, 65.883,
> >     65.884, 65.885, 65.886, 65.887, 65.888, 65.889, 65.89, 65.891, 65.892,
> >     65.893, 65.894, 65.895, 65.896, 65.897, 65.898, 65.899, 65.9, 65.901,
> >     65.902, 65.903, 65.904, 65.905, 65.906, 65.907, 65.908, 65.909, 65.91,
> >     65.911, 65.912, 65.913, 65.914, 65.915, 65.916, 65.917, 65.918, 65.919,
> >     65.92, 65.921, 65.922, 65.923, 65.924, 65.925, 65.926, 65.927, 65.928,
> >     65.929, 65.93, 65.931, 65.932, 65.933, 65.934, 65.935, 65.936, 65.937,
> >     65.938, 65.939, 65.94, 65.941, 65.942, 65.943, 65.944, 65.945, 65.946,
> >     65.947, 65.948, 65.949, 65.95, 65.951, 65.952, 65.953, 65.954, 65.955,
> >     65.956, 65.957, 65.958, 65.959, 65.96, 65.961, 65.962, 65.963, 65.964,
> >     65.965, 65.966, 65.967, 65.968, 65.969, 65.97, 65.971, 65.972, 65.973,
> >     65.974, 65.975, 65.976, 65.977, 65.978, 65.979, 65.98, 65.981, 65.982,
> >     65.983, 65.984, 65.985, 65.986, 65.987, 65.988, 65.989, 65.99, 65.991,
> >     65.992, 65.993, 65.994, 65.995, 65.996, 65.997, 65.998, 65.999, 66,
> >     66.001, 66.002, 66.003, 66.004, 66.005, 66.006, 66.007, 66.008, 66.009,
> >     66.01, 66.011, 66.012, 66.013, 66.014, 66.015, 66.016, 66.017, 66.018,
> >     66.019, 66.02, 66.021, 66.022, 66.023, 66.024, 66.025, 66.026, 66.027,
> >     66.028, 66.029, 66.03, 66.031, 66.032, 66.033, 66.034, 66.035, 66.036,
> >     66.037, 66.038, 66.039, 66.04, 66.041, 66.042, 66.043, 66.044, 66.045,
> >     66.046, 66.047, 66.048, 66.049,
> >
> > ...
> >
> > Jianwei
> >
> > On Fri, 1 Aug 2003, Jianwei Li wrote:
> >
> >
> >>I think I know where the problem is now, after looking into the
> >>fortran binding code as a "human":)
> >>
> >>The automatic fortran binding is mistaking (*start)[], (*count)[],
> >>and (*stride)[] as *start[], *count[], *stride[].
> >>After changing the fortran binding interface code from
> >>
> >> FORTRAN_API void FORT_CALL nfmpi_put_vara_float_all_ ( int *v1, int *v2,
> >> int * v3[], int * v4[], float*v5, MPI_Fint *ierr ){
> >>     *ierr = ncmpi_put_vara_float_all( *v1, *v2, (const size_t *)(*v3),
> >> (const size_t *)(*v4), v5 );
> >> }
> >>
> >>to
> >>
> >> FORTRAN_API void FORT_CALL nfmpi_put_vara_float_all_ ( int *v1, int *v2,
> >> int (* v3)[], int (* v4)[], float*v5, MPI_Fint *ierr ){
> >>     *ierr = ncmpi_put_vara_float_all( *v1, *v2, (const size_t *)(*v3),
> >> (const size_t *)(*v4), v5 );
> >> }
> >>
> >>in file "parallel-netcdf-0.8.8/src/libf/put_vara_float_allf.c"
> >>
> >>and doing the same thing to other fortran binding functions that need to
> >>deal with start[], count[], stride[],
> >>I got the fortran test running successfully.
> >>And that's why the original ones fails only for put_vara/get_vara:)
> >>
> >>Another way may be just using **start, **count, **stride?
> >>
> >>But I don't know how to modify these automatically :(
> >>
> >>#############################################################################
> >>This is the netcdf data file generated, running with 8 processes:
> >>
> >>ncdump pnf_test.nc | more
> >>netcdf pnf_test {
> >>dimensions:
> >>        level = 256 ;
> >>        latitude = 256 ;
> >>        longitude = 256 ;
> >>variables:
> >>        float tt(level, latitude, longitude) ;
> >>data:
> >>
> >> tt =
> >>  65.794, 65.795, 65.796, 65.797, 65.798, 65.799, 65.8, 65.801, 65.802,
> >>    65.803, 65.804, 65.805, 65.806, 65.807, 65.808, 65.809, 65.81, 65.811,
> >>    65.812, 65.813, 65.814, 65.815, 65.816, 65.817, 65.818, 65.819, 65.82,
> >>    65.821, 65.822, 65.823, 65.824, 65.825, 65.826, 65.827, 65.828, 65.829,
> >>    65.83, 65.831, 65.832, 65.833, 65.834, 65.835, 65.836, 65.837, 65.838,
> >>    65.839, 65.84, 65.841, 65.842, 65.843, 65.844, 65.845, 65.846, 65.847,
> >>    65.848, 65.849, 65.85, 65.851, 65.852, 65.853, 65.854, 65.855, 65.856,
> >>    65.857, 65.858, 65.859, 65.86, 65.861, 65.862, 65.863, 65.864, 65.865,
> >>    65.866, 65.867, 65.868, 65.869, 65.87, 65.871, 65.872, 65.873, 65.874,
> >>...
> >>
> >>#####################################################################
> >>And the standard output looks like:
> >>poe pnf_test -nodes 1 -tasks_per_node 8 -rmpool 1 -euilib ip -euidevice en0
> >>mype  pe_coords    totsiz_3d         locsiz_3d       kstart,jstart,istart
> >>  0    0  0  0   256  256  256      32  256  256        0      0      0
> >>  1    1  0  0   256  256  256      32  256  256       32      0      0
> >>  2    2  0  0   256  256  256      32  256  256       64      0      0
> >>  3    3  0  0   256  256  256      32  256  256       96      0      0
> >>  4    4  0  0   256  256  256      32  256  256      128      0      0
> >>  5    5  0  0   256  256  256      32  256  256      160      0      0
> >>  6    6  0  0   256  256  256      32  256  256      192      0      0
> >>  7    7  0  0   256  256  256      32  256  256      224      0      0
> >>write 1: 0.000E+00 7.040E+02
> >>
> >>... It's still running, and I'll post the full output later and confirm
> >>that my thought:)
> >>
> >>Jianwei
> >>
> >>On Thu, 31 Jul 2003, John Tannahill wrote:
> >>
> >>
> >>>Jianwei,
> >>>
> >>>This is what I was suspecting as well.
> >>>
> >>>John
> >>>
> >>>Jianwei Li wrote:
> >>>
> >>>>I think I was wrong after a careful look at the standard output.
> >>>>
> >>>>operation    header I/O time    data I/O time
> >>>>write 2: 	1.250E-01 	0.000E+00
> >>>>read 2:  	6.250E-02 	0.000E+00
> >>>>
> >>>>It seems that Nfmpi_Put_Vara_Float_All/Nfmpi_Get_Vara_Float_All
> >>>>are not running properly in this case.
> >>>>
> >>>>We should look in more details for problems ...
> >>>>
> >>>>Jianwei
> >>>>
> >>>>On Thu, 31 Jul 2003, Jianwei Li wrote:
> >>>>
> >>>>
> >>>>
> >>>>>John,
> >>>>>
> >>>>>I had a quick run of your attached fortran code using pnetcdf0.8.8
> >>>>>on SDSC's IBM-SP (called bluehorizon). The code ran pretty well
> >>>>>and genterate these outputs:
> >>>>>
> >>>>>#######################################################################
> >>>>>standard output:
> >>>>>
> >>>>>mype  pe_coords    totsiz_3d         locsiz_3d       kstart,jstart,istart
> >>>>> 0    0  0  0   256  256  256      16  256  256        0      0      0
> >>>>> 1    1  0  0   256  256  256      16  256  256       16      0      0
> >>>>>13   13  0  0   256  256  256      16  256  256      208      0      0
> >>>>> 2    2  0  0   256  256  256      16  256  256       32      0      0
> >>>>> 8    8  0  0   256  256  256      16  256  256      128      0      0
> >>>>> 5    5  0  0   256  256  256      16  256  256       80      0      0
> >>>>> 9    9  0  0   256  256  256      16  256  256      144      0      0
> >>>>> 6    6  0  0   256  256  256      16  256  256       96      0      0
> >>>>>10   10  0  0   256  256  256      16  256  256      160      0      0
> >>>>> 4    4  0  0   256  256  256      16  256  256       64      0      0
> >>>>>11   11  0  0   256  256  256      16  256  256      176      0      0
> >>>>>12   12  0  0   256  256  256      16  256  256      192      0      0
> >>>>>14   14  0  0   256  256  256      16  256  256      224      0      0
> >>>>>15   15  0  0   256  256  256      16  256  256      240      0      0
> >>>>> 3    3  0  0   256  256  256      16  256  256       48      0      0
> >>>>> 7    7  0  0   256  256  256      16  256  256      112      0      0
> >>>>>write 1: 2.500E-01 6.250E-02
> >>>>>write 2: 1.250E-01 0.000E+00
> >>>>>write 3: 1.250E-01 6.250E-02
> >>>>>write 4: 1.875E-01 0.000E+00
> >>>>>write 5: 1.250E-01 0.000E+00
> >>>>>read 1: 6.250E-02 0.000E+00
> >>>>>diff, delmax, delmin = 1.009E+00 1.738E+00 1.701E-02
> >>>>>read 2: 6.250E-02 0.000E+00
> >>>>>read 3: 6.250E-02 0.000E+00
> >>>>>read 4: 6.250E-02 0.000E+00
> >>>>>read 5: 6.250E-02 0.000E+00
> >>>>>File size:  1.342E+02 MB
> >>>>>   Write:       INF MB/s  (eff.,  1073.742 MB/s)
> >>>>>   Read :       INF MB/s  (eff.,  2147.484 MB/s)
> >>>>>Total number PEs:   16
> >>>>> 1.250E-01  0.000E+00 1073.742  6.250E-02  0.000E+00 2147.484
> >>>>>
> >>>>>##########################################################################
> >>>>>netcdf file <pnf_test.nc>:
> >>>>>ncdump pnf_test.nc | more
> >>>>>netcdf pnf_test {
> >>>>>dimensions:
> >>>>>       level = 256 ;
> >>>>>       latitude = 256 ;
> >>>>>       longitude = 256 ;
> >>>>>variables:
> >>>>>       float tt(level, latitude, longitude) ;
> >>>>>data:
> >>>>>
> >>>>>tt =
> >>>>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>>>>0,
> >>>>>...
> >>>>>
> >>>>>I think it's a successful run, right?
> >>>>>
> >>>>>So what? Is it the Fortran Binding problem specially related to the Frost
> >>>>>platform? or something else?
> >>>>>
> >>>>>btw, I build my pnetcdf lib as below and maybe you want to try this:
> >>>>>
> >>>>>setenv CC xlc
> >>>>>setenv FC xlf
> >>>>>setenv F90 xlf90
> >>>>>setenv CXX xlC
> >>>>>setenv FFLAGS '-d -O2'
> >>>>>setenv MPICC mpcc_r
> >>>>>setenv MPIF77 mpxlf_r
> >>>>>
> >>>>>#make
> >>>>>#make install
> >>>>>
> >>>>>//what else can I do?:)
> >>>>>
> >>>>>Jianwei
> >>>>>
> >>>>>On Thu, 31 Jul 2003, John Tannahill wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Rob,
> >>>>>>
> >>>>>>I am hoping that I can catch you before you leave, so that you can
> >>>>>>pass this on to someone, but if you are already gone, can anyone
> >>>>>>else take a look at this?
> >>>>>>
> >>>>>>I have graduated up to my original bigger test case and the C version
> >>>>>>works, but the Fortran version doesn't.  It's certainly possible that
> >>>>>>I have screwed up the translation from C to Fortran and I will be
> >>>>>>looking at that, but I wanted to pass this back to you folks, so that
> >>>>>>you can take a look at it to.
> >>>>>>
> >>>>>>I am using 0.8.8.  Attached are two tar files that should be pretty
> >>>>>>self-explanatory, but let me know if you have questions.
> >>>>>>
> >>>>>>Regards,
> >>>>>>John
> >>>>>>
> >>>>>>--
> >>>>>>============================
> >>>>>>John R. Tannahill
> >>>>>>Lawrence Livermore Nat. Lab.
> >>>>>>P.O. Box 808, M/S L-103
> >>>>>>Livermore, CA  94551
> >>>>>>925-423-3514
> >>>>>>Fax:  925-423-4908
> >>>>>>============================
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>--
> >>>============================
> >>>John R. Tannahill
> >>>Lawrence Livermore Nat. Lab.
> >>>P.O. Box 808, M/S L-103
> >>>Livermore, CA  94551
> >>>925-423-3514
> >>>Fax:  925-423-4908
> >>>============================
> >>>
> >>
> >
> >
>
>
> --
> ============================
> John R. Tannahill
> Lawrence Livermore Nat. Lab.
> P.O. Box 808, M/S L-103
> Livermore, CA  94551
> 925-423-3514
> Fax:  925-423-4908
> ============================
>




More information about the parallel-netcdf mailing list