Houston, we have a problem

Jianwei Li jianwei at cheetah.cpdc.ece.nwu.edu
Fri Aug 1 11:28:41 CDT 2003


I think I know where the problem is now, after looking into the
fortran binding code as a "human":)

The automatic fortran binding is mistaking (*start)[], (*count)[],
and (*stride)[] as *start[], *count[], *stride[].
After changing the fortran binding interface code from

 FORTRAN_API void FORT_CALL nfmpi_put_vara_float_all_ ( int *v1, int *v2,
 int * v3[], int * v4[], float*v5, MPI_Fint *ierr ){
     *ierr = ncmpi_put_vara_float_all( *v1, *v2, (const size_t *)(*v3),
 (const size_t *)(*v4), v5 );
 }

to

 FORTRAN_API void FORT_CALL nfmpi_put_vara_float_all_ ( int *v1, int *v2,
 int (* v3)[], int (* v4)[], float*v5, MPI_Fint *ierr ){
     *ierr = ncmpi_put_vara_float_all( *v1, *v2, (const size_t *)(*v3),
 (const size_t *)(*v4), v5 );
 }

in file "parallel-netcdf-0.8.8/src/libf/put_vara_float_allf.c"

and doing the same thing to other fortran binding functions that need to
deal with start[], count[], stride[],
I got the fortran test running successfully.
And that's why the original ones fails only for put_vara/get_vara:)

Another way may be just using **start, **count, **stride?

But I don't know how to modify these automatically :(

#############################################################################
This is the netcdf data file generated, running with 8 processes:

ncdump pnf_test.nc | more
netcdf pnf_test {
dimensions:
        level = 256 ;
        latitude = 256 ;
        longitude = 256 ;
variables:
        float tt(level, latitude, longitude) ;
data:

 tt =
  65.794, 65.795, 65.796, 65.797, 65.798, 65.799, 65.8, 65.801, 65.802,
    65.803, 65.804, 65.805, 65.806, 65.807, 65.808, 65.809, 65.81, 65.811,
    65.812, 65.813, 65.814, 65.815, 65.816, 65.817, 65.818, 65.819, 65.82,
    65.821, 65.822, 65.823, 65.824, 65.825, 65.826, 65.827, 65.828, 65.829,
    65.83, 65.831, 65.832, 65.833, 65.834, 65.835, 65.836, 65.837, 65.838,
    65.839, 65.84, 65.841, 65.842, 65.843, 65.844, 65.845, 65.846, 65.847,
    65.848, 65.849, 65.85, 65.851, 65.852, 65.853, 65.854, 65.855, 65.856,
    65.857, 65.858, 65.859, 65.86, 65.861, 65.862, 65.863, 65.864, 65.865,
    65.866, 65.867, 65.868, 65.869, 65.87, 65.871, 65.872, 65.873, 65.874,
...

#####################################################################
And the standard output looks like:
poe pnf_test -nodes 1 -tasks_per_node 8 -rmpool 1 -euilib ip -euidevice en0
mype  pe_coords    totsiz_3d         locsiz_3d       kstart,jstart,istart
  0    0  0  0   256  256  256      32  256  256        0      0      0
  1    1  0  0   256  256  256      32  256  256       32      0      0
  2    2  0  0   256  256  256      32  256  256       64      0      0
  3    3  0  0   256  256  256      32  256  256       96      0      0
  4    4  0  0   256  256  256      32  256  256      128      0      0
  5    5  0  0   256  256  256      32  256  256      160      0      0
  6    6  0  0   256  256  256      32  256  256      192      0      0
  7    7  0  0   256  256  256      32  256  256      224      0      0
write 1: 0.000E+00 7.040E+02

... It's still running, and I'll post the full output later and confirm
that my thought:)

Jianwei

On Thu, 31 Jul 2003, John Tannahill wrote:

> Jianwei,
>
> This is what I was suspecting as well.
>
> John
>
> Jianwei Li wrote:
> > I think I was wrong after a careful look at the standard output.
> >
> > operation    header I/O time    data I/O time
> > write 2: 	1.250E-01 	0.000E+00
> > read 2:  	6.250E-02 	0.000E+00
> >
> > It seems that Nfmpi_Put_Vara_Float_All/Nfmpi_Get_Vara_Float_All
> > are not running properly in this case.
> >
> > We should look in more details for problems ...
> >
> > Jianwei
> >
> > On Thu, 31 Jul 2003, Jianwei Li wrote:
> >
> >
> >>John,
> >>
> >>I had a quick run of your attached fortran code using pnetcdf0.8.8
> >>on SDSC's IBM-SP (called bluehorizon). The code ran pretty well
> >>and genterate these outputs:
> >>
> >>#######################################################################
> >>standard output:
> >>
> >>mype  pe_coords    totsiz_3d         locsiz_3d       kstart,jstart,istart
> >>  0    0  0  0   256  256  256      16  256  256        0      0      0
> >>  1    1  0  0   256  256  256      16  256  256       16      0      0
> >> 13   13  0  0   256  256  256      16  256  256      208      0      0
> >>  2    2  0  0   256  256  256      16  256  256       32      0      0
> >>  8    8  0  0   256  256  256      16  256  256      128      0      0
> >>  5    5  0  0   256  256  256      16  256  256       80      0      0
> >>  9    9  0  0   256  256  256      16  256  256      144      0      0
> >>  6    6  0  0   256  256  256      16  256  256       96      0      0
> >> 10   10  0  0   256  256  256      16  256  256      160      0      0
> >>  4    4  0  0   256  256  256      16  256  256       64      0      0
> >> 11   11  0  0   256  256  256      16  256  256      176      0      0
> >> 12   12  0  0   256  256  256      16  256  256      192      0      0
> >> 14   14  0  0   256  256  256      16  256  256      224      0      0
> >> 15   15  0  0   256  256  256      16  256  256      240      0      0
> >>  3    3  0  0   256  256  256      16  256  256       48      0      0
> >>  7    7  0  0   256  256  256      16  256  256      112      0      0
> >>write 1: 2.500E-01 6.250E-02
> >>write 2: 1.250E-01 0.000E+00
> >>write 3: 1.250E-01 6.250E-02
> >>write 4: 1.875E-01 0.000E+00
> >>write 5: 1.250E-01 0.000E+00
> >> read 1: 6.250E-02 0.000E+00
> >>diff, delmax, delmin = 1.009E+00 1.738E+00 1.701E-02
> >> read 2: 6.250E-02 0.000E+00
> >> read 3: 6.250E-02 0.000E+00
> >> read 4: 6.250E-02 0.000E+00
> >> read 5: 6.250E-02 0.000E+00
> >>File size:  1.342E+02 MB
> >>    Write:       INF MB/s  (eff.,  1073.742 MB/s)
> >>    Read :       INF MB/s  (eff.,  2147.484 MB/s)
> >>Total number PEs:   16
> >>  1.250E-01  0.000E+00 1073.742  6.250E-02  0.000E+00 2147.484
> >>
> >>##########################################################################
> >>netcdf file <pnf_test.nc>:
> >>ncdump pnf_test.nc | more
> >>netcdf pnf_test {
> >>dimensions:
> >>        level = 256 ;
> >>        latitude = 256 ;
> >>        longitude = 256 ;
> >>variables:
> >>        float tt(level, latitude, longitude) ;
> >>data:
> >>
> >> tt =
> >>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>0,
> >>...
> >>
> >>I think it's a successful run, right?
> >>
> >>So what? Is it the Fortran Binding problem specially related to the Frost
> >>platform? or something else?
> >>
> >>btw, I build my pnetcdf lib as below and maybe you want to try this:
> >>
> >>setenv CC xlc
> >>setenv FC xlf
> >>setenv F90 xlf90
> >>setenv CXX xlC
> >>setenv FFLAGS '-d -O2'
> >>setenv MPICC mpcc_r
> >>setenv MPIF77 mpxlf_r
> >>
> >>#make
> >>#make install
> >>
> >>//what else can I do?:)
> >>
> >>Jianwei
> >>
> >>On Thu, 31 Jul 2003, John Tannahill wrote:
> >>
> >>
> >>>Rob,
> >>>
> >>>I am hoping that I can catch you before you leave, so that you can
> >>>pass this on to someone, but if you are already gone, can anyone
> >>>else take a look at this?
> >>>
> >>>I have graduated up to my original bigger test case and the C version
> >>>works, but the Fortran version doesn't.  It's certainly possible that
> >>>I have screwed up the translation from C to Fortran and I will be
> >>>looking at that, but I wanted to pass this back to you folks, so that
> >>>you can take a look at it to.
> >>>
> >>>I am using 0.8.8.  Attached are two tar files that should be pretty
> >>>self-explanatory, but let me know if you have questions.
> >>>
> >>>Regards,
> >>>John
> >>>
> >>>--
> >>>============================
> >>>John R. Tannahill
> >>>Lawrence Livermore Nat. Lab.
> >>>P.O. Box 808, M/S L-103
> >>>Livermore, CA  94551
> >>>925-423-3514
> >>>Fax:  925-423-4908
> >>>============================
> >>>
> >>
> >
> >
>
>
> --
> ============================
> John R. Tannahill
> Lawrence Livermore Nat. Lab.
> P.O. Box 808, M/S L-103
> Livermore, CA  94551
> 925-423-3514
> Fax:  925-423-4908
> ============================
>




More information about the parallel-netcdf mailing list