PNetCDF problem

HLung at us.fujitsu.com HLung at us.fujitsu.com
Thu Dec 24 14:32:42 CST 2015


Hi Wei-keng,

I tried pnetcdf 1.6.1 as you advised with WRF.  This time the calculation of WRF itself is complete but the program died again at ncmpix_put_int64 with the same address not aligned error (BUS_ADRALN).  The output file is empty.  This function is also in ncx.c.

I am wondering if this error is related to the input file or WRF run.  Does pnetcdf has any restriction on the WRF input data?

Thanks,

Han

-----Original Message-----
From: Wei-keng Liao [mailto:wkliao at eecs.northwestern.edu] 
Sent: Wednesday, November 11, 2015 5:39 PM
To: Lung, Han
Cc: parallel-netcdf at lists.mcs.anl.gov
Subject: Re: PNetCDF problem

Hi, Han

Thanks for reporting the problem you are encountering.
Since PnetCDF 1.3.0 is quite old now (more than 3 years), I wonder if you can try 1.6.1, the latest stable release. ncx.c has had a significant revision since 1.3.0.

We do not have access to Fujitsu compilers. I would not be surprised by problems on your environment. Actually, I was recently informed the same BUS_ADRALN error was observed when using Fujitsu compilers. I am in the process of getting an account on a machine with Fujitsu compilers, so I can debug this issue. Before I can do anything, could you at least try 1.6.1? Thanks.

Wei-keng

On Nov 11, 2015, at 6:42 PM, hlung at us.fujitsu.com wrote:

> Hi,
>  
> This is Han Lung, Director of HPC Group, Fujitsu America, Inc.
>  
> I am working on WRF 3.7+PNetcdf 1.3.0 and got some errors.  The test case is conus12km.
>  
> I was told by wrfhelp that this error is related to internal of PNetCDF, so I am writing to you and hope I can get some help from you.
>  
> I traced the error to ncmpix_put_size_t(void  **xpp, const MPI_Offset   lp, int sizeof_t) in ncx.c of parallel-netcdf-1.3.0.  Here is the part of the code that caused the error:
>  
>  
> #ifdef WORDS_BIGENDIAN
>         MPI_Offset *ptr = (MPI_Offset*) (*xpp); /* typecast to 8-byte integer */
>         *ptr = lp;    ç error here
> #else
>         ..
>  
> This is an operation for 8-byte integer lp (sizeof_t = 8).  However, 
> *xpp is incremented each time by 4 or 8, depending on sizeof_t:  *xpp  
> = (void *)((char *)(*xpp) + sizeof_t);
>  
> Now when *xpp is not on 8-byte boundary (due to previous operation of 4-byte increment) the operation "*ptr = lp;" will cause the address not aligned error (BUS_ADRALN). 
>  
> In my case, for most of the time, sizeof_t is 4, and there is no problem.  The first two times when sizeof_t is 8, *xpp is 591725472 and 591725720, respectively, which are at 8-byte boundary, so there is no problem, either.  The third time when sizeof_t is 8, *xpp is 591725972, which is not at 8-byte boundary, and that caused the problem.
>  
> I modified ncx.c to pad a 4-byte space if it's not on 8-byte boundary.  It did solve this mis-alignment problem but died at the free() call later.  I am also not sure if this padding is a right approach since the padded parts are all garbage.
>  
> Do you see this kind of error before?  Any advice on how to resolve it?
>  
> Thanks,
>  
> Han



More information about the parallel-netcdf mailing list