PNetCDF problem

HLung at us.fujitsu.com HLung at us.fujitsu.com
Thu Dec 24 17:24:17 CST 2015


Hi Wei-keng,

Really appreciate your help.  I will try to download this unreleased version and see how it works.

Best regards and wish you a very merry Christmas holidays,

Han

-----Original Message-----
From: Wei-keng Liao [mailto:wkliao at eecs.northwestern.edu] 
Sent: Thursday, December 24, 2015 2:40 PM
To: Lung, Han
Cc: parallel-netcdf at lists.mcs.anl.gov
Subject: Re: PNetCDF problem

Hi Han,

The bug related to BUS_ADRALN has recently been fixed in the PnetCDF repo and will be available in the next release.
If you would like to try the latest unreleased PnetCDF, please run the following commands to download the codes.

svn checkout https://svn.mcs.anl.gov/repos/parallel-netcdf/trunk
cd trunk
autoreconf

Then, please run the configure, make commands you used before.

If you are using Fujitsu compilers, then please read the file README.Fujitsu. You might want to add LDFLAGS="-L." into your configure command line.

Let me know if you encounter any further problem.

Wei-keng

On Dec 24, 2015, at 2:32 PM, hlung at us.fujitsu.com wrote:

> Hi Wei-keng,
> 
> I tried pnetcdf 1.6.1 as you advised with WRF.  This time the calculation of WRF itself is complete but the program died again at ncmpix_put_int64 with the same address not aligned error (BUS_ADRALN).  The output file is empty.  This function is also in ncx.c.
> 
> I am wondering if this error is related to the input file or WRF run.  Does pnetcdf has any restriction on the WRF input data?
> 
> Thanks,
> 
> Han
> 
> -----Original Message-----
> From: Wei-keng Liao [mailto:wkliao at eecs.northwestern.edu]
> Sent: Wednesday, November 11, 2015 5:39 PM
> To: Lung, Han
> Cc: parallel-netcdf at lists.mcs.anl.gov
> Subject: Re: PNetCDF problem
> 
> Hi, Han
> 
> Thanks for reporting the problem you are encountering.
> Since PnetCDF 1.3.0 is quite old now (more than 3 years), I wonder if you can try 1.6.1, the latest stable release. ncx.c has had a significant revision since 1.3.0.
> 
> We do not have access to Fujitsu compilers. I would not be surprised by problems on your environment. Actually, I was recently informed the same BUS_ADRALN error was observed when using Fujitsu compilers. I am in the process of getting an account on a machine with Fujitsu compilers, so I can debug this issue. Before I can do anything, could you at least try 1.6.1? Thanks.
> 
> Wei-keng
> 
> On Nov 11, 2015, at 6:42 PM, hlung at us.fujitsu.com wrote:
> 
>> Hi,
>> 
>> This is Han Lung, Director of HPC Group, Fujitsu America, Inc.
>> 
>> I am working on WRF 3.7+PNetcdf 1.3.0 and got some errors.  The test case is conus12km.
>> 
>> I was told by wrfhelp that this error is related to internal of PNetCDF, so I am writing to you and hope I can get some help from you.
>> 
>> I traced the error to ncmpix_put_size_t(void  **xpp, const MPI_Offset   lp, int sizeof_t) in ncx.c of parallel-netcdf-1.3.0.  Here is the part of the code that caused the error:
>> 
>> 
>> #ifdef WORDS_BIGENDIAN
>>        MPI_Offset *ptr = (MPI_Offset*) (*xpp); /* typecast to 8-byte integer */
>>        *ptr = lp;    ç error here
>> #else
>>        ..
>> 
>> This is an operation for 8-byte integer lp (sizeof_t = 8).  However, 
>> *xpp is incremented each time by 4 or 8, depending on sizeof_t:  *xpp 
>> = (void *)((char *)(*xpp) + sizeof_t);
>> 
>> Now when *xpp is not on 8-byte boundary (due to previous operation of 4-byte increment) the operation "*ptr = lp;" will cause the address not aligned error (BUS_ADRALN). 
>> 
>> In my case, for most of the time, sizeof_t is 4, and there is no problem.  The first two times when sizeof_t is 8, *xpp is 591725472 and 591725720, respectively, which are at 8-byte boundary, so there is no problem, either.  The third time when sizeof_t is 8, *xpp is 591725972, which is not at 8-byte boundary, and that caused the problem.
>> 
>> I modified ncx.c to pad a 4-byte space if it's not on 8-byte boundary.  It did solve this mis-alignment problem but died at the free() call later.  I am also not sure if this padding is a right approach since the padded parts are all garbage.
>> 
>> Do you see this kind of error before?  Any advice on how to resolve it?
>> 
>> Thanks,
>> 
>> Han
> 



More information about the parallel-netcdf mailing list