Inconsistent results on bluegene (reproduce the same problem on ANL's BG/L)
Robert Latham
robl at mcs.anl.gov
Tue Jun 6 13:26:52 CDT 2006
On Tue, Jun 06, 2006 at 10:29:53AM -0700, Yu-Heng Tseng wrote:
> Dear Rob,
>
> Thank you for your explanation. In fact, we got problems when we were
> importing CAM3.1 (Community Atmosphere Model) on ANL's BG/L using
> PnetCDF. The code crashed at the following line
> ierr = Nfmpi_Close (ncid)
> after it pass the readin statment (ierr=0)
> ierr = Nfmpi_Put_Vara_Real_All(ncid, tt_id, start_3d,
> count_3d, tt)
>
> I tried to trace back the problem and use the most simple Fortran test
> for PnetCDF. Then we identified this problem finally.
>
> CAM3.1 with PnetCDF works well for many other platforms (IBM SP3-5 and
> Cray) already but it fails on both NCAR and ANL's BG/L systems. The
> benchmark porting tests for CAM on BG/L are delayed due to this
> problem. A presentation in July (ScicomP12) may be canceled if we
> still couldn't get results by then. That's why we are eager to get it
> fixed as soon as possible. Thanks again for your help!
Ah-ha!
There is one thing you might want to try, especially since you are
calling one of the vara_*_all functions. Set the BGLMPIO_TUNEBLOCKING
environment variable to 0 (at ANL, that's 'cqsub -e
BGLMPIO_TUNEBLOCKING=0'. not sure what that would be at NCAR).
This will hurt I/O performance by about 20%, but might give you
correct answers. For 16 processes, I get correct answers on PVFS2
and NFS (all zeros for diff, delmax and delmin) so it seems to help a
lot.
This workaround should help you on both Argonne's and NCAR's bluegene.
Please let me know if this does the trick for you, and I'll make a
more-prominent note in our README.bgl about the data inconsistencies.
==rob
--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Labs, IL USA B29D F333 664A 4280 315B
More information about the parallel-netcdf
mailing list