Inconsistent results on bluegene (reproduce the same problem on ANL's BG/L)
Yu-Heng Tseng
YHTseng at lbl.gov
Tue Jun 13 00:30:27 CDT 2006
Dear Rob,
YES! That helps for nodes=16,32 runs to get correct results. For
node=2 run, it still gives wrong answers. Can you explain this? It
really helps but not totally why? Really thanks a lot for your help!
For CAM3.1 application, that also works and CAM can run successfully.
The previous errors (crashs) I mentioned
are gone. However, I need to varify if we get all results correct.
Thanks a lot for your help!
Yu-heng
---------------------------------------------------
Yu-Heng Tseng
Computational Research Division
Lawrence Berkeley National Laboratory
One Cyclotron Rd, MS: 50F-1650
Berkeley, CA94720
YHTseng at lbl.gov
510.495.2904
----- Original Message -----
From: robl at mcs.anl.gov (Robert Latham)
Date: Tuesday, June 6, 2006 11:26 am
Subject: Re: Inconsistent results on bluegene (reproduce the same
problem on ANL's BG/L)
> On Tue, Jun 06, 2006 at 10:29:53AM -0700, Yu-Heng Tseng wrote:
> > Dear Rob,
> >
> > Thank you for your explanation. In fact, we got problems when we
> were
> > importing CAM3.1 (Community Atmosphere Model) on ANL's BG/L
> using
> > PnetCDF. The code crashed at the following line
> > ierr = Nfmpi_Close (ncid)
> > after it pass the readin statment (ierr=0)
> > ierr = Nfmpi_Put_Vara_Real_All(ncid, tt_id, start_3d,
> > count_3d, tt)
> >
> > I tried to trace back the problem and use the most simple
> Fortran test
> > for PnetCDF. Then we identified this problem finally.
> >
> > CAM3.1 with PnetCDF works well for many other platforms (IBM SP3-
> 5 and
> > Cray) already but it fails on both NCAR and ANL's BG/L systems.
> The
> > benchmark porting tests for CAM on BG/L are delayed due to this
> > problem. A presentation in July (ScicomP12) may be canceled if
> we
> > still couldn't get results by then. That's why we are eager to
> get it
> > fixed as soon as possible. Thanks again for your help!
>
> Ah-ha!
>
> There is one thing you might want to try, especially since you are
> calling one of the vara_*_all functions. Set the
BGLMPIO_TUNEBLOCKING
> environment variable to 0 (at ANL, that's 'cqsub -e
> BGLMPIO_TUNEBLOCKING=0'. not sure what that would be at NCAR).
>
> This will hurt I/O performance by about 20%, but might give you
> correct answers. For 16 processes, I get correct answers on PVFS2
> and NFS (all zeros for diff, delmax and delmin) so it seems to
> help a
> lot.
>
> This workaround should help you on both Argonne's and NCAR's
bluegene.
>
> Please let me know if this does the trick for you, and I'll make a
> more-prominent note in our README.bgl about the data inconsistencies.
>
> ==rob
>
> --
> Rob Latham
> Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
> Argonne National Labs, IL USA B29D F333 664A 4280 315B
>
More information about the parallel-netcdf
mailing list