Inconsistent results on bluegene (reproduce the same problem on ANL's BG/L)

Yu-Heng Tseng YHTseng at lbl.gov
Tue Jun 13 00:30:27 CDT 2006


Dear Rob,

YES! That helps for nodes=16,32 runs to get correct results. For 
node=2 run, it still gives wrong answers. Can you explain this? It 
really helps but not totally why? Really thanks a lot for your help! 

For CAM3.1 application, that also works and CAM can run successfully. 
The previous errors (crashs) I mentioned
are gone. However, I need to varify if we get all results correct.

Thanks a lot for your help!

Yu-heng
---------------------------------------------------
Yu-Heng Tseng

Computational Research Division
Lawrence Berkeley National Laboratory
One Cyclotron Rd, MS: 50F-1650
Berkeley, CA94720
YHTseng at lbl.gov
510.495.2904

----- Original Message -----
From: robl at mcs.anl.gov (Robert Latham)
Date: Tuesday, June 6, 2006 11:26 am
Subject: Re: Inconsistent results on bluegene (reproduce the same 
problem on ANL's BG/L)

> On Tue, Jun 06, 2006 at 10:29:53AM -0700, Yu-Heng Tseng wrote:
> > Dear Rob,
> > 
> > Thank you for your explanation. In fact, we got problems when we 
> were 
> > importing CAM3.1 (Community Atmosphere Model) on ANL's BG/L 
> using 
> > PnetCDF. The code crashed at the following line
> >         ierr = Nfmpi_Close (ncid)
> > after it pass the readin statment (ierr=0) 
> >         ierr = Nfmpi_Put_Vara_Real_All(ncid, tt_id, start_3d, 
> > count_3d, tt)
> > 
> > I tried to trace back the problem and use the most simple 
> Fortran test 
> > for PnetCDF. Then we identified this problem finally.
> > 
> > CAM3.1 with PnetCDF works well for many other platforms (IBM SP3-
> 5 and 
> > Cray) already but it fails on both NCAR and ANL's BG/L systems. 
> The 
> > benchmark porting tests for CAM on BG/L are delayed due to this 
> > problem. A presentation in July (ScicomP12) may be canceled if 
> we 
> > still couldn't get results by then. That's why we are eager to 
> get it 
> > fixed as soon as possible. Thanks again for your help!
> 
> Ah-ha!
> 
> There is one thing you might want to try, especially since you are
> calling one of the vara_*_all functions.  Set the 
BGLMPIO_TUNEBLOCKING
> environment variable to 0 (at ANL, that's 'cqsub -e
> BGLMPIO_TUNEBLOCKING=0'.  not sure what that would be at NCAR).  
> 
> This will hurt I/O performance by about 20%, but might give you
> correct answers.   For 16 processes, I get correct answers on PVFS2
> and NFS (all zeros for diff, delmax and delmin) so it seems to 
> help a
> lot.  
> 
> This workaround should help you on both Argonne's and NCAR's 
bluegene.
> 
> Please let me know if this does the trick for you, and I'll make a
> more-prominent note in our README.bgl about the data inconsistencies.
> 
> ==rob
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
> Argonne National Labs, IL USA                B29D F333 664A 4280 315B
> 




More information about the parallel-netcdf mailing list