Inconsistent results on bluegene (reproduce the same problem on ANL's BG/L)

Robert Latham robl at mcs.anl.gov
Tue Jun 6 13:26:52 CDT 2006


On Tue, Jun 06, 2006 at 10:29:53AM -0700, Yu-Heng Tseng wrote:
> Dear Rob,
> 
> Thank you for your explanation. In fact, we got problems when we were 
> importing CAM3.1 (Community Atmosphere Model) on ANL's BG/L using 
> PnetCDF. The code crashed at the following line
>         ierr = Nfmpi_Close (ncid)
> after it pass the readin statment (ierr=0) 
>         ierr = Nfmpi_Put_Vara_Real_All(ncid, tt_id, start_3d, 
> count_3d, tt)
> 
> I tried to trace back the problem and use the most simple Fortran test 
> for PnetCDF. Then we identified this problem finally.
> 
> CAM3.1 with PnetCDF works well for many other platforms (IBM SP3-5 and 
> Cray) already but it fails on both NCAR and ANL's BG/L systems. The 
> benchmark porting tests for CAM on BG/L are delayed due to this 
> problem. A presentation in July (ScicomP12) may be canceled if we 
> still couldn't get results by then. That's why we are eager to get it 
> fixed as soon as possible. Thanks again for your help!

Ah-ha!

There is one thing you might want to try, especially since you are
calling one of the vara_*_all functions.  Set the BGLMPIO_TUNEBLOCKING
environment variable to 0 (at ANL, that's 'cqsub -e
BGLMPIO_TUNEBLOCKING=0'.  not sure what that would be at NCAR).  

This will hurt I/O performance by about 20%, but might give you
correct answers.   For 16 processes, I get correct answers on PVFS2
and NFS (all zeros for diff, delmax and delmin) so it seems to help a
lot.  

This workaround should help you on both Argonne's and NCAR's bluegene.

Please let me know if this does the trick for you, and I'll make a
more-prominent note in our README.bgl about the data inconsistencies.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Labs, IL USA                B29D F333 664A 4280 315B




More information about the parallel-netcdf mailing list