Inconsistent results on bluegene (reproduce the same problem on ANL's BG/L)

Yu-Heng Tseng YHTseng at lbl.gov
Tue Jun 6 12:29:53 CDT 2006


Dear Rob,

Thank you for your explanation. In fact, we got problems when we were 
importing CAM3.1 (Community Atmosphere Model) on ANL's BG/L using 
PnetCDF. The code crashed at the following line
        ierr = Nfmpi_Close (ncid)
after it pass the readin statment (ierr=0) 
        ierr = Nfmpi_Put_Vara_Real_All(ncid, tt_id, start_3d, 
count_3d, tt)

I tried to trace back the problem and use the most simple Fortran test 
for PnetCDF. Then we identified this problem finally.

CAM3.1 with PnetCDF works well for many other platforms (IBM SP3-5 and 
Cray) already but it fails on both NCAR and ANL's BG/L systems. The 
benchmark porting tests for CAM on BG/L are delayed due to this 
problem. A presentation in July (ScicomP12) may be canceled if we 
still couldn't get results by then. That's why we are eager to get it 
fixed as soon as possible. Thanks again for your help!

Cheers
Yu-heng
---------------------------------------------------
Yu-Heng Tseng

Computational Research Division
Lawrence Berkeley National Laboratory
One Cyclotron Rd, MS: 50F-1650
Berkeley, CA94720
YHTseng at lbl.gov
510.495.2904

----- Original Message -----
From: robl at mcs.anl.gov (Robert Latham)
Date: Tuesday, June 6, 2006 8:09 am
Subject: Re: Inconsistent results on bluegene (reproduce the same 
problem on ANL's BG/L)

> On Sat, Jun 03, 2006 at 07:52:09AM -0700, Yu-Heng Tseng wrote:
> > Thanks for checking this out. However, could you get more 
> details? It 
> > is very strange that the inconsistency always occurs on 
> nodes=2,8,16 
> > (when testing nodes=2,4,8,16,32,64,128). This is true for both 
> ANL's 
> > BG/L and NCAR's BG/L. This is also true for different file 
> systems. I 
> > believe NCAR's BG/L also uses different file system. Does that 
> imply 
> > that parallel I/O is still not stable on BG/L so far? Any way to 
> fix 
> > this? Thanks a lot for your investigation.
> 
> Well, I don't know how Lustre or GPFS file systems are exported to 
BGL
> compute nodes.  In Argonne's case, both the NFS-exported home
> directories and PVFS2 are treated by the MPI-IO implementation as a
> unix file system.   Because both file systems lack certain unix-like
> characteristics (caching and locking behaviors), treating them 
> like a
> unix file system will work a lot of the time, but not always. 
> 
> The fastest way to fix this is for IBM to rebuild their MPI-IO with
> support for NFS.  As of V1R2M1_020_2006-060110, there is no NFS
> support in the MPI-IO implementation.   I've asked our BGL guys about
> this.  native support for PVFS2 is a bit harder than a recompile, but
> we're working on it. 
> 
> In the meantime, do try with real applications.  There are many
> workloads (as you have seen) that do not exhibit this failure.  If 
you
> can provide additional applications and workloads that do fail, that
> would be good motivation for an updated MPI-IO implementation.
> 
> Thanks 
> ==rob
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
> Argonne National Labs, IL USA                B29D F333 664A 4280 315B
> 
> 




More information about the parallel-netcdf mailing list