Inconsistent results on bluegene (reproduce the same problem on ANL's BG/L)

Robert Latham robl at mcs.anl.gov
Tue Jun 13 08:23:38 CDT 2006


On Mon, Jun 12, 2006 at 10:30:27PM -0700, Yu-Heng Tseng wrote:
> Dear Rob,
> 
> YES! That helps for nodes=16,32 runs to get correct results. For 
> node=2 run, it still gives wrong answers. Can you explain this? It 
> really helps but not totally why? Really thanks a lot for your help! 

Well, the BGLMPIO_TUNEBLOCKING workarond doesn't address the core
issue: that the MPI-IO implementation is making incorrect assumptions
about the underlying file system.  When you set BGLMPIO_TUNEBLOCKING,
you take a slower, unoptimized code path that needs to make fewer
assumptions.  That gets you pretty close, but as you've seen there
still are places where you get in trouble.   I've queued up a 2-run
job on PVFS to see if pnf_test gives bad results there. 

> For CAM3.1 application, that also works and CAM can run
> successfully.  The previous errors (crashs) I mentioned are gone.
> However, I need to varify if we get all results correct.

That's great news.  Sorry it's taken so long to get you going. Hope
you still have time to collect enough data for your presentation.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Labs, IL USA                B29D F333 664A 4280 315B




More information about the parallel-netcdf mailing list