Problem on Blue Gene/P

Wei-keng Liao wkliao at ece.northwestern.edu
Mon Jun 15 22:53:58 CDT 2009



I agree this suggestion. So, there are three options.

1. When pnetcdf is built with debug mode (at configure time), we will  
enable
    the consistency checking across all processes. In this case, the  
error
    is considered fatal. For debugging purpose, this should be fine.  
Note that
    the consistency checking may become costly, especially when the  
number of
    processes is large. We do not expect a production pnetcdf to be  
built with
    the debug mode.

2. When built without debug mode, pnetcdf will only take process 0's  
inputs and
    ignore all others. Also, consistency checking is disabled.

3. A middle ground: enable consistency checking but only process 0's  
inputs
    are used to define variables, attributes, etc. if inconsistency is  
detected.
    The error is not fatal, but only gives a warning message.

If these are fine to the pnetcdf community, we will start to implement  
them.

Wei-keng

On Jun 15, 2009, at 8:21 PM, Yu-hengTseng wrote:

> Thanks Rob and Wei-keng,
> It will be a good idea to make this check as a warming only. In most  
> realistic applications (including Community Atmospheric Model,  
> within CCSM development), it's almost impossible to have the same  
> dimension for the same array variable within different processes.  
> Usually, 1 or 2 shift. Cheers,
> Yu-heng Tseng
> Department of Atmospheric Sciences
> National Taiwan University
> No. 1, Sec. 4, Roosevelt Rd, Taipei 106, Taiwan
> tel: 886-2-33663918
> email: yhtseng at as.ntu.edu.tw
>
> ----- Original Message -----
> From: wkliao at ece.northwestern.edu
> To: wkliao at ece.northwestern.edu;parallel-netcdf<parallel-netcdf at lists.mcs.anl.gov 
> >
> Sent: 2009-06-16 03:00:01
> Subject: Re: Problem on Blue Gene/P
>
> For example, when defining a new 2D array variable and the number of
> processes is 2, P0 and P1.
> The metadata (array dimensions, attributes, etc. in define mode) must
> be the same between P0
> and P1. If P0 uses 10x10 dimension values and P1 uses 10x11, then this
> error message will
> appear.
>
> Wei-keng
>
> On Jun 15, 2009, at 12:57 PM, Julien Bodart wrote:
>
> > Thanks everybody for your help.
> >
> > I am afraid I don't get the point "your code is defining netcdf
> > variables and attributes in
> > a slightly different way on some MPI processes than others"...
> > depending on what?
> >
> > Another test I could try is to unable the check made by ncpmi_enddef
> > if it is possible, and see which kind of output file I get.
> > I don't know if it is possible to do it easily without recompiling
> > the library.
> >
> > I will try anyway the binary debugging.
> >
> >
> > 2009/6/15 Rob Latham
> > On Fri, Jun 12, 2009 at 02:19:33PM +0200, Julien Bodart wrote:
> > > While it does not create any problems on small cases, bigger cases
> > stop at
> > > the ncmpi_enddef call on some files (randomly, even with
> > synchronisation in
> > > between), saying that there is a mismatch between dimensions.
> > After many
> > > check it does not seems that there is something wrong with the
> > dimensions. I
> > > have no idea of how to solve the problem. Did anyone had similar
> > problem?
> > > Thanks for your help.
> >
> > Hi Julien. Wei-keng is right: I know you've checked carefully, but
> > some part of your code is defining netcdf variables and attributes  
> in
> > a slightly different way on some MPI processes than others.
> >
> > The main way people debug this is through binary search: comment out
> > half of the define-mode portion; if the problem persists, comment  
> out
> > half of the remainder, else, try with the other half.
> >
> > You're not the first to encounter this problem.  Maybe this could  
> be a
> > warning and not an error, and maybe we should just have the define
> > mode view as rank 0 sees it be the one that wins if there's a
> > discrepancy.   I don't know how many people (if any) rely on the
> > current behavior to find problems.
> >
> > ==rob
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> >
>
>
>
>



More information about the parallel-netcdf mailing list