Problem on Blue Gene/P

Rob Ross rross at mcs.anl.gov
Tue Jun 16 12:44:09 CDT 2009


Hi Jim, yes, this might be the cleanest way to solve the problem. -- Rob

On Jun 16, 2009, at 11:47 AM, Jim Edwards wrote:

> Can we find a way to do (a) and make it backward compatible?   The  
> only way I can think of is to ignore those calls on other than root   
> but we don't need to require them.   Print something if in debug  
> mode, otherwise just ignore them.
>
> On Tue, Jun 16, 2009 at 10:26 AM, Rob Ross <rross at mcs.anl.gov> wrote:
> Hey,
>
> I guess I feel like either (a) the API should require these calls  
> from only one process, or (b) the values from the processes should  
> match up. We can do (c) everyone passes something in, but we ignore  
> everything from non-root processes, but that semantic doesn't make  
> any sense really, right? Why bother to make the user make the call  
> on the other processes if we're not going to look at the input?
>
> The only real argument for (c) is that it doesn't require existing  
> users to change their code, which is definitely a factor...
>
> Rob
>
>
> On Jun 16, 2009, at 11:15 AM, Jim Edwards wrote:
>
> If you make it collective but only valid on rank 0 the change should  
> be transparent to most users.
>
> I can't think of a reason at the moment that you shouldn't just make  
> it independent, but I don't want you to do anything that would  
> require developers to change code that they already have.   So when  
> you say invalid I hope you don't mean error generating.
>
> On Tue, Jun 16, 2009 at 10:06 AM, Rob Ross <rross at mcs.anl.gov> wrote:
> Hi Jim,
>
> Yeah, I agree that (given that we're bcast()ing the header around to  
> check it anyway) there isn't much advantage to what we're doing now.
>
> So what would the API look like? Would we make all the define mode  
> calls independent and only valid on rank 0 of the communicator used  
> to open the dataset? Do we make them collective but ignore values  
> from all other processes?
>
> Rob
>
>
> On Jun 16, 2009, at 10:46 AM, Jim Edwards wrote:
>
> Seems like it wouldn't be any harder to implement and would be more  
> user friendly if you use the value from the root process and ignore  
> the others.   In my application there are several attributes which  
> are set on the root task and which are not needed on the other tasks  
> except to meet the requirement of pnetcdf, if we remove this  
> requirement I could get rid of several otherwise unnecessary  
> mpi_bcast calls.
>
>
>
> On Tue, Jun 16, 2009 at 9:38 AM, Rob Ross <rross at mcs.anl.gov> wrote:
> Hi Julien,
>
> Huh. I wouldn't have expected that. Perhaps we should adjust the  
> PnetCDF semantics for the *values* such that you get a value from  
> some process, not defined which one if they are different?
>
> Rob
>
>
> On Jun 16, 2009, at 10:27 AM, Julien Bodart wrote:
>
> Hi Rob,
>
> This was in the value actually:
>
> current_time = time(NULL) ;
> GSTRING_ASS(my_string, ctime(&current_time) ) ;
> status =ncmpi_put_att_text ( nc_id, NC_GLOBAL, "time_stamp"      ,  
> (MPI_Offset) (my_string->len-1) , my_string->str ) ;
>
> GSTRING_ASS being a macro that return a string, where a string being  
> a char(str) and the number of element in the char (len)
>
> Regards,
>
> Julien
>
> 2009/6/16 Rob Ross <rross at mcs.anl.gov>
> Hi Julien,
>
> So you were using the time in the name of the attribute, not in the  
> value?
>
> Thanks,
>
> Rob
>
>
> On Jun 16, 2009, at 7:22 AM, Julien Bodart wrote:
>
> Hi everybody,
>
> I finally manage to remove this bug, which was of course coming from  
> my source code!
> The guilty: a "time" global attribute coming from the "ctime"  
> function which of course is different across a large number of  
> processors... I know this is silly but actually I was not expecting  
> a check on the global attributes, especially with an error message  
> "NC definitions mismatch".
> So I have to apologize for such a stupid mistake, but at the same  
> time, it reinforces the idea to rethink this check function.
> Thanks again.
>
> Julien
>
> 2009/6/16 Wei-keng Liao <wkliao at ece.northwestern.edu>
>
>
> I agree this suggestion. So, there are three options.
>
> 1. When pnetcdf is built with debug mode (at configure time), we  
> will enable
>  the consistency checking across all processes. In this case, the  
> error
>  is considered fatal. For debugging purpose, this should be fine.  
> Note that
>  the consistency checking may become costly, especially when the  
> number of
>  processes is large. We do not expect a production pnetcdf to be  
> built with
>  the debug mode.
>
> 2. When built without debug mode, pnetcdf will only take process 0's  
> inputs and
>  ignore all others. Also, consistency checking is disabled.
>
> 3. A middle ground: enable consistency checking but only process 0's  
> inputs
>  are used to define variables, attributes, etc. if inconsistency is  
> detected.
>  The error is not fatal, but only gives a warning message.
>
> If these are fine to the pnetcdf community, we will start to  
> implement them.
>
> Wei-keng
>
>
> On Jun 15, 2009, at 8:21 PM, Yu-hengTseng wrote:
>
> Thanks Rob and Wei-keng,
> It will be a good idea to make this check as a warming only. In most  
> realistic applications (including Community Atmospheric Model,  
> within CCSM development), it's almost impossible to have the same  
> dimension for the same array variable within different processes.  
> Usually, 1 or 2 shift. Cheers,
> Yu-heng Tseng
> Department of Atmospheric Sciences
> National Taiwan University
> No. 1, Sec. 4, Roosevelt Rd, Taipei 106, Taiwan
> tel: 886-2-33663918
> email: yhtseng at as.ntu.edu.tw
>
> ----- Original Message -----
> From: wkliao at ece.northwestern.edu
> To: wkliao at ece.northwestern.edu;parallel-netcdf<parallel-netcdf at lists.mcs.anl.gov 
> >
> Sent: 2009-06-16 03:00:01
> Subject: Re: Problem on Blue Gene/P
>
> For example, when defining a new 2D array variable and the number of
> processes is 2, P0 and P1.
> The metadata (array dimensions, attributes, etc. in define mode) must
> be the same between P0
> and P1. If P0 uses 10x10 dimension values and P1 uses 10x11, then this
> error message will
> appear.
>
> Wei-keng
>
> On Jun 15, 2009, at 12:57 PM, Julien Bodart wrote:
>
> > Thanks everybody for your help.
> >
> > I am afraid I don't get the point "your code is defining netcdf
> > variables and attributes in
> > a slightly different way on some MPI processes than others"...
> > depending on what?
> >
> > Another test I could try is to unable the check made by ncpmi_enddef
> > if it is possible, and see which kind of output file I get.
> > I don't know if it is possible to do it easily without recompiling
> > the library.
> >
> > I will try anyway the binary debugging.
> >
> >
> > 2009/6/15 Rob Latham
> > On Fri, Jun 12, 2009 at 02:19:33PM +0200, Julien Bodart wrote:
> > > While it does not create any problems on small cases, bigger cases
> > stop at
> > > the ncmpi_enddef call on some files (randomly, even with
> > synchronisation in
> > > between), saying that there is a mismatch between dimensions.
> > After many
> > > check it does not seems that there is something wrong with the
> > dimensions. I
> > > have no idea of how to solve the problem. Did anyone had similar
> > problem?
> > > Thanks for your help.
> >
> > Hi Julien. Wei-keng is right: I know you've checked carefully, but
> > some part of your code is defining netcdf variables and attributes  
> in
> > a slightly different way on some MPI processes than others.
> >
> > The main way people debug this is through binary search: comment out
> > half of the define-mode portion; if the problem persists, comment  
> out
> > half of the remainder, else, try with the other half.
> >
> > You're not the first to encounter this problem.  Maybe this could  
> be a
> > warning and not an error, and maybe we should just have the define
> > mode view as rank 0 sees it be the one that wins if there's a
> > discrepancy.   I don't know how many people (if any) rely on the
> > current behavior to find problems.
> >
> > ==rob
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> >
>
>
>
>
>
>
>
>
>
>
>
>
>
>



More information about the parallel-netcdf mailing list