Problem on Blue Gene/P

Jim Edwards edwards.jim at gmail.com
Tue Jun 16 11:47:12 CDT 2009


Can we find a way to do (a) and make it backward compatible?   The only way
I can think of is to ignore those calls on other than root  but we don't
need to require them.   Print something if in debug mode, otherwise just
ignore them.

On Tue, Jun 16, 2009 at 10:26 AM, Rob Ross <rross at mcs.anl.gov> wrote:

> Hey,
>
> I guess I feel like either (a) the API should require these calls from only
> one process, or (b) the values from the processes should match up. We can do
> (c) everyone passes something in, but we ignore everything from non-root
> processes, but that semantic doesn't make any sense really, right? Why
> bother to make the user make the call on the other processes if we're not
> going to look at the input?
>
> The only real argument for (c) is that it doesn't require existing users to
> change their code, which is definitely a factor...
>
> Rob
>
>
> On Jun 16, 2009, at 11:15 AM, Jim Edwards wrote:
>
>  If you make it collective but only valid on rank 0 the change should be
>> transparent to most users.
>>
>> I can't think of a reason at the moment that you shouldn't just make it
>> independent, but I don't want you to do anything that would require
>> developers to change code that they already have.   So when you say invalid
>> I hope you don't mean error generating.
>>
>> On Tue, Jun 16, 2009 at 10:06 AM, Rob Ross <rross at mcs.anl.gov> wrote:
>> Hi Jim,
>>
>> Yeah, I agree that (given that we're bcast()ing the header around to check
>> it anyway) there isn't much advantage to what we're doing now.
>>
>> So what would the API look like? Would we make all the define mode calls
>> independent and only valid on rank 0 of the communicator used to open the
>> dataset? Do we make them collective but ignore values from all other
>> processes?
>>
>> Rob
>>
>>
>> On Jun 16, 2009, at 10:46 AM, Jim Edwards wrote:
>>
>> Seems like it wouldn't be any harder to implement and would be more user
>> friendly if you use the value from the root process and ignore the others.
>> In my application there are several attributes which are set on the root
>> task and which are not needed on the other tasks except to meet the
>> requirement of pnetcdf, if we remove this requirement I could get rid of
>> several otherwise unnecessary mpi_bcast calls.
>>
>>
>>
>> On Tue, Jun 16, 2009 at 9:38 AM, Rob Ross <rross at mcs.anl.gov> wrote:
>> Hi Julien,
>>
>> Huh. I wouldn't have expected that. Perhaps we should adjust the PnetCDF
>> semantics for the *values* such that you get a value from some process, not
>> defined which one if they are different?
>>
>> Rob
>>
>>
>> On Jun 16, 2009, at 10:27 AM, Julien Bodart wrote:
>>
>> Hi Rob,
>>
>> This was in the value actually:
>>
>> current_time = time(NULL) ;
>> GSTRING_ASS(my_string, ctime(&current_time) ) ;
>> status =ncmpi_put_att_text ( nc_id, NC_GLOBAL, "time_stamp"      ,
>> (MPI_Offset) (my_string->len-1) , my_string->str ) ;
>>
>> GSTRING_ASS being a macro that return a string, where a string being a
>> char(str) and the number of element in the char (len)
>>
>> Regards,
>>
>> Julien
>>
>> 2009/6/16 Rob Ross <rross at mcs.anl.gov>
>> Hi Julien,
>>
>> So you were using the time in the name of the attribute, not in the value?
>>
>> Thanks,
>>
>> Rob
>>
>>
>> On Jun 16, 2009, at 7:22 AM, Julien Bodart wrote:
>>
>> Hi everybody,
>>
>> I finally manage to remove this bug, which was of course coming from my
>> source code!
>> The guilty: a "time" global attribute coming from the "ctime" function
>> which of course is different across a large number of processors... I know
>> this is silly but actually I was not expecting a check on the global
>> attributes, especially with an error message "NC definitions mismatch".
>> So I have to apologize for such a stupid mistake, but at the same time, it
>> reinforces the idea to rethink this check function.
>> Thanks again.
>>
>> Julien
>>
>> 2009/6/16 Wei-keng Liao <wkliao at ece.northwestern.edu>
>>
>>
>> I agree this suggestion. So, there are three options.
>>
>> 1. When pnetcdf is built with debug mode (at configure time), we will
>> enable
>>  the consistency checking across all processes. In this case, the error
>>  is considered fatal. For debugging purpose, this should be fine. Note
>> that
>>  the consistency checking may become costly, especially when the number of
>>  processes is large. We do not expect a production pnetcdf to be built
>> with
>>  the debug mode.
>>
>> 2. When built without debug mode, pnetcdf will only take process 0's
>> inputs and
>>  ignore all others. Also, consistency checking is disabled.
>>
>> 3. A middle ground: enable consistency checking but only process 0's
>> inputs
>>  are used to define variables, attributes, etc. if inconsistency is
>> detected.
>>  The error is not fatal, but only gives a warning message.
>>
>> If these are fine to the pnetcdf community, we will start to implement
>> them.
>>
>> Wei-keng
>>
>>
>> On Jun 15, 2009, at 8:21 PM, Yu-hengTseng wrote:
>>
>> Thanks Rob and Wei-keng,
>> It will be a good idea to make this check as a warming only. In most
>> realistic applications (including Community Atmospheric Model, within CCSM
>> development), it's almost impossible to have the same dimension for the same
>> array variable within different processes. Usually, 1 or 2 shift. Cheers,
>> Yu-heng Tseng
>> Department of Atmospheric Sciences
>> National Taiwan University
>> No. 1, Sec. 4, Roosevelt Rd, Taipei 106, Taiwan
>> tel: 886-2-33663918
>> email: yhtseng at as.ntu.edu.tw
>>
>> ----- Original Message -----
>> From: wkliao at ece.northwestern.edu
>> To: wkliao at ece.northwestern.edu;parallel-netcdf<
>> parallel-netcdf at lists.mcs.anl.gov>
>> Sent: 2009-06-16 03:00:01
>> Subject: Re: Problem on Blue Gene/P
>>
>> For example, when defining a new 2D array variable and the number of
>> processes is 2, P0 and P1.
>> The metadata (array dimensions, attributes, etc. in define mode) must
>> be the same between P0
>> and P1. If P0 uses 10x10 dimension values and P1 uses 10x11, then this
>> error message will
>> appear.
>>
>> Wei-keng
>>
>> On Jun 15, 2009, at 12:57 PM, Julien Bodart wrote:
>>
>> > Thanks everybody for your help.
>> >
>> > I am afraid I don't get the point "your code is defining netcdf
>> > variables and attributes in
>> > a slightly different way on some MPI processes than others"...
>> > depending on what?
>> >
>> > Another test I could try is to unable the check made by ncpmi_enddef
>> > if it is possible, and see which kind of output file I get.
>> > I don't know if it is possible to do it easily without recompiling
>> > the library.
>> >
>> > I will try anyway the binary debugging.
>> >
>> >
>> > 2009/6/15 Rob Latham
>> > On Fri, Jun 12, 2009 at 02:19:33PM +0200, Julien Bodart wrote:
>> > > While it does not create any problems on small cases, bigger cases
>> > stop at
>> > > the ncmpi_enddef call on some files (randomly, even with
>> > synchronisation in
>> > > between), saying that there is a mismatch between dimensions.
>> > After many
>> > > check it does not seems that there is something wrong with the
>> > dimensions. I
>> > > have no idea of how to solve the problem. Did anyone had similar
>> > problem?
>> > > Thanks for your help.
>> >
>> > Hi Julien. Wei-keng is right: I know you've checked carefully, but
>> > some part of your code is defining netcdf variables and attributes in
>> > a slightly different way on some MPI processes than others.
>> >
>> > The main way people debug this is through binary search: comment out
>> > half of the define-mode portion; if the problem persists, comment out
>> > half of the remainder, else, try with the other half.
>> >
>> > You're not the first to encounter this problem.  Maybe this could be a
>> > warning and not an error, and maybe we should just have the define
>> > mode view as rank 0 sees it be the one that wins if there's a
>> > discrepancy.   I don't know how many people (if any) rely on the
>> > current behavior to find problems.
>> >
>> > ==rob
>> >
>> > --
>> > Rob Latham
>> > Mathematics and Computer Science Division
>> > Argonne National Lab, IL USA
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20090616/5c39021d/attachment-0001.htm>


More information about the parallel-netcdf mailing list