Problem on Blue Gene/P
Jim Edwards
edwards.jim at gmail.com
Tue Jun 16 11:15:15 CDT 2009
If you make it collective but only valid on rank 0 the change should be
transparent to most users.
I can't think of a reason at the moment that you shouldn't just make it
independent, but I don't want you to do anything that would require
developers to change code that they already have. So when you say invalid
I hope you don't mean error generating.
On Tue, Jun 16, 2009 at 10:06 AM, Rob Ross <rross at mcs.anl.gov> wrote:
> Hi Jim,
>
> Yeah, I agree that (given that we're bcast()ing the header around to check
> it anyway) there isn't much advantage to what we're doing now.
>
> So what would the API look like? Would we make all the define mode calls
> independent and only valid on rank 0 of the communicator used to open the
> dataset? Do we make them collective but ignore values from all other
> processes?
>
> Rob
>
>
> On Jun 16, 2009, at 10:46 AM, Jim Edwards wrote:
>
> Seems like it wouldn't be any harder to implement and would be more user
>> friendly if you use the value from the root process and ignore the others.
>> In my application there are several attributes which are set on the root
>> task and which are not needed on the other tasks except to meet the
>> requirement of pnetcdf, if we remove this requirement I could get rid of
>> several otherwise unnecessary mpi_bcast calls.
>>
>>
>>
>> On Tue, Jun 16, 2009 at 9:38 AM, Rob Ross <rross at mcs.anl.gov> wrote:
>> Hi Julien,
>>
>> Huh. I wouldn't have expected that. Perhaps we should adjust the PnetCDF
>> semantics for the *values* such that you get a value from some process, not
>> defined which one if they are different?
>>
>> Rob
>>
>>
>> On Jun 16, 2009, at 10:27 AM, Julien Bodart wrote:
>>
>> Hi Rob,
>>
>> This was in the value actually:
>>
>> current_time = time(NULL) ;
>> GSTRING_ASS(my_string, ctime(¤t_time) ) ;
>> status =ncmpi_put_att_text ( nc_id, NC_GLOBAL, "time_stamp" ,
>> (MPI_Offset) (my_string->len-1) , my_string->str ) ;
>>
>> GSTRING_ASS being a macro that return a string, where a string being a
>> char(str) and the number of element in the char (len)
>>
>> Regards,
>>
>> Julien
>>
>> 2009/6/16 Rob Ross <rross at mcs.anl.gov>
>> Hi Julien,
>>
>> So you were using the time in the name of the attribute, not in the value?
>>
>> Thanks,
>>
>> Rob
>>
>>
>> On Jun 16, 2009, at 7:22 AM, Julien Bodart wrote:
>>
>> Hi everybody,
>>
>> I finally manage to remove this bug, which was of course coming from my
>> source code!
>> The guilty: a "time" global attribute coming from the "ctime" function
>> which of course is different across a large number of processors... I know
>> this is silly but actually I was not expecting a check on the global
>> attributes, especially with an error message "NC definitions mismatch".
>> So I have to apologize for such a stupid mistake, but at the same time, it
>> reinforces the idea to rethink this check function.
>> Thanks again.
>>
>> Julien
>>
>> 2009/6/16 Wei-keng Liao <wkliao at ece.northwestern.edu>
>>
>>
>> I agree this suggestion. So, there are three options.
>>
>> 1. When pnetcdf is built with debug mode (at configure time), we will
>> enable
>> the consistency checking across all processes. In this case, the error
>> is considered fatal. For debugging purpose, this should be fine. Note
>> that
>> the consistency checking may become costly, especially when the number of
>> processes is large. We do not expect a production pnetcdf to be built
>> with
>> the debug mode.
>>
>> 2. When built without debug mode, pnetcdf will only take process 0's
>> inputs and
>> ignore all others. Also, consistency checking is disabled.
>>
>> 3. A middle ground: enable consistency checking but only process 0's
>> inputs
>> are used to define variables, attributes, etc. if inconsistency is
>> detected.
>> The error is not fatal, but only gives a warning message.
>>
>> If these are fine to the pnetcdf community, we will start to implement
>> them.
>>
>> Wei-keng
>>
>>
>> On Jun 15, 2009, at 8:21 PM, Yu-hengTseng wrote:
>>
>> Thanks Rob and Wei-keng,
>> It will be a good idea to make this check as a warming only. In most
>> realistic applications (including Community Atmospheric Model, within CCSM
>> development), it's almost impossible to have the same dimension for the same
>> array variable within different processes. Usually, 1 or 2 shift. Cheers,
>> Yu-heng Tseng
>> Department of Atmospheric Sciences
>> National Taiwan University
>> No. 1, Sec. 4, Roosevelt Rd, Taipei 106, Taiwan
>> tel: 886-2-33663918
>> email: yhtseng at as.ntu.edu.tw
>>
>> ----- Original Message -----
>> From: wkliao at ece.northwestern.edu
>> To: wkliao at ece.northwestern.edu;parallel-netcdf<
>> parallel-netcdf at lists.mcs.anl.gov>
>> Sent: 2009-06-16 03:00:01
>> Subject: Re: Problem on Blue Gene/P
>>
>> For example, when defining a new 2D array variable and the number of
>> processes is 2, P0 and P1.
>> The metadata (array dimensions, attributes, etc. in define mode) must
>> be the same between P0
>> and P1. If P0 uses 10x10 dimension values and P1 uses 10x11, then this
>> error message will
>> appear.
>>
>> Wei-keng
>>
>> On Jun 15, 2009, at 12:57 PM, Julien Bodart wrote:
>>
>> > Thanks everybody for your help.
>> >
>> > I am afraid I don't get the point "your code is defining netcdf
>> > variables and attributes in
>> > a slightly different way on some MPI processes than others"...
>> > depending on what?
>> >
>> > Another test I could try is to unable the check made by ncpmi_enddef
>> > if it is possible, and see which kind of output file I get.
>> > I don't know if it is possible to do it easily without recompiling
>> > the library.
>> >
>> > I will try anyway the binary debugging.
>> >
>> >
>> > 2009/6/15 Rob Latham
>> > On Fri, Jun 12, 2009 at 02:19:33PM +0200, Julien Bodart wrote:
>> > > While it does not create any problems on small cases, bigger cases
>> > stop at
>> > > the ncmpi_enddef call on some files (randomly, even with
>> > synchronisation in
>> > > between), saying that there is a mismatch between dimensions.
>> > After many
>> > > check it does not seems that there is something wrong with the
>> > dimensions. I
>> > > have no idea of how to solve the problem. Did anyone had similar
>> > problem?
>> > > Thanks for your help.
>> >
>> > Hi Julien. Wei-keng is right: I know you've checked carefully, but
>> > some part of your code is defining netcdf variables and attributes in
>> > a slightly different way on some MPI processes than others.
>> >
>> > The main way people debug this is through binary search: comment out
>> > half of the define-mode portion; if the problem persists, comment out
>> > half of the remainder, else, try with the other half.
>> >
>> > You're not the first to encounter this problem. Maybe this could be a
>> > warning and not an error, and maybe we should just have the define
>> > mode view as rank 0 sees it be the one that wins if there's a
>> > discrepancy. I don't know how many people (if any) rely on the
>> > current behavior to find problems.
>> >
>> > ==rob
>> >
>> > --
>> > Rob Latham
>> > Mathematics and Computer Science Division
>> > Argonne National Lab, IL USA
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20090616/a1c50037/attachment-0001.htm>
More information about the parallel-netcdf
mailing list