performance issue

Jim Edwards jedwards at ucar.edu
Wed Aug 9 18:17:35 CDT 2023


I think that your example case is too simple - it's doing a simple block
decomposition.
In order to get the performance difference I am observing I need to do a
more complicated
mapping.   I will work on a program that reproduces the problem without pio
but it may take a
while.

On Wed, Aug 9, 2023 at 5:07 PM Wei-Keng Liao <wkliao at northwestern.edu>
wrote:

> Hi, Jim
>
>
> FYI. This is what I used in my runs. The file size is about 46 GB.
>
> srun -n 1024 /tmp/wkliao_nonblocking_write -l 64
> /pscratch/sd/w/wkliao/FS_1M_64/nonblocking_write_test
>
> ncmpidump -h /pscratch/sd/w/wkliao/FS_1M_64/nonblocking_write_test | more
> netcdf nonblocking_write_test {
> // file format: CDF-5 (big variables)
> dimensions:
> time = UNLIMITED ; // (4 currently)
> z = 1024 ;
> y = 512 ;
> x = 512 ;
> variables:
> int scalar_var_0 ;
> int scalar_var_1 ;
> int scalar_var_2 ;
> int scalar_var_3 ;
> int scalar_var_4 ;
> int fix_var_0(z, y, x) ;
> int fix_var_1(z, y, x) ;
> int fix_var_2(z, y, x) ;
> int fix_var_3(z, y, x) ;
> int fix_var_4(z, y, x) ;
> int rec_var_0(time, z, y, x) ;
> int rec_var_1(time, z, y, x) ;
> int rec_var_2(time, z, y, x) ;
> int rec_var_3(time, z, y, x) ;
> int rec_var_4(time, z, y, x) ;
> int rec_var_5(time, z, y, x) ;
> int rec_var_6(time, z, y, x) ;
> int rec_var_7(time, z, y, x) ;
> int rec_var_8(time, z, y, x) ;
> int rec_var_9(time, z, y, x) ;
> }
>
> Wei-keng
>
> On Aug 9, 2023, at 5:57 PM, Wei-Keng Liao <wkliao at northwestern.edu> wrote:
>
> I ran just now the example program on Perlmutter and did not notice
> any significant difference between with and without scalar variables.
> I ran 1024 MPI processes on 8 computer nodes.
>
> Does this happen to you on 1024 processes?
> I can test 2048, but it may take longer.
>
> Wei-keng
>
> On Aug 9, 2023, at 5:16 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>
> The cb_nodes was different between the test program and the cesm but I was
> able
> to figure out what the issue was and reproduce it in the test program so
> now cb_nodes is the same for
> both files - both are being written by the test program and the only
> difference between them is that
> the slow one has one additional variable which is a scalar and is not a
> record variable.
>
> On Wed, Aug 9, 2023 at 3:57 PM Wei-Keng Liao <wkliao at northwestern.edu>
> wrote:
>
>> I thoughts the cb_nodes values are different between the two runs, based
>> on one of your earlier emails. Can you try the example C program I
>> modified
>> to include scalar and record variables? It reports timings and cb_nodes
>> value.
>>
>>
>> Wei-keng
>>
>> On Aug 9, 2023, at 4:22 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>
>> The only difference in the two files is the addition of a single scalar
>> string variable.
>> Why would that significantly change this?
>>
>> I changed the striping on the directory using lfs setstripe -c -1
>> I did this because it exaggerates the performance difference.
>> lmm_stripe_count:  96
>> lmm_stripe_size:   1048576
>> lmm_pattern:       raid0
>> lmm_layout_gen:    0
>> lmm_stripe_offset: 6
>>
>> The original problem was on 32786 tasks - I can now see it on 2048 tasks.
>>
>> On Wed, Aug 9, 2023 at 3:13 PM Wei-Keng Liao <wkliao at northwestern.edu>
>> wrote:
>>
>>> Googling gave me this.
>>> "The number of the gaps and the size is related to how many seek
>>> operation happens and how much is the size of the file in bytes that is
>>> skipped to write the next part."
>>>
>>>
>>> Are you still using the default file striping settings?
>>>
>>> Wei-keng
>>>
>>> On Aug 9, 2023, at 3:51 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>
>>> I spent a little time trying to do this but gave up and went back to
>>> using cray profiling tools to get more info.
>>> One thing really stands out to me:
>>>
>>> This is for the fast write:
>>> dec1793.hsn.de.hpc.ucar.edu
>>> <https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$>
>>> 0: | number of write gaps = 2
>>> dec1793.hsn.de.hpc.ucar.edu
>>> <https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$>
>>> 0: | ave write gap size = 9722924978
>>> dec1793.hsn.de.hpc.ucar.edu
>>> <https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$>
>>> 0: --------------------------------------------------------
>>> dec1793.hsn.de.hpc.ucar.edu
>>> <https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$>
>>> 0: RESULT: write SUBSET 1 16 64 4060.0217755460 4.5714040530
>>>
>>> And this is for the slow one:
>>> dec1793.hsn.de.hpc.ucar.edu
>>> <https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$>
>>> 0: | number of write gaps = 1020
>>> dec1793.hsn.de.hpc.ucar.edu
>>> <https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$>
>>> 0: | ave write gap size = 19079761
>>> dec1793.hsn.de.hpc.ucar.edu
>>> <https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$>
>>> 0: --------------------------------------------------------
>>> dec1793.hsn.de.hpc.ucar.edu
>>> <https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$>
>>> 0: RESULT: write SUBSET 1 16 64 76.2558020443 243.3913158400
>>>
>>>
>>> Do you understand?
>>>
>>> On Tue, Aug 8, 2023 at 11:50 AM Wei-Keng Liao <wkliao at northwestern.edu>
>>> wrote:
>>>
>>>> I have revised the example program to add writes to scalar and record
>>>> variables.
>>>> Let me know if that works for you. URL again is below.
>>>>
>>>>
>>>> https://github.com/Parallel-NetCDF/PnetCDF/blob/master/examples/C/nonblocking_write.c
>>>> <https://urldefense.com/v3/__https://github.com/Parallel-NetCDF/PnetCDF/blob/master/examples/C/nonblocking_write.c__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-e9eaX1iY$>
>>>>
>>>> Wei-keng
>>>>
>>>> On Aug 7, 2023, at 6:10 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>>
>>>> That example doesn't include record variables.  Do you have a similar
>>>> one with record vars?
>>>>
>>>>
>>>>
>>>> On Mon, Aug 7, 2023 at 4:32 PM Wei-Keng Liao <wkliao at northwestern.edu>
>>>> wrote:
>>>>
>>>>> Hi, Jim
>>>>>
>>>>> To eliminate the overheads of PIO, I suggest to use this PnetCDF
>>>>> example program
>>>>> and add a scalar variable to see if the same happens.
>>>>>
>>>>>
>>>>> https://github.com/Parallel-NetCDF/PnetCDF/blob/master/examples/C/nonblocking_write.c
>>>>> <https://urldefense.com/v3/__https://github.com/Parallel-NetCDF/PnetCDF/blob/master/examples/C/nonblocking_write.c__;!!Dq0X2DkFhyF93HkjWTBQKhk!RGlLkVUbuYrrGrSkShv42nz4KqtPJK0FiNzPuYKV-esdwU5UcgKr0xLvQpOooAfY4n2UMB8meSG2ZanhcYgGU_Q$>
>>>>>
>>>>> Wei-keng
>>>>>
>>>>> On Aug 7, 2023, at 4:28 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>>>
>>>>> Hi Wei-Keng,
>>>>>
>>>>> The cb_nodes doesn't seem to be affected.
>>>>>
>>>>> Not using independent mode doesn't seem to have helped.  I have the
>>>>> pioperf program now writing two files.  One with only
>>>>> decomposed fields and one with one additional field, rundate, which is
>>>>> a string with the date in it.
>>>>>
>>>>> The performance is drastically different:
>>>>>                                                          IO tasks
>>>>> vars      Mb/s                           Time (s)
>>>>>  RESULT: write    SUBSET         1       256        64
>>>>>  12067.7548254854       25.1577347560    (without scalar)
>>>>>  RESULT: write    SUBSET         1       256        64
>>>>>  286.4615089145     1059.8190875640      (with scalar)
>>>>>
>>>>>
>>>>> On Mon, Aug 7, 2023 at 1:47 PM Wei-Keng Liao <wkliao at northwestern.edu>
>>>>> wrote:
>>>>>
>>>>>> Is that the reason for why cb_nodes is 1?
>>>>>> Strange, because cb_nodes is set at the file open time.
>>>>>>
>>>>>> Entering the independent data mode in PnetCDF can be completely
>>>>>> avoided
>>>>>> if using the nonblocking APIs.
>>>>>>
>>>>>> I would suggest your codes to use the nonblocking APIs in the
>>>>>> following way.
>>>>>>
>>>>>> /* for non-partitioned variables */
>>>>>> if (rank == 0) {
>>>>>>     ncmpi_iput_var_int(fh, varid[0], data[0], &req[0]); /* write the
>>>>>> whole variable */
>>>>>>     ncmpi_iput_var_int(fh, varid[1], data[1], &req[1]);
>>>>>>     ...
>>>>>> }
>>>>>> /* for partitioned variables */
>>>>>> ncmpi_iput_vara_int(fh, varid[j], data[j], starts[j], counts[j],
>>>>>> &req[j]);
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> /* commit all posted nonblocking requests */
>>>>>> ncmpi_wait_all(ncid, NC_REQ_ALL, NC_REQ_NULL, NULL);
>>>>>>
>>>>>>
>>>>>> Wei-keng
>>>>>>
>>>>>> > On Aug 7, 2023, at 2:12 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>>>> >
>>>>>> > Hi Wei-Keng,
>>>>>> >
>>>>>> > I think that I've found the problem.   In the model I am writing a
>>>>>> number of scalar variables to the file as well as the decomposed variables.
>>>>>> > for the scalar variables I use a code structure like:
>>>>>> >
>>>>>> > ncmpi_begin_indep_data(fh);
>>>>>> > ncmpi_put_vars_int(fh, varid, start, count, stride, data);
>>>>>> > ncmpi_end_indep_data(fh);
>>>>>> >
>>>>>> > In my pioperf test code I didn't write any scalars - this morning I
>>>>>> added one and the write performance for the decomposed variables got very
>>>>>> very
>>>>>> > bad.  What can I do about it?
>>>>>> >
>>>>>> > Jim
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Jim Edwards
>>>>>> >
>>>>>> > CESM Software Engineer
>>>>>> > National Center for Atmospheric Research
>>>>>> > Boulder, CO
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Jim Edwards
>>>>>
>>>>> CESM Software Engineer
>>>>> National Center for Atmospheric Research
>>>>> Boulder, CO
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Jim Edwards
>>>>
>>>> CESM Software Engineer
>>>> National Center for Atmospheric Research
>>>> Boulder, CO
>>>>
>>>>
>>>>
>>>
>>> --
>>> Jim Edwards
>>>
>>> CESM Software Engineer
>>> National Center for Atmospheric Research
>>> Boulder, CO
>>>
>>>
>>>
>>
>> --
>> Jim Edwards
>>
>> CESM Software Engineer
>> National Center for Atmospheric Research
>> Boulder, CO
>>
>>
>>
>
> --
> Jim Edwards
>
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO
>
>
>
>

-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230809/503fd4b2/attachment-0001.html>


More information about the parallel-netcdf mailing list