performance issue

Wei-Keng Liao wkliao at northwestern.edu
Wed Aug 9 18:22:27 CDT 2023


In that case, I have the E3SM-IO benchmark that has a fairly complicate I/O
partitioning pattern. It used the decomposition maps generated from PIO.
https://github.com/Parallel-NetCDF/E3SM-IO

Wei-keng

On Aug 9, 2023, at 6:17 PM, Jim Edwards <jedwards at ucar.edu> wrote:

I think that your example case is too simple - it's doing a simple block decomposition.
In order to get the performance difference I am observing I need to do a more complicated
mapping.   I will work on a program that reproduces the problem without pio but it may take a
while.

On Wed, Aug 9, 2023 at 5:07 PM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Hi, Jim


FYI. This is what I used in my runs. The file size is about 46 GB.

srun -n 1024 /tmp/wkliao_nonblocking_write -l 64 /pscratch/sd/w/wkliao/FS_1M_64/nonblocking_write_test

ncmpidump -h /pscratch/sd/w/wkliao/FS_1M_64/nonblocking_write_test | more
netcdf nonblocking_write_test {
// file format: CDF-5 (big variables)
dimensions:
time = UNLIMITED ; // (4 currently)
z = 1024 ;
y = 512 ;
x = 512 ;
variables:
int scalar_var_0 ;
int scalar_var_1 ;
int scalar_var_2 ;
int scalar_var_3 ;
int scalar_var_4 ;
int fix_var_0(z, y, x) ;
int fix_var_1(z, y, x) ;
int fix_var_2(z, y, x) ;
int fix_var_3(z, y, x) ;
int fix_var_4(z, y, x) ;
int rec_var_0(time, z, y, x) ;
int rec_var_1(time, z, y, x) ;
int rec_var_2(time, z, y, x) ;
int rec_var_3(time, z, y, x) ;
int rec_var_4(time, z, y, x) ;
int rec_var_5(time, z, y, x) ;
int rec_var_6(time, z, y, x) ;
int rec_var_7(time, z, y, x) ;
int rec_var_8(time, z, y, x) ;
int rec_var_9(time, z, y, x) ;
}

Wei-keng

On Aug 9, 2023, at 5:57 PM, Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:

I ran just now the example program on Perlmutter and did not notice
any significant difference between with and without scalar variables.
I ran 1024 MPI processes on 8 computer nodes.

Does this happen to you on 1024 processes?
I can test 2048, but it may take longer.

Wei-keng

On Aug 9, 2023, at 5:16 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

The cb_nodes was different between the test program and the cesm but I was able
to figure out what the issue was and reproduce it in the test program so now cb_nodes is the same for
both files - both are being written by the test program and the only difference between them is that
the slow one has one additional variable which is a scalar and is not a record variable.

On Wed, Aug 9, 2023 at 3:57 PM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
I thoughts the cb_nodes values are different between the two runs, based
on one of your earlier emails. Can you try the example C program I modified
to include scalar and record variables? It reports timings and cb_nodes value.


Wei-keng

On Aug 9, 2023, at 4:22 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

The only difference in the two files is the addition of a single scalar string variable.
Why would that significantly change this?

I changed the striping on the directory using lfs setstripe -c -1
I did this because it exaggerates the performance difference.
lmm_stripe_count:  96
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 6

The original problem was on 32786 tasks - I can now see it on 2048 tasks.

On Wed, Aug 9, 2023 at 3:13 PM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Googling gave me this.
"The number of the gaps and the size is related to how many seek operation happens and how much is the size of the file in bytes that is skipped to write the next part."


Are you still using the default file striping settings?

Wei-keng

On Aug 9, 2023, at 3:51 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

I spent a little time trying to do this but gave up and went back to using cray profiling tools to get more info.
One thing really stands out to me:

This is for the fast write:
dec1793.hsn.de.hpc.ucar.edu<https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$> 0: | number of write gaps = 2
dec1793.hsn.de.hpc.ucar.edu<https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$> 0: | ave write gap size = 9722924978
dec1793.hsn.de.hpc.ucar.edu<https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$> 0: --------------------------------------------------------
dec1793.hsn.de.hpc.ucar.edu<https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$> 0: RESULT: write SUBSET 1 16 64 4060.0217755460 4.5714040530

And this is for the slow one:
dec1793.hsn.de.hpc.ucar.edu<https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$> 0: | number of write gaps = 1020
dec1793.hsn.de.hpc.ucar.edu<https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$> 0: | ave write gap size = 19079761
dec1793.hsn.de.hpc.ucar.edu<https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$> 0: --------------------------------------------------------
dec1793.hsn.de.hpc.ucar.edu<https://urldefense.com/v3/__http://dec1793.hsn.de.hpc.ucar.edu__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-eNzr_M-E$> 0: RESULT: write SUBSET 1 16 64 76.2558020443 243.3913158400


Do you understand?

On Tue, Aug 8, 2023 at 11:50 AM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
I have revised the example program to add writes to scalar and record variables.
Let me know if that works for you. URL again is below.

https://github.com/Parallel-NetCDF/PnetCDF/blob/master/examples/C/nonblocking_write.c<https://urldefense.com/v3/__https://github.com/Parallel-NetCDF/PnetCDF/blob/master/examples/C/nonblocking_write.c__;!!Dq0X2DkFhyF93HkjWTBQKhk!SP4va2rvVHU4KEb9PEsINCVGTkdEiITT61-aKfQXjWRmlCGrqiRw6rNt8YvXMwO2eOwRL2T7qyjXj_-e9eaX1iY$>

Wei-keng

On Aug 7, 2023, at 6:10 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

That example doesn't include record variables.  Do you have a similar one with record vars?



On Mon, Aug 7, 2023 at 4:32 PM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Hi, Jim

To eliminate the overheads of PIO, I suggest to use this PnetCDF example program
and add a scalar variable to see if the same happens.

https://github.com/Parallel-NetCDF/PnetCDF/blob/master/examples/C/nonblocking_write.c<https://urldefense.com/v3/__https://github.com/Parallel-NetCDF/PnetCDF/blob/master/examples/C/nonblocking_write.c__;!!Dq0X2DkFhyF93HkjWTBQKhk!RGlLkVUbuYrrGrSkShv42nz4KqtPJK0FiNzPuYKV-esdwU5UcgKr0xLvQpOooAfY4n2UMB8meSG2ZanhcYgGU_Q$>

Wei-keng

On Aug 7, 2023, at 4:28 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

Hi Wei-Keng,

The cb_nodes doesn't seem to be affected.

Not using independent mode doesn't seem to have helped.  I have the pioperf program now writing two files.  One with only
decomposed fields and one with one additional field, rundate, which is a string with the date in it.

The performance is drastically different:
                                                         IO tasks   vars      Mb/s                           Time (s)
 RESULT: write    SUBSET         1       256        64    12067.7548254854       25.1577347560    (without scalar)
 RESULT: write    SUBSET         1       256        64      286.4615089145     1059.8190875640      (with scalar)


On Mon, Aug 7, 2023 at 1:47 PM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Is that the reason for why cb_nodes is 1?
Strange, because cb_nodes is set at the file open time.

Entering the independent data mode in PnetCDF can be completely avoided
if using the nonblocking APIs.

I would suggest your codes to use the nonblocking APIs in the following way.

/* for non-partitioned variables */
if (rank == 0) {
    ncmpi_iput_var_int(fh, varid[0], data[0], &req[0]); /* write the whole variable */
    ncmpi_iput_var_int(fh, varid[1], data[1], &req[1]);
    ...
}
/* for partitioned variables */
ncmpi_iput_vara_int(fh, varid[j], data[j], starts[j], counts[j], &req[j]);
...


/* commit all posted nonblocking requests */
ncmpi_wait_all(ncid, NC_REQ_ALL, NC_REQ_NULL, NULL);


Wei-keng

> On Aug 7, 2023, at 2:12 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:
>
> Hi Wei-Keng,
>
> I think that I've found the problem.   In the model I am writing a number of scalar variables to the file as well as the decomposed variables.
> for the scalar variables I use a code structure like:
>
> ncmpi_begin_indep_data(fh);
> ncmpi_put_vars_int(fh, varid, start, count, stride, data);
> ncmpi_end_indep_data(fh);
>
> In my pioperf test code I didn't write any scalars - this morning I added one and the write performance for the decomposed variables got very very
> bad.  What can I do about it?
>
> Jim
>
>
> --
> Jim Edwards
>
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO




--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230809/21b53170/attachment-0001.html>


More information about the parallel-netcdf mailing list