performance issue

Jim Edwards jedwards at ucar.edu
Fri Aug 11 18:03:01 CDT 2023


Hi Wei-keng,,

For this case I'm using a RoundRobin distribution as shown here.

    if(doftype .eq. 'ROUNDROBIN') then
       do i=1,varsize
          compmap(i) = (i-1)*npe+mype+1
       enddo



On Fri, Aug 11, 2023 at 4:54 PM Wei-Keng Liao <wkliao at northwestern.edu>
wrote:

> Hi, Jim
>
> Can you please describe the data partitioning pattern used in pioperf?
>
> Wei-keng
>
> On Aug 11, 2023, at 5:46 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>
> Yes that line is called by all processes, but it in turn calls into
> pio_getput_int.c line 1159
>
>    ierr = ncmpi_bput_vars_text(file->fh, varid, start, count, fake_stride,
> buf, request);
>
> which is called only by MPI_ROOT  (line 1088 of the same file).
>
> On Fri, Aug 11, 2023 at 4:41 PM Wei-Keng Liao <wkliao at northwestern.edu>
> wrote:
>
>> I can see line 344 of pioperformance.F90 is called by all processes.
>>
>>                    nvarmult= pio_put_var(File, rundate, date//'
>> '//time(1:4))
>>
>> How do I change it, so it is called by rank 0 only?
>>
>>
>> Wei-keng
>>
>> On Aug 11, 2023, at 4:25 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>
>> Yes - src/clib/pio_darray_int.c and src/clib/pio_getput_int.c
>>
>> I've also attached a couple of darshan dxt profiles.  Both use the same
>> lustre file parameters, the first (dxt1.out) is the fast one without the
>> scalar write.
>> dxt2.out is the slow one.    It seems like adding the scalar is causing
>> all of the other writes to get broken up into smaller bits.
>> I've also tried moving around where the scalar variable is defined and
>> written with respect to the record variables - that doesn't seem to make
>> any difference.
>>
>> On Fri, Aug 11, 2023 at 3:21 PM Wei-Keng Liao <wkliao at northwestern.edu>
>> wrote:
>>
>>> Yes, I have.
>>>
>>> Can you let me know the source codes files that make the PnetCDF API
>>> calls?
>>>
>>>
>>> Wei-keng
>>>
>>> On Aug 11, 2023, at 4:10 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>
>>> Hi Wei-Keng,
>>>
>>> Sorry about the miscommunication earlier today - I just wanted to
>>> confirm that you've been able to reproduce the issue now?
>>>
>>> On Fri, Aug 11, 2023 at 1:01 PM Jim Edwards <jedwards at ucar.edu> wrote:
>>>
>>>> I'm sorry - I thought that I had provided that, but I guess not.
>>>> repo: git at github.com:jedwards4b/ParallelIO.git
>>>> branch: bugtest/lustre
>>>>
>>>> On Fri, Aug 11, 2023 at 12:46 PM Wei-Keng Liao <wkliao at northwestern.edu>
>>>> wrote:
>>>>
>>>>> Any particular github branch I should use?
>>>>>
>>>>> I got an error during make.
>>>>> /global/homes/w/wkliao/PIO/Github/ParallelIO/src/clib/pio_nc4.c:1481:18:
>>>>> error: call to undeclared function 'nc_inq_var_filter_ids'; ISO C99 and
>>>>> later do not support implicit function declarations
>>>>> [-Wimplicit-function-declaration]
>>>>>           ierr = nc_inq_var_filter_ids(file->fh, varid, nfiltersp,
>>>>> ids);
>>>>>                  ^
>>>>>
>>>>>
>>>>> Setting these 2 does not help.
>>>>> #undef NC_HAS_ZSTD
>>>>> #undef NC_HAS_BZ2
>>>>>
>>>>>
>>>>> Wei-keng
>>>>>
>>>>> > On Aug 11, 2023, at 12:44 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>>> >
>>>>> > I see I missed answering one question - total 2048 tasks.  (16 nodes)
>>>>> >
>>>>> > On Fri, Aug 11, 2023 at 11:35 AM Jim Edwards <jedwards at ucar.edu>
>>>>> wrote:
>>>>> > Here is my run script on perlmutter:
>>>>> >
>>>>> > #!/usr/bin/env python
>>>>> > #
>>>>> > #SBATCH -A mp9
>>>>> > #SBATCH -C cpu
>>>>> > #SBATCH --qos=regular
>>>>> > #SBATCH --time=15
>>>>> > #SBATCH --nodes=16
>>>>> > #SBATCH --ntasks-per-node=128
>>>>> >
>>>>> > import os
>>>>> > import glob
>>>>> >
>>>>> > with open("pioperf.nl
>>>>> <https://urldefense.com/v3/__http://pioperf.nl__;!!Dq0X2DkFhyF93HkjWTBQKhk!Wb6cQ1sLURpN2Ny6hale9LwE7W4NZTmSs0o72VGzPYA0zKG52eGy3PPukBWcmzrlC-J0mML7UGxclZF0OT7T_80$>","w")
>>>>> as fd:
>>>>> >     fd.write("&pioperf\n")
>>>>> >     fd.write("  decompfile='ROUNDROBIN'\n")
>>>>> > #    for filename in decompfiles:
>>>>> > #        fd.write("   '"+filename+"',\n")
>>>>> >     fd.write(" varsize=18560\n");
>>>>> >     fd.write(" pio_typenames = 'pnetcdf','pnetcdf'\n");
>>>>> >     fd.write(" rearrangers = 2\n");
>>>>> >     fd.write(" nframes = 1\n");
>>>>> >     fd.write(" nvars = 64\n");
>>>>> >     fd.write(" niotasks = 16\n");
>>>>> >     fd.write(" /\n")
>>>>> >
>>>>> > os.system("srun -n 2048 ~/parallelio/bld/tests/performance/pioperf ")
>>>>> >
>>>>> >
>>>>> > Module environment:
>>>>> > Currently Loaded Modules:
>>>>> >   1) craype-x86-milan                        6) cpe/23.03
>>>>>     11) craype-accel-nvidia80        16) craype/2.7.20          21)
>>>>> cmake/3.24.3
>>>>> >   2) libfabric/1.15.2.0
>>>>> <https://urldefense.com/v3/__http://1.15.2.0__;!!Dq0X2DkFhyF93HkjWTBQKhk!Wb6cQ1sLURpN2Ny6hale9LwE7W4NZTmSs0o72VGzPYA0zKG52eGy3PPukBWcmzrlC-J0mML7UGxclZF09lbxLQc$>
>>>>>                     7) xalt/2.10.2              12) gpu/1.0
>>>>>       17) cray-dsmml/0.2.2       22) cray-parallel-netcdf/1.12.3.3
>>>>> <https://urldefense.com/v3/__http://1.12.3.3__;!!Dq0X2DkFhyF93HkjWTBQKhk!Wb6cQ1sLURpN2Ny6hale9LwE7W4NZTmSs0o72VGzPYA0zKG52eGy3PPukBWcmzrlC-J0mML7UGxclZF0srcAuMk$>
>>>>> >   3) craype-network-ofi                      8)
>>>>> Nsight-Compute/2022.1.1  13) evp-patch                    18)
>>>>> cray-mpich/8.1.25      23) cray-hdf5/1.12.2.3
>>>>> <https://urldefense.com/v3/__http://1.12.2.3__;!!Dq0X2DkFhyF93HkjWTBQKhk!Wb6cQ1sLURpN2Ny6hale9LwE7W4NZTmSs0o72VGzPYA0zKG52eGy3PPukBWcmzrlC-J0mML7UGxclZF0dMhur8g$>
>>>>> >   4) xpmem/2.5.2-2.4_3.49__gd0f7936.shasta   9)
>>>>> Nsight-Systems/2022.2.1  14) python/3.9-anaconda-2021.11  19) cray-libsci/
>>>>> 23.02.1.1
>>>>> <https://urldefense.com/v3/__http://23.02.1.1__;!!Dq0X2DkFhyF93HkjWTBQKhk!Wb6cQ1sLURpN2Ny6hale9LwE7W4NZTmSs0o72VGzPYA0zKG52eGy3PPukBWcmzrlC-J0mML7UGxclZF0iCkdumc$>
>>>>> 24) cray-netcdf/4.9.0.3
>>>>> <https://urldefense.com/v3/__http://4.9.0.3__;!!Dq0X2DkFhyF93HkjWTBQKhk!Wb6cQ1sLURpN2Ny6hale9LwE7W4NZTmSs0o72VGzPYA0zKG52eGy3PPukBWcmzrlC-J0mML7UGxclZF0K5aEvpc$>
>>>>> >   5) perftools-base/23.03.0                 10) cudatoolkit/11.7
>>>>>      15) intel/2023.1.0               20) PrgEnv-intel/8.3.3
>>>>> >
>>>>> > cmake command:
>>>>> >  CC=mpicc FC=mpifort cmake
>>>>> -DPNETCDF_DIR=$CRAY_PARALLEL_NETCDF_DIR/intel/19.0
>>>>> -DNETCDF_DIR=$CRAY_NETCDF_PREFIX -DHAVE_PAR_FILTERS=OFF ../
>>>>> >
>>>>> > There are a couple of issues with the build that can be fixed by
>>>>> editing file config.h (created in the bld directory by cmake)
>>>>> >
>>>>> > Add the following to config.h:
>>>>> >
>>>>> > #undef NC_HAS_ZSTD
>>>>> > #undef NC_HAS_BZ2
>>>>> >
>>>>> > then:
>>>>> > make pioperf
>>>>> >
>>>>> > once it's built run the submit script from $SCRATCH
>>>>> >
>>>>> > On Fri, Aug 11, 2023 at 11:13 AM Wei-Keng Liao <
>>>>> wkliao at northwestern.edu> wrote:
>>>>> > OK. I will test it myself on Perlmutter.
>>>>> > Do you have a small test program to reproduce or is it still pioperf?
>>>>> > If pioperf, are the build instructions on Perlmutter the same?
>>>>> >
>>>>> > Please let me know how you run on Perlmutter, i.e. no. process,
>>>>> nodes,
>>>>> > Lustre striping, problem size, etc.
>>>>> >
>>>>> > Does "1 16 64" in your results mean 16 I/O tasks and 64 variables,
>>>>> > yes this is correct
>>>>> >
>>>>> >   and only 16 MPI processes out of total ? processes call PnetCDF
>>>>> APIs?
>>>>> >
>>>>> > yes this is also correct.
>>>>> >
>>>>> >   Wei-keng
>>>>> >
>>>>> >> On Aug 11, 2023, at 9:35 AM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>>> >>
>>>>> >> I tried on perlmutter and am seeing the same issue only maybe even
>>>>> worse:
>>>>> >>
>>>>> >> RESULT: write    SUBSET         1        16        64
>>>>>  1261.0737058071       14.7176171500
>>>>> >> RESULT: write    SUBSET         1        16        64
>>>>>  90.3736534450      205.3695882870
>>>>> >>
>>>>> >>
>>>>> >> On Fri, Aug 11, 2023 at 8:17 AM Jim Edwards <jedwards at ucar.edu>
>>>>> wrote:
>>>>> >> Hi Wei-Keng,
>>>>> >>
>>>>> >> I released that the numbers in this table are all showing the slow
>>>>> performing file and the fast file
>>>>> >> (the one without the scalar variable) are not represented - I will
>>>>> rerun and present these numbers again.
>>>>> >>
>>>>> >> Here are corrected numbers for a few cases:
>>>>> >> GPFS (/glade/work on derecho):
>>>>> >> RESULT: write    SUBSET         1        16        64
>>>>>  4570.2078677815        4.0610844270
>>>>> >> RESULT: write    SUBSET         1        16        64
>>>>>  4470.3231494386        4.1518251320
>>>>> >>
>>>>> >> Lustre, default PFL's:
>>>>> >> RESULT: write    SUBSET         1        16        64
>>>>>  2808.6570137094        6.6081404420
>>>>> >> RESULT: write    SUBSET         1        16        64
>>>>>  1025.1671656858       18.1043644600
>>>>> >>
>>>>> >> LUSTRE, no PFL's and very wide stripe:
>>>>> >>  RESULT: write    SUBSET         1        16        64
>>>>>  4687.6852437580        3.9593102000
>>>>> >>  RESULT: write    SUBSET         1        16        64
>>>>>  3001.4741125579        6.1836282120
>>>>> >>
>>>>> >> On Thu, Aug 10, 2023 at 11:34 AM Jim Edwards <jedwards at ucar.edu>
>>>>> wrote:
>>>>> >> the stripe settings
>>>>> >> lfs setstripe -c 96 -S 128M
>>>>> >> logs/c96_S128M/
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Jim Edwards
>>>>> >
>>>>> > CESM Software Engineer
>>>>> > National Center for Atmospheric Research
>>>>> > Boulder, CO
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Jim Edwards
>>>>> >
>>>>> > CESM Software Engineer
>>>>> > National Center for Atmospheric Research
>>>>> > Boulder, CO
>>>>>
>>>>>
>>>>
>>>> --
>>>> Jim Edwards
>>>>
>>>> CESM Software Engineer
>>>> National Center for Atmospheric Research
>>>> Boulder, CO
>>>>
>>>
>>>
>>> --
>>> Jim Edwards
>>>
>>> CESM Software Engineer
>>> National Center for Atmospheric Research
>>> Boulder, CO
>>>
>>>
>>>
>>
>> --
>> Jim Edwards
>>
>> CESM Software Engineer
>> National Center for Atmospheric Research
>> Boulder, CO
>> <dxt2.out><dxt1.out>
>>
>>
>>
>
> --
> Jim Edwards
>
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO
>
>
>

-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230811/b5e2bd87/attachment.html>


More information about the parallel-netcdf mailing list