performance issue

Wei-Keng Liao wkliao at northwestern.edu
Mon Aug 14 16:16:27 CDT 2023


I realized that because ROMIO's drivers for Lustre and GPFS use different
file domain partitioning strategies, it ends up a lot of small sized copying
in Lustre than GPFS.

When I increased the Lustre stripe size, which effectively reduces the
number of copy calls, the difference becomes less significant.


Wei-keng

On Aug 14, 2023, at 2:44 PM, Jim Edwards <jedwards at ucar.edu> wrote:

I tried the GPFS on derecho again with the fix and

RESULT: write    SUBSET         1        16        64     5450.6689151051        3.4050866580
RESULT: write    SUBSET         1        16        64     5953.4351587908        3.1175278650

So maybe there is a measurable difference on GPFS - it's just particularly hard on lustre.

On Mon, Aug 14, 2023 at 1:29 PM Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:
Yes, the same test on lustre and GPFS.

derecho:

GPFS (/glade/work on derecho):
RESULT: write SUBSET 1 16 64 4570.2078677815 4.0610844270
RESULT: write SUBSET 1 16 64 4470.3231494386 4.1518251320

Lustre, default PFL's:
RESULT: write SUBSET 1 16 64 2808.6570137094 6.6081404420
RESULT: write SUBSET 1 16 64 1025.1671656858 18.1043644600

LUSTRE, no PFL's lfs setstripe -c 48 -S 128
RESULT: write SUBSET 1 16 64 4687.6852437580 3.9593102000
RESULT: write SUBSET 1 16 64 3001.4741125579 6.1836282120

perlmutter, no PFL's lfs setstripe -c 48 -S 128

RESULT: write  SUBSET     1    16    64   5174.0815629926    3.5871100550<tel:35871100550>

RESULT: write  SUBSET     1    16    64   3176.2693942192    5.843333073<tel:58433330730>

Frontera (uses impi not cray-mpich)

RESULT: write    SUBSET         1        36        64      243.4676204728       75.0407794043
 RESULT: write    SUBSET         1        36        64       30.8135655567      592.9206721112


impi can turn off lustre optimizations with env variables:
unset I_MPI_EXTRA_FILESYSTEM_FORCE
unset I_MPI_EXTRA_FILESYSTEM
RESULT: write    SUBSET         1        36        64      315.4529589578       57.9167177901

RESULT: write    SUBSET         1        36        64      303.0899778031       60.2791294269


GPFS cheyenne:
  RESULT: write    SUBSET         1        16        64     6126.5760973514        3.0294245440
  RESULT: write    SUBSET         1        16        64     4638.4045534969        4.0013758580




On Mon, Aug 14, 2023 at 1:11 PM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Did you run the same tests on a non-Lustre file system and see no difference?
Can you show me the timings?

Wei-keng

On Aug 14, 2023, at 11:54 AM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

Hi Wei-Keng,

Thanks for looking into this.   Because the allocations in pioperformance.F90 are done on the compute nodes and
not the IO nodes I don't think that your suggestion would make any difference.   I also wonder why this issue
appears to be so specific to the lustre file system - presumably the ROMIO functionality you speak of is general and not
specific to lustre?   Anyway your analysis spurred me to try something else which seems to work: prior to calling
ncmpi_iput_varn in pio_darray_int.c I added a call to ncmpi_wait_all to make sure that any existing buffer was written.

This seems to have fixed the problem and my writes are now
RESULT: write    SUBSET         1        16        64     4787.6393342631        3.8766495770
RESULT: write    SUBSET         1        16        64     4803.9296372205        3.8635037150



On Mon, Aug 14, 2023 at 9:47 AM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Hi, Jim

Digging into ROMIO source codes, I found the root cause of the timing
difference between the two test cases is whether or not the user buffer
passed to MPI_File_write_all is contiguous.

In your test program, the write buffers for all record variables are
allocated in a contiguous space, while the fix-sized variable is in a
separate memory space.
https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90#L220-L227<https://urldefense.com/v3/__https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90*L220-L227__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!TtiUA1y3rS-K1Ci1HJaI5-nAJVx4QmQHf1GQLtAtYnyfITDBJ9tfJc2Ckg-D6o4KnMowEJ-fG-V_LHZ69baTrKA$>


Therefore, in case of writing an extra fix-sized variable, the aggregated
write buffer is noncontiguous, while in the other case contiguous.

When the write buffer is not contiguous, ROMIO allocates an internal buffer,
copies the data over, and uses it to perform communication. When the buffer
is contiguous, ROMIO uses the user buffer directly for communication.
Such coping can become expensive when the write amount is large.

If you want to verify this of my finding, please try allocating the
buffers of individual record variables separately. Let me know how it goes.

Wei-keng

On Aug 11, 2023, at 6:03 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

Hi Wei-keng,,

For this case I'm using a RoundRobin distribution as shown here.

    if(doftype .eq. 'ROUNDROBIN') then
       do i=1,varsize
          compmap(i) = (i-1)*npe+mype+1
       enddo

--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO


--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230814/013f8a83/attachment-0001.html>


More information about the parallel-netcdf mailing list