performance issue

Mon Aug 14 14:44:16 CDT 2023

I tried the GPFS on derecho again with the fix and

RESULT: write    SUBSET         1        16        64     5450.6689151051
     3.4050866580
RESULT: write    SUBSET         1        16        64     5953.4351587908
     3.1175278650

So maybe there is a measurable difference on GPFS - it's just particularly
hard on lustre.

On Mon, Aug 14, 2023 at 1:29 PM Jim Edwards <jedwards at ucar.edu> wrote:

> Yes, the same test on lustre and GPFS.
>
> derecho:
>
> GPFS (/glade/work on derecho):
> RESULT: write SUBSET 1 16 64 4570.2078677815 4.0610844270
> RESULT: write SUBSET 1 16 64 4470.3231494386 4.1518251320
>
> Lustre, default PFL's:
> RESULT: write SUBSET 1 16 64 2808.6570137094 6.6081404420
> RESULT: write SUBSET 1 16 64 1025.1671656858 18.1043644600
>
> LUSTRE, no PFL's lfs setstripe -c 48 -S 128
> RESULT: write SUBSET 1 16 64 4687.6852437580 3.9593102000
> RESULT: write SUBSET 1 16 64 3001.4741125579 6.1836282120
>
> perlmutter, no PFL's lfs setstripe -c 48 -S 128
>
> RESULT: write  SUBSET     1    16    64   5174.0815629926    3.5871100550
> <35871100550>
>
> RESULT: write  SUBSET     1    16    64   3176.2693942192    5.843333073
> <58433330730>
>
> Frontera (uses impi not cray-mpich)
>
> RESULT: write    SUBSET         1        36        64      243.4676204728       75.0407794043
>  RESULT: write    SUBSET         1        36        64       30.8135655567      592.9206721112
>
> impi can turn off lustre optimizations with env variables:
> unset I_MPI_EXTRA_FILESYSTEM_FORCE
> unset I_MPI_EXTRA_FILESYSTEM
> RESULT: write    SUBSET         1        36        64      315.4529589578       57.9167177901
>
> RESULT: write    SUBSET         1        36        64      303.0899778031       60.2791294269
>
> GPFS cheyenne:
>   RESULT: write    SUBSET         1        16        64     6126.5760973514        3.0294245440
>   RESULT: write    SUBSET         1        16        64     4638.4045534969        4.0013758580
>
>
>
>
> On Mon, Aug 14, 2023 at 1:11 PM Wei-Keng Liao <wkliao at northwestern.edu>
> wrote:
>
>> Did you run the same tests on a non-Lustre file system and see no
>> difference?
>> Can you show me the timings?
>>
>> Wei-keng
>>
>> On Aug 14, 2023, at 11:54 AM, Jim Edwards <jedwards at ucar.edu> wrote:
>>
>> Hi Wei-Keng,
>>
>> Thanks for looking into this.   Because the allocations in
>> pioperformance.F90 are done on the compute nodes and
>> not the IO nodes I don't think that your suggestion would make any
>> difference.   I also wonder why this issue
>> appears to be so specific to the lustre file system - presumably the
>> ROMIO functionality you speak of is general and not
>> specific to lustre?   Anyway your analysis spurred me to try something
>> else which seems to work: prior to calling
>> ncmpi_iput_varn in pio_darray_int.c I added a call to ncmpi_wait_all to
>> make sure that any existing buffer was written.
>>
>> This seems to have fixed the problem and my writes are now
>> RESULT: write    SUBSET         1        16        64     4787.6393342631
>>        3.8766495770
>> RESULT: write    SUBSET         1        16        64     4803.9296372205
>>        3.8635037150
>>
>>
>>
>> On Mon, Aug 14, 2023 at 9:47 AM Wei-Keng Liao <wkliao at northwestern.edu>
>> wrote:
>>
>>> Hi, Jim
>>>
>>> Digging into ROMIO source codes, I found the root cause of the timing
>>> difference between the two test cases is whether or not the user buffer
>>> passed to MPI_File_write_all is contiguous.
>>>
>>> In your test program, the write buffers for all record variables are
>>> allocated in a contiguous space, while the fix-sized variable is in a
>>> separate memory space.
>>>
>>> https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90#L220-L227
>>> <https://urldefense.com/v3/__https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90*L220-L227__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!TtiUA1y3rS-K1Ci1HJaI5-nAJVx4QmQHf1GQLtAtYnyfITDBJ9tfJc2Ckg-D6o4KnMowEJ-fG-V_LHZ69baTrKA$>
>>>
>>>
>>> Therefore, in case of writing an extra fix-sized variable, the
>>> aggregated
>>> write buffer is noncontiguous, while in the other case contiguous.
>>>
>>> When the write buffer is not contiguous, ROMIO allocates an internal
>>> buffer,
>>> copies the data over, and uses it to perform communication. When the
>>> buffer
>>> is contiguous, ROMIO uses the user buffer directly for communication.
>>> Such coping can become expensive when the write amount is large.
>>>
>>> If you want to verify this of my finding, please try allocating the
>>> buffers of individual record variables separately. Let me know how it
>>> goes.
>>>
>>> Wei-keng
>>>
>>> On Aug 11, 2023, at 6:03 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>>
>>> Hi Wei-keng,,
>>>
>>> For this case I'm using a RoundRobin distribution as shown here.
>>>
>>>     if(doftype .eq. 'ROUNDROBIN') then
>>>        do i=1,varsize
>>>           compmap(i) = (i-1)*npe+mype+1
>>>        enddo
>>>
>>> --
>>> Jim Edwards
>>>
>>> CESM Software Engineer
>>> National Center for Atmospheric Research
>>> Boulder, CO
>>>
>>>
>>>
>>
>> --
>> Jim Edwards
>>
>> CESM Software Engineer
>> National Center for Atmospheric Research
>> Boulder, CO
>>
>>
>>
>
> --
> Jim Edwards
>
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO
>

-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230814/85dff3c0/attachment-0001.html>