performance issue

Mon Aug 14 14:29:54 CDT 2023

Yes, the same test on lustre and GPFS.

derecho:

GPFS (/glade/work on derecho):
RESULT: write SUBSET 1 16 64 4570.2078677815 4.0610844270
RESULT: write SUBSET 1 16 64 4470.3231494386 4.1518251320

Lustre, default PFL's:
RESULT: write SUBSET 1 16 64 2808.6570137094 6.6081404420
RESULT: write SUBSET 1 16 64 1025.1671656858 18.1043644600

LUSTRE, no PFL's lfs setstripe -c 48 -S 128
RESULT: write SUBSET 1 16 64 4687.6852437580 3.9593102000
RESULT: write SUBSET 1 16 64 3001.4741125579 6.1836282120

perlmutter, no PFL's lfs setstripe -c 48 -S 128

RESULT: write  SUBSET     1    16    64   5174.0815629926    3.5871100550
<35871100550>

RESULT: write  SUBSET     1    16    64   3176.2693942192    5.843333073
<58433330730>

Frontera (uses impi not cray-mpich)

RESULT: write    SUBSET         1        36        64
243.4676204728       75.0407794043
 RESULT: write    SUBSET         1        36        64
30.8135655567      592.9206721112

impi can turn off lustre optimizations with env variables:
unset I_MPI_EXTRA_FILESYSTEM_FORCE
unset I_MPI_EXTRA_FILESYSTEM
RESULT: write    SUBSET         1        36        64
315.4529589578       57.9167177901

RESULT: write    SUBSET         1        36        64
303.0899778031       60.2791294269

GPFS cheyenne:
  RESULT: write    SUBSET         1        16        64
6126.5760973514        3.0294245440
  RESULT: write    SUBSET         1        16        64
4638.4045534969        4.0013758580

On Mon, Aug 14, 2023 at 1:11 PM Wei-Keng Liao <wkliao at northwestern.edu>
wrote:

> Did you run the same tests on a non-Lustre file system and see no
> difference?
> Can you show me the timings?
>
> Wei-keng
>
> On Aug 14, 2023, at 11:54 AM, Jim Edwards <jedwards at ucar.edu> wrote:
>
> Hi Wei-Keng,
>
> Thanks for looking into this.   Because the allocations in
> pioperformance.F90 are done on the compute nodes and
> not the IO nodes I don't think that your suggestion would make any
> difference.   I also wonder why this issue
> appears to be so specific to the lustre file system - presumably the ROMIO
> functionality you speak of is general and not
> specific to lustre?   Anyway your analysis spurred me to try something
> else which seems to work: prior to calling
> ncmpi_iput_varn in pio_darray_int.c I added a call to ncmpi_wait_all to
> make sure that any existing buffer was written.
>
> This seems to have fixed the problem and my writes are now
> RESULT: write    SUBSET         1        16        64     4787.6393342631
>        3.8766495770
> RESULT: write    SUBSET         1        16        64     4803.9296372205
>        3.8635037150
>
>
>
> On Mon, Aug 14, 2023 at 9:47 AM Wei-Keng Liao <wkliao at northwestern.edu>
> wrote:
>
>> Hi, Jim
>>
>> Digging into ROMIO source codes, I found the root cause of the timing
>> difference between the two test cases is whether or not the user buffer
>> passed to MPI_File_write_all is contiguous.
>>
>> In your test program, the write buffers for all record variables are
>> allocated in a contiguous space, while the fix-sized variable is in a
>> separate memory space.
>>
>> https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90#L220-L227
>> <https://urldefense.com/v3/__https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90*L220-L227__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!TtiUA1y3rS-K1Ci1HJaI5-nAJVx4QmQHf1GQLtAtYnyfITDBJ9tfJc2Ckg-D6o4KnMowEJ-fG-V_LHZ69baTrKA$>
>>
>>
>> Therefore, in case of writing an extra fix-sized variable, the aggregated
>> write buffer is noncontiguous, while in the other case contiguous.
>>
>> When the write buffer is not contiguous, ROMIO allocates an internal
>> buffer,
>> copies the data over, and uses it to perform communication. When the
>> buffer
>> is contiguous, ROMIO uses the user buffer directly for communication.
>> Such coping can become expensive when the write amount is large.
>>
>> If you want to verify this of my finding, please try allocating the
>> buffers of individual record variables separately. Let me know how it
>> goes.
>>
>> Wei-keng
>>
>> On Aug 11, 2023, at 6:03 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>
>> Hi Wei-keng,,
>>
>> For this case I'm using a RoundRobin distribution as shown here.
>>
>>     if(doftype .eq. 'ROUNDROBIN') then
>>        do i=1,varsize
>>           compmap(i) = (i-1)*npe+mype+1
>>        enddo
>>
>> --
>> Jim Edwards
>>
>> CESM Software Engineer
>> National Center for Atmospheric Research
>> Boulder, CO
>>
>>
>>
>
> --
> Jim Edwards
>
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO
>
>
>

-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230814/461d0fdf/attachment.html>