performance issue

Wei-Keng Liao wkliao at northwestern.edu
Mon Aug 14 14:11:44 CDT 2023


Did you run the same tests on a non-Lustre file system and see no difference?
Can you show me the timings?

Wei-keng

On Aug 14, 2023, at 11:54 AM, Jim Edwards <jedwards at ucar.edu> wrote:

Hi Wei-Keng,

Thanks for looking into this.   Because the allocations in pioperformance.F90 are done on the compute nodes and
not the IO nodes I don't think that your suggestion would make any difference.   I also wonder why this issue
appears to be so specific to the lustre file system - presumably the ROMIO functionality you speak of is general and not
specific to lustre?   Anyway your analysis spurred me to try something else which seems to work: prior to calling
ncmpi_iput_varn in pio_darray_int.c I added a call to ncmpi_wait_all to make sure that any existing buffer was written.

This seems to have fixed the problem and my writes are now
RESULT: write    SUBSET         1        16        64     4787.6393342631        3.8766495770
RESULT: write    SUBSET         1        16        64     4803.9296372205        3.8635037150



On Mon, Aug 14, 2023 at 9:47 AM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Hi, Jim

Digging into ROMIO source codes, I found the root cause of the timing
difference between the two test cases is whether or not the user buffer
passed to MPI_File_write_all is contiguous.

In your test program, the write buffers for all record variables are
allocated in a contiguous space, while the fix-sized variable is in a
separate memory space.
https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90#L220-L227<https://urldefense.com/v3/__https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90*L220-L227__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!TtiUA1y3rS-K1Ci1HJaI5-nAJVx4QmQHf1GQLtAtYnyfITDBJ9tfJc2Ckg-D6o4KnMowEJ-fG-V_LHZ69baTrKA$>


Therefore, in case of writing an extra fix-sized variable, the aggregated
write buffer is noncontiguous, while in the other case contiguous.

When the write buffer is not contiguous, ROMIO allocates an internal buffer,
copies the data over, and uses it to perform communication. When the buffer
is contiguous, ROMIO uses the user buffer directly for communication.
Such coping can become expensive when the write amount is large.

If you want to verify this of my finding, please try allocating the
buffers of individual record variables separately. Let me know how it goes.

Wei-keng

On Aug 11, 2023, at 6:03 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

Hi Wei-keng,,

For this case I'm using a RoundRobin distribution as shown here.

    if(doftype .eq. 'ROUNDROBIN') then
       do i=1,varsize
          compmap(i) = (i-1)*npe+mype+1
       enddo

--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230814/ccac0e67/attachment.html>


More information about the parallel-netcdf mailing list