performance issue

Jim Edwards jedwards at ucar.edu
Mon Aug 14 11:54:22 CDT 2023


Hi Wei-Keng,

Thanks for looking into this.   Because the allocations in
pioperformance.F90 are done on the compute nodes and
not the IO nodes I don't think that your suggestion would make any
difference.   I also wonder why this issue
appears to be so specific to the lustre file system - presumably the ROMIO
functionality you speak of is general and not
specific to lustre?   Anyway your analysis spurred me to try something else
which seems to work: prior to calling
ncmpi_iput_varn in pio_darray_int.c I added a call to ncmpi_wait_all to
make sure that any existing buffer was written.

This seems to have fixed the problem and my writes are now
RESULT: write    SUBSET         1        16        64     4787.6393342631
     3.8766495770
RESULT: write    SUBSET         1        16        64     4803.9296372205
     3.8635037150



On Mon, Aug 14, 2023 at 9:47 AM Wei-Keng Liao <wkliao at northwestern.edu>
wrote:

> Hi, Jim
>
> Digging into ROMIO source codes, I found the root cause of the timing
> difference between the two test cases is whether or not the user buffer
> passed to MPI_File_write_all is contiguous.
>
> In your test program, the write buffers for all record variables are
> allocated in a contiguous space, while the fix-sized variable is in a
> separate memory space.
>
> https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90#L220-L227
>
>
> Therefore, in case of writing an extra fix-sized variable, the aggregated
> write buffer is noncontiguous, while in the other case contiguous.
>
> When the write buffer is not contiguous, ROMIO allocates an internal
> buffer,
> copies the data over, and uses it to perform communication. When the buffer
> is contiguous, ROMIO uses the user buffer directly for communication.
> Such coping can become expensive when the write amount is large.
>
> If you want to verify this of my finding, please try allocating the
> buffers of individual record variables separately. Let me know how it goes.
>
> Wei-keng
>
> On Aug 11, 2023, at 6:03 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>
> Hi Wei-keng,,
>
> For this case I'm using a RoundRobin distribution as shown here.
>
>     if(doftype .eq. 'ROUNDROBIN') then
>        do i=1,varsize
>           compmap(i) = (i-1)*npe+mype+1
>        enddo
>
> --
> Jim Edwards
>
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO
>
>
>

-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230814/a5e15f82/attachment-0001.html>


More information about the parallel-netcdf mailing list