performance issue

Wei-Keng Liao wkliao at northwestern.edu
Mon Aug 14 10:47:36 CDT 2023


Hi, Jim

Digging into ROMIO source codes, I found the root cause of the timing
difference between the two test cases is whether or not the user buffer
passed to MPI_File_write_all is contiguous.

In your test program, the write buffers for all record variables are
allocated in a contiguous space, while the fix-sized variable is in a
separate memory space.
https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90#L220-L227


Therefore, in case of writing an extra fix-sized variable, the aggregated
write buffer is noncontiguous, while in the other case contiguous.

When the write buffer is not contiguous, ROMIO allocates an internal buffer,
copies the data over, and uses it to perform communication. When the buffer
is contiguous, ROMIO uses the user buffer directly for communication.
Such coping can become expensive when the write amount is large.

If you want to verify this of my finding, please try allocating the
buffers of individual record variables separately. Let me know how it goes.

Wei-keng

On Aug 11, 2023, at 6:03 PM, Jim Edwards <jedwards at ucar.edu> wrote:

Hi Wei-keng,,

For this case I'm using a RoundRobin distribution as shown here.

    if(doftype .eq. 'ROUNDROBIN') then
       do i=1,varsize
          compmap(i) = (i-1)*npe+mype+1
       enddo

--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230814/a3d131e6/attachment.html>


More information about the parallel-netcdf mailing list