performance issue
Wei-Keng Liao
wkliao at northwestern.edu
Mon Aug 14 10:47:36 CDT 2023
Hi, Jim
Digging into ROMIO source codes, I found the root cause of the timing
difference between the two test cases is whether or not the user buffer
passed to MPI_File_write_all is contiguous.
In your test program, the write buffers for all record variables are
allocated in a contiguous space, while the fix-sized variable is in a
separate memory space.
https://github.com/jedwards4b/ParallelIO/blob/25b471d5864db1cf7b8dfa26bd5d568eceba1a04/tests/performance/pioperformance.F90#L220-L227
Therefore, in case of writing an extra fix-sized variable, the aggregated
write buffer is noncontiguous, while in the other case contiguous.
When the write buffer is not contiguous, ROMIO allocates an internal buffer,
copies the data over, and uses it to perform communication. When the buffer
is contiguous, ROMIO uses the user buffer directly for communication.
Such coping can become expensive when the write amount is large.
If you want to verify this of my finding, please try allocating the
buffers of individual record variables separately. Let me know how it goes.
Wei-keng
On Aug 11, 2023, at 6:03 PM, Jim Edwards <jedwards at ucar.edu> wrote:
Hi Wei-keng,,
For this case I'm using a RoundRobin distribution as shown here.
if(doftype .eq. 'ROUNDROBIN') then
do i=1,varsize
compmap(i) = (i-1)*npe+mype+1
enddo
--
Jim Edwards
CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20230814/a3d131e6/attachment.html>
More information about the parallel-netcdf
mailing list