[Darshan-users] Strange MPI I/O imbalance
Judit Planas
judit.planas at epfl.ch
Fri Apr 22 06:47:55 CDT 2016
Dear all,
While profiling my application with Darshan, I found there was a large
imbalance in an MPI I/O completion.
Summary question: If every MPI process writes a 4 MB block using
collective MPI_File_write_all, why would there be 15x difference between
fastest and slowest ranks? (On BG/Q GPFS with 4 MB block size)
Here are more details:
My program is a pure-MPI application (no OpenMP, Pthreads, etc.) that
creates a single file and all processes write collectively to the file
through the MPI_File_write_all primitive. What I see with further
investigation with HPCToolkit is that there are some processes that
spend a really long time inside the MPI I/O collective call (more than
10 times compared to other processes, see attached screenshots).
After simplifying my application as much as possible, this is the
configuration I'm running:
- BG/Q system (tried on MIRA, Juqueen and CSCS: absolute execution times
vary, but imbalance stays the same)
- GPFS filesystem, 4 MB block size
- 1024 ranks in total
- 8 ranks / node, 128 nodes in total (also tried using full mid-plane
i.e. 512 nodes with no change in behavior)
- Each rank writes exactly a single, contiguous block of 4194304 bytes
(4 MB)
- Rank 0 writes its block at the beginning of the file (offset 0), then,
rank 1, then rank 2, etc.
- File offsets are set through MPI_File_set_view
- Environment variable set: BGLOCKLESSMPIO_F_TYPE=0x47504653
- Able to reproduce similar behavior with IOR benchmark as well
Observations:
- More than half of the ranks (> 512) spend less than 0.05 seconds
inside MPI_File_write_all
- The rest of the ranks spend more than 0.05 seconds inside
MPI_File_write_all, and the maximum time inside this call goes up to 1.2
seconds for some of the ranks
I would be very thankful if someone could help me understand this
strange (or expected ?) behavior.
Tracing details are provided in the attached pictures (screenshots of
HPCToolkit):
- 1-MPI_wr_all_imbalance.png:
General trace view of all the 1024 ranks. The MPI_File_write_all time is
shown in green. One can quickly see the imbalance between ranks.
- 2-MPI_wr_all_imbalance_zoom.png:
Zoom of the previous image, from rank 122 to 368. We can see here in
more detail the imbalance.
- 3-MPI_wr_all_call_stack_zoom.png:
Same zoom as the previous image, different color scheme to illustrate
the call stack of MPI_File_write_all and spot the functions that take
more time.
Could also someone explain when the __lseek_nocancel and
__write_nocancel functions are used and what for?
Please, let me know if there is any further information that might be
helpful. I can also provide the simplified source code.
Thanks in advance,
Judit
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1-MPI_wr_all_imbalance.png
Type: image/png
Size: 174055 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20160422/d9283a4e/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2-MPI_wr_all_imbalance_zoom.png
Type: image/png
Size: 150390 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20160422/d9283a4e/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 3-MPI_wr_all_call_stack_zoom.png
Type: image/png
Size: 277756 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20160422/d9283a4e/attachment-0005.png>
More information about the Darshan-users
mailing list