[Darshan-users] Strange MPI I/O imbalance

Fri Apr 22 06:47:55 CDT 2016

Dear all,

While profiling my application with Darshan, I found there was a large 
imbalance in an MPI I/O completion.

Summary question: If every MPI process writes a 4 MB block using 
collective MPI_File_write_all, why would there be 15x difference between 
fastest and slowest ranks? (On BG/Q GPFS with 4 MB block size)

Here are more details:

My program is a pure-MPI application (no OpenMP, Pthreads, etc.) that 
creates a single file and all processes write collectively to the file 
through the MPI_File_write_all primitive. What I see with further 
investigation with HPCToolkit is that there are some processes that 
spend a really long time inside the MPI I/O collective call (more than 
10 times compared to other processes, see attached screenshots).

After simplifying my application as much as possible, this is the 
configuration I'm running:
- BG/Q system (tried on MIRA, Juqueen and CSCS: absolute execution times 
vary, but imbalance stays the same)
- GPFS filesystem, 4 MB block size
- 1024 ranks in total
- 8 ranks / node, 128 nodes in total (also tried using full mid-plane 
i.e. 512 nodes with no change in behavior)
- Each rank writes exactly a single, contiguous block of 4194304 bytes 
(4 MB)
- Rank 0 writes its block at the beginning of the file (offset 0), then, 
rank 1, then rank 2, etc.
- File offsets are set through MPI_File_set_view
- Environment variable set: BGLOCKLESSMPIO_F_TYPE=0x47504653
- Able to reproduce similar behavior with IOR benchmark as well

Observations:
- More than half of the ranks (> 512) spend less than 0.05 seconds 
inside MPI_File_write_all
- The rest of the ranks spend more than 0.05 seconds inside 
MPI_File_write_all, and the maximum time inside this call goes up to 1.2 
seconds for some of the ranks

I would be very thankful if someone could help me understand this 
strange (or expected ?) behavior.

Tracing details are provided in the attached pictures (screenshots of 
HPCToolkit):
- 1-MPI_wr_all_imbalance.png:
General trace view of all the 1024 ranks. The MPI_File_write_all time is 
shown in green. One can quickly see the imbalance between ranks.

- 2-MPI_wr_all_imbalance_zoom.png:
Zoom of the previous image, from rank 122 to 368. We can see here in 
more detail the imbalance.

- 3-MPI_wr_all_call_stack_zoom.png:
Same zoom as the previous image, different color scheme to illustrate 
the call stack of MPI_File_write_all and spot the functions that take 
more time.

Could also someone explain when the __lseek_nocancel and 
__write_nocancel functions are used and what for?

Please, let me know if there is any further information that might be 
helpful. I can also provide the simplified source code.

Thanks in advance,
Judit
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1-MPI_wr_all_imbalance.png
Type: image/png
Size: 174055 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20160422/d9283a4e/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2-MPI_wr_all_imbalance_zoom.png
Type: image/png
Size: 150390 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20160422/d9283a4e/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 3-MPI_wr_all_call_stack_zoom.png
Type: image/png
Size: 277756 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20160422/d9283a4e/attachment-0005.png>