[Darshan-users] Strange MPI I/O imbalance

Fri Apr 22 09:44:41 CDT 2016

  In collective I/O, only a subset of ranks (aggregators) perform the actual I/O. The other ranks just transmit data to an aggregator rank. So having a "load imbalance" would be expected as the communication time of the data is likely much less than the time for a rank to write the buffered data out. Especially on Blue Genes which have high performance interconnects.

  I did not look at all the details you sent, but hopefully this explains what you are seeing.

kevin

>Dear all,
>
>While profiling my application with Darshan, I found there was a large 
>imbalance in an MPI I/O completion.
>
>Summary question: If every MPI process writes a 4 MB block using 
>collective MPI_File_write_all, why would there be 15x difference between 
>fastest and slowest ranks? (On BG/Q GPFS with 4 MB block size)
>
>Here are more details:
>
>My program is a pure-MPI application (no OpenMP, Pthreads, etc.) that 
>creates a single file and all processes write collectively to the file 
>through the MPI_File_write_all primitive. What I see with further 
>investigation with HPCToolkit is that there are some processes that 
>spend a really long time inside the MPI I/O collective call (more than 
>10 times compared to other processes, see attached screenshots).
>
>After simplifying my application as much as possible, this is the 
>configuration I'm running:
>- BG/Q system (tried on MIRA, Juqueen and CSCS: absolute execution times 
>vary, but imbalance stays the same)
>- GPFS filesystem, 4 MB block size
>- 1024 ranks in total
>- 8 ranks / node, 128 nodes in total (also tried using full mid-plane 
>i.e. 512 nodes with no change in behavior)
>- Each rank writes exactly a single, contiguous block of 4194304 bytes 
>(4 MB)
>- Rank 0 writes its block at the beginning of the file (offset 0), then, 
>rank 1, then rank 2, etc.
>- File offsets are set through MPI_File_set_view
>- Environment variable set: BGLOCKLESSMPIO_F_TYPE=0x47504653
>- Able to reproduce similar behavior with IOR benchmark as well
>
>Observations:
>- More than half of the ranks (> 512) spend less than 0.05 seconds 
>inside MPI_File_write_all
>- The rest of the ranks spend more than 0.05 seconds inside 
>MPI_File_write_all, and the maximum time inside this call goes up to 1.2 
>seconds for some of the ranks
>
>I would be very thankful if someone could help me understand this 
>strange (or expected ?) behavior.
>
>Tracing details are provided in the attached pictures (screenshots of 
>HPCToolkit):
>- 1-MPI_wr_all_imbalance.png:
>General trace view of all the 1024 ranks. The MPI_File_write_all time is 
>shown in green. One can quickly see the imbalance between ranks.
>
>- 2-MPI_wr_all_imbalance_zoom.png:
>Zoom of the previous image, from rank 122 to 368. We can see here in 
>more detail the imbalance.
>
>- 3-MPI_wr_all_call_stack_zoom.png:
>Same zoom as the previous image, different color scheme to illustrate 
>the call stack of MPI_File_write_all and spot the functions that take 
>more time.
>
>Could also someone explain when the __lseek_nocancel and 
>__write_nocancel functions are used and what for?
>
>Please, let me know if there is any further information that might be 
>helpful. I can also provide the simplified source code.
>
>Thanks in advance,
>Judit
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4090 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20160422/c2ad24db/attachment.bin>