[Darshan-users] Strange MPI I/O imbalance

Judit Planas judit.planas at epfl.ch
Fri Apr 22 10:34:09 CDT 2016


Dear Kevin,

Thanks for your quick reply.

Is there a way to find out which ranks are the aggregators?

Your answer makes a lot of sense and this would explain the imbalance. 
However, in some cases I see the imbalance happens for every second 
rank, so it seems that there is an aggregator for every pair of ranks. 
Should I expect aggregators to be equally distributed among all the 
ranks? What should be the ratio between aggregators and total ranks?

Thanks!
Judit

On 22/04/16 16:44, Harms, Kevin wrote:
>    In collective I/O, only a subset of ranks (aggregators) perform the actual I/O. The other ranks just transmit data to an aggregator rank. So having a "load imbalance" would be expected as the communication time of the data is likely much less than the time for a rank to write the buffered data out. Especially on Blue Genes which have high performance interconnects.
>
>    I did not look at all the details you sent, but hopefully this explains what you are seeing.
>
> kevin
>
>
>
>
>
>> Dear all,
>>
>> While profiling my application with Darshan, I found there was a large
>> imbalance in an MPI I/O completion.
>>
>> Summary question: If every MPI process writes a 4 MB block using
>> collective MPI_File_write_all, why would there be 15x difference between
>> fastest and slowest ranks? (On BG/Q GPFS with 4 MB block size)
>>
>> Here are more details:
>>
>> My program is a pure-MPI application (no OpenMP, Pthreads, etc.) that
>> creates a single file and all processes write collectively to the file
>> through the MPI_File_write_all primitive. What I see with further
>> investigation with HPCToolkit is that there are some processes that
>> spend a really long time inside the MPI I/O collective call (more than
>> 10 times compared to other processes, see attached screenshots).
>>
>> After simplifying my application as much as possible, this is the
>> configuration I'm running:
>> - BG/Q system (tried on MIRA, Juqueen and CSCS: absolute execution times
>> vary, but imbalance stays the same)
>> - GPFS filesystem, 4 MB block size
>> - 1024 ranks in total
>> - 8 ranks / node, 128 nodes in total (also tried using full mid-plane
>> i.e. 512 nodes with no change in behavior)
>> - Each rank writes exactly a single, contiguous block of 4194304 bytes
>> (4 MB)
>> - Rank 0 writes its block at the beginning of the file (offset 0), then,
>> rank 1, then rank 2, etc.
>> - File offsets are set through MPI_File_set_view
>> - Environment variable set: BGLOCKLESSMPIO_F_TYPE=0x47504653
>> - Able to reproduce similar behavior with IOR benchmark as well
>>
>> Observations:
>> - More than half of the ranks (> 512) spend less than 0.05 seconds
>> inside MPI_File_write_all
>> - The rest of the ranks spend more than 0.05 seconds inside
>> MPI_File_write_all, and the maximum time inside this call goes up to 1.2
>> seconds for some of the ranks
>>
>> I would be very thankful if someone could help me understand this
>> strange (or expected ?) behavior.
>>
>> Tracing details are provided in the attached pictures (screenshots of
>> HPCToolkit):
>> - 1-MPI_wr_all_imbalance.png:
>> General trace view of all the 1024 ranks. The MPI_File_write_all time is
>> shown in green. One can quickly see the imbalance between ranks.
>>
>> - 2-MPI_wr_all_imbalance_zoom.png:
>> Zoom of the previous image, from rank 122 to 368. We can see here in
>> more detail the imbalance.
>>
>> - 3-MPI_wr_all_call_stack_zoom.png:
>> Same zoom as the previous image, different color scheme to illustrate
>> the call stack of MPI_File_write_all and spot the functions that take
>> more time.
>>
>> Could also someone explain when the __lseek_nocancel and
>> __write_nocancel functions are used and what for?
>>
>> Please, let me know if there is any further information that might be
>> helpful. I can also provide the simplified source code.
>>
>> Thanks in advance,
>> Judit



More information about the Darshan-users mailing list