[Darshan-users] Strange MPI I/O imbalance

Fri Apr 22 10:45:31 CDT 2016

On 04/22/2016 10:34 AM, Judit Planas wrote:
> Dear Kevin,
>
> Thanks for your quick reply.
>
> Is there a way to find out which ranks are the aggregators?
>
> Your answer makes a lot of sense and this would explain the imbalance.
> However, in some cases I see the imbalance happens for every second
> rank, so it seems that there is an aggregator for every pair of ranks.
> Should I expect aggregators to be equally distributed among all the
> ranks? What should be the ratio between aggregators and total ranks?

The answers to your questions depend a lot on your MPI implementation, 
but since all of them are descended from ROMIO, I'll give you some 
general pointers.

ROMIO will, by default, pick one process per node to be an I/O 
aggregator.   For a lot of configurations, it does not make sense for 
multiple processes on one node to fight for the single network link to 
the storage.

The story is a little more complicated on blue gene: I offer this link 
only to show you how the MPI implementation can be pretty flexible in 
how it selects aggregators.

https://press3.mcs.anl.gov/romio/2015/05/15/aggregation-selection-on-blue-gene/

At smaller scales, and if you have a fairly recent MPICH library (v3.1.1 
or newer) ,  you can read the relatively recent hint 
"romio_aggregator_list" to get a list of the mpi ranks that are aggregators.

==rob

> Thanks!
> Judit
>
> On 22/04/16 16:44, Harms, Kevin wrote:
>>    In collective I/O, only a subset of ranks (aggregators) perform the
>> actual I/O. The other ranks just transmit data to an aggregator rank.
>> So having a "load imbalance" would be expected as the communication
>> time of the data is likely much less than the time for a rank to write
>> the buffered data out. Especially on Blue Genes which have high
>> performance interconnects.
>>
>>    I did not look at all the details you sent, but hopefully this
>> explains what you are seeing.
>>
>> kevin
>>
>>
>>
>>
>>
>>> Dear all,
>>>
>>> While profiling my application with Darshan, I found there was a large
>>> imbalance in an MPI I/O completion.
>>>
>>> Summary question: If every MPI process writes a 4 MB block using
>>> collective MPI_File_write_all, why would there be 15x difference between
>>> fastest and slowest ranks? (On BG/Q GPFS with 4 MB block size)
>>>
>>> Here are more details:
>>>
>>> My program is a pure-MPI application (no OpenMP, Pthreads, etc.) that
>>> creates a single file and all processes write collectively to the file
>>> through the MPI_File_write_all primitive. What I see with further
>>> investigation with HPCToolkit is that there are some processes that
>>> spend a really long time inside the MPI I/O collective call (more than
>>> 10 times compared to other processes, see attached screenshots).
>>>
>>> After simplifying my application as much as possible, this is the
>>> configuration I'm running:
>>> - BG/Q system (tried on MIRA, Juqueen and CSCS: absolute execution times
>>> vary, but imbalance stays the same)
>>> - GPFS filesystem, 4 MB block size
>>> - 1024 ranks in total
>>> - 8 ranks / node, 128 nodes in total (also tried using full mid-plane
>>> i.e. 512 nodes with no change in behavior)
>>> - Each rank writes exactly a single, contiguous block of 4194304 bytes
>>> (4 MB)
>>> - Rank 0 writes its block at the beginning of the file (offset 0), then,
>>> rank 1, then rank 2, etc.
>>> - File offsets are set through MPI_File_set_view
>>> - Environment variable set: BGLOCKLESSMPIO_F_TYPE=0x47504653
>>> - Able to reproduce similar behavior with IOR benchmark as well
>>>
>>> Observations:
>>> - More than half of the ranks (> 512) spend less than 0.05 seconds
>>> inside MPI_File_write_all
>>> - The rest of the ranks spend more than 0.05 seconds inside
>>> MPI_File_write_all, and the maximum time inside this call goes up to 1.2
>>> seconds for some of the ranks
>>>
>>> I would be very thankful if someone could help me understand this
>>> strange (or expected ?) behavior.
>>>
>>> Tracing details are provided in the attached pictures (screenshots of
>>> HPCToolkit):
>>> - 1-MPI_wr_all_imbalance.png:
>>> General trace view of all the 1024 ranks. The MPI_File_write_all time is
>>> shown in green. One can quickly see the imbalance between ranks.
>>>
>>> - 2-MPI_wr_all_imbalance_zoom.png:
>>> Zoom of the previous image, from rank 122 to 368. We can see here in
>>> more detail the imbalance.
>>>
>>> - 3-MPI_wr_all_call_stack_zoom.png:
>>> Same zoom as the previous image, different color scheme to illustrate
>>> the call stack of MPI_File_write_all and spot the functions that take
>>> more time.
>>>
>>> Could also someone explain when the __lseek_nocancel and
>>> __write_nocancel functions are used and what for?
>>>
>>> Please, let me know if there is any further information that might be
>>> helpful. I can also provide the simplified source code.
>>>
>>> Thanks in advance,
>>> Judit
>
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users