[codes-ross-users] Load imbalance when running TraceR in parallel

Mon Jul 2 09:11:36 CDT 2018

Hi Philip,

As far as LP distribution, CODES maps LPs to PEs linearly based on the LP Groups part of the CODES configuration file you use. I don’t think there’s any way currently to change the mapping (besides going into src/util/codes_mapping.c and changing it to do something else). Recently we’ve had some discussion on adding in load balancing support to ROSS, but I’m unsure when this will happen. 

I do have some tips for settings on optimistic simulation.  You may want to try the real time optimistic (sync=5) mode. I think we’ve seen some improvement in CODES models in that mode. Normal optimistic (sync=3) bases the synchronization frequency on the number of events that have been calculated, where as real time sync, performs it after some amount of real time has passed.  If you choose sync=5, then you probably want to set --gvt-interval=32, which is 32 ms between GVT computations (if you stick with sync=3, perhaps a gvt-interval of 128 will be good).  Also, regardless of whether you set sync to 3 or 5, set --batch=1.  This will increase how often the network is polled for new events that have come in and can help reduce rollbacks.  

Another setting you can try (in addition to the ones I’ve already listed) is --max-opt-lookahead.  Not sure of the exact value you should try, but I’ve had decent success with setting it somewhere from 100 to 1000 when using the dragonfly model.  What this does is put a window on events that can be executed, keeping PEs from getting too far ahead in virtual time of the other PEs. So it should slow down your PEs that have the lighter load, hopefully keeping them from causing rollbacks on the PE that has a heavy load.

Hopefully this helps!
Caitlin

> On Jun 29, 2018, at 12:10 PM, Taffet, Philip Adam <taffet2 at llnl.gov> wrote:
> 
> Hi,
> I’m trying to run an TraceR OTF simulation with lots of messages and lots of congestion. This is the first time that I’ve had a big enough simulation that I need to run it in parallel, and I’m having a really hard time getting any sort of parallel speedup. I tried running on 4-8 nodes with –sync=2 and –sync=3, as well as various values of --nkp, and the best I’ve gotten is only a few percent faster than serial. I looked into it some, and found that the cause appears to be massive load imbalance. I’m attaching a screenshot from hpctraceviewer that shows that rank 0 does almost all the work while the other ranks spend a large amount of time in MPI_Allreduce, waiting for rank 0 to arrive. I don’t know this part of ROSS/CODES very well, but does this mean the LPs are not being distributed evenly? If so, how can I change the distribution?
> It wouldn’t surprise me if my traffic pattern caused some load imbalance because there are 4 endpoints that receive way more traffic than the others, but I don’t think the imbalance should be this bad.
> Thank you very much,
> Philip Taffet
> <Screen Shot 2018-06-26 at 4.16.09 PM[1].png>_______________________________________________
> codes-ross-users mailing list
> codes-ross-users at lists.mcs.anl.gov <mailto:codes-ross-users at lists.mcs.anl.gov>
> https://lists.mcs.anl.gov/mailman/listinfo/codes-ross-users <https://lists.mcs.anl.gov/mailman/listinfo/codes-ross-users>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/codes-ross-users/attachments/20180702/d2cdb908/attachment.html>