<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<meta name="Generator" content="Microsoft Word 15 (filtered medium)">


<style><!--


/* Font Definitions */


@font-face


        {font-family:"Cambria Math";


        panose-1:2 4 5 3 5 4 6 3 2 4;}


@font-face


        {font-family:Calibri;


        panose-1:2 15 5 2 2 2 4 3 2 4;}


/* Style Definitions */


p.MsoNormal, li.MsoNormal, div.MsoNormal


        {margin:0in;


        margin-bottom:.0001pt;


        font-size:12.0pt;


        font-family:"Calibri",sans-serif;}


a:link, span.MsoHyperlink


        {mso-style-priority:99;


        color:#0563C1;


        text-decoration:underline;}


a:visited, span.MsoHyperlinkFollowed


        {mso-style-priority:99;


        color:#954F72;


        text-decoration:underline;}


span.EmailStyle17


        {mso-style-type:personal-compose;


        font-family:"Calibri",sans-serif;


        color:windowtext;}


.MsoChpDefault


        {mso-style-type:export-only;


        font-family:"Calibri",sans-serif;}


@page WordSection1


        {size:8.5in 11.0in;


        margin:1.0in 1.0in 1.0in 1.0in;}


div.WordSection1


        {page:WordSection1;}


--></style>


</head>


<body lang="EN-US" link="#0563C1" vlink="#954F72">


<div class="WordSection1">


<p class="MsoNormal"><span style="font-size:11.0pt">Hi,<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt">I’m trying to run an TraceR OTF simulation with lots of messages and lots of congestion. This is the first time that I’ve had a big enough simulation that I need to run it in parallel, and I’m having a really


 hard time getting any sort of parallel speedup. I tried running on 4-8 nodes with –sync=2 and –sync=3, as well as various values of --nkp, and the best I’ve gotten is only a few percent faster than serial. I looked into it some, and found that the cause appears


 to be massive load imbalance. I’m attaching a screenshot from hpctraceviewer that shows that rank 0 does almost all the work while the other ranks spend a large amount of time in MPI_Allreduce, waiting for rank 0 to arrive. I don’t know this part of ROSS/CODES


 very well, but does this mean the LPs are not being distributed evenly? If so, how can I change the distribution?<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt">It wouldn’t surprise me if my traffic pattern caused some load imbalance because there are 4 endpoints that receive way more traffic than the others, but I don’t think the imbalance should be this bad.<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt">Thank you very much,<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt">Philip Taffet<o:p></o:p></span></p>


</div>


</body>


</html>