<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Even without the specifics, it would be useful to have a test program that could run these tests and present the differences in performance. And a version that, like "stress", could run for a specified length of time, would be helpful in testing for races in the thread and smp code. Could one of the summer students write such an example?<div><br></div><div>Bill</div><div><br><div><div>On Jul 27, 2009, at 1:25 PM, chong tan wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "><br class="Apple-interchange-newline">Nicholas and Pavan, and D,</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">Some tiny bit of info ffrom my work. When we run our parallizzed application,</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">we like to see time spent in comminucation, MPICH2 that is, being 30% or less.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">That includes setting up the data exahnge, and process having to wait for other</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">processes. Our application will make good number of coummunication, in the</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">100 Billion to few trillion, over a period of hours to months. </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">The data sent between the process is as litle as 8 bytes, to few MByte. Most of</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">the time, the data are small, < 128 bytes.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">MPICH2 does a good job of buffering send, but that really did not help us as</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">much as there is one recv for each send.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">Our monitoring show most recv takes around 0.6us on most of our boxes. We are</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">talking about $12K boxes here, so that is not bad.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">Currently, we observes 5-10% in recv time that we hope to eliminate by parallizing</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">the recv operations (when the main thread is working meaningful work). That fits our</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">application very well as recv can be isolated easily.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">We have tried these on tests that run for hours :</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> - non threaded, blocked recv : this is still the fastest solution</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> - non thread, Irecv : bad</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> - non-thread, pre-launched Irecv: bad, but not too bad</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> - thread multiple : very bad</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> - thread multiple with irecv : not as bad as very bad</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> - thread funnel : super bad<br></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">we analyzed a few tests where the 'recv' thread could have run in parallel with the main thread</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">nicely, and yet negative gains are observed. We don't have any theory why that could have p</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">happened.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">So, we are particularly curios with what is happening with thread multiple ? We have 1</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">thing common between thread and non-threaded-Irecv test : Wait_all, could this be the cause ?</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">thanks</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; ">tan</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: 'times new roman', 'new york', times, serif; font-size: 12pt; "> </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-family: arial, helvetica, sans-serif; font-size: 13px; "><font size="2" face="Tahoma"><hr size="1"><b><span style="font-weight: bold; ">From:</span></b><span class="Apple-converted-space"> </span>Pavan Balaji <<a href="mailto:balaji@mcs.anl.gov">balaji@mcs.anl.gov</a>><br><b><span style="font-weight: bold; ">To:</span></b><span class="Apple-converted-space"> </span><a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br><b><span style="font-weight: bold; ">Sent:</span></b><span class="Apple-converted-space"> </span>Saturday, July 25, 2009 2:40:45 PM<br><b><span style="font-weight: bold; ">Subject:</span></b><span class="Apple-converted-space"> </span>Re: [mpich-discuss] thread MPI calls<br></font><br>Nicholas,<br><br>From what I understand about your application, there are two approaches you can use:<br><br>1. Use no threads -- in this case, each worker posts Irecv's from all processes its expecting messages from (using MPI_ANY_SOURCE if needed) and do some loop similar to:<br><br>till_work_is_done {<br> compute();<br> Waitany() or Testany()<br>}<br><br>This is the approach most master-worker applications, such as mpiBLAST, tend to use.<br><br>2. Otherwise, you can use threads as you suggested below. It is true that there is some overhead, but unless you are using very small messages (e.g., < 64 bytes), and performing a lot of communication (e.g., 90% of your application is communication time), you'll not notice any overhead.<br><br>In any case, we are working on some enhanced designs that will minimize threading overheads even in such rare cases (some of them are experimentally included in the mpich2-1.1.x series).<br><br>-- Pavan<br><br>On 07/25/2009 01:38 PM, Nicolas Rosner wrote:<br>> Sorry to shamelessly invade Tan's, but since we're in the middle of a<br>> thread about threads, I thought I'd rephrase an old question I once<br>> tried to state here -- with a lot less understanding of the problem<br>> back then. Context: My app is a rather typical<br>> single-centralized-master, N-1-worker, pool-of-tasks setup, except<br>> that workers don't just consume, but may also split tasks that seem<br>> too hard, which ends up pushing large series of new child tasks back<br>> to where the parent had been obtained, and so on, recursively.<br>><span class="Apple-converted-space"> </span><br>> Now, most task-related messages and structures don't handle the<br>> real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy<br>> objects (~1 KB) that represent them -- which works fine in many<br>> situations where some subset of metadata suffices. However, after a<br>> worker gets a task assignment, at some point it does need the complete<br>> file. Conversely, when one is divided, the splitting worker needs to<br>> store the new files somewhere. The current approach is as follows:<br>><span class="Apple-converted-space"> </span><br>> - all new task files produced by a worker W are stored to a local<br>> HD on its host<br>><span class="Apple-converted-space"> </span><br>> - master is immediately notified about each creation (children are<br>> pushed back into queue,<br>> but by reference only, including a "currently located at ..." field),<br>><span class="Apple-converted-space"> </span><br>> - when master assigns a task to some other worker Z, the message<br>> includes said location, and<br>><span class="Apple-converted-space"> </span><br>> - worker Z then sends a msg to worker W requesting the task, which<br>> W sends as one fat MPI message.<br>><span class="Apple-converted-space"> </span><br>> This achieves decent load balancing, but although a worker can't<br>> really do anything productive while waiting for a totally new<br>> datafile, it may certainly not block if it is to stay deadlock-free --<br>> it [or *some* component, anyway] needs to be ready to serve any<br>> potential incoming "full datafile requests" from other workers within<br>> some constant amount of delay that may not depend on its own pending<br>> request.<br>><span class="Apple-converted-space"> </span><br>> So, I first tried the nonblocking approach; no good, worker + file<br>> server chores combined yield a giant statechart, way too complex to<br>> debug and maintain. Then, trying to stay away from hybrid thread+MPI<br>> code, I tried separating the worker and server as two different<br>> processes, and ran twice as many processes as available processors.<br>> Although Linux did a pretty good job scheduling them (not surprising<br>> since they're almost purely cpu and i/o bound, respectively), there is<br>> some performance penalty, plus it's messy to be keeping track of how<br>> mpixec deals roles to hosts, e.g. lots of rank -> host mappings that<br>> were fine suddenly become unusable and must be avoided, etc.<br>><span class="Apple-converted-space"> </span><br>> Eventually I gave up on my "no-hybrid" and "don't want to depend on<br>> thread_multiple support" wishes, got myself a pthreads tutorial and<br>> ended up with a worker version that uses a 2nd thread (+ possibly<br>> several subthreads thereof) to keep serving files regardless of main<br>> thread (actual worker code) status -- and with cleaner, better<br>> separated code. (Note that there is little or no interaction between<br>> Wi and Si once launched -- Si just needs to keep serving files that Wi<br>> produced earlier on but doesn't even remember, let alone care about,<br>> anymore. Both need MPI messaging all the time, though.)<br>><span class="Apple-converted-space"> </span><br>> Bottom line: as scary as it looked to me to go hybrid, and despite<br>> several warnings from experts against it, in this particular case it<br>> turned out to be simpler than many of the clumsy attempts at avoiding<br>> it, and probably the "least ugly" approach I've found so far.<br>><span class="Apple-converted-space"> </span><br>> Questions: Do you think this solution is OK? Any suggestions for<br>> alternatives or improvements? Am I wrong in thinking that this kind<br>> of need must be a common one in this field? How do people normally<br>> address it? Distributed filesystems, perhaps? Or something like a<br>> lightweight http server on each host? Shared central storage is not an<br>> option -- its scalability hits the ceiling way too quickly. Is the<br>> performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of<br>> considerable magnitude? Do you think it could be avoided, while still<br>> keeping the worker code reasonably isolated and unaware of the serving<br>> part?<br>><span class="Apple-converted-space"> </span><br>> Thanks in advance for any comments, and in retrospective for all the<br>> useful info here.<br>><span class="Apple-converted-space"> </span><br>> Sincerely<br>> N.<br><br>-- Pavan Balaji<br><a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br></div></div><br></span><br class="Apple-interchange-newline"></blockquote></div><br><div> <span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="border-collapse: separate; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; -webkit-text-decorations-in-effect: none; text-indent: 0px; -webkit-text-size-adjust: auto; text-transform: none; orphans: 2; white-space: normal; widows: 2; word-spacing: 0px; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>William Gropp</div><div>Deputy Director for Research</div><div>Institute for Advanced Computing Applications and Technologies</div><div>Paul and Cynthia Saylor Professor of Computer Science</div><div>University of Illinois Urbana-Champaign</div><div><br class="khtml-block-placeholder"></div></div><br class="Apple-interchange-newline"></span></div></span><br class="Apple-interchange-newline"> </div><br></div></body></html>