<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman, new york, times, serif;font-size:12pt"><DIV><BR></DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Nicholas and Pavan, and D,</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Some tiny bit of info ffrom my work. When we run our parallizzed application,</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">we like to see time spent in comminucation, MPICH2 that is, being 30% or less.</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">That includes setting up the data exahnge, and process having to wait for other</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">processes. Our application will make good number of coummunication, in the</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">100 Billion to few trillion, over a period of hours to months. </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">The data sent between the process is as litle as 8 bytes, to few MByte. Most of</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">the time, the data are small, < 128 bytes.</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">MPICH2 does a good job of buffering send, but that really did not help us as</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">much as there is one recv for each send.</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Our monitoring show most recv takes around 0.6us on most of our boxes. We are</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">talking about $12K boxes here, so that is not bad.</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Currently, we observes 5-10% in recv time that we hope to eliminate by parallizing</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">the recv operations (when the main thread is working meaningful work). That fits our </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">application very well as recv can be isolated easily.</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">We have tried these on tests that run for hours :</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> - non threaded, blocked recv : this is still the fastest solution</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> - non thread, Irecv : bad</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> - non-thread, pre-launched Irecv: bad, but not too bad</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> - thread multiple : very bad</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> - thread multiple with irecv : not as bad as very bad</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> - thread funnel : super bad<BR></DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">we analyzed a few tests where the 'recv' thread could have run in parallel with the main thread</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">nicely, and yet negative gains are observed. We don't have any theory why that could have p</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">happened.</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">So, we are particularly curios with what is happening with thread multiple ? We have 1</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">thing common between thread and non-threaded-Irecv test : Wait_all, could this be the cause ?</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">thanks</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">tan</DIV>
<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"> </DIV>
<DIV style="FONT-FAMILY: arial, helvetica, sans-serif; FONT-SIZE: 13px"><FONT size=2 face=Tahoma>
<HR SIZE=1>
<B><SPAN style="FONT-WEIGHT: bold">From:</SPAN></B> Pavan Balaji <balaji@mcs.anl.gov><BR><B><SPAN style="FONT-WEIGHT: bold">To:</SPAN></B> mpich-discuss@mcs.anl.gov<BR><B><SPAN style="FONT-WEIGHT: bold">Sent:</SPAN></B> Saturday, July 25, 2009 2:40:45 PM<BR><B><SPAN style="FONT-WEIGHT: bold">Subject:</SPAN></B> Re: [mpich-discuss] thread MPI calls<BR></FONT><BR>Nicholas,<BR><BR>From what I understand about your application, there are two approaches you can use:<BR><BR>1. Use no threads -- in this case, each worker posts Irecv's from all processes its expecting messages from (using MPI_ANY_SOURCE if needed) and do some loop similar to:<BR><BR>till_work_is_done {<BR> compute();<BR> Waitany() or Testany()<BR>}<BR><BR>This is the approach most master-worker applications, such as mpiBLAST, tend to use.<BR><BR>2. Otherwise, you can use threads as you suggested below. It is true that there is some overhead, but unless
you are using very small messages (e.g., < 64 bytes), and performing a lot of communication (e.g., 90% of your application is communication time), you'll not notice any overhead.<BR><BR>In any case, we are working on some enhanced designs that will minimize threading overheads even in such rare cases (some of them are experimentally included in the mpich2-1.1.x series).<BR><BR>-- Pavan<BR><BR>On 07/25/2009 01:38 PM, Nicolas Rosner wrote:<BR>> Sorry to shamelessly invade Tan's, but since we're in the middle of a<BR>> thread about threads, I thought I'd rephrase an old question I once<BR>> tried to state here -- with a lot less understanding of the problem<BR>> back then. Context: My app is a rather typical<BR>> single-centralized-master, N-1-worker, pool-of-tasks setup, except<BR>> that workers don't just consume, but may also split tasks that seem<BR>> too hard, which ends up pushing large series of new child tasks
back<BR>> to where the parent had been obtained, and so on, recursively.<BR>> <BR>> Now, most task-related messages and structures don't handle the<BR>> real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy<BR>> objects (~1 KB) that represent them -- which works fine in many<BR>> situations where some subset of metadata suffices. However, after a<BR>> worker gets a task assignment, at some point it does need the complete<BR>> file. Conversely, when one is divided, the splitting worker needs to<BR>> store the new files somewhere. The current approach is as follows:<BR>> <BR>> - all new task files produced by a worker W are stored to a local<BR>> HD on its host<BR>> <BR>> - master is immediately notified about each creation (children are<BR>> pushed back into queue,<BR>> but by reference only, including a "currently located at ..." field),<BR>>
<BR>> - when master assigns a task to some other worker Z, the message<BR>> includes said location, and<BR>> <BR>> - worker Z then sends a msg to worker W requesting the task, which<BR>> W sends as one fat MPI message.<BR>> <BR>> This achieves decent load balancing, but although a worker can't<BR>> really do anything productive while waiting for a totally new<BR>> datafile, it may certainly not block if it is to stay deadlock-free --<BR>> it [or *some* component, anyway] needs to be ready to serve any<BR>> potential incoming "full datafile requests" from other workers within<BR>> some constant amount of delay that may not depend on its own pending<BR>> request.<BR>> <BR>> So, I first tried the nonblocking approach; no good, worker + file<BR>> server chores combined yield a giant statechart, way too complex to<BR>> debug and maintain. Then, trying to stay away from hybrid
thread+MPI<BR>> code, I tried separating the worker and server as two different<BR>> processes, and ran twice as many processes as available processors.<BR>> Although Linux did a pretty good job scheduling them (not surprising<BR>> since they're almost purely cpu and i/o bound, respectively), there is<BR>> some performance penalty, plus it's messy to be keeping track of how<BR>> mpixec deals roles to hosts, e.g. lots of rank -> host mappings that<BR>> were fine suddenly become unusable and must be avoided, etc.<BR>> <BR>> Eventually I gave up on my "no-hybrid" and "don't want to depend on<BR>> thread_multiple support" wishes, got myself a pthreads tutorial and<BR>> ended up with a worker version that uses a 2nd thread (+ possibly<BR>> several subthreads thereof) to keep serving files regardless of main<BR>> thread (actual worker code) status -- and with cleaner, better<BR>> separated code. (Note that there is
little or no interaction between<BR>> Wi and Si once launched -- Si just needs to keep serving files that Wi<BR>> produced earlier on but doesn't even remember, let alone care about,<BR>> anymore. Both need MPI messaging all the time, though.)<BR>> <BR>> Bottom line: as scary as it looked to me to go hybrid, and despite<BR>> several warnings from experts against it, in this particular case it<BR>> turned out to be simpler than many of the clumsy attempts at avoiding<BR>> it, and probably the "least ugly" approach I've found so far.<BR>> <BR>> Questions: Do you think this solution is OK? Any suggestions for<BR>> alternatives or improvements? Am I wrong in thinking that this kind<BR>> of need must be a common one in this field? How do people normally<BR>> address it? Distributed filesystems, perhaps? Or something like a<BR>> lightweight http server on each host? Shared central storage is
not an<BR>> option -- its scalability hits the ceiling way too quickly. Is the<BR>> performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of<BR>> considerable magnitude? Do you think it could be avoided, while still<BR>> keeping the worker code reasonably isolated and unaware of the serving<BR>> part?<BR>> <BR>> Thanks in advance for any comments, and in retrospective for all the<BR>> useful info here.<BR>> <BR>> Sincerely<BR>> N.<BR><BR>-- Pavan Balaji<BR><A href="http://www.mcs.anl.gov/~balaji" target=_blank>http://www.mcs.anl.gov/~balaji</A><BR></DIV></div><br>
</body></html>