<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman, new york, times, serif;font-size:12pt"><DIV><BR></DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">we have done a lot more experiments on this subjects, and I like to share the</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">findings.&nbsp;&nbsp; (my employer will not be happy, but I think it is OK)</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Our abstracted original implementation, which is not-threaded, as as below :</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;master process :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; slave :</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp; repeat until done :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; repeat until done</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp; do_work()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do_work()</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp; for n slaves</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MPI_Recv()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MPI_Send( master )</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp; minor_work()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp; for n slaves</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MPI_Send()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MPI_Recv( master )</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">end repeat&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end repeat</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">minor_work is really minor, interms of 3*(#slave) instruction.&nbsp; the amount of work</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">done by each do_wrok() call in each process fluctuates a lot.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">so far, this is the fastest implementation.&nbsp; We observed that master process causes</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">5-10% performance loss in some tests when the do_work() actuall does more work </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">than the slave regularly.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Our solution is to thread this way :</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp; master process :</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp; main thread&nbsp;:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; recv thread :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp; </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; ...</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">repeat until done&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; enable recv_sema&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; block on recv_semaphore</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; do_work()</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; block on data_sema&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for n slaves</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MPI_Irecv()</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; minor_work()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MPI_Wait_all()</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; for n slaves&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; enable data_sema</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MPI_Send()&nbsp;&nbsp;&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp; block data_sema</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">end repeat&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">the slave processes remins un-changed.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">We have observed and determined the following :</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">1-&nbsp; threaded MPI calls cuases performance degration in heavy MPI_bound applications.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">2-&nbsp; MPI_Wait_all() call causes MPI to do inter-process probing of some sorts, it can</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp; cause significant application performance loss.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">3.&nbsp; MPI_Irecv should be used with a grain of salt.&nbsp; It will not deliver the expected </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp;&nbsp;&nbsp; results.&nbsp; For any gain you can get, the subsequence MPI_Wait gives them all back.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">I will go a litle further on #2.&nbsp; In one of our test, which is not the one showing the worst </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">surprise, originally, all 4 processes in our application spent between 30-40% of its</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">times in MPI, by going multi-threaded using the algorithm above, we saw a performance</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">degration of &gt; 25%.&nbsp; Furthur monitoring and anaylsis show that all slave processes spent</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">between 20-45% of their time in sys activity, compared to &lt;1.5% in the original code,&nbsp; While</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">the recv thread spent 80% of its time in sys activity.&nbsp; This cause all slave to be slowed to</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">the point that the master process ended up spending 50% of its time in idle (waiting for the data </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">to arrive).</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">The lession learned :</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">-&nbsp; MPI_Wait_all() causes accessive inter-processes probing, it is best to avoid this.&nbsp; If we </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; absolutely must use this (due to Irecv or Isend), we want to call this function as late as</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; possible to reduce the inter-process probing.&nbsp; With the code above, if we move the MPI_Wait_all()</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; call in the recv thread to main thread, right before minor_work(), our threaded application</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;&nbsp; get significant performance gain, making it quite close to the non-thread implementation.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">- Although we are still exploring all possibilities, threaded MPI is likely not worth the effort.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">One potential performance request for MPICH2 team :</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">-&nbsp; for MPI_Wait* functions, avoid doing inter-process probing if the request is recv.'</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">thanks</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">tan</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"><BR>&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt"><FONT size=2 face=Tahoma>

<HR SIZE=1>

<B><SPAN style="FONT-WEIGHT: bold">From:</SPAN></B> chong tan &lt;chong_guan_tan@yahoo.com&gt;<BR><B><SPAN style="FONT-WEIGHT: bold">To:</SPAN></B> mpich-discuss@mcs.anl.gov<BR><B><SPAN style="FONT-WEIGHT: bold">Sent:</SPAN></B> Monday, July 27, 2009 11:25:35 AM<BR><B><SPAN style="FONT-WEIGHT: bold">Subject:</SPAN></B> Re: [mpich-discuss] thread MPI calls<BR></FONT><BR>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">

<DIV><BR></DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Nicholas and Pavan, and D,</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Some tiny bit of info ffrom my work.&nbsp; When we run our parallizzed application,</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">we like to see time spent in comminucation, MPICH2 that is, being 30% or less.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">That includes setting up the data exahnge, and process having to wait for other</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">processes.&nbsp; Our application will make good number of coummunication, in the</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">100 Billion to few trillion, over a period of hours to months.&nbsp; </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">The data sent between the process is as litle as 8 bytes, to few MByte.&nbsp; Most of</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">the time, the data are small, &lt; 128 bytes.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">MPICH2 does a good job of buffering send, but that really did not help us as</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">much as there is one recv for each send.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Our monitoring show most recv takes around 0.6us on most of our boxes.&nbsp; We are</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">talking about $12K boxes here, so that is not bad.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">Currently, we observes 5-10% in recv time that we hope to eliminate by parallizing</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">the recv operations (when the main thread is working meaningful work).&nbsp;&nbsp; That fits our </DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">application very well as recv can be isolated easily.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">We have tried these on tests that run for hours&nbsp;:</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;-&nbsp; non threaded, blocked recv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp; this is still the fastest solution</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;-&nbsp; non thread, Irecv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp; bad</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;-&nbsp; non-thread, pre-launched Irecv: bad, but not too bad</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;-&nbsp; thread multiple&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : very bad</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;-&nbsp; thread multiple with irecv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : not as bad as very bad</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;-&nbsp; thread funnel&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : super bad<BR></DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">we analyzed a few tests where the 'recv' thread could have run in parallel with the main thread</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">nicely, and yet negative gains are observed.&nbsp; We don't have any theory why that could have p</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">happened.</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">So, we are particularly curios with what is happening with thread multiple ?&nbsp;&nbsp; We have 1</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">thing common between thread and non-threaded-Irecv test : Wait_all, could this be the cause ?</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">thanks</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">tan</DIV>

<DIV style="FONT-FAMILY: times new roman, new york, times, serif; FONT-SIZE: 12pt">&nbsp;</DIV>

<DIV style="FONT-FAMILY: arial, helvetica, sans-serif; FONT-SIZE: 13px"><FONT size=2 face=Tahoma>

<HR SIZE=1>

<B><SPAN style="FONT-WEIGHT: bold">From:</SPAN></B> Pavan Balaji &lt;balaji@mcs.anl.gov&gt;<BR><B><SPAN style="FONT-WEIGHT: bold">To:</SPAN></B> mpich-discuss@mcs.anl.gov<BR><B><SPAN style="FONT-WEIGHT: bold">Sent:</SPAN></B> Saturday, July 25, 2009 2:40:45 PM<BR><B><SPAN style="FONT-WEIGHT: bold">Subject:</SPAN></B> Re: [mpich-discuss] thread MPI calls<BR></FONT><BR>Nicholas,<BR><BR>From what I understand about your application, there are two approaches you can use:<BR><BR>1. Use no threads -- in this case, each worker posts Irecv's from all processes its expecting messages from (using MPI_ANY_SOURCE if needed) and do some loop similar to:<BR><BR>till_work_is_done {<BR>&nbsp;&nbsp;&nbsp; compute();<BR>&nbsp;&nbsp;&nbsp; Waitany() or Testany()<BR>}<BR><BR>This is the approach most master-worker applications, such as mpiBLAST, tend to use.<BR><BR>2. Otherwise, you can use threads as you suggested below. It is true that there is some overhead, but unless

 you are using very small messages (e.g., &lt; 64 bytes), and performing a lot of communication (e.g., 90% of your application is communication time), you'll not notice any overhead.<BR><BR>In any case, we are working on some enhanced designs that will minimize threading overheads even in such rare cases (some of them are experimentally included in the mpich2-1.1.x series).<BR><BR>-- Pavan<BR><BR>On 07/25/2009 01:38 PM, Nicolas Rosner wrote:<BR>&gt; Sorry to shamelessly invade Tan's, but since we're in the middle of a<BR>&gt; thread about threads, I thought I'd rephrase an old question I once<BR>&gt; tried to state here -- with a lot less understanding of the problem<BR>&gt; back then. Context:&nbsp; My app is a rather typical<BR>&gt; single-centralized-master, N-1-worker, pool-of-tasks setup, except<BR>&gt; that workers don't just consume, but may also split tasks that seem<BR>&gt; too hard, which ends up pushing large series of new child tasks

 back<BR>&gt; to where the parent had been obtained, and so on, recursively.<BR>&gt; <BR>&gt; Now, most task-related messages and structures don't handle the<BR>&gt; real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy<BR>&gt; objects (~1 KB) that represent them -- which works fine in many<BR>&gt; situations where some subset of metadata suffices. However, after a<BR>&gt; worker gets a task assignment, at some point it does need the complete<BR>&gt; file. Conversely, when one is divided, the splitting worker needs to<BR>&gt; store the new files somewhere. The current approach is as follows:<BR>&gt; <BR>&gt;&nbsp; &nbsp; - all new task files produced by a worker W are stored to a local<BR>&gt; HD on its host<BR>&gt; <BR>&gt;&nbsp; &nbsp; - master is immediately notified about each creation (children are<BR>&gt; pushed back into queue,<BR>&gt;&nbsp; &nbsp; &nbsp; but by reference only, including a "currently located at ..." field),<BR>&gt;

 <BR>&gt;&nbsp; &nbsp; - when master assigns a task to some other worker Z, the message<BR>&gt; includes said location, and<BR>&gt; <BR>&gt;&nbsp; &nbsp; - worker Z then sends a msg to worker W requesting the task, which<BR>&gt; W sends as one fat MPI message.<BR>&gt; <BR>&gt; This achieves decent load balancing, but although a worker can't<BR>&gt; really do anything productive while waiting for a totally new<BR>&gt; datafile, it may certainly not block if it is to stay deadlock-free --<BR>&gt; it [or *some* component, anyway] needs to be ready to serve any<BR>&gt; potential incoming "full datafile requests" from other workers within<BR>&gt; some constant amount of delay that may not depend on its own pending<BR>&gt; request.<BR>&gt; <BR>&gt; So, I first tried the nonblocking approach; no good, worker + file<BR>&gt; server chores combined yield a giant statechart, way too complex to<BR>&gt; debug and maintain. Then, trying to stay away from hybrid

 thread+MPI<BR>&gt; code, I tried separating the worker and server as two different<BR>&gt; processes, and ran twice as many processes as available processors.<BR>&gt; Although Linux did a pretty good job scheduling them (not surprising<BR>&gt; since they're almost purely cpu and i/o bound, respectively), there is<BR>&gt; some performance penalty, plus it's messy to be keeping track of how<BR>&gt; mpixec deals roles to hosts, e.g. lots of rank -&gt; host mappings that<BR>&gt; were fine suddenly become unusable and must be avoided, etc.<BR>&gt; <BR>&gt; Eventually I gave up on my "no-hybrid" and "don't want to depend on<BR>&gt; thread_multiple support" wishes, got myself a pthreads tutorial and<BR>&gt; ended up with a worker version that uses a 2nd thread (+ possibly<BR>&gt; several subthreads thereof) to keep serving files regardless of main<BR>&gt; thread (actual worker code) status -- and with cleaner, better<BR>&gt; separated code. (Note that there is

 little or no interaction between<BR>&gt; Wi and Si once launched -- Si just needs to keep serving files that Wi<BR>&gt; produced earlier on but doesn't even remember, let alone care about,<BR>&gt; anymore. Both need MPI messaging all the time, though.)<BR>&gt; <BR>&gt; Bottom line: as scary as it looked to me to go hybrid, and despite<BR>&gt; several warnings from experts against it, in this particular case it<BR>&gt; turned out to be simpler than many of the clumsy attempts at avoiding<BR>&gt; it, and probably the "least ugly" approach I've found so far.<BR>&gt; <BR>&gt; Questions: Do you think this solution is OK? Any suggestions for<BR>&gt; alternatives or improvements?&nbsp; Am I wrong in thinking that this kind<BR>&gt; of need must be a common one in this field?&nbsp; How do people normally<BR>&gt; address it?&nbsp; Distributed filesystems, perhaps?&nbsp; Or something like a<BR>&gt; lightweight http server on each host? Shared central storage is

 not an<BR>&gt; option -- its scalability hits the ceiling way too quickly. Is the<BR>&gt; performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of<BR>&gt; considerable magnitude?&nbsp; Do you think it could be avoided, while still<BR>&gt; keeping the worker code reasonably isolated and unaware of the serving<BR>&gt; part?<BR>&gt; <BR>&gt; Thanks in advance for any comments, and in retrospective for all the<BR>&gt; useful info here.<BR>&gt; <BR>&gt; Sincerely<BR>&gt; N.<BR><BR>-- Pavan Balaji<BR>http://www.mcs.anl.gov/~balaji<BR></DIV></DIV><BR></DIV></div><br>


      </body></html>