[mpich-discuss] thread MPI calls

Tue Jul 28 15:25:45 CDT 2009

Two things:

1. In your application, only the main thread or the slave thread is 
making MPI calls at any given time; they don't simultaneously make MPI 
calls. So, you can initialize MPI as THREAD_SERIALIZED, instead of 
THREAD_MULTIPLE.

2. In case you still need to use THREAD_MULTIPLE, I have a theory here; 
let's see if it'll work. Are your Irecv() using specific source/tag 
values? If yes, can you change them all to ANY_SOURCE and ANY_TAG? The 
reason is, in the worst case when requests come in exactly the reverse 
order of posted requests, for each incoming message, there is an O(N) 
search -- that's a total of O(N^2) search time, with N being the number 
of workers.

Let me know if either of these helps.

  -- Pavan

On 07/28/2009 02:30 PM, chong tan wrote:
> 
> we have done a lot more experiments on this subjects, and I like to 
> share the
> findings.   (my employer will not be happy, but I think it is OK)
>  
> Our abstracted original implementation, which is not-threaded, as as below :
>  
>  master process :                                              slave :
>  
>   repeat until done :                                          repeat 
> until done
>      do_work()                                                      
> do_work()
>      for n slaves
>           MPI_Recv()                                              
> MPI_Send( master )
>      minor_work()                                                
>      for n slaves
>           MPI_Send()                                              
> MPI_Recv( master )
> end repeat                                                        end repeat
>  
> minor_work is really minor, interms of 3*(#slave) instruction.  the 
> amount of work
> done by each do_wrok() call in each process fluctuates a lot.
> so far, this is the fastest implementation.  We observed that master 
> process causes
> 5-10% performance loss in some tests when the do_work() actuall does 
> more work
> than the slave regularly.
>  
> Our solution is to thread this way :
>  
>   master process :
>  
>   main thread :                  recv thread 
> :                                                                          
>  
>    ...
> repeat until done         
>    enable recv_sema          block on recv_semaphore
>    do_work()
>    block on data_sema       for n slaves
>                                             MPI_Irecv()
>    minor_work()                MPI_Wait_all()
>    for n slaves                    enable data_sema
>       MPI_Send()   
>   block data_sema
> end repeat            
>   
>  
> the slave processes remins un-changed.
>  
> We have observed and determined the following :
>  
> 1-  threaded MPI calls cuases performance degration in heavy MPI_bound 
> applications.
> 2-  MPI_Wait_all() call causes MPI to do inter-process probing of some 
> sorts, it can
>      cause significant application performance loss.
> 3.  MPI_Irecv should be used with a grain of salt.  It will not deliver 
> the expected
>      results.  For any gain you can get, the subsequence MPI_Wait gives 
> them all back.
>  
> I will go a litle further on #2.  In one of our test, which is not the 
> one showing the worst
> surprise, originally, all 4 processes in our application spent between 
> 30-40% of its
> times in MPI, by going multi-threaded using the algorithm above, we saw 
> a performance
> degration of > 25%.  Furthur monitoring and anaylsis show that all slave 
> processes spent
> between 20-45% of their time in sys activity, compared to <1.5% in the 
> original code,  While
> the recv thread spent 80% of its time in sys activity.  This cause all 
> slave to be slowed to
> the point that the master process ended up spending 50% of its time in 
> idle (waiting for the data
> to arrive).
>  
> The lession learned :
> -  MPI_Wait_all() causes accessive inter-processes probing, it is best 
> to avoid this.  If we
>    absolutely must use this (due to Irecv or Isend), we want to call 
> this function as late as
>    possible to reduce the inter-process probing.  With the code above, 
> if we move the MPI_Wait_all()
>    call in the recv thread to main thread, right before minor_work(), 
> our threaded application
>    get significant performance gain, making it quite close to the 
> non-thread implementation.
> - Although we are still exploring all possibilities, threaded MPI is 
> likely not worth the effort.
>  
> One potential performance request for MPICH2 team :
> -  for MPI_Wait* functions, avoid doing inter-process probing if the 
> request is recv.'
>  
>  
> thanks
>  
> tan
> 
>  
> ------------------------------------------------------------------------
> *From:* chong tan <chong_guan_tan at yahoo.com>
> *To:* mpich-discuss at mcs.anl.gov
> *Sent:* Monday, July 27, 2009 11:25:35 AM
> *Subject:* Re: [mpich-discuss] thread MPI calls
> 
> 
> Nicholas and Pavan, and D,
>  
> Some tiny bit of info ffrom my work.  When we run our parallizzed 
> application,
> we like to see time spent in comminucation, MPICH2 that is, being 30% or 
> less.
> That includes setting up the data exahnge, and process having to wait 
> for other
> processes.  Our application will make good number of coummunication, in the
> 100 Billion to few trillion, over a period of hours to months. 
>  
> The data sent between the process is as litle as 8 bytes, to few MByte.  
> Most of
> the time, the data are small, < 128 bytes.
>  
> MPICH2 does a good job of buffering send, but that really did not help us as
> much as there is one recv for each send.
>  
> Our monitoring show most recv takes around 0.6us on most of our boxes.  
> We are
> talking about $12K boxes here, so that is not bad.
>  
> Currently, we observes 5-10% in recv time that we hope to eliminate by 
> parallizing
> the recv operations (when the main thread is working meaningful work).   
> That fits our
> application very well as recv can be isolated easily.
>  
> We have tried these on tests that run for hours :
>  -  non threaded, blocked recv      :  this is still the fastest solution
>  -  non thread, Irecv                      :  bad
>  -  non-thread, pre-launched Irecv: bad, but not too bad
>  -  thread multiple                         : very bad
>  -  thread multiple with irecv        : not as bad as very bad
>  -  thread funnel                             : super bad
>  
> we analyzed a few tests where the 'recv' thread could have run in 
> parallel with the main thread
> nicely, and yet negative gains are observed.  We don't have any theory 
> why that could have p
> happened.
>  
> So, we are particularly curios with what is happening with thread 
> multiple ?   We have 1
> thing common between thread and non-threaded-Irecv test : Wait_all, 
> could this be the cause ?
>  
> thanks
> tan
>  
> ------------------------------------------------------------------------
> *From:* Pavan Balaji <balaji at mcs.anl.gov>
> *To:* mpich-discuss at mcs.anl.gov
> *Sent:* Saturday, July 25, 2009 2:40:45 PM
> *Subject:* Re: [mpich-discuss] thread MPI calls
> 
> Nicholas,
> 
>  From what I understand about your application, there are two approaches 
> you can use:
> 
> 1. Use no threads -- in this case, each worker posts Irecv's from all 
> processes its expecting messages from (using MPI_ANY_SOURCE if needed) 
> and do some loop similar to:
> 
> till_work_is_done {
>     compute();
>     Waitany() or Testany()
> }
> 
> This is the approach most master-worker applications, such as mpiBLAST, 
> tend to use.
> 
> 2. Otherwise, you can use threads as you suggested below. It is true 
> that there is some overhead, but unless you are using very small 
> messages (e.g., < 64 bytes), and performing a lot of communication 
> (e.g., 90% of your application is communication time), you'll not notice 
> any overhead.
> 
> In any case, we are working on some enhanced designs that will minimize 
> threading overheads even in such rare cases (some of them are 
> experimentally included in the mpich2-1.1.x series).
> 
> -- Pavan
> 
> On 07/25/2009 01:38 PM, Nicolas Rosner wrote:
>  > Sorry to shamelessly invade Tan's, but since we're in the middle of a
>  > thread about threads, I thought I'd rephrase an old question I once
>  > tried to state here -- with a lot less understanding of the problem
>  > back then. Context:  My app is a rather typical
>  > single-centralized-master, N-1-worker, pool-of-tasks setup, except
>  > that workers don't just consume, but may also split tasks that seem
>  > too hard, which ends up pushing large series of new child tasks back
>  > to where the parent had been obtained, and so on, recursively.
>  >
>  > Now, most task-related messages and structures don't handle the
>  > real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy
>  > objects (~1 KB) that represent them -- which works fine in many
>  > situations where some subset of metadata suffices. However, after a
>  > worker gets a task assignment, at some point it does need the complete
>  > file. Conversely, when one is divided, the splitting worker needs to
>  > store the new files somewhere. The current approach is as follows:
>  >
>  >    - all new task files produced by a worker W are stored to a local
>  > HD on its host
>  >
>  >    - master is immediately notified about each creation (children are
>  > pushed back into queue,
>  >      but by reference only, including a "currently located at ..." 
> field),
>  >
>  >    - when master assigns a task to some other worker Z, the message
>  > includes said location, and
>  >
>  >    - worker Z then sends a msg to worker W requesting the task, which
>  > W sends as one fat MPI message.
>  >
>  > This achieves decent load balancing, but although a worker can't
>  > really do anything productive while waiting for a totally new
>  > datafile, it may certainly not block if it is to stay deadlock-free --
>  > it [or *some* component, anyway] needs to be ready to serve any
>  > potential incoming "full datafile requests" from other workers within
>  > some constant amount of delay that may not depend on its own pending
>  > request.
>  >
>  > So, I first tried the nonblocking approach; no good, worker + file
>  > server chores combined yield a giant statechart, way too complex to
>  > debug and maintain. Then, trying to stay away from hybrid thread+MPI
>  > code, I tried separating the worker and server as two different
>  > processes, and ran twice as many processes as available processors.
>  > Although Linux did a pretty good job scheduling them (not surprising
>  > since they're almost purely cpu and i/o bound, respectively), there is
>  > some performance penalty, plus it's messy to be keeping track of how
>  > mpixec deals roles to hosts, e.g. lots of rank -> host mappings that
>  > were fine suddenly become unusable and must be avoided, etc.
>  >
>  > Eventually I gave up on my "no-hybrid" and "don't want to depend on
>  > thread_multiple support" wishes, got myself a pthreads tutorial and
>  > ended up with a worker version that uses a 2nd thread (+ possibly
>  > several subthreads thereof) to keep serving files regardless of main
>  > thread (actual worker code) status -- and with cleaner, better
>  > separated code. (Note that there is little or no interaction between
>  > Wi and Si once launched -- Si just needs to keep serving files that Wi
>  > produced earlier on but doesn't even remember, let alone care about,
>  > anymore. Both need MPI messaging all the time, though.)
>  >
>  > Bottom line: as scary as it looked to me to go hybrid, and despite
>  > several warnings from experts against it, in this particular case it
>  > turned out to be simpler than many of the clumsy attempts at avoiding
>  > it, and probably the "least ugly" approach I've found so far.
>  >
>  > Questions: Do you think this solution is OK? Any suggestions for
>  > alternatives or improvements?  Am I wrong in thinking that this kind
>  > of need must be a common one in this field?  How do people normally
>  > address it?  Distributed filesystems, perhaps?  Or something like a
>  > lightweight http server on each host? Shared central storage is not an
>  > option -- its scalability hits the ceiling way too quickly. Is the
>  > performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of
>  > considerable magnitude?  Do you think it could be avoided, while still
>  > keeping the worker code reasonably isolated and unaware of the serving
>  > part?
>  >
>  > Thanks in advance for any comments, and in retrospective for all the
>  > useful info here.
>  >
>  > Sincerely
>  > N.
> 
> -- Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji