[mpich-discuss] thread MPI calls

Tue Jul 28 14:30:35 CDT 2009

we have done a lot more experiments on this subjects, and I like to share the
findings.   (my employer will not be happy, but I think it is OK)

Our abstracted original implementation, which is not-threaded, as as below :

 master process :                                              slave :

  repeat until done :                                          repeat until done
     do_work()                                                      do_work()
     for n slaves
          MPI_Recv()                                              MPI_Send( master )
     minor_work()                                                 
     for n slaves
          MPI_Send()                                              MPI_Recv( master )
end repeat                                                        end repeat

minor_work is really minor, interms of 3*(#slave) instruction.  the amount of work
done by each do_wrok() call in each process fluctuates a lot.
so far, this is the fastest implementation.  We observed that master process causes
5-10% performance loss in some tests when the do_work() actuall does more work 
than the slave regularly.

Our solution is to thread this way :

  master process :

  main thread :                  recv thread :                                                                          

   ...
repeat until done          
   enable recv_sema          block on recv_semaphore
   do_work()
   block on data_sema       for n slaves
                                            MPI_Irecv()
   minor_work()                MPI_Wait_all()
   for n slaves                    enable data_sema
      MPI_Send()   
  block data_sema
end repeat             

the slave processes remins un-changed.

We have observed and determined the following :

1-  threaded MPI calls cuases performance degration in heavy MPI_bound applications.
2-  MPI_Wait_all() call causes MPI to do inter-process probing of some sorts, it can
     cause significant application performance loss.
3.  MPI_Irecv should be used with a grain of salt.  It will not deliver the expected 
     results.  For any gain you can get, the subsequence MPI_Wait gives them all back.

I will go a litle further on #2.  In one of our test, which is not the one showing the worst 
surprise, originally, all 4 processes in our application spent between 30-40% of its
times in MPI, by going multi-threaded using the algorithm above, we saw a performance
degration of > 25%.  Furthur monitoring and anaylsis show that all slave processes spent
between 20-45% of their time in sys activity, compared to <1.5% in the original code,  While
the recv thread spent 80% of its time in sys activity.  This cause all slave to be slowed to
the point that the master process ended up spending 50% of its time in idle (waiting for the data 
to arrive).

The lession learned :
-  MPI_Wait_all() causes accessive inter-processes probing, it is best to avoid this.  If we 
   absolutely must use this (due to Irecv or Isend), we want to call this function as late as
   possible to reduce the inter-process probing.  With the code above, if we move the MPI_Wait_all()
   call in the recv thread to main thread, right before minor_work(), our threaded application
   get significant performance gain, making it quite close to the non-thread implementation.
- Although we are still exploring all possibilities, threaded MPI is likely not worth the effort.

One potential performance request for MPICH2 team :
-  for MPI_Wait* functions, avoid doing inter-process probing if the request is recv.'

thanks

tan

________________________________
From: chong tan <chong_guan_tan at yahoo.com>
To: mpich-discuss at mcs.anl.gov
Sent: Monday, July 27, 2009 11:25:35 AM
Subject: Re: [mpich-discuss] thread MPI calls

Nicholas and Pavan, and D,

Some tiny bit of info ffrom my work.  When we run our parallizzed application,
we like to see time spent in comminucation, MPICH2 that is, being 30% or less.
That includes setting up the data exahnge, and process having to wait for other
processes.  Our application will make good number of coummunication, in the
100 Billion to few trillion, over a period of hours to months.  

The data sent between the process is as litle as 8 bytes, to few MByte.  Most of
the time, the data are small, < 128 bytes.

MPICH2 does a good job of buffering send, but that really did not help us as
much as there is one recv for each send.

Our monitoring show most recv takes around 0.6us on most of our boxes.  We are
talking about $12K boxes here, so that is not bad.

Currently, we observes 5-10% in recv time that we hope to eliminate by parallizing
the recv operations (when the main thread is working meaningful work).   That fits our 
application very well as recv can be isolated easily.

We have tried these on tests that run for hours :
 -  non threaded, blocked recv      :  this is still the fastest solution
 -  non thread, Irecv                      :  bad
 -  non-thread, pre-launched Irecv: bad, but not too bad
 -  thread multiple                         : very bad
 -  thread multiple with irecv        : not as bad as very bad
 -  thread funnel                             : super bad

we analyzed a few tests where the 'recv' thread could have run in parallel with the main thread
nicely, and yet negative gains are observed.  We don't have any theory why that could have p
happened.

So, we are particularly curios with what is happening with thread multiple ?   We have 1
thing common between thread and non-threaded-Irecv test : Wait_all, could this be the cause ?

thanks
tan

________________________________
From: Pavan Balaji <balaji at mcs.anl.gov>
To: mpich-discuss at mcs.anl.gov
Sent: Saturday, July 25, 2009 2:40:45 PM
Subject: Re: [mpich-discuss] thread MPI calls

Nicholas,

From what I understand about your application, there are two approaches you can use:

1. Use no threads -- in this case, each worker posts Irecv's from all processes its expecting messages from (using MPI_ANY_SOURCE if needed) and do some loop similar to:

till_work_is_done {
    compute();
    Waitany() or Testany()
}

This is the approach most master-worker applications, such as mpiBLAST, tend to use.

2. Otherwise, you can use threads as you suggested below. It is true that there is some overhead, but unless you are using very small messages (e.g., < 64 bytes), and performing a lot of communication (e.g., 90% of your application is communication time), you'll not notice any overhead.

In any case, we are working on some enhanced designs that will minimize threading overheads even in such rare cases (some of them are experimentally included in the mpich2-1.1.x series).

-- Pavan

On 07/25/2009 01:38 PM, Nicolas Rosner wrote:
> Sorry to shamelessly invade Tan's, but since we're in the middle of a
> thread about threads, I thought I'd rephrase an old question I once
> tried to state here -- with a lot less understanding of the problem
> back then. Context:  My app is a rather typical
> single-centralized-master, N-1-worker, pool-of-tasks setup, except
> that workers don't just consume, but may also split tasks that seem
> too hard, which ends up pushing large series of new child tasks back
> to where the parent had been obtained, and so on, recursively.
> 
> Now, most task-related messages and structures don't handle the
> real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy
> objects (~1 KB) that represent them -- which works fine in many
> situations where some subset of metadata suffices. However, after a
> worker gets a task assignment, at some point it does need the complete
> file. Conversely, when one is divided, the splitting worker needs to
> store the new files somewhere. The current approach is as follows:
> 
>    - all new task files produced by a worker W are stored to a local
> HD on its host
> 
>    - master is immediately notified about each creation (children are
> pushed back into queue,
>      but by reference only, including a "currently located at ..." field),
> 
>    - when master assigns a task to some other worker Z, the message
> includes said location, and
> 
>    - worker Z then sends a msg to worker W requesting the task, which
> W sends as one fat MPI message.
> 
> This achieves decent load balancing, but although a worker can't
> really do anything productive while waiting for a totally new
> datafile, it may certainly not block if it is to stay deadlock-free --
> it [or *some* component, anyway] needs to be ready to serve any
> potential incoming "full datafile requests" from other workers within
> some constant amount of delay that may not depend on its own pending
> request.
> 
> So, I first tried the nonblocking approach; no good, worker + file
> server chores combined yield a giant statechart, way too complex to
> debug and maintain. Then, trying to stay away from hybrid thread+MPI
> code, I tried separating the worker and server as two different
> processes, and ran twice as many processes as available processors.
> Although Linux did a pretty good job scheduling them (not surprising
> since they're almost purely cpu and i/o bound, respectively), there is
> some performance penalty, plus it's messy to be keeping track of how
> mpixec deals roles to hosts, e.g. lots of rank -> host mappings that
> were fine suddenly become unusable and must be avoided, etc.
> 
> Eventually I gave up on my "no-hybrid" and "don't want to depend on
> thread_multiple support" wishes, got myself a pthreads tutorial and
> ended up with a worker version that uses a 2nd thread (+ possibly
> several subthreads thereof) to keep serving files regardless of main
> thread (actual worker code) status -- and with cleaner, better
> separated code. (Note that there is little or no interaction between
> Wi and Si once launched -- Si just needs to keep serving files that Wi
> produced earlier on but doesn't even remember, let alone care about,
> anymore. Both need MPI messaging all the time, though.)
> 
> Bottom line: as scary as it looked to me to go hybrid, and despite
> several warnings from experts against it, in this particular case it
> turned out to be simpler than many of the clumsy attempts at avoiding
> it, and probably the "least ugly" approach I've found so far.
> 
> Questions: Do you think this solution is OK? Any suggestions for
> alternatives or improvements?  Am I wrong in thinking that this kind
> of need must be a common one in this field?  How do people normally
> address it?  Distributed filesystems, perhaps?  Or something like a
> lightweight http server on each host? Shared central storage is not an
> option -- its scalability hits the ceiling way too quickly. Is the
> performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of
> considerable magnitude?  Do you think it could be avoided, while still
> keeping the worker code reasonably isolated and unaware of the serving
> part?
> 
> Thanks in advance for any comments, and in retrospective for all the
> useful info here.
> 
> Sincerely
> N.

-- Pavan Balaji
http://www.mcs.anl.gov/~balaji

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090728/fac34841/attachment-0001.htm>