[mpich-discuss] thread MPI calls

Sat Jul 25 16:40:45 CDT 2009

Nicholas,

 From what I understand about your application, there are two approaches 
you can use:

1. Use no threads -- in this case, each worker posts Irecv's from all 
processes its expecting messages from (using MPI_ANY_SOURCE if needed) 
and do some loop similar to:

till_work_is_done {
	compute();
	Waitany() or Testany()
}

This is the approach most master-worker applications, such as mpiBLAST, 
tend to use.

2. Otherwise, you can use threads as you suggested below. It is true 
that there is some overhead, but unless you are using very small 
messages (e.g., < 64 bytes), and performing a lot of communication 
(e.g., 90% of your application is communication time), you'll not notice 
any overhead.

In any case, we are working on some enhanced designs that will minimize 
threading overheads even in such rare cases (some of them are 
experimentally included in the mpich2-1.1.x series).

  -- Pavan

On 07/25/2009 01:38 PM, Nicolas Rosner wrote:
> Sorry to shamelessly invade Tan's, but since we're in the middle of a
> thread about threads, I thought I'd rephrase an old question I once
> tried to state here -- with a lot less understanding of the problem
> back then. Context:  My app is a rather typical
> single-centralized-master, N-1-worker, pool-of-tasks setup, except
> that workers don't just consume, but may also split tasks that seem
> too hard, which ends up pushing large series of new child tasks back
> to where the parent had been obtained, and so on, recursively.
> 
> Now, most task-related messages and structures don't handle the
> real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy
> objects (~1 KB) that represent them -- which works fine in many
> situations where some subset of metadata suffices. However, after a
> worker gets a task assignment, at some point it does need the complete
> file. Conversely, when one is divided, the splitting worker needs to
> store the new files somewhere. The current approach is as follows:
> 
>    - all new task files produced by a worker W are stored to a local
> HD on its host
> 
>    - master is immediately notified about each creation (children are
> pushed back into queue,
>      but by reference only, including a "currently located at ..." field),
> 
>    - when master assigns a task to some other worker Z, the message
> includes said location, and
> 
>    - worker Z then sends a msg to worker W requesting the task, which
> W sends as one fat MPI message.
> 
> This achieves decent load balancing, but although a worker can't
> really do anything productive while waiting for a totally new
> datafile, it may certainly not block if it is to stay deadlock-free --
> it [or *some* component, anyway] needs to be ready to serve any
> potential incoming "full datafile requests" from other workers within
> some constant amount of delay that may not depend on its own pending
> request.
> 
> So, I first tried the nonblocking approach; no good, worker + file
> server chores combined yield a giant statechart, way too complex to
> debug and maintain. Then, trying to stay away from hybrid thread+MPI
> code, I tried separating the worker and server as two different
> processes, and ran twice as many processes as available processors.
> Although Linux did a pretty good job scheduling them (not surprising
> since they're almost purely cpu and i/o bound, respectively), there is
> some performance penalty, plus it's messy to be keeping track of how
> mpixec deals roles to hosts, e.g. lots of rank -> host mappings that
> were fine suddenly become unusable and must be avoided, etc.
> 
> Eventually I gave up on my "no-hybrid" and "don't want to depend on
> thread_multiple support" wishes, got myself a pthreads tutorial and
> ended up with a worker version that uses a 2nd thread (+ possibly
> several subthreads thereof) to keep serving files regardless of main
> thread (actual worker code) status -- and with cleaner, better
> separated code. (Note that there is little or no interaction between
> Wi and Si once launched -- Si just needs to keep serving files that Wi
> produced earlier on but doesn't even remember, let alone care about,
> anymore. Both need MPI messaging all the time, though.)
> 
> Bottom line: as scary as it looked to me to go hybrid, and despite
> several warnings from experts against it, in this particular case it
> turned out to be simpler than many of the clumsy attempts at avoiding
> it, and probably the "least ugly" approach I've found so far.
> 
> Questions: Do you think this solution is OK? Any suggestions for
> alternatives or improvements?  Am I wrong in thinking that this kind
> of need must be a common one in this field?  How do people normally
> address it?  Distributed filesystems, perhaps?  Or something like a
> lightweight http server on each host? Shared central storage is not an
> option -- its scalability hits the ceiling way too quickly. Is the
> performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of
> considerable magnitude?  Do you think it could be avoided, while still
> keeping the worker code reasonably isolated and unaware of the serving
> part?
> 
> Thanks in advance for any comments, and in retrospective for all the
> useful info here.
> 
> Sincerely
> N.

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji