[mpich-discuss] thread MPI calls

Sat Jul 25 13:38:48 CDT 2009

Sorry to shamelessly invade Tan's, but since we're in the middle of a
thread about threads, I thought I'd rephrase an old question I once
tried to state here -- with a lot less understanding of the problem
back then. Context:  My app is a rather typical
single-centralized-master, N-1-worker, pool-of-tasks setup, except
that workers don't just consume, but may also split tasks that seem
too hard, which ends up pushing large series of new child tasks back
to where the parent had been obtained, and so on, recursively.

Now, most task-related messages and structures don't handle the
real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy
objects (~1 KB) that represent them -- which works fine in many
situations where some subset of metadata suffices. However, after a
worker gets a task assignment, at some point it does need the complete
file. Conversely, when one is divided, the splitting worker needs to
store the new files somewhere. The current approach is as follows:

   - all new task files produced by a worker W are stored to a local
HD on its host

   - master is immediately notified about each creation (children are
pushed back into queue,
     but by reference only, including a "currently located at ..." field),

   - when master assigns a task to some other worker Z, the message
includes said location, and

   - worker Z then sends a msg to worker W requesting the task, which
W sends as one fat MPI message.

This achieves decent load balancing, but although a worker can't
really do anything productive while waiting for a totally new
datafile, it may certainly not block if it is to stay deadlock-free --
it [or *some* component, anyway] needs to be ready to serve any
potential incoming "full datafile requests" from other workers within
some constant amount of delay that may not depend on its own pending
request.

So, I first tried the nonblocking approach; no good, worker + file
server chores combined yield a giant statechart, way too complex to
debug and maintain. Then, trying to stay away from hybrid thread+MPI
code, I tried separating the worker and server as two different
processes, and ran twice as many processes as available processors.
Although Linux did a pretty good job scheduling them (not surprising
since they're almost purely cpu and i/o bound, respectively), there is
some performance penalty, plus it's messy to be keeping track of how
mpixec deals roles to hosts, e.g. lots of rank -> host mappings that
were fine suddenly become unusable and must be avoided, etc.

Eventually I gave up on my "no-hybrid" and "don't want to depend on
thread_multiple support" wishes, got myself a pthreads tutorial and
ended up with a worker version that uses a 2nd thread (+ possibly
several subthreads thereof) to keep serving files regardless of main
thread (actual worker code) status -- and with cleaner, better
separated code. (Note that there is little or no interaction between
Wi and Si once launched -- Si just needs to keep serving files that Wi
produced earlier on but doesn't even remember, let alone care about,
anymore. Both need MPI messaging all the time, though.)

Bottom line: as scary as it looked to me to go hybrid, and despite
several warnings from experts against it, in this particular case it
turned out to be simpler than many of the clumsy attempts at avoiding
it, and probably the "least ugly" approach I've found so far.

Questions: Do you think this solution is OK? Any suggestions for
alternatives or improvements?  Am I wrong in thinking that this kind
of need must be a common one in this field?  How do people normally
address it?  Distributed filesystems, perhaps?  Or something like a
lightweight http server on each host? Shared central storage is not an
option -- its scalability hits the ceiling way too quickly. Is the
performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of
considerable magnitude?  Do you think it could be avoided, while still
keeping the worker code reasonably isolated and unaware of the serving
part?

Thanks in advance for any comments, and in retrospective for all the
useful info here.

Sincerely
N.