[mpich-discuss] Large MPI messages?

Nicolas Rosner nrosner at gmail.com
Tue Sep 30 20:57:48 CDT 2008


Hello mpich-discuss,

What do you do if you need to send really big messages (like, say,
500KB, or even a few megabytes) over MPI? Is that possible at all? Any
experience with fragmentation schemes if there's a hard limit to this?

My app uses a "pool of tasks" approach. Agents both consume tasks from
the pool and push new (hopefully smaller) tasks back into it. A
dedicated "pool" process keeps the requests synchronized, and the fact
that the pool is a centralized entity isn't that much of a concern
because the average task is rather long to execute (in the order of
minutes), and the pool only deals task IDs (strings, basically, that
identify files on secondary storage).

Problem is, the "secondary storage" is currently just a shared NFS
directory (on this cluster, by default, your home dir --which lives on
/home at the head node's HD-- is mounted via NFS on every node). So
you end up with a few hundred agents fighting each other while trying
to create, write and read thousands of half-megabyte files to one
shared directory (from one same physical drive) concurrently -- and
that is seriously not scaling.

Plus, the cluster features an InfiniBand switch, but apparently that's
only used by MPI via MVAPICH, while NFS traffic goes through plain old
Ethernet. While IB's ultra-low latency isn't that much of an advantage
here (since message frequency is laughable compared to average task
completion time), its high transfer rate could be very useful.

So, on the one hand, I really should distribute the storage load (at
the very least, the agents could store the new tasks they generate
locally, and the pool could serve [taskID, hostWhoHasIt] pairs instead
of just task IDs). But, on top of that, it would be very nice if I
could make the agents exchange tasks through MPI directly; this would
not only benefit of IB speed, but also avoid the passage through hard
disk altogether.

Any input, ideas, suggestions, warnings etc will be greatly appreciated.

TIA,
Nicolás




More information about the mpich-discuss mailing list