[mpich-discuss] mpich2 in a cluster (nfs?)

Nicolas Rosner nrosner at gmail.com
Fri Apr 24 15:35:23 CDT 2009


Hello,

> The problem is the 3 filesystem not shared?

I'm afraid it would be hard for anyone here to answer that question,
since we don't know anything about what your client's program does.

Perhaps the problem is caused by the fact that this second program is
not a parallel program at all, i.e. just plain old sequential code?
If you run a non-MPI program on N machines, all you get are N copies
of the same thing running in parallel, each process totally isolated
from the rest.


> Cos, the first program i've tested not ported to parallelization and worked well.

Hmm, that sounds unlikely.  If, as you said, the first program
achieved almost linear speedup (i.e. the total running time on 3 nodes
was about 1/3 of the total time it takes to run with same input data
on a single node), then it definitely *has* to be a parallel
implementation.

In other words, it somehow manages to split its workload evenly among
the 3 processes; this implies, at the very least, that each copy of
the executable is aware of the fact that it is not alone.


> If i use mpiexec -n 3 /home/cluster/program and this programa only write
> in a nfs exported file system maybe i have only one process shared in
> all 3 nodes. Really, the process was more fast. 1/3 of time.

I'm not sure what you mean by "shared", but a piece of purely
sequential code is not going to automagically run 3x faster just
because you use mpiexec to run it on 3 machines. Whether it writes on
NFS-shared or node-local storage is not the issue; it sounds like you
should be working on basic questions (such as "does this program use
MPI calls at all") before shifting your focus to storage.


However, once you get those crucial aspects sorted out, there are a
few important facts about shared-vs-local storage that you'll want to
consider:

> a program where in each node the files generated are
> the same and writed in a /scratch/user PATH

1) Be careful with programs that write to fixed locations. The
semantics of "write to this fixed path" change radically depending on
whether multiple copies of the program are executed on different
machines or on the same one, and whether the fixed pathname denotes a
local file or one in NFS-shared common storage.

A few examples might help you here:

1a) If two processes write to a fixed local path (say, "/tmp/foo.txt")
while running on different physical machines, then each process will
have its own foo.txt file (which might be Very Good or Very Bad,
depending on what you're trying to achieve).

1b) If the same situation happens but the two processes happen to be
running on the same host (e.g. on two cores within the same node),
then both would try to write to the very same file -- probably Very
Bad under most circumstances.

1c) If two processes try to write to a fixed pathname in NFS-shared
space instead of node-local (say, "/shared/foo.txt"), the result would
be equivalent to 1b) above (most likely Very Bad) regardless of
whether the processes are running on different hosts or on the same
one.


2) Be careful with NFS abuse as well; shared storage is a centralized
facility, and as such can become a serious limitation for scalability
later on.

I mean, sure, as long as each process uses slightly different
pathnames to avoid collisions (e.g. "/shared/worker5/foo.txt"), it
looks like a great solution, since many problems related to live data
distribution just seem to disappear (e.g. no need to move big files
back and forth between nodes, etc).

But once you start using, say, 100 processes instead of 3, you might
run into the Ugly Side of shared storage. First problem is disk space:
if each process writes 10 GB, suddenly you need 1 TB of centralized
storage (instead of just 10 GB on each local disk, or, say, 40 GB per
host if using 4-core machines).

And even if you had the money to just buy a bigger HD for the NFS
server, the second problem is worse: imagine 100 processes trying to
simultaneously write large files to the same physical HD. Even with a
fast server, and perhaps a few separate disks RAIDed togther or so,
there's a fixed limit to the volume of concurrent requests that it can
handle before collapsing.

That limit is quickly reached in parallel computing as you increase
the number of running processes. Sooner or later you will probably
need to distribute your storage needs, much in the same way you are
now trying to distribute your CPU usage.

Hope this helps,

Nicolas


More information about the mpich-discuss mailing list