[mpich-discuss] mpich2 hangs on Ubuntu beowulf cluster(with NFS)

Nicolas Rosner nrosner at gmail.com
Wed Jan 4 18:06:40 CST 2012


>> One question. If the code deadlocks
>> when using the two machines
>> shouldn it deadlock when running
>> on each machine seperately

Hm, I wonder whether I misread what you meant by that.

Well, either way, the answer is still no -- althoug perhaps for
different reasons.

If by "on each machine separately" you still meant running N ranks on
one machine, then what I said in my previous email does not apply
(since situations like "A and B waiting for each other" are just as
possible within a machine as they are across machines).

In that case, the answer would still be "no" because of other factors,
from changes in latency, to MPICH2 switching to shared-memory-based
comm if network is not needed, to just about anything else, really.

If a program may deadlock (i.e. not deadlock-free code, and therefore
wrong code), in most cases *anything* you change in the running
conditions could (potentially) hide or expose symptoms.

This is actually not surprising if we consider that even changing
nothing at all implies the same thing.  : )

(If, say, the network load changes abruptly, which is beyond your
control, you may see a deadlock or cease to see one. This is why the
only safe code is code that can never deadlock, period. Code that
"works well most of the time" is no better than code that never works,
and in most cases it's in fact worse.)

Hth, N.



Good luck! N.


More information about the mpich-discuss mailing list