[mpich-discuss] Odd differences between runs on two clusters / Lost messages?

Mon Sep 22 06:29:16 CDT 2008

Hello, hope this is the right place to ask about this.

I'm developing an MPI app using a "task pool" approach:  there is a
"pool" process that essentially monitors a queue, allowing (safely
synchronized, hopefully) PUSH and POP messages on this data structure,
and there are N agents that:

1) POP a new task from the pool
2) try to solve it for a while
3) either declare it solved and go back to step 1, or
4) declare it "too hard", split it up into subtasks, and
5) PUSH each generated subtask back into the pool
6) go back to step 1.

The tasks have hierarchical IDs, the root task being "1".  Thus, an
agent could obtain task 1.23.15 and, after massaging it for a while,
decide to partition it into 1.23.15.1, 1.23.15.2, 1.23.15.3, etc, all
of which would be queued at the central pool, waiting to be obtained
by other agents, and so on.

My program seems to run fine on my test cluster, which consists of 3
dual-core PCs in my office running MPICH2 1.0.7.  But it's not working
well on the production one, consisting of 50 quad-core nodes, on an
InfiniBand network, running MVAPICH 1.0.

I have already asked the cluster admins whether it would be possible
to upgrade MPI on the cluster to the latest MVAPICH release (which
seems to be based on the same MPICH2 1.0.7 that is installed on the
test cluster). But the problem seems basic enough, and I'd be
surprised if the rather old MVAPICH version was to blame. (Meaning, my
guess is that I probably have some bug that shows itself quite easily
on the one platform, remains asymptomatic on the other one, yet is
still a bug).

You can see an example of a (very verbose) logfile showing the
unwanted behavior here:

http://neurus.dc.uba.ar/rsat/logs/publ/112/knine.o122

The three lines where we last hear about agents 2, 3 and 4 are

8.74 s -- Agent 3 sending PUSH message to pool for task 1.1.7
9.15 s -- Agent 4 sending PUSH message to pool for task 1.2.7
29.75 s -- Agent 2 sending PUSH message to pool for task 1.3.7

The agents are using fully synchronous Ssend()s for the PUSH messages,
and the pool process is using Iprobe() to test whether there is a new
message, and if that returns true, Recv() to get it.

Notice how several PUSHes in a row succeed fine, then suddenly one of
them gets lost somehow. The pool process doesn't seem to ever get the
message (i.e. Iprobe() never returns true for it) so, naturally, the
sending agent blocks forever on its Ssend() call. Once this happens to
all agents, progress stops.

If I try to run that same test, with same inputs, same parameters,
same PRNG seed, same number of agents, etc, but on the test cluster in
my office, it runs fine; messages are not lost, progress never stops
and the process eventually ends normally.

Any input, comments or suggestions would be greatly appreciated.  I
can provide source code, more logs and further details upon request to
anyone interested in helping me out.

Thanks a lot in advance!

Nicolás