[mpich-discuss] MPI Cluster Hangs

abhishek pandey hipandey at gmail.com
Fri Jan 15 08:57:15 CST 2010


Hi,

I am using MPI to communicate in cluster consisting of controller-workers.
There is one controller and 5 workers. All these workers are spawned by
controller.
Most of the time the communication works fine between controller and workers
but sometime a worker hangs. I am running cluster on windows and my program
is multi-threaded.

The flow is as follows:

Worker :

1. worker places a IRecv request to get the message from controller.
2. worker sends ("blocking" ) a message to controller to provide data which
the worker takes in buffer posted in step-1.
3. Worker tests the IRecv request. If the test fails then worker sleeps for
sometime and then tests again.


Controller:

1. Controlller gets the message from worker  and sends (blocking) message to
worker.

But controller does send the  message  to worker and this message does not
lost. But sometime the placed Irecv request from worker never succeeds and
worker hangs.

Any thought on this ?

Thanks,
Abhishek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100115/19ecedeb/attachment.htm>


More information about the mpich-discuss mailing list