[mpich-discuss] MPI Cluster Hangs

Jayesh Krishna jayesh at mcs.anl.gov
Fri Jan 15 09:51:01 CST 2010


Hi,
 Can you send us a test program that shows the problem ?

Regards,
Jayesh
----- Original Message -----
From: "abhishek pandey" <hipandey at gmail.com>
To: mpich-discuss at mcs.anl.gov
Sent: Friday, January 15, 2010 8:57:15 AM GMT -06:00 US/Canada Central
Subject: [mpich-discuss] MPI Cluster Hangs


Hi, 

I am using MPI to communicate in cluster consisting of controller-workers. There is one controller and 5 workers. All these workers are spawned by controller. 
Most of the time the communication works fine between controller and workers but sometime a worker hangs. I am running cluster on windows and my program is multi-threaded. 

The flow is as follows: 

Worker : 

1. worker places a IRecv request to get the message from controller. 
2. worker sends ("blocking" ) a message to controller to provide data which the worker takes in buffer posted in step-1. 
3. Worker tests the IRecv request. If the test fails then worker sleeps for sometime and then tests again. 


Controller: 

1. Controlller gets the message from worker and sends (blocking) message to worker. 

But controller does send the message to worker and this message does not lost. But sometime the placed Irecv request from worker never succeeds and worker hangs. 

Any thought on this ? 

Thanks, 
Abhishek 





_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list