<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman, new york, times, serif;font-size:12pt"><DIV>I have been running into shm/ssm dropping packages starting 104. I am working on localize the problem.</DIV>
<DIV>The run consists of 1 master and 4 slave processes. Code is like this:</DIV>
<DIV> </DIV>
<DIV> master : slave :</DIV>
<DIV> </DIV>
<DIV> loop { loop {</DIV>
<DIV> loop { loop {</DIV>
<DIV> forall slave {</DIV>
<DIV> recieve( slave ) send( master )</DIV>
<DIV> }</DIV>
<DIV> compute_status, inject to data</DIV>
<DIV> for all slave {</DIV>
<DIV> send( slave ) recieve( master )</DIV>
<DIV> }</DIV>
<DIV> if end( status ) if end( status )</DIV>
<DIV> terminate() terminate()</DIV>
<DIV> if no_more_data() if no_more_data( status )</DIV>
<DIV> break_inner_loop break_inner_loop</DIV>
<DIV> } }</DIV>
<DIV> </DIV>
<DIV> forall slave {</DIV>
<DIV> recieve_next( slave) send_next( master )</DIV>
<DIV> sync_check() </DIV>
<DIV> }</DIV>
<DIV> compute_next()</DIV>
<DIV> forall slave {</DIV>
<DIV> send_next( slave ) recieve_next( master )</DIV>
<DIV> } </DIV>
<DIV> move_to_next() move_to_next()</DIV>
<DIV> } }</DIV>
<DIV> </DIV>
<DIV>where all the send are done with MPI_Send, and recieve are done with MPI_Recv. the sync_check() is an assertion on master_current_step == slave's_current_step. The slaves' current step is sent as part of the data</DIV>
<DIV>exchanged recieve_next(). All process moves thru a step counter, so they must be in-sync for the algorithm to work.</DIV>
<DIV> </DIV>
<DIV>The problems happen randomly per step. However, it always occur with 3rd slave. and 3rd time the inner </DIV>
<DIV>loop is executed. At that time, the lost package is 16 bytes in length. It is always the package from the master</DIV>
<DIV>to the 3rd slave that is lost. here is the abastracted last trace : (1..4) means 'do it for proc 1, 2, 3, 4'</DIV>
<DIV> </DIV>
<DIV> master : slave 3:</DIV>
<DIV>step n : step n:</DIV>
<DIV> recieve_next( ) (1..4) send_next( 0 )</DIV>
<DIV> send_next( ) (1..4) recieve_next( 0 )</DIV>
<DIV>step n+m : step n+m :</DIV>
<DIV> recieve( ) (1..4) send( 0 )</DIV>
<DIV> send() (1..4) recieve( 0 )</DIV>
<DIV> recieve() (1..4) send( 0 )</DIV>
<DIV> send() (1..4) recieve( 0 )</DIV>
<DIV> recieve() (1..4) send( 0 ) </DIV>
<DIV> send() (1..4) // send(3) returns recieve( 0 ) <<<<--- package dropped, 3 keeps waiting</DIV>
<DIV> recieve_next() (1..4) <<<<-- stuck on trying to send to slave 3 </DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>The same code works OK when I was using 102. The code also works with nemesis. I suspect one of the following :</DIV>
<DIV> - time out when master send to slave 3.</DIV>
<DIV> - bug in shared-mem channel</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>please advise on how to proceed on debugging this problem.</DIV>
<DIV> </DIV>
<DIV>thanks</DIV>
<DIV> </DIV>
<DIV>tan</DIV>
<DIV> </DIV></div><br>
<hr size=1>Ahhh...imagining that irresistible "new car" smell?<br> Check out
<a href="http://us.rd.yahoo.com/evt=48245/*http://autos.yahoo.com/new_cars.html;_ylc=X3oDMTE1YW1jcXJ2BF9TAzk3MTA3MDc2BHNlYwNtYWlsdGFncwRzbGsDbmV3LWNhcnM-">new cars at Yahoo! Autos.</a>
</body></html>