[MPICH] MPICH105 shm drops packages on SUN niagara

chong tan chong_guan_tan at yahoo.com
Thu Apr 26 12:56:41 CDT 2007


I have been running into shm/ssm dropping packages starting 104.  I am working on localize the problem.
The run consists of 1 master and 4 slave processes.  Code is like this:

     master :                                                     slave :
 
   loop {                                                         loop {
      loop {                                                            loop {
          forall slave {
            recieve( slave )                                             send( master )
          }
          compute_status, inject to data
          for all slave {
                send( slave )                                              recieve( master )
         }
         if end( status )                                                  if end( status )
           terminate()                                                       terminate()
         if no_more_data()                                            if no_more_data( status )
              break_inner_loop                                          break_inner_loop
      }                                                                   }

      forall slave {
          recieve_next( slave)                                    send_next( master )
          sync_check()                                            
      }
      compute_next()
      forall slave {
          send_next( slave )                                       recieve_next( master )
      }                                                                    
      move_to_next()                                              move_to_next()
  }                                                                 }

where all the send are done with MPI_Send, and recieve are done with MPI_Recv.  the sync_check() is an assertion on master_current_step == slave's_current_step.  The slaves' current step is sent as part of the data
exchanged recieve_next().  All process moves thru a step counter, so they must be in-sync for the algorithm to work.

The problems happen randomly per step.  However, it always occur with 3rd slave. and 3rd time the inner 
loop is executed.  At that time, the lost package is 16 bytes in length.  It is always the package from the master
to the 3rd slave that is lost.  here is the abastracted last trace :  (1..4) means 'do it for proc 1, 2, 3, 4'

   master :                                                slave 3:
step n :                                                    step n:
   recieve_next( )  (1..4)                                  send_next( 0 )
   send_next( )      (1..4)                                  recieve_next( 0 )
step n+m :                                               step n+m :
   recieve( )  (1..4)                                          send( 0 )
   send() (1..4)                                                recieve( 0 )
   recieve()   (1..4)                                          send( 0 )
   send() (1..4)                                                recieve( 0 )
   recieve() (1..4)                                            send( 0 )    
   send() (1..4)       // send(3)  returns              recieve( 0 )   <<<<---   package dropped, 3 keeps waiting
   recieve_next() (1..4)                          <<<<-- stuck on trying to send to slave 3                    
  

The same code works OK when I was using 102.  The code also works with nemesis.  I suspect one of the following :
     -  time out when master send to slave 3.
     -  bug in shared-mem channel


please advise on how to proceed on debugging this problem.

thanks

tan

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070426/c8b5892c/attachment.htm>


More information about the mpich-discuss mailing list