[MPICH] run cpi on windows and linux machines

Yu-Cheng Chou cycchou at ucdavis.edu
Fri Oct 7 18:40:35 CDT 2005


Hi,

I launched smpd on 4 machines which include 2 windows and 2 linux 
platforms.

I executed the command shown below on the linux local host.

mpiexec -pwdfile password -path "/home/ycchou/tmp;C:\tmp" -n 4 -
machinefile machines cpi

(1)It hung there. 
   The file "machines" contains the machine names which are placed in 
   this order - linux A (local host), linux B, windows A, and windows B.

(2)I changed the order of the machine names like this - linux A (local 
   host), windows A, linux B, and windows B.
   The program exited after I entered the number of intervals.
   The error message is shown below.
 
    job aborted:
    rank: node: exit code[: error message]
    0: iel2: -1
    1: bird2: 13
    2: Mouse2: -1: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(827): MPI_Bcast(buf=0xbffff748, count=1, MPI_INT, root=0,
    MPI_COMM_WORLD) failed
    MPIR_Bcast(229): 
    MPIC_Send(48): 
    MPIC_Wait(321): 
    MPIDI_CH3_Progress_wait(209): an error occurred while handling an
    event returned by MPIDU_Sock_Wait()
    MPIDI_CH3I_Progress_handle_sock_event(1120): [ch3:sock] failed to 
    connnect to remote process 234207FE36DCB2F24D062016BBEF915:3
    MPIDU_Sock_post_connect(114): unable to resolve host name to an
    address (set=1,sock=16777218,host=phoenix)
    3: phoenix: 13

(3)When I put the windows machines in the first two positions of the 
   file "machines", the program worked.

(4)When I changed the order of the machine names like this - linux B,
   linux A (local host), windows A, and windows B, the program exited 
   with error message shown below.

   op_read error on left context: socket connection closed, error stack:
   MPIDU_Socki_handle_read(632): connection closed by peer  
   (set=1,sock=16777216)
   unable to read the cmd header on the left context, socket connection
   closed, error stack:
   MPIDU_Socki_handle_read(632): connection closed by peer   
   (set=1,sock=16777216).

(5)When I changed the order of the machine names like this - linux B,
   linux A (local host), windows B, and windows A, the program hung there 
   with error message shown below.

   op_read error on left context: socket connection failed, error stack:
   MPIDU_Socki_handle_read(658): connection failure  
   (set=1,sock=16777216,errno=104:Connection reset by peer)
   unable to read the cmd header on the left context, socket connection
   failed, error stack:
   MPIDU_Socki_handle_read(658): connection failure
   (set=1,sock=16777216,errno=104:Connection reset by peer).

Could anyone tell me why it works only when windows machines are placed 
before linux machines in the machinefile?





More information about the mpich-discuss mailing list