[mpich-discuss] Unable to run simple mpi problem
    Jayesh Krishna 
    jayesh at mcs.anl.gov
       
    Tue Dec 15 16:59:33 CST 2009
    
    
  
Hi,
 The short answer is that I don't know. The process launching mechanism is different when you use the "-localonly" option, so there could be a bug there. I tried running your code with "-localonly" option with upto 30 procs but was unable to reproduce the error.
 If there is no requirement for you to use the "-localonly" option I would recommend getting rid of it.
Regards,
Jayesh
----- Original Message -----
From: "dave waite" <waitedm at gmail.com>
To: mpich-discuss at mcs.anl.gov
Sent: Tuesday, December 15, 2009 4:26:58 PM GMT -06:00 US/Canada Central
Subject: Re: [mpich-discuss] Unable to run simple mpi problem
Jayesh,
We are running 1.2.1
We get the error when running either 2 or 3 processes - haven't tried
anything else.
We did our test removing the -localonly and now hellompi runs fine.
Mpich2 output:
C:\MPI>mpiexec2 -n 3 hellompi
Hello from Rank 1 of 3 on usbospc126.americas.munters.com
Hello from Rank 2 of 3 on usbospc126.americas.munters.com
Nodes:1
Names:||usbospc126.americas.munters.com||
Smpd -d output:
[01:5368]\smpd_add_command_arg
[01:5368]/smpd_add_command_arg
[01:5368]creating an exit command for rank 0, pid 3308, exit code 0.
[01:5368]\smpd_post_write_command
[01:5368]\smpd_package_command
[01:5368]/smpd_package_command
[01:5368]\SMPDU_Sock_get_sock_id
[01:5368]/SMPDU_Sock_get_sock_id
[01:5368]smpd_post_write_command on the parent context sock 708: 98 bytes
for co
mmand: "cmd=exit src=1 dest=0 tag=33 rank=0 code=0
kvs=2B33681B-F9DA-4d18-8E7A-2
9D855F2EC08 "
[01:5368]\SMPDU_Sock_post_writev
[01:5368]/SMPDU_Sock_post_writev
[01:5368]/smpd_post_write_command
[01:5368]\smpd_free_process_struct
[01:5368]/smpd_free_process_struct
[01:5368]\smpd_free_context
[01:5368]freeing stdout context.
[01:5368]\smpd_init_context
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_init_context
[01:5368]/smpd_free_context
[01:5368]/smpd_handle_op_close
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_CLOSE
[01:5368]\smpd_handle_op_close
[01:5368]\smpd_get_state_string
[01:5368]/smpd_get_state_string
[01:5368]op_close received - SMPD_CLOSING state.
[01:5368]Unaffiliated stdin context closing.
[01:5368]\smpd_free_context
[01:5368]freeing stdin context.
[01:5368]\smpd_init_context
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_init_context
[01:5368]/smpd_free_context
[01:5368]/smpd_handle_op_close
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_WRITE
[01:5368]\smpd_handle_op_write
[01:5368]\smpd_state_writing_cmd
[01:5368]wrote command
[01:5368]command written to parent: "cmd=exit src=1 dest=0 tag=33 rank=0
code=0
kvs=2B33681B-F9DA-4d18-8E7A-29D855F2EC08 "
[01:5368]\smpd_free_command
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_free_command
[01:5368]/smpd_state_writing_cmd
[01:5368]/smpd_handle_op_write
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]ReadFile failed, error 109
[01:5368]ReadFile failed, error 109
[01:5368]*** smpd_piothread finishing pid:420 ***
[01:5368]*** smpd_piothread finishing pid:420 ***
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_READ event.error = -1, result = 0, context->type=4
[01:5368]\smpd_handle_op_read
[01:5368]\smpd_state_reading_stdouterr
[01:5368]/smpd_state_reading_stdouterr
[01:5368]/smpd_handle_op_read
[01:5368]SOCK_OP_READ failed - result = -1, closing stdout context.
[01:5368]\SMPDU_Sock_post_close
[01:5368]\SMPDU_Sock_post_read
[01:5368]\SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_read
[01:5368]/SMPDU_Sock_post_close
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_CLOSE
[01:5368]\smpd_handle_op_close
[01:5368]\smpd_get_state_string
[01:5368]/smpd_get_state_string
[01:5368]op_close received - SMPD_CLOSING state.
[01:5368]process refcount == 1, stdout closed.
[01:5368]\smpd_free_context
[01:5368]freeing stdout context.
[01:5368]\smpd_init_context
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_init_context
[01:5368]/smpd_free_context
[01:5368]/smpd_handle_op_close
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_READ event.error = -1, result = 0, context->type=6
[01:5368]\smpd_handle_op_read
[01:5368]\smpd_state_reading_stdouterr
[01:5368]/smpd_state_reading_stdouterr
[01:5368]/smpd_handle_op_read
[01:5368]SOCK_OP_READ failed - result = -1, closing stderr context.
[01:5368]\SMPDU_Sock_post_close
[01:5368]\SMPDU_Sock_post_read
[01:5368]\SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_read
[01:5368]/SMPDU_Sock_post_close
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]closing output socket took 0.000 seconds
[01:5368]closing output socket took 0.000 seconds
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_CLOSE
[01:5368]\smpd_handle_op_close
[01:5368]\smpd_get_state_string
[01:5368]/smpd_get_state_string
[01:5368]op_close received - SMPD_CLOSING state.
[01:5368]process refcount == 0, waiting for the process to finish exiting.
[01:5368]\smpd_process_from_registry
[01:5368]/smpd_process_from_registry
[01:5368]\smpd_wait_process
[01:5368]/smpd_wait_process
[01:5368]\SMPDU_Sock_post_close
[01:5368]*** smpd_pinthread finishing pid:420 ***
[01:5368]\SMPDU_Sock_post_read
[01:5368]\SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_read
[01:5368]/SMPDU_Sock_post_close
[01:5368]\smpd_create_command
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_create_command
[01:5368]\smpd_add_command_int_arg
[01:5368]/smpd_add_command_int_arg
[01:5368]\smpd_add_command_int_arg
[01:5368]/smpd_add_command_int_arg
[01:5368]\smpd_add_command_arg
[01:5368]/smpd_add_command_arg
[01:5368]creating an exit command for rank 1, pid 420, exit code 0.
[01:5368]\smpd_post_write_command
[01:5368]\smpd_package_command
[01:5368]/smpd_package_command
[01:5368]\SMPDU_Sock_get_sock_id
[01:5368]/SMPDU_Sock_get_sock_id
[01:5368]smpd_post_write_command on the parent context sock 708: 98 bytes
for co
mmand: "cmd=exit src=1 dest=0 tag=34 rank=1 code=0
kvs=2B33681B-F9DA-4d18-8E7A-2
9D855F2EC08 "
[01:5368]\SMPDU_Sock_post_writev
[01:5368]/SMPDU_Sock_post_writev
[01:5368]/smpd_post_write_command
[01:5368]\smpd_free_process_struct
[01:5368]/smpd_free_process_struct
[01:5368]\smpd_free_context
[01:5368]freeing stderr context.
[01:5368]\smpd_init_context
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_init_context
[01:5368]/smpd_free_context
[01:5368]/smpd_handle_op_close
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_CLOSE
[01:5368]\smpd_handle_op_close
[01:5368]\smpd_get_state_string
[01:5368]/smpd_get_state_string
[01:5368]op_close received - SMPD_CLOSING state.
[01:5368]Unaffiliated stdin context closing.
[01:5368]\smpd_free_context
[01:5368]freeing stdin context.
[01:5368]\smpd_init_context
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_init_context
[01:5368]/smpd_free_context
[01:5368]/smpd_handle_op_close
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_WRITE
[01:5368]\smpd_handle_op_write
[01:5368]\smpd_state_writing_cmd
[01:5368]wrote command
[01:5368]command written to parent: "cmd=exit src=1 dest=0 tag=34 rank=1
code=0
kvs=2B33681B-F9DA-4d18-8E7A-29D855F2EC08 "
[01:5368]\smpd_free_command
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_free_command
[01:5368]/smpd_state_writing_cmd
[01:5368]/smpd_handle_op_write
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_READ event.error = 0, result = 0, context->type=8
[01:5368]\smpd_handle_op_read
[01:5368]\smpd_state_reading_cmd_header
[01:5368]read command header
[01:5368]command header read, posting read for data: 31 bytes
[01:5368]\SMPDU_Sock_post_read
[01:5368]\SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_read
[01:5368]/smpd_state_reading_cmd_header
[01:5368]/smpd_handle_op_read
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_READ event.error = 0, result = 0, context->type=8
[01:5368]\smpd_handle_op_read
[01:5368]\smpd_state_reading_cmd
[01:5368]read command
[01:5368]\smpd_parse_command
[01:5368]/smpd_parse_command
[01:5368]read command: "cmd=close src=0 dest=1 tag=10 "
[01:5368]\smpd_handle_command
[01:5368]handling command:
[01:5368] src  = 0
[01:5368] dest = 1
[01:5368] cmd  = close
[01:5368] tag  = 10
[01:5368] ctx  = parent
[01:5368] len  = 31
[01:5368] str  = cmd=close src=0 dest=1 tag=10
[01:5368]\smpd_command_destination
[01:5368]1 -> 1 : returning NULL context
[01:5368]/smpd_command_destination
[01:5368]\smpd_handle_close_command
[01:5368]\smpd_create_command
[01:5368].\smpd_init_command
[01:5368]./smpd_init_command
[01:5368]/smpd_create_command
[01:5368]sending closed command to parent: "cmd=closed src=1 dest=0 tag=35 "
[01:5368]\smpd_post_write_command
[01:5368].\smpd_package_command
[01:5368]./smpd_package_command
[01:5368].\SMPDU_Sock_get_sock_id
[01:5368]./SMPDU_Sock_get_sock_id
[01:5368].smpd_post_write_command on the parent context sock 708: 45 bytes
for c
ommand: "cmd=closed src=1 dest=0 tag=35 "
[01:5368].\SMPDU_Sock_post_writev
[01:5368]./SMPDU_Sock_post_writev
[01:5368]/smpd_post_write_command
[01:5368]posted closed command.
[01:5368]/smpd_handle_close_command
[01:5368]/smpd_handle_command
[01:5368]not posting read for another command because SMPD_CLOSE returned
[01:5368]/smpd_state_reading_cmd
[01:5368]/smpd_handle_op_read
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_WRITE
[01:5368]\smpd_handle_op_write
[01:5368]\smpd_state_writing_cmd
[01:5368]wrote command
[01:5368]command written to parent: "cmd=closed src=1 dest=0 tag=35 "
[01:5368]closed command written, posting close of the sock.
[01:5368]\SMPDU_Sock_get_sock_id
[01:5368]/SMPDU_Sock_get_sock_id
[01:5368]SMPDU_Sock_post_close(708)
[01:5368]\SMPDU_Sock_post_close
[01:5368]\SMPDU_Sock_post_read
[01:5368]\SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_readv
[01:5368]/SMPDU_Sock_post_read
[01:5368]/SMPDU_Sock_post_close
[01:5368]\smpd_free_command
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_free_command
[01:5368]/smpd_state_writing_cmd
[01:5368]/smpd_handle_op_write
[01:5368]sock_waiting for the next event.
[01:5368]\SMPDU_Sock_wait
[01:5368]/SMPDU_Sock_wait
[01:5368]SOCK_OP_CLOSE
[01:5368]\smpd_handle_op_close
[01:5368]\smpd_get_state_string
[01:5368]/smpd_get_state_string
[01:5368]op_close received - SMPD_CLOSING state.
[01:5368]Unaffiliated parent context closing.
[01:5368]\smpd_free_context
[01:5368]freeing parent context.
[01:5368]\smpd_init_context
[01:5368]\smpd_init_command
[01:5368]/smpd_init_command
[01:5368]/smpd_init_context
[01:5368]/smpd_free_context
[01:5368]all contexts closed, exiting state machine.
[01:5368]/smpd_handle_op_close
[01:5368]/smpd_enter_at_state
[01:5368]\smpd_exit
[01:5368]\smpd_kill_all_processes
[01:5368]/smpd_kill_all_processes
[01:5368]\smpd_finalize_drive_maps
[01:5368]/smpd_finalize_drive_maps
Why is -localonly giving us problems?
Thanks,  Dave.
-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Jayesh Krishna
Sent: Tuesday, December 15, 2009 2:12 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Unable to run simple mpi problem
Hi,
 Which version of MPICH2 are you using (Use the latest stable version,
1.2.1, available at
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downlo
ads)?
 Do you get the error when running the MPI program on the local m/c with 3
procs ? Do you get the error if you remove the "-localonly" option ?
Regards,
Jayesh
----- Original Message -----
From: "dave waite" <waitedm at gmail.com>
To: mpich-discuss at mcs.anl.gov
Sent: Tuesday, December 15, 2009 2:40:02 PM GMT -06:00 US/Canada Central
Subject: [mpich-discuss] Unable to run simple mpi problem
We are running mpich2 applications on many Windows platforms. In a few
installations, we have a problem where the job dies while initializing mpi.
To examine this further, we ran a simple Hellompi program, 
// mpi2.cpp : Defines the entry point for the console application. 
// 
#include "stdafx.h" 
int master ; 
int n_workers ; 
MPI_Comm world, workers ; 
MPI_Group world_group, worker_group ; 
#define BSIZE MPI_MAX_PROCESSOR_NAME 
char chrNames[MPI_MAX_PROCESSOR_NAME*64]; 
int _tmain( int argc, char * argv[]) 
{ 
int nprocs=1; 
world = MPI_COMM_WORLD; 
int iVal=0; 
int rank, size, len; 
char name[MPI_MAX_PROCESSOR_NAME]; 
MPI_Status reqstat; 
char * p; 
int iNodeCnt=1; 
SYSTEM_INFO info; 
GetSystemInfo( &info ); 
int i; 
MPI_Init(&argc, &argv); 
MPI_Comm_rank(MPI_COMM_WORLD, &rank); 
MPI_Comm_size(MPI_COMM_WORLD, &size); 
MPI_Get_processor_name(name, &len); 
if (rank==0) 
{ 
// server commands 
chrNames[0]=0; 
strcat(chrNames, "||" ); 
strcat(chrNames,name); 
strcat(chrNames, "||" ); 
for (i=1;i<size;i++) 
{ 
MPI_Recv(name,BSIZE,MPI_CHAR,i,999,MPI_COMM_WORLD,&reqstat); 
p=strstr(chrNames,name); 
if (p==NULL) 
{ 
strcat(chrNames,name); 
strcat(chrNames, "||" ); 
iNodeCnt++; 
} 
//printf("Hello MPI!\n"); 
printf( "Hello from Rank %d of %d on %s\n" ,i,size,name); 
} 
printf( "\nNodes:%d\n" ,iNodeCnt); 
printf( "Names:%s\n" ,chrNames); 
} 
else 
{ 
// client commands 
MPI_Send(name,BSIZE,MPI_CHAR,0,999,MPI_COMM_WORLD); 
} 
MPI_Finalize(); 
return 0; 
} 
And noted the same failure. Here is our output, 
C:\MPI>mpiexec2 -localonly -n 3 hellompi 
unable to read the cmd header on the pmi context, Error = -1 
. 
[01:4792]......ERROR:result command received but the wait_list is empty. 
[01:4792]....ERROR:unable to handle the command: "cmd=result src=1 dest=1
tag=7 
cmd_tag=2 cmd_orig=dbput ctx_key=1 result=DBS_SUCCESS " 
[01:4792]...ERROR:sock_op_close returned while unknown context is in state:
SMPD_IDLE 
mpiexec aborting job... 
SuspendThread failed with error 5 for process
0:3AB7E6A8-6169-4544-8282-D4D35207 
F564:'hellompi' 
unable to suspend process. 
received suspend command for a pmi context that doesn't exist: unmatched id
= 1 
unable to read the cmd header on the pmi context, Error = -1 
. 
Error posting readv, An existing connection was forcibly closed by the
remote host.(10054) 
received kill command for a pmi context that doesn't exist: unmatched id = 1
unable to read the cmd header on the pmi context, Error = -1 
. 
Error posting readv, An existing connection was forcibly closed by the
remote ho 
st.(10054) 
job aborted: 
rank: node: exit code[: error message] 
0: usbospc126.americas.munters.com: 123: process 0 exited without calling
finalize 
1: usbospc126.americas.munters.com: 123: process 1 exited without calling
finalize 
2: usbospc126.americas.munters.com: 123 
Fatal error in MPI_Finalize: Invalid communicator, error stack: 
MPI_Finalize(307): MPI_Finalize failed 
MPI_Finalize(198): 
MPID_Finalize(92): 
PMPI_Barrier(476): MPI_Barrier(comm=0x44000002) failed 
PMPI_Barrier(396): Invalid communicator 
[0] unable to post a write of the abort command. 
This was run on a dual-core machine, running Windows XP, SP2. What do these
error messages tell us? 
What is the best way to proceed in debugging this kind of issue? 
Thanks, 
Dave Waite 
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
    
    
More information about the mpich-discuss
mailing list