[mpich-discuss] unable to connect ?

Jayesh Krishna jayesh at mcs.anl.gov
Fri Feb 27 09:08:36 CST 2009


 Hi,
  From your debug logs the problem does not appear to be a network
connectivity issue. It looks more like a configuration issue,

============== snip ========================
...\smpd_state_reading_connect_result
....read connect result: 'FAIL'
....connection rejected, server returned - FAIL
============== snip ========================

  Your PM connection can get rejected due to the foll reasons,

# There is a mismatch in the version of MPICH2 software installed on the
multiple machines.
# There is a mismatch in the passphrase used on the multiple machines (You
enter this "passphrase" during MPICH2 installation).

  I would recommend the following,

# Uninstall MPICH2 on both the machines.
# Download the latest stable version (1.0.8) of MPICH2 from the downloads
page
(http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=dow
nloads).
# Install MPICH2 on the machines using the installer downloaded from the
downloads page.

------- Make sure that you keep the default settings for passphrase during
the installation
------- Also make sure that all users have access to MPICH2 (Change the
default option from "Just me" to "Everyone" during installation)

# If your machine is not part of a domain, when registering the
username/password with mpiexec don't specify any domain name. Also
validate, as before, after registering the user.

 Let us know the results.

(PS: There is no specific configuration required, apart from the info
above, to get MPICH2 working across multiple windows machines)

Regards,
Jayesh

-----Original Message-----
From: kiss attila [mailto:kissattila2008 at gmail.com]
Sent: Thursday, February 26, 2009 11:45 PM
To: Jayesh Krishna
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] unable to connect ?

Hi

  I've tried now everything: I've created the same user, I've validated
this user ( mpiuser ), but still nothing... Can anyone send me some config
files from a  w o r k i n g Mpich2 setup based on windows workgroup (not
domain). Till then these are my output from smpd -d,  and mpiexec
commands, when I try to run from 10.0.0.10 hostname on remote computer
(10.0.0.13)

D:\Program Files\MPICH2\bin>smpd -d

[00:2436]..\smpd_set_smpd_data
[00:2436]../smpd_set_smpd_data
[00:2436]..created a set for the listener: 1724 [00:2436]..smpd listening
on port 8676 [00:2436]..\smpd_create_context
[00:2436]...\smpd_init_context [00:2436]....\smpd_init_command
[00:2436]..../smpd_init_command [00:2436].../smpd_init_context
[00:2436]../smpd_create_context [00:2436]..\smpd_option_on
[00:2436]...\smpd_get_smpd_data
[00:2436]....\smpd_get_smpd_data_from_environment
[00:2436]..../smpd_get_smpd_data_from_environment
[00:2436]....\smpd_get_smpd_data_default
[00:2436]..../smpd_get_smpd_data_default
[00:2436]....Unable to get the data for the key 'no_dynamic_hosts'
[00:2436].../smpd_get_smpd_data
[00:2436]../smpd_option_on
[00:2436]..\smpd_insert_into_dynamic_hosts
[00:2436]../smpd_insert_into_dynamic_hosts
[00:2436]..\smpd_enter_at_state
[00:2436]...sock_waiting for the next event.
[00:2436]...SOCK_OP_ACCEPT
[00:2436]...\smpd_handle_op_accept
[00:2436]....\smpd_state_smpd_listening
[00:2436].....authenticating new connection
[00:2436].....\smpd_create_context
[00:2436]......\smpd_init_context
[00:2436].......\smpd_init_command
[00:2436]......./smpd_init_command
[00:2436]....../smpd_init_context
[00:2436]...../smpd_create_context
[00:2436].....\smpd_gen_authentication_strings
[00:2436]......\smpd_hash
[00:2436]....../smpd_hash
[00:2436]...../smpd_gen_authentication_strings
[00:2436].....posting a write of the challenge string: 1.0.8 7993
[00:2436]..../smpd_state_smpd_listening
[00:2436].../smpd_handle_op_accept
[00:2436]...sock_waiting for the next event.
[00:2436]...SOCK_OP_WRITE
[00:2436]...\smpd_handle_op_write
[00:2436]....\smpd_state_writing_challenge_string
[00:2436].....wrote challenge string: '1.0.8 7993'
[00:2436]..../smpd_state_writing_challenge_string
[00:2436].../smpd_handle_op_write
[00:2436]...sock_waiting for the next event.
[00:2436]...SOCK_OP_READ
[00:2436]...\smpd_handle_op_read
[00:2436]....\smpd_state_reading_challenge_response
[00:2436].....read challenge response: 'd6fdd96549e0c22c875ac55a2735a162'
[00:2436]..../smpd_state_reading_challenge_response
[00:2436].../smpd_handle_op_read
[00:2436]...sock_waiting for the next event.
[00:2436]...SOCK_OP_WRITE
[00:2436]...\smpd_handle_op_write
[00:2436]....\smpd_state_writing_connect_result
[00:2436].....wrote connect result: 'FAIL'
[00:2436].....connection reject string written, closing sock.
[00:2436]..../smpd_state_writing_connect_result
[00:2436].../smpd_handle_op_write
[00:2436]...sock_waiting for the next event.
[00:2436]...SOCK_OP_CLOSE
[00:2436]...\smpd_handle_op_close
[00:2436]....\smpd_get_state_string
[00:2436]..../smpd_get_state_string
[00:2436]....op_close received - SMPD_CLOSING state.
[00:2436]....Unaffiliated undetermined context closing.
[00:2436]....\smpd_free_context
[00:2436].....freeing undetermined context.
[00:2436].....\smpd_init_context
[00:2436]......\smpd_init_command
[00:2436]....../smpd_init_command
[00:2436]...../smpd_init_context
[00:2436]..../smpd_free_context
[00:2436].../smpd_handle_op_close
[00:2436]...sock_waiting for the next event.


C:\Program Files\MPICH2\bin>mpiexec -verbose -hosts 1 10.0.0.13 -user
mpiuser hostname

..\smpd_add_host_to_default_list
...\smpd_add_extended_host_to_default_list
.../smpd_add_extended_host_to_default_list
../smpd_add_host_to_default_list
..\smpd_hide_string_arg
...\first_token
.../first_token
...\compare_token
.../compare_token
...\next_token
....\first_token
..../first_token
....\first_token
..../first_token
.../next_token
../smpd_hide_string_arg
../smpd_hide_string_arg
..\smpd_hide_string_arg
...\first_token
.../first_token
...\compare_token
.../compare_token
...\next_token
....\first_token
..../first_token
....\first_token
..../first_token
.../next_token
../smpd_hide_string_arg
../smpd_hide_string_arg
..\smpd_get_full_path_name
...fixing up exe name: 'hostname' -> '(null)'
../smpd_get_full_path_name
..handling executable:
hostname.exe
..\smpd_get_next_host
...\smpd_get_host_id
.../smpd_get_host_id
../smpd_get_next_host
..\smpd_create_cliques
...\next_launch_node
.../next_launch_node
...\next_launch_node
.../next_launch_node
../smpd_create_cliques
..\smpd_fix_up_host_tree
../smpd_fix_up_host_tree
./mp_parse_command_args
.host tree:
. host: 10.0.0.13, parent: 0, id: 1
.launch nodes:
. iproc: 0, id: 1, exe: hostname.exe
.\smpd_get_smpd_data
..\smpd_get_smpd_data_from_environment
../smpd_get_smpd_data_from_environment
./smpd_get_smpd_data
.\smpd_create_context
..\smpd_init_context
...\smpd_init_command
.../smpd_init_command
../smpd_init_context
./smpd_create_context
.\smpd_make_socket_loop
..\smpd_get_hostname
../smpd_get_hostname
./smpd_make_socket_loop
.\smpd_create_context
..\smpd_init_context
...\smpd_init_command
.../smpd_init_command
../smpd_init_context
./smpd_create_context
.\smpd_enter_at_state
..sock_waiting for the next event.
..SOCK_OP_CONNECT
..\smpd_handle_op_connect
...connect succeeded, posting read of the challenge string
../smpd_handle_op_connect ..sock_waiting for the next event.
..SOCK_OP_READ
..\smpd_handle_op_read
...\smpd_state_reading_challenge_string
....read challenge string: '1.0.8 7993'
....\smpd_verify_version
..../smpd_verify_version
....\smpd_hash
..../smpd_hash
.../smpd_state_reading_challenge_string
../smpd_handle_op_read
..sock_waiting for the next event.
..SOCK_OP_WRITE
..\smpd_handle_op_write
...\smpd_state_writing_challenge_response
....wrote challenge response: 'd6fdd96549e0c22c875ac55a2735a162'
.../smpd_state_writing_challenge_response
../smpd_handle_op_write
..sock_waiting for the next event.
..SOCK_OP_READ
..\smpd_handle_op_read
...\smpd_state_reading_connect_result
....read connect result: 'FAIL'
....connection rejected, server returned - FAIL
....\smpd_post_abort_command .....\smpd_create_command
......\smpd_init_command ....../smpd_init_command
...../smpd_create_command .....\smpd_add_command_arg
...../smpd_add_command_arg .....\smpd_command_destination ......0 -> 0 :
returning NULL context ...../smpd_command_destination
Aborting: unable to connect to 10.0.0.13 ..../smpd_post_abort_command
....\smpd_exit .....\smpd_kill_all_processes ...../smpd_kill_all_processes
.....\smpd_finalize_drive_maps ...../smpd_finalize_drive_maps
.....\smpd_dbs_finalize ...../smpd_dbs_finalize


Thanks for any ideas.
regards
K.A. Albert

2009/2/26 Jayesh Krishna <jayesh at mcs.anl.gov>:
> Hi,
>
>>>.. I launch mpiexec.exe from an another windows user acount...
>
>  This could be your problem. You can try registering a
> username/password available on both the machines using the "-user"
> option (mpiexec -register -user 1) & launch your job using that user
> (mpiexec -n 2 -user 1 -hosts 2 10.0.0.10 10.0.0.13 hostname). You can
> also validate if the user credentials are capable of launching a job
> using the "-validate" option of mpiexec (mpiexec -validate -user 1
> 10.0.0.10 ; mpiexec -validate -user 1 10.0.0.13)
>
> (PS: Did you copy-paste the complete output of the mpiexec command &
> the command itself ? Please don't remove any part of the output. This
> will help us in debugging your problem.)
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: kiss attila [mailto:kissattila2008 at gmail.com]
> Sent: Thursday, February 26, 2009 12:26 AM
> To: Jayesh Krishna
> Subject: Re: [mpich-discuss] unable to connect ?
>
> 1. Yes, the ping works fine. With wmpiconfig.exe i can see both
machines.
> 2. MPICH2 1.0.8 installed on both.
> 3. No firewalls of any kind.
> 4. On  smpd -status i get:
> smpd running on 10.0.0.10
> smpd running on 10.0.0.13
>
> 5. from 10.0.0.10
> C:\Program Files\MPICH2\bin>mpiexec -hosts 2 10.0.0.10 10.0.0.13
> hostname
> abort: unable to connect to 10.0.0.13
>
> from 10.0.0.13
> C:\Program Files\MPICH2\bin>mpiexec -hosts 2 10.0.0.10 10.0.0.13
> hostname
> abort: unable to connect to 10.0.0.10
>
> and here is the -verbose mode:
>
> ...../first_token
> .....\compare_token
> ...../compare_token
> .....\next_token
> ......\first_token
> ....../first_token
> ......\first_token
> ....../first_token
> ...../next_token
> ..../smpd_hide_string_arg
> ..../smpd_hide_string_arg
> .....\smpd_option_on
> ......\smpd_get_smpd_data
> .......\smpd_get_smpd_data_from_environment
> ......./smpd_get_smpd_data_from_environment
> .......\smpd_get_smpd_data_default
> ......./smpd_get_smpd_data_default
> .......Unable to get the data for the key 'nocache'
> ....../smpd_get_smpd_data
> ...../smpd_option_on
> ....\smpd_hide_string_arg
> .....\first_token
> ...../first_token
> .....\compare_token
> ...../compare_token
> .....\next_token
> ......\first_token
> ....../first_token
> ......\first_token
> ....../first_token
> ...../next_token
> ..../smpd_hide_string_arg
> ..../smpd_hide_string_arg
> .../smpd_handle_op_read
> ...sock_waiting for the next event.
> ...SOCK_OP_WRITE
> ...\smpd_handle_op_write
> ....\smpd_state_writing_cred_ack_yes
> .....wrote cred request yes ack.
> ..../smpd_state_writing_cred_ack_yes
> .../smpd_handle_op_write
> ...sock_waiting for the next event.
> ...SOCK_OP_WRITE
> ...\smpd_handle_op_write
> ....\smpd_state_writing_account
> .....wrote account: 'mpiuser'
> .....\smpd_encrypt_data
> ...../smpd_encrypt_data
> ..../smpd_state_writing_account
> .../smpd_handle_op_write
> ...sock_waiting for the next event.
> ...SOCK_OP_WRITE
> ...\smpd_handle_op_write
> ....\smpd_hide_string_arg
> .....\first_token
> ...../first_token
> .....\compare_token
> ...../compare_token
> .....\next_token
> ......\first_token
> ....../first_token
> ......\first_token
> ....../first_token
> ...../next_token
> ..../smpd_hide_string_arg
> ..../smpd_hide_string_arg
> .....\smpd_hide_string_arg
> ......\first_token
> ....../first_token
> ......\compare_token
> ....../compare_token
> ......\next_token
> .......\first_token
> ......./first_token
> .......\first_token
> ......./first_token
> ....../next_token
> ...../smpd_hide_string_arg
> ...../smpd_hide_string_arg
> ....\smpd_hide_string_arg
> .....\first_token
> ...../first_token
> .....\compare_token
> ...../compare_token
> .....\next_token
> ......\first_token
> ....../first_token
> ......\first_token
> ....../first_token
> ...../next_token
> ..../smpd_hide_string_arg
> ..../smpd_hide_string_arg
> .../smpd_handle_op_write
> ...sock_waiting for the next event.
> ...SOCK_OP_READ
> ...\smpd_handle_op_read
> ....\smpd_state_reading_process_result
> .....read process session result: 'SUCCESS'
> ..../smpd_state_reading_process_result
> .../smpd_handle_op_read
> ...sock_waiting for the next event.
> ...SOCK_OP_READ
> ...\smpd_handle_op_read
> ....\smpd_state_reading_reconnect_request
> .....read re-connect request: '3972'
> .....closing the old socket in the left context.
> .....MPIDU_Sock_post_close(1720)
> .....connecting a new socket.
> .....\smpd_create_context
> ......\smpd_init_context
> .......\smpd_init_command
> ......./smpd_init_command
> ....../smpd_init_context
> ...../smpd_create_context
> .....posting a re-connect to 10.0.0.10:3972 in left context.
> ..../smpd_state_reading_reconnect_request
> .../smpd_handle_op_read
> ...sock_waiting for the next event.
> ...SOCK_OP_CLOSE
> ...\smpd_handle_op_close
> ....\smpd_get_state_string
> ..../smpd_get_state_string
> ....op_close received - SMPD_CLOSING state.
> ....Unaffiliated left context closing.
> ....\smpd_free_context
> .....freeing left context.
> .....\smpd_init_context
> ......\smpd_init_command
> ....../smpd_init_command
> ...../smpd_init_context
> ..../smpd_free_context
> .../smpd_handle_op_close
> ...sock_waiting for the next event.
> ...SOCK_OP_CONNECT
> ...\smpd_handle_op_connect
> ....\smpd_generate_session_header
> .....session header: (id=1 parent=0 level=0)
> ..../smpd_generate_session_header .../smpd_handle_op_connect
> ...sock_waiting for the next event.
> ...SOCK_OP_WRITE
> ...\smpd_handle_op_write
> ....\smpd_state_writing_session_header
> .....wrote session header: 'id=1 parent=0 level=0'
> .....\smpd_post_read_command
> ......posting a read for a command header on the left context, sock
> 1656 ...../smpd_post_read_command .....creating connect command for
> left node .....creating connect command to '10.0.0.13'
> .....\smpd_create_command
> ......\smpd_init_command
> ....../smpd_init_command
> ...../smpd_create_command
> .....\smpd_add_command_arg
> ...../smpd_add_command_arg
> .....\smpd_add_command_int_arg
> ...../smpd_add_command_int_arg
> .....\smpd_post_write_command
> ......\smpd_package_command
> ....../smpd_package_command
> ......smpd_post_write_command on the left context sock 1656: 65 bytes
> for
> command: "cmd=connect src=0 dest=1 tag=0 host=10.0.0.13 id=2 "
> ...../smpd_post_write_command
> .....not connected yet: 10.0.0.13 not connected
> ..../smpd_state_writing_session_header
> .../smpd_handle_op_write
> ...sock_waiting for the next event.
> ...SOCK_OP_WRITE
> ...\smpd_handle_op_write
> ....\smpd_state_writing_cmd
> .....wrote command
> .....command written to left: "cmd=connect src=0 dest=1 tag=0
> host=10.0.0.13 id=2 "
> .....moving 'connect' command to the wait_list.
> ..../smpd_state_writing_cmd
> .../smpd_handle_op_write
> ...sock_waiting for the next event.
> ...SOCK_OP_READ
> ...\smpd_handle_op_read
> ....\smpd_state_reading_cmd_header
> .....read command header
> .....command header read, posting read for data: 69 bytes
> ..../smpd_state_reading_cmd_header
> .../smpd_handle_op_read
> ...sock_waiting for the next event.
> ...SOCK_OP_READ
> ...\smpd_handle_op_read
> ....\smpd_state_reading_cmd
> .....read command
> .....\smpd_parse_command
> ...../smpd_parse_command
> .....read command: "cmd=abort src=1 dest=0 tag=0 error="unable to
> connect to 10.0.0.13" "
> .....\smpd_handle_command
> ......handling command:
> ...... src  = 1
> ...... dest = 0
> ...... cmd  = abort
> ...... tag  = 0
> ...... ctx  = left
> ...... len  = 69
> ...... str  = cmd=abort src=1 dest=0 tag=0 error="unable to connect to
> 10.0.0.13"
> ......\smpd_command_destination
> .......0 -> 0 : returning NULL context ....../smpd_command_destination
> ......\smpd_handle_abort_command
> .......abort: unable to connect to 10.0.0.13
> ....../smpd_handle_abort_command ...../smpd_handle_command
> .....\smpd_post_read_command ......posting a read for a command header
> on the left context, sock 1656 ...../smpd_post_read_command
> .....\smpd_create_command ......\smpd_init_command
> ....../smpd_init_command ...../smpd_create_command
> .....\smpd_post_write_command ......\smpd_package_command
> ....../smpd_package_command ......smpd_post_write_command on the left
> context sock 1656: 43 bytes for
> command: "cmd=close src=0 dest=1 tag=1 "
> ...../smpd_post_write_command
> ..../smpd_state_reading_cmd
> .../smpd_handle_op_read
> ...sock_waiting for the next event.
> ...SOCK_OP_READ
> ...\smpd_handle_op_read
> ....\smpd_state_reading_cmd_header
> .....read command header
> .....command header read, posting read for data: 31 bytes
> ..../smpd_state_reading_cmd_header
> .../smpd_handle_op_read
> ...sock_waiting for the next event.
> ...SOCK_OP_WRITE
> ...\smpd_handle_op_write
> ....\smpd_state_writing_cmd
> .....wrote command
> .....command written to left: "cmd=close src=0 dest=1 tag=1 "
> .....\smpd_free_command
> ......\smpd_init_command
> ....../smpd_init_command
> ...../smpd_free_command
> ..../smpd_state_writing_cmd
> .../smpd_handle_op_write
> ...sock_waiting for the next event.
> ...SOCK_OP_READ
> ...\smpd_handle_op_read
> ....\smpd_state_reading_cmd
> .....read command
> .....\smpd_parse_command
> ...../smpd_parse_command
> .....read command: "cmd=closed src=1 dest=0 tag=1 "
> .....\smpd_handle_command
> ......handling command:
> ...... src  = 1
> ...... dest = 0
> ...... cmd  = closed
> ...... tag  = 1
> ...... ctx  = left
> ...... len  = 31
> ...... str  = cmd=closed src=1 dest=0 tag=1
> ......\smpd_command_destination .......0 -> 0 : returning NULL context
> ....../smpd_command_destination ......\smpd_handle_closed_command
> .......closed command received from left child, closing sock.
> .......MPIDU_Sock_post_close(1656)
> .......received a closed at node with no parent context, assuming
> root, returning SMPD_EXITING.
> ....../smpd_handle_closed_command
> ...../smpd_handle_command
> .....not posting read for another command because SMPD_EXITING
> returned ..../smpd_state_reading_cmd .../smpd_handle_op_read
> ...sock_waiting for the next event.
> ...SOCK_OP_CLOSE
> ...\smpd_handle_op_close
> ....\smpd_get_state_string
> ..../smpd_get_state_string
> ....op_close received - SMPD_EXITING state.
> ....\smpd_free_context
> .....freeing left context.
> .....\smpd_init_context
> ......\smpd_init_command
> ....../smpd_init_command
> ...../smpd_init_context
> ..../smpd_free_context
> .../smpd_handle_op_close
> ../smpd_enter_at_state
> ./main
> .\smpd_exit
> ..\smpd_kill_all_processes
> ../smpd_kill_all_processes
> ..\smpd_finalize_drive_maps
> ../smpd_finalize_drive_maps
> ..\smpd_dbs_finalize
> ../smpd_dbs_finalize
>
> I have registered with wmpiregister.exe the same user with the same
> password on both computers but I launch mpiexec.exe from an another
> windows user acount; could this be a problem?. Thanks
>
> regards
> k.a.albert
>
>
>
>
> 2009/2/25 Jayesh Krishna <jayesh at mcs.anl.gov>:
>>  Hi,
>>
>> # Can you ping the machines from each other ?
>> # Make sure that you have the same version of MPICH2 installed on
>> both the machines.
>> # Do you have any firewalls (windows, third-party) running on the
>> machines (Turn off any firewalls running on the machines)?
>> # Make sure that you have the MPICH2 process manager, smpd.exe,
>> running as a service on both the machines (To check the status of the
>> process manager type, smpd -status, at the command prompt).
>> # Before trying to execute an MPI program like cpi.exe, try executing
>> a non-MPI program like hostname on the machines (mpiexec -hosts 2
>> 10.0.0.10
>> 10.0.0.13 hostname).
>>
>>  Let us know the results.
>>
>> (PS: In your reply please copy-paste the commands and the output)
>> Regards, Jayesh
>>
>>
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of kiss attila
>> Sent: Wednesday, February 25, 2009 1:46 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: [mpich-discuss] unable to connect ?
>>
>> Hi
>>
>>   I have two WinXp machines (10.0.0.13,10.0.0.10) with mpich2
>> installed, and on this command:
>> "D:\Program Files\MPICH2\bin\mpiexec.exe" -hosts 2 10.0.0.10
>> 10.0.0.13 -noprompt c:\ex\cpi.exe
>>
>> I get:
>>
>> Aborting: unable to connect to 10.0.0.10
>>
>> Somehow I can't start any process on the remote machine(10.0.0.10).
>> It annoys me, that a few days ago it worked, but I had to reinstall
>> one of them, and since then i couldn't figure it out what's wrong
>> with my settings.  thanks.
>>
>> regards
>> K.A. Albert
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090227/930c20cc/attachment-0001.htm>


More information about the mpich-discuss mailing list