[mpich-discuss] SGE & Hydra Problem
Ursula Winkler
ursula.winkler at uni-graz.at
Wed Sep 22 06:49:38 CDT 2010
Pavan Balaji schrieb:
> ----- "Ursula Winkler" <ursula.winkler at uni-graz.at> wrote:
>
>
>> No, when mpiexec is placed within the SGE job script, it works fine on
>> the second
>> cluster. I meant just the command "qrsh -inherit -V ...
>> hydra_pmi_proxy
>> ..." placed
>> within the SGE script that results in the mentioned error message (on
>> both clusters).
>>
>
> Ok, just to confirm, if nodes X and Y are both in the $TMPDIR/machines file, you are running the qrsh command from node X to node Y, correct?
>
yes
> I'm surprised that this is not working on the second cluster, as this is exactly what Hydra does internally.
>
> Can you run mpiexec (from within an SGE script) for both cluster with the -verbose option and send me the outputs?
>
> % mpiexec -verbose /bin/hostname
>
The cluster on which it works:
mpiexec options:
----------------
Base path: /installadmin/software/mpich/1.3b1/intel/bin/
Bootstrap server: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT
MANPATH=/installadmin/software/mpich/1.3b1/intel/share/man:/installadmin/software/intel/intel_fce_111/man:/installadmin/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/man:/usr/local/share/man
INTEL_LICENSE_FILE=/installadmin/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/installadmin/software/intel/intel_cce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses
HOST=emmy
TERM=xterm
HISTSIZE=1000
SSH_CLIENT=143.50.128.178 36866 22
SSH_TTY=/dev/pts/2
GROUP=edvz
LD_LIBRARY_PATH=/installadmin/software/mpich/1.3b1/intel/lib:/installadmin/software/intel/intel_fce_111/lib/intel64:/installadmin/software/intel/intel_cce_111/lib/intel64
LS_COLORS=no
HOSTTYPE=x86_64-linux
MAIL=/var/spool/mail/winkl
INPUTRC=/etc/inputrc
PWD=/usr/people/edvz/winkl/MPI-Test
SGE_ACCOUNT=sge
SGE_RSH_COMMAND=/installadmin/sge/utilbin/lx24-amd64/rsh
LANG=en_US.UTF-8
REQNAME=test_nodes.b2
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
MPI=/installadmin/software/mpich/1.3b1/intel
SHLVL=2
SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test
OSTYPE=linux
MPIHOME=/installadmin/software/mpich/1.3b1/intel
VENDOR=unknown
MACHTYPE=x86_64
REMOTEUSER=root
CVS_RSH=ssh
SSH_CONNECTION=143.50.128.178 36866 143.50.10.43 22
LESSOPEN=|/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1
_=/installadmin/software/mpich/1.3b1/intel/bin/mpiexec
Proxy information:
*********************
Proxy ID: 1
-----------------
Proxy name: y23
Process count: 2
Start PID: 0
Proxy exec list:
....................
Exec: ./cpitest.x; Process count: 2
Proxy ID: 2
-----------------
Proxy name: y12
Process count: 2
Start PID: 2
Proxy exec list:
....................
Exec: ./cpitest.x; Process count: 2
==================================================================================================
[mpiexec at y23] Timeout set to -1 (-1 means infinite)
[mpiexec at y23] Got a control port string of y23:51464
Proxy launch args:
/installadmin/software/mpich/1.3b1/intel/bin/hydra_pmi_proxy
--control-port y23:51464 --debug --demux poll --pgid 0 --enable-stdin 1
--proxy-id
[mpiexec at y23] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 0:
--version 1.3b1 --interface-env-name MPICH_INTERFACE_NAME --hostname y23
--global-core-count 4 --global-process-count 4 --auto-cleanup 1
--pmi-rank -1 --pmi-kvsname kvs_15511_0 --pmi-process-mapping
(vector,(0,2,2)) --global-inherited-env 33
'REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT'
'MANPATH=/installadmin/software/mpich/1.3b1/intel/share/man:/installadmin/software/intel/intel_fce_111/man:/installadmin/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/man:/usr/local/share/man'
'INTEL_LICENSE_FILE=/installadmin/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/installadmin/software/intel/intel_cce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses'
'HOST=emmy' 'TERM=xterm' 'HISTSIZE=1000' 'SSH_CLIENT=143.50.128.178
36866 22' 'SSH_TTY=/dev/pts/2' 'GROUP=edvz'
'LD_LIBRARY_PATH=/installadmin/software/mpich/1.3b1/intel/lib:/installadmin/software/intel/intel_fce_111/lib/intel64:/installadmin/software/intel/intel_cce_111/lib/intel64'
'LS_COLORS=no' 'HOSTTYPE=x86_64-linux' 'MAIL=/var/spool/mail/winkl'
'INPUTRC=/etc/inputrc' 'PWD=/usr/people/edvz/winkl/MPI-Test'
'SGE_ACCOUNT=sge'
'SGE_RSH_COMMAND=/installadmin/sge/utilbin/lx24-amd64/rsh'
'LANG=en_US.UTF-8' 'REQNAME=test_nodes.b2'
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass'
'MPI=/installadmin/software/mpich/1.3b1/intel' 'SHLVL=2'
'SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test' 'OSTYPE=linux'
'MPIHOME=/installadmin/software/mpich/1.3b1/intel' 'VENDOR=unknown'
'MACHTYPE=x86_64' 'REMOTEUSER=root' 'CVS_RSH=ssh'
'SSH_CONNECTION=143.50.128.178 36866 143.50.10.43 22'
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'G_BROKEN_FILENAMES=1'
'_=/installadmin/software/mpich/1.3b1/intel/bin/mpiexec'
--global-user-env 0 --global-system-env 0 --start-pid 0
--proxy-core-count 2 --exec --exec-appnum 0 --exec-proc-count 2
--exec-local-env 0 --exec-wdir /usr/people/edvz/winkl/MPI-Test
--exec-args 1 ./cpitest.x
[mpiexec at y23] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 1:
--version 1.3b1 --interface-env-name MPICH_INTERFACE_NAME --hostname y12
--global-core-count 4 --global-process-count 4 --auto-cleanup 1
--pmi-rank -1 --pmi-kvsname kvs_15511_0 --pmi-process-mapping
(vector,(0,2,2)) --global-inherited-env 33
'REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT'
'MANPATH=/installadmin/software/mpich/1.3b1/intel/share/man:/installadmin/software/intel/intel_fce_111/man:/installadmin/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/man:/usr/local/share/man'
'INTEL_LICENSE_FILE=/installadmin/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/installadmin/software/intel/intel_cce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses'
'HOST=emmy' 'TERM=xterm' 'HISTSIZE=1000' 'SSH_CLIENT=143.50.128.178
36866 22' 'SSH_TTY=/dev/pts/2' 'GROUP=edvz'
'LD_LIBRARY_PATH=/installadmin/software/mpich/1.3b1/intel/lib:/installadmin/software/intel/intel_fce_111/lib/intel64:/installadmin/software/intel/intel_cce_111/lib/intel64'
'LS_COLORS=no' 'HOSTTYPE=x86_64-linux' 'MAIL=/var/spool/mail/winkl'
'INPUTRC=/etc/inputrc' 'PWD=/usr/people/edvz/winkl/MPI-Test'
'SGE_ACCOUNT=sge'
'SGE_RSH_COMMAND=/installadmin/sge/utilbin/lx24-amd64/rsh'
'LANG=en_US.UTF-8' 'REQNAME=test_nodes.b2'
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass'
'MPI=/installadmin/software/mpich/1.3b1/intel' 'SHLVL=2'
'SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test' 'OSTYPE=linux'
'MPIHOME=/installadmin/software/mpich/1.3b1/intel' 'VENDOR=unknown'
'MACHTYPE=x86_64' 'REMOTEUSER=root' 'CVS_RSH=ssh'
'SSH_CONNECTION=143.50.128.178 36866 143.50.10.43 22'
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'G_BROKEN_FILENAMES=1'
'_=/installadmin/software/mpich/1.3b1/intel/bin/mpiexec'
--global-user-env 0 --global-system-env 0 --start-pid 2
--proxy-core-count 2 --exec --exec-appnum 0 --exec-proc-count 2
--exec-local-env 0 --exec-wdir /usr/people/edvz/winkl/MPI-Test
--exec-args 1 ./cpitest.x
[mpiexec at y23] Launch arguments:
/installadmin/software/mpich/1.3b1/intel/bin/hydra_pmi_proxy
--control-port y23:51464 --debug --demux poll --pgid 0 --enable-stdin 1
--proxy-id 0
[mpiexec at y23] Launch arguments: /installadmin/sge/bin/lx24-amd64/qrsh
-inherit -V y12
/installadmin/software/mpich/1.3b1/intel/bin/hydra_pmi_proxy
--control-port y23:51464 --debug --demux poll --pgid 0 --enable-stdin 1
--proxy-id 1
[proxy:0:0 at y23] got pmi command (from 9): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at y23] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:0 at y23] got pmi command (from 9): get_maxes
[proxy:0:0 at y23] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:0 at y23] got pmi command (from 9): get_appnum
[proxy:0:0 at y23] PMI response: cmd=appnum appnum=0
[proxy:0:0 at y23] got pmi command (from 9): get_my_kvsname
[proxy:0:0 at y23] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:0 at y23] got pmi command (from 9): get_my_kvsname
[proxy:0:0 at y23] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:0 at y23] got pmi command (from 9): get
kvsname=kvs_15511_0 key=PMI_process_mapping
[proxy:0:0 at y23] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,2))
[proxy:0:0 at y23] got pmi command (from 9): barrier_in
[proxy:0:0 at y23] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at y23] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:0 at y23] got pmi command (from 6): get_maxes
[proxy:0:0 at y23] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:0 at y23] got pmi command (from 6): get_appnum
[proxy:0:0 at y23] PMI response: cmd=appnum appnum=0
[proxy:0:0 at y23] got pmi command (from 6): get_my_kvsname
[proxy:0:0 at y23] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:0 at y23] got pmi command (from 6): get_my_kvsname
[proxy:0:0 at y23] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:0 at y23] got pmi command (from 6): get
kvsname=kvs_15511_0 key=PMI_process_mapping
[proxy:0:0 at y23] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,2))
[proxy:0:0 at y23] got pmi command (from 6): put
kvsname=kvs_15511_0 key=sharedFilename[0]
value=/dev/shm/mpich_shar_tmpWjm2Xo
[proxy:0:0 at y23] we don't understand this command put; forwarding upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0
key=sharedFilename[0] value=/dev/shm/mpich_shar_tmpWjm2Xo
[mpiexec at y23] PMI response to fd 6 pid 6: cmd=put_result rc=0 msg=success
[proxy:0:0 at y23] we don't understand the response put_result; forwarding
downstream
[proxy:0:0 at y23] got pmi command (from 6): barrier_in
[proxy:0:0 at y23] forwarding command (cmd=barrier_in) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at y12] got pmi command (from 4): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at y12] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:1 at y12] got pmi command (from 4): get_maxes
[proxy:0:1 at y12] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:1 at y12] got pmi command (from 5): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at y12] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:1 at y12] got pmi command (from 4): get_appnum
[proxy:0:1 at y12] PMI response: cmd=appnum appnum=0
[proxy:0:1 at y12] got pmi command (from 4): get_my_kvsname
[proxy:0:1 at y12] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:1 at y12] got pmi command (from 5): get_maxes
[proxy:0:1 at y12] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:1 at y12] got pmi command (from 4): get_my_kvsname
[proxy:0:1 at y12] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:1 at y12] got pmi command (from 4): get
kvsname=kvs_15511_0 key=PMI_process_mapping
[proxy:0:1 at y12] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,2))
[proxy:0:1 at y12] got pmi command (from 5): get_appnum
[proxy:0:1 at y12] PMI response: cmd=appnum appnum=0
[proxy:0:1 at y12] got pmi command (from 4): put
kvsname=kvs_15511_0 key=sharedFilename[2]
value=/dev/shm/mpich_shar_tmpAqIkNK
[proxy:0:1 at y12] we don't understand this command put; forwarding upstream
[proxy:0:1 at y12] got pmi command (from 5): get_my_kvsname
[proxy:0:1 at y12] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:1 at y12] got pmi command (from 5): get_my_kvsname
[proxy:0:1 at y12] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:1 at y12] got pmi command (from 5): get
kvsname=kvs_15511_0 key=PMI_process_mapping
[proxy:0:1 at y12] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,2))
[proxy:0:1 at y12] got pmi command (from 5): barrier_in
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0
key=sharedFilename[2] value=/dev/shm/mpich_shar_tmpAqIkNK
[mpiexec at y23] PMI response to fd 0 pid 4: cmd=put_result rc=0 msg=success
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at y23] PMI response to fd 0 pid 4: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] got pmi command (from 9): get
kvsname=kvs_15511_0 key=sharedFilename[0]
[proxy:0:0 at y23] forwarding command (cmd=get kvsname=kvs_15511_0
key=sharedFilename[0]) upstream
[proxy:0:1 at y12] we don't understand the response put_result; forwarding
downstream
[proxy:0:1 at y12] got pmi command (from 4): barrier_in
[proxy:0:1 at y12] forwarding command (cmd=barrier_in) upstream
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] got pmi command (from 5): get
kvsname=kvs_15511_0 key=sharedFilename[2]
[proxy:0:1 at y12] forwarding command (cmd=get kvsname=kvs_15511_0
key=sharedFilename[2]) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=get kvsname=kvs_15511_0
key=sharedFilename[0]
[mpiexec at y23] PMI response to fd 6 pid 9: cmd=get_result rc=0
msg=success value=/dev/shm/mpich_shar_tmpWjm2Xo
[mpiexec at y23] [pgid: 0] got PMI command: cmd=get kvsname=kvs_15511_0
key=sharedFilename[2]
[mpiexec at y23] PMI response to fd 0 pid 5: cmd=get_result rc=0
msg=success value=/dev/shm/mpich_shar_tmpAqIkNK
[proxy:0:0 at y23] we don't understand the response get_result; forwarding
downstream
[proxy:0:0 at y23] got pmi command (from 6): put
kvsname=kvs_15511_0 key=P0-businesscard
value=description#y23$port#53836$ifname#10.143.41.63$
[proxy:0:0 at y23] we don't understand this command put; forwarding upstream
[proxy:0:0 at y23] got pmi command (from 9): put
kvsname=kvs_15511_0 key=P1-businesscard
value=description#y23$port#38784$ifname#10.143.41.63$
[proxy:0:0 at y23] we don't understand this command put; forwarding upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0
key=P0-businesscard value=description#y23$port#53836$ifname#10.143.41.63$
[mpiexec at y23] PMI response to fd 6 pid 6: cmd=put_result rc=0 msg=success
[proxy:0:1 at y12] we don't understand the response get_result; forwarding
downstream
[proxy:0:1 at y12] got pmi command (from 4): put
kvsname=kvs_15511_0 key=P2-businesscard
value=description#y12$port#47251$ifname#10.143.41.52$
[proxy:0:1 at y12] we don't understand this command put; forwarding upstream
[proxy:0:1 at y12] got pmi command (from 5): put
kvsname=kvs_15511_0 key=P3-businesscard
value=description#y12$port#55610$ifname#10.143.41.52$
[proxy:0:1 at y12] we don't understand this command put; forwarding upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0
key=P1-businesscard value=description#y23$port#38784$ifname#10.143.41.63$
[mpiexec at y23] PMI response to fd 6 pid 9: cmd=put_result rc=0 msg=success
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0
key=P2-businesscard value=description#y12$port#47251$ifname#10.143.41.52$
[mpiexec at y23] PMI response to fd 0 pid 4: cmd=put_result rc=0 msg=success
[proxy:0:0 at y23] we don't understand the response put_result; forwarding
downstream
[proxy:0:0 at y23] got pmi command (from 6): barrier_in
[proxy:0:0 at y23] we don't understand the response put_result; forwarding
downstream
[proxy:0:0 at y23] got pmi command (from 9): barrier_in
[proxy:0:0 at y23] forwarding command (cmd=barrier_in) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0
key=P3-businesscard value=description#y12$port#55610$ifname#10.143.41.52$
[mpiexec at y23] PMI response to fd 0 pid 5: cmd=put_result rc=0 msg=success
[proxy:0:1 at y12] we don't understand the response put_result; forwarding
downstream
[proxy:0:1 at y12] got pmi command (from 4): barrier_in
[proxy:0:1 at y12] we don't understand the response put_result; forwarding
downstream
[proxy:0:1 at y12] got pmi command (from 5): barrier_in
[proxy:0:1 at y12] forwarding command (cmd=barrier_in) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] PMI response to fd 6 pid 5: cmd=barrier_out
[mpiexec at y23] PMI response to fd 0 pid 5: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
Process 0 of 4 is on y23
Process 1 of 4 is on y23
[proxy:0:0 at y23] got pmi command (from 6): get
kvsname=kvs_15511_0 key=P2-businesscard
[proxy:0:0 at y23] forwarding command (cmd=get kvsname=kvs_15511_0
key=P2-businesscard) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=get kvsname=kvs_15511_0
key=P2-businesscard
[mpiexec at y23] PMI response to fd 6 pid 6: cmd=get_result rc=0
msg=success value=description#y12$port#47251$ifname#10.143.41.52$
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] PMI response: cmd=barrier_out
Process 2 of 4 is on y12
Process 3 of 4 is on y12
[proxy:0:0 at y23] we don't understand the response get_result; forwarding
downstream
pi is approximately 3.1415926535897682, Error is 0.0000000000000249
wall clock time = 1.991206
[proxy:0:0 at y23] got pmi command (from 9): barrier_in
[proxy:0:0 at y23] got pmi command (from 6): barrier_in
[proxy:0:0 at y23] forwarding command (cmd=barrier_in) upstream
[proxy:0:1 at y12] got pmi command (from 5): barrier_in
[proxy:0:1 at y12] got pmi command (from 4): barrier_in
[proxy:0:1 at y12] forwarding command (cmd=barrier_in) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at y23] PMI response to fd 0 pid 4: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] got pmi command (from 6): finalize
[proxy:0:0 at y23] PMI response: cmd=finalize_ack
[proxy:0:0 at y23] got pmi command (from 9): finalize
[proxy:0:0 at y23] PMI response: cmd=finalize_ack
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] got pmi command (from 4): finalize
[proxy:0:1 at y12] PMI response: cmd=finalize_ack
[proxy:0:1 at y12] got pmi command (from 5): finalize
[proxy:0:1 at y12] PMI response: cmd=finalize_ack
-----------------------------------------------------------------------------------------------------
On the cluster on which it doesn't work:
mpiexec options:
----------------
Base path: /installadmin/mpich2/test/intel/bin/
Bootstrap server: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT
MANPATH=/installadmin/sge/man:/software/mpich2/test/intel/share/man:/software/intel/intel_fce_111/man:/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/share/man
CONSOLE=/dev/console
SELINUX_INIT=YES
INTEL_LICENSE_FILE=/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/software/intel/intel_cce_111/licenses:/software/intel/licenses:/usr/people/edvz/winkl/intel/licenses
HOST=b00
TERM=xterm
HISTSIZE=1000
SSH_CLIENT=143.50.128.178 36871 22
SSH_TTY=/dev/pts/0
GROUP=edvz
LD_LIBRARY_PATH=/installadmin/mpich2/test/intel/lib:/software/intel/intel_fce_111/lib/intel64:/software/intel/intel_cce_111/lib/intel64
LS_COLORS=no
INIT_VERSION=sysvinit-2.86
HOSTTYPE=x86_64-linux
AUTOBOOT=YES
MAIL=/var/spool/mail/winkl
runlevel=3
RUNLEVEL=3
INPUTRC=/etc/inputrc
PWD=/usr/people/edvz/winkl/MPI-Test
SGE_ACCOUNT=sge
LANG=en_US.UTF-8
previous=N
PREVLEVEL=N
REQNAME=test_nodes.b2
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
MPI=/installadmin/mpich2/test/intel
SHLVL=2
SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test
OSTYPE=linux
BOOT_IMAGE=2.6.18-194.11.3
MPIHOME=/installadmin/mpich2/test/intel
VENDOR=unknown
MACHTYPE=x86_64
CVS_RSH=ssh
SSH_CONNECTION=143.50.128.178 36871 143.50.10.40 22
LESSOPEN=|/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1
_=/installadmin/mpich2/test/intel/bin/mpiexec
Proxy information:
*********************
Proxy ID: 1
-----------------
Proxy name: b72
Process count: 2
Start PID: 0
Proxy exec list:
....................
Exec: ./cpitest.x; Process count: 2
Proxy ID: 2
-----------------
Proxy name: b60
Process count: 2
Start PID: 2
Proxy exec list:
....................
Exec: ./cpitest.x; Process count: 2
==================================================================================================
[mpiexec at b72] Timeout set to -1 (-1 means infinite)
[mpiexec at b72] Got a control port string of b72:53271
Proxy launch args: /installadmin/mpich2/test/intel/bin/hydra_pmi_proxy
--control-port b72:53271 --debug --demux poll --pgid 0 --enable-stdin 1
--proxy-id
[mpiexec at b72] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 0:
--version 1.3b1 --interface-env-name MPICH_INTERFACE_NAME --hostname b72
--global-core-count 4 --global-process-count 4 --auto-cleanup 1
--pmi-rank -1 --pmi-kvsname kvs_3249_0 --pmi-process-mapping
(vector,(0,2,2)) --global-inherited-env 40
'REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT'
'MANPATH=/installadmin/sge/man:/software/mpich2/test/intel/share/man:/software/intel/intel_fce_111/man:/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/share/man'
'CONSOLE=/dev/console' 'SELINUX_INIT=YES'
'INTEL_LICENSE_FILE=/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/software/intel/intel_cce_111/licenses:/software/intel/licenses:/usr/people/edvz/winkl/intel/licenses'
'HOST=b00' 'TERM=xterm' 'HISTSIZE=1000' 'SSH_CLIENT=143.50.128.178 36871
22' 'SSH_TTY=/dev/pts/0' 'GROUP=edvz'
'LD_LIBRARY_PATH=/installadmin/mpich2/test/intel/lib:/software/intel/intel_fce_111/lib/intel64:/software/intel/intel_cce_111/lib/intel64'
'LS_COLORS=no' 'INIT_VERSION=sysvinit-2.86' 'HOSTTYPE=x86_64-linux'
'AUTOBOOT=YES' 'MAIL=/var/spool/mail/winkl' 'runlevel=3' 'RUNLEVEL=3'
'INPUTRC=/etc/inputrc' 'PWD=/usr/people/edvz/winkl/MPI-Test'
'SGE_ACCOUNT=sge' 'LANG=en_US.UTF-8' 'previous=N' 'PREVLEVEL=N'
'REQNAME=test_nodes.b2'
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass'
'MPI=/installadmin/mpich2/test/intel' 'SHLVL=2'
'SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test' 'OSTYPE=linux'
'BOOT_IMAGE=2.6.18-194.11.3' 'MPIHOME=/installadmin/mpich2/test/intel'
'VENDOR=unknown' 'MACHTYPE=x86_64' 'CVS_RSH=ssh'
'SSH_CONNECTION=143.50.128.178 36871 143.50.10.40 22'
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'G_BROKEN_FILENAMES=1'
'_=/installadmin/mpich2/test/intel/bin/mpiexec' --global-user-env 0
--global-system-env 0 --start-pid 0 --proxy-core-count 2 --exec
--exec-appnum 0 --exec-proc-count 2 --exec-local-env 0 --exec-wdir
/usr/people/edvz/winkl/MPI-Test --exec-args 1 ./cpitest.x
[mpiexec at b72] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 1:
--version 1.3b1 --interface-env-name MPICH_INTERFACE_NAME --hostname b60
--global-core-count 4 --global-process-count 4 --auto-cleanup 1
--pmi-rank -1 --pmi-kvsname kvs_3249_0 --pmi-process-mapping
(vector,(0,2,2)) --global-inherited-env 40
'REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT'
'MANPATH=/installadmin/sge/man:/software/mpich2/test/intel/share/man:/software/intel/intel_fce_111/man:/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/share/man'
'CONSOLE=/dev/console' 'SELINUX_INIT=YES'
'INTEL_LICENSE_FILE=/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/software/intel/intel_cce_111/licenses:/software/intel/licenses:/usr/people/edvz/winkl/intel/licenses'
'HOST=b00' 'TERM=xterm' 'HISTSIZE=1000' 'SSH_CLIENT=143.50.128.178 36871
22' 'SSH_TTY=/dev/pts/0' 'GROUP=edvz'
'LD_LIBRARY_PATH=/installadmin/mpich2/test/intel/lib:/software/intel/intel_fce_111/lib/intel64:/software/intel/intel_cce_111/lib/intel64'
'LS_COLORS=no' 'INIT_VERSION=sysvinit-2.86' 'HOSTTYPE=x86_64-linux'
'AUTOBOOT=YES' 'MAIL=/var/spool/mail/winkl' 'runlevel=3' 'RUNLEVEL=3'
'INPUTRC=/etc/inputrc' 'PWD=/usr/people/edvz/winkl/MPI-Test'
'SGE_ACCOUNT=sge' 'LANG=en_US.UTF-8' 'previous=N' 'PREVLEVEL=N'
'REQNAME=test_nodes.b2'
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass'
'MPI=/installadmin/mpich2/test/intel' 'SHLVL=2'
'SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test' 'OSTYPE=linux'
'BOOT_IMAGE=2.6.18-194.11.3' 'MPIHOME=/installadmin/mpich2/test/intel'
'VENDOR=unknown' 'MACHTYPE=x86_64' 'CVS_RSH=ssh'
'SSH_CONNECTION=143.50.128.178 36871 143.50.10.40 22'
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'G_BROKEN_FILENAMES=1'
'_=/installadmin/mpich2/test/intel/bin/mpiexec' --global-user-env 0
--global-system-env 0 --start-pid 2 --proxy-core-count 2 --exec
--exec-appnum 0 --exec-proc-count 2 --exec-local-env 0 --exec-wdir
/usr/people/edvz/winkl/MPI-Test --exec-args 1 ./cpitest.x
[mpiexec at b72] Launch arguments:
/installadmin/mpich2/test/intel/bin/hydra_pmi_proxy --control-port
b72:53271 --debug --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0
[mpiexec at b72] Launch arguments: /installadmin/sge/bin/lx24-amd64/qrsh
-inherit -V b60 /installadmin/mpich2/test/intel/bin/hydra_pmi_proxy
--control-port b72:53271 --debug --demux poll --pgid 0 --enable-stdin 1
--proxy-id 1
[proxy:0:0 at b72] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at b72] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:0 at b72] got pmi command (from 9): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at b72] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:0 at b72] got pmi command (from 9): get_maxes
[proxy:0:0 at b72] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:0 at b72] got pmi command (from 6): get_maxes
[proxy:0:0 at b72] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:0 at b72] got pmi command (from 9): get_appnum
[proxy:0:0 at b72] PMI response: cmd=appnum appnum=0
[proxy:0:0 at b72] got pmi command (from 9): get_my_kvsname
[proxy:0:0 at b72] PMI response: cmd=my_kvsname kvsname=kvs_3249_0
[proxy:0:0 at b72] got pmi command (from 6): get_appnum
[proxy:0:0 at b72] PMI response: cmd=appnum appnum=0
[proxy:0:0 at b72] got pmi command (from 9): get_my_kvsname
[proxy:0:0 at b72] PMI response: cmd=my_kvsname kvsname=kvs_3249_0
[proxy:0:0 at b72] got pmi command (from 9): get
kvsname=kvs_3249_0 key=PMI_process_mapping
[proxy:0:0 at b72] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,2))
[proxy:0:0 at b72] got pmi command (from 6): get_my_kvsname
[proxy:0:0 at b72] PMI response: cmd=my_kvsname kvsname=kvs_3249_0
[proxy:0:0 at b72] got pmi command (from 9): barrier_in
[proxy:0:0 at b72] got pmi command (from 6): get_my_kvsname
[proxy:0:0 at b72] PMI response: cmd=my_kvsname kvsname=kvs_3249_0
[proxy:0:0 at b72] got pmi command (from 6): get
kvsname=kvs_3249_0 key=PMI_process_mapping
[proxy:0:0 at b72] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,2))
[proxy:0:0 at b72] got pmi command (from 6): put
kvsname=kvs_3249_0 key=sharedFilename[0] value=/dev/shm/mpich_shar_tmp1BFE87
[proxy:0:0 at b72] we don't understand this command put; forwarding upstream
[mpiexec at b72] [pgid: 0] got PMI command: cmd=put kvsname=kvs_3249_0
key=sharedFilename[0] value=/dev/shm/mpich_shar_tmp1BFE87
[mpiexec at b72] PMI response to fd 6 pid 6: cmd=put_result rc=0 msg=success
[proxy:0:0 at b72] we don't understand the response put_result; forwarding
downstream
[proxy:0:0 at b72] got pmi command (from 6): barrier_in
[proxy:0:0 at b72] forwarding command (cmd=barrier_in) upstream
[mpiexec at b72] [pgid: 0] got PMI command: cmd=barrier_in
- Note: There is no output from Host b60 (the second participating host)
as there is no
process created on it.
Ursula
More information about the mpich-discuss
mailing list