[mpich-discuss] SGE & Hydra Problem

Ursula Winkler ursula.winkler at uni-graz.at
Wed Sep 22 06:49:38 CDT 2010


Pavan Balaji schrieb:
> ----- "Ursula Winkler" <ursula.winkler at uni-graz.at> wrote:
>
>   
>> No, when mpiexec is placed within the SGE job script, it works fine on
>> the second
>> cluster. I meant just the command "qrsh -inherit -V ...
>> hydra_pmi_proxy 
>> ..." placed
>> within the SGE script that results in the mentioned error message (on
>> both clusters).
>>     
>
> Ok, just to confirm, if nodes X and Y are both in the $TMPDIR/machines file, you are running the qrsh command from node X to node Y, correct?
>   

yes

> I'm surprised that this is not working on the second cluster, as this is exactly what Hydra does internally.
>
> Can you run mpiexec (from within an SGE script) for both cluster with the -verbose option and send me the outputs?
>
> % mpiexec -verbose /bin/hostname
>   

The cluster on which  it works:

mpiexec options:
----------------
  Base path: /installadmin/software/mpich/1.3b1/intel/bin/
  Bootstrap server: (null)
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
    REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT
    
MANPATH=/installadmin/software/mpich/1.3b1/intel/share/man:/installadmin/software/intel/intel_fce_111/man:/installadmin/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/man:/usr/local/share/man
    
INTEL_LICENSE_FILE=/installadmin/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/installadmin/software/intel/intel_cce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses
    HOST=emmy
    TERM=xterm
    HISTSIZE=1000
    SSH_CLIENT=143.50.128.178 36866 22
    SSH_TTY=/dev/pts/2
    GROUP=edvz
    
LD_LIBRARY_PATH=/installadmin/software/mpich/1.3b1/intel/lib:/installadmin/software/intel/intel_fce_111/lib/intel64:/installadmin/software/intel/intel_cce_111/lib/intel64
    LS_COLORS=no
    HOSTTYPE=x86_64-linux
    MAIL=/var/spool/mail/winkl
    INPUTRC=/etc/inputrc
    PWD=/usr/people/edvz/winkl/MPI-Test
    SGE_ACCOUNT=sge
    SGE_RSH_COMMAND=/installadmin/sge/utilbin/lx24-amd64/rsh
    LANG=en_US.UTF-8
    REQNAME=test_nodes.b2
    SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
    MPI=/installadmin/software/mpich/1.3b1/intel
    SHLVL=2
    SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test
    OSTYPE=linux
    MPIHOME=/installadmin/software/mpich/1.3b1/intel
    VENDOR=unknown
    MACHTYPE=x86_64
    REMOTEUSER=root
    CVS_RSH=ssh
    SSH_CONNECTION=143.50.128.178 36866 143.50.10.43 22
    LESSOPEN=|/usr/bin/lesspipe.sh %s
    G_BROKEN_FILENAMES=1
    _=/installadmin/software/mpich/1.3b1/intel/bin/mpiexec


    Proxy information:
    *********************
      Proxy ID:  1
      -----------------
        Proxy name: y23
        Process count: 2
        Start PID: 0

        Proxy exec list:
        ....................
          Exec: ./cpitest.x; Process count: 2
      Proxy ID:  2
      -----------------
        Proxy name: y12
        Process count: 2
        Start PID: 2

        Proxy exec list:
        ....................
          Exec: ./cpitest.x; Process count: 2

==================================================================================================

[mpiexec at y23] Timeout set to -1 (-1 means infinite)
[mpiexec at y23] Got a control port string of y23:51464

Proxy launch args: 
/installadmin/software/mpich/1.3b1/intel/bin/hydra_pmi_proxy 
--control-port y23:51464 --debug --demux poll --pgid 0 --enable-stdin 1 
--proxy-id

[mpiexec at y23] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 0:
--version 1.3b1 --interface-env-name MPICH_INTERFACE_NAME --hostname y23 
--global-core-count 4 --global-process-count 4 --auto-cleanup 1 
--pmi-rank -1 --pmi-kvsname kvs_15511_0 --pmi-process-mapping 
(vector,(0,2,2)) --global-inherited-env 33 
'REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT' 
'MANPATH=/installadmin/software/mpich/1.3b1/intel/share/man:/installadmin/software/intel/intel_fce_111/man:/installadmin/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/man:/usr/local/share/man' 
'INTEL_LICENSE_FILE=/installadmin/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/installadmin/software/intel/intel_cce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses' 
'HOST=emmy' 'TERM=xterm' 'HISTSIZE=1000' 'SSH_CLIENT=143.50.128.178 
36866 22' 'SSH_TTY=/dev/pts/2' 'GROUP=edvz' 
'LD_LIBRARY_PATH=/installadmin/software/mpich/1.3b1/intel/lib:/installadmin/software/intel/intel_fce_111/lib/intel64:/installadmin/software/intel/intel_cce_111/lib/intel64' 
'LS_COLORS=no' 'HOSTTYPE=x86_64-linux' 'MAIL=/var/spool/mail/winkl' 
'INPUTRC=/etc/inputrc' 'PWD=/usr/people/edvz/winkl/MPI-Test' 
'SGE_ACCOUNT=sge' 
'SGE_RSH_COMMAND=/installadmin/sge/utilbin/lx24-amd64/rsh' 
'LANG=en_US.UTF-8' 'REQNAME=test_nodes.b2' 
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 
'MPI=/installadmin/software/mpich/1.3b1/intel' 'SHLVL=2' 
'SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test' 'OSTYPE=linux' 
'MPIHOME=/installadmin/software/mpich/1.3b1/intel' 'VENDOR=unknown' 
'MACHTYPE=x86_64' 'REMOTEUSER=root' 'CVS_RSH=ssh' 
'SSH_CONNECTION=143.50.128.178 36866 143.50.10.43 22' 
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'G_BROKEN_FILENAMES=1' 
'_=/installadmin/software/mpich/1.3b1/intel/bin/mpiexec' 
--global-user-env 0 --global-system-env 0 --start-pid 0 
--proxy-core-count 2 --exec --exec-appnum 0 --exec-proc-count 2 
--exec-local-env 0 --exec-wdir /usr/people/edvz/winkl/MPI-Test 
--exec-args 1 ./cpitest.x

[mpiexec at y23] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 1:
--version 1.3b1 --interface-env-name MPICH_INTERFACE_NAME --hostname y12 
--global-core-count 4 --global-process-count 4 --auto-cleanup 1 
--pmi-rank -1 --pmi-kvsname kvs_15511_0 --pmi-process-mapping 
(vector,(0,2,2)) --global-inherited-env 33 
'REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT' 
'MANPATH=/installadmin/software/mpich/1.3b1/intel/share/man:/installadmin/software/intel/intel_fce_111/man:/installadmin/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/man:/usr/local/share/man' 
'INTEL_LICENSE_FILE=/installadmin/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/installadmin/software/intel/intel_cce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses' 
'HOST=emmy' 'TERM=xterm' 'HISTSIZE=1000' 'SSH_CLIENT=143.50.128.178 
36866 22' 'SSH_TTY=/dev/pts/2' 'GROUP=edvz' 
'LD_LIBRARY_PATH=/installadmin/software/mpich/1.3b1/intel/lib:/installadmin/software/intel/intel_fce_111/lib/intel64:/installadmin/software/intel/intel_cce_111/lib/intel64' 
'LS_COLORS=no' 'HOSTTYPE=x86_64-linux' 'MAIL=/var/spool/mail/winkl' 
'INPUTRC=/etc/inputrc' 'PWD=/usr/people/edvz/winkl/MPI-Test' 
'SGE_ACCOUNT=sge' 
'SGE_RSH_COMMAND=/installadmin/sge/utilbin/lx24-amd64/rsh' 
'LANG=en_US.UTF-8' 'REQNAME=test_nodes.b2' 
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 
'MPI=/installadmin/software/mpich/1.3b1/intel' 'SHLVL=2' 
'SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test' 'OSTYPE=linux' 
'MPIHOME=/installadmin/software/mpich/1.3b1/intel' 'VENDOR=unknown' 
'MACHTYPE=x86_64' 'REMOTEUSER=root' 'CVS_RSH=ssh' 
'SSH_CONNECTION=143.50.128.178 36866 143.50.10.43 22' 
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'G_BROKEN_FILENAMES=1' 
'_=/installadmin/software/mpich/1.3b1/intel/bin/mpiexec' 
--global-user-env 0 --global-system-env 0 --start-pid 2 
--proxy-core-count 2 --exec --exec-appnum 0 --exec-proc-count 2 
--exec-local-env 0 --exec-wdir /usr/people/edvz/winkl/MPI-Test 
--exec-args 1 ./cpitest.x

[mpiexec at y23] Launch arguments: 
/installadmin/software/mpich/1.3b1/intel/bin/hydra_pmi_proxy 
--control-port y23:51464 --debug --demux poll --pgid 0 --enable-stdin 1 
--proxy-id 0
[mpiexec at y23] Launch arguments: /installadmin/sge/bin/lx24-amd64/qrsh 
-inherit -V y12 
/installadmin/software/mpich/1.3b1/intel/bin/hydra_pmi_proxy 
--control-port y23:51464 --debug --demux poll --pgid 0 --enable-stdin 1 
--proxy-id 1
[proxy:0:0 at y23] got pmi command (from 9): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at y23] PMI response: cmd=response_to_init pmi_version=1 
pmi_subversion=1 rc=0
[proxy:0:0 at y23] got pmi command (from 9): get_maxes

[proxy:0:0 at y23] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 
vallen_max=1024
[proxy:0:0 at y23] got pmi command (from 9): get_appnum

[proxy:0:0 at y23] PMI response: cmd=appnum appnum=0
[proxy:0:0 at y23] got pmi command (from 9): get_my_kvsname

[proxy:0:0 at y23] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:0 at y23] got pmi command (from 9): get_my_kvsname

[proxy:0:0 at y23] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:0 at y23] got pmi command (from 9): get
kvsname=kvs_15511_0 key=PMI_process_mapping
[proxy:0:0 at y23] PMI response: cmd=get_result rc=0 msg=success 
value=(vector,(0,2,2))
[proxy:0:0 at y23] got pmi command (from 9): barrier_in

[proxy:0:0 at y23] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at y23] PMI response: cmd=response_to_init pmi_version=1 
pmi_subversion=1 rc=0
[proxy:0:0 at y23] got pmi command (from 6): get_maxes

[proxy:0:0 at y23] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 
vallen_max=1024
[proxy:0:0 at y23] got pmi command (from 6): get_appnum

[proxy:0:0 at y23] PMI response: cmd=appnum appnum=0
[proxy:0:0 at y23] got pmi command (from 6): get_my_kvsname

[proxy:0:0 at y23] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:0 at y23] got pmi command (from 6): get_my_kvsname

[proxy:0:0 at y23] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:0 at y23] got pmi command (from 6): get
kvsname=kvs_15511_0 key=PMI_process_mapping
[proxy:0:0 at y23] PMI response: cmd=get_result rc=0 msg=success 
value=(vector,(0,2,2))
[proxy:0:0 at y23] got pmi command (from 6): put
kvsname=kvs_15511_0 key=sharedFilename[0] 
value=/dev/shm/mpich_shar_tmpWjm2Xo
[proxy:0:0 at y23] we don't understand this command put; forwarding upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0 
key=sharedFilename[0] value=/dev/shm/mpich_shar_tmpWjm2Xo
[mpiexec at y23] PMI response to fd 6 pid 6: cmd=put_result rc=0 msg=success
[proxy:0:0 at y23] we don't understand the response put_result; forwarding 
downstream
[proxy:0:0 at y23] got pmi command (from 6): barrier_in

[proxy:0:0 at y23] forwarding command (cmd=barrier_in) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at y12] got pmi command (from 4): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at y12] PMI response: cmd=response_to_init pmi_version=1 
pmi_subversion=1 rc=0
[proxy:0:1 at y12] got pmi command (from 4): get_maxes

[proxy:0:1 at y12] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 
vallen_max=1024
[proxy:0:1 at y12] got pmi command (from 5): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at y12] PMI response: cmd=response_to_init pmi_version=1 
pmi_subversion=1 rc=0
[proxy:0:1 at y12] got pmi command (from 4): get_appnum

[proxy:0:1 at y12] PMI response: cmd=appnum appnum=0
[proxy:0:1 at y12] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at y12] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:1 at y12] got pmi command (from 5): get_maxes

[proxy:0:1 at y12] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 
vallen_max=1024
[proxy:0:1 at y12] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at y12] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:1 at y12] got pmi command (from 4): get
kvsname=kvs_15511_0 key=PMI_process_mapping
[proxy:0:1 at y12] PMI response: cmd=get_result rc=0 msg=success 
value=(vector,(0,2,2))
[proxy:0:1 at y12] got pmi command (from 5): get_appnum

[proxy:0:1 at y12] PMI response: cmd=appnum appnum=0
[proxy:0:1 at y12] got pmi command (from 4): put
kvsname=kvs_15511_0 key=sharedFilename[2] 
value=/dev/shm/mpich_shar_tmpAqIkNK
[proxy:0:1 at y12] we don't understand this command put; forwarding upstream
[proxy:0:1 at y12] got pmi command (from 5): get_my_kvsname

[proxy:0:1 at y12] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:1 at y12] got pmi command (from 5): get_my_kvsname

[proxy:0:1 at y12] PMI response: cmd=my_kvsname kvsname=kvs_15511_0
[proxy:0:1 at y12] got pmi command (from 5): get
kvsname=kvs_15511_0 key=PMI_process_mapping
[proxy:0:1 at y12] PMI response: cmd=get_result rc=0 msg=success 
value=(vector,(0,2,2))
[proxy:0:1 at y12] got pmi command (from 5): barrier_in

[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0 
key=sharedFilename[2] value=/dev/shm/mpich_shar_tmpAqIkNK
[mpiexec at y23] PMI response to fd 0 pid 4: cmd=put_result rc=0 msg=success
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at y23] PMI response to fd 0 pid 4: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] got pmi command (from 9): get
kvsname=kvs_15511_0 key=sharedFilename[0]
[proxy:0:0 at y23] forwarding command (cmd=get kvsname=kvs_15511_0 
key=sharedFilename[0]) upstream
[proxy:0:1 at y12] we don't understand the response put_result; forwarding 
downstream
[proxy:0:1 at y12] got pmi command (from 4): barrier_in

[proxy:0:1 at y12] forwarding command (cmd=barrier_in) upstream
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] got pmi command (from 5): get
kvsname=kvs_15511_0 key=sharedFilename[2]
[proxy:0:1 at y12] forwarding command (cmd=get kvsname=kvs_15511_0 
key=sharedFilename[2]) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=get kvsname=kvs_15511_0 
key=sharedFilename[0]
[mpiexec at y23] PMI response to fd 6 pid 9: cmd=get_result rc=0 
msg=success value=/dev/shm/mpich_shar_tmpWjm2Xo
[mpiexec at y23] [pgid: 0] got PMI command: cmd=get kvsname=kvs_15511_0 
key=sharedFilename[2]
[mpiexec at y23] PMI response to fd 0 pid 5: cmd=get_result rc=0 
msg=success value=/dev/shm/mpich_shar_tmpAqIkNK
[proxy:0:0 at y23] we don't understand the response get_result; forwarding 
downstream
[proxy:0:0 at y23] got pmi command (from 6): put
kvsname=kvs_15511_0 key=P0-businesscard 
value=description#y23$port#53836$ifname#10.143.41.63$
[proxy:0:0 at y23] we don't understand this command put; forwarding upstream
[proxy:0:0 at y23] got pmi command (from 9): put
kvsname=kvs_15511_0 key=P1-businesscard 
value=description#y23$port#38784$ifname#10.143.41.63$
[proxy:0:0 at y23] we don't understand this command put; forwarding upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0 
key=P0-businesscard value=description#y23$port#53836$ifname#10.143.41.63$
[mpiexec at y23] PMI response to fd 6 pid 6: cmd=put_result rc=0 msg=success
[proxy:0:1 at y12] we don't understand the response get_result; forwarding 
downstream
[proxy:0:1 at y12] got pmi command (from 4): put
kvsname=kvs_15511_0 key=P2-businesscard 
value=description#y12$port#47251$ifname#10.143.41.52$
[proxy:0:1 at y12] we don't understand this command put; forwarding upstream
[proxy:0:1 at y12] got pmi command (from 5): put
kvsname=kvs_15511_0 key=P3-businesscard 
value=description#y12$port#55610$ifname#10.143.41.52$
[proxy:0:1 at y12] we don't understand this command put; forwarding upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0 
key=P1-businesscard value=description#y23$port#38784$ifname#10.143.41.63$
[mpiexec at y23] PMI response to fd 6 pid 9: cmd=put_result rc=0 msg=success
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0 
key=P2-businesscard value=description#y12$port#47251$ifname#10.143.41.52$
[mpiexec at y23] PMI response to fd 0 pid 4: cmd=put_result rc=0 msg=success
[proxy:0:0 at y23] we don't understand the response put_result; forwarding 
downstream
[proxy:0:0 at y23] got pmi command (from 6): barrier_in

[proxy:0:0 at y23] we don't understand the response put_result; forwarding 
downstream
[proxy:0:0 at y23] got pmi command (from 9): barrier_in

[proxy:0:0 at y23] forwarding command (cmd=barrier_in) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] [pgid: 0] got PMI command: cmd=put kvsname=kvs_15511_0 
key=P3-businesscard value=description#y12$port#55610$ifname#10.143.41.52$
[mpiexec at y23] PMI response to fd 0 pid 5: cmd=put_result rc=0 msg=success
[proxy:0:1 at y12] we don't understand the response put_result; forwarding 
downstream
[proxy:0:1 at y12] got pmi command (from 4): barrier_in

[proxy:0:1 at y12] we don't understand the response put_result; forwarding 
downstream
[proxy:0:1 at y12] got pmi command (from 5): barrier_in

[proxy:0:1 at y12] forwarding command (cmd=barrier_in) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] PMI response to fd 6 pid 5: cmd=barrier_out
[mpiexec at y23] PMI response to fd 0 pid 5: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
Process 0 of 4 is on y23
Process 1 of 4 is on y23
[proxy:0:0 at y23] got pmi command (from 6): get
kvsname=kvs_15511_0 key=P2-businesscard
[proxy:0:0 at y23] forwarding command (cmd=get kvsname=kvs_15511_0 
key=P2-businesscard) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=get kvsname=kvs_15511_0 
key=P2-businesscard
[mpiexec at y23] PMI response to fd 6 pid 6: cmd=get_result rc=0 
msg=success value=description#y12$port#47251$ifname#10.143.41.52$
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] PMI response: cmd=barrier_out
Process 2 of 4 is on y12
Process 3 of 4 is on y12
[proxy:0:0 at y23] we don't understand the response get_result; forwarding 
downstream
pi is approximately 3.1415926535897682, Error is 0.0000000000000249
wall clock time = 1.991206
[proxy:0:0 at y23] got pmi command (from 9): barrier_in

[proxy:0:0 at y23] got pmi command (from 6): barrier_in

[proxy:0:0 at y23] forwarding command (cmd=barrier_in) upstream
[proxy:0:1 at y12] got pmi command (from 5): barrier_in

[proxy:0:1 at y12] got pmi command (from 4): barrier_in

[proxy:0:1 at y12] forwarding command (cmd=barrier_in) upstream
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at y23] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at y23] PMI response to fd 0 pid 4: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] PMI response: cmd=barrier_out
[proxy:0:0 at y23] got pmi command (from 6): finalize

[proxy:0:0 at y23] PMI response: cmd=finalize_ack
[proxy:0:0 at y23] got pmi command (from 9): finalize

[proxy:0:0 at y23] PMI response: cmd=finalize_ack
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] PMI response: cmd=barrier_out
[proxy:0:1 at y12] got pmi command (from 4): finalize

[proxy:0:1 at y12] PMI response: cmd=finalize_ack
[proxy:0:1 at y12] got pmi command (from 5): finalize

[proxy:0:1 at y12] PMI response: cmd=finalize_ack

-----------------------------------------------------------------------------------------------------

On the cluster on which it doesn't work:

mpiexec options:
----------------
  Base path: /installadmin/mpich2/test/intel/bin/
  Bootstrap server: (null)
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
    REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT
    
MANPATH=/installadmin/sge/man:/software/mpich2/test/intel/share/man:/software/intel/intel_fce_111/man:/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/share/man
    CONSOLE=/dev/console
    SELINUX_INIT=YES
    
INTEL_LICENSE_FILE=/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/software/intel/intel_cce_111/licenses:/software/intel/licenses:/usr/people/edvz/winkl/intel/licenses
    HOST=b00
    TERM=xterm
    HISTSIZE=1000
    SSH_CLIENT=143.50.128.178 36871 22
    SSH_TTY=/dev/pts/0
    GROUP=edvz
    
LD_LIBRARY_PATH=/installadmin/mpich2/test/intel/lib:/software/intel/intel_fce_111/lib/intel64:/software/intel/intel_cce_111/lib/intel64
    LS_COLORS=no
    INIT_VERSION=sysvinit-2.86
    HOSTTYPE=x86_64-linux
    AUTOBOOT=YES
    MAIL=/var/spool/mail/winkl
    runlevel=3
    RUNLEVEL=3
    INPUTRC=/etc/inputrc
    PWD=/usr/people/edvz/winkl/MPI-Test
    SGE_ACCOUNT=sge
    LANG=en_US.UTF-8
    previous=N
    PREVLEVEL=N
    REQNAME=test_nodes.b2
    SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
    MPI=/installadmin/mpich2/test/intel
    SHLVL=2
    SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test
    OSTYPE=linux
    BOOT_IMAGE=2.6.18-194.11.3
    MPIHOME=/installadmin/mpich2/test/intel
    VENDOR=unknown
    MACHTYPE=x86_64
    CVS_RSH=ssh
    SSH_CONNECTION=143.50.128.178 36871 143.50.10.40 22
    LESSOPEN=|/usr/bin/lesspipe.sh %s
    G_BROKEN_FILENAMES=1
    _=/installadmin/mpich2/test/intel/bin/mpiexec


    Proxy information:
    *********************
      Proxy ID:  1
      -----------------
        Proxy name: b72
        Process count: 2
        Start PID: 0

        Proxy exec list:
        ....................
          Exec: ./cpitest.x; Process count: 2
      Proxy ID:  2
      -----------------
        Proxy name: b60
        Process count: 2
        Start PID: 2

        Proxy exec list:
        ....................
          Exec: ./cpitest.x; Process count: 2

==================================================================================================

[mpiexec at b72] Timeout set to -1 (-1 means infinite)
[mpiexec at b72] Got a control port string of b72:53271

Proxy launch args: /installadmin/mpich2/test/intel/bin/hydra_pmi_proxy 
--control-port b72:53271 --debug --demux poll --pgid 0 --enable-stdin 1 
--proxy-id

[mpiexec at b72] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 0:
--version 1.3b1 --interface-env-name MPICH_INTERFACE_NAME --hostname b72 
--global-core-count 4 --global-process-count 4 --auto-cleanup 1 
--pmi-rank -1 --pmi-kvsname kvs_3249_0 --pmi-process-mapping 
(vector,(0,2,2)) --global-inherited-env 40 
'REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT' 
'MANPATH=/installadmin/sge/man:/software/mpich2/test/intel/share/man:/software/intel/intel_fce_111/man:/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/share/man' 
'CONSOLE=/dev/console' 'SELINUX_INIT=YES' 
'INTEL_LICENSE_FILE=/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/software/intel/intel_cce_111/licenses:/software/intel/licenses:/usr/people/edvz/winkl/intel/licenses' 
'HOST=b00' 'TERM=xterm' 'HISTSIZE=1000' 'SSH_CLIENT=143.50.128.178 36871 
22' 'SSH_TTY=/dev/pts/0' 'GROUP=edvz' 
'LD_LIBRARY_PATH=/installadmin/mpich2/test/intel/lib:/software/intel/intel_fce_111/lib/intel64:/software/intel/intel_cce_111/lib/intel64' 
'LS_COLORS=no' 'INIT_VERSION=sysvinit-2.86' 'HOSTTYPE=x86_64-linux' 
'AUTOBOOT=YES' 'MAIL=/var/spool/mail/winkl' 'runlevel=3' 'RUNLEVEL=3' 
'INPUTRC=/etc/inputrc' 'PWD=/usr/people/edvz/winkl/MPI-Test' 
'SGE_ACCOUNT=sge' 'LANG=en_US.UTF-8' 'previous=N' 'PREVLEVEL=N' 
'REQNAME=test_nodes.b2' 
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 
'MPI=/installadmin/mpich2/test/intel' 'SHLVL=2' 
'SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test' 'OSTYPE=linux' 
'BOOT_IMAGE=2.6.18-194.11.3' 'MPIHOME=/installadmin/mpich2/test/intel' 
'VENDOR=unknown' 'MACHTYPE=x86_64' 'CVS_RSH=ssh' 
'SSH_CONNECTION=143.50.128.178 36871 143.50.10.40 22' 
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'G_BROKEN_FILENAMES=1' 
'_=/installadmin/mpich2/test/intel/bin/mpiexec' --global-user-env 0 
--global-system-env 0 --start-pid 0 --proxy-core-count 2 --exec 
--exec-appnum 0 --exec-proc-count 2 --exec-local-env 0 --exec-wdir 
/usr/people/edvz/winkl/MPI-Test --exec-args 1 ./cpitest.x

[mpiexec at b72] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 1:
--version 1.3b1 --interface-env-name MPICH_INTERFACE_NAME --hostname b60 
--global-core-count 4 --global-process-count 4 --auto-cleanup 1 
--pmi-rank -1 --pmi-kvsname kvs_3249_0 --pmi-process-mapping 
(vector,(0,2,2)) --global-inherited-env 40 
'REMOTEHOST=ZID178.KFUNIGRAZ.AC.AT' 
'MANPATH=/installadmin/sge/man:/software/mpich2/test/intel/share/man:/software/intel/intel_fce_111/man:/software/intel/intel_cce_111/man:/installadmin/sge/man:/usr/share/man/en:/usr/share/man:/usr/local/share/man' 
'CONSOLE=/dev/console' 'SELINUX_INIT=YES' 
'INTEL_LICENSE_FILE=/software/intel/intel_fce_111/licenses:/opt/intel/licenses:/usr/people/edvz/winkl/intel/licenses:/software/intel/intel_cce_111/licenses:/software/intel/licenses:/usr/people/edvz/winkl/intel/licenses' 
'HOST=b00' 'TERM=xterm' 'HISTSIZE=1000' 'SSH_CLIENT=143.50.128.178 36871 
22' 'SSH_TTY=/dev/pts/0' 'GROUP=edvz' 
'LD_LIBRARY_PATH=/installadmin/mpich2/test/intel/lib:/software/intel/intel_fce_111/lib/intel64:/software/intel/intel_cce_111/lib/intel64' 
'LS_COLORS=no' 'INIT_VERSION=sysvinit-2.86' 'HOSTTYPE=x86_64-linux' 
'AUTOBOOT=YES' 'MAIL=/var/spool/mail/winkl' 'runlevel=3' 'RUNLEVEL=3' 
'INPUTRC=/etc/inputrc' 'PWD=/usr/people/edvz/winkl/MPI-Test' 
'SGE_ACCOUNT=sge' 'LANG=en_US.UTF-8' 'previous=N' 'PREVLEVEL=N' 
'REQNAME=test_nodes.b2' 
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 
'MPI=/installadmin/mpich2/test/intel' 'SHLVL=2' 
'SGE_CWD_PATH=/usr/people/edvz/winkl/MPI-Test' 'OSTYPE=linux' 
'BOOT_IMAGE=2.6.18-194.11.3' 'MPIHOME=/installadmin/mpich2/test/intel' 
'VENDOR=unknown' 'MACHTYPE=x86_64' 'CVS_RSH=ssh' 
'SSH_CONNECTION=143.50.128.178 36871 143.50.10.40 22' 
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'G_BROKEN_FILENAMES=1' 
'_=/installadmin/mpich2/test/intel/bin/mpiexec' --global-user-env 0 
--global-system-env 0 --start-pid 2 --proxy-core-count 2 --exec 
--exec-appnum 0 --exec-proc-count 2 --exec-local-env 0 --exec-wdir 
/usr/people/edvz/winkl/MPI-Test --exec-args 1 ./cpitest.x

[mpiexec at b72] Launch arguments: 
/installadmin/mpich2/test/intel/bin/hydra_pmi_proxy --control-port 
b72:53271 --debug --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0
[mpiexec at b72] Launch arguments: /installadmin/sge/bin/lx24-amd64/qrsh 
-inherit -V b60 /installadmin/mpich2/test/intel/bin/hydra_pmi_proxy 
--control-port b72:53271 --debug --demux poll --pgid 0 --enable-stdin 1 
--proxy-id 1
[proxy:0:0 at b72] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at b72] PMI response: cmd=response_to_init pmi_version=1 
pmi_subversion=1 rc=0
[proxy:0:0 at b72] got pmi command (from 9): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at b72] PMI response: cmd=response_to_init pmi_version=1 
pmi_subversion=1 rc=0
[proxy:0:0 at b72] got pmi command (from 9): get_maxes

[proxy:0:0 at b72] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 
vallen_max=1024
[proxy:0:0 at b72] got pmi command (from 6): get_maxes

[proxy:0:0 at b72] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 
vallen_max=1024
[proxy:0:0 at b72] got pmi command (from 9): get_appnum

[proxy:0:0 at b72] PMI response: cmd=appnum appnum=0
[proxy:0:0 at b72] got pmi command (from 9): get_my_kvsname

[proxy:0:0 at b72] PMI response: cmd=my_kvsname kvsname=kvs_3249_0
[proxy:0:0 at b72] got pmi command (from 6): get_appnum

[proxy:0:0 at b72] PMI response: cmd=appnum appnum=0
[proxy:0:0 at b72] got pmi command (from 9): get_my_kvsname

[proxy:0:0 at b72] PMI response: cmd=my_kvsname kvsname=kvs_3249_0
[proxy:0:0 at b72] got pmi command (from 9): get
kvsname=kvs_3249_0 key=PMI_process_mapping
[proxy:0:0 at b72] PMI response: cmd=get_result rc=0 msg=success 
value=(vector,(0,2,2))
[proxy:0:0 at b72] got pmi command (from 6): get_my_kvsname

[proxy:0:0 at b72] PMI response: cmd=my_kvsname kvsname=kvs_3249_0
[proxy:0:0 at b72] got pmi command (from 9): barrier_in

[proxy:0:0 at b72] got pmi command (from 6): get_my_kvsname

[proxy:0:0 at b72] PMI response: cmd=my_kvsname kvsname=kvs_3249_0
[proxy:0:0 at b72] got pmi command (from 6): get
kvsname=kvs_3249_0 key=PMI_process_mapping
[proxy:0:0 at b72] PMI response: cmd=get_result rc=0 msg=success 
value=(vector,(0,2,2))
[proxy:0:0 at b72] got pmi command (from 6): put
kvsname=kvs_3249_0 key=sharedFilename[0] value=/dev/shm/mpich_shar_tmp1BFE87
[proxy:0:0 at b72] we don't understand this command put; forwarding upstream
[mpiexec at b72] [pgid: 0] got PMI command: cmd=put kvsname=kvs_3249_0 
key=sharedFilename[0] value=/dev/shm/mpich_shar_tmp1BFE87
[mpiexec at b72] PMI response to fd 6 pid 6: cmd=put_result rc=0 msg=success
[proxy:0:0 at b72] we don't understand the response put_result; forwarding 
downstream
[proxy:0:0 at b72] got pmi command (from 6): barrier_in

[proxy:0:0 at b72] forwarding command (cmd=barrier_in) upstream
[mpiexec at b72] [pgid: 0] got PMI command: cmd=barrier_in

- Note: There is no output from Host b60 (the second participating host) 
as there is no
process created on it.

Ursula




More information about the mpich-discuss mailing list