[mpich-discuss] SGE & Hydra Problem
Ursula Winkler
ursula.winkler at uni-graz.at
Mon Sep 13 03:01:10 CDT 2010
Hi to all,
I'm trying to get working mpich2-1.3b1 with Hydra Process Manager on a
Scientific Linux Cluster
with SGE. Although it works some times, most time it fails with this
error message:
error: getting configuration: unable to contact qmaster using port
536 on host "b00"
error:
Cannot get configuration from qmaster.
[mpiexec at b45] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:98): one of the
processes terminated badly; timing out
[mpiexec at b45] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:18): bootstrap
device returned error waiting for completion
[mpiexec at b45] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:325): bootstrap server
returned error waiting for completion
[mpiexec at b45] main (./ui/mpich/mpiexec.c:293): process manager
error waiting for completion
Job specifications:
qsub: 4 Cores (2 on one host):
mpiexec -bootstrap rsh -bootstrap-exec rsh -f $TMPDIR/machines -n
$NSLOTS ./cpitest.x
$ qstat -u winkl -t
job-ID prior name user state submit/start at
queue master ja-task-ID task-ID state cpu mem
io stat failed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
158180 0.51468 test_nodes winkl r 09/10/2010 14:55:48
fastmpi at b45 MASTER r 00:02:27 2.54263 0.00000
fastmpi at b45 SLAVE 1.b45 r 00:00:00 0.00035 0.00000
fastmpi at b45 SLAVE
158180 0.51468 test_nodes winkl r 09/10/2010 14:55:48
fastmpi at b46 SLAVE
fastmpi at b46 SLAVE
On host b45 (Job Master Host):
31387 sgeadmin \_ sge_shepherd-158180 -bg
31420 winkl | \_ -sh
/installadmin/sge/cluster/spool/b45/job_scripts/158180
31482 winkl | \_ mpiexec -bootstrap rsh -bootstrap-exec rsh
-f /tmp/158180.1.fastmpi/machines -n 4 ./cpitest.x
31483 winkl | \_ /installadmin/sge/bin/lx24-amd64/qrsh
-inherit b45 /installadmin/mpich2/test/intel/bin/hy
31507 winkl | | \_
/installadmin/sge/utilbin/lx24-amd64/rsh -p 32913 b45 exec
'/installadmin/sge/utilbin
31509 winkl | | \_ [rsh] <defunct>
31484 winkl | \_ /installadmin/sge/bin/lx24-amd64/qrsh
-inherit b46 /installadmin/mpich2/test/intel/bin/hy
31505 sgeadmin \_ sge_shepherd-158180 -bg
31506 root \_ /installadmin/sge/utilbin/lx24-amd64/rshd -l
31508 winkl \_
/installadmin/sge/utilbin/lx24-amd64/qrsh_starter
/installadmin/sge/cluster/spool/b45/active_
31576 winkl \_ tcsh -c
/installadmin/mpich2/test/intel/bin/hydra_pmi_proxy --control-port
b45:57313 --bo
31643 winkl \_
/installadmin/mpich2/test/intel/bin/hydra_pmi_proxy --control-port
b45:57313 --bootst
31644 winkl \_ ./cpitest.x
31645 winkl \_ ./cpitest.x
On host b46 (Job Slave Host):
there are no processes though SGE reports 2 with qstat
Has anyone seen this problem before or has any ideas?
I don't have any troubles with mpich2-1.3b1 smpd (daemon-version) on the
cluster.
Cheers,
Ursula
More information about the mpich-discuss
mailing list