[mpich-discuss] SGE & Hydra Problem

Ursula Winkler ursula.winkler at uni-graz.at
Mon Sep 13 03:01:10 CDT 2010


Hi to all,

I'm trying to get working mpich2-1.3b1 with Hydra Process Manager on a 
Scientific Linux Cluster
with SGE. Although it works some times, most time it fails with this 
error message:

      error: getting configuration: unable to contact qmaster using port 
536 on host "b00"
      error:
      Cannot get configuration from qmaster.
      [mpiexec at b45] HYDT_bscu_wait_for_completion 
(./tools/bootstrap/utils/bscu_wait.c:98): one of the
                    processes terminated badly; timing out
      [mpiexec at b45] HYDT_bsci_wait_for_completion 
(./tools/bootstrap/src/bsci_wait.c:18): bootstrap
                    device returned error waiting for completion
      [mpiexec at b45] HYD_pmci_wait_for_completion 
(./pm/pmiserv/pmiserv_pmci.c:325): bootstrap server
                    returned error waiting for completion
      [mpiexec at b45] main (./ui/mpich/mpiexec.c:293): process manager 
error waiting for completion


Job specifications:

qsub: 4 Cores (2 on one host):

   mpiexec -bootstrap rsh -bootstrap-exec rsh -f $TMPDIR/machines -n 
$NSLOTS ./cpitest.x


$ qstat -u winkl -t
job-ID  prior   name       user   state submit/start at     
queue             master ja-task-ID task-ID state cpu        mem     
io      stat failed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 158180 0.51468 test_nodes winkl    r   09/10/2010 14:55:48 
fastmpi at b45       MASTER    r     00:02:27 2.54263 0.00000
                                                            
fastmpi at b45       SLAVE            1.b45    r     00:00:00 0.00035 0.00000
                                                            
fastmpi at b45       SLAVE
 158180 0.51468 test_nodes winkl    r   09/10/2010 14:55:48 
fastmpi at b46       SLAVE
                                                            
fastmpi at b46       SLAVE

On host b45 (Job Master Host):
  31387 sgeadmin  \_ sge_shepherd-158180 -bg
  31420 winkl     |   \_ -sh 
/installadmin/sge/cluster/spool/b45/job_scripts/158180
  31482 winkl     |       \_ mpiexec -bootstrap rsh -bootstrap-exec rsh 
-f /tmp/158180.1.fastmpi/machines -n 4 ./cpitest.x
  31483 winkl     |           \_ /installadmin/sge/bin/lx24-amd64/qrsh 
-inherit b45 /installadmin/mpich2/test/intel/bin/hy
  31507 winkl     |           |   \_ 
/installadmin/sge/utilbin/lx24-amd64/rsh -p 32913 b45 exec 
'/installadmin/sge/utilbin
  31509 winkl     |           |       \_ [rsh] <defunct>
  31484 winkl     |           \_ /installadmin/sge/bin/lx24-amd64/qrsh 
-inherit b46 /installadmin/mpich2/test/intel/bin/hy
  31505 sgeadmin  \_ sge_shepherd-158180 -bg
  31506 root          \_ /installadmin/sge/utilbin/lx24-amd64/rshd -l
  31508 winkl             \_ 
/installadmin/sge/utilbin/lx24-amd64/qrsh_starter 
/installadmin/sge/cluster/spool/b45/active_
  31576 winkl                 \_ tcsh -c 
/installadmin/mpich2/test/intel/bin/hydra_pmi_proxy --control-port 
b45:57313 --bo
  31643 winkl                     \_ 
/installadmin/mpich2/test/intel/bin/hydra_pmi_proxy --control-port 
b45:57313 --bootst
  31644 winkl                         \_ ./cpitest.x
  31645 winkl                         \_ ./cpitest.x


On host b46 (Job Slave Host):
   there are no processes though SGE reports 2 with qstat
   
Has anyone seen this problem before or has any ideas?

I don't have any troubles with mpich2-1.3b1 smpd (daemon-version) on the 
cluster.

Cheers,
Ursula


More information about the mpich-discuss mailing list