[mpich-discuss] SGE & Hydra Problem

Mon Sep 13 12:15:00 CDT 2010

Hi,

Am 13.09.2010 um 10:01 schrieb Ursula Winkler:

> I'm trying to get working mpich2-1.3b1 with Hydra Process Manager on  
> a Scientific Linux Cluster
> with SGE. Although it works some times, most time it fails with this  
> error message:

which version of SGE are you running - using port 536 was used in  
former times, and nowadays the official ports are 6444 and 6445.

>     error: getting configuration: unable to contact qmaster using  
> port 536 on host "b00"
>     error:
>     Cannot get configuration from qmaster.

This looks more like a network problem, and unrelated to SGE or  
MPICH2. Dou you have any firewall on the machines? Other applications  
run across the nodes? AFAICS below, SGE is using rsh, and not the  
default -builtin- of the newer versions of SGE (there would be no rsh/ 
rshd any longer) - nevertheless, your setup should work.

>     [mpiexec at b45] HYDT_bscu_wait_for_completion (./tools/bootstrap/ 
> utils/bscu_wait.c:98): one of the
>                   processes terminated badly; timing out
>     [mpiexec at b45] HYDT_bsci_wait_for_completion (./tools/bootstrap/ 
> src/bsci_wait.c:18): bootstrap
>                   device returned error waiting for completion
>     [mpiexec at b45] HYD_pmci_wait_for_completion (./pm/pmiserv/ 
> pmiserv_pmci.c:325): bootstrap server
>                   returned error waiting for completion
>     [mpiexec at b45] main (./ui/mpich/mpiexec.c:293): process manager  
> error waiting for completion
>
>
> Job specifications:
>
> qsub: 4 Cores (2 on one host):
>
>  mpiexec -bootstrap rsh -bootstrap-exec rsh -f $TMPDIR/machines -n  
> $NSLOTS ./cpitest.x
>
>
> $ qstat -u winkl -t
> job-ID  prior   name       user   state submit/start at      
> queue             master ja-task-ID task-ID state cpu        mem      
> io      stat failed
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 158180 0.51468 test_nodes winkl    r   09/10/2010 14:55:48  
> fastmpi at b45       MASTER    r     00:02:27 2.54263 0.00000
>                                                            
> fastmpi at b45       SLAVE            1.b45    r     00:00:00 0.00035  
> 0.00000
>                                                            
> fastmpi at b45       SLAVE
> 158180 0.51468 test_nodes winkl    r   09/10/2010 14:55:48  
> fastmpi at b46       SLAVE
>                                                            
> fastmpi at b46       SLAVE
>
> On host b45 (Job Master Host):
> 31387 sgeadmin  \_ sge_shepherd-158180 -bg
> 31420 winkl     |   \_ -sh /installadmin/sge/cluster/spool/b45/ 
> job_scripts/158180
> 31482 winkl     |       \_ mpiexec -bootstrap rsh -bootstrap-exec  
> rsh -f /tmp/158180.1.fastmpi/machines -n 4 ./cpitest.x
> 31483 winkl     |           \_ /installadmin/sge/bin/lx24-amd64/qrsh  
> -inherit b45 /installadmin/mpich2/test/intel/bin/hy
> 31507 winkl     |           |   \_ /installadmin/sge/utilbin/lx24- 
> amd64/rsh -p 32913 b45 exec '/installadmin/sge/utilbin
> 31509 winkl     |           |       \_ [rsh] <defunct>
> 31484 winkl     |           \_ /installadmin/sge/bin/lx24-amd64/qrsh  
> -inherit b46 /installadmin/mpich2/test/intel/bin/hy
> 31505 sgeadmin  \_ sge_shepherd-158180 -bg
> 31506 root          \_ /installadmin/sge/utilbin/lx24-amd64/rshd -l
> 31508 winkl             \_ /installadmin/sge/utilbin/lx24-amd64/ 
> qrsh_starter /installadmin/sge/cluster/spool/b45/active_
> 31576 winkl                 \_ tcsh -c /installadmin/mpich2/test/ 
> intel/bin/hydra_pmi_proxy --control-port b45:57313 --bo
> 31643 winkl                     \_ /installadmin/mpich2/test/intel/ 
> bin/hydra_pmi_proxy --control-port b45:57313 --bootst
> 31644 winkl                         \_ ./cpitest.x
> 31645 winkl                         \_ ./cpitest.x

In principle this looks nice, as all the processes are bound to the  
sgeexecd. This is what I tried to achieve in:

http://lists.mcs.anl.gov/pipermail/mpich-discuss/2010-August/007678.html

But in the meantime, SGE is now supported out-of-the-box by MPICH2.  
Can just issue a plain "mpiexec ./cpitest.x". With a proper request of  
a PE in SGE (/bin/true is sufficient for the start/stop_proc_args),  
Hydra should get the number of cores and nodes automatically (in  
1.3b1, which you are referring to).

-- Reuti

> On host b46 (Job Slave Host):
>  there are no processes though SGE reports 2 with qstat
>  Has anyone seen this problem before or has any ideas?
>
> I don't have any troubles with mpich2-1.3b1 smpd (daemon-version) on  
> the cluster.
>
> Cheers,
> Ursula
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss