[mpich-discuss] SGE & Hydra Problem

Mon Sep 20 02:01:02 CDT 2010

Pavan Balaji schrieb:
> On 09/16/2010 05:17 AM, Ursula Winkler wrote:
>   
>>>> [proxy:0:1 at b46] HYDU_sock_connect (./utils/sock/sock.c:151): connect error (Connection refused)
>>>> [proxy:0:1 at b46] main (./pm/pmiserv/pmip.c:202): unable to connect to server b45 at port 52298 (check for firewalls!)
>>>>
>>>>         
>>> It might be normal. SGE will check whether `qrsh -inherit ...` is allowed to this particular host (and reject it otherwise) (i.e. whether it's in the list of granted slaves). Also worth to mention: "job_is_first_task" will allow in addition the local call of `qrsh -inherit ...`,  as this depends on the parallel library whether it makes a local call or not. I.e. you have N-1 (TRUE) or N (FALSE) `qrsh -inherit ...` calls allowed.
>>>
>>>       
>> I guessed that it must be normal because the other cluster where hyrda
>> works gives out the same error when the command is placed within the SGE
>> job script. Thank you for the explanation.
>>     
>
> The above error is coming from Hydra. So do you mean that Hydra is 
> throwing an error on the "cluster that works fine" when mpiexec is 
> placed within the SGE job script?
>   

No, when mpiexec is placed within the SGE job script, it works fine on 
the second
cluster. I meant just the command "qrsh -inherit -V ... hydra_pmi_proxy 
..." placed
within the SGE script that results in the mentioned error message (on 
both clusters).
> mpiexec should always be called from within the SGE job script.
>
>   

mpiexec I'm testing often without SGE when the queues are full and I 
have installed
a new version of mpich/mvapich. But the ultimative test of course is 
with SGE.