[mpich-discuss] SGE & Hydra Problem
Ursula Winkler
ursula.winkler at uni-graz.at
Mon Sep 20 02:01:02 CDT 2010
Pavan Balaji schrieb:
> On 09/16/2010 05:17 AM, Ursula Winkler wrote:
>
>>>> [proxy:0:1 at b46] HYDU_sock_connect (./utils/sock/sock.c:151): connect error (Connection refused)
>>>> [proxy:0:1 at b46] main (./pm/pmiserv/pmip.c:202): unable to connect to server b45 at port 52298 (check for firewalls!)
>>>>
>>>>
>>> It might be normal. SGE will check whether `qrsh -inherit ...` is allowed to this particular host (and reject it otherwise) (i.e. whether it's in the list of granted slaves). Also worth to mention: "job_is_first_task" will allow in addition the local call of `qrsh -inherit ...`, as this depends on the parallel library whether it makes a local call or not. I.e. you have N-1 (TRUE) or N (FALSE) `qrsh -inherit ...` calls allowed.
>>>
>>>
>> I guessed that it must be normal because the other cluster where hyrda
>> works gives out the same error when the command is placed within the SGE
>> job script. Thank you for the explanation.
>>
>
> The above error is coming from Hydra. So do you mean that Hydra is
> throwing an error on the "cluster that works fine" when mpiexec is
> placed within the SGE job script?
>
No, when mpiexec is placed within the SGE job script, it works fine on
the second
cluster. I meant just the command "qrsh -inherit -V ... hydra_pmi_proxy
..." placed
within the SGE script that results in the mentioned error message (on
both clusters).
> mpiexec should always be called from within the SGE job script.
>
>
mpiexec I'm testing often without SGE when the queues are full and I
have installed
a new version of mpich/mvapich. But the ultimative test of course is
with SGE.
More information about the mpich-discuss
mailing list