[mpich-discuss] SGE & Hydra Problem

Pavan Balaji balaji at mcs.anl.gov
Fri Sep 17 10:53:57 CDT 2010


On 09/16/2010 05:17 AM, Ursula Winkler wrote:
>>> [proxy:0:1 at b46] HYDU_sock_connect (./utils/sock/sock.c:151): connect error (Connection refused)
>>> [proxy:0:1 at b46] main (./pm/pmiserv/pmip.c:202): unable to connect to server b45 at port 52298 (check for firewalls!)
>>>
>>
>> It might be normal. SGE will check whether `qrsh -inherit ...` is allowed to this particular host (and reject it otherwise) (i.e. whether it's in the list of granted slaves). Also worth to mention: "job_is_first_task" will allow in addition the local call of `qrsh -inherit ...`,  as this depends on the parallel library whether it makes a local call or not. I.e. you have N-1 (TRUE) or N (FALSE) `qrsh -inherit ...` calls allowed.
>>
> I guessed that it must be normal because the other cluster where hyrda
> works gives out the same error when the command is placed within the SGE
> job script. Thank you for the explanation.

The above error is coming from Hydra. So do you mean that Hydra is 
throwing an error on the "cluster that works fine" when mpiexec is 
placed within the SGE job script?

mpiexec should always be called from within the SGE job script.

  -- Pavan

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list