[mpich-discuss] SGE & Hydra Problem
Pavan Balaji
balaji at mcs.anl.gov
Fri Sep 17 10:53:57 CDT 2010
On 09/16/2010 05:17 AM, Ursula Winkler wrote:
>>> [proxy:0:1 at b46] HYDU_sock_connect (./utils/sock/sock.c:151): connect error (Connection refused)
>>> [proxy:0:1 at b46] main (./pm/pmiserv/pmip.c:202): unable to connect to server b45 at port 52298 (check for firewalls!)
>>>
>>
>> It might be normal. SGE will check whether `qrsh -inherit ...` is allowed to this particular host (and reject it otherwise) (i.e. whether it's in the list of granted slaves). Also worth to mention: "job_is_first_task" will allow in addition the local call of `qrsh -inherit ...`, as this depends on the parallel library whether it makes a local call or not. I.e. you have N-1 (TRUE) or N (FALSE) `qrsh -inherit ...` calls allowed.
>>
> I guessed that it must be normal because the other cluster where hyrda
> works gives out the same error when the command is placed within the SGE
> job script. Thank you for the explanation.
The above error is coming from Hydra. So do you mean that Hydra is
throwing an error on the "cluster that works fine" when mpiexec is
placed within the SGE job script?
mpiexec should always be called from within the SGE job script.
-- Pavan
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list