[mpich-discuss] SGE & Hydra Problem

Ursula Winkler ursula.winkler at uni-graz.at
Thu Sep 16 05:17:11 CDT 2010


Reuti schrieb:
> Am 16.09.2010 um 10:11 schrieb Ursula Winkler:
>
>   
>> <snip>
>> Well, it won't work as long as all participating hosts aren't in the $TMPDIR/machines file.
>> If that's the case then the command doesn' hang and I get the error (again on both clusters):
>>     
>
> What is "If that's the case" - a host not in the hostlist?
>   

Sorry, I meant that the host must be in the host list, of course.

>
>   
>> [proxy:0:1 at b46] HYDU_sock_connect (./utils/sock/sock.c:151): connect error (Connection refused)
>> [proxy:0:1 at b46] main (./pm/pmiserv/pmip.c:202): unable to connect to server b45 at port 52298 (check for firewalls!)
>>     
>
> It might be normal. SGE will check whether `qrsh -inherit ...` is allowed to this particular host (and reject it otherwise) (i.e. whether it's in the list of granted slaves). Also worth to mention: "job_is_first_task" will allow in addition the local call of `qrsh -inherit ...`,  as this depends on the parallel library whether it makes a local call or not. I.e. you have N-1 (TRUE) or N (FALSE) `qrsh -inherit ...` calls allowed.
>   
I guessed that it must be normal because the other cluster where hyrda 
works gives out the same error when the command is placed within the SGE 
job script. Thank you for the explanation.

Ursula



More information about the mpich-discuss mailing list