[mpich-discuss] SGE & Hydra Problem

Reuti reuti at staff.uni-marburg.de
Thu Sep 16 05:04:01 CDT 2010


Am 16.09.2010 um 10:11 schrieb Ursula Winkler:

> <snip>
> Well, it won't work as long as all participating hosts aren't in the $TMPDIR/machines file.
> If that's the case then the command doesn' hang and I get the error (again on both clusters):

What is "If that's the case" - a host not in the hostlist?


> [proxy:0:1 at b46] HYDU_sock_connect (./utils/sock/sock.c:151): connect error (Connection refused)
> [proxy:0:1 at b46] main (./pm/pmiserv/pmip.c:202): unable to connect to server b45 at port 52298 (check for firewalls!)

It might be normal. SGE will check whether `qrsh -inherit ...` is allowed to this particular host (and reject it otherwise) (i.e. whether it's in the list of granted slaves). Also worth to mention: "job_is_first_task" will allow in addition the local call of `qrsh -inherit ...`,  as this depends on the parallel library whether it makes a local call or not. I.e. you have N-1 (TRUE) or N (FALSE) `qrsh -inherit ...` calls allowed.

-- Reuti


More information about the mpich-discuss mailing list