[Swift-devel] Swift run: hanging up when submitting a job

Sun Aug 10 15:58:33 CDT 2008

On Sun, 2008-08-10 at 15:43 -0500, lixi at uchicago.edu wrote:
> Hi,
> 
> Today I ran a workflow including 3000 jobs with replication 
> enabled. 2999 jobs finished successfully and only one job is 
> hanging up. When taking a close look at the log file, I 
> found the hanging job id is 0-2800, so I execute the 
> following command to check the job:
> 
> [...]
> 2008-08-10 10:46:17,377-0500 DEBUG TaskImpl Task
> (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
> setting status to Submitting
> 2008-08-10 10:46:18,848-0500 DEBUG TaskImpl Task
> (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
> setting status to Submitted
> 2008-08-10 10:46:18,848-0500 DEBUG 
> WeightedHostScoreScheduler Submission time for Task
> (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474): 
> 1471ms. Score delta: -0.024897435897435895
> 2008-08-10 10:46:30,063-0500 DEBUG TaskImpl Task
> (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
> setting status to Active
> 
> >From the log file, we can see that the submission of this 
> job wasn't finished.

Actually the job was submitted and it appears to be running.

>  So I think that this is why no 
> replicaiton job was generated for this job after so long a 
> time even with replication enabled.

Replication only works if the job is queued. This job seems to be
running. Though we're probably talking about the site going bad after
the job started to run causing the notifications of the job
completing/failing to not be sent.

> 
> This is my understanding. I wonder if I made any 
> misunderstanding. If my understanding is right, is there any 
> solution to this kind of situation?

It's not simple. If notification is unreliable it's impossible to
distinguish between a really long process and the notification having
been lost. That is if there is no information about how long the process
is.

So one solution would be to make "notifications" more reliable by
polling for the job status. But GRAM makes it really hard to do this
efficiently (each poll for each job involves one full SSL session
establishment).

The other solution is to put a cap on the process duration. So if the
job has a walltime spec, consider notifications lost if the job doesn't
complete in walltime + some_margin_of_error.

Mihael