[Swift-devel] Swift run: hanging up when submitting a job

lixi at uchicago.edu lixi at uchicago.edu
Sun Aug 10 15:43:17 CDT 2008


Hi,

Today I ran a workflow including 3000 jobs with replication 
enabled. 2999 jobs finished successfully and only one job is 
hanging up. When taking a close look at the log file, I 
found the hanging job id is 0-2800, so I execute the 
following command to check the job:

[lixi at communicado 3000]$ grep 0-2800 testworkflow-20080810-
0953-mlj2nsc4.log 
2008-08-10 09:53:53,032-0500 INFO  worknode PROCEDURE 
thread=0-2800 name=worknode
2008-08-10 09:53:54,200-0500 INFO  vdl:parameterlog PARAM 
thread=0-2800 direction=input variable=input 
provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:2008
0810-0953-d6p5ul9d:720000000006
2008-08-10 09:53:55,708-0500 INFO  vdl:parameterlog PARAM 
thread=0-2800 direction=output variable=output 
provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:2008
0810-0953-d6p5ul9d:720000005789
2008-08-10 09:54:05,612-0500 INFO  vdl:execute START 
thread=0-2800 tr=node10
2008-08-10 10:46:10,044-0500 DEBUG vdl:execute2 
THREAD_ASSOCIATION jobid=node10-19x1krxi thread=0-2800-1 
host=AGLT2 replicationGroup=fot1krxi
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
setting status to Submitting
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
setting status to Submitted
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
setting status to Active
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
setting status to Completed
2008-08-10 10:46:15,494-0500 INFO  LateBindingScheduler Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
Completed. Waiting: 2472, Running: 66. Heap size: 355M, Heap 
free: 141M, Max heap: 986M
2008-08-10 10:46:17,377-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
setting status to Submitting
2008-08-10 10:46:18,848-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
setting status to Submitted
2008-08-10 10:46:18,848-0500 DEBUG 
WeightedHostScoreScheduler Submission time for Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474): 
1471ms. Score delta: -0.024897435897435895
2008-08-10 10:46:30,063-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
setting status to Active

>From the log file, we can see that the submission of this 
job wasn't finished. So I think that this is why no 
replicaiton job was generated for this job after so long a 
time even with replication enabled.

This is my understanding. I wonder if I made any 
misunderstanding. If my understanding is right, is there any 
solution to this kind of situation? The log file is:
/home/lixi/performancetest/2/application/3000/testworkflow-
20080810-0953-mlj2nsc4.log

Thanks,

Xi



More information about the Swift-devel mailing list