[Swift-devel] Swift run: hanging up when submitting a job
lixi at uchicago.edu
lixi at uchicago.edu
Sun Aug 10 15:43:17 CDT 2008
Hi,
Today I ran a workflow including 3000 jobs with replication
enabled. 2999 jobs finished successfully and only one job is
hanging up. When taking a close look at the log file, I
found the hanging job id is 0-2800, so I execute the
following command to check the job:
[lixi at communicado 3000]$ grep 0-2800 testworkflow-20080810-
0953-mlj2nsc4.log
2008-08-10 09:53:53,032-0500 INFO worknode PROCEDURE
thread=0-2800 name=worknode
2008-08-10 09:53:54,200-0500 INFO vdl:parameterlog PARAM
thread=0-2800 direction=input variable=input
provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:2008
0810-0953-d6p5ul9d:720000000006
2008-08-10 09:53:55,708-0500 INFO vdl:parameterlog PARAM
thread=0-2800 direction=output variable=output
provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:2008
0810-0953-d6p5ul9d:720000005789
2008-08-10 09:54:05,612-0500 INFO vdl:execute START
thread=0-2800 tr=node10
2008-08-10 10:46:10,044-0500 DEBUG vdl:execute2
THREAD_ASSOCIATION jobid=node10-19x1krxi thread=0-2800-1
host=AGLT2 replicationGroup=fot1krxi
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396)
setting status to Submitting
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396)
setting status to Submitted
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396)
setting status to Active
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396)
setting status to Completed
2008-08-10 10:46:15,494-0500 INFO LateBindingScheduler Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396)
Completed. Waiting: 2472, Running: 66. Heap size: 355M, Heap
free: 141M, Max heap: 986M
2008-08-10 10:46:17,377-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474)
setting status to Submitting
2008-08-10 10:46:18,848-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474)
setting status to Submitted
2008-08-10 10:46:18,848-0500 DEBUG
WeightedHostScoreScheduler Submission time for Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474):
1471ms. Score delta: -0.024897435897435895
2008-08-10 10:46:30,063-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474)
setting status to Active
>From the log file, we can see that the submission of this
job wasn't finished. So I think that this is why no
replicaiton job was generated for this job after so long a
time even with replication enabled.
This is my understanding. I wonder if I made any
misunderstanding. If my understanding is right, is there any
solution to this kind of situation? The log file is:
/home/lixi/performancetest/2/application/3000/testworkflow-
20080810-0953-mlj2nsc4.log
Thanks,
Xi
More information about the Swift-devel
mailing list