[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
Ioan Raicu
iraicu at cs.uchicago.edu
Tue Jul 17 22:33:02 CDT 2007
Ian Foster wrote:
> Another (perhaps dumb?) question--it would seem desirable that we be
> able to quickly determine what tasks failed and then (attempt to)
> rerun them in such circumstances. \
I think Swift already does this up to a fixed # of times (I think it is
3 or 5).
>
> Here it seems that a lot of effort is required just to determine what
> tasks failed, and I am not sure that the information extracted is
> enough to rerun them.
The failed tasks are pretty easy to find in the logs based on the exit
code. If we were to do a resume from Swift, I think it would
automatically resubmit just the failed tasks... but unless we figure out
why they failed and fix the problem, they will likely again.
>
> It also seems that we can't easily determine which output files are
> missing.
I don't know about this one, Maybe Nika can comment on this.
Ioan
>
> Ian.
>
> Ian Foster wrote:
>> Ioan:
>>
>> a) I think this information should be in the bugzilla summary,
>> according to our processes?
>>
>> b) Why did it take so long to get all of the workers working?
>>
>> c) Can we debug using less than O(800) node hours?
>>
>> Ian.
>>
>> bugzilla-daemon at mcs.anl.gov wrote:
>>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>>
>>>
>>>
>>>
>>>
>>> ------- Comment #24 from iraicu at cs.uchicago.edu 2007-07-17 16:08
>>> -------
>>> So the latest MolDyn's 244 mol run also failed... but I think it
>>> made it all
>>> the way to the final few jobs...
>>>
>>> The place where I put all the information about the run is at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
>>>
>>>
>>> Here are the graphs:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg
>>>
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg
>>>
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg
>>>
>>>
>>> The Swift log can be found at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log
>>>
>>>
>>> The Falkon logs are at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/
>>>
>>>
>>> The 244 mol run was supposed to have 20497 tasks, broken down as
>>> follows:
>>> 1 1 1
>>> 1 244 244
>>> 1 244 244
>>> 68 244 16592
>>> 1 244 244
>>> 11 244 2684
>>> 1 244 244
>>> 1 244 244
>>> ======================
>>> 20497
>>>
>>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks
>>> that exited
>>> with an exit code of -3. The worker logs don't show anything on the
>>> stdout or
>>> stderr of the failed jobs. I looked online what an exit code of -3
>>> could mean,
>>> but didn't find anything. Here are the failed 6 tasks:
>>> Executing task urn:0-9408-1184616132483... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei
>>> fe_stdout_m112
>>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with
>>> exit code -3 in 238 ms
>>>
>>> Executing task urn:0-9408-1184616133199... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei
>>> fe_stdout_m112
>>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with
>>> exit code -3 in 201 ms
>>>
>>> Executing task urn:0-15036-1184616133342... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei
>>> fe_stdout_m179
>>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with
>>> exit code -3 in 267 ms
>>>
>>> Executing task urn:0-15036-1184616133628... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei
>>> fe_stdout_m179
>>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with
>>> exit code -3 in 2368 ms
>>>
>>> Executing task urn:0-15036-1184616133528... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei
>>> fe_stdout_m179
>>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with
>>> exit code -3 in 311 ms
>>>
>>> Executing task urn:0-9408-1184616130688... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei
>>> fe_stdout_m112
>>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with
>>> exit code -3 in 464 ms
>>>
>>>
>>> Both the Falkon logs and the Swift logs agree on the number of
>>> submitted tasks,
>>> number of successful tasks, and number of failed tasks. There were no
>>> outstanding tasks at the time when the workflow failed. BTW, I
>>> checked the
>>> disk space usage after about an hour that the whole experiment
>>> finished, and
>>> there was plenty of disk space left.
>>>
>>> Yong mentioned that he looked through the output of MolDyn, and
>>> there were only
>>> 242 'fe_solv_*' files, so 2 molecule files were missing... one
>>> question for
>>> Nika, are the 6 failed tasks the same job, resubmitted? Nika, can
>>> you add anything more to this? Is there anything else to be learned
>>> from the Swift log, as to why those last few jobs failed? After we
>>> have tried
>>> to figure out what happened, can we resume the workflow, and
>>> hopefully finish
>>> the last few jobs in another run?
>>>
>>> Ioan
>>>
>>>
>>>
>>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
More information about the Swift-devel
mailing list