[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
Ian Foster
foster at mcs.anl.gov
Tue Jul 17 21:43:52 CDT 2007
Another (perhaps dumb?) question--it would seem desirable that we be
able to quickly determine what tasks failed and then (attempt to) rerun
them in such circumstances.
Here it seems that a lot of effort is required just to determine what
tasks failed, and I am not sure that the information extracted is enough
to rerun them.
It also seems that we can't easily determine which output files are missing.
Ian.
Ian Foster wrote:
> Ioan:
>
> a) I think this information should be in the bugzilla summary,
> according to our processes?
>
> b) Why did it take so long to get all of the workers working?
>
> c) Can we debug using less than O(800) node hours?
>
> Ian.
>
> bugzilla-daemon at mcs.anl.gov wrote:
>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>
>>
>>
>>
>>
>> ------- Comment #24 from iraicu at cs.uchicago.edu 2007-07-17 16:08
>> -------
>> So the latest MolDyn's 244 mol run also failed... but I think it made
>> it all
>> the way to the final few jobs...
>>
>> The place where I put all the information about the run is at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
>>
>>
>> Here are the graphs:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg
>>
>>
>> The Swift log can be found at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log
>>
>>
>> The Falkon logs are at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/
>>
>>
>> The 244 mol run was supposed to have 20497 tasks, broken down as
>> follows:
>> 1 1 1
>> 1 244 244
>> 1 244 244
>> 68 244 16592
>> 1 244 244
>> 11 244 2684
>> 1 244 244
>> 1 244 244
>> ======================
>> 20497
>>
>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks
>> that exited
>> with an exit code of -3. The worker logs don't show anything on the
>> stdout or
>> stderr of the failed jobs. I looked online what an exit code of -3
>> could mean,
>> but didn't find anything.
>> Here are the failed 6 tasks:
>> Executing task urn:0-9408-1184616132483... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei
>> fe_stdout_m112
>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with
>> exit code -3 in 238 ms
>>
>> Executing task urn:0-9408-1184616133199... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei
>> fe_stdout_m112
>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with
>> exit code -3 in 201 ms
>>
>> Executing task urn:0-15036-1184616133342... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei
>> fe_stdout_m179
>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with
>> exit code -3 in 267 ms
>>
>> Executing task urn:0-15036-1184616133628... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei
>> fe_stdout_m179
>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with
>> exit code -3 in 2368 ms
>>
>> Executing task urn:0-15036-1184616133528... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei
>> fe_stdout_m179
>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with
>> exit code -3 in 311 ms
>>
>> Executing task urn:0-9408-1184616130688... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei
>> fe_stdout_m112
>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with
>> exit code -3 in 464 ms
>>
>>
>> Both the Falkon logs and the Swift logs agree on the number of
>> submitted tasks,
>> number of successful tasks, and number of failed tasks. There were no
>> outstanding tasks at the time when the workflow failed. BTW, I
>> checked the
>> disk space usage after about an hour that the whole experiment
>> finished, and
>> there was plenty of disk space left.
>>
>> Yong mentioned that he looked through the output of MolDyn, and there
>> were only
>> 242 'fe_solv_*' files, so 2 molecule files were missing... one
>> question for
>> Nika, are the 6 failed tasks the same job, resubmitted?
>> Nika, can you add anything more to this? Is there anything else to
>> be learned
>> from the Swift log, as to why those last few jobs failed? After we
>> have tried
>> to figure out what happened, can we resume the workflow, and
>> hopefully finish
>> the last few jobs in another run?
>>
>> Ioan
>>
>>
>>
>
--
Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619. Web: www.ci.uchicago.edu.
Globus Alliance: www.globus.org.
More information about the Swift-devel
mailing list