[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Yong Zhao yongzh at cs.uchicago.edu
Tue Jul 17 21:50:12 CDT 2007


We already have retry mechanism there. I suspect the failed jobs were
retried but failed again. The server side logs should have something about
which files were missing.

Yong.

On Tue, 17 Jul 2007, Ian Foster wrote:

> Another (perhaps dumb?) question--it would seem desirable that we be
> able to quickly determine what tasks failed and then (attempt to) rerun
> them in such circumstances.
>
> Here it seems that a lot of effort is required just to determine what
> tasks failed, and I am not sure that the information extracted is enough
> to rerun them.
>
> It also seems that we can't easily determine which output files are missing.
>
> Ian.
>
> Ian Foster wrote:
> > Ioan:
> >
> > a) I think this information should be in the bugzilla summary,
> > according to our processes?
> >
> > b) Why did it take so long to get all of the workers working?
> >
> > c) Can we debug using less than O(800) node hours?
> >
> > Ian.
> >
> > bugzilla-daemon at mcs.anl.gov wrote:
> >> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
> >>
> >>
> >>
> >>
> >>
> >> ------- Comment #24 from iraicu at cs.uchicago.edu  2007-07-17 16:08
> >> -------
> >> So the latest MolDyn's 244 mol run also failed... but I think it made
> >> it all
> >> the way to the final few jobs...
> >>
> >> The place where I put all the information about the run is at:
> >> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
> >>
> >>
> >> Here are the graphs:
> >> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg
> >>
> >> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg
> >>
> >> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg
> >>
> >>
> >> The Swift log can be found at:
> >> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log
> >>
> >>
> >> The Falkon logs are at:
> >> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/
> >>
> >>
> >> The 244 mol run was supposed to have 20497 tasks, broken down as
> >> follows:
> >> 1       1       1
> >> 1       244     244
> >> 1       244     244
> >> 68      244     16592
> >> 1       244     244
> >> 11      244     2684
> >> 1       244     244
> >> 1       244     244
> >> ======================
> >>                 20497
> >>
> >> We had 20495 tasks that exited with an exit code of 0, and 6 tasks
> >> that exited
> >> with an exit code of -3.  The worker logs don't show anything on the
> >> stdout or
> >> stderr of the failed jobs.  I looked online what an exit code of -3
> >> could mean,
> >> but didn't find anything.
> >> Here are the failed 6 tasks:
> >> Executing task urn:0-9408-1184616132483... Building executable
> >> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei
> >> fe_stdout_m112
> >> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
> >> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
> >> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
> >> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
> >> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
> >> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
> >> --resultonly --wham_outputs wf_m112 --solv_lrc_file
> >> solv_chg_a10_m112_done
> >> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with
> >> exit code -3 in 238 ms
> >>
> >> Executing task urn:0-9408-1184616133199... Building executable
> >> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei
> >> fe_stdout_m112
> >> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
> >> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
> >> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
> >> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
> >> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
> >> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
> >> --resultonly --wham_outputs wf_m112 --solv_lrc_file
> >> solv_chg_a10_m112_done
> >> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with
> >> exit code -3 in 201 ms
> >>
> >> Executing task urn:0-15036-1184616133342... Building executable
> >> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei
> >> fe_stdout_m179
> >> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
> >> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
> >> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
> >> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
> >> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
> >> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
> >> --resultonly --wham_outputs wf_m179 --solv_lrc_file
> >> solv_chg_a10_m179_done
> >> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with
> >> exit code -3 in 267 ms
> >>
> >> Executing task urn:0-15036-1184616133628... Building executable
> >> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei
> >> fe_stdout_m179
> >> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
> >> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
> >> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
> >> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
> >> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
> >> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
> >> --resultonly --wham_outputs wf_m179 --solv_lrc_file
> >> solv_chg_a10_m179_done
> >> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with
> >> exit code -3 in 2368 ms
> >>
> >> Executing task urn:0-15036-1184616133528... Building executable
> >> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei
> >> fe_stdout_m179
> >> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
> >> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
> >> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
> >> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
> >> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
> >> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
> >> --resultonly --wham_outputs wf_m179 --solv_lrc_file
> >> solv_chg_a10_m179_done
> >> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with
> >> exit code -3 in 311 ms
> >>
> >> Executing task urn:0-9408-1184616130688... Building executable
> >> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei
> >> fe_stdout_m112
> >> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
> >> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
> >> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
> >> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
> >> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
> >> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
> >> --resultonly --wham_outputs wf_m112 --solv_lrc_file
> >> solv_chg_a10_m112_done
> >> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with
> >> exit code -3 in 464 ms
> >>
> >>
> >> Both the Falkon logs and the Swift logs agree on the number of
> >> submitted tasks,
> >> number of successful tasks, and number of failed tasks.  There were no
> >> outstanding tasks at the time when the workflow failed.  BTW, I
> >> checked the
> >> disk space usage after about an hour that the whole experiment
> >> finished, and
> >> there was plenty of disk space left.
> >>
> >> Yong mentioned that he looked through the output of MolDyn, and there
> >> were only
> >> 242 'fe_solv_*' files, so 2 molecule files were missing...  one
> >> question for
> >> Nika, are the 6 failed tasks the same job, resubmitted?
> >> Nika, can you add anything more to this?  Is there anything else to
> >> be learned
> >> from the Swift log, as to why those last few jobs failed?  After we
> >> have tried
> >> to figure out what happened, can we resume the workflow, and
> >> hopefully finish
> >> the last few jobs in another run?
> >>
> >> Ioan
> >>
> >>
> >>
> >
>
> --
>
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>



More information about the Swift-devel mailing list