[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Tue Jul 17 23:18:49 CDT 2007

I also had jobs failing at the Argonne site today.
It seems that the ia_32 were randomly fail on executing some of my
jobs, so I had to switch my apps to the ia_64 to get a full,
successful execution.

Tibi

On 7/17/07, Ioan Raicu <iraicu at cs.uchicago.edu> wrote:
>
>
>
>  Mihael Hategan wrote:
>  On Tue, 2007-07-17 at 21:43 -0500, Ian Foster wrote:
>
>
>  Another (perhaps dumb?) question--it would seem desirable that we be
> able to quickly determine what tasks failed and then (attempt to) rerun
> them in such circumstances.
>
> Here it seems that a lot of effort is required just to determine what
> tasks failed, and I am not sure that the information extracted is enough
> to rerun them.
>
>  Normally, a summary of what failed with the reasons is printed on
> stderr, together with the stdout and stderr of the jobs. Perhaps it
> should also go to the log file.
>
> In this case, 2 jobs failed. The 6 failures are due to restarts. Which
> is in agreement with the 2 missing molecules.
>
> When jobs fail, swift should not clean up the job directories so that
> one can do post-mortem debugging. I suggest invoking the application
> manually to see if it's a matter of a bad node or bad data.
>
>  The errors happened on 3 different nodes, so I suspect that its not bad
> nodes (as we had previously experience with the stale NFS handle).
>
>  Nika, I sent out the actual commands that failed... can you try to run them
> manually to see what happens, and possibly determine why they failed?  Can
> you also find out what an exit code of -3 means within the application that
> failed (you might have to look at the app source code, or contact the
> original source code writer).
>
>  Ioan
>
>
>
>
>  It also seems that we can't easily determine which output files are
> missing.
>
>  In the general case we wouldn't be able to, because the exact outputs
> may only be known at run-time. Granted, that kind of dynamics would
> depend on our ability to have nondeterministic files being returned,
> which we haven't gotten around to implementing. But there is a question
> of whether we should try to implement a short term solution that would
> be invalidated by our own plans.
>
>
>
>  Ian.
>
> Ian Foster wrote:
>
>
>  Ioan:
>
> a) I think this information should be in the bugzilla summary,
> according to our processes?
>
> b) Why did it take so long to get all of the workers working?
>
> c) Can we debug using less than O(800) node hours?
>
> Ian.
>
> bugzilla-daemon at mcs.anl.gov wrote:
>
>
>  http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>
>
>
>
>
> ------- Comment #24 from iraicu at cs.uchicago.edu 2007-07-17 16:08
> -------
> So the latest MolDyn's 244 mol run also failed... but I think it made
> it all
> the way to the final few jobs...
>
> The place where I put all the information about the run is at:
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
>
>
> Here are the graphs:
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg
>
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg
>
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg
>
>
> The Swift log can be found at:
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log
>
>
> The Falkon logs are at:
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/
>
>
> The 244 mol run was supposed to have 20497 tasks, broken down as
> follows:
> 1 1 1
> 1 244 244
> 1 244 244
> 68 244 16592
> 1 244 244
> 11 244 2684
> 1 244 244
> 1 244 244
> ======================
>  20497
>
> We had 20495 tasks that exited with an exit code of 0, and 6 tasks
> that exited
> with an exit code of -3. The worker logs don't show anything on the
> stdout or
> stderr of the failed jobs. I looked online what an exit code of -3
> could mean,
> but didn't find anything.
> Here are the failed 6 tasks:
> Executing task urn:0-9408-1184616132483... Building executable
> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei
> fe_stdout_m112
> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
> --nosite
> --resultonly --wham_outputs wf_m112 --solv_lrc_file
> solv_chg_a10_m112_done
> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with
> exit code -3 in 238 ms
>
> Executing task urn:0-9408-1184616133199... Building executable
> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei
> fe_stdout_m112
> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
> --nosite
> --resultonly --wham_outputs wf_m112 --solv_lrc_file
> solv_chg_a10_m112_done
> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with
> exit code -3 in 201 ms
>
> Executing task urn:0-15036-1184616133342... Building executable
> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei
> fe_stdout_m179
> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
> --nosite
> --resultonly --wham_outputs wf_m179 --solv_lrc_file
> solv_chg_a10_m179_done
> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with
> exit code -3 in 267 ms
>
> Executing task urn:0-15036-1184616133628... Building executable
> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei
> fe_stdout_m179
> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
> --nosite
> --resultonly --wham_outputs wf_m179 --solv_lrc_file
> solv_chg_a10_m179_done
> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with
> exit code -3 in 2368 ms
>
> Executing task urn:0-15036-1184616133528... Building executable
> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei
> fe_stdout_m179
> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
> --nosite
> --resultonly --wham_outputs wf_m179 --solv_lrc_file
> solv_chg_a10_m179_done
> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with
> exit code -3 in 311 ms
>
> Executing task urn:0-9408-1184616130688... Building executable
> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei
> fe_stdout_m112
> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
> --nosite
> --resultonly --wham_outputs wf_m112 --solv_lrc_file
> solv_chg_a10_m112_done
> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with
> exit code -3 in 464 ms
>
>
> Both the Falkon logs and the Swift logs agree on the number of
> submitted tasks,
> number of successful tasks, and number of failed tasks. There were no
> outstanding tasks at the time when the workflow failed. BTW, I
> checked the
> disk space usage after about an hour that the whole experiment
> finished, and
> there was plenty of disk space left.
>
> Yong mentioned that he looked through the output of MolDyn, and there
> were only
> 242 'fe_solv_*' files, so 2 molecule files were missing... one
> question for
> Nika, are the 6 failed tasks the same job, resubmitted?
> Nika, can you add anything more to this? Is there anything else to
> be learned
> from the Swift log, as to why those last few jobs failed? After we
> have tried
> to figure out what happened, can we resume the workflow, and
> hopefully finish
> the last few jobs in another run?
>
> Ioan
>
>
>
>
>  --
>
>  Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu.
>  Globus Alliance: www.globus.org.
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
>  _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
>
>  --
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web: http://www.cs.uchicago.edu/~iraicu
>  http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>

-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/