<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
The 4 machines that failed 6 jobs were:<br>
tg-c055<br>
tg-v028<br>
tg-v092<br>
tg-v023<br>
<br>
Note that there is a 64 bit one, and 3 32 bit ones.... also, I had two
workers on each machine, only one worker on each machine failed some
job... if it was indeed a node hardware problem, I would have expected
that both workers on that machine to have failed jobs...  <br>
<br>
I concur with Mihael that there might have been incomplete or missing
data... we just have to find out if that is possible despite the
previous stages all exiting with an exit code of 0.  Yuqing (the
domain/app specific expert) is probably the key to finding out what
happened in this run with these failed 6 jobs.  Nika, did you try to
run the jobs manually to see if they fail on the same -3 exit code?<br>
<br>
Ioan<br>
<br>
Mihael Hategan wrote:
<blockquote cite="mid:1184733164.14719.5.camel@blabla.mcs.anl.gov"
 type="cite">
  <pre wrap="">I don't think these are random failures. In the whole workflow there
were exactly 6 tasks failed. 3 belonging to one job and 3 to the other.
Statistically, and if Ioan's assertion that they were not sent to the
exact same worker is correct, I'd be pretty confident saying that it was
due to specific executables failing on specific data (and by that I
would include the possibility of missing data).

Mihael

On Tue, 2007-07-17 at 23:18 -0500, Tiberiu Stef-Praun wrote:
  </pre>
  <blockquote type="cite">
    <pre wrap="">I also had jobs failing at the Argonne site today.
It seems that the ia_32 were randomly fail on executing some of my
jobs, so I had to switch my apps to the ia_64 to get a full,
successful execution.

Tibi

On 7/17/07, Ioan Raicu <a class="moz-txt-link-rfc2396E" href="mailto:iraicu@cs.uchicago.edu"><iraicu@cs.uchicago.edu></a> wrote:
    </pre>
    <blockquote type="cite">
      <pre wrap="">

 Mihael Hategan wrote:
 On Tue, 2007-07-17 at 21:43 -0500, Ian Foster wrote:


 Another (perhaps dumb?) question--it would seem desirable that we be
able to quickly determine what tasks failed and then (attempt to) rerun
them in such circumstances.

Here it seems that a lot of effort is required just to determine what
tasks failed, and I am not sure that the information extracted is enough
to rerun them.

 Normally, a summary of what failed with the reasons is printed on
stderr, together with the stdout and stderr of the jobs. Perhaps it
should also go to the log file.

In this case, 2 jobs failed. The 6 failures are due to restarts. Which
is in agreement with the 2 missing molecules.

When jobs fail, swift should not clean up the job directories so that
one can do post-mortem debugging. I suggest invoking the application
manually to see if it's a matter of a bad node or bad data.

 The errors happened on 3 different nodes, so I suspect that its not bad
nodes (as we had previously experience with the stale NFS handle).

 Nika, I sent out the actual commands that failed... can you try to run them
manually to see what happens, and possibly determine why they failed?  Can
you also find out what an exit code of -3 means within the application that
failed (you might have to look at the app source code, or contact the
original source code writer).

 Ioan




 It also seems that we can't easily determine which output files are
missing.

 In the general case we wouldn't be able to, because the exact outputs
may only be known at run-time. Granted, that kind of dynamics would
depend on our ability to have nondeterministic files being returned,
which we haven't gotten around to implementing. But there is a question
of whether we should try to implement a short term solution that would
be invalidated by our own plans.



 Ian.

Ian Foster wrote:


 Ioan:

a) I think this information should be in the bugzilla summary,
according to our processes?

b) Why did it take so long to get all of the workers working?

c) Can we debug using less than O(800) node hours?

Ian.

<a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@mcs.anl.gov">bugzilla-daemon@mcs.anl.gov</a> wrote:


 <a class="moz-txt-link-freetext" href="http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72">http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72</a>





------- Comment #24 from <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a> 2007-07-17 16:08
-------
So the latest MolDyn's 244 mol run also failed... but I think it made
it all
the way to the final few jobs...

The place where I put all the information about the run is at:
<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/</a>


Here are the graphs:
<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg</a>

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg</a>

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg</a>


The Swift log can be found at:
<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log</a>


The Falkon logs are at:
<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/</a>


The 244 mol run was supposed to have 20497 tasks, broken down as
follows:
1 1 1
1 244 244
1 244 244
68 244 16592
1 244 244
11 244 2684
1 244 244
1 244 244
======================
 20497

We had 20495 tasks that exited with an exit code of 0, and 6 tasks
that exited
with an exit code of -3. The worker logs don't show anything on the
stdout or
stderr of the failed jobs. I looked online what an exit code of -3
could mean,
but didn't find anything.
Here are the failed 6 tasks:
Executing task urn:0-9408-1184616132483... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei
fe_stdout_m112
stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
--nosite
--resultonly --wham_outputs wf_m112 --solv_lrc_file
solv_chg_a10_m112_done
--fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with
exit code -3 in 238 ms

Executing task urn:0-9408-1184616133199... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei
fe_stdout_m112
stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
--nosite
--resultonly --wham_outputs wf_m112 --solv_lrc_file
solv_chg_a10_m112_done
--fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with
exit code -3 in 201 ms

Executing task urn:0-15036-1184616133342... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei
fe_stdout_m179
stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
--nosite
--resultonly --wham_outputs wf_m179 --solv_lrc_file
solv_chg_a10_m179_done
--fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with
exit code -3 in 267 ms

Executing task urn:0-15036-1184616133628... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei
fe_stdout_m179
stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
--nosite
--resultonly --wham_outputs wf_m179 --solv_lrc_file
solv_chg_a10_m179_done
--fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with
exit code -3 in 2368 ms

Executing task urn:0-15036-1184616133528... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei
fe_stdout_m179
stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
--nosite
--resultonly --wham_outputs wf_m179 --solv_lrc_file
solv_chg_a10_m179_done
--fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with
exit code -3 in 311 ms

Executing task urn:0-9408-1184616130688... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei
fe_stdout_m112
stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
--nosite
--resultonly --wham_outputs wf_m112 --solv_lrc_file
solv_chg_a10_m112_done
--fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with
exit code -3 in 464 ms


Both the Falkon logs and the Swift logs agree on the number of
submitted tasks,
number of successful tasks, and number of failed tasks. There were no
outstanding tasks at the time when the workflow failed. BTW, I
checked the
disk space usage after about an hour that the whole experiment
finished, and
there was plenty of disk space left.

Yong mentioned that he looked through the output of MolDyn, and there
were only
242 'fe_solv_*' files, so 2 molecule files were missing... one
question for
Nika, are the 6 failed tasks the same job, resubmitted?
Nika, can you add anything more to this? Is there anything else to
be learned
from the Swift log, as to why those last few jobs failed? After we
have tried
to figure out what happened, can we resume the workflow, and
hopefully finish
the last few jobs in another run?

Ioan




 --

 Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619. Web: <a class="moz-txt-link-abbreviated" href="http://www.ci.uchicago.edu">www.ci.uchicago.edu</a>.
 Globus Alliance: <a class="moz-txt-link-abbreviated" href="http://www.globus.org">www.globus.org</a>.

_______________________________________________
Swift-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a>
<a class="moz-txt-link-freetext" href="http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel">http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel</a>


 _______________________________________________
Swift-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a>
<a class="moz-txt-link-freetext" href="http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel">http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel</a>



 --
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>
Web: <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>
 <a class="moz-txt-link-freetext" href="http://dsl.cs.uchicago.edu/">http://dsl.cs.uchicago.edu/</a>
============================================
============================================


_______________________________________________
Swift-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a>
<a class="moz-txt-link-freetext" href="http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel">http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel</a>


      </pre>
    </blockquote>
    <pre wrap="">
    </pre>
  </blockquote>
  <pre wrap=""><!---->

  </pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>
Web:   <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>
       <a class="moz-txt-link-freetext" href="http://dsl.cs.uchicago.edu/">http://dsl.cs.uchicago.edu/</a>
============================================
============================================
</pre>
</body>
</html>