<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Aha, OK, it didn't click that (2) was referring to the Swift log that I
was referring to. So, in that case, we can't do much else on this run,
other than make sure we fix the infamous m179 molecule, turn on all
debugging (and make sure its actually printing debug statements), and
try the run again!<br>
<br>
Ioan<br>
<br>
Veronika Nefedova wrote:
<blockquote cite="mid:D20E5057-ED37-4C44-A841-1957EBC3A9C1@mcs.anl.gov"
type="cite">Ioan, I can't answer any of your questions -- read my
point number 2 below );
<div><br class="khtml-block-placeholder">
</div>
<div>Nika</div>
<div><br>
<div>
<div>On Aug 6, 2007, at 11:25 AM, Ioan Raicu wrote:</div>
<br class="Apple-interchange-newline">
<blockquote type="cite"> Hi,<br>
<br>
Veronika Nefedova wrote:
<blockquote
cite="mid:1E69C5C1-ACB4-4540-93BD-3BBDC2AD6C1A@mcs.anl.gov" type="cite">Ok,
here is what happened with the last 244-molecule run.
<div><br class="khtml-block-placeholder">
</div>
<div>1. First of all, the new swift code (with loops etc) was
used. The code's size is dramatically reduced:</div>
<div><br class="khtml-block-placeholder">
</div>
<div>-rw-r--r-- 1 nefedova users 13342526 2007-07-05 12:01
MolDyn-244.dtm</div>
<div>-rw-r--r-- 1 nefedova users 21898 2007-08-03 11:00
MolDyn-244-loops.swift</div>
<div><br class="khtml-block-placeholder">
</div>
<div><br class="khtml-block-placeholder">
</div>
<div>2. I do not have the log on the swift size (probably it was
not produced because I put in the hack for output reduction and log
output was suppressed -- it can be fixed easily)</div>
<div><br class="khtml-block-placeholder">
</div>
<div>3. There were 2 molecules that failed. <span
class="Apple-tab-span" style="white-space: pre;"> </span>That
infamous m179 failed at the last step (3 re-tries). Yuqing -- its the
same molecule you said you fixed the antechamber code for. You told me
to use the code in your home directory /home/ydeng/antechamber-1.27, I
assumed it was on tg-uc. Is that correct? Or its on another host?
Anyway, I used the code from the directory above and it didn't work.
The output
is @tg-login1:/disks/scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos-bm66sjz1li5h1/shared.
I could try to run again this molecule specifically in case it works
for you.</div>
<div><br class="khtml-block-placeholder">
</div>
<div>4. The second molecule that failed is m050. Its quite
a mystery why it failed: it finished the 4-th stage (those 68 charm
jobs) successfully (I have the data in shared directory on tg-uc) but
then the 5-th stage has never started! I do not see any leftover
directories from the 5-th stage for m050 (or any other stages for m050
for that matter). So it was not a job failure, but job submission
failure (since no directories were even created). It had to be a job
called 'generator_cat' with a parameter 'm050'. Ioan - is that possible
to rack what happened to this job in Falcon logs?</div>
<div><br class="khtml-block-placeholder">
</div>
</blockquote>
There were only 3 jobs that failed in the Falkon logs, so I presume
that those were from (3) above. I also forgot to enable any debug
logging, as the settings were from some older high throughput
experiments, so I don't have a trace of all the task descriptions and
STDOUT and STDERR. About the only thing I can think of is... can you
summarize from the Swift log, how many submitted jobs there were, how
many success and how many failed? At least maybe we can make sure that
the Swift log is consistent with the Falkon logs. Could it be that a
task actually fails (say it doesn't produce all the output files), but
still returns an exit code of 0 (success)? If yes, then would Swift
attempt the next task that needed the missing files and likely fail
while executing due to not finding all the files?<br>
<br>
Now, you mention that it could be a job submission failure... but
wouldn't this be explicit in the Swift logs, that it tried to submit
and it failed? <br>
<br>
Here is the list of all tasks that Falkon knows of: <a
moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://tg-viz-login1.uc.teragrid.org:51000/service_logs/GenericPortalWS_taskPerf.txt">http://tg-viz-login1.uc.teragrid.org:51000/service_logs/GenericPortalWS_taskPerf.txt</a><br>
<br>
Can you produce a similar list of tasks (from the Swift logs), if the
task ID (<a moz-do-not-send="true" href="urn:0-1-10-0-1186176957479">urn:0-1-10-0-1186176957479</a>),
and the status (i.e. submitted, success, failed, etc)? I believe that
the latest provisioner code you had (which I hope it did not get
overwritten by SVN as I don't know if it was ever checked in, and I
don't remember when it was changed, before or after the commit to SVN)
should have printed at each submission to Falkon the task ID in the
form it is above, and the status of the task at that point in time.
Assuming this information is in the Swift log, you should be able to
grep for these lines and produce a summary of all the tasks, that we
can then cross-match with Falkon's logs. Which one is the Swift log
for this latest run on viper? There are so many, and I can't tell
which one it is.<br>
<br>
Ioan<br>
<blockquote
cite="mid:1E69C5C1-ACB4-4540-93BD-3BBDC2AD6C1A@mcs.anl.gov" type="cite">
<div>5. I can't restart the workflow since this bug/feature has
not been fixed: <a moz-do-not-send="true"
href="http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29">http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29</a>
(as long as I use the hack for output reduction -- restarts do not
work).</div>
<div><br class="khtml-block-placeholder">
</div>
<div>Nika</div>
<div><br>
<div>
<div>On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:</div>
<br class="Apple-interchange-newline">
<blockquote type="cite"> Hi,<br>
Nika can probably be more specific, but the last time we ran the 244
molecule MolDyn, the workflow failed on the last few jobs, and the
failures were application specific, not Swift or Falkon. I believe the
specific issue that caused those jobs to fail has been resolved. <br>
<br>
We have made another attempt at the MolDyn 244 molecule run, and from
what I can tell, it did not complete successfully again. We were
supposed to have 20497 jobs...<br>
<br>
<table x:str="" style="border-collapse: collapse; width: 144pt;"
border="0" cellpadding="0" cellspacing="0" width="192">
<col style="width: 48pt;" span="3" width="64"> <tbody>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt; width: 48pt;" x:num=""
align="right" height="17" width="64">1</td>
<td style="width: 48pt;" x:num="" align="right" width="64">1</td>
<td style="width: 48pt;" x:num="" x:fmla="=A1*B1"
align="right" width="64">1</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" x:num="" align="right"
height="17">1</td>
<td x:num="" align="right">244</td>
<td x:num="" x:fmla="=A2*B2" align="right">244</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" x:num="" align="right"
height="17">1</td>
<td x:num="" align="right">244</td>
<td x:num="" x:fmla="=A3*B3" align="right">244</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" x:num="" align="right"
height="17">68</td>
<td x:num="" align="right">244</td>
<td x:num="" x:fmla="=A4*B4" align="right">16592</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" x:num="" align="right"
height="17">1</td>
<td x:num="" align="right">244</td>
<td x:num="" x:fmla="=A5*B5" align="right">244</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" x:num="" align="right"
height="17">11</td>
<td x:num="" align="right">244</td>
<td x:num="" x:fmla="=A6*B6" align="right">2684</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" x:num="" align="right"
height="17">1</td>
<td x:num="" align="right">244</td>
<td x:num="" x:fmla="=A7*B7" align="right">244</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" x:num="" align="right"
height="17">1</td>
<td x:num="" align="right">244</td>
<td x:num="" x:fmla="=A8*B8" align="right">244</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" height="17"><br>
</td>
<td><br>
</td>
<td><br>
</td>
</tr>
<tr style="height: 12.75pt;" height="17">
<td style="height: 12.75pt;" height="17"><br>
</td>
<td><br>
</td>
<td x:num="" x:fmla="=SUM(C1:C9)" align="right">20497</td>
</tr>
</tbody>
</table>
<br>
but we have:<br>
20482 with exit code 0<br>
1 with exit code -3<br>
2 with exit code 253<br>
<br>
I forgot to enable the debug at the workers, so I don't know what the
STDOUT and STDERR was for these 3 jobs. Given that Swift retries 3
times a job before it fails the workflow, my guess is that these 3 jobs
were really the same job failing 3 times. The failure occurred on 3
different machines, so I don't think it was machine related. Nika, can
you tell from the various Swift logs what happened to these 3 jobs? Is
this the same issue as we had on the last 244 mol run? It looks like
we failed the workflow with 15 jobs to go. <br>
<br>
The graphs all look nice, similar to the last ones we had. If people
really want to see them, I can generate them again. Otherwise, look at
<a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://tg-viz-login1.uc.teragrid.org:51000/index.htm">http://tg-viz-login1.uc.teragrid.org:51000/index.htm</a>
to see the last 10K samples of the experiment.<br>
<br>
Nika, after you try to figure out what happened, can you simply retry
the workflow, maybe it will manage to finish the last 15 jobs.
Depending on what problem we find, I think we might conclude that 3
retries is not enough, and we might want to have a higher number as the
default when running with Falkon. If the error was an application
error, then no matter how many retries we have, it won't make any
difference.<br>
<br>
Ioan<br>
<br>
</blockquote>
</div>
<br>
</div>
</blockquote>
</blockquote>
</div>
<br>
</div>
</blockquote>
</body>
</html>