<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; ">Ok, here is what happened with the last 244-molecule run.<DIV><BR class="khtml-block-placeholder"></DIV><DIV>1. First of all, the new swift code (with loops etc) was used. The code's size is dramatically reduced:</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>-rw-r--r-- 1 nefedova users 13342526 2007-07-05 12:01 MolDyn-244.dtm</DIV><DIV>-rw-r--r-- 1 nefedova users 21898 2007-08-03 11:00 MolDyn-244-loops.swift</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>2. I do not have the log on the swift size (probably it was not produced because I put in the hack for output reduction and log output was suppressed -- it can be fixed easily)</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>3. There were 2 molecules that failed. <SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>That infamous m179 failed at the last step (3 re-tries). Yuqing -- its the same molecule you said you fixed the antechamber code for. You told me to use the code in your home directory /home/ydeng/antechamber-1.27, I assumed it was on tg-uc. Is that correct? Or its on another host? Anyway, I used the code from the directory above and it didn't work. The output is @tg-login1:/disks/scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos-bm66sjz1li5h1/shared. I could try to run again this molecule specifically in case it works for you.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>4. The second molecule that failed is m050. Its quite a mystery why it failed: it finished the 4-th stage (those 68 charm jobs) successfully (I have the data in shared directory on tg-uc) but then the 5-th stage has never started! I do not see any leftover directories from the 5-th stage for m050 (or any other stages for m050 for that matter). So it was not a job failure, but job submission failure (since no directories were even created). It had to be a job called 'generator_cat' with a parameter 'm050'. Ioan - is that possible to rack what happened to this job in Falcon logs?</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>5. I can't restart the workflow since this bug/feature has not been fixed: <A href="http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29">http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29</A> (as long as I use the hack for output reduction -- restarts do not work).</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Nika</DIV><DIV><BR><DIV><DIV>On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite"> Hi,<BR> Nika can probably be more specific, but the last time we ran the 244 molecule MolDyn, the workflow failed on the last few jobs, and the failures were application specific, not Swift or Falkon. I believe the specific issue that caused those jobs to fail has been resolved. <BR> <BR> We have made another attempt at the MolDyn 244 molecule run, and from what I can tell, it did not complete successfully again. We were supposed to have 20497 jobs...<BR> <BR> <TABLE x:str="" style="border-collapse: collapse; width: 144pt;" border="0" cellpadding="0" cellspacing="0" width="192"> <COL style="width: 48pt;" span="3" width="64"> <TBODY><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt; width: 48pt;" x:num="" align="right" height="17" width="64">1</TD><TD style="width: 48pt;" x:num="" align="right" width="64">1</TD><TD style="width: 48pt;" x:num="" x:fmla="=A1*B1" align="right" width="64">1</TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" x:num="" align="right" height="17">1</TD><TD x:num="" align="right">244</TD><TD x:num="" x:fmla="=A2*B2" align="right">244</TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" x:num="" align="right" height="17">1</TD><TD x:num="" align="right">244</TD><TD x:num="" x:fmla="=A3*B3" align="right">244</TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" x:num="" align="right" height="17">68</TD><TD x:num="" align="right">244</TD><TD x:num="" x:fmla="=A4*B4" align="right">16592</TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" x:num="" align="right" height="17">1</TD><TD x:num="" align="right">244</TD><TD x:num="" x:fmla="=A5*B5" align="right">244</TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" x:num="" align="right" height="17">11</TD><TD x:num="" align="right">244</TD><TD x:num="" x:fmla="=A6*B6" align="right">2684</TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" x:num="" align="right" height="17">1</TD><TD x:num="" align="right">244</TD><TD x:num="" x:fmla="=A7*B7" align="right">244</TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" x:num="" align="right" height="17">1</TD><TD x:num="" align="right">244</TD><TD x:num="" x:fmla="=A8*B8" align="right">244</TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" height="17"><BR> </TD><TD><BR> </TD><TD><BR> </TD></TR><TR style="height: 12.75pt;" height="17"><TD style="height: 12.75pt;" height="17"><BR> </TD><TD><BR> </TD><TD x:num="" x:fmla="=SUM(C1:C9)" align="right">20497</TD></TR></TBODY> </TABLE> <BR> but we have:<BR> 20482 with exit code 0<BR> 1 with exit code -3<BR> 2 with exit code 253<BR> <BR> I forgot to enable the debug at the workers, so I don't know what the STDOUT and STDERR was for these 3 jobs. Given that Swift retries 3 times a job before it fails the workflow, my guess is that these 3 jobs were really the same job failing 3 times. The failure occurred on 3 different machines, so I don't think it was machine related. Nika, can you tell from the various Swift logs what happened to these 3 jobs? Is this the same issue as we had on the last 244 mol run? It looks like we failed the workflow with 15 jobs to go. <BR> <BR> The graphs all look nice, similar to the last ones we had. If people really want to see them, I can generate them again. Otherwise, look at <A class="moz-txt-link-freetext" href="http://tg-viz-login1.uc.teragrid.org:51000/index.htm">http://tg-viz-login1.uc.teragrid.org:51000/index.htm</A> to see the last 10K samples of the experiment.<BR> <BR> Nika, after you try to figure out what happened, can you simply retry the workflow, maybe it will manage to finish the last 15 jobs. Depending on what problem we find, I think we might conclude that 3 retries is not enough, and we might want to have a higher number as the default when running with Falkon. If the error was an application error, then no matter how many retries we have, it won't make any difference.<BR> <BR> Ioan<BR> <BR></BLOCKQUOTE></DIV><BR></DIV></BODY></HTML>