[Swift-devel] Q about MolDyn
Veronika Nefedova
nefedova at mcs.anl.gov
Mon Aug 6 11:01:10 CDT 2007
Ok, here is what happened with the last 244-molecule run.
1. First of all, the new swift code (with loops etc) was used. The
code's size is dramatically reduced:
-rw-r--r-- 1 nefedova users 13342526 2007-07-05 12:01 MolDyn-244.dtm
-rw-r--r-- 1 nefedova users 21898 2007-08-03 11:00 MolDyn-244-
loops.swift
2. I do not have the log on the swift size (probably it was not
produced because I put in the hack for output reduction and log
output was suppressed -- it can be fixed easily)
3. There were 2 molecules that failed. That infamous m179 failed at
the last step (3 re-tries). Yuqing -- its the same molecule you said
you fixed the antechamber code for. You told me to use the code in
your home directory /home/ydeng/antechamber-1.27, I assumed it was
on tg-uc. Is that correct? Or its on another host? Anyway, I used the
code from the directory above and it didn't work. The output is @tg-
login1:/disks/scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos-
bm66sjz1li5h1/shared. I could try to run again this molecule
specifically in case it works for you.
4. The second molecule that failed is m050. Its quite a mystery why
it failed: it finished the 4-th stage (those 68 charm jobs)
successfully (I have the data in shared directory on tg-uc) but then
the 5-th stage has never started! I do not see any leftover
directories from the 5-th stage for m050 (or any other stages for
m050 for that matter). So it was not a job failure, but job
submission failure (since no directories were even created). It had
to be a job called 'generator_cat' with a parameter 'm050'. Ioan - is
that possible to rack what happened to this job in Falcon logs?
5. I can't restart the workflow since this bug/feature has not been
fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29 (as long
as I use the hack for output reduction -- restarts do not work).
Nika
On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:
> Hi,
> Nika can probably be more specific, but the last time we ran the
> 244 molecule MolDyn, the workflow failed on the last few jobs, and
> the failures were application specific, not Swift or Falkon. I
> believe the specific issue that caused those jobs to fail has been
> resolved.
>
> We have made another attempt at the MolDyn 244 molecule run, and
> from what I can tell, it did not complete successfully again. We
> were supposed to have 20497 jobs...
>
> 1 1 1
> 1 244 244
> 1 244 244
> 68 244 16592
> 1 244 244
> 11 244 2684
> 1 244 244
> 1 244 244
>
>
>
>
>
> 20497
>
> but we have:
> 20482 with exit code 0
> 1 with exit code -3
> 2 with exit code 253
>
> I forgot to enable the debug at the workers, so I don't know what
> the STDOUT and STDERR was for these 3 jobs. Given that Swift
> retries 3 times a job before it fails the workflow, my guess is
> that these 3 jobs were really the same job failing 3 times. The
> failure occurred on 3 different machines, so I don't think it was
> machine related. Nika, can you tell from the various Swift logs
> what happened to these 3 jobs? Is this the same issue as we had on
> the last 244 mol run? It looks like we failed the workflow with 15
> jobs to go.
>
> The graphs all look nice, similar to the last ones we had. If
> people really want to see them, I can generate them again.
> Otherwise, look at http://tg-viz-login1.uc.teragrid.org:51000/
> index.htm to see the last 10K samples of the experiment.
>
> Nika, after you try to figure out what happened, can you simply
> retry the workflow, maybe it will manage to finish the last 15
> jobs. Depending on what problem we find, I think we might conclude
> that 3 retries is not enough, and we might want to have a higher
> number as the default when running with Falkon. If the error was
> an application error, then no matter how many retries we have, it
> won't make any difference.
>
> Ioan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070806/b2dcb47b/attachment.html>
More information about the Swift-devel
mailing list