[Swift-devel] Q about MolDyn

Veronika Nefedova nefedova at mcs.anl.gov
Mon Aug 6 11:01:10 CDT 2007


Ok, here is what happened with the last 244-molecule run.

1. First of all, the new swift code (with loops etc) was used. The  
code's size is dramatically reduced:

-rw-r--r--  1 nefedova users 13342526 2007-07-05 12:01 MolDyn-244.dtm
-rw-r--r--  1 nefedova users    21898 2007-08-03 11:00 MolDyn-244- 
loops.swift


2. I do not have the log on the swift size (probably it was not  
produced because I put in the hack for output reduction and log  
output was suppressed -- it can be fixed easily)

3. There were 2 molecules that failed. 	That infamous m179 failed at  
the last step (3 re-tries). Yuqing -- its the same molecule you said  
you fixed the antechamber code for. You told me to use the code in  
your home directory  /home/ydeng/antechamber-1.27, I assumed it was  
on tg-uc. Is that correct? Or its on another host? Anyway, I used the  
code from the directory above and it didn't work. The output is @tg- 
login1:/disks/scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos- 
bm66sjz1li5h1/shared. I could try to run again this molecule  
specifically in case it works for you.

4.  The second molecule that failed is m050. Its quite a mystery why  
it failed: it finished the 4-th stage (those 68 charm jobs)  
successfully (I have the data in shared directory on tg-uc) but then  
the 5-th stage has never started! I do not see any leftover  
directories from the 5-th stage for m050 (or any other stages for  
m050 for that matter). So it was not a job failure, but job  
submission failure (since no directories were even created). It had  
to be a job called 'generator_cat' with a parameter 'm050'. Ioan - is  
that possible to rack  what happened to this job in Falcon logs?

5. I can't restart the workflow since this bug/feature has not been  
fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29 (as long  
as I use the hack for output reduction -- restarts do not work).

Nika

On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:

> Hi,
> Nika can probably be more specific, but the last time we ran the  
> 244 molecule MolDyn, the workflow failed on the last few jobs, and  
> the failures were application specific, not Swift or Falkon.  I  
> believe the specific issue that caused those jobs to fail has been  
> resolved.
>
> We have made another attempt at the MolDyn 244 molecule run, and  
> from what I can tell, it did not complete successfully again.  We  
> were supposed to have 20497 jobs...
>
> 1	1	1
> 1	244	244
> 1	244	244
> 68	244	16592
> 1	244	244
> 11	244	2684
> 1	244	244
> 1	244	244
>
>
>
>
>
> 20497
>
> but we have:
> 20482 with exit code 0
> 1 with exit code -3
> 2 with exit code 253
>
> I forgot to enable the debug at the workers, so I don't know what  
> the STDOUT and STDERR was for these 3 jobs.  Given that Swift  
> retries 3 times a job before it fails the workflow, my guess is  
> that these 3 jobs were really the same job failing 3 times.  The  
> failure occurred on 3 different machines, so I don't think it was  
> machine related.  Nika, can you tell from the various Swift logs  
> what happened to these 3 jobs?  Is this the same issue as we had on  
> the last 244 mol run?  It looks like we failed the workflow with 15  
> jobs to go.
>
> The graphs all look nice, similar to the last ones we had.  If  
> people really want to see them, I can generate them again.   
> Otherwise, look at http://tg-viz-login1.uc.teragrid.org:51000/ 
> index.htm to see the last 10K samples of the experiment.
>
> Nika, after you try to figure out what happened, can you simply  
> retry the workflow, maybe it will manage to finish the last 15  
> jobs.  Depending on what problem we find, I think we might conclude  
> that 3 retries is not enough, and we might want to have a higher  
> number as the default when running with Falkon.  If the error was  
> an application error, then no matter how many retries we have, it  
> won't make any difference.
>
> Ioan
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070806/b2dcb47b/attachment.html>


More information about the Swift-devel mailing list