[Swift-devel] Q about MolDyn
    Yuqing Deng 
    yuqing.deng at gmail.com
       
    Mon Aug  6 13:24:09 CDT 2007
    
    
  
Nika,
  The fix is in one of the data files that are loaded by antechamber.
The ACHOME environment viariable has to be set to
/home/ydeng/antechamber-1.27/ too.
  Yuqing
On 8/6/07, Veronika Nefedova <nefedova at mcs.anl.gov> wrote:
> Ok, here is what happened with the last 244-molecule run.
>
> 1. First of all, the new swift code (with loops etc) was used. The code's
> size is dramatically reduced:
>
> -rw-r--r--  1 nefedova users 13342526 2007-07-05 12:01 MolDyn-244.dtm
> -rw-r--r--  1 nefedova users    21898 2007-08-03 11:00
> MolDyn-244-loops.swift
>
>
> 2. I do not have the log on the swift size (probably it was not produced
> because I put in the hack for output reduction and log output was suppressed
> -- it can be fixed easily)
>
> 3. There were 2 molecules that failed.  That infamous m179 failed at the
> last step (3 re-tries). Yuqing -- its the same molecule you said you fixed
> the antechamber code for. You told me to use the code in your home
> directory  /home/ydeng/antechamber-1.27, I assumed it was
> on tg-uc. Is that correct? Or its on another host? Anyway, I used the code
> from the directory above and it didn't work. The output
> is @tg-login1:/disks/scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos-bm66sjz1li5h1/shared.
> I could try to run again this molecule specifically in case it works for
> you.
>
> 4.  The second molecule that failed is m050. Its quite a mystery why it
> failed: it finished the 4-th stage (those 68 charm jobs) successfully (I
> have the data in shared directory on tg-uc) but then the 5-th stage has
> never started! I do not see any leftover directories from the 5-th stage for
> m050 (or any other stages for m050 for that matter). So it was not a job
> failure, but job submission failure (since no directories were even
> created). It had to be a job called 'generator_cat' with a parameter 'm050'.
> Ioan - is that possible to rack  what happened to this job in Falcon logs?
>
> 5. I can't restart the workflow since this bug/feature has not been
> fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29
> (as long as I use the hack for output reduction -- restarts do not work).
>
> Nika
>
>
> On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:
>  Hi,
>  Nika can probably be more specific, but the last time we ran the 244
> molecule MolDyn, the workflow failed on the last few jobs, and the failures
> were application specific, not Swift or Falkon.  I believe the specific
> issue that caused those jobs to fail has been resolved.
>
>  We have made another attempt at the MolDyn 244 molecule run, and from what
> I can tell, it did not complete successfully again.  We were supposed to
> have 20497 jobs...
>
>
> 111
> 1244244
> 1244244
> 6824416592
> 1244244
> 112442684
> 1244244
> 1244244
>
>
>
>
>
>
>  20497
>  but we have:
>  20482 with exit code 0
>  1 with exit code -3
>  2 with exit code 253
>
>  I forgot to enable the debug at the workers, so I don't know what the
> STDOUT and STDERR was for these 3 jobs.  Given that Swift retries 3 times a
> job before it fails the workflow, my guess is that these 3 jobs were really
> the same job failing 3 times.  The failure occurred on 3 different machines,
> so I don't think it was machine related.  Nika, can you tell from the
> various Swift logs what happened to these 3 jobs?  Is this the same issue as
> we had on the last 244 mol run?  It looks like we failed the workflow with
> 15 jobs to go.
>
>  The graphs all look nice, similar to the last ones we had.  If people
> really want to see them, I can generate them again.  Otherwise, look at
> http://tg-viz-login1.uc.teragrid.org:51000/index.htm to see
> the last 10K samples of the experiment.
>
>  Nika, after you try to figure out what happened, can you simply retry the
> workflow, maybe it will manage to finish the last 15 jobs.  Depending on
> what problem we find, I think we might conclude that 3 retries is not
> enough, and we might want to have a higher number as the default when
> running with Falkon.  If the error was an application error, then no matter
> how many retries we have, it won't make any difference.
>
>  Ioan
>
>
>
    
    
More information about the Swift-devel
mailing list