[Swift-devel] Q about MolDyn
Veronika Nefedova
nefedova at mcs.anl.gov
Mon Aug 6 11:31:34 CDT 2007
Ioan, I can't answer any of your questions -- read my point number 2
below );
Nika
On Aug 6, 2007, at 11:25 AM, Ioan Raicu wrote:
> Hi,
>
> Veronika Nefedova wrote:
>> Ok, here is what happened with the last 244-molecule run.
>>
>> 1. First of all, the new swift code (with loops etc) was used. The
>> code's size is dramatically reduced:
>>
>> -rw-r--r-- 1 nefedova users 13342526 2007-07-05 12:01 MolDyn-244.dtm
>> -rw-r--r-- 1 nefedova users 21898 2007-08-03 11:00 MolDyn-244-
>> loops.swift
>>
>>
>> 2. I do not have the log on the swift size (probably it was not
>> produced because I put in the hack for output reduction and log
>> output was suppressed -- it can be fixed easily)
>>
>> 3. There were 2 molecules that failed. That infamous m179 failed
>> at the last step (3 re-tries). Yuqing -- its the same molecule you
>> said you fixed the antechamber code for. You told me to use the
>> code in your home directory /home/ydeng/antechamber-1.27, I
>> assumed it was on tg-uc. Is that correct? Or its on another host?
>> Anyway, I used the code from the directory above and it didn't
>> work. The output is @tg-login1:/disks/scratchgpfs1/iraicu/ModLyn/
>> MolDyn-244-loos-bm66sjz1li5h1/shared. I could try to run again
>> this molecule specifically in case it works for you.
>>
>> 4. The second molecule that failed is m050. Its quite a mystery
>> why it failed: it finished the 4-th stage (those 68 charm jobs)
>> successfully (I have the data in shared directory on tg-uc) but
>> then the 5-th stage has never started! I do not see any leftover
>> directories from the 5-th stage for m050 (or any other stages for
>> m050 for that matter). So it was not a job failure, but job
>> submission failure (since no directories were even created). It
>> had to be a job called 'generator_cat' with a parameter 'm050'.
>> Ioan - is that possible to rack what happened to this job in
>> Falcon logs?
>>
> There were only 3 jobs that failed in the Falkon logs, so I presume
> that those were from (3) above. I also forgot to enable any debug
> logging, as the settings were from some older high throughput
> experiments, so I don't have a trace of all the task descriptions
> and STDOUT and STDERR. About the only thing I can think of is...
> can you summarize from the Swift log, how many submitted jobs there
> were, how many success and how many failed? At least maybe we can
> make sure that the Swift log is consistent with the Falkon logs.
> Could it be that a task actually fails (say it doesn't produce all
> the output files), but still returns an exit code of 0 (success)?
> If yes, then would Swift attempt the next task that needed the
> missing files and likely fail while executing due to not finding
> all the files?
>
> Now, you mention that it could be a job submission failure... but
> wouldn't this be explicit in the Swift logs, that it tried to
> submit and it failed?
>
> Here is the list of all tasks that Falkon knows of: http://tg-viz-
> login1.uc.teragrid.org:51000/service_logs/GenericPortalWS_taskPerf.txt
>
> Can you produce a similar list of tasks (from the Swift logs), if
> the task ID (urn:0-1-10-0-1186176957479), and the status (i.e.
> submitted, success, failed, etc)? I believe that the latest
> provisioner code you had (which I hope it did not get overwritten
> by SVN as I don't know if it was ever checked in, and I don't
> remember when it was changed, before or after the commit to SVN)
> should have printed at each submission to Falkon the task ID in the
> form it is above, and the status of the task at that point in
> time. Assuming this information is in the Swift log, you should be
> able to grep for these lines and produce a summary of all the
> tasks, that we can then cross-match with Falkon's logs. Which one
> is the Swift log for this latest run on viper? There are so many,
> and I can't tell which one it is.
>
> Ioan
>> 5. I can't restart the workflow since this bug/feature has not
>> been fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29
>> (as long as I use the hack for output reduction -- restarts do not
>> work).
>>
>> Nika
>>
>> On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:
>>
>>> Hi,
>>> Nika can probably be more specific, but the last time we ran the
>>> 244 molecule MolDyn, the workflow failed on the last few jobs,
>>> and the failures were application specific, not Swift or Falkon.
>>> I believe the specific issue that caused those jobs to fail has
>>> been resolved.
>>>
>>> We have made another attempt at the MolDyn 244 molecule run, and
>>> from what I can tell, it did not complete successfully again. We
>>> were supposed to have 20497 jobs...
>>>
>>> 1 1 1
>>> 1 244 244
>>> 1 244 244
>>> 68 244 16592
>>> 1 244 244
>>> 11 244 2684
>>> 1 244 244
>>> 1 244 244
>>>
>>>
>>>
>>>
>>>
>>> 20497
>>>
>>> but we have:
>>> 20482 with exit code 0
>>> 1 with exit code -3
>>> 2 with exit code 253
>>>
>>> I forgot to enable the debug at the workers, so I don't know what
>>> the STDOUT and STDERR was for these 3 jobs. Given that Swift
>>> retries 3 times a job before it fails the workflow, my guess is
>>> that these 3 jobs were really the same job failing 3 times. The
>>> failure occurred on 3 different machines, so I don't think it was
>>> machine related. Nika, can you tell from the various Swift logs
>>> what happened to these 3 jobs? Is this the same issue as we had
>>> on the last 244 mol run? It looks like we failed the workflow
>>> with 15 jobs to go.
>>>
>>> The graphs all look nice, similar to the last ones we had. If
>>> people really want to see them, I can generate them again.
>>> Otherwise, look at http://tg-viz-login1.uc.teragrid.org:51000/
>>> index.htm to see the last 10K samples of the experiment.
>>>
>>> Nika, after you try to figure out what happened, can you simply
>>> retry the workflow, maybe it will manage to finish the last 15
>>> jobs. Depending on what problem we find, I think we might
>>> conclude that 3 retries is not enough, and we might want to have
>>> a higher number as the default when running with Falkon. If the
>>> error was an application error, then no matter how many retries
>>> we have, it won't make any difference.
>>>
>>> Ioan
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070806/8b2d10cb/attachment.html>
More information about the Swift-devel
mailing list