[Swift-devel] Q about MolDyn
Veronika Nefedova
nefedova at mcs.anl.gov
Mon Aug 6 12:22:28 CDT 2007
BTW - the swift log thing was fixed thanks to Ben -- it was not the
output reduction hack but in fact some discrepancies in
log4j.properties file that were introduced during the latest SVN update.
Nika
On Aug 6, 2007, at 11:34 AM, Ioan Raicu wrote:
> Aha, OK, it didn't click that (2) was referring to the Swift log
> that I was referring to. So, in that case, we can't do much else
> on this run, other than make sure we fix the infamous m179
> molecule, turn on all debugging (and make sure its actually
> printing debug statements), and try the run again!
>
> Ioan
>
> Veronika Nefedova wrote:
>> Ioan, I can't answer any of your questions -- read my point number
>> 2 below );
>>
>> Nika
>>
>> On Aug 6, 2007, at 11:25 AM, Ioan Raicu wrote:
>>
>>> Hi,
>>>
>>> Veronika Nefedova wrote:
>>>> Ok, here is what happened with the last 244-molecule run.
>>>>
>>>> 1. First of all, the new swift code (with loops etc) was used.
>>>> The code's size is dramatically reduced:
>>>>
>>>> -rw-r--r-- 1 nefedova users 13342526 2007-07-05 12:01
>>>> MolDyn-244.dtm
>>>> -rw-r--r-- 1 nefedova users 21898 2007-08-03 11:00
>>>> MolDyn-244-loops.swift
>>>>
>>>>
>>>> 2. I do not have the log on the swift size (probably it was not
>>>> produced because I put in the hack for output reduction and log
>>>> output was suppressed -- it can be fixed easily)
>>>>
>>>> 3. There were 2 molecules that failed. That infamous m179
>>>> failed at the last step (3 re-tries). Yuqing -- its the same
>>>> molecule you said you fixed the antechamber code for. You told
>>>> me to use the code in your home directory /home/ydeng/
>>>> antechamber-1.27, I assumed it was on tg-uc. Is that correct? Or
>>>> its on another host? Anyway, I used the code from the directory
>>>> above and it didn't work. The output is @tg-login1:/disks/
>>>> scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos-bm66sjz1li5h1/shared.
>>>> I could try to run again this molecule specifically in case it
>>>> works for you.
>>>>
>>>> 4. The second molecule that failed is m050. Its quite a mystery
>>>> why it failed: it finished the 4-th stage (those 68 charm jobs)
>>>> successfully (I have the data in shared directory on tg-uc) but
>>>> then the 5-th stage has never started! I do not see any leftover
>>>> directories from the 5-th stage for m050 (or any other stages
>>>> for m050 for that matter). So it was not a job failure, but job
>>>> submission failure (since no directories were even created). It
>>>> had to be a job called 'generator_cat' with a parameter 'm050'.
>>>> Ioan - is that possible to rack what happened to this job in
>>>> Falcon logs?
>>>>
>>> There were only 3 jobs that failed in the Falkon logs, so I
>>> presume that those were from (3) above. I also forgot to enable
>>> any debug logging, as the settings were from some older high
>>> throughput experiments, so I don't have a trace of all the task
>>> descriptions and STDOUT and STDERR. About the only thing I can
>>> think of is... can you summarize from the Swift log, how many
>>> submitted jobs there were, how many success and how many failed?
>>> At least maybe we can make sure that the Swift log is consistent
>>> with the Falkon logs. Could it be that a task actually fails
>>> (say it doesn't produce all the output files), but still returns
>>> an exit code of 0 (success)? If yes, then would Swift attempt
>>> the next task that needed the missing files and likely fail while
>>> executing due to not finding all the files?
>>>
>>> Now, you mention that it could be a job submission failure... but
>>> wouldn't this be explicit in the Swift logs, that it tried to
>>> submit and it failed?
>>>
>>> Here is the list of all tasks that Falkon knows of: http://tg-viz-
>>> login1.uc.teragrid.org:51000/service_logs/
>>> GenericPortalWS_taskPerf.txt
>>>
>>> Can you produce a similar list of tasks (from the Swift logs), if
>>> the task ID (urn:0-1-10-0-1186176957479), and the status (i.e.
>>> submitted, success, failed, etc)? I believe that the latest
>>> provisioner code you had (which I hope it did not get overwritten
>>> by SVN as I don't know if it was ever checked in, and I don't
>>> remember when it was changed, before or after the commit to SVN)
>>> should have printed at each submission to Falkon the task ID in
>>> the form it is above, and the status of the task at that point in
>>> time. Assuming this information is in the Swift log, you should
>>> be able to grep for these lines and produce a summary of all the
>>> tasks, that we can then cross-match with Falkon's logs. Which
>>> one is the Swift log for this latest run on viper? There are so
>>> many, and I can't tell which one it is.
>>>
>>> Ioan
>>>> 5. I can't restart the workflow since this bug/feature has not
>>>> been fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29
>>>> (as long as I use the hack for output reduction -- restarts do
>>>> not work).
>>>>
>>>> Nika
>>>>
>>>> On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:
>>>>
>>>>> Hi,
>>>>> Nika can probably be more specific, but the last time we ran
>>>>> the 244 molecule MolDyn, the workflow failed on the last few
>>>>> jobs, and the failures were application specific, not Swift or
>>>>> Falkon. I believe the specific issue that caused those jobs to
>>>>> fail has been resolved.
>>>>>
>>>>> We have made another attempt at the MolDyn 244 molecule run,
>>>>> and from what I can tell, it did not complete successfully
>>>>> again. We were supposed to have 20497 jobs...
>>>>>
>>>>> 1 1 1
>>>>> 1 244 244
>>>>> 1 244 244
>>>>> 68 244 16592
>>>>> 1 244 244
>>>>> 11 244 2684
>>>>> 1 244 244
>>>>> 1 244 244
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 20497
>>>>>
>>>>> but we have:
>>>>> 20482 with exit code 0
>>>>> 1 with exit code -3
>>>>> 2 with exit code 253
>>>>>
>>>>> I forgot to enable the debug at the workers, so I don't know
>>>>> what the STDOUT and STDERR was for these 3 jobs. Given that
>>>>> Swift retries 3 times a job before it fails the workflow, my
>>>>> guess is that these 3 jobs were really the same job failing 3
>>>>> times. The failure occurred on 3 different machines, so I
>>>>> don't think it was machine related. Nika, can you tell from
>>>>> the various Swift logs what happened to these 3 jobs? Is this
>>>>> the same issue as we had on the last 244 mol run? It looks
>>>>> like we failed the workflow with 15 jobs to go.
>>>>>
>>>>> The graphs all look nice, similar to the last ones we had. If
>>>>> people really want to see them, I can generate them again.
>>>>> Otherwise, look at http://tg-viz-login1.uc.teragrid.org:51000/
>>>>> index.htm to see the last 10K samples of the experiment.
>>>>>
>>>>> Nika, after you try to figure out what happened, can you simply
>>>>> retry the workflow, maybe it will manage to finish the last 15
>>>>> jobs. Depending on what problem we find, I think we might
>>>>> conclude that 3 retries is not enough, and we might want to
>>>>> have a higher number as the default when running with Falkon.
>>>>> If the error was an application error, then no matter how many
>>>>> retries we have, it won't make any difference.
>>>>>
>>>>> Ioan
>>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070806/01a0778d/attachment.html>
More information about the Swift-devel
mailing list