[Swift-devel] Q about MolDyn
Ioan Raicu
iraicu at cs.uchicago.edu
Mon Aug 6 11:36:19 CDT 2007
Hi,
ANL/UC seems almost idle, I bet we could get 244 processors if we try it
again soon!
Ioan
iraicu at tg-viz-login1:~/java/Falkon_v0.8.1/service/logs/244-mol-08-03-07>
showq
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING
STARTTIME
1479856 leggett Running 2 3:27:15 Mon Aug 6
10:02:18
1479840 leggett Running 2 1:05:51 Mon Aug 6
07:40:54
2 active jobs 4 of 260 processors in use by local jobs (1.54%)
2 of 130 nodes active (1.54%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT
QUEUETIME
0 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT
QUEUETIME
0 blocked jobs
Total jobs: 2
Veronika Nefedova wrote:
> Ioan, I can't answer any of your questions -- read my point number 2
> below );
>
> Nika
>
> On Aug 6, 2007, at 11:25 AM, Ioan Raicu wrote:
>
>> Hi,
>>
>> Veronika Nefedova wrote:
>>> Ok, here is what happened with the last 244-molecule run.
>>>
>>> 1. First of all, the new swift code (with loops etc) was used. The
>>> code's size is dramatically reduced:
>>>
>>> -rw-r--r-- 1 nefedova users 13342526 2007-07-05 12:01 MolDyn-244.dtm
>>> -rw-r--r-- 1 nefedova users 21898 2007-08-03 11:00
>>> MolDyn-244-loops.swift
>>>
>>>
>>> 2. I do not have the log on the swift size (probably it was not
>>> produced because I put in the hack for output reduction and log
>>> output was suppressed -- it can be fixed easily)
>>>
>>> 3. There were 2 molecules that failed. That infamous m179 failed at
>>> the last step (3 re-tries). Yuqing -- its the same molecule you said
>>> you fixed the antechamber code for. You told me to use the code in
>>> your home directory /home/ydeng/antechamber-1.27, I assumed it was
>>> on tg-uc. Is that correct? Or its on another host? Anyway, I used
>>> the code from the directory above and it didn't work. The output
>>> is @tg-login1:/disks/scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos-bm66sjz1li5h1/shared.
>>> I could try to run again this molecule specifically in case it works
>>> for you.
>>>
>>> 4. The second molecule that failed is m050. Its quite a mystery why
>>> it failed: it finished the 4-th stage (those 68 charm jobs)
>>> successfully (I have the data in shared directory on tg-uc) but then
>>> the 5-th stage has never started! I do not see any leftover
>>> directories from the 5-th stage for m050 (or any other stages for
>>> m050 for that matter). So it was not a job failure, but job
>>> submission failure (since no directories were even created). It had
>>> to be a job called 'generator_cat' with a parameter 'm050'. Ioan -
>>> is that possible to rack what happened to this job in Falcon logs?
>>>
>> There were only 3 jobs that failed in the Falkon logs, so I presume
>> that those were from (3) above. I also forgot to enable any debug
>> logging, as the settings were from some older high throughput
>> experiments, so I don't have a trace of all the task descriptions and
>> STDOUT and STDERR. About the only thing I can think of is... can you
>> summarize from the Swift log, how many submitted jobs there were, how
>> many success and how many failed? At least maybe we can make sure
>> that the Swift log is consistent with the Falkon logs. Could it be
>> that a task actually fails (say it doesn't produce all the output
>> files), but still returns an exit code of 0 (success)? If yes, then
>> would Swift attempt the next task that needed the missing files and
>> likely fail while executing due to not finding all the files?
>>
>> Now, you mention that it could be a job submission failure... but
>> wouldn't this be explicit in the Swift logs, that it tried to submit
>> and it failed?
>>
>> Here is the list of all tasks that Falkon knows of:
>> http://tg-viz-login1.uc.teragrid.org:51000/service_logs/GenericPortalWS_taskPerf.txt
>>
>> Can you produce a similar list of tasks (from the Swift logs), if the
>> task ID (urn:0-1-10-0-1186176957479), and the status (i.e. submitted,
>> success, failed, etc)? I believe that the latest provisioner code
>> you had (which I hope it did not get overwritten by SVN as I don't
>> know if it was ever checked in, and I don't remember when it was
>> changed, before or after the commit to SVN) should have printed at
>> each submission to Falkon the task ID in the form it is above, and
>> the status of the task at that point in time. Assuming this
>> information is in the Swift log, you should be able to grep for these
>> lines and produce a summary of all the tasks, that we can then
>> cross-match with Falkon's logs. Which one is the Swift log for this
>> latest run on viper? There are so many, and I can't tell which one
>> it is.
>>
>> Ioan
>>> 5. I can't restart the workflow since this bug/feature has not been
>>> fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29 (as long
>>> as I use the hack for output reduction -- restarts do not work).
>>>
>>> Nika
>>>
>>> On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:
>>>
>>>> Hi,
>>>> Nika can probably be more specific, but the last time we ran the
>>>> 244 molecule MolDyn, the workflow failed on the last few jobs, and
>>>> the failures were application specific, not Swift or Falkon. I
>>>> believe the specific issue that caused those jobs to fail has been
>>>> resolved.
>>>>
>>>> We have made another attempt at the MolDyn 244 molecule run, and
>>>> from what I can tell, it did not complete successfully again. We
>>>> were supposed to have 20497 jobs...
>>>>
>>>> 1 1 1
>>>> 1 244 244
>>>> 1 244 244
>>>> 68 244 16592
>>>> 1 244 244
>>>> 11 244 2684
>>>> 1 244 244
>>>> 1 244 244
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 20497
>>>>
>>>>
>>>> but we have:
>>>> 20482 with exit code 0
>>>> 1 with exit code -3
>>>> 2 with exit code 253
>>>>
>>>> I forgot to enable the debug at the workers, so I don't know what
>>>> the STDOUT and STDERR was for these 3 jobs. Given that Swift
>>>> retries 3 times a job before it fails the workflow, my guess is
>>>> that these 3 jobs were really the same job failing 3 times. The
>>>> failure occurred on 3 different machines, so I don't think it was
>>>> machine related. Nika, can you tell from the various Swift logs
>>>> what happened to these 3 jobs? Is this the same issue as we had on
>>>> the last 244 mol run? It looks like we failed the workflow with 15
>>>> jobs to go.
>>>>
>>>> The graphs all look nice, similar to the last ones we had. If
>>>> people really want to see them, I can generate them again.
>>>> Otherwise, look at
>>>> http://tg-viz-login1.uc.teragrid.org:51000/index.htm to see the
>>>> last 10K samples of the experiment.
>>>>
>>>> Nika, after you try to figure out what happened, can you simply
>>>> retry the workflow, maybe it will manage to finish the last 15
>>>> jobs. Depending on what problem we find, I think we might conclude
>>>> that 3 retries is not enough, and we might want to have a higher
>>>> number as the default when running with Falkon. If the error was
>>>> an application error, then no matter how many retries we have, it
>>>> won't make any difference.
>>>>
>>>> Ioan
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070806/5cd0b804/attachment.html>
More information about the Swift-devel
mailing list