[Swift-devel] Q about MolDyn

Ioan Raicu iraicu at cs.uchicago.edu
Mon Aug 6 11:36:19 CDT 2007


Hi,
ANL/UC seems almost idle, I bet we could get 244 processors if we try it 
again soon!
Ioan

iraicu at tg-viz-login1:~/java/Falkon_v0.8.1/service/logs/244-mol-08-03-07> 
showq

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            
STARTTIME

1479856             leggett    Running     2     3:27:15  Mon Aug  6 
10:02:18
1479840             leggett    Running     2     1:05:51  Mon Aug  6 
07:40:54

2 active jobs             4 of 260 processors in use by local jobs (1.54%)
                          2 of 130 nodes active      (1.54%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            
QUEUETIME


0 eligible jobs  

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            
QUEUETIME


0 blocked jobs  

Total jobs:  2


Veronika Nefedova wrote:
> Ioan, I can't answer any of your questions -- read my point number 2 
> below );
>
> Nika
>
> On Aug 6, 2007, at 11:25 AM, Ioan Raicu wrote:
>
>> Hi,
>>
>> Veronika Nefedova wrote:
>>> Ok, here is what happened with the last 244-molecule run.
>>>
>>> 1. First of all, the new swift code (with loops etc) was used. The 
>>> code's size is dramatically reduced:
>>>
>>> -rw-r--r--  1 nefedova users 13342526 2007-07-05 12:01 MolDyn-244.dtm
>>> -rw-r--r--  1 nefedova users    21898 2007-08-03 11:00 
>>> MolDyn-244-loops.swift
>>>
>>>
>>> 2. I do not have the log on the swift size (probably it was not 
>>> produced because I put in the hack for output reduction and log 
>>> output was suppressed -- it can be fixed easily)
>>>
>>> 3. There were 2 molecules that failed.  That infamous m179 failed at 
>>> the last step (3 re-tries). Yuqing -- its the same molecule you said 
>>> you fixed the antechamber code for. You told me to use the code in 
>>> your home directory  /home/ydeng/antechamber-1.27, I assumed it was 
>>> on tg-uc. Is that correct? Or its on another host? Anyway, I used 
>>> the code from the directory above and it didn't work. The output 
>>> is @tg-login1:/disks/scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos-bm66sjz1li5h1/shared. 
>>> I could try to run again this molecule specifically in case it works 
>>> for you.
>>>
>>> 4.  The second molecule that failed is m050. Its quite a mystery why 
>>> it failed: it finished the 4-th stage (those 68 charm jobs) 
>>> successfully (I have the data in shared directory on tg-uc) but then 
>>> the 5-th stage has never started! I do not see any leftover 
>>> directories from the 5-th stage for m050 (or any other stages for 
>>> m050 for that matter). So it was not a job failure, but job 
>>> submission failure (since no directories were even created). It had 
>>> to be a job called 'generator_cat' with a parameter 'm050'. Ioan - 
>>> is that possible to rack  what happened to this job in Falcon logs?
>>>
>> There were only 3 jobs that failed in the Falkon logs, so I presume 
>> that those were from (3) above.  I also forgot to enable any debug 
>> logging, as the settings were from some older high throughput 
>> experiments, so I don't have a trace of all the task descriptions and 
>> STDOUT and STDERR.  About the only thing I can think of is... can you 
>> summarize from the Swift log, how many submitted jobs there were, how 
>> many success and how many failed?  At least maybe we can make sure 
>> that the Swift log is consistent with the Falkon logs.  Could it be 
>> that a task actually fails (say it doesn't produce all the output 
>> files), but still returns an exit code of 0 (success)?  If yes, then 
>> would Swift attempt the next task that needed the missing files and 
>> likely fail while executing due to not finding all the files?
>>
>> Now, you mention that it could be a job submission failure... but 
>> wouldn't this be explicit in the Swift logs, that it tried to submit 
>> and it failed? 
>>
>> Here is the list of all tasks that Falkon knows of: 
>> http://tg-viz-login1.uc.teragrid.org:51000/service_logs/GenericPortalWS_taskPerf.txt
>>
>> Can you produce a similar list of tasks (from the Swift logs), if the 
>> task ID (urn:0-1-10-0-1186176957479), and the status (i.e. submitted, 
>> success, failed, etc)?  I believe that the latest provisioner code 
>> you had (which I hope it did not get overwritten by SVN as I don't 
>> know if it was ever checked in, and I don't remember when it was 
>> changed, before or after the commit to SVN) should have printed at 
>> each submission to Falkon the task ID in the form it is above, and 
>> the status of the task at that point in time.  Assuming this 
>> information is in the Swift log, you should be able to grep for these 
>> lines and produce a summary of all the tasks, that we can then 
>> cross-match with Falkon's logs.  Which one is the Swift log for this 
>> latest run on viper?  There are so many, and I can't tell which one 
>> it is.
>>
>> Ioan
>>> 5. I can't restart the workflow since this bug/feature has not been 
>>> fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29 (as long 
>>> as I use the hack for output reduction -- restarts do not work).
>>>
>>> Nika
>>>
>>> On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:
>>>
>>>> Hi,
>>>> Nika can probably be more specific, but the last time we ran the 
>>>> 244 molecule MolDyn, the workflow failed on the last few jobs, and 
>>>> the failures were application specific, not Swift or Falkon.  I 
>>>> believe the specific issue that caused those jobs to fail has been 
>>>> resolved. 
>>>>
>>>> We have made another attempt at the MolDyn 244 molecule run, and 
>>>> from what I can tell, it did not complete successfully again.  We 
>>>> were supposed to have 20497 jobs...
>>>>
>>>> 1 	1 	1
>>>> 1 	244 	244
>>>> 1 	244 	244
>>>> 68 	244 	16592
>>>> 1 	244 	244
>>>> 11 	244 	2684
>>>> 1 	244 	244
>>>> 1 	244 	244
>>>>
>>>> 	
>>>> 	
>>>>
>>>> 	
>>>> 	20497
>>>>
>>>>
>>>> but we have:
>>>> 20482 with exit code 0
>>>> 1 with exit code -3
>>>> 2 with exit code 253
>>>>
>>>> I forgot to enable the debug at the workers, so I don't know what 
>>>> the STDOUT and STDERR was for these 3 jobs.  Given that Swift 
>>>> retries 3 times a job before it fails the workflow, my guess is 
>>>> that these 3 jobs were really the same job failing 3 times.  The 
>>>> failure occurred on 3 different machines, so I don't think it was 
>>>> machine related.  Nika, can you tell from the various Swift logs 
>>>> what happened to these 3 jobs?  Is this the same issue as we had on 
>>>> the last 244 mol run?  It looks like we failed the workflow with 15 
>>>> jobs to go. 
>>>>
>>>> The graphs all look nice, similar to the last ones we had.  If 
>>>> people really want to see them, I can generate them again.  
>>>> Otherwise, look at 
>>>> http://tg-viz-login1.uc.teragrid.org:51000/index.htm to see the 
>>>> last 10K samples of the experiment.
>>>>
>>>> Nika, after you try to figure out what happened, can you simply 
>>>> retry the workflow, maybe it will manage to finish the last 15 
>>>> jobs.  Depending on what problem we find, I think we might conclude 
>>>> that 3 retries is not enough, and we might want to have a higher 
>>>> number as the default when running with Falkon.  If the error was 
>>>> an application error, then no matter how many retries we have, it 
>>>> won't make any difference.
>>>>
>>>> Ioan
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070806/5cd0b804/attachment.html>


More information about the Swift-devel mailing list