[Swift-devel] Q about MolDyn

Veronika Nefedova nefedova at mcs.anl.gov
Mon Aug 6 12:22:28 CDT 2007


BTW - the swift log thing was fixed thanks to Ben -- it was not the  
output reduction hack but in fact some discrepancies in  
log4j.properties file that were introduced during the latest SVN update.

Nika

On Aug 6, 2007, at 11:34 AM, Ioan Raicu wrote:

> Aha, OK, it didn't click that (2) was referring to the Swift log  
> that I was referring to.  So, in that case, we can't do much else  
> on this run, other than make sure we fix the infamous m179  
> molecule, turn on all debugging (and make sure its actually  
> printing debug statements), and try the run again!
>
> Ioan
>
> Veronika Nefedova wrote:
>> Ioan, I can't answer any of your questions -- read my point number  
>> 2 below );
>>
>> Nika
>>
>> On Aug 6, 2007, at 11:25 AM, Ioan Raicu wrote:
>>
>>> Hi,
>>>
>>> Veronika Nefedova wrote:
>>>> Ok, here is what happened with the last 244-molecule run.
>>>>
>>>> 1. First of all, the new swift code (with loops etc) was used.  
>>>> The code's size is dramatically reduced:
>>>>
>>>> -rw-r--r--  1 nefedova users 13342526 2007-07-05 12:01  
>>>> MolDyn-244.dtm
>>>> -rw-r--r--  1 nefedova users    21898 2007-08-03 11:00  
>>>> MolDyn-244-loops.swift
>>>>
>>>>
>>>> 2. I do not have the log on the swift size (probably it was not  
>>>> produced because I put in the hack for output reduction and log  
>>>> output was suppressed -- it can be fixed easily)
>>>>
>>>> 3. There were 2 molecules that failed.  That infamous m179  
>>>> failed at the last step (3 re-tries). Yuqing -- its the same  
>>>> molecule you said you fixed the antechamber code for. You told  
>>>> me to use the code in your home directory  /home/ydeng/ 
>>>> antechamber-1.27, I assumed it was on tg-uc. Is that correct? Or  
>>>> its on another host? Anyway, I used the code from the directory  
>>>> above and it didn't work. The output is @tg-login1:/disks/ 
>>>> scratchgpfs1/iraicu/ModLyn/MolDyn-244-loos-bm66sjz1li5h1/shared.  
>>>> I could try to run again this molecule specifically in case it  
>>>> works for you.
>>>>
>>>> 4.  The second molecule that failed is m050. Its quite a mystery  
>>>> why it failed: it finished the 4-th stage (those 68 charm jobs)  
>>>> successfully (I have the data in shared directory on tg-uc) but  
>>>> then the 5-th stage has never started! I do not see any leftover  
>>>> directories from the 5-th stage for m050 (or any other stages  
>>>> for m050 for that matter). So it was not a job failure, but job  
>>>> submission failure (since no directories were even created). It  
>>>> had to be a job called 'generator_cat' with a parameter 'm050'.  
>>>> Ioan - is that possible to rack  what happened to this job in  
>>>> Falcon logs?
>>>>
>>> There were only 3 jobs that failed in the Falkon logs, so I  
>>> presume that those were from (3) above.  I also forgot to enable  
>>> any debug logging, as the settings were from some older high  
>>> throughput experiments, so I don't have a trace of all the task  
>>> descriptions and STDOUT and STDERR.  About the only thing I can  
>>> think of is... can you summarize from the Swift log, how many  
>>> submitted jobs there were, how many success and how many failed?   
>>> At least maybe we can make sure that the Swift log is consistent  
>>> with the Falkon logs.  Could it be that a task actually fails  
>>> (say it doesn't produce all the output files), but still returns  
>>> an exit code of 0 (success)?  If yes, then would Swift attempt  
>>> the next task that needed the missing files and likely fail while  
>>> executing due to not finding all the files?
>>>
>>> Now, you mention that it could be a job submission failure... but  
>>> wouldn't this be explicit in the Swift logs, that it tried to  
>>> submit and it failed?
>>>
>>> Here is the list of all tasks that Falkon knows of: http://tg-viz- 
>>> login1.uc.teragrid.org:51000/service_logs/ 
>>> GenericPortalWS_taskPerf.txt
>>>
>>> Can you produce a similar list of tasks (from the Swift logs), if  
>>> the task ID (urn:0-1-10-0-1186176957479), and the status (i.e.  
>>> submitted, success, failed, etc)?  I believe that the latest  
>>> provisioner code you had (which I hope it did not get overwritten  
>>> by SVN as I don't know if it was ever checked in, and I don't  
>>> remember when it was changed, before or after the commit to SVN)  
>>> should have printed at each submission to Falkon the task ID in  
>>> the form it is above, and the status of the task at that point in  
>>> time.  Assuming this information is in the Swift log, you should  
>>> be able to grep for these lines and produce a summary of all the  
>>> tasks, that we can then cross-match with Falkon's logs.  Which  
>>> one is the Swift log for this latest run on viper?  There are so  
>>> many, and I can't tell which one it is.
>>>
>>> Ioan
>>>> 5. I can't restart the workflow since this bug/feature has not  
>>>> been fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29  
>>>> (as long as I use the hack for output reduction -- restarts do  
>>>> not work).
>>>>
>>>> Nika
>>>>
>>>> On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:
>>>>
>>>>> Hi,
>>>>> Nika can probably be more specific, but the last time we ran  
>>>>> the 244 molecule MolDyn, the workflow failed on the last few  
>>>>> jobs, and the failures were application specific, not Swift or  
>>>>> Falkon.  I believe the specific issue that caused those jobs to  
>>>>> fail has been resolved.
>>>>>
>>>>> We have made another attempt at the MolDyn 244 molecule run,  
>>>>> and from what I can tell, it did not complete successfully  
>>>>> again.  We were supposed to have 20497 jobs...
>>>>>
>>>>> 1	1	1
>>>>> 1	244	244
>>>>> 1	244	244
>>>>> 68	244	16592
>>>>> 1	244	244
>>>>> 11	244	2684
>>>>> 1	244	244
>>>>> 1	244	244
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 20497
>>>>>
>>>>> but we have:
>>>>> 20482 with exit code 0
>>>>> 1 with exit code -3
>>>>> 2 with exit code 253
>>>>>
>>>>> I forgot to enable the debug at the workers, so I don't know  
>>>>> what the STDOUT and STDERR was for these 3 jobs.  Given that  
>>>>> Swift retries 3 times a job before it fails the workflow, my  
>>>>> guess is that these 3 jobs were really the same job failing 3  
>>>>> times.  The failure occurred on 3 different machines, so I  
>>>>> don't think it was machine related.  Nika, can you tell from  
>>>>> the various Swift logs what happened to these 3 jobs?  Is this  
>>>>> the same issue as we had on the last 244 mol run?  It looks  
>>>>> like we failed the workflow with 15 jobs to go.
>>>>>
>>>>> The graphs all look nice, similar to the last ones we had.  If  
>>>>> people really want to see them, I can generate them again.   
>>>>> Otherwise, look at http://tg-viz-login1.uc.teragrid.org:51000/ 
>>>>> index.htm to see the last 10K samples of the experiment.
>>>>>
>>>>> Nika, after you try to figure out what happened, can you simply  
>>>>> retry the workflow, maybe it will manage to finish the last 15  
>>>>> jobs.  Depending on what problem we find, I think we might  
>>>>> conclude that 3 retries is not enough, and we might want to  
>>>>> have a higher number as the default when running with Falkon.   
>>>>> If the error was an application error, then no matter how many  
>>>>> retries we have, it won't make any difference.
>>>>>
>>>>> Ioan
>>>>>
>>>>
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070806/01a0778d/attachment.html>


More information about the Swift-devel mailing list