[Swift-devel] Q about MolDyn

Veronika Nefedova nefedova at mcs.anl.gov
Mon Aug 6 11:31:34 CDT 2007


Ioan, I can't answer any of your questions -- read my point number 2  
below );

Nika

On Aug 6, 2007, at 11:25 AM, Ioan Raicu wrote:

> Hi,
>
> Veronika Nefedova wrote:
>> Ok, here is what happened with the last 244-molecule run.
>>
>> 1. First of all, the new swift code (with loops etc) was used. The  
>> code's size is dramatically reduced:
>>
>> -rw-r--r--  1 nefedova users 13342526 2007-07-05 12:01 MolDyn-244.dtm
>> -rw-r--r--  1 nefedova users    21898 2007-08-03 11:00 MolDyn-244- 
>> loops.swift
>>
>>
>> 2. I do not have the log on the swift size (probably it was not  
>> produced because I put in the hack for output reduction and log  
>> output was suppressed -- it can be fixed easily)
>>
>> 3. There were 2 molecules that failed.  That infamous m179 failed  
>> at the last step (3 re-tries). Yuqing -- its the same molecule you  
>> said you fixed the antechamber code for. You told me to use the  
>> code in your home directory  /home/ydeng/antechamber-1.27, I  
>> assumed it was on tg-uc. Is that correct? Or its on another host?  
>> Anyway, I used the code from the directory above and it didn't  
>> work. The output is @tg-login1:/disks/scratchgpfs1/iraicu/ModLyn/ 
>> MolDyn-244-loos-bm66sjz1li5h1/shared. I could try to run again  
>> this molecule specifically in case it works for you.
>>
>> 4.  The second molecule that failed is m050. Its quite a mystery  
>> why it failed: it finished the 4-th stage (those 68 charm jobs)  
>> successfully (I have the data in shared directory on tg-uc) but  
>> then the 5-th stage has never started! I do not see any leftover  
>> directories from the 5-th stage for m050 (or any other stages for  
>> m050 for that matter). So it was not a job failure, but job  
>> submission failure (since no directories were even created). It  
>> had to be a job called 'generator_cat' with a parameter 'm050'.  
>> Ioan - is that possible to rack  what happened to this job in  
>> Falcon logs?
>>
> There were only 3 jobs that failed in the Falkon logs, so I presume  
> that those were from (3) above.  I also forgot to enable any debug  
> logging, as the settings were from some older high throughput  
> experiments, so I don't have a trace of all the task descriptions  
> and STDOUT and STDERR.  About the only thing I can think of is...  
> can you summarize from the Swift log, how many submitted jobs there  
> were, how many success and how many failed?  At least maybe we can  
> make sure that the Swift log is consistent with the Falkon logs.   
> Could it be that a task actually fails (say it doesn't produce all  
> the output files), but still returns an exit code of 0 (success)?   
> If yes, then would Swift attempt the next task that needed the  
> missing files and likely fail while executing due to not finding  
> all the files?
>
> Now, you mention that it could be a job submission failure... but  
> wouldn't this be explicit in the Swift logs, that it tried to  
> submit and it failed?
>
> Here is the list of all tasks that Falkon knows of: http://tg-viz- 
> login1.uc.teragrid.org:51000/service_logs/GenericPortalWS_taskPerf.txt
>
> Can you produce a similar list of tasks (from the Swift logs), if  
> the task ID (urn:0-1-10-0-1186176957479), and the status (i.e.  
> submitted, success, failed, etc)?  I believe that the latest  
> provisioner code you had (which I hope it did not get overwritten  
> by SVN as I don't know if it was ever checked in, and I don't  
> remember when it was changed, before or after the commit to SVN)  
> should have printed at each submission to Falkon the task ID in the  
> form it is above, and the status of the task at that point in  
> time.  Assuming this information is in the Swift log, you should be  
> able to grep for these lines and produce a summary of all the  
> tasks, that we can then cross-match with Falkon's logs.  Which one  
> is the Swift log for this latest run on viper?  There are so many,  
> and I can't tell which one it is.
>
> Ioan
>> 5. I can't restart the workflow since this bug/feature has not  
>> been fixed: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=29  
>> (as long as I use the hack for output reduction -- restarts do not  
>> work).
>>
>> Nika
>>
>> On Aug 3, 2007, at 11:03 PM, Ioan Raicu wrote:
>>
>>> Hi,
>>> Nika can probably be more specific, but the last time we ran the  
>>> 244 molecule MolDyn, the workflow failed on the last few jobs,  
>>> and the failures were application specific, not Swift or Falkon.   
>>> I believe the specific issue that caused those jobs to fail has  
>>> been resolved.
>>>
>>> We have made another attempt at the MolDyn 244 molecule run, and  
>>> from what I can tell, it did not complete successfully again.  We  
>>> were supposed to have 20497 jobs...
>>>
>>> 1	1	1
>>> 1	244	244
>>> 1	244	244
>>> 68	244	16592
>>> 1	244	244
>>> 11	244	2684
>>> 1	244	244
>>> 1	244	244
>>>
>>>
>>>
>>>
>>>
>>> 20497
>>>
>>> but we have:
>>> 20482 with exit code 0
>>> 1 with exit code -3
>>> 2 with exit code 253
>>>
>>> I forgot to enable the debug at the workers, so I don't know what  
>>> the STDOUT and STDERR was for these 3 jobs.  Given that Swift  
>>> retries 3 times a job before it fails the workflow, my guess is  
>>> that these 3 jobs were really the same job failing 3 times.  The  
>>> failure occurred on 3 different machines, so I don't think it was  
>>> machine related.  Nika, can you tell from the various Swift logs  
>>> what happened to these 3 jobs?  Is this the same issue as we had  
>>> on the last 244 mol run?  It looks like we failed the workflow  
>>> with 15 jobs to go.
>>>
>>> The graphs all look nice, similar to the last ones we had.  If  
>>> people really want to see them, I can generate them again.   
>>> Otherwise, look at http://tg-viz-login1.uc.teragrid.org:51000/ 
>>> index.htm to see the last 10K samples of the experiment.
>>>
>>> Nika, after you try to figure out what happened, can you simply  
>>> retry the workflow, maybe it will manage to finish the last 15  
>>> jobs.  Depending on what problem we find, I think we might  
>>> conclude that 3 retries is not enough, and we might want to have  
>>> a higher number as the default when running with Falkon.  If the  
>>> error was an application error, then no matter how many retries  
>>> we have, it won't make any difference.
>>>
>>> Ioan
>>>
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070806/8b2d10cb/attachment.html>


More information about the Swift-devel mailing list