[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Tue Jul 17 22:36:36 CDT 2007

Mihael Hategan wrote:
> On Tue, 2007-07-17 at 21:43 -0500, Ian Foster wrote:
>   
>> Another (perhaps dumb?) question--it would seem desirable that we be 
>> able to quickly determine what tasks failed and then (attempt to) rerun 
>> them in such circumstances.
>>
>> Here it seems that a lot of effort is required just to determine what 
>> tasks failed, and I am not sure that the information extracted is enough 
>> to rerun them.
>>     
>
> Normally, a summary of what failed with the reasons is printed on
> stderr, together with the stdout and stderr of the jobs. Perhaps it
> should also go to the log file.
>
> In this case, 2 jobs failed. The 6 failures are due to restarts. Which
> is in agreement with the 2 missing molecules.
>
> When jobs fail, swift should not clean up the job directories so that
> one can do post-mortem debugging. I suggest invoking the application
> manually to see if it's a matter of a bad node or bad data.
>   
The errors happened on 3 different nodes, so I suspect that its not bad 
nodes (as we had previously experience with the stale NFS handle). 

Nika, I sent out the actual commands that failed... can you try to run 
them manually to see what happens, and possibly determine why they 
failed?  Can you also find out what an exit code of -3 means within the 
application that failed (you might have to look at the app source code, 
or contact the original source code writer).

Ioan
>   
>> It also seems that we can't easily determine which output files are missing.
>>     
>
> In the general case we wouldn't be able to, because the exact outputs
> may only be known at run-time. Granted, that kind of dynamics would
> depend on our ability to have nondeterministic files being returned,
> which we haven't gotten around to implementing. But there is a question
> of whether we should try to implement a short term solution that would
> be invalidated by our own plans.
>
>   
>> Ian.
>>
>> Ian Foster wrote:
>>     
>>> Ioan:
>>>
>>> a) I think this information should be in the bugzilla summary, 
>>> according to our processes?
>>>
>>> b) Why did it take so long to get all of the workers working?
>>>
>>> c) Can we debug using less than O(800) node hours?
>>>
>>> Ian.
>>>
>>> bugzilla-daemon at mcs.anl.gov wrote:
>>>       
>>>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------- Comment #24 from iraicu at cs.uchicago.edu  2007-07-17 16:08 
>>>> -------
>>>> So the latest MolDyn's 244 mol run also failed... but I think it made 
>>>> it all
>>>> the way to the final few jobs...
>>>>
>>>> The place where I put all the information about the run is at:
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/ 
>>>>
>>>>
>>>> Here are the graphs:
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg 
>>>>
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg 
>>>>
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg 
>>>>
>>>>
>>>> The Swift log can be found at:
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log 
>>>>
>>>>
>>>> The Falkon logs are at:
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/ 
>>>>
>>>>
>>>> The 244 mol run was supposed to have 20497 tasks, broken down as 
>>>> follows:
>>>> 1       1       1
>>>> 1       244     244
>>>> 1       244     244
>>>> 68      244     16592
>>>> 1       244     244
>>>> 11      244     2684
>>>> 1       244     244
>>>> 1       244     244
>>>> ======================
>>>>                 20497
>>>>
>>>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks 
>>>> that exited
>>>> with an exit code of -3.  The worker logs don't show anything on the 
>>>> stdout or
>>>> stderr of the failed jobs.  I looked online what an exit code of -3 
>>>> could mean,
>>>> but didn't find anything. 
>>>> Here are the failed 6 tasks:
>>>> Executing task urn:0-9408-1184616132483... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei 
>>>> fe_stdout_m112
>>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>>> solv_chg_a10_m112_done
>>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with 
>>>> exit code -3 in 238 ms
>>>>
>>>> Executing task urn:0-9408-1184616133199... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei 
>>>> fe_stdout_m112
>>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>>> solv_chg_a10_m112_done
>>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with 
>>>> exit code -3 in 201 ms
>>>>
>>>> Executing task urn:0-15036-1184616133342... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei 
>>>> fe_stdout_m179
>>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>>> solv_chg_a10_m179_done
>>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with 
>>>> exit code -3 in 267 ms
>>>>
>>>> Executing task urn:0-15036-1184616133628... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei 
>>>> fe_stdout_m179
>>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>>> solv_chg_a10_m179_done
>>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with 
>>>> exit code -3 in 2368 ms
>>>>
>>>> Executing task urn:0-15036-1184616133528... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei 
>>>> fe_stdout_m179
>>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>>> solv_chg_a10_m179_done
>>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with 
>>>> exit code -3 in 311 ms
>>>>
>>>> Executing task urn:0-9408-1184616130688... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei 
>>>> fe_stdout_m112
>>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>>> solv_chg_a10_m112_done
>>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with 
>>>> exit code -3 in 464 ms
>>>>
>>>>
>>>> Both the Falkon logs and the Swift logs agree on the number of 
>>>> submitted tasks,
>>>> number of successful tasks, and number of failed tasks.  There were no
>>>> outstanding tasks at the time when the workflow failed.  BTW, I 
>>>> checked the
>>>> disk space usage after about an hour that the whole experiment 
>>>> finished, and
>>>> there was plenty of disk space left.
>>>>
>>>> Yong mentioned that he looked through the output of MolDyn, and there 
>>>> were only
>>>> 242 'fe_solv_*' files, so 2 molecule files were missing...  one 
>>>> question for
>>>> Nika, are the 6 failed tasks the same job, resubmitted? 
>>>> Nika, can you add anything more to this?  Is there anything else to 
>>>> be learned
>>>> from the Swift log, as to why those last few jobs failed?  After we 
>>>> have tried
>>>> to figure out what happened, can we resume the workflow, and 
>>>> hopefully finish
>>>> the last few jobs in another run?
>>>>
>>>> Ioan
>>>>
>>>>
>>>>   
>>>>         
>> -- 
>>
>>    Ian Foster, Director, Computation Institute
>> Argonne National Laboratory & University of Chicago
>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
>> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>>       Globus Alliance: www.globus.org.
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>     
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070717/301ab316/attachment.html>