[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Tue Jul 17 22:33:02 CDT 2007


Ian Foster wrote:
> Another (perhaps dumb?) question--it would seem desirable that we be 
> able to quickly determine what tasks failed and then (attempt to) 
> rerun them in such circumstances. \
I think Swift already does this up to a fixed # of times (I think it is 
3 or 5).
>
> Here it seems that a lot of effort is required just to determine what 
> tasks failed, and I am not sure that the information extracted is 
> enough to rerun them.
The failed tasks are pretty easy to find in the logs based on the exit 
code.  If we were to do a resume from Swift, I think it would 
automatically resubmit just the failed tasks... but unless we figure out 
why they failed and fix the problem, they will likely again.
>
> It also seems that we can't easily determine which output files are 
> missing.
I don't know about this one, Maybe Nika can comment on this.

Ioan
>
> Ian.
>
> Ian Foster wrote:
>> Ioan:
>>
>> a) I think this information should be in the bugzilla summary, 
>> according to our processes?
>>
>> b) Why did it take so long to get all of the workers working?
>>
>> c) Can we debug using less than O(800) node hours?
>>
>> Ian.
>>
>> bugzilla-daemon at mcs.anl.gov wrote:
>>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>>
>>>
>>>
>>>
>>>
>>> ------- Comment #24 from iraicu at cs.uchicago.edu  2007-07-17 16:08 
>>> -------
>>> So the latest MolDyn's 244 mol run also failed... but I think it 
>>> made it all
>>> the way to the final few jobs...
>>>
>>> The place where I put all the information about the run is at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/ 
>>>
>>>
>>> Here are the graphs:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg 
>>>
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg 
>>>
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg 
>>>
>>>
>>> The Swift log can be found at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log 
>>>
>>>
>>> The Falkon logs are at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/ 
>>>
>>>
>>> The 244 mol run was supposed to have 20497 tasks, broken down as 
>>> follows:
>>> 1       1       1
>>> 1       244     244
>>> 1       244     244
>>> 68      244     16592
>>> 1       244     244
>>> 11      244     2684
>>> 1       244     244
>>> 1       244     244
>>> ======================
>>>                 20497
>>>
>>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks 
>>> that exited
>>> with an exit code of -3.  The worker logs don't show anything on the 
>>> stdout or
>>> stderr of the failed jobs.  I looked online what an exit code of -3 
>>> could mean,
>>> but didn't find anything. Here are the failed 6 tasks:
>>> Executing task urn:0-9408-1184616132483... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei 
>>> fe_stdout_m112
>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with 
>>> exit code -3 in 238 ms
>>>
>>> Executing task urn:0-9408-1184616133199... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei 
>>> fe_stdout_m112
>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with 
>>> exit code -3 in 201 ms
>>>
>>> Executing task urn:0-15036-1184616133342... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei 
>>> fe_stdout_m179
>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with 
>>> exit code -3 in 267 ms
>>>
>>> Executing task urn:0-15036-1184616133628... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei 
>>> fe_stdout_m179
>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with 
>>> exit code -3 in 2368 ms
>>>
>>> Executing task urn:0-15036-1184616133528... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei 
>>> fe_stdout_m179
>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with 
>>> exit code -3 in 311 ms
>>>
>>> Executing task urn:0-9408-1184616130688... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei 
>>> fe_stdout_m112
>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with 
>>> exit code -3 in 464 ms
>>>
>>>
>>> Both the Falkon logs and the Swift logs agree on the number of 
>>> submitted tasks,
>>> number of successful tasks, and number of failed tasks.  There were no
>>> outstanding tasks at the time when the workflow failed.  BTW, I 
>>> checked the
>>> disk space usage after about an hour that the whole experiment 
>>> finished, and
>>> there was plenty of disk space left.
>>>
>>> Yong mentioned that he looked through the output of MolDyn, and 
>>> there were only
>>> 242 'fe_solv_*' files, so 2 molecule files were missing...  one 
>>> question for
>>> Nika, are the 6 failed tasks the same job, resubmitted? Nika, can 
>>> you add anything more to this?  Is there anything else to be learned
>>> from the Swift log, as to why those last few jobs failed?  After we 
>>> have tried
>>> to figure out what happened, can we resume the workflow, and 
>>> hopefully finish
>>> the last few jobs in another run?
>>>
>>> Ioan
>>>
>>>
>>>   
>>
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================