[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Ian Foster foster at mcs.anl.gov
Tue Jul 17 22:37:29 CDT 2007


Sorry, I was unclear. What I meant was: in the event that Swift decides 
that things have "failed" (definitively), it would be good to have 
something like a DAGman "rescue dag" that would show exactly what needed 
to be done to resubmit a task manually.

Your comment that "If we were to do a resume from Swift, I think it 
would automatically resubmit just the failed tasks" suggests that (in 
effect) we already ahve this.

Ian.

Ioan Raicu wrote:
>
>
> Ian Foster wrote:
>> Another (perhaps dumb?) question--it would seem desirable that we be 
>> able to quickly determine what tasks failed and then (attempt to) 
>> rerun them in such circumstances. \
> I think Swift already does this up to a fixed # of times (I think it 
> is 3 or 5).
>>
>> Here it seems that a lot of effort is required just to determine what 
>> tasks failed, and I am not sure that the information extracted is 
>> enough to rerun them.
> The failed tasks are pretty easy to find in the logs based on the exit 
> code.  If we were to do a resume from Swift, I think it would 
> automatically resubmit just the failed tasks... but unless we figure 
> out why they failed and fix the problem, they will likely again.
>>
>> It also seems that we can't easily determine which output files are 
>> missing.
> I don't know about this one, Maybe Nika can comment on this.
>
> Ioan
>>
>> Ian.
>>
>> Ian Foster wrote:
>>> Ioan:
>>>
>>> a) I think this information should be in the bugzilla summary, 
>>> according to our processes?
>>>
>>> b) Why did it take so long to get all of the workers working?
>>>
>>> c) Can we debug using less than O(800) node hours?
>>>
>>> Ian.
>>>
>>> bugzilla-daemon at mcs.anl.gov wrote:
>>>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------- Comment #24 from iraicu at cs.uchicago.edu  2007-07-17 16:08 
>>>> -------
>>>> So the latest MolDyn's 244 mol run also failed... but I think it 
>>>> made it all
>>>> the way to the final few jobs...
>>>>
>>>> The place where I put all the information about the run is at:
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/ 
>>>>
>>>>
>>>> Here are the graphs:
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg 
>>>>
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg 
>>>>
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg 
>>>>
>>>>
>>>> The Swift log can be found at:
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log 
>>>>
>>>>
>>>> The Falkon logs are at:
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/ 
>>>>
>>>>
>>>> The 244 mol run was supposed to have 20497 tasks, broken down as 
>>>> follows:
>>>> 1       1       1
>>>> 1       244     244
>>>> 1       244     244
>>>> 68      244     16592
>>>> 1       244     244
>>>> 11      244     2684
>>>> 1       244     244
>>>> 1       244     244
>>>> ======================
>>>>                 20497
>>>>
>>>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks 
>>>> that exited
>>>> with an exit code of -3.  The worker logs don't show anything on 
>>>> the stdout or
>>>> stderr of the failed jobs.  I looked online what an exit code of -3 
>>>> could mean,
>>>> but didn't find anything. Here are the failed 6 tasks:
>>>> Executing task urn:0-9408-1184616132483... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei 
>>>> fe_stdout_m112
>>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out 
>>>> solv_repu_0.5_0.6_m112.out
>>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>>> solv_chg_a10_m112_done
>>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with 
>>>> exit code -3 in 238 ms
>>>>
>>>> Executing task urn:0-9408-1184616133199... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei 
>>>> fe_stdout_m112
>>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out 
>>>> solv_repu_0.5_0.6_m112.out
>>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>>> solv_chg_a10_m112_done
>>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with 
>>>> exit code -3 in 201 ms
>>>>
>>>> Executing task urn:0-15036-1184616133342... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei 
>>>> fe_stdout_m179
>>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out 
>>>> solv_repu_0.5_0.6_m179.out
>>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>>> solv_chg_a10_m179_done
>>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed 
>>>> with exit code -3 in 267 ms
>>>>
>>>> Executing task urn:0-15036-1184616133628... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei 
>>>> fe_stdout_m179
>>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out 
>>>> solv_repu_0.5_0.6_m179.out
>>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>>> solv_chg_a10_m179_done
>>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed 
>>>> with exit code -3 in 2368 ms
>>>>
>>>> Executing task urn:0-15036-1184616133528... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei 
>>>> fe_stdout_m179
>>>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out 
>>>> solv_repu_0.5_0.6_m179.out
>>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>>>> solv_chg_a10_m179_done
>>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed 
>>>> with exit code -3 in 311 ms
>>>>
>>>> Executing task urn:0-9408-1184616130688... Building executable
>>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei 
>>>> fe_stdout_m112
>>>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out 
>>>> solv_repu_0.5_0.6_m112.out
>>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>>>> solv_chg_a10_m112_done
>>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with 
>>>> exit code -3 in 464 ms
>>>>
>>>>
>>>> Both the Falkon logs and the Swift logs agree on the number of 
>>>> submitted tasks,
>>>> number of successful tasks, and number of failed tasks.  There were no
>>>> outstanding tasks at the time when the workflow failed.  BTW, I 
>>>> checked the
>>>> disk space usage after about an hour that the whole experiment 
>>>> finished, and
>>>> there was plenty of disk space left.
>>>>
>>>> Yong mentioned that he looked through the output of MolDyn, and 
>>>> there were only
>>>> 242 'fe_solv_*' files, so 2 molecule files were missing...  one 
>>>> question for
>>>> Nika, are the 6 failed tasks the same job, resubmitted? Nika, can 
>>>> you add anything more to this?  Is there anything else to be learned
>>>> from the Swift log, as to why those last few jobs failed?  After we 
>>>> have tried
>>>> to figure out what happened, can we resume the workflow, and 
>>>> hopefully finish
>>>> the last few jobs in another run?
>>>>
>>>> Ioan
>>>>
>>>>
>>>>   
>>>
>>
>

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.




More information about the Swift-devel mailing list