[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Tue Jul 17 21:43:52 CDT 2007

Another (perhaps dumb?) question--it would seem desirable that we be 
able to quickly determine what tasks failed and then (attempt to) rerun 
them in such circumstances.

Here it seems that a lot of effort is required just to determine what 
tasks failed, and I am not sure that the information extracted is enough 
to rerun them.

It also seems that we can't easily determine which output files are missing.

Ian.

Ian Foster wrote:
> Ioan:
>
> a) I think this information should be in the bugzilla summary, 
> according to our processes?
>
> b) Why did it take so long to get all of the workers working?
>
> c) Can we debug using less than O(800) node hours?
>
> Ian.
>
> bugzilla-daemon at mcs.anl.gov wrote:
>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>
>>
>>
>>
>>
>> ------- Comment #24 from iraicu at cs.uchicago.edu  2007-07-17 16:08 
>> -------
>> So the latest MolDyn's 244 mol run also failed... but I think it made 
>> it all
>> the way to the final few jobs...
>>
>> The place where I put all the information about the run is at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/ 
>>
>>
>> Here are the graphs:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg 
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg 
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg 
>>
>>
>> The Swift log can be found at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log 
>>
>>
>> The Falkon logs are at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/ 
>>
>>
>> The 244 mol run was supposed to have 20497 tasks, broken down as 
>> follows:
>> 1       1       1
>> 1       244     244
>> 1       244     244
>> 68      244     16592
>> 1       244     244
>> 11      244     2684
>> 1       244     244
>> 1       244     244
>> ======================
>>                 20497
>>
>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks 
>> that exited
>> with an exit code of -3.  The worker logs don't show anything on the 
>> stdout or
>> stderr of the failed jobs.  I looked online what an exit code of -3 
>> could mean,
>> but didn't find anything. 
>> Here are the failed 6 tasks:
>> Executing task urn:0-9408-1184616132483... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei 
>> fe_stdout_m112
>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with 
>> exit code -3 in 238 ms
>>
>> Executing task urn:0-9408-1184616133199... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei 
>> fe_stdout_m112
>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with 
>> exit code -3 in 201 ms
>>
>> Executing task urn:0-15036-1184616133342... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei 
>> fe_stdout_m179
>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with 
>> exit code -3 in 267 ms
>>
>> Executing task urn:0-15036-1184616133628... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei 
>> fe_stdout_m179
>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with 
>> exit code -3 in 2368 ms
>>
>> Executing task urn:0-15036-1184616133528... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei 
>> fe_stdout_m179
>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with 
>> exit code -3 in 311 ms
>>
>> Executing task urn:0-9408-1184616130688... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei 
>> fe_stdout_m112
>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with 
>> exit code -3 in 464 ms
>>
>>
>> Both the Falkon logs and the Swift logs agree on the number of 
>> submitted tasks,
>> number of successful tasks, and number of failed tasks.  There were no
>> outstanding tasks at the time when the workflow failed.  BTW, I 
>> checked the
>> disk space usage after about an hour that the whole experiment 
>> finished, and
>> there was plenty of disk space left.
>>
>> Yong mentioned that he looked through the output of MolDyn, and there 
>> were only
>> 242 'fe_solv_*' files, so 2 molecule files were missing...  one 
>> question for
>> Nika, are the 6 failed tasks the same job, resubmitted? 
>> Nika, can you add anything more to this?  Is there anything else to 
>> be learned
>> from the Swift log, as to why those last few jobs failed?  After we 
>> have tried
>> to figure out what happened, can we resume the workflow, and 
>> hopefully finish
>> the last few jobs in another run?
>>
>> Ioan
>>
>>
>>   
>

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.