[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Tue Jul 17 22:30:34 CDT 2007

Hi,
See below:

Ian Foster wrote:
> Ioan:
>
> a) I think this information should be in the bugzilla summary, 
> according to our processes?
>
I posted all this to bugzilla, didn't I?
> b) Why did it take so long to get all of the workers working?
I finally had enough confidence in the dynamic resource provisioning 
that we won't loose any jobs across resource allocation boundaries (ran 
lots of tests and they were all positive), so I enabled it for this 
run.  I set the max to be the entire ANL site (274 processors)... and we 
got 146 at the beginning, and with time, the # of processors kept 
increasing up to the peak of 208 or so... the rest up to 274 were queued 
up in the PBS wait queue.  The difference between the beginning with 146 
and the end with 208 was that others who were in the system at the 
beginning finished their work and released some nodes, and idle 
processors went from the wait queue into the run queue.  I would 
actually be curious to try out the latest DRP stuff on a busy site, such 
as Purdue or NCSA, and to see if we can maintain a nice pool size over a 
period of time, despite the sites being busy...

BTW, in the previous runs for MolDyn, we normally set the min and max to 
say 100 processors, or 200 processors, and we would wait until we had 
all of them before we started... sometimes, this meant waiting 12~24 
hours for enough nodes to become free so the large job could start.  
With DRP, you can start off with whatever the site has available, and 
you get more with time as your jobs make it through the wait queue and 
other jobs that are running complete...
>
> c) Can we debug using less than O(800) node hours?
The real MolDyn run for 244 molecules takes on the order of O(20K) node 
hours, so O(0.8K) is still an improvement.  Remember that we can run the 
smaller workflows fine, but its the bigger ones that are giving us a 
hard time.  Nika, if you have any other suggestion on how we can further 
reduce the run time of each job just to simulate the # of jobs and the 
input/output # of files, let us know.

Ioan
>
> Ian.
>
> bugzilla-daemon at mcs.anl.gov wrote:
>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>
>>
>>
>>
>>
>> ------- Comment #24 from iraicu at cs.uchicago.edu  2007-07-17 16:08 
>> -------
>> So the latest MolDyn's 244 mol run also failed... but I think it made 
>> it all
>> the way to the final few jobs...
>>
>> The place where I put all the information about the run is at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/ 
>>
>>
>> Here are the graphs:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg 
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg 
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg 
>>
>>
>> The Swift log can be found at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log 
>>
>>
>> The Falkon logs are at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/ 
>>
>>
>> The 244 mol run was supposed to have 20497 tasks, broken down as 
>> follows:
>> 1       1       1
>> 1       244     244
>> 1       244     244
>> 68      244     16592
>> 1       244     244
>> 11      244     2684
>> 1       244     244
>> 1       244     244
>> ======================
>>                 20497
>>
>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks 
>> that exited
>> with an exit code of -3.  The worker logs don't show anything on the 
>> stdout or
>> stderr of the failed jobs.  I looked online what an exit code of -3 
>> could mean,
>> but didn't find anything. 
>> Here are the failed 6 tasks:
>> Executing task urn:0-9408-1184616132483... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei 
>> fe_stdout_m112
>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with 
>> exit code -3 in 238 ms
>>
>> Executing task urn:0-9408-1184616133199... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei 
>> fe_stdout_m112
>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with 
>> exit code -3 in 201 ms
>>
>> Executing task urn:0-15036-1184616133342... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei 
>> fe_stdout_m179
>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with 
>> exit code -3 in 267 ms
>>
>> Executing task urn:0-15036-1184616133628... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei 
>> fe_stdout_m179
>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with 
>> exit code -3 in 2368 ms
>>
>> Executing task urn:0-15036-1184616133528... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei 
>> fe_stdout_m179
>> stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file 
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with 
>> exit code -3 in 311 ms
>>
>> Executing task urn:0-9408-1184616130688... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei 
>> fe_stdout_m112
>> stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file 
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with 
>> exit code -3 in 464 ms
>>
>>
>> Both the Falkon logs and the Swift logs agree on the number of 
>> submitted tasks,
>> number of successful tasks, and number of failed tasks.  There were no
>> outstanding tasks at the time when the workflow failed.  BTW, I 
>> checked the
>> disk space usage after about an hour that the whole experiment 
>> finished, and
>> there was plenty of disk space left.
>>
>> Yong mentioned that he looked through the output of MolDyn, and there 
>> were only
>> 242 'fe_solv_*' files, so 2 molecule files were missing...  one 
>> question for
>> Nika, are the 6 failed tasks the same job, resubmitted? 
>> Nika, can you add anything more to this?  Is there anything else to 
>> be learned
>> from the Swift log, as to why those last few jobs failed?  After we 
>> have tried
>> to figure out what happened, can we resume the workflow, and 
>> hopefully finish
>> the last few jobs in another run?
>>
>> Ioan
>>
>>
>>   
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================