[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
Ioan Raicu
iraicu at cs.uchicago.edu
Tue Jul 17 22:30:34 CDT 2007
Hi,
See below:
Ian Foster wrote:
> Ioan:
>
> a) I think this information should be in the bugzilla summary,
> according to our processes?
>
I posted all this to bugzilla, didn't I?
> b) Why did it take so long to get all of the workers working?
I finally had enough confidence in the dynamic resource provisioning
that we won't loose any jobs across resource allocation boundaries (ran
lots of tests and they were all positive), so I enabled it for this
run. I set the max to be the entire ANL site (274 processors)... and we
got 146 at the beginning, and with time, the # of processors kept
increasing up to the peak of 208 or so... the rest up to 274 were queued
up in the PBS wait queue. The difference between the beginning with 146
and the end with 208 was that others who were in the system at the
beginning finished their work and released some nodes, and idle
processors went from the wait queue into the run queue. I would
actually be curious to try out the latest DRP stuff on a busy site, such
as Purdue or NCSA, and to see if we can maintain a nice pool size over a
period of time, despite the sites being busy...
BTW, in the previous runs for MolDyn, we normally set the min and max to
say 100 processors, or 200 processors, and we would wait until we had
all of them before we started... sometimes, this meant waiting 12~24
hours for enough nodes to become free so the large job could start.
With DRP, you can start off with whatever the site has available, and
you get more with time as your jobs make it through the wait queue and
other jobs that are running complete...
>
> c) Can we debug using less than O(800) node hours?
The real MolDyn run for 244 molecules takes on the order of O(20K) node
hours, so O(0.8K) is still an improvement. Remember that we can run the
smaller workflows fine, but its the bigger ones that are giving us a
hard time. Nika, if you have any other suggestion on how we can further
reduce the run time of each job just to simulate the # of jobs and the
input/output # of files, let us know.
Ioan
>
> Ian.
>
> bugzilla-daemon at mcs.anl.gov wrote:
>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>
>>
>>
>>
>>
>> ------- Comment #24 from iraicu at cs.uchicago.edu 2007-07-17 16:08
>> -------
>> So the latest MolDyn's 244 mol run also failed... but I think it made
>> it all
>> the way to the final few jobs...
>>
>> The place where I put all the information about the run is at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
>>
>>
>> Here are the graphs:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg
>>
>>
>> The Swift log can be found at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log
>>
>>
>> The Falkon logs are at:
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/
>>
>>
>> The 244 mol run was supposed to have 20497 tasks, broken down as
>> follows:
>> 1 1 1
>> 1 244 244
>> 1 244 244
>> 68 244 16592
>> 1 244 244
>> 11 244 2684
>> 1 244 244
>> 1 244 244
>> ======================
>> 20497
>>
>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks
>> that exited
>> with an exit code of -3. The worker logs don't show anything on the
>> stdout or
>> stderr of the failed jobs. I looked online what an exit code of -3
>> could mean,
>> but didn't find anything.
>> Here are the failed 6 tasks:
>> Executing task urn:0-9408-1184616132483... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei
>> fe_stdout_m112
>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with
>> exit code -3 in 238 ms
>>
>> Executing task urn:0-9408-1184616133199... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei
>> fe_stdout_m112
>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with
>> exit code -3 in 201 ms
>>
>> Executing task urn:0-15036-1184616133342... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei
>> fe_stdout_m179
>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with
>> exit code -3 in 267 ms
>>
>> Executing task urn:0-15036-1184616133628... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei
>> fe_stdout_m179
>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with
>> exit code -3 in 2368 ms
>>
>> Executing task urn:0-15036-1184616133528... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei
>> fe_stdout_m179
>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>> solv_chg_a10_m179_done
>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with
>> exit code -3 in 311 ms
>>
>> Executing task urn:0-9408-1184616130688... Building executable
>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei
>> fe_stdout_m112
>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>> solv_chg_a10_m112_done
>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with
>> exit code -3 in 464 ms
>>
>>
>> Both the Falkon logs and the Swift logs agree on the number of
>> submitted tasks,
>> number of successful tasks, and number of failed tasks. There were no
>> outstanding tasks at the time when the workflow failed. BTW, I
>> checked the
>> disk space usage after about an hour that the whole experiment
>> finished, and
>> there was plenty of disk space left.
>>
>> Yong mentioned that he looked through the output of MolDyn, and there
>> were only
>> 242 'fe_solv_*' files, so 2 molecule files were missing... one
>> question for
>> Nika, are the 6 failed tasks the same job, resubmitted?
>> Nika, can you add anything more to this? Is there anything else to
>> be learned
>> from the Swift log, as to why those last few jobs failed? After we
>> have tried
>> to figure out what happened, can we resume the workflow, and
>> hopefully finish
>> the last few jobs in another run?
>>
>> Ioan
>>
>>
>>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
More information about the Swift-devel
mailing list