[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

bugzilla-daemon at mcs.anl.gov bugzilla-daemon at mcs.anl.gov
Tue Jul 17 16:08:59 CDT 2007


http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72





------- Comment #24 from iraicu at cs.uchicago.edu  2007-07-17 16:08 -------
So the latest MolDyn's 244 mol run also failed... but I think it made it all
the way to the final few jobs...

The place where I put all the information about the run is at:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/

Here are the graphs:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/summary_graph_med.jpg
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/task_graph_med.jpg
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/executor_graph_med.jpg

The Swift log can be found at:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log

The Falkon logs are at:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/logs/falkon/

The 244 mol run was supposed to have 20497 tasks, broken down as follows:
1       1       1
1       244     244
1       244     244
68      244     16592
1       244     244
11      244     2684
1       244     244
1       244     244
======================
                20497

We had 20495 tasks that exited with an exit code of 0, and 6 tasks that exited
with an exit code of -3.  The worker logs don't show anything on the stdout or
stderr of the failed jobs.  I looked online what an exit code of -3 could mean,
but didn't find anything.  

Here are the failed 6 tasks:
Executing task urn:0-9408-1184616132483... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei fe_stdout_m112
stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
--resultonly --wham_outputs wf_m112 --solv_lrc_file solv_chg_a10_m112_done
--fe_file fe_solv_m112 
Task urn:0-9408-1184616132483 completed with exit code -3 in 238 ms

Executing task urn:0-9408-1184616133199... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei fe_stdout_m112
stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
--resultonly --wham_outputs wf_m112 --solv_lrc_file solv_chg_a10_m112_done
--fe_file fe_solv_m112 
Task urn:0-9408-1184616133199 completed with exit code -3 in 201 ms

Executing task urn:0-15036-1184616133342... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei fe_stdout_m179
stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
--resultonly --wham_outputs wf_m179 --solv_lrc_file solv_chg_a10_m179_done
--fe_file fe_solv_m179 
Task urn:0-15036-1184616133342 completed with exit code -3 in 267 ms

Executing task urn:0-15036-1184616133628... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei fe_stdout_m179
stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
--resultonly --wham_outputs wf_m179 --solv_lrc_file solv_chg_a10_m179_done
--fe_file fe_solv_m179 
Task urn:0-15036-1184616133628 completed with exit code -3 in 2368 ms

Executing task urn:0-15036-1184616133528... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei fe_stdout_m179
stderr.txt   wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
solv_chg_m179.out solv_repu_0.6_0.7_m179.out solv_repu_0.5_0.6_m179.out
solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
fe_stdout_m179  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
--resultonly --wham_outputs wf_m179 --solv_lrc_file solv_chg_a10_m179_done
--fe_file fe_solv_m179 
Task urn:0-15036-1184616133528 completed with exit code -3 in 311 ms

Executing task urn:0-9408-1184616130688... Building executable
command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei fe_stdout_m112
stderr.txt   wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
solv_chg_m112.out solv_repu_0.6_0.7_m112.out solv_repu_0.5_0.6_m112.out
solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
fe_stdout_m112  /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl --nosite
--resultonly --wham_outputs wf_m112 --solv_lrc_file solv_chg_a10_m112_done
--fe_file fe_solv_m112 
Task urn:0-9408-1184616130688 completed with exit code -3 in 464 ms


Both the Falkon logs and the Swift logs agree on the number of submitted tasks,
number of successful tasks, and number of failed tasks.  There were no
outstanding tasks at the time when the workflow failed.  BTW, I checked the
disk space usage after about an hour that the whole experiment finished, and
there was plenty of disk space left.

Yong mentioned that he looked through the output of MolDyn, and there were only
242 'fe_solv_*' files, so 2 molecule files were missing...  one question for
Nika, are the 6 failed tasks the same job, resubmitted?  

Nika, can you add anything more to this?  Is there anything else to be learned
from the Swift log, as to why those last few jobs failed?  After we have tried
to figure out what happened, can we resume the workflow, and hopefully finish
the last few jobs in another run?

Ioan


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.



More information about the Swift-devel mailing list