[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Veronika Nefedova nefedova at mcs.anl.gov
Wed Jul 18 08:27:56 CDT 2007


Sorry I was offline (sick w/cold/fever). I am taking today off as well.

I've checked the stderr files from the last run - it looks like 2  
jobs failed due to some application-specific reasons. I am Cc Yuqing  
to see if he has any insights... Here is what i had:

WHAM is not converged for solv_chg_m112
WHAM is not converged for solv_chg_m179

So it looks like 2 molecules (out of 244) failed. The last stage of  
the workflow failed for these molecules because the previous stage(s)  
produced some wrong/incomplete (?) results.

Yuqing, there are 6 directories on tg-login1:/disks/scratchgpfs1/ 
iraicu/ModLyn/MolDyn-244-ja4ya01d6cti1 (3 for each of the failed  
molecules). Any ideas what went wrong with these 2 molecules?

Nika


On Jul 17, 2007, at 11:32 PM, Mihael Hategan wrote:

> I don't think these are random failures. In the whole workflow there
> were exactly 6 tasks failed. 3 belonging to one job and 3 to the  
> other.
> Statistically, and if Ioan's assertion that they were not sent to the
> exact same worker is correct, I'd be pretty confident saying that  
> it was
> due to specific executables failing on specific data (and by that I
> would include the possibility of missing data).
>
> Mihael
>
> On Tue, 2007-07-17 at 23:18 -0500, Tiberiu Stef-Praun wrote:
>> I also had jobs failing at the Argonne site today.
>> It seems that the ia_32 were randomly fail on executing some of my
>> jobs, so I had to switch my apps to the ia_64 to get a full,
>> successful execution.
>>
>> Tibi
>>
>> On 7/17/07, Ioan Raicu <iraicu at cs.uchicago.edu> wrote:
>>>
>>>
>>>
>>>  Mihael Hategan wrote:
>>>  On Tue, 2007-07-17 at 21:43 -0500, Ian Foster wrote:
>>>
>>>
>>>  Another (perhaps dumb?) question--it would seem desirable that  
>>> we be
>>> able to quickly determine what tasks failed and then (attempt to)  
>>> rerun
>>> them in such circumstances.
>>>
>>> Here it seems that a lot of effort is required just to determine  
>>> what
>>> tasks failed, and I am not sure that the information extracted is  
>>> enough
>>> to rerun them.
>>>
>>>  Normally, a summary of what failed with the reasons is printed on
>>> stderr, together with the stdout and stderr of the jobs. Perhaps it
>>> should also go to the log file.
>>>
>>> In this case, 2 jobs failed. The 6 failures are due to restarts.  
>>> Which
>>> is in agreement with the 2 missing molecules.
>>>
>>> When jobs fail, swift should not clean up the job directories so  
>>> that
>>> one can do post-mortem debugging. I suggest invoking the application
>>> manually to see if it's a matter of a bad node or bad data.
>>>
>>>  The errors happened on 3 different nodes, so I suspect that its  
>>> not bad
>>> nodes (as we had previously experience with the stale NFS handle).
>>>
>>>  Nika, I sent out the actual commands that failed... can you try  
>>> to run them
>>> manually to see what happens, and possibly determine why they  
>>> failed?  Can
>>> you also find out what an exit code of -3 means within the  
>>> application that
>>> failed (you might have to look at the app source code, or contact  
>>> the
>>> original source code writer).
>>>
>>>  Ioan
>>>
>>>
>>>
>>>
>>>  It also seems that we can't easily determine which output files are
>>> missing.
>>>
>>>  In the general case we wouldn't be able to, because the exact  
>>> outputs
>>> may only be known at run-time. Granted, that kind of dynamics would
>>> depend on our ability to have nondeterministic files being returned,
>>> which we haven't gotten around to implementing. But there is a  
>>> question
>>> of whether we should try to implement a short term solution that  
>>> would
>>> be invalidated by our own plans.
>>>
>>>
>>>
>>>  Ian.
>>>
>>> Ian Foster wrote:
>>>
>>>
>>>  Ioan:
>>>
>>> a) I think this information should be in the bugzilla summary,
>>> according to our processes?
>>>
>>> b) Why did it take so long to get all of the workers working?
>>>
>>> c) Can we debug using less than O(800) node hours?
>>>
>>> Ian.
>>>
>>> bugzilla-daemon at mcs.anl.gov wrote:
>>>
>>>
>>>  http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>>
>>>
>>>
>>>
>>>
>>> ------- Comment #24 from iraicu at cs.uchicago.edu 2007-07-17 16:08
>>> -------
>>> So the latest MolDyn's 244 mol run also failed... but I think it  
>>> made
>>> it all
>>> the way to the final few jobs...
>>>
>>> The place where I put all the information about the run is at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244- 
>>> mol-failed-7-16-07/
>>>
>>>
>>> Here are the graphs:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244- 
>>> mol-failed-7-16-07/summary_graph_med.jpg
>>>
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244- 
>>> mol-failed-7-16-07/task_graph_med.jpg
>>>
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244- 
>>> mol-failed-7-16-07/executor_graph_med.jpg
>>>
>>>
>>> The Swift log can be found at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244- 
>>> mol-failed-7-16-07/logs/swift/MolDyn-244-ja4ya01d6cti1.log
>>>
>>>
>>> The Falkon logs are at:
>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244- 
>>> mol-failed-7-16-07/logs/falkon/
>>>
>>>
>>> The 244 mol run was supposed to have 20497 tasks, broken down as
>>> follows:
>>> 1 1 1
>>> 1 244 244
>>> 1 244 244
>>> 68 244 16592
>>> 1 244 244
>>> 11 244 2684
>>> 1 244 244
>>> 1 244 244
>>> ======================
>>>  20497
>>>
>>> We had 20495 tasks that exited with an exit code of 0, and 6 tasks
>>> that exited
>>> with an exit code of -3. The worker logs don't show anything on the
>>> stdout or
>>> stderr of the failed jobs. I looked online what an exit code of -3
>>> could mean,
>>> but didn't find anything.
>>> Here are the failed 6 tasks:
>>> Executing task urn:0-9408-1184616132483... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-zqtloeei
>>> fe_stdout_m112
>>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out  
>>> solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
>>> --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616132483 completed with
>>> exit code -3 in 238 ms
>>>
>>> Executing task urn:0-9408-1184616133199... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-2rtloeei
>>> fe_stdout_m112
>>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out  
>>> solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
>>> --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616133199 completed with
>>> exit code -3 in 201 ms
>>>
>>> Executing task urn:0-15036-1184616133342... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-5rtloeei
>>> fe_stdout_m179
>>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out  
>>> solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
>>> --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133342 completed with
>>> exit code -3 in 267 ms
>>>
>>> Executing task urn:0-15036-1184616133628... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9rtloeei
>>> fe_stdout_m179
>>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out  
>>> solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
>>> --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133628 completed with
>>> exit code -3 in 2368 ms
>>>
>>> Executing task urn:0-15036-1184616133528... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-8rtloeei
>>> fe_stdout_m179
>>> stderr.txt wf_m179 solv_chg_a10_m179_done solv_repu_0.2_0.3_m179.out
>>> solv_repu_0_0.2_m179.out solv_repu_0.9_1_m179.out solv_disp_m179.out
>>> solv_chg_m179.out solv_repu_0.6_0.7_m179.out  
>>> solv_repu_0.5_0.6_m179.out
>>> solv_repu_0.4_0.5_m179.out solv_repu_0.3_0.4_m179.out
>>> solv_repu_0.8_0.9_m179.out solv_repu_0.7_0.8_m179.out fe_solv_m179
>>> fe_stdout_m179 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
>>> --nosite
>>> --resultonly --wham_outputs wf_m179 --solv_lrc_file
>>> solv_chg_a10_m179_done
>>> --fe_file fe_solv_m179 Task urn:0-15036-1184616133528 completed with
>>> exit code -3 in 311 ms
>>>
>>> Executing task urn:0-9408-1184616130688... Building executable
>>> command...Executing: /bin/sh shared/wrapper.sh fepl-9ptloeei
>>> fe_stdout_m112
>>> stderr.txt wf_m112 solv_chg_a10_m112_done solv_repu_0.2_0.3_m112.out
>>> solv_repu_0_0.2_m112.out solv_repu_0.9_1_m112.out solv_disp_m112.out
>>> solv_chg_m112.out solv_repu_0.6_0.7_m112.out  
>>> solv_repu_0.5_0.6_m112.out
>>> solv_repu_0.4_0.5_m112.out solv_repu_0.3_0.4_m112.out
>>> solv_repu_0.8_0.9_m112.out solv_repu_0.7_0.8_m112.out fe_solv_m112
>>> fe_stdout_m112 /disks/scratchgpfs1/iraicu/ModLyn/bin/fe.pl
>>> --nosite
>>> --resultonly --wham_outputs wf_m112 --solv_lrc_file
>>> solv_chg_a10_m112_done
>>> --fe_file fe_solv_m112 Task urn:0-9408-1184616130688 completed with
>>> exit code -3 in 464 ms
>>>
>>>
>>> Both the Falkon logs and the Swift logs agree on the number of
>>> submitted tasks,
>>> number of successful tasks, and number of failed tasks. There  
>>> were no
>>> outstanding tasks at the time when the workflow failed. BTW, I
>>> checked the
>>> disk space usage after about an hour that the whole experiment
>>> finished, and
>>> there was plenty of disk space left.
>>>
>>> Yong mentioned that he looked through the output of MolDyn, and  
>>> there
>>> were only
>>> 242 'fe_solv_*' files, so 2 molecule files were missing... one
>>> question for
>>> Nika, are the 6 failed tasks the same job, resubmitted?
>>> Nika, can you add anything more to this? Is there anything else to
>>> be learned
>>> from the Swift log, as to why those last few jobs failed? After we
>>> have tried
>>> to figure out what happened, can we resume the workflow, and
>>> hopefully finish
>>> the last few jobs in another run?
>>>
>>> Ioan
>>>
>>>
>>>
>>>
>>>  --
>>>
>>>  Ian Foster, Director, Computation Institute
>>> Argonne National Laboratory & University of Chicago
>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
>>> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu.
>>>  Globus Alliance: www.globus.org.
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>
>>>  _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>
>>>
>>>  --
>>> ============================================
>>> Ioan Raicu
>>> Ph.D. Student
>>> ============================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ============================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web: http://www.cs.uchicago.edu/~iraicu
>>>  http://dsl.cs.uchicago.edu/
>>> ============================================
>>> ============================================
>>>
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>
>>
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>




More information about the Swift-devel mailing list