From benc at hawaga.org.uk Mon Feb 26 09:10:21 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 15:10:21 +0000 (GMT) Subject: [Swift-devel] Re: vdl2-devel list changing. In-Reply-To: References: Message-ID: On Mon, 26 Feb 2007, Ben Clifford wrote: > You are subscribed to vdl2-devel. Shortly this list will change to > swift-devel. If you receive this message, you are now on swift-devel! -- From benc at hawaga.org.uk Mon Feb 26 10:33:30 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 16:33:30 +0000 (GMT) Subject: [Swift-devel] mapping structures Message-ID: I was thinking about some fmri stuff for a tutorial exercise and came up with this question: .img and .hdr files come in the input in pairs, like this: $ ls Raw/ anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr reference.hdr anatomy1.img anatomy2.img anatomy3.img anatomy4.img reference.img so from a data type perspective it makes sense to say something like: type inputimage { file img; file data; } but then do any of the existing mappers let me map into this structure? or do I have to write a custom mapper? (to torture syntax, express something like: inputimg ref; ref.img <"reference.img">; ref.hdr <"reference.hdr">; ) I think I can do this with the CSV mapper, with the input file pairs specified as CSV rows; but this needs a CSV to be externally generated from information that is already expressed in the metadata in the filename so doesn't particularly help. -- From tiberius at ci.uchicago.edu Mon Feb 26 11:04:29 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 26 Feb 2007 11:04:29 -0600 Subject: [Swift-devel] Fwd: [VDL2-devel] Re: test v0.1rc1 In-Reply-To: References: Message-ID: Some early observations: I'm running the SIDGrid workflow from teraport, and some early observations: ( I have not finished a full run yet): - data seemed to be delayed in transferring back (from the NCSA site). I waited 5 minutes after the execution apparently finished on the remote site (I was logged in and was monitoring the output files) then stopped the workflow. Still investigating - the workflow chose 3 sites (I had 4 available) and it started 6 parallel jobs on each site. Strange not to choose all the available sites. Still investigating Tibi On 2/26/07, Tiberiu Stef-Praun wrote: > I have to do measurements of the SIDGrid, so I'll use this new release > to do that. > You'll hear from me. > > Tibi > > On 2/26/07, Ben Clifford wrote: > > > > On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > v0.1rc1 was built at the end of last week. please spend some time testing > > > > here's the URL for download: > > > > http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > -- > > > > > -- > Tiberiu (Tibi) Stef-Praun, PhD > Research Staff, Computation Institute > 5640 S. Ellis Ave, #405 > University of Chicago > http://www-unix.mcs.anl.gov/~tiberius/ > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Mon Feb 26 11:14:09 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 17:14:09 +0000 (GMT) Subject: [Swift-devel] Re: [VDL2-devel] Re: test v0.1rc1 Message-ID: is this different behaviour from what you've observed with other versions? On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: > Some early observations: > > I'm running the SIDGrid workflow from teraport, and some early > observations: ( I have not finished a full run yet): > > - data seemed to be delayed in transferring back (from the NCSA > site). I waited 5 minutes after the execution apparently finished on > the remote site (I was logged in and was monitoring the output files) > then stopped the workflow. Still investigating > - the workflow chose 3 sites (I had 4 available) and it started 6 > parallel jobs on each site. Strange not to choose all the available > sites. Still investigating > > Tibi > > > On 2/26/07, Tiberiu Stef-Praun wrote: > > I have to do measurements of the SIDGrid, so I'll use this new release > > to do that. > > You'll hear from me. > > > > Tibi > > > > On 2/26/07, Ben Clifford wrote: > > > > > > On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > > > > v0.1rc1 was built at the end of last week. please spend some time > > testing > > > > > > here's the URL for download: > > > > > > http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > -- > > > > > > > > > -- > > Tiberiu (Tibi) Stef-Praun, PhD > > Research Staff, Computation Institute > > 5640 S. Ellis Ave, #405 > > University of Chicago > > http://www-unix.mcs.anl.gov/~tiberius/ > > > > > From foster at mcs.anl.gov Mon Feb 26 11:14:56 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 26 Feb 2007 11:14:56 -0600 Subject: [Swift-devel] Re: [VDL2-devel] Re: test v0.1rc1 In-Reply-To: References: Message-ID: <45E31590.3000000@mcs.anl.gov> I am puzzled why we are testing on multiple sites, rather than a single site. Ben Clifford wrote: > is this different behaviour from what you've observed with other versions? > > On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: > > >> Some early observations: >> >> I'm running the SIDGrid workflow from teraport, and some early >> observations: ( I have not finished a full run yet): >> >> - data seemed to be delayed in transferring back (from the NCSA >> site). I waited 5 minutes after the execution apparently finished on >> the remote site (I was logged in and was monitoring the output files) >> then stopped the workflow. Still investigating >> - the workflow chose 3 sites (I had 4 available) and it started 6 >> parallel jobs on each site. Strange not to choose all the available >> sites. Still investigating >> >> Tibi >> >> >> On 2/26/07, Tiberiu Stef-Praun wrote: >> >>> I have to do measurements of the SIDGrid, so I'll use this new release >>> to do that. >>> You'll hear from me. >>> >>> Tibi >>> >>> On 2/26/07, Ben Clifford wrote: >>> >>>> On Mon, 26 Feb 2007, Ben Clifford wrote: >>>> >>>> >>>>> v0.1rc1 was built at the end of last week. please spend some time >>>>> >>> testing >>> >>>> here's the URL for download: >>>> >>>> http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz >>>> >>>> -- >>>> >>>> >>> -- >>> Tiberiu (Tibi) Stef-Praun, PhD >>> Research Staff, Computation Institute >>> 5640 S. Ellis Ave, #405 >>> University of Chicago >>> http://www-unix.mcs.anl.gov/~tiberius/ >>> >>> >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tiberius at ci.uchicago.edu Mon Feb 26 11:22:19 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 26 Feb 2007 11:22:19 -0600 Subject: [Swift-devel] Re: [VDL2-devel] Re: test v0.1rc1 In-Reply-To: <45E31590.3000000@mcs.anl.gov> References: <45E31590.3000000@mcs.anl.gov> Message-ID: The output of running the workflow is 200GB, so I have to find a site that allows me to manipulate that much data. I have chosen teraport, but is not yet fully usable. So I had two choices: go use multiple sites (a realistic test case) or move to another site that has more thant 200G available. The SIDGrid users do not have yet Teragrid accounts, so I went with the first choice for now. I'm planning to try the second choice as well. On 2/26/07, Ian Foster wrote: > > I am puzzled why we are testing on multiple sites, rather than a single > site. > > Ben Clifford wrote: > is this different behaviour from what you've observed with other versions? > > On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: > > > > Some early observations: > > I'm running the SIDGrid workflow from teraport, and some early > observations: ( I have not finished a full run yet): > > - data seemed to be delayed in transferring back (from the NCSA > site). I waited 5 minutes after the execution apparently finished on > the remote site (I was logged in and was monitoring the output files) > then stopped the workflow. Still investigating > - the workflow chose 3 sites (I had 4 available) and it started 6 > parallel jobs on each site. Strange not to choose all the available > sites. Still investigating > > Tibi > > > On 2/26/07, Tiberiu Stef-Praun wrote: > > > I have to do measurements of the SIDGrid, so I'll use this new release > to do that. > You'll hear from me. > > Tibi > > On 2/26/07, Ben Clifford wrote: > > > On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > v0.1rc1 was built at the end of last week. please spend some time > > testing > > > here's the URL for download: > > http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > -- > > > -- > Tiberiu (Tibi) Stef-Praun, PhD > Research Staff, Computation Institute > 5640 S. Ellis Ave, #405 > University of Chicago > http://www-unix.mcs.anl.gov/~tiberius/ > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From wilde at mcs.anl.gov Mon Feb 26 11:42:27 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Mon, 26 Feb 2007 11:42:27 -0600 Subject: [Swift-devel] Re: [VDL2-devel] Re: test v0.1rc1 In-Reply-To: <45E31590.3000000@mcs.anl.gov> References: <45E31590.3000000@mcs.anl.gov> Message-ID: <45E31C03.302@mcs.anl.gov> Ian Foster wrote, On 2/26/2007 11:14 AM: > I am puzzled why we are testing on multiple sites, rather than a single > site. I believe that we need to, that the code is there to do this, and needs testing. This is not an unreasonable time to start trying this. If we are to be taken seriously by OSG, and as a Grid, we need to use multiple sites. We can set the priority low for the last anomaly that Tibi pointed out below (using 3 sites out of 4). We decided in 0.1 that we would enable more sites to run (although we said there "not multiple"). 0.1 is virtually done; we have not yet frozen the feature set of 0.2. This is a reasonable candidate feature to consider for 0.2. People can be focusing and still mention (and file as bugs) things that they encounter in testing. They should be able to try things a bit out of the box as long as we are making good progress, which we are. - Mike > > Ben Clifford wrote: >> is this different behaviour from what you've observed with other versions? >> >> On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: >> >> >>> Some early observations: >>> >>> I'm running the SIDGrid workflow from teraport, and some early >>> observations: ( I have not finished a full run yet): >>> >>> - data seemed to be delayed in transferring back (from the NCSA >>> site). I waited 5 minutes after the execution apparently finished on >>> the remote site (I was logged in and was monitoring the output files) >>> then stopped the workflow. Still investigating >>> - the workflow chose 3 sites (I had 4 available) and it started 6 >>> parallel jobs on each site. Strange not to choose all the available >>> sites. Still investigating >>> >>> Tibi >>> >>> >>> On 2/26/07, Tiberiu Stef-Praun wrote: >>> >>>> I have to do measurements of the SIDGrid, so I'll use this new release >>>> to do that. >>>> You'll hear from me. >>>> >>>> Tibi >>>> >>>> On 2/26/07, Ben Clifford wrote: >>>> >>>>> On Mon, 26 Feb 2007, Ben Clifford wrote: >>>>> >>>>> >>>>>> v0.1rc1 was built at the end of last week. please spend some time >>>>>> >>>> testing >>>> >>>>> here's the URL for download: >>>>> >>>>> http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz >>>>> >>>>> -- >>>>> >>>>> >>>> -- >>>> Tiberiu (Tibi) Stef-Praun, PhD >>>> Research Staff, Computation Institute >>>> 5640 S. Ellis Ave, #405 >>>> University of Chicago >>>> http://www-unix.mcs.anl.gov/~tiberius/ >>>> >>>> >>> >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Mon Feb 26 11:45:54 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 17:45:54 +0000 (GMT) Subject: [Swift-devel] Re: [VDL2-devel] Re: test v0.1rc1 In-Reply-To: References: Message-ID: On Mon, 26 Feb 2007, Ben Clifford wrote: > > - data seemed to be delayed in transferring back (from the NCSA > > site). I waited 5 minutes after the execution apparently finished on > > the remote site (I was logged in and was monitoring the output files) > > then stopped the workflow. Still investigating so is this a general problem with submitting work to NCSA? Its in our standard sites.xml so its probably useful that it actually works. > > - the workflow chose 3 sites (I had 4 available) and it started 6 > > parallel jobs on each site. Strange not to choose all the available > > sites. Still investigating this is not something that will affect 0.1 release, I think. -- From foster at mcs.anl.gov Mon Feb 26 11:49:11 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 26 Feb 2007 11:49:11 -0600 Subject: [Swift-devel] Re: [VDL2-devel] Re: test v0.1rc1 In-Reply-To: <45E31C03.302@mcs.anl.gov> References: <45E31590.3000000@mcs.anl.gov> <45E31C03.302@mcs.anl.gov> Message-ID: <45E31D97.5010708@mcs.anl.gov> Mike: My big concern continues to be that we diffuse our efforts by trying to do too many things at once. I asked how things were going last week, and someone (can't recall who) replied: "we are consumed getting new grid sites to work." Meanwhile, there is no documentation on the Swift web site of a single application that runs end to end, showing the code run and the performance achieved. That doesn't seem good to me, if we want to deliver value to users. We of course do want to run across multiple sites, and soon--but I continue to hope that we can be disciplined about approaching things one step at a time. Ian. Mike Wilde wrote: > Ian Foster wrote, On 2/26/2007 11:14 AM: >> I am puzzled why we are testing on multiple sites, rather than a >> single site. > > I believe that we need to, that the code is there to do this, and > needs testing. This is not an unreasonable time to start trying this. > > If we are to be taken seriously by OSG, and as a Grid, we need to use > multiple sites. > > We can set the priority low for the last anomaly that Tibi pointed out > below (using 3 sites out of 4). We decided in 0.1 that we would > enable more sites to run (although we said there "not multiple"). 0.1 > is virtually done; we have not yet frozen the feature set of 0.2. > This is a reasonable candidate feature to consider for 0.2. > > People can be focusing and still mention (and file as bugs) things > that they encounter in testing. They should be able to try things a > bit out of the box as long as we are making good progress, which we are. > > - Mike > >> >> Ben Clifford wrote: >>> is this different behaviour from what you've observed with other >>> versions? >>> >>> On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: >>> >>> >>>> Some early observations: >>>> >>>> I'm running the SIDGrid workflow from teraport, and some early >>>> observations: ( I have not finished a full run yet): >>>> >>>> - data seemed to be delayed in transferring back (from the NCSA >>>> site). I waited 5 minutes after the execution apparently finished on >>>> the remote site (I was logged in and was monitoring the output files) >>>> then stopped the workflow. Still investigating >>>> - the workflow chose 3 sites (I had 4 available) and it started 6 >>>> parallel jobs on each site. Strange not to choose all the available >>>> sites. Still investigating >>>> >>>> Tibi >>>> >>>> >>>> On 2/26/07, Tiberiu Stef-Praun wrote: >>>> >>>>> I have to do measurements of the SIDGrid, so I'll use this new >>>>> release >>>>> to do that. >>>>> You'll hear from me. >>>>> >>>>> Tibi >>>>> >>>>> On 2/26/07, Ben Clifford wrote: >>>>> >>>>>> On Mon, 26 Feb 2007, Ben Clifford wrote: >>>>>> >>>>>> >>>>>>> v0.1rc1 was built at the end of last week. please spend some time >>>>>>> >>>>> testing >>>>> >>>>>> here's the URL for download: >>>>>> >>>>>> http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>> -- >>>>> Tiberiu (Tibi) Stef-Praun, PhD >>>>> Research Staff, Computation Institute >>>>> 5640 S. Ellis Ave, #405 >>>>> University of Chicago >>>>> http://www-unix.mcs.anl.gov/~tiberius/ >>>>> >>>>> >>>> >>>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >> >> -- >> >> Ian Foster, Director, Computation Institute >> Argonne National Laboratory & University of Chicago >> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 >> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 >> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. >> Globus Alliance: www.globus.org. >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From benc at hawaga.org.uk Mon Feb 26 11:49:18 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 17:49:18 +0000 (GMT) Subject: [Swift-devel] csv deliminators Message-ID: The CSV mapper doc suggests: > delim: Content delimiter, default is white space That seems an 'interesting' default for *comma* separated values. -- From yongzh at cs.uchicago.edu Mon Feb 26 11:53:40 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Mon, 26 Feb 2007 11:53:40 -0600 (CST) Subject: [Swift-devel] csv deliminators In-Reply-To: References: Message-ID: I think comma was also in there, just the string tokenizer thing. Yong. On Mon, 26 Feb 2007, Ben Clifford wrote: > > The CSV mapper doc suggests: > > > delim: Content delimiter, default is white space > > That seems an 'interesting' default for *comma* separated values. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From yongzh at cs.uchicago.edu Mon Feb 26 11:54:43 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Mon, 26 Feb 2007 11:54:43 -0600 (CST) Subject: [Swift-devel] mapping structures In-Reply-To: References: Message-ID: there is an airsn_mapper that maps runs and volumes. Yong. On Mon, 26 Feb 2007, Ben Clifford wrote: > > I was thinking about some fmri stuff for a tutorial exercise and came up > with this question: > > .img and .hdr files come in the input in pairs, like this: > > $ ls Raw/ > anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr reference.hdr > anatomy1.img anatomy2.img anatomy3.img anatomy4.img reference.img > > so from a data type perspective it makes sense to say something like: > > type inputimage { > file img; > file data; > } > > but then do any of the existing mappers let me map into this structure? or > do I have to write a custom mapper? > > (to torture syntax, express something like: > inputimg ref; > ref.img <"reference.img">; > ref.hdr <"reference.hdr">; > ) > > I think I can do this with the CSV mapper, with the input file pairs > specified as CSV rows; but this needs a CSV to be externally generated > from information that is already expressed in the metadata in the filename > so doesn't particularly help. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Mon Feb 26 11:56:13 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 17:56:13 +0000 (GMT) Subject: [Swift-devel] mapping structures In-Reply-To: References: Message-ID: but that's not part of mainstream swift - its part of custom application support code, right? (i.e. it needed to be written for airsn) On Mon, 26 Feb 2007, Yong Zhao wrote: > there is an airsn_mapper that maps runs and volumes. > > Yong. > > On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > I was thinking about some fmri stuff for a tutorial exercise and came up > > with this question: > > > > .img and .hdr files come in the input in pairs, like this: > > > > $ ls Raw/ > > anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr reference.hdr > > anatomy1.img anatomy2.img anatomy3.img anatomy4.img reference.img > > > > so from a data type perspective it makes sense to say something like: > > > > type inputimage { > > file img; > > file data; > > } > > > > but then do any of the existing mappers let me map into this structure? or > > do I have to write a custom mapper? > > > > (to torture syntax, express something like: > > inputimg ref; > > ref.img <"reference.img">; > > ref.hdr <"reference.hdr">; > > ) > > > > I think I can do this with the CSV mapper, with the input file pairs > > specified as CSV rows; but this needs a CSV to be externally generated > > from information that is already expressed in the metadata in the filename > > so doesn't particularly help. > > > > -- > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From hategan at mcs.anl.gov Mon Feb 26 12:12:15 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 12:12:15 -0600 Subject: [Swift-devel] Re: [VDL2-devel] Re: test v0.1rc1 In-Reply-To: <45E31D97.5010708@mcs.anl.gov> References: <45E31590.3000000@mcs.anl.gov> <45E31C03.302@mcs.anl.gov> <45E31D97.5010708@mcs.anl.gov> Message-ID: <1172513536.22253.24.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 11:49 -0600, Ian Foster wrote: > Mike: > > My big concern continues to be that we diffuse our efforts by trying to > do too many things at once. I asked how things were going last week, and > someone (can't recall who) replied: "we are consumed getting new grid > sites to work." And whether that statement is actually accurate and reflects the state of things is another story. Also whether it refers to something that is done to swift or it's about figuring out where Globus services are installed on the sites, and whether they work, is yet a different dimension. What I know: - I haven't changed the scheduler code in a while. The only change was to support file operation throttling, but that was necessary even for one site. - The nightly tests have been running for quite a while on two sites, and it seems to work fine (the % of successful workflows is >50% even if one of the sites is down). - I am "consumed" by updating documentation and polishing various rough edges in Swift and thinking about addressing various problems that seem to be of high importance to certain applications and with long term benefits. Not much new there. So I would like to ask folks to be careful about generic statements, and be clear about what the problems are. > Meanwhile, there is no documentation on the Swift web > site of a single application that runs end to end, showing the code run > and the performance achieved. That doesn't seem good to me, if we want > to deliver value to users. > > We of course do want to run across multiple sites, and soon--but I > continue to hope that we can be disciplined about approaching things one > step at a time. The multiple sites step has been done for a long time now. Mostly because Karajan comes with it by default. I think it may be time to retire this discussions. To me it looks that most of the times when it occurs, it is triggered by bits and pieces of information, like the above, which are either not accurate or easy to misinterpret. Mihael > > Ian. > > Mike Wilde wrote: > > Ian Foster wrote, On 2/26/2007 11:14 AM: > >> I am puzzled why we are testing on multiple sites, rather than a > >> single site. > > > > I believe that we need to, that the code is there to do this, and > > needs testing. This is not an unreasonable time to start trying this. > > > > If we are to be taken seriously by OSG, and as a Grid, we need to use > > multiple sites. > > > > We can set the priority low for the last anomaly that Tibi pointed out > > below (using 3 sites out of 4). We decided in 0.1 that we would > > enable more sites to run (although we said there "not multiple"). 0.1 > > is virtually done; we have not yet frozen the feature set of 0.2. > > This is a reasonable candidate feature to consider for 0.2. > > > > People can be focusing and still mention (and file as bugs) things > > that they encounter in testing. They should be able to try things a > > bit out of the box as long as we are making good progress, which we are. > > > > - Mike > > > >> > >> Ben Clifford wrote: > >>> is this different behaviour from what you've observed with other > >>> versions? > >>> > >>> On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: > >>> > >>> > >>>> Some early observations: > >>>> > >>>> I'm running the SIDGrid workflow from teraport, and some early > >>>> observations: ( I have not finished a full run yet): > >>>> > >>>> - data seemed to be delayed in transferring back (from the NCSA > >>>> site). I waited 5 minutes after the execution apparently finished on > >>>> the remote site (I was logged in and was monitoring the output files) > >>>> then stopped the workflow. Still investigating > >>>> - the workflow chose 3 sites (I had 4 available) and it started 6 > >>>> parallel jobs on each site. Strange not to choose all the available > >>>> sites. Still investigating > >>>> > >>>> Tibi > >>>> > >>>> > >>>> On 2/26/07, Tiberiu Stef-Praun wrote: > >>>> > >>>>> I have to do measurements of the SIDGrid, so I'll use this new > >>>>> release > >>>>> to do that. > >>>>> You'll hear from me. > >>>>> > >>>>> Tibi > >>>>> > >>>>> On 2/26/07, Ben Clifford wrote: > >>>>> > >>>>>> On Mon, 26 Feb 2007, Ben Clifford wrote: > >>>>>> > >>>>>> > >>>>>>> v0.1rc1 was built at the end of last week. please spend some time > >>>>>>> > >>>>> testing > >>>>> > >>>>>> here's the URL for download: > >>>>>> > >>>>>> http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > >>>>>> > >>>>>> -- > >>>>>> > >>>>>> > >>>>> -- > >>>>> Tiberiu (Tibi) Stef-Praun, PhD > >>>>> Research Staff, Computation Institute > >>>>> 5640 S. Ellis Ave, #405 > >>>>> University of Chicago > >>>>> http://www-unix.mcs.anl.gov/~tiberius/ > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>> > >> > >> -- > >> > >> Ian Foster, Director, Computation Institute > >> Argonne National Laboratory & University of Chicago > >> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > >> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > >> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > >> Globus Alliance: www.globus.org. > >> > >> > >> ------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From nefedova at mcs.anl.gov Mon Feb 26 13:05:32 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Mon, 26 Feb 2007 13:05:32 -0600 Subject: [Swift-devel] Re: test v0.1rc1 Message-ID: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> When I tried to run my working workflow with a new version, it gave me an exception: Warning: Task handler throws exception but does not set status org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: TaskHandler can only handle unsubmitted tasks at org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) at java.lang.Thread.run(Thread.java:534) [349] wiggum /sandbox/ydeng/alamines > \\ I do not have this happening with 070219 built. Nika At 06:12 AM 2/26/2007, Ben Clifford wrote: >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > v0.1rc1 was built at the end of last week. please spend some time testing > >here's the URL for download: > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > >-- From benc at hawaga.org.uk Mon Feb 26 13:08:12 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 19:08:12 +0000 (GMT) Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > On Fri, 2007-02-23 at 16:58 +0000, Ben Clifford wrote: > > > > On Fri, 23 Feb 2007, Mihael Hategan wrote: > > > > > Since this is non-functional (failing to shut down a GridFTP client > > > that's not in use any more), I think the message could be moved to info, > > > and the stack trace to debug. > > > > sounds good. > > Seems like it was at info for about a month now. I split it however to > only log the exception in debug. ok. I guess this error message came from something like 0rc3 as chad reported it originally I think. -- From hategan at mcs.anl.gov Mon Feb 26 13:15:02 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 13:15:02 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> Message-ID: <1172517302.25410.0.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > When I tried to run my working workflow with a new version, it gave me an > exception: Which new version? Mihael > > Warning: Task handler throws exception but does not set status > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > TaskHandler can only handle unsubmitted tasks > at > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > at > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > at java.lang.Thread.run(Thread.java:534) > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > I do not have this happening with 070219 built. > > Nika > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > v0.1rc1 was built at the end of last week. please spend some time testing > > > >here's the URL for download: > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > >-- > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From nefedova at mcs.anl.gov Mon Feb 26 13:21:45 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Mon, 26 Feb 2007 13:21:45 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172517302.25410.0.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> The one Ben asked us all to test: >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz At 01:15 PM 2/26/2007, Mihael Hategan wrote: >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > > When I tried to run my working workflow with a new version, it gave me an > > exception: > >Which new version? > >Mihael > > > > > Warning: Task handler throws exception but does not set status > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > TaskHandler can only handle unsubmitted tasks > > at > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > at > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > at java.lang.Thread.run(Thread.java:534) > > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > I do not have this happening with 070219 built. > > > > Nika > > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > > > > v0.1rc1 was built at the end of last week. please spend some time > testing > > > > > >here's the URL for download: > > > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > >-- > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From yongzh at cs.uchicago.edu Mon Feb 26 13:28:42 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Mon, 26 Feb 2007 13:28:42 -0600 (CST) Subject: [Swift-devel] mapping structures In-Reply-To: References: Message-ID: right, but I remember that we decided not to separate them out at this time. Yong. On Mon, 26 Feb 2007, Ben Clifford wrote: > > but that's not part of mainstream swift - its part of custom application > support code, right? (i.e. it needed to be written for airsn) > > On Mon, 26 Feb 2007, Yong Zhao wrote: > > > there is an airsn_mapper that maps runs and volumes. > > > > Yong. > > > > On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > I was thinking about some fmri stuff for a tutorial exercise and came up > > > with this question: > > > > > > .img and .hdr files come in the input in pairs, like this: > > > > > > $ ls Raw/ > > > anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr reference.hdr > > > anatomy1.img anatomy2.img anatomy3.img anatomy4.img reference.img > > > > > > so from a data type perspective it makes sense to say something like: > > > > > > type inputimage { > > > file img; > > > file data; > > > } > > > > > > but then do any of the existing mappers let me map into this structure? or > > > do I have to write a custom mapper? > > > > > > (to torture syntax, express something like: > > > inputimg ref; > > > ref.img <"reference.img">; > > > ref.hdr <"reference.hdr">; > > > ) > > > > > > I think I can do this with the CSV mapper, with the input file pairs > > > specified as CSV rows; but this needs a CSV to be externally generated > > > from information that is already expressed in the metadata in the filename > > > so doesn't particularly help. > > > > > > -- > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > From hategan at mcs.anl.gov Mon Feb 26 13:39:18 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 13:39:18 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> Message-ID: <1172518758.26112.0.camel@blabla.mcs.anl.gov> That doesn't sound good. How do I reproduce this? Mihael On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > The one Ben asked us all to test: > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > > > When I tried to run my working workflow with a new version, it gave me an > > > exception: > > > >Which new version? > > > >Mihael > > > > > > > > Warning: Task handler throws exception but does not set status > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > TaskHandler can only handle unsubmitted tasks > > > at > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > at > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > at java.lang.Thread.run(Thread.java:534) > > > > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > > > > I do not have this happening with 070219 built. > > > > > > Nika > > > > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > > > > > > > v0.1rc1 was built at the end of last week. please spend some time > > testing > > > > > > > >here's the URL for download: > > > > > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > > >-- > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From benc at hawaga.org.uk Mon Feb 26 13:53:19 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 19:53:19 +0000 (GMT) Subject: [Swift-devel] mapping structures In-Reply-To: References: Message-ID: On Mon, 26 Feb 2007, Yong Zhao wrote: > right, but I remember that we decided not to separate them out at this > time. not code-wise, no - but in terms of explaining to someone what is going on to embed an application inside there's a difference. that's pretty much what I wanted to check. -- From tiberius at ci.uchicago.edu Mon Feb 26 13:53:51 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 26 Feb 2007 13:53:51 -0600 Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: References: Message-ID: How do I know that a GridFTP client timeout occurs ? It seems that my SIDGrid workflow has stopped performing. Normally it should process 200 parallel tasks, each of which producess 1GB in 28 files. I am testing with 0.1 rc 1, but the same has happened to me before (SVn checkout, v0.rc3,etc). The workflow freezes after a while (after processing the first round of jobs submitted? = I received 16G and that's it). The current scenario is me using 3 teragrid sites (UC, Purdue, NCSA), but the same behavior (workflow freeze) happened when I ran the workflow on teraport only. Since It hang, I was always forced to terminate it, so we never had a full SIDGrid run. Any suggestions ? BTW, the NCSA problem is a non-issue, I solved it. The only other small issue is taking full advantage of all the sites in the sites.xml. And the big issue is what I listed above. Next I will try running the workflow fully at the UC teragrid site. Tibi On 2/26/07, Ben Clifford wrote: > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > On Fri, 2007-02-23 at 16:58 +0000, Ben Clifford wrote: > > > > > > On Fri, 23 Feb 2007, Mihael Hategan wrote: > > > > > > > Since this is non-functional (failing to shut down a GridFTP client > > > > that's not in use any more), I think the message could be moved to info, > > > > and the stack trace to debug. > > > > > > sounds good. > > > > Seems like it was at info for about a month now. I split it however to > > only log the exception in debug. > > ok. I guess this error message came from something like 0rc3 as chad > reported it originally I think. > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Mon Feb 26 13:55:22 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 13:55:22 -0600 Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: References: Message-ID: <1172519722.27092.0.camel@blabla.mcs.anl.gov> Ip address maybe? On Mon, 2007-02-26 at 13:53 -0600, Tiberiu Stef-Praun wrote: > How do I know that a GridFTP client timeout occurs ? > It seems that my SIDGrid workflow has stopped performing. > Normally it should process 200 parallel tasks, each of which producess > 1GB in 28 files. > > I am testing with 0.1 rc 1, but the same has happened to me before > (SVn checkout, v0.rc3,etc). > > The workflow freezes after a while (after processing the first round > of jobs submitted? = I received 16G and that's it). The current > scenario is me using 3 teragrid sites (UC, Purdue, NCSA), but the same > behavior (workflow freeze) happened when I ran the workflow on > teraport only. Since It hang, I was always forced to terminate it, so > we never had a full SIDGrid run. > > Any suggestions ? > > BTW, the NCSA problem is a non-issue, I solved it. The only other > small issue is taking full advantage of all the sites in the > sites.xml. And the big issue is what I listed above. > > Next I will try running the workflow fully at the UC teragrid site. > > Tibi > > > On 2/26/07, Ben Clifford wrote: > > > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > > > On Fri, 2007-02-23 at 16:58 +0000, Ben Clifford wrote: > > > > > > > > On Fri, 23 Feb 2007, Mihael Hategan wrote: > > > > > > > > > Since this is non-functional (failing to shut down a GridFTP client > > > > > that's not in use any more), I think the message could be moved to info, > > > > > and the stack trace to debug. > > > > > > > > sounds good. > > > > > > Seems like it was at info for about a month now. I split it however to > > > only log the exception in debug. > > > > ok. I guess this error message came from something like 0rc3 as chad > > reported it originally I think. > > -- > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From benc at hawaga.org.uk Mon Feb 26 13:57:51 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 19:57:51 +0000 (GMT) Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: References: Message-ID: On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: > BTW, the NCSA problem is a non-issue, I solved it. how? > The only other > small issue is taking full advantage of all the sites in the > sites.xml. For 0.1, each site needs to be usable as the single site to run everything one. It doesn't matter so much (at all) that you can't use all sites in the same run... -- From yongzh at cs.uchicago.edu Mon Feb 26 13:58:43 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Mon, 26 Feb 2007 13:58:43 -0600 (CST) Subject: [Swift-devel] mapping structures In-Reply-To: References: Message-ID: In that sense, pretty much each customized type that can not be mapped from the csv_mapper should have a mapper implementation. Yong. On Mon, 26 Feb 2007, Ben Clifford wrote: > > > On Mon, 26 Feb 2007, Yong Zhao wrote: > > > right, but I remember that we decided not to separate them out at this > > time. > > > not code-wise, no - but in terms of explaining to someone what is going on > to embed an application inside there's a difference. that's pretty much > what I wanted to check. > > -- > From benc at hawaga.org.uk Mon Feb 26 14:00:12 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:00:12 +0000 (GMT) Subject: [Swift-devel] mapping structures In-Reply-To: References: Message-ID: On Mon, 26 Feb 2007, Yong Zhao wrote: > In that sense, pretty much each customized type that can not be mapped > from the csv_mapper should have a mapper implementation. right. I guess mapper implementation documention needs to happen. Want to write some for 0.2? ;-) -- From hategan at mcs.anl.gov Mon Feb 26 14:01:23 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:01:23 -0600 Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: References: <1172519722.27092.0.camel@blabla.mcs.anl.gov> Message-ID: <1172520083.27374.0.camel@blabla.mcs.anl.gov> Well, post the log. On Mon, 2007-02-26 at 14:00 -0600, Tiberiu Stef-Praun wrote: > I fixed that, I am getting back some of the results. > Aparently the wf is stuck at the point where it needs to delete the > remote files > Although that might not be the actual root of all evils, because when > running on a single site (teraport), several iterations of sets of > jobs were sent out before the wf stopped completely. > > > > On 2/26/07, Mihael Hategan wrote: > > Ip address maybe? > > > > On Mon, 2007-02-26 at 13:53 -0600, Tiberiu Stef-Praun wrote: > > > How do I know that a GridFTP client timeout occurs ? > > > It seems that my SIDGrid workflow has stopped performing. > > > Normally it should process 200 parallel tasks, each of which producess > > > 1GB in 28 files. > > > > > > I am testing with 0.1 rc 1, but the same has happened to me before > > > (SVn checkout, v0.rc3,etc). > > > > > > The workflow freezes after a while (after processing the first round > > > of jobs submitted? = I received 16G and that's it). The current > > > scenario is me using 3 teragrid sites (UC, Purdue, NCSA), but the same > > > behavior (workflow freeze) happened when I ran the workflow on > > > teraport only. Since It hang, I was always forced to terminate it, so > > > we never had a full SIDGrid run. > > > > > > Any suggestions ? > > > > > > BTW, the NCSA problem is a non-issue, I solved it. The only other > > > small issue is taking full advantage of all the sites in the > > > sites.xml. And the big issue is what I listed above. > > > > > > Next I will try running the workflow fully at the UC teragrid site. > > > > > > Tibi > > > > > > > > > On 2/26/07, Ben Clifford wrote: > > > > > > > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > > > > > > > On Fri, 2007-02-23 at 16:58 +0000, Ben Clifford wrote: > > > > > > > > > > > > On Fri, 23 Feb 2007, Mihael Hategan wrote: > > > > > > > > > > > > > Since this is non-functional (failing to shut down a GridFTP client > > > > > > > that's not in use any more), I think the message could be moved to info, > > > > > > > and the stack trace to debug. > > > > > > > > > > > > sounds good. > > > > > > > > > > Seems like it was at info for about a month now. I split it however to > > > > > only log the exception in debug. > > > > > > > > ok. I guess this error message came from something like 0rc3 as chad > > > > reported it originally I think. > > > > -- > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > From tiberius at ci.uchicago.edu Mon Feb 26 14:00:08 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 26 Feb 2007 14:00:08 -0600 Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: <1172519722.27092.0.camel@blabla.mcs.anl.gov> References: <1172519722.27092.0.camel@blabla.mcs.anl.gov> Message-ID: I fixed that, I am getting back some of the results. Aparently the wf is stuck at the point where it needs to delete the remote files Although that might not be the actual root of all evils, because when running on a single site (teraport), several iterations of sets of jobs were sent out before the wf stopped completely. On 2/26/07, Mihael Hategan wrote: > Ip address maybe? > > On Mon, 2007-02-26 at 13:53 -0600, Tiberiu Stef-Praun wrote: > > How do I know that a GridFTP client timeout occurs ? > > It seems that my SIDGrid workflow has stopped performing. > > Normally it should process 200 parallel tasks, each of which producess > > 1GB in 28 files. > > > > I am testing with 0.1 rc 1, but the same has happened to me before > > (SVn checkout, v0.rc3,etc). > > > > The workflow freezes after a while (after processing the first round > > of jobs submitted? = I received 16G and that's it). The current > > scenario is me using 3 teragrid sites (UC, Purdue, NCSA), but the same > > behavior (workflow freeze) happened when I ran the workflow on > > teraport only. Since It hang, I was always forced to terminate it, so > > we never had a full SIDGrid run. > > > > Any suggestions ? > > > > BTW, the NCSA problem is a non-issue, I solved it. The only other > > small issue is taking full advantage of all the sites in the > > sites.xml. And the big issue is what I listed above. > > > > Next I will try running the workflow fully at the UC teragrid site. > > > > Tibi > > > > > > On 2/26/07, Ben Clifford wrote: > > > > > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > > > > > On Fri, 2007-02-23 at 16:58 +0000, Ben Clifford wrote: > > > > > > > > > > On Fri, 23 Feb 2007, Mihael Hategan wrote: > > > > > > > > > > > Since this is non-functional (failing to shut down a GridFTP client > > > > > > that's not in use any more), I think the message could be moved to info, > > > > > > and the stack trace to debug. > > > > > > > > > > sounds good. > > > > > > > > Seems like it was at info for about a month now. I split it however to > > > > only log the exception in debug. > > > > > > ok. I guess this error message came from something like 0rc3 as chad > > > reported it originally I think. > > > -- > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From nefedova at mcs.anl.gov Mon Feb 26 14:17:55 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Mon, 26 Feb 2007 14:17:55 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172518758.26112.0.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> You can try to run my application, or look in the logs. I ran it all on wiggum. The log is: /sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log the dtm file I am running is /sandbox/ydeng/alamines/swift-MolDyn.dtm Nika At 01:39 PM 2/26/2007, Mihael Hategan wrote: >That doesn't sound good. How do I reproduce this? > >Mihael > >On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > The one Ben asked us all to test: > > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > > > > When I tried to run my working workflow with a new version, it gave > me an > > > > exception: > > > > > >Which new version? > > > > > >Mihael > > > > > > > > > > > Warning: Task handler throws exception but does not set status > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > TaskHandler can only handle unsubmitted tasks > > > > at > > > > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > at > > > > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > at java.lang.Thread.run(Thread.java:534) > > > > > > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > > > > > > > I do not have this happening with 070219 built. > > > > > > > > Nika > > > > > > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > > > > > > > > > > v0.1rc1 was built at the end of last week. please spend some time > > > testing > > > > > > > > > >here's the URL for download: > > > > > > > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > > > > >-- > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > From benc at hawaga.org.uk Mon Feb 26 14:19:27 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:19:27 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) Message-ID: I want to work out what the 0.2-scoped work for this should be. The description in bugzilla for bug 22 is a little misleading in as much as I want this to solve a specific app problem that Nika is having, and I've changed my understanding of what her problem is since I submitted that bug. There's a broad issue of how to deal with applications that want a particular remote directory/filename layout for input and output; in the shorter term, I'm using Nika's app as the motivation for 0.2. This is the general problem that Nika has, as far as I see it: One of the programs insists on placing output files in the same directory as the input, with the name being the same as the input file name, with an interface that looks something like this: # the input files in a directory $ ls foodir/ m002_am1 # run the job $ antch foodir/m002_am1 # now the directory that had the input file has a bunch of # intermediate files, as well as the output files $ ls foodir/ m002_am1 m002_am1.crd m002_am1.prm m002_am1.rtf Swift cannot tell this program where to place the output files on the remote end. However, given knowledge about where swift has staged the input file into the job directory, it is straightforward to specify where the interesting output files have appeared under the job directory. At present the syntax in swift is this: (actually I've cut out other irrelevant parameters) (file k, file m, file n) Antchmbr (file l) { app { antch "-i" @l ; } } If by chance k, m, and n happened to be mapped to submit-side paths that co-incide in the right way with the input path of l, then this happens to work. But in the general case, with k, m and n being mapped to arbitrary locations, this does not work, because at present swift assumes that the app is outputting to the (relative remote) paths k, m, n. This cannot be assumed, because that information about k,m,n is never passed to the application. Here are a few potential solutions for 22. I'm interested in opinions on implementation ease, applicability to other peoples app problems that they have now, use to end users in general, and of course other solutions. 1. assert that remote end application-specific wrappers are the way to go, with those remote end wrappers remoulding the program interface to something more palateable to swift (in this case by moving input and output files around appropriately, on either side of an application invocation). 2. extend the SwiftScript app {} syntax to allow different expression of output files, by specifying that swift should expect to find them in more explicitly named files, rather than passing on the app commandline. Thus we might change the above app code fragment to something like this: (file k, file m, file n) Antchmbr (file l) { app { antch "-i" @l ; k <@strcat(@l,".crd"); m <@strcat(@l,".prd"); n <@strcat(@l,".rtf"); } } which says that when the application has run, the three output files will be located in/under the run directory in those locations. This would then give swift sufficient information to locate the three output files rather than assuming the output names. My opinions: sol1. I dislike the concept of remote wrappers from a users perspective - conceptually, app {} blocks are where we tell swift how to interface to programs. Using remote wrappers basically distributes that information between the app {} block and a remote wrapper; I find that unpleasant and would rather keep the information in one place - the app {} block. sol2. more development work. from a user perspective I find it more pleasing because it keeps the interface description in one place, the app block. it introduces something that starts to look like mappers for the execution side of things - this might be an interesting longer term strategy for dealing with non-file-based data too; or might not - Yong/Mike probably have thought about the most so I'm interested in comment there. Like I said at the top, I'm most interested in scoping 0.2 work, rather than long term stuff, but I want it to be coherent with the long term vision. -- From benc at hawaga.org.uk Mon Feb 26 14:20:27 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:20:27 +0000 (GMT) Subject: [Swift-devel] bug 23 In-Reply-To: References: Message-ID: Something else which ties in is the whole exception discussion; however, I realise I don't actually understand the problem very well in the sense of what the application outputs. It would be nice for my understanding of that for the app person for the app(s) that need (which I think is mihael for BRIC, according to rumour?) could give a summary in the style of my bug 22 summary of nika's problem about what files exist before and after (in both the 'ok' and 'exceptional' cases), and how that should affect the on-going workflow, in the terms of swift pseudo code fragments. -- From tiberius at ci.uchicago.edu Mon Feb 26 14:22:48 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 26 Feb 2007 14:22:48 -0600 Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: References: <1172519722.27092.0.camel@blabla.mcs.anl.gov> <1172520083.27374.0.camel@blabla.mcs.anl.gov> Message-ID: There is a limit on attachments on the mailing list, so you can see the log here: http://teraport.uchicago.edu/~tiberius/sid-wf-pers-channel-type-rcm0oqd5bk4l1.log On 2/26/07, Tiberiu Stef-Praun wrote: > Here it is, attached. > > I run the command this way: > swift -d -v sid-wf-pers-channel-type.dtm > on teraport in /home/tiberius/scratch > > > On 2/26/07, Mihael Hategan wrote: > > Well, post the log. > > > > On Mon, 2007-02-26 at 14:00 -0600, Tiberiu Stef-Praun wrote: > > > I fixed that, I am getting back some of the results. > > > Aparently the wf is stuck at the point where it needs to delete the > > > remote files > > > Although that might not be the actual root of all evils, because when > > > running on a single site (teraport), several iterations of sets of > > > jobs were sent out before the wf stopped completely. > > > > > > > > > > > > On 2/26/07, Mihael Hategan wrote: > > > > Ip address maybe? > > > > > > > > On Mon, 2007-02-26 at 13:53 -0600, Tiberiu Stef-Praun wrote: > > > > > How do I know that a GridFTP client timeout occurs ? > > > > > It seems that my SIDGrid workflow has stopped performing. > > > > > Normally it should process 200 parallel tasks, each of which producess > > > > > 1GB in 28 files. > > > > > > > > > > I am testing with 0.1 rc 1, but the same has happened to me before > > > > > (SVn checkout, v0.rc3,etc). > > > > > > > > > > The workflow freezes after a while (after processing the first round > > > > > of jobs submitted? = I received 16G and that's it). The current > > > > > scenario is me using 3 teragrid sites (UC, Purdue, NCSA), but the same > > > > > behavior (workflow freeze) happened when I ran the workflow on > > > > > teraport only. Since It hang, I was always forced to terminate it, so > > > > > we never had a full SIDGrid run. > > > > > > > > > > Any suggestions ? > > > > > > > > > > BTW, the NCSA problem is a non-issue, I solved it. The only other > > > > > small issue is taking full advantage of all the sites in the > > > > > sites.xml. And the big issue is what I listed above. > > > > > > > > > > Next I will try running the workflow fully at the UC teragrid site. > > > > > > > > > > Tibi > > > > > > > > > > > > > > > On 2/26/07, Ben Clifford wrote: > > > > > > > > > > > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > > > > > > > > > > > On Fri, 2007-02-23 at 16:58 +0000, Ben Clifford wrote: > > > > > > > > > > > > > > > > On Fri, 23 Feb 2007, Mihael Hategan wrote: > > > > > > > > > > > > > > > > > Since this is non-functional (failing to shut down a GridFTP client > > > > > > > > > that's not in use any more), I think the message could be moved to info, > > > > > > > > > and the stack trace to debug. > > > > > > > > > > > > > > > > sounds good. > > > > > > > > > > > > > > Seems like it was at info for about a month now. I split it however to > > > > > > > only log the exception in debug. > > > > > > > > > > > > ok. I guess this error message came from something like 0rc3 as chad > > > > > > reported it originally I think. > > > > > > -- > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Tiberiu (Tibi) Stef-Praun, PhD > Research Staff, Computation Institute > 5640 S. Ellis Ave, #405 > University of Chicago > http://www-unix.mcs.anl.gov/~tiberius/ > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From nefedova at mcs.anl.gov Mon Feb 26 14:26:40 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Mon, 26 Feb 2007 14:26:40 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> And now its getting interesting! I have now the same failure (as below) with 070219 as I had on localhost with v0.1rc1 *BUT* when running on TG. Failed at the same point (while trying to run the last app in the workflow), with the same exceptions. Strange that 070219 worked on localhost (and still working). The log is on wiggum: /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job chrm-rmnoet7i chrm with arguments [system:solv_m001, title:solv, stitle:m001, rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, rwater:15, nstep:100, minstep:100, skipstep:100, startstep:10000] in swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application exception: Job chrm failed with an exit code of 174 All input files are staged in... Nika At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: >You can try to run my application, or look in the logs. I ran it all on >wiggum. The log is: >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > >the dtm file I am running is /sandbox/ydeng/alamines/swift-MolDyn.dtm > >Nika > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: >>That doesn't sound good. How do I reproduce this? >> >>Mihael >> >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: >> > The one Ben asked us all to test: >> > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz >> > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: >> > > > When I tried to run my working workflow with a new version, it >> gave me an >> > > > exception: >> > > >> > >Which new version? >> > > >> > >Mihael >> > > >> > > > >> > > > Warning: Task handler throws exception but does not set status >> > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> > > > TaskHandler can only handle unsubmitted tasks >> > > > at >> > > > >> > > >> org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) >> > > > at >> > > > >> > > >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) >> > > > at java.lang.Thread.run(Thread.java:534) >> > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ >> > > > >> > > > >> > > > I do not have this happening with 070219 built. >> > > > >> > > > Nika >> > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: >> > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: >> > > > > >> > > > > > >> > > > > > v0.1rc1 was built at the end of last week. please spend some time >> > > testing >> > > > > >> > > > >here's the URL for download: >> > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz >> > > > > >> > > > >-- >> > > > >> > > > >> > > > _______________________________________________ >> > > > Swift-devel mailing list >> > > > Swift-devel at ci.uchicago.edu >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > >> > >> > > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From tiberius at ci.uchicago.edu Mon Feb 26 14:28:01 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 26 Feb 2007 14:28:01 -0600 Subject: [Swift-devel] bug 23 In-Reply-To: References: Message-ID: I wrote something. On 2/26/07, Ben Clifford wrote: > > Something else which ties in is the whole exception discussion; however, I > realise I don't actually understand the problem very well in the sense of > what the application outputs. It would be nice for my understanding of > that for the app person for the app(s) that need (which I think is mihael > for BRIC, according to rumour?) could give a summary in the style of my > bug 22 summary of nika's problem about what files exist before and after > (in both the 'ok' and 'exceptional' cases), and how that should affect the > on-going workflow, in the terms of swift pseudo code fragments. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Mon Feb 26 14:28:51 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:28:51 +0000 (GMT) Subject: [Swift-devel] bug 23 In-Reply-To: References: Message-ID: On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: > I wrote something. ... -- From hategan at mcs.anl.gov Mon Feb 26 14:27:56 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:27:56 -0600 Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: References: Message-ID: <1172521676.27811.9.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 20:19 +0000, Ben Clifford wrote: [...] > 2. extend the SwiftScript app {} syntax to allow different expression of > output files, by specifying that swift should expect to find them in more > explicitly named files, rather than passing on the app commandline. > > Thus we might change the above app code fragment to something like this: > > (file k, file m, file n) Antchmbr (file l) { > app { > antch "-i" @l ; > k <@strcat(@l,".crd"); > m <@strcat(@l,".prd"); > n <@strcat(@l,".rtf"); > } > } Right. This would be the "application mapper". Now, there are a few things here: We may also want to do the same to the input, because some even more twisted apps will not even accept that as a parameter. So: (file k, file m, file n) myapp(file l) { app{ l>"input.txt"; myapp; k<"output.crd" m<"output.prd" n<"output.rtf" } } This may become a little trickier when inputs (or even outputs) are arrays, so we may need nicer schemes: (file o) myapp(file i[]){ app{ i[x=*] > "input"+$1; (or something like that) myapp; ... } } Mihael > > which says that when the application has run, the three output files will > be located in/under the run directory in those locations. This would then > give swift sufficient information to locate the three output files rather > than assuming the output names. > > [...] From hategan at mcs.anl.gov Mon Feb 26 14:31:15 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:31:15 -0600 Subject: [Swift-devel] bug 23 In-Reply-To: References: Message-ID: <1172521875.27811.13.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 20:20 +0000, Ben Clifford wrote: > Something else which ties in is the whole exception discussion; however, I > realise I don't actually understand the problem very well in the sense of > what the application outputs. It would be nice for my understanding of > that for the app person for the app(s) that need (which I think is mihael > for BRIC, according to rumour?) could give a summary in the style of my > bug 22 summary of nika's problem about what files exist before and after > (in both the 'ok' and 'exceptional' cases), and how that should affect the > on-going workflow, in the terms of swift pseudo code fragments. There is no rule, but a wrapper can be used to adapt arbitrary behavior to a consistent thing. In the BRIC case, the output file exists, but it contains a specific string (something to the extent of "no clusters found"). In general, however, any number of schemes may be adopted by applications: a message on STD*, an exit code !=0, no output file produced, of o combination of these. > From hategan at mcs.anl.gov Mon Feb 26 14:31:59 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:31:59 -0600 Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: References: <1172519722.27092.0.camel@blabla.mcs.anl.gov> <1172520083.27374.0.camel@blabla.mcs.anl.gov> Message-ID: <1172521919.27811.15.camel@blabla.mcs.anl.gov> There is no such thing: hategan at ci.uchicago.edu On Mon, 2007-02-26 at 14:22 -0600, Tiberiu Stef-Praun wrote: > There is a limit on attachments on the mailing list, so you can see > the log here: > http://teraport.uchicago.edu/~tiberius/sid-wf-pers-channel-type-rcm0oqd5bk4l1.log > > > On 2/26/07, Tiberiu Stef-Praun wrote: > > Here it is, attached. > > > > I run the command this way: > > swift -d -v sid-wf-pers-channel-type.dtm > > on teraport in /home/tiberius/scratch > > > > > > On 2/26/07, Mihael Hategan wrote: > > > Well, post the log. > > > > > > On Mon, 2007-02-26 at 14:00 -0600, Tiberiu Stef-Praun wrote: > > > > I fixed that, I am getting back some of the results. > > > > Aparently the wf is stuck at the point where it needs to delete the > > > > remote files > > > > Although that might not be the actual root of all evils, because when > > > > running on a single site (teraport), several iterations of sets of > > > > jobs were sent out before the wf stopped completely. > > > > > > > > > > > > > > > > On 2/26/07, Mihael Hategan wrote: > > > > > Ip address maybe? > > > > > > > > > > On Mon, 2007-02-26 at 13:53 -0600, Tiberiu Stef-Praun wrote: > > > > > > How do I know that a GridFTP client timeout occurs ? > > > > > > It seems that my SIDGrid workflow has stopped performing. > > > > > > Normally it should process 200 parallel tasks, each of which producess > > > > > > 1GB in 28 files. > > > > > > > > > > > > I am testing with 0.1 rc 1, but the same has happened to me before > > > > > > (SVn checkout, v0.rc3,etc). > > > > > > > > > > > > The workflow freezes after a while (after processing the first round > > > > > > of jobs submitted? = I received 16G and that's it). The current > > > > > > scenario is me using 3 teragrid sites (UC, Purdue, NCSA), but the same > > > > > > behavior (workflow freeze) happened when I ran the workflow on > > > > > > teraport only. Since It hang, I was always forced to terminate it, so > > > > > > we never had a full SIDGrid run. > > > > > > > > > > > > Any suggestions ? > > > > > > > > > > > > BTW, the NCSA problem is a non-issue, I solved it. The only other > > > > > > small issue is taking full advantage of all the sites in the > > > > > > sites.xml. And the big issue is what I listed above. > > > > > > > > > > > > Next I will try running the workflow fully at the UC teragrid site. > > > > > > > > > > > > Tibi > > > > > > > > > > > > > > > > > > On 2/26/07, Ben Clifford wrote: > > > > > > > > > > > > > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > > > > > > > > > > > > > On Fri, 2007-02-23 at 16:58 +0000, Ben Clifford wrote: > > > > > > > > > > > > > > > > > > On Fri, 23 Feb 2007, Mihael Hategan wrote: > > > > > > > > > > > > > > > > > > > Since this is non-functional (failing to shut down a GridFTP client > > > > > > > > > > that's not in use any more), I think the message could be moved to info, > > > > > > > > > > and the stack trace to debug. > > > > > > > > > > > > > > > > > > sounds good. > > > > > > > > > > > > > > > > Seems like it was at info for about a month now. I split it however to > > > > > > > > only log the exception in debug. > > > > > > > > > > > > > > ok. I guess this error message came from something like 0rc3 as chad > > > > > > > reported it originally I think. > > > > > > > -- > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Tiberiu (Tibi) Stef-Praun, PhD > > Research Staff, Computation Institute > > 5640 S. Ellis Ave, #405 > > University of Chicago > > http://www-unix.mcs.anl.gov/~tiberius/ > > > > > > From benc at hawaga.org.uk Mon Feb 26 14:34:35 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:34:35 +0000 (GMT) Subject: [Swift-devel] bug 23 In-Reply-To: <1172521875.27811.13.camel@blabla.mcs.anl.gov> References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > In the BRIC case, the output file exists, but it contains a specific > string (something to the extent of "no clusters found"). So what goes in the output file in the successful case? -- From hategan at mcs.anl.gov Mon Feb 26 14:33:14 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:33:14 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> Message-ID: <1172521994.27811.17.camel@blabla.mcs.anl.gov> Wait, because I'm missing something. Wasn't the error supposed to be "TaskHandler can only handle unsubmitted tasks"? On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > And now its getting interesting! > > I have now the same failure (as below) with 070219 as I had on localhost > with v0.1rc1 *BUT* when running on TG. Failed at the same point (while > trying to run the last app in the workflow), with the same exceptions. > Strange that 070219 worked on localhost (and still working). > > The log is on wiggum: /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job chrm-rmnoet7i chrm > with arguments [system:solv_m001, title:solv, stitle:m001, > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, rwater:15, > nstep:100, minstep:100, skipstep:100, startstep:10000] in > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application exception: Job chrm > failed with an exit code of 174 > > All input files are staged in... > > > Nika > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > >You can try to run my application, or look in the logs. I ran it all on > >wiggum. The log is: > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > >the dtm file I am running is /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > >Nika > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > >>That doesn't sound good. How do I reproduce this? > >> > >>Mihael > >> > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > >> > The one Ben asked us all to test: > >> > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > >> > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > >> > > > When I tried to run my working workflow with a new version, it > >> gave me an > >> > > > exception: > >> > > > >> > >Which new version? > >> > > > >> > >Mihael > >> > > > >> > > > > >> > > > Warning: Task handler throws exception but does not set status > >> > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >> > > > TaskHandler can only handle unsubmitted tasks > >> > > > at > >> > > > > >> > > > >> org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > >> > > > at > >> > > > > >> > > > >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > >> > > > at java.lang.Thread.run(Thread.java:534) > >> > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > >> > > > > >> > > > > >> > > > I do not have this happening with 070219 built. > >> > > > > >> > > > Nika > >> > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > >> > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > >> > > > > > >> > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. please spend some time > >> > > testing > >> > > > > > >> > > > >here's the URL for download: > >> > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > >> > > > > > >> > > > >-- > >> > > > > >> > > > > >> > > > _______________________________________________ > >> > > > Swift-devel mailing list > >> > > > Swift-devel at ci.uchicago.edu > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > >> > > >> > > > > > > >_______________________________________________ > >Swift-devel mailing list > >Swift-devel at ci.uchicago.edu > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Mon Feb 26 14:35:15 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:35:15 +0000 (GMT) Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: <1172521919.27811.15.camel@blabla.mcs.anl.gov> References: <1172519722.27092.0.camel@blabla.mcs.anl.gov> <1172520083.27374.0.camel@blabla.mcs.anl.gov> <1172521919.27811.15.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > There is no such thing: hategan at ci.uchicago.edu I know - I added it by CC by accident. Pretty much everyone who is at CI but doesn't have active CI email address ends up with me doing that to them every so often. -- From nefedova at mcs.anl.gov Mon Feb 26 14:37:15 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Mon, 26 Feb 2007 14:37:15 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172521994.27811.17.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> Yes, I didn't paste it -- its all in the log. If you'd like I can send you the log as an attachment... Nika At 02:33 PM 2/26/2007, Mihael Hategan wrote: >Wait, because I'm missing something. Wasn't the error supposed to be >"TaskHandler can only handle unsubmitted tasks"? > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > And now its getting interesting! > > > > I have now the same failure (as below) with 070219 as I had on localhost > > with v0.1rc1 *BUT* when running on TG. Failed at the same point (while > > trying to run the last app in the workflow), with the same exceptions. > > Strange that 070219 worked on localhost (and still working). > > > > The log is on wiggum: > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job chrm-rmnoet7i chrm > > with arguments [system:solv_m001, title:solv, stitle:m001, > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, rwater:15, > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application exception: Job chrm > > failed with an exit code of 174 > > > > All input files are staged in... > > > > > > Nika > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > >You can try to run my application, or look in the logs. I ran it all on > > >wiggum. The log is: > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > >the dtm file I am running is /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > >Nika > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > >>That doesn't sound good. How do I reproduce this? > > >> > > >>Mihael > > >> > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > >> > The one Ben asked us all to test: > > >> > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > >> > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > > >> > > > When I tried to run my working workflow with a new version, it > > >> gave me an > > >> > > > exception: > > >> > > > > >> > >Which new version? > > >> > > > > >> > >Mihael > > >> > > > > >> > > > > > >> > > > Warning: Task handler throws exception but does not set status > > >> > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > >> > > > TaskHandler can only handle unsubmitted tasks > > >> > > > at > > >> > > > > > >> > > > > >> > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > >> > > > at > > >> > > > > > >> > > > > >> > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > >> > > > at java.lang.Thread.run(Thread.java:534) > > >> > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > >> > > > > > >> > > > > > >> > > > I do not have this happening with 070219 built. > > >> > > > > > >> > > > Nika > > >> > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > >> > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > >> > > > > > > >> > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. please spend > some time > > >> > > testing > > >> > > > > > > >> > > > >here's the URL for download: > > >> > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > >> > > > > > > >> > > > >-- > > >> > > > > > >> > > > > > >> > > > _______________________________________________ > > >> > > > Swift-devel mailing list > > >> > > > Swift-devel at ci.uchicago.edu > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> > > > > > >> > > > >> > > > > > > > > > >_______________________________________________ > > >Swift-devel mailing list > > >Swift-devel at ci.uchicago.edu > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From hategan at mcs.anl.gov Mon Feb 26 14:35:27 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:35:27 -0600 Subject: [Swift-devel] bug 23 In-Reply-To: References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> Message-ID: <1172522128.27811.20.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 20:34 +0000, Ben Clifford wrote: > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > In the BRIC case, the output file exists, but it contains a specific > > string (something to the extent of "no clusters found"). > > So what goes in the output file in the successful case? Digits. Lots of digits. Separated by whitespace. And an occasional ".". And certainly the string "no clusters found" is absent in that case. > From benc at hawaga.org.uk Mon Feb 26 14:38:54 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:38:54 +0000 (GMT) Subject: [Swift-devel] bug 23 In-Reply-To: <1172522128.27811.20.camel@blabla.mcs.anl.gov> References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> <1172522128.27811.20.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > On Mon, 2007-02-26 at 20:34 +0000, Ben Clifford wrote: > > > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > > > In the BRIC case, the output file exists, but it contains a specific > > > string (something to the extent of "no clusters found"). > > > > So what goes in the output file in the successful case? > > Digits. Lots of digits. Separated by whitespace. And an occasional ".". > And certainly the string "no clusters found" is absent in that case. ok. So that's the actual output data, presumably? -- From hategan at mcs.anl.gov Mon Feb 26 14:38:57 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:38:57 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> Message-ID: <1172522337.27811.24.camel@blabla.mcs.anl.gov> That's fine. Just wanted to be clear that we're talking about the same error. It's good that it also occurs in 070219, because there are no recent changes I could remember that could trigger it. It's also good to know that it may or may not occur, because I know approximately what class of problem we're dealing with. Mihael On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > Yes, I didn't paste it -- its all in the log. If you'd like I can send you > the log as an attachment... > > Nika > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > >Wait, because I'm missing something. Wasn't the error supposed to be > >"TaskHandler can only handle unsubmitted tasks"? > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > And now its getting interesting! > > > > > > I have now the same failure (as below) with 070219 as I had on localhost > > > with v0.1rc1 *BUT* when running on TG. Failed at the same point (while > > > trying to run the last app in the workflow), with the same exceptions. > > > Strange that 070219 worked on localhost (and still working). > > > > > > The log is on wiggum: > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job chrm-rmnoet7i chrm > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, rwater:15, > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application exception: Job chrm > > > failed with an exit code of 174 > > > > > > All input files are staged in... > > > > > > > > > Nika > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > >You can try to run my application, or look in the logs. I ran it all on > > > >wiggum. The log is: > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > >the dtm file I am running is /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > >Nika > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > >>That doesn't sound good. How do I reproduce this? > > > >> > > > >>Mihael > > > >> > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > > >> > The one Ben asked us all to test: > > > >> > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > >> > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > > > >> > > > When I tried to run my working workflow with a new version, it > > > >> gave me an > > > >> > > > exception: > > > >> > > > > > >> > >Which new version? > > > >> > > > > > >> > >Mihael > > > >> > > > > > >> > > > > > > >> > > > Warning: Task handler throws exception but does not set status > > > >> > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > >> > > > at > > > >> > > > > > > >> > > > > > >> > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > >> > > > at > > > >> > > > > > > >> > > > > > >> > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > >> > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > >> > > > > > > >> > > > > > > >> > > > I do not have this happening with 070219 built. > > > >> > > > > > > >> > > > Nika > > > >> > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > >> > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > >> > > > > > > > >> > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. please spend > > some time > > > >> > > testing > > > >> > > > > > > > >> > > > >here's the URL for download: > > > >> > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > >> > > > > > > > >> > > > >-- > > > >> > > > > > > >> > > > > > > >> > > > _______________________________________________ > > > >> > > > Swift-devel mailing list > > > >> > > > Swift-devel at ci.uchicago.edu > > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > >> > > > > > > >> > > > > >> > > > > > > > > > > > > >_______________________________________________ > > > >Swift-devel mailing list > > > >Swift-devel at ci.uchicago.edu > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > From tiberius at ci.uchicago.edu Mon Feb 26 14:41:30 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 26 Feb 2007 14:41:30 -0600 Subject: [Swift-devel] IS the PBS provider in v0.1 rc 1 Message-ID: How can I use it ? A quick example please. Thanks Tibi -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Mon Feb 26 14:39:44 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:39:44 -0600 Subject: [Swift-devel] bug 23 In-Reply-To: References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> <1172522128.27811.20.camel@blabla.mcs.anl.gov> Message-ID: <1172522384.27811.26.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 20:38 +0000, Ben Clifford wrote: > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > On Mon, 2007-02-26 at 20:34 +0000, Ben Clifford wrote: > > > > > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > > > > > In the BRIC case, the output file exists, but it contains a specific > > > > string (something to the extent of "no clusters found"). > > > > > > So what goes in the output file in the successful case? > > > > Digits. Lots of digits. Separated by whitespace. And an occasional ".". > > And certainly the string "no clusters found" is absent in that case. > > ok. So that's the actual output data, presumably? Yes. It is the output data file. > From nefedova at mcs.anl.gov Mon Feb 26 14:46:34 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Mon, 26 Feb 2007 14:46:34 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172522337.27811.24.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> An additional info: This failure happened on TG with 070219 when I was running 2 molecules at the same time (i.e. two executables at the same time). When I tried to run just one, it failed with the same exitcode, but didn't have that handle exception: 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application exception: Job chrm failed with an exit code of 174 sys:throw @ vdl-int.k, line: 108 vdl:checkexitcode @ vdl-int.k, line: 367 vdl:execute2 @ execute-default.k, line: 22 vdl:execute @ swift-MolDyn.kml, line: 69 charmm @ swift-MolDyn.kml, line: 279 vdl:mains @ swift-MolDyn.kml, line: 261 Again, the failure with 070219 happens only on TG, on localhost (wiggum) its working just fine. Nika At 02:38 PM 2/26/2007, Mihael Hategan wrote: >That's fine. Just wanted to be clear that we're talking about the same >error. It's good that it also occurs in 070219, because there are no >recent changes I could remember that could trigger it. It's also good to >know that it may or may not occur, because I know approximately what >class of problem we're dealing with. > >Mihael > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > Yes, I didn't paste it -- its all in the log. If you'd like I can send you > > the log as an attachment... > > > > Nika > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > >Wait, because I'm missing something. Wasn't the error supposed to be > > >"TaskHandler can only handle unsubmitted tasks"? > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > > And now its getting interesting! > > > > > > > > I have now the same failure (as below) with 070219 as I had on > localhost > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same point (while > > > > trying to run the last app in the workflow), with the same exceptions. > > > > Strange that 070219 worked on localhost (and still working). > > > > > > > > The log is on wiggum: > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > chrm-rmnoet7i chrm > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, > rwater:15, > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application exception: > Job chrm > > > > failed with an exit code of 174 > > > > > > > > All input files are staged in... > > > > > > > > > > > > Nika > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > > >You can try to run my application, or look in the logs. I ran it > all on > > > > >wiggum. The log is: > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > > > >the dtm file I am running is /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > > > >Nika > > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > > >>That doesn't sound good. How do I reproduce this? > > > > >> > > > > >>Mihael > > > > >> > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > > > >> > The one Ben asked us all to test: > > > > >> > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > >> > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > > > > >> > > > When I tried to run my working workflow with a new version, it > > > > >> gave me an > > > > >> > > > exception: > > > > >> > > > > > > >> > >Which new version? > > > > >> > > > > > > >> > >Mihael > > > > >> > > > > > > >> > > > > > > > >> > > > Warning: Task handler throws exception but does not set status > > > > >> > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > > >> > > > at > > > > >> > > > > > > > >> > > > > > > >> > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > >> > > > at > > > > >> > > > > > > > >> > > > > > > >> > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > > >> > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > >> > > > > > > > >> > > > > > > > >> > > > I do not have this happening with 070219 built. > > > > >> > > > > > > > >> > > > Nika > > > > >> > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > >> > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. please spend > > > some time > > > > >> > > testing > > > > >> > > > > > > > > >> > > > >here's the URL for download: > > > > >> > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > >> > > > > > > > > >> > > > >-- > > > > >> > > > > > > > >> > > > > > > > >> > > > _______________________________________________ > > > > >> > > > Swift-devel mailing list > > > > >> > > > Swift-devel at ci.uchicago.edu > > > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > >> > > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > >_______________________________________________ > > > > >Swift-devel mailing list > > > > >Swift-devel at ci.uchicago.edu > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > From hategan at mcs.anl.gov Mon Feb 26 14:45:57 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:45:57 -0600 Subject: [Swift-devel] IS the PBS provider in v0.1 rc 1 In-Reply-To: References: Message-ID: <1172522757.27811.31.camel@blabla.mcs.anl.gov> Edit libexec/vdl-sc.k. In element(jobmanager..., add the following under url == "local://localhost" "local": url == "pbs://localhost" "pbs" Then, in your site catalog, when it comes to the local host, use that magic url for the job manager: Of course, you should initially fetch that provider. You can find it on http://wiki.cogkit.org/index.php/V:4.1.5/Java_CoG_Kit_Release_Page Unpack it and make sure etc files go into etc, and jar files go into lib. Mihael On Mon, 2007-02-26 at 14:41 -0600, Tiberiu Stef-Praun wrote: > How can I use it ? > A quick example please. > > Thanks > Tibi > From benc at hawaga.org.uk Mon Feb 26 14:47:50 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:47:50 +0000 (GMT) Subject: [Swift-devel] bug 23 In-Reply-To: <1172521875.27811.13.camel@blabla.mcs.anl.gov> References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > There is no rule, but a wrapper can be used to adapt arbitrary behavior > to a consistent thing. wrappers do well for that, yes. what i'm wondering about is how to avoid wrappers in the common cases - which is some combination of 'what is a common case?' and 'how to avoid that case?' without straying too far from coming up with something to implement for 0.2 -- From hategan at mcs.anl.gov Mon Feb 26 14:47:40 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:47:40 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> Message-ID: <1172522860.27811.34.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > An additional info: This failure happened on TG with 070219 when I was > running 2 molecules at the same time (i.e. two executables at the same > time). When I tried to run just one, it failed with the same exitcode, but > didn't have that handle exception: Right. This seems like a different problem, and I'm not sure if it's Swift or some problem with TP or the application. That needs to be investigated. > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application exception: Job chrm > failed with an exit code of 174 > sys:throw @ vdl-int.k, line: 108 > vdl:checkexitcode @ vdl-int.k, line: 367 > vdl:execute2 @ execute-default.k, line: 22 > vdl:execute @ swift-MolDyn.kml, line: 69 > charmm @ swift-MolDyn.kml, line: 279 > vdl:mains @ swift-MolDyn.kml, line: 261 > > > Again, the failure with 070219 happens only on TG, on localhost (wiggum) > its working just fine. > > Nika > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > >That's fine. Just wanted to be clear that we're talking about the same > >error. It's good that it also occurs in 070219, because there are no > >recent changes I could remember that could trigger it. It's also good to > >know that it may or may not occur, because I know approximately what > >class of problem we're dealing with. > > > >Mihael > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > > Yes, I didn't paste it -- its all in the log. If you'd like I can send you > > > the log as an attachment... > > > > > > Nika > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > > >Wait, because I'm missing something. Wasn't the error supposed to be > > > >"TaskHandler can only handle unsubmitted tasks"? > > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > > > And now its getting interesting! > > > > > > > > > > I have now the same failure (as below) with 070219 as I had on > > localhost > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same point (while > > > > > trying to run the last app in the workflow), with the same exceptions. > > > > > Strange that 070219 worked on localhost (and still working). > > > > > > > > > > The log is on wiggum: > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > > chrm-rmnoet7i chrm > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, > > rwater:15, > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application exception: > > Job chrm > > > > > failed with an exit code of 174 > > > > > > > > > > All input files are staged in... > > > > > > > > > > > > > > > Nika > > > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > > > >You can try to run my application, or look in the logs. I ran it > > all on > > > > > >wiggum. The log is: > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > > > > > >the dtm file I am running is /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > > > > > >Nika > > > > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > > > >>That doesn't sound good. How do I reproduce this? > > > > > >> > > > > > >>Mihael > > > > > >> > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > > > > >> > The one Ben asked us all to test: > > > > > >> > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > >> > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > > > > > >> > > > When I tried to run my working workflow with a new version, it > > > > > >> gave me an > > > > > >> > > > exception: > > > > > >> > > > > > > > >> > >Which new version? > > > > > >> > > > > > > > >> > >Mihael > > > > > >> > > > > > > > >> > > > > > > > > >> > > > Warning: Task handler throws exception but does not set status > > > > > >> > > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > > > >> > > > at > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > > >> > > > at > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > > > >> > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > I do not have this happening with 070219 built. > > > > > >> > > > > > > > > >> > > > Nika > > > > > >> > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > >> > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. please spend > > > > some time > > > > > >> > > testing > > > > > >> > > > > > > > > > >> > > > >here's the URL for download: > > > > > >> > > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > >> > > > > > > > > > >> > > > >-- > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > _______________________________________________ > > > > > >> > > > Swift-devel mailing list > > > > > >> > > > Swift-devel at ci.uchicago.edu > > > > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > >Swift-devel mailing list > > > > > >Swift-devel at ci.uchicago.edu > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > From hategan at mcs.anl.gov Mon Feb 26 14:50:32 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:50:32 -0600 Subject: [Swift-devel] bug 23 In-Reply-To: References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> Message-ID: <1172523032.27811.37.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 20:47 +0000, Ben Clifford wrote: > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > There is no rule, but a wrapper can be used to adapt arbitrary behavior > > to a consistent thing. > > wrappers do well for that, yes. > > what i'm wondering about is how to avoid wrappers in the common cases - > which is some combination of 'what is a common case?' and 'how to avoid > that case?' And whether there is a common case. We should not do 90:10 optimizations on a uniform distribution. > > without straying too far from coming up with something to implement for > 0.2 > From benc at hawaga.org.uk Mon Feb 26 14:53:47 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:53:47 +0000 (GMT) Subject: [Swift-devel] bug 23 In-Reply-To: <1172523032.27811.37.camel@blabla.mcs.anl.gov> References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> <1172523032.27811.37.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > And whether there is a common case. We should not do 90:10 optimizations > on a uniform distribution. well, its pretty much an assumption of the way that app {} was defined that one common case is that the inputs and outputs get specified on the command line... -- From benc at hawaga.org.uk Mon Feb 26 14:54:07 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 20:54:07 +0000 (GMT) Subject: [Swift-devel] IS the PBS provider in v0.1 rc 1 In-Reply-To: <1172522757.27811.31.camel@blabla.mcs.anl.gov> References: <1172522757.27811.31.camel@blabla.mcs.anl.gov> Message-ID: mmmm. If this goes anywhere near being sold to someone as an actual application solution that this group is expected to support, it should look a little nicer than this... On Mon, 26 Feb 2007, Mihael Hategan wrote: > Edit libexec/vdl-sc.k. > In element(jobmanager..., > > add the following under url == "local://localhost" "local": > url == "pbs://localhost" "pbs" > > Then, in your site catalog, when it comes to the local host, use that > magic url for the job manager: > > > Of course, you should initially fetch that provider. You can find it on > http://wiki.cogkit.org/index.php/V:4.1.5/Java_CoG_Kit_Release_Page > > Unpack it and make sure etc files go into etc, and jar files go into > lib. > > Mihael > > On Mon, 2007-02-26 at 14:41 -0600, Tiberiu Stef-Praun wrote: > > How can I use it ? > > A quick example please. > > > > Thanks > > Tibi > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Mon Feb 26 14:54:07 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:54:07 -0600 Subject: [Swift-devel] bug 23 In-Reply-To: <1172523032.27811.37.camel@blabla.mcs.anl.gov> References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> <1172523032.27811.37.camel@blabla.mcs.anl.gov> Message-ID: <1172523247.29557.0.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 14:50 -0600, Mihael Hategan wrote: > On Mon, 2007-02-26 at 20:47 +0000, Ben Clifford wrote: > > > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > > > There is no rule, but a wrapper can be used to adapt arbitrary behavior > > > to a consistent thing. > > > > wrappers do well for that, yes. > > > > what i'm wondering about is how to avoid wrappers in the common cases - > > which is some combination of 'what is a common case?' and 'how to avoid > > that case?' > > And whether there is a common case. We should not ... try to ... > do 90:10 optimizations > on a uniform distribution. ... because we can't. > > > > > without straying too far from coming up with something to implement for > > 0.2 > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Feb 26 14:54:52 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:54:52 -0600 Subject: [Swift-devel] bug 23 In-Reply-To: References: <1172521875.27811.13.camel@blabla.mcs.anl.gov> <1172523032.27811.37.camel@blabla.mcs.anl.gov> Message-ID: <1172523292.29557.2.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 20:53 +0000, Ben Clifford wrote: > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > And whether there is a common case. We should not do 90:10 optimizations > > on a uniform distribution. > > well, its pretty much an assumption of the way that app {} was defined > that one common case is that the inputs and outputs get specified on the > command line... Different distribution. Most apps would revolve around that to a certain extent. > From hategan at mcs.anl.gov Mon Feb 26 14:56:33 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:56:33 -0600 Subject: [Swift-devel] IS the PBS provider in v0.1 rc 1 In-Reply-To: References: <1172522757.27811.31.camel@blabla.mcs.anl.gov> Message-ID: <1172523393.29557.5.camel@blabla.mcs.anl.gov> On Mon, 2007-02-26 at 20:54 +0000, Ben Clifford wrote: > mmmm. > > If this goes anywhere near being sold to someone as an actual application > solution that this group is expected to support, it should look a little > nicer than this... Of course. We're only trying it out between ourselves for now. There are many more questions than how to enable it here, and I don't think I want to focus on this right now. > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > Edit libexec/vdl-sc.k. > > In element(jobmanager..., > > > > add the following under url == "local://localhost" "local": > > url == "pbs://localhost" "pbs" > > > > Then, in your site catalog, when it comes to the local host, use that > > magic url for the job manager: > > > > > > Of course, you should initially fetch that provider. You can find it on > > http://wiki.cogkit.org/index.php/V:4.1.5/Java_CoG_Kit_Release_Page > > > > Unpack it and make sure etc files go into etc, and jar files go into > > lib. > > > > Mihael > > > > On Mon, 2007-02-26 at 14:41 -0600, Tiberiu Stef-Praun wrote: > > > How can I use it ? > > > A quick example please. > > > > > > Thanks > > > Tibi > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From hategan at mcs.anl.gov Mon Feb 26 14:57:28 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 14:57:28 -0600 Subject: [Swift-devel] IS the PBS provider in v0.1 rc 1 In-Reply-To: <1172523187.27811.41.camel@blabla.mcs.anl.gov> References: <1172522757.27811.31.camel@blabla.mcs.anl.gov> <1172523187.27811.41.camel@blabla.mcs.anl.gov> Message-ID: <1172523448.29557.7.camel@blabla.mcs.anl.gov> Ooops. Wrong mailing list. On Mon, 2007-02-26 at 14:53 -0600, Mihael Hategan wrote: > It's not by default in v0.1rc1. > Specifying the number of nodes is no different from how you would do it > with GTx. GLOBUS::count=n. But for some reason, I'm not sure this is > what you actually want, unless it's an MPI job. > > Mihael > > On Mon, 2007-02-26 at 14:50 -0600, Tiberiu Stef-Praun wrote: > > So the answer is: it's not in v01.rc1 > > How about the workflow, where do I specify how many notes to allocate > > for my tasks ? > > Hmm... Let me weigh my chances of doing a successful testing of the > > PBS provider on my own. > > > > > > On 2/26/07, Mihael Hategan wrote: > > > Edit libexec/vdl-sc.k. > > > In element(jobmanager..., > > > > > > add the following under url == "local://localhost" "local": > > > url == "pbs://localhost" "pbs" > > > > > > Then, in your site catalog, when it comes to the local host, use that > > > magic url for the job manager: > > > > > > > > > Of course, you should initially fetch that provider. You can find it on > > > http://wiki.cogkit.org/index.php/V:4.1.5/Java_CoG_Kit_Release_Page > > > > > > Unpack it and make sure etc files go into etc, and jar files go into > > > lib. > > > > > > Mihael > > > > > > On Mon, 2007-02-26 at 14:41 -0600, Tiberiu Stef-Praun wrote: > > > > How can I use it ? > > > > A quick example please. > > > > > > > > Thanks > > > > Tibi > > > > > > > > > > > > > > > From hategan at mcs.anl.gov Mon Feb 26 15:01:00 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 15:01:00 -0600 Subject: [Swift-devel] IS the PBS provider in v0.1 rc 1 In-Reply-To: References: <1172522757.27811.31.camel@blabla.mcs.anl.gov> <1172523187.27811.41.camel@blabla.mcs.anl.gov> Message-ID: <1172523661.29844.2.camel@blabla.mcs.anl.gov> That's a "hidden" throttling setting. libexec/scheduler.xml . You probably want to increase that. On Mon, 2007-02-26 at 14:56 -0600, Tiberiu Stef-Praun wrote: > I want to be able to tell my workflow to submit all 200 jobs in one > shot. I assumed that specifying the number of nodes would help me in > achieving that. > > On 2/26/07, Mihael Hategan wrote: > > It's not by default in v0.1rc1. > > Specifying the number of nodes is no different from how you would do it > > with GTx. GLOBUS::count=n. But for some reason, I'm not sure this is > > what you actually want, unless it's an MPI job. > > > > Mihael > > > > On Mon, 2007-02-26 at 14:50 -0600, Tiberiu Stef-Praun wrote: > > > So the answer is: it's not in v01.rc1 > > > How about the workflow, where do I specify how many notes to allocate > > > for my tasks ? > > > Hmm... Let me weigh my chances of doing a successful testing of the > > > PBS provider on my own. > > > > > > > > > On 2/26/07, Mihael Hategan wrote: > > > > Edit libexec/vdl-sc.k. > > > > In element(jobmanager..., > > > > > > > > add the following under url == "local://localhost" "local": > > > > url == "pbs://localhost" "pbs" > > > > > > > > Then, in your site catalog, when it comes to the local host, use that > > > > magic url for the job manager: > > > > > > > > > > > > Of course, you should initially fetch that provider. You can find it on > > > > http://wiki.cogkit.org/index.php/V:4.1.5/Java_CoG_Kit_Release_Page > > > > > > > > Unpack it and make sure etc files go into etc, and jar files go into > > > > lib. > > > > > > > > Mihael > > > > > > > > On Mon, 2007-02-26 at 14:41 -0600, Tiberiu Stef-Praun wrote: > > > > > How can I use it ? > > > > > A quick example please. > > > > > > > > > > Thanks > > > > > Tibi > > > > > > > > > > > > > > > > > > > > > > > > > From benc at hawaga.org.uk Mon Feb 26 15:14:55 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 21:14:55 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: <1172521676.27811.9.camel@blabla.mcs.anl.gov> References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > We may also want to do the same to the input, because some even more > twisted apps will not even accept that as a parameter. So: > (file k, file m, file n) myapp(file l) { > app{ > l>"input.txt"; > myapp; > k<"output.crd" > m<"output.prd" > n<"output.rtf" > } > } somewhat related to that is the way that the air align_warp program wants its input files: $ ls Raw/ anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr reference.hdr anatomy1.img anatomy2.img anatomy3.img anatomy4.img reference.img All paired up - but the app is told just: $ align_warp /path/to/anatomy1.img /otherpath/reference.img with it being implicit that the corresponding hdr files exists at: "/path/to/anatomy1.img" - "img" + "hdr" "/otherpath/reference.img" - "img" + "hdr" In practice, I think this is OK in the short-medium term because as long as the corresponding hdr and img files sit next to each other in the submit side, they'll map through to execute side files that also sit next to each other, because of the way that we map. But in the longer term future, that is not necessarily going to be the case - as source-side mapping becomes richer, to pull data from different places that maybe don't have a single source filename (eg replicas, other kinds of data store), the above invariant may become untrue. So there's at least consideration needed there - either in terms of the invariants that we assume are true but are never written down in a spec, or in terms of how we make that stuff map properly on the execute side. -- From nefedova at mcs.anl.gov Mon Feb 26 15:17:45 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Mon, 26 Feb 2007 15:17:45 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172522860.27811.34.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070226151711.05ea3e60@mail.mcs.anl.gov> OK. Cool. Please let me know when you have a solution so I could test it. Thanks! Nika At 02:47 PM 2/26/2007, Mihael Hategan wrote: >On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > > An additional info: This failure happened on TG with 070219 when I was > > running 2 molecules at the same time (i.e. two executables at the same > > time). When I tried to run just one, it failed with the same exitcode, but > > didn't have that handle exception: > >Right. This seems like a different problem, and I'm not sure if it's >Swift or some problem with TP or the application. That needs to be >investigated. > > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application exception: Job chrm > > failed with an exit code of 174 > > sys:throw @ vdl-int.k, line: 108 > > vdl:checkexitcode @ vdl-int.k, line: 367 > > vdl:execute2 @ execute-default.k, line: 22 > > vdl:execute @ swift-MolDyn.kml, line: 69 > > charmm @ swift-MolDyn.kml, line: 279 > > vdl:mains @ swift-MolDyn.kml, line: 261 > > > > > > Again, the failure with 070219 happens only on TG, on localhost (wiggum) > > its working just fine. > > > > Nika > > > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > > >That's fine. Just wanted to be clear that we're talking about the same > > >error. It's good that it also occurs in 070219, because there are no > > >recent changes I could remember that could trigger it. It's also good to > > >know that it may or may not occur, because I know approximately what > > >class of problem we're dealing with. > > > > > >Mihael > > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > > > Yes, I didn't paste it -- its all in the log. If you'd like I can > send you > > > > the log as an attachment... > > > > > > > > Nika > > > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > > > >Wait, because I'm missing something. Wasn't the error supposed to be > > > > >"TaskHandler can only handle unsubmitted tasks"? > > > > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > > > > And now its getting interesting! > > > > > > > > > > > > I have now the same failure (as below) with 070219 as I had on > > > localhost > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same point > (while > > > > > > trying to run the last app in the workflow), with the same > exceptions. > > > > > > Strange that 070219 worked on localhost (and still working). > > > > > > > > > > > > The log is on wiggum: > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > > > chrm-rmnoet7i chrm > > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, > > > rwater:15, > > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application exception: > > > Job chrm > > > > > > failed with an exit code of 174 > > > > > > > > > > > > All input files are staged in... > > > > > > > > > > > > > > > > > > Nika > > > > > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > > > > >You can try to run my application, or look in the logs. I ran it > > > all on > > > > > > >wiggum. The log is: > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > > > > > > > >the dtm file I am running is > /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > > > > > > > >Nika > > > > > > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > > > > >>That doesn't sound good. How do I reproduce this? > > > > > > >> > > > > > > >>Mihael > > > > > > >> > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > > > > > >> > The one Ben asked us all to test: > > > > > > >> > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > >> > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova > wrote: > > > > > > >> > > > When I tried to run my working workflow with a new > version, it > > > > > > >> gave me an > > > > > > >> > > > exception: > > > > > > >> > > > > > > > > >> > >Which new version? > > > > > > >> > > > > > > > > >> > >Mihael > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > Warning: Task handler throws exception but does not > set status > > > > > > >> > > > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > > > > >> > > > at > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > > > >> > > > at > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > > > > >> > > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > I do not have this happening with 070219 built. > > > > > > >> > > > > > > > > > >> > > > Nika > > > > > > >> > > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > > >> > > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. please > spend > > > > > some time > > > > > > >> > > testing > > > > > > >> > > > > > > > > > > >> > > > >here's the URL for download: > > > > > > >> > > > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > >> > > > > > > > > > > >> > > > >-- > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > _______________________________________________ > > > > > > >> > > > Swift-devel mailing list > > > > > > >> > > > Swift-devel at ci.uchicago.edu > > > > > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > > >Swift-devel mailing list > > > > > > >Swift-devel at ci.uchicago.edu > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > From hategan at mcs.anl.gov Mon Feb 26 15:20:14 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 15:20:14 -0600 Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> Message-ID: <1172524814.30498.0.camel@blabla.mcs.anl.gov> I thought we had this discussion before: swft at ... Subject "function vs. mapper". On Mon, 2007-02-26 at 21:14 +0000, Ben Clifford wrote: > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > We may also want to do the same to the input, because some even more > > twisted apps will not even accept that as a parameter. > > So: > > (file k, file m, file n) myapp(file l) { > > app{ > > l>"input.txt"; > > myapp; > > k<"output.crd" > > m<"output.prd" > > n<"output.rtf" > > } > > } > > somewhat related to that is the way that the air align_warp program wants > its input files: > > $ ls Raw/ > anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr reference.hdr > anatomy1.img anatomy2.img anatomy3.img anatomy4.img reference.img > > All paired up - but the app is told just: > > $ align_warp /path/to/anatomy1.img /otherpath/reference.img > > with it being implicit that the corresponding hdr files exists at: > > "/path/to/anatomy1.img" - "img" + "hdr" > "/otherpath/reference.img" - "img" + "hdr" > > In practice, I think this is OK in the short-medium term because as long > as the corresponding hdr and img files sit next to each other in the > submit side, they'll map through to execute side files that also sit next > to each other, because of the way that we map. > > But in the longer term future, that is not necessarily going to be the > case - as source-side mapping becomes richer, to pull data from different > places that maybe don't have a single source filename (eg replicas, other > kinds of data store), the above invariant may become untrue. So there's at > least consideration needed there - either in terms of the invariants that > we assume are true but are never written down in a spec, or in terms of > how we make that stuff map properly on the execute side. > From benc at hawaga.org.uk Mon Feb 26 15:37:33 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Feb 2007 21:37:33 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: <1172524814.30498.0.camel@blabla.mcs.anl.gov> References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172524814.30498.0.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > I thought we had this discussion before: swft at ... Subject "function vs. > mapper". that was discussed a little bit - that bit of the thread was wandering off into abstractions pretty hard though and never got to a conclusion. Mostly I think I want to note that the fact that align_warp works at the moment is because it is the case that we preserve path locality (files in the same submit directory end up in the same execute-side directory) and that those filenames are preserved. So perhaps I should write that into the bit of the documentation that specifies how app{} behaves... -- From tiberius at ci.uchicago.edu Mon Feb 26 14:01:02 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 26 Feb 2007 14:01:02 -0600 Subject: [Swift-devel] Re: [VDL2-user] GridFTP timeout exception In-Reply-To: References: Message-ID: On 2/26/07, Ben Clifford wrote: > > > On Mon, 26 Feb 2007, Tiberiu Stef-Praun wrote: > > > BTW, the NCSA problem is a non-issue, I solved it. > > how? tiberius at tp-login2:~/scratch> cat ~/.globus/cog.properties ip=128.135.158.222 > > > The only other > > small issue is taking full advantage of all the sites in the > > sites.xml. > > For 0.1, each site needs to be usable as the single site to run everything > one. It doesn't matter so much (at all) that you can't use all sites in > the same run... > > -- > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Mon Feb 26 21:57:10 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 21:57:10 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172522860.27811.34.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> Message-ID: <1172548630.12021.3.camel@blabla.mcs.anl.gov> Hmm. I made a change to the code that did not seem to be the cause, but some other, smaller issue and enabled some more debugging in log4j. With this, I've been running the workflow in a loop on wiggum for two hours now, and got nothing yet. I don't know what to make of it. I'll keep running and eventually revert the changes to see if they are the source. Mihael On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > > An additional info: This failure happened on TG with 070219 when I was > > running 2 molecules at the same time (i.e. two executables at the same > > time). When I tried to run just one, it failed with the same exitcode, but > > didn't have that handle exception: > > Right. This seems like a different problem, and I'm not sure if it's > Swift or some problem with TP or the application. That needs to be > investigated. > > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application exception: Job chrm > > failed with an exit code of 174 > > sys:throw @ vdl-int.k, line: 108 > > vdl:checkexitcode @ vdl-int.k, line: 367 > > vdl:execute2 @ execute-default.k, line: 22 > > vdl:execute @ swift-MolDyn.kml, line: 69 > > charmm @ swift-MolDyn.kml, line: 279 > > vdl:mains @ swift-MolDyn.kml, line: 261 > > > > > > Again, the failure with 070219 happens only on TG, on localhost (wiggum) > > its working just fine. > > > > Nika > > > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > > >That's fine. Just wanted to be clear that we're talking about the same > > >error. It's good that it also occurs in 070219, because there are no > > >recent changes I could remember that could trigger it. It's also good to > > >know that it may or may not occur, because I know approximately what > > >class of problem we're dealing with. > > > > > >Mihael > > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > > > Yes, I didn't paste it -- its all in the log. If you'd like I can send you > > > > the log as an attachment... > > > > > > > > Nika > > > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > > > >Wait, because I'm missing something. Wasn't the error supposed to be > > > > >"TaskHandler can only handle unsubmitted tasks"? > > > > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > > > > And now its getting interesting! > > > > > > > > > > > > I have now the same failure (as below) with 070219 as I had on > > > localhost > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same point (while > > > > > > trying to run the last app in the workflow), with the same exceptions. > > > > > > Strange that 070219 worked on localhost (and still working). > > > > > > > > > > > > The log is on wiggum: > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > > > chrm-rmnoet7i chrm > > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, > > > rwater:15, > > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application exception: > > > Job chrm > > > > > > failed with an exit code of 174 > > > > > > > > > > > > All input files are staged in... > > > > > > > > > > > > > > > > > > Nika > > > > > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > > > > >You can try to run my application, or look in the logs. I ran it > > > all on > > > > > > >wiggum. The log is: > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > > > > > > > >the dtm file I am running is /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > > > > > > > >Nika > > > > > > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > > > > >>That doesn't sound good. How do I reproduce this? > > > > > > >> > > > > > > >>Mihael > > > > > > >> > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > > > > > >> > The one Ben asked us all to test: > > > > > > >> > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > >> > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova wrote: > > > > > > >> > > > When I tried to run my working workflow with a new version, it > > > > > > >> gave me an > > > > > > >> > > > exception: > > > > > > >> > > > > > > > > >> > >Which new version? > > > > > > >> > > > > > > > > >> > >Mihael > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > Warning: Task handler throws exception but does not set status > > > > > > >> > > > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > > > > >> > > > at > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > > > >> > > > at > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > > > > >> > > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > I do not have this happening with 070219 built. > > > > > > >> > > > > > > > > > >> > > > Nika > > > > > > >> > > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > > >> > > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. please spend > > > > > some time > > > > > > >> > > testing > > > > > > >> > > > > > > > > > > >> > > > >here's the URL for download: > > > > > > >> > > > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > >> > > > > > > > > > > >> > > > >-- > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > _______________________________________________ > > > > > > >> > > > Swift-devel mailing list > > > > > > >> > > > Swift-devel at ci.uchicago.edu > > > > > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > > >Swift-devel mailing list > > > > > > >Swift-devel at ci.uchicago.edu > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From nefedova at mcs.anl.gov Mon Feb 26 22:17:11 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Mon, 26 Feb 2007 22:17:11 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172548630.12021.3.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> If you give me your vds_home location - I can try to run the workflow and see if its working... NIka At 09:57 PM 2/26/2007, Mihael Hategan wrote: >Hmm. I made a change to the code that did not seem to be the cause, but >some other, smaller issue and enabled some more debugging in log4j. With >this, I've been running the workflow in a loop on wiggum for two hours >now, and got nothing yet. I don't know what to make of it. > >I'll keep running and eventually revert the changes to see if they are >the source. > >Mihael > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > > > An additional info: This failure happened on TG with 070219 when I was > > > running 2 molecules at the same time (i.e. two executables at the same > > > time). When I tried to run just one, it failed with the same > exitcode, but > > > didn't have that handle exception: > > > > Right. This seems like a different problem, and I'm not sure if it's > > Swift or some problem with TP or the application. That needs to be > > investigated. > > > > > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application exception: Job > chrm > > > failed with an exit code of 174 > > > sys:throw @ vdl-int.k, line: 108 > > > vdl:checkexitcode @ vdl-int.k, line: 367 > > > vdl:execute2 @ execute-default.k, line: 22 > > > vdl:execute @ swift-MolDyn.kml, line: 69 > > > charmm @ swift-MolDyn.kml, line: 279 > > > vdl:mains @ swift-MolDyn.kml, line: 261 > > > > > > > > > Again, the failure with 070219 happens only on TG, on localhost (wiggum) > > > its working just fine. > > > > > > Nika > > > > > > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > > > >That's fine. Just wanted to be clear that we're talking about the same > > > >error. It's good that it also occurs in 070219, because there are no > > > >recent changes I could remember that could trigger it. It's also good to > > > >know that it may or may not occur, because I know approximately what > > > >class of problem we're dealing with. > > > > > > > >Mihael > > > > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > > > > Yes, I didn't paste it -- its all in the log. If you'd like I can > send you > > > > > the log as an attachment... > > > > > > > > > > Nika > > > > > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > > > > >Wait, because I'm missing something. Wasn't the error supposed to be > > > > > >"TaskHandler can only handle unsubmitted tasks"? > > > > > > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > > > > > And now its getting interesting! > > > > > > > > > > > > > > I have now the same failure (as below) with 070219 as I had on > > > > localhost > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same > point (while > > > > > > > trying to run the last app in the workflow), with the same > exceptions. > > > > > > > Strange that 070219 worked on localhost (and still working). > > > > > > > > > > > > > > The log is on wiggum: > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > > > > chrm-rmnoet7i chrm > > > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > > > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, > > > > rwater:15, > > > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application > exception: > > > > Job chrm > > > > > > > failed with an exit code of 174 > > > > > > > > > > > > > > All input files are staged in... > > > > > > > > > > > > > > > > > > > > > Nika > > > > > > > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > > > > > >You can try to run my application, or look in the logs. I > ran it > > > > all on > > > > > > > >wiggum. The log is: > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > > > > > > > > > >the dtm file I am running is > /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > > > > > > > > > >Nika > > > > > > > > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > >>That doesn't sound good. How do I reproduce this? > > > > > > > >> > > > > > > > >>Mihael > > > > > > > >> > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > > > > > > >> > The one Ben asked us all to test: > > > > > > > >> > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > > >> > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova > wrote: > > > > > > > >> > > > When I tried to run my working workflow with a new > version, it > > > > > > > >> gave me an > > > > > > > >> > > > exception: > > > > > > > >> > > > > > > > > > >> > >Which new version? > > > > > > > >> > > > > > > > > > >> > >Mihael > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > Warning: Task handler throws exception but does not > set status > > > > > > > >> > > > > > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > > > > > >> > > > at > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > > > > >> > > > at > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > > > > > >> > > > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > I do not have this happening with 070219 built. > > > > > > > >> > > > > > > > > > > >> > > > Nika > > > > > > > >> > > > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > > > >> > > > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. > please spend > > > > > > some time > > > > > > > >> > > testing > > > > > > > >> > > > > > > > > > > > >> > > > >here's the URL for download: > > > > > > > >> > > > > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.t > ar.gz > > > > > > > >> > > > > > > > > > > > >> > > > >-- > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > _______________________________________________ > > > > > > > >> > > > Swift-devel mailing list > > > > > > > >> > > > Swift-devel at ci.uchicago.edu > > > > > > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > > > >Swift-devel mailing list > > > > > > > >Swift-devel at ci.uchicago.edu > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Mon Feb 26 22:19:23 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Feb 2007 22:19:23 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> Message-ID: <1172549963.12914.0.camel@blabla.mcs.anl.gov> Nevermind. It happened right after I sent the email :) On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: > If you give me your vds_home location - I can try to run the workflow and > see if its working... > > NIka > At 09:57 PM 2/26/2007, Mihael Hategan wrote: > >Hmm. I made a change to the code that did not seem to be the cause, but > >some other, smaller issue and enabled some more debugging in log4j. With > >this, I've been running the workflow in a loop on wiggum for two hours > >now, and got nothing yet. I don't know what to make of it. > > > >I'll keep running and eventually revert the changes to see if they are > >the source. > > > >Mihael > > > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > > > > An additional info: This failure happened on TG with 070219 when I was > > > > running 2 molecules at the same time (i.e. two executables at the same > > > > time). When I tried to run just one, it failed with the same > > exitcode, but > > > > didn't have that handle exception: > > > > > > Right. This seems like a different problem, and I'm not sure if it's > > > Swift or some problem with TP or the application. That needs to be > > > investigated. > > > > > > > > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application exception: Job > > chrm > > > > failed with an exit code of 174 > > > > sys:throw @ vdl-int.k, line: 108 > > > > vdl:checkexitcode @ vdl-int.k, line: 367 > > > > vdl:execute2 @ execute-default.k, line: 22 > > > > vdl:execute @ swift-MolDyn.kml, line: 69 > > > > charmm @ swift-MolDyn.kml, line: 279 > > > > vdl:mains @ swift-MolDyn.kml, line: 261 > > > > > > > > > > > > Again, the failure with 070219 happens only on TG, on localhost (wiggum) > > > > its working just fine. > > > > > > > > Nika > > > > > > > > > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > > > > >That's fine. Just wanted to be clear that we're talking about the same > > > > >error. It's good that it also occurs in 070219, because there are no > > > > >recent changes I could remember that could trigger it. It's also good to > > > > >know that it may or may not occur, because I know approximately what > > > > >class of problem we're dealing with. > > > > > > > > > >Mihael > > > > > > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > > > > > Yes, I didn't paste it -- its all in the log. If you'd like I can > > send you > > > > > > the log as an attachment... > > > > > > > > > > > > Nika > > > > > > > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > > > > > >Wait, because I'm missing something. Wasn't the error supposed to be > > > > > > >"TaskHandler can only handle unsubmitted tasks"? > > > > > > > > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > > > > > > And now its getting interesting! > > > > > > > > > > > > > > > > I have now the same failure (as below) with 070219 as I had on > > > > > localhost > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same > > point (while > > > > > > > > trying to run the last app in the workflow), with the same > > exceptions. > > > > > > > > Strange that 070219 worked on localhost (and still working). > > > > > > > > > > > > > > > > The log is on wiggum: > > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > > > > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > > > > > chrm-rmnoet7i chrm > > > > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > > > > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, > > > > > rwater:15, > > > > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application > > exception: > > > > > Job chrm > > > > > > > > failed with an exit code of 174 > > > > > > > > > > > > > > > > All input files are staged in... > > > > > > > > > > > > > > > > > > > > > > > > Nika > > > > > > > > > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > > > > > > >You can try to run my application, or look in the logs. I > > ran it > > > > > all on > > > > > > > > >wiggum. The log is: > > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > > > > > > > > > > > >the dtm file I am running is > > /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > > > > > > > > > > > >Nika > > > > > > > > > > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > > >>That doesn't sound good. How do I reproduce this? > > > > > > > > >> > > > > > > > > >>Mihael > > > > > > > > >> > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > > > > > > > >> > The one Ben asked us all to test: > > > > > > > > >> > > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > > > >> > > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova > > wrote: > > > > > > > > >> > > > When I tried to run my working workflow with a new > > version, it > > > > > > > > >> gave me an > > > > > > > > >> > > > exception: > > > > > > > > >> > > > > > > > > > > >> > >Which new version? > > > > > > > > >> > > > > > > > > > > >> > >Mihael > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > Warning: Task handler throws exception but does not > > set status > > > > > > > > >> > > > > > > > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > > > > > > >> > > > at > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > > > > > >> > > > at > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > > > > > > >> > > > > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > I do not have this happening with 070219 built. > > > > > > > > >> > > > > > > > > > > > >> > > > Nika > > > > > > > > >> > > > > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > > > > >> > > > > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. > > please spend > > > > > > > some time > > > > > > > > >> > > testing > > > > > > > > >> > > > > > > > > > > > > >> > > > >here's the URL for download: > > > > > > > > >> > > > > > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.t > > ar.gz > > > > > > > > >> > > > > > > > > > > > > >> > > > >-- > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > _______________________________________________ > > > > > > > > >> > > > Swift-devel mailing list > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu > > > > > > > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > > > > >Swift-devel mailing list > > > > > > > > >Swift-devel at ci.uchicago.edu > > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From wilde at mcs.anl.gov Tue Feb 27 09:02:24 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 27 Feb 2007 09:02:24 -0600 Subject: [Swift-devel] Re: Swift mailing lists In-Reply-To: <45E4125D.7020607@mcs.anl.gov> References: <45E4125D.7020607@mcs.anl.gov> Message-ID: <45E44800.1010501@mcs.anl.gov> Absolutely! That would be great, Ravi. Just sign up for both lists at: http://www.ci.uchicago.edu/swift/contact.shtml I'll also forward you the swift proposal we did recently - Mike Ravi Madduri wrote, On 2/27/2007 5:13 AM: > Mike: > Do you think I can be part of swift/vdl2 mailing lists ? -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From tiberius at ci.uchicago.edu Tue Feb 27 11:08:35 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 27 Feb 2007 11:08:35 -0600 Subject: [Swift-devel] Double loop, please suggest options: Message-ID: type file {} //define the wavelet procedure (file wavelets) waveletTransf (file waveletScript, int subjNo, string trialType, file dataFiles) { app { cwtPacksmall @filename(waveletScript) subjNo trialType @filename(wavelets); } } (file outputs[]) batchTrials ( string trialTypes[], string subjectNo[] ){ file waveletScript; foreach string t,j in subjectNo { foreach string s,i in trialTypes { //example 101.FB.tgz (I can name this in any way, //I still need to have the subjectNo and trialTypes in the name, for clarity purposes) // SUGGEST HERE file output; // example: 101.FB (this is a symlink from the original input file) file dataFiles; output = waveletTransf(waveletScript,t,s,dataFiles); //SUGGEST HERE outputs[i*j]=output; } } } //string subjectNo[]=["101"]; string subjectNo[]= ["101","102","103","104","105","107","110","111","112","113","114","115","116","117","118",120","121",122","124","126","128","129","130","131","132","133","134","135","137","137","138","139","140"]; string trialTypes[] = ["FB", "FC", "FI", "SB", "SC", "SI" ]; //string trialTypes[] = [ "FB", "FC" ]; //SUGGEST HERE file outputs[] outputs = batchTrials (trialTypes, sybjectNo); -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Tue Feb 27 13:22:59 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Feb 2007 19:22:59 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: <1172521676.27811.9.camel@blabla.mcs.anl.gov> References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 26 Feb 2007, Mihael Hategan wrote: > Right. This would be the "application mapper". > Now, there are a few things here: > We may also want to do the same to the input, because some even more > twisted apps will not even accept that as a parameter. So: > (file k, file m, file n) myapp(file l) { > app{ > l>"input.txt"; > myapp; > k<"output.crd" > m<"output.prd" > n<"output.rtf" > } > } so something like the above syntax, sufficient to address Nika's cases, is probably the way to go for bug 22, without necessarily implementing more complicated stuff like arrays below. That should give us a feel for how the concept works in practice too. > This may become a little trickier when inputs (or even outputs) are > arrays, so we may need nicer schemes: > (file o) myapp(file i[]){ > app{ > i[x=*] > "input"+$1; (or something like that) > myapp; > ... > } > } -- From hategan at mcs.anl.gov Tue Feb 27 13:26:17 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 13:26:17 -0600 Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> Message-ID: <1172604377.25936.2.camel@blabla.mcs.anl.gov> If you can make this translate into something like vdl:(in| out)appmapping(var, path, dest), preferably after the stagein/stageout directives, I can probably make it work. On Tue, 2007-02-27 at 19:22 +0000, Ben Clifford wrote: > > On Mon, 26 Feb 2007, Mihael Hategan wrote: > > > Right. This would be the "application mapper". > > Now, there are a few things here: > > We may also want to do the same to the input, because some even more > > twisted apps will not even accept that as a parameter. So: > > (file k, file m, file n) myapp(file l) { > > app{ > > l>"input.txt"; > > myapp; > > k<"output.crd" > > m<"output.prd" > > n<"output.rtf" > > } > > } > > so something like the above syntax, sufficient to address Nika's cases, is > probably the way to go for bug 22, without necessarily implementing more > complicated stuff like arrays below. > > That should give us a feel for how the concept works in practice too. > > > This may become a little trickier when inputs (or even outputs) are > > arrays, so we may need nicer schemes: > > (file o) myapp(file i[]){ > > app{ > > i[x=*] > "input"+$1; (or something like that) > > myapp; > > ... > > } > > } > From hategan at mcs.anl.gov Tue Feb 27 14:04:19 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 14:04:19 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> Message-ID: <1172606659.27311.2.camel@blabla.mcs.anl.gov> Try it now. The latest nightly build should contain the fix. The problem was an inner class having synchronized methods and me idiotically assuming that they will use the outer class' instance monitor. Mihael On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: > If you give me your vds_home location - I can try to run the workflow and > see if its working... > > NIka > At 09:57 PM 2/26/2007, Mihael Hategan wrote: > >Hmm. I made a change to the code that did not seem to be the cause, but > >some other, smaller issue and enabled some more debugging in log4j. With > >this, I've been running the workflow in a loop on wiggum for two hours > >now, and got nothing yet. I don't know what to make of it. > > > >I'll keep running and eventually revert the changes to see if they are > >the source. > > > >Mihael > > > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > > > > An additional info: This failure happened on TG with 070219 when I was > > > > running 2 molecules at the same time (i.e. two executables at the same > > > > time). When I tried to run just one, it failed with the same > > exitcode, but > > > > didn't have that handle exception: > > > > > > Right. This seems like a different problem, and I'm not sure if it's > > > Swift or some problem with TP or the application. That needs to be > > > investigated. > > > > > > > > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application exception: Job > > chrm > > > > failed with an exit code of 174 > > > > sys:throw @ vdl-int.k, line: 108 > > > > vdl:checkexitcode @ vdl-int.k, line: 367 > > > > vdl:execute2 @ execute-default.k, line: 22 > > > > vdl:execute @ swift-MolDyn.kml, line: 69 > > > > charmm @ swift-MolDyn.kml, line: 279 > > > > vdl:mains @ swift-MolDyn.kml, line: 261 > > > > > > > > > > > > Again, the failure with 070219 happens only on TG, on localhost (wiggum) > > > > its working just fine. > > > > > > > > Nika > > > > > > > > > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > > > > >That's fine. Just wanted to be clear that we're talking about the same > > > > >error. It's good that it also occurs in 070219, because there are no > > > > >recent changes I could remember that could trigger it. It's also good to > > > > >know that it may or may not occur, because I know approximately what > > > > >class of problem we're dealing with. > > > > > > > > > >Mihael > > > > > > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > > > > > Yes, I didn't paste it -- its all in the log. If you'd like I can > > send you > > > > > > the log as an attachment... > > > > > > > > > > > > Nika > > > > > > > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > > > > > >Wait, because I'm missing something. Wasn't the error supposed to be > > > > > > >"TaskHandler can only handle unsubmitted tasks"? > > > > > > > > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > > > > > > And now its getting interesting! > > > > > > > > > > > > > > > > I have now the same failure (as below) with 070219 as I had on > > > > > localhost > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same > > point (while > > > > > > > > trying to run the last app in the workflow), with the same > > exceptions. > > > > > > > > Strange that 070219 worked on localhost (and still working). > > > > > > > > > > > > > > > > The log is on wiggum: > > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > > > > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > > > > > chrm-rmnoet7i chrm > > > > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > > > > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, iseed:3131887, > > > > > rwater:15, > > > > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application > > exception: > > > > > Job chrm > > > > > > > > failed with an exit code of 174 > > > > > > > > > > > > > > > > All input files are staged in... > > > > > > > > > > > > > > > > > > > > > > > > Nika > > > > > > > > > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > > > > > > >You can try to run my application, or look in the logs. I > > ran it > > > > > all on > > > > > > > > >wiggum. The log is: > > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > > > > > > > > > > > >the dtm file I am running is > > /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > > > > > > > > > > > >Nika > > > > > > > > > > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > > >>That doesn't sound good. How do I reproduce this? > > > > > > > > >> > > > > > > > > >>Mihael > > > > > > > > >> > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova wrote: > > > > > > > > >> > The one Ben asked us all to test: > > > > > > > > >> > > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.tar.gz > > > > > > > > >> > > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. Nefedova > > wrote: > > > > > > > > >> > > > When I tried to run my working workflow with a new > > version, it > > > > > > > > >> gave me an > > > > > > > > >> > > > exception: > > > > > > > > >> > > > > > > > > > > >> > >Which new version? > > > > > > > > >> > > > > > > > > > > >> > >Mihael > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > Warning: Task handler throws exception but does not > > set status > > > > > > > > >> > > > > > > > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > > > > > > >> > > > at > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > > > > > >> > > > at > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > > > > > > >> > > > > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > I do not have this happening with 070219 built. > > > > > > > > >> > > > > > > > > > > > >> > > > Nika > > > > > > > > >> > > > > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > > > > >> > > > > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. > > please spend > > > > > > > some time > > > > > > > > >> > > testing > > > > > > > > >> > > > > > > > > > > > > >> > > > >here's the URL for download: > > > > > > > > >> > > > > > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.t > > ar.gz > > > > > > > > >> > > > > > > > > > > > > >> > > > >-- > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > _______________________________________________ > > > > > > > > >> > > > Swift-devel mailing list > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu > > > > > > > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > > > > >Swift-devel mailing list > > > > > > > > >Swift-devel at ci.uchicago.edu > > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From nefedova at mcs.anl.gov Tue Feb 27 14:19:15 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 27 Feb 2007 14:19:15 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172606659.27311.2.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> <1172606659.27311.2.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070227141852.05fba7c0@mail.mcs.anl.gov> Thanks, Mihael! I'll try it now Nika At 02:04 PM 2/27/2007, Mihael Hategan wrote: >Try it now. The latest nightly build should contain the fix. > >The problem was an inner class having synchronized methods and me >idiotically assuming that they will use the outer class' instance >monitor. > >Mihael > >On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: > > If you give me your vds_home location - I can try to run the workflow and > > see if its working... > > > > NIka > > At 09:57 PM 2/26/2007, Mihael Hategan wrote: > > >Hmm. I made a change to the code that did not seem to be the cause, but > > >some other, smaller issue and enabled some more debugging in log4j. With > > >this, I've been running the workflow in a loop on wiggum for two hours > > >now, and got nothing yet. I don't know what to make of it. > > > > > >I'll keep running and eventually revert the changes to see if they are > > >the source. > > > > > >Mihael > > > > > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: > > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > > > > > An additional info: This failure happened on TG with 070219 when > I was > > > > > running 2 molecules at the same time (i.e. two executables at the > same > > > > > time). When I tried to run just one, it failed with the same > > > exitcode, but > > > > > didn't have that handle exception: > > > > > > > > Right. This seems like a different problem, and I'm not sure if it's > > > > Swift or some problem with TP or the application. That needs to be > > > > investigated. > > > > > > > > > > > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application exception: > Job > > > chrm > > > > > failed with an exit code of 174 > > > > > sys:throw @ vdl-int.k, line: 108 > > > > > vdl:checkexitcode @ vdl-int.k, line: 367 > > > > > vdl:execute2 @ execute-default.k, line: 22 > > > > > vdl:execute @ swift-MolDyn.kml, line: 69 > > > > > charmm @ swift-MolDyn.kml, line: 279 > > > > > vdl:mains @ swift-MolDyn.kml, line: 261 > > > > > > > > > > > > > > > Again, the failure with 070219 happens only on TG, on localhost > (wiggum) > > > > > its working just fine. > > > > > > > > > > Nika > > > > > > > > > > > > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > > > > > >That's fine. Just wanted to be clear that we're talking about > the same > > > > > >error. It's good that it also occurs in 070219, because there are no > > > > > >recent changes I could remember that could trigger it. It's also > good to > > > > > >know that it may or may not occur, because I know approximately what > > > > > >class of problem we're dealing with. > > > > > > > > > > > >Mihael > > > > > > > > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > > > > > > Yes, I didn't paste it -- its all in the log. If you'd like I > can > > > send you > > > > > > > the log as an attachment... > > > > > > > > > > > > > > Nika > > > > > > > > > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > >Wait, because I'm missing something. Wasn't the error > supposed to be > > > > > > > >"TaskHandler can only handle unsubmitted tasks"? > > > > > > > > > > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > > > > > > > > > And now its getting interesting! > > > > > > > > > > > > > > > > > > I have now the same failure (as below) with 070219 as I > had on > > > > > > localhost > > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same > > > point (while > > > > > > > > > trying to run the last app in the workflow), with the same > > > exceptions. > > > > > > > > > Strange that 070219 worked on localhost (and still working). > > > > > > > > > > > > > > > > > > The log is on wiggum: > > > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > > > > > > > > > > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > > > > > > chrm-rmnoet7i chrm > > > > > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > > > > > > > > > rtffile:parm03_gaff_all.rtf, paramfile:parm03_gaffnb_all.prm, > > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, > iseed:3131887, > > > > > > rwater:15, > > > > > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application > > > exception: > > > > > > Job chrm > > > > > > > > > failed with an exit code of 174 > > > > > > > > > > > > > > > > > > All input files are staged in... > > > > > > > > > > > > > > > > > > > > > > > > > > > Nika > > > > > > > > > > > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > > > > > > > > >You can try to run my application, or look in the logs. I > > > ran it > > > > > > all on > > > > > > > > > >wiggum. The log is: > > > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > > > > > > > > > > > > > > > > > > >the dtm file I am running is > > > /sandbox/ydeng/alamines/swift-MolDyn.dtm > > > > > > > > > > > > > > > > > > > >Nika > > > > > > > > > > > > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > > > >>That doesn't sound good. How do I reproduce this? > > > > > > > > > >> > > > > > > > > > >>Mihael > > > > > > > > > >> > > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. Nefedova > wrote: > > > > > > > > > >> > The one Ben asked us all to test: > > > > > > > > > >> > > > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1.t > ar.gz > > > > > > > > > >> > > > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. > Nefedova > > > wrote: > > > > > > > > > >> > > > When I tried to run my working workflow with a new > > > version, it > > > > > > > > > >> gave me an > > > > > > > > > >> > > > exception: > > > > > > > > > >> > > > > > > > > > > > >> > >Which new version? > > > > > > > > > >> > > > > > > > > > > > >> > >Mihael > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > Warning: Task handler throws exception but does not > > > set status > > > > > > > > > >> > > > > > > > > > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > > > > > > > > >> > > > at > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > > > > > > > > >> > > > at > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > > > > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > > > > > > > > > >> > > > > > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > I do not have this happening with 070219 built. > > > > > > > > > >> > > > > > > > > > > > > >> > > > Nika > > > > > > > > > >> > > > > > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > > > > > > > > >> > > > > > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. > > > please spend > > > > > > > > some time > > > > > > > > > >> > > testing > > > > > > > > > >> > > > > > > > > > > > > > >> > > > >here's the URL for download: > > > > > > > > > >> > > > > > > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1r > c1.t > > > ar.gz > > > > > > > > > >> > > > > > > > > > > > > > >> > > > >-- > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > _______________________________________________ > > > > > > > > > >> > > > Swift-devel mailing list > > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu > > > > > > > > > >> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > > > > > >Swift-devel mailing list > > > > > > > > > >Swift-devel at ci.uchicago.edu > > > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > From benc at hawaga.org.uk Tue Feb 27 14:31:13 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Feb 2007 20:31:13 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: <1172604377.25936.2.camel@blabla.mcs.anl.gov> References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 27 Feb 2007, Mihael Hategan wrote: > If you can make this translate into something like vdl:(in| > out)appmapping(var, path, dest), preferably after the stagein/stageout > directives, I can probably make it work. what are path and dest? the source form only has a left and a right... > > > l>"input.txt"; -- From hategan at mcs.anl.gov Tue Feb 27 14:31:09 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 14:31:09 -0600 Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> Message-ID: <1172608269.28337.0.camel@blabla.mcs.anl.gov> On Tue, 2007-02-27 at 20:31 +0000, Ben Clifford wrote: > > On Tue, 27 Feb 2007, Mihael Hategan wrote: > > > If you can make this translate into something like vdl:(in| > > out)appmapping(var, path, dest), preferably after the stagein/stageout > > directives, I can probably make it work. > > what are path and dest? > > the source form only has a left and a right... l.x.y > "foo.txt" l - var x.y - path "foo.txt" - dest > > > > > l>"input.txt"; > From benc at hawaga.org.uk Tue Feb 27 14:37:38 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Feb 2007 20:37:38 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: <1172608269.28337.0.camel@blabla.mcs.anl.gov> References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> <1172608269.28337.0.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 27 Feb 2007, Mihael Hategan wrote: > On Tue, 2007-02-27 at 20:31 +0000, Ben Clifford wrote: > > > > On Tue, 27 Feb 2007, Mihael Hategan wrote: > > > > > If you can make this translate into something like vdl:(in| > > > out)appmapping(var, path, dest), preferably after the stagein/stageout > > > directives, I can probably make it work. > > > > what are path and dest? > > > > the source form only has a left and a right... > > l.x.y > "foo.txt" > l - var > x.y - path > "foo.txt" - dest oh, ok. I think that should be straightforward to implement. I'd be interested if yong has comment - he knows the swiftscript->kml layer more intimately than I. -- From hategan at mcs.anl.gov Tue Feb 27 14:39:06 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 14:39:06 -0600 Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> <1172608269.28337.0.camel@blabla.mcs.anl.gov> Message-ID: <1172608746.28687.0.camel@blabla.mcs.anl.gov> On Tue, 2007-02-27 at 20:37 +0000, Ben Clifford wrote: > > On Tue, 27 Feb 2007, Mihael Hategan wrote: > > > On Tue, 2007-02-27 at 20:31 +0000, Ben Clifford wrote: > > > > > > On Tue, 27 Feb 2007, Mihael Hategan wrote: > > > > > > > If you can make this translate into something like vdl:(in| > > > > out)appmapping(var, path, dest), preferably after the stagein/stageout > > > > directives, I can probably make it work. > > > > > > what are path and dest? > > > > > > the source form only has a left and a right... > > > > l.x.y > "foo.txt" > > l - var > > x.y - path > > "foo.txt" - dest > > oh, ok. > > I think that should be straightforward to implement. I'd be interested if > yong has comment - he knows the swiftscript->kml layer more intimately > than I. You can replicate certain assignments. The ones that translate to vdl:setfieldvalue. It works in a similar way. > From nefedova at mcs.anl.gov Tue Feb 27 14:54:20 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 27 Feb 2007 14:54:20 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070227141852.05fba7c0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> <1172606659.27311.2.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227141852.05fba7c0@mail.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070227145017.06043760@mail.mcs.anl.gov> It worked! Thank you very much for tracking it down, Mihael. Anyway -- a complete MolDyn workflow just finished its first run on Teragrid (for 3 molecules only). (-; Nika At 02:19 PM 2/27/2007, Veronika V. Nefedova wrote: >Thanks, Mihael! >I'll try it now > >Nika > >At 02:04 PM 2/27/2007, Mihael Hategan wrote: >>Try it now. The latest nightly build should contain the fix. >> >>The problem was an inner class having synchronized methods and me >>idiotically assuming that they will use the outer class' instance >>monitor. >> >>Mihael >> >>On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: >> > If you give me your vds_home location - I can try to run the workflow and >> > see if its working... >> > >> > NIka >> > At 09:57 PM 2/26/2007, Mihael Hategan wrote: >> > >Hmm. I made a change to the code that did not seem to be the cause, but >> > >some other, smaller issue and enabled some more debugging in log4j. With >> > >this, I've been running the workflow in a loop on wiggum for two hours >> > >now, and got nothing yet. I don't know what to make of it. >> > > >> > >I'll keep running and eventually revert the changes to see if they are >> > >the source. >> > > >> > >Mihael >> > > >> > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: >> > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: >> > > > > An additional info: This failure happened on TG with 070219 when >> I was >> > > > > running 2 molecules at the same time (i.e. two executables at >> the same >> > > > > time). When I tried to run just one, it failed with the same >> > > exitcode, but >> > > > > didn't have that handle exception: >> > > > >> > > > Right. This seems like a different problem, and I'm not sure if it's >> > > > Swift or some problem with TP or the application. That needs to be >> > > > investigated. >> > > > >> > > > > >> > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application >> exception: Job >> > > chrm >> > > > > failed with an exit code of 174 >> > > > > sys:throw @ vdl-int.k, line: 108 >> > > > > vdl:checkexitcode @ vdl-int.k, line: 367 >> > > > > vdl:execute2 @ execute-default.k, line: 22 >> > > > > vdl:execute @ swift-MolDyn.kml, line: 69 >> > > > > charmm @ swift-MolDyn.kml, line: 279 >> > > > > vdl:mains @ swift-MolDyn.kml, line: 261 >> > > > > >> > > > > >> > > > > Again, the failure with 070219 happens only on TG, on localhost >> (wiggum) >> > > > > its working just fine. >> > > > > >> > > > > Nika >> > > > > >> > > > > >> > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: >> > > > > >That's fine. Just wanted to be clear that we're talking about >> the same >> > > > > >error. It's good that it also occurs in 070219, because there >> are no >> > > > > >recent changes I could remember that could trigger it. It's >> also good to >> > > > > >know that it may or may not occur, because I know approximately >> what >> > > > > >class of problem we're dealing with. >> > > > > > >> > > > > >Mihael >> > > > > > >> > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: >> > > > > > > Yes, I didn't paste it -- its all in the log. If you'd like >> I can >> > > send you >> > > > > > > the log as an attachment... >> > > > > > > >> > > > > > > Nika >> > > > > > > >> > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: >> > > > > > > >Wait, because I'm missing something. Wasn't the error >> supposed to be >> > > > > > > >"TaskHandler can only handle unsubmitted tasks"? >> > > > > > > > >> > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: >> > > > > > > > > And now its getting interesting! >> > > > > > > > > >> > > > > > > > > I have now the same failure (as below) with 070219 as I >> had on >> > > > > > localhost >> > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same >> > > point (while >> > > > > > > > > trying to run the last app in the workflow), with the same >> > > exceptions. >> > > > > > > > > Strange that 070219 worked on localhost (and still working). >> > > > > > > > > >> > > > > > > > > The log is on wiggum: >> > > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log >> > > > > > > > > >> > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job >> > > > > > chrm-rmnoet7i chrm >> > > > > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, >> > > > > > > > > rtffile:parm03_gaff_all.rtf, >> paramfile:parm03_gaffnb_all.prm, >> > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, >> iseed:3131887, >> > > > > > rwater:15, >> > > > > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in >> > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA >> > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application >> > > exception: >> > > > > > Job chrm >> > > > > > > > > failed with an exit code of 174 >> > > > > > > > > >> > > > > > > > > All input files are staged in... >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > Nika >> > > > > > > > > >> > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: >> > > > > > > > > >You can try to run my application, or look in the logs. I >> > > ran it >> > > > > > all on >> > > > > > > > > >wiggum. The log is: >> > > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log >> > > > > > > > > > >> > > > > > > > > >the dtm file I am running is >> > > /sandbox/ydeng/alamines/swift-MolDyn.dtm >> > > > > > > > > > >> > > > > > > > > >Nika >> > > > > > > > > > >> > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: >> > > > > > > > > >>That doesn't sound good. How do I reproduce this? >> > > > > > > > > >> >> > > > > > > > > >>Mihael >> > > > > > > > > >> >> > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. >> Nefedova wrote: >> > > > > > > > > >> > The one Ben asked us all to test: >> > > > > > > > > >> > >> > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1. >> t ar.gz >> > > > > > > > > >> > >> > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: >> > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. >> Nefedova >> > > wrote: >> > > > > > > > > >> > > > When I tried to run my working workflow with a new >> > > version, it >> > > > > > > > > >> gave me an >> > > > > > > > > >> > > > exception: >> > > > > > > > > >> > > >> > > > > > > > > >> > >Which new version? >> > > > > > > > > >> > > >> > > > > > > > > >> > >Mihael >> > > > > > > > > >> > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > Warning: Task handler throws exception but does not >> > > set status >> > > > > > > > > >> > > > >> > > > > > > > >> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> > > > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks >> > > > > > > > > >> > > > at >> > > > > > > > > >> > > > >> > > > > > > > > >> > > >> > > > > > > > > >> >> > > > > > > > >> > > > > > >> > > >> org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) >> > > > > > > > > >> > > > at >> > > > > > > > > >> > > > >> > > > > > > > > >> > > >> > > > > > > > > >> >> > > > > > > > >> > > > > > >> > > >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) >> > > > > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > I do not have this happening with 070219 built. >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > Nika >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. >> > > please spend >> > > > > > > > some time >> > > > > > > > > >> > > testing >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > >here's the URL for download: >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1 >> r c1.t >> > > ar.gz >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > >-- >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > _______________________________________________ >> > > > > > > > > >> > > > Swift-devel mailing list >> > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu >> > > > > > > > > >> > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > > > > > > >> > > > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >_______________________________________________ >> > > > > > > > > >Swift-devel mailing list >> > > > > > > > > >Swift-devel at ci.uchicago.edu >> > > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > >> > > > > >> > > > > >> > > > >> > > > _______________________________________________ >> > > > Swift-devel mailing list >> > > > Swift-devel at ci.uchicago.edu >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > >> > >> > > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Feb 27 14:54:42 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 14:54:42 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070227145017.06043760@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> <1172606659.27311.2.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227141852.05fba7c0@mail.mcs.anl.gov> <6.0.0.22.2.20070227145017.06043760@mail.mcs.anl.gov> Message-ID: <1172609682.29302.0.camel@blabla.mcs.anl.gov> On Tue, 2007-02-27 at 14:54 -0600, Veronika V. Nefedova wrote: > It worked! Thank you very much for tracking it down, Mihael. > > Anyway -- a complete MolDyn workflow just finished its first run on > Teragrid (for 3 molecules only). Groovy. Can it be pumped up? > > (-; > > Nika > > At 02:19 PM 2/27/2007, Veronika V. Nefedova wrote: > > >Thanks, Mihael! > >I'll try it now > > > >Nika > > > >At 02:04 PM 2/27/2007, Mihael Hategan wrote: > >>Try it now. The latest nightly build should contain the fix. > >> > >>The problem was an inner class having synchronized methods and me > >>idiotically assuming that they will use the outer class' instance > >>monitor. > >> > >>Mihael > >> > >>On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: > >> > If you give me your vds_home location - I can try to run the workflow and > >> > see if its working... > >> > > >> > NIka > >> > At 09:57 PM 2/26/2007, Mihael Hategan wrote: > >> > >Hmm. I made a change to the code that did not seem to be the cause, but > >> > >some other, smaller issue and enabled some more debugging in log4j. With > >> > >this, I've been running the workflow in a loop on wiggum for two hours > >> > >now, and got nothing yet. I don't know what to make of it. > >> > > > >> > >I'll keep running and eventually revert the changes to see if they are > >> > >the source. > >> > > > >> > >Mihael > >> > > > >> > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: > >> > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > >> > > > > An additional info: This failure happened on TG with 070219 when > >> I was > >> > > > > running 2 molecules at the same time (i.e. two executables at > >> the same > >> > > > > time). When I tried to run just one, it failed with the same > >> > > exitcode, but > >> > > > > didn't have that handle exception: > >> > > > > >> > > > Right. This seems like a different problem, and I'm not sure if it's > >> > > > Swift or some problem with TP or the application. That needs to be > >> > > > investigated. > >> > > > > >> > > > > > >> > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application > >> exception: Job > >> > > chrm > >> > > > > failed with an exit code of 174 > >> > > > > sys:throw @ vdl-int.k, line: 108 > >> > > > > vdl:checkexitcode @ vdl-int.k, line: 367 > >> > > > > vdl:execute2 @ execute-default.k, line: 22 > >> > > > > vdl:execute @ swift-MolDyn.kml, line: 69 > >> > > > > charmm @ swift-MolDyn.kml, line: 279 > >> > > > > vdl:mains @ swift-MolDyn.kml, line: 261 > >> > > > > > >> > > > > > >> > > > > Again, the failure with 070219 happens only on TG, on localhost > >> (wiggum) > >> > > > > its working just fine. > >> > > > > > >> > > > > Nika > >> > > > > > >> > > > > > >> > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > >> > > > > >That's fine. Just wanted to be clear that we're talking about > >> the same > >> > > > > >error. It's good that it also occurs in 070219, because there > >> are no > >> > > > > >recent changes I could remember that could trigger it. It's > >> also good to > >> > > > > >know that it may or may not occur, because I know approximately > >> what > >> > > > > >class of problem we're dealing with. > >> > > > > > > >> > > > > >Mihael > >> > > > > > > >> > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > >> > > > > > > Yes, I didn't paste it -- its all in the log. If you'd like > >> I can > >> > > send you > >> > > > > > > the log as an attachment... > >> > > > > > > > >> > > > > > > Nika > >> > > > > > > > >> > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > >> > > > > > > >Wait, because I'm missing something. Wasn't the error > >> supposed to be > >> > > > > > > >"TaskHandler can only handle unsubmitted tasks"? > >> > > > > > > > > >> > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova wrote: > >> > > > > > > > > And now its getting interesting! > >> > > > > > > > > > >> > > > > > > > > I have now the same failure (as below) with 070219 as I > >> had on > >> > > > > > localhost > >> > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the same > >> > > point (while > >> > > > > > > > > trying to run the last app in the workflow), with the same > >> > > exceptions. > >> > > > > > > > > Strange that 070219 worked on localhost (and still working). > >> > > > > > > > > > >> > > > > > > > > The log is on wiggum: > >> > > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > >> > > > > > > > > > >> > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > >> > > > > > chrm-rmnoet7i chrm > >> > > > > > > > > with arguments [system:solv_m001, title:solv, stitle:m001, > >> > > > > > > > > rtffile:parm03_gaff_all.rtf, > >> paramfile:parm03_gaffnb_all.prm, > >> > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, > >> iseed:3131887, > >> > > > > > rwater:15, > >> > > > > > > > > nstep:100, minstep:100, skipstep:100, startstep:10000] in > >> > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > >> > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application > >> > > exception: > >> > > > > > Job chrm > >> > > > > > > > > failed with an exit code of 174 > >> > > > > > > > > > >> > > > > > > > > All input files are staged in... > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > Nika > >> > > > > > > > > > >> > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > >> > > > > > > > > >You can try to run my application, or look in the logs. I > >> > > ran it > >> > > > > > all on > >> > > > > > > > > >wiggum. The log is: > >> > > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > >> > > > > > > > > > > >> > > > > > > > > >the dtm file I am running is > >> > > /sandbox/ydeng/alamines/swift-MolDyn.dtm > >> > > > > > > > > > > >> > > > > > > > > >Nika > >> > > > > > > > > > > >> > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > >> > > > > > > > > >>That doesn't sound good. How do I reproduce this? > >> > > > > > > > > >> > >> > > > > > > > > >>Mihael > >> > > > > > > > > >> > >> > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. > >> Nefedova wrote: > >> > > > > > > > > >> > The one Ben asked us all to test: > >> > > > > > > > > >> > > >> > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc1. > >> t ar.gz > >> > > > > > > > > >> > > >> > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > >> > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. > >> Nefedova > >> > > wrote: > >> > > > > > > > > >> > > > When I tried to run my working workflow with a new > >> > > version, it > >> > > > > > > > > >> gave me an > >> > > > > > > > > >> > > > exception: > >> > > > > > > > > >> > > > >> > > > > > > > > >> > >Which new version? > >> > > > > > > > > >> > > > >> > > > > > > > > >> > >Mihael > >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > Warning: Task handler throws exception but does not > >> > > set status > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >> > > > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > >> > > > > > > > > >> > > > at > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > >> > > > > > > > > >> > > > > > > >> > > > >> org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > >> > > > > > > > > >> > > > at > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > >> > > > > > > > > >> > > > > > > >> > > > >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > >> > > > > > > > > >> > > > at java.lang.Thread.run(Thread.java:534) > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > I do not have this happening with 070219 built. > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > Nika > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. > >> > > please spend > >> > > > > > > > some time > >> > > > > > > > > >> > > testing > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > >here's the URL for download: > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1 > >> r c1.t > >> > > ar.gz > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > >-- > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > _______________________________________________ > >> > > > > > > > > >> > > > Swift-devel mailing list > >> > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu > >> > > > > > > > > >> > > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > >> > > > > > > > > >> > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >_______________________________________________ > >> > > > > > > > > >Swift-devel mailing list > >> > > > > > > > > >Swift-devel at ci.uchicago.edu > >> > > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > _______________________________________________ > >> > > > Swift-devel mailing list > >> > > > Swift-devel at ci.uchicago.edu > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > >> > > >> > > > > > > >_______________________________________________ > >Swift-devel mailing list > >Swift-devel at ci.uchicago.edu > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From nefedova at mcs.anl.gov Tue Feb 27 15:00:03 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 27 Feb 2007 15:00:03 -0600 Subject: [Swift-devel] tmp dir cleanup? Message-ID: <6.0.0.22.2.20070227145605.05fbd2f0@mail.mcs.anl.gov> Hi, I am wondering if Swift is supposed to clean up after itself (after a successful run) ? I have a bunch of tmp directories created by swift both on localhost and on Teragrid which were left there after a normal completions of the workflow... Thanks, Nika From nefedova at mcs.anl.gov Tue Feb 27 15:03:47 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 27 Feb 2007 15:03:47 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <1172609682.29302.0.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> <1172606659.27311.2.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227141852.05fba7c0@mail.mcs.anl.gov> <6.0.0.22.2.20070227145017.06043760@mail.mcs.anl.gov> <1172609682.29302.0.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070227150116.06c309d0@mail.mcs.anl.gov> I have just 16 molecules (at present) for testing - I am running all of them now. Will let you know how did it go. I've already asked Yuqing to send my way all 350 molecules (; Nika At 02:54 PM 2/27/2007, Mihael Hategan wrote: >On Tue, 2007-02-27 at 14:54 -0600, Veronika V. Nefedova wrote: > > It worked! Thank you very much for tracking it down, Mihael. > > > > Anyway -- a complete MolDyn workflow just finished its first run on > > Teragrid (for 3 molecules only). > >Groovy. Can it be pumped up? > > > > > (-; > > > > Nika > > > > At 02:19 PM 2/27/2007, Veronika V. Nefedova wrote: > > > > >Thanks, Mihael! > > >I'll try it now > > > > > >Nika > > > > > >At 02:04 PM 2/27/2007, Mihael Hategan wrote: > > >>Try it now. The latest nightly build should contain the fix. > > >> > > >>The problem was an inner class having synchronized methods and me > > >>idiotically assuming that they will use the outer class' instance > > >>monitor. > > >> > > >>Mihael > > >> > > >>On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: > > >> > If you give me your vds_home location - I can try to run the > workflow and > > >> > see if its working... > > >> > > > >> > NIka > > >> > At 09:57 PM 2/26/2007, Mihael Hategan wrote: > > >> > >Hmm. I made a change to the code that did not seem to be the > cause, but > > >> > >some other, smaller issue and enabled some more debugging in > log4j. With > > >> > >this, I've been running the workflow in a loop on wiggum for two > hours > > >> > >now, and got nothing yet. I don't know what to make of it. > > >> > > > > >> > >I'll keep running and eventually revert the changes to see if > they are > > >> > >the source. > > >> > > > > >> > >Mihael > > >> > > > > >> > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: > > >> > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: > > >> > > > > An additional info: This failure happened on TG with 070219 > when > > >> I was > > >> > > > > running 2 molecules at the same time (i.e. two executables at > > >> the same > > >> > > > > time). When I tried to run just one, it failed with the same > > >> > > exitcode, but > > >> > > > > didn't have that handle exception: > > >> > > > > > >> > > > Right. This seems like a different problem, and I'm not sure > if it's > > >> > > > Swift or some problem with TP or the application. That needs to be > > >> > > > investigated. > > >> > > > > > >> > > > > > > >> > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application > > >> exception: Job > > >> > > chrm > > >> > > > > failed with an exit code of 174 > > >> > > > > sys:throw @ vdl-int.k, line: 108 > > >> > > > > vdl:checkexitcode @ vdl-int.k, line: 367 > > >> > > > > vdl:execute2 @ execute-default.k, line: 22 > > >> > > > > vdl:execute @ swift-MolDyn.kml, line: 69 > > >> > > > > charmm @ swift-MolDyn.kml, line: 279 > > >> > > > > vdl:mains @ swift-MolDyn.kml, line: 261 > > >> > > > > > > >> > > > > > > >> > > > > Again, the failure with 070219 happens only on TG, on localhost > > >> (wiggum) > > >> > > > > its working just fine. > > >> > > > > > > >> > > > > Nika > > >> > > > > > > >> > > > > > > >> > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: > > >> > > > > >That's fine. Just wanted to be clear that we're talking about > > >> the same > > >> > > > > >error. It's good that it also occurs in 070219, because there > > >> are no > > >> > > > > >recent changes I could remember that could trigger it. It's > > >> also good to > > >> > > > > >know that it may or may not occur, because I know > approximately > > >> what > > >> > > > > >class of problem we're dealing with. > > >> > > > > > > > >> > > > > >Mihael > > >> > > > > > > > >> > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: > > >> > > > > > > Yes, I didn't paste it -- its all in the log. If you'd like > > >> I can > > >> > > send you > > >> > > > > > > the log as an attachment... > > >> > > > > > > > > >> > > > > > > Nika > > >> > > > > > > > > >> > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: > > >> > > > > > > >Wait, because I'm missing something. Wasn't the error > > >> supposed to be > > >> > > > > > > >"TaskHandler can only handle unsubmitted tasks"? > > >> > > > > > > > > > >> > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. Nefedova > wrote: > > >> > > > > > > > > And now its getting interesting! > > >> > > > > > > > > > > >> > > > > > > > > I have now the same failure (as below) with 070219 as I > > >> had on > > >> > > > > > localhost > > >> > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at the > same > > >> > > point (while > > >> > > > > > > > > trying to run the last app in the workflow), with > the same > > >> > > exceptions. > > >> > > > > > > > > Strange that 070219 worked on localhost (and still > working). > > >> > > > > > > > > > > >> > > > > > > > > The log is on wiggum: > > >> > > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log > > >> > > > > > > > > > > >> > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job > > >> > > > > > chrm-rmnoet7i chrm > > >> > > > > > > > > with arguments [system:solv_m001, title:solv, > stitle:m001, > > >> > > > > > > > > rtffile:parm03_gaff_all.rtf, > > >> paramfile:parm03_gaffnb_all.prm, > > >> > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, > > >> iseed:3131887, > > >> > > > > > rwater:15, > > >> > > > > > > > > nstep:100, minstep:100, skipstep:100, > startstep:10000] in > > >> > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA > > >> > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application > > >> > > exception: > > >> > > > > > Job chrm > > >> > > > > > > > > failed with an exit code of 174 > > >> > > > > > > > > > > >> > > > > > > > > All input files are staged in... > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > Nika > > >> > > > > > > > > > > >> > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: > > >> > > > > > > > > >You can try to run my application, or look in the > logs. I > > >> > > ran it > > >> > > > > > all on > > >> > > > > > > > > >wiggum. The log is: > > >> > > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log > > >> > > > > > > > > > > > >> > > > > > > > > >the dtm file I am running is > > >> > > /sandbox/ydeng/alamines/swift-MolDyn.dtm > > >> > > > > > > > > > > > >> > > > > > > > > >Nika > > >> > > > > > > > > > > > >> > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: > > >> > > > > > > > > >>That doesn't sound good. How do I reproduce this? > > >> > > > > > > > > >> > > >> > > > > > > > > >>Mihael > > >> > > > > > > > > >> > > >> > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. > > >> Nefedova wrote: > > >> > > > > > > > > >> > The one Ben asked us all to test: > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0.1 > rc1. > > >> t ar.gz > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: > > >> > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. > > >> Nefedova > > >> > > wrote: > > >> > > > > > > > > >> > > > When I tried to run my working workflow > with a new > > >> > > version, it > > >> > > > > > > > > >> gave me an > > >> > > > > > > > > >> > > > exception: > > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > >Which new version? > > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > >Mihael > > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > Warning: Task handler throws exception but > does not > > >> > > set status > > >> > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > >> > > > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks > > >> > > > > > > > > >> > > > at > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > >> > > > > > > > > > >> > > > > > > > >> > > > > >> > org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) > > >> > > > > > > > > >> > > > at > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > >> > > > > > > > > >> > > >> > > > > > > > > > >> > > > > > > > >> > > > > >> > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) > > >> > > > > > > > > >> > > > at > java.lang.Thread.run(Thread.java:534) > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > I do not have this happening with 070219 built. > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > Nika > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. > > >> > > please spend > > >> > > > > > > > some time > > >> > > > > > > > > >> > > testing > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > >> > > > >here's the URL for download: > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vdsk > -0.1 > > >> r c1.t > > >> > > ar.gz > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > >> > > > >-- > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > _______________________________________________ > > >> > > > > > > > > >> > > > Swift-devel mailing list > > >> > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu > > >> > > > > > > > > >> > > > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > >_______________________________________________ > > >> > > > > > > > > >Swift-devel mailing list > > >> > > > > > > > > >Swift-devel at ci.uchicago.edu > > >> > > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-d > evel > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > _______________________________________________ > > >> > > > Swift-devel mailing list > > >> > > > Swift-devel at ci.uchicago.edu > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> > > > > > >> > > > >> > > > > > > > > > >_______________________________________________ > > >Swift-devel mailing list > > >Swift-devel at ci.uchicago.edu > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From hategan at mcs.anl.gov Tue Feb 27 15:05:40 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 15:05:40 -0600 Subject: [Swift-devel] tmp dir cleanup? In-Reply-To: <6.0.0.22.2.20070227145605.05fbd2f0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070227145605.05fbd2f0@mail.mcs.anl.gov> Message-ID: <1172610340.29528.2.camel@blabla.mcs.anl.gov> On Tue, 2007-02-27 at 15:00 -0600, Veronika V. Nefedova wrote: > Hi, > > I am wondering if Swift is supposed to clean up after itself (after a > successful run) ? I have a bunch of tmp directories created by swift both > on localhost and on Teragrid which were left there after a normal > completions of the workflow... The whole run directory should be deleted at the end of a successful run. With rm -rf. > > Thanks, > > Nika > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From nefedova at mcs.anl.gov Tue Feb 27 15:13:34 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 27 Feb 2007 15:13:34 -0600 Subject: [Swift-devel] tmp dir cleanup? In-Reply-To: <1172610340.29528.2.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070227145605.05fbd2f0@mail.mcs.anl.gov> <1172610340.29528.2.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070227150928.05fbdd80@mail.mcs.anl.gov> At 03:05 PM 2/27/2007, Mihael Hategan wrote: >On Tue, 2007-02-27 at 15:00 -0600, Veronika V. Nefedova wrote: > > Hi, > > > > I am wondering if Swift is supposed to clean up after itself (after a > > successful run) ? I have a bunch of tmp directories created by swift both > > on localhost and on Teragrid which were left there after a normal > > completions of the workflow... > >The whole run directory should be deleted at the end of a successful >run. With rm -rf. By whom? Me or Swift ? Swift is not deleting anything... Nika > > > > Thanks, > > > > Nika > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Tue Feb 27 15:13:07 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 15:13:07 -0600 Subject: [Swift-devel] tmp dir cleanup? In-Reply-To: <6.0.0.22.2.20070227150928.05fbdd80@mail.mcs.anl.gov> References: <6.0.0.22.2.20070227145605.05fbd2f0@mail.mcs.anl.gov> <1172610340.29528.2.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227150928.05fbdd80@mail.mcs.anl.gov> Message-ID: <1172610787.30050.0.camel@blabla.mcs.anl.gov> On Tue, 2007-02-27 at 15:13 -0600, Veronika V. Nefedova wrote: > At 03:05 PM 2/27/2007, Mihael Hategan wrote: > >On Tue, 2007-02-27 at 15:00 -0600, Veronika V. Nefedova wrote: > > > Hi, > > > > > > I am wondering if Swift is supposed to clean up after itself (after a > > > successful run) ? I have a bunch of tmp directories created by swift both > > > on localhost and on Teragrid which were left there after a normal > > > completions of the workflow... > > > >The whole run directory should be deleted at the end of a successful > >run. With rm -rf. > > By whom? Me or Swift ? Swift is not deleting anything... It should. Unless there are errors. > > Nika > > > > > > > Thanks, > > > > > > Nika > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From hategan at mcs.anl.gov Tue Feb 27 15:15:11 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 15:15:11 -0600 Subject: [Swift-devel] tmp dir cleanup? In-Reply-To: <1172610787.30050.0.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070227145605.05fbd2f0@mail.mcs.anl.gov> <1172610340.29528.2.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227150928.05fbdd80@mail.mcs.anl.gov> <1172610787.30050.0.camel@blabla.mcs.anl.gov> Message-ID: <1172610911.30144.0.camel@blabla.mcs.anl.gov> On Tue, 2007-02-27 at 15:13 -0600, Mihael Hategan wrote: > On Tue, 2007-02-27 at 15:13 -0600, Veronika V. Nefedova wrote: > > At 03:05 PM 2/27/2007, Mihael Hategan wrote: > > >On Tue, 2007-02-27 at 15:00 -0600, Veronika V. Nefedova wrote: > > > > Hi, > > > > > > > > I am wondering if Swift is supposed to clean up after itself (after a > > > > successful run) ? I have a bunch of tmp directories created by swift both > > > > on localhost and on Teragrid which were left there after a normal > > > > completions of the workflow... > > > > > >The whole run directory should be deleted at the end of a successful > > >run. With rm -rf. > > > > By whom? Me or Swift ? Swift is not deleting anything... > > It should. Unless there are errors. If running with -v, it should say what's up at the end. Either "Cleanups: [list of stuff]" or "Errors detected, cleanup not done". > > > > > Nika > > > > > > > > > > Thanks, > > > > > > > > Nika > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From nefedova at mcs.anl.gov Tue Feb 27 15:26:11 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 27 Feb 2007 15:26:11 -0600 Subject: [Swift-devel] log Message-ID: <6.0.0.22.2.20070227152152.060105c0@mail.mcs.anl.gov> HI, I have a question about the log. My log shows 16 identical directory creation (one per molecule?) with the same name: 2007-02-27 15:04:38,680 INFO vdl:createdirs Creating directory structure in swift- MolDyn-4brc2qqmmvfe0/shared (swift-MolDyn-4brc2qqmmvfe0/shared/) 2007-02-27 15:04:38,694 INFO vdl:createdirs Creating directory structure in swift- MolDyn-4brc2qqmmvfe0/shared (swift-MolDyn-4brc2qqmmvfe0/shared/) <16 times, with the same name but different time stamps> Probably this should be taken out of the loop ? Nika From nefedova at mcs.anl.gov Tue Feb 27 15:27:41 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 27 Feb 2007 15:27:41 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070227150116.06c309d0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> <1172606659.27311.2.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227141852.05fba7c0@mail.mcs.anl.gov> <6.0.0.22.2.20070227145017.06043760@mail.mcs.anl.gov> <1172609682.29302.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227150116.06c309d0@mail.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070227152655.05fc36b0@mail.mcs.anl.gov> Workflow with 16 molecules finished on TG just swiftly (-; At 03:03 PM 2/27/2007, Veronika V. Nefedova wrote: >I have just 16 molecules (at present) for testing - I am running all of >them now. Will let you know how did it go. I've already asked Yuqing to >send my way all 350 molecules (; > >Nika > >At 02:54 PM 2/27/2007, Mihael Hategan wrote: >>On Tue, 2007-02-27 at 14:54 -0600, Veronika V. Nefedova wrote: >> > It worked! Thank you very much for tracking it down, Mihael. >> > >> > Anyway -- a complete MolDyn workflow just finished its first run on >> > Teragrid (for 3 molecules only). >> >>Groovy. Can it be pumped up? >> >> > >> > (-; >> > >> > Nika >> > >> > At 02:19 PM 2/27/2007, Veronika V. Nefedova wrote: >> > >> > >Thanks, Mihael! >> > >I'll try it now >> > > >> > >Nika >> > > >> > >At 02:04 PM 2/27/2007, Mihael Hategan wrote: >> > >>Try it now. The latest nightly build should contain the fix. >> > >> >> > >>The problem was an inner class having synchronized methods and me >> > >>idiotically assuming that they will use the outer class' instance >> > >>monitor. >> > >> >> > >>Mihael >> > >> >> > >>On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: >> > >> > If you give me your vds_home location - I can try to run the >> workflow and >> > >> > see if its working... >> > >> > >> > >> > NIka >> > >> > At 09:57 PM 2/26/2007, Mihael Hategan wrote: >> > >> > >Hmm. I made a change to the code that did not seem to be the >> cause, but >> > >> > >some other, smaller issue and enabled some more debugging in >> log4j. With >> > >> > >this, I've been running the workflow in a loop on wiggum for two >> hours >> > >> > >now, and got nothing yet. I don't know what to make of it. >> > >> > > >> > >> > >I'll keep running and eventually revert the changes to see if >> they are >> > >> > >the source. >> > >> > > >> > >> > >Mihael >> > >> > > >> > >> > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: >> > >> > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: >> > >> > > > > An additional info: This failure happened on TG with 070219 >> when >> > >> I was >> > >> > > > > running 2 molecules at the same time (i.e. two executables at >> > >> the same >> > >> > > > > time). When I tried to run just one, it failed with the same >> > >> > > exitcode, but >> > >> > > > > didn't have that handle exception: >> > >> > > > >> > >> > > > Right. This seems like a different problem, and I'm not sure >> if it's >> > >> > > > Swift or some problem with TP or the application. That needs >> to be >> > >> > > > investigated. >> > >> > > > >> > >> > > > > >> > >> > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application >> > >> exception: Job >> > >> > > chrm >> > >> > > > > failed with an exit code of 174 >> > >> > > > > sys:throw @ vdl-int.k, line: 108 >> > >> > > > > vdl:checkexitcode @ vdl-int.k, line: 367 >> > >> > > > > vdl:execute2 @ execute-default.k, line: 22 >> > >> > > > > vdl:execute @ swift-MolDyn.kml, line: 69 >> > >> > > > > charmm @ swift-MolDyn.kml, line: 279 >> > >> > > > > vdl:mains @ swift-MolDyn.kml, line: 261 >> > >> > > > > >> > >> > > > > >> > >> > > > > Again, the failure with 070219 happens only on TG, on localhost >> > >> (wiggum) >> > >> > > > > its working just fine. >> > >> > > > > >> > >> > > > > Nika >> > >> > > > > >> > >> > > > > >> > >> > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: >> > >> > > > > >That's fine. Just wanted to be clear that we're talking about >> > >> the same >> > >> > > > > >error. It's good that it also occurs in 070219, because there >> > >> are no >> > >> > > > > >recent changes I could remember that could trigger it. It's >> > >> also good to >> > >> > > > > >know that it may or may not occur, because I know >> approximately >> > >> what >> > >> > > > > >class of problem we're dealing with. >> > >> > > > > > >> > >> > > > > >Mihael >> > >> > > > > > >> > >> > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova wrote: >> > >> > > > > > > Yes, I didn't paste it -- its all in the log. If you'd like >> > >> I can >> > >> > > send you >> > >> > > > > > > the log as an attachment... >> > >> > > > > > > >> > >> > > > > > > Nika >> > >> > > > > > > >> > >> > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: >> > >> > > > > > > >Wait, because I'm missing something. Wasn't the error >> > >> supposed to be >> > >> > > > > > > >"TaskHandler can only handle unsubmitted tasks"? >> > >> > > > > > > > >> > >> > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. >> Nefedova wrote: >> > >> > > > > > > > > And now its getting interesting! >> > >> > > > > > > > > >> > >> > > > > > > > > I have now the same failure (as below) with 070219 as I >> > >> had on >> > >> > > > > > localhost >> > >> > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at >> the same >> > >> > > point (while >> > >> > > > > > > > > trying to run the last app in the workflow), with >> the same >> > >> > > exceptions. >> > >> > > > > > > > > Strange that 070219 worked on localhost (and still >> working). >> > >> > > > > > > > > >> > >> > > > > > > > > The log is on wiggum: >> > >> > > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log >> > >> > > > > > > > > >> > >> > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 Running job >> > >> > > > > > chrm-rmnoet7i chrm >> > >> > > > > > > > > with arguments [system:solv_m001, title:solv, >> stitle:m001, >> > >> > > > > > > > > rtffile:parm03_gaff_all.rtf, >> > >> paramfile:parm03_gaffnb_all.prm, >> > >> > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, >> > >> iseed:3131887, >> > >> > > > > > rwater:15, >> > >> > > > > > > > > nstep:100, minstep:100, skipstep:100, >> startstep:10000] in >> > >> > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA >> > >> > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 Application >> > >> > > exception: >> > >> > > > > > Job chrm >> > >> > > > > > > > > failed with an exit code of 174 >> > >> > > > > > > > > >> > >> > > > > > > > > All input files are staged in... >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > Nika >> > >> > > > > > > > > >> > >> > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: >> > >> > > > > > > > > >You can try to run my application, or look in the >> logs. I >> > >> > > ran it >> > >> > > > > > all on >> > >> > > > > > > > > >wiggum. The log is: >> > >> > > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log >> > >> > > > > > > > > > >> > >> > > > > > > > > >the dtm file I am running is >> > >> > > /sandbox/ydeng/alamines/swift-MolDyn.dtm >> > >> > > > > > > > > > >> > >> > > > > > > > > >Nika >> > >> > > > > > > > > > >> > >> > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: >> > >> > > > > > > > > >>That doesn't sound good. How do I reproduce this? >> > >> > > > > > > > > >> >> > >> > > > > > > > > >>Mihael >> > >> > > > > > > > > >> >> > >> > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. >> > >> Nefedova wrote: >> > >> > > > > > > > > >> > The one Ben asked us all to test: >> > >> > > > > > > > > >> > >> > >> > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk-0. >> 1 rc1. >> > >> t ar.gz >> > >> > > > > > > > > >> > >> > >> > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: >> > >> > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. >> > >> Nefedova >> > >> > > wrote: >> > >> > > > > > > > > >> > > > When I tried to run my working workflow >> with a new >> > >> > > version, it >> > >> > > > > > > > > >> gave me an >> > >> > > > > > > > > >> > > > exception: >> > >> > > > > > > > > >> > > >> > >> > > > > > > > > >> > >Which new version? >> > >> > > > > > > > > >> > > >> > >> > > > > > > > > >> > >Mihael >> > >> > > > > > > > > >> > > >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > Warning: Task handler throws exception but >> does not >> > >> > > set status >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > >> > >> > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> > >> > > > > > > > > >> > > > TaskHandler can only handle unsubmitted tasks >> > >> > > > > > > > > >> > > > at >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > >> > >> > > > > > > > > >> >> > >> > > > > > > > >> > >> > > > > > >> > >> > > >> > >> >> org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) >> > >> > > > > > > > > >> > > > at >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > >> > >> > > > > > > > > >> >> > >> > > > > > > > >> > >> > > > > > >> > >> > > >> > >> >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) >> > >> > > > > > > > > >> > > > at >> java.lang.Thread.run(Thread.java:534) >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > I do not have this happening with 070219 >> built. >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > Nika >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: >> > >> > > > > > > > > >> > > > > >> > >> > > > > > > > > >> > > > > > >> > >> > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last week. >> > >> > > please spend >> > >> > > > > > > > some time >> > >> > > > > > > > > >> > > testing >> > >> > > > > > > > > >> > > > > >> > >> > > > > > > > > >> > > > >here's the URL for download: >> > >> > > > > > > > > >> > > > > >> > >> > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/vds >> k -0.1 >> > >> r c1.t >> > >> > > ar.gz >> > >> > > > > > > > > >> > > > > >> > >> > > > > > > > > >> > > > >-- >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > > > >> _______________________________________________ >> > >> > > > > > > > > >> > > > Swift-devel mailing list >> > >> > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu >> > >> > > > > > > > > >> > > > >> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > >> > > > > > > > > >> > > > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > >_______________________________________________ >> > >> > > > > > > > > >Swift-devel mailing list >> > >> > > > > > > > > >Swift-devel at ci.uchicago.edu >> > >> > > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift- >> d evel >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > > _______________________________________________ >> > >> > > > Swift-devel mailing list >> > >> > > > Swift-devel at ci.uchicago.edu >> > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > >> > > > >> > >> > >> > >> > >> > > >> > > >> > >_______________________________________________ >> > >Swift-devel mailing list >> > >Swift-devel at ci.uchicago.edu >> > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > >> > > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Feb 27 15:30:00 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 15:30:00 -0600 Subject: [Swift-devel] log In-Reply-To: <6.0.0.22.2.20070227152152.060105c0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070227152152.060105c0@mail.mcs.anl.gov> Message-ID: <1172611801.30685.0.camel@blabla.mcs.anl.gov> On Tue, 2007-02-27 at 15:26 -0600, Veronika V. Nefedova wrote: > HI, > > I have a question about the log. My log shows 16 identical directory > creation (one per molecule?) with the same name: > > 2007-02-27 15:04:38,680 INFO vdl:createdirs Creating directory > structure in swift- > MolDyn-4brc2qqmmvfe0/shared (swift-MolDyn-4brc2qqmmvfe0/shared/) > 2007-02-27 15:04:38,694 INFO vdl:createdirs Creating directory > structure in swift- > MolDyn-4brc2qqmmvfe0/shared (swift-MolDyn-4brc2qqmmvfe0/shared/) > > <16 times, with the same name but different time stamps> > > Probably this should be taken out of the loop ? It doesn't do much since the directory structure (nil) is already there. > > Nika > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Feb 27 15:33:38 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 27 Feb 2007 15:33:38 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <6.0.0.22.2.20070227152655.05fc36b0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> <1172606659.27311.2.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227141852.05fba7c0@mail.mcs.anl.gov> <6.0.0.22.2.20070227145017.06043760@mail.mcs.anl.gov> <1172609682.29302.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227150116.06c309d0@mail.mcs.anl.gov> <6.0.0.22.2.20070227152655.05fc36b0@mail.mcs.anl.gov> Message-ID: <45E4A3B2.7080000@mcs.anl.gov> Awesome! Nice work Nika and everyone!!! Just how "swift" was it? :) Mike ^ thanks, I needed that! Veronika V. Nefedova wrote, On 2/27/2007 3:27 PM: > Workflow with 16 molecules finished on TG just swiftly (-; > > At 03:03 PM 2/27/2007, Veronika V. Nefedova wrote: >> I have just 16 molecules (at present) for testing - I am running all >> of them now. Will let you know how did it go. I've already asked >> Yuqing to send my way all 350 molecules (; >> >> Nika >> >> At 02:54 PM 2/27/2007, Mihael Hategan wrote: >>> On Tue, 2007-02-27 at 14:54 -0600, Veronika V. Nefedova wrote: >>> > It worked! Thank you very much for tracking it down, Mihael. >>> > >>> > Anyway -- a complete MolDyn workflow just finished its first run on >>> > Teragrid (for 3 molecules only). >>> >>> Groovy. Can it be pumped up? >>> >>> > >>> > (-; >>> > >>> > Nika >>> > >>> > At 02:19 PM 2/27/2007, Veronika V. Nefedova wrote: >>> > >>> > >Thanks, Mihael! >>> > >I'll try it now >>> > > >>> > >Nika >>> > > >>> > >At 02:04 PM 2/27/2007, Mihael Hategan wrote: >>> > >>Try it now. The latest nightly build should contain the fix. >>> > >> >>> > >>The problem was an inner class having synchronized methods and me >>> > >>idiotically assuming that they will use the outer class' instance >>> > >>monitor. >>> > >> >>> > >>Mihael >>> > >> >>> > >>On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: >>> > >> > If you give me your vds_home location - I can try to run the >>> workflow and >>> > >> > see if its working... >>> > >> > >>> > >> > NIka >>> > >> > At 09:57 PM 2/26/2007, Mihael Hategan wrote: >>> > >> > >Hmm. I made a change to the code that did not seem to be the >>> cause, but >>> > >> > >some other, smaller issue and enabled some more debugging in >>> log4j. With >>> > >> > >this, I've been running the workflow in a loop on wiggum for >>> two hours >>> > >> > >now, and got nothing yet. I don't know what to make of it. >>> > >> > > >>> > >> > >I'll keep running and eventually revert the changes to see if >>> they are >>> > >> > >the source. >>> > >> > > >>> > >> > >Mihael >>> > >> > > >>> > >> > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: >>> > >> > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova >>> wrote: >>> > >> > > > > An additional info: This failure happened on TG with >>> 070219 when >>> > >> I was >>> > >> > > > > running 2 molecules at the same time (i.e. two >>> executables at >>> > >> the same >>> > >> > > > > time). When I tried to run just one, it failed with the >>> same >>> > >> > > exitcode, but >>> > >> > > > > didn't have that handle exception: >>> > >> > > > >>> > >> > > > Right. This seems like a different problem, and I'm not >>> sure if it's >>> > >> > > > Swift or some problem with TP or the application. That >>> needs to be >>> > >> > > > investigated. >>> > >> > > > >>> > >> > > > > >>> > >> > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application >>> > >> exception: Job >>> > >> > > chrm >>> > >> > > > > failed with an exit code of 174 >>> > >> > > > > sys:throw @ vdl-int.k, line: 108 >>> > >> > > > > vdl:checkexitcode @ vdl-int.k, line: 367 >>> > >> > > > > vdl:execute2 @ execute-default.k, line: 22 >>> > >> > > > > vdl:execute @ swift-MolDyn.kml, line: 69 >>> > >> > > > > charmm @ swift-MolDyn.kml, line: 279 >>> > >> > > > > vdl:mains @ swift-MolDyn.kml, line: 261 >>> > >> > > > > >>> > >> > > > > >>> > >> > > > > Again, the failure with 070219 happens only on TG, on >>> localhost >>> > >> (wiggum) >>> > >> > > > > its working just fine. >>> > >> > > > > >>> > >> > > > > Nika >>> > >> > > > > >>> > >> > > > > >>> > >> > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: >>> > >> > > > > >That's fine. Just wanted to be clear that we're talking >>> about >>> > >> the same >>> > >> > > > > >error. It's good that it also occurs in 070219, because >>> there >>> > >> are no >>> > >> > > > > >recent changes I could remember that could trigger it. >>> It's >>> > >> also good to >>> > >> > > > > >know that it may or may not occur, because I know >>> approximately >>> > >> what >>> > >> > > > > >class of problem we're dealing with. >>> > >> > > > > > >>> > >> > > > > >Mihael >>> > >> > > > > > >>> > >> > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova >>> wrote: >>> > >> > > > > > > Yes, I didn't paste it -- its all in the log. If >>> you'd like >>> > >> I can >>> > >> > > send you >>> > >> > > > > > > the log as an attachment... >>> > >> > > > > > > >>> > >> > > > > > > Nika >>> > >> > > > > > > >>> > >> > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: >>> > >> > > > > > > >Wait, because I'm missing something. Wasn't the error >>> > >> supposed to be >>> > >> > > > > > > >"TaskHandler can only handle unsubmitted tasks"? >>> > >> > > > > > > > >>> > >> > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. >>> Nefedova wrote: >>> > >> > > > > > > > > And now its getting interesting! >>> > >> > > > > > > > > >>> > >> > > > > > > > > I have now the same failure (as below) with >>> 070219 as I >>> > >> had on >>> > >> > > > > > localhost >>> > >> > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at >>> the same >>> > >> > > point (while >>> > >> > > > > > > > > trying to run the last app in the workflow), >>> with the same >>> > >> > > exceptions. >>> > >> > > > > > > > > Strange that 070219 worked on localhost (and >>> still working). >>> > >> > > > > > > > > >>> > >> > > > > > > > > The log is on wiggum: >>> > >> > > > > > > > >>> /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log >>> > >> > > > > > > > > >>> > >> > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 >>> Running job >>> > >> > > > > > chrm-rmnoet7i chrm >>> > >> > > > > > > > > with arguments [system:solv_m001, title:solv, >>> stitle:m001, >>> > >> > > > > > > > > rtffile:parm03_gaff_all.rtf, >>> > >> paramfile:parm03_gaffnb_all.prm, >>> > >> > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, >>> > >> iseed:3131887, >>> > >> > > > > > rwater:15, >>> > >> > > > > > > > > nstep:100, minstep:100, skipstep:100, >>> startstep:10000] in >>> > >> > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA >>> > >> > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 >>> Application >>> > >> > > exception: >>> > >> > > > > > Job chrm >>> > >> > > > > > > > > failed with an exit code of 174 >>> > >> > > > > > > > > >>> > >> > > > > > > > > All input files are staged in... >>> > >> > > > > > > > > >>> > >> > > > > > > > > >>> > >> > > > > > > > > Nika >>> > >> > > > > > > > > >>> > >> > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: >>> > >> > > > > > > > > >You can try to run my application, or look in >>> the logs. I >>> > >> > > ran it >>> > >> > > > > > all on >>> > >> > > > > > > > > >wiggum. The log is: >>> > >> > > > > > > > > >>> >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c2.log >>> > >> > > > > > > > > > >>> > >> > > > > > > > > >the dtm file I am running is >>> > >> > > /sandbox/ydeng/alamines/swift-MolDyn.dtm >>> > >> > > > > > > > > > >>> > >> > > > > > > > > >Nika >>> > >> > > > > > > > > > >>> > >> > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: >>> > >> > > > > > > > > >>That doesn't sound good. How do I reproduce this? >>> > >> > > > > > > > > >> >>> > >> > > > > > > > > >>Mihael >>> > >> > > > > > > > > >> >>> > >> > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. >>> > >> Nefedova wrote: >>> > >> > > > > > > > > >> > The one Ben asked us all to test: >>> > >> > > > > > > > > >> > >>> > >> > > > > > > > > >> > >>> >http://www.ci.uchicago.edu/swift/tests/vdsk-0. 1 rc1. >>> > >> t ar.gz >>> > >> > > > > > > > > >> > >>> > >> > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: >>> > >> > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, >>> Veronika V. >>> > >> Nefedova >>> > >> > > wrote: >>> > >> > > > > > > > > >> > > > When I tried to run my working workflow >>> with a new >>> > >> > > version, it >>> > >> > > > > > > > > >> gave me an >>> > >> > > > > > > > > >> > > > exception: >>> > >> > > > > > > > > >> > > >>> > >> > > > > > > > > >> > >Which new version? >>> > >> > > > > > > > > >> > > >>> > >> > > > > > > > > >> > >Mihael >>> > >> > > > > > > > > >> > > >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > Warning: Task handler throws exception >>> but does not >>> > >> > > set status >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > >>> > >> > > >>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>> > >> > > > > > > > > >> > > > TaskHandler can only handle unsubmitted >>> tasks >>> > >> > > > > > > > > >> > > > at >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > >>> > >> > > > > > > > > >> >>> > >> > > > > > > > >>> > >> > > > > > >>> > >> > > >>> > >> >>> org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) >>> >>> > >> > > > > > > > > >> > > > at >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > >>> > >> > > > > > > > > >> >>> > >> > > > > > > > >>> > >> > > > > > >>> > >> > > >>> > >> >>> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) >>> >>> > >> > > > > > > > > >> > > > at >>> java.lang.Thread.run(Thread.java:534) >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > I do not have this happening with >>> 070219 built. >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > Nika >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: >>> > >> > > > > > > > > >> > > > > >>> > >> > > > > > > > > >> > > > > > >>> > >> > > > > > > > > >> > > > > > v0.1rc1 was built at the end of >>> last week. >>> > >> > > please spend >>> > >> > > > > > > > some time >>> > >> > > > > > > > > >> > > testing >>> > >> > > > > > > > > >> > > > > >>> > >> > > > > > > > > >> > > > >here's the URL for download: >>> > >> > > > > > > > > >> > > > > >>> > >> > > > > > > > > >> > > > >>> >http://www.ci.uchicago.edu/swift/tests/vds k -0.1 >>> > >> r c1.t >>> > >> > > ar.gz >>> > >> > > > > > > > > >> > > > > >>> > >> > > > > > > > > >> > > > >-- >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > > > >>> _______________________________________________ >>> > >> > > > > > > > > >> > > > Swift-devel mailing list >>> > >> > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu >>> > >> > > > > > > > > >> > > > >>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > >> > > > > > > > > >> > > > >>> > >> > > > > > > > > >> > >>> > >> > > > > > > > > >> > >>> > >> > > > > > > > > > >>> > >> > > > > > > > > > >>> > >> > > > > > > > > >_______________________________________________ >>> > >> > > > > > > > > >Swift-devel mailing list >>> > >> > > > > > > > > >Swift-devel at ci.uchicago.edu >>> > >> > > > > > > > > >>> >http://mail.ci.uchicago.edu/mailman/listinfo/swift- d evel >>> > >> > > > > > > > > >>> > >> > > > > > > > > >>> > >> > > > > > > >>> > >> > > > > > > >>> > >> > > > > >>> > >> > > > > >>> > >> > > > >>> > >> > > > _______________________________________________ >>> > >> > > > Swift-devel mailing list >>> > >> > > > Swift-devel at ci.uchicago.edu >>> > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > >> > > > >>> > >> > >>> > >> > >>> > > >>> > > >>> > >_______________________________________________ >>> > >Swift-devel mailing list >>> > >Swift-devel at ci.uchicago.edu >>> > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > >>> > >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From nefedova at mcs.anl.gov Tue Feb 27 15:40:50 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 27 Feb 2007 15:40:50 -0600 Subject: [Swift-devel] Re: test v0.1rc1 In-Reply-To: <45E4A3B2.7080000@mcs.anl.gov> References: <6.0.0.22.2.20070226125640.059527f0@mail.mcs.anl.gov> <1172517302.25410.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226132104.05733c30@mail.mcs.anl.gov> <1172518758.26112.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226140905.05adc960@mail.mcs.anl.gov> <6.0.0.22.2.20070226142008.05ee77f0@mail.mcs.anl.gov> <1172521994.27811.17.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226143643.05e74cc0@mail.mcs.anl.gov> <1172522337.27811.24.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226144143.05e71e60@mail.mcs.anl.gov> <1172522860.27811.34.camel@blabla.mcs.anl.gov> <1172548630.12021.3.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070226221618.032608e0@mail.mcs.anl.gov> <1172606659.27311.2.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227141852.05fba7c0@mail.mcs.anl.gov> <6.0.0.22.2.20070227145017.06043760@mail.mcs.anl.gov> <1172609682.29302.0.camel@blabla.mcs.anl.gov> <6.0.0.22.2.20070227150116.06c309d0@mail.mcs.anl.gov> <6.0.0.22.2.20070227152655.05fc36b0@mail.mcs.anl.gov> <45E4A3B2.7080000@mcs.anl.gov> Message-ID: <6.0.0.22.2.20070227153537.06c34cd0@mail.mcs.anl.gov> Yes, thank you to everybody who helped me ! The log shows these time stamps: Start: 15:02:49 (i.e. first entry in the log) Finish: 15:15:22 (i.e. the last entry in the log) It is the total time (including all the staging in/out, waiting in the queue, etc). I need yet to find out what compute times were before swift. Nika At 03:33 PM 2/27/2007, Mike Wilde wrote: >Awesome! Nice work Nika and everyone!!! > >Just how "swift" was it? > >:) Mike > >^ thanks, I needed that! > >Veronika V. Nefedova wrote, On 2/27/2007 3:27 PM: >>Workflow with 16 molecules finished on TG just swiftly (-; >>At 03:03 PM 2/27/2007, Veronika V. Nefedova wrote: >>>I have just 16 molecules (at present) for testing - I am running all of >>>them now. Will let you know how did it go. I've already asked Yuqing to >>>send my way all 350 molecules (; >>> >>>Nika >>> >>>At 02:54 PM 2/27/2007, Mihael Hategan wrote: >>>>On Tue, 2007-02-27 at 14:54 -0600, Veronika V. Nefedova wrote: >>>> > It worked! Thank you very much for tracking it down, Mihael. >>>> > >>>> > Anyway -- a complete MolDyn workflow just finished its first run on >>>> > Teragrid (for 3 molecules only). >>>> >>>>Groovy. Can it be pumped up? >>>> >>>> > >>>> > (-; >>>> > >>>> > Nika >>>> > >>>> > At 02:19 PM 2/27/2007, Veronika V. Nefedova wrote: >>>> > >>>> > >Thanks, Mihael! >>>> > >I'll try it now >>>> > > >>>> > >Nika >>>> > > >>>> > >At 02:04 PM 2/27/2007, Mihael Hategan wrote: >>>> > >>Try it now. The latest nightly build should contain the fix. >>>> > >> >>>> > >>The problem was an inner class having synchronized methods and me >>>> > >>idiotically assuming that they will use the outer class' instance >>>> > >>monitor. >>>> > >> >>>> > >>Mihael >>>> > >> >>>> > >>On Mon, 2007-02-26 at 22:17 -0600, Veronika V. Nefedova wrote: >>>> > >> > If you give me your vds_home location - I can try to run the >>>> workflow and >>>> > >> > see if its working... >>>> > >> > >>>> > >> > NIka >>>> > >> > At 09:57 PM 2/26/2007, Mihael Hategan wrote: >>>> > >> > >Hmm. I made a change to the code that did not seem to be the >>>> cause, but >>>> > >> > >some other, smaller issue and enabled some more debugging in >>>> log4j. With >>>> > >> > >this, I've been running the workflow in a loop on wiggum for >>>> two hours >>>> > >> > >now, and got nothing yet. I don't know what to make of it. >>>> > >> > > >>>> > >> > >I'll keep running and eventually revert the changes to see if >>>> they are >>>> > >> > >the source. >>>> > >> > > >>>> > >> > >Mihael >>>> > >> > > >>>> > >> > >On Mon, 2007-02-26 at 14:47 -0600, Mihael Hategan wrote: >>>> > >> > > > On Mon, 2007-02-26 at 14:46 -0600, Veronika V. Nefedova wrote: >>>> > >> > > > > An additional info: This failure happened on TG with >>>> 070219 when >>>> > >> I was >>>> > >> > > > > running 2 molecules at the same time (i.e. two executables at >>>> > >> the same >>>> > >> > > > > time). When I tried to run just one, it failed with the same >>>> > >> > > exitcode, but >>>> > >> > > > > didn't have that handle exception: >>>> > >> > > > >>>> > >> > > > Right. This seems like a different problem, and I'm not >>>> sure if it's >>>> > >> > > > Swift or some problem with TP or the application. That >>>> needs to be >>>> > >> > > > investigated. >>>> > >> > > > >>>> > >> > > > > >>>> > >> > > > > 2007-02-26 14:34:41,986 DEBUG vdl:execute2 Application >>>> > >> exception: Job >>>> > >> > > chrm >>>> > >> > > > > failed with an exit code of 174 >>>> > >> > > > > sys:throw @ vdl-int.k, line: 108 >>>> > >> > > > > vdl:checkexitcode @ vdl-int.k, line: 367 >>>> > >> > > > > vdl:execute2 @ execute-default.k, line: 22 >>>> > >> > > > > vdl:execute @ swift-MolDyn.kml, line: 69 >>>> > >> > > > > charmm @ swift-MolDyn.kml, line: 279 >>>> > >> > > > > vdl:mains @ swift-MolDyn.kml, line: 261 >>>> > >> > > > > >>>> > >> > > > > >>>> > >> > > > > Again, the failure with 070219 happens only on TG, on >>>> localhost >>>> > >> (wiggum) >>>> > >> > > > > its working just fine. >>>> > >> > > > > >>>> > >> > > > > Nika >>>> > >> > > > > >>>> > >> > > > > >>>> > >> > > > > At 02:38 PM 2/26/2007, Mihael Hategan wrote: >>>> > >> > > > > >That's fine. Just wanted to be clear that we're talking >>>> about >>>> > >> the same >>>> > >> > > > > >error. It's good that it also occurs in 070219, because >>>> there >>>> > >> are no >>>> > >> > > > > >recent changes I could remember that could trigger it. It's >>>> > >> also good to >>>> > >> > > > > >know that it may or may not occur, because I know >>>> approximately >>>> > >> what >>>> > >> > > > > >class of problem we're dealing with. >>>> > >> > > > > > >>>> > >> > > > > >Mihael >>>> > >> > > > > > >>>> > >> > > > > >On Mon, 2007-02-26 at 14:37 -0600, Veronika V. Nefedova >>>> wrote: >>>> > >> > > > > > > Yes, I didn't paste it -- its all in the log. If >>>> you'd like >>>> > >> I can >>>> > >> > > send you >>>> > >> > > > > > > the log as an attachment... >>>> > >> > > > > > > >>>> > >> > > > > > > Nika >>>> > >> > > > > > > >>>> > >> > > > > > > At 02:33 PM 2/26/2007, Mihael Hategan wrote: >>>> > >> > > > > > > >Wait, because I'm missing something. Wasn't the error >>>> > >> supposed to be >>>> > >> > > > > > > >"TaskHandler can only handle unsubmitted tasks"? >>>> > >> > > > > > > > >>>> > >> > > > > > > >On Mon, 2007-02-26 at 14:26 -0600, Veronika V. >>>> Nefedova wrote: >>>> > >> > > > > > > > > And now its getting interesting! >>>> > >> > > > > > > > > >>>> > >> > > > > > > > > I have now the same failure (as below) with >>>> 070219 as I >>>> > >> had on >>>> > >> > > > > > localhost >>>> > >> > > > > > > > > with v0.1rc1 *BUT* when running on TG. Failed at >>>> the same >>>> > >> > > point (while >>>> > >> > > > > > > > > trying to run the last app in the workflow), with >>>> the same >>>> > >> > > exceptions. >>>> > >> > > > > > > > > Strange that 070219 worked on localhost (and >>>> still working). >>>> > >> > > > > > > > > >>>> > >> > > > > > > > > The log is on wiggum: >>>> > >> > > > > > > > /sandbox/ydeng/alamines/swift-MolDyn-690y7r1skc8z0.log >>>> > >> > > > > > > > > >>>> > >> > > > > > > > > 2007-02-26 14:10:16,543 INFO vdl:execute2 >>>> Running job >>>> > >> > > > > > chrm-rmnoet7i chrm >>>> > >> > > > > > > > > with arguments [system:solv_m001, title:solv, >>>> stitle:m001, >>>> > >> > > > > > > > > rtffile:parm03_gaff_all.rtf, >>>> > >> paramfile:parm03_gaffnb_all.prm, >>>> > >> > > > > > > > > gaff:m001_am1, nwater:400, ligcrd:lyz, rforce:0, >>>> > >> iseed:3131887, >>>> > >> > > > > > rwater:15, >>>> > >> > > > > > > > > nstep:100, minstep:100, skipstep:100, >>>> startstep:10000] in >>>> > >> > > > > > > > > swift-MolDyn-690y7r1skc8z0/chrm-rmnoet7i on TG-NCSA >>>> > >> > > > > > > > > 2007-02-26 14:11:18,586 DEBUG vdl:execute2 >>>> Application >>>> > >> > > exception: >>>> > >> > > > > > Job chrm >>>> > >> > > > > > > > > failed with an exit code of 174 >>>> > >> > > > > > > > > >>>> > >> > > > > > > > > All input files are staged in... >>>> > >> > > > > > > > > >>>> > >> > > > > > > > > >>>> > >> > > > > > > > > Nika >>>> > >> > > > > > > > > >>>> > >> > > > > > > > > At 02:17 PM 2/26/2007, Veronika V. Nefedova wrote: >>>> > >> > > > > > > > > >You can try to run my application, or look in >>>> the logs. I >>>> > >> > > ran it >>>> > >> > > > > > all on >>>> > >> > > > > > > > > >wiggum. The log is: >>>> > >> > > > > > > > > >/sandbox/ydeng/alamines/swift-MolDyn-8q6ygr7cy15c >>>> 2.log >>>> > >> > > > > > > > > > >>>> > >> > > > > > > > > >the dtm file I am running is >>>> > >> > > /sandbox/ydeng/alamines/swift-MolDyn.dtm >>>> > >> > > > > > > > > > >>>> > >> > > > > > > > > >Nika >>>> > >> > > > > > > > > > >>>> > >> > > > > > > > > >At 01:39 PM 2/26/2007, Mihael Hategan wrote: >>>> > >> > > > > > > > > >>That doesn't sound good. How do I reproduce this? >>>> > >> > > > > > > > > >> >>>> > >> > > > > > > > > >>Mihael >>>> > >> > > > > > > > > >> >>>> > >> > > > > > > > > >>On Mon, 2007-02-26 at 13:21 -0600, Veronika V. >>>> > >> Nefedova wrote: >>>> > >> > > > > > > > > >> > The one Ben asked us all to test: >>>> > >> > > > > > > > > >> > >>>> > >> > > > > > > > > >> > >http://www.ci.uchicago.edu/swift/tests/vdsk- >>>> 0. 1 rc1. >>>> > >> t ar.gz >>>> > >> > > > > > > > > >> > >>>> > >> > > > > > > > > >> > At 01:15 PM 2/26/2007, Mihael Hategan wrote: >>>> > >> > > > > > > > > >> > >On Mon, 2007-02-26 at 13:05 -0600, Veronika V. >>>> > >> Nefedova >>>> > >> > > wrote: >>>> > >> > > > > > > > > >> > > > When I tried to run my working workflow >>>> with a new >>>> > >> > > version, it >>>> > >> > > > > > > > > >> gave me an >>>> > >> > > > > > > > > >> > > > exception: >>>> > >> > > > > > > > > >> > > >>>> > >> > > > > > > > > >> > >Which new version? >>>> > >> > > > > > > > > >> > > >>>> > >> > > > > > > > > >> > >Mihael >>>> > >> > > > > > > > > >> > > >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > Warning: Task handler throws exception >>>> but does not >>>> > >> > > set status >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > >>>> > >> > > >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>>> > >> > > > > > > > > >> > > > TaskHandler can only handle unsubmitted >>>> tasks >>>> > >> > > > > > > > > >> > > > at >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > >>>> > >> > > > > > > > > >> >>>> > >> > > > > > > > >>>> > >> > > > > > >>>> > >> > > >>>> > >> >>>> org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:20) >>>> >>>> > >> > > > > > > > > >> > > > at >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > >>>> > >> > > > > > > > > >> >>>> > >> > > > > > > > >>>> > >> > > > > > >>>> > >> > > >>>> > >> >>>> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:78) >>>> >>>> > >> > > > > > > > > >> > > > at >>>> java.lang.Thread.run(Thread.java:534) >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > [349] wiggum /sandbox/ydeng/alamines > \\ >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > I do not have this happening with 070219 >>>> built. >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > Nika >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > At 06:12 AM 2/26/2007, Ben Clifford wrote: >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > >On Mon, 26 Feb 2007, Ben Clifford wrote: >>>> > >> > > > > > > > > >> > > > > >>>> > >> > > > > > > > > >> > > > > > >>>> > >> > > > > > > > > >> > > > > > v0.1rc1 was built at the end of last >>>> week. >>>> > >> > > please spend >>>> > >> > > > > > > > some time >>>> > >> > > > > > > > > >> > > testing >>>> > >> > > > > > > > > >> > > > > >>>> > >> > > > > > > > > >> > > > >here's the URL for download: >>>> > >> > > > > > > > > >> > > > > >>>> > >> > > > > > > > > >> > > > >http://www.ci.uchicago.edu/swift/tests/v >>>> ds k -0.1 >>>> > >> r c1.t >>>> > >> > > ar.gz >>>> > >> > > > > > > > > >> > > > > >>>> > >> > > > > > > > > >> > > > >-- >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > > > >>>> _______________________________________________ >>>> > >> > > > > > > > > >> > > > Swift-devel mailing list >>>> > >> > > > > > > > > >> > > > Swift-devel at ci.uchicago.edu >>>> > >> > > > > > > > > >> > > > >>>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> > >> > > > > > > > > >> > > > >>>> > >> > > > > > > > > >> > >>>> > >> > > > > > > > > >> > >>>> > >> > > > > > > > > > >>>> > >> > > > > > > > > > >>>> > >> > > > > > > > > >_______________________________________________ >>>> > >> > > > > > > > > >Swift-devel mailing list >>>> > >> > > > > > > > > >Swift-devel at ci.uchicago.edu >>>> > >> > > > > > > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swif >>>> t- d evel >>>> > >> > > > > > > > > >>>> > >> > > > > > > > > >>>> > >> > > > > > > >>>> > >> > > > > > > >>>> > >> > > > > >>>> > >> > > > > >>>> > >> > > > >>>> > >> > > > _______________________________________________ >>>> > >> > > > Swift-devel mailing list >>>> > >> > > > Swift-devel at ci.uchicago.edu >>>> > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> > >> > > > >>>> > >> > >>>> > >> > >>>> > > >>>> > > >>>> > >_______________________________________________ >>>> > >Swift-devel mailing list >>>> > >Swift-devel at ci.uchicago.edu >>>> > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> > >>>> > >>> >>> >>>_______________________________________________ >>>Swift-devel mailing list >>>Swift-devel at ci.uchicago.edu >>>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >>_______________________________________________ >>Swift-devel mailing list >>Swift-devel at ci.uchicago.edu >>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >-- >Mike Wilde >Computation Institute, University of Chicago >Math & Computer Science Division >Argonne National Laboratory >Argonne, IL 60439 USA >tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Tue Feb 27 15:45:50 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Feb 2007 21:45:50 +0000 (GMT) Subject: [Swift-devel] website Message-ID: I'm merging website changes now. Expect trouble if you intend to merge over the next hour or so. -- From hategan at mcs.anl.gov Tue Feb 27 15:45:39 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 15:45:39 -0600 Subject: [Swift-devel] website In-Reply-To: References: Message-ID: <1172612739.31124.0.camel@blabla.mcs.anl.gov> On Tue, 2007-02-27 at 21:45 +0000, Ben Clifford wrote: > I'm merging website changes now. Expect trouble if you intend to merge > over the next hour or so. You mean the cool one? From benc at hawaga.org.uk Tue Feb 27 15:48:15 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Feb 2007 21:48:15 +0000 (GMT) Subject: [Swift-devel] website In-Reply-To: <1172612739.31124.0.camel@blabla.mcs.anl.gov> References: <1172612739.31124.0.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 27 Feb 2007, Mihael Hategan wrote: > On Tue, 2007-02-27 at 21:45 +0000, Ben Clifford wrote: > > I'm merging website changes now. Expect trouble if you intend to merge > > over the next hour or so. > > You mean the cool one? well, its red. -- From wilde at mcs.anl.gov Tue Feb 27 15:54:21 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 27 Feb 2007 15:54:21 -0600 Subject: [Swift-devel] [Fwd: Re: MPI via globus (fwd)] Message-ID: <45E4A88D.2050908@mcs.anl.gov> Not sure if this is relevant to us? - Mike -------- Original Message -------- Subject: Re: MPI via globus (fwd) Date: Tue, 27 Feb 2007 15:01:24 -0600 From: Alain Roy To: Stuart Martin , Steve Gallo CC: OSG ITB , Abhishek Singh Rana , Steve Gallo , JP Navarro References:

>There is a patch that TG gave to Alain that was added in VDT 1.6.1 >(correct me if I am wrong Alain). It creates a new flag at the top >of the pbs.pm script to turn off this default behavior for using the >mpi paths found at post-install time. If I understand it correctly, it's in VDT 1.6.1. I see a flag at the top of pbs.pm that is: $mpisoftenv = 0; # 0=false, 1=true Is that it? -alain -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Tue Feb 27 17:11:08 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Feb 2007 23:11:08 +0000 (GMT) Subject: [Swift-devel] website In-Reply-To: References: <1172612739.31124.0.camel@blabla.mcs.anl.gov> Message-ID: turns out it doesn't work quite right -- php works on beth's test site but not on the real site. I have a ci support req in to change that behaviour. until then, site is as it was. -- From benc at hawaga.org.uk Tue Feb 27 21:59:08 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 03:59:08 +0000 (GMT) Subject: [Swift-devel] 0.1rc2 Message-ID: mihael built an rc2. it is at: http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc2.tar.gz I haven't tried it yet. As before, try it, see what breaks, and if it survives 24h from this message then it turns into 0.1 release. -- From tiberius at ci.uchicago.edu Tue Feb 27 22:42:09 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 27 Feb 2007 22:42:09 -0600 Subject: [Swift-devel] 0.1rc2 In-Reply-To: References: Message-ID: Where is the changelog file for it ? (probably there is one inside the archive, but I wish there was one available on the web somewhere). Tibi On 2/27/07, Ben Clifford wrote: > > mihael built an rc2. it is at: > > http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc2.tar.gz > > I haven't tried it yet. > > As before, try it, see what breaks, and if it survives 24h from this > message then it turns into 0.1 release. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Tue Feb 27 22:51:21 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Feb 2007 22:51:21 -0600 Subject: [Swift-devel] 0.1rc2 In-Reply-To: References: Message-ID: <1172638281.15952.2.camel@blabla.mcs.anl.gov> Good point. The builds do not contain the changelog. It may also be meaningful to aggregate the changelogs from CoG and stuff. I'll see to this. On Tue, 2007-02-27 at 22:42 -0600, Tiberiu Stef-Praun wrote: > Where is the changelog file for it ? (probably there is one inside the > archive, but I wish there was one available on the web somewhere). > > Tibi > > On 2/27/07, Ben Clifford wrote: > > > > mihael built an rc2. it is at: > > > > http://www.ci.uchicago.edu/swift/tests/vdsk-0.1rc2.tar.gz > > > > I haven't tried it yet. > > > > As before, try it, see what breaks, and if it survives 24h from this > > message then it turns into 0.1 release. > > > > -- > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From wilde at mcs.anl.gov Wed Feb 28 11:41:49 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 28 Feb 2007 11:41:49 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow Message-ID: <45E5BEDD.2070901@mcs.anl.gov> Mihael informs me that the latest problems with the wavlet workflow indicate that some number of jobs in the workflow are failing to launch under PBS through the pre-WS GRAM provider. These failing jobs seem to give no indication whatsoever where the underlying failure is occurring. I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs are failing in this manner (not sure I have these numbers right). Mihael is continuing to experiment to characterize the failure better and will report back to the group (and involve the TP and GRAM support teams) when he knows more. - Mike -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From tiberius at ci.uchicago.edu Wed Feb 28 12:14:19 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 28 Feb 2007 12:14:19 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: <45E5BEDD.2070901@mcs.anl.gov> References: <45E5BEDD.2070901@mcs.anl.gov> Message-ID: Here is more info: Indeed, yesterday I got 175 successful jobs from the total of 192, and the workflow never ended (it kept retrying transferring fiiles from the the failed ones, which it failed because they did not exist). Looking at the processors load and at the transfer load, the total 175jobs were done in about 75minutes (about 10x speedup from a serialized execution). At Mihael's suggestion I started with smaller workflows, so here are the numbers (for the ones that completed successfully): 1 job: 4 minutes 6jobs: 6 minutes 24 jobs: 20 minutes 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer). I have a total of 192 jobs to run. I have retried running some of the failed workflows, and they fail because some task in the workflow is not run correctly. For instance, the most troubling one was the latest run: the jobs submitted failed right at the beginning, even though they have run successfully in the previous run. My current assumption is that one (?several) cluster nodes are bad. The failure can be observed in the log in the following way: job gets submitted, andd 20 seconds later, gram declares is finished (normal execution time is about 3 minutes), so the workflow attempts to transfer back some inexistent files (nothing gets generated, neither outputs, nor stdout,stderr,kickstart in the job's working directory), and it creates on the submission machine files of size zero. That is not good because when attempting a -resume, those failed jobs are not re-considered for execution. Summary/Speculation: bad teraport node causes job to be declared as done even though the execution failed I will move to another Grid site, to run in there locally, and hopefully not get the same behavior as on teraport. Tibi On 2/28/07, Mike Wilde wrote: > Mihael informs me that the latest problems with the wavlet workflow indicate > that some number of jobs in the workflow are failing to launch under PBS > through the pre-WS GRAM provider. These failing jobs seem to give no > indication whatsoever where the underlying failure is occurring. > > I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs > are failing in this manner (not sure I have these numbers right). > > Mihael is continuing to experiment to characterize the failure better and will > report back to the group (and involve the TP and GRAM support teams) when he > knows more. > > - Mike > > -- > Mike Wilde > Computation Institute, University of Chicago > Math & Computer Science Division > Argonne National Laboratory > Argonne, IL 60439 USA > tel 630-252-7497 fax 630-252-1997 > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Wed Feb 28 12:16:12 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 18:16:12 +0000 (GMT) Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> Message-ID: do you have kickstart records for the jobs that are failing? On Wed, 28 Feb 2007, Tiberiu Stef-Praun wrote: > Here is more info: > Indeed, yesterday I got 175 successful jobs from the total of 192, and > the workflow never ended (it kept retrying transferring fiiles from > the the failed ones, which it failed because they did not exist). > Looking at the processors load and at the transfer load, the total > 175jobs were done in about 75minutes (about 10x speedup from a > serialized execution). > > At Mihael's suggestion I started with smaller workflows, so here are > the numbers (for the ones that completed successfully): > 1 job: 4 minutes > 6jobs: 6 minutes > 24 jobs: 20 minutes > 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer). > > I have a total of 192 jobs to run. > > > I have retried running some of the failed workflows, and they fail > because some task in the workflow is not run correctly. For instance, > the most troubling one was the latest run: the jobs submitted failed > right at the beginning, even though they have run successfully in the > previous run. > My current assumption is that one (?several) cluster nodes are bad. > The failure can be observed in the log in the following way: job gets > submitted, andd 20 seconds later, gram declares is finished (normal > execution time is about 3 minutes), so the workflow attempts to > transfer back some inexistent files (nothing gets generated, neither > outputs, nor stdout,stderr,kickstart in the job's working directory), > and it creates on the submission machine files of size zero. That is > not good because when attempting a -resume, those failed jobs are not > re-considered for execution. > > Summary/Speculation: bad teraport node causes job to be declared as > done even though the execution failed > > I will move to another Grid site, to run in there locally, and > hopefully not get the same behavior as on teraport. > > Tibi > > On 2/28/07, Mike Wilde wrote: > > Mihael informs me that the latest problems with the wavlet workflow indicate > > that some number of jobs in the workflow are failing to launch under PBS > > through the pre-WS GRAM provider. These failing jobs seem to give no > > indication whatsoever where the underlying failure is occurring. > > > > I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs > > are failing in this manner (not sure I have these numbers right). > > > > Mihael is continuing to experiment to characterize the failure better and > > will > > report back to the group (and involve the TP and GRAM support teams) when he > > knows more. > > > > - Mike > > > > -- > > Mike Wilde > > Computation Institute, University of Chicago > > Math & Computer Science Division > > Argonne National Laboratory > > Argonne, IL 60439 USA > > tel 630-252-7497 fax 630-252-1997 > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From benc at hawaga.org.uk Wed Feb 28 12:17:50 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 18:17:50 +0000 (GMT) Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> Message-ID: On Wed, 28 Feb 2007, Ben Clifford wrote: > do you have kickstart records for the jobs that are failing? if you do, then: > > Summary/Speculation: bad teraport node causes job to be declared as > > done even though the execution failed this speculation can be investigated further by: finding a job that breaks. finding the node name from the kickstart record. grepping all the kickstart records to find other kickstart records for those jobs. looking to see if they all fail, or if some work and some fail. then report back findings here. -- From tiberius at ci.uchicago.edu Wed Feb 28 12:21:37 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 28 Feb 2007 12:21:37 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> Message-ID: Nothing gets generated in the individual job's temporary directories. There is no kickstart record. It would be really useful finding out the hostname of the node on which these jobs ran. Let me retry some more workflow runs. On 2/28/07, Ben Clifford wrote: > > > On Wed, 28 Feb 2007, Ben Clifford wrote: > > > do you have kickstart records for the jobs that are failing? > > if you do, then: > > > > Summary/Speculation: bad teraport node causes job to be declared as > > > done even though the execution failed > > this speculation can be investigated further by: > > finding a job that breaks. finding the node name from the kickstart > record. grepping all the kickstart records to find other kickstart records > for those jobs. looking to see if they all fail, or if some work and some > fail. then report back findings here. > > -- > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Wed Feb 28 12:21:49 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 18:21:49 +0000 (GMT) Subject: [Swift-devel] the big red swift site Message-ID: is mostly up: http://www.ci.uchicago.edu/swift/ The docbook docs are not yet restyled yet, and I know a couple of links have got lost in the transistion. I'm fixing those that I know of now, but point any that you notice. -- From benc at hawaga.org.uk Wed Feb 28 12:24:13 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 18:24:13 +0000 (GMT) Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> Message-ID: do you have kickstart records for the nodes that *do* run? On Wed, 28 Feb 2007, Tiberiu Stef-Praun wrote: > Nothing gets generated in the individual job's temporary directories. > There is no kickstart record. > It would be really useful finding out the hostname of the node on > which these jobs ran. > > Let me retry some more workflow runs. > > On 2/28/07, Ben Clifford wrote: > > > > > > On Wed, 28 Feb 2007, Ben Clifford wrote: > > > > > do you have kickstart records for the jobs that are failing? > > > > if you do, then: > > > > > > Summary/Speculation: bad teraport node causes job to be declared as > > > > done even though the execution failed > > > > this speculation can be investigated further by: > > > > finding a job that breaks. finding the node name from the kickstart > > record. grepping all the kickstart records to find other kickstart records > > for those jobs. looking to see if they all fail, or if some work and some > > fail. then report back findings here. > > > > -- > > > > > From tiberius at ci.uchicago.edu Wed Feb 28 12:27:07 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 28 Feb 2007 12:27:07 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> Message-ID: In this case, everything that was submitted failed, and as soon as I noticed the failures, I stopped the workflow, to investigate the cause of the failure. Will re-run and let you know about kickstart records. On 2/28/07, Ben Clifford wrote: > > do you have kickstart records for the nodes that *do* run? > > On Wed, 28 Feb 2007, Tiberiu Stef-Praun wrote: > > > Nothing gets generated in the individual job's temporary directories. > > There is no kickstart record. > > It would be really useful finding out the hostname of the node on > > which these jobs ran. > > > > Let me retry some more workflow runs. > > > > On 2/28/07, Ben Clifford wrote: > > > > > > > > > On Wed, 28 Feb 2007, Ben Clifford wrote: > > > > > > > do you have kickstart records for the jobs that are failing? > > > > > > if you do, then: > > > > > > > > Summary/Speculation: bad teraport node causes job to be declared as > > > > > done even though the execution failed > > > > > > this speculation can be investigated further by: > > > > > > finding a job that breaks. finding the node name from the kickstart > > > record. grepping all the kickstart records to find other kickstart records > > > for those jobs. looking to see if they all fail, or if some work and some > > > fail. then report back findings here. > > > > > > -- > > > > > > > > > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Wed Feb 28 12:31:43 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Feb 2007 12:31:43 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> Message-ID: <1172687504.26377.1.camel@blabla.mcs.anl.gov> On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote: > Here is more info: > Indeed, yesterday I got 175 successful jobs from the total of 192, and > the workflow never ended Did the workflow lock up or did you interrupt it because you got tired of it trying to transfer all the missing files? > (it kept retrying transferring fiiles from > the the failed ones, which it failed because they did not exist). > Looking at the processors load and at the transfer load, the total > 175jobs were done in about 75minutes (about 10x speedup from a > serialized execution). > > At Mihael's suggestion I started with smaller workflows, so here are > the numbers (for the ones that completed successfully): > 1 job: 4 minutes > 6jobs: 6 minutes > 24 jobs: 20 minutes > 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer). > > I have a total of 192 jobs to run. > > > I have retried running some of the failed workflows, and they fail > because some task in the workflow is not run correctly. For instance, > the most troubling one was the latest run: the jobs submitted failed > right at the beginning, even though they have run successfully in the > previous run. > My current assumption is that one (?several) cluster nodes are bad. > The failure can be observed in the log in the following way: job gets > submitted, andd 20 seconds later, gram declares is finished (normal > execution time is about 3 minutes), so the workflow attempts to > transfer back some inexistent files (nothing gets generated, neither > outputs, nor stdout,stderr,kickstart in the job's working directory), > and it creates on the submission machine files of size zero. That is > not good because when attempting a -resume, those failed jobs are not > re-considered for execution. > > Summary/Speculation: bad teraport node causes job to be declared as > done even though the execution failed > > I will move to another Grid site, to run in there locally, and > hopefully not get the same behavior as on teraport. > > Tibi > > On 2/28/07, Mike Wilde wrote: > > Mihael informs me that the latest problems with the wavlet workflow indicate > > that some number of jobs in the workflow are failing to launch under PBS > > through the pre-WS GRAM provider. These failing jobs seem to give no > > indication whatsoever where the underlying failure is occurring. > > > > I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs > > are failing in this manner (not sure I have these numbers right). > > > > Mihael is continuing to experiment to characterize the failure better and will > > report back to the group (and involve the TP and GRAM support teams) when he > > knows more. > > > > - Mike > > > > -- > > Mike Wilde > > Computation Institute, University of Chicago > > Math & Computer Science Division > > Argonne National Laboratory > > Argonne, IL 60439 USA > > tel 630-252-7497 fax 630-252-1997 > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From hategan at mcs.anl.gov Wed Feb 28 12:33:03 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Feb 2007 12:33:03 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> Message-ID: <1172687583.26377.3.camel@blabla.mcs.anl.gov> On Wed, 2007-02-28 at 18:16 +0000, Ben Clifford wrote: > do you have kickstart records for the jobs that are failing? Thing is, the wrapper for those jobs doesn't seem to put any output in the wrapper log. Which seems to indicate that the wrapper was never started. Which in turn may mean that kickstart would not have been started either. > > On Wed, 28 Feb 2007, Tiberiu Stef-Praun wrote: > > > Here is more info: > > Indeed, yesterday I got 175 successful jobs from the total of 192, and > > the workflow never ended (it kept retrying transferring fiiles from > > the the failed ones, which it failed because they did not exist). > > Looking at the processors load and at the transfer load, the total > > 175jobs were done in about 75minutes (about 10x speedup from a > > serialized execution). > > > > At Mihael's suggestion I started with smaller workflows, so here are > > the numbers (for the ones that completed successfully): > > 1 job: 4 minutes > > 6jobs: 6 minutes > > 24 jobs: 20 minutes > > 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer). > > > > I have a total of 192 jobs to run. > > > > > > I have retried running some of the failed workflows, and they fail > > because some task in the workflow is not run correctly. For instance, > > the most troubling one was the latest run: the jobs submitted failed > > right at the beginning, even though they have run successfully in the > > previous run. > > My current assumption is that one (?several) cluster nodes are bad. > > The failure can be observed in the log in the following way: job gets > > submitted, andd 20 seconds later, gram declares is finished (normal > > execution time is about 3 minutes), so the workflow attempts to > > transfer back some inexistent files (nothing gets generated, neither > > outputs, nor stdout,stderr,kickstart in the job's working directory), > > and it creates on the submission machine files of size zero. That is > > not good because when attempting a -resume, those failed jobs are not > > re-considered for execution. > > > > Summary/Speculation: bad teraport node causes job to be declared as > > done even though the execution failed > > > > I will move to another Grid site, to run in there locally, and > > hopefully not get the same behavior as on teraport. > > > > Tibi > > > > On 2/28/07, Mike Wilde wrote: > > > Mihael informs me that the latest problems with the wavlet workflow indicate > > > that some number of jobs in the workflow are failing to launch under PBS > > > through the pre-WS GRAM provider. These failing jobs seem to give no > > > indication whatsoever where the underlying failure is occurring. > > > > > > I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs > > > are failing in this manner (not sure I have these numbers right). > > > > > > Mihael is continuing to experiment to characterize the failure better and > > > will > > > report back to the group (and involve the TP and GRAM support teams) when he > > > knows more. > > > > > > - Mike > > > > > > -- > > > Mike Wilde > > > Computation Institute, University of Chicago > > > Math & Computer Science Division > > > Argonne National Laboratory > > > Argonne, IL 60439 USA > > > tel 630-252-7497 fax 630-252-1997 > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From tiberius at ci.uchicago.edu Wed Feb 28 12:36:45 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 28 Feb 2007 12:36:45 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: <1172687504.26377.1.camel@blabla.mcs.anl.gov> References: <45E5BEDD.2070901@mcs.anl.gov> <1172687504.26377.1.camel@blabla.mcs.anl.gov> Message-ID: I stopped it On 2/28/07, Mihael Hategan wrote: > On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote: > > Here is more info: > > Indeed, yesterday I got 175 successful jobs from the total of 192, and > > the workflow never ended > > Did the workflow lock up or did you interrupt it because you got tired > of it trying to transfer all the missing files? > > > (it kept retrying transferring fiiles from > > the the failed ones, which it failed because they did not exist). > > Looking at the processors load and at the transfer load, the total > > 175jobs were done in about 75minutes (about 10x speedup from a > > serialized execution). > > > > At Mihael's suggestion I started with smaller workflows, so here are > > the numbers (for the ones that completed successfully): > > 1 job: 4 minutes > > 6jobs: 6 minutes > > 24 jobs: 20 minutes > > 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer). > > > > I have a total of 192 jobs to run. > > > > > > I have retried running some of the failed workflows, and they fail > > because some task in the workflow is not run correctly. For instance, > > the most troubling one was the latest run: the jobs submitted failed > > right at the beginning, even though they have run successfully in the > > previous run. > > My current assumption is that one (?several) cluster nodes are bad. > > The failure can be observed in the log in the following way: job gets > > submitted, andd 20 seconds later, gram declares is finished (normal > > execution time is about 3 minutes), so the workflow attempts to > > transfer back some inexistent files (nothing gets generated, neither > > outputs, nor stdout,stderr,kickstart in the job's working directory), > > and it creates on the submission machine files of size zero. That is > > not good because when attempting a -resume, those failed jobs are not > > re-considered for execution. > > > > Summary/Speculation: bad teraport node causes job to be declared as > > done even though the execution failed > > > > I will move to another Grid site, to run in there locally, and > > hopefully not get the same behavior as on teraport. > > > > Tibi > > > > On 2/28/07, Mike Wilde wrote: > > > Mihael informs me that the latest problems with the wavlet workflow indicate > > > that some number of jobs in the workflow are failing to launch under PBS > > > through the pre-WS GRAM provider. These failing jobs seem to give no > > > indication whatsoever where the underlying failure is occurring. > > > > > > I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs > > > are failing in this manner (not sure I have these numbers right). > > > > > > Mihael is continuing to experiment to characterize the failure better and will > > > report back to the group (and involve the TP and GRAM support teams) when he > > > knows more. > > > > > > - Mike > > > > > > -- > > > Mike Wilde > > > Computation Institute, University of Chicago > > > Math & Computer Science Division > > > Argonne National Laboratory > > > Argonne, IL 60439 USA > > > tel 630-252-7497 fax 630-252-1997 > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From wilde at mcs.anl.gov Wed Feb 28 12:37:31 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 28 Feb 2007 12:37:31 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: <1172687504.26377.1.camel@blabla.mcs.anl.gov> References: <45E5BEDD.2070901@mcs.anl.gov> <1172687504.26377.1.camel@blabla.mcs.anl.gov> Message-ID: <45E5CBEB.3040402@mcs.anl.gov> Do we need to file a bug to improve the processing of the missing-file case? Ie., if the file is truly missing, this should (typically?) not cause great delays in the workflow proceeding, or proceeding to fail quickly. - Mike Mihael Hategan wrote, On 2/28/2007 12:31 PM: > On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote: >> Here is more info: >> Indeed, yesterday I got 175 successful jobs from the total of 192, and >> the workflow never ended > > Did the workflow lock up or did you interrupt it because you got tired > of it trying to transfer all the missing files? > >> (it kept retrying transferring fiiles from >> the the failed ones, which it failed because they did not exist). >> Looking at the processors load and at the transfer load, the total >> 175jobs were done in about 75minutes (about 10x speedup from a >> serialized execution). >> >> At Mihael's suggestion I started with smaller workflows, so here are >> the numbers (for the ones that completed successfully): >> 1 job: 4 minutes >> 6jobs: 6 minutes >> 24 jobs: 20 minutes >> 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer). >> >> I have a total of 192 jobs to run. >> >> >> I have retried running some of the failed workflows, and they fail >> because some task in the workflow is not run correctly. For instance, >> the most troubling one was the latest run: the jobs submitted failed >> right at the beginning, even though they have run successfully in the >> previous run. >> My current assumption is that one (?several) cluster nodes are bad. >> The failure can be observed in the log in the following way: job gets >> submitted, andd 20 seconds later, gram declares is finished (normal >> execution time is about 3 minutes), so the workflow attempts to >> transfer back some inexistent files (nothing gets generated, neither >> outputs, nor stdout,stderr,kickstart in the job's working directory), >> and it creates on the submission machine files of size zero. That is >> not good because when attempting a -resume, those failed jobs are not >> re-considered for execution. >> >> Summary/Speculation: bad teraport node causes job to be declared as >> done even though the execution failed >> >> I will move to another Grid site, to run in there locally, and >> hopefully not get the same behavior as on teraport. >> >> Tibi >> >> On 2/28/07, Mike Wilde wrote: >>> Mihael informs me that the latest problems with the wavlet workflow indicate >>> that some number of jobs in the workflow are failing to launch under PBS >>> through the pre-WS GRAM provider. These failing jobs seem to give no >>> indication whatsoever where the underlying failure is occurring. >>> >>> I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs >>> are failing in this manner (not sure I have these numbers right). >>> >>> Mihael is continuing to experiment to characterize the failure better and will >>> report back to the group (and involve the TP and GRAM support teams) when he >>> knows more. >>> >>> - Mike >>> >>> -- >>> Mike Wilde >>> Computation Institute, University of Chicago >>> Math & Computer Science Division >>> Argonne National Laboratory >>> Argonne, IL 60439 USA >>> tel 630-252-7497 fax 630-252-1997 >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From tiberius at ci.uchicago.edu Wed Feb 28 12:39:15 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 28 Feb 2007 12:39:15 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: <45E5CBEB.3040402@mcs.anl.gov> References: <45E5BEDD.2070901@mcs.anl.gov> <1172687504.26377.1.camel@blabla.mcs.anl.gov> <45E5CBEB.3040402@mcs.anl.gov> Message-ID: I would say that the solution is: on a missing file, fully resubmit the job, not just try to re-read it. Anyway, I'm trying to discover what causes the file not to be generated. On 2/28/07, Mike Wilde wrote: > Do we need to file a bug to improve the processing of the missing-file case? > > Ie., if the file is truly missing, this should (typically?) not cause great > delays in the workflow proceeding, or proceeding to fail quickly. > > - Mike > > Mihael Hategan wrote, On 2/28/2007 12:31 PM: > > On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote: > >> Here is more info: > >> Indeed, yesterday I got 175 successful jobs from the total of 192, and > >> the workflow never ended > > > > Did the workflow lock up or did you interrupt it because you got tired > > of it trying to transfer all the missing files? > > > >> (it kept retrying transferring fiiles from > >> the the failed ones, which it failed because they did not exist). > >> Looking at the processors load and at the transfer load, the total > >> 175jobs were done in about 75minutes (about 10x speedup from a > >> serialized execution). > >> > >> At Mihael's suggestion I started with smaller workflows, so here are > >> the numbers (for the ones that completed successfully): > >> 1 job: 4 minutes > >> 6jobs: 6 minutes > >> 24 jobs: 20 minutes > >> 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer). > >> > >> I have a total of 192 jobs to run. > >> > >> > >> I have retried running some of the failed workflows, and they fail > >> because some task in the workflow is not run correctly. For instance, > >> the most troubling one was the latest run: the jobs submitted failed > >> right at the beginning, even though they have run successfully in the > >> previous run. > >> My current assumption is that one (?several) cluster nodes are bad. > >> The failure can be observed in the log in the following way: job gets > >> submitted, andd 20 seconds later, gram declares is finished (normal > >> execution time is about 3 minutes), so the workflow attempts to > >> transfer back some inexistent files (nothing gets generated, neither > >> outputs, nor stdout,stderr,kickstart in the job's working directory), > >> and it creates on the submission machine files of size zero. That is > >> not good because when attempting a -resume, those failed jobs are not > >> re-considered for execution. > >> > >> Summary/Speculation: bad teraport node causes job to be declared as > >> done even though the execution failed > >> > >> I will move to another Grid site, to run in there locally, and > >> hopefully not get the same behavior as on teraport. > >> > >> Tibi > >> > >> On 2/28/07, Mike Wilde wrote: > >>> Mihael informs me that the latest problems with the wavlet workflow indicate > >>> that some number of jobs in the workflow are failing to launch under PBS > >>> through the pre-WS GRAM provider. These failing jobs seem to give no > >>> indication whatsoever where the underlying failure is occurring. > >>> > >>> I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs > >>> are failing in this manner (not sure I have these numbers right). > >>> > >>> Mihael is continuing to experiment to characterize the failure better and will > >>> report back to the group (and involve the TP and GRAM support teams) when he > >>> knows more. > >>> > >>> - Mike > >>> > >>> -- > >>> Mike Wilde > >>> Computation Institute, University of Chicago > >>> Math & Computer Science Division > >>> Argonne National Laboratory > >>> Argonne, IL 60439 USA > >>> tel 630-252-7497 fax 630-252-1997 > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >> > > > > > > -- > Mike Wilde > Computation Institute, University of Chicago > Math & Computer Science Division > Argonne National Laboratory > Argonne, IL 60439 USA > tel 630-252-7497 fax 630-252-1997 > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Wed Feb 28 12:39:16 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 18:39:16 +0000 (GMT) Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: <1172687583.26377.3.camel@blabla.mcs.anl.gov> References: <45E5BEDD.2070901@mcs.anl.gov> <1172687583.26377.3.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 28 Feb 2007, Mihael Hategan wrote: > On Wed, 2007-02-28 at 18:16 +0000, Ben Clifford wrote: > > do you have kickstart records for the jobs that are failing? > > Thing is, the wrapper for those jobs doesn't seem to put any output in > the wrapper log. Which seems to indicate that the wrapper was never > started. Which in turn may mean that kickstart would not have been > started either. quite possibly yes. grr. I have some memory of a similar problem once before that I diagnosed to something like full-/tmp or something, but I have no recollection really of how I diagnosed that. maybe I just checked each node for /tmp space. -- From hategan at mcs.anl.gov Wed Feb 28 12:40:49 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Feb 2007 12:40:49 -0600 Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> <1172687504.26377.1.camel@blabla.mcs.anl.gov> <45E5CBEB.3040402@mcs.anl.gov> Message-ID: <1172688049.26377.10.camel@blabla.mcs.anl.gov> On Wed, 2007-02-28 at 12:39 -0600, Tiberiu Stef-Praun wrote: > I would say that the solution is: on a missing file, fully resubmit > the job, not just try to re-read it. That does happen. But it also tries to re-read it. A filter could be added to not restart transfers that fail because the file is missing. > Anyway, I'm trying to discover what causes the file not to be generated. > > On 2/28/07, Mike Wilde wrote: > > Do we need to file a bug to improve the processing of the missing-file case? > > > > Ie., if the file is truly missing, this should (typically?) not cause great > > delays in the workflow proceeding, or proceeding to fail quickly. > > > > - Mike > > > > Mihael Hategan wrote, On 2/28/2007 12:31 PM: > > > On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote: > > >> Here is more info: > > >> Indeed, yesterday I got 175 successful jobs from the total of 192, and > > >> the workflow never ended > > > > > > Did the workflow lock up or did you interrupt it because you got tired > > > of it trying to transfer all the missing files? > > > > > >> (it kept retrying transferring fiiles from > > >> the the failed ones, which it failed because they did not exist). > > >> Looking at the processors load and at the transfer load, the total > > >> 175jobs were done in about 75minutes (about 10x speedup from a > > >> serialized execution). > > >> > > >> At Mihael's suggestion I started with smaller workflows, so here are > > >> the numbers (for the ones that completed successfully): > > >> 1 job: 4 minutes > > >> 6jobs: 6 minutes > > >> 24 jobs: 20 minutes > > >> 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer). > > >> > > >> I have a total of 192 jobs to run. > > >> > > >> > > >> I have retried running some of the failed workflows, and they fail > > >> because some task in the workflow is not run correctly. For instance, > > >> the most troubling one was the latest run: the jobs submitted failed > > >> right at the beginning, even though they have run successfully in the > > >> previous run. > > >> My current assumption is that one (?several) cluster nodes are bad. > > >> The failure can be observed in the log in the following way: job gets > > >> submitted, andd 20 seconds later, gram declares is finished (normal > > >> execution time is about 3 minutes), so the workflow attempts to > > >> transfer back some inexistent files (nothing gets generated, neither > > >> outputs, nor stdout,stderr,kickstart in the job's working directory), > > >> and it creates on the submission machine files of size zero. That is > > >> not good because when attempting a -resume, those failed jobs are not > > >> re-considered for execution. > > >> > > >> Summary/Speculation: bad teraport node causes job to be declared as > > >> done even though the execution failed > > >> > > >> I will move to another Grid site, to run in there locally, and > > >> hopefully not get the same behavior as on teraport. > > >> > > >> Tibi > > >> > > >> On 2/28/07, Mike Wilde wrote: > > >>> Mihael informs me that the latest problems with the wavlet workflow indicate > > >>> that some number of jobs in the workflow are failing to launch under PBS > > >>> through the pre-WS GRAM provider. These failing jobs seem to give no > > >>> indication whatsoever where the underlying failure is occurring. > > >>> > > >>> I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs > > >>> are failing in this manner (not sure I have these numbers right). > > >>> > > >>> Mihael is continuing to experiment to characterize the failure better and will > > >>> report back to the group (and involve the TP and GRAM support teams) when he > > >>> knows more. > > >>> > > >>> - Mike > > >>> > > >>> -- > > >>> Mike Wilde > > >>> Computation Institute, University of Chicago > > >>> Math & Computer Science Division > > >>> Argonne National Laboratory > > >>> Argonne, IL 60439 USA > > >>> tel 630-252-7497 fax 630-252-1997 > > >>> _______________________________________________ > > >>> Swift-devel mailing list > > >>> Swift-devel at ci.uchicago.edu > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >>> > > >> > > > > > > > > > > -- > > Mike Wilde > > Computation Institute, University of Chicago > > Math & Computer Science Division > > Argonne National Laboratory > > Argonne, IL 60439 USA > > tel 630-252-7497 fax 630-252-1997 > > > > > -- > Tiberiu (Tibi) Stef-Praun, PhD > Research Staff, Computation Institute > 5640 S. Ellis Ave, #405 > University of Chicago > http://www-unix.mcs.anl.gov/~tiberius/ > From hategan at mcs.anl.gov Wed Feb 28 12:45:30 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Feb 2007 12:45:30 -0600 Subject: [Swift-devel] the big red swift site In-Reply-To: References: Message-ID: <1172688330.26982.0.camel@blabla.mcs.anl.gov> "ABOUT Swift incorporates several existing toolkits and engines to achieve its results. \em{What can Swift do for you?} " There's a thin line between sublime and ridiculous. On Wed, 2007-02-28 at 18:21 +0000, Ben Clifford wrote: > is mostly up: > > http://www.ci.uchicago.edu/swift/ > > The docbook docs are not yet restyled yet, and I know a couple of links > have got lost in the transistion. I'm fixing those that I know of now, but > point any that you notice. > From benc at hawaga.org.uk Wed Feb 28 12:50:15 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 18:50:15 +0000 (GMT) Subject: [Swift-devel] the big red swift site In-Reply-To: <1172688330.26982.0.camel@blabla.mcs.anl.gov> References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> Message-ID: content is nothing to do with me! On Wed, 28 Feb 2007, Mihael Hategan wrote: > "ABOUT > Swift incorporates several existing toolkits and engines to achieve its > results. \em{What can Swift do for you?} " > > There's a thin line between sublime and ridiculous. > > On Wed, 2007-02-28 at 18:21 +0000, Ben Clifford wrote: > > is mostly up: > > > > http://www.ci.uchicago.edu/swift/ > > > > The docbook docs are not yet restyled yet, and I know a couple of links > > have got lost in the transistion. I'm fixing those that I know of now, but > > point any that you notice. > > > > From wilde at mcs.anl.gov Wed Feb 28 12:52:50 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 28 Feb 2007 12:52:50 -0600 Subject: [Swift-devel] the big red swift site In-Reply-To: <1172688330.26982.0.camel@blabla.mcs.anl.gov> References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> Message-ID: <45E5CF82.5060702@mcs.anl.gov> Can we put up an "under construction" note on pages who's content is not yet ready? This is one of them, as I indicated in email yesterday that its on my list do to. I suggested that we dont go beyond the original swift site in basic content until we have a chance to review. Sound OK? Ben, Beth, can you do this? Thanks, Mike Mihael Hategan wrote, On 2/28/2007 12:45 PM: > "ABOUT > Swift incorporates several existing toolkits and engines to achieve its > results. \em{What can Swift do for you?} " > > There's a thin line between sublime and ridiculous. > > On Wed, 2007-02-28 at 18:21 +0000, Ben Clifford wrote: >> is mostly up: >> >> http://www.ci.uchicago.edu/swift/ >> >> The docbook docs are not yet restyled yet, and I know a couple of links >> have got lost in the transistion. I'm fixing those that I know of now, but >> point any that you notice. >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Wed Feb 28 12:55:16 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 18:55:16 +0000 (GMT) Subject: [Swift-devel] the big red swift site In-Reply-To: <45E5CF82.5060702@mcs.anl.gov> References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> <45E5CF82.5060702@mcs.anl.gov> Message-ID: On Wed, 28 Feb 2007, Mike Wilde wrote: > Can we put up an "under construction" note welcome to the 1990s! > I suggested that we dont go beyond the original swift site in basic content > until we have a chance to review. > > Sound OK? Ben, Beth, can you do this? I want to stay away from actual content preparation and limit my involvement to the technical side, at least in the next week or so. is beth on this list, btw? -- From benc at hawaga.org.uk Wed Feb 28 13:01:27 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 19:01:27 +0000 (GMT) Subject: [Swift-devel] Update on Teraport problems with wavlet workflow In-Reply-To: References: <45E5BEDD.2070901@mcs.anl.gov> <1172687583.26377.3.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 28 Feb 2007, Ben Clifford wrote: > I have some memory of a similar problem once before that I diagnosed to > something like full-/tmp or something, but I have no recollection really > of how I diagnosed that. maybe I just checked each node for /tmp space. maybe I found something in the tp:~/.globus/job directories for the failing jobs. mumble mumble -- From bcerny at mcs.anl.gov Wed Feb 28 13:04:42 2007 From: bcerny at mcs.anl.gov (Beth Cerny Patino) Date: Wed, 28 Feb 2007 13:04:42 -0600 Subject: [Swift-devel] the big red swift site In-Reply-To: References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> <45E5CF82.5060702@mcs.anl.gov> Message-ID: <45E5D24A.8070908@mcs.anl.gov> Hi, I made it really easy to add new content to the website so that site maintenance could be done by the Swift group. I am happy to teach whomever how to make changes if needed. I will be available to make some changes, but due to my workload in order to have things done in a timely manner, it may be best to have someone else be responsible for the site. Beth Ben Clifford wrote: > On Wed, 28 Feb 2007, Mike Wilde wrote: > > >> Can we put up an "under construction" note >> > > welcome to the 1990s! > > >> I suggested that we dont go beyond the original swift site in basic content >> until we have a chance to review. >> >> Sound OK? Ben, Beth, can you do this? >> > > I want to stay away from actual content preparation and limit my > involvement to the technical side, at least in the next week or so. > > is beth on this list, btw? > > From hategan at mcs.anl.gov Wed Feb 28 13:06:11 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Feb 2007 13:06:11 -0600 Subject: [Swift-devel] the big red swift site In-Reply-To: References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> <45E5CF82.5060702@mcs.anl.gov> Message-ID: <1172689571.27806.3.camel@blabla.mcs.anl.gov> On Wed, 2007-02-28 at 18:55 +0000, Ben Clifford wrote: > On Wed, 28 Feb 2007, Mike Wilde wrote: > > > Can we put up an "under construction" note > > welcome to the 1990s! Yeah! And we can have animated GIFs with folk and a shovel. Sorry. > > > I suggested that we dont go beyond the original swift site in basic content > > until we have a chance to review. > > > > Sound OK? Ben, Beth, can you do this? > > I want to stay away from actual content preparation and limit my > involvement to the technical side, at least in the next week or so. I'm going to change a few things, like remove certain sentences and links opening in new windows. I find the latter justified in too few cases and annoying in most. I'll also try to integrate the docs. So if it's ok with everybody, lock on www in about 45 minutes. Mihael > > is beth on this list, btw? > From wilde at mcs.anl.gov Wed Feb 28 13:08:28 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 28 Feb 2007 13:08:28 -0600 Subject: [Swift-devel] [Fwd: Swift-devel post from bcerny@mcs.anl.gov requires approval] Message-ID: <45E5D32C.2050900@mcs.anl.gov> Add Beth to this list? At least for now? - Mike -------- Original Message -------- Subject: Swift-devel post from bcerny at mcs.anl.gov requires approval Date: Wed, 28 Feb 2007 13:04:42 -0600 From: swift-devel-owner at ci.uchicago.edu To: swift-devel-owner at ci.uchicago.edu As list administrator, your authorization is requested for the following mailing list posting: List: Swift-devel at ci.uchicago.edu From: bcerny at mcs.anl.gov Subject: Re: [Swift-devel] the big red swift site Reason: Post by non-member to a members-only list At your convenience, visit: http://mail.ci.uchicago.edu/mailman/admindb/swift-devel to approve or deny the request. -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 -------------- next part -------------- An embedded message was scrubbed... From: Beth Cerny Patino Subject: Re: [Swift-devel] the big red swift site Date: Wed, 28 Feb 2007 13:04:42 -0600 Size: 2746 URL: -------------- next part -------------- An embedded message was scrubbed... From: swift-devel-request at ci.uchicago.edu Subject: confirm c92397e705a303e8a5bd8d84fcd327d885f2a7b9 Date: no date Size: 643 URL: From benc at hawaga.org.uk Wed Feb 28 13:09:20 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Feb 2007 19:09:20 +0000 (GMT) Subject: [Swift-devel] the big red swift site In-Reply-To: <1172689571.27806.3.camel@blabla.mcs.anl.gov> References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> <45E5CF82.5060702@mcs.anl.gov> <1172689571.27806.3.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 28 Feb 2007, Mihael Hategan wrote: > So if it's ok with everybody, lock on www in about 45 minutes. I'm fixing links on downloads/index.php at the moment. I'll commit those when I'm done and then stop changing stuff. -- From itf at mcs.anl.gov Wed Feb 28 13:21:16 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Wed, 28 Feb 2007 19:21:16 +0000 Subject: [Swift-devel] the big red swift site In-Reply-To: References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> Message-ID: <170188944-1172690535-cardhu_blackberry.rim.net-1432056332-@bwe023-cell00.bisx.prod.on.blackberry> The about page has little useful information. The user doesn't care that it uses globus cig teragird etc. Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Wed, 28 Feb 2007 18:50:15 To:Mihael Hategan Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] the big red swift site content is nothing to do with me! On Wed, 28 Feb 2007, Mihael Hategan wrote: > "ABOUT > Swift incorporates several existing toolkits and engines to achieve its > results. \em{What can Swift do for you?} " > > There's a thin line between sublime and ridiculous. > > On Wed, 2007-02-28 at 18:21 +0000, Ben Clifford wrote: > > is mostly up: > > > > http://www.ci.uchicago.edu/swift/ > > > > The docbook docs are not yet restyled yet, and I know a couple of links > > have got lost in the transistion. I'm fixing those that I know of now, but > > point any that you notice. > > > > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Feb 28 13:38:24 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Feb 2007 13:38:24 -0600 Subject: [Swift-devel] the big red swift site In-Reply-To: <170188944-1172690535-cardhu_blackberry.rim.net-1432056332-@bwe023-cell00.bisx.prod.on.blackberry> References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> <170188944-1172690535-cardhu_blackberry.rim.net-1432056332-@bwe023-cell00.bisx.prod.on.blackberry> Message-ID: <1172691504.28366.0.camel@blabla.mcs.anl.gov> On Wed, 2007-02-28 at 19:21 +0000, Ian Foster wrote: > The about page has little useful information. The user doesn't care that it uses globus cig teragird etc. It was called "links" initially. > > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > Date: Wed, 28 Feb 2007 18:50:15 > To:Mihael Hategan > Cc:swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] the big red swift site > > > content is nothing to do with me! > > On Wed, 28 Feb 2007, Mihael Hategan wrote: > > > "ABOUT > > Swift incorporates several existing toolkits and engines to achieve its > > results. \em{What can Swift do for you?} " > > > > There's a thin line between sublime and ridiculous. > > > > On Wed, 2007-02-28 at 18:21 +0000, Ben Clifford wrote: > > > is mostly up: > > > > > > http://www.ci.uchicago.edu/swift/ > > > > > > The docbook docs are not yet restyled yet, and I know a couple of links > > > have got lost in the transistion. I'm fixing those that I know of now, but > > > point any that you notice. > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From itf at mcs.anl.gov Wed Feb 28 13:50:53 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Wed, 28 Feb 2007 19:50:53 +0000 Subject: [Swift-devel] the big red swift site In-Reply-To: <170188944-1172690535-cardhu_blackberry.rim.net-1432056332-@bwe023-cell00.bisx.prod.on.blackberry> References: <1172688330.26982.0.camel@blabla.mcs.anl.gov> <170188944-1172690535-cardhu_blackberry.rim.net-1432056332-@bwe023-cell00.bisx.prod.on.blackberry> Message-ID: <94971556-1172692312-cardhu_blackberry.rim.net-1080186105-@bwe035-cell00.bisx.prod.on.blackberry> Normally "about" has more technical detail. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: "Ian Foster" Date: Wed, 28 Feb 2007 19:21:16 To:"Ben Clifford" , swift-devel-bounces at ci.uchicago.edu, "Mihael Hategan" Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] the big red swift site The about page has little useful information. The user doesn't care that it uses globus cig teragird etc. Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Wed, 28 Feb 2007 18:50:15 To:Mihael Hategan Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] the big red swift site content is nothing to do with me! On Wed, 28 Feb 2007, Mihael Hategan wrote: > "ABOUT > Swift incorporates several existing toolkits and engines to achieve its > results. \em{What can Swift do for you?} " > > There's a thin line between sublime and ridiculous. > > On Wed, 2007-02-28 at 18:21 +0000, Ben Clifford wrote: > > is mostly up: > > > > http://www.ci.uchicago.edu/swift/ > > > > The docbook docs are not yet restyled yet, and I know a couple of links > > have got lost in the transistion. I'm fixing those that I know of now, but > > point any that you notice. > > > > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Feb 28 14:18:42 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Feb 2007 14:18:42 -0600 Subject: [Swift-devel] the big red swift site In-Reply-To: <94971556-1172692312-cardhu_blackberry.rim.net-1080186105-@bwe035-cell00.bisx.prod.on.blackberry> References: <1172688330.26982.0.camel@blabla.mcs.anl.gov>