From hategan at mcs.anl.gov Wed Aug 3 00:22:21 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 02 Aug 2011 22:22:21 -0700 Subject: [Swift-devel] iterate behaviour round II Message-ID: <1312348941.21371.12.camel@blabla> So I think we decided that: iterate v { trace(v); } until (v >= 10); would do the test after v was incremented and would always execute at least once (so 0 to 9 would be printed). But then the tutorial has the following (adapted a bit): int a[]; a[0] = 10; iterate v { a[v + 1] = a[v] - 1; } until(a[v+1] < 1); It's all peachy in concept, except if v is incremented before the check, an access to a[v+1] will hang. a[v] is now the correct expression in the test, but then it's not quite intuitive. Proposal 1: change documentation and tests to "until(a[v] < 1)" (this does not solve the problem in general since a[v+1] would still lead to a hang, not unlike bug 481 Proposal 2: Proposal 1 + deprecate iterate and suggest foreach instead. Opinions? Other ideas? Mihael From hategan at mcs.anl.gov Wed Aug 3 00:23:59 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 02 Aug 2011 22:23:59 -0700 Subject: [Swift-devel] trunk coasters Message-ID: <1312349039.21371.13.camel@blabla> Has anybody ran a job trunk coasters recently? Mihael From hategan at mcs.anl.gov Wed Aug 3 00:27:43 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 02 Aug 2011 22:27:43 -0700 Subject: [Swift-devel] extractint Message-ID: <1312349263.21792.0.camel@blabla> Why is extractint returning a float? From benc at hawaga.org.uk Wed Aug 3 06:38:36 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 3 Aug 2011 13:38:36 +0200 Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1312348941.21371.12.camel@blabla> References: <1312348941.21371.12.camel@blabla> Message-ID: I dislike using array indices for each step. I wanted to figure out something that looked more like a fold/unfold, where the body of the iterate only has access to "previous" and "next" so that you write something like this: file a[]; file seed <"foo">; a = iterate from seed { next = f(previous) } until(g(next)=false) ; but I never figured out a syntax that I liked. (contrast to haskell unfold syntax) From benc at hawaga.org.uk Wed Aug 3 06:39:11 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 3 Aug 2011 13:39:11 +0200 Subject: [Swift-devel] extractint In-Reply-To: <1312349263.21792.0.camel@blabla> References: <1312349263.21792.0.camel@blabla> Message-ID: On Aug 3, 2011, at 7:27 AM, Mihael Hategan wrote: > Why is extractint returning a float? because swift numerical types are poorly defined? From jonmon at mcs.anl.gov Wed Aug 3 08:51:58 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 3 Aug 2011 08:51:58 -0500 Subject: [Swift-devel] trunk coasters In-Reply-To: <1312349039.21371.13.camel@blabla> References: <1312349039.21371.13.camel@blabla> Message-ID: <264C2AE9-C8DC-47C9-9EFA-A31E382FDA9D@mcs.anl.gov> I have. A small 45 task run. On Aug 3, 2011, at 12:23 AM, Mihael Hategan wrote: > Has anybody ran a job trunk coasters recently? > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Aug 3 08:53:51 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 08:53:51 -0500 Subject: [Swift-devel] trunk coasters In-Reply-To: <1312349039.21371.13.camel@blabla> References: <1312349039.21371.13.camel@blabla> Message-ID: They were failing for me using persistent coasters to osg sites; will be testing further today and file bugs as needed. On 8/3/11, Mihael Hategan wrote: > Has anybody ran a job trunk coasters recently? > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Sent from my mobile device From jonmon at mcs.anl.gov Wed Aug 3 08:54:48 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 3 Aug 2011 08:54:48 -0500 Subject: [Swift-devel] trunk coasters In-Reply-To: References: <1312349039.21371.13.camel@blabla> Message-ID: <48D1727A-0C16-4D0D-AC3B-F1F9DFE5894B@mcs.anl.gov> I have been using automatic coasters and submitting to PADS. I haven't tried any large scale runs recently though. On Aug 3, 2011, at 8:53 AM, Michael Wilde wrote: > They were failing for me using persistent coasters to osg sites; will > be testing further today and file bugs as needed. > > On 8/3/11, Mihael Hategan wrote: >> Has anybody ran a job trunk coasters recently? >> >> Mihael >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > -- > Sent from my mobile device > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Aug 3 09:08:54 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 09:08:54 -0500 (CDT) Subject: [Swift-devel] trunk coasters In-Reply-To: <48D1727A-0C16-4D0D-AC3B-F1F9DFE5894B@mcs.anl.gov> Message-ID: <1723865632.184066.1312380534696.JavaMail.root@zimbra.anl.gov> Last night (on current trunk) I was getting this: 2011-08-02 23:24:49,863-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-4q4lmvdk - Application exception: null Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.karajan.workflow.service.ProtocolException: Failed to set configuration: For input string: "" Caused by: org.globus.cog.karajan.workflow.service.RemoteException: Failed to set configuration: For input string: "" 2011-08-02 23:24:49,866-0500 INFO vdl:execute END_FAILURE thread=0-3-0-1 tr=cat 2011-08-02 23:24:49,868-0500 DEBUG VDL2ExecutionContext Exception in cat: Arguments: [data.txt] Host: localhost Directory: catsn-20110802-2324-ze1lfx8f/jobs/4/cat-4q4lmvdk - - - Exception in cat: Arguments: [data.txt] Host: localhost Directory: catsn-20110802-2324-ze1lfx8f/jobs/4/cat-4q4lmvdk - - - Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.karajan.workflow.service.ProtocolException: Failed to set configuration: For input string: "" Caused by: org.globus.cog.karajan.workflow.service.RemoteException: Failed to set configuration: For input string: "" at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) --- But that was a rather new configuration, so I need to do more diagnosis. Why did you ask, Mihael - are you seeing problems too? I hope to work with Alberto this week to resume site config testing with a focus on coaster configs. - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Michael Wilde" > Cc: "Mihael Hategan" , "Swift Devel" > Sent: Wednesday, August 3, 2011 8:54:48 AM > Subject: Re: [Swift-devel] trunk coasters > I have been using automatic coasters and submitting to PADS. I haven't > tried any large scale runs recently though. > On Aug 3, 2011, at 8:53 AM, Michael Wilde wrote: > > > They were failing for me using persistent coasters to osg sites; > > will > > be testing further today and file bugs as needed. > > > > On 8/3/11, Mihael Hategan wrote: > >> Has anybody ran a job trunk coasters recently? > >> > >> Mihael > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > > > > -- > > Sent from my mobile device > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Aug 3 09:23:29 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 09:23:29 -0500 (CDT) Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1312348941.21371.12.camel@blabla> Message-ID: <1654887811.184149.1312381409266.JavaMail.root@zimbra.anl.gov> I propose that we do everything possible to ensure that the semantics of iterate does not change from 0.92.1, to avoid breaking code. NCAR and the DOE ParVis project, in particular, has a very large Swift script that they are testing for production use, and we really dont want that to break. e should not allow 0.93 to break current user code -- if at all possible. I propose instead that we experiment with new iterate semantics using one or more new statements (fold, do, while, for). Ben, Mihael, and others are interested in functional-style statements; I am in favor of C-like statements. I would favor having both in the language as long as we keep it simple (which I understand is complex ;) To start a parallel thread here: how feasible is it (within the write-once variable model and scope-creation semantics that iterate uses) to provide the 3 C iteration statements with syntax and semantics as close to C as possible? - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Swift Devel" > Sent: Wednesday, August 3, 2011 12:22:21 AM > Subject: [Swift-devel] iterate behaviour round II > So I think we decided that: > > iterate v { > trace(v); > } until (v >= 10); > > would do the test after v was incremented and would always execute at > least once (so 0 to 9 would be printed). > > But then the tutorial has the following (adapted a bit): > > > int a[]; a[0] = 10; > iterate v { > a[v + 1] = a[v] - 1; > } until(a[v+1] < 1); > > It's all peachy in concept, except if v is incremented before the > check, > an access to a[v+1] will hang. a[v] is now the correct expression in > the > test, but then it's not quite intuitive. > > Proposal 1: change documentation and tests to "until(a[v] < 1)" (this > does not solve the problem in general since a[v+1] would still lead to > a > hang, not unlike bug 481 > Proposal 2: Proposal 1 + deprecate iterate and suggest foreach > instead. > > Opinions? Other ideas? > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Aug 3 09:27:03 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 09:27:03 -0500 (CDT) Subject: [Swift-devel] extractint In-Reply-To: <1312349263.21792.0.camel@blabla> Message-ID: <2111872963.184160.1312381623643.JavaMail.root@zimbra.anl.gov> Do we have a test for this in the test suite? Is that a change from 0.92.1 ? - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Swift Devel" > Sent: Wednesday, August 3, 2011 12:27:43 AM > Subject: [Swift-devel] extractint > Why is extractint returning a float? > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Wed Aug 3 09:27:54 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 3 Aug 2011 09:27:54 -0500 Subject: [Swift-devel] extractint In-Reply-To: <2111872963.184160.1312381623643.JavaMail.root@zimbra.anl.gov> References: <2111872963.184160.1312381623643.JavaMail.root@zimbra.anl.gov> Message-ID: I think it was like that because until recently ints were really java doubles underneath. At least that is what it looked like when going through the code. On Aug 3, 2011, at 9:27 AM, Michael Wilde wrote: > Do we have a test for this in the test suite? > Is that a change from 0.92.1 ? > > - Mike > > > ----- Original Message ----- >> From: "Mihael Hategan" >> To: "Swift Devel" >> Sent: Wednesday, August 3, 2011 12:27:43 AM >> Subject: [Swift-devel] extractint >> Why is extractint returning a float? >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From ketancmaheshwari at gmail.com Wed Aug 3 09:28:14 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Wed, 3 Aug 2011 09:28:14 -0500 Subject: [Swift-devel] trunk coasters In-Reply-To: <1723865632.184066.1312380534696.JavaMail.root@zimbra.anl.gov> References: <48D1727A-0C16-4D0D-AC3B-F1F9DFE5894B@mcs.anl.gov> <1723865632.184066.1312380534696.JavaMail.root@zimbra.anl.gov> Message-ID: I have been using trunk persistent coasters on mcs resources and did not see any issues. On Wed, Aug 3, 2011 at 9:08 AM, Michael Wilde wrote: > Last night (on current trunk) I was getting this: > > 2011-08-02 23:24:49,863-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=cat-4q4lmvdk - Application exception: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not submit job > Caused by: org.globus.cog.karajan.workflow.service.ProtocolException: > Failed to set configuration: For input string: "" > Caused by: org.globus.cog.karajan.workflow.service.RemoteException: Failed > to set configuration: For input string: "" > 2011-08-02 23:24:49,866-0500 INFO vdl:execute END_FAILURE thread=0-3-0-1 > tr=cat > 2011-08-02 23:24:49,868-0500 DEBUG VDL2ExecutionContext Exception in cat: > Arguments: [data.txt] > Host: localhost > Directory: catsn-20110802-2324-ze1lfx8f/jobs/4/cat-4q4lmvdk > - - - > > Exception in cat: > Arguments: [data.txt] > Host: localhost > Directory: catsn-20110802-2324-ze1lfx8f/jobs/4/cat-4q4lmvdk > - - - > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not submit job > Caused by: org.globus.cog.karajan.workflow.service.ProtocolException: > Failed to set configuration: For input string: "" > Caused by: org.globus.cog.karajan.workflow.service.RemoteException: Failed > to set configuration: For input string: "" > at > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > --- But that was a rather new configuration, so I need to do more > diagnosis. > > Why did you ask, Mihael - are you seeing problems too? > > I hope to work with Alberto this week to resume site config testing with a > focus on coaster configs. > > - Mike > > > > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "Michael Wilde" > > Cc: "Mihael Hategan" , "Swift Devel" < > swift-devel at ci.uchicago.edu> > > Sent: Wednesday, August 3, 2011 8:54:48 AM > > Subject: Re: [Swift-devel] trunk coasters > > I have been using automatic coasters and submitting to PADS. I haven't > > tried any large scale runs recently though. > > On Aug 3, 2011, at 8:53 AM, Michael Wilde wrote: > > > > > They were failing for me using persistent coasters to osg sites; > > > will > > > be testing further today and file bugs as needed. > > > > > > On 8/3/11, Mihael Hategan wrote: > > >> Has anybody ran a job trunk coasters recently? > > >> > > >> Mihael > > >> > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >> > > > > > > -- > > > Sent from my mobile device > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Aug 3 10:12:41 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 10:12:41 -0500 (CDT) Subject: [Swift-devel] trunk coasters In-Reply-To: <1723865632.184066.1312380534696.JavaMail.root@zimbra.anl.gov> Message-ID: <570047635.184499.1312384361864.JavaMail.root@zimbra.anl.gov> Im testing the persistent coaster setup that was failing as below, but instead of starting the workers on remote OSG sites Im starting a single worker locally on communicado, where the service is running. This seems to fail in a different manner than the test to OSG sites. I see this in the service log (swift.log file): 2011-08-03 10:01:32,000-0500 INFO Settings Local contacts: [http://128.135.125.17:35852] 2011-08-03 10:01:32,014-0500 INFO CoasterService Started local service: http://128.135.125.17:35852 2011-08-03 10:01:32,014-0500 INFO CoasterService Started coaster service: http://128.135.125.17:41176 2011-08-03 10:05:50,884-0500 INFO AbstractStreamKarajanChannel$Multiplexer Multiplexer 0 started 2011-08-03 10:05:50,884-0500 INFO AbstractStreamKarajanChannel$Multiplexer (0) Scheduling SC-null for addition 2011-08-03 10:05:50,885-0500 INFO AbstractStreamKarajanChannel nullChannel started 2011-08-03 10:05:50,885-0500 INFO AbstractStreamKarajanChannel$Multiplexer Multiplexer 1 started 2011-08-03 10:05:50,909-0500 INFO LocalTCPService Received registration: blockid = twork, url = communicado.ci.uchicago.edu 2011-08-03 10:05:50,919-0500 INFO AbstractKarajanChannel MetaChannel: 700804192[1615734796: {}] -> null: Disabling heartbeats (conf\ ig is null) 2011-08-03 10:05:50,920-0500 INFO MetaChannel MetaChannel: 700804192[1615734796: {}] -> null.bind -> SC-null 2011-08-03 10:05:50,922-0500 DEBUG Cpu workerStarted: twork:communicado.ci.uchicago.edu:0 2011-08-03 10:05:50,922-0500 DEBUG Cpu twork:0 pullLater 2011-08-03 10:05:50,924-0500 INFO Block Started CPU 0:1312383950s 2011-08-03 10:05:50,924-0500 INFO Block Started worker twork:000000 2011-08-03 10:05:50,924-0500 INFO Cpu twork:0 pull 2011-08-03 10:05:50,926-0500 WARN BlockQueueProcessor Failed to send worker status update to client java.lang.NullPointerException at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:434) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227) at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j\ ava:72) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne\ l.java:375) 2011-08-03 10:06:00,893-0500 INFO AbstractStreamKarajanChannel$Multiplexer Avg stream buf: 0 2011-08-03 10:06:00,971-0500 INFO PullThread runTime: 4, sleepTime: 10043 2011-08-03 10:06:10,904-0500 INFO AbstractStreamKarajanChannel$Multiplexer Avg stream buf: 0 2011-08-03 10:06:11,012-0500 INFO PullThread runTime: 1, sleepTime: 10040 2011-08-03 10:06:20,911-0500 INFO AbstractStreamKarajanChannel$Multiplexer Avg stream buf: 0 2011-08-03 10:06:21,050-0500 INFO PullThread runTime: 2, sleepTime: 10036 etc... === and this in the service's std out/err log: Local contacts: [http://128.135.125.17:35852] Started local service: http://128.135.125.17:35852 Started coaster service: http://128.135.125.17:41176 Started coaster service: http://128.135.125.17:41176 Multiplexer 0 started (0) Scheduling SC-null for addition nullChannel started Multiplexer 1 started Received registration: blockid = twork, url = communicado.ci.uchicago.edu MetaChannel: 700804192[1615734796: {}] -> null: Disabling heartbeats (config is null) MetaChannel: 700804192[1615734796: {}] -> null.bind -> SC-null Started CPU 0:1312383950s Started worker twork:000000 twork:0 pull Failed to send worker status update to client java.lang.NullPointerException at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:434) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227) at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j\ ava:72) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne\ l.java:375) Avg stream buf: 0 runTime: 4, sleepTime: 10043 Avg stream buf: 0 runTime: 1, sleepTime: 10040 Avg stream buf: 0 runTime: 2, sleepTime: 10036 Sender 742510685 queue size: 1 Avg stream buf: 0 runTime: 1, sleepTime: 10042 etc... === and this in the worker log (log level DEBUG): 1312383950.848 INFO - twork Logging started: Wed Aug 3 10:05:50 2011 1312383950.848 INFO - Running on node communicado.ci.uchicago.edu 1312383950.848 DEBUG - uri=http://communicado.ci.uchicago.edu:35852 1312383950.848 DEBUG - scheme=http 1312383950.848 DEBUG - host=communicado.ci.uchicago.edu 1312383950.848 DEBUG - port=35852 1312383950.848 DEBUG - blockid=twork 1312383950.848 INFO - Connecting (0)... 1312383950.848 DEBUG - Trying communicado.ci.uchicago.edu:35852... 1312383950.862 INFO - Connected 1312383950.862 DEBUG - Replies: {} 1312383950.862 DEBUG - OUT: len=8, tag=0, flags=0 1312383950.863 DEBUG - OUT: len=5, tag=0, flags=0 1312383950.863 DEBUG - OUT: len=27, tag=0, flags=0 1312383950.863 DEBUG - OUT: len=16, tag=0, flags=2 1312383950.863 DEBUG - done sending frags for 0 1312383950.931 DEBUG - Fin flag set 1312383950.931 INFO 000000 Registration successful. ID=000000 1312383980.863 DEBUG 000000 Replies: {} 1312383980.863 DEBUG 000000 OUT: len=9, tag=1, flags=2 1312383980.864 DEBUG 000000 done sending frags for 1 1312383980.868 DEBUG 000000 Fin flag set 1312383980.868 DEBUG 000000 Heartbeat acknowledged 1312383986.739 DEBUG 000000 New request (1) 1312383986.739 DEBUG 000000 Fin flag set 1312383986.739 DEBUG 000000 Processing request 1312383986.739 DEBUG 000000 Cmd is HEARTBEAT 1312383986.739 DEBUG 000000 OUT: len=2, tag=1, flags=3 1312383986.739 DEBUG 000000 done sending frags for 1 etc... ----- Original Message ----- > From: "Michael Wilde" > To: "Jonathan Monette" > Cc: "Swift Devel" > Sent: Wednesday, August 3, 2011 9:08:54 AM > Subject: Re: [Swift-devel] trunk coasters > Last night (on current trunk) I was getting this: > > 2011-08-02 23:24:49,863-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=cat-4q4lmvdk - Application exception: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not submit job > Caused by: org.globus.cog.karajan.workflow.service.ProtocolException: > Failed to set configuration: For input string: "" > Caused by: org.globus.cog.karajan.workflow.service.RemoteException: > Failed to set configuration: For input string: "" > 2011-08-02 23:24:49,866-0500 INFO vdl:execute END_FAILURE > thread=0-3-0-1 tr=cat > 2011-08-02 23:24:49,868-0500 DEBUG VDL2ExecutionContext Exception in > cat: > Arguments: [data.txt] > Host: localhost > Directory: catsn-20110802-2324-ze1lfx8f/jobs/4/cat-4q4lmvdk > - - - > > Exception in cat: > Arguments: [data.txt] > Host: localhost > Directory: catsn-20110802-2324-ze1lfx8f/jobs/4/cat-4q4lmvdk > - - - > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not submit job > Caused by: org.globus.cog.karajan.workflow.service.ProtocolException: > Failed to set configuration: For input string: "" > Caused by: org.globus.cog.karajan.workflow.service.RemoteException: > Failed to set configuration: For input string: "" > at > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > --- But that was a rather new configuration, so I need to do more > diagnosis. > > Why did you ask, Mihael - are you seeing problems too? > > I hope to work with Alberto this week to resume site config testing > with a focus on coaster configs. > > - Mike > > > > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "Michael Wilde" > > Cc: "Mihael Hategan" , "Swift Devel" > > > > Sent: Wednesday, August 3, 2011 8:54:48 AM > > Subject: Re: [Swift-devel] trunk coasters > > I have been using automatic coasters and submitting to PADS. I > > haven't > > tried any large scale runs recently though. > > On Aug 3, 2011, at 8:53 AM, Michael Wilde wrote: > > > > > They were failing for me using persistent coasters to osg sites; > > > will > > > be testing further today and file bugs as needed. > > > > > > On 8/3/11, Mihael Hategan wrote: > > >> Has anybody ran a job trunk coasters recently? > > >> > > >> Mihael > > >> > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >> > > > > > > -- > > > Sent from my mobile device > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Aug 3 10:16:34 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 10:16:34 -0500 (CDT) Subject: [Swift-devel] trunk coasters In-Reply-To: <570047635.184499.1312384361864.JavaMail.root@zimbra.anl.gov> Message-ID: <746307103.184537.1312384594305.JavaMail.root@zimbra.anl.gov> Correction: the simple local worker test below *is* failing in the same manner as the test to OSG sites. A swift run again the service with a single local worker returns the same error as I reported earlier in this thread: com$ swift -config cf.ps -tc.file tc -sites.file sites.grid-ps.xml catsn.swift -n=1 Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog modified locally) RunID: 20110803-1013-2v63ui0g Progress: time: Wed, 03 Aug 2011 10:13:49 -0500 Find: http://localhost:41176 Find: keepalive(120), reconnect - http://localhost:41176 Execution failed: Failed to set configuration: For input string: "" com$ - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Jonathan Monette" > Cc: "Swift Devel" > Sent: Wednesday, August 3, 2011 10:12:41 AM > Subject: Re: [Swift-devel] trunk coasters > Im testing the persistent coaster setup that was failing as below, but > instead of starting the workers on remote OSG sites Im starting a > single worker locally on communicado, where the service is running. > > This seems to fail in a different manner than the test to OSG sites. > > I see this in the service log (swift.log file): > > 2011-08-03 10:01:32,000-0500 INFO Settings Local contacts: > [http://128.135.125.17:35852] > 2011-08-03 10:01:32,014-0500 INFO CoasterService Started local > service: http://128.135.125.17:35852 > 2011-08-03 10:01:32,014-0500 INFO CoasterService Started coaster > service: http://128.135.125.17:41176 > 2011-08-03 10:05:50,884-0500 INFO > AbstractStreamKarajanChannel$Multiplexer Multiplexer 0 started > 2011-08-03 10:05:50,884-0500 INFO > AbstractStreamKarajanChannel$Multiplexer (0) Scheduling SC-null for > addition > 2011-08-03 10:05:50,885-0500 INFO AbstractStreamKarajanChannel > nullChannel started > 2011-08-03 10:05:50,885-0500 INFO > AbstractStreamKarajanChannel$Multiplexer Multiplexer 1 started > 2011-08-03 10:05:50,909-0500 INFO LocalTCPService Received > registration: blockid = twork, url = communicado.ci.uchicago.edu > 2011-08-03 10:05:50,919-0500 INFO AbstractKarajanChannel MetaChannel: > 700804192[1615734796: {}] -> null: Disabling heartbeats (conf\ > ig is null) > 2011-08-03 10:05:50,920-0500 INFO MetaChannel MetaChannel: > 700804192[1615734796: {}] -> null.bind -> SC-null > 2011-08-03 10:05:50,922-0500 DEBUG Cpu workerStarted: > twork:communicado.ci.uchicago.edu:0 > 2011-08-03 10:05:50,922-0500 DEBUG Cpu twork:0 pullLater > 2011-08-03 10:05:50,924-0500 INFO Block Started CPU 0:1312383950s > 2011-08-03 10:05:50,924-0500 INFO Block Started worker twork:000000 > 2011-08-03 10:05:50,924-0500 INFO Cpu twork:0 pull > 2011-08-03 10:05:50,926-0500 WARN BlockQueueProcessor Failed to send > worker status update to client > java.lang.NullPointerException > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:434) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227) > at > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j\ > ava:72) > at > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > at > org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > at > org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > at > org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne\ > l.java:375) > 2011-08-03 10:06:00,893-0500 INFO > AbstractStreamKarajanChannel$Multiplexer Avg stream buf: 0 > 2011-08-03 10:06:00,971-0500 INFO PullThread runTime: 4, sleepTime: > 10043 > 2011-08-03 10:06:10,904-0500 INFO > AbstractStreamKarajanChannel$Multiplexer Avg stream buf: 0 > 2011-08-03 10:06:11,012-0500 INFO PullThread runTime: 1, sleepTime: > 10040 > 2011-08-03 10:06:20,911-0500 INFO > AbstractStreamKarajanChannel$Multiplexer Avg stream buf: 0 > 2011-08-03 10:06:21,050-0500 INFO PullThread runTime: 2, sleepTime: > 10036 > etc... > > === and this in the service's std out/err log: > > Local contacts: [http://128.135.125.17:35852] > Started local service: http://128.135.125.17:35852 > Started coaster service: http://128.135.125.17:41176 > Started coaster service: http://128.135.125.17:41176 > Multiplexer 0 started > (0) Scheduling SC-null for addition > nullChannel started > Multiplexer 1 started > Received registration: blockid = twork, url = > communicado.ci.uchicago.edu > MetaChannel: 700804192[1615734796: {}] -> null: Disabling heartbeats > (config is null) > MetaChannel: 700804192[1615734796: {}] -> null.bind -> SC-null > Started CPU 0:1312383950s > Started worker twork:000000 > twork:0 pull > Failed to send worker status update to client > java.lang.NullPointerException > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:434) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227) > at > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j\ > ava:72) > at > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > at > org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > at > org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > at > org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne\ > l.java:375) > Avg stream buf: 0 > runTime: 4, sleepTime: 10043 > Avg stream buf: 0 > runTime: 1, sleepTime: 10040 > Avg stream buf: 0 > runTime: 2, sleepTime: 10036 > Sender 742510685 queue size: 1 > Avg stream buf: 0 > runTime: 1, sleepTime: 10042 > etc... > > === and this in the worker log (log level DEBUG): > > 1312383950.848 INFO - twork Logging started: Wed Aug 3 10:05:50 2011 > 1312383950.848 INFO - Running on node communicado.ci.uchicago.edu > 1312383950.848 DEBUG - uri=http://communicado.ci.uchicago.edu:35852 > 1312383950.848 DEBUG - scheme=http > 1312383950.848 DEBUG - host=communicado.ci.uchicago.edu > 1312383950.848 DEBUG - port=35852 > 1312383950.848 DEBUG - blockid=twork > 1312383950.848 INFO - Connecting (0)... > 1312383950.848 DEBUG - Trying communicado.ci.uchicago.edu:35852... > 1312383950.862 INFO - Connected > 1312383950.862 DEBUG - Replies: {} > 1312383950.862 DEBUG - OUT: len=8, tag=0, flags=0 > 1312383950.863 DEBUG - OUT: len=5, tag=0, flags=0 > 1312383950.863 DEBUG - OUT: len=27, tag=0, flags=0 > 1312383950.863 DEBUG - OUT: len=16, tag=0, flags=2 > 1312383950.863 DEBUG - done sending frags for 0 > 1312383950.931 DEBUG - Fin flag set > 1312383950.931 INFO 000000 Registration successful. ID=000000 > 1312383980.863 DEBUG 000000 Replies: {} > 1312383980.863 DEBUG 000000 OUT: len=9, tag=1, flags=2 > 1312383980.864 DEBUG 000000 done sending frags for 1 > 1312383980.868 DEBUG 000000 Fin flag set > 1312383980.868 DEBUG 000000 Heartbeat acknowledged > 1312383986.739 DEBUG 000000 New request (1) > 1312383986.739 DEBUG 000000 Fin flag set > 1312383986.739 DEBUG 000000 Processing request > 1312383986.739 DEBUG 000000 Cmd is HEARTBEAT > 1312383986.739 DEBUG 000000 OUT: len=2, tag=1, flags=3 > 1312383986.739 DEBUG 000000 done sending frags for 1 > etc... > > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Jonathan Monette" > > Cc: "Swift Devel" > > Sent: Wednesday, August 3, 2011 9:08:54 AM > > Subject: Re: [Swift-devel] trunk coasters > > Last night (on current trunk) I was getting this: > > > > 2011-08-02 23:24:49,863-0500 DEBUG vdl:execute2 > > APPLICATION_EXCEPTION > > jobid=cat-4q4lmvdk - Application exception: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Could not submit job > > Caused by: > > org.globus.cog.karajan.workflow.service.ProtocolException: > > Failed to set configuration: For input string: "" > > Caused by: org.globus.cog.karajan.workflow.service.RemoteException: > > Failed to set configuration: For input string: "" > > 2011-08-02 23:24:49,866-0500 INFO vdl:execute END_FAILURE > > thread=0-3-0-1 tr=cat > > 2011-08-02 23:24:49,868-0500 DEBUG VDL2ExecutionContext Exception in > > cat: > > Arguments: [data.txt] > > Host: localhost > > Directory: catsn-20110802-2324-ze1lfx8f/jobs/4/cat-4q4lmvdk > > - - - > > > > Exception in cat: > > Arguments: [data.txt] > > Host: localhost > > Directory: catsn-20110802-2324-ze1lfx8f/jobs/4/cat-4q4lmvdk > > - - - > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Could not submit job > > Caused by: > > org.globus.cog.karajan.workflow.service.ProtocolException: > > Failed to set configuration: For input string: "" > > Caused by: org.globus.cog.karajan.workflow.service.RemoteException: > > Failed to set configuration: For input string: "" > > at > > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > > > --- But that was a rather new configuration, so I need to do more > > diagnosis. > > > > Why did you ask, Mihael - are you seeing problems too? > > > > I hope to work with Alberto this week to resume site config testing > > with a focus on coaster configs. > > > > - Mike > > > > > > > > > > ----- Original Message ----- > > > From: "Jonathan Monette" > > > To: "Michael Wilde" > > > Cc: "Mihael Hategan" , "Swift Devel" > > > > > > Sent: Wednesday, August 3, 2011 8:54:48 AM > > > Subject: Re: [Swift-devel] trunk coasters > > > I have been using automatic coasters and submitting to PADS. I > > > haven't > > > tried any large scale runs recently though. > > > On Aug 3, 2011, at 8:53 AM, Michael Wilde wrote: > > > > > > > They were failing for me using persistent coasters to osg sites; > > > > will > > > > be testing further today and file bugs as needed. > > > > > > > > On 8/3/11, Mihael Hategan wrote: > > > >> Has anybody ran a job trunk coasters recently? > > > >> > > > >> Mihael > > > >> > > > >> _______________________________________________ > > > >> Swift-devel mailing list > > > >> Swift-devel at ci.uchicago.edu > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >> > > > > > > > > -- > > > > Sent from my mobile device > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From benc at hawaga.org.uk Wed Aug 3 10:20:46 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 3 Aug 2011 17:20:46 +0200 Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1654887811.184149.1312381409266.JavaMail.root@zimbra.anl.gov> References: <1654887811.184149.1312381409266.JavaMail.root@zimbra.anl.gov> Message-ID: On Aug 3, 2011, at 4:23 PM, Michael Wilde wrote: > > To start a parallel thread here: how feasible is it (within the write-once variable model and scope-creation semantics that iterate uses) to provide the 3 C iteration statements with syntax and semantics as close to C as possible? the independent-iterations use cases are pretty well covered by foreach, I think. (where each iteration is independent of other iterations). the non-indepent-iterations use cases are fairly poorly defined, and its hard to throw syntax suggestions around without those. swift had 'while' in the past, 'iterate' which replaced it but which is fairly ugly, and the suggestion i made earlier in this thread which I like better than iterate but is still ugly. More suggestions would be interesting, even if just for scoping out what people want from such a construct. -- From hategan at mcs.anl.gov Wed Aug 3 12:05:54 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 10:05:54 -0700 Subject: [Swift-devel] extractint In-Reply-To: References: <1312349263.21792.0.camel@blabla> Message-ID: <1312391154.24112.0.camel@blabla> On Wed, 2011-08-03 at 13:39 +0200, Ben Clifford wrote: > On Aug 3, 2011, at 7:27 AM, Mihael Hategan wrote: > > > Why is extractint returning a float? > > because swift numerical types are poorly defined? The internal representation used to be. I changed that. But that still doesn't explain why extractint returns a Swift float. From wilde at mcs.anl.gov Wed Aug 3 12:07:37 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 12:07:37 -0500 (CDT) Subject: [Swift-devel] extractint In-Reply-To: <1312391154.24112.0.camel@blabla> Message-ID: <1007137.185235.1312391257382.JavaMail.root@zimbra.anl.gov> Can you double check? In looking at the iterate issues this morning I noticed that trace() is printing ints as if they are floats. Is that perhaps what you are seeing? - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Ben Clifford" > Cc: "Swift Devel" > Sent: Wednesday, August 3, 2011 12:05:54 PM > Subject: Re: [Swift-devel] extractint > On Wed, 2011-08-03 at 13:39 +0200, Ben Clifford wrote: > > On Aug 3, 2011, at 7:27 AM, Mihael Hategan wrote: > > > > > Why is extractint returning a float? > > > > because swift numerical types are poorly defined? > > The internal representation used to be. I changed that. > > But that still doesn't explain why extractint returns a Swift float. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Aug 3 12:08:10 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 10:08:10 -0700 Subject: [Swift-devel] trunk coasters In-Reply-To: <1723865632.184066.1312380534696.JavaMail.root@zimbra.anl.gov> References: <1723865632.184066.1312380534696.JavaMail.root@zimbra.anl.gov> Message-ID: <1312391290.24112.2.camel@blabla> On Wed, 2011-08-03 at 09:08 -0500, Michael Wilde wrote: > Why did you ask, Mihael - are you seeing problems too? I asked because when I tried to use my local copy I was getting problems with the service not being able to connect back to the client due to channels not being found and other weirdness. I fixed it in my local copy, but if it actually works in a clean trunk checkout, I'd rather not fix that. From hategan at mcs.anl.gov Wed Aug 3 12:11:54 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 10:11:54 -0700 Subject: [Swift-devel] trunk coasters In-Reply-To: <570047635.184499.1312384361864.JavaMail.root@zimbra.anl.gov> References: <570047635.184499.1312384361864.JavaMail.root@zimbra.anl.gov> Message-ID: <1312391514.24112.5.camel@blabla> On Wed, 2011-08-03 at 10:12 -0500, Michael Wilde wrote: > Im testing the persistent coaster setup that was failing as below, but > instead of starting the workers on remote OSG sites Im starting a > single worker locally on communicado, where the service is running. > > This seems to fail in a different manner than the test to OSG sites. > > I see this in the service log (swift.log file): > [...] > 2011-08-03 10:05:50,926-0500 WARN BlockQueueProcessor Failed to send worker status update to client > java.lang.NullPointerException > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:434) Great. That's what I was looking for. It seems it needs fixing. From hategan at mcs.anl.gov Wed Aug 3 12:13:49 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 10:13:49 -0700 Subject: [Swift-devel] extractint In-Reply-To: <1007137.185235.1312391257382.JavaMail.root@zimbra.anl.gov> References: <1007137.185235.1312391257382.JavaMail.root@zimbra.anl.gov> Message-ID: <1312391629.24112.6.camel@blabla> On Wed, 2011-08-03 at 12:07 -0500, Michael Wilde wrote: > Can you double check? In looking at the iterate issues this morning I > noticed that trace() is printing ints as if they are floats. Is that > perhaps what you are seeing? The code is pretty unambiguous: DSHandle result = new RootDataNode(Types.FLOAT, i); From hategan at mcs.anl.gov Wed Aug 3 12:19:28 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 10:19:28 -0700 Subject: [Swift-devel] iterate behaviour round II In-Reply-To: References: <1654887811.184149.1312381409266.JavaMail.root@zimbra.anl.gov> Message-ID: <1312391968.24112.11.camel@blabla> On Wed, 2011-08-03 at 17:20 +0200, Ben Clifford wrote: > On Aug 3, 2011, at 4:23 PM, Michael Wilde wrote: > > > > > To start a parallel thread here: how feasible is it (within the > write-once variable model and scope-creation semantics that iterate > uses) to provide the 3 C iteration statements with syntax and > semantics as close to C as possible? > > the independent-iterations use cases are pretty well covered by > foreach, I think. (where each iteration is independent of other > iterations) The dependent iterations are also covered by foreach, and you can't deadlock as easily as you can in the iterate case: int a[]; a[0] = 10; foreach v, k in a { if (v > 1) { a[k + 1] = v - 1; } } Though now you could equally do: int a[auto]; a << 10; foreach v in a { if (v > 1) { a << v - 1; } } From hategan at mcs.anl.gov Wed Aug 3 12:24:12 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 10:24:12 -0700 Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1654887811.184149.1312381409266.JavaMail.root@zimbra.anl.gov> References: <1654887811.184149.1312381409266.JavaMail.root@zimbra.anl.gov> Message-ID: <1312392252.24112.16.camel@blabla> On Wed, 2011-08-03 at 09:23 -0500, Michael Wilde wrote: > I propose that we do everything possible to ensure that the semantics > of iterate does not change from 0.92.1, to avoid breaking code. NCAR > and the DOE ParVis project, in particular, has a very large Swift > script that they are testing for production use, and we really dont > want that to break. > > e should not allow 0.93 to break current user code -- if at all possible. On one hand, I agree with you. On the other hand, I do not think that backwards compatibility, in the long run, is a good justification for keeping something that is really poorly done. From wilde at mcs.anl.gov Wed Aug 3 12:33:41 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 12:33:41 -0500 (CDT) Subject: [Swift-devel] Meeting 4PM to talk about release In-Reply-To: <1328230364.185173.1312390766463.JavaMail.root@zimbra.anl.gov> Message-ID: <1692079976.185348.1312392821704.JavaMail.root@zimbra.anl.gov> Lets meet at 4PM Central to discuss 0.93. Dial in: (218) 862-6420 access code 815549 Justin is on vacation. Mihael, Alberto, David, Ketan, Jon - can you join? Topics: - testing 0.93 branch(es) vs trunk - 0.93 was cut Jul 5 - do we want to keep that, or get trunk in shape and make another branch? Can we do "0.NNrcX" branches? How much has been committed to 0.93 so far? - identifying blocker bugs for 0.93 - especially for Mihael to focus on -- resolution of the iterate statement for 0.93 - do bugs remain? did it change from 0.92.1? Are any issues in this statement related to semantic definitions, or due to subtleties in synchronization? I have a test I'd like to hand over: works as expected in 0.92, fails in trunk. -- jobsPerNode not working in trunk (due to Justin's attr mods?) Ketan taking this. -- PBS trunk issues related to attributes? (similar to prior? getting cray attr by mistake) -- fix SGE provider for limited set of machines? (Ranger, ibicluster, ?) - coordination of site testing and approach for that - docs and web cleanup; esp site config guide and gensites, and tutorial route - set target date for 0.93 - is Aug 30 possible? Thanks, - Mike From benc at hawaga.org.uk Wed Aug 3 12:50:59 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 3 Aug 2011 17:50:59 +0000 (GMT) Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1312391968.24112.11.camel@blabla> References: <1654887811.184149.1312381409266.JavaMail.root@zimbra.anl.gov> <1312391968.24112.11.camel@blabla> Message-ID: > The dependent iterations are also covered by foreach, and you can't > deadlock as easily as you can in the iterate case: I think I disliked that approach before because my thinking was that Swift should require a to be full defined outside of the foreach. But there's no really strong reason for requiring that (after all, it was the whole point of iterate...) and I think it looks nice with the << syntax. > Though now you could equally do: > int a[auto]; a << 10; > foreach v in a { > if (v > 1) { a << v - 1; } > } -- From wilde at mcs.anl.gov Wed Aug 3 12:53:37 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 12:53:37 -0500 (CDT) Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1312392252.24112.16.camel@blabla> Message-ID: <1050281184.185428.1312394017994.JavaMail.root@zimbra.anl.gov> Right. So lets keep iterate semantically unchanged for now, then deprecate it when we have an approach thats clearly better. For 0.93 lets focus on making it work as currently described. Lets consider deprecating it in some not-too-distant release and if possible remove it much later with sufficient notice to the user community. - Mike (Also note that where iterate is first mentioned in the user guide it has no until() clause.) ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Wednesday, August 3, 2011 12:24:12 PM > Subject: Re: [Swift-devel] iterate behaviour round II > On Wed, 2011-08-03 at 09:23 -0500, Michael Wilde wrote: > > I propose that we do everything possible to ensure that the > > semantics > > of iterate does not change from 0.92.1, to avoid breaking code. NCAR > > and the DOE ParVis project, in particular, has a very large Swift > > script that they are testing for production use, and we really dont > > want that to break. > > > > e should not allow 0.93 to break current user code -- if at all > > possible. > > On one hand, I agree with you. > > On the other hand, I do not think that backwards compatibility, in the > long run, is a good justification for keeping something that is really > poorly done. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Aug 3 12:56:25 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 10:56:25 -0700 Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1050281184.185428.1312394017994.JavaMail.root@zimbra.anl.gov> References: <1050281184.185428.1312394017994.JavaMail.root@zimbra.anl.gov> Message-ID: <1312394185.24839.0.camel@blabla> On Wed, 2011-08-03 at 12:53 -0500, Michael Wilde wrote: > Right. So lets keep iterate semantically unchanged for now, then > deprecate it when we have an approach thats clearly better. For 0.93 > lets focus on making it work as currently described. I agree. I will revert the change I did before. From hategan at mcs.anl.gov Wed Aug 3 12:56:35 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 10:56:35 -0700 Subject: [Swift-devel] Meeting 4PM to talk about release In-Reply-To: <1692079976.185348.1312392821704.JavaMail.root@zimbra.anl.gov> References: <1692079976.185348.1312392821704.JavaMail.root@zimbra.anl.gov> Message-ID: <1312394195.24839.1.camel@blabla> On Wed, 2011-08-03 at 12:33 -0500, Michael Wilde wrote: > Lets meet at 4PM Central to discuss 0.93. Dial in: (218) 862-6420 access code 815549 > > Justin is on vacation. Mihael, Alberto, David, Ketan, Jon - can you join? I'm in. From wilde at mcs.anl.gov Wed Aug 3 13:22:43 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 13:22:43 -0500 (CDT) Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1312394185.24839.0.camel@blabla> Message-ID: <400213490.185542.1312395763634.JavaMail.root@zimbra.anl.gov> Cool. Related to current semantics, Im seeing a case where iterate seems to not terminate correctly with an == test but does with a > test. Is there some float funkiness going on in there too? ">" termination condition works OK: com$ cat iterategt.swift iterate i { trace(i); } until(i > 3); com$ swift iterategt.swift | head -15 no sites file specified, setting to default: /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog modified locally) RunID: 20110803-1315-qn3cr6s8 Progress: time: Wed, 03 Aug 2011 13:15:38 -0500 SwiftScript trace: 0 SwiftScript trace: 1 SwiftScript trace: 2 SwiftScript trace: 3 Final status: time: Wed, 03 Aug 2011 13:15:38 -0500 com$ "==" termination condition never terminates: com$ com$ cat iterateeq.swift iterate i { trace(i); } until(i == 3); com$ com$ swift iterateeq.swift | head -15 no sites file specified, setting to default: /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog modified locally) RunID: 20110803-1316-f9qhsxig SwiftScript trace: 0 Progress: time: Wed, 03 Aug 2011 13:16:06 -0500 SwiftScript trace: 1 SwiftScript trace: 2 SwiftScript trace: 3 SwiftScript trace: 4 SwiftScript trace: 5 SwiftScript trace: 6 SwiftScript trace: 7 SwiftScript trace: 8 SwiftScript trace: 9 SwiftScript trace: 10 ^C com$ swift -version no sites file specified, setting to default: /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog modified locally) com$ which swift /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/bin/swift com$ ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Wednesday, August 3, 2011 12:56:25 PM > Subject: Re: [Swift-devel] iterate behaviour round II > On Wed, 2011-08-03 at 12:53 -0500, Michael Wilde wrote: > > Right. So lets keep iterate semantically unchanged for now, then > > deprecate it when we have an approach thats clearly better. For 0.93 > > lets focus on making it work as currently described. > > I agree. I will revert the change I did before. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Aug 3 13:41:27 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 13:41:27 -0500 (CDT) Subject: [Swift-devel] Meeting 4PM to talk about release In-Reply-To: Message-ID: <1199301879.185637.1312396887107.JavaMail.root@zimbra.anl.gov> This page describes what was done for site testing on prior releases: https://sites.google.com/site/swiftdevel/site-specific-testing Alberto, you should update that page with a plan for site testing for 0.93. I'll help you with that this week and you can continue with Justin when he returns next week. - Mike ----- Original Message ----- From: "Alberto Chavez" To: "Mike Wilde" Sent: Wednesday, August 3, 2011 12:54:48 PM Subject: RE: [Swift-devel] Meeting 4PM to talk about release 4PM sounds good > Date: Wed, 3 Aug 2011 12:33:41 -0500 > From: wilde at mcs.anl.gov > To: swift-devel at ci.uchicago.edu > Subject: [Swift-devel] Meeting 4PM to talk about release > > Lets meet at 4PM Central to discuss 0.93. Dial in: (218) 862-6420 access code 815549 > > Justin is on vacation. Mihael, Alberto, David, Ketan, Jon - can you join? > > Topics: > > - testing 0.93 branch(es) vs trunk > - 0.93 was cut Jul 5 - do we want to keep that, or get trunk in shape > and make another branch? Can we do "0.NNrcX" branches? > How much has been committed to 0.93 so far? > > - identifying blocker bugs for 0.93 - especially for Mihael to focus on > > -- resolution of the iterate statement for 0.93 - do bugs remain? > did it change from 0.92.1? Are any issues in this statement > related to semantic definitions, or due to subtleties in synchronization? > I have a test I'd like to hand over: works as expected in 0.92, fails in trunk. > > -- jobsPerNode not working in trunk (due to Justin's attr mods?) Ketan taking this. > > -- PBS trunk issues related to attributes? (similar to prior? getting cray attr by mistake) > > -- fix SGE provider for limited set of machines? (Ranger, ibicluster, ?) > > - coordination of site testing and approach for that > > - docs and web cleanup; esp site config guide and gensites, and tutorial route > > - set target date for 0.93 - is Aug 30 possible? > > Thanks, > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Aug 3 15:36:55 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Aug 2011 13:36:55 -0700 Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <400213490.185542.1312395763634.JavaMail.root@zimbra.anl.gov> References: <400213490.185542.1312395763634.JavaMail.root@zimbra.anl.gov> Message-ID: <1312403815.26065.0.camel@blabla> Can you do an svn up an recheck? On Wed, 2011-08-03 at 13:22 -0500, Michael Wilde wrote: > Cool. Related to current semantics, Im seeing a case where iterate seems to not terminate correctly with an == test but does with a > test. Is there some float funkiness going on in there too? > > ">" termination condition works OK: > > com$ cat iterategt.swift > iterate i > { > trace(i); > } until(i > 3); > com$ swift iterategt.swift | head -15 > no sites file specified, setting to default: /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog modified locally) > > RunID: 20110803-1315-qn3cr6s8 > Progress: time: Wed, 03 Aug 2011 13:15:38 -0500 > SwiftScript trace: 0 > SwiftScript trace: 1 > SwiftScript trace: 2 > SwiftScript trace: 3 > Final status: time: Wed, 03 Aug 2011 13:15:38 -0500 > com$ > > "==" termination condition never terminates: > > com$ > com$ cat iterateeq.swift > iterate i > { > trace(i); > } until(i == 3); > com$ > > com$ swift iterateeq.swift | head -15 > no sites file specified, setting to default: /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog modified locally) > > RunID: 20110803-1316-f9qhsxig > SwiftScript trace: 0 > Progress: time: Wed, 03 Aug 2011 13:16:06 -0500 > SwiftScript trace: 1 > SwiftScript trace: 2 > SwiftScript trace: 3 > SwiftScript trace: 4 > SwiftScript trace: 5 > SwiftScript trace: 6 > SwiftScript trace: 7 > SwiftScript trace: 8 > SwiftScript trace: 9 > SwiftScript trace: 10 > > ^C > com$ swift -version > no sites file specified, setting to default: /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog modified locally) > > com$ which swift > /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/bin/swift > com$ > > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "Swift Devel" > > Sent: Wednesday, August 3, 2011 12:56:25 PM > > Subject: Re: [Swift-devel] iterate behaviour round II > > On Wed, 2011-08-03 at 12:53 -0500, Michael Wilde wrote: > > > Right. So lets keep iterate semantically unchanged for now, then > > > deprecate it when we have an approach thats clearly better. For 0.93 > > > lets focus on making it work as currently described. > > > > I agree. I will revert the change I did before. > From wilde at mcs.anl.gov Wed Aug 3 15:42:41 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Aug 2011 15:42:41 -0500 (CDT) Subject: [Swift-devel] iterate behaviour round II In-Reply-To: <1312403815.26065.0.camel@blabla> Message-ID: <929176831.186312.1312404161065.JavaMail.root@zimbra.anl.gov> Great, that fixed the failing eq case. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Wednesday, August 3, 2011 3:36:55 PM > Subject: Re: [Swift-devel] iterate behaviour round II > Can you do an svn up an recheck? > > On Wed, 2011-08-03 at 13:22 -0500, Michael Wilde wrote: > > Cool. Related to current semantics, Im seeing a case where iterate > > seems to not terminate correctly with an == test but does with a > > > test. Is there some float funkiness going on in there too? > > > > ">" termination condition works OK: > > > > com$ cat iterategt.swift > > iterate i > > { > > trace(i); > > } until(i > 3); > > com$ swift iterategt.swift | head -15 > > no sites file specified, setting to default: > > /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > > Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog > > modified locally) > > > > RunID: 20110803-1315-qn3cr6s8 > > Progress: time: Wed, 03 Aug 2011 13:15:38 -0500 > > SwiftScript trace: 0 > > SwiftScript trace: 1 > > SwiftScript trace: 2 > > SwiftScript trace: 3 > > Final status: time: Wed, 03 Aug 2011 13:15:38 -0500 > > com$ > > > > "==" termination condition never terminates: > > > > com$ > > com$ cat iterateeq.swift > > iterate i > > { > > trace(i); > > } until(i == 3); > > com$ > > > > com$ swift iterateeq.swift | head -15 > > no sites file specified, setting to default: > > /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > > Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog > > modified locally) > > > > RunID: 20110803-1316-f9qhsxig > > SwiftScript trace: 0 > > Progress: time: Wed, 03 Aug 2011 13:16:06 -0500 > > SwiftScript trace: 1 > > SwiftScript trace: 2 > > SwiftScript trace: 3 > > SwiftScript trace: 4 > > SwiftScript trace: 5 > > SwiftScript trace: 6 > > SwiftScript trace: 7 > > SwiftScript trace: 8 > > SwiftScript trace: 9 > > SwiftScript trace: 10 > > > > ^C > > com$ swift -version > > no sites file specified, setting to default: > > /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > > Swift svn swift-r4934 (swift modified locally) cog-r3184 (cog > > modified locally) > > > > com$ which swift > > /scratch/local/wilde/swift/src/devtrunk/cog/modules/swift/dist/swift-svn/bin/swift > > com$ > > > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Cc: "Swift Devel" > > > Sent: Wednesday, August 3, 2011 12:56:25 PM > > > Subject: Re: [Swift-devel] iterate behaviour round II > > > On Wed, 2011-08-03 at 12:53 -0500, Michael Wilde wrote: > > > > Right. So lets keep iterate semantically unchanged for now, then > > > > deprecate it when we have an approach thats clearly better. For > > > > 0.93 > > > > lets focus on making it work as currently described. > > > > > > I agree. I will revert the change I did before. > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Wed Aug 3 20:57:07 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 3 Aug 2011 20:57:07 -0500 (CDT) Subject: [Swift-devel] 0.93 site testing In-Reply-To: <1009000252.54108.1312422324637.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <519491437.54124.1312423027539.JavaMail.root@zimbra-mb2.anl.gov> Hello, I updated the swift devel website tonight with plans for 0.93 site testing. I am starting with the same site tests that we performed with the last release. The page is at: https://sites.google.com/site/swiftdevel/site-specific-testing Feel free to edit that page as the tests get run. The tests are located in swift/tests/providers, but they likely need tweaked. I'm pretty sure the PADS template is incorrect, others may be as well so it is probably worthwhile to double check everything. David From wilde at mcs.anl.gov Thu Aug 4 04:44:29 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 4 Aug 2011 04:44:29 -0500 (CDT) Subject: [Swift-devel] 0.93 site testing In-Reply-To: <519491437.54124.1312423027539.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <488459950.187367.1312451069961.JavaMail.root@zimbra.anl.gov> David, Alberto, The test list looks good. We can maybe shuffle the names assigned: Alberto on PADS and Fusion; Justin can maybe help on the BG/P's and Frankin; can add the Cray crow test system. It might be good to add tests of the plain ssh provider, and then the ssh:local coaster configuration. (Jon is using this eg to run on PADS and Beagle from Globus Online).(Aside: I see very excessive logging from the ssh provider - lets investigate, I'll file a ticket) I noticed that in several past incidents our site tests were fooled into thinking they passed, when in fact the actual application invocations took place in an environment different than intended. Some cases that come to mind: - thinking that we were running on a cluster via coasters when in fact the apps ran on localhost. This was the incorrect PADS sites entry you mention below. - thinking that we were running on Cray compute nodes when in fact the apps ran on the Cray PBS service node (on Beagle, again this was a login node) - asking to run N apps per compute node (1 per core) when in fact we ran 1 app per node - asking to run N apps per compute node when in fact we ran N^2 apps per node In this next round of testing, can we enhance the tests (or add new ones) so that: 1) part of the app execution records the node(s) it executes on and ensures that we are running on a compute node (We can do this in a site-independent fashion by adding a "compute node hostname pattern" to the siteTester script: https://trac.ci.uchicago.edu/swift/browser/trunk/tests/sitetester?desc=1 and passing the name pattern to the test. 2) the expected number of apps are running on the compute node (sleep; do ps; count the number of app shells running, and ensure that there are >1 and <= N) - Mike ----- Original Message ----- > From: "David Kelly" > To: swift-devel at ci.uchicago.edu > Sent: Wednesday, August 3, 2011 8:57:07 PM > Subject: [Swift-devel] 0.93 site testing > Hello, > > I updated the swift devel website tonight with plans for 0.93 site > testing. I am starting with the same site tests that we performed with > the last release. The page is at: > > https://sites.google.com/site/swiftdevel/site-specific-testing > > Feel free to edit that page as the tests get run. The tests are > located in swift/tests/providers, but they likely need tweaked. I'm > pretty sure the PADS template is incorrect, others may be as well so > it is probably worthwhile to double check everything. > > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Aug 4 04:52:41 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 4 Aug 2011 04:52:41 -0500 (CDT) Subject: [Swift-devel] 0.93 site testing In-Reply-To: <488459950.187367.1312451069961.JavaMail.root@zimbra.anl.gov> Message-ID: <1043952745.187374.1312451561434.JavaMail.root@zimbra.anl.gov> A question about 0.93 testing in general: We agreed yesterday to stay with the current 0.93 branch. Should we be testing 0.93 with the tests in 0.93 itself, or the tests in trunk, or both? I think if we do this rigorously, the answer is that we test 0.93 with the 0.93 test suite, but that we integrate relevant test corrections and enhancements in both directions. The same of course applies to code, but in the case of the tests we may find that more cross-branch integration is needed, especially if we are going to do a lot of improvement on site tests in the next 2 weeks. Im wondering if we can afford the 2-way integration process, or should take the shortcut of testing 0.93 with trunk tests? Im in favor of the two-way approach, but to keep an eye on the process and reconsider if it becomes too costly. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" , "Alberto Chavez" > Cc: swift-devel at ci.uchicago.edu > Sent: Thursday, August 4, 2011 4:44:29 AM > Subject: Re: [Swift-devel] 0.93 site testing > David, Alberto, > > The test list looks good. We can maybe shuffle the names assigned: > Alberto on PADS and Fusion; Justin can maybe help on the BG/P's and > Frankin; can add the Cray crow test system. > > It might be good to add tests of the plain ssh provider, and then the > ssh:local coaster configuration. (Jon is using this eg to run on PADS > and Beagle from Globus Online).(Aside: I see very excessive logging > from the ssh provider - lets investigate, I'll file a ticket) > > I noticed that in several past incidents our site tests were fooled > into thinking they passed, when in fact the actual application > invocations took place in an environment different than intended. Some > cases that come to mind: > > - thinking that we were running on a cluster via coasters when in fact > the apps ran on localhost. This was the incorrect PADS sites entry you > mention below. > > - thinking that we were running on Cray compute nodes when in fact the > apps ran on the Cray PBS service node (on Beagle, again this was a > login node) > > - asking to run N apps per compute node (1 per core) when in fact we > ran 1 app per node > > - asking to run N apps per compute node when in fact we ran N^2 apps > per node > > In this next round of testing, can we enhance the tests (or add new > ones) so that: > > 1) part of the app execution records the node(s) it executes on and > ensures that we are running on a compute node (We can do this in a > site-independent fashion by adding a "compute node hostname pattern" > to the siteTester script: > https://trac.ci.uchicago.edu/swift/browser/trunk/tests/sitetester?desc=1 > and passing the name pattern to the test. > > 2) the expected number of apps are running on the compute node (sleep; > do ps; count the number of app shells running, and ensure that there > are >1 and <= N) > > - Mike > > ----- Original Message ----- > > From: "David Kelly" > > To: swift-devel at ci.uchicago.edu > > Sent: Wednesday, August 3, 2011 8:57:07 PM > > Subject: [Swift-devel] 0.93 site testing > > Hello, > > > > I updated the swift devel website tonight with plans for 0.93 site > > testing. I am starting with the same site tests that we performed > > with > > the last release. The page is at: > > > > https://sites.google.com/site/swiftdevel/site-specific-testing > > > > Feel free to edit that page as the tests get run. The tests are > > located in swift/tests/providers, but they likely need tweaked. I'm > > pretty sure the PADS template is incorrect, others may be as well so > > it is probably worthwhile to double check everything. > > > > David > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Aug 4 14:40:32 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 04 Aug 2011 12:40:32 -0700 Subject: [Swift-devel] 0.93 site testing In-Reply-To: <1043952745.187374.1312451561434.JavaMail.root@zimbra.anl.gov> References: <1043952745.187374.1312451561434.JavaMail.root@zimbra.anl.gov> Message-ID: <1312486832.30524.0.camel@blabla> On Thu, 2011-08-04 at 04:52 -0500, Michael Wilde wrote: > A question about 0.93 testing in general: We agreed yesterday to stay with the current 0.93 branch. Should we be testing 0.93 with the tests in 0.93 itself, or the tests in trunk, or both? > > I think if we do this rigorously, the answer is that we test 0.93 with the 0.93 test suite, but that we integrate relevant test corrections and enhancements in both directions. The same of course applies to code, but in the case of the tests we may find that more cross-branch integration is needed, especially if we are going to do a lot of improvement on site tests in the next 2 weeks. > > Im wondering if we can afford the 2-way integration process, or should > take the shortcut of testing 0.93 with trunk tests? Im in favor of > the two-way approach, but to keep an eye on the process and reconsider > if it becomes too costly. There were fixes to the tests that I committed to trunk. I think mot of them should be backported to the branch by virtue of them being fixes. From hategan at mcs.anl.gov Thu Aug 4 18:09:56 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 04 Aug 2011 16:09:56 -0700 Subject: [Swift-devel] trunk coasters In-Reply-To: <2098979040.190576.1312489752827.JavaMail.root@zimbra.anl.gov> References: <2098979040.190576.1312489752827.JavaMail.root@zimbra.anl.gov> Message-ID: <1312499396.2896.2.camel@blabla> On Thu, 2011-08-04 at 15:29 -0500, Michael Wilde wrote: > So the other error - the failing service - is not happing on local tests on 0.93; next I'll try the remote cases. Ok. I committed a number of things to trunk, one of which is a fix for the messed up channel lookup problem. I used it previously for auto-deployed services on ranger and pads, but haven't tried it with the stand-alone service. So please test that if you can and let me know. I'll now move to dealing with 0.93 issues. From wilde at mcs.anl.gov Thu Aug 4 23:22:31 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 4 Aug 2011 23:22:31 -0500 (CDT) Subject: [Swift-devel] trunk coasters In-Reply-To: <1312499396.2896.2.camel@blabla> Message-ID: <404855649.191251.1312518151043.JavaMail.root@zimbra.anl.gov> Im still getting the errors below (which I think are what I reported prior to this fix). I'll double check that I got the latest fix in, but I think I do. - Mike 2011-08-04 23:17:47,757-0500 DEBUG Cpu workerStarted: swork:node016:0 2011-08-04 23:17:47,757-0500 DEBUG Cpu swork:0 pullLater 2011-08-04 23:17:47,758-0500 INFO Block Started CPU 0:1312517867s 2011-08-04 23:17:47,758-0500 INFO Block Started worker swork:000000 2011-08-04 23:17:47,758-0500 INFO Cpu swork:0 pull 2011-08-04 23:17:47,761-0500 WARN BlockQueueProcessor Failed to send worker status update to client java.lang.NullPointerException at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.java:72) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:375) 2011-08-04 23:17:47,764-0500 INFO LocalTCPService Received registration: blockid = swork, url = node016 2011-08-04 23:17:47,765-0500 INFO AbstractKarajanChannel MetaChannel: 467772424[15735326: {}] -> null: Disabling heartbeats (config is null) 2011-08-04 23:17:47,765-0500 INFO MetaChannel MetaChannel: 467772424[15735326: {}] -> null.bind -> SC-null 2011-08-04 23:17:47,765-0500 DEBUG Cpu workerStarted: swork:node016:1 2011-08-04 23:17:47,765-0500 DEBUG Cpu swork:1 pullLater 2011-08-04 23:17:47,765-0500 INFO Block Started CPU 1:1312517867s 2011-08-04 23:17:47,765-0500 INFO Cpu swork:1 pull 2011-08-04 23:17:47,765-0500 INFO Block Started worker swork:000001 2011-08-04 23:17:47,766-0500 WARN BlockQueueProcessor Failed to send worker status update to client java.lang.NullPointerException at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.java:72) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:375) 2011-08-04 23:17:48,568-0500 INFO TCPBufferManager Adjusting buffer size to 524288 ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Justin M Wozniak" , "Swift Devel" > Sent: Thursday, August 4, 2011 6:09:56 PM > Subject: Re: [Swift-devel] trunk coasters > On Thu, 2011-08-04 at 15:29 -0500, Michael Wilde wrote: > > > So the other error - the failing service - is not happing on local > > tests on 0.93; next I'll try the remote cases. > > Ok. I committed a number of things to trunk, one of which is a fix for > the messed up channel lookup problem. > > I used it previously for auto-deployed services on ranger and pads, > but > haven't tried it with the stand-alone service. So please test that if > you can and let me know. > > I'll now move to dealing with 0.93 issues. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Aug 5 01:46:51 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 04 Aug 2011 23:46:51 -0700 Subject: [Swift-devel] testing and bugs Message-ID: <1312526811.16208.2.camel@blabla> I was thinking it would be useful, if - when reporting a bug that, for a release, is meant to be fixed - you have access to SVN, to commit a test for it along with the bug report. And that sentence could have been simpler. From wilde at mcs.anl.gov Fri Aug 5 09:25:46 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 5 Aug 2011 09:25:46 -0500 (CDT) Subject: [Swift-devel] Try Ranger with nodeGranularity=16 Message-ID: <1943578453.191714.1312554346440.JavaMail.root@zimbra.anl.gov> I learned from Mihael yesterday that the SGE provider should in fact work on Ranger in 0.93 and trunk if you set nodeGranularity to 16. This is a confusion between nodes and cores that should get fixed in 0.94 unless testing indicates that the above setting doesnt work in 0.93. - Mike From jonmon at mcs.anl.gov Fri Aug 5 09:42:03 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 5 Aug 2011 09:42:03 -0500 Subject: [Swift-devel] Try Ranger with nodeGranularity=16 In-Reply-To: <1943578453.191714.1312554346440.JavaMail.root@zimbra.anl.gov> References: <1943578453.191714.1312554346440.JavaMail.root@zimbra.anl.gov> Message-ID: <04BE1E8B-F006-4D30-80DA-6AF031346C91@mcs.anl.gov> On Aug 5, 2011, at 9:25 AM, Michael Wilde wrote: > I learned from Mihael yesterday that the SGE provider should in fact work on Ranger in 0.93 and trunk if you set nodeGranularity to 16. > > This is a confusion between nodes and cores that should get fixed in 0.94 unless testing indicates that the above setting doesnt work in 0.93. > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Fri Aug 5 09:42:13 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 5 Aug 2011 09:42:13 -0500 Subject: [Swift-devel] Try Ranger with nodeGranularity=16 In-Reply-To: <1943578453.191714.1312554346440.JavaMail.root@zimbra.anl.gov> References: <1943578453.191714.1312554346440.JavaMail.root@zimbra.anl.gov> Message-ID: I will certainly give this a try. I think we mentioned this in the con call on Wednesday but could someone post the svn co procedure for 0.93 perhaps to the swift-devel google site? The swift 0.93 branch is easy to figure out but cog not so much. On Aug 5, 2011, at 9:25 AM, Michael Wilde wrote: > I learned from Mihael yesterday that the SGE provider should in fact work on Ranger in 0.93 and trunk if you set nodeGranularity to 16. > > This is a confusion between nodes and cores that should get fixed in 0.94 unless testing indicates that the above setting doesnt work in 0.93. > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Aug 5 09:55:36 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 5 Aug 2011 09:55:36 -0500 (CDT) Subject: [Swift-devel] Try Ranger with nodeGranularity=16 In-Reply-To: <37E2636A-7D90-44DD-90D2-CE9F96570881@gmail.com> Message-ID: <147752740.191850.1312556136873.JavaMail.root@zimbra.anl.gov> > From: "Jonathan Monette" > > I think we mentioned this in the con call on Wednesday but could > someone post the svn co procedure for 0.93 perhaps to the swift-devel > google site? The swift 0.93 branch is easy to figure out but cog not > so much. > svn co https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.9/src/cog cd cog/modules svn co https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.93 swift From wilde at mcs.anl.gov Fri Aug 5 10:00:58 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 5 Aug 2011 10:00:58 -0500 (CDT) Subject: [Swift-devel] trunk coasters In-Reply-To: <404855649.191251.1312518151043.JavaMail.root@zimbra.anl.gov> Message-ID: <2050741312.191916.1312556458623.JavaMail.root@zimbra.anl.gov> Mihael, Persistent coasters works well so far in 0.93; the problem below seems to be in trunk. Im able to run to many remote OSG sites now, with good performance, using provider staging, with one coaster service. Ive seen one script of 100 jobs hang after 97 completed (once), but all other tests up to 1000 jobs have succeeded. I'll try to recreate that hang and capture logs etc. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Justin M Wozniak" , "Swift Devel" > Sent: Thursday, August 4, 2011 11:22:31 PM > Subject: Re: [Swift-devel] trunk coasters > Im still getting the errors below (which I think are what I reported > prior to this fix). I'll double check that I got the latest fix in, > but I think I do. > > - Mike > > 2011-08-04 23:17:47,757-0500 DEBUG Cpu workerStarted: swork:node016:0 > 2011-08-04 23:17:47,757-0500 DEBUG Cpu swork:0 pullLater > 2011-08-04 23:17:47,758-0500 INFO Block Started CPU 0:1312517867s > 2011-08-04 23:17:47,758-0500 INFO Block Started worker swork:000000 > 2011-08-04 23:17:47,758-0500 INFO Cpu swork:0 pull > 2011-08-04 23:17:47,761-0500 WARN BlockQueueProcessor Failed to send > worker status update to client > java.lang.NullPointerException > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > at > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.java:72) > at > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > at > org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > at > org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > at > org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:375) > 2011-08-04 23:17:47,764-0500 INFO LocalTCPService Received > registration: blockid = swork, url = node016 > 2011-08-04 23:17:47,765-0500 INFO AbstractKarajanChannel MetaChannel: > 467772424[15735326: {}] -> null: Disabling heartbeats (config is null) > 2011-08-04 23:17:47,765-0500 INFO MetaChannel MetaChannel: > 467772424[15735326: {}] -> null.bind -> SC-null > 2011-08-04 23:17:47,765-0500 DEBUG Cpu workerStarted: swork:node016:1 > 2011-08-04 23:17:47,765-0500 DEBUG Cpu swork:1 pullLater > 2011-08-04 23:17:47,765-0500 INFO Block Started CPU 1:1312517867s > 2011-08-04 23:17:47,765-0500 INFO Cpu swork:1 pull > 2011-08-04 23:17:47,765-0500 INFO Block Started worker swork:000001 > 2011-08-04 23:17:47,766-0500 WARN BlockQueueProcessor Failed to send > worker status update to client > java.lang.NullPointerException > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > at > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.java:72) > at > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > at > org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > at > org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > at > org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:375) > 2011-08-04 23:17:48,568-0500 INFO TCPBufferManager Adjusting buffer > size to 524288 > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "Justin M Wozniak" , "Swift Devel" > > > > Sent: Thursday, August 4, 2011 6:09:56 PM > > Subject: Re: [Swift-devel] trunk coasters > > On Thu, 2011-08-04 at 15:29 -0500, Michael Wilde wrote: > > > > > So the other error - the failing service - is not happing on local > > > tests on 0.93; next I'll try the remote cases. > > > > Ok. I committed a number of things to trunk, one of which is a fix > > for > > the messed up channel lookup problem. > > > > I used it previously for auto-deployed services on ranger and pads, > > but > > haven't tried it with the stand-alone service. So please test that > > if > > you can and let me know. > > > > I'll now move to dealing with 0.93 issues. > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Aug 5 13:37:10 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 05 Aug 2011 11:37:10 -0700 Subject: [Swift-devel] trunk coasters In-Reply-To: <404855649.191251.1312518151043.JavaMail.root@zimbra.anl.gov> References: <404855649.191251.1312518151043.JavaMail.root@zimbra.anl.gov> Message-ID: <1312569430.4824.3.camel@blabla> Can you post the sites file? On Thu, 2011-08-04 at 23:22 -0500, Michael Wilde wrote: > Im still getting the errors below (which I think are what I reported prior to this fix). I'll double check that I got the latest fix in, but I think I do. > > - Mike > > 2011-08-04 23:17:47,757-0500 DEBUG Cpu workerStarted: swork:node016:0 > 2011-08-04 23:17:47,757-0500 DEBUG Cpu swork:0 pullLater > 2011-08-04 23:17:47,758-0500 INFO Block Started CPU 0:1312517867s > 2011-08-04 23:17:47,758-0500 INFO Block Started worker swork:000000 > 2011-08-04 23:17:47,758-0500 INFO Cpu swork:0 pull > 2011-08-04 23:17:47,761-0500 WARN BlockQueueProcessor Failed to send worker status update to client > java.lang.NullPointerException > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.java:72) > at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:375) > 2011-08-04 23:17:47,764-0500 INFO LocalTCPService Received registration: blockid = swork, url = node016 > 2011-08-04 23:17:47,765-0500 INFO AbstractKarajanChannel MetaChannel: 467772424[15735326: {}] -> null: Disabling heartbeats (config is null) > 2011-08-04 23:17:47,765-0500 INFO MetaChannel MetaChannel: 467772424[15735326: {}] -> null.bind -> SC-null > 2011-08-04 23:17:47,765-0500 DEBUG Cpu workerStarted: swork:node016:1 > 2011-08-04 23:17:47,765-0500 DEBUG Cpu swork:1 pullLater > 2011-08-04 23:17:47,765-0500 INFO Block Started CPU 1:1312517867s > 2011-08-04 23:17:47,765-0500 INFO Cpu swork:1 pull > 2011-08-04 23:17:47,765-0500 INFO Block Started worker swork:000001 > 2011-08-04 23:17:47,766-0500 WARN BlockQueueProcessor Failed to send worker status update to client > java.lang.NullPointerException > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.java:72) > at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:375) > 2011-08-04 23:17:48,568-0500 INFO TCPBufferManager Adjusting buffer size to 524288 > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "Justin M Wozniak" , "Swift Devel" > > Sent: Thursday, August 4, 2011 6:09:56 PM > > Subject: Re: [Swift-devel] trunk coasters > > On Thu, 2011-08-04 at 15:29 -0500, Michael Wilde wrote: > > > > > So the other error - the failing service - is not happing on local > > > tests on 0.93; next I'll try the remote cases. > > > > Ok. I committed a number of things to trunk, one of which is a fix for > > the messed up channel lookup problem. > > > > I used it previously for auto-deployed services on ranger and pads, > > but > > haven't tried it with the stand-alone service. So please test that if > > you can and let me know. > > > > I'll now move to dealing with 0.93 issues. > From ketancmaheshwari at gmail.com Fri Aug 5 14:47:30 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Fri, 5 Aug 2011 14:47:30 -0500 Subject: [Swift-devel] How to get Engage VO membership Message-ID: Hello, Per discussion with Mike, here is how one can get a membership to Engage VO: Step1. Apply for a certificate: https://pki1.doegrids.org/ca/; use ANL as affiliation (registration authority) in the form. Step2. When you receive your certificate via a link by mail, download and install it in your browser; I have only tested it for firefox on linux and mac. Jon says, it works for Chrome on mac. And I know that it does not work on Chrome+linux. On firefox, as you click the link that you received in the mail, you will be prompted to install it by firefox, passphrase it and click install. Next take a backup of this certificate in the form of .p12. This is in Preferences > Advanced > Encryption > View Certi > Your Certi Step3. Install DOE CA and ESnet root CA into the browser by clicking the top left links on this page: http://www.doegrids.org/; I do not know if ESnet CA cert is necessary or not but I install both anyways. I know that DOE CA is necessary. Step4. Go to the Engage VO registration point here: https://osg-engage.renci.org:8443/vomrs/Engage/vomrs from the same browser that has the above certs installed. Also see this : https://twiki.grid.iu.edu/bin/view/Engagement/EngageNewUserGuide for more details. Step 5. Once you have the membership of the VO, you need to have the certificate that is in the browser put in your .globus from where you want to access OSG resources. The certificate has to be in the form of .pem files with a seperate file for key and cert. For this use the above backed up .p12 file as follows: $ openssl pkcs12 -in your.p12 -out usercert.pem -nodes -clcerts -nokeys $ openssl pkcs12 -in your.p12 -out userkey.pem -nodes -nocerts Above commands are taken from: http://security.ncsa.illinois.edu/research/grid-howtos/usefulopenssl.html For more on openssl: http://www.openssl.org/docs/apps/openssl.html Step6. Test it: $ voms-proxy-init --voms Engage -hours 48 Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Aug 5 23:27:01 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 5 Aug 2011 23:27:01 -0500 (CDT) Subject: [Swift-devel] Coaster test failed at 86K of 100K jobs In-Reply-To: <578229958.194613.1312604520459.JavaMail.root@zimbra.anl.gov> Message-ID: <1077362111.194622.1312604821209.JavaMail.root@zimbra.anl.gov> Mihael, I was running catsn.swift with 100K jobs (-n=100000) to a single-server persistent coaster pool to about 50 OSG worker nodes. Using 0.93. It failed after about 86K jobs with this error: Submitted:82 Active:2 Finished successfully:86521 Progress: time: Fri, 05 Aug 2011 22:15:50 -0500 Selecting site:921 Submitting:16 Submitted:83 Active:2 Finished successfully:86531 Progress: time: Fri, 05 Aug 2011 22:15:51 -0500 Selecting site:922 Submitting:12 Submitted:76 Active:13 Finished successfully:86534 Progress: time: Fri, 05 Aug 2011 22:15:54 -0500 Selecting site:918 Submitting:16 Submitted:83 Active:1 Finished successfully:86548 Execution failed: java.util.ConcurrentModificationException The first exception in the logs shows: 2011-08-05 22:15:54,845-0500 DEBUG vdl:mains FOREACH_IT_END line=9 thread=0-3-87187 2011-08-05 22:15:54,845-0500 DEBUG VDL2ExecutionContext java.util.ConcurrentModificationException java.util.ConcurrentModificationException Caused by: java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:177) at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159) at org.griphyn.vdl.karajan.lib.RuntimeStats.setProgress(RuntimeStats.java:88) at org.griphyn.vdl.karajan.lib.RuntimeStats.vdl_setprogress(RuntimeStats.java:82) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) Ive moved the logs to: /home/wilde/swiftgrid/test.swift-workers/logs.05 - Mike From hategan at mcs.anl.gov Sat Aug 6 00:02:16 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 05 Aug 2011 22:02:16 -0700 Subject: [Swift-devel] Coaster test failed at 86K of 100K jobs In-Reply-To: <1077362111.194622.1312604821209.JavaMail.root@zimbra.anl.gov> References: <1077362111.194622.1312604821209.JavaMail.root@zimbra.anl.gov> Message-ID: <1312606936.18332.1.camel@blabla> Amazing how that bug in what would otherwise be a relatively simple class (CopyOnWriteArrayList) has managed to survive so long. Concurrency ain't easy! I'll have a fix committed after I do a bit of testing. On Fri, 2011-08-05 at 23:27 -0500, Michael Wilde wrote: > Mihael, > > I was running catsn.swift with 100K jobs (-n=100000) to a single-server persistent coaster pool to about 50 OSG worker nodes. Using 0.93. > > It failed after about 86K jobs with this error: > > Submitted:82 Active:2 Finished successfully:86521 > Progress: time: Fri, 05 Aug 2011 22:15:50 -0500 Selecting site:921 Submitting:16 Submitted:83 Active:2 Finished successfully:86531 > Progress: time: Fri, 05 Aug 2011 22:15:51 -0500 Selecting site:922 Submitting:12 Submitted:76 Active:13 Finished successfully:86534 > Progress: time: Fri, 05 Aug 2011 22:15:54 -0500 Selecting site:918 Submitting:16 Submitted:83 Active:1 Finished successfully:86548 > Execution failed: > java.util.ConcurrentModificationException > > > The first exception in the logs shows: > > 2011-08-05 22:15:54,845-0500 DEBUG vdl:mains FOREACH_IT_END line=9 thread=0-3-87187 > 2011-08-05 22:15:54,845-0500 DEBUG VDL2ExecutionContext java.util.ConcurrentModificationException > java.util.ConcurrentModificationException > Caused by: java.util.ConcurrentModificationException > at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) > at java.util.AbstractList$Itr.next(AbstractList.java:343) > at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:177) > at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) > at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159) > at org.griphyn.vdl.karajan.lib.RuntimeStats.setProgress(RuntimeStats.java:88) > at org.griphyn.vdl.karajan.lib.RuntimeStats.vdl_setprogress(RuntimeStats.java:82) > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > > > Ive moved the logs to: /home/wilde/swiftgrid/test.swift-workers/logs.05 > > - Mike > > > From hategan at mcs.anl.gov Sat Aug 6 01:02:48 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 05 Aug 2011 23:02:48 -0700 Subject: [Swift-devel] Coaster test failed at 86K of 100K jobs In-Reply-To: <1312606936.18332.1.camel@blabla> References: <1077362111.194622.1312604821209.JavaMail.root@zimbra.anl.gov> <1312606936.18332.1.camel@blabla> Message-ID: <1312610568.18332.12.camel@blabla> Potential fix is in the 0.93 branch. I'm not entirely sure that this was the problem, but it's the only one I can see right now. The issue is as follows. There is a "special" implementation of a CopyOnWriteArrayList in the util module. The standard java one does a copy of the underlying array for EVERY operation that changes the list. This guarantees that ongoing iterations will not be messed up by concurrent modifications to the list, but is very bad if you have many operations that change the list. The version in util only does a copy if there is an ongoing iteration on a particular underlying array. If no concurrent changes and iterations occur, this works at the speed of a normal synchronized list. If concurrent changes and iterations occur, there is a copy penalty for each iteration (but only once for each iteration). This requires the user code to notify the implementation when an iteration is done (release). The problem was with the way that the lock was implemented. It would be increased for every iteration, set to 0 for each mutation operation and decreased if > 0 for a release. That was broken, the following could have occurred: iteration1start - lock = 1, with array1 add - lock > 0, copy to array2, lock = 0 iteration2start - lock = 1, with array2 iteration1end - lock = 0 add - lock == 0, add to array2 -> ConcurrentModificationException on iteration2. Though I don't see how the usage stats got to iterate twice at the same time through stuff. Mihael On Fri, 2011-08-05 at 22:02 -0700, Mihael Hategan wrote: > Amazing how that bug in what would otherwise be a relatively simple > class (CopyOnWriteArrayList) has managed to survive so long. Concurrency > ain't easy! > > I'll have a fix committed after I do a bit of testing. From wilde at mcs.anl.gov Sat Aug 6 07:38:45 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 6 Aug 2011 07:38:45 -0500 (CDT) Subject: [Swift-devel] Coaster test failed at 86K of 100K jobs In-Reply-To: <1312610568.18332.12.camel@blabla> Message-ID: <1251862264.194810.1312634325846.JavaMail.root@zimbra.anl.gov> Mihael, I rebuilt with that fix. Now Im getting this error on runs as small as 1,000 jobs. Logs are in: /home/wilde/swiftgrid/test.swift-workers/logs.06 Failing run was *8za.log I copied the sites etc files there as well. com$ ls -lt logs.06 total 1920 -rw-r--r-- 1 wilde ci-users 918172 Aug 6 07:34 swift.log -rw-r--r-- 1 wilde ci-users 526 Aug 6 07:34 start-grid-service.out -rw-r--r-- 1 wilde ci-users 11279 Aug 6 07:34 swift-workers.out -rw-r--r-- 1 wilde ci-users 69555 Aug 6 07:33 condor.log -rw-r--r-- 1 wilde ci-users 488616 Aug 6 07:28 catsn-20110806-0728-tpo2b8za.log drwxr-xr-x 2 wilde ci-users 9 Aug 6 07:28 catsn-20110806-0728-tpo2b8za.d/ -rw-r--r-- 1 wilde ci-users 136 Aug 6 07:28 catsn-20110806-0728-tpo2b8za.0.rlog -rw-r--r-- 1 wilde ci-users 200148 Aug 6 07:28 catsn-20110806-0728-8lecscl7.log drwxr-xr-x 2 wilde ci-users 102 Aug 6 07:28 catsn-20110806-0728-8lecscl7.d/ -rw-r--r-- 1 wilde ci-users 23388 Aug 6 07:28 catsn-20110806-0728-jvvxoqdg.log drwxr-xr-x 2 wilde ci-users 12 Aug 6 07:28 catsn-20110806-0728-jvvxoqdg.d/ -rw-r--r-- 1 wilde ci-users 5940 Aug 6 07:28 catsn-20110806-0728-lge9pvy3.log drwxr-xr-x 2 wilde ci-users 3 Aug 6 07:28 catsn-20110806-0728-lge9pvy3.d/ com$ 2011-08-06 07:28:46,432-0500 DEBUG vdl:execute2 JOB_START jobid=cat-j2tn42ek tr=cat arguments=[data.txt] tmpdir=catsn-20110806-0728-\ tpo2b8za/jobs/j/cat-j2tn42ek host=localhost 2011-08-06 07:28:46,432-0500 DEBUG VDL2ExecutionContext org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert v\ alue to boolean: null org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null at org.globus.cog.karajan.util.TypeUtil.toBoolean(TypeUtil.java:131) at org.griphyn.vdl.karajan.lib.Mark.function(Mark.java:30) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null at org.globus.cog.karajan.util.TypeUtil.toBoolean(TypeUtil.java:127) ... 20 more ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Saturday, August 6, 2011 1:02:48 AM > Subject: Re: [Swift-devel] Coaster test failed at 86K of 100K jobs > Potential fix is in the 0.93 branch. > > I'm not entirely sure that this was the problem, but it's the only one > I > can see right now. > > The issue is as follows. There is a "special" implementation of a > CopyOnWriteArrayList in the util module. The standard java one does a > copy of the underlying array for EVERY operation that changes the > list. > This guarantees that ongoing iterations will not be messed up by > concurrent modifications to the list, but is very bad if you have many > operations that change the list. > > The version in util only does a copy if there is an ongoing iteration > on > a particular underlying array. If no concurrent changes and iterations > occur, this works at the speed of a normal synchronized list. If > concurrent changes and iterations occur, there is a copy penalty for > each iteration (but only once for each iteration). This requires the > user code to notify the implementation when an iteration is done > (release). > > The problem was with the way that the lock was implemented. It would > be > increased for every iteration, set to 0 for each mutation operation > and > decreased if > 0 for a release. That was broken, the following could > have occurred: > > iteration1start - lock = 1, with array1 > add - lock > 0, copy to array2, lock = 0 > iteration2start - lock = 1, with array2 > iteration1end - lock = 0 > add - lock == 0, add to array2 -> ConcurrentModificationException on > iteration2. > > Though I don't see how the usage stats got to iterate twice at the > same > time through stuff. > > Mihael > > > On Fri, 2011-08-05 at 22:02 -0700, Mihael Hategan wrote: > > Amazing how that bug in what would otherwise be a relatively simple > > class (CopyOnWriteArrayList) has managed to survive so long. > > Concurrency > > ain't easy! > > > > I'll have a fix committed after I do a bit of testing. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sat Aug 6 13:34:45 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 6 Aug 2011 13:34:45 -0500 (CDT) Subject: [Swift-devel] 100K job script hangs at 30K jobs Message-ID: <1296444624.194996.1312655685765.JavaMail.root@zimbra.anl.gov> Mihael, A later catsn test, started this morning, hung at 30K or 100K catsn jobs. Swift was still printing progress but not progressing beyond: Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting site:1014 Submitted:10 Finished successfully:30329 I had stopped it earlier in the morning, then resumed it to get a jstack. Logs and stack traces of both the swift and coaster service JVMs are in: /home/wilde/swiftgrid/test.swift-workers/logs.07 - Mike From hategan at mcs.anl.gov Sat Aug 6 21:29:48 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 06 Aug 2011 19:29:48 -0700 Subject: [Swift-devel] 100K job script hangs at 30K jobs In-Reply-To: <1296444624.194996.1312655685765.JavaMail.root@zimbra.anl.gov> References: <1296444624.194996.1312655685765.JavaMail.root@zimbra.anl.gov> Message-ID: <1312684188.29942.2.camel@blabla> So this problem was the problem of dying workers combined with the system not noticing it and so zombie jobs would slowly fill the throttle (which was set to 10 in this case). I backported the dead worker detection code from trunk. Combined with retries, this should take care of the problem, but it may be worth looking into why the workers were dying. On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote: > Mihael, > > A later catsn test, started this morning, hung at 30K or 100K catsn jobs. > > Swift was still printing progress but not progressing beyond: > > Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting site:1014 Submitted:10 Finished successfully:30329 > > I had stopped it earlier in the morning, then resumed it to get a jstack. > > Logs and stack traces of both the swift and coaster service JVMs are in: > /home/wilde/swiftgrid/test.swift-workers/logs.07 > > - Mike From davidk at ci.uchicago.edu Sun Aug 7 00:54:59 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Sun, 7 Aug 2011 00:54:59 -0500 (CDT) Subject: [Swift-devel] Can't build latest 0.93 Message-ID: <1601513232.57066.1312696499687.JavaMail.root@zimbra-mb2.anl.gov> I am getting this error while running ant dist on 0.93: package.list: [echo] [provider-coaster]: PACKAGE LIST [java] Missing package: backport-util-concurrent.jar BUILD FAILED /home/david/cog.093/modules/swift/build.xml:73: The following error occurred while executing this line: /home/david/cog.093/mbuild.xml:445: The following error occurred while executing this line: /home/david/cog.093/mbuild.xml:79: The following error occurred while executing this line: /home/david/cog.093/mbuild.xml:52: The following error occurred while executing this line: /home/david/cog.093/modules/swift/dependencies.xml:13: The following error occurred while executing this line: /home/david/cog.093/mbuild.xml:163: The following error occurred while executing this line: /home/david/cog.093/mbuild.xml:168: The following error occurred while executing this line: /home/david/cog.093/modules/provider-coaster/build.xml:60: The following error occurred while executing this line: /home/david/cog.093/modules/provider-coaster/build.xml:168: Java returned: 1 From hategan at mcs.anl.gov Sun Aug 7 01:57:40 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 06 Aug 2011 23:57:40 -0700 Subject: [Swift-devel] Can't build latest 0.93 In-Reply-To: <1601513232.57066.1312696499687.JavaMail.root@zimbra-mb2.anl.gov> References: <1601513232.57066.1312696499687.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1312700260.12834.0.camel@blabla> Try now (cog r3215) On Sun, 2011-08-07 at 00:54 -0500, David Kelly wrote: > I am getting this error while running ant dist on 0.93: > > package.list: > [echo] [provider-coaster]: PACKAGE LIST > [java] Missing package: backport-util-concurrent.jar > > BUILD FAILED > /home/david/cog.093/modules/swift/build.xml:73: The following error occurred while executing this line: > /home/david/cog.093/mbuild.xml:445: The following error occurred while executing this line: > /home/david/cog.093/mbuild.xml:79: The following error occurred while executing this line: > /home/david/cog.093/mbuild.xml:52: The following error occurred while executing this line: > /home/david/cog.093/modules/swift/dependencies.xml:13: The following error occurred while executing this line: > /home/david/cog.093/mbuild.xml:163: The following error occurred while executing this line: > /home/david/cog.093/mbuild.xml:168: The following error occurred while executing this line: > /home/david/cog.093/modules/provider-coaster/build.xml:60: The following error occurred while executing this line: > /home/david/cog.093/modules/provider-coaster/build.xml:168: Java returned: 1 > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Aug 7 22:59:27 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 7 Aug 2011 22:59:27 -0500 (CDT) Subject: [Swift-devel] [Bug 359] Add ability to set ENV vars, maxwalltime, and RAM requirements on app invocation In-Reply-To: <20110808034743.1D613563AA@wind.mcs.anl.gov> Message-ID: <410964285.195935.1312775967623.JavaMail.root@zimbra.anl.gov> I'd rather see features that were slotted for a release get moved to the next release if they dont fit, rather than put back into a "floating" state. This feature was slotted for 0.93 before 0.93 was sealed, so it should move to 0.94 for consideration. - Mike ----- Original Message ----- > From: bugzilla-daemon at mcs.anl.gov > To: wilde at mcs.anl.gov > Sent: Sunday, August 7, 2011 10:47:43 PM > Subject: [Bug 359] Add ability to set ENV vars, maxwalltime, and RAM requirements on app invocation > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359 > > > Mihael Hategan changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |hategan at mcs.anl.gov > Target Milestone|v0.93 |UNDEFINED > > > > > --- Comment #2 from Mihael Hategan 2011-08-07 > 22:47:42 --- > No new features in 0.93 at this point. Removing milestone. > > -- > Configure bugmail: > https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You reported the bug. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Aug 7 23:05:43 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 07 Aug 2011 21:05:43 -0700 Subject: [Swift-devel] [Bug 359] Add ability to set ENV vars, maxwalltime, and RAM requirements on app invocation In-Reply-To: <410964285.195935.1312775967623.JavaMail.root@zimbra.anl.gov> References: <410964285.195935.1312775967623.JavaMail.root@zimbra.anl.gov> Message-ID: <1312776343.7851.2.camel@blabla> My instinct was not to assign things that weren't debated on the mailing list to any particular release. But please re-target as needed. Mihael On Sun, 2011-08-07 at 22:59 -0500, Michael Wilde wrote: > I'd rather see features that were slotted for a release get moved to the next release if they dont fit, rather than put back into a "floating" state. > > This feature was slotted for 0.93 before 0.93 was sealed, so it should move to 0.94 for consideration. > > - Mike > > > ----- Original Message ----- > > From: bugzilla-daemon at mcs.anl.gov > > To: wilde at mcs.anl.gov > > Sent: Sunday, August 7, 2011 10:47:43 PM > > Subject: [Bug 359] Add ability to set ENV vars, maxwalltime, and RAM requirements on app invocation > > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359 > > > > > > Mihael Hategan changed: > > > > What |Removed |Added > > ---------------------------------------------------------------------------- > > CC| |hategan at mcs.anl.gov > > Target Milestone|v0.93 |UNDEFINED > > > > > > > > > > --- Comment #2 from Mihael Hategan 2011-08-07 > > 22:47:42 --- > > No new features in 0.93 at this point. Removing milestone. > > > > -- > > Configure bugmail: > > https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email > > ------- You are receiving this mail because: ------- > > You reported the bug. > From wilde at mcs.anl.gov Mon Aug 8 16:29:02 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 8 Aug 2011 16:29:02 -0500 (CDT) Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1312747967.22082.1.camel@blabla> Message-ID: <388085554.202204.1312838942710.JavaMail.root@zimbra.anl.gov> Mihael, I ran one test to 100K jobs - ran fine. Second test failed after ~15K jobs with the following error: catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or directory) (partial traceback below). Is this related to your change on handling of the status file? I was seeing the same error on sporadic, shorter tests last night but did not yet have a chance to investigate. The full log for this error is catsn-20110808-1558-6tm450a1.log in /home/wilde/swiftgrid/test.swift-workers/logs.10 - Mike 2011-08-08 16:01:27,952-0500 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-14624-1-1-1312837151244) is \ /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.14625.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.14625.out -k\ -cdmfile -status provider -a data.txt 2011-08-08 16:01:27,960-0500 INFO ExecutionContext Detailed exception: Exception in cat: Arguments: [data.txt] Host: localhost Directory: catsn-20110808-1558-6tm450a1/jobs/z/cat-ze1806ek - - - Caused by: /autonfs/home/wilde/swiftgrid/test.swift-workers/./catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or dir\ ectory) at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Sent: Sunday, August 7, 2011 3:12:47 PM > Subject: Re: 100K job script hangs at 30K jobs > Ok. I ran 65k jobs with a script that randomly killed and added > workers. > It finished fine, but it needs testing on more environments. > > On Sun, 2011-08-07 at 09:39 -0500, Michael Wilde wrote: > > I'll try to trap that next chance I get, and try to ship back worker > > logs. > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Cc: "Swift Devel" > > > Sent: Saturday, August 6, 2011 9:29:48 PM > > > Subject: Re: 100K job script hangs at 30K jobs > > > So this problem was the problem of dying workers combined with the > > > system not noticing it and so zombie jobs would slowly fill the > > > throttle > > > (which was set to 10 in this case). I backported the dead worker > > > detection code from trunk. Combined with retries, this should take > > > care > > > of the problem, but it may be worth looking into why the workers > > > were > > > dying. > > > > > > On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote: > > > > Mihael, > > > > > > > > A later catsn test, started this morning, hung at 30K or 100K > > > > catsn > > > > jobs. > > > > > > > > Swift was still printing progress but not progressing beyond: > > > > > > > > Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting > > > > site:1014 > > > > Submitted:10 Finished successfully:30329 > > > > > > > > I had stopped it earlier in the morning, then resumed it to get > > > > a > > > > jstack. > > > > > > > > Logs and stack traces of both the swift and coaster service JVMs > > > > are > > > > in: > > > > /home/wilde/swiftgrid/test.swift-workers/logs.07 > > > > > > > > - Mike > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From alberto_chavez at live.com Mon Aug 8 17:14:16 2011 From: alberto_chavez at live.com (Alberto Chavez) Date: Mon, 8 Aug 2011 17:14:16 -0500 Subject: [Swift-devel] ssh-pbs-coasters test case on PADS. Message-ID: Hello, I am going through the test cases for different providers in the test suite directory,I am manually running ssh-pbs-coasters test case with the following command: swift 001-catsn-ssh-pbs-coasters.swift -tc.file tc.template.data -sites.file sites.template.xml I am getting the following output: Swift svn swift-r4861 (swift modified locally) cog-r3183 RunID: 20110808-1703-mz2tcfhaProgress: time: Mon, 08 Aug 2011 17:03:49 -0500Progress: time: Mon, 08 Aug 2011 17:03:55 -0500 Selecting site:8 Initializing site shared directory:1 Stage in:1Progress: time: Mon, 08 Aug 2011 17:03:59 -0500 Submitted:1 Failed but can retry:9Failed to transfer wrapper log for job cat-2jvs26ekProgress: time: Mon, 08 Aug 2011 17:04:02 -0500 Stage in:1 Failed but can retry:9Failed to transfer wrapper log for job cat-uivs26ekFailed to transfer wrapper log for job cat-xivs26ekFailed to transfer wrapper log for job cat-zivs26ekFailed to transfer wrapper log for job cat-4jvs26ekFailed to transfer wrapper log for job cat-0jvs26ekFailed to transfer wrapper log for job cat-yivs26ekProgress: time: Mon, 08 Aug 2011 17:04:03 -0500 Stage in:1 Submitting:1 Failed but can retry:8Failed to transfer wrapper log for job cat-vivs26ekFailed to transfer wrapper log for job cat-1jvs26ekFailed to transfer wrapper log for job cat-3jvs26ekProgress: time: Mon, 08 Aug 2011 17:04:04 -0500 Stage in:1 Submitting:1 Failed but can retry:8Progress: time: Mon, 08 Aug 2011 17:04:07 -0500 Submitting:1 Submitted:1 Failed but can retry:8Failed to transfer wrapper log for job cat-6jvs26ekProgress: time: Mon, 08 Aug 2011 17:04:09 -0500 Stage in:1 Failed but can retry:9Failed to transfer wrapper log for job cat-8jvs26ekFailed to transfer wrapper log for job cat-ajvs26ekFailed to transfer wrapper log for job cat-cjvs26ekFailed to transfer wrapper log for job cat-ejvs26ekFailed to transfer wrapper log for job cat-gjvs26ekFailed to transfer wrapper log for job cat-ijvs26ekProgress: time: Mon, 08 Aug 2011 17:04:10 -0500 Stage in:1 Submitting:1 Failed but can retry:8Failed to transfer wrapper log for job cat-kjvs26ekFailed to transfer wrapper log for job cat-mjvs26ekProgress: time: Mon, 08 Aug 2011 17:04:11 -0500 Stage in:1 Failed but can retry:9Failed to transfer wrapper log for job cat-ojvs26ekProgress: time: Mon, 08 Aug 2011 17:04:16 -0500 Submitting:1 Submitted:1 Failed but can retry:8Failed to transfer wrapper log for job cat-qjvs26ekProgress: time: Mon, 08 Aug 2011 17:04:17 -0500 Failed:1 Failed but can retry:9 these are the contents of sites.template.xml file: 3000 8 1 1 10 short 0.5 10000 /home/achavez/swiftwork and this is the swiftscript that I am trying to run: type file; app (file o) cat (file i){ cat @i stdout=@o;} string t = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";string char[] = @strsplit(t, ""); file out[];foreach j in [1:@toint(@arg("n","10"))] { file data<"data.txt">; out[j] = cat(data);} I am pretty sure the test is failing, and I guess that it's something wrong on my side, I just don't know what that is, so any help figuring out what I'm doing wrong will be strongly appreciated. Everytime I run the test, a dialog box pops up and asks me for my username to login on pads, and then it asks for my password,then it shows the messages:Failed to transfer wrapper log for job XXXXXand then asks three more times for my username and password. Thank you, Alberto. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 8 17:23:15 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 08 Aug 2011 15:23:15 -0700 Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <388085554.202204.1312838942710.JavaMail.root@zimbra.anl.gov> References: <388085554.202204.1312838942710.JavaMail.root@zimbra.anl.gov> Message-ID: <1312842195.13185.1.camel@blabla> On Mon, 2011-08-08 at 16:29 -0500, Michael Wilde wrote: > catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or directory) > (partial traceback below). > > Is this related to your change on handling of the status file? Yes, but I thought I fixed it. Make sure you have at least swift r4963. > > I was seeing the same error on sporadic, shorter tests last night but did not yet have a chance to investigate. > > The full log for this error is catsn-20110808-1558-6tm450a1.log in > /home/wilde/swiftgrid/test.swift-workers/logs.10 > > - Mike > > > 2011-08-08 16:01:27,952-0500 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-14624-1-1-1312837151244) is \ > /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.14625.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.14625.out -k\ > -cdmfile -status provider -a data.txt > 2011-08-08 16:01:27,960-0500 INFO ExecutionContext Detailed exception: > Exception in cat: > Arguments: [data.txt] > Host: localhost > Directory: catsn-20110808-1558-6tm450a1/jobs/z/cat-ze1806ek > - - - > > Caused by: /autonfs/home/wilde/swiftgrid/test.swift-workers/./catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or dir\ > ectory) > > at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Sent: Sunday, August 7, 2011 3:12:47 PM > > Subject: Re: 100K job script hangs at 30K jobs > > Ok. I ran 65k jobs with a script that randomly killed and added > > workers. > > It finished fine, but it needs testing on more environments. > > > > On Sun, 2011-08-07 at 09:39 -0500, Michael Wilde wrote: > > > I'll try to trap that next chance I get, and try to ship back worker > > > logs. > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "Michael Wilde" > > > > Cc: "Swift Devel" > > > > Sent: Saturday, August 6, 2011 9:29:48 PM > > > > Subject: Re: 100K job script hangs at 30K jobs > > > > So this problem was the problem of dying workers combined with the > > > > system not noticing it and so zombie jobs would slowly fill the > > > > throttle > > > > (which was set to 10 in this case). I backported the dead worker > > > > detection code from trunk. Combined with retries, this should take > > > > care > > > > of the problem, but it may be worth looking into why the workers > > > > were > > > > dying. > > > > > > > > On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote: > > > > > Mihael, > > > > > > > > > > A later catsn test, started this morning, hung at 30K or 100K > > > > > catsn > > > > > jobs. > > > > > > > > > > Swift was still printing progress but not progressing beyond: > > > > > > > > > > Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting > > > > > site:1014 > > > > > Submitted:10 Finished successfully:30329 > > > > > > > > > > I had stopped it earlier in the morning, then resumed it to get > > > > > a > > > > > jstack. > > > > > > > > > > Logs and stack traces of both the swift and coaster service JVMs > > > > > are > > > > > in: > > > > > /home/wilde/swiftgrid/test.swift-workers/logs.07 > > > > > > > > > > - Mike > > > > From ketancmaheshwari at gmail.com Mon Aug 8 19:08:53 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 8 Aug 2011 19:08:53 -0500 Subject: [Swift-devel] ssh-pbs-coasters test case on PADS. In-Reply-To: References: Message-ID: Alberto, Create an auth.defaults file in your ~/.ssh directory. Add contents of the following form: bridled.ci.uchicago.edu.type=key bridled.ci.uchicago.edu.username=urusername bridled.ci.uchicago.edu.key=/path/to/your/id_rsa bridled.ci.uchicago.edu.passphrase=yourpassphrase The perms on this file should be: 600 Above example is for bridled. you will need to add the machine names you are connecting from and to each. About the wrapperlog transfer issue: do you have provider staging on in your config? -- Ketan On Mon, Aug 8, 2011 at 5:14 PM, Alberto Chavez wrote: > Hello, > > I am going through the test cases for different providers in the test suite > directory, > I am manually running ssh-pbs-coasters test case with the following > command: > > swift 001-catsn-ssh-pbs-coasters.swift -tc.file tc.template.data > -sites.file sites.template.xml > > I am getting the following output: > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > RunID: 20110808-1703-mz2tcfha > Progress: time: Mon, 08 Aug 2011 17:03:49 -0500 > Progress: time: Mon, 08 Aug 2011 17:03:55 -0500 Selecting site:8 > Initializing site shared directory:1 Stage in:1 > Progress: time: Mon, 08 Aug 2011 17:03:59 -0500 Submitted:1 Failed but > can retry:9 > Failed to transfer wrapper log for job cat-2jvs26ek > Progress: time: Mon, 08 Aug 2011 17:04:02 -0500 Stage in:1 Failed but can > retry:9 > Failed to transfer wrapper log for job cat-uivs26ek > Failed to transfer wrapper log for job cat-xivs26ek > Failed to transfer wrapper log for job cat-zivs26ek > Failed to transfer wrapper log for job cat-4jvs26ek > Failed to transfer wrapper log for job cat-0jvs26ek > Failed to transfer wrapper log for job cat-yivs26ek > Progress: time: Mon, 08 Aug 2011 17:04:03 -0500 Stage in:1 Submitting:1 > Failed but can retry:8 > Failed to transfer wrapper log for job cat-vivs26ek > Failed to transfer wrapper log for job cat-1jvs26ek > Failed to transfer wrapper log for job cat-3jvs26ek > Progress: time: Mon, 08 Aug 2011 17:04:04 -0500 Stage in:1 Submitting:1 > Failed but can retry:8 > Progress: time: Mon, 08 Aug 2011 17:04:07 -0500 Submitting:1 Submitted:1 > Failed but can retry:8 > Failed to transfer wrapper log for job cat-6jvs26ek > Progress: time: Mon, 08 Aug 2011 17:04:09 -0500 Stage in:1 Failed but can > retry:9 > Failed to transfer wrapper log for job cat-8jvs26ek > Failed to transfer wrapper log for job cat-ajvs26ek > Failed to transfer wrapper log for job cat-cjvs26ek > Failed to transfer wrapper log for job cat-ejvs26ek > Failed to transfer wrapper log for job cat-gjvs26ek > Failed to transfer wrapper log for job cat-ijvs26ek > Progress: time: Mon, 08 Aug 2011 17:04:10 -0500 Stage in:1 Submitting:1 > Failed but can retry:8 > Failed to transfer wrapper log for job cat-kjvs26ek > Failed to transfer wrapper log for job cat-mjvs26ek > Progress: time: Mon, 08 Aug 2011 17:04:11 -0500 Stage in:1 Failed but can > retry:9 > Failed to transfer wrapper log for job cat-ojvs26ek > Progress: time: Mon, 08 Aug 2011 17:04:16 -0500 Submitting:1 Submitted:1 > Failed but can retry:8 > Failed to transfer wrapper log for job cat-qjvs26ek > Progress: time: Mon, 08 Aug 2011 17:04:17 -0500 Failed:1 Failed but can > retry:9 > > these are the contents of sites.template.xml file: > > > > > > 3000 > 8 > 1 > 1 > 10 > short > 0.5 > 10000 > /home/achavez/swiftwork > > > > and this is the swiftscript that I am trying to run: > > type file; > > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > string t = > "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; > string char[] = @strsplit(t, ""); > > file out[]; > foreach j in [1:@toint(@arg("n","10"))] { > file data<"data.txt">; > out[j] = cat(data); > } > > > I am pretty sure the test is failing, and I guess that it's something wrong > on my side, I just don't know what that is, so any help figuring out what > I'm doing wrong will be strongly appreciated. > > Everytime I run the test, a dialog box pops up and asks me for my username > to login on pads, and then it asks for my password, > then it shows the messages: > Failed to transfer wrapper log for job XXXXX > and then asks three more times for my username and password. > > Thank you, > > Alberto. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 8 19:31:52 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 08 Aug 2011 17:31:52 -0700 Subject: [Swift-devel] ssh-pbs-coasters test case on PADS. In-Reply-To: References: Message-ID: <1312849912.14688.1.camel@blabla> On Mon, 2011-08-08 at 19:08 -0500, Ketan Maheshwari wrote: > Alberto, > > > Create an auth.defaults file in your ~/.ssh directory. > > > Add contents of the following form: > > > bridled.ci.uchicago.edu.type=key > bridled.ci.uchicago.edu.username=urusername > bridled.ci.uchicago.edu.key=/path/to/your/id_rsa > bridled.ci.uchicago.edu.passphrase=yourpassphrase RIght. If you feel that having your passphrase there is not ok, you can omit it, but you will get a prompt for it. I think you should only get the prompt once, but I've seen it pop twice in the same run, so I'm going to check that. From wilde at mcs.anl.gov Mon Aug 8 20:39:34 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 8 Aug 2011 20:39:34 -0500 (CDT) Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1312842195.13185.1.camel@blabla> Message-ID: <298081939.202648.1312853974463.JavaMail.root@zimbra.anl.gov> Im now running Swift svn swift-r4965 cog-r3225 A 100K-catsn script ran to completion. Then a 500K-catsn script terminated at ~ 15K jobs with the error below. Logs are in /home/wilde/swiftgrid/test.swift-workers Failing run was *pe.log - Mike 2011-08-08 18:37:59,452-0500 DEBUG vdl:execute2 THREAD_ASSOCIATION jobid=cat-1fkb66ek thread=0-3-29294-1-1 host=localhost replicati\ onGroup=8shb66ek 2011-08-08 18:37:59,452-0500 DEBUG vdl:execute2 THREAD_ASSOCIATION jobid=cat-2fkb66ek thread=0-3-29296-1-1 host=localhost replicati\ onGroup=9shb66ek 2011-08-08 18:37:59,452-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-eakb66ek - Application exception: Task failed: Conn\ ection to worker lost java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:124) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.j\ ava:305) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.ja\ va:251) 2011-08-08 18:37:59,452-0500 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-29290-1-1-1312846318323) is\ /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.29291.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.29291.out \ -k -cdmfile -status provider -a data.txt 2011-08-08 18:37:59,452-0500 INFO vdl:execute START thread=0-3-30899-1 tr=cat 2011-08-08 18:37:59,455-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-oakb66ek - Application exception: Task failed: Conn\ ection to worker lost java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:124) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.j\ ava:305) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.ja\ va:251) ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Monday, August 8, 2011 5:23:15 PM > Subject: Re: New 0.93 problem: .error No such file or directory > On Mon, 2011-08-08 at 16:29 -0500, Michael Wilde wrote: > > catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or > > directory) > > (partial traceback below). > > > > Is this related to your change on handling of the status file? > > Yes, but I thought I fixed it. Make sure you have at least swift > r4963. > > > > > I was seeing the same error on sporadic, shorter tests last night > > but did not yet have a chance to investigate. > > > > The full log for this error is catsn-20110808-1558-6tm450a1.log in > > /home/wilde/swiftgrid/test.swift-workers/logs.10 > > > > - Mike > > > > > > 2011-08-08 16:01:27,952-0500 INFO GridExec TASK_DEFINITION: > > Task(type=JOB_SUBMISSION, identity=urn:0-3-14624-1-1-1312837151244) > > is \ > > /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.14625.out > > -err stderr.txt -i -d outdir -if data.txt -of outdir/f.14625.out -k\ > > -cdmfile -status provider -a data.txt > > 2011-08-08 16:01:27,960-0500 INFO ExecutionContext Detailed > > exception: > > Exception in cat: > > Arguments: [data.txt] > > Host: localhost > > Directory: catsn-20110808-1558-6tm450a1/jobs/z/cat-ze1806ek > > - - - > > > > Caused by: > > /autonfs/home/wilde/swiftgrid/test.swift-workers/./catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error > > (No such file or dir\ > > ectory) > > > > at > > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Sent: Sunday, August 7, 2011 3:12:47 PM > > > Subject: Re: 100K job script hangs at 30K jobs > > > Ok. I ran 65k jobs with a script that randomly killed and added > > > workers. > > > It finished fine, but it needs testing on more environments. > > > > > > On Sun, 2011-08-07 at 09:39 -0500, Michael Wilde wrote: > > > > I'll try to trap that next chance I get, and try to ship back > > > > worker > > > > logs. > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" > > > > > To: "Michael Wilde" > > > > > Cc: "Swift Devel" > > > > > Sent: Saturday, August 6, 2011 9:29:48 PM > > > > > Subject: Re: 100K job script hangs at 30K jobs > > > > > So this problem was the problem of dying workers combined with > > > > > the > > > > > system not noticing it and so zombie jobs would slowly fill > > > > > the > > > > > throttle > > > > > (which was set to 10 in this case). I backported the dead > > > > > worker > > > > > detection code from trunk. Combined with retries, this should > > > > > take > > > > > care > > > > > of the problem, but it may be worth looking into why the > > > > > workers > > > > > were > > > > > dying. > > > > > > > > > > On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote: > > > > > > Mihael, > > > > > > > > > > > > A later catsn test, started this morning, hung at 30K or > > > > > > 100K > > > > > > catsn > > > > > > jobs. > > > > > > > > > > > > Swift was still printing progress but not progressing > > > > > > beyond: > > > > > > > > > > > > Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting > > > > > > site:1014 > > > > > > Submitted:10 Finished successfully:30329 > > > > > > > > > > > > I had stopped it earlier in the morning, then resumed it to > > > > > > get > > > > > > a > > > > > > jstack. > > > > > > > > > > > > Logs and stack traces of both the swift and coaster service > > > > > > JVMs > > > > > > are > > > > > > in: > > > > > > /home/wilde/swiftgrid/test.swift-workers/logs.07 > > > > > > > > > > > > - Mike > > > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Aug 8 20:58:24 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 08 Aug 2011 18:58:24 -0700 Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <298081939.202648.1312853974463.JavaMail.root@zimbra.anl.gov> References: <298081939.202648.1312853974463.JavaMail.root@zimbra.anl.gov> Message-ID: <1312855112.15215.0.camel@blabla> On Mon, 2011-08-08 at 20:39 -0500, Michael Wilde wrote: > Im now running Swift svn swift-r4965 cog-r3225 > > A 100K-catsn script ran to completion. > > Then a 500K-catsn script terminated at ~ 15K jobs with the error below. > > Logs are in /home/wilde/swiftgrid/test.swift-workers Judging from the error message, your workers are dying for unknown reasons. I see only two applications that failed (and they have distinct arguments), so I'm guessing you turned off retries. At 2/15K failure probability, if you set retries to at least 1, you would get a dramatic decrease in the odds that the failure will happen twice for the same app. Do you know where swork:14 and swork:29 ran? (it may be useful to name workers based on their site). Also, if you want to troubleshoot the workers, worker logging may help. Mihael From ketancmaheshwari at gmail.com Mon Aug 8 21:44:18 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 8 Aug 2011 21:44:18 -0500 Subject: [Swift-devel] int to string Message-ID: Hello, I was wondering if we can convert an int to string in Swift. I think @tostr method doesn't exist. Any clues? -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 8 21:52:54 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 08 Aug 2011 19:52:54 -0700 Subject: [Swift-devel] int to string In-Reply-To: References: Message-ID: <1312858374.15649.1.camel@blabla> On Mon, 2011-08-08 at 21:44 -0500, Ketan Maheshwari wrote: > Hello, > > > I was wondering if we can convert an int to string in Swift. I think > @tostr method doesn't exist. Strcat will do implicit conversion of its arguments to string. So @strcat(2) should work. Though we should have @tostr. From wilde at mcs.anl.gov Mon Aug 8 21:54:35 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 8 Aug 2011 21:54:35 -0500 (CDT) Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1312855112.15215.0.camel@blabla> Message-ID: <1602901548.202696.1312858475726.JavaMail.root@zimbra.anl.gov> > Judging from the error message, your workers are dying for unknown > reasons. I see only two applications that failed (and they have > distinct > arguments), so I'm guessing you turned off retries. At 2/15K failure > probability, if you set retries to at least 1, you would get a > dramatic > decrease in the odds that the failure will happen twice for the same > app. Good idea, will do. So I just realized whats happening here. Workers can fail (ie you tested killing them, you said) and Swift will keep running, *but* the apps that were running on failed workers receive failures and need to get retried through normal retry, as if the apps themselves had failed, correct? That just dawned on me. > Do you know where swork:14 and swork:29 ran? (it may be useful to name > workers based on their site). Good idea, will do. > Also, if you want to troubleshoot the workers, worker logging may > help. I have worker logging on; Im not sure why Im not (yet) getting the logs back. My Condor jobs are coded to transfer the worker log back after workers exit. I'll try to get these logs. I saw two apps fail because the site didnt set OSG_WM_TMP (where I place the logs). I thought that in those two cases the worker never started, but perhaps those two failures are related to these two app failures. More digging. - Mike > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Aug 8 21:58:59 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 08 Aug 2011 19:58:59 -0700 Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1602901548.202696.1312858475726.JavaMail.root@zimbra.anl.gov> References: <1602901548.202696.1312858475726.JavaMail.root@zimbra.anl.gov> Message-ID: <1312858739.15790.3.camel@blabla> On Mon, 2011-08-08 at 21:54 -0500, Michael Wilde wrote: > > Judging from the error message, your workers are dying for unknown > > reasons. I see only two applications that failed (and they have > > distinct > > arguments), so I'm guessing you turned off retries. At 2/15K failure > > probability, if you set retries to at least 1, you would get a > > dramatic > > decrease in the odds that the failure will happen twice for the same > > app. > > Good idea, will do. > > So I just realized whats happening here. Workers can fail (ie you > tested killing them, you said) and Swift will keep running, *but* the > apps that were running on failed workers receive failures and need to > get retried through normal retry, as if the apps themselves had > failed, correct? That just dawned on me. Yep. [...] > > I saw two apps fail because the site didnt set OSG_WM_TMP (where I > place the logs). I thought that in those two cases the worker never > started, but perhaps those two failures are related to these two app > failures. In your case there is an actual TCP connections, so the workers must have started. From ketancmaheshwari at gmail.com Mon Aug 8 22:04:25 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 8 Aug 2011 22:04:25 -0500 Subject: [Swift-devel] Exception whilst logging dataset Message-ID: Hi, Testing 0.93 on Beagle, I am seeing this exception for modftdock script: Exception whilst logging dataset content for ?:string = 100 - Closed java.lang.NullPointerException at org.griphyn.vdl.mapping.RootDataNode.getMapper(RootDataNode.java:213) at org.griphyn.vdl.mapping.AbstractDataNode.logContent(AbstractDataNode.java:460) at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:422) at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:361) at org.griphyn.vdl.mapping.RootDataNode.setValue(RootDataNode.java:221) at org.griphyn.vdl.mapping.RootDataNode.newNode(RootDataNode.java:27) at org.griphyn.vdl.karajan.lib.swiftscript.FnArg.function(FnArg.java:71) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:452) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:309) at java.util.concurrent.FutureTask.run(FutureTask.java:149) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:897) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919) at java.lang.Thread.run(Thread.java:736) For all the variables in the script. The script runs to completion however. ftdock.swift attached. I tried to comment out the trace and converting the int variable "mod_index" to string but the exception persisted. -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ftdock.swift Type: application/octet-stream Size: 1459 bytes Desc: not available URL: From wilde at mcs.anl.gov Mon Aug 8 23:14:22 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 8 Aug 2011 23:14:22 -0500 (CDT) Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1312858739.15790.3.camel@blabla> Message-ID: <1913671608.202766.1312863262707.JavaMail.root@zimbra.anl.gov> OK, with retry on, the same run has now passed 250K jobs, and retried 2 failures successfully. Its running at about 100 jobs/sec to about 38 workers over 22 sites. Once this tests out I'll increase the number of workers. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Monday, August 8, 2011 9:58:59 PM > Subject: Re: New 0.93 problem: .error No such file or directory > On Mon, 2011-08-08 at 21:54 -0500, Michael Wilde wrote: > > > Judging from the error message, your workers are dying for unknown > > > reasons. I see only two applications that failed (and they have > > > distinct > > > arguments), so I'm guessing you turned off retries. At 2/15K > > > failure > > > probability, if you set retries to at least 1, you would get a > > > dramatic > > > decrease in the odds that the failure will happen twice for the > > > same > > > app. > > > > Good idea, will do. > > > > So I just realized whats happening here. Workers can fail (ie you > > tested killing them, you said) and Swift will keep running, *but* > > the > > apps that were running on failed workers receive failures and need > > to > > get retried through normal retry, as if the apps themselves had > > failed, correct? That just dawned on me. > > Yep. > > [...] > > > > I saw two apps fail because the site didnt set OSG_WM_TMP (where I > > place the logs). I thought that in those two cases the worker never > > started, but perhaps those two failures are related to these two app > > failures. > > In your case there is an actual TCP connections, so the workers must > have started. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Mon Aug 8 23:18:41 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Mon, 08 Aug 2011 23:18:41 -0500 Subject: [Swift-devel] =?utf-8?q?int_to_string?= Message-ID: <20110809041823.9B0CF124A1@zimbra.anl.gov> I have been wanting this function in Swift. I had a need for it awhile back but came up with a workaround. I can't exactly remember what the need was for though. ----- Reply message ----- From: "Mihael Hategan" Date: Mon, Aug 8, 2011 9:52 pm Subject: [Swift-devel] int to string To: "Ketan Maheshwari" Cc: "Swift Devel" On Mon, 2011-08-08 at 21:44 -0500, Ketan Maheshwari wrote: > Hello, > > > I was wondering if we can convert an int to string in Swift. I think > @tostr method doesn't exist. Strcat will do implicit conversion of its arguments to string. So @strcat(2) should work. Though we should have @tostr. _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Aug 9 01:23:35 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 08 Aug 2011 23:23:35 -0700 Subject: [Swift-devel] Exception whilst logging dataset In-Reply-To: References: Message-ID: <1312871015.17110.0.camel@blabla> Yep. I can reproduce this. In the mean time, if you need to run stuff, disable provenance logging in swift.properties. On Mon, 2011-08-08 at 22:04 -0500, Ketan Maheshwari wrote: > Hi, > > > Testing 0.93 on Beagle, I am seeing this exception for modftdock > script: > > > Exception whilst logging dataset content for ?:string = 100 - Closed > java.lang.NullPointerException > at > org.griphyn.vdl.mapping.RootDataNode.getMapper(RootDataNode.java:213) > at > org.griphyn.vdl.mapping.AbstractDataNode.logContent(AbstractDataNode.java:460) > at > org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:422) > at > org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:361) > at > org.griphyn.vdl.mapping.RootDataNode.setValue(RootDataNode.java:221) > at org.griphyn.vdl.mapping.RootDataNode.newNode(RootDataNode.java:27) > at > org.griphyn.vdl.karajan.lib.swiftscript.FnArg.function(FnArg.java:71) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > at > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > at > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > at java.util.concurrent.Executors > $RunnableAdapter.call(Executors.java:452) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:309) > at java.util.concurrent.FutureTask.run(FutureTask.java:149) > at java.util.concurrent.ThreadPoolExecutor > $Worker.runTask(ThreadPoolExecutor.java:897) > at java.util.concurrent.ThreadPoolExecutor > $Worker.run(ThreadPoolExecutor.java:919) > at java.lang.Thread.run(Thread.java:736) > > > For all the variables in the script. > > > The script runs to completion however. > > > ftdock.swift attached. > > > I tried to comment out the trace and converting the int variable > "mod_index" to string but the exception persisted. > > -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Aug 9 04:34:13 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Aug 2011 02:34:13 -0700 Subject: [Swift-devel] Exception whilst logging dataset In-Reply-To: <1312871015.17110.0.camel@blabla> References: <1312871015.17110.0.camel@blabla> Message-ID: <1312882453.1194.0.camel@blabla> Fixed in swift r4966. On Mon, 2011-08-08 at 23:23 -0700, Mihael Hategan wrote: > Yep. I can reproduce this. > > In the mean time, if you need to run stuff, disable provenance logging > in swift.properties. > > On Mon, 2011-08-08 at 22:04 -0500, Ketan Maheshwari wrote: > > Hi, > > > > > > Testing 0.93 on Beagle, I am seeing this exception for modftdock > > script: > > > > > > Exception whilst logging dataset content for ?:string = 100 - Closed > > java.lang.NullPointerException > > at > > org.griphyn.vdl.mapping.RootDataNode.getMapper(RootDataNode.java:213) > > at > > org.griphyn.vdl.mapping.AbstractDataNode.logContent(AbstractDataNode.java:460) > > at > > org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:422) > > at > > org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:361) > > at > > org.griphyn.vdl.mapping.RootDataNode.setValue(RootDataNode.java:221) > > at org.griphyn.vdl.mapping.RootDataNode.newNode(RootDataNode.java:27) > > at > > org.griphyn.vdl.karajan.lib.swiftscript.FnArg.function(FnArg.java:71) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > at > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > at > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > at > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > at java.util.concurrent.Executors > > $RunnableAdapter.call(Executors.java:452) > > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:309) > > at java.util.concurrent.FutureTask.run(FutureTask.java:149) > > at java.util.concurrent.ThreadPoolExecutor > > $Worker.runTask(ThreadPoolExecutor.java:897) > > at java.util.concurrent.ThreadPoolExecutor > > $Worker.run(ThreadPoolExecutor.java:919) > > at java.lang.Thread.run(Thread.java:736) > > > > > > For all the variables in the script. > > > > > > The script runs to completion however. > > > > > > ftdock.swift attached. > > > > > > I tried to comment out the trace and converting the int variable > > "mod_index" to string but the exception persisted. > > > > -- > > Ketan > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Aug 9 07:16:39 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Aug 2011 07:16:39 -0500 (CDT) Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1913671608.202766.1312863262707.JavaMail.root@zimbra.anl.gov> Message-ID: <1979181272.203035.1312892199200.JavaMail.root@zimbra.anl.gov> I stopped this run and started a larger one: 5M catsn jobs to a pool of 300-400 workers (varies over time). It finished 2.2M and was still running, albeit slowly, when I ended it. The job rate ramped up quickly as the external QueueN script obtained workers. After about 15 mins had obtained 80 workers and seemed to be running at several hundred tasks per second. I had moved all the test clients, IO, and logging to local hard disk on communicado for speed. I set a retry count of 5, and turned on lazy failure mode. After about 6 hours, the test had passed 2.2M jobs and was still progressing, but seemed to have drastically slowed down from its earlier rate. Seemed to have dropped below a few jobs per second. Possibly it ate through its throttle due to failed/hung workers. The throttle was 300 jobs, and it seemed have about 400 running workers (the QueueN algorithm was grabbing more workers than the artificial "demand" I had set of 250). I then killed the run and captured all the logs, including jstacks and a trace of top output every minute. Mainly because I wanted to free up the workers and study the run before continuing. I see about 3 worker failure scenarios in the Condor logs: 1) _swiftwrap.staging: line 331: warning: here-document at line 303 delimited by end-of-file (wanted `$STDERR') 2) com$ cat 2.err Send failed: Transport endpoint is not connected at ./worker.pl line 384. com$ cat 2.out OSG_WN_TMP=/state/partition1/tmp === contact: http://communicado.ci.uchicago.edu:56323 === name: Firefly Running in dir /grid_home/engage/gram_scratch_7Xkg2fpMUc === cwd: /grid_home/engage/gram_scratch_7Xkg2fpMUc === logdir: /state/partition1/tmp/Firefly.workerdir.Q18464 =============================================== === exit: worker.pl exited with code=107 === worker log - last 1000 lines: ==> /state/partition1/tmp/Firefly.workerdir.Q18464/worker-Firefly.log <== 1312882398.535 INFO - Firefly Logging started: Tue Aug 9 04:33:18 2011 1312882398.535 INFO - Running on node c1511.local 1312882398.535 INFO - Connecting (0)... 1312882398.566 INFO - Connected 1312882398.604 INFO 000101 Registration successful. ID=000101 1312890065.197 WARN 000101 Send failed: Transport endpoint is not connected com$ 3) only occurred once or twice, and I need to hunt it down. ---- I see 1234 messages containing "worker lost", like: 2011-08-09 01:50:03,438-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-fld0l6ek - Application exception: Task failed: Conne ction to worker lost 1234 is >> the throttle of 300, so it seems to be running past that problem. I'll investigate more, but since its working so well I need to first get the application users going that are waiting on this. I wonder if these issues will show up more local stress testing on the MCS hosts, as Alberto and Ketan are working on. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Monday, August 8, 2011 11:14:22 PM > Subject: Re: New 0.93 problem: .error No such file or directory > OK, with retry on, the same run has now passed 250K jobs, and retried > 2 failures successfully. Its running at about 100 jobs/sec to about 38 > workers over 22 sites. > > Once this tests out I'll increase the number of workers. > > - Mike > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "Swift Devel" > > Sent: Monday, August 8, 2011 9:58:59 PM > > Subject: Re: New 0.93 problem: .error No such file or > > directory > > On Mon, 2011-08-08 at 21:54 -0500, Michael Wilde wrote: > > > > Judging from the error message, your workers are dying for > > > > unknown > > > > reasons. I see only two applications that failed (and they have > > > > distinct > > > > arguments), so I'm guessing you turned off retries. At 2/15K > > > > failure > > > > probability, if you set retries to at least 1, you would get a > > > > dramatic > > > > decrease in the odds that the failure will happen twice for the > > > > same > > > > app. > > > > > > Good idea, will do. > > > > > > So I just realized whats happening here. Workers can fail (ie you > > > tested killing them, you said) and Swift will keep running, *but* > > > the > > > apps that were running on failed workers receive failures and need > > > to > > > get retried through normal retry, as if the apps themselves had > > > failed, correct? That just dawned on me. > > > > Yep. > > > > [...] > > > > > > I saw two apps fail because the site didnt set OSG_WM_TMP (where I > > > place the logs). I thought that in those two cases the worker > > > never > > > started, but perhaps those two failures are related to these two > > > app > > > failures. > > > > In your case there is an actual TCP connections, so the workers must > > have started. > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Aug 9 07:25:41 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Aug 2011 07:25:41 -0500 (CDT) Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1979181272.203035.1312892199200.JavaMail.root@zimbra.anl.gov> Message-ID: <1774137108.203043.1312892741224.JavaMail.root@zimbra.anl.gov> Forgot to mention two things: - the logs are on communicado on local dir /scratch/local/wilde/swift/test.swift-workers/logs.14 - this is a really cool milestone: 2.2M jobs and counting from one swift script to OSG; at about 20 mins into the run it was pushing 138 jobs/sec in one arbitrary 10 min period that I looked at. Nice work, Mihael! - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Tuesday, August 9, 2011 7:16:39 AM > Subject: Re: [Swift-devel] New 0.93 problem: .error No such file or directory > I stopped this run and started a larger one: 5M catsn jobs to a pool > of 300-400 workers (varies over time). It finished 2.2M and was still > running, albeit slowly, when I ended it. > > The job rate ramped up quickly as the external QueueN script obtained > workers. After about 15 mins had obtained 80 workers and seemed to be > running at several hundred tasks per second. I had moved all the test > clients, IO, and logging to local hard disk on communicado for speed. > I set a retry count of 5, and turned on lazy failure mode. > > After about 6 hours, the test had passed 2.2M jobs and was still > progressing, but seemed to have drastically slowed down from its > earlier rate. Seemed to have dropped below a few jobs per second. > Possibly it ate through its throttle due to failed/hung workers. > > The throttle was 300 jobs, and it seemed have about 400 running > workers (the QueueN algorithm was grabbing more workers than the > artificial "demand" I had set of 250). > > I then killed the run and captured all the logs, including jstacks and > a trace of top output every minute. Mainly because I wanted to free up > the workers and study the run before continuing. > > I see about 3 worker failure scenarios in the Condor logs: > > 1) _swiftwrap.staging: line 331: warning: here-document at line 303 > delimited by end-of-file (wanted `$STDERR') > > 2) com$ cat 2.err > Send failed: Transport endpoint is not connected at ./worker.pl line > 384. > com$ cat 2.out > OSG_WN_TMP=/state/partition1/tmp > === contact: http://communicado.ci.uchicago.edu:56323 > === name: Firefly Running in dir > /grid_home/engage/gram_scratch_7Xkg2fpMUc > === cwd: /grid_home/engage/gram_scratch_7Xkg2fpMUc > === logdir: /state/partition1/tmp/Firefly.workerdir.Q18464 > =============================================== > === exit: worker.pl exited with code=107 > === worker log - last 1000 lines: > > ==> /state/partition1/tmp/Firefly.workerdir.Q18464/worker-Firefly.log > <== > 1312882398.535 INFO - Firefly Logging started: Tue Aug 9 04:33:18 2011 > 1312882398.535 INFO - Running on node c1511.local > 1312882398.535 INFO - Connecting (0)... > 1312882398.566 INFO - Connected > 1312882398.604 INFO 000101 Registration successful. ID=000101 > 1312890065.197 WARN 000101 Send failed: Transport endpoint is not > connected > com$ > > 3) only occurred once or twice, and I need to hunt it down. > > ---- > > I see 1234 messages containing "worker lost", like: > 2011-08-09 01:50:03,438-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=cat-fld0l6ek - Application exception: Task failed: Conne > ction to worker lost > > 1234 is >> the throttle of 300, so it seems to be running past that > problem. > > I'll investigate more, but since its working so well I need to first > get the application users going that are waiting on this. I wonder if > these issues will show up more local stress testing on the MCS hosts, > as Alberto and Ketan are working on. > > - Mike > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Mihael Hategan" > > Cc: "Swift Devel" > > Sent: Monday, August 8, 2011 11:14:22 PM > > Subject: Re: New 0.93 problem: .error No such file or > > directory > > OK, with retry on, the same run has now passed 250K jobs, and > > retried > > 2 failures successfully. Its running at about 100 jobs/sec to about > > 38 > > workers over 22 sites. > > > > Once this tests out I'll increase the number of workers. > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Cc: "Swift Devel" > > > Sent: Monday, August 8, 2011 9:58:59 PM > > > Subject: Re: New 0.93 problem: .error No such file or > > > directory > > > On Mon, 2011-08-08 at 21:54 -0500, Michael Wilde wrote: > > > > > Judging from the error message, your workers are dying for > > > > > unknown > > > > > reasons. I see only two applications that failed (and they > > > > > have > > > > > distinct > > > > > arguments), so I'm guessing you turned off retries. At 2/15K > > > > > failure > > > > > probability, if you set retries to at least 1, you would get a > > > > > dramatic > > > > > decrease in the odds that the failure will happen twice for > > > > > the > > > > > same > > > > > app. > > > > > > > > Good idea, will do. > > > > > > > > So I just realized whats happening here. Workers can fail (ie > > > > you > > > > tested killing them, you said) and Swift will keep running, > > > > *but* > > > > the > > > > apps that were running on failed workers receive failures and > > > > need > > > > to > > > > get retried through normal retry, as if the apps themselves had > > > > failed, correct? That just dawned on me. > > > > > > Yep. > > > > > > [...] > > > > > > > > I saw two apps fail because the site didnt set OSG_WM_TMP (where > > > > I > > > > place the logs). I thought that in those two cases the worker > > > > never > > > > started, but perhaps those two failures are related to these two > > > > app > > > > failures. > > > > > > In your case there is an actual TCP connections, so the workers > > > must > > > have started. > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From yadudoc1729 at gmail.com Tue Aug 9 08:51:00 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Tue, 9 Aug 2011 19:21:00 +0530 Subject: [Swift-devel] Overwriting procedures in swift. Message-ID: Hi, I've been working on getting an implementation for a feature which allows calling a function by the string-identifier. During the discussion with Mihael, we found that swift allows us to redefine a function with no complaints. We think this is a bug and a check should be put to prevent this. Inputs on this are welcome. Eg. (int o) f (int i){ o=i; } (int z) f (int x){ z= x*5; } trace ( f(8) ); Gives output as 40. while swift should throw an error instead. -- Thanks and Regards, Yadu Nand B From benc at hawaga.org.uk Tue Aug 9 09:21:15 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 9 Aug 2011 16:21:15 +0200 Subject: [Swift-devel] Overwriting procedures in swift. In-Reply-To: References: Message-ID: <93D83E81-B514-4B8A-86BF-F7E1C306184E@hawaga.org.uk> On Aug 9, 2011, at 3:51 PM, Yadu Nand wrote: > I've been working on getting an implementation for a feature > which allows calling a function by the string-identifier. During > the discussion with Mihael, we found that swift allows us to > redefine a function with no complaints. We think this is a bug > and a check should be put to prevent this. Inputs on this are > welcome. Yes, I think this is a bug and a check should be put in to prevent this. Ben From yadudoc1729 at gmail.com Tue Aug 9 09:30:18 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Tue, 9 Aug 2011 20:00:18 +0530 Subject: [Swift-devel] Overwriting procedures in swift. In-Reply-To: <93D83E81-B514-4B8A-86BF-F7E1C306184E@hawaga.org.uk> References: <93D83E81-B514-4B8A-86BF-F7E1C306184E@hawaga.org.uk> Message-ID: > Yes, I think this is a bug and a check should be put in to prevent this. Great :) Can someone review the patch attached, please? -- Thanks and Regards, Yadu Nand B -------------- next part -------------- A non-text attachment was scrubbed... Name: overwrite.patch Type: text/x-patch Size: 1181 bytes Desc: not available URL: From benc at hawaga.org.uk Tue Aug 9 09:33:06 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 9 Aug 2011 14:33:06 +0000 (GMT) Subject: [Swift-devel] Overwriting procedures in swift. In-Reply-To: References: <93D83E81-B514-4B8A-86BF-F7E1C306184E@hawaga.org.uk> Message-ID: what happens with case? (and, what *should* happen with case?) I think karajan identifiers are case insensitive (?) but this patch looks like it is case-sensitive. -- http://www.hawaga.org.uk/ben/ From yadudoc1729 at gmail.com Tue Aug 9 10:01:22 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Tue, 9 Aug 2011 20:31:22 +0530 Subject: [Swift-devel] Overwriting procedures in swift. In-Reply-To: References: <93D83E81-B514-4B8A-86BF-F7E1C306184E@hawaga.org.uk> Message-ID: > what happens with case? (and, what *should* happen with case?) (int o) f ( int i) { o = i; } (int z) F (int a){ z = a * 5 ; } trace ( f (5) , F(5) ); for the above snippet, trace returns 25, 25. So F is overwriting f anyway. I don't think this is right. > I think karajan identifiers are case insensitive (?) but this patch looks > like it is case-sensitive. Fixed it. Please check the new patch attached. -- Thanks and Regards, Yadu Nand B -------------- next part -------------- A non-text attachment was scrubbed... Name: overwrite_case_sensitive.patch Type: text/x-patch Size: 1209 bytes Desc: not available URL: From alberto_chavez at live.com Tue Aug 9 13:43:59 2011 From: alberto_chavez at live.com (Alberto Chavez) Date: Tue, 9 Aug 2011 13:43:59 -0500 Subject: [Swift-devel] ssh test case on pads/beagle Message-ID: Hello, I am trying to run a simpler case than ssh-pbs-coaster test case, and I'm still having the same error.Now I am running only ssh test case (/tests/providers/ssh/001-catsn-ssn.swift) The command line is:swift -config cf -tc.file tc.template.data -sites.file sites.template.xml 001-catsn-ssh.swift The output:Swift svn swift-r4861 (swift modified locally) cog-r3183 RunID: 20110809-1336-ohte788aProgress: time: Tue, 09 Aug 2011 13:36:42 -0500Exception in cat:Arguments: [data.txt]Host: sshDirectory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek- - - Caused by: nullCaused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphraseCaused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBCProgress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting site:8 Submitting:1 Failed:1Exception in cat:Arguments: [data.txt]Host: sshDirectory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek- - - Caused by: nullCaused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphraseCaused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBCProgress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting site:7 Submitting:1 Failed:2Exception in cat:Arguments: [data.txt]Host: sshDirectory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek- - - Caused by: nullCaused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphraseCaused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC"error_log.log" 105L, 5770C My auth.defaults reads: login1.beagle.ci.uchicago.edu.type=key login1.beagle.ci.uchicago.edu.username=achavez login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity login1.pads.ci.uchicago.edu.type=key login1.pads.ci.uchicago.edu.username=achavez login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity and it has been set to 600, I ommited the passphrase line, but it is there, and the passphrase is right because I just verified it in two ways: 1) by logging to pads and beagle without providing a password 2) "changed" the password. I the "new" password is the same as the "old" one. sites.templates.xml: 0 /home/achavez/swiftwork config file: wrapperlog.always.transfer=truesitedir.keep=trueexecution.retries=0lazy.errors=truestatus.mode=provideruse.provider.staging=trueprovider.staging.pin.swiftfiles=falseforeach.max.threads=10provenance.log=true I also tried a simpler SwiftScript: type filemsg; app (filemsg output) hello(string s){ echo s stdout=@filename(output);} filemsg myfile<"dogcatdinosaur.out">;myfile = hello("dog,cat,dinosaur"); and I get the following output: Swift svn swift-r4861 (swift modified locally) cog-r3183 RunID: 20110809-1343-2es2hel2Progress: time: Tue, 09 Aug 2011 13:43:25 -0500Exception in echo:Arguments: [dog,cat,dinosaur]Host: sshDirectory: hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek- - - Caused by: nullCaused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphraseCaused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBCFinal status: time: Tue, 09 Aug 2011 13:43:26 -0500 Failed:1The following errors have occurred:1. Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC any thoughts on this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Tue Aug 9 13:47:22 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 9 Aug 2011 13:47:22 -0500 Subject: [Swift-devel] Persistent coasters running one job per worker Message-ID: Mihael, I was discussing this with Justin and we thought you could help: I am observing that persistent coasters are running one job per worker as opposed to the number specified in jobspernode (I also tried nodegranularity) on sites.xml. Attaching the log, and the sites.xml for the run. Swift is 0.93 (Swift svn swift-r4968 cog-r3225). The script is Mike's catsnsleep that sleeps for 20s with n=10. -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sites.pecos.xml Type: text/xml Size: 605 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: catsnsleep-20110809-1324-ouf3x44c.log Type: application/octet-stream Size: 27481 bytes Desc: not available URL: From hategan at mcs.anl.gov Tue Aug 9 13:57:06 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Aug 2011 11:57:06 -0700 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: References: Message-ID: <1312916226.2671.2.camel@blabla> Hmm: Unsupported passphrase algorithm: AES-128-CBC I'll try to see how that can be fixed. In the mean time, can you generate a new key pair with 3DES encryption instead and use that? On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > Hello, > > > I am trying to run a simpler case than ssh-pbs-coaster test case, and > I'm still having the same error. > Now I am running only ssh test case > (/tests/providers/ssh/001-catsn-ssn.swift) > > > The command line is: > swift -config cf -tc.file tc.template.data -sites.file > sites.template.xml 001-catsn-ssh.swift > > > The output: > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > RunID: 20110809-1336-ohte788a > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > Exception in cat: > Arguments: [data.txt] > Host: ssh > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > - - - > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > Caused by: > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > read key due to cryptography problems: > java.security.NoSuchAlgorithmException: Unsupported passphrase > algorithm: AES-128-CBC > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting site:8 > Submitting:1 Failed:1 > Exception in cat: > Arguments: [data.txt] > Host: ssh > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > - - - > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > Caused by: > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > read key due to cryptography problems: > java.security.NoSuchAlgorithmException: Unsupported passphrase > algorithm: AES-128-CBC > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting site:7 > Submitting:1 Failed:2 > Exception in cat: > Arguments: [data.txt] > Host: ssh > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > - - - > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > Caused by: > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > read key due to cryptography problems: > java.security.NoSuchAlgorithmException: Unsupported passphrase > algorithm: AES-128-CBC > "error_log.log" 105L, 5770C > > > My auth.defaults reads: > > > login1.beagle.ci.uchicago.edu.type=key > login1.beagle.ci.uchicago.edu.username=achavez > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > login1.pads.ci.uchicago.edu.type=key > login1.pads.ci.uchicago.edu.username=achavez > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > and it has been set to 600, I ommited the passphrase line, but it is > there, and the passphrase is right because I just verified it in two > ways: > 1) by logging to pads and beagle without providing a password > 2) "changed" the password. I the "new" password is the same as the > "old" one. > > sites.templates.xml: > > > > jobmanager="ssh"/> > > 0 > /home/achavez/swiftwork > > > > > config file: > > wrapperlog.always.transfer=true > sitedir.keep=true > execution.retries=0 > lazy.errors=true > status.mode=provider > use.provider.staging=true > provider.staging.pin.swiftfiles=false > foreach.max.threads=10 > provenance.log=true > > > > > > I also tried a simpler SwiftScript: > > > type filemsg; > > > app (filemsg output) hello(string s) > { > echo s stdout=@filename(output); > } > > > filemsg myfile<"dogcatdinosaur.out">; > myfile = hello("dog,cat,dinosaur"); > > > and I get the following output: > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > RunID: 20110809-1343-2es2hel2 > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > Exception in echo: > Arguments: [dog,cat,dinosaur] > Host: ssh > Directory: hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > - - - > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > Caused by: > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > read key due to cryptography problems: > java.security.NoSuchAlgorithmException: Unsupported passphrase > algorithm: AES-128-CBC > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 Failed:1 > The following errors have occurred: > 1. Can't read key due to cryptography problems: > java.security.NoSuchAlgorithmException: Unsupported passphrase > algorithm: AES-128-CBC > > > > > any thoughts on this? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Aug 9 13:58:20 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Aug 2011 13:58:20 -0500 (CDT) Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: Message-ID: <1181722696.205031.1312916300484.JavaMail.root@zimbra.anl.gov> Alberto, I suspect that the problem is that your SSH key is of a form that's not compatible with the Java SSH library that Swift is using, based on this message: Caused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC Can you try again with a new key, generated on Linux, using say RSA encryption? Try using one of the recipes for generating ssh keys that are posted on the MCS or CI web sites. - Mike ----- Original Message ----- From: "Alberto Chavez" To: "Swift Devel" Sent: Tuesday, August 9, 2011 1:43:59 PM Subject: [Swift-devel] ssh test case on pads/beagle Hello, I am trying to run a simpler case than ssh-pbs-coaster test case, and I'm still having the same error. Now I am running only ssh test case (/tests/providers/ssh/001-catsn-ssn.swift) The command line is: swift -config cf -tc.file tc.template.data -sites.file sites.template.xml 001-catsn-ssh.swift The output: Swift svn swift-r4861 (swift modified locally) cog-r3183 RunID: 20110809-1336-ohte788a Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 Exception in cat: Arguments: [data.txt] Host: ssh Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek - - - Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase Caused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting site:8 Submitting:1 Failed:1 Exception in cat: Arguments: [data.txt] Host: ssh Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek - - - Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase Caused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting site:7 Submitting:1 Failed:2 Exception in cat: Arguments: [data.txt] Host: ssh Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek - - - Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase Caused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC "error_log.log" 105L, 5770C My auth.defaults reads: login1.beagle.ci.uchicago.edu.type=key login1.beagle.ci.uchicago.edu.username=achavez login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity login1.pads.ci.uchicago.edu.type=key login1.pads.ci.uchicago.edu.username=achavez login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity and it has been set to 600, I ommited the passphrase line, but it is there, and the passphrase is right because I just verified it in two ways: 1) by logging to pads and beagle without providing a password 2) "changed" the password. I the "new" password is the same as the "old" one. sites.templates.xml: 0 /home/achavez/swiftwork config file: wrapperlog.always.transfer=true sitedir.keep=true execution.retries=0 lazy.errors=true status.mode=provider use.provider.staging=true provider.staging.pin.swiftfiles=false foreach.max.threads=10 provenance.log=true I also tried a simpler SwiftScript: type filemsg; app (filemsg output) hello(string s) { echo s stdout=@filename(output); } filemsg myfile<"dogcatdinosaur.out">; myfile = hello("dog,cat,dinosaur"); and I get the following output: Swift svn swift-r4861 (swift modified locally) cog-r3183 RunID: 20110809-1343-2es2hel2 Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 Exception in echo: Arguments: [dog,cat,dinosaur] Host: ssh Directory: hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek - - - Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase Caused by: com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 Failed:1 The following errors have occurred: 1. Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC any thoughts on this? _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Aug 9 13:59:27 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Aug 2011 11:59:27 -0700 Subject: [Swift-devel] Persistent coasters running one job per worker In-Reply-To: References: Message-ID: <1312916367.2671.4.camel@blabla> but but but I checked this, and it worked fine... Can you also post the coasters log (on the machine the coaster service is on, in ~/.globus/coasters)? On Tue, 2011-08-09 at 13:47 -0500, Ketan Maheshwari wrote: > Mihael, > > > I was discussing this with Justin and we thought you could help: > > > I am observing that persistent coasters are running one job per worker > as opposed to the number specified in jobspernode (I also tried > nodegranularity) on sites.xml. > > > Attaching the log, and the sites.xml for the run. Swift is 0.93 (Swift > svn swift-r4968 cog-r3225). > > > The script is Mike's catsnsleep that sleeps for 20s with n=10. > > -- > Ketan > > > From hategan at mcs.anl.gov Tue Aug 9 14:05:36 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Aug 2011 12:05:36 -0700 Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1979181272.203035.1312892199200.JavaMail.root@zimbra.anl.gov> References: <1979181272.203035.1312892199200.JavaMail.root@zimbra.anl.gov> Message-ID: <1312916736.2671.7.camel@blabla> On Tue, 2011-08-09 at 07:16 -0500, Michael Wilde wrote: > I stopped this run and started a larger one: 5M catsn jobs to a pool > of 300-400 workers (varies over time). It finished 2.2M and was still > running, albeit slowly, when I ended it. > > The job rate ramped up quickly as the external QueueN script obtained > workers. After about 15 mins had obtained 80 workers and seemed to be > running at several hundred tasks per second. I had moved all the test > clients, IO, and logging to local hard disk on communicado for speed. > I set a retry count of 5, and turned on lazy failure mode. > > After about 6 hours, the test had passed 2.2M jobs and was still > progressing, but seemed to have drastically slowed down from its > earlier rate. Seemed to have dropped below a few jobs per second. > Possibly it ate through its throttle due to failed/hung workers. Shouldn't be the case any more. My first suspicion would be that swift is running out of memory. But then it could also be some leak in the coaster staging buffers. I'll look at the logs later today. From wilde at mcs.anl.gov Tue Aug 9 14:08:01 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Aug 2011 14:08:01 -0500 (CDT) Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1312916736.2671.7.camel@blabla> Message-ID: <1578546213.205094.1312916881011.JavaMail.root@zimbra.anl.gov> > Shouldn't be the case any more. My first suspicion would be that swift > is running out of memory. But then it could also be some leak in the > coaster staging buffers. I'll look at the logs later today. Cool, thanks. It would be great if the latest log plotting tools could run on this log to plot the activity rate over the test period. From ketancmaheshwari at gmail.com Tue Aug 9 14:09:07 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 9 Aug 2011 14:09:07 -0500 Subject: [Swift-devel] Persistent coasters running one job per worker In-Reply-To: <1312916367.2671.4.camel@blabla> References: <1312916367.2671.4.camel@blabla> Message-ID: I do not see any recent log in ~/.globus/coasters. The stdout/err of the coaster service run is in the attached service.log and the coaster.log is in the attached swift.log. On Tue, Aug 9, 2011 at 1:59 PM, Mihael Hategan wrote: > but but but I checked this, and it worked fine... > > Can you also post the coasters log (on the machine the coaster service > is on, in ~/.globus/coasters)? > > On Tue, 2011-08-09 at 13:47 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > I was discussing this with Justin and we thought you could help: > > > > > > I am observing that persistent coasters are running one job per worker > > as opposed to the number specified in jobspernode (I also tried > > nodegranularity) on sites.xml. > > > > > > Attaching the log, and the sites.xml for the run. Swift is 0.93 (Swift > > svn swift-r4968 cog-r3225). > > > > > > The script is Mike's catsnsleep that sleeps for 20s with n=10. > > > > -- > > Ketan > > > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: service.log Type: application/octet-stream Size: 24692 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swift.log Type: application/octet-stream Size: 74296 bytes Desc: not available URL: From ketancmaheshwari at gmail.com Tue Aug 9 14:10:34 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 9 Aug 2011 14:10:34 -0500 Subject: [Swift-devel] New 0.93 problem: .error No such file or directory In-Reply-To: <1578546213.205094.1312916881011.JavaMail.root@zimbra.anl.gov> References: <1312916736.2671.7.camel@blabla> <1578546213.205094.1312916881011.JavaMail.root@zimbra.anl.gov> Message-ID: On Tue, Aug 9, 2011 at 2:08 PM, Michael Wilde wrote: > > Shouldn't be the case any more. My first suspicion would be that swift > > is running out of memory. But then it could also be some leak in the > > coaster staging buffers. I'll look at the logs later today. > > Cool, thanks. It would be great if the latest log plotting tools could run > on this log to plot the activity rate over the test period. > I will try this. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Aug 9 14:16:50 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Aug 2011 12:16:50 -0700 Subject: [Swift-devel] Persistent coasters running one job per worker In-Reply-To: References: <1312916367.2671.4.camel@blabla> Message-ID: <1312917410.3416.2.camel@blabla> Ah! If the workers connect before the client does, then jobsPerNode does not make it to the coaster service. I'll think about this. In the mean time, you could have the workers started after the client sends its first job to the service. I'm thinking that maybe jobsPerNode should be a setting that the workers themselves could be started with. On Tue, 2011-08-09 at 14:09 -0500, Ketan Maheshwari wrote: > I do not see any recent log in ~/.globus/coasters. The stdout/err of > the coaster service run is in the attached service.log and the > coaster.log is in the attached swift.log. > > > > > On Tue, Aug 9, 2011 at 1:59 PM, Mihael Hategan > wrote: > but but but I checked this, and it worked fine... > > Can you also post the coasters log (on the machine the coaster > service > is on, in ~/.globus/coasters)? > > > On Tue, 2011-08-09 at 13:47 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > I was discussing this with Justin and we thought you could > help: > > > > > > I am observing that persistent coasters are running one job > per worker > > as opposed to the number specified in jobspernode (I also > tried > > nodegranularity) on sites.xml. > > > > > > Attaching the log, and the sites.xml for the run. Swift is > 0.93 (Swift > > svn swift-r4968 cog-r3225). > > > > > > The script is Mike's catsnsleep that sleeps for 20s with > n=10. > > > > -- > > Ketan > > > > > > > > > > > > > -- > Ketan > > > From wilde at mcs.anl.gov Tue Aug 9 14:28:39 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Aug 2011 14:28:39 -0500 (CDT) Subject: [Swift-devel] Persistent coasters running one job per worker In-Reply-To: <1312917410.3416.2.camel@blabla> Message-ID: <1875366538.205206.1312918119390.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > From: "Mihael Hategan" > To: "Ketan Maheshwari" > Cc: "Swift Devel" > Sent: Tuesday, August 9, 2011 2:16:50 PM > Subject: Re: [Swift-devel] Persistent coasters running one job per worker > Ah! > > If the workers connect before the client does, then jobsPerNode does > not > make it to the coaster service. > > I'll think about this. In the mean time, you could have the workers > started after the client sends its first job to the service. > > I'm thinking that maybe jobsPerNode should be a setting that the > workers > themselves could be started with. That sounds OK for now, for persistent coasters I assume you mean. - Mike > > On Tue, 2011-08-09 at 14:09 -0500, Ketan Maheshwari wrote: > > I do not see any recent log in ~/.globus/coasters. The stdout/err of > > the coaster service run is in the attached service.log and the > > coaster.log is in the attached swift.log. > > > > > > > > > > On Tue, Aug 9, 2011 at 1:59 PM, Mihael Hategan > > wrote: > > but but but I checked this, and it worked fine... > > > > Can you also post the coasters log (on the machine the > > coaster > > service > > is on, in ~/.globus/coasters)? > > > > > > On Tue, 2011-08-09 at 13:47 -0500, Ketan Maheshwari wrote: > > > Mihael, > > > > > > > > > I was discussing this with Justin and we thought you could > > help: > > > > > > > > > I am observing that persistent coasters are running one > > > job > > per worker > > > as opposed to the number specified in jobspernode (I also > > tried > > > nodegranularity) on sites.xml. > > > > > > > > > Attaching the log, and the sites.xml for the run. Swift is > > 0.93 (Swift > > > svn swift-r4968 cog-r3225). > > > > > > > > > The script is Mike's catsnsleep that sleeps for 20s with > > n=10. > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Aug 9 14:31:54 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Aug 2011 14:31:54 -0500 (CDT) Subject: [Swift-devel] Bringing back the coaster worker timeout feature? In-Reply-To: <1312917410.3416.2.camel@blabla> Message-ID: <705833790.205231.1312918314735.JavaMail.root@zimbra.anl.gov> Related to the idea of adding a worker option for jobsPerNode: I'd like to propose/discuss adding back the option for workers to time out when they have been idle for some settable period. This would be useful in configurations like we're running for OSG and TeraGrid, where we may at some points have more workers running than the Swift script has demand for, because of the fairly loose coupling between the script and the worker factory, along with queuing delays, etc. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Ketan Maheshwari" > Cc: "Swift Devel" > Sent: Tuesday, August 9, 2011 2:16:50 PM > Subject: Re: [Swift-devel] Persistent coasters running one job per worker > Ah! > > If the workers connect before the client does, then jobsPerNode does > not > make it to the coaster service. > > I'll think about this. In the mean time, you could have the workers > started after the client sends its first job to the service. > > I'm thinking that maybe jobsPerNode should be a setting that the > workers > themselves could be started with. > > On Tue, 2011-08-09 at 14:09 -0500, Ketan Maheshwari wrote: > > I do not see any recent log in ~/.globus/coasters. The stdout/err of > > the coaster service run is in the attached service.log and the > > coaster.log is in the attached swift.log. > > > > > > > > > > On Tue, Aug 9, 2011 at 1:59 PM, Mihael Hategan > > wrote: > > but but but I checked this, and it worked fine... > > > > Can you also post the coasters log (on the machine the > > coaster > > service > > is on, in ~/.globus/coasters)? > > > > > > On Tue, 2011-08-09 at 13:47 -0500, Ketan Maheshwari wrote: > > > Mihael, > > > > > > > > > I was discussing this with Justin and we thought you could > > help: > > > > > > > > > I am observing that persistent coasters are running one > > > job > > per worker > > > as opposed to the number specified in jobspernode (I also > > tried > > > nodegranularity) on sites.xml. > > > > > > > > > Attaching the log, and the sites.xml for the run. Swift is > > 0.93 (Swift > > > svn swift-r4968 cog-r3225). > > > > > > > > > The script is Mike's catsnsleep that sleeps for 20s with > > n=10. > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Aug 9 14:33:17 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Aug 2011 12:33:17 -0700 Subject: [Swift-devel] Persistent coasters running one job per worker In-Reply-To: <1875366538.205206.1312918119390.JavaMail.root@zimbra.anl.gov> References: <1875366538.205206.1312918119390.JavaMail.root@zimbra.anl.gov> Message-ID: <1312918397.3539.2.camel@blabla> On Tue, 2011-08-09 at 14:28 -0500, Michael Wilde wrote: > > > > > I'm thinking that maybe jobsPerNode should be a setting that the > > workers > > themselves could be started with. > > That sounds OK for now, for persistent coasters I assume you mean. Yes. In the same spirit, one could also pass a walltime that way. Trunk has some code allowing a worker to pass a bunch of key-value pairs when registering, but it's not being used. From yadudoc1729 at gmail.com Tue Aug 9 14:47:58 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Wed, 10 Aug 2011 01:17:58 +0530 Subject: [Swift-devel] Call function Map-Reduce Message-ID: Hi, I've been working on implementing a call function which would allow function calls in swift using string identifiers for procedures. In order to do this we planned to use karajan's executeElement which I think needs a slightly different definition for user defined elements. int x=5; (int out) old_func (int inp) { } old_func(x); used to translate into the following format : x In order to have the calls using executeElement we need the following style ... new_func inp But how do we handle the output variable ? I don't see any documentation on this ? -- Thanks and Regards, Yadu Nand B From hategan at mcs.anl.gov Tue Aug 9 16:28:45 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Aug 2011 14:28:45 -0700 Subject: [Swift-devel] Call function Map-Reduce In-Reply-To: References: Message-ID: <1312925325.4795.1.camel@blabla> On Wed, 2011-08-10 at 01:17 +0530, Yadu Nand wrote: > Hi, > > I've been working on implementing a call function which > would allow function calls in swift using string identifiers > for procedures. > > In order to do this we planned to use karajan's > executeElement which I think needs a slightly different > definition for user defined elements. > > int x=5; > (int out) old_func (int inp) { > > } > old_func(x); > > used to translate into the following format : > > > > > > > > x > > > > In order to have the calls using executeElement we need > the following style > > > > ... > > > > new_func > inp > We'll do it like this: ... > > > But how do we handle the output variable ? I don't see any > documentation on this ? > Return values are passed by reference. So y = f(x) would be y x From yadudoc1729 at gmail.com Wed Aug 10 00:56:35 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Wed, 10 Aug 2011 11:26:35 +0530 Subject: [Swift-devel] Procedure re-defintion, feature or bug ? Message-ID: Hi, Following what Justin said about function redefinition being a feature in the shell I tried a test to check if that really works in swift. (int o) f (int i){ o = i ; } trace ( "first" , f(5) ); (int z) f (int a){ z = a * 10; } trace ( "second", f(5) ); In swift this would give : second, 50 first , 50 What I think is, the procedures are overwritten around compile time which allows this behavior, in which only the last definition is valid by execution time. -- Thanks and Regards, Yadu Nand B From hategan at mcs.anl.gov Wed Aug 10 01:28:44 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Aug 2011 23:28:44 -0700 Subject: [Swift-devel] Procedure re-defintion, feature or bug ? In-Reply-To: References: Message-ID: <1312957724.7413.1.camel@blabla> On Wed, 2011-08-10 at 11:26 +0530, Yadu Nand wrote: > What I think is, the procedures are overwritten around compile > time Not quite. Just that they are both defined before any of the invocations. The swift compiler re-orders instructions quite a bit, so the actual execution order is quite unrelated to the lexical order. That's exactly why I think that this shouldn't happen. There is no way to invoke the first definition, so why allow it? > which allows this behavior, in which only the last definition > is valid by execution time. > > From benc at hawaga.org.uk Wed Aug 10 02:03:12 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Aug 2011 09:03:12 +0200 Subject: [Swift-devel] Procedure re-defintion, feature or bug ? In-Reply-To: References: Message-ID: All the definitions should be happening "simultaneously". Mostly this is how I think definitions should work: Imagine instead of defining f in Swift, you are defining x in the following simultaneous equation: x=3 x=4 What is the value of x? (remember its a simultaneous equation, not a program...) -- From hategan at mcs.anl.gov Wed Aug 10 02:55:12 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Aug 2011 00:55:12 -0700 Subject: [Swift-devel] Call function Map-Reduce In-Reply-To: <1312925325.4795.1.camel@blabla> References: <1312925325.4795.1.camel@blabla> Message-ID: <1312962912.8757.3.camel@blabla> On Tue, 2011-08-09 at 14:28 -0700, Mihael Hategan wrote: > We'll do it like this: > > ... > This should now work in trunk: import("sys.k") element(bla, [...] echo(each(...)) ) executeElement("bla", "test", 1, 2, 3) in xml it's ... It will only work for user defined functions, so executeElement("echo", "test") won't work. But then we won't need that. From wozniak at mcs.anl.gov Wed Aug 10 10:00:11 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 10 Aug 2011 10:00:11 -0500 (Central Daylight Time) Subject: [Swift-devel] Procedure re-defintion, feature or bug ? In-Reply-To: References: Message-ID: On Wed, 10 Aug 2011, Ben Clifford wrote: > All the definitions should be happening "simultaneously". > > Mostly this is how I think definitions should work: > > Imagine instead of defining f in Swift, you are defining x in the following simultaneous equation: > > x=3 > x=4 > > What is the value of x? (remember its a simultaneous equation, not a program...) Yes, the definitions are simultaneous and inconsistent and Swift should report an error. Justin -- Justin M Wozniak From yadudoc1729 at gmail.com Wed Aug 10 12:47:09 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Wed, 10 Aug 2011 23:17:09 +0530 Subject: [Swift-devel] Procedure re-defintion, feature or bug ? In-Reply-To: References: Message-ID: On Wed, Aug 10, 2011 at 8:30 PM, Justin M Wozniak wrote: > Yes, the definitions are simultaneous and inconsistent and Swift should > report an error. Okay I think I understand better now. I'm attaching a patch with a one line change to the earlier patch. Please let me know if this needs additional fixing. -- Thanks and Regards, Yadu Nand B -------------- next part -------------- A non-text attachment was scrubbed... Name: check_proc_redefintion.patch Type: text/x-patch Size: 1313 bytes Desc: not available URL: From yadudoc1729 at gmail.com Wed Aug 10 15:44:34 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Thu, 11 Aug 2011 02:14:34 +0530 Subject: [Swift-devel] Call function. Message-ID: Hi, I'm working on implementing a call function which takes procedure identifiers as strings. This will allow us to do some cool stuff like : int assoc_array [ string ] [ int ]; assoc_array["do_x"][1] = 1000; assoc_array["do_x"][2] = 1001; assoc_array["do_y"][1] = 5000;assoc_array["do_y"][2] = 5002; foreach i in assoc_array { foreach vi in assoc_array[ i ] { call ( i , vi ); }} I have considered two alternative structures for "call " : 1. = call ( , < args > ) ; 2. call ( , , < args > ) ; Any ideas on this are welcome. -- Thanks and Regards, Yadu Nand B From wilde at mcs.anl.gov Wed Aug 10 17:12:58 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 10 Aug 2011 17:12:58 -0500 (CDT) Subject: [Swift-devel] swiftdel Message-ID: <172562579.209954.1313014378362.JavaMail.root@zimbra.anl.gov> I updated the swiftdevel release-plans page with the results form todays 0.93 meeting David, please move everything needed to Bugzilla tickets and close this list down except for non-ticket procedural notes. Thanks, - Mike From alberto_chavez at live.com Wed Aug 10 18:41:25 2011 From: alberto_chavez at live.com (Alberto Chavez) Date: Wed, 10 Aug 2011 18:41:25 -0500 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: <1312916226.2671.2.camel@blabla> References: , <1312916226.2671.2.camel@blabla> Message-ID: I changed my ssh-key, and they worked on the MCS machines because the authorized_keys file has not been updated yet on the CI Machines. I created a new ssh-key using:ssh-keygen -t rsa -b 2048exactly as the MCS site suggested,On the other hand, I still have a problem, I am getting the following error: Swift svn swift-r4978 cog-r3226 RunID: 20110810-1819-1cdo2o62Progress: time: Wed, 10 Aug 2011 18:19:42 -0500Exception in cat:Arguments: [data.txt]Host: sshDirectory: 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek- - -Caused by: nullCaused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 127Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 Failed:10The following errors have occurred:1. Job failed with an exit code of 127 (10 times) These are the contents of the log: Execution completed with errors 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing channel 0 [Unnamed Channel]2011-08-10 18:19:43,263-0500 INFO Exec Exit code 1272011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing channel 0 [Unnamed Channel]2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application exception: nullCaused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 1272011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE thread=0-5-3-1 tr=cat2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in cat: at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) at org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed exception: Execution completed with errors at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) at org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)I believe that the problem resides on the TC file because when I run a much simpler SwiftScript like:int i = 9;trace(i);I get the following output:swift traceme.swift -tc.file tc.template.data -sites.file sites.template.xml -config cf Swift svn swift-r4978 cog-r3226RunID: 20110810-1832-buktjj3dProgress: time: Wed, 10 Aug 2011 18:32:30 -0500SwiftScript trace: 9.0Final status: time: Wed, 10 Aug 2011 18:32:30 -0500but as soon as I start using the commands stated the TC file, I get the "exit code 127"My tc file reads:ssh echo /bin/echo null nullssh cat /bin/cat null nullssh ls /bin/ls null nullssh grep /bin/grep null nullssh sort /bin/sort null nullssh paste /bin/paste null nullssh wc /usr/bin/wc null nullI am working on the login node of the MCS machine trying to ssh via Swift to steamroller. > Subject: Re: [Swift-devel] ssh test case on pads/beagle > From: hategan at mcs.anl.gov > To: alberto_chavez at live.com > CC: swift-devel at ci.uchicago.edu > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > I'll try to see how that can be fixed. In the mean time, can you > generate a new key pair with 3DES encryption instead and use that? > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > Hello, > > > > > > I am trying to run a simpler case than ssh-pbs-coaster test case, and > > I'm still having the same error. > > Now I am running only ssh test case > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > The command line is: > > swift -config cf -tc.file tc.template.data -sites.file > > sites.template.xml 001-catsn-ssh.swift > > > > > > The output: > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > RunID: 20110809-1336-ohte788a > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting site:8 > > Submitting:1 Failed:1 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting site:7 > > Submitting:1 Failed:2 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > "error_log.log" 105L, 5770C > > > > > > My auth.defaults reads: > > > > > > login1.beagle.ci.uchicago.edu.type=key > > login1.beagle.ci.uchicago.edu.username=achavez > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > login1.pads.ci.uchicago.edu.type=key > > login1.pads.ci.uchicago.edu.username=achavez > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, but it is > > there, and the passphrase is right because I just verified it in two > > ways: > > 1) by logging to pads and beagle without providing a password > > 2) "changed" the password. I the "new" password is the same as the > > "old" one. > > > > sites.templates.xml: > > > > > > > > > jobmanager="ssh"/> > > > > 0 > > /home/achavez/swiftwork > > > > > > > > > > config file: > > > > wrapperlog.always.transfer=true > > sitedir.keep=true > > execution.retries=0 > > lazy.errors=true > > status.mode=provider > > use.provider.staging=true > > provider.staging.pin.swiftfiles=false > > foreach.max.threads=10 > > provenance.log=true > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > type filemsg; > > > > > > app (filemsg output) hello(string s) > > { > > echo s stdout=@filename(output); > > } > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > myfile = hello("dog,cat,dinosaur"); > > > > > > and I get the following output: > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > RunID: 20110809-1343-2es2hel2 > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > Exception in echo: > > Arguments: [dog,cat,dinosaur] > > Host: ssh > > Directory: hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 Failed:1 > > The following errors have occurred: > > 1. Can't read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > > > > > > > > > any thoughts on this? > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Aug 10 19:09:03 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 10 Aug 2011 19:09:03 -0500 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: References: , <1312916226.2671.2.camel@blabla> Message-ID: <47D22625-F683-4FCF-A1B5-2DE0D789911E@mcs.anl.gov> Exit code "127" normally means that a particular function doesn't exist. Are you sure that all those paths to apps exist? Also, I am not sure if this is a problem but shouldn't there be a third column in the app file? LIke "ssh echo /bin/echo null null null" On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote: > I changed my ssh-key, and they worked on the MCS machines because the authorized_keys file has not been updated yet on the CI Machines. > I created a new ssh-key using: > ssh-keygen -t rsa -b 2048 > exactly as the MCS site suggested, > On the other hand, I still have a problem, I am getting the following error: > > Swift svn swift-r4978 cog-r3226 > > RunID: 20110810-1819-1cdo2o62 > Progress: time: Wed, 10 Aug 2011 18:19:42 -0500 > Exception in cat: > Arguments: [data.txt] > Host: ssh > Directory: 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek > - - - > Caused by: null > Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 127 > Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 Failed:10 > The following errors have occurred: > 1. Job failed with an exit code of 127 (10 times) > > > These are the contents of the log: > > Execution completed with errors > > 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing channel 0 [Unnamed Channel] > 2011-08-10 18:19:43,263-0500 INFO Exec Exit code 127 > 2011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing channel 0 [Unnamed Channel] > 2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application exception: null > Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 127 > 2011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE thread=0-5-3-1 tr=cat > 2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in cat: > at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > at org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:636) > 2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed exception: > > Execution completed with errors > > at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > at org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:636) > > I believe that the problem resides on the TC file because when I run a much simpler SwiftScript like: > > int i = 9; > trace(i); > > I get the following output: > > swift traceme.swift -tc.file tc.template.data -sites.file sites.template.xml -config cf > Swift svn swift-r4978 cog-r3226 > > RunID: 20110810-1832-buktjj3d > Progress: time: Wed, 10 Aug 2011 18:32:30 -0500 > SwiftScript trace: 9.0 > Final status: time: Wed, 10 Aug 2011 18:32:30 -0500 > > but as soon as I start using the commands stated the TC file, I get the "exit code 127" > > My tc file reads: > > ssh echo /bin/echo null null > ssh cat /bin/cat null null > ssh ls /bin/ls null null > ssh grep /bin/grep null null > ssh sort /bin/sort null null > ssh paste /bin/paste null null > ssh wc /usr/bin/wc null null > > I am working on the login node of the MCS machine trying to ssh via Swift to steamroller. > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > From: hategan at mcs.anl.gov > > To: alberto_chavez at live.com > > CC: swift-devel at ci.uchicago.edu > > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > > > I'll try to see how that can be fixed. In the mean time, can you > > generate a new key pair with 3DES encryption instead and use that? > > > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > > Hello, > > > > > > > > > I am trying to run a simpler case than ssh-pbs-coaster test case, and > > > I'm still having the same error. > > > Now I am running only ssh test case > > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > > > > The command line is: > > > swift -config cf -tc.file tc.template.data -sites.file > > > sites.template.xml 001-catsn-ssh.swift > > > > > > > > > The output: > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > RunID: 20110809-1336-ohte788a > > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > Caused by: > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > > read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > > algorithm: AES-128-CBC > > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting site:8 > > > Submitting:1 Failed:1 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > Caused by: > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > > read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > > algorithm: AES-128-CBC > > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting site:7 > > > Submitting:1 Failed:2 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > Caused by: > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > > read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > > algorithm: AES-128-CBC > > > "error_log.log" 105L, 5770C > > > > > > > > > My auth.defaults reads: > > > > > > > > > login1.beagle.ci.uchicago.edu.type=key > > > login1.beagle.ci.uchicago.edu.username=achavez > > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > login1.pads.ci.uchicago.edu.type=key > > > login1.pads.ci.uchicago.edu.username=achavez > > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, but it is > > > there, and the passphrase is right because I just verified it in two > > > ways: > > > 1) by logging to pads and beagle without providing a password > > > 2) "changed" the password. I the "new" password is the same as the > > > "old" one. > > > > > > sites.templates.xml: > > > > > > > > > > > > > > jobmanager="ssh"/> > > > > > > 0 > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > config file: > > > > > > wrapperlog.always.transfer=true > > > sitedir.keep=true > > > execution.retries=0 > > > lazy.errors=true > > > status.mode=provider > > > use.provider.staging=true > > > provider.staging.pin.swiftfiles=false > > > foreach.max.threads=10 > > > provenance.log=true > > > > > > > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > > > > type filemsg; > > > > > > > > > app (filemsg output) hello(string s) > > > { > > > echo s stdout=@filename(output); > > > } > > > > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > > myfile = hello("dog,cat,dinosaur"); > > > > > > > > > and I get the following output: > > > > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > RunID: 20110809-1343-2es2hel2 > > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > > Exception in echo: > > > Arguments: [dog,cat,dinosaur] > > > Host: ssh > > > Directory: hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > Caused by: > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > > read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > > algorithm: AES-128-CBC > > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 Failed:1 > > > The following errors have occurred: > > > 1. Can't read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > > algorithm: AES-128-CBC > > > > > > > > > > > > > > > any thoughts on this? > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From alberto_chavez at live.com Wed Aug 10 19:16:32 2011 From: alberto_chavez at live.com (Alberto Chavez) Date: Wed, 10 Aug 2011 19:16:32 -0500 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: <47D22625-F683-4FCF-A1B5-2DE0D789911E@mcs.anl.gov> References: , <1312916226.2671.2.camel@blabla> , <47D22625-F683-4FCF-A1B5-2DE0D789911E@mcs.anl.gov> Message-ID: Exit code "127" normally means that a particular function doesn't exist. Are you sure that all those paths to apps exist?> Yes, I doubled check that and those are the right paths to the apps. Also, I am not sure if this is a problem but shouldn't there be a third column in the app file? LIke"ssh echo /bin/echo null null null" Looking at the documentation for the transformation catalog, the structure should be: site, transformation name, executable path, installation status, platform, and profile entries. The installation status and platform fields are not used. Set them to INSTALLED and INTEL32::LINUX respectively.The profiles field should be set to null if no profile entries are to be specified, or should contain the profile entries separated by semicolons. but even when I switch the columns to INSTALLED and INTEL32::LINUX and keep the profiles field set to null, I'm still getting the same exit code.On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote:I changed my ssh-key, and they worked on the MCS machines because the authorized_keys file has not been updated yet on the CI Machines. I created a new ssh-key using:ssh-keygen -t rsa -b 2048exactly as the MCS site suggested,On the other hand, I still have a problem, I am getting the following error: Swift svn swift-r4978 cog-r3226 RunID: 20110810-1819-1cdo2o62Progress: time: Wed, 10 Aug 2011 18:19:42 -0500Exception in cat:Arguments: [data.txt]Host: sshDirectory: 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek- - -Caused by: nullCaused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 127Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 Failed:10The following errors have occurred:1. Job failed with an exit code of 127 (10 times) These are the contents of the log: Execution completed with errors 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing channel 0 [Unnamed Channel]2011-08-10 18:19:43,263-0500 INFO Exec Exit code 1272011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing channel 0 [Unnamed Channel]2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application exception: nullCaused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 1272011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE thread=0-5-3-1 tr=cat2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in cat: at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) at org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed exception: Execution completed with errors at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) at org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)I believe that the problem resides on the TC file because when I run a much simpler SwiftScript like:int i = 9;trace(i);I get the following output:swift traceme.swift -tc.file tc.template.data -sites.file sites.template.xml -config cf Swift svn swift-r4978 cog-r3226RunID: 20110810-1832-buktjj3dProgress: time: Wed, 10 Aug 2011 18:32:30 -0500SwiftScript trace: 9.0Final status: time: Wed, 10 Aug 2011 18:32:30 -0500but as soon as I start using the commands stated the TC file, I get the "exit code 127"My tc file reads:ssh echo /bin/echo null nullssh cat /bin/cat null nullssh ls /bin/ls null nullssh grep /bin/grep null nullssh sort /bin/sort null nullssh paste /bin/paste null nullssh wc /usr/bin/wc null nullI am working on the login node of the MCS machine trying to ssh via Swift to steamroller. > Subject: Re: [Swift-devel] ssh test case on pads/beagle > From: hategan at mcs.anl.gov > To: alberto_chavez at live.com > CC: swift-devel at ci.uchicago.edu > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > I'll try to see how that can be fixed. In the mean time, can you > generate a new key pair with 3DES encryption instead and use that? > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > Hello, > > > > > > I am trying to run a simpler case than ssh-pbs-coaster test case, and > > I'm still having the same error. > > Now I am running only ssh test case > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > The command line is: > > swift -config cf -tc.file tc.template.data -sites.file > > sites.template.xml 001-catsn-ssh.swift > > > > > > The output: > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > RunID: 20110809-1336-ohte788a > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting site:8 > > Submitting:1 Failed:1 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting site:7 > > Submitting:1 Failed:2 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > "error_log.log" 105L, 5770C > > > > > > My auth.defaults reads: > > > > > > login1.beagle.ci.uchicago.edu.type=key > > login1.beagle.ci.uchicago.edu.username=achavez > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > login1.pads.ci.uchicago.edu.type=key > > login1.pads.ci.uchicago.edu.username=achavez > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, but it is > > there, and the passphrase is right because I just verified it in two > > ways: > > 1) by logging to pads and beagle without providing a password > > 2) "changed" the password. I the "new" password is the same as the > > "old" one. > > > > sites.templates.xml: > > > > > > > > > jobmanager="ssh"/> > > > > 0 > > /home/achavez/swiftwork > > > > > > > > > > config file: > > > > wrapperlog.always.transfer=true > > sitedir.keep=true > > execution.retries=0 > > lazy.errors=true > > status.mode=provider > > use.provider.staging=true > > provider.staging.pin.swiftfiles=false > > foreach.max.threads=10 > > provenance.log=true > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > type filemsg; > > > > > > app (filemsg output) hello(string s) > > { > > echo s stdout=@filename(output); > > } > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > myfile = hello("dog,cat,dinosaur"); > > > > > > and I get the following output: > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > RunID: 20110809-1343-2es2hel2 > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > Exception in echo: > > Arguments: [dog,cat,dinosaur] > > Host: ssh > > Directory: hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 Failed:1 > > The following errors have occurred: > > 1. Can't read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > > > > > > > > > any thoughts on this? > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Aug 10 23:54:24 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Wed, 10 Aug 2011 23:54:24 -0500 Subject: [Swift-devel] =?utf-8?q?ssh_test_case_on_pads/beagle?= Message-ID: <20110811045409.99852121A8@zimbra.anl.gov> Could you post the sites file? ----- Reply message ----- From: "Alberto Chavez" Date: Wed, Aug 10, 2011 7:16 pm Subject: [Swift-devel] ssh test case on pads/beagle To: Cc: "Mihael Hategan" , "Swift Devel" -------------- next part -------------- An HTML attachment was scrubbed... URL: From alberto_chavez at live.com Thu Aug 11 01:17:40 2011 From: alberto_chavez at live.com (Alberto Chavez) Date: Thu, 11 Aug 2011 01:17:40 -0500 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: <20110811045409.99852121A8@zimbra.anl.gov> References: <20110811045409.99852121A8@zimbra.anl.gov> Message-ID: Sure: 0 /home/achavez/swiftwork To: alberto_chavez at live.com From: jonmon at mcs.anl.gov CC: hategan at mcs.anl.gov; swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] ssh test case on pads/beagle Date: Wed, 10 Aug 2011 23:54:24 -0500 Could you post the sites file? ----- Reply message ----- From: "Alberto Chavez" Date: Wed, Aug 10, 2011 7:16 pm Subject: [Swift-devel] ssh test case on pads/beagle To: Cc: "Mihael Hategan" , "Swift Devel" Exit code "127" normally means that a particular function doesn't exist. Are you sure that all those paths to apps exist?> Yes, I doubled check that and those are the right paths to the apps. Also, I am not sure if this is a problem but shouldn't there be a third column in the app file? LIke"ssh echo /bin/echo null null null" Looking at the documentation for the transformation catalog, the structure should be: site, transformation name, executable path, installation status, platform, and profile entries. The installation status and platform fields are not used. Set them to INSTALLED and INTEL32::LINUX respectively.The profiles field should be set to null if no profile entries are to be specified, or should contain the profile entries separated by semicolons. but even when I switch the columns to INSTALLED and INTEL32::LINUX and keep the profiles field set to null, I'm still getting the same exit code.On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote:I changed my ssh-key, and they worked on the MCS machines because the authorized_keys file has not been updated yet on the CI Machines. I created a new ssh-key using:ssh-keygen -t rsa -b 2048exactly as the MCS site suggested,On the other hand, I still have a problem, I am getting the following error: Swift svn swift-r4978 cog-r3226 RunID: 20110810-1819-1cdo2o62Progress: time: Wed, 10 Aug 2011 18:19:42 -0500Exception in cat:Arguments: [data.txt]Host: sshDirectory: 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek- - -Caused by: nullCaused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 127Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 Failed:10The following errors have occurred:1. Job failed with an exit code of 127 (10 times) These are the contents of the log: Execution completed with errors 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing channel 0 [Unnamed Channel]2011-08-10 18:19:43,263-0500 INFO Exec Exit code 1272011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing channel 0 [Unnamed Channel]2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application exception: nullCaused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 1272011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE thread=0-5-3-1 tr=cat2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in cat: at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) at org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed exception: Execution completed with errors at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) at org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) at org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)I believe that the problem resides on the TC file because when I run a much simpler SwiftScript like:int i = 9;trace(i);I get the following output:swift traceme.swift -tc.file tc.template.data -sites.file sites.template.xml -config cf Swift svn swift-r4978 cog-r3226RunID: 20110810-1832-buktjj3dProgress: time: Wed, 10 Aug 2011 18:32:30 -0500SwiftScript trace: 9.0Final status: time: Wed, 10 Aug 2011 18:32:30 -0500but as soon as I start using the commands stated the TC file, I get the "exit code 127"My tc file reads:ssh echo /bin/echo null nullssh cat /bin/cat null nullssh ls /bin/ls null nullssh grep /bin/grep null nullssh sort /bin/sort null nullssh paste /bin/paste null nullssh wc /usr/bin/wc null nullI am working on the login node of the MCS machine trying to ssh via Swift to steamroller. > Subject: Re: [Swift-devel] ssh test case on pads/beagle > From: hategan at mcs.anl.gov > To: alberto_chavez at live.com > CC: swift-devel at ci.uchicago.edu > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > I'll try to see how that can be fixed. In the mean time, can you > generate a new key pair with 3DES encryption instead and use that? > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > Hello, > > > > > > I am trying to run a simpler case than ssh-pbs-coaster test case, and > > I'm still having the same error. > > Now I am running only ssh test case > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > The command line is: > > swift -config cf -tc.file tc.template.data -sites.file > > sites.template.xml 001-catsn-ssh.swift > > > > > > The output: > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > RunID: 20110809-1336-ohte788a > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting site:8 > > Submitting:1 Failed:1 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting site:7 > > Submitting:1 Failed:2 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > "error_log.log" 105L, 5770C > > > > > > My auth.defaults reads: > > > > > > login1.beagle.ci.uchicago.edu.type=key > > login1.beagle.ci.uchicago.edu.username=achavez > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > login1.pads.ci.uchicago.edu.type=key > > login1.pads.ci.uchicago.edu.username=achavez > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, but it is > > there, and the passphrase is right because I just verified it in two > > ways: > > 1) by logging to pads and beagle without providing a password > > 2) "changed" the password. I the "new" password is the same as the > > "old" one. > > > > sites.templates.xml: > > > > > > > > > jobmanager="ssh"/> > > > > 0 > > /home/achavez/swiftwork > > > > > > > > > > config file: > > > > wrapperlog.always.transfer=true > > sitedir.keep=true > > execution.retries=0 > > lazy.errors=true > > status.mode=provider > > use.provider.staging=true > > provider.staging.pin.swiftfiles=false > > foreach.max.threads=10 > > provenance.log=true > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > type filemsg; > > > > > > app (filemsg output) hello(string s) > > { > > echo s stdout=@filename(output); > > } > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > myfile = hello("dog,cat,dinosaur"); > > > > > > and I get the following output: > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > RunID: 20110809-1343-2es2hel2 > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > Exception in echo: > > Arguments: [dog,cat,dinosaur] > > Host: ssh > > Directory: hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > Caused by: > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: Can't > > read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 Failed:1 > > The following errors have occurred: > > 1. Can't read key due to cryptography problems: > > java.security.NoSuchAlgorithmException: Unsupported passphrase > > algorithm: AES-128-CBC > > > > > > > > > > any thoughts on this? > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Aug 11 02:18:07 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Aug 2011 00:18:07 -0700 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: References: <20110811045409.99852121A8@zimbra.anl.gov> Message-ID: <1313047087.3215.6.camel@blabla> Can you post (a link to) the entire log file? Since it contains both the tc.data and sites.xml and the error, it's probably always better to post than individual snippets. On Thu, 2011-08-11 at 01:17 -0500, Alberto Chavez wrote: > Sure: > > > > > > 0 > /home/achavez/swiftwork > > > > > ______________________________________________________________________ > To: alberto_chavez at live.com > From: jonmon at mcs.anl.gov > CC: hategan at mcs.anl.gov; swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] ssh test case on pads/beagle > Date: Wed, 10 Aug 2011 23:54:24 -0500 > > Could you post the sites file? > > ----- Reply message ----- > From: "Alberto Chavez" > Date: Wed, Aug 10, 2011 7:16 pm > Subject: [Swift-devel] ssh test case on pads/beagle > To: > Cc: "Mihael Hategan" , "Swift Devel" > > > > > Exit code "127" normally means that a particular function doesn't > exist. Are you sure that all those paths to apps exist? > > Yes, I doubled check that and those are the right paths to the apps. > > > Also, I am not sure if this is a problem but shouldn't there be a > third column in the app file? LIke > "ssh echo /bin/echo null null null" > > > > > Looking at the documentation for the transformation catalog, the > structure should be: > > site, transformation name, executable path, installation status, > platform, and profile entries. > > > > > > The installation status and platform fields are not used. Set them > to INSTALLED and INTEL32::LINUX respectively. > > The profiles field should be set to null if no profile entries are to > be specified, or should contain the profile entries separated by > semicolons. > > > but even when I switch the columns to INSTALLED and INTEL32::LINUX and > keep the profiles field set to null, I'm still getting the same exit > code. > > > On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote: > > I changed my ssh-key, and they worked on the MCS machines > because the authorized_keys file has not been updated yet on > the CI Machines. > I created a new ssh-key using: > ssh-keygen -t rsa -b 2048 > exactly as the MCS site suggested, > On the other hand, I still have a problem, I am getting the > following error: > > > Swift svn swift-r4978 cog-r3226 > > > RunID: 20110810-1819-1cdo2o62 > Progress: time: Wed, 10 Aug 2011 18:19:42 -0500 > Exception in cat: > Arguments: [data.txt] > Host: ssh > Directory: > 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek > - - - > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.execution.JobException: > Job failed with an exit code of 127 > Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 > Failed:10 > The following errors have occurred: > 1. Job failed with an exit code of 127 (10 times) > > > > > These are the contents of the log: > > > Execution completed with errors > > > 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing > channel 0 [Unnamed Channel] > 2011-08-10 18:19:43,263-0500 INFO Exec Exit code 127 > 2011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing > channel 0 [Unnamed Channel] > 2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 > APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application > exception: null > Caused by: > org.globus.cog.abstraction.impl.common.execution.JobException: > Job failed with an exit code of 127 > 2011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE > thread=0-5-3-1 tr=cat > 2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in > cat: > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > at > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > at > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > at > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > at java.util.concurrent.Executors > $RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask > $Sync.innerRun(FutureTask.java:334) > at > java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at java.util.concurrent.ThreadPoolExecutor > $Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:636) > 2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed > exception: > > > Execution completed with errors > > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > at > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > at > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > at > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > at java.util.concurrent.Executors > $RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask > $Sync.innerRun(FutureTask.java:334) > at > java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at java.util.concurrent.ThreadPoolExecutor > $Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:636) > > I believe that the problem resides on the TC file because when > I run a much simpler SwiftScript like: > > > int i = 9; > trace(i); > > > I get the following output: > > > swift traceme.swift -tc.file tc.template.data > -sites.file sites.template.xml -config cf > Swift svn swift-r4978 cog-r3226 > > > RunID: 20110810-1832-buktjj3d > Progress: time: Wed, 10 Aug 2011 18:32:30 -0500 > SwiftScript trace: 9.0 > Final status: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > but as soon as I start using the commands stated the TC file, > I get the "exit code 127" > > > My tc file reads: > > > ssh echo /bin/echo null null > ssh cat /bin/cat null null > ssh ls /bin/ls null null > ssh grep /bin/grep null null > ssh sort /bin/sort null null > ssh paste /bin/paste null null > ssh wc /usr/bin/wc null null > > > I am working on the login node of the MCS machine trying to > ssh via Swift to steamroller. > > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > From: hategan at mcs.anl.gov > > To: alberto_chavez at live.com > > CC: swift-devel at ci.uchicago.edu > > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > > > I'll try to see how that can be fixed. In the mean time, can > you > > generate a new key pair with 3DES encryption instead and use > that? > > > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > > Hello, > > > > > > > > > I am trying to run a simpler case than ssh-pbs-coaster > test case, and > > > I'm still having the same error. > > > Now I am running only ssh test case > > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > > > > The command line is: > > > swift -config cf -tc.file tc.template.data -sites.file > > > sites.template.xml 001-catsn-ssh.swift > > > > > > > > > The output: > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > RunID: 20110809-1336-ohte788a > > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: > 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > Caused by: > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > Can't > > > read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported > passphrase > > > algorithm: AES-128-CBC > > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting > site:8 > > > Submitting:1 Failed:1 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: > 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > Caused by: > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > Can't > > > read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported > passphrase > > > algorithm: AES-128-CBC > > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting > site:7 > > > Submitting:1 Failed:2 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: > 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > Caused by: > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > Can't > > > read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported > passphrase > > > algorithm: AES-128-CBC > > > "error_log.log" 105L, 5770C > > > > > > > > > My auth.defaults reads: > > > > > > > > > login1.beagle.ci.uchicago.edu.type=key > > > login1.beagle.ci.uchicago.edu.username=achavez > > > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > login1.pads.ci.uchicago.edu.type=key > > > login1.pads.ci.uchicago.edu.username=achavez > > > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, > but it is > > > there, and the passphrase is right because I just verified > it in two > > > ways: > > > 1) by logging to pads and beagle without providing a > password > > > 2) "changed" the password. I the "new" password is the > same as the > > > "old" one. > > > > > > sites.templates.xml: > > > > > > > > > > > > url="login1.pads.ci.uchicago.edu" > > > jobmanager="ssh"/> > > > url="login1.pads.ci.uchicago.edu" /> > > > 0 > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > config file: > > > > > > wrapperlog.always.transfer=true > > > sitedir.keep=true > > > execution.retries=0 > > > lazy.errors=true > > > status.mode=provider > > > use.provider.staging=true > > > provider.staging.pin.swiftfiles=false > > > foreach.max.threads=10 > > > provenance.log=true > > > > > > > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > > > > type filemsg; > > > > > > > > > app (filemsg output) hello(string s) > > > { > > > echo s stdout=@filename(output); > > > } > > > > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > > myfile = hello("dog,cat,dinosaur"); > > > > > > > > > and I get the following output: > > > > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > RunID: 20110809-1343-2es2hel2 > > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > > Exception in echo: > > > Arguments: [dog,cat,dinosaur] > > > Host: ssh > > > Directory: > hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > Caused by: > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > Can't > > > read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported > passphrase > > > algorithm: AES-128-CBC > > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 > Failed:1 > > > The following errors have occurred: > > > 1. Can't read key due to cryptography problems: > > > java.security.NoSuchAlgorithmException: Unsupported > passphrase > > > algorithm: AES-128-CBC > > > > > > > > > > > > > > > any thoughts on this? > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > From benc at hawaga.org.uk Thu Aug 11 04:57:15 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 11 Aug 2011 11:57:15 +0200 Subject: [Swift-devel] Call function. Message-ID: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> That's moving a big jump away from compile time type checking: you can't check the return types if you don't know anything about the function to call. Does that matter for Swift? Its nice to find errors before you embark on a long run. But the strongly-typed-ness of swift doesn't otherwise seem too useful. Do you need general string based invocation? Where are you getting these strings from? If you want to preserve type checking, then in your example in your message, you could use first order function references (eg: functions in Haskell, function pointers in C) which can carry their type with them. Your example might then look like: int assoc_array [ string ] [ int ]; assoc_array[do_x][1] = 1000; assoc_array[do_x][2] = 1001; assoc_array[do_y][1] = 5000;assoc_array[do_y][2] = 5002; foreach i in assoc_array { foreach vi in assoc_array[ i ] { call ( i , vi ); }} All I did there was remove the quotes, and make each function usable as a variable name (which happens in other languages too - both C and Haskell). The return type of call( f, ...) is then the return type of f, and the type of other parameters of the call are the type the other parameters of f. That's making the type system more fancy, though, in a way that might not actually make this more useful for users doing actual things. (but I don't know what your real application use case is). It also excludes the use case of the function names really being dynamic - for example, something like: s = read(file_containing_a_function_name); call(s,4); Is that a use case you expect? From yadudoc1729 at gmail.com Thu Aug 11 07:45:38 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Thu, 11 Aug 2011 18:15:38 +0530 Subject: [Swift-devel] Call function. In-Reply-To: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> Message-ID: > That's moving a big jump away from compile time type checking: you can't check the return types if you don't know anything about the function to call. > Does that matter for Swift? Its nice to find errors before you embark on a long run. But the strongly-typed-ness of swift doesn't otherwise seem too useful. Well, What I plan on doing is the first string passed to a "call" function will need to be a function identifier and as we translate to karajan lookup the type of the function and ensure that the return and input args match. I haven't gotten there yet, I'm still arm-twisting the parser to accept the new syntax. > Do you need general string based invocation? Where are you getting these strings from? Well, I don't understand if it makes a difference. Its easier with strings, because we then just need to pass them on to executeElement which now accepts the string identifier of a procedure. > If you want to preserve type checking, then in your example in your message, you could use first order function references (eg: functions in Haskell, function pointers in C) which can carry their type with them. > > Your example might then look like: > > int assoc_array [ string ] [ int ]; > assoc_array[do_x][1] = 1000; assoc_array[do_x][2] = 1001; > assoc_array[do_y][1] = 5000;assoc_array[do_y][2] = 5002; > foreach i in assoc_array { > ? ? foreach vi ?in assoc_array[ i ] { > ? ? ? ? ? ?call ( i , vi ?); > }} > > All I did there was remove the quotes, and make each function usable as a variable name (which happens in other languages too - both C and Haskell). > > The return type of call( f, ...) is then the return type of f, and the type of other parameters of the call are the type the other parameters of f. I don't know enough to actually comment on that, I think. > That's making the type system more fancy, though, in a way that might not actually make this more useful for users doing actual things. (but I don't know what your real application use case is). > It also excludes the use case of the function names really being dynamic - for example, something like: > s = read(file_containing_a_function_name); > call(s,4); > Is that a use case you expect? I don't understand this example. What I need is a way to pass functions to other functions. In the map , reduce , fold style functions we need to pass a function and the list of items to operate on. I'm trying to make that possible here. The end result I'm looking for is the map-reduce style. -- Thanks and Regards, Yadu Nand B From benc at hawaga.org.uk Thu Aug 11 08:03:35 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 11 Aug 2011 13:03:35 +0000 (GMT) Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> Message-ID: > > That's moving a big jump away from compile time type checking: you > > can't check the return types if you don't know anything about the > > function to call. Does that matter for Swift? Its nice to find errors > > before you embark on a long run. But the strongly-typed-ness of swift > > doesn't otherwise seem too useful. > Well, What I plan on doing is the first string passed to a "call" function > will need to be a function identifier and as we translate to karajan lookup > the type of the function and ensure that the return and input args match. If you have an arbitrary string, you can't know what is in that string until runtime - potentially after a lot of other stuff has run. So you can only do that check at runtime - potentially after a lot of other stuff has run. You will eventually be able to check - but I was mostly highlighting the fact that this can't happen at compile time, in general; and then asking (swift people in general) if compile time (i.e. start of the run) type checking matters here. -- From alberto_chavez at live.com Thu Aug 11 08:31:53 2011 From: alberto_chavez at live.com (Alberto Chavez) Date: Thu, 11 Aug 2011 08:31:53 -0500 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: <1313047087.3215.6.camel@blabla> References: <20110811045409.99852121A8@zimbra.anl.gov>, , <1313047087.3215.6.camel@blabla> Message-ID: Sure, attached are the output of stdout and stderror, and the log generated by swift. > Subject: RE: [Swift-devel] ssh test case on pads/beagle > From: hategan at mcs.anl.gov > To: alberto_chavez at live.com > CC: jonmon at mcs.anl.gov; swift-devel at ci.uchicago.edu > Date: Thu, 11 Aug 2011 00:18:07 -0700 > > Can you post (a link to) the entire log file? Since it contains both the > tc.data and sites.xml and the error, it's probably always better to post > than individual snippets. > > On Thu, 2011-08-11 at 01:17 -0500, Alberto Chavez wrote: > > Sure: > > > > > > > > > > > > 0 > > /home/achavez/swiftwork > > > > > > > > > > ______________________________________________________________________ > > To: alberto_chavez at live.com > > From: jonmon at mcs.anl.gov > > CC: hategan at mcs.anl.gov; swift-devel at ci.uchicago.edu > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > Date: Wed, 10 Aug 2011 23:54:24 -0500 > > > > Could you post the sites file? > > > > ----- Reply message ----- > > From: "Alberto Chavez" > > Date: Wed, Aug 10, 2011 7:16 pm > > Subject: [Swift-devel] ssh test case on pads/beagle > > To: > > Cc: "Mihael Hategan" , "Swift Devel" > > > > > > > > > > Exit code "127" normally means that a particular function doesn't > > exist. Are you sure that all those paths to apps exist? > > > Yes, I doubled check that and those are the right paths to the apps. > > > > > > Also, I am not sure if this is a problem but shouldn't there be a > > third column in the app file? LIke > > "ssh echo /bin/echo null null null" > > > > > > > > > > Looking at the documentation for the transformation catalog, the > > structure should be: > > > > site, transformation name, executable path, installation status, > > platform, and profile entries. > > > > > > > > > > > > The installation status and platform fields are not used. Set them > > to INSTALLED and INTEL32::LINUX respectively. > > > > The profiles field should be set to null if no profile entries are to > > be specified, or should contain the profile entries separated by > > semicolons. > > > > > > but even when I switch the columns to INSTALLED and INTEL32::LINUX and > > keep the profiles field set to null, I'm still getting the same exit > > code. > > > > > > On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote: > > > > I changed my ssh-key, and they worked on the MCS machines > > because the authorized_keys file has not been updated yet on > > the CI Machines. > > I created a new ssh-key using: > > ssh-keygen -t rsa -b 2048 > > exactly as the MCS site suggested, > > On the other hand, I still have a problem, I am getting the > > following error: > > > > > > Swift svn swift-r4978 cog-r3226 > > > > > > RunID: 20110810-1819-1cdo2o62 > > Progress: time: Wed, 10 Aug 2011 18:19:42 -0500 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: > > 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek > > - - - > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job failed with an exit code of 127 > > Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 > > Failed:10 > > The following errors have occurred: > > 1. Job failed with an exit code of 127 (10 times) > > > > > > > > > > These are the contents of the log: > > > > > > Execution completed with errors > > > > > > 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing > > channel 0 [Unnamed Channel] > > 2011-08-10 18:19:43,263-0500 INFO Exec Exit code 127 > > 2011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing > > channel 0 [Unnamed Channel] > > 2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 > > APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application > > exception: null > > Caused by: > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job failed with an exit code of 127 > > 2011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE > > thread=0-5-3-1 tr=cat > > 2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in > > cat: > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > at > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > at > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > at > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > at > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > at java.util.concurrent.Executors > > $RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask > > $Sync.innerRun(FutureTask.java:334) > > at > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > at java.util.concurrent.ThreadPoolExecutor > > $Worker.run(ThreadPoolExecutor.java:603) > > at java.lang.Thread.run(Thread.java:636) > > 2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed > > exception: > > > > > > Execution completed with errors > > > > > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > at > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > at > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > at > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > at > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > at java.util.concurrent.Executors > > $RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask > > $Sync.innerRun(FutureTask.java:334) > > at > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > at java.util.concurrent.ThreadPoolExecutor > > $Worker.run(ThreadPoolExecutor.java:603) > > at java.lang.Thread.run(Thread.java:636) > > > > I believe that the problem resides on the TC file because when > > I run a much simpler SwiftScript like: > > > > > > int i = 9; > > trace(i); > > > > > > I get the following output: > > > > > > swift traceme.swift -tc.file tc.template.data > > -sites.file sites.template.xml -config cf > > Swift svn swift-r4978 cog-r3226 > > > > > > RunID: 20110810-1832-buktjj3d > > Progress: time: Wed, 10 Aug 2011 18:32:30 -0500 > > SwiftScript trace: 9.0 > > Final status: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > > > > but as soon as I start using the commands stated the TC file, > > I get the "exit code 127" > > > > > > My tc file reads: > > > > > > ssh echo /bin/echo null null > > ssh cat /bin/cat null null > > ssh ls /bin/ls null null > > ssh grep /bin/grep null null > > ssh sort /bin/sort null null > > ssh paste /bin/paste null null > > ssh wc /usr/bin/wc null null > > > > > > I am working on the login node of the MCS machine trying to > > ssh via Swift to steamroller. > > > > > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > From: hategan at mcs.anl.gov > > > To: alberto_chavez at live.com > > > CC: swift-devel at ci.uchicago.edu > > > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > > > > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > > > > > I'll try to see how that can be fixed. In the mean time, can > > you > > > generate a new key pair with 3DES encryption instead and use > > that? > > > > > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > > > Hello, > > > > > > > > > > > > I am trying to run a simpler case than ssh-pbs-coaster > > test case, and > > > > I'm still having the same error. > > > > Now I am running only ssh test case > > > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > > > > > > > The command line is: > > > > swift -config cf -tc.file tc.template.data -sites.file > > > > sites.template.xml 001-catsn-ssh.swift > > > > > > > > > > > > The output: > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > RunID: 20110809-1336-ohte788a > > > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > > > Exception in cat: > > > > Arguments: [data.txt] > > > > Host: ssh > > > > Directory: > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > Caused by: > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > Can't > > > > read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting > > site:8 > > > > Submitting:1 Failed:1 > > > > Exception in cat: > > > > Arguments: [data.txt] > > > > Host: ssh > > > > Directory: > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > Caused by: > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > Can't > > > > read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting > > site:7 > > > > Submitting:1 Failed:2 > > > > Exception in cat: > > > > Arguments: [data.txt] > > > > Host: ssh > > > > Directory: > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > Caused by: > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > Can't > > > > read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > "error_log.log" 105L, 5770C > > > > > > > > > > > > My auth.defaults reads: > > > > > > > > > > > > login1.beagle.ci.uchicago.edu.type=key > > > > login1.beagle.ci.uchicago.edu.username=achavez > > > > > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > login1.pads.ci.uchicago.edu.type=key > > > > login1.pads.ci.uchicago.edu.username=achavez > > > > > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, > > but it is > > > > there, and the passphrase is right because I just verified > > it in two > > > > ways: > > > > 1) by logging to pads and beagle without providing a > > password > > > > 2) "changed" the password. I the "new" password is the > > same as the > > > > "old" one. > > > > > > > > sites.templates.xml: > > > > > > > > > > > > > > > > > url="login1.pads.ci.uchicago.edu" > > > > jobmanager="ssh"/> > > > > > url="login1.pads.ci.uchicago.edu" /> > > > > 0 > > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > > > > > > config file: > > > > > > > > wrapperlog.always.transfer=true > > > > sitedir.keep=true > > > > execution.retries=0 > > > > lazy.errors=true > > > > status.mode=provider > > > > use.provider.staging=true > > > > provider.staging.pin.swiftfiles=false > > > > foreach.max.threads=10 > > > > provenance.log=true > > > > > > > > > > > > > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > > > > > > > type filemsg; > > > > > > > > > > > > app (filemsg output) hello(string s) > > > > { > > > > echo s stdout=@filename(output); > > > > } > > > > > > > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > > > myfile = hello("dog,cat,dinosaur"); > > > > > > > > > > > > and I get the following output: > > > > > > > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > RunID: 20110809-1343-2es2hel2 > > > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > > > Exception in echo: > > > > Arguments: [dog,cat,dinosaur] > > > > Host: ssh > > > > Directory: > > hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > Caused by: > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > Can't > > > > read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 > > Failed:1 > > > > The following errors have occurred: > > > > 1. Can't read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > > > > > > > > > > > > > > > > > any thoughts on this? > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 001-catsn-ssh-20110811-0828-s51oubu6.log Type: text/x-log Size: 167486 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ssh-test-output.log Type: text/x-log Size: 3559 bytes Desc: not available URL: From wilde at mcs.anl.gov Thu Aug 11 08:57:36 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 11 Aug 2011 08:57:36 -0500 (CDT) Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: Message-ID: <2078636548.210814.1313071056159.JavaMail.root@zimbra.anl.gov> Mihael, Ive never seen sites.xml entries showing up in the log - are they supposed to be now? They are not in the log Alberto attached, nor have I seen them in any other log yet. Can we log all the files mentioned in the command line report (the first line of the log) right at the front, along with the source text? Ie, script, tc, sites, and config? Ideally values for all of the swift.properties? Ideally auth.defaults with suitable masking? 0.94 feature? >> 2011-08-11 08:28:03,762-0500 DEBUG Loader arguments: [001-catsn-ssh.swift, -tc.file, tc.template.data, -sites.file, sites.template.xml, -config, cf] Alberto, stop by and we can try to debug this in person, as ssh requires a fair bit of correct configuration to work. We need to look at the cf, sites.template.xml, and cf file. - Mike ----- Original Message ----- From: "Alberto Chavez" To: "Mihael Hategan" Cc: "Swift Devel" Sent: Thursday, August 11, 2011 8:31:53 AM Subject: Re: [Swift-devel] ssh test case on pads/beagle Sure, attached are the output of stdout and stderror, and the log generated by swift. > Subject: RE: [Swift-devel] ssh test case on pads/beagle > From: hategan at mcs.anl.gov > To: alberto_chavez at live.com > CC: jonmon at mcs.anl.gov; swift-devel at ci.uchicago.edu > Date: Thu, 11 Aug 2011 00:18:07 -0700 > > Can you post (a link to) the entire log file? Since it contains both the > tc.data and sites.xml and the error, it's probably always better to post > than individual snippets. > > On Thu, 2011-08-11 at 01:17 -0500, Alberto Chavez wrote: > > Sure: > > > > > > > > > > > > 0 > > /home/achavez/swiftwork > > > > > > > > > > ______________________________________________________________________ > > To: alberto_chavez at live.com > > From: jonmon at mcs.anl.gov > > CC: hategan at mcs.anl.gov; swift-devel at ci.uchicago.edu > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > Date: Wed, 10 Aug 2011 23:54:24 -0500 > > > > Could you post the sites file? > > > > ----- Reply message ----- > > From: "Alberto Chavez" > > Date: Wed, Aug 10, 2011 7:16 pm > > Subject: [Swift-devel] ssh test case on pads/beagle > > To: > > Cc: "Mihael Hategan" , "Swift Devel" > > > > > > > > > > Exit code "127" normally means that a particular function doesn't > > exist. Are you sure that all those paths to apps exist? > > > Yes, I doubled check that and those are the right paths to the apps. > > > > > > Also, I am not sure if this is a problem but shouldn't there be a > > third column in the app file? LIke > > "ssh echo /bin/echo null null null" > > > > > > > > > > Looking at the documentation for the transformation catalog, the > > structure should be: > > > > site, transformation name, executable path, installation status, > > platform, and profile entries. > > > > > > > > > > > > The installation status and platform fields are not used. Set them > > to INSTALLED and INTEL32::LINUX respectively. > > > > The profiles field should be set to null if no profile entries are to > > be specified, or should contain the profile entries separated by > > semicolons. > > > > > > but even when I switch the columns to INSTALLED and INTEL32::LINUX and > > keep the profiles field set to null, I'm still getting the same exit > > code. > > > > > > On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote: > > > > I changed my ssh-key, and they worked on the MCS machines > > because the authorized_keys file has not been updated yet on > > the CI Machines. > > I created a new ssh-key using: > > ssh-keygen -t rsa -b 2048 > > exactly as the MCS site suggested, > > On the other hand, I still have a problem, I am getting the > > following error: > > > > > > Swift svn swift-r4978 cog-r3226 > > > > > > RunID: 20110810-1819-1cdo2o62 > > Progress: time: Wed, 10 Aug 2011 18:19:42 -0500 > > Exception in cat: > > Arguments: [data.txt] > > Host: ssh > > Directory: > > 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek > > - - - > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job failed with an exit code of 127 > > Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 > > Failed:10 > > The following errors have occurred: > > 1. Job failed with an exit code of 127 (10 times) > > > > > > > > > > These are the contents of the log: > > > > > > Execution completed with errors > > > > > > 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing > > channel 0 [Unnamed Channel] > > 2011-08-10 18:19:43,263-0500 INFO Exec Exit code 127 > > 2011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing > > channel 0 [Unnamed Channel] > > 2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 > > APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application > > exception: null > > Caused by: > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job failed with an exit code of 127 > > 2011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE > > thread=0-5-3-1 tr=cat > > 2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in > > cat: > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > at > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > at > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > at > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > at > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > at java.util.concurrent.Executors > > $RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask > > $Sync.innerRun(FutureTask.java:334) > > at > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > at java.util.concurrent.ThreadPoolExecutor > > $Worker.run(ThreadPoolExecutor.java:603) > > at java.lang.Thread.run(Thread.java:636) > > 2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed > > exception: > > > > > > Execution completed with errors > > > > > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > at > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > at > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > at > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > at > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > at java.util.concurrent.Executors > > $RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask > > $Sync.innerRun(FutureTask.java:334) > > at > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > at java.util.concurrent.ThreadPoolExecutor > > $Worker.run(ThreadPoolExecutor.java:603) > > at java.lang.Thread.run(Thread.java:636) > > > > I believe that the problem resides on the TC file because when > > I run a much simpler SwiftScript like: > > > > > > int i = 9; > > trace(i); > > > > > > I get the following output: > > > > > > swift traceme.swift -tc.file tc.template.data > > -sites.file sites.template.xml -config cf > > Swift svn swift-r4978 cog-r3226 > > > > > > RunID: 20110810-1832-buktjj3d > > Progress: time: Wed, 10 Aug 2011 18:32:30 -0500 > > SwiftScript trace: 9.0 > > Final status: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > > > > but as soon as I start using the commands stated the TC file, > > I get the "exit code 127" > > > > > > My tc file reads: > > > > > > ssh echo /bin/echo null null > > ssh cat /bin/cat null null > > ssh ls /bin/ls null null > > ssh grep /bin/grep null null > > ssh sort /bin/sort null null > > ssh paste /bin/paste null null > > ssh wc /usr/bin/wc null null > > > > > > I am working on the login node of the MCS machine trying to > > ssh via Swift to steamroller. > > > > > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > From: hategan at mcs.anl.gov > > > To: alberto_chavez at live.com > > > CC: swift-devel at ci.uchicago.edu > > > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > > > > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > > > > > I'll try to see how that can be fixed. In the mean time, can > > you > > > generate a new key pair with 3DES encryption instead and use > > that? > > > > > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > > > Hello, > > > > > > > > > > > > I am trying to run a simpler case than ssh-pbs-coaster > > test case, and > > > > I'm still having the same error. > > > > Now I am running only ssh test case > > > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > > > > > > > The command line is: > > > > swift -config cf -tc.file tc.template.data -sites.file > > > > sites.template.xml 001-catsn-ssh.swift > > > > > > > > > > > > The output: > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > RunID: 20110809-1336-ohte788a > > > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > > > Exception in cat: > > > > Arguments: [data.txt] > > > > Host: ssh > > > > Directory: > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > Caused by: > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > Can't > > > > read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting > > site:8 > > > > Submitting:1 Failed:1 > > > > Exception in cat: > > > > Arguments: [data.txt] > > > > Host: ssh > > > > Directory: > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > Caused by: > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > Can't > > > > read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting > > site:7 > > > > Submitting:1 Failed:2 > > > > Exception in cat: > > > > Arguments: [data.txt] > > > > Host: ssh > > > > Directory: > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > Caused by: > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > Can't > > > > read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > "error_log.log" 105L, 5770C > > > > > > > > > > > > My auth.defaults reads: > > > > > > > > > > > > login1.beagle.ci.uchicago.edu.type=key > > > > login1.beagle.ci.uchicago.edu.username=achavez > > > > > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > login1.pads.ci.uchicago.edu.type=key > > > > login1.pads.ci.uchicago.edu.username=achavez > > > > > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, > > but it is > > > > there, and the passphrase is right because I just verified > > it in two > > > > ways: > > > > 1) by logging to pads and beagle without providing a > > password > > > > 2) "changed" the password. I the "new" password is the > > same as the > > > > "old" one. > > > > > > > > sites.templates.xml: > > > > > > > > > > > > > > > > > url="login1.pads.ci.uchicago.edu" > > > > jobmanager="ssh"/> > > > > > url="login1.pads.ci.uchicago.edu" /> > > > > 0 > > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > > > > > > config file: > > > > > > > > wrapperlog.always.transfer=true > > > > sitedir.keep=true > > > > execution.retries=0 > > > > lazy.errors=true > > > > status.mode=provider > > > > use.provider.staging=true > > > > provider.staging.pin.swiftfiles=false > > > > foreach.max.threads=10 > > > > provenance.log=true > > > > > > > > > > > > > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > > > > > > > type filemsg; > > > > > > > > > > > > app (filemsg output) hello(string s) > > > > { > > > > echo s stdout=@filename(output); > > > > } > > > > > > > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > > > myfile = hello("dog,cat,dinosaur"); > > > > > > > > > > > > and I get the following output: > > > > > > > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > RunID: 20110809-1343-2es2hel2 > > > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > > > Exception in echo: > > > > Arguments: [dog,cat,dinosaur] > > > > Host: ssh > > > > Directory: > > hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > Caused by: > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > Can't > > > > read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 > > Failed:1 > > > > The following errors have occurred: > > > > 1. Can't read key due to cryptography problems: > > > > java.security.NoSuchAlgorithmException: Unsupported > > passphrase > > > > algorithm: AES-128-CBC > > > > > > > > > > > > > > > > > > > > any thoughts on this? > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From alberto_chavez at live.com Thu Aug 11 10:04:02 2011 From: alberto_chavez at live.com (Alberto Chavez) Date: Thu, 11 Aug 2011 10:04:02 -0500 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: <2078636548.210814.1313071056159.JavaMail.root@zimbra.anl.gov> References: , <2078636548.210814.1313071056159.JavaMail.root@zimbra.anl.gov> Message-ID: Mike helped me to track down the problem to the configuration file. Since I am not using coasters, the line:user.provider.stagingshould be set to false $ cat cfwrapperlog.always.transfer=truesitedir.keep=trueexecution.retries=0lazy.errors=truestatus.mode=provideruse.provider.staging=falseprovider.staging.pin.swiftfiles=falseforeach.max.threads=10provenance.log=true $ swift -tc.file tc.template.data -sites.file sites.template.xml -config cf 001-catsn-ssh.swift -n=1Swift svn swift-r4978 cog-r3226 RunID: 20110811-1002-pylik8vgProgress: time: Thu, 11 Aug 2011 10:02:39 -0500Progress: time: Thu, 11 Aug 2011 10:02:40 -0500 Submitted:1Final status: time: Thu, 11 Aug 2011 10:02:40 -0500 Finished successfully:1 > Date: Thu, 11 Aug 2011 08:57:36 -0500 > From: wilde at mcs.anl.gov > To: alberto_chavez at live.com > CC: swift-devel at ci.uchicago.edu; hategan at mcs.anl.gov > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > Mihael, Ive never seen sites.xml entries showing up in the log - are they supposed to be now? They are not in the log Alberto attached, nor have I seen them in any other log yet. > > Can we log all the files mentioned in the command line report (the first line of the log) right at the front, along with the source text? Ie, script, tc, sites, and config? Ideally values for all of the swift.properties? Ideally auth.defaults with suitable masking? 0.94 feature? > > >> 2011-08-11 08:28:03,762-0500 DEBUG Loader arguments: [001-catsn-ssh.swift, -tc.file, tc.template.data, -sites.file, sites.template.xml, -config, cf] > > Alberto, stop by and we can try to debug this in person, as ssh requires a fair bit of correct configuration to work. > > We need to look at the cf, sites.template.xml, and cf file. > > - Mike > > > ----- Original Message ----- > > > From: "Alberto Chavez" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Thursday, August 11, 2011 8:31:53 AM > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > Sure, attached are the output of stdout and stderror, and the log generated by swift. > > > > > Subject: RE: [Swift-devel] ssh test case on pads/beagle > > From: hategan at mcs.anl.gov > > To: alberto_chavez at live.com > > CC: jonmon at mcs.anl.gov; swift-devel at ci.uchicago.edu > > Date: Thu, 11 Aug 2011 00:18:07 -0700 > > > > Can you post (a link to) the entire log file? Since it contains both the > > tc.data and sites.xml and the error, it's probably always better to post > > than individual snippets. > > > > On Thu, 2011-08-11 at 01:17 -0500, Alberto Chavez wrote: > > > Sure: > > > > > > > > > > > > > > > > > > 0 > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > ______________________________________________________________________ > > > To: alberto_chavez at live.com > > > From: jonmon at mcs.anl.gov > > > CC: hategan at mcs.anl.gov; swift-devel at ci.uchicago.edu > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > Date: Wed, 10 Aug 2011 23:54:24 -0500 > > > > > > Could you post the sites file? > > > > > > ----- Reply message ----- > > > From: "Alberto Chavez" > > > Date: Wed, Aug 10, 2011 7:16 pm > > > Subject: [Swift-devel] ssh test case on pads/beagle > > > To: > > > Cc: "Mihael Hategan" , "Swift Devel" > > > > > > > > > > > > > > > Exit code "127" normally means that a particular function doesn't > > > exist. Are you sure that all those paths to apps exist? > > > > Yes, I doubled check that and those are the right paths to the apps. > > > > > > > > > Also, I am not sure if this is a problem but shouldn't there be a > > > third column in the app file? LIke > > > "ssh echo /bin/echo null null null" > > > > > > > > > > > > > > > Looking at the documentation for the transformation catalog, the > > > structure should be: > > > > > > site, transformation name, executable path, installation status, > > > platform, and profile entries. > > > > > > > > > > > > > > > > > > The installation status and platform fields are not used. Set them > > > to INSTALLED and INTEL32::LINUX respectively. > > > > > > The profiles field should be set to null if no profile entries are to > > > be specified, or should contain the profile entries separated by > > > semicolons. > > > > > > > > > but even when I switch the columns to INSTALLED and INTEL32::LINUX and > > > keep the profiles field set to null, I'm still getting the same exit > > > code. > > > > > > > > > On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote: > > > > > > I changed my ssh-key, and they worked on the MCS machines > > > because the authorized_keys file has not been updated yet on > > > the CI Machines. > > > I created a new ssh-key using: > > > ssh-keygen -t rsa -b 2048 > > > exactly as the MCS site suggested, > > > On the other hand, I still have a problem, I am getting the > > > following error: > > > > > > > > > Swift svn swift-r4978 cog-r3226 > > > > > > > > > RunID: 20110810-1819-1cdo2o62 > > > Progress: time: Wed, 10 Aug 2011 18:19:42 -0500 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: > > > 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek > > > - - - > > > Caused by: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job failed with an exit code of 127 > > > Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 > > > Failed:10 > > > The following errors have occurred: > > > 1. Job failed with an exit code of 127 (10 times) > > > > > > > > > > > > > > > These are the contents of the log: > > > > > > > > > Execution completed with errors > > > > > > > > > 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing > > > channel 0 [Unnamed Channel] > > > 2011-08-10 18:19:43,263-0500 INFO Exec Exit code 127 > > > 2011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing > > > channel 0 [Unnamed Channel] > > > 2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 > > > APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application > > > exception: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job failed with an exit code of 127 > > > 2011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE > > > thread=0-5-3-1 tr=cat > > > 2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in > > > cat: > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > > at > > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > > at > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > > at > > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > > at > > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > > at > > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > > at > > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > > at > > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > > at > > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > > at java.util.concurrent.Executors > > > $RunnableAdapter.call(Executors.java:471) > > > at java.util.concurrent.FutureTask > > > $Sync.innerRun(FutureTask.java:334) > > > at > > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > > at java.util.concurrent.ThreadPoolExecutor > > > $Worker.run(ThreadPoolExecutor.java:603) > > > at java.lang.Thread.run(Thread.java:636) > > > 2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed > > > exception: > > > > > > > > > Execution completed with errors > > > > > > > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > > at > > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > > at > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > > at > > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > > at > > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > > at > > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > > at > > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > > at > > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > > at > > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > > at java.util.concurrent.Executors > > > $RunnableAdapter.call(Executors.java:471) > > > at java.util.concurrent.FutureTask > > > $Sync.innerRun(FutureTask.java:334) > > > at > > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > > at java.util.concurrent.ThreadPoolExecutor > > > $Worker.run(ThreadPoolExecutor.java:603) > > > at java.lang.Thread.run(Thread.java:636) > > > > > > I believe that the problem resides on the TC file because when > > > I run a much simpler SwiftScript like: > > > > > > > > > int i = 9; > > > trace(i); > > > > > > > > > I get the following output: > > > > > > > > > swift traceme.swift -tc.file tc.template.data > > > -sites.file sites.template.xml -config cf > > > Swift svn swift-r4978 cog-r3226 > > > > > > > > > RunID: 20110810-1832-buktjj3d > > > Progress: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > SwiftScript trace: 9.0 > > > Final status: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > > > > > > > but as soon as I start using the commands stated the TC file, > > > I get the "exit code 127" > > > > > > > > > My tc file reads: > > > > > > > > > ssh echo /bin/echo null null > > > ssh cat /bin/cat null null > > > ssh ls /bin/ls null null > > > ssh grep /bin/grep null null > > > ssh sort /bin/sort null null > > > ssh paste /bin/paste null null > > > ssh wc /usr/bin/wc null null > > > > > > > > > I am working on the login node of the MCS machine trying to > > > ssh via Swift to steamroller. > > > > > > > > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > > From: hategan at mcs.anl.gov > > > > To: alberto_chavez at live.com > > > > CC: swift-devel at ci.uchicago.edu > > > > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > > > > > > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > > > > > > > I'll try to see how that can be fixed. In the mean time, can > > > you > > > > generate a new key pair with 3DES encryption instead and use > > > that? > > > > > > > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > > > > Hello, > > > > > > > > > > > > > > > I am trying to run a simpler case than ssh-pbs-coaster > > > test case, and > > > > > I'm still having the same error. > > > > > Now I am running only ssh test case > > > > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > > > > > > > > > > The command line is: > > > > > swift -config cf -tc.file tc.template.data -sites.file > > > > > sites.template.xml 001-catsn-ssh.swift > > > > > > > > > > > > > > > The output: > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > > > > RunID: 20110809-1336-ohte788a > > > > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting > > > site:8 > > > > > Submitting:1 Failed:1 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting > > > site:7 > > > > > Submitting:1 Failed:2 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > "error_log.log" 105L, 5770C > > > > > > > > > > > > > > > My auth.defaults reads: > > > > > > > > > > > > > > > login1.beagle.ci.uchicago.edu.type=key > > > > > login1.beagle.ci.uchicago.edu.username=achavez > > > > > > > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > login1.pads.ci.uchicago.edu.type=key > > > > > login1.pads.ci.uchicago.edu.username=achavez > > > > > > > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, > > > but it is > > > > > there, and the passphrase is right because I just verified > > > it in two > > > > > ways: > > > > > 1) by logging to pads and beagle without providing a > > > password > > > > > 2) "changed" the password. I the "new" password is the > > > same as the > > > > > "old" one. > > > > > > > > > > sites.templates.xml: > > > > > > > > > > > > > > > > > > > > > > url="login1.pads.ci.uchicago.edu" > > > > > jobmanager="ssh"/> > > > > > > > url="login1.pads.ci.uchicago.edu" /> > > > > > 0 > > > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > > > > > > > > > > > config file: > > > > > > > > > > wrapperlog.always.transfer=true > > > > > sitedir.keep=true > > > > > execution.retries=0 > > > > > lazy.errors=true > > > > > status.mode=provider > > > > > use.provider.staging=true > > > > > provider.staging.pin.swiftfiles=false > > > > > foreach.max.threads=10 > > > > > provenance.log=true > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > > > > > > > > > > type filemsg; > > > > > > > > > > > > > > > app (filemsg output) hello(string s) > > > > > { > > > > > echo s stdout=@filename(output); > > > > > } > > > > > > > > > > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > > > > myfile = hello("dog,cat,dinosaur"); > > > > > > > > > > > > > > > and I get the following output: > > > > > > > > > > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > > > > RunID: 20110809-1343-2es2hel2 > > > > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > > > > Exception in echo: > > > > > Arguments: [dog,cat,dinosaur] > > > > > Host: ssh > > > > > Directory: > > > hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 > > > Failed:1 > > > > > The following errors have occurred: > > > > > 1. Can't read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > > > > > > > > > > > > > > > > > > > > > any thoughts on this? > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Aug 11 10:08:47 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 11 Aug 2011 10:08:47 -0500 (CDT) Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: <2078636548.210814.1313071056159.JavaMail.root@zimbra.anl.gov> Message-ID: <1715921449.211141.1313075327075.JavaMail.root@zimbra.anl.gov> Alberto's ssh test now runs. It was failing because provider staging was specified in the -config file; that seemed to cause the error code 127. I did not go back and search for a message to that effect in the prior log Alberto sent, but we should, to see if it was reported in some reasonable fashion which could be presented more clearly to the user. We might want to check to ensure that provider staging is not specified for providers that can't support it. Is such a check feasible and sensible? Also, this case illustrates the benefit of having the properties settings (and -config overrides) echoed in the .log file. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Alberto Chavez" > Cc: "Swift Devel" > Sent: Thursday, August 11, 2011 8:57:36 AM > Subject: Re: [Swift-devel] ssh test case on pads/beagle > Mihael, Ive never seen sites.xml entries showing up in the log - are > they supposed to be now? They are not in the log Alberto attached, nor > have I seen them in any other log yet. > > Can we log all the files mentioned in the command line report (the > first line of the log) right at the front, along with the source text? > Ie, script, tc, sites, and config? Ideally values for all of the > swift.properties? Ideally auth.defaults with suitable masking? 0.94 > feature? > > >> 2011-08-11 08:28:03,762-0500 DEBUG Loader arguments: > >> [001-catsn-ssh.swift, -tc.file, tc.template.data, -sites.file, > >> sites.template.xml, -config, cf] > > Alberto, stop by and we can try to debug this in person, as ssh > requires a fair bit of correct configuration to work. > > We need to look at the cf, sites.template.xml, and cf file. > > - Mike > > > ----- Original Message ----- > > > From: "Alberto Chavez" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Thursday, August 11, 2011 8:31:53 AM > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > Sure, attached are the output of stdout and stderror, and the log > generated by swift. > > > > > Subject: RE: [Swift-devel] ssh test case on pads/beagle > > From: hategan at mcs.anl.gov > > To: alberto_chavez at live.com > > CC: jonmon at mcs.anl.gov; swift-devel at ci.uchicago.edu > > Date: Thu, 11 Aug 2011 00:18:07 -0700 > > > > Can you post (a link to) the entire log file? Since it contains both > > the > > tc.data and sites.xml and the error, it's probably always better to > > post > > than individual snippets. > > > > On Thu, 2011-08-11 at 01:17 -0500, Alberto Chavez wrote: > > > Sure: > > > > > > > > > > > > > > > > > > 0 > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > ______________________________________________________________________ > > > To: alberto_chavez at live.com > > > From: jonmon at mcs.anl.gov > > > CC: hategan at mcs.anl.gov; swift-devel at ci.uchicago.edu > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > Date: Wed, 10 Aug 2011 23:54:24 -0500 > > > > > > Could you post the sites file? > > > > > > ----- Reply message ----- > > > From: "Alberto Chavez" > > > Date: Wed, Aug 10, 2011 7:16 pm > > > Subject: [Swift-devel] ssh test case on pads/beagle > > > To: > > > Cc: "Mihael Hategan" , "Swift Devel" > > > > > > > > > > > > > > > Exit code "127" normally means that a particular function doesn't > > > exist. Are you sure that all those paths to apps exist? > > > > Yes, I doubled check that and those are the right paths to the > > > > apps. > > > > > > > > > Also, I am not sure if this is a problem but shouldn't there be a > > > third column in the app file? LIke > > > "ssh echo /bin/echo null null null" > > > > > > > > > > > > > > > Looking at the documentation for the transformation catalog, the > > > structure should be: > > > > > > site, transformation name, executable path, installation status, > > > platform, and profile entries. > > > > > > > > > > > > > > > > > > The installation status and platform fields are not used. Set them > > > to INSTALLED and INTEL32::LINUX respectively. > > > > > > The profiles field should be set to null if no profile entries are > > > to > > > be specified, or should contain the profile entries separated by > > > semicolons. > > > > > > > > > but even when I switch the columns to INSTALLED and INTEL32::LINUX > > > and > > > keep the profiles field set to null, I'm still getting the same > > > exit > > > code. > > > > > > > > > On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote: > > > > > > I changed my ssh-key, and they worked on the MCS machines > > > because the authorized_keys file has not been updated yet on > > > the CI Machines. > > > I created a new ssh-key using: > > > ssh-keygen -t rsa -b 2048 > > > exactly as the MCS site suggested, > > > On the other hand, I still have a problem, I am getting the > > > following error: > > > > > > > > > Swift svn swift-r4978 cog-r3226 > > > > > > > > > RunID: 20110810-1819-1cdo2o62 > > > Progress: time: Wed, 10 Aug 2011 18:19:42 -0500 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: > > > 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek > > > - - - > > > Caused by: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job failed with an exit code of 127 > > > Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 > > > Failed:10 > > > The following errors have occurred: > > > 1. Job failed with an exit code of 127 (10 times) > > > > > > > > > > > > > > > These are the contents of the log: > > > > > > > > > Execution completed with errors > > > > > > > > > 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing > > > channel 0 [Unnamed Channel] > > > 2011-08-10 18:19:43,263-0500 INFO Exec Exit code 127 > > > 2011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing > > > channel 0 [Unnamed Channel] > > > 2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 > > > APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application > > > exception: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job failed with an exit code of 127 > > > 2011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE > > > thread=0-5-3-1 tr=cat > > > 2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in > > > cat: > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > > at > > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > > at > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > > at > > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > > at > > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > > at > > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > > at > > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > > at > > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > > at > > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > > at java.util.concurrent.Executors > > > $RunnableAdapter.call(Executors.java:471) > > > at java.util.concurrent.FutureTask > > > $Sync.innerRun(FutureTask.java:334) > > > at > > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > > at java.util.concurrent.ThreadPoolExecutor > > > $Worker.run(ThreadPoolExecutor.java:603) > > > at java.lang.Thread.run(Thread.java:636) > > > 2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed > > > exception: > > > > > > > > > Execution completed with errors > > > > > > > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > > at > > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > > at > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > > at > > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > > at > > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > > at > > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > > at > > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > > at > > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > > at > > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > > at java.util.concurrent.Executors > > > $RunnableAdapter.call(Executors.java:471) > > > at java.util.concurrent.FutureTask > > > $Sync.innerRun(FutureTask.java:334) > > > at > > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > > at java.util.concurrent.ThreadPoolExecutor > > > $Worker.run(ThreadPoolExecutor.java:603) > > > at java.lang.Thread.run(Thread.java:636) > > > > > > I believe that the problem resides on the TC file because when > > > I run a much simpler SwiftScript like: > > > > > > > > > int i = 9; > > > trace(i); > > > > > > > > > I get the following output: > > > > > > > > > swift traceme.swift -tc.file tc.template.data > > > -sites.file sites.template.xml -config cf > > > Swift svn swift-r4978 cog-r3226 > > > > > > > > > RunID: 20110810-1832-buktjj3d > > > Progress: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > SwiftScript trace: 9.0 > > > Final status: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > > > > > > > but as soon as I start using the commands stated the TC file, > > > I get the "exit code 127" > > > > > > > > > My tc file reads: > > > > > > > > > ssh echo /bin/echo null null > > > ssh cat /bin/cat null null > > > ssh ls /bin/ls null null > > > ssh grep /bin/grep null null > > > ssh sort /bin/sort null null > > > ssh paste /bin/paste null null > > > ssh wc /usr/bin/wc null null > > > > > > > > > I am working on the login node of the MCS machine trying to > > > ssh via Swift to steamroller. > > > > > > > > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > > From: hategan at mcs.anl.gov > > > > To: alberto_chavez at live.com > > > > CC: swift-devel at ci.uchicago.edu > > > > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > > > > > > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > > > > > > > I'll try to see how that can be fixed. In the mean time, can > > > you > > > > generate a new key pair with 3DES encryption instead and use > > > that? > > > > > > > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > > > > Hello, > > > > > > > > > > > > > > > I am trying to run a simpler case than ssh-pbs-coaster > > > test case, and > > > > > I'm still having the same error. > > > > > Now I am running only ssh test case > > > > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > > > > > > > > > > The command line is: > > > > > swift -config cf -tc.file tc.template.data -sites.file > > > > > sites.template.xml 001-catsn-ssh.swift > > > > > > > > > > > > > > > The output: > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > > > > RunID: 20110809-1336-ohte788a > > > > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > > > Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting > > > site:8 > > > > > Submitting:1 Failed:1 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > > > Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting > > > site:7 > > > > > Submitting:1 Failed:2 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > > > Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > "error_log.log" 105L, 5770C > > > > > > > > > > > > > > > My auth.defaults reads: > > > > > > > > > > > > > > > login1.beagle.ci.uchicago.edu.type=key > > > > > login1.beagle.ci.uchicago.edu.username=achavez > > > > > > > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > login1.pads.ci.uchicago.edu.type=key > > > > > login1.pads.ci.uchicago.edu.username=achavez > > > > > > > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, > > > but it is > > > > > there, and the passphrase is right because I just verified > > > it in two > > > > > ways: > > > > > 1) by logging to pads and beagle without providing a > > > password > > > > > 2) "changed" the password. I the "new" password is the > > > same as the > > > > > "old" one. > > > > > > > > > > sites.templates.xml: > > > > > > > > > > > > > > > > > > > > > > url="login1.pads.ci.uchicago.edu" > > > > > jobmanager="ssh"/> > > > > > > > url="login1.pads.ci.uchicago.edu" /> > > > > > 0 > > > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > > > > > > > > > > > config file: > > > > > > > > > > wrapperlog.always.transfer=true > > > > > sitedir.keep=true > > > > > execution.retries=0 > > > > > lazy.errors=true > > > > > status.mode=provider > > > > > use.provider.staging=true > > > > > provider.staging.pin.swiftfiles=false > > > > > foreach.max.threads=10 > > > > > provenance.log=true > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > > > > > > > > > > type filemsg; > > > > > > > > > > > > > > > app (filemsg output) hello(string s) > > > > > { > > > > > echo s stdout=@filename(output); > > > > > } > > > > > > > > > > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > > > > myfile = hello("dog,cat,dinosaur"); > > > > > > > > > > > > > > > and I get the following output: > > > > > > > > > > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > > > > RunID: 20110809-1343-2es2hel2 > > > > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > > > > Exception in echo: > > > > > Arguments: [dog,cat,dinosaur] > > > > > Host: ssh > > > > > Directory: > > > hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > > > Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 > > > Failed:1 > > > > > The following errors have occurred: > > > > > 1. Can't read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > > > > > > > > > > > > > > > > > > > > > any thoughts on this? > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Thu Aug 11 10:17:16 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 11 Aug 2011 10:17:16 -0500 Subject: [Swift-devel] Persistent coasters running one job per worker In-Reply-To: <1312917410.3416.2.camel@blabla> References: <1312916367.2671.4.camel@blabla> <1312917410.3416.2.camel@blabla> Message-ID: On Tue, Aug 9, 2011 at 2:16 PM, Mihael Hategan wrote: > Ah! > > If the workers connect before the client does, then jobsPerNode does not > make it to the coaster service. > > I'll think about this. In the mean time, you could have the workers > started after the client sends its first job to the service. > I did this and it worked. Thanks Mihael. > > I'm thinking that maybe jobsPerNode should be a setting that the workers > themselves could be started with. > > On Tue, 2011-08-09 at 14:09 -0500, Ketan Maheshwari wrote: > > I do not see any recent log in ~/.globus/coasters. The stdout/err of > > the coaster service run is in the attached service.log and the > > coaster.log is in the attached swift.log. > > > > > > > > > > On Tue, Aug 9, 2011 at 1:59 PM, Mihael Hategan > > wrote: > > but but but I checked this, and it worked fine... > > > > Can you also post the coasters log (on the machine the coaster > > service > > is on, in ~/.globus/coasters)? > > > > > > On Tue, 2011-08-09 at 13:47 -0500, Ketan Maheshwari wrote: > > > Mihael, > > > > > > > > > I was discussing this with Justin and we thought you could > > help: > > > > > > > > > I am observing that persistent coasters are running one job > > per worker > > > as opposed to the number specified in jobspernode (I also > > tried > > > nodegranularity) on sites.xml. > > > > > > > > > Attaching the log, and the sites.xml for the run. Swift is > > 0.93 (Swift > > > svn swift-r4968 cog-r3225). > > > > > > > > > The script is Mike's catsnsleep that sleeps for 20s with > > n=10. > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Thu Aug 11 12:45:28 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 11 Aug 2011 12:45:28 -0500 (CDT) Subject: [Swift-devel] Cogkit SVN access Message-ID: <1287204456.62270.1313084728219.JavaMail.root@zimbra-mb2.anl.gov> Hello, How can I request access to the cogkit SVN repo? I have a patch I'd like to apply that allows 0.93 to compile under Java 1.5. My sourceforge username is davidkelly999. Thanks, David From hategan at mcs.anl.gov Thu Aug 11 13:23:18 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Aug 2011 11:23:18 -0700 Subject: [Swift-devel] ssh test case on pads/beagle In-Reply-To: References: <20110811045409.99852121A8@zimbra.anl.gov> , ,<1313047087.3215.6.camel@blabla> Message-ID: <1313086998.8503.0.camel@blabla> You have provider staging enabled, and ssh doesn't support that. I'll make sure it actually throws an exception instead of trying to run jobs with staging directives. On Thu, 2011-08-11 at 08:31 -0500, Alberto Chavez wrote: > Sure, attached are the output of stdout and stderror, and the log > generated by swift. > > > > Subject: RE: [Swift-devel] ssh test case on pads/beagle > > From: hategan at mcs.anl.gov > > To: alberto_chavez at live.com > > CC: jonmon at mcs.anl.gov; swift-devel at ci.uchicago.edu > > Date: Thu, 11 Aug 2011 00:18:07 -0700 > > > > Can you post (a link to) the entire log file? Since it contains both > the > > tc.data and sites.xml and the error, it's probably always better to > post > > than individual snippets. > > > > On Thu, 2011-08-11 at 01:17 -0500, Alberto Chavez wrote: > > > Sure: > > > > > > > > > > > > > > > > > > 0 > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > > ______________________________________________________________________ > > > To: alberto_chavez at live.com > > > From: jonmon at mcs.anl.gov > > > CC: hategan at mcs.anl.gov; swift-devel at ci.uchicago.edu > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > Date: Wed, 10 Aug 2011 23:54:24 -0500 > > > > > > Could you post the sites file? > > > > > > ----- Reply message ----- > > > From: "Alberto Chavez" > > > Date: Wed, Aug 10, 2011 7:16 pm > > > Subject: [Swift-devel] ssh test case on pads/beagle > > > To: > > > Cc: "Mihael Hategan" , "Swift Devel" > > > > > > > > > > > > > > > Exit code "127" normally means that a particular function doesn't > > > exist. Are you sure that all those paths to apps exist? > > > > Yes, I doubled check that and those are the right paths to the > apps. > > > > > > > > > Also, I am not sure if this is a problem but shouldn't there be a > > > third column in the app file? LIke > > > "ssh echo /bin/echo null null null" > > > > > > > > > > > > > > > Looking at the documentation for the transformation catalog, the > > > structure should be: > > > > > > site, transformation name, executable path, installation status, > > > platform, and profile entries. > > > > > > > > > > > > > > > > > > The installation status and platform fields are not used. Set them > > > to INSTALLED and INTEL32::LINUX respectively. > > > > > > The profiles field should be set to null if no profile entries are > to > > > be specified, or should contain the profile entries separated by > > > semicolons. > > > > > > > > > but even when I switch the columns to INSTALLED and INTEL32::LINUX > and > > > keep the profiles field set to null, I'm still getting the same > exit > > > code. > > > > > > > > > On Aug 10, 2011, at 6:41 PM, Alberto Chavez wrote: > > > > > > I changed my ssh-key, and they worked on the MCS machines > > > because the authorized_keys file has not been updated yet on > > > the CI Machines. > > > I created a new ssh-key using: > > > ssh-keygen -t rsa -b 2048 > > > exactly as the MCS site suggested, > > > On the other hand, I still have a problem, I am getting the > > > following error: > > > > > > > > > Swift svn swift-r4978 cog-r3226 > > > > > > > > > RunID: 20110810-1819-1cdo2o62 > > > Progress: time: Wed, 10 Aug 2011 18:19:42 -0500 > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: ssh > > > Directory: > > > 001-catsn-ssh-20110810-1819-1cdo2o62/jobs/9/cat-9jd0g9ek > > > - - - > > > Caused by: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job failed with an exit code of 127 > > > Final status: time: Wed, 10 Aug 2011 18:20:00 -0500 > > > Failed:10 > > > The following errors have occurred: > > > 1. Job failed with an exit code of 127 (10 times) > > > > > > > > > > > > > > > These are the contents of the log: > > > > > > > > > Execution completed with errors > > > > > > > > > 2011-08-10 18:19:43,251-0500 INFO ConnectionProtocol Freeing > > > channel 0 [Unnamed Channel] > > > 2011-08-10 18:19:43,263-0500 INFO Exec Exit code 127 > > > 2011-08-10 18:19:43,269-0500 INFO ConnectionProtocol Freeing > > > channel 0 [Unnamed Channel] > > > 2011-08-10 18:19:43,277-0500 DEBUG vdl:execute2 > > > APPLICATION_EXCEPTION jobid=cat-9jd0g9ek - Application > > > exception: null > > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job failed with an exit code of 127 > > > 2011-08-10 18:19:43,280-0500 INFO vdl:execute END_FAILURE > > > thread=0-5-3-1 tr=cat > > > 2011-08-10 18:19:43,281-0500 INFO vdl:execute Exception in > > > cat: > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > > at > > > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > > at > > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > > at > > > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > > at > > > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > > at > > > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > > at > > > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > > at > > > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > > at > > > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > > at java.util.concurrent.Executors > > > $RunnableAdapter.call(Executors.java:471) > > > at java.util.concurrent.FutureTask > > > $Sync.innerRun(FutureTask.java:334) > > > at > > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > > at > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > > at java.util.concurrent.ThreadPoolExecutor > > > $Worker.run(ThreadPoolExecutor.java:603) > > > at java.lang.Thread.run(Thread.java:636) > > > 2011-08-10 18:20:00,332-0500 INFO ExecutionContext Detailed > > > exception: > > > > > > > > > Execution completed with errors > > > > > > > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:250) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.fail(FlowNode.java:254) > > > at > > > > org.globus.cog.karajan.workflow.nodes.GenerateErrorNode.post(GenerateErrorNode.java:27) > > > at > > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > > at > > > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > > > at > > > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) > > > at > > > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) > > > at > > > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) > > > at > > > > org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) > > > at > > > > org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) > > > at > > > > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) > > > at java.util.concurrent.Executors > > > $RunnableAdapter.call(Executors.java:471) > > > at java.util.concurrent.FutureTask > > > $Sync.innerRun(FutureTask.java:334) > > > at > > > java.util.concurrent.FutureTask.run(FutureTask.java:166) > > > at > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > > at java.util.concurrent.ThreadPoolExecutor > > > $Worker.run(ThreadPoolExecutor.java:603) > > > at java.lang.Thread.run(Thread.java:636) > > > > > > I believe that the problem resides on the TC file because when > > > I run a much simpler SwiftScript like: > > > > > > > > > int i = 9; > > > trace(i); > > > > > > > > > I get the following output: > > > > > > > > > swift traceme.swift -tc.file tc.template.data > > > -sites.file sites.template.xml -config cf > > > Swift svn swift-r4978 cog-r3226 > > > > > > > > > RunID: 20110810-1832-buktjj3d > > > Progress: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > SwiftScript trace: 9.0 > > > Final status: time: Wed, 10 Aug 2011 18:32:30 -0500 > > > > > > > > > but as soon as I start using the commands stated the TC file, > > > I get the "exit code 127" > > > > > > > > > My tc file reads: > > > > > > > > > ssh echo /bin/echo null null > > > ssh cat /bin/cat null null > > > ssh ls /bin/ls null null > > > ssh grep /bin/grep null null > > > ssh sort /bin/sort null null > > > ssh paste /bin/paste null null > > > ssh wc /usr/bin/wc null null > > > > > > > > > I am working on the login node of the MCS machine trying to > > > ssh via Swift to steamroller. > > > > > > > > > > Subject: Re: [Swift-devel] ssh test case on pads/beagle > > > > From: hategan at mcs.anl.gov > > > > To: alberto_chavez at live.com > > > > CC: swift-devel at ci.uchicago.edu > > > > Date: Tue, 9 Aug 2011 11:57:06 -0700 > > > > > > > > Hmm: Unsupported passphrase algorithm: AES-128-CBC > > > > > > > > I'll try to see how that can be fixed. In the mean time, can > > > you > > > > generate a new key pair with 3DES encryption instead and use > > > that? > > > > > > > > On Tue, 2011-08-09 at 13:43 -0500, Alberto Chavez wrote: > > > > > Hello, > > > > > > > > > > > > > > > I am trying to run a simpler case than ssh-pbs-coaster > > > test case, and > > > > > I'm still having the same error. > > > > > Now I am running only ssh test case > > > > > (/tests/providers/ssh/001-catsn-ssn.swift) > > > > > > > > > > > > > > > The command line is: > > > > > swift -config cf -tc.file tc.template.data -sites.file > > > > > sites.template.xml 001-catsn-ssh.swift > > > > > > > > > > > > > > > The output: > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > > > > RunID: 20110809-1336-ohte788a > > > > > Progress: time: Tue, 09 Aug 2011 13:36:42 -0500 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/m/cat-mq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Progress: time: Tue, 09 Aug 2011 13:36:43 -0500 Selecting > > > site:8 > > > > > Submitting:1 Failed:1 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/n/cat-nq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Progress: time: Tue, 09 Aug 2011 13:36:44 -0500 Selecting > > > site:7 > > > > > Submitting:1 Failed:2 > > > > > Exception in cat: > > > > > Arguments: [data.txt] > > > > > Host: ssh > > > > > Directory: > > > 001-catsn-ssh-20110809-1336-ohte788a/jobs/o/cat-oq74h7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > "error_log.log" 105L, 5770C > > > > > > > > > > > > > > > My auth.defaults reads: > > > > > > > > > > > > > > > login1.beagle.ci.uchicago.edu.type=key > > > > > login1.beagle.ci.uchicago.edu.username=achavez > > > > > > > > login1.beagle.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > login1.pads.ci.uchicago.edu.type=key > > > > > login1.pads.ci.uchicago.edu.username=achavez > > > > > > > > login1.pads.ci.uchicago.edu.key=/home/Alberto/.ssh/identity > > > > > > > > > > > > > > > > > > > > > > > > > and it has been set to 600, I ommited the passphrase line, > > > but it is > > > > > there, and the passphrase is right because I just verified > > > it in two > > > > > ways: > > > > > 1) by logging to pads and beagle without providing a > > > password > > > > > 2) "changed" the password. I the "new" password is the > > > same as the > > > > > "old" one. > > > > > > > > > > sites.templates.xml: > > > > > > > > > > > > > > > > > > > > > > url="login1.pads.ci.uchicago.edu" > > > > > jobmanager="ssh"/> > > > > > > > url="login1.pads.ci.uchicago.edu" /> > > > > > 0 > > > > > /home/achavez/swiftwork > > > > > > > > > > > > > > > > > > > > > > > > > config file: > > > > > > > > > > wrapperlog.always.transfer=true > > > > > sitedir.keep=true > > > > > execution.retries=0 > > > > > lazy.errors=true > > > > > status.mode=provider > > > > > use.provider.staging=true > > > > > provider.staging.pin.swiftfiles=false > > > > > foreach.max.threads=10 > > > > > provenance.log=true > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also tried a simpler SwiftScript: > > > > > > > > > > > > > > > type filemsg; > > > > > > > > > > > > > > > app (filemsg output) hello(string s) > > > > > { > > > > > echo s stdout=@filename(output); > > > > > } > > > > > > > > > > > > > > > filemsg myfile<"dogcatdinosaur.out">; > > > > > myfile = hello("dog,cat,dinosaur"); > > > > > > > > > > > > > > > and I get the following output: > > > > > > > > > > > > > > > Swift svn swift-r4861 (swift modified locally) cog-r3183 > > > > > > > > > > > > > > > RunID: 20110809-1343-2es2hel2 > > > > > Progress: time: Tue, 09 Aug 2011 13:43:25 -0500 > > > > > Exception in echo: > > > > > Arguments: [dog,cat,dinosaur] > > > > > Host: ssh > > > > > Directory: > > > hello_swift-20110809-1343-2es2hel2/jobs/0/echo-0oldh7ek > > > > > - - - > > > > > > > > > > > > > > > Caused by: null > > > > > Caused by: > > > > > > > > > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid private key or passphrase > > > > > Caused by: > > > > > > > > com.sshtools.j2ssh.transport.publickey.InvalidSshKeyException: > > > Can't > > > > > read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > Final status: time: Tue, 09 Aug 2011 13:43:26 -0500 > > > Failed:1 > > > > > The following errors have occurred: > > > > > 1. Can't read key due to cryptography problems: > > > > > java.security.NoSuchAlgorithmException: Unsupported > > > passphrase > > > > > algorithm: AES-128-CBC > > > > > > > > > > > > > > > > > > > > > > > > > any thoughts on this? > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > From hategan at mcs.anl.gov Thu Aug 11 13:50:43 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Aug 2011 11:50:43 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> Message-ID: <1313088643.8503.21.camel@blabla> On Thu, 2011-08-11 at 18:15 +0530, Yadu Nand wrote: > > That's moving a big jump away from compile time type checking: you can't check the return types if you don't know anything about the function to call. > > Does that matter for Swift? Its nice to find errors before you embark on a long run. But the strongly-typed-ness of swift doesn't otherwise seem too useful. > Well, What I plan on doing is the first string passed to a "call" function > will need to be a function identifier and as we translate to karajan lookup > the type of the function and ensure that the return and input args match. > I haven't gotten there yet, I'm still arm-twisting the parser to accept the > new syntax. Right. Well, Ben nails it here. > > > Do you need general string based invocation? Where are you getting these strings from? > Well, I don't understand if it makes a difference. Its easier with strings, > because we then just need to pass them on to executeElement which > now accepts the string identifier of a procedure. A string is not a function. It's as simple as that. One important quality of strong typing is that values of a type don't magically transform into values of another type. So a string shouldn't mean a bunch of characters in one context and a function in others. So that means that we need function types. These will have to look like signatures without actual bodies: (file b) proc(file a) mycat; That's somewhat clear. What is unclear is how (and what) we assign to mycat. Given a standard cat (matching the above signature), we could have: mycat = cat; But the issue there is that now variables and procedures appear to live in the same namespace. So the semantics of the following code are unclear: int f = 2; (int r) f(int i) {...}; x = f; What is assigned to x? There are three resolutions I can think of: 1. Keep them in the same namespace, treat all procs as if they were equivalent to name = signature {body}. Disallow variables and procedures with the same name. 2. Do not keep them in the same namespace and consider the namespace implicit in the type. So if I assign to a non-procedure type then it's a normal variable, and if I assign to (or use in the context of) a procedure type, then it's a procedure. This could be confusing to a user and it requires one to look at the context of an expression to determine its type, which complicates the compiler code. 3. Have some special keyword that indicates the procedure namespace: myfn= proc f (or myfn = proc(f) or myfn = proc:f). From hategan at mcs.anl.gov Thu Aug 11 13:54:12 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Aug 2011 11:54:12 -0700 Subject: [Swift-devel] Cogkit SVN access In-Reply-To: <1287204456.62270.1313084728219.JavaMail.root@zimbra-mb2.anl.gov> References: <1287204456.62270.1313084728219.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1313088852.8503.24.camel@blabla> I gave you access to the cog svn, but the question is whether we still care about java 1.5. Are there any machines that haven't moved to 1.6? On the other hand, it could be argued that the cost of keeping 1.5 compatibility is relatively low, so we might as well just do it. On Thu, 2011-08-11 at 12:45 -0500, David Kelly wrote: > Hello, > > How can I request access to the cogkit SVN repo? I have a patch I'd like to apply that allows 0.93 to compile under Java 1.5. My sourceforge username is davidkelly999. > > Thanks, > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu Aug 11 14:09:01 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 11 Aug 2011 19:09:01 +0000 (GMT) Subject: [Swift-devel] Call function. In-Reply-To: <1313088643.8503.21.camel@blabla> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> Message-ID: > 1. Keep them in the same namespace, treat all procs as if they were > equivalent to name = signature {body}. Disallow variables and procedures > with the same name. This is what both C and Haskell do, very roughly. C has a distinction between functions directly defined, and functions being referenced (by a pointer) - the name of the function invoked as f(x); in the first case, and (*f)(x) in the second case. Haskell treats them the same: f x The other two options you give look ugly to me. But this project is not about abstract PL research/development - what makes this stuff easier to use? I'm inclined to think some function type (rather than strings) does, (but I'm very into compile time safety so that's unsurprising) but I'm not sure if its worth the effort for what I think is a fairly restricted use case. -- From davidk at ci.uchicago.edu Thu Aug 11 14:25:55 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 11 Aug 2011 14:25:55 -0500 (CDT) Subject: [Swift-devel] Cogkit SVN access In-Reply-To: <1313088852.8503.24.camel@blabla> Message-ID: <633281416.62517.1313090755025.JavaMail.root@zimbra-mb2.anl.gov> Thanks Mihael. I think there are a few systems still out there running 1.5, but not many. Intrepid runs 1.5 by default, but I think you can modify it with softenv. There was another machine called sisboombah that only had 1.5. I don't have any strong feelings about it one way or another, but it's a probably a good idea to come up with a list of what is supported (version, IBM java, openjdk?) and add it to the list of things we test before a new release. David ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: swift-devel at ci.uchicago.edu > Sent: Thursday, August 11, 2011 1:54:12 PM > Subject: Re: [Swift-devel] Cogkit SVN access > I gave you access to the cog svn, but the question is whether we still > care about java 1.5. Are there any machines that haven't moved to 1.6? > > On the other hand, it could be argued that the cost of keeping 1.5 > compatibility is relatively low, so we might as well just do it. > > On Thu, 2011-08-11 at 12:45 -0500, David Kelly wrote: > > Hello, > > > > How can I request access to the cogkit SVN repo? I have a patch I'd > > like to apply that allows 0.93 to compile under Java 1.5. My > > sourceforge username is davidkelly999. > > > > Thanks, > > David > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Aug 11 14:40:02 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Aug 2011 12:40:02 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> Message-ID: <1313091602.9185.5.camel@blabla> On Thu, 2011-08-11 at 19:09 +0000, Ben Clifford wrote: > > 1. Keep them in the same namespace, treat all procs as if they were > > equivalent to name = signature {body}. Disallow variables and procedures > > with the same name. > > This is what both C and Haskell do, very roughly. I'd argue that C does a bit of both the above and x = proc f (x = &f). And let's not take the credit from ML because it did what Haskell does before there was a Haskell. > > C has a distinction between functions directly defined, and functions > being referenced (by a pointer) - the name of the function invoked as > f(x); in the first case, and (*f)(x) in the second case. Haskell treats > them the same: f x > > The other two options you give look ugly to me. > > But this project is not about abstract PL research/development - what > makes this stuff easier to use? x = proc f. The name clash I think would be annoying. Inference (i.e. context dependent meanings) are not intuitive. > > I'm inclined to think some function type (rather than strings) does, (but > I'm very into compile time safety so that's unsurprising) but I'm not sure > if its worth the effort for what I think is a fairly restricted use case. > It's restricted because it's not implemented. But we have no (easy) way of knowing how much it will actually be used should it be there. From wozniak at mcs.anl.gov Thu Aug 11 16:06:37 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 11 Aug 2011 16:06:37 -0500 (CDT) Subject: [Swift-devel] Cogkit SVN access In-Reply-To: <633281416.62517.1313090755025.JavaMail.root@zimbra-mb2.anl.gov> References: <633281416.62517.1313090755025.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: Putting this in the release notes is a good idea. The BG/P and Cray both have 1.6. Justin On Thu, 11 Aug 2011, David Kelly wrote: > Thanks Mihael. > > I think there are a few systems still out there running 1.5, but not > many. Intrepid runs 1.5 by default, but I think you can modify it with > softenv. There was another machine called sisboombah that only had 1.5. > I don't have any strong feelings about it one way or another, but it's a > probably a good idea to come up with a list of what is supported > (version, IBM java, openjdk?) and add it to the list of things we test > before a new release. > > David > > ----- Original Message ----- >> From: "Mihael Hategan" >> To: "David Kelly" >> Cc: swift-devel at ci.uchicago.edu >> Sent: Thursday, August 11, 2011 1:54:12 PM >> Subject: Re: [Swift-devel] Cogkit SVN access >> I gave you access to the cog svn, but the question is whether we still >> care about java 1.5. Are there any machines that haven't moved to 1.6? >> >> On the other hand, it could be argued that the cost of keeping 1.5 >> compatibility is relatively low, so we might as well just do it. >> >> On Thu, 2011-08-11 at 12:45 -0500, David Kelly wrote: >>> Hello, >>> >>> How can I request access to the cogkit SVN repo? I have a patch I'd >>> like to apply that allows 0.93 to compile under Java 1.5. My >>> sourceforge username is davidkelly999. >>> >>> Thanks, >>> David >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak From jonmon at mcs.anl.gov Thu Aug 11 17:13:37 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Thu, 11 Aug 2011 17:13:37 -0500 Subject: [Swift-devel] =?utf-8?q?Cogkit_SVN_access?= Message-ID: <20110811221323.C0F05126F4@zimbra.anl.gov> I don't have access to BG/P but I know on Beagle there was an issue with the IBMs Java. It was throwing an EOFException and you needed to install Suns jre. Ketan reported this awhile back and I experienced the issue as well. I do not know if this problem has been resolved though. ----- Reply message ----- From: "Justin M Wozniak" Date: Thu, Aug 11, 2011 4:06 pm Subject: [Swift-devel] Cogkit SVN access To: "David Kelly" Cc: Putting this in the release notes is a good idea. The BG/P and Cray both have 1.6. Justin On Thu, 11 Aug 2011, David Kelly wrote: > Thanks Mihael. > > I think there are a few systems still out there running 1.5, but not > many. Intrepid runs 1.5 by default, but I think you can modify it with > softenv. There was another machine called sisboombah that only had 1.5. > I don't have any strong feelings about it one way or another, but it's a > probably a good idea to come up with a list of what is supported > (version, IBM java, openjdk?) and add it to the list of things we test > before a new release. > > David > > ----- Original Message ----- >> From: "Mihael Hategan" >> To: "David Kelly" >> Cc: swift-devel at ci.uchicago.edu >> Sent: Thursday, August 11, 2011 1:54:12 PM >> Subject: Re: [Swift-devel] Cogkit SVN access >> I gave you access to the cog svn, but the question is whether we still >> care about java 1.5. Are there any machines that haven't moved to 1.6? >> >> On the other hand, it could be argued that the cost of keeping 1.5 >> compatibility is relatively low, so we might as well just do it. >> >> On Thu, 2011-08-11 at 12:45 -0500, David Kelly wrote: >>> Hello, >>> >>> How can I request access to the cogkit SVN repo? I have a patch I'd >>> like to apply that allows 0.93 to compile under Java 1.5. My >>> sourceforge username is davidkelly999. >>> >>> Thanks, >>> David >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From yadudoc1729 at gmail.com Fri Aug 12 07:25:51 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Fri, 12 Aug 2011 17:55:51 +0530 Subject: [Swift-devel] Call function. In-Reply-To: <1313091602.9185.5.camel@blabla> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> Message-ID: Hi, Do all procedures have the same default namespace in swift ? (not considering imports) If that is the case, the reason for having functions passed to other functions almost goes away. I can't imagine any scenario in which we might need that. What remains is the functional style iterators like map and reduce, which by convention use functions passed to it. Is this the style we need ? int sum1 = call ( (int)func1(int[ ]) , [1,2,3] ); (I can't help but say that if we gave a bit of freedom for call to ignore the strong typed'ness of swift, It could try some really cool things, just saying, thats all) >> I'm inclined to think some function type (rather than strings) does, (but >> I'm very into compile time safety so that's unsurprising) but I'm not sure >> if its worth the effort for what I think is a fairly restricted use case. Well, how else would we do the map - reduce style functionality. Sure, we can probably do map by writing a separate procedure for applying a function over every item in a list, but I think map is easier (and cooler!) > It's restricted because it's not implemented. But we have no (easy) way > of knowing how much it will actually be used should it be there. :) In erlang you can write code to do map, reduce and fold, but users like these functions which help them save probably a couple more lines of code. -- Thanks and Regards, Yadu Nand B From benc at hawaga.org.uk Fri Aug 12 07:46:43 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 12 Aug 2011 14:46:43 +0200 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> Message-ID: On Aug 12, 2011, at 2:25 PM, Yadu Nand wrote: > Do all procedures have the same default namespace in swift ? > (not considering imports) I don't really understand what you mean by that question. There's only one function namespace at the moment. VDS/VDL, the predecessor to swift, had more namespace structure, but nothing has really driven that to happen in swift. (something that might drive that, for example, could be people wanting to write libraries in swift (rather than libraries to use *with* swift, but written in a different language) - so maybe they'll appear in swift too, one day) > Sure, we > can probably do map by writing a separate procedure for applying a function > over every item in a list, but I think map is easier (and cooler!) You can do a map now using foreach, without writing a separate procedure. that was one of the original "interesting things" that swift did beyond VDL. hategan posted recently about using foreach to iterate "sequentially" over data (meaning a value in the output array can depend on everything to the left of it) which looks like it could do a lot of fold-like stuff too. (in a thread about getting rid of iterate). (That needed the ability to access "the previous" element - when you're numbering your output array with an integer, thats easy: here-1. But when you're using 'auto', then that doesn't work (and the meaning of "the predecessor" is not immediately apparent in the case of 'auto' - there are a few different things it could mean)) What you can't do at the moment is use some function identity (be it a string or be it some richer function reference) - and given that, I find your comment: > If that is the case, the reason for having > functions passed to other functions almost goes away. I can't > imagine any scenario in which we might need that. a bit perplexing because I thought that throwing functions around as values was exactly what you wanted to do? -- From yadudoc1729 at gmail.com Fri Aug 12 08:34:11 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Fri, 12 Aug 2011 19:04:11 +0530 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> Message-ID: > There's only one function namespace at the moment. Yes, that answers the question. > You can do a map now using foreach, without writing a separate procedure. that was one of the original "interesting things" that swift did beyond VDL. > hategan posted recently about using foreach to iterate "sequentially" over data (meaning a value in the output array can depend on everything to the left of it) which looks like it could do a lot of fold-like stuff too. (in a thread about getting rid of iterate). > (That needed the ability to access "the previous" element - when you're numbering your output array with an integer, thats easy: here-1. But when you're using 'auto', then that doesn't work (and the meaning of "the predecessor" is not immediately apparent in the case of 'auto' - there are a few different things it could mean)) So we don't need functional iterators in swift ? Why was this on the gsoc-ideas list ? I was under the impression that functional iterators will be useful in some way :( > What you can't do at the moment is use some function identity (be it a string or be it some richer function reference) - and given that, I find your comment: > >> ?If that is the case, the reason for having >> functions passed to other functions almost goes away. I can't >> imagine any scenario in which we might need that. > > > a bit perplexing because I thought that throwing functions around as values was exactly what you wanted to do? If all the functions existed in the same namespace (or if there is no concept of namespace), we don't need to pass functions to other functions, do we ? What I understand is that, we pass function-a to function-b so that function-a becomes available under function-b. -- Thanks and Regards, Yadu Nand B From hategan at mcs.anl.gov Fri Aug 12 10:15:19 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 12 Aug 2011 08:15:19 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> Message-ID: <1313162119.15746.7.camel@blabla> On Fri, 2011-08-12 at 19:04 +0530, Yadu Nand wrote: > If all the functions existed in the same namespace (or if there is no > concept of namespace), > we don't need to pass functions to other functions, do we ? What I > understand is that, we > pass function-a to function-b so that function-a becomes available > under function-b. > No. You use first class functions of type T to be able to write a generic function G that can use functions of type T without being tied to a specific function of type T (say F) at the time you write G. In other words, you can produce a G that works with a class of functions (T) rather than a specific function. In a sense you are correct. You don't really need it. You can always copy-and-paste the body of G and manually replace calls to F. But that seems to be a crappy way to do things. From benc at hawaga.org.uk Fri Aug 12 11:39:57 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 12 Aug 2011 16:39:57 +0000 (GMT) Subject: [Swift-devel] Call function. In-Reply-To: <1313162119.15746.7.camel@blabla> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> Message-ID: What would be needed to write 'map' in swift(script) for use as a library function by other swift code? (rather than writing it in karajan or java) It might look something like this: (standby for bleeding eyes on the first line) (X out[]) map( (X)f(Y), Y inp[]) { foreach v,i in inp { out[i] = f(v); // or equivalently out[i] = f( inp[i] ); } } The above adds syntax for passing a function f which takes a value of type Y and returns a value of type X. f is invoked by juxtaposition rather than by an explicit call, though I think that is irrelevant for this message. But for map to work for arbitrary 1-d arrays, X and Y need to be type variables of some kind, not actual concrete types. We haven't discussed that at all in this thread, but I think it would be needed to do the above kind of thing. That's nothing particularly fancy though its another shift away from fortran-era types - C++ templates, java generics can both express this in some form, as can haskell. Comments? -- http://www.hawaga.org.uk/ben/ From hategan at mcs.anl.gov Fri Aug 12 11:48:50 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 12 Aug 2011 09:48:50 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> Message-ID: <1313167730.15746.13.camel@blabla> On Fri, 2011-08-12 at 17:55 +0530, Yadu Nand wrote: > >> I'm inclined to think some function type (rather than strings) does, (but > >> I'm very into compile time safety so that's unsurprising) but I'm not sure > >> if its worth the effort for what I think is a fairly restricted use case. > Well, how else would we do the map - reduce style functionality. Sure, we > can probably do map by writing a separate procedure for applying a function > over every item in a list, but I think map is easier (and cooler!) We have to distinguish between "map" as it appears in standard functional languages (i.e. map(S -> T, [S]) -> [T]) and "map" as in Google map/reduce function signature (i.e. map(K1, T1) -> [(K2, T2)]). The semantics of the functional map can be achieved in swift using foreach. From hategan at mcs.anl.gov Fri Aug 12 12:11:38 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 12 Aug 2011 10:11:38 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> Message-ID: <1313169098.15746.24.camel@blabla> On Fri, 2011-08-12 at 16:39 +0000, Ben Clifford wrote: > What would be needed to write 'map' in swift(script) for use as a library > function by other swift code? (rather than writing it in karajan or java) > > It might look something like this: (standby for bleeding eyes on the first > line) > > (X out[]) map( (X)f(Y), Y inp[]) { > foreach v,i in inp { > out[i] = f(v); // or equivalently out[i] = f( inp[i] ); > } > } > > The above adds syntax for passing a function f which takes a value of type > Y and returns a value of type X. f is invoked by juxtaposition rather than > by an explicit call, though I think that is irrelevant for this message. > > But for map to work for arbitrary 1-d arrays, X and Y need to be type > variables of some kind, not actual concrete types. > > We haven't discussed that at all in this thread, but I think it would be > needed to do the above kind of thing. That's nothing particularly fancy > though its another shift away from fortran-era types - C++ templates, java > generics can both express this in some form, as can haskell. > > Comments? > True, but we can still cover a lot of cases with: type mrfile; type kv { string key; mrfile value; } (kv[] intermediate) map_((kv[] r) proc(string key, mrfile value), kv[] input) { ... } (mrfile[string][] out2) collect_(kv[] intermediate) { ... } (mrfile[] result) reduce_((mrfile[]) proc(mrfile[string]), mrfile[string][]) { ... } I agree though. A full treatment of the issue would require type parameters. Mihael From yadudoc1729 at gmail.com Sun Aug 14 12:54:39 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Sun, 14 Aug 2011 23:24:39 +0530 Subject: [Swift-devel] Call function. In-Reply-To: <1313169098.15746.24.camel@blabla> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> Message-ID: Hi, I've got the following syntax working for call : (int a) square (int b) { a = b * b ; } int x = call ( square , 5 ); The "call" syntax uses the usual procedure-call syntax checks (its basically the same code copied over, with very limited changes). The major difference being that while swift is being converted to xml the function call is essentially a string and later on checked for type conformance of the input and output variables. It is again treated as a string when converted to karajan as a string arg to executeElement. Sadly now, it looks like the usual procedure-call except for a slightly different syntax. I am now working on implementing a functional style map which accepts arrays. Right now we use a foreach to iterate through arrays, as functions don't accept or return arrays. int results[int]; foreach v,i in [1,2,3,4,5] { results [i] = square (v) ; } What I'm hoping to achieve is to make this simpler like : int results[int] = map ( square , [1,2,3,4,5] ); How does this look ? Will this be useful ? -- Thanks and Regards, Yadu Nand B From hategan at mcs.anl.gov Sun Aug 14 15:03:41 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Aug 2011 13:03:41 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> Message-ID: <1313352221.1660.1.camel@blabla> On Sun, 2011-08-14 at 23:24 +0530, Yadu Nand wrote: > int results[int]; > foreach v,i in [1,2,3,4,5] { > results [i] = square (v) ; > } > > What I'm hoping to achieve is to make this simpler like : > int results[int] = map ( square , [1,2,3,4,5] ); > > How does this look ? Will this be useful ? > I'm not sure I would say useful as much as convenient. But it goes back to an issue mentioned earlier: what is the type signature of "map"? From wilde at mcs.anl.gov Sun Aug 14 15:14:15 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 14 Aug 2011 15:14:15 -0500 (CDT) Subject: [Swift-devel] Call function. In-Reply-To: Message-ID: <704756509.218261.1313352855506.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > From: "Yadu Nand" ... > What I'm hoping to achieve is to make this simpler like : > int results[int] = map ( square , [1,2,3,4,5] ); > > How does this look ? Will this be useful ? How is this more useful than simply doing: foreach i in [1:5} { results[i] = square(i) } I think that the call() function is useful, in that it gives Swift the equivalent of function pointers. But if your goal, Yadu, is to explore map-reduce in Swift, I would look more closely at (a) use cases of processing key-value data and (b) how to do distributed parallel reduction in Swift. Feel free to discuss some of the notes I send you privately, on the swift-devel list. I think two of the issues for parallel reduction are: - how to specify intermediate reduction functions that take some arbitrary subset of a map()'s output (ie, a foreach's result set) and process those values to return an intermediate result, and then to repeat that at one or more levels. I think we're looking for is a way to couch the abstractions in the semi-functional Swift language, in a way that enables data-movement-efficient and highly parallel processing to take place at runtime with no specification of the physical computing locations in the abstract program. Its possible that we want to do this via some kind of new reduce() operator, which in turn may require an analogous map operator. If this leads you there, then you could re-visit (and challenge) my comment above about map() not being substantially different than foreach(). WHat I say here is somewhat vague still, as I'm not able to devote enough thinking to this at the moment. I would proceed by detailing carefully the reduction phase of classic Google map-reduce and proposing a way to both abstract and implement reduction with similar efficiency (in terms of parallelism and distributed operation) in Swift. - Mike - Mike From yadudoc1729 at gmail.com Sun Aug 14 16:57:58 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Mon, 15 Aug 2011 03:27:58 +0530 Subject: [Swift-devel] Call function. In-Reply-To: <1313352221.1660.1.camel@blabla> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> <1313352221.1660.1.camel@blabla> Message-ID: >> What I'm hoping to achieve is to make this simpler like : >> int results[int] = map ( square , [1,2,3,4,5] ); >> >> How does this look ? Will this be useful ? > I'm not sure I would say useful as much as convenient. But it goes back > to an issue mentioned earlier: what is the type signature of "map"? Given a function func of type : (type1) func (type2) The type signature for map will be like : ( type1[ type3] ) map ( , type2[type3] ); -- Thanks and Regards, Yadu Nand B From yadudoc1729 at gmail.com Sun Aug 14 17:04:00 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Mon, 15 Aug 2011 03:34:00 +0530 Subject: [Swift-devel] Call function. In-Reply-To: <704756509.218261.1313352855506.JavaMail.root@zimbra.anl.gov> References: <704756509.218261.1313352855506.JavaMail.root@zimbra.anl.gov> Message-ID: > How is this more useful than simply doing: > > foreach i in [1:5} { > ?results[i] = square(i) > } > > I think that the call() function is useful, in that it gives Swift the equivalent of function pointers. I don't think it gives any greater advantage than a slightly easier syntax. > But if your goal, Yadu, is to explore map-reduce in Swift, I would look more closely at (a) use cases of processing key-value data and (b) how to do distributed parallel reduction in Swift. ?Feel free to discuss some of the notes I send you privately, on the swift-devel list. I think I have veered off from the original plan of MapReduce, but then again, one of my proposals was about functional style iterators. I should probably read up more and then post about this. > I think two of the issues for parallel reduction are: > > - how to specify intermediate reduction functions that take some arbitrary subset of a map()'s output (ie, a foreach's result set) and process those values to return an intermediate result, and then to repeat that at one or more levels. > > I think we're looking for is a way to couch the abstractions in the semi-functional Swift language, in a way that enables data-movement-efficient and highly parallel processing to take place at runtime with no specification of the physical computing locations in the abstract program. > > Its possible that we want to do this via some kind of new reduce() operator, which in turn may require an analogous map operator. ?If this leads you there, then you could re-visit (and challenge) my comment above about map() not being substantially different than foreach(). > > WHat I say here is somewhat vague still, as I'm not able to devote enough thinking to this at the moment. ?I would proceed by detailing carefully the reduction phase of classic Google map-reduce and proposing a way to both abstract and implement reduction with similar efficiency (in terms of parallelism and distributed operation) in Swift. > > - Mike > > > > - Mike > > > -- Thanks and Regards, Yadu Nand B From hategan at mcs.anl.gov Sun Aug 14 17:05:55 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Aug 2011 15:05:55 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> <1313352221.1660.1.camel@blabla> Message-ID: <1313359555.2226.2.camel@blabla> On Mon, 2011-08-15 at 03:27 +0530, Yadu Nand wrote: > >> What I'm hoping to achieve is to make this simpler like : > >> int results[int] = map ( square , [1,2,3,4,5] ); > >> > >> How does this look ? Will this be useful ? > > > I'm not sure I would say useful as much as convenient. But it goes back > > to an issue mentioned earlier: what is the type signature of "map"? > > Given a function func of type : > (type1) func (type2) > > The type signature for map will be like : > ( type1[ type3] ) map ( , type2[type3] ); > What if I have two functions: (type1) f(type2) (type4) f(type5) ? From yadudoc1729 at gmail.com Mon Aug 15 00:56:10 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Mon, 15 Aug 2011 11:26:10 +0530 Subject: [Swift-devel] Call function. In-Reply-To: <1313359555.2226.2.camel@blabla> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> <1313352221.1660.1.camel@blabla> <1313359555.2226.2.camel@blabla> Message-ID: > What if I have two functions: > (type1) f(type2) > (type4) f(type5) > ? Wouldn't the second definition overwrite the first ? In effect only the second one would be valid unless ofcourse the checks for redefinition of procedures has gone into trunk and made this illegal. Or , did you mean something else ? -- Thanks and Regards, Yadu Nand B From jonmon at mcs.anl.gov Mon Aug 15 01:01:20 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 15 Aug 2011 01:01:20 -0500 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> <1313352221.1660.1.camel@blabla> <1313359555.2226.2.camel@blabla> Message-ID: Well that case should be valid shouldn't it? I may have misunderstood the validity of overwritten functions. I know in Java you could have several methods named with the same identifier and which method that was called was determined at compile time based on the argument signature. So this seems like it should be a valid case but again I may have misunderstood the discussion on re-defining functions. On Aug 15, 2011, at 12:56 AM, Yadu Nand wrote: >> What if I have two functions: >> (type1) f(type2) >> (type4) f(type5) >> ? > > Wouldn't the second definition overwrite the first ? In effect only the > second one would be valid unless ofcourse the checks for redefinition > of procedures has gone into trunk and made this illegal. > Or , did you mean something else ? > > -- > Thanks and Regards, > Yadu Nand B > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From yadudoc1729 at gmail.com Mon Aug 15 01:17:07 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Mon, 15 Aug 2011 11:47:07 +0530 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> <1313352221.1660.1.camel@blabla> <1313359555.2226.2.camel@blabla> Message-ID: On Mon, Aug 15, 2011 at 11:31 AM, Jonathan Monette wrote: > Well that case should be valid shouldn't it? ?I may have misunderstood the validity of overwritten functions. ?I know in Java you could have several methods named with the same identifier and which method that was called was determined at compile time based on the argument signature. ?So this seems like it should be a valid case but again I may have misunderstood the discussion on re-defining functions. Yes java, C++ all support functional-polymorphism. I am not sure how karajan deals with this underneath, but at the level where checks are made in swift, the code definitely doesn't support it yet. -- Thanks and Regards, Yadu Nand B From hategan at mcs.anl.gov Mon Aug 15 01:20:03 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Aug 2011 23:20:03 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> <1313352221.1660.1.camel@blabla> <1313359555.2226.2.camel@blabla> Message-ID: <1313389203.3577.5.camel@blabla> On Mon, 2011-08-15 at 11:26 +0530, Yadu Nand wrote: > > What if I have two functions: > > (type1) f(type2) > > (type4) f(type5) > > ? > > Wouldn't the second definition overwrite the first ? In effect only the > second one would be valid unless ofcourse the checks for redefinition > of procedures has gone into trunk and made this illegal. > Or , did you mean something else ? (type1) f(type2) (type4) g(type5) The point is that the signature of map() will need to accommodate both if it was to be general enough. So you'd need something like generics: T[] map(T proc(S), S[]); where T and S are not concrete types. From hategan at mcs.anl.gov Mon Aug 15 01:21:47 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Aug 2011 23:21:47 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> <1313352221.1660.1.camel@blabla> <1313359555.2226.2.camel@blabla> Message-ID: <1313389307.3577.7.camel@blabla> On Mon, 2011-08-15 at 01:01 -0500, Jonathan Monette wrote: > Well that case should be valid shouldn't it? I may have misunderstood > the validity of overwritten functions. I know in Java you could have > several methods named with the same identifier and which method that > was called was determined at compile time based on the argument > signature. So this seems like it should be a valid case but again I > may have misunderstood the discussion on re-defining functions. Sorry. It meant them to have different names. But you make a valid point. So the question is whether we want to support simple polymorphism. From hategan at mcs.anl.gov Mon Aug 15 01:42:44 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Aug 2011 23:42:44 -0700 Subject: [Swift-devel] Call function. In-Reply-To: References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313091602.9185.5.camel@blabla> <1313162119.15746.7.camel@blabla> <1313169098.15746.24.camel@blabla> <1313352221.1660.1.camel@blabla> <1313359555.2226.2.camel@blabla> Message-ID: <1313390564.3577.17.camel@blabla> On Mon, 2011-08-15 at 11:47 +0530, Yadu Nand wrote: > On Mon, Aug 15, 2011 at 11:31 AM, Jonathan Monette wrote: > > Well that case should be valid shouldn't it? I may have misunderstood the validity of overwritten functions. I know in Java you could have several methods named with the same identifier and which method that was called was determined at compile time based on the argument signature. So this seems like it should be a valid case but again I may have misunderstood the discussion on re-defining functions. > > Yes java, C++ all support functional-polymorphism. I am not > sure how karajan deals with this underneath, It doesn't. It's dynamically typed, so it has no function polymorphism. > but at the level > where checks are made in swift, the code definitely doesn't > support it yet. > Since it's static typing, it would be known at compile time which exact function each call refers to. So given: (int r) plus(int a, int b); (float r) plus(float a, float b); The compiler would map the two to two different karajan functions (say plus1 and plus2) and invocations of plus(1, 2) would become plus1(1, 2) and those of plus(1.0, 2.0) would become plus2(1.0, 2.0). For a more detailed explanation of how this is done in various compilers, see http://en.wikipedia.org/wiki/Name_mangling From hategan at mcs.anl.gov Mon Aug 15 11:47:18 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 15 Aug 2011 09:47:18 -0700 Subject: [Swift-devel] Call function. In-Reply-To: <1313088643.8503.21.camel@blabla> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> Message-ID: <1313426838.9225.1.camel@blabla> On Thu, 2011-08-11 at 11:50 -0700, Mihael Hategan wrote: > Given a standard cat (matching the above signature), we could have: > > mycat = cat; > > But the issue there is that now variables and procedures appear to live > in the same namespace. So the semantics of the following code are > unclear: > > int f = 2; > (int r) f(int i) {...}; > x = f; > > What is assigned to x? > > There are three resolutions I can think of: > 1. Keep them in the same namespace, treat all procs as if they were > equivalent to name = signature {body}. Disallow variables and procedures > with the same name. > 2. Do not keep them in the same namespace and consider the namespace > implicit in the type. So if I assign to a non-procedure type then it's a > normal variable, and if I assign to (or use in the context of) a > procedure type, then it's a procedure. This could be confusing to a user > and it requires one to look at the context of an expression to determine > its type, which complicates the compiler code. > 3. Have some special keyword that indicates the procedure namespace: > myfn= proc f (or myfn = proc(f) or myfn = proc:f). > Ok, so let's make this democratic. Opinions, preferences? From benc at hawaga.org.uk Tue Aug 16 07:55:59 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 16 Aug 2011 14:55:59 +0200 Subject: [Swift-devel] Call function. In-Reply-To: <1313426838.9225.1.camel@blabla> References: <84DC8B4A-4324-46AD-929C-2ADA0EFBDA62@hawaga.org.uk> <1313088643.8503.21.camel@blabla> <1313426838.9225.1.camel@blabla> Message-ID: <65740AD6-3B0A-4AA4-8913-F5C074B747D4@hawaga.org.uk> On Aug 15, 2011, at 6:47 PM, Mihael Hategan wrote: >> >> 1. Keep them in the same namespace, treat all procs as if they were >> equivalent to name = signature {body}. Disallow variables and procedures >> with the same name. >> 2. Do not keep them in the same namespace and consider the namespace >> implicit in the type. So if I assign to a non-procedure type then it's a >> normal variable, and if I assign to (or use in the context of) a >> procedure type, then it's a procedure. This could be confusing to a user >> and it requires one to look at the context of an expression to determine >> its type, which complicates the compiler code. >> 3. Have some special keyword that indicates the procedure namespace: >> myfn= proc f (or myfn = proc(f) or myfn = proc:f). >> > > Ok, so let's make this democratic. Opinions, preferences? I prefer something like 1. 3 is ok - I think its unnecessary keywordage but fine otherwise. not 2. -- From wilde at mcs.anl.gov Tue Aug 16 08:42:13 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 16 Aug 2011 08:42:13 -0500 (CDT) Subject: [Swift-devel] Call function. In-Reply-To: <65740AD6-3B0A-4AA4-8913-F5C074B747D4@hawaga.org.uk> Message-ID: <125845973.222565.1313502133183.JavaMail.root@zimbra.anl.gov> I vote for 1. - Mike ----- Original Message ----- > From: "Ben Clifford" > To: "Mihael Hategan" > Cc: "swift-devel Devel" > Sent: Tuesday, August 16, 2011 7:55:59 AM > Subject: Re: [Swift-devel] Call function. > On Aug 15, 2011, at 6:47 PM, Mihael Hategan wrote: > > >> > >> 1. Keep them in the same namespace, treat all procs as if they were > >> equivalent to name = signature {body}. Disallow variables and > >> procedures > >> with the same name. > >> 2. Do not keep them in the same namespace and consider the > >> namespace > >> implicit in the type. So if I assign to a non-procedure type then > >> it's a > >> normal variable, and if I assign to (or use in the context of) a > >> procedure type, then it's a procedure. This could be confusing to a > >> user > >> and it requires one to look at the context of an expression to > >> determine > >> its type, which complicates the compiler code. > >> 3. Have some special keyword that indicates the procedure > >> namespace: > >> myfn= proc f (or myfn = proc(f) or myfn = proc:f). > >> > > > > Ok, so let's make this democratic. Opinions, preferences? > > > I prefer something like 1. 3 is ok - I think its unnecessary > keywordage but fine otherwise. not 2. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Tue Aug 16 18:04:56 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 16 Aug 2011 18:04:56 -0500 Subject: [Swift-devel] chxml Message-ID: <9A987CD0-0C64-49B0-93CF-CCC4BF00E3D7@mcs.anl.gov> Hello, Is there any way to report what the chxml script saw that was inappropriate in the sites.xml file? When I run Swift with 0.93 release on beagle I get an error saying that some parameters do not apply to to the selected execution provider. It doesn't cause the script to fail it just says there is a n error. I would just like to know which one it is so I can remove it if it is indeed unnecessary. Maybe log the entries that are inappropriate to the log file? Perhaps the swift.log file that normally only contains 1 or 2 lines? From wilde at mcs.anl.gov Tue Aug 16 20:11:52 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 16 Aug 2011 20:11:52 -0500 (CDT) Subject: [Swift-devel] chxml In-Reply-To: <9A987CD0-0C64-49B0-93CF-CCC4BF00E3D7@mcs.anl.gov> Message-ID: <1788602715.225846.1313543512191.JavaMail.root@zimbra.anl.gov> Agreed; filed as bug 513. Thanks, Jon. - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "swift-devel Devel" > Sent: Tuesday, August 16, 2011 6:04:56 PM > Subject: [Swift-devel] chxml > Hello, > Is there any way to report what the chxml script saw that was > inappropriate in the sites.xml file? When I run Swift with 0.93 > release on beagle I get an error saying that some parameters do not > apply to to the selected execution provider. It doesn't cause the > script to fail it just says there is a n error. I would just like to > know which one it is so I can remove it if it is indeed unnecessary. > Maybe log the entries that are inappropriate to the log file? Perhaps > the swift.log file that normally only contains 1 or 2 lines? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Aug 16 20:15:54 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 16 Aug 2011 20:15:54 -0500 (CDT) Subject: [Swift-devel] provider staging question In-Reply-To: <5EFEF13E-2B52-40AA-8292-E841B456335C@mcs.anl.gov> Message-ID: <156074162.225848.1313543754474.JavaMail.root@zimbra.anl.gov> Hi Jon, I'll try a short answer, but this needs more thoight, and much testing: - Dont use CDM with provider staging (yet; maybe someday that will make sense...) I dont tthink the two will work together well. - ps should stage to the local hard disk; I *think* it may honor the workdirectory tag as where to stage, though. Or maybe the scratch tag? I think it defaults to /tmp. This needs to be tested and documented. Mihael, can you clarify? Miek ----- Original Message ----- > From: "Jonathan Monette" > To: "Michael Wilde" > Sent: Tuesday, August 16, 2011 6:08:47 PM > Subject: provider staging question > Mike, > I am configuring the run to test the SwiftMontage runs with provider > staging turned on as you suggested. When I turn on provider staging > should I not use CDM? The issue that I was experiencing that led me to > start using CDM was extremely long copy times from the cwd to the job > directory specified by the in the sites.xml file. Does > provider staging circumvent that issue? I know provider staging copies > the input files directly onto the compute nodes local disk but that is > about all I know. Could you fill me in a bit on what exactly provider > staging does? Would using both CDM "direct" directives and provider > staging cause some degraded performance? -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Tue Aug 16 21:14:24 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 16 Aug 2011 21:14:24 -0500 Subject: [Swift-devel] provider staging question In-Reply-To: <156074162.225848.1313543754474.JavaMail.root@zimbra.anl.gov> References: <156074162.225848.1313543754474.JavaMail.root@zimbra.anl.gov> Message-ID: <79B0E471-BFBD-4B2C-9AD9-5ED493F636E7@mcs.anl.gov> That is what I was thinking about CDM but wasn't sure if that is what you meant in the email a couple days ago. I thought that the scratch tag was broken from the sites file. I thought it was mentioned before that Papia was experiencing problems when using the scratch tag in her sites file. On Aug 16, 2011, at 8:15 PM, Michael Wilde wrote: > Hi Jon, > > I'll try a short answer, but this needs more thoight, and much testing: > > - Dont use CDM with provider staging (yet; maybe someday that will make sense...) > I dont tthink the two will work together well. > > - ps should stage to the local hard disk; I *think* it may honor the workdirectory tag as where to stage, though. Or maybe the scratch tag? I think it defaults to /tmp. This needs to be tested and documented. Mihael, can you clarify? > > > Miek > > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Michael Wilde" >> Sent: Tuesday, August 16, 2011 6:08:47 PM >> Subject: provider staging question >> Mike, >> I am configuring the run to test the SwiftMontage runs with provider >> staging turned on as you suggested. When I turn on provider staging >> should I not use CDM? The issue that I was experiencing that led me to >> start using CDM was extremely long copy times from the cwd to the job >> directory specified by the in the sites.xml file. Does >> provider staging circumvent that issue? I know provider staging copies >> the input files directly onto the compute nodes local disk but that is >> about all I know. Could you fill me in a bit on what exactly provider >> staging does? Would using both CDM "direct" directives and provider >> staging cause some degraded performance? > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From wilde at mcs.anl.gov Tue Aug 16 22:08:22 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 16 Aug 2011 22:08:22 -0500 (CDT) Subject: [Swift-devel] provider staging question In-Reply-To: <79B0E471-BFBD-4B2C-9AD9-5ED493F636E7@mcs.anl.gov> Message-ID: <1900504312.225979.1313550502282.JavaMail.root@zimbra.anl.gov> Right, I remember that now, but I cant recall if the scratch tag was completely broken, was set incorrectly, was interfering with debugging, or was interacting incorrectly with some other setting. Papia, if you recall, please clarify. Thanks, - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Michael Wilde" > Cc: "swift-devel Devel" > Sent: Tuesday, August 16, 2011 9:14:24 PM > Subject: Re: provider staging question > That is what I was thinking about CDM but wasn't sure if that is what > you meant in the email a couple days ago. > > I thought that the scratch tag was broken from the sites file. I > thought it was mentioned before that Papia was experiencing problems > when using the scratch tag in her sites file. > > On Aug 16, 2011, at 8:15 PM, Michael Wilde wrote: > > > Hi Jon, > > > > I'll try a short answer, but this needs more thoight, and much > > testing: > > > > - Dont use CDM with provider staging (yet; maybe someday that will > > make sense...) > > I dont tthink the two will work together well. > > > > - ps should stage to the local hard disk; I *think* it may honor the > > workdirectory tag as where to stage, though. Or maybe the scratch > > tag? I think it defaults to /tmp. This needs to be tested and > > documented. Mihael, can you clarify? > > > > > > Miek > > > > > > ----- Original Message ----- > >> From: "Jonathan Monette" > >> To: "Michael Wilde" > >> Sent: Tuesday, August 16, 2011 6:08:47 PM > >> Subject: provider staging question > >> Mike, > >> I am configuring the run to test the SwiftMontage runs with > >> provider > >> staging turned on as you suggested. When I turn on provider staging > >> should I not use CDM? The issue that I was experiencing that led me > >> to > >> start using CDM was extremely long copy times from the cwd to the > >> job > >> directory specified by the in the sites.xml file. > >> Does > >> provider staging circumvent that issue? I know provider staging > >> copies > >> the input files directly onto the compute nodes local disk but that > >> is > >> about all I know. Could you fill me in a bit on what exactly > >> provider > >> staging does? Would using both CDM "direct" directives and provider > >> staging cause some degraded performance? > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Tue Aug 16 22:33:14 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 16 Aug 2011 22:33:14 -0500 Subject: [Swift-devel] provider staging question In-Reply-To: <1900504312.225979.1313550502282.JavaMail.root@zimbra.anl.gov> References: <79B0E471-BFBD-4B2C-9AD9-5ED493F636E7@mcs.anl.gov> <1900504312.225979.1313550502282.JavaMail.root@zimbra.anl.gov> Message-ID: Mike, Here is an old mail snippet (May) that you sent about the scratch tag: " I was surprised to see that coaster provider staging used the tag to determine the jobdir on the compute node, on Beagle where /tmp is not writeable. I always thought that it would honor the < scratch> tag to let the user specify the provider staging jobdir. But this seems not to be the case. " So, it seems that the scratch tag is not really honored. On Tue, Aug 16, 2011 at 10:08 PM, Michael Wilde wrote: > Right, I remember that now, but I cant recall if the scratch tag was > completely broken, was set incorrectly, was interfering with debugging, or > was interacting incorrectly with some other setting. > > Papia, if you recall, please clarify. > > Thanks, > > - Mike > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" > > Sent: Tuesday, August 16, 2011 9:14:24 PM > > Subject: Re: provider staging question > > That is what I was thinking about CDM but wasn't sure if that is what > > you meant in the email a couple days ago. > > > > I thought that the scratch tag was broken from the sites file. I > > thought it was mentioned before that Papia was experiencing problems > > when using the scratch tag in her sites file. > > > > On Aug 16, 2011, at 8:15 PM, Michael Wilde wrote: > > > > > Hi Jon, > > > > > > I'll try a short answer, but this needs more thoight, and much > > > testing: > > > > > > - Dont use CDM with provider staging (yet; maybe someday that will > > > make sense...) > > > I dont tthink the two will work together well. > > > > > > - ps should stage to the local hard disk; I *think* it may honor the > > > workdirectory tag as where to stage, though. Or maybe the scratch > > > tag? I think it defaults to /tmp. This needs to be tested and > > > documented. Mihael, can you clarify? > > > > > > > > > Miek > > > > > > > > > ----- Original Message ----- > > >> From: "Jonathan Monette" > > >> To: "Michael Wilde" > > >> Sent: Tuesday, August 16, 2011 6:08:47 PM > > >> Subject: provider staging question > > >> Mike, > > >> I am configuring the run to test the SwiftMontage runs with > > >> provider > > >> staging turned on as you suggested. When I turn on provider staging > > >> should I not use CDM? The issue that I was experiencing that led me > > >> to > > >> start using CDM was extremely long copy times from the cwd to the > > >> job > > >> directory specified by the in the sites.xml file. > > >> Does > > >> provider staging circumvent that issue? I know provider staging > > >> copies > > >> the input files directly onto the compute nodes local disk but that > > >> is > > >> about all I know. Could you fill me in a bit on what exactly > > >> provider > > >> staging does? Would using both CDM "direct" directives and provider > > >> staging cause some degraded performance? > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From damitha119 at gmail.com Tue Aug 16 22:38:55 2011 From: damitha119 at gmail.com (Damitha Wimalasooriya) Date: Wed, 17 Aug 2011 03:38:55 +0000 (UTC) Subject: [Swift-devel] Invitation to connect on LinkedIn Message-ID: <992259626.566401.1313552335286.JavaMail.app@ela4-app0128.prod> LinkedIn ------------ I'd like to add you to my professional network on LinkedIn. - Damitha Damitha Wimalasooriya CEO at Embla Software Innovation (pvt) Ltd. Sri Lanka Confirm that you know Damitha Wimalasooriya https://www.linkedin.com/e/-ie4j5l-grfr58jn-p/isd/3883959075/5MCqIRXY/ -- (c) 2011, LinkedIn Corporation -------------- next part -------------- An HTML attachment was scrubbed... URL: From damitha119 at gmail.com Tue Aug 16 22:39:22 2011 From: damitha119 at gmail.com (Damitha Wimalasooriya) Date: Wed, 17 Aug 2011 03:39:22 +0000 (UTC) Subject: [Swift-devel] Invitation to connect on LinkedIn Message-ID: <1168612951.513144.1313552362651.JavaMail.app@ela4-bed82.prod> LinkedIn ------------ I'd like to add you to my professional network on LinkedIn. - Damitha Damitha Wimalasooriya CEO at Embla Software Innovation (pvt) Ltd. Sri Lanka Confirm that you know Damitha Wimalasooriya https://www.linkedin.com/e/-ie4j5l-grfr5tp4-4y/isd/3883959075/5MCqIRXY/ -- (c) 2011, LinkedIn Corporation -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Wed Aug 17 11:34:06 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 17 Aug 2011 11:34:06 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: <136619707.68715.1313598225179.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1911735576.68747.1313598846556.JavaMail.root@zimbra-mb2.anl.gov> Hello, When testing 0.93 on Fusion, Swift throws an exception. I am running with the catsn script. It runs, creates the output, but then gives this error when cleaning up: Final status: time: Wed, 17 Aug 2011 11:17:38 -0500 Finished successfully:10 org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:67) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:57) at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) Caused by: java.lang.NullPointerException at org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.makeName(PBSExecutor.java:304) at org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.writeScript(PBSExecutor.java:205) at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.buildCommandLine(AbstractExecutor.java:169) at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) ... 3 more The config and log files are attached. They can also be found on fusion in ~davidk/temp. This is filed in bugzilla as bug #515. David -------------- next part -------------- A non-text attachment was scrubbed... Name: temp.tar.gz Type: application/x-compressed-tar Size: 11533 bytes Desc: not available URL: From wilde at mcs.anl.gov Wed Aug 17 11:43:38 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 17 Aug 2011 11:43:38 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: <1911735576.68747.1313598846556.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <354639862.227334.1313599418427.JavaMail.root@zimbra.anl.gov> David, one possibility here is that (automatic) coasters is generating an additional job with invalid PBS attributes. Can you turn on debug=true in etc/pbs.properties and take a look to see if thats the case? The NPE is an issue in the pbs/localsched provider, but perhaps caused by na unexpected problem or state in the pbs jobs. - Mike ----- Original Message ----- > From: "David Kelly" > To: swift-devel at ci.uchicago.edu > Sent: Wednesday, August 17, 2011 11:34:06 AM > Subject: [Swift-devel] Swift 0.93 exception on Fusion > Hello, > > When testing 0.93 on Fusion, Swift throws an exception. I am running > with the catsn script. It runs, creates the output, but then gives > this error when cleaning up: > > Final status: time: Wed, 17 Aug 2011 11:17:38 -0500 Finished > successfully:10 > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:67) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:57) > at > org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > Caused by: java.lang.NullPointerException > at > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.makeName(PBSExecutor.java:304) > at > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.writeScript(PBSExecutor.java:205) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.buildCommandLine(AbstractExecutor.java:169) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > ... 3 more > > The config and log files are attached. They can also be found on > fusion in ~davidk/temp. This is filed in bugzilla as bug #515. > > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Wed Aug 17 12:28:31 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 17 Aug 2011 12:28:31 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: <354639862.227334.1313599418427.JavaMail.root@zimbra.anl.gov> Message-ID: <304378190.68870.1313602111443.JavaMail.root@zimbra-mb2.anl.gov> Two .submit files get created in .globus/scripts: PBS8380254050153377753.submit and PBS5442733515255786709.submit. PBS5442733515255786709.submit is empty. I'm guessing this is where the problem is. PBS8380254050153377753.submit looks ok. There is nothing out of the ordinary in the stderr and stdout files for this. I can resubmit this using qsub and it gets queued without any immediate errors. Attached a tar file of the PBS files. David ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: swift-devel at ci.uchicago.edu > Sent: Wednesday, August 17, 2011 11:43:38 AM > Subject: Re: [Swift-devel] Swift 0.93 exception on Fusion > David, one possibility here is that (automatic) coasters is generating > an additional job with invalid PBS attributes. > > Can you turn on debug=true in etc/pbs.properties and take a look to > see if thats the case? > > The NPE is an issue in the pbs/localsched provider, but perhaps caused > by na unexpected problem or state in the pbs jobs. > > - Mike > > > ----- Original Message ----- > > From: "David Kelly" > > To: swift-devel at ci.uchicago.edu > > Sent: Wednesday, August 17, 2011 11:34:06 AM > > Subject: [Swift-devel] Swift 0.93 exception on Fusion > > Hello, > > > > When testing 0.93 on Fusion, Swift throws an exception. I am running > > with the catsn script. It runs, creates the output, but then gives > > this error when cleaning up: > > > > Final status: time: Wed, 17 Aug 2011 11:17:38 -0500 Finished > > successfully:10 > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:67) > > at > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) > > at > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:57) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > > Caused by: java.lang.NullPointerException > > at > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.makeName(PBSExecutor.java:304) > > at > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.writeScript(PBSExecutor.java:205) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.buildCommandLine(AbstractExecutor.java:169) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > > ... 3 more > > > > The config and log files are attached. They can also be found on > > fusion in ~davidk/temp. This is filed in bugzilla as bug #515. > > > > David > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: fusion-pbs.tar Type: application/x-tar Size: 10240 bytes Desc: not available URL: From davidk at ci.uchicago.edu Wed Aug 17 14:37:37 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 17 Aug 2011 14:37:37 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: <304378190.68870.1313602111443.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <79054368.69054.1313609857383.JavaMail.root@zimbra-mb2.anl.gov> I am seeing the same issue when running directly on login.pads with pbs and coasters. Alberto and Jon both had success running from communicado to PADS using ssh:pbs, but using local:pbs with coasters seems to trigger whatever is causing this. The files are in ~davidk/temp2. David ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: swift-devel at ci.uchicago.edu > Sent: Wednesday, August 17, 2011 12:28:31 PM > Subject: Re: [Swift-devel] Swift 0.93 exception on Fusion > Two .submit files get created in .globus/scripts: > PBS8380254050153377753.submit and PBS5442733515255786709.submit. > > PBS5442733515255786709.submit is empty. I'm guessing this is where the > problem is. > > PBS8380254050153377753.submit looks ok. There is nothing out of the > ordinary in the stderr and stdout files for this. I can resubmit this > using qsub and it gets queued without any immediate errors. > > Attached a tar file of the PBS files. > > David > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "David Kelly" > > Cc: swift-devel at ci.uchicago.edu > > Sent: Wednesday, August 17, 2011 11:43:38 AM > > Subject: Re: [Swift-devel] Swift 0.93 exception on Fusion > > David, one possibility here is that (automatic) coasters is > > generating > > an additional job with invalid PBS attributes. > > > > Can you turn on debug=true in etc/pbs.properties and take a look to > > see if thats the case? > > > > The NPE is an issue in the pbs/localsched provider, but perhaps > > caused > > by na unexpected problem or state in the pbs jobs. > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: swift-devel at ci.uchicago.edu > > > Sent: Wednesday, August 17, 2011 11:34:06 AM > > > Subject: [Swift-devel] Swift 0.93 exception on Fusion > > > Hello, > > > > > > When testing 0.93 on Fusion, Swift throws an exception. I am > > > running > > > with the catsn script. It runs, creates the output, but then gives > > > this error when cleaning up: > > > > > > Final status: time: Wed, 17 Aug 2011 11:17:38 -0500 Finished > > > successfully:10 > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Cannot submit job > > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:67) > > > at > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) > > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:57) > > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > > > Caused by: java.lang.NullPointerException > > > at > > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.makeName(PBSExecutor.java:304) > > > at > > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.writeScript(PBSExecutor.java:205) > > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.buildCommandLine(AbstractExecutor.java:169) > > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) > > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > > > ... 3 more > > > > > > The config and log files are attached. They can also be found on > > > fusion in ~davidk/temp. This is filed in bugzilla as bug #515. > > > > > > David > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Aug 17 14:47:38 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 17 Aug 2011 12:47:38 -0700 Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: <1911735576.68747.1313598846556.JavaMail.root@zimbra-mb2.anl.gov> References: <1911735576.68747.1313598846556.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1313610458.24629.0.camel@blabla> Fixed in svn. On Wed, 2011-08-17 at 11:34 -0500, David Kelly wrote: > Hello, > > When testing 0.93 on Fusion, Swift throws an exception. I am running with the catsn script. It runs, creates the output, but then gives this error when cleaning up: > > Final status: time: Wed, 17 Aug 2011 11:17:38 -0500 Finished successfully:10 > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job > at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:67) > at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) > at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:57) > at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > Caused by: java.lang.NullPointerException > at org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.makeName(PBSExecutor.java:304) > at org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.writeScript(PBSExecutor.java:205) > at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.buildCommandLine(AbstractExecutor.java:169) > at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) > at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > ... 3 more > > The config and log files are attached. They can also be found on fusion in ~davidk/temp. This is filed in bugzilla as bug #515. > > David > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Wed Aug 17 14:53:41 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 17 Aug 2011 14:53:41 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: <1313610458.24629.0.camel@blabla> Message-ID: <824503992.69090.1313610821222.JavaMail.root@zimbra-mb2.anl.gov> Thanks Mihael, that fixed it. David ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: swift-devel at ci.uchicago.edu > Sent: Wednesday, August 17, 2011 2:47:38 PM > Subject: Re: [Swift-devel] Swift 0.93 exception on Fusion > Fixed in svn. > > On Wed, 2011-08-17 at 11:34 -0500, David Kelly wrote: > > Hello, > > > > When testing 0.93 on Fusion, Swift throws an exception. I am running > > with the catsn script. It runs, creates the output, but then gives > > this error when cleaning up: > > > > Final status: time: Wed, 17 Aug 2011 11:17:38 -0500 Finished > > successfully:10 > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:67) > > at > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) > > at > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:57) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > > Caused by: java.lang.NullPointerException > > at > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.makeName(PBSExecutor.java:304) > > at > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.writeScript(PBSExecutor.java:205) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.buildCommandLine(AbstractExecutor.java:169) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > > ... 3 more > > > > The config and log files are attached. They can also be found on > > fusion in ~davidk/temp. This is filed in bugzilla as bug #515. > > > > David > > _______________________________________________ Swift-devel mailing > > list Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Thu Aug 18 13:09:38 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Thu, 18 Aug 2011 13:09:38 -0500 Subject: [Swift-devel] =?utf-8?q?Index_out_of_bounds?= Message-ID: <20110818180927.88E2A1212A@zimbra.anl.gov> Hello, I was running 0.93 with one a relatively small run, a 350 task run. The run failed on one of the final tasks. I checked the log file and saw some index out of bounds errors. I tried with a smaller run and didn't see the error. This run was using beagle, pads, and communicado. I was also using cdm. I will post the log in a bit. I am seeing if I cam replicate it without using cdm and with a smaller site pool. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Aug 18 17:56:28 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Aug 2011 15:56:28 -0700 Subject: [Swift-devel] Index out of bounds In-Reply-To: <20110818180927.88E2A1212A@zimbra.anl.gov> References: <20110818180927.88E2A1212A@zimbra.anl.gov> Message-ID: <1313708188.8673.0.camel@blabla> It's probably a good idea to post the stack trace of that exception now rather than later. On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > Hello, > I was running 0.93 with one a relatively small run, a 350 task run. > The run failed on one of the final tasks. I checked the log file and > saw some index out of bounds errors. I tried with a smaller run and > didn't see the error. > > This run was using beagle, pads, and communicado. I was also using > cdm. I will post the log in a bit. I am seeing if I cam replicate it > without using cdm and with a smaller site pool. > From jonmon at mcs.anl.gov Thu Aug 18 23:14:30 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 18 Aug 2011 23:14:30 -0500 Subject: [Swift-devel] Index out of bounds In-Reply-To: <1313708188.8673.0.camel@blabla> References: <20110818180927.88E2A1212A@zimbra.anl.gov> <1313708188.8673.0.camel@blabla> Message-ID: Ok. The log is at www.ci.uchicago.edu/~jonmon/logs/montage-1.log On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > It's probably a good idea to post the stack trace of that exception now > rather than later. > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: >> Hello, >> I was running 0.93 with one a relatively small run, a 350 task run. >> The run failed on one of the final tasks. I checked the log file and >> saw some index out of bounds errors. I tried with a smaller run and >> didn't see the error. >> >> This run was using beagle, pads, and communicado. I was also using >> cdm. I will post the log in a bit. I am seeing if I cam replicate it >> without using cdm and with a smaller site pool. >> > > From hategan at mcs.anl.gov Fri Aug 19 14:03:55 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Aug 2011 12:03:55 -0700 Subject: [Swift-devel] Index out of bounds In-Reply-To: References: <20110818180927.88E2A1212A@zimbra.anl.gov> <1313708188.8673.0.camel@blabla> Message-ID: <1313780635.23860.1.camel@blabla> Hmm. So I can't see how this manages to happen. I added some checks and debugging statements. Can you update, set log level of org.globus.cog.abstraction.impl.file.local to DEBUG, re-run and then post the log when the exception pops up? Mihael On Thu, 2011-08-18 at 23:14 -0500, Jonathan Monette wrote: > Ok. The log is at > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > > > It's probably a good idea to post the stack trace of that exception now > > rather than later. > > > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > >> Hello, > >> I was running 0.93 with one a relatively small run, a 350 task run. > >> The run failed on one of the final tasks. I checked the log file and > >> saw some index out of bounds errors. I tried with a smaller run and > >> didn't see the error. > >> > >> This run was using beagle, pads, and communicado. I was also using > >> cdm. I will post the log in a bit. I am seeing if I cam replicate it > >> without using cdm and with a smaller site pool. > >> > > > > > From jonmon at mcs.anl.gov Fri Aug 19 14:45:52 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Fri, 19 Aug 2011 14:45:52 -0500 Subject: [Swift-devel] =?utf-8?q?Index_out_of_bounds?= Message-ID: <20110819194537.51A691263B@zimbra.anl.gov> Sure can. I add that line to the log4j file or in a different properties file. ----- Reply message ----- From: "Mihael Hategan" Date: Fri, Aug 19, 2011 2:03 pm Subject: Index out of bounds To: "Jonathan Monette" Cc: Hmm. So I can't see how this manages to happen. I added some checks and debugging statements. Can you update, set log level of org.globus.cog.abstraction.impl.file.local to DEBUG, re-run and then post the log when the exception pops up? Mihael On Thu, 2011-08-18 at 23:14 -0500, Jonathan Monette wrote: > Ok. The log is at > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > > > It's probably a good idea to post the stack trace of that exception now > > rather than later. > > > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > >> Hello, > >> I was running 0.93 with one a relatively small run, a 350 task run. > >> The run failed on one of the final tasks. I checked the log file and > >> saw some index out of bounds errors. I tried with a smaller run and > >> didn't see the error. > >> > >> This run was using beagle, pads, and communicado. I was also using > >> cdm. I will post the log in a bit. I am seeing if I cam replicate it > >> without using cdm and with a smaller site pool. > >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Fri Aug 19 14:46:38 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Fri, 19 Aug 2011 14:46:38 -0500 Subject: [Swift-devel] =?utf-8?q?Index_out_of_bounds?= Message-ID: <20110819194623.CDC4112637@zimbra.anl.gov> Sure can. I add that line to the log4j file or in a different properties file. ----- Reply message ----- From: "Mihael Hategan" Date: Fri, Aug 19, 2011 2:03 pm Subject: Index out of bounds To: "Jonathan Monette" Cc: Hmm. So I can't see how this manages to happen. I added some checks and debugging statements. Can you update, set log level of org.globus.cog.abstraction.impl.file.local to DEBUG, re-run and then post the log when the exception pops up? Mihael On Thu, 2011-08-18 at 23:14 -0500, Jonathan Monette wrote: > Ok. The log is at > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > > > It's probably a good idea to post the stack trace of that exception now > > rather than later. > > > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > >> Hello, > >> I was running 0.93 with one a relatively small run, a 350 task run. > >> The run failed on one of the final tasks. I checked the log file and > >> saw some index out of bounds errors. I tried with a smaller run and > >> didn't see the error. > >> > >> This run was using beagle, pads, and communicado. I was also using > >> cdm. I will post the log in a bit. I am seeing if I cam replicate it > >> without using cdm and with a smaller site pool. > >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Sat Aug 20 00:03:21 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sat, 20 Aug 2011 00:03:21 -0500 Subject: [Swift-devel] Index out of bounds In-Reply-To: <20110819194623.CDC4112637@zimbra.anl.gov> References: <20110819194623.CDC4112637@zimbra.anl.gov> Message-ID: I updated and rebuilt and added that line to my log4j properties. Does anyone know if Beagle is down? showq says there is no service listening to sdb:. qstat shows that I have a job sitting in the queue but it doesn't look like jobs are running. I am using both PADS and Beagle for this execution. In this case where jobs are not executing on Beagle shouldn't Swift start submitting jobs to PADS? I do not see that behavior. This run is still executing. But if you would like to look at the log it is at www.ci.uchicago.edu/~jonmon/logs/montage-2.log. Only 23 tasks have finished before it just sits there waiting for Beagle to run. On Aug 19, 2011, at 2:46 PM, Jonathan Monette wrote: > Sure can. I add that line to the log4j file or in a different properties file. > > ----- Reply message ----- > From: "Mihael Hategan" > Date: Fri, Aug 19, 2011 2:03 pm > Subject: Index out of bounds > To: "Jonathan Monette" > Cc: > > > Hmm. So I can't see how this manages to happen. > > I added some checks and debugging statements. Can you update, set log > level of org.globus.cog.abstraction.impl.file.local to DEBUG, re-run and > then post the log when the exception pops up? > > Mihael > > On Thu, 2011-08-18 at 23:14 -0500, Jonathan Monette wrote: > > Ok. The log is at > > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > > On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > > > > > It's probably a good idea to post the stack trace of that exception now > > > rather than later. > > > > > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > > >> Hello, > > >> I was running 0.93 with one a relatively small run, a 350 task run. > > >> The run failed on one of the final tasks. I checked the log file and > > >> saw some index out of bounds errors. I tried with a smaller run and > > >> didn't see the error. > > >> > > >> This run was using beagle, pads, and communicado. I was also using > > >> cdm. I will post the log in a bit. I am seeing if I cam replicate it > > >> without using cdm and with a smaller site pool. > > >> > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From ketancmaheshwari at gmail.com Sat Aug 20 07:45:18 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Sat, 20 Aug 2011 07:45:18 -0500 Subject: [Swift-devel] Index out of bounds In-Reply-To: References: <20110819194623.CDC4112637@zimbra.anl.gov> Message-ID: Yes, Beagle went down yesterday. There was a notice. Current status as of Aug 19, 5.30PM: == At this time, Lustre is not starting properly on Beagle. This may be related to a configuration change that was made during the last outage. The effort to restore system availability is still in active progress. == Ketan On Sat, Aug 20, 2011 at 12:03 AM, Jonathan Monette wrote: > I updated and rebuilt and added that line to my log4j properties. Does > anyone know if Beagle is down? showq says there is no service listening to > sdb:. qstat shows that I have a job sitting in the queue but it > doesn't look like jobs are running. > > I am using both PADS and Beagle for this execution. In this case where > jobs are not executing on Beagle shouldn't Swift start submitting jobs to > PADS? I do not see that behavior. > > This run is still executing. But if you would like to look at the log it > is at www.ci.uchicago.edu/~jonmon/logs/montage-2.log. Only 23 tasks have > finished before it just sits there waiting for Beagle to run. > On Aug 19, 2011, at 2:46 PM, Jonathan Monette wrote: > > > Sure can. I add that line to the log4j file or in a different properties > file. > > > > ----- Reply message ----- > > From: "Mihael Hategan" > > Date: Fri, Aug 19, 2011 2:03 pm > > Subject: Index out of bounds > > To: "Jonathan Monette" > > Cc: > > > > > > Hmm. So I can't see how this manages to happen. > > > > I added some checks and debugging statements. Can you update, set log > > level of org.globus.cog.abstraction.impl.file.local to DEBUG, re-run and > > then post the log when the exception pops up? > > > > Mihael > > > > On Thu, 2011-08-18 at 23:14 -0500, Jonathan Monette wrote: > > > Ok. The log is at > > > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > > > On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > > > > > > > It's probably a good idea to post the stack trace of that exception > now > > > > rather than later. > > > > > > > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > > > >> Hello, > > > >> I was running 0.93 with one a relatively small run, a 350 task run. > > > >> The run failed on one of the final tasks. I checked the log file and > > > >> saw some index out of bounds errors. I tried with a smaller run and > > > >> didn't see the error. > > > >> > > > >> This run was using beagle, pads, and communicado. I was also using > > > >> cdm. I will post the log in a bit. I am seeing if I cam replicate it > > > >> without using cdm and with a smaller site pool. > > > >> > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Sat Aug 20 14:52:13 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Sat, 20 Aug 2011 14:52:13 -0500 Subject: [Swift-devel] =?utf-8?q?Index_out_of_bounds?= Message-ID: <20110820195159.ADE2612260@zimbra.anl.gov> Ok thanks. It seems that I was not added to the beagle-notify list. Could someone point me to a link I can subscribe to? Or do I subscribe by sending mail to beagle-support? ----- Reply message ----- From: "Ketan Maheshwari" Date: Sat, Aug 20, 2011 7:45 am Subject: [Swift-devel] Index out of bounds To: "Jonathan Monette" Cc: -------------- next part -------------- An HTML attachment was scrubbed... URL: From dsk at ci.uchicago.edu Sat Aug 20 15:14:42 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Sat, 20 Aug 2011 15:14:42 -0500 Subject: [Swift-devel] Index out of bounds In-Reply-To: <20110820195159.ADE2612260@zimbra.anl.gov> References: <20110820195159.ADE2612260@zimbra.anl.gov> Message-ID: <593A4C01-7DA4-4DE8-AF2A-442AABEA968A@ci.uchicago.edu> Yes, write to beagle-support. On Aug 20, 2011, at 14:52, "Jonathan Monette" wrote: > Ok thanks. It seems that I was not added to the beagle-notify list. Could someone point me to a link I can subscribe to? Or do I subscribe by sending mail to beagle-support? > > ----- Reply message ----- > From: "Ketan Maheshwari" > Date: Sat, Aug 20, 2011 7:45 am > Subject: [Swift-devel] Index out of bounds > To: "Jonathan Monette" > Cc: > > > Yes, Beagle went down yesterday. There was a notice. > > Current status as of Aug 19, 5.30PM: > > == > At this time, Lustre is not starting properly on Beagle. This may be related to a configuration change that was made during the last outage. The effort to restore system availability is still in active progress. > == > > > Ketan > > On Sat, Aug 20, 2011 at 12:03 AM, Jonathan Monette wrote: > I updated and rebuilt and added that line to my log4j properties. Does anyone know if Beagle is down? showq says there is no service listening to sdb:. qstat shows that I have a job sitting in the queue but it doesn't look like jobs are running. > > I am using both PADS and Beagle for this execution. In this case where jobs are not executing on Beagle shouldn't Swift start submitting jobs to PADS? I do not see that behavior. > > This run is still executing. But if you would like to look at the log it is at www.ci.uchicago.edu/~jonmon/logs/montage-2.log. Only 23 tasks have finished before it just sits there waiting for Beagle to run. > On Aug 19, 2011, at 2:46 PM, Jonathan Monette wrote: > > > Sure can. I add that line to the log4j file or in a different properties file. > > > > ----- Reply message ----- > > From: "Mihael Hategan" > > Date: Fri, Aug 19, 2011 2:03 pm > > Subject: Index out of bounds > > To: "Jonathan Monette" > > Cc: > > > > > > Hmm. So I can't see how this manages to happen. > > > > I added some checks and debugging statements. Can you update, set log > > level of org.globus.cog.abstraction.impl.file.local to DEBUG, re-run and > > then post the log when the exception pops up? > > > > Mihael > > > > On Thu, 2011-08-18 at 23:14 -0500, Jonathan Monette wrote: > > > Ok. The log is at > > > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > > > On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > > > > > > > It's probably a good idea to post the stack trace of that exception now > > > > rather than later. > > > > > > > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > > > >> Hello, > > > >> I was running 0.93 with one a relatively small run, a 350 task run. > > > >> The run failed on one of the final tasks. I checked the log file and > > > >> saw some index out of bounds errors. I tried with a smaller run and > > > >> didn't see the error. > > > >> > > > >> This run was using beagle, pads, and communicado. I was also using > > > >> cdm. I will post the log in a bit. I am seeing if I cam replicate it > > > >> without using cdm and with a smaller site pool. > > > >> > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Ketan > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Sat Aug 20 16:20:35 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Sat, 20 Aug 2011 16:20:35 -0500 Subject: [Swift-devel] =?utf-8?q?Index_out_of_bounds?= Message-ID: <20110820212018.1BAAC1225E@zimbra.anl.gov> Thanks. In the meantime could someone let me know when beagle is back in production so I can check my run? ----- Reply message ----- From: "Daniel S. Katz" Date: Sat, Aug 20, 2011 3:14 pm Subject: [Swift-devel] Index out of bounds To: "Jonathan Monette" Cc: "Ketan Maheshwari" , "swift-devel at ci.uchicago.edu" -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sat Aug 20 21:03:35 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 20 Aug 2011 21:03:35 -0500 (CDT) Subject: [Swift-devel] Index out of bounds In-Reply-To: <20110820212018.1BAAC1225E@zimbra.anl.gov> Message-ID: <263354065.236370.1313892215268.JavaMail.root@zimbra.anl.gov> Jon, the list you want for Beagle issue notifications is beagle-users. You can subscribe via the link: https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/beagle-users - Mike ----- Forwarded Message ----- From: "Greg Cross" To: beagle-users at ci.uchicago.edu Sent: Saturday, August 20, 2011 2:12:45 PM Subject: [beagle-users] Outage update Lustre is mounting properly but there is a communication failure between the Moab and ALPS scheduler components. This issue is under investigation and has been escalated to Cray. As a reminder, please DO NOT attempt to log into the system during this or any other maintenance period. While logins should be denied at this time, any user processes found running on login or sandbox nodes will be terminated without warning. Users who do not respect this may be contacted individually. Definitive notification will be sent to this mailing list when the system is available for use. _______________________________________________ beagle-users mailing list beagle-users at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/beagle-users ----- Original Message ----- From: "Jonathan Monette" To: "Daniel S. Katz" Cc: swift-devel at ci.uchicago.edu Sent: Saturday, August 20, 2011 4:20:35 PM Subject: Re: [Swift-devel] Index out of bounds Thanks. In the meantime could someone let me know when beagle is back in production so I can check my run? ----- Reply message ----- From: "Daniel S. Katz" Date: Sat, Aug 20, 2011 3:14 pm Subject: [Swift-devel] Index out of bounds To: "Jonathan Monette" Cc: "Ketan Maheshwari" , "swift-devel at ci.uchicago.edu" Yes, write to beagle-support. On Aug 20, 2011, at 14:52, "Jonathan Monette" < jonmon at mcs.anl.gov > wrote: Ok thanks. It seems that I was not added to the beagle-notify list. Could someone point me to a link I can subscribe to? Or do I subscribe by sending mail to beagle-support? ----- Reply message ----- From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com > Date: Sat, Aug 20, 2011 7:45 am Subject: [Swift-devel] Index out of bounds To: "Jonathan Monette" < jonmon at mcs.anl.gov > Cc: < swift-devel at ci.uchicago.edu > Yes, Beagle went down yesterday. There was a notice. Current status as of Aug 19, 5.30PM: == At this time, Lustre is not starting properly on Beagle. This may be related to a configuration change that was made during the last outage. The effort to restore system availability is still in active progress. == Ketan On Sat, Aug 20, 2011 at 12:03 AM, Jonathan Monette < jonmon at mcs.anl.gov > wrote: I updated and rebuilt and added that line to my log4j properties. Does anyone know if Beagle is down? showq says there is no service listening to sdb:. qstat shows that I have a job sitting in the queue but it doesn't look like jobs are running. I am using both PADS and Beagle for this execution. In this case where jobs are not executing on Beagle shouldn't Swift start submitting jobs to PADS? I do not see that behavior. This run is still executing. But if you would like to look at the log it is at www.ci.uchicago.edu/~jonmon/logs/montage-2.log . Only 23 tasks have finished before it just sits there waiting for Beagle to run. On Aug 19, 2011, at 2:46 PM, Jonathan Monette wrote: > Sure can. I add that line to the log4j file or in a different properties file. > > ----- Reply message ----- > From: "Mihael Hategan" < hategan at mcs.anl.gov > > Date: Fri, Aug 19, 2011 2:03 pm > Subject: Index out of bounds > To: "Jonathan Monette" < jonmon at mcs.anl.gov > > Cc: < swift-devel at ci.uchicago.edu > > > > Hmm. So I can't see how this manages to happen. > > I added some checks and debugging statements. Can you update, set log > level of org.globus.cog.abstraction.impl.file.local to DEBUG, re-run and > then post the log when the exception pops up? > > Mihael > > On Thu, 2011-08-18 at 23:14 -0500, Jonathan Monette wrote: > > Ok. The log is at > > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > > On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > > > > > It's probably a good idea to post the stack trace of that exception now > > > rather than later. > > > > > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > > >> Hello, > > >> I was running 0.93 with one a relatively small run, a 350 task run. > > >> The run failed on one of the final tasks. I checked the log file and > > >> saw some index out of bounds errors. I tried with a smaller run and > > >> didn't see the error. > > >> > > >> This run was using beagle, pads, and communicado. I was also using > > >> cdm. I will post the log in a bit. I am seeing if I cam replicate it > > >> without using cdm and with a smaller site pool. > > >> > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Ketan _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Sat Aug 20 22:13:35 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Sat, 20 Aug 2011 22:13:35 -0500 Subject: [Swift-devel] =?utf-8?q?Index_out_of_bounds?= Message-ID: <20110821031320.9B8D01225E@zimbra.anl.gov> Thanks. I sent mail to beagle-support already but I will subscribe to that list and respond to beagle-support about it. Thanks again. ----- Reply message ----- From: "Michael Wilde" Date: Sat, Aug 20, 2011 9:03 pm Subject: [Swift-devel] Index out of bounds To: "Jonathan Monette" Cc: , "Daniel S. Katz" -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Aug 22 10:24:52 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 22 Aug 2011 10:24:52 -0500 (CDT) Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? Message-ID: <243878065.238332.1314026692263.JavaMail.root@zimbra.anl.gov> Papia, Ketan, In reviewing 0.93 work remaining with David, I remembered this issue. You both reported that the DSSAT application script doesnt finish on PADS - it seems not to start the second round of coaster blocks that it needs to complete (as I recall, but this may not be correct). This needs to be researched and filed as a bug (or, an error in the sites spec needs to be identified and made clear in the site guide if it turns out to be the problem). Possible there is an issue with jobs failing at the end of the coaster blocks, and you dont have the necessary retry values set for the PADS site??? We need an example run with logs and full details. Can you try to re-create this with a much smaller initial allocation, and see if coasters is transitioning from its initial blocks to the next blocks? Can you give this high prio for today? Thanks, - Mike From ketancmaheshwari at gmail.com Mon Aug 22 10:32:31 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 22 Aug 2011 10:32:31 -0500 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: <243878065.238332.1314026692263.JavaMail.root@zimbra.anl.gov> References: <243878065.238332.1314026692263.JavaMail.root@zimbra.anl.gov> Message-ID: Mike, If I recall correctly, Papia has always been running her DSSAT app with 0.92. She has not yet tried with 0.93. I too tried with 0.92 with her sites file settings. I once tried it with 0.93 on pads but could never get in the running from the queue. I will give another try today as it might be that PADS was too busy last week. As I recall Jon was also struggling to get access. Regards, Ketan On Mon, Aug 22, 2011 at 10:24 AM, Michael Wilde wrote: > Papia, Ketan, > > In reviewing 0.93 work remaining with David, I remembered this issue. > > You both reported that the DSSAT application script doesnt finish on PADS - > it seems not to start the second round of coaster blocks that it needs to > complete (as I recall, but this may not be correct). This needs to be > researched and filed as a bug (or, an error in the sites spec needs to be > identified and made clear in the site guide if it turns out to be the > problem). > > Possible there is an issue with jobs failing at the end of the coaster > blocks, and you dont have the necessary retry values set for the PADS > site??? > > We need an example run with logs and full details. Can you try to re-create > this with a much smaller initial allocation, and see if coasters is > transitioning from its initial blocks to the next blocks? > > Can you give this high prio for today? > > Thanks, > > - Mike > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Aug 22 10:41:43 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 22 Aug 2011 10:41:43 -0500 (CDT) Subject: [Swift-devel] Performance problem with CDM direct processing In-Reply-To: <1560250488.238376.1314027328841.JavaMail.root@zimbra.anl.gov> Message-ID: <1432364152.238400.1314027703652.JavaMail.root@zimbra.anl.gov> Justin, In testing Montage, Jon observed what looks like a performance bottleneck in the processing of CDM direct output passing. I *think* what was happening was that a large number of jobs (say 25,000 or more, but I dont recall the exact number, it may have been larger) produced an output file, and all those files were being passed as input to a merge job. What we observed was that the scripts being called from _swiftwrap (and perhaps some processing at the vdl-int.k level??? as well) were running very slowly, and that a fairly large number of scripts were being invoked per file. I think (but am not sure) that the high overhead was being observed at the start of the merge job in CDM scripts called by _swiftwrap. Jon, can you explain what you know about this problem, and then lets see if we can enhance the performance? This is now the main bottleneck in this application, which is otherwise now performing quite well. Thanks, - Mike From wilde at mcs.anl.gov Mon Aug 22 10:47:56 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 22 Aug 2011 10:47:56 -0500 (CDT) Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: Message-ID: <1692446633.238425.1314028076127.JavaMail.root@zimbra.anl.gov> Can you try this on PADS using small jobs in the fast queue? I have not thought this all the way through, but perhaps coasters will honor maxtime and maxwalltime on any coaster block, even if its not running on a batch scheduler. In that case perhaps you can replicate the problem on the MCS pool or better yet on localhost. In these runs, what was the value of the execution.retries and lazy.errors flags? Mihael, do those properties need to be set to >0 and true, respectively, in order for coasters to start new blocks correctly, assuming that in some cases a job will run longer than its maxwalltime? - Mike ----- Original Message ----- From: "Ketan Maheshwari" To: "Michael Wilde" Cc: "Papia Rizwan" , "swift-devel Devel" Sent: Monday, August 22, 2011 10:32:31 AM Subject: Re: Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? Mike, If I recall correctly, Papia has always been running her DSSAT app with 0.92. She has not yet tried with 0.93. I too tried with 0.92 with her sites file settings. I once tried it with 0.93 on pads but could never get in the running from the queue. I will give another try today as it might be that PADS was too busy last week. As I recall Jon was also struggling to get access. Regards, Ketan On Mon, Aug 22, 2011 at 10:24 AM, Michael Wilde < wilde at mcs.anl.gov > wrote: Papia, Ketan, In reviewing 0.93 work remaining with David, I remembered this issue. You both reported that the DSSAT application script doesnt finish on PADS - it seems not to start the second round of coaster blocks that it needs to complete (as I recall, but this may not be correct). This needs to be researched and filed as a bug (or, an error in the sites spec needs to be identified and made clear in the site guide if it turns out to be the problem). Possible there is an issue with jobs failing at the end of the coaster blocks, and you dont have the necessary retry values set for the PADS site??? We need an example run with logs and full details. Can you try to re-create this with a much smaller initial allocation, and see if coasters is transitioning from its initial blocks to the next blocks? Can you give this high prio for today? Thanks, - Mike -- Ketan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Mon Aug 22 11:35:20 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Mon, 22 Aug 2011 11:35:20 -0500 Subject: [Swift-devel] =?utf-8?q?Performance_problem_with_CDM_direct_proce?= =?utf-8?q?ssing?= Message-ID: <20110822163505.AF84E1245F@zimbra.anl.gov> Correct. I suspect if we can improve the performance of this section we can go from a run 12 hour run to a 6-8 hour run. The number of files that are being procesed by cdm look up is 320K. What was observed was several processes were spawned for each file and took maybe a second to run(i think that was the time). Mike and me had a discussion on how we can replicate it with a simple test case to show the delay as well as some simple fixes to try out. ----- Reply message ----- From: "Michael Wilde" Date: Mon, Aug 22, 2011 10:41 am Subject: [Swift-devel] Performance problem with CDM direct processing To: "Jonathan Monette" , "Justin M Wozniak" Cc: "swift-devel Devel" Justin, In testing Montage, Jon observed what looks like a performance bottleneck in the processing of CDM direct output passing. I *think* what was happening was that a large number of jobs (say 25,000 or more, but I dont recall the exact number, it may have been larger) produced an output file, and all those files were being passed as input to a merge job. What we observed was that the scripts being called from _swiftwrap (and perhaps some processing at the vdl-int.k level??? as well) were running very slowly, and that a fairly large number of scripts were being invoked per file. I think (but am not sure) that the high overhead was being observed at the start of the merge job in CDM scripts called by _swiftwrap. Jon, can you explain what you know about this problem, and then lets see if we can enhance the performance? This is now the main bottleneck in this application, which is otherwise now performing quite well. Thanks, - Mike _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Mon Aug 22 12:46:37 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 22 Aug 2011 12:46:37 -0500 (Central Daylight Time) Subject: [Swift-devel] Performance problem with CDM direct processing In-Reply-To: <20110822163505.AF84E1245F@zimbra.anl.gov> References: <20110822163505.AF84E1245F@zimbra.anl.gov> Message-ID: This has to do with the way the _swiftwrap shell script looks up those files. To avoid the external use of perl, I will take a look at using bash to do the wildcard matching and lookup. Either that or I will batch multiple lookups into one perl call. Justin On Mon, 22 Aug 2011, Jonathan Monette wrote: > Correct. I suspect if we can improve the performance of this section we > can go from a run 12 hour run to a 6-8 hour run. > > The number of files that are being procesed by cdm look up is 320K. > What was observed was several processes were spawned for each file and > took maybe a second to run(i think that was the time). > > Mike and me had a discussion on how we can replicate it with a simple > test case to show the delay as well as some simple fixes to try out. > > ----- Reply message ----- > From: "Michael Wilde" > Date: Mon, Aug 22, 2011 10:41 am > Subject: [Swift-devel] Performance problem with CDM direct processing > To: "Jonathan Monette" , "Justin M Wozniak" > Cc: "swift-devel Devel" > > > Justin, > > In testing Montage, Jon observed what looks like a performance bottleneck in the processing of CDM direct output passing. > > I *think* what was happening was that a large number of jobs (say 25,000 or more, but I dont recall the exact number, it may have been larger) produced an output file, and all those files were being passed as input to a merge job. > > What we observed was that the scripts being called from _swiftwrap (and perhaps some processing at the vdl-int.k level??? as well) were running very slowly, and that a fairly large number of scripts were being invoked per file. I think (but am not sure) that the high overhead was being observed at the start of the merge job in CDM scripts called by _swiftwrap. > > Jon, can you explain what you know about this problem, and then lets see if we can enhance the performance? This is now the main bottleneck in this application, which is otherwise now performing quite well. > > Thanks, > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Justin M Wozniak From jonmon at mcs.anl.gov Mon Aug 22 13:02:42 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Mon, 22 Aug 2011 13:02:42 -0500 Subject: [Swift-devel] =?utf-8?q?Performance_problem_with_CDM_direct_proce?= =?utf-8?q?ssing?= Message-ID: <20110822180226.B438A12660@zimbra.anl.gov> Using bash to do the wildcard matching was one of the ideas we came up with. ----- Reply message ----- From: "Justin M Wozniak" Date: Mon, Aug 22, 2011 12:46 pm Subject: [Swift-devel] Performance problem with CDM direct processing To: "Jonathan Monette" Cc: "Michael Wilde" , "Jonathan Monette" , "swift-devel Devel" This has to do with the way the _swiftwrap shell script looks up those files. To avoid the external use of perl, I will take a look at using bash to do the wildcard matching and lookup. Either that or I will batch multiple lookups into one perl call. Justin On Mon, 22 Aug 2011, Jonathan Monette wrote: > Correct. I suspect if we can improve the performance of this section we > can go from a run 12 hour run to a 6-8 hour run. > > The number of files that are being procesed by cdm look up is 320K. > What was observed was several processes were spawned for each file and > took maybe a second to run(i think that was the time). > > Mike and me had a discussion on how we can replicate it with a simple > test case to show the delay as well as some simple fixes to try out. > > ----- Reply message ----- > From: "Michael Wilde" > Date: Mon, Aug 22, 2011 10:41 am > Subject: [Swift-devel] Performance problem with CDM direct processing > To: "Jonathan Monette" , "Justin M Wozniak" > Cc: "swift-devel Devel" > > > Justin, > > In testing Montage, Jon observed what looks like a performance bottleneck in the processing of CDM direct output passing. > > I *think* what was happening was that a large number of jobs (say 25,000 or more, but I dont recall the exact number, it may have been larger) produced an output file, and all those files were being passed as input to a merge job. > > What we observed was that the scripts being called from _swiftwrap (and perhaps some processing at the vdl-int.k level??? as well) were running very slowly, and that a fairly large number of scripts were being invoked per file. I think (but am not sure) that the high overhead was being observed at the start of the merge job in CDM scripts called by _swiftwrap. > > Jon, can you explain what you know about this problem, and then lets see if we can enhance the performance? This is now the main bottleneck in this application, which is otherwise now performing quite well. > > Thanks, > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Mon Aug 22 13:45:32 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 22 Aug 2011 13:45:32 -0500 Subject: [Swift-devel] persistent coasters on OSG Message-ID: Hi Mihael, All, I am trying to test the persistent coasters setup with OSG sites from communicado and see some intermittent exceptions/ jobs failed errors which eventually succeed on retries. The exceptions I see from the log are mostly low-level network exceptions: (Channel Exceptions, Broken Pipe SocketExceptions, Timeout, etc.). The runs that I tried were incremental catsn runs with n=1,10,50 and 100 and data.txt=100MB and 200MB. The only run that had the above mentioned errors were the ones with n=100 and data.txt=200MB. The other runs completed without any errors. I used just one OSG site for these runs. Attaching the sites, log files and a file that contains exception messages grepped from log files. Any clues as to harden this, I had about 5 errors on today's run and about 11 on a similar run last week. Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- 2011-08-22 11:31:26,251-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-6jxfpsek - Application exception: Task failed: Connection to worker lost java.net.SocketException: Broken pipe 2011-08-22 11:31:50,808-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-ajxfpsek - Application exception: Task failed: Connection to worker lost java.net.SocketException: Broken pipe 2011-08-22 11:35:38,899-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-ljxfpsek - Application exception: Task failed: Connection to worker lost java.net.SocketException: Broken pipe 2011-08-22 11:35:57,531-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-jjxfpsek - Application exception: Task failed: Connection to worker lost java.net.SocketException: Broken pipe 2011-08-22 11:40:12,334-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-yjxfpsek - Application exception: Task failed: Connection to worker lost java.net.SocketException: Broken pipe 2011-08-22 11:42:16,427-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-6kxfpsek - Application exception: Task failed: Connection to worker lost java.net.SocketException: Broken pipe 2011-08-22 11:28:29,903-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe 2011-08-22 11:28:29,913-0500 INFO ChannelManager Handling channel exception java.net.SocketException: Broken pipe java.net.SocketException: Broken pipe 2011-08-22 11:28:29,914-0500 INFO ChannelManager Channel exception handled org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException 2011-08-22 11:30:40,430-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe 2011-08-22 11:30:40,431-0500 INFO ChannelManager Handling channel exception java.net.SocketException: Broken pipe java.net.SocketException: Broken pipe 2011-08-22 11:30:40,431-0500 INFO ChannelManager Channel exception handled 2011-08-22 11:31:26,235-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe 2011-08-22 11:31:26,235-0500 INFO ChannelManager Handling channel exception java.net.SocketException: Broken pipe java.net.SocketException: Broken pipe 2011-08-22 11:31:26,237-0500 INFO ChannelManager Channel exception handled 2011-08-22 11:31:50,780-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe 2011-08-22 11:31:50,781-0500 INFO ChannelManager Handling channel exception java.net.SocketException: Broken pipe java.net.SocketException: Broken pipe 2011-08-22 11:31:50,783-0500 INFO ChannelManager Channel exception handled org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException 2011-08-22 11:35:38,887-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe 2011-08-22 11:35:38,887-0500 INFO ChannelManager Handling channel exception java.net.SocketException: Broken pipe java.net.SocketException: Broken pipe 2011-08-22 11:35:38,889-0500 INFO ChannelManager Channel exception handled 2011-08-22 11:35:57,527-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe 2011-08-22 11:35:57,527-0500 INFO ChannelManager Handling channel exception java.net.SocketException: Broken pipe java.net.SocketException: Broken pipe 2011-08-22 11:35:57,528-0500 INFO ChannelManager Channel exception handled org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException 2011-08-22 11:40:12,320-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe 2011-08-22 11:40:12,321-0500 INFO ChannelManager Handling channel exception java.net.SocketException: Broken pipe java.net.SocketException: Broken pipe 2011-08-22 11:40:12,322-0500 INFO ChannelManager Channel exception handled org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException org.globus.cog.karajan.workflow.service.ReplyTimeoutException 2011-08-22 11:42:16,407-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe 2011-08-22 11:42:16,407-0500 INFO ChannelManager Handling channel exception java.net.SocketException: Broken pipe java.net.SocketException: Broken pipe 2011-08-22 11:42:16,410-0500 INFO ChannelManager Channel exception handled -------------- next part -------------- A non-text attachment was scrubbed... Name: service-0.out Type: application/octet-stream Size: 277424 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: catsn-catsn-osg-n50-d200mb.log Type: application/octet-stream Size: 117353 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swift.log Type: application/octet-stream Size: 819984 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sites.grid-ps.xml Type: text/xml Size: 616 bytes Desc: not available URL: From yadudoc1729 at gmail.com Mon Aug 22 13:47:09 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Tue, 23 Aug 2011 00:17:09 +0530 Subject: [Swift-devel] Patch for "call" Message-ID: Hi, I'm attaching a patch for the call function that I "almost implemented. Its done in such a way that you can easily add support for "map" "reduce" or any other calls. Hopefully this will make syntactic changes easier. The patch is by far the biggest in terms of code I've given, so I am sure there is room for improvement. Please point out things I could fix and improve. I have also added some documentation to swiftdevel site [1][2]. GSoC is pretty much over, and working on Swift this summer has been awesome. I give my thanks to everyone on this list and my mentors Justin, Mihael and Michael for all their support and help. I don't want to stop working with you all and I hope that I will be able to balance everything. [1] https://sites.google.com/site/swiftdevel/internals/adding-a-new-feature-to-swift [2] https://sites.google.com/site/swiftdevel/internals/progress-model/associative-arrays -- Thanks and Regards, Yadu Nand B -------------- next part -------------- A non-text attachment was scrubbed... Name: call_func.patch Type: text/x-patch Size: 25128 bytes Desc: not available URL: From davidk at ci.uchicago.edu Mon Aug 22 15:47:25 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 22 Aug 2011 15:47:25 -0500 (CDT) Subject: [Swift-devel] To-do for 0.93 In-Reply-To: <966663064.74411.1314042594005.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1585311239.74572.1314046045926.JavaMail.root@zimbra-mb2.anl.gov> Hello, Just wanted to post an update on the status of 0.93, and try to get an idea of what remains before we can release it. After some fixes for various systems, nearly all the provider testing is completed and working. This list includes Beagle (with and without coasters), CI machines, Crow, Fusion, Intrepid, local, local with coasters, MCS with persistent coasters, PADS (with and without coasters), Queenbee, Ranger with sge/coasters, and Surveyor. The last thing to be tested from the list is Ranger with gt2:sge. I'm running into some issues while trying to renew my doe certs, so if anyone want to help out and has access, that would be great. I am aiming to have a more complete site guide finished by the end of this week. Mike and I will also be working together on the finishing touches for the new website. What is the list of potential blockers? Here is the partial list I have (I created new bugzilla tickets for the ones I couldn't find - let me know if any are dupes) Out of Bounds issue - Jon reported seeing an OOB exception during a run on Beagle/PADS using 0.93. https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=521 State reporting on Beagle. Swift not correctly reporting on the state of the run. https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=519 DSSAT script does not complete, 2nd coaster blocks dont start - https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=523 Is there anything else I am missing? Thanks! Regards, David From wozniak at mcs.anl.gov Mon Aug 22 21:22:16 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 22 Aug 2011 21:22:16 -0500 (Central Daylight Time) Subject: [Swift-devel] Performance problem with CDM direct processing In-Reply-To: <20110822180226.B438A12660@zimbra.anl.gov> References: <20110822180226.B438A12660@zimbra.anl.gov> Message-ID: Ok, great. Another idea that I think Mihael suggested a while ago would be to rewrite _swiftwrap in perl. A lot of things might come out of that. For example, it would be pretty neat if the Coasters worker could be configured to only read the functions from that file and thus not require an external call to perl to start a Swift task. Justin On Mon, 22 Aug 2011, Jonathan Monette wrote: > Using bash to do the wildcard matching was one of the ideas we came up with. > > ----- Reply message ----- > From: "Justin M Wozniak" > Date: Mon, Aug 22, 2011 12:46 pm > Subject: [Swift-devel] Performance problem with CDM direct processing > To: "Jonathan Monette" > Cc: "Michael Wilde" , "Jonathan Monette" , "swift-devel Devel" > > > > This has to do with the way the _swiftwrap shell script looks up those > files. To avoid the external use of perl, I will take a look at using > bash to do the wildcard matching and lookup. Either that or I will batch > multiple lookups into one perl call. > Justin > > On Mon, 22 Aug 2011, Jonathan Monette wrote: > >> Correct. I suspect if we can improve the performance of this section we >> can go from a run 12 hour run to a 6-8 hour run. >> >> The number of files that are being procesed by cdm look up is 320K. >> What was observed was several processes were spawned for each file and >> took maybe a second to run(i think that was the time). >> >> Mike and me had a discussion on how we can replicate it with a simple >> test case to show the delay as well as some simple fixes to try out. >> >> ----- Reply message ----- >> From: "Michael Wilde" >> Date: Mon, Aug 22, 2011 10:41 am >> Subject: [Swift-devel] Performance problem with CDM direct processing >> To: "Jonathan Monette" , "Justin M Wozniak" >> Cc: "swift-devel Devel" >> >> >> Justin, >> >> In testing Montage, Jon observed what looks like a performance bottleneck in the processing of CDM direct output passing. >> >> I *think* what was happening was that a large number of jobs (say 25,000 or more, but I dont recall the exact number, it may have been larger) produced an output file, and all those files were being passed as input to a merge job. >> >> What we observed was that the scripts being called from _swiftwrap (and perhaps some processing at the vdl-int.k level??? as well) were running very slowly, and that a fairly large number of scripts were being invoked per file. I think (but am not sure) that the high overhead was being observed at the start of the merge job in CDM scripts called by _swiftwrap. >> >> Jon, can you explain what you know about this problem, and then lets see if we can enhance the performance? This is now the main bottleneck in this application, which is otherwise now performing quite well. >> >> Thanks, >> >> - Mike >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Justin M Wozniak -- Justin M Wozniak From jonmon at mcs.anl.gov Mon Aug 22 21:32:44 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 22 Aug 2011 21:32:44 -0500 Subject: [Swift-devel] Performance problem with CDM direct processing In-Reply-To: References: <20110822180226.B438A12660@zimbra.anl.gov> Message-ID: How difficult would that be? I am all for re-writing _swiftwrap if there is other benefits that come out of it. I think keeping all the utility scripts(cdm, _swiftwrap, worker.pl, any others that might appear) in one consistent is beneficial from a maintenance point of view as well as performance. On Aug 22, 2011, at 9:22 PM, Justin M Wozniak wrote: > > Ok, great. Another idea that I think Mihael suggested a while ago would be to rewrite _swiftwrap in perl. A lot of things might come out of that. For example, it would be pretty neat if the Coasters worker could be configured to only read the functions from that file and thus not require an external call to perl to start a Swift task. > Justin > > On Mon, 22 Aug 2011, Jonathan Monette wrote: > >> Using bash to do the wildcard matching was one of the ideas we came up with. >> ----- Reply message ----- >> From: "Justin M Wozniak" >> Date: Mon, Aug 22, 2011 12:46 pm >> Subject: [Swift-devel] Performance problem with CDM direct processing >> To: "Jonathan Monette" >> Cc: "Michael Wilde" , "Jonathan Monette" , "swift-devel Devel" >> >> >> >> This has to do with the way the _swiftwrap shell script looks up those files. To avoid the external use of perl, I will take a look at using bash to do the wildcard matching and lookup. Either that or I will batch multiple lookups into one perl call. >> Justin >> >> On Mon, 22 Aug 2011, Jonathan Monette wrote: >> >>> Correct. I suspect if we can improve the performance of this section we can go from a run 12 hour run to a 6-8 hour run. >>> >>> The number of files that are being procesed by cdm look up is 320K. What was observed was several processes were spawned for each file and took maybe a second to run(i think that was the time). >>> >>> Mike and me had a discussion on how we can replicate it with a simple test case to show the delay as well as some simple fixes to try out. >>> >>> ----- Reply message ----- >>> From: "Michael Wilde" >>> Date: Mon, Aug 22, 2011 10:41 am >>> Subject: [Swift-devel] Performance problem with CDM direct processing >>> To: "Jonathan Monette" , "Justin M Wozniak" >>> Cc: "swift-devel Devel" >>> >>> >>> Justin, >>> >>> In testing Montage, Jon observed what looks like a performance bottleneck in the processing of CDM direct output passing. >>> >>> I *think* what was happening was that a large number of jobs (say 25,000 or more, but I dont recall the exact number, it may have been larger) produced an output file, and all those files were being passed as input to a merge job. >>> >>> What we observed was that the scripts being called from _swiftwrap (and perhaps some processing at the vdl-int.k level??? as well) were running very slowly, and that a fairly large number of scripts were being invoked per file. I think (but am not sure) that the high overhead was being observed at the start of the merge job in CDM scripts called by _swiftwrap. >>> >>> Jon, can you explain what you know about this problem, and then lets see if we can enhance the performance? This is now the main bottleneck in this application, which is otherwise now performing quite well. >>> >>> Thanks, >>> >>> - Mike >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Justin M Wozniak > > -- > Justin M Wozniak From jonmon at mcs.anl.gov Mon Aug 22 21:37:22 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 22 Aug 2011 21:37:22 -0500 Subject: [Swift-devel] Index out of bounds In-Reply-To: <263354065.236370.1313892215268.JavaMail.root@zimbra.anl.gov> References: <263354065.236370.1313892215268.JavaMail.root@zimbra.anl.gov> Message-ID: Ok. I have ran the test case after updating the and rebuilding the 0.93 release. I am not sure why the IndexOutOfBounds error was appearing but now it is not. I have ran my scripts around 10 times and the error has not appeared. I am not really sure what happened but I cannot reproduce the error. I am not sure why it was appearing in the first place. On Aug 20, 2011, at 9:03 PM, Michael Wilde wrote: > Jon, the list you want for Beagle issue notifications is beagle-users. You can subscribe via the link: > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/beagle-users > > - Mike > > ----- Forwarded Message ----- > From: "Greg Cross" > To: beagle-users at ci.uchicago.edu > Sent: Saturday, August 20, 2011 2:12:45 PM > Subject: [beagle-users] Outage update > > Lustre is mounting properly but there is a communication failure between the Moab and ALPS scheduler components. This issue is under investigation and has been escalated to Cray. > > As a reminder, please DO NOT attempt to log into the system during this or any other maintenance period. While logins should be denied at this time, any user processes found running on login or sandbox nodes will be terminated without warning. Users who do not respect this may be contacted individually. > > Definitive notification will be sent to this mailing list when the system is available for use. > > > _______________________________________________ > beagle-users mailing list > beagle-users at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/beagle-users > > > From: "Jonathan Monette" > To: "Daniel S. Katz" > Cc: swift-devel at ci.uchicago.edu > Sent: Saturday, August 20, 2011 4:20:35 PM > Subject: Re: [Swift-devel] Index out of bounds > > Thanks. In the meantime could someone let me know when beagle is back in production so I can check my run? > > ----- Reply message ----- > From: "Daniel S. Katz" > Date: Sat, Aug 20, 2011 3:14 pm > Subject: [Swift-devel] Index out of bounds > To: "Jonathan Monette" > Cc: "Ketan Maheshwari" , "swift-devel at ci.uchicago.edu" > > > Yes, write to beagle-support. > > On Aug 20, 2011, at 14:52, "Jonathan Monette" wrote: > > Ok thanks. It seems that I was not added to the beagle-notify list. Could someone point me to a link I can subscribe to? Or do I subscribe by sending mail to beagle-support? > > ----- Reply message ----- > From: "Ketan Maheshwari" > Date: Sat, Aug 20, 2011 7:45 am > Subject: [Swift-devel] Index out of bounds > To: "Jonathan Monette" > Cc: > > > Yes, Beagle went down yesterday. There was a notice. > > Current status as of Aug 19, 5.30PM: > > == > At this time, Lustre is not starting properly on Beagle. This may be related to a configuration change that was made during the last outage. The effort to restore system availability is still in active progress. > == > > > Ketan > > On Sat, Aug 20, 2011 at 12:03 AM, Jonathan Monette wrote: > I updated and rebuilt and added that line to my log4j properties. Does anyone know if Beagle is down? showq says there is no service listening to sdb:. qstat shows that I have a job sitting in the queue but it doesn't look like jobs are running. > > I am using both PADS and Beagle for this execution. In this case where jobs are not executing on Beagle shouldn't Swift start submitting jobs to PADS? I do not see that behavior. > > This run is still executing. But if you would like to look at the log it is at www.ci.uchicago.edu/~jonmon/logs/montage-2.log. Only 23 tasks have finished before it just sits there waiting for Beagle to run. > On Aug 19, 2011, at 2:46 PM, Jonathan Monette wrote: > > > Sure can. I add that line to the log4j file or in a different properties file. > > > > ----- Reply message ----- > > From: "Mihael Hategan" > > Date: Fri, Aug 19, 2011 2:03 pm > > Subject: Index out of bounds > > To: "Jonathan Monette" > > Cc: > > > > > > Hmm. So I can't see how this manages to happen. > > > > I added some checks and debugging statements. Can you update, set log > > level of org.globus.cog.abstraction.impl.file.local to DEBUG, re-run and > > then post the log when the exception pops up? > > > > Mihael > > > > On Thu, 2011-08-18 at 23:14 -0500, Jonathan Monette wrote: > > > Ok. The log is at > > > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > > > On Aug 18, 2011, at 5:56 PM, Mihael Hategan wrote: > > > > > > > It's probably a good idea to post the stack trace of that exception now > > > > rather than later. > > > > > > > > On Thu, 2011-08-18 at 13:09 -0500, Jonathan Monette wrote: > > > >> Hello, > > > >> I was running 0.93 with one a relatively small run, a 350 task run. > > > >> The run failed on one of the final tasks. I checked the log file and > > > >> saw some index out of bounds errors. I tried with a smaller run and > > > >> didn't see the error. > > > >> > > > >> This run was using beagle, pads, and communicado. I was also using > > > >> cdm. I will post the log in a bit. I am seeing if I cam replicate it > > > >> without using cdm and with a smaller site pool. > > > >> > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Ketan > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 22 21:47:05 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 22 Aug 2011 19:47:05 -0700 Subject: [Swift-devel] Index out of bounds In-Reply-To: References: <263354065.236370.1313892215268.JavaMail.root@zimbra.anl.gov> Message-ID: <1314067625.15659.0.camel@blabla> On Mon, 2011-08-22 at 21:37 -0500, Jonathan Monette wrote: > Ok. I have ran the test case after updating the and rebuilding the > 0.93 release. I am not sure why the IndexOutOfBounds error was > appearing but now it is not. Ok, then I might know. Was any of your files over 2G in size? > I have ran my scripts around 10 times and the error has not appeared. > I am not really sure what happened but I cannot reproduce the error. > I am not sure why it was appearing in the first place. > > On Aug 20, 2011, at 9:03 PM, Michael Wilde wrote: > > > Jon, the list you want for Beagle issue notifications is > > beagle-users. You can subscribe via the link: > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/beagle-users > > > > > > - Mike > > > > > > ----- Forwarded Message ----- > > From: "Greg Cross" > > To: beagle-users at ci.uchicago.edu > > Sent: Saturday, August 20, 2011 2:12:45 PM > > Subject: [beagle-users] Outage update > > > > > > Lustre is mounting properly but there is a communication failure > > between the Moab and ALPS scheduler components. This issue is under > > investigation and has been escalated to Cray. > > > > > > As a reminder, please DO NOT attempt to log into the system during > > this or any other maintenance period. While logins should be denied > > at this time, any user processes found running on login or sandbox > > nodes will be terminated without warning. Users who do not respect > > this may be contacted individually. > > > > > > Definitive notification will be sent to this mailing list when the > > system is available for use. > > > > > > > > > > _______________________________________________ > > beagle-users mailing list > > beagle-users at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/beagle-users > > > > > > > > > > ____________________________________________________________________ > > From: "Jonathan Monette" > > To: "Daniel S. Katz" > > Cc: swift-devel at ci.uchicago.edu > > Sent: Saturday, August 20, 2011 4:20:35 PM > > Subject: Re: [Swift-devel] Index out of bounds > > > > Thanks. In the meantime could someone let me know when > > beagle is back in production so I can check my run? > > > > ----- Reply message ----- > > From: "Daniel S. Katz" > > Date: Sat, Aug 20, 2011 3:14 pm > > Subject: [Swift-devel] Index out of bounds > > To: "Jonathan Monette" > > Cc: "Ketan Maheshwari" , > > "swift-devel at ci.uchicago.edu" > > > > > > > > Yes, write to beagle-support. > > > > On Aug 20, 2011, at 14:52, "Jonathan Monette" > > wrote: > > > > > > > > Ok thanks. It seems that I was not added to the > > beagle-notify list. Could someone point me to a link > > I can subscribe to? Or do I subscribe by sending > > mail to beagle-support? > > > > ----- Reply message ----- > > From: "Ketan Maheshwari" > > > > Date: Sat, Aug 20, 2011 7:45 am > > Subject: [Swift-devel] Index out of bounds > > To: "Jonathan Monette" > > Cc: > > > > > > > > Yes, Beagle went down yesterday. There was a notice. > > > > > > Current status as of Aug 19, 5.30PM: > > > > > > == > > At this time, Lustre is not starting properly on > > Beagle. This may be related to a configuration > > change that was made during the last outage. The > > effort to restore system availability is still in > > active progress. > > == > > > > > > > > > > Ketan > > > > On Sat, Aug 20, 2011 at 12:03 AM, Jonathan > > Monette wrote: > > I updated and rebuilt and added that line to > > my log4j properties. Does anyone know if > > Beagle is down? showq says there is no > > service listening to sdb:. qstat > > shows that I have a job sitting in the queue > > but it doesn't look like jobs are running. > > > > I am using both PADS and Beagle for this > > execution. In this case where jobs are not > > executing on Beagle shouldn't Swift start > > submitting jobs to PADS? I do not see that > > behavior. > > > > This run is still executing. But if you > > would like to look at the log it is > > at www.ci.uchicago.edu/~jonmon/logs/montage-2.log. Only 23 tasks have finished before it just sits there waiting for Beagle to run. > > > > On Aug 19, 2011, at 2:46 PM, Jonathan > > Monette wrote: > > > > > Sure can. I add that line to the log4j > > file or in a different properties file. > > > > > > ----- Reply message ----- > > > From: "Mihael Hategan" > > > > > Date: Fri, Aug 19, 2011 2:03 pm > > > Subject: Index out of bounds > > > To: "Jonathan Monette" > > > > > Cc: > > > > > > > > > Hmm. So I can't see how this manages to > > happen. > > > > > > I added some checks and debugging > > statements. Can you update, set log > > > level of > > org.globus.cog.abstraction.impl.file.local > > to DEBUG, re-run and > > > then post the log when the exception pops > > up? > > > > > > Mihael > > > > > > On Thu, 2011-08-18 at 23:14 -0500, > > Jonathan Monette wrote: > > > > Ok. The log is at > > > > > > www.ci.uchicago.edu/~jonmon/logs/montage-1.log > > > > On Aug 18, 2011, at 5:56 PM, Mihael > > Hategan wrote: > > > > > > > > > It's probably a good idea to post the > > stack trace of that exception now > > > > > rather than later. > > > > > > > > > > On Thu, 2011-08-18 at 13:09 -0500, > > Jonathan Monette wrote: > > > > >> Hello, > > > > >> I was running 0.93 with one a > > relatively small run, a 350 task run. > > > > >> The run failed on one of the final > > tasks. I checked the log file and > > > > >> saw some index out of bounds errors. > > I tried with a smaller run and > > > > >> didn't see the error. > > > > >> > > > > >> This run was using beagle, pads, and > > communicado. I was also using > > > > >> cdm. I will post the log in a bit. I > > am seeing if I cam replicate it > > > > >> without using cdm and with a smaller > > site pool. > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -- > > Ketan > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Mon Aug 22 21:48:17 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 22 Aug 2011 21:48:17 -0500 Subject: [Swift-devel] Index out of bounds In-Reply-To: <1314067625.15659.0.camel@blabla> References: <263354065.236370.1313892215268.JavaMail.root@zimbra.anl.gov> <1314067625.15659.0.camel@blabla> Message-ID: Yes as a matter of fact. The output files of the app that was failing are both 2.6GB. On Aug 22, 2011, at 9:47 PM, Mihael Hategan wrote: > On Mon, 2011-08-22 at 21:37 -0500, Jonathan Monette wrote: >> Ok. I have ran the test case after updating the and rebuilding the >> 0.93 release. I am not sure why the IndexOutOfBounds error was >> appearing but now it is not. > > Ok, then I might know. Was any of your files over 2G in size? > >> I have ran my scripts around 10 times and the error has not appeared. >> I am not really sure what happened but I cannot reproduce the error. >> I am not sure why it was appearing in the first place. >> >> On Aug 20, 2011, at 9:03 PM, Michael Wilde wrote: >> >>> Jon, the list you want for Beagle issue notifications is >>> beagle-users. You can subscribe via the link: >>> >>> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/beagle-users >>> >>> >>> - Mike >>> >>> >>> ----- Forwarded Message ----- >>> From: "Greg Cross" >>> To: beagle-users at ci.uchicago.edu >>> Sent: Saturday, August 20, 2011 2:12:45 PM >>> Subject: [beagle-users] Outage update >>> >>> >>> Lustre is mounting properly but there is a communication failure >>> between the Moab and ALPS scheduler components. This issue is under >>> investigation and has been escalated to Cray. >>> >>> >>> As a reminder, please DO NOT attempt to log into the system during >>> this or any other maintenance period. While logins should be denied >>> at this time, any user processes found running on login or sandbox >>> nodes will be terminated without warning. Users who do not respect >>> this may be contacted individually. >>> >>> >>> Definitive notification will be sent to this mailing list when the >>> system is available for use. >>> >>> >>> >>> >>> _______________________________________________ >>> beagle-users mailing list >>> beagle-users at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/beagle-users >>> >>> >>> >>> >>> ____________________________________________________________________ >>> From: "Jonathan Monette" >>> To: "Daniel S. Katz" >>> Cc: swift-devel at ci.uchicago.edu >>> Sent: Saturday, August 20, 2011 4:20:35 PM >>> Subject: Re: [Swift-devel] Index out of bounds >>> >>> Thanks. In the meantime could someone let me know when >>> beagle is back in production so I can check my run? >>> >>> ----- Reply message ----- >>> From: "Daniel S. Katz" >>> Date: Sat, Aug 20, 2011 3:14 pm >>> Subject: [Swift-devel] Index out of bounds >>> To: "Jonathan Monette" >>> Cc: "Ketan Maheshwari" , >>> "swift-devel at ci.uchicago.edu" >>> >>> >>> >>> Yes, write to beagle-support. >>> >>> On Aug 20, 2011, at 14:52, "Jonathan Monette" >>> wrote: >>> >>> >>> >>> Ok thanks. It seems that I was not added to the >>> beagle-notify list. Could someone point me to a link >>> I can subscribe to? Or do I subscribe by sending >>> mail to beagle-support? >>> >>> ----- Reply message ----- >>> From: "Ketan Maheshwari" >>> >>> Date: Sat, Aug 20, 2011 7:45 am >>> Subject: [Swift-devel] Index out of bounds >>> To: "Jonathan Monette" >>> Cc: >>> >>> >>> >>> Yes, Beagle went down yesterday. There was a notice. >>> >>> >>> Current status as of Aug 19, 5.30PM: >>> >>> >>> == >>> At this time, Lustre is not starting properly on >>> Beagle. This may be related to a configuration >>> change that was made during the last outage. The >>> effort to restore system availability is still in >>> active progress. >>> == >>> >>> >>> >>> >>> Ketan >>> >>> On Sat, Aug 20, 2011 at 12:03 AM, Jonathan >>> Monette wrote: >>> I updated and rebuilt and added that line to >>> my log4j properties. Does anyone know if >>> Beagle is down? showq says there is no >>> service listening to sdb:. qstat >>> shows that I have a job sitting in the queue >>> but it doesn't look like jobs are running. >>> >>> I am using both PADS and Beagle for this >>> execution. In this case where jobs are not >>> executing on Beagle shouldn't Swift start >>> submitting jobs to PADS? I do not see that >>> behavior. >>> >>> This run is still executing. But if you >>> would like to look at the log it is >>> at www.ci.uchicago.edu/~jonmon/logs/montage-2.log. Only 23 tasks have finished before it just sits there waiting for Beagle to run. >>> >>> On Aug 19, 2011, at 2:46 PM, Jonathan >>> Monette wrote: >>> >>>> Sure can. I add that line to the log4j >>> file or in a different properties file. >>>> >>>> ----- Reply message ----- >>>> From: "Mihael Hategan" >>> >>>> Date: Fri, Aug 19, 2011 2:03 pm >>>> Subject: Index out of bounds >>>> To: "Jonathan Monette" >>> >>>> Cc: >>>> >>>> >>>> Hmm. So I can't see how this manages to >>> happen. >>>> >>>> I added some checks and debugging >>> statements. Can you update, set log >>>> level of >>> org.globus.cog.abstraction.impl.file.local >>> to DEBUG, re-run and >>>> then post the log when the exception pops >>> up? >>>> >>>> Mihael >>>> >>>> On Thu, 2011-08-18 at 23:14 -0500, >>> Jonathan Monette wrote: >>>>> Ok. The log is at >>>> >>>> www.ci.uchicago.edu/~jonmon/logs/montage-1.log >>>>> On Aug 18, 2011, at 5:56 PM, Mihael >>> Hategan wrote: >>>>> >>>>>> It's probably a good idea to post the >>> stack trace of that exception now >>>>>> rather than later. >>>>>> >>>>>> On Thu, 2011-08-18 at 13:09 -0500, >>> Jonathan Monette wrote: >>>>>>> Hello, >>>>>>> I was running 0.93 with one a >>> relatively small run, a 350 task run. >>>>>>> The run failed on one of the final >>> tasks. I checked the log file and >>>>>>> saw some index out of bounds errors. >>> I tried with a smaller run and >>>>>>> didn't see the error. >>>>>>> >>>>>>> This run was using beagle, pads, and >>> communicado. I was also using >>>>>>> cdm. I will post the log in a bit. I >>> am seeing if I cam replicate it >>>>>>> without using cdm and with a smaller >>> site pool. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >>>> >>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> >>> >>> >>> -- >>> Ketan >>> >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> >>> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > From ketancmaheshwari at gmail.com Mon Aug 22 21:59:15 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 22 Aug 2011 21:59:15 -0500 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: <1692446633.238425.1314028076127.JavaMail.root@zimbra.anl.gov> References: <1692446633.238425.1314028076127.JavaMail.root@zimbra.anl.gov> Message-ID: Hi, I tried a big catsn run with 0.93 on PADS. The number of tasks i set were to 100K. I saw that at about 18K-19K, there were few error messages: error shutdown of block and some replytimeout exceptions. The run was put so as to test the coasters block restart so it was on a fast queue with walltime of 16 mins. The log for the run is : http://ci.uchicago.edu/~ketan/catsn-20110822-1547-1ajivxte.log The execution.retry value was 1 for these runs. Regards, Ketan On Mon, Aug 22, 2011 at 10:47 AM, Michael Wilde wrote: > Can you try this on PADS using small jobs in the fast queue? > > I have not thought this all the way through, but perhaps coasters will > honor maxtime and maxwalltime on any coaster block, even if its not running > on a batch scheduler. In that case perhaps you can replicate the problem on > the MCS pool or better yet on localhost. > > In these runs, what was the value of the execution.retries and lazy.errors > flags? Mihael, do those properties need to be set to >0 and true, > respectively, in order for coasters to start new blocks correctly, assuming > that in some cases a job will run longer than its maxwalltime? > > - Mike > > ------------------------------ > > *From: *"Ketan Maheshwari" > *To: *"Michael Wilde" > *Cc: *"Papia Rizwan" , "swift-devel Devel" < > swift-devel at ci.uchicago.edu> > *Sent: *Monday, August 22, 2011 10:32:31 AM > *Subject: *Re: Blocker issue for 0.93: DSSAT script does not complete, 2nd > coaster blocks dont start? > > > Mike, > > If I recall correctly, Papia has always been running her DSSAT app with > 0.92. She has not yet tried with 0.93. I too tried with 0.92 with her sites > file settings. > > I once tried it with 0.93 on pads but could never get in the running from > the queue. > > I will give another try today as it might be that PADS was too busy last > week. As I recall Jon was also struggling to get access. > > Regards, > Ketan > > On Mon, Aug 22, 2011 at 10:24 AM, Michael Wilde wrote: > >> Papia, Ketan, >> >> In reviewing 0.93 work remaining with David, I remembered this issue. >> >> You both reported that the DSSAT application script doesnt finish on PADS >> - it seems not to start the second round of coaster blocks that it needs to >> complete (as I recall, but this may not be correct). This needs to be >> researched and filed as a bug (or, an error in the sites spec needs to be >> identified and made clear in the site guide if it turns out to be the >> problem). >> >> Possible there is an issue with jobs failing at the end of the coaster >> blocks, and you dont have the necessary retry values set for the PADS >> site??? >> >> We need an example run with logs and full details. Can you try to >> re-create this with a much smaller initial allocation, and see if coasters >> is transitioning from its initial blocks to the next blocks? >> >> Can you give this high prio for today? >> >> Thanks, >> >> - Mike >> > > > > -- > Ketan > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 22 22:37:14 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 22 Aug 2011 20:37:14 -0700 Subject: [Swift-devel] Index out of bounds In-Reply-To: References: <263354065.236370.1313892215268.JavaMail.root@zimbra.anl.gov> <1314067625.15659.0.camel@blabla> Message-ID: <1314070634.15774.2.camel@blabla> On Mon, 2011-08-22 at 21:48 -0500, Jonathan Monette wrote: > Yes as a matter of fact. The output files of the app that was failing are both 2.6GB. That was the problem then: - int read = remoteStream.read(buf, 0, Math.min(buf.length, (int) (total - crt))); + int read = remoteStream.read(buf, 0, (int) Math.min(buf.length, total - crt)); From jonmon at mcs.anl.gov Mon Aug 22 22:40:11 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 22 Aug 2011 22:40:11 -0500 Subject: [Swift-devel] Index out of bounds In-Reply-To: <1314070634.15774.2.camel@blabla> References: <263354065.236370.1313892215268.JavaMail.root@zimbra.anl.gov> <1314067625.15659.0.camel@blabla> <1314070634.15774.2.camel@blabla> Message-ID: Ok. So then it seems that the issue has been resolved? On Aug 22, 2011, at 10:37 PM, Mihael Hategan wrote: > On Mon, 2011-08-22 at 21:48 -0500, Jonathan Monette wrote: >> Yes as a matter of fact. The output files of the app that was failing are both 2.6GB. > > That was the problem then: > > - int read = remoteStream.read(buf, 0, Math.min(buf.length, (int) (total > - crt))); > + int read = remoteStream.read(buf, 0, (int) Math.min(buf.length, total > - crt)); > > From hategan at mcs.anl.gov Tue Aug 23 01:03:53 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 22 Aug 2011 23:03:53 -0700 Subject: [Swift-devel] Index out of bounds In-Reply-To: References: <263354065.236370.1313892215268.JavaMail.root@zimbra.anl.gov> <1314067625.15659.0.camel@blabla> <1314070634.15774.2.camel@blabla> Message-ID: <1314079433.15958.0.camel@blabla> I would say so: 1. There is a reasonable explanation and fix for the problem. 2. It's not occurring any more. On Mon, 2011-08-22 at 22:40 -0500, Jonathan Monette wrote: > Ok. So then it seems that the issue has been resolved? > On Aug 22, 2011, at 10:37 PM, Mihael Hategan wrote: > > > On Mon, 2011-08-22 at 21:48 -0500, Jonathan Monette wrote: > >> Yes as a matter of fact. The output files of the app that was failing are both 2.6GB. > > > > That was the problem then: > > > > - int read = remoteStream.read(buf, 0, Math.min(buf.length, (int) (total > > - crt))); > > + int read = remoteStream.read(buf, 0, (int) Math.min(buf.length, total > > - crt)); > > > > > From hategan at mcs.anl.gov Tue Aug 23 01:05:30 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 22 Aug 2011 23:05:30 -0700 Subject: [Swift-devel] [Fwd: [Bug 525] Requested file offset error with local provider] Message-ID: <1314079530.15958.1.camel@blabla> Thanks for catching and fixing this, David. Mihael -------- Forwarded Message -------- From: bugzilla-daemon at mcs.anl.gov To: hategan at mcs.anl.gov Subject: [Bug 525] Requested file offset error with local provider Date: Mon, 22 Aug 2011 18:09:39 -0500 (CDT) https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=525 David Kelly changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #1 from David Kelly 2011-08-22 18:09:38 --- Fixed in 3233 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From ketancmaheshwari at gmail.com Tue Aug 23 11:47:48 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Aug 2011 11:47:48 -0500 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: References: <1692446633.238425.1314028076127.JavaMail.root@zimbra.anl.gov> <1099993048.241780.1314115473917.JavaMail.root@zimbra.anl.gov> Message-ID: Hello Mike, I tried another run with 30K tasks on PADS. This run stopped after completing 16K+ tasks. The log file is: http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log The exception messages I get are attached with the mail. Looking at the messages, it seems the coasters are unable to restart the submit block once the walltime is expired for a run. Regards, Ketan On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari < ketancmaheshwari at gmail.com> wrote: > Mike, > > This looks like the coasters blocks not restarting issue. I can try to run > the same run again and see if this persists. > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde wrote: > >> Ketan, >> >> Should I ask David to try to replicate this problem? >> >> Did you figure out why your jobs are not starting on PADS? >> >> - Mike >> >> >> ------------------------------ >> >> *From: *"Michael Wilde" >> *To: *"Ketan Maheshwari" , "Mihael Hategan" < >> hategan at mcs.anl.gov> >> *Cc: *"swift-devel Devel" , "Papia Rizwan" < >> papia.rizwan at gmail.com> >> *Sent: *Monday, August 22, 2011 10:47:56 AM >> >> *Subject: *Re: Blocker issue for 0.93: DSSAT script does not complete, >> 2nd coaster blocks dont start? >> >> Can you try this on PADS using small jobs in the fast queue? >> >> I have not thought this all the way through, but perhaps coasters will >> honor maxtime and maxwalltime on any coaster block, even if its not running >> on a batch scheduler. In that case perhaps you can replicate the problem on >> the MCS pool or better yet on localhost. >> >> In these runs, what was the value of the execution.retries and lazy.errors >> flags? Mihael, do those properties need to be set to >0 and true, >> respectively, in order for coasters to start new blocks correctly, assuming >> that in some cases a job will run longer than its maxwalltime? >> >> - Mike >> >> ------------------------------ >> >> *From: *"Ketan Maheshwari" >> *To: *"Michael Wilde" >> *Cc: *"Papia Rizwan" , "swift-devel Devel" < >> swift-devel at ci.uchicago.edu> >> *Sent: *Monday, August 22, 2011 10:32:31 AM >> *Subject: *Re: Blocker issue for 0.93: DSSAT script does not complete, >> 2nd coaster blocks dont start? >> >> Mike, >> >> If I recall correctly, Papia has always been running her DSSAT app with >> 0.92. She has not yet tried with 0.93. I too tried with 0.92 with her sites >> file settings. >> >> I once tried it with 0.93 on pads but could never get in the running from >> the queue. >> >> I will give another try today as it might be that PADS was too busy last >> week. As I recall Jon was also struggling to get access. >> >> Regards, >> Ketan >> >> On Mon, Aug 22, 2011 at 10:24 AM, Michael Wilde wrote: >> >>> Papia, Ketan, >>> >>> In reviewing 0.93 work remaining with David, I remembered this issue. >>> >>> You both reported that the DSSAT application script doesnt finish on PADS >>> - it seems not to start the second round of coaster blocks that it needs to >>> complete (as I recall, but this may not be correct). This needs to be >>> researched and filed as a bug (or, an error in the sites spec needs to be >>> identified and made clear in the site guide if it turns out to be the >>> problem). >>> >>> Possible there is an issue with jobs failing at the end of the coaster >>> blocks, and you dont have the necessary retry values set for the PADS >>> site??? >>> >>> We need an example run with logs and full details. Can you try to >>> re-create this with a much smaller initial allocation, and see if coasters >>> is transitioning from its initial blocks to the next blocks? >>> >>> Can you give this high prio for today? >>> >>> Thanks, >>> >>> - Mike >>> >> >> >> >> -- >> Ketan >> >> >> >> >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> >> >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> > > > -- > Ketan > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: exceptions.pads Type: application/octet-stream Size: 4709 bytes Desc: not available URL: From hategan at mcs.anl.gov Tue Aug 23 14:15:03 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Aug 2011 12:15:03 -0700 Subject: [Swift-devel] persistent coasters on OSG In-Reply-To: References: Message-ID: <1314126903.21733.1.camel@blabla> It looks like workers on nemo are somehow messed up. Can you find out why? On Mon, 2011-08-22 at 13:45 -0500, Ketan Maheshwari wrote: > Hi Mihael, All, > > > I am trying to test the persistent coasters setup with OSG sites from > communicado and see some intermittent exceptions/ jobs failed errors > which eventually succeed on retries. > > > The exceptions I see from the log are mostly low-level network > exceptions: (Channel Exceptions, Broken Pipe SocketExceptions, > Timeout, etc.). > > > The runs that I tried were incremental catsn runs with n=1,10,50 and > 100 and data.txt=100MB and 200MB. > > > The only run that had the above mentioned errors were the ones with > n=100 and data.txt=200MB. > > > The other runs completed without any errors. > > > I used just one OSG site for these runs. > > > Attaching the sites, log files and a file that contains exception > messages grepped from log files. > > > Any clues as to harden this, I had about 5 errors on today's run and > about 11 on a similar run last week. > > > > > Regards, > -- > Ketan > > > From wilde at mcs.anl.gov Tue Aug 23 14:17:31 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Aug 2011 14:17:31 -0500 (CDT) Subject: [Swift-devel] persistent coasters on OSG In-Reply-To: <1314126903.21733.1.camel@blabla> Message-ID: <1475609775.242867.1314127051793.JavaMail.root@zimbra.anl.gov> Can you describe what you are seeing on Nemo and what to look for there? - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Ketan Maheshwari" > Cc: "Swift Devel" > Sent: Tuesday, August 23, 2011 2:15:03 PM > Subject: Re: [Swift-devel] persistent coasters on OSG > It looks like workers on nemo are somehow messed up. Can you find out > why? > > On Mon, 2011-08-22 at 13:45 -0500, Ketan Maheshwari wrote: > > Hi Mihael, All, > > > > > > I am trying to test the persistent coasters setup with OSG sites > > from > > communicado and see some intermittent exceptions/ jobs failed errors > > which eventually succeed on retries. > > > > > > The exceptions I see from the log are mostly low-level network > > exceptions: (Channel Exceptions, Broken Pipe SocketExceptions, > > Timeout, etc.). > > > > > > The runs that I tried were incremental catsn runs with n=1,10,50 and > > 100 and data.txt=100MB and 200MB. > > > > > > The only run that had the above mentioned errors were the ones with > > n=100 and data.txt=200MB. > > > > > > The other runs completed without any errors. > > > > > > I used just one OSG site for these runs. > > > > > > Attaching the sites, log files and a file that contains exception > > messages grepped from log files. > > > > > > Any clues as to harden this, I had about 5 errors on today's run and > > about 11 on a similar run last week. > > > > > > > > > > Regards, > > -- > > Ketan > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Aug 23 14:27:54 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Aug 2011 12:27:54 -0700 Subject: [Swift-devel] persistent coasters on OSG In-Reply-To: <1475609775.242867.1314127051793.JavaMail.root@zimbra.anl.gov> References: <1475609775.242867.1314127051793.JavaMail.root@zimbra.anl.gov> Message-ID: <1314127674.21733.3.camel@blabla> If you look through the service log, you see that all "lost connection to worker" messages come from workers on nemo. That implies that something is wrong there, but I can't tell what it is. Perhaps enabling worker logging for workers on nemo might shed some light on the issue. On Tue, 2011-08-23 at 14:17 -0500, Michael Wilde wrote: > Can you describe what you are seeing on Nemo and what to look for there? > > - Mike > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Ketan Maheshwari" > > Cc: "Swift Devel" > > Sent: Tuesday, August 23, 2011 2:15:03 PM > > Subject: Re: [Swift-devel] persistent coasters on OSG > > It looks like workers on nemo are somehow messed up. Can you find out > > why? > > > > On Mon, 2011-08-22 at 13:45 -0500, Ketan Maheshwari wrote: > > > Hi Mihael, All, > > > > > > > > > I am trying to test the persistent coasters setup with OSG sites > > > from > > > communicado and see some intermittent exceptions/ jobs failed errors > > > which eventually succeed on retries. > > > > > > > > > The exceptions I see from the log are mostly low-level network > > > exceptions: (Channel Exceptions, Broken Pipe SocketExceptions, > > > Timeout, etc.). > > > > > > > > > The runs that I tried were incremental catsn runs with n=1,10,50 and > > > 100 and data.txt=100MB and 200MB. > > > > > > > > > The only run that had the above mentioned errors were the ones with > > > n=100 and data.txt=200MB. > > > > > > > > > The other runs completed without any errors. > > > > > > > > > I used just one OSG site for these runs. > > > > > > > > > Attaching the sites, log files and a file that contains exception > > > messages grepped from log files. > > > > > > > > > Any clues as to harden this, I had about 5 errors on today's run and > > > about 11 on a similar run last week. > > > > > > > > > > > > > > > Regards, > > > -- > > > Ketan > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From ketancmaheshwari at gmail.com Tue Aug 23 14:31:40 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Aug 2011 14:31:40 -0500 Subject: [Swift-devel] JVM shutting down message on persistent coaster: service side Message-ID: Hi, Of late, I am getting this message after a while starting coaster service: .... Timer cancelled due to JVM shutting down. Going without timeouts. Timer cancelled due to JVM shutting down. Going without timeouts. Timer cancelled due to JVM shutting down. Going without timeouts. Timer cancelled due to JVM shutting down. Going without timeouts. Congestion queue size: 68 Timer cancelled due to JVM shutting down. Going without timeouts. Timer cancelled due to JVM shutting down. Going without timeouts. Timer cancelled due to JVM shutting down. Going without timeouts. .... I start the coaster service as follows: coaster-service -port 1984 -localport 35753 -nosec This is from Swift version 0.93: $ which coaster-service coaster-service is /home/ketan/swift-install/0.93/cog/modules/swift/dist/swift-svn/bin/coaster-service -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Aug 23 14:34:17 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Aug 2011 14:34:17 -0500 (CDT) Subject: [Swift-devel] persistent coasters on OSG In-Reply-To: <1314127674.21733.3.camel@blabla> Message-ID: <93792166.242945.1314128057254.JavaMail.root@zimbra.anl.gov> OK, thanks. I pointed Ketan to the wrapper script which launches the workers (and which is run as a Condor-G job). This script sets logging on, and tries to send the log back on stdout or stderr. That needs to be tested, as its tricky to get the log to come back when the jobs are killed. And its hard to get the logs from OSG. Maybe we can do a test run where the coaster logs are copied or tee'd to a shared filesystem file on any signal, and/or occasionally, etc. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" , "Ketan Maheshwari" > Sent: Tuesday, August 23, 2011 2:27:54 PM > Subject: Re: [Swift-devel] persistent coasters on OSG > If you look through the service log, you see that all "lost connection > to worker" messages come from workers on nemo. That implies that > something is wrong there, but I can't tell what it is. > > Perhaps enabling worker logging for workers on nemo might shed some > light on the issue. > > On Tue, 2011-08-23 at 14:17 -0500, Michael Wilde wrote: > > Can you describe what you are seeing on Nemo and what to look for > > there? > > > > - Mike > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Ketan Maheshwari" > > > Cc: "Swift Devel" > > > Sent: Tuesday, August 23, 2011 2:15:03 PM > > > Subject: Re: [Swift-devel] persistent coasters on OSG > > > It looks like workers on nemo are somehow messed up. Can you find > > > out > > > why? > > > > > > On Mon, 2011-08-22 at 13:45 -0500, Ketan Maheshwari wrote: > > > > Hi Mihael, All, > > > > > > > > > > > > I am trying to test the persistent coasters setup with OSG sites > > > > from > > > > communicado and see some intermittent exceptions/ jobs failed > > > > errors > > > > which eventually succeed on retries. > > > > > > > > > > > > The exceptions I see from the log are mostly low-level network > > > > exceptions: (Channel Exceptions, Broken Pipe SocketExceptions, > > > > Timeout, etc.). > > > > > > > > > > > > The runs that I tried were incremental catsn runs with n=1,10,50 > > > > and > > > > 100 and data.txt=100MB and 200MB. > > > > > > > > > > > > The only run that had the above mentioned errors were the ones > > > > with > > > > n=100 and data.txt=200MB. > > > > > > > > > > > > The other runs completed without any errors. > > > > > > > > > > > > I used just one OSG site for these runs. > > > > > > > > > > > > Attaching the sites, log files and a file that contains > > > > exception > > > > messages grepped from log files. > > > > > > > > > > > > Any clues as to harden this, I had about 5 errors on today's run > > > > and > > > > about 11 on a similar run last week. > > > > > > > > > > > > > > > > > > > > Regards, > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Aug 23 14:36:00 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Aug 2011 12:36:00 -0700 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: References: <1692446633.238425.1314028076127.JavaMail.root@zimbra.anl.gov> <1099993048.241780.1314115473917.JavaMail.root@zimbra.anl.gov> Message-ID: <1314128160.21733.4.camel@blabla> mike at blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: Java heap space Caused by: java.lang.OutOfMemoryError: Java heap space Caused by: java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: Java heap space Caused by: java.lang.OutOfMemoryError: Java heap space Caused by: java.lang.OutOfMemoryError: Java heap space On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote: > Hello Mike, > > > I tried another run with 30K tasks on PADS. This run stopped after > completing 16K+ tasks. > > > The log file is: > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log > > > The exception messages I get are attached with the mail. > > > Looking at the messages, it seems the coasters are unable to restart > the submit block once the walltime is expired for a run. > > > Regards, > Ketan > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari > wrote: > Mike, > > > This looks like the coasters blocks not restarting issue. I > can try to run the same run again and see if this persists. > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde > wrote: > Ketan, > > > Should I ask David to try to replicate this problem? > > > Did you figure out why your jobs are not starting on > PADS? > > > - Mike > > > > ______________________________________________________ > From: "Michael Wilde" > To: "Ketan Maheshwari" > , "Mihael Hategan" > > Cc: "swift-devel Devel" > , "Papia Rizwan" > > Sent: Monday, August 22, 2011 10:47:56 AM > > > Subject: Re: Blocker issue for 0.93: DSSAT > script does not complete, 2nd coaster blocks > dont start? > > Can you try this on PADS using small jobs in > the fast queue? > > I have not thought this all the way through, > but perhaps coasters will honor maxtime and > maxwalltime on any coaster block, even if its > not running on a batch scheduler. In that > case perhaps you can replicate the problem on > the MCS pool or better yet on localhost. > > > In these runs, what was the value of > the execution.retries and lazy.errors flags? > Mihael, do those properties need to be set to > >0 and true, respectively, in order for > coasters to start new blocks correctly, > assuming that in some cases a job will run > longer than its maxwalltime? > > > - Mike > > > > ______________________________________________ > From: "Ketan Maheshwari" > > To: "Michael Wilde" > > Cc: "Papia Rizwan" > , "swift-devel > Devel" > Sent: Monday, August 22, 2011 10:32:31 > AM > Subject: Re: Blocker issue for 0.93: > DSSAT script does not complete, 2nd > coaster blocks dont start? > > Mike, > > > If I recall correctly, Papia has > always been running her DSSAT app with > 0.92. She has not yet tried with 0.93. > I too tried with 0.92 with her sites > file settings. > > > I once tried it with 0.93 on pads but > could never get in the running from > the queue. > > > I will give another try today as it > might be that PADS was too busy last > week. As I recall Jon was also > struggling to get access. > > > Regards, > Ketan > > On Mon, Aug 22, 2011 at 10:24 AM, > Michael Wilde > wrote: > Papia, Ketan, > > In reviewing 0.93 work > remaining with David, I > remembered this issue. > > You both reported that the > DSSAT application script > doesnt finish on PADS - it > seems not to start the second > round of coaster blocks that > it needs to complete (as I > recall, but this may not be > correct). This needs to be > researched and filed as a bug > (or, an error in the sites > spec needs to be identified > and made clear in the site > guide if it turns out to be > the problem). > > Possible there is an issue > with jobs failing at the end > of the coaster blocks, and you > dont have the necessary retry > values set for the PADS > site??? > > We need an example run with > logs and full details. Can you > try to re-create this with a > much smaller initial > allocation, and see if > coasters is transitioning from > its initial blocks to the next > blocks? > > Can you give this high prio > for today? > > Thanks, > > - Mike > > > > > -- > Ketan > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > > > -- > Ketan > > > > > > > > -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Aug 23 14:46:01 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Aug 2011 14:46:01 -0500 (CDT) Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: <1314128160.21733.4.camel@blabla> Message-ID: <488203920.243001.1314128761894.JavaMail.root@zimbra.anl.gov> Well, I raised that issue, but Ketan claimed that the failure to start more jobs occurs without that message as well. Do you believe that the Out of Mem error is the root cause? Ketan, can you point to logs without the OOM error? Can you re-run the catsn with more memory? And more importantly: can you run a *very small* catsnsleep test where you carefully craft the sleep times and settings to cause one (very short duration) coaster block to time out and verify that a new block is submitted and in new job and that the script runs to completion? I suggested in the ticket that David do this; can you both discuss and see who is better positioned to do this sooner, so we can decide if we have a blocker here, or just something that needs better configuration and perhaps a note in the user guide telling users what to watch out for in this regard? (I think for example we do not tell how and when to increase memory in the user guide, at the moment). Nor are we clear enough on the issues around maxtime, maxwalltime, and the sizing of coaster blocks. Thanks, - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Ketan Maheshwari" > Cc: "Michael Wilde" , "Swift Devel" > Sent: Tuesday, August 23, 2011 2:36:00 PM > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? > mike at blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336 > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext > java.lang.OutOfMemoryError: Java heap space > java.lang.OutOfMemoryError: Java heap space > Caused by: java.lang.OutOfMemoryError: Java heap space > Caused by: java.lang.OutOfMemoryError: Java heap space > java.lang.OutOfMemoryError: Java heap space > Caused by: java.lang.OutOfMemoryError: Java heap space > Caused by: java.lang.OutOfMemoryError: Java heap space > > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote: > > Hello Mike, > > > > > > I tried another run with 30K tasks on PADS. This run stopped after > > completing 16K+ tasks. > > > > > > The log file is: > > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log > > > > > > The exception messages I get are attached with the mail. > > > > > > Looking at the messages, it seems the coasters are unable to restart > > the submit block once the walltime is expired for a run. > > > > > > Regards, > > Ketan > > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari > > wrote: > > Mike, > > > > > > This looks like the coasters blocks not restarting issue. I > > can try to run the same run again and see if this persists. > > > > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde > > wrote: > > Ketan, > > > > > > Should I ask David to try to replicate this problem? > > > > > > Did you figure out why your jobs are not starting on > > PADS? > > > > > > - Mike > > > > > > > > ______________________________________________________ > > From: "Michael Wilde" > > To: "Ketan Maheshwari" > > , "Mihael > > Hategan" > > > > Cc: "swift-devel Devel" > > , "Papia > > Rizwan" > > > > Sent: Monday, August 22, 2011 10:47:56 AM > > > > > > Subject: Re: Blocker issue for 0.93: DSSAT > > script does not complete, 2nd coaster blocks > > dont start? > > > > Can you try this on PADS using small jobs in > > the fast queue? > > > > I have not thought this all the way through, > > but perhaps coasters will honor maxtime and > > maxwalltime on any coaster block, even if > > its > > not running on a batch scheduler. In that > > case perhaps you can replicate the problem > > on > > the MCS pool or better yet on localhost. > > > > > > In these runs, what was the value of > > the execution.retries and lazy.errors flags? > > Mihael, do those properties need to be set > > to > > >0 and true, respectively, in order for > > coasters to start new blocks correctly, > > assuming that in some cases a job will run > > longer than its maxwalltime? > > > > > > - Mike > > > > > > > > ______________________________________________ > > From: "Ketan Maheshwari" > > > > To: "Michael Wilde" > > > > Cc: "Papia Rizwan" > > , > > "swift-devel > > Devel" > > Sent: Monday, August 22, 2011 > > 10:32:31 > > AM > > Subject: Re: Blocker issue for 0.93: > > DSSAT script does not complete, 2nd > > coaster blocks dont start? > > > > Mike, > > > > > > If I recall correctly, Papia has > > always been running her DSSAT app > > with > > 0.92. She has not yet tried with > > 0.93. > > I too tried with 0.92 with her sites > > file settings. > > > > > > I once tried it with 0.93 on pads > > but > > could never get in the running from > > the queue. > > > > > > I will give another try today as it > > might be that PADS was too busy last > > week. As I recall Jon was also > > struggling to get access. > > > > > > Regards, > > Ketan > > > > On Mon, Aug 22, 2011 at 10:24 AM, > > Michael Wilde > > wrote: > > Papia, Ketan, > > > > In reviewing 0.93 work > > remaining with David, I > > remembered this issue. > > > > You both reported that the > > DSSAT application script > > doesnt finish on PADS - it > > seems not to start the > > second > > round of coaster blocks that > > it needs to complete (as I > > recall, but this may not be > > correct). This needs to be > > researched and filed as a > > bug > > (or, an error in the sites > > spec needs to be identified > > and made clear in the site > > guide if it turns out to be > > the problem). > > > > Possible there is an issue > > with jobs failing at the end > > of the coaster blocks, and > > you > > dont have the necessary > > retry > > values set for the PADS > > site??? > > > > We need an example run with > > logs and full details. Can > > you > > try to re-create this with a > > much smaller initial > > allocation, and see if > > coasters is transitioning > > from > > its initial blocks to the > > next > > blocks? > > > > Can you give this high prio > > for today? > > > > Thanks, > > > > - Mike > > > > > > > > > > -- > > Ketan > > > > > > > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Aug 23 15:21:59 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Aug 2011 13:21:59 -0700 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: <488203920.243001.1314128761894.JavaMail.root@zimbra.anl.gov> References: <488203920.243001.1314128761894.JavaMail.root@zimbra.anl.gov> Message-ID: <1314130919.23088.0.camel@blabla> On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote: > Well, I raised that issue, but Ketan claimed that the failure to start more jobs occurs without that message as well. Fine. I'll need the log from that then. > > Do you believe that the Out of Mem error is the root cause? > > Ketan, can you point to logs without the OOM error? > > Can you re-run the catsn with more memory? > > And more importantly: can you run a *very small* catsnsleep test where you carefully craft the sleep times and settings to cause one (very short duration) coaster block to time out and verify that a new block is submitted and in new job and that the script runs to completion? > > I suggested in the ticket that David do this; can you both discuss and see who is better positioned to do this sooner, so we can decide if we have a blocker here, or just something that needs better configuration and perhaps a note in the user guide telling users what to watch out for in this regard? (I think for example we do not tell how and when to increase memory in the user guide, at the moment). Nor are we clear enough on the issues around maxtime, maxwalltime, and the sizing of coaster blocks. > > Thanks, > > - Mike > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Ketan Maheshwari" > > Cc: "Michael Wilde" , "Swift Devel" > > Sent: Tuesday, August 23, 2011 2:36:00 PM > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? > > mike at blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336 > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext > > java.lang.OutOfMemoryError: Java heap space > > java.lang.OutOfMemoryError: Java heap space > > Caused by: java.lang.OutOfMemoryError: Java heap space > > Caused by: java.lang.OutOfMemoryError: Java heap space > > java.lang.OutOfMemoryError: Java heap space > > Caused by: java.lang.OutOfMemoryError: Java heap space > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote: > > > Hello Mike, > > > > > > > > > I tried another run with 30K tasks on PADS. This run stopped after > > > completing 16K+ tasks. > > > > > > > > > The log file is: > > > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log > > > > > > > > > The exception messages I get are attached with the mail. > > > > > > > > > Looking at the messages, it seems the coasters are unable to restart > > > the submit block once the walltime is expired for a run. > > > > > > > > > Regards, > > > Ketan > > > > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari > > > wrote: > > > Mike, > > > > > > > > > This looks like the coasters blocks not restarting issue. I > > > can try to run the same run again and see if this persists. > > > > > > > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde > > > wrote: > > > Ketan, > > > > > > > > > Should I ask David to try to replicate this problem? > > > > > > > > > Did you figure out why your jobs are not starting on > > > PADS? > > > > > > > > > - Mike > > > > > > > > > > > > ______________________________________________________ > > > From: "Michael Wilde" > > > To: "Ketan Maheshwari" > > > , "Mihael > > > Hategan" > > > > > > Cc: "swift-devel Devel" > > > , "Papia > > > Rizwan" > > > > > > Sent: Monday, August 22, 2011 10:47:56 AM > > > > > > > > > Subject: Re: Blocker issue for 0.93: DSSAT > > > script does not complete, 2nd coaster blocks > > > dont start? > > > > > > Can you try this on PADS using small jobs in > > > the fast queue? > > > > > > I have not thought this all the way through, > > > but perhaps coasters will honor maxtime and > > > maxwalltime on any coaster block, even if > > > its > > > not running on a batch scheduler. In that > > > case perhaps you can replicate the problem > > > on > > > the MCS pool or better yet on localhost. > > > > > > > > > In these runs, what was the value of > > > the execution.retries and lazy.errors flags? > > > Mihael, do those properties need to be set > > > to > > > >0 and true, respectively, in order for > > > coasters to start new blocks correctly, > > > assuming that in some cases a job will run > > > longer than its maxwalltime? > > > > > > > > > - Mike > > > > > > > > > > > > ______________________________________________ > > > From: "Ketan Maheshwari" > > > > > > To: "Michael Wilde" > > > > > > Cc: "Papia Rizwan" > > > , > > > "swift-devel > > > Devel" > > > Sent: Monday, August 22, 2011 > > > 10:32:31 > > > AM > > > Subject: Re: Blocker issue for 0.93: > > > DSSAT script does not complete, 2nd > > > coaster blocks dont start? > > > > > > Mike, > > > > > > > > > If I recall correctly, Papia has > > > always been running her DSSAT app > > > with > > > 0.92. She has not yet tried with > > > 0.93. > > > I too tried with 0.92 with her sites > > > file settings. > > > > > > > > > I once tried it with 0.93 on pads > > > but > > > could never get in the running from > > > the queue. > > > > > > > > > I will give another try today as it > > > might be that PADS was too busy last > > > week. As I recall Jon was also > > > struggling to get access. > > > > > > > > > Regards, > > > Ketan > > > > > > On Mon, Aug 22, 2011 at 10:24 AM, > > > Michael Wilde > > > wrote: > > > Papia, Ketan, > > > > > > In reviewing 0.93 work > > > remaining with David, I > > > remembered this issue. > > > > > > You both reported that the > > > DSSAT application script > > > doesnt finish on PADS - it > > > seems not to start the > > > second > > > round of coaster blocks that > > > it needs to complete (as I > > > recall, but this may not be > > > correct). This needs to be > > > researched and filed as a > > > bug > > > (or, an error in the sites > > > spec needs to be identified > > > and made clear in the site > > > guide if it turns out to be > > > the problem). > > > > > > Possible there is an issue > > > with jobs failing at the end > > > of the coaster blocks, and > > > you > > > dont have the necessary > > > retry > > > values set for the PADS > > > site??? > > > > > > We need an example run with > > > logs and full details. Can > > > you > > > try to re-create this with a > > > much smaller initial > > > allocation, and see if > > > coasters is transitioning > > > from > > > its initial blocks to the > > > next > > > blocks? > > > > > > Can you give this high prio > > > for today? > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From ketancmaheshwari at gmail.com Tue Aug 23 15:46:04 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Aug 2011 15:46:04 -0500 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: <1314130919.23088.0.camel@blabla> References: <488203920.243001.1314128761894.JavaMail.root@zimbra.anl.gov> <1314130919.23088.0.camel@blabla> Message-ID: Hi, I tried a smaller run, catsnsleep, sleeptime 60sec, n=20, walltime=2min The run indeed completed but I saw this in the middle (I suppose at the end of first walltime slot): Command(13, HEARTBEAT): handling reply timeout; sendReqTime=110823-153059.847, sendTime=110823-153059.847, now=110823-153259.860 Command(13, HEARTBEAT)fault was: Reply timeout org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {} org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Heartbeat failed: Invalid channel: 914784201: {} org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) The log is attached. I will try a longish run with more heap memory. Regards, Ketan On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan wrote: > On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote: > > Well, I raised that issue, but Ketan claimed that the failure to start > more jobs occurs without that message as well. > > Fine. I'll need the log from that then. > > > > > Do you believe that the Out of Mem error is the root cause? > > > > Ketan, can you point to logs without the OOM error? > > > > Can you re-run the catsn with more memory? > > > > And more importantly: can you run a *very small* catsnsleep test where > you carefully craft the sleep times and settings to cause one (very short > duration) coaster block to time out and verify that a new block is submitted > and in new job and that the script runs to completion? > > > > I suggested in the ticket that David do this; can you both discuss and > see who is better positioned to do this sooner, so we can decide if we have > a blocker here, or just something that needs better configuration and > perhaps a note in the user guide telling users what to watch out for in this > regard? (I think for example we do not tell how and when to increase memory > in the user guide, at the moment). Nor are we clear enough on the issues > around maxtime, maxwalltime, and the sizing of coaster blocks. > > > > Thanks, > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Ketan Maheshwari" > > > Cc: "Michael Wilde" , "Swift Devel" < > swift-devel at ci.uchicago.edu> > > > Sent: Tuesday, August 23, 2011 2:36:00 PM > > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT script does > not complete, 2nd coaster blocks dont start? > > > mike at blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log > > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336 > > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext > > > java.lang.OutOfMemoryError: Java heap space > > > java.lang.OutOfMemoryError: Java heap space > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > java.lang.OutOfMemoryError: Java heap space > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > > > > > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote: > > > > Hello Mike, > > > > > > > > > > > > I tried another run with 30K tasks on PADS. This run stopped after > > > > completing 16K+ tasks. > > > > > > > > > > > > The log file is: > > > > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log > > > > > > > > > > > > The exception messages I get are attached with the mail. > > > > > > > > > > > > Looking at the messages, it seems the coasters are unable to restart > > > > the submit block once the walltime is expired for a run. > > > > > > > > > > > > Regards, > > > > Ketan > > > > > > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari > > > > wrote: > > > > Mike, > > > > > > > > > > > > This looks like the coasters blocks not restarting issue. I > > > > can try to run the same run again and see if this persists. > > > > > > > > > > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde > > > > wrote: > > > > Ketan, > > > > > > > > > > > > Should I ask David to try to replicate this problem? > > > > > > > > > > > > Did you figure out why your jobs are not starting on > > > > PADS? > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > ______________________________________________________ > > > > From: "Michael Wilde" > > > > To: "Ketan Maheshwari" > > > > , "Mihael > > > > Hategan" > > > > > > > > Cc: "swift-devel Devel" > > > > , "Papia > > > > Rizwan" > > > > > > > > Sent: Monday, August 22, 2011 10:47:56 AM > > > > > > > > > > > > Subject: Re: Blocker issue for 0.93: DSSAT > > > > script does not complete, 2nd coaster blocks > > > > dont start? > > > > > > > > Can you try this on PADS using small jobs in > > > > the fast queue? > > > > > > > > I have not thought this all the way through, > > > > but perhaps coasters will honor maxtime and > > > > maxwalltime on any coaster block, even if > > > > its > > > > not running on a batch scheduler. In that > > > > case perhaps you can replicate the problem > > > > on > > > > the MCS pool or better yet on localhost. > > > > > > > > > > > > In these runs, what was the value of > > > > the execution.retries and lazy.errors flags? > > > > Mihael, do those properties need to be set > > > > to > > > > >0 and true, respectively, in order for > > > > coasters to start new blocks correctly, > > > > assuming that in some cases a job will run > > > > longer than its maxwalltime? > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > ______________________________________________ > > > > From: "Ketan Maheshwari" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "Papia Rizwan" > > > > , > > > > "swift-devel > > > > Devel" > > > > Sent: Monday, August 22, 2011 > > > > 10:32:31 > > > > AM > > > > Subject: Re: Blocker issue for 0.93: > > > > DSSAT script does not complete, 2nd > > > > coaster blocks dont start? > > > > > > > > Mike, > > > > > > > > > > > > If I recall correctly, Papia has > > > > always been running her DSSAT app > > > > with > > > > 0.92. She has not yet tried with > > > > 0.93. > > > > I too tried with 0.92 with her sites > > > > file settings. > > > > > > > > > > > > I once tried it with 0.93 on pads > > > > but > > > > could never get in the running from > > > > the queue. > > > > > > > > > > > > I will give another try today as it > > > > might be that PADS was too busy last > > > > week. As I recall Jon was also > > > > struggling to get access. > > > > > > > > > > > > Regards, > > > > Ketan > > > > > > > > On Mon, Aug 22, 2011 at 10:24 AM, > > > > Michael Wilde > > > > wrote: > > > > Papia, Ketan, > > > > > > > > In reviewing 0.93 work > > > > remaining with David, I > > > > remembered this issue. > > > > > > > > You both reported that the > > > > DSSAT application script > > > > doesnt finish on PADS - it > > > > seems not to start the > > > > second > > > > round of coaster blocks that > > > > it needs to complete (as I > > > > recall, but this may not be > > > > correct). This needs to be > > > > researched and filed as a > > > > bug > > > > (or, an error in the sites > > > > spec needs to be identified > > > > and made clear in the site > > > > guide if it turns out to be > > > > the problem). > > > > > > > > Possible there is an issue > > > > with jobs failing at the end > > > > of the coaster blocks, and > > > > you > > > > dont have the necessary > > > > retry > > > > values set for the PADS > > > > site??? > > > > > > > > We need an example run with > > > > logs and full details. Can > > > > you > > > > try to re-create this with a > > > > much smaller initial > > > > allocation, and see if > > > > coasters is transitioning > > > > from > > > > its initial blocks to the > > > > next > > > > blocks? > > > > > > > > Can you give this high prio > > > > for today? > > > > > > > > Thanks, > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: catsnsleep-20110823-1527-b6kpd8h9.log Type: application/octet-stream Size: 147190 bytes Desc: not available URL: From ketancmaheshwari at gmail.com Tue Aug 23 15:59:26 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Aug 2011 15:59:26 -0500 Subject: [Swift-devel] Java stackoverflow Message-ID: Hello, Today while running the SCEC workflow, I saw java stackoverflow exception. I haven't seen this in a while. Following is a partial stack trace: Uncaught exception: java.lang.StackOverflowError in kernel:variable @ postproc.kml, line: 1868 java.lang.StackOverflowError at java.nio.CharBuffer.arrayOffset(CharBuffer.java:967) at sun.nio.cs.UTF_8.updatePositions(UTF_8.java:58) at sun.nio.cs.UTF_8$Encoder.overflow(UTF_8.java:328) at sun.nio.cs.UTF_8$Encoder.encodeArrayLoop(UTF_8.java:358) at sun.nio.cs.UTF_8$Encoder.encodeLoop(UTF_8.java:447) at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:544) at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:252) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:106) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:116) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:203) at java.io.Writer.write(Writer.java:140) at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.info(Category.java:666) at org.griphyn.vdl.karajan.lib.SetFieldValue.log(SetFieldValue.java:73) at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:38) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:48) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.futureModified(AbstractSequentialWithArguments.java:208) at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:60) at org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:626) at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:414) at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:361) at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:103) at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:46) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:48) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.futureModified(AbstractSequentialWithArguments.java:208) at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:60) at org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:626) at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:414) at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:361) at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:103) at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:46) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:48) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.futureModified(AbstractSequentialWithArguments.java:208) at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:60) at org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:626) at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:414) at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:361) at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:103) at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:46) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:48) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.futureModified(AbstractSequentialWithArguments.java:208) at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:60) at org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:626) at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:414) at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:361) at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:103) at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:46) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:48) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.futureModified(AbstractSequentialWithArguments.java:208) at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:60) at org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:626) at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:414) at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:361) at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:103) at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:46) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:48) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java: Following is the log file of this run: http://www.ci.uchicago.edu/~ketan/postproc-20110823-1501-rtqj0ks1.log Regards -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Aug 23 16:05:13 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Aug 2011 14:05:13 -0700 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: References: <488203920.243001.1314128761894.JavaMail.root@zimbra.anl.gov> <1314130919.23088.0.camel@blabla> Message-ID: <1314133513.23324.1.camel@blabla> That's benign, but I committed a patch to prevent it from happening in cog r3237. On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote: > Hi, > > > I tried a smaller run, catsnsleep, sleeptime 60sec, n=20, > walltime=2min > > > The run indeed completed but I saw this in the middle (I suppose at > the end of first walltime slot): > > > Command(13, HEARTBEAT): handling reply timeout; > sendReqTime=110823-153059.847, sendTime=110823-153059.847, > now=110823-153259.860 > Command(13, HEARTBEAT)fault was: Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > at org.globus.cog.karajan.workflow.service.commands.Command > $Timeout.run(Command.java:293) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {} > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > at org.globus.cog.karajan.workflow.service.commands.Command > $Timeout.run(Command.java:293) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > Heartbeat failed: Invalid channel: 914784201: {} > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > at org.globus.cog.karajan.workflow.service.commands.Command > $Timeout.run(Command.java:293) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > > > The log is attached. I will try a longish run with more heap memory. > > > > > Regards, > Ketan > > > > On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan > wrote: > On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote: > > Well, I raised that issue, but Ketan claimed that the > failure to start more jobs occurs without that message as > well. > > > Fine. I'll need the log from that then. > > > > > > Do you believe that the Out of Mem error is the root cause? > > > > Ketan, can you point to logs without the OOM error? > > > > Can you re-run the catsn with more memory? > > > > And more importantly: can you run a *very small* catsnsleep > test where you carefully craft the sleep times and settings to > cause one (very short duration) coaster block to time out and > verify that a new block is submitted and in new job and that > the script runs to completion? > > > > I suggested in the ticket that David do this; can you both > discuss and see who is better positioned to do this sooner, so > we can decide if we have a blocker here, or just something > that needs better configuration and perhaps a note in the user > guide telling users what to watch out for in this regard? (I > think for example we do not tell how and when to increase > memory in the user guide, at the moment). Nor are we clear > enough on the issues around maxtime, maxwalltime, and the > sizing of coaster blocks. > > > > Thanks, > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Ketan Maheshwari" > > > Cc: "Michael Wilde" , "Swift Devel" > > > > Sent: Tuesday, August 23, 2011 2:36:00 PM > > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT > script does not complete, 2nd coaster blocks dont start? > > > mike at blabla:~/tmp$ grep "heap" > catsn-20110823-1116-94roxc18.log > > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: > 257294336 > > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext > > > java.lang.OutOfMemoryError: Java heap space > > > java.lang.OutOfMemoryError: Java heap space > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > java.lang.OutOfMemoryError: Java heap space > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > > > > > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote: > > > > Hello Mike, > > > > > > > > > > > > I tried another run with 30K tasks on PADS. This run > stopped after > > > > completing 16K+ tasks. > > > > > > > > > > > > The log file is: > > > > > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log > > > > > > > > > > > > The exception messages I get are attached with the mail. > > > > > > > > > > > > Looking at the messages, it seems the coasters are > unable to restart > > > > the submit block once the walltime is expired for a run. > > > > > > > > > > > > Regards, > > > > Ketan > > > > > > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari > > > > wrote: > > > > Mike, > > > > > > > > > > > > This looks like the coasters blocks not > restarting issue. I > > > > can try to run the same run again and see if > this persists. > > > > > > > > > > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde > > > > wrote: > > > > Ketan, > > > > > > > > > > > > Should I ask David to try to replicate > this problem? > > > > > > > > > > > > Did you figure out why your jobs are not > starting on > > > > PADS? > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > ______________________________________________________ > > > > From: "Michael Wilde" > > > > > To: "Ketan Maheshwari" > > > > , > "Mihael > > > > Hategan" > > > > > > > > Cc: "swift-devel Devel" > > > > , > "Papia > > > > Rizwan" > > > > > > > > Sent: Monday, August 22, 2011 > 10:47:56 AM > > > > > > > > > > > > Subject: Re: Blocker issue for > 0.93: DSSAT > > > > script does not complete, 2nd > coaster blocks > > > > dont start? > > > > > > > > Can you try this on PADS using > small jobs in > > > > the fast queue? > > > > > > > > I have not thought this all the > way through, > > > > but perhaps coasters will honor > maxtime and > > > > maxwalltime on any coaster > block, even if > > > > its > > > > not running on a batch > scheduler. In that > > > > case perhaps you can replicate > the problem > > > > on > > > > the MCS pool or better yet on > localhost. > > > > > > > > > > > > In these runs, what was the > value of > > > > the execution.retries and > lazy.errors flags? > > > > Mihael, do those properties > need to be set > > > > to > > > > >0 and true, respectively, in > order for > > > > coasters to start new blocks > correctly, > > > > assuming that in some cases a > job will run > > > > longer than its maxwalltime? > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > ______________________________________________ > > > > From: "Ketan Maheshwari" > > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "Papia Rizwan" > > > > > , > > > > "swift-devel > > > > Devel" > > > > > Sent: Monday, August 22, > 2011 > > > > 10:32:31 > > > > AM > > > > Subject: Re: Blocker > issue for 0.93: > > > > DSSAT script does not > complete, 2nd > > > > coaster blocks dont > start? > > > > > > > > Mike, > > > > > > > > > > > > If I recall correctly, > Papia has > > > > always been running her > DSSAT app > > > > with > > > > 0.92. She has not yet > tried with > > > > 0.93. > > > > I too tried with 0.92 > with her sites > > > > file settings. > > > > > > > > > > > > I once tried it with > 0.93 on pads > > > > but > > > > could never get in the > running from > > > > the queue. > > > > > > > > > > > > I will give another try > today as it > > > > might be that PADS was > too busy last > > > > week. As I recall Jon > was also > > > > struggling to get > access. > > > > > > > > > > > > Regards, > > > > Ketan > > > > > > > > On Mon, Aug 22, 2011 at > 10:24 AM, > > > > Michael Wilde > > > > > wrote: > > > > Papia, Ketan, > > > > > > > > In reviewing > 0.93 work > > > > remaining with > David, I > > > > remembered this > issue. > > > > > > > > You both > reported that the > > > > DSSAT > application script > > > > doesnt finish on > PADS - it > > > > seems not to > start the > > > > second > > > > round of coaster > blocks that > > > > it needs to > complete (as I > > > > recall, but this > may not be > > > > correct). This > needs to be > > > > researched and > filed as a > > > > bug > > > > (or, an error in > the sites > > > > spec needs to be > identified > > > > and made clear > in the site > > > > guide if it > turns out to be > > > > the problem). > > > > > > > > Possible there > is an issue > > > > with jobs > failing at the end > > > > of the coaster > blocks, and > > > > you > > > > dont have the > necessary > > > > retry > > > > values set for > the PADS > > > > site??? > > > > > > > > We need an > example run with > > > > logs and full > details. Can > > > > you > > > > try to re-create > this with a > > > > much smaller > initial > > > > allocation, and > see if > > > > coasters is > transitioning > > > > from > > > > its initial > blocks to the > > > > next > > > > blocks? > > > > > > > > Can you give > this high prio > > > > for today? > > > > > > > > Thanks, > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, > University of Chicago > > > > Mathematics and Computer Science > Division > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of > Chicago > > > > Mathematics and Computer Science > Division > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > Ketan > > > From ketancmaheshwari at gmail.com Tue Aug 23 20:12:08 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Aug 2011 20:12:08 -0500 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: <1314133513.23324.1.camel@blabla> References: <488203920.243001.1314128761894.JavaMail.root@zimbra.anl.gov> <1314130919.23088.0.camel@blabla> <1314133513.23324.1.camel@blabla> Message-ID: Hi again, Tried a larger run on PADS with similar sleep and but large n parameters. The run seemed to be progressing well (I killed it by mistake), but the log does show some coaster block shutdown and network related exception messages. Attached is the log. Regards, Ketan On Tue, Aug 23, 2011 at 4:05 PM, Mihael Hategan wrote: > That's benign, but I committed a patch to prevent it from happening in > cog r3237. > > On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote: > > Hi, > > > > > > I tried a smaller run, catsnsleep, sleeptime 60sec, n=20, > > walltime=2min > > > > > > The run indeed completed but I saw this in the middle (I suppose at > > the end of first walltime slot): > > > > > > Command(13, HEARTBEAT): handling reply timeout; > > sendReqTime=110823-153059.847, sendTime=110823-153059.847, > > now=110823-153259.860 > > Command(13, HEARTBEAT)fault was: Reply timeout > > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > > at org.globus.cog.karajan.workflow.service.commands.Command > > $Timeout.run(Command.java:293) > > at java.util.TimerThread.mainLoop(Timer.java:512) > > at java.util.TimerThread.run(Timer.java:462) > > Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {} > > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > > at org.globus.cog.karajan.workflow.service.commands.Command > > $Timeout.run(Command.java:293) > > at java.util.TimerThread.mainLoop(Timer.java:512) > > at java.util.TimerThread.run(Timer.java:462) > > Heartbeat failed: Invalid channel: 914784201: {} > > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > > at org.globus.cog.karajan.workflow.service.commands.Command > > $Timeout.run(Command.java:293) > > at java.util.TimerThread.mainLoop(Timer.java:512) > > at java.util.TimerThread.run(Timer.java:462) > > > > > > The log is attached. I will try a longish run with more heap memory. > > > > > > > > > > Regards, > > Ketan > > > > > > > > On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan > > wrote: > > On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote: > > > Well, I raised that issue, but Ketan claimed that the > > failure to start more jobs occurs without that message as > > well. > > > > > > Fine. I'll need the log from that then. > > > > > > > > > > Do you believe that the Out of Mem error is the root cause? > > > > > > Ketan, can you point to logs without the OOM error? > > > > > > Can you re-run the catsn with more memory? > > > > > > And more importantly: can you run a *very small* catsnsleep > > test where you carefully craft the sleep times and settings to > > cause one (very short duration) coaster block to time out and > > verify that a new block is submitted and in new job and that > > the script runs to completion? > > > > > > I suggested in the ticket that David do this; can you both > > discuss and see who is better positioned to do this sooner, so > > we can decide if we have a blocker here, or just something > > that needs better configuration and perhaps a note in the user > > guide telling users what to watch out for in this regard? (I > > think for example we do not tell how and when to increase > > memory in the user guide, at the moment). Nor are we clear > > enough on the issues around maxtime, maxwalltime, and the > > sizing of coaster blocks. > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "Ketan Maheshwari" > > > > Cc: "Michael Wilde" , "Swift Devel" > > > > > > Sent: Tuesday, August 23, 2011 2:36:00 PM > > > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT > > script does not complete, 2nd coaster blocks dont start? > > > > mike at blabla:~/tmp$ grep "heap" > > catsn-20110823-1116-94roxc18.log > > > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: > > 257294336 > > > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext > > > > java.lang.OutOfMemoryError: Java heap space > > > > java.lang.OutOfMemoryError: Java heap space > > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > java.lang.OutOfMemoryError: Java heap space > > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > > > > > > > > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote: > > > > > Hello Mike, > > > > > > > > > > > > > > > I tried another run with 30K tasks on PADS. This run > > stopped after > > > > > completing 16K+ tasks. > > > > > > > > > > > > > > > The log file is: > > > > > > > > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log > > > > > > > > > > > > > > > The exception messages I get are attached with the mail. > > > > > > > > > > > > > > > Looking at the messages, it seems the coasters are > > unable to restart > > > > > the submit block once the walltime is expired for a run. > > > > > > > > > > > > > > > Regards, > > > > > Ketan > > > > > > > > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari > > > > > wrote: > > > > > Mike, > > > > > > > > > > > > > > > This looks like the coasters blocks not > > restarting issue. I > > > > > can try to run the same run again and see if > > this persists. > > > > > > > > > > > > > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde > > > > > wrote: > > > > > Ketan, > > > > > > > > > > > > > > > Should I ask David to try to replicate > > this problem? > > > > > > > > > > > > > > > Did you figure out why your jobs are not > > starting on > > > > > PADS? > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________ > > > > > From: "Michael Wilde" > > > > > > > To: "Ketan Maheshwari" > > > > > , > > "Mihael > > > > > Hategan" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > , > > "Papia > > > > > Rizwan" > > > > > > > > > > Sent: Monday, August 22, 2011 > > 10:47:56 AM > > > > > > > > > > > > > > > Subject: Re: Blocker issue for > > 0.93: DSSAT > > > > > script does not complete, 2nd > > coaster blocks > > > > > dont start? > > > > > > > > > > Can you try this on PADS using > > small jobs in > > > > > the fast queue? > > > > > > > > > > I have not thought this all the > > way through, > > > > > but perhaps coasters will honor > > maxtime and > > > > > maxwalltime on any coaster > > block, even if > > > > > its > > > > > not running on a batch > > scheduler. In that > > > > > case perhaps you can replicate > > the problem > > > > > on > > > > > the MCS pool or better yet on > > localhost. > > > > > > > > > > > > > > > In these runs, what was the > > value of > > > > > the execution.retries and > > lazy.errors flags? > > > > > Mihael, do those properties > > need to be set > > > > > to > > > > > >0 and true, respectively, in > > order for > > > > > coasters to start new blocks > > correctly, > > > > > assuming that in some cases a > > job will run > > > > > longer than its maxwalltime? > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > ______________________________________________ > > > > > From: "Ketan Maheshwari" > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > Cc: "Papia Rizwan" > > > > > > > , > > > > > "swift-devel > > > > > Devel" > > > > > > > Sent: Monday, August 22, > > 2011 > > > > > 10:32:31 > > > > > AM > > > > > Subject: Re: Blocker > > issue for 0.93: > > > > > DSSAT script does not > > complete, 2nd > > > > > coaster blocks dont > > start? > > > > > > > > > > Mike, > > > > > > > > > > > > > > > If I recall correctly, > > Papia has > > > > > always been running her > > DSSAT app > > > > > with > > > > > 0.92. She has not yet > > tried with > > > > > 0.93. > > > > > I too tried with 0.92 > > with her sites > > > > > file settings. > > > > > > > > > > > > > > > I once tried it with > > 0.93 on pads > > > > > but > > > > > could never get in the > > running from > > > > > the queue. > > > > > > > > > > > > > > > I will give another try > > today as it > > > > > might be that PADS was > > too busy last > > > > > week. As I recall Jon > > was also > > > > > struggling to get > > access. > > > > > > > > > > > > > > > Regards, > > > > > Ketan > > > > > > > > > > On Mon, Aug 22, 2011 at > > 10:24 AM, > > > > > Michael Wilde > > > > > > > wrote: > > > > > Papia, Ketan, > > > > > > > > > > In reviewing > > 0.93 work > > > > > remaining with > > David, I > > > > > remembered this > > issue. > > > > > > > > > > You both > > reported that the > > > > > DSSAT > > application script > > > > > doesnt finish on > > PADS - it > > > > > seems not to > > start the > > > > > second > > > > > round of coaster > > blocks that > > > > > it needs to > > complete (as I > > > > > recall, but this > > may not be > > > > > correct). This > > needs to be > > > > > researched and > > filed as a > > > > > bug > > > > > (or, an error in > > the sites > > > > > spec needs to be > > identified > > > > > and made clear > > in the site > > > > > guide if it > > turns out to be > > > > > the problem). > > > > > > > > > > Possible there > > is an issue > > > > > with jobs > > failing at the end > > > > > of the coaster > > blocks, and > > > > > you > > > > > dont have the > > necessary > > > > > retry > > > > > values set for > > the PADS > > > > > site??? > > > > > > > > > > We need an > > example run with > > > > > logs and full > > details. Can > > > > > you > > > > > try to re-create > > this with a > > > > > much smaller > > initial > > > > > allocation, and > > see if > > > > > coasters is > > transitioning > > > > > from > > > > > its initial > > blocks to the > > > > > next > > > > > blocks? > > > > > > > > > > Can you give > > this high prio > > > > > for today? > > > > > > > > > > Thanks, > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, > > University of Chicago > > > > > Mathematics and Computer Science > > Division > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of > > Chicago > > > > > Mathematics and Computer Science > > Division > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -- > > Ketan > > > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: catsnsleep-20110823-1714-9xk5y0b2.log Type: application/octet-stream Size: 2852009 bytes Desc: not available URL: From hategan at mcs.anl.gov Tue Aug 23 23:04:03 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Aug 2011 21:04:03 -0700 Subject: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? In-Reply-To: References: <488203920.243001.1314128761894.JavaMail.root@zimbra.anl.gov> <1314130919.23088.0.camel@blabla> <1314133513.23324.1.camel@blabla> Message-ID: <1314158643.26958.0.camel@blabla> Also benign, but annoying. So I'd like to nail these out. Can you try r3240? On Tue, 2011-08-23 at 20:12 -0500, Ketan Maheshwari wrote: > Hi again, > > > Tried a larger run on PADS with similar sleep and but large n > parameters. The run seemed to be progressing well (I killed it by > mistake), but the log does show some coaster block shutdown and > network related exception messages. > > > Attached is the log. > > > Regards, > Ketan > > > > On Tue, Aug 23, 2011 at 4:05 PM, Mihael Hategan > wrote: > That's benign, but I committed a patch to prevent it from > happening in > cog r3237. > > > On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote: > > Hi, > > > > > > I tried a smaller run, catsnsleep, sleeptime 60sec, n=20, > > walltime=2min > > > > > > The run indeed completed but I saw this in the middle (I > suppose at > > the end of first walltime slot): > > > > > > Command(13, HEARTBEAT): handling reply timeout; > > sendReqTime=110823-153059.847, sendTime=110823-153059.847, > > now=110823-153259.860 > > Command(13, HEARTBEAT)fault was: Reply timeout > > > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > > at org.globus.cog.karajan.workflow.service.commands.Command > > $Timeout.run(Command.java:293) > > at java.util.TimerThread.mainLoop(Timer.java:512) > > at java.util.TimerThread.run(Timer.java:462) > > Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: > {} > > > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > > at org.globus.cog.karajan.workflow.service.commands.Command > > $Timeout.run(Command.java:293) > > at java.util.TimerThread.mainLoop(Timer.java:512) > > at java.util.TimerThread.run(Timer.java:462) > > Heartbeat failed: Invalid channel: 914784201: {} > > > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) > > at org.globus.cog.karajan.workflow.service.commands.Command > > $Timeout.run(Command.java:293) > > at java.util.TimerThread.mainLoop(Timer.java:512) > > at java.util.TimerThread.run(Timer.java:462) > > > > > > The log is attached. I will try a longish run with more heap > memory. > > > > > > > > > > Regards, > > Ketan > > > > > > > > On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan > > > wrote: > > On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde > wrote: > > > Well, I raised that issue, but Ketan claimed that > the > > failure to start more jobs occurs without that > message as > > well. > > > > > > Fine. I'll need the log from that then. > > > > > > > > > > Do you believe that the Out of Mem error is the > root cause? > > > > > > Ketan, can you point to logs without the OOM > error? > > > > > > Can you re-run the catsn with more memory? > > > > > > And more importantly: can you run a *very small* > catsnsleep > > test where you carefully craft the sleep times and > settings to > > cause one (very short duration) coaster block to > time out and > > verify that a new block is submitted and in new job > and that > > the script runs to completion? > > > > > > I suggested in the ticket that David do this; can > you both > > discuss and see who is better positioned to do this > sooner, so > > we can decide if we have a blocker here, or just > something > > that needs better configuration and perhaps a note > in the user > > guide telling users what to watch out for in this > regard? (I > > think for example we do not tell how and when to > increase > > memory in the user guide, at the moment). Nor are > we clear > > enough on the issues around maxtime, maxwalltime, > and the > > sizing of coaster blocks. > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "Ketan Maheshwari" > > > > > Cc: "Michael Wilde" , "Swift > Devel" > > > > > > Sent: Tuesday, August 23, 2011 2:36:00 PM > > > > Subject: Re: [Swift-devel] Blocker issue for > 0.93: DSSAT > > script does not complete, 2nd coaster blocks dont > start? > > > > mike at blabla:~/tmp$ grep "heap" > > catsn-20110823-1116-94roxc18.log > > > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max > heap: > > 257294336 > > > > 2011-08-23 11:38:50,957-0500 DEBUG > VDL2ExecutionContext > > > > java.lang.OutOfMemoryError: Java heap space > > > > java.lang.OutOfMemoryError: Java heap space > > > > Caused by: java.lang.OutOfMemoryError: Java heap > space > > > > Caused by: java.lang.OutOfMemoryError: Java heap > space > > > > java.lang.OutOfMemoryError: Java heap space > > > > Caused by: java.lang.OutOfMemoryError: Java heap > space > > > > Caused by: java.lang.OutOfMemoryError: Java heap > space > > > > > > > > > > > > On Tue, 2011-08-23 at 11:47 -0500, Ketan > Maheshwari wrote: > > > > > Hello Mike, > > > > > > > > > > > > > > > I tried another run with 30K tasks on PADS. > This run > > stopped after > > > > > completing 16K+ tasks. > > > > > > > > > > > > > > > The log file is: > > > > > > > > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log > > > > > > > > > > > > > > > The exception messages I get are attached with > the mail. > > > > > > > > > > > > > > > Looking at the messages, it seems the coasters > are > > unable to restart > > > > > the submit block once the walltime is expired > for a run. > > > > > > > > > > > > > > > Regards, > > > > > Ketan > > > > > > > > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan > Maheshwari > > > > > wrote: > > > > > Mike, > > > > > > > > > > > > > > > This looks like the coasters blocks > not > > restarting issue. I > > > > > can try to run the same run again and > see if > > this persists. > > > > > > > > > > > > > > > On Tue, Aug 23, 2011 at 11:04 AM, > Michael Wilde > > > > > wrote: > > > > > Ketan, > > > > > > > > > > > > > > > Should I ask David to try to > replicate > > this problem? > > > > > > > > > > > > > > > Did you figure out why your > jobs are not > > starting on > > > > > PADS? > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________ > > > > > From: "Michael Wilde" > > > > > > > To: "Ketan Maheshwari" > > > > > > , > > "Mihael > > > > > Hategan" > > > > > > > > > > Cc: "swift-devel > Devel" > > > > > > , > > "Papia > > > > > Rizwan" > > > > > > > > > > > Sent: Monday, August > 22, 2011 > > 10:47:56 AM > > > > > > > > > > > > > > > Subject: Re: Blocker > issue for > > 0.93: DSSAT > > > > > script does not > complete, 2nd > > coaster blocks > > > > > dont start? > > > > > > > > > > Can you try this on > PADS using > > small jobs in > > > > > the fast queue? > > > > > > > > > > I have not thought > this all the > > way through, > > > > > but perhaps coasters > will honor > > maxtime and > > > > > maxwalltime on any > coaster > > block, even if > > > > > its > > > > > not running on a batch > > scheduler. In that > > > > > case perhaps you can > replicate > > the problem > > > > > on > > > > > the MCS pool or better > yet on > > localhost. > > > > > > > > > > > > > > > In these runs, what > was the > > value of > > > > > the execution.retries > and > > lazy.errors flags? > > > > > Mihael, do those > properties > > need to be set > > > > > to > > > > > >0 and true, > respectively, in > > order for > > > > > coasters to start new > blocks > > correctly, > > > > > assuming that in some > cases a > > job will run > > > > > longer than its > maxwalltime? > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > ______________________________________________ > > > > > From: "Ketan > Maheshwari" > > > > > > > > > > > > To: "Michael > Wilde" > > > > > > > > > > > Cc: "Papia > Rizwan" > > > > > > > , > > > > > "swift-devel > > > > > Devel" > > > > > > > Sent: Monday, > August 22, > > 2011 > > > > > 10:32:31 > > > > > AM > > > > > Subject: Re: > Blocker > > issue for 0.93: > > > > > DSSAT script > does not > > complete, 2nd > > > > > coaster blocks > dont > > start? > > > > > > > > > > Mike, > > > > > > > > > > > > > > > If I recall > correctly, > > Papia has > > > > > always been > running her > > DSSAT app > > > > > with > > > > > 0.92. She has > not yet > > tried with > > > > > 0.93. > > > > > I too tried > with 0.92 > > with her sites > > > > > file settings. > > > > > > > > > > > > > > > I once tried > it with > > 0.93 on pads > > > > > but > > > > > could never > get in the > > running from > > > > > the queue. > > > > > > > > > > > > > > > I will give > another try > > today as it > > > > > might be that > PADS was > > too busy last > > > > > week. As I > recall Jon > > was also > > > > > struggling to > get > > access. > > > > > > > > > > > > > > > Regards, > > > > > Ketan > > > > > > > > > > On Mon, Aug > 22, 2011 at > > 10:24 AM, > > > > > Michael Wilde > > > > > > > wrote: > > > > > Papia, > Ketan, > > > > > > > > > > In > reviewing > > 0.93 work > > > > > > remaining with > > David, I > > > > > > remembered this > > issue. > > > > > > > > > > You > both > > reported that the > > > > > DSSAT > > application script > > > > > doesnt > finish on > > PADS - it > > > > > seems > not to > > start the > > > > > second > > > > > round > of coaster > > blocks that > > > > > it > needs to > > complete (as I > > > > > > recall, but this > > may not be > > > > > > correct). This > > needs to be > > > > > > researched and > > filed as a > > > > > bug > > > > > (or, > an error in > > the sites > > > > > spec > needs to be > > identified > > > > > and > made clear > > in the site > > > > > guide > if it > > turns out to be > > > > > the > problem). > > > > > > > > > > > Possible there > > is an issue > > > > > with > jobs > > failing at the end > > > > > of the > coaster > > blocks, and > > > > > you > > > > > dont > have the > > necessary > > > > > retry > > > > > values > set for > > the PADS > > > > > > site??? > > > > > > > > > > We > need an > > example run with > > > > > logs > and full > > details. Can > > > > > you > > > > > try to > re-create > > this with a > > > > > much > smaller > > initial > > > > > > allocation, and > > see if > > > > > > coasters is > > transitioning > > > > > from > > > > > its > initial > > blocks to the > > > > > next > > > > > > blocks? > > > > > > > > > > Can > you give > > this high prio > > > > > for > today? > > > > > > > > > > > Thanks, > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, > > University of Chicago > > > > > Mathematics and > Computer Science > > Division > > > > > Argonne National > Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, > University of > > Chicago > > > > > Mathematics and Computer > Science > > Division > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -- > > Ketan > > > > > > > > > > > > > > -- > Ketan > > > From davidk at ci.uchicago.edu Wed Aug 24 01:09:23 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 24 Aug 2011 01:09:23 -0500 (CDT) Subject: [Swift-devel] Log processing and language reference documentation In-Reply-To: <1310252657.76167.1314163798498.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <2055964052.76197.1314166162999.JavaMail.root@zimbra-mb2.anl.gov> Hello, I was just wondering if the log processing documentation on the website is up to date? Is it worth converting to asciidoc and putting on the new website? I have a similar question for the language reference. Do people still make changes to it, and is it worth converting? Also, is there any place to get an up to date list of swift reserved words? It took me a few minutes to figure out why my swift code wasn't compiling. I had named a variable 'in' and it was conflicting with the 'in' used by foreach. Thanks, David From wilde at mcs.anl.gov Wed Aug 24 12:46:11 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 24 Aug 2011 12:46:11 -0500 (CDT) Subject: [Swift-devel] Log processing and language reference documentation In-Reply-To: <2055964052.76197.1314166162999.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1960826007.245715.1314207971273.JavaMail.root@zimbra.anl.gov> Hi David, ----- Original Message ----- > From: "David Kelly" > To: swift-devel at ci.uchicago.edu > Sent: Wednesday, August 24, 2011 1:09:23 AM > Subject: [Swift-devel] Log processing and language reference documentation > Hello, > > I was just wondering if the log processing documentation on the > website is up to date? Justin and any users of the log tools, can you comment? > Is it worth converting to asciidoc and putting > on the new website? I feel yes - these tools are pretty important and we want to keep pushing to make them more usable and more used. > I have a similar question for the language reference. Do people still > make changes to it, and is it worth converting? Its not changed in years, is woefully out of date. I feel its worth converting, but not now: probably part of a re-write at some much later date. Till then lets focus on making the user guide be the definitive definition of the language. We should incorporate any relevant info from the recent language paper into the user guide. > Also, is there any place to get an up to date list of swift reserved > words? It took me a few minutes to figure out why my swift code wasn't > compiling. I had named a variable 'in' and it was conflicting with the > 'in' used by foreach. That would be good to state in the user guide. Mihael, Justin, or Ben, can you comment on this based on your knowledge of the parser? Is this encoded into an ANTLR spec that can be translated into a user guide table? - Mike > > Thanks, > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Wed Aug 24 14:25:23 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 24 Aug 2011 14:25:23 -0500 Subject: [Swift-devel] home file system Message-ID: <8071D178-9A5B-47C4-9856-2A35113244B5@mcs.anl.gov> Is anyone experiencing problems on the ci machines right now? I cannot ssh to any of the machines(communicado, bridled, pads, beagle). Bridled let me in(but asked for my password) then said that the directory '/home/jonmon' did not exist. Is anyone experiencing similar issues? Was there some scheduled maintenance today that I may have overlooked? From ketancmaheshwari at gmail.com Wed Aug 24 14:28:09 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Wed, 24 Aug 2011 14:28:09 -0500 Subject: [Swift-devel] home file system In-Reply-To: <8071D178-9A5B-47C4-9856-2A35113244B5@mcs.anl.gov> References: <8071D178-9A5B-47C4-9856-2A35113244B5@mcs.anl.gov> Message-ID: Yes, I just lost connection to Communicado and PADS. On Wed, Aug 24, 2011 at 2:25 PM, Jonathan Monette wrote: > Is anyone experiencing problems on the ci machines right now? I cannot ssh > to any of the machines(communicado, bridled, pads, beagle). Bridled let me > in(but asked for my password) then said that the directory '/home/jonmon' > did not exist. Is anyone experiencing similar issues? Was there some > scheduled maintenance today that I may have overlooked? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Wed Aug 24 15:02:47 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 24 Aug 2011 15:02:47 -0500 (CDT) Subject: [Swift-devel] Log processing and language reference documentation In-Reply-To: <1960826007.245715.1314207971273.JavaMail.root@zimbra.anl.gov> References: <1960826007.245715.1314207971273.JavaMail.root@zimbra.anl.gov> Message-ID: >> I was just wondering if the log processing documentation on the >> website is up to date? > > Justin and any users of the log tools, can you comment? All of the log-processing scripts that I know how to use are documented in: swift/libexec/log-processing/README.txt This is an asciidoc-formatted file. >> Also, is there any place to get an up to date list of swift reserved >> words? It took me a few minutes to figure out why my swift code wasn't >> compiling. I had named a variable 'in' and it was conflicting with the >> 'in' used by foreach. > > That would be good to state in the user guide. Mihael, Justin, or Ben, > can you comment on this based on your knowledge of the parser? Is this > encoded into an ANTLR spec that can be translated into a user guide > table? I just took another look at swiftscript.g and I don't think there is an easy way to automate this. Justin -- Justin M Wozniak From davidk at ci.uchicago.edu Wed Aug 24 15:25:27 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 24 Aug 2011 15:25:27 -0500 (CDT) Subject: [Swift-devel] Log processing and language reference documentation In-Reply-To: Message-ID: <1286191860.77201.1314217527396.JavaMail.root@zimbra-mb2.anl.gov> Thanks Justin. I will remove references to the older guide and use the one you mentioned. I'll copy the document to swift/docs/log-processing/log-processing.txt so it can become a part of the nightly build process. David ----- Original Message ----- > From: "Justin M Wozniak" > To: "Michael Wilde" > Cc: "David Kelly" , swift-devel at ci.uchicago.edu > Sent: Wednesday, August 24, 2011 3:02:47 PM > Subject: Re: [Swift-devel] Log processing and language reference documentation > >> I was just wondering if the log processing documentation on the > >> website is up to date? > > > > Justin and any users of the log tools, can you comment? > > All of the log-processing scripts that I know how to use are > documented > in: > > swift/libexec/log-processing/README.txt > > This is an asciidoc-formatted file. > > >> Also, is there any place to get an up to date list of swift > >> reserved > >> words? It took me a few minutes to figure out why my swift code > >> wasn't > >> compiling. I had named a variable 'in' and it was conflicting with > >> the > >> 'in' used by foreach. > > > > That would be good to state in the user guide. Mihael, Justin, or > > Ben, > > can you comment on this based on your knowledge of the parser? Is > > this > > encoded into an ANTLR spec that can be translated into a user guide > > table? > > I just took another look at swiftscript.g and I don't think there is > an > easy way to automate this. > > Justin > > -- > Justin M Wozniak From wilde at mcs.anl.gov Wed Aug 24 15:27:07 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 24 Aug 2011 15:27:07 -0500 (CDT) Subject: [Swift-devel] Log processing and language reference documentation In-Reply-To: <1286191860.77201.1314217527396.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <807189075.246525.1314217627424.JavaMail.root@zimbra.anl.gov> I vote for including the log doc in the user guide. - Mike ----- Original Message ----- > From: "David Kelly" > To: "Justin M Wozniak" > Cc: swift-devel at ci.uchicago.edu, "Michael Wilde" > Sent: Wednesday, August 24, 2011 3:25:27 PM > Subject: Re: [Swift-devel] Log processing and language reference documentation > Thanks Justin. I will remove references to the older guide and use the > one you mentioned. I'll copy the document to > swift/docs/log-processing/log-processing.txt so it can become a part > of the nightly build process. > > David > > ----- Original Message ----- > > From: "Justin M Wozniak" > > To: "Michael Wilde" > > Cc: "David Kelly" , > > swift-devel at ci.uchicago.edu > > Sent: Wednesday, August 24, 2011 3:02:47 PM > > Subject: Re: [Swift-devel] Log processing and language reference > > documentation > > >> I was just wondering if the log processing documentation on the > > >> website is up to date? > > > > > > Justin and any users of the log tools, can you comment? > > > > All of the log-processing scripts that I know how to use are > > documented > > in: > > > > swift/libexec/log-processing/README.txt > > > > This is an asciidoc-formatted file. > > > > >> Also, is there any place to get an up to date list of swift > > >> reserved > > >> words? It took me a few minutes to figure out why my swift code > > >> wasn't > > >> compiling. I had named a variable 'in' and it was conflicting > > >> with > > >> the > > >> 'in' used by foreach. > > > > > > That would be good to state in the user guide. Mihael, Justin, or > > > Ben, > > > can you comment on this based on your knowledge of the parser? Is > > > this > > > encoded into an ANTLR spec that can be translated into a user > > > guide > > > table? > > > > I just took another look at swiftscript.g and I don't think there is > > an > > easy way to automate this. > > > > Justin > > > > -- > > Justin M Wozniak -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Wed Aug 24 15:56:52 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 24 Aug 2011 15:56:52 -0500 (CDT) Subject: [Swift-devel] Log processing and language reference documentation In-Reply-To: <807189075.246525.1314217627424.JavaMail.root@zimbra.anl.gov> Message-ID: <1133196265.77259.1314219412921.JavaMail.root@zimbra-mb2.anl.gov> Added it as a chapter in the userguide - docs/userguide/log-processing ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: swift-devel at ci.uchicago.edu, "Justin M Wozniak" > Sent: Wednesday, August 24, 2011 3:27:07 PM > Subject: Re: [Swift-devel] Log processing and language reference documentation > I vote for including the log doc in the user guide. > > - Mike > > ----- Original Message ----- > > From: "David Kelly" > > To: "Justin M Wozniak" > > Cc: swift-devel at ci.uchicago.edu, "Michael Wilde" > > Sent: Wednesday, August 24, 2011 3:25:27 PM > > Subject: Re: [Swift-devel] Log processing and language reference > > documentation > > Thanks Justin. I will remove references to the older guide and use > > the > > one you mentioned. I'll copy the document to > > swift/docs/log-processing/log-processing.txt so it can become a part > > of the nightly build process. > > > > David > > > > ----- Original Message ----- > > > From: "Justin M Wozniak" > > > To: "Michael Wilde" > > > Cc: "David Kelly" , > > > swift-devel at ci.uchicago.edu > > > Sent: Wednesday, August 24, 2011 3:02:47 PM > > > Subject: Re: [Swift-devel] Log processing and language reference > > > documentation > > > >> I was just wondering if the log processing documentation on the > > > >> website is up to date? > > > > > > > > Justin and any users of the log tools, can you comment? > > > > > > All of the log-processing scripts that I know how to use are > > > documented > > > in: > > > > > > swift/libexec/log-processing/README.txt > > > > > > This is an asciidoc-formatted file. > > > > > > >> Also, is there any place to get an up to date list of swift > > > >> reserved > > > >> words? It took me a few minutes to figure out why my swift code > > > >> wasn't > > > >> compiling. I had named a variable 'in' and it was conflicting > > > >> with > > > >> the > > > >> 'in' used by foreach. > > > > > > > > That would be good to state in the user guide. Mihael, Justin, > > > > or > > > > Ben, > > > > can you comment on this based on your knowledge of the parser? > > > > Is > > > > this > > > > encoded into an ANTLR spec that can be translated into a user > > > > guide > > > > table? > > > > > > I just took another look at swiftscript.g and I don't think there > > > is > > > an > > > easy way to automate this. > > > > > > Justin > > > > > > -- > > > Justin M Wozniak > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From wilde at mcs.anl.gov Wed Aug 24 17:51:18 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 24 Aug 2011 17:51:18 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <977109412.246682.1314220243168.JavaMail.root@zimbra.anl.gov> Message-ID: <1104967546.247058.1314226278053.JavaMail.root@zimbra.anl.gov> David: add multinode tests for PBS and SGE sites Need valid DOEGrids cert or volunteer gt2 for ranger tests Cobalt-workaround Eureka test - Justin to do next week. Intrepid works; Fusion works. David to re-run after release stabilizes. Mihael to merge mods from 0.93 branch to trunk after 0.93 is completed (see svn merge information in subversion book) David to circulate site guide for review near end of week or early next. New asciidoc info to gen doc David, Justin, Mihael to close/resolve/push remaining 0.93 tickets. David: remove kickstart info from user guide Mike: work on or remove case studies info in new web; revise intro text David: add quickstart path to Download page; ensure that new site guide has a "quick start" for each site config. From skenny at uchicago.edu Wed Aug 24 18:03:51 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Wed, 24 Aug 2011 16:03:51 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1104967546.247058.1314226278053.JavaMail.root@zimbra.anl.gov> References: <977109412.246682.1314220243168.JavaMail.root@zimbra.anl.gov> <1104967546.247058.1314226278053.JavaMail.root@zimbra.anl.gov> Message-ID: On Wed, Aug 24, 2011 at 3:51 PM, Michael Wilde wrote: > > David: add multinode tests for PBS and SGE sites > > Need valid DOEGrids cert or volunteer gt2 for ranger tests > if you just need someone with a valid cert to shoot off some tests to ranger and give you output i should be able to do this, lemme know. ~sk -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Aug 24 18:21:14 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 24 Aug 2011 18:21:14 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: Message-ID: <1735271533.247090.1314228074106.JavaMail.root@zimbra.anl.gov> Sarah, thanks! More testing is always welcome; maybe David can post what he needs done on Ranger. Also, David: I think you can get a proxy from NCSA myproxy, and run GT2 jobs under that. This would use your TeraGrid credentials rather than your DOEGrids credentials. - Mike ----- Original Message ----- From: "Sarah Kenny" To: "Michael Wilde" Cc: "swift-devel Devel" Sent: Wednesday, August 24, 2011 6:03:51 PM Subject: Re: [Swift-devel] Notes from 0.93 meeting On Wed, Aug 24, 2011 at 3:51 PM, Michael Wilde < wilde at mcs.anl.gov > wrote: David: add multinode tests for PBS and SGE sites Need valid DOEGrids cert or volunteer gt2 for ranger tests if you just need someone with a valid cert to shoot off some tests to ranger and give you output i should be able to do this, lemme know. ~sk -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Aug 24 19:56:51 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 24 Aug 2011 19:56:51 -0500 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1735271533.247090.1314228074106.JavaMail.root@zimbra.anl.gov> References: <1735271533.247090.1314228074106.JavaMail.root@zimbra.anl.gov> Message-ID: <33C0384B-4DB1-443C-8ABA-F52BB883ECAC@mcs.anl.gov> I was going to suggest the NCSA route but I wasn't sure if David had TeraGrid credentials. I think the server is myproxy.teragrid.org so myproxy-logon -s myproxy.teragrid.org will retrieve a proxy for 11 hours. If you need help working with myproxy let me know. On Aug 24, 2011, at 6:21 PM, Michael Wilde wrote: > Sarah, thanks! More testing is always welcome; maybe David can post what he needs done on Ranger. > > Also, David: I think you can get a proxy from NCSA myproxy, and run GT2 jobs under that. This would use your TeraGrid credentials rather than your DOEGrids credentials. > > - Mike > > From: "Sarah Kenny" > To: "Michael Wilde" > Cc: "swift-devel Devel" > Sent: Wednesday, August 24, 2011 6:03:51 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > On Wed, Aug 24, 2011 at 3:51 PM, Michael Wilde wrote: > > David: add multinode tests for PBS and SGE sites > > Need valid DOEGrids cert or volunteer gt2 for ranger tests > > if you just need someone with a valid cert to shoot off some tests to ranger and give you output i should be able to do this, lemme know. > > ~sk > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Aug 25 12:44:51 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 12:44:51 -0500 Subject: [Swift-devel] Swift jobs failing Message-ID: I started a run of my SwiftMontage work and all the jobs keep failing. No progress is being made and the swift stdout will have the line "failed but can retry: 'some number'". The log is located at www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. From davidk at ci.uchicago.edu Thu Aug 25 13:29:14 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 25 Aug 2011 13:29:14 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <33C0384B-4DB1-443C-8ABA-F52BB883ECAC@mcs.anl.gov> Message-ID: <928140200.78521.1314296954524.JavaMail.root@zimbra-mb2.anl.gov> Thanks Jon, Here is what happens when I try this from communicado: [davidk at communicado ~]$ myproxy-logon -l dkelly -s myproxy.teragrid.org Enter MyProxy pass phrase: A credential has been received for user dkelly in /tmp/x509up_u1878. [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file tc.data 001-catsn-ranger.swift Swift svn swift-r4987 (swift modified locally) cog-r3229 RunID: 20110825-1326-o3e38fe0 Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 Initializing site shared directory:2 Execution failed: Authentication failed [Caused by: Failure unspecified at GSS-API level [Caused by: Unknown CA]] Any ideas? Thanks, David From jonmon at mcs.anl.gov Thu Aug 25 13:36:45 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 13:36:45 -0500 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <928140200.78521.1314296954524.JavaMail.root@zimbra-mb2.anl.gov> References: <928140200.78521.1314296954524.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <2E7C884C-EAF8-4BFF-8D1C-8665F654FA7A@mcs.anl.gov> That Unknown CA error is normally caused if you don't trust the CA but since Swift is throwing the error this would indicate that Swift does not know about the signing files that is needs to trust the CA. Do you set X509_CERT_DIR or X509_CA_DIR to anything? On Aug 25, 2011, at 1:29 PM, David Kelly wrote: > Thanks Jon, > > Here is what happens when I try this from communicado: > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s myproxy.teragrid.org > Enter MyProxy pass phrase: > A credential has been received for user dkelly in /tmp/x509up_u1878. > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file tc.data 001-catsn-ranger.swift > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > RunID: 20110825-1326-o3e38fe0 > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 Initializing site shared directory:2 > Execution failed: > Authentication failed [Caused by: Failure unspecified at GSS-API level [Caused by: Unknown CA]] > > Any ideas? > > Thanks, > David From ketancmaheshwari at gmail.com Thu Aug 25 13:32:50 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 25 Aug 2011 13:32:50 -0500 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <928140200.78521.1314296954524.JavaMail.root@zimbra-mb2.anl.gov> References: <33C0384B-4DB1-443C-8ABA-F52BB883ECAC@mcs.anl.gov> <928140200.78521.1314296954524.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: Hi, Are your CADIR and CACERT env vars set up? [communicado:swiftgrid]$ echo $X509_CADIR /opt/osg-1.2.16/globus/TRUSTED_CA [communicado:swiftgrid]$ echo $X509_CERT_DIR /opt/osg-1.2.16/globus/TRUSTED_CA On Thu, Aug 25, 2011 at 1:29 PM, David Kelly wrote: > Thanks Jon, > > Here is what happens when I try this from communicado: > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s myproxy.teragrid.org > Enter MyProxy pass phrase: > A credential has been received for user dkelly in /tmp/x509up_u1878. > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file tc.data > 001-catsn-ranger.swift > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > RunID: 20110825-1326-o3e38fe0 > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 > Initializing site shared directory:2 > Execution failed: > Authentication failed [Caused by: Failure unspecified at GSS-API > level [Caused by: Unknown CA]] > > Any ideas? > > Thanks, > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Aug 25 13:42:45 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 13:42:45 -0500 Subject: [Swift-devel] Swift jobs failing In-Reply-To: References: Message-ID: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> This also seems to be happening in trunk. Is anyone seeing this issue with there code? On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > I started a run of my SwiftMontage work and all the jobs keep failing. No progress is being made and the swift stdout will have the line "failed but can retry: 'some number'". The log is located at www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Thu Aug 25 13:56:02 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 25 Aug 2011 13:56:02 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: Message-ID: <839210623.78583.1314298562032.JavaMail.root@zimbra-mb2.anl.gov> Those environment variables were not set up. I have them defined now, but I'm still getting the same error. [davidk at communicado ranger]$ env |grep 509 X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file tc.data 001-catsn-ranger.swift Swift svn swift-r4987 (swift modified locally) cog-r3229 RunID: 20110825-1352-f1v940b4 Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 Initializing site shared directory:3 Execution failed: Authentication failed [Caused by: Failure unspecified at GSS-API level [Caused by: Unknown CA]] ----- Original Message ----- > From: "Ketan Maheshwari" > To: "David Kelly" > Cc: "Jonathan Monette" , "swift-devel Devel" > Sent: Thursday, August 25, 2011 1:32:50 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > Hi, > > > Are your CADIR and CACERT env vars set up? > > > > [communicado:swiftgrid]$ echo $X509_CADIR > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < davidk at ci.uchicago.edu > > wrote: > > > Thanks Jon, > > Here is what happens when I try this from communicado: > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > myproxy.teragrid.org > Enter MyProxy pass phrase: > A credential has been received for user dkelly in /tmp/x509up_u1878. > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file > tc.data 001-catsn-ranger.swift > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > RunID: 20110825-1326-o3e38fe0 > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 > Initializing site shared directory:2 > Execution failed: > Authentication failed [Caused by: Failure unspecified at GSS-API level > [Caused by: Unknown CA]] > > Any ideas? > > Thanks, > David > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Ketan From jonmon at mcs.anl.gov Thu Aug 25 14:03:37 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 14:03:37 -0500 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <839210623.78583.1314298562032.JavaMail.root@zimbra-mb2.anl.gov> References: <839210623.78583.1314298562032.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <2CA96926-491B-4688-87A4-7CE8CBA30AED@mcs.anl.gov> With that proxy can you gsissh to ranger? On Aug 25, 2011, at 1:56 PM, David Kelly wrote: > > Those environment variables were not set up. I have them defined now, but I'm still getting the same error. > > [davidk at communicado ranger]$ env |grep 509 > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file tc.data 001-catsn-ranger.swift > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > RunID: 20110825-1352-f1v940b4 > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 Initializing site shared directory:3 > Execution failed: > Authentication failed [Caused by: Failure unspecified at GSS-API level [Caused by: Unknown CA]] > > > ----- Original Message ----- >> From: "Ketan Maheshwari" >> To: "David Kelly" >> Cc: "Jonathan Monette" , "swift-devel Devel" >> Sent: Thursday, August 25, 2011 1:32:50 PM >> Subject: Re: [Swift-devel] Notes from 0.93 meeting >> Hi, >> >> >> Are your CADIR and CACERT env vars set up? >> >> >> >> [communicado:swiftgrid]$ echo $X509_CADIR >> /opt/osg-1.2.16/globus/TRUSTED_CA >> >> >> >> [communicado:swiftgrid]$ echo $X509_CERT_DIR >> /opt/osg-1.2.16/globus/TRUSTED_CA >> >> >> >> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < davidk at ci.uchicago.edu >>> wrote: >> >> >> Thanks Jon, >> >> Here is what happens when I try this from communicado: >> >> [davidk at communicado ~]$ myproxy-logon -l dkelly -s >> myproxy.teragrid.org >> Enter MyProxy pass phrase: >> A credential has been received for user dkelly in /tmp/x509up_u1878. >> >> [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file >> tc.data 001-catsn-ranger.swift >> Swift svn swift-r4987 (swift modified locally) cog-r3229 >> >> RunID: 20110825-1326-o3e38fe0 >> Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 >> Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 >> Initializing site shared directory:2 >> Execution failed: >> Authentication failed [Caused by: Failure unspecified at GSS-API level >> [Caused by: Unknown CA]] >> >> Any ideas? >> >> Thanks, >> David >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> >> -- >> Ketan From ketancmaheshwari at gmail.com Thu Aug 25 14:05:36 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 25 Aug 2011 14:05:36 -0500 Subject: [Swift-devel] Swift jobs failing In-Reply-To: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> Message-ID: I am using 0.93 from Communicado to OSG using persistent-coasters. I do not see such messages. On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette wrote: > This also seems to be happening in trunk. Is anyone seeing this issue with > there code? > On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > > > I started a run of my SwiftMontage work and all the jobs keep failing. > No progress is being made and the swift stdout will have the line "failed > but can retry: 'some number'". The log is located at > www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This > is with the most recent version of 0.93. > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Aug 25 14:07:37 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 14:07:37 -0500 Subject: [Swift-devel] Swift jobs failing In-Reply-To: References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> Message-ID: Ok. I will check my configuration. I don't see very helpful messages in the log file but I will give a closer look. More information to follow. On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > I am using 0.93 from Communicado to OSG using persistent-coasters. I do not see such messages. > > > On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette wrote: > This also seems to be happening in trunk. Is anyone seeing this issue with there code? > On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > > > I started a run of my SwiftMontage work and all the jobs keep failing. No progress is being made and the swift stdout will have the line "failed but can retry: 'some number'". The log is located at www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Ketan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Thu Aug 25 14:11:23 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 25 Aug 2011 14:11:23 -0500 Subject: [Swift-devel] Swift jobs failing In-Reply-To: References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> Message-ID: I see this message near the only application_exception that appears in the log: Service does not appear to be registered with this manager Seems to be the coasters service not properly starting. Could this be the cause? On Thu, Aug 25, 2011 at 2:07 PM, Jonathan Monette wrote: > Ok. I will check my configuration. I don't see very helpful messages in > the log file but I will give a closer look. More information to follow. > > On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > > I am using 0.93 from Communicado to OSG using persistent-coasters. I do not > see such messages. > > > On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette wrote: > >> This also seems to be happening in trunk. Is anyone seeing this issue >> with there code? >> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: >> >> > I started a run of my SwiftMontage work and all the jobs keep failing. >> No progress is being made and the swift stdout will have the line "failed >> but can retry: 'some number'". The log is located at >> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. >> This is with the most recent version of 0.93. >> > _______________________________________________ >> > Swift-devel mailing list >> > Swift-devel at ci.uchicago.edu >> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > > > -- > Ketan > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Aug 25 14:11:28 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 12:11:28 -0700 Subject: [Swift-devel] Swift jobs failing In-Reply-To: References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> Message-ID: <1314299488.10772.1.camel@blabla> Can you post the coaster log on the remote machine? On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: > Ok. I will check my configuration. I don't see very helpful messages > in the log file but I will give a closer look. More information to > follow. > > On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > > > I am using 0.93 from Communicado to OSG using persistent-coasters. I > > do not see such messages. > > > > > > On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette > > wrote: > > This also seems to be happening in trunk. Is anyone seeing > > this issue with there code? > > > > On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > > > > > I started a run of my SwiftMontage work and all the jobs > > keep failing. No progress is being made and the swift > > stdout will have the line "failed but can retry: 'some > > number'". The log is located at > > www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -- > > Ketan > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Thu Aug 25 14:21:47 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 25 Aug 2011 14:21:47 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <2CA96926-491B-4688-87A4-7CE8CBA30AED@mcs.anl.gov> Message-ID: <1985642119.78654.1314300107217.JavaMail.root@zimbra-mb2.anl.gov> Yes, I can gsissh into davidkel at ranger.tacc.utexas.edu with that proxy. One weird thing to note - my username on communicado is davidk, my username for teragrid is dkelly, and my username on ranger is davidkel. I don't think that would cause the CA error, but thought I should mention it. David ----- Original Message ----- > From: "Jonathan Monette" > To: "David Kelly" > Cc: "Ketan Maheshwari" , "swift-devel Devel" > Sent: Thursday, August 25, 2011 2:03:37 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > With that proxy can you gsissh to ranger? > > On Aug 25, 2011, at 1:56 PM, David Kelly wrote: > > > > > Those environment variables were not set up. I have them defined > > now, but I'm still getting the same error. > > > > [davidk at communicado ranger]$ env |grep 509 > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file > > tc.data 001-catsn-ranger.swift > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > RunID: 20110825-1352-f1v940b4 > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 > > Initializing site shared directory:3 > > Execution failed: > > Authentication failed [Caused by: Failure unspecified at GSS-API > > level [Caused by: Unknown CA]] > > > > > > ----- Original Message ----- > >> From: "Ketan Maheshwari" > >> To: "David Kelly" > >> Cc: "Jonathan Monette" , "swift-devel Devel" > >> > >> Sent: Thursday, August 25, 2011 1:32:50 PM > >> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >> Hi, > >> > >> > >> Are your CADIR and CACERT env vars set up? > >> > >> > >> > >> [communicado:swiftgrid]$ echo $X509_CADIR > >> /opt/osg-1.2.16/globus/TRUSTED_CA > >> > >> > >> > >> [communicado:swiftgrid]$ echo $X509_CERT_DIR > >> /opt/osg-1.2.16/globus/TRUSTED_CA > >> > >> > >> > >> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > >> davidk at ci.uchicago.edu > >>> wrote: > >> > >> > >> Thanks Jon, > >> > >> Here is what happens when I try this from communicado: > >> > >> [davidk at communicado ~]$ myproxy-logon -l dkelly -s > >> myproxy.teragrid.org > >> Enter MyProxy pass phrase: > >> A credential has been received for user dkelly in > >> /tmp/x509up_u1878. > >> > >> [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file > >> tc.data 001-catsn-ranger.swift > >> Swift svn swift-r4987 (swift modified locally) cog-r3229 > >> > >> RunID: 20110825-1326-o3e38fe0 > >> Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > >> Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 > >> Initializing site shared directory:2 > >> Execution failed: > >> Authentication failed [Caused by: Failure unspecified at GSS-API > >> level > >> [Caused by: Unknown CA]] > >> > >> Any ideas? > >> > >> Thanks, > >> David > >> > >> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> > >> > >> > >> -- > >> Ketan From jonmon at mcs.anl.gov Thu Aug 25 14:24:10 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 14:24:10 -0500 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1985642119.78654.1314300107217.JavaMail.root@zimbra-mb2.anl.gov> References: <1985642119.78654.1314300107217.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <21FA6397-13BA-4AE8-AEF5-D526E14D29FE@mcs.anl.gov> I am not sure it matter(but it might) but this is the login node for teragrid on ranger. tg-login.ranger.tacc.teragrid.org. You requested a proxy from myproxy-login with username=dkelly. If you want to use the proxy on ranger think the command should have been 'myproxy-login -l davidkel -s myproxy.teragrid.org'. On Aug 25, 2011, at 2:21 PM, David Kelly wrote: > > Yes, I can gsissh into davidkel at ranger.tacc.utexas.edu with that proxy. One weird thing to note - my username on communicado is davidk, my username for teragrid is dkelly, and my username on ranger is davidkel. I don't think that would cause the CA error, but thought I should mention it. > > David > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "David Kelly" >> Cc: "Ketan Maheshwari" , "swift-devel Devel" >> Sent: Thursday, August 25, 2011 2:03:37 PM >> Subject: Re: [Swift-devel] Notes from 0.93 meeting >> With that proxy can you gsissh to ranger? >> >> On Aug 25, 2011, at 1:56 PM, David Kelly wrote: >> >>> >>> Those environment variables were not set up. I have them defined >>> now, but I'm still getting the same error. >>> >>> [davidk at communicado ranger]$ env |grep 509 >>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA >>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA >>> >>> [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file >>> tc.data 001-catsn-ranger.swift >>> Swift svn swift-r4987 (swift modified locally) cog-r3229 >>> >>> RunID: 20110825-1352-f1v940b4 >>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 >>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 >>> Initializing site shared directory:3 >>> Execution failed: >>> Authentication failed [Caused by: Failure unspecified at GSS-API >>> level [Caused by: Unknown CA]] >>> >>> >>> ----- Original Message ----- >>>> From: "Ketan Maheshwari" >>>> To: "David Kelly" >>>> Cc: "Jonathan Monette" , "swift-devel Devel" >>>> >>>> Sent: Thursday, August 25, 2011 1:32:50 PM >>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting >>>> Hi, >>>> >>>> >>>> Are your CADIR and CACERT env vars set up? >>>> >>>> >>>> >>>> [communicado:swiftgrid]$ echo $X509_CADIR >>>> /opt/osg-1.2.16/globus/TRUSTED_CA >>>> >>>> >>>> >>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR >>>> /opt/osg-1.2.16/globus/TRUSTED_CA >>>> >>>> >>>> >>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < >>>> davidk at ci.uchicago.edu >>>>> wrote: >>>> >>>> >>>> Thanks Jon, >>>> >>>> Here is what happens when I try this from communicado: >>>> >>>> [davidk at communicado ~]$ myproxy-logon -l dkelly -s >>>> myproxy.teragrid.org >>>> Enter MyProxy pass phrase: >>>> A credential has been received for user dkelly in >>>> /tmp/x509up_u1878. >>>> >>>> [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file >>>> tc.data 001-catsn-ranger.swift >>>> Swift svn swift-r4987 (swift modified locally) cog-r3229 >>>> >>>> RunID: 20110825-1326-o3e38fe0 >>>> Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 >>>> Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 >>>> Initializing site shared directory:2 >>>> Execution failed: >>>> Authentication failed [Caused by: Failure unspecified at GSS-API >>>> level >>>> [Caused by: Unknown CA]] >>>> >>>> Any ideas? >>>> >>>> Thanks, >>>> David >>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> >>>> >>>> >>>> -- >>>> Ketan From davidk at ci.uchicago.edu Thu Aug 25 14:35:06 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 25 Aug 2011 14:35:06 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <21FA6397-13BA-4AE8-AEF5-D526E14D29FE@mcs.anl.gov> Message-ID: <1011819083.78677.1314300906335.JavaMail.root@zimbra-mb2.anl.gov> I can also gsissh into davidkel at tg-login.ranger.tacc.teragrid.org. But when I try to request a proxy for davidkel from myproxy.teragrid.org, I get: ERROR from myproxy-server (myproxy.teragrid.org): unknown username: davidkel > I am not sure it matter(but it might) but this is the login node for > teragrid on ranger. tg-login.ranger.tacc.teragrid.org. You requested a > proxy from myproxy-login with username=dkelly. If you want to use the > proxy on ranger think the command should have been 'myproxy-login -l > davidkel -s myproxy.teragrid.org'. From hategan at mcs.anl.gov Thu Aug 25 14:43:31 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 12:43:31 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <839210623.78583.1314298562032.JavaMail.root@zimbra-mb2.anl.gov> References: <839210623.78583.1314298562032.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1314301411.11593.0.camel@blabla> It's possible that the CA dir on Ranger is not properly set up. Can you post the full log? On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > Those environment variables were not set up. I have them defined now, but I'm still getting the same error. > > [davidk at communicado ranger]$ env |grep 509 > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file tc.data 001-catsn-ranger.swift > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > RunID: 20110825-1352-f1v940b4 > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 Initializing site shared directory:3 > Execution failed: > Authentication failed [Caused by: Failure unspecified at GSS-API level [Caused by: Unknown CA]] > > > ----- Original Message ----- > > From: "Ketan Maheshwari" > > To: "David Kelly" > > Cc: "Jonathan Monette" , "swift-devel Devel" > > Sent: Thursday, August 25, 2011 1:32:50 PM > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > Hi, > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < davidk at ci.uchicago.edu > > > wrote: > > > > > > Thanks Jon, > > > > Here is what happens when I try this from communicado: > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > myproxy.teragrid.org > > Enter MyProxy pass phrase: > > A credential has been received for user dkelly in /tmp/x509up_u1878. > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file > > tc.data 001-catsn-ranger.swift > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > RunID: 20110825-1326-o3e38fe0 > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 > > Initializing site shared directory:2 > > Execution failed: > > Authentication failed [Caused by: Failure unspecified at GSS-API level > > [Caused by: Unknown CA]] > > > > Any ideas? > > > > Thanks, > > David > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > -- > > Ketan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Thu Aug 25 15:18:49 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 25 Aug 2011 15:18:49 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1314301411.11593.0.camel@blabla> Message-ID: <1757176177.78866.1314303529517.JavaMail.root@zimbra-mb2.anl.gov> Sure, here is the full log: http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Ketan Maheshwari" , "swift-devel Devel" > Sent: Thursday, August 25, 2011 2:43:31 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > It's possible that the CA dir on Ranger is not properly set up. Can > you > post the full log? > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > Those environment variables were not set up. I have them defined > > now, but I'm still getting the same error. > > > > [davidk at communicado ranger]$ env |grep 509 > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file > > tc.data 001-catsn-ranger.swift > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > RunID: 20110825-1352-f1v940b4 > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 > > Initializing site shared directory:3 > > Execution failed: > > Authentication failed [Caused by: Failure unspecified at GSS-API > > level [Caused by: Unknown CA]] > > > > > > ----- Original Message ----- > > > From: "Ketan Maheshwari" > > > To: "David Kelly" > > > Cc: "Jonathan Monette" , "swift-devel Devel" > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > Hi, > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > davidk at ci.uchicago.edu > > > > wrote: > > > > > > > > > Thanks Jon, > > > > > > Here is what happens when I try this from communicado: > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > myproxy.teragrid.org > > > Enter MyProxy pass phrase: > > > A credential has been received for user dkelly in > > > /tmp/x509up_u1878. > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file > > > tc.data 001-catsn-ranger.swift > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > RunID: 20110825-1326-o3e38fe0 > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 > > > Initializing site shared directory:2 > > > Execution failed: > > > Authentication failed [Caused by: Failure unspecified at GSS-API > > > level > > > [Caused by: Unknown CA]] > > > > > > Any ideas? > > > > > > Thanks, > > > David > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > -- > > > Ketan > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Aug 25 15:42:57 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 13:42:57 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1757176177.78866.1314303529517.JavaMail.root@zimbra-mb2.anl.gov> References: <1757176177.78866.1314303529517.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1314304977.11910.1.camel@blabla> Odd. Can you paste the output of 'grid-proxy-info -all'? On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > Sure, here is the full log: > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: "Ketan Maheshwari" , "swift-devel Devel" > > Sent: Thursday, August 25, 2011 2:43:31 PM > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > It's possible that the CA dir on Ranger is not properly set up. Can > > you > > post the full log? > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > > Those environment variables were not set up. I have them defined > > > now, but I'm still getting the same error. > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file > > > tc.data 001-catsn-ranger.swift > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > RunID: 20110825-1352-f1v940b4 > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 > > > Initializing site shared directory:3 > > > Execution failed: > > > Authentication failed [Caused by: Failure unspecified at GSS-API > > > level [Caused by: Unknown CA]] > > > > > > > > > ----- Original Message ----- > > > > From: "Ketan Maheshwari" > > > > To: "David Kelly" > > > > Cc: "Jonathan Monette" , "swift-devel Devel" > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > Hi, > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > davidk at ci.uchicago.edu > > > > > wrote: > > > > > > > > > > > > Thanks Jon, > > > > > > > > Here is what happens when I try this from communicado: > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > > myproxy.teragrid.org > > > > Enter MyProxy pass phrase: > > > > A credential has been received for user dkelly in > > > > /tmp/x509up_u1878. > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml -tc.file > > > > tc.data 001-catsn-ranger.swift > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting site:8 > > > > Initializing site shared directory:2 > > > > Execution failed: > > > > Authentication failed [Caused by: Failure unspecified at GSS-API > > > > level > > > > [Caused by: Unknown CA]] > > > > > > > > Any ideas? > > > > > > > > Thanks, > > > > David > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Thu Aug 25 15:55:02 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 25 Aug 2011 15:55:02 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1314304977.11910.1.camel@blabla> Message-ID: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> $ grid-proxy-info -all subject : /C=US/O=National Center for Supercomputing Applications/CN=David Kelly issuer : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy identity : /C=US/O=National Center for Supercomputing Applications/CN=David Kelly type : end entity credential strength : 1024 bits path : /tmp/x509up_u1878 timeleft : 9:56:53 ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Ketan Maheshwari" , "swift-devel Devel" > Sent: Thursday, August 25, 2011 3:42:57 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > Odd. Can you paste the output of 'grid-proxy-info -all'? > > On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > Sure, here is the full log: > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "David Kelly" > > > Cc: "Ketan Maheshwari" , "swift-devel > > > Devel" > > > Sent: Thursday, August 25, 2011 2:43:31 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > It's possible that the CA dir on Ranger is not properly set up. > > > Can > > > you > > > post the full log? > > > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > > > Those environment variables were not set up. I have them defined > > > > now, but I'm still getting the same error. > > > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > -tc.file > > > > tc.data 001-catsn-ranger.swift > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > RunID: 20110825-1352-f1v940b4 > > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 > > > > Initializing site shared directory:3 > > > > Execution failed: > > > > Authentication failed [Caused by: Failure unspecified at > > > > GSS-API > > > > level [Caused by: Unknown CA]] > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Ketan Maheshwari" > > > > > To: "David Kelly" > > > > > Cc: "Jonathan Monette" , "swift-devel > > > > > Devel" > > > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > Hi, > > > > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > > davidk at ci.uchicago.edu > > > > > > wrote: > > > > > > > > > > > > > > > Thanks Jon, > > > > > > > > > > Here is what happens when I try this from communicado: > > > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > > > myproxy.teragrid.org > > > > > Enter MyProxy pass phrase: > > > > > A credential has been received for user dkelly in > > > > > /tmp/x509up_u1878. > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > -tc.file > > > > > tc.data 001-catsn-ranger.swift > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting > > > > > site:8 > > > > > Initializing site shared directory:2 > > > > > Execution failed: > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > GSS-API > > > > > level > > > > > [Caused by: Unknown CA]] > > > > > > > > > > Any ideas? > > > > > > > > > > Thanks, > > > > > David > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Aug 25 16:17:53 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 14:17:53 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> References: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1314307073.12144.4.camel@blabla> Can you try a globus-url-copy to gridftp.ranger? gridftp.ranger seems to have the NCSA myproxy CA. You say you have the proper certificates dir in your X509_CERT_DIR, and that directory contains the TACC root cert. So it should work. And so should swift. Though I think that jglobus should be more clear about "Unknown ca" errors. At least the name of the unknown CA should be part of the error message. On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: > $ grid-proxy-info -all > subject : /C=US/O=National Center for Supercomputing Applications/CN=David Kelly > issuer : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy > identity : /C=US/O=National Center for Supercomputing Applications/CN=David Kelly > type : end entity credential > strength : 1024 bits > path : /tmp/x509up_u1878 > timeleft : 9:56:53 > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: "Ketan Maheshwari" , "swift-devel Devel" > > Sent: Thursday, August 25, 2011 3:42:57 PM > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > Odd. Can you paste the output of 'grid-proxy-info -all'? > > > > On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > > Sure, here is the full log: > > > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "David Kelly" > > > > Cc: "Ketan Maheshwari" , "swift-devel > > > > Devel" > > > > Sent: Thursday, August 25, 2011 2:43:31 PM > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > It's possible that the CA dir on Ranger is not properly set up. > > > > Can > > > > you > > > > post the full log? > > > > > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > > > > Those environment variables were not set up. I have them defined > > > > > now, but I'm still getting the same error. > > > > > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > -tc.file > > > > > tc.data 001-catsn-ranger.swift > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > RunID: 20110825-1352-f1v940b4 > > > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 > > > > > Initializing site shared directory:3 > > > > > Execution failed: > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > GSS-API > > > > > level [Caused by: Unknown CA]] > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Ketan Maheshwari" > > > > > > To: "David Kelly" > > > > > > Cc: "Jonathan Monette" , "swift-devel > > > > > > Devel" > > > > > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > > Hi, > > > > > > > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > > > davidk at ci.uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Thanks Jon, > > > > > > > > > > > > Here is what happens when I try this from communicado: > > > > > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > > > > myproxy.teragrid.org > > > > > > Enter MyProxy pass phrase: > > > > > > A credential has been received for user dkelly in > > > > > > /tmp/x509up_u1878. > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > -tc.file > > > > > > tc.data 001-catsn-ranger.swift > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting > > > > > > site:8 > > > > > > Initializing site shared directory:2 > > > > > > Execution failed: > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > GSS-API > > > > > > level > > > > > > [Caused by: Unknown CA]] > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > Thanks, > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Ketan > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Thu Aug 25 16:37:06 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 25 Aug 2011 16:37:06 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1314307073.12144.4.camel@blabla> Message-ID: <2064139658.79032.1314308226318.JavaMail.root@zimbra-mb2.anl.gov> That gives a slightly more detailed error.. $ globus-url-copy file:///homes/davidk/ranger/sites.xml gsiftp://gridftp.ranger.tacc.teragrid.org/share/home/01503/davidkel error: globus_ftp_control: gss_init_sec_context failed OpenSSL Error: s3_clnt.c:915: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Can't get the local trusted CA certificate: Untrusted self-signed certificate in chain with hash 9a1da9f9 ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Ketan Maheshwari" , "swift-devel Devel" > Sent: Thursday, August 25, 2011 4:17:53 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > Can you try a globus-url-copy to gridftp.ranger? > > gridftp.ranger seems to have the NCSA myproxy CA. You say you have the > proper certificates dir in your X509_CERT_DIR, and that directory > contains the TACC root cert. So it should work. And so should swift. > > Though I think that jglobus should be more clear about "Unknown ca" > errors. At least the name of the unknown CA should be part of the > error > message. > > On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: > > $ grid-proxy-info -all > > subject : /C=US/O=National Center for Supercomputing > > Applications/CN=David Kelly > > issuer : /C=US/O=National Center for Supercomputing > > Applications/OU=Certificate Authorities/CN=MyProxy > > identity : /C=US/O=National Center for Supercomputing > > Applications/CN=David Kelly > > type : end entity credential > > strength : 1024 bits > > path : /tmp/x509up_u1878 > > timeleft : 9:56:53 > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "David Kelly" > > > Cc: "Ketan Maheshwari" , "swift-devel > > > Devel" > > > Sent: Thursday, August 25, 2011 3:42:57 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > Odd. Can you paste the output of 'grid-proxy-info -all'? > > > > > > On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > > > Sure, here is the full log: > > > > > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" > > > > > To: "David Kelly" > > > > > Cc: "Ketan Maheshwari" , > > > > > "swift-devel > > > > > Devel" > > > > > Sent: Thursday, August 25, 2011 2:43:31 PM > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > It's possible that the CA dir on Ranger is not properly set > > > > > up. > > > > > Can > > > > > you > > > > > post the full log? > > > > > > > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > > > > > Those environment variables were not set up. I have them > > > > > > defined > > > > > > now, but I'm still getting the same error. > > > > > > > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > -tc.file > > > > > > tc.data 001-catsn-ranger.swift > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > RunID: 20110825-1352-f1v940b4 > > > > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting > > > > > > site:7 > > > > > > Initializing site shared directory:3 > > > > > > Execution failed: > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > GSS-API > > > > > > level [Caused by: Unknown CA]] > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Ketan Maheshwari" > > > > > > > To: "David Kelly" > > > > > > > Cc: "Jonathan Monette" , "swift-devel > > > > > > > Devel" > > > > > > > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > > > > davidk at ci.uchicago.edu > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks Jon, > > > > > > > > > > > > > > Here is what happens when I try this from communicado: > > > > > > > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > > > > > myproxy.teragrid.org > > > > > > > Enter MyProxy pass phrase: > > > > > > > A credential has been received for user dkelly in > > > > > > > /tmp/x509up_u1878. > > > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > > -tc.file > > > > > > > tc.data 001-catsn-ranger.swift > > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting > > > > > > > site:8 > > > > > > > Initializing site shared directory:2 > > > > > > > Execution failed: > > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > > GSS-API > > > > > > > level > > > > > > > [Caused by: Unknown CA]] > > > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > Thanks, > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Ketan > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From skenny at uchicago.edu Thu Aug 25 16:38:52 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Thu, 25 Aug 2011 14:38:52 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1314307073.12144.4.camel@blabla> References: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> <1314307073.12144.4.camel@blabla> Message-ID: communicado's certs (/etc/grid-security/certificates) are out-of-date...if you copy ranger's /etc/grid-security/certificates directory to communicado and point yr X509_CERT_DIR to it you can get a job thru (a simple globus-job-run with my vaild cert fails from communicado at the moment if i don't do this). i set our machines at uci to update daily...i think it's less frequently at ci... On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan wrote: > Can you try a globus-url-copy to gridftp.ranger? > > gridftp.ranger seems to have the NCSA myproxy CA. You say you have the > proper certificates dir in your X509_CERT_DIR, and that directory > contains the TACC root cert. So it should work. And so should swift. > > Though I think that jglobus should be more clear about "Unknown ca" > errors. At least the name of the unknown CA should be part of the error > message. > > On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: > > $ grid-proxy-info -all > > subject : /C=US/O=National Center for Supercomputing > Applications/CN=David Kelly > > issuer : /C=US/O=National Center for Supercomputing > Applications/OU=Certificate Authorities/CN=MyProxy > > identity : /C=US/O=National Center for Supercomputing > Applications/CN=David Kelly > > type : end entity credential > > strength : 1024 bits > > path : /tmp/x509up_u1878 > > timeleft : 9:56:53 > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "David Kelly" > > > Cc: "Ketan Maheshwari" , "swift-devel > Devel" > > > Sent: Thursday, August 25, 2011 3:42:57 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > Odd. Can you paste the output of 'grid-proxy-info -all'? > > > > > > On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > > > Sure, here is the full log: > > > > > > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" > > > > > To: "David Kelly" > > > > > Cc: "Ketan Maheshwari" , "swift-devel > > > > > Devel" > > > > > Sent: Thursday, August 25, 2011 2:43:31 PM > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > It's possible that the CA dir on Ranger is not properly set up. > > > > > Can > > > > > you > > > > > post the full log? > > > > > > > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > > > > > Those environment variables were not set up. I have them defined > > > > > > now, but I'm still getting the same error. > > > > > > > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > -tc.file > > > > > > tc.data 001-catsn-ranger.swift > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > RunID: 20110825-1352-f1v940b4 > > > > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 > > > > > > Initializing site shared directory:3 > > > > > > Execution failed: > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > GSS-API > > > > > > level [Caused by: Unknown CA]] > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Ketan Maheshwari" > > > > > > > To: "David Kelly" > > > > > > > Cc: "Jonathan Monette" , "swift-devel > > > > > > > Devel" > > > > > > > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > > > > davidk at ci.uchicago.edu > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks Jon, > > > > > > > > > > > > > > Here is what happens when I try this from communicado: > > > > > > > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > > > > > myproxy.teragrid.org > > > > > > > Enter MyProxy pass phrase: > > > > > > > A credential has been received for user dkelly in > > > > > > > /tmp/x509up_u1878. > > > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > > -tc.file > > > > > > > tc.data 001-catsn-ranger.swift > > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting > > > > > > > site:8 > > > > > > > Initializing site shared directory:2 > > > > > > > Execution failed: > > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > > GSS-API > > > > > > > level > > > > > > > [Caused by: Unknown CA]] > > > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > Thanks, > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Ketan > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Aug 25 16:40:43 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 16:40:43 -0500 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: References: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> <1314307073.12144.4.camel@blabla> Message-ID: That is weird. If you were able to gsissh to ranger I would assume that you are able to globus-url-copy to ranger. Anyways, what Sarah said should work. I would assume that ci would update more frequently to avoid this problem. On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > communicado's certs (/etc/grid-security/certificates) are out-of-date...if you copy ranger's /etc/grid-security/certificates directory to communicado and point yr X509_CERT_DIR to it you can get a job thru (a simple globus-job-run with my vaild cert fails from communicado at the moment if i don't do this). > > i set our machines at uci to update daily...i think it's less frequently at ci... > > On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan wrote: > Can you try a globus-url-copy to gridftp.ranger? > > gridftp.ranger seems to have the NCSA myproxy CA. You say you have the > proper certificates dir in your X509_CERT_DIR, and that directory > contains the TACC root cert. So it should work. And so should swift. > > Though I think that jglobus should be more clear about "Unknown ca" > errors. At least the name of the unknown CA should be part of the error > message. > > On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: > > $ grid-proxy-info -all > > subject : /C=US/O=National Center for Supercomputing Applications/CN=David Kelly > > issuer : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy > > identity : /C=US/O=National Center for Supercomputing Applications/CN=David Kelly > > type : end entity credential > > strength : 1024 bits > > path : /tmp/x509up_u1878 > > timeleft : 9:56:53 > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "David Kelly" > > > Cc: "Ketan Maheshwari" , "swift-devel Devel" > > > Sent: Thursday, August 25, 2011 3:42:57 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > Odd. Can you paste the output of 'grid-proxy-info -all'? > > > > > > On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > > > Sure, here is the full log: > > > > > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" > > > > > To: "David Kelly" > > > > > Cc: "Ketan Maheshwari" , "swift-devel > > > > > Devel" > > > > > Sent: Thursday, August 25, 2011 2:43:31 PM > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > It's possible that the CA dir on Ranger is not properly set up. > > > > > Can > > > > > you > > > > > post the full log? > > > > > > > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > > > > > Those environment variables were not set up. I have them defined > > > > > > now, but I'm still getting the same error. > > > > > > > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > -tc.file > > > > > > tc.data 001-catsn-ranger.swift > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > RunID: 20110825-1352-f1v940b4 > > > > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting site:7 > > > > > > Initializing site shared directory:3 > > > > > > Execution failed: > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > GSS-API > > > > > > level [Caused by: Unknown CA]] > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Ketan Maheshwari" > > > > > > > To: "David Kelly" > > > > > > > Cc: "Jonathan Monette" , "swift-devel > > > > > > > Devel" > > > > > > > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > > > > davidk at ci.uchicago.edu > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks Jon, > > > > > > > > > > > > > > Here is what happens when I try this from communicado: > > > > > > > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > > > > > myproxy.teragrid.org > > > > > > > Enter MyProxy pass phrase: > > > > > > > A credential has been received for user dkelly in > > > > > > > /tmp/x509up_u1878. > > > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > > -tc.file > > > > > > > tc.data 001-catsn-ranger.swift > > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting > > > > > > > site:8 > > > > > > > Initializing site shared directory:2 > > > > > > > Execution failed: > > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > > GSS-API > > > > > > > level > > > > > > > [Caused by: Unknown CA]] > > > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > Thanks, > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Ketan > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Aug 25 16:42:04 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 14:42:04 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: References: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> <1314307073.12144.4.camel@blabla> Message-ID: <1314308524.14490.0.camel@blabla> Right. Or just grab the tacc legacy CA from here: http://www.tacc.utexas.edu/CA/ On Thu, 2011-08-25 at 14:38 -0700, Sarah Kenny wrote: > communicado's certs (/etc/grid-security/certificates) are > out-of-date...if you copy ranger's /etc/grid-security/certificates > directory to communicado and point yr X509_CERT_DIR to it you can get > a job thru (a simple globus-job-run with my vaild cert fails from > communicado at the moment if i don't do this). > > i set our machines at uci to update daily...i think it's less > frequently at ci... > > On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > wrote: > Can you try a globus-url-copy to gridftp.ranger? > > gridftp.ranger seems to have the NCSA myproxy CA. You say you > have the > proper certificates dir in your X509_CERT_DIR, and that > directory > contains the TACC root cert. So it should work. And so should > swift. > > Though I think that jglobus should be more clear about > "Unknown ca" > errors. At least the name of the unknown CA should be part of > the error > message. > > > On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: > > $ grid-proxy-info -all > > subject : /C=US/O=National Center for Supercomputing > Applications/CN=David Kelly > > issuer : /C=US/O=National Center for Supercomputing > Applications/OU=Certificate Authorities/CN=MyProxy > > identity : /C=US/O=National Center for Supercomputing > Applications/CN=David Kelly > > type : end entity credential > > strength : 1024 bits > > path : /tmp/x509up_u1878 > > timeleft : 9:56:53 > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "David Kelly" > > > Cc: "Ketan Maheshwari" , > "swift-devel Devel" > > > Sent: Thursday, August 25, 2011 3:42:57 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > Odd. Can you paste the output of 'grid-proxy-info -all'? > > > > > > On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > > > Sure, here is the full log: > > > > > > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" > > > > > To: "David Kelly" > > > > > Cc: "Ketan Maheshwari" , > "swift-devel > > > > > Devel" > > > > > Sent: Thursday, August 25, 2011 2:43:31 PM > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > It's possible that the CA dir on Ranger is not > properly set up. > > > > > Can > > > > > you > > > > > post the full log? > > > > > > > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > > > > > Those environment variables were not set up. I have > them defined > > > > > > now, but I'm still getting the same error. > > > > > > > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file > sites.xml > > > > > > -tc.file > > > > > > tc.data 001-catsn-ranger.swift > > > > > > Swift svn swift-r4987 (swift modified locally) > cog-r3229 > > > > > > > > > > > > RunID: 20110825-1352-f1v940b4 > > > > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > Selecting site:7 > > > > > > Initializing site shared directory:3 > > > > > > Execution failed: > > > > > > Authentication failed [Caused by: Failure > unspecified at > > > > > > GSS-API > > > > > > level [Caused by: Unknown CA]] > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Ketan Maheshwari" > > > > > > > > To: "David Kelly" > > > > > > > Cc: "Jonathan Monette" , > "swift-devel > > > > > > > Devel" > > > > > > > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > > > > davidk at ci.uchicago.edu > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks Jon, > > > > > > > > > > > > > > Here is what happens when I try this from > communicado: > > > > > > > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > > > > > myproxy.teragrid.org > > > > > > > Enter MyProxy pass phrase: > > > > > > > A credential has been received for user dkelly in > > > > > > > /tmp/x509up_u1878. > > > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file > sites.xml > > > > > > > -tc.file > > > > > > > tc.data 001-catsn-ranger.swift > > > > > > > Swift svn swift-r4987 (swift modified locally) > cog-r3229 > > > > > > > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 > Selecting > > > > > > > site:8 > > > > > > > Initializing site shared directory:2 > > > > > > > Execution failed: > > > > > > > Authentication failed [Caused by: Failure > unspecified at > > > > > > > GSS-API > > > > > > > level > > > > > > > [Caused by: Unknown CA]] > > > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > Thanks, > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Ketan > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > From hategan at mcs.anl.gov Thu Aug 25 16:42:58 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 14:42:58 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: References: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> <1314307073.12144.4.camel@blabla> Message-ID: <1314308578.14490.1.camel@blabla> On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette wrote: > That is weird. If you were able to gsissh to ranger I would assume > that you are able to globus-url-copy to ranger. Not if the two use different CAs. Or if a password was typed at the ssh login. > Anyways, what Sarah said should work. I would assume that ci would > update more frequently to avoid this problem. > On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > > > communicado's certs (/etc/grid-security/certificates) are > > out-of-date...if you copy ranger's /etc/grid-security/certificates > > directory to communicado and point yr X509_CERT_DIR to it you can > > get a job thru (a simple globus-job-run with my vaild cert fails > > from communicado at the moment if i don't do this). > > > > i set our machines at uci to update daily...i think it's less > > frequently at ci... > > > > On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > > wrote: > > Can you try a globus-url-copy to gridftp.ranger? > > > > gridftp.ranger seems to have the NCSA myproxy CA. You say > > you have the > > proper certificates dir in your X509_CERT_DIR, and that > > directory > > contains the TACC root cert. So it should work. And so > > should swift. > > > > Though I think that jglobus should be more clear about > > "Unknown ca" > > errors. At least the name of the unknown CA should be part > > of the error > > message. > > > > > > On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: > > > $ grid-proxy-info -all > > > subject : /C=US/O=National Center for Supercomputing > > Applications/CN=David Kelly > > > issuer : /C=US/O=National Center for Supercomputing > > Applications/OU=Certificate Authorities/CN=MyProxy > > > identity : /C=US/O=National Center for Supercomputing > > Applications/CN=David Kelly > > > type : end entity credential > > > strength : 1024 bits > > > path : /tmp/x509up_u1878 > > > timeleft : 9:56:53 > > > > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "David Kelly" > > > > Cc: "Ketan Maheshwari" , > > "swift-devel Devel" > > > > Sent: Thursday, August 25, 2011 3:42:57 PM > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > Odd. Can you paste the output of 'grid-proxy-info -all'? > > > > > > > > On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > > > > Sure, here is the full log: > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Mihael Hategan" > > > > > > To: "David Kelly" > > > > > > Cc: "Ketan Maheshwari" , > > "swift-devel > > > > > > Devel" > > > > > > Sent: Thursday, August 25, 2011 2:43:31 PM > > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > > It's possible that the CA dir on Ranger is not > > properly set up. > > > > > > Can > > > > > > you > > > > > > post the full log? > > > > > > > > > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly > > wrote: > > > > > > > Those environment variables were not set up. I > > have them defined > > > > > > > now, but I'm still getting the same error. > > > > > > > > > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > > > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file > > sites.xml > > > > > > > -tc.file > > > > > > > tc.data 001-catsn-ranger.swift > > > > > > > Swift svn swift-r4987 (swift modified locally) > > cog-r3229 > > > > > > > > > > > > > > RunID: 20110825-1352-f1v940b4 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > > Selecting site:7 > > > > > > > Initializing site shared directory:3 > > > > > > > Execution failed: > > > > > > > Authentication failed [Caused by: Failure > > unspecified at > > > > > > > GSS-API > > > > > > > level [Caused by: Unknown CA]] > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Ketan Maheshwari" > > > > > > > > > > To: "David Kelly" > > > > > > > > Cc: "Jonathan Monette" , > > "swift-devel > > > > > > > > Devel" > > > > > > > > > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > > > > > Subject: Re: [Swift-devel] Notes from 0.93 > > meeting > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > > > > > davidk at ci.uchicago.edu > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Thanks Jon, > > > > > > > > > > > > > > > > Here is what happens when I try this from > > communicado: > > > > > > > > > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly > > -s > > > > > > > > myproxy.teragrid.org > > > > > > > > Enter MyProxy pass phrase: > > > > > > > > A credential has been received for user dkelly > > in > > > > > > > > /tmp/x509up_u1878. > > > > > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file > > sites.xml > > > > > > > > -tc.file > > > > > > > > tc.data 001-catsn-ranger.swift > > > > > > > > Swift svn swift-r4987 (swift modified locally) > > cog-r3229 > > > > > > > > > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 > > Selecting > > > > > > > > site:8 > > > > > > > > Initializing site shared directory:2 > > > > > > > > Execution failed: > > > > > > > > Authentication failed [Caused by: Failure > > unspecified at > > > > > > > > GSS-API > > > > > > > > level > > > > > > > > [Caused by: Unknown CA]] > > > > > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Swift-devel mailing list > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Ketan > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From jonmon at mcs.anl.gov Thu Aug 25 16:43:23 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 16:43:23 -0500 Subject: [Swift-devel] Swift jobs failing In-Reply-To: <1314299488.10772.1.camel@blabla> References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> <1314299488.10772.1.camel@blabla> Message-ID: Here is the link. The log is very long(has entries from 2010) but the log toward the bottom does mention that it could not register the coaster service. www.ci.uchicago.edu/~jonmon/log/coaster.log On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: > Can you post the coaster log on the remote machine? > > > On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: >> Ok. I will check my configuration. I don't see very helpful messages >> in the log file but I will give a closer look. More information to >> follow. >> >> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: >> >>> I am using 0.93 from Communicado to OSG using persistent-coasters. I >>> do not see such messages. >>> >>> >>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette >>> wrote: >>> This also seems to be happening in trunk. Is anyone seeing >>> this issue with there code? >>> >>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: >>> >>>> I started a run of my SwiftMontage work and all the jobs >>> keep failing. No progress is being made and the swift >>> stdout will have the line "failed but can retry: 'some >>> number'". The log is located at >>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> >>> >>> >>> -- >>> Ketan >>> >>> >>> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > From jonmon at mcs.anl.gov Thu Aug 25 16:46:26 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 16:46:26 -0500 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1314308578.14490.1.camel@blabla> References: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> <1314307073.12144.4.camel@blabla> <1314308578.14490.1.camel@blabla> Message-ID: True. I did not think that each mechanism would use different CAs. We might want to ask ci support to update the grid certs more frequently then to avoid this situation. On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette wrote: >> That is weird. If you were able to gsissh to ranger I would assume >> that you are able to globus-url-copy to ranger. > > Not if the two use different CAs. Or if a password was typed at the ssh > login. > >> Anyways, what Sarah said should work. I would assume that ci would >> update more frequently to avoid this problem. >> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: >> >>> communicado's certs (/etc/grid-security/certificates) are >>> out-of-date...if you copy ranger's /etc/grid-security/certificates >>> directory to communicado and point yr X509_CERT_DIR to it you can >>> get a job thru (a simple globus-job-run with my vaild cert fails >>> from communicado at the moment if i don't do this). >>> >>> i set our machines at uci to update daily...i think it's less >>> frequently at ci... >>> >>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan >>> wrote: >>> Can you try a globus-url-copy to gridftp.ranger? >>> >>> gridftp.ranger seems to have the NCSA myproxy CA. You say >>> you have the >>> proper certificates dir in your X509_CERT_DIR, and that >>> directory >>> contains the TACC root cert. So it should work. And so >>> should swift. >>> >>> Though I think that jglobus should be more clear about >>> "Unknown ca" >>> errors. At least the name of the unknown CA should be part >>> of the error >>> message. >>> >>> >>> On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: >>>> $ grid-proxy-info -all >>>> subject : /C=US/O=National Center for Supercomputing >>> Applications/CN=David Kelly >>>> issuer : /C=US/O=National Center for Supercomputing >>> Applications/OU=Certificate Authorities/CN=MyProxy >>>> identity : /C=US/O=National Center for Supercomputing >>> Applications/CN=David Kelly >>>> type : end entity credential >>>> strength : 1024 bits >>>> path : /tmp/x509up_u1878 >>>> timeleft : 9:56:53 >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Mihael Hategan" >>>>> To: "David Kelly" >>>>> Cc: "Ketan Maheshwari" , >>> "swift-devel Devel" >>>>> Sent: Thursday, August 25, 2011 3:42:57 PM >>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting >>>>> Odd. Can you paste the output of 'grid-proxy-info -all'? >>>>> >>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: >>>>>> Sure, here is the full log: >>>>>> >>>>>> >>> http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Mihael Hategan" >>>>>>> To: "David Kelly" >>>>>>> Cc: "Ketan Maheshwari" , >>> "swift-devel >>>>>>> Devel" >>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM >>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting >>>>>>> It's possible that the CA dir on Ranger is not >>> properly set up. >>>>>>> Can >>>>>>> you >>>>>>> post the full log? >>>>>>> >>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly >>> wrote: >>>>>>>> Those environment variables were not set up. I >>> have them defined >>>>>>>> now, but I'm still getting the same error. >>>>>>>> >>>>>>>> [davidk at communicado ranger]$ env |grep 509 >>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA >>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA >>>>>>>> >>>>>>>> [davidk at communicado ranger]$ swift -sites.file >>> sites.xml >>>>>>>> -tc.file >>>>>>>> tc.data 001-catsn-ranger.swift >>>>>>>> Swift svn swift-r4987 (swift modified locally) >>> cog-r3229 >>>>>>>> >>>>>>>> RunID: 20110825-1352-f1v940b4 >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 >>> Selecting site:7 >>>>>>>> Initializing site shared directory:3 >>>>>>>> Execution failed: >>>>>>>> Authentication failed [Caused by: Failure >>> unspecified at >>>>>>>> GSS-API >>>>>>>> level [Caused by: Unknown CA]] >>>>>>>> >>>>>>>> >>>>>>>> ----- Original Message ----- >>>>>>>>> From: "Ketan Maheshwari" >>> >>>>>>>>> To: "David Kelly" >>>>>>>>> Cc: "Jonathan Monette" , >>> "swift-devel >>>>>>>>> Devel" >>>>>>>>> >>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM >>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 >>> meeting >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> >>>>>>>>> Are your CADIR and CACERT env vars set up? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < >>>>>>>>> davidk at ci.uchicago.edu >>>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks Jon, >>>>>>>>> >>>>>>>>> Here is what happens when I try this from >>> communicado: >>>>>>>>> >>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l dkelly >>> -s >>>>>>>>> myproxy.teragrid.org >>>>>>>>> Enter MyProxy pass phrase: >>>>>>>>> A credential has been received for user dkelly >>> in >>>>>>>>> /tmp/x509up_u1878. >>>>>>>>> >>>>>>>>> [davidk at communicado ranger]$ swift -sites.file >>> sites.xml >>>>>>>>> -tc.file >>>>>>>>> tc.data 001-catsn-ranger.swift >>>>>>>>> Swift svn swift-r4987 (swift modified locally) >>> cog-r3229 >>>>>>>>> >>>>>>>>> RunID: 20110825-1326-o3e38fe0 >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 >>> Selecting >>>>>>>>> site:8 >>>>>>>>> Initializing site shared directory:2 >>>>>>>>> Execution failed: >>>>>>>>> Authentication failed [Caused by: Failure >>> unspecified at >>>>>>>>> GSS-API >>>>>>>>> level >>>>>>>>> [Caused by: Unknown CA]] >>>>>>>>> >>>>>>>>> Any ideas? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> David >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Swift-devel mailing list >>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ketan >>>>>>>> _______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> >>> >>> -- >>> Sarah Kenny >>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III >>> University of California Irvine, Dept. of Neurology ~ 773-818-8300 >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > From skenny at uchicago.edu Thu Aug 25 17:11:57 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Thu, 25 Aug 2011 15:11:57 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: References: <1905958187.78949.1314305702246.JavaMail.root@zimbra-mb2.anl.gov> <1314307073.12144.4.camel@blabla> <1314308578.14490.1.camel@blabla> Message-ID: if i had a nickel for every time i dealt with this i'd be rich! :) actually, now that i'm looking at our uci machines i actually have them updating hourly...so, maybe you want to ask the admins to do that to avoid a full day of confusion whenever they expire :P *usually* i can't gsissh either if the certs have expired but, yeah, they must be using different CA's now for that on ranger as mihael suggests... On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette wrote: > True. I did not think that each mechanism would use different CAs. We > might want to ask ci support to update the grid certs more frequently then > to avoid this situation. > > On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > > > On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette wrote: > >> That is weird. If you were able to gsissh to ranger I would assume > >> that you are able to globus-url-copy to ranger. > > > > Not if the two use different CAs. Or if a password was typed at the ssh > > login. > > > >> Anyways, what Sarah said should work. I would assume that ci would > >> update more frequently to avoid this problem. > >> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > >> > >>> communicado's certs (/etc/grid-security/certificates) are > >>> out-of-date...if you copy ranger's /etc/grid-security/certificates > >>> directory to communicado and point yr X509_CERT_DIR to it you can > >>> get a job thru (a simple globus-job-run with my vaild cert fails > >>> from communicado at the moment if i don't do this). > >>> > >>> i set our machines at uci to update daily...i think it's less > >>> frequently at ci... > >>> > >>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > >>> wrote: > >>> Can you try a globus-url-copy to gridftp.ranger? > >>> > >>> gridftp.ranger seems to have the NCSA myproxy CA. You say > >>> you have the > >>> proper certificates dir in your X509_CERT_DIR, and that > >>> directory > >>> contains the TACC root cert. So it should work. And so > >>> should swift. > >>> > >>> Though I think that jglobus should be more clear about > >>> "Unknown ca" > >>> errors. At least the name of the unknown CA should be part > >>> of the error > >>> message. > >>> > >>> > >>> On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: > >>>> $ grid-proxy-info -all > >>>> subject : /C=US/O=National Center for Supercomputing > >>> Applications/CN=David Kelly > >>>> issuer : /C=US/O=National Center for Supercomputing > >>> Applications/OU=Certificate Authorities/CN=MyProxy > >>>> identity : /C=US/O=National Center for Supercomputing > >>> Applications/CN=David Kelly > >>>> type : end entity credential > >>>> strength : 1024 bits > >>>> path : /tmp/x509up_u1878 > >>>> timeleft : 9:56:53 > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Mihael Hategan" > >>>>> To: "David Kelly" > >>>>> Cc: "Ketan Maheshwari" , > >>> "swift-devel Devel" > >>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > >>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>>> Odd. Can you paste the output of 'grid-proxy-info -all'? > >>>>> > >>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > >>>>>> Sure, here is the full log: > >>>>>> > >>>>>> > >>> > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > >>>>>> > >>>>>> ----- Original Message ----- > >>>>>>> From: "Mihael Hategan" > >>>>>>> To: "David Kelly" > >>>>>>> Cc: "Ketan Maheshwari" , > >>> "swift-devel > >>>>>>> Devel" > >>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > >>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>>>>> It's possible that the CA dir on Ranger is not > >>> properly set up. > >>>>>>> Can > >>>>>>> you > >>>>>>> post the full log? > >>>>>>> > >>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > >>> wrote: > >>>>>>>> Those environment variables were not set up. I > >>> have them defined > >>>>>>>> now, but I'm still getting the same error. > >>>>>>>> > >>>>>>>> [davidk at communicado ranger]$ env |grep 509 > >>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>> > >>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>> sites.xml > >>>>>>>> -tc.file > >>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>> cog-r3229 > >>>>>>>> > >>>>>>>> RunID: 20110825-1352-f1v940b4 > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > >>> Selecting site:7 > >>>>>>>> Initializing site shared directory:3 > >>>>>>>> Execution failed: > >>>>>>>> Authentication failed [Caused by: Failure > >>> unspecified at > >>>>>>>> GSS-API > >>>>>>>> level [Caused by: Unknown CA]] > >>>>>>>> > >>>>>>>> > >>>>>>>> ----- Original Message ----- > >>>>>>>>> From: "Ketan Maheshwari" > >>> > >>>>>>>>> To: "David Kelly" > >>>>>>>>> Cc: "Jonathan Monette" , > >>> "swift-devel > >>>>>>>>> Devel" > >>>>>>>>> > >>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > >>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > >>> meeting > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Are your CADIR and CACERT env vars set up? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > >>>>>>>>> davidk at ci.uchicago.edu > >>>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks Jon, > >>>>>>>>> > >>>>>>>>> Here is what happens when I try this from > >>> communicado: > >>>>>>>>> > >>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l dkelly > >>> -s > >>>>>>>>> myproxy.teragrid.org > >>>>>>>>> Enter MyProxy pass phrase: > >>>>>>>>> A credential has been received for user dkelly > >>> in > >>>>>>>>> /tmp/x509up_u1878. > >>>>>>>>> > >>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>> sites.xml > >>>>>>>>> -tc.file > >>>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>> cog-r3229 > >>>>>>>>> > >>>>>>>>> RunID: 20110825-1326-o3e38fe0 > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 > >>> Selecting > >>>>>>>>> site:8 > >>>>>>>>> Initializing site shared directory:2 > >>>>>>>>> Execution failed: > >>>>>>>>> Authentication failed [Caused by: Failure > >>> unspecified at > >>>>>>>>> GSS-API > >>>>>>>>> level > >>>>>>>>> [Caused by: Unknown CA]] > >>>>>>>>> > >>>>>>>>> Any ideas? > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> David > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Swift-devel mailing list > >>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>> > >>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Ketan > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> > >>> > >>> > >>> -- > >>> Sarah Kenny > >>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > >>> University of California Irvine, Dept. of Neurology ~ 773-818-8300 > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > > > > > > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Aug 25 17:18:11 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Thu, 25 Aug 2011 17:18:11 -0500 Subject: [Swift-devel] =?utf-8?q?Notes_from_0=2E93_meeting?= Message-ID: <20110825221756.AA96612815@zimbra.anl.gov> I can send mail to ci support and cc mike to it and ask what they can do. Mihael, is there anyway for Swift to give a little more feedback besides unknown CA or is that a jglobus problem? ----- Reply message ----- From: "Sarah Kenny" Date: Thu, Aug 25, 2011 5:11 pm Subject: [Swift-devel] Notes from 0.93 meeting To: "Jonathan Monette" Cc: "Mihael Hategan" , "swift-devel Devel" -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Aug 25 17:31:34 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 15:31:34 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <20110825221756.AA96612815@zimbra.anl.gov> References: <20110825221756.AA96612815@zimbra.anl.gov> Message-ID: <1314311494.14675.2.camel@blabla> On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > I can send mail to ci support and cc mike to it and ask what they can > do. > > Mihael, is there anyway for Swift to give a little more feedback > besides unknown CA or is that a jglobus problem? It's a jglobus problem. That in itself may not be a big issue, but jglobus is now being heavily re-organized by the globus team, so I'm not sure what the best long-term strategy is here. > > ----- Reply message ----- > From: "Sarah Kenny" > Date: Thu, Aug 25, 2011 5:11 pm > Subject: [Swift-devel] Notes from 0.93 meeting > To: "Jonathan Monette" > Cc: "Mihael Hategan" , "swift-devel Devel" > > > > > if i had a nickel for every time i dealt with this i'd be rich! :) > actually, now that i'm looking at our uci machines i actually have > them updating hourly...so, maybe you want to ask the admins to do that > to avoid a full day of confusion whenever they expire :P > > *usually* i can't gsissh either if the certs have expired but, yeah, > they must be using different CA's now for that on ranger as mihael > suggests... > > On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > wrote: > True. I did not think that each mechanism would use different > CAs. We might want to ask ci support to update the grid certs > more frequently then to avoid this situation. > > > On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > > > On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette wrote: > >> That is weird. If you were able to gsissh to ranger I > would assume > >> that you are able to globus-url-copy to ranger. > > > > Not if the two use different CAs. Or if a password was typed > at the ssh > > login. > > > >> Anyways, what Sarah said should work. I would assume that > ci would > >> update more frequently to avoid this problem. > >> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > >> > >>> communicado's certs (/etc/grid-security/certificates) are > >>> out-of-date...if you copy > ranger's /etc/grid-security/certificates > >>> directory to communicado and point yr X509_CERT_DIR to it > you can > >>> get a job thru (a simple globus-job-run with my vaild cert > fails > >>> from communicado at the moment if i don't do this). > >>> > >>> i set our machines at uci to update daily...i think it's > less > >>> frequently at ci... > >>> > >>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > >>> wrote: > >>> Can you try a globus-url-copy to gridftp.ranger? > >>> > >>> gridftp.ranger seems to have the NCSA myproxy CA. > You say > >>> you have the > >>> proper certificates dir in your X509_CERT_DIR, and > that > >>> directory > >>> contains the TACC root cert. So it should work. And > so > >>> should swift. > >>> > >>> Though I think that jglobus should be more clear > about > >>> "Unknown ca" > >>> errors. At least the name of the unknown CA should > be part > >>> of the error > >>> message. > >>> > >>> > >>> On Thu, 2011-08-25 at 15:55 -0500, David Kelly > wrote: > >>>> $ grid-proxy-info -all > >>>> subject : /C=US/O=National Center for Supercomputing > >>> Applications/CN=David Kelly > >>>> issuer : /C=US/O=National Center for Supercomputing > >>> Applications/OU=Certificate Authorities/CN=MyProxy > >>>> identity : /C=US/O=National Center for Supercomputing > >>> Applications/CN=David Kelly > >>>> type : end entity credential > >>>> strength : 1024 bits > >>>> path : /tmp/x509up_u1878 > >>>> timeleft : 9:56:53 > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Mihael Hategan" > >>>>> To: "David Kelly" > >>>>> Cc: "Ketan Maheshwari" , > >>> "swift-devel Devel" > >>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > >>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>>> Odd. Can you paste the output of 'grid-proxy-info -all'? > >>>>> > >>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > >>>>>> Sure, here is the full log: > >>>>>> > >>>>>> > >>> > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > >>>>>> > >>>>>> ----- Original Message ----- > >>>>>>> From: "Mihael Hategan" > >>>>>>> To: "David Kelly" > >>>>>>> Cc: "Ketan Maheshwari" , > >>> "swift-devel > >>>>>>> Devel" > >>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > >>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>>>>> It's possible that the CA dir on Ranger is not > >>> properly set up. > >>>>>>> Can > >>>>>>> you > >>>>>>> post the full log? > >>>>>>> > >>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > >>> wrote: > >>>>>>>> Those environment variables were not set up. I > >>> have them defined > >>>>>>>> now, but I'm still getting the same error. > >>>>>>>> > >>>>>>>> [davidk at communicado ranger]$ env |grep 509 > >>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>> > >>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>> sites.xml > >>>>>>>> -tc.file > >>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>> cog-r3229 > >>>>>>>> > >>>>>>>> RunID: 20110825-1352-f1v940b4 > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > >>> Selecting site:7 > >>>>>>>> Initializing site shared directory:3 > >>>>>>>> Execution failed: > >>>>>>>> Authentication failed [Caused by: Failure > >>> unspecified at > >>>>>>>> GSS-API > >>>>>>>> level [Caused by: Unknown CA]] > >>>>>>>> > >>>>>>>> > >>>>>>>> ----- Original Message ----- > >>>>>>>>> From: "Ketan Maheshwari" > >>> > >>>>>>>>> To: "David Kelly" > >>>>>>>>> Cc: "Jonathan Monette" , > >>> "swift-devel > >>>>>>>>> Devel" > >>>>>>>>> > >>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > >>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > >>> meeting > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Are your CADIR and CACERT env vars set up? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > >>>>>>>>> davidk at ci.uchicago.edu > >>>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks Jon, > >>>>>>>>> > >>>>>>>>> Here is what happens when I try this from > >>> communicado: > >>>>>>>>> > >>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l dkelly > >>> -s > >>>>>>>>> myproxy.teragrid.org > >>>>>>>>> Enter MyProxy pass phrase: > >>>>>>>>> A credential has been received for user dkelly > >>> in > >>>>>>>>> /tmp/x509up_u1878. > >>>>>>>>> > >>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>> sites.xml > >>>>>>>>> -tc.file > >>>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>> cog-r3229 > >>>>>>>>> > >>>>>>>>> RunID: 20110825-1326-o3e38fe0 > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 > >>> Selecting > >>>>>>>>> site:8 > >>>>>>>>> Initializing site shared directory:2 > >>>>>>>>> Execution failed: > >>>>>>>>> Authentication failed [Caused by: Failure > >>> unspecified at > >>>>>>>>> GSS-API > >>>>>>>>> level > >>>>>>>>> [Caused by: Unknown CA]] > >>>>>>>>> > >>>>>>>>> Any ideas? > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> David > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Swift-devel mailing list > >>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>> > >>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Ketan > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> > >>> > >>> > >>> -- > >>> Sarah Kenny > >>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci > III > >>> University of California Irvine, Dept. of Neurology ~ > 773-818-8300 > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > > > > > > > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > From hategan at mcs.anl.gov Thu Aug 25 17:41:28 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 15:41:28 -0700 Subject: [Swift-devel] Swift jobs failing In-Reply-To: References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> <1314299488.10772.1.camel@blabla> Message-ID: <1314312088.14881.0.camel@blabla> mike at blabla:~/tmp$ wget http://www.ci.uchicago.edu/~jonmon/log/coaster.log --2011-08-25 15:40:31-- http://www.ci.uchicago.edu/~jonmon/log/coaster.log Resolving www.ci.uchicago.edu... 192.5.86.67 Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2011-08-25 15:40:32 ERROR 404: Not Found. On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: > Here is the link. The log is very long(has entries from 2010) but the log toward the bottom does mention that it could not register the coaster service. > www.ci.uchicago.edu/~jonmon/log/coaster.log > > On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: > > > Can you post the coaster log on the remote machine? > > > > > > On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: > >> Ok. I will check my configuration. I don't see very helpful messages > >> in the log file but I will give a closer look. More information to > >> follow. > >> > >> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > >> > >>> I am using 0.93 from Communicado to OSG using persistent-coasters. I > >>> do not see such messages. > >>> > >>> > >>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette > >>> wrote: > >>> This also seems to be happening in trunk. Is anyone seeing > >>> this issue with there code? > >>> > >>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > >>> > >>>> I started a run of my SwiftMontage work and all the jobs > >>> keep failing. No progress is being made and the swift > >>> stdout will have the line "failed but can retry: 'some > >>> number'". The log is located at > >>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> Ketan > >>> > >>> > >>> > >> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > From jonmon at mcs.anl.gov Thu Aug 25 17:51:54 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 17:51:54 -0500 Subject: [Swift-devel] Swift jobs failing In-Reply-To: <1314312088.14881.0.camel@blabla> References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> <1314299488.10772.1.camel@blabla> <1314312088.14881.0.camel@blabla> Message-ID: my bad?.try www.ci.uchicago.edu/~jonmon/logs/coasters.log On Aug 25, 2011, at 5:41 PM, Mihael Hategan wrote: > mike at blabla:~/tmp$ wget > http://www.ci.uchicago.edu/~jonmon/log/coaster.log > --2011-08-25 15:40:31-- > http://www.ci.uchicago.edu/~jonmon/log/coaster.log > Resolving www.ci.uchicago.edu... 192.5.86.67 > Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. > HTTP request sent, awaiting response... 404 Not Found > 2011-08-25 15:40:32 ERROR 404: Not Found. > > > > On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: >> Here is the link. The log is very long(has entries from 2010) but the log toward the bottom does mention that it could not register the coaster service. >> www.ci.uchicago.edu/~jonmon/log/coaster.log >> >> On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: >> >>> Can you post the coaster log on the remote machine? >>> >>> >>> On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: >>>> Ok. I will check my configuration. I don't see very helpful messages >>>> in the log file but I will give a closer look. More information to >>>> follow. >>>> >>>> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: >>>> >>>>> I am using 0.93 from Communicado to OSG using persistent-coasters. I >>>>> do not see such messages. >>>>> >>>>> >>>>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette >>>>> wrote: >>>>> This also seems to be happening in trunk. Is anyone seeing >>>>> this issue with there code? >>>>> >>>>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: >>>>> >>>>>> I started a run of my SwiftMontage work and all the jobs >>>>> keep failing. No progress is being made and the swift >>>>> stdout will have the line "failed but can retry: 'some >>>>> number'". The log is located at >>>>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ketan >>>>> >>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >> > > From hategan at mcs.anl.gov Thu Aug 25 18:03:59 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 16:03:59 -0700 Subject: [Swift-devel] Swift jobs failing In-Reply-To: References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> <1314299488.10772.1.camel@blabla> <1314312088.14881.0.camel@blabla> Message-ID: <1314313439.15015.0.camel@blabla> On Thu, 2011-08-25 at 17:51 -0500, Jonathan Monette wrote: > my bad?.try www.ci.uchicago.edu/~jonmon/logs/coasters.log Could you gzip that? > > On Aug 25, 2011, at 5:41 PM, Mihael Hategan wrote: > > > mike at blabla:~/tmp$ wget > > http://www.ci.uchicago.edu/~jonmon/log/coaster.log > > --2011-08-25 15:40:31-- > > http://www.ci.uchicago.edu/~jonmon/log/coaster.log > > Resolving www.ci.uchicago.edu... 192.5.86.67 > > Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. > > HTTP request sent, awaiting response... 404 Not Found > > 2011-08-25 15:40:32 ERROR 404: Not Found. > > > > > > > > On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: > >> Here is the link. The log is very long(has entries from 2010) but the log toward the bottom does mention that it could not register the coaster service. > >> www.ci.uchicago.edu/~jonmon/log/coaster.log > >> > >> On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: > >> > >>> Can you post the coaster log on the remote machine? > >>> > >>> > >>> On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: > >>>> Ok. I will check my configuration. I don't see very helpful messages > >>>> in the log file but I will give a closer look. More information to > >>>> follow. > >>>> > >>>> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > >>>> > >>>>> I am using 0.93 from Communicado to OSG using persistent-coasters. I > >>>>> do not see such messages. > >>>>> > >>>>> > >>>>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette > >>>>> wrote: > >>>>> This also seems to be happening in trunk. Is anyone seeing > >>>>> this issue with there code? > >>>>> > >>>>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > >>>>> > >>>>>> I started a run of my SwiftMontage work and all the jobs > >>>>> keep failing. No progress is being made and the swift > >>>>> stdout will have the line "failed but can retry: 'some > >>>>> number'". The log is located at > >>>>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. > >>>>>> _______________________________________________ > >>>>>> Swift-devel mailing list > >>>>>> Swift-devel at ci.uchicago.edu > >>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Ketan > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> > >> > > > > > From jonmon at mcs.anl.gov Thu Aug 25 18:13:12 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 18:13:12 -0500 Subject: [Swift-devel] Swift jobs failing In-Reply-To: <1314313439.15015.0.camel@blabla> References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> <1314299488.10772.1.camel@blabla> <1314312088.14881.0.camel@blabla> <1314313439.15015.0.camel@blabla> Message-ID: <445AB9EC-97BA-4FC0-B9C3-485B07969B35@mcs.anl.gov> www.ci.uchicago.edu/~jonmon/logs/coasters.log.tar.gz On Aug 25, 2011, at 6:03 PM, Mihael Hategan wrote: > On Thu, 2011-08-25 at 17:51 -0500, Jonathan Monette wrote: >> my bad?.try www.ci.uchicago.edu/~jonmon/logs/coasters.log > > Could you gzip that? > >> >> On Aug 25, 2011, at 5:41 PM, Mihael Hategan wrote: >> >>> mike at blabla:~/tmp$ wget >>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log >>> --2011-08-25 15:40:31-- >>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log >>> Resolving www.ci.uchicago.edu... 192.5.86.67 >>> Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. >>> HTTP request sent, awaiting response... 404 Not Found >>> 2011-08-25 15:40:32 ERROR 404: Not Found. >>> >>> >>> >>> On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: >>>> Here is the link. The log is very long(has entries from 2010) but the log toward the bottom does mention that it could not register the coaster service. >>>> www.ci.uchicago.edu/~jonmon/log/coaster.log >>>> >>>> On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: >>>> >>>>> Can you post the coaster log on the remote machine? >>>>> >>>>> >>>>> On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: >>>>>> Ok. I will check my configuration. I don't see very helpful messages >>>>>> in the log file but I will give a closer look. More information to >>>>>> follow. >>>>>> >>>>>> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: >>>>>> >>>>>>> I am using 0.93 from Communicado to OSG using persistent-coasters. I >>>>>>> do not see such messages. >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette >>>>>>> wrote: >>>>>>> This also seems to be happening in trunk. Is anyone seeing >>>>>>> this issue with there code? >>>>>>> >>>>>>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: >>>>>>> >>>>>>>> I started a run of my SwiftMontage work and all the jobs >>>>>>> keep failing. No progress is being made and the swift >>>>>>> stdout will have the line "failed but can retry: 'some >>>>>>> number'". The log is located at >>>>>>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. >>>>>>>> _______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ketan >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> >>>> >>> >>> >> > > From hategan at mcs.anl.gov Thu Aug 25 20:17:15 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 18:17:15 -0700 Subject: [Swift-devel] Swift jobs failing In-Reply-To: <445AB9EC-97BA-4FC0-B9C3-485B07969B35@mcs.anl.gov> References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> <1314299488.10772.1.camel@blabla> <1314312088.14881.0.camel@blabla> <1314313439.15015.0.camel@blabla> <445AB9EC-97BA-4FC0-B9C3-485B07969B35@mcs.anl.gov> Message-ID: <1314321435.17114.5.camel@blabla> Authentication failed [Caused by: Defective credential detected [Caused by: CRL for CA "C=US,O=National Center for Supercomputing Applications,OU=Certificate Authorities,CN=MyProxy" has expired.]] Sorry, but things cannot work properly if certificates are messed up. Please complain to support at ci. In the mean time I committed a patch to print better error messages in such cases (i.e. the above message). So please test that. On Thu, 2011-08-25 at 18:13 -0500, Jonathan Monette wrote: > www.ci.uchicago.edu/~jonmon/logs/coasters.log.tar.gz > On Aug 25, 2011, at 6:03 PM, Mihael Hategan wrote: > > > On Thu, 2011-08-25 at 17:51 -0500, Jonathan Monette wrote: > >> my bad?.try www.ci.uchicago.edu/~jonmon/logs/coasters.log > > > > Could you gzip that? > > > >> > >> On Aug 25, 2011, at 5:41 PM, Mihael Hategan wrote: > >> > >>> mike at blabla:~/tmp$ wget > >>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log > >>> --2011-08-25 15:40:31-- > >>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log > >>> Resolving www.ci.uchicago.edu... 192.5.86.67 > >>> Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. > >>> HTTP request sent, awaiting response... 404 Not Found > >>> 2011-08-25 15:40:32 ERROR 404: Not Found. > >>> > >>> > >>> > >>> On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: > >>>> Here is the link. The log is very long(has entries from 2010) but the log toward the bottom does mention that it could not register the coaster service. > >>>> www.ci.uchicago.edu/~jonmon/log/coaster.log > >>>> > >>>> On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: > >>>> > >>>>> Can you post the coaster log on the remote machine? > >>>>> > >>>>> > >>>>> On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: > >>>>>> Ok. I will check my configuration. I don't see very helpful messages > >>>>>> in the log file but I will give a closer look. More information to > >>>>>> follow. > >>>>>> > >>>>>> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > >>>>>> > >>>>>>> I am using 0.93 from Communicado to OSG using persistent-coasters. I > >>>>>>> do not see such messages. > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette > >>>>>>> wrote: > >>>>>>> This also seems to be happening in trunk. Is anyone seeing > >>>>>>> this issue with there code? > >>>>>>> > >>>>>>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > >>>>>>> > >>>>>>>> I started a run of my SwiftMontage work and all the jobs > >>>>>>> keep failing. No progress is being made and the swift > >>>>>>> stdout will have the line "failed but can retry: 'some > >>>>>>> number'". The log is located at > >>>>>>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Swift-devel mailing list > >>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Ketan > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Swift-devel mailing list > >>>>>> Swift-devel at ci.uchicago.edu > >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > From jonmon at mcs.anl.gov Thu Aug 25 21:15:37 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Thu, 25 Aug 2011 21:15:37 -0500 Subject: [Swift-devel] =?utf-8?q?Swift_jobs_failing?= Message-ID: <20110826021521.EBE3912824@zimbra.anl.gov> I will try a different run and get you a fresh log. I think that was when I attempted to see why David couldn't run with the NCSA proxy. Let me start a new run. ----- Reply message ----- From: "Mihael Hategan" Date: Thu, Aug 25, 2011 8:17 pm Subject: [Swift-devel] Swift jobs failing To: "Jonathan Monette" Cc: "Ketan Maheshwari" , "swift-devel Devel" Authentication failed [Caused by: Defective credential detected [Caused by: CRL for CA "C=US,O=National Center for Supercomputing Applications,OU=Certificate Authorities,CN=MyProxy" has expired.]] Sorry, but things cannot work properly if certificates are messed up. Please complain to support at ci. In the mean time I committed a patch to print better error messages in such cases (i.e. the above message). So please test that. On Thu, 2011-08-25 at 18:13 -0500, Jonathan Monette wrote: > www.ci.uchicago.edu/~jonmon/logs/coasters.log.tar.gz > On Aug 25, 2011, at 6:03 PM, Mihael Hategan wrote: > > > On Thu, 2011-08-25 at 17:51 -0500, Jonathan Monette wrote: > >> my bad?.try www.ci.uchicago.edu/~jonmon/logs/coasters.log > > > > Could you gzip that? > > > >> > >> On Aug 25, 2011, at 5:41 PM, Mihael Hategan wrote: > >> > >>> mike at blabla:~/tmp$ wget > >>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log > >>> --2011-08-25 15:40:31-- > >>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log > >>> Resolving www.ci.uchicago.edu... 192.5.86.67 > >>> Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. > >>> HTTP request sent, awaiting response... 404 Not Found > >>> 2011-08-25 15:40:32 ERROR 404: Not Found. > >>> > >>> > >>> > >>> On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: > >>>> Here is the link. The log is very long(has entries from 2010) but the log toward the bottom does mention that it could not register the coaster service. > >>>> www.ci.uchicago.edu/~jonmon/log/coaster.log > >>>> > >>>> On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: > >>>> > >>>>> Can you post the coaster log on the remote machine? > >>>>> > >>>>> > >>>>> On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: > >>>>>> Ok. I will check my configuration. I don't see very helpful messages > >>>>>> in the log file but I will give a closer look. More information to > >>>>>> follow. > >>>>>> > >>>>>> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > >>>>>> > >>>>>>> I am using 0.93 from Communicado to OSG using persistent-coasters. I > >>>>>>> do not see such messages. > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette > >>>>>>> wrote: > >>>>>>> This also seems to be happening in trunk. Is anyone seeing > >>>>>>> this issue with there code? > >>>>>>> > >>>>>>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > >>>>>>> > >>>>>>>> I started a run of my SwiftMontage work and all the jobs > >>>>>>> keep failing. No progress is being made and the swift > >>>>>>> stdout will have the line "failed but can retry: 'some > >>>>>>> number'". The log is located at > >>>>>>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Swift-devel mailing list > >>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Ketan > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Swift-devel mailing list > >>>>>> Swift-devel at ci.uchicago.edu > >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Aug 25 21:28:35 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 25 Aug 2011 21:28:35 -0500 Subject: [Swift-devel] Swift jobs failing In-Reply-To: <1314321435.17114.5.camel@blabla> References: <7D2BE7AB-BF33-446E-A0F4-B708D325CCFA@mcs.anl.gov> <1314299488.10772.1.camel@blabla> <1314312088.14881.0.camel@blabla> <1314313439.15015.0.camel@blabla> <445AB9EC-97BA-4FC0-B9C3-485B07969B35@mcs.anl.gov> <1314321435.17114.5.camel@blabla> Message-ID: <12E9D9B8-F542-4BE1-985C-D42BFC83EF57@mcs.anl.gov> Never mind. I found a similar error. [Caused by: Defective credential detected [Caused by: CRL for CA "DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1" has expired.]] I will email support at ci and let them know of the situation and if they can up the frequency they update the CAs. On Aug 25, 2011, at 8:17 PM, Mihael Hategan wrote: > Authentication failed [Caused by: Defective credential detected [Caused > by: CRL for CA "C=US,O=National Center for Supercomputing > Applications,OU=Certificate Authorities,CN=MyProxy" has expired.]] > > Sorry, but things cannot work properly if certificates are messed up. > Please complain to support at ci. > > In the mean time I committed a patch to print better error messages in > such cases (i.e. the above message). So please test that. > > On Thu, 2011-08-25 at 18:13 -0500, Jonathan Monette wrote: >> www.ci.uchicago.edu/~jonmon/logs/coasters.log.tar.gz >> On Aug 25, 2011, at 6:03 PM, Mihael Hategan wrote: >> >>> On Thu, 2011-08-25 at 17:51 -0500, Jonathan Monette wrote: >>>> my bad?.try www.ci.uchicago.edu/~jonmon/logs/coasters.log >>> >>> Could you gzip that? >>> >>>> >>>> On Aug 25, 2011, at 5:41 PM, Mihael Hategan wrote: >>>> >>>>> mike at blabla:~/tmp$ wget >>>>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log >>>>> --2011-08-25 15:40:31-- >>>>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log >>>>> Resolving www.ci.uchicago.edu... 192.5.86.67 >>>>> Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. >>>>> HTTP request sent, awaiting response... 404 Not Found >>>>> 2011-08-25 15:40:32 ERROR 404: Not Found. >>>>> >>>>> >>>>> >>>>> On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: >>>>>> Here is the link. The log is very long(has entries from 2010) but the log toward the bottom does mention that it could not register the coaster service. >>>>>> www.ci.uchicago.edu/~jonmon/log/coaster.log >>>>>> >>>>>> On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: >>>>>> >>>>>>> Can you post the coaster log on the remote machine? >>>>>>> >>>>>>> >>>>>>> On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: >>>>>>>> Ok. I will check my configuration. I don't see very helpful messages >>>>>>>> in the log file but I will give a closer look. More information to >>>>>>>> follow. >>>>>>>> >>>>>>>> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: >>>>>>>> >>>>>>>>> I am using 0.93 from Communicado to OSG using persistent-coasters. I >>>>>>>>> do not see such messages. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette >>>>>>>>> wrote: >>>>>>>>> This also seems to be happening in trunk. Is anyone seeing >>>>>>>>> this issue with there code? >>>>>>>>> >>>>>>>>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: >>>>>>>>> >>>>>>>>>> I started a run of my SwiftMontage work and all the jobs >>>>>>>>> keep failing. No progress is being made and the swift >>>>>>>>> stdout will have the line "failed but can retry: 'some >>>>>>>>> number'". The log is located at >>>>>>>>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. This is with the most recent version of 0.93. >>>>>>>>>> _______________________________________________ >>>>>>>>>> Swift-devel mailing list >>>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>>> >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Swift-devel mailing list >>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ketan >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > > From wilde at mcs.anl.gov Thu Aug 25 22:25:09 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 25 Aug 2011 22:25:09 -0500 (CDT) Subject: [Swift-devel] Swift jobs failing In-Reply-To: <12E9D9B8-F542-4BE1-985C-D42BFC83EF57@mcs.anl.gov> Message-ID: <2059566375.250933.1314329109996.JavaMail.root@zimbra.anl.gov> The ca certs in these directories work for the NCSA CA for accessing Ranger: com$ export X509_CADIR=/home/wilde/TRUSTEDCA com$ com$ com$ export X509_CERT_DIR=/home/wilde/TRUSTEDCA com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org:2119 /usr/bin/id uid=455797(tg455797) gid=80243(G-80243) groups=80243(G-80243),81031(G-81031),81411(G-81411),81611(G-81611),81613(G-81613),81621(G-81621),81747(G-81747),81792(G-81792),800744(G-800744),800745(G-800745),800889(G-800889),800981(G-800981),800983(G-800983),801271(G-801271),801364(G-801364),801525(G-801525),801551(G-801551),801694(G-801694),801708(G-801708),801758(G-801758),801759(G-801759),801897(G-801897),802865(G-802865) com$ ----- Original Message ----- > From: "Jonathan Monette" > To: "Mihael Hategan" > Cc: "swift-devel Devel" > Sent: Thursday, August 25, 2011 9:28:35 PM > Subject: Re: [Swift-devel] Swift jobs failing > Never mind. I found a similar error. > [Caused by: Defective credential detected [Caused by: CRL for CA > "DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1" has > expired.]] > > I will email support at ci and let them know of the situation and if they > can up the frequency they update the CAs. > > On Aug 25, 2011, at 8:17 PM, Mihael Hategan wrote: > > > Authentication failed [Caused by: Defective credential detected > > [Caused > > by: CRL for CA "C=US,O=National Center for Supercomputing > > Applications,OU=Certificate Authorities,CN=MyProxy" has expired.]] > > > > Sorry, but things cannot work properly if certificates are messed > > up. > > Please complain to support at ci. > > > > In the mean time I committed a patch to print better error messages > > in > > such cases (i.e. the above message). So please test that. > > > > On Thu, 2011-08-25 at 18:13 -0500, Jonathan Monette wrote: > >> www.ci.uchicago.edu/~jonmon/logs/coasters.log.tar.gz > >> On Aug 25, 2011, at 6:03 PM, Mihael Hategan wrote: > >> > >>> On Thu, 2011-08-25 at 17:51 -0500, Jonathan Monette wrote: > >>>> my bad?.try www.ci.uchicago.edu/~jonmon/logs/coasters.log > >>> > >>> Could you gzip that? > >>> > >>>> > >>>> On Aug 25, 2011, at 5:41 PM, Mihael Hategan wrote: > >>>> > >>>>> mike at blabla:~/tmp$ wget > >>>>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log > >>>>> --2011-08-25 15:40:31-- > >>>>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log > >>>>> Resolving www.ci.uchicago.edu... 192.5.86.67 > >>>>> Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. > >>>>> HTTP request sent, awaiting response... 404 Not Found > >>>>> 2011-08-25 15:40:32 ERROR 404: Not Found. > >>>>> > >>>>> > >>>>> > >>>>> On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: > >>>>>> Here is the link. The log is very long(has entries from 2010) > >>>>>> but the log toward the bottom does mention that it could not > >>>>>> register the coaster service. > >>>>>> www.ci.uchicago.edu/~jonmon/log/coaster.log > >>>>>> > >>>>>> On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: > >>>>>> > >>>>>>> Can you post the coaster log on the remote machine? > >>>>>>> > >>>>>>> > >>>>>>> On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: > >>>>>>>> Ok. I will check my configuration. I don't see very helpful > >>>>>>>> messages > >>>>>>>> in the log file but I will give a closer look. More > >>>>>>>> information to > >>>>>>>> follow. > >>>>>>>> > >>>>>>>> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > >>>>>>>> > >>>>>>>>> I am using 0.93 from Communicado to OSG using > >>>>>>>>> persistent-coasters. I > >>>>>>>>> do not see such messages. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette > >>>>>>>>> wrote: > >>>>>>>>> This also seems to be happening in trunk. Is anyone > >>>>>>>>> seeing > >>>>>>>>> this issue with there code? > >>>>>>>>> > >>>>>>>>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > >>>>>>>>> > >>>>>>>>>> I started a run of my SwiftMontage work and all the jobs > >>>>>>>>> keep failing. No progress is being made and the swift > >>>>>>>>> stdout will have the line "failed but can retry: 'some > >>>>>>>>> number'". The log is located at > >>>>>>>>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. > >>>>>>>>> This is with the most recent version of 0.93. > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Swift-devel mailing list > >>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>> > >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Swift-devel mailing list > >>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Ketan > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Aug 25 22:34:49 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Aug 2011 20:34:49 -0700 Subject: [Swift-devel] Swift jobs failing In-Reply-To: <2059566375.250933.1314329109996.JavaMail.root@zimbra.anl.gov> References: <2059566375.250933.1314329109996.JavaMail.root@zimbra.anl.gov> Message-ID: <1314329689.17678.0.camel@blabla> Right. That's a choice: have your own for which you update the CRLs. There was some script around to do that automatically. On Thu, 2011-08-25 at 22:25 -0500, Michael Wilde wrote: > The ca certs in these directories work for the NCSA CA for accessing Ranger: > > com$ export X509_CADIR=/home/wilde/TRUSTEDCA > com$ > com$ > com$ export X509_CERT_DIR=/home/wilde/TRUSTEDCA > com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org:2119 /usr/bin/id > uid=455797(tg455797) gid=80243(G-80243) groups=80243(G-80243),81031(G-81031),81411(G-81411),81611(G-81611),81613(G-81613),81621(G-81621),81747(G-81747),81792(G-81792),800744(G-800744),800745(G-800745),800889(G-800889),800981(G-800981),800983(G-800983),801271(G-801271),801364(G-801364),801525(G-801525),801551(G-801551),801694(G-801694),801708(G-801708),801758(G-801758),801759(G-801759),801897(G-801897),802865(G-802865) > com$ > > > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "Mihael Hategan" > > Cc: "swift-devel Devel" > > Sent: Thursday, August 25, 2011 9:28:35 PM > > Subject: Re: [Swift-devel] Swift jobs failing > > Never mind. I found a similar error. > > [Caused by: Defective credential detected [Caused by: CRL for CA > > "DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1" has > > expired.]] > > > > I will email support at ci and let them know of the situation and if they > > can up the frequency they update the CAs. > > > > On Aug 25, 2011, at 8:17 PM, Mihael Hategan wrote: > > > > > Authentication failed [Caused by: Defective credential detected > > > [Caused > > > by: CRL for CA "C=US,O=National Center for Supercomputing > > > Applications,OU=Certificate Authorities,CN=MyProxy" has expired.]] > > > > > > Sorry, but things cannot work properly if certificates are messed > > > up. > > > Please complain to support at ci. > > > > > > In the mean time I committed a patch to print better error messages > > > in > > > such cases (i.e. the above message). So please test that. > > > > > > On Thu, 2011-08-25 at 18:13 -0500, Jonathan Monette wrote: > > >> www.ci.uchicago.edu/~jonmon/logs/coasters.log.tar.gz > > >> On Aug 25, 2011, at 6:03 PM, Mihael Hategan wrote: > > >> > > >>> On Thu, 2011-08-25 at 17:51 -0500, Jonathan Monette wrote: > > >>>> my bad?.try www.ci.uchicago.edu/~jonmon/logs/coasters.log > > >>> > > >>> Could you gzip that? > > >>> > > >>>> > > >>>> On Aug 25, 2011, at 5:41 PM, Mihael Hategan wrote: > > >>>> > > >>>>> mike at blabla:~/tmp$ wget > > >>>>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log > > >>>>> --2011-08-25 15:40:31-- > > >>>>> http://www.ci.uchicago.edu/~jonmon/log/coaster.log > > >>>>> Resolving www.ci.uchicago.edu... 192.5.86.67 > > >>>>> Connecting to www.ci.uchicago.edu|192.5.86.67|:80... connected. > > >>>>> HTTP request sent, awaiting response... 404 Not Found > > >>>>> 2011-08-25 15:40:32 ERROR 404: Not Found. > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Thu, 2011-08-25 at 16:43 -0500, Jonathan Monette wrote: > > >>>>>> Here is the link. The log is very long(has entries from 2010) > > >>>>>> but the log toward the bottom does mention that it could not > > >>>>>> register the coaster service. > > >>>>>> www.ci.uchicago.edu/~jonmon/log/coaster.log > > >>>>>> > > >>>>>> On Aug 25, 2011, at 2:11 PM, Mihael Hategan wrote: > > >>>>>> > > >>>>>>> Can you post the coaster log on the remote machine? > > >>>>>>> > > >>>>>>> > > >>>>>>> On Thu, 2011-08-25 at 14:07 -0500, Jonathan Monette wrote: > > >>>>>>>> Ok. I will check my configuration. I don't see very helpful > > >>>>>>>> messages > > >>>>>>>> in the log file but I will give a closer look. More > > >>>>>>>> information to > > >>>>>>>> follow. > > >>>>>>>> > > >>>>>>>> On Aug 25, 2011, at 2:05 PM, Ketan Maheshwari wrote: > > >>>>>>>> > > >>>>>>>>> I am using 0.93 from Communicado to OSG using > > >>>>>>>>> persistent-coasters. I > > >>>>>>>>> do not see such messages. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Thu, Aug 25, 2011 at 1:42 PM, Jonathan Monette > > >>>>>>>>> wrote: > > >>>>>>>>> This also seems to be happening in trunk. Is anyone > > >>>>>>>>> seeing > > >>>>>>>>> this issue with there code? > > >>>>>>>>> > > >>>>>>>>> On Aug 25, 2011, at 12:44 PM, Jonathan Monette wrote: > > >>>>>>>>> > > >>>>>>>>>> I started a run of my SwiftMontage work and all the jobs > > >>>>>>>>> keep failing. No progress is being made and the swift > > >>>>>>>>> stdout will have the line "failed but can retry: 'some > > >>>>>>>>> number'". The log is located at > > >>>>>>>>> www.ci.uchicago.edu/~jonmon/logs/montage-20110825-1232-y104fa88.log. > > >>>>>>>>> This is with the most recent version of 0.93. > > >>>>>>>>>> _______________________________________________ > > >>>>>>>>>> Swift-devel mailing list > > >>>>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>>>> > > >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>>>>>>> > > >>>>>>>>> _______________________________________________ > > >>>>>>>>> Swift-devel mailing list > > >>>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> Ketan > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> _______________________________________________ > > >>>>>>>> Swift-devel mailing list > > >>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>> > > >>> > > >>> > > >> > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From davidk at ci.uchicago.edu Thu Aug 25 22:36:46 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 25 Aug 2011 22:36:46 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: Message-ID: <1525902186.79404.1314329806351.JavaMail.root@zimbra-mb2.anl.gov> Thanks for the help everyone. Copying /etc/grid-security/certificates from ranger and updating my X509 variables fixed my certificate issues. I'm seeing another issue now where it's not submitting jobs correctly. I think Sarah saw something similar yesterday. I'll try to troubleshoot this a little more.. David ----- Original Message ----- > From: "Sarah Kenny" > To: "Mihael Hategan" > Cc: "David Kelly" , "swift-devel Devel" > Sent: Thursday, August 25, 2011 4:38:52 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > communicado's certs (/etc/grid-security/certificates) are > out-of-date...if you copy ranger's /etc/grid-security/certificates > directory to communicado and point yr X509_CERT_DIR to it you can get > a job thru (a simple globus-job-run with my vaild cert fails from > communicado at the moment if i don't do this). > > i set our machines at uci to update daily...i think it's less > frequently at ci... > > > On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan < hategan at mcs.anl.gov > > wrote: > > > Can you try a globus-url-copy to gridftp.ranger? > > gridftp.ranger seems to have the NCSA myproxy CA. You say you have the > proper certificates dir in your X509_CERT_DIR, and that directory > contains the TACC root cert. So it should work. And so should swift. > > Though I think that jglobus should be more clear about "Unknown ca" > errors. At least the name of the unknown CA should be part of the > error > message. > > > > > On Thu, 2011-08-25 at 15:55 -0500, David Kelly wrote: > > $ grid-proxy-info -all > > subject : /C=US/O=National Center for Supercomputing > > Applications/CN=David Kelly > > issuer : /C=US/O=National Center for Supercomputing > > Applications/OU=Certificate Authorities/CN=MyProxy > > identity : /C=US/O=National Center for Supercomputing > > Applications/CN=David Kelly > > type : end entity credential > > strength : 1024 bits > > path : /tmp/x509up_u1878 > > timeleft : 9:56:53 > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" < hategan at mcs.anl.gov > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > Cc: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >, > > > "swift-devel Devel" < swift-devel at ci.uchicago.edu > > > > Sent: Thursday, August 25, 2011 3:42:57 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > Odd. Can you paste the output of 'grid-proxy-info -all'? > > > > > > On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > > > Sure, here is the full log: > > > > > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" < hategan at mcs.anl.gov > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > Cc: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >, > > > > > "swift-devel > > > > > Devel" < swift-devel at ci.uchicago.edu > > > > > > Sent: Thursday, August 25, 2011 2:43:31 PM > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > It's possible that the CA dir on Ranger is not properly set > > > > > up. > > > > > Can > > > > > you > > > > > post the full log? > > > > > > > > > > On Thu, 2011-08-25 at 13:56 -0500, David Kelly wrote: > > > > > > Those environment variables were not set up. I have them > > > > > > defined > > > > > > now, but I'm still getting the same error. > > > > > > > > > > > > [davidk at communicado ranger]$ env |grep 509 > > > > > > X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > -tc.file > > > > > > tc.data 001-catsn-ranger.swift > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > RunID: 20110825-1352-f1v940b4 > > > > > > Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > > > Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 Selecting > > > > > > site:7 > > > > > > Initializing site shared directory:3 > > > > > > Execution failed: > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > GSS-API > > > > > > level [Caused by: Unknown CA]] > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > > > Cc: "Jonathan Monette" < jonmon at mcs.anl.gov >, > > > > > > > "swift-devel > > > > > > > Devel" > > > > > > > < swift-devel at ci.uchicago.edu > > > > > > > > Sent: Thursday, August 25, 2011 1:32:50 PM > > > > > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > Are your CADIR and CACERT env vars set up? > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CADIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > > > > /opt/osg-1.2.16/globus/TRUSTED_CA > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > > > > davidk at ci.uchicago.edu > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks Jon, > > > > > > > > > > > > > > Here is what happens when I try this from communicado: > > > > > > > > > > > > > > [davidk at communicado ~]$ myproxy-logon -l dkelly -s > > > > > > > myproxy.teragrid.org > > > > > > > Enter MyProxy pass phrase: > > > > > > > A credential has been received for user dkelly in > > > > > > > /tmp/x509up_u1878. > > > > > > > > > > > > > > [davidk at communicado ranger]$ swift -sites.file sites.xml > > > > > > > -tc.file > > > > > > > tc.data 001-catsn-ranger.swift > > > > > > > Swift svn swift-r4987 (swift modified locally) cog-r3229 > > > > > > > > > > > > > > RunID: 20110825-1326-o3e38fe0 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > > > > > Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 Selecting > > > > > > > site:8 > > > > > > > Initializing site shared directory:2 > > > > > > > Execution failed: > > > > > > > Authentication failed [Caused by: Failure unspecified at > > > > > > > GSS-API > > > > > > > level > > > > > > > [Caused by: Unknown CA]] > > > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > Thanks, > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Ketan > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 From turam at mcs.anl.gov Fri Aug 26 13:38:50 2011 From: turam at mcs.anl.gov (Thomas Uram) Date: Fri, 26 Aug 2011 13:38:50 -0500 Subject: [Swift-devel] auth.defaults In-Reply-To: <1281574720.27901.12.camel@blabla2.none> References: <1281557936.27901.3.camel@blabla2.none> <1281574720.27901.12.camel@blabla2.none> Message-ID: <6BD56C67-9AC3-4242-99AE-3C3D4D3EA0E7@mcs.anl.gov> In a portal scenario, where the portal is executing jobs on behalf of users and, therefore, shuffling identities on the backend, this functionality would be useful. Has work on this progressed at all? Tom On Aug 11, 2010, at 7:58 PM, Mihael Hategan wrote: > On Wed, 2010-08-11 at 17:08 -0400, David Kelly wrote: >> On Wed, Aug 11, 2010 at 4:18 PM, Mihael Hategan >> wrote: >> >> I don't think I understand what you mean by "per-host >> auth.defaults". >> >> What I mean by that is, each configuration generated by swiftconfig >> could have it's own unique auth.defaults file like it has it's own >> sites.xml. Then you could run something like "swift >> -auth.file /mypath/auth.defaults" rather than requiring it to be >> stored in ~/.ssh. > > It's possible, but I don't think it's worth the effort. That file, like > authorized_keys or known_hosts is not meant to contain frequently > changing information, since passwords and usernanames are not generally > a moving target (nor are host keys). > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Fri Aug 26 13:38:56 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 26 Aug 2011 13:38:56 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1314311494.14675.2.camel@blabla> Message-ID: <1956829718.80203.1314383936947.JavaMail.root@zimbra-mb2.anl.gov> When I am trying to run the script now, Swift does not seem to be submitting the jobs correctly. Nothing it showing up in qstat. I noticed that a gram log gets created in my home directory that says: ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end level=ERROR gramid=/16145868447994515851/17606392074284884670/ job_status=4 status=-73 reason="the job manager failed to open stdout" I'm guessing this is the cause of the problem. Bugs #153 and #215 were related to similar problems with stdout and gt2/sge. The full logs are at http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz Thanks, David ----- Original Message ----- > From: "Mihael Hategan" > To: "Jonathan Monette" > Cc: "swift-devel Devel" > Sent: Thursday, August 25, 2011 5:31:34 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > > I can send mail to ci support and cc mike to it and ask what they > > can > > do. > > > > Mihael, is there anyway for Swift to give a little more feedback > > besides unknown CA or is that a jglobus problem? > > It's a jglobus problem. > > That in itself may not be a big issue, but jglobus is now being > heavily > re-organized by the globus team, so I'm not sure what the best > long-term > strategy is here. > > > > ----- Reply message ----- > > From: "Sarah Kenny" > > Date: Thu, Aug 25, 2011 5:11 pm > > Subject: [Swift-devel] Notes from 0.93 meeting > > To: "Jonathan Monette" > > Cc: "Mihael Hategan" , "swift-devel Devel" > > > > > > > > > > if i had a nickel for every time i dealt with this i'd be rich! :) > > actually, now that i'm looking at our uci machines i actually have > > them updating hourly...so, maybe you want to ask the admins to do > > that > > to avoid a full day of confusion whenever they expire :P > > > > *usually* i can't gsissh either if the certs have expired but, yeah, > > they must be using different CA's now for that on ranger as mihael > > suggests... > > > > On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > > > > wrote: > > True. I did not think that each mechanism would use > > different > > CAs. We might want to ask ci support to update the grid > > certs > > more frequently then to avoid this situation. > > > > > > On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > > > > > On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette wrote: > > >> That is weird. If you were able to gsissh to ranger I > > would assume > > >> that you are able to globus-url-copy to ranger. > > > > > > Not if the two use different CAs. Or if a password was > > > typed > > at the ssh > > > login. > > > > > >> Anyways, what Sarah said should work. I would assume > > >> that > > ci would > > >> update more frequently to avoid this problem. > > >> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > > >> > > >>> communicado's certs (/etc/grid-security/certificates) > > >>> are > > >>> out-of-date...if you copy > > ranger's /etc/grid-security/certificates > > >>> directory to communicado and point yr X509_CERT_DIR to > > >>> it > > you can > > >>> get a job thru (a simple globus-job-run with my vaild > > >>> cert > > fails > > >>> from communicado at the moment if i don't do this). > > >>> > > >>> i set our machines at uci to update daily...i think it's > > less > > >>> frequently at ci... > > >>> > > >>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > > >>> wrote: > > >>> Can you try a globus-url-copy to gridftp.ranger? > > >>> > > >>> gridftp.ranger seems to have the NCSA myproxy CA. > > You say > > >>> you have the > > >>> proper certificates dir in your X509_CERT_DIR, > > >>> and > > that > > >>> directory > > >>> contains the TACC root cert. So it should work. > > >>> And > > so > > >>> should swift. > > >>> > > >>> Though I think that jglobus should be more clear > > about > > >>> "Unknown ca" > > >>> errors. At least the name of the unknown CA > > >>> should > > be part > > >>> of the error > > >>> message. > > >>> > > >>> > > >>> On Thu, 2011-08-25 at 15:55 -0500, David Kelly > > wrote: > > >>>> $ grid-proxy-info -all > > >>>> subject : /C=US/O=National Center for Supercomputing > > >>> Applications/CN=David Kelly > > >>>> issuer : /C=US/O=National Center for Supercomputing > > >>> Applications/OU=Certificate > > >>> Authorities/CN=MyProxy > > >>>> identity : /C=US/O=National Center for Supercomputing > > >>> Applications/CN=David Kelly > > >>>> type : end entity credential > > >>>> strength : 1024 bits > > >>>> path : /tmp/x509up_u1878 > > >>>> timeleft : 9:56:53 > > >>>> > > >>>> > > >>>> ----- Original Message ----- > > >>>>> From: "Mihael Hategan" > > >>>>> To: "David Kelly" > > >>>>> Cc: "Ketan Maheshwari" , > > >>> "swift-devel Devel" > > >>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > > >>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > >>>>> Odd. Can you paste the output of 'grid-proxy-info > > >>>>> -all'? > > >>>>> > > >>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > >>>>>> Sure, here is the full log: > > >>>>>> > > >>>>>> > > >>> > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > >>>>>> > > >>>>>> ----- Original Message ----- > > >>>>>>> From: "Mihael Hategan" > > >>>>>>> To: "David Kelly" > > >>>>>>> Cc: "Ketan Maheshwari" , > > >>> "swift-devel > > >>>>>>> Devel" > > >>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > > >>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > >>>>>>> It's possible that the CA dir on Ranger is not > > >>> properly set up. > > >>>>>>> Can > > >>>>>>> you > > >>>>>>> post the full log? > > >>>>>>> > > >>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > > >>> wrote: > > >>>>>>>> Those environment variables were not set up. I > > >>> have them defined > > >>>>>>>> now, but I'm still getting the same error. > > >>>>>>>> > > >>>>>>>> [davidk at communicado ranger]$ env |grep 509 > > >>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > >>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > >>>>>>>> > > >>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > >>> sites.xml > > >>>>>>>> -tc.file > > >>>>>>>> tc.data 001-catsn-ranger.swift > > >>>>>>>> Swift svn swift-r4987 (swift modified locally) > > >>> cog-r3229 > > >>>>>>>> > > >>>>>>>> RunID: 20110825-1352-f1v940b4 > > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > > >>> Selecting site:7 > > >>>>>>>> Initializing site shared directory:3 > > >>>>>>>> Execution failed: > > >>>>>>>> Authentication failed [Caused by: Failure > > >>> unspecified at > > >>>>>>>> GSS-API > > >>>>>>>> level [Caused by: Unknown CA]] > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> ----- Original Message ----- > > >>>>>>>>> From: "Ketan Maheshwari" > > >>> > > >>>>>>>>> To: "David Kelly" > > >>>>>>>>> Cc: "Jonathan Monette" , > > >>> "swift-devel > > >>>>>>>>> Devel" > > >>>>>>>>> > > >>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > > >>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > > >>> meeting > > >>>>>>>>> Hi, > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Are your CADIR and CACERT env vars set up? > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > >>>>>>>>> davidk at ci.uchicago.edu > > >>>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Thanks Jon, > > >>>>>>>>> > > >>>>>>>>> Here is what happens when I try this from > > >>> communicado: > > >>>>>>>>> > > >>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l dkelly > > >>> -s > > >>>>>>>>> myproxy.teragrid.org > > >>>>>>>>> Enter MyProxy pass phrase: > > >>>>>>>>> A credential has been received for user dkelly > > >>> in > > >>>>>>>>> /tmp/x509up_u1878. > > >>>>>>>>> > > >>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > >>> sites.xml > > >>>>>>>>> -tc.file > > >>>>>>>>> tc.data 001-catsn-ranger.swift > > >>>>>>>>> Swift svn swift-r4987 (swift modified locally) > > >>> cog-r3229 > > >>>>>>>>> > > >>>>>>>>> RunID: 20110825-1326-o3e38fe0 > > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 > > >>> Selecting > > >>>>>>>>> site:8 > > >>>>>>>>> Initializing site shared directory:2 > > >>>>>>>>> Execution failed: > > >>>>>>>>> Authentication failed [Caused by: Failure > > >>> unspecified at > > >>>>>>>>> GSS-API > > >>>>>>>>> level > > >>>>>>>>> [Caused by: Unknown CA]] > > >>>>>>>>> > > >>>>>>>>> Any ideas? > > >>>>>>>>> > > >>>>>>>>> Thanks, > > >>>>>>>>> David > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> _______________________________________________ > > >>>>>>>>> Swift-devel mailing list > > >>>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>>> > > >>> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> Ketan > > >>>>>>>> _______________________________________________ > > >>>>>>>> Swift-devel mailing list > > >>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>> > > >>> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>> > > >>> > > >>> _______________________________________________ > > >>> Swift-devel mailing list > > >>> Swift-devel at ci.uchicago.edu > > >>> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>> > > >>> > > >>> > > >>> > > >>> -- > > >>> Sarah Kenny > > >>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci > > III > > >>> University of California Irvine, Dept. of Neurology ~ > > 773-818-8300 > > >>> > > >>> _______________________________________________ > > >>> Swift-devel mailing list > > >>> Swift-devel at ci.uchicago.edu > > >>> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >> > > > > > > > > > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Aug 26 13:42:13 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 26 Aug 2011 11:42:13 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1956829718.80203.1314383936947.JavaMail.root@zimbra-mb2.anl.gov> References: <1956829718.80203.1314383936947.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1314384133.19481.0.camel@blabla> "The job manager failed to open stderr" tends to happen when you have GLOBUS_HOSTNAME set incorrectly. On Fri, 2011-08-26 at 13:38 -0500, David Kelly wrote: > When I am trying to run the script now, Swift does not seem to be submitting the jobs correctly. Nothing it showing up in qstat. > > I noticed that a gram log gets created in my home directory that says: > ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end level=ERROR gramid=/16145868447994515851/17606392074284884670/ job_status=4 status=-73 reason="the job manager failed to open stdout" > > I'm guessing this is the cause of the problem. Bugs #153 and #215 were related to similar problems with stdout and gt2/sge. > > The full logs are at http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz > > Thanks, > David > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Jonathan Monette" > > Cc: "swift-devel Devel" > > Sent: Thursday, August 25, 2011 5:31:34 PM > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > > > I can send mail to ci support and cc mike to it and ask what they > > > can > > > do. > > > > > > Mihael, is there anyway for Swift to give a little more feedback > > > besides unknown CA or is that a jglobus problem? > > > > It's a jglobus problem. > > > > That in itself may not be a big issue, but jglobus is now being > > heavily > > re-organized by the globus team, so I'm not sure what the best > > long-term > > strategy is here. > > > > > > ----- Reply message ----- > > > From: "Sarah Kenny" > > > Date: Thu, Aug 25, 2011 5:11 pm > > > Subject: [Swift-devel] Notes from 0.93 meeting > > > To: "Jonathan Monette" > > > Cc: "Mihael Hategan" , "swift-devel Devel" > > > > > > > > > > > > > > > if i had a nickel for every time i dealt with this i'd be rich! :) > > > actually, now that i'm looking at our uci machines i actually have > > > them updating hourly...so, maybe you want to ask the admins to do > > > that > > > to avoid a full day of confusion whenever they expire :P > > > > > > *usually* i can't gsissh either if the certs have expired but, yeah, > > > they must be using different CA's now for that on ranger as mihael > > > suggests... > > > > > > On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > > > > > > wrote: > > > True. I did not think that each mechanism would use > > > different > > > CAs. We might want to ask ci support to update the grid > > > certs > > > more frequently then to avoid this situation. > > > > > > > > > On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > > > > > > > On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette wrote: > > > >> That is weird. If you were able to gsissh to ranger I > > > would assume > > > >> that you are able to globus-url-copy to ranger. > > > > > > > > Not if the two use different CAs. Or if a password was > > > > typed > > > at the ssh > > > > login. > > > > > > > >> Anyways, what Sarah said should work. I would assume > > > >> that > > > ci would > > > >> update more frequently to avoid this problem. > > > >> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > > > >> > > > >>> communicado's certs (/etc/grid-security/certificates) > > > >>> are > > > >>> out-of-date...if you copy > > > ranger's /etc/grid-security/certificates > > > >>> directory to communicado and point yr X509_CERT_DIR to > > > >>> it > > > you can > > > >>> get a job thru (a simple globus-job-run with my vaild > > > >>> cert > > > fails > > > >>> from communicado at the moment if i don't do this). > > > >>> > > > >>> i set our machines at uci to update daily...i think it's > > > less > > > >>> frequently at ci... > > > >>> > > > >>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > > > >>> wrote: > > > >>> Can you try a globus-url-copy to gridftp.ranger? > > > >>> > > > >>> gridftp.ranger seems to have the NCSA myproxy CA. > > > You say > > > >>> you have the > > > >>> proper certificates dir in your X509_CERT_DIR, > > > >>> and > > > that > > > >>> directory > > > >>> contains the TACC root cert. So it should work. > > > >>> And > > > so > > > >>> should swift. > > > >>> > > > >>> Though I think that jglobus should be more clear > > > about > > > >>> "Unknown ca" > > > >>> errors. At least the name of the unknown CA > > > >>> should > > > be part > > > >>> of the error > > > >>> message. > > > >>> > > > >>> > > > >>> On Thu, 2011-08-25 at 15:55 -0500, David Kelly > > > wrote: > > > >>>> $ grid-proxy-info -all > > > >>>> subject : /C=US/O=National Center for Supercomputing > > > >>> Applications/CN=David Kelly > > > >>>> issuer : /C=US/O=National Center for Supercomputing > > > >>> Applications/OU=Certificate > > > >>> Authorities/CN=MyProxy > > > >>>> identity : /C=US/O=National Center for Supercomputing > > > >>> Applications/CN=David Kelly > > > >>>> type : end entity credential > > > >>>> strength : 1024 bits > > > >>>> path : /tmp/x509up_u1878 > > > >>>> timeleft : 9:56:53 > > > >>>> > > > >>>> > > > >>>> ----- Original Message ----- > > > >>>>> From: "Mihael Hategan" > > > >>>>> To: "David Kelly" > > > >>>>> Cc: "Ketan Maheshwari" , > > > >>> "swift-devel Devel" > > > >>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > > > >>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > >>>>> Odd. Can you paste the output of 'grid-proxy-info > > > >>>>> -all'? > > > >>>>> > > > >>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly wrote: > > > >>>>>> Sure, here is the full log: > > > >>>>>> > > > >>>>>> > > > >>> > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > >>>>>> > > > >>>>>> ----- Original Message ----- > > > >>>>>>> From: "Mihael Hategan" > > > >>>>>>> To: "David Kelly" > > > >>>>>>> Cc: "Ketan Maheshwari" , > > > >>> "swift-devel > > > >>>>>>> Devel" > > > >>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > > > >>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > >>>>>>> It's possible that the CA dir on Ranger is not > > > >>> properly set up. > > > >>>>>>> Can > > > >>>>>>> you > > > >>>>>>> post the full log? > > > >>>>>>> > > > >>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > > > >>> wrote: > > > >>>>>>>> Those environment variables were not set up. I > > > >>> have them defined > > > >>>>>>>> now, but I'm still getting the same error. > > > >>>>>>>> > > > >>>>>>>> [davidk at communicado ranger]$ env |grep 509 > > > >>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > >>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > >>>>>>>> > > > >>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > > >>> sites.xml > > > >>>>>>>> -tc.file > > > >>>>>>>> tc.data 001-catsn-ranger.swift > > > >>>>>>>> Swift svn swift-r4987 (swift modified locally) > > > >>> cog-r3229 > > > >>>>>>>> > > > >>>>>>>> RunID: 20110825-1352-f1v940b4 > > > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > > > >>> Selecting site:7 > > > >>>>>>>> Initializing site shared directory:3 > > > >>>>>>>> Execution failed: > > > >>>>>>>> Authentication failed [Caused by: Failure > > > >>> unspecified at > > > >>>>>>>> GSS-API > > > >>>>>>>> level [Caused by: Unknown CA]] > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> ----- Original Message ----- > > > >>>>>>>>> From: "Ketan Maheshwari" > > > >>> > > > >>>>>>>>> To: "David Kelly" > > > >>>>>>>>> Cc: "Jonathan Monette" , > > > >>> "swift-devel > > > >>>>>>>>> Devel" > > > >>>>>>>>> > > > >>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > > > >>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > > > >>> meeting > > > >>>>>>>>> Hi, > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> Are your CADIR and CACERT env vars set up? > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > > > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > >>>>>>>>> davidk at ci.uchicago.edu > > > >>>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> Thanks Jon, > > > >>>>>>>>> > > > >>>>>>>>> Here is what happens when I try this from > > > >>> communicado: > > > >>>>>>>>> > > > >>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l dkelly > > > >>> -s > > > >>>>>>>>> myproxy.teragrid.org > > > >>>>>>>>> Enter MyProxy pass phrase: > > > >>>>>>>>> A credential has been received for user dkelly > > > >>> in > > > >>>>>>>>> /tmp/x509up_u1878. > > > >>>>>>>>> > > > >>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > > >>> sites.xml > > > >>>>>>>>> -tc.file > > > >>>>>>>>> tc.data 001-catsn-ranger.swift > > > >>>>>>>>> Swift svn swift-r4987 (swift modified locally) > > > >>> cog-r3229 > > > >>>>>>>>> > > > >>>>>>>>> RunID: 20110825-1326-o3e38fe0 > > > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 -0500 > > > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 -0500 > > > >>> Selecting > > > >>>>>>>>> site:8 > > > >>>>>>>>> Initializing site shared directory:2 > > > >>>>>>>>> Execution failed: > > > >>>>>>>>> Authentication failed [Caused by: Failure > > > >>> unspecified at > > > >>>>>>>>> GSS-API > > > >>>>>>>>> level > > > >>>>>>>>> [Caused by: Unknown CA]] > > > >>>>>>>>> > > > >>>>>>>>> Any ideas? > > > >>>>>>>>> > > > >>>>>>>>> Thanks, > > > >>>>>>>>> David > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> _______________________________________________ > > > >>>>>>>>> Swift-devel mailing list > > > >>>>>>>>> Swift-devel at ci.uchicago.edu > > > >>>>>>>>> > > > >>> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> -- > > > >>>>>>>>> Ketan > > > >>>>>>>> _______________________________________________ > > > >>>>>>>> Swift-devel mailing list > > > >>>>>>>> Swift-devel at ci.uchicago.edu > > > >>>>>>>> > > > >>> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >>> > > > >>> > > > >>> _______________________________________________ > > > >>> Swift-devel mailing list > > > >>> Swift-devel at ci.uchicago.edu > > > >>> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> -- > > > >>> Sarah Kenny > > > >>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci > > > III > > > >>> University of California Irvine, Dept. of Neurology ~ > > > 773-818-8300 > > > >>> > > > >>> _______________________________________________ > > > >>> Swift-devel mailing list > > > >>> Swift-devel at ci.uchicago.edu > > > >>> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Fri Aug 26 14:48:59 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Fri, 26 Aug 2011 14:48:59 -0500 Subject: [Swift-devel] =?utf-8?q?auth=2Edefaults?= Message-ID: <20110826194843.D624F12400@zimbra.anl.gov> I do not think so. Mike and I have brainstormed a couple work arounds for auth.defaults before. Such as using ssh master channels. Is this so the user could upload an auth.defaults on a per site basis during a run? ----- Reply message ----- From: "Thomas Uram" Date: Fri, Aug 26, 2011 1:38 pm Subject: [Swift-devel] auth.defaults To: "Mihael Hategan" Cc: "swift-devel" , "David Kelly" In a portal scenario, where the portal is executing jobs on behalf of users and, therefore, shuffling identities on the backend, this functionality would be useful. Has work on this progressed at all? Tom On Aug 11, 2010, at 7:58 PM, Mihael Hategan wrote: > On Wed, 2010-08-11 at 17:08 -0400, David Kelly wrote: >> On Wed, Aug 11, 2010 at 4:18 PM, Mihael Hategan >> wrote: >> >> I don't think I understand what you mean by "per-host >> auth.defaults". >> >> What I mean by that is, each configuration generated by swiftconfig >> could have it's own unique auth.defaults file like it has it's own >> sites.xml. Then you could run something like "swift >> -auth.file /mypath/auth.defaults" rather than requiring it to be >> stored in ~/.ssh. > > It's possible, but I don't think it's worth the effort. That file, like > authorized_keys or known_hosts is not meant to contain frequently > changing information, since passwords and usernanames are not generally > a moving target (nor are host keys). > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From turam at mcs.anl.gov Fri Aug 26 15:26:36 2011 From: turam at mcs.anl.gov (Thomas Uram) Date: Fri, 26 Aug 2011 15:26:36 -0500 Subject: [Swift-devel] auth.defaults In-Reply-To: <20110826194843.D624F12400@zimbra.anl.gov> References: <20110826194843.D624F12400@zimbra.anl.gov> Message-ID: <44D8CB68-DBC7-427C-8702-3097E6F8B836@mcs.anl.gov> No. GPSI generates a keypair and builds a related auth.defaults file on the fly. Because it will inevitably run jobs for multiple users, auth.defaults collisions will occur. I'm not doing anything to prevent this now, but need to. On Aug 26, 2011, at 2:48 PM, Jonathan Monette wrote: > I do not think so. Mike and I have brainstormed a couple work arounds for auth.defaults before. Such as using ssh master channels. Is this so the user could upload an auth.defaults on a per site basis during a run? > > ----- Reply message ----- > From: "Thomas Uram" > Date: Fri, Aug 26, 2011 1:38 pm > Subject: [Swift-devel] auth.defaults > To: "Mihael Hategan" > Cc: "swift-devel" , "David Kelly" > > > In a portal scenario, where the portal is executing jobs on behalf of users and, therefore, shuffling identities on the backend, this functionality would be useful. Has work on this progressed at all? > > Tom > > On Aug 11, 2010, at 7:58 PM, Mihael Hategan wrote: > > > On Wed, 2010-08-11 at 17:08 -0400, David Kelly wrote: > >> On Wed, Aug 11, 2010 at 4:18 PM, Mihael Hategan > >> wrote: > >> > >> I don't think I understand what you mean by "per-host > >> auth.defaults". > >> > >> What I mean by that is, each configuration generated by swiftconfig > >> could have it's own unique auth.defaults file like it has it's own > >> sites.xml. Then you could run something like "swift > >> -auth.file /mypath/auth.defaults" rather than requiring it to be > >> stored in ~/.ssh. > > > > It's possible, but I don't think it's worth the effort. That file, like > > authorized_keys or known_hosts is not meant to contain frequently > > changing information, since passwords and usernanames are not generally > > a moving target (nor are host keys). > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Fri Aug 26 18:52:16 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 26 Aug 2011 18:52:16 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1314384133.19481.0.camel@blabla> Message-ID: <492042179.80666.1314402736159.JavaMail.root@zimbra-mb2.anl.gov> I tried setting GLOBUS_HOSTNAME on communicado. The gram log file is no longer created, but I still don't see any jobs being submitted? There is a new set of logs at www.ci.uchicago.edu/~davidk/ranger-gt2-logs2.tar.gz David ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "swift-devel Devel" , "Jonathan Monette" > Sent: Friday, August 26, 2011 1:42:13 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > "The job manager failed to open stderr" tends to happen when you have > GLOBUS_HOSTNAME set incorrectly. > > On Fri, 2011-08-26 at 13:38 -0500, David Kelly wrote: > > When I am trying to run the script now, Swift does not seem to be > > submitting the jobs correctly. Nothing it showing up in qstat. > > > > I noticed that a gram log gets created in my home directory that > > says: > > ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end > > level=ERROR gramid=/16145868447994515851/17606392074284884670/ > > job_status=4 status=-73 reason="the job manager failed to open > > stdout" > > > > I'm guessing this is the cause of the problem. Bugs #153 and #215 > > were related to similar problems with stdout and gt2/sge. > > > > The full logs are at > > http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz > > > > Thanks, > > David > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Jonathan Monette" > > > Cc: "swift-devel Devel" > > > Sent: Thursday, August 25, 2011 5:31:34 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > > > > I can send mail to ci support and cc mike to it and ask what > > > > they > > > > can > > > > do. > > > > > > > > Mihael, is there anyway for Swift to give a little more feedback > > > > besides unknown CA or is that a jglobus problem? > > > > > > It's a jglobus problem. > > > > > > That in itself may not be a big issue, but jglobus is now being > > > heavily > > > re-organized by the globus team, so I'm not sure what the best > > > long-term > > > strategy is here. > > > > > > > > ----- Reply message ----- > > > > From: "Sarah Kenny" > > > > Date: Thu, Aug 25, 2011 5:11 pm > > > > Subject: [Swift-devel] Notes from 0.93 meeting > > > > To: "Jonathan Monette" > > > > Cc: "Mihael Hategan" , "swift-devel Devel" > > > > > > > > > > > > > > > > > > > > if i had a nickel for every time i dealt with this i'd be rich! > > > > :) > > > > actually, now that i'm looking at our uci machines i actually > > > > have > > > > them updating hourly...so, maybe you want to ask the admins to > > > > do > > > > that > > > > to avoid a full day of confusion whenever they expire :P > > > > > > > > *usually* i can't gsissh either if the certs have expired but, > > > > yeah, > > > > they must be using different CA's now for that on ranger as > > > > mihael > > > > suggests... > > > > > > > > On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > > > > > > > > wrote: > > > > True. I did not think that each mechanism would use > > > > different > > > > CAs. We might want to ask ci support to update the grid > > > > certs > > > > more frequently then to avoid this situation. > > > > > > > > > > > > On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > > > > > > > > > On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette > > > > > wrote: > > > > >> That is weird. If you were able to gsissh to ranger I > > > > would assume > > > > >> that you are able to globus-url-copy to ranger. > > > > > > > > > > Not if the two use different CAs. Or if a password was > > > > > typed > > > > at the ssh > > > > > login. > > > > > > > > > >> Anyways, what Sarah said should work. I would assume > > > > >> that > > > > ci would > > > > >> update more frequently to avoid this problem. > > > > >> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > > > > >> > > > > >>> communicado's certs > > > > >>> (/etc/grid-security/certificates) > > > > >>> are > > > > >>> out-of-date...if you copy > > > > ranger's /etc/grid-security/certificates > > > > >>> directory to communicado and point yr X509_CERT_DIR > > > > >>> to > > > > >>> it > > > > you can > > > > >>> get a job thru (a simple globus-job-run with my > > > > >>> vaild > > > > >>> cert > > > > fails > > > > >>> from communicado at the moment if i don't do this). > > > > >>> > > > > >>> i set our machines at uci to update daily...i think > > > > >>> it's > > > > less > > > > >>> frequently at ci... > > > > >>> > > > > >>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > > > > >>> wrote: > > > > >>> Can you try a globus-url-copy to > > > > >>> gridftp.ranger? > > > > >>> > > > > >>> gridftp.ranger seems to have the NCSA myproxy > > > > >>> CA. > > > > You say > > > > >>> you have the > > > > >>> proper certificates dir in your > > > > >>> X509_CERT_DIR, > > > > >>> and > > > > that > > > > >>> directory > > > > >>> contains the TACC root cert. So it should > > > > >>> work. > > > > >>> And > > > > so > > > > >>> should swift. > > > > >>> > > > > >>> Though I think that jglobus should be more > > > > >>> clear > > > > about > > > > >>> "Unknown ca" > > > > >>> errors. At least the name of the unknown CA > > > > >>> should > > > > be part > > > > >>> of the error > > > > >>> message. > > > > >>> > > > > >>> > > > > >>> On Thu, 2011-08-25 at 15:55 -0500, David > > > > >>> Kelly > > > > wrote: > > > > >>>> $ grid-proxy-info -all > > > > >>>> subject : /C=US/O=National Center for > > > > >>>> Supercomputing > > > > >>> Applications/CN=David Kelly > > > > >>>> issuer : /C=US/O=National Center for Supercomputing > > > > >>> Applications/OU=Certificate > > > > >>> Authorities/CN=MyProxy > > > > >>>> identity : /C=US/O=National Center for > > > > >>>> Supercomputing > > > > >>> Applications/CN=David Kelly > > > > >>>> type : end entity credential > > > > >>>> strength : 1024 bits > > > > >>>> path : /tmp/x509up_u1878 > > > > >>>> timeleft : 9:56:53 > > > > >>>> > > > > >>>> > > > > >>>> ----- Original Message ----- > > > > >>>>> From: "Mihael Hategan" > > > > >>>>> To: "David Kelly" > > > > >>>>> Cc: "Ketan Maheshwari" > > > > >>>>> , > > > > >>> "swift-devel Devel" > > > > >>> > > > > >>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > > > > >>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > > >>>>> Odd. Can you paste the output of 'grid-proxy-info > > > > >>>>> -all'? > > > > >>>>> > > > > >>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly > > > > >>>>> wrote: > > > > >>>>>> Sure, here is the full log: > > > > >>>>>> > > > > >>>>>> > > > > >>> > > > > http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > > >>>>>> > > > > >>>>>> ----- Original Message ----- > > > > >>>>>>> From: "Mihael Hategan" > > > > >>>>>>> To: "David Kelly" > > > > >>>>>>> Cc: "Ketan Maheshwari" > > > > >>>>>>> , > > > > >>> "swift-devel > > > > >>>>>>> Devel" > > > > >>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > > > > >>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > > > > >>>>>>> meeting > > > > >>>>>>> It's possible that the CA dir on Ranger is not > > > > >>> properly set up. > > > > >>>>>>> Can > > > > >>>>>>> you > > > > >>>>>>> post the full log? > > > > >>>>>>> > > > > >>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > > > > >>> wrote: > > > > >>>>>>>> Those environment variables were not set up. I > > > > >>> have them defined > > > > >>>>>>>> now, but I'm still getting the same error. > > > > >>>>>>>> > > > > >>>>>>>> [davidk at communicado ranger]$ env |grep 509 > > > > >>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > >>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > > >>>>>>>> > > > > >>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > > > >>> sites.xml > > > > >>>>>>>> -tc.file > > > > >>>>>>>> tc.data 001-catsn-ranger.swift > > > > >>>>>>>> Swift svn swift-r4987 (swift modified locally) > > > > >>> cog-r3229 > > > > >>>>>>>> > > > > >>>>>>>> RunID: 20110825-1352-f1v940b4 > > > > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > > >>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > > > > >>> Selecting site:7 > > > > >>>>>>>> Initializing site shared directory:3 > > > > >>>>>>>> Execution failed: > > > > >>>>>>>> Authentication failed [Caused by: Failure > > > > >>> unspecified at > > > > >>>>>>>> GSS-API > > > > >>>>>>>> level [Caused by: Unknown CA]] > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> ----- Original Message ----- > > > > >>>>>>>>> From: "Ketan Maheshwari" > > > > >>> > > > > >>>>>>>>> To: "David Kelly" > > > > >>>>>>>>> Cc: "Jonathan Monette" , > > > > >>> "swift-devel > > > > >>>>>>>>> Devel" > > > > >>>>>>>>> > > > > >>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > > > > >>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > > > > >>> meeting > > > > >>>>>>>>> Hi, > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> Are your CADIR and CACERT env vars set up? > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > > > > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > > >>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > > >>>>>>>>> davidk at ci.uchicago.edu > > > > >>>>>>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> Thanks Jon, > > > > >>>>>>>>> > > > > >>>>>>>>> Here is what happens when I try this from > > > > >>> communicado: > > > > >>>>>>>>> > > > > >>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l > > > > >>>>>>>>> dkelly > > > > >>> -s > > > > >>>>>>>>> myproxy.teragrid.org > > > > >>>>>>>>> Enter MyProxy pass phrase: > > > > >>>>>>>>> A credential has been received for user dkelly > > > > >>> in > > > > >>>>>>>>> /tmp/x509up_u1878. > > > > >>>>>>>>> > > > > >>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > > > >>> sites.xml > > > > >>>>>>>>> -tc.file > > > > >>>>>>>>> tc.data 001-catsn-ranger.swift > > > > >>>>>>>>> Swift svn swift-r4987 (swift modified locally) > > > > >>> cog-r3229 > > > > >>>>>>>>> > > > > >>>>>>>>> RunID: 20110825-1326-o3e38fe0 > > > > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 > > > > >>>>>>>>> -0500 > > > > >>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 > > > > >>>>>>>>> -0500 > > > > >>> Selecting > > > > >>>>>>>>> site:8 > > > > >>>>>>>>> Initializing site shared directory:2 > > > > >>>>>>>>> Execution failed: > > > > >>>>>>>>> Authentication failed [Caused by: Failure > > > > >>> unspecified at > > > > >>>>>>>>> GSS-API > > > > >>>>>>>>> level > > > > >>>>>>>>> [Caused by: Unknown CA]] > > > > >>>>>>>>> > > > > >>>>>>>>> Any ideas? > > > > >>>>>>>>> > > > > >>>>>>>>> Thanks, > > > > >>>>>>>>> David > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> _______________________________________________ > > > > >>>>>>>>> Swift-devel mailing list > > > > >>>>>>>>> Swift-devel at ci.uchicago.edu > > > > >>>>>>>>> > > > > >>> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> -- > > > > >>>>>>>>> Ketan > > > > >>>>>>>> _______________________________________________ > > > > >>>>>>>> Swift-devel mailing list > > > > >>>>>>>> Swift-devel at ci.uchicago.edu > > > > >>>>>>>> > > > > >>> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > >>> > > > > >>> > > > > >>> _______________________________________________ > > > > >>> Swift-devel mailing list > > > > >>> Swift-devel at ci.uchicago.edu > > > > >>> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> Sarah Kenny > > > > >>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio > > > > >>> Sci > > > > III > > > > >>> University of California Irvine, Dept. of Neurology > > > > >>> ~ > > > > 773-818-8300 > > > > >>> > > > > >>> _______________________________________________ > > > > >>> Swift-devel mailing list > > > > >>> Swift-devel at ci.uchicago.edu > > > > >>> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > University of California Irvine, Dept. of Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Fri Aug 26 18:54:29 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 26 Aug 2011 18:54:29 -0500 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <492042179.80666.1314402736159.JavaMail.root@zimbra-mb2.anl.gov> References: <492042179.80666.1314402736159.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <8C445C78-4C72-4307-91F1-464CEEBA6808@mcs.anl.gov> Did you set GLOBUS_HOSTNAME to communicado.ci.uchicago.edu or probably better the ip-address of communicado? On Aug 26, 2011, at 6:52 PM, David Kelly wrote: > I tried setting GLOBUS_HOSTNAME on communicado. The gram log file is no longer created, but I still don't see any jobs being submitted? > > There is a new set of logs at www.ci.uchicago.edu/~davidk/ranger-gt2-logs2.tar.gz > > David > > ----- Original Message ----- >> From: "Mihael Hategan" >> To: "David Kelly" >> Cc: "swift-devel Devel" , "Jonathan Monette" >> Sent: Friday, August 26, 2011 1:42:13 PM >> Subject: Re: [Swift-devel] Notes from 0.93 meeting >> "The job manager failed to open stderr" tends to happen when you have >> GLOBUS_HOSTNAME set incorrectly. >> >> On Fri, 2011-08-26 at 13:38 -0500, David Kelly wrote: >>> When I am trying to run the script now, Swift does not seem to be >>> submitting the jobs correctly. Nothing it showing up in qstat. >>> >>> I noticed that a gram log gets created in my home directory that >>> says: >>> ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end >>> level=ERROR gramid=/16145868447994515851/17606392074284884670/ >>> job_status=4 status=-73 reason="the job manager failed to open >>> stdout" >>> >>> I'm guessing this is the cause of the problem. Bugs #153 and #215 >>> were related to similar problems with stdout and gt2/sge. >>> >>> The full logs are at >>> http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz >>> >>> Thanks, >>> David >>> >>> >>> ----- Original Message ----- >>>> From: "Mihael Hategan" >>>> To: "Jonathan Monette" >>>> Cc: "swift-devel Devel" >>>> Sent: Thursday, August 25, 2011 5:31:34 PM >>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting >>>> On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: >>>>> I can send mail to ci support and cc mike to it and ask what >>>>> they >>>>> can >>>>> do. >>>>> >>>>> Mihael, is there anyway for Swift to give a little more feedback >>>>> besides unknown CA or is that a jglobus problem? >>>> >>>> It's a jglobus problem. >>>> >>>> That in itself may not be a big issue, but jglobus is now being >>>> heavily >>>> re-organized by the globus team, so I'm not sure what the best >>>> long-term >>>> strategy is here. >>>>> >>>>> ----- Reply message ----- >>>>> From: "Sarah Kenny" >>>>> Date: Thu, Aug 25, 2011 5:11 pm >>>>> Subject: [Swift-devel] Notes from 0.93 meeting >>>>> To: "Jonathan Monette" >>>>> Cc: "Mihael Hategan" , "swift-devel Devel" >>>>> >>>>> >>>>> >>>>> >>>>> if i had a nickel for every time i dealt with this i'd be rich! >>>>> :) >>>>> actually, now that i'm looking at our uci machines i actually >>>>> have >>>>> them updating hourly...so, maybe you want to ask the admins to >>>>> do >>>>> that >>>>> to avoid a full day of confusion whenever they expire :P >>>>> >>>>> *usually* i can't gsissh either if the certs have expired but, >>>>> yeah, >>>>> they must be using different CA's now for that on ranger as >>>>> mihael >>>>> suggests... >>>>> >>>>> On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette >>>>> >>>>> wrote: >>>>> True. I did not think that each mechanism would use >>>>> different >>>>> CAs. We might want to ask ci support to update the grid >>>>> certs >>>>> more frequently then to avoid this situation. >>>>> >>>>> >>>>> On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: >>>>> >>>>>> On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette >>>>>> wrote: >>>>>>> That is weird. If you were able to gsissh to ranger I >>>>> would assume >>>>>>> that you are able to globus-url-copy to ranger. >>>>>> >>>>>> Not if the two use different CAs. Or if a password was >>>>>> typed >>>>> at the ssh >>>>>> login. >>>>>> >>>>>>> Anyways, what Sarah said should work. I would assume >>>>>>> that >>>>> ci would >>>>>>> update more frequently to avoid this problem. >>>>>>> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: >>>>>>> >>>>>>>> communicado's certs >>>>>>>> (/etc/grid-security/certificates) >>>>>>>> are >>>>>>>> out-of-date...if you copy >>>>> ranger's /etc/grid-security/certificates >>>>>>>> directory to communicado and point yr X509_CERT_DIR >>>>>>>> to >>>>>>>> it >>>>> you can >>>>>>>> get a job thru (a simple globus-job-run with my >>>>>>>> vaild >>>>>>>> cert >>>>> fails >>>>>>>> from communicado at the moment if i don't do this). >>>>>>>> >>>>>>>> i set our machines at uci to update daily...i think >>>>>>>> it's >>>>> less >>>>>>>> frequently at ci... >>>>>>>> >>>>>>>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan >>>>>>>> wrote: >>>>>>>> Can you try a globus-url-copy to >>>>>>>> gridftp.ranger? >>>>>>>> >>>>>>>> gridftp.ranger seems to have the NCSA myproxy >>>>>>>> CA. >>>>> You say >>>>>>>> you have the >>>>>>>> proper certificates dir in your >>>>>>>> X509_CERT_DIR, >>>>>>>> and >>>>> that >>>>>>>> directory >>>>>>>> contains the TACC root cert. So it should >>>>>>>> work. >>>>>>>> And >>>>> so >>>>>>>> should swift. >>>>>>>> >>>>>>>> Though I think that jglobus should be more >>>>>>>> clear >>>>> about >>>>>>>> "Unknown ca" >>>>>>>> errors. At least the name of the unknown CA >>>>>>>> should >>>>> be part >>>>>>>> of the error >>>>>>>> message. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 2011-08-25 at 15:55 -0500, David >>>>>>>> Kelly >>>>> wrote: >>>>>>>>> $ grid-proxy-info -all >>>>>>>>> subject : /C=US/O=National Center for >>>>>>>>> Supercomputing >>>>>>>> Applications/CN=David Kelly >>>>>>>>> issuer : /C=US/O=National Center for Supercomputing >>>>>>>> Applications/OU=Certificate >>>>>>>> Authorities/CN=MyProxy >>>>>>>>> identity : /C=US/O=National Center for >>>>>>>>> Supercomputing >>>>>>>> Applications/CN=David Kelly >>>>>>>>> type : end entity credential >>>>>>>>> strength : 1024 bits >>>>>>>>> path : /tmp/x509up_u1878 >>>>>>>>> timeleft : 9:56:53 >>>>>>>>> >>>>>>>>> >>>>>>>>> ----- Original Message ----- >>>>>>>>>> From: "Mihael Hategan" >>>>>>>>>> To: "David Kelly" >>>>>>>>>> Cc: "Ketan Maheshwari" >>>>>>>>>> , >>>>>>>> "swift-devel Devel" >>>>>>>> >>>>>>>>>> Sent: Thursday, August 25, 2011 3:42:57 PM >>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting >>>>>>>>>> Odd. Can you paste the output of 'grid-proxy-info >>>>>>>>>> -all'? >>>>>>>>>> >>>>>>>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly >>>>>>>>>> wrote: >>>>>>>>>>> Sure, here is the full log: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>> http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log >>>>>>>>>>> >>>>>>>>>>> ----- Original Message ----- >>>>>>>>>>>> From: "Mihael Hategan" >>>>>>>>>>>> To: "David Kelly" >>>>>>>>>>>> Cc: "Ketan Maheshwari" >>>>>>>>>>>> , >>>>>>>> "swift-devel >>>>>>>>>>>> Devel" >>>>>>>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM >>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 >>>>>>>>>>>> meeting >>>>>>>>>>>> It's possible that the CA dir on Ranger is not >>>>>>>> properly set up. >>>>>>>>>>>> Can >>>>>>>>>>>> you >>>>>>>>>>>> post the full log? >>>>>>>>>>>> >>>>>>>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly >>>>>>>> wrote: >>>>>>>>>>>>> Those environment variables were not set up. I >>>>>>>> have them defined >>>>>>>>>>>>> now, but I'm still getting the same error. >>>>>>>>>>>>> >>>>>>>>>>>>> [davidk at communicado ranger]$ env |grep 509 >>>>>>>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA >>>>>>>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA >>>>>>>>>>>>> >>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file >>>>>>>> sites.xml >>>>>>>>>>>>> -tc.file >>>>>>>>>>>>> tc.data 001-catsn-ranger.swift >>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) >>>>>>>> cog-r3229 >>>>>>>>>>>>> >>>>>>>>>>>>> RunID: 20110825-1352-f1v940b4 >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 >>>>>>>> Selecting site:7 >>>>>>>>>>>>> Initializing site shared directory:3 >>>>>>>>>>>>> Execution failed: >>>>>>>>>>>>> Authentication failed [Caused by: Failure >>>>>>>> unspecified at >>>>>>>>>>>>> GSS-API >>>>>>>>>>>>> level [Caused by: Unknown CA]] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> ----- Original Message ----- >>>>>>>>>>>>>> From: "Ketan Maheshwari" >>>>>>>> >>>>>>>>>>>>>> To: "David Kelly" >>>>>>>>>>>>>> Cc: "Jonathan Monette" , >>>>>>>> "swift-devel >>>>>>>>>>>>>> Devel" >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM >>>>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 >>>>>>>> meeting >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Are your CADIR and CACERT env vars set up? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < >>>>>>>>>>>>>> davidk at ci.uchicago.edu >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Jon, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is what happens when I try this from >>>>>>>> communicado: >>>>>>>>>>>>>> >>>>>>>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l >>>>>>>>>>>>>> dkelly >>>>>>>> -s >>>>>>>>>>>>>> myproxy.teragrid.org >>>>>>>>>>>>>> Enter MyProxy pass phrase: >>>>>>>>>>>>>> A credential has been received for user dkelly >>>>>>>> in >>>>>>>>>>>>>> /tmp/x509up_u1878. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file >>>>>>>> sites.xml >>>>>>>>>>>>>> -tc.file >>>>>>>>>>>>>> tc.data 001-catsn-ranger.swift >>>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) >>>>>>>> cog-r3229 >>>>>>>>>>>>>> >>>>>>>>>>>>>> RunID: 20110825-1326-o3e38fe0 >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 >>>>>>>>>>>>>> -0500 >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 >>>>>>>>>>>>>> -0500 >>>>>>>> Selecting >>>>>>>>>>>>>> site:8 >>>>>>>>>>>>>> Initializing site shared directory:2 >>>>>>>>>>>>>> Execution failed: >>>>>>>>>>>>>> Authentication failed [Caused by: Failure >>>>>>>> unspecified at >>>>>>>>>>>>>> GSS-API >>>>>>>>>>>>>> level >>>>>>>>>>>>>> [Caused by: Unknown CA]] >>>>>>>>>>>>>> >>>>>>>>>>>>>> Any ideas? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> David >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Swift-devel mailing list >>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>>>>>>> >>>>>>>> >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Ketan >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Swift-devel mailing list >>>>>>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>>>>>> >>>>>>>> >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Sarah Kenny >>>>>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio >>>>>>>> Sci >>>>> III >>>>>>>> University of California Irvine, Dept. of Neurology >>>>>>>> ~ >>>>> 773-818-8300 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sarah Kenny >>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III >>>>> University of California Irvine, Dept. of Neurology ~ >>>>> 773-818-8300 >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Fri Aug 26 19:13:58 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 26 Aug 2011 19:13:58 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <8C445C78-4C72-4307-91F1-464CEEBA6808@mcs.anl.gov> Message-ID: <44646787.80670.1314404038440.JavaMail.root@zimbra-mb2.anl.gov> I set it to communicado.ci.uchicago.edu. I'll try again with IP address. ----- Original Message ----- > From: "Jonathan Monette" > To: "David Kelly" > Cc: "Mihael Hategan" , "swift-devel Devel" > Sent: Friday, August 26, 2011 6:54:29 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > Did you set GLOBUS_HOSTNAME to communicado.ci.uchicago.edu or probably > better the ip-address of communicado? > On Aug 26, 2011, at 6:52 PM, David Kelly wrote: > > > I tried setting GLOBUS_HOSTNAME on communicado. The gram log file is > > no longer created, but I still don't see any jobs being submitted? > > > > There is a new set of logs at > > www.ci.uchicago.edu/~davidk/ranger-gt2-logs2.tar.gz > > > > David > > > > ----- Original Message ----- > >> From: "Mihael Hategan" > >> To: "David Kelly" > >> Cc: "swift-devel Devel" , "Jonathan > >> Monette" > >> Sent: Friday, August 26, 2011 1:42:13 PM > >> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >> "The job manager failed to open stderr" tends to happen when you > >> have > >> GLOBUS_HOSTNAME set incorrectly. > >> > >> On Fri, 2011-08-26 at 13:38 -0500, David Kelly wrote: > >>> When I am trying to run the script now, Swift does not seem to be > >>> submitting the jobs correctly. Nothing it showing up in qstat. > >>> > >>> I noticed that a gram log gets created in my home directory that > >>> says: > >>> ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end > >>> level=ERROR gramid=/16145868447994515851/17606392074284884670/ > >>> job_status=4 status=-73 reason="the job manager failed to open > >>> stdout" > >>> > >>> I'm guessing this is the cause of the problem. Bugs #153 and #215 > >>> were related to similar problems with stdout and gt2/sge. > >>> > >>> The full logs are at > >>> http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz > >>> > >>> Thanks, > >>> David > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Mihael Hategan" > >>>> To: "Jonathan Monette" > >>>> Cc: "swift-devel Devel" > >>>> Sent: Thursday, August 25, 2011 5:31:34 PM > >>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>> On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > >>>>> I can send mail to ci support and cc mike to it and ask what > >>>>> they > >>>>> can > >>>>> do. > >>>>> > >>>>> Mihael, is there anyway for Swift to give a little more feedback > >>>>> besides unknown CA or is that a jglobus problem? > >>>> > >>>> It's a jglobus problem. > >>>> > >>>> That in itself may not be a big issue, but jglobus is now being > >>>> heavily > >>>> re-organized by the globus team, so I'm not sure what the best > >>>> long-term > >>>> strategy is here. > >>>>> > >>>>> ----- Reply message ----- > >>>>> From: "Sarah Kenny" > >>>>> Date: Thu, Aug 25, 2011 5:11 pm > >>>>> Subject: [Swift-devel] Notes from 0.93 meeting > >>>>> To: "Jonathan Monette" > >>>>> Cc: "Mihael Hategan" , "swift-devel Devel" > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> if i had a nickel for every time i dealt with this i'd be rich! > >>>>> :) > >>>>> actually, now that i'm looking at our uci machines i actually > >>>>> have > >>>>> them updating hourly...so, maybe you want to ask the admins to > >>>>> do > >>>>> that > >>>>> to avoid a full day of confusion whenever they expire :P > >>>>> > >>>>> *usually* i can't gsissh either if the certs have expired but, > >>>>> yeah, > >>>>> they must be using different CA's now for that on ranger as > >>>>> mihael > >>>>> suggests... > >>>>> > >>>>> On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > >>>>> > >>>>> wrote: > >>>>> True. I did not think that each mechanism would use > >>>>> different > >>>>> CAs. We might want to ask ci support to update the grid > >>>>> certs > >>>>> more frequently then to avoid this situation. > >>>>> > >>>>> > >>>>> On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > >>>>> > >>>>>> On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette > >>>>>> wrote: > >>>>>>> That is weird. If you were able to gsissh to ranger I > >>>>> would assume > >>>>>>> that you are able to globus-url-copy to ranger. > >>>>>> > >>>>>> Not if the two use different CAs. Or if a password was > >>>>>> typed > >>>>> at the ssh > >>>>>> login. > >>>>>> > >>>>>>> Anyways, what Sarah said should work. I would assume > >>>>>>> that > >>>>> ci would > >>>>>>> update more frequently to avoid this problem. > >>>>>>> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > >>>>>>> > >>>>>>>> communicado's certs > >>>>>>>> (/etc/grid-security/certificates) > >>>>>>>> are > >>>>>>>> out-of-date...if you copy > >>>>> ranger's /etc/grid-security/certificates > >>>>>>>> directory to communicado and point yr X509_CERT_DIR > >>>>>>>> to > >>>>>>>> it > >>>>> you can > >>>>>>>> get a job thru (a simple globus-job-run with my > >>>>>>>> vaild > >>>>>>>> cert > >>>>> fails > >>>>>>>> from communicado at the moment if i don't do this). > >>>>>>>> > >>>>>>>> i set our machines at uci to update daily...i think > >>>>>>>> it's > >>>>> less > >>>>>>>> frequently at ci... > >>>>>>>> > >>>>>>>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > >>>>>>>> wrote: > >>>>>>>> Can you try a globus-url-copy to > >>>>>>>> gridftp.ranger? > >>>>>>>> > >>>>>>>> gridftp.ranger seems to have the NCSA myproxy > >>>>>>>> CA. > >>>>> You say > >>>>>>>> you have the > >>>>>>>> proper certificates dir in your > >>>>>>>> X509_CERT_DIR, > >>>>>>>> and > >>>>> that > >>>>>>>> directory > >>>>>>>> contains the TACC root cert. So it should > >>>>>>>> work. > >>>>>>>> And > >>>>> so > >>>>>>>> should swift. > >>>>>>>> > >>>>>>>> Though I think that jglobus should be more > >>>>>>>> clear > >>>>> about > >>>>>>>> "Unknown ca" > >>>>>>>> errors. At least the name of the unknown CA > >>>>>>>> should > >>>>> be part > >>>>>>>> of the error > >>>>>>>> message. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, 2011-08-25 at 15:55 -0500, David > >>>>>>>> Kelly > >>>>> wrote: > >>>>>>>>> $ grid-proxy-info -all > >>>>>>>>> subject : /C=US/O=National Center for > >>>>>>>>> Supercomputing > >>>>>>>> Applications/CN=David Kelly > >>>>>>>>> issuer : /C=US/O=National Center for Supercomputing > >>>>>>>> Applications/OU=Certificate > >>>>>>>> Authorities/CN=MyProxy > >>>>>>>>> identity : /C=US/O=National Center for > >>>>>>>>> Supercomputing > >>>>>>>> Applications/CN=David Kelly > >>>>>>>>> type : end entity credential > >>>>>>>>> strength : 1024 bits > >>>>>>>>> path : /tmp/x509up_u1878 > >>>>>>>>> timeleft : 9:56:53 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ----- Original Message ----- > >>>>>>>>>> From: "Mihael Hategan" > >>>>>>>>>> To: "David Kelly" > >>>>>>>>>> Cc: "Ketan Maheshwari" > >>>>>>>>>> , > >>>>>>>> "swift-devel Devel" > >>>>>>>> > >>>>>>>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > >>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>>>>>>>> Odd. Can you paste the output of 'grid-proxy-info > >>>>>>>>>> -all'? > >>>>>>>>>> > >>>>>>>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly > >>>>>>>>>> wrote: > >>>>>>>>>>> Sure, here is the full log: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>> http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > >>>>>>>>>>> > >>>>>>>>>>> ----- Original Message ----- > >>>>>>>>>>>> From: "Mihael Hategan" > >>>>>>>>>>>> To: "David Kelly" > >>>>>>>>>>>> Cc: "Ketan Maheshwari" > >>>>>>>>>>>> , > >>>>>>>> "swift-devel > >>>>>>>>>>>> Devel" > >>>>>>>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > >>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > >>>>>>>>>>>> meeting > >>>>>>>>>>>> It's possible that the CA dir on Ranger is not > >>>>>>>> properly set up. > >>>>>>>>>>>> Can > >>>>>>>>>>>> you > >>>>>>>>>>>> post the full log? > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > >>>>>>>> wrote: > >>>>>>>>>>>>> Those environment variables were not set up. I > >>>>>>>> have them defined > >>>>>>>>>>>>> now, but I'm still getting the same error. > >>>>>>>>>>>>> > >>>>>>>>>>>>> [davidk at communicado ranger]$ env |grep 509 > >>>>>>>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>> > >>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>>>>>>> sites.xml > >>>>>>>>>>>>> -tc.file > >>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>>>>>>> cog-r3229 > >>>>>>>>>>>>> > >>>>>>>>>>>>> RunID: 20110825-1352-f1v940b4 > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > >>>>>>>> Selecting site:7 > >>>>>>>>>>>>> Initializing site shared directory:3 > >>>>>>>>>>>>> Execution failed: > >>>>>>>>>>>>> Authentication failed [Caused by: Failure > >>>>>>>> unspecified at > >>>>>>>>>>>>> GSS-API > >>>>>>>>>>>>> level [Caused by: Unknown CA]] > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> ----- Original Message ----- > >>>>>>>>>>>>>> From: "Ketan Maheshwari" > >>>>>>>> > >>>>>>>>>>>>>> To: "David Kelly" > >>>>>>>>>>>>>> Cc: "Jonathan Monette" , > >>>>>>>> "swift-devel > >>>>>>>>>>>>>> Devel" > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > >>>>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > >>>>>>>> meeting > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Are your CADIR and CACERT env vars set up? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > >>>>>>>>>>>>>> davidk at ci.uchicago.edu > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks Jon, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Here is what happens when I try this from > >>>>>>>> communicado: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l > >>>>>>>>>>>>>> dkelly > >>>>>>>> -s > >>>>>>>>>>>>>> myproxy.teragrid.org > >>>>>>>>>>>>>> Enter MyProxy pass phrase: > >>>>>>>>>>>>>> A credential has been received for user dkelly > >>>>>>>> in > >>>>>>>>>>>>>> /tmp/x509up_u1878. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>>>>>>> sites.xml > >>>>>>>>>>>>>> -tc.file > >>>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>>>>>>> cog-r3229 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> RunID: 20110825-1326-o3e38fe0 > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 > >>>>>>>>>>>>>> -0500 > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 > >>>>>>>>>>>>>> -0500 > >>>>>>>> Selecting > >>>>>>>>>>>>>> site:8 > >>>>>>>>>>>>>> Initializing site shared directory:2 > >>>>>>>>>>>>>> Execution failed: > >>>>>>>>>>>>>> Authentication failed [Caused by: Failure > >>>>>>>> unspecified at > >>>>>>>>>>>>>> GSS-API > >>>>>>>>>>>>>> level > >>>>>>>>>>>>>> [Caused by: Unknown CA]] > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Any ideas? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> David > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> Swift-devel mailing list > >>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>>>>>> > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Ketan > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> Swift-devel mailing list > >>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>>>>> > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Sarah Kenny > >>>>>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio > >>>>>>>> Sci > >>>>> III > >>>>>>>> University of California Irvine, Dept. of Neurology > >>>>>>>> ~ > >>>>> 773-818-8300 > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Sarah Kenny > >>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > >>>>> University of California Irvine, Dept. of Neurology ~ > >>>>> 773-818-8300 > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Fri Aug 26 20:02:30 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Fri, 26 Aug 2011 20:02:30 -0500 Subject: [Swift-devel] =?utf-8?q?Notes_from_0=2E93_meeting?= Message-ID: <20110827010215.1318E124A8@zimbra.anl.gov> I don't think it matters but it was a thought. When I was working with swift and globus online I had to set it to an Il address and not the dns name. But that might have been for a different reason. ----- Reply message ----- From: "David Kelly" Date: Fri, Aug 26, 2011 7:13 pm Subject: [Swift-devel] Notes from 0.93 meeting To: "Jonathan Monette" Cc: "Mihael Hategan" , "swift-devel Devel" I set it to communicado.ci.uchicago.edu. I'll try again with IP address. ----- Original Message ----- > From: "Jonathan Monette" > To: "David Kelly" > Cc: "Mihael Hategan" , "swift-devel Devel" > Sent: Friday, August 26, 2011 6:54:29 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > Did you set GLOBUS_HOSTNAME to communicado.ci.uchicago.edu or probably > better the ip-address of communicado? > On Aug 26, 2011, at 6:52 PM, David Kelly wrote: > > > I tried setting GLOBUS_HOSTNAME on communicado. The gram log file is > > no longer created, but I still don't see any jobs being submitted? > > > > There is a new set of logs at > > www.ci.uchicago.edu/~davidk/ranger-gt2-logs2.tar.gz > > > > David > > > > ----- Original Message ----- > >> From: "Mihael Hategan" > >> To: "David Kelly" > >> Cc: "swift-devel Devel" , "Jonathan > >> Monette" > >> Sent: Friday, August 26, 2011 1:42:13 PM > >> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >> "The job manager failed to open stderr" tends to happen when you > >> have > >> GLOBUS_HOSTNAME set incorrectly. > >> > >> On Fri, 2011-08-26 at 13:38 -0500, David Kelly wrote: > >>> When I am trying to run the script now, Swift does not seem to be > >>> submitting the jobs correctly. Nothing it showing up in qstat. > >>> > >>> I noticed that a gram log gets created in my home directory that > >>> says: > >>> ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end > >>> level=ERROR gramid=/16145868447994515851/17606392074284884670/ > >>> job_status=4 status=-73 reason="the job manager failed to open > >>> stdout" > >>> > >>> I'm guessing this is the cause of the problem. Bugs #153 and #215 > >>> were related to similar problems with stdout and gt2/sge. > >>> > >>> The full logs are at > >>> http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz > >>> > >>> Thanks, > >>> David > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Mihael Hategan" > >>>> To: "Jonathan Monette" > >>>> Cc: "swift-devel Devel" > >>>> Sent: Thursday, August 25, 2011 5:31:34 PM > >>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>> On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > >>>>> I can send mail to ci support and cc mike to it and ask what > >>>>> they > >>>>> can > >>>>> do. > >>>>> > >>>>> Mihael, is there anyway for Swift to give a little more feedback > >>>>> besides unknown CA or is that a jglobus problem? > >>>> > >>>> It's a jglobus problem. > >>>> > >>>> That in itself may not be a big issue, but jglobus is now being > >>>> heavily > >>>> re-organized by the globus team, so I'm not sure what the best > >>>> long-term > >>>> strategy is here. > >>>>> > >>>>> ----- Reply message ----- > >>>>> From: "Sarah Kenny" > >>>>> Date: Thu, Aug 25, 2011 5:11 pm > >>>>> Subject: [Swift-devel] Notes from 0.93 meeting > >>>>> To: "Jonathan Monette" > >>>>> Cc: "Mihael Hategan" , "swift-devel Devel" > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> if i had a nickel for every time i dealt with this i'd be rich! > >>>>> :) > >>>>> actually, now that i'm looking at our uci machines i actually > >>>>> have > >>>>> them updating hourly...so, maybe you want to ask the admins to > >>>>> do > >>>>> that > >>>>> to avoid a full day of confusion whenever they expire :P > >>>>> > >>>>> *usually* i can't gsissh either if the certs have expired but, > >>>>> yeah, > >>>>> they must be using different CA's now for that on ranger as > >>>>> mihael > >>>>> suggests... > >>>>> > >>>>> On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > >>>>> > >>>>> wrote: > >>>>> True. I did not think that each mechanism would use > >>>>> different > >>>>> CAs. We might want to ask ci support to update the grid > >>>>> certs > >>>>> more frequently then to avoid this situation. > >>>>> > >>>>> > >>>>> On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > >>>>> > >>>>>> On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette > >>>>>> wrote: > >>>>>>> That is weird. If you were able to gsissh to ranger I > >>>>> would assume > >>>>>>> that you are able to globus-url-copy to ranger. > >>>>>> > >>>>>> Not if the two use different CAs. Or if a password was > >>>>>> typed > >>>>> at the ssh > >>>>>> login. > >>>>>> > >>>>>>> Anyways, what Sarah said should work. I would assume > >>>>>>> that > >>>>> ci would > >>>>>>> update more frequently to avoid this problem. > >>>>>>> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > >>>>>>> > >>>>>>>> communicado's certs > >>>>>>>> (/etc/grid-security/certificates) > >>>>>>>> are > >>>>>>>> out-of-date...if you copy > >>>>> ranger's /etc/grid-security/certificates > >>>>>>>> directory to communicado and point yr X509_CERT_DIR > >>>>>>>> to > >>>>>>>> it > >>>>> you can > >>>>>>>> get a job thru (a simple globus-job-run with my > >>>>>>>> vaild > >>>>>>>> cert > >>>>> fails > >>>>>>>> from communicado at the moment if i don't do this). > >>>>>>>> > >>>>>>>> i set our machines at uci to update daily...i think > >>>>>>>> it's > >>>>> less > >>>>>>>> frequently at ci... > >>>>>>>> > >>>>>>>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > >>>>>>>> wrote: > >>>>>>>> Can you try a globus-url-copy to > >>>>>>>> gridftp.ranger? > >>>>>>>> > >>>>>>>> gridftp.ranger seems to have the NCSA myproxy > >>>>>>>> CA. > >>>>> You say > >>>>>>>> you have the > >>>>>>>> proper certificates dir in your > >>>>>>>> X509_CERT_DIR, > >>>>>>>> and > >>>>> that > >>>>>>>> directory > >>>>>>>> contains the TACC root cert. So it should > >>>>>>>> work. > >>>>>>>> And > >>>>> so > >>>>>>>> should swift. > >>>>>>>> > >>>>>>>> Though I think that jglobus should be more > >>>>>>>> clear > >>>>> about > >>>>>>>> "Unknown ca" > >>>>>>>> errors. At least the name of the unknown CA > >>>>>>>> should > >>>>> be part > >>>>>>>> of the error > >>>>>>>> message. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, 2011-08-25 at 15:55 -0500, David > >>>>>>>> Kelly > >>>>> wrote: > >>>>>>>>> $ grid-proxy-info -all > >>>>>>>>> subject : /C=US/O=National Center for > >>>>>>>>> Supercomputing > >>>>>>>> Applications/CN=David Kelly > >>>>>>>>> issuer : /C=US/O=National Center for Supercomputing > >>>>>>>> Applications/OU=Certificate > >>>>>>>> Authorities/CN=MyProxy > >>>>>>>>> identity : /C=US/O=National Center for > >>>>>>>>> Supercomputing > >>>>>>>> Applications/CN=David Kelly > >>>>>>>>> type : end entity credential > >>>>>>>>> strength : 1024 bits > >>>>>>>>> path : /tmp/x509up_u1878 > >>>>>>>>> timeleft : 9:56:53 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ----- Original Message ----- > >>>>>>>>>> From: "Mihael Hategan" > >>>>>>>>>> To: "David Kelly" > >>>>>>>>>> Cc: "Ketan Maheshwari" > >>>>>>>>>> , > >>>>>>>> "swift-devel Devel" > >>>>>>>> > >>>>>>>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > >>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>>>>>>>> Odd. Can you paste the output of 'grid-proxy-info > >>>>>>>>>> -all'? > >>>>>>>>>> > >>>>>>>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly > >>>>>>>>>> wrote: > >>>>>>>>>>> Sure, here is the full log: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>> http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > >>>>>>>>>>> > >>>>>>>>>>> ----- Original Message ----- > >>>>>>>>>>>> From: "Mihael Hategan" > >>>>>>>>>>>> To: "David Kelly" > >>>>>>>>>>>> Cc: "Ketan Maheshwari" > >>>>>>>>>>>> , > >>>>>>>> "swift-devel > >>>>>>>>>>>> Devel" > >>>>>>>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > >>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > >>>>>>>>>>>> meeting > >>>>>>>>>>>> It's possible that the CA dir on Ranger is not > >>>>>>>> properly set up. > >>>>>>>>>>>> Can > >>>>>>>>>>>> you > >>>>>>>>>>>> post the full log? > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > >>>>>>>> wrote: > >>>>>>>>>>>>> Those environment variables were not set up. I > >>>>>>>> have them defined > >>>>>>>>>>>>> now, but I'm still getting the same error. > >>>>>>>>>>>>> > >>>>>>>>>>>>> [davidk at communicado ranger]$ env |grep 509 > >>>>>>>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>> > >>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>>>>>>> sites.xml > >>>>>>>>>>>>> -tc.file > >>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>>>>>>> cog-r3229 > >>>>>>>>>>>>> > >>>>>>>>>>>>> RunID: 20110825-1352-f1v940b4 > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > >>>>>>>> Selecting site:7 > >>>>>>>>>>>>> Initializing site shared directory:3 > >>>>>>>>>>>>> Execution failed: > >>>>>>>>>>>>> Authentication failed [Caused by: Failure > >>>>>>>> unspecified at > >>>>>>>>>>>>> GSS-API > >>>>>>>>>>>>> level [Caused by: Unknown CA]] > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> ----- Original Message ----- > >>>>>>>>>>>>>> From: "Ketan Maheshwari" > >>>>>>>> > >>>>>>>>>>>>>> To: "David Kelly" > >>>>>>>>>>>>>> Cc: "Jonathan Monette" , > >>>>>>>> "swift-devel > >>>>>>>>>>>>>> Devel" > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > >>>>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > >>>>>>>> meeting > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Are your CADIR and CACERT env vars set up? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > >>>>>>>>>>>>>> davidk at ci.uchicago.edu > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks Jon, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Here is what happens when I try this from > >>>>>>>> communicado: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l > >>>>>>>>>>>>>> dkelly > >>>>>>>> -s > >>>>>>>>>>>>>> myproxy.teragrid.org > >>>>>>>>>>>>>> Enter MyProxy pass phrase: > >>>>>>>>>>>>>> A credential has been received for user dkelly > >>>>>>>> in > >>>>>>>>>>>>>> /tmp/x509up_u1878. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>>>>>>> sites.xml > >>>>>>>>>>>>>> -tc.file > >>>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>>>>>>> cog-r3229 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> RunID: 20110825-1326-o3e38fe0 > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 > >>>>>>>>>>>>>> -0500 > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 > >>>>>>>>>>>>>> -0500 > >>>>>>>> Selecting > >>>>>>>>>>>>>> site:8 > >>>>>>>>>>>>>> Initializing site shared directory:2 > >>>>>>>>>>>>>> Execution failed: > >>>>>>>>>>>>>> Authentication failed [Caused by: Failure > >>>>>>>> unspecified at > >>>>>>>>>>>>>> GSS-API > >>>>>>>>>>>>>> level > >>>>>>>>>>>>>> [Caused by: Unknown CA]] > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Any ideas? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> David > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> Swift-devel mailing list > >>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>>>>>> > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Ketan > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> Swift-devel mailing list > >>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>>>>> > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Sarah Kenny > >>>>>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio > >>>>>>>> Sci > >>>>> III > >>>>>>>> University of California Irvine, Dept. of Neurology > >>>>>>>> ~ > >>>>> 773-818-8300 > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Sarah Kenny > >>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > >>>>> University of California Irvine, Dept. of Neurology ~ > >>>>> 773-818-8300 > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Aug 26 20:09:59 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 26 Aug 2011 20:09:59 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <20110827010215.1318E124A8@zimbra.anl.gov> Message-ID: <1108908741.254125.1314407399688.JavaMail.root@zimbra.anl.gov> David, Jon - see also this thread for possible pointers: http://lists.ci.uchicago.edu/pipermail/swift-devel/2011-July/008541.html - Mike ----- Original Message ----- From: "Jonathan Monette" To: "David Kelly" Cc: "swift-devel Devel" Sent: Friday, August 26, 2011 8:02:30 PM Subject: Re: [Swift-devel] Notes from 0.93 meeting I don't think it matters but it was a thought. When I was working with swift and globus online I had to set it to an Il address and not the dns name. But that might have been for a different reason. ----- Reply message ----- From: "David Kelly" Date: Fri, Aug 26, 2011 7:13 pm Subject: [Swift-devel] Notes from 0.93 meeting To: "Jonathan Monette" Cc: "Mihael Hategan" , "swift-devel Devel" I set it to communicado.ci.uchicago.edu. I'll try again with IP address. ----- Original Message ----- > From: "Jonathan Monette" > To: "David Kelly" > Cc: "Mihael Hategan" , "swift-devel Devel" > Sent: Friday, August 26, 2011 6:54:29 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > Did you set GLOBUS_HOSTNAME to communicado.ci.uchicago.edu or probably > better the ip-address of communicado? > On Aug 26, 2011, at 6:52 PM, David Kelly wrote: > > > I tried setting GLOBUS_HOSTNAME on communicado. The gram log file is > > no longer created, but I still don't see any jobs being submitted? > > > > There is a new set of logs at > > www.ci.uchicago.edu/~davidk/ranger-gt2-logs2.tar.gz > > > > David > > > > ----- Original Message ----- > >> From: "Mihael Hategan" > >> To: "David Kelly" > >> Cc: "swift-devel Devel" , "Jonathan > >> Monette" > >> Sent: Friday, August 26, 2011 1:42:13 PM > >> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >> "The job manager failed to open stderr" tends to happen when you > >> have > >> GLOBUS_HOSTNAME set incorrectly. > >> > >> On Fri, 2011-08-26 at 13:38 -0500, David Kelly wrote: > >>> When I am trying to run the script now, Swift does not seem to be > >>> submitting the jobs correctly. Nothing it showing up in qstat. > >>> > >>> I noticed that a gram log gets created in my home directory that > >>> says: > >>> ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end > >>> level=ERROR gramid=/16145868447994515851/17606392074284884670/ > >>> job_status=4 status=-73 reason="the job manager failed to open > >>> stdout" > >>> > >>> I'm guessing this is the cause of the problem. Bugs #153 and #215 > >>> were related to similar problems with stdout and gt2/sge. > >>> > >>> The full logs are at > >>> http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz > >>> > >>> Thanks, > >>> David > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Mihael Hategan" > >>>> To: "Jonathan Monette" > >>>> Cc: "swift-devel Devel" > >>>> Sent: Thursday, August 25, 2011 5:31:34 PM > >>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>> On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > >>>>> I can send mail to ci support and cc mike to it and ask what > >>>>> they > >>>>> can > >>>>> do. > >>>>> > >>>>> Mihael, is there anyway for Swift to give a little more feedback > >>>>> besides unknown CA or is that a jglobus problem? > >>>> > >>>> It's a jglobus problem. > >>>> > >>>> That in itself may not be a big issue, but jglobus is now being > >>>> heavily > >>>> re-organized by the globus team, so I'm not sure what the best > >>>> long-term > >>>> strategy is here. > >>>>> > >>>>> ----- Reply message ----- > >>>>> From: "Sarah Kenny" > >>>>> Date: Thu, Aug 25, 2011 5:11 pm > >>>>> Subject: [Swift-devel] Notes from 0.93 meeting > >>>>> To: "Jonathan Monette" > >>>>> Cc: "Mihael Hategan" , "swift-devel Devel" > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> if i had a nickel for every time i dealt with this i'd be rich! > >>>>> :) > >>>>> actually, now that i'm looking at our uci machines i actually > >>>>> have > >>>>> them updating hourly...so, maybe you want to ask the admins to > >>>>> do > >>>>> that > >>>>> to avoid a full day of confusion whenever they expire :P > >>>>> > >>>>> *usually* i can't gsissh either if the certs have expired but, > >>>>> yeah, > >>>>> they must be using different CA's now for that on ranger as > >>>>> mihael > >>>>> suggests... > >>>>> > >>>>> On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > >>>>> > >>>>> wrote: > >>>>> True. I did not think that each mechanism would use > >>>>> different > >>>>> CAs. We might want to ask ci support to update the grid > >>>>> certs > >>>>> more frequently then to avoid this situation. > >>>>> > >>>>> > >>>>> On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > >>>>> > >>>>>> On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette > >>>>>> wrote: > >>>>>>> That is weird. If you were able to gsissh to ranger I > >>>>> would assume > >>>>>>> that you are able to globus-url-copy to ranger. > >>>>>> > >>>>>> Not if the two use different CAs. Or if a password was > >>>>>> typed > >>>>> at the ssh > >>>>>> login. > >>>>>> > >>>>>>> Anyways, what Sarah said should work. I would assume > >>>>>>> that > >>>>> ci would > >>>>>>> update more frequently to avoid this problem. > >>>>>>> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > >>>>>>> > >>>>>>>> communicado's certs > >>>>>>>> (/etc/grid-security/certificates) > >>>>>>>> are > >>>>>>>> out-of-date...if you copy > >>>>> ranger's /etc/grid-security/certificates > >>>>>>>> directory to communicado and point yr X509_CERT_DIR > >>>>>>>> to > >>>>>>>> it > >>>>> you can > >>>>>>>> get a job thru (a simple globus-job-run with my > >>>>>>>> vaild > >>>>>>>> cert > >>>>> fails > >>>>>>>> from communicado at the moment if i don't do this). > >>>>>>>> > >>>>>>>> i set our machines at uci to update daily...i think > >>>>>>>> it's > >>>>> less > >>>>>>>> frequently at ci... > >>>>>>>> > >>>>>>>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > >>>>>>>> wrote: > >>>>>>>> Can you try a globus-url-copy to > >>>>>>>> gridftp.ranger? > >>>>>>>> > >>>>>>>> gridftp.ranger seems to have the NCSA myproxy > >>>>>>>> CA. > >>>>> You say > >>>>>>>> you have the > >>>>>>>> proper certificates dir in your > >>>>>>>> X509_CERT_DIR, > >>>>>>>> and > >>>>> that > >>>>>>>> directory > >>>>>>>> contains the TACC root cert. So it should > >>>>>>>> work. > >>>>>>>> And > >>>>> so > >>>>>>>> should swift. > >>>>>>>> > >>>>>>>> Though I think that jglobus should be more > >>>>>>>> clear > >>>>> about > >>>>>>>> "Unknown ca" > >>>>>>>> errors. At least the name of the unknown CA > >>>>>>>> should > >>>>> be part > >>>>>>>> of the error > >>>>>>>> message. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, 2011-08-25 at 15:55 -0500, David > >>>>>>>> Kelly > >>>>> wrote: > >>>>>>>>> $ grid-proxy-info -all > >>>>>>>>> subject : /C=US/O=National Center for > >>>>>>>>> Supercomputing > >>>>>>>> Applications/CN=David Kelly > >>>>>>>>> issuer : /C=US/O=National Center for Supercomputing > >>>>>>>> Applications/OU=Certificate > >>>>>>>> Authorities/CN=MyProxy > >>>>>>>>> identity : /C=US/O=National Center for > >>>>>>>>> Supercomputing > >>>>>>>> Applications/CN=David Kelly > >>>>>>>>> type : end entity credential > >>>>>>>>> strength : 1024 bits > >>>>>>>>> path : /tmp/x509up_u1878 > >>>>>>>>> timeleft : 9:56:53 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ----- Original Message ----- > >>>>>>>>>> From: "Mihael Hategan" > >>>>>>>>>> To: "David Kelly" > >>>>>>>>>> Cc: "Ketan Maheshwari" > >>>>>>>>>> , > >>>>>>>> "swift-devel Devel" > >>>>>>>> > >>>>>>>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > >>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > >>>>>>>>>> Odd. Can you paste the output of 'grid-proxy-info > >>>>>>>>>> -all'? > >>>>>>>>>> > >>>>>>>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly > >>>>>>>>>> wrote: > >>>>>>>>>>> Sure, here is the full log: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>> http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > >>>>>>>>>>> > >>>>>>>>>>> ----- Original Message ----- > >>>>>>>>>>>> From: "Mihael Hategan" > >>>>>>>>>>>> To: "David Kelly" > >>>>>>>>>>>> Cc: "Ketan Maheshwari" > >>>>>>>>>>>> , > >>>>>>>> "swift-devel > >>>>>>>>>>>> Devel" > >>>>>>>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > >>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > >>>>>>>>>>>> meeting > >>>>>>>>>>>> It's possible that the CA dir on Ranger is not > >>>>>>>> properly set up. > >>>>>>>>>>>> Can > >>>>>>>>>>>> you > >>>>>>>>>>>> post the full log? > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > >>>>>>>> wrote: > >>>>>>>>>>>>> Those environment variables were not set up. I > >>>>>>>> have them defined > >>>>>>>>>>>>> now, but I'm still getting the same error. > >>>>>>>>>>>>> > >>>>>>>>>>>>> [davidk at communicado ranger]$ env |grep 509 > >>>>>>>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>> > >>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>>>>>>> sites.xml > >>>>>>>>>>>>> -tc.file > >>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>>>>>>> cog-r3229 > >>>>>>>>>>>>> > >>>>>>>>>>>>> RunID: 20110825-1352-f1v940b4 > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > >>>>>>>> Selecting site:7 > >>>>>>>>>>>>> Initializing site shared directory:3 > >>>>>>>>>>>>> Execution failed: > >>>>>>>>>>>>> Authentication failed [Caused by: Failure > >>>>>>>> unspecified at > >>>>>>>>>>>>> GSS-API > >>>>>>>>>>>>> level [Caused by: Unknown CA]] > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> ----- Original Message ----- > >>>>>>>>>>>>>> From: "Ketan Maheshwari" > >>>>>>>> > >>>>>>>>>>>>>> To: "David Kelly" > >>>>>>>>>>>>>> Cc: "Jonathan Monette" , > >>>>>>>> "swift-devel > >>>>>>>>>>>>>> Devel" > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > >>>>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > >>>>>>>> meeting > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Are your CADIR and CACERT env vars set up? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > >>>>>>>>>>>>>> davidk at ci.uchicago.edu > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks Jon, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Here is what happens when I try this from > >>>>>>>> communicado: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l > >>>>>>>>>>>>>> dkelly > >>>>>>>> -s > >>>>>>>>>>>>>> myproxy.teragrid.org > >>>>>>>>>>>>>> Enter MyProxy pass phrase: > >>>>>>>>>>>>>> A credential has been received for user dkelly > >>>>>>>> in > >>>>>>>>>>>>>> /tmp/x509up_u1878. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > >>>>>>>> sites.xml > >>>>>>>>>>>>>> -tc.file > >>>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > >>>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > >>>>>>>> cog-r3229 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> RunID: 20110825-1326-o3e38fe0 > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 > >>>>>>>>>>>>>> -0500 > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 > >>>>>>>>>>>>>> -0500 > >>>>>>>> Selecting > >>>>>>>>>>>>>> site:8 > >>>>>>>>>>>>>> Initializing site shared directory:2 > >>>>>>>>>>>>>> Execution failed: > >>>>>>>>>>>>>> Authentication failed [Caused by: Failure > >>>>>>>> unspecified at > >>>>>>>>>>>>>> GSS-API > >>>>>>>>>>>>>> level > >>>>>>>>>>>>>> [Caused by: Unknown CA]] > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Any ideas? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> David > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> Swift-devel mailing list > >>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>>>>>> > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Ketan > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> Swift-devel mailing list > >>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>>>>> > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Sarah Kenny > >>>>>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio > >>>>>>>> Sci > >>>>> III > >>>>>>>> University of California Irvine, Dept. of Neurology > >>>>>>>> ~ > >>>>> 773-818-8300 > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Sarah Kenny > >>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > >>>>> University of California Irvine, Dept. of Neurology ~ > >>>>> 773-818-8300 > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Aug 26 20:23:14 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 26 Aug 2011 18:23:14 -0700 Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <44646787.80670.1314404038440.JavaMail.root@zimbra-mb2.anl.gov> References: <44646787.80670.1314404038440.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1314408194.21762.0.camel@blabla> Can you try GT2:SGE instead of GT2:GT2:SGE? On Fri, 2011-08-26 at 19:13 -0500, David Kelly wrote: > I set it to communicado.ci.uchicago.edu. I'll try again with IP address. > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "David Kelly" > > Cc: "Mihael Hategan" , "swift-devel Devel" > > Sent: Friday, August 26, 2011 6:54:29 PM > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > Did you set GLOBUS_HOSTNAME to communicado.ci.uchicago.edu or probably > > better the ip-address of communicado? > > On Aug 26, 2011, at 6:52 PM, David Kelly wrote: > > > > > I tried setting GLOBUS_HOSTNAME on communicado. The gram log file is > > > no longer created, but I still don't see any jobs being submitted? > > > > > > There is a new set of logs at > > > www.ci.uchicago.edu/~davidk/ranger-gt2-logs2.tar.gz > > > > > > David > > > > > > ----- Original Message ----- > > >> From: "Mihael Hategan" > > >> To: "David Kelly" > > >> Cc: "swift-devel Devel" , "Jonathan > > >> Monette" > > >> Sent: Friday, August 26, 2011 1:42:13 PM > > >> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > >> "The job manager failed to open stderr" tends to happen when you > > >> have > > >> GLOBUS_HOSTNAME set incorrectly. > > >> > > >> On Fri, 2011-08-26 at 13:38 -0500, David Kelly wrote: > > >>> When I am trying to run the script now, Swift does not seem to be > > >>> submitting the jobs correctly. Nothing it showing up in qstat. > > >>> > > >>> I noticed that a gram log gets created in my home directory that > > >>> says: > > >>> ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end > > >>> level=ERROR gramid=/16145868447994515851/17606392074284884670/ > > >>> job_status=4 status=-73 reason="the job manager failed to open > > >>> stdout" > > >>> > > >>> I'm guessing this is the cause of the problem. Bugs #153 and #215 > > >>> were related to similar problems with stdout and gt2/sge. > > >>> > > >>> The full logs are at > > >>> http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz > > >>> > > >>> Thanks, > > >>> David > > >>> > > >>> > > >>> ----- Original Message ----- > > >>>> From: "Mihael Hategan" > > >>>> To: "Jonathan Monette" > > >>>> Cc: "swift-devel Devel" > > >>>> Sent: Thursday, August 25, 2011 5:31:34 PM > > >>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > >>>> On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > > >>>>> I can send mail to ci support and cc mike to it and ask what > > >>>>> they > > >>>>> can > > >>>>> do. > > >>>>> > > >>>>> Mihael, is there anyway for Swift to give a little more feedback > > >>>>> besides unknown CA or is that a jglobus problem? > > >>>> > > >>>> It's a jglobus problem. > > >>>> > > >>>> That in itself may not be a big issue, but jglobus is now being > > >>>> heavily > > >>>> re-organized by the globus team, so I'm not sure what the best > > >>>> long-term > > >>>> strategy is here. > > >>>>> > > >>>>> ----- Reply message ----- > > >>>>> From: "Sarah Kenny" > > >>>>> Date: Thu, Aug 25, 2011 5:11 pm > > >>>>> Subject: [Swift-devel] Notes from 0.93 meeting > > >>>>> To: "Jonathan Monette" > > >>>>> Cc: "Mihael Hategan" , "swift-devel Devel" > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> if i had a nickel for every time i dealt with this i'd be rich! > > >>>>> :) > > >>>>> actually, now that i'm looking at our uci machines i actually > > >>>>> have > > >>>>> them updating hourly...so, maybe you want to ask the admins to > > >>>>> do > > >>>>> that > > >>>>> to avoid a full day of confusion whenever they expire :P > > >>>>> > > >>>>> *usually* i can't gsissh either if the certs have expired but, > > >>>>> yeah, > > >>>>> they must be using different CA's now for that on ranger as > > >>>>> mihael > > >>>>> suggests... > > >>>>> > > >>>>> On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > > >>>>> > > >>>>> wrote: > > >>>>> True. I did not think that each mechanism would use > > >>>>> different > > >>>>> CAs. We might want to ask ci support to update the grid > > >>>>> certs > > >>>>> more frequently then to avoid this situation. > > >>>>> > > >>>>> > > >>>>> On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > > >>>>> > > >>>>>> On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette > > >>>>>> wrote: > > >>>>>>> That is weird. If you were able to gsissh to ranger I > > >>>>> would assume > > >>>>>>> that you are able to globus-url-copy to ranger. > > >>>>>> > > >>>>>> Not if the two use different CAs. Or if a password was > > >>>>>> typed > > >>>>> at the ssh > > >>>>>> login. > > >>>>>> > > >>>>>>> Anyways, what Sarah said should work. I would assume > > >>>>>>> that > > >>>>> ci would > > >>>>>>> update more frequently to avoid this problem. > > >>>>>>> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > > >>>>>>> > > >>>>>>>> communicado's certs > > >>>>>>>> (/etc/grid-security/certificates) > > >>>>>>>> are > > >>>>>>>> out-of-date...if you copy > > >>>>> ranger's /etc/grid-security/certificates > > >>>>>>>> directory to communicado and point yr X509_CERT_DIR > > >>>>>>>> to > > >>>>>>>> it > > >>>>> you can > > >>>>>>>> get a job thru (a simple globus-job-run with my > > >>>>>>>> vaild > > >>>>>>>> cert > > >>>>> fails > > >>>>>>>> from communicado at the moment if i don't do this). > > >>>>>>>> > > >>>>>>>> i set our machines at uci to update daily...i think > > >>>>>>>> it's > > >>>>> less > > >>>>>>>> frequently at ci... > > >>>>>>>> > > >>>>>>>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > > >>>>>>>> wrote: > > >>>>>>>> Can you try a globus-url-copy to > > >>>>>>>> gridftp.ranger? > > >>>>>>>> > > >>>>>>>> gridftp.ranger seems to have the NCSA myproxy > > >>>>>>>> CA. > > >>>>> You say > > >>>>>>>> you have the > > >>>>>>>> proper certificates dir in your > > >>>>>>>> X509_CERT_DIR, > > >>>>>>>> and > > >>>>> that > > >>>>>>>> directory > > >>>>>>>> contains the TACC root cert. So it should > > >>>>>>>> work. > > >>>>>>>> And > > >>>>> so > > >>>>>>>> should swift. > > >>>>>>>> > > >>>>>>>> Though I think that jglobus should be more > > >>>>>>>> clear > > >>>>> about > > >>>>>>>> "Unknown ca" > > >>>>>>>> errors. At least the name of the unknown CA > > >>>>>>>> should > > >>>>> be part > > >>>>>>>> of the error > > >>>>>>>> message. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Thu, 2011-08-25 at 15:55 -0500, David > > >>>>>>>> Kelly > > >>>>> wrote: > > >>>>>>>>> $ grid-proxy-info -all > > >>>>>>>>> subject : /C=US/O=National Center for > > >>>>>>>>> Supercomputing > > >>>>>>>> Applications/CN=David Kelly > > >>>>>>>>> issuer : /C=US/O=National Center for Supercomputing > > >>>>>>>> Applications/OU=Certificate > > >>>>>>>> Authorities/CN=MyProxy > > >>>>>>>>> identity : /C=US/O=National Center for > > >>>>>>>>> Supercomputing > > >>>>>>>> Applications/CN=David Kelly > > >>>>>>>>> type : end entity credential > > >>>>>>>>> strength : 1024 bits > > >>>>>>>>> path : /tmp/x509up_u1878 > > >>>>>>>>> timeleft : 9:56:53 > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> ----- Original Message ----- > > >>>>>>>>>> From: "Mihael Hategan" > > >>>>>>>>>> To: "David Kelly" > > >>>>>>>>>> Cc: "Ketan Maheshwari" > > >>>>>>>>>> , > > >>>>>>>> "swift-devel Devel" > > >>>>>>>> > > >>>>>>>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > > >>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > >>>>>>>>>> Odd. Can you paste the output of 'grid-proxy-info > > >>>>>>>>>> -all'? > > >>>>>>>>>> > > >>>>>>>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly > > >>>>>>>>>> wrote: > > >>>>>>>>>>> Sure, here is the full log: > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>> > > >>>>> http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > >>>>>>>>>>> > > >>>>>>>>>>> ----- Original Message ----- > > >>>>>>>>>>>> From: "Mihael Hategan" > > >>>>>>>>>>>> To: "David Kelly" > > >>>>>>>>>>>> Cc: "Ketan Maheshwari" > > >>>>>>>>>>>> , > > >>>>>>>> "swift-devel > > >>>>>>>>>>>> Devel" > > >>>>>>>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > > >>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > > >>>>>>>>>>>> meeting > > >>>>>>>>>>>> It's possible that the CA dir on Ranger is not > > >>>>>>>> properly set up. > > >>>>>>>>>>>> Can > > >>>>>>>>>>>> you > > >>>>>>>>>>>> post the full log? > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > > >>>>>>>> wrote: > > >>>>>>>>>>>>> Those environment variables were not set up. I > > >>>>>>>> have them defined > > >>>>>>>>>>>>> now, but I'm still getting the same error. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> [davidk at communicado ranger]$ env |grep 509 > > >>>>>>>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > >>>>>>>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > >>>>>>>> sites.xml > > >>>>>>>>>>>>> -tc.file > > >>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > > >>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > > >>>>>>>> cog-r3229 > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> RunID: 20110825-1352-f1v940b4 > > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > > >>>>>>>> Selecting site:7 > > >>>>>>>>>>>>> Initializing site shared directory:3 > > >>>>>>>>>>>>> Execution failed: > > >>>>>>>>>>>>> Authentication failed [Caused by: Failure > > >>>>>>>> unspecified at > > >>>>>>>>>>>>> GSS-API > > >>>>>>>>>>>>> level [Caused by: Unknown CA]] > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> ----- Original Message ----- > > >>>>>>>>>>>>>> From: "Ketan Maheshwari" > > >>>>>>>> > > >>>>>>>>>>>>>> To: "David Kelly" > > >>>>>>>>>>>>>> Cc: "Jonathan Monette" , > > >>>>>>>> "swift-devel > > >>>>>>>>>>>>>> Devel" > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > > >>>>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > > >>>>>>>> meeting > > >>>>>>>>>>>>>> Hi, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Are your CADIR and CACERT env vars set up? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > >>>>>>>>>>>>>> davidk at ci.uchicago.edu > > >>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks Jon, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Here is what happens when I try this from > > >>>>>>>> communicado: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l > > >>>>>>>>>>>>>> dkelly > > >>>>>>>> -s > > >>>>>>>>>>>>>> myproxy.teragrid.org > > >>>>>>>>>>>>>> Enter MyProxy pass phrase: > > >>>>>>>>>>>>>> A credential has been received for user dkelly > > >>>>>>>> in > > >>>>>>>>>>>>>> /tmp/x509up_u1878. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > >>>>>>>> sites.xml > > >>>>>>>>>>>>>> -tc.file > > >>>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > > >>>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > > >>>>>>>> cog-r3229 > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> RunID: 20110825-1326-o3e38fe0 > > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 > > >>>>>>>>>>>>>> -0500 > > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 > > >>>>>>>>>>>>>> -0500 > > >>>>>>>> Selecting > > >>>>>>>>>>>>>> site:8 > > >>>>>>>>>>>>>> Initializing site shared directory:2 > > >>>>>>>>>>>>>> Execution failed: > > >>>>>>>>>>>>>> Authentication failed [Caused by: Failure > > >>>>>>>> unspecified at > > >>>>>>>>>>>>>> GSS-API > > >>>>>>>>>>>>>> level > > >>>>>>>>>>>>>> [Caused by: Unknown CA]] > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Any ideas? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>> David > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> _______________________________________________ > > >>>>>>>>>>>>>> Swift-devel mailing list > > >>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>>>>>>>> > > >>>>>>>> > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> -- > > >>>>>>>>>>>>>> Ketan > > >>>>>>>>>>>>> _______________________________________________ > > >>>>>>>>>>>>> Swift-devel mailing list > > >>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>>>>>>> > > >>>>>>>> > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> _______________________________________________ > > >>>>>>>> Swift-devel mailing list > > >>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>> > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> -- > > >>>>>>>> Sarah Kenny > > >>>>>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio > > >>>>>>>> Sci > > >>>>> III > > >>>>>>>> University of California Irvine, Dept. of Neurology > > >>>>>>>> ~ > > >>>>> 773-818-8300 > > >>>>>>>> > > >>>>>>>> _______________________________________________ > > >>>>>>>> Swift-devel mailing list > > >>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>> > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Sarah Kenny > > >>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > >>>>> University of California Irvine, Dept. of Neurology ~ > > >>>>> 773-818-8300 > > >>>>> > > >>>> > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Fri Aug 26 21:19:02 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 26 Aug 2011 21:19:02 -0500 (CDT) Subject: [Swift-devel] Notes from 0.93 meeting In-Reply-To: <1314408194.21762.0.camel@blabla> Message-ID: <1877618323.80685.1314411542200.JavaMail.root@zimbra-mb2.anl.gov> Getting closer now. When I changed it from gt2:gt2:sge to gt2:sge, there was more detail in the coaster log. I saw an error that a valid pe was not specified. I defined a pe in my sites.xml and tried again. An SGE submit file is getting created now, but the submit file sets the pe to 'threaded' rather than my value, 16way. http://www.ci.uchicago.edu/~davidk/ranger-gt2-sge3.tar.gz David ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Jonathan Monette" , "swift-devel Devel" > Sent: Friday, August 26, 2011 8:23:14 PM > Subject: Re: [Swift-devel] Notes from 0.93 meeting > Can you try GT2:SGE instead of GT2:GT2:SGE? > > On Fri, 2011-08-26 at 19:13 -0500, David Kelly wrote: > > I set it to communicado.ci.uchicago.edu. I'll try again with IP > > address. > > > > ----- Original Message ----- > > > From: "Jonathan Monette" > > > To: "David Kelly" > > > Cc: "Mihael Hategan" , "swift-devel Devel" > > > > > > Sent: Friday, August 26, 2011 6:54:29 PM > > > Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > Did you set GLOBUS_HOSTNAME to communicado.ci.uchicago.edu or > > > probably > > > better the ip-address of communicado? > > > On Aug 26, 2011, at 6:52 PM, David Kelly wrote: > > > > > > > I tried setting GLOBUS_HOSTNAME on communicado. The gram log > > > > file is > > > > no longer created, but I still don't see any jobs being > > > > submitted? > > > > > > > > There is a new set of logs at > > > > www.ci.uchicago.edu/~davidk/ranger-gt2-logs2.tar.gz > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > >> From: "Mihael Hategan" > > > >> To: "David Kelly" > > > >> Cc: "swift-devel Devel" , > > > >> "Jonathan > > > >> Monette" > > > >> Sent: Friday, August 26, 2011 1:42:13 PM > > > >> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > >> "The job manager failed to open stderr" tends to happen when > > > >> you > > > >> have > > > >> GLOBUS_HOSTNAME set incorrectly. > > > >> > > > >> On Fri, 2011-08-26 at 13:38 -0500, David Kelly wrote: > > > >>> When I am trying to run the script now, Swift does not seem to > > > >>> be > > > >>> submitting the jobs correctly. Nothing it showing up in qstat. > > > >>> > > > >>> I noticed that a gram log gets created in my home directory > > > >>> that > > > >>> says: > > > >>> ts=2011-08-26T17:30:03.910618Z id=27215 event=gram.job.end > > > >>> level=ERROR gramid=/16145868447994515851/17606392074284884670/ > > > >>> job_status=4 status=-73 reason="the job manager failed to open > > > >>> stdout" > > > >>> > > > >>> I'm guessing this is the cause of the problem. Bugs #153 and > > > >>> #215 > > > >>> were related to similar problems with stdout and gt2/sge. > > > >>> > > > >>> The full logs are at > > > >>> http://www.ci.uchicago.edu/~davidk/ranger-gt2-logs.tar.gz > > > >>> > > > >>> Thanks, > > > >>> David > > > >>> > > > >>> > > > >>> ----- Original Message ----- > > > >>>> From: "Mihael Hategan" > > > >>>> To: "Jonathan Monette" > > > >>>> Cc: "swift-devel Devel" > > > >>>> Sent: Thursday, August 25, 2011 5:31:34 PM > > > >>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > >>>> On Thu, 2011-08-25 at 17:18 -0500, Jonathan Monette wrote: > > > >>>>> I can send mail to ci support and cc mike to it and ask what > > > >>>>> they > > > >>>>> can > > > >>>>> do. > > > >>>>> > > > >>>>> Mihael, is there anyway for Swift to give a little more > > > >>>>> feedback > > > >>>>> besides unknown CA or is that a jglobus problem? > > > >>>> > > > >>>> It's a jglobus problem. > > > >>>> > > > >>>> That in itself may not be a big issue, but jglobus is now > > > >>>> being > > > >>>> heavily > > > >>>> re-organized by the globus team, so I'm not sure what the > > > >>>> best > > > >>>> long-term > > > >>>> strategy is here. > > > >>>>> > > > >>>>> ----- Reply message ----- > > > >>>>> From: "Sarah Kenny" > > > >>>>> Date: Thu, Aug 25, 2011 5:11 pm > > > >>>>> Subject: [Swift-devel] Notes from 0.93 meeting > > > >>>>> To: "Jonathan Monette" > > > >>>>> Cc: "Mihael Hategan" , "swift-devel > > > >>>>> Devel" > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> if i had a nickel for every time i dealt with this i'd be > > > >>>>> rich! > > > >>>>> :) > > > >>>>> actually, now that i'm looking at our uci machines i > > > >>>>> actually > > > >>>>> have > > > >>>>> them updating hourly...so, maybe you want to ask the admins > > > >>>>> to > > > >>>>> do > > > >>>>> that > > > >>>>> to avoid a full day of confusion whenever they expire :P > > > >>>>> > > > >>>>> *usually* i can't gsissh either if the certs have expired > > > >>>>> but, > > > >>>>> yeah, > > > >>>>> they must be using different CA's now for that on ranger as > > > >>>>> mihael > > > >>>>> suggests... > > > >>>>> > > > >>>>> On Thu, Aug 25, 2011 at 2:46 PM, Jonathan Monette > > > >>>>> > > > >>>>> wrote: > > > >>>>> True. I did not think that each mechanism would use > > > >>>>> different > > > >>>>> CAs. We might want to ask ci support to update the > > > >>>>> grid > > > >>>>> certs > > > >>>>> more frequently then to avoid this situation. > > > >>>>> > > > >>>>> > > > >>>>> On Aug 25, 2011, at 4:42 PM, Mihael Hategan wrote: > > > >>>>> > > > >>>>>> On Thu, 2011-08-25 at 16:40 -0500, Jonathan Monette > > > >>>>>> wrote: > > > >>>>>>> That is weird. If you were able to gsissh to ranger I > > > >>>>> would assume > > > >>>>>>> that you are able to globus-url-copy to ranger. > > > >>>>>> > > > >>>>>> Not if the two use different CAs. Or if a password was > > > >>>>>> typed > > > >>>>> at the ssh > > > >>>>>> login. > > > >>>>>> > > > >>>>>>> Anyways, what Sarah said should work. I would assume > > > >>>>>>> that > > > >>>>> ci would > > > >>>>>>> update more frequently to avoid this problem. > > > >>>>>>> On Aug 25, 2011, at 4:38 PM, Sarah Kenny wrote: > > > >>>>>>> > > > >>>>>>>> communicado's certs > > > >>>>>>>> (/etc/grid-security/certificates) > > > >>>>>>>> are > > > >>>>>>>> out-of-date...if you copy > > > >>>>> ranger's /etc/grid-security/certificates > > > >>>>>>>> directory to communicado and point yr X509_CERT_DIR > > > >>>>>>>> to > > > >>>>>>>> it > > > >>>>> you can > > > >>>>>>>> get a job thru (a simple globus-job-run with my > > > >>>>>>>> vaild > > > >>>>>>>> cert > > > >>>>> fails > > > >>>>>>>> from communicado at the moment if i don't do this). > > > >>>>>>>> > > > >>>>>>>> i set our machines at uci to update daily...i think > > > >>>>>>>> it's > > > >>>>> less > > > >>>>>>>> frequently at ci... > > > >>>>>>>> > > > >>>>>>>> On Thu, Aug 25, 2011 at 2:17 PM, Mihael Hategan > > > >>>>>>>> wrote: > > > >>>>>>>> Can you try a globus-url-copy to > > > >>>>>>>> gridftp.ranger? > > > >>>>>>>> > > > >>>>>>>> gridftp.ranger seems to have the NCSA myproxy > > > >>>>>>>> CA. > > > >>>>> You say > > > >>>>>>>> you have the > > > >>>>>>>> proper certificates dir in your > > > >>>>>>>> X509_CERT_DIR, > > > >>>>>>>> and > > > >>>>> that > > > >>>>>>>> directory > > > >>>>>>>> contains the TACC root cert. So it should > > > >>>>>>>> work. > > > >>>>>>>> And > > > >>>>> so > > > >>>>>>>> should swift. > > > >>>>>>>> > > > >>>>>>>> Though I think that jglobus should be more > > > >>>>>>>> clear > > > >>>>> about > > > >>>>>>>> "Unknown ca" > > > >>>>>>>> errors. At least the name of the unknown CA > > > >>>>>>>> should > > > >>>>> be part > > > >>>>>>>> of the error > > > >>>>>>>> message. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Thu, 2011-08-25 at 15:55 -0500, David > > > >>>>>>>> Kelly > > > >>>>> wrote: > > > >>>>>>>>> $ grid-proxy-info -all > > > >>>>>>>>> subject : /C=US/O=National Center for > > > >>>>>>>>> Supercomputing > > > >>>>>>>> Applications/CN=David Kelly > > > >>>>>>>>> issuer : /C=US/O=National Center for Supercomputing > > > >>>>>>>> Applications/OU=Certificate > > > >>>>>>>> Authorities/CN=MyProxy > > > >>>>>>>>> identity : /C=US/O=National Center for > > > >>>>>>>>> Supercomputing > > > >>>>>>>> Applications/CN=David Kelly > > > >>>>>>>>> type : end entity credential > > > >>>>>>>>> strength : 1024 bits > > > >>>>>>>>> path : /tmp/x509up_u1878 > > > >>>>>>>>> timeleft : 9:56:53 > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> ----- Original Message ----- > > > >>>>>>>>>> From: "Mihael Hategan" > > > >>>>>>>>>> To: "David Kelly" > > > >>>>>>>>>> Cc: "Ketan Maheshwari" > > > >>>>>>>>>> , > > > >>>>>>>> "swift-devel Devel" > > > >>>>>>>> > > > >>>>>>>>>> Sent: Thursday, August 25, 2011 3:42:57 PM > > > >>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 meeting > > > >>>>>>>>>> Odd. Can you paste the output of 'grid-proxy-info > > > >>>>>>>>>> -all'? > > > >>>>>>>>>> > > > >>>>>>>>>> On Thu, 2011-08-25 at 15:18 -0500, David Kelly > > > >>>>>>>>>> wrote: > > > >>>>>>>>>>> Sure, here is the full log: > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>> > > > >>>>> http://www.ci.uchicago.edu/~davidk/001-catsn-ranger-20110825-1515-5tydro91.log > > > >>>>>>>>>>> > > > >>>>>>>>>>> ----- Original Message ----- > > > >>>>>>>>>>>> From: "Mihael Hategan" > > > >>>>>>>>>>>> To: "David Kelly" > > > >>>>>>>>>>>> Cc: "Ketan Maheshwari" > > > >>>>>>>>>>>> , > > > >>>>>>>> "swift-devel > > > >>>>>>>>>>>> Devel" > > > >>>>>>>>>>>> Sent: Thursday, August 25, 2011 2:43:31 PM > > > >>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > > > >>>>>>>>>>>> meeting > > > >>>>>>>>>>>> It's possible that the CA dir on Ranger is not > > > >>>>>>>> properly set up. > > > >>>>>>>>>>>> Can > > > >>>>>>>>>>>> you > > > >>>>>>>>>>>> post the full log? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Thu, 2011-08-25 at 13:56 -0500, David Kelly > > > >>>>>>>> wrote: > > > >>>>>>>>>>>>> Those environment variables were not set up. I > > > >>>>>>>> have them defined > > > >>>>>>>>>>>>> now, but I'm still getting the same error. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> [davidk at communicado ranger]$ env |grep 509 > > > >>>>>>>>>>>>> X509_CERT_DIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > >>>>>>>>>>>>> X509_CADIR=/opt/osg-1.2.16/globus/TRUSTED_CA > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > > >>>>>>>> sites.xml > > > >>>>>>>>>>>>> -tc.file > > > >>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > > > >>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > > > >>>>>>>> cog-r3229 > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> RunID: 20110825-1352-f1v940b4 > > > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:52:59 -0500 > > > >>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:53:00 -0500 > > > >>>>>>>> Selecting site:7 > > > >>>>>>>>>>>>> Initializing site shared directory:3 > > > >>>>>>>>>>>>> Execution failed: > > > >>>>>>>>>>>>> Authentication failed [Caused by: Failure > > > >>>>>>>> unspecified at > > > >>>>>>>>>>>>> GSS-API > > > >>>>>>>>>>>>> level [Caused by: Unknown CA]] > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> ----- Original Message ----- > > > >>>>>>>>>>>>>> From: "Ketan Maheshwari" > > > >>>>>>>> > > > >>>>>>>>>>>>>> To: "David Kelly" > > > >>>>>>>>>>>>>> Cc: "Jonathan Monette" , > > > >>>>>>>> "swift-devel > > > >>>>>>>>>>>>>> Devel" > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Sent: Thursday, August 25, 2011 1:32:50 PM > > > >>>>>>>>>>>>>> Subject: Re: [Swift-devel] Notes from 0.93 > > > >>>>>>>> meeting > > > >>>>>>>>>>>>>> Hi, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Are your CADIR and CACERT env vars set up? > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CADIR > > > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> [communicado:swiftgrid]$ echo $X509_CERT_DIR > > > >>>>>>>>>>>>>> /opt/osg-1.2.16/globus/TRUSTED_CA > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Thu, Aug 25, 2011 at 1:29 PM, David Kelly < > > > >>>>>>>>>>>>>> davidk at ci.uchicago.edu > > > >>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Thanks Jon, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Here is what happens when I try this from > > > >>>>>>>> communicado: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> [davidk at communicado ~]$ myproxy-logon -l > > > >>>>>>>>>>>>>> dkelly > > > >>>>>>>> -s > > > >>>>>>>>>>>>>> myproxy.teragrid.org > > > >>>>>>>>>>>>>> Enter MyProxy pass phrase: > > > >>>>>>>>>>>>>> A credential has been received for user dkelly > > > >>>>>>>> in > > > >>>>>>>>>>>>>> /tmp/x509up_u1878. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> [davidk at communicado ranger]$ swift -sites.file > > > >>>>>>>> sites.xml > > > >>>>>>>>>>>>>> -tc.file > > > >>>>>>>>>>>>>> tc.data 001-catsn-ranger.swift > > > >>>>>>>>>>>>>> Swift svn swift-r4987 (swift modified locally) > > > >>>>>>>> cog-r3229 > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> RunID: 20110825-1326-o3e38fe0 > > > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:43 > > > >>>>>>>>>>>>>> -0500 > > > >>>>>>>>>>>>>> Progress: time: Thu, 25 Aug 2011 13:26:44 > > > >>>>>>>>>>>>>> -0500 > > > >>>>>>>> Selecting > > > >>>>>>>>>>>>>> site:8 > > > >>>>>>>>>>>>>> Initializing site shared directory:2 > > > >>>>>>>>>>>>>> Execution failed: > > > >>>>>>>>>>>>>> Authentication failed [Caused by: Failure > > > >>>>>>>> unspecified at > > > >>>>>>>>>>>>>> GSS-API > > > >>>>>>>>>>>>>> level > > > >>>>>>>>>>>>>> [Caused by: Unknown CA]] > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Any ideas? > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>> David > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> _______________________________________________ > > > >>>>>>>>>>>>>> Swift-devel mailing list > > > >>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > > > >>>>>>>>>>>>>> > > > >>>>>>>> > > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> -- > > > >>>>>>>>>>>>>> Ketan > > > >>>>>>>>>>>>> _______________________________________________ > > > >>>>>>>>>>>>> Swift-devel mailing list > > > >>>>>>>>>>>>> Swift-devel at ci.uchicago.edu > > > >>>>>>>>>>>>> > > > >>>>>>>> > > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> _______________________________________________ > > > >>>>>>>> Swift-devel mailing list > > > >>>>>>>> Swift-devel at ci.uchicago.edu > > > >>>>>>>> > > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> -- > > > >>>>>>>> Sarah Kenny > > > >>>>>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio > > > >>>>>>>> Sci > > > >>>>> III > > > >>>>>>>> University of California Irvine, Dept. of Neurology > > > >>>>>>>> ~ > > > >>>>> 773-818-8300 > > > >>>>>>>> > > > >>>>>>>> _______________________________________________ > > > >>>>>>>> Swift-devel mailing list > > > >>>>>>>> Swift-devel at ci.uchicago.edu > > > >>>>>>>> > > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > >>>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> -- > > > >>>>> Sarah Kenny > > > >>>>> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > >>>>> University of California Irvine, Dept. of Neurology ~ > > > >>>>> 773-818-8300 > > > >>>>> > > > >>>> > > > >>>> > > > >>>> _______________________________________________ > > > >>>> Swift-devel mailing list > > > >>>> Swift-devel at ci.uchicago.edu > > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From yadudoc1729 at gmail.com Sun Aug 28 08:03:29 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Sun, 28 Aug 2011 18:33:29 +0530 Subject: [Swift-devel] MapReduce, doubts Message-ID: Hi, I was going through some materials ([1], [2] , [3]) to understand Google's MapReduce system and I have a couple of queries : 1. How do we address the issue of data locality ? When we run a map job, it is a priority to run it such that least network overhead is incurred, so preferably on the same system holding the data (or one which is nearest , I don't know how this works). 2. Is it possible to somehow force the reduce tasks to wait till all map jobs are done ? The MapReduce uses a system which permits reduce to run only after all the map jobs are done executing. I'm not entirely sure why this is a requirement but this has its own issues, such as a single slow mapper. This is usually tackled by the main-controller noticing the slow one and running multiple instances of the map job to get results faster. Does swift at some level use the concept of a central controller ? How do we tackle this ? 3. How does swift handle failures ? Is there a facility for re-execution ? Is this documented somewhere ? Do we use any file-system that handles loss of a particular file /input-set ? I'm stopping here, there are more questions nagging me, but its probably best to not blurt it out all at once :) [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html [2] http://www.youtube.com/watch?v=-vD6PUdf3Js [3] http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html -- Thanks and Regards, Yadu Nand B From yadudoc1729 at gmail.com Sun Aug 28 08:15:22 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Sun, 28 Aug 2011 18:45:22 +0530 Subject: [Swift-devel] MapReduce, doubts Message-ID: Hi, I was going through some materials ([1], [2] , [3]) to understand Google's MapReduce system and I have a couple of queries : 1. How do we address the issue of data locality ? When we run a map job, it is a priority to run it such that least network overhead is incurred, so preferably on the same system holding the data (or one which is nearest , I don't know how this works). 2. Is it possible to somehow force the reduce tasks to wait till all map jobs are done ? The MapReduce uses a system which permits reduce to run only after all the map jobs are done executing. I'm not entirely sure why this is a requirement but this has its own issues, such as a single slow mapper. This is usually tackled by the main-controller noticing the slow one and running multiple instances of the map job to get results faster. Does swift at some level use the concept of a central controller ? How do we tackle this ? 3. How does swift handle failures ? Is there a facility for re-execution ? Is this documented somewhere ? Do we use any file-system that handles loss of a particular file /input-set ? I'm stopping here, there are more questions nagging me, but its probably best to not blurt it out all at once :) [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html [2] http://www.youtube.com/watch?v=-vD6PUdf3Js [3] http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html -- Thanks and Regards, Yadu Nand B From wilde at mcs.anl.gov Sun Aug 28 10:15:02 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 28 Aug 2011 10:15:02 -0500 (CDT) Subject: [Swift-devel] MapReduce, doubts In-Reply-To: Message-ID: <1560585796.255142.1314544502422.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > From: "Yadu Nand" > To: "swift-devel" , "Justin M Wozniak" , "Mihael Hategan" > , "Michael Wilde" > Sent: Sunday, August 28, 2011 8:03:29 AM > Subject: MapReduce, doubts > Hi, > > I was going through some materials ([1], [2] , [3]) to understand > Google's MapReduce system and I have a couple of queries : > > 1. How do we address the issue of data locality ? > When we run a map job, it is a priority to run it such that least > network overhead is incurred, so preferably on the same system > holding the data (or one which is nearest , I don't know how this > works). Currently, we dont. We have discussed a new feature to do this (its listed as a GSoC project and I can probably locate a discussion with a 2010 GSoC candidate in which I detailed a possible strategy). We can current implement a similar scheme using an external mapper to select input files from multiple sites and map them to gsiftp URIs. Then an enhancement in the scheduler could select a site based on the URI of some or all or the input files. > 2. Is it possible to somehow force the reduce tasks to wait till all > map jobs are done ? Isn't that just normal swift semantics? If we coded a simple-minded reduce job whose input was the array of outputs from the map() stage, the reduce (assuming its an app function) would wait for all the map() ops to finish, right? I would ask instead "do we want to?". Do the distributed reduce ops in map-reduce really wait? Doesn't MR do distributed reduction in batches, asynchronously to the completion of the map() operations? Isnt this a key property that is made possible by the name/value pair-based nature of the MR data model? I thought MR reduce ops take place at any location, in any input chunk size, in a tree-based manner, and that this is possible because the reduction operator is "distributed" in the mathematical sense. > The MapReduce uses a system which permits reduce to run only > after all the map jobs are done executing. I'm not entirely sure why > this is a requirement but this has its own issues, such as a single > slow mapper. This is usually tackled by the main-controller noticing > the slow one and running multiple instances of the map job to get > results faster. Does swift at some level use the concept of a central > controller ? How do we tackle this ? > > 3. How does swift handle failures ? Is there a facility for > re-execution ? Yes, Swift retries failing app invocations as controlled by the properties execution.retries and lazy.errors. You can read on these in the users guide and in the properties file. > Is this documented somewhere ? Do we use any file-system that > handles loss of a particular file /input-set ? No, we dont, but some of this would come with using a replication-based model for the input dataset where the mapper could supply a list of possible inputs instead of one, and the scheduler could pick a replica each time it selects a site for a (retried) job. Also, we might think of a "forMostOf" statement which could implement semantics that would be suitable for runs in which you dont need every single map() to complete. I.e. the target array can be considered closed when "most of" (tbd) the input collection had been processed. The formost() could complete when it enters the "tail" of the loop (see Tim Armstrong's paper on the tail phenomenon). > I'm stopping here, there are more questions nagging me, but its > probably best to not blurt it out all at once :) I think you are hitting the right issues here, and I encourage you to keep pushing towards something that you could readily experiment with. This si exactly where we need to go to provide a convenient method for expressing map-reduce as an elegant high-level script. I also encourage you to read on what Ed Walker did for map-reduce in his parallel shell. - Mike > [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html > [2] http://www.youtube.com/watch?v=-vD6PUdf3Js > [3] > http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html > -- > Thanks and Regards, > Yadu Nand B -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Sun Aug 28 11:32:44 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 28 Aug 2011 11:32:44 -0500 Subject: [Swift-devel] MapReduce, doubts In-Reply-To: <1560585796.255142.1314544502422.JavaMail.root@zimbra.anl.gov> References: <1560585796.255142.1314544502422.JavaMail.root@zimbra.anl.gov> Message-ID: <6171FF36-1837-44BC-8600-F259D0D7C280@mcs.anl.gov> On Aug 28, 2011, at 10:15 AM, Michael Wilde wrote: > ----- Original Message ----- >> From: "Yadu Nand" >> To: "swift-devel" , "Justin M Wozniak" , "Mihael Hategan" >> , "Michael Wilde" >> Sent: Sunday, August 28, 2011 8:03:29 AM >> Subject: MapReduce, doubts >> Hi, >> >> I was going through some materials ([1], [2] , [3]) to understand >> Google's MapReduce system and I have a couple of queries : >> >> 1. How do we address the issue of data locality ? >> When we run a map job, it is a priority to run it such that least >> network overhead is incurred, so preferably on the same system >> holding the data (or one which is nearest , I don't know how this >> works). > > Currently, we dont. We have discussed a new feature to do this (its listed as a GSoC project and I can probably locate a discussion with a 2010 GSoC candidate in which I detailed a possible strategy). > > We can current implement a similar scheme using an external mapper to select input files from multiple sites and map them to gsiftp URIs. Then an enhancement in the scheduler could select a site based on the URI of some or all or the input files. In my work with GOSwift I found that mappers will follow the GSIURI path. For a single file: file input1 ; The above will map the file input1.txt that resides on PADS. For a group of files: file inputs[] >> 2. Is it possible to somehow force the reduce tasks to wait till all >> map jobs are done ? > > Isn't that just normal swift semantics? If we coded a simple-minded reduce job whose input was the array of outputs from the map() stage, the reduce (assuming its an app function) would wait for all the map() ops to finish, right? > > I would ask instead "do we want to?". Do the distributed reduce ops in map-reduce really wait? Doesn't MR do distributed reduction in batches, asynchronously to the completion of the map() operations? Isnt this a key property that is made possible by the name/value pair-based nature of the MR data model? I thought MR reduce ops take place at any location, in any input chunk size, in a tree-based manner, and that this is possible because the reduction operator is "distributed" in the mathematical sense. > >> The MapReduce uses a system which permits reduce to run only >> after all the map jobs are done executing. I'm not entirely sure why >> this is a requirement but this has its own issues, such as a single >> slow mapper. This is usually tackled by the main-controller noticing >> the slow one and running multiple instances of the map job to get >> results faster. Does swift at some level use the concept of a central >> controller ? How do we tackle this ? >> >> 3. How does swift handle failures ? Is there a facility for >> re-execution ? > > Yes, Swift retries failing app invocations as controlled by the properties execution.retries and lazy.errors. You can read on these in the users guide and in the properties file. > >> Is this documented somewhere ? Do we use any file-system that >> handles loss of a particular file /input-set ? > > No, we dont, but some of this would come with using a replication-based model for the input dataset where the mapper could supply a list of possible inputs instead of one, and the scheduler could pick a replica each time it selects a site for a (retried) job. > > Also, we might think of a "forMostOf" statement which could implement semantics that would be suitable for runs in which you dont need every single map() to complete. I.e. the target array can be considered closed when "most of" (tbd) the input collection had been processed. The formost() could complete when it enters the "tail" of the loop (see Tim Armstrong's paper on the tail phenomenon). I think this has been discussed before. In the Montage application there is a step where I map more filenames than will be created. So I don't need all the maps to complete for the workflow to keep progressing. I made a workaround but I think this "forMostOf" feature would be useful. I will locate the thread in which Mihael and I had this discussion. > >> I'm stopping here, there are more questions nagging me, but its >> probably best to not blurt it out all at once :) > > I think you are hitting the right issues here, and I encourage you to keep pushing towards something that you could readily experiment with. This si exactly where we need to go to provide a convenient method for expressing map-reduce as an elegant high-level script. > > I also encourage you to read on what Ed Walker did for map-reduce in his parallel shell. > > - Mike > >> [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html >> [2] http://www.youtube.com/watch?v=-vD6PUdf3Js >> [3] >> http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html >> -- >> Thanks and Regards, >> Yadu Nand B > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Aug 28 12:11:57 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 28 Aug 2011 12:11:57 -0500 (CDT) Subject: [Swift-devel] MapReduce, doubts In-Reply-To: <6171FF36-1837-44BC-8600-F259D0D7C280@mcs.anl.gov> Message-ID: <16807435.255223.1314551517654.JavaMail.root@zimbra.anl.gov> Jon, yes, thats right. Can you add the info below to the User Guide, or if time is pressing, to the cookbook? Thanks, - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Michael Wilde" > Cc: "Yadu Nand" , "swift-devel" > Sent: Sunday, August 28, 2011 11:32:44 AM > Subject: Re: [Swift-devel] MapReduce, doubts > On Aug 28, 2011, at 10:15 AM, Michael Wilde wrote: > > > ----- Original Message ----- > >> From: "Yadu Nand" > >> To: "swift-devel" , "Justin M Wozniak" > >> , "Mihael Hategan" > >> , "Michael Wilde" > >> Sent: Sunday, August 28, 2011 8:03:29 AM > >> Subject: MapReduce, doubts > >> Hi, > >> > >> I was going through some materials ([1], [2] , [3]) to understand > >> Google's MapReduce system and I have a couple of queries : > >> > >> 1. How do we address the issue of data locality ? > >> When we run a map job, it is a priority to run it such that least > >> network overhead is incurred, so preferably on the same system > >> holding the data (or one which is nearest , I don't know how this > >> works). > > > > Currently, we dont. We have discussed a new feature to do this (its > > listed as a GSoC project and I can probably locate a discussion with > > a 2010 GSoC candidate in which I detailed a possible strategy). > > > > We can current implement a similar scheme using an external mapper > > to select input files from multiple sites and map them to gsiftp > > URIs. Then an enhancement in the scheduler could select a site based > > on the URI of some or all or the input files. > > In my work with GOSwift I found that mappers will follow the GSIURI > path. > > For a single file: > file input1 > ; > > The above will map the file input1.txt that resides on PADS. > > For a group of files: > file inputs[] "gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data?, > suffix=?.txt?>; > > The above will map all files with a ".txt" extension in the directory > data on PADS. > > I think this is what you were talking about having the external mapper > do. > > > > >> 2. Is it possible to somehow force the reduce tasks to wait till > >> all > >> map jobs are done ? > > > > Isn't that just normal swift semantics? If we coded a simple-minded > > reduce job whose input was the array of outputs from the map() > > stage, the reduce (assuming its an app function) would wait for all > > the map() ops to finish, right? > > > > I would ask instead "do we want to?". Do the distributed reduce ops > > in map-reduce really wait? Doesn't MR do distributed reduction in > > batches, asynchronously to the completion of the map() operations? > > Isnt this a key property that is made possible by the name/value > > pair-based nature of the MR data model? I thought MR reduce ops take > > place at any location, in any input chunk size, in a tree-based > > manner, and that this is possible because the reduction operator is > > "distributed" in the mathematical sense. > > > >> The MapReduce uses a system which permits reduce to run only > >> after all the map jobs are done executing. I'm not entirely sure > >> why > >> this is a requirement but this has its own issues, such as a single > >> slow mapper. This is usually tackled by the main-controller > >> noticing > >> the slow one and running multiple instances of the map job to get > >> results faster. Does swift at some level use the concept of a > >> central > >> controller ? How do we tackle this ? > >> > >> 3. How does swift handle failures ? Is there a facility for > >> re-execution ? > > > > Yes, Swift retries failing app invocations as controlled by the > > properties execution.retries and lazy.errors. You can read on these > > in the users guide and in the properties file. > > > >> Is this documented somewhere ? Do we use any file-system that > >> handles loss of a particular file /input-set ? > > > > No, we dont, but some of this would come with using a > > replication-based model for the input dataset where the mapper could > > supply a list of possible inputs instead of one, and the scheduler > > could pick a replica each time it selects a site for a (retried) > > job. > > > > Also, we might think of a "forMostOf" statement which could > > implement semantics that would be suitable for runs in which you > > dont need every single map() to complete. I.e. the target array can > > be considered closed when "most of" (tbd) the input collection had > > been processed. The formost() could complete when it enters the > > "tail" of the loop (see Tim Armstrong's paper on the tail > > phenomenon). > > I think this has been discussed before. In the Montage application > there is a step where I map more filenames than will be created. So I > don't need all the maps to complete for the workflow to keep > progressing. I made a workaround but I think this "forMostOf" feature > would be useful. I will locate the thread in which Mihael and I had > this discussion. > > > >> I'm stopping here, there are more questions nagging me, but its > >> probably best to not blurt it out all at once :) > > > > I think you are hitting the right issues here, and I encourage you > > to keep pushing towards something that you could readily experiment > > with. This si exactly where we need to go to provide a convenient > > method for expressing map-reduce as an elegant high-level script. > > > > I also encourage you to read on what Ed Walker did for map-reduce in > > his parallel shell. > > > > - Mike > > > >> [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html > >> [2] http://www.youtube.com/watch?v=-vD6PUdf3Js > >> [3] > >> http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html > >> -- > >> Thanks and Regards, > >> Yadu Nand B > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Sun Aug 28 13:37:33 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Sun, 28 Aug 2011 13:37:33 -0500 Subject: [Swift-devel] =?utf-8?q?MapReduce=2C_doubts?= Message-ID: <20110828183719.0E605124CC@zimbra.anl.gov> I'll add it to the cookbook because I am not sure what mappers it works for and what options work. I only know that it works for those two mappers so testing to see exactly what works is needed. Is the cookbook in asciidoc or is it the ci swift wiki one? ----- Reply message ----- From: "Michael Wilde" Date: Sun, Aug 28, 2011 12:11 pm Subject: [Swift-devel] MapReduce, doubts To: "Jonathan Monette" Cc: "Yadu Nand" , "swift-devel" Jon, yes, thats right. Can you add the info below to the User Guide, or if time is pressing, to the cookbook? Thanks, - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Michael Wilde" > Cc: "Yadu Nand" , "swift-devel" > Sent: Sunday, August 28, 2011 11:32:44 AM > Subject: Re: [Swift-devel] MapReduce, doubts > On Aug 28, 2011, at 10:15 AM, Michael Wilde wrote: > > > ----- Original Message ----- > >> From: "Yadu Nand" > >> To: "swift-devel" , "Justin M Wozniak" > >> , "Mihael Hategan" > >> , "Michael Wilde" > >> Sent: Sunday, August 28, 2011 8:03:29 AM > >> Subject: MapReduce, doubts > >> Hi, > >> > >> I was going through some materials ([1], [2] , [3]) to understand > >> Google's MapReduce system and I have a couple of queries : > >> > >> 1. How do we address the issue of data locality ? > >> When we run a map job, it is a priority to run it such that least > >> network overhead is incurred, so preferably on the same system > >> holding the data (or one which is nearest , I don't know how this > >> works). > > > > Currently, we dont. We have discussed a new feature to do this (its > > listed as a GSoC project and I can probably locate a discussion with > > a 2010 GSoC candidate in which I detailed a possible strategy). > > > > We can current implement a similar scheme using an external mapper > > to select input files from multiple sites and map them to gsiftp > > URIs. Then an enhancement in the scheduler could select a site based > > on the URI of some or all or the input files. > > In my work with GOSwift I found that mappers will follow the GSIURI > path. > > For a single file: > file input1 > ; > > The above will map the file input1.txt that resides on PADS. > > For a group of files: > file inputs[] "gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data?, > suffix=?.txt?>; > > The above will map all files with a ".txt" extension in the directory > data on PADS. > > I think this is what you were talking about having the external mapper > do. > > > > >> 2. Is it possible to somehow force the reduce tasks to wait till > >> all > >> map jobs are done ? > > > > Isn't that just normal swift semantics? If we coded a simple-minded > > reduce job whose input was the array of outputs from the map() > > stage, the reduce (assuming its an app function) would wait for all > > the map() ops to finish, right? > > > > I would ask instead "do we want to?". Do the distributed reduce ops > > in map-reduce really wait? Doesn't MR do distributed reduction in > > batches, asynchronously to the completion of the map() operations? > > Isnt this a key property that is made possible by the name/value > > pair-based nature of the MR data model? I thought MR reduce ops take > > place at any location, in any input chunk size, in a tree-based > > manner, and that this is possible because the reduction operator is > > "distributed" in the mathematical sense. > > > >> The MapReduce uses a system which permits reduce to run only > >> after all the map jobs are done executing. I'm not entirely sure > >> why > >> this is a requirement but this has its own issues, such as a single > >> slow mapper. This is usually tackled by the main-controller > >> noticing > >> the slow one and running multiple instances of the map job to get > >> results faster. Does swift at some level use the concept of a > >> central > >> controller ? How do we tackle this ? > >> > >> 3. How does swift handle failures ? Is there a facility for > >> re-execution ? > > > > Yes, Swift retries failing app invocations as controlled by the > > properties execution.retries and lazy.errors. You can read on these > > in the users guide and in the properties file. > > > >> Is this documented somewhere ? Do we use any file-system that > >> handles loss of a particular file /input-set ? > > > > No, we dont, but some of this would come with using a > > replication-based model for the input dataset where the mapper could > > supply a list of possible inputs instead of one, and the scheduler > > could pick a replica each time it selects a site for a (retried) > > job. > > > > Also, we might think of a "forMostOf" statement which could > > implement semantics that would be suitable for runs in which you > > dont need every single map() to complete. I.e. the target array can > > be considered closed when "most of" (tbd) the input collection had > > been processed. The formost() could complete when it enters the > > "tail" of the loop (see Tim Armstrong's paper on the tail > > phenomenon). > > I think this has been discussed before. In the Montage application > there is a step where I map more filenames than will be created. So I > don't need all the maps to complete for the workflow to keep > progressing. I made a workaround but I think this "forMostOf" feature > would be useful. I will locate the thread in which Mihael and I had > this discussion. > > > >> I'm stopping here, there are more questions nagging me, but its > >> probably best to not blurt it out all at once :) > > > > I think you are hitting the right issues here, and I encourage you > > to keep pushing towards something that you could readily experiment > > with. This si exactly where we need to go to provide a convenient > > method for expressing map-reduce as an elegant high-level script. > > > > I also encourage you to read on what Ed Walker did for map-reduce in > > his parallel shell. > > > > - Mike > > > >> [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html > >> [2] http://www.youtube.com/watch?v=-vD6PUdf3Js > >> [3] > >> http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html > >> -- > >> Thanks and Regards, > >> Yadu Nand B > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Sun Aug 28 13:38:47 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Sun, 28 Aug 2011 13:38:47 -0500 Subject: [Swift-devel] MapReduce, doubts In-Reply-To: <20110828183719.0E605124CC@zimbra.anl.gov> References: <20110828183719.0E605124CC@zimbra.anl.gov> Message-ID: Jon, The cookbook is asciidoc located at: http://www.ci.uchicago.edu/swift/cookbook/cookbook-asciidoc.html On Sun, Aug 28, 2011 at 1:37 PM, Jonathan Monette wrote: > I'll add it to the cookbook because I am not sure what mappers it works for > and what options work. I only know that it works for those two mappers so > testing to see exactly what works is needed. Is the cookbook in asciidoc or > is it the ci swift wiki one? > > ----- Reply message ----- > From: "Michael Wilde" > Date: Sun, Aug 28, 2011 12:11 pm > Subject: [Swift-devel] MapReduce, doubts > To: "Jonathan Monette" > Cc: "Yadu Nand" , "swift-devel" < > swift-devel at ci.uchicago.edu> > > > Jon, yes, thats right. Can you add the info below to the User Guide, or if > time is pressing, to the cookbook? > > Thanks, > > - Mike > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "Michael Wilde" > > Cc: "Yadu Nand" , "swift-devel" < > swift-devel at ci.uchicago.edu> > > Sent: Sunday, August 28, 2011 11:32:44 AM > > Subject: Re: [Swift-devel] MapReduce, doubts > > On Aug 28, 2011, at 10:15 AM, Michael Wilde wrote: > > > > > ----- Original Message ----- > > >> From: "Yadu Nand" > > >> To: "swift-devel" , "Justin M Wozniak" > > >> , "Mihael Hategan" > > >> , "Michael Wilde" > > >> Sent: Sunday, August 28, 2011 8:03:29 AM > > >> Subject: MapReduce, doubts > > >> Hi, > > >> > > >> I was going through some materials ([1], [2] , [3]) to understand > > >> Google's MapReduce system and I have a couple of queries : > > >> > > >> 1. How do we address the issue of data locality ? > > >> When we run a map job, it is a priority to run it such that least > > >> network overhead is incurred, so preferably on the same system > > >> holding the data (or one which is nearest , I don't know how this > > >> works). > > > > > > Currently, we dont. We have discussed a new feature to do this (its > > > listed as a GSoC project and I can probably locate a discussion with > > > a 2010 GSoC candidate in which I detailed a possible strategy). > > > > > > We can current implement a similar scheme using an external mapper > > > to select input files from multiple sites and map them to gsiftp > > > URIs. Then an enhancement in the scheduler could select a site based > > > on the URI of some or all or the input files. > > > > In my work with GOSwift I found that mappers will follow the GSIURI > > path. > > > > For a single file: > > file input1 > > gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data/input1.txt?>; > > > > The above will map the file input1.txt that resides on PADS. > > > > For a group of files: > > file inputs[] > "gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data?, > > suffix=?.txt?>; > > > > The above will map all files with a ".txt" extension in the directory > > data on PADS. > > > > I think this is what you were talking about having the external mapper > > do. > > > > > > > >> 2. Is it possible to somehow force the reduce tasks to wait till > > >> all > > >> map jobs are done ? > > > > > > Isn't that just normal swift semantics? If we coded a simple-minded > > > reduce job whose input was the array of outputs from the map() > > > stage, the reduce (assuming its an app function) would wait for all > > > the map() ops to finish, right? > > > > > > I would ask instead "do we want to?". Do the distributed reduce ops > > > in map-reduce really wait? Doesn't MR do distributed reduction in > > > batches, asynchronously to the completion of the map() operations? > > > Isnt this a key property that is made possible by the name/value > > > pair-based nature of the MR data model? I thought MR reduce ops take > > > place at any location, in any input chunk size, in a tree-based > > > manner, and that this is possible because the reduction operator is > > > "distributed" in the mathematical sense. > > > > > >> The MapReduce uses a system which permits reduce to run only > > >> after all the map jobs are done executing. I'm not entirely sure > > >> why > > >> this is a requirement but this has its own issues, such as a single > > >> slow mapper. This is usually tackled by the main-controller > > >> noticing > > >> the slow one and running multiple instances of the map job to get > > >> results faster. Does swift at some level use the concept of a > > >> central > > >> controller ? How do we tackle this ? > > >> > > >> 3. How does swift handle failures ? Is there a facility for > > >> re-execution ? > > > > > > Yes, Swift retries failing app invocations as controlled by the > > > properties execution.retries and lazy.errors. You can read on these > > > in the users guide and in the properties file. > > > > > >> Is this documented somewhere ? Do we use any file-system that > > >> handles loss of a particular file /input-set ? > > > > > > No, we dont, but some of this would come with using a > > > replication-based model for the input dataset where the mapper could > > > supply a list of possible inputs instead of one, and the scheduler > > > could pick a replica each time it selects a site for a (retried) > > > job. > > > > > > Also, we might think of a "forMostOf" statement which could > > > implement semantics that would be suitable for runs in which you > > > dont need every single map() to complete. I.e. the target array can > > > be considered closed when "most of" (tbd) the input collection had > > > been processed. The formost() could complete when it enters the > > > "tail" of the loop (see Tim Armstrong's paper on the tail > > > phenomenon). > > > > I think this has been discussed before. In the Montage application > > there is a step where I map more filenames than will be created. So I > > don't need all the maps to complete for the workflow to keep > > progressing. I made a workaround but I think this "forMostOf" feature > > would be useful. I will locate the thread in which Mihael and I had > > this discussion. > > > > > >> I'm stopping here, there are more questions nagging me, but its > > >> probably best to not blurt it out all at once :) > > > > > > I think you are hitting the right issues here, and I encourage you > > > to keep pushing towards something that you could readily experiment > > > with. This si exactly where we need to go to provide a convenient > > > method for expressing map-reduce as an elegant high-level script. > > > > > > I also encourage you to read on what Ed Walker did for map-reduce in > > > his parallel shell. > > > > > > - Mike > > > > > >> [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html > > >> [2] http://www.youtube.com/watch?v=-vD6PUdf3Js > > >> [3] > > >> > http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html > > >> -- > > >> Thanks and Regards, > > >> Yadu Nand B > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Sun Aug 28 13:47:28 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Sun, 28 Aug 2011 13:47:28 -0500 Subject: [Swift-devel] =?utf-8?q?MapReduce=2C_doubts?= Message-ID: <20110828184714.0D109124D3@zimbra.anl.gov> Thanks Ketan. ----- Reply message ----- From: "Ketan Maheshwari" Date: Sun, Aug 28, 2011 1:38 pm Subject: [Swift-devel] MapReduce, doubts To: "Jonathan Monette" Cc: "Michael Wilde" , "swift-devel" -------------- next part -------------- An HTML attachment was scrubbed... URL: From yadudoc1729 at gmail.com Sun Aug 28 13:55:51 2011 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Mon, 29 Aug 2011 00:25:51 +0530 Subject: [Swift-devel] MapReduce, doubts In-Reply-To: <1560585796.255142.1314544502422.JavaMail.root@zimbra.anl.gov> References: <1560585796.255142.1314544502422.JavaMail.root@zimbra.anl.gov> Message-ID: >> I was going through some materials ([1], [2] , [3]) to understand >> Google's MapReduce system and I have a couple of queries : >> >> 1. How do we address the issue of data locality ? >> When we run a map job, it is a priority to run it such that least >> network overhead is incurred, so preferably on the same system >> holding the data (or one which is nearest , I don't know how this >> works). > > Currently, we dont. We have discussed a new feature to do this (its listed as a GSoC project and I can probably locate a discussion with a 2010 GSoC candidate in which I detailed a possible strategy). > > We can current implement a similar scheme using an external mapper to select input files from multiple sites and map them to gsiftp URIs. ?Then an enhancement in the scheduler could select a site based on the URI of some or all or the input files. Okay, I will read more on this. Do you mean to say that we currently can tune/tweak the scheduler to pick optimal sites ? >> 2. Is it possible to somehow force the reduce tasks to wait till all >> map jobs are done ? > > Isn't that just normal swift semantics? If we coded a simple-minded reduce job whose input was the array of outputs from the map() stage, the reduce (assuming its an app function) would wait for all the map() ops to finish, right? > > I would ask instead "do we want to?". Do the distributed reduce ops in map-reduce really wait? Doesn't MR do distributed reduction in batches, asynchronously to the completion of the map() operations? Isnt this a key property that is made possible by the name/value pair-based nature of the MR data model? ?I thought MR reduce ops take place at any location, in any input chunk size, in a tree-based manner, and that this is possible because the reduction operator is "distributed" in the mathematical sense. Google's MapReduce waits till all map jobs are complete. They list some reasons for choosing this over running reduce in parallel. * Difficulty when a site fails (both mappers and reducers will need to restart and will need to remember states. This adds unnecessary complexity) * In the end, its CPU cycles we are intelligently dealing with. We could just use it for map and then start the reduce stage. * In the lecture ([2]) it is stated that keeping reduce towards the end led to lesser bandwidth usage. Again as mentioned earlier we can use a *combiner* at each site to pre-reduce the intermediates to lessen the bandwidth needs if required. (provided the functions are associative and commutative) The combiner is usually the same as the reducer function, but run locally. >> 3. How does swift handle failures ? Is there a facility for >> re-execution ? > > Yes, Swift retries failing app invocations as controlled by the properties execution.retries and lazy.errors. You can read on these in the users guide and in the properties file. Great, I went through the user-guide pages on Swift properties. I see the relication.enabled option as well. With this I think a lot of plus points of MapReduce will be covered :) > No, we dont, but some of this would come with using a replication-based model for the input dataset where the mapper could supply a list of possible inputs instead of one, and the scheduler could pick a replica each time it selects a site for a (retried) job. > > Also, we might think of a "forMostOf" statement which could implement semantics that would be suitable for runs in which you dont need every single map() to complete. I.e. the target array can be considered closed when "most of" (tbd) the input collection had been processed. The formost() could complete when it enters the "tail" of the loop (see Tim Armstrong's paper on the tail phenomenon). > I haven't read the paper yet. With execution.retries, lazy.errors don't we have the required behavior ? Which is, if a job fails retry a limited number of times and if there is no progress ignore the job. I think replication.enabled can also be useful here. MapReduce uses a similar idea of spawning multiple-redundant jobs to handle cases where jobs run too slowly. Can we expect similar behavior here as well ? >> I'm stopping here, there are more questions nagging me, but its >> probably best to not blurt it out all at once :) > > I think you are hitting the right issues here, and I encourage you to keep pushing towards something that you could readily experiment with. ?This si exactly where we need to go to provide a convenient method for expressing map-reduce as an elegant high-level script. > I also encourage you to read on what Ed Walker did for map-reduce in his parallel shell. Okay, I will read this paper as well and post. Thanks :) -- Thanks and Regards, Yadu Nand B From jonmon at mcs.anl.gov Sun Aug 28 14:27:37 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 28 Aug 2011 14:27:37 -0500 Subject: [Swift-devel] MapReduce, doubts In-Reply-To: References: <1560585796.255142.1314544502422.JavaMail.root@zimbra.anl.gov> Message-ID: <1BD72DAA-57BA-4D29-ADCC-66929936D279@mcs.anl.gov> On Aug 28, 2011, at 1:55 PM, Yadu Nand wrote: >>> I was going through some materials ([1], [2] , [3]) to understand >>> Google's MapReduce system and I have a couple of queries : >>> >>> 1. How do we address the issue of data locality ? >>> When we run a map job, it is a priority to run it such that least >>> network overhead is incurred, so preferably on the same system >>> holding the data (or one which is nearest , I don't know how this >>> works). >> >> Currently, we dont. We have discussed a new feature to do this (its listed as a GSoC project and I can probably locate a discussion with a 2010 GSoC candidate in which I detailed a possible strategy). >> >> We can current implement a similar scheme using an external mapper to select input files from multiple sites and map them to gsiftp URIs. Then an enhancement in the scheduler could select a site based on the URI of some or all or the input files. > > Okay, I will read more on this. Do you mean to say that we currently > can tune/tweak the scheduler to pick optimal sites ? I think he was saying that the scheduler can choose where to run the app based on where the data it needs is mapped. If you say that data is mapped on PADS using the GSIURL for PADS then the scheduler will give preference to run on PADS. So saying: file input1 ; output1 = cat(input1) would have the app defined for cat run on PADS since the input1 is mapped to PADS. > >>> 2. Is it possible to somehow force the reduce tasks to wait till all >>> map jobs are done ? >> >> Isn't that just normal swift semantics? If we coded a simple-minded reduce job whose input was the array of outputs from the map() stage, the reduce (assuming its an app function) would wait for all the map() ops to finish, right? >> >> I would ask instead "do we want to?". Do the distributed reduce ops in map-reduce really wait? Doesn't MR do distributed reduction in batches, asynchronously to the completion of the map() operations? Isnt this a key property that is made possible by the name/value pair-based nature of the MR data model? I thought MR reduce ops take place at any location, in any input chunk size, in a tree-based manner, and that this is possible because the reduction operator is "distributed" in the mathematical sense. > > Google's MapReduce waits till all map jobs are complete. They list > some reasons for choosing this over running reduce in parallel. > * Difficulty when a site fails (both mappers and reducers will need > to restart and will need to remember states. This adds unnecessary > complexity) > * In the end, its CPU cycles we are intelligently dealing with. We > could just use it for map and then start the reduce stage. > * In the lecture ([2]) it is stated that keeping reduce towards the end > led to lesser bandwidth usage. > > Again as mentioned earlier we can use a *combiner* at each site > to pre-reduce the intermediates to lessen the bandwidth needs if > required. (provided the functions are associative and commutative) > The combiner is usually the same as the reducer function, but run > locally. > >>> 3. How does swift handle failures ? Is there a facility for >>> re-execution ? >> >> Yes, Swift retries failing app invocations as controlled by the properties execution.retries and lazy.errors. You can read on these in the users guide and in the properties file. > Great, I went through the user-guide pages on Swift properties. I see > the relication.enabled option as well. With this I think a lot of plus points > of MapReduce will be covered :) > >> No, we dont, but some of this would come with using a replication-based model for the input dataset where the mapper could supply a list of possible inputs instead of one, and the scheduler could pick a replica each time it selects a site for a (retried) job. >> >> Also, we might think of a "forMostOf" statement which could implement semantics that would be suitable for runs in which you dont need every single map() to complete. I.e. the target array can be considered closed when "most of" (tbd) the input collection had been processed. The formost() could complete when it enters the "tail" of the loop (see Tim Armstrong's paper on the tail phenomenon). >> > I haven't read the paper yet. With execution.retries, lazy.errors don't > we have the required behavior ? Which is, if a job fails retry a limited > number of times and if there is no progress ignore the job. I think > replication.enabled can also be useful here. MapReduce uses a similar > idea of spawning multiple-redundant jobs to handle cases where jobs > run too slowly. Can we expect similar behavior here as well ? > >>> I'm stopping here, there are more questions nagging me, but its >>> probably best to not blurt it out all at once :) >> >> I think you are hitting the right issues here, and I encourage you to keep pushing towards something that you could readily experiment with. This si exactly where we need to go to provide a convenient method for expressing map-reduce as an elegant high-level script. >> I also encourage you to read on what Ed Walker did for map-reduce in his parallel shell. > > Okay, I will read this paper as well and post. Thanks :) > > -- > Thanks and Regards, > Yadu Nand B > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Aug 28 14:33:57 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 28 Aug 2011 14:33:57 -0500 (CDT) Subject: [Swift-devel] MapReduce, doubts In-Reply-To: Message-ID: <1492913409.255287.1314560037805.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > From: "Yadu Nand" ... > Okay, I will read more on this. Do you mean to say that we currently > can tune/tweak the scheduler to pick optimal sites ? Yes, in the sense that one can enhance the code to do this :) > Google's MapReduce waits till all map jobs are complete. They list > some reasons for choosing this over running reduce in parallel. My understanding was wrong - thanks for the correction. ... > Again as mentioned earlier we can use a *combiner* at each site > to pre-reduce the intermediates to lessen the bandwidth needs if > required. (provided the functions are associative and commutative) > The combiner is usually the same as the reducer function, but run > locally. Sounds like a good approach. Might need the new primitive "foreachsite" which would operate on all members of a collection cached at a site, and do so over all sites that hold members of the collection. That would be a "funny" operator in the sense that its based on some physical aspect of the implementation, and the state of a run, unlike the rest of the language that has no physical site connections. But it seems useful and pragmatic, and to my mind worth at least exploring. ... > I haven't read the paper yet. With execution.retries, lazy.errors > don't > we have the required behavior ? Which is, if a job fails retry a > limited > number of times and if there is no progress ignore the job. I think > replication.enabled can also be useful here. MapReduce uses a similar > idea of spawning multiple-redundant jobs to handle cases where jobs > run too slowly. Can we expect similar behavior here as well ? I think so, at least to a first approximation. - Mike From ketancmaheshwari at gmail.com Sun Aug 28 16:47:13 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Sun, 28 Aug 2011 16:47:13 -0500 Subject: [Swift-devel] queuedsize > 0 but no job dequeued Message-ID: Hello, I remember this error happened in the past with Glen's and Sheri's runs. I saw this today again on Beagle with 0.93 while running the DSSAT run. The run stops with the following complete message: queuedsize > 0 but no job dequeued. Queued: {} java.lang.Throwable at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) queuedsize > 0 but no job dequeued. Queued: {} java.lang.Throwable at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) Progress: time: Sun, 28 Aug 2011 13:34:26 -0600 Submitted:76 Active:23 Checking status:1 Finished successfully:597 The logs, properties and sources for this run are: http://www.ci.uchicago.edu/~ketan/run23.tgz Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 29 19:16:09 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 29 Aug 2011 17:16:09 -0700 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: References: Message-ID: <1314663369.31525.0.camel@blabla> Can I have the coasters log please? On Sun, 2011-08-28 at 16:47 -0500, Ketan Maheshwari wrote: > Hello, > > > I remember this error happened in the past with Glen's and Sheri's > runs. I saw this today again on Beagle with 0.93 while running the > DSSAT run. > > > The run stops with the following complete message: > > > queuedsize > 0 but no job dequeued. Queued: {} > java.lang.Throwable > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > queuedsize > 0 but no job dequeued. Queued: {} > java.lang.Throwable > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > Progress: time: Sun, 28 Aug 2011 13:34:26 -0600 Submitted:76 > Active:23 Checking status:1 Finished successfully:597 > > > > > The logs, properties and sources for this run are: > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > Regards, > -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From ketancmaheshwari at gmail.com Mon Aug 29 19:52:01 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 29 Aug 2011 19:52:01 -0500 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: <1314663369.31525.0.camel@blabla> References: <1314663369.31525.0.camel@blabla> Message-ID: Mihael, This run was with automatic coasters. I do not see any specific coasters.log file written during this run in .globus/coaster nor in the run's work dir. Ketan On Mon, Aug 29, 2011 at 7:16 PM, Mihael Hategan wrote: > Can I have the coasters log please? > > On Sun, 2011-08-28 at 16:47 -0500, Ketan Maheshwari wrote: > > Hello, > > > > > > I remember this error happened in the past with Glen's and Sheri's > > runs. I saw this today again on Beagle with 0.93 while running the > > DSSAT run. > > > > > > The run stops with the following complete message: > > > > > > queuedsize > 0 but no job dequeued. Queued: {} > > java.lang.Throwable > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > queuedsize > 0 but no job dequeued. Queued: {} > > java.lang.Throwable > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > Progress: time: Sun, 28 Aug 2011 13:34:26 -0600 Submitted:76 > > Active:23 Checking status:1 Finished successfully:597 > > > > > > > > > > The logs, properties and sources for this run are: > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > Regards, > > -- > > Ketan > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 29 20:30:37 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 29 Aug 2011 18:30:37 -0700 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: References: <1314663369.31525.0.camel@blabla> Message-ID: <1314667837.919.0.camel@blabla> On Mon, 2011-08-29 at 19:52 -0500, Ketan Maheshwari wrote: > Mihael, > > > This run was with automatic coasters. I do not see any specific > coasters.log file written during this run in .globus/coaster nor in > the run's work dir. It's on the remote site in .globus/coasters. > > > Ketan > > On Mon, Aug 29, 2011 at 7:16 PM, Mihael Hategan > wrote: > Can I have the coasters log please? > > > On Sun, 2011-08-28 at 16:47 -0500, Ketan Maheshwari wrote: > > Hello, > > > > > > I remember this error happened in the past with Glen's and > Sheri's > > runs. I saw this today again on Beagle with 0.93 while > running the > > DSSAT run. > > > > > > The run stops with the following complete message: > > > > > > queuedsize > 0 but no job dequeued. Queued: {} > > java.lang.Throwable > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > queuedsize > 0 but no job dequeued. Queued: {} > > java.lang.Throwable > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > Progress: time: Sun, 28 Aug 2011 13:34:26 -0600 > Submitted:76 > > Active:23 Checking status:1 Finished successfully:597 > > > > > > > > > > The logs, properties and sources for this run are: > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > Regards, > > -- > > Ketan > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > -- > Ketan > > > From ketancmaheshwari at gmail.com Mon Aug 29 20:59:53 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 29 Aug 2011 20:59:53 -0500 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: <1314667837.919.0.camel@blabla> References: <1314663369.31525.0.camel@blabla> <1314667837.919.0.camel@blabla> Message-ID: This is on Beagle. I am running local:pbs from /lustre. On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan wrote: > On Mon, 2011-08-29 at 19:52 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > This run was with automatic coasters. I do not see any specific > > coasters.log file written during this run in .globus/coaster nor in > > the run's work dir. > > It's on the remote site in .globus/coasters. > > > > > > Ketan > > > > On Mon, Aug 29, 2011 at 7:16 PM, Mihael Hategan > > wrote: > > Can I have the coasters log please? > > > > > > On Sun, 2011-08-28 at 16:47 -0500, Ketan Maheshwari wrote: > > > Hello, > > > > > > > > > I remember this error happened in the past with Glen's and > > Sheri's > > > runs. I saw this today again on Beagle with 0.93 while > > running the > > > DSSAT run. > > > > > > > > > The run stops with the following complete message: > > > > > > > > > queuedsize > 0 but no job dequeued. Queued: {} > > > java.lang.Throwable > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > queuedsize > 0 but no job dequeued. Queued: {} > > > java.lang.Throwable > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > Progress: time: Sun, 28 Aug 2011 13:34:26 -0600 > > Submitted:76 > > > Active:23 Checking status:1 Finished successfully:597 > > > > > > > > > > > > > > > The logs, properties and sources for this run are: > > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > > > > Regards, > > > -- > > > Ketan > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 29 22:55:59 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 29 Aug 2011 20:55:59 -0700 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: References: <1314663369.31525.0.camel@blabla> <1314667837.919.0.camel@blabla> Message-ID: <1314676559.1750.0.camel@blabla> My bad. The info is in the swift log. On Mon, 2011-08-29 at 20:59 -0500, Ketan Maheshwari wrote: > This is on Beagle. I am running local:pbs from /lustre. > > On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan > wrote: > On Mon, 2011-08-29 at 19:52 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > This run was with automatic coasters. I do not see any > specific > > coasters.log file written during this run in .globus/coaster > nor in > > the run's work dir. > > > It's on the remote site in .globus/coasters. > > > > > > > Ketan > > > > On Mon, Aug 29, 2011 at 7:16 PM, Mihael Hategan > > > wrote: > > Can I have the coasters log please? > > > > > > On Sun, 2011-08-28 at 16:47 -0500, Ketan Maheshwari > wrote: > > > Hello, > > > > > > > > > I remember this error happened in the past with > Glen's and > > Sheri's > > > runs. I saw this today again on Beagle with 0.93 > while > > running the > > > DSSAT run. > > > > > > > > > The run stops with the following complete message: > > > > > > > > > queuedsize > 0 but no job dequeued. Queued: {} > > > java.lang.Throwable > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > queuedsize > 0 but no job dequeued. Queued: {} > > > java.lang.Throwable > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > at > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > Progress: time: Sun, 28 Aug 2011 13:34:26 -0600 > > Submitted:76 > > > Active:23 Checking status:1 Finished > successfully:597 > > > > > > > > > > > > > > > The logs, properties and sources for this run are: > > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > > > > Regards, > > > -- > > > Ketan > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > > > > > > -- > Ketan > > From wilde at mcs.anl.gov Tue Aug 30 10:34:39 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 30 Aug 2011 10:34:39 -0500 (CDT) Subject: [Swift-devel] Next steps on SwiftR - re Fwd: [Swift-user] Do you have any resource for learning about SwiftR? In-Reply-To: <283375835.259628.1314717726557.JavaMail.root@zimbra.anl.gov> Message-ID: <254735520.259702.1314718479384.JavaMail.root@zimbra.anl.gov> David, can you schedule some time to meet with Tim, learn and try SwiftR, and work to publicize it on the Swift web? Tim, do you think its ready to submit to CRAN? Does it meet the CRAN install criteria wrt building all included packages from source? (Ie, Swift?) How should we handle the docs issue? I think we need to consolidate docs from the at least 3 sources I know about and get them into both the R help and echo them into an asciidoc page on the Swift web? (SwiftR R help; Tim's page; SWFT pages(2); also, do we still have docs on it on the OpenMx wiki? Last I looked the Swift help page there now refers back to SWFT pages). Lastly, what do we need to do for additional SwiftR site support? - config and test on beagle - integrate with more swift configs - test and support for TG/XSEDE and OSG Thanks, - Mike ----- Forwarded Message ----- From: "Michael Wilde" To: "Lorenzo Pesce" Cc: "Tim Armstrong" Sent: Tuesday, August 30, 2011 10:22:06 AM Subject: Re: [Swift-user] Do you have any resource for learning about SwiftR? Lorenzo, The SwiftR documentation is currently at: http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftR which also provides a quick start guide at: http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftRQuickstart Further examples and some performance measurements are at: http://people.cs.uchicago.edu/~tga/swiftR/ And more examples are available with ?SwiftR help once you load the package: > source("http://people.cs.uchicago.edu/~tga/swiftR/getSwift.R") I just built an R-2.13.1 release on Beagle with plain gcc, which I think *should* be runnable in parallel on worker nodes. (Not yet tested though). This R should be capable of running SwiftR. Im hoping that Tim cam verify this soon. We'll likely need an additional SwiftR server name and config for Beagle and other Cray systems. We'll try to consolidate the SwiftR documentation in a user guide on the Swift in the future. Tim, can you do a quick check of the documentation to make sure its still correct and that it points to the latest SwiftR package? Thanks, - Mike ----- Original Message ----- > From: "Lorenzo Pesce" > To: swift-user at ci.uchicago.edu > Sent: Tuesday, August 30, 2011 9:22:51 AM > Subject: [Swift-user] Do you have any resource for learning about SwiftR? > Hi - > > I want to run relatively small sized simulations (say at most 50 cores > or so, probably mostly one or two) but many many times over. The > simulations will be coded in R. > > Thanks a lot! > > Lorenzo > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Aug 30 15:05:54 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 30 Aug 2011 13:05:54 -0700 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: <1314676559.1750.0.camel@blabla> References: <1314663369.31525.0.camel@blabla> <1314667837.919.0.camel@blabla> <1314676559.1750.0.camel@blabla> Message-ID: <1314734754.6888.0.camel@blabla> Any chance you can re-run this with debug enabled on coasters (log4j.logger.org.globus.cog.abstraction.coaster=DEBUG)? On Mon, 2011-08-29 at 20:55 -0700, Mihael Hategan wrote: > My bad. The info is in the swift log. > > On Mon, 2011-08-29 at 20:59 -0500, Ketan Maheshwari wrote: > > This is on Beagle. I am running local:pbs from /lustre. > > > > On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan > > wrote: > > On Mon, 2011-08-29 at 19:52 -0500, Ketan Maheshwari wrote: > > > Mihael, > > > > > > > > > This run was with automatic coasters. I do not see any > > specific > > > coasters.log file written during this run in .globus/coaster > > nor in > > > the run's work dir. > > > > > > It's on the remote site in .globus/coasters. > > > > > > > > > > > Ketan > > > > > > On Mon, Aug 29, 2011 at 7:16 PM, Mihael Hategan > > > > > wrote: > > > Can I have the coasters log please? > > > > > > > > > On Sun, 2011-08-28 at 16:47 -0500, Ketan Maheshwari > > wrote: > > > > Hello, > > > > > > > > > > > > I remember this error happened in the past with > > Glen's and > > > Sheri's > > > > runs. I saw this today again on Beagle with 0.93 > > while > > > running the > > > > DSSAT run. > > > > > > > > > > > > The run stops with the following complete message: > > > > > > > > > > > > queuedsize > 0 but no job dequeued. Queued: {} > > > > java.lang.Throwable > > > > at > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > at > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > at > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > queuedsize > 0 but no job dequeued. Queued: {} > > > > java.lang.Throwable > > > > at > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > at > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > at > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > Progress: time: Sun, 28 Aug 2011 13:34:26 -0600 > > > Submitted:76 > > > > Active:23 Checking status:1 Finished > > successfully:597 > > > > > > > > > > > > > > > > > > > > The logs, properties and sources for this run are: > > > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > > > > > > > Regards, > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Aug 30 17:30:51 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 30 Aug 2011 17:30:51 -0500 (CDT) Subject: [Swift-devel] How is hostCount handled by JETS Swift coaster interface? Message-ID: <1813177206.261664.1314743451714.JavaMail.root@zimbra.anl.gov> Does globus::hostCount=2 really mean 2 hosts, or nproc=2 to mpi? How does it interact with jobsPerNode or anything similar to specify how many cores are to be used per node? Thanks, - Mike From wozniak at mcs.anl.gov Tue Aug 30 17:40:00 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 30 Aug 2011 17:40:00 -0500 (Central Daylight Time) Subject: [Swift-devel] How is hostCount handled by JETS Swift coaster interface? In-Reply-To: <1813177206.261664.1314743451714.JavaMail.root@zimbra.anl.gov> References: <1813177206.261664.1314743451714.JavaMail.root@zimbra.anl.gov> Message-ID: It means two distinct hosts, I have not turned on the use of multi-core yet. On Tue, 30 Aug 2011, Michael Wilde wrote: > Does globus::hostCount=2 really mean 2 hosts, or nproc=2 to mpi? > > How does it interact with jobsPerNode or anything similar to specify how many cores are to be used per node? > > Thanks, > > - Mike -- Justin M Wozniak From wilde at mcs.anl.gov Tue Aug 30 18:16:05 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 30 Aug 2011 18:16:05 -0500 (CDT) Subject: [Swift-devel] How is hostCount handled by JETS Swift coaster interface? In-Reply-To: Message-ID: <1365658832.261713.1314746165705.JavaMail.root@zimbra.anl.gov> So hostCount=2 essentially does an mpirun with np=2 for an mpi_size of 2? Even if you are not allocating by core, can you set the mpi_size to nodes x cores? - MIke ----- Original Message ----- > From: "Justin M Wozniak" > To: "Michael Wilde" > Cc: "swift-devel Devel" > Sent: Tuesday, August 30, 2011 5:40:00 PM > Subject: Re: How is hostCount handled by JETS Swift coaster interface? > It means two distinct hosts, I have not turned on the use of > multi-core > yet. > > On Tue, 30 Aug 2011, Michael Wilde wrote: > > > Does globus::hostCount=2 really mean 2 hosts, or nproc=2 to mpi? > > > > How does it interact with jobsPerNode or anything similar to specify > > how many cores are to be used per node? > > > > Thanks, > > > > - Mike > > -- > Justin M Wozniak -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Tue Aug 30 18:20:36 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 30 Aug 2011 18:20:36 -0500 (Central Daylight Time) Subject: [Swift-devel] How is hostCount handled by JETS Swift coaster interface? In-Reply-To: <1365658832.261713.1314746165705.JavaMail.root@zimbra.anl.gov> References: <1365658832.261713.1314746165705.JavaMail.root@zimbra.anl.gov> Message-ID: > So hostCount=2 essentially does an mpirun with np=2 for an mpi_size of 2? Yes, but it also means to submit the job to 2 distinct hosts. If you give mpiexec -n 4 and 2 hosts you will get two processes per host. I just haven't implemented a way to do that yet from Swift. Justin On Tue, 30 Aug 2011, Michael Wilde wrote: > > Even if you are not allocating by core, can you set the mpi_size to nodes x cores? > > - MIke > > ----- Original Message ----- >> From: "Justin M Wozniak" >> To: "Michael Wilde" >> Cc: "swift-devel Devel" >> Sent: Tuesday, August 30, 2011 5:40:00 PM >> Subject: Re: How is hostCount handled by JETS Swift coaster interface? >> It means two distinct hosts, I have not turned on the use of >> multi-core >> yet. >> >> On Tue, 30 Aug 2011, Michael Wilde wrote: >> >>> Does globus::hostCount=2 really mean 2 hosts, or nproc=2 to mpi? >>> >>> How does it interact with jobsPerNode or anything similar to specify >>> how many cores are to be used per node? >>> >>> Thanks, >>> >>> - Mike >> >> -- >> Justin M Wozniak > > -- Justin M Wozniak From iraicu at cs.iit.edu Wed Aug 31 00:34:27 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Wed, 31 Aug 2011 00:34:27 -0500 Subject: [Swift-devel] CFP: 2011 Chicago Colloquium on Digital Humanities and Computer Science Message-ID: <4E5DC7E3.5080201@cs.iit.edu> Call for Papers 2011 Chicago Colloquium on Digital Humanities and Computer Science November 19-21, 2011 Loyola University Chicago -- Chicago, Illinois, USA Submission Deadline: September 15, 2011 http://chicagocolloquium.org The Chicago Colloquium on Digital Humanities and Computer Science (DHCS) brings together researchers and scholars in the humanities and computer science to examine the current state of digital humanities as a field of intellectual inquiry and to identify and explore new directions and perspectives for future research. Here is a brief look at the three most recent conferences in the DHCS series, which celebrates its sixth year running in 2011. * DHCS 2008 (University of Chicago) focused on "Making Sense" -- an exploration of how meaning is created and apprehended at the transition from the digital to the analog. * DHCS 2009 (IIT) focused on computational methods in digital humanities, including computational stylistics, text analytics, and visualization. * DHCS 2010 (Northwestern) focused on "Working with Digital Data: Collaborate, Curate, Analyze, Annotate." With broad agency support for and continued cross-disciplinary interest in "digging into data" as well as cyberinfrastructure and collaboration, this year's DHCS will continue to focus on these and related topics of interest to the community, with a formal colloquium theme to be unveiled as the program is finalized. We invite submissions from scholars, researchers, practitioners (independent scholars and industry), librarians, technologists, and students, on all topics that intersect current theory and practice in the humanities and computer science. This year's DHCS is sponsored by Loyola University Chicago, The University of Chicago, Northwestern University, and the Illinois Institute of Technology. Location and Venue Description Loyola University Chicago Water Tower Campus 820 N. Michigan Avenue Chicago, IL 60640 The conference will be held at Loyola University Chicago at its Water Tower Campus. Located near the Magnificent Mile and the historic Water Tower, the venue offers convenient access to excellent hotels and restaurants, not to mention ample opportunities for sightseeing and shopping. The time frame for the conference coincides with the annual unveiling of the holiday lights and delightful walks on the Magnificent Mile--the last chance before Chicago's winter arrives in full force. Keynote Speakers The list of keynote speakers is still being determined and will be posted as the conference program is nearing completion. Co-Chairs George K. Thiruvathukal, Computer Science, Loyola University Chicago, http://www.thiruvathukal.com Steven E. Jones, English, Loyola University Chicago, http://stevenejones.org/ Program Committee * Shlomo Argamon, Computer Science, Illinois Institute of Technology, http://www.iit.edu/csl/cs/faculty/argamon_shlomo.shtml * Arno Bosse, Comparative Literature, University of Chicago * Helma Dik, Classics, University of Chicago, http://classics.uchicago.edu/faculty/dik * Doug Downey, Computer Science, Northwestern University, http://www.cs.northwestern.edu/~ddowney/ * William L. Honig, Computer Science, Loyola University Chicago, http://people.cs.luc.edu/whonig * Konstantin L?ufer, Computer Science, Loyola University Chicago, http://laufer.cs.luc.edu * Peter Leonard, Humanities Research Computing, University of Chicago, http://home.uchicago.edu/psleonar/ * Catherine Mardikes, University Library, University of Chicago * Mark Olsen, ARTFL Project, University of Chicago, http://artfl-project.uchicago.edu/ * Ioan Raicu, Computer Science, Illinois Institute of Technology, http://www.cs.iit.edu/~iraicu/ * Claire Stewart, University Library, Northwestern University, http://www.library.northwestern.edu/directory/claire-stewart Journal of the Chicago DHCS Colloquium Select papers and posters accepted at DHCS are published in the /Journal of the Chicago Colloquium on Digital Humanities and Computer Science (JDHCS)/. Please visit http://jdhcs.uchicago.edu to view the full text of presentations from these colloquia. Preliminary Colloquium Schedule The formal DHCS colloquium program runs Saturday November 19 (afternoon), Sunday, November 20 (all day), and Monday, November 21 (ending mid-afternoon) and will consist of four, 1-1/2 hour paper panels and two, two-hour poster sessions as well as three keynotes. Pre-conference birds of a feather and tutorials will occur on Saturday, November 19, in the afternoon. Generous time has been set aside for questions and follow-up discussions after each panel and in the schedule breaks. There are no plans for parallel sessions. For further details, please see the conference website. Registration Fee Attendance for DHCS 2011 is free. All conference participants, however, will be required to register in advance. Details to follow as the conference program is finalized. Submission Format We welcome submissions that are either extended abstracts or full papers (8-page maximum, please) in PDF format. We welcome submissions for: * Paper presentations (15 and 30 minute presentations) * Posters * Software demonstrations * Performances * Pre-conference tutorials/workshops/seminars, and * Pre-conference "birds of a feather" meetings This year, we are using the EasyChair software to handle all submissions. http://www.easychair.org/conferences/?conf=dhcs2011 The instructions are simple: 1. Register yourself (you will add co-authors later) 2. Confirm the registration e-mail. 3. Make sure you go back to the main link and sign in. 4. Create a "New Submission". Fill in all appropriate sections. 5. Don't forget to Upload Paper at the end of the form. Submissions will only be accepted at the EasyChair URL above. Should you run into problems, please contact George K. Thiruvathukal at gkt+dhcs at cs.luc.edu . (The +dhcs is optional but will help to prioritize your e-mail.) Graduate Student Travel Fund A limited number of bursaries are available to assist graduate students who are presenting at the colloquium with their travel and accommodation expenses. More information about the application process will be available shortly at the Chicago Colloquium web site. Important Dates Deadline for Submissions: September 15 Notification of Acceptance: October 1 Full Program Announcement: October 15 Registration: October 1-November 15 (on-site will also be possible) Colloquium: Sunday, November 20 -- Monday, November 21, 2011 Contact Info Please email gkt+dhcs at cs.luc.edu Conference Hash Tag: #dhcs2011 -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Wed Aug 31 02:37:08 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Wed, 31 Aug 2011 02:37:08 -0500 Subject: [Swift-devel] CFP: 4th Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2011 -- co-located with IEEE/ACM Supercomputing 2011 Message-ID: <4E5DE4A4.4010807@cs.iit.edu> 4th Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2011 *http://datasys.cs.iit.edu/events/MTAGS11/index.html * *Co-located with * *Supercomputing/SC 2011* * Seattle Washington -- November 14th, 2011* News * *Keynote Speaker: *Professor David Abramson from Monash University, Australia * *Special Issue on Data Intensive Computing in the Clouds in the Springer Journal of Grid Computing* * *The Second International Workshop on Data Intensive Computing in the Clouds (DataCloud-SC11) 2011, co-located at Supercomputing/SC 2011, November 14th, 2011 * Overview The 4th workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all topics related to MTC on large scale systems. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (pending approval). The workshop will be co-located with the IEEE/ACM Supercomputing 2011 Conference in Seattle Washington on November 14th, 2011. For more information, please see http://datasys.cs.iit.edu/events/MTAGS11/ . For more information on past workshops, please see MTAGS10 , MTAGS09 , and MTAGS08 . We also ran a Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which has appeared in June 2011; the proceedings can be found online at http://www.computer.org/portal/web/csdl/abs/trans/td/2011/06/ttd201106toc.htm. We, the workshop organizers, also published two papers that are highly relevant to this workshop. One paper is titled "Toward Loosely Coupled Programming on Petascale Systems ", and was published in SC08 ; the second paper is titled "Many-Task Computing for Grids and Supercomputers ", which was published in MTAGS08 . Topics We invite the submission of original work that is related to the topics below. The papers can be either short (5 pages) position papers, or long (10 pages) research papers. Topics of interest include (in the context of Many-Task Computing): * Compute Resource Management o Scheduling o Job execution frameworks o Local resource manager extensions o Performance evaluation of resource managers in use on large scale systems o Dynamic resource provisioning o Techniques to manage many-core resources and/or GPUs o Challenges and opportunities in running many-task workloads on HPC systems o Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Storage architectures and implementations o Distributed file systems o Parallel file systems o Distributed meta-data management o Content distribution systems for large data o Data caching frameworks and techniques o Data management within and across data centers o Data-aware scheduling o Data-intensive computing applications o Eventual-consistency storage usage and management * Programming models and tools o Map-reduce and its generalizations o Many-task computing middleware and applications o Parallel programming frameworks o Ensemble MPI techniques and frameworks o Service-oriented science applications * Large-Scale Workflow Systems o Workflow system performance and scalability analysis o Scalability of workflow systems o Workflow infrastructure and e-Science middleware o Programming Paradigms and Models * Large-Scale Many-Task Applications o High-throughput computing (HTC) applications o Data-intensive applications o Quasi-supercomputing applications, deployments, and experiences o Performance Evaluation * Performance evaluation o Real systems o Simulations o Reliability of large systems Important Dates * Abstract submission: September 2, 2011 * Paper submission: September 9, 2011 * Acceptance notification: October 7, 2011 * Final papers due: October 28, 2011 Paper Submission Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 manuscript guidelines (http://www.acm.org/publications/instructions_for_proceedings_volumes); document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. We are also seeking position papers of no more than 5 pages in length. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2011/ before the deadline of September 2nd, 2011 at 11:59PM PST; the final 5/10 page papers in PDF format will be due on September 9th, 2011 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (pending approval). Notifications of the paper decisions will be sent out by October 7th, 2011. Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters, such as the previous Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which has appeared in June 2011. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please http://datasys.cs.iit.edu/events/MTAGS11/ , or send email to mtags11-chairs at datasys.cs.iit.edu . Organization *General Chairs (mtags11-chairs at datasys.cs.iit.edu )* * Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory, USA * Ian Foster, University of Chicago & Argonne National Laboratory, USA * Yong Zhao, University of Electronic Science and Technology of China, China *Steering Committee* * David Abramson, Monash University, Australia * Jack Dongara, University of Tennessee, USA * Geoffrey Fox, Indiana University, USA * Manish Parashar, Rutgers University, USA * Marc Snir, University of Illinois at Urbana Champaign, USA * Xian-He Sun, Illinois Institute of Technology, USA * Weimin Zheng, Tsinghua University, China *Program Committee* * Roger Barga, Microsoft Research, USA * Mihai Budiu, Microsoft Research, USA * Rajkumar Buyya, University of Melbourne, Australia * Catalin Dumitrescu, Fermi National Labs, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Florin Isaila, Universidad Carlos III de Madrid, Spain * Kamil Iskra, Argonne National Laboratory, USA * Hui Jin, Illinois Institute of Technology, USA * Daniel S. Katz, University of Chicago, USA * Tevfik Kosar, Louisiana State University, USA * Zhiling Lan, Illinois Institute of Technology, USA * Reagan Moore, University of North Carolina, Chappel Hill, USA * Jose Moreira, IBM Research, USA * Marlon Pierce, Indiana University, USA * Judy Qiu, Indiana University, USA * Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA * Matei Ripeanu, University of British Columbia, Canada * Alain Roy, University of Wisconsin, Madison, USA * Edward Walker, Texas Advanced Computing Center, USA * Mike Wilde, University of Chicago & Argonne National Laboratory, USA * Matthew Woitaszek, The University Corporation for Atmospheric Research, USA * Ken Yocum, University of California at San Diego, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Wed Aug 31 02:48:14 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Wed, 31 Aug 2011 02:48:14 -0500 Subject: [Swift-devel] CFP: The Second International Workshop on Data Intensive Computing in the Clouds (DataCloud-SC11) 2011 -- co-located with IEEE/ACM Supercomputing 2011 Message-ID: <4E5DE73E.1070800@cs.iit.edu> The Second International Workshop on Data Intensive Computing in the Clouds (DataCloud-SC11) 2011 http://datasys.cs.iit.edu/events/DataCloud-SC11/ *Co-located with * *Supercomputing/SC 2011* * Seattle Washington -- November 14th, 2011* News * *Special Issue on Data Intensive Computing in the Clouds in the Springer Journal of Grid Computing* * *4th Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2011, co-located at Supercomputing/SC 2011, November 14th, 2011 * Overview Applications and experiments in all areas of science are becoming increasingly complex and more demanding in terms of their computational and data requirements. Some applications generate data volumes reaching hundreds of terabytes and even petabytes. As scientific applications become more data intensive, the management of data resources and dataflow between the storage and compute resources is becoming the main bottleneck. Analyzing, visualizing, and disseminating these large data sets has become a major challenge and data intensive computing is now considered as the "fourth paradigm" in scientific discovery after theoretical, experimental, and computational science. The second international workshop on Data-intensive Computing in the Clouds (DataCloud-SC11) will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running data-intensive computing workloads on Cloud Computing infrastructures. The DataCloud-SC11 workshop will focus on the use of cloud-based technologies to meet the new data intensive scientific challenges that are not well served by the current supercomputers, grids or compute-intensive clouds. We believe the workshop will be an excellent place to help the community define the current state, determine future goals, and present architectures and services for future clouds supporting data intensive computing. For more information about the workshop, please see http://datasys.cs.iit.edu/events/DataCloud-SC11/. To see the 1st workshop's program agenda, and accepted papers and presentations, please see http://www.cse.buffalo.edu/faculty/tkosar/datacloud2011/. We are also running a Special Issue on Data Intensive Computing in the Clouds in the Springer Journal of Grid Computing with a paper submission deadline of August 16th 2011, which will appear in print in June 2012. Topics * Data-intensive cloud computing applications, characteristics, challenges * Case studies of data intensive computing in the clouds * Performance evaluation of data clouds, data grids, and data centers * Energy-efficient data cloud design and management * Data placement, scheduling, and interoperability in the clouds * Accountability, QoS, and SLAs * Data privacy and protection in a public cloud environment * Distributed file systems for clouds * Data streaming and parallelization * New programming models for data-intensive cloud computing * Scalability issues in clouds * Social computing and massively social gaming * 3D Internet and implications * Future research challenges in data-intensive cloud computing Important Dates * Abstract submission: September 2, 2011 * Paper submission: September 9, 2011 * Acceptance notification: October 7, 2011 * Final papers due: October 28, 2011 Paper Submission Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 manuscript guidelines (http://www.acm.org/publications/instructions_for_proceedings_volumes); document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. We are also seeking position papers of no more than 5 pages in length. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/DataCloud_SC11/ before the deadline of September 2nd, 2011 at 11:59PM PST; the final 5/10 page papers in PDF format will be due on September 9th, 2011 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (pending approval). Notifications of the paper decisions will be sent out by October 7th, 2011. Selected excellent work may be eligible for additional post-conference publication as journal articles. We are currently running a Special Issue on Data Intensive Computing in the Clouds in the Springer Journal of Grid Computing . Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please see http://datasys.cs.iit.edu/events/DataCloud-SC11/ or send email to datacloud-sc11-chairs at datasys.cs.iit.edu . Organization *General Chairs (datacloud-sc11-chairs at datasys.cs.iit.edu )* * Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory, USA * Tevfik Kosar, University at Buffalo, USA * Roger Barga, Microsoft Research, USA *Steering Committee* * Ian Foster, University of Chicago & Argonne National Laboratory, USA * Geoffrey Fox, Indiana University, USA * James Hamilton, Amazon, USA * Manish Parashar, Rutgers University, USA * Dan Reed, Microsoft Research, USA * Rich Wolski, University of California at Santa Barbara, USA * Rong Chang, IBM, USA *Program Committee* * David Abramson, Monash University, Australia * Abhishek Chandra, University of Minnesota, USA * Yong Chen, Texas Tech University, USA * Terence Critchlow, Pacific Northwest National Laboratory, USA * Murat Demirbas, SUNY Buffalo, USA * Jaliya Ekanayake, Microsoft Research, USA * Rob Gillen, Oak Ridge National Laboratory, USA * Maria Indrawan, Monash University, Australia * Alexandru Iosup, Delft University of Technology, Netherlands * Hui Jin, Illinois Institute of Technology, USA * Dan S. Katz, University of Chicago, USA * Gregor von Laszewski, Indiana University, USA * Erwin Laure, CERN, Switzerland * Reagan Moore, University of North Carolina at Chapel Hill, USA * Jim Myers, Rensselaer Polytechnic Institute, USA * Judy Qiu, Indiana University, USA * Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA * Florian Schintke, Zuse Institute Berlin, Germany * Borja Sotomayor, University of Chicago, USA * Ian Taylor, Cardiff University, UK * Bernard Traversat, Oracle Corporation, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Wed Aug 31 03:42:30 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Wed, 31 Aug 2011 03:42:30 -0500 Subject: [Swift-devel] CiteSearcher: a Google Scholar front-end for iOS and Android mobile devices Message-ID: <4E5DF3F6.7060003@cs.iit.edu> CiteSearcher v1.2search%20results.png Google Scholar on your iPod, iPhone, iPad, and Android based mobile devices *http://datasys.cs.iit.edu/projects/CiteSearcher/* CiteSearcher is a Google Scholar front-end for iOS and Android mobile devices. With it, you can easily search Google Scholar for an author's work, his/her Hirsch index (H-index, http://en.wikipedia.org/wiki/H-index), and G-Index (http://en.wikipedia.org/wiki/G-index). For a detailed list of features and screenshots, see http://datasys.cs.iit.edu/projects/CiteSearcher/details.html. For the free downloads, see IOS (http://itunes.apple.com/us/app/citesearcher/id453186643?mt=8) or Android (https://market.android.com/details?id=datasys.iit). We plan to maintain this software as long as there is demand from the community, and improve it with new features and by supporting additional mobile devices. The lead developer of these applications is Kevin Brandstatter from the DataSys Laboratory at Illinois Institute of Technology. If you would like to signup to the CiteSearcher user mailing list in order to find out information about future releases of CiteSearcher, please see http://datasys.cs.iit.edu/mailman/listinfo/citesearcher-user. For any comments or feedback, please write to citesearcher-devel at datasys.cs.iit.edu. Bugs can be reported to http://datasys.cs.iit.edu/projects/CiteSearcher/bugReport.php. -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: search results.png Type: image/png Size: 49607 bytes Desc: not available URL: From ketancmaheshwari at gmail.com Wed Aug 31 09:04:26 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Wed, 31 Aug 2011 09:04:26 -0500 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: <1314734754.6888.0.camel@blabla> References: <1314663369.31525.0.camel@blabla> <1314667837.919.0.camel@blabla> <1314676559.1750.0.camel@blabla> <1314734754.6888.0.camel@blabla> Message-ID: Mihael, I did the run with the debug enabled on coasters. Please find the logs etc, for this run here: http://www.ci.uchicago.edu/~ketan/run25.tgz Note that the run went well and ran upto 20k jobs without issues. After that I did not get nodes so I stopped it and resumed it this morning. It ran for about 1000+ jobs and crashed with the same error message. Regards, Ketan On Tue, Aug 30, 2011 at 3:05 PM, Mihael Hategan wrote: > Any chance you can re-run this with debug enabled on coasters > (log4j.logger.org.globus.cog.abstraction.coaster=DEBUG)? > > On Mon, 2011-08-29 at 20:55 -0700, Mihael Hategan wrote: > > My bad. The info is in the swift log. > > > > On Mon, 2011-08-29 at 20:59 -0500, Ketan Maheshwari wrote: > > > This is on Beagle. I am running local:pbs from /lustre. > > > > > > On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan > > > wrote: > > > On Mon, 2011-08-29 at 19:52 -0500, Ketan Maheshwari wrote: > > > > Mihael, > > > > > > > > > > > > This run was with automatic coasters. I do not see any > > > specific > > > > coasters.log file written during this run in .globus/coaster > > > nor in > > > > the run's work dir. > > > > > > > > > It's on the remote site in .globus/coasters. > > > > > > > > > > > > > > > Ketan > > > > > > > > On Mon, Aug 29, 2011 at 7:16 PM, Mihael Hategan > > > > > > > wrote: > > > > Can I have the coasters log please? > > > > > > > > > > > > On Sun, 2011-08-28 at 16:47 -0500, Ketan Maheshwari > > > wrote: > > > > > Hello, > > > > > > > > > > > > > > > I remember this error happened in the past with > > > Glen's and > > > > Sheri's > > > > > runs. I saw this today again on Beagle with 0.93 > > > while > > > > running the > > > > > DSSAT run. > > > > > > > > > > > > > > > The run stops with the following complete message: > > > > > > > > > > > > > > > queuedsize > 0 but no job dequeued. Queued: {} > > > > > java.lang.Throwable > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > > queuedsize > 0 but no job dequeued. Queued: {} > > > > > java.lang.Throwable > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > > Progress: time: Sun, 28 Aug 2011 13:34:26 -0600 > > > > Submitted:76 > > > > > Active:23 Checking status:1 Finished > > > successfully:597 > > > > > > > > > > > > > > > > > > > > > > > > > The logs, properties and sources for this run are: > > > > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > > > > > > > > > > Regards, > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Wed Aug 31 17:56:11 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 31 Aug 2011 17:56:11 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 RC1 available In-Reply-To: <526100767.87201.1314830495411.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <233145295.87296.1314831371473.JavaMail.root@zimbra-mb2.anl.gov> Hello all, I just wanted to let you know that Swift 0.93 release candidate 1 is now available at http://www.ci.uchicago.edu/swift/packages/swift-0.93RC1.tar.gz Please download this and test it out a bit. If you notice any problems, please send an email to the list or create a new bugzilla ticket for it. Thanks! Regards, David