From ketan at mcs.anl.gov Mon Mar 2 15:47:48 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 2 Mar 2015 15:47:48 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely Message-ID: I trying to run on BG/Q with local:cobalt with trunk but Swift crashes with the following error: Caused by: Exception in bgsh: Arguments: [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, 1] Host: cluster Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m exception @ swift-int-staging.k, line: 165 Caused by: exception @ swift-int-staging.k, line: 160 Caused by: Block task failed: 0302-2109420-000000 Block task ended prematurely In the log, I see the qsub call being made and a jobid is returned. However, I could not figure what is the cause for the task to fail. One more thing I noticed when translating from old sites conf to new is that the new conf did not accept the property "globus:mode = script". A full run log is attached. Thanks for any suggestions. Thanks, Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run001.tgz Type: application/x-gzip Size: 16530 bytes Desc: not available URL: From wilde at anl.gov Mon Mar 2 16:02:08 2015 From: wilde at anl.gov (Michael Wilde) Date: Mon, 2 Mar 2015 16:02:08 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: Message-ID: <54F4DDE0.2020006@anl.gov> Is this the first time you've tried running 0.96 on the BG/Q in subjob mode? (I.e., has this ever worked before?) Did you get a submission script in the run directory (or a log of the cobalt qsub or cqsub command) which you could test manually? If 0.96 is rejecting the "script" property, it seems possible that 0.96 is generating an invalid qsub command and/or submission script. - Mike On 3/2/15 3:47 PM, Ketan Maheshwari wrote: > I trying to run on BG/Q with local:cobalt with trunk but Swift crashes > with the following error: > > Caused by: Exception in bgsh: > Arguments: [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > 1] > Host: cluster > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > exception @ swift-int-staging.k, line: 165 > Caused by: > exception @ swift-int-staging.k, line: 160 > Caused by: Block task failed: 0302-2109420-000000 Block task ended > prematurely > > In the log, I see the qsub call being made and a jobid is returned. > However, I could not figure what is the cause for the task to fail. > > One more thing I noticed when translating from old sites conf to new > is that the new conf did not accept the property "globus:mode = script". > > A full run log is attached. Thanks for any suggestions. > > Thanks, > Ketan > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Mar 2 16:07:38 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 2 Mar 2015 14:07:38 -0800 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: Message-ID: <1425334058.14234.5.camel@echo> I would recommend enabling worker logging to see if we get any info from the worker process. Could be some simple thing, like the wrong IP address. Mihael On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > I trying to run on BG/Q with local:cobalt with trunk but Swift crashes with > the following error: > > Caused by: Exception in bgsh: > Arguments: [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > 1] > Host: cluster > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > exception @ swift-int-staging.k, line: 165 > Caused by: > exception @ swift-int-staging.k, line: 160 > Caused by: Block task failed: 0302-2109420-000000 Block task ended > prematurely > > In the log, I see the qsub call being made and a jobid is returned. > However, I could not figure what is the cause for the task to fail. > > One more thing I noticed when translating from old sites conf to new is > that the new conf did not accept the property "globus:mode = script". > > A full run log is attached. Thanks for any suggestions. > > Thanks, > Ketan > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From ketan at mcs.anl.gov Mon Mar 2 16:14:23 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 2 Mar 2015 16:14:23 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: <54F4DDE0.2020006@anl.gov> References: <54F4DDE0.2020006@anl.gov> Message-ID: This is the first time I am trying with 0.96. The generated qsub command indeed does not have "--mode script" which seems to be causing the issue. Thanks, Ketan On Mon, Mar 2, 2015 at 4:02 PM, Michael Wilde wrote: > Is this the first time you've tried running 0.96 on the BG/Q in subjob > mode? > (I.e., has this ever worked before?) > > Did you get a submission script in the run directory (or a log of the > cobalt qsub or cqsub command) which you could test manually? > > If 0.96 is rejecting the "script" property, it seems possible that 0.96 is > generating an invalid qsub command and/or submission script. > > - Mike > > > > On 3/2/15 3:47 PM, Ketan Maheshwari wrote: > > I trying to run on BG/Q with local:cobalt with trunk but Swift crashes > with the following error: > > Caused by: Exception in bgsh: > Arguments: [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > 1] > Host: cluster > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > exception @ swift-int-staging.k, line: 165 > Caused by: > exception @ swift-int-staging.k, line: 160 > Caused by: Block task failed: 0302-2109420-000000 Block task ended > prematurely > > In the log, I see the qsub call being made and a jobid is returned. > However, I could not figure what is the cause for the task to fail. > > One more thing I noticed when translating from old sites conf to new is > that the new conf did not accept the property "globus:mode = script". > > A full run log is attached. Thanks for any suggestions. > > Thanks, > Ketan > > > _______________________________________________ > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Mar 2 16:27:06 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 2 Mar 2015 14:27:06 -0800 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: <54F4DDE0.2020006@anl.gov> Message-ID: <1425335226.14234.8.camel@echo> I looked at the Cobalt script in the run directory you sent and it was empty. When I look at the trunk code for the Cobalt provider, I don't see anything that would generate a script, so I'm not sure what's happening there. I vaguely remember that somebody wrote a patch for the Cobalt provider to use script mode, but I'm not sure where that ended up. Mihael On Mon, 2015-03-02 at 16:14 -0600, Ketan Maheshwari wrote: > This is the first time I am trying with 0.96. > > The generated qsub command indeed does not have "--mode script" which seems > to be causing the issue. > > Thanks, > Ketan > > On Mon, Mar 2, 2015 at 4:02 PM, Michael Wilde wrote: > > > Is this the first time you've tried running 0.96 on the BG/Q in subjob > > mode? > > (I.e., has this ever worked before?) > > > > Did you get a submission script in the run directory (or a log of the > > cobalt qsub or cqsub command) which you could test manually? > > > > If 0.96 is rejecting the "script" property, it seems possible that 0.96 is > > generating an invalid qsub command and/or submission script. > > > > - Mike > > > > > > > > On 3/2/15 3:47 PM, Ketan Maheshwari wrote: > > > > I trying to run on BG/Q with local:cobalt with trunk but Swift crashes > > with the following error: > > > > Caused by: Exception in bgsh: > > Arguments: [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > 1] > > Host: cluster > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > exception @ swift-int-staging.k, line: 165 > > Caused by: > > exception @ swift-int-staging.k, line: 160 > > Caused by: Block task failed: 0302-2109420-000000 Block task ended > > prematurely > > > > In the log, I see the qsub call being made and a jobid is returned. > > However, I could not figure what is the cause for the task to fail. > > > > One more thing I noticed when translating from old sites conf to new is > > that the new conf did not accept the property "globus:mode = script". > > > > A full run log is attached. Thanks for any suggestions. > > > > Thanks, > > Ketan > > > > > > _______________________________________________ > > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > -- > > Michael Wilde > > Mathematics and Computer Science Computation Institute > > Argonne National Laboratory The University of Chicago > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From ketan at mcs.anl.gov Mon Mar 2 16:27:27 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 2 Mar 2015 16:27:27 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> Message-ID: For workerlogs, I am trying: app.bgsh { executable: "/home/ketan/SwiftApps/subjobs/bg.sh" maxWallTime: "00:04:00" env.ENABLE_WORKER_LOGGING="TRUE" env.WORKER_LOGGING_LEVEL="DEBUG" env.WORKER_LOG_DIR="/home/ketan/workerlogs" } Does not seem to trigger logging. Thanks, Ketan On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. < hategan at mcs.anl.gov> wrote: > I would recommend enabling worker logging to see if we get any info from > the worker process. Could be some simple thing, like the wrong IP > address. > > Mihael > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > I trying to run on BG/Q with local:cobalt with trunk but Swift crashes > with > > the following error: > > > > Caused by: Exception in bgsh: > > Arguments: [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > 1] > > Host: cluster > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > exception @ swift-int-staging.k, line: 165 > > Caused by: > > exception @ swift-int-staging.k, line: 160 > > Caused by: Block task failed: 0302-2109420-000000 Block task ended > > prematurely > > > > In the log, I see the qsub call being made and a jobid is returned. > > However, I could not figure what is the cause for the task to fail. > > > > One more thing I noticed when translating from old sites conf to new is > > that the new conf did not accept the property "globus:mode = script". > > > > A full run log is attached. Thanks for any suggestions. > > > > Thanks, > > Ketan > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Mon Mar 2 16:30:17 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 2 Mar 2015 16:30:17 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: <32e63864a40f40b99c18779246bdc548@GEORGE.anl.gov> References: <54F4DDE0.2020006@anl.gov> <32e63864a40f40b99c18779246bdc548@GEORGE.anl.gov> Message-ID: BG/Q accepts job-desc in the form of qsub command only. This is why no script gets generated. I updated the provider from the old cqsub and related options to the new qsub and options. However, I am not sure why the script mode is not working. --Ketan On Mon, Mar 2, 2015 at 4:27 PM, Hategan-Marandiuc, Philip M. < hategan at mcs.anl.gov> wrote: > I looked at the Cobalt script in the run directory you sent and it was > empty. When I look at the trunk code for the Cobalt provider, I don't > see anything that would generate a script, so I'm not sure what's > happening there. I vaguely remember that somebody wrote a patch for the > Cobalt provider to use script mode, but I'm not sure where that ended > up. > > Mihael > > On Mon, 2015-03-02 at 16:14 -0600, Ketan Maheshwari wrote: > > This is the first time I am trying with 0.96. > > > > The generated qsub command indeed does not have "--mode script" which > seems > > to be causing the issue. > > > > Thanks, > > Ketan > > > > On Mon, Mar 2, 2015 at 4:02 PM, Michael Wilde wrote: > > > > > Is this the first time you've tried running 0.96 on the BG/Q in subjob > > > mode? > > > (I.e., has this ever worked before?) > > > > > > Did you get a submission script in the run directory (or a log of the > > > cobalt qsub or cqsub command) which you could test manually? > > > > > > If 0.96 is rejecting the "script" property, it seems possible that > 0.96 is > > > generating an invalid qsub command and/or submission script. > > > > > > - Mike > > > > > > > > > > > > On 3/2/15 3:47 PM, Ketan Maheshwari wrote: > > > > > > I trying to run on BG/Q with local:cobalt with trunk but Swift crashes > > > with the following error: > > > > > > Caused by: Exception in bgsh: > > > Arguments: [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > 1] > > > Host: cluster > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > exception @ swift-int-staging.k, line: 165 > > > Caused by: > > > exception @ swift-int-staging.k, line: 160 > > > Caused by: Block task failed: 0302-2109420-000000 Block task ended > > > prematurely > > > > > > In the log, I see the qsub call being made and a jobid is returned. > > > However, I could not figure what is the cause for the task to fail. > > > > > > One more thing I noticed when translating from old sites conf to new > is > > > that the new conf did not accept the property "globus:mode = script". > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > Thanks, > > > Ketan > > > > > > > > > _______________________________________________ > > > Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > -- > > > Michael Wilde > > > Mathematics and Computer Science Computation Institute > > > Argonne National Laboratory The University of Chicago > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Mar 2 16:33:09 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 2 Mar 2015 14:33:09 -0800 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> Message-ID: <1425335589.17672.1.camel@echo> Well, we need to figure out why. Since the qsub command line is in the swift log, and the qsub command line should reflect the setting, it would be useful if you posted the swift log. Mihael On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote: > For workerlogs, I am trying: > > app.bgsh { > executable: "/home/ketan/SwiftApps/subjobs/bg.sh" > maxWallTime: "00:04:00" > env.ENABLE_WORKER_LOGGING="TRUE" > env.WORKER_LOGGING_LEVEL="DEBUG" > env.WORKER_LOG_DIR="/home/ketan/workerlogs" > } > > Does not seem to trigger logging. > > Thanks, > Ketan > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. < > hategan at mcs.anl.gov> wrote: > > > I would recommend enabling worker logging to see if we get any info from > > the worker process. Could be some simple thing, like the wrong IP > > address. > > > > Mihael > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > > I trying to run on BG/Q with local:cobalt with trunk but Swift crashes > > with > > > the following error: > > > > > > Caused by: Exception in bgsh: > > > Arguments: [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > 1] > > > Host: cluster > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > exception @ swift-int-staging.k, line: 165 > > > Caused by: > > > exception @ swift-int-staging.k, line: 160 > > > Caused by: Block task failed: 0302-2109420-000000 Block task ended > > > prematurely > > > > > > In the log, I see the qsub call being made and a jobid is returned. > > > However, I could not figure what is the cause for the task to fail. > > > > > > One more thing I noticed when translating from old sites conf to new is > > > that the new conf did not accept the property "globus:mode = script". > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > Thanks, > > > Ketan > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > From ketan at mcs.anl.gov Mon Mar 2 16:37:32 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 2 Mar 2015 16:37:32 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> Message-ID: The qsub command from the log says: qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40 --cwd ... So, the env variable on swift.conf does not seem to take effect. On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. < hategan at mcs.anl.gov> wrote: > Well, we need to figure out why. Since the qsub command line is in the > swift log, and the qsub command line should reflect the setting, it > would be useful if you posted the swift log. > > Mihael > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote: > > For workerlogs, I am trying: > > > > app.bgsh { > > executable: "/home/ketan/SwiftApps/subjobs/bg.sh" > > maxWallTime: "00:04:00" > > env.ENABLE_WORKER_LOGGING="TRUE" > > env.WORKER_LOGGING_LEVEL="DEBUG" > > env.WORKER_LOG_DIR="/home/ketan/workerlogs" > > } > > > > Does not seem to trigger logging. > > > > Thanks, > > Ketan > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. < > > hategan at mcs.anl.gov> wrote: > > > > > I would recommend enabling worker logging to see if we get any info > from > > > the worker process. Could be some simple thing, like the wrong IP > > > address. > > > > > > Mihael > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > > > I trying to run on BG/Q with local:cobalt with trunk but Swift > crashes > > > with > > > > the following error: > > > > > > > > Caused by: Exception in bgsh: > > > > Arguments: > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > > 1] > > > > Host: cluster > > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > > exception @ swift-int-staging.k, line: 165 > > > > Caused by: > > > > exception @ swift-int-staging.k, line: 160 > > > > Caused by: Block task failed: 0302-2109420-000000 Block task ended > > > > prematurely > > > > > > > > In the log, I see the qsub call being made and a jobid is returned. > > > > However, I could not figure what is the cause for the task to fail. > > > > > > > > One more thing I noticed when translating from old sites conf to new > is > > > > that the new conf did not accept the property "globus:mode = script". > > > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > > > Thanks, > > > > Ketan > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Mar 2 17:27:09 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 2 Mar 2015 15:27:09 -0800 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> Message-ID: <1425338829.21807.4.camel@echo> It would really be much more useful if you posted the full log. Anyway, I believe that what you need to do is: site.cluster.execution.options.workerLoggingLevel = "DEBUG" Mihael On Mon, 2015-03-02 at 16:37 -0600, Ketan Maheshwari wrote: > The qsub command from the log says: > > qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40 --cwd ... > > So, the env variable on swift.conf does not seem to take effect. > > On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. < > hategan at mcs.anl.gov> wrote: > > > Well, we need to figure out why. Since the qsub command line is in the > > swift log, and the qsub command line should reflect the setting, it > > would be useful if you posted the swift log. > > > > Mihael > > > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote: > > > For workerlogs, I am trying: > > > > > > app.bgsh { > > > executable: "/home/ketan/SwiftApps/subjobs/bg.sh" > > > maxWallTime: "00:04:00" > > > env.ENABLE_WORKER_LOGGING="TRUE" > > > env.WORKER_LOGGING_LEVEL="DEBUG" > > > env.WORKER_LOG_DIR="/home/ketan/workerlogs" > > > } > > > > > > Does not seem to trigger logging. > > > > > > Thanks, > > > Ketan > > > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. < > > > hategan at mcs.anl.gov> wrote: > > > > > > > I would recommend enabling worker logging to see if we get any info > > from > > > > the worker process. Could be some simple thing, like the wrong IP > > > > address. > > > > > > > > Mihael > > > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > > > > I trying to run on BG/Q with local:cobalt with trunk but Swift > > crashes > > > > with > > > > > the following error: > > > > > > > > > > Caused by: Exception in bgsh: > > > > > Arguments: > > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > > > 1] > > > > > Host: cluster > > > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > > > exception @ swift-int-staging.k, line: 165 > > > > > Caused by: > > > > > exception @ swift-int-staging.k, line: 160 > > > > > Caused by: Block task failed: 0302-2109420-000000 Block task ended > > > > > prematurely > > > > > > > > > > In the log, I see the qsub call being made and a jobid is returned. > > > > > However, I could not figure what is the cause for the task to fail. > > > > > > > > > > One more thing I noticed when translating from old sites conf to new > > is > > > > > that the new conf did not accept the property "globus:mode = script". > > > > > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > > > > > Thanks, > > > > > Ketan > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > From ketan at mcs.anl.gov Mon Mar 2 18:11:02 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 2 Mar 2015 18:11:02 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: <1425338829.21807.4.camel@echo> References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> <1425338829.21807.4.camel@echo> Message-ID: I tried this option but did not seem to work. Attached is the log. On Mon, Mar 2, 2015 at 5:27 PM, Mihael Hategan wrote: > It would really be much more useful if you posted the full log. > > Anyway, I believe that what you need to do is: > site.cluster.execution.options.workerLoggingLevel = "DEBUG" > > Mihael > > On Mon, 2015-03-02 at 16:37 -0600, Ketan Maheshwari wrote: > > The qsub command from the log says: > > > > qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40 --cwd ... > > > > So, the env variable on swift.conf does not seem to take effect. > > > > On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. < > > hategan at mcs.anl.gov> wrote: > > > > > Well, we need to figure out why. Since the qsub command line is in the > > > swift log, and the qsub command line should reflect the setting, it > > > would be useful if you posted the swift log. > > > > > > Mihael > > > > > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote: > > > > For workerlogs, I am trying: > > > > > > > > app.bgsh { > > > > executable: "/home/ketan/SwiftApps/subjobs/bg.sh" > > > > maxWallTime: "00:04:00" > > > > env.ENABLE_WORKER_LOGGING="TRUE" > > > > env.WORKER_LOGGING_LEVEL="DEBUG" > > > > env.WORKER_LOG_DIR="/home/ketan/workerlogs" > > > > } > > > > > > > > Does not seem to trigger logging. > > > > > > > > Thanks, > > > > Ketan > > > > > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. < > > > > hategan at mcs.anl.gov> wrote: > > > > > > > > > I would recommend enabling worker logging to see if we get any info > > > from > > > > > the worker process. Could be some simple thing, like the wrong IP > > > > > address. > > > > > > > > > > Mihael > > > > > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > > > > > I trying to run on BG/Q with local:cobalt with trunk but Swift > > > crashes > > > > > with > > > > > > the following error: > > > > > > > > > > > > Caused by: Exception in bgsh: > > > > > > Arguments: > > > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > > > > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > > > > 1] > > > > > > Host: cluster > > > > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > > > > exception @ swift-int-staging.k, line: 165 > > > > > > Caused by: > > > > > > exception @ swift-int-staging.k, line: 160 > > > > > > Caused by: Block task failed: 0302-2109420-000000 Block task > ended > > > > > > prematurely > > > > > > > > > > > > In the log, I see the qsub call being made and a jobid is > returned. > > > > > > However, I could not figure what is the cause for the task to > fail. > > > > > > > > > > > > One more thing I noticed when translating from old sites conf to > new > > > is > > > > > > that the new conf did not accept the property "globus:mode = > script". > > > > > > > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > > > > > > > Thanks, > > > > > > Ketan > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run003.tgz Type: application/x-gzip Size: 9567 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Mar 2 18:25:42 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 2 Mar 2015 16:25:42 -0800 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> <1425338829.21807.4.camel@echo> Message-ID: <1425342342.29195.3.camel@echo> On Mon, 2015-03-02 at 18:11 -0600, Ketan Maheshwari wrote: > I tried this option but did not seem to work. Attached is the log. Check /home/ketan/.globus/coasters for worker logs. If there aren't any, it means that worker.pl isn't being started (I'm assuming that /home is mounted on compute/service nodes). If that's the case, I would suggest troubleshooting by manually running the qsub command and seeing why the worker doesn't start. Mihael > > On Mon, Mar 2, 2015 at 5:27 PM, Mihael Hategan wrote: > > > It would really be much more useful if you posted the full log. > > > > Anyway, I believe that what you need to do is: > > site.cluster.execution.options.workerLoggingLevel = "DEBUG" > > > > Mihael > > > > On Mon, 2015-03-02 at 16:37 -0600, Ketan Maheshwari wrote: > > > The qsub command from the log says: > > > > > > qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40 --cwd ... > > > > > > So, the env variable on swift.conf does not seem to take effect. > > > > > > On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. < > > > hategan at mcs.anl.gov> wrote: > > > > > > > Well, we need to figure out why. Since the qsub command line is in the > > > > swift log, and the qsub command line should reflect the setting, it > > > > would be useful if you posted the swift log. > > > > > > > > Mihael > > > > > > > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote: > > > > > For workerlogs, I am trying: > > > > > > > > > > app.bgsh { > > > > > executable: "/home/ketan/SwiftApps/subjobs/bg.sh" > > > > > maxWallTime: "00:04:00" > > > > > env.ENABLE_WORKER_LOGGING="TRUE" > > > > > env.WORKER_LOGGING_LEVEL="DEBUG" > > > > > env.WORKER_LOG_DIR="/home/ketan/workerlogs" > > > > > } > > > > > > > > > > Does not seem to trigger logging. > > > > > > > > > > Thanks, > > > > > Ketan > > > > > > > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. < > > > > > hategan at mcs.anl.gov> wrote: > > > > > > > > > > > I would recommend enabling worker logging to see if we get any info > > > > from > > > > > > the worker process. Could be some simple thing, like the wrong IP > > > > > > address. > > > > > > > > > > > > Mihael > > > > > > > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > > > > > > I trying to run on BG/Q with local:cobalt with trunk but Swift > > > > crashes > > > > > > with > > > > > > > the following error: > > > > > > > > > > > > > > Caused by: Exception in bgsh: > > > > > > > Arguments: > > > > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > > > > > > > > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > > > > > 1] > > > > > > > Host: cluster > > > > > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > > > > > exception @ swift-int-staging.k, line: 165 > > > > > > > Caused by: > > > > > > > exception @ swift-int-staging.k, line: 160 > > > > > > > Caused by: Block task failed: 0302-2109420-000000 Block task > > ended > > > > > > > prematurely > > > > > > > > > > > > > > In the log, I see the qsub call being made and a jobid is > > returned. > > > > > > > However, I could not figure what is the cause for the task to > > fail. > > > > > > > > > > > > > > One more thing I noticed when translating from old sites conf to > > new > > > > is > > > > > > > that the new conf did not accept the property "globus:mode = > > script". > > > > > > > > > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > > > > > > > > > Thanks, > > > > > > > Ketan > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From ketan at mcs.anl.gov Mon Mar 2 18:55:22 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 2 Mar 2015 18:55:22 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: <1425342342.29195.3.camel@echo> References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> <1425338829.21807.4.camel@echo> <1425342342.29195.3.camel@echo> Message-ID: I do not see any logs in ~/.globus/coasters; yes, /home is mounted on service nodes and is writable from there. I added "--mode script" as a default arg to qsub in provider code, but still getting the same error. Attached is the new log. About the manual option, would we also need coaster service to be running? Or just invoking worker would suffice (for troubleshooting purposes)? --Ketan On Mon, Mar 2, 2015 at 6:25 PM, Mihael Hategan wrote: > On Mon, 2015-03-02 at 18:11 -0600, Ketan Maheshwari wrote: > > I tried this option but did not seem to work. Attached is the log. > > Check /home/ketan/.globus/coasters for worker logs. If there aren't any, > it means that worker.pl isn't being started (I'm assuming that /home is > mounted on compute/service nodes). > > If that's the case, I would suggest troubleshooting by manually running > the qsub command and seeing why the worker doesn't start. > > Mihael > > > > > On Mon, Mar 2, 2015 at 5:27 PM, Mihael Hategan > wrote: > > > > > It would really be much more useful if you posted the full log. > > > > > > Anyway, I believe that what you need to do is: > > > site.cluster.execution.options.workerLoggingLevel = "DEBUG" > > > > > > Mihael > > > > > > On Mon, 2015-03-02 at 16:37 -0600, Ketan Maheshwari wrote: > > > > The qsub command from the log says: > > > > > > > > qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40 --cwd > ... > > > > > > > > So, the env variable on swift.conf does not seem to take effect. > > > > > > > > On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. < > > > > hategan at mcs.anl.gov> wrote: > > > > > > > > > Well, we need to figure out why. Since the qsub command line is in > the > > > > > swift log, and the qsub command line should reflect the setting, it > > > > > would be useful if you posted the swift log. > > > > > > > > > > Mihael > > > > > > > > > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote: > > > > > > For workerlogs, I am trying: > > > > > > > > > > > > app.bgsh { > > > > > > executable: "/home/ketan/SwiftApps/subjobs/bg.sh" > > > > > > maxWallTime: "00:04:00" > > > > > > env.ENABLE_WORKER_LOGGING="TRUE" > > > > > > env.WORKER_LOGGING_LEVEL="DEBUG" > > > > > > env.WORKER_LOG_DIR="/home/ketan/workerlogs" > > > > > > } > > > > > > > > > > > > Does not seem to trigger logging. > > > > > > > > > > > > Thanks, > > > > > > Ketan > > > > > > > > > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. < > > > > > > hategan at mcs.anl.gov> wrote: > > > > > > > > > > > > > I would recommend enabling worker logging to see if we get any > info > > > > > from > > > > > > > the worker process. Could be some simple thing, like the wrong > IP > > > > > > > address. > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > > > > > > > I trying to run on BG/Q with local:cobalt with trunk but > Swift > > > > > crashes > > > > > > > with > > > > > > > > the following error: > > > > > > > > > > > > > > > > Caused by: Exception in bgsh: > > > > > > > > Arguments: > > > > > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > > > > > > > > > > > > > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > > > > > > 1] > > > > > > > > Host: cluster > > > > > > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > > > > > > exception @ swift-int-staging.k, line: 165 > > > > > > > > Caused by: > > > > > > > > exception @ swift-int-staging.k, line: 160 > > > > > > > > Caused by: Block task failed: 0302-2109420-000000 Block task > > > ended > > > > > > > > prematurely > > > > > > > > > > > > > > > > In the log, I see the qsub call being made and a jobid is > > > returned. > > > > > > > > However, I could not figure what is the cause for the task to > > > fail. > > > > > > > > > > > > > > > > One more thing I noticed when translating from old sites > conf to > > > new > > > > > is > > > > > > > > that the new conf did not accept the property "globus:mode = > > > script". > > > > > > > > > > > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Ketan > > > > > > > > _______________________________________________ > > > > > > > > Swift-user mailing list > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run007.tgz Type: application/x-gzip Size: 9545 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Mar 2 19:35:56 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 2 Mar 2015 17:35:56 -0800 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> <1425338829.21807.4.camel@echo> <1425342342.29195.3.camel@echo> Message-ID: <1425346556.2015.1.camel@echo> On Mon, 2015-03-02 at 18:55 -0600, Ketan Maheshwari wrote: > I do not see any logs in ~/.globus/coasters; yes, /home is mounted on > service nodes and is writable from there. > > I added "--mode script" as a default arg to qsub in provider code, but > still getting the same error. Attached is the new log. > > About the manual option, would we also need coaster service to be running? > Or just invoking worker would suffice (for troubleshooting purposes)? Just invoking worker.pl. You should eventually get a log file from the worker that indicates that the perl process has started. It will fail, unable to connect to the service, but that's secondary. I'm surprised that you are not getting any stdout/stderr from the process. Maybe the secret is somewhere around that. Mihael > > --Ketan > > On Mon, Mar 2, 2015 at 6:25 PM, Mihael Hategan wrote: > > > On Mon, 2015-03-02 at 18:11 -0600, Ketan Maheshwari wrote: > > > I tried this option but did not seem to work. Attached is the log. > > > > Check /home/ketan/.globus/coasters for worker logs. If there aren't any, > > it means that worker.pl isn't being started (I'm assuming that /home is > > mounted on compute/service nodes). > > > > If that's the case, I would suggest troubleshooting by manually running > > the qsub command and seeing why the worker doesn't start. > > > > Mihael > > > > > > > > On Mon, Mar 2, 2015 at 5:27 PM, Mihael Hategan > > wrote: > > > > > > > It would really be much more useful if you posted the full log. > > > > > > > > Anyway, I believe that what you need to do is: > > > > site.cluster.execution.options.workerLoggingLevel = "DEBUG" > > > > > > > > Mihael > > > > > > > > On Mon, 2015-03-02 at 16:37 -0600, Ketan Maheshwari wrote: > > > > > The qsub command from the log says: > > > > > > > > > > qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40 --cwd > > ... > > > > > > > > > > So, the env variable on swift.conf does not seem to take effect. > > > > > > > > > > On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. < > > > > > hategan at mcs.anl.gov> wrote: > > > > > > > > > > > Well, we need to figure out why. Since the qsub command line is in > > the > > > > > > swift log, and the qsub command line should reflect the setting, it > > > > > > would be useful if you posted the swift log. > > > > > > > > > > > > Mihael > > > > > > > > > > > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote: > > > > > > > For workerlogs, I am trying: > > > > > > > > > > > > > > app.bgsh { > > > > > > > executable: "/home/ketan/SwiftApps/subjobs/bg.sh" > > > > > > > maxWallTime: "00:04:00" > > > > > > > env.ENABLE_WORKER_LOGGING="TRUE" > > > > > > > env.WORKER_LOGGING_LEVEL="DEBUG" > > > > > > > env.WORKER_LOG_DIR="/home/ketan/workerlogs" > > > > > > > } > > > > > > > > > > > > > > Does not seem to trigger logging. > > > > > > > > > > > > > > Thanks, > > > > > > > Ketan > > > > > > > > > > > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. < > > > > > > > hategan at mcs.anl.gov> wrote: > > > > > > > > > > > > > > > I would recommend enabling worker logging to see if we get any > > info > > > > > > from > > > > > > > > the worker process. Could be some simple thing, like the wrong > > IP > > > > > > > > address. > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > > > > > > > > I trying to run on BG/Q with local:cobalt with trunk but > > Swift > > > > > > crashes > > > > > > > > with > > > > > > > > > the following error: > > > > > > > > > > > > > > > > > > Caused by: Exception in bgsh: > > > > > > > > > Arguments: > > > > > > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > > > > > > > 1] > > > > > > > > > Host: cluster > > > > > > > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > > > > > > > exception @ swift-int-staging.k, line: 165 > > > > > > > > > Caused by: > > > > > > > > > exception @ swift-int-staging.k, line: 160 > > > > > > > > > Caused by: Block task failed: 0302-2109420-000000 Block task > > > > ended > > > > > > > > > prematurely > > > > > > > > > > > > > > > > > > In the log, I see the qsub call being made and a jobid is > > > > returned. > > > > > > > > > However, I could not figure what is the cause for the task to > > > > fail. > > > > > > > > > > > > > > > > > > One more thing I noticed when translating from old sites > > conf to > > > > new > > > > > > is > > > > > > > > > that the new conf did not accept the property "globus:mode = > > > > script". > > > > > > > > > > > > > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Ketan > > > > > > > > > _______________________________________________ > > > > > > > > > Swift-user mailing list > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From ketan at mcs.anl.gov Mon Mar 2 20:22:28 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 2 Mar 2015 20:22:28 -0600 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> <1425338829.21807.4.camel@echo> <1425342342.29195.3.camel@echo> Message-ID: OK, I found that worker.pl was crashing because of my subjob related mods. I forgot to declare a variable using "my". After this change, it runs. However, jobs that complete are not reported to be completed; they stay in "active" state as seen from the progress log till the job times out. I also see the following lines in stderr: Use of uninitialized value $SOFT_IMAGE_JOB_ID in numeric eq (==) at /home/ketan/.globus/coasters/cscript225276003254762418.pl line 2235. Use of uninitialized value in concatenation (.) or string at /home/ketan/.globus/coasters/cscript225276003254762418.pl line 387. Use of uninitialized value $SOFT_IMAGE_JOB_ID in numeric eq (==) at /home/ketan/.globus/coasters/cscript225276003254762418.pl line 2235. Use of uninitialized value $SOFT_IMAGE_JOB_ID in numeric eq (==) at /home/ketan/.globus/coasters/cscript225276003254762418.pl line 2235. Use of uninitialized value in concatenation (.) or string at /home/ketan/.globus/coasters/cscript225276003254762418.pl line 387. Use of uninitialized value $SOFT_IMAGE_JOB_ID in numeric eq (==) at /home/ketan/.globus/coasters/cscript225276003254762418.pl line 2235. Not sure if these are errors or warnings and relevant. Attached is the complete log. Thanks, Ketan On Mon, Mar 2, 2015 at 7:35 PM, Hategan-Marandiuc, Philip M. < hategan at mcs.anl.gov> wrote: > On Mon, 2015-03-02 at 18:55 -0600, Ketan Maheshwari wrote: > > I do not see any logs in ~/.globus/coasters; yes, /home is mounted on > > service nodes and is writable from there. > > > > I added "--mode script" as a default arg to qsub in provider code, but > > still getting the same error. Attached is the new log. > > > > About the manual option, would we also need coaster service to be > running? > > Or just invoking worker would suffice (for troubleshooting purposes)? > > Just invoking worker.pl. You should eventually get a log file from the > worker that indicates that the perl process has started. It will fail, > unable to connect to the service, but that's secondary. > > I'm surprised that you are not getting any stdout/stderr from the > process. Maybe the secret is somewhere around that. > > Mihael > > > > > --Ketan > > > > On Mon, Mar 2, 2015 at 6:25 PM, Mihael Hategan > wrote: > > > > > On Mon, 2015-03-02 at 18:11 -0600, Ketan Maheshwari wrote: > > > > I tried this option but did not seem to work. Attached is the log. > > > > > > Check /home/ketan/.globus/coasters for worker logs. If there aren't > any, > > > it means that worker.pl isn't being started (I'm assuming that /home > is > > > mounted on compute/service nodes). > > > > > > If that's the case, I would suggest troubleshooting by manually running > > > the qsub command and seeing why the worker doesn't start. > > > > > > Mihael > > > > > > > > > > > On Mon, Mar 2, 2015 at 5:27 PM, Mihael Hategan > > > wrote: > > > > > > > > > It would really be much more useful if you posted the full log. > > > > > > > > > > Anyway, I believe that what you need to do is: > > > > > site.cluster.execution.options.workerLoggingLevel = "DEBUG" > > > > > > > > > > Mihael > > > > > > > > > > On Mon, 2015-03-02 at 16:37 -0600, Ketan Maheshwari wrote: > > > > > > The qsub command from the log says: > > > > > > > > > > > > qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40 > --cwd > > > ... > > > > > > > > > > > > So, the env variable on swift.conf does not seem to take effect. > > > > > > > > > > > > On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. < > > > > > > hategan at mcs.anl.gov> wrote: > > > > > > > > > > > > > Well, we need to figure out why. Since the qsub command line > is in > > > the > > > > > > > swift log, and the qsub command line should reflect the > setting, it > > > > > > > would be useful if you posted the swift log. > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote: > > > > > > > > For workerlogs, I am trying: > > > > > > > > > > > > > > > > app.bgsh { > > > > > > > > executable: "/home/ketan/SwiftApps/subjobs/bg.sh" > > > > > > > > maxWallTime: "00:04:00" > > > > > > > > env.ENABLE_WORKER_LOGGING="TRUE" > > > > > > > > env.WORKER_LOGGING_LEVEL="DEBUG" > > > > > > > > env.WORKER_LOG_DIR="/home/ketan/workerlogs" > > > > > > > > } > > > > > > > > > > > > > > > > Does not seem to trigger logging. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Ketan > > > > > > > > > > > > > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. > < > > > > > > > > hategan at mcs.anl.gov> wrote: > > > > > > > > > > > > > > > > > I would recommend enabling worker logging to see if we get > any > > > info > > > > > > > from > > > > > > > > > the worker process. Could be some simple thing, like the > wrong > > > IP > > > > > > > > > address. > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote: > > > > > > > > > > I trying to run on BG/Q with local:cobalt with trunk but > > > Swift > > > > > > > crashes > > > > > > > > > with > > > > > > > > > > the following error: > > > > > > > > > > > > > > > > > > > > Caused by: Exception in bgsh: > > > > > > > > > > Arguments: > > > > > > > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap, > > > > > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out, > > > > > > > > > > 1] > > > > > > > > > > Host: cluster > > > > > > > > > > Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m > > > > > > > > > > exception @ swift-int-staging.k, line: 165 > > > > > > > > > > Caused by: > > > > > > > > > > exception @ swift-int-staging.k, line: 160 > > > > > > > > > > Caused by: Block task failed: 0302-2109420-000000 Block > task > > > > > ended > > > > > > > > > > prematurely > > > > > > > > > > > > > > > > > > > > In the log, I see the qsub call being made and a jobid is > > > > > returned. > > > > > > > > > > However, I could not figure what is the cause for the > task to > > > > > fail. > > > > > > > > > > > > > > > > > > > > One more thing I noticed when translating from old sites > > > conf to > > > > > new > > > > > > > is > > > > > > > > > > that the new conf did not accept the property > "globus:mode = > > > > > script". > > > > > > > > > > > > > > > > > > > > A full run log is attached. Thanks for any suggestions. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Ketan > > > > > > > > > > _______________________________________________ > > > > > > > > > > Swift-user mailing list > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run001.tgz Type: application/x-gzip Size: 19894 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Mar 2 23:41:51 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 2 Mar 2015 21:41:51 -0800 Subject: [Swift-user] trunk-cobalt block task ended prematurely In-Reply-To: References: <40501f76358841f2a925e87b0db50dc9@NAGURSKI.anl.gov> <73ab26c4260a4caf878a642c963dd98b@NAGURSKI.anl.gov> <1425338829.21807.4.camel@echo> <1425342342.29195.3.camel@echo> Message-ID: <1425361311.28006.4.camel@echo> On Mon, 2015-03-02 at 20:22 -0600, Ketan Maheshwari wrote: > OK, I found that worker.pl was crashing because of my subjob related mods. > I forgot to declare a variable using "my". After this change, it runs. > > However, jobs that complete are not reported to be completed; they stay in > "active" state as seen from the progress log till the job times out. The last state I see in the info files is of this form: Progress 2015-03-03 02:08:20.687250858+0000 EXECUTE >From that I would say that the apps really don't terminate within their walltime. This is unrelated to worker.pl, since it happens at the _swiftwrap level. Mihael From overhaeg at vub.ac.be Tue Mar 3 03:53:22 2015 From: overhaeg at vub.ac.be (Olivier Verhaegen) Date: Tue, 3 Mar 2015 10:53:22 +0100 Subject: [Swift-user] Public Swift repositories and projects Message-ID: Hi, I'm looking for Swift source code but I unfortunately couldn't find any, mostly because of interference of Apple's Swift in the search results. Do any of you know locations of public Swift repositories or are willing to share their Swift projects? Thanks, Olivier. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pittjj at uchicago.edu Tue Mar 3 12:17:36 2015 From: pittjj at uchicago.edu (Jason James Pitt) Date: Tue, 3 Mar 2015 18:17:36 +0000 Subject: [Swift-user] File header does not match type Message-ID: Hi All, I was hoping someone could help shed some light on an issue I'm having running a Swift script (with Swift 0.95-RC6). When running my code I get the following failure (execution stops immediately after). Progress: Fri, 27 Feb 2015 04:59:20-0600 Execution failed: File header does not match type. Expected 4 whitespace separated items. Got 2 instead. I've been comparing this swift script and the configurations files to other swift scripts that have run successfully. I can't seem to identify the problem. Is the above a general failer for all kinds of swift configuration files? Perhaps there is a specific file I should be looking at ? Thanks, and I'd be grateful for any insight. Jason From wilde at anl.gov Tue Mar 3 12:24:48 2015 From: wilde at anl.gov (Michael Wilde) Date: Tue, 3 Mar 2015 12:24:48 -0600 Subject: [Swift-user] File header does not match type In-Reply-To: References: Message-ID: <54F5FC70.7060105@anl.gov> Jason, This sounds like an error from readData( ). Do you have a readData statement that's reading a file into a structure of four fields, but is only finding 2 fields on the header line. Its remotely possible that a ext_mapper could give a similar error. Does that help narrow it down? - Mike On 3/3/15 12:17 PM, Jason James Pitt wrote: > Hi All, > > I was hoping someone could help shed some light on an issue I'm having running a Swift script (with Swift 0.95-RC6). When running my code I get the following failure (execution stops immediately after). > > Progress: Fri, 27 Feb 2015 04:59:20-0600 > > Execution failed: > File header does not match type. Expected 4 whitespace separated items. Got 2 instead. > > I've been comparing this swift script and the configurations files to other swift scripts that have run successfully. I can't seem to identify the problem. Is the above a general failer for all kinds of swift configuration files? Perhaps there is a specific file I should be looking at ? > > Thanks, and I'd be grateful for any insight. > > Jason > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From yadunand at uchicago.edu Thu Mar 5 10:59:32 2015 From: yadunand at uchicago.edu (Yadu Nand Babuji) Date: Thu, 05 Mar 2015 10:59:32 -0600 Subject: [Swift-user] Public Swift repositories and projects In-Reply-To: References: Message-ID: <54F88B74.8030104@uchicago.edu> Hi Olivier, The Swift repository is available here : https://github.com/swift-lang/swift-k I'm not aware of any public repositories, but we do have some science apps that we use for demos. If that is of interest, please let me know. Thanks, Yadu On 03/03/2015 03:53 AM, Olivier Verhaegen wrote: > Hi, > > I'm looking for Swift source code but I unfortunately couldn't find > any, mostly because of interference of Apple's Swift in the search > results. Do any of you know locations of public Swift repositories or > are willing to share their Swift projects? > > Thanks, > > Olivier. > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From yadunand at uchicago.edu Thu Mar 5 11:13:50 2015 From: yadunand at uchicago.edu (Yadu Nand Babuji) Date: Thu, 05 Mar 2015 11:13:50 -0600 Subject: [Swift-user] File header does not match type In-Reply-To: <54F5FC70.7060105@anl.gov> References: <54F5FC70.7060105@anl.gov> Message-ID: <54F88ECE.6020909@uchicago.edu> Update: As Mike pointed out, this was indeed an error from a mismatch between the file header readData( ) was reading and the target struct. -Yadu On 03/03/2015 12:24 PM, Michael Wilde wrote: > Jason, > > This sounds like an error from readData( ). Do you have a readData > statement that's reading a file into a structure of four fields, but is > only finding 2 fields on the header line. > > Its remotely possible that a ext_mapper could give a similar error. > > Does that help narrow it down? > > - Mike > > On 3/3/15 12:17 PM, Jason James Pitt wrote: >> Hi All, >> >> I was hoping someone could help shed some light on an issue I'm having running a Swift script (with Swift 0.95-RC6). When running my code I get the following failure (execution stops immediately after). >> >> Progress: Fri, 27 Feb 2015 04:59:20-0600 >> >> Execution failed: >> File header does not match type. Expected 4 whitespace separated items. Got 2 instead. >> >> I've been comparing this swift script and the configurations files to other swift scripts that have run successfully. I can't seem to identify the problem. Is the above a general failer for all kinds of swift configuration files? Perhaps there is a specific file I should be looking at ? >> >> Thanks, and I'd be grateful for any insight. >> >> Jason >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From karthikeyanb at uchicago.edu Fri Mar 6 11:27:36 2015 From: karthikeyanb at uchicago.edu (Karthikeyan Balasubramanian) Date: Fri, 6 Mar 2015 17:27:36 +0000 Subject: [Swift-user] Swift 0.96 - Connection to worker lost exception - reg. Message-ID: <8CEB97C36B499F4CB2FA1E00DD06E343449E1523@xm-mbx-07-prod.ad.uchicago.edu> Hi, The following exception was thrown by the Swift 0.96, as the process was running. Attached, please find the runxxx folder. ------- Application STDERR -------- ----------------------------------- exception @ swift-int-staging.k, line: 160 Caused by: Block task failed: Connection to worker lost org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150306-040431.246, now=150306-040632.030, channel=0304-5404390-000001:000000 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:163) at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:154) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) k:assign @ swift.k, line: 174 Caused by: Exception in rungenerateGLMC: Arguments: [/project/rossc/BMI/glmmodel/././data/input/z20130227_SPK.mat, 0, /project/rossc/BMI/glmmodel/./data/glmaicout/z20130227_SPK_0_AIC.mat, /project/rossc/BMI/glmmodel/./data/glmcausalout/z20130227_SPK_0_CNA.mat] Host: midway Directory: swift_glm-run016/jobs/r/rungenerateGLMC-eucktf5m exception @ swift-int-staging.k, line: 165 Caused by: Any thoughts or suggestions is appreciated. Thanks. B.K. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run016.zip Type: application/x-zip-compressed Size: 1288084 bytes Desc: run016.zip URL: From iraicu at cs.iit.edu Sat Mar 7 13:02:50 2015 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sat, 07 Mar 2015 13:02:50 -0600 Subject: [Swift-user] CFP: IEEE Cluster 2015 -- Submission deadline extended to March 25th AOE! (hard deadline) Message-ID: <54FB4B5A.7000501@cs.iit.edu> March 7, 2015 Release IEEE International Conference on Cluster Computing September 8-11, 2015 Chicago, IL, USA http://www.mcs.anl.gov/ieeecluster2015/ CLUSTER 2015 CALL FOR PAPERS Following the successes of the series of Cluster conferences, for 2015 we solicit high-quality original papers presenting work that advances the state-of-the-art in clusters and closely related fields. All papers will be rigorously peer-reviewed for their originality, technical depth and correctness, potential impact, relevance to the conference, and quality of presentation. Research papers must clearly demonstrate research contributions and novelty, while papers reporting experience must clearly describe lessons learned and impact, along with the utility of the approach compared to the ones in the past. *** Paper Tracks *** Area 1: Application, Algorithms, and Libraries * HPC Applications on Clusters * Performance Modeling and Measurement * Novel Algorithms on Clusters * Hybrid programming techniques (MPI+OpenMP, MPI+OpenCL, etc.) * Cluster Benchmarks * Application-level libraries on clusters * Effective use of clusters in novel applications * Performance evaluation tools Area 2: Architecture, Network/Communications, and Management * Energy-efficient cluster architectures * Node and system architecture * Packaging, power and cooling * GPU/ManyCore and heterogeneous clusters * Interconnect/memory architectures * Single system image clusters * Administration and maintenance tools Area 3: Programming and System Software * Cluster System Software/Operating Systems * Cloud-enabling cluster technologies and virtualization * Energy-efficient middleware * Cluster system-level Protocols and APIs * Cluster Security * Resource and job management * Programming and Software Development Environment on Clusters * Fault tolerance and high-availability Area 4: Data, Storage, and Visualization * Cluster Architecture for Big Data storage and processing * Middleware for Big Data management * Cluster-based Cloud Architecture for Big Data * File systems and I/O libraries * Support and Integration of Non-Volatile Memory * Visualization clusters and tiled displays * Big Data visualization tools * Programming models for Big Data processing * Big Data Application studies on cluster architectures *** Submission Guidelines *** Authors are invited to submit papers electronically in PDF format. Submitted manuscripts should be structured as technical papers and may not exceed 10 letter-size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings. Submissions not conforming to these guidelines may be returned without review. Authors should make sure that their file will print on a printer that uses letter-size (8.5 x 11) paper. The official language of the conference is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Paper submissions are limited to 10 pages in 2-column IEEE format including all figures and references. Submitted manuscripts exceeding this limit will be returned without review. For the final camera-ready version, authors with accepted papers may purchase additional pages at the following rates: 200 USD for each of two additional pages. See formatting templates for details: LaTex Package ZIP (http://datasys.cs.iit.edu/events/CCGrid2014/IEEECS_confs_LaTeX.zip) Word Template DOC (http://datasys.cs.iit.edu/events/CCGrid2014/instruct8.5x11x2.doc) and PDF (http://datasys.cs.iit.edu/events/CCGrid2014/instruct8.5x11x2.pdf) Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding the page limit, or not appropriately structured may not be considered. Authors may contact the conference chairs for more information. The proceedings will be published through the IEEE Computer Society Conference Publishing Services. Please submit your paper via the EasyChair submission system: https://easychair.org/conferences/?conf=ieeecluster2015 *** Journal Special Issue *** The best papers of Cluster 2015 will be included in a Special Issue on advances in topics related to cluster computing of the Elsevier International Journal of Parallel Computing (PARCO), edited by Pavan Balaji, Satoshi Matsuoka, and Michela Taufer. This special issue is dedicated for the papers accepted in the Cluster 2015 conference. The submission to this special issue is by invitation only. *** Important Dates *** ***March 25, 2015*** Papers Submission Deadline (hard deadline) May 9, 2015 Papers Acceptance Notification See other deadlines in the Important Dates page (http://www.mcs.anl.gov/ieeecluster2015/author-information/important-dates) *** Cluster 2015 Program Chair *** Satoshi Matsuoka, Tokyo Institute of Technology (matsu AT is.titech.ac.jp). ---------------------------------------------- ...Follow us on Facebook athttps://www.facebook.com/ieee.cluster ...Follow us on Twitter athttps://twitter.com/IEEECluster ...Follow us on Linkedin at https://www.linkedin.com/groups/IEEE-International-Conference-on-Cluster-7428925 ...Follow us on RenRen athttp://page.renren.com/601871401 ---------------------------------------------- -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= From pittjj at uchicago.edu Tue Mar 10 08:47:13 2015 From: pittjj at uchicago.edu (Jason James Pitt) Date: Tue, 10 Mar 2015 13:47:13 +0000 Subject: [Swift-user] CPU was not in the block (Swift 0.96) Message-ID: Hi Everyone, I've sporadically seen the following exception (or similar) sporadically in some of the runs I've been performing recently. Any sense of what this may mean and is there something I can do on my end to prevent it? I can pass along the logs if that'd be helpful (though this particular run is still active). Thanks! Jason CoasterService fatal error: CPU was not in the block java.lang.Throwable at org.globus.cog.abstraction.coaster.service.job.manager.Block.remove(Block.java:209) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.jobTerminated(Cpu.java:115) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.statusChanged(Cpu.java:433) at org.globus.cog.abstraction.impl.execution.coaster.NotificationManager.notificationReceived(NotificationManager.java:117) at org.globus.cog.abstraction.coaster.service.local.JobStatusHandler.requestComplete(JobStatusHandler.java:81) at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.stepNIO(AbstractStreamCoasterChannel.java:240) at org.globus.cog.coaster.channels.NIOMultiplexer.loop(NIOMultiplexer.java:116) at org.globus.cog.coaster.channels.NIOMultiplexer.run(NIOMultiplexer.java:75) CPU was not in the block java.lang.Throwable at org.globus.cog.abstraction.coaster.service.job.manager.Block.remove(Block.java:209) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.jobTerminated(Cpu.java:115) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.statusChanged(Cpu.java:433) at org.globus.cog.abstraction.impl.execution.coaster.NotificationManager.notificationReceived(NotificationManager.java:117) at org.globus.cog.abstraction.coaster.service.local.JobStatusHandler.requestComplete(JobStatusHandler.java:81) at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.stepNIO(AbstractStreamCoasterChannel.java:240) at org.globus.cog.coaster.channels.NIOMultiplexer.loop(NIOMultiplexer.java:116) at org.globus.cog.coaster.channels.NIOMultiplexer.run(NIOMultiplexer.java:75) From wilde at anl.gov Tue Mar 10 08:56:16 2015 From: wilde at anl.gov (Michael Wilde) Date: Tue, 10 Mar 2015 08:56:16 -0500 Subject: [Swift-user] CPU was not in the block (Swift 0.96) In-Reply-To: References: Message-ID: <54FEF800.5060009@anl.gov> Jason, we'll almost certainly need a log for this; feel free to send a pointer, off-list. Ideally for the run that produced the traceback below. What happens when this occurs? Does the run continue? Any observable ill effects? Thanks, - Mike On 3/10/15 8:47 AM, Jason James Pitt wrote: > Hi Everyone, > > I've sporadically seen the following exception (or similar) sporadically in some of the runs I've been performing recently. Any sense of what this may mean and is there something I can do on my end to prevent it? I can pass along the logs if that'd be helpful (though this particular run is still active). Thanks! > > Jason > > CoasterService fatal error: > CPU was not in the block > java.lang.Throwable > at org.globus.cog.abstraction.coaster.service.job.manager.Block.remove(Block.java:209) > at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.jobTerminated(Cpu.java:115) > at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.statusChanged(Cpu.java:433) > at org.globus.cog.abstraction.impl.execution.coaster.NotificationManager.notificationReceived(NotificationManager.java:117) > at org.globus.cog.abstraction.coaster.service.local.JobStatusHandler.requestComplete(JobStatusHandler.java:81) > at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) > at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.stepNIO(AbstractStreamCoasterChannel.java:240) > at org.globus.cog.coaster.channels.NIOMultiplexer.loop(NIOMultiplexer.java:116) > at org.globus.cog.coaster.channels.NIOMultiplexer.run(NIOMultiplexer.java:75) > CPU was not in the block > java.lang.Throwable > at org.globus.cog.abstraction.coaster.service.job.manager.Block.remove(Block.java:209) > at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.jobTerminated(Cpu.java:115) > at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.statusChanged(Cpu.java:433) > at org.globus.cog.abstraction.impl.execution.coaster.NotificationManager.notificationReceived(NotificationManager.java:117) > at org.globus.cog.abstraction.coaster.service.local.JobStatusHandler.requestComplete(JobStatusHandler.java:81) > at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) > at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.stepNIO(AbstractStreamCoasterChannel.java:240) > at org.globus.cog.coaster.channels.NIOMultiplexer.loop(NIOMultiplexer.java:116) > at org.globus.cog.coaster.channels.NIOMultiplexer.run(NIOMultiplexer.java:75) > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From pittjj at uchicago.edu Tue Mar 10 09:08:39 2015 From: pittjj at uchicago.edu (Jason James Pitt) Date: Tue, 10 Mar 2015 14:08:39 +0000 Subject: [Swift-user] CPU was not in the block (Swift 0.96) In-Reply-To: <54FEF800.5060009@anl.gov> References: , <54FEF800.5060009@anl.gov> Message-ID: Hi Mike, Thanks for a quick reply! I am preparing the files now, and will point you guys to them once everything is ready. Regarding ill effects, I can't confirm for sure, but it looks like tasks are remaining in the submitted state and are not becoming active. In addition, I think the process itself is in a zombie state. The few jobs that are active should be finishing, but they are not. qstat is showing I have 10 workers active, but according to the screen log by swift I only have 6 active tasks (the jobs are one node jobs). Control^c isn't effective so I'm going to have to use the pid to kill the swift/java process. Best, Jason ________________________________________ From: swift-user-bounces at ci.uchicago.edu [swift-user-bounces at ci.uchicago.edu] on behalf of Michael Wilde [wilde at anl.gov] Sent: Tuesday, March 10, 2015 8:56 AM To: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] CPU was not in the block (Swift 0.96) Jason, we'll almost certainly need a log for this; feel free to send a pointer, off-list. Ideally for the run that produced the traceback below. What happens when this occurs? Does the run continue? Any observable ill effects? Thanks, - Mike On 3/10/15 8:47 AM, Jason James Pitt wrote: > Hi Everyone, > > I've sporadically seen the following exception (or similar) sporadically in some of the runs I've been performing recently. Any sense of what this may mean and is there something I can do on my end to prevent it? I can pass along the logs if that'd be helpful (though this particular run is still active). Thanks! > > Jason > > CoasterService fatal error: > CPU was not in the block > java.lang.Throwable > at org.globus.cog.abstraction.coaster.service.job.manager.Block.remove(Block.java:209) > at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.jobTerminated(Cpu.java:115) > at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.statusChanged(Cpu.java:433) > at org.globus.cog.abstraction.impl.execution.coaster.NotificationManager.notificationReceived(NotificationManager.java:117) > at org.globus.cog.abstraction.coaster.service.local.JobStatusHandler.requestComplete(JobStatusHandler.java:81) > at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) > at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.stepNIO(AbstractStreamCoasterChannel.java:240) > at org.globus.cog.coaster.channels.NIOMultiplexer.loop(NIOMultiplexer.java:116) > at org.globus.cog.coaster.channels.NIOMultiplexer.run(NIOMultiplexer.java:75) > CPU was not in the block > java.lang.Throwable > at org.globus.cog.abstraction.coaster.service.job.manager.Block.remove(Block.java:209) > at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.jobTerminated(Cpu.java:115) > at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.statusChanged(Cpu.java:433) > at org.globus.cog.abstraction.impl.execution.coaster.NotificationManager.notificationReceived(NotificationManager.java:117) > at org.globus.cog.abstraction.coaster.service.local.JobStatusHandler.requestComplete(JobStatusHandler.java:81) > at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) > at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.stepNIO(AbstractStreamCoasterChannel.java:240) > at org.globus.cog.coaster.channels.NIOMultiplexer.loop(NIOMultiplexer.java:116) > at org.globus.cog.coaster.channels.NIOMultiplexer.run(NIOMultiplexer.java:75) > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From karthikeyanb at uchicago.edu Fri Mar 13 11:02:56 2015 From: karthikeyanb at uchicago.edu (Karthikeyan Balasubramanian) Date: Fri, 13 Mar 2015 16:02:56 +0000 Subject: [Swift-user] Matlab mcr path - reg. Message-ID: <8CEB97C36B499F4CB2FA1E00DD06E343449E24CA@xm-mbx-07-prod.ad.uchicago.edu> Hi, I am looking for the mcr path needed to run compiled Matlab codes on midway. Do we have the option of running the compiled code on mcr or should we invoke matlab environment to run the code? Thanks. B.K. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Sat Mar 21 09:12:04 2015 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sat, 21 Mar 2015 09:12:04 -0500 Subject: [Swift-user] IEEE Cluster 2015 hard deadline approaching - March 25, 2015 (AOE) Message-ID: <550D7C34.1060405@cs.iit.edu> March 20, 2015 Release IEEE International Conference on Cluster Computing September 8-11, 2015 Chicago, IL, USA http://www.mcs.anl.gov/ieeecluster2015/ *** NEW!!!! Important Dates *** March 25, 2015 (AOE) Papers Submission Deadline May 9, 2015 Papers Acceptance Notification See other deadlines in the Important Dates page (http://www.mcs.anl.gov/ieeecluster2015/author-information/important-dates) CLUSTER 2015 CALL FOR PAPERS Following the successes of the series of Cluster conferences, for 2015 we solicit high-quality original papers presenting work that advances the state-of-the-art in clusters and closely related fields. All papers will be rigorously peer-reviewed for their originality, technical depth and correctness, potential impact, relevance to the conference, and quality of presentation. Research papers must clearly demonstrate research contributions and novelty, while papers reporting experience must clearly describe lessons learned and impact, along with the utility of the approach compared to the ones in the past. *** Paper Tracks *** Area 1: Application, Algorithms, and Libraries * HPC Applications on Clusters * Performance Modeling and Measurement * Novel Algorithms on Clusters * Hybrid programming techniques (MPI+OpenMP, MPI+OpenCL, etc.) * Cluster Benchmarks * Application-level libraries on clusters * Effective use of clusters in novel applications * Performance evaluation tools Area 2: Architecture, Network/Communications, and Management * Energy-efficient cluster architectures * Node and system architecture * Packaging, power and cooling * GPU/ManyCore and heterogeneous clusters * Interconnect/memory architectures * Single system image clusters * Administration and maintenance tools Area 3: Programming and System Software * Cluster System Software/Operating Systems * Cloud-enabling cluster technologies and virtualization * Energy-efficient middleware * Cluster system-level Protocols and APIs * Cluster Security * Resource and job management * Programming and Software Development Environment on Clusters * Fault tolerance and high-availability Area 4: Data, Storage, and Visualization * Cluster Architecture for Big Data storage and processing * Middleware for Big Data management * Cluster-based Cloud Architecture for Big Data * File systems and I/O libraries * Support and Integration of Non-Volatile Memory * Visualization clusters and tiled displays * Big Data visualization tools * Programming models for Big Data processing * Big Data Application studies on cluster architectures *** Submission Guidelines *** Authors are invited to submit papers electronically in PDF format. Submitted manuscripts should be structured as technical papers and may not exceed 10 letter-size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings. Submissions not conforming to these guidelines may be returned without review. Authors should make sure that their file will print on a printer that uses letter-size (8.5 x 11) paper. The official language of the conference is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Paper submissions are limited to 10 pages in 2-column IEEE format including all figures and references. Submitted manuscripts exceeding this limit will be returned without review. For the final camera-ready version, authors with accepted papers may purchase additional pages at the following rates: 200 USD for each of two additional pages. See formatting templates for details: LaTex Package ZIP (http://datasys.cs.iit.edu/events/CCGrid2014/IEEECS_confs_LaTeX.zip) Word Template DOC (http://datasys.cs.iit.edu/events/CCGrid2014/instruct8.5x11x2.doc) and PDF (http://datasys.cs.iit.edu/events/CCGrid2014/instruct8.5x11x2.pdf) Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding the page limit, or not appropriately structured may not be considered. Authors may contact the conference chairs for more information. The proceedings will be published through the IEEE Computer Society Conference Publishing Services. Please submit your paper via the EasyChair submission system: https://easychair.org/conferences/?conf=ieeecluster2015 *** Journal Special Issue *** The best papers of Cluster 2015 will be included in a Special Issue on advances in topics related to cluster computing of the Elsevier International Journal of Parallel Computing (PARCO), edited by Pavan Balaji, Satoshi Matsuoka, and Michela Taufer. This special issue is dedicated for the papers accepted in the Cluster 2015 conference. The submission to this special issue is by invitation only. *** Cluster 2015 Program Chair *** Satoshi Matsuoka, Tokyo Institute of Technology (matsu AT is.titech.ac.jp). ---------------------------------------------- ...Follow us on Facebook athttps://www.facebook.com/ieee.cluster ...Follow us on Twitter athttps://twitter.com/IEEECluster ...Follow us on Linkedin at https://www.linkedin.com/groups/IEEE-International-Conference-on-Cluster-7428925 ...Follow us on RenRen athttp://page.renren.com/601871401 ---------------------------------------------- -- -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From jscherer at anl.gov Mon Mar 30 16:10:54 2015 From: jscherer at anl.gov (Scherer, Justin C.) Date: Mon, 30 Mar 2015 21:10:54 +0000 Subject: [Swift-user] Error with SSH and SSH-CL Message-ID: <0C1D9134D7E7C042A9EBEE22F29A4CACAE1042F7@HALAS.anl.gov> I have been trying to get the simple hello.swift script to run on a remote host that I have set up with another researcher. I will post my sites.xml, config file, and tc.data files along with this message. After going through all of the steps to get ssh to work, I get the following error for running it in ssh mode: RunID: 20150330-1550-ut0jr8rf Progress: time: Mon, 30 Mar 2015 15:50:36 -0500 Execution failed: Could not initialize shared directory on remotehost Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error while communicating with the SSH server on 146.137.66.189:22 Caused by: Invalid private key or passphrase Caused by: Can't read key due to cryptography problems: java.security.NoSuchAlgorithmException: Unsupported passphrase algorithm: AES-128-CBC greeting, hello.swift, line 9 I tried both the dsa algorithm and putting in the rsa algorithm for this. Here is my sites.xml for this: 0 10000 /home/swift-user/temp also, here is the tc.data file: #This is the transformation catalog. # #It comes pre-configured with a number of simple transformations with #paths that are likely to work on a linux box. However, on some systems, #the paths to these executables will be different (for example, sometimes #some of these programs are found in /usr/bin rather than in /bin) # #NOTE WELL: fields in this file must be separated by tabs, not spaces; and #there must be no trailing whitespace at the end of each line. # # sitename transformation path INSTALLED platform profiles localhost echo /bin/echo INSTALLED INTEL32::LINUX null localhost cat /bin/cat INSTALLED INTEL32::LINUX null localhost ls /bin/ls INSTALLED INTEL32::LINUX null localhost grep /bin/grep INSTALLED INTEL32::LINUX null localhost sort /bin/sort INSTALLED INTEL32::LINUX null localhost paste /bin/paste INSTALLED INTEL32::LINUX null localhost cp /bin/cp INSTALLED INTEL32::LINUX null localhost touch /bin/touch INSTALLED INTEL32::LINUX null localhost wc /usr/bin/wc INSTALLED INTEL32::LINUX null localhost sleep /bin/sleep null null null remotehost echo /bin/echo INSTALLED INTEL32::LINUX null and the configuration file is exactly like what is shown in the documentation. If I try to do the ssh-cl interface, I get the following error: RunID: 20150330-1529-4b3sr1u4 Progress: time: Mon, 30 Mar 2015 15:29:20 -0500 Execution failed: Exception in echo: Arguments: [Hello, world!] Host: remotehost Directory: hello-20150330-1529-4b3sr1u4/jobs/c/echo-c3z4un6m Caused by: Could not submit job Caused by: Could not start coaster service Caused by: java.lang.NullPointerException at org.globus.cog.abstraction.impl.execution.coaster.AutoCA.ensureCACertsExist(AutoCA.java:143) at org.globus.cog.abstraction.impl.execution.coaster.AutoCA.createProxy(AutoCA.java:128) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.setupGSIProxy(ServiceManager.java:238) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:194) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:132) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:151) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.getChannel(JobSubmissionTaskHandler.java:119) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:105) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:97) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) greeting, hello.swift, line 9 There is no shared folder, so I have to use scheduling. I am using Swift version 0.94.1 from the main website. Any help is appreciated. Thank you! Justin Scherer