From yadunand at uchicago.edu Fri May 1 11:47:13 2015 From: yadunand at uchicago.edu (Yadu Nand Babuji) Date: Fri, 01 May 2015 11:47:13 -0500 Subject: [Swift-user] Remote SGE cluster In-Reply-To: References: Message-ID: <5543AE11.7030204@uchicago.edu> Hi Igor, Swift does support SGE clusters, and you can refer to the swift-tutorial for sample code and configurations from this link: https://github.com/swift-lang/swift-tutorial Here's a sample config from our test-suite for Godzilla, an SGE cluster at UChicago: https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf You could modify and add this config to the swift.conf file in the swift-tutorial to run Swift on any machine and execute on a remote SGE cluster. SGE is a widely used resource manager and most sites have differences in their setups that make each site unique. If you run into issues with the default swift package, and could provide help in figuring out specifics of your cluster, we will help you adapt the Swift SGE provider to support your cluster. Thanks, Yadu On 04/28/2015 05:09 PM, Igor Russo wrote: > Hi All, > > It is possible to use Swift with a remote SGE/OGE cluster? > > Regards, > Igor > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From igor.souza.russo at gmail.com Fri May 1 14:29:42 2015 From: igor.souza.russo at gmail.com (Igor Russo) Date: Fri, 1 May 2015 16:29:42 -0300 Subject: [Swift-user] Remote SGE cluster In-Reply-To: <5543AE11.7030204@uchicago.edu> References: <5543AE11.7030204@uchicago.edu> Message-ID: Hi Yadu, Thank you very much! I changed the config file with the data from my cluster. When executing the 4th part of Swift-tutorial, i'm getting the following error: "Failed to download bootstrap jar from ..." -------------------------------------------------------------------------------- RunID: run031 Progress: Sex, 01 Mai 2015 15:40:42-0300 Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 Execution failed: Exception in sort: Arguments: [-n, unsorted.txt] Host: mmc Directory: p4-run031/jobs/s/sort-go28d68m exception @ swift-int-staging.k, line: 165 Caused by: exception @ swift-int-staging.k, line: 160 Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task ended before registration was received. Failed to download bootstrap jar from http://igor-ubuntu:51251 k:assign @ swift.k, line: 174 Caused by: Exception in sort: Arguments: [-n, unsorted.txt] Host: mmc Directory: p4-run031/jobs/s/sort-go28d68m exception @ swift-int-staging.k, line: 165 Caused by: exception @ swift-int-staging.k, line: 160 Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task ended before registration was received. Failed to download bootstrap jar from http://igor-ubuntu:51251 -------------------------------------------------------------------------------- Thanks, Igor 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji : > Hi Igor, > > Swift does support SGE clusters, and you can refer to the swift-tutorial > for sample code and configurations from this link: > https://github.com/swift-lang/swift-tutorial > > Here's a sample config from our test-suite for Godzilla, an SGE cluster at > UChicago: > > https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf > You could modify and add this config to the swift.conf file in the > swift-tutorial to run > Swift on any machine and execute on a remote SGE cluster. > > SGE is a widely used resource manager and most sites have differences in > their setups that make each site unique. If you run into issues with the > default > swift package, and could provide help in figuring out specifics of your > cluster, we > will help you adapt the Swift SGE provider to support your cluster. > > Thanks, > Yadu > > > > On 04/28/2015 05:09 PM, Igor Russo wrote: > > Hi All, > > It is possible to use Swift with a remote SGE/OGE cluster? > > Regards, > Igor > > > _______________________________________________ > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.altaweel at ucl.ac.uk Fri May 1 14:31:12 2015 From: m.altaweel at ucl.ac.uk (Altaweel, Mark) Date: Fri, 1 May 2015 19:31:12 +0000 Subject: [Swift-user] SGE swift job In-Reply-To: <3F8C8656-3F8B-4BC4-87A5-F94E8506FD56@live.ucl.ac.uk> References: <3F8C8656-3F8B-4BC4-87A5-F94E8506FD56@live.ucl.ac.uk> Message-ID: Hi, Further to the message below, I received this from our cluster administrator: There isn't a parallel environment called 1way on Legion. Ian spent some time modifying the source code for Swift 0.94.1 to get it to submit jobs on Legion, but then found that those modifications no longer work for the later versions. Here's what he had to do for the earlier version: Modify the code in swift-0.94/cog/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java: * find the function private void verifyQueueInformation() and delete the contents, replacing with just return; * delete the line: writeAttr("queue", "-q ", wr); and then, because it expects to be able to inspect the queue settings itself to determine cores per node and jobs per node, I ham-fistedly changed those to be hard-coded to 1 in the same file, changing: String queue = (String)spec.getAttribute("queue"); int coresPerNode = Integer.valueOf(getAttribute(spec, "coresPerNode", String.valueOf(poller.getQueueInformation(queue).getSlots()))); int jobsPerNode = Integer.valueOf(getAttribute(spec, "jobsPerNode", String.valueOf(coresPerNode))); int coresToRequest = ( count * jobsPerNode + coresPerNode - 1) / coresPerNode * coresPerNode; to //String queue = (String)spec.getAttribute("queue"); int coresPerNode = 1; int jobsPerNode = 1; int coresToRequest = 1; In cog/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java: change: writeWallTime(wr); writeSoftWallTime(wr); if (spec.getStdInput() != null) { wr.write("#$ -i " + quote(spec.getStdInput()) + '\n'); } wr.write("#$ -o " + quote(stdout) + '\n'); wr.write("#$ -e " + quote(stderr) + '\n'); if (!spec.getEnvironmentVariableNames().isEmpty()) { to: writeWallTime(wr); writeSoftWallTime(wr); if (spec.getStdInput() != null) { wr.write("#$ -i " + quote(spec.getStdInput()) + '\n'); } wr.write("#$ -o " + quote(stdout) + '\n'); wr.write("#$ -e " + quote(stderr) + '\n'); wr.write("#$ -jsv /shared/ucl/apps/sge_support/clean_variables_from_jobenv.jsv\n"); if (!spec.getEnvironmentVariableNames().isEmpty()) { As the above suggests, I've put that JSV script in /shared/ucl/apps/sge_support in case it's needed for anything else. The bug which makes it useful has been fixed in the version of SoGE we're using after the upgrade, so it shouldn't be necessary for too long. (That part is to do with stripping out variables containing % characters). On May 1, 2015, at 4:27 PM, Altaweel, Mark > wrote: Hi, I am trying to use Swift (swift-0.96-sge-mod) and trying to run a script on SGE for the local cluster. Is there any clear reason for the error (below). My swift.conf setup is: site.Legion { execution { type: "coaster" jobManager: "local:sge" URL : "localhost" options { maxJobs: 2 nodeGranularity: 1 maxNodesPerJob: 2 tasksPerNode: 1 jobProject: "AllUsers" jobQueue: "Tarvek" maxJobTime: "1800" } } maxParallelTasks : 3 initialParallelTasks : 2 staging: local workDirectory: "/tmp/"${env.USER} app.ALL { executable: "*" maxWallTime: "00:05:00" } } I get the following error: RunID: run009 Warning: The @ syntax for function invocation is deprecated Warning: Variable spans, defined on line 52, might have multiple conflicting writers Progress: Fri, 01 May 2015 16:19:29+0100 Number of parameter combinations: 2 Stride: 1 Begin: 1, End: 1 Begin: 2, End: 2 Progress: Fri, 01 May 2015 16:19:30+0100 Submitted:2 Error: No parallel environment specified Could not submit job (qsub reported an exit code of 1). Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. Execution failed: Exception in sh: Arguments: [repast_instance.sh, /imports/home1/tcrnma3/Scratch/UrbanModel/, 1, 1, 1, urf_2.txt] Host: Legion Directory: repast-run009/jobs/9/sh-9xg6568m exception @ swift-int-staging.k, line: 174 Caused by: exception @ swift-int-staging.k, line: 170 Caused by: Block task failed: Error submitting block task org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job (qsub reported an exit code of 1). Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:62) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:61) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:70) Caused by: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Could not submit job (qsub reported an exit code of 1). Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:116) at org.globus.cog.abstraction.impl.scheduler.sge.SGEExecutor.start(SGEExecutor.java:192) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:52) ... 3 more -------------- next part -------------- An HTML attachment was scrubbed... URL: From yadunand at uchicago.edu Fri May 1 15:47:58 2015 From: yadunand at uchicago.edu (Yadu Nand Babuji) Date: Fri, 01 May 2015 15:47:58 -0500 Subject: [Swift-user] Remote SGE cluster In-Reply-To: References: <5543AE11.7030204@uchicago.edu> Message-ID: <5543E67E.8050205@uchicago.edu> Hi Igor, The remote connection system requires that the local machine you run the swift client on has a public ip address. It looks like swift was not able to guess it and set it tohttp://igor-ubuntu:51251 Could you retry running part04 after doing the next step, and please make sure your environment has these variables set whenever you run swift to remote systems : export GLOBUS_HOSTNAME= export GLOBUS_TCP_PORT_RANGE=50000,51000 Thanks, Yadu On 05/01/2015 02:29 PM, Igor Russo wrote: > Hi Yadu, > > Thank you very much! > > I changed the config file with the data from my cluster. > > When executing the 4th part of Swift-tutorial, i'm getting the > following error: > "Failed to download bootstrap jar from ..." > > > -------------------------------------------------------------------------------- > > RunID: run031 > Progress: Sex, 01 Mai 2015 15:40:42-0300 > Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 > > Execution failed: > Exception in sort: > Arguments: [-n, unsorted.txt] > Host: mmc > Directory: p4-run031/jobs/s/sort-go28d68m > exception @ swift-int-staging.k, line: 165 > Caused by: > exception @ swift-int-staging.k, line: 160 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not start coaster service > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Task ended before registration was received. > Failed to download bootstrap jar from http://igor-ubuntu:51251 > > > k:assign @ swift.k, line: 174 > Caused by: Exception in sort: > Arguments: [-n, unsorted.txt] > Host: mmc > Directory: p4-run031/jobs/s/sort-go28d68m > exception @ swift-int-staging.k, line: 165 > Caused by: > exception @ swift-int-staging.k, line: 160 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not start coaster service > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Task ended before registration was received. > Failed to download bootstrap jar from http://igor-ubuntu:51251 > > > -------------------------------------------------------------------------------- > > Thanks, > Igor > > 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji >: > > Hi Igor, > > Swift does support SGE clusters, and you can refer to the > swift-tutorial > for sample code and configurations from this link: > https://github.com/swift-lang/swift-tutorial > > Here's a sample config from our test-suite for Godzilla, an SGE > cluster at UChicago: > https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf > You could modify and add this config to the swift.conf file in the > swift-tutorial to run > Swift on any machine and execute on a remote SGE cluster. > > SGE is a widely used resource manager and most sites have > differences in > their setups that make each site unique. If you run into issues > with the default > swift package, and could provide help in figuring out specifics of > your cluster, we > will help you adapt the Swift SGE provider to support your cluster. > > Thanks, > Yadu > > > > On 04/28/2015 05:09 PM, Igor Russo wrote: >> Hi All, >> >> It is possible to use Swift with a remote SGE/OGE cluster? >> >> Regards, >> Igor >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.altaweel at ucl.ac.uk Sun May 3 01:11:14 2015 From: m.altaweel at ucl.ac.uk (Altaweel, Mark) Date: Sun, 3 May 2015 06:11:14 +0000 Subject: [Swift-user] hung submission Message-ID: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> Hi, I tried executing Swift on our institutions?s sge-based cluster and the submission seems hung or not executing properly. It has the following message: Swift 0.96-RC1 git-rev: c7a1dc478a40865f5639f186284697d53978bd48 heads/release-0.96-swift 6274 (modified locally) RunID: run002 Progress: Sun, 03 May 2015 07:00:29+0100 Number of parameter combinations: 2 Stride: 1 Begin: 1, End: 1 Begin: 2, End: 2 Progress: Sun, 03 May 2015 07:00:30+0100 Submitted:2 Error: No parallel environment specified Progress: Sun, 03 May 2015 07:01:00+0100 Submitted:2 Progress: Sun, 03 May 2015 07:01:30+0100 Submitted:2 Progress: Sun, 03 May 2015 07:02:00+0100 Submitted:2 Progress: Sun, 03 May 2015 07:02:30+0100 Submitted:2 Progress: Sun, 03 May 2015 07:03:00+0100 Submitted:2 Progress: Sun, 03 May 2015 07:03:30+0100 Submitted:2 Progress: Sun, 03 May 2015 07:04:00+0100 Submitted:2 Progress: Sun, 03 May 2015 07:04:30+0100 Submitted:2 This is just repeated and does not seem to stop The log file has the following messages, which also repeat: 2015-05-03 07:08:22,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64559392, JVMThreads: 52 2015-05-03 07:08:23,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64559432, JVMThreads: 52 2015-05-03 07:08:23,709+0100 INFO AbstractQueuePoller Actively monitored: 1, New: 0, Done: 0 2015-05-03 07:08:24,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584080, JVMThreads: 52 2015-05-03 07:08:25,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584120, JVMThreads: 52 2015-05-03 07:08:26,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584160, JVMThreads: 52 2015-05-03 07:08:27,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584200, JVMThreads: 52 2015-05-03 07:08:28,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584240, JVMThreads: 52 2015-05-03 07:08:29,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584280, JVMThreads: 52 I did run this locally to see if anything is wrong with the submission and it worked fine with proper output. Thank you. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From yadunand at uchicago.edu Sun May 3 08:32:46 2015 From: yadunand at uchicago.edu (Yadu Nand Babuji) Date: Sun, 03 May 2015 08:32:46 -0500 Subject: [Swift-user] hung submission In-Reply-To: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> Message-ID: <5546237E.4090309@uchicago.edu> Hi Mark, What you are seeing is progress reports from swift at an interval of 30s, and all this indicates is that your jobs were submitted to the queue for execution. Until the local resource manager, in this case the SGE scheduler starts the execution of jobs swift will have to wait. From you description all I can gather is that you are seeing long wait times, with no indications of a any failure. Could you check if you can spot the jobs submitted by swift to the queue ? For this, open a separate terminal on the login node while your swift run is waiting in submitted state, and use qstat to see your jobs. [coursa1 at login06 part05]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 6593408 0.00000 B0503-2802 coursa1 qw 05/03/2015 14:28:40 1 6593409 0.00000 B0503-2802 coursa1 qw 05/03/2015 14:28:41 1 The qw state indicates that your jobs are waiting in the queue. Thanks, Yadu On 05/03/2015 01:11 AM, Altaweel, Mark wrote: > Hi, > > I tried executing Swift on our institutions?s sge-based cluster and > the submission seems hung or not executing properly. It has the > following message: > > Swift 0.96-RC1 git-rev: c7a1dc478a40865f5639f186284697d53978bd48 > heads/release-0.96-swift 6274 (modified locally) > RunID: run002 > Progress: Sun, 03 May 2015 07:00:29+0100 > Number of parameter combinations: 2 > Stride: 1 > Begin: 1, End: 1 > Begin: 2, End: 2 > Progress: Sun, 03 May 2015 07:00:30+0100 Submitted:2 > Error: No parallel environment specified > Progress: Sun, 03 May 2015 07:01:00+0100 Submitted:2 > Progress: Sun, 03 May 2015 07:01:30+0100 Submitted:2 > Progress: Sun, 03 May 2015 07:02:00+0100 Submitted:2 > Progress: Sun, 03 May 2015 07:02:30+0100 Submitted:2 > Progress: Sun, 03 May 2015 07:03:00+0100 Submitted:2 > Progress: Sun, 03 May 2015 07:03:30+0100 Submitted:2 > Progress: Sun, 03 May 2015 07:04:00+0100 Submitted:2 > Progress: Sun, 03 May 2015 07:04:30+0100 Submitted:2 > > This is just repeated and does not seem to stop > > The log file has the following messages, which also repeat: > > 2015-05-03 07:08:22,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: > 954728448, CrtHeap: 378535936, UsedHeap: 64559392, JVMThreads: 52 > 2015-05-03 07:08:23,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: > 954728448, CrtHeap: 378535936, UsedHeap: 64559432, JVMThreads: 52 > 2015-05-03 07:08:23,709+0100 INFO AbstractQueuePoller Actively > monitored: 1, New: 0, Done: 0 > 2015-05-03 07:08:24,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: > 954728448, CrtHeap: 378535936, UsedHeap: 64584080, JVMThreads: 52 > 2015-05-03 07:08:25,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: > 954728448, CrtHeap: 378535936, UsedHeap: 64584120, JVMThreads: 52 > 2015-05-03 07:08:26,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: > 954728448, CrtHeap: 378535936, UsedHeap: 64584160, JVMThreads: 52 > 2015-05-03 07:08:27,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: > 954728448, CrtHeap: 378535936, UsedHeap: 64584200, JVMThreads: 52 > 2015-05-03 07:08:28,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: > 954728448, CrtHeap: 378535936, UsedHeap: 64584240, JVMThreads: 52 > 2015-05-03 07:08:29,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: > 954728448, CrtHeap: 378535936, UsedHeap: 64584280, JVMThreads: 52 > > > I did run this locally to see if anything is wrong with the submission > and it worked fine with proper output. > > Thank you. > > Mark > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.altaweel at ucl.ac.uk Sun May 3 13:43:31 2015 From: m.altaweel at ucl.ac.uk (Altaweel, Mark) Date: Sun, 3 May 2015 18:43:31 +0000 Subject: [Swift-user] hung submission In-Reply-To: <5546237E.4090309@uchicago.edu> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <5546237E.4090309@uchicago.edu> Message-ID: <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> Thanks Yadu. Yes I did check and digging in it seems to fail : 6596817 2.69388 B0503-3707 tcrnma3 Eqw 05/03/2015 19:37:48 And then if I look at the reason (qstat -j) I get this (basically the error reason shows a truncated version of my file submitted): Seems odd that it shortens the path or at least indicates that it does this. Mark job_number: 6596817 exec_file: job_scripts/6596817 submission_time: Sun May 3 19:37:48 2015 owner: tcrnma3 uid: 147447 group: users gid: 1002 sge_o_home: /home/tcrnma3/ sge_o_log_name: tcrnma3 sge_o_path: /shared/ucl/apps/Java/64/jdk1.7.0_45/bin:/shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/nedit/5.6/bin:/shared/ucl/apps/gerun/i:/usr/mpi/qlogic//sbin:/usr/mpi/qlogic//bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/shared/ucl/apps/bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin/intel64:/cm/shared/apps/sge/6.2u3/bin/lx26-amd64:/home/tcrnma3//bin:/home/tcrnma3//Scratch/swift-0.96-sge-mod/bin:/sbin sge_o_shell: /bin/bash sge_o_workdir: /imports/home1/tcrnma3/Scratch/UrbanModel sge_o_host: login08 account: ucl_jsv4h;S=0;T=1.0;W=1.0;X=1.0;Y=1.0;V=0;Z=1.0;U=1.0 stderr_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stderr hard resource_list: batch=true,bonus=0,h_rt=540,jcs=0,jct=1,jcu=1,jcv=0,jcw=1,jcx=1,jcy=1,jcz=1,maxversion=2,memory=1M,penalty=604801,s_rt=530 mail_list: tcrnma3 at login08.data.legion.ucl.ac.uk notify: FALSE job_name: B0503-3707460-0 stdout_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stdout jobshare: 0 restart: n shell_list: NONE:/bin/ksh env_list: WORKER_LOGGING_LEVEL=NONE,XAUTHORITY=/scratch/scratch/tcrnma3/.Xauthority,PAID=0,GPU=0,OMP_NUM_THREADS=1,MICCOUNT=0,SCRATCH_SPACE=10737418240,MEMPERSLOT=1048576,SGE_SHARENODE=1,IFS= script_file: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit project: AllUsers error reason 1: 05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur scheduling info: (Collecting of scheduler job information is turned off) On May 3, 2015, at 2:32 PM, Yadu Nand Babuji > wrote: Hi Mark, What you are seeing is progress reports from swift at an interval of 30s, and all this indicates is that your jobs were submitted to the queue for execution. Until the local resource manager, in this case the SGE scheduler starts the execution of jobs swift will have to wait. >From you description all I can gather is that you are seeing long wait times, with no indications of a any failure. Could you check if you can spot the jobs submitted by swift to the queue ? For this, open a separate terminal on the login node while your swift run is waiting in submitted state, and use qstat to see your jobs. [coursa1 at login06 part05]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 6593408 0.00000 B0503-2802 coursa1 qw 05/03/2015 14:28:40 1 6593409 0.00000 B0503-2802 coursa1 qw 05/03/2015 14:28:41 1 The qw state indicates that your jobs are waiting in the queue. Thanks, Yadu On 05/03/2015 01:11 AM, Altaweel, Mark wrote: Hi, I tried executing Swift on our institutions?s sge-based cluster and the submission seems hung or not executing properly. It has the following message: Swift 0.96-RC1 git-rev: c7a1dc478a40865f5639f186284697d53978bd48 heads/release-0.96-swift 6274 (modified locally) RunID: run002 Progress: Sun, 03 May 2015 07:00:29+0100 Number of parameter combinations: 2 Stride: 1 Begin: 1, End: 1 Begin: 2, End: 2 Progress: Sun, 03 May 2015 07:00:30+0100 Submitted:2 Error: No parallel environment specified Progress: Sun, 03 May 2015 07:01:00+0100 Submitted:2 Progress: Sun, 03 May 2015 07:01:30+0100 Submitted:2 Progress: Sun, 03 May 2015 07:02:00+0100 Submitted:2 Progress: Sun, 03 May 2015 07:02:30+0100 Submitted:2 Progress: Sun, 03 May 2015 07:03:00+0100 Submitted:2 Progress: Sun, 03 May 2015 07:03:30+0100 Submitted:2 Progress: Sun, 03 May 2015 07:04:00+0100 Submitted:2 Progress: Sun, 03 May 2015 07:04:30+0100 Submitted:2 This is just repeated and does not seem to stop The log file has the following messages, which also repeat: 2015-05-03 07:08:22,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64559392, JVMThreads: 52 2015-05-03 07:08:23,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64559432, JVMThreads: 52 2015-05-03 07:08:23,709+0100 INFO AbstractQueuePoller Actively monitored: 1, New: 0, Done: 0 2015-05-03 07:08:24,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584080, JVMThreads: 52 2015-05-03 07:08:25,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584120, JVMThreads: 52 2015-05-03 07:08:26,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584160, JVMThreads: 52 2015-05-03 07:08:27,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584200, JVMThreads: 52 2015-05-03 07:08:28,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584240, JVMThreads: 52 2015-05-03 07:08:29,401+0100 INFO RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584280, JVMThreads: 52 I did run this locally to see if anything is wrong with the submission and it worked fine with proper output. Thank you. Mark _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun May 3 13:49:16 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 3 May 2015 11:49:16 -0700 Subject: [Swift-user] hung submission In-Reply-To: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> Message-ID: <1430678956.15193.2.camel@echo> Hi, On Sun, 2015-05-03 at 06:11 +0000, Altaweel, Mark wrote: > Error: No parallel environment specified It looks like something is complaining about a missing PE. Do you have one specified in your config file? Mihael From hategan at mcs.anl.gov Sun May 3 13:59:57 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 3 May 2015 11:59:57 -0700 Subject: [Swift-user] hung submission In-Reply-To: <1430678956.15193.2.camel@echo> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <1430678956.15193.2.camel@echo> Message-ID: <1430679597.15193.8.camel@echo> On Sun, 2015-05-03 at 11:49 -0700, Mihael Hategan wrote: > Hi, > > On Sun, 2015-05-03 at 06:11 +0000, Altaweel, Mark wrote: > > Error: No parallel environment specified > > It looks like something is complaining about a missing PE. Do you have > one specified in your config file? Sorry, I now saw your config file in an earlier email. I believe that the relevant incantation would be: site.Legion { execution { ... options { ... jobOptions { pe: } } ... } ... } Mihael From hategan at mcs.anl.gov Sun May 3 14:07:06 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 3 May 2015 12:07:06 -0700 Subject: [Swift-user] hung submission In-Reply-To: <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <5546237E.4090309@uchicago.edu> <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> Message-ID: <1430680026.15193.11.camel@echo> On Sun, 2015-05-03 at 18:43 +0000, Altaweel, Mark wrote: > error reason 1: 05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur ... aaand my PE suggestion had little to do with the problem. Is /imports mounted on compute nodes? Mihael From yadunand at uchicago.edu Sun May 3 14:13:35 2015 From: yadunand at uchicago.edu (Yadu Nand Babuji) Date: Sun, 03 May 2015 14:13:35 -0500 Subject: [Swift-user] hung submission In-Reply-To: <1430679597.15193.8.camel@echo> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <1430678956.15193.2.camel@echo> <1430679597.15193.8.camel@echo> Message-ID: <5546735F.20404@uchicago.edu> Hi Mihael, Mark is running on Legion at UCL using a patched version of Swift 0.96. Though it complains about pe, the run should be unaffected, and I've tested this and so has Jonathan. It does look like Mark's scripts are trying to access filesystems not mounted on the worker nodes. On Legion the home directories are not accessible on the workers and all data should be places on the Scratch filesystem. Thanks, Yadu On 05/03/2015 01:59 PM, Mihael Hategan wrote: > On Sun, 2015-05-03 at 11:49 -0700, Mihael Hategan wrote: >> Hi, >> >> On Sun, 2015-05-03 at 06:11 +0000, Altaweel, Mark wrote: >>> Error: No parallel environment specified >> It looks like something is complaining about a missing PE. Do you have >> one specified in your config file? > Sorry, I now saw your config file in an earlier email. > > I believe that the relevant incantation would be: > > site.Legion { > execution { > ... > options { > ... > jobOptions { > pe: > } > } > ... > } > ... > } > > Mihael > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From m.altaweel at ucl.ac.uk Sun May 3 14:20:34 2015 From: m.altaweel at ucl.ac.uk (Altaweel, Mark) Date: Sun, 3 May 2015 19:20:34 +0000 Subject: [Swift-user] hung submission In-Reply-To: <1430680026.15193.11.camel@echo> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <5546237E.4090309@uchicago.edu> <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> <1430680026.15193.11.camel@echo> Message-ID: Yes so I do import swift in the shell script that gets distributed. However, same conclusion it seems. I don?t understand why it truncates the path, unless it is there but only writes a certain number of the characters. This is added to the script: export PATH=$PATH:~/Scratch/swift-0.96-sge-mod/bin module load java/1.7.0_45 So java is included. If I remove it same thing happens though. Mark On May 3, 2015, at 8:07 PM, Mihael Hategan > wrote: On Sun, 2015-05-03 at 18:43 +0000, Altaweel, Mark wrote: error reason 1: 05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur ... aaand my PE suggestion had little to do with the problem. Is /imports mounted on compute nodes? Mihael _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun May 3 15:06:21 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 3 May 2015 13:06:21 -0700 Subject: [Swift-user] hung submission In-Reply-To: References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <5546237E.4090309@uchicago.edu> <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> <1430680026.15193.11.camel@echo> Message-ID: <1430683581.15943.8.camel@echo> It seems that it is more likely that the error message gets truncated rather than the path itself. After all, stdout_path_list does contain what seems to be the correct path. There should be a script: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit (or similar) that should be available while a swift run is in progress. I think one way to troubleshoot things would be to copy that script and submit it manually. Mihael On Sun, 2015-05-03 at 19:20 +0000, Altaweel, Mark wrote: > Yes so I do import swift in the shell script that gets distributed. However, same conclusion it seems. I don?t understand why it truncates the path, unless it is there but only writes a certain number of the characters. > > This is added to the script: > > export PATH=$PATH:~/Scratch/swift-0.96-sge-mod/bin > module load java/1.7.0_45 > > So java is included. If I remove it same thing happens though. > > Mark > > > > On May 3, 2015, at 8:07 PM, Mihael Hategan > wrote: > > On Sun, 2015-05-03 at 18:43 +0000, Altaweel, Mark wrote: > error reason 1: 05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur > > ... aaand my PE suggestion had little to do with the problem. > > Is /imports mounted on compute nodes? > > Mihael > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > From m.altaweel at ucl.ac.uk Sun May 3 15:18:10 2015 From: m.altaweel at ucl.ac.uk (Altaweel, Mark) Date: Sun, 3 May 2015 20:18:10 +0000 Subject: [Swift-user] hung submission In-Reply-To: <1430683581.15943.8.camel@echo> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <5546237E.4090309@uchicago.edu> <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> <1430680026.15193.11.camel@echo> <1430683581.15943.8.camel@echo> Message-ID: If I do a qsub on the script I get the same error message: job_number: 6597054 exec_file: job_scripts/6597054 submission_time: Sun May 3 21:15:23 2015 owner: tcrnma3 uid: 147447 group: users gid: 1002 sge_o_home: /home/tcrnma3/ sge_o_log_name: tcrnma3 sge_o_path: /shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/nedit/5.6/bin:/shared/ucl/apps/gerun/i:/usr/mpi/qlogic//sbin:/usr/mpi/qlogic//bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/shared/ucl/apps/bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin/intel64:/cm/shared/apps/sge/6.2u3/bin/lx26-amd64:/home/tcrnma3//bin sge_o_shell: /bin/bash sge_o_workdir: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts sge_o_host: login06 account: ucl_jsv4h;S=0;T=1.0;W=1.0;X=1.0;Y=1.0;V=0;Z=1.0;U=1.0 stderr_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stderr hard resource_list: batch=true,bonus=0,h_rt=540,jcs=0,jct=1,jcu=1,jcv=0,jcw=1,jcx=1,jcy=1,jcz=1,maxversion=2,memory=1M,penalty=604801,s_rt=530 mail_list: tcrnma3 at login06.data.legion.ucl.ac.uk notify: FALSE job_name: B0503-3707460-0 stdout_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stdout jobshare: 0 restart: n shell_list: NONE:/bin/ksh env_list: WORKER_LOGGING_LEVEL=NONE,XAUTHORITY=/scratch/scratch/tcrnma3/.Xauthority,PAID=0,GPU=0,OMP_NUM_THREADS=1,MICCOUNT=0,SCRATCH_SPACE=10737418240,MEMPERSLOT=1048576,SGE_SHARENODE=1,IFS= script_file: SGE7948718974736431209.submit project: AllUsers error reason 1: 05/03/2015 21:15:57 [147447:18805]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur scheduling info: (Collecting of scheduler job information is turned off) Mark On May 3, 2015, at 9:06 PM, Mihael Hategan > wrote: It seems that it is more likely that the error message gets truncated rather than the path itself. After all, stdout_path_list does contain what seems to be the correct path. There should be a script: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit (or similar) that should be available while a swift run is in progress. I think one way to troubleshoot things would be to copy that script and submit it manually. Mihael On Sun, 2015-05-03 at 19:20 +0000, Altaweel, Mark wrote: Yes so I do import swift in the shell script that gets distributed. However, same conclusion it seems. I don?t understand why it truncates the path, unless it is there but only writes a certain number of the characters. This is added to the script: export PATH=$PATH:~/Scratch/swift-0.96-sge-mod/bin module load java/1.7.0_45 So java is included. If I remove it same thing happens though. Mark On May 3, 2015, at 8:07 PM, Mihael Hategan > wrote: On Sun, 2015-05-03 at 18:43 +0000, Altaweel, Mark wrote: error reason 1: 05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur ... aaand my PE suggestion had little to do with the problem. Is /imports mounted on compute nodes? Mihael _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun May 3 15:30:37 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 3 May 2015 13:30:37 -0700 Subject: [Swift-user] hung submission In-Reply-To: References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <5546237E.4090309@uchicago.edu> <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> <1430680026.15193.11.camel@echo> <1430683581.15943.8.camel@echo> Message-ID: <1430685037.16927.4.camel@echo> Hi, That's actually good since we eliminated lots of moving parts. ~/Scratch seems to be the right spot according to https://wiki.rc.ucl.ac.uk/wiki/Managing_Data_on_Legion What I suspect might be happening is that the mountpoints are different between login nodes and compute nodes. Can you try running these on both the login node and a compute node: mount (or df) ls -al $HOME/Scratch and then pasting the outputs back in an email. Mihael On Sun, 2015-05-03 at 20:18 +0000, Altaweel, Mark wrote: > If I do a qsub on the script I get the same error message: > > job_number: 6597054 > exec_file: job_scripts/6597054 > submission_time: Sun May 3 21:15:23 2015 > owner: tcrnma3 > uid: 147447 > group: users > gid: 1002 > sge_o_home: /home/tcrnma3/ > sge_o_log_name: tcrnma3 > sge_o_path: /shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/nedit/5.6/bin:/shared/ucl/apps/gerun/i:/usr/mpi/qlogic//sbin:/usr/mpi/qlogic//bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/shared/ucl/apps/bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin/intel64:/cm/shared/apps/sge/6.2u3/bin/lx26-amd64:/home/tcrnma3//bin > sge_o_shell: /bin/bash > sge_o_workdir: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts > sge_o_host: login06 > account: ucl_jsv4h;S=0;T=1.0;W=1.0;X=1.0;Y=1.0;V=0;Z=1.0;U=1.0 > stderr_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stderr > hard resource_list: batch=true,bonus=0,h_rt=540,jcs=0,jct=1,jcu=1,jcv=0,jcw=1,jcx=1,jcy=1,jcz=1,maxversion=2,memory=1M,penalty=604801,s_rt=530 > mail_list: tcrnma3 at login06.data.legion.ucl.ac.uk > notify: FALSE > job_name: B0503-3707460-0 > stdout_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stdout > jobshare: 0 > restart: n > shell_list: NONE:/bin/ksh > env_list: WORKER_LOGGING_LEVEL=NONE,XAUTHORITY=/scratch/scratch/tcrnma3/.Xauthority,PAID=0,GPU=0,OMP_NUM_THREADS=1,MICCOUNT=0,SCRATCH_SPACE=10737418240,MEMPERSLOT=1048576,SGE_SHARENODE=1,IFS= > script_file: SGE7948718974736431209.submit > project: AllUsers > error reason 1: 05/03/2015 21:15:57 [147447:18805]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur > scheduling info: (Collecting of scheduler job information is turned off) > > Mark > > On May 3, 2015, at 9:06 PM, Mihael Hategan > wrote: > > It seems that it is more likely that the error message gets truncated > rather than the path itself. After all, stdout_path_list does contain > what seems to be the correct path. > > There should be a > script: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit > (or similar) that should be available while a swift run is in progress. > > I think one way to troubleshoot things would be to copy that script and > submit it manually. > > Mihael > > On Sun, 2015-05-03 at 19:20 +0000, Altaweel, Mark wrote: > Yes so I do import swift in the shell script that gets distributed. However, same conclusion it seems. I don?t understand why it truncates the path, unless it is there but only writes a certain number of the characters. > > This is added to the script: > > export PATH=$PATH:~/Scratch/swift-0.96-sge-mod/bin > module load java/1.7.0_45 > > So java is included. If I remove it same thing happens though. > > Mark > > > > On May 3, 2015, at 8:07 PM, Mihael Hategan > wrote: > > On Sun, 2015-05-03 at 18:43 +0000, Altaweel, Mark wrote: > error reason 1: 05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur > > ... aaand my PE suggestion had little to do with the problem. > > Is /imports mounted on compute nodes? > > Mihael > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > From m.altaweel at ucl.ac.uk Sun May 3 16:41:59 2015 From: m.altaweel at ucl.ac.uk (Altaweel, Mark) Date: Sun, 3 May 2015 21:41:59 +0000 Subject: [Swift-user] hung submission In-Reply-To: <1430685037.16927.4.camel@echo> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <5546237E.4090309@uchicago.edu> <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> <1430680026.15193.11.camel@echo> <1430683581.15943.8.camel@echo> <1430685037.16927.4.camel@echo> Message-ID: <86DCEB38-AFEA-4D4B-AED9-4DB3CE808759@live.ucl.ac.uk> Hi again, I get this on the local: /dev/sda1 on / type ext3 (rw,noatime,nodiratime) none on /proc type proc (rw,nosuid) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) /dev/sda2 on /var type ext3 (rw,noatime,nodiratime) none on /dev/shm type tmpfs (rw) none on /tmp type tmpfs (rw,nodev,noatime,nodiratime,size=32g) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) nfs:/exports/cmshared on /cm/shared type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,addr=10.143.0.14) nfs:/exports/home on /home type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfs:/exports/home0 on /imports/home0 type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfs:/exports/home1 on /imports/home1 type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfs:/exports/homeL on /imports/homeL type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfs:/exports/software on /shared type nfs (ro,rsize=32768,wsize=32768,hard,intr,addr=10.143.0.14) nfs:/exports/sge on /cm/shared/apps/sge type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,addr=10.143.0.14) nfs:/exports/lcgsoft on /imports/lcgsoft type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,addr=10.143.0.14) nfs:/exports/deptapps on /imports/deptapps type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfsd on /proc/fs/nfsd type nfsd (rw) none on /dev/cpuset type cpuset (rw) 10.143.0.127 at tcp:10.143.0.126 at tcp:/scratch on /scratch type lustre (rw,_netdev,noatime,nodiratime,flock) Don?t really see the output on the compute node. Mark On May 3, 2015, at 9:30 PM, Mihael Hategan > wrote: Hi, That's actually good since we eliminated lots of moving parts. ~/Scratch seems to be the right spot according to https://wiki.rc.ucl.ac.uk/wiki/Managing_Data_on_Legion What I suspect might be happening is that the mountpoints are different between login nodes and compute nodes. Can you try running these on both the login node and a compute node: mount (or df) ls -al $HOME/Scratch and then pasting the outputs back in an email. Mihael On Sun, 2015-05-03 at 20:18 +0000, Altaweel, Mark wrote: If I do a qsub on the script I get the same error message: job_number: 6597054 exec_file: job_scripts/6597054 submission_time: Sun May 3 21:15:23 2015 owner: tcrnma3 uid: 147447 group: users gid: 1002 sge_o_home: /home/tcrnma3/ sge_o_log_name: tcrnma3 sge_o_path: /shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/nedit/5.6/bin:/shared/ucl/apps/gerun/i:/usr/mpi/qlogic//sbin:/usr/mpi/qlogic//bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/shared/ucl/apps/bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin/intel64:/cm/shared/apps/sge/6.2u3/bin/lx26-amd64:/home/tcrnma3//bin sge_o_shell: /bin/bash sge_o_workdir: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts sge_o_host: login06 account: ucl_jsv4h;S=0;T=1.0;W=1.0;X=1.0;Y=1.0;V=0;Z=1.0;U=1.0 stderr_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stderr hard resource_list: batch=true,bonus=0,h_rt=540,jcs=0,jct=1,jcu=1,jcv=0,jcw=1,jcx=1,jcy=1,jcz=1,maxversion=2,memory=1M,penalty=604801,s_rt=530 mail_list: tcrnma3 at login06.data.legion.ucl.ac.uk notify: FALSE job_name: B0503-3707460-0 stdout_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stdout jobshare: 0 restart: n shell_list: NONE:/bin/ksh env_list: WORKER_LOGGING_LEVEL=NONE,XAUTHORITY=/scratch/scratch/tcrnma3/.Xauthority,PAID=0,GPU=0,OMP_NUM_THREADS=1,MICCOUNT=0,SCRATCH_SPACE=10737418240,MEMPERSLOT=1048576,SGE_SHARENODE=1,IFS= script_file: SGE7948718974736431209.submit project: AllUsers error reason 1: 05/03/2015 21:15:57 [147447:18805]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur scheduling info: (Collecting of scheduler job information is turned off) Mark On May 3, 2015, at 9:06 PM, Mihael Hategan > wrote: It seems that it is more likely that the error message gets truncated rather than the path itself. After all, stdout_path_list does contain what seems to be the correct path. There should be a script: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit (or similar) that should be available while a swift run is in progress. I think one way to troubleshoot things would be to copy that script and submit it manually. Mihael On Sun, 2015-05-03 at 19:20 +0000, Altaweel, Mark wrote: Yes so I do import swift in the shell script that gets distributed. However, same conclusion it seems. I don?t understand why it truncates the path, unless it is there but only writes a certain number of the characters. This is added to the script: export PATH=$PATH:~/Scratch/swift-0.96-sge-mod/bin module load java/1.7.0_45 So java is included. If I remove it same thing happens though. Mark On May 3, 2015, at 8:07 PM, Mihael Hategan > wrote: On Sun, 2015-05-03 at 18:43 +0000, Altaweel, Mark wrote: error reason 1: 05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur ... aaand my PE suggestion had little to do with the problem. Is /imports mounted on compute nodes? Mihael _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun May 3 17:25:32 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 3 May 2015 15:25:32 -0700 Subject: [Swift-user] hung submission In-Reply-To: <86DCEB38-AFEA-4D4B-AED9-4DB3CE808759@live.ucl.ac.uk> References: <20A806EE-399C-47ED-84EE-39F240C6CFFD@live.ucl.ac.uk> <5546237E.4090309@uchicago.edu> <40B314A7-324C-423F-92E7-974CAC9512D2@live.ucl.ac.uk> <1430680026.15193.11.camel@echo> <1430683581.15943.8.camel@echo> <1430685037.16927.4.camel@echo> <86DCEB38-AFEA-4D4B-AED9-4DB3CE808759@live.ucl.ac.uk> Message-ID: <1430691932.17945.8.camel@echo> On Sun, 2015-05-03 at 21:41 +0000, Altaweel, Mark wrote: > Hi again, > > I get this on the local: > [...] > > Don?t really see the output on the compute node. Hi, I'm not sure what you mean. Did you try it through an interactive job (qlogin)? If not, the following script, adapted from https://wiki.rc.ucl.ac.uk/wiki/Legion_Scripts should work. Otherwise I would email the system administrator. #!/bin/bash -l #$ -S /bin/bash #$ -l h_rt=0:10:0 #$ -l mem=1G #$ -l tmpfs=1G #$ -N test #$ -wd /home/tcrnma3/Scratch/output cd $TMPDIR mount > $HOME/Scratch/test.out echo "HOME is: $HOME" >> $HOME/Scratch/test.out ls -al $HOME/Scratch >> $HOME/Scratch/test.out Mihael From igor.souza.russo at gmail.com Mon May 4 07:51:18 2015 From: igor.souza.russo at gmail.com (Igor Russo) Date: Mon, 4 May 2015 09:51:18 -0300 Subject: [Swift-user] Remote SGE cluster In-Reply-To: <5543E67E.8050205@uchicago.edu> References: <5543AE11.7030204@uchicago.edu> <5543E67E.8050205@uchicago.edu> Message-ID: Hi Yadu, Thanks again. I tried your suggestion. Now i'm not getting the previous error, but the jobs aren't being submitted: RunID: run001 Progress: Seg, 04 Mai 2015 09:32:54-0300 Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1 Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1 Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1 Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1 Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1 Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1 Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1 Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1 In the the log file, i notice the following errors: 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not appear to be registered with this manager 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service ended. Reason: null Thanks, Igor 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji : > Hi Igor, > > The remote connection system requires that the local machine you run the > swift client on has > a public ip address. It looks like swift was not able to guess it and set > it to http://igor-ubuntu:51251 > > Could you retry running part04 after doing the next step, and please > make sure your environment has > these variables set whenever you run swift to remote systems : > export GLOBUS_HOSTNAME= > export GLOBUS_TCP_PORT_RANGE=50000,51000 > > Thanks, > Yadu > > > On 05/01/2015 02:29 PM, Igor Russo wrote: > > Hi Yadu, > > Thank you very much! > > I changed the config file with the data from my cluster. > > When executing the 4th part of Swift-tutorial, i'm getting the following > error: > "Failed to download bootstrap jar from ..." > > > > -------------------------------------------------------------------------------- > > RunID: run031 > Progress: Sex, 01 Mai 2015 15:40:42-0300 > Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 > > Execution failed: > Exception in sort: > Arguments: [-n, unsorted.txt] > Host: mmc > Directory: p4-run031/jobs/s/sort-go28d68m > exception @ swift-int-staging.k, line: 165 > Caused by: > exception @ swift-int-staging.k, line: 160 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not start coaster service > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task > ended before registration was received. > Failed to download bootstrap jar from http://igor-ubuntu:51251 > > k:assign @ swift.k, line: 174 > Caused by: Exception in sort: > Arguments: [-n, unsorted.txt] > Host: mmc > Directory: p4-run031/jobs/s/sort-go28d68m > exception @ swift-int-staging.k, line: 165 > Caused by: > exception @ swift-int-staging.k, line: 160 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not start coaster service > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task > ended before registration was received. > Failed to download bootstrap jar from http://igor-ubuntu:51251 > > > -------------------------------------------------------------------------------- > > Thanks, > Igor > > 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji : > >> Hi Igor, >> >> Swift does support SGE clusters, and you can refer to the swift-tutorial >> for sample code and configurations from this link: >> https://github.com/swift-lang/swift-tutorial >> >> Here's a sample config from our test-suite for Godzilla, an SGE cluster >> at UChicago: >> >> https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf >> You could modify and add this config to the swift.conf file in the >> swift-tutorial to run >> Swift on any machine and execute on a remote SGE cluster. >> >> SGE is a widely used resource manager and most sites have differences in >> their setups that make each site unique. If you run into issues with the >> default >> swift package, and could provide help in figuring out specifics of your >> cluster, we >> will help you adapt the Swift SGE provider to support your cluster. >> >> Thanks, >> Yadu >> >> >> >> On 04/28/2015 05:09 PM, Igor Russo wrote: >> >> Hi All, >> >> It is possible to use Swift with a remote SGE/OGE cluster? >> >> Regards, >> Igor >> >> >> _______________________________________________ >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > > > _______________________________________________ > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.altaweel at ucl.ac.uk Fri May 1 10:27:35 2015 From: m.altaweel at ucl.ac.uk (Altaweel, Mark) Date: Fri, 1 May 2015 15:27:35 +0000 Subject: [Swift-user] SGE swift job Message-ID: <3F8C8656-3F8B-4BC4-87A5-F94E8506FD56@live.ucl.ac.uk> Hi, I am trying to use Swift (swift-0.96-sge-mod) and trying to run a script on SGE for the local cluster. Is there any clear reason for the error (below). My swift.conf setup is: site.Legion { execution { type: "coaster" jobManager: "local:sge" URL : "localhost" options { maxJobs: 2 nodeGranularity: 1 maxNodesPerJob: 2 tasksPerNode: 1 jobProject: "AllUsers" jobQueue: "Tarvek" maxJobTime: "1800" } } maxParallelTasks : 3 initialParallelTasks : 2 staging: local workDirectory: "/tmp/"${env.USER} app.ALL { executable: "*" maxWallTime: "00:05:00" } } I get the following error: RunID: run009 Warning: The @ syntax for function invocation is deprecated Warning: Variable spans, defined on line 52, might have multiple conflicting writers Progress: Fri, 01 May 2015 16:19:29+0100 Number of parameter combinations: 2 Stride: 1 Begin: 1, End: 1 Begin: 2, End: 2 Progress: Fri, 01 May 2015 16:19:30+0100 Submitted:2 Error: No parallel environment specified Could not submit job (qsub reported an exit code of 1). Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. Execution failed: Exception in sh: Arguments: [repast_instance.sh, /imports/home1/tcrnma3/Scratch/UrbanModel/, 1, 1, 1, urf_2.txt] Host: Legion Directory: repast-run009/jobs/9/sh-9xg6568m exception @ swift-int-staging.k, line: 174 Caused by: exception @ swift-int-staging.k, line: 170 Caused by: Block task failed: Error submitting block task org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job (qsub reported an exit code of 1). Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:62) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:61) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:70) Caused by: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Could not submit job (qsub reported an exit code of 1). Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:116) at org.globus.cog.abstraction.impl.scheduler.sge.SGEExecutor.start(SGEExecutor.java:192) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:52) ... 3 more -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon May 4 14:26:09 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 4 May 2015 12:26:09 -0700 Subject: [Swift-user] SGE swift job In-Reply-To: <3F8C8656-3F8B-4BC4-87A5-F94E8506FD56@live.ucl.ac.uk> References: <3F8C8656-3F8B-4BC4-87A5-F94E8506FD56@live.ucl.ac.uk> Message-ID: <1430767569.7228.0.camel@echo> For reference, this is an email that got held by the mailing list software. The issue has since been resolved. Mihael On Fri, 2015-05-01 at 15:27 +0000, Altaweel, Mark wrote: > Hi, > > I am trying to use Swift (swift-0.96-sge-mod) and trying to run a script on SGE for the local cluster. Is there any clear reason for the error (below). > > My swift.conf setup is: > > site.Legion { > execution { > type: "coaster" > jobManager: "local:sge" > URL : "localhost" > options { > maxJobs: 2 > nodeGranularity: 1 > maxNodesPerJob: 2 > tasksPerNode: 1 > jobProject: "AllUsers" > jobQueue: "Tarvek" > maxJobTime: "1800" > > } > } > maxParallelTasks : 3 > initialParallelTasks : 2 > staging: local > workDirectory: "/tmp/"${env.USER} > app.ALL { > executable: "*" > maxWallTime: "00:05:00" > } > } > > > I get the following error: > > RunID: run009 > Warning: The @ syntax for function invocation is deprecated > Warning: Variable spans, defined on line 52, might have multiple conflicting writers > Progress: Fri, 01 May 2015 16:19:29+0100 > Number of parameter combinations: 2 > Stride: 1 > Begin: 1, End: 1 > Begin: 2, End: 2 > Progress: Fri, 01 May 2015 16:19:30+0100 Submitted:2 > Error: No parallel environment specified > > Could not submit job (qsub reported an exit code of 1). > Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. > > Execution failed: > Exception in sh: > Arguments: [repast_instance.sh, /imports/home1/tcrnma3/Scratch/UrbanModel/, 1, 1, 1, urf_2.txt] > Host: Legion > Directory: repast-run009/jobs/9/sh-9xg6568m > exception @ swift-int-staging.k, line: 174 > Caused by: > exception @ swift-int-staging.k, line: 170 > Caused by: Block task failed: Error submitting block task > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job (qsub reported an exit code of 1). > Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. > > at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:62) > at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) > at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:61) > at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:70) > Caused by: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Could not submit job (qsub reported an exit code of 1). > Unable to run job: job rejected: the requested parallel environment "1way" does not exist.Exiting. > > at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:116) > at org.globus.cog.abstraction.impl.scheduler.sge.SGEExecutor.start(SGEExecutor.java:192) > at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:52) > ... 3 more > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From yadunand at uchicago.edu Mon May 4 14:57:49 2015 From: yadunand at uchicago.edu (Yadu Nand Babuji) Date: Mon, 04 May 2015 14:57:49 -0500 Subject: [Swift-user] Remote SGE cluster In-Reply-To: References: <5543AE11.7030204@uchicago.edu> <5543E67E.8050205@uchicago.edu> Message-ID: <5547CF3D.5010908@uchicago.edu> Hi Igor, Are you able to ssh from your machine to legion directly without entering passwords ? Could you please send us a tarball of the runNNN directories for a failing run ? I've put the following settings in my ~/.ssh/config on my laptop and setup ssh keys on both socrates and legion. This allows me to use "ssh legion.rc.ucl.ac.uk" and connect. Host legion.rc.ucl.ac.uk User YOUR_USERNAME Hostname legion.rc.ucl.ac.uk ProxyCommand ssh socrates -W %h:%p Host socrates Hostname socrates.ucl.ac.uk User YOUR_USERNAME ForwardAgent yes Thanks, Yadu On 05/04/2015 07:51 AM, Igor Russo wrote: > Hi Yadu, > > Thanks again. > > I tried your suggestion. Now i'm not getting the previous error, but > the jobs aren't being submitted: > > RunID: run001 > Progress: Seg, 04 Mai 2015 09:32:54-0300 > Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1 > > In the the log file, i notice the following errors: > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not > appear to be registered with this manager > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service > ended. Reason: null > > Thanks, > Igor > > > 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji >: > > Hi Igor, > > The remote connection system requires that the local machine you > run the swift client on has > a public ip address. It looks like swift was not able to guess it > and set it tohttp://igor-ubuntu:51251 > > Could you retry running part04 after doing the next step, and > please make sure your environment has > these variables set whenever you run swift to remote systems : > export GLOBUS_HOSTNAME= > export GLOBUS_TCP_PORT_RANGE=50000,51000 > > Thanks, > Yadu > > > On 05/01/2015 02:29 PM, Igor Russo wrote: >> Hi Yadu, >> >> Thank you very much! >> >> I changed the config file with the data from my cluster. >> >> When executing the 4th part of Swift-tutorial, i'm getting the >> following error: >> "Failed to download bootstrap jar from ..." >> >> >> -------------------------------------------------------------------------------- >> >> RunID: run031 >> Progress: Sex, 01 Mai 2015 15:40:42-0300 >> Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 >> >> Execution failed: >> Exception in sort: >> Arguments: [-n, unsorted.txt] >> Host: mmc >> Directory: p4-run031/jobs/s/sort-go28d68m >> exception @ swift-int-staging.k, line: 165 >> Caused by: >> exception @ swift-int-staging.k, line: 160 >> Caused by: null >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Could not submit job >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Could not start coaster service >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Task ended before registration was received. >> Failed to download bootstrap jar from http://igor-ubuntu:51251 >> >> >> k:assign @ swift.k, line: 174 >> Caused by: Exception in sort: >> Arguments: [-n, unsorted.txt] >> Host: mmc >> Directory: p4-run031/jobs/s/sort-go28d68m >> exception @ swift-int-staging.k, line: 165 >> Caused by: >> exception @ swift-int-staging.k, line: 160 >> Caused by: null >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Could not submit job >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Could not start coaster service >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Task ended before registration was received. >> Failed to download bootstrap jar from http://igor-ubuntu:51251 >> >> >> -------------------------------------------------------------------------------- >> >> Thanks, >> Igor >> >> 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji >> >: >> >> Hi Igor, >> >> Swift does support SGE clusters, and you can refer to the >> swift-tutorial >> for sample code and configurations from this link: >> https://github.com/swift-lang/swift-tutorial >> >> Here's a sample config from our test-suite for Godzilla, an >> SGE cluster at UChicago: >> https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf >> You could modify and add this config to the swift.conf file >> in the swift-tutorial to run >> Swift on any machine and execute on a remote SGE cluster. >> >> SGE is a widely used resource manager and most sites have >> differences in >> their setups that make each site unique. If you run into >> issues with the default >> swift package, and could provide help in figuring out >> specifics of your cluster, we >> will help you adapt the Swift SGE provider to support your >> cluster. >> >> Thanks, >> Yadu >> >> >> >> On 04/28/2015 05:09 PM, Igor Russo wrote: >>> Hi All, >>> >>> It is possible to use Swift with a remote SGE/OGE cluster? >>> >>> Regards, >>> Igor >>> >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From igor.souza.russo at gmail.com Mon May 4 16:27:56 2015 From: igor.souza.russo at gmail.com (Igor Russo) Date: Mon, 4 May 2015 18:27:56 -0300 Subject: [Swift-user] Remote SGE cluster In-Reply-To: <5547CF3D.5010908@uchicago.edu> References: <5543AE11.7030204@uchicago.edu> <5543E67E.8050205@uchicago.edu> <5547CF3D.5010908@uchicago.edu> Message-ID: Hi Yadu, Yes, i can ssh from my laptop to the cluster directly. The coaster-bootstrap-*.log files are created in the remote system. I'm sending the log file attached. Thanks, Igor 2015-05-04 16:57 GMT-03:00 Yadu Nand Babuji : > Hi Igor, > > Are you able to ssh from your machine to legion directly without entering > passwords ? > Could you please send us a tarball of the runNNN directories for a failing > run ? > > I've put the following settings in my ~/.ssh/config on my laptop and setup > ssh keys on > both socrates and legion. This allows me to use "ssh legion.rc.ucl.ac.uk" > and connect. > > Host legion.rc.ucl.ac.uk > User YOUR_USERNAME > Hostname legion.rc.ucl.ac.uk > ProxyCommand ssh socrates -W %h:%p > > Host socrates > Hostname socrates.ucl.ac.uk > User YOUR_USERNAME > ForwardAgent yes > > Thanks, > Yadu > > > > On 05/04/2015 07:51 AM, Igor Russo wrote: > > Hi Yadu, > > Thanks again. > > I tried your suggestion. Now i'm not getting the previous error, but the > jobs aren't being submitted: > > RunID: run001 > Progress: Seg, 04 Mai 2015 09:32:54-0300 > Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1 > Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1 > > In the the log file, i notice the following errors: > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not > appear to be registered with this manager > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service ended. > Reason: null > > Thanks, > Igor > > > 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji : > >> Hi Igor, >> >> The remote connection system requires that the local machine you run >> the swift client on has >> a public ip address. It looks like swift was not able to guess it and set >> it to http://igor-ubuntu:51251 >> >> Could you retry running part04 after doing the next step, and please >> make sure your environment has >> these variables set whenever you run swift to remote systems : >> export GLOBUS_HOSTNAME= >> export GLOBUS_TCP_PORT_RANGE=50000,51000 >> >> Thanks, >> Yadu >> >> >> On 05/01/2015 02:29 PM, Igor Russo wrote: >> >> Hi Yadu, >> >> Thank you very much! >> >> I changed the config file with the data from my cluster. >> >> When executing the 4th part of Swift-tutorial, i'm getting the >> following error: >> "Failed to download bootstrap jar from ..." >> >> >> >> -------------------------------------------------------------------------------- >> >> RunID: run031 >> Progress: Sex, 01 Mai 2015 15:40:42-0300 >> Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 >> >> Execution failed: >> Exception in sort: >> Arguments: [-n, unsorted.txt] >> Host: mmc >> Directory: p4-run031/jobs/s/sort-go28d68m >> exception @ swift-int-staging.k, line: 165 >> Caused by: >> exception @ swift-int-staging.k, line: 160 >> Caused by: null >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could >> not submit job >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could >> not start coaster service >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task >> ended before registration was received. >> Failed to download bootstrap jar from http://igor-ubuntu:51251 >> >> k:assign @ swift.k, line: 174 >> Caused by: Exception in sort: >> Arguments: [-n, unsorted.txt] >> Host: mmc >> Directory: p4-run031/jobs/s/sort-go28d68m >> exception @ swift-int-staging.k, line: 165 >> Caused by: >> exception @ swift-int-staging.k, line: 160 >> Caused by: null >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could >> not submit job >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could >> not start coaster service >> Caused by: >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task >> ended before registration was received. >> Failed to download bootstrap jar from http://igor-ubuntu:51251 >> >> >> -------------------------------------------------------------------------------- >> >> Thanks, >> Igor >> >> 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji : >> >>> Hi Igor, >>> >>> Swift does support SGE clusters, and you can refer to the swift-tutorial >>> for sample code and configurations from this link: >>> https://github.com/swift-lang/swift-tutorial >>> >>> Here's a sample config from our test-suite for Godzilla, an SGE cluster >>> at UChicago: >>> >>> https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf >>> You could modify and add this config to the swift.conf file in the >>> swift-tutorial to run >>> Swift on any machine and execute on a remote SGE cluster. >>> >>> SGE is a widely used resource manager and most sites have differences in >>> their setups that make each site unique. If you run into issues with the >>> default >>> swift package, and could provide help in figuring out specifics of your >>> cluster, we >>> will help you adapt the Swift SGE provider to support your cluster. >>> >>> Thanks, >>> Yadu >>> >>> >>> >>> On 04/28/2015 05:09 PM, Igor Russo wrote: >>> >>> Hi All, >>> >>> It is possible to use Swift with a remote SGE/OGE cluster? >>> >>> Regards, >>> Igor >>> >>> >>> _______________________________________________ >>> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>> >>> >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>> >> >> >> >> _______________________________________________ >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > > > _______________________________________________ > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run001.tar.gz Type: application/x-gzip Size: 5680 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon May 4 16:52:46 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 4 May 2015 14:52:46 -0700 Subject: [Swift-user] Remote SGE cluster In-Reply-To: References: <5543AE11.7030204@uchicago.edu> <5543E67E.8050205@uchicago.edu> <5547CF3D.5010908@uchicago.edu> Message-ID: <1430776366.8782.10.camel@echo> Hi, In most cases (globus, coasters), the service side (legion in this case) needs the ability to connect back to the client (your home connection). Correct me if I'm wrong, but you are on a DSL line, behind a router with NAT. If so, you must configure the router to forward some incoming connections to the actual machine from which you are running swift from. Typically this is done by configuring a certain port range forwarding on the router (Yadu suggested GLOBUS_TCP_PORT_RANGE=50000,51000, so that port range should be matched on the router). The gist of it is that swift starts a simple shell script on legion that downloads a small java app from the client side and launches it. Said shell script logs things into ~/coaster-bootstrap-xxx.log files. The contents of the bootstrap logs is probably very useful here. If all of that goes well, the aforementioned small java app downloads the full coaster service from the client and starts it. Once started, the coaster service connects back to Swift. The last two parts log their doings in ~/.globus/coasters/*.log. Those can be useful, too, if they exist. Mihael On Mon, 2015-05-04 at 18:27 -0300, Igor Russo wrote: > Hi Yadu, > > Yes, i can ssh from my laptop to the cluster directly. > > The coaster-bootstrap-*.log files are created in the remote system. > > I'm sending the log file attached. > > Thanks, > Igor > > 2015-05-04 16:57 GMT-03:00 Yadu Nand Babuji : > > > Hi Igor, > > > > Are you able to ssh from your machine to legion directly without entering > > passwords ? > > Could you please send us a tarball of the runNNN directories for a failing > > run ? > > > > I've put the following settings in my ~/.ssh/config on my laptop and setup > > ssh keys on > > both socrates and legion. This allows me to use "ssh legion.rc.ucl.ac.uk" > > and connect. > > > > Host legion.rc.ucl.ac.uk > > User YOUR_USERNAME > > Hostname legion.rc.ucl.ac.uk > > ProxyCommand ssh socrates -W %h:%p > > > > Host socrates > > Hostname socrates.ucl.ac.uk > > User YOUR_USERNAME > > ForwardAgent yes > > > > Thanks, > > Yadu > > > > > > > > On 05/04/2015 07:51 AM, Igor Russo wrote: > > > > Hi Yadu, > > > > Thanks again. > > > > I tried your suggestion. Now i'm not getting the previous error, but the > > jobs aren't being submitted: > > > > RunID: run001 > > Progress: Seg, 04 Mai 2015 09:32:54-0300 > > Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1 > > Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1 > > Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1 > > Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1 > > Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1 > > Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1 > > Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1 > > Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1 > > > > In the the log file, i notice the following errors: > > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not > > appear to be registered with this manager > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service ended. > > Reason: null > > > > Thanks, > > Igor > > > > > > 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji : > > > >> Hi Igor, > >> > >> The remote connection system requires that the local machine you run > >> the swift client on has > >> a public ip address. It looks like swift was not able to guess it and set > >> it to http://igor-ubuntu:51251 > >> > >> Could you retry running part04 after doing the next step, and please > >> make sure your environment has > >> these variables set whenever you run swift to remote systems : > >> export GLOBUS_HOSTNAME= > >> export GLOBUS_TCP_PORT_RANGE=50000,51000 > >> > >> Thanks, > >> Yadu > >> > >> > >> On 05/01/2015 02:29 PM, Igor Russo wrote: > >> > >> Hi Yadu, > >> > >> Thank you very much! > >> > >> I changed the config file with the data from my cluster. > >> > >> When executing the 4th part of Swift-tutorial, i'm getting the > >> following error: > >> "Failed to download bootstrap jar from ..." > >> > >> > >> > >> -------------------------------------------------------------------------------- > >> > >> RunID: run031 > >> Progress: Sex, 01 Mai 2015 15:40:42-0300 > >> Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 > >> > >> Execution failed: > >> Exception in sort: > >> Arguments: [-n, unsorted.txt] > >> Host: mmc > >> Directory: p4-run031/jobs/s/sort-go28d68m > >> exception @ swift-int-staging.k, line: 165 > >> Caused by: > >> exception @ swift-int-staging.k, line: 160 > >> Caused by: null > >> Caused by: > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > >> not submit job > >> Caused by: > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > >> not start coaster service > >> Caused by: > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task > >> ended before registration was received. > >> Failed to download bootstrap jar from http://igor-ubuntu:51251 > >> > >> k:assign @ swift.k, line: 174 > >> Caused by: Exception in sort: > >> Arguments: [-n, unsorted.txt] > >> Host: mmc > >> Directory: p4-run031/jobs/s/sort-go28d68m > >> exception @ swift-int-staging.k, line: 165 > >> Caused by: > >> exception @ swift-int-staging.k, line: 160 > >> Caused by: null > >> Caused by: > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > >> not submit job > >> Caused by: > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > >> not start coaster service > >> Caused by: > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task > >> ended before registration was received. > >> Failed to download bootstrap jar from http://igor-ubuntu:51251 > >> > >> > >> -------------------------------------------------------------------------------- > >> > >> Thanks, > >> Igor > >> > >> 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji : > >> > >>> Hi Igor, > >>> > >>> Swift does support SGE clusters, and you can refer to the swift-tutorial > >>> for sample code and configurations from this link: > >>> https://github.com/swift-lang/swift-tutorial > >>> > >>> Here's a sample config from our test-suite for Godzilla, an SGE cluster > >>> at UChicago: > >>> > >>> https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf > >>> You could modify and add this config to the swift.conf file in the > >>> swift-tutorial to run > >>> Swift on any machine and execute on a remote SGE cluster. > >>> > >>> SGE is a widely used resource manager and most sites have differences in > >>> their setups that make each site unique. If you run into issues with the > >>> default > >>> swift package, and could provide help in figuring out specifics of your > >>> cluster, we > >>> will help you adapt the Swift SGE provider to support your cluster. > >>> > >>> Thanks, > >>> Yadu > >>> > >>> > >>> > >>> On 04/28/2015 05:09 PM, Igor Russo wrote: > >>> > >>> Hi All, > >>> > >>> It is possible to use Swift with a remote SGE/OGE cluster? > >>> > >>> Regards, > >>> Igor > >>> > >>> > >>> _______________________________________________ > >>> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>> > >>> > >>> > >>> _______________________________________________ > >>> Swift-user mailing list > >>> Swift-user at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>> > >> > >> > >> > >> _______________________________________________ > >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >> > >> > >> > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >> > > > > > > > > _______________________________________________ > > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From igor.souza.russo at gmail.com Tue May 5 09:01:50 2015 From: igor.souza.russo at gmail.com (Igor Russo) Date: Tue, 5 May 2015 11:01:50 -0300 Subject: [Swift-user] Remote SGE cluster In-Reply-To: <1430776366.8782.10.camel@echo> References: <5543AE11.7030204@uchicago.edu> <5543E67E.8050205@uchicago.edu> <5547CF3D.5010908@uchicago.edu> <1430776366.8782.10.camel@echo> Message-ID: Hi Mihael, Sorry to bother you again. You were right, after configuring the port forwarding the script is able to connect. But i still get an error "Checksum does not match". Here goes the content of the ~/coaster-bootstrap-xxx.log file: using plain mode BS: http://189.12.232.9:50006 which: no gmd5sum in (/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/condor/bin:/opt/condor/sbin:/opt/gridengine/bin/linux-x64) Expected checksum: 9b7bd5a96a2912cf8d06d1a2fd891620 Computed checksum: 9b7bd5a96a2912cf8d06d1a2fd891620 JAVA=/usr/java/latest/bin/java plain /usr/java/latest/bin/java -Djava=/usr/java/latest/bin/java -Xmx64M -DGLOBUS_TCP_PORT_RANGE= -DX509_USER_PROXY=/home/igor/.globus/sshproxy-1344874142-1432003400 -DX509_CERT_DIR=/home/igor/.globus/sshCAcert-1344874142-1432003400.pem -DGLOBUS_HOSTNAME=cluster.mmc.ufjf.br -Duser.home=/home/igor -jar /tmp/bootstrap.xTzo3v http://189.12.232.9:50006 https://189.12.232.9:50005 11100954039 Failed to download cog-provider-coaster-0.3.jar: java.lang.RuntimeException: Checksum does not match. Thanks, Igor 2015-05-04 18:52 GMT-03:00 Mihael Hategan : > > Hi, > > In most cases (globus, coasters), the service side (legion in this case) > needs the ability to connect back to the client (your home connection). > > Correct me if I'm wrong, but you are on a DSL line, behind a router with > NAT. If so, you must configure the router to forward some incoming > connections to the actual machine from which you are running swift from. > Typically this is done by configuring a certain port range forwarding on > the router (Yadu suggested GLOBUS_TCP_PORT_RANGE=50000,51000, so that > port range should be matched on the router). > > The gist of it is that swift starts a simple shell script on legion that > downloads a small java app from the client side and launches it. Said > shell script logs things into ~/coaster-bootstrap-xxx.log files. The > contents of the bootstrap logs is probably very useful here. > > If all of that goes well, the aforementioned small java app downloads > the full coaster service from the client and starts it. Once started, > the coaster service connects back to Swift. The last two parts log their > doings in ~/.globus/coasters/*.log. Those can be useful, too, if they > exist. > > Mihael > > On Mon, 2015-05-04 at 18:27 -0300, Igor Russo wrote: > > Hi Yadu, > > > > Yes, i can ssh from my laptop to the cluster directly. > > > > The coaster-bootstrap-*.log files are created in the remote system. > > > > I'm sending the log file attached. > > > > Thanks, > > Igor > > > > 2015-05-04 16:57 GMT-03:00 Yadu Nand Babuji : > > > > > Hi Igor, > > > > > > Are you able to ssh from your machine to legion directly without > entering > > > passwords ? > > > Could you please send us a tarball of the runNNN directories for a > failing > > > run ? > > > > > > I've put the following settings in my ~/.ssh/config on my laptop and > setup > > > ssh keys on > > > both socrates and legion. This allows me to use "ssh > legion.rc.ucl.ac.uk" > > > and connect. > > > > > > Host legion.rc.ucl.ac.uk > > > User YOUR_USERNAME > > > Hostname legion.rc.ucl.ac.uk > > > ProxyCommand ssh socrates -W %h:%p > > > > > > Host socrates > > > Hostname socrates.ucl.ac.uk > > > User YOUR_USERNAME > > > ForwardAgent yes > > > > > > Thanks, > > > Yadu > > > > > > > > > > > > On 05/04/2015 07:51 AM, Igor Russo wrote: > > > > > > Hi Yadu, > > > > > > Thanks again. > > > > > > I tried your suggestion. Now i'm not getting the previous error, but > the > > > jobs aren't being submitted: > > > > > > RunID: run001 > > > Progress: Seg, 04 Mai 2015 09:32:54-0300 > > > Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1 > > > Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1 > > > Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1 > > > Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1 > > > Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1 > > > Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1 > > > Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1 > > > Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1 > > > > > > In the the log file, i notice the following errors: > > > > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not > > > appear to be registered with this manager > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service > ended. > > > Reason: null > > > > > > Thanks, > > > Igor > > > > > > > > > 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji : > > > > > >> Hi Igor, > > >> > > >> The remote connection system requires that the local machine you run > > >> the swift client on has > > >> a public ip address. It looks like swift was not able to guess it and > set > > >> it to http://igor-ubuntu:51251 > > >> > > >> Could you retry running part04 after doing the next step, and please > > >> make sure your environment has > > >> these variables set whenever you run swift to remote systems : > > >> export GLOBUS_HOSTNAME= > > >> export GLOBUS_TCP_PORT_RANGE=50000,51000 > > >> > > >> Thanks, > > >> Yadu > > >> > > >> > > >> On 05/01/2015 02:29 PM, Igor Russo wrote: > > >> > > >> Hi Yadu, > > >> > > >> Thank you very much! > > >> > > >> I changed the config file with the data from my cluster. > > >> > > >> When executing the 4th part of Swift-tutorial, i'm getting the > > >> following error: > > >> "Failed to download bootstrap jar from ..." > > >> > > >> > > >> > > >> > -------------------------------------------------------------------------------- > > >> > > >> RunID: run031 > > >> Progress: Sex, 01 Mai 2015 15:40:42-0300 > > >> Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 > > >> > > >> Execution failed: > > >> Exception in sort: > > >> Arguments: [-n, unsorted.txt] > > >> Host: mmc > > >> Directory: p4-run031/jobs/s/sort-go28d68m > > >> exception @ swift-int-staging.k, line: 165 > > >> Caused by: > > >> exception @ swift-int-staging.k, line: 160 > > >> Caused by: null > > >> Caused by: > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could > > >> not submit job > > >> Caused by: > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could > > >> not start coaster service > > >> Caused by: > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Task > > >> ended before registration was received. > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251 > > >> > > >> k:assign @ swift.k, line: 174 > > >> Caused by: Exception in sort: > > >> Arguments: [-n, unsorted.txt] > > >> Host: mmc > > >> Directory: p4-run031/jobs/s/sort-go28d68m > > >> exception @ swift-int-staging.k, line: 165 > > >> Caused by: > > >> exception @ swift-int-staging.k, line: 160 > > >> Caused by: null > > >> Caused by: > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could > > >> not submit job > > >> Caused by: > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could > > >> not start coaster service > > >> Caused by: > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Task > > >> ended before registration was received. > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251 > > >> > > >> > > >> > -------------------------------------------------------------------------------- > > >> > > >> Thanks, > > >> Igor > > >> > > >> 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji : > > >> > > >>> Hi Igor, > > >>> > > >>> Swift does support SGE clusters, and you can refer to the > swift-tutorial > > >>> for sample code and configurations from this link: > > >>> https://github.com/swift-lang/swift-tutorial > > >>> > > >>> Here's a sample config from our test-suite for Godzilla, an SGE > cluster > > >>> at UChicago: > > >>> > > >>> > https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf > > >>> You could modify and add this config to the swift.conf file in the > > >>> swift-tutorial to run > > >>> Swift on any machine and execute on a remote SGE cluster. > > >>> > > >>> SGE is a widely used resource manager and most sites have > differences in > > >>> their setups that make each site unique. If you run into issues with > the > > >>> default > > >>> swift package, and could provide help in figuring out specifics of > your > > >>> cluster, we > > >>> will help you adapt the Swift SGE provider to support your cluster. > > >>> > > >>> Thanks, > > >>> Yadu > > >>> > > >>> > > >>> > > >>> On 04/28/2015 05:09 PM, Igor Russo wrote: > > >>> > > >>> Hi All, > > >>> > > >>> It is possible to use Swift with a remote SGE/OGE cluster? > > >>> > > >>> Regards, > > >>> Igor > > >>> > > >>> > > >>> _______________________________________________ > > >>> Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >>> > > >>> > > >>> > > >>> _______________________________________________ > > >>> Swift-user mailing list > > >>> Swift-user at ci.uchicago.edu > > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >>> > > >> > > >> > > >> > > >> _______________________________________________ > > >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >> > > >> > > >> > > >> _______________________________________________ > > >> Swift-user mailing list > > >> Swift-user at ci.uchicago.edu > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >> > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue May 5 14:27:46 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 5 May 2015 12:27:46 -0700 Subject: [Swift-user] Remote SGE cluster In-Reply-To: References: <5543AE11.7030204@uchicago.edu> <5543E67E.8050205@uchicago.edu> <5547CF3D.5010908@uchicago.edu> <1430776366.8782.10.camel@echo> Message-ID: <1430854066.18863.3.camel@echo> Hi, Have you modified any jar files or copied them from another swift package? The coaster bootstrap stores checksums of the jar files that it needs (calculated at swift compile time) and checks all jar files that come over an unsecured network against them. Maybe there should be a tool to update these checksums when needed, not just at compile time. Mihael On Tue, 2015-05-05 at 11:01 -0300, Igor Russo wrote: > Hi Mihael, > > Sorry to bother you again. > > You were right, after configuring the port forwarding the script is able to > connect. > > But i still get an error "Checksum does not match". > > Here goes the content of the ~/coaster-bootstrap-xxx.log file: > > using plain mode > BS: http://189.12.232.9:50006 > which: no gmd5sum in > (/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/condor/bin:/opt/condor/sbin:/opt/gridengine/bin/linux-x64) > Expected checksum: 9b7bd5a96a2912cf8d06d1a2fd891620 > Computed checksum: 9b7bd5a96a2912cf8d06d1a2fd891620 > JAVA=/usr/java/latest/bin/java > plain /usr/java/latest/bin/java -Djava=/usr/java/latest/bin/java -Xmx64M > -DGLOBUS_TCP_PORT_RANGE= > -DX509_USER_PROXY=/home/igor/.globus/sshproxy-1344874142-1432003400 > -DX509_CERT_DIR=/home/igor/.globus/sshCAcert-1344874142-1432003400.pem > -DGLOBUS_HOSTNAME=cluster.mmc.ufjf.br -Duser.home=/home/igor -jar > /tmp/bootstrap.xTzo3v http://189.12.232.9:50006 https://189.12.232.9:50005 > 11100954039 > Failed to download cog-provider-coaster-0.3.jar: > java.lang.RuntimeException: Checksum does not match. > > > Thanks, > Igor > > 2015-05-04 18:52 GMT-03:00 Mihael Hategan : > > > > > Hi, > > > > In most cases (globus, coasters), the service side (legion in this case) > > needs the ability to connect back to the client (your home connection). > > > > Correct me if I'm wrong, but you are on a DSL line, behind a router with > > NAT. If so, you must configure the router to forward some incoming > > connections to the actual machine from which you are running swift from. > > Typically this is done by configuring a certain port range forwarding on > > the router (Yadu suggested GLOBUS_TCP_PORT_RANGE=50000,51000, so that > > port range should be matched on the router). > > > > The gist of it is that swift starts a simple shell script on legion that > > downloads a small java app from the client side and launches it. Said > > shell script logs things into ~/coaster-bootstrap-xxx.log files. The > > contents of the bootstrap logs is probably very useful here. > > > > If all of that goes well, the aforementioned small java app downloads > > the full coaster service from the client and starts it. Once started, > > the coaster service connects back to Swift. The last two parts log their > > doings in ~/.globus/coasters/*.log. Those can be useful, too, if they > > exist. > > > > Mihael > > > > On Mon, 2015-05-04 at 18:27 -0300, Igor Russo wrote: > > > Hi Yadu, > > > > > > Yes, i can ssh from my laptop to the cluster directly. > > > > > > The coaster-bootstrap-*.log files are created in the remote system. > > > > > > I'm sending the log file attached. > > > > > > Thanks, > > > Igor > > > > > > 2015-05-04 16:57 GMT-03:00 Yadu Nand Babuji : > > > > > > > Hi Igor, > > > > > > > > Are you able to ssh from your machine to legion directly without > > entering > > > > passwords ? > > > > Could you please send us a tarball of the runNNN directories for a > > failing > > > > run ? > > > > > > > > I've put the following settings in my ~/.ssh/config on my laptop and > > setup > > > > ssh keys on > > > > both socrates and legion. This allows me to use "ssh > > legion.rc.ucl.ac.uk" > > > > and connect. > > > > > > > > Host legion.rc.ucl.ac.uk > > > > User YOUR_USERNAME > > > > Hostname legion.rc.ucl.ac.uk > > > > ProxyCommand ssh socrates -W %h:%p > > > > > > > > Host socrates > > > > Hostname socrates.ucl.ac.uk > > > > User YOUR_USERNAME > > > > ForwardAgent yes > > > > > > > > Thanks, > > > > Yadu > > > > > > > > > > > > > > > > On 05/04/2015 07:51 AM, Igor Russo wrote: > > > > > > > > Hi Yadu, > > > > > > > > Thanks again. > > > > > > > > I tried your suggestion. Now i'm not getting the previous error, but > > the > > > > jobs aren't being submitted: > > > > > > > > RunID: run001 > > > > Progress: Seg, 04 Mai 2015 09:32:54-0300 > > > > Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1 > > > > Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1 > > > > Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1 > > > > Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1 > > > > Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1 > > > > Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1 > > > > Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1 > > > > Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1 > > > > > > > > In the the log file, i notice the following errors: > > > > > > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not > > > > appear to be registered with this manager > > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service > > ended. > > > > Reason: null > > > > > > > > Thanks, > > > > Igor > > > > > > > > > > > > 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji : > > > > > > > >> Hi Igor, > > > >> > > > >> The remote connection system requires that the local machine you run > > > >> the swift client on has > > > >> a public ip address. It looks like swift was not able to guess it and > > set > > > >> it to http://igor-ubuntu:51251 > > > >> > > > >> Could you retry running part04 after doing the next step, and please > > > >> make sure your environment has > > > >> these variables set whenever you run swift to remote systems : > > > >> export GLOBUS_HOSTNAME= > > > >> export GLOBUS_TCP_PORT_RANGE=50000,51000 > > > >> > > > >> Thanks, > > > >> Yadu > > > >> > > > >> > > > >> On 05/01/2015 02:29 PM, Igor Russo wrote: > > > >> > > > >> Hi Yadu, > > > >> > > > >> Thank you very much! > > > >> > > > >> I changed the config file with the data from my cluster. > > > >> > > > >> When executing the 4th part of Swift-tutorial, i'm getting the > > > >> following error: > > > >> "Failed to download bootstrap jar from ..." > > > >> > > > >> > > > >> > > > >> > > -------------------------------------------------------------------------------- > > > >> > > > >> RunID: run031 > > > >> Progress: Sex, 01 Mai 2015 15:40:42-0300 > > > >> Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 > > > >> > > > >> Execution failed: > > > >> Exception in sort: > > > >> Arguments: [-n, unsorted.txt] > > > >> Host: mmc > > > >> Directory: p4-run031/jobs/s/sort-go28d68m > > > >> exception @ swift-int-staging.k, line: 165 > > > >> Caused by: > > > >> exception @ swift-int-staging.k, line: 160 > > > >> Caused by: null > > > >> Caused by: > > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Could > > > >> not submit job > > > >> Caused by: > > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Could > > > >> not start coaster service > > > >> Caused by: > > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Task > > > >> ended before registration was received. > > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251 > > > >> > > > >> k:assign @ swift.k, line: 174 > > > >> Caused by: Exception in sort: > > > >> Arguments: [-n, unsorted.txt] > > > >> Host: mmc > > > >> Directory: p4-run031/jobs/s/sort-go28d68m > > > >> exception @ swift-int-staging.k, line: 165 > > > >> Caused by: > > > >> exception @ swift-int-staging.k, line: 160 > > > >> Caused by: null > > > >> Caused by: > > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Could > > > >> not submit job > > > >> Caused by: > > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Could > > > >> not start coaster service > > > >> Caused by: > > > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Task > > > >> ended before registration was received. > > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251 > > > >> > > > >> > > > >> > > -------------------------------------------------------------------------------- > > > >> > > > >> Thanks, > > > >> Igor > > > >> > > > >> 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji : > > > >> > > > >>> Hi Igor, > > > >>> > > > >>> Swift does support SGE clusters, and you can refer to the > > swift-tutorial > > > >>> for sample code and configurations from this link: > > > >>> https://github.com/swift-lang/swift-tutorial > > > >>> > > > >>> Here's a sample config from our test-suite for Godzilla, an SGE > > cluster > > > >>> at UChicago: > > > >>> > > > >>> > > https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf > > > >>> You could modify and add this config to the swift.conf file in the > > > >>> swift-tutorial to run > > > >>> Swift on any machine and execute on a remote SGE cluster. > > > >>> > > > >>> SGE is a widely used resource manager and most sites have > > differences in > > > >>> their setups that make each site unique. If you run into issues with > > the > > > >>> default > > > >>> swift package, and could provide help in figuring out specifics of > > your > > > >>> cluster, we > > > >>> will help you adapt the Swift SGE provider to support your cluster. > > > >>> > > > >>> Thanks, > > > >>> Yadu > > > >>> > > > >>> > > > >>> > > > >>> On 04/28/2015 05:09 PM, Igor Russo wrote: > > > >>> > > > >>> Hi All, > > > >>> > > > >>> It is possible to use Swift with a remote SGE/OGE cluster? > > > >>> > > > >>> Regards, > > > >>> Igor > > > >>> > > > >>> > > > >>> _______________________________________________ > > > >>> Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > >>> > > > >>> > > > >>> > > > >>> _______________________________________________ > > > >>> Swift-user mailing list > > > >>> Swift-user at ci.uchicago.edu > > > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > >>> > > > >> > > > >> > > > >> > > > >> _______________________________________________ > > > >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > >> > > > >> > > > >> > > > >> _______________________________________________ > > > >> Swift-user mailing list > > > >> Swift-user at ci.uchicago.edu > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > >> > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From igor.souza.russo at gmail.com Wed May 6 07:41:24 2015 From: igor.souza.russo at gmail.com (Igor Russo) Date: Wed, 6 May 2015 09:41:24 -0300 Subject: [Swift-user] Remote SGE cluster In-Reply-To: <1430854066.18863.3.camel@echo> References: <5543AE11.7030204@uchicago.edu> <5543E67E.8050205@uchicago.edu> <5547CF3D.5010908@uchicago.edu> <1430776366.8782.10.camel@echo> <1430854066.18863.3.camel@echo> Message-ID: Hi, I've downloaded the package again and it worked just fine. Thank you very much, Yadu and Mihael! Igor 2015-05-05 16:27 GMT-03:00 Mihael Hategan : > Hi, > > Have you modified any jar files or copied them from another swift > package? > > The coaster bootstrap stores checksums of the jar files that it needs > (calculated at swift compile time) and checks all jar files that come > over an unsecured network against them. Maybe there should be a tool to > update these checksums when needed, not just at compile time. > > Mihael > > On Tue, 2015-05-05 at 11:01 -0300, Igor Russo wrote: > > Hi Mihael, > > > > Sorry to bother you again. > > > > You were right, after configuring the port forwarding the script is able > to > > connect. > > > > But i still get an error "Checksum does not match". > > > > Here goes the content of the ~/coaster-bootstrap-xxx.log file: > > > > using plain mode > > BS: http://189.12.232.9:50006 > > which: no gmd5sum in > > > (/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/condor/bin:/opt/condor/sbin:/opt/gridengine/bin/linux-x64) > > Expected checksum: 9b7bd5a96a2912cf8d06d1a2fd891620 > > Computed checksum: 9b7bd5a96a2912cf8d06d1a2fd891620 > > JAVA=/usr/java/latest/bin/java > > plain /usr/java/latest/bin/java -Djava=/usr/java/latest/bin/java -Xmx64M > > -DGLOBUS_TCP_PORT_RANGE= > > -DX509_USER_PROXY=/home/igor/.globus/sshproxy-1344874142-1432003400 > > -DX509_CERT_DIR=/home/igor/.globus/sshCAcert-1344874142-1432003400.pem > > -DGLOBUS_HOSTNAME=cluster.mmc.ufjf.br -Duser.home=/home/igor -jar > > /tmp/bootstrap.xTzo3v http://189.12.232.9:50006 > https://189.12.232.9:50005 > > 11100954039 > > Failed to download cog-provider-coaster-0.3.jar: > > java.lang.RuntimeException: Checksum does not match. > > > > > > Thanks, > > Igor > > > > 2015-05-04 18:52 GMT-03:00 Mihael Hategan : > > > > > > > > Hi, > > > > > > In most cases (globus, coasters), the service side (legion in this > case) > > > needs the ability to connect back to the client (your home connection). > > > > > > Correct me if I'm wrong, but you are on a DSL line, behind a router > with > > > NAT. If so, you must configure the router to forward some incoming > > > connections to the actual machine from which you are running swift > from. > > > Typically this is done by configuring a certain port range forwarding > on > > > the router (Yadu suggested GLOBUS_TCP_PORT_RANGE=50000,51000, so that > > > port range should be matched on the router). > > > > > > The gist of it is that swift starts a simple shell script on legion > that > > > downloads a small java app from the client side and launches it. Said > > > shell script logs things into ~/coaster-bootstrap-xxx.log files. The > > > contents of the bootstrap logs is probably very useful here. > > > > > > If all of that goes well, the aforementioned small java app downloads > > > the full coaster service from the client and starts it. Once started, > > > the coaster service connects back to Swift. The last two parts log > their > > > doings in ~/.globus/coasters/*.log. Those can be useful, too, if they > > > exist. > > > > > > Mihael > > > > > > On Mon, 2015-05-04 at 18:27 -0300, Igor Russo wrote: > > > > Hi Yadu, > > > > > > > > Yes, i can ssh from my laptop to the cluster directly. > > > > > > > > The coaster-bootstrap-*.log files are created in the remote system. > > > > > > > > I'm sending the log file attached. > > > > > > > > Thanks, > > > > Igor > > > > > > > > 2015-05-04 16:57 GMT-03:00 Yadu Nand Babuji : > > > > > > > > > Hi Igor, > > > > > > > > > > Are you able to ssh from your machine to legion directly without > > > entering > > > > > passwords ? > > > > > Could you please send us a tarball of the runNNN directories for a > > > failing > > > > > run ? > > > > > > > > > > I've put the following settings in my ~/.ssh/config on my laptop > and > > > setup > > > > > ssh keys on > > > > > both socrates and legion. This allows me to use "ssh > > > legion.rc.ucl.ac.uk" > > > > > and connect. > > > > > > > > > > Host legion.rc.ucl.ac.uk > > > > > User YOUR_USERNAME > > > > > Hostname legion.rc.ucl.ac.uk > > > > > ProxyCommand ssh socrates -W %h:%p > > > > > > > > > > Host socrates > > > > > Hostname socrates.ucl.ac.uk > > > > > User YOUR_USERNAME > > > > > ForwardAgent yes > > > > > > > > > > Thanks, > > > > > Yadu > > > > > > > > > > > > > > > > > > > > On 05/04/2015 07:51 AM, Igor Russo wrote: > > > > > > > > > > Hi Yadu, > > > > > > > > > > Thanks again. > > > > > > > > > > I tried your suggestion. Now i'm not getting the previous error, > but > > > the > > > > > jobs aren't being submitted: > > > > > > > > > > RunID: run001 > > > > > Progress: Seg, 04 Mai 2015 09:32:54-0300 > > > > > Progress: Seg, 04 Mai 2015 09:32:55-0300 Submitting:1 > > > > > Progress: Seg, 04 Mai 2015 09:33:25-0300 Submitting:1 > > > > > Progress: Seg, 04 Mai 2015 09:33:55-0300 Submitting:1 > > > > > Progress: Seg, 04 Mai 2015 09:34:25-0300 Submitting:1 > > > > > Progress: Seg, 04 Mai 2015 09:34:55-0300 Submitting:1 > > > > > Progress: Seg, 04 Mai 2015 09:35:25-0300 Submitting:1 > > > > > Progress: Seg, 04 Mai 2015 09:35:55-0300 Submitting:1 > > > > > Progress: Seg, 04 Mai 2015 09:36:25-0300 Submitting:1 > > > > > > > > > > In the the log file, i notice the following errors: > > > > > > > > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Service does not > > > > > appear to be registered with this manager > > > > > 2015-05-04 09:24:06,223-0300 INFO ServiceManager Coaster service > > > ended. > > > > > Reason: null > > > > > > > > > > Thanks, > > > > > Igor > > > > > > > > > > > > > > > 2015-05-01 17:47 GMT-03:00 Yadu Nand Babuji >: > > > > > > > > > >> Hi Igor, > > > > >> > > > > >> The remote connection system requires that the local machine you > run > > > > >> the swift client on has > > > > >> a public ip address. It looks like swift was not able to guess it > and > > > set > > > > >> it to http://igor-ubuntu:51251 > > > > >> > > > > >> Could you retry running part04 after doing the next step, and > please > > > > >> make sure your environment has > > > > >> these variables set whenever you run swift to remote systems : > > > > >> export GLOBUS_HOSTNAME= > > > > >> export GLOBUS_TCP_PORT_RANGE=50000,51000 > > > > >> > > > > >> Thanks, > > > > >> Yadu > > > > >> > > > > >> > > > > >> On 05/01/2015 02:29 PM, Igor Russo wrote: > > > > >> > > > > >> Hi Yadu, > > > > >> > > > > >> Thank you very much! > > > > >> > > > > >> I changed the config file with the data from my cluster. > > > > >> > > > > >> When executing the 4th part of Swift-tutorial, i'm getting the > > > > >> following error: > > > > >> "Failed to download bootstrap jar from ..." > > > > >> > > > > >> > > > > >> > > > > >> > > > > -------------------------------------------------------------------------------- > > > > >> > > > > >> RunID: run031 > > > > >> Progress: Sex, 01 Mai 2015 15:40:42-0300 > > > > >> Progress: Sex, 01 Mai 2015 15:40:43-0300 Submitting:1 > > > > >> > > > > >> Execution failed: > > > > >> Exception in sort: > > > > >> Arguments: [-n, unsorted.txt] > > > > >> Host: mmc > > > > >> Directory: p4-run031/jobs/s/sort-go28d68m > > > > >> exception @ swift-int-staging.k, line: 165 > > > > >> Caused by: > > > > >> exception @ swift-int-staging.k, line: 160 > > > > >> Caused by: null > > > > >> Caused by: > > > > >> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Could > > > > >> not submit job > > > > >> Caused by: > > > > >> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Could > > > > >> not start coaster service > > > > >> Caused by: > > > > >> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Task > > > > >> ended before registration was received. > > > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251 > > > > >> > > > > >> k:assign @ swift.k, line: 174 > > > > >> Caused by: Exception in sort: > > > > >> Arguments: [-n, unsorted.txt] > > > > >> Host: mmc > > > > >> Directory: p4-run031/jobs/s/sort-go28d68m > > > > >> exception @ swift-int-staging.k, line: 165 > > > > >> Caused by: > > > > >> exception @ swift-int-staging.k, line: 160 > > > > >> Caused by: null > > > > >> Caused by: > > > > >> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Could > > > > >> not submit job > > > > >> Caused by: > > > > >> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Could > > > > >> not start coaster service > > > > >> Caused by: > > > > >> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Task > > > > >> ended before registration was received. > > > > >> Failed to download bootstrap jar from http://igor-ubuntu:51251 > > > > >> > > > > >> > > > > >> > > > > -------------------------------------------------------------------------------- > > > > >> > > > > >> Thanks, > > > > >> Igor > > > > >> > > > > >> 2015-05-01 13:47 GMT-03:00 Yadu Nand Babuji < > yadunand at uchicago.edu>: > > > > >> > > > > >>> Hi Igor, > > > > >>> > > > > >>> Swift does support SGE clusters, and you can refer to the > > > swift-tutorial > > > > >>> for sample code and configurations from this link: > > > > >>> https://github.com/swift-lang/swift-tutorial > > > > >>> > > > > >>> Here's a sample config from our test-suite for Godzilla, an SGE > > > cluster > > > > >>> at UChicago: > > > > >>> > > > > >>> > > > > https://github.com/swift-lang/swift-k/blob/master/tests/sites/godzilla/swift.conf > > > > >>> You could modify and add this config to the swift.conf file in > the > > > > >>> swift-tutorial to run > > > > >>> Swift on any machine and execute on a remote SGE cluster. > > > > >>> > > > > >>> SGE is a widely used resource manager and most sites have > > > differences in > > > > >>> their setups that make each site unique. If you run into issues > with > > > the > > > > >>> default > > > > >>> swift package, and could provide help in figuring out specifics > of > > > your > > > > >>> cluster, we > > > > >>> will help you adapt the Swift SGE provider to support your > cluster. > > > > >>> > > > > >>> Thanks, > > > > >>> Yadu > > > > >>> > > > > >>> > > > > >>> > > > > >>> On 04/28/2015 05:09 PM, Igor Russo wrote: > > > > >>> > > > > >>> Hi All, > > > > >>> > > > > >>> It is possible to use Swift with a remote SGE/OGE cluster? > > > > >>> > > > > >>> Regards, > > > > >>> Igor > > > > >>> > > > > >>> > > > > >>> _______________________________________________ > > > > >>> Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > >>> > > > > >>> > > > > >>> > > > > >>> _______________________________________________ > > > > >>> Swift-user mailing list > > > > >>> Swift-user at ci.uchicago.edu > > > > >>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > >>> > > > > >> > > > > >> > > > > >> > > > > >> _______________________________________________ > > > > >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > >> > > > > >> > > > > >> > > > > >> _______________________________________________ > > > > >> Swift-user mailing list > > > > >> Swift-user at ci.uchicago.edu > > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > >> > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing listSwift-user at ci.uchicago.eduhttps:// > > > lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Wed May 6 12:22:45 2015 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Wed, 06 May 2015 12:22:45 -0500 Subject: [Swift-user] Call for Participation and Posters: ACM HPDC'15 in Portland, Oregon, June 15-19, 2015 Message-ID: <554A4DE5.8010203@cs.iit.edu> *HPDC?15 - Call for Participation and Posters** **http://www.hpdc.org/2015/** * /*NEWS:*//* *//*- Poster/Abstract submission deadline: May 8th, 2015*//* *//*- Early Registration deadline: May 18th, 2015*//* *//*- Conference dates: June 15th-19th, 2015*//* *//* - Keynote: Allen D. Malony*//* *//* - HPDC Achievement Award winner: Ewa Deelman*/ HPDC?15, the 24th International ACM Symposium on High Performance Parallel and Distributed Computing, will be held in Portland, Oregon, June 15-19, 2015. HPDC?15 will be part of FCRC?15, ACM?s Federated Computing Research Conference, hosting no less than 13 top-level conferences and symposia, along with an exciting program of joint keynote lectures, opening on the evening of June 14th with the A.M. Turing Award Lecture, delivered by the 2014 award winner, Michael Stonebreaker. HPDC is hosting a total of seven workshops on June 15 and 16. The main HPDC conference will run from June 17 to 19, featuring 30 presentations of papers and short papers, two keynotes, and a poster session. This year?s HPDC keynote presentations will be given by Allen D. Malony and this year?s HPDC Achievement Award winner, Ewa Deelman. Please see http://www.hpdc.org/2015/ for details. Registration (via FCRC) is open now. Deadline for discounted, early registration is May 18, 2015. Travel grants for students attending a US institution are available, sponsored by NSF. *Call for posters:** * HPDC'15 will feature a poster session that will provide the right environment for lively and informal discussions on various high performance parallel and distributed computing topics. There will be a "poster session and reception" on Wednesday, 6-8 pm. Posters can remain on display during the three days of the HPDC main conference (Wednesday to Friday). We invite all potential authors to submit their contribution to this poster session in the form of a two-page PDF abstract (we recommend using the ACM Proceedings style, and fonts not smaller than 10 point). Posters may be accompanied by practical demonstrations. Abstracts must be submitted by May 8th 2015 through https://www.easychair.org/conferences/?conf=hpdc15posters. For any questions about the submission, selection, and presentation of the accepted posters, please contact the Posters Chair - Ana-Maria Oprescu , Vrije Universiteit Amsterdam, The Netherlands. -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From parkjs at aps.anl.gov Fri May 8 02:20:49 2015 From: parkjs at aps.anl.gov (Park, Jun-Sang) Date: Fri, 8 May 2015 07:20:49 +0000 Subject: [Swift-user] swift with matlab Message-ID: <6780FAAD7D3A8A4FB23E2530CBF02FB2255A169D@BUTKUS.anl.gov> Hello, I am tinkering with the matlab + swift capability and am having problems. As a trial, what I am trying is to do with the "hello world" problem (hwsq or the magic square problem) is this; instead of giving a csv input file path in the vanilla "hello world" problem, I am giving a mat-file path as an input and loading it in the "hello world" problem. The problem I am having is when I change the size of the problem (increase the number for iterations in the foreach statement) , it no longer works; smaller problems run, bigger problems fail with the following message. It seems like some jobs go through but some don't. Is it possible that there are just too many pings to the mat file and swift or file system somehow doesn't like? Anyways, if there is someone I can talk to or if we can talk about how to go about it, I'd appreciate it. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Swift 0.94 swift-r6637 cog-r3742 RunID: 20150507-1148-9kb4by0e Progress: time: Thu, 07 May 2015 11:48:18 -0500 Progress: time: Thu, 07 May 2015 11:48:21 -0500 Selecting site:25 Submitted:15 Active:1 Progress: time: Thu, 07 May 2015 11:48:35 -0500 Selecting site:25 Active:15 Checking status:1 Progress: time: Thu, 07 May 2015 11:48:36 -0500 Selecting site:18 Stage in:1 Active:14 Finished successfully:8 Progress: time: Thu, 07 May 2015 11:48:37 -0500 Selecting site:10 Stage in:1 Active:15 Finished successfully:15 Progress: time: Thu, 07 May 2015 11:48:48 -0500 Selecting site:9 Active:16 Finished successfully:16 Progress: time: Thu, 07 May 2015 11:48:49 -0500 Selecting site:9 Active:15 Checking status:1 Finished successfully:16 Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0017.dat, 17] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/c/run_hwsq_matload-csn64g8m Caused by: Failed to link input file ../mat.mat Progress: time: Thu, 07 May 2015 11:48:50 -0500 Submitted:1 Active:14 Failed:2 Finished successfully:24 Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0000.dat, 0] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/d/run_hwsq_matload-dsn64g8m Caused by: Failed to link input file ../mat.mat Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0001.dat, 1] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/e/run_hwsq_matload-esn64g8m Caused by: Failed to link input file ../mat.mat Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0011.dat, 11] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/f/run_hwsq_matload-fsn64g8m Caused by: Failed to link input file ../mat.mat Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0028.dat, 28] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/g/run_hwsq_matload-gsn64g8m Caused by: Failed to link input file ../mat.mat Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0005.dat, 5] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/h/run_hwsq_matload-hsn64g8m Caused by: Failed to link input file ../mat.mat Progress: time: Thu, 07 May 2015 11:48:52 -0500 Active:3 Checking status:1 Failed:6 Finished successfully:31 Progress: time: Thu, 07 May 2015 11:49:02 -0500 Active:2 Checking status:1 Failed:6 Finished successfully:32 Final status: Thu, 07 May 2015 11:49:02 -0500 Failed:6 Finished successfully:35 The following errors have occurred: 1. Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0017.dat, 17] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/c/run_hwsq_matload-csn64g8m Caused by: Failed to link input file ../mat.mat 2. Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0011.dat, 11] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/f/run_hwsq_matload-fsn64g8m Caused by: Failed to link input file ../mat.mat 3. Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0001.dat, 1] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/e/run_hwsq_matload-esn64g8m Caused by: Failed to link input file ../mat.mat 4. Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0000.dat, 0] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/d/run_hwsq_matload-dsn64g8m Caused by: Failed to link input file ../mat.mat 5. Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0028.dat, 28] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/g/run_hwsq_matload-gsn64g8m Caused by: Failed to link input file ../mat.mat 6. Exception in run_hwsq_matload: Arguments: [/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/, ../mat.mat, sqmat.0005.dat, 5] Host: cluster Directory: hwsq_matload-20150507-1148-9kb4by0e/jobs/h/run_hwsq_matload-hsn64g8m Caused by: Failed to link input file ../mat.mat %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % run.sh script #! /bin/bash PATH=~wilde/swift/rev/swift-0.94.1/bin:$PATH swift -sites.file sites.xml -tc.file tc -config cf hwsq_matload.swift %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % run_hwsq_matload.sh script (autogenerated by matlab and I edited for environment) #!/bin/sh # script for execution of deployed applications # # Sets up the MCR environment for the current $ARCH and executes # the specified command. # exe_name=$0 exe_dir=`dirname "$0"` echo "------------------------------------------" if [ "x$1" = "x" ]; then echo Usage: echo $0 \ args else echo Setting up environment variables MCRROOT="$1" echo --- LD_LIBRARY_PATH=.:${MCRROOT}/runtime/glnxa64 ; LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MCRROOT}/bin/glnxa64 ; LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MCRROOT}/sys/os/glnxa64; # LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/clhome/TOMO1/tomo/MATLAB_Compiler_Runtime/v82/runtime/glnxa64:/clhome/TOMO1/tomo/MATLAB_Compiler_Runtime/v82/bin/glnxa64:/clhome/TOMO1/tomo/MATLAB_Compiler_Runtime/v82/sys/os/glnxa64:/clhome/TOMO1/tomo/MATLAB_Compiler_Runtime/v82/sys/java/jre/glnxa64/jre/lib/amd64/native_threads:/clhome/TOMO1/tomo/MATLAB_Compiler_Runtime/v82/sys/java/jre/glnxa64/jre/lib/amd64/server:/clhome/TOMO1/tomo/MATLAB_Compiler_Runtime/v82/sys/java/jre/glnxa64/jre/lib/amd64; XAPPLRESDIR=${MCRROOT}/X11/app-defaults ; # XAPPLRESDIR=XAPPLRESDIR=${MCRROOT}/X11/app-defaults ; export XAPPLRESDIR; export LD_LIBRARY_PATH; echo LD_LIBRARY_PATH is ${LD_LIBRARY_PATH}; shift 1 args= while [ $# -gt 0 ]; do token=$1 args="${args} \"${token}\"" shift done eval "/clhome/TOMO1/tomo/park_swift/bin_hwsq_matload/hwsq_matload" $args fi exit %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % hwsq_matload.swift type file; app (file outdata) hwsq_matload (file indata, int factor) { run_hwsq_matload "/clhome/TOMO1/tomo/MATLAB/MATLAB_Compiler_Runtime/v84/" @indata @outdata factor; } file degreeData<"../mat.mat">; int factors[] = [0:40]; file squareMats[] ; foreach f, i in factors { squareMats[i] = hwsq_matload (degreeData, f); } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % tc cluster run_hwsq_matload /clhome/TOMO1/tomo/park_swift/bin_hwsq_matload/run_hwsq_matload.sh %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % sites.xml {env.HOME}/swiftwork sec1bigmem --> sec1_bigmem --> 32 1 1 1 16 .15 00:15:00 1800 {env.HOME}/swiftwork 100 100 10000 From m.altaweel at ucl.ac.uk Mon May 11 10:59:26 2015 From: m.altaweel at ucl.ac.uk (Altaweel, Mark) Date: Mon, 11 May 2015 15:59:26 +0000 Subject: [Swift-user] hung process Message-ID: <61312D06-A9BC-4EC3-A194-61D1CAC37795@live.ucl.ac.uk> Hi, I was running a swift job on Midway with the following swift (0.96) conf (attached). The process seemed hung (keeps just saying submitted in the out file) and was not sure if it was something in the queuing or perhaps within swift? The log is attached. When I ran this with a small dataset everything ran fine and finished with the correct output, but this is on a larger dataset (which did run on a single node). So I did set the wall time to the length of what the run should be. Thanks again! Mark -------------- next part -------------- A non-text attachment was scrubbed... Name: out.zip Type: application/zip Size: 481799 bytes Desc: out.zip URL: From fullc0de at gmail.com Sun May 24 22:34:07 2015 From: fullc0de at gmail.com (fullc0de) Date: Mon, 25 May 2015 12:34:07 +0900 Subject: [Swift-user] Lazy Initializer doesn't defend against calling twice. Message-ID: Hello, folks. I have found a case that lazy initializer is called twice when the initializer is placed in a recursion. In this case, I don't have any thought whether it is allowed to be called twice or not. Until now, I have known that lazy keyword guarantees being worked only once. Is it misunderstanding? Test code is the following that: lazy var testLabel: UILabel = { > println("testLabel self = \(self)") > let label = UILabel() > label.text = "hello" > self.testLabel.text = "world" > return label > }() As you know, this code is really nonsense. But, I want to test if lazy guarantees being worked only once or not in the recursion. From this code, I could meet an infinite recursion. Isn't this case included in lazy's guarantee that doing once? Best regards. Kyokook Hwang. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon May 25 12:13:01 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 May 2015 10:13:01 -0700 Subject: [Swift-user] Lazy Initializer doesn't defend against calling twice. In-Reply-To: References: Message-ID: <1432573981.32542.15.camel@echo3> Hi, This is a mailing list for swift-lang.org rather than developer.apple.com/swift/, so an entirely different language. I would suggest posting your question in a more appropriate place. I'll take a stab at answering your question though: you have multiple instances of the testLabel field (one for each instance of UILabel). Each of them is initialized at most once (although due to the loop, none of them get to be initialized). Mihael On Mon, 2015-05-25 at 12:34 +0900, fullc0de wrote: > Hello, folks. > > I have found a case that lazy initializer is called twice when the > initializer is placed in a recursion. > In this case, I don't have any thought whether it is allowed to be called > twice or not. Until now, I have known that lazy keyword guarantees being > worked only once. Is it misunderstanding? > > Test code is the following that: > > lazy var testLabel: UILabel = { > > println("testLabel self = \(self)") > > let label = UILabel() > > label.text = "hello" > > self.testLabel.text = "world" > > return label > > }() > > > As you know, this code is really nonsense. But, I want to test if lazy > guarantees being worked only once or not in the recursion. From this code, > I could meet an infinite recursion. > > Isn't this case included in lazy's guarantee that doing once? > > > Best regards. > > Kyokook Hwang. > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From Matthew.Shaxted at som.com Wed May 27 09:50:14 2015 From: Matthew.Shaxted at som.com (Matthew Shaxted) Date: Wed, 27 May 2015 10:50:14 -0400 Subject: [Swift-user] Channel Timeout on Beagle? Message-ID: Hi Swift Users: I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason. Does anyone have insight into the cause of this? Thanks for any help. Below is the error I am getting: Host: cluster Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ swift-int-staging.k, line: 181 Caused by: exception @ swift-int-staging.k, line: 177 Caused by: Block task failed: Connection to worker lost org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) at java.util.TimerThread.mainLoop(Timer.java:566) at java.util.TimerThread.run(Timer.java:516) Below is my sites.xml file: CI-SES000178 24 100 100 pbs.aprun;pbs.mpp;depth=24 10800 01:25:00 /lustre/beagle2/mattshax/epsweep/swifthome 20 600 1 180 10000 /dev/shm/mattshax/swiftapp MATTHEW SHAXTED SKIDMORE, OWINGS & MERRILL LLP 224 SOUTH MICHIGAN AVENUE CHICAGO, IL 60604 T (312) 360-4368 MATTHEW.SHAXTED at SOM.COM [cid:image004.png at 01D09862.876C67F0] The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. [cid:image003.gif at 01D09861.F8DCE460] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 566 bytes Desc: image003.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.png Type: image/png Size: 5311 bytes Desc: image004.png URL: From Matthew.Shaxted at som.com Wed May 27 14:03:38 2015 From: Matthew.Shaxted at som.com (Matthew Shaxted) Date: Wed, 27 May 2015 15:03:38 -0400 Subject: [Swift-user] Channel Timeout on Beagle? In-Reply-To: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> References: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> Message-ID: Hi All: I was able to get the runs working successfully by changing the maxtime flag in the sites file. Thanks From: Matthew Shaxted Sent: Wednesday, May 27, 2015 9:50 AM To: Swift User Subject: Channel Timeout on Beagle? Hi Swift Users: I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason. Does anyone have insight into the cause of this? Thanks for any help. Below is the error I am getting: Host: cluster Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ swift-int-staging.k, line: 181 Caused by: exception @ swift-int-staging.k, line: 177 Caused by: Block task failed: Connection to worker lost org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) at java.util.TimerThread.mainLoop(Timer.java:566) at java.util.TimerThread.run(Timer.java:516) Below is my sites.xml file: CI-SES000178 24 100 100 pbs.aprun;pbs.mpp;depth=24 10800 01:25:00 /lustre/beagle2/mattshax/epsweep/swifthome 20 600 1 180 10000 /dev/shm/mattshax/swiftapp MATTHEW SHAXTED SKIDMORE, OWINGS & MERRILL LLP 224 SOUTH MICHIGAN AVENUE CHICAGO, IL 60604 T (312) 360-4368 MATTHEW.SHAXTED at SOM.COM [cid:image001.png at 01D09885.EDB6D900] The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. [cid:image002.gif at 01D09885.EDB6D900] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5311 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 566 bytes Desc: image002.gif URL: From Matthew.Shaxted at som.com Fri May 29 10:39:12 2015 From: Matthew.Shaxted at som.com (Matthew Shaxted) Date: Fri, 29 May 2015 11:39:12 -0400 Subject: [Swift-user] Channel Timeout on Beagle? In-Reply-To: <13DD345F45E0EA4BBAC56317EAC0F026972B813380@CCRD007.mail.lan> References: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> <13DD345F45E0EA4BBAC56317EAC0F026972B813380@CCRD007.mail.lan> Message-ID: It looks like the timeout problem is not solved actually. For some reason I am having much difficulty running on Beagle, and I have an feeling it is due to slow read/write. For example, I finished ~1,200 / 12,000 runs before failure (see below paragraph) and moving these results (of not very large result files) to the public_html is taking an hour or so. I'm hoping to scale up to 100-300k runs or so, thus this will become a significant bottleneck. I have emailed beagle-support about this issue just now. In all test environments my Swift workflow is working well, but when submitting jobs to Beagle queue, it completes some number of simulations before the timeout error occurs and all jobs stop. I'm using Swift-0.95-RC7 (and am in process of updating to 0.95 latest), but think these errors may also be due to this slow read/write. Any suggestions? Below is the error I see and the job completely stops: Host: cluster Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ swift-int-staging.k, line: 181 Caused by: exception @ swift-int-staging.k, line: 177 Caused by: Block task failed: Connection to worker lost org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) at java.util.TimerThread.mainLoop(Timer.java:566) at java.util.TimerThread.run(Timer.java:516) MATTHEW SHAXTED SKIDMORE, OWINGS & MERRILL LLP 224 SOUTH MICHIGAN AVENUE CHICAGO, IL 60604 T (312) 360-4368 MATTHEW.SHAXTED at SOM.COM [cid:image001.png at 01D099FB.B26BE1C0] The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. [cid:image005.gif at 01D099F2.1E9A2BE0] From: Matthew Shaxted Sent: Wednesday, May 27, 2015 2:04 PM To: 'Swift User' Subject: RE: Channel Timeout on Beagle? Hi All: I was able to get the runs working successfully by changing the maxtime flag in the sites file. Thanks From: Matthew Shaxted Sent: Wednesday, May 27, 2015 9:50 AM To: Swift User Subject: Channel Timeout on Beagle? Hi Swift Users: I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason. Does anyone have insight into the cause of this? Thanks for any help. Below is the error I am getting: Host: cluster Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ swift-int-staging.k, line: 181 Caused by: exception @ swift-int-staging.k, line: 177 Caused by: Block task failed: Connection to worker lost org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) at java.util.TimerThread.mainLoop(Timer.java:566) at java.util.TimerThread.run(Timer.java:516) Below is my sites.xml file: CI-SES000178 24 100 100 pbs.aprun;pbs.mpp;depth=24 10800 01:25:00 /lustre/beagle2/mattshax/epsweep/swifthome 20 600 1 180 10000 /dev/shm/mattshax/swiftapp MATTHEW SHAXTED SKIDMORE, OWINGS & MERRILL LLP 224 SOUTH MICHIGAN AVENUE CHICAGO, IL 60604 T (312) 360-4368 MATTHEW.SHAXTED at SOM.COM [cid:image006.png at 01D099F2.1E9A2BE0] The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. [cid:image005.gif at 01D099F2.1E9A2BE0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.gif Type: image/gif Size: 566 bytes Desc: image005.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image006.png Type: image/png Size: 5311 bytes Desc: image006.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5311 bytes Desc: image001.png URL: From wilde at anl.gov Fri May 29 11:41:00 2015 From: wilde at anl.gov (Michael Wilde) Date: Fri, 29 May 2015 11:41:00 -0500 Subject: [Swift-user] Channel Timeout on Beagle? In-Reply-To: References: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> <13DD345F45E0EA4BBAC56317EAC0F026972B813380@CCRD007.mail.lan> Message-ID: <5568969C.7040103@anl.gov> Matthew, You should consider using Swift 0.96.0, and to the extent possible use local filesystems instead of the shared filesystem, which is often under excessive load. We can discuss how to do this in subsequent followup as needed. Basically, try provider-staging, and put both the input data on the login node's local filesystem, and the site workdirectory under /dev/shm or /tmp. (You may need to probe the compute node as to which of these is writable and has sufficient space). - Mike On 5/29/15 10:39 AM, Matthew Shaxted wrote: > > It looks like the timeout problem is not solved actually. For some > reason I am having much difficulty running on Beagle, and I have an > feeling it is due to slow read/write. > > For example, I finished ~1,200 / 12,000 runs before failure (see below > paragraph) and moving these results (of not very large result files) > to the public_html is taking an hour or so. I?m hoping to scale up to > 100-300k runs or so, thus this will become a significant bottleneck. I > have emailed beagle-support about this issue just now. > > In all test environments my Swift workflow is working well, but when > submitting jobs to Beagle queue, it completes some number of > simulations before the timeout error occurs and all jobs stop. I'm > using Swift-0.95-RC7 (and am in process of updating to 0.95 latest), > but think these errors may also be due to this slow read/write. > > Any suggestions? > > Below is the error I see and the job completely stops: > > Host: cluster > > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ > swift-int-staging.k, line: 181 > > Caused by: exception @ swift-int-staging.k, line: 177 > > Caused by: Block task failed: Connection to worker lost > > org.globus.cog.coaster.TimeoutException: Channel timed out. > lastTime=150526-142313.128, > > 50526-142514.107, channel=TCPChannel [type: server, contact: > 0526-0802460-000014-000456 > > at > org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > > at > org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > > at java.util.TimerThread.mainLoop(Timer.java:566) > > at java.util.TimerThread.run(Timer.java:516) > > MATTHEW SHAXTED > > SKIDMORE, OWINGS & MERRILL LLP > 224 SOUTH MICHIGAN AVENUE > CHICAGO, IL 60604 > T (312) 360-4368 > MATTHEW.SHAXTED at SOM.COM > > cid:image001.png at 01CF9071.6FB46030 > > The information contained in this communication may be confidential, > is intended only for the use of the recipient(s) named above, and may > be legally privileged. If the reader of this message is not the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication, or any of its > contents, is strictly prohibited and may be unlawful. If you have > received this communication in error, please return it to the > sen???der immediately and delete the original message and any copy of > it from your computer system. If you have any questions concerning > this message, please contact the sender. > > http://intranet.som.com/common/admin/file.cfm?f=%2Fresources%2Fcontent%2F5%2F0%2F4%2F4%2F6%2F4%2F0%2F3%2Fdocuments%2Fimagea560bf%2Egif%406e10073b%2E30854c37 > > *From:* Matthew Shaxted > *Sent:* Wednesday, May 27, 2015 2:04 PM > *To:* 'Swift User' > *Subject:* RE: Channel Timeout on Beagle? > > Hi All: I was able to get the runs working successfully by changing > the maxtime flag in the sites file. > > Thanks > > *From:* Matthew Shaxted > *Sent:* Wednesday, May 27, 2015 9:50 AM > *To:* Swift User > *Subject:* Channel Timeout on Beagle? > > Hi Swift Users: > > I am running some studies on Beagle using Swift, and experiencing a > strange error. The Swift scripts run great on cloud and on the Beagle > login node, but seems to be timing out for some reason. > > Does anyone have insight into the cause of this? Thanks for any help. > > Below is the error I am getting: > > Host: cluster > > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ > swift-int-staging.k, line: 181 > > Caused by: exception @ swift-int-staging.k, line: 177 > > Caused by: Block task failed: Connection to worker lost > > org.globus.cog.coaster.TimeoutException: Channel timed out. > lastTime=150526-142313.128, > > 50526-142514.107, channel=TCPChannel [type: server, contact: > 0526-0802460-000014-000456 > > at > org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > > at > org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > > at java.util.TimerThread.mainLoop(Timer.java:566) > > at java.util.TimerThread.run(Timer.java:516) > > Below is my sites.xml file: > > > > > > CI-SES000178 > > 24 > > 100 > > 100 > > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 > > 10800 > > 01:25:00 > > key="userHomeOverride">/lustre/beagle2/mattshax/epsweep/swifthome > > 20 > > 600 > > 1 > > 180 > > 10000 > > > > /dev/shm/mattshax/swiftapp > > > > MATTHEW SHAXTED > > SKIDMORE, OWINGS & MERRILL LLP > 224 SOUTH MICHIGAN AVENUE > CHICAGO, IL 60604 > T (312) 360-4368 > MATTHEW.SHAXTED at SOM.COM > > cid:image001.png at 01CF9071.6FB46030 > > The information contained in this communication may be confidential, > is intended only for the use of the recipient(s) named above, and may > be legally privileged. If the reader of this message is not the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication, or any of its > contents, is strictly prohibited and may be unlawful. If you have > received this communication in error, please return it to the > sen???der immediately and delete the original message and any copy of > it from your computer system. If you have any questions concerning > this message, please contact the sender. > > http://intranet.som.com/common/admin/file.cfm?f=%2Fresources%2Fcontent%2F5%2F0%2F4%2F4%2F6%2F4%2F0%2F3%2Fdocuments%2Fimagea560bf%2Egif%406e10073b%2E30854c37 > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 5311 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 566 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 5311 bytes Desc: not available URL: From hategan at mcs.anl.gov Fri May 29 12:47:42 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 May 2015 10:47:42 -0700 Subject: [Swift-user] Channel Timeout on Beagle? In-Reply-To: References: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> <13DD345F45E0EA4BBAC56317EAC0F026972B813380@CCRD007.mail.lan> Message-ID: <1432921662.8367.0.camel@echo3> Hi Matthew, Can you send me the full swift log? Mihael On Fri, 2015-05-29 at 11:39 -0400, Matthew Shaxted wrote: > It looks like the timeout problem is not solved actually. For some reason I am having much difficulty running on Beagle, and I have an feeling it is due to slow read/write. > > For example, I finished ~1,200 / 12,000 runs before failure (see below paragraph) and moving these results (of not very large result files) to the public_html is taking an hour or so. I'm hoping to scale up to 100-300k runs or so, thus this will become a significant bottleneck. I have emailed beagle-support about this issue just now. > > In all test environments my Swift workflow is working well, but when submitting jobs to Beagle queue, it completes some number of simulations before the timeout error occurs and all jobs stop. I'm using Swift-0.95-RC7 (and am in process of updating to 0.95 latest), but think these errors may also be due to this slow read/write. > > Any suggestions? > > Below is the error I see and the job completely stops: > > Host: cluster > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ swift-int-staging.k, line: 181 > Caused by: exception @ swift-int-staging.k, line: 177 > Caused by: Block task failed: Connection to worker lost > org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150526-142313.128, > 50526-142514.107, channel=TCPChannel [type: server, contact: 0526-0802460-000014-000456 > at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > at java.util.TimerThread.mainLoop(Timer.java:566) > at java.util.TimerThread.run(Timer.java:516) > > > MATTHEW SHAXTED > SKIDMORE, OWINGS & MERRILL LLP > 224 SOUTH MICHIGAN AVENUE > CHICAGO, IL 60604 > T (312) 360-4368 > MATTHEW.SHAXTED at SOM.COM > > [cid:image001.png at 01D099FB.B26BE1C0] > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > [cid:image005.gif at 01D099F2.1E9A2BE0] > > From: Matthew Shaxted > Sent: Wednesday, May 27, 2015 2:04 PM > To: 'Swift User' > Subject: RE: Channel Timeout on Beagle? > > Hi All: I was able to get the runs working successfully by changing the maxtime flag in the sites file. > > Thanks > > > From: Matthew Shaxted > Sent: Wednesday, May 27, 2015 9:50 AM > To: Swift User > Subject: Channel Timeout on Beagle? > > Hi Swift Users: > > I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason. > > Does anyone have insight into the cause of this? Thanks for any help. > > Below is the error I am getting: > > Host: cluster > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ swift-int-staging.k, line: 181 > Caused by: exception @ swift-int-staging.k, line: 177 > Caused by: Block task failed: Connection to worker lost > org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150526-142313.128, > 50526-142514.107, channel=TCPChannel [type: server, contact: 0526-0802460-000014-000456 > at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > at java.util.TimerThread.mainLoop(Timer.java:566) > at java.util.TimerThread.run(Timer.java:516) > > Below is my sites.xml file: > > > > CI-SES000178 > 24 > 100 > 100 > pbs.aprun;pbs.mpp;depth=24 > 10800 > 01:25:00 > /lustre/beagle2/mattshax/epsweep/swifthome > 20 > 600 > 1 > 180 > 10000 > > /dev/shm/mattshax/swiftapp > > > > MATTHEW SHAXTED > SKIDMORE, OWINGS & MERRILL LLP > 224 SOUTH MICHIGAN AVENUE > CHICAGO, IL 60604 > T (312) 360-4368 > MATTHEW.SHAXTED at SOM.COM > > [cid:image006.png at 01D099F2.1E9A2BE0] > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > [cid:image005.gif at 01D099F2.1E9A2BE0] > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From Matthew.Shaxted at som.com Fri May 29 12:59:21 2015 From: Matthew.Shaxted at som.com (Matthew Shaxted) Date: Fri, 29 May 2015 13:59:21 -0400 Subject: [Swift-user] Channel Timeout on Beagle? In-Reply-To: <1432921662.8367.0.camel@echo3> References: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> <13DD345F45E0EA4BBAC56317EAC0F026972B813380@CCRD007.mail.lan> <1432921662.8367.0.camel@echo3> Message-ID: Mihael, please see the Swift run001 folder at the link below: http://web.ci.uchicago.edu/~mattshax/epsweep-run001.tar.gz MATTHEW SHAXTED SKIDMORE, OWINGS & MERRILL LLP 224 SOUTH MICHIGAN AVENUE CHICAGO, IL 60604 T? (312) 360-4368 MATTHEW.SHAXTED at SOM.COM The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. -----Original Message----- From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Mihael Hategan Sent: Friday, May 29, 2015 12:48 PM To: Matthew Shaxted Cc: 'Swift User' Subject: Re: [Swift-user] Channel Timeout on Beagle? Hi Matthew, Can you send me the full swift log? Mihael On Fri, 2015-05-29 at 11:39 -0400, Matthew Shaxted wrote: > It looks like the timeout problem is not solved actually. For some reason I am having much difficulty running on Beagle, and I have an feeling it is due to slow read/write. > > For example, I finished ~1,200 / 12,000 runs before failure (see below paragraph) and moving these results (of not very large result files) to the public_html is taking an hour or so. I'm hoping to scale up to 100-300k runs or so, thus this will become a significant bottleneck. I have emailed beagle-support about this issue just now. > > In all test environments my Swift workflow is working well, but when submitting jobs to Beagle queue, it completes some number of simulations before the timeout error occurs and all jobs stop. I'm using Swift-0.95-RC7 (and am in process of updating to 0.95 latest), but think these errors may also be due to this slow read/write. > > Any suggestions? > > Below is the error I see and the job completely stops: > > Host: cluster > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ > swift-int-staging.k, line: 181 Caused by: exception @ > swift-int-staging.k, line: 177 Caused by: Block task failed: > Connection to worker lost > org.globus.cog.coaster.TimeoutException: Channel timed out. > lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel > [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > at java.util.TimerThread.mainLoop(Timer.java:566) > at java.util.TimerThread.run(Timer.java:516) > > > MATTHEW SHAXTED > SKIDMORE, OWINGS & MERRILL LLP > 224 SOUTH MICHIGAN AVENUE > CHICAGO, IL 60604 > T (312) 360-4368 > MATTHEW.SHAXTED at SOM.COM > > [cid:image001.png at 01D099FB.B26BE1C0] > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > [cid:image005.gif at 01D099F2.1E9A2BE0] > > From: Matthew Shaxted > Sent: Wednesday, May 27, 2015 2:04 PM > To: 'Swift User' > Subject: RE: Channel Timeout on Beagle? > > Hi All: I was able to get the runs working successfully by changing the maxtime flag in the sites file. > > Thanks > > > From: Matthew Shaxted > Sent: Wednesday, May 27, 2015 9:50 AM > To: Swift User > Subject: Channel Timeout on Beagle? > > Hi Swift Users: > > I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason. > > Does anyone have insight into the cause of this? Thanks for any help. > > Below is the error I am getting: > > Host: cluster > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ > swift-int-staging.k, line: 181 Caused by: exception @ > swift-int-staging.k, line: 177 Caused by: Block task failed: > Connection to worker lost > org.globus.cog.coaster.TimeoutException: Channel timed out. > lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel > [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > at java.util.TimerThread.mainLoop(Timer.java:566) > at java.util.TimerThread.run(Timer.java:516) > > Below is my sites.xml file: > > > > CI-SES000178 > 24 > 100 > 100 > pbs.aprun;pbs.mpp;depth=24 > 10800 > 01:25:00 > /lustre/beagle2/mattshax/epsweep/swifthome > 20 > 600 > 1 > 180 > 10000 > > /dev/shm/mattshax/swiftapp > > > > MATTHEW SHAXTED > SKIDMORE, OWINGS & MERRILL LLP > 224 SOUTH MICHIGAN AVENUE > CHICAGO, IL 60604 > T (312) 360-4368 > MATTHEW.SHAXTED at SOM.COM > > [cid:image006.png at 01D099F2.1E9A2BE0] > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > [cid:image005.gif at 01D099F2.1E9A2BE0] > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Fri May 29 13:11:34 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 May 2015 11:11:34 -0700 Subject: [Swift-user] Channel Timeout on Beagle? In-Reply-To: References: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> <13DD345F45E0EA4BBAC56317EAC0F026972B813380@CCRD007.mail.lan> <1432921662.8367.0.camel@echo3> Message-ID: <1432923094.8367.6.camel@echo3> My initial suspicion would be running out of space on /dev/shm But I think we need more information. 0.96 has these worker health probes that periodically check the status of various things like the filesystem usage. That's choice #1. #2 is worker logging, which is supported in 0.95. You enable it by saying DEBUG inside the relevant site. This should give you a bunch of logs in $userHomeOverride/.globus/coasters (I think; Yadu, correct me if I'm wrong). They may provide more details about why coaster workers are misbehaving. Mihael On Fri, 2015-05-29 at 13:59 -0400, Matthew Shaxted wrote: > Mihael, please see the Swift run001 folder at the link below: > > http://web.ci.uchicago.edu/~mattshax/epsweep-run001.tar.gz > > MATTHEW SHAXTED > SKIDMORE, OWINGS & MERRILL LLP > 224 SOUTH MICHIGAN AVENUE > CHICAGO, IL 60604 > T (312) 360-4368 > MATTHEW.SHAXTED at SOM.COM > > > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > > > -----Original Message----- > From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Mihael Hategan > Sent: Friday, May 29, 2015 12:48 PM > To: Matthew Shaxted > Cc: 'Swift User' > Subject: Re: [Swift-user] Channel Timeout on Beagle? > > Hi Matthew, > > Can you send me the full swift log? > > Mihael > > On Fri, 2015-05-29 at 11:39 -0400, Matthew Shaxted wrote: > > It looks like the timeout problem is not solved actually. For some reason I am having much difficulty running on Beagle, and I have an feeling it is due to slow read/write. > > > > For example, I finished ~1,200 / 12,000 runs before failure (see below paragraph) and moving these results (of not very large result files) to the public_html is taking an hour or so. I'm hoping to scale up to 100-300k runs or so, thus this will become a significant bottleneck. I have emailed beagle-support about this issue just now. > > > > In all test environments my Swift workflow is working well, but when submitting jobs to Beagle queue, it completes some number of simulations before the timeout error occurs and all jobs stop. I'm using Swift-0.95-RC7 (and am in process of updating to 0.95 latest), but think these errors may also be due to this slow read/write. > > > > Any suggestions? > > > > Below is the error I see and the job completely stops: > > > > Host: cluster > > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ > > swift-int-staging.k, line: 181 Caused by: exception @ > > swift-int-staging.k, line: 177 Caused by: Block task failed: > > Connection to worker lost > > org.globus.cog.coaster.TimeoutException: Channel timed out. > > lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel > > [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > > at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > > at java.util.TimerThread.mainLoop(Timer.java:566) > > at java.util.TimerThread.run(Timer.java:516) > > > > > > MATTHEW SHAXTED > > SKIDMORE, OWINGS & MERRILL LLP > > 224 SOUTH MICHIGAN AVENUE > > CHICAGO, IL 60604 > > T (312) 360-4368 > > MATTHEW.SHAXTED at SOM.COM > > > > [cid:image001.png at 01D099FB.B26BE1C0] > > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > > > [cid:image005.gif at 01D099F2.1E9A2BE0] > > > > From: Matthew Shaxted > > Sent: Wednesday, May 27, 2015 2:04 PM > > To: 'Swift User' > > Subject: RE: Channel Timeout on Beagle? > > > > Hi All: I was able to get the runs working successfully by changing the maxtime flag in the sites file. > > > > Thanks > > > > > > From: Matthew Shaxted > > Sent: Wednesday, May 27, 2015 9:50 AM > > To: Swift User > > Subject: Channel Timeout on Beagle? > > > > Hi Swift Users: > > > > I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason. > > > > Does anyone have insight into the cause of this? Thanks for any help. > > > > Below is the error I am getting: > > > > Host: cluster > > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ > > swift-int-staging.k, line: 181 Caused by: exception @ > > swift-int-staging.k, line: 177 Caused by: Block task failed: > > Connection to worker lost > > org.globus.cog.coaster.TimeoutException: Channel timed out. > > lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel > > [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > > at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > > at java.util.TimerThread.mainLoop(Timer.java:566) > > at java.util.TimerThread.run(Timer.java:516) > > > > Below is my sites.xml file: > > > > > > > > CI-SES000178 > > 24 > > 100 > > 100 > > pbs.aprun;pbs.mpp;depth=24 > > 10800 > > 01:25:00 > > /lustre/beagle2/mattshax/epsweep/swifthome > > 20 > > 600 > > 1 > > 180 > > 10000 > > > > /dev/shm/mattshax/swiftapp > > > > > > > > MATTHEW SHAXTED > > SKIDMORE, OWINGS & MERRILL LLP > > 224 SOUTH MICHIGAN AVENUE > > CHICAGO, IL 60604 > > T (312) 360-4368 > > MATTHEW.SHAXTED at SOM.COM > > > > [cid:image006.png at 01D099F2.1E9A2BE0] > > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > > > [cid:image005.gif at 01D099F2.1E9A2BE0] > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > From Matthew.Shaxted at som.com Fri May 29 13:16:24 2015 From: Matthew.Shaxted at som.com (Matthew Shaxted) Date: Fri, 29 May 2015 14:16:24 -0400 Subject: [Swift-user] Channel Timeout on Beagle? In-Reply-To: <1432923094.8367.6.camel@echo3> References: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> <13DD345F45E0EA4BBAC56317EAC0F026972B813380@CCRD007.mail.lan> <1432921662.8367.0.camel@echo3> <1432923094.8367.6.camel@echo3> Message-ID: Okay thanks - I will rerun now enabling workerLogging DEBUG and in the meantime will update to 0.96. -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Friday, May 29, 2015 1:12 PM To: Matthew Shaxted Cc: 'Swift User' Subject: Re: [Swift-user] Channel Timeout on Beagle? My initial suspicion would be running out of space on /dev/shm But I think we need more information. 0.96 has these worker health probes that periodically check the status of various things like the filesystem usage. That's choice #1. #2 is worker logging, which is supported in 0.95. You enable it by saying DEBUG inside the relevant site. This should give you a bunch of logs in $userHomeOverride/.globus/coasters (I think; Yadu, correct me if I'm wrong). They may provide more details about why coaster workers are misbehaving. Mihael On Fri, 2015-05-29 at 13:59 -0400, Matthew Shaxted wrote: > Mihael, please see the Swift run001 folder at the link below: > > http://web.ci.uchicago.edu/~mattshax/epsweep-run001.tar.gz > > MATTHEW SHAXTED > SKIDMORE, OWINGS & MERRILL LLP > 224 SOUTH MICHIGAN AVENUE > CHICAGO, IL 60604 > T (312) 360-4368 > MATTHEW.SHAXTED at SOM.COM > > > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > > > -----Original Message----- > From: swift-user-bounces at ci.uchicago.edu > [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Mihael > Hategan > Sent: Friday, May 29, 2015 12:48 PM > To: Matthew Shaxted > Cc: 'Swift User' > Subject: Re: [Swift-user] Channel Timeout on Beagle? > > Hi Matthew, > > Can you send me the full swift log? > > Mihael > > On Fri, 2015-05-29 at 11:39 -0400, Matthew Shaxted wrote: > > It looks like the timeout problem is not solved actually. For some reason I am having much difficulty running on Beagle, and I have an feeling it is due to slow read/write. > > > > For example, I finished ~1,200 / 12,000 runs before failure (see below paragraph) and moving these results (of not very large result files) to the public_html is taking an hour or so. I'm hoping to scale up to 100-300k runs or so, thus this will become a significant bottleneck. I have emailed beagle-support about this issue just now. > > > > In all test environments my Swift workflow is working well, but when submitting jobs to Beagle queue, it completes some number of simulations before the timeout error occurs and all jobs stop. I'm using Swift-0.95-RC7 (and am in process of updating to 0.95 latest), but think these errors may also be due to this slow read/write. > > > > Any suggestions? > > > > Below is the error I see and the job completely stops: > > > > Host: cluster > > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ > > swift-int-staging.k, line: 181 Caused by: exception @ > > swift-int-staging.k, line: 177 Caused by: Block task failed: > > Connection to worker lost > > org.globus.cog.coaster.TimeoutException: Channel timed out. > > lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel > > [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > > at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > > at java.util.TimerThread.mainLoop(Timer.java:566) > > at java.util.TimerThread.run(Timer.java:516) > > > > > > MATTHEW SHAXTED > > SKIDMORE, OWINGS & MERRILL LLP > > 224 SOUTH MICHIGAN AVENUE > > CHICAGO, IL 60604 > > T (312) 360-4368 > > MATTHEW.SHAXTED at SOM.COM > > > > [cid:image001.png at 01D099FB.B26BE1C0] > > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > > > [cid:image005.gif at 01D099F2.1E9A2BE0] > > > > From: Matthew Shaxted > > Sent: Wednesday, May 27, 2015 2:04 PM > > To: 'Swift User' > > Subject: RE: Channel Timeout on Beagle? > > > > Hi All: I was able to get the runs working successfully by changing the maxtime flag in the sites file. > > > > Thanks > > > > > > From: Matthew Shaxted > > Sent: Wednesday, May 27, 2015 9:50 AM > > To: Swift User > > Subject: Channel Timeout on Beagle? > > > > Hi Swift Users: > > > > I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason. > > > > Does anyone have insight into the cause of this? Thanks for any help. > > > > Below is the error I am getting: > > > > Host: cluster > > Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ > > swift-int-staging.k, line: 181 Caused by: exception @ > > swift-int-staging.k, line: 177 Caused by: Block task failed: > > Connection to worker lost > > org.globus.cog.coaster.TimeoutException: Channel timed out. > > lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel > > [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) > > at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) > > at java.util.TimerThread.mainLoop(Timer.java:566) > > at java.util.TimerThread.run(Timer.java:516) > > > > Below is my sites.xml file: > > > > > > > > CI-SES000178 > > 24 > > 100 > > 100 > > pbs.aprun;pbs.mpp;depth=24 > > 10800 > > 01:25:00 > > /lustre/beagle2/mattshax/epsweep/swifthome > > 20 > > 600 > > 1 > > 180 > > 10000 > > > > /dev/shm/mattshax/swiftapp > > > > > > > > MATTHEW SHAXTED > > SKIDMORE, OWINGS & MERRILL LLP > > 224 SOUTH MICHIGAN AVENUE > > CHICAGO, IL 60604 > > T (312) 360-4368 > > MATTHEW.SHAXTED at SOM.COM > > > > [cid:image006.png at 01D099F2.1E9A2BE0] > > The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. > > > > [cid:image005.gif at 01D099F2.1E9A2BE0] > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > From Matthew.Shaxted at som.com Fri May 29 13:19:04 2015 From: Matthew.Shaxted at som.com (Matthew Shaxted) Date: Fri, 29 May 2015 14:19:04 -0400 Subject: [Swift-user] Channel Timeout on Beagle? In-Reply-To: <5568969C.7040103@anl.gov> References: <13DD345F45E0EA4BBAC56317EAC0F026972B813375@CCRD007.mail.lan> <13DD345F45E0EA4BBAC56317EAC0F026972B813380@CCRD007.mail.lan> <5568969C.7040103@anl.gov> Message-ID: Thanks Mike - running on local filesystem makes a lot of sense. Will give this a try as well. MATTHEW SHAXTED SKIDMORE, OWINGS & MERRILL LLP 224 SOUTH MICHIGAN AVENUE CHICAGO, IL 60604 T (312) 360-4368 MATTHEW.SHAXTED at SOM.COM [cid:image004.png at 01D09A12.089ED460] The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. [cid:image002.gif at 01D09A12.0897A870] From: Michael Wilde [mailto:wilde at anl.gov] Sent: Friday, May 29, 2015 11:41 AM To: Matthew Shaxted; Swift User Subject: Re: [Swift-user] Channel Timeout on Beagle? Matthew, You should consider using Swift 0.96.0, and to the extent possible use local filesystems instead of the shared filesystem, which is often under excessive load. We can discuss how to do this in subsequent followup as needed. Basically, try provider-staging, and put both the input data on the login node's local filesystem, and the site workdirectory under /dev/shm or /tmp. (You may need to probe the compute node as to which of these is writable and has sufficient space). - Mike On 5/29/15 10:39 AM, Matthew Shaxted wrote: It looks like the timeout problem is not solved actually. For some reason I am having much difficulty running on Beagle, and I have an feeling it is due to slow read/write. For example, I finished ~1,200 / 12,000 runs before failure (see below paragraph) and moving these results (of not very large result files) to the public_html is taking an hour or so. I'm hoping to scale up to 100-300k runs or so, thus this will become a significant bottleneck. I have emailed beagle-support about this issue just now. In all test environments my Swift workflow is working well, but when submitting jobs to Beagle queue, it completes some number of simulations before the timeout error occurs and all jobs stop. I'm using Swift-0.95-RC7 (and am in process of updating to 0.95 latest), but think these errors may also be due to this slow read/write. Any suggestions? Below is the error I see and the job completely stops: Host: cluster Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ swift-int-staging.k, line: 181 Caused by: exception @ swift-int-staging.k, line: 177 Caused by: Block task failed: Connection to worker lost org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) at java.util.TimerThread.mainLoop(Timer.java:566) at java.util.TimerThread.run(Timer.java:516) MATTHEW SHAXTED SKIDMORE, OWINGS & MERRILL LLP 224 SOUTH MICHIGAN AVENUE CHICAGO, IL 60604 T (312) 360-4368 MATTHEW.SHAXTED at SOM.COM [cid:image003.png at 01D09A12.0897A870] The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. [cid:image002.gif at 01D09A12.0897A870] From: Matthew Shaxted Sent: Wednesday, May 27, 2015 2:04 PM To: 'Swift User' Subject: RE: Channel Timeout on Beagle? Hi All: I was able to get the runs working successfully by changing the maxtime flag in the sites file. Thanks From: Matthew Shaxted Sent: Wednesday, May 27, 2015 9:50 AM To: Swift User Subject: Channel Timeout on Beagle? Hi Swift Users: I am running some studies on Beagle using Swift, and experiencing a strange error. The Swift scripts run great on cloud and on the Beagle login node, but seems to be timing out for some reason. Does anyone have insight into the cause of this? Thanks for any help. Below is the error I am getting: Host: cluster Directory: epsweep-run004/jobs/a/RunEP-ai2mic9m exception @ swift-int-staging.k, line: 181 Caused by: exception @ swift-int-staging.k, line: 177 Caused by: Block task failed: Connection to worker lost org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=150526-142313.128, 50526-142514.107, channel=TCPChannel [type: server, contact: 0526-0802460-000014-000456 at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133) at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124) at java.util.TimerThread.mainLoop(Timer.java:566) at java.util.TimerThread.run(Timer.java:516) Below is my sites.xml file: CI-SES000178 24 100 100 pbs.aprun;pbs.mpp;depth=24 10800 01:25:00 /lustre/beagle2/mattshax/epsweep/swifthome 20 600 1 180 10000 /dev/shm/mattshax/swiftapp MATTHEW SHAXTED SKIDMORE, OWINGS & MERRILL LLP 224 SOUTH MICHIGAN AVENUE CHICAGO, IL 60604 T (312) 360-4368 MATTHEW.SHAXTED at SOM.COM [cid:image003.png at 01D09A12.0897A870] The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender. [cid:image002.gif at 01D09A12.0897A870] _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 566 bytes Desc: image002.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 5311 bytes Desc: image003.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.png Type: image/png Size: 5321 bytes Desc: image004.png URL: