From zhaozhang at uchicago.edu Tue Jun 2 15:42:44 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 02 Jun 2009 15:42:44 -0500
Subject: [Swift-devel] coaster test failed on SDSC IA-64 Cluster
Message-ID: <4A258EC4.3010307@uchicago.edu>
Hi, Mihael
The language behavior test failed on SDSC cluster. All gram and coaster
log files could be found at CI netowrk:
/home/zzhang/swift_coaster/cog/modules/swift/tests/sites/sdsc/
I am attaching the stdout and the sites.xml definition of SDSC cluster
site. Could you help find out where things
go wrong? Thanks.
best
zhao
[zzhang at communicado sites]$ cat coaster_new/tgsdsc-pbs-gram2.xml
TG-DBS080005N
/users/zzhang/work
4
2
5
1
2
false
[zzhang at communicado sites]$ ./run-site coaster_new/tgsdsc-pbs-gram2.xml
testing site configuration: coaster_new/tgsdsc-pbs-gram2.xml
Removing files from previous runs
Running test 061-cattwo at Tue Jun 2 15:27:48 CDT 2009
Swift svn swift-r2949 cog-r2406
RunID: 20090602-1527-q2au81s2
Progress:
Progress: Stage in:1
Progress: Submitting:1
Progress: Submitting:1
Progress: Submitting:1
Progress: Submitting:1
Progress: Submitted:1
Progress: Active:1
Progress: Failed but can retry:1
Failed to transfer wrapper log from
061-cattwo-20090602-1527-q2au81s2/info/c on tgsdsc
Progress: Active:1
Failed to transfer wrapper log from
061-cattwo-20090602-1527-q2au81s2/info/e on tgsdsc
Progress: Stage in:1
Progress: Active:1
Failed to transfer wrapper log from
061-cattwo-20090602-1527-q2au81s2/info/g on tgsdsc
Progress: Failed:1
Execution failed:
Exception in cat:
Arguments: [061-cattwo.1.in, 061-cattwo.2.in]
Host: tgsdsc
Directory: 061-cattwo-20090602-1527-q2au81s2/jobs/g/cat-gvrvuobj
stderr.txt:
stdout.txt:
----
Caused by:
Block task failed:
org.globus.gram.GramException: The job failed when the job manager
attempted to run it
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
at org.globus.gram.GramJob.setStatus(GramJob.java:184)
at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
at java.lang.Thread.run(Thread.java:534)
Cleaning up...
Shutting down service at https://198.202.112.33:45214
Got channel MetaChannel: 25326891 -> GSSSChannel-null(1)
- Done
SWIFT RETURN CODE NON-ZERO - test 061-cattwo
From hategan at mcs.anl.gov Tue Jun 2 17:19:02 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 02 Jun 2009 17:19:02 -0500
Subject: [Swift-devel] coaster test failed on SDSC IA-64 Cluster
In-Reply-To: <4A258EC4.3010307@uchicago.edu>
References: <4A258EC4.3010307@uchicago.edu>
Message-ID: <1243981142.31356.0.camel@localhost>
Gram is failing, so you should try a plain gram job to confirm that it
is indeed the issue.
On Tue, 2009-06-02 at 15:42 -0500, Zhao Zhang wrote:
> Hi, Mihael
>
> The language behavior test failed on SDSC cluster. All gram and coaster
> log files could be found at CI netowrk:
> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/sdsc/
>
> I am attaching the stdout and the sites.xml definition of SDSC cluster
> site. Could you help find out where things
> go wrong? Thanks.
>
> best
> zhao
>
> [zzhang at communicado sites]$ cat coaster_new/tgsdsc-pbs-gram2.xml
>
>
>
> jobManager="gt2:gt2:pbs"/>
> TG-DBS080005N
> /users/zzhang/work
> 4
> 2
> 5
> 1
> 2
> false
>
>
>
>
> [zzhang at communicado sites]$ ./run-site coaster_new/tgsdsc-pbs-gram2.xml
> testing site configuration: coaster_new/tgsdsc-pbs-gram2.xml
> Removing files from previous runs
> Running test 061-cattwo at Tue Jun 2 15:27:48 CDT 2009
> Swift svn swift-r2949 cog-r2406
>
> RunID: 20090602-1527-q2au81s2
> Progress:
> Progress: Stage in:1
> Progress: Submitting:1
> Progress: Submitting:1
> Progress: Submitting:1
> Progress: Submitting:1
> Progress: Submitted:1
> Progress: Active:1
> Progress: Failed but can retry:1
> Failed to transfer wrapper log from
> 061-cattwo-20090602-1527-q2au81s2/info/c on tgsdsc
> Progress: Active:1
> Failed to transfer wrapper log from
> 061-cattwo-20090602-1527-q2au81s2/info/e on tgsdsc
> Progress: Stage in:1
> Progress: Active:1
> Failed to transfer wrapper log from
> 061-cattwo-20090602-1527-q2au81s2/info/g on tgsdsc
> Progress: Failed:1
> Execution failed:
> Exception in cat:
> Arguments: [061-cattwo.1.in, 061-cattwo.2.in]
> Host: tgsdsc
> Directory: 061-cattwo-20090602-1527-q2au81s2/jobs/g/cat-gvrvuobj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Block task failed:
> org.globus.gram.GramException: The job failed when the job manager
> attempted to run it
> at
> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
> at org.globus.gram.GramJob.setStatus(GramJob.java:184)
> at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
> at java.lang.Thread.run(Thread.java:534)
>
> Cleaning up...
> Shutting down service at https://198.202.112.33:45214
> Got channel MetaChannel: 25326891 -> GSSSChannel-null(1)
> - Done
> SWIFT RETURN CODE NON-ZERO - test 061-cattwo
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From zhaozhang at uchicago.edu Tue Jun 2 18:13:59 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 02 Jun 2009 18:13:59 -0500
Subject: [Swift-devel] coaster test failed on SDSC IA-64 Cluster
In-Reply-To: <1243981142.31356.0.camel@localhost>
References: <4A258EC4.3010307@uchicago.edu>
<1243981142.31356.0.camel@localhost>
Message-ID: <4A25B237.2080000@uchicago.edu>
Do you mean a globus-job-run? or a test with a jobmanger other than coaster?
I did a globu-job-run, it was fine.
[zzhang at communicado coaster_new]$ globus-job-run
tg-login1.sdsc.teragrid.org /us
r/bin/id
uid=501593(zzhang) gid=5387(anl101) groups=5387(anl101)
zhao
Mihael Hategan wrote:
> Gram is failing, so you should try a plain gram job to confirm that it
> is indeed the issue.
>
> On Tue, 2009-06-02 at 15:42 -0500, Zhao Zhang wrote:
>
>> Hi, Mihael
>>
>> The language behavior test failed on SDSC cluster. All gram and coaster
>> log files could be found at CI netowrk:
>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/sdsc/
>>
>> I am attaching the stdout and the sites.xml definition of SDSC cluster
>> site. Could you help find out where things
>> go wrong? Thanks.
>>
>> best
>> zhao
>>
>> [zzhang at communicado sites]$ cat coaster_new/tgsdsc-pbs-gram2.xml
>>
>>
>>
>> > jobManager="gt2:gt2:pbs"/>
>> TG-DBS080005N
>> /users/zzhang/work
>> 4
>> 2
>> 5
>> 1
>> 2
>> false
>>
>>
>>
>>
>> [zzhang at communicado sites]$ ./run-site coaster_new/tgsdsc-pbs-gram2.xml
>> testing site configuration: coaster_new/tgsdsc-pbs-gram2.xml
>> Removing files from previous runs
>> Running test 061-cattwo at Tue Jun 2 15:27:48 CDT 2009
>> Swift svn swift-r2949 cog-r2406
>>
>> RunID: 20090602-1527-q2au81s2
>> Progress:
>> Progress: Stage in:1
>> Progress: Submitting:1
>> Progress: Submitting:1
>> Progress: Submitting:1
>> Progress: Submitting:1
>> Progress: Submitted:1
>> Progress: Active:1
>> Progress: Failed but can retry:1
>> Failed to transfer wrapper log from
>> 061-cattwo-20090602-1527-q2au81s2/info/c on tgsdsc
>> Progress: Active:1
>> Failed to transfer wrapper log from
>> 061-cattwo-20090602-1527-q2au81s2/info/e on tgsdsc
>> Progress: Stage in:1
>> Progress: Active:1
>> Failed to transfer wrapper log from
>> 061-cattwo-20090602-1527-q2au81s2/info/g on tgsdsc
>> Progress: Failed:1
>> Execution failed:
>> Exception in cat:
>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in]
>> Host: tgsdsc
>> Directory: 061-cattwo-20090602-1527-q2au81s2/jobs/g/cat-gvrvuobj
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> Block task failed:
>> org.globus.gram.GramException: The job failed when the job manager
>> attempted to run it
>> at
>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>> at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>> at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>> at java.lang.Thread.run(Thread.java:534)
>>
>> Cleaning up...
>> Shutting down service at https://198.202.112.33:45214
>> Got channel MetaChannel: 25326891 -> GSSSChannel-null(1)
>> - Done
>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>
>
>
From wilde at mcs.anl.gov Tue Jun 2 19:18:56 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 02 Jun 2009 19:18:56 -0500
Subject: [Swift-devel] coaster test failed on SDSC IA-64 Cluster
In-Reply-To: <4A25B237.2080000@uchicago.edu>
References: <4A258EC4.3010307@uchicago.edu> <1243981142.31356.0.camel@localhost>
<4A25B237.2080000@uchicago.edu>
Message-ID: <4A25C170.5050606@mcs.anl.gov>
Try doing a gobus-job-run to the PBS job manager, specifying your project.
Also try a simple swift job to the PBS job manager, with the same project.
- Mike
On 6/2/09 6:13 PM, Zhao Zhang wrote:
> Do you mean a globus-job-run? or a test with a jobmanger other than
> coaster?
> I did a globu-job-run, it was fine.
> [zzhang at communicado coaster_new]$ globus-job-run
> tg-login1.sdsc.teragrid.org /us
> r/bin/id
> uid=501593(zzhang) gid=5387(anl101) groups=5387(anl101)
>
> zhao
>
> Mihael Hategan wrote:
>> Gram is failing, so you should try a plain gram job to confirm that it
>> is indeed the issue.
>>
>> On Tue, 2009-06-02 at 15:42 -0500, Zhao Zhang wrote:
>>
>>> Hi, Mihael
>>>
>>> The language behavior test failed on SDSC cluster. All gram and
>>> coaster log files could be found at CI netowrk:
>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/sdsc/
>>>
>>> I am attaching the stdout and the sites.xml definition of SDSC
>>> cluster site. Could you help find out where things
>>> go wrong? Thanks.
>>>
>>> best
>>> zhao
>>>
>>> [zzhang at communicado sites]$ cat coaster_new/tgsdsc-pbs-gram2.xml
>>>
>>>
>>>
>>> >> jobManager="gt2:gt2:pbs"/>
>>> TG-DBS080005N
>>> /users/zzhang/work
>>> 4
>>> 2
>>> 5
>>> 1
>>> 2
>>> >> key="remoteMonitorEnabled">false
>>>
>>>
>>>
>>>
>>> [zzhang at communicado sites]$ ./run-site coaster_new/tgsdsc-pbs-gram2.xml
>>> testing site configuration: coaster_new/tgsdsc-pbs-gram2.xml
>>> Removing files from previous runs
>>> Running test 061-cattwo at Tue Jun 2 15:27:48 CDT 2009
>>> Swift svn swift-r2949 cog-r2406
>>>
>>> RunID: 20090602-1527-q2au81s2
>>> Progress:
>>> Progress: Stage in:1
>>> Progress: Submitting:1
>>> Progress: Submitting:1
>>> Progress: Submitting:1
>>> Progress: Submitting:1
>>> Progress: Submitted:1
>>> Progress: Active:1
>>> Progress: Failed but can retry:1
>>> Failed to transfer wrapper log from
>>> 061-cattwo-20090602-1527-q2au81s2/info/c on tgsdsc
>>> Progress: Active:1
>>> Failed to transfer wrapper log from
>>> 061-cattwo-20090602-1527-q2au81s2/info/e on tgsdsc
>>> Progress: Stage in:1
>>> Progress: Active:1
>>> Failed to transfer wrapper log from
>>> 061-cattwo-20090602-1527-q2au81s2/info/g on tgsdsc
>>> Progress: Failed:1
>>> Execution failed:
>>> Exception in cat:
>>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in]
>>> Host: tgsdsc
>>> Directory: 061-cattwo-20090602-1527-q2au81s2/jobs/g/cat-gvrvuobj
>>> stderr.txt:
>>>
>>> stdout.txt:
>>>
>>> ----
>>>
>>> Caused by:
>>> Block task failed:
>>> org.globus.gram.GramException: The job failed when the job manager
>>> attempted to run it
>>> at
>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>>>
>>> at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>>> at
>>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>>> at java.lang.Thread.run(Thread.java:534)
>>>
>>> Cleaning up...
>>> Shutting down service at https://198.202.112.33:45214
>>> Got channel MetaChannel: 25326891 -> GSSSChannel-null(1)
>>> - Done
>>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Wed Jun 3 10:10:52 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 3 Jun 2009 15:10:52 +0000 (GMT)
Subject: [Swift-devel] small-scale swift pseudopublications
Message-ID:
Theres a bunch of stuff floating round that isn't published material but
is still interesting - mostly making me think of this is stuff that is to
be presented at the DSL Workshop this year but I suspect there is other
material floating around.
It might be interesting to keep some list of this online on the swift
website linking to whatever random material is available.
--
From wilde at mcs.anl.gov Wed Jun 3 16:24:08 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 03 Jun 2009 16:24:08 -0500
Subject: [Swift-devel] [Fwd: [teraport-notify] tp-login2 job submission
failures]
Message-ID: <4A26E9F8.7060609@mcs.anl.gov>
just fyi
-------- Original Message --------
Subject: [teraport-notify] tp-login2 job submission failures
Date: Wed, 3 Jun 2009 14:56:34 -0500
From: Greg Cross
To: teraport-notify at ci.uchicago.edu
Users have reported failures when attempting job submissions to
Teraport's scheduler using the "qsub" command. This problem has been
isolated to the node tp-login2.ci.uchicago.edu. The cause for these
failures stem from recent misconfigurations in DNS service, which is
out of the CI's scope of control. You will receive notification when
the responsible authority has corrected the misconfigurations.
In the meantime, all users should use tp-login1.ci.uchicago.edu
exclusively for job submission purposes.
_______________________________________________
teraport-notify mailing list
teraport-notify at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/teraport-notify
From zhaozhang at uchicago.edu Thu Jun 4 14:08:09 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 04 Jun 2009 14:08:09 -0500
Subject: [Swift-devel] Test Cases for Swift Test
Message-ID: <4A281B99.7080901@uchicago.edu>
Hi, Ben
Here is a list of the tests I put in the regular swift test with coaster
on TeraGrid for now. Sooner, we are going to test
swift with condor-G on OSG, too. Is there any other tests you want to
put it the regular test for swift, more sites, more applications, or
more swift features?
best
zhao
:
1 Sanity Test (Swift language behavior test)
061-cattwo
130-fmri
103-quote.swift
1032-singlequote.swift
1031-quote.swift
1033-singlequote.swift
141-space-in-filename
142-space-and-quotes
2 Data Movement Test
foreach data in {1KB, 1MB, "10MB, 100MB, 1GB"}
foreach i in {1, 10, 100, 1000}
copy data to site
redirect data to output_file
copy output_file back to submit host
done
done
3. Application Test
SCIP - 1000 jobs on ranger, I am building scip on uc-teragrid and
Abe now, hopefully, could add those sites today.
DOCK6 - 1000 jobs on uc teragrid.
From aespinosa at cs.uchicago.edu Mon Jun 8 16:24:37 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 8 Jun 2009 16:24:37 -0500
Subject: [Swift-devel] block coasters not registering on proper queue
Message-ID: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com>
Is there a default maxwalltime being submitted to the LRM if nothing
is specified? I made in this configuration to use the "fast" ueue in
sites.xml but i keep getting placed inside the "exteneded" queue.
sites.xml
fast
/home/aespinosa/work
50
10
20
gram log snippet:
...
...
Mon Jun 8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created.
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Entering pbs submit
Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed
(may be harmless): Operation not permitted
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl
Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may
be harmless): Operation not permitted
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from
job description
Mon Jun 8 16:19:21 2009 JM_SCRIPT: using queue default
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max wall time
limit from job description
Mon Jun 8 16:19:21 2009 JM_SCRIPT: using maxwalltime of 60
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Building job script
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument
"/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to
"/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument
"http://128.135.125.118:56015"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to
"http://128.135.125.118:56015"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000"
...
...
$grep fast gram*.log:
gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
Swift version: Swift svn swift-r2949 cog-r2406
-Allan
From hategan at mcs.anl.gov Mon Jun 8 17:49:25 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 08 Jun 2009 17:49:25 -0500
Subject: [Swift-devel] block coasters not registering on proper queue
In-Reply-To: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com>
References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com>
Message-ID: <1244501365.5919.3.camel@localhost>
On Mon, 2009-06-08 at 16:24 -0500, Allan Espinosa wrote:
> Is there a default maxwalltime being submitted to the LRM if nothing
> is specified?
The block maxwalltime varies depending on the job maxwalltimes and the
overallocation parameters. So in a sense, yes.
> I made in this configuration to use the "fast" ueue in
> sites.xml but i keep getting placed inside the "exteneded" queue.
My gut feeling tells me that the LRM would not change the queue if the
walltime didn't fit, but would instead complain that the maxwalltime is
larger than what the queue accepts. So it looks more like the queue
parameter doesn't get passed to the LRM properly.
>
> sites.xml
>
>
>
> jobmanager="gt2:gt2:pbs" />
> fast
> /home/aespinosa/work
> 50
> 10
> 20
>
>
>
> gram log snippet:
> ...
> ...
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created.
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Entering pbs submit
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed
> (may be harmless): Operation not permitted
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may
> be harmless): Operation not permitted
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from
> job description
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: using queue default
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max wall time
> limit from job description
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: using maxwalltime of 60
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Building job script
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument
> "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to
> "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument
> "http://128.135.125.118:56015"
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to
> "http://128.135.125.118:56015"
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000"
> Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000"
> ...
> ...
>
> $grep fast gram*.log:
> gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
>
>
> Swift version: Swift svn swift-r2949 cog-r2406
>
> -Allan
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Tue Jun 9 03:37:13 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 9 Jun 2009 08:37:13 +0000 (GMT)
Subject: [Swift-devel] block coasters not registering on proper queue
In-Reply-To: <1244501365.5919.3.camel@localhost>
References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com>
<1244501365.5919.3.camel@localhost>
Message-ID:
> My gut feeling tells me that the LRM would not change the queue if the
> walltime didn't fit, but would instead complain that the maxwalltime is
> larger than what the queue accepts.
Thats also my understanding of how teraport behaves.
--
From hockyg at uchicago.edu Tue Jun 9 14:33:29 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Tue, 09 Jun 2009 14:33:29 -0500
Subject: [Swift-devel] Error conditional on name of sites.file
Message-ID: <4A2EB909.9020106@uchicago.edu>
Hi everyone,
When i use this file:
>
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5">
>
>
>
> /home/hockyg/swiftwork
>
>
>
I get this error
> swift sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input
> -nsims=1 -kill=1
> Execution failed:
> Could not load file sites.local:
> org.globus.cog.karajan.translator.TranslationException:
> org.globus.cog.karajan.parser.ParsingException: Line 8:
>
> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER but
> got '/'
When I rename it to something else
> swift sites.local.xml -prot=T1af7
> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1
> Swift svn swift-r2953 cog-r2406
>
> RunID: 20090609-1430-851tnjye
> Progress:
> Progress: Active:1
> Progress: Active:1
> Progress: Checking status:1
> Final status: Finished successfully:1
From wilde at mcs.anl.gov Tue Jun 9 14:56:34 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 09 Jun 2009 14:56:34 -0500
Subject: [Swift-devel] "multiple writers-in-iterate" problem
Message-ID: <4A2EBE72.1020105@mcs.anl.gov>
Ben, can you fix this in the next release?
-------- Original Message --------
Subject: Re: [Swift-devel] continued questions on iterate
Date: Sun, 01 Mar 2009 23:39:42 -0600
From: Michael Wilde
To: swift-devel
References: <49AB6653.6010306 at mcs.anl.gov>
Im able to work around this by moving the s[0] assignments inside the
iterate block, in an if(i==0) {} else {} construct.
Still, it seems the restriction is not intended.
- Mike
On 3/1/09 10:53 PM, Michael Wilde wrote:
> This program:
>
> string s[];
> s[0]="hi ";
> iterate i {
> s[i+1] = @strcat(s[i],"hi ");
> trace(s[i]);
> } until(i==5);
>
> Gives:
>
> com$ swift it4.swift
> Could not start execution.
> variable s has multiple writers.
>
> --
> Its similar to the tutorial example:
>
> counterfile a[] ;
>
> a[0] = echo("793578934574893");
>
> iterate v {
> a[v+1] = countstep(a[v]);
> print("extract int value ", at extractint(a[v+1]));
> } until (@extractint(a[v+1]) <= 1);
>
> --
>
> ...which I reported earlier as having problems (I think in addition to
> the one above?)
>
> This is using the latest swift, rev 2631, and latest cog.
>
> I thought I had issues like this licked, but then updated the code to
> get closer to what the user needs.
>
> In this example, I dont see any violation of single-assignment, but
> apparently swift does.
>
> The full example that the test case above is for is at:
> www.ci.uchicago.edu/~wilde/oops8.swift, which encounters the same
> multiple-writer problem.
>
> I start with an initial "secondary structure" string of all A's, same
> length as the protein sequence. After each folding round, a new
> structure is derived for analysis and used as the starting point for the
> next round. This has the same data access pattern as array s[] above:
>
> foreach p, pn in protein {
> OOPSOut result[][] ;
> SecSeq secseq[] prefix=@strcat("seqseq/",p,"/"),suffix=".secseq">;
> OOPSIn oopsin ;
> secseq[0] = sedfasta(oopsin.fasta, ["-e","s/./A/g"]);
> boolean converged[];
> iterate i {
> SecSeq s;
> result[i] = doRound(p,oopsin,secseq[i],i);
> (converged[i],s) = analyzeResult(result[i], p, i, secseq[i]);
> secseq[i+1] = s;
> } until (converged[i] || (i==3));
> }
>
> In this case, I get the same message for array secseq (varable has
> multiple writers).
>
> I
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Jun 9 15:00:09 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 09 Jun 2009 15:00:09 -0500
Subject: [Swift-devel] Error conditional on name of sites.file
In-Reply-To: <4A2EB909.9020106@uchicago.edu>
References: <4A2EB909.9020106@uchicago.edu>
Message-ID: <4A2EBF49.1070706@mcs.anl.gov>
Looks like the sites file needs to end in .xml?
On 6/9/09 2:33 PM, Glen Hocky wrote:
> Hi everyone,
> When i use this file:
>
>>
>> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5">
>>
>>
>>
>> /home/hockyg/swiftwork
>>
>>
>>
>
> I get this error
>> swift > sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input
>> -nsims=1 -kill=1
>> Execution failed:
>> Could not load file sites.local:
>> org.globus.cog.karajan.translator.TranslationException:
>> org.globus.cog.karajan.parser.ParsingException: Line 8:
>>
>> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER but
>> got '/'
> When I rename it to something else
>
>> swift > sites.local.xml -prot=T1af7
>> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1
>> Swift svn swift-r2953 cog-r2406
>>
>> RunID: 20090609-1430-851tnjye
>> Progress:
>> Progress: Active:1
>> Progress: Active:1
>> Progress: Checking status:1
>> Final status: Finished successfully:1
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From aespinosa at cs.uchicago.edu Tue Jun 9 15:10:13 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 9 Jun 2009 15:10:13 -0500
Subject: [Swift-devel] active jobs vs available processors on submitted
coaster queues
Message-ID: <50b07b4b0906091310j502935b2jc144e126aeb4edf8@mail.gmail.com>
I was expecting to have 2 active jobs at a time from the swift log but
instead got only one at a time:
Swift svn swift-r2949 cog-r2406
RunID: out.run_000
Progress:
Progress: Selecting site:4 Initializing site shared directory:1 Stage in:1
Progress: Stage in:6
Progress: Stage in:6
Progress: Stage in:6
Progress: Stage in:6
Progress: Stage in:6
Progress: Stage in:6
Progress: Stage in:5 Submitting:1
Progress: Submitting:5 Submitted:1
Progress: Submitted:6
Progress: Submitted:5 Active:1
Progress: Submitted:5 Active:1
Progress: Submitted:5 Active:1
Progress: Submitted:5 Active:1
Progress: Submitted:5 Active:1
Progress: Submitted:5 Checking status:1
Progress: Submitted:4 Active:1 Finished successfully:1
Progress: Submitted:4 Active:1 Finished successfully:1
Progress: Submitted:4 Active:1 Finished successfully:1
Progress: Submitted:4 Active:1 Finished successfully:1
Progress: Submitted:4 Active:1 Finished successfully:1
Progress: Submitted:4 Active:1 Finished successfully:1
Progress: Submitted:4 Checking status:1 Finished successfully:1
Progress: Submitted:3 Active:1 Finished successfully:2
Progress: Submitted:3 Active:1 Finished successfully:2
Progress: Submitted:3 Active:1 Finished successfully:2
Progress: Submitted:3 Checking status:1 Finished successfully:2
Progress: Submitted:2 Active:1 Finished successfully:3
...
...
uc-teragrid queue status: $showq -u $USER
[aespinosa at tg-login1 ~]$ showq -u $USER
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
2015982 aespinos Running 2 00:55:41 Tue Jun 9 15:02:18
1 active job 2 of 116 processors in use by local jobs (1.72%)
42 of 58 nodes active (72.41%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 blocked jobs
Total job: 1
sites.xml:
/home/aespinosa/blast-runs
5
1.26
ia64-compute
4
16
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From aespinosa at cs.uchicago.edu Tue Jun 9 15:31:06 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 9 Jun 2009 15:31:06 -0500
Subject: [Swift-devel] block coasters not registering on proper queue
In-Reply-To: <1244501365.5919.3.camel@localhost>
References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com>
<1244501365.5919.3.camel@localhost>
Message-ID: <50b07b4b0906091331y598949cfi353155ae646b7826@mail.gmail.com>
ok. so i guess i should file this on bugzilla...
2009/6/8 Mihael Hategan :
> On Mon, 2009-06-08 at 16:24 -0500, Allan Espinosa wrote:
>> Is there a default maxwalltime being submitted to the LRM if nothing
>> is specified?
>
> The block maxwalltime varies depending on the job maxwalltimes and the
> overallocation parameters. So in a sense, yes.
>
>> ? I made in this configuration to use the "fast" ueue in
>> sites.xml but i keep getting placed inside the "exteneded" queue.
>
> My gut feeling tells me that the LRM would not change the queue if the
> walltime didn't fit, but would instead complain that the maxwalltime is
> larger than what the queue accepts. So it looks more like the queue
> parameter doesn't get passed to the LRM properly.
>
>>
>> sites.xml
>>
>> ?
>> ? ?
>> ? ? > jobmanager="gt2:gt2:pbs" />
>> ? ? fast
>> ? ? /home/aespinosa/work
>> ? ? 50
>> ? ? 10
>> ? ? 20
>> ?
>>
>>
>> gram log snippet:
>> ...
>> ...
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created.
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Entering pbs submit
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed
>> (may be harmless): Operation not permitted
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may
>> be harmless): Operation not permitted
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from
>> job description
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: ? ?using queue default
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Determining job max wall time
>> limit from job description
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: ? ?using maxwalltime of 60
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Building job script
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument
>> "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to
>> "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument
>> "http://128.135.125.118:56015"
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to
>> "http://128.135.125.118:56015"
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000"
>> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000"
>> ...
>> ...
>>
>> $grep fast gram*.log:
>> gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>> gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 =
>> GLOBUS_FAILURE (try Perl scripts)
>>
>>
>> Swift version: Swift svn swift-r2949 cog-r2406
>>
>> -Allan
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From hockyg at uchicago.edu Tue Jun 9 15:34:43 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Tue, 09 Jun 2009 15:34:43 -0500
Subject: [Swift-devel] Error conditional on name of sites.file
In-Reply-To: <4A2EBF49.1070706@mcs.anl.gov>
References: <4A2EB909.9020106@uchicago.edu> <4A2EBF49.1070706@mcs.anl.gov>
Message-ID: <4A2EC763.9030303@uchicago.edu>
To clarify, I don't mind if the filename must end in .xml or w/e, I just
wish the error message would have told me that rather than figuring it
out by trial and error
Michael Wilde wrote:
> Looks like the sites file needs to end in .xml?
>
> On 6/9/09 2:33 PM, Glen Hocky wrote:
>> Hi everyone,
>> When i use this file:
>>
>>>
>>> >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5">
>>>
>>>
>>>
>>> /home/hockyg/swiftwork
>>>
>>>
>>>
>>
>> I get this error
>>> swift >> sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input
>>> -nsims=1 -kill=1
>>> Execution failed:
>>> Could not load file sites.local:
>>> org.globus.cog.karajan.translator.TranslationException:
>>> org.globus.cog.karajan.parser.ParsingException: Line 8:
>>>
>>> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER
>>> but got '/'
>> When I rename it to something else
>>
>>> swift >> sites.local.xml -prot=T1af7
>>> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1
>>> Swift svn swift-r2953 cog-r2406
>>>
>>> RunID: 20090609-1430-851tnjye
>>> Progress:
>>> Progress: Active:1
>>> Progress: Active:1
>>> Progress: Checking status:1
>>> Final status: Finished successfully:1
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From bugzilla-daemon at mcs.anl.gov Tue Jun 9 15:34:25 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 9 Jun 2009 15:34:25 -0500 (CDT)
Subject: [Swift-devel] [Bug 211] New: block coasters not registering on
proper queue
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=211
Summary: block coasters not registering on proper queue
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Specific site issues
AssignedTo: hategan at mcs.anl.gov
ReportedBy: aespinosa at cs.uchicago.edu
Is there a default maxwalltime being submitted to the LRM if nothing
is specified? I made in this configuration to use the "fast" ueue in
sites.xml but i keep getting placed inside the "exteneded" queue.
sites.xml
fast
/home/aespinosa/work
50
10
20
gram log snippet:
...
...
Mon Jun 8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created.
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Entering pbs submit
Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed
(may be harmless): Operation not permitted
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl
Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may
be harmless): Operation not permitted
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from
job description
Mon Jun 8 16:19:21 2009 JM_SCRIPT: using queue default
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max wall time
limit from job description
Mon Jun 8 16:19:21 2009 JM_SCRIPT: using maxwalltime of 60
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Building job script
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument
"/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to
"/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument
"http://128.135.125.118:56015"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to
"http://128.135.125.118:56015"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000"
Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000"
...
...
$grep fast gram*.log:
gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 =
GLOBUS_FAILURE (try Perl scripts)
Swift version: Swift svn swift-r2949 cog-r2406
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
From wilde at mcs.anl.gov Tue Jun 9 16:09:48 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 09 Jun 2009 16:09:48 -0500
Subject: [Swift-devel] Error conditional on name of sites.file
In-Reply-To: <4A2EC763.9030303@uchicago.edu>
References: <4A2EB909.9020106@uchicago.edu> <4A2EBF49.1070706@mcs.anl.gov>
<4A2EC763.9030303@uchicago.edu>
Message-ID: <4A2ECF9C.9030902@mcs.anl.gov>
Indeed. We should file it as a bug.
On 6/9/09 3:34 PM, Glen Hocky wrote:
> To clarify, I don't mind if the filename must end in .xml or w/e, I just
> wish the error message would have told me that rather than figuring it
> out by trial and error
>
> Michael Wilde wrote:
>> Looks like the sites file needs to end in .xml?
>>
>> On 6/9/09 2:33 PM, Glen Hocky wrote:
>>> Hi everyone,
>>> When i use this file:
>>>
>>>>
>>>> >>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5">
>>>>
>>>>
>>>>
>>>> /home/hockyg/swiftwork
>>>>
>>>>
>>>>
>>>
>>> I get this error
>>>> swift >>> sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input
>>>> -nsims=1 -kill=1
>>>> Execution failed:
>>>> Could not load file sites.local:
>>>> org.globus.cog.karajan.translator.TranslationException:
>>>> org.globus.cog.karajan.parser.ParsingException: Line 8:
>>>>
>>>> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER
>>>> but got '/'
>>> When I rename it to something else
>>>
>>>> swift >>> sites.local.xml -prot=T1af7
>>>> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1
>>>> Swift svn swift-r2953 cog-r2406
>>>>
>>>> RunID: 20090609-1430-851tnjye
>>>> Progress:
>>>> Progress: Active:1
>>>> Progress: Active:1
>>>> Progress: Checking status:1
>>>> Final status: Finished successfully:1
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Wed Jun 10 02:26:51 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 10 Jun 2009 02:26:51 -0500
Subject: [Swift-devel] Error conditional on name of sites.file
In-Reply-To: <4A2EBF49.1070706@mcs.anl.gov>
References: <4A2EB909.9020106@uchicago.edu> <4A2EBF49.1070706@mcs.anl.gov>
Message-ID: <1244618811.16077.0.camel@localhost>
On Tue, 2009-06-09 at 15:00 -0500, Michael Wilde wrote:
> Looks like the sites file needs to end in .xml?
Yes.
>
> On 6/9/09 2:33 PM, Glen Hocky wrote:
> > Hi everyone,
> > When i use this file:
> >
> >>
> >> >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5">
> >>
> >>
> >>
> >> /home/hockyg/swiftwork
> >>
> >>
> >>
> >
> > I get this error
> >> swift >> sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input
> >> -nsims=1 -kill=1
> >> Execution failed:
> >> Could not load file sites.local:
> >> org.globus.cog.karajan.translator.TranslationException:
> >> org.globus.cog.karajan.parser.ParsingException: Line 8:
> >>
> >> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER but
> >> got '/'
> > When I rename it to something else
> >
> >> swift >> sites.local.xml -prot=T1af7
> >> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1
> >> Swift svn swift-r2953 cog-r2406
> >>
> >> RunID: 20090609-1430-851tnjye
> >> Progress:
> >> Progress: Active:1
> >> Progress: Active:1
> >> Progress: Checking status:1
> >> Final status: Finished successfully:1
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Wed Jun 10 02:29:16 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 10 Jun 2009 02:29:16 -0500
Subject: [Swift-devel] active jobs vs available processors on submitted
coaster queues
In-Reply-To: <50b07b4b0906091310j502935b2jc144e126aeb4edf8@mail.gmail.com>
References: <50b07b4b0906091310j502935b2jc144e126aeb4edf8@mail.gmail.com>
Message-ID: <1244618956.16077.2.camel@localhost>
I need to look at the coaster log.
On Tue, 2009-06-09 at 15:10 -0500, Allan Espinosa wrote:
> I was expecting to have 2 active jobs at a time from the swift log but
> instead got only one at a time:
> Swift svn swift-r2949 cog-r2406
>
> RunID: out.run_000
> Progress:
> Progress: Selecting site:4 Initializing site shared directory:1 Stage in:1
> Progress: Stage in:6
> Progress: Stage in:6
>
>
>
> Progress: Stage in:6
> Progress: Stage in:6
> Progress: Stage in:6
> Progress: Stage in:6
> Progress: Stage in:5 Submitting:1
> Progress: Submitting:5 Submitted:1
> Progress: Submitted:6
> Progress: Submitted:5 Active:1
> Progress: Submitted:5 Active:1
> Progress: Submitted:5 Active:1
> Progress: Submitted:5 Active:1
> Progress: Submitted:5 Active:1
> Progress: Submitted:5 Checking status:1
> Progress: Submitted:4 Active:1 Finished successfully:1
> Progress: Submitted:4 Active:1 Finished successfully:1
> Progress: Submitted:4 Active:1 Finished successfully:1
> Progress: Submitted:4 Active:1 Finished successfully:1
> Progress: Submitted:4 Active:1 Finished successfully:1
> Progress: Submitted:4 Active:1 Finished successfully:1
> Progress: Submitted:4 Checking status:1 Finished successfully:1
> Progress: Submitted:3 Active:1 Finished successfully:2
> Progress: Submitted:3 Active:1 Finished successfully:2
> Progress: Submitted:3 Active:1 Finished successfully:2
> Progress: Submitted:3 Checking status:1 Finished successfully:2
> Progress: Submitted:2 Active:1 Finished successfully:3
> ...
> ...
>
>
> uc-teragrid queue status: $showq -u $USER
> [aespinosa at tg-login1 ~]$ showq -u $USER
>
> active jobs------------------------
> JOBID USERNAME STATE PROCS REMAINING STARTTIME
>
> 2015982 aespinos Running 2 00:55:41 Tue Jun 9 15:02:18
>
> 1 active job 2 of 116 processors in use by local jobs (1.72%)
> 42 of 58 nodes active (72.41%)
>
> eligible jobs----------------------
> JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
>
>
> 0 eligible jobs
>
> blocked jobs-----------------------
> JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
>
>
> 0 blocked jobs
>
> Total job: 1
>
>
> sites.xml:
>
>
> url="tg-grid.uc.teragrid.org" jobmanager="gt2:gt2:pbs" />
>
> /home/aespinosa/blast-runs
>
> 5
> 1.26
>
> key="host_types">ia64-compute
> 4
> 16
>
>
>
>
From benc at hawaga.org.uk Wed Jun 10 06:10:02 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Jun 2009 11:10:02 +0000 (GMT)
Subject: [Swift-devel] someone to own swift NMI build and test
Message-ID:
The Swift NMI daily and per-commit builds run in my user account in the
NMI build and test system.
I'm going to turn those off before 17th of July 2009.
Who wants to own them now?
--
From wilde at mcs.anl.gov Wed Jun 10 06:40:53 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 10 Jun 2009 06:40:53 -0500
Subject: [Swift-devel] someone to own swift NMI build and test
In-Reply-To:
References:
Message-ID: <4A2F9BC5.4010405@mcs.anl.gov>
Whats involved? Does someone need to establish a metronome login?
Is there an automated way to push tests from swift svn to metronome?
Do errors get emailed or does one have to check the logs via the web?
On 6/10/09 6:10 AM, Ben Clifford wrote:
> The Swift NMI daily and per-commit builds run in my user account in the
> NMI build and test system.
>
> I'm going to turn those off before 17th of July 2009.
>
> Who wants to own them now?
>
From benc at hawaga.org.uk Wed Jun 10 07:04:08 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Jun 2009 12:04:08 +0000 (GMT)
Subject: [Swift-devel] someone to own swift NMI build and test
In-Reply-To: <4A2F9BC5.4010405@mcs.anl.gov>
References:
<4A2F9BC5.4010405@mcs.anl.gov>
Message-ID:
On Wed, 10 Jun 2009, Michael Wilde wrote:
> Whats involved? Does someone need to establish a metronome login?
yes - contact nmi-support at ci.uchicago.edu
> Is there an automated way to push tests from swift svn to metronome?
The tests are checked out every time they run.
> Do errors get emailed or does one have to check the logs via the web?
they get emailed to me when they finish.
--
From benc at hawaga.org.uk Wed Jun 10 07:11:39 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Jun 2009 12:11:39 +0000 (GMT)
Subject: [Swift-devel] someone to own swift NMI build and test
In-Reply-To:
References:
<4A2F9BC5.4010405@mcs.anl.gov>
Message-ID:
> yes - contact nmi-support at ci.uchicago.edu
lies!
nmi-support at cs.wisc.edu
--
From wilde at mcs.anl.gov Wed Jun 10 09:46:24 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 10 Jun 2009 09:46:24 -0500
Subject: [Swift-devel] Application engagements
Message-ID: <4A2FC740.9030901@mcs.anl.gov>
Here's an update on 7 application engagements we have going at the
moment. This doesnt include the ongoing CNARI work Sarah is supporting.
The exact things to do next for each varies based on where each user is
and how well they are making progress:
1. scip - Chris Henry
Zhao turned over prototype script to them and did initial tests;
Chris needs to provide run definitions;
we need to give him a Swift starter release and a tailored README
to get him started. He (pr we) need to create a TG startup account.
2. oops - Glen and Aashish
They are making progress on their own
3. dock - Andrew Binkowski
Andrew is running DOCK on Falkon on his own;
can try to convert to Swift; for now, bigger Falkon runs
are needed for INCITE app
4. oops - Mike Kubal
Mike wants to run OOPS for other studies; needs startup help;
waiting on data and on a new oops version
5. ptmap - Yue Chen
Yue is focusing elsewhere; will get back to running at some point
6. see - Joshua Elliot and Todd Munson, ampl runs
This just became "ready to swift";
need to do initial scripts and runs
7. PIR BLAST - Baris Suszek
Allan doing this as a demo for them;
Based on this, Zhao, Allan, I think the next step is to write the Swift
script for SEE, #6; help Chris, #1; prepare to help MikeK, #4. I will
send more details.
- Mike
From wilde at mcs.anl.gov Wed Jun 10 10:21:42 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 10 Jun 2009 10:21:42 -0500
Subject: [Swift-devel] Application engagements
In-Reply-To: <4A2FC740.9030901@mcs.anl.gov>
References: <4A2FC740.9030901@mcs.anl.gov>
Message-ID: <4A2FCF86.1070601@mcs.anl.gov>
additions:
8. Matlab workflows for David Biron's "worm lab" (C.elegans)
9. Matlab workflows for Andrew Jamieison / Giger lab
Alex Moore, student, is making progress on 8 with some help from me.
9. is on hold - waiting for time and interest on their part.
- Mike
On 6/10/09 9:46 AM, Michael Wilde wrote:
> Here's an update on 7 application engagements we have going at the
> moment. This doesnt include the ongoing CNARI work Sarah is supporting.
>
> The exact things to do next for each varies based on where each user is
> and how well they are making progress:
>
> 1. scip - Chris Henry
> Zhao turned over prototype script to them and did initial tests;
> Chris needs to provide run definitions;
> we need to give him a Swift starter release and a tailored README
> to get him started. He (pr we) need to create a TG startup account.
> 2. oops - Glen and Aashish
> They are making progress on their own
> 3. dock - Andrew Binkowski
> Andrew is running DOCK on Falkon on his own;
> can try to convert to Swift; for now, bigger Falkon runs
> are needed for INCITE app
> 4. oops - Mike Kubal
> Mike wants to run OOPS for other studies; needs startup help;
> waiting on data and on a new oops version
> 5. ptmap - Yue Chen
> Yue is focusing elsewhere; will get back to running at some point
> 6. see - Joshua Elliot and Todd Munson, ampl runs
> This just became "ready to swift";
> need to do initial scripts and runs
> 7. PIR BLAST - Baris Suszek
> Allan doing this as a demo for them;
>
> Based on this, Zhao, Allan, I think the next step is to write the Swift
> script for SEE, #6; help Chris, #1; prepare to help MikeK, #4. I will
> send more details.
>
> - Mike
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From aespinosa at cs.uchicago.edu Wed Jun 10 13:45:45 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 10 Jun 2009 13:45:45 -0500
Subject: [Swift-devel] coaster job "completed" but swift reports
check-status-failed
Message-ID: <50b07b4b0906101145q6831e076oc5050efef8a82f5e@mail.gmail.com>
attached are the corresponding swift logs, coaster logs and gram logs.
sites.xml:
/home/aespinosa/blast-runs
1
1.26
ia64-compute
4
2
swift session stdout:
RunID: out.run_000
Progress:
Progress: uninitialized:1
Progress: Initializing:1000 Selecting site:1
Progress: Selecting site:1000 Initializing site shared directory:1
Progress: Selecting site:999 Initializing site shared directory:1 Stage in:1
Progress: Selecting site:996 Stage in:5
Progress: Selecting site:996 Stage in:5
Progress: Selecting site:995 Stage in:6
Progress: Selecting site:994 Stage in:7
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:992 Stage in:9
Progress: Selecting site:992 Stage in:9
Progress: Selecting site:992 Stage in:8 Submitting:1
Progress: Selecting site:991 Stage in:1 Submitting:9
Progress: Selecting site:991 Stage in:1 Submitting:8 Submitted:1
Progress: Selecting site:991 Submitted:9 Active:1
Progress: Selecting site:991 Submitted:8 Active:2
Progress: Selecting site:991 Active:7 Checking status:2 Failed but
can retry:1
Progress: Selecting site:991 Active:2 Checking status:3 Failed but
can retry:5
Progress: Selecting site:991 Active:2 Failed but can retry:8
Progress: Selecting site:991 Active:2 Failed but can retry:8
Progress: Selecting site:991 Active:2 Failed but can retry:8
Progress: Selecting site:991 Active:2 Failed but can retry:8
Progress: Selecting site:991 Active:2 Failed but can retry:8
Progress: Selecting site:991 Active:2 Failed but can retry:8
Progress: Selecting site:991 Active:1 Checking status:1 Failed but
can retry:8
Progress: Selecting site:989 Active:2 Checking status:1 Finished
successfully:1 Failed but can retry:8
Progress: Selecting site:988 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:10
...
...
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tarball.tar.gz
Type: application/x-gzip
Size: 172173 bytes
Desc: not available
URL:
From aespinosa at cs.uchicago.edu Wed Jun 10 16:42:35 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 10 Jun 2009 16:42:35 -0500
Subject: [Swift-devel] active jobs vs available processors on submitted
coaster queues
In-Reply-To: <50b07b4b0906101213g34368050re6e6b7b2b0992d9a@mail.gmail.com>
References: <50b07b4b0906091310j502935b2jc144e126aeb4edf8@mail.gmail.com>
<1244618956.16077.2.camel@localhost>
<50b07b4b0906101213g34368050re6e6b7b2b0992d9a@mail.gmail.com>
Message-ID: <50b07b4b0906101442t14695d3ei808adaa740b7ac1d@mail.gmail.com>
Here's run on 1k jobs: only 2 jobs were active . the 18 procs here
in the LRM i think is the 2nd block request:
[aespinosa at tg-login1 ~]$ showq -u $USER
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
2016757 aespinos Running 18 00:15:09 Wed Jun 10 16:29:31
1 active job 18 of 114 processors in use by local jobs (15.79%)
50 of 57 nodes active (87.72%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
swift session:
Swift svn swift-r2949 cog-r2406
RunID: out.run_000
Progress:
Progress: uninitialized:1
Progress: Initializing:1000 Selecting site:1
Progress: Selecting site:1000 Initializing site shared directory:1
Progress: Selecting site:999 Initializing site shared directory:1 Stage in:1
Progress: Selecting site:996 Stage in:5
Progress: Selecting site:996 Stage in:5
Progress: Selecting site:995 Stage in:6
Progress: Selecting site:994 Stage in:7
Progress: Selecting site:994 Stage in:7
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:992 Stage in:9
Progress: Selecting site:992 Stage in:9
Progress: Selecting site:992 Stage in:8 Submitting:1
Progress: Selecting site:991 Stage in:1 Submitting:8 Submitted:1
Progress: Selecting site:991 Submitted:9 Active:1
Progress: Selecting site:991 Submitted:9 Active:1
Progress: Selecting site:991 Submitted:8 Active:2
Progress: Selecting site:991 Submitted:1 Active:2 Checking
status:6 Failed but can retry:1
Progress: Selecting site:991 Active:1 Checking status:4 Failed but
can retry:5
Progress: Selecting site:990 Stage in:1 Active:1 Failed but can retry:9
Progress: Selecting site:990 Active:1 Checking status:1 Failed but
can retry:9
Progress: Selecting site:989 Submitting:1 Active:1 Failed but can retry:10
Progress: Selecting site:989 Active:1 Checking status:1 Failed but
can retry:10
Progress: Selecting site:988 Submitting:1 Active:1 Failed but can retry:11
Progress: Selecting site:988 Active:1 Checking status:1 Failed but
can retry:11
Progress: Selecting site:987 Submitting:1 Active:1 Failed but can retry:12
Progress: Selecting site:987 Active:1 Checking status:1 Failed but
can retry:12
Progress: Selecting site:986 Stage in:1 Active:1 Failed but can retry:13
Progress: Selecting site:986 Active:1 Checking status:1 Failed but
can retry:13
Progress: Selecting site:985 Stage in:1 Active:1 Failed but can retry:14
Progress: Selecting site:985 Active:1 Checking status:1 Failed but
can retry:14
Progress: Selecting site:984 Stage in:1 Active:1 Failed but can retry:15
Progress: Selecting site:984 Active:1 Checking status:1 Failed but
can retry:15
Progress: Selecting site:983 Stage in:1 Active:1 Failed but can retry:16
Progress: Selecting site:983 Active:2 Failed but can retry:16
Progress: Selecting site:983 Active:2 Failed but can retry:16
Progress: Selecting site:983 Active:1 Checking status:1 Failed but
can retry:16
Progress: Selecting site:982 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:16
Progress: Selecting site:982 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:16
Progress: Selecting site:981 Submitting:1 Active:1 Finished
successfully:1 Failed but can retry:17
Progress: Selecting site:981 Active:1 Finished successfully:1
Failed but can retry:18
Progress: Selecting site:980 Submitting:1 Active:1 Finished
successfully:1 Failed but can retry:18
Progress: Selecting site:980 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:18
Progress: Selecting site:979 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:19
Progress: Selecting site:979 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:19
Progress: Selecting site:979 Active:1 Finished successfully:1
Failed but can retry:20
Progress: Selecting site:978 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:20
Progress: Selecting site:978 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:20
Progress: Selecting site:977 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:21
Progress: Selecting site:977 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:21
Progress: Selecting site:976 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Stage in:1 Submitted:1 Finished
successfully:1 Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
qProgress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
2009/6/10 Allan Espinosa :
> hi mihael,
>
> I reran the job and attached the log files (coaster log, swift-log, gram logs).
>
> swift session:
> rogress: ?Submitted:1 ?Active:1 ?Finished successfully:4
> Progress: ?Submitted:1 ?Active:1 ?Finished successfully:4
> Progress: ?Submitted:1 ?Active:1 ?Finished successfully:4
> Progress: ?Submitted:1 ?Checking status:1 ?Finished successfully:4
> Progress: ?Active:1 ?Finished successfully:5
> Progress: ?Active:1 ?Finished successfully:5
> Progress: ?Active:1 ?Finished successfully:5
> Progress: ?Active:1 ?Finished successfully:5
> Progress: ?Active:1 ?Finished successfully:5
> Progress: ?Active:1 ?Finished successfully:5
> Progress: ?Active:1 ?Finished successfully:5
> Progress: ?Active:1 ?Finished successfully:5
> Progress: ?Checking status:1 ?Finished successfully:5
> Progress: ?Stage out:1 ?Finished successfully:5
> Progress: ?Submitted:1 ?Finished successfully:6
> Progress: ?Submitted:1 ?Finished successfully:6
> Progress: ?Submitted:1 ?Finished successfully:6
> Progress: ?Submitted:1 ?Finished successfully:6
> Progress: ?Submitted:1 ?Finished successfully:6
> ...
>
> sites.xml (i may have changed it during this run):
>
> ? ? ? ?
> ? ? ? ? ? ? ? ? url="tg-grid.uc.teragrid.org" ?jobmanager="gt2:gt2:pbs" />
> ? ? ? ? ? ? ? ?
> ? ? ? ? ? ? ? ?/home/aespinosa/blast-runs
>
> ? ? ? ? ? ? ? ?1
> ? ? ? ? ? ? ? ?1.26
>
> ? ? ? ? ? ? ? ? key="host_types">ia64-compute
> ? ? ? ? ? ? ? ?4
> ? ? ? ? ? ? ? ?2
> ? ? ? ?
>
>
> it looks like the last job was submitted but have not yet registered
> with the gram service in the ucanl remote site. ?at this point the
> coaster for the previous 5 jobs already ended.
> -Allan
>
> 2009/6/10 Mihael Hategan :
>> I need to look at the coaster log.
>>
>> On Tue, 2009-06-09 at 15:10 -0500, Allan Espinosa wrote:
>>> I was expecting to have 2 active jobs at a time from the swift log but
>>> instead got only one at a time:
>>> Swift svn swift-r2949 cog-r2406
>>>
>>> RunID: out.run_000
>>> Progress:
>>> Progress: ?Selecting site:4 ?Initializing site shared directory:1 ?Stage in:1
>>> Progress: ?Stage in:6
>>> Progress: ?Stage in:6
>>>
>>>
>>>
>>> Progress: ?Stage in:6
>>> Progress: ?Stage in:6
>>> Progress: ?Stage in:6
>>> Progress: ?Stage in:6
>>> Progress: ?Stage in:5 ?Submitting:1
>>> Progress: ?Submitting:5 ?Submitted:1
>>> Progress: ?Submitted:6
>>> Progress: ?Submitted:5 ?Active:1
>>> Progress: ?Submitted:5 ?Active:1
>>> Progress: ?Submitted:5 ?Active:1
>>> Progress: ?Submitted:5 ?Active:1
>>> Progress: ?Submitted:5 ?Active:1
>>> Progress: ?Submitted:5 ?Checking status:1
>>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1
>>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1
>>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1
>>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1
>>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1
>>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1
>>> Progress: ?Submitted:4 ?Checking status:1 ?Finished successfully:1
>>> Progress: ?Submitted:3 ?Active:1 ?Finished successfully:2
>>> Progress: ?Submitted:3 ?Active:1 ?Finished successfully:2
>>> Progress: ?Submitted:3 ?Active:1 ?Finished successfully:2
>>> Progress: ?Submitted:3 ?Checking status:1 ?Finished successfully:2
>>> Progress: ?Submitted:2 ?Active:1 ?Finished successfully:3
>>> ...
>>> ...
>>>
>>>
>>> uc-teragrid queue status: $showq -u $USER
>>> [aespinosa at tg-login1 ~]$ showq -u $USER
>>>
>>> active jobs------------------------
>>> JOBID ? ? ? ? ? ? ?USERNAME ? ? ?STATE PROCS ? REMAINING ? ? ? ? ? ?STARTTIME
>>>
>>> 2015982 ? ? ? ? ? ?aespinos ? ?Running ? ? 2 ? ?00:55:41 ?Tue Jun ?9 15:02:18
>>>
>>> 1 active job ? ? ? ? ? ? ?2 of 116 processors in use by local jobs (1.72%)
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? 42 of 58 nodes active ? ? ?(72.41%)
>>>
>>> eligible jobs----------------------
>>> JOBID ? ? ? ? ? ? ?USERNAME ? ? ?STATE PROCS ? ? WCLIMIT ? ? ? ? ? ?QUEUETIME
>>>
>>>
>>> 0 eligible jobs
>>>
>>> blocked jobs-----------------------
>>> JOBID ? ? ? ? ? ? ?USERNAME ? ? ?STATE PROCS ? ? WCLIMIT ? ? ? ? ? ?QUEUETIME
>>>
>>>
>>> 0 blocked jobs
>>>
>>> Total job: ?1
>>>
>>>
>>> sites.xml:
>>>
>>> ? ? ? ?
>>> ? ? ? ? ? ? ? ? >> url="tg-grid.uc.teragrid.org" ?jobmanager="gt2:gt2:pbs" />
>>> ? ? ? ? ? ? ? ?
>>> ? ? ? ? ? ? ? ? /home/aespinosa/blast-runs
>>>
>>> ? ? ? ? ? ? ? ? 5
>>> ? ? ? ? ? ? ? ? 1.26
>>>
>>> ? ? ? ? ? ? ? ? >> key="host_types">ia64-compute
>>> ? ? ? ? ? ? ? ? 4
>>> ? ? ? ? ? ? ? ? 16
>>> ? ? ? ?
>>>
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tarball.tar.gz
Type: application/x-gzip
Size: 182455 bytes
Desc: not available
URL:
From zhaozhang at uchicago.edu Thu Jun 11 09:24:57 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 11 Jun 2009 09:24:57 -0500
Subject: [Swift-devel] coaster error on ranger
In-Reply-To: <4A30EDBD.50108@mcs.anl.gov>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
Message-ID: <4A3113B9.9080303@uchicago.edu>
Hi, Mike and Mihael
Here is the error, I think this is related to the job wall time of
coaster settings.
Mihael, could you give me some suggestions on how to set the parameters
for coasters on ranger?
For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
best
zhao
Execution failed:
Exception in run_ampl:
Arguments: [run70, template, armington.mod, armington_process.cmd,
armington_ou\
tput.cmd, subproblems/producer_tree.mod, ces.so]
Host: tgtacc
Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
stderr.txt:
stdout.txt:
----
Caused by:
Shutting down worker
Cleaning up...
Shutting down service at https://129.114.50.163:58556
And here is my sites.xml
bash-3.00$ cat tgranger-sge-gram2.xml
TG-CCR080022N
/work/00946/zzhang/work
/tmp/zzhang/jobdir
16
development
100
10
20
5
1
5
From hategan at mcs.anl.gov Thu Jun 11 10:22:23 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 11 Jun 2009 10:22:23 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <4A3113B9.9080303@uchicago.edu>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu>
Message-ID: <1244733743.18728.1.camel@localhost>
On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
> Hi, Mike and Mihael
>
> Here is the error, I think this is related to the job wall time of
> coaster settings.
>
> Mihael, could you give me some suggestions on how to set the parameters
> for coasters on ranger?
I need to know what the problem is first. And for that I need to take a
look at the coaster log (and possibly gram logs). So if you could copy
that to some shared space in the CI, that would be good.
> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>
> best
> zhao
>
> Execution failed:
> Exception in run_ampl:
> Arguments: [run70, template, armington.mod, armington_process.cmd,
> armington_ou\
> tput.cmd, subproblems/producer_tree.mod, ces.so]
> Host: tgtacc
> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
> stderr.txt:
>
> stdout.txt:
> ----
>
> Caused by:
> Shutting down worker
> Cleaning up...
> Shutting down service at https://129.114.50.163:58556
>
> And here is my sites.xml
> bash-3.00$ cat tgranger-sge-gram2.xml
>
>
>
> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>
> TG-CCR080022N
> /work/00946/zzhang/work
> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
> 16
> development
> 100
> 10
> 20
> 5
> 1
> 5
>
>
>
>
>
>
From wilde at mcs.anl.gov Thu Jun 11 10:29:35 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 11 Jun 2009 10:29:35 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <1244733743.18728.1.camel@localhost>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu>
<1244733743.18728.1.camel@localhost>
Message-ID: <4A3122DF.2090301@mcs.anl.gov>
There is some likelihood that ampl itself is exitting with a non-zero
exit code (12 I suspect) due ot a subscript error at the near-correct
termination of the model (ie it runs usefully to the end, then dies when
it runs off the end of an array). We know the fix for this.
But I wonder, in the case below, Zhao: is this happening when ampl gets
one of these errors, or is it running one job OK on a coaster, and then
running into a timeout on the next job?
What was the mapping of the number of jobs in this script (100 I think)
to the number of coasters started? Did the error occur when it tried to
start a second long job on a coaster after a prior (long) job had
already completed?
- Mike
On 6/11/09 10:22 AM, Mihael Hategan wrote:
> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>> Hi, Mike and Mihael
>>
>> Here is the error, I think this is related to the job wall time of
>> coaster settings.
>>
>> Mihael, could you give me some suggestions on how to set the parameters
>> for coasters on ranger?
>
> I need to know what the problem is first. And for that I need to take a
> look at the coaster log (and possibly gram logs). So if you could copy
> that to some shared space in the CI, that would be good.
>
>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>
>> best
>> zhao
>>
>> Execution failed:
>> Exception in run_ampl:
>> Arguments: [run70, template, armington.mod, armington_process.cmd,
>> armington_ou\
>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>> Host: tgtacc
>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>> stderr.txt:
>>
>> stdout.txt:
>> ----
>>
>> Caused by:
>> Shutting down worker
>> Cleaning up...
>> Shutting down service at https://129.114.50.163:58556
>>
>> And here is my sites.xml
>> bash-3.00$ cat tgranger-sge-gram2.xml
>>
>>
>>
>> > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>
>> TG-CCR080022N
>> /work/00946/zzhang/work
>> > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
>> 16
>> development
>> 100
>> 10
>> 20
>> 5
>> 1
>> 5
>>
>>
>>
>>
>>
>>
>
From zhaozhang at uchicago.edu Thu Jun 11 10:37:02 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 11 Jun 2009 10:37:02 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <1244733743.18728.1.camel@localhost>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu>
<1244733743.18728.1.camel@localhost>
Message-ID: <4A31249E.4060806@uchicago.edu>
Hi, Mihael
The coaster log is at /home/zzhang/see/logs/coasters.log. The latest
record should be the run that failed last night.
best
zhao
Mihael Hategan wrote:
> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>
>> Hi, Mike and Mihael
>>
>> Here is the error, I think this is related to the job wall time of
>> coaster settings.
>>
>> Mihael, could you give me some suggestions on how to set the parameters
>> for coasters on ranger?
>>
>
> I need to know what the problem is first. And for that I need to take a
> look at the coaster log (and possibly gram logs). So if you could copy
> that to some shared space in the CI, that would be good.
>
>
>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>
>> best
>> zhao
>>
>> Execution failed:
>> Exception in run_ampl:
>> Arguments: [run70, template, armington.mod, armington_process.cmd,
>> armington_ou\
>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>> Host: tgtacc
>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>> stderr.txt:
>>
>> stdout.txt:
>> ----
>>
>> Caused by:
>> Shutting down worker
>> Cleaning up...
>> Shutting down service at https://129.114.50.163:58556
>>
>> And here is my sites.xml
>> bash-3.00$ cat tgranger-sge-gram2.xml
>>
>>
>>
>> > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>
>> TG-CCR080022N
>> /work/00946/zzhang/work
>> > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
>> 16
>> development
>> 100
>> 10
>> 20
>> 5
>> 1
>> 5
>>
>>
>>
>>
>>
>>
>>
>
>
>
From zhaozhang at uchicago.edu Thu Jun 11 11:06:48 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 11 Jun 2009 11:06:48 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <4A3122DF.2090301@mcs.anl.gov>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu>
<1244733743.18728.1.camel@localhost> <4A3122DF.2090301@mcs.anl.gov>
Message-ID: <4A312B98.2090702@uchicago.edu>
Hi, Mike
I am attaching the whole log at the end.
From the log, we could tell that no job is successful at the point when
the the work flow exits. And the workflow
has been running for only 13 minutes.
I also copy the swift-work dir back to CI net work, it is at
/home/zzhang/see/logs/ampl-20090611-0122-hzktisu5.
Although no job in the workflow returned successful, I did find 22
result files in
/home/zzhang/see/logs/ampl-20090611-0122-hzktisu5/shared/result
You could take a look at run14 as an example. I echo the exit code of
the ampl script at the end of run_ampl:
** EXIT - solution found.
Major Iterations. . . . 4
Minor Iterations. . . . 36
Restarts. . . . . . . . 0
Crash Iterations. . . . 0
Gradient Steps. . . . . 0
Function Evaluations. . 5
Gradient Evaluations. . 5
Basis Time. . . . . . . 25.713607
Total Time. . . . . . . 27.701732
Residual. . . . . . . . 2.998933e-07
Postsolved residual: 2.9989e-07
Path 4.7.01: Solution found.
4 iterations (0 for crash); 36 pivots.
5 function, 5 gradient evaluations.
exitcode 2
See here? the exit code is 2, which mean, the ampl script has error
itself. I know you said Todd has a fix for this,
but I didn't find it. The code I was running is the latest from svn. Any
idea about this?
best wishes
zhao
Swift svn swift-r2953 cog-r2406
RunID: 20090611-0122-hzktisu5
Progress:
Progress: uninitialized:1
Progress: Selecting site:98 Initializing site shared directory:1
Stage in:1
Progress: Stage in:99 Submitting:1
Progress: Submitting:99 Submitted:1
Progress: Submitted:100
Progress: Submitted:100
Progress: Submitted:100
Progress: Submitted:100
Progress: Submitted:99 Active:1
Progress: Submitted:82 Active:18
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:23
Progress: Submitted:77 Active:22 Failed but can retry:1
Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/h
on tgtacc
Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/s
on tgtacc
Progress: Submitted:75 Active:21 Failed:1 Failed but can retry:3
Execution failed:
Exception in run_ampl:
Arguments: [run70, template, armington.mod, armington_process.cmd,
armington_output.cmd, subproblems/producer_tree.mod, ces.so]
Host: tgtacc
Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
stderr.txt:
stdout.txt:
----
Caused by:
Shutting down worker
Cleaning up...
Shutting down service at https://129.114.50.163:58556
Got channel MetaChannel: 6217586 -> GSSSChannel-null(1)
Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/o
on tgtacc
- Done
Michael Wilde wrote:
> There is some likelihood that ampl itself is exitting with a non-zero
> exit code (12 I suspect) due ot a subscript error at the near-correct
> termination of the model (ie it runs usefully to the end, then dies
> when it runs off the end of an array). We know the fix for this.
>
> But I wonder, in the case below, Zhao: is this happening when ampl
> gets one of these errors, or is it running one job OK on a coaster,
> and then running into a timeout on the next job?
>
> What was the mapping of the number of jobs in this script (100 I
> think) to the number of coasters started? Did the error occur when it
> tried to start a second long job on a coaster after a prior (long) job
> had already completed?
>
> - Mike
>
>
> On 6/11/09 10:22 AM, Mihael Hategan wrote:
>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>>> Hi, Mike and Mihael
>>>
>>> Here is the error, I think this is related to the job wall time of
>>> coaster settings.
>>>
>>> Mihael, could you give me some suggestions on how to set the
>>> parameters for coasters on ranger?
>>
>> I need to know what the problem is first. And for that I need to take a
>> look at the coaster log (and possibly gram logs). So if you could copy
>> that to some shared space in the CI, that would be good.
>>
>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>>
>>> best
>>> zhao
>>>
>>> Execution failed:
>>> Exception in run_ampl:
>>> Arguments: [run70, template, armington.mod, armington_process.cmd,
>>> armington_ou\
>>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>>> Host: tgtacc
>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>>> stderr.txt:
>>>
>>> stdout.txt:
>>> ----
>>>
>>> Caused by:
>>> Shutting down worker
>>> Cleaning up...
>>> Shutting down service at https://129.114.50.163:58556
>>>
>>> And here is my sites.xml
>>> bash-3.00$ cat tgranger-sge-gram2.xml
>>>
>>>
>>>
>>> >> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>>
>>> TG-CCR080022N
>>> /work/00946/zzhang/work
>>> >> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
>>> 16
>>> development
>>> 100
>>> 10
>>> 20
>>> 5
>>> 1
>>> 5
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
From hategan at mcs.anl.gov Thu Jun 11 12:32:34 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 11 Jun 2009 12:32:34 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <4A31249E.4060806@uchicago.edu>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost>
<4A31249E.4060806@uchicago.edu>
Message-ID: <1244741554.23235.0.camel@localhost>
Your jobs seem to not have a walltime specified. Can you post your
tc.data?
On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote:
> Hi, Mihael
>
> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest
> record should be the run that failed last night.
>
> best
> zhao
>
> Mihael Hategan wrote:
> > On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
> >
> >> Hi, Mike and Mihael
> >>
> >> Here is the error, I think this is related to the job wall time of
> >> coaster settings.
> >>
> >> Mihael, could you give me some suggestions on how to set the parameters
> >> for coasters on ranger?
> >>
> >
> > I need to know what the problem is first. And for that I need to take a
> > look at the coaster log (and possibly gram logs). So if you could copy
> > that to some shared space in the CI, that would be good.
> >
> >
> >> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
> >>
> >> best
> >> zhao
> >>
> >> Execution failed:
> >> Exception in run_ampl:
> >> Arguments: [run70, template, armington.mod, armington_process.cmd,
> >> armington_ou\
> >> tput.cmd, subproblems/producer_tree.mod, ces.so]
> >> Host: tgtacc
> >> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
> >> stderr.txt:
> >>
> >> stdout.txt:
> >> ----
> >>
> >> Caused by:
> >> Shutting down worker
> >> Cleaning up...
> >> Shutting down service at https://129.114.50.163:58556
> >>
> >> And here is my sites.xml
> >> bash-3.00$ cat tgranger-sge-gram2.xml
> >>
> >>
> >>
> >> >> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
> >>
> >> TG-CCR080022N
> >> /work/00946/zzhang/work
> >> >> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
> >> 16
> >> development
> >> 100
> >> 10
> >> 20
> >> 5
> >> 1
> >> 5
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
From zhaozhang at uchicago.edu Thu Jun 11 13:04:42 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 11 Jun 2009 13:04:42 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <1244741554.23235.0.camel@localhost>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu>
<1244733743.18728.1.camel@localhost>
<4A31249E.4060806@uchicago.edu>
<1244741554.23235.0.camel@localhost>
Message-ID: <4A31473A.1090006@uchicago.edu>
No, I don't specify any wall time.
The last entry is for the run_ampl script.
zhao
login3% cat tc.data
#This is the transformation catalog.
#
#It comes pre-configured with a number of simple transformations with
#paths that are likely to work on a linux box. However, on some systems,
#the paths to these executables will be different (for example, sometimes
#some of these programs are found in /usr/bin rather than in /bin)
#
#NOTE WELL: fields in this file must be separated by tabs, not spaces; and
#there must be no trailing whitespace at the end of each line.
#
# sitename transformation path INSTALLED platform profiles
bgps echo /bin/echo INSTALLED INTEL32::LINUX null
bgp000 cat /bin/cat INSTALLED INTEL32::LINUX null
localhost sleep /bin/sleep INSTALLED
INTEL32::LINUX null
localhost echo /bin/echo INSTALLED
INTEL32::LINUX null
localhost ls /bin/ls INSTALLED
INTEL32::LINUX null
localhost wc /bin/wc INSTALLED
INTEL32::LINUX null
localhost grep /bin/grep INSTALLED
INTEL32::LINUX null
localhost sort /bin/sort INSTALLED
INTEL32::LINUX null
localhost paste /bin/paste INSTALLED
INTEL32::LINUX null
localhost date /bin/date INSTALLED
INTEL32::LINUX null
localhost db /home/wilde/angle/data/db
INSTALLED INTEL32::LINUX null
localhost set1 /home/wilde/angle/data/set1
INSTALLED INTEL32::LINUX null
localhost set3 /home/wilde/angle/data/set3
INSTALLED INTEL32::LINUX null
localhost run_ampl
/share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED
INTEL32::LINUX null
tgtacc run_ampl
/share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED
INTEL32::LINUX null
Mihael Hategan wrote:
> Your jobs seem to not have a walltime specified. Can you post your
> tc.data?
>
> On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote:
>
>> Hi, Mihael
>>
>> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest
>> record should be the run that failed last night.
>>
>> best
>> zhao
>>
>> Mihael Hategan wrote:
>>
>>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>>>
>>>
>>>> Hi, Mike and Mihael
>>>>
>>>> Here is the error, I think this is related to the job wall time of
>>>> coaster settings.
>>>>
>>>> Mihael, could you give me some suggestions on how to set the parameters
>>>> for coasters on ranger?
>>>>
>>>>
>>> I need to know what the problem is first. And for that I need to take a
>>> look at the coaster log (and possibly gram logs). So if you could copy
>>> that to some shared space in the CI, that would be good.
>>>
>>>
>>>
>>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>>>
>>>> best
>>>> zhao
>>>>
>>>> Execution failed:
>>>> Exception in run_ampl:
>>>> Arguments: [run70, template, armington.mod, armington_process.cmd,
>>>> armington_ou\
>>>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>>>> Host: tgtacc
>>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>> ----
>>>>
>>>> Caused by:
>>>> Shutting down worker
>>>> Cleaning up...
>>>> Shutting down service at https://129.114.50.163:58556
>>>>
>>>> And here is my sites.xml
>>>> bash-3.00$ cat tgranger-sge-gram2.xml
>>>>
>>>>
>>>>
>>>> >>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>>>
>>>> TG-CCR080022N
>>>> /work/00946/zzhang/work
>>>> >>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
>>>> 16
>>>> development
>>>> 100
>>>> 10
>>>> 20
>>>> 5
>>>> 1
>>>> 5
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
>
>
From hategan at mcs.anl.gov Thu Jun 11 13:09:20 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 11 Jun 2009 13:09:20 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <4A31473A.1090006@uchicago.edu>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost>
<4A31249E.4060806@uchicago.edu> <1244741554.23235.0.camel@localhost>
<4A31473A.1090006@uchicago.edu>
Message-ID: <1244743761.24254.1.camel@localhost>
On Thu, 2009-06-11 at 13:04 -0500, Zhao Zhang wrote:
> No, I don't specify any wall time.
Well, you need to specify one.
> The last entry is for the run_ampl script.
>
> zhao
>
> login3% cat tc.data
> #This is the transformation catalog.
> #
> #It comes pre-configured with a number of simple transformations with
> #paths that are likely to work on a linux box. However, on some systems,
> #the paths to these executables will be different (for example, sometimes
> #some of these programs are found in /usr/bin rather than in /bin)
> #
> #NOTE WELL: fields in this file must be separated by tabs, not spaces; and
> #there must be no trailing whitespace at the end of each line.
> #
> # sitename transformation path INSTALLED platform profiles
> bgps echo /bin/echo INSTALLED INTEL32::LINUX null
> bgp000 cat /bin/cat INSTALLED INTEL32::LINUX null
> localhost sleep /bin/sleep INSTALLED
> INTEL32::LINUX null
> localhost echo /bin/echo INSTALLED
> INTEL32::LINUX null
> localhost ls /bin/ls INSTALLED
> INTEL32::LINUX null
> localhost wc /bin/wc INSTALLED
> INTEL32::LINUX null
> localhost grep /bin/grep INSTALLED
> INTEL32::LINUX null
> localhost sort /bin/sort INSTALLED
> INTEL32::LINUX null
> localhost paste /bin/paste INSTALLED
> INTEL32::LINUX null
> localhost date /bin/date INSTALLED
> INTEL32::LINUX null
> localhost db /home/wilde/angle/data/db
> INSTALLED INTEL32::LINUX null
> localhost set1 /home/wilde/angle/data/set1
> INSTALLED INTEL32::LINUX null
> localhost set3 /home/wilde/angle/data/set3
> INSTALLED INTEL32::LINUX null
> localhost run_ampl
> /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED
> INTEL32::LINUX null
> tgtacc run_ampl
> /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED
> INTEL32::LINUX null
>
>
> Mihael Hategan wrote:
> > Your jobs seem to not have a walltime specified. Can you post your
> > tc.data?
> >
> > On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote:
> >
> >> Hi, Mihael
> >>
> >> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest
> >> record should be the run that failed last night.
> >>
> >> best
> >> zhao
> >>
> >> Mihael Hategan wrote:
> >>
> >>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
> >>>
> >>>
> >>>> Hi, Mike and Mihael
> >>>>
> >>>> Here is the error, I think this is related to the job wall time of
> >>>> coaster settings.
> >>>>
> >>>> Mihael, could you give me some suggestions on how to set the parameters
> >>>> for coasters on ranger?
> >>>>
> >>>>
> >>> I need to know what the problem is first. And for that I need to take a
> >>> look at the coaster log (and possibly gram logs). So if you could copy
> >>> that to some shared space in the CI, that would be good.
> >>>
> >>>
> >>>
> >>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
> >>>>
> >>>> best
> >>>> zhao
> >>>>
> >>>> Execution failed:
> >>>> Exception in run_ampl:
> >>>> Arguments: [run70, template, armington.mod, armington_process.cmd,
> >>>> armington_ou\
> >>>> tput.cmd, subproblems/producer_tree.mod, ces.so]
> >>>> Host: tgtacc
> >>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
> >>>> stderr.txt:
> >>>>
> >>>> stdout.txt:
> >>>> ----
> >>>>
> >>>> Caused by:
> >>>> Shutting down worker
> >>>> Cleaning up...
> >>>> Shutting down service at https://129.114.50.163:58556
> >>>>
> >>>> And here is my sites.xml
> >>>> bash-3.00$ cat tgranger-sge-gram2.xml
> >>>>
> >>>>
> >>>>
> >>>> >>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
> >>>>
> >>>> TG-CCR080022N
> >>>> /work/00946/zzhang/work
> >>>> >>>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
> >>>> 16
> >>>> development
> >>>> 100
> >>>> 10
> >>>> 20
> >>>> 5
> >>>> 1
> >>>> 5
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >
> >
> >
From zhaozhang at uchicago.edu Thu Jun 11 13:13:34 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 11 Jun 2009 13:13:34 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <1244743761.24254.1.camel@localhost>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu>
<1244733743.18728.1.camel@localhost>
<4A31249E.4060806@uchicago.edu>
<1244741554.23235.0.camel@localhost>
<4A31473A.1090006@uchicago.edu>
<1244743761.24254.1.camel@localhost>
Message-ID: <4A31494E.8060209@uchicago.edu>
Hi, Mihael
Actually, I have no idea how long would these jobs run. Some of them
just took ~10 minutes, and some of them went far more than this.
What if I set the wall to 120 minutes, what will happen when the wall
time is up but the job doesn't finish?
120
zhao
Mihael Hategan wrote:
> On Thu, 2009-06-11 at 13:04 -0500, Zhao Zhang wrote:
>
>> No, I don't specify any wall time.
>>
>
> Well, you need to specify one.
>
>
>> The last entry is for the run_ampl script.
>>
>> zhao
>>
>> login3% cat tc.data
>> #This is the transformation catalog.
>> #
>> #It comes pre-configured with a number of simple transformations with
>> #paths that are likely to work on a linux box. However, on some systems,
>> #the paths to these executables will be different (for example, sometimes
>> #some of these programs are found in /usr/bin rather than in /bin)
>> #
>> #NOTE WELL: fields in this file must be separated by tabs, not spaces; and
>> #there must be no trailing whitespace at the end of each line.
>> #
>> # sitename transformation path INSTALLED platform profiles
>> bgps echo /bin/echo INSTALLED INTEL32::LINUX null
>> bgp000 cat /bin/cat INSTALLED INTEL32::LINUX null
>> localhost sleep /bin/sleep INSTALLED
>> INTEL32::LINUX null
>> localhost echo /bin/echo INSTALLED
>> INTEL32::LINUX null
>> localhost ls /bin/ls INSTALLED
>> INTEL32::LINUX null
>> localhost wc /bin/wc INSTALLED
>> INTEL32::LINUX null
>> localhost grep /bin/grep INSTALLED
>> INTEL32::LINUX null
>> localhost sort /bin/sort INSTALLED
>> INTEL32::LINUX null
>> localhost paste /bin/paste INSTALLED
>> INTEL32::LINUX null
>> localhost date /bin/date INSTALLED
>> INTEL32::LINUX null
>> localhost db /home/wilde/angle/data/db
>> INSTALLED INTEL32::LINUX null
>> localhost set1 /home/wilde/angle/data/set1
>> INSTALLED INTEL32::LINUX null
>> localhost set3 /home/wilde/angle/data/set3
>> INSTALLED INTEL32::LINUX null
>> localhost run_ampl
>> /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED
>> INTEL32::LINUX null
>> tgtacc run_ampl
>> /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED
>> INTEL32::LINUX null
>>
>>
>> Mihael Hategan wrote:
>>
>>> Your jobs seem to not have a walltime specified. Can you post your
>>> tc.data?
>>>
>>> On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote:
>>>
>>>
>>>> Hi, Mihael
>>>>
>>>> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest
>>>> record should be the run that failed last night.
>>>>
>>>> best
>>>> zhao
>>>>
>>>> Mihael Hategan wrote:
>>>>
>>>>
>>>>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi, Mike and Mihael
>>>>>>
>>>>>> Here is the error, I think this is related to the job wall time of
>>>>>> coaster settings.
>>>>>>
>>>>>> Mihael, could you give me some suggestions on how to set the parameters
>>>>>> for coasters on ranger?
>>>>>>
>>>>>>
>>>>>>
>>>>> I need to know what the problem is first. And for that I need to take a
>>>>> look at the coaster log (and possibly gram logs). So if you could copy
>>>>> that to some shared space in the CI, that would be good.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>>>>>
>>>>>> best
>>>>>> zhao
>>>>>>
>>>>>> Execution failed:
>>>>>> Exception in run_ampl:
>>>>>> Arguments: [run70, template, armington.mod, armington_process.cmd,
>>>>>> armington_ou\
>>>>>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>>>>>> Host: tgtacc
>>>>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>>>>>> stderr.txt:
>>>>>>
>>>>>> stdout.txt:
>>>>>> ----
>>>>>>
>>>>>> Caused by:
>>>>>> Shutting down worker
>>>>>> Cleaning up...
>>>>>> Shutting down service at https://129.114.50.163:58556
>>>>>>
>>>>>> And here is my sites.xml
>>>>>> bash-3.00$ cat tgranger-sge-gram2.xml
>>>>>>
>>>>>>
>>>>>>
>>>>>> >>>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>>>>>
>>>>>> TG-CCR080022N
>>>>>> /work/00946/zzhang/work
>>>>>> >>>>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
>>>>>> 16
>>>>>> development
>>>>>> 100
>>>>>> 10
>>>>>> 20
>>>>>> 5
>>>>>> 1
>>>>>> 5
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
>
From hategan at mcs.anl.gov Thu Jun 11 13:24:58 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 11 Jun 2009 13:24:58 -0500
Subject: [Swift-devel] Re: coaster error on ranger
In-Reply-To: <4A31494E.8060209@uchicago.edu>
References:
<4A300069.6030103@anl.gov>
<4A3005CF.2090107@mcs.anl.gov>
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost>
<4A31249E.4060806@uchicago.edu> <1244741554.23235.0.camel@localhost>
<4A31473A.1090006@uchicago.edu> <1244743761.24254.1.camel@localhost>
<4A31494E.8060209@uchicago.edu>
Message-ID: <1244744698.24446.7.camel@localhost>
On Thu, 2009-06-11 at 13:13 -0500, Zhao Zhang wrote:
> Hi, Mihael
>
> Actually, I have no idea how long would these jobs run. Some of them
> just took ~10 minutes, and some of them went far more than this.
> What if I set the wall to 120 minutes, what will happen when the wall
> time is up but the job doesn't finish?
You'll probably send another message to the mailing list saying that
things don't work properly. And I'll ask you again to gather all logs.
Then I'll tell you the same thing, which is that you should set a proper
maxwalltime for the job.
> 120
Read the swift documentation, in particular
http://www.ci.uchicago.edu/swift/guides/userguide.php#profile.globus
(how to specify the maxwalltime and what it means, and the meaning of
"maxtime" which has nothing to do with your job's maximum walltime).
From bugzilla-daemon at mcs.anl.gov Thu Jun 11 15:02:57 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 11 Jun 2009 15:02:57 -0500 (CDT)
Subject: [Swift-devel] [Bug 212] New: support for multiple arguments
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=212
Summary: support for multiple arguments
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P2
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: aespinosa at cs.uchicago.edu
Created an attachment (id=287)
--> (http://bugzilla.mcs.anl.gov/swift/attachment.cgi?id=287)
job with 10k arguments
attached job is an invocation with lots of arguments (10k). we can generalize
these kinds of jobs as "summarizers".
this occurs most likely because of the number of arguments limits in the shell
when a job is invoked by _swiftwrap. A mapreduce-like approach to reduce the
data chunks-at-a-time would be a possible solution.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
From bugzilla-daemon at mcs.anl.gov Thu Jun 11 15:10:50 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 11 Jun 2009 15:10:50 -0500 (CDT)
Subject: [Swift-devel] [Bug 212] support for lots of arguments
In-Reply-To:
References:
Message-ID: <20090611201050.A973B2B886@wind.mcs.anl.gov>
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=212
Allan Espinosa changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|support for multiple |support for lots of
|arguments |arguments
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
From benc at hawaga.org.uk Fri Jun 12 13:21:08 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 12 Jun 2009 18:21:08 +0000 (GMT)
Subject: [Swift-devel] swift in softenv at CI
Message-ID:
For people using CI maintained login machines, Swift is now available in
softenv. To get Swift 0.9 in the same way that you get other CI software,
add the line:
@swift
to the start of your ~/.soft file.
The full commentary from CI support is:
> There's a @swift macro which is recommended to use before the @default
> macro and that will pull in the Sun Java for that OS and swift. And
> there's also a +swift if someone wants to try out a different Java.
--
From aespinosa at cs.uchicago.edu Fri Jun 12 15:33:24 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Fri, 12 Jun 2009 15:33:24 -0500
Subject: [Swift-devel] Re: block coasters not registering on proper queue
In-Reply-To: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com>
References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com>
Message-ID: <50b07b4b0906121333s62ea38f6hbfc8835c534af4b6@mail.gmail.com>
I did a rebuild (ant redist) today and it looks like everything is
working fine. It looks like some files during my previous build were
not updated properly. I was using rsync to copy files from swift-svn
to another directory. i guess that was a bad idea.
./runtest.sh
Swift svn swift-r2949 cog-r2406
RunID: coasterrun
Progress: uninitialized:1
Progress: Submitted:1
Progress: Active:1
Final status: Finished successfully:1
Cleaning up...
Shutting down service at https://128.135.125.117:43627
Got channel MetaChannel: 1910518671 -> GSSSChannel-null(1)
- Done
qstat:
1095930.tp-mgt null aespinosa 0 R fast
-Allan
2009/6/8 Allan Espinosa :
> Is there a default maxwalltime being submitted to the LRM if nothing
> is specified? ?I made in this configuration to use the "fast" ueue in
> sites.xml but i keep getting placed inside the "exteneded" queue.
>
> sites.xml
>
> ?
> ? ?
> ? ? jobmanager="gt2:gt2:pbs" />
> ? ?fast
> ? ?/home/aespinosa/work
> ? ?50
> ? ?10
> ? ?20
> ?
>
>
> gram log snippet:
> ...
> ...
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created.
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Entering pbs submit
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed
> (may be harmless): Operation not permitted
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may
> be harmless): Operation not permitted
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from
> job description
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: ? ?using queue default
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Determining job max wall time
> limit from job description
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: ? ?using maxwalltime of 60
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Building job script
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument
> "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to
> "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument
> "http://128.135.125.118:56015"
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to
> "http://128.135.125.118:56015"
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000"
> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000"
> ...
> ...
>
> $grep fast gram*.log:
> gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
> gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 =
> GLOBUS_FAILURE (try Perl scripts)
>
>
> Swift version: Swift svn swift-r2949 cog-r2406
>
> -Allan
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From bugzilla-daemon at mcs.anl.gov Fri Jun 12 15:35:06 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri, 12 Jun 2009 15:35:06 -0500 (CDT)
Subject: [Swift-devel] [Bug 211] block coasters not registering on proper
queue
In-Reply-To:
References:
Message-ID: <20090612203506.C209C2CB39@wind.mcs.anl.gov>
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=211
Allan Espinosa changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |INVALID
--- Comment #1 from Allan Espinosa 2009-06-12 15:35:06 ---
I made a mistake in my build scripts. apparently rsync is not the way to copy
of builds in swift-svn to another directory.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
From hategan at mcs.anl.gov Fri Jun 12 19:20:17 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 12 Jun 2009 19:20:17 -0500
Subject: [Swift-devel] Re: block coasters not registering on proper queue
In-Reply-To: <50b07b4b0906121333s62ea38f6hbfc8835c534af4b6@mail.gmail.com>
References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com>
<50b07b4b0906121333s62ea38f6hbfc8835c534af4b6@mail.gmail.com>
Message-ID: <1244852417.10588.1.camel@localhost>
There's a chance an intermittent coaster bug still exists on this issue.
So it would be useful to test the same configuration some more.
On Fri, 2009-06-12 at 15:33 -0500, Allan Espinosa wrote:
> I did a rebuild (ant redist) today and it looks like everything is
> working fine. It looks like some files during my previous build were
> not updated properly. I was using rsync to copy files from swift-svn
> to another directory. i guess that was a bad idea.
>
> ./runtest.sh
> Swift svn swift-r2949 cog-r2406
>
> RunID: coasterrun
> Progress: uninitialized:1
> Progress: Submitted:1
> Progress: Active:1
> Final status: Finished successfully:1
> Cleaning up...
> Shutting down service at https://128.135.125.117:43627
> Got channel MetaChannel: 1910518671 -> GSSSChannel-null(1)
> - Done
>
> qstat:
> 1095930.tp-mgt null aespinosa 0 R fast
>
> -Allan
>
>
> 2009/6/8 Allan Espinosa :
> > Is there a default maxwalltime being submitted to the LRM if nothing
> > is specified? I made in this configuration to use the "fast" ueue in
> > sites.xml but i keep getting placed inside the "exteneded" queue.
> >
> > sites.xml
> >
> >
> >
> > > jobmanager="gt2:gt2:pbs" />
> > fast
> > /home/aespinosa/work
> > 50
> > 10
> > 20
> >
> >
> >
> > gram log snippet:
> > ...
> > ...
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created.
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
> > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Entering pbs submit
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed
> > (may be harmless): Operation not permitted
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may
> > be harmless): Operation not permitted
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from
> > job description
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: using queue default
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max wall time
> > limit from job description
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: using maxwalltime of 60
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Building job script
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir:
> > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument
> > "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to
> > "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl"
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument
> > "http://128.135.125.118:56015"
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to
> > "http://128.135.125.118:56015"
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000"
> > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000"
> > ...
> > ...
> >
> > $grep fast gram*.log:
> > gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> > gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 =
> > GLOBUS_FAILURE (try Perl scripts)
> >
> >
> > Swift version: Swift svn swift-r2949 cog-r2406
> >
> > -Allan
> >
>
>
>
From benc at hawaga.org.uk Mon Jun 15 05:37:42 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 15 Jun 2009 10:37:42 +0000 (GMT)
Subject: [Swift-devel] pc3 swift slides
Message-ID:
Although I think they're of minimal interest to most, here are the slides
I presented in the swift slot at PC3 last week at Universiteit van
Amsterdam.
http://www.ci.uchicago.edu/~benc/pc3-swift-slides.pdf
Some substantially more meaty technical report on Swift vs PC3 should
appear later.
--
From zhaozhang at uchicago.edu Mon Jun 15 16:48:01 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Mon, 15 Jun 2009 16:48:01 -0500
Subject: [Swift-devel] Two problems regarding coaster
Message-ID: <4A36C191.9050701@uchicago.edu>
Hi, Mihael
I encountered two problems on coasters.
1. The log file coasters.log is increasing too fast. For a two hour run,
the log file could be 5GB.
And swift would fail if there is no space for coaster to produce logs.
2.On ranger there are two file systems that I am using now. One is
$HOME, the other is $WORK,
each with quota 6GB and 350 GB.
By default coasters.log is produced at $HOME/.globus/coaster, I set
up a symbol link there, it actually
points to a place in $WORK.
This fixes the problem I saw on Sunday, but there comes a new one. I
am not sure why swift failed,
could you help find out either it is because the job reaches the
maxwalltime, or something else. My
sites.xml is at the end of the email.
The worker logs, gram logs, swift logs and standard output are on CI
network, /home/zzhang/ranger-logs/2009-06-15
The coasters.log is too big, if you need it, I would try to tailor
it. Let me know.
best
zhao
From aespinosa at cs.uchicago.edu Mon Jun 15 18:37:54 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 15 Jun 2009 18:37:54 -0500
Subject: [Swift-devel] coaster error on ranger
In-Reply-To: <4A3113B9.9080303@uchicago.edu>
References:
<4A301544.4030806@uchicago.edu>
<4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov>
<4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov>
<4A3113B9.9080303@uchicago.edu>
Message-ID: <50b07b4b0906151637t7dd7653eu29a2ab0d17f76153@mail.gmail.com>
isn't coastersPerNode already deprecated as a configuration parameter?
2009/6/11 Zhao Zhang
> Hi, Mike and Mihael
>
> Here is the error, I think this is related to the job wall time of coaster
> settings.
>
> Mihael, could you give me some suggestions on how to set the parameters for
> coasters on ranger?
> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>
> best
> zhao
>
> Execution failed:
> Exception in run_ampl:
> Arguments: [run70, template, armington.mod, armington_process.cmd,
> armington_ou\
> tput.cmd, subproblems/producer_tree.mod, ces.so]
> Host: tgtacc
> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
> stderr.txt:
>
> stdout.txt:
> ----
>
> Caused by:
> Shutting down worker
> Cleaning up...
> Shutting down service at https://129.114.50.163:58556
>
> And here is my sites.xml
> bash-3.00$ cat tgranger-sge-gram2.xml
>
>
>
> jobManager="gt2:gt2:SGE"/>
>
> TG-CCR080022N
> /work/00946/zzhang/work
> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir
> 16
> development
> 100
> 10
> 20
> 5
> 1
> 5
>
>
>
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From benc at hawaga.org.uk Tue Jun 16 03:53:53 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 16 Jun 2009 08:53:53 +0000 (GMT)
Subject: [Swift-devel] passing very long lists of files to applications
Message-ID:
Some applications have a problem where a component program is to process a
very large number of files.
Swift can deal with files in two basic ways at the moment:
i) explicitly pass the filenames on the commandline
ii) stage the files into the job input directory without explicitly
naming the files on the commandline, with the component program
inspecting the working directory on the execution side to decide
which files to process.
i) has the disadvantage that the commandline limits the number of
filenames that can be passed
ii) has the disadvantage that the component program must be able to
distinguish which of their working directory files are the relevant input
files.
A further option which could be implemented is to provide the ability to
write out a list of filenames into a file, and have that file staged as
input. This needs the component program to be able to take a list of files
from a file rather than from the command line (for example, the -T option
of tar).
This could be implemented, I think, by providing a writeData procedure
which is the inverse of readData, and writing something like this:
file l = writeData(@filenames(f))
p(l,f);
app p(file l, file f[]) {
myproc "-T" @l
}
comments?
--
From benc at hawaga.org.uk Tue Jun 16 05:10:59 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 16 Jun 2009 10:10:59 +0000 (GMT)
Subject: [Swift-devel] passing very long lists of files to applications
In-Reply-To:
References:
Message-ID:
a related idea to this when you are using some component program that is
summarising data and is associative (and maybe other properties) in its
work is that the associativity could be indicated to Swift, and Swift
could then make use of that to generate an arbitrary number of app calls.
For example, the numerical operations max or sum fit this, but mean
does not.
max (100,8,1,1,33,8,7,423,46,2,222) = max( max(100,8,1,1), max(33,8),
max(7,423,46), max(2,222) )
so its possible to evaluate the max without any individual invocation
having more than 4 parameters.
This fits in quite nicely with ideas of having Swift stuff be expressed
more functionally, and have Swift able to make its own decisions about
exactly how things are run.
I don't think this is going to be something that goes in the language
soon, but if anyone happens to pursue the functional direction further,
this is a case that should be kept in mind.
--
From aespinosa at cs.uchicago.edu Tue Jun 16 17:30:03 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 16 Jun 2009 17:30:03 -0500
Subject: [Swift-devel] more active processes than requested cores
Message-ID: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
By the throttling parameters below, i do expect to have a thousand
jobs active at a time. But shouldn't the coaster request larger
blocks to accommodate the 277 active jobs?
sge snapshot:
ACTIVE JOBS--------------------------
JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
================================================================================
779616 data tg802895 Running 16 00:36:01 Tue Jun 16 15:59:41
779723 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41
779724 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41
779727 data tg802895 Running 16 01:45:58 Tue Jun 16 17:09:38
swift session snipper
Progress: Selecting site:38 Submitted:707 Active:278 Finished
successfully:1861
Progress: Selecting site:38 Submitted:707 Active:277 Checking
status:1 Finished successfully:1861
sites.xml
TG-CCR080022N
/work/01035/tg802895/blast-runs
16
development
4
00:30:00
2
2
10
i'll send the swift and coaster logs once the run finishes.
-Allan
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From benc at hawaga.org.uk Tue Jun 16 17:33:54 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 16 Jun 2009 22:33:54 +0000 (GMT)
Subject: [Swift-devel] more active processes than requested cores
In-Reply-To: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
Message-ID:
Can you compare with the post-processed logs (especially info/worker logs,
not execution layer stats), not the runtime counter - the runtime counter
is necessarily reliant on the realtime delivery of status changes; the
post-processed wrapper logs are not.
So maybe this is too many jobs running at once; maybe this is delayed
statistics updates (as has been discussed here)
You need to turn on the wrapper log always transfer option in the config
file to get all the wrapper logs back if you don't already have that.
On Tue, 16 Jun 2009, Allan Espinosa wrote:
> By the throttling parameters below, i do expect to have a thousand
> jobs active at a time. But shouldn't the coaster request larger
> blocks to accommodate the 277 active jobs?
>
> sge snapshot:
> ACTIVE JOBS--------------------------
> JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
> ================================================================================
> 779616 data tg802895 Running 16 00:36:01 Tue Jun 16 15:59:41
> 779723 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41
> 779724 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41
> 779727 data tg802895 Running 16 01:45:58 Tue Jun 16 17:09:38
>
>
> swift session snipper
> Progress: Selecting site:38 Submitted:707 Active:278 Finished
> successfully:1861
> Progress: Selecting site:38 Submitted:707 Active:277 Checking
> status:1 Finished successfully:1861
>
>
> sites.xml
>
>
>
> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
> TG-CCR080022N
> /work/01035/tg802895/blast-runs
> 16
> development
> 4
> 00:30:00
> 2
> 2
> 10
>
>
>
> i'll send the swift and coaster logs once the run finishes.
>
> -Allan
>
>
>
From hategan at mcs.anl.gov Wed Jun 17 07:12:55 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 17 Jun 2009 07:12:55 -0500
Subject: [Swift-devel] Two problems regarding coaster
In-Reply-To: <4A36C191.9050701@uchicago.edu>
References: <4A36C191.9050701@uchicago.edu>
Message-ID: <1245240775.8776.3.camel@localhost>
On Mon, 2009-06-15 at 16:48 -0500, Zhao Zhang wrote:
> Hi, Mihael
>
> I encountered two problems on coasters.
> 1. The log file coasters.log is increasing too fast. For a two hour run,
> the log file could be 5GB.
> And swift would fail if there is no space for coaster to produce logs.
That's the temporary price we pay while the thing isn't tested much in
order to be able to find bugs quickly.
From hategan at mcs.anl.gov Wed Jun 17 07:17:37 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 17 Jun 2009 07:17:37 -0500
Subject: [Swift-devel] more active processes than requested cores
In-Reply-To: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
Message-ID: <1245241057.8776.6.camel@localhost>
On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote:
> By the throttling parameters below, i do expect to have a thousand
> jobs active at a time. But shouldn't the coaster request larger
> blocks to accommodate the 277 active jobs?
Not if they fit in existing blocks (either vertically or horizontally).
This is something that should be thought of some more, but for short
jobs it seems ok.
>
> sge snapshot:
> ACTIVE JOBS--------------------------
> JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
> ================================================================================
> 779616 data tg802895 Running 16 00:36:01 Tue Jun 16 15:59:41
> 779723 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41
> 779724 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41
> 779727 data tg802895 Running 16 01:45:58 Tue Jun 16 17:09:38
>
>
> swift session snipper
> Progress: Selecting site:38 Submitted:707 Active:278 Finished
> successfully:1861
> Progress: Selecting site:38 Submitted:707 Active:277 Checking
> status:1 Finished successfully:1861
>
>
> sites.xml
>
>
>
> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
> TG-CCR080022N
> /work/01035/tg802895/blast-runs
> 16
> development
> 4
> 00:30:00
> 2
> 2
> 10
>
>
>
> i'll send the swift and coaster logs once the run finishes.
>
> -Allan
>
>
From zhaozhang at uchicago.edu Wed Jun 17 10:11:22 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Wed, 17 Jun 2009 10:11:22 -0500
Subject: [Swift-devel] more active processes than requested cores
In-Reply-To: <1245241057.8776.6.camel@localhost>
References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
<1245241057.8776.6.camel@localhost>
Message-ID: <4A39079A.3050001@uchicago.edu>
Hi, All
Here is something in my test case:
Swift says:
Progress: Selecting site:80 Submitted:828 Active:115 Finished in
previous run:487 Finished successfully:295
Progress: Selecting site:80 Submitted:828 Active:115 Finished in
previous run:487 Finished successfully:295
Progress: Selecting site:80 Submitted:828 Active:115 Finished in
previous run:487 Finished successfully:295
Progress: Selecting site:80 Submitted:828 Active:115 Finished in
previous run:487 Finished successfully:295
Progress: Selecting site:80 Submitted:828 Active:115 Finished in
previous run:487 Finished successfully:295
And showq -u says
login3% showq -u
ACTIVE JOBS--------------------------
JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
================================================================================
0 active jobs : 0 of 3828 hosts ( 0.00 %)
Why there are no active SGE jobs, but swift says there are 115 active jobs?
zhao
Mihael Hategan wrote:
> On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote:
>
>> By the throttling parameters below, i do expect to have a thousand
>> jobs active at a time. But shouldn't the coaster request larger
>> blocks to accommodate the 277 active jobs?
>>
>
> Not if they fit in existing blocks (either vertically or horizontally).
> This is something that should be thought of some more, but for short
> jobs it seems ok.
>
>
>> sge snapshot:
>> ACTIVE JOBS--------------------------
>> JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
>> ================================================================================
>> 779616 data tg802895 Running 16 00:36:01 Tue Jun 16 15:59:41
>> 779723 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41
>> 779724 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41
>> 779727 data tg802895 Running 16 01:45:58 Tue Jun 16 17:09:38
>>
>>
>> swift session snipper
>> Progress: Selecting site:38 Submitted:707 Active:278 Finished
>> successfully:1861
>> Progress: Selecting site:38 Submitted:707 Active:277 Checking
>> status:1 Finished successfully:1861
>>
>>
>> sites.xml
>>
>>
>>
>> > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>> TG-CCR080022N
>> /work/01035/tg802895/blast-runs
>> 16
>> development
>> 4
>> 00:30:00
>> 2
>> 2
>> 10
>>
>>
>>
>> i'll send the swift and coaster logs once the run finishes.
>>
>> -Allan
>>
>>
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
From aespinosa at cs.uchicago.edu Wed Jun 17 14:08:50 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 17 Jun 2009 14:08:50 -0500
Subject: [Swift-devel] more active processes than requested cores
In-Reply-To: <4A39079A.3050001@uchicago.edu>
References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
<1245241057.8776.6.camel@localhost> <4A39079A.3050001@uchicago.edu>
Message-ID: <50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com>
I also get this after a while.
Attached are the logs when the workflow finished. Actually it did not
finish because the coaster got an out of memory error. This does not
happen if coasters were not used.
2009/6/17 Zhao Zhang :
> Hi, All
>
> Here is something in my test case:
>
> Swift says:
> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
> previous run:487 ?Finished successfully:295
> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
> previous run:487 ?Finished successfully:295
> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
> previous run:487 ?Finished successfully:295
> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
> previous run:487 ?Finished successfully:295
> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
> previous run:487 ?Finished successfully:295
>
> And showq -u says
> login3% showq -u
> ACTIVE JOBS--------------------------
> JOBID ? ? JOBNAME ? ?USERNAME ? ? ?STATE ? CORE ?REMAINING ?STARTTIME
> ================================================================================
>
> ? ?0 active jobs : ? ?0 of 3828 hosts ( ?0.00 %)
>
> Why there are no active SGE jobs, but swift says there are 115 active jobs?
>
> zhao
>
> Mihael Hategan wrote:
>>
>> On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote:
>>
>>>
>>> By the throttling parameters below, i do expect to have a thousand
>>> jobs active at a time. ?But shouldn't the coaster request larger
>>> blocks to accommodate the 277 active jobs?
>>>
>>
>> Not if they fit in existing blocks (either vertically or horizontally).
>> This is something that should be thought of some more, but for short
>> jobs it seems ok.
>>
>>
>>>
>>> sge snapshot:
>>> ACTIVE JOBS--------------------------
>>> JOBID ? ? JOBNAME ? ?USERNAME ? ? ?STATE ? CORE ?REMAINING ?STARTTIME
>>>
>>> ================================================================================
>>> 779616 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 00:36:01 ?Tue Jun 16
>>> 15:59:41
>>> 779723 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:44:01 ?Tue Jun 16
>>> 17:07:41
>>> 779724 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:44:01 ?Tue Jun 16
>>> 17:07:41
>>> 779727 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:45:58 ?Tue Jun 16
>>> 17:09:38
>>>
>>>
>>> swift session snipper
>>> Progress: ?Selecting site:38 ?Submitted:707 ?Active:278 ?Finished
>>> successfully:1861
>>> Progress: ?Selecting site:38 ?Submitted:707 ?Active:277 ?Checking
>>> status:1 ?Finished successfully:1861
>>>
>>>
>>> sites.xml
>>>
>>> ?
>>> ? ?
>>> ? ?>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>> ? ?TG-CCR080022N
>>> ? ?/work/01035/tg802895/blast-runs
>>> ? ?16
>>> ? ?development
>>> ? ?4
>>> ? ?00:30:00
>>> ? ?2
>>> ? ?2
>>> ? ?10
>>> ?
>>>
>>>
>>> i'll send the swift and coaster logs once the run finishes.
>>>
>>> -Allan
>>>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bug05.tar.gz
Type: application/x-gzip
Size: 5444754 bytes
Desc: not available
URL:
From HodgessE at uhd.edu Wed Jun 17 14:16:07 2009
From: HodgessE at uhd.edu (Hodgess, Erin)
Date: Wed, 17 Jun 2009 14:16:07 -0500
Subject: [Swift-devel] updated files for tutorial
Message-ID: <70A5AC06FDB5E54482D19E1C04CDFCF307C37048@BALI.uhd.campus>
Dear Swift Development:
Please find the locations of the appropriate updated files for the tutorial (on home machine).
/home/erin/cog/modules/swift/docs
The files are tutorial.php and tutorial.html respectively.
Please let me know if I need to do further changes.
Thanks,
Erin
Erin M. Hodgess, PhD
Associate Professor
Department of Computer and Mathematical Sciences
University of Houston - Downtown
mailto: hodgesse at uhd.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From aespinosa at cs.uchicago.edu Wed Jun 17 14:22:55 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 17 Jun 2009 14:22:55 -0500
Subject: [Swift-devel] more active processes than requested cores
In-Reply-To: <50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com>
References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
<1245241057.8776.6.camel@localhost> <4A39079A.3050001@uchicago.edu>
<50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com>
Message-ID: <50b07b4b0906171222k172bf7a1y9e9d44841c7c8b3d@mail.gmail.com>
oops. forgot all the wrapper logs.
this next attachment should have it.
2009/6/17 Allan Espinosa :
> I also get this after a while.
>
> Attached are the logs when the workflow finished. ?Actually it did not
> finish because the coaster got an out of memory error. ?This does not
> happen if coasters were not used.
>
> 2009/6/17 Zhao Zhang :
>> Hi, All
>>
>> Here is something in my test case:
>>
>> Swift says:
>> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
>> previous run:487 ?Finished successfully:295
>> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
>> previous run:487 ?Finished successfully:295
>> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
>> previous run:487 ?Finished successfully:295
>> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
>> previous run:487 ?Finished successfully:295
>> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in
>> previous run:487 ?Finished successfully:295
>>
>> And showq -u says
>> login3% showq -u
>> ACTIVE JOBS--------------------------
>> JOBID ? ? JOBNAME ? ?USERNAME ? ? ?STATE ? CORE ?REMAINING ?STARTTIME
>> ================================================================================
>>
>> ? ?0 active jobs : ? ?0 of 3828 hosts ( ?0.00 %)
>>
>> Why there are no active SGE jobs, but swift says there are 115 active jobs?
>>
>> zhao
>>
>> Mihael Hategan wrote:
>>>
>>> On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote:
>>>
>>>>
>>>> By the throttling parameters below, i do expect to have a thousand
>>>> jobs active at a time. ?But shouldn't the coaster request larger
>>>> blocks to accommodate the 277 active jobs?
>>>>
>>>
>>> Not if they fit in existing blocks (either vertically or horizontally).
>>> This is something that should be thought of some more, but for short
>>> jobs it seems ok.
>>>
>>>
>>>>
>>>> sge snapshot:
>>>> ACTIVE JOBS--------------------------
>>>> JOBID ? ? JOBNAME ? ?USERNAME ? ? ?STATE ? CORE ?REMAINING ?STARTTIME
>>>>
>>>> ================================================================================
>>>> 779616 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 00:36:01 ?Tue Jun 16
>>>> 15:59:41
>>>> 779723 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:44:01 ?Tue Jun 16
>>>> 17:07:41
>>>> 779724 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:44:01 ?Tue Jun 16
>>>> 17:07:41
>>>> 779727 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:45:58 ?Tue Jun 16
>>>> 17:09:38
>>>>
>>>>
>>>> swift session snipper
>>>> Progress: ?Selecting site:38 ?Submitted:707 ?Active:278 ?Finished
>>>> successfully:1861
>>>> Progress: ?Selecting site:38 ?Submitted:707 ?Active:277 ?Checking
>>>> status:1 ?Finished successfully:1861
>>>>
>>>>
>>>> sites.xml
>>>>
>>>> ?
>>>> ? ?
>>>> ? ?>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>>> ? ?TG-CCR080022N
>>>> ? ?/work/01035/tg802895/blast-runs
>>>> ? ?16
>>>> ? ?development
>>>> ? ?4
>>>> ? ?00:30:00
>>>> ? ?2
>>>> ? ?2
>>>> ? ?10
>>>> ?
>>>>
>>>>
>>>> i'll send the swift and coaster logs once the run finishes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: buf04.tar.gz
Type: application/x-gzip
Size: 5654046 bytes
Desc: not available
URL:
From benc at hawaga.org.uk Thu Jun 18 03:35:44 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 18 Jun 2009 08:35:44 +0000 (GMT)
Subject: [Swift-devel] updated files for tutorial
In-Reply-To: <70A5AC06FDB5E54482D19E1C04CDFCF307C37048@BALI.uhd.campus>
References: <70A5AC06FDB5E54482D19E1C04CDFCF307C37048@BALI.uhd.campus>
Message-ID:
You need to submit the changes to the .xml files, not the generated ones.
Do this:
in your docs directory, type:
svn diff > whatever.diff
and then make that whatever.diff available here.
That gives specific information about the changes you made to the XML
file, rather than the end PHPs and HTMLs - the PHP and HTML files on the
website are generated from the latest SVN version every night.
To contribute to Swift, you need to have gone through the dev.globus
contributor licensing paperwork - basically you and your employer need to
fill out a licence and give to Gigi at Argonne (who sits in C101).
On Wed, 17 Jun 2009, Hodgess, Erin wrote:
> Dear Swift Development:
>
> Please find the locations of the appropriate updated files for the tutorial (on home machine).
> /home/erin/cog/modules/swift/docs
>
> The files are tutorial.php and tutorial.html respectively.
>
>
>
> Please let me know if I need to do further changes.
>
> Thanks,
> Erin
>
>
> Erin M. Hodgess, PhD
> Associate Professor
> Department of Computer and Mathematical Sciences
> University of Houston - Downtown
> mailto: hodgesse at uhd.edu
>
>
>
From benc at hawaga.org.uk Thu Jun 18 03:59:02 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 18 Jun 2009 08:59:02 +0000 (GMT)
Subject: [Swift-devel] Re: [metrics-dev] feasability of collecting swift
usage stats through the globus usage stats mechanism (fwd)
Message-ID:
Its been mooted a few times over the last year or so, so I enquired with
metrics-dev about using the globus usage stats stuff for very basic swift
usage info. Here's the response below in case anyone is interested in
following up.
---------- Forwarded message ----------
Date: Wed, 17 Jun 2009 09:31:19 -0600
From: Lee Liming
To: Ben Clifford
Cc: metrics-dev at globus.org
Subject: Re: [metrics-dev] feasability of collecting swift usage stats through
the globus usage stats mechanism
Absolutely yes. See http://dev.globus.org/wiki/Incubator/Metrics#FAQ for the
information you ask about here. I think all of the topics are covered in the
FAQ and linked docs. In summary, the code and mechanism needed to do this is
totally open and available. Using the CDIGS listener service itself requires
coordination with the person who operates it (currently Joe Bester) but it is
do-able. Running your own listener requires no coordination and is a good
option.
You may want to consider operating your own listener service. The "global"
CDIGS listener service is experiencing growing pains at the moment, and is not
currently (this week, maybe next) available for you to experiment with because
it's being serviced. It's also pretty heavily loaded, so your performance
(e.g., report generation) will not be stellar. It's quite easy to bring up a
listener service, and if you have control of your code deployment (the code
being reported on), you can easily configure where it sends reports. You could
even have it report to multiple listeners, such as a Swift-specific listener
*and* the CDIGS listener.
The largest challenge in running your own listener would be sustaining its
operation over time, and you will have to think a bit about what your
requirements are in that area. (How badly you want to have *every* usage
report.) If it's not vital that you have each and every usage report (but get a
good sampling, for example, and keep track of when you were vs. weren't
listening), then this should be a pretty lightweight thing to do. CDIGS has
tried to be meticulous about high availability and not losing any data, and our
record over several years is quite good, but it can be a high-stress enterprise
and requires significant attention for short (mostly unpredictable) times. I am
not 100% sure we know the return we're getting for such effort.
--- Lee
On Jun 17, 2009, at 2:39 AM, Ben Clifford wrote:
>
> I would like to investigate the feasability of collecting basic usage
> stats for Swift through the globus usage stats mechanism. Specifically:
>
> 1) is the usage stats mechanism even open to other dev.globus projects
> (sociopolitically, not technically)
>
> 2) what actual code needs to be added to the client to send usage packets
> (is there a packaged library?)
>
> 3) what needs to happen at the receiving end to get useful reports.
>
> The sort of information I suspect being logged would be (for each run
> where Swift ends naturally):
>
> i) svn revision number
> ii) number of tasks executed
>
> --
From hategan at mcs.anl.gov Thu Jun 18 07:14:47 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Jun 2009 07:14:47 -0500
Subject: [Swift-devel] more active processes than requested cores
In-Reply-To: <50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com>
References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com>
<1245241057.8776.6.camel@localhost> <4A39079A.3050001@uchicago.edu>
<50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com>
Message-ID: <1245327287.25261.3.camel@localhost>
Ok. This is getting messy, and I need to be able to reproduce it.
I suggest testing with one of the existing workflows, such as
066-many.swift, and if that does not trigger the problem, a custom
version of it with /bin/sleep instead. If that fails too, I'll need
access to your blast installation.
I also need to know if this is an intermittent issue or not, so testing
more than once would be desirable.
On Wed, 2009-06-17 at 14:08 -0500, Allan Espinosa wrote:
> I also get this after a while.
>
> Attached are the logs when the workflow finished. Actually it did not
> finish because the coaster got an out of memory error. This does not
> happen if coasters were not used.
>
> 2009/6/17 Zhao Zhang :
> > Hi, All
> >
> > Here is something in my test case:
> >
> > Swift says:
> > Progress: Selecting site:80 Submitted:828 Active:115 Finished in
> > previous run:487 Finished successfully:295
> > Progress: Selecting site:80 Submitted:828 Active:115 Finished in
> > previous run:487 Finished successfully:295
> > Progress: Selecting site:80 Submitted:828 Active:115 Finished in
> > previous run:487 Finished successfully:295
> > Progress: Selecting site:80 Submitted:828 Active:115 Finished in
> > previous run:487 Finished successfully:295
> > Progress: Selecting site:80 Submitted:828 Active:115 Finished in
> > previous run:487 Finished successfully:295
> >
> > And showq -u says
> > login3% showq -u
> > ACTIVE JOBS--------------------------
> > JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
> > ================================================================================
> >
> > 0 active jobs : 0 of 3828 hosts ( 0.00 %)
> >
> > Why there are no active SGE jobs, but swift says there are 115 active jobs?
> >
> > zhao
> >
> > Mihael Hategan wrote:
> >>
> >> On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote:
> >>
> >>>
> >>> By the throttling parameters below, i do expect to have a thousand
> >>> jobs active at a time. But shouldn't the coaster request larger
> >>> blocks to accommodate the 277 active jobs?
> >>>
> >>
> >> Not if they fit in existing blocks (either vertically or horizontally).
> >> This is something that should be thought of some more, but for short
> >> jobs it seems ok.
> >>
> >>
> >>>
> >>> sge snapshot:
> >>> ACTIVE JOBS--------------------------
> >>> JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
> >>>
> >>> ================================================================================
> >>> 779616 data tg802895 Running 16 00:36:01 Tue Jun 16
> >>> 15:59:41
> >>> 779723 data tg802895 Running 16 01:44:01 Tue Jun 16
> >>> 17:07:41
> >>> 779724 data tg802895 Running 16 01:44:01 Tue Jun 16
> >>> 17:07:41
> >>> 779727 data tg802895 Running 16 01:45:58 Tue Jun 16
> >>> 17:09:38
> >>>
> >>>
> >>> swift session snipper
> >>> Progress: Selecting site:38 Submitted:707 Active:278 Finished
> >>> successfully:1861
> >>> Progress: Selecting site:38 Submitted:707 Active:277 Checking
> >>> status:1 Finished successfully:1861
> >>>
> >>>
> >>> sites.xml
> >>>
> >>>
> >>>
> >>> >>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
> >>> TG-CCR080022N
> >>> /work/01035/tg802895/blast-runs
> >>> 16
> >>> development
> >>> 4
> >>> 00:30:00
> >>> 2
> >>> 2
> >>> 10
> >>>
> >>>
> >>>
> >>> i'll send the swift and coaster logs once the run finishes.
> >>>
> >>> -Allan
> >>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Thu Jun 18 07:19:17 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Jun 2009 07:19:17 -0500
Subject: [Swift-devel] Re: [metrics-dev] feasability of collecting
swift usage stats through the globus usage stats mechanism (fwd)
In-Reply-To:
References:
Message-ID: <1245327557.25261.8.camel@localhost>
On Thu, 2009-06-18 at 08:59 +0000, Ben Clifford wrote:
> You may want to consider operating your own listener service. The "global"
> CDIGS listener service is experiencing growing pains at the moment,
The "global CDIGS listener service" was experiencing growing pains from
the start. Table-top software is a different business from scalable
software.
From aespinosa at cs.uchicago.edu Thu Jun 18 13:04:18 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 18 Jun 2009 13:04:18 -0500
Subject: [Swift-devel] scheduler scoring with file transfer
Message-ID: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com>
i observed in swift logs that there are scheduler score updates after
FILE_OPERATIONs. As we can see below in the nostagein workflow, there
are less submitted jobs than the one with stageins.
Does this mean i have to match my file transfer throttles with job
submission throttles?
same score and throttling parameters in sites.xml file:
/home/aespinosa/workflows/activelog/workdir
2.02
1.98
fast
4
2
64
00:06:00
3600
[aespinosa at tp-login1 blast]$ ./demoblast.sh (blast.swift with coasters)
Swift svn swift-r2949 cog-r2406
RunID: out.run_000
Progress:
Progress:
Progress:
Progress:
Progress:
Progress: uninitialized:1
Progress: Initializing:1022 Selecting site:1
Progress: Selecting site:1022 Initializing site shared directory:1
Progress: Selecting site:1011 Stage in:12
Progress: Selecting site:1010 Stage in:13
Progress: Selecting site:1005 Stage in:18
Progress: Selecting site:998 Stage in:25
Progress: Selecting site:989 Stage in:34
Progress: Selecting site:988 Stage in:35
Progress: Selecting site:984 Stage in:39
Progress: Selecting site:983 Stage in:40
Progress: Selecting site:974 Stage in:49
Progress: Selecting site:973 Submitting:49 Submitted:1
Progress: Selecting site:973 Submitted:50
Progress: Selecting site:973 Submitted:50
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
0 Active Jobs 174 of 200 Processors Active (87.00%)
94 of 100 Nodes Active (94.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
1101631 aespinosa Idle 23 00:54:00 Thu Jun 18 12:47:56
066-many.swift (no stageins)
[aespinosa at tp-login1 activelog]$ ./runtest.sh
Swift svn swift-r2949 cog-r2406
RunID: activelog
Progress:
Progress: uninitialized:1
Progress: Initializing:1022 Selecting site:1
Progress: Selecting site:1022 Initializing site shared directory:1
Progress: Selecting site:1013 Submitting:9 Submitted:1
Progress: Selecting site:1013 Submitted:10
...
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From hategan at mcs.anl.gov Thu Jun 18 13:12:32 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Jun 2009 13:12:32 -0500
Subject: [Swift-devel] scheduler scoring with file transfer
In-Reply-To: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com>
References: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com>
Message-ID: <1245348752.1601.2.camel@localhost>
On Thu, 2009-06-18 at 13:04 -0500, Allan Espinosa wrote:
> i observed in swift logs that there are scheduler score updates after
> FILE_OPERATIONs. As we can see below in the nostagein workflow, there
> are less submitted jobs than the one with stageins.
Yes. There is more load on the site when there are files to transfer
then when there are no files to transfer.
>
> Does this mean i have to match my file transfer throttles with job
> submission throttles?
I don't know what that means.
From benc at hawaga.org.uk Thu Jun 18 13:19:26 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 18 Jun 2009 18:19:26 +0000 (GMT)
Subject: [Swift-devel] scheduler scoring with file transfer
In-Reply-To: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com>
References: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com>
Message-ID:
On Thu, 18 Jun 2009, Allan Espinosa wrote:
> Does this mean i have to match my file transfer throttles with job
> submission throttles?
no.
While score capacity on a site is used up dealing files, that same
capacity won't be used to submit jobs - the adaptive rate limiting
attempts to restrict the load put on a site, not the number of jobs
submitted to a site.
File transfer and operation load is still load; although it is
qualitiatively different from job submission load, the scheduler doesn't
make that distinction
--
From aespinosa at cs.uchicago.edu Thu Jun 18 13:22:35 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 18 Jun 2009 13:22:35 -0500
Subject: [Swift-devel] scheduler scoring with file transfer
In-Reply-To: <1245348752.1601.2.camel@localhost>
References: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com>
<1245348752.1601.2.camel@localhost>
Message-ID: <50b07b4b0906181122s19eaa8c2j5adf2b5e27b93bcf@mail.gmail.com>
2009/6/18 Mihael Hategan :
> On Thu, 2009-06-18 at 13:04 -0500, Allan Espinosa wrote:
>> i observed in swift logs that there are scheduler score updates after
>> FILE_OPERATIONs. ?As we can see below in the nostagein workflow, there
>> are less submitted jobs than the one with stageins.
>
> Yes. There is more load on the site when there are files to transfer
> then when there are no files to transfer.
>
>>
I want to have the same number of jobs at the point of job sumbission
to replicate some bugs. Guess i'll just add file transfers in my
066-many.swift workflow.
>> Does this mean i have to match my file transfer throttles with job
>> submission throttles?
>
> I don't know what that means.
>
>
>
From wilde at mcs.anl.gov Thu Jun 18 14:22:10 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 18 Jun 2009 14:22:10 -0500
Subject: [Swift-devel] Cant run condor-g on TeraPort
Message-ID: <4A3A93E2.2080805@mcs.anl.gov>
As far as I can tell, the condor client code is broken on TeraPort.
Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg
in my .soft. I source $VDT_LOCATION/setup.sh
Zhao, Glen, can you cross-check and see if you are now seeing the same
thing?
My suspicion is that the condor client config broke in the last month,
through OSG changes, CI Support work, etc etc.
- Mike
I get this from condor_q:
tp$ condor_q
Error:
Extra Info: You probably saw this error because the condor_schedd is not
running on the machine you are trying to query. If the condor_schedd is not
running, the Condor system will not be able to find an address and port to
connect to and satisfy this request. Please make sure the Condor daemons
are
running and try again.
Extra Info: If the condor_schedd is running on the machine you are
trying to
query and you still see the error, the most likely cause is that you have
setup a personal Condor, you have not defined SCHEDD_NAME in your
condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE
setting. You must define either or both of those settings in your config
file, or you must use the -name option to condor_q. Please see the Condor
manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
tp$
and this from swift:
tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift
Swift svn swift-r2890 cog-r2392
RunID: 20090618-1404-mo0thjj4
Progress:
Progress: Stage in:1
Progress: Submitted:1
Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h on
firefly
Progress: Failed:1
Execution failed:
Exception in cat:
Arguments: [data.txt]
Host: firefly
Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
stderr.txt:
stdout.txt:
----
Caused by:
Cannot submit job: Could not submit job (condor_submit reported an exit
code of 1). no error output
tp-grid1$ ls
--
Using this sites file:
grid
gt2
ff-grid.unl.edu/jobmanager-pbs
/panfs/panasas/CMS/data/oops/wilde/swiftwork
From hategan at mcs.anl.gov Thu Jun 18 14:25:16 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Jun 2009 14:25:16 -0500
Subject: [Swift-devel] Cant run condor-g on TeraPort
In-Reply-To: <4A3A93E2.2080805@mcs.anl.gov>
References: <4A3A93E2.2080805@mcs.anl.gov>
Message-ID: <1245353116.2875.0.camel@localhost>
Send mail to Ti to restart the daemon (or fix whatever configuration
problems prevent it from starting).
On Thu, 2009-06-18 at 14:22 -0500, Michael Wilde wrote:
> As far as I can tell, the condor client code is broken on TeraPort.
>
> Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg
> in my .soft. I source $VDT_LOCATION/setup.sh
>
> Zhao, Glen, can you cross-check and see if you are now seeing the same
> thing?
>
> My suspicion is that the condor client config broke in the last month,
> through OSG changes, CI Support work, etc etc.
>
> - Mike
>
>
> I get this from condor_q:
>
> tp$ condor_q
> Error:
>
> Extra Info: You probably saw this error because the condor_schedd is not
> running on the machine you are trying to query. If the condor_schedd is not
> running, the Condor system will not be able to find an address and port to
> connect to and satisfy this request. Please make sure the Condor daemons
> are
> running and try again.
>
> Extra Info: If the condor_schedd is running on the machine you are
> trying to
> query and you still see the error, the most likely cause is that you have
> setup a personal Condor, you have not defined SCHEDD_NAME in your
> condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE
> setting. You must define either or both of those settings in your config
> file, or you must use the -name option to condor_q. Please see the Condor
> manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
> tp$
>
> and this from swift:
>
> tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift
> Swift svn swift-r2890 cog-r2392
>
> RunID: 20090618-1404-mo0thjj4
> Progress:
> Progress: Stage in:1
> Progress: Submitted:1
> Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h on
> firefly
> Progress: Failed:1
> Execution failed:
> Exception in cat:
> Arguments: [data.txt]
> Host: firefly
> Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Cannot submit job: Could not submit job (condor_submit reported an exit
> code of 1). no error output
> tp-grid1$ ls
>
> --
>
> Using this sites file:
>
>
>
>
>
> grid
> gt2
> ff-grid.unl.edu/jobmanager-pbs
> >/panfs/panasas/CMS/data/oops/wilde/swiftwork
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From zhaozhang at uchicago.edu Thu Jun 18 14:29:07 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 18 Jun 2009 14:29:07 -0500
Subject: [Swift-devel] condor-g test on ff-grid site
Message-ID: <4A3A9583.8010005@uchicago.edu>
Dear All
I am trying to run a workflow on ff-grid site with condor-g feature. My
submit host is tp-osg.ci.uchicago.edu.
I have a question about the remote site requirements. Does remote site
require a condor jobmanger in order
for us to run swift with condor-g on there? cuz ff-grid only has pbs job
manager. Here is my sites.xml
[zzhang at tp-grid1 sites]$ cat condor-g_new/ff-grid.xml
/mnt/panasas/CMS/grid_users/osg/
grid
gt2
ff-grid.unl.edu/jobmanager-pbs
The reason I am asking this is because my test failed on ff-grid site.
All related logs are at CI network
/home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs/ff-grid/
Execution failed:
Exception in cat:
Arguments: [061-cattwo.1.in, 061-cattwo.2.in]
Host: ff-grid
Directory: 061-cattwo-20090618-1407-gfg03g57/jobs/v/cat-v66x3gcj
stderr.txt:
stdout.txt:
----
Caused by:
No status file was found. Check the shared filesystem on ff-grid
SWIFT RETURN CODE NON-ZERO - test 061-cattwo
On the remote site, the shared dir was created, but the jobs dir wasn't.
[zzhang at tp-grid1 ~]$ globus-job-run ff-grid.unl.edu /bin/ls
061-cattwo-20090618-
1407-gfg03g57/
info
kickstart
shared
status
Any idea on the job failure? Also, to make sure it is not the test
workflow's problem, I tested exactly the same suite
on the GLOW site.
best
zhao
From zhaozhang at uchicago.edu Thu Jun 18 14:32:19 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 18 Jun 2009 14:32:19 -0500
Subject: [Swift-devel] Cant run condor-g on TeraPort
In-Reply-To: <4A3A93E2.2080805@mcs.anl.gov>
References: <4A3A93E2.2080805@mcs.anl.gov>
Message-ID: <4A3A9643.6050908@uchicago.edu>
Hi, Mike
Michael Wilde wrote:
> As far as I can tell, the condor client code is broken on TeraPort.
>
> Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg
> in my .soft. I source $VDT_LOCATION/setup.sh
>
> Zhao, Glen, can you cross-check and see if you are now seeing the same
> thing?
>
> My suspicion is that the condor client config broke in the last month,
> through OSG changes, CI Support work, etc etc.
>
> - Mike
>
>
> I get this from condor_q:
condor_q is working for me
[zzhang at tp-grid1 sites]$ condor_q
-- Submitter: tp-grid1.ci.uchicago.edu : <128.135.125.118:43109> :
tp-grid1.ci.uchicago.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE
CMD
101.0 zzhang 4/28 16:15 0+00:02:06 X 0 1.0 bash
/nfs/home/osg
137.0 zzhang 4/29 12:25 0+00:00:00 X 0 1.0 bash
/nfs/osg-data
138.0 zzhang 4/29 13:02 0+00:00:00 X 0 1.0 bash
/scratch/ufhp
139.0 zzhang 4/29 16:15 0+00:00:00 X 0 1.0 bash
/opt/osg/data
140.0 zzhang 5/5 14:12 0+00:00:43 X 0 1.0 bash
/nfs/osg-data
157.0 zzhang 5/5 14:49 0+00:00:00 X 0 1.0 bash
/atlas/data08
158.0 zzhang 5/5 14:59 0+00:00:00 X 0 1.0 bash
/raid2/osg-da
159.0 zzhang 5/5 15:03 0+00:00:00 X 0 1.0 bash
/raid2/osg-da
The source file in my .bashrc is "source /opt/osg/setup.sh" not
"/opt/osg-ce-1.0.0-r2/setup.sh".
[zzhang at tp-grid1 sites]$ echo $VDT_LOCATION
/opt/osg-ce-1.0.0-r2
zhao
>
> tp$ condor_q
> Error:
>
> Extra Info: You probably saw this error because the condor_schedd is not
> running on the machine you are trying to query. If the condor_schedd
> is not
> running, the Condor system will not be able to find an address and
> port to
> connect to and satisfy this request. Please make sure the Condor
> daemons are
> running and try again.
>
> Extra Info: If the condor_schedd is running on the machine you are
> trying to
> query and you still see the error, the most likely cause is that you have
> setup a personal Condor, you have not defined SCHEDD_NAME in your
> condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE
> setting. You must define either or both of those settings in your config
> file, or you must use the -name option to condor_q. Please see the Condor
> manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
> tp$
>
> and this from swift:
>
> tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift
> Swift svn swift-r2890 cog-r2392
>
> RunID: 20090618-1404-mo0thjj4
> Progress:
> Progress: Stage in:1
> Progress: Submitted:1
> Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h
> on firefly
> Progress: Failed:1
> Execution failed:
> Exception in cat:
> Arguments: [data.txt]
> Host: firefly
> Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Cannot submit job: Could not submit job (condor_submit reported an
> exit code of 1). no error output
> tp-grid1$ ls
>
> --
>
> Using this sites file:
>
>
>
>
>
> grid
> gt2
> ff-grid.unl.edu/jobmanager-pbs
> >/panfs/panasas/CMS/data/oops/wilde/swiftwork
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From benc at hawaga.org.uk Thu Jun 18 14:31:26 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT)
Subject: [Swift-devel] Cant run condor-g on TeraPort
In-Reply-To: <4A3A93E2.2080805@mcs.anl.gov>
References: <4A3A93E2.2080805@mcs.anl.gov>
Message-ID:
condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather than
use softenv. it doesn't work for me if I use @osg in softenv, with the
error you report.
On Thu, 18 Jun 2009, Michael Wilde wrote:
> As far as I can tell, the condor client code is broken on TeraPort.
>
> Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg in my
> .soft. I source $VDT_LOCATION/setup.sh
>
> Zhao, Glen, can you cross-check and see if you are now seeing the same thing?
>
> My suspicion is that the condor client config broke in the last month, through
> OSG changes, CI Support work, etc etc.
>
> - Mike
>
>
> I get this from condor_q:
>
> tp$ condor_q
> Error:
>
> Extra Info: You probably saw this error because the condor_schedd is not
> running on the machine you are trying to query. If the condor_schedd is not
> running, the Condor system will not be able to find an address and port to
> connect to and satisfy this request. Please make sure the Condor daemons are
> running and try again.
>
> Extra Info: If the condor_schedd is running on the machine you are trying to
> query and you still see the error, the most likely cause is that you have
> setup a personal Condor, you have not defined SCHEDD_NAME in your
> condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE
> setting. You must define either or both of those settings in your config
> file, or you must use the -name option to condor_q. Please see the Condor
> manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
> tp$
>
> and this from swift:
>
> tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift
> Swift svn swift-r2890 cog-r2392
>
> RunID: 20090618-1404-mo0thjj4
> Progress:
> Progress: Stage in:1
> Progress: Submitted:1
> Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h on
> firefly
> Progress: Failed:1
> Execution failed:
> Exception in cat:
> Arguments: [data.txt]
> Host: firefly
> Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Cannot submit job: Could not submit job (condor_submit reported an
> exit code of 1). no error output
> tp-grid1$ ls
>
> --
>
> Using this sites file:
>
>
>
>
>
> grid
> gt2
> ff-grid.unl.edu/jobmanager-pbs
> /panfs/panasas/CMS/data/oops/wilde/swiftwork
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
From benc at hawaga.org.uk Thu Jun 18 14:38:32 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 18 Jun 2009 19:38:32 +0000 (GMT)
Subject: [Swift-devel] condor-g test on ff-grid site
In-Reply-To: <4A3A9583.8010005@uchicago.edu>
References: <4A3A9583.8010005@uchicago.edu>
Message-ID:
On Thu, 18 Jun 2009, Zhao Zhang wrote:
> I have a question about the remote site requirements. Does remote site require
> a condor jobmanger in order
> for us to run swift with condor-g on there?
no. condor-g is a submit-side only requirement.
Does the site work using swift+plain gram2 instead of swift+condor-g?
--
From hockyg at uchicago.edu Thu Jun 18 15:15:03 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Thu, 18 Jun 2009 15:15:03 -0500
Subject: [Swift-devel] condor-g test on ff-grid site
In-Reply-To:
References: <4A3A9583.8010005@uchicago.edu>
Message-ID:
Hey Zhao,
I couldn't get it to work from teraport but from the engage login host,
engage-submit, i can with this
default
grid
gt2
ff-grid.unl.edu/jobmanager-pbs
/panfs/panasas/CMS/data/oops/swiftwork
On Thu, Jun 18, 2009 at 2:38 PM, Ben Clifford wrote:
>
> On Thu, 18 Jun 2009, Zhao Zhang wrote:
>
> > I have a question about the remote site requirements. Does remote site
> require
> > a condor jobmanger in order
> > for us to run swift with condor-g on there?
>
> no. condor-g is a submit-side only requirement.
>
> Does the site work using swift+plain gram2 instead of swift+condor-g?
>
> --
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From wilde at mcs.anl.gov Thu Jun 18 16:44:03 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 18 Jun 2009 16:44:03 -0500
Subject: [Swift-devel] How to set .soft and env to run condor on TeraPort?
Message-ID: <4A3AB523.2060205@mcs.anl.gov>
Hi,
Swift users need to run the condor-g client in order to send jobs to OSG
sites from a Swift script.
Can you tell us how to set .soft and env so that condor_submit to "grid"
universe works?
We've had all sorts of problems in getting this to work well:
- the version of condor client code on communicado is too new to run
with Swift.
- On teraport, it seems difficult to get the right settings of .soft
entries and setup.sh scripts to work corrcetly together
- I still dont know if what worked for Zhao on tp-osg a month ago still
works. It seems not to, and I cant tell if its because of a change in
.soft or env settings, or some other software issue
- We would like to run from Teraport compute nodes with qsub -I, and
hope that whatever we determine to be the right settings for login nodes
work on interactive compute nodes as well.
- It would be good *not* to run on tp-osg.
Suchandra, Ti, or Greg, can you help us sort out how to set things
correctly?
Tanks,
Mike
-------- Original Message --------
Subject: Re: [Swift-devel] Cant run condor-g on TeraPort
Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT)
From: Ben Clifford
To: Michael Wilde
CC: swift-devel
References: <4A3A93E2.2080805 at mcs.anl.gov>
condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather than
use softenv. it doesn't work for me if I use @osg in softenv, with the
error you report.
On Thu, 18 Jun 2009, Michael Wilde wrote:
> As far as I can tell, the condor client code is broken on TeraPort.
>
> Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg in my
> .soft. I source $VDT_LOCATION/setup.sh
>
> Zhao, Glen, can you cross-check and see if you are now seeing the same thing?
>
> My suspicion is that the condor client config broke in the last month, through
> OSG changes, CI Support work, etc etc.
>
> - Mike
>
>
> I get this from condor_q:
>
> tp$ condor_q
> Error:
>
> Extra Info: You probably saw this error because the condor_schedd is not
> running on the machine you are trying to query. If the condor_schedd is not
> running, the Condor system will not be able to find an address and port to
> connect to and satisfy this request. Please make sure the Condor daemons are
> running and try again.
>
> Extra Info: If the condor_schedd is running on the machine you are trying to
> query and you still see the error, the most likely cause is that you have
> setup a personal Condor, you have not defined SCHEDD_NAME in your
> condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE
> setting. You must define either or both of those settings in your config
> file, or you must use the -name option to condor_q. Please see the Condor
> manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
> tp$
>
> and this from swift:
>
> tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift
> Swift svn swift-r2890 cog-r2392
>
> RunID: 20090618-1404-mo0thjj4
> Progress:
> Progress: Stage in:1
> Progress: Submitted:1
> Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h on
> firefly
> Progress: Failed:1
> Execution failed:
> Exception in cat:
> Arguments: [data.txt]
> Host: firefly
> Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Cannot submit job: Could not submit job (condor_submit reported an
> exit code of 1). no error output
> tp-grid1$ ls
>
> --
>
> Using this sites file:
>
>
>
>
>
> grid
> gt2
> ff-grid.unl.edu/jobmanager-pbs
> /panfs/panasas/CMS/data/oops/wilde/swiftwork
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
From wilde at mcs.anl.gov Thu Jun 18 16:59:02 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 18 Jun 2009 16:59:02 -0500
Subject: [Swift-devel] Overview of coaster block-allocation-version issues
Message-ID: <4A3AB8A6.9070706@mcs.anl.gov>
Zhao and Allan have been testing the new coaster block-allocation
version on Ranger.
They have reported some issues, and need to work with Mihael to better
characterize the errors, and try to reproduce them in a way that Mihael
can also do.
From working with them, I see two more issues that should be discussed
and resolved, which I think they have not yet mentioned on the list.
Zhao will discuss at least one of these, but is swamped getting a
science run completed for the SEE project.
The issues:
1) Its hard to configure the time dimensions of the allocator, and to
make it work well with Swift retry parameters. The properties listed in
the table in the User Guide coaster section need more explanation and
examples. I think Zhao in his latest run got these working OK for the
"ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with
him on this, but help from others is welcome.
2) Allan and Zhao got kicked off of Ranger because the Coaster service
was consuming too much time on the head node, which is also "login3". We
were impacting other users, and got a "cease and desist" order from the
Ranger sysadmin. They have at least one anecdotal "top" snapshot from
the host that indicates the service was indeed using a lot of time (on
his 2000 job x 2 hour script). At the same time, Zhao sees a huge
coaster (service?) log. Maybe related?
Allan and Zhao, please keep updates flowing to swift-devel with the list
and status of coaster issues (ideally bugzilla'ed when appropriate), and
work with Mihael to capture the logs and test cases he needs to see for
each problem. Can you both work together to make a list, and with
Mihael to decide which items need to be tracked as bugs?
Thanks,
Mike
From iraicu at cs.uchicago.edu Thu Jun 18 21:33:29 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 18 Jun 2009 21:33:29 -0500
Subject: [Swift-devel] [Fwd: Updated Call for participation in eScience09,
please distribute]
Message-ID: <4A3AF8F9.9030302@cs.uchicago.edu>
Hi,
Here is an interesting conference on e-Science.
Cheers,
Ioan
-------- Original Message --------
Subject: Updated Call for participation in eScience09, please distribute
Date: Wed, 10 Jun 2009 05:00:15 -0700
From:
Reply-To: david.wallom at oerc.ox.ac.uk
To: iraicu at cs.uchicago.edu
Dear Iaon Raicu
Following a note about not having the URLs for submission please circulate the updated call through your network of contacts the call for papers for eScience09 below.
Regards
David
+++++++++++++++++++++++++++++++
e-Science 2009, call for papers
+++++++++++++++++++++++++++++++
About
-----
Scientific research is increasingly carried out by communities of researchers that span disciplines, laboratories, organizations, and national boundaries.
The e-Science 2009 conference is designed to bring together leading international and interdisciplinary research communities, developers, and users of e-Science applications and enabling technologies. The conference serves as a forum to present the results of the latest research and product/tool developments and to highlight related activities from around the world.
The sixth IEEE e-Science conference will be held in Oxford, UK from Dec 9-11. The meeting coincides with the UK e-Science All Hands Meeting that will be held from Dec 7-9th, 2009.
Building on the successes of previous meetings, we would like to develop some themes at the conference. These include
1. Arts and Humanities and e-Social Science
2. Bioinformatics and Health
3. Climate and Earth Sciences
4. Digital Repositories and Data Management
5. eScience Practice and Education
6. Physical Sciences and Engineering
7. Research Tools, Workflow and Systems
There is also the opportunity to submit a workshop programme that is focussed on newer and less well developed areas of research. e-Science 2009 will also feature and exhibits.
As well as the vibrant research agenda the meeting will offer the opportunity to meet socially with colleagues in the some of the UK?s most spectacular University venues.
We look forward to seeing you Oxford,
Anne Trefethen (co-chair)
Dave De Roure (co-chair)
Paul Roe (Programme co-chair)
David Wallom (Programme co-chair)
Mark Baker (Workshop chair)
Instructions
------------
Authors are invited to submit papers with unpublished, original work of not more than 8 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines (see website list below). Authors should submit a PDF or PostScript (level 2) file that will print on a PostScript printer. Papers conforming to the above guidelines can be submitted through the e-Science 2009 paper submission system (see URL below). It is expected that the proceedings will be published by the IEEE CS Press, USA and will be made available online through the IEEE Digital Library.
The following topics concerning e-Science are of interest, but not restrictive:
1. Arts and Humanities and e-Social Science
2. Bioinformatics and Health
3. Climate and Earth Sciences
4. Digital Repositories and Data Management
5. eScience Practice and Education
6. Physical Sciences and Engineering
7. Research Tools, Workflow and Systems
Important Dates
Papers Due: Friday 31st July, 2009
Notification of Acceptance: Tuesday 1st September, 2009
Camera Ready Papers Due: Friday 18th September, 2009
Publication Policy
All papers will be peer-reviewed. Accepted papers from both the main track and the workshops will be published in pre-conference proceedings published by IEEE. Selected excellent work may be eligible for additional post-conference publication as extended papers in selected journals, such as FGCS ( http://www.elsevier.com/locate/fgcs )
Websites
http://www.escience-meeting.org/eScience2006/instructions.html
https://cmt.research.microsoft.com/ESCIENCE2009/Default.aspx
--
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From wilde at mcs.anl.gov Fri Jun 19 07:47:47 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Jun 2009 07:47:47 -0500
Subject: [Swift-devel] Overview of coaster block-allocation-version issues
In-Reply-To: <4A3AB8A6.9070706@mcs.anl.gov>
References: <4A3AB8A6.9070706@mcs.anl.gov>
Message-ID: <4A3B88F3.7020206@mcs.anl.gov>
More thoughts on this:
(2) is a showstopper on Ranger (and possible elsewhere) and hence a much
more important issue than (1).
It seems like this problem merits a 2-pronged attack:
a) reduce the overhead. Is it logging, or intrinsic to the protocol?
-- is it obvious from the log whats causing the high overhead?
-- its it a situation where the overhead is incurred even when
jobs are not running, just queued?
b) see if the service can be moved to a worker node
Mike
On 6/18/09 4:59 PM, Michael Wilde wrote:
> Zhao and Allan have been testing the new coaster block-allocation
> version on Ranger.
>
> They have reported some issues, and need to work with Mihael to better
> characterize the errors, and try to reproduce them in a way that Mihael
> can also do.
>
> From working with them, I see two more issues that should be discussed
> and resolved, which I think they have not yet mentioned on the list.
> Zhao will discuss at least one of these, but is swamped getting a
> science run completed for the SEE project.
>
> The issues:
>
> 1) Its hard to configure the time dimensions of the allocator, and to
> make it work well with Swift retry parameters. The properties listed in
> the table in the User Guide coaster section need more explanation and
> examples. I think Zhao in his latest run got these working OK for the
> "ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with
> him on this, but help from others is welcome.
>
> 2) Allan and Zhao got kicked off of Ranger because the Coaster service
> was consuming too much time on the head node, which is also "login3". We
> were impacting other users, and got a "cease and desist" order from the
> Ranger sysadmin. They have at least one anecdotal "top" snapshot from
> the host that indicates the service was indeed using a lot of time (on
> his 2000 job x 2 hour script). At the same time, Zhao sees a huge
> coaster (service?) log. Maybe related?
>
> Allan and Zhao, please keep updates flowing to swift-devel with the list
> and status of coaster issues (ideally bugzilla'ed when appropriate), and
> work with Mihael to capture the logs and test cases he needs to see for
> each problem. Can you both work together to make a list, and with
> Mihael to decide which items need to be tracked as bugs?
>
> Thanks,
>
> Mike
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Fri Jun 19 08:23:28 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 19 Jun 2009 08:23:28 -0500
Subject: [Swift-devel] Overview of coaster block-allocation-version issues
In-Reply-To: <4A3B88F3.7020206@mcs.anl.gov>
References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov>
Message-ID: <1245417808.18736.2.camel@localhost>
On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote:
> More thoughts on this:
>
> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much
> more important issue than (1).
>
> It seems like this problem merits a 2-pronged attack:
>
> a) reduce the overhead. Is it logging, or intrinsic to the protocol?
> -- is it obvious from the log whats causing the high overhead?
> -- its it a situation where the overhead is incurred even when
> jobs are not running, just queued?
Some profiling needs to be done.
> b) see if the service can be moved to a worker node
>
> Mike
>
>
> On 6/18/09 4:59 PM, Michael Wilde wrote:
> > Zhao and Allan have been testing the new coaster block-allocation
> > version on Ranger.
> >
> > They have reported some issues, and need to work with Mihael to better
> > characterize the errors, and try to reproduce them in a way that Mihael
> > can also do.
> >
> > From working with them, I see two more issues that should be discussed
> > and resolved, which I think they have not yet mentioned on the list.
> > Zhao will discuss at least one of these, but is swamped getting a
> > science run completed for the SEE project.
> >
> > The issues:
> >
> > 1) Its hard to configure the time dimensions of the allocator, and to
> > make it work well with Swift retry parameters. The properties listed in
> > the table in the User Guide coaster section need more explanation and
> > examples. I think Zhao in his latest run got these working OK for the
> > "ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with
> > him on this, but help from others is welcome.
> >
> > 2) Allan and Zhao got kicked off of Ranger because the Coaster service
> > was consuming too much time on the head node, which is also "login3". We
> > were impacting other users, and got a "cease and desist" order from the
> > Ranger sysadmin. They have at least one anecdotal "top" snapshot from
> > the host that indicates the service was indeed using a lot of time (on
> > his 2000 job x 2 hour script). At the same time, Zhao sees a huge
> > coaster (service?) log. Maybe related?
> >
> > Allan and Zhao, please keep updates flowing to swift-devel with the list
> > and status of coaster issues (ideally bugzilla'ed when appropriate), and
> > work with Mihael to capture the logs and test cases he needs to see for
> > each problem. Can you both work together to make a list, and with
> > Mihael to decide which items need to be tracked as bugs?
> >
> > Thanks,
> >
> > Mike
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Fri Jun 19 08:31:32 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Jun 2009 08:31:32 -0500
Subject: [Swift-devel] Overview of coaster block-allocation-version issues
In-Reply-To: <1245417808.18736.2.camel@localhost>
References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov>
<1245417808.18736.2.camel@localhost>
Message-ID: <4A3B9334.80200@mcs.anl.gov>
On 6/19/09 8:23 AM, Mihael Hategan wrote:
> On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote:
>> More thoughts on this:
>>
>> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much
>> more important issue than (1).
>>
>> It seems like this problem merits a 2-pronged attack:
>>
>> a) reduce the overhead. Is it logging, or intrinsic to the protocol?
>> -- is it obvious from the log whats causing the high overhead?
>> -- its it a situation where the overhead is incurred even when
>> jobs are not running, just queued?
>
> Some profiling needs to be done.
Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid,
using a simple script and dummy app so that Mihael can readily reproduce?
Mihael, do you want them to run with profiling and post results?
- Mike
>
>> b) see if the service can be moved to a worker node
>>
>> Mike
>>
>>
>> On 6/18/09 4:59 PM, Michael Wilde wrote:
>>> Zhao and Allan have been testing the new coaster block-allocation
>>> version on Ranger.
>>>
>>> They have reported some issues, and need to work with Mihael to better
>>> characterize the errors, and try to reproduce them in a way that Mihael
>>> can also do.
>>>
>>> From working with them, I see two more issues that should be discussed
>>> and resolved, which I think they have not yet mentioned on the list.
>>> Zhao will discuss at least one of these, but is swamped getting a
>>> science run completed for the SEE project.
>>>
>>> The issues:
>>>
>>> 1) Its hard to configure the time dimensions of the allocator, and to
>>> make it work well with Swift retry parameters. The properties listed in
>>> the table in the User Guide coaster section need more explanation and
>>> examples. I think Zhao in his latest run got these working OK for the
>>> "ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with
>>> him on this, but help from others is welcome.
>>>
>>> 2) Allan and Zhao got kicked off of Ranger because the Coaster service
>>> was consuming too much time on the head node, which is also "login3". We
>>> were impacting other users, and got a "cease and desist" order from the
>>> Ranger sysadmin. They have at least one anecdotal "top" snapshot from
>>> the host that indicates the service was indeed using a lot of time (on
>>> his 2000 job x 2 hour script). At the same time, Zhao sees a huge
>>> coaster (service?) log. Maybe related?
>>>
>>> Allan and Zhao, please keep updates flowing to swift-devel with the list
>>> and status of coaster issues (ideally bugzilla'ed when appropriate), and
>>> work with Mihael to capture the logs and test cases he needs to see for
>>> each problem. Can you both work together to make a list, and with
>>> Mihael to decide which items need to be tracked as bugs?
>>>
>>> Thanks,
>>>
>>> Mike
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From hategan at mcs.anl.gov Fri Jun 19 08:35:25 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 19 Jun 2009 08:35:25 -0500
Subject: [Swift-devel] Overview of coaster block-allocation-version issues
In-Reply-To: <4A3B9334.80200@mcs.anl.gov>
References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov>
<1245417808.18736.2.camel@localhost> <4A3B9334.80200@mcs.anl.gov>
Message-ID: <1245418525.19007.1.camel@localhost>
On Fri, 2009-06-19 at 08:31 -0500, Michael Wilde wrote:
>
> On 6/19/09 8:23 AM, Mihael Hategan wrote:
> > On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote:
> >> More thoughts on this:
> >>
> >> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much
> >> more important issue than (1).
> >>
> >> It seems like this problem merits a 2-pronged attack:
> >>
> >> a) reduce the overhead. Is it logging, or intrinsic to the protocol?
> >> -- is it obvious from the log whats causing the high overhead?
> >> -- its it a situation where the overhead is incurred even when
> >> jobs are not running, just queued?
> >
> > Some profiling needs to be done.
>
> Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid,
> using a simple script and dummy app so that Mihael can readily reproduce?
>
> Mihael, do you want them to run with profiling and post results?
That would be great. Get a hprof dump with cpu tracing enabled. See
http://java.sun.com/developer/technicalArticles/Programming/HPROF.html
From hategan at mcs.anl.gov Fri Jun 19 08:47:23 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 19 Jun 2009 08:47:23 -0500
Subject: [Swift-devel] Overview of coaster block-allocation-version issues
In-Reply-To: <1245418525.19007.1.camel@localhost>
References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov>
<1245417808.18736.2.camel@localhost> <4A3B9334.80200@mcs.anl.gov>
<1245418525.19007.1.camel@localhost>
Message-ID: <1245419243.19245.1.camel@localhost>
On Fri, 2009-06-19 at 08:35 -0500, Mihael Hategan wrote:
> On Fri, 2009-06-19 at 08:31 -0500, Michael Wilde wrote:
> >
> > On 6/19/09 8:23 AM, Mihael Hategan wrote:
> > > On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote:
> > >> More thoughts on this:
> > >>
> > >> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much
> > >> more important issue than (1).
> > >>
> > >> It seems like this problem merits a 2-pronged attack:
> > >>
> > >> a) reduce the overhead. Is it logging, or intrinsic to the protocol?
> > >> -- is it obvious from the log whats causing the high overhead?
> > >> -- its it a situation where the overhead is incurred even when
> > >> jobs are not running, just queued?
> > >
> > > Some profiling needs to be done.
> >
> > Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid,
> > using a simple script and dummy app so that Mihael can readily reproduce?
> >
> > Mihael, do you want them to run with profiling and post results?
>
> That would be great. Get a hprof dump with cpu tracing enabled. See
> http://java.sun.com/developer/technicalArticles/Programming/HPROF.html
Bootstrap.java will also need to be modified for the relevant profiling
parameters to be passed to the coaster service JVM.
addDebuggingOptions() may be the right place to do so.
From support at ci.uchicago.edu Fri Jun 19 08:49:17 2009
From: support at ci.uchicago.edu (Ti Leggett)
Date: Fri, 19 Jun 2009 08:49:17 -0500
Subject: [Swift-devel] [CI Ticketing System #1074] How to set .soft and env
to run condor on TeraPort?
In-Reply-To: <4A3AB523.2060205@mcs.anl.gov>
References: <4A3AB523.2060205@mcs.anl.gov>
Message-ID:
There were some misconfigurations in the @globus-4 macro for rhel-5 and condor
that I've just fixed. Can you set your ~/.soft to look like below and then run
resoft:
@globus-4
@default
You should be using /soft/condor-7.0.5-r1 and /soft/globus-4.2.1-r2 after that.
Let me know if that works for you, or if anything changes.
On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov wrote:
> Hi,
>
> Swift users need to run the condor-g client in order to send jobs to
> OSG
> sites from a Swift script.
>
> Can you tell us how to set .soft and env so that condor_submit to
> "grid"
> universe works?
>
> We've had all sorts of problems in getting this to work well:
>
> - the version of condor client code on communicado is too new to run
> with Swift.
>
> - On teraport, it seems difficult to get the right settings of .soft
> entries and setup.sh scripts to work corrcetly together
>
> - I still dont know if what worked for Zhao on tp-osg a month ago
> still
> works. It seems not to, and I cant tell if its because of a change in
> .soft or env settings, or some other software issue
>
> - We would like to run from Teraport compute nodes with qsub -I, and
> hope that whatever we determine to be the right settings for login
> nodes
> work on interactive compute nodes as well.
>
> - It would be good *not* to run on tp-osg.
>
> Suchandra, Ti, or Greg, can you help us sort out how to set things
> correctly?
>
> Tanks,
>
> Mike
>
>
> -------- Original Message --------
> Subject: Re: [Swift-devel] Cant run condor-g on TeraPort
> Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT)
> From: Ben Clifford
> To: Michael Wilde
> CC: swift-devel
> References: <4A3A93E2.2080805 at mcs.anl.gov>
>
>
> condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather
> than
> use softenv. it doesn't work for me if I use @osg in softenv, with the
> error you report.
>
> On Thu, 18 Jun 2009, Michael Wilde wrote:
>
> > As far as I can tell, the condor client code is broken on TeraPort.
> >
> > Ive tried this on tp-login and tp-osg; I am using +osg-client and
> @osg in my
> > .soft. I source $VDT_LOCATION/setup.sh
> >
> > Zhao, Glen, can you cross-check and see if you are now seeing the
> same thing?
> >
> > My suspicion is that the condor client config broke in the last
> month, through
> > OSG changes, CI Support work, etc etc.
> >
> > - Mike
> >
> >
> > I get this from condor_q:
> >
> > tp$ condor_q
> > Error:
> >
> > Extra Info: You probably saw this error because the condor_schedd is
> not
> > running on the machine you are trying to query. If the condor_schedd
> is not
> > running, the Condor system will not be able to find an address and
> port to
> > connect to and satisfy this request. Please make sure the Condor
> daemons are
> > running and try again.
> >
> > Extra Info: If the condor_schedd is running on the machine you are
> trying to
> > query and you still see the error, the most likely cause is that you
> have
> > setup a personal Condor, you have not defined SCHEDD_NAME in your
> > condor_config file, and something is wrong with your
> SCHEDD_ADDRESS_FILE
> > setting. You must define either or both of those settings in your
> config
> > file, or you must use the -name option to condor_q. Please see the
> Condor
> > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
> > tp$
> >
> > and this from swift:
> >
> > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml
> cat.swift
> > Swift svn swift-r2890 cog-r2392
> >
> > RunID: 20090618-1404-mo0thjj4
> > Progress:
> > Progress: Stage in:1
> > Progress: Submitted:1
> > Failed to transfer wrapper log from cat-20090618-1404-
> mo0thjj4/info/h on
> > firefly
> > Progress: Failed:1
> > Execution failed:
> > Exception in cat:
> > Arguments: [data.txt]
> > Host: firefly
> > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
> > stderr.txt:
> >
> > stdout.txt:
> >
> > ----
> >
> > Caused by:
> > Cannot submit job: Could not submit job (condor_submit reported an
> > exit code of 1). no error output
> > tp-grid1$ ls
> >
> > --
> >
> > Using this sites file:
> >
> >
> >
> >
> >
> > grid
> > gt2
> > ff-grid.unl.edu/jobmanager-pbs
> > >/panfs/panasas/CMS/data/oops/wilde/swiftwork
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >
From wilde at mcs.anl.gov Fri Jun 19 09:40:00 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Jun 2009 09:40:00 -0500
Subject: [Swift-devel] Overview of coaster block-allocation-version issues
In-Reply-To: <1245419243.19245.1.camel@localhost>
References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov>
<1245417808.18736.2.camel@localhost> <4A3B9334.80200@mcs.anl.gov>
<1245418525.19007.1.camel@localhost>
<1245419243.19245.1.camel@localhost>
Message-ID: <4A3BA340.7000004@mcs.anl.gov>
might in addition be good to start with empty logs, and summarize the
record type counts in the log. That 5GB log size is a bit of a concern.
Might be something simple like n^2 debug logging.
On 6/19/09 8:47 AM, Mihael Hategan wrote:
> On Fri, 2009-06-19 at 08:35 -0500, Mihael Hategan wrote:
>> On Fri, 2009-06-19 at 08:31 -0500, Michael Wilde wrote:
>>> On 6/19/09 8:23 AM, Mihael Hategan wrote:
>>>> On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote:
>>>>> More thoughts on this:
>>>>>
>>>>> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much
>>>>> more important issue than (1).
>>>>>
>>>>> It seems like this problem merits a 2-pronged attack:
>>>>>
>>>>> a) reduce the overhead. Is it logging, or intrinsic to the protocol?
>>>>> -- is it obvious from the log whats causing the high overhead?
>>>>> -- its it a situation where the overhead is incurred even when
>>>>> jobs are not running, just queued?
>>>> Some profiling needs to be done.
>>> Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid,
>>> using a simple script and dummy app so that Mihael can readily reproduce?
>>>
>>> Mihael, do you want them to run with profiling and post results?
>> That would be great. Get a hprof dump with cpu tracing enabled. See
>> http://java.sun.com/developer/technicalArticles/Programming/HPROF.html
>
> Bootstrap.java will also need to be modified for the relevant profiling
> parameters to be passed to the coaster service JVM.
>
> addDebuggingOptions() may be the right place to do so.
>
>
From smartin at mcs.anl.gov Fri Jun 19 10:21:57 2009
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Fri, 19 Jun 2009 10:21:57 -0500
Subject: [Swift-devel] swift testing of gram5 on teraport
Message-ID: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov>
Hi Mike,
Ben was planning on testing GRAM5 on teraport for Swift. Now that Ben
is moving on, I am wondering what the plan is for that. Do you still
plan to do that? Is there someone else that will do the testing?
Ti was going to install GRAM5 for Ben to try out, but he has been
delayed dealing with other issues. GRAM5 has not yet been installed
on teraport. I was going to ask him again to install it, but I don't
know who will now drive this testing.
-Stu
From hategan at mcs.anl.gov Fri Jun 19 10:26:48 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 19 Jun 2009 10:26:48 -0500
Subject: [Swift-devel] Overview of coaster block-allocation-version issues
In-Reply-To: <4A3BA340.7000004@mcs.anl.gov>
References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov>
<1245417808.18736.2.camel@localhost> <4A3B9334.80200@mcs.anl.gov>
<1245418525.19007.1.camel@localhost>
<1245419243.19245.1.camel@localhost> <4A3BA340.7000004@mcs.anl.gov>
Message-ID: <1245425208.20840.0.camel@localhost>
On Fri, 2009-06-19 at 09:40 -0500, Michael Wilde wrote:
> might in addition be good to start with empty logs, and summarize the
> record type counts in the log. That 5GB log size is a bit of a concern.
> Might be something simple like n^2 debug logging.
:)
No. It's verbose because the software is new and at this point it's
better to have more information than less.
>
> On 6/19/09 8:47 AM, Mihael Hategan wrote:
> > On Fri, 2009-06-19 at 08:35 -0500, Mihael Hategan wrote:
> >> On Fri, 2009-06-19 at 08:31 -0500, Michael Wilde wrote:
> >>> On 6/19/09 8:23 AM, Mihael Hategan wrote:
> >>>> On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote:
> >>>>> More thoughts on this:
> >>>>>
> >>>>> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much
> >>>>> more important issue than (1).
> >>>>>
> >>>>> It seems like this problem merits a 2-pronged attack:
> >>>>>
> >>>>> a) reduce the overhead. Is it logging, or intrinsic to the protocol?
> >>>>> -- is it obvious from the log whats causing the high overhead?
> >>>>> -- its it a situation where the overhead is incurred even when
> >>>>> jobs are not running, just queued?
> >>>> Some profiling needs to be done.
> >>> Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid,
> >>> using a simple script and dummy app so that Mihael can readily reproduce?
> >>>
> >>> Mihael, do you want them to run with profiling and post results?
> >> That would be great. Get a hprof dump with cpu tracing enabled. See
> >> http://java.sun.com/developer/technicalArticles/Programming/HPROF.html
> >
> > Bootstrap.java will also need to be modified for the relevant profiling
> > parameters to be passed to the coaster service JVM.
> >
> > addDebuggingOptions() may be the right place to do so.
> >
> >
From wilde at mcs.anl.gov Fri Jun 19 10:54:15 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Jun 2009 10:54:15 -0500
Subject: [Swift-devel] Re: swift testing of gram5 on teraport
In-Reply-To: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov>
References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov>
Message-ID: <4A3BB4A7.2070708@mcs.anl.gov>
We'll find a way to do this, STu, but it may go a little slower than
desired due to heavy multi-tasking in the group.
So you should push forward to get it testable, thats step zero I think.
In parallel, we should discuss on the list what ifany Swift changes are
needed to use it. It dont have my head around the issue at the moment.
Where can we read the specs of how it affects the user?
We have a pretty swamped schedule through July, so I'd expect to slot
this for late Jul early Aug.
Thanks,
Mike
On 6/19/09 10:21 AM, Stuart Martin wrote:
> Hi Mike,
>
> Ben was planning on testing GRAM5 on teraport for Swift. Now that Ben
> is moving on, I am wondering what the plan is for that. Do you still
> plan to do that? Is there someone else that will do the testing?
>
> Ti was going to install GRAM5 for Ben to try out, but he has been
> delayed dealing with other issues. GRAM5 has not yet been installed on
> teraport. I was going to ask him again to install it, but I don't know
> who will now drive this testing.
>
> -Stu
From hockyg at uchicago.edu Fri Jun 19 10:56:50 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Fri, 19 Jun 2009 10:56:50 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort?
In-Reply-To:
References: <4A3AB523.2060205@mcs.anl.gov>
Message-ID:
This did update my condor and globus locations, but did not fix the problem.
Hopefully Zhao can tell me what to do next
[hockyg at tp-grid1 swift]$ which condor_q
/soft/condor-7.0.5-r1/bin/condor_q
[hockyg at tp-grid1 swift]$ condor_q
Neither the environment variable CONDOR_CONFIG,
/etc/condor/, nor ~condor/ contain a condor_config source.
Either set CONDOR_CONFIG to point to a valid config source,
or put a "condor_config" file in /etc/condor or ~condor/
Exiting.
On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett wrote:
> There were some misconfigurations in the @globus-4 macro for rhel-5 and
> condor
> that I've just fixed. Can you set your ~/.soft to look like below and then
> run
> resoft:
>
> @globus-4
>
> @default
>
> You should be using /soft/condor-7.0.5-r1 and /soft/globus-4.2.1-r2 after
> that.
> Let me know if that works for you, or if anything changes.
>
> On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov wrote:
> > Hi,
> >
> > Swift users need to run the condor-g client in order to send jobs to
> > OSG
> > sites from a Swift script.
> >
> > Can you tell us how to set .soft and env so that condor_submit to
> > "grid"
> > universe works?
> >
> > We've had all sorts of problems in getting this to work well:
> >
> > - the version of condor client code on communicado is too new to run
> > with Swift.
> >
> > - On teraport, it seems difficult to get the right settings of .soft
> > entries and setup.sh scripts to work corrcetly together
> >
> > - I still dont know if what worked for Zhao on tp-osg a month ago
> > still
> > works. It seems not to, and I cant tell if its because of a change in
> > .soft or env settings, or some other software issue
> >
> > - We would like to run from Teraport compute nodes with qsub -I, and
> > hope that whatever we determine to be the right settings for login
> > nodes
> > work on interactive compute nodes as well.
> >
> > - It would be good *not* to run on tp-osg.
> >
> > Suchandra, Ti, or Greg, can you help us sort out how to set things
> > correctly?
> >
> > Tanks,
> >
> > Mike
> >
> >
> > -------- Original Message --------
> > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort
> > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT)
> > From: Ben Clifford
> > To: Michael Wilde
> > CC: swift-devel
> > References: <4A3A93E2.2080805 at mcs.anl.gov>
> >
> >
> > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather
> > than
> > use softenv. it doesn't work for me if I use @osg in softenv, with the
> > error you report.
> >
> > On Thu, 18 Jun 2009, Michael Wilde wrote:
> >
> > > As far as I can tell, the condor client code is broken on TeraPort.
> > >
> > > Ive tried this on tp-login and tp-osg; I am using +osg-client and
> > @osg in my
> > > .soft. I source $VDT_LOCATION/setup.sh
> > >
> > > Zhao, Glen, can you cross-check and see if you are now seeing the
> > same thing?
> > >
> > > My suspicion is that the condor client config broke in the last
> > month, through
> > > OSG changes, CI Support work, etc etc.
> > >
> > > - Mike
> > >
> > >
> > > I get this from condor_q:
> > >
> > > tp$ condor_q
> > > Error:
> > >
> > > Extra Info: You probably saw this error because the condor_schedd is
> > not
> > > running on the machine you are trying to query. If the condor_schedd
> > is not
> > > running, the Condor system will not be able to find an address and
> > port to
> > > connect to and satisfy this request. Please make sure the Condor
> > daemons are
> > > running and try again.
> > >
> > > Extra Info: If the condor_schedd is running on the machine you are
> > trying to
> > > query and you still see the error, the most likely cause is that you
> > have
> > > setup a personal Condor, you have not defined SCHEDD_NAME in your
> > > condor_config file, and something is wrong with your
> > SCHEDD_ADDRESS_FILE
> > > setting. You must define either or both of those settings in your
> > config
> > > file, or you must use the -name option to condor_q. Please see the
> > Condor
> > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
> > > tp$
> > >
> > > and this from swift:
> > >
> > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml
> > cat.swift
> > > Swift svn swift-r2890 cog-r2392
> > >
> > > RunID: 20090618-1404-mo0thjj4
> > > Progress:
> > > Progress: Stage in:1
> > > Progress: Submitted:1
> > > Failed to transfer wrapper log from cat-20090618-1404-
> > mo0thjj4/info/h on
> > > firefly
> > > Progress: Failed:1
> > > Execution failed:
> > > Exception in cat:
> > > Arguments: [data.txt]
> > > Host: firefly
> > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
> > > stderr.txt:
> > >
> > > stdout.txt:
> > >
> > > ----
> > >
> > > Caused by:
> > > Cannot submit job: Could not submit job (condor_submit reported an
> > > exit code of 1). no error output
> > > tp-grid1$ ls
> > >
> > > --
> > >
> > > Using this sites file:
> > >
> > >
> > >
> > >
> > >
> > > grid
> > > gt2
> > > ff-grid.unl.edu/jobmanager-pbs
> > > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> > >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From support at ci.uchicago.edu Fri Jun 19 10:56:58 2009
From: support at ci.uchicago.edu (Glen Hocky)
Date: Fri, 19 Jun 2009 10:56:58 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort?
In-Reply-To:
References: <4A3AB523.2060205@mcs.anl.gov>
Message-ID:
This did update my condor and globus locations, but did not fix the problem.
Hopefully Zhao can tell me what to do next
[hockyg at tp-grid1 swift]$ which condor_q
/soft/condor-7.0.5-r1/bin/condor_q
[hockyg at tp-grid1 swift]$ condor_q
Neither the environment variable CONDOR_CONFIG,
/etc/condor/, nor ~condor/ contain a condor_config source.
Either set CONDOR_CONFIG to point to a valid config source,
or put a "condor_config" file in /etc/condor or ~condor/
Exiting.
On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett wrote:
> There were some misconfigurations in the @globus-4 macro for rhel-5 and
> condor
> that I've just fixed. Can you set your ~/.soft to look like below and then
> run
> resoft:
>
> @globus-4
>
> @default
>
> You should be using /soft/condor-7.0.5-r1 and /soft/globus-4.2.1-r2 after
> that.
> Let me know if that works for you, or if anything changes.
>
> On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov wrote:
> > Hi,
> >
> > Swift users need to run the condor-g client in order to send jobs to
> > OSG
> > sites from a Swift script.
> >
> > Can you tell us how to set .soft and env so that condor_submit to
> > "grid"
> > universe works?
> >
> > We've had all sorts of problems in getting this to work well:
> >
> > - the version of condor client code on communicado is too new to run
> > with Swift.
> >
> > - On teraport, it seems difficult to get the right settings of .soft
> > entries and setup.sh scripts to work corrcetly together
> >
> > - I still dont know if what worked for Zhao on tp-osg a month ago
> > still
> > works. It seems not to, and I cant tell if its because of a change in
> > .soft or env settings, or some other software issue
> >
> > - We would like to run from Teraport compute nodes with qsub -I, and
> > hope that whatever we determine to be the right settings for login
> > nodes
> > work on interactive compute nodes as well.
> >
> > - It would be good *not* to run on tp-osg.
> >
> > Suchandra, Ti, or Greg, can you help us sort out how to set things
> > correctly?
> >
> > Tanks,
> >
> > Mike
> >
> >
> > -------- Original Message --------
> > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort
> > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT)
> > From: Ben Clifford
> > To: Michael Wilde
> > CC: swift-devel
> > References: <4A3A93E2.2080805 at mcs.anl.gov>
> >
> >
> > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather
> > than
> > use softenv. it doesn't work for me if I use @osg in softenv, with the
> > error you report.
> >
> > On Thu, 18 Jun 2009, Michael Wilde wrote:
> >
> > > As far as I can tell, the condor client code is broken on TeraPort.
> > >
> > > Ive tried this on tp-login and tp-osg; I am using +osg-client and
> > @osg in my
> > > .soft. I source $VDT_LOCATION/setup.sh
> > >
> > > Zhao, Glen, can you cross-check and see if you are now seeing the
> > same thing?
> > >
> > > My suspicion is that the condor client config broke in the last
> > month, through
> > > OSG changes, CI Support work, etc etc.
> > >
> > > - Mike
> > >
> > >
> > > I get this from condor_q:
> > >
> > > tp$ condor_q
> > > Error:
> > >
> > > Extra Info: You probably saw this error because the condor_schedd is
> > not
> > > running on the machine you are trying to query. If the condor_schedd
> > is not
> > > running, the Condor system will not be able to find an address and
> > port to
> > > connect to and satisfy this request. Please make sure the Condor
> > daemons are
> > > running and try again.
> > >
> > > Extra Info: If the condor_schedd is running on the machine you are
> > trying to
> > > query and you still see the error, the most likely cause is that you
> > have
> > > setup a personal Condor, you have not defined SCHEDD_NAME in your
> > > condor_config file, and something is wrong with your
> > SCHEDD_ADDRESS_FILE
> > > setting. You must define either or both of those settings in your
> > config
> > > file, or you must use the -name option to condor_q. Please see the
> > Condor
> > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
> > > tp$
> > >
> > > and this from swift:
> > >
> > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml
> > cat.swift
> > > Swift svn swift-r2890 cog-r2392
> > >
> > > RunID: 20090618-1404-mo0thjj4
> > > Progress:
> > > Progress: Stage in:1
> > > Progress: Submitted:1
> > > Failed to transfer wrapper log from cat-20090618-1404-
> > mo0thjj4/info/h on
> > > firefly
> > > Progress: Failed:1
> > > Execution failed:
> > > Exception in cat:
> > > Arguments: [data.txt]
> > > Host: firefly
> > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
> > > stderr.txt:
> > >
> > > stdout.txt:
> > >
> > > ----
> > >
> > > Caused by:
> > > Cannot submit job: Could not submit job (condor_submit reported an
> > > exit code of 1). no error output
> > > tp-grid1$ ls
> > >
> > > --
> > >
> > > Using this sites file:
> > >
> > >
> > >
> > >
> > >
> > > grid
> > > gt2
> > > ff-grid.unl.edu/jobmanager-pbs
> > > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> > >
>
>
From benc at hawaga.org.uk Fri Jun 19 10:58:17 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 19 Jun 2009 15:58:17 +0000 (GMT)
Subject: [Swift-devel] Re: swift testing of gram5 on teraport
In-Reply-To: <4A3BB4A7.2070708@mcs.anl.gov>
References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov>
<4A3BB4A7.2070708@mcs.anl.gov>
Message-ID:
On Fri, 19 Jun 2009, Michael Wilde wrote:
> In parallel, we should discuss on the list what ifany Swift changes are needed
> to use it. It dont have my head around the issue at the moment. Where can we
> read the specs of how it affects the user?
Theoretically it will Just Work with the GRAM2 provider. Evidence thus far
suggests this might be true (for example, apparently the gram2 cog stuff
can submit to gram5 ok) but there hasn't been any swift-level testing to
see how it all fits together.
--
From zhaozhang at uchicago.edu Fri Jun 19 11:00:11 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 19 Jun 2009 11:00:11 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort?
In-Reply-To:
References: <4A3AB523.2060205@mcs.anl.gov>
Message-ID: <4A3BB60B.3040008@uchicago.edu>
Here is my .soft
[zzhang at tp-grid1 ~]$ cat .soft
#
# This is your SoftEnv configuration run control file.
#
# It is used to tell SoftEnv how to customize your environment by
# setting up variables such as PATH and MANPATH. To learn more
# about this file, do a "man softenv".
#
+java-sun
+osg-client
+maui
+torque
@python-2.5
@osg
@default
@globus-4
And the source file is
source /opt/osg/setup.sh
zhao
Glen Hocky wrote:
> This did update my condor and globus locations, but did not fix the
> problem. Hopefully Zhao can tell me what to do next
>
> [hockyg at tp-grid1 swift]$ which condor_q
> /soft/condor-7.0.5-r1/bin/condor_q
> [hockyg at tp-grid1 swift]$ condor_q
>
> Neither the environment variable CONDOR_CONFIG,
> /etc/condor/, nor ~condor/ contain a condor_config source.
> Either set CONDOR_CONFIG to point to a valid config source,
> or put a "condor_config" file in /etc/condor or ~condor/
> Exiting.
>
>
> On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett > wrote:
>
> There were some misconfigurations in the @globus-4 macro for
> rhel-5 and condor
> that I've just fixed. Can you set your ~/.soft to look like below
> and then run
> resoft:
>
> @globus-4
>
> @default
>
> You should be using /soft/condor-7.0.5-r1 and
> /soft/globus-4.2.1-r2 after that.
> Let me know if that works for you, or if anything changes.
>
> On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov
> wrote:
> > Hi,
> >
> > Swift users need to run the condor-g client in order to send jobs to
> > OSG
> > sites from a Swift script.
> >
> > Can you tell us how to set .soft and env so that condor_submit to
> > "grid"
> > universe works?
> >
> > We've had all sorts of problems in getting this to work well:
> >
> > - the version of condor client code on communicado is too new to run
> > with Swift.
> >
> > - On teraport, it seems difficult to get the right settings of .soft
> > entries and setup.sh scripts to work corrcetly together
> >
> > - I still dont know if what worked for Zhao on tp-osg a month ago
> > still
> > works. It seems not to, and I cant tell if its because of a
> change in
> > .soft or env settings, or some other software issue
> >
> > - We would like to run from Teraport compute nodes with qsub -I, and
> > hope that whatever we determine to be the right settings for login
> > nodes
> > work on interactive compute nodes as well.
> >
> > - It would be good *not* to run on tp-osg.
> >
> > Suchandra, Ti, or Greg, can you help us sort out how to set things
> > correctly?
> >
> > Tanks,
> >
> > Mike
> >
> >
> > -------- Original Message --------
> > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort
> > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT)
> > From: Ben Clifford >
> > To: Michael Wilde >
> > CC: swift-devel >
> > References: <4A3A93E2.2080805 at mcs.anl.gov
> >
> >
> >
> > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh
> rather
> > than
> > use softenv. it doesn't work for me if I use @osg in softenv,
> with the
> > error you report.
> >
> > On Thu, 18 Jun 2009, Michael Wilde wrote:
> >
> > > As far as I can tell, the condor client code is broken on
> TeraPort.
> > >
> > > Ive tried this on tp-login and tp-osg; I am using +osg-client and
> > @osg in my
> > > .soft. I source $VDT_LOCATION/setup.sh
> > >
> > > Zhao, Glen, can you cross-check and see if you are now seeing the
> > same thing?
> > >
> > > My suspicion is that the condor client config broke in the last
> > month, through
> > > OSG changes, CI Support work, etc etc.
> > >
> > > - Mike
> > >
> > >
> > > I get this from condor_q:
> > >
> > > tp$ condor_q
> > > Error:
> > >
> > > Extra Info: You probably saw this error because the
> condor_schedd is
> > not
> > > running on the machine you are trying to query. If the
> condor_schedd
> > is not
> > > running, the Condor system will not be able to find an address and
> > port to
> > > connect to and satisfy this request. Please make sure the Condor
> > daemons are
> > > running and try again.
> > >
> > > Extra Info: If the condor_schedd is running on the machine you are
> > trying to
> > > query and you still see the error, the most likely cause is
> that you
> > have
> > > setup a personal Condor, you have not defined SCHEDD_NAME in your
> > > condor_config file, and something is wrong with your
> > SCHEDD_ADDRESS_FILE
> > > setting. You must define either or both of those settings in your
> > config
> > > file, or you must use the -name option to condor_q. Please see the
> > Condor
> > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
> > > tp$
> > >
> > > and this from swift:
> > >
> > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml
> > cat.swift
> > > Swift svn swift-r2890 cog-r2392
> > >
> > > RunID: 20090618-1404-mo0thjj4
> > > Progress:
> > > Progress: Stage in:1
> > > Progress: Submitted:1
> > > Failed to transfer wrapper log from cat-20090618-1404-
> > mo0thjj4/info/h on
> > > firefly
> > > Progress: Failed:1
> > > Execution failed:
> > > Exception in cat:
> > > Arguments: [data.txt]
> > > Host: firefly
> > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
> > > stderr.txt:
> > >
> > > stdout.txt:
> > >
> > > ----
> > >
> > > Caused by:
> > > Cannot submit job: Could not submit job (condor_submit reported an
> > > exit code of 1). no error output
> > > tp-grid1$ ls
> > >
> > > --
> > >
> > > Using this sites file:
> > >
> > >
> > >
> > >
> > >
> > > grid
> > > gt2
> > > ff-grid.unl.edu/jobmanager-pbs
>
> > > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> > >
>
>
From benc at hawaga.org.uk Fri Jun 19 11:00:13 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT)
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort?
In-Reply-To:
References: <4A3AB523.2060205@mcs.anl.gov>
Message-ID:
my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor
work. this suggests perhaps that in a working environment, condor should
be coming from that OSG stack and not from a specific condor softenv key.
--
From support at ci.uchicago.edu Fri Jun 19 11:00:27 2009
From: support at ci.uchicago.edu (Ben Clifford)
Date: Fri, 19 Jun 2009 11:00:27 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort?
In-Reply-To:
References: <4A3AB523.2060205@mcs.anl.gov>
Message-ID:
my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor
work. this suggests perhaps that in a working environment, condor should
be coming from that OSG stack and not from a specific condor softenv key.
--
From smartin at mcs.anl.gov Fri Jun 19 11:02:33 2009
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Fri, 19 Jun 2009 11:02:33 -0500
Subject: [Swift-devel] Re: swift testing of gram5 on teraport
In-Reply-To: <4A3BB4A7.2070708@mcs.anl.gov>
References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov>
<4A3BB4A7.2070708@mcs.anl.gov>
Message-ID:
On Jun 19, 2009, at Jun 19, 10:54 AM, Michael Wilde wrote:
> We'll find a way to do this, STu, but it may go a little slower than
> desired due to heavy multi-tasking in the group.
>
> So you should push forward to get it testable, thats step zero I
> think.
I am pushing forward with groups where there is someone to drive the
testing. For example, Jaime Frey is testing gram5 with condor-g. CMS
will be doing some testing in early July. Then there is the swift
testing...
>
>
> In parallel, we should discuss on the list what ifany Swift changes
> are needed to use it. It dont have my head around the issue at the
> moment. Where can we read the specs of how it affects the user?
>
> We have a pretty swamped schedule through July, so I'd expect to
> slot this for late Jul early Aug.
>
> Thanks,
>
> Mike
>
>
> On 6/19/09 10:21 AM, Stuart Martin wrote:
>> Hi Mike,
>> Ben was planning on testing GRAM5 on teraport for Swift. Now that
>> Ben is moving on, I am wondering what the plan is for that. Do you
>> still plan to do that? Is there someone else that will do the
>> testing?
>> Ti was going to install GRAM5 for Ben to try out, but he has been
>> delayed dealing with other issues. GRAM5 has not yet been
>> installed on teraport. I was going to ask him again to install it,
>> but I don't know who will now drive this testing.
>> -Stu
From hockyg at uchicago.edu Fri Jun 19 11:05:52 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Fri, 19 Jun 2009 11:05:52 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort?
In-Reply-To: <4A3BB60B.3040008@uchicago.edu>
References: <4A3AB523.2060205@mcs.anl.gov>
<4A3BB60B.3040008@uchicago.edu>
Message-ID:
That did it for me! Thanks Zhao
On Fri, Jun 19, 2009 at 11:00 AM, Zhao Zhang wrote:
> Here is my .soft
>
> [zzhang at tp-grid1 ~]$ cat .soft
> #
> # This is your SoftEnv configuration run control file.
> #
> # It is used to tell SoftEnv how to customize your environment by
> # setting up variables such as PATH and MANPATH. To learn more
> # about this file, do a "man softenv".
> #
> +java-sun
> +osg-client
> +maui
> +torque
> @python-2.5
> @osg
> @default
> @globus-4
>
> And the source file is
> source /opt/osg/setup.sh
>
> zhao
>
> Glen Hocky wrote:
>
>> This did update my condor and globus locations, but did not fix the
>> problem. Hopefully Zhao can tell me what to do next
>>
>> [hockyg at tp-grid1 swift]$ which condor_q
>> /soft/condor-7.0.5-r1/bin/condor_q
>> [hockyg at tp-grid1 swift]$ condor_q
>>
>> Neither the environment variable CONDOR_CONFIG,
>> /etc/condor/, nor ~condor/ contain a condor_config source.
>> Either set CONDOR_CONFIG to point to a valid config source,
>> or put a "condor_config" file in /etc/condor or ~condor/
>> Exiting.
>>
>>
>> On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett > support at ci.uchicago.edu>> wrote:
>>
>> There were some misconfigurations in the @globus-4 macro for
>> rhel-5 and condor
>> that I've just fixed. Can you set your ~/.soft to look like below
>> and then run
>> resoft:
>>
>> @globus-4
>>
>> @default
>>
>> You should be using /soft/condor-7.0.5-r1 and
>> /soft/globus-4.2.1-r2 after that.
>> Let me know if that works for you, or if anything changes.
>>
>> On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov
>> wrote:
>> > Hi,
>> >
>> > Swift users need to run the condor-g client in order to send jobs to
>> > OSG
>> > sites from a Swift script.
>> >
>> > Can you tell us how to set .soft and env so that condor_submit to
>> > "grid"
>> > universe works?
>> >
>> > We've had all sorts of problems in getting this to work well:
>> >
>> > - the version of condor client code on communicado is too new to run
>> > with Swift.
>> >
>> > - On teraport, it seems difficult to get the right settings of .soft
>> > entries and setup.sh scripts to work corrcetly together
>> >
>> > - I still dont know if what worked for Zhao on tp-osg a month ago
>> > still
>> > works. It seems not to, and I cant tell if its because of a
>> change in
>> > .soft or env settings, or some other software issue
>> >
>> > - We would like to run from Teraport compute nodes with qsub -I, and
>> > hope that whatever we determine to be the right settings for login
>> > nodes
>> > work on interactive compute nodes as well.
>> >
>> > - It would be good *not* to run on tp-osg.
>> >
>> > Suchandra, Ti, or Greg, can you help us sort out how to set things
>> > correctly?
>> >
>> > Tanks,
>> >
>> > Mike
>> >
>> >
>> > -------- Original Message --------
>> > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort
>> > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT)
>> > From: Ben Clifford >
>> > To: Michael Wilde >
>> > CC: swift-devel > >
>> > References: <4A3A93E2.2080805 at mcs.anl.gov
>> >
>> >
>> >
>> > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh
>> rather
>> > than
>> > use softenv. it doesn't work for me if I use @osg in softenv,
>> with the
>> > error you report.
>> >
>> > On Thu, 18 Jun 2009, Michael Wilde wrote:
>> >
>> > > As far as I can tell, the condor client code is broken on
>> TeraPort.
>> > >
>> > > Ive tried this on tp-login and tp-osg; I am using +osg-client and
>> > @osg in my
>> > > .soft. I source $VDT_LOCATION/setup.sh
>> > >
>> > > Zhao, Glen, can you cross-check and see if you are now seeing the
>> > same thing?
>> > >
>> > > My suspicion is that the condor client config broke in the last
>> > month, through
>> > > OSG changes, CI Support work, etc etc.
>> > >
>> > > - Mike
>> > >
>> > >
>> > > I get this from condor_q:
>> > >
>> > > tp$ condor_q
>> > > Error:
>> > >
>> > > Extra Info: You probably saw this error because the
>> condor_schedd is
>> > not
>> > > running on the machine you are trying to query. If the
>> condor_schedd
>> > is not
>> > > running, the Condor system will not be able to find an address and
>> > port to
>> > > connect to and satisfy this request. Please make sure the Condor
>> > daemons are
>> > > running and try again.
>> > >
>> > > Extra Info: If the condor_schedd is running on the machine you are
>> > trying to
>> > > query and you still see the error, the most likely cause is
>> that you
>> > have
>> > > setup a personal Condor, you have not defined SCHEDD_NAME in your
>> > > condor_config file, and something is wrong with your
>> > SCHEDD_ADDRESS_FILE
>> > > setting. You must define either or both of those settings in your
>> > config
>> > > file, or you must use the -name option to condor_q. Please see the
>> > Condor
>> > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
>> > > tp$
>> > >
>> > > and this from swift:
>> > >
>> > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml
>> > cat.swift
>> > > Swift svn swift-r2890 cog-r2392
>> > >
>> > > RunID: 20090618-1404-mo0thjj4
>> > > Progress:
>> > > Progress: Stage in:1
>> > > Progress: Submitted:1
>> > > Failed to transfer wrapper log from cat-20090618-1404-
>> > mo0thjj4/info/h on
>> > > firefly
>> > > Progress: Failed:1
>> > > Execution failed:
>> > > Exception in cat:
>> > > Arguments: [data.txt]
>> > > Host: firefly
>> > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
>> > > stderr.txt:
>> > >
>> > > stdout.txt:
>> > >
>> > > ----
>> > >
>> > > Caused by:
>> > > Cannot submit job: Could not submit job (condor_submit reported an
>> > > exit code of 1). no error output
>> > > tp-grid1$ ls
>> > >
>> > > --
>> > >
>> > > Using this sites file:
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > grid
>> > > gt2
>> > > ff-grid.unl.edu/jobmanager-pbs
>>
>> > > > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork
>> > >
>> > >
>> > > _______________________________________________
>> > > Swift-devel mailing list
>> > > Swift-devel at ci.uchicago.edu
>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> > >
>> > >
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hockyg at uchicago.edu Fri Jun 19 11:06:24 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Fri, 19 Jun 2009 11:06:24 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort?
In-Reply-To:
References: <4A3AB523.2060205@mcs.anl.gov>
<4A3BB60B.3040008@uchicago.edu>
Message-ID:
(and ben)
On Fri, Jun 19, 2009 at 11:05 AM, Glen Hocky wrote:
> That did it for me! Thanks Zhao
>
>
> On Fri, Jun 19, 2009 at 11:00 AM, Zhao Zhang wrote:
>
>> Here is my .soft
>>
>> [zzhang at tp-grid1 ~]$ cat .soft
>> #
>> # This is your SoftEnv configuration run control file.
>> #
>> # It is used to tell SoftEnv how to customize your environment by
>> # setting up variables such as PATH and MANPATH. To learn more
>> # about this file, do a "man softenv".
>> #
>> +java-sun
>> +osg-client
>> +maui
>> +torque
>> @python-2.5
>> @osg
>> @default
>> @globus-4
>>
>> And the source file is
>> source /opt/osg/setup.sh
>>
>> zhao
>>
>> Glen Hocky wrote:
>>
>>> This did update my condor and globus locations, but did not fix the
>>> problem. Hopefully Zhao can tell me what to do next
>>>
>>> [hockyg at tp-grid1 swift]$ which condor_q
>>> /soft/condor-7.0.5-r1/bin/condor_q
>>> [hockyg at tp-grid1 swift]$ condor_q
>>>
>>> Neither the environment variable CONDOR_CONFIG,
>>> /etc/condor/, nor ~condor/ contain a condor_config source.
>>> Either set CONDOR_CONFIG to point to a valid config source,
>>> or put a "condor_config" file in /etc/condor or ~condor/
>>> Exiting.
>>>
>>>
>>> On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett >> support at ci.uchicago.edu>> wrote:
>>>
>>> There were some misconfigurations in the @globus-4 macro for
>>> rhel-5 and condor
>>> that I've just fixed. Can you set your ~/.soft to look like below
>>> and then run
>>> resoft:
>>>
>>> @globus-4
>>>
>>> @default
>>>
>>> You should be using /soft/condor-7.0.5-r1 and
>>> /soft/globus-4.2.1-r2 after that.
>>> Let me know if that works for you, or if anything changes.
>>>
>>> On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov
>>> wrote:
>>> > Hi,
>>> >
>>> > Swift users need to run the condor-g client in order to send jobs to
>>> > OSG
>>> > sites from a Swift script.
>>> >
>>> > Can you tell us how to set .soft and env so that condor_submit to
>>> > "grid"
>>> > universe works?
>>> >
>>> > We've had all sorts of problems in getting this to work well:
>>> >
>>> > - the version of condor client code on communicado is too new to run
>>> > with Swift.
>>> >
>>> > - On teraport, it seems difficult to get the right settings of .soft
>>> > entries and setup.sh scripts to work corrcetly together
>>> >
>>> > - I still dont know if what worked for Zhao on tp-osg a month ago
>>> > still
>>> > works. It seems not to, and I cant tell if its because of a
>>> change in
>>> > .soft or env settings, or some other software issue
>>> >
>>> > - We would like to run from Teraport compute nodes with qsub -I, and
>>> > hope that whatever we determine to be the right settings for login
>>> > nodes
>>> > work on interactive compute nodes as well.
>>> >
>>> > - It would be good *not* to run on tp-osg.
>>> >
>>> > Suchandra, Ti, or Greg, can you help us sort out how to set things
>>> > correctly?
>>> >
>>> > Tanks,
>>> >
>>> > Mike
>>> >
>>> >
>>> > -------- Original Message --------
>>> > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort
>>> > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT)
>>> > From: Ben Clifford >
>>> > To: Michael Wilde >
>>> > CC: swift-devel >> >
>>> > References: <4A3A93E2.2080805 at mcs.anl.gov
>>> >
>>> >
>>> >
>>> > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh
>>> rather
>>> > than
>>> > use softenv. it doesn't work for me if I use @osg in softenv,
>>> with the
>>> > error you report.
>>> >
>>> > On Thu, 18 Jun 2009, Michael Wilde wrote:
>>> >
>>> > > As far as I can tell, the condor client code is broken on
>>> TeraPort.
>>> > >
>>> > > Ive tried this on tp-login and tp-osg; I am using +osg-client and
>>> > @osg in my
>>> > > .soft. I source $VDT_LOCATION/setup.sh
>>> > >
>>> > > Zhao, Glen, can you cross-check and see if you are now seeing the
>>> > same thing?
>>> > >
>>> > > My suspicion is that the condor client config broke in the last
>>> > month, through
>>> > > OSG changes, CI Support work, etc etc.
>>> > >
>>> > > - Mike
>>> > >
>>> > >
>>> > > I get this from condor_q:
>>> > >
>>> > > tp$ condor_q
>>> > > Error:
>>> > >
>>> > > Extra Info: You probably saw this error because the
>>> condor_schedd is
>>> > not
>>> > > running on the machine you are trying to query. If the
>>> condor_schedd
>>> > is not
>>> > > running, the Condor system will not be able to find an address and
>>> > port to
>>> > > connect to and satisfy this request. Please make sure the Condor
>>> > daemons are
>>> > > running and try again.
>>> > >
>>> > > Extra Info: If the condor_schedd is running on the machine you are
>>> > trying to
>>> > > query and you still see the error, the most likely cause is
>>> that you
>>> > have
>>> > > setup a personal Condor, you have not defined SCHEDD_NAME in your
>>> > > condor_config file, and something is wrong with your
>>> > SCHEDD_ADDRESS_FILE
>>> > > setting. You must define either or both of those settings in your
>>> > config
>>> > > file, or you must use the -name option to condor_q. Please see the
>>> > Condor
>>> > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
>>> > > tp$
>>> > >
>>> > > and this from swift:
>>> > >
>>> > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml
>>> > cat.swift
>>> > > Swift svn swift-r2890 cog-r2392
>>> > >
>>> > > RunID: 20090618-1404-mo0thjj4
>>> > > Progress:
>>> > > Progress: Stage in:1
>>> > > Progress: Submitted:1
>>> > > Failed to transfer wrapper log from cat-20090618-1404-
>>> > mo0thjj4/info/h on
>>> > > firefly
>>> > > Progress: Failed:1
>>> > > Execution failed:
>>> > > Exception in cat:
>>> > > Arguments: [data.txt]
>>> > > Host: firefly
>>> > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj
>>> > > stderr.txt:
>>> > >
>>> > > stdout.txt:
>>> > >
>>> > > ----
>>> > >
>>> > > Caused by:
>>> > > Cannot submit job: Could not submit job (condor_submit reported an
>>> > > exit code of 1). no error output
>>> > > tp-grid1$ ls
>>> > >
>>> > > --
>>> > >
>>> > > Using this sites file:
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > grid
>>> > > gt2
>>> > > ff-grid.unl.edu/jobmanager-pbs
>>>
>>> > > >> > >/panfs/panasas/CMS/data/oops/wilde/swiftwork
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > Swift-devel mailing list
>>> > > Swift-devel at ci.uchicago.edu
>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>> > >
>>> > >
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From support at ci.uchicago.edu Fri Jun 19 11:07:02 2009
From: support at ci.uchicago.edu (Ben Clifford)
Date: Fri, 19 Jun 2009 11:07:02 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort? (fwd)
In-Reply-To:
References:
Message-ID:
ci support got removed from this thread but I believe this is relevant.
Zhao also reports the same way of getting it working, in another
non-ci-support message.
---------- Forwarded message ----------
Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT)
From: Ben Clifford
To: Glen Hocky
Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu,
zhaozhang at uchicago.edu, papka at ci.uchicago.edu
Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor
on TeraPort?
my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor
work. this suggests perhaps that in a working environment, condor should
be coming from that OSG stack and not from a specific condor softenv key.
--
From wilde at mcs.anl.gov Fri Jun 19 11:08:02 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Jun 2009 11:08:02 -0500
Subject: [Swift-devel] Re: swift testing of gram5 on teraport
In-Reply-To:
References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov>
<4A3BB4A7.2070708@mcs.anl.gov>
Message-ID: <4A3BB7E2.3020503@mcs.anl.gov>
Is what we're looking to see here the ability to run Swift with a full
or wide throttle to Gram5, directly, without Condor-G, and the ability
to have (a) lots of jobs in the queue and (b) many more jobs running at
once, while watching the gatekeepr host for CPU stress and memory pressure?
Where say (a) is a few thousand jobs and (b) is the full cluster busy?
I wonder if we can get a full-system reservation on TeraPort to test this?
We're also testing Swift via Condor-G at the moment on UNL's new cluster
"Firefly" which has 6000 cores of which 3000 are accessible to OSG. As
its a new and lightly loaded cluster, perhaps Brian Bockelman would be
willing to test GRAM5 on it? (its a PBS cluster)
So, now that I think about, as long as there's a GRAM5 gatekeeper we can
use, since it should Just Work, Im sure we can give it some informal
usage as soon as its available.
Stu, do you have plans for testing beyon Teraport on larger clusters?
I wonder, maybe we could test it in AWS at large scales too on a Nimbus
workspace?
- Mike
On 6/19/09 10:58 AM, Ben Clifford wrote:
> On Fri, 19 Jun 2009, Michael Wilde wrote:
>
>> In parallel, we should discuss on the list what ifany Swift changes are needed
>> to use it. It dont have my head around the issue at the moment. Where can we
>> read the specs of how it affects the user?
>
> Theoretically it will Just Work with the GRAM2 provider. Evidence thus far
> suggests this might be true (for example, apparently the gram2 cog stuff
> can submit to gram5 ok) but there hasn't been any swift-level testing to
> see how it all fits together.
>
From wilde at mcs.anl.gov Fri Jun 19 11:13:16 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Jun 2009 11:13:16 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort? (fwd)
In-Reply-To:
References:
Message-ID: <4A3BB91C.9060906@mcs.anl.gov>
Zhao, when the best way for regular users to do this is determined
(sounds like its close) please put instructions for how to do it on:
http://www.ci.uchicago.edu/wiki/bin/view/SWFT
as page SwiftQuickStartForCondorG
(including all the Swift config issues, eg sites.xml, etc)
Thanks,
Mike
--
ps. I think you have a few other pages that should go there, eg, BGP,
Ranger/Coasters, etc.
How to run on your own local PBS cluster
On 6/19/09 11:07 AM, Ben Clifford wrote:
> ci support got removed from this thread but I believe this is relevant.
> Zhao also reports the same way of getting it working, in another
> non-ci-support message.
>
> ---------- Forwarded message ----------
> Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT)
> From: Ben Clifford
> To: Glen Hocky
> Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu,
> zhaozhang at uchicago.edu, papka at ci.uchicago.edu
> Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor
> on TeraPort?
>
> my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor
> work. this suggests perhaps that in a working environment, condor should
> be coming from that OSG stack and not from a specific condor softenv key.
From support at ci.uchicago.edu Fri Jun 19 11:13:27 2009
From: support at ci.uchicago.edu (Mike Wilde)
Date: Fri, 19 Jun 2009 11:13:27 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort? (fwd)
In-Reply-To: <4A3BB91C.9060906@mcs.anl.gov>
References:
<4A3BB91C.9060906@mcs.anl.gov>
Message-ID:
Zhao, when the best way for regular users to do this is determined
(sounds like its close) please put instructions for how to do it on:
http://www.ci.uchicago.edu/wiki/bin/view/SWFT
as page SwiftQuickStartForCondorG
(including all the Swift config issues, eg sites.xml, etc)
Thanks,
Mike
--
ps. I think you have a few other pages that should go there, eg, BGP,
Ranger/Coasters, etc.
How to run on your own local PBS cluster
On 6/19/09 11:07 AM, Ben Clifford wrote:
> ci support got removed from this thread but I believe this is relevant.
> Zhao also reports the same way of getting it working, in another
> non-ci-support message.
>
> ---------- Forwarded message ----------
> Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT)
> From: Ben Clifford
> To: Glen Hocky
> Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu,
> zhaozhang at uchicago.edu, papka at ci.uchicago.edu
> Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor
> on TeraPort?
>
> my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor
> work. this suggests perhaps that in a working environment, condor should
> be coming from that OSG stack and not from a specific condor softenv key.
From wilde at mcs.anl.gov Fri Jun 19 11:17:04 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 19 Jun 2009 11:17:04 -0500
Subject: [Swift-devel] Re: swift testing of gram5 on teraport
In-Reply-To:
References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov>
<4A3BB4A7.2070708@mcs.anl.gov>
Message-ID: <4A3BBA00.2070805@mcs.anl.gov>
On 6/19/09 11:02 AM, Stuart Martin wrote:
> On Jun 19, 2009, at Jun 19, 10:54 AM, Michael Wilde wrote:
>
>> We'll find a way to do this, STu, but it may go a little slower than
>> desired due to heavy multi-tasking in the group.
>>
>> So you should push forward to get it testable, thats step zero I think.
>
> I am pushing forward with groups where there is someone to drive the
> testing. For example, Jaime Frey is testing gram5 with condor-g. CMS
> will be doing some testing in early July. Then there is the swift
> testing...
Stu,
I would suggest not to delay on getting it installed where we can test
with swift. My prior comment was based on a poor initial guess of whts
involved. But that's your call; when its installed where we can run
Swift jobs, we'll test it.
Eg: we are running 2 apps on Firefly at the moment. if you can get it
installed there, we can test on it even simpler than we are testing over
Condor-G.
Do you have a way to capture gatekeeper stress during such tests?
- Mike
>>
>>
>> In parallel, we should discuss on the list what ifany Swift changes
>> are needed to use it. It dont have my head around the issue at the
>> moment. Where can we read the specs of how it affects the user?
>>
>> We have a pretty swamped schedule through July, so I'd expect to slot
>> this for late Jul early Aug.
>>
>> Thanks,
>>
>> Mike
>>
>>
>> On 6/19/09 10:21 AM, Stuart Martin wrote:
>>> Hi Mike,
>>> Ben was planning on testing GRAM5 on teraport for Swift. Now that
>>> Ben is moving on, I am wondering what the plan is for that. Do you
>>> still plan to do that? Is there someone else that will do the testing?
>>> Ti was going to install GRAM5 for Ben to try out, but he has been
>>> delayed dealing with other issues. GRAM5 has not yet been installed
>>> on teraport. I was going to ask him again to install it, but I don't
>>> know who will now drive this testing.
>>> -Stu
>
From benc at hawaga.org.uk Fri Jun 19 11:17:47 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 19 Jun 2009 16:17:47 +0000 (GMT)
Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and
env to run condor on TeraPort? (fwd)
In-Reply-To:
References: