[Swift-devel] Small issues in coasters on local:pbs
Michael Wilde
wilde at mcs.anl.gov
Thu Mar 19 07:13:53 CDT 2009
Regarding: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?]
I'm retesting coasters on local:pbs (on teraport), as I think this may
partially alleviate Andrew's problem.
A simple foreach() works nice and fast, but I see two things:
1) I first tested without a valid proxy. I forgot that coasters requires
a proxy (presumably for its secure channels) even when its not using
GRAM to reach its "RRM". The error returned if you dont have a proxy is
cryptic and buried in the coaster boostrap log. So 3 things: (a) do a
check for proxy early on and print a nice message if theres not a valid
proxy; (b) bring the errors from the bootstrap log back to the user
(unless thats not possible) in which case point the user to look for
that. (c) document that you need a proxy.
2) When the script finishes you get this message on stdout/err which
looks like a leftover debugging message:
--
Swift svn swift-r2701 cog-r2332
RunID: 20090319-0658-3ejpl9xc
Progress:
Progress: Submitting:9 Submitted:1
Progress: Submitted:9 Active:1
Progress: Submitted:4 Active:3 Stage out:1 Finished successfully:2
Final status: Finished successfully:10
Cleaning up...
Shutting down service at https://128.135.125.117:50002
Got channel MetaChannel: 101224864 -> GSSSChannel-null(1)
- Done
--
- Mike
The errors you get when you dont have a proxy are:
tp$ swift hellos.swift -sites.file sites.xml -tc.file tc.data
Swift svn swift-r2701 cog-r2332
RunID: 20090319-0655-9ufl1r2g
Progress:
Progress: Submitting:9 Submitted:1
Failed to transfer wrapper log from hellos-20090319-0655-9ufl1r2g/info/l
on teraport
Execution failed:
Exception in echo:
Arguments: [Output of run, 6]
Host: teraport
Directory: hellos-20090319-0655-9ufl1r2g/jobs/l/echo-lde8x58j
stderr.txt:
stdout.txt:
----
Caused by:
Could not submit job
Caused by:
Could not start coaster service
Caused by:
Task ended before registration was received.
STDOUT:
STDERR:
Caused by:
Job failed with an exit code of 1
Cleaning up...
Done
tp$
tp$ cat /home/wilde/coaster-bootstrap-01709350024.log
BS: http://tp-login2.ci.uchicago.edu:50001
find wget = /usr/bin/wget
-->/usr/bin/wget -c -q
http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O
/tmp/bootstrap.YJ4129 >>/home/wilde/coaster-bootstrap-01709350024.log
2>&1<--
which: no gmd5sum in
(/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg
asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
which: no gmd5sum in
(/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg
asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
find gmd5sum =
find md5sum = /usr/bin/md5sum
Expected checksum: 33170989491a2e007a1c7c68eb907832
Computed checksum: 33170989491a2e007a1c7c68eb907832
find java = /soft/java-1.6.0_11-sun-r1/bin/java
JAVA=/soft/java-1.6.0_11-sun-r1/bin/java
/soft/java-1.6.0_11-sun-r1/bin/java
-Djava=/soft/java-1.6.0_11-sun-r1/bin/java
-DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY=
-DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA
-DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.YJ4129
http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000
01709350024
java.lang.RuntimeException: Failed to register service
at
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111)
at
org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226)
Caused by:
org.globus.cog.karajan.workflow.service.channels.ChannelException:
Failed to start channel GSSCChannel-https://128.135.125.117:50000(1)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63)
at
org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43)
at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115)
at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186)
at
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100)
... 1 more
Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy
file (/tmp/x509up_u1031) not found.
at org.globus.gsi.GlobusCredential.<init>(GlobusCredential.java:114)
at
org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590)
at
org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575)
at
org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77)
... 9 more
EC: 1
BS: http://tp-login2.ci.uchicago.edu:50001
find wget = /usr/bin/wget
-->/usr/bin/wget -c -q
http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O
/tmp/bootstrap.DS4363 >>/home/wilde/coaster-bootstrap-01709350024.log
2>&1<--
which: no gmd5sum in
(/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg
asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
which: no gmd5sum in
(/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg
asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
find gmd5sum =
find md5sum = /usr/bin/md5sum
Expected checksum: 33170989491a2e007a1c7c68eb907832
Computed checksum: 33170989491a2e007a1c7c68eb907832
find java = /soft/java-1.6.0_11-sun-r1/bin/java
JAVA=/soft/java-1.6.0_11-sun-r1/bin/java
/soft/java-1.6.0_11-sun-r1/bin/java
-Djava=/soft/java-1.6.0_11-sun-r1/bin/java
-DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY=
-DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA
-DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.DS4363
http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000
01709350024
java.lang.RuntimeException: Failed to register service
at
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111)
at
org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226)
Caused by:
org.globus.cog.karajan.workflow.service.channels.ChannelException:
Failed to start channel GSSCChannel-https://128.135.125.117:50000(1)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63)
at
org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43)
at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115)
at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186)
at
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100)
... 1 more
Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy
file (/tmp/x509up_u1031) not found.
at org.globus.gsi.GlobusCredential.<init>(GlobusCredential.java:114)
at
org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590)
at
org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575)
at
org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77)
... 9 more
EC: 1
tp$
-------- Original Message --------
Subject: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?
Date: Thu, 19 Mar 2009 02:42:25 -0500
From: Andrew Boyce <ajboyce at jacks.sdstate.edu>
To: swift-user at ci.uchicago.edu
Hello,
I am currently running Swift in conjunction with the PBS scheduler. My
annoyance at the moment is this:
When running any script, even a simple script such as first.swift
(which normally finishes almost instantaneously), Swift always takes
precisely five minutes to tell me that my job Finished successfully
and copy the files back to the appropriate folder. It is always almost
exactly five minutes; I've checked many logs - it polls the scheduler
for five minutes. When I run a script (like first.swift) without using
the PBS scheduler, everything happens as normal; execution and
"Finished successfully" are nearly immediate.
I think I know what the problem is: even after the scheduler says that
the job is 'completed,' (which is generally right away) the scheduler
keeps the job up on qstat and such for 5 minutes after (this setting
is a PBS server attribute known as 'keep_completed', and I have
checked that it is indeed set to 300 seconds; unfortunately I don't
have permissions to change it). So when Swift polls the scheduler, the
job is still up on qstat, and Swift must think that the task has not
yet "Finished successfully."
My question is this:
Am I indeed right that Swift does not "understand" that when the PBS
scheduler says a job is 'completed', the job really has "Finished
successfully"?
Can this be changed so that Swift does "understand" that a 'completed'
job has "Finished successfully"?
I have not included any files because I think I have narrowed the
problem down to a question that does not require those that I would
usually provide, but if I am wrong, then I can provide.
Thank you and sorry for the length.
Regards,
Andrew Boyce
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
More information about the Swift-devel
mailing list