[Swift-devel] Small issues in coasters on local:pbs

Mihael Hategan hategan at mcs.anl.gov
Thu Mar 19 16:49:40 CDT 2009


On Thu, 2009-03-19 at 07:13 -0500, Michael Wilde wrote:
> Regarding: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?]
> 
> I'm retesting coasters on local:pbs (on teraport), as I think this may 
> partially alleviate Andrew's problem.
> 
> A simple foreach() works nice and fast, but I see two things:
> 
> 1) I first tested without a valid proxy. I forgot that coasters requires 
> a proxy (presumably for its secure channels)

Yes. For its secure channels.

>  even when its not using 
> GRAM to reach its "RRM". The error returned if you dont have a proxy is 
> cryptic and buried in the coaster boostrap log. So 3 things: (a) do a 
> check for proxy early on and print a nice message if theres not a valid 
> proxy; (b) bring the errors from the bootstrap log back to the user 
> (unless thats not possible) in which case point the user to look for 
> that.

I would favor that. Currently swift seems to report too little of the
underlying errors, which often contain essential information for solving
the problem.

But in this particular instance, it's mostly a matter of nicely
propagating error messages through a handful of layers. I'll see what I
can do, though I suspect that for now the too little vs. too much output
conflict will exist.

>   (c) document that you need a proxy.
> 
> 2) When the script finishes you get this message on stdout/err which 
> looks like a leftover debugging message:

It is. I will remove that.

> 
> --
> Swift svn swift-r2701 cog-r2332
> 
> RunID: 20090319-0658-3ejpl9xc
> Progress:
> Progress:  Submitting:9 Submitted:1
> Progress:  Submitted:9 Active:1
> Progress:  Submitted:4 Active:3 Stage out:1 Finished successfully:2
> Final status:  Finished successfully:10
> Cleaning up...
> Shutting down service at https://128.135.125.117:50002
> Got channel MetaChannel: 101224864 -> GSSSChannel-null(1)
> - Done
> --
> 
> - Mike
> 
> The errors you get when you dont have a proxy are:
> 
> tp$ swift hellos.swift -sites.file sites.xml -tc.file tc.data
> Swift svn swift-r2701 cog-r2332
> 
> RunID: 20090319-0655-9ufl1r2g
> Progress:
> Progress:  Submitting:9 Submitted:1
> Failed to transfer wrapper log from hellos-20090319-0655-9ufl1r2g/info/l 
> on teraport
> Execution failed:
> 	Exception in echo:
> Arguments: [Output of run, 6]
> Host: teraport
> Directory: hellos-20090319-0655-9ufl1r2g/jobs/l/echo-lde8x58j
> stderr.txt:
> 
> stdout.txt:
> 
> ----
> 
> Caused by:
> 	Could not submit job
> Caused by:
> 	Could not start coaster service
> Caused by:
> 	Task ended before registration was received.
> STDOUT:
> STDERR:
> 
> Caused by:
> 	Job failed with an exit code of 1
> Cleaning up...
>   Done
> tp$
> 
> tp$ cat /home/wilde/coaster-bootstrap-01709350024.log
> BS: http://tp-login2.ci.uchicago.edu:50001
> find wget = /usr/bin/wget
> -->/usr/bin/wget -c -q 
> http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O 
> /tmp/bootstrap.YJ4129 >>/home/wilde/coaster-bootstrap-01709350024.log 
> 2>&1<--
> which: no gmd5sum in 
> (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/p
 eg
> asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
> which: no gmd5sum in 
> (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/p
 eg
> asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
> find gmd5sum =
> find md5sum = /usr/bin/md5sum
> Expected checksum: 33170989491a2e007a1c7c68eb907832
> Computed checksum: 33170989491a2e007a1c7c68eb907832
> find java = /soft/java-1.6.0_11-sun-r1/bin/java
> JAVA=/soft/java-1.6.0_11-sun-r1/bin/java
> /soft/java-1.6.0_11-sun-r1/bin/java 
> -Djava=/soft/java-1.6.0_11-sun-r1/bin/java 
> -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY= 
> -DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA 
> -DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.YJ4129 
> http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000 
> 01709350024
> java.lang.RuntimeException: Failed to register service
> 	at 
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111)
> 	at 
> org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226)
> Caused by: 
> org.globus.cog.karajan.workflow.service.channels.ChannelException: 
> Failed to start channel GSSCChannel-https://128.135.125.117:50000(1)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63)
> 	at 
> org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43)
> 	at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115)
> 	at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186)
> 	at 
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100)
> 	... 1 more
> Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy 
> file (/tmp/x509up_u1031) not found.
> 	at org.globus.gsi.GlobusCredential.<init>(GlobusCredential.java:114)
> 	at 
> org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590)
> 	at 
> org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575)
> 	at 
> org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77)
> 	... 9 more
> 
> EC: 1
> BS: http://tp-login2.ci.uchicago.edu:50001
> find wget = /usr/bin/wget
> -->/usr/bin/wget -c -q 
> http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O 
> /tmp/bootstrap.DS4363 >>/home/wilde/coaster-bootstrap-01709350024.log 
> 2>&1<--
> which: no gmd5sum in 
> (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/p
 eg
> asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
> which: no gmd5sum in 
> (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/p
 eg
> asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
> find gmd5sum =
> find md5sum = /usr/bin/md5sum
> Expected checksum: 33170989491a2e007a1c7c68eb907832
> Computed checksum: 33170989491a2e007a1c7c68eb907832
> find java = /soft/java-1.6.0_11-sun-r1/bin/java
> JAVA=/soft/java-1.6.0_11-sun-r1/bin/java
> /soft/java-1.6.0_11-sun-r1/bin/java 
> -Djava=/soft/java-1.6.0_11-sun-r1/bin/java 
> -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY= 
> -DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA 
> -DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.DS4363 
> http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000 
> 01709350024
> java.lang.RuntimeException: Failed to register service
> 	at 
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111)
> 	at 
> org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226)
> Caused by: 
> org.globus.cog.karajan.workflow.service.channels.ChannelException: 
> Failed to start channel GSSCChannel-https://128.135.125.117:50000(1)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63)
> 	at 
> org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43)
> 	at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115)
> 	at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186)
> 	at 
> org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100)
> 	... 1 more
> Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy 
> file (/tmp/x509up_u1031) not found.
> 	at org.globus.gsi.GlobusCredential.<init>(GlobusCredential.java:114)
> 	at 
> org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590)
> 	at 
> org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575)
> 	at 
> org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99)
> 	at 
> org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77)
> 	... 9 more
> 
> EC: 1
> tp$
> 
> 
> -------- Original Message --------
> Subject: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?
> Date: Thu, 19 Mar 2009 02:42:25 -0500
> From: Andrew Boyce <ajboyce at jacks.sdstate.edu>
> To: swift-user at ci.uchicago.edu
> 
> Hello,
> 
> I am currently running Swift in conjunction with the PBS scheduler. My
> annoyance at the moment is this:
> 
> When running any script, even a simple script such as first.swift
> (which normally finishes almost instantaneously), Swift always takes
> precisely five minutes to tell me that my job Finished successfully
> and copy the files back to the appropriate folder. It is always almost
> exactly five minutes; I've checked many logs - it polls the scheduler
> for five minutes. When I run a script (like first.swift) without using
> the PBS scheduler, everything happens as normal; execution and
> "Finished successfully" are nearly immediate.
> 
> I think I know what the problem is: even after the scheduler says that
> the job is 'completed,' (which is generally right away) the scheduler
> keeps the job up on qstat and such for 5 minutes after (this setting
> is a PBS server attribute known as 'keep_completed', and I have
> checked that it is indeed set to 300 seconds; unfortunately I don't
> have permissions to change it). So when Swift polls the scheduler, the
> job is still up on qstat, and Swift must think that the task has not
> yet "Finished successfully."
> 
> My question is this:
> Am I indeed right that Swift does not "understand" that when the PBS
> scheduler says a job is 'completed', the job really has "Finished
> successfully"?
> Can this be changed so that Swift does "understand" that a 'completed'
> job has "Finished successfully"?
> 
> I have not included any files because I think I have narrowed the
> problem down to a question that does not require those that I would
> usually provide, but if I am wrong, then I can provide.
> 
> Thank you and sorry for the length.
> 
> Regards,
> 
> Andrew Boyce
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list