[Swift-devel] gram2 coaster workers terminating prematurely

Mihael Hategan hategan at mcs.anl.gov
Thu Aug 19 15:55:57 CDT 2010


Why was I under the impression that gram supports a "name" argument to
name your job?
I'll remove this from the gt2 provider.

On Thu, 2010-08-19 at 15:45 -0500, Allan Espinosa wrote:
> Hi,
> 
> It looks like the service cannot start the workers successfully when I
> run on OSG resources.  I get these messages from the coasters.log:
> 
> 2010-08-19 14:53:33,274-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1282247428606-1282247568724-1282247568725) setting status
> to Failed Task failed: Error submitting block task
> 2010-08-19 14:53:34,532-0500 WARN  Block Failed to shut down block
> 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1282247428746-1282247568932-1282247568933) setting status
> to Failed Task failed: Error submitting block task
> 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1282247428749-1282247568935-1282247568936) setting status
> to Failed Task failed: Error submitting block task
> 2010-08-19 14:53:41,148-0500 WARN  Block Failed to shut down block
> 2010-08-19 14:53:41,953-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1282247428534-1282247568613-1282247568614) setting status
> to Failed Task failed: Error submitting block task
> 2010-08-19 14:53:43,352-0500 WARN  Block Failed to shut down block
> 2010-08-19 14:53:48,904-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1282247428602-1282247568718-1282247568719) setting status
> to Failed Task failed: Error submitting block task
> 2010-08-19 14:53:49,969-0500 WARN  Block Failed to shut down block
> 
> pool entry:
>   <pool handle="Firefly">
>     <execution provider="coaster" url="ff-grid2.unl.edu"
>       jobmanager="gt2:gt2:pbs" />
>     <profile namespace="globus" key="internalhostname">129.93.227.78</profile>
> 
>     <profile namespace="globus" key="maxTime">86400</profile>
>     <profile namespace="globus" key="maxNodes">1290</profile>
>     <profile namespace="globus" key="spread">0.8</profile>
>     <profile namespace="globus" key="slots">10</profile>
>     <profile namespace="globus" key="lowOverallocation">20</profile>
>     <!--<profile namespace="globus" key="remoteMonitorEnabled">true</profile>-->
> 
>     <profile namespace="karajan" key="initialScore">1500.0</profile>
>     <profile namespace="karajan" key="jobThrottle">51.54</profile>
> 
>     <gridftp  url="gsiftp://ff-grid3.unl.edu"/>
>     <workdirectory>/panfs/panasas/CMS/data/engage-scec/swift_scratch</workdirectory>
>   </pool>
> 
> I traced a submission for a block and here are the related log entries:
> 
> corresponding entry in coasterlog and additional context lines:
> $ grep 000000 -C 2 coasters.log
> 2010-08-19 14:52:50,200-0500 INFO  JobQueue Adding task
> Task(type=JOB_SUBMISSION,
> identity=urn:1282247429268-1282247569718-1282247569719) to coaster
> queue
> 2010-08-19 14:52:50,229-0500 INFO  BlockTaskSubmitter Queuing block
> Block 0819-520247-000000 (322x5100.000s) for submission
> 2010-08-19 14:52:50,232-0500 INFO  BlockQueueProcessor Added 350 jobs
> to new blocks
> --
> 2010-08-19 14:52:51,004-0500 INFO  JobQueue Adding task
> Task(type=JOB_SUBMISSION,
> identity=urn:1282247430113-1282247570981-1282247570982) to coaster
> queue
> 2010-08-19 14:52:51,005-0500 INFO  BlockTaskSubmitter Submitting block
> Block 0819-520247-000000 (322x5100.000s)
> 2010-08-19 14:52:51,026-0500 INFO  JobQueue Adding task
> Task(type=JOB_SUBMISSION,
> identity=urn:1282247430114-1282247570984-1282247570985) to coaster
> queue
> --
> 2010-08-19 14:53:00,436-0500 INFO  BlockQueueProcessor Committed 415 new jobs
> 2010-08-19 14:53:00,437-0500 INFO  Block Shutting down block Block
> 0819-520247-000000 (322x5100.000s)
> 2010-08-19 14:53:00,437-0500 INFO  Block Block Block
> 0819-520247-000000 (322x5100.000s) not running. Cancelling job.
> 2010-08-19 14:53:00,438-0500 WARN  JobSubmissionTaskHandler Job cleaned before
> --
>         ... 9 more
> 2010-08-19 14:53:00,438-0500 INFO  BlockQueueProcessor Removing block
> Block 0819-520247-000000 (322x5100.000s)
> 2010-08-19 14:53:00,438-0500 INFO  BlockQueueProcessor Cleaned 1 done blocks
> 
> 
> gram log:
> 
> /19 14:52:55 JM: Security context imported
> 8/19 14:52:55 JM: Adding new callback contact
> (url=https://129.93.227.78:57277/1282247571979, mask=65535)
> 8/19 14:52:55 JM: Added successfully
> 8/19 14:52:55 Pre-parsed RSL string: &( rsl_substitution =
> (GLOBUSRUN_GASS_URL "https://129.93.227.78:41794") )( directory = "/"
> )( arguments = "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> "http://129.93.227.78:38194" "0819-520247-000000"
> "/grid_home/engage/.globus/coasters" )( hostcount = "322" )(
> executable = "/usr/bin/perl" )( maxwalltime = "85" )( stderr =
> $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-1282247568900" )( stdout
> = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1282247568900" )( name
> = "Block-0819-520247-000000" )( jobtype = "multiple" )( count = "322"
> )
> 8/19 14:52:55
> <<<<<Job Request RSL
> &("rsl_substitution" = ("GLOBUSRUN_GASS_URL"
> "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
> "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> "http://129.93.227.78:38194" "0819-520247-000000"
> "/grid_home/engage/.globus/coasters" )("hostcount" = "322"
> )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
> $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
> )("stdout" = $("GLOBUSRUN_GASS_URL") #
> "/dev/stdout-urn:cog-1282247568900" )("name" =
> "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
> )
> >>>>>Job Request RSL
> 8/19 14:52:55
> <<<<<Job Request RSL (canonical)
> &("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
> "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
> "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> "http://129.93.227.78:38194" "0819-520247-000000"
> "/grid_home/engage/.globus/coasters" )("hostcount" = "322"
> )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
> $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
> )("stdout" = $("GLOBUSRUN_GASS_URL") #
> "/dev/stdout-urn:cog-1282247568900" )("name" =
> "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
> )
> >>>>>Job Request RSL (canonical)
> 8/19 14:52:55 JM: Evaluating RSL Value8/19 14:52:55 JM: Evaluated RSL
> Value to GLOBUSRUN_GASS_URL8/19 14:52:55 JM: Evaluating RSL Value8/19
> 14:52:55 JM: Evaluated RSL Value to https://129.93.227.78:417948/19
> 14:52:55 Appending extra env.var
> LD_LIBRARY_PATH=/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:
> 8/19 14:52:55 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
> 8/19 14:52:55
> <<<<<Job RSL
> &("environment" = ("LD_LIBRARY_PATH"
> "/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:"
> ) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" )
> )("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
> "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
> "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> "http://129.93.227.78:38194" "0819-520247-000000"
> "/grid_home/engage/.globus/coasters" )("hostcount" = "322"
> )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
> $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
> )("stdout" = $("GLOBUSRUN_GASS_URL") #
> "/dev/stdout-urn:cog-1282247568900" )("name" =
> "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
> )
> >>>>>Job RSL
> 8/19 14:52:55
> <<<<<Job RSL (post-eval)
> &("environment" = ("LD_LIBRARY_PATH"
> "/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:"
> ) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" )
> )("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
> "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
> "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> "http://129.93.227.78:38194" "0819-520247-000000"
> "/grid_home/engage/.globus/coasters" )("hostcount" = "322"
> )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
> "https://129.93.227.78:41794/dev/stderr-urn:cog-1282247568900"
> )("stdout" = "https://129.93.227.78:41794/dev/stdout-urn:cog-1282247568900"
> )("name" = "Block-0819-520247-000000" )("jobtype" = "multiple"
> )("count" = "322" )
> >>>>>Job RSL (post-eval)
> RSL attribute 'name' is not in the validation file!
> 8/19 14:52:55 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED
> 8/19 14:52:55 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT
> 8/19 14:52:55 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP
> 8/19 14:52:55 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP
> 8/19 14:52:55 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP
> 8/19 14:52:55 JMI: testing job manager scripts for type pbs exist and
> permissions are ok.
> 8/19 14:52:55 JMI: completed script validation: job manager type is pbs.
> 8/19 14:52:55 JMI: cmd = cache_cleanup
> Thu Aug 19 14:52:57 2010 JM_SCRIPT: New Perl JobManager created.
> Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir:
> /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
> Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir:
> /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
> Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(enter)
> Thu Aug 19 14:52:57 2010 JM_SCRIPT: Cleaning files in job dir
> /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
> Thu Aug 19 14:52:57 2010 JM_SCRIPT: Removed 1 files from
> /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
> Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(exit)
> 8/19 14:52:57 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP
> 8/19 14:52:57 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
> 8/19 14:52:57 JM: before sending to client: rc=0 (Success)
> 8/19 14:52:57 Job Manager State Machine (exiting):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
> 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove()
> 8/19 14:52:57 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
> 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove()
> 8/19 14:52:57 JM: exiting globus_gram_job_manager
> 
> >From the gram log above, it looks like that the workers are finishing
> nicely from the cancel signal made by the coaster service.
> 
> -Allan





More information about the Swift-devel mailing list