[Swift-devel] gram2 coaster workers terminating prematurely
Mihael Hategan
hategan at mcs.anl.gov
Thu Aug 19 16:08:11 CDT 2010
On Thu, 2010-08-19 at 15:55 -0500, Mihael Hategan wrote:
> Why was I under the impression that gram supports a "name" argument to
> name your job?
> I'll remove this from the gt2 provider.
cog r2867.
>
> On Thu, 2010-08-19 at 15:45 -0500, Allan Espinosa wrote:
> > Hi,
> >
> > It looks like the service cannot start the workers successfully when I
> > run on OSG resources. I get these messages from the coasters.log:
> >
> > 2010-08-19 14:53:33,274-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1282247428606-1282247568724-1282247568725) setting status
> > to Failed Task failed: Error submitting block task
> > 2010-08-19 14:53:34,532-0500 WARN Block Failed to shut down block
> > 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1282247428746-1282247568932-1282247568933) setting status
> > to Failed Task failed: Error submitting block task
> > 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1282247428749-1282247568935-1282247568936) setting status
> > to Failed Task failed: Error submitting block task
> > 2010-08-19 14:53:41,148-0500 WARN Block Failed to shut down block
> > 2010-08-19 14:53:41,953-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1282247428534-1282247568613-1282247568614) setting status
> > to Failed Task failed: Error submitting block task
> > 2010-08-19 14:53:43,352-0500 WARN Block Failed to shut down block
> > 2010-08-19 14:53:48,904-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1282247428602-1282247568718-1282247568719) setting status
> > to Failed Task failed: Error submitting block task
> > 2010-08-19 14:53:49,969-0500 WARN Block Failed to shut down block
> >
> > pool entry:
> > <pool handle="Firefly">
> > <execution provider="coaster" url="ff-grid2.unl.edu"
> > jobmanager="gt2:gt2:pbs" />
> > <profile namespace="globus" key="internalhostname">129.93.227.78</profile>
> >
> > <profile namespace="globus" key="maxTime">86400</profile>
> > <profile namespace="globus" key="maxNodes">1290</profile>
> > <profile namespace="globus" key="spread">0.8</profile>
> > <profile namespace="globus" key="slots">10</profile>
> > <profile namespace="globus" key="lowOverallocation">20</profile>
> > <!--<profile namespace="globus" key="remoteMonitorEnabled">true</profile>-->
> >
> > <profile namespace="karajan" key="initialScore">1500.0</profile>
> > <profile namespace="karajan" key="jobThrottle">51.54</profile>
> >
> > <gridftp url="gsiftp://ff-grid3.unl.edu"/>
> > <workdirectory>/panfs/panasas/CMS/data/engage-scec/swift_scratch</workdirectory>
> > </pool>
> >
> > I traced a submission for a block and here are the related log entries:
> >
> > corresponding entry in coasterlog and additional context lines:
> > $ grep 000000 -C 2 coasters.log
> > 2010-08-19 14:52:50,200-0500 INFO JobQueue Adding task
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1282247429268-1282247569718-1282247569719) to coaster
> > queue
> > 2010-08-19 14:52:50,229-0500 INFO BlockTaskSubmitter Queuing block
> > Block 0819-520247-000000 (322x5100.000s) for submission
> > 2010-08-19 14:52:50,232-0500 INFO BlockQueueProcessor Added 350 jobs
> > to new blocks
> > --
> > 2010-08-19 14:52:51,004-0500 INFO JobQueue Adding task
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1282247430113-1282247570981-1282247570982) to coaster
> > queue
> > 2010-08-19 14:52:51,005-0500 INFO BlockTaskSubmitter Submitting block
> > Block 0819-520247-000000 (322x5100.000s)
> > 2010-08-19 14:52:51,026-0500 INFO JobQueue Adding task
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1282247430114-1282247570984-1282247570985) to coaster
> > queue
> > --
> > 2010-08-19 14:53:00,436-0500 INFO BlockQueueProcessor Committed 415 new jobs
> > 2010-08-19 14:53:00,437-0500 INFO Block Shutting down block Block
> > 0819-520247-000000 (322x5100.000s)
> > 2010-08-19 14:53:00,437-0500 INFO Block Block Block
> > 0819-520247-000000 (322x5100.000s) not running. Cancelling job.
> > 2010-08-19 14:53:00,438-0500 WARN JobSubmissionTaskHandler Job cleaned before
> > --
> > ... 9 more
> > 2010-08-19 14:53:00,438-0500 INFO BlockQueueProcessor Removing block
> > Block 0819-520247-000000 (322x5100.000s)
> > 2010-08-19 14:53:00,438-0500 INFO BlockQueueProcessor Cleaned 1 done blocks
> >
> >
> > gram log:
> >
> > /19 14:52:55 JM: Security context imported
> > 8/19 14:52:55 JM: Adding new callback contact
> > (url=https://129.93.227.78:57277/1282247571979, mask=65535)
> > 8/19 14:52:55 JM: Added successfully
> > 8/19 14:52:55 Pre-parsed RSL string: &( rsl_substitution =
> > (GLOBUSRUN_GASS_URL "https://129.93.227.78:41794") )( directory = "/"
> > )( arguments = "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> > "http://129.93.227.78:38194" "0819-520247-000000"
> > "/grid_home/engage/.globus/coasters" )( hostcount = "322" )(
> > executable = "/usr/bin/perl" )( maxwalltime = "85" )( stderr =
> > $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-1282247568900" )( stdout
> > = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1282247568900" )( name
> > = "Block-0819-520247-000000" )( jobtype = "multiple" )( count = "322"
> > )
> > 8/19 14:52:55
> > <<<<<Job Request RSL
> > &("rsl_substitution" = ("GLOBUSRUN_GASS_URL"
> > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
> > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> > "http://129.93.227.78:38194" "0819-520247-000000"
> > "/grid_home/engage/.globus/coasters" )("hostcount" = "322"
> > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
> > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
> > )("stdout" = $("GLOBUSRUN_GASS_URL") #
> > "/dev/stdout-urn:cog-1282247568900" )("name" =
> > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
> > )
> > >>>>>Job Request RSL
> > 8/19 14:52:55
> > <<<<<Job Request RSL (canonical)
> > &("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
> > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
> > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> > "http://129.93.227.78:38194" "0819-520247-000000"
> > "/grid_home/engage/.globus/coasters" )("hostcount" = "322"
> > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
> > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
> > )("stdout" = $("GLOBUSRUN_GASS_URL") #
> > "/dev/stdout-urn:cog-1282247568900" )("name" =
> > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
> > )
> > >>>>>Job Request RSL (canonical)
> > 8/19 14:52:55 JM: Evaluating RSL Value8/19 14:52:55 JM: Evaluated RSL
> > Value to GLOBUSRUN_GASS_URL8/19 14:52:55 JM: Evaluating RSL Value8/19
> > 14:52:55 JM: Evaluated RSL Value to https://129.93.227.78:417948/19
> > 14:52:55 Appending extra env.var
> > LD_LIBRARY_PATH=/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:
> > 8/19 14:52:55 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
> > 8/19 14:52:55
> > <<<<<Job RSL
> > &("environment" = ("LD_LIBRARY_PATH"
> > "/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:"
> > ) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" )
> > )("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
> > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
> > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> > "http://129.93.227.78:38194" "0819-520247-000000"
> > "/grid_home/engage/.globus/coasters" )("hostcount" = "322"
> > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
> > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
> > )("stdout" = $("GLOBUSRUN_GASS_URL") #
> > "/dev/stdout-urn:cog-1282247568900" )("name" =
> > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
> > )
> > >>>>>Job RSL
> > 8/19 14:52:55
> > <<<<<Job RSL (post-eval)
> > &("environment" = ("LD_LIBRARY_PATH"
> > "/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:"
> > ) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" )
> > )("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
> > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
> > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
> > "http://129.93.227.78:38194" "0819-520247-000000"
> > "/grid_home/engage/.globus/coasters" )("hostcount" = "322"
> > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
> > "https://129.93.227.78:41794/dev/stderr-urn:cog-1282247568900"
> > )("stdout" = "https://129.93.227.78:41794/dev/stdout-urn:cog-1282247568900"
> > )("name" = "Block-0819-520247-000000" )("jobtype" = "multiple"
> > )("count" = "322" )
> > >>>>>Job RSL (post-eval)
> > RSL attribute 'name' is not in the validation file!
> > 8/19 14:52:55 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED
> > 8/19 14:52:55 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT
> > 8/19 14:52:55 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP
> > 8/19 14:52:55 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP
> > 8/19 14:52:55 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP
> > 8/19 14:52:55 JMI: testing job manager scripts for type pbs exist and
> > permissions are ok.
> > 8/19 14:52:55 JMI: completed script validation: job manager type is pbs.
> > 8/19 14:52:55 JMI: cmd = cache_cleanup
> > Thu Aug 19 14:52:57 2010 JM_SCRIPT: New Perl JobManager created.
> > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir:
> > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
> > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir:
> > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
> > Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(enter)
> > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Cleaning files in job dir
> > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
> > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Removed 1 files from
> > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
> > Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(exit)
> > 8/19 14:52:57 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP
> > 8/19 14:52:57 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
> > 8/19 14:52:57 JM: before sending to client: rc=0 (Success)
> > 8/19 14:52:57 Job Manager State Machine (exiting):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
> > 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove()
> > 8/19 14:52:57 Job Manager State Machine (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
> > 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove()
> > 8/19 14:52:57 JM: exiting globus_gram_job_manager
> >
> > >From the gram log above, it looks like that the workers are finishing
> > nicely from the cancel signal made by the coaster service.
> >
> > -Allan
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list