[Swift-devel] gram2 coaster workers terminating prematurely

Allan Espinosa aespinosa at cs.uchicago.edu
Thu Aug 19 15:45:10 CDT 2010


Hi,

It looks like the service cannot start the workers successfully when I
run on OSG resources.  I get these messages from the coasters.log:

2010-08-19 14:53:33,274-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1282247428606-1282247568724-1282247568725) setting status
to Failed Task failed: Error submitting block task
2010-08-19 14:53:34,532-0500 WARN  Block Failed to shut down block
2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1282247428746-1282247568932-1282247568933) setting status
to Failed Task failed: Error submitting block task
2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1282247428749-1282247568935-1282247568936) setting status
to Failed Task failed: Error submitting block task
2010-08-19 14:53:41,148-0500 WARN  Block Failed to shut down block
2010-08-19 14:53:41,953-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1282247428534-1282247568613-1282247568614) setting status
to Failed Task failed: Error submitting block task
2010-08-19 14:53:43,352-0500 WARN  Block Failed to shut down block
2010-08-19 14:53:48,904-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1282247428602-1282247568718-1282247568719) setting status
to Failed Task failed: Error submitting block task
2010-08-19 14:53:49,969-0500 WARN  Block Failed to shut down block

pool entry:
  <pool handle="Firefly">
    <execution provider="coaster" url="ff-grid2.unl.edu"
      jobmanager="gt2:gt2:pbs" />
    <profile namespace="globus" key="internalhostname">129.93.227.78</profile>

    <profile namespace="globus" key="maxTime">86400</profile>
    <profile namespace="globus" key="maxNodes">1290</profile>
    <profile namespace="globus" key="spread">0.8</profile>
    <profile namespace="globus" key="slots">10</profile>
    <profile namespace="globus" key="lowOverallocation">20</profile>
    <!--<profile namespace="globus" key="remoteMonitorEnabled">true</profile>-->

    <profile namespace="karajan" key="initialScore">1500.0</profile>
    <profile namespace="karajan" key="jobThrottle">51.54</profile>

    <gridftp  url="gsiftp://ff-grid3.unl.edu"/>
    <workdirectory>/panfs/panasas/CMS/data/engage-scec/swift_scratch</workdirectory>
  </pool>

I traced a submission for a block and here are the related log entries:

corresponding entry in coasterlog and additional context lines:
$ grep 000000 -C 2 coasters.log
2010-08-19 14:52:50,200-0500 INFO  JobQueue Adding task
Task(type=JOB_SUBMISSION,
identity=urn:1282247429268-1282247569718-1282247569719) to coaster
queue
2010-08-19 14:52:50,229-0500 INFO  BlockTaskSubmitter Queuing block
Block 0819-520247-000000 (322x5100.000s) for submission
2010-08-19 14:52:50,232-0500 INFO  BlockQueueProcessor Added 350 jobs
to new blocks
--
2010-08-19 14:52:51,004-0500 INFO  JobQueue Adding task
Task(type=JOB_SUBMISSION,
identity=urn:1282247430113-1282247570981-1282247570982) to coaster
queue
2010-08-19 14:52:51,005-0500 INFO  BlockTaskSubmitter Submitting block
Block 0819-520247-000000 (322x5100.000s)
2010-08-19 14:52:51,026-0500 INFO  JobQueue Adding task
Task(type=JOB_SUBMISSION,
identity=urn:1282247430114-1282247570984-1282247570985) to coaster
queue
--
2010-08-19 14:53:00,436-0500 INFO  BlockQueueProcessor Committed 415 new jobs
2010-08-19 14:53:00,437-0500 INFO  Block Shutting down block Block
0819-520247-000000 (322x5100.000s)
2010-08-19 14:53:00,437-0500 INFO  Block Block Block
0819-520247-000000 (322x5100.000s) not running. Cancelling job.
2010-08-19 14:53:00,438-0500 WARN  JobSubmissionTaskHandler Job cleaned before
--
        ... 9 more
2010-08-19 14:53:00,438-0500 INFO  BlockQueueProcessor Removing block
Block 0819-520247-000000 (322x5100.000s)
2010-08-19 14:53:00,438-0500 INFO  BlockQueueProcessor Cleaned 1 done blocks


gram log:

/19 14:52:55 JM: Security context imported
8/19 14:52:55 JM: Adding new callback contact
(url=https://129.93.227.78:57277/1282247571979, mask=65535)
8/19 14:52:55 JM: Added successfully
8/19 14:52:55 Pre-parsed RSL string: &( rsl_substitution =
(GLOBUSRUN_GASS_URL "https://129.93.227.78:41794") )( directory = "/"
)( arguments = "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
"http://129.93.227.78:38194" "0819-520247-000000"
"/grid_home/engage/.globus/coasters" )( hostcount = "322" )(
executable = "/usr/bin/perl" )( maxwalltime = "85" )( stderr =
$(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-1282247568900" )( stdout
= $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1282247568900" )( name
= "Block-0819-520247-000000" )( jobtype = "multiple" )( count = "322"
)
8/19 14:52:55
<<<<<Job Request RSL
&("rsl_substitution" = ("GLOBUSRUN_GASS_URL"
"https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
"/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
"http://129.93.227.78:38194" "0819-520247-000000"
"/grid_home/engage/.globus/coasters" )("hostcount" = "322"
)("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
$("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
)("stdout" = $("GLOBUSRUN_GASS_URL") #
"/dev/stdout-urn:cog-1282247568900" )("name" =
"Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
)
>>>>>Job Request RSL
8/19 14:52:55
<<<<<Job Request RSL (canonical)
&("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
"/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
"http://129.93.227.78:38194" "0819-520247-000000"
"/grid_home/engage/.globus/coasters" )("hostcount" = "322"
)("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
$("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
)("stdout" = $("GLOBUSRUN_GASS_URL") #
"/dev/stdout-urn:cog-1282247568900" )("name" =
"Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
)
>>>>>Job Request RSL (canonical)
8/19 14:52:55 JM: Evaluating RSL Value8/19 14:52:55 JM: Evaluated RSL
Value to GLOBUSRUN_GASS_URL8/19 14:52:55 JM: Evaluating RSL Value8/19
14:52:55 JM: Evaluated RSL Value to https://129.93.227.78:417948/19
14:52:55 Appending extra env.var
LD_LIBRARY_PATH=/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:
8/19 14:52:55 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
8/19 14:52:55
<<<<<Job RSL
&("environment" = ("LD_LIBRARY_PATH"
"/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:"
) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" )
)("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
"/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
"http://129.93.227.78:38194" "0819-520247-000000"
"/grid_home/engage/.globus/coasters" )("hostcount" = "322"
)("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
$("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900"
)("stdout" = $("GLOBUSRUN_GASS_URL") #
"/dev/stdout-urn:cog-1282247568900" )("name" =
"Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322"
)
>>>>>Job RSL
8/19 14:52:55
<<<<<Job RSL (post-eval)
&("environment" = ("LD_LIBRARY_PATH"
"/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:"
) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" )
)("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" =
"/grid_home/engage/.globus/coasters/cscript758567006055368697.pl"
"http://129.93.227.78:38194" "0819-520247-000000"
"/grid_home/engage/.globus/coasters" )("hostcount" = "322"
)("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" =
"https://129.93.227.78:41794/dev/stderr-urn:cog-1282247568900"
)("stdout" = "https://129.93.227.78:41794/dev/stdout-urn:cog-1282247568900"
)("name" = "Block-0819-520247-000000" )("jobtype" = "multiple"
)("count" = "322" )
>>>>>Job RSL (post-eval)
RSL attribute 'name' is not in the validation file!
8/19 14:52:55 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED
8/19 14:52:55 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT
8/19 14:52:55 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP
8/19 14:52:55 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP
8/19 14:52:55 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP
8/19 14:52:55 JMI: testing job manager scripts for type pbs exist and
permissions are ok.
8/19 14:52:55 JMI: completed script validation: job manager type is pbs.
8/19 14:52:55 JMI: cmd = cache_cleanup
Thu Aug 19 14:52:57 2010 JM_SCRIPT: New Perl JobManager created.
Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir:
/grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir:
/grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(enter)
Thu Aug 19 14:52:57 2010 JM_SCRIPT: Cleaning files in job dir
/grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
Thu Aug 19 14:52:57 2010 JM_SCRIPT: Removed 1 files from
/grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575
Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(exit)
8/19 14:52:57 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP
8/19 14:52:57 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
8/19 14:52:57 JM: before sending to client: rc=0 (Success)
8/19 14:52:57 Job Manager State Machine (exiting):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove()
8/19 14:52:57 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove()
8/19 14:52:57 JM: exiting globus_gram_job_manager

>From the gram log above, it looks like that the workers are finishing
nicely from the cancel signal made by the coaster service.

-Allan
-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>



More information about the Swift-devel mailing list