[Swift-devel] [Bug 218] New: Coasters failure in shutdown processing

bugzilla-daemon at mcs.anl.gov bugzilla-daemon at mcs.anl.gov
Tue Aug 25 10:35:53 CDT 2009


https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=218

           Summary: Coasters failure in shutdown processing
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: hategan at mcs.anl.gov
        ReportedBy: wilde at mcs.anl.gov


Hi,

I have a processing step that takes somewhere ~2-5 min. It takes on
input two ~5Mb files, and produces a small text file, which I need to
store. I need to compute large number of such jobs, using different
parameters. It seems to me "coaster" is the best execution provider
for my application.

Trying to start simple, I am running first.swift (echo) example that
comes with Swift using different providers: GT2, GT4, GT2/coaster, and
GT4/coaster. All of this is done on Abe NCSA cluster.

Here's my sites.xml:

<pool handle="Abe-GT4">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="gt4" jobmanager="PBS"

url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT4-coasters">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="coaster" jobmanager="gt4:gt4:pbs"

url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT2">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="gt2" jobmanager="PBS"
 url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT2-coasters">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="coaster" jobmanager="gt2:gt2:pbs"
 url="grid-abe.ncsa.teragrid.org"/>
 <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org" />
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

And tc.data is simply

Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null

and I change the site to test different providers.

Now, results:

1) both GT2 and GT4 providers work fine, script completes

2) with GT2+coaster provider, I can see the job in the PBS queue
(requested time is 01:41, I guess this comes with the default coaster
parameters, that I didn't change). The job appears to finish
successfully, and it seems like the output file is fetched back, but
then I get this error:

Final status:  Finished successfully:1
START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
Submitted task Task(type=JOB_SUBMISSION,
identity=urn:0-1-1251210343871). Job id:
urn:1251210343871-1251210376098-1251210376099
Unregistering Command(21, SUBMITJOB)
GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
Cleaning up...
Shutting down service at https://141.142.68.180:45552
Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
Command(22, SHUTDOWNSERVICE): handling reply timeout
Command(22, SHUTDOWNSERVICE): failed too many times
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
       at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
       at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
       at java.util.TimerThread.mainLoop(Timer.java:512)
       at java.util.TimerThread.run(Timer.java:462)
- Done

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.



More information about the Swift-devel mailing list