[Swift-devel] Fwd: raptor-loop model problem
Michael Wilde
wilde at mcs.anl.gov
Fri May 28 13:13:45 CDT 2010
Wenjun, in the attached log, I see 5 boost-threader jobs starting but not finishing.
Then *I think* the coasters start timing out with nothing else to do.
Mihael, can you take a look at this log and work with Wenjun to pinpoint the problem?
Thanks,
- Mike
----- Forwarded Message -----
From: "wenjun wu" <wwjag at mcs.anl.gov>
To: "Michael Wilde" <wilde at mcs.anl.gov>
Cc: "Thomas D. Uram" <turam at mcs.anl.gov>
Sent: Friday, May 28, 2010 10:48:40 AM GMT -06:00 US/Canada Central
Subject: Re: raptor-loop model problem
Hi Mike,
After I fixed the File uploader in the portal, the old problem
""cannot open seq file PREPROCESSED/SEQ/T0411D1.seq for read!" is gone.
But the portal still can't get the results from the "BoosterThread"
result.
The error in the workflow run log is:
2010-05-27 13:25:52,842-0500 WARN RequestHandler
org.globus.cog.karajan.workflow.service.channels.IrrecoverableException:
Coaster service ended. Reason: null
stdout:
stderr:
at
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:230)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253)
at
org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193)
at
org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84)
at org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43)
at java.lang.Thread.run(Thread.java:595)
2010-05-27 13:25:52,843-0500 INFO AbstractStreamKarajanChannel
1427072207: Channel shut down
java.lang.Throwable
at
org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.close(AbstractTCPChannel.java:97)
at
org.globus.cog.karajan.workflow.service.channels.MetaChannel.close(MetaChannel.java:87)
at
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:232)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253)
at
org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193)
at
org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84)
at org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43)
at java.lang.Thread.run(Thread.java:595)
2010-05-27 13:25:52,843-0500 INFO ConnectionProtocol Freeing channel 4
[Unnamed Channel]
2010-05-27 13:25:58,866-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:08,883-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:18,905-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:28,920-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:38,924-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:48,934-0500 INFO
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:56,456-0500 INFO TransportProtocolCommon Sending
SSH_MSG_DISCONNECT
2010-05-27 13:26:56,456-0500 INFO Service ssh-connection thread is exiting
It seems something went wrong after the BoostThreader job is finished.
So the swift engine never gets the result back and run into endless waiting.
And in the coaster.log, there also are some error messages :
2010-05-27 13:25:46,741-0500 WARN Command Command(38, JOBSTATUS):
handling reply timeout; sendReqTime=100527-132346.736,
sendTime=100527-132346.737, now=100527-132546.741
2010-05-27 13:25:46,741-0500 WARN Command Command(38, JOBSTATUS)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:46,851-0500 WARN Command Command(39, JOBSTATUS):
handling reply timeout; sendReqTime=100527-132346.847,
sendTime=100527-132346.848, now=100527-132546.851
2010-05-27 13:25:46,851-0500 WARN Command Command(39, JOBSTATUS)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:47,335-0500 WARN Command Command(40, JOBSTATUS):
handling reply timeout; sendReqTime=100527-132347.331,
sendTime=100527-132347.332, now=100527-132547.335
2010-05-27 13:25:47,335-0500 WARN Command Command(40, JOBSTATUS)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:47,478-0500 INFO Cpu 0527-011146-000000:3 pull
2010-05-27 13:25:47,804-0500 INFO BlockQueueProcessor Updated
allocsize: 6.519897787948322
2010-05-27 13:25:47,805-0500 INFO BlockQueueProcessor allocsize =
6.519897787948322, queuedsize = 0.0, qsz = 0
2010-05-27 13:25:47,805-0500 INFO BlockQueueProcessor Plan time: 1
2010-05-27 13:25:48,084-0500 WARN Command Command(41, JOBSTATUS):
handling reply timeout; sendReqTime=100527-132348.080,
sendTime=100527-132348.081, now=100527-132548.084
2010-05-27 13:25:48,084-0500 WARN Command Command(41, JOBSTATUS)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:48,480-0500 INFO Cpu 0527-011146-000000:0 pull
2010-05-27 13:25:49,482-0500 INFO Cpu 0527-011146-000000:5 pull
2010-05-27 13:25:49,734-0500 WARN Command Command(42, JOBSTATUS):
handling reply timeout; sendReqTime=100527-132349.730,
sendTime=100527-132349.731, now=100527-132549.734
2010-05-27 13:25:49,734-0500 WARN Command Command(42, JOBSTATUS)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:50,007-0500 INFO Block Shutting down block Block
0527-011146-000000 (6x4200.000s)
2010-05-27 13:25:50,009-0500 INFO BlockQueueProcessor Cleaned 1 done blocks
2010-05-27 13:25:50,009-0500 INFO BlockQueueProcessor Updated
allocsize: 6.519849641186395
The complete log is attached.
Wenjun
> Hi Mike,
> My raptorloop script doesn't work from portal but worked when I run
> from the attached shell script.
> After digging up swift and raptor log files,
> I found out the error message: "cannot open seq file
> PREPROCESSED/SEQ/T0411D1.seq for read!"
> BoostThreader is supposed to untar the prepared tar ball to PREPROCESSED.
> I got a feeling that it was caused by input arguments. So from the
> portal, the input arguments look like:
>
> -target=T0411D1 -maxLoopModel=1 -minLoopModelScore=1.0 -minLoopSize=3
> -maxLoopsPerModel=10 -templateList=20
> -prepTar=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal//temp/AE00A497C18DB8885C24D04862A0909A/t0411d1.prep.tar.gz
>
> -seqFile=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal//temp/1D8389428752F99E3C0A14789C07F55C/t0411d1.fasta
> -templatesPerJob=4
> -nModels=10
>
>
> And the successful raptor run has the following arguments:
> -target=T0411D1 \
> -seqFile=/home/aashish/testPrep/T0411D1.fasta \
> -prepTar=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal/temp/AE00A497C18DB8885C24D04862A0909A/t0411d1.prep.tar.gz
> \
> -templatesPerJob=4 -templateList=20 -nModels=10 -nSim=4 \
> -loopRunParamFile=$(pwd)/loopmodels.param \
> -maxLoopModels=1 \
> -minLoopModelScore=1.0 \
> -minLoopSize=3 \
> -maxLoopsPerModel=10
>
> Could you figure out any reason for this problem?
>
> Wenjun
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: oops-20100527-1051-78q7905g.log
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100528/90a1e72d/attachment.ksh>
More information about the Swift-devel
mailing list