[Swift-devel] Fwd: raptor-loop model problem
Mihael Hategan
hategan at mcs.anl.gov
Fri May 28 13:37:48 CDT 2010
What version of swift/coasters is this and can you post the coaster log
(on the remote site in ~/.globus/coasters)?
Mihael
On Fri, 2010-05-28 at 13:13 -0500, Michael Wilde wrote:
> Wenjun, in the attached log, I see 5 boost-threader jobs starting but not finishing.
>
> Then *I think* the coasters start timing out with nothing else to do.
>
> Mihael, can you take a look at this log and work with Wenjun to pinpoint the problem?
>
> Thanks,
>
> - Mike
>
> ----- Forwarded Message -----
> From: "wenjun wu" <wwjag at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Thomas D. Uram" <turam at mcs.anl.gov>
> Sent: Friday, May 28, 2010 10:48:40 AM GMT -06:00 US/Canada Central
> Subject: Re: raptor-loop model problem
>
> Hi Mike,
> After I fixed the File uploader in the portal, the old problem
> ""cannot open seq file PREPROCESSED/SEQ/T0411D1.seq for read!" is gone.
> But the portal still can't get the results from the "BoosterThread"
> result.
> The error in the workflow run log is:
> 2010-05-27 13:25:52,842-0500 WARN RequestHandler
> org.globus.cog.karajan.workflow.service.channels.IrrecoverableException:
> Coaster service ended. Reason: null
> stdout:
> stderr:
> at
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:230)
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253)
> at
> org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193)
> at
> org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84)
> at org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43)
> at java.lang.Thread.run(Thread.java:595)
> 2010-05-27 13:25:52,843-0500 INFO AbstractStreamKarajanChannel
> 1427072207: Channel shut down
> java.lang.Throwable
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.close(AbstractTCPChannel.java:97)
> at
> org.globus.cog.karajan.workflow.service.channels.MetaChannel.close(MetaChannel.java:87)
> at
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:232)
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253)
> at
> org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193)
> at
> org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84)
> at org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43)
> at java.lang.Thread.run(Thread.java:595)
> 2010-05-27 13:25:52,843-0500 INFO ConnectionProtocol Freeing channel 4
> [Unnamed Channel]
> 2010-05-27 13:25:58,866-0500 INFO
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:08,883-0500 INFO
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:18,905-0500 INFO
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:28,920-0500 INFO
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:38,924-0500 INFO
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:48,934-0500 INFO
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:56,456-0500 INFO TransportProtocolCommon Sending
> SSH_MSG_DISCONNECT
> 2010-05-27 13:26:56,456-0500 INFO Service ssh-connection thread is exiting
>
> It seems something went wrong after the BoostThreader job is finished.
> So the swift engine never gets the result back and run into endless waiting.
> And in the coaster.log, there also are some error messages :
>
> 2010-05-27 13:25:46,741-0500 WARN Command Command(38, JOBSTATUS):
> handling reply timeout; sendReqTime=100527-132346.736,
> sendTime=100527-132346.737, now=100527-132546.741
> 2010-05-27 13:25:46,741-0500 WARN Command Command(38, JOBSTATUS)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:46,851-0500 WARN Command Command(39, JOBSTATUS):
> handling reply timeout; sendReqTime=100527-132346.847,
> sendTime=100527-132346.848, now=100527-132546.851
> 2010-05-27 13:25:46,851-0500 WARN Command Command(39, JOBSTATUS)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:47,335-0500 WARN Command Command(40, JOBSTATUS):
> handling reply timeout; sendReqTime=100527-132347.331,
> sendTime=100527-132347.332, now=100527-132547.335
> 2010-05-27 13:25:47,335-0500 WARN Command Command(40, JOBSTATUS)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:47,478-0500 INFO Cpu 0527-011146-000000:3 pull
> 2010-05-27 13:25:47,804-0500 INFO BlockQueueProcessor Updated
> allocsize: 6.519897787948322
> 2010-05-27 13:25:47,805-0500 INFO BlockQueueProcessor allocsize =
> 6.519897787948322, queuedsize = 0.0, qsz = 0
> 2010-05-27 13:25:47,805-0500 INFO BlockQueueProcessor Plan time: 1
> 2010-05-27 13:25:48,084-0500 WARN Command Command(41, JOBSTATUS):
> handling reply timeout; sendReqTime=100527-132348.080,
> sendTime=100527-132348.081, now=100527-132548.084
> 2010-05-27 13:25:48,084-0500 WARN Command Command(41, JOBSTATUS)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:48,480-0500 INFO Cpu 0527-011146-000000:0 pull
> 2010-05-27 13:25:49,482-0500 INFO Cpu 0527-011146-000000:5 pull
> 2010-05-27 13:25:49,734-0500 WARN Command Command(42, JOBSTATUS):
> handling reply timeout; sendReqTime=100527-132349.730,
> sendTime=100527-132349.731, now=100527-132549.734
> 2010-05-27 13:25:49,734-0500 WARN Command Command(42, JOBSTATUS)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:50,007-0500 INFO Block Shutting down block Block
> 0527-011146-000000 (6x4200.000s)
> 2010-05-27 13:25:50,009-0500 INFO BlockQueueProcessor Cleaned 1 done blocks
> 2010-05-27 13:25:50,009-0500 INFO BlockQueueProcessor Updated
> allocsize: 6.519849641186395
>
> The complete log is attached.
>
> Wenjun
> > Hi Mike,
> > My raptorloop script doesn't work from portal but worked when I run
> > from the attached shell script.
> > After digging up swift and raptor log files,
> > I found out the error message: "cannot open seq file
> > PREPROCESSED/SEQ/T0411D1.seq for read!"
> > BoostThreader is supposed to untar the prepared tar ball to PREPROCESSED.
> > I got a feeling that it was caused by input arguments. So from the
> > portal, the input arguments look like:
> >
> > -target=T0411D1 -maxLoopModel=1 -minLoopModelScore=1.0 -minLoopSize=3
> > -maxLoopsPerModel=10 -templateList=20
> > -prepTar=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal//temp/AE00A497C18DB8885C24D04862A0909A/t0411d1.prep.tar.gz
> >
> > -seqFile=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal//temp/1D8389428752F99E3C0A14789C07F55C/t0411d1.fasta
> > -templatesPerJob=4
> > -nModels=10
> >
> >
> > And the successful raptor run has the following arguments:
> > -target=T0411D1 \
> > -seqFile=/home/aashish/testPrep/T0411D1.fasta \
> > -prepTar=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal/temp/AE00A497C18DB8885C24D04862A0909A/t0411d1.prep.tar.gz
> > \
> > -templatesPerJob=4 -templateList=20 -nModels=10 -nSim=4 \
> > -loopRunParamFile=$(pwd)/loopmodels.param \
> > -maxLoopModels=1 \
> > -minLoopModelScore=1.0 \
> > -minLoopSize=3 \
> > -maxLoopsPerModel=10
> >
> > Could you figure out any reason for this problem?
> >
> > Wenjun
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list