[Swift-devel] Fwd: raptor-loop model problem

Mihael Hategan hategan at mcs.anl.gov
Fri May 28 13:37:48 CDT 2010


What version of swift/coasters is this and can you post the coaster log
(on the remote site in ~/.globus/coasters)?

Mihael

On Fri, 2010-05-28 at 13:13 -0500, Michael Wilde wrote:
> Wenjun, in the attached log, I see 5 boost-threader jobs starting but not finishing.
> 
> Then *I think* the coasters start timing out with nothing else to do.
> 
> Mihael, can you take a look at this log and work with Wenjun to pinpoint the problem?
> 
> Thanks,
> 
> - Mike
> 
> ----- Forwarded Message -----
> From: "wenjun wu" <wwjag at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Thomas D. Uram" <turam at mcs.anl.gov>
> Sent: Friday, May 28, 2010 10:48:40 AM GMT -06:00 US/Canada Central
> Subject: Re: raptor-loop model problem
> 
> Hi Mike,
>      After I fixed the File uploader in the portal, the old problem 
> ""cannot open seq file PREPROCESSED/SEQ/T0411D1.seq for read!" is gone.
>     But the portal still can't get the results from the "BoosterThread" 
> result.
>     The error in the workflow run log is:
>        2010-05-27 13:25:52,842-0500 WARN  RequestHandler
> org.globus.cog.karajan.workflow.service.channels.IrrecoverableException: 
> Coaster service ended. Reason: null
>      stdout:
>      stderr:
>      at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:230)
>      at 
> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
>      at 
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
>      at 
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253)
>      at 
> org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193)
>      at 
> org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84)
>      at org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43)
>      at java.lang.Thread.run(Thread.java:595)
> 2010-05-27 13:25:52,843-0500 INFO  AbstractStreamKarajanChannel 
> 1427072207: Channel shut down
> java.lang.Throwable
>      at 
> org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.close(AbstractTCPChannel.java:97)
>      at 
> org.globus.cog.karajan.workflow.service.channels.MetaChannel.close(MetaChannel.java:87)
>      at 
> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:232)
>      at 
> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
>      at 
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
>      at 
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253)
>      at 
> org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193)
>      at 
> org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84)
>      at org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43)
>      at java.lang.Thread.run(Thread.java:595)
> 2010-05-27 13:25:52,843-0500 INFO  ConnectionProtocol Freeing channel 4 
> [Unnamed Channel]
> 2010-05-27 13:25:58,866-0500 INFO  
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:08,883-0500 INFO  
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:18,905-0500 INFO  
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:28,920-0500 INFO  
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:38,924-0500 INFO  
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:48,934-0500 INFO  
> AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-05-27 13:26:56,456-0500 INFO  TransportProtocolCommon Sending 
> SSH_MSG_DISCONNECT
> 2010-05-27 13:26:56,456-0500 INFO  Service ssh-connection thread is exiting
> 
> It seems something went wrong after the BoostThreader job is finished. 
> So the swift engine never gets the result back and run into endless waiting.
> And in the coaster.log, there also are some error messages :
> 
> 2010-05-27 13:25:46,741-0500 WARN  Command Command(38, JOBSTATUS): 
> handling reply timeout; sendReqTime=100527-132346.736, 
> sendTime=100527-132346.737, now=100527-132546.741
> 2010-05-27 13:25:46,741-0500 WARN  Command Command(38, JOBSTATUS)fault 
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>          at java.util.TimerThread.mainLoop(Timer.java:512)
>          at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:46,851-0500 WARN  Command Command(39, JOBSTATUS): 
> handling reply timeout; sendReqTime=100527-132346.847, 
> sendTime=100527-132346.848, now=100527-132546.851
> 2010-05-27 13:25:46,851-0500 WARN  Command Command(39, JOBSTATUS)fault 
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>          at java.util.TimerThread.mainLoop(Timer.java:512)
>          at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:47,335-0500 WARN  Command Command(40, JOBSTATUS): 
> handling reply timeout; sendReqTime=100527-132347.331, 
> sendTime=100527-132347.332, now=100527-132547.335
> 2010-05-27 13:25:47,335-0500 WARN  Command Command(40, JOBSTATUS)fault 
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>          at java.util.TimerThread.mainLoop(Timer.java:512)
>          at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:47,478-0500 INFO  Cpu 0527-011146-000000:3 pull
> 2010-05-27 13:25:47,804-0500 INFO  BlockQueueProcessor Updated 
> allocsize: 6.519897787948322
> 2010-05-27 13:25:47,805-0500 INFO  BlockQueueProcessor allocsize = 
> 6.519897787948322, queuedsize = 0.0, qsz = 0
> 2010-05-27 13:25:47,805-0500 INFO  BlockQueueProcessor Plan time: 1
> 2010-05-27 13:25:48,084-0500 WARN  Command Command(41, JOBSTATUS): 
> handling reply timeout; sendReqTime=100527-132348.080, 
> sendTime=100527-132348.081, now=100527-132548.084
> 2010-05-27 13:25:48,084-0500 WARN  Command Command(41, JOBSTATUS)fault 
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>          at java.util.TimerThread.mainLoop(Timer.java:512)
>          at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:48,480-0500 INFO  Cpu 0527-011146-000000:0 pull
> 2010-05-27 13:25:49,482-0500 INFO  Cpu 0527-011146-000000:5 pull
> 2010-05-27 13:25:49,734-0500 WARN  Command Command(42, JOBSTATUS): 
> handling reply timeout; sendReqTime=100527-132349.730, 
> sendTime=100527-132349.731, now=100527-132549.734
> 2010-05-27 13:25:49,734-0500 WARN  Command Command(42, JOBSTATUS)fault 
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>          at java.util.TimerThread.mainLoop(Timer.java:512)
>          at java.util.TimerThread.run(Timer.java:462)
> 2010-05-27 13:25:50,007-0500 INFO  Block Shutting down block Block 
> 0527-011146-000000 (6x4200.000s)
> 2010-05-27 13:25:50,009-0500 INFO  BlockQueueProcessor Cleaned 1 done blocks
> 2010-05-27 13:25:50,009-0500 INFO  BlockQueueProcessor Updated 
> allocsize: 6.519849641186395
> 
> The complete log is attached.
> 
> Wenjun
> > Hi Mike,
> >   My raptorloop script doesn't work from portal but worked when I run 
> > from the attached shell script.
> > After digging up swift and raptor log files,
> > I found out the error message: "cannot open seq file 
> > PREPROCESSED/SEQ/T0411D1.seq for read!"
> > BoostThreader is supposed to untar the prepared tar ball to PREPROCESSED.
> >   I got a feeling that it was caused by input arguments. So from the 
> > portal, the input arguments look like:
> >
> > -target=T0411D1 -maxLoopModel=1 -minLoopModelScore=1.0 -minLoopSize=3 
> > -maxLoopsPerModel=10 -templateList=20
> >  -prepTar=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal//temp/AE00A497C18DB8885C24D04862A0909A/t0411d1.prep.tar.gz 
> >
> > -seqFile=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal//temp/1D8389428752F99E3C0A14789C07F55C/t0411d1.fasta 
> > -templatesPerJob=4
> > -nModels=10
> >
> >
> >   And the successful raptor run has the following arguments:
> >   -target=T0411D1 \
> > -seqFile=/home/aashish/testPrep/T0411D1.fasta \
> > -prepTar=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal/temp/AE00A497C18DB8885C24D04862A0909A/t0411d1.prep.tar.gz 
> > \
> > -templatesPerJob=4 -templateList=20 -nModels=10 -nSim=4 \
> > -loopRunParamFile=$(pwd)/loopmodels.param \
> > -maxLoopModels=1 \
> > -minLoopModelScore=1.0 \
> > -minLoopSize=3 \
> > -maxLoopsPerModel=10
> >
> >  Could you figure out any reason for this problem?
> >
> > Wenjun
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list